- Add CJK bold spacing fix: insert spaces around **bold** spans containing CJK characters for correct rendering (handles emoji adjacency, already-spaced) - Add JSON pretty-print: auto-format JSON code blocks with 2-space indent - Add 31 unit tests covering all post-processing functions - Fix pandoc simple table detection (1-space column gaps) - Fix image path double-nesting when --assets-dir ends with 'media' - Rename all markdown-tools references across 15 files (README, QUICKSTART, marketplace.json, CLAUDE.md, meeting-minutes-taker, GitHub templates) - Add 5-tool benchmark report (Docling/MarkItDown/Pandoc/Mammoth/ours) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3.8 KiB
3.8 KiB
Heavy Mode Guide
Detailed documentation for doc-to-markdown Heavy Mode conversion.
Overview
Heavy Mode runs multiple conversion tools in parallel and intelligently merges their outputs to produce the highest quality markdown possible.
When to Use Heavy Mode
Use Heavy Mode when:
- Document has complex tables that need precise formatting
- Images must be preserved with proper references
- Structure hierarchy (headings, lists) must be accurate
- Output quality is more important than conversion speed
- Document will be used for LLM processing
Use Quick Mode when:
- Speed is priority
- Document is simple (mostly text)
- Output is for draft/review purposes
Tool Capabilities
PyMuPDF4LLM (Best for PDFs)
Strengths:
- Native table detection with multiple strategies
- Image extraction with position metadata
- LLM-optimized output format
- Preserves reading order
Usage:
import pymupdf4llm
md_text = pymupdf4llm.to_markdown(
"document.pdf",
write_images=True,
table_strategy="lines_strict",
image_path="./assets",
dpi=150
)
markitdown (Universal Converter)
Strengths:
- Supports many formats (PDF, DOCX, PPTX, XLSX)
- Good text extraction
- Simple API
Limitations:
- May miss complex tables
- No native image extraction
pandoc (Best for Office Docs)
Strengths:
- Excellent DOCX/PPTX structure preservation
- Proper heading hierarchy
- List formatting
Limitations:
- Requires system installation
- PDF support limited
Merge Strategy
Segment-Level Selection
Heavy Mode doesn't just pick one tool's output. It:
- Parses each output into segments
- Scores each segment independently
- Selects the best version of each segment
Segment Types
| Type | Detection Pattern | Scoring Criteria |
|---|---|---|
| Table | |.*| rows |
Row count, column count, header separator |
| Heading | ^#{1-6} |
Proper level, reasonable length |
| Image | !\[.*\]\(.*\) |
Alt text present, local path |
| List | ^[-*+\d.] |
Item count, nesting depth |
| Code | Triple backticks | Line count, language specified |
| Paragraph | Default | Word count, completeness |
Scoring Example
Table from pymupdf4llm:
- 10 rows × 5 columns = 5.0 points
- Header separator present = 1.0 points
- Total: 6.0 points
Table from markitdown:
- 8 rows × 5 columns = 4.0 points
- No header separator = 0.0 points
- Total: 4.0 points
→ Select pymupdf4llm version
Advanced Usage
Force Specific Tool
# Use only pandoc
uv run scripts/convert.py document.docx -o output.md --tool pandoc
Custom Assets Directory
# Heavy mode with custom image output
uv run scripts/convert.py document.pdf -o output.md --heavy --assets-dir ./images
Validate After Conversion
# Convert then validate
uv run scripts/convert.py document.pdf -o output.md --heavy
uv run scripts/validate_output.py document.pdf output.md --report quality.html
Troubleshooting
Low Text Retention Score
Causes:
- PDF has scanned images (not searchable text)
- Encoding issues in source document
- Complex layouts confusing the parser
Solutions:
- Use OCR preprocessing for scanned PDFs
- Try different tool with
--toolflag - Manual cleanup may be needed
Missing Tables
Causes:
- Tables without visible borders
- Tables spanning multiple pages
- Merged cells
Solutions:
- Use Heavy Mode for better detection
- Try pymupdf4llm with different table_strategy
- Manual table reconstruction
Image References Broken
Causes:
- Assets directory not created
- Relative path issues
- Image extraction failed
Solutions:
- Ensure
--assets-dirpoints to correct location - Check
images_metadata.jsonfor extraction status - Use
extract_pdf_images.pyseparately