Add multi-tool orchestration for best-quality document conversion: - Dual mode: Quick (fast) and Heavy (best quality, multi-tool merge) - New convert.py - main orchestrator with tool selection matrix - New merge_outputs.py - segment-level multi-tool output merger - New validate_output.py - quality validation with HTML reports - Enhanced extract_pdf_images.py - metadata (page, position, dimensions) - PyMuPDF4LLM integration for LLM-optimized PDF conversion - pandoc integration for DOCX/PPTX structure preservation - Quality metrics: text/table/image retention with pass/warn/fail - New references: heavy-mode-guide.md, tool-comparison.md Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
3.8 KiB
3.8 KiB
Heavy Mode Guide
Detailed documentation for markdown-tools Heavy Mode conversion.
Overview
Heavy Mode runs multiple conversion tools in parallel and intelligently merges their outputs to produce the highest quality markdown possible.
When to Use Heavy Mode
Use Heavy Mode when:
- Document has complex tables that need precise formatting
- Images must be preserved with proper references
- Structure hierarchy (headings, lists) must be accurate
- Output quality is more important than conversion speed
- Document will be used for LLM processing
Use Quick Mode when:
- Speed is priority
- Document is simple (mostly text)
- Output is for draft/review purposes
Tool Capabilities
PyMuPDF4LLM (Best for PDFs)
Strengths:
- Native table detection with multiple strategies
- Image extraction with position metadata
- LLM-optimized output format
- Preserves reading order
Usage:
import pymupdf4llm
md_text = pymupdf4llm.to_markdown(
"document.pdf",
write_images=True,
table_strategy="lines_strict",
image_path="./assets",
dpi=150
)
markitdown (Universal Converter)
Strengths:
- Supports many formats (PDF, DOCX, PPTX, XLSX)
- Good text extraction
- Simple API
Limitations:
- May miss complex tables
- No native image extraction
pandoc (Best for Office Docs)
Strengths:
- Excellent DOCX/PPTX structure preservation
- Proper heading hierarchy
- List formatting
Limitations:
- Requires system installation
- PDF support limited
Merge Strategy
Segment-Level Selection
Heavy Mode doesn't just pick one tool's output. It:
- Parses each output into segments
- Scores each segment independently
- Selects the best version of each segment
Segment Types
| Type | Detection Pattern | Scoring Criteria |
|---|---|---|
| Table | |.*| rows |
Row count, column count, header separator |
| Heading | ^#{1-6} |
Proper level, reasonable length |
| Image | !\[.*\]\(.*\) |
Alt text present, local path |
| List | ^[-*+\d.] |
Item count, nesting depth |
| Code | Triple backticks | Line count, language specified |
| Paragraph | Default | Word count, completeness |
Scoring Example
Table from pymupdf4llm:
- 10 rows × 5 columns = 5.0 points
- Header separator present = 1.0 points
- Total: 6.0 points
Table from markitdown:
- 8 rows × 5 columns = 4.0 points
- No header separator = 0.0 points
- Total: 4.0 points
→ Select pymupdf4llm version
Advanced Usage
Force Specific Tool
# Use only pandoc
uv run scripts/convert.py document.docx -o output.md --tool pandoc
Custom Assets Directory
# Heavy mode with custom image output
uv run scripts/convert.py document.pdf -o output.md --heavy --assets-dir ./images
Validate After Conversion
# Convert then validate
uv run scripts/convert.py document.pdf -o output.md --heavy
uv run scripts/validate_output.py document.pdf output.md --report quality.html
Troubleshooting
Low Text Retention Score
Causes:
- PDF has scanned images (not searchable text)
- Encoding issues in source document
- Complex layouts confusing the parser
Solutions:
- Use OCR preprocessing for scanned PDFs
- Try different tool with
--toolflag - Manual cleanup may be needed
Missing Tables
Causes:
- Tables without visible borders
- Tables spanning multiple pages
- Merged cells
Solutions:
- Use Heavy Mode for better detection
- Try pymupdf4llm with different table_strategy
- Manual table reconstruction
Image References Broken
Causes:
- Assets directory not created
- Relative path issues
- Image extraction failed
Solutions:
- Ensure
--assets-dirpoints to correct location - Check
images_metadata.jsonfor extraction status - Use
extract_pdf_images.pyseparately