- Add CJK bold spacing fix: insert spaces around **bold** spans containing CJK characters for correct rendering (handles emoji adjacency, already-spaced) - Add JSON pretty-print: auto-format JSON code blocks with 2-space indent - Add 31 unit tests covering all post-processing functions - Fix pandoc simple table detection (1-space column gaps) - Fix image path double-nesting when --assets-dir ends with 'media' - Rename all markdown-tools references across 15 files (README, QUICKSTART, marketplace.json, CLAUDE.md, meeting-minutes-taker, GitHub templates) - Add 5-tool benchmark report (Docling/MarkItDown/Pandoc/Mammoth/ours) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
166 lines
3.8 KiB
Markdown
166 lines
3.8 KiB
Markdown
# Heavy Mode Guide
|
||
|
||
Detailed documentation for doc-to-markdown Heavy Mode conversion.
|
||
|
||
## Overview
|
||
|
||
Heavy Mode runs multiple conversion tools in parallel and intelligently merges their outputs to produce the highest quality markdown possible.
|
||
|
||
## When to Use Heavy Mode
|
||
|
||
Use Heavy Mode when:
|
||
- Document has complex tables that need precise formatting
|
||
- Images must be preserved with proper references
|
||
- Structure hierarchy (headings, lists) must be accurate
|
||
- Output quality is more important than conversion speed
|
||
- Document will be used for LLM processing
|
||
|
||
Use Quick Mode when:
|
||
- Speed is priority
|
||
- Document is simple (mostly text)
|
||
- Output is for draft/review purposes
|
||
|
||
## Tool Capabilities
|
||
|
||
### PyMuPDF4LLM (Best for PDFs)
|
||
|
||
**Strengths:**
|
||
- Native table detection with multiple strategies
|
||
- Image extraction with position metadata
|
||
- LLM-optimized output format
|
||
- Preserves reading order
|
||
|
||
**Usage:**
|
||
```python
|
||
import pymupdf4llm
|
||
|
||
md_text = pymupdf4llm.to_markdown(
|
||
"document.pdf",
|
||
write_images=True,
|
||
table_strategy="lines_strict",
|
||
image_path="./assets",
|
||
dpi=150
|
||
)
|
||
```
|
||
|
||
### markitdown (Universal Converter)
|
||
|
||
**Strengths:**
|
||
- Supports many formats (PDF, DOCX, PPTX, XLSX)
|
||
- Good text extraction
|
||
- Simple API
|
||
|
||
**Limitations:**
|
||
- May miss complex tables
|
||
- No native image extraction
|
||
|
||
### pandoc (Best for Office Docs)
|
||
|
||
**Strengths:**
|
||
- Excellent DOCX/PPTX structure preservation
|
||
- Proper heading hierarchy
|
||
- List formatting
|
||
|
||
**Limitations:**
|
||
- Requires system installation
|
||
- PDF support limited
|
||
|
||
## Merge Strategy
|
||
|
||
### Segment-Level Selection
|
||
|
||
Heavy Mode doesn't just pick one tool's output. It:
|
||
|
||
1. Parses each output into segments
|
||
2. Scores each segment independently
|
||
3. Selects the best version of each segment
|
||
|
||
### Segment Types
|
||
|
||
| Type | Detection Pattern | Scoring Criteria |
|
||
|------|-------------------|------------------|
|
||
| Table | `\|.*\|` rows | Row count, column count, header separator |
|
||
| Heading | `^#{1-6} ` | Proper level, reasonable length |
|
||
| Image | `!\[.*\]\(.*\)` | Alt text present, local path |
|
||
| List | `^[-*+\d.] ` | Item count, nesting depth |
|
||
| Code | Triple backticks | Line count, language specified |
|
||
| Paragraph | Default | Word count, completeness |
|
||
|
||
### Scoring Example
|
||
|
||
```
|
||
Table from pymupdf4llm:
|
||
- 10 rows × 5 columns = 5.0 points
|
||
- Header separator present = 1.0 points
|
||
- Total: 6.0 points
|
||
|
||
Table from markitdown:
|
||
- 8 rows × 5 columns = 4.0 points
|
||
- No header separator = 0.0 points
|
||
- Total: 4.0 points
|
||
|
||
→ Select pymupdf4llm version
|
||
```
|
||
|
||
## Advanced Usage
|
||
|
||
### Force Specific Tool
|
||
|
||
```bash
|
||
# Use only pandoc
|
||
uv run scripts/convert.py document.docx -o output.md --tool pandoc
|
||
```
|
||
|
||
### Custom Assets Directory
|
||
|
||
```bash
|
||
# Heavy mode with custom image output
|
||
uv run scripts/convert.py document.pdf -o output.md --heavy --assets-dir ./images
|
||
```
|
||
|
||
### Validate After Conversion
|
||
|
||
```bash
|
||
# Convert then validate
|
||
uv run scripts/convert.py document.pdf -o output.md --heavy
|
||
uv run scripts/validate_output.py document.pdf output.md --report quality.html
|
||
```
|
||
|
||
## Troubleshooting
|
||
|
||
### Low Text Retention Score
|
||
|
||
**Causes:**
|
||
- PDF has scanned images (not searchable text)
|
||
- Encoding issues in source document
|
||
- Complex layouts confusing the parser
|
||
|
||
**Solutions:**
|
||
- Use OCR preprocessing for scanned PDFs
|
||
- Try different tool with `--tool` flag
|
||
- Manual cleanup may be needed
|
||
|
||
### Missing Tables
|
||
|
||
**Causes:**
|
||
- Tables without visible borders
|
||
- Tables spanning multiple pages
|
||
- Merged cells
|
||
|
||
**Solutions:**
|
||
- Use Heavy Mode for better detection
|
||
- Try pymupdf4llm with different table_strategy
|
||
- Manual table reconstruction
|
||
|
||
### Image References Broken
|
||
|
||
**Causes:**
|
||
- Assets directory not created
|
||
- Relative path issues
|
||
- Image extraction failed
|
||
|
||
**Solutions:**
|
||
- Ensure `--assets-dir` points to correct location
|
||
- Check `images_metadata.json` for extraction status
|
||
- Use `extract_pdf_images.py` separately
|