Files
claude-code-skills-reference/markdown-tools/references/heavy-mode-guide.md
daymade 3f15b8942c Release v1.27.0: Enhance markdown-tools with Heavy Mode
Add multi-tool orchestration for best-quality document conversion:
- Dual mode: Quick (fast) and Heavy (best quality, multi-tool merge)
- New convert.py - main orchestrator with tool selection matrix
- New merge_outputs.py - segment-level multi-tool output merger
- New validate_output.py - quality validation with HTML reports
- Enhanced extract_pdf_images.py - metadata (page, position, dimensions)
- PyMuPDF4LLM integration for LLM-optimized PDF conversion
- pandoc integration for DOCX/PPTX structure preservation
- Quality metrics: text/table/image retention with pass/warn/fail
- New references: heavy-mode-guide.md, tool-comparison.md

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-25 21:36:08 +08:00

166 lines
3.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Heavy Mode Guide
Detailed documentation for markdown-tools Heavy Mode conversion.
## Overview
Heavy Mode runs multiple conversion tools in parallel and intelligently merges their outputs to produce the highest quality markdown possible.
## When to Use Heavy Mode
Use Heavy Mode when:
- Document has complex tables that need precise formatting
- Images must be preserved with proper references
- Structure hierarchy (headings, lists) must be accurate
- Output quality is more important than conversion speed
- Document will be used for LLM processing
Use Quick Mode when:
- Speed is priority
- Document is simple (mostly text)
- Output is for draft/review purposes
## Tool Capabilities
### PyMuPDF4LLM (Best for PDFs)
**Strengths:**
- Native table detection with multiple strategies
- Image extraction with position metadata
- LLM-optimized output format
- Preserves reading order
**Usage:**
```python
import pymupdf4llm
md_text = pymupdf4llm.to_markdown(
"document.pdf",
write_images=True,
table_strategy="lines_strict",
image_path="./assets",
dpi=150
)
```
### markitdown (Universal Converter)
**Strengths:**
- Supports many formats (PDF, DOCX, PPTX, XLSX)
- Good text extraction
- Simple API
**Limitations:**
- May miss complex tables
- No native image extraction
### pandoc (Best for Office Docs)
**Strengths:**
- Excellent DOCX/PPTX structure preservation
- Proper heading hierarchy
- List formatting
**Limitations:**
- Requires system installation
- PDF support limited
## Merge Strategy
### Segment-Level Selection
Heavy Mode doesn't just pick one tool's output. It:
1. Parses each output into segments
2. Scores each segment independently
3. Selects the best version of each segment
### Segment Types
| Type | Detection Pattern | Scoring Criteria |
|------|-------------------|------------------|
| Table | `\|.*\|` rows | Row count, column count, header separator |
| Heading | `^#{1-6} ` | Proper level, reasonable length |
| Image | `!\[.*\]\(.*\)` | Alt text present, local path |
| List | `^[-*+\d.] ` | Item count, nesting depth |
| Code | Triple backticks | Line count, language specified |
| Paragraph | Default | Word count, completeness |
### Scoring Example
```
Table from pymupdf4llm:
- 10 rows × 5 columns = 5.0 points
- Header separator present = 1.0 points
- Total: 6.0 points
Table from markitdown:
- 8 rows × 5 columns = 4.0 points
- No header separator = 0.0 points
- Total: 4.0 points
→ Select pymupdf4llm version
```
## Advanced Usage
### Force Specific Tool
```bash
# Use only pandoc
uv run scripts/convert.py document.docx -o output.md --tool pandoc
```
### Custom Assets Directory
```bash
# Heavy mode with custom image output
uv run scripts/convert.py document.pdf -o output.md --heavy --assets-dir ./images
```
### Validate After Conversion
```bash
# Convert then validate
uv run scripts/convert.py document.pdf -o output.md --heavy
uv run scripts/validate_output.py document.pdf output.md --report quality.html
```
## Troubleshooting
### Low Text Retention Score
**Causes:**
- PDF has scanned images (not searchable text)
- Encoding issues in source document
- Complex layouts confusing the parser
**Solutions:**
- Use OCR preprocessing for scanned PDFs
- Try different tool with `--tool` flag
- Manual cleanup may be needed
### Missing Tables
**Causes:**
- Tables without visible borders
- Tables spanning multiple pages
- Merged cells
**Solutions:**
- Use Heavy Mode for better detection
- Try pymupdf4llm with different table_strategy
- Manual table reconstruction
### Image References Broken
**Causes:**
- Assets directory not created
- Relative path issues
- Image extraction failed
**Solutions:**
- Ensure `--assets-dir` points to correct location
- Check `images_metadata.json` for extraction status
- Use `extract_pdf_images.py` separately