- Rename skill to better reflect its purpose (document-to-markdown conversion) - Update SKILL.md name, description, and trigger keywords - Add benchmark reference (2026-03-22) - Update marketplace.json entry (name, skills path, version 2.0.0) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3.7 KiB
3.7 KiB
Tool Comparison
Comparison of document-to-markdown conversion tools.
Feature Matrix
| Feature | pymupdf4llm | markitdown | pandoc |
|---|---|---|---|
| PDF Support | ✅ Excellent | ✅ Good | ⚠️ Limited |
| DOCX Support | ❌ No | ✅ Good | ✅ Excellent |
| PPTX Support | ❌ No | ✅ Good | ✅ Good |
| XLSX Support | ❌ No | ✅ Good | ⚠️ Limited |
| Table Detection | ✅ Multiple strategies | ⚠️ Basic | ✅ Good |
| Image Extraction | ✅ With metadata | ❌ No | ✅ Yes |
| Heading Hierarchy | ✅ Good | ⚠️ Variable | ✅ Excellent |
| List Formatting | ✅ Good | ⚠️ Basic | ✅ Excellent |
| LLM Optimization | ✅ Built-in | ❌ No | ❌ No |
Installation
pymupdf4llm
pip install pymupdf4llm
# Or with uv
uv pip install pymupdf4llm
Dependencies: None (pure Python with PyMuPDF)
markitdown
# With PDF support
uv tool install "markitdown[pdf]"
# Or
pip install "markitdown[pdf]"
Dependencies: Various per format (pdfminer, python-docx, etc.)
pandoc
# macOS
brew install pandoc
# Ubuntu/Debian
apt-get install pandoc
# Windows
choco install pandoc
Dependencies: System installation required
Performance Benchmarks
PDF Conversion (100-page document)
| Tool | Time | Memory | Output Quality |
|---|---|---|---|
| pymupdf4llm | ~15s | 150MB | Excellent |
| markitdown | ~45s | 200MB | Good |
| pandoc | ~60s | 100MB | Variable |
DOCX Conversion (50-page document)
| Tool | Time | Memory | Output Quality |
|---|---|---|---|
| pandoc | ~5s | 50MB | Excellent |
| markitdown | ~10s | 80MB | Good |
Best Practices
For PDFs
-
First choice: pymupdf4llm
- Best table detection
- Image extraction with metadata
- LLM-optimized output
-
Fallback: markitdown
- When pymupdf4llm fails
- Simpler documents
For DOCX/DOC
-
First choice: pandoc
- Best structure preservation
- Proper heading hierarchy
- List formatting
-
Fallback: markitdown
- When pandoc unavailable
- Quick conversion needed
For PPTX
-
First choice: markitdown
- Good slide content extraction
- Handles speaker notes
-
Fallback: pandoc
- Better structure preservation
For XLSX
- Only option: markitdown
- Table to markdown conversion
- Sheet handling
Common Issues by Tool
pymupdf4llm
| Issue | Solution |
|---|---|
| "Cannot import fitz" | pip install pymupdf |
| Tables not detected | Try different table_strategy |
| Images not extracted | Enable write_images=True |
markitdown
| Issue | Solution |
|---|---|
| PDF support missing | Install with [pdf] extra |
| Slow conversion | Expected for large files |
| Missing content | Try alternative tool |
pandoc
| Issue | Solution |
|---|---|
| Command not found | Install via package manager |
| PDF conversion fails | Use pymupdf4llm instead |
| Images not extracted | Add --extract-media flag |
API Comparison
pymupdf4llm
import pymupdf4llm
md = pymupdf4llm.to_markdown(
"doc.pdf",
write_images=True,
table_strategy="lines_strict",
image_path="./assets"
)
markitdown
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("document.pdf")
print(result.text_content)
pandoc
pandoc document.docx -t markdown --wrap=none --extract-media=./assets
import subprocess
result = subprocess.run(
["pandoc", "doc.docx", "-t", "markdown", "--wrap=none"],
capture_output=True, text=True
)
print(result.stdout)