Files
claude-code-skills-reference/markdown-tools/references/tool-comparison.md
daymade 3f15b8942c Release v1.27.0: Enhance markdown-tools with Heavy Mode
Add multi-tool orchestration for best-quality document conversion:
- Dual mode: Quick (fast) and Heavy (best quality, multi-tool merge)
- New convert.py - main orchestrator with tool selection matrix
- New merge_outputs.py - segment-level multi-tool output merger
- New validate_output.py - quality validation with HTML reports
- Enhanced extract_pdf_images.py - metadata (page, position, dimensions)
- PyMuPDF4LLM integration for LLM-optimized PDF conversion
- pandoc integration for DOCX/PPTX structure preservation
- Quality metrics: text/table/image retention with pass/warn/fail
- New references: heavy-mode-guide.md, tool-comparison.md

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-25 21:36:08 +08:00

3.7 KiB

Tool Comparison

Comparison of document-to-markdown conversion tools.

Feature Matrix

Feature pymupdf4llm markitdown pandoc
PDF Support Excellent Good ⚠️ Limited
DOCX Support No Good Excellent
PPTX Support No Good Good
XLSX Support No Good ⚠️ Limited
Table Detection Multiple strategies ⚠️ Basic Good
Image Extraction With metadata No Yes
Heading Hierarchy Good ⚠️ Variable Excellent
List Formatting Good ⚠️ Basic Excellent
LLM Optimization Built-in No No

Installation

pymupdf4llm

pip install pymupdf4llm

# Or with uv
uv pip install pymupdf4llm

Dependencies: None (pure Python with PyMuPDF)

markitdown

# With PDF support
uv tool install "markitdown[pdf]"

# Or
pip install "markitdown[pdf]"

Dependencies: Various per format (pdfminer, python-docx, etc.)

pandoc

# macOS
brew install pandoc

# Ubuntu/Debian
apt-get install pandoc

# Windows
choco install pandoc

Dependencies: System installation required

Performance Benchmarks

PDF Conversion (100-page document)

Tool Time Memory Output Quality
pymupdf4llm ~15s 150MB Excellent
markitdown ~45s 200MB Good
pandoc ~60s 100MB Variable

DOCX Conversion (50-page document)

Tool Time Memory Output Quality
pandoc ~5s 50MB Excellent
markitdown ~10s 80MB Good

Best Practices

For PDFs

  1. First choice: pymupdf4llm

    • Best table detection
    • Image extraction with metadata
    • LLM-optimized output
  2. Fallback: markitdown

    • When pymupdf4llm fails
    • Simpler documents

For DOCX/DOC

  1. First choice: pandoc

    • Best structure preservation
    • Proper heading hierarchy
    • List formatting
  2. Fallback: markitdown

    • When pandoc unavailable
    • Quick conversion needed

For PPTX

  1. First choice: markitdown

    • Good slide content extraction
    • Handles speaker notes
  2. Fallback: pandoc

    • Better structure preservation

For XLSX

  1. Only option: markitdown
    • Table to markdown conversion
    • Sheet handling

Common Issues by Tool

pymupdf4llm

Issue Solution
"Cannot import fitz" pip install pymupdf
Tables not detected Try different table_strategy
Images not extracted Enable write_images=True

markitdown

Issue Solution
PDF support missing Install with [pdf] extra
Slow conversion Expected for large files
Missing content Try alternative tool

pandoc

Issue Solution
Command not found Install via package manager
PDF conversion fails Use pymupdf4llm instead
Images not extracted Add --extract-media flag

API Comparison

pymupdf4llm

import pymupdf4llm

md = pymupdf4llm.to_markdown(
    "doc.pdf",
    write_images=True,
    table_strategy="lines_strict",
    image_path="./assets"
)

markitdown

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("document.pdf")
print(result.text_content)

pandoc

pandoc document.docx -t markdown --wrap=none --extract-media=./assets
import subprocess

result = subprocess.run(
    ["pandoc", "doc.docx", "-t", "markdown", "--wrap=none"],
    capture_output=True, text=True
)
print(result.stdout)