refactor: rename markdown-tools → doc-to-markdown (v2.0.0)

- Rename skill to better reflect its purpose (document-to-markdown conversion) - Update SKILL.md name, description, and trigger keywords - Add benchmark reference (2026-03-22) - Update marketplace.json entry (name, skills path, version 2.0.0) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-23 00:06:30 +08:00
parent ee38ae41b8
commit 143995b213
12 changed files with 1346 additions and 451 deletions
--- a/doc-to-markdown/SKILL.md
+++ b/doc-to-markdown/SKILL.md
@@ -0,0 +1,178 @@
+---
+name: doc-to-markdown
+description: Converts DOCX/PDF/PPTX to high-quality Markdown with automatic post-processing. Fixes pandoc grid tables, image paths, attribute noise, and code blocks. Supports Quick Mode (fast, single tool) and Heavy Mode (best quality, multi-tool merge). Trigger on "convert document", "docx to markdown", "parse word", "doc to markdown", "extract images from document".
+---
+
+# Doc to Markdown
+
+Convert documents to high-quality markdown with intelligent multi-tool orchestration and automatic DOCX post-processing.
+
+## Dual Mode Architecture
+
+| Mode | Speed | Quality | Use Case |
+|------|-------|---------|----------|
+| **Quick** (default) | Fast | Good | Drafts, simple documents |
+| **Heavy** | Slower | Best | Final documents, complex layouts |
+
+## Quick Start
+
+### Installation
+
+```bash
+# Required: PDF/DOCX/PPTX support
+uv tool install "markitdown[pdf]"
+pip install pymupdf4llm
+brew install pandoc
+```
+
+### Basic Conversion
+
+```bash
+# Quick Mode (default) - fast, single best tool
+uv run --with pymupdf4llm --with markitdown scripts/convert.py document.pdf -o output.md
+
+# Heavy Mode - multi-tool parallel execution with merge
+uv run --with pymupdf4llm --with markitdown scripts/convert.py document.pdf -o output.md --heavy
+
+# DOCX with deep python-docx parsing (experimental)
+uv run --with pymupdf4llm --with markitdown --with python-docx scripts/convert.py document.docx -o output.md --docx-deep
+
+# Check available tools
+uv run scripts/convert.py --list-tools
+```
+
+## Tool Selection Matrix
+
+| Format | Quick Mode Tool | Heavy Mode Tools |
+|--------|----------------|------------------|
+| PDF | pymupdf4llm | pymupdf4llm + markitdown |
+| DOCX | pandoc + post-processing | pandoc + markitdown |
+| PPTX | markitdown | markitdown + pandoc |
+| XLSX | markitdown | markitdown |
+
+### Tool Characteristics
+
+- **pymupdf4llm**: LLM-optimized PDF conversion with native table detection and image extraction
+- **markitdown**: Microsoft's universal converter, good for Office formats
+- **pandoc**: Excellent structure preservation for DOCX/PPTX
+
+## DOCX Post-Processing (automatic)
+
+When converting DOCX files via pandoc, the following cleanups are applied automatically:
+
+| Problem | Fix |
+|---------|-----|
+| Grid tables (`+:---+` syntax) | Single-column -> blockquote, multi-column -> split images |
+| Image double path (`media/media/`) | Flatten to `media/` |
+| Pandoc attributes (`{width="..." height="..."}`) | Removed |
+| Inline classes (`{.underline}`, `{.mark}`) | Removed |
+| Indented dashed code blocks | Converted to fenced code blocks (```) |
+| Escaped brackets (`\[...\]`) | Unescaped to `[...]` |
+| Double-bracket links (`[[text]{...}](url)`) | Simplified to `[text](url)` |
+| Escaped quotes in code (`\"`) | Fixed to `"` |
+
+## Heavy Mode Workflow
+
+Heavy Mode runs multiple tools in parallel and selects the best segments:
+
+1. **Parallel Execution**: Run all applicable tools simultaneously
+2. **Segment Analysis**: Parse each output into segments (tables, headings, images, paragraphs)
+3. **Quality Scoring**: Score each segment based on completeness and structure
+4. **Intelligent Merge**: Select best version of each segment across tools
+
+### Merge Criteria
+
+| Segment Type | Selection Criteria |
+|--------------|-------------------|
+| Tables | More rows/columns, proper header separator |
+| Images | Alt text present, local paths preferred |
+| Headings | Proper hierarchy, appropriate length |
+| Lists | More items, nested structure preserved |
+| Paragraphs | Content completeness |
+
+## Image Extraction
+
+```bash
+# Extract images with metadata
+uv run --with pymupdf scripts/extract_pdf_images.py document.pdf -o ./assets
+
+# Generate markdown references file
+uv run --with pymupdf scripts/extract_pdf_images.py document.pdf --markdown refs.md
+```
+
+Output:
+- Images: `assets/img_page1_1.png`, `assets/img_page2_1.jpg`
+- Metadata: `assets/images_metadata.json` (page, position, dimensions)
+
+## Quality Validation
+
+```bash
+# Validate conversion quality
+uv run --with pymupdf scripts/validate_output.py document.pdf output.md
+
+# Generate HTML report
+uv run --with pymupdf scripts/validate_output.py document.pdf output.md --report report.html
+```
+
+### Quality Metrics
+
+| Metric | Pass | Warn | Fail |
+|--------|------|------|------|
+| Text Retention | >95% | 85-95% | <85% |
+| Table Retention | 100% | 90-99% | <90% |
+| Image Retention | 100% | 80-99% | <80% |
+
+## Merge Outputs Manually
+
+```bash
+# Merge multiple markdown files
+python scripts/merge_outputs.py output1.md output2.md -o merged.md
+
+# Show segment attribution
+python scripts/merge_outputs.py output1.md output2.md -o merged.md --verbose
+```
+
+## Path Conversion (Windows/WSL)
+
+```bash
+# Windows to WSL conversion
+python scripts/convert_path.py "C:\Users\name\Documents\file.pdf"
+# Output: /mnt/c/Users/name/Documents/file.pdf
+```
+
+## Common Issues
+
+**"No conversion tools available"**
+```bash
+# Install all tools
+pip install pymupdf4llm
+uv tool install "markitdown[pdf]"
+brew install pandoc
+```
+
+**FontBBox warnings during PDF conversion**
+- Harmless font parsing warnings, output is still correct
+
+**Images missing from output**
+- Use Heavy Mode for better image preservation
+- Or extract separately with `scripts/extract_pdf_images.py`
+
+**Tables broken in output**
+- Use Heavy Mode - it selects the most complete table version
+- Or validate with `scripts/validate_output.py`
+
+## Bundled Scripts
+
+| Script | Purpose |
+|--------|---------|
+| `convert.py` | Main orchestrator with Quick/Heavy mode + DOCX post-processing |
+| `merge_outputs.py` | Merge multiple markdown outputs |
+| `validate_output.py` | Quality validation with HTML report |
+| `extract_pdf_images.py` | PDF image extraction with metadata |
+| `convert_path.py` | Windows to WSL path converter |
+
+## References
+
+- `references/heavy-mode-guide.md` - Detailed Heavy Mode documentation
+- `references/tool-comparison.md` - Tool capabilities comparison
+- `references/conversion-examples.md` - Batch operation examples
--- a/doc-to-markdown/references/benchmark-2026-03-22.md
+++ b/doc-to-markdown/references/benchmark-2026-03-22.md
@@ -0,0 +1,163 @@
+# DOCX→Markdown 转换方案基准测试
+
+> **测试日期**：2026-03-22
+>
+> **测试文件**：`助教-【腾讯云🦞】小白实践 OpenClaw 保姆级教程.docx`（19MB，77 张图片，含 grid table 布局、JSON 代码块、多列图片并排、信息框）
+>
+> **测试方法**：5 个方案对同一文件转换，按 5 个维度各 10 分制打分
+
+---
+
+## 综合评分
+
+| 维度 | Docling (IBM) | MarkItDown (MS) | Pandoc | Mammoth | **doc-to-markdown（我们）** |
+|------|:---:|:---:|:---:|:---:|:---:|
+| 表格质量 | 5 | 3 | 5 | 1~3 | **6** |
+| 图片提取 | 4 | 2 | **10** | 5 | 7 |
+| 文本完整性 | 8 | 7 | **9** | 7 | **9** |
+| 格式清洁度 | 5 | 5 | 5 | 3 | **7** |
+| 代码块 | 2 | 1 | N/A | 1 | **9** |
+| **综合** | **4.8** | **3.6** | **7.3** | **3.4~3.6** | **7.6** |
+
+---
+
+## 各方案详细分析
+
+### 1. IBM Docling（综合 4.8）
+
+- **版本**：docling 2.x + Granite-Docling-258M
+- **架构**：AI 驱动（VLM 视觉语言模型），DocTags 中间格式 → Markdown
+
+**致命问题**：
+- 图片引用全部是 `<!-- image -->` 占位符（77 张图 0 张可显示），`ImageRefMode` API 对 DOCX 不可用
+- 标题层级全部丢失（0 个 `#`），所有标题退化为粗体文本
+- 代码块为零，JSON 和命令全部输出为普通段落
+- `api_key` 被错误转义为 `api\_key`
+
+**优点**：
+- 文本内容完整，中文/emoji/链接保留良好
+- 无 grid table 或 HTML 残留
+- 表格语法正确（pipe table），但内容是占位符
+
+**结论**：Docling 的优势在 PDF（AAAI 2025 论文场景），DOCX 支持远未达到生产级别。
+
+### 2. Microsoft MarkItDown（综合 3.6）
+
+- **版本**：markitdown 0.1.5
+- **架构**：底层调用 mammoth → HTML → markdownify → Markdown
+
+**致命问题**：
+- 77 张图片全部是截断的 base64 占位符（`data:image/png;base64,...`），默认 `keep_data_uris=False` 主动丢弃图片数据
+- 标题全部变为粗体文本（mammoth 无法识别 WPS 自定义样式）
+- 代码块为零，JSON 被塞入表格单元格
+- 有序列表编号全部错误（输出为 `1. 1. 1.`）
+
+**优点**：
+- 无 HTML 标签残留
+- 文本内容基本完整
+
+**结论**：MarkItDown 的 markdownify 后处理反而引入破坏性截断。轻量场景可用，复杂 DOCX 不可靠。
+
+### 3. Pandoc（综合 7.3）
+
+- **版本**：pandoc 3.9
+- **架构**：Haskell 原生 AST 解析，支持 60+ 格式
+
+**测试了 3 种参数**：
+
+| 参数 | 结果 |
+|------|------|
+| `-t gfm` | 最差：24 个 HTML `<table>` 嵌套，74 个 HTML `<img>` |
+| `-t markdown` | 最佳：grid table（可后处理），无 HTML |
+| `-t markdown-raw_html-...` | 与 markdown 完全相同，参数无效果 |
+
+**问题**：
+- Grid table 不可避免（原 docx 有多行单元格和嵌套表格，pipe table 无法表达）
+- `{width="..." height="..."}` 68 处
+- `{.underline}` 6 处
+- 反斜杠过度转义 37 处
+
+**优点**：
+- 图片提取 10/10（77 张全部正确，路径结构一致）
+- 文本完整性 9/10（内容、链接、emoji 全部保留）
+- 最成熟稳定的底层引擎
+
+**结论**：Pandoc 是最可靠的底层引擎，输出质量最高但需要后处理清洗 pandoc 私有语法。
+
+### 4. Mammoth（综合 3.4~3.6）
+
+- **版本**：mammoth 1.11.0
+- **架构**：python-docx 解析 → HTML/Markdown（Markdown 支持已废弃）
+
+**测试了 2 种方式**：
+
+| 方式 | 综合 |
+|------|------|
+| 方式A：直接转 Markdown | 3.4（表格完全丢失） |
+| 方式B：转 HTML → markdownify | 3.6（有表格但嵌套被压扁） |
+
+**致命问题**：
+- 标题全部丢失（WPS `styles.xml` 中样式定义为空，mammoth 无法映射 Heading）
+- 代码块为零
+- 图片全部 base64 内嵌，单文件 28MB
+- 方式B 中 markdownify 丢失 14 张图片（63/77）
+
+**结论**：Mammoth 的 Markdown 支持已废弃，对 WPS 导出的 docx 兼容性差。不推荐。
+
+### 5. doc-to-markdown / 我们的方案（综合 7.6）
+
+- **版本**：doc-to-markdown 1.0（基于 pandoc + 6 个后处理函数）
+- **架构**：Pandoc 转换 → 自动后处理（grid table 清理、图片路径修复、属性清理、代码块修复、转义修复）
+
+**后处理实际效果**：
+
+| 后处理函数 | 修复数量 |
+|-----------|---------|
+| `_convert_grid_tables` | 11 处 grid table → pipe table / blockquote |
+| `_clean_pandoc_attributes` | 3437 字符属性清理 |
+| `_fix_code_blocks` | 22 处缩进虚线 → ``` 代码块 |
+| `_fix_escaped_brackets` | 10 处 |
+| `_fix_double_bracket_links` | 1 处 |
+| `_fix_image_paths` | 77 张图片路径修复 |
+
+**已知问题（待修复）**：
+- 图片路径双层嵌套 bug：`--assets-dir` 指定目录内被 pandoc 再建一层 `media/`
+- 2 处 grid table 残留（文末并排图片组未完全转换）
+
+**优点**：
+- 代码块识别 9/10（JSON 带语言标签，命令行正确包裹）
+- 格式清洁度 7/10（attributes、转义、grid table 大部分清理干净）
+- 文本完整性 9/10（关键内容全部保留）
+
+**结论**：综合最优，核心价值在 pandoc 后处理层。剩余 2 个 bug 可修。
+
+---
+
+## 架构决策
+
+```
+最终方案：Pandoc（底层引擎）+ doc-to-markdown 后处理（增值层）
+
+理由：
+1. Pandoc 图片提取最可靠（10/10），文本最完整（9/10）
+2. Pandoc 的问题（grid table、属性、转义）全部可后处理解决
+3. Docling/MarkItDown/Mammoth 的致命问题（图片丢失、标题丢失）无法后处理修复
+4. 后处理层是我们的核心竞争力，成本低、可迭代
+```
+
+---
+
+## 测试文件特征
+
+本次测试文件的难点在于：
+
+| 特征 | 说明 | 影响 |
+|------|------|------|
+| WPS 导出 | 非标准 Word 样式（Style ID 2/3 而非 Heading 1/2） | mammoth/markitdown/docling 标题全丢 |
+| 多列图片布局 | 2x2、1x4 图片网格用表格排版 | pandoc 输出 grid table |
+| 信息框/提示框 | 单列表格包裹文字 | pandoc 输出 grid table |
+| 嵌套表格 | 表格内套表格 | pipe table 无法表达 |
+| JSON 代码块 | 非代码块样式，用文本框/缩进表示 | 多数工具无法识别为代码 |
+| 19MB 文件 | 77 张截图嵌入 | base64 方案导致 28MB 输出 |
+
+这些特征代表了真实世界中 WPS/飞书文档导出 docx 的典型困难，是有效的基准测试场景。
--- a/doc-to-markdown/references/conversion-examples.md
+++ b/doc-to-markdown/references/conversion-examples.md
@@ -0,0 +1,346 @@
+# Document Conversion Examples
+
+Comprehensive examples for converting various document formats to markdown.
+
+## Basic Document Conversions
+
+### PDF to Markdown
+
+```bash
+# Simple PDF conversion
+markitdown "document.pdf" > output.md
+
+# WSL path example
+markitdown "/mnt/c/Users/username/Documents/report.pdf" > report.md
+
+# With explicit output
+markitdown "slides.pdf" > "slides.md"
+```
+
+### Word Documents to Markdown
+
+```bash
+# Modern Word document (.docx)
+markitdown "document.docx" > output.md
+
+# Legacy Word document (.doc)
+markitdown "legacy-doc.doc" > output.md
+
+# Preserve directory structure
+markitdown "/path/to/docs/file.docx" > "/path/to/output/file.md"
+```
+
+### PowerPoint to Markdown
+
+```bash
+# Convert presentation
+markitdown "presentation.pptx" > slides.md
+
+# WSL path
+markitdown "/mnt/c/Users/username/Desktop/slides.pptx" > slides.md
+```
+
+---
+
+## Windows/WSL Path Conversion
+
+### Basic Path Conversion Rules
+
+```bash
+# Windows path
+C:\Users\username\Documents\file.doc
+
+# WSL equivalent
+/mnt/c/Users/username/Documents/file.doc
+```
+
+### Conversion Examples
+
+```bash
+# Single backslash to forward slash
+C:\folder\file.txt
+→ /mnt/c/folder/file.txt
+
+# Path with spaces (must use quotes)
+C:\Users\John Doe\Documents\report.pdf
+→ "/mnt/c/Users/John Doe/Documents/report.pdf"
+
+# OneDrive path
+C:\Users\username\OneDrive\Documents\file.doc
+→ "/mnt/c/Users/username/OneDrive/Documents/file.doc"
+
+# Different drive letters
+D:\Projects\document.docx
+→ /mnt/d/Projects/document.docx
+```
+
+### Using convert_path.py Helper
+
+```bash
+# Automatic conversion
+python scripts/convert_path.py "C:\Users\username\Downloads\document.doc"
+# Output: /mnt/c/Users/username/Downloads/document.doc
+
+# Use in conversion command
+wsl_path=$(python scripts/convert_path.py "C:\Users\username\file.docx")
+markitdown "$wsl_path" > output.md
+```
+
+---
+
+## Batch Conversions
+
+### Convert Multiple Files
+
+```bash
+# Convert all PDFs in a directory
+for pdf in /path/to/pdfs/*.pdf; do
+  filename=$(basename "$pdf" .pdf)
+  markitdown "$pdf" > "/path/to/output/${filename}.md"
+done
+
+# Convert all Word documents
+for doc in /path/to/docs/*.docx; do
+  filename=$(basename "$doc" .docx)
+  markitdown "$doc" > "/path/to/output/${filename}.md"
+done
+```
+
+### Batch Conversion with Path Conversion
+
+```bash
+# Windows batch (PowerShell)
+Get-ChildItem "C:\Documents\*.pdf" | ForEach-Object {
+  $wslPath = "/mnt/c/Documents/$($_.Name)"
+  $outFile = "/mnt/c/Output/$($_.BaseName).md"
+  wsl markitdown $wslPath > $outFile
+}
+```
+
+---
+
+## Confluence Export Handling
+
+### Simple Confluence Export
+
+```bash
+# Direct conversion for exports without special characters
+markitdown "confluence-export.doc" > output.md
+```
+
+### Export with Special Characters
+
+For Confluence exports containing special characters:
+
+1. Save the .doc file to an accessible location
+2. Try direct conversion first:
+   ```bash
+   markitdown "confluence-export.doc" > output.md
+   ```
+
+3. If special characters cause issues:
+   - Open in Word and save as .docx
+   - Or use LibreOffice to convert: `libreoffice --headless --convert-to docx export.doc`
+   - Then convert the .docx file
+
+### Handling Encoding Issues
+
+```bash
+# Check file encoding
+file -i "document.doc"
+
+# Convert if needed (using iconv)
+iconv -f ISO-8859-1 -t UTF-8 input.md > output.md
+```
+
+---
+
+## Advanced Conversion Scenarios
+
+### Preserving Directory Structure
+
+```bash
+# Mirror directory structure
+src_dir="/mnt/c/Users/username/Documents"
+out_dir="/path/to/output"
+
+find "$src_dir" -name "*.docx" | while read file; do
+  # Get relative path
+  rel_path="${file#$src_dir/}"
+  out_file="$out_dir/${rel_path%.docx}.md"
+
+  # Create output directory
+  mkdir -p "$(dirname "$out_file")"
+
+  # Convert
+  markitdown "$file" > "$out_file"
+done
+```
+
+### Conversion with Metadata
+
+```bash
+# Add frontmatter to converted file
+{
+  echo "---"
+  echo "title: $(basename "$file" .pdf)"
+  echo "converted: $(date -I)"
+  echo "source: $file"
+  echo "---"
+  echo ""
+  markitdown "$file"
+} > output.md
+```
+
+---
+
+## Error Recovery
+
+### Handling Failed Conversions
+
+```bash
+# Check if markitdown succeeded
+if markitdown "document.pdf" > output.md 2> error.log; then
+  echo "Conversion successful"
+else
+  echo "Conversion failed, check error.log"
+fi
+```
+
+### Retry Logic
+
+```bash
+# Retry failed conversions
+for file in *.pdf; do
+  output="${file%.pdf}.md"
+  if ! [ -f "$output" ]; then
+    echo "Converting $file..."
+    markitdown "$file" > "$output" || echo "Failed: $file" >> failed.txt
+  fi
+done
+```
+
+---
+
+## Quality Verification
+
+### Check Conversion Quality
+
+```bash
+# Compare line counts
+wc -l document.pdf.md
+
+# Check for common issues
+grep "TODO\|ERROR\|MISSING" output.md
+
+# Preview first/last lines
+head -n 20 output.md
+tail -n 20 output.md
+```
+
+### Validate Output
+
+```bash
+# Check for empty files
+if [ ! -s output.md ]; then
+  echo "Warning: Output file is empty"
+fi
+
+# Verify markdown syntax
+# Use a markdown linter if available
+markdownlint output.md
+```
+
+---
+
+## Best Practices
+
+### 1. Path Handling
+- Always quote paths with spaces
+- Verify paths exist before conversion
+- Use absolute paths for scripts
+
+### 2. Batch Processing
+- Log conversions for audit trail
+- Handle errors gracefully
+- Preserve original files
+
+### 3. Output Organization
+- Mirror source directory structure
+- Use consistent naming conventions
+- Separate by document type or date
+
+### 4. Quality Assurance
+- Spot-check random conversions
+- Validate critical documents manually
+- Keep conversion logs
+
+### 5. Performance
+- Use parallel processing for large batches
+- Skip already converted files
+- Clean up temporary files
+
+---
+
+## Common Patterns
+
+### Pattern: Convert and Review
+
+```bash
+#!/bin/bash
+file="$1"
+output="${file%.*}.md"
+
+# Convert
+markitdown "$file" > "$output"
+
+# Open in editor for review
+${EDITOR:-vim} "$output"
+```
+
+### Pattern: Safe Conversion
+
+```bash
+#!/bin/bash
+file="$1"
+backup="${file}.backup"
+output="${file%.*}.md"
+
+# Backup original
+cp "$file" "$backup"
+
+# Convert with error handling
+if markitdown "$file" > "$output" 2> conversion.log; then
+  echo "Success: $output"
+  rm "$backup"
+else
+  echo "Failed: Check conversion.log"
+  mv "$backup" "$file"
+fi
+```
+
+### Pattern: Metadata Preservation
+
+```bash
+#!/bin/bash
+# Extract and preserve document metadata
+
+file="$1"
+output="${file%.*}.md"
+
+# Get file metadata
+created=$(stat -c %w "$file" 2>/dev/null || stat -f %SB "$file")
+modified=$(stat -c %y "$file" 2>/dev/null || stat -f %Sm "$file")
+
+# Convert with metadata
+{
+  echo "---"
+  echo "original_file: $(basename "$file")"
+  echo "created: $created"
+  echo "modified: $modified"
+  echo "converted: $(date -I)"
+  echo "---"
+  echo ""
+  markitdown "$file"
+} > "$output"
+```
--- a/doc-to-markdown/references/heavy-mode-guide.md
+++ b/doc-to-markdown/references/heavy-mode-guide.md
@@ -0,0 +1,165 @@
+# Heavy Mode Guide
+
+Detailed documentation for markdown-tools Heavy Mode conversion.
+
+## Overview
+
+Heavy Mode runs multiple conversion tools in parallel and intelligently merges their outputs to produce the highest quality markdown possible.
+
+## When to Use Heavy Mode
+
+Use Heavy Mode when:
+- Document has complex tables that need precise formatting
+- Images must be preserved with proper references
+- Structure hierarchy (headings, lists) must be accurate
+- Output quality is more important than conversion speed
+- Document will be used for LLM processing
+
+Use Quick Mode when:
+- Speed is priority
+- Document is simple (mostly text)
+- Output is for draft/review purposes
+
+## Tool Capabilities
+
+### PyMuPDF4LLM (Best for PDFs)
+
+**Strengths:**
+- Native table detection with multiple strategies
+- Image extraction with position metadata
+- LLM-optimized output format
+- Preserves reading order
+
+**Usage:**
+```python
+import pymupdf4llm
+
+md_text = pymupdf4llm.to_markdown(
+    "document.pdf",
+    write_images=True,
+    table_strategy="lines_strict",
+    image_path="./assets",
+    dpi=150
+)
+```
+
+### markitdown (Universal Converter)
+
+**Strengths:**
+- Supports many formats (PDF, DOCX, PPTX, XLSX)
+- Good text extraction
+- Simple API
+
+**Limitations:**
+- May miss complex tables
+- No native image extraction
+
+### pandoc (Best for Office Docs)
+
+**Strengths:**
+- Excellent DOCX/PPTX structure preservation
+- Proper heading hierarchy
+- List formatting
+
+**Limitations:**
+- Requires system installation
+- PDF support limited
+
+## Merge Strategy
+
+### Segment-Level Selection
+
+Heavy Mode doesn't just pick one tool's output. It:
+
+1. Parses each output into segments
+2. Scores each segment independently
+3. Selects the best version of each segment
+
+### Segment Types
+
+| Type | Detection Pattern | Scoring Criteria |
+|------|-------------------|------------------|
+| Table | `\|.*\|` rows | Row count, column count, header separator |
+| Heading | `^#{1-6} ` | Proper level, reasonable length |
+| Image | `!\[.*\]\(.*\)` | Alt text present, local path |
+| List | `^[-*+\d.] ` | Item count, nesting depth |
+| Code | Triple backticks | Line count, language specified |
+| Paragraph | Default | Word count, completeness |
+
+### Scoring Example
+
+```
+Table from pymupdf4llm:
+  - 10 rows × 5 columns = 5.0 points
+  - Header separator present = 1.0 points
+  - Total: 6.0 points
+
+Table from markitdown:
+  - 8 rows × 5 columns = 4.0 points
+  - No header separator = 0.0 points
+  - Total: 4.0 points
+
+→ Select pymupdf4llm version
+```
+
+## Advanced Usage
+
+### Force Specific Tool
+
+```bash
+# Use only pandoc
+uv run scripts/convert.py document.docx -o output.md --tool pandoc
+```
+
+### Custom Assets Directory
+
+```bash
+# Heavy mode with custom image output
+uv run scripts/convert.py document.pdf -o output.md --heavy --assets-dir ./images
+```
+
+### Validate After Conversion
+
+```bash
+# Convert then validate
+uv run scripts/convert.py document.pdf -o output.md --heavy
+uv run scripts/validate_output.py document.pdf output.md --report quality.html
+```
+
+## Troubleshooting
+
+### Low Text Retention Score
+
+**Causes:**
+- PDF has scanned images (not searchable text)
+- Encoding issues in source document
+- Complex layouts confusing the parser
+
+**Solutions:**
+- Use OCR preprocessing for scanned PDFs
+- Try different tool with `--tool` flag
+- Manual cleanup may be needed
+
+### Missing Tables
+
+**Causes:**
+- Tables without visible borders
+- Tables spanning multiple pages
+- Merged cells
+
+**Solutions:**
+- Use Heavy Mode for better detection
+- Try pymupdf4llm with different table_strategy
+- Manual table reconstruction
+
+### Image References Broken
+
+**Causes:**
+- Assets directory not created
+- Relative path issues
+- Image extraction failed
+
+**Solutions:**
+- Ensure `--assets-dir` points to correct location
+- Check `images_metadata.json` for extraction status
+- Use `extract_pdf_images.py` separately
--- a/doc-to-markdown/references/tool-comparison.md
+++ b/doc-to-markdown/references/tool-comparison.md
@@ -0,0 +1,180 @@
+# Tool Comparison
+
+Comparison of document-to-markdown conversion tools.
+
+## Feature Matrix
+
+| Feature | pymupdf4llm | markitdown | pandoc |
+|---------|-------------|------------|--------|
+| **PDF Support** | ✅ Excellent | ✅ Good | ⚠️ Limited |
+| **DOCX Support** | ❌ No | ✅ Good | ✅ Excellent |
+| **PPTX Support** | ❌ No | ✅ Good | ✅ Good |
+| **XLSX Support** | ❌ No | ✅ Good | ⚠️ Limited |
+| **Table Detection** | ✅ Multiple strategies | ⚠️ Basic | ✅ Good |
+| **Image Extraction** | ✅ With metadata | ❌ No | ✅ Yes |
+| **Heading Hierarchy** | ✅ Good | ⚠️ Variable | ✅ Excellent |
+| **List Formatting** | ✅ Good | ⚠️ Basic | ✅ Excellent |
+| **LLM Optimization** | ✅ Built-in | ❌ No | ❌ No |
+
+## Installation
+
+### pymupdf4llm
+
+```bash
+pip install pymupdf4llm
+
+# Or with uv
+uv pip install pymupdf4llm
+```
+
+**Dependencies:** None (pure Python with PyMuPDF)
+
+### markitdown
+
+```bash
+# With PDF support
+uv tool install "markitdown[pdf]"
+
+# Or
+pip install "markitdown[pdf]"
+```
+
+**Dependencies:** Various per format (pdfminer, python-docx, etc.)
+
+### pandoc
+
+```bash
+# macOS
+brew install pandoc
+
+# Ubuntu/Debian
+apt-get install pandoc
+
+# Windows
+choco install pandoc
+```
+
+**Dependencies:** System installation required
+
+## Performance Benchmarks
+
+### PDF Conversion (100-page document)
+
+| Tool | Time | Memory | Output Quality |
+|------|------|--------|----------------|
+| pymupdf4llm | ~15s | 150MB | Excellent |
+| markitdown | ~45s | 200MB | Good |
+| pandoc | ~60s | 100MB | Variable |
+
+### DOCX Conversion (50-page document)
+
+| Tool | Time | Memory | Output Quality |
+|------|------|--------|----------------|
+| pandoc | ~5s | 50MB | Excellent |
+| markitdown | ~10s | 80MB | Good |
+
+## Best Practices
+
+### For PDFs
+
+1. **First choice:** pymupdf4llm
+   - Best table detection
+   - Image extraction with metadata
+   - LLM-optimized output
+
+2. **Fallback:** markitdown
+   - When pymupdf4llm fails
+   - Simpler documents
+
+### For DOCX/DOC
+
+1. **First choice:** pandoc
+   - Best structure preservation
+   - Proper heading hierarchy
+   - List formatting
+
+2. **Fallback:** markitdown
+   - When pandoc unavailable
+   - Quick conversion needed
+
+### For PPTX
+
+1. **First choice:** markitdown
+   - Good slide content extraction
+   - Handles speaker notes
+
+2. **Fallback:** pandoc
+   - Better structure preservation
+
+### For XLSX
+
+1. **Only option:** markitdown
+   - Table to markdown conversion
+   - Sheet handling
+
+## Common Issues by Tool
+
+### pymupdf4llm
+
+| Issue | Solution |
+|-------|----------|
+| "Cannot import fitz" | `pip install pymupdf` |
+| Tables not detected | Try different `table_strategy` |
+| Images not extracted | Enable `write_images=True` |
+
+### markitdown
+
+| Issue | Solution |
+|-------|----------|
+| PDF support missing | Install with `[pdf]` extra |
+| Slow conversion | Expected for large files |
+| Missing content | Try alternative tool |
+
+### pandoc
+
+| Issue | Solution |
+|-------|----------|
+| Command not found | Install via package manager |
+| PDF conversion fails | Use pymupdf4llm instead |
+| Images not extracted | Add `--extract-media` flag |
+
+## API Comparison
+
+### pymupdf4llm
+
+```python
+import pymupdf4llm
+
+md = pymupdf4llm.to_markdown(
+    "doc.pdf",
+    write_images=True,
+    table_strategy="lines_strict",
+    image_path="./assets"
+)
+```
+
+### markitdown
+
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown()
+result = md.convert("document.pdf")
+print(result.text_content)
+```
+
+### pandoc
+
+```bash
+pandoc document.docx -t markdown --wrap=none --extract-media=./assets
+```
+
+```python
+import subprocess
+
+result = subprocess.run(
+    ["pandoc", "doc.docx", "-t", "markdown", "--wrap=none"],
+    capture_output=True, text=True
+)
+print(result.stdout)
+```
--- a/doc-to-markdown/scripts/convert.py
+++ b/doc-to-markdown/scripts/convert.py
--- a/doc-to-markdown/scripts/convert_path.py
+++ b/doc-to-markdown/scripts/convert_path.py
@@ -0,0 +1,61 @@
+#!/usr/bin/env python3
+"""
+Convert Windows paths to WSL format.
+
+Usage:
+    python convert_path.py "C:\\Users\\username\\Downloads\\file.doc"
+
+Output:
+    /mnt/c/Users/username/Downloads/file.doc
+"""
+
+import sys
+import re
+
+
+def convert_windows_to_wsl(windows_path: str) -> str:
+    """
+    Convert a Windows path to WSL format.
+
+    Args:
+        windows_path: Windows path (e.g., "C:\\Users\\username\\file.doc")
+
+    Returns:
+        WSL path (e.g., "/mnt/c/Users/username/file.doc")
+    """
+    # Remove quotes if present
+    path = windows_path.strip('"').strip("'")
+
+    # Handle drive letter (C:\ or C:/)
+    drive_pattern = r'^([A-Za-z]):[\\\/]'
+    match = re.match(drive_pattern, path)
+
+    if not match:
+        # Already a WSL path or relative path
+        return path
+
+    drive_letter = match.group(1).lower()
+    path_without_drive = path[3:]  # Remove "C:\"
+
+    # Replace backslashes with forward slashes
+    path_without_drive = path_without_drive.replace('\\', '/')
+
+    # Construct WSL path
+    wsl_path = f"/mnt/{drive_letter}/{path_without_drive}"
+
+    return wsl_path
+
+
+def main():
+    if len(sys.argv) < 2:
+        print("Usage: python convert_path.py <windows_path>")
+        print('Example: python convert_path.py "C:\\Users\\username\\Downloads\\file.doc"')
+        sys.exit(1)
+
+    windows_path = sys.argv[1]
+    wsl_path = convert_windows_to_wsl(windows_path)
+    print(wsl_path)
+
+
+if __name__ == "__main__":
+    main()
--- a/doc-to-markdown/scripts/extract_pdf_images.py
+++ b/doc-to-markdown/scripts/extract_pdf_images.py
@@ -0,0 +1,243 @@
+#!/usr/bin/env python3
+"""
+Extract images from PDF files with metadata using PyMuPDF.
+
+Features:
+- Extracts all images with page and position metadata
+- Generates JSON metadata file for each image
+- Supports markdown reference generation
+- Optional DPI control for quality
+
+Usage:
+    uv run --with pymupdf scripts/extract_pdf_images.py document.pdf
+    uv run --with pymupdf scripts/extract_pdf_images.py document.pdf -o ./images
+    uv run --with pymupdf scripts/extract_pdf_images.py document.pdf --markdown refs.md
+
+Examples:
+    # Basic extraction
+    uv run --with pymupdf scripts/extract_pdf_images.py document.pdf
+
+    # With custom output and markdown references
+    uv run --with pymupdf scripts/extract_pdf_images.py doc.pdf -o assets --markdown images.md
+"""
+
+import argparse
+import json
+import sys
+from dataclasses import dataclass, asdict
+from pathlib import Path
+from typing import Optional
+
+
+@dataclass
+class ImageMetadata:
+    """Metadata for an extracted image."""
+    filename: str
+    page: int  # 1-indexed
+    index: int  # Image index on page (1-indexed)
+    width: int  # Original width in pixels
+    height: int  # Original height in pixels
+    x: float  # X position on page (points)
+    y: float  # Y position on page (points)
+    bbox_width: float  # Width on page (points)
+    bbox_height: float  # Height on page (points)
+    size_bytes: int
+    format: str  # png, jpg, etc.
+    colorspace: str  # RGB, CMYK, Gray
+    bits_per_component: int
+
+
+def extract_images(
+    pdf_path: Path,
+    output_dir: Path,
+    markdown_file: Optional[Path] = None
+) -> list[ImageMetadata]:
+    """
+    Extract all images from a PDF file with metadata.
+
+    Args:
+        pdf_path: Path to the PDF file
+        output_dir: Directory to save extracted images
+        markdown_file: Optional path to write markdown references
+
+    Returns:
+        List of ImageMetadata for each extracted image
+    """
+    try:
+        import fitz  # PyMuPDF
+    except ImportError:
+        print("Error: PyMuPDF not installed. Run with:")
+        print('  uv run --with pymupdf scripts/extract_pdf_images.py <pdf_path>')
+        sys.exit(1)
+
+    output_dir.mkdir(parents=True, exist_ok=True)
+
+    doc = fitz.open(str(pdf_path))
+    extracted: list[ImageMetadata] = []
+    markdown_refs: list[str] = []
+
+    for page_num in range(len(doc)):
+        page = doc[page_num]
+        image_list = page.get_images(full=True)
+
+        for img_index, img_info in enumerate(image_list):
+            xref = img_info[0]
+
+            try:
+                base_image = doc.extract_image(xref)
+            except Exception as e:
+                print(f"  Warning: Could not extract image xref={xref}: {e}")
+                continue
+
+            image_bytes = base_image["image"]
+            image_ext = base_image["ext"]
+            width = base_image.get("width", 0)
+            height = base_image.get("height", 0)
+            colorspace = base_image.get("colorspace", 0)
+            bpc = base_image.get("bpc", 8)
+
+            # Map colorspace number to name
+            cs_names = {1: "Gray", 3: "RGB", 4: "CMYK"}
+            cs_name = cs_names.get(colorspace, f"Unknown({colorspace})")
+
+            # Get image position on page
+            # img_info: (xref, smask, width, height, bpc, colorspace, alt, name, filter, referencer)
+            # We need to find the image rect on page
+            bbox_x, bbox_y, bbox_w, bbox_h = 0.0, 0.0, 0.0, 0.0
+
+            # Search for image instances on page
+            for img_block in page.get_images():
+                if img_block[0] == xref:
+                    # Found matching image, try to get its rect
+                    rects = page.get_image_rects(img_block)
+                    if rects:
+                        rect = rects[0]  # Use first occurrence
+                        bbox_x = rect.x0
+                        bbox_y = rect.y0
+                        bbox_w = rect.width
+                        bbox_h = rect.height
+                    break
+
+            # Create descriptive filename
+            img_filename = f"img_page{page_num + 1}_{img_index + 1}.{image_ext}"
+            img_path = output_dir / img_filename
+
+            # Save image
+            with open(img_path, "wb") as f:
+                f.write(image_bytes)
+
+            # Create metadata
+            metadata = ImageMetadata(
+                filename=img_filename,
+                page=page_num + 1,
+                index=img_index + 1,
+                width=width,
+                height=height,
+                x=round(bbox_x, 2),
+                y=round(bbox_y, 2),
+                bbox_width=round(bbox_w, 2),
+                bbox_height=round(bbox_h, 2),
+                size_bytes=len(image_bytes),
+                format=image_ext,
+                colorspace=cs_name,
+                bits_per_component=bpc
+            )
+            extracted.append(metadata)
+
+            # Generate markdown reference
+            alt_text = f"Image from page {page_num + 1}"
+            md_ref = f"![{alt_text}]({img_path.name})"
+            markdown_refs.append(f"<!-- Page {page_num + 1}, Position: ({bbox_x:.0f}, {bbox_y:.0f}) -->\n{md_ref}")
+
+            print(f"  ✓ {img_filename} ({width}x{height}, {len(image_bytes):,} bytes)")
+
+    doc.close()
+
+    # Write metadata JSON
+    metadata_path = output_dir / "images_metadata.json"
+    with open(metadata_path, "w") as f:
+        json.dump(
+            {
+                "source": str(pdf_path),
+                "image_count": len(extracted),
+                "images": [asdict(m) for m in extracted]
+            },
+            f,
+            indent=2
+        )
+    print(f"\n📋 Metadata: {metadata_path}")
+
+    # Write markdown references if requested
+    if markdown_file and markdown_refs:
+        markdown_content = f"# Images from {pdf_path.name}\n\n"
+        markdown_content += "\n\n".join(markdown_refs)
+        markdown_file.parent.mkdir(parents=True, exist_ok=True)
+        markdown_file.write_text(markdown_content)
+        print(f"📝 Markdown refs: {markdown_file}")
+
+    print(f"\n✅ Total: {len(extracted)} images extracted to {output_dir}/")
+    return extracted
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description="Extract images from PDF files with metadata",
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+        epilog="""
+Examples:
+    # Basic extraction
+    uv run --with pymupdf scripts/extract_pdf_images.py document.pdf
+
+    # Custom output directory
+    uv run --with pymupdf scripts/extract_pdf_images.py doc.pdf -o ./images
+
+    # With markdown references
+    uv run --with pymupdf scripts/extract_pdf_images.py doc.pdf --markdown refs.md
+
+Output:
+    Images are saved with descriptive names: img_page1_1.png, img_page2_1.jpg
+    Metadata is saved to: images_metadata.json
+        """
+    )
+    parser.add_argument(
+        "pdf_path",
+        type=Path,
+        help="Path to the PDF file"
+    )
+    parser.add_argument(
+        "-o", "--output",
+        type=Path,
+        default=Path("assets"),
+        help="Directory to save images (default: ./assets)"
+    )
+    parser.add_argument(
+        "--markdown",
+        type=Path,
+        help="Generate markdown file with image references"
+    )
+    parser.add_argument(
+        "--json",
+        action="store_true",
+        help="Output metadata as JSON to stdout"
+    )
+
+    args = parser.parse_args()
+
+    if not args.pdf_path.exists():
+        print(f"Error: File not found: {args.pdf_path}", file=sys.stderr)
+        sys.exit(1)
+
+    print(f"📄 Extracting images from: {args.pdf_path}")
+
+    extracted = extract_images(
+        args.pdf_path,
+        args.output,
+        args.markdown
+    )
+
+    if args.json:
+        print(json.dumps([asdict(m) for m in extracted], indent=2))
+
+
+if __name__ == "__main__":
+    main()
--- a/doc-to-markdown/scripts/merge_outputs.py
+++ b/doc-to-markdown/scripts/merge_outputs.py
@@ -0,0 +1,439 @@
+#!/usr/bin/env python3
+"""
+Multi-tool markdown output merger with segment-level comparison.
+
+Merges markdown outputs from multiple conversion tools by selecting
+the best version of each segment (tables, images, headings, paragraphs).
+
+Usage:
+    python merge_outputs.py output1.md output2.md -o merged.md
+    python merge_outputs.py --from-json results.json -o merged.md
+"""
+
+import argparse
+import json
+import re
+import sys
+from dataclasses import dataclass, field
+from pathlib import Path
+from typing import Optional
+
+
+@dataclass
+class Segment:
+    """A segment of markdown content."""
+    type: str  # 'heading', 'table', 'image', 'list', 'paragraph', 'code'
+    content: str
+    level: int = 0  # For headings
+    score: float = 0.0
+
+
+@dataclass
+class MergeResult:
+    """Result from merging multiple markdown files."""
+    markdown: str
+    sources: list[str] = field(default_factory=list)
+    segment_sources: dict = field(default_factory=dict)  # segment_idx -> source
+
+
+def parse_segments(markdown: str) -> list[Segment]:
+    """Parse markdown into typed segments."""
+    segments = []
+    lines = markdown.split('\n')
+    current_segment = []
+    current_type = 'paragraph'
+    current_level = 0
+    in_code_block = False
+    in_table = False
+
+    def flush_segment():
+        nonlocal current_segment, current_type, current_level
+        if current_segment:
+            content = '\n'.join(current_segment).strip()
+            if content:
+                segments.append(Segment(
+                    type=current_type,
+                    content=content,
+                    level=current_level
+                ))
+        current_segment = []
+        current_type = 'paragraph'
+        current_level = 0
+
+    for line in lines:
+        # Code block detection
+        if line.startswith('```'):
+            if in_code_block:
+                current_segment.append(line)
+                flush_segment()
+                in_code_block = False
+                continue
+            else:
+                flush_segment()
+                in_code_block = True
+                current_type = 'code'
+                current_segment.append(line)
+                continue
+
+        if in_code_block:
+            current_segment.append(line)
+            continue
+
+        # Heading detection
+        heading_match = re.match(r'^(#{1,6})\s+(.+)$', line)
+        if heading_match:
+            flush_segment()
+            current_type = 'heading'
+            current_level = len(heading_match.group(1))
+            current_segment.append(line)
+            flush_segment()
+            continue
+
+        # Table detection
+        if '|' in line and re.match(r'^\s*\|.*\|\s*$', line):
+            if not in_table:
+                flush_segment()
+                in_table = True
+                current_type = 'table'
+            current_segment.append(line)
+            continue
+        elif in_table:
+            flush_segment()
+            in_table = False
+
+        # Image detection
+        if re.match(r'!\[.*\]\(.*\)', line):
+            flush_segment()
+            current_type = 'image'
+            current_segment.append(line)
+            flush_segment()
+            continue
+
+        # List detection
+        if re.match(r'^[\s]*[-*+]\s+', line) or re.match(r'^[\s]*\d+\.\s+', line):
+            if current_type != 'list':
+                flush_segment()
+                current_type = 'list'
+            current_segment.append(line)
+            continue
+        elif current_type == 'list' and line.strip() == '':
+            flush_segment()
+            continue
+
+        # Empty line - potential paragraph break
+        if line.strip() == '':
+            if current_type == 'paragraph' and current_segment:
+                flush_segment()
+            continue
+
+        # Default: paragraph
+        if current_type not in ['list']:
+            current_type = 'paragraph'
+        current_segment.append(line)
+
+    flush_segment()
+    return segments
+
+
+def score_segment(segment: Segment) -> float:
+    """Score a segment for quality comparison."""
+    score = 0.0
+    content = segment.content
+
+    if segment.type == 'table':
+        # Count rows and columns
+        rows = [l for l in content.split('\n') if '|' in l]
+        if rows:
+            cols = rows[0].count('|') - 1
+            score += len(rows) * 0.5  # More rows = better
+            score += cols * 0.3  # More columns = better
+            # Penalize separator-only tables
+            if all(re.match(r'^[\s|:-]+$', r) for r in rows):
+                score -= 5.0
+            # Bonus for proper header separator
+            if len(rows) > 1 and re.match(r'^[\s|:-]+$', rows[1]):
+                score += 1.0
+
+    elif segment.type == 'heading':
+        # Prefer proper heading hierarchy
+        score += 1.0
+        # Penalize very long headings
+        if len(content) > 100:
+            score -= 0.5
+
+    elif segment.type == 'image':
+        # Prefer images with alt text
+        if re.search(r'!\[.+\]', content):
+            score += 1.0
+        # Prefer local paths over base64
+        if 'data:image' not in content:
+            score += 0.5
+
+    elif segment.type == 'list':
+        items = re.findall(r'^[\s]*[-*+\d.]+\s+', content, re.MULTILINE)
+        score += len(items) * 0.3
+        # Bonus for nested lists
+        if re.search(r'^\s{2,}[-*+]', content, re.MULTILINE):
+            score += 0.5
+
+    elif segment.type == 'code':
+        lines = content.split('\n')
+        score += min(len(lines) * 0.2, 3.0)
+        # Bonus for language specification
+        if re.match(r'^```\w+', content):
+            score += 0.5
+
+    else:  # paragraph
+        words = len(content.split())
+        score += min(words * 0.05, 2.0)
+        # Penalize very short paragraphs
+        if words < 5:
+            score -= 0.5
+
+    return score
+
+
+def find_matching_segment(
+    segment: Segment,
+    candidates: list[Segment],
+    used_indices: set
+) -> Optional[int]:
+    """Find a matching segment in candidates by type and similarity."""
+    best_match = None
+    best_similarity = 0.3  # Minimum threshold
+
+    for i, candidate in enumerate(candidates):
+        if i in used_indices:
+            continue
+        if candidate.type != segment.type:
+            continue
+
+        # Calculate similarity
+        if segment.type == 'heading':
+            # Compare heading text (ignore # symbols)
+            s1 = re.sub(r'^#+\s*', '', segment.content).lower()
+            s2 = re.sub(r'^#+\s*', '', candidate.content).lower()
+            similarity = _text_similarity(s1, s2)
+        elif segment.type == 'table':
+            # Compare first row (header)
+            h1 = segment.content.split('\n')[0] if segment.content else ''
+            h2 = candidate.content.split('\n')[0] if candidate.content else ''
+            similarity = _text_similarity(h1, h2)
+        else:
+            # Compare content directly
+            similarity = _text_similarity(segment.content, candidate.content)
+
+        if similarity > best_similarity:
+            best_similarity = similarity
+            best_match = i
+
+    return best_match
+
+
+def _text_similarity(s1: str, s2: str) -> float:
+    """Calculate simple text similarity (Jaccard on words)."""
+    if not s1 or not s2:
+        return 0.0
+
+    words1 = set(s1.lower().split())
+    words2 = set(s2.lower().split())
+
+    if not words1 or not words2:
+        return 0.0
+
+    intersection = len(words1 & words2)
+    union = len(words1 | words2)
+
+    return intersection / union if union > 0 else 0.0
+
+
+def merge_markdown_files(
+    files: list[Path],
+    source_names: Optional[list[str]] = None
+) -> MergeResult:
+    """Merge multiple markdown files by selecting best segments."""
+    if not files:
+        return MergeResult(markdown="", sources=[])
+
+    if source_names is None:
+        source_names = [f.stem for f in files]
+
+    # Parse all files into segments
+    all_segments = []
+    for i, file_path in enumerate(files):
+        content = file_path.read_text()
+        segments = parse_segments(content)
+        # Score each segment
+        for seg in segments:
+            seg.score = score_segment(seg)
+        all_segments.append((source_names[i], segments))
+
+    if len(all_segments) == 1:
+        return MergeResult(
+            markdown=files[0].read_text(),
+            sources=[source_names[0]]
+        )
+
+    # Use first file as base structure
+    base_name, base_segments = all_segments[0]
+    merged_segments = []
+    segment_sources = {}
+
+    for i, base_seg in enumerate(base_segments):
+        best_segment = base_seg
+        best_source = base_name
+
+        # Find matching segments in other files
+        for other_name, other_segments in all_segments[1:]:
+            used = set()
+            match_idx = find_matching_segment(base_seg, other_segments, used)
+
+            if match_idx is not None:
+                other_seg = other_segments[match_idx]
+                if other_seg.score > best_segment.score:
+                    best_segment = other_seg
+                    best_source = other_name
+
+        merged_segments.append(best_segment)
+        segment_sources[i] = best_source
+
+    # Check for segments in other files that weren't matched
+    # (content that only appears in secondary sources)
+    base_used = set(range(len(base_segments)))
+    for other_name, other_segments in all_segments[1:]:
+        for j, other_seg in enumerate(other_segments):
+            match_idx = find_matching_segment(other_seg, base_segments, set())
+            if match_idx is None and other_seg.score > 0.5:
+                # This segment doesn't exist in base - consider adding
+                merged_segments.append(other_seg)
+                segment_sources[len(merged_segments) - 1] = other_name
+
+    # Reconstruct markdown
+    merged_md = '\n\n'.join(seg.content for seg in merged_segments)
+
+    return MergeResult(
+        markdown=merged_md,
+        sources=source_names,
+        segment_sources=segment_sources
+    )
+
+
+def merge_from_json(json_path: Path) -> MergeResult:
+    """Merge from JSON results file (from convert.py)."""
+    with open(json_path) as f:
+        data = json.load(f)
+
+    results = data.get('results', [])
+    if not results:
+        return MergeResult(markdown="", sources=[])
+
+    # Filter successful results
+    successful = [r for r in results if r.get('success') and r.get('markdown')]
+    if not successful:
+        return MergeResult(markdown="", sources=[])
+
+    if len(successful) == 1:
+        return MergeResult(
+            markdown=successful[0]['markdown'],
+            sources=[successful[0]['tool']]
+        )
+
+    # Parse and merge
+    all_segments = []
+    for result in successful:
+        tool = result['tool']
+        segments = parse_segments(result['markdown'])
+        for seg in segments:
+            seg.score = score_segment(seg)
+        all_segments.append((tool, segments))
+
+    # Same merge logic as merge_markdown_files
+    base_name, base_segments = all_segments[0]
+    merged_segments = []
+    segment_sources = {}
+
+    for i, base_seg in enumerate(base_segments):
+        best_segment = base_seg
+        best_source = base_name
+
+        for other_name, other_segments in all_segments[1:]:
+            match_idx = find_matching_segment(base_seg, other_segments, set())
+            if match_idx is not None:
+                other_seg = other_segments[match_idx]
+                if other_seg.score > best_segment.score:
+                    best_segment = other_seg
+                    best_source = other_name
+
+        merged_segments.append(best_segment)
+        segment_sources[i] = best_source
+
+    merged_md = '\n\n'.join(seg.content for seg in merged_segments)
+
+    return MergeResult(
+        markdown=merged_md,
+        sources=[r['tool'] for r in successful],
+        segment_sources=segment_sources
+    )
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description="Merge markdown outputs from multiple conversion tools"
+    )
+    parser.add_argument(
+        "inputs",
+        nargs="*",
+        type=Path,
+        help="Input markdown files to merge"
+    )
+    parser.add_argument(
+        "-o", "--output",
+        type=Path,
+        help="Output merged markdown file"
+    )
+    parser.add_argument(
+        "--from-json",
+        type=Path,
+        help="Merge from JSON results file (from convert.py)"
+    )
+    parser.add_argument(
+        "--verbose",
+        action="store_true",
+        help="Show segment source attribution"
+    )
+
+    args = parser.parse_args()
+
+    if args.from_json:
+        result = merge_from_json(args.from_json)
+    elif args.inputs:
+        # Validate inputs
+        for f in args.inputs:
+            if not f.exists():
+                print(f"Error: File not found: {f}", file=sys.stderr)
+                sys.exit(1)
+        result = merge_markdown_files(args.inputs)
+    else:
+        parser.error("Either input files or --from-json is required")
+
+    if not result.markdown:
+        print("Error: No content to merge", file=sys.stderr)
+        sys.exit(1)
+
+    # Output
+    if args.output:
+        args.output.parent.mkdir(parents=True, exist_ok=True)
+        args.output.write_text(result.markdown)
+        print(f"Merged output: {args.output}")
+        print(f"Sources: {', '.join(result.sources)}")
+    else:
+        print(result.markdown)
+
+    if args.verbose and result.segment_sources:
+        print("\n--- Segment Attribution ---", file=sys.stderr)
+        for idx, source in result.segment_sources.items():
+            print(f"  Segment {idx}: {source}", file=sys.stderr)
+
+
+if __name__ == "__main__":
+    main()
--- a/doc-to-markdown/scripts/validate_output.py
+++ b/doc-to-markdown/scripts/validate_output.py
@@ -0,0 +1,466 @@
+#!/usr/bin/env python3
+"""
+Quality validator for document-to-markdown conversion.
+
+Compare original document with converted markdown to assess conversion quality.
+Generates HTML quality report with detailed metrics.
+
+Usage:
+    uv run --with pymupdf scripts/validate_output.py document.pdf output.md
+    uv run --with pymupdf scripts/validate_output.py document.pdf output.md --report report.html
+"""
+
+import argparse
+import html
+import re
+import subprocess
+import sys
+from dataclasses import dataclass, field
+from pathlib import Path
+from typing import Optional
+
+
+@dataclass
+class ValidationMetrics:
+    """Quality metrics for conversion validation."""
+    # Text metrics
+    source_char_count: int = 0
+    output_char_count: int = 0
+    text_retention: float = 0.0
+
+    # Table metrics
+    source_table_count: int = 0
+    output_table_count: int = 0
+    table_retention: float = 0.0
+
+    # Image metrics
+    source_image_count: int = 0
+    output_image_count: int = 0
+    image_retention: float = 0.0
+
+    # Structure metrics
+    heading_count: int = 0
+    list_count: int = 0
+    code_block_count: int = 0
+
+    # Quality scores
+    overall_score: float = 0.0
+    status: str = "unknown"  # pass, warn, fail
+
+    # Details
+    warnings: list[str] = field(default_factory=list)
+    errors: list[str] = field(default_factory=list)
+
+
+def extract_text_from_pdf(pdf_path: Path) -> tuple[str, int, int]:
+    """Extract text, table count, and image count from PDF."""
+    try:
+        import fitz  # PyMuPDF
+
+        doc = fitz.open(str(pdf_path))
+        text_parts = []
+        table_count = 0
+        image_count = 0
+
+        for page in doc:
+            text_parts.append(page.get_text())
+            # Count images
+            image_count += len(page.get_images())
+            # Estimate tables (look for grid-like structures)
+            # This is approximate - tables are hard to detect in PDFs
+            page_text = page.get_text()
+            if re.search(r'(\t.*){2,}', page_text) or '│' in page_text:
+                table_count += 1
+
+        doc.close()
+        return '\n'.join(text_parts), table_count, image_count
+
+    except ImportError:
+        # Fallback to pdftotext if available
+        try:
+            result = subprocess.run(
+                ['pdftotext', '-layout', str(pdf_path), '-'],
+                capture_output=True,
+                text=True,
+                timeout=60
+            )
+            return result.stdout, 0, 0  # Can't count tables/images
+        except Exception:
+            return "", 0, 0
+
+
+def extract_text_from_docx(docx_path: Path) -> tuple[str, int, int]:
+    """Extract text, table count, and image count from DOCX."""
+    try:
+        import zipfile
+        from xml.etree import ElementTree as ET
+
+        with zipfile.ZipFile(docx_path, 'r') as z:
+            # Extract main document text
+            if 'word/document.xml' not in z.namelist():
+                return "", 0, 0
+
+            with z.open('word/document.xml') as f:
+                tree = ET.parse(f)
+                root = tree.getroot()
+
+                # Extract text
+                ns = {'w': 'http://schemas.openxmlformats.org/wordprocessingml/2006/main'}
+                text_parts = []
+                for t in root.iter('{http://schemas.openxmlformats.org/wordprocessingml/2006/main}t'):
+                    if t.text:
+                        text_parts.append(t.text)
+
+                # Count tables
+                tables = root.findall('.//w:tbl', ns)
+                table_count = len(tables)
+
+            # Count images
+            image_count = sum(1 for name in z.namelist()
+                            if name.startswith('word/media/'))
+
+            return ' '.join(text_parts), table_count, image_count
+
+    except Exception as e:
+        return "", 0, 0
+
+
+def analyze_markdown(md_path: Path) -> dict:
+    """Analyze markdown file structure and content."""
+    content = md_path.read_text()
+
+    # Count tables (markdown tables with |)
+    table_lines = [l for l in content.split('\n')
+                   if re.match(r'^\s*\|.*\|', l)]
+    # Group consecutive table lines
+    table_count = 0
+    in_table = False
+    for line in content.split('\n'):
+        if re.match(r'^\s*\|.*\|', line):
+            if not in_table:
+                table_count += 1
+                in_table = True
+        else:
+            in_table = False
+
+    # Count images
+    images = re.findall(r'!\[.*?\]\(.*?\)', content)
+
+    # Count headings
+    headings = re.findall(r'^#{1,6}\s+.+$', content, re.MULTILINE)
+
+    # Count lists
+    list_items = re.findall(r'^[\s]*[-*+]\s+', content, re.MULTILINE)
+    list_items += re.findall(r'^[\s]*\d+\.\s+', content, re.MULTILINE)
+
+    # Count code blocks
+    code_blocks = re.findall(r'```', content)
+
+    # Clean text for comparison
+    clean_text = re.sub(r'```.*?```', '', content, flags=re.DOTALL)
+    clean_text = re.sub(r'!\[.*?\]\(.*?\)', '', clean_text)
+    clean_text = re.sub(r'\[.*?\]\(.*?\)', '', clean_text)
+    clean_text = re.sub(r'[#*_`|>-]', '', clean_text)
+    clean_text = re.sub(r'\s+', ' ', clean_text).strip()
+
+    return {
+        'char_count': len(clean_text),
+        'table_count': table_count,
+        'image_count': len(images),
+        'heading_count': len(headings),
+        'list_count': len(list_items),
+        'code_block_count': len(code_blocks) // 2,
+        'raw_content': content,
+        'clean_text': clean_text
+    }
+
+
+def validate_conversion(
+    source_path: Path,
+    output_path: Path
+) -> ValidationMetrics:
+    """Validate conversion quality by comparing source and output."""
+    metrics = ValidationMetrics()
+
+    # Analyze output markdown
+    md_analysis = analyze_markdown(output_path)
+    metrics.output_char_count = md_analysis['char_count']
+    metrics.output_table_count = md_analysis['table_count']
+    metrics.output_image_count = md_analysis['image_count']
+    metrics.heading_count = md_analysis['heading_count']
+    metrics.list_count = md_analysis['list_count']
+    metrics.code_block_count = md_analysis['code_block_count']
+
+    # Extract source content based on file type
+    ext = source_path.suffix.lower()
+    if ext == '.pdf':
+        source_text, source_tables, source_images = extract_text_from_pdf(source_path)
+    elif ext in ['.docx', '.doc']:
+        source_text, source_tables, source_images = extract_text_from_docx(source_path)
+    else:
+        # For other formats, estimate from file size
+        source_text = ""
+        source_tables = 0
+        source_images = 0
+        metrics.warnings.append(f"Cannot analyze source format: {ext}")
+
+    metrics.source_char_count = len(source_text.replace(' ', '').replace('\n', ''))
+    metrics.source_table_count = source_tables
+    metrics.source_image_count = source_images
+
+    # Calculate retention rates
+    if metrics.source_char_count > 0:
+        # Use ratio of actual/expected, capped at 1.0
+        metrics.text_retention = min(
+            metrics.output_char_count / metrics.source_char_count,
+            1.0
+        )
+    else:
+        metrics.text_retention = 1.0 if metrics.output_char_count > 0 else 0.0
+
+    if metrics.source_table_count > 0:
+        metrics.table_retention = min(
+            metrics.output_table_count / metrics.source_table_count,
+            1.0
+        )
+    else:
+        metrics.table_retention = 1.0  # No tables expected
+
+    if metrics.source_image_count > 0:
+        metrics.image_retention = min(
+            metrics.output_image_count / metrics.source_image_count,
+            1.0
+        )
+    else:
+        metrics.image_retention = 1.0  # No images expected
+
+    # Determine status based on thresholds
+    if metrics.text_retention < 0.85:
+        metrics.errors.append(f"Low text retention: {metrics.text_retention:.1%}")
+    elif metrics.text_retention < 0.95:
+        metrics.warnings.append(f"Text retention below optimal: {metrics.text_retention:.1%}")
+
+    if metrics.source_table_count > 0 and metrics.table_retention < 0.9:
+        metrics.errors.append(f"Tables missing: {metrics.table_retention:.1%} retained")
+    elif metrics.source_table_count > 0 and metrics.table_retention < 1.0:
+        metrics.warnings.append(f"Some tables may be incomplete: {metrics.table_retention:.1%}")
+
+    if metrics.source_image_count > 0 and metrics.image_retention < 0.8:
+        metrics.errors.append(f"Images missing: {metrics.image_retention:.1%} retained")
+    elif metrics.source_image_count > 0 and metrics.image_retention < 1.0:
+        metrics.warnings.append(f"Some images missing: {metrics.image_retention:.1%}")
+
+    # Calculate overall score (0-100)
+    metrics.overall_score = (
+        metrics.text_retention * 50 +
+        metrics.table_retention * 25 +
+        metrics.image_retention * 25
+    ) * 100
+
+    # Determine status
+    if metrics.errors:
+        metrics.status = "fail"
+    elif metrics.warnings:
+        metrics.status = "warn"
+    else:
+        metrics.status = "pass"
+
+    return metrics
+
+
+def generate_html_report(
+    metrics: ValidationMetrics,
+    source_path: Path,
+    output_path: Path
+) -> str:
+    """Generate HTML quality report."""
+    status_colors = {
+        "pass": "#28a745",
+        "warn": "#ffc107",
+        "fail": "#dc3545"
+    }
+    status_color = status_colors.get(metrics.status, "#6c757d")
+
+    def metric_bar(value: float, thresholds: tuple) -> str:
+        """Generate colored progress bar."""
+        pct = int(value * 100)
+        if value >= thresholds[0]:
+            color = "#28a745"  # green
+        elif value >= thresholds[1]:
+            color = "#ffc107"  # yellow
+        else:
+            color = "#dc3545"  # red
+        return f'''
+        <div style="background: #e9ecef; border-radius: 4px; overflow: hidden; height: 20px;">
+            <div style="background: {color}; width: {pct}%; height: 100%; transition: width 0.3s;"></div>
+        </div>
+        <span style="font-size: 14px; color: #666;">{pct}%</span>
+        '''
+
+    report = f'''<!DOCTYPE html>
+<html>
+<head>
+    <meta charset="UTF-8">
+    <title>Conversion Quality Report</title>
+    <style>
+        body {{ font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; margin: 40px; background: #f5f5f5; }}
+        .container {{ max-width: 800px; margin: 0 auto; background: white; padding: 30px; border-radius: 8px; box-shadow: 0 2px 4px rgba(0,0,0,0.1); }}
+        h1 {{ color: #333; border-bottom: 2px solid #eee; padding-bottom: 15px; }}
+        .status {{ display: inline-block; padding: 8px 16px; border-radius: 4px; color: white; font-weight: bold; }}
+        .metric {{ margin: 20px 0; padding: 15px; background: #f8f9fa; border-radius: 4px; }}
+        .metric-label {{ font-weight: bold; color: #333; margin-bottom: 8px; }}
+        .metric-value {{ font-size: 24px; color: #333; }}
+        .issues {{ margin-top: 20px; }}
+        .error {{ background: #f8d7da; color: #721c24; padding: 10px; margin: 5px 0; border-radius: 4px; }}
+        .warning {{ background: #fff3cd; color: #856404; padding: 10px; margin: 5px 0; border-radius: 4px; }}
+        table {{ width: 100%; border-collapse: collapse; margin: 15px 0; }}
+        th, td {{ padding: 10px; text-align: left; border-bottom: 1px solid #eee; }}
+        th {{ background: #f8f9fa; }}
+        .score {{ font-size: 48px; font-weight: bold; color: {status_color}; }}
+    </style>
+</head>
+<body>
+    <div class="container">
+        <h1>📊 Conversion Quality Report</h1>
+
+        <div style="text-align: center; margin: 30px 0;">
+            <div class="score">{metrics.overall_score:.0f}</div>
+            <div style="color: #666;">Overall Score</div>
+            <div class="status" style="background: {status_color}; margin-top: 10px;">
+                {metrics.status.upper()}
+            </div>
+        </div>
+
+        <h2>📄 File Information</h2>
+        <table>
+            <tr><th>Source</th><td>{html.escape(str(source_path))}</td></tr>
+            <tr><th>Output</th><td>{html.escape(str(output_path))}</td></tr>
+        </table>
+
+        <h2>📏 Retention Metrics</h2>
+
+        <div class="metric">
+            <div class="metric-label">Text Retention (target: >95%)</div>
+            {metric_bar(metrics.text_retention, (0.95, 0.85))}
+            <div style="font-size: 12px; color: #666; margin-top: 5px;">
+                Source: ~{metrics.source_char_count:,} chars | Output: {metrics.output_char_count:,} chars
+            </div>
+        </div>
+
+        <div class="metric">
+            <div class="metric-label">Table Retention (target: 100%)</div>
+            {metric_bar(metrics.table_retention, (1.0, 0.9))}
+            <div style="font-size: 12px; color: #666; margin-top: 5px;">
+                Source: {metrics.source_table_count} tables | Output: {metrics.output_table_count} tables
+            </div>
+        </div>
+
+        <div class="metric">
+            <div class="metric-label">Image Retention (target: 100%)</div>
+            {metric_bar(metrics.image_retention, (1.0, 0.8))}
+            <div style="font-size: 12px; color: #666; margin-top: 5px;">
+                Source: {metrics.source_image_count} images | Output: {metrics.output_image_count} images
+            </div>
+        </div>
+
+        <h2>📊 Structure Analysis</h2>
+        <table>
+            <tr><th>Headings</th><td>{metrics.heading_count}</td></tr>
+            <tr><th>List Items</th><td>{metrics.list_count}</td></tr>
+            <tr><th>Code Blocks</th><td>{metrics.code_block_count}</td></tr>
+        </table>
+
+        {'<h2>⚠️ Issues</h2><div class="issues">' + ''.join(f'<div class="error">❌ {html.escape(e)}</div>' for e in metrics.errors) + ''.join(f'<div class="warning">⚠️ {html.escape(w)}</div>' for w in metrics.warnings) + '</div>' if metrics.errors or metrics.warnings else ''}
+
+        <div style="margin-top: 30px; padding-top: 20px; border-top: 1px solid #eee; color: #666; font-size: 12px;">
+            Generated by markdown-tools validate_output.py
+        </div>
+    </div>
+</body>
+</html>
+'''
+    return report
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description="Validate document-to-markdown conversion quality"
+    )
+    parser.add_argument(
+        "source",
+        type=Path,
+        help="Original document (PDF, DOCX, etc.)"
+    )
+    parser.add_argument(
+        "output",
+        type=Path,
+        help="Converted markdown file"
+    )
+    parser.add_argument(
+        "--report",
+        type=Path,
+        help="Generate HTML report at this path"
+    )
+    parser.add_argument(
+        "--json",
+        action="store_true",
+        help="Output metrics as JSON"
+    )
+
+    args = parser.parse_args()
+
+    # Validate inputs
+    if not args.source.exists():
+        print(f"Error: Source file not found: {args.source}", file=sys.stderr)
+        sys.exit(1)
+    if not args.output.exists():
+        print(f"Error: Output file not found: {args.output}", file=sys.stderr)
+        sys.exit(1)
+
+    # Run validation
+    metrics = validate_conversion(args.source, args.output)
+
+    # Output results
+    if args.json:
+        import json
+        print(json.dumps({
+            'text_retention': metrics.text_retention,
+            'table_retention': metrics.table_retention,
+            'image_retention': metrics.image_retention,
+            'overall_score': metrics.overall_score,
+            'status': metrics.status,
+            'warnings': metrics.warnings,
+            'errors': metrics.errors
+        }, indent=2))
+    else:
+        # Console output
+        status_emoji = {"pass": "✅", "warn": "⚠️", "fail": "❌"}.get(metrics.status, "❓")
+        print(f"\n{status_emoji} Conversion Quality: {metrics.status.upper()}")
+        print(f"   Overall Score: {metrics.overall_score:.0f}/100")
+        print(f"\n   Text Retention:  {metrics.text_retention:.1%}")
+        print(f"   Table Retention: {metrics.table_retention:.1%}")
+        print(f"   Image Retention: {metrics.image_retention:.1%}")
+
+        if metrics.errors:
+            print("\n   Errors:")
+            for e in metrics.errors:
+                print(f"     ❌ {e}")
+
+        if metrics.warnings:
+            print("\n   Warnings:")
+            for w in metrics.warnings:
+                print(f"     ⚠️ {w}")
+
+    # Generate HTML report
+    if args.report:
+        report_html = generate_html_report(metrics, args.source, args.output)
+        args.report.parent.mkdir(parents=True, exist_ok=True)
+        args.report.write_text(report_html)
+        print(f"\n📊 HTML report: {args.report}")
+
+    # Exit with appropriate code
+    sys.exit(0 if metrics.status != "fail" else 1)
+
+
+if __name__ == "__main__":
+    main()