refactor: rename markdown-tools → doc-to-markdown (v2.0.0)

- Rename skill to better reflect its purpose (document-to-markdown conversion) - Update SKILL.md name, description, and trigger keywords - Add benchmark reference (2026-03-22) - Update marketplace.json entry (name, skills path, version 2.0.0) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-23 00:06:30 +08:00
parent ee38ae41b8
commit 143995b213
12 changed files with 1346 additions and 451 deletions
--- a/.claude-plugin/marketplace.json
+++ b/.claude-plugin/marketplace.json
@@ -50,25 +50,23 @@
      ]
    },
    {
-      "name": "markdown-tools",
-      "description": "Convert documents (PDFs, Word, PowerPoint) to high-quality markdown with multi-tool orchestration. Supports Quick Mode (fast, single tool) and Heavy Mode (best quality, multi-tool merge with segment-level selection). Features PyMuPDF4LLM for LLM-optimized PDF conversion, pandoc for DOCX/PPTX structure preservation, quality validation with HTML reports, and image extraction with metadata",
+      "name": "doc-to-markdown",
+      "description": "Converts DOCX/PDF/PPTX to high-quality Markdown with automatic post-processing. Fixes pandoc grid tables, image paths, attribute noise, and code blocks. Supports Quick Mode (fast, single tool) and Heavy Mode (best quality, multi-tool merge). Trigger on \"convert document\", \"docx to markdown\", \"parse word\", \"doc to markdown\", \"extract images from document\".",
      "source": "./",
      "strict": false,
-      "version": "1.2.0",
+      "version": "2.0.0",
      "category": "document-conversion",
      "keywords": [
        "markdown",
-        "pdf",
        "docx",
+        "pdf",
        "pptx",
-        "pymupdf4llm",
+        "converter",
        "pandoc",
-        "markitdown",
-        "heavy-mode",
-        "quality-validation"
+        "document"
      ],
      "skills": [
-        "./markdown-tools"
+        "./doc-to-markdown"
      ]
    },
    {
--- a/doc-to-markdown/SKILL.md
+++ b/doc-to-markdown/SKILL.md
@@ -1,11 +1,11 @@
 ---
-name: markdown-tools
-description: Converts documents to markdown with multi-tool orchestration for best quality. Supports Quick Mode (fast, single tool) and Heavy Mode (best quality, multi-tool merge). Use when converting PDF/DOCX/PPTX files to markdown, extracting images from documents, validating conversion quality, or needing LLM-optimized document output.
+name: doc-to-markdown
+description: Converts DOCX/PDF/PPTX to high-quality Markdown with automatic post-processing. Fixes pandoc grid tables, image paths, attribute noise, and code blocks. Supports Quick Mode (fast, single tool) and Heavy Mode (best quality, multi-tool merge). Trigger on "convert document", "docx to markdown", "parse word", "doc to markdown", "extract images from document".
 ---

-# Markdown Tools
+# Doc to Markdown

-Convert documents to high-quality markdown with intelligent multi-tool orchestration.
+Convert documents to high-quality markdown with intelligent multi-tool orchestration and automatic DOCX post-processing.

 ## Dual Mode Architecture

@@ -34,6 +34,9 @@ uv run --with pymupdf4llm --with markitdown scripts/convert.py document.pdf -o o
 # Heavy Mode - multi-tool parallel execution with merge
 uv run --with pymupdf4llm --with markitdown scripts/convert.py document.pdf -o output.md --heavy

+# DOCX with deep python-docx parsing (experimental)
+uv run --with pymupdf4llm --with markitdown --with python-docx scripts/convert.py document.docx -o output.md --docx-deep
+
 # Check available tools
 uv run scripts/convert.py --list-tools
 ```
@@ -43,7 +46,7 @@ uv run scripts/convert.py --list-tools
 | Format | Quick Mode Tool | Heavy Mode Tools |
 |--------|----------------|------------------|
 | PDF | pymupdf4llm | pymupdf4llm + markitdown |
-| DOCX | pandoc | pandoc + markitdown |
+| DOCX | pandoc + post-processing | pandoc + markitdown |
 | PPTX | markitdown | markitdown + pandoc |
 | XLSX | markitdown | markitdown |

@@ -53,6 +56,21 @@ uv run scripts/convert.py --list-tools
 - **markitdown**: Microsoft's universal converter, good for Office formats
 - **pandoc**: Excellent structure preservation for DOCX/PPTX

+## DOCX Post-Processing (automatic)
+
+When converting DOCX files via pandoc, the following cleanups are applied automatically:
+
+| Problem | Fix |
+|---------|-----|
+| Grid tables (`+:---+` syntax) | Single-column -> blockquote, multi-column -> split images |
+| Image double path (`media/media/`) | Flatten to `media/` |
+| Pandoc attributes (`{width="..." height="..."}`) | Removed |
+| Inline classes (`{.underline}`, `{.mark}`) | Removed |
+| Indented dashed code blocks | Converted to fenced code blocks (```) |
+| Escaped brackets (`\[...\]`) | Unescaped to `[...]` |
+| Double-bracket links (`[[text]{...}](url)`) | Simplified to `[text](url)` |
+| Escaped quotes in code (`\"`) | Fixed to `"` |
+
 ## Heavy Mode Workflow

 Heavy Mode runs multiple tools in parallel and selects the best segments:
@@ -117,7 +135,7 @@ python scripts/merge_outputs.py output1.md output2.md -o merged.md --verbose
 ## Path Conversion (Windows/WSL)

 ```bash
-# Windows → WSL conversion
+# Windows to WSL conversion
 python scripts/convert_path.py "C:\Users\name\Documents\file.pdf"
 # Output: /mnt/c/Users/name/Documents/file.pdf
 ```
@@ -147,7 +165,7 @@ brew install pandoc

 | Script | Purpose |
 |--------|---------|
-| `convert.py` | Main orchestrator with Quick/Heavy mode |
+| `convert.py` | Main orchestrator with Quick/Heavy mode + DOCX post-processing |
 | `merge_outputs.py` | Merge multiple markdown outputs |
 | `validate_output.py` | Quality validation with HTML report |
 | `extract_pdf_images.py` | PDF image extraction with metadata |
--- a/doc-to-markdown/references/benchmark-2026-03-22.md
+++ b/doc-to-markdown/references/benchmark-2026-03-22.md
@@ -0,0 +1,163 @@
+# DOCX→Markdown 转换方案基准测试
+
+> **测试日期**：2026-03-22
+>
+> **测试文件**：`助教-【腾讯云🦞】小白实践 OpenClaw 保姆级教程.docx`（19MB，77 张图片，含 grid table 布局、JSON 代码块、多列图片并排、信息框）
+>
+> **测试方法**：5 个方案对同一文件转换，按 5 个维度各 10 分制打分
+
+---
+
+## 综合评分
+
+| 维度 | Docling (IBM) | MarkItDown (MS) | Pandoc | Mammoth | **doc-to-markdown（我们）** |
+|------|:---:|:---:|:---:|:---:|:---:|
+| 表格质量 | 5 | 3 | 5 | 1~3 | **6** |
+| 图片提取 | 4 | 2 | **10** | 5 | 7 |
+| 文本完整性 | 8 | 7 | **9** | 7 | **9** |
+| 格式清洁度 | 5 | 5 | 5 | 3 | **7** |
+| 代码块 | 2 | 1 | N/A | 1 | **9** |
+| **综合** | **4.8** | **3.6** | **7.3** | **3.4~3.6** | **7.6** |
+
+---
+
+## 各方案详细分析
+
+### 1. IBM Docling（综合 4.8）
+
+- **版本**：docling 2.x + Granite-Docling-258M
+- **架构**：AI 驱动（VLM 视觉语言模型），DocTags 中间格式 → Markdown
+
+**致命问题**：
+- 图片引用全部是 `<!-- image -->` 占位符（77 张图 0 张可显示），`ImageRefMode` API 对 DOCX 不可用
+- 标题层级全部丢失（0 个 `#`），所有标题退化为粗体文本
+- 代码块为零，JSON 和命令全部输出为普通段落
+- `api_key` 被错误转义为 `api\_key`
+
+**优点**：
+- 文本内容完整，中文/emoji/链接保留良好
+- 无 grid table 或 HTML 残留
+- 表格语法正确（pipe table），但内容是占位符
+
+**结论**：Docling 的优势在 PDF（AAAI 2025 论文场景），DOCX 支持远未达到生产级别。
+
+### 2. Microsoft MarkItDown（综合 3.6）
+
+- **版本**：markitdown 0.1.5
+- **架构**：底层调用 mammoth → HTML → markdownify → Markdown
+
+**致命问题**：
+- 77 张图片全部是截断的 base64 占位符（`data:image/png;base64,...`），默认 `keep_data_uris=False` 主动丢弃图片数据
+- 标题全部变为粗体文本（mammoth 无法识别 WPS 自定义样式）
+- 代码块为零，JSON 被塞入表格单元格
+- 有序列表编号全部错误（输出为 `1. 1. 1.`）
+
+**优点**：
+- 无 HTML 标签残留
+- 文本内容基本完整
+
+**结论**：MarkItDown 的 markdownify 后处理反而引入破坏性截断。轻量场景可用，复杂 DOCX 不可靠。
+
+### 3. Pandoc（综合 7.3）
+
+- **版本**：pandoc 3.9
+- **架构**：Haskell 原生 AST 解析，支持 60+ 格式
+
+**测试了 3 种参数**：
+
+| 参数 | 结果 |
+|------|------|
+| `-t gfm` | 最差：24 个 HTML `<table>` 嵌套，74 个 HTML `<img>` |
+| `-t markdown` | 最佳：grid table（可后处理），无 HTML |
+| `-t markdown-raw_html-...` | 与 markdown 完全相同，参数无效果 |
+
+**问题**：
+- Grid table 不可避免（原 docx 有多行单元格和嵌套表格，pipe table 无法表达）
+- `{width="..." height="..."}` 68 处
+- `{.underline}` 6 处
+- 反斜杠过度转义 37 处
+
+**优点**：
+- 图片提取 10/10（77 张全部正确，路径结构一致）
+- 文本完整性 9/10（内容、链接、emoji 全部保留）
+- 最成熟稳定的底层引擎
+
+**结论**：Pandoc 是最可靠的底层引擎，输出质量最高但需要后处理清洗 pandoc 私有语法。
+
+### 4. Mammoth（综合 3.4~3.6）
+
+- **版本**：mammoth 1.11.0
+- **架构**：python-docx 解析 → HTML/Markdown（Markdown 支持已废弃）
+
+**测试了 2 种方式**：
+
+| 方式 | 综合 |
+|------|------|
+| 方式A：直接转 Markdown | 3.4（表格完全丢失） |
+| 方式B：转 HTML → markdownify | 3.6（有表格但嵌套被压扁） |
+
+**致命问题**：
+- 标题全部丢失（WPS `styles.xml` 中样式定义为空，mammoth 无法映射 Heading）
+- 代码块为零
+- 图片全部 base64 内嵌，单文件 28MB
+- 方式B 中 markdownify 丢失 14 张图片（63/77）
+
+**结论**：Mammoth 的 Markdown 支持已废弃，对 WPS 导出的 docx 兼容性差。不推荐。
+
+### 5. doc-to-markdown / 我们的方案（综合 7.6）
+
+- **版本**：doc-to-markdown 1.0（基于 pandoc + 6 个后处理函数）
+- **架构**：Pandoc 转换 → 自动后处理（grid table 清理、图片路径修复、属性清理、代码块修复、转义修复）
+
+**后处理实际效果**：
+
+| 后处理函数 | 修复数量 |
+|-----------|---------|
+| `_convert_grid_tables` | 11 处 grid table → pipe table / blockquote |
+| `_clean_pandoc_attributes` | 3437 字符属性清理 |
+| `_fix_code_blocks` | 22 处缩进虚线 → ``` 代码块 |
+| `_fix_escaped_brackets` | 10 处 |
+| `_fix_double_bracket_links` | 1 处 |
+| `_fix_image_paths` | 77 张图片路径修复 |
+
+**已知问题（待修复）**：
+- 图片路径双层嵌套 bug：`--assets-dir` 指定目录内被 pandoc 再建一层 `media/`
+- 2 处 grid table 残留（文末并排图片组未完全转换）
+
+**优点**：
+- 代码块识别 9/10（JSON 带语言标签，命令行正确包裹）
+- 格式清洁度 7/10（attributes、转义、grid table 大部分清理干净）
+- 文本完整性 9/10（关键内容全部保留）
+
+**结论**：综合最优，核心价值在 pandoc 后处理层。剩余 2 个 bug 可修。
+
+---
+
+## 架构决策
+
+```
+最终方案：Pandoc（底层引擎）+ doc-to-markdown 后处理（增值层）
+
+理由：
+1. Pandoc 图片提取最可靠（10/10），文本最完整（9/10）
+2. Pandoc 的问题（grid table、属性、转义）全部可后处理解决
+3. Docling/MarkItDown/Mammoth 的致命问题（图片丢失、标题丢失）无法后处理修复
+4. 后处理层是我们的核心竞争力，成本低、可迭代
+```
+
+---
+
+## 测试文件特征
+
+本次测试文件的难点在于：
+
+| 特征 | 说明 | 影响 |
+|------|------|------|
+| WPS 导出 | 非标准 Word 样式（Style ID 2/3 而非 Heading 1/2） | mammoth/markitdown/docling 标题全丢 |
+| 多列图片布局 | 2x2、1x4 图片网格用表格排版 | pandoc 输出 grid table |
+| 信息框/提示框 | 单列表格包裹文字 | pandoc 输出 grid table |
+| 嵌套表格 | 表格内套表格 | pipe table 无法表达 |
+| JSON 代码块 | 非代码块样式，用文本框/缩进表示 | 多数工具无法识别为代码 |
+| 19MB 文件 | 77 张截图嵌入 | base64 方案导致 28MB 输出 |
+
+这些特征代表了真实世界中 WPS/飞书文档导出 docx 的典型困难，是有效的基准测试场景。
--- a/doc-to-markdown/references/conversion-examples.md
+++ b/doc-to-markdown/references/conversion-examples.md
--- a/doc-to-markdown/references/heavy-mode-guide.md
+++ b/doc-to-markdown/references/heavy-mode-guide.md
--- a/doc-to-markdown/references/tool-comparison.md
+++ b/doc-to-markdown/references/tool-comparison.md
--- a/doc-to-markdown/scripts/convert.py
+++ b/doc-to-markdown/scripts/convert.py
--- a/doc-to-markdown/scripts/convert_path.py
+++ b/doc-to-markdown/scripts/convert_path.py
--- a/doc-to-markdown/scripts/extract_pdf_images.py
+++ b/doc-to-markdown/scripts/extract_pdf_images.py
--- a/doc-to-markdown/scripts/merge_outputs.py
+++ b/doc-to-markdown/scripts/merge_outputs.py
--- a/doc-to-markdown/scripts/validate_output.py
+++ b/doc-to-markdown/scripts/validate_output.py
--- a/markdown-tools/scripts/convert.py
+++ b/markdown-tools/scripts/convert.py
@@ -1,434 +0,0 @@
-#!/usr/bin/env python3
-"""
-Multi-tool document to markdown converter with intelligent orchestration.
-
-Supports Quick Mode (fast, single tool) and Heavy Mode (best quality, multi-tool merge).
-
-Usage:
-    # Quick Mode (default) - fast, single best tool
-    uv run --with pymupdf4llm --with markitdown scripts/convert.py document.pdf -o output.md
-
-    # Heavy Mode - multi-tool parallel execution with merge
-    uv run --with pymupdf4llm --with markitdown scripts/convert.py document.pdf -o output.md --heavy
-
-    # With image extraction
-    uv run --with pymupdf4llm scripts/convert.py document.pdf -o output.md --assets-dir ./images
-
-Dependencies:
-    - pymupdf4llm: PDF conversion (LLM-optimized)
-    - markitdown: PDF/DOCX/PPTX conversion
-    - pandoc: DOCX/PPTX conversion (system install: brew install pandoc)
-"""
-
-import argparse
-import subprocess
-import sys
-import tempfile
-import shutil
-from dataclasses import dataclass, field
-from pathlib import Path
-from typing import Optional
-
-
-@dataclass
-class ConversionResult:
-    """Result from a single tool conversion."""
-    markdown: str
-    tool: str
-    images: list[str] = field(default_factory=list)
-    success: bool = True
-    error: str = ""
-
-
-def check_tool_available(tool: str) -> bool:
-    """Check if a conversion tool is available."""
-    if tool == "pymupdf4llm":
-        try:
-            import pymupdf4llm
-            return True
-        except ImportError:
-            return False
-    elif tool == "markitdown":
-        try:
-            import markitdown
-            return True
-        except ImportError:
-            return False
-    elif tool == "pandoc":
-        return shutil.which("pandoc") is not None
-    return False
-
-
-def select_tools(file_path: Path, mode: str) -> list[str]:
-    """Select conversion tools based on file type and mode."""
-    ext = file_path.suffix.lower()
-
-    # Tool preferences by format
-    tool_map = {
-        ".pdf": {
-            "quick": ["pymupdf4llm", "markitdown"],  # fallback order
-            "heavy": ["pymupdf4llm", "markitdown"],
-        },
-        ".docx": {
-            "quick": ["pandoc", "markitdown"],
-            "heavy": ["pandoc", "markitdown"],
-        },
-        ".doc": {
-            "quick": ["pandoc", "markitdown"],
-            "heavy": ["pandoc", "markitdown"],
-        },
-        ".pptx": {
-            "quick": ["markitdown", "pandoc"],
-            "heavy": ["markitdown", "pandoc"],
-        },
-        ".xlsx": {
-            "quick": ["markitdown"],
-            "heavy": ["markitdown"],
-        },
-    }
-
-    tools = tool_map.get(ext, {"quick": ["markitdown"], "heavy": ["markitdown"]})
-
-    if mode == "quick":
-        # Return first available tool
-        for tool in tools["quick"]:
-            if check_tool_available(tool):
-                return [tool]
-        return []
-    else:  # heavy
-        # Return all available tools
-        return [t for t in tools["heavy"] if check_tool_available(t)]
-
-
-def convert_with_pymupdf4llm(
-    file_path: Path, assets_dir: Optional[Path] = None
-) -> ConversionResult:
-    """Convert using PyMuPDF4LLM (best for PDFs)."""
-    try:
-        import pymupdf4llm
-
-        kwargs = {}
-        images = []
-
-        if assets_dir:
-            assets_dir.mkdir(parents=True, exist_ok=True)
-            kwargs["write_images"] = True
-            kwargs["image_path"] = str(assets_dir)
-            kwargs["dpi"] = 150
-
-        # Use best table detection strategy
-        kwargs["table_strategy"] = "lines_strict"
-
-        md_text = pymupdf4llm.to_markdown(str(file_path), **kwargs)
-
-        # Collect extracted images
-        if assets_dir and assets_dir.exists():
-            images = [str(p) for p in assets_dir.glob("*.png")]
-            images.extend([str(p) for p in assets_dir.glob("*.jpg")])
-
-        return ConversionResult(
-            markdown=md_text, tool="pymupdf4llm", images=images, success=True
-        )
-    except Exception as e:
-        return ConversionResult(
-            markdown="", tool="pymupdf4llm", success=False, error=str(e)
-        )
-
-
-def convert_with_markitdown(
-    file_path: Path, assets_dir: Optional[Path] = None
-) -> ConversionResult:
-    """Convert using markitdown."""
-    try:
-        # markitdown CLI approach
-        result = subprocess.run(
-            ["markitdown", str(file_path)],
-            capture_output=True,
-            text=True,
-            timeout=120,
-        )
-
-        if result.returncode != 0:
-            return ConversionResult(
-                markdown="",
-                tool="markitdown",
-                success=False,
-                error=result.stderr,
-            )
-
-        return ConversionResult(
-            markdown=result.stdout, tool="markitdown", success=True
-        )
-    except FileNotFoundError:
-        # Try Python API
-        try:
-            from markitdown import MarkItDown
-
-            md = MarkItDown()
-            result = md.convert(str(file_path))
-            return ConversionResult(
-                markdown=result.text_content, tool="markitdown", success=True
-            )
-        except Exception as e:
-            return ConversionResult(
-                markdown="", tool="markitdown", success=False, error=str(e)
-            )
-    except Exception as e:
-        return ConversionResult(
-            markdown="", tool="markitdown", success=False, error=str(e)
-        )
-
-
-def convert_with_pandoc(
-    file_path: Path, assets_dir: Optional[Path] = None
-) -> ConversionResult:
-    """Convert using pandoc."""
-    try:
-        cmd = ["pandoc", str(file_path), "-t", "markdown", "--wrap=none"]
-
-        if assets_dir:
-            assets_dir.mkdir(parents=True, exist_ok=True)
-            cmd.extend(["--extract-media", str(assets_dir)])
-
-        result = subprocess.run(
-            cmd, capture_output=True, text=True, timeout=120
-        )
-
-        if result.returncode != 0:
-            return ConversionResult(
-                markdown="", tool="pandoc", success=False, error=result.stderr
-            )
-
-        images = []
-        if assets_dir and assets_dir.exists():
-            images = [str(p) for p in assets_dir.rglob("*.png")]
-            images.extend([str(p) for p in assets_dir.rglob("*.jpg")])
-
-        return ConversionResult(
-            markdown=result.stdout, tool="pandoc", images=images, success=True
-        )
-    except Exception as e:
-        return ConversionResult(
-            markdown="", tool="pandoc", success=False, error=str(e)
-        )
-
-
-def convert_single(
-    file_path: Path, tool: str, assets_dir: Optional[Path] = None
-) -> ConversionResult:
-    """Run a single conversion tool."""
-    converters = {
-        "pymupdf4llm": convert_with_pymupdf4llm,
-        "markitdown": convert_with_markitdown,
-        "pandoc": convert_with_pandoc,
-    }
-
-    converter = converters.get(tool)
-    if not converter:
-        return ConversionResult(
-            markdown="", tool=tool, success=False, error=f"Unknown tool: {tool}"
-        )
-
-    return converter(file_path, assets_dir)
-
-
-def merge_results(results: list[ConversionResult]) -> ConversionResult:
-    """Merge results from multiple tools, selecting best segments."""
-    if not results:
-        return ConversionResult(markdown="", tool="none", success=False)
-
-    # Filter successful results
-    successful = [r for r in results if r.success and r.markdown.strip()]
-    if not successful:
-        # Return first error
-        return results[0] if results else ConversionResult(
-            markdown="", tool="none", success=False
-        )
-
-    if len(successful) == 1:
-        return successful[0]
-
-    # Multiple successful results - merge them
-    # Strategy: Compare key metrics and select best
-    best = successful[0]
-    best_score = score_markdown(best.markdown)
-
-    for result in successful[1:]:
-        score = score_markdown(result.markdown)
-        if score > best_score:
-            best = result
-            best_score = score
-
-    # Merge images from all results
-    all_images = []
-    seen = set()
-    for result in successful:
-        for img in result.images:
-            if img not in seen:
-                all_images.append(img)
-                seen.add(img)
-
-    best.images = all_images
-    best.tool = f"merged({','.join(r.tool for r in successful)})"
-
-    return best
-
-
-def score_markdown(md: str) -> float:
-    """Score markdown quality for comparison."""
-    score = 0.0
-
-    # Length (more content is generally better)
-    score += min(len(md) / 10000, 5.0)  # Cap at 5 points
-
-    # Tables (proper markdown tables)
-    table_count = md.count("|---|") + md.count("| ---")
-    score += min(table_count * 0.5, 3.0)
-
-    # Images (referenced images)
-    image_count = md.count("![")
-    score += min(image_count * 0.3, 2.0)
-
-    # Headings (proper hierarchy)
-    h1_count = md.count("\n# ")
-    h2_count = md.count("\n## ")
-    h3_count = md.count("\n### ")
-    if h1_count > 0 and h2_count >= h1_count:
-        score += 1.0  # Good hierarchy
-
-    # Lists (structured content)
-    list_count = md.count("\n- ") + md.count("\n* ") + md.count("\n1. ")
-    score += min(list_count * 0.1, 2.0)
-
-    return score
-
-
-def main():
-    parser = argparse.ArgumentParser(
-        description="Convert documents to markdown with multi-tool orchestration",
-        formatter_class=argparse.RawDescriptionHelpFormatter,
-        epilog="""
-Examples:
-    # Quick mode (default)
-    python convert.py document.pdf -o output.md
-
-    # Heavy mode (best quality)
-    python convert.py document.pdf -o output.md --heavy
-
-    # With custom assets directory
-    python convert.py document.pdf -o output.md --assets-dir ./images
-        """,
-    )
-    parser.add_argument("input", type=Path, help="Input document path")
-    parser.add_argument(
-        "-o", "--output", type=Path, help="Output markdown file"
-    )
-    parser.add_argument(
-        "--heavy",
-        action="store_true",
-        help="Enable Heavy Mode (multi-tool, best quality)",
-    )
-    parser.add_argument(
-        "--assets-dir",
-        type=Path,
-        default=None,
-        help="Directory for extracted images (default: <output>_assets/)",
-    )
-    parser.add_argument(
-        "--tool",
-        choices=["pymupdf4llm", "markitdown", "pandoc"],
-        help="Force specific tool (overrides auto-selection)",
-    )
-    parser.add_argument(
-        "--list-tools",
-        action="store_true",
-        help="List available tools and exit",
-    )
-
-    args = parser.parse_args()
-
-    # List tools mode
-    if args.list_tools:
-        tools = ["pymupdf4llm", "markitdown", "pandoc"]
-        print("Available conversion tools:")
-        for tool in tools:
-            status = "✓" if check_tool_available(tool) else "✗"
-            print(f"  {status} {tool}")
-        sys.exit(0)
-
-    # Validate input
-    if not args.input.exists():
-        print(f"Error: Input file not found: {args.input}", file=sys.stderr)
-        sys.exit(1)
-
-    # Determine output path
-    output_path = args.output or args.input.with_suffix(".md")
-
-    # Determine assets directory
-    assets_dir = args.assets_dir
-    if assets_dir is None and args.heavy:
-        assets_dir = output_path.parent / f"{output_path.stem}_assets"
-
-    # Select tools
-    mode = "heavy" if args.heavy else "quick"
-    if args.tool:
-        tools = [args.tool] if check_tool_available(args.tool) else []
-    else:
-        tools = select_tools(args.input, mode)
-
-    if not tools:
-        print("Error: No conversion tools available.", file=sys.stderr)
-        print("Install with:", file=sys.stderr)
-        print("  pip install pymupdf4llm", file=sys.stderr)
-        print("  uv tool install markitdown[pdf]", file=sys.stderr)
-        print("  brew install pandoc", file=sys.stderr)
-        sys.exit(1)
-
-    print(f"Converting: {args.input}")
-    print(f"Mode: {mode.upper()}")
-    print(f"Tools: {', '.join(tools)}")
-
-    # Run conversions
-    results = []
-    for tool in tools:
-        print(f"  Running {tool}...", end=" ", flush=True)
-
-        # Use separate assets dirs for each tool in heavy mode
-        tool_assets = None
-        if assets_dir and mode == "heavy" and len(tools) > 1:
-            tool_assets = assets_dir / tool
-        elif assets_dir:
-            tool_assets = assets_dir
-
-        result = convert_single(args.input, tool, tool_assets)
-        results.append(result)
-
-        if result.success:
-            print(f"✓ ({len(result.markdown):,} chars, {len(result.images)} images)")
-        else:
-            print(f"✗ ({result.error[:50]}...)")
-
-    # Merge results if heavy mode
-    if mode == "heavy" and len(results) > 1:
-        print("  Merging results...", end=" ", flush=True)
-        final = merge_results(results)
-        print(f"✓ (using {final.tool})")
-    else:
-        final = merge_results(results)
-
-    if not final.success:
-        print(f"Error: Conversion failed: {final.error}", file=sys.stderr)
-        sys.exit(1)
-
-    # Write output
-    output_path.parent.mkdir(parents=True, exist_ok=True)
-    output_path.write_text(final.markdown)
-
-    print(f"\nOutput: {output_path}")
-    print(f"  Size: {len(final.markdown):,} characters")
-    if final.images:
-        print(f"  Images: {len(final.images)} extracted")
-
-
-if __name__ == "__main__":
-    main()