refactor: rename markdown-tools → doc-to-markdown (v2.0.0)
- Rename skill to better reflect its purpose (document-to-markdown conversion) - Update SKILL.md name, description, and trigger keywords - Add benchmark reference (2026-03-22) - Update marketplace.json entry (name, skills path, version 2.0.0) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -50,25 +50,23 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "markdown-tools",
|
||||
"description": "Convert documents (PDFs, Word, PowerPoint) to high-quality markdown with multi-tool orchestration. Supports Quick Mode (fast, single tool) and Heavy Mode (best quality, multi-tool merge with segment-level selection). Features PyMuPDF4LLM for LLM-optimized PDF conversion, pandoc for DOCX/PPTX structure preservation, quality validation with HTML reports, and image extraction with metadata",
|
||||
"name": "doc-to-markdown",
|
||||
"description": "Converts DOCX/PDF/PPTX to high-quality Markdown with automatic post-processing. Fixes pandoc grid tables, image paths, attribute noise, and code blocks. Supports Quick Mode (fast, single tool) and Heavy Mode (best quality, multi-tool merge). Trigger on \"convert document\", \"docx to markdown\", \"parse word\", \"doc to markdown\", \"extract images from document\".",
|
||||
"source": "./",
|
||||
"strict": false,
|
||||
"version": "1.2.0",
|
||||
"version": "2.0.0",
|
||||
"category": "document-conversion",
|
||||
"keywords": [
|
||||
"markdown",
|
||||
"pdf",
|
||||
"docx",
|
||||
"pdf",
|
||||
"pptx",
|
||||
"pymupdf4llm",
|
||||
"converter",
|
||||
"pandoc",
|
||||
"markitdown",
|
||||
"heavy-mode",
|
||||
"quality-validation"
|
||||
"document"
|
||||
],
|
||||
"skills": [
|
||||
"./markdown-tools"
|
||||
"./doc-to-markdown"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -942,4 +940,4 @@
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
@@ -1,11 +1,11 @@
|
||||
---
|
||||
name: markdown-tools
|
||||
description: Converts documents to markdown with multi-tool orchestration for best quality. Supports Quick Mode (fast, single tool) and Heavy Mode (best quality, multi-tool merge). Use when converting PDF/DOCX/PPTX files to markdown, extracting images from documents, validating conversion quality, or needing LLM-optimized document output.
|
||||
name: doc-to-markdown
|
||||
description: Converts DOCX/PDF/PPTX to high-quality Markdown with automatic post-processing. Fixes pandoc grid tables, image paths, attribute noise, and code blocks. Supports Quick Mode (fast, single tool) and Heavy Mode (best quality, multi-tool merge). Trigger on "convert document", "docx to markdown", "parse word", "doc to markdown", "extract images from document".
|
||||
---
|
||||
|
||||
# Markdown Tools
|
||||
# Doc to Markdown
|
||||
|
||||
Convert documents to high-quality markdown with intelligent multi-tool orchestration.
|
||||
Convert documents to high-quality markdown with intelligent multi-tool orchestration and automatic DOCX post-processing.
|
||||
|
||||
## Dual Mode Architecture
|
||||
|
||||
@@ -34,6 +34,9 @@ uv run --with pymupdf4llm --with markitdown scripts/convert.py document.pdf -o o
|
||||
# Heavy Mode - multi-tool parallel execution with merge
|
||||
uv run --with pymupdf4llm --with markitdown scripts/convert.py document.pdf -o output.md --heavy
|
||||
|
||||
# DOCX with deep python-docx parsing (experimental)
|
||||
uv run --with pymupdf4llm --with markitdown --with python-docx scripts/convert.py document.docx -o output.md --docx-deep
|
||||
|
||||
# Check available tools
|
||||
uv run scripts/convert.py --list-tools
|
||||
```
|
||||
@@ -43,7 +46,7 @@ uv run scripts/convert.py --list-tools
|
||||
| Format | Quick Mode Tool | Heavy Mode Tools |
|
||||
|--------|----------------|------------------|
|
||||
| PDF | pymupdf4llm | pymupdf4llm + markitdown |
|
||||
| DOCX | pandoc | pandoc + markitdown |
|
||||
| DOCX | pandoc + post-processing | pandoc + markitdown |
|
||||
| PPTX | markitdown | markitdown + pandoc |
|
||||
| XLSX | markitdown | markitdown |
|
||||
|
||||
@@ -53,6 +56,21 @@ uv run scripts/convert.py --list-tools
|
||||
- **markitdown**: Microsoft's universal converter, good for Office formats
|
||||
- **pandoc**: Excellent structure preservation for DOCX/PPTX
|
||||
|
||||
## DOCX Post-Processing (automatic)
|
||||
|
||||
When converting DOCX files via pandoc, the following cleanups are applied automatically:
|
||||
|
||||
| Problem | Fix |
|
||||
|---------|-----|
|
||||
| Grid tables (`+:---+` syntax) | Single-column -> blockquote, multi-column -> split images |
|
||||
| Image double path (`media/media/`) | Flatten to `media/` |
|
||||
| Pandoc attributes (`{width="..." height="..."}`) | Removed |
|
||||
| Inline classes (`{.underline}`, `{.mark}`) | Removed |
|
||||
| Indented dashed code blocks | Converted to fenced code blocks (```) |
|
||||
| Escaped brackets (`\[...\]`) | Unescaped to `[...]` |
|
||||
| Double-bracket links (`[[text]{...}](url)`) | Simplified to `[text](url)` |
|
||||
| Escaped quotes in code (`\"`) | Fixed to `"` |
|
||||
|
||||
## Heavy Mode Workflow
|
||||
|
||||
Heavy Mode runs multiple tools in parallel and selects the best segments:
|
||||
@@ -117,7 +135,7 @@ python scripts/merge_outputs.py output1.md output2.md -o merged.md --verbose
|
||||
## Path Conversion (Windows/WSL)
|
||||
|
||||
```bash
|
||||
# Windows → WSL conversion
|
||||
# Windows to WSL conversion
|
||||
python scripts/convert_path.py "C:\Users\name\Documents\file.pdf"
|
||||
# Output: /mnt/c/Users/name/Documents/file.pdf
|
||||
```
|
||||
@@ -147,7 +165,7 @@ brew install pandoc
|
||||
|
||||
| Script | Purpose |
|
||||
|--------|---------|
|
||||
| `convert.py` | Main orchestrator with Quick/Heavy mode |
|
||||
| `convert.py` | Main orchestrator with Quick/Heavy mode + DOCX post-processing |
|
||||
| `merge_outputs.py` | Merge multiple markdown outputs |
|
||||
| `validate_output.py` | Quality validation with HTML report |
|
||||
| `extract_pdf_images.py` | PDF image extraction with metadata |
|
||||
163
doc-to-markdown/references/benchmark-2026-03-22.md
Normal file
163
doc-to-markdown/references/benchmark-2026-03-22.md
Normal file
@@ -0,0 +1,163 @@
|
||||
# DOCX→Markdown 转换方案基准测试
|
||||
|
||||
> **测试日期**:2026-03-22
|
||||
>
|
||||
> **测试文件**:`助教-【腾讯云🦞】小白实践 OpenClaw 保姆级教程.docx`(19MB,77 张图片,含 grid table 布局、JSON 代码块、多列图片并排、信息框)
|
||||
>
|
||||
> **测试方法**:5 个方案对同一文件转换,按 5 个维度各 10 分制打分
|
||||
|
||||
---
|
||||
|
||||
## 综合评分
|
||||
|
||||
| 维度 | Docling (IBM) | MarkItDown (MS) | Pandoc | Mammoth | **doc-to-markdown(我们)** |
|
||||
|------|:---:|:---:|:---:|:---:|:---:|
|
||||
| 表格质量 | 5 | 3 | 5 | 1~3 | **6** |
|
||||
| 图片提取 | 4 | 2 | **10** | 5 | 7 |
|
||||
| 文本完整性 | 8 | 7 | **9** | 7 | **9** |
|
||||
| 格式清洁度 | 5 | 5 | 5 | 3 | **7** |
|
||||
| 代码块 | 2 | 1 | N/A | 1 | **9** |
|
||||
| **综合** | **4.8** | **3.6** | **7.3** | **3.4~3.6** | **7.6** |
|
||||
|
||||
---
|
||||
|
||||
## 各方案详细分析
|
||||
|
||||
### 1. IBM Docling(综合 4.8)
|
||||
|
||||
- **版本**:docling 2.x + Granite-Docling-258M
|
||||
- **架构**:AI 驱动(VLM 视觉语言模型),DocTags 中间格式 → Markdown
|
||||
|
||||
**致命问题**:
|
||||
- 图片引用全部是 `<!-- image -->` 占位符(77 张图 0 张可显示),`ImageRefMode` API 对 DOCX 不可用
|
||||
- 标题层级全部丢失(0 个 `#`),所有标题退化为粗体文本
|
||||
- 代码块为零,JSON 和命令全部输出为普通段落
|
||||
- `api_key` 被错误转义为 `api\_key`
|
||||
|
||||
**优点**:
|
||||
- 文本内容完整,中文/emoji/链接保留良好
|
||||
- 无 grid table 或 HTML 残留
|
||||
- 表格语法正确(pipe table),但内容是占位符
|
||||
|
||||
**结论**:Docling 的优势在 PDF(AAAI 2025 论文场景),DOCX 支持远未达到生产级别。
|
||||
|
||||
### 2. Microsoft MarkItDown(综合 3.6)
|
||||
|
||||
- **版本**:markitdown 0.1.5
|
||||
- **架构**:底层调用 mammoth → HTML → markdownify → Markdown
|
||||
|
||||
**致命问题**:
|
||||
- 77 张图片全部是截断的 base64 占位符(`data:image/png;base64,...`),默认 `keep_data_uris=False` 主动丢弃图片数据
|
||||
- 标题全部变为粗体文本(mammoth 无法识别 WPS 自定义样式)
|
||||
- 代码块为零,JSON 被塞入表格单元格
|
||||
- 有序列表编号全部错误(输出为 `1. 1. 1.`)
|
||||
|
||||
**优点**:
|
||||
- 无 HTML 标签残留
|
||||
- 文本内容基本完整
|
||||
|
||||
**结论**:MarkItDown 的 markdownify 后处理反而引入破坏性截断。轻量场景可用,复杂 DOCX 不可靠。
|
||||
|
||||
### 3. Pandoc(综合 7.3)
|
||||
|
||||
- **版本**:pandoc 3.9
|
||||
- **架构**:Haskell 原生 AST 解析,支持 60+ 格式
|
||||
|
||||
**测试了 3 种参数**:
|
||||
|
||||
| 参数 | 结果 |
|
||||
|------|------|
|
||||
| `-t gfm` | 最差:24 个 HTML `<table>` 嵌套,74 个 HTML `<img>` |
|
||||
| `-t markdown` | 最佳:grid table(可后处理),无 HTML |
|
||||
| `-t markdown-raw_html-...` | 与 markdown 完全相同,参数无效果 |
|
||||
|
||||
**问题**:
|
||||
- Grid table 不可避免(原 docx 有多行单元格和嵌套表格,pipe table 无法表达)
|
||||
- `{width="..." height="..."}` 68 处
|
||||
- `{.underline}` 6 处
|
||||
- 反斜杠过度转义 37 处
|
||||
|
||||
**优点**:
|
||||
- 图片提取 10/10(77 张全部正确,路径结构一致)
|
||||
- 文本完整性 9/10(内容、链接、emoji 全部保留)
|
||||
- 最成熟稳定的底层引擎
|
||||
|
||||
**结论**:Pandoc 是最可靠的底层引擎,输出质量最高但需要后处理清洗 pandoc 私有语法。
|
||||
|
||||
### 4. Mammoth(综合 3.4~3.6)
|
||||
|
||||
- **版本**:mammoth 1.11.0
|
||||
- **架构**:python-docx 解析 → HTML/Markdown(Markdown 支持已废弃)
|
||||
|
||||
**测试了 2 种方式**:
|
||||
|
||||
| 方式 | 综合 |
|
||||
|------|------|
|
||||
| 方式A:直接转 Markdown | 3.4(表格完全丢失) |
|
||||
| 方式B:转 HTML → markdownify | 3.6(有表格但嵌套被压扁) |
|
||||
|
||||
**致命问题**:
|
||||
- 标题全部丢失(WPS `styles.xml` 中样式定义为空,mammoth 无法映射 Heading)
|
||||
- 代码块为零
|
||||
- 图片全部 base64 内嵌,单文件 28MB
|
||||
- 方式B 中 markdownify 丢失 14 张图片(63/77)
|
||||
|
||||
**结论**:Mammoth 的 Markdown 支持已废弃,对 WPS 导出的 docx 兼容性差。不推荐。
|
||||
|
||||
### 5. doc-to-markdown / 我们的方案(综合 7.6)
|
||||
|
||||
- **版本**:doc-to-markdown 1.0(基于 pandoc + 6 个后处理函数)
|
||||
- **架构**:Pandoc 转换 → 自动后处理(grid table 清理、图片路径修复、属性清理、代码块修复、转义修复)
|
||||
|
||||
**后处理实际效果**:
|
||||
|
||||
| 后处理函数 | 修复数量 |
|
||||
|-----------|---------|
|
||||
| `_convert_grid_tables` | 11 处 grid table → pipe table / blockquote |
|
||||
| `_clean_pandoc_attributes` | 3437 字符属性清理 |
|
||||
| `_fix_code_blocks` | 22 处缩进虚线 → ``` 代码块 |
|
||||
| `_fix_escaped_brackets` | 10 处 |
|
||||
| `_fix_double_bracket_links` | 1 处 |
|
||||
| `_fix_image_paths` | 77 张图片路径修复 |
|
||||
|
||||
**已知问题(待修复)**:
|
||||
- 图片路径双层嵌套 bug:`--assets-dir` 指定目录内被 pandoc 再建一层 `media/`
|
||||
- 2 处 grid table 残留(文末并排图片组未完全转换)
|
||||
|
||||
**优点**:
|
||||
- 代码块识别 9/10(JSON 带语言标签,命令行正确包裹)
|
||||
- 格式清洁度 7/10(attributes、转义、grid table 大部分清理干净)
|
||||
- 文本完整性 9/10(关键内容全部保留)
|
||||
|
||||
**结论**:综合最优,核心价值在 pandoc 后处理层。剩余 2 个 bug 可修。
|
||||
|
||||
---
|
||||
|
||||
## 架构决策
|
||||
|
||||
```
|
||||
最终方案:Pandoc(底层引擎)+ doc-to-markdown 后处理(增值层)
|
||||
|
||||
理由:
|
||||
1. Pandoc 图片提取最可靠(10/10),文本最完整(9/10)
|
||||
2. Pandoc 的问题(grid table、属性、转义)全部可后处理解决
|
||||
3. Docling/MarkItDown/Mammoth 的致命问题(图片丢失、标题丢失)无法后处理修复
|
||||
4. 后处理层是我们的核心竞争力,成本低、可迭代
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 测试文件特征
|
||||
|
||||
本次测试文件的难点在于:
|
||||
|
||||
| 特征 | 说明 | 影响 |
|
||||
|------|------|------|
|
||||
| WPS 导出 | 非标准 Word 样式(Style ID 2/3 而非 Heading 1/2) | mammoth/markitdown/docling 标题全丢 |
|
||||
| 多列图片布局 | 2x2、1x4 图片网格用表格排版 | pandoc 输出 grid table |
|
||||
| 信息框/提示框 | 单列表格包裹文字 | pandoc 输出 grid table |
|
||||
| 嵌套表格 | 表格内套表格 | pipe table 无法表达 |
|
||||
| JSON 代码块 | 非代码块样式,用文本框/缩进表示 | 多数工具无法识别为代码 |
|
||||
| 19MB 文件 | 77 张截图嵌入 | base64 方案导致 28MB 输出 |
|
||||
|
||||
这些特征代表了真实世界中 WPS/飞书文档导出 docx 的典型困难,是有效的基准测试场景。
|
||||
1150
doc-to-markdown/scripts/convert.py
Executable file
1150
doc-to-markdown/scripts/convert.py
Executable file
File diff suppressed because it is too large
Load Diff
@@ -1,434 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Multi-tool document to markdown converter with intelligent orchestration.
|
||||
|
||||
Supports Quick Mode (fast, single tool) and Heavy Mode (best quality, multi-tool merge).
|
||||
|
||||
Usage:
|
||||
# Quick Mode (default) - fast, single best tool
|
||||
uv run --with pymupdf4llm --with markitdown scripts/convert.py document.pdf -o output.md
|
||||
|
||||
# Heavy Mode - multi-tool parallel execution with merge
|
||||
uv run --with pymupdf4llm --with markitdown scripts/convert.py document.pdf -o output.md --heavy
|
||||
|
||||
# With image extraction
|
||||
uv run --with pymupdf4llm scripts/convert.py document.pdf -o output.md --assets-dir ./images
|
||||
|
||||
Dependencies:
|
||||
- pymupdf4llm: PDF conversion (LLM-optimized)
|
||||
- markitdown: PDF/DOCX/PPTX conversion
|
||||
- pandoc: DOCX/PPTX conversion (system install: brew install pandoc)
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import subprocess
|
||||
import sys
|
||||
import tempfile
|
||||
import shutil
|
||||
from dataclasses import dataclass, field
|
||||
from pathlib import Path
|
||||
from typing import Optional
|
||||
|
||||
|
||||
@dataclass
|
||||
class ConversionResult:
|
||||
"""Result from a single tool conversion."""
|
||||
markdown: str
|
||||
tool: str
|
||||
images: list[str] = field(default_factory=list)
|
||||
success: bool = True
|
||||
error: str = ""
|
||||
|
||||
|
||||
def check_tool_available(tool: str) -> bool:
|
||||
"""Check if a conversion tool is available."""
|
||||
if tool == "pymupdf4llm":
|
||||
try:
|
||||
import pymupdf4llm
|
||||
return True
|
||||
except ImportError:
|
||||
return False
|
||||
elif tool == "markitdown":
|
||||
try:
|
||||
import markitdown
|
||||
return True
|
||||
except ImportError:
|
||||
return False
|
||||
elif tool == "pandoc":
|
||||
return shutil.which("pandoc") is not None
|
||||
return False
|
||||
|
||||
|
||||
def select_tools(file_path: Path, mode: str) -> list[str]:
|
||||
"""Select conversion tools based on file type and mode."""
|
||||
ext = file_path.suffix.lower()
|
||||
|
||||
# Tool preferences by format
|
||||
tool_map = {
|
||||
".pdf": {
|
||||
"quick": ["pymupdf4llm", "markitdown"], # fallback order
|
||||
"heavy": ["pymupdf4llm", "markitdown"],
|
||||
},
|
||||
".docx": {
|
||||
"quick": ["pandoc", "markitdown"],
|
||||
"heavy": ["pandoc", "markitdown"],
|
||||
},
|
||||
".doc": {
|
||||
"quick": ["pandoc", "markitdown"],
|
||||
"heavy": ["pandoc", "markitdown"],
|
||||
},
|
||||
".pptx": {
|
||||
"quick": ["markitdown", "pandoc"],
|
||||
"heavy": ["markitdown", "pandoc"],
|
||||
},
|
||||
".xlsx": {
|
||||
"quick": ["markitdown"],
|
||||
"heavy": ["markitdown"],
|
||||
},
|
||||
}
|
||||
|
||||
tools = tool_map.get(ext, {"quick": ["markitdown"], "heavy": ["markitdown"]})
|
||||
|
||||
if mode == "quick":
|
||||
# Return first available tool
|
||||
for tool in tools["quick"]:
|
||||
if check_tool_available(tool):
|
||||
return [tool]
|
||||
return []
|
||||
else: # heavy
|
||||
# Return all available tools
|
||||
return [t for t in tools["heavy"] if check_tool_available(t)]
|
||||
|
||||
|
||||
def convert_with_pymupdf4llm(
|
||||
file_path: Path, assets_dir: Optional[Path] = None
|
||||
) -> ConversionResult:
|
||||
"""Convert using PyMuPDF4LLM (best for PDFs)."""
|
||||
try:
|
||||
import pymupdf4llm
|
||||
|
||||
kwargs = {}
|
||||
images = []
|
||||
|
||||
if assets_dir:
|
||||
assets_dir.mkdir(parents=True, exist_ok=True)
|
||||
kwargs["write_images"] = True
|
||||
kwargs["image_path"] = str(assets_dir)
|
||||
kwargs["dpi"] = 150
|
||||
|
||||
# Use best table detection strategy
|
||||
kwargs["table_strategy"] = "lines_strict"
|
||||
|
||||
md_text = pymupdf4llm.to_markdown(str(file_path), **kwargs)
|
||||
|
||||
# Collect extracted images
|
||||
if assets_dir and assets_dir.exists():
|
||||
images = [str(p) for p in assets_dir.glob("*.png")]
|
||||
images.extend([str(p) for p in assets_dir.glob("*.jpg")])
|
||||
|
||||
return ConversionResult(
|
||||
markdown=md_text, tool="pymupdf4llm", images=images, success=True
|
||||
)
|
||||
except Exception as e:
|
||||
return ConversionResult(
|
||||
markdown="", tool="pymupdf4llm", success=False, error=str(e)
|
||||
)
|
||||
|
||||
|
||||
def convert_with_markitdown(
|
||||
file_path: Path, assets_dir: Optional[Path] = None
|
||||
) -> ConversionResult:
|
||||
"""Convert using markitdown."""
|
||||
try:
|
||||
# markitdown CLI approach
|
||||
result = subprocess.run(
|
||||
["markitdown", str(file_path)],
|
||||
capture_output=True,
|
||||
text=True,
|
||||
timeout=120,
|
||||
)
|
||||
|
||||
if result.returncode != 0:
|
||||
return ConversionResult(
|
||||
markdown="",
|
||||
tool="markitdown",
|
||||
success=False,
|
||||
error=result.stderr,
|
||||
)
|
||||
|
||||
return ConversionResult(
|
||||
markdown=result.stdout, tool="markitdown", success=True
|
||||
)
|
||||
except FileNotFoundError:
|
||||
# Try Python API
|
||||
try:
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
result = md.convert(str(file_path))
|
||||
return ConversionResult(
|
||||
markdown=result.text_content, tool="markitdown", success=True
|
||||
)
|
||||
except Exception as e:
|
||||
return ConversionResult(
|
||||
markdown="", tool="markitdown", success=False, error=str(e)
|
||||
)
|
||||
except Exception as e:
|
||||
return ConversionResult(
|
||||
markdown="", tool="markitdown", success=False, error=str(e)
|
||||
)
|
||||
|
||||
|
||||
def convert_with_pandoc(
|
||||
file_path: Path, assets_dir: Optional[Path] = None
|
||||
) -> ConversionResult:
|
||||
"""Convert using pandoc."""
|
||||
try:
|
||||
cmd = ["pandoc", str(file_path), "-t", "markdown", "--wrap=none"]
|
||||
|
||||
if assets_dir:
|
||||
assets_dir.mkdir(parents=True, exist_ok=True)
|
||||
cmd.extend(["--extract-media", str(assets_dir)])
|
||||
|
||||
result = subprocess.run(
|
||||
cmd, capture_output=True, text=True, timeout=120
|
||||
)
|
||||
|
||||
if result.returncode != 0:
|
||||
return ConversionResult(
|
||||
markdown="", tool="pandoc", success=False, error=result.stderr
|
||||
)
|
||||
|
||||
images = []
|
||||
if assets_dir and assets_dir.exists():
|
||||
images = [str(p) for p in assets_dir.rglob("*.png")]
|
||||
images.extend([str(p) for p in assets_dir.rglob("*.jpg")])
|
||||
|
||||
return ConversionResult(
|
||||
markdown=result.stdout, tool="pandoc", images=images, success=True
|
||||
)
|
||||
except Exception as e:
|
||||
return ConversionResult(
|
||||
markdown="", tool="pandoc", success=False, error=str(e)
|
||||
)
|
||||
|
||||
|
||||
def convert_single(
|
||||
file_path: Path, tool: str, assets_dir: Optional[Path] = None
|
||||
) -> ConversionResult:
|
||||
"""Run a single conversion tool."""
|
||||
converters = {
|
||||
"pymupdf4llm": convert_with_pymupdf4llm,
|
||||
"markitdown": convert_with_markitdown,
|
||||
"pandoc": convert_with_pandoc,
|
||||
}
|
||||
|
||||
converter = converters.get(tool)
|
||||
if not converter:
|
||||
return ConversionResult(
|
||||
markdown="", tool=tool, success=False, error=f"Unknown tool: {tool}"
|
||||
)
|
||||
|
||||
return converter(file_path, assets_dir)
|
||||
|
||||
|
||||
def merge_results(results: list[ConversionResult]) -> ConversionResult:
|
||||
"""Merge results from multiple tools, selecting best segments."""
|
||||
if not results:
|
||||
return ConversionResult(markdown="", tool="none", success=False)
|
||||
|
||||
# Filter successful results
|
||||
successful = [r for r in results if r.success and r.markdown.strip()]
|
||||
if not successful:
|
||||
# Return first error
|
||||
return results[0] if results else ConversionResult(
|
||||
markdown="", tool="none", success=False
|
||||
)
|
||||
|
||||
if len(successful) == 1:
|
||||
return successful[0]
|
||||
|
||||
# Multiple successful results - merge them
|
||||
# Strategy: Compare key metrics and select best
|
||||
best = successful[0]
|
||||
best_score = score_markdown(best.markdown)
|
||||
|
||||
for result in successful[1:]:
|
||||
score = score_markdown(result.markdown)
|
||||
if score > best_score:
|
||||
best = result
|
||||
best_score = score
|
||||
|
||||
# Merge images from all results
|
||||
all_images = []
|
||||
seen = set()
|
||||
for result in successful:
|
||||
for img in result.images:
|
||||
if img not in seen:
|
||||
all_images.append(img)
|
||||
seen.add(img)
|
||||
|
||||
best.images = all_images
|
||||
best.tool = f"merged({','.join(r.tool for r in successful)})"
|
||||
|
||||
return best
|
||||
|
||||
|
||||
def score_markdown(md: str) -> float:
|
||||
"""Score markdown quality for comparison."""
|
||||
score = 0.0
|
||||
|
||||
# Length (more content is generally better)
|
||||
score += min(len(md) / 10000, 5.0) # Cap at 5 points
|
||||
|
||||
# Tables (proper markdown tables)
|
||||
table_count = md.count("|---|") + md.count("| ---")
|
||||
score += min(table_count * 0.5, 3.0)
|
||||
|
||||
# Images (referenced images)
|
||||
image_count = md.count("![")
|
||||
score += min(image_count * 0.3, 2.0)
|
||||
|
||||
# Headings (proper hierarchy)
|
||||
h1_count = md.count("\n# ")
|
||||
h2_count = md.count("\n## ")
|
||||
h3_count = md.count("\n### ")
|
||||
if h1_count > 0 and h2_count >= h1_count:
|
||||
score += 1.0 # Good hierarchy
|
||||
|
||||
# Lists (structured content)
|
||||
list_count = md.count("\n- ") + md.count("\n* ") + md.count("\n1. ")
|
||||
score += min(list_count * 0.1, 2.0)
|
||||
|
||||
return score
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Convert documents to markdown with multi-tool orchestration",
|
||||
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||
epilog="""
|
||||
Examples:
|
||||
# Quick mode (default)
|
||||
python convert.py document.pdf -o output.md
|
||||
|
||||
# Heavy mode (best quality)
|
||||
python convert.py document.pdf -o output.md --heavy
|
||||
|
||||
# With custom assets directory
|
||||
python convert.py document.pdf -o output.md --assets-dir ./images
|
||||
""",
|
||||
)
|
||||
parser.add_argument("input", type=Path, help="Input document path")
|
||||
parser.add_argument(
|
||||
"-o", "--output", type=Path, help="Output markdown file"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--heavy",
|
||||
action="store_true",
|
||||
help="Enable Heavy Mode (multi-tool, best quality)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--assets-dir",
|
||||
type=Path,
|
||||
default=None,
|
||||
help="Directory for extracted images (default: <output>_assets/)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--tool",
|
||||
choices=["pymupdf4llm", "markitdown", "pandoc"],
|
||||
help="Force specific tool (overrides auto-selection)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--list-tools",
|
||||
action="store_true",
|
||||
help="List available tools and exit",
|
||||
)
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
# List tools mode
|
||||
if args.list_tools:
|
||||
tools = ["pymupdf4llm", "markitdown", "pandoc"]
|
||||
print("Available conversion tools:")
|
||||
for tool in tools:
|
||||
status = "✓" if check_tool_available(tool) else "✗"
|
||||
print(f" {status} {tool}")
|
||||
sys.exit(0)
|
||||
|
||||
# Validate input
|
||||
if not args.input.exists():
|
||||
print(f"Error: Input file not found: {args.input}", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
# Determine output path
|
||||
output_path = args.output or args.input.with_suffix(".md")
|
||||
|
||||
# Determine assets directory
|
||||
assets_dir = args.assets_dir
|
||||
if assets_dir is None and args.heavy:
|
||||
assets_dir = output_path.parent / f"{output_path.stem}_assets"
|
||||
|
||||
# Select tools
|
||||
mode = "heavy" if args.heavy else "quick"
|
||||
if args.tool:
|
||||
tools = [args.tool] if check_tool_available(args.tool) else []
|
||||
else:
|
||||
tools = select_tools(args.input, mode)
|
||||
|
||||
if not tools:
|
||||
print("Error: No conversion tools available.", file=sys.stderr)
|
||||
print("Install with:", file=sys.stderr)
|
||||
print(" pip install pymupdf4llm", file=sys.stderr)
|
||||
print(" uv tool install markitdown[pdf]", file=sys.stderr)
|
||||
print(" brew install pandoc", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
print(f"Converting: {args.input}")
|
||||
print(f"Mode: {mode.upper()}")
|
||||
print(f"Tools: {', '.join(tools)}")
|
||||
|
||||
# Run conversions
|
||||
results = []
|
||||
for tool in tools:
|
||||
print(f" Running {tool}...", end=" ", flush=True)
|
||||
|
||||
# Use separate assets dirs for each tool in heavy mode
|
||||
tool_assets = None
|
||||
if assets_dir and mode == "heavy" and len(tools) > 1:
|
||||
tool_assets = assets_dir / tool
|
||||
elif assets_dir:
|
||||
tool_assets = assets_dir
|
||||
|
||||
result = convert_single(args.input, tool, tool_assets)
|
||||
results.append(result)
|
||||
|
||||
if result.success:
|
||||
print(f"✓ ({len(result.markdown):,} chars, {len(result.images)} images)")
|
||||
else:
|
||||
print(f"✗ ({result.error[:50]}...)")
|
||||
|
||||
# Merge results if heavy mode
|
||||
if mode == "heavy" and len(results) > 1:
|
||||
print(" Merging results...", end=" ", flush=True)
|
||||
final = merge_results(results)
|
||||
print(f"✓ (using {final.tool})")
|
||||
else:
|
||||
final = merge_results(results)
|
||||
|
||||
if not final.success:
|
||||
print(f"Error: Conversion failed: {final.error}", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
# Write output
|
||||
output_path.parent.mkdir(parents=True, exist_ok=True)
|
||||
output_path.write_text(final.markdown)
|
||||
|
||||
print(f"\nOutput: {output_path}")
|
||||
print(f" Size: {len(final.markdown):,} characters")
|
||||
if final.images:
|
||||
print(f" Images: {len(final.images)} extracted")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
Reference in New Issue
Block a user