refactor: rename markdown-tools → doc-to-markdown (v2.0.0)
- Rename skill to better reflect its purpose (document-to-markdown conversion) - Update SKILL.md name, description, and trigger keywords - Add benchmark reference (2026-03-22) - Update marketplace.json entry (name, skills path, version 2.0.0) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
178
doc-to-markdown/SKILL.md
Normal file
178
doc-to-markdown/SKILL.md
Normal file
@@ -0,0 +1,178 @@
|
||||
---
|
||||
name: doc-to-markdown
|
||||
description: Converts DOCX/PDF/PPTX to high-quality Markdown with automatic post-processing. Fixes pandoc grid tables, image paths, attribute noise, and code blocks. Supports Quick Mode (fast, single tool) and Heavy Mode (best quality, multi-tool merge). Trigger on "convert document", "docx to markdown", "parse word", "doc to markdown", "extract images from document".
|
||||
---
|
||||
|
||||
# Doc to Markdown
|
||||
|
||||
Convert documents to high-quality markdown with intelligent multi-tool orchestration and automatic DOCX post-processing.
|
||||
|
||||
## Dual Mode Architecture
|
||||
|
||||
| Mode | Speed | Quality | Use Case |
|
||||
|------|-------|---------|----------|
|
||||
| **Quick** (default) | Fast | Good | Drafts, simple documents |
|
||||
| **Heavy** | Slower | Best | Final documents, complex layouts |
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Installation
|
||||
|
||||
```bash
|
||||
# Required: PDF/DOCX/PPTX support
|
||||
uv tool install "markitdown[pdf]"
|
||||
pip install pymupdf4llm
|
||||
brew install pandoc
|
||||
```
|
||||
|
||||
### Basic Conversion
|
||||
|
||||
```bash
|
||||
# Quick Mode (default) - fast, single best tool
|
||||
uv run --with pymupdf4llm --with markitdown scripts/convert.py document.pdf -o output.md
|
||||
|
||||
# Heavy Mode - multi-tool parallel execution with merge
|
||||
uv run --with pymupdf4llm --with markitdown scripts/convert.py document.pdf -o output.md --heavy
|
||||
|
||||
# DOCX with deep python-docx parsing (experimental)
|
||||
uv run --with pymupdf4llm --with markitdown --with python-docx scripts/convert.py document.docx -o output.md --docx-deep
|
||||
|
||||
# Check available tools
|
||||
uv run scripts/convert.py --list-tools
|
||||
```
|
||||
|
||||
## Tool Selection Matrix
|
||||
|
||||
| Format | Quick Mode Tool | Heavy Mode Tools |
|
||||
|--------|----------------|------------------|
|
||||
| PDF | pymupdf4llm | pymupdf4llm + markitdown |
|
||||
| DOCX | pandoc + post-processing | pandoc + markitdown |
|
||||
| PPTX | markitdown | markitdown + pandoc |
|
||||
| XLSX | markitdown | markitdown |
|
||||
|
||||
### Tool Characteristics
|
||||
|
||||
- **pymupdf4llm**: LLM-optimized PDF conversion with native table detection and image extraction
|
||||
- **markitdown**: Microsoft's universal converter, good for Office formats
|
||||
- **pandoc**: Excellent structure preservation for DOCX/PPTX
|
||||
|
||||
## DOCX Post-Processing (automatic)
|
||||
|
||||
When converting DOCX files via pandoc, the following cleanups are applied automatically:
|
||||
|
||||
| Problem | Fix |
|
||||
|---------|-----|
|
||||
| Grid tables (`+:---+` syntax) | Single-column -> blockquote, multi-column -> split images |
|
||||
| Image double path (`media/media/`) | Flatten to `media/` |
|
||||
| Pandoc attributes (`{width="..." height="..."}`) | Removed |
|
||||
| Inline classes (`{.underline}`, `{.mark}`) | Removed |
|
||||
| Indented dashed code blocks | Converted to fenced code blocks (```) |
|
||||
| Escaped brackets (`\[...\]`) | Unescaped to `[...]` |
|
||||
| Double-bracket links (`[[text]{...}](url)`) | Simplified to `[text](url)` |
|
||||
| Escaped quotes in code (`\"`) | Fixed to `"` |
|
||||
|
||||
## Heavy Mode Workflow
|
||||
|
||||
Heavy Mode runs multiple tools in parallel and selects the best segments:
|
||||
|
||||
1. **Parallel Execution**: Run all applicable tools simultaneously
|
||||
2. **Segment Analysis**: Parse each output into segments (tables, headings, images, paragraphs)
|
||||
3. **Quality Scoring**: Score each segment based on completeness and structure
|
||||
4. **Intelligent Merge**: Select best version of each segment across tools
|
||||
|
||||
### Merge Criteria
|
||||
|
||||
| Segment Type | Selection Criteria |
|
||||
|--------------|-------------------|
|
||||
| Tables | More rows/columns, proper header separator |
|
||||
| Images | Alt text present, local paths preferred |
|
||||
| Headings | Proper hierarchy, appropriate length |
|
||||
| Lists | More items, nested structure preserved |
|
||||
| Paragraphs | Content completeness |
|
||||
|
||||
## Image Extraction
|
||||
|
||||
```bash
|
||||
# Extract images with metadata
|
||||
uv run --with pymupdf scripts/extract_pdf_images.py document.pdf -o ./assets
|
||||
|
||||
# Generate markdown references file
|
||||
uv run --with pymupdf scripts/extract_pdf_images.py document.pdf --markdown refs.md
|
||||
```
|
||||
|
||||
Output:
|
||||
- Images: `assets/img_page1_1.png`, `assets/img_page2_1.jpg`
|
||||
- Metadata: `assets/images_metadata.json` (page, position, dimensions)
|
||||
|
||||
## Quality Validation
|
||||
|
||||
```bash
|
||||
# Validate conversion quality
|
||||
uv run --with pymupdf scripts/validate_output.py document.pdf output.md
|
||||
|
||||
# Generate HTML report
|
||||
uv run --with pymupdf scripts/validate_output.py document.pdf output.md --report report.html
|
||||
```
|
||||
|
||||
### Quality Metrics
|
||||
|
||||
| Metric | Pass | Warn | Fail |
|
||||
|--------|------|------|------|
|
||||
| Text Retention | >95% | 85-95% | <85% |
|
||||
| Table Retention | 100% | 90-99% | <90% |
|
||||
| Image Retention | 100% | 80-99% | <80% |
|
||||
|
||||
## Merge Outputs Manually
|
||||
|
||||
```bash
|
||||
# Merge multiple markdown files
|
||||
python scripts/merge_outputs.py output1.md output2.md -o merged.md
|
||||
|
||||
# Show segment attribution
|
||||
python scripts/merge_outputs.py output1.md output2.md -o merged.md --verbose
|
||||
```
|
||||
|
||||
## Path Conversion (Windows/WSL)
|
||||
|
||||
```bash
|
||||
# Windows to WSL conversion
|
||||
python scripts/convert_path.py "C:\Users\name\Documents\file.pdf"
|
||||
# Output: /mnt/c/Users/name/Documents/file.pdf
|
||||
```
|
||||
|
||||
## Common Issues
|
||||
|
||||
**"No conversion tools available"**
|
||||
```bash
|
||||
# Install all tools
|
||||
pip install pymupdf4llm
|
||||
uv tool install "markitdown[pdf]"
|
||||
brew install pandoc
|
||||
```
|
||||
|
||||
**FontBBox warnings during PDF conversion**
|
||||
- Harmless font parsing warnings, output is still correct
|
||||
|
||||
**Images missing from output**
|
||||
- Use Heavy Mode for better image preservation
|
||||
- Or extract separately with `scripts/extract_pdf_images.py`
|
||||
|
||||
**Tables broken in output**
|
||||
- Use Heavy Mode - it selects the most complete table version
|
||||
- Or validate with `scripts/validate_output.py`
|
||||
|
||||
## Bundled Scripts
|
||||
|
||||
| Script | Purpose |
|
||||
|--------|---------|
|
||||
| `convert.py` | Main orchestrator with Quick/Heavy mode + DOCX post-processing |
|
||||
| `merge_outputs.py` | Merge multiple markdown outputs |
|
||||
| `validate_output.py` | Quality validation with HTML report |
|
||||
| `extract_pdf_images.py` | PDF image extraction with metadata |
|
||||
| `convert_path.py` | Windows to WSL path converter |
|
||||
|
||||
## References
|
||||
|
||||
- `references/heavy-mode-guide.md` - Detailed Heavy Mode documentation
|
||||
- `references/tool-comparison.md` - Tool capabilities comparison
|
||||
- `references/conversion-examples.md` - Batch operation examples
|
||||
163
doc-to-markdown/references/benchmark-2026-03-22.md
Normal file
163
doc-to-markdown/references/benchmark-2026-03-22.md
Normal file
@@ -0,0 +1,163 @@
|
||||
# DOCX→Markdown 转换方案基准测试
|
||||
|
||||
> **测试日期**:2026-03-22
|
||||
>
|
||||
> **测试文件**:`助教-【腾讯云🦞】小白实践 OpenClaw 保姆级教程.docx`(19MB,77 张图片,含 grid table 布局、JSON 代码块、多列图片并排、信息框)
|
||||
>
|
||||
> **测试方法**:5 个方案对同一文件转换,按 5 个维度各 10 分制打分
|
||||
|
||||
---
|
||||
|
||||
## 综合评分
|
||||
|
||||
| 维度 | Docling (IBM) | MarkItDown (MS) | Pandoc | Mammoth | **doc-to-markdown(我们)** |
|
||||
|------|:---:|:---:|:---:|:---:|:---:|
|
||||
| 表格质量 | 5 | 3 | 5 | 1~3 | **6** |
|
||||
| 图片提取 | 4 | 2 | **10** | 5 | 7 |
|
||||
| 文本完整性 | 8 | 7 | **9** | 7 | **9** |
|
||||
| 格式清洁度 | 5 | 5 | 5 | 3 | **7** |
|
||||
| 代码块 | 2 | 1 | N/A | 1 | **9** |
|
||||
| **综合** | **4.8** | **3.6** | **7.3** | **3.4~3.6** | **7.6** |
|
||||
|
||||
---
|
||||
|
||||
## 各方案详细分析
|
||||
|
||||
### 1. IBM Docling(综合 4.8)
|
||||
|
||||
- **版本**:docling 2.x + Granite-Docling-258M
|
||||
- **架构**:AI 驱动(VLM 视觉语言模型),DocTags 中间格式 → Markdown
|
||||
|
||||
**致命问题**:
|
||||
- 图片引用全部是 `<!-- image -->` 占位符(77 张图 0 张可显示),`ImageRefMode` API 对 DOCX 不可用
|
||||
- 标题层级全部丢失(0 个 `#`),所有标题退化为粗体文本
|
||||
- 代码块为零,JSON 和命令全部输出为普通段落
|
||||
- `api_key` 被错误转义为 `api\_key`
|
||||
|
||||
**优点**:
|
||||
- 文本内容完整,中文/emoji/链接保留良好
|
||||
- 无 grid table 或 HTML 残留
|
||||
- 表格语法正确(pipe table),但内容是占位符
|
||||
|
||||
**结论**:Docling 的优势在 PDF(AAAI 2025 论文场景),DOCX 支持远未达到生产级别。
|
||||
|
||||
### 2. Microsoft MarkItDown(综合 3.6)
|
||||
|
||||
- **版本**:markitdown 0.1.5
|
||||
- **架构**:底层调用 mammoth → HTML → markdownify → Markdown
|
||||
|
||||
**致命问题**:
|
||||
- 77 张图片全部是截断的 base64 占位符(`data:image/png;base64,...`),默认 `keep_data_uris=False` 主动丢弃图片数据
|
||||
- 标题全部变为粗体文本(mammoth 无法识别 WPS 自定义样式)
|
||||
- 代码块为零,JSON 被塞入表格单元格
|
||||
- 有序列表编号全部错误(输出为 `1. 1. 1.`)
|
||||
|
||||
**优点**:
|
||||
- 无 HTML 标签残留
|
||||
- 文本内容基本完整
|
||||
|
||||
**结论**:MarkItDown 的 markdownify 后处理反而引入破坏性截断。轻量场景可用,复杂 DOCX 不可靠。
|
||||
|
||||
### 3. Pandoc(综合 7.3)
|
||||
|
||||
- **版本**:pandoc 3.9
|
||||
- **架构**:Haskell 原生 AST 解析,支持 60+ 格式
|
||||
|
||||
**测试了 3 种参数**:
|
||||
|
||||
| 参数 | 结果 |
|
||||
|------|------|
|
||||
| `-t gfm` | 最差:24 个 HTML `<table>` 嵌套,74 个 HTML `<img>` |
|
||||
| `-t markdown` | 最佳:grid table(可后处理),无 HTML |
|
||||
| `-t markdown-raw_html-...` | 与 markdown 完全相同,参数无效果 |
|
||||
|
||||
**问题**:
|
||||
- Grid table 不可避免(原 docx 有多行单元格和嵌套表格,pipe table 无法表达)
|
||||
- `{width="..." height="..."}` 68 处
|
||||
- `{.underline}` 6 处
|
||||
- 反斜杠过度转义 37 处
|
||||
|
||||
**优点**:
|
||||
- 图片提取 10/10(77 张全部正确,路径结构一致)
|
||||
- 文本完整性 9/10(内容、链接、emoji 全部保留)
|
||||
- 最成熟稳定的底层引擎
|
||||
|
||||
**结论**:Pandoc 是最可靠的底层引擎,输出质量最高但需要后处理清洗 pandoc 私有语法。
|
||||
|
||||
### 4. Mammoth(综合 3.4~3.6)
|
||||
|
||||
- **版本**:mammoth 1.11.0
|
||||
- **架构**:python-docx 解析 → HTML/Markdown(Markdown 支持已废弃)
|
||||
|
||||
**测试了 2 种方式**:
|
||||
|
||||
| 方式 | 综合 |
|
||||
|------|------|
|
||||
| 方式A:直接转 Markdown | 3.4(表格完全丢失) |
|
||||
| 方式B:转 HTML → markdownify | 3.6(有表格但嵌套被压扁) |
|
||||
|
||||
**致命问题**:
|
||||
- 标题全部丢失(WPS `styles.xml` 中样式定义为空,mammoth 无法映射 Heading)
|
||||
- 代码块为零
|
||||
- 图片全部 base64 内嵌,单文件 28MB
|
||||
- 方式B 中 markdownify 丢失 14 张图片(63/77)
|
||||
|
||||
**结论**:Mammoth 的 Markdown 支持已废弃,对 WPS 导出的 docx 兼容性差。不推荐。
|
||||
|
||||
### 5. doc-to-markdown / 我们的方案(综合 7.6)
|
||||
|
||||
- **版本**:doc-to-markdown 1.0(基于 pandoc + 6 个后处理函数)
|
||||
- **架构**:Pandoc 转换 → 自动后处理(grid table 清理、图片路径修复、属性清理、代码块修复、转义修复)
|
||||
|
||||
**后处理实际效果**:
|
||||
|
||||
| 后处理函数 | 修复数量 |
|
||||
|-----------|---------|
|
||||
| `_convert_grid_tables` | 11 处 grid table → pipe table / blockquote |
|
||||
| `_clean_pandoc_attributes` | 3437 字符属性清理 |
|
||||
| `_fix_code_blocks` | 22 处缩进虚线 → ``` 代码块 |
|
||||
| `_fix_escaped_brackets` | 10 处 |
|
||||
| `_fix_double_bracket_links` | 1 处 |
|
||||
| `_fix_image_paths` | 77 张图片路径修复 |
|
||||
|
||||
**已知问题(待修复)**:
|
||||
- 图片路径双层嵌套 bug:`--assets-dir` 指定目录内被 pandoc 再建一层 `media/`
|
||||
- 2 处 grid table 残留(文末并排图片组未完全转换)
|
||||
|
||||
**优点**:
|
||||
- 代码块识别 9/10(JSON 带语言标签,命令行正确包裹)
|
||||
- 格式清洁度 7/10(attributes、转义、grid table 大部分清理干净)
|
||||
- 文本完整性 9/10(关键内容全部保留)
|
||||
|
||||
**结论**:综合最优,核心价值在 pandoc 后处理层。剩余 2 个 bug 可修。
|
||||
|
||||
---
|
||||
|
||||
## 架构决策
|
||||
|
||||
```
|
||||
最终方案:Pandoc(底层引擎)+ doc-to-markdown 后处理(增值层)
|
||||
|
||||
理由:
|
||||
1. Pandoc 图片提取最可靠(10/10),文本最完整(9/10)
|
||||
2. Pandoc 的问题(grid table、属性、转义)全部可后处理解决
|
||||
3. Docling/MarkItDown/Mammoth 的致命问题(图片丢失、标题丢失)无法后处理修复
|
||||
4. 后处理层是我们的核心竞争力,成本低、可迭代
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 测试文件特征
|
||||
|
||||
本次测试文件的难点在于:
|
||||
|
||||
| 特征 | 说明 | 影响 |
|
||||
|------|------|------|
|
||||
| WPS 导出 | 非标准 Word 样式(Style ID 2/3 而非 Heading 1/2) | mammoth/markitdown/docling 标题全丢 |
|
||||
| 多列图片布局 | 2x2、1x4 图片网格用表格排版 | pandoc 输出 grid table |
|
||||
| 信息框/提示框 | 单列表格包裹文字 | pandoc 输出 grid table |
|
||||
| 嵌套表格 | 表格内套表格 | pipe table 无法表达 |
|
||||
| JSON 代码块 | 非代码块样式,用文本框/缩进表示 | 多数工具无法识别为代码 |
|
||||
| 19MB 文件 | 77 张截图嵌入 | base64 方案导致 28MB 输出 |
|
||||
|
||||
这些特征代表了真实世界中 WPS/飞书文档导出 docx 的典型困难,是有效的基准测试场景。
|
||||
346
doc-to-markdown/references/conversion-examples.md
Normal file
346
doc-to-markdown/references/conversion-examples.md
Normal file
@@ -0,0 +1,346 @@
|
||||
# Document Conversion Examples
|
||||
|
||||
Comprehensive examples for converting various document formats to markdown.
|
||||
|
||||
## Basic Document Conversions
|
||||
|
||||
### PDF to Markdown
|
||||
|
||||
```bash
|
||||
# Simple PDF conversion
|
||||
markitdown "document.pdf" > output.md
|
||||
|
||||
# WSL path example
|
||||
markitdown "/mnt/c/Users/username/Documents/report.pdf" > report.md
|
||||
|
||||
# With explicit output
|
||||
markitdown "slides.pdf" > "slides.md"
|
||||
```
|
||||
|
||||
### Word Documents to Markdown
|
||||
|
||||
```bash
|
||||
# Modern Word document (.docx)
|
||||
markitdown "document.docx" > output.md
|
||||
|
||||
# Legacy Word document (.doc)
|
||||
markitdown "legacy-doc.doc" > output.md
|
||||
|
||||
# Preserve directory structure
|
||||
markitdown "/path/to/docs/file.docx" > "/path/to/output/file.md"
|
||||
```
|
||||
|
||||
### PowerPoint to Markdown
|
||||
|
||||
```bash
|
||||
# Convert presentation
|
||||
markitdown "presentation.pptx" > slides.md
|
||||
|
||||
# WSL path
|
||||
markitdown "/mnt/c/Users/username/Desktop/slides.pptx" > slides.md
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Windows/WSL Path Conversion
|
||||
|
||||
### Basic Path Conversion Rules
|
||||
|
||||
```bash
|
||||
# Windows path
|
||||
C:\Users\username\Documents\file.doc
|
||||
|
||||
# WSL equivalent
|
||||
/mnt/c/Users/username/Documents/file.doc
|
||||
```
|
||||
|
||||
### Conversion Examples
|
||||
|
||||
```bash
|
||||
# Single backslash to forward slash
|
||||
C:\folder\file.txt
|
||||
→ /mnt/c/folder/file.txt
|
||||
|
||||
# Path with spaces (must use quotes)
|
||||
C:\Users\John Doe\Documents\report.pdf
|
||||
→ "/mnt/c/Users/John Doe/Documents/report.pdf"
|
||||
|
||||
# OneDrive path
|
||||
C:\Users\username\OneDrive\Documents\file.doc
|
||||
→ "/mnt/c/Users/username/OneDrive/Documents/file.doc"
|
||||
|
||||
# Different drive letters
|
||||
D:\Projects\document.docx
|
||||
→ /mnt/d/Projects/document.docx
|
||||
```
|
||||
|
||||
### Using convert_path.py Helper
|
||||
|
||||
```bash
|
||||
# Automatic conversion
|
||||
python scripts/convert_path.py "C:\Users\username\Downloads\document.doc"
|
||||
# Output: /mnt/c/Users/username/Downloads/document.doc
|
||||
|
||||
# Use in conversion command
|
||||
wsl_path=$(python scripts/convert_path.py "C:\Users\username\file.docx")
|
||||
markitdown "$wsl_path" > output.md
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Batch Conversions
|
||||
|
||||
### Convert Multiple Files
|
||||
|
||||
```bash
|
||||
# Convert all PDFs in a directory
|
||||
for pdf in /path/to/pdfs/*.pdf; do
|
||||
filename=$(basename "$pdf" .pdf)
|
||||
markitdown "$pdf" > "/path/to/output/${filename}.md"
|
||||
done
|
||||
|
||||
# Convert all Word documents
|
||||
for doc in /path/to/docs/*.docx; do
|
||||
filename=$(basename "$doc" .docx)
|
||||
markitdown "$doc" > "/path/to/output/${filename}.md"
|
||||
done
|
||||
```
|
||||
|
||||
### Batch Conversion with Path Conversion
|
||||
|
||||
```bash
|
||||
# Windows batch (PowerShell)
|
||||
Get-ChildItem "C:\Documents\*.pdf" | ForEach-Object {
|
||||
$wslPath = "/mnt/c/Documents/$($_.Name)"
|
||||
$outFile = "/mnt/c/Output/$($_.BaseName).md"
|
||||
wsl markitdown $wslPath > $outFile
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Confluence Export Handling
|
||||
|
||||
### Simple Confluence Export
|
||||
|
||||
```bash
|
||||
# Direct conversion for exports without special characters
|
||||
markitdown "confluence-export.doc" > output.md
|
||||
```
|
||||
|
||||
### Export with Special Characters
|
||||
|
||||
For Confluence exports containing special characters:
|
||||
|
||||
1. Save the .doc file to an accessible location
|
||||
2. Try direct conversion first:
|
||||
```bash
|
||||
markitdown "confluence-export.doc" > output.md
|
||||
```
|
||||
|
||||
3. If special characters cause issues:
|
||||
- Open in Word and save as .docx
|
||||
- Or use LibreOffice to convert: `libreoffice --headless --convert-to docx export.doc`
|
||||
- Then convert the .docx file
|
||||
|
||||
### Handling Encoding Issues
|
||||
|
||||
```bash
|
||||
# Check file encoding
|
||||
file -i "document.doc"
|
||||
|
||||
# Convert if needed (using iconv)
|
||||
iconv -f ISO-8859-1 -t UTF-8 input.md > output.md
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Advanced Conversion Scenarios
|
||||
|
||||
### Preserving Directory Structure
|
||||
|
||||
```bash
|
||||
# Mirror directory structure
|
||||
src_dir="/mnt/c/Users/username/Documents"
|
||||
out_dir="/path/to/output"
|
||||
|
||||
find "$src_dir" -name "*.docx" | while read file; do
|
||||
# Get relative path
|
||||
rel_path="${file#$src_dir/}"
|
||||
out_file="$out_dir/${rel_path%.docx}.md"
|
||||
|
||||
# Create output directory
|
||||
mkdir -p "$(dirname "$out_file")"
|
||||
|
||||
# Convert
|
||||
markitdown "$file" > "$out_file"
|
||||
done
|
||||
```
|
||||
|
||||
### Conversion with Metadata
|
||||
|
||||
```bash
|
||||
# Add frontmatter to converted file
|
||||
{
|
||||
echo "---"
|
||||
echo "title: $(basename "$file" .pdf)"
|
||||
echo "converted: $(date -I)"
|
||||
echo "source: $file"
|
||||
echo "---"
|
||||
echo ""
|
||||
markitdown "$file"
|
||||
} > output.md
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Error Recovery
|
||||
|
||||
### Handling Failed Conversions
|
||||
|
||||
```bash
|
||||
# Check if markitdown succeeded
|
||||
if markitdown "document.pdf" > output.md 2> error.log; then
|
||||
echo "Conversion successful"
|
||||
else
|
||||
echo "Conversion failed, check error.log"
|
||||
fi
|
||||
```
|
||||
|
||||
### Retry Logic
|
||||
|
||||
```bash
|
||||
# Retry failed conversions
|
||||
for file in *.pdf; do
|
||||
output="${file%.pdf}.md"
|
||||
if ! [ -f "$output" ]; then
|
||||
echo "Converting $file..."
|
||||
markitdown "$file" > "$output" || echo "Failed: $file" >> failed.txt
|
||||
fi
|
||||
done
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Quality Verification
|
||||
|
||||
### Check Conversion Quality
|
||||
|
||||
```bash
|
||||
# Compare line counts
|
||||
wc -l document.pdf.md
|
||||
|
||||
# Check for common issues
|
||||
grep "TODO\|ERROR\|MISSING" output.md
|
||||
|
||||
# Preview first/last lines
|
||||
head -n 20 output.md
|
||||
tail -n 20 output.md
|
||||
```
|
||||
|
||||
### Validate Output
|
||||
|
||||
```bash
|
||||
# Check for empty files
|
||||
if [ ! -s output.md ]; then
|
||||
echo "Warning: Output file is empty"
|
||||
fi
|
||||
|
||||
# Verify markdown syntax
|
||||
# Use a markdown linter if available
|
||||
markdownlint output.md
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Best Practices
|
||||
|
||||
### 1. Path Handling
|
||||
- Always quote paths with spaces
|
||||
- Verify paths exist before conversion
|
||||
- Use absolute paths for scripts
|
||||
|
||||
### 2. Batch Processing
|
||||
- Log conversions for audit trail
|
||||
- Handle errors gracefully
|
||||
- Preserve original files
|
||||
|
||||
### 3. Output Organization
|
||||
- Mirror source directory structure
|
||||
- Use consistent naming conventions
|
||||
- Separate by document type or date
|
||||
|
||||
### 4. Quality Assurance
|
||||
- Spot-check random conversions
|
||||
- Validate critical documents manually
|
||||
- Keep conversion logs
|
||||
|
||||
### 5. Performance
|
||||
- Use parallel processing for large batches
|
||||
- Skip already converted files
|
||||
- Clean up temporary files
|
||||
|
||||
---
|
||||
|
||||
## Common Patterns
|
||||
|
||||
### Pattern: Convert and Review
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
file="$1"
|
||||
output="${file%.*}.md"
|
||||
|
||||
# Convert
|
||||
markitdown "$file" > "$output"
|
||||
|
||||
# Open in editor for review
|
||||
${EDITOR:-vim} "$output"
|
||||
```
|
||||
|
||||
### Pattern: Safe Conversion
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
file="$1"
|
||||
backup="${file}.backup"
|
||||
output="${file%.*}.md"
|
||||
|
||||
# Backup original
|
||||
cp "$file" "$backup"
|
||||
|
||||
# Convert with error handling
|
||||
if markitdown "$file" > "$output" 2> conversion.log; then
|
||||
echo "Success: $output"
|
||||
rm "$backup"
|
||||
else
|
||||
echo "Failed: Check conversion.log"
|
||||
mv "$backup" "$file"
|
||||
fi
|
||||
```
|
||||
|
||||
### Pattern: Metadata Preservation
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# Extract and preserve document metadata
|
||||
|
||||
file="$1"
|
||||
output="${file%.*}.md"
|
||||
|
||||
# Get file metadata
|
||||
created=$(stat -c %w "$file" 2>/dev/null || stat -f %SB "$file")
|
||||
modified=$(stat -c %y "$file" 2>/dev/null || stat -f %Sm "$file")
|
||||
|
||||
# Convert with metadata
|
||||
{
|
||||
echo "---"
|
||||
echo "original_file: $(basename "$file")"
|
||||
echo "created: $created"
|
||||
echo "modified: $modified"
|
||||
echo "converted: $(date -I)"
|
||||
echo "---"
|
||||
echo ""
|
||||
markitdown "$file"
|
||||
} > "$output"
|
||||
```
|
||||
165
doc-to-markdown/references/heavy-mode-guide.md
Normal file
165
doc-to-markdown/references/heavy-mode-guide.md
Normal file
@@ -0,0 +1,165 @@
|
||||
# Heavy Mode Guide
|
||||
|
||||
Detailed documentation for markdown-tools Heavy Mode conversion.
|
||||
|
||||
## Overview
|
||||
|
||||
Heavy Mode runs multiple conversion tools in parallel and intelligently merges their outputs to produce the highest quality markdown possible.
|
||||
|
||||
## When to Use Heavy Mode
|
||||
|
||||
Use Heavy Mode when:
|
||||
- Document has complex tables that need precise formatting
|
||||
- Images must be preserved with proper references
|
||||
- Structure hierarchy (headings, lists) must be accurate
|
||||
- Output quality is more important than conversion speed
|
||||
- Document will be used for LLM processing
|
||||
|
||||
Use Quick Mode when:
|
||||
- Speed is priority
|
||||
- Document is simple (mostly text)
|
||||
- Output is for draft/review purposes
|
||||
|
||||
## Tool Capabilities
|
||||
|
||||
### PyMuPDF4LLM (Best for PDFs)
|
||||
|
||||
**Strengths:**
|
||||
- Native table detection with multiple strategies
|
||||
- Image extraction with position metadata
|
||||
- LLM-optimized output format
|
||||
- Preserves reading order
|
||||
|
||||
**Usage:**
|
||||
```python
|
||||
import pymupdf4llm
|
||||
|
||||
md_text = pymupdf4llm.to_markdown(
|
||||
"document.pdf",
|
||||
write_images=True,
|
||||
table_strategy="lines_strict",
|
||||
image_path="./assets",
|
||||
dpi=150
|
||||
)
|
||||
```
|
||||
|
||||
### markitdown (Universal Converter)
|
||||
|
||||
**Strengths:**
|
||||
- Supports many formats (PDF, DOCX, PPTX, XLSX)
|
||||
- Good text extraction
|
||||
- Simple API
|
||||
|
||||
**Limitations:**
|
||||
- May miss complex tables
|
||||
- No native image extraction
|
||||
|
||||
### pandoc (Best for Office Docs)
|
||||
|
||||
**Strengths:**
|
||||
- Excellent DOCX/PPTX structure preservation
|
||||
- Proper heading hierarchy
|
||||
- List formatting
|
||||
|
||||
**Limitations:**
|
||||
- Requires system installation
|
||||
- PDF support limited
|
||||
|
||||
## Merge Strategy
|
||||
|
||||
### Segment-Level Selection
|
||||
|
||||
Heavy Mode doesn't just pick one tool's output. It:
|
||||
|
||||
1. Parses each output into segments
|
||||
2. Scores each segment independently
|
||||
3. Selects the best version of each segment
|
||||
|
||||
### Segment Types
|
||||
|
||||
| Type | Detection Pattern | Scoring Criteria |
|
||||
|------|-------------------|------------------|
|
||||
| Table | `\|.*\|` rows | Row count, column count, header separator |
|
||||
| Heading | `^#{1-6} ` | Proper level, reasonable length |
|
||||
| Image | `!\[.*\]\(.*\)` | Alt text present, local path |
|
||||
| List | `^[-*+\d.] ` | Item count, nesting depth |
|
||||
| Code | Triple backticks | Line count, language specified |
|
||||
| Paragraph | Default | Word count, completeness |
|
||||
|
||||
### Scoring Example
|
||||
|
||||
```
|
||||
Table from pymupdf4llm:
|
||||
- 10 rows × 5 columns = 5.0 points
|
||||
- Header separator present = 1.0 points
|
||||
- Total: 6.0 points
|
||||
|
||||
Table from markitdown:
|
||||
- 8 rows × 5 columns = 4.0 points
|
||||
- No header separator = 0.0 points
|
||||
- Total: 4.0 points
|
||||
|
||||
→ Select pymupdf4llm version
|
||||
```
|
||||
|
||||
## Advanced Usage
|
||||
|
||||
### Force Specific Tool
|
||||
|
||||
```bash
|
||||
# Use only pandoc
|
||||
uv run scripts/convert.py document.docx -o output.md --tool pandoc
|
||||
```
|
||||
|
||||
### Custom Assets Directory
|
||||
|
||||
```bash
|
||||
# Heavy mode with custom image output
|
||||
uv run scripts/convert.py document.pdf -o output.md --heavy --assets-dir ./images
|
||||
```
|
||||
|
||||
### Validate After Conversion
|
||||
|
||||
```bash
|
||||
# Convert then validate
|
||||
uv run scripts/convert.py document.pdf -o output.md --heavy
|
||||
uv run scripts/validate_output.py document.pdf output.md --report quality.html
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Low Text Retention Score
|
||||
|
||||
**Causes:**
|
||||
- PDF has scanned images (not searchable text)
|
||||
- Encoding issues in source document
|
||||
- Complex layouts confusing the parser
|
||||
|
||||
**Solutions:**
|
||||
- Use OCR preprocessing for scanned PDFs
|
||||
- Try different tool with `--tool` flag
|
||||
- Manual cleanup may be needed
|
||||
|
||||
### Missing Tables
|
||||
|
||||
**Causes:**
|
||||
- Tables without visible borders
|
||||
- Tables spanning multiple pages
|
||||
- Merged cells
|
||||
|
||||
**Solutions:**
|
||||
- Use Heavy Mode for better detection
|
||||
- Try pymupdf4llm with different table_strategy
|
||||
- Manual table reconstruction
|
||||
|
||||
### Image References Broken
|
||||
|
||||
**Causes:**
|
||||
- Assets directory not created
|
||||
- Relative path issues
|
||||
- Image extraction failed
|
||||
|
||||
**Solutions:**
|
||||
- Ensure `--assets-dir` points to correct location
|
||||
- Check `images_metadata.json` for extraction status
|
||||
- Use `extract_pdf_images.py` separately
|
||||
180
doc-to-markdown/references/tool-comparison.md
Normal file
180
doc-to-markdown/references/tool-comparison.md
Normal file
@@ -0,0 +1,180 @@
|
||||
# Tool Comparison
|
||||
|
||||
Comparison of document-to-markdown conversion tools.
|
||||
|
||||
## Feature Matrix
|
||||
|
||||
| Feature | pymupdf4llm | markitdown | pandoc |
|
||||
|---------|-------------|------------|--------|
|
||||
| **PDF Support** | ✅ Excellent | ✅ Good | ⚠️ Limited |
|
||||
| **DOCX Support** | ❌ No | ✅ Good | ✅ Excellent |
|
||||
| **PPTX Support** | ❌ No | ✅ Good | ✅ Good |
|
||||
| **XLSX Support** | ❌ No | ✅ Good | ⚠️ Limited |
|
||||
| **Table Detection** | ✅ Multiple strategies | ⚠️ Basic | ✅ Good |
|
||||
| **Image Extraction** | ✅ With metadata | ❌ No | ✅ Yes |
|
||||
| **Heading Hierarchy** | ✅ Good | ⚠️ Variable | ✅ Excellent |
|
||||
| **List Formatting** | ✅ Good | ⚠️ Basic | ✅ Excellent |
|
||||
| **LLM Optimization** | ✅ Built-in | ❌ No | ❌ No |
|
||||
|
||||
## Installation
|
||||
|
||||
### pymupdf4llm
|
||||
|
||||
```bash
|
||||
pip install pymupdf4llm
|
||||
|
||||
# Or with uv
|
||||
uv pip install pymupdf4llm
|
||||
```
|
||||
|
||||
**Dependencies:** None (pure Python with PyMuPDF)
|
||||
|
||||
### markitdown
|
||||
|
||||
```bash
|
||||
# With PDF support
|
||||
uv tool install "markitdown[pdf]"
|
||||
|
||||
# Or
|
||||
pip install "markitdown[pdf]"
|
||||
```
|
||||
|
||||
**Dependencies:** Various per format (pdfminer, python-docx, etc.)
|
||||
|
||||
### pandoc
|
||||
|
||||
```bash
|
||||
# macOS
|
||||
brew install pandoc
|
||||
|
||||
# Ubuntu/Debian
|
||||
apt-get install pandoc
|
||||
|
||||
# Windows
|
||||
choco install pandoc
|
||||
```
|
||||
|
||||
**Dependencies:** System installation required
|
||||
|
||||
## Performance Benchmarks
|
||||
|
||||
### PDF Conversion (100-page document)
|
||||
|
||||
| Tool | Time | Memory | Output Quality |
|
||||
|------|------|--------|----------------|
|
||||
| pymupdf4llm | ~15s | 150MB | Excellent |
|
||||
| markitdown | ~45s | 200MB | Good |
|
||||
| pandoc | ~60s | 100MB | Variable |
|
||||
|
||||
### DOCX Conversion (50-page document)
|
||||
|
||||
| Tool | Time | Memory | Output Quality |
|
||||
|------|------|--------|----------------|
|
||||
| pandoc | ~5s | 50MB | Excellent |
|
||||
| markitdown | ~10s | 80MB | Good |
|
||||
|
||||
## Best Practices
|
||||
|
||||
### For PDFs
|
||||
|
||||
1. **First choice:** pymupdf4llm
|
||||
- Best table detection
|
||||
- Image extraction with metadata
|
||||
- LLM-optimized output
|
||||
|
||||
2. **Fallback:** markitdown
|
||||
- When pymupdf4llm fails
|
||||
- Simpler documents
|
||||
|
||||
### For DOCX/DOC
|
||||
|
||||
1. **First choice:** pandoc
|
||||
- Best structure preservation
|
||||
- Proper heading hierarchy
|
||||
- List formatting
|
||||
|
||||
2. **Fallback:** markitdown
|
||||
- When pandoc unavailable
|
||||
- Quick conversion needed
|
||||
|
||||
### For PPTX
|
||||
|
||||
1. **First choice:** markitdown
|
||||
- Good slide content extraction
|
||||
- Handles speaker notes
|
||||
|
||||
2. **Fallback:** pandoc
|
||||
- Better structure preservation
|
||||
|
||||
### For XLSX
|
||||
|
||||
1. **Only option:** markitdown
|
||||
- Table to markdown conversion
|
||||
- Sheet handling
|
||||
|
||||
## Common Issues by Tool
|
||||
|
||||
### pymupdf4llm
|
||||
|
||||
| Issue | Solution |
|
||||
|-------|----------|
|
||||
| "Cannot import fitz" | `pip install pymupdf` |
|
||||
| Tables not detected | Try different `table_strategy` |
|
||||
| Images not extracted | Enable `write_images=True` |
|
||||
|
||||
### markitdown
|
||||
|
||||
| Issue | Solution |
|
||||
|-------|----------|
|
||||
| PDF support missing | Install with `[pdf]` extra |
|
||||
| Slow conversion | Expected for large files |
|
||||
| Missing content | Try alternative tool |
|
||||
|
||||
### pandoc
|
||||
|
||||
| Issue | Solution |
|
||||
|-------|----------|
|
||||
| Command not found | Install via package manager |
|
||||
| PDF conversion fails | Use pymupdf4llm instead |
|
||||
| Images not extracted | Add `--extract-media` flag |
|
||||
|
||||
## API Comparison
|
||||
|
||||
### pymupdf4llm
|
||||
|
||||
```python
|
||||
import pymupdf4llm
|
||||
|
||||
md = pymupdf4llm.to_markdown(
|
||||
"doc.pdf",
|
||||
write_images=True,
|
||||
table_strategy="lines_strict",
|
||||
image_path="./assets"
|
||||
)
|
||||
```
|
||||
|
||||
### markitdown
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
result = md.convert("document.pdf")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
### pandoc
|
||||
|
||||
```bash
|
||||
pandoc document.docx -t markdown --wrap=none --extract-media=./assets
|
||||
```
|
||||
|
||||
```python
|
||||
import subprocess
|
||||
|
||||
result = subprocess.run(
|
||||
["pandoc", "doc.docx", "-t", "markdown", "--wrap=none"],
|
||||
capture_output=True, text=True
|
||||
)
|
||||
print(result.stdout)
|
||||
```
|
||||
1150
doc-to-markdown/scripts/convert.py
Executable file
1150
doc-to-markdown/scripts/convert.py
Executable file
File diff suppressed because it is too large
Load Diff
61
doc-to-markdown/scripts/convert_path.py
Executable file
61
doc-to-markdown/scripts/convert_path.py
Executable file
@@ -0,0 +1,61 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Convert Windows paths to WSL format.
|
||||
|
||||
Usage:
|
||||
python convert_path.py "C:\\Users\\username\\Downloads\\file.doc"
|
||||
|
||||
Output:
|
||||
/mnt/c/Users/username/Downloads/file.doc
|
||||
"""
|
||||
|
||||
import sys
|
||||
import re
|
||||
|
||||
|
||||
def convert_windows_to_wsl(windows_path: str) -> str:
|
||||
"""
|
||||
Convert a Windows path to WSL format.
|
||||
|
||||
Args:
|
||||
windows_path: Windows path (e.g., "C:\\Users\\username\\file.doc")
|
||||
|
||||
Returns:
|
||||
WSL path (e.g., "/mnt/c/Users/username/file.doc")
|
||||
"""
|
||||
# Remove quotes if present
|
||||
path = windows_path.strip('"').strip("'")
|
||||
|
||||
# Handle drive letter (C:\ or C:/)
|
||||
drive_pattern = r'^([A-Za-z]):[\\\/]'
|
||||
match = re.match(drive_pattern, path)
|
||||
|
||||
if not match:
|
||||
# Already a WSL path or relative path
|
||||
return path
|
||||
|
||||
drive_letter = match.group(1).lower()
|
||||
path_without_drive = path[3:] # Remove "C:\"
|
||||
|
||||
# Replace backslashes with forward slashes
|
||||
path_without_drive = path_without_drive.replace('\\', '/')
|
||||
|
||||
# Construct WSL path
|
||||
wsl_path = f"/mnt/{drive_letter}/{path_without_drive}"
|
||||
|
||||
return wsl_path
|
||||
|
||||
|
||||
def main():
|
||||
if len(sys.argv) < 2:
|
||||
print("Usage: python convert_path.py <windows_path>")
|
||||
print('Example: python convert_path.py "C:\\Users\\username\\Downloads\\file.doc"')
|
||||
sys.exit(1)
|
||||
|
||||
windows_path = sys.argv[1]
|
||||
wsl_path = convert_windows_to_wsl(windows_path)
|
||||
print(wsl_path)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
243
doc-to-markdown/scripts/extract_pdf_images.py
Executable file
243
doc-to-markdown/scripts/extract_pdf_images.py
Executable file
@@ -0,0 +1,243 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Extract images from PDF files with metadata using PyMuPDF.
|
||||
|
||||
Features:
|
||||
- Extracts all images with page and position metadata
|
||||
- Generates JSON metadata file for each image
|
||||
- Supports markdown reference generation
|
||||
- Optional DPI control for quality
|
||||
|
||||
Usage:
|
||||
uv run --with pymupdf scripts/extract_pdf_images.py document.pdf
|
||||
uv run --with pymupdf scripts/extract_pdf_images.py document.pdf -o ./images
|
||||
uv run --with pymupdf scripts/extract_pdf_images.py document.pdf --markdown refs.md
|
||||
|
||||
Examples:
|
||||
# Basic extraction
|
||||
uv run --with pymupdf scripts/extract_pdf_images.py document.pdf
|
||||
|
||||
# With custom output and markdown references
|
||||
uv run --with pymupdf scripts/extract_pdf_images.py doc.pdf -o assets --markdown images.md
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import sys
|
||||
from dataclasses import dataclass, asdict
|
||||
from pathlib import Path
|
||||
from typing import Optional
|
||||
|
||||
|
||||
@dataclass
|
||||
class ImageMetadata:
|
||||
"""Metadata for an extracted image."""
|
||||
filename: str
|
||||
page: int # 1-indexed
|
||||
index: int # Image index on page (1-indexed)
|
||||
width: int # Original width in pixels
|
||||
height: int # Original height in pixels
|
||||
x: float # X position on page (points)
|
||||
y: float # Y position on page (points)
|
||||
bbox_width: float # Width on page (points)
|
||||
bbox_height: float # Height on page (points)
|
||||
size_bytes: int
|
||||
format: str # png, jpg, etc.
|
||||
colorspace: str # RGB, CMYK, Gray
|
||||
bits_per_component: int
|
||||
|
||||
|
||||
def extract_images(
|
||||
pdf_path: Path,
|
||||
output_dir: Path,
|
||||
markdown_file: Optional[Path] = None
|
||||
) -> list[ImageMetadata]:
|
||||
"""
|
||||
Extract all images from a PDF file with metadata.
|
||||
|
||||
Args:
|
||||
pdf_path: Path to the PDF file
|
||||
output_dir: Directory to save extracted images
|
||||
markdown_file: Optional path to write markdown references
|
||||
|
||||
Returns:
|
||||
List of ImageMetadata for each extracted image
|
||||
"""
|
||||
try:
|
||||
import fitz # PyMuPDF
|
||||
except ImportError:
|
||||
print("Error: PyMuPDF not installed. Run with:")
|
||||
print(' uv run --with pymupdf scripts/extract_pdf_images.py <pdf_path>')
|
||||
sys.exit(1)
|
||||
|
||||
output_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
doc = fitz.open(str(pdf_path))
|
||||
extracted: list[ImageMetadata] = []
|
||||
markdown_refs: list[str] = []
|
||||
|
||||
for page_num in range(len(doc)):
|
||||
page = doc[page_num]
|
||||
image_list = page.get_images(full=True)
|
||||
|
||||
for img_index, img_info in enumerate(image_list):
|
||||
xref = img_info[0]
|
||||
|
||||
try:
|
||||
base_image = doc.extract_image(xref)
|
||||
except Exception as e:
|
||||
print(f" Warning: Could not extract image xref={xref}: {e}")
|
||||
continue
|
||||
|
||||
image_bytes = base_image["image"]
|
||||
image_ext = base_image["ext"]
|
||||
width = base_image.get("width", 0)
|
||||
height = base_image.get("height", 0)
|
||||
colorspace = base_image.get("colorspace", 0)
|
||||
bpc = base_image.get("bpc", 8)
|
||||
|
||||
# Map colorspace number to name
|
||||
cs_names = {1: "Gray", 3: "RGB", 4: "CMYK"}
|
||||
cs_name = cs_names.get(colorspace, f"Unknown({colorspace})")
|
||||
|
||||
# Get image position on page
|
||||
# img_info: (xref, smask, width, height, bpc, colorspace, alt, name, filter, referencer)
|
||||
# We need to find the image rect on page
|
||||
bbox_x, bbox_y, bbox_w, bbox_h = 0.0, 0.0, 0.0, 0.0
|
||||
|
||||
# Search for image instances on page
|
||||
for img_block in page.get_images():
|
||||
if img_block[0] == xref:
|
||||
# Found matching image, try to get its rect
|
||||
rects = page.get_image_rects(img_block)
|
||||
if rects:
|
||||
rect = rects[0] # Use first occurrence
|
||||
bbox_x = rect.x0
|
||||
bbox_y = rect.y0
|
||||
bbox_w = rect.width
|
||||
bbox_h = rect.height
|
||||
break
|
||||
|
||||
# Create descriptive filename
|
||||
img_filename = f"img_page{page_num + 1}_{img_index + 1}.{image_ext}"
|
||||
img_path = output_dir / img_filename
|
||||
|
||||
# Save image
|
||||
with open(img_path, "wb") as f:
|
||||
f.write(image_bytes)
|
||||
|
||||
# Create metadata
|
||||
metadata = ImageMetadata(
|
||||
filename=img_filename,
|
||||
page=page_num + 1,
|
||||
index=img_index + 1,
|
||||
width=width,
|
||||
height=height,
|
||||
x=round(bbox_x, 2),
|
||||
y=round(bbox_y, 2),
|
||||
bbox_width=round(bbox_w, 2),
|
||||
bbox_height=round(bbox_h, 2),
|
||||
size_bytes=len(image_bytes),
|
||||
format=image_ext,
|
||||
colorspace=cs_name,
|
||||
bits_per_component=bpc
|
||||
)
|
||||
extracted.append(metadata)
|
||||
|
||||
# Generate markdown reference
|
||||
alt_text = f"Image from page {page_num + 1}"
|
||||
md_ref = f""
|
||||
markdown_refs.append(f"<!-- Page {page_num + 1}, Position: ({bbox_x:.0f}, {bbox_y:.0f}) -->\n{md_ref}")
|
||||
|
||||
print(f" ✓ {img_filename} ({width}x{height}, {len(image_bytes):,} bytes)")
|
||||
|
||||
doc.close()
|
||||
|
||||
# Write metadata JSON
|
||||
metadata_path = output_dir / "images_metadata.json"
|
||||
with open(metadata_path, "w") as f:
|
||||
json.dump(
|
||||
{
|
||||
"source": str(pdf_path),
|
||||
"image_count": len(extracted),
|
||||
"images": [asdict(m) for m in extracted]
|
||||
},
|
||||
f,
|
||||
indent=2
|
||||
)
|
||||
print(f"\n📋 Metadata: {metadata_path}")
|
||||
|
||||
# Write markdown references if requested
|
||||
if markdown_file and markdown_refs:
|
||||
markdown_content = f"# Images from {pdf_path.name}\n\n"
|
||||
markdown_content += "\n\n".join(markdown_refs)
|
||||
markdown_file.parent.mkdir(parents=True, exist_ok=True)
|
||||
markdown_file.write_text(markdown_content)
|
||||
print(f"📝 Markdown refs: {markdown_file}")
|
||||
|
||||
print(f"\n✅ Total: {len(extracted)} images extracted to {output_dir}/")
|
||||
return extracted
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Extract images from PDF files with metadata",
|
||||
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||
epilog="""
|
||||
Examples:
|
||||
# Basic extraction
|
||||
uv run --with pymupdf scripts/extract_pdf_images.py document.pdf
|
||||
|
||||
# Custom output directory
|
||||
uv run --with pymupdf scripts/extract_pdf_images.py doc.pdf -o ./images
|
||||
|
||||
# With markdown references
|
||||
uv run --with pymupdf scripts/extract_pdf_images.py doc.pdf --markdown refs.md
|
||||
|
||||
Output:
|
||||
Images are saved with descriptive names: img_page1_1.png, img_page2_1.jpg
|
||||
Metadata is saved to: images_metadata.json
|
||||
"""
|
||||
)
|
||||
parser.add_argument(
|
||||
"pdf_path",
|
||||
type=Path,
|
||||
help="Path to the PDF file"
|
||||
)
|
||||
parser.add_argument(
|
||||
"-o", "--output",
|
||||
type=Path,
|
||||
default=Path("assets"),
|
||||
help="Directory to save images (default: ./assets)"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--markdown",
|
||||
type=Path,
|
||||
help="Generate markdown file with image references"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--json",
|
||||
action="store_true",
|
||||
help="Output metadata as JSON to stdout"
|
||||
)
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
if not args.pdf_path.exists():
|
||||
print(f"Error: File not found: {args.pdf_path}", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
print(f"📄 Extracting images from: {args.pdf_path}")
|
||||
|
||||
extracted = extract_images(
|
||||
args.pdf_path,
|
||||
args.output,
|
||||
args.markdown
|
||||
)
|
||||
|
||||
if args.json:
|
||||
print(json.dumps([asdict(m) for m in extracted], indent=2))
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
439
doc-to-markdown/scripts/merge_outputs.py
Executable file
439
doc-to-markdown/scripts/merge_outputs.py
Executable file
@@ -0,0 +1,439 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Multi-tool markdown output merger with segment-level comparison.
|
||||
|
||||
Merges markdown outputs from multiple conversion tools by selecting
|
||||
the best version of each segment (tables, images, headings, paragraphs).
|
||||
|
||||
Usage:
|
||||
python merge_outputs.py output1.md output2.md -o merged.md
|
||||
python merge_outputs.py --from-json results.json -o merged.md
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import re
|
||||
import sys
|
||||
from dataclasses import dataclass, field
|
||||
from pathlib import Path
|
||||
from typing import Optional
|
||||
|
||||
|
||||
@dataclass
|
||||
class Segment:
|
||||
"""A segment of markdown content."""
|
||||
type: str # 'heading', 'table', 'image', 'list', 'paragraph', 'code'
|
||||
content: str
|
||||
level: int = 0 # For headings
|
||||
score: float = 0.0
|
||||
|
||||
|
||||
@dataclass
|
||||
class MergeResult:
|
||||
"""Result from merging multiple markdown files."""
|
||||
markdown: str
|
||||
sources: list[str] = field(default_factory=list)
|
||||
segment_sources: dict = field(default_factory=dict) # segment_idx -> source
|
||||
|
||||
|
||||
def parse_segments(markdown: str) -> list[Segment]:
|
||||
"""Parse markdown into typed segments."""
|
||||
segments = []
|
||||
lines = markdown.split('\n')
|
||||
current_segment = []
|
||||
current_type = 'paragraph'
|
||||
current_level = 0
|
||||
in_code_block = False
|
||||
in_table = False
|
||||
|
||||
def flush_segment():
|
||||
nonlocal current_segment, current_type, current_level
|
||||
if current_segment:
|
||||
content = '\n'.join(current_segment).strip()
|
||||
if content:
|
||||
segments.append(Segment(
|
||||
type=current_type,
|
||||
content=content,
|
||||
level=current_level
|
||||
))
|
||||
current_segment = []
|
||||
current_type = 'paragraph'
|
||||
current_level = 0
|
||||
|
||||
for line in lines:
|
||||
# Code block detection
|
||||
if line.startswith('```'):
|
||||
if in_code_block:
|
||||
current_segment.append(line)
|
||||
flush_segment()
|
||||
in_code_block = False
|
||||
continue
|
||||
else:
|
||||
flush_segment()
|
||||
in_code_block = True
|
||||
current_type = 'code'
|
||||
current_segment.append(line)
|
||||
continue
|
||||
|
||||
if in_code_block:
|
||||
current_segment.append(line)
|
||||
continue
|
||||
|
||||
# Heading detection
|
||||
heading_match = re.match(r'^(#{1,6})\s+(.+)$', line)
|
||||
if heading_match:
|
||||
flush_segment()
|
||||
current_type = 'heading'
|
||||
current_level = len(heading_match.group(1))
|
||||
current_segment.append(line)
|
||||
flush_segment()
|
||||
continue
|
||||
|
||||
# Table detection
|
||||
if '|' in line and re.match(r'^\s*\|.*\|\s*$', line):
|
||||
if not in_table:
|
||||
flush_segment()
|
||||
in_table = True
|
||||
current_type = 'table'
|
||||
current_segment.append(line)
|
||||
continue
|
||||
elif in_table:
|
||||
flush_segment()
|
||||
in_table = False
|
||||
|
||||
# Image detection
|
||||
if re.match(r'!\[.*\]\(.*\)', line):
|
||||
flush_segment()
|
||||
current_type = 'image'
|
||||
current_segment.append(line)
|
||||
flush_segment()
|
||||
continue
|
||||
|
||||
# List detection
|
||||
if re.match(r'^[\s]*[-*+]\s+', line) or re.match(r'^[\s]*\d+\.\s+', line):
|
||||
if current_type != 'list':
|
||||
flush_segment()
|
||||
current_type = 'list'
|
||||
current_segment.append(line)
|
||||
continue
|
||||
elif current_type == 'list' and line.strip() == '':
|
||||
flush_segment()
|
||||
continue
|
||||
|
||||
# Empty line - potential paragraph break
|
||||
if line.strip() == '':
|
||||
if current_type == 'paragraph' and current_segment:
|
||||
flush_segment()
|
||||
continue
|
||||
|
||||
# Default: paragraph
|
||||
if current_type not in ['list']:
|
||||
current_type = 'paragraph'
|
||||
current_segment.append(line)
|
||||
|
||||
flush_segment()
|
||||
return segments
|
||||
|
||||
|
||||
def score_segment(segment: Segment) -> float:
|
||||
"""Score a segment for quality comparison."""
|
||||
score = 0.0
|
||||
content = segment.content
|
||||
|
||||
if segment.type == 'table':
|
||||
# Count rows and columns
|
||||
rows = [l for l in content.split('\n') if '|' in l]
|
||||
if rows:
|
||||
cols = rows[0].count('|') - 1
|
||||
score += len(rows) * 0.5 # More rows = better
|
||||
score += cols * 0.3 # More columns = better
|
||||
# Penalize separator-only tables
|
||||
if all(re.match(r'^[\s|:-]+$', r) for r in rows):
|
||||
score -= 5.0
|
||||
# Bonus for proper header separator
|
||||
if len(rows) > 1 and re.match(r'^[\s|:-]+$', rows[1]):
|
||||
score += 1.0
|
||||
|
||||
elif segment.type == 'heading':
|
||||
# Prefer proper heading hierarchy
|
||||
score += 1.0
|
||||
# Penalize very long headings
|
||||
if len(content) > 100:
|
||||
score -= 0.5
|
||||
|
||||
elif segment.type == 'image':
|
||||
# Prefer images with alt text
|
||||
if re.search(r'!\[.+\]', content):
|
||||
score += 1.0
|
||||
# Prefer local paths over base64
|
||||
if 'data:image' not in content:
|
||||
score += 0.5
|
||||
|
||||
elif segment.type == 'list':
|
||||
items = re.findall(r'^[\s]*[-*+\d.]+\s+', content, re.MULTILINE)
|
||||
score += len(items) * 0.3
|
||||
# Bonus for nested lists
|
||||
if re.search(r'^\s{2,}[-*+]', content, re.MULTILINE):
|
||||
score += 0.5
|
||||
|
||||
elif segment.type == 'code':
|
||||
lines = content.split('\n')
|
||||
score += min(len(lines) * 0.2, 3.0)
|
||||
# Bonus for language specification
|
||||
if re.match(r'^```\w+', content):
|
||||
score += 0.5
|
||||
|
||||
else: # paragraph
|
||||
words = len(content.split())
|
||||
score += min(words * 0.05, 2.0)
|
||||
# Penalize very short paragraphs
|
||||
if words < 5:
|
||||
score -= 0.5
|
||||
|
||||
return score
|
||||
|
||||
|
||||
def find_matching_segment(
|
||||
segment: Segment,
|
||||
candidates: list[Segment],
|
||||
used_indices: set
|
||||
) -> Optional[int]:
|
||||
"""Find a matching segment in candidates by type and similarity."""
|
||||
best_match = None
|
||||
best_similarity = 0.3 # Minimum threshold
|
||||
|
||||
for i, candidate in enumerate(candidates):
|
||||
if i in used_indices:
|
||||
continue
|
||||
if candidate.type != segment.type:
|
||||
continue
|
||||
|
||||
# Calculate similarity
|
||||
if segment.type == 'heading':
|
||||
# Compare heading text (ignore # symbols)
|
||||
s1 = re.sub(r'^#+\s*', '', segment.content).lower()
|
||||
s2 = re.sub(r'^#+\s*', '', candidate.content).lower()
|
||||
similarity = _text_similarity(s1, s2)
|
||||
elif segment.type == 'table':
|
||||
# Compare first row (header)
|
||||
h1 = segment.content.split('\n')[0] if segment.content else ''
|
||||
h2 = candidate.content.split('\n')[0] if candidate.content else ''
|
||||
similarity = _text_similarity(h1, h2)
|
||||
else:
|
||||
# Compare content directly
|
||||
similarity = _text_similarity(segment.content, candidate.content)
|
||||
|
||||
if similarity > best_similarity:
|
||||
best_similarity = similarity
|
||||
best_match = i
|
||||
|
||||
return best_match
|
||||
|
||||
|
||||
def _text_similarity(s1: str, s2: str) -> float:
|
||||
"""Calculate simple text similarity (Jaccard on words)."""
|
||||
if not s1 or not s2:
|
||||
return 0.0
|
||||
|
||||
words1 = set(s1.lower().split())
|
||||
words2 = set(s2.lower().split())
|
||||
|
||||
if not words1 or not words2:
|
||||
return 0.0
|
||||
|
||||
intersection = len(words1 & words2)
|
||||
union = len(words1 | words2)
|
||||
|
||||
return intersection / union if union > 0 else 0.0
|
||||
|
||||
|
||||
def merge_markdown_files(
|
||||
files: list[Path],
|
||||
source_names: Optional[list[str]] = None
|
||||
) -> MergeResult:
|
||||
"""Merge multiple markdown files by selecting best segments."""
|
||||
if not files:
|
||||
return MergeResult(markdown="", sources=[])
|
||||
|
||||
if source_names is None:
|
||||
source_names = [f.stem for f in files]
|
||||
|
||||
# Parse all files into segments
|
||||
all_segments = []
|
||||
for i, file_path in enumerate(files):
|
||||
content = file_path.read_text()
|
||||
segments = parse_segments(content)
|
||||
# Score each segment
|
||||
for seg in segments:
|
||||
seg.score = score_segment(seg)
|
||||
all_segments.append((source_names[i], segments))
|
||||
|
||||
if len(all_segments) == 1:
|
||||
return MergeResult(
|
||||
markdown=files[0].read_text(),
|
||||
sources=[source_names[0]]
|
||||
)
|
||||
|
||||
# Use first file as base structure
|
||||
base_name, base_segments = all_segments[0]
|
||||
merged_segments = []
|
||||
segment_sources = {}
|
||||
|
||||
for i, base_seg in enumerate(base_segments):
|
||||
best_segment = base_seg
|
||||
best_source = base_name
|
||||
|
||||
# Find matching segments in other files
|
||||
for other_name, other_segments in all_segments[1:]:
|
||||
used = set()
|
||||
match_idx = find_matching_segment(base_seg, other_segments, used)
|
||||
|
||||
if match_idx is not None:
|
||||
other_seg = other_segments[match_idx]
|
||||
if other_seg.score > best_segment.score:
|
||||
best_segment = other_seg
|
||||
best_source = other_name
|
||||
|
||||
merged_segments.append(best_segment)
|
||||
segment_sources[i] = best_source
|
||||
|
||||
# Check for segments in other files that weren't matched
|
||||
# (content that only appears in secondary sources)
|
||||
base_used = set(range(len(base_segments)))
|
||||
for other_name, other_segments in all_segments[1:]:
|
||||
for j, other_seg in enumerate(other_segments):
|
||||
match_idx = find_matching_segment(other_seg, base_segments, set())
|
||||
if match_idx is None and other_seg.score > 0.5:
|
||||
# This segment doesn't exist in base - consider adding
|
||||
merged_segments.append(other_seg)
|
||||
segment_sources[len(merged_segments) - 1] = other_name
|
||||
|
||||
# Reconstruct markdown
|
||||
merged_md = '\n\n'.join(seg.content for seg in merged_segments)
|
||||
|
||||
return MergeResult(
|
||||
markdown=merged_md,
|
||||
sources=source_names,
|
||||
segment_sources=segment_sources
|
||||
)
|
||||
|
||||
|
||||
def merge_from_json(json_path: Path) -> MergeResult:
|
||||
"""Merge from JSON results file (from convert.py)."""
|
||||
with open(json_path) as f:
|
||||
data = json.load(f)
|
||||
|
||||
results = data.get('results', [])
|
||||
if not results:
|
||||
return MergeResult(markdown="", sources=[])
|
||||
|
||||
# Filter successful results
|
||||
successful = [r for r in results if r.get('success') and r.get('markdown')]
|
||||
if not successful:
|
||||
return MergeResult(markdown="", sources=[])
|
||||
|
||||
if len(successful) == 1:
|
||||
return MergeResult(
|
||||
markdown=successful[0]['markdown'],
|
||||
sources=[successful[0]['tool']]
|
||||
)
|
||||
|
||||
# Parse and merge
|
||||
all_segments = []
|
||||
for result in successful:
|
||||
tool = result['tool']
|
||||
segments = parse_segments(result['markdown'])
|
||||
for seg in segments:
|
||||
seg.score = score_segment(seg)
|
||||
all_segments.append((tool, segments))
|
||||
|
||||
# Same merge logic as merge_markdown_files
|
||||
base_name, base_segments = all_segments[0]
|
||||
merged_segments = []
|
||||
segment_sources = {}
|
||||
|
||||
for i, base_seg in enumerate(base_segments):
|
||||
best_segment = base_seg
|
||||
best_source = base_name
|
||||
|
||||
for other_name, other_segments in all_segments[1:]:
|
||||
match_idx = find_matching_segment(base_seg, other_segments, set())
|
||||
if match_idx is not None:
|
||||
other_seg = other_segments[match_idx]
|
||||
if other_seg.score > best_segment.score:
|
||||
best_segment = other_seg
|
||||
best_source = other_name
|
||||
|
||||
merged_segments.append(best_segment)
|
||||
segment_sources[i] = best_source
|
||||
|
||||
merged_md = '\n\n'.join(seg.content for seg in merged_segments)
|
||||
|
||||
return MergeResult(
|
||||
markdown=merged_md,
|
||||
sources=[r['tool'] for r in successful],
|
||||
segment_sources=segment_sources
|
||||
)
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Merge markdown outputs from multiple conversion tools"
|
||||
)
|
||||
parser.add_argument(
|
||||
"inputs",
|
||||
nargs="*",
|
||||
type=Path,
|
||||
help="Input markdown files to merge"
|
||||
)
|
||||
parser.add_argument(
|
||||
"-o", "--output",
|
||||
type=Path,
|
||||
help="Output merged markdown file"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--from-json",
|
||||
type=Path,
|
||||
help="Merge from JSON results file (from convert.py)"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--verbose",
|
||||
action="store_true",
|
||||
help="Show segment source attribution"
|
||||
)
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
if args.from_json:
|
||||
result = merge_from_json(args.from_json)
|
||||
elif args.inputs:
|
||||
# Validate inputs
|
||||
for f in args.inputs:
|
||||
if not f.exists():
|
||||
print(f"Error: File not found: {f}", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
result = merge_markdown_files(args.inputs)
|
||||
else:
|
||||
parser.error("Either input files or --from-json is required")
|
||||
|
||||
if not result.markdown:
|
||||
print("Error: No content to merge", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
# Output
|
||||
if args.output:
|
||||
args.output.parent.mkdir(parents=True, exist_ok=True)
|
||||
args.output.write_text(result.markdown)
|
||||
print(f"Merged output: {args.output}")
|
||||
print(f"Sources: {', '.join(result.sources)}")
|
||||
else:
|
||||
print(result.markdown)
|
||||
|
||||
if args.verbose and result.segment_sources:
|
||||
print("\n--- Segment Attribution ---", file=sys.stderr)
|
||||
for idx, source in result.segment_sources.items():
|
||||
print(f" Segment {idx}: {source}", file=sys.stderr)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
466
doc-to-markdown/scripts/validate_output.py
Executable file
466
doc-to-markdown/scripts/validate_output.py
Executable file
@@ -0,0 +1,466 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Quality validator for document-to-markdown conversion.
|
||||
|
||||
Compare original document with converted markdown to assess conversion quality.
|
||||
Generates HTML quality report with detailed metrics.
|
||||
|
||||
Usage:
|
||||
uv run --with pymupdf scripts/validate_output.py document.pdf output.md
|
||||
uv run --with pymupdf scripts/validate_output.py document.pdf output.md --report report.html
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import html
|
||||
import re
|
||||
import subprocess
|
||||
import sys
|
||||
from dataclasses import dataclass, field
|
||||
from pathlib import Path
|
||||
from typing import Optional
|
||||
|
||||
|
||||
@dataclass
|
||||
class ValidationMetrics:
|
||||
"""Quality metrics for conversion validation."""
|
||||
# Text metrics
|
||||
source_char_count: int = 0
|
||||
output_char_count: int = 0
|
||||
text_retention: float = 0.0
|
||||
|
||||
# Table metrics
|
||||
source_table_count: int = 0
|
||||
output_table_count: int = 0
|
||||
table_retention: float = 0.0
|
||||
|
||||
# Image metrics
|
||||
source_image_count: int = 0
|
||||
output_image_count: int = 0
|
||||
image_retention: float = 0.0
|
||||
|
||||
# Structure metrics
|
||||
heading_count: int = 0
|
||||
list_count: int = 0
|
||||
code_block_count: int = 0
|
||||
|
||||
# Quality scores
|
||||
overall_score: float = 0.0
|
||||
status: str = "unknown" # pass, warn, fail
|
||||
|
||||
# Details
|
||||
warnings: list[str] = field(default_factory=list)
|
||||
errors: list[str] = field(default_factory=list)
|
||||
|
||||
|
||||
def extract_text_from_pdf(pdf_path: Path) -> tuple[str, int, int]:
|
||||
"""Extract text, table count, and image count from PDF."""
|
||||
try:
|
||||
import fitz # PyMuPDF
|
||||
|
||||
doc = fitz.open(str(pdf_path))
|
||||
text_parts = []
|
||||
table_count = 0
|
||||
image_count = 0
|
||||
|
||||
for page in doc:
|
||||
text_parts.append(page.get_text())
|
||||
# Count images
|
||||
image_count += len(page.get_images())
|
||||
# Estimate tables (look for grid-like structures)
|
||||
# This is approximate - tables are hard to detect in PDFs
|
||||
page_text = page.get_text()
|
||||
if re.search(r'(\t.*){2,}', page_text) or '│' in page_text:
|
||||
table_count += 1
|
||||
|
||||
doc.close()
|
||||
return '\n'.join(text_parts), table_count, image_count
|
||||
|
||||
except ImportError:
|
||||
# Fallback to pdftotext if available
|
||||
try:
|
||||
result = subprocess.run(
|
||||
['pdftotext', '-layout', str(pdf_path), '-'],
|
||||
capture_output=True,
|
||||
text=True,
|
||||
timeout=60
|
||||
)
|
||||
return result.stdout, 0, 0 # Can't count tables/images
|
||||
except Exception:
|
||||
return "", 0, 0
|
||||
|
||||
|
||||
def extract_text_from_docx(docx_path: Path) -> tuple[str, int, int]:
|
||||
"""Extract text, table count, and image count from DOCX."""
|
||||
try:
|
||||
import zipfile
|
||||
from xml.etree import ElementTree as ET
|
||||
|
||||
with zipfile.ZipFile(docx_path, 'r') as z:
|
||||
# Extract main document text
|
||||
if 'word/document.xml' not in z.namelist():
|
||||
return "", 0, 0
|
||||
|
||||
with z.open('word/document.xml') as f:
|
||||
tree = ET.parse(f)
|
||||
root = tree.getroot()
|
||||
|
||||
# Extract text
|
||||
ns = {'w': 'http://schemas.openxmlformats.org/wordprocessingml/2006/main'}
|
||||
text_parts = []
|
||||
for t in root.iter('{http://schemas.openxmlformats.org/wordprocessingml/2006/main}t'):
|
||||
if t.text:
|
||||
text_parts.append(t.text)
|
||||
|
||||
# Count tables
|
||||
tables = root.findall('.//w:tbl', ns)
|
||||
table_count = len(tables)
|
||||
|
||||
# Count images
|
||||
image_count = sum(1 for name in z.namelist()
|
||||
if name.startswith('word/media/'))
|
||||
|
||||
return ' '.join(text_parts), table_count, image_count
|
||||
|
||||
except Exception as e:
|
||||
return "", 0, 0
|
||||
|
||||
|
||||
def analyze_markdown(md_path: Path) -> dict:
|
||||
"""Analyze markdown file structure and content."""
|
||||
content = md_path.read_text()
|
||||
|
||||
# Count tables (markdown tables with |)
|
||||
table_lines = [l for l in content.split('\n')
|
||||
if re.match(r'^\s*\|.*\|', l)]
|
||||
# Group consecutive table lines
|
||||
table_count = 0
|
||||
in_table = False
|
||||
for line in content.split('\n'):
|
||||
if re.match(r'^\s*\|.*\|', line):
|
||||
if not in_table:
|
||||
table_count += 1
|
||||
in_table = True
|
||||
else:
|
||||
in_table = False
|
||||
|
||||
# Count images
|
||||
images = re.findall(r'!\[.*?\]\(.*?\)', content)
|
||||
|
||||
# Count headings
|
||||
headings = re.findall(r'^#{1,6}\s+.+$', content, re.MULTILINE)
|
||||
|
||||
# Count lists
|
||||
list_items = re.findall(r'^[\s]*[-*+]\s+', content, re.MULTILINE)
|
||||
list_items += re.findall(r'^[\s]*\d+\.\s+', content, re.MULTILINE)
|
||||
|
||||
# Count code blocks
|
||||
code_blocks = re.findall(r'```', content)
|
||||
|
||||
# Clean text for comparison
|
||||
clean_text = re.sub(r'```.*?```', '', content, flags=re.DOTALL)
|
||||
clean_text = re.sub(r'!\[.*?\]\(.*?\)', '', clean_text)
|
||||
clean_text = re.sub(r'\[.*?\]\(.*?\)', '', clean_text)
|
||||
clean_text = re.sub(r'[#*_`|>-]', '', clean_text)
|
||||
clean_text = re.sub(r'\s+', ' ', clean_text).strip()
|
||||
|
||||
return {
|
||||
'char_count': len(clean_text),
|
||||
'table_count': table_count,
|
||||
'image_count': len(images),
|
||||
'heading_count': len(headings),
|
||||
'list_count': len(list_items),
|
||||
'code_block_count': len(code_blocks) // 2,
|
||||
'raw_content': content,
|
||||
'clean_text': clean_text
|
||||
}
|
||||
|
||||
|
||||
def validate_conversion(
|
||||
source_path: Path,
|
||||
output_path: Path
|
||||
) -> ValidationMetrics:
|
||||
"""Validate conversion quality by comparing source and output."""
|
||||
metrics = ValidationMetrics()
|
||||
|
||||
# Analyze output markdown
|
||||
md_analysis = analyze_markdown(output_path)
|
||||
metrics.output_char_count = md_analysis['char_count']
|
||||
metrics.output_table_count = md_analysis['table_count']
|
||||
metrics.output_image_count = md_analysis['image_count']
|
||||
metrics.heading_count = md_analysis['heading_count']
|
||||
metrics.list_count = md_analysis['list_count']
|
||||
metrics.code_block_count = md_analysis['code_block_count']
|
||||
|
||||
# Extract source content based on file type
|
||||
ext = source_path.suffix.lower()
|
||||
if ext == '.pdf':
|
||||
source_text, source_tables, source_images = extract_text_from_pdf(source_path)
|
||||
elif ext in ['.docx', '.doc']:
|
||||
source_text, source_tables, source_images = extract_text_from_docx(source_path)
|
||||
else:
|
||||
# For other formats, estimate from file size
|
||||
source_text = ""
|
||||
source_tables = 0
|
||||
source_images = 0
|
||||
metrics.warnings.append(f"Cannot analyze source format: {ext}")
|
||||
|
||||
metrics.source_char_count = len(source_text.replace(' ', '').replace('\n', ''))
|
||||
metrics.source_table_count = source_tables
|
||||
metrics.source_image_count = source_images
|
||||
|
||||
# Calculate retention rates
|
||||
if metrics.source_char_count > 0:
|
||||
# Use ratio of actual/expected, capped at 1.0
|
||||
metrics.text_retention = min(
|
||||
metrics.output_char_count / metrics.source_char_count,
|
||||
1.0
|
||||
)
|
||||
else:
|
||||
metrics.text_retention = 1.0 if metrics.output_char_count > 0 else 0.0
|
||||
|
||||
if metrics.source_table_count > 0:
|
||||
metrics.table_retention = min(
|
||||
metrics.output_table_count / metrics.source_table_count,
|
||||
1.0
|
||||
)
|
||||
else:
|
||||
metrics.table_retention = 1.0 # No tables expected
|
||||
|
||||
if metrics.source_image_count > 0:
|
||||
metrics.image_retention = min(
|
||||
metrics.output_image_count / metrics.source_image_count,
|
||||
1.0
|
||||
)
|
||||
else:
|
||||
metrics.image_retention = 1.0 # No images expected
|
||||
|
||||
# Determine status based on thresholds
|
||||
if metrics.text_retention < 0.85:
|
||||
metrics.errors.append(f"Low text retention: {metrics.text_retention:.1%}")
|
||||
elif metrics.text_retention < 0.95:
|
||||
metrics.warnings.append(f"Text retention below optimal: {metrics.text_retention:.1%}")
|
||||
|
||||
if metrics.source_table_count > 0 and metrics.table_retention < 0.9:
|
||||
metrics.errors.append(f"Tables missing: {metrics.table_retention:.1%} retained")
|
||||
elif metrics.source_table_count > 0 and metrics.table_retention < 1.0:
|
||||
metrics.warnings.append(f"Some tables may be incomplete: {metrics.table_retention:.1%}")
|
||||
|
||||
if metrics.source_image_count > 0 and metrics.image_retention < 0.8:
|
||||
metrics.errors.append(f"Images missing: {metrics.image_retention:.1%} retained")
|
||||
elif metrics.source_image_count > 0 and metrics.image_retention < 1.0:
|
||||
metrics.warnings.append(f"Some images missing: {metrics.image_retention:.1%}")
|
||||
|
||||
# Calculate overall score (0-100)
|
||||
metrics.overall_score = (
|
||||
metrics.text_retention * 50 +
|
||||
metrics.table_retention * 25 +
|
||||
metrics.image_retention * 25
|
||||
) * 100
|
||||
|
||||
# Determine status
|
||||
if metrics.errors:
|
||||
metrics.status = "fail"
|
||||
elif metrics.warnings:
|
||||
metrics.status = "warn"
|
||||
else:
|
||||
metrics.status = "pass"
|
||||
|
||||
return metrics
|
||||
|
||||
|
||||
def generate_html_report(
|
||||
metrics: ValidationMetrics,
|
||||
source_path: Path,
|
||||
output_path: Path
|
||||
) -> str:
|
||||
"""Generate HTML quality report."""
|
||||
status_colors = {
|
||||
"pass": "#28a745",
|
||||
"warn": "#ffc107",
|
||||
"fail": "#dc3545"
|
||||
}
|
||||
status_color = status_colors.get(metrics.status, "#6c757d")
|
||||
|
||||
def metric_bar(value: float, thresholds: tuple) -> str:
|
||||
"""Generate colored progress bar."""
|
||||
pct = int(value * 100)
|
||||
if value >= thresholds[0]:
|
||||
color = "#28a745" # green
|
||||
elif value >= thresholds[1]:
|
||||
color = "#ffc107" # yellow
|
||||
else:
|
||||
color = "#dc3545" # red
|
||||
return f'''
|
||||
<div style="background: #e9ecef; border-radius: 4px; overflow: hidden; height: 20px;">
|
||||
<div style="background: {color}; width: {pct}%; height: 100%; transition: width 0.3s;"></div>
|
||||
</div>
|
||||
<span style="font-size: 14px; color: #666;">{pct}%</span>
|
||||
'''
|
||||
|
||||
report = f'''<!DOCTYPE html>
|
||||
<html>
|
||||
<head>
|
||||
<meta charset="UTF-8">
|
||||
<title>Conversion Quality Report</title>
|
||||
<style>
|
||||
body {{ font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; margin: 40px; background: #f5f5f5; }}
|
||||
.container {{ max-width: 800px; margin: 0 auto; background: white; padding: 30px; border-radius: 8px; box-shadow: 0 2px 4px rgba(0,0,0,0.1); }}
|
||||
h1 {{ color: #333; border-bottom: 2px solid #eee; padding-bottom: 15px; }}
|
||||
.status {{ display: inline-block; padding: 8px 16px; border-radius: 4px; color: white; font-weight: bold; }}
|
||||
.metric {{ margin: 20px 0; padding: 15px; background: #f8f9fa; border-radius: 4px; }}
|
||||
.metric-label {{ font-weight: bold; color: #333; margin-bottom: 8px; }}
|
||||
.metric-value {{ font-size: 24px; color: #333; }}
|
||||
.issues {{ margin-top: 20px; }}
|
||||
.error {{ background: #f8d7da; color: #721c24; padding: 10px; margin: 5px 0; border-radius: 4px; }}
|
||||
.warning {{ background: #fff3cd; color: #856404; padding: 10px; margin: 5px 0; border-radius: 4px; }}
|
||||
table {{ width: 100%; border-collapse: collapse; margin: 15px 0; }}
|
||||
th, td {{ padding: 10px; text-align: left; border-bottom: 1px solid #eee; }}
|
||||
th {{ background: #f8f9fa; }}
|
||||
.score {{ font-size: 48px; font-weight: bold; color: {status_color}; }}
|
||||
</style>
|
||||
</head>
|
||||
<body>
|
||||
<div class="container">
|
||||
<h1>📊 Conversion Quality Report</h1>
|
||||
|
||||
<div style="text-align: center; margin: 30px 0;">
|
||||
<div class="score">{metrics.overall_score:.0f}</div>
|
||||
<div style="color: #666;">Overall Score</div>
|
||||
<div class="status" style="background: {status_color}; margin-top: 10px;">
|
||||
{metrics.status.upper()}
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<h2>📄 File Information</h2>
|
||||
<table>
|
||||
<tr><th>Source</th><td>{html.escape(str(source_path))}</td></tr>
|
||||
<tr><th>Output</th><td>{html.escape(str(output_path))}</td></tr>
|
||||
</table>
|
||||
|
||||
<h2>📏 Retention Metrics</h2>
|
||||
|
||||
<div class="metric">
|
||||
<div class="metric-label">Text Retention (target: >95%)</div>
|
||||
{metric_bar(metrics.text_retention, (0.95, 0.85))}
|
||||
<div style="font-size: 12px; color: #666; margin-top: 5px;">
|
||||
Source: ~{metrics.source_char_count:,} chars | Output: {metrics.output_char_count:,} chars
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="metric">
|
||||
<div class="metric-label">Table Retention (target: 100%)</div>
|
||||
{metric_bar(metrics.table_retention, (1.0, 0.9))}
|
||||
<div style="font-size: 12px; color: #666; margin-top: 5px;">
|
||||
Source: {metrics.source_table_count} tables | Output: {metrics.output_table_count} tables
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="metric">
|
||||
<div class="metric-label">Image Retention (target: 100%)</div>
|
||||
{metric_bar(metrics.image_retention, (1.0, 0.8))}
|
||||
<div style="font-size: 12px; color: #666; margin-top: 5px;">
|
||||
Source: {metrics.source_image_count} images | Output: {metrics.output_image_count} images
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<h2>📊 Structure Analysis</h2>
|
||||
<table>
|
||||
<tr><th>Headings</th><td>{metrics.heading_count}</td></tr>
|
||||
<tr><th>List Items</th><td>{metrics.list_count}</td></tr>
|
||||
<tr><th>Code Blocks</th><td>{metrics.code_block_count}</td></tr>
|
||||
</table>
|
||||
|
||||
{'<h2>⚠️ Issues</h2><div class="issues">' + ''.join(f'<div class="error">❌ {html.escape(e)}</div>' for e in metrics.errors) + ''.join(f'<div class="warning">⚠️ {html.escape(w)}</div>' for w in metrics.warnings) + '</div>' if metrics.errors or metrics.warnings else ''}
|
||||
|
||||
<div style="margin-top: 30px; padding-top: 20px; border-top: 1px solid #eee; color: #666; font-size: 12px;">
|
||||
Generated by markdown-tools validate_output.py
|
||||
</div>
|
||||
</div>
|
||||
</body>
|
||||
</html>
|
||||
'''
|
||||
return report
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Validate document-to-markdown conversion quality"
|
||||
)
|
||||
parser.add_argument(
|
||||
"source",
|
||||
type=Path,
|
||||
help="Original document (PDF, DOCX, etc.)"
|
||||
)
|
||||
parser.add_argument(
|
||||
"output",
|
||||
type=Path,
|
||||
help="Converted markdown file"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--report",
|
||||
type=Path,
|
||||
help="Generate HTML report at this path"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--json",
|
||||
action="store_true",
|
||||
help="Output metrics as JSON"
|
||||
)
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
# Validate inputs
|
||||
if not args.source.exists():
|
||||
print(f"Error: Source file not found: {args.source}", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
if not args.output.exists():
|
||||
print(f"Error: Output file not found: {args.output}", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
# Run validation
|
||||
metrics = validate_conversion(args.source, args.output)
|
||||
|
||||
# Output results
|
||||
if args.json:
|
||||
import json
|
||||
print(json.dumps({
|
||||
'text_retention': metrics.text_retention,
|
||||
'table_retention': metrics.table_retention,
|
||||
'image_retention': metrics.image_retention,
|
||||
'overall_score': metrics.overall_score,
|
||||
'status': metrics.status,
|
||||
'warnings': metrics.warnings,
|
||||
'errors': metrics.errors
|
||||
}, indent=2))
|
||||
else:
|
||||
# Console output
|
||||
status_emoji = {"pass": "✅", "warn": "⚠️", "fail": "❌"}.get(metrics.status, "❓")
|
||||
print(f"\n{status_emoji} Conversion Quality: {metrics.status.upper()}")
|
||||
print(f" Overall Score: {metrics.overall_score:.0f}/100")
|
||||
print(f"\n Text Retention: {metrics.text_retention:.1%}")
|
||||
print(f" Table Retention: {metrics.table_retention:.1%}")
|
||||
print(f" Image Retention: {metrics.image_retention:.1%}")
|
||||
|
||||
if metrics.errors:
|
||||
print("\n Errors:")
|
||||
for e in metrics.errors:
|
||||
print(f" ❌ {e}")
|
||||
|
||||
if metrics.warnings:
|
||||
print("\n Warnings:")
|
||||
for w in metrics.warnings:
|
||||
print(f" ⚠️ {w}")
|
||||
|
||||
# Generate HTML report
|
||||
if args.report:
|
||||
report_html = generate_html_report(metrics, args.source, args.output)
|
||||
args.report.parent.mkdir(parents=True, exist_ok=True)
|
||||
args.report.write_text(report_html)
|
||||
print(f"\n📊 HTML report: {args.report}")
|
||||
|
||||
# Exit with appropriate code
|
||||
sys.exit(0 if metrics.status != "fail" else 1)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
Reference in New Issue
Block a user