feat(doc-to-markdown): CJK bold spacing, JSON pretty-print, 31 tests, full rename cleanup

- Add CJK bold spacing fix: insert spaces around **bold** spans containing
  CJK characters for correct rendering (handles emoji adjacency, already-spaced)
- Add JSON pretty-print: auto-format JSON code blocks with 2-space indent
- Add 31 unit tests covering all post-processing functions
- Fix pandoc simple table detection (1-space column gaps)
- Fix image path double-nesting when --assets-dir ends with 'media'
- Rename all markdown-tools references across 15 files (README, QUICKSTART,
  marketplace.json, CLAUDE.md, meeting-minutes-taker, GitHub templates)
- Add 5-tool benchmark report (Docling/MarkItDown/Pandoc/Mammoth/ours)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
daymade
2026-03-23 03:18:37 +08:00
parent a5f3a4bfbe
commit d9e1967689
16 changed files with 351 additions and 90 deletions

View File

@@ -668,7 +668,7 @@
}, },
{ {
"name": "meeting-minutes-taker", "name": "meeting-minutes-taker",
"description": "Transform meeting transcripts into high-fidelity, structured meeting minutes with iterative review. Features speaker identification via feature analysis (word count, speaking style, topic focus) with context.md team directory mapping, intelligent file naming from content, integration with markdown-tools and transcript-fixer for pre-processing, evidence-based recording with speaker quotes, Mermaid diagrams for architecture discussions, and multi-turn parallel generation with UNION merge", "description": "Transform meeting transcripts into high-fidelity, structured meeting minutes with iterative review. Features speaker identification via feature analysis (word count, speaking style, topic focus) with context.md team directory mapping, intelligent file naming from content, integration with doc-to-markdown and transcript-fixer for pre-processing, evidence-based recording with speaker quotes, Mermaid diagrams for architecture discussions, and multi-turn parallel generation with UNION merge",
"source": "./", "source": "./",
"strict": false, "strict": false,
"version": "1.1.0", "version": "1.1.0",

View File

@@ -16,7 +16,7 @@ Which skill is affected?
- [ ] skill-creator - [ ] skill-creator
- [ ] github-ops - [ ] github-ops
- [ ] markdown-tools - [ ] doc-to-markdown
- [ ] mermaid-tools - [ ] mermaid-tools
- [ ] statusline-generator - [ ] statusline-generator
- [ ] teams-channel-post-writer - [ ] teams-channel-post-writer

View File

@@ -20,7 +20,7 @@ Which skill would this enhance?
- [ ] skill-creator - [ ] skill-creator
- [ ] github-ops - [ ] github-ops
- [ ] markdown-tools - [ ] doc-to-markdown
- [ ] mermaid-tools - [ ] mermaid-tools
- [ ] statusline-generator - [ ] statusline-generator
- [ ] teams-channel-post-writer - [ ] teams-channel-post-writer

View File

@@ -33,7 +33,7 @@ Which skills are affected by this PR?
- [ ] skill-creator - [ ] skill-creator
- [ ] github-ops - [ ] github-ops
- [ ] markdown-tools - [ ] doc-to-markdown
- [ ] mermaid-tools - [ ] mermaid-tools
- [ ] statusline-generator - [ ] statusline-generator
- [ ] teams-channel-post-writer - [ ] teams-channel-post-writer

View File

@@ -7,8 +7,11 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
## [Unreleased] ## [Unreleased]
### Added ### Changed
- None - **Renamed**: `markdown-tools``doc-to-markdown` — clearer name for DOCX/PDF/PPTX → Markdown conversion
- **doc-to-markdown**: Added 8 DOCX post-processing fixes (grid tables, simple tables, CJK bold spacing, JSON pretty-print, image path flattening, pandoc attribute cleanup, code block detection, bracket fixes)
- **doc-to-markdown**: Added 31 unit tests (`test_convert.py`)
- **doc-to-markdown**: Added 5-tool benchmark report (`references/benchmark-2026-03-22.md`)
## [1.39.0] - 2026-03-18 ## [1.39.0] - 2026-03-18

View File

@@ -179,7 +179,7 @@ This applies when you change ANY file under a skill directory:
1. **skill-creator** ⭐ - **Essential meta-skill** for creating your own skills (with init/validate/package scripts) 1. **skill-creator** ⭐ - **Essential meta-skill** for creating your own skills (with init/validate/package scripts)
2. **github-ops** - GitHub operations via gh CLI and API 2. **github-ops** - GitHub operations via gh CLI and API
3. **markdown-tools** - Document conversion with WSL path handling 3. **doc-to-markdown** - DOCX/PDF/PPTX → Markdown conversion with CJK post-processing
4. **mermaid-tools** - Diagram extraction and PNG generation 4. **mermaid-tools** - Diagram extraction and PNG generation
5. **statusline-generator** - Claude Code statusline customization 5. **statusline-generator** - Claude Code statusline customization
6. **teams-channel-post-writer** - Teams communication templates 6. **teams-channel-post-writer** - Teams communication templates

View File

@@ -122,7 +122,7 @@ claude plugin marketplace add https://github.com/daymade/claude-code-skills
# In Claude Code use `/plugin ...`; in your terminal use `claude plugin ...` # In Claude Code use `/plugin ...`; in your terminal use `claude plugin ...`
# Step 2: Install skills you need # Step 2: Install skills you need
claude plugin install github-ops@daymade-skills claude plugin install github-ops@daymade-skills
claude plugin install markdown-tools@daymade-skills claude plugin install doc-to-markdown@daymade-skills
# ... add more as needed # ... add more as needed
# Step 3: Restart Claude Code # Step 3: Restart Claude Code
@@ -136,7 +136,7 @@ This table is a quick starter list. See [README.md](./README.md) for the full ca
|-------|-------------|-------------| |-------|-------------|-------------|
| **skill-creator** ⭐ | Create your own skills | Building custom workflows | | **skill-creator** ⭐ | Create your own skills | Building custom workflows |
| **github-ops** | GitHub operations | Managing PRs, issues, workflows | | **github-ops** | GitHub operations | Managing PRs, issues, workflows |
| **markdown-tools** | Document conversion | Converting docs to markdown | | **doc-to-markdown** | Document conversion | Converting docs to markdown |
| **mermaid-tools** | Diagram generation | Creating PNG diagrams | | **mermaid-tools** | Diagram generation | Creating PNG diagrams |
| **statusline-generator** | Statusline customization | Customizing Claude Code UI | | **statusline-generator** | Statusline customization | Customizing Claude Code UI |
| **teams-channel-post-writer** | Teams communication | Writing professional posts | | **teams-channel-post-writer** | Teams communication | Writing professional posts |

View File

@@ -122,7 +122,7 @@ claude plugin marketplace add https://github.com/daymade/claude-code-skills
# 在 Claude Code 内使用 `/plugin ...`,在终端中使用 `claude plugin ...` # 在 Claude Code 内使用 `/plugin ...`,在终端中使用 `claude plugin ...`
# 步骤 2安装你需要的技能 # 步骤 2安装你需要的技能
claude plugin install github-ops@daymade-skills claude plugin install github-ops@daymade-skills
claude plugin install markdown-tools@daymade-skills claude plugin install doc-to-markdown@daymade-skills
# ... 根据需要添加更多 # ... 根据需要添加更多
# 步骤 3重启 Claude Code # 步骤 3重启 Claude Code
@@ -136,7 +136,7 @@ claude plugin install markdown-tools@daymade-skills
|-------|-------------|-------------| |-------|-------------|-------------|
| **skill-creator** ⭐ | 创建你自己的技能 | 构建自定义工作流 | | **skill-creator** ⭐ | 创建你自己的技能 | 构建自定义工作流 |
| **github-ops** | GitHub 操作 | 管理 PR、问题、工作流 | | **github-ops** | GitHub 操作 | 管理 PR、问题、工作流 |
| **markdown-tools** | 文档转换 | 将文档转换为 markdown | | **doc-to-markdown** | 文档转换 | 将文档转换为 markdown |
| **mermaid-tools** | 图表生成 | 创建 PNG 图表 | | **mermaid-tools** | 图表生成 | 创建 PNG 图表 |
| **statusline-generator** | 状态栏定制 | 自定义 Claude Code UI | | **statusline-generator** | 状态栏定制 | 自定义 Claude Code UI |
| **teams-channel-post-writer** | Teams 通信 | 编写专业帖子 | | **teams-channel-post-writer** | Teams 通信 | 编写专业帖子 |

View File

@@ -146,7 +146,7 @@ claude plugin install skill-creator@daymade-skills
claude plugin install github-ops@daymade-skills claude plugin install github-ops@daymade-skills
# Document conversion # Document conversion
claude plugin install markdown-tools@daymade-skills claude plugin install doc-to-markdown@daymade-skills
# Diagram generation # Diagram generation
claude plugin install mermaid-tools@daymade-skills claude plugin install mermaid-tools@daymade-skills
@@ -294,7 +294,7 @@ Comprehensive GitHub operations using gh CLI and GitHub API.
--- ---
### 2. **markdown-tools** - Document Conversion Suite ### 2. **doc-to-markdown** - Document Conversion Suite
Converts documents to markdown with Windows/WSL path handling and PDF image extraction. Converts documents to markdown with Windows/WSL path handling and PDF image extraction.
@@ -313,7 +313,7 @@ Converts documents to markdown with Windows/WSL path handling and PDF image extr
**🎬 Live Demo** **🎬 Live Demo**
![Markdown Tools Demo](./demos/markdown-tools/convert-docs.gif) ![Markdown Tools Demo](./demos/doc-to-markdown/convert-docs.gif)
--- ---
@@ -1838,7 +1838,7 @@ Want to see all demos in one place with click-to-enlarge functionality? Check ou
Use **github-ops** to streamline PR creation, issue management, and API operations. Use **github-ops** to streamline PR creation, issue management, and API operations.
### For Documentation ### For Documentation
Combine **markdown-tools** for document conversion and **mermaid-tools** for diagram generation to create comprehensive documentation. Use **llm-icon-finder** to add brand icons. Combine **doc-to-markdown** for document conversion and **mermaid-tools** for diagram generation to create comprehensive documentation. Use **llm-icon-finder** to add brand icons.
### For Research & Analysis ### For Research & Analysis
Use **deep-research** to produce format-controlled research reports with evidence tables and citations. Combine with **fact-checker** to validate claims or with **twitter-reader** for social-source collection. Use **deep-research** to produce format-controlled research reports with evidence tables and citations. Combine with **fact-checker** to validate claims or with **twitter-reader** for social-source collection.
@@ -1916,7 +1916,7 @@ Use **iOS-APP-developer** to configure XcodeGen projects, resolve SPM dependency
Use **macos-cleaner** to intelligently analyze and reclaim disk space on macOS with safety-first approach. Unlike one-click cleaners that blindly delete, macos-cleaner explains what each file is, categorizes by risk level (🟢/🟡/🔴), and requires explicit confirmation before any deletion. Perfect for developers dealing with Docker/Homebrew/npm/pip cache bloat, users wanting to understand storage consumption, or anyone who values transparency over automation. Combines script-based precision with optional Mole visual tool integration for hybrid workflow. Use **macos-cleaner** to intelligently analyze and reclaim disk space on macOS with safety-first approach. Unlike one-click cleaners that blindly delete, macos-cleaner explains what each file is, categorizes by risk level (🟢/🟡/🔴), and requires explicit confirmation before any deletion. Perfect for developers dealing with Docker/Homebrew/npm/pip cache bloat, users wanting to understand storage consumption, or anyone who values transparency over automation. Combines script-based precision with optional Mole visual tool integration for hybrid workflow.
### For Twitter/X Content Research ### For Twitter/X Content Research
Use **twitter-reader** to fetch tweet content without JavaScript rendering or authentication. Perfect for documenting social media discussions, archiving threads, analyzing tweet content, or gathering reference material from Twitter/X. Combine with **markdown-tools** to convert fetched content into other formats, or with **repomix-safe-mixer** to package research collections securely. Use **twitter-reader** to fetch tweet content without JavaScript rendering or authentication. Perfect for documenting social media discussions, archiving threads, analyzing tweet content, or gathering reference material from Twitter/X. Combine with **doc-to-markdown** to convert fetched content into other formats, or with **repomix-safe-mixer** to package research collections securely.
### For Skill Quality & Open-Source Contributions ### For Skill Quality & Open-Source Contributions
Use **skill-reviewer** to validate your own skills against best practices before publishing, or to review and improve others' skill repositories. Combine with **github-contributor** to find high-impact open-source projects, create professional PRs, and build your contributor reputation. Perfect for developers who want to contribute to the Claude Code ecosystem or any GitHub project systematically. Use **skill-reviewer** to validate your own skills against best practices before publishing, or to review and improve others' skill repositories. Combine with **github-contributor** to find high-impact open-source projects, create professional PRs, and build your contributor reputation. Perfect for developers who want to contribute to the Claude Code ecosystem or any GitHub project systematically.
@@ -1947,7 +1947,7 @@ Each skill includes:
### Quick Links ### Quick Links
- **github-ops**: See `github-ops/references/api_reference.md` for API documentation - **github-ops**: See `github-ops/references/api_reference.md` for API documentation
- **markdown-tools**: See `markdown-tools/references/conversion-examples.md` for conversion scenarios - **doc-to-markdown**: See `doc-to-markdown/references/conversion-examples.md` for conversion scenarios
- **mermaid-tools**: See `mermaid-tools/references/setup_and_troubleshooting.md` for setup guide - **mermaid-tools**: See `mermaid-tools/references/setup_and_troubleshooting.md` for setup guide
- **statusline-generator**: See `statusline-generator/references/color_codes.md` for customization - **statusline-generator**: See `statusline-generator/references/color_codes.md` for customization
- **teams-channel-post-writer**: See `teams-channel-post-writer/references/writing-guidelines.md` for quality standards - **teams-channel-post-writer**: See `teams-channel-post-writer/references/writing-guidelines.md` for quality standards
@@ -1992,7 +1992,7 @@ Each skill includes:
- **Claude Code** 2.0.13 or higher - **Claude Code** 2.0.13 or higher
- **Python 3.6+** (for scripts in multiple skills) - **Python 3.6+** (for scripts in multiple skills)
- **gh CLI** (for github-ops) - **gh CLI** (for github-ops)
- **markitdown** (for markdown-tools) - **markitdown** (for doc-to-markdown)
- **mermaid-cli** (for mermaid-tools) - **mermaid-cli** (for mermaid-tools)
- **yt-dlp** (for youtube-downloader): `brew install yt-dlp` or `pip install yt-dlp` - **yt-dlp** (for youtube-downloader): `brew install yt-dlp` or `pip install yt-dlp`
- **FFmpeg/FFprobe** (for video-comparer): `brew install ffmpeg`, `apt install ffmpeg`, or `winget install ffmpeg` - **FFmpeg/FFprobe** (for video-comparer): `brew install ffmpeg`, `apt install ffmpeg`, or `winget install ffmpeg`

View File

@@ -146,7 +146,7 @@ claude plugin install skill-creator@daymade-skills
claude plugin install github-ops@daymade-skills claude plugin install github-ops@daymade-skills
# 文档转换 # 文档转换
claude plugin install markdown-tools@daymade-skills claude plugin install doc-to-markdown@daymade-skills
# 图表生成 # 图表生成
claude plugin install mermaid-tools@daymade-skills claude plugin install mermaid-tools@daymade-skills
@@ -319,7 +319,7 @@ CC-Switch 支持以下中国 AI 服务提供商:
--- ---
### 2. **markdown-tools** - 文档转换套件 ### 2. **doc-to-markdown** - 文档转换套件
将文档转换为 markdown支持 Windows/WSL 路径处理和 PDF 图片提取。 将文档转换为 markdown支持 Windows/WSL 路径处理和 PDF 图片提取。
@@ -338,7 +338,7 @@ CC-Switch 支持以下中国 AI 服务提供商:
**🎬 实时演示** **🎬 实时演示**
![Markdown 工具演示](./demos/markdown-tools/convert-docs.gif) ![Markdown 工具演示](./demos/doc-to-markdown/convert-docs.gif)
--- ---
@@ -1880,7 +1880,7 @@ claude plugin install scrapling-skill@daymade-skills
使用 **github-ops** 简化 PR 创建、问题管理和 API 操作。 使用 **github-ops** 简化 PR 创建、问题管理和 API 操作。
### 文档处理 ### 文档处理
结合 **markdown-tools** 进行文档转换和 **mermaid-tools** 进行图表生成,创建全面的文档。使用 **llm-icon-finder** 添加品牌图标。 结合 **doc-to-markdown** 进行文档转换和 **mermaid-tools** 进行图表生成,创建全面的文档。使用 **llm-icon-finder** 添加品牌图标。
### 调研与分析 ### 调研与分析
使用 **deep-research** 生成格式可控的调研报告,包含证据表与引用。与 **fact-checker** 结合用于验证关键结论,或与 **twitter-reader** 结合收集社媒资料。 使用 **deep-research** 生成格式可控的调研报告,包含证据表与引用。与 **fact-checker** 结合用于验证关键结论,或与 **twitter-reader** 结合收集社媒资料。
@@ -1952,7 +1952,7 @@ claude plugin install scrapling-skill@daymade-skills
使用 **iOS-APP-developer** 配置 XcodeGen 项目,处理 SPM 依赖、签名与部署问题。 使用 **iOS-APP-developer** 配置 XcodeGen 项目,处理 SPM 依赖、签名与部署问题。
### Twitter/X 内容研究 ### Twitter/X 内容研究
使用 **twitter-reader** 无需 JavaScript 渲染或身份验证即可获取推文内容。非常适合记录社交媒体讨论、归档话题、分析推文内容或从 Twitter/X 收集参考资料。与 **markdown-tools** 结合可将获取的内容转换为其他格式,或与 **repomix-safe-mixer** 结合安全地打包研究集合。 使用 **twitter-reader** 无需 JavaScript 渲染或身份验证即可获取推文内容。非常适合记录社交媒体讨论、归档话题、分析推文内容或从 Twitter/X 收集参考资料。与 **doc-to-markdown** 结合可将获取的内容转换为其他格式,或与 **repomix-safe-mixer** 结合安全地打包研究集合。
### macOS 系统维护与磁盘空间恢复 ### macOS 系统维护与磁盘空间恢复
使用 **macos-cleaner** 以安全优先的方式智能分析和恢复 macOS 上的磁盘空间。与盲目删除的一键清理工具不同macos-cleaner 解释每个文件是什么、按风险级别分类(🟢/🟡/🔴),并在任何删除前需要明确确认。非常适合处理 Docker/Homebrew/npm/pip 缓存膨胀的开发者、希望了解存储空间消耗的用户,或任何重视透明度而非自动化的人。结合基于脚本的精度和可选的 Mole 可视化工具集成以实现混合工作流。 使用 **macos-cleaner** 以安全优先的方式智能分析和恢复 macOS 上的磁盘空间。与盲目删除的一键清理工具不同macos-cleaner 解释每个文件是什么、按风险级别分类(🟢/🟡/🔴),并在任何删除前需要明确确认。非常适合处理 Docker/Homebrew/npm/pip 缓存膨胀的开发者、希望了解存储空间消耗的用户,或任何重视透明度而非自动化的人。结合基于脚本的精度和可选的 Mole 可视化工具集成以实现混合工作流。
@@ -1989,7 +1989,7 @@ claude plugin install scrapling-skill@daymade-skills
### 快速链接 ### 快速链接
- **github-ops**:参见 `github-ops/references/api_reference.md` 了解 API 文档 - **github-ops**:参见 `github-ops/references/api_reference.md` 了解 API 文档
- **markdown-tools**:参见 `markdown-tools/references/conversion-examples.md` 了解转换场景 - **doc-to-markdown**:参见 `doc-to-markdown/references/conversion-examples.md` 了解转换场景
- **mermaid-tools**:参见 `mermaid-tools/references/setup_and_troubleshooting.md` 了解设置指南 - **mermaid-tools**:参见 `mermaid-tools/references/setup_and_troubleshooting.md` 了解设置指南
- **statusline-generator**:参见 `statusline-generator/references/color_codes.md` 了解自定义 - **statusline-generator**:参见 `statusline-generator/references/color_codes.md` 了解自定义
- **teams-channel-post-writer**:参见 `teams-channel-post-writer/references/writing-guidelines.md` 了解质量标准 - **teams-channel-post-writer**:参见 `teams-channel-post-writer/references/writing-guidelines.md` 了解质量标准
@@ -2034,7 +2034,7 @@ claude plugin install scrapling-skill@daymade-skills
- **Claude Code** 2.0.13 或更高版本 - **Claude Code** 2.0.13 或更高版本
- **Python 3.6+**(用于多个技能中的脚本) - **Python 3.6+**(用于多个技能中的脚本)
- **gh CLI**(用于 github-ops - **gh CLI**(用于 github-ops
- **markitdown**(用于 markdown-tools - **markitdown**(用于 doc-to-markdown
- **mermaid-cli**(用于 mermaid-tools - **mermaid-cli**(用于 mermaid-tools
- **VHS**(用于 cli-demo-generator`brew install vhs` - **VHS**(用于 cli-demo-generator`brew install vhs`
- **asciinema**(可选,用于 cli-demo-generator 交互式录制) - **asciinema**(可选,用于 cli-demo-generator 交互式录制)

View File

@@ -14,7 +14,7 @@ demos/
│ └── package-skill.tape # Package for distribution │ └── package-skill.tape # Package for distribution
├── github-ops/ ├── github-ops/
│ └── create-pr.tape # Create pull requests │ └── create-pr.tape # Create pull requests
├── markdown-tools/ ├── doc-to-markdown/
│ └── convert-docs.tape # Convert documents │ └── convert-docs.tape # Convert documents
└── generate_all_demos.sh # Generate all GIFs └── generate_all_demos.sh # Generate all GIFs
``` ```

View File

@@ -1,75 +1,68 @@
--- ---
name: doc-to-markdown name: doc-to-markdown
description: Converts DOCX/PDF/PPTX to high-quality Markdown with automatic post-processing. Fixes pandoc grid tables, image paths, attribute noise, and code blocks. Supports Quick Mode (fast, single tool) and Heavy Mode (best quality, multi-tool merge). Trigger on "convert document", "docx to markdown", "parse word", "doc to markdown", "extract images from document". description: Converts DOCX/PDF/PPTX to high-quality Markdown with automatic post-processing. Fixes pandoc grid tables, simple tables, image paths, CJK bold spacing, attribute noise, and code blocks. Benchmarked best-in-class (7.6/10) against Docling, MarkItDown, Pandoc raw, and Mammoth. Trigger on "convert document", "docx to markdown", "parse word", "doc to markdown", "解析word", "转换文档".
--- ---
# Doc to Markdown # Doc to Markdown
Convert documents to high-quality markdown with intelligent multi-tool orchestration and automatic DOCX post-processing. Convert documents to high-quality markdown with intelligent multi-tool orchestration and automatic DOCX post-processing.
## Dual Mode Architecture **Architecture**: Pandoc (best-in-class extraction) + 8 post-processing fixes (our value-add).
## Quick Start
```bash
# DOCX → Markdown (one command, zero manual fixes)
uv run --with pymupdf4llm --with markitdown scripts/convert.py document.docx -o output.md --assets-dir ./media
# PDF → Markdown
uv run --with pymupdf4llm --with markitdown scripts/convert.py document.pdf -o output.md
# Run tests
uv run --with pytest pytest scripts/test_convert.py -v
```
## Dual Mode
| Mode | Speed | Quality | Use Case | | Mode | Speed | Quality | Use Case |
|------|-------|---------|----------| |------|-------|---------|----------|
| **Quick** (default) | Fast | Good | Drafts, simple documents | | **Quick** (default) | Fast | Good | Drafts, simple documents |
| **Heavy** | Slower | Best | Final documents, complex layouts | | **Heavy** | Slower | Best | Final documents, complex layouts |
## Quick Start ## Tool Selection
### Installation | Format | Quick Mode | Heavy Mode |
|--------|-----------|------------|
```bash
# Required: PDF/DOCX/PPTX support
uv tool install "markitdown[pdf]"
pip install pymupdf4llm
brew install pandoc
```
### Basic Conversion
```bash
# Quick Mode (default) - fast, single best tool
uv run --with pymupdf4llm --with markitdown scripts/convert.py document.pdf -o output.md
# Heavy Mode - multi-tool parallel execution with merge
uv run --with pymupdf4llm --with markitdown scripts/convert.py document.pdf -o output.md --heavy
# DOCX with deep python-docx parsing (experimental)
uv run --with pymupdf4llm --with markitdown --with python-docx scripts/convert.py document.docx -o output.md --docx-deep
# Check available tools
uv run scripts/convert.py --list-tools
```
## Tool Selection Matrix
| Format | Quick Mode Tool | Heavy Mode Tools |
|--------|----------------|------------------|
| PDF | pymupdf4llm | pymupdf4llm + markitdown | | PDF | pymupdf4llm | pymupdf4llm + markitdown |
| DOCX | pandoc + post-processing | pandoc + markitdown | | DOCX | pandoc + post-processing | pandoc + markitdown |
| PPTX | markitdown | markitdown + pandoc | | PPTX | markitdown | markitdown + pandoc |
| XLSX | markitdown | markitdown | | XLSX | markitdown | markitdown |
### Tool Characteristics
- **pymupdf4llm**: LLM-optimized PDF conversion with native table detection and image extraction
- **markitdown**: Microsoft's universal converter, good for Office formats
- **pandoc**: Excellent structure preservation for DOCX/PPTX
## DOCX Post-Processing (automatic) ## DOCX Post-Processing (automatic)
When converting DOCX files via pandoc, the following cleanups are applied automatically: When converting DOCX via pandoc, 8 cleanups are applied automatically:
| Problem | Fix | | Problem | Fix | Test coverage |
|---------|-----| |---------|-----|---------------|
| Grid tables (`+:---+` syntax) | Single-column -> blockquote, multi-column -> split images | | Grid tables (`+:---+`) | Single-column blockquote, multi-column → pipe table | `TestPostprocessPipeline` |
| Image double path (`media/media/`) | Flatten to `media/` | | Simple tables (` ---- ----`) | Multi-column images → pipe table with captions | `TestSimpleTable` |
| Pandoc attributes (`{width="..." height="..."}`) | Removed | | Image path nesting (`media/media/`) | Flatten to `media/`, absolute → relative | `test_stats_tracking` |
| Inline classes (`{.underline}`, `{.mark}`) | Removed | | Pandoc attributes (`{width="..."}`) | Removed | `test_pandoc_attributes_removed` |
| Indented dashed code blocks | Converted to fenced code blocks (```) | | CJK bold spacing (`**粗体**中文`) | Add space around `**` for CJK bold spans | `TestCjkBoldSpacing` (15 cases) |
| Escaped brackets (`\[...\]`) | Unescaped to `[...]` | | Indented dashed code blocks | → fenced ``` with language detection | `test_code_block_with_language` |
| Double-bracket links (`[[text]{...}](url)`) | Simplified to `[text](url)` | | Escaped brackets (`\[...\]`) | → `[...]` | `test_escaped_brackets_fixed` |
| Escaped quotes in code (`\"`) | Fixed to `"` | | Double-bracket links (`[[text]](url)`) | → `[text](url)` | `test_double_bracket_links_fixed` |
### CJK Bold Spacing — why and how
DOCX uses run-level styling (no spaces between bold/normal runs in CJK text). Markdown renderers need whitespace around `**` to recognize bold boundaries.
**Rule**: if a `**content**` span contains any CJK character, ensure both sides have a space — unless already spaced or at line boundary. This handles CJK punctuation, emoji adjacency, and mixed content.
```
Before: 打开**飞书**,就可以 → some renderers fail to bold
After: 打开 **飞书** ,就可以 → universally renders correctly
```
## Heavy Mode Workflow ## Heavy Mode Workflow
@@ -166,6 +159,7 @@ brew install pandoc
| Script | Purpose | | Script | Purpose |
|--------|---------| |--------|---------|
| `convert.py` | Main orchestrator with Quick/Heavy mode + DOCX post-processing | | `convert.py` | Main orchestrator with Quick/Heavy mode + DOCX post-processing |
| `test_convert.py` | 31 tests covering all post-processing functions |
| `merge_outputs.py` | Merge multiple markdown outputs | | `merge_outputs.py` | Merge multiple markdown outputs |
| `validate_output.py` | Quality validation with HTML report | | `validate_output.py` | Quality validation with HTML report |
| `extract_pdf_images.py` | PDF image extraction with metadata | | `extract_pdf_images.py` | PDF image extraction with metadata |
@@ -173,6 +167,7 @@ brew install pandoc
## References ## References
- `references/benchmark-2026-03-22.md` - 5-tool benchmark (Docling/MarkItDown/Pandoc/Mammoth/ours)
- `references/heavy-mode-guide.md` - Detailed Heavy Mode documentation - `references/heavy-mode-guide.md` - Detailed Heavy Mode documentation
- `references/tool-comparison.md` - Tool capabilities comparison - `references/tool-comparison.md` - Tool capabilities comparison
- `references/conversion-examples.md` - Batch operation examples - `references/conversion-examples.md` - Batch operation examples

View File

@@ -1,6 +1,6 @@
# Heavy Mode Guide # Heavy Mode Guide
Detailed documentation for markdown-tools Heavy Mode conversion. Detailed documentation for doc-to-markdown Heavy Mode conversion.
## Overview ## Overview

View File

@@ -26,6 +26,7 @@ Dependencies:
""" """
import argparse import argparse
import json
import re import re
import subprocess import subprocess
import sys import sys
@@ -478,10 +479,19 @@ def _fix_code_blocks(text: str, stats: PostProcessStats) -> str:
# Decide: code block vs blockquote # Decide: code block vs blockquote
if has_lang_hint or _is_code_content(cleaned): if has_lang_hint or _is_code_content(cleaned):
# Code block # Code block — try to pretty-print JSON
code_lines = cleaned
if lang_hint == "json":
try:
raw = "\n".join(cleaned)
parsed = json.loads(raw)
code_lines = json.dumps(parsed, indent=2, ensure_ascii=False).split("\n")
except (json.JSONDecodeError, ValueError):
pass # Keep original if not valid JSON
result.append("") result.append("")
result.append(f"```{lang_hint}") result.append(f"```{lang_hint}")
result.extend(cleaned) result.extend(code_lines)
result.append("```") result.append("```")
result.append("") result.append("")
else: else:
@@ -529,29 +539,40 @@ def _fix_double_bracket_links(text: str, stats: PostProcessStats) -> str:
def _fix_cjk_bold_spacing(text: str) -> str: def _fix_cjk_bold_spacing(text: str) -> str:
"""Add space between **bold** markers and adjacent CJK characters. """Add space around **bold** spans that contain CJK characters.
DOCX uses run-level styling for bold — no spaces between runs in CJK text. DOCX uses run-level styling for bold — no spaces between runs in CJK text.
Markdown renderers need whitespace around ** to recognize bold boundaries. Markdown renderers need whitespace around ** to recognize bold boundaries.
We find each **content** span, check the character before/after, and insert
a space only when the adjacent character is CJK (avoiding double spaces). Rule: if a **content** span contains any CJK character, ensure both sides
have a space (unless already spaced or at line boundary). This handles:
- CJK directly touching **: 打开**飞书** → 打开 **飞书**
- Emoji touching **: **密码】**➡️ → **密码】** ➡️
- Already spaced: 已有 **粗体** → unchanged
- English bold: English **bold** text → unchanged
""" """
result = [] result = []
last_end = 0 last_end = 0
for m in _RE_BOLD_PAIR.finditer(text): for m in _RE_BOLD_PAIR.finditer(text):
start, end = m.start(), m.end() start, end = m.start(), m.end()
content = m.group(1)
result.append(text[last_end:start]) result.append(text[last_end:start])
# Space before opening ** if preceded by CJK # Only add spaces for bold spans containing CJK
if start > 0 and _RE_CJK_PUNCT.match(text[start - 1]): if _RE_CJK_PUNCT.search(content):
result.append(' ') # Space before ** if previous char is not whitespace
if start > 0 and text[start - 1] not in (' ', '\t', '\n'):
result.append(' ')
result.append(m.group(0)) result.append(m.group(0))
# Space after closing ** if followed by CJK # Space after ** if next char is not whitespace
if end < len(text) and _RE_CJK_PUNCT.match(text[end]): if end < len(text) and text[end] not in (' ', '\t', '\n'):
result.append(' ') result.append(' ')
else:
result.append(m.group(0))
last_end = end last_end = end

View File

@@ -0,0 +1,242 @@
"""Tests for doc-to-markdown convert.py post-processing functions.
Run: uv run pytest scripts/test_convert.py -v
"""
import pytest
import re
import sys
from pathlib import Path
# Import the module under test
sys.path.insert(0, str(Path(__file__).parent))
from convert import (
_fix_cjk_bold_spacing,
_build_pipe_table,
_collect_images,
PostProcessStats,
postprocess_docx_markdown,
)
# ── CJK Bold Spacing ─────────────────────────────────────────────────────────
class TestCjkBoldSpacing:
"""Test _fix_cjk_bold_spacing: spaces between **bold** and CJK chars."""
def test_bold_followed_by_cjk_punctuation(self):
"""**text** directly touching CJK colon → add space after **."""
inp = "**打开阶跃开放平台链接**https://platform.stepfun.com/"
out = _fix_cjk_bold_spacing(inp)
assert "**打开阶跃开放平台链接** " in out
def test_cjk_before_bold(self):
"""CJK char directly before ** → add space before **."""
assert _fix_cjk_bold_spacing("可用**手机号**进行") == "可用 **手机号** 进行"
def test_bold_with_emoji_neighbor(self):
"""**text** touching emoji ➡️ → still add space (CJK content rule)."""
inp = "点击**【接口密码】**➡️**【创建新的密钥**】"
out = _fix_cjk_bold_spacing(inp)
# Each CJK-containing bold span should have spaces on both sides
assert "点击 **【接口密码】** ➡️" in out
assert "➡️ **【创建新的密钥**" in out
def test_full_emoji_line(self):
"""Complete line with emoji separators between bold spans."""
inp = "点击**【接口密码】**➡️**【创建新的密钥**】➡️**【输入密钥名称】**输入你想取的名称生成API Key"
out = _fix_cjk_bold_spacing(inp)
assert "点击 **【接口密码】** ➡️" in out
assert "**【输入密钥名称】** (输入" in out
def test_bold_between_cjk(self):
"""CJK **text** CJK → spaces on both sides."""
assert _fix_cjk_bold_spacing("打开**飞书**,就可以") == "打开 **飞书** ,就可以"
def test_bold_with_chinese_quotes(self):
"""Bold containing Chinese quotes."""
inp = '有个**"企鹅戴龙虾头套的机器人"**,开始'
out = _fix_cjk_bold_spacing(inp)
assert '**"企鹅戴龙虾头套的机器人"** ' in out
def test_multiple_bold_spans(self):
"""Multiple bold spans in one line."""
assert _fix_cjk_bold_spacing("这是**测试**和**验证**的效果") == "这是 **测试** 和 **验证** 的效果"
def test_already_spaced(self):
"""Already has spaces → no double spaces."""
inp = "已有空格 **粗体** 不需要再加"
assert _fix_cjk_bold_spacing(inp) == inp
def test_english_unchanged(self):
"""English bold text should not be modified."""
inp = "English **bold** text should not change"
assert _fix_cjk_bold_spacing(inp) == inp
def test_line_start_bold(self):
"""Bold at line start followed by CJK."""
assert _fix_cjk_bold_spacing("**重要**内容") == "**重要** 内容"
def test_line_start_bold_standalone(self):
"""Bold at line start with no CJK neighbor → no change."""
assert _fix_cjk_bold_spacing("**这是纯粗体不需要改**") == "**这是纯粗体不需要改**"
def test_no_bold(self):
"""Text without bold markers → unchanged."""
inp = "这是普通文本,没有粗体"
assert _fix_cjk_bold_spacing(inp) == inp
def test_empty_string(self):
assert _fix_cjk_bold_spacing("") == ""
def test_bold_at_line_end(self):
"""Bold at line end → no trailing space needed."""
assert _fix_cjk_bold_spacing("内容是**粗体**") == "内容是 **粗体**"
def test_mixed_cjk_and_english_bold(self):
"""English bold between CJK → no change (no CJK in content)."""
inp = "请使用 **API Key** 进行认证"
assert _fix_cjk_bold_spacing(inp) == inp
# ── Pipe Table Builder ────────────────────────────────────────────────────────
class TestBuildPipeTable:
"""Test _build_pipe_table: rows → markdown pipe table."""
def test_basic_table(self):
rows = [["a", "b"], ["c", "d"]]
result = _build_pipe_table(rows)
assert result == [
"| | |",
"| --- | --- |",
"| a | b |",
"| c | d |",
]
def test_uneven_rows(self):
"""Rows with different column counts → padded."""
rows = [["a", "b", "c"], ["d"]]
result = _build_pipe_table(rows)
assert "| d | | |" in result
def test_single_cell(self):
rows = [["only"]]
result = _build_pipe_table(rows)
assert len(result) == 3 # header + sep + 1 row
def test_empty_rows(self):
assert _build_pipe_table([]) == []
def test_image_with_caption(self):
"""Images and captions should pair correctly in table."""
rows = [
["![](img1.png)", "![](img2.png)"],
["Step 1", "Step 2"],
]
result = _build_pipe_table(rows)
assert "| ![](img1.png) | ![](img2.png) |" in result
assert "| Step 1 | Step 2 |" in result
# ── Full Post-Processing Pipeline ─────────────────────────────────────────────
class TestPostprocessPipeline:
"""Integration tests for the full postprocess_docx_markdown pipeline."""
def test_grid_table_single_column_to_blockquote(self):
"""Single-column grid table → blockquote."""
inp = """+:---+
| 注意事项 |
+----+"""
out, stats = postprocess_docx_markdown(inp)
assert "> 注意事项" in out
assert "+:---+" not in out
def test_pandoc_attributes_removed(self):
"""Pandoc {width=...} and {.underline} removed."""
inp = '![](img.png){width="5in" height="3in"} and [text]{.underline}'
out, stats = postprocess_docx_markdown(inp)
assert "{width=" not in out
assert "{.underline}" not in out
assert "![](img.png)" in out
def test_escaped_brackets_fixed(self):
r"""Pandoc \[ and \] → [ and ]."""
inp = r"\[在飞书里\] 发消息"
out, stats = postprocess_docx_markdown(inp)
assert "你 [在飞书里] 发消息" in out
def test_double_bracket_links_fixed(self):
"""[[text]](url) → [text](url)."""
inp = "[[点击跳转]](https://example.com)"
out, stats = postprocess_docx_markdown(inp)
assert "[点击跳转](https://example.com)" in out
def test_code_block_with_language(self):
"""Indented dashed block with JSON language hint → ```json."""
inp = """ ------------------------------------------------------------------
JSON\\
{\\
"provider": "stepfun"\\
}
------------------------------------------------------------------"""
out, stats = postprocess_docx_markdown(inp)
assert "```json" in out
assert '"provider": "stepfun"' in out
assert "---" not in out
def test_code_block_plain_text_to_blockquote(self):
"""Indented dashed block with plain text → blockquote."""
inp = """ --------------------------
注意:这是一条重要提示
--------------------------"""
out, stats = postprocess_docx_markdown(inp)
assert "> 注意:这是一条重要提示" in out
def test_cjk_bold_spacing_in_pipeline(self):
"""CJK bold spacing is applied in the full pipeline."""
inp = "打开**飞书**,就可以看到"
out, stats = postprocess_docx_markdown(inp)
assert "打开 **飞书** ,就可以看到" in out
def test_excessive_blank_lines_collapsed(self):
"""4+ blank lines → 2 blank lines."""
inp = "line1\n\n\n\n\nline2"
out, stats = postprocess_docx_markdown(inp)
assert out.count("\n") < 5
def test_stats_tracking(self):
"""Stats object correctly tracks fix counts."""
inp = '![](media/media/img.png){width="5in"}'
out, stats = postprocess_docx_markdown(inp)
assert stats.attributes_removed > 0
# ── Simple Table (pandoc) ─────────────────────────────────────────────────────
class TestSimpleTable:
"""Test pandoc simple table (indented dashes with spaces) → pipe table."""
def test_two_column_image_table(self):
"""Two images side by side in simple table → pipe table."""
inp = """ ---- ----
![](img1.png) ![](img2.png)
---- ----"""
out, stats = postprocess_docx_markdown(inp)
assert "| ![](img1.png) | ![](img2.png) |" in out
assert "----" not in out
def test_four_column_image_table(self):
"""Four images in simple table → 4-column pipe table."""
inp = """ ---------- ---------- ---------- ----------
![](a.png) ![](b.png) ![](c.png) ![](d.png)
---------- ---------- ---------- ----------"""
out, stats = postprocess_docx_markdown(inp)
assert "| ![](a.png) | ![](b.png) | ![](c.png) | ![](d.png) |" in out

View File

@@ -11,7 +11,7 @@ Transform raw meeting transcripts into comprehensive, evidence-based meeting min
## Quick Start ## Quick Start
**Pre-processing (Optional but Recommended):** **Pre-processing (Optional but Recommended):**
- **Document conversion**: Use `markdown-tools` skill to convert .docx/.pdf to Markdown first (preserves tables/images) - **Document conversion**: Use `doc-to-markdown` skill to convert .docx/.pdf to Markdown first (preserves tables/images)
- **Transcript cleanup**: Use `transcript-fixer` skill to fix ASR/STT errors if transcript quality is poor - **Transcript cleanup**: Use `transcript-fixer` skill to fix ASR/STT errors if transcript quality is poor
- **Context file**: Prepare `context.md` with team directory for accurate speaker identification - **Context file**: Prepare `context.md` with team directory for accurate speaker identification
@@ -457,7 +457,7 @@ If v3 has a flowchart for "Status Query Mechanism" but v1/v2 don't have it, that
**Full pipeline for .docx transcripts:** **Full pipeline for .docx transcripts:**
``` ```
Step 0: markdown-tools # Convert .docx → Markdown (preserves tables/images) Step 0: doc-to-markdown # Convert .docx → Markdown (preserves tables/images)
Step 0.5: transcript-fixer # Fix ASR errors (optional, if quality is poor) Step 0.5: transcript-fixer # Fix ASR errors (optional, if quality is poor)