firefrost-gaming/claude-code-skills-reference

Files

daymade d9e1967689 feat(doc-to-markdown): CJK bold spacing, JSON pretty-print, 31 tests, full rename cleanup

- Add CJK bold spacing fix: insert spaces around **bold** spans containing
  CJK characters for correct rendering (handles emoji adjacency, already-spaced)
- Add JSON pretty-print: auto-format JSON code blocks with 2-space indent
- Add 31 unit tests covering all post-processing functions
- Fix pandoc simple table detection (1-space column gaps)
- Fix image path double-nesting when --assets-dir ends with 'media'
- Rename all markdown-tools references across 15 files (README, QUICKSTART,
  marketplace.json, CLAUDE.md, meeting-minutes-taker, GitHub templates)
- Add 5-tool benchmark report (Docling/MarkItDown/Pandoc/Mammoth/ours)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-23 03:18:37 +08:00

3.8 KiB

Raw Blame History

Heavy Mode Guide

Detailed documentation for doc-to-markdown Heavy Mode conversion.

Overview

Heavy Mode runs multiple conversion tools in parallel and intelligently merges their outputs to produce the highest quality markdown possible.

When to Use Heavy Mode

Use Heavy Mode when:

Document has complex tables that need precise formatting
Images must be preserved with proper references
Structure hierarchy (headings, lists) must be accurate
Output quality is more important than conversion speed
Document will be used for LLM processing

Use Quick Mode when:

Speed is priority
Document is simple (mostly text)
Output is for draft/review purposes

Tool Capabilities

PyMuPDF4LLM (Best for PDFs)

Strengths:

Native table detection with multiple strategies
Image extraction with position metadata
LLM-optimized output format
Preserves reading order

Usage:

import pymupdf4llm

md_text = pymupdf4llm.to_markdown(
    "document.pdf",
    write_images=True,
    table_strategy="lines_strict",
    image_path="./assets",
    dpi=150
)

markitdown (Universal Converter)

Strengths:

Supports many formats (PDF, DOCX, PPTX, XLSX)
Good text extraction
Simple API

Limitations:

May miss complex tables
No native image extraction

pandoc (Best for Office Docs)

Strengths:

Excellent DOCX/PPTX structure preservation
Proper heading hierarchy
List formatting

Limitations:

Requires system installation
PDF support limited

Merge Strategy

Segment-Level Selection

Heavy Mode doesn't just pick one tool's output. It:

Parses each output into segments
Scores each segment independently
Selects the best version of each segment

Segment Types

Type	Detection Pattern	Scoring Criteria
Table	`\|.*\|` rows	Row count, column count, header separator
Heading	`^#{1-6}`	Proper level, reasonable length
Image	`!\[.\]\(.\)`	Alt text present, local path
List	`^[-*+\d.]`	Item count, nesting depth
Code	Triple backticks	Line count, language specified
Paragraph	Default	Word count, completeness

Scoring Example

Table from pymupdf4llm:
  - 10 rows × 5 columns = 5.0 points
  - Header separator present = 1.0 points
  - Total: 6.0 points

Table from markitdown:
  - 8 rows × 5 columns = 4.0 points
  - No header separator = 0.0 points
  - Total: 4.0 points

→ Select pymupdf4llm version

Advanced Usage

Force Specific Tool

# Use only pandoc
uv run scripts/convert.py document.docx -o output.md --tool pandoc

Custom Assets Directory

# Heavy mode with custom image output
uv run scripts/convert.py document.pdf -o output.md --heavy --assets-dir ./images

Validate After Conversion

# Convert then validate
uv run scripts/convert.py document.pdf -o output.md --heavy
uv run scripts/validate_output.py document.pdf output.md --report quality.html

Troubleshooting

Low Text Retention Score

Causes:

PDF has scanned images (not searchable text)
Encoding issues in source document
Complex layouts confusing the parser

Solutions:

Use OCR preprocessing for scanned PDFs
Try different tool with --tool flag
Manual cleanup may be needed

Missing Tables

Causes:

Tables without visible borders
Tables spanning multiple pages
Merged cells

Solutions:

Use Heavy Mode for better detection
Try pymupdf4llm with different table_strategy
Manual table reconstruction

Image References Broken

Causes:

Assets directory not created
Relative path issues
Image extraction failed

Solutions:

Ensure --assets-dir points to correct location
Check images_metadata.json for extraction status
Use extract_pdf_images.py separately

3.8 KiB Raw Blame History Unescape Escape

Heavy Mode Guide

Overview

When to Use Heavy Mode

Tool Capabilities

PyMuPDF4LLM (Best for PDFs)

markitdown (Universal Converter)

pandoc (Best for Office Docs)

Merge Strategy

Segment-Level Selection

Segment Types

Scoring Example

Advanced Usage

Force Specific Tool

Custom Assets Directory

Validate After Conversion

Troubleshooting

Low Text Retention Score

Missing Tables

Image References Broken

3.8 KiB

Raw Blame History