# Heavy Mode Guide

Detailed documentation for doc-to-markdown Heavy Mode conversion.

## Overview

Heavy Mode runs multiple conversion tools in parallel and intelligently merges their outputs to produce the highest quality markdown possible.

## When to Use Heavy Mode

Use Heavy Mode when:
- Document has complex tables that need precise formatting
- Images must be preserved with proper references
- Structure hierarchy (headings, lists) must be accurate
- Output quality is more important than conversion speed
- Document will be used for LLM processing

Use Quick Mode when:
- Speed is priority
- Document is simple (mostly text)
- Output is for draft/review purposes

## Tool Capabilities

### PyMuPDF4LLM (Best for PDFs)

**Strengths:**
- Native table detection with multiple strategies
- Image extraction with position metadata
- LLM-optimized output format
- Preserves reading order

**Usage:**
```python
import pymupdf4llm

md_text = pymupdf4llm.to_markdown(
    "document.pdf",
    write_images=True,
    table_strategy="lines_strict",
    image_path="./assets",
    dpi=150
)
```

### markitdown (Universal Converter)

**Strengths:**
- Supports many formats (PDF, DOCX, PPTX, XLSX)
- Good text extraction
- Simple API

**Limitations:**
- May miss complex tables
- No native image extraction

### pandoc (Best for Office Docs)

**Strengths:**
- Excellent DOCX/PPTX structure preservation
- Proper heading hierarchy
- List formatting

**Limitations:**
- Requires system installation
- PDF support limited

## Merge Strategy

### Segment-Level Selection

Heavy Mode doesn't just pick one tool's output. It:

1. Parses each output into segments
2. Scores each segment independently
3. Selects the best version of each segment

### Segment Types

| Type | Detection Pattern | Scoring Criteria |
|------|-------------------|------------------|
| Table | `\|.*\|` rows | Row count, column count, header separator |
| Heading | `^#{1-6} ` | Proper level, reasonable length |
| Image | `!\[.*\]\(.*\)` | Alt text present, local path |
| List | `^[-*+\d.] ` | Item count, nesting depth |
| Code | Triple backticks | Line count, language specified |
| Paragraph | Default | Word count, completeness |

### Scoring Example

```
Table from pymupdf4llm:
  - 10 rows × 5 columns = 5.0 points
  - Header separator present = 1.0 points
  - Total: 6.0 points

Table from markitdown:
  - 8 rows × 5 columns = 4.0 points
  - No header separator = 0.0 points
  - Total: 4.0 points

→ Select pymupdf4llm version
```

## Advanced Usage

### Force Specific Tool

```bash
# Use only pandoc
uv run scripts/convert.py document.docx -o output.md --tool pandoc
```

### Custom Assets Directory

```bash
# Heavy mode with custom image output
uv run scripts/convert.py document.pdf -o output.md --heavy --assets-dir ./images
```

### Validate After Conversion

```bash
# Convert then validate
uv run scripts/convert.py document.pdf -o output.md --heavy
uv run scripts/validate_output.py document.pdf output.md --report quality.html
```

## Troubleshooting

### Low Text Retention Score

**Causes:**
- PDF has scanned images (not searchable text)
- Encoding issues in source document
- Complex layouts confusing the parser

**Solutions:**
- Use OCR preprocessing for scanned PDFs
- Try different tool with `--tool` flag
- Manual cleanup may be needed

### Missing Tables

**Causes:**
- Tables without visible borders
- Tables spanning multiple pages
- Merged cells

**Solutions:**
- Use Heavy Mode for better detection
- Try pymupdf4llm with different table_strategy
- Manual table reconstruction

### Image References Broken

**Causes:**
- Assets directory not created
- Relative path issues
- Image extraction failed

**Solutions:**
- Ensure `--assets-dir` points to correct location
- Check `images_metadata.json` for extraction status
- Use `extract_pdf_images.py` separately