skill-seekers-reference/docs/PDF_PARSING_RESEARCH.md

# PDF Parsing Libraries Research (Task B1.1)

**Date:** October 21, 2025
**Task:** B1.1 - Research PDF parsing libraries
**Purpose:** Evaluate Python libraries for extracting text and code from PDF documentation

---

## Executive Summary

After comprehensive research, **PyMuPDF (fitz)** is recommended as the primary library for Skill Seeker's PDF parsing needs, with **pdfplumber** as a secondary option for complex table extraction.

### Quick Recommendation:
- **Primary Choice:** PyMuPDF (fitz) - Fast, comprehensive, well-maintained
- **Secondary/Fallback:** pdfplumber - Better for tables, slower but more precise
- **Avoid:** PyPDF2 (deprecated, merged into pypdf)

---

## Library Comparison Matrix

| Library | Speed | Text Quality | Code Detection | Tables | Maintenance | License |
|---------|-------|--------------|----------------|--------|-------------|---------|
| **PyMuPDF** | ⚡⚡⚡⚡⚡ Fastest (42ms) | High | Excellent | Good | Active | AGPL/Commercial |
| **pdfplumber** | ⚡⚡ Slower (2.5s) | Very High | Excellent | Excellent | Active | MIT |
| **pypdf** | ⚡⚡⚡ Fast | Medium | Good | Basic | Active | BSD |
| **pdfminer.six** | ⚡ Slow | Very High | Good | Medium | Active | MIT |
| **pypdfium2** | ⚡⚡⚡⚡⚡ Very Fast (3ms) | Medium | Good | Basic | Active | Apache-2.0 |

---

## Detailed Analysis

### 1. PyMuPDF (fitz) ⭐ RECOMMENDED

**Performance:** 42 milliseconds (60x faster than pdfminer.six)

**Installation:**
```bash
pip install PyMuPDF
```

**Pros:**
- ✅ Extremely fast (C-based MuPDF backend)
- ✅ Comprehensive features (text, images, tables, metadata)
- ✅ Supports markdown output
- ✅ Can extract images and diagrams
- ✅ Well-documented and actively maintained
- ✅ Handles complex layouts well

**Cons:**
- ⚠️ AGPL license (requires commercial license for proprietary projects)
- ⚠️ Requires MuPDF binary installation (handled by pip)
- ⚠️ Slightly larger dependency footprint

**Code Example:**
```python
import fitz  # PyMuPDF

# Extract text from entire PDF
def extract_pdf_text(pdf_path):
    doc = fitz.open(pdf_path)
    text = ''
    for page in doc:
        text += page.get_text()
    doc.close()
    return text

# Extract text from single page
def extract_page_text(pdf_path, page_num):
    doc = fitz.open(pdf_path)
    page = doc.load_page(page_num)
    text = page.get_text()
    doc.close()
    return text

# Extract with markdown formatting
def extract_as_markdown(pdf_path):
    doc = fitz.open(pdf_path)
    markdown = ''
    for page in doc:
        markdown += page.get_text("markdown")
    doc.close()
    return markdown
```

**Use Cases for Skill Seeker:**
- Fast extraction of code examples from PDF docs
- Preserving formatting for code blocks
- Extracting diagrams and screenshots
- High-volume documentation scraping

---

### 2. pdfplumber ⭐ RECOMMENDED (for tables)

**Performance:** ~2.5 seconds (slower but more precise)

**Installation:**
```bash
pip install pdfplumber
```

**Pros:**
- ✅ MIT license (fully open source)
- ✅ Exceptional table extraction
- ✅ Visual debugging tool
- ✅ Precise layout preservation
- ✅ Built on pdfminer (proven text extraction)
- ✅ No binary dependencies

**Cons:**
- ⚠️ Slower than PyMuPDF
- ⚠️ Higher memory usage for large PDFs
- ⚠️ Requires more configuration for optimal results

**Code Example:**
```python
import pdfplumber

# Extract text from PDF
def extract_with_pdfplumber(pdf_path):
    with pdfplumber.open(pdf_path) as pdf:
        text = ''
        for page in pdf.pages:
            text += page.extract_text()
        return text

# Extract tables
def extract_tables(pdf_path):
    tables = []
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            page_tables = page.extract_tables()
            tables.extend(page_tables)
    return tables

# Extract specific region (for code blocks)
def extract_region(pdf_path, page_num, bbox):
    with pdfplumber.open(pdf_path) as pdf:
        page = pdf.pages[page_num]
        cropped = page.crop(bbox)
        return cropped.extract_text()
```

**Use Cases for Skill Seeker:**
- Extracting API reference tables from PDFs
- Precise code block extraction with layout
- Documentation with complex table structures

---

### 3. pypdf (formerly PyPDF2)

**Performance:** Fast (medium speed)

**Installation:**
```bash
pip install pypdf
```

**Pros:**
- ✅ BSD license
- ✅ Simple API
- ✅ Can modify PDFs (merge, split, encrypt)
- ✅ Actively maintained (PyPDF2 merged back)
- ✅ No external dependencies

**Cons:**
- ⚠️ Limited complex layout support
- ⚠️ Basic text extraction only
- ⚠️ Poor with scanned/image PDFs
- ⚠️ No table extraction

**Code Example:**
```python
from pypdf import PdfReader

# Extract text
def extract_with_pypdf(pdf_path):
    reader = PdfReader(pdf_path)
    text = ''
    for page in reader.pages:
        text += page.extract_text()
    return text
```

**Use Cases for Skill Seeker:**
- Simple text extraction
- Fallback when PyMuPDF licensing is an issue
- Basic PDF manipulation tasks

---

### 4. pdfminer.six

**Performance:** Slow (~2.5 seconds)

**Installation:**
```bash
pip install pdfminer.six
```

**Pros:**
- ✅ MIT license
- ✅ Excellent text quality (preserves formatting)
- ✅ Handles complex layouts
- ✅ Pure Python (no binaries)

**Cons:**
- ⚠️ Slowest option
- ⚠️ Complex API
- ⚠️ Poor documentation
- ⚠️ Limited table support

**Use Cases for Skill Seeker:**
- Not recommended (pdfplumber is built on this with better API)

---

### 5. pypdfium2

**Performance:** Very fast (3ms - fastest tested)

**Installation:**
```bash
pip install pypdfium2
```

**Pros:**
- ✅ Extremely fast
- ✅ Apache 2.0 license
- ✅ Lightweight
- ✅ Clean output

**Cons:**
- ⚠️ Basic features only
- ⚠️ Limited documentation
- ⚠️ No table extraction
- ⚠️ Newer/less proven

**Use Cases for Skill Seeker:**
- High-speed basic extraction
- Potential future optimization

---

## Licensing Considerations

### Open Source Projects (Skill Seeker):
- **PyMuPDF:** ✅ AGPL license is fine for open-source projects
- **pdfplumber:** ✅ MIT license (most permissive)
- **pypdf:** ✅ BSD license (permissive)

### Important Note:
PyMuPDF requires AGPL compliance (source code must be shared) OR a commercial license for proprietary use. Since Skill Seeker is open source on GitHub, AGPL is acceptable.

---

## Performance Benchmarks

Based on 2025 testing:

| Library | Time (single page) | Time (100 pages) |
|---------|-------------------|------------------|
| pypdfium2 | 0.003s | 0.3s |
| PyMuPDF | 0.042s | 4.2s |
| pypdf | 0.1s | 10s |
| pdfplumber | 2.5s | 250s |
| pdfminer.six | 2.5s | 250s |

**Winner:** pypdfium2 (speed) / PyMuPDF (features + speed balance)

---

## Recommendations for Skill Seeker

### Primary Approach: PyMuPDF (fitz)

**Why:**
1. **Speed** - 60x faster than alternatives
2. **Features** - Text, images, markdown output, metadata
3. **Quality** - High-quality text extraction
4. **Maintained** - Active development, good docs
5. **License** - AGPL is fine for open source

**Implementation Strategy:**
```python
import fitz  # PyMuPDF

def extract_pdf_documentation(pdf_path):
    """
    Extract documentation from PDF with code block detection
    """
    doc = fitz.open(pdf_path)
    pages = []

    for page_num, page in enumerate(doc):
        # Get text with layout info
        text = page.get_text("text")

        # Get markdown (preserves code blocks)
        markdown = page.get_text("markdown")

        # Get images (for diagrams)
        images = page.get_images()

        pages.append({
            'page_number': page_num,
            'text': text,
            'markdown': markdown,
            'images': images
        })

    doc.close()
    return pages
```

### Fallback Approach: pdfplumber

**When to use:**
- PDF has complex tables that PyMuPDF misses
- Need visual debugging
- License concerns (use MIT instead of AGPL)

**Implementation Strategy:**
```python
import pdfplumber

def extract_pdf_tables(pdf_path):
    """
    Extract tables from PDF documentation
    """
    with pdfplumber.open(pdf_path) as pdf:
        tables = []
        for page in pdf.pages:
            page_tables = page.extract_tables()
            if page_tables:
                tables.extend(page_tables)
        return tables
```

---

## Code Block Detection Strategy

PDFs don't have semantic "code block" markers like HTML. Detection strategies:

### 1. Font-based Detection
```python
# PyMuPDF can detect font changes
def detect_code_by_font(page):
    blocks = page.get_text("dict")["blocks"]
    code_blocks = []

    for block in blocks:
        if 'lines' in block:
            for line in block['lines']:
                for span in line['spans']:
                    font = span['font']
                    # Monospace fonts indicate code
                    if 'Courier' in font or 'Mono' in font:
                        code_blocks.append(span['text'])

    return code_blocks
```

### 2. Indentation-based Detection
```python
def detect_code_by_indent(text):
    lines = text.split('\n')
    code_blocks = []
    current_block = []

    for line in lines:
        # Code often has consistent indentation
        if line.startswith('    ') or line.startswith('\t'):
            current_block.append(line)
        elif current_block:
            code_blocks.append('\n'.join(current_block))
            current_block = []

    return code_blocks
```

### 3. Pattern-based Detection
```python
import re

def detect_code_by_pattern(text):
    # Look for common code patterns
    patterns = [
        r'(def \w+\(.*?\):)',  # Python functions
        r'(function \w+\(.*?\) \{)',  # JavaScript
        r'(class \w+:)',  # Python classes
        r'(import \w+)',  # Import statements
    ]

    code_snippets = []
    for pattern in patterns:
        matches = re.findall(pattern, text)
        code_snippets.extend(matches)

    return code_snippets
```

---

## Next Steps (Task B1.2+)

### Immediate Next Task: B1.2 - Create Simple PDF Text Extractor

**Goal:** Proof of concept using PyMuPDF

**Implementation Plan:**
1. Create `cli/pdf_extractor_poc.py`
2. Extract text from sample PDF
3. Detect code blocks using font/pattern matching
4. Output to JSON (similar to web scraper)

**Dependencies:**
```bash
pip install PyMuPDF
```

**Expected Output:**
```json
{
  "pages": [
    {
      "page_number": 1,
      "text": "...",
      "code_blocks": ["def main():", "import sys"],
      "images": []
    }
  ]
}
```

### Future Tasks:
- **B1.3:** Add page chunking (split large PDFs)
- **B1.4:** Improve code block detection
- **B1.5:** Extract images/diagrams
- **B1.6:** Create full `pdf_scraper.py` CLI
- **B1.7:** Add MCP tool integration
- **B1.8:** Create PDF config format

---

## Additional Resources

### Documentation:
- PyMuPDF: https://pymupdf.readthedocs.io/
- pdfplumber: https://github.com/jsvine/pdfplumber
- pypdf: https://pypdf.readthedocs.io/

### Comparison Studies:
- 2025 Comparative Study: https://arxiv.org/html/2410.09871v1
- Performance Benchmarks: https://github.com/py-pdf/benchmarks

### Example Use Cases:
- Extracting API docs from PDF manuals
- Converting PDF guides to markdown
- Building skills from PDF-only documentation

---

## Conclusion

**For Skill Seeker's PDF documentation extraction:**

1. **Use PyMuPDF (fitz)** as primary library
2. **Add pdfplumber** for complex table extraction
3. **Detect code blocks** using font + pattern matching
4. **Preserve formatting** with markdown output
5. **Extract images** for diagrams/screenshots

**Estimated Implementation Time:**
- B1.2 (POC): 2-3 hours
- B1.3-B1.5 (Features): 5-8 hours
- B1.6 (CLI): 3-4 hours
- B1.7 (MCP): 2-3 hours
- B1.8 (Config): 1-2 hours
- **Total: 13-20 hours** for complete PDF support

**License:** AGPL (PyMuPDF) is acceptable for Skill Seeker (open source)

---

**Research completed:** ✅ October 21, 2025
**Next task:** B1.2 - Create simple PDF text extractor (proof of concept)