Replace all occurrences of old ambiguous flag names with the new explicit ones: --chunk-size (tokens) → --chunk-tokens --chunk-overlap → --chunk-overlap-tokens --chunk → --chunk-for-rag --streaming-chunk-size → --streaming-chunk-chars --streaming-overlap → --streaming-overlap-chars --chunk-size (pages) → --pdf-pages-per-chunk Updated: CLI_REFERENCE (EN+ZH), user-guide (EN+ZH), integrations (Haystack, Chroma, Weaviate, FAISS, Qdrant), features/PDF_CHUNKING, examples/haystack-pipeline, strategy docs, archive docs, and CHANGELOG. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
522 lines
13 KiB
Markdown
522 lines
13 KiB
Markdown
# PDF Page Detection and Chunking (Task B1.3)
|
|
|
|
**Status:** ✅ Completed
|
|
**Date:** October 21, 2025
|
|
**Task:** B1.3 - Add PDF page detection and chunking
|
|
|
|
---
|
|
|
|
## Overview
|
|
|
|
Task B1.3 enhances the PDF extractor with intelligent page chunking and chapter detection capabilities. This allows large PDF documentation to be split into manageable, logical sections for better processing and organization.
|
|
|
|
## New Features
|
|
|
|
### ✅ 1. Page Chunking
|
|
|
|
Break large PDFs into smaller, manageable chunks:
|
|
- Configurable chunk size (default: 10 pages per chunk)
|
|
- Smart chunking that respects chapter boundaries
|
|
- Chunk metadata includes page ranges and chapter titles
|
|
|
|
**Usage:**
|
|
```bash
|
|
# Default chunking (10 pages per chunk)
|
|
python3 cli/pdf_extractor_poc.py input.pdf
|
|
|
|
# Custom chunk size (20 pages per chunk)
|
|
python3 cli/pdf_extractor_poc.py input.pdf --pdf-pages-per-chunk 20
|
|
|
|
# Disable chunking (single chunk with all pages)
|
|
python3 cli/pdf_extractor_poc.py input.pdf --pdf-pages-per-chunk 0
|
|
```
|
|
|
|
### ✅ 2. Chapter/Section Detection
|
|
|
|
Automatically detect chapter and section boundaries:
|
|
- Detects H1 and H2 headings as chapter markers
|
|
- Recognizes common chapter patterns:
|
|
- "Chapter 1", "Chapter 2", etc.
|
|
- "Part 1", "Part 2", etc.
|
|
- "Section 1", "Section 2", etc.
|
|
- Numbered sections like "1. Introduction"
|
|
|
|
**Chapter Detection Logic:**
|
|
1. Check for H1/H2 headings at page start
|
|
2. Pattern match against common chapter formats
|
|
3. Extract chapter title for metadata
|
|
|
|
### ✅ 3. Code Block Merging
|
|
|
|
Intelligently merge code blocks split across pages:
|
|
- Detects when code continues from one page to the next
|
|
- Checks language and detection method consistency
|
|
- Looks for continuation indicators:
|
|
- Doesn't end with `}`, `;`
|
|
- Ends with `,`, `\`
|
|
- Incomplete syntax structures
|
|
|
|
**Example:**
|
|
```
|
|
Page 5: def calculate_total(items):
|
|
total = 0
|
|
for item in items:
|
|
|
|
Page 6: total += item.price
|
|
return total
|
|
```
|
|
|
|
The merger will combine these into a single code block.
|
|
|
|
---
|
|
|
|
## Output Format
|
|
|
|
### Enhanced JSON Structure
|
|
|
|
The output now includes chunking and chapter information:
|
|
|
|
```json
|
|
{
|
|
"source_file": "manual.pdf",
|
|
"metadata": { ... },
|
|
"total_pages": 150,
|
|
"total_chunks": 15,
|
|
"chapters": [
|
|
{
|
|
"title": "Getting Started",
|
|
"start_page": 1,
|
|
"end_page": 12
|
|
},
|
|
{
|
|
"title": "API Reference",
|
|
"start_page": 13,
|
|
"end_page": 45
|
|
}
|
|
],
|
|
"chunks": [
|
|
{
|
|
"chunk_number": 1,
|
|
"start_page": 1,
|
|
"end_page": 12,
|
|
"chapter_title": "Getting Started",
|
|
"pages": [ ... ]
|
|
},
|
|
{
|
|
"chunk_number": 2,
|
|
"start_page": 13,
|
|
"end_page": 22,
|
|
"chapter_title": "API Reference",
|
|
"pages": [ ... ]
|
|
}
|
|
],
|
|
"pages": [ ... ]
|
|
}
|
|
```
|
|
|
|
### Chunk Object
|
|
|
|
Each chunk contains:
|
|
- `chunk_number` - Sequential chunk identifier (1-indexed)
|
|
- `start_page` - First page in chunk (1-indexed)
|
|
- `end_page` - Last page in chunk (1-indexed)
|
|
- `chapter_title` - Detected chapter title (if any)
|
|
- `pages` - Array of page objects in this chunk
|
|
|
|
### Merged Code Block Indicator
|
|
|
|
Code blocks merged from multiple pages include a flag:
|
|
```json
|
|
{
|
|
"code": "def example():\n ...",
|
|
"language": "python",
|
|
"detection_method": "font",
|
|
"merged_from_next_page": true
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Implementation Details
|
|
|
|
### Chapter Detection Algorithm
|
|
|
|
```python
|
|
def detect_chapter_start(self, page_data):
|
|
"""
|
|
Detect if a page starts a new chapter/section.
|
|
|
|
Returns (is_chapter_start, chapter_title) tuple.
|
|
"""
|
|
# Check H1/H2 headings first
|
|
headings = page_data.get('headings', [])
|
|
if headings:
|
|
first_heading = headings[0]
|
|
if first_heading['level'] in ['h1', 'h2']:
|
|
return True, first_heading['text']
|
|
|
|
# Pattern match against common chapter formats
|
|
text = page_data.get('text', '')
|
|
first_line = text.split('\n')[0] if text else ''
|
|
|
|
chapter_patterns = [
|
|
r'^Chapter\s+\d+',
|
|
r'^Part\s+\d+',
|
|
r'^Section\s+\d+',
|
|
r'^\d+\.\s+[A-Z]', # "1. Introduction"
|
|
]
|
|
|
|
for pattern in chapter_patterns:
|
|
if re.match(pattern, first_line, re.IGNORECASE):
|
|
return True, first_line.strip()
|
|
|
|
return False, None
|
|
```
|
|
|
|
### Code Block Merging Algorithm
|
|
|
|
```python
|
|
def merge_continued_code_blocks(self, pages):
|
|
"""
|
|
Merge code blocks that are split across pages.
|
|
"""
|
|
for i in range(len(pages) - 1):
|
|
current_page = pages[i]
|
|
next_page = pages[i + 1]
|
|
|
|
# Get last code block of current page
|
|
last_code = current_page['code_samples'][-1]
|
|
|
|
# Get first code block of next page
|
|
first_next_code = next_page['code_samples'][0]
|
|
|
|
# Check if they're likely the same code block
|
|
if (last_code['language'] == first_next_code['language'] and
|
|
last_code['detection_method'] == first_next_code['detection_method']):
|
|
|
|
# Check for continuation indicators
|
|
last_code_text = last_code['code'].rstrip()
|
|
continuation_indicators = [
|
|
not last_code_text.endswith('}'),
|
|
not last_code_text.endswith(';'),
|
|
last_code_text.endswith(','),
|
|
last_code_text.endswith('\\'),
|
|
]
|
|
|
|
if any(continuation_indicators):
|
|
# Merge the blocks
|
|
merged_code = last_code['code'] + '\n' + first_next_code['code']
|
|
last_code['code'] = merged_code
|
|
last_code['merged_from_next_page'] = True
|
|
|
|
# Remove duplicate from next page
|
|
next_page['code_samples'].pop(0)
|
|
|
|
return pages
|
|
```
|
|
|
|
### Chunking Algorithm
|
|
|
|
```python
|
|
def create_chunks(self, pages):
|
|
"""
|
|
Create chunks of pages respecting chapter boundaries.
|
|
"""
|
|
chunks = []
|
|
current_chunk = []
|
|
current_chapter = None
|
|
|
|
for i, page in enumerate(pages):
|
|
# Detect chapter start
|
|
is_chapter, chapter_title = self.detect_chapter_start(page)
|
|
|
|
if is_chapter and current_chunk:
|
|
# Save current chunk before starting new one
|
|
chunks.append({
|
|
'chunk_number': len(chunks) + 1,
|
|
'start_page': chunk_start + 1,
|
|
'end_page': i,
|
|
'pages': current_chunk,
|
|
'chapter_title': current_chapter
|
|
})
|
|
current_chunk = []
|
|
current_chapter = chapter_title
|
|
|
|
current_chunk.append(page)
|
|
|
|
# Check if chunk size reached (but don't break chapters)
|
|
if not is_chapter and len(current_chunk) >= self.chunk_size:
|
|
# Create chunk
|
|
chunks.append(...)
|
|
current_chunk = []
|
|
|
|
return chunks
|
|
```
|
|
|
|
---
|
|
|
|
## Usage Examples
|
|
|
|
### Basic Chunking
|
|
|
|
```bash
|
|
# Extract with default 10-page chunks
|
|
python3 cli/pdf_extractor_poc.py manual.pdf -o manual.json
|
|
|
|
# Output includes chunks
|
|
cat manual.json | jq '.total_chunks'
|
|
# Output: 15
|
|
```
|
|
|
|
### Large PDF Processing
|
|
|
|
```bash
|
|
# Large PDF with bigger chunks (50 pages each)
|
|
python3 cli/pdf_extractor_poc.py large_manual.pdf --pdf-pages-per-chunk 50 -o output.json -v
|
|
|
|
# Verbose output shows:
|
|
# 📦 Creating chunks (chunk_size=50)...
|
|
# 🔗 Merging code blocks across pages...
|
|
# ✅ Extraction complete:
|
|
# Chunks created: 8
|
|
# Chapters detected: 12
|
|
```
|
|
|
|
### No Chunking (Single Output)
|
|
|
|
```bash
|
|
# Process all pages as single chunk
|
|
python3 cli/pdf_extractor_poc.py small_doc.pdf --pdf-pages-per-chunk 0 -o output.json
|
|
```
|
|
|
|
---
|
|
|
|
## Performance
|
|
|
|
### Chunking Performance
|
|
|
|
- **Chapter Detection:** ~0.1ms per page (negligible overhead)
|
|
- **Code Merging:** ~0.5ms per page (fast)
|
|
- **Chunk Creation:** ~1ms total (very fast)
|
|
|
|
**Total overhead:** < 1% of extraction time
|
|
|
|
### Memory Benefits
|
|
|
|
Chunking large PDFs helps reduce memory usage:
|
|
- **Without chunking:** Entire PDF loaded in memory
|
|
- **With chunking:** Process chunk-by-chunk (future enhancement)
|
|
|
|
**Current implementation** still loads entire PDF but provides structured output for chunked processing downstream.
|
|
|
|
---
|
|
|
|
## Limitations
|
|
|
|
### Current Limitations
|
|
|
|
1. **Chapter Pattern Matching**
|
|
- Limited to common English chapter patterns
|
|
- May miss non-standard chapter formats
|
|
- No support for non-English chapters (e.g., "Capitulo", "Chapitre")
|
|
|
|
2. **Code Merging Heuristics**
|
|
- Based on simple continuation indicators
|
|
- May miss some edge cases
|
|
- No AST-based validation
|
|
|
|
3. **Chunk Size**
|
|
- Fixed page count (not by content size)
|
|
- Doesn't account for page content volume
|
|
- No auto-sizing based on memory constraints
|
|
|
|
### Known Issues
|
|
|
|
1. **Multi-Chapter Pages**
|
|
- If a single page has multiple chapters, only first is detected
|
|
- Workaround: Use smaller chunk sizes
|
|
|
|
2. **False Code Merges**
|
|
- Rare cases where separate code blocks are merged
|
|
- Detection: Look for `merged_from_next_page` flag
|
|
|
|
3. **Table of Contents**
|
|
- TOC pages may be detected as chapters
|
|
- Workaround: Manual filtering in downstream processing
|
|
|
|
---
|
|
|
|
## Comparison: Before vs After
|
|
|
|
| Feature | Before (B1.2) | After (B1.3) |
|
|
|---------|---------------|--------------|
|
|
| Page chunking | None | ✅ Configurable |
|
|
| Chapter detection | None | ✅ Auto-detect |
|
|
| Code spanning pages | Split | ✅ Merged |
|
|
| Large PDF handling | Difficult | ✅ Chunked |
|
|
| Memory efficiency | Poor | Better (structure for future) |
|
|
| Output organization | Flat | ✅ Hierarchical |
|
|
|
|
---
|
|
|
|
## Testing
|
|
|
|
### Test Chapter Detection
|
|
|
|
Create a test PDF with chapters:
|
|
1. Page 1: "Chapter 1: Introduction"
|
|
2. Page 15: "Chapter 2: Getting Started"
|
|
3. Page 30: "Chapter 3: API Reference"
|
|
|
|
```bash
|
|
python3 cli/pdf_extractor_poc.py test.pdf -o test.json --pdf-pages-per-chunk 20 -v
|
|
|
|
# Verify chapters detected
|
|
cat test.json | jq '.chapters'
|
|
```
|
|
|
|
Expected output:
|
|
```json
|
|
[
|
|
{
|
|
"title": "Chapter 1: Introduction",
|
|
"start_page": 1,
|
|
"end_page": 14
|
|
},
|
|
{
|
|
"title": "Chapter 2: Getting Started",
|
|
"start_page": 15,
|
|
"end_page": 29
|
|
},
|
|
{
|
|
"title": "Chapter 3: API Reference",
|
|
"start_page": 30,
|
|
"end_page": 50
|
|
}
|
|
]
|
|
```
|
|
|
|
### Test Code Merging
|
|
|
|
Create a test PDF with code spanning pages:
|
|
- Page 1 ends with: `def example():\n total = 0`
|
|
- Page 2 starts with: ` for i in range(10):\n total += i`
|
|
|
|
```bash
|
|
python3 cli/pdf_extractor_poc.py test.pdf -o test.json -v
|
|
|
|
# Check for merged code blocks
|
|
cat test.json | jq '.pages[0].code_samples[] | select(.merged_from_next_page == true)'
|
|
```
|
|
|
|
---
|
|
|
|
## Next Steps (Future Tasks)
|
|
|
|
### Task B1.4: Improve Code Block Detection
|
|
- Add syntax validation
|
|
- Use AST parsing for better language detection
|
|
- Improve continuation detection accuracy
|
|
|
|
### Task B1.5: Add Image Extraction
|
|
- Extract images from chunks
|
|
- OCR for code in images
|
|
- Diagram detection and extraction
|
|
|
|
### Task B1.6: Full PDF Scraper CLI
|
|
- Build on chunking foundation
|
|
- Category detection for chunks
|
|
- Multi-PDF support
|
|
|
|
---
|
|
|
|
## Integration with Skill Seeker
|
|
|
|
The chunking feature lays groundwork for:
|
|
1. **Memory-efficient processing** - Process PDFs chunk-by-chunk
|
|
2. **Better categorization** - Chapters become categories
|
|
3. **Improved SKILL.md** - Organize by detected chapters
|
|
4. **Large PDF support** - Handle 500+ page manuals
|
|
|
|
**Example workflow:**
|
|
```bash
|
|
# Extract large manual with chapters
|
|
python3 cli/pdf_extractor_poc.py large_manual.pdf --pdf-pages-per-chunk 25 -o manual.json
|
|
|
|
# Future: Build skill from chunks
|
|
python3 cli/build_skill_from_pdf.py manual.json
|
|
|
|
# Result: SKILL.md organized by detected chapters
|
|
```
|
|
|
|
---
|
|
|
|
## API Usage
|
|
|
|
### Using PDFExtractor with Chunking
|
|
|
|
```python
|
|
from cli.pdf_extractor_poc import PDFExtractor
|
|
|
|
# Create extractor with 15-page chunks
|
|
extractor = PDFExtractor('manual.pdf', verbose=True, chunk_size=15)
|
|
|
|
# Extract
|
|
result = extractor.extract_all()
|
|
|
|
# Access chunks
|
|
for chunk in result['chunks']:
|
|
print(f"Chunk {chunk['chunk_number']}: {chunk['chapter_title']}")
|
|
print(f" Pages: {chunk['start_page']}-{chunk['end_page']}")
|
|
print(f" Total pages: {len(chunk['pages'])}")
|
|
|
|
# Access chapters
|
|
for chapter in result['chapters']:
|
|
print(f"Chapter: {chapter['title']}")
|
|
print(f" Pages: {chapter['start_page']}-{chapter['end_page']}")
|
|
```
|
|
|
|
### Processing Chunks Independently
|
|
|
|
```python
|
|
# Extract
|
|
result = extractor.extract_all()
|
|
|
|
# Process each chunk separately
|
|
for chunk in result['chunks']:
|
|
# Get pages in chunk
|
|
pages = chunk['pages']
|
|
|
|
# Process pages
|
|
for page in pages:
|
|
# Extract code samples
|
|
for code in page['code_samples']:
|
|
print(f"Found {code['language']} code")
|
|
|
|
# Check if merged from next page
|
|
if code.get('merged_from_next_page'):
|
|
print(" (merged from next page)")
|
|
```
|
|
|
|
---
|
|
|
|
## Conclusion
|
|
|
|
Task B1.3 successfully implements:
|
|
- ✅ Page chunking with configurable size
|
|
- ✅ Automatic chapter/section detection
|
|
- ✅ Code block merging across pages
|
|
- ✅ Enhanced output format with structure
|
|
- ✅ Foundation for large PDF handling
|
|
|
|
**Performance:** Minimal overhead (<1%)
|
|
**Compatibility:** Backward compatible (pages array still included)
|
|
**Quality:** Significantly improved organization
|
|
|
|
**Ready for B1.4:** Code block detection improvements
|
|
|
|
---
|
|
|
|
**Task Completed:** October 21, 2025
|
|
**Next Task:** B1.4 - Improve code block extraction with syntax detection
|