Add PDF documentation support (Tasks B1.1-B1.8)
Complete PDF extraction and skill conversion functionality: - pdf_extractor_poc.py (1,004 lines): Extract text, code, images from PDFs - pdf_scraper.py (353 lines): Convert PDFs to Claude skills - MCP tool scrape_pdf: PDF scraping via Claude Code - 7 comprehensive documentation guides (4,705 lines) - Example PDF config format (configs/example_pdf.json) Features: - 3 code detection methods (font, indent, pattern) - 19+ programming languages detected with confidence scoring - Syntax validation and quality scoring (0-10 scale) - Image extraction with size filtering (--extract-images) - Chapter/section detection and page chunking - Quality-filtered code examples (--min-quality) - Three usage modes: config file, direct PDF, from extracted JSON Technical: - PyMuPDF (fitz) as primary library (60x faster than alternatives) - Language detection with confidence scoring - Code block merging across pages - Comprehensive metadata and statistics - Compatible with existing Skill Seeker workflow MCP Integration: - New scrape_pdf tool (10th MCP tool total) - Supports all three usage modes - 10-minute timeout for large PDFs - Real-time streaming output Documentation (4,705 lines): - B1_COMPLETE_SUMMARY.md: Overview of all 8 tasks - PDF_PARSING_RESEARCH.md: Library comparison and benchmarks - PDF_EXTRACTOR_POC.md: POC documentation - PDF_CHUNKING.md: Page chunking guide - PDF_SYNTAX_DETECTION.md: Syntax detection guide - PDF_IMAGE_EXTRACTION.md: Image extraction guide - PDF_SCRAPER.md: PDF scraper usage guide - PDF_MCP_TOOL.md: MCP integration guide Tasks completed: B1.1-B1.8 Addresses Issue #27 See docs/B1_COMPLETE_SUMMARY.md for complete details 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
17
configs/example_pdf.json
Normal file
17
configs/example_pdf.json
Normal file
@@ -0,0 +1,17 @@
|
||||
{
|
||||
"name": "example_manual",
|
||||
"description": "Example PDF documentation skill",
|
||||
"pdf_path": "docs/manual.pdf",
|
||||
"extract_options": {
|
||||
"chunk_size": 10,
|
||||
"min_quality": 5.0,
|
||||
"extract_images": true,
|
||||
"min_image_size": 100
|
||||
},
|
||||
"categories": {
|
||||
"getting_started": ["introduction", "getting started", "quick start", "setup"],
|
||||
"tutorial": ["tutorial", "guide", "walkthrough", "example"],
|
||||
"api": ["api", "reference", "function", "class", "method"],
|
||||
"advanced": ["advanced", "optimization", "performance", "best practices"]
|
||||
}
|
||||
}
|
||||
Reference in New Issue
Block a user