Complete PDF extraction and skill conversion functionality: - pdf_extractor_poc.py (1,004 lines): Extract text, code, images from PDFs - pdf_scraper.py (353 lines): Convert PDFs to Claude skills - MCP tool scrape_pdf: PDF scraping via Claude Code - 7 comprehensive documentation guides (4,705 lines) - Example PDF config format (configs/example_pdf.json) Features: - 3 code detection methods (font, indent, pattern) - 19+ programming languages detected with confidence scoring - Syntax validation and quality scoring (0-10 scale) - Image extraction with size filtering (--extract-images) - Chapter/section detection and page chunking - Quality-filtered code examples (--min-quality) - Three usage modes: config file, direct PDF, from extracted JSON Technical: - PyMuPDF (fitz) as primary library (60x faster than alternatives) - Language detection with confidence scoring - Code block merging across pages - Comprehensive metadata and statistics - Compatible with existing Skill Seeker workflow MCP Integration: - New scrape_pdf tool (10th MCP tool total) - Supports all three usage modes - 10-minute timeout for large PDFs - Real-time streaming output Documentation (4,705 lines): - B1_COMPLETE_SUMMARY.md: Overview of all 8 tasks - PDF_PARSING_RESEARCH.md: Library comparison and benchmarks - PDF_EXTRACTOR_POC.md: POC documentation - PDF_CHUNKING.md: Page chunking guide - PDF_SYNTAX_DETECTION.md: Syntax detection guide - PDF_IMAGE_EXTRACTION.md: Image extraction guide - PDF_SCRAPER.md: PDF scraper usage guide - PDF_MCP_TOOL.md: MCP integration guide Tasks completed: B1.1-B1.8 Addresses Issue #27 See docs/B1_COMPLETE_SUMMARY.md for complete details 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
40 lines
714 B
Plaintext
40 lines
714 B
Plaintext
annotated-types==0.7.0
|
|
anyio==4.11.0
|
|
attrs==25.4.0
|
|
beautifulsoup4==4.14.2
|
|
certifi==2025.10.5
|
|
charset-normalizer==3.4.4
|
|
click==8.3.0
|
|
coverage==7.11.0
|
|
h11==0.16.0
|
|
httpcore==1.0.9
|
|
httpx==0.28.1
|
|
httpx-sse==0.4.3
|
|
idna==3.11
|
|
iniconfig==2.3.0
|
|
jsonschema==4.25.1
|
|
jsonschema-specifications==2025.9.1
|
|
mcp==1.18.0
|
|
packaging==25.0
|
|
pluggy==1.6.0
|
|
pydantic==2.12.3
|
|
pydantic-settings==2.11.0
|
|
pydantic_core==2.41.4
|
|
Pygments==2.19.2
|
|
PyMuPDF==1.24.14
|
|
pytest==8.4.2
|
|
pytest-cov==7.0.0
|
|
python-dotenv==1.1.1
|
|
python-multipart==0.0.20
|
|
referencing==0.37.0
|
|
requests==2.32.5
|
|
rpds-py==0.27.1
|
|
sniffio==1.3.1
|
|
soupsieve==2.8
|
|
sse-starlette==3.0.2
|
|
starlette==0.48.0
|
|
typing-inspection==0.4.2
|
|
typing_extensions==4.15.0
|
|
urllib3==2.5.0
|
|
uvicorn==0.38.0
|