Files
skill-seekers-reference/requirements.txt
yusyus 6936057820 Add PDF documentation support (Tasks B1.1-B1.8)
Complete PDF extraction and skill conversion functionality:
- pdf_extractor_poc.py (1,004 lines): Extract text, code, images from PDFs
- pdf_scraper.py (353 lines): Convert PDFs to Claude skills
- MCP tool scrape_pdf: PDF scraping via Claude Code
- 7 comprehensive documentation guides (4,705 lines)
- Example PDF config format (configs/example_pdf.json)

Features:
- 3 code detection methods (font, indent, pattern)
- 19+ programming languages detected with confidence scoring
- Syntax validation and quality scoring (0-10 scale)
- Image extraction with size filtering (--extract-images)
- Chapter/section detection and page chunking
- Quality-filtered code examples (--min-quality)
- Three usage modes: config file, direct PDF, from extracted JSON

Technical:
- PyMuPDF (fitz) as primary library (60x faster than alternatives)
- Language detection with confidence scoring
- Code block merging across pages
- Comprehensive metadata and statistics
- Compatible with existing Skill Seeker workflow

MCP Integration:
- New scrape_pdf tool (10th MCP tool total)
- Supports all three usage modes
- 10-minute timeout for large PDFs
- Real-time streaming output

Documentation (4,705 lines):
- B1_COMPLETE_SUMMARY.md: Overview of all 8 tasks
- PDF_PARSING_RESEARCH.md: Library comparison and benchmarks
- PDF_EXTRACTOR_POC.md: POC documentation
- PDF_CHUNKING.md: Page chunking guide
- PDF_SYNTAX_DETECTION.md: Syntax detection guide
- PDF_IMAGE_EXTRACTION.md: Image extraction guide
- PDF_SCRAPER.md: PDF scraper usage guide
- PDF_MCP_TOOL.md: MCP integration guide

Tasks completed: B1.1-B1.8
Addresses Issue #27
See docs/B1_COMPLETE_SUMMARY.md for complete details

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-23 00:23:16 +03:00

40 lines
714 B
Plaintext

annotated-types==0.7.0
anyio==4.11.0
attrs==25.4.0
beautifulsoup4==4.14.2
certifi==2025.10.5
charset-normalizer==3.4.4
click==8.3.0
coverage==7.11.0
h11==0.16.0
httpcore==1.0.9
httpx==0.28.1
httpx-sse==0.4.3
idna==3.11
iniconfig==2.3.0
jsonschema==4.25.1
jsonschema-specifications==2025.9.1
mcp==1.18.0
packaging==25.0
pluggy==1.6.0
pydantic==2.12.3
pydantic-settings==2.11.0
pydantic_core==2.41.4
Pygments==2.19.2
PyMuPDF==1.24.14
pytest==8.4.2
pytest-cov==7.0.0
python-dotenv==1.1.1
python-multipart==0.0.20
referencing==0.37.0
requests==2.32.5
rpds-py==0.27.1
sniffio==1.3.1
soupsieve==2.8
sse-starlette==3.0.2
starlette==0.48.0
typing-inspection==0.4.2
typing_extensions==4.15.0
urllib3==2.5.0
uvicorn==0.38.0