feat: add EPUB input support (#310)

Adds EPUB as a first-class input source for skill generation.

- EpubToSkillConverter (epub_scraper.py, ~1200 lines) following PDF scraper pattern
- Dublin Core metadata, spine items, code blocks, tables, images extraction
- DRM detection (Adobe ADEPT, Apple FairPlay, Readium LCP) with fail-fast
- EPUB 3 NCX TOC bug workaround (ignore_ncx=True)
- ebooklib as optional dep: pip install skill-seekers[epub]
- Wired into create command with .epub auto-detection
- 104 tests, all passing

Review fixes: removed 3 empty test stubs, fixed SVG double-counting in
_extract_images(), added logger.debug to bare except pass.

Based on PR #310 by @christianbaumann.
Co-authored-by: Christian Baumann <mail@chriss-baumann.de>
This commit is contained in:
yusyus
2026-03-15 02:34:41 +03:00
committed by GitHub
parent 83b9a695ba
commit 2e30970dfb
16 changed files with 4502 additions and 9 deletions

View File

@@ -4,7 +4,7 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
## 🎯 Project Overview
**Skill Seekers** is the **universal documentation preprocessor** for AI systems. It transforms documentation websites, GitHub repositories, and PDFs into production-ready formats for **16+ platforms**: RAG pipelines (LangChain, LlamaIndex, Haystack), vector databases (Pinecone, Chroma, Weaviate, FAISS, Qdrant), AI coding assistants (Cursor, Windsurf, Cline, Continue.dev), and LLM platforms (Claude, Gemini, OpenAI).
**Skill Seekers** is the **universal documentation preprocessor** for AI systems. It transforms documentation websites, GitHub repositories, PDFs, and EPUBs into production-ready formats for **16+ platforms**: RAG pipelines (LangChain, LlamaIndex, Haystack), vector databases (Pinecone, Chroma, Weaviate, FAISS, Qdrant), AI coding assistants (Cursor, Windsurf, Cline, Continue.dev), and LLM platforms (Claude, Gemini, OpenAI).
**Current Version:** v3.1.3
**Python Version:** 3.10+ required
@@ -222,6 +222,7 @@ src/skill_seekers/
│ ├── dependency_analyzer.py # Dependency graph analysis
│ ├── signal_flow_analyzer.py # C3.10 Signal flow analysis (Godot)
│ ├── pdf_scraper.py # PDF extraction
│ ├── epub_scraper.py # EPUB extraction
│ └── adaptors/ # ⭐ Platform adaptor pattern
│ ├── __init__.py # Factory: get_adaptor()
│ ├── base_adaptor.py # Abstract base
@@ -397,7 +398,7 @@ The unified CLI modifies `sys.argv` and calls existing `main()` functions to mai
# Transforms to: doc_scraper.main() with modified sys.argv
```
**Subcommands:** create, scrape, github, pdf, unified, codebase, enhance, enhance-status, package, upload, estimate, install, install-agent, patterns, how-to-guides
**Subcommands:** create, scrape, github, pdf, epub, unified, codebase, enhance, enhance-status, package, upload, estimate, install, install-agent, patterns, how-to-guides
### NEW: Unified `create` Command
@@ -409,6 +410,7 @@ skill-seekers create https://docs.react.dev/ # → Web scraping
skill-seekers create facebook/react # → GitHub analysis
skill-seekers create ./my-project # → Local codebase
skill-seekers create tutorial.pdf # → PDF extraction
skill-seekers create book.epub # → EPUB extraction
skill-seekers create configs/react.json # → Multi-source
# Progressive help system
@@ -417,6 +419,7 @@ skill-seekers create --help-web # Shows web-specific options
skill-seekers create --help-github # Shows GitHub-specific options
skill-seekers create --help-local # Shows local analysis options
skill-seekers create --help-pdf # Shows PDF extraction options
skill-seekers create --help-epub # Shows EPUB extraction options
skill-seekers create --help-advanced # Shows advanced/rare options
skill-seekers create --help-all # Shows all 120+ flags
@@ -685,6 +688,7 @@ pytest tests/ -v -m ""
- `test_unified.py` - Multi-source scraping
- `test_github_scraper.py` - GitHub analysis
- `test_pdf_scraper.py` - PDF extraction
- `test_epub_scraper.py` - EPUB extraction
- `test_install_multiplatform.py` - Multi-platform packaging
- `test_integration.py` - End-to-end workflows
- `test_install_skill.py` - One-command install
@@ -741,6 +745,7 @@ skill-seekers-resume = "skill_seekers.cli.resume_command:main" #
skill-seekers-scrape = "skill_seekers.cli.doc_scraper:main"
skill-seekers-github = "skill_seekers.cli.github_scraper:main"
skill-seekers-pdf = "skill_seekers.cli.pdf_scraper:main"
skill-seekers-epub = "skill_seekers.cli.epub_scraper:main"
skill-seekers-unified = "skill_seekers.cli.unified_scraper:main"
skill-seekers-codebase = "skill_seekers.cli.codebase_scraper:main" # C2.x Local codebase analysis
skill-seekers-enhance = "skill_seekers.cli.enhance_skill_local:main"
@@ -1754,6 +1759,7 @@ This section helps you quickly locate the right files when implementing common c
| GitHub scraping | `src/skill_seekers/cli/github_scraper.py` | ~56KB | Repo analysis + metadata |
| GitHub API | `src/skill_seekers/cli/github_fetcher.py` | ~17KB | Rate limit handling |
| PDF extraction | `src/skill_seekers/cli/pdf_scraper.py` | Medium | PyMuPDF + OCR |
| EPUB extraction | `src/skill_seekers/cli/epub_scraper.py` | Medium | ebooklib + BeautifulSoup |
| Code analysis | `src/skill_seekers/cli/code_analyzer.py` | ~65KB | Multi-language AST parsing |
| Pattern detection | `src/skill_seekers/cli/pattern_recognizer.py` | Medium | C3.1 - 10 GoF patterns |
| Test extraction | `src/skill_seekers/cli/test_example_extractor.py` | Medium | C3.2 - 5 categories |
@@ -1777,7 +1783,7 @@ This section helps you quickly locate the right files when implementing common c
2. **Arguments:** `src/skill_seekers/cli/arguments/create.py`
- Three tiers of arguments:
- `UNIVERSAL_ARGUMENTS` (13 flags) - Work for all sources
- Source-specific dicts (`WEB_ARGUMENTS`, `GITHUB_ARGUMENTS`, etc.)
- Source-specific dicts (`WEB_ARGUMENTS`, `GITHUB_ARGUMENTS`, `EPUB_ARGUMENTS`, etc.)
- `ADVANCED_ARGUMENTS` - Rare/advanced options
- `add_create_arguments(parser, mode)` - Multi-mode argument addition