feat: add EPUB input support (#310)

Adds EPUB as a first-class input source for skill generation. - EpubToSkillConverter (epub_scraper.py, ~1200 lines) following PDF scraper pattern - Dublin Core metadata, spine items, code blocks, tables, images extraction - DRM detection (Adobe ADEPT, Apple FairPlay, Readium LCP) with fail-fast - EPUB 3 NCX TOC bug workaround (ignore_ncx=True) - ebooklib as optional dep: pip install skill-seekers[epub] - Wired into create command with .epub auto-detection - 104 tests, all passing Review fixes: removed 3 empty test stubs, fixed SVG double-counting in _extract_images(), added logger.debug to bare except pass. Based on PR #310 by @christianbaumann. Co-authored-by: Christian Baumann <mail@chriss-baumann.de>
2026-03-15 02:34:41 +03:00
parent 83b9a695ba
commit 2e30970dfb
16 changed files with 4502 additions and 9 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -4,7 +4,7 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co

 ## 🎯 Project Overview

-**Skill Seekers** is the **universal documentation preprocessor** for AI systems. It transforms documentation websites, GitHub repositories, and PDFs into production-ready formats for **16+ platforms**: RAG pipelines (LangChain, LlamaIndex, Haystack), vector databases (Pinecone, Chroma, Weaviate, FAISS, Qdrant), AI coding assistants (Cursor, Windsurf, Cline, Continue.dev), and LLM platforms (Claude, Gemini, OpenAI).
+**Skill Seekers** is the **universal documentation preprocessor** for AI systems. It transforms documentation websites, GitHub repositories, PDFs, and EPUBs into production-ready formats for **16+ platforms**: RAG pipelines (LangChain, LlamaIndex, Haystack), vector databases (Pinecone, Chroma, Weaviate, FAISS, Qdrant), AI coding assistants (Cursor, Windsurf, Cline, Continue.dev), and LLM platforms (Claude, Gemini, OpenAI).

 **Current Version:** v3.1.3
 **Python Version:** 3.10+ required
@@ -222,6 +222,7 @@ src/skill_seekers/
 │   ├── dependency_analyzer.py        # Dependency graph analysis
 │   ├── signal_flow_analyzer.py       # C3.10 Signal flow analysis (Godot)
 │   ├── pdf_scraper.py                # PDF extraction
+│   ├── epub_scraper.py               # EPUB extraction
 │   └── adaptors/                     # ⭐ Platform adaptor pattern
 │       ├── __init__.py               # Factory: get_adaptor()
 │       ├── base_adaptor.py           # Abstract base
@@ -397,7 +398,7 @@ The unified CLI modifies `sys.argv` and calls existing `main()` functions to mai
 # Transforms to: doc_scraper.main() with modified sys.argv
 ```

-**Subcommands:** create, scrape, github, pdf, unified, codebase, enhance, enhance-status, package, upload, estimate, install, install-agent, patterns, how-to-guides
+**Subcommands:** create, scrape, github, pdf, epub, unified, codebase, enhance, enhance-status, package, upload, estimate, install, install-agent, patterns, how-to-guides

 ### NEW: Unified `create` Command

@@ -409,6 +410,7 @@ skill-seekers create https://docs.react.dev/         # → Web scraping
 skill-seekers create facebook/react                  # → GitHub analysis
 skill-seekers create ./my-project                    # → Local codebase
 skill-seekers create tutorial.pdf                    # → PDF extraction
+skill-seekers create book.epub                       # → EPUB extraction
 skill-seekers create configs/react.json              # → Multi-source

 # Progressive help system
@@ -417,6 +419,7 @@ skill-seekers create --help-web       # Shows web-specific options
 skill-seekers create --help-github    # Shows GitHub-specific options
 skill-seekers create --help-local     # Shows local analysis options
 skill-seekers create --help-pdf       # Shows PDF extraction options
+skill-seekers create --help-epub      # Shows EPUB extraction options
 skill-seekers create --help-advanced  # Shows advanced/rare options
 skill-seekers create --help-all       # Shows all 120+ flags

@@ -685,6 +688,7 @@ pytest tests/ -v -m ""
 - `test_unified.py` - Multi-source scraping
 - `test_github_scraper.py` - GitHub analysis
 - `test_pdf_scraper.py` - PDF extraction
+- `test_epub_scraper.py` - EPUB extraction
 - `test_install_multiplatform.py` - Multi-platform packaging
 - `test_integration.py` - End-to-end workflows
 - `test_install_skill.py` - One-command install
@@ -741,6 +745,7 @@ skill-seekers-resume = "skill_seekers.cli.resume_command:main"                #
 skill-seekers-scrape = "skill_seekers.cli.doc_scraper:main"
 skill-seekers-github = "skill_seekers.cli.github_scraper:main"
 skill-seekers-pdf = "skill_seekers.cli.pdf_scraper:main"
+skill-seekers-epub = "skill_seekers.cli.epub_scraper:main"
 skill-seekers-unified = "skill_seekers.cli.unified_scraper:main"
 skill-seekers-codebase = "skill_seekers.cli.codebase_scraper:main"           # C2.x Local codebase analysis
 skill-seekers-enhance = "skill_seekers.cli.enhance_skill_local:main"
@@ -1754,6 +1759,7 @@ This section helps you quickly locate the right files when implementing common c
 | GitHub scraping | `src/skill_seekers/cli/github_scraper.py` | ~56KB | Repo analysis + metadata |
 | GitHub API | `src/skill_seekers/cli/github_fetcher.py` | ~17KB | Rate limit handling |
 | PDF extraction | `src/skill_seekers/cli/pdf_scraper.py` | Medium | PyMuPDF + OCR |
+| EPUB extraction | `src/skill_seekers/cli/epub_scraper.py` | Medium | ebooklib + BeautifulSoup |
 | Code analysis | `src/skill_seekers/cli/code_analyzer.py` | ~65KB | Multi-language AST parsing |
 | Pattern detection | `src/skill_seekers/cli/pattern_recognizer.py` | Medium | C3.1 - 10 GoF patterns |
 | Test extraction | `src/skill_seekers/cli/test_example_extractor.py` | Medium | C3.2 - 5 categories |
@@ -1777,7 +1783,7 @@ This section helps you quickly locate the right files when implementing common c
 2. **Arguments:** `src/skill_seekers/cli/arguments/create.py`
   - Three tiers of arguments:
     - `UNIVERSAL_ARGUMENTS` (13 flags) - Work for all sources
-     - Source-specific dicts (`WEB_ARGUMENTS`, `GITHUB_ARGUMENTS`, etc.)
+     - Source-specific dicts (`WEB_ARGUMENTS`, `GITHUB_ARGUMENTS`, `EPUB_ARGUMENTS`, etc.)
     - `ADVANCED_ARGUMENTS` - Rare/advanced options
   - `add_create_arguments(parser, mode)` - Multi-mode argument addition