feat: add EPUB input support (#310)
Adds EPUB as a first-class input source for skill generation. - EpubToSkillConverter (epub_scraper.py, ~1200 lines) following PDF scraper pattern - Dublin Core metadata, spine items, code blocks, tables, images extraction - DRM detection (Adobe ADEPT, Apple FairPlay, Readium LCP) with fail-fast - EPUB 3 NCX TOC bug workaround (ignore_ncx=True) - ebooklib as optional dep: pip install skill-seekers[epub] - Wired into create command with .epub auto-detection - 104 tests, all passing Review fixes: removed 3 empty test stubs, fixed SVG double-counting in _extract_images(), added logger.debug to bare except pass. Based on PR #310 by @christianbaumann. Co-authored-by: Christian Baumann <mail@chriss-baumann.de>
This commit is contained in:
1160
docs/agents/plans/2026-03-14-epub-input-support.md
Normal file
1160
docs/agents/plans/2026-03-14-epub-input-support.md
Normal file
File diff suppressed because it is too large
Load Diff
@@ -0,0 +1,271 @@
|
||||
---
|
||||
date: 2026-03-14T12:54:24.700367+00:00
|
||||
git_commit: 7c90a4b9c9bccac8341b0769550d77aae3b4e524
|
||||
branch: development
|
||||
topic: "What files would be affected to add .epub support for input"
|
||||
tags: [research, codebase, epub, input-format, scraper]
|
||||
status: complete
|
||||
---
|
||||
|
||||
# Research: What files would be affected to add .epub support for input
|
||||
|
||||
## Research Question
|
||||
|
||||
What files would be affected to add .epub support for input.
|
||||
|
||||
## Summary
|
||||
|
||||
Adding `.epub` input support follows an established pattern already used for PDF and Word (.docx) formats. The codebase has a consistent multi-layer architecture for document input formats: source detection, argument definitions, parser registration, create command routing, standalone scraper module, and tests. Based on analysis of the existing PDF and Word implementations, **16 existing files would need modification** and **4 new files would need to be created**.
|
||||
|
||||
## Detailed Findings
|
||||
|
||||
### New Files to Create (4 files)
|
||||
|
||||
| File | Purpose |
|
||||
|------|---------|
|
||||
| `src/skill_seekers/cli/epub_scraper.py` | Core EPUB extraction and skill building logic (analog: `word_scraper.py` at ~750 lines) |
|
||||
| `src/skill_seekers/cli/arguments/epub.py` | EPUB-specific argument definitions (analog: `arguments/word.py`) |
|
||||
| `src/skill_seekers/cli/parsers/epub_parser.py` | Subcommand parser class (analog: `parsers/word_parser.py`) |
|
||||
| `tests/test_epub_scraper.py` | Test suite (analog: `test_word_scraper.py` at ~750 lines, 130+ tests) |
|
||||
|
||||
### Existing Files to Modify (16 files)
|
||||
|
||||
#### 1. Source Detection Layer
|
||||
|
||||
**`src/skill_seekers/cli/source_detector.py`** (3 locations)
|
||||
|
||||
- **`SourceDetector.detect()`** (line ~60): Add `.epub` extension check, following the `.docx` pattern at line 63-64:
|
||||
```python
|
||||
if source.endswith(".epub"):
|
||||
return cls._detect_epub(source)
|
||||
```
|
||||
|
||||
- **New method `_detect_epub()`**: Add detection method (following `_detect_word()` at lines 124-129):
|
||||
```python
|
||||
@classmethod
|
||||
def _detect_epub(cls, source: str) -> SourceInfo:
|
||||
name = os.path.splitext(os.path.basename(source))[0]
|
||||
return SourceInfo(
|
||||
type="epub", parsed={"file_path": source}, suggested_name=name, raw_input=source
|
||||
)
|
||||
```
|
||||
|
||||
- **`validate_source()`** (line ~250): Add epub validation block (following the word block at lines 273-278)
|
||||
|
||||
- **Error message** (line ~94): Add EPUB example to the `ValueError` help text
|
||||
|
||||
#### 2. CLI Dispatcher
|
||||
|
||||
**`src/skill_seekers/cli/main.py`** (2 locations)
|
||||
|
||||
- **`COMMAND_MODULES` dict** (line ~46): Add epub entry:
|
||||
```python
|
||||
"epub": "skill_seekers.cli.epub_scraper",
|
||||
```
|
||||
|
||||
- **Module docstring** (line ~1): Add `epub` to the commands list
|
||||
|
||||
#### 3. Create Command Routing
|
||||
|
||||
**`src/skill_seekers/cli/create_command.py`** (3 locations)
|
||||
|
||||
- **`_route_to_scraper()`** (line ~121): Add `elif self.source_info.type == "epub":` routing case
|
||||
|
||||
- **New `_route_epub()` method**: Following the `_route_word()` pattern at lines 331-352:
|
||||
```python
|
||||
def _route_epub(self) -> int:
|
||||
from skill_seekers.cli import epub_scraper
|
||||
argv = ["epub_scraper"]
|
||||
file_path = self.source_info.parsed["file_path"]
|
||||
argv.extend(["--epub", file_path])
|
||||
self._add_common_args(argv)
|
||||
# epub-specific args here
|
||||
...
|
||||
```
|
||||
|
||||
- **`main()` epilog** (line ~537): Add EPUB example and source auto-detection entry
|
||||
|
||||
- **Progressive help** (line ~590): Add `--help-epub` flag and handler block
|
||||
|
||||
#### 4. Argument Definitions
|
||||
|
||||
**`src/skill_seekers/cli/arguments/create.py`** (4 locations)
|
||||
|
||||
- **New `EPUB_ARGUMENTS` dict** (~line 401): Define epub-specific arguments (e.g., `--epub` file path flag), following the `WORD_ARGUMENTS` pattern at lines 402-411
|
||||
|
||||
- **`get_source_specific_arguments()`** (line 595): Add `"epub": EPUB_ARGUMENTS` to the `source_args` dict
|
||||
|
||||
- **`add_create_arguments()`** (line 676): Add epub mode block:
|
||||
```python
|
||||
if mode in ["epub", "all"]:
|
||||
for arg_name, arg_def in EPUB_ARGUMENTS.items():
|
||||
parser.add_argument(*arg_def["flags"], **arg_def["kwargs"])
|
||||
```
|
||||
|
||||
#### 5. Parser Registration
|
||||
|
||||
**`src/skill_seekers/cli/parsers/__init__.py`** (2 locations)
|
||||
|
||||
- **Import** (line ~15): Add `from .epub_parser import EpubParser`
|
||||
|
||||
- **`PARSERS` list** (line ~46): Add `EpubParser()` entry (near `WordParser()` and `PDFParser()`)
|
||||
|
||||
#### 6. Package Configuration
|
||||
|
||||
**`pyproject.toml`** (3 locations)
|
||||
|
||||
- **`[project.optional-dependencies]`** (line ~111): Add `epub` optional dependency group:
|
||||
```toml
|
||||
epub = [
|
||||
"ebooklib>=0.18",
|
||||
]
|
||||
```
|
||||
|
||||
- **`all` optional dependency group** (line ~178): Add epub dependency to the combined `all` group
|
||||
|
||||
- **`[project.scripts]`** (line ~224): Add standalone entry point:
|
||||
```toml
|
||||
skill-seekers-epub = "skill_seekers.cli.epub_scraper:main"
|
||||
```
|
||||
|
||||
#### 7. Argument Commons
|
||||
|
||||
**`src/skill_seekers/cli/arguments/common.py`**
|
||||
|
||||
- No changes strictly required, but `add_all_standard_arguments()` is called by the new `arguments/epub.py` (no modification needed — it's used as-is)
|
||||
|
||||
#### 8. Documentation / Configuration
|
||||
|
||||
**`CLAUDE.md`** (2 locations)
|
||||
|
||||
- **Commands section**: Add `epub` to the list of subcommands
|
||||
- **Key source files table**: Add `epub_scraper.py` entry
|
||||
|
||||
**`CONTRIBUTING.md`** — Potentially update with epub format mention
|
||||
|
||||
**`CHANGELOG.md`** — New feature entry
|
||||
|
||||
### Files NOT Affected
|
||||
|
||||
These files do **not** need changes:
|
||||
|
||||
- **`unified_scraper.py`** — Multi-source configs could add epub support later but it's not required for basic input support
|
||||
- **Platform adaptors** (`adaptors/*.py`) — Adaptors work on the output side (packaging), not input
|
||||
- **Enhancement system** (`enhance_skill.py`, `enhance_skill_local.py`) — Works generically on SKILL.md
|
||||
- **MCP server** (`mcp/server_fastmcp.py`) — Operates on completed skills
|
||||
- **`pdf_extractor_poc.py`** — PDF-specific extraction; epub needs its own extractor
|
||||
|
||||
## Code References
|
||||
|
||||
### Pattern to Follow (Word .docx implementation)
|
||||
|
||||
- `src/skill_seekers/cli/word_scraper.py:1-750` — Full scraper with `WordToSkillConverter` class
|
||||
- `src/skill_seekers/cli/arguments/word.py:1-75` — Argument definitions with `add_word_arguments()`
|
||||
- `src/skill_seekers/cli/parsers/word_parser.py:1-33` — Parser class extending `SubcommandParser`
|
||||
- `tests/test_word_scraper.py:1-750` — Comprehensive test suite with 130+ tests
|
||||
|
||||
### Key Integration Points
|
||||
|
||||
- `src/skill_seekers/cli/source_detector.py:57-65` — File extension detection order
|
||||
- `src/skill_seekers/cli/source_detector.py:124-129` — `_detect_word()` method (template for `_detect_epub()`)
|
||||
- `src/skill_seekers/cli/create_command.py:121-143` — `_route_to_scraper()` dispatch
|
||||
- `src/skill_seekers/cli/create_command.py:331-352` — `_route_word()` (template for `_route_epub()`)
|
||||
- `src/skill_seekers/cli/arguments/create.py:401-411` — `WORD_ARGUMENTS` dict (template)
|
||||
- `src/skill_seekers/cli/arguments/create.py:595-604` — `get_source_specific_arguments()` mapping
|
||||
- `src/skill_seekers/cli/arguments/create.py:676-678` — `add_create_arguments()` mode handling
|
||||
- `src/skill_seekers/cli/parsers/__init__.py:35-59` — `PARSERS` registry list
|
||||
- `src/skill_seekers/cli/main.py:46-70` — `COMMAND_MODULES` dict
|
||||
- `pyproject.toml:111-115` — Optional dependency group pattern (docx)
|
||||
- `pyproject.toml:213-246` — Script entry points
|
||||
|
||||
### Data Flow Architecture
|
||||
|
||||
The epub scraper would follow the same three-step pipeline as Word/PDF:
|
||||
|
||||
1. **Extract** — Parse `.epub` file → sections with text, headings, code, images → save to `output/{name}_extracted.json`
|
||||
2. **Categorize** — Group sections by chapters/keywords
|
||||
3. **Build** — Generate `SKILL.md`, `references/*.md`, `references/index.md`, `assets/`
|
||||
|
||||
The intermediate JSON format uses the same structure as Word/PDF:
|
||||
```python
|
||||
{
|
||||
"source_file": str,
|
||||
"metadata": {"title", "author", "created", ...},
|
||||
"total_sections": int,
|
||||
"total_code_blocks": int,
|
||||
"total_images": int,
|
||||
"languages_detected": {str: int},
|
||||
"pages": [ # sections
|
||||
{
|
||||
"section_number": int,
|
||||
"heading": str,
|
||||
"text": str,
|
||||
"code_samples": [...],
|
||||
"images": [...],
|
||||
"headings": [...]
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
## Architecture Documentation
|
||||
|
||||
### Document Input Format Pattern
|
||||
|
||||
Each input format follows a consistent architecture:
|
||||
|
||||
```
|
||||
[source_detector.py] → detect type by extension
|
||||
↓
|
||||
[create_command.py] → route to scraper
|
||||
↓
|
||||
[{format}_scraper.py] → extract → categorize → build skill
|
||||
↓
|
||||
[output/{name}/] → SKILL.md + references/ + assets/
|
||||
```
|
||||
|
||||
Supporting files per format:
|
||||
- `arguments/{format}.py` — CLI argument definitions
|
||||
- `parsers/{format}_parser.py` — Subcommand parser class
|
||||
- `tests/test_{format}_scraper.py` — Test suite
|
||||
|
||||
### Dependency Guard Pattern
|
||||
|
||||
The Word scraper uses an optional dependency guard that epub should replicate:
|
||||
|
||||
```python
|
||||
try:
|
||||
import ebooklib
|
||||
from ebooklib import epub
|
||||
EPUB_AVAILABLE = True
|
||||
except ImportError:
|
||||
EPUB_AVAILABLE = False
|
||||
|
||||
def _check_epub_deps():
|
||||
if not EPUB_AVAILABLE:
|
||||
raise RuntimeError(
|
||||
"ebooklib is required for EPUB support.\n"
|
||||
'Install with: pip install "skill-seekers[epub]"\n'
|
||||
"Or: pip install ebooklib"
|
||||
)
|
||||
```
|
||||
|
||||
## Summary Table
|
||||
|
||||
| Category | Files | Action |
|
||||
|----------|-------|--------|
|
||||
| New files | 4 | Create from scratch |
|
||||
| Source detection | 1 | Add epub detection + validation |
|
||||
| CLI dispatcher | 1 | Add command module mapping |
|
||||
| Create command | 1 | Add routing + help + examples |
|
||||
| Arguments | 1 | Add EPUB_ARGUMENTS + register in helpers |
|
||||
| Parser registry | 1 | Import + register EpubParser |
|
||||
| Package config | 1 | Add deps + entry point |
|
||||
| Documentation | 2+ | Update CLAUDE.md, CHANGELOG |
|
||||
| **Total** | **12+ modified, 4 new** | |
|
||||
|
||||
## Open Questions
|
||||
|
||||
- Should epub support reuse any of the existing HTML parsing from `word_scraper.py` (which uses mammoth to convert to HTML then parses with BeautifulSoup)? EPUB internally contains XHTML files, so BeautifulSoup parsing would be directly applicable.
|
||||
- Should the epub scraper support DRM-protected files, or only DRM-free epub files?
|
||||
- Should epub-specific arguments include options like `--chapter-range` (similar to PDF's `--pages`)?
|
||||
Reference in New Issue
Block a user