firefrost-gaming/skill-seekers-reference

Files

yusyus 2e30970dfb feat: add EPUB input support (#310 )

Adds EPUB as a first-class input source for skill generation.

- EpubToSkillConverter (epub_scraper.py, ~1200 lines) following PDF scraper pattern
- Dublin Core metadata, spine items, code blocks, tables, images extraction
- DRM detection (Adobe ADEPT, Apple FairPlay, Readium LCP) with fail-fast
- EPUB 3 NCX TOC bug workaround (ignore_ncx=True)
- ebooklib as optional dep: pip install skill-seekers[epub]
- Wired into create command with .epub auto-detection
- 104 tests, all passing

Review fixes: removed 3 empty test stubs, fixed SVG double-counting in
_extract_images(), added logger.debug to bare except pass.

Based on PR #310 by @christianbaumann.
Co-authored-by: Christian Baumann <mail@chriss-baumann.de>

2026-03-15 02:34:41 +03:00

9.9 KiB

Raw Blame History

date, git_commit, branch, topic, tags, status

date

git_commit

branch

topic

Research: What files would be affected to add .epub support for input

Research Question

What files would be affected to add .epub support for input.

Summary

Adding .epub input support follows an established pattern already used for PDF and Word (.docx) formats. The codebase has a consistent multi-layer architecture for document input formats: source detection, argument definitions, parser registration, create command routing, standalone scraper module, and tests. Based on analysis of the existing PDF and Word implementations, 16 existing files would need modification and 4 new files would need to be created.

Detailed Findings

New Files to Create (4 files)

File	Purpose
`src/skill_seekers/cli/epub_scraper.py`	Core EPUB extraction and skill building logic (analog: `word_scraper.py` at ~750 lines)
`src/skill_seekers/cli/arguments/epub.py`	EPUB-specific argument definitions (analog: `arguments/word.py`)
`src/skill_seekers/cli/parsers/epub_parser.py`	Subcommand parser class (analog: `parsers/word_parser.py`)
`tests/test_epub_scraper.py`	Test suite (analog: `test_word_scraper.py` at ~750 lines, 130+ tests)

Existing Files to Modify (16 files)

1. Source Detection Layer

src/skill_seekers/cli/source_detector.py (3 locations)

SourceDetector.detect() (line ~60): Add .epub extension check, following the .docx pattern at line 63-64:
```
if source.endswith(".epub"):
    return cls._detect_epub(source)
```

New method _detect_epub(): Add detection method (following _detect_word() at lines 124-129):

@classmethod
def _detect_epub(cls, source: str) -> SourceInfo:
    name = os.path.splitext(os.path.basename(source))[0]
    return SourceInfo(
        type="epub", parsed={"file_path": source}, suggested_name=name, raw_input=source
    )

validate_source() (line ~250): Add epub validation block (following the word block at lines 273-278)
Error message (line ~94): Add EPUB example to the ValueError help text

2. CLI Dispatcher

src/skill_seekers/cli/main.py (2 locations)

COMMAND_MODULES dict (line ~46): Add epub entry:
```
"epub": "skill_seekers.cli.epub_scraper",
```
Module docstring (line ~1): Add epub to the commands list

3. Create Command Routing

src/skill_seekers/cli/create_command.py (3 locations)

_route_to_scraper() (line ~121): Add elif self.source_info.type == "epub": routing case

New _route_epub() method: Following the _route_word() pattern at lines 331-352:

def _route_epub(self) -> int:
    from skill_seekers.cli import epub_scraper
    argv = ["epub_scraper"]
    file_path = self.source_info.parsed["file_path"]
    argv.extend(["--epub", file_path])
    self._add_common_args(argv)
    # epub-specific args here
    ...

main() epilog (line ~537): Add EPUB example and source auto-detection entry
Progressive help (line ~590): Add --help-epub flag and handler block

4. Argument Definitions

src/skill_seekers/cli/arguments/create.py (4 locations)

New EPUB_ARGUMENTS dict (~line 401): Define epub-specific arguments (e.g., --epub file path flag), following the WORD_ARGUMENTS pattern at lines 402-411
get_source_specific_arguments() (line 595): Add "epub": EPUB_ARGUMENTS to the source_args dict

add_create_arguments() (line 676): Add epub mode block:

if mode in ["epub", "all"]:
    for arg_name, arg_def in EPUB_ARGUMENTS.items():
        parser.add_argument(*arg_def["flags"], **arg_def["kwargs"])

5. Parser Registration

src/skill_seekers/cli/parsers/__init__.py (2 locations)

Import (line ~15): Add from .epub_parser import EpubParser
PARSERS list (line ~46): Add EpubParser() entry (near WordParser() and PDFParser())

6. Package Configuration

pyproject.toml (3 locations)

[project.optional-dependencies] (line ~111): Add epub optional dependency group:
```
epub = [
    "ebooklib>=0.18",
]
```
all optional dependency group (line ~178): Add epub dependency to the combined all group

[project.scripts] (line ~224): Add standalone entry point:

skill-seekers-epub = "skill_seekers.cli.epub_scraper:main"

7. Argument Commons

src/skill_seekers/cli/arguments/common.py

No changes strictly required, but add_all_standard_arguments() is called by the new arguments/epub.py (no modification needed — it's used as-is)

8. Documentation / Configuration

CLAUDE.md (2 locations)

Commands section: Add epub to the list of subcommands
Key source files table: Add epub_scraper.py entry

CONTRIBUTING.md — Potentially update with epub format mention

CHANGELOG.md — New feature entry

Files NOT Affected

These files do not need changes:

unified_scraper.py — Multi-source configs could add epub support later but it's not required for basic input support
Platform adaptors (adaptors/*.py) — Adaptors work on the output side (packaging), not input
Enhancement system (enhance_skill.py, enhance_skill_local.py) — Works generically on SKILL.md
MCP server (mcp/server_fastmcp.py) — Operates on completed skills
pdf_extractor_poc.py — PDF-specific extraction; epub needs its own extractor

Code References

Pattern to Follow (Word .docx implementation)

src/skill_seekers/cli/word_scraper.py:1-750 — Full scraper with WordToSkillConverter class
src/skill_seekers/cli/arguments/word.py:1-75 — Argument definitions with add_word_arguments()
src/skill_seekers/cli/parsers/word_parser.py:1-33 — Parser class extending SubcommandParser
tests/test_word_scraper.py:1-750 — Comprehensive test suite with 130+ tests

Key Integration Points

src/skill_seekers/cli/source_detector.py:57-65 — File extension detection order
src/skill_seekers/cli/source_detector.py:124-129 — _detect_word() method (template for _detect_epub())
src/skill_seekers/cli/create_command.py:121-143 — _route_to_scraper() dispatch
src/skill_seekers/cli/create_command.py:331-352 — _route_word() (template for _route_epub())
src/skill_seekers/cli/arguments/create.py:401-411 — WORD_ARGUMENTS dict (template)
src/skill_seekers/cli/arguments/create.py:595-604 — get_source_specific_arguments() mapping
src/skill_seekers/cli/arguments/create.py:676-678 — add_create_arguments() mode handling
src/skill_seekers/cli/parsers/__init__.py:35-59 — PARSERS registry list
src/skill_seekers/cli/main.py:46-70 — COMMAND_MODULES dict
pyproject.toml:111-115 — Optional dependency group pattern (docx)
pyproject.toml:213-246 — Script entry points

Data Flow Architecture

The epub scraper would follow the same three-step pipeline as Word/PDF:

Extract — Parse .epub file → sections with text, headings, code, images → save to output/{name}_extracted.json
Categorize — Group sections by chapters/keywords
Build — Generate SKILL.md, references/*.md, references/index.md, assets/

The intermediate JSON format uses the same structure as Word/PDF:

{
    "source_file": str,
    "metadata": {"title", "author", "created", ...},
    "total_sections": int,
    "total_code_blocks": int,
    "total_images": int,
    "languages_detected": {str: int},
    "pages": [  # sections
        {
            "section_number": int,
            "heading": str,
            "text": str,
            "code_samples": [...],
            "images": [...],
            "headings": [...]
        }
    ]
}

Architecture Documentation

Document Input Format Pattern

Each input format follows a consistent architecture:

[source_detector.py] → detect type by extension
        ↓
[create_command.py] → route to scraper
        ↓
[{format}_scraper.py] → extract → categorize → build skill
        ↓
[output/{name}/] → SKILL.md + references/ + assets/

Supporting files per format:

arguments/{format}.py — CLI argument definitions
parsers/{format}_parser.py — Subcommand parser class
tests/test_{format}_scraper.py — Test suite

Dependency Guard Pattern

The Word scraper uses an optional dependency guard that epub should replicate:

try:
    import ebooklib
    from ebooklib import epub
    EPUB_AVAILABLE = True
except ImportError:
    EPUB_AVAILABLE = False

def _check_epub_deps():
    if not EPUB_AVAILABLE:
        raise RuntimeError(
            "ebooklib is required for EPUB support.\n"
            'Install with: pip install "skill-seekers[epub]"\n'
            "Or: pip install ebooklib"
        )

Summary Table

Category	Files	Action
New files	4	Create from scratch
Source detection	1	Add epub detection + validation
CLI dispatcher	1	Add command module mapping
Create command	1	Add routing + help + examples
Arguments	1	Add EPUB_ARGUMENTS + register in helpers
Parser registry	1	Import + register EpubParser
Package config	1	Add deps + entry point
Documentation	2+	Update CLAUDE.md, CHANGELOG
Total	12+ modified, 4 new

Open Questions

Should epub support reuse any of the existing HTML parsing from word_scraper.py (which uses mammoth to convert to HTML then parses with BeautifulSoup)? EPUB internally contains XHTML files, so BeautifulSoup parsing would be directly applicable.
Should the epub scraper support DRM-protected files, or only DRM-free epub files?
Should epub-specific arguments include options like --chapter-range (similar to PDF's --pages)?

9.9 KiB Raw Blame History