Files
skill-seekers-reference/docs/agents/research/2026-03-14-epub-input-support-affected-files.md
yusyus 2e30970dfb feat: add EPUB input support (#310)
Adds EPUB as a first-class input source for skill generation.

- EpubToSkillConverter (epub_scraper.py, ~1200 lines) following PDF scraper pattern
- Dublin Core metadata, spine items, code blocks, tables, images extraction
- DRM detection (Adobe ADEPT, Apple FairPlay, Readium LCP) with fail-fast
- EPUB 3 NCX TOC bug workaround (ignore_ncx=True)
- ebooklib as optional dep: pip install skill-seekers[epub]
- Wired into create command with .epub auto-detection
- 104 tests, all passing

Review fixes: removed 3 empty test stubs, fixed SVG double-counting in
_extract_images(), added logger.debug to bare except pass.

Based on PR #310 by @christianbaumann.
Co-authored-by: Christian Baumann <mail@chriss-baumann.de>
2026-03-15 02:34:41 +03:00

9.9 KiB

date, git_commit, branch, topic, tags, status
date git_commit branch topic tags status
2026-03-14T12:54:24.700367+00:00 7c90a4b9c9bccac8341b0769550d77aae3b4e524 development What files would be affected to add .epub support for input
research
codebase
epub
input-format
scraper
complete

Research: What files would be affected to add .epub support for input

Research Question

What files would be affected to add .epub support for input.

Summary

Adding .epub input support follows an established pattern already used for PDF and Word (.docx) formats. The codebase has a consistent multi-layer architecture for document input formats: source detection, argument definitions, parser registration, create command routing, standalone scraper module, and tests. Based on analysis of the existing PDF and Word implementations, 16 existing files would need modification and 4 new files would need to be created.

Detailed Findings

New Files to Create (4 files)

File Purpose
src/skill_seekers/cli/epub_scraper.py Core EPUB extraction and skill building logic (analog: word_scraper.py at ~750 lines)
src/skill_seekers/cli/arguments/epub.py EPUB-specific argument definitions (analog: arguments/word.py)
src/skill_seekers/cli/parsers/epub_parser.py Subcommand parser class (analog: parsers/word_parser.py)
tests/test_epub_scraper.py Test suite (analog: test_word_scraper.py at ~750 lines, 130+ tests)

Existing Files to Modify (16 files)

1. Source Detection Layer

src/skill_seekers/cli/source_detector.py (3 locations)

  • SourceDetector.detect() (line ~60): Add .epub extension check, following the .docx pattern at line 63-64:

    if source.endswith(".epub"):
        return cls._detect_epub(source)
    
  • New method _detect_epub(): Add detection method (following _detect_word() at lines 124-129):

    @classmethod
    def _detect_epub(cls, source: str) -> SourceInfo:
        name = os.path.splitext(os.path.basename(source))[0]
        return SourceInfo(
            type="epub", parsed={"file_path": source}, suggested_name=name, raw_input=source
        )
    
  • validate_source() (line ~250): Add epub validation block (following the word block at lines 273-278)

  • Error message (line ~94): Add EPUB example to the ValueError help text

2. CLI Dispatcher

src/skill_seekers/cli/main.py (2 locations)

  • COMMAND_MODULES dict (line ~46): Add epub entry:

    "epub": "skill_seekers.cli.epub_scraper",
    
  • Module docstring (line ~1): Add epub to the commands list

3. Create Command Routing

src/skill_seekers/cli/create_command.py (3 locations)

  • _route_to_scraper() (line ~121): Add elif self.source_info.type == "epub": routing case

  • New _route_epub() method: Following the _route_word() pattern at lines 331-352:

    def _route_epub(self) -> int:
        from skill_seekers.cli import epub_scraper
        argv = ["epub_scraper"]
        file_path = self.source_info.parsed["file_path"]
        argv.extend(["--epub", file_path])
        self._add_common_args(argv)
        # epub-specific args here
        ...
    
  • main() epilog (line ~537): Add EPUB example and source auto-detection entry

  • Progressive help (line ~590): Add --help-epub flag and handler block

4. Argument Definitions

src/skill_seekers/cli/arguments/create.py (4 locations)

  • New EPUB_ARGUMENTS dict (~line 401): Define epub-specific arguments (e.g., --epub file path flag), following the WORD_ARGUMENTS pattern at lines 402-411

  • get_source_specific_arguments() (line 595): Add "epub": EPUB_ARGUMENTS to the source_args dict

  • add_create_arguments() (line 676): Add epub mode block:

    if mode in ["epub", "all"]:
        for arg_name, arg_def in EPUB_ARGUMENTS.items():
            parser.add_argument(*arg_def["flags"], **arg_def["kwargs"])
    

5. Parser Registration

src/skill_seekers/cli/parsers/__init__.py (2 locations)

  • Import (line ~15): Add from .epub_parser import EpubParser

  • PARSERS list (line ~46): Add EpubParser() entry (near WordParser() and PDFParser())

6. Package Configuration

pyproject.toml (3 locations)

  • [project.optional-dependencies] (line ~111): Add epub optional dependency group:

    epub = [
        "ebooklib>=0.18",
    ]
    
  • all optional dependency group (line ~178): Add epub dependency to the combined all group

  • [project.scripts] (line ~224): Add standalone entry point:

    skill-seekers-epub = "skill_seekers.cli.epub_scraper:main"
    

7. Argument Commons

src/skill_seekers/cli/arguments/common.py

  • No changes strictly required, but add_all_standard_arguments() is called by the new arguments/epub.py (no modification needed — it's used as-is)

8. Documentation / Configuration

CLAUDE.md (2 locations)

  • Commands section: Add epub to the list of subcommands
  • Key source files table: Add epub_scraper.py entry

CONTRIBUTING.md — Potentially update with epub format mention

CHANGELOG.md — New feature entry

Files NOT Affected

These files do not need changes:

  • unified_scraper.py — Multi-source configs could add epub support later but it's not required for basic input support
  • Platform adaptors (adaptors/*.py) — Adaptors work on the output side (packaging), not input
  • Enhancement system (enhance_skill.py, enhance_skill_local.py) — Works generically on SKILL.md
  • MCP server (mcp/server_fastmcp.py) — Operates on completed skills
  • pdf_extractor_poc.py — PDF-specific extraction; epub needs its own extractor

Code References

Pattern to Follow (Word .docx implementation)

  • src/skill_seekers/cli/word_scraper.py:1-750 — Full scraper with WordToSkillConverter class
  • src/skill_seekers/cli/arguments/word.py:1-75 — Argument definitions with add_word_arguments()
  • src/skill_seekers/cli/parsers/word_parser.py:1-33 — Parser class extending SubcommandParser
  • tests/test_word_scraper.py:1-750 — Comprehensive test suite with 130+ tests

Key Integration Points

  • src/skill_seekers/cli/source_detector.py:57-65 — File extension detection order
  • src/skill_seekers/cli/source_detector.py:124-129_detect_word() method (template for _detect_epub())
  • src/skill_seekers/cli/create_command.py:121-143_route_to_scraper() dispatch
  • src/skill_seekers/cli/create_command.py:331-352_route_word() (template for _route_epub())
  • src/skill_seekers/cli/arguments/create.py:401-411WORD_ARGUMENTS dict (template)
  • src/skill_seekers/cli/arguments/create.py:595-604get_source_specific_arguments() mapping
  • src/skill_seekers/cli/arguments/create.py:676-678add_create_arguments() mode handling
  • src/skill_seekers/cli/parsers/__init__.py:35-59PARSERS registry list
  • src/skill_seekers/cli/main.py:46-70COMMAND_MODULES dict
  • pyproject.toml:111-115 — Optional dependency group pattern (docx)
  • pyproject.toml:213-246 — Script entry points

Data Flow Architecture

The epub scraper would follow the same three-step pipeline as Word/PDF:

  1. Extract — Parse .epub file → sections with text, headings, code, images → save to output/{name}_extracted.json
  2. Categorize — Group sections by chapters/keywords
  3. Build — Generate SKILL.md, references/*.md, references/index.md, assets/

The intermediate JSON format uses the same structure as Word/PDF:

{
    "source_file": str,
    "metadata": {"title", "author", "created", ...},
    "total_sections": int,
    "total_code_blocks": int,
    "total_images": int,
    "languages_detected": {str: int},
    "pages": [  # sections
        {
            "section_number": int,
            "heading": str,
            "text": str,
            "code_samples": [...],
            "images": [...],
            "headings": [...]
        }
    ]
}

Architecture Documentation

Document Input Format Pattern

Each input format follows a consistent architecture:

[source_detector.py] → detect type by extension
        ↓
[create_command.py] → route to scraper
        ↓
[{format}_scraper.py] → extract → categorize → build skill
        ↓
[output/{name}/] → SKILL.md + references/ + assets/

Supporting files per format:

  • arguments/{format}.py — CLI argument definitions
  • parsers/{format}_parser.py — Subcommand parser class
  • tests/test_{format}_scraper.py — Test suite

Dependency Guard Pattern

The Word scraper uses an optional dependency guard that epub should replicate:

try:
    import ebooklib
    from ebooklib import epub
    EPUB_AVAILABLE = True
except ImportError:
    EPUB_AVAILABLE = False

def _check_epub_deps():
    if not EPUB_AVAILABLE:
        raise RuntimeError(
            "ebooklib is required for EPUB support.\n"
            'Install with: pip install "skill-seekers[epub]"\n'
            "Or: pip install ebooklib"
        )

Summary Table

Category Files Action
New files 4 Create from scratch
Source detection 1 Add epub detection + validation
CLI dispatcher 1 Add command module mapping
Create command 1 Add routing + help + examples
Arguments 1 Add EPUB_ARGUMENTS + register in helpers
Parser registry 1 Import + register EpubParser
Package config 1 Add deps + entry point
Documentation 2+ Update CLAUDE.md, CHANGELOG
Total 12+ modified, 4 new

Open Questions

  • Should epub support reuse any of the existing HTML parsing from word_scraper.py (which uses mammoth to convert to HTML then parses with BeautifulSoup)? EPUB internally contains XHTML files, so BeautifulSoup parsing would be directly applicable.
  • Should the epub scraper support DRM-protected files, or only DRM-free epub files?
  • Should epub-specific arguments include options like --chapter-range (similar to PDF's --pages)?