Files
skill-seekers-reference/docs/agents/plans/2026-03-14-epub-input-support.md
yusyus 2e30970dfb feat: add EPUB input support (#310)
Adds EPUB as a first-class input source for skill generation.

- EpubToSkillConverter (epub_scraper.py, ~1200 lines) following PDF scraper pattern
- Dublin Core metadata, spine items, code blocks, tables, images extraction
- DRM detection (Adobe ADEPT, Apple FairPlay, Readium LCP) with fail-fast
- EPUB 3 NCX TOC bug workaround (ignore_ncx=True)
- ebooklib as optional dep: pip install skill-seekers[epub]
- Wired into create command with .epub auto-detection
- 104 tests, all passing

Review fixes: removed 3 empty test stubs, fixed SVG double-counting in
_extract_images(), added logger.debug to bare except pass.

Based on PR #310 by @christianbaumann.
Co-authored-by: Christian Baumann <mail@chriss-baumann.de>
2026-03-15 02:34:41 +03:00

42 KiB
Raw Blame History

date, git_commit, branch, topic, tags, status
date git_commit branch topic tags status
2026-03-14T19:30:35.172407+00:00 7c90a4b9c9bccac8341b0769550d77aae3b4e524 development Add EPUB Input Support
plan
epub
scraper
input-format
complete

Add EPUB Input Support — Implementation Plan

Overview

Add .epub as an input format for Skill Seekers, enabling skill-seekers create book.epub and skill-seekers epub --epub book.epub. Follows the established Word/PDF scraper pattern: source detection → routing → extraction → categorize → build skill.

Authoritative reference: W3C EPUB 3.3 Specification (also covers EPUB 2 backward compatibility).

Current State Analysis

The codebase has a consistent multi-layer architecture for document input formats. PDF and Word (.docx) serve as direct analogs. The Word scraper (word_scraper.py) is the closest pattern match since both Word and EPUB produce HTML/XHTML that is parsed with BeautifulSoup.

Key Discoveries:

  • Word scraper converts .docx → HTML (via mammoth) → BeautifulSoup parse → intermediate JSON → SKILL.md (word_scraper.py:96-235)
  • EPUB files contain XHTML natively (per W3C spec §5), so the mammoth conversion step is unnecessary — BeautifulSoup can parse EPUB XHTML content directly
  • Source detection uses file extension matching (source_detector.py:57-65)
  • Optional dependencies use a guard pattern with try/except ImportError and a _check_*_deps() function (word_scraper.py:21-40)
  • The ebooklib library (v0.18+) provides epub.read_epub() returning an EpubBook with spine iteration, metadata access via get_metadata('DC', key), and item content via get_content()/get_body_content()
  • ebooklib has a known bug: EPUB 3 files read TOC from NCX instead of NAV (issue #200); workaround: options={"ignore_ncx": True}
  • ebooklib loads entire EPUB into memory — acceptable for typical books but relevant for edge cases

Desired End State

Running skill-seekers create book.epub produces:

output/book/
├── SKILL.md              # Main skill file with metadata, concepts, code examples
├── references/
│   ├── index.md          # Category index with statistics
│   └── book.md           # Chapter content (or multiple files if categorized)
├── scripts/
└── assets/
    └── *.png|*.jpg       # Extracted images

CLI Output Mockup

$ skill-seekers create programming-rust.epub

  Detected source type: epub
  Routing to epub scraper...

🔍 Extracting from EPUB: programming-rust.epub
   Title: Programming Rust, 2nd Edition
   Author: Jim Blandy, Jason Orendorff
   Language: en
   Chapters: 23 (spine items)

📄 Processing chapters...
   Chapter 1/23: Why Rust? (2 sections, 1 code block)
   Chapter 2/23: A Tour of Rust (5 sections, 12 code blocks)
   ...
   Chapter 23/23: Macros (4 sections, 8 code blocks)

📊 Extraction complete:
   Sections: 142
   Code blocks: 287 (Rust: 245, Shell: 28, TOML: 14)
   Images: 34
   Tables: 12

💾 Saved extracted data to: output/programming-rust_extracted.json

📋 Categorizing content...
✅ Created 1 category (single EPUB source)
   - programming-rust: 142 sections

📝 Generating reference files...
   Generated: output/programming-rust/references/programming-rust.md
   Generated: output/programming-rust/references/index.md

✅ Skill built successfully: output/programming-rust/

📦 Next step: Package with: skill-seekers package output/programming-rust/

Verification:

  • skill-seekers create book.epub produces valid output directory
  • skill-seekers epub --epub book.epub --name mybook works standalone
  • skill-seekers create book.epub --dry-run shows config without processing
  • All ~2,540+ existing tests still pass (982 passed, 1 pre-existing failure)
  • New test suite has 100+ tests covering happy path, errors, and edge cases (107 tests, 14 classes)

What We're NOT Doing

  • DRM decryption (detect and error gracefully with clear message)
  • EPUB writing/creation (read-only)
  • Media overlay / audio / video extraction (ignore gracefully)
  • Fixed-layout OCR (detect and warn; extract whatever text exists in XHTML)
  • --chapter-range flag (can be added later)
  • Unified scraper (unified_scraper.py) EPUB support (separate future task)
  • MCP tool for EPUB (separate future task)

Implementation Approach

Follow the Word scraper pattern exactly, with EPUB-specific extraction logic:

  1. Phase 1: Core epub_scraper.py — the EpubToSkillConverter class
  2. Phase 2: CLI integration — source detection, arguments, parser, routing, entry points
  3. Phase 3: Comprehensive test suite — 100+ tests across 11 test classes
  4. Phase 4: Documentation updates

Phase 1: Core EPUB Scraper

Overview

Create epub_scraper.py with EpubToSkillConverter class following the Word scraper pattern. This is the bulk of new code.

Changes Required:

[x] 1. Optional dependency in pyproject.toml

File: pyproject.toml Changes: Add epub optional dependency group and include in all group

# After the docx group (~line 115)
# EPUB (.epub) support
epub = [
    "ebooklib>=0.18",
]

Add "ebooklib>=0.18", to the all group (~line 178).

[x] 2. Create src/skill_seekers/cli/epub_scraper.py

File: src/skill_seekers/cli/epub_scraper.py (new) Changes: Full EPUB scraper module

Structure (following word_scraper.py pattern):

"""
EPUB Documentation to Skill Converter

Converts EPUB e-books into skills.
Uses ebooklib for EPUB parsing, BeautifulSoup for XHTML content extraction.

Usage:
    skill-seekers epub --epub book.epub --name myskill
    skill-seekers epub --from-json book_extracted.json
"""

import argparse
import json
import logging
import os
import re
import sys
from pathlib import Path

# Optional dependency guard
try:
    import ebooklib
    from ebooklib import epub
    EPUB_AVAILABLE = True
except ImportError:
    EPUB_AVAILABLE = False

# BeautifulSoup is a core dependency (always available)
from bs4 import BeautifulSoup, Comment

logger = logging.getLogger(__name__)


def _check_epub_deps():
    """Raise RuntimeError if ebooklib is not installed."""
    if not EPUB_AVAILABLE:
        raise RuntimeError(
            "ebooklib is required for EPUB support.\n"
            'Install with: pip install "skill-seekers[epub]"\n'
            "Or: pip install ebooklib"
        )


def infer_description_from_epub(metadata: dict | None = None, name: str = "") -> str:
    """Infer skill description from EPUB metadata."""
    if metadata:
        if metadata.get("description") and len(metadata["description"]) > 20:
            desc = metadata["description"].strip()
            if len(desc) > 150:
                desc = desc[:147] + "..."
            return f"Use when {desc.lower()}"
        if metadata.get("title") and len(metadata["title"]) > 10:
            return f"Use when working with {metadata['title'].lower()}"
    return (
        f"Use when referencing {name} documentation"
        if name
        else "Use when referencing this documentation"
    )

EpubToSkillConverter class methods:

class EpubToSkillConverter:
    def __init__(self, config: dict):
        self.config = config
        self.name = config["name"]
        self.epub_path = config.get("epub_path", "")
        self.description = config.get(
            "description", f"Use when referencing {self.name} documentation"
        )
        self.skill_dir = f"output/{self.name}"
        self.data_file = f"output/{self.name}_extracted.json"
        self.categories = config.get("categories", {})
        self.extracted_data = None

    def extract_epub(self) -> bool:
        """Extract content from EPUB file.

        Workflow:
        1. Check dependencies (ebooklib)
        2. Detect DRM via META-INF/encryption.xml (fail fast)
        3. Read EPUB via ebooklib with ignore_ncx=True (EPUB 3 TOC bug workaround)
        4. Extract Dublin Core metadata (title, creator, language, publisher, date, description, subject)
        5. Iterate spine items in reading order
        6. For each ITEM_DOCUMENT: parse XHTML with BeautifulSoup
        7. Split by h1/h2 heading boundaries into sections
        8. Extract code blocks from <pre>/<code> elements
        9. Extract images from EpubImage items
        10. Detect code languages via LanguageDetector
        11. Save intermediate JSON to {name}_extracted.json

        Returns True on success.
        Raises RuntimeError for DRM-protected files.
        Raises FileNotFoundError for missing files.
        Raises ValueError for invalid EPUB files.
        """

DRM detection (per W3C spec §4.2.6.3.2):

def _detect_drm(self, book) -> bool:
    """Detect DRM by checking for encryption.xml with non-font-obfuscation entries.

    Per W3C EPUB 3.3 spec: encryption.xml is present when resources are encrypted.
    Font obfuscation (IDPF algorithm http://www.idpf.org/2008/embedding or
    Adobe algorithm http://ns.adobe.com/pdf/enc#RC) is NOT DRM — only font mangling.

    Actual DRM uses algorithms like:
    - Adobe ADEPT: http://ns.adobe.com/adept namespace
    - Apple FairPlay: http://itunes.apple.com/dataenc
    - Readium LCP: http://readium.org/2014/01/lcp
    """

Metadata extraction (per W3C spec §5.2, Dublin Core):

def _extract_metadata(self, book) -> dict:
    """Extract Dublin Core metadata from EPUB.

    Per W3C EPUB 3.3 spec: required elements are dc:identifier, dc:title, dc:language.
    Optional: dc:creator, dc:contributor, dc:date, dc:description, dc:publisher,
    dc:subject, dc:rights, dc:type, dc:coverage, dc:source, dc:relation, dc:format.

    ebooklib API: book.get_metadata('DC', key) returns list of (value, attrs) tuples.
    """
    def _get_one(key):
        data = book.get_metadata('DC', key)
        return data[0][0] if data else None

    def _get_list(key):
        data = book.get_metadata('DC', key)
        return [x[0] for x in data] if data else []

    return {
        "title": _get_one('title') or "Untitled",
        "author": ", ".join(_get_list('creator')) or None,
        "language": _get_one('language') or "en",
        "publisher": _get_one('publisher'),
        "date": _get_one('date'),
        "description": _get_one('description'),
        "subject": ", ".join(_get_list('subject')) or None,
        "rights": _get_one('rights'),
        "identifier": _get_one('identifier'),
    }

Content extraction (per W3C spec §5 — XHTML Content Documents use XML serialization of HTML5):

def _extract_spine_content(self, book) -> list[dict]:
    """Extract content from spine items in reading order.

    Per W3C EPUB 3.3 spec §3.4.8: spine defines ordered list of content documents.
    Linear="yes" (default) items form the primary reading order.
    Linear="no" items are auxiliary (footnotes, glossary).

    Per spec §5: XHTML content documents use XML syntax of HTML5.
    Parse with BeautifulSoup, split by h1/h2 heading boundaries.
    """
    sections = []
    section_number = 0

    for item_id, linear in book.spine:
        item = book.get_item_with_id(item_id)
        if not item or item.get_type() != ebooklib.ITEM_DOCUMENT:
            continue

        soup = BeautifulSoup(item.get_content(), 'html.parser')

        # Remove scripts, styles, comments (not useful for text extraction)
        for tag in soup(['script', 'style']):
            tag.decompose()
        for comment in soup.find_all(string=lambda t: isinstance(t, Comment)):
            comment.extract()

        body = soup.find('body')
        if not body:
            continue

        # Split by h1/h2 heading boundaries (same as word_scraper)
        # Each heading starts a new section
        ...

Image extraction (per W3C spec §3.3 — core media types include JPEG, PNG, GIF, WebP, SVG):

def _extract_images(self, book) -> list[dict]:
    """Extract images from EPUB manifest.

    Per W3C EPUB 3.3 spec §3.3: core image media types are
    image/gif, image/jpeg, image/png, image/svg+xml, image/webp.

    ebooklib API: book.get_items_of_type(ebooklib.ITEM_IMAGE)
    returns EpubImage items with get_content() (bytes) and media_type.

    SVG images (ITEM_VECTOR) handled separately.
    """

The remaining methods (categorize_content, build_skill, _generate_reference_file, _generate_index, _generate_skill_md, _format_key_concepts, _format_patterns_from_content, _sanitize_filename) follow the Word scraper pattern exactly — they operate on the same intermediate JSON structure.

main() function (following word_scraper.py:923-1059):

def main():
    from .arguments.epub import add_epub_arguments

    parser = argparse.ArgumentParser(
        description="Convert EPUB e-book to skill",
        formatter_class=argparse.RawDescriptionHelpFormatter,
    )
    add_epub_arguments(parser)
    args = parser.parse_args()

    # Logging setup
    if getattr(args, "quiet", False):
        logging.getLogger().setLevel(logging.WARNING)
    elif getattr(args, "verbose", False):
        logging.getLogger().setLevel(logging.DEBUG)

    # Dry run
    if getattr(args, "dry_run", False):
        source = args.epub or args.from_json or "(none)"
        print(f"\n{'=' * 60}")
        print("DRY RUN: EPUB Extraction")
        print(f"{'=' * 60}")
        print(f"Source:         {source}")
        print(f"Name:           {getattr(args, 'name', None) or '(auto-detect)'}")
        print(f"Enhance level:  {getattr(args, 'enhance_level', 0)}")
        print(f"\n✅ Dry run complete")
        return

    # Validate inputs
    if not (args.epub or args.from_json):
        parser.error("Must specify --epub or --from-json")

    # From-JSON workflow
    if args.from_json:
        name = Path(args.from_json).stem.replace("_extracted", "")
        config = {
            "name": name,
            "description": args.description or f"Use when referencing {name} documentation",
        }
        converter = EpubToSkillConverter(config)
        converter.load_extracted_data(args.from_json)
        converter.build_skill()
        return

    # Direct EPUB workflow
    name = args.name or Path(args.epub).stem
    config = {
        "name": name,
        "epub_path": args.epub,
        "description": args.description or f"Use when referencing {name} documentation",
    }

    try:
        converter = EpubToSkillConverter(config)
        if not converter.extract_epub():
            print("\n❌ EPUB extraction failed", file=sys.stderr)
            sys.exit(1)
        converter.build_skill()

        # Enhancement workflow integration
        from skill_seekers.cli.workflow_runner import run_workflows
        run_workflows(args)

        # Traditional enhancement
        if getattr(args, "enhance_level", 0) > 0:
            # Same pattern as word_scraper.py and pdf_scraper.py
            ...

    except RuntimeError as e:
        print(f"\n❌ Error: {e}", file=sys.stderr)
        sys.exit(1)

Success Criteria:

Automated Verification:

  • ruff check src/skill_seekers/cli/epub_scraper.py passes
  • ruff format --check src/skill_seekers/cli/epub_scraper.py passes
  • mypy src/skill_seekers/cli/epub_scraper.py passes (continue-on-error)
  • pip install -e ".[epub]" installs successfully

Manual Verification:

  • Verify import ebooklib works after install
  • Review epub_scraper.py structure matches word_scraper.py pattern

Implementation Note: After completing this phase and all automated verification passes, pause here for manual confirmation from the human before proceeding to the next phase.


Phase 2: CLI Integration

Overview

Wire the EPUB scraper into the CLI: source detection, argument definitions, parser registration, create command routing, and entry points.

Changes Required:

[x] 1. Source detection

File: src/skill_seekers/cli/source_detector.py Changes: Add .epub extension detection, _detect_epub() method, validation, and error message update

Add after the .docx check (line 64):

if source.endswith(".epub"):
    return cls._detect_epub(source)

Add _detect_epub() method (following _detect_word() at line 124):

@classmethod
def _detect_epub(cls, source: str) -> SourceInfo:
    """Detect EPUB file source."""
    name = os.path.splitext(os.path.basename(source))[0]
    return SourceInfo(
        type="epub", parsed={"file_path": source}, suggested_name=name, raw_input=source
    )

Add epub validation in validate_source() (after word block at line 278):

elif source_info.type == "epub":
    file_path = source_info.parsed["file_path"]
    if not os.path.exists(file_path):
        raise ValueError(f"EPUB file does not exist: {file_path}")
    if not os.path.isfile(file_path):
        raise ValueError(f"Path is not a file: {file_path}")

Add EPUB example to the ValueError message (line 94):

"  EPUB:  skill-seekers create ebook.epub\n"

[x] 2. Argument definitions

File: src/skill_seekers/cli/arguments/epub.py (new) Changes: EPUB-specific argument definitions

"""EPUB-specific CLI arguments."""

import argparse
from typing import Any

from .common import add_all_standard_arguments

EPUB_ARGUMENTS: dict[str, dict[str, Any]] = {
    "epub": {
        "flags": ("--epub",),
        "kwargs": {
            "type": str,
            "help": "Direct EPUB file path",
            "metavar": "PATH",
        },
    },
    "from_json": {
        "flags": ("--from-json",),
        "kwargs": {
            "type": str,
            "help": "Build skill from extracted JSON",
            "metavar": "FILE",
        },
    },
}


def add_epub_arguments(parser: argparse.ArgumentParser) -> None:
    """Add EPUB-specific arguments to parser."""
    add_all_standard_arguments(parser)

    # Override enhance-level default to 0 for EPUB
    for action in parser._actions:
        if hasattr(action, "dest") and action.dest == "enhance_level":
            action.default = 0
            action.help = (
                "AI enhancement level (auto-detects API vs LOCAL mode): "
                "0=disabled (default for EPUB), 1=SKILL.md only, "
                "2=+architecture/config, 3=full enhancement. "
                "Mode selection: uses API if ANTHROPIC_API_KEY is set, "
                "otherwise LOCAL (Claude Code)"
            )

    for arg_name, arg_def in EPUB_ARGUMENTS.items():
        parser.add_argument(*arg_def["flags"], **arg_def["kwargs"])

[x] 3. Create command argument integration

File: src/skill_seekers/cli/arguments/create.py Changes: Add EPUB_ARGUMENTS dict, register in helper functions, add mode handling

Add after WORD_ARGUMENTS (~line 411):

# EPUB specific (from epub.py)
EPUB_ARGUMENTS: dict[str, dict[str, Any]] = {
    "epub": {
        "flags": ("--epub",),
        "kwargs": {
            "type": str,
            "help": "EPUB file path",
            "metavar": "PATH",
        },
    },
}

Add to get_source_specific_arguments() (line 595):

"epub": EPUB_ARGUMENTS,

Add to add_create_arguments() (after word block at line 678):

if mode in ["epub", "all"]:
    for arg_name, arg_def in EPUB_ARGUMENTS.items():
        parser.add_argument(*arg_def["flags"], **arg_def["kwargs"])

[x] 4. Parser class

File: src/skill_seekers/cli/parsers/epub_parser.py (new) Changes: Subcommand parser for standalone skill-seekers epub command

"""Parser for epub subcommand."""

from .base import SubcommandParser
from skill_seekers.cli.arguments.epub import add_epub_arguments


class EpubParser(SubcommandParser):
    """Parser for EPUB extraction command."""

    @property
    def name(self) -> str:
        return "epub"

    @property
    def help(self) -> str:
        return "Extract from EPUB e-book (.epub)"

    @property
    def description(self) -> str:
        return "Extract content from EPUB e-book (.epub) and generate skill"

    def add_arguments(self, parser):
        add_epub_arguments(parser)

[x] 5. Parser registration

File: src/skill_seekers/cli/parsers/__init__.py Changes: Import and register EpubParser

Add import (after WordParser import, line 15):

from .epub_parser import EpubParser

Add to PARSERS list (after WordParser(), line 46):

EpubParser(),

[x] 6. CLI dispatcher

File: src/skill_seekers/cli/main.py Changes: Add epub to COMMAND_MODULES dict and module docstring

Add to COMMAND_MODULES (after "word" entry, line 52):

"epub": "skill_seekers.cli.epub_scraper",

Add to module docstring (after "word" line, line 15):

#    epub                 Extract from EPUB e-book (.epub)

[x] 7. Create command routing

File: src/skill_seekers/cli/create_command.py Changes: Add _route_epub() method, routing case, help flag, and epilog example

Add to _route_to_scraper() (after word case, line 136):

elif self.source_info.type == "epub":
    return self._route_epub()

Add _route_epub() method (after _route_word(), line 352):

def _route_epub(self) -> int:
    """Route to EPUB scraper (epub_scraper.py)."""
    from skill_seekers.cli import epub_scraper

    argv = ["epub_scraper"]
    file_path = self.source_info.parsed["file_path"]
    argv.extend(["--epub", file_path])
    self._add_common_args(argv)

    logger.debug(f"Calling epub_scraper with argv: {argv}")
    original_argv = sys.argv
    try:
        sys.argv = argv
        return epub_scraper.main()
    finally:
        sys.argv = original_argv

Add to epilog (line 543, after DOCX example):

  EPUB:     skill-seekers create ebook.epub

Add to Source Auto-Detection section:

   file.epub  EPUB extraction

Add --help-epub flag and handler (after --help-word at line 592):

parser.add_argument(
    "--help-epub", action="store_true", help=argparse.SUPPRESS, dest="_help_epub"
)

Add handler block (after _help_word block at line 654):

elif args._help_epub:
    parser_epub = argparse.ArgumentParser(
        prog="skill-seekers create",
        description="Create skill from EPUB e-book (.epub)",
        formatter_class=argparse.RawDescriptionHelpFormatter,
    )
    add_create_arguments(parser_epub, mode="epub")
    parser_epub.print_help()
    return 0

[x] 8. Entry point

File: pyproject.toml Changes: Add standalone entry point

Add after skill-seekers-word (line 224):

skill-seekers-epub = "skill_seekers.cli.epub_scraper:main"

[x] 9. Positional argument handling in main.py

File: src/skill_seekers/cli/main.py Changes: Add "input_file" is already in the positional list at line 153, so no change needed. Verify _reconstruct_argv handles epub correctly through the standard delegation path.

Success Criteria:

Automated Verification:

  • ruff check src/skill_seekers/cli/source_detector.py src/skill_seekers/cli/arguments/epub.py src/skill_seekers/cli/parsers/epub_parser.py src/skill_seekers/cli/create_command.py passes
  • ruff format --check src/skill_seekers/cli/ passes
  • pip install -e ".[epub]" installs with all entry points
  • skill-seekers epub --help shows EPUB-specific help
  • skill-seekers create --help-epub shows EPUB arguments (via standalone entry point skill-seekers-create)
  • skill-seekers create nonexistent.epub gives clear error about missing file
  • Existing tests still pass: pytest tests/ -v -x -m "not slow and not integration" (875 passed, 1 pre-existing unrelated failure in test_git_sources_e2e)

Manual Verification:

  • skill-seekers --help lists epub command
  • skill-seekers create book.epub --dry-run shows dry run output

Implementation Note: After completing this phase and all automated verification passes, pause here for manual confirmation from the human before proceeding to the next phase.


Phase 3: Comprehensive Test Suite

Overview

Create tests/test_epub_scraper.py with 100+ tests across 11 test classes, covering happy path, negative cases, edge cases, and CLI integration.

Changes Required:

[x] 1. Create test file

File: tests/test_epub_scraper.py (new) Changes: Comprehensive test suite following test_word_scraper.py patterns

"""
Tests for EPUB scraper (epub_scraper.py).

Covers: initialization, extraction, categorization, skill building,
code blocks, tables, images, error handling, JSON workflow, CLI arguments,
helper functions, source detection, DRM detection, and edge cases.

Tests use mock data and do not require actual EPUB files or ebooklib installed.
"""

import json
import os
import shutil
import tempfile
import unittest
from pathlib import Path
from unittest.mock import MagicMock, patch, PropertyMock


# Conditional import (same pattern as test_word_scraper.py)
try:
    from skill_seekers.cli.epub_scraper import (
        EpubToSkillConverter,
        infer_description_from_epub,
        _score_code_quality,
        _check_epub_deps,
        EPUB_AVAILABLE,
    )
    IMPORT_OK = True
except ImportError:
    IMPORT_OK = False

Helper factory function:

def _make_sample_extracted_data(
    num_sections=2,
    include_code=False,
    include_tables=False,
    include_images=False,
) -> dict:
    """Create minimal extracted_data dict for testing."""
    sections = []
    total_code = 0
    total_images = 0
    languages = {}

    for i in range(1, num_sections + 1):
        section = {
            "section_number": i,
            "heading": f"Chapter {i}",
            "heading_level": "h1",
            "text": f"Content of chapter {i}. This is sample text.",
            "headings": [{"level": "h2", "text": f"Section {i}.1"}],
            "code_samples": [],
            "tables": [],
            "images": [],
        }

        if include_code:
            section["code_samples"] = [
                {"code": f"def func_{i}():\n    return {i}", "language": "python", "quality_score": 7.5},
                {"code": f"console.log({i})", "language": "javascript", "quality_score": 4.0},
            ]
            total_code += 2
            languages["python"] = languages.get("python", 0) + 1
            languages["javascript"] = languages.get("javascript", 0) + 1

        if include_tables:
            section["tables"] = [
                {"headers": ["Name", "Value"], "rows": [["key", "val"]]}
            ]

        if include_images:
            section["images"] = [
                {"index": 1, "data": b"\x89PNG\r\n\x1a\n", "width": 100, "height": 100}
            ]
            total_images += 1

        sections.append(section)

    return {
        "source_file": "test.epub",
        "metadata": {
            "title": "Test Book",
            "author": "Test Author",
            "language": "en",
            "publisher": "Test Publisher",
            "date": "2024-01-01",
            "description": "A test book for unit testing",
            "subject": "Testing, Unit Tests",
            "rights": "Copyright 2024",
            "identifier": "urn:uuid:12345",
        },
        "total_sections": num_sections,
        "total_code_blocks": total_code,
        "total_images": total_images,
        "languages_detected": languages,
        "pages": sections,
    }

Test Classes and Methods:

[x] Class 1: TestEpubToSkillConverterInit (8 tests)

Happy path:

  • test_init_with_name_and_epub_path — basic config with name + epub_path
  • test_init_with_full_config — config with all fields (name, epub_path, description, categories)
  • test_default_description_uses_name — description defaults to "Use when referencing {name} documentation"
  • test_skill_dir_uses_name — skill_dir is output/{name}
  • test_data_file_uses_name — data_file is output/{name}_extracted.json

Negative:

  • test_init_requires_name — missing "name" key raises KeyError
  • test_init_empty_name — empty string name still works (no crash)

Edge case:

  • test_init_with_special_characters_in_name — name with spaces/dashes sanitized for paths

[x] Class 2: TestEpubExtraction (12 tests)

Happy path:

  • test_extract_basic_epub — mock ebooklib, verify sections extracted in spine order
  • test_extract_metadata — verify Dublin Core metadata extraction (title, creator, language, etc.)
  • test_extract_multiple_chapters — multiple spine items produce multiple sections
  • test_extract_code_blocks<pre><code> elements extracted with language detection
  • test_extract_images — ITEM_IMAGE items extracted with correct content
  • test_heading_boundary_splitting — h1/h2 boundaries create new sections

Negative:

  • test_extract_missing_file_raises_error — FileNotFoundError for nonexistent path
  • test_extract_invalid_epub_raises_error — ValueError for corrupted/non-EPUB file
  • test_extract_deps_not_installed — RuntimeError with install instructions when ebooklib missing

Edge cases:

  • test_extract_empty_spine — EPUB with no spine items produces empty sections list
  • test_extract_spine_item_no_body — XHTML without <body> tag skipped gracefully
  • test_extract_non_linear_spine_items — linear="no" items still extracted (included but flagged)

[x] Class 3: TestEpubDrmDetection (6 tests)

Happy path:

  • test_no_drm_detected — normal EPUB without encryption.xml returns False

Negative:

  • test_drm_detected_adobe_adept — encryption.xml with Adobe namespace raises RuntimeError
  • test_drm_detected_apple_fairplay — encryption.xml with Apple namespace raises RuntimeError
  • test_drm_detected_readium_lcp — encryption.xml with Readium namespace raises RuntimeError

Edge cases:

  • test_font_obfuscation_not_drm — encryption.xml with only IDPF font obfuscation algorithm (http://www.idpf.org/2008/embedding) is NOT DRM, extraction proceeds
  • test_drm_error_message_is_clear — error message mentions DRM and suggests removing protection

[x] Class 4: TestEpubCategorization (8 tests)

Happy path:

  • test_single_source_creates_one_category — single EPUB creates category named after file
  • test_keyword_categorization — sections matched to categories by keyword scoring
  • test_no_categories_uses_default — no category config creates single "content" category

Negative:

  • test_categorize_empty_sections — empty sections list produces empty categories
  • test_categorize_no_keyword_matches — unmatched sections go to "other" category

Edge cases:

  • test_categorize_single_section — one section creates one category
  • test_categorize_many_sections — 50+ sections categorized correctly
  • test_categorize_preserves_section_order — sections maintain original order within categories

[x] Class 5: TestEpubSkillBuilding (10 tests)

Happy path:

  • test_build_creates_directory_structure — output/{name}/, references/, scripts/, assets/ created
  • test_build_generates_skill_md — SKILL.md created with YAML frontmatter
  • test_build_generates_reference_files — reference markdown files created per category
  • test_build_generates_index — references/index.md created with category links
  • test_skill_md_contains_metadata — SKILL.md includes title, author, language from metadata
  • test_skill_md_yaml_frontmatter — frontmatter has name and description fields

Negative:

  • test_build_without_extracted_data_fails — calling build_skill() before extraction raises error

Edge cases:

  • test_build_overwrites_existing_output — re-running build overwrites existing files
  • test_build_with_long_name — name > 64 chars truncated in YAML frontmatter
  • test_build_with_unicode_content — Unicode text (CJK, Arabic, emoji) preserved correctly

[x] Class 6: TestEpubCodeBlocks (8 tests)

Happy path:

  • test_code_blocks_included_in_reference_files — code samples appear in reference markdown
  • test_code_blocks_in_skill_md_top_15 — SKILL.md shows top 15 code examples by quality
  • test_code_language_grouped — code examples grouped by language in SKILL.md

Edge cases:

  • test_empty_code_block<pre><code></code></pre> with no content skipped
  • test_code_block_with_html_entities&lt;, &gt;, &amp; decoded to <, >, &
  • test_code_block_with_syntax_highlighting_spans<span class="keyword"> stripped, plain text preserved
  • test_code_block_language_from_classclass="language-python", class="code-rust" detected
  • test_code_quality_scoring — scoring heuristic produces expected ranges (0-10)

[x] Class 7: TestEpubTables (5 tests)

Happy path:

  • test_tables_in_reference_files — tables rendered as markdown in reference files
  • test_table_with_headers — headers from <thead> used correctly

Edge cases:

  • test_table_no_thead — first row used as headers when no <thead>
  • test_empty_table — empty <table> element handled gracefully
  • test_table_with_colspan_rowspan — complex tables don't crash (data may be imperfect)

[x] Class 8: TestEpubImages (7 tests)

Happy path:

  • test_images_saved_to_assets — image bytes written to assets/ directory
  • test_image_references_in_markdown — markdown ![Image](../assets/...) references correct

Negative:

  • test_image_with_zero_bytes — empty image content skipped

Edge cases:

  • test_svg_images_handled — SVG items (ITEM_VECTOR) extracted or skipped gracefully
  • test_image_filename_conflicts — duplicate filenames disambiguated
  • test_cover_image_identified — cover image (ITEM_COVER) extracted
  • test_many_images — 100+ images extracted without error

[x] Class 9: TestEpubErrorHandling (10 tests)

Negative / error cases:

  • test_missing_epub_file_raises_error — FileNotFoundError for nonexistent path
  • test_not_a_file_raises_error — ValueError when path is a directory
  • test_not_epub_extension_raises_error — ValueError for .txt, .pdf, .doc files
  • test_corrupted_zip_raises_error — ValueError or RuntimeError for corrupted ZIP
  • test_missing_container_xml — ValueError for ZIP without META-INF/container.xml
  • test_missing_opf_file — ValueError when container.xml points to nonexistent OPF
  • test_drm_protected_raises_error — RuntimeError with clear DRM message
  • test_empty_epub_raises_error — ValueError for EPUB with no content documents
  • test_ebooklib_not_installed_error — RuntimeError with install instructions
  • test_malformed_xhtml_handled_gracefully — unclosed tags, invalid entities don't crash (BeautifulSoup tolerant parsing)

[x] Class 10: TestEpubJSONWorkflow (6 tests)

Happy path:

  • test_load_extracted_json — load previously extracted JSON
  • test_build_from_json — full workflow: load JSON → categorize → build
  • test_json_round_trip — extract → save JSON → load JSON → build produces same output

Negative:

  • test_load_invalid_json — malformed JSON raises appropriate error
  • test_load_nonexistent_json — FileNotFoundError for missing file

Edge case:

  • test_json_with_missing_fields — partial JSON (missing optional fields) still works

[x] Class 11: TestEpubCLIArguments (8 tests)

Happy path:

  • test_epub_flag_accepted--epub path.epub parsed correctly
  • test_from_json_flag_accepted--from-json data.json parsed correctly
  • test_name_flag_accepted--name mybook parsed correctly
  • test_enhance_level_default_zero — enhance-level defaults to 0 for EPUB
  • test_dry_run_flag--dry-run flag parsed correctly

Negative:

  • test_no_args_shows_error — no --epub or --from-json shows error

Integration:

  • test_verbose_flag--verbose accepted
  • test_quiet_flag--quiet accepted

[x] Class 12: TestEpubHelperFunctions (6 tests)

  • test_infer_description_from_metadata_description — uses description field
  • test_infer_description_from_metadata_title — falls back to title
  • test_infer_description_fallback — falls back to name-based template
  • test_infer_description_empty_metadata — empty dict returns fallback
  • test_score_code_quality_ranges — scoring returns 0-10
  • test_sanitize_filename — special characters cleaned

[x] Class 13: TestEpubSourceDetection (6 tests)

  • test_epub_detected_as_epub_type.epub extension detected correctly
  • test_epub_suggested_name — filename stem used as suggested name
  • test_epub_validation_missing_file — validation raises ValueError for missing file
  • test_epub_validation_not_a_file — validation raises ValueError for directory
  • test_epub_with_path./books/test.epub detected with correct file_path
  • test_pdf_still_detected — regression test: .pdf still detected as pdf type

[x] Class 14: TestEpubEdgeCases (8 tests)

Per W3C EPUB 3.3 spec edge cases:

  • test_epub2_vs_epub3 — both versions parse successfully (ebooklib handles both)
  • test_epub_no_toc — EPUB without table of contents extracts using spine order
  • test_epub_empty_chapters — chapters with no text content skipped gracefully
  • test_epub_single_chapter — book with one spine item produces valid output
  • test_epub_unicode_content — CJK, Arabic, Cyrillic, emoji text preserved
  • test_epub_large_section_count — 100+ sections processed without error
  • test_epub_nested_headings — h3/h4/h5/h6 become sub-headings within sections
  • test_fixed_layout_detected — fixed-layout EPUB produces warning but still extracts text

Total: ~108 test methods across 14 classes

Success Criteria:

Automated Verification:

  • pytest tests/test_epub_scraper.py -v — all 107 tests pass
  • pytest tests/ -v -x -m "not slow and not integration" — 982 passed (1 pre-existing unrelated failure in test_git_sources_e2e)
  • ruff check tests/test_epub_scraper.py passes
  • ruff format --check tests/test_epub_scraper.py passes
  • Test count >= 100 methods (107 tests across 14 classes)

Manual Verification:

  • Review test coverage includes: happy path, negative, edge cases, CLI, source detection, DRM, JSON workflow
  • Verify no tests require actual EPUB files or ebooklib installed (all use mocks/skipTest guards)

Implementation Note: After completing this phase and all automated verification passes, pause here for manual confirmation from the human before proceeding to the next phase.


Phase 4: Documentation

Overview

Update CLAUDE.md and CHANGELOG.md to reflect the new EPUB support.

Changes Required:

[x] 1. Update CLAUDE.md

File: CLAUDE.md Changes:

Add to Commands section (after pdf line):

skill-seekers epub --epub book.epub --name myskill

Add to "Unified create" examples:

skill-seekers create book.epub

Add to Key source files table:

| Core scraping | `cli/epub_scraper.py` |

Add to "Adding things → New create command flags" section:

- Source-specific → `EPUB_ARGUMENTS`

[x] 2. Update CHANGELOG.md

File: CHANGELOG.md Changes: Add entry for EPUB support under next version

### Added
- EPUB (.epub) input support via `skill-seekers create book.epub` or `skill-seekers epub --epub book.epub`
- Extracts chapters, metadata, code blocks, images, and tables from EPUB 2 and EPUB 3 files
- DRM detection with clear error messages
- Optional dependency: `pip install "skill-seekers[epub]"`

Success Criteria:

Automated Verification:

  • ruff check passes on any modified files
  • pytest tests/ -v -x -m "not slow and not integration" — all tests still pass (982 passed, 1 pre-existing failure)

Manual Verification:

  • CLAUDE.md accurately reflects new commands
  • CHANGELOG.md entry is clear and complete

Implementation Note: After completing this phase and all automated verification passes, pause here for manual confirmation from the human before proceeding.


Testing Strategy

Unit Tests (Phase 3 — ~108 tests):

By category:

Category Count What's tested
Initialization 8 Config parsing, defaults, edge cases
Extraction 12 Spine iteration, metadata, headings, code, images
DRM detection 6 Adobe, Apple, Readium, font obfuscation (not DRM)
Categorization 8 Single/multi category, keywords, empty, ordering
Skill building 10 Directory structure, SKILL.md, references, index
Code blocks 8 Extraction, quality, language detection, HTML entities
Tables 5 Headers, no-thead fallback, empty, colspan
Images 7 Save, references, SVG, conflicts, cover, many
Error handling 10 Missing file, corrupt, DRM, no deps, malformed XHTML
JSON workflow 6 Load, build, round-trip, invalid, missing fields
CLI arguments 8 Flags, defaults, dry-run, verbose/quiet
Helper functions 6 Description inference, quality scoring, filename sanitization
Source detection 6 Detection, validation, regression
Edge cases 8 EPUB 2/3, no TOC, empty chapters, Unicode, fixed-layout

Integration Tests:

  • Full extract → categorize → build workflow with mock ebooklib
  • JSON round-trip (extract → save → load → build)

Manual Testing Steps:

  1. pip install -e ".[epub]" — verify install
  2. skill-seekers create book.epub with a real EPUB file — verify output directory structure
  3. skill-seekers epub --epub book.epub --dry-run — verify dry run output
  4. skill-seekers create drm-book.epub — verify DRM error message
  5. skill-seekers create nonexistent.epub — verify file-not-found error
  6. Open generated SKILL.md — verify content quality and structure

Performance Considerations

  • ebooklib loads entire EPUB into memory. For typical books (<50MB), this is fine
  • For very large EPUBs (100MB+), memory usage may spike. No mitigation needed for v1 — document as known limitation
  • BeautifulSoup parsing of XHTML is fast. No performance concerns expected

Migration Notes

  • No migration needed — this is a new feature with no existing data to migrate
  • Optional dependency (ebooklib) means existing installs are unaffected
  • No breaking changes to any existing commands or APIs

References