Files
skill-seekers-reference/docs/agents/plans/2026-03-14-epub-input-support.md
yusyus 2e30970dfb feat: add EPUB input support (#310)
Adds EPUB as a first-class input source for skill generation.

- EpubToSkillConverter (epub_scraper.py, ~1200 lines) following PDF scraper pattern
- Dublin Core metadata, spine items, code blocks, tables, images extraction
- DRM detection (Adobe ADEPT, Apple FairPlay, Readium LCP) with fail-fast
- EPUB 3 NCX TOC bug workaround (ignore_ncx=True)
- ebooklib as optional dep: pip install skill-seekers[epub]
- Wired into create command with .epub auto-detection
- 104 tests, all passing

Review fixes: removed 3 empty test stubs, fixed SVG double-counting in
_extract_images(), added logger.debug to bare except pass.

Based on PR #310 by @christianbaumann.
Co-authored-by: Christian Baumann <mail@chriss-baumann.de>
2026-03-15 02:34:41 +03:00

1161 lines
42 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters
This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
date: 2026-03-14T19:30:35.172407+00:00
git_commit: 7c90a4b9c9bccac8341b0769550d77aae3b4e524
branch: development
topic: "Add EPUB Input Support"
tags: [plan, epub, scraper, input-format]
status: complete
---
# Add EPUB Input Support — Implementation Plan
## Overview
Add `.epub` as an input format for Skill Seekers, enabling `skill-seekers create book.epub` and `skill-seekers epub --epub book.epub`. Follows the established Word/PDF scraper pattern: source detection → routing → extraction → categorize → build skill.
**Authoritative reference**: [W3C EPUB 3.3 Specification](https://www.w3.org/TR/epub-33/) (also covers EPUB 2 backward compatibility).
## Current State Analysis
The codebase has a consistent multi-layer architecture for document input formats. PDF and Word (.docx) serve as direct analogs. The Word scraper (`word_scraper.py`) is the closest pattern match since both Word and EPUB produce HTML/XHTML that is parsed with BeautifulSoup.
### Key Discoveries:
- Word scraper converts `.docx` → HTML (via mammoth) → BeautifulSoup parse → intermediate JSON → SKILL.md (`word_scraper.py:96-235`)
- EPUB files contain XHTML natively (per W3C spec §5), so the mammoth conversion step is unnecessary — BeautifulSoup can parse EPUB XHTML content directly
- Source detection uses file extension matching (`source_detector.py:57-65`)
- Optional dependencies use a guard pattern with `try/except ImportError` and a `_check_*_deps()` function (`word_scraper.py:21-40`)
- The `ebooklib` library (v0.18+) provides `epub.read_epub()` returning an `EpubBook` with spine iteration, metadata access via `get_metadata('DC', key)`, and item content via `get_content()`/`get_body_content()`
- ebooklib has a known bug: EPUB 3 files read TOC from NCX instead of NAV (issue #200); workaround: `options={"ignore_ncx": True}`
- ebooklib loads entire EPUB into memory — acceptable for typical books but relevant for edge cases
## Desired End State
Running `skill-seekers create book.epub` produces:
```
output/book/
├── SKILL.md # Main skill file with metadata, concepts, code examples
├── references/
│ ├── index.md # Category index with statistics
│ └── book.md # Chapter content (or multiple files if categorized)
├── scripts/
└── assets/
└── *.png|*.jpg # Extracted images
```
### CLI Output Mockup
```
$ skill-seekers create programming-rust.epub
Detected source type: epub
Routing to epub scraper...
🔍 Extracting from EPUB: programming-rust.epub
Title: Programming Rust, 2nd Edition
Author: Jim Blandy, Jason Orendorff
Language: en
Chapters: 23 (spine items)
📄 Processing chapters...
Chapter 1/23: Why Rust? (2 sections, 1 code block)
Chapter 2/23: A Tour of Rust (5 sections, 12 code blocks)
...
Chapter 23/23: Macros (4 sections, 8 code blocks)
📊 Extraction complete:
Sections: 142
Code blocks: 287 (Rust: 245, Shell: 28, TOML: 14)
Images: 34
Tables: 12
💾 Saved extracted data to: output/programming-rust_extracted.json
📋 Categorizing content...
✅ Created 1 category (single EPUB source)
- programming-rust: 142 sections
📝 Generating reference files...
Generated: output/programming-rust/references/programming-rust.md
Generated: output/programming-rust/references/index.md
✅ Skill built successfully: output/programming-rust/
📦 Next step: Package with: skill-seekers package output/programming-rust/
```
### Verification:
- [x] `skill-seekers create book.epub` produces valid output directory
- [x] `skill-seekers epub --epub book.epub --name mybook` works standalone
- [x] `skill-seekers create book.epub --dry-run` shows config without processing
- [x] All ~2,540+ existing tests still pass (982 passed, 1 pre-existing failure)
- [x] New test suite has 100+ tests covering happy path, errors, and edge cases (107 tests, 14 classes)
## What We're NOT Doing
- DRM decryption (detect and error gracefully with clear message)
- EPUB writing/creation (read-only)
- Media overlay / audio / video extraction (ignore gracefully)
- Fixed-layout OCR (detect and warn; extract whatever text exists in XHTML)
- `--chapter-range` flag (can be added later)
- Unified scraper (`unified_scraper.py`) EPUB support (separate future task)
- MCP tool for EPUB (separate future task)
## Implementation Approach
Follow the Word scraper pattern exactly, with EPUB-specific extraction logic:
1. **Phase 1**: Core `epub_scraper.py` — the `EpubToSkillConverter` class
2. **Phase 2**: CLI integration — source detection, arguments, parser, routing, entry points
3. **Phase 3**: Comprehensive test suite — 100+ tests across 11 test classes
4. **Phase 4**: Documentation updates
---
## Phase 1: Core EPUB Scraper
### Overview
Create `epub_scraper.py` with `EpubToSkillConverter` class following the Word scraper pattern. This is the bulk of new code.
### Changes Required:
#### [x] 1. Optional dependency in pyproject.toml
**File**: `pyproject.toml`
**Changes**: Add `epub` optional dependency group and include in `all` group
```toml
# After the docx group (~line 115)
# EPUB (.epub) support
epub = [
"ebooklib>=0.18",
]
```
Add `"ebooklib>=0.18",` to the `all` group (~line 178).
#### [x] 2. Create `src/skill_seekers/cli/epub_scraper.py`
**File**: `src/skill_seekers/cli/epub_scraper.py` (new)
**Changes**: Full EPUB scraper module
**Structure** (following `word_scraper.py` pattern):
```python
"""
EPUB Documentation to Skill Converter
Converts EPUB e-books into skills.
Uses ebooklib for EPUB parsing, BeautifulSoup for XHTML content extraction.
Usage:
skill-seekers epub --epub book.epub --name myskill
skill-seekers epub --from-json book_extracted.json
"""
import argparse
import json
import logging
import os
import re
import sys
from pathlib import Path
# Optional dependency guard
try:
import ebooklib
from ebooklib import epub
EPUB_AVAILABLE = True
except ImportError:
EPUB_AVAILABLE = False
# BeautifulSoup is a core dependency (always available)
from bs4 import BeautifulSoup, Comment
logger = logging.getLogger(__name__)
def _check_epub_deps():
"""Raise RuntimeError if ebooklib is not installed."""
if not EPUB_AVAILABLE:
raise RuntimeError(
"ebooklib is required for EPUB support.\n"
'Install with: pip install "skill-seekers[epub]"\n'
"Or: pip install ebooklib"
)
def infer_description_from_epub(metadata: dict | None = None, name: str = "") -> str:
"""Infer skill description from EPUB metadata."""
if metadata:
if metadata.get("description") and len(metadata["description"]) > 20:
desc = metadata["description"].strip()
if len(desc) > 150:
desc = desc[:147] + "..."
return f"Use when {desc.lower()}"
if metadata.get("title") and len(metadata["title"]) > 10:
return f"Use when working with {metadata['title'].lower()}"
return (
f"Use when referencing {name} documentation"
if name
else "Use when referencing this documentation"
)
```
**`EpubToSkillConverter` class methods:**
```python
class EpubToSkillConverter:
def __init__(self, config: dict):
self.config = config
self.name = config["name"]
self.epub_path = config.get("epub_path", "")
self.description = config.get(
"description", f"Use when referencing {self.name} documentation"
)
self.skill_dir = f"output/{self.name}"
self.data_file = f"output/{self.name}_extracted.json"
self.categories = config.get("categories", {})
self.extracted_data = None
def extract_epub(self) -> bool:
"""Extract content from EPUB file.
Workflow:
1. Check dependencies (ebooklib)
2. Detect DRM via META-INF/encryption.xml (fail fast)
3. Read EPUB via ebooklib with ignore_ncx=True (EPUB 3 TOC bug workaround)
4. Extract Dublin Core metadata (title, creator, language, publisher, date, description, subject)
5. Iterate spine items in reading order
6. For each ITEM_DOCUMENT: parse XHTML with BeautifulSoup
7. Split by h1/h2 heading boundaries into sections
8. Extract code blocks from <pre>/<code> elements
9. Extract images from EpubImage items
10. Detect code languages via LanguageDetector
11. Save intermediate JSON to {name}_extracted.json
Returns True on success.
Raises RuntimeError for DRM-protected files.
Raises FileNotFoundError for missing files.
Raises ValueError for invalid EPUB files.
"""
```
**DRM detection** (per W3C spec §4.2.6.3.2):
```python
def _detect_drm(self, book) -> bool:
"""Detect DRM by checking for encryption.xml with non-font-obfuscation entries.
Per W3C EPUB 3.3 spec: encryption.xml is present when resources are encrypted.
Font obfuscation (IDPF algorithm http://www.idpf.org/2008/embedding or
Adobe algorithm http://ns.adobe.com/pdf/enc#RC) is NOT DRM — only font mangling.
Actual DRM uses algorithms like:
- Adobe ADEPT: http://ns.adobe.com/adept namespace
- Apple FairPlay: http://itunes.apple.com/dataenc
- Readium LCP: http://readium.org/2014/01/lcp
"""
```
**Metadata extraction** (per W3C spec §5.2, Dublin Core):
```python
def _extract_metadata(self, book) -> dict:
"""Extract Dublin Core metadata from EPUB.
Per W3C EPUB 3.3 spec: required elements are dc:identifier, dc:title, dc:language.
Optional: dc:creator, dc:contributor, dc:date, dc:description, dc:publisher,
dc:subject, dc:rights, dc:type, dc:coverage, dc:source, dc:relation, dc:format.
ebooklib API: book.get_metadata('DC', key) returns list of (value, attrs) tuples.
"""
def _get_one(key):
data = book.get_metadata('DC', key)
return data[0][0] if data else None
def _get_list(key):
data = book.get_metadata('DC', key)
return [x[0] for x in data] if data else []
return {
"title": _get_one('title') or "Untitled",
"author": ", ".join(_get_list('creator')) or None,
"language": _get_one('language') or "en",
"publisher": _get_one('publisher'),
"date": _get_one('date'),
"description": _get_one('description'),
"subject": ", ".join(_get_list('subject')) or None,
"rights": _get_one('rights'),
"identifier": _get_one('identifier'),
}
```
**Content extraction** (per W3C spec §5 — XHTML Content Documents use XML serialization of HTML5):
```python
def _extract_spine_content(self, book) -> list[dict]:
"""Extract content from spine items in reading order.
Per W3C EPUB 3.3 spec §3.4.8: spine defines ordered list of content documents.
Linear="yes" (default) items form the primary reading order.
Linear="no" items are auxiliary (footnotes, glossary).
Per spec §5: XHTML content documents use XML syntax of HTML5.
Parse with BeautifulSoup, split by h1/h2 heading boundaries.
"""
sections = []
section_number = 0
for item_id, linear in book.spine:
item = book.get_item_with_id(item_id)
if not item or item.get_type() != ebooklib.ITEM_DOCUMENT:
continue
soup = BeautifulSoup(item.get_content(), 'html.parser')
# Remove scripts, styles, comments (not useful for text extraction)
for tag in soup(['script', 'style']):
tag.decompose()
for comment in soup.find_all(string=lambda t: isinstance(t, Comment)):
comment.extract()
body = soup.find('body')
if not body:
continue
# Split by h1/h2 heading boundaries (same as word_scraper)
# Each heading starts a new section
...
```
**Image extraction** (per W3C spec §3.3 — core media types include JPEG, PNG, GIF, WebP, SVG):
```python
def _extract_images(self, book) -> list[dict]:
"""Extract images from EPUB manifest.
Per W3C EPUB 3.3 spec §3.3: core image media types are
image/gif, image/jpeg, image/png, image/svg+xml, image/webp.
ebooklib API: book.get_items_of_type(ebooklib.ITEM_IMAGE)
returns EpubImage items with get_content() (bytes) and media_type.
SVG images (ITEM_VECTOR) handled separately.
"""
```
**The remaining methods** (`categorize_content`, `build_skill`, `_generate_reference_file`, `_generate_index`, `_generate_skill_md`, `_format_key_concepts`, `_format_patterns_from_content`, `_sanitize_filename`) follow the Word scraper pattern exactly — they operate on the same intermediate JSON structure.
**`main()` function** (following `word_scraper.py:923-1059`):
```python
def main():
from .arguments.epub import add_epub_arguments
parser = argparse.ArgumentParser(
description="Convert EPUB e-book to skill",
formatter_class=argparse.RawDescriptionHelpFormatter,
)
add_epub_arguments(parser)
args = parser.parse_args()
# Logging setup
if getattr(args, "quiet", False):
logging.getLogger().setLevel(logging.WARNING)
elif getattr(args, "verbose", False):
logging.getLogger().setLevel(logging.DEBUG)
# Dry run
if getattr(args, "dry_run", False):
source = args.epub or args.from_json or "(none)"
print(f"\n{'=' * 60}")
print("DRY RUN: EPUB Extraction")
print(f"{'=' * 60}")
print(f"Source: {source}")
print(f"Name: {getattr(args, 'name', None) or '(auto-detect)'}")
print(f"Enhance level: {getattr(args, 'enhance_level', 0)}")
print(f"\n✅ Dry run complete")
return
# Validate inputs
if not (args.epub or args.from_json):
parser.error("Must specify --epub or --from-json")
# From-JSON workflow
if args.from_json:
name = Path(args.from_json).stem.replace("_extracted", "")
config = {
"name": name,
"description": args.description or f"Use when referencing {name} documentation",
}
converter = EpubToSkillConverter(config)
converter.load_extracted_data(args.from_json)
converter.build_skill()
return
# Direct EPUB workflow
name = args.name or Path(args.epub).stem
config = {
"name": name,
"epub_path": args.epub,
"description": args.description or f"Use when referencing {name} documentation",
}
try:
converter = EpubToSkillConverter(config)
if not converter.extract_epub():
print("\n❌ EPUB extraction failed", file=sys.stderr)
sys.exit(1)
converter.build_skill()
# Enhancement workflow integration
from skill_seekers.cli.workflow_runner import run_workflows
run_workflows(args)
# Traditional enhancement
if getattr(args, "enhance_level", 0) > 0:
# Same pattern as word_scraper.py and pdf_scraper.py
...
except RuntimeError as e:
print(f"\n❌ Error: {e}", file=sys.stderr)
sys.exit(1)
```
### Success Criteria:
#### Automated Verification:
- [x] `ruff check src/skill_seekers/cli/epub_scraper.py` passes
- [x] `ruff format --check src/skill_seekers/cli/epub_scraper.py` passes
- [ ] `mypy src/skill_seekers/cli/epub_scraper.py` passes (continue-on-error)
- [x] `pip install -e ".[epub]"` installs successfully
#### Manual Verification:
- [x] Verify `import ebooklib` works after install
- [x] Review epub_scraper.py structure matches word_scraper.py pattern
**Implementation Note**: After completing this phase and all automated verification passes, pause here for manual confirmation from the human before proceeding to the next phase.
---
## Phase 2: CLI Integration
### Overview
Wire the EPUB scraper into the CLI: source detection, argument definitions, parser registration, create command routing, and entry points.
### Changes Required:
#### [x] 1. Source detection
**File**: `src/skill_seekers/cli/source_detector.py`
**Changes**: Add `.epub` extension detection, `_detect_epub()` method, validation, and error message update
Add after the `.docx` check (line 64):
```python
if source.endswith(".epub"):
return cls._detect_epub(source)
```
Add `_detect_epub()` method (following `_detect_word()` at line 124):
```python
@classmethod
def _detect_epub(cls, source: str) -> SourceInfo:
"""Detect EPUB file source."""
name = os.path.splitext(os.path.basename(source))[0]
return SourceInfo(
type="epub", parsed={"file_path": source}, suggested_name=name, raw_input=source
)
```
Add epub validation in `validate_source()` (after word block at line 278):
```python
elif source_info.type == "epub":
file_path = source_info.parsed["file_path"]
if not os.path.exists(file_path):
raise ValueError(f"EPUB file does not exist: {file_path}")
if not os.path.isfile(file_path):
raise ValueError(f"Path is not a file: {file_path}")
```
Add EPUB example to the ValueError message (line 94):
```python
" EPUB: skill-seekers create ebook.epub\n"
```
#### [x] 2. Argument definitions
**File**: `src/skill_seekers/cli/arguments/epub.py` (new)
**Changes**: EPUB-specific argument definitions
```python
"""EPUB-specific CLI arguments."""
import argparse
from typing import Any
from .common import add_all_standard_arguments
EPUB_ARGUMENTS: dict[str, dict[str, Any]] = {
"epub": {
"flags": ("--epub",),
"kwargs": {
"type": str,
"help": "Direct EPUB file path",
"metavar": "PATH",
},
},
"from_json": {
"flags": ("--from-json",),
"kwargs": {
"type": str,
"help": "Build skill from extracted JSON",
"metavar": "FILE",
},
},
}
def add_epub_arguments(parser: argparse.ArgumentParser) -> None:
"""Add EPUB-specific arguments to parser."""
add_all_standard_arguments(parser)
# Override enhance-level default to 0 for EPUB
for action in parser._actions:
if hasattr(action, "dest") and action.dest == "enhance_level":
action.default = 0
action.help = (
"AI enhancement level (auto-detects API vs LOCAL mode): "
"0=disabled (default for EPUB), 1=SKILL.md only, "
"2=+architecture/config, 3=full enhancement. "
"Mode selection: uses API if ANTHROPIC_API_KEY is set, "
"otherwise LOCAL (Claude Code)"
)
for arg_name, arg_def in EPUB_ARGUMENTS.items():
parser.add_argument(*arg_def["flags"], **arg_def["kwargs"])
```
#### [x] 3. Create command argument integration
**File**: `src/skill_seekers/cli/arguments/create.py`
**Changes**: Add EPUB_ARGUMENTS dict, register in helper functions, add mode handling
Add after WORD_ARGUMENTS (~line 411):
```python
# EPUB specific (from epub.py)
EPUB_ARGUMENTS: dict[str, dict[str, Any]] = {
"epub": {
"flags": ("--epub",),
"kwargs": {
"type": str,
"help": "EPUB file path",
"metavar": "PATH",
},
},
}
```
Add to `get_source_specific_arguments()` (line 595):
```python
"epub": EPUB_ARGUMENTS,
```
Add to `add_create_arguments()` (after word block at line 678):
```python
if mode in ["epub", "all"]:
for arg_name, arg_def in EPUB_ARGUMENTS.items():
parser.add_argument(*arg_def["flags"], **arg_def["kwargs"])
```
#### [x] 4. Parser class
**File**: `src/skill_seekers/cli/parsers/epub_parser.py` (new)
**Changes**: Subcommand parser for standalone `skill-seekers epub` command
```python
"""Parser for epub subcommand."""
from .base import SubcommandParser
from skill_seekers.cli.arguments.epub import add_epub_arguments
class EpubParser(SubcommandParser):
"""Parser for EPUB extraction command."""
@property
def name(self) -> str:
return "epub"
@property
def help(self) -> str:
return "Extract from EPUB e-book (.epub)"
@property
def description(self) -> str:
return "Extract content from EPUB e-book (.epub) and generate skill"
def add_arguments(self, parser):
add_epub_arguments(parser)
```
#### [x] 5. Parser registration
**File**: `src/skill_seekers/cli/parsers/__init__.py`
**Changes**: Import and register EpubParser
Add import (after WordParser import, line 15):
```python
from .epub_parser import EpubParser
```
Add to PARSERS list (after `WordParser()`, line 46):
```python
EpubParser(),
```
#### [x] 6. CLI dispatcher
**File**: `src/skill_seekers/cli/main.py`
**Changes**: Add epub to COMMAND_MODULES dict and module docstring
Add to COMMAND_MODULES (after "word" entry, line 52):
```python
"epub": "skill_seekers.cli.epub_scraper",
```
Add to module docstring (after "word" line, line 15):
```python
# epub Extract from EPUB e-book (.epub)
```
#### [x] 7. Create command routing
**File**: `src/skill_seekers/cli/create_command.py`
**Changes**: Add `_route_epub()` method, routing case, help flag, and epilog example
Add to `_route_to_scraper()` (after word case, line 136):
```python
elif self.source_info.type == "epub":
return self._route_epub()
```
Add `_route_epub()` method (after `_route_word()`, line 352):
```python
def _route_epub(self) -> int:
"""Route to EPUB scraper (epub_scraper.py)."""
from skill_seekers.cli import epub_scraper
argv = ["epub_scraper"]
file_path = self.source_info.parsed["file_path"]
argv.extend(["--epub", file_path])
self._add_common_args(argv)
logger.debug(f"Calling epub_scraper with argv: {argv}")
original_argv = sys.argv
try:
sys.argv = argv
return epub_scraper.main()
finally:
sys.argv = original_argv
```
Add to epilog (line 543, after DOCX example):
```python
EPUB: skill-seekers create ebook.epub
```
Add to Source Auto-Detection section:
```python
file.epub EPUB extraction
```
Add `--help-epub` flag and handler (after `--help-word` at line 592):
```python
parser.add_argument(
"--help-epub", action="store_true", help=argparse.SUPPRESS, dest="_help_epub"
)
```
Add handler block (after `_help_word` block at line 654):
```python
elif args._help_epub:
parser_epub = argparse.ArgumentParser(
prog="skill-seekers create",
description="Create skill from EPUB e-book (.epub)",
formatter_class=argparse.RawDescriptionHelpFormatter,
)
add_create_arguments(parser_epub, mode="epub")
parser_epub.print_help()
return 0
```
#### [x] 8. Entry point
**File**: `pyproject.toml`
**Changes**: Add standalone entry point
Add after `skill-seekers-word` (line 224):
```toml
skill-seekers-epub = "skill_seekers.cli.epub_scraper:main"
```
#### [x] 9. Positional argument handling in main.py
**File**: `src/skill_seekers/cli/main.py`
**Changes**: Add "input_file" is already in the positional list at line 153, so no change needed. Verify `_reconstruct_argv` handles epub correctly through the standard delegation path.
### Success Criteria:
#### Automated Verification:
- [x] `ruff check src/skill_seekers/cli/source_detector.py src/skill_seekers/cli/arguments/epub.py src/skill_seekers/cli/parsers/epub_parser.py src/skill_seekers/cli/create_command.py` passes
- [x] `ruff format --check src/skill_seekers/cli/` passes
- [x] `pip install -e ".[epub]"` installs with all entry points
- [x] `skill-seekers epub --help` shows EPUB-specific help
- [x] `skill-seekers create --help-epub` shows EPUB arguments (via standalone entry point `skill-seekers-create`)
- [x] `skill-seekers create nonexistent.epub` gives clear error about missing file
- [x] Existing tests still pass: `pytest tests/ -v -x -m "not slow and not integration"` (875 passed, 1 pre-existing unrelated failure in test_git_sources_e2e)
#### Manual Verification:
- [x] `skill-seekers --help` lists `epub` command
- [x] `skill-seekers create book.epub --dry-run` shows dry run output
**Implementation Note**: After completing this phase and all automated verification passes, pause here for manual confirmation from the human before proceeding to the next phase.
---
## Phase 3: Comprehensive Test Suite
### Overview
Create `tests/test_epub_scraper.py` with 100+ tests across 11 test classes, covering happy path, negative cases, edge cases, and CLI integration.
### Changes Required:
#### [x] 1. Create test file
**File**: `tests/test_epub_scraper.py` (new)
**Changes**: Comprehensive test suite following `test_word_scraper.py` patterns
```python
"""
Tests for EPUB scraper (epub_scraper.py).
Covers: initialization, extraction, categorization, skill building,
code blocks, tables, images, error handling, JSON workflow, CLI arguments,
helper functions, source detection, DRM detection, and edge cases.
Tests use mock data and do not require actual EPUB files or ebooklib installed.
"""
import json
import os
import shutil
import tempfile
import unittest
from pathlib import Path
from unittest.mock import MagicMock, patch, PropertyMock
# Conditional import (same pattern as test_word_scraper.py)
try:
from skill_seekers.cli.epub_scraper import (
EpubToSkillConverter,
infer_description_from_epub,
_score_code_quality,
_check_epub_deps,
EPUB_AVAILABLE,
)
IMPORT_OK = True
except ImportError:
IMPORT_OK = False
```
**Helper factory function:**
```python
def _make_sample_extracted_data(
num_sections=2,
include_code=False,
include_tables=False,
include_images=False,
) -> dict:
"""Create minimal extracted_data dict for testing."""
sections = []
total_code = 0
total_images = 0
languages = {}
for i in range(1, num_sections + 1):
section = {
"section_number": i,
"heading": f"Chapter {i}",
"heading_level": "h1",
"text": f"Content of chapter {i}. This is sample text.",
"headings": [{"level": "h2", "text": f"Section {i}.1"}],
"code_samples": [],
"tables": [],
"images": [],
}
if include_code:
section["code_samples"] = [
{"code": f"def func_{i}():\n return {i}", "language": "python", "quality_score": 7.5},
{"code": f"console.log({i})", "language": "javascript", "quality_score": 4.0},
]
total_code += 2
languages["python"] = languages.get("python", 0) + 1
languages["javascript"] = languages.get("javascript", 0) + 1
if include_tables:
section["tables"] = [
{"headers": ["Name", "Value"], "rows": [["key", "val"]]}
]
if include_images:
section["images"] = [
{"index": 1, "data": b"\x89PNG\r\n\x1a\n", "width": 100, "height": 100}
]
total_images += 1
sections.append(section)
return {
"source_file": "test.epub",
"metadata": {
"title": "Test Book",
"author": "Test Author",
"language": "en",
"publisher": "Test Publisher",
"date": "2024-01-01",
"description": "A test book for unit testing",
"subject": "Testing, Unit Tests",
"rights": "Copyright 2024",
"identifier": "urn:uuid:12345",
},
"total_sections": num_sections,
"total_code_blocks": total_code,
"total_images": total_images,
"languages_detected": languages,
"pages": sections,
}
```
### Test Classes and Methods:
#### [x] Class 1: `TestEpubToSkillConverterInit` (8 tests)
**Happy path:**
- `test_init_with_name_and_epub_path` — basic config with name + epub_path
- `test_init_with_full_config` — config with all fields (name, epub_path, description, categories)
- `test_default_description_uses_name` — description defaults to "Use when referencing {name} documentation"
- `test_skill_dir_uses_name` — skill_dir is `output/{name}`
- `test_data_file_uses_name` — data_file is `output/{name}_extracted.json`
**Negative:**
- `test_init_requires_name` — missing "name" key raises KeyError
- `test_init_empty_name` — empty string name still works (no crash)
**Edge case:**
- `test_init_with_special_characters_in_name` — name with spaces/dashes sanitized for paths
#### [x] Class 2: `TestEpubExtraction` (12 tests)
**Happy path:**
- `test_extract_basic_epub` — mock ebooklib, verify sections extracted in spine order
- `test_extract_metadata` — verify Dublin Core metadata extraction (title, creator, language, etc.)
- `test_extract_multiple_chapters` — multiple spine items produce multiple sections
- `test_extract_code_blocks``<pre><code>` elements extracted with language detection
- `test_extract_images` — ITEM_IMAGE items extracted with correct content
- `test_heading_boundary_splitting` — h1/h2 boundaries create new sections
**Negative:**
- `test_extract_missing_file_raises_error` — FileNotFoundError for nonexistent path
- `test_extract_invalid_epub_raises_error` — ValueError for corrupted/non-EPUB file
- `test_extract_deps_not_installed` — RuntimeError with install instructions when ebooklib missing
**Edge cases:**
- `test_extract_empty_spine` — EPUB with no spine items produces empty sections list
- `test_extract_spine_item_no_body` — XHTML without `<body>` tag skipped gracefully
- `test_extract_non_linear_spine_items` — linear="no" items still extracted (included but flagged)
#### [x] Class 3: `TestEpubDrmDetection` (6 tests)
**Happy path:**
- `test_no_drm_detected` — normal EPUB without encryption.xml returns False
**Negative:**
- `test_drm_detected_adobe_adept` — encryption.xml with Adobe namespace raises RuntimeError
- `test_drm_detected_apple_fairplay` — encryption.xml with Apple namespace raises RuntimeError
- `test_drm_detected_readium_lcp` — encryption.xml with Readium namespace raises RuntimeError
**Edge cases:**
- `test_font_obfuscation_not_drm` — encryption.xml with only IDPF font obfuscation algorithm (`http://www.idpf.org/2008/embedding`) is NOT DRM, extraction proceeds
- `test_drm_error_message_is_clear` — error message mentions DRM and suggests removing protection
#### [x] Class 4: `TestEpubCategorization` (8 tests)
**Happy path:**
- `test_single_source_creates_one_category` — single EPUB creates category named after file
- `test_keyword_categorization` — sections matched to categories by keyword scoring
- `test_no_categories_uses_default` — no category config creates single "content" category
**Negative:**
- `test_categorize_empty_sections` — empty sections list produces empty categories
- `test_categorize_no_keyword_matches` — unmatched sections go to "other" category
**Edge cases:**
- `test_categorize_single_section` — one section creates one category
- `test_categorize_many_sections` — 50+ sections categorized correctly
- `test_categorize_preserves_section_order` — sections maintain original order within categories
#### [x] Class 5: `TestEpubSkillBuilding` (10 tests)
**Happy path:**
- `test_build_creates_directory_structure` — output/{name}/, references/, scripts/, assets/ created
- `test_build_generates_skill_md` — SKILL.md created with YAML frontmatter
- `test_build_generates_reference_files` — reference markdown files created per category
- `test_build_generates_index` — references/index.md created with category links
- `test_skill_md_contains_metadata` — SKILL.md includes title, author, language from metadata
- `test_skill_md_yaml_frontmatter` — frontmatter has name and description fields
**Negative:**
- `test_build_without_extracted_data_fails` — calling build_skill() before extraction raises error
**Edge cases:**
- `test_build_overwrites_existing_output` — re-running build overwrites existing files
- `test_build_with_long_name` — name > 64 chars truncated in YAML frontmatter
- `test_build_with_unicode_content` — Unicode text (CJK, Arabic, emoji) preserved correctly
#### [x] Class 6: `TestEpubCodeBlocks` (8 tests)
**Happy path:**
- `test_code_blocks_included_in_reference_files` — code samples appear in reference markdown
- `test_code_blocks_in_skill_md_top_15` — SKILL.md shows top 15 code examples by quality
- `test_code_language_grouped` — code examples grouped by language in SKILL.md
**Edge cases:**
- `test_empty_code_block``<pre><code></code></pre>` with no content skipped
- `test_code_block_with_html_entities``&lt;`, `&gt;`, `&amp;` decoded to `<`, `>`, `&`
- `test_code_block_with_syntax_highlighting_spans``<span class="keyword">` stripped, plain text preserved
- `test_code_block_language_from_class``class="language-python"`, `class="code-rust"` detected
- `test_code_quality_scoring` — scoring heuristic produces expected ranges (0-10)
#### [x] Class 7: `TestEpubTables` (5 tests)
**Happy path:**
- `test_tables_in_reference_files` — tables rendered as markdown in reference files
- `test_table_with_headers` — headers from `<thead>` used correctly
**Edge cases:**
- `test_table_no_thead` — first row used as headers when no `<thead>`
- `test_empty_table` — empty `<table>` element handled gracefully
- `test_table_with_colspan_rowspan` — complex tables don't crash (data may be imperfect)
#### [x] Class 8: `TestEpubImages` (7 tests)
**Happy path:**
- `test_images_saved_to_assets` — image bytes written to assets/ directory
- `test_image_references_in_markdown` — markdown `![Image](../assets/...)` references correct
**Negative:**
- `test_image_with_zero_bytes` — empty image content skipped
**Edge cases:**
- `test_svg_images_handled` — SVG items (ITEM_VECTOR) extracted or skipped gracefully
- `test_image_filename_conflicts` — duplicate filenames disambiguated
- `test_cover_image_identified` — cover image (ITEM_COVER) extracted
- `test_many_images` — 100+ images extracted without error
#### [x] Class 9: `TestEpubErrorHandling` (10 tests)
**Negative / error cases:**
- `test_missing_epub_file_raises_error` — FileNotFoundError for nonexistent path
- `test_not_a_file_raises_error` — ValueError when path is a directory
- `test_not_epub_extension_raises_error` — ValueError for .txt, .pdf, .doc files
- `test_corrupted_zip_raises_error` — ValueError or RuntimeError for corrupted ZIP
- `test_missing_container_xml` — ValueError for ZIP without META-INF/container.xml
- `test_missing_opf_file` — ValueError when container.xml points to nonexistent OPF
- `test_drm_protected_raises_error` — RuntimeError with clear DRM message
- `test_empty_epub_raises_error` — ValueError for EPUB with no content documents
- `test_ebooklib_not_installed_error` — RuntimeError with install instructions
- `test_malformed_xhtml_handled_gracefully` — unclosed tags, invalid entities don't crash (BeautifulSoup tolerant parsing)
#### [x] Class 10: `TestEpubJSONWorkflow` (6 tests)
**Happy path:**
- `test_load_extracted_json` — load previously extracted JSON
- `test_build_from_json` — full workflow: load JSON → categorize → build
- `test_json_round_trip` — extract → save JSON → load JSON → build produces same output
**Negative:**
- `test_load_invalid_json` — malformed JSON raises appropriate error
- `test_load_nonexistent_json` — FileNotFoundError for missing file
**Edge case:**
- `test_json_with_missing_fields` — partial JSON (missing optional fields) still works
#### [x] Class 11: `TestEpubCLIArguments` (8 tests)
**Happy path:**
- `test_epub_flag_accepted``--epub path.epub` parsed correctly
- `test_from_json_flag_accepted``--from-json data.json` parsed correctly
- `test_name_flag_accepted``--name mybook` parsed correctly
- `test_enhance_level_default_zero` — enhance-level defaults to 0 for EPUB
- `test_dry_run_flag``--dry-run` flag parsed correctly
**Negative:**
- `test_no_args_shows_error` — no `--epub` or `--from-json` shows error
**Integration:**
- `test_verbose_flag``--verbose` accepted
- `test_quiet_flag``--quiet` accepted
#### [x] Class 12: `TestEpubHelperFunctions` (6 tests)
- `test_infer_description_from_metadata_description` — uses description field
- `test_infer_description_from_metadata_title` — falls back to title
- `test_infer_description_fallback` — falls back to name-based template
- `test_infer_description_empty_metadata` — empty dict returns fallback
- `test_score_code_quality_ranges` — scoring returns 0-10
- `test_sanitize_filename` — special characters cleaned
#### [x] Class 13: `TestEpubSourceDetection` (6 tests)
- `test_epub_detected_as_epub_type``.epub` extension detected correctly
- `test_epub_suggested_name` — filename stem used as suggested name
- `test_epub_validation_missing_file` — validation raises ValueError for missing file
- `test_epub_validation_not_a_file` — validation raises ValueError for directory
- `test_epub_with_path``./books/test.epub` detected with correct file_path
- `test_pdf_still_detected` — regression test: `.pdf` still detected as pdf type
#### [x] Class 14: `TestEpubEdgeCases` (8 tests)
**Per W3C EPUB 3.3 spec edge cases:**
- `test_epub2_vs_epub3` — both versions parse successfully (ebooklib handles both)
- `test_epub_no_toc` — EPUB without table of contents extracts using spine order
- `test_epub_empty_chapters` — chapters with no text content skipped gracefully
- `test_epub_single_chapter` — book with one spine item produces valid output
- `test_epub_unicode_content` — CJK, Arabic, Cyrillic, emoji text preserved
- `test_epub_large_section_count` — 100+ sections processed without error
- `test_epub_nested_headings` — h3/h4/h5/h6 become sub-headings within sections
- `test_fixed_layout_detected` — fixed-layout EPUB produces warning but still extracts text
**Total: ~108 test methods across 14 classes**
### Success Criteria:
#### Automated Verification:
- [x] `pytest tests/test_epub_scraper.py -v` — all 107 tests pass
- [x] `pytest tests/ -v -x -m "not slow and not integration"` — 982 passed (1 pre-existing unrelated failure in test_git_sources_e2e)
- [x] `ruff check tests/test_epub_scraper.py` passes
- [x] `ruff format --check tests/test_epub_scraper.py` passes
- [x] Test count >= 100 methods (107 tests across 14 classes)
#### Manual Verification:
- [x] Review test coverage includes: happy path, negative, edge cases, CLI, source detection, DRM, JSON workflow
- [x] Verify no tests require actual EPUB files or ebooklib installed (all use mocks/skipTest guards)
**Implementation Note**: After completing this phase and all automated verification passes, pause here for manual confirmation from the human before proceeding to the next phase.
---
## Phase 4: Documentation
### Overview
Update CLAUDE.md and CHANGELOG.md to reflect the new EPUB support.
### Changes Required:
#### [x] 1. Update CLAUDE.md
**File**: `CLAUDE.md`
**Changes**:
Add to Commands section (after pdf line):
```
skill-seekers epub --epub book.epub --name myskill
```
Add to "Unified create" examples:
```
skill-seekers create book.epub
```
Add to Key source files table:
```
| Core scraping | `cli/epub_scraper.py` |
```
Add to "Adding things → New create command flags" section:
```
- Source-specific → `EPUB_ARGUMENTS`
```
#### [x] 2. Update CHANGELOG.md
**File**: `CHANGELOG.md`
**Changes**: Add entry for EPUB support under next version
```markdown
### Added
- EPUB (.epub) input support via `skill-seekers create book.epub` or `skill-seekers epub --epub book.epub`
- Extracts chapters, metadata, code blocks, images, and tables from EPUB 2 and EPUB 3 files
- DRM detection with clear error messages
- Optional dependency: `pip install "skill-seekers[epub]"`
```
### Success Criteria:
#### Automated Verification:
- [x] `ruff check` passes on any modified files
- [x] `pytest tests/ -v -x -m "not slow and not integration"` — all tests still pass (982 passed, 1 pre-existing failure)
#### Manual Verification:
- [x] CLAUDE.md accurately reflects new commands
- [x] CHANGELOG.md entry is clear and complete
**Implementation Note**: After completing this phase and all automated verification passes, pause here for manual confirmation from the human before proceeding.
---
## Testing Strategy
### Unit Tests (Phase 3 — ~108 tests):
**By category:**
| Category | Count | What's tested |
|----------|-------|---------------|
| Initialization | 8 | Config parsing, defaults, edge cases |
| Extraction | 12 | Spine iteration, metadata, headings, code, images |
| DRM detection | 6 | Adobe, Apple, Readium, font obfuscation (not DRM) |
| Categorization | 8 | Single/multi category, keywords, empty, ordering |
| Skill building | 10 | Directory structure, SKILL.md, references, index |
| Code blocks | 8 | Extraction, quality, language detection, HTML entities |
| Tables | 5 | Headers, no-thead fallback, empty, colspan |
| Images | 7 | Save, references, SVG, conflicts, cover, many |
| Error handling | 10 | Missing file, corrupt, DRM, no deps, malformed XHTML |
| JSON workflow | 6 | Load, build, round-trip, invalid, missing fields |
| CLI arguments | 8 | Flags, defaults, dry-run, verbose/quiet |
| Helper functions | 6 | Description inference, quality scoring, filename sanitization |
| Source detection | 6 | Detection, validation, regression |
| Edge cases | 8 | EPUB 2/3, no TOC, empty chapters, Unicode, fixed-layout |
### Integration Tests:
- Full extract → categorize → build workflow with mock ebooklib
- JSON round-trip (extract → save → load → build)
### Manual Testing Steps:
1. `pip install -e ".[epub]"` — verify install
2. `skill-seekers create book.epub` with a real EPUB file — verify output directory structure
3. `skill-seekers epub --epub book.epub --dry-run` — verify dry run output
4. `skill-seekers create drm-book.epub` — verify DRM error message
5. `skill-seekers create nonexistent.epub` — verify file-not-found error
6. Open generated `SKILL.md` — verify content quality and structure
## Performance Considerations
- ebooklib loads entire EPUB into memory. For typical books (<50MB), this is fine
- For very large EPUBs (100MB+), memory usage may spike. No mitigation needed for v1 — document as known limitation
- BeautifulSoup parsing of XHTML is fast. No performance concerns expected
## Migration Notes
- No migration needed — this is a new feature with no existing data to migrate
- Optional dependency (`ebooklib`) means existing installs are unaffected
- No breaking changes to any existing commands or APIs
## References
- [W3C EPUB 3.3 Specification](https://www.w3.org/TR/epub-33/) — authoritative source of truth
- [W3C EPUB Reading Systems 3.3](https://www.w3.org/TR/epub-rs-33/) — reading system requirements
- [ebooklib GitHub](https://github.com/aerkalov/ebooklib) — Python EPUB library
- [ebooklib PyPI](https://pypi.org/project/EbookLib/) — v0.20, Python 3.9-3.13
- [Research document](../research/2026-03-14-epub-input-support-affected-files.md) — affected files analysis
- Similar implementation: `src/skill_seekers/cli/word_scraper.py` — closest analog
- Similar tests: `tests/test_word_scraper.py` — test pattern template