Adds EPUB as a first-class input source for skill generation. - EpubToSkillConverter (epub_scraper.py, ~1200 lines) following PDF scraper pattern - Dublin Core metadata, spine items, code blocks, tables, images extraction - DRM detection (Adobe ADEPT, Apple FairPlay, Readium LCP) with fail-fast - EPUB 3 NCX TOC bug workaround (ignore_ncx=True) - ebooklib as optional dep: pip install skill-seekers[epub] - Wired into create command with .epub auto-detection - 104 tests, all passing Review fixes: removed 3 empty test stubs, fixed SVG double-counting in _extract_images(), added logger.debug to bare except pass. Based on PR #310 by @christianbaumann. Co-authored-by: Christian Baumann <mail@chriss-baumann.de>
42 KiB
date, git_commit, branch, topic, tags, status
| date | git_commit | branch | topic | tags | status | ||||
|---|---|---|---|---|---|---|---|---|---|
| 2026-03-14T19:30:35.172407+00:00 | 7c90a4b9c9bccac8341b0769550d77aae3b4e524 | development | Add EPUB Input Support |
|
complete |
Add EPUB Input Support — Implementation Plan
Overview
Add .epub as an input format for Skill Seekers, enabling skill-seekers create book.epub and skill-seekers epub --epub book.epub. Follows the established Word/PDF scraper pattern: source detection → routing → extraction → categorize → build skill.
Authoritative reference: W3C EPUB 3.3 Specification (also covers EPUB 2 backward compatibility).
Current State Analysis
The codebase has a consistent multi-layer architecture for document input formats. PDF and Word (.docx) serve as direct analogs. The Word scraper (word_scraper.py) is the closest pattern match since both Word and EPUB produce HTML/XHTML that is parsed with BeautifulSoup.
Key Discoveries:
- Word scraper converts
.docx→ HTML (via mammoth) → BeautifulSoup parse → intermediate JSON → SKILL.md (word_scraper.py:96-235) - EPUB files contain XHTML natively (per W3C spec §5), so the mammoth conversion step is unnecessary — BeautifulSoup can parse EPUB XHTML content directly
- Source detection uses file extension matching (
source_detector.py:57-65) - Optional dependencies use a guard pattern with
try/except ImportErrorand a_check_*_deps()function (word_scraper.py:21-40) - The
ebookliblibrary (v0.18+) providesepub.read_epub()returning anEpubBookwith spine iteration, metadata access viaget_metadata('DC', key), and item content viaget_content()/get_body_content() - ebooklib has a known bug: EPUB 3 files read TOC from NCX instead of NAV (issue #200); workaround:
options={"ignore_ncx": True} - ebooklib loads entire EPUB into memory — acceptable for typical books but relevant for edge cases
Desired End State
Running skill-seekers create book.epub produces:
output/book/
├── SKILL.md # Main skill file with metadata, concepts, code examples
├── references/
│ ├── index.md # Category index with statistics
│ └── book.md # Chapter content (or multiple files if categorized)
├── scripts/
└── assets/
└── *.png|*.jpg # Extracted images
CLI Output Mockup
$ skill-seekers create programming-rust.epub
ℹ️ Detected source type: epub
ℹ️ Routing to epub scraper...
🔍 Extracting from EPUB: programming-rust.epub
Title: Programming Rust, 2nd Edition
Author: Jim Blandy, Jason Orendorff
Language: en
Chapters: 23 (spine items)
📄 Processing chapters...
Chapter 1/23: Why Rust? (2 sections, 1 code block)
Chapter 2/23: A Tour of Rust (5 sections, 12 code blocks)
...
Chapter 23/23: Macros (4 sections, 8 code blocks)
📊 Extraction complete:
Sections: 142
Code blocks: 287 (Rust: 245, Shell: 28, TOML: 14)
Images: 34
Tables: 12
💾 Saved extracted data to: output/programming-rust_extracted.json
📋 Categorizing content...
✅ Created 1 category (single EPUB source)
- programming-rust: 142 sections
📝 Generating reference files...
Generated: output/programming-rust/references/programming-rust.md
Generated: output/programming-rust/references/index.md
✅ Skill built successfully: output/programming-rust/
📦 Next step: Package with: skill-seekers package output/programming-rust/
Verification:
skill-seekers create book.epubproduces valid output directoryskill-seekers epub --epub book.epub --name mybookworks standaloneskill-seekers create book.epub --dry-runshows config without processing- All ~2,540+ existing tests still pass (982 passed, 1 pre-existing failure)
- New test suite has 100+ tests covering happy path, errors, and edge cases (107 tests, 14 classes)
What We're NOT Doing
- DRM decryption (detect and error gracefully with clear message)
- EPUB writing/creation (read-only)
- Media overlay / audio / video extraction (ignore gracefully)
- Fixed-layout OCR (detect and warn; extract whatever text exists in XHTML)
--chapter-rangeflag (can be added later)- Unified scraper (
unified_scraper.py) EPUB support (separate future task) - MCP tool for EPUB (separate future task)
Implementation Approach
Follow the Word scraper pattern exactly, with EPUB-specific extraction logic:
- Phase 1: Core
epub_scraper.py— theEpubToSkillConverterclass - Phase 2: CLI integration — source detection, arguments, parser, routing, entry points
- Phase 3: Comprehensive test suite — 100+ tests across 11 test classes
- Phase 4: Documentation updates
Phase 1: Core EPUB Scraper
Overview
Create epub_scraper.py with EpubToSkillConverter class following the Word scraper pattern. This is the bulk of new code.
Changes Required:
[x] 1. Optional dependency in pyproject.toml
File: pyproject.toml
Changes: Add epub optional dependency group and include in all group
# After the docx group (~line 115)
# EPUB (.epub) support
epub = [
"ebooklib>=0.18",
]
Add "ebooklib>=0.18", to the all group (~line 178).
[x] 2. Create src/skill_seekers/cli/epub_scraper.py
File: src/skill_seekers/cli/epub_scraper.py (new)
Changes: Full EPUB scraper module
Structure (following word_scraper.py pattern):
"""
EPUB Documentation to Skill Converter
Converts EPUB e-books into skills.
Uses ebooklib for EPUB parsing, BeautifulSoup for XHTML content extraction.
Usage:
skill-seekers epub --epub book.epub --name myskill
skill-seekers epub --from-json book_extracted.json
"""
import argparse
import json
import logging
import os
import re
import sys
from pathlib import Path
# Optional dependency guard
try:
import ebooklib
from ebooklib import epub
EPUB_AVAILABLE = True
except ImportError:
EPUB_AVAILABLE = False
# BeautifulSoup is a core dependency (always available)
from bs4 import BeautifulSoup, Comment
logger = logging.getLogger(__name__)
def _check_epub_deps():
"""Raise RuntimeError if ebooklib is not installed."""
if not EPUB_AVAILABLE:
raise RuntimeError(
"ebooklib is required for EPUB support.\n"
'Install with: pip install "skill-seekers[epub]"\n'
"Or: pip install ebooklib"
)
def infer_description_from_epub(metadata: dict | None = None, name: str = "") -> str:
"""Infer skill description from EPUB metadata."""
if metadata:
if metadata.get("description") and len(metadata["description"]) > 20:
desc = metadata["description"].strip()
if len(desc) > 150:
desc = desc[:147] + "..."
return f"Use when {desc.lower()}"
if metadata.get("title") and len(metadata["title"]) > 10:
return f"Use when working with {metadata['title'].lower()}"
return (
f"Use when referencing {name} documentation"
if name
else "Use when referencing this documentation"
)
EpubToSkillConverter class methods:
class EpubToSkillConverter:
def __init__(self, config: dict):
self.config = config
self.name = config["name"]
self.epub_path = config.get("epub_path", "")
self.description = config.get(
"description", f"Use when referencing {self.name} documentation"
)
self.skill_dir = f"output/{self.name}"
self.data_file = f"output/{self.name}_extracted.json"
self.categories = config.get("categories", {})
self.extracted_data = None
def extract_epub(self) -> bool:
"""Extract content from EPUB file.
Workflow:
1. Check dependencies (ebooklib)
2. Detect DRM via META-INF/encryption.xml (fail fast)
3. Read EPUB via ebooklib with ignore_ncx=True (EPUB 3 TOC bug workaround)
4. Extract Dublin Core metadata (title, creator, language, publisher, date, description, subject)
5. Iterate spine items in reading order
6. For each ITEM_DOCUMENT: parse XHTML with BeautifulSoup
7. Split by h1/h2 heading boundaries into sections
8. Extract code blocks from <pre>/<code> elements
9. Extract images from EpubImage items
10. Detect code languages via LanguageDetector
11. Save intermediate JSON to {name}_extracted.json
Returns True on success.
Raises RuntimeError for DRM-protected files.
Raises FileNotFoundError for missing files.
Raises ValueError for invalid EPUB files.
"""
DRM detection (per W3C spec §4.2.6.3.2):
def _detect_drm(self, book) -> bool:
"""Detect DRM by checking for encryption.xml with non-font-obfuscation entries.
Per W3C EPUB 3.3 spec: encryption.xml is present when resources are encrypted.
Font obfuscation (IDPF algorithm http://www.idpf.org/2008/embedding or
Adobe algorithm http://ns.adobe.com/pdf/enc#RC) is NOT DRM — only font mangling.
Actual DRM uses algorithms like:
- Adobe ADEPT: http://ns.adobe.com/adept namespace
- Apple FairPlay: http://itunes.apple.com/dataenc
- Readium LCP: http://readium.org/2014/01/lcp
"""
Metadata extraction (per W3C spec §5.2, Dublin Core):
def _extract_metadata(self, book) -> dict:
"""Extract Dublin Core metadata from EPUB.
Per W3C EPUB 3.3 spec: required elements are dc:identifier, dc:title, dc:language.
Optional: dc:creator, dc:contributor, dc:date, dc:description, dc:publisher,
dc:subject, dc:rights, dc:type, dc:coverage, dc:source, dc:relation, dc:format.
ebooklib API: book.get_metadata('DC', key) returns list of (value, attrs) tuples.
"""
def _get_one(key):
data = book.get_metadata('DC', key)
return data[0][0] if data else None
def _get_list(key):
data = book.get_metadata('DC', key)
return [x[0] for x in data] if data else []
return {
"title": _get_one('title') or "Untitled",
"author": ", ".join(_get_list('creator')) or None,
"language": _get_one('language') or "en",
"publisher": _get_one('publisher'),
"date": _get_one('date'),
"description": _get_one('description'),
"subject": ", ".join(_get_list('subject')) or None,
"rights": _get_one('rights'),
"identifier": _get_one('identifier'),
}
Content extraction (per W3C spec §5 — XHTML Content Documents use XML serialization of HTML5):
def _extract_spine_content(self, book) -> list[dict]:
"""Extract content from spine items in reading order.
Per W3C EPUB 3.3 spec §3.4.8: spine defines ordered list of content documents.
Linear="yes" (default) items form the primary reading order.
Linear="no" items are auxiliary (footnotes, glossary).
Per spec §5: XHTML content documents use XML syntax of HTML5.
Parse with BeautifulSoup, split by h1/h2 heading boundaries.
"""
sections = []
section_number = 0
for item_id, linear in book.spine:
item = book.get_item_with_id(item_id)
if not item or item.get_type() != ebooklib.ITEM_DOCUMENT:
continue
soup = BeautifulSoup(item.get_content(), 'html.parser')
# Remove scripts, styles, comments (not useful for text extraction)
for tag in soup(['script', 'style']):
tag.decompose()
for comment in soup.find_all(string=lambda t: isinstance(t, Comment)):
comment.extract()
body = soup.find('body')
if not body:
continue
# Split by h1/h2 heading boundaries (same as word_scraper)
# Each heading starts a new section
...
Image extraction (per W3C spec §3.3 — core media types include JPEG, PNG, GIF, WebP, SVG):
def _extract_images(self, book) -> list[dict]:
"""Extract images from EPUB manifest.
Per W3C EPUB 3.3 spec §3.3: core image media types are
image/gif, image/jpeg, image/png, image/svg+xml, image/webp.
ebooklib API: book.get_items_of_type(ebooklib.ITEM_IMAGE)
returns EpubImage items with get_content() (bytes) and media_type.
SVG images (ITEM_VECTOR) handled separately.
"""
The remaining methods (categorize_content, build_skill, _generate_reference_file, _generate_index, _generate_skill_md, _format_key_concepts, _format_patterns_from_content, _sanitize_filename) follow the Word scraper pattern exactly — they operate on the same intermediate JSON structure.
main() function (following word_scraper.py:923-1059):
def main():
from .arguments.epub import add_epub_arguments
parser = argparse.ArgumentParser(
description="Convert EPUB e-book to skill",
formatter_class=argparse.RawDescriptionHelpFormatter,
)
add_epub_arguments(parser)
args = parser.parse_args()
# Logging setup
if getattr(args, "quiet", False):
logging.getLogger().setLevel(logging.WARNING)
elif getattr(args, "verbose", False):
logging.getLogger().setLevel(logging.DEBUG)
# Dry run
if getattr(args, "dry_run", False):
source = args.epub or args.from_json or "(none)"
print(f"\n{'=' * 60}")
print("DRY RUN: EPUB Extraction")
print(f"{'=' * 60}")
print(f"Source: {source}")
print(f"Name: {getattr(args, 'name', None) or '(auto-detect)'}")
print(f"Enhance level: {getattr(args, 'enhance_level', 0)}")
print(f"\n✅ Dry run complete")
return
# Validate inputs
if not (args.epub or args.from_json):
parser.error("Must specify --epub or --from-json")
# From-JSON workflow
if args.from_json:
name = Path(args.from_json).stem.replace("_extracted", "")
config = {
"name": name,
"description": args.description or f"Use when referencing {name} documentation",
}
converter = EpubToSkillConverter(config)
converter.load_extracted_data(args.from_json)
converter.build_skill()
return
# Direct EPUB workflow
name = args.name or Path(args.epub).stem
config = {
"name": name,
"epub_path": args.epub,
"description": args.description or f"Use when referencing {name} documentation",
}
try:
converter = EpubToSkillConverter(config)
if not converter.extract_epub():
print("\n❌ EPUB extraction failed", file=sys.stderr)
sys.exit(1)
converter.build_skill()
# Enhancement workflow integration
from skill_seekers.cli.workflow_runner import run_workflows
run_workflows(args)
# Traditional enhancement
if getattr(args, "enhance_level", 0) > 0:
# Same pattern as word_scraper.py and pdf_scraper.py
...
except RuntimeError as e:
print(f"\n❌ Error: {e}", file=sys.stderr)
sys.exit(1)
Success Criteria:
Automated Verification:
ruff check src/skill_seekers/cli/epub_scraper.pypassesruff format --check src/skill_seekers/cli/epub_scraper.pypassesmypy src/skill_seekers/cli/epub_scraper.pypasses (continue-on-error)pip install -e ".[epub]"installs successfully
Manual Verification:
- Verify
import ebooklibworks after install - Review epub_scraper.py structure matches word_scraper.py pattern
Implementation Note: After completing this phase and all automated verification passes, pause here for manual confirmation from the human before proceeding to the next phase.
Phase 2: CLI Integration
Overview
Wire the EPUB scraper into the CLI: source detection, argument definitions, parser registration, create command routing, and entry points.
Changes Required:
[x] 1. Source detection
File: src/skill_seekers/cli/source_detector.py
Changes: Add .epub extension detection, _detect_epub() method, validation, and error message update
Add after the .docx check (line 64):
if source.endswith(".epub"):
return cls._detect_epub(source)
Add _detect_epub() method (following _detect_word() at line 124):
@classmethod
def _detect_epub(cls, source: str) -> SourceInfo:
"""Detect EPUB file source."""
name = os.path.splitext(os.path.basename(source))[0]
return SourceInfo(
type="epub", parsed={"file_path": source}, suggested_name=name, raw_input=source
)
Add epub validation in validate_source() (after word block at line 278):
elif source_info.type == "epub":
file_path = source_info.parsed["file_path"]
if not os.path.exists(file_path):
raise ValueError(f"EPUB file does not exist: {file_path}")
if not os.path.isfile(file_path):
raise ValueError(f"Path is not a file: {file_path}")
Add EPUB example to the ValueError message (line 94):
" EPUB: skill-seekers create ebook.epub\n"
[x] 2. Argument definitions
File: src/skill_seekers/cli/arguments/epub.py (new)
Changes: EPUB-specific argument definitions
"""EPUB-specific CLI arguments."""
import argparse
from typing import Any
from .common import add_all_standard_arguments
EPUB_ARGUMENTS: dict[str, dict[str, Any]] = {
"epub": {
"flags": ("--epub",),
"kwargs": {
"type": str,
"help": "Direct EPUB file path",
"metavar": "PATH",
},
},
"from_json": {
"flags": ("--from-json",),
"kwargs": {
"type": str,
"help": "Build skill from extracted JSON",
"metavar": "FILE",
},
},
}
def add_epub_arguments(parser: argparse.ArgumentParser) -> None:
"""Add EPUB-specific arguments to parser."""
add_all_standard_arguments(parser)
# Override enhance-level default to 0 for EPUB
for action in parser._actions:
if hasattr(action, "dest") and action.dest == "enhance_level":
action.default = 0
action.help = (
"AI enhancement level (auto-detects API vs LOCAL mode): "
"0=disabled (default for EPUB), 1=SKILL.md only, "
"2=+architecture/config, 3=full enhancement. "
"Mode selection: uses API if ANTHROPIC_API_KEY is set, "
"otherwise LOCAL (Claude Code)"
)
for arg_name, arg_def in EPUB_ARGUMENTS.items():
parser.add_argument(*arg_def["flags"], **arg_def["kwargs"])
[x] 3. Create command argument integration
File: src/skill_seekers/cli/arguments/create.py
Changes: Add EPUB_ARGUMENTS dict, register in helper functions, add mode handling
Add after WORD_ARGUMENTS (~line 411):
# EPUB specific (from epub.py)
EPUB_ARGUMENTS: dict[str, dict[str, Any]] = {
"epub": {
"flags": ("--epub",),
"kwargs": {
"type": str,
"help": "EPUB file path",
"metavar": "PATH",
},
},
}
Add to get_source_specific_arguments() (line 595):
"epub": EPUB_ARGUMENTS,
Add to add_create_arguments() (after word block at line 678):
if mode in ["epub", "all"]:
for arg_name, arg_def in EPUB_ARGUMENTS.items():
parser.add_argument(*arg_def["flags"], **arg_def["kwargs"])
[x] 4. Parser class
File: src/skill_seekers/cli/parsers/epub_parser.py (new)
Changes: Subcommand parser for standalone skill-seekers epub command
"""Parser for epub subcommand."""
from .base import SubcommandParser
from skill_seekers.cli.arguments.epub import add_epub_arguments
class EpubParser(SubcommandParser):
"""Parser for EPUB extraction command."""
@property
def name(self) -> str:
return "epub"
@property
def help(self) -> str:
return "Extract from EPUB e-book (.epub)"
@property
def description(self) -> str:
return "Extract content from EPUB e-book (.epub) and generate skill"
def add_arguments(self, parser):
add_epub_arguments(parser)
[x] 5. Parser registration
File: src/skill_seekers/cli/parsers/__init__.py
Changes: Import and register EpubParser
Add import (after WordParser import, line 15):
from .epub_parser import EpubParser
Add to PARSERS list (after WordParser(), line 46):
EpubParser(),
[x] 6. CLI dispatcher
File: src/skill_seekers/cli/main.py
Changes: Add epub to COMMAND_MODULES dict and module docstring
Add to COMMAND_MODULES (after "word" entry, line 52):
"epub": "skill_seekers.cli.epub_scraper",
Add to module docstring (after "word" line, line 15):
# epub Extract from EPUB e-book (.epub)
[x] 7. Create command routing
File: src/skill_seekers/cli/create_command.py
Changes: Add _route_epub() method, routing case, help flag, and epilog example
Add to _route_to_scraper() (after word case, line 136):
elif self.source_info.type == "epub":
return self._route_epub()
Add _route_epub() method (after _route_word(), line 352):
def _route_epub(self) -> int:
"""Route to EPUB scraper (epub_scraper.py)."""
from skill_seekers.cli import epub_scraper
argv = ["epub_scraper"]
file_path = self.source_info.parsed["file_path"]
argv.extend(["--epub", file_path])
self._add_common_args(argv)
logger.debug(f"Calling epub_scraper with argv: {argv}")
original_argv = sys.argv
try:
sys.argv = argv
return epub_scraper.main()
finally:
sys.argv = original_argv
Add to epilog (line 543, after DOCX example):
EPUB: skill-seekers create ebook.epub
Add to Source Auto-Detection section:
• file.epub → EPUB extraction
Add --help-epub flag and handler (after --help-word at line 592):
parser.add_argument(
"--help-epub", action="store_true", help=argparse.SUPPRESS, dest="_help_epub"
)
Add handler block (after _help_word block at line 654):
elif args._help_epub:
parser_epub = argparse.ArgumentParser(
prog="skill-seekers create",
description="Create skill from EPUB e-book (.epub)",
formatter_class=argparse.RawDescriptionHelpFormatter,
)
add_create_arguments(parser_epub, mode="epub")
parser_epub.print_help()
return 0
[x] 8. Entry point
File: pyproject.toml
Changes: Add standalone entry point
Add after skill-seekers-word (line 224):
skill-seekers-epub = "skill_seekers.cli.epub_scraper:main"
[x] 9. Positional argument handling in main.py
File: src/skill_seekers/cli/main.py
Changes: Add "input_file" is already in the positional list at line 153, so no change needed. Verify _reconstruct_argv handles epub correctly through the standard delegation path.
Success Criteria:
Automated Verification:
ruff check src/skill_seekers/cli/source_detector.py src/skill_seekers/cli/arguments/epub.py src/skill_seekers/cli/parsers/epub_parser.py src/skill_seekers/cli/create_command.pypassesruff format --check src/skill_seekers/cli/passespip install -e ".[epub]"installs with all entry pointsskill-seekers epub --helpshows EPUB-specific helpskill-seekers create --help-epubshows EPUB arguments (via standalone entry pointskill-seekers-create)skill-seekers create nonexistent.epubgives clear error about missing file- Existing tests still pass:
pytest tests/ -v -x -m "not slow and not integration"(875 passed, 1 pre-existing unrelated failure in test_git_sources_e2e)
Manual Verification:
skill-seekers --helplistsepubcommandskill-seekers create book.epub --dry-runshows dry run output
Implementation Note: After completing this phase and all automated verification passes, pause here for manual confirmation from the human before proceeding to the next phase.
Phase 3: Comprehensive Test Suite
Overview
Create tests/test_epub_scraper.py with 100+ tests across 11 test classes, covering happy path, negative cases, edge cases, and CLI integration.
Changes Required:
[x] 1. Create test file
File: tests/test_epub_scraper.py (new)
Changes: Comprehensive test suite following test_word_scraper.py patterns
"""
Tests for EPUB scraper (epub_scraper.py).
Covers: initialization, extraction, categorization, skill building,
code blocks, tables, images, error handling, JSON workflow, CLI arguments,
helper functions, source detection, DRM detection, and edge cases.
Tests use mock data and do not require actual EPUB files or ebooklib installed.
"""
import json
import os
import shutil
import tempfile
import unittest
from pathlib import Path
from unittest.mock import MagicMock, patch, PropertyMock
# Conditional import (same pattern as test_word_scraper.py)
try:
from skill_seekers.cli.epub_scraper import (
EpubToSkillConverter,
infer_description_from_epub,
_score_code_quality,
_check_epub_deps,
EPUB_AVAILABLE,
)
IMPORT_OK = True
except ImportError:
IMPORT_OK = False
Helper factory function:
def _make_sample_extracted_data(
num_sections=2,
include_code=False,
include_tables=False,
include_images=False,
) -> dict:
"""Create minimal extracted_data dict for testing."""
sections = []
total_code = 0
total_images = 0
languages = {}
for i in range(1, num_sections + 1):
section = {
"section_number": i,
"heading": f"Chapter {i}",
"heading_level": "h1",
"text": f"Content of chapter {i}. This is sample text.",
"headings": [{"level": "h2", "text": f"Section {i}.1"}],
"code_samples": [],
"tables": [],
"images": [],
}
if include_code:
section["code_samples"] = [
{"code": f"def func_{i}():\n return {i}", "language": "python", "quality_score": 7.5},
{"code": f"console.log({i})", "language": "javascript", "quality_score": 4.0},
]
total_code += 2
languages["python"] = languages.get("python", 0) + 1
languages["javascript"] = languages.get("javascript", 0) + 1
if include_tables:
section["tables"] = [
{"headers": ["Name", "Value"], "rows": [["key", "val"]]}
]
if include_images:
section["images"] = [
{"index": 1, "data": b"\x89PNG\r\n\x1a\n", "width": 100, "height": 100}
]
total_images += 1
sections.append(section)
return {
"source_file": "test.epub",
"metadata": {
"title": "Test Book",
"author": "Test Author",
"language": "en",
"publisher": "Test Publisher",
"date": "2024-01-01",
"description": "A test book for unit testing",
"subject": "Testing, Unit Tests",
"rights": "Copyright 2024",
"identifier": "urn:uuid:12345",
},
"total_sections": num_sections,
"total_code_blocks": total_code,
"total_images": total_images,
"languages_detected": languages,
"pages": sections,
}
Test Classes and Methods:
[x] Class 1: TestEpubToSkillConverterInit (8 tests)
Happy path:
test_init_with_name_and_epub_path— basic config with name + epub_pathtest_init_with_full_config— config with all fields (name, epub_path, description, categories)test_default_description_uses_name— description defaults to "Use when referencing {name} documentation"test_skill_dir_uses_name— skill_dir isoutput/{name}test_data_file_uses_name— data_file isoutput/{name}_extracted.json
Negative:
test_init_requires_name— missing "name" key raises KeyErrortest_init_empty_name— empty string name still works (no crash)
Edge case:
test_init_with_special_characters_in_name— name with spaces/dashes sanitized for paths
[x] Class 2: TestEpubExtraction (12 tests)
Happy path:
test_extract_basic_epub— mock ebooklib, verify sections extracted in spine ordertest_extract_metadata— verify Dublin Core metadata extraction (title, creator, language, etc.)test_extract_multiple_chapters— multiple spine items produce multiple sectionstest_extract_code_blocks—<pre><code>elements extracted with language detectiontest_extract_images— ITEM_IMAGE items extracted with correct contenttest_heading_boundary_splitting— h1/h2 boundaries create new sections
Negative:
test_extract_missing_file_raises_error— FileNotFoundError for nonexistent pathtest_extract_invalid_epub_raises_error— ValueError for corrupted/non-EPUB filetest_extract_deps_not_installed— RuntimeError with install instructions when ebooklib missing
Edge cases:
test_extract_empty_spine— EPUB with no spine items produces empty sections listtest_extract_spine_item_no_body— XHTML without<body>tag skipped gracefullytest_extract_non_linear_spine_items— linear="no" items still extracted (included but flagged)
[x] Class 3: TestEpubDrmDetection (6 tests)
Happy path:
test_no_drm_detected— normal EPUB without encryption.xml returns False
Negative:
test_drm_detected_adobe_adept— encryption.xml with Adobe namespace raises RuntimeErrortest_drm_detected_apple_fairplay— encryption.xml with Apple namespace raises RuntimeErrortest_drm_detected_readium_lcp— encryption.xml with Readium namespace raises RuntimeError
Edge cases:
test_font_obfuscation_not_drm— encryption.xml with only IDPF font obfuscation algorithm (http://www.idpf.org/2008/embedding) is NOT DRM, extraction proceedstest_drm_error_message_is_clear— error message mentions DRM and suggests removing protection
[x] Class 4: TestEpubCategorization (8 tests)
Happy path:
test_single_source_creates_one_category— single EPUB creates category named after filetest_keyword_categorization— sections matched to categories by keyword scoringtest_no_categories_uses_default— no category config creates single "content" category
Negative:
test_categorize_empty_sections— empty sections list produces empty categoriestest_categorize_no_keyword_matches— unmatched sections go to "other" category
Edge cases:
test_categorize_single_section— one section creates one categorytest_categorize_many_sections— 50+ sections categorized correctlytest_categorize_preserves_section_order— sections maintain original order within categories
[x] Class 5: TestEpubSkillBuilding (10 tests)
Happy path:
test_build_creates_directory_structure— output/{name}/, references/, scripts/, assets/ createdtest_build_generates_skill_md— SKILL.md created with YAML frontmattertest_build_generates_reference_files— reference markdown files created per categorytest_build_generates_index— references/index.md created with category linkstest_skill_md_contains_metadata— SKILL.md includes title, author, language from metadatatest_skill_md_yaml_frontmatter— frontmatter has name and description fields
Negative:
test_build_without_extracted_data_fails— calling build_skill() before extraction raises error
Edge cases:
test_build_overwrites_existing_output— re-running build overwrites existing filestest_build_with_long_name— name > 64 chars truncated in YAML frontmattertest_build_with_unicode_content— Unicode text (CJK, Arabic, emoji) preserved correctly
[x] Class 6: TestEpubCodeBlocks (8 tests)
Happy path:
test_code_blocks_included_in_reference_files— code samples appear in reference markdowntest_code_blocks_in_skill_md_top_15— SKILL.md shows top 15 code examples by qualitytest_code_language_grouped— code examples grouped by language in SKILL.md
Edge cases:
test_empty_code_block—<pre><code></code></pre>with no content skippedtest_code_block_with_html_entities—<,>,&decoded to<,>,&test_code_block_with_syntax_highlighting_spans—<span class="keyword">stripped, plain text preservedtest_code_block_language_from_class—class="language-python",class="code-rust"detectedtest_code_quality_scoring— scoring heuristic produces expected ranges (0-10)
[x] Class 7: TestEpubTables (5 tests)
Happy path:
test_tables_in_reference_files— tables rendered as markdown in reference filestest_table_with_headers— headers from<thead>used correctly
Edge cases:
test_table_no_thead— first row used as headers when no<thead>test_empty_table— empty<table>element handled gracefullytest_table_with_colspan_rowspan— complex tables don't crash (data may be imperfect)
[x] Class 8: TestEpubImages (7 tests)
Happy path:
test_images_saved_to_assets— image bytes written to assets/ directorytest_image_references_in_markdown— markdownreferences correct
Negative:
test_image_with_zero_bytes— empty image content skipped
Edge cases:
test_svg_images_handled— SVG items (ITEM_VECTOR) extracted or skipped gracefullytest_image_filename_conflicts— duplicate filenames disambiguatedtest_cover_image_identified— cover image (ITEM_COVER) extractedtest_many_images— 100+ images extracted without error
[x] Class 9: TestEpubErrorHandling (10 tests)
Negative / error cases:
test_missing_epub_file_raises_error— FileNotFoundError for nonexistent pathtest_not_a_file_raises_error— ValueError when path is a directorytest_not_epub_extension_raises_error— ValueError for .txt, .pdf, .doc filestest_corrupted_zip_raises_error— ValueError or RuntimeError for corrupted ZIPtest_missing_container_xml— ValueError for ZIP without META-INF/container.xmltest_missing_opf_file— ValueError when container.xml points to nonexistent OPFtest_drm_protected_raises_error— RuntimeError with clear DRM messagetest_empty_epub_raises_error— ValueError for EPUB with no content documentstest_ebooklib_not_installed_error— RuntimeError with install instructionstest_malformed_xhtml_handled_gracefully— unclosed tags, invalid entities don't crash (BeautifulSoup tolerant parsing)
[x] Class 10: TestEpubJSONWorkflow (6 tests)
Happy path:
test_load_extracted_json— load previously extracted JSONtest_build_from_json— full workflow: load JSON → categorize → buildtest_json_round_trip— extract → save JSON → load JSON → build produces same output
Negative:
test_load_invalid_json— malformed JSON raises appropriate errortest_load_nonexistent_json— FileNotFoundError for missing file
Edge case:
test_json_with_missing_fields— partial JSON (missing optional fields) still works
[x] Class 11: TestEpubCLIArguments (8 tests)
Happy path:
test_epub_flag_accepted—--epub path.epubparsed correctlytest_from_json_flag_accepted—--from-json data.jsonparsed correctlytest_name_flag_accepted—--name mybookparsed correctlytest_enhance_level_default_zero— enhance-level defaults to 0 for EPUBtest_dry_run_flag—--dry-runflag parsed correctly
Negative:
test_no_args_shows_error— no--epubor--from-jsonshows error
Integration:
test_verbose_flag—--verboseacceptedtest_quiet_flag—--quietaccepted
[x] Class 12: TestEpubHelperFunctions (6 tests)
test_infer_description_from_metadata_description— uses description fieldtest_infer_description_from_metadata_title— falls back to titletest_infer_description_fallback— falls back to name-based templatetest_infer_description_empty_metadata— empty dict returns fallbacktest_score_code_quality_ranges— scoring returns 0-10test_sanitize_filename— special characters cleaned
[x] Class 13: TestEpubSourceDetection (6 tests)
test_epub_detected_as_epub_type—.epubextension detected correctlytest_epub_suggested_name— filename stem used as suggested nametest_epub_validation_missing_file— validation raises ValueError for missing filetest_epub_validation_not_a_file— validation raises ValueError for directorytest_epub_with_path—./books/test.epubdetected with correct file_pathtest_pdf_still_detected— regression test:.pdfstill detected as pdf type
[x] Class 14: TestEpubEdgeCases (8 tests)
Per W3C EPUB 3.3 spec edge cases:
test_epub2_vs_epub3— both versions parse successfully (ebooklib handles both)test_epub_no_toc— EPUB without table of contents extracts using spine ordertest_epub_empty_chapters— chapters with no text content skipped gracefullytest_epub_single_chapter— book with one spine item produces valid outputtest_epub_unicode_content— CJK, Arabic, Cyrillic, emoji text preservedtest_epub_large_section_count— 100+ sections processed without errortest_epub_nested_headings— h3/h4/h5/h6 become sub-headings within sectionstest_fixed_layout_detected— fixed-layout EPUB produces warning but still extracts text
Total: ~108 test methods across 14 classes
Success Criteria:
Automated Verification:
pytest tests/test_epub_scraper.py -v— all 107 tests passpytest tests/ -v -x -m "not slow and not integration"— 982 passed (1 pre-existing unrelated failure in test_git_sources_e2e)ruff check tests/test_epub_scraper.pypassesruff format --check tests/test_epub_scraper.pypasses- Test count >= 100 methods (107 tests across 14 classes)
Manual Verification:
- Review test coverage includes: happy path, negative, edge cases, CLI, source detection, DRM, JSON workflow
- Verify no tests require actual EPUB files or ebooklib installed (all use mocks/skipTest guards)
Implementation Note: After completing this phase and all automated verification passes, pause here for manual confirmation from the human before proceeding to the next phase.
Phase 4: Documentation
Overview
Update CLAUDE.md and CHANGELOG.md to reflect the new EPUB support.
Changes Required:
[x] 1. Update CLAUDE.md
File: CLAUDE.md
Changes:
Add to Commands section (after pdf line):
skill-seekers epub --epub book.epub --name myskill
Add to "Unified create" examples:
skill-seekers create book.epub
Add to Key source files table:
| Core scraping | `cli/epub_scraper.py` |
Add to "Adding things → New create command flags" section:
- Source-specific → `EPUB_ARGUMENTS`
[x] 2. Update CHANGELOG.md
File: CHANGELOG.md
Changes: Add entry for EPUB support under next version
### Added
- EPUB (.epub) input support via `skill-seekers create book.epub` or `skill-seekers epub --epub book.epub`
- Extracts chapters, metadata, code blocks, images, and tables from EPUB 2 and EPUB 3 files
- DRM detection with clear error messages
- Optional dependency: `pip install "skill-seekers[epub]"`
Success Criteria:
Automated Verification:
ruff checkpasses on any modified filespytest tests/ -v -x -m "not slow and not integration"— all tests still pass (982 passed, 1 pre-existing failure)
Manual Verification:
- CLAUDE.md accurately reflects new commands
- CHANGELOG.md entry is clear and complete
Implementation Note: After completing this phase and all automated verification passes, pause here for manual confirmation from the human before proceeding.
Testing Strategy
Unit Tests (Phase 3 — ~108 tests):
By category:
| Category | Count | What's tested |
|---|---|---|
| Initialization | 8 | Config parsing, defaults, edge cases |
| Extraction | 12 | Spine iteration, metadata, headings, code, images |
| DRM detection | 6 | Adobe, Apple, Readium, font obfuscation (not DRM) |
| Categorization | 8 | Single/multi category, keywords, empty, ordering |
| Skill building | 10 | Directory structure, SKILL.md, references, index |
| Code blocks | 8 | Extraction, quality, language detection, HTML entities |
| Tables | 5 | Headers, no-thead fallback, empty, colspan |
| Images | 7 | Save, references, SVG, conflicts, cover, many |
| Error handling | 10 | Missing file, corrupt, DRM, no deps, malformed XHTML |
| JSON workflow | 6 | Load, build, round-trip, invalid, missing fields |
| CLI arguments | 8 | Flags, defaults, dry-run, verbose/quiet |
| Helper functions | 6 | Description inference, quality scoring, filename sanitization |
| Source detection | 6 | Detection, validation, regression |
| Edge cases | 8 | EPUB 2/3, no TOC, empty chapters, Unicode, fixed-layout |
Integration Tests:
- Full extract → categorize → build workflow with mock ebooklib
- JSON round-trip (extract → save → load → build)
Manual Testing Steps:
pip install -e ".[epub]"— verify installskill-seekers create book.epubwith a real EPUB file — verify output directory structureskill-seekers epub --epub book.epub --dry-run— verify dry run outputskill-seekers create drm-book.epub— verify DRM error messageskill-seekers create nonexistent.epub— verify file-not-found error- Open generated
SKILL.md— verify content quality and structure
Performance Considerations
- ebooklib loads entire EPUB into memory. For typical books (<50MB), this is fine
- For very large EPUBs (100MB+), memory usage may spike. No mitigation needed for v1 — document as known limitation
- BeautifulSoup parsing of XHTML is fast. No performance concerns expected
Migration Notes
- No migration needed — this is a new feature with no existing data to migrate
- Optional dependency (
ebooklib) means existing installs are unaffected - No breaking changes to any existing commands or APIs
References
- W3C EPUB 3.3 Specification — authoritative source of truth
- W3C EPUB Reading Systems 3.3 — reading system requirements
- ebooklib GitHub — Python EPUB library
- ebooklib PyPI — v0.20, Python 3.9-3.13
- Research document — affected files analysis
- Similar implementation:
src/skill_seekers/cli/word_scraper.py— closest analog - Similar tests:
tests/test_word_scraper.py— test pattern template