yusyus
|
2e30970dfb
|
feat: add EPUB input support (#310)
Adds EPUB as a first-class input source for skill generation.
- EpubToSkillConverter (epub_scraper.py, ~1200 lines) following PDF scraper pattern
- Dublin Core metadata, spine items, code blocks, tables, images extraction
- DRM detection (Adobe ADEPT, Apple FairPlay, Readium LCP) with fail-fast
- EPUB 3 NCX TOC bug workaround (ignore_ncx=True)
- ebooklib as optional dep: pip install skill-seekers[epub]
- Wired into create command with .epub auto-detection
- 104 tests, all passing
Review fixes: removed 3 empty test stubs, fixed SVG double-counting in
_extract_images(), added logger.debug to bare except pass.
Based on PR #310 by @christianbaumann.
Co-authored-by: Christian Baumann <mail@chriss-baumann.de>
|
2026-03-15 02:34:41 +03:00 |
|
yusyus
|
4b19cf4836
|
style: ruff format 4 video pipeline files
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
2026-03-01 19:48:02 +03:00 |
|
YusufKaraaslanSpyke
|
62071c4aa9
|
feat: add video tutorial scraping pipeline with per-panel OCR and AI enhancement
Add complete video tutorial extraction system that converts YouTube videos
and local video files into AI-consumable skills. The pipeline extracts
transcripts, performs visual OCR on code editor panels independently,
tracks code evolution across frames, and generates structured SKILL.md output.
Key features:
- Video metadata extraction (YouTube, local files, playlists)
- Multi-source transcript extraction (YouTube API, yt-dlp, Whisper fallback)
- Chapter-based and time-window segmentation
- Visual extraction: keyframe detection, frame classification, panel detection
- Per-panel sub-section OCR (each IDE panel OCR'd independently)
- Parallel OCR with ThreadPoolExecutor for multi-panel frames
- Narrow panel filtering (300px min width) to skip UI chrome
- Text block tracking with spatial panel position matching
- Code timeline with edit tracking across frames
- Audio-visual alignment (code + narrator pairs)
- Video-specific AI enhancement prompt for OCR denoising and code reconstruction
- video-tutorial.yaml workflow with 4 stages (OCR cleanup, language detection,
tutorial synthesis, skill polish)
- CLI integration: skill-seekers video --url/--video-file/--playlist
- MCP tool: scrape_video for automation
- 161 tests passing
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
2026-02-27 23:10:19 +03:00 |
|
yusyus
|
b81d55fda0
|
feat(B2): add Microsoft Word (.docx) support
Implements ROADMAP task B2 — full .docx scraping support via mammoth +
python-docx, producing SKILL.md + references/ output identical to other
source types.
New files:
- src/skill_seekers/cli/word_scraper.py — WordToSkillConverter class +
main() entry point (~600 lines); mammoth → BeautifulSoup pipeline;
handles headings, code detection (incl. monospace <p><br> blocks),
tables, images, metadata extraction
- src/skill_seekers/cli/arguments/word.py — add_word_arguments() +
WORD_ARGUMENTS dict
- src/skill_seekers/cli/parsers/word_parser.py — WordParser for unified
CLI parser registry
- tests/test_word_scraper.py — comprehensive test suite (~300 lines)
Modified files:
- src/skill_seekers/cli/main.py — registered "word" command module
- src/skill_seekers/cli/source_detector.py — .docx auto-detection +
_detect_word() classmethod
- src/skill_seekers/cli/create_command.py — _route_word() + --help-word
- src/skill_seekers/cli/arguments/create.py — WORD_ARGUMENTS + routing
- src/skill_seekers/cli/arguments/__init__.py — export word args
- src/skill_seekers/cli/parsers/__init__.py — register WordParser
- src/skill_seekers/cli/unified_scraper.py — _scrape_word() integration
- src/skill_seekers/cli/pdf_scraper.py — fix: real enhancement instead
of stub; remove [:3] reference file limit; capture run_workflows return
- src/skill_seekers/cli/github_scraper.py — fix: remove arbitrary
open_issues[:20] / closed_issues[:10] reference file limits
- pyproject.toml — skill-seekers-word entry point + docx optional dep
- tests/test_cli_parsers.py — update parser count 21→22
Bug fixes applied during real-world testing:
- Code detection: detect monospace <p><br> blocks as code (mammoth
renders Courier paragraphs this way, not as <pre>/<code>)
- Language detector: fix wrong method name detect_from_text →
detect_from_code
- Description inference: pass None from main() so extract_docx() can
infer description from Word document subject/title metadata
- Bullet-point guard: exclude prose starting with •/-/* from code scoring
- Enhancement: implement real API/LOCAL enhancement (was stub)
- pip install message: add quotes around skill-seekers[docx]
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-02-25 21:47:30 +03:00 |
|
yusyus
|
57061b7daf
|
style: Auto-format 48 files with ruff format
- Fixed formatting to comply with ruff standards
- No functional changes, only formatting/style
- Completes CI/CD pipeline formatting requirements
|
2026-02-15 20:24:32 +03:00 |
|
yusyus
|
83b03d9f9f
|
fix: Resolve all linting errors from ruff
Fix 145 linting errors across CLI refactor code:
Type annotation modernization (Python 3.9+):
- Replace typing.Dict with dict
- Replace typing.List with list
- Replace typing.Set with set
- Replace Optional[X] with X | None
Code quality improvements:
- Remove trailing whitespace (W291)
- Remove whitespace from blank lines (W293)
- Remove unused imports (F401)
- Use dictionary lookup instead of if-elif chains (SIM116)
- Combine nested if statements (SIM102)
Files fixed (45 files):
- src/skill_seekers/cli/arguments/*.py (10 files)
- src/skill_seekers/cli/parsers/*.py (24 files)
- src/skill_seekers/cli/presets/*.py (4 files)
- src/skill_seekers/cli/create_command.py
- src/skill_seekers/cli/source_detector.py
- src/skill_seekers/cli/github_scraper.py
- tests/test_*.py (5 test files)
All files now pass ruff linting checks.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
2026-02-15 20:20:55 +03:00 |
|
yusyus
|
ba1670a220
|
feat: Unified create command + consolidated enhancement flags
This commit includes two major improvements:
## 1. Unified Create Command (v3.0.0 feature)
- Auto-detects source type (web, GitHub, local, PDF, config)
- Three-tier argument organization (universal, source-specific, advanced)
- Routes to existing scrapers (100% backward compatible)
- Progressive disclosure: 15 universal flags in default help
**New files:**
- src/skill_seekers/cli/source_detector.py - Auto-detection logic
- src/skill_seekers/cli/arguments/create.py - Argument definitions
- src/skill_seekers/cli/create_command.py - Main orchestrator
- src/skill_seekers/cli/parsers/create_parser.py - Parser integration
**Tests:**
- tests/test_source_detector.py (35 tests)
- tests/test_create_arguments.py (30 tests)
- tests/test_create_integration_basic.py (10 tests)
## 2. Enhanced Flag Consolidation (Phase 1)
- Consolidated 3 flags (--enhance, --enhance-local, --enhance-level) → 1 flag
- --enhance-level 0-3 with auto-detection of API vs LOCAL mode
- Default: --enhance-level 2 (balanced enhancement)
**Modified files:**
- arguments/{common,create,scrape,github,analyze}.py - Added enhance_level
- {doc_scraper,github_scraper,config_extractor,main}.py - Updated logic
- create_command.py - Uses consolidated flag
**Auto-detection:**
- If ANTHROPIC_API_KEY set → API mode
- Else → LOCAL mode (Claude Code)
## 3. PresetManager Bug Fix
- Fixed module naming conflict (presets.py vs presets/ directory)
- Moved presets.py → presets/manager.py
- Updated __init__.py exports
**Test Results:**
- All 160+ tests passing
- Zero regressions
- 100% backward compatible
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
2026-02-15 14:29:19 +03:00 |
|