Auto-detects NVIDIA (CUDA), AMD (ROCm), or CPU-only GPU and installs the correct PyTorch variant + easyocr + all visual extraction dependencies. Removes easyocr from video-full pip extras to avoid pulling ~2GB of wrong CUDA packages on non-NVIDIA systems. New files: - video_setup.py (835 lines): GPU detection, PyTorch install, ROCm config, venv checks, system dep validation, module selection, verification - test_video_setup.py (60 tests): Full coverage of detection, install, verify Updated docs: CHANGELOG, AGENTS.md, CLAUDE.md, README.md, CLI_REFERENCE, FAQ, TROUBLESHOOTING, installation guide, video dependency plan All 2523 tests passing (15 skipped). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
13 KiB
Video Source Support — Master Plan
Date: February 27, 2026 Feature ID: V1.0 Status: Planning Priority: High Estimated Complexity: Large (multi-sprint feature)
Table of Contents
- Executive Summary
- Motivation & Goals
- Scope
- Plan Documents Index
- High-Level Architecture
- Implementation Phases
- Dependencies
- Risk Assessment
- Success Criteria
Executive Summary
Add video as a first-class source type in Skill Seekers, alongside web documentation, GitHub repositories, PDF files, and Word documents. Videos contain a massive amount of knowledge — conference talks, official tutorials, live coding sessions, architecture walkthroughs — that is currently inaccessible to our pipeline.
The video source will use a 3-stream parallel extraction model:
| Stream | What | Tool |
|---|---|---|
| ASR (Audio Speech Recognition) | Spoken words → timestamped text | youtube-transcript-api + faster-whisper |
| OCR (Optical Character Recognition) | On-screen code/slides/diagrams → text | PySceneDetect + OpenCV + easyocr |
| Metadata | Title, chapters, tags, description | yt-dlp Python API |
These three streams are aligned on a shared timeline and merged into structured VideoSegment objects — the fundamental output unit. Segments are then categorized, converted to reference markdown files, and integrated into SKILL.md just like any other source.
Motivation & Goals
Why Video?
- Knowledge density — A 30-minute conference talk can contain the equivalent of a 5,000-word blog post, plus live code demos that never appear in written docs.
- Official tutorials — Many frameworks (React, Flutter, Unity, Godot) have official video tutorials that are the canonical learning resource.
- Code walkthroughs — Screen-recorded coding sessions show real patterns, debugging workflows, and architecture decisions that written docs miss.
- Conference talks — JSConf, PyCon, GopherCon, etc. contain deep technical insights from framework authors.
- Completeness — Skill Seekers aims to be the universal documentation preprocessor. Video is the last major content type we don't support.
Goals
- G1: Extract structured, time-aligned knowledge from YouTube videos, playlists, channels, and local video files.
- G2: Integrate video as a first-class source in the unified config system (multiple video sources per skill, alongside docs/github/pdf).
- G3: Auto-detect video sources in the
createcommand (YouTube URLs, video file extensions). - G4: Support two tiers: lightweight (transcript + metadata only) and full (+ visual extraction with OCR).
- G5: Produce output that is indistinguishable in quality from other source types — properly categorized reference files integrated into SKILL.md.
- G6: Make visual extraction (Whisper, OCR) available as optional add-on dependencies, keeping core install lightweight.
Non-Goals (explicitly out of scope for V1.0)
- Real-time / live stream processing
- Video generation or editing
- Speaker diarization (identifying who said what) — future enhancement
- Automatic video discovery (e.g., "find all React tutorials on YouTube") — future enhancement
- DRM-protected or paywalled video content (Udemy, Coursera, etc.)
- Audio-only podcasts (similar pipeline but separate feature)
Scope
Supported Video Sources
| Source | Input Format | Example |
|---|---|---|
| YouTube single video | URL | https://youtube.com/watch?v=abc123 |
| YouTube short URL | URL | https://youtu.be/abc123 |
| YouTube playlist | URL | https://youtube.com/playlist?list=PLxxx |
| YouTube channel | URL | https://youtube.com/@channelname |
| Vimeo video | URL | https://vimeo.com/123456 |
| Local video file | Path | ./tutorials/intro.mp4 |
| Local video directory | Path | ./recordings/ (batch) |
Supported Video Formats (local files)
| Format | Extension | Notes |
|---|---|---|
| MP4 | .mp4 |
Most common, universal |
| Matroska | .mkv |
Common for screen recordings |
| WebM | .webm |
Web-native, YouTube's format |
| AVI | .avi |
Legacy but still used |
| QuickTime | .mov |
macOS screen recordings |
| Flash Video | .flv |
Legacy, rare |
| MPEG Transport | .ts |
Streaming recordings |
| Windows Media | .wmv |
Windows screen recordings |
Supported Languages (transcript)
All languages supported by:
- YouTube's caption system (100+ languages)
- faster-whisper / OpenAI Whisper (99 languages)
Plan Documents Index
| Document | Content |
|---|---|
01_VIDEO_RESEARCH.md |
Library research, benchmarks, industry standards |
02_VIDEO_DATA_MODELS.md |
All data classes, type definitions, JSON schemas |
03_VIDEO_PIPELINE.md |
Processing pipeline (6 phases), algorithms, edge cases |
04_VIDEO_INTEGRATION.md |
CLI, config, source detection, unified scraper integration |
05_VIDEO_OUTPUT.md |
Output structure, SKILL.md integration, reference file format |
06_VIDEO_TESTING.md |
Test strategy, mocking, fixtures, CI considerations |
07_VIDEO_DEPENDENCIES.md |
Dependency tiers, optional installs, system requirements — IMPLEMENTED (video_setup.py, GPU auto-detection, --setup) |
High-Level Architecture
┌──────────────────────┐
│ User Input │
│ │
│ YouTube URL │
│ Playlist URL │
│ Local .mp4 file │
│ Unified config JSON │
└──────────┬───────────┘
│
┌──────────▼───────────┐
│ Source Detector │
│ (source_detector.py) │
│ type="video" │
└──────────┬───────────┘
│
┌──────────▼───────────┐
│ Video Scraper │
│ (video_scraper.py) │
│ Main orchestrator │
└──────────┬───────────┘
│
┌────────────────────┼────────────────────┐
│ │ │
┌──────────▼──────┐ ┌──────────▼──────┐ ┌──────────▼──────┐
│ Stream 1: ASR │ │ Stream 2: OCR │ │ Stream 3: Meta │
│ │ │ (optional) │ │ │
│ youtube-trans- │ │ PySceneDetect │ │ yt-dlp │
│ cript-api │ │ OpenCV │ │ extract_info() │
│ faster-whisper │ │ easyocr │ │ │
└────────┬────────┘ └────────┬────────┘ └────────┬────────┘
│ │ │
│ Timestamped │ Keyframes + │ Chapters,
│ transcript │ OCR text │ tags, desc
│ │ │
└────────────────────┼────────────────────┘
│
┌──────────▼───────────┐
│ Segmenter & │
│ Aligner │
│ (video_segmenter.py)│
│ │
│ Align 3 streams │
│ on shared timeline │
└──────────┬───────────┘
│
list[VideoSegment]
│
┌──────────▼───────────┐
│ Output Generator │
│ │
│ ├ references/*.md │
│ ├ video_data/*.json │
│ └ SKILL.md section │
└──────────────────────┘
Implementation Phases
Phase 1: Foundation (Core Pipeline)
video_models.py— All data classesvideo_scraper.py— Main orchestratorvideo_transcript.py— YouTube captions + Whisper fallback- Source detector update — YouTube URL patterns, video file extensions
- Basic metadata extraction via yt-dlp
- Output: timestamped transcript as reference markdown
Phase 2: Segmentation & Structure
video_segmenter.py— Chapter-aware segmentation- Semantic segmentation fallback (when no chapters)
- Time-window fallback (configurable interval)
- Segment categorization (reuse smart_categorize patterns)
Phase 3: Visual Extraction
video_visual.py— Frame extraction + scene detection- Frame classification (code/slide/terminal/diagram/other)
- OCR on classified frames (easyocr)
- Timeline alignment with ASR transcript
Phase 4: Integration
- Unified config support (
"type": "video") createcommand routing- CLI parser + arguments
- Unified scraper integration (video alongside docs/github/pdf)
- SKILL.md section generation
Phase 5: Quality & Polish
- AI enhancement for video content (summarization, topic extraction)
- RAG-optimized chunking for video segments
- MCP tools (scrape_video, export_video)
- Comprehensive test suite
Dependencies
Core (always required for video)
yt-dlp>=2024.12.0
youtube-transcript-api>=1.2.0
Full (for visual extraction + local file transcription)
faster-whisper>=1.0.0
scenedetect[opencv]>=0.6.4
easyocr>=1.7.0
opencv-python-headless>=4.9.0
System Requirements (for full mode)
- FFmpeg (required by faster-whisper and yt-dlp for audio extraction)
- GPU (optional but recommended for Whisper and easyocr)
Risk Assessment
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| YouTube API changes break scraping | Medium | High | yt-dlp actively maintained, abstract behind our API |
| Whisper models are large (~1.5GB) | Certain | Medium | Optional dependency, offer multiple model sizes |
| OCR accuracy on code is low | Medium | Medium | Combine OCR with transcript context, use confidence scoring |
| Video download is slow | High | Medium | Stream audio only, don't download full video for transcript |
| Auto-generated captions are noisy | High | Medium | Confidence filtering, AI cleanup in enhancement phase |
| Copyright / ToS concerns | Low | High | Document that user is responsible for content rights |
| CI tests can't download videos | Certain | Medium | Mock all network calls, use fixture transcripts |
Success Criteria
- Functional:
skill-seekers create https://youtube.com/watch?v=xxxproduces a skill with video content integrated into SKILL.md. - Multi-source: Video sources work alongside docs/github/pdf in unified configs.
- Quality: Video-derived reference files are categorized and structured (not raw transcript dumps).
- Performance: Transcript-only mode processes a 30-minute video in < 30 seconds.
- Tests: Full test suite with mocked network calls, 100% of video pipeline covered.
- Tiered deps:
pip install skill-seekers[video]works without pulling Whisper/OpenCV.