Auto-detects NVIDIA (CUDA), AMD (ROCm), or CPU-only GPU and installs the correct PyTorch variant + easyocr + all visual extraction dependencies. Removes easyocr from video-full pip extras to avoid pulling ~2GB of wrong CUDA packages on non-NVIDIA systems. New files: - video_setup.py (835 lines): GPU detection, PyTorch install, ROCm config, venv checks, system dep validation, module selection, verification - test_video_setup.py (60 tests): Full coverage of detection, install, verify Updated docs: CHANGELOG, AGENTS.md, CLAUDE.md, README.md, CLI_REFERENCE, FAQ, TROUBLESHOOTING, installation guide, video dependency plan All 2523 tests passing (15 skipped). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
262 lines
13 KiB
Markdown
262 lines
13 KiB
Markdown
# Video Source Support — Master Plan
|
|
|
|
**Date:** February 27, 2026
|
|
**Feature ID:** V1.0
|
|
**Status:** Planning
|
|
**Priority:** High
|
|
**Estimated Complexity:** Large (multi-sprint feature)
|
|
|
|
---
|
|
|
|
## Table of Contents
|
|
|
|
1. [Executive Summary](#executive-summary)
|
|
2. [Motivation & Goals](#motivation--goals)
|
|
3. [Scope](#scope)
|
|
4. [Plan Documents Index](#plan-documents-index)
|
|
5. [High-Level Architecture](#high-level-architecture)
|
|
6. [Implementation Phases](#implementation-phases)
|
|
7. [Dependencies](#dependencies)
|
|
8. [Risk Assessment](#risk-assessment)
|
|
9. [Success Criteria](#success-criteria)
|
|
|
|
---
|
|
|
|
## Executive Summary
|
|
|
|
Add **video** as a first-class source type in Skill Seekers, alongside web documentation, GitHub repositories, PDF files, and Word documents. Videos contain a massive amount of knowledge — conference talks, official tutorials, live coding sessions, architecture walkthroughs — that is currently inaccessible to our pipeline.
|
|
|
|
The video source will use a **3-stream parallel extraction** model:
|
|
|
|
| Stream | What | Tool |
|
|
|--------|------|------|
|
|
| **ASR** (Audio Speech Recognition) | Spoken words → timestamped text | youtube-transcript-api + faster-whisper |
|
|
| **OCR** (Optical Character Recognition) | On-screen code/slides/diagrams → text | PySceneDetect + OpenCV + easyocr |
|
|
| **Metadata** | Title, chapters, tags, description | yt-dlp Python API |
|
|
|
|
These three streams are **aligned on a shared timeline** and merged into structured `VideoSegment` objects — the fundamental output unit. Segments are then categorized, converted to reference markdown files, and integrated into SKILL.md just like any other source.
|
|
|
|
---
|
|
|
|
## Motivation & Goals
|
|
|
|
### Why Video?
|
|
|
|
1. **Knowledge density** — A 30-minute conference talk can contain the equivalent of a 5,000-word blog post, plus live code demos that never appear in written docs.
|
|
2. **Official tutorials** — Many frameworks (React, Flutter, Unity, Godot) have official video tutorials that are the canonical learning resource.
|
|
3. **Code walkthroughs** — Screen-recorded coding sessions show real patterns, debugging workflows, and architecture decisions that written docs miss.
|
|
4. **Conference talks** — JSConf, PyCon, GopherCon, etc. contain deep technical insights from framework authors.
|
|
5. **Completeness** — Skill Seekers aims to be the **universal** documentation preprocessor. Video is the last major content type we don't support.
|
|
|
|
### Goals
|
|
|
|
- **G1:** Extract structured, time-aligned knowledge from YouTube videos, playlists, channels, and local video files.
|
|
- **G2:** Integrate video as a first-class source in the unified config system (multiple video sources per skill, alongside docs/github/pdf).
|
|
- **G3:** Auto-detect video sources in the `create` command (YouTube URLs, video file extensions).
|
|
- **G4:** Support two tiers: lightweight (transcript + metadata only) and full (+ visual extraction with OCR).
|
|
- **G5:** Produce output that is indistinguishable in quality from other source types — properly categorized reference files integrated into SKILL.md.
|
|
- **G6:** Make visual extraction (Whisper, OCR) available as optional add-on dependencies, keeping core install lightweight.
|
|
|
|
### Non-Goals (explicitly out of scope for V1.0)
|
|
|
|
- Real-time / live stream processing
|
|
- Video generation or editing
|
|
- Speaker diarization (identifying who said what) — future enhancement
|
|
- Automatic video discovery (e.g., "find all React tutorials on YouTube") — future enhancement
|
|
- DRM-protected or paywalled video content (Udemy, Coursera, etc.)
|
|
- Audio-only podcasts (similar pipeline but separate feature)
|
|
|
|
---
|
|
|
|
## Scope
|
|
|
|
### Supported Video Sources
|
|
|
|
| Source | Input Format | Example |
|
|
|--------|-------------|---------|
|
|
| YouTube single video | URL | `https://youtube.com/watch?v=abc123` |
|
|
| YouTube short URL | URL | `https://youtu.be/abc123` |
|
|
| YouTube playlist | URL | `https://youtube.com/playlist?list=PLxxx` |
|
|
| YouTube channel | URL | `https://youtube.com/@channelname` |
|
|
| Vimeo video | URL | `https://vimeo.com/123456` |
|
|
| Local video file | Path | `./tutorials/intro.mp4` |
|
|
| Local video directory | Path | `./recordings/` (batch) |
|
|
|
|
### Supported Video Formats (local files)
|
|
|
|
| Format | Extension | Notes |
|
|
|--------|-----------|-------|
|
|
| MP4 | `.mp4` | Most common, universal |
|
|
| Matroska | `.mkv` | Common for screen recordings |
|
|
| WebM | `.webm` | Web-native, YouTube's format |
|
|
| AVI | `.avi` | Legacy but still used |
|
|
| QuickTime | `.mov` | macOS screen recordings |
|
|
| Flash Video | `.flv` | Legacy, rare |
|
|
| MPEG Transport | `.ts` | Streaming recordings |
|
|
| Windows Media | `.wmv` | Windows screen recordings |
|
|
|
|
### Supported Languages (transcript)
|
|
|
|
All languages supported by:
|
|
- YouTube's caption system (100+ languages)
|
|
- faster-whisper / OpenAI Whisper (99 languages)
|
|
|
|
---
|
|
|
|
## Plan Documents Index
|
|
|
|
| Document | Content |
|
|
|----------|---------|
|
|
| [`01_VIDEO_RESEARCH.md`](./01_VIDEO_RESEARCH.md) | Library research, benchmarks, industry standards |
|
|
| [`02_VIDEO_DATA_MODELS.md`](./02_VIDEO_DATA_MODELS.md) | All data classes, type definitions, JSON schemas |
|
|
| [`03_VIDEO_PIPELINE.md`](./03_VIDEO_PIPELINE.md) | Processing pipeline (6 phases), algorithms, edge cases |
|
|
| [`04_VIDEO_INTEGRATION.md`](./04_VIDEO_INTEGRATION.md) | CLI, config, source detection, unified scraper integration |
|
|
| [`05_VIDEO_OUTPUT.md`](./05_VIDEO_OUTPUT.md) | Output structure, SKILL.md integration, reference file format |
|
|
| [`06_VIDEO_TESTING.md`](./06_VIDEO_TESTING.md) | Test strategy, mocking, fixtures, CI considerations |
|
|
| [`07_VIDEO_DEPENDENCIES.md`](./07_VIDEO_DEPENDENCIES.md) | Dependency tiers, optional installs, system requirements — **IMPLEMENTED** (`video_setup.py`, GPU auto-detection, `--setup`) |
|
|
|
|
---
|
|
|
|
## High-Level Architecture
|
|
|
|
```
|
|
┌──────────────────────┐
|
|
│ User Input │
|
|
│ │
|
|
│ YouTube URL │
|
|
│ Playlist URL │
|
|
│ Local .mp4 file │
|
|
│ Unified config JSON │
|
|
└──────────┬───────────┘
|
|
│
|
|
┌──────────▼───────────┐
|
|
│ Source Detector │
|
|
│ (source_detector.py) │
|
|
│ type="video" │
|
|
└──────────┬───────────┘
|
|
│
|
|
┌──────────▼───────────┐
|
|
│ Video Scraper │
|
|
│ (video_scraper.py) │
|
|
│ Main orchestrator │
|
|
└──────────┬───────────┘
|
|
│
|
|
┌────────────────────┼────────────────────┐
|
|
│ │ │
|
|
┌──────────▼──────┐ ┌──────────▼──────┐ ┌──────────▼──────┐
|
|
│ Stream 1: ASR │ │ Stream 2: OCR │ │ Stream 3: Meta │
|
|
│ │ │ (optional) │ │ │
|
|
│ youtube-trans- │ │ PySceneDetect │ │ yt-dlp │
|
|
│ cript-api │ │ OpenCV │ │ extract_info() │
|
|
│ faster-whisper │ │ easyocr │ │ │
|
|
└────────┬────────┘ └────────┬────────┘ └────────┬────────┘
|
|
│ │ │
|
|
│ Timestamped │ Keyframes + │ Chapters,
|
|
│ transcript │ OCR text │ tags, desc
|
|
│ │ │
|
|
└────────────────────┼────────────────────┘
|
|
│
|
|
┌──────────▼───────────┐
|
|
│ Segmenter & │
|
|
│ Aligner │
|
|
│ (video_segmenter.py)│
|
|
│ │
|
|
│ Align 3 streams │
|
|
│ on shared timeline │
|
|
└──────────┬───────────┘
|
|
│
|
|
list[VideoSegment]
|
|
│
|
|
┌──────────▼───────────┐
|
|
│ Output Generator │
|
|
│ │
|
|
│ ├ references/*.md │
|
|
│ ├ video_data/*.json │
|
|
│ └ SKILL.md section │
|
|
└──────────────────────┘
|
|
```
|
|
|
|
---
|
|
|
|
## Implementation Phases
|
|
|
|
### Phase 1: Foundation (Core Pipeline)
|
|
- `video_models.py` — All data classes
|
|
- `video_scraper.py` — Main orchestrator
|
|
- `video_transcript.py` — YouTube captions + Whisper fallback
|
|
- Source detector update — YouTube URL patterns, video file extensions
|
|
- Basic metadata extraction via yt-dlp
|
|
- Output: timestamped transcript as reference markdown
|
|
|
|
### Phase 2: Segmentation & Structure
|
|
- `video_segmenter.py` — Chapter-aware segmentation
|
|
- Semantic segmentation fallback (when no chapters)
|
|
- Time-window fallback (configurable interval)
|
|
- Segment categorization (reuse smart_categorize patterns)
|
|
|
|
### Phase 3: Visual Extraction
|
|
- `video_visual.py` — Frame extraction + scene detection
|
|
- Frame classification (code/slide/terminal/diagram/other)
|
|
- OCR on classified frames (easyocr)
|
|
- Timeline alignment with ASR transcript
|
|
|
|
### Phase 4: Integration
|
|
- Unified config support (`"type": "video"`)
|
|
- `create` command routing
|
|
- CLI parser + arguments
|
|
- Unified scraper integration (video alongside docs/github/pdf)
|
|
- SKILL.md section generation
|
|
|
|
### Phase 5: Quality & Polish
|
|
- AI enhancement for video content (summarization, topic extraction)
|
|
- RAG-optimized chunking for video segments
|
|
- MCP tools (scrape_video, export_video)
|
|
- Comprehensive test suite
|
|
|
|
---
|
|
|
|
## Dependencies
|
|
|
|
### Core (always required for video)
|
|
```
|
|
yt-dlp>=2024.12.0
|
|
youtube-transcript-api>=1.2.0
|
|
```
|
|
|
|
### Full (for visual extraction + local file transcription)
|
|
```
|
|
faster-whisper>=1.0.0
|
|
scenedetect[opencv]>=0.6.4
|
|
easyocr>=1.7.0
|
|
opencv-python-headless>=4.9.0
|
|
```
|
|
|
|
### System Requirements (for full mode)
|
|
- FFmpeg (required by faster-whisper and yt-dlp for audio extraction)
|
|
- GPU (optional but recommended for Whisper and easyocr)
|
|
|
|
---
|
|
|
|
## Risk Assessment
|
|
|
|
| Risk | Likelihood | Impact | Mitigation |
|
|
|------|-----------|--------|------------|
|
|
| YouTube API changes break scraping | Medium | High | yt-dlp actively maintained, abstract behind our API |
|
|
| Whisper models are large (~1.5GB) | Certain | Medium | Optional dependency, offer multiple model sizes |
|
|
| OCR accuracy on code is low | Medium | Medium | Combine OCR with transcript context, use confidence scoring |
|
|
| Video download is slow | High | Medium | Stream audio only, don't download full video for transcript |
|
|
| Auto-generated captions are noisy | High | Medium | Confidence filtering, AI cleanup in enhancement phase |
|
|
| Copyright / ToS concerns | Low | High | Document that user is responsible for content rights |
|
|
| CI tests can't download videos | Certain | Medium | Mock all network calls, use fixture transcripts |
|
|
|
|
---
|
|
|
|
## Success Criteria
|
|
|
|
1. **Functional:** `skill-seekers create https://youtube.com/watch?v=xxx` produces a skill with video content integrated into SKILL.md.
|
|
2. **Multi-source:** Video sources work alongside docs/github/pdf in unified configs.
|
|
3. **Quality:** Video-derived reference files are categorized and structured (not raw transcript dumps).
|
|
4. **Performance:** Transcript-only mode processes a 30-minute video in < 30 seconds.
|
|
5. **Tests:** Full test suite with mocked network calls, 100% of video pipeline covered.
|
|
6. **Tiered deps:** `pip install skill-seekers[video]` works without pulling Whisper/OpenCV.
|