feat: add video tutorial scraping pipeline with per-panel OCR and AI enhancement

Add complete video tutorial extraction system that converts YouTube videos and local video files into AI-consumable skills. The pipeline extracts transcripts, performs visual OCR on code editor panels independently, tracks code evolution across frames, and generates structured SKILL.md output. Key features: - Video metadata extraction (YouTube, local files, playlists) - Multi-source transcript extraction (YouTube API, yt-dlp, Whisper fallback) - Chapter-based and time-window segmentation - Visual extraction: keyframe detection, frame classification, panel detection - Per-panel sub-section OCR (each IDE panel OCR'd independently) - Parallel OCR with ThreadPoolExecutor for multi-panel frames - Narrow panel filtering (300px min width) to skip UI chrome - Text block tracking with spatial panel position matching - Code timeline with edit tracking across frames - Audio-visual alignment (code + narrator pairs) - Video-specific AI enhancement prompt for OCR denoising and code reconstruction - video-tutorial.yaml workflow with 4 stages (OCR cleanup, language detection, tutorial synthesis, skill polish) - CLI integration: skill-seekers video --url/--video-file/--playlist - MCP tool: scrape_video for automation - 161 tests passing Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-27 23:10:19 +03:00
parent 3bad7cf365
commit 62071c4aa9
32 changed files with 15090 additions and 9 deletions
--- a/docs/plans/video/07_VIDEO_DEPENDENCIES.md
+++ b/docs/plans/video/07_VIDEO_DEPENDENCIES.md
@@ -0,0 +1,506 @@
+# Video Source — Dependencies & System Requirements
+
+**Date:** February 27, 2026
+**Document:** 07 of 07
+**Status:** Planning
+
+---
+
+## Table of Contents
+
+1. [Dependency Tiers](#dependency-tiers)
+2. [pyproject.toml Changes](#pyprojecttoml-changes)
+3. [System Requirements](#system-requirements)
+4. [Import Guards](#import-guards)
+5. [Dependency Check Command](#dependency-check-command)
+6. [Model Management](#model-management)
+7. [Docker Considerations](#docker-considerations)
+
+---
+
+## Dependency Tiers
+
+Video processing has two tiers to keep the base install lightweight:
+
+### Tier 1: `[video]` — Lightweight (YouTube transcripts + metadata)
+
+**Use case:** YouTube videos with existing captions. No download, no GPU needed.
+
+| Package | Version | Size | Purpose |
+|---------|---------|------|---------|
+| `yt-dlp` | `>=2024.12.0` | ~15MB | Metadata extraction, audio download |
+| `youtube-transcript-api` | `>=1.2.0` | ~50KB | YouTube caption extraction |
+
+**Capabilities:**
+- YouTube metadata (title, chapters, tags, description, engagement)
+- YouTube captions (manual and auto-generated)
+- Vimeo metadata
+- Playlist and channel resolution
+- Subtitle file parsing (SRT, VTT)
+- Segmentation and alignment
+- Full output generation
+
+**NOT included:**
+- Speech-to-text (Whisper)
+- Visual extraction (frame + OCR)
+- Local video file transcription (without subtitles)
+
+### Tier 2: `[video-full]` — Full (adds Whisper + visual extraction)
+
+**Use case:** Local videos without subtitles, or when you want code/slide extraction from screen.
+
+| Package | Version | Size | Purpose |
+|---------|---------|------|---------|
+| `yt-dlp` | `>=2024.12.0` | ~15MB | Metadata + audio download |
+| `youtube-transcript-api` | `>=1.2.0` | ~50KB | YouTube captions |
+| `faster-whisper` | `>=1.0.0` | ~5MB (+ models: 75MB-3GB) | Speech-to-text |
+| `scenedetect[opencv]` | `>=0.6.4` | ~50MB (includes OpenCV) | Scene boundary detection |
+| `easyocr` | `>=1.7.0` | ~150MB (+ models: ~200MB) | Text recognition from frames |
+| `opencv-python-headless` | `>=4.9.0` | ~50MB | Frame extraction, image processing |
+
+**Additional capabilities over Tier 1:**
+- Whisper speech-to-text (99 languages, word-level timestamps)
+- Scene detection (find visual transitions)
+- Keyframe extraction (save important frames)
+- Frame classification (code/slide/terminal/diagram)
+- OCR on frames (extract code and text from screen)
+- Code block detection from video
+
+**Total install size:**
+- Tier 1: ~15MB
+- Tier 2: ~270MB + models (~300MB-3.2GB depending on Whisper model)
+
+---
+
+## pyproject.toml Changes
+
+```toml
+[project.optional-dependencies]
+# Existing dependencies...
+gemini = ["google-generativeai>=0.8.0"]
+openai = ["openai>=1.0.0"]
+all-llms = ["google-generativeai>=0.8.0", "openai>=1.0.0"]
+
+# NEW: Video processing
+video = [
+    "yt-dlp>=2024.12.0",
+    "youtube-transcript-api>=1.2.0",
+]
+video-full = [
+    "yt-dlp>=2024.12.0",
+    "youtube-transcript-api>=1.2.0",
+    "faster-whisper>=1.0.0",
+    "scenedetect[opencv]>=0.6.4",
+    "easyocr>=1.7.0",
+    "opencv-python-headless>=4.9.0",
+]
+
+# Update 'all' to include video
+all = [
+    # ... existing all dependencies ...
+    "yt-dlp>=2024.12.0",
+    "youtube-transcript-api>=1.2.0",
+    "faster-whisper>=1.0.0",
+    "scenedetect[opencv]>=0.6.4",
+    "easyocr>=1.7.0",
+    "opencv-python-headless>=4.9.0",
+]
+
+[project.scripts]
+# ... existing entry points ...
+skill-seekers-video = "skill_seekers.cli.video_scraper:main"      # NEW
+```
+
+### Installation Commands
+
+```bash
+# Lightweight video (YouTube transcripts + metadata)
+pip install skill-seekers[video]
+
+# Full video (+ Whisper + visual extraction)
+pip install skill-seekers[video-full]
+
+# Everything
+pip install skill-seekers[all]
+
+# Development (editable)
+pip install -e ".[video]"
+pip install -e ".[video-full]"
+```
+
+---
+
+## System Requirements
+
+### Tier 1 (Lightweight)
+
+| Requirement | Needed For | How to Check |
+|-------------|-----------|-------------|
+| Python 3.10+ | All | `python --version` |
+| Internet connection | YouTube API calls | N/A |
+
+No additional system dependencies. Pure Python.
+
+### Tier 2 (Full)
+
+| Requirement | Needed For | How to Check | Install |
+|-------------|-----------|-------------|---------|
+| Python 3.10+ | All | `python --version` | — |
+| FFmpeg | Audio extraction, video processing | `ffmpeg -version` | See below |
+| GPU (optional) | Whisper + easyocr acceleration | `nvidia-smi` (NVIDIA) | CUDA toolkit |
+
+### FFmpeg Installation
+
+FFmpeg is required for:
+- Extracting audio from video files (Whisper input)
+- Downloading audio-only streams (yt-dlp post-processing)
+- Converting between audio formats
+
+```bash
+# macOS
+brew install ffmpeg
+
+# Ubuntu/Debian
+sudo apt install ffmpeg
+
+# Windows (winget)
+winget install ffmpeg
+
+# Windows (choco)
+choco install ffmpeg
+
+# Verify
+ffmpeg -version
+```
+
+### GPU Support (Optional)
+
+GPU accelerates Whisper (~4x) and easyocr (~5x) but is not required.
+
+**NVIDIA GPU (CUDA):**
+```bash
+# Check CUDA availability
+python -c "import torch; print(torch.cuda.is_available())"
+
+# faster-whisper uses CTranslate2 which auto-detects CUDA
+# easyocr uses PyTorch which auto-detects CUDA
+# No additional setup needed if PyTorch CUDA is working
+```
+
+**Apple Silicon (MPS):**
+```bash
+# faster-whisper does not support MPS directly
+# Falls back to CPU on Apple Silicon
+# easyocr has partial MPS support
+```
+
+**CPU-only (no GPU):**
+```bash
+# Everything works on CPU, just slower
+# Whisper base model: ~4x slower on CPU vs GPU
+# easyocr: ~5x slower on CPU vs GPU
+# For short videos (<10 min), CPU is fine
+```
+
+---
+
+## Import Guards
+
+All video dependencies use try/except import guards to provide clear error messages:
+
+### video_scraper.py
+
+```python
+"""Video scraper - main orchestrator."""
+
+# Core dependencies (always available)
+import json
+import logging
+import os
+from pathlib import Path
+
+# Tier 1: Video basics
+try:
+    from yt_dlp import YoutubeDL
+    HAS_YTDLP = True
+except ImportError:
+    HAS_YTDLP = False
+
+try:
+    from youtube_transcript_api import YouTubeTranscriptApi
+    HAS_YT_TRANSCRIPT = True
+except ImportError:
+    HAS_YT_TRANSCRIPT = False
+
+# Feature availability check
+def check_video_dependencies(require_full: bool = False) -> None:
+    """Check that video dependencies are installed.
+
+    Args:
+        require_full: If True, check for full dependencies (Whisper, OCR)
+
+    Raises:
+        ImportError: With installation instructions
+    """
+    missing = []
+
+    if not HAS_YTDLP:
+        missing.append("yt-dlp")
+    if not HAS_YT_TRANSCRIPT:
+        missing.append("youtube-transcript-api")
+
+    if missing:
+        raise ImportError(
+            f"Video processing requires: {', '.join(missing)}\n"
+            f"Install with: pip install skill-seekers[video]"
+        )
+
+    if require_full:
+        full_missing = []
+        try:
+            import faster_whisper
+        except ImportError:
+            full_missing.append("faster-whisper")
+        try:
+            import cv2
+        except ImportError:
+            full_missing.append("opencv-python-headless")
+        try:
+            import scenedetect
+        except ImportError:
+            full_missing.append("scenedetect[opencv]")
+        try:
+            import easyocr
+        except ImportError:
+            full_missing.append("easyocr")
+
+        if full_missing:
+            raise ImportError(
+                f"Visual extraction requires: {', '.join(full_missing)}\n"
+                f"Install with: pip install skill-seekers[video-full]"
+            )
+```
+
+### video_transcript.py
+
+```python
+"""Transcript extraction module."""
+
+# YouTube transcript (Tier 1)
+try:
+    from youtube_transcript_api import YouTubeTranscriptApi
+    HAS_YT_TRANSCRIPT = True
+except ImportError:
+    HAS_YT_TRANSCRIPT = False
+
+# Whisper (Tier 2)
+try:
+    from faster_whisper import WhisperModel
+    HAS_WHISPER = True
+except ImportError:
+    HAS_WHISPER = False
+
+
+def get_transcript(video_info, config):
+    """Get transcript using best available method."""
+
+    # Try YouTube captions first (Tier 1)
+    if HAS_YT_TRANSCRIPT and video_info.source_type == VideoSourceType.YOUTUBE:
+        try:
+            return extract_youtube_transcript(video_info.video_id, config.languages)
+        except TranscriptNotAvailable:
+            pass
+
+    # Try Whisper fallback (Tier 2)
+    if HAS_WHISPER:
+        return transcribe_with_whisper(video_info, config)
+
+    # No transcript possible
+    if not HAS_WHISPER:
+        logger.warning(
+            f"No transcript for {video_info.video_id}. "
+            "Install faster-whisper for speech-to-text: "
+            "pip install skill-seekers[video-full]"
+        )
+    return [], TranscriptSource.NONE
+```
+
+### video_visual.py
+
+```python
+"""Visual extraction module."""
+
+try:
+    import cv2
+    HAS_OPENCV = True
+except ImportError:
+    HAS_OPENCV = False
+
+try:
+    from scenedetect import detect, ContentDetector
+    HAS_SCENEDETECT = True
+except ImportError:
+    HAS_SCENEDETECT = False
+
+try:
+    import easyocr
+    HAS_EASYOCR = True
+except ImportError:
+    HAS_EASYOCR = False
+
+
+def check_visual_dependencies() -> None:
+    """Check visual extraction dependencies."""
+    missing = []
+    if not HAS_OPENCV:
+        missing.append("opencv-python-headless")
+    if not HAS_SCENEDETECT:
+        missing.append("scenedetect[opencv]")
+    if not HAS_EASYOCR:
+        missing.append("easyocr")
+
+    if missing:
+        raise ImportError(
+            f"Visual extraction requires: {', '.join(missing)}\n"
+            f"Install with: pip install skill-seekers[video-full]"
+        )
+
+
+def check_ffmpeg() -> bool:
+    """Check if FFmpeg is available."""
+    import shutil
+    return shutil.which('ffmpeg') is not None
+```
+
+---
+
+## Dependency Check Command
+
+Add a dependency check to the `config` command:
+
+```bash
+# Check all video dependencies
+skill-seekers config --check-video
+
+# Output:
+# Video Dependencies:
+#   yt-dlp              ✅ 2025.01.15
+#   youtube-transcript-api ✅ 1.2.3
+#   faster-whisper      ❌ Not installed (pip install skill-seekers[video-full])
+#   opencv-python-headless ❌ Not installed
+#   scenedetect         ❌ Not installed
+#   easyocr             ❌ Not installed
+#
+# System Dependencies:
+#   FFmpeg              ✅ 6.1.1
+#   GPU (CUDA)          ❌ Not available (CPU mode will be used)
+#
+# Available modes:
+#   Transcript only     ✅ YouTube captions available
+#   Whisper fallback    ❌ Install faster-whisper
+#   Visual extraction   ❌ Install video-full dependencies
+```
+
+---
+
+## Model Management
+
+### Whisper Models
+
+Whisper models are downloaded on first use and cached in the user's home directory.
+
+| Model | Download Size | Disk Size | First-Use Download Time |
+|-------|-------------|-----------|------------------------|
+| tiny | 75 MB | 75 MB | ~15s |
+| base | 142 MB | 142 MB | ~25s |
+| small | 466 MB | 466 MB | ~60s |
+| medium | 1.5 GB | 1.5 GB | ~3 min |
+| large-v3 | 3.1 GB | 3.1 GB | ~5 min |
+| large-v3-turbo | 1.6 GB | 1.6 GB | ~3 min |
+
+**Cache location:** `~/.cache/huggingface/hub/` (CTranslate2 models)
+
+**Pre-download command:**
+```bash
+# Pre-download a model before using it
+python -c "from faster_whisper import WhisperModel; WhisperModel('base')"
+```
+
+### easyocr Models
+
+easyocr models are also downloaded on first use.
+
+| Language Pack | Download Size | Disk Size |
+|-------------|-------------|-----------|
+| English | ~100 MB | ~100 MB |
+| + Additional language | ~50-100 MB each | ~50-100 MB each |
+
+**Cache location:** `~/.EasyOCR/model/`
+
+**Pre-download command:**
+```bash
+# Pre-download English OCR model
+python -c "import easyocr; easyocr.Reader(['en'])"
+```
+
+---
+
+## Docker Considerations
+
+### Dockerfile additions for video support
+
+```dockerfile
+# Tier 1 (lightweight)
+RUN pip install skill-seekers[video]
+
+# Tier 2 (full)
+RUN apt-get update && apt-get install -y ffmpeg
+RUN pip install skill-seekers[video-full]
+
+# Pre-download Whisper model (avoids first-run download)
+RUN python -c "from faster_whisper import WhisperModel; WhisperModel('base')"
+
+# Pre-download easyocr model
+RUN python -c "import easyocr; easyocr.Reader(['en'])"
+```
+
+### Docker image sizes
+
+| Tier | Base Image Size | Additional Size | Total |
+|------|----------------|----------------|-------|
+| Tier 1 (video) | ~300 MB | ~20 MB | ~320 MB |
+| Tier 2 (video-full, CPU) | ~300 MB | ~800 MB | ~1.1 GB |
+| Tier 2 (video-full, GPU) | ~5 GB (CUDA base) | ~800 MB | ~5.8 GB |
+
+### Kubernetes resource recommendations
+
+```yaml
+# Tier 1 (transcript only)
+resources:
+  requests:
+    memory: "256Mi"
+    cpu: "500m"
+  limits:
+    memory: "512Mi"
+    cpu: "1000m"
+
+# Tier 2 (full, CPU)
+resources:
+  requests:
+    memory: "2Gi"
+    cpu: "2000m"
+  limits:
+    memory: "4Gi"
+    cpu: "4000m"
+
+# Tier 2 (full, GPU)
+resources:
+  requests:
+    memory: "4Gi"
+    cpu: "2000m"
+    nvidia.com/gpu: 1
+  limits:
+    memory: "8Gi"
+    cpu: "4000m"
+    nvidia.com/gpu: 1
+```