Files
skill-seekers-reference/docs/plans/video/01_VIDEO_RESEARCH.md
YusufKaraaslanSpyke 62071c4aa9 feat: add video tutorial scraping pipeline with per-panel OCR and AI enhancement
Add complete video tutorial extraction system that converts YouTube videos
and local video files into AI-consumable skills. The pipeline extracts
transcripts, performs visual OCR on code editor panels independently,
tracks code evolution across frames, and generates structured SKILL.md output.

Key features:
- Video metadata extraction (YouTube, local files, playlists)
- Multi-source transcript extraction (YouTube API, yt-dlp, Whisper fallback)
- Chapter-based and time-window segmentation
- Visual extraction: keyframe detection, frame classification, panel detection
- Per-panel sub-section OCR (each IDE panel OCR'd independently)
- Parallel OCR with ThreadPoolExecutor for multi-panel frames
- Narrow panel filtering (300px min width) to skip UI chrome
- Text block tracking with spatial panel position matching
- Code timeline with edit tracking across frames
- Audio-visual alignment (code + narrator pairs)
- Video-specific AI enhancement prompt for OCR denoising and code reconstruction
- video-tutorial.yaml workflow with 4 stages (OCR cleanup, language detection,
  tutorial synthesis, skill polish)
- CLI integration: skill-seekers video --url/--video-file/--playlist
- MCP tool: scrape_video for automation
- 161 tests passing

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-27 23:10:19 +03:00

23 KiB

Video Source — Library Research & Industry Standards

Date: February 27, 2026 Document: 01 of 07 Status: Complete


Table of Contents

  1. Industry Standards & Approaches
  2. Library Comparison Matrix
  3. Detailed Library Analysis
  4. Architecture Patterns from Industry
  5. Benchmarks & Performance Data
  6. Recommendations

Industry Standards & Approaches

How the Industry Processes Video for AI/RAG

Based on research from NVIDIA, LlamaIndex, Ragie, and open-source projects, the industry has converged on a 3-stream parallel extraction model:

The 3-Stream Model

Video Input
    │
    ├──→ Stream 1: ASR (Audio Speech Recognition)
    │    Extract spoken words with timestamps
    │    Tools: Whisper, YouTube captions API
    │    Output: [{text, start, end, confidence}, ...]
    │
    ├──→ Stream 2: OCR (Optical Character Recognition)
    │    Extract visual text (code, slides, diagrams)
    │    Tools: OpenCV + scene detection + OCR engine
    │    Output: [{text, timestamp, frame_type, bbox}, ...]
    │
    └──→ Stream 3: Metadata
         Extract structural info (chapters, tags, description)
         Tools: yt-dlp, platform APIs
         Output: {title, chapters, tags, description, ...}

Key insight (from NVIDIA's multimodal RAG blog): Ground everything to text first. Align all streams on a shared timeline, then merge into unified text segments. This makes the output compatible with any text-based RAG pipeline without requiring multimodal embeddings.

Reference Implementations

Project Approach Strengths Weaknesses
video-analyzer Whisper + OpenCV + LLM analysis Full pipeline, LLM summaries No chapter support, no YouTube integration
LlamaIndex MultiModal RAG Frame extraction + CLIP + LanceDB Vector search over frames Heavy (requires GPU), no ASR
VideoRAG Graph-based reasoning + multimodal retrieval Multi-hour video support Research project, not production-ready
Ragie Multimodal RAG faster-whisper large-v3-turbo + OCR + object detection Production-grade, 3-stream Proprietary, not open-source

Industry Best Practices

  1. Audio-only download — Never download full video when you only need audio. Extract audio stream with FFmpeg (-vn flag). This is 10-50x smaller.
  2. Prefer existing captions — YouTube manual captions are higher quality than any ASR model. Only fall back to Whisper when captions unavailable.
  3. Chapter-based segmentation — YouTube chapters provide natural content boundaries. Use them as primary segmentation, fall back to time-window or semantic splitting.
  4. Confidence filtering — Auto-generated captions and OCR output include confidence scores. Filter low-confidence content rather than including everything.
  5. Parallel extraction — Run ASR and OCR in parallel (they're independent). Merge after both complete.
  6. Tiered processing — Offer fast/light mode (transcript only) and deep mode (+ visual). Let users choose based on their compute budget.

Library Comparison Matrix

Metadata & Download

Library Purpose Install Size Actively Maintained Python API License
yt-dlp Metadata + subtitles + download ~15MB Yes (weekly releases) Yes (YoutubeDL class) Unlicense
pytube YouTube download ~1MB Inconsistent Yes MIT
youtube-dl Download (original) ~10MB Stale Yes Unlicense
pafy YouTube metadata ~50KB Dead (2021) Yes LGPL

Winner: yt-dlp — De-facto standard, actively maintained, comprehensive Python API, supports 1000+ sites (not just YouTube).

Transcript Extraction (YouTube)

Library Purpose Requires Download Speed Accuracy License
youtube-transcript-api YouTube captions No Very fast (<1s) Depends on caption source MIT
yt-dlp subtitles Download subtitle files Yes (subtitle only) Fast (~2s) Same as above Unlicense

Winner: youtube-transcript-api — Fastest, no download needed, returns structured JSON with timestamps directly. Falls back to yt-dlp for non-YouTube platforms.

Speech-to-Text (ASR)

Library Speed (30 min audio) Word Timestamps Model Sizes GPU Required Language Support License
faster-whisper ~2-4 min (GPU), ~8-15 min (CPU) Yes (word_timestamps=True) tiny (39M) → large-v3 (1.5B) No (but recommended) 99 languages MIT
openai-whisper ~5-10 min (GPU), ~20-40 min (CPU) Yes Same models Recommended 99 languages MIT
whisper-timestamped Same as openai-whisper Yes (more accurate) Same models Recommended 99 languages MIT
whisperx ~2-3 min (GPU) Yes (best accuracy via wav2vec2) Same + wav2vec2 Yes (required) 99 languages BSD
stable-ts Same as openai-whisper Yes (stabilized) Same models Recommended 99 languages MIT
Google Speech-to-Text Real-time Yes Cloud No 125+ languages Proprietary
AssemblyAI Real-time Yes Cloud No 100+ languages Proprietary

Winner: faster-whisper — 4x faster than OpenAI Whisper via CTranslate2 optimization, MIT license, word-level timestamps, works without GPU (just slower), actively maintained. We may consider whisperx as a future upgrade for speaker diarization.

Scene Detection & Frame Extraction

Library Purpose Algorithm Speed License
PySceneDetect Scene boundary detection ContentDetector, ThresholdDetector, AdaptiveDetector Fast BSD
opencv-python-headless Frame extraction, image processing Manual (absdiff, histogram) Fast Apache 2.0
Filmstrip Keyframe extraction Scene detection + selection Medium MIT
video-keyframe-detector Keyframe extraction Peak estimation from frame diff Fast MIT
decord GPU-accelerated frame extraction Direct frame access Very fast Apache 2.0

Winner: PySceneDetect + opencv-python-headless — PySceneDetect handles intelligent boundary detection, OpenCV handles frame extraction and image processing. Both are well-maintained and BSD/Apache licensed.

OCR (Optical Character Recognition)

Library Languages GPU Support Accuracy on Code Speed Install Size License
easyocr 80+ Yes (PyTorch) Good Medium ~150MB + models Apache 2.0
pytesseract 100+ No Medium Fast ~30MB + Tesseract Apache 2.0
PaddleOCR 80+ Yes (PaddlePaddle) Very Good Fast ~200MB + models Apache 2.0
TrOCR (HuggingFace) Multilingual Yes Good Slow ~500MB MIT
docTR 10+ Yes (TF/PyTorch) Good Medium ~100MB Apache 2.0

Winner: easyocr — Best balance of accuracy (especially on code/terminal text), GPU support, language coverage, and ease of use. PaddleOCR is a close second but has heavier dependencies (PaddlePaddle framework).


Detailed Library Analysis

1. yt-dlp (Metadata & Download Engine)

What it provides:

  • Video metadata (title, description, duration, upload date, channel, tags, categories)
  • Chapter information (title, start_time, end_time for each chapter)
  • Subtitle/caption download (all available languages, all formats)
  • Thumbnail URLs
  • View/like counts
  • Playlist information (title, entries, ordering)
  • Audio-only extraction (no full video download needed)
  • Supports 1000+ video sites (YouTube, Vimeo, Dailymotion, etc.)

Python API usage:

from yt_dlp import YoutubeDL

def extract_video_metadata(url: str) -> dict:
    """Extract metadata without downloading."""
    opts = {
        'quiet': True,
        'no_warnings': True,
        'extract_flat': False,  # Full extraction
    }
    with YoutubeDL(opts) as ydl:
        info = ydl.extract_info(url, download=False)
        return info

Key fields in info_dict:

{
    'id': 'dQw4w9WgXcQ',              # Video ID
    'title': 'Video Title',            # Full title
    'description': '...',              # Full description text
    'duration': 1832,                  # Duration in seconds
    'upload_date': '20260115',         # YYYYMMDD format
    'uploader': 'Channel Name',        # Channel/uploader name
    'uploader_id': '@channelname',     # Channel ID
    'uploader_url': 'https://...',     # Channel URL
    'channel_follower_count': 150000,  # Subscriber count
    'view_count': 5000000,             # View count
    'like_count': 120000,              # Like count
    'comment_count': 8500,             # Comment count
    'tags': ['react', 'hooks', ...],   # Video tags
    'categories': ['Education'],        # YouTube categories
    'language': 'en',                  # Primary language
    'subtitles': {                     # Manual captions
        'en': [{'ext': 'vtt', 'url': '...'}],
    },
    'automatic_captions': {            # Auto-generated captions
        'en': [{'ext': 'vtt', 'url': '...'}],
    },
    'chapters': [                      # Chapter markers
        {'title': 'Intro', 'start_time': 0, 'end_time': 45},
        {'title': 'Setup', 'start_time': 45, 'end_time': 180},
        {'title': 'First Component', 'start_time': 180, 'end_time': 420},
    ],
    'thumbnail': 'https://...',        # Best thumbnail URL
    'thumbnails': [...],               # All thumbnail variants
    'webpage_url': 'https://...',      # Canonical URL
    'formats': [...],                  # Available formats
    'requested_formats': [...],        # Selected format info
}

Playlist extraction:

def extract_playlist(url: str) -> list[dict]:
    """Extract all videos from a playlist."""
    opts = {
        'quiet': True,
        'extract_flat': True,  # Don't extract each video yet
    }
    with YoutubeDL(opts) as ydl:
        info = ydl.extract_info(url, download=False)
        # info['entries'] contains all video entries
        return info.get('entries', [])

Audio-only download (for Whisper):

def download_audio(url: str, output_dir: str) -> str:
    """Download audio stream only (no video)."""
    opts = {
        'format': 'bestaudio/best',
        'postprocessors': [{
            'key': 'FFmpegExtractAudio',
            'preferredcodec': 'wav',
            'preferredquality': '16',  # 16kHz (Whisper's native rate)
        }],
        'outtmpl': f'{output_dir}/%(id)s.%(ext)s',
        'quiet': True,
    }
    with YoutubeDL(opts) as ydl:
        info = ydl.extract_info(url, download=True)
        return f"{output_dir}/{info['id']}.wav"

2. youtube-transcript-api (Caption Extraction)

What it provides:

  • Direct access to YouTube captions without downloading
  • Manual and auto-generated caption support
  • Translation support (translate captions to any language)
  • Structured output with timestamps

Python API usage:

from youtube_transcript_api import YouTubeTranscriptApi

def get_youtube_transcript(video_id: str, languages: list[str] = None) -> list[dict]:
    """Get transcript with timestamps."""
    languages = languages or ['en']

    transcript_list = YouTubeTranscriptApi.list_transcripts(video_id)

    # Prefer manual captions over auto-generated
    try:
        transcript = transcript_list.find_manually_created_transcript(languages)
    except Exception:
        transcript = transcript_list.find_generated_transcript(languages)

    # Fetch the actual transcript data
    data = transcript.fetch()
    return data
    # Returns: [{'text': 'Hello', 'start': 0.0, 'duration': 1.5}, ...]

Output format:

[
    {
        'text': "Welcome to this React tutorial",
        'start': 0.0,        # Start time in seconds
        'duration': 2.5       # Duration in seconds
    },
    {
        'text': "Today we'll learn about hooks",
        'start': 2.5,
        'duration': 3.0
    },
    # ... continues for entire video
]

Key features:

  • Segments are typically 2-5 seconds each
  • Manual captions have punctuation and proper casing
  • Auto-generated captions may lack punctuation and have lower accuracy
  • Can detect available languages and caption types

3. faster-whisper (Speech-to-Text)

What it provides:

  • OpenAI Whisper models with 4x speedup via CTranslate2
  • Word-level timestamps with confidence scores
  • Language detection
  • VAD (Voice Activity Detection) filtering
  • Multiple model sizes from tiny (39M) to large-v3 (1.5B)

Python API usage:

from faster_whisper import WhisperModel

def transcribe_with_whisper(audio_path: str, model_size: str = "base") -> dict:
    """Transcribe audio file with word-level timestamps."""
    model = WhisperModel(
        model_size,
        device="auto",          # auto-detect GPU/CPU
        compute_type="auto",    # auto-select precision
    )

    segments, info = model.transcribe(
        audio_path,
        word_timestamps=True,
        vad_filter=True,         # Filter silence
        vad_parameters={
            "min_silence_duration_ms": 500,
        },
    )

    result = {
        'language': info.language,
        'language_probability': info.language_probability,
        'duration': info.duration,
        'segments': [],
    }

    for segment in segments:
        seg_data = {
            'start': segment.start,
            'end': segment.end,
            'text': segment.text.strip(),
            'avg_logprob': segment.avg_logprob,
            'no_speech_prob': segment.no_speech_prob,
            'words': [],
        }
        if segment.words:
            for word in segment.words:
                seg_data['words'].append({
                    'word': word.word,
                    'start': word.start,
                    'end': word.end,
                    'probability': word.probability,
                })
        result['segments'].append(seg_data)

    return result

Model size guide:

Model Parameters English WER Multilingual WER VRAM (FP16) Speed (30 min, GPU)
tiny 39M 14.8% 23.2% ~1GB ~30s
base 74M 11.5% 18.7% ~1GB ~45s
small 244M 9.5% 14.6% ~2GB ~90s
medium 769M 8.0% 12.4% ~5GB ~180s
large-v3 1.5B 5.7% 10.1% ~10GB ~240s
large-v3-turbo 809M 6.2% 10.8% ~6GB ~120s

Recommendation: Default to base (good balance), offer large-v3-turbo for best accuracy, tiny for speed.

4. PySceneDetect (Scene Boundary Detection)

What it provides:

  • Automatic scene/cut detection in video files
  • Multiple detection algorithms (content-based, threshold, adaptive)
  • Frame-accurate boundaries
  • Integration with OpenCV

Python API usage:

from scenedetect import detect, ContentDetector, AdaptiveDetector

def detect_scene_changes(video_path: str) -> list[tuple[float, float]]:
    """Detect scene boundaries in video.

    Returns list of (start_time, end_time) tuples.
    """
    scene_list = detect(
        video_path,
        ContentDetector(
            threshold=27.0,      # Sensitivity (lower = more scenes)
            min_scene_len=15,    # Minimum 15 frames per scene
        ),
    )

    boundaries = []
    for scene in scene_list:
        start = scene[0].get_seconds()
        end = scene[1].get_seconds()
        boundaries.append((start, end))

    return boundaries

Detection algorithms:

Algorithm Best For Speed Sensitivity
ContentDetector General content changes Fast Medium
AdaptiveDetector Gradual transitions Medium High
ThresholdDetector Hard cuts (black frames) Very fast Low

5. easyocr (Text Recognition)

What it provides:

  • Text detection and recognition from images
  • 80+ language support
  • GPU acceleration
  • Bounding box coordinates for each text region
  • Confidence scores

Python API usage:

import easyocr

def extract_text_from_frame(image_path: str, languages: list[str] = None) -> list[dict]:
    """Extract text from a video frame image."""
    languages = languages or ['en']
    reader = easyocr.Reader(languages, gpu=True)

    results = reader.readtext(image_path)
    # results: [([x1,y1],[x2,y2],[x3,y3],[x4,y4]), text, confidence]

    extracted = []
    for bbox, text, confidence in results:
        extracted.append({
            'text': text,
            'confidence': confidence,
            'bbox': bbox,  # Corner coordinates
        })

    return extracted

Tips for code/terminal OCR:

  • Pre-process images: increase contrast, convert to grayscale
  • Use higher DPI/resolution frames
  • Filter by confidence threshold (>0.5 for code)
  • Detect monospace regions first, then OCR only those regions

6. OpenCV (Frame Extraction)

What it provides:

  • Video file reading and frame extraction
  • Image processing (resize, crop, color conversion)
  • Template matching (detect code editors, terminals)
  • Histogram analysis (detect slide vs code vs webcam)

Python API usage:

import cv2
import numpy as np

def extract_frames_at_timestamps(
    video_path: str,
    timestamps: list[float],
    output_dir: str
) -> list[str]:
    """Extract frames at specific timestamps."""
    cap = cv2.VideoCapture(video_path)
    fps = cap.get(cv2.CAP_PROP_FPS)
    frame_paths = []

    for ts in timestamps:
        frame_number = int(ts * fps)
        cap.set(cv2.CAP_PROP_POS_FRAMES, frame_number)
        ret, frame = cap.read()
        if ret:
            path = f"{output_dir}/frame_{ts:.2f}.png"
            cv2.imwrite(path, frame)
            frame_paths.append(path)

    cap.release()
    return frame_paths


def classify_frame(image_path: str) -> str:
    """Classify frame as code/slide/terminal/webcam/other.

    Uses heuristics:
    - Dark background + monospace text regions = code/terminal
    - Light background + large text blocks = slide
    - Face detection = webcam
    - High color variance = diagram
    """
    img = cv2.imread(image_path)
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    h, w = gray.shape

    # Check brightness distribution
    mean_brightness = np.mean(gray)
    brightness_std = np.std(gray)

    # Dark background with structured content = code/terminal
    if mean_brightness < 80 and brightness_std > 40:
        return 'code'  # or 'terminal'

    # Light background with text blocks = slide
    if mean_brightness > 180 and brightness_std < 60:
        return 'slide'

    # High edge density = diagram
    edges = cv2.Canny(gray, 50, 150)
    edge_density = np.count_nonzero(edges) / (h * w)
    if edge_density > 0.15:
        return 'diagram'

    return 'other'

Benchmarks & Performance Data

Transcript Extraction Speed

Method 10 min video 30 min video 60 min video Requires Download
youtube-transcript-api ~0.5s ~0.5s ~0.5s No
yt-dlp subtitles ~2s ~2s ~2s Subtitle file only
faster-whisper (tiny, GPU) ~10s ~30s ~60s Audio only
faster-whisper (base, GPU) ~15s ~45s ~90s Audio only
faster-whisper (large-v3, GPU) ~80s ~240s ~480s Audio only
faster-whisper (base, CPU) ~60s ~180s ~360s Audio only

Visual Extraction Speed

Operation Per Frame Per 10 min video (50 keyframes)
Frame extraction (OpenCV) ~5ms ~0.25s
Scene detection (PySceneDetect) N/A ~15s for full video
Frame classification (heuristic) ~10ms ~0.5s
OCR per frame (easyocr, GPU) ~200ms ~10s
OCR per frame (easyocr, CPU) ~1-2s ~50-100s

Total Pipeline Time (estimated)

Mode 10 min video 30 min video 1 hour video
Transcript only (YouTube captions) ~2s ~2s ~2s
Transcript only (Whisper base, GPU) ~20s ~50s ~100s
Full (transcript + visual, GPU) ~35s ~80s ~170s
Full (transcript + visual, CPU) ~120s ~350s ~700s

Recommendations

Primary Stack (Chosen)

Component Library Why
Metadata + download yt-dlp De-facto standard, 1000+ sites, comprehensive Python API
YouTube transcripts youtube-transcript-api Fastest, no download, structured output
Speech-to-text faster-whisper 4x faster than Whisper, MIT, word timestamps
Scene detection PySceneDetect Best algorithm options, OpenCV-based
Frame extraction opencv-python-headless Standard, headless (no GUI deps)
OCR easyocr Best code/terminal accuracy, 80+ languages, GPU support

Future Considerations

Component Library When to Add
Speaker diarization whisperx or pyannote V2.0 — identify who said what
Object detection YOLO V2.0 — detect UI elements, diagrams
Multimodal embeddings CLIP V2.0 — embed frames for visual search
Slide detection python-pptx + heuristics V1.5 — detect and extract slide content

Sources