firefrost-gaming/skill-seekers-reference

Files

YusufKaraaslanSpyke 62071c4aa9 feat: add video tutorial scraping pipeline with per-panel OCR and AI enhancement

Add complete video tutorial extraction system that converts YouTube videos
and local video files into AI-consumable skills. The pipeline extracts
transcripts, performs visual OCR on code editor panels independently,
tracks code evolution across frames, and generates structured SKILL.md output.

Key features:
- Video metadata extraction (YouTube, local files, playlists)
- Multi-source transcript extraction (YouTube API, yt-dlp, Whisper fallback)
- Chapter-based and time-window segmentation
- Visual extraction: keyframe detection, frame classification, panel detection
- Per-panel sub-section OCR (each IDE panel OCR'd independently)
- Parallel OCR with ThreadPoolExecutor for multi-panel frames
- Narrow panel filtering (300px min width) to skip UI chrome
- Text block tracking with spatial panel position matching
- Code timeline with edit tracking across frames
- Audio-visual alignment (code + narrator pairs)
- Video-specific AI enhancement prompt for OCR denoising and code reconstruction
- video-tutorial.yaml workflow with 4 stages (OCR cleanup, language detection,
  tutorial synthesis, skill polish)
- CLI integration: skill-seekers video --url/--video-file/--playlist
- MCP tool: scrape_video for automation
- 161 tests passing

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-02-27 23:10:19 +03:00

23 KiB

Raw Blame History

Video Source — Library Research & Industry Standards

Date: February 27, 2026 Document: 01 of 07 Status: Complete

Industry Standards & Approaches
Library Comparison Matrix
Detailed Library Analysis
Architecture Patterns from Industry
Benchmarks & Performance Data
Recommendations

Industry Standards & Approaches

How the Industry Processes Video for AI/RAG

Based on research from NVIDIA, LlamaIndex, Ragie, and open-source projects, the industry has converged on a 3-stream parallel extraction model:

The 3-Stream Model

Video Input
    │
    ├──→ Stream 1: ASR (Audio Speech Recognition)
    │    Extract spoken words with timestamps
    │    Tools: Whisper, YouTube captions API
    │    Output: [{text, start, end, confidence}, ...]
    │
    ├──→ Stream 2: OCR (Optical Character Recognition)
    │    Extract visual text (code, slides, diagrams)
    │    Tools: OpenCV + scene detection + OCR engine
    │    Output: [{text, timestamp, frame_type, bbox}, ...]
    │
    └──→ Stream 3: Metadata
         Extract structural info (chapters, tags, description)
         Tools: yt-dlp, platform APIs
         Output: {title, chapters, tags, description, ...}

Key insight (from NVIDIA's multimodal RAG blog): Ground everything to text first. Align all streams on a shared timeline, then merge into unified text segments. This makes the output compatible with any text-based RAG pipeline without requiring multimodal embeddings.

Reference Implementations

Project	Approach	Strengths	Weaknesses
video-analyzer	Whisper + OpenCV + LLM analysis	Full pipeline, LLM summaries	No chapter support, no YouTube integration
LlamaIndex MultiModal RAG	Frame extraction + CLIP + LanceDB	Vector search over frames	Heavy (requires GPU), no ASR
VideoRAG	Graph-based reasoning + multimodal retrieval	Multi-hour video support	Research project, not production-ready
Ragie Multimodal RAG	faster-whisper large-v3-turbo + OCR + object detection	Production-grade, 3-stream	Proprietary, not open-source

Industry Best Practices

Audio-only download — Never download full video when you only need audio. Extract audio stream with FFmpeg (-vn flag). This is 10-50x smaller.
Prefer existing captions — YouTube manual captions are higher quality than any ASR model. Only fall back to Whisper when captions unavailable.
Chapter-based segmentation — YouTube chapters provide natural content boundaries. Use them as primary segmentation, fall back to time-window or semantic splitting.
Confidence filtering — Auto-generated captions and OCR output include confidence scores. Filter low-confidence content rather than including everything.
Parallel extraction — Run ASR and OCR in parallel (they're independent). Merge after both complete.
Tiered processing — Offer fast/light mode (transcript only) and deep mode (+ visual). Let users choose based on their compute budget.

Library Comparison Matrix

Metadata & Download

Library	Purpose	Install Size	Actively Maintained	Python API	License
yt-dlp	Metadata + subtitles + download	~15MB	Yes (weekly releases)	Yes (`YoutubeDL` class)	Unlicense
pytube	YouTube download	~1MB	Inconsistent	Yes	MIT
youtube-dl	Download (original)	~10MB	Stale	Yes	Unlicense
pafy	YouTube metadata	~50KB	Dead (2021)	Yes	LGPL

Winner: yt-dlp — De-facto standard, actively maintained, comprehensive Python API, supports 1000+ sites (not just YouTube).

Transcript Extraction (YouTube)

Library	Purpose	Requires Download	Speed	Accuracy	License
youtube-transcript-api	YouTube captions	No	Very fast (<1s)	Depends on caption source	MIT
yt-dlp subtitles	Download subtitle files	Yes (subtitle only)	Fast (~2s)	Same as above	Unlicense

Winner: youtube-transcript-api — Fastest, no download needed, returns structured JSON with timestamps directly. Falls back to yt-dlp for non-YouTube platforms.

Speech-to-Text (ASR)

Library	Speed (30 min audio)	Word Timestamps	Model Sizes	GPU Required	Language Support	License
faster-whisper	~2-4 min (GPU), ~8-15 min (CPU)	Yes (`word_timestamps=True`)	tiny (39M) → large-v3 (1.5B)	No (but recommended)	99 languages	MIT
openai-whisper	~5-10 min (GPU), ~20-40 min (CPU)	Yes	Same models	Recommended	99 languages	MIT
whisper-timestamped	Same as openai-whisper	Yes (more accurate)	Same models	Recommended	99 languages	MIT
whisperx	~2-3 min (GPU)	Yes (best accuracy via wav2vec2)	Same + wav2vec2	Yes (required)	99 languages	BSD
stable-ts	Same as openai-whisper	Yes (stabilized)	Same models	Recommended	99 languages	MIT
Google Speech-to-Text	Real-time	Yes	Cloud	No	125+ languages	Proprietary
AssemblyAI	Real-time	Yes	Cloud	No	100+ languages	Proprietary

Winner: faster-whisper — 4x faster than OpenAI Whisper via CTranslate2 optimization, MIT license, word-level timestamps, works without GPU (just slower), actively maintained. We may consider whisperx as a future upgrade for speaker diarization.

Scene Detection & Frame Extraction

Library	Purpose	Algorithm	Speed	License
PySceneDetect	Scene boundary detection	ContentDetector, ThresholdDetector, AdaptiveDetector	Fast	BSD
opencv-python-headless	Frame extraction, image processing	Manual (absdiff, histogram)	Fast	Apache 2.0
Filmstrip	Keyframe extraction	Scene detection + selection	Medium	MIT
video-keyframe-detector	Keyframe extraction	Peak estimation from frame diff	Fast	MIT
decord	GPU-accelerated frame extraction	Direct frame access	Very fast	Apache 2.0

Winner: PySceneDetect + opencv-python-headless — PySceneDetect handles intelligent boundary detection, OpenCV handles frame extraction and image processing. Both are well-maintained and BSD/Apache licensed.

OCR (Optical Character Recognition)

Library	Languages	GPU Support	Accuracy on Code	Speed	Install Size	License
easyocr	80+	Yes (PyTorch)	Good	Medium	~150MB + models	Apache 2.0
pytesseract	100+	No	Medium	Fast	~30MB + Tesseract	Apache 2.0
PaddleOCR	80+	Yes (PaddlePaddle)	Very Good	Fast	~200MB + models	Apache 2.0
TrOCR (HuggingFace)	Multilingual	Yes	Good	Slow	~500MB	MIT
docTR	10+	Yes (TF/PyTorch)	Good	Medium	~100MB	Apache 2.0

Winner: easyocr — Best balance of accuracy (especially on code/terminal text), GPU support, language coverage, and ease of use. PaddleOCR is a close second but has heavier dependencies (PaddlePaddle framework).

Detailed Library Analysis

1. yt-dlp (Metadata & Download Engine)

What it provides:

Video metadata (title, description, duration, upload date, channel, tags, categories)
Chapter information (title, start_time, end_time for each chapter)
Subtitle/caption download (all available languages, all formats)
Thumbnail URLs
View/like counts
Playlist information (title, entries, ordering)
Audio-only extraction (no full video download needed)
Supports 1000+ video sites (YouTube, Vimeo, Dailymotion, etc.)

Python API usage:

from yt_dlp import YoutubeDL

def extract_video_metadata(url: str) -> dict:
    """Extract metadata without downloading."""
    opts = {
        'quiet': True,
        'no_warnings': True,
        'extract_flat': False,  # Full extraction
    }
    with YoutubeDL(opts) as ydl:
        info = ydl.extract_info(url, download=False)
        return info

Key fields in info_dict:

{
    'id': 'dQw4w9WgXcQ',              # Video ID
    'title': 'Video Title',            # Full title
    'description': '...',              # Full description text
    'duration': 1832,                  # Duration in seconds
    'upload_date': '20260115',         # YYYYMMDD format
    'uploader': 'Channel Name',        # Channel/uploader name
    'uploader_id': '@channelname',     # Channel ID
    'uploader_url': 'https://...',     # Channel URL
    'channel_follower_count': 150000,  # Subscriber count
    'view_count': 5000000,             # View count
    'like_count': 120000,              # Like count
    'comment_count': 8500,             # Comment count
    'tags': ['react', 'hooks', ...],   # Video tags
    'categories': ['Education'],        # YouTube categories
    'language': 'en',                  # Primary language
    'subtitles': {                     # Manual captions
        'en': [{'ext': 'vtt', 'url': '...'}],
    },
    'automatic_captions': {            # Auto-generated captions
        'en': [{'ext': 'vtt', 'url': '...'}],
    },
    'chapters': [                      # Chapter markers
        {'title': 'Intro', 'start_time': 0, 'end_time': 45},
        {'title': 'Setup', 'start_time': 45, 'end_time': 180},
        {'title': 'First Component', 'start_time': 180, 'end_time': 420},
    ],
    'thumbnail': 'https://...',        # Best thumbnail URL
    'thumbnails': [...],               # All thumbnail variants
    'webpage_url': 'https://...',      # Canonical URL
    'formats': [...],                  # Available formats
    'requested_formats': [...],        # Selected format info
}

Playlist extraction:

def extract_playlist(url: str) -> list[dict]:
    """Extract all videos from a playlist."""
    opts = {
        'quiet': True,
        'extract_flat': True,  # Don't extract each video yet
    }
    with YoutubeDL(opts) as ydl:
        info = ydl.extract_info(url, download=False)
        # info['entries'] contains all video entries
        return info.get('entries', [])

Audio-only download (for Whisper):

def download_audio(url: str, output_dir: str) -> str:
    """Download audio stream only (no video)."""
    opts = {
        'format': 'bestaudio/best',
        'postprocessors': [{
            'key': 'FFmpegExtractAudio',
            'preferredcodec': 'wav',
            'preferredquality': '16',  # 16kHz (Whisper's native rate)
        }],
        'outtmpl': f'{output_dir}/%(id)s.%(ext)s',
        'quiet': True,
    }
    with YoutubeDL(opts) as ydl:
        info = ydl.extract_info(url, download=True)
        return f"{output_dir}/{info['id']}.wav"

2. youtube-transcript-api (Caption Extraction)

What it provides:

Direct access to YouTube captions without downloading
Manual and auto-generated caption support
Translation support (translate captions to any language)
Structured output with timestamps

Python API usage:

from youtube_transcript_api import YouTubeTranscriptApi

def get_youtube_transcript(video_id: str, languages: list[str] = None) -> list[dict]:
    """Get transcript with timestamps."""
    languages = languages or ['en']

    transcript_list = YouTubeTranscriptApi.list_transcripts(video_id)

    # Prefer manual captions over auto-generated
    try:
        transcript = transcript_list.find_manually_created_transcript(languages)
    except Exception:
        transcript = transcript_list.find_generated_transcript(languages)

    # Fetch the actual transcript data
    data = transcript.fetch()
    return data
    # Returns: [{'text': 'Hello', 'start': 0.0, 'duration': 1.5}, ...]

Output format:

[
    {
        'text': "Welcome to this React tutorial",
        'start': 0.0,        # Start time in seconds
        'duration': 2.5       # Duration in seconds
    },
    {
        'text': "Today we'll learn about hooks",
        'start': 2.5,
        'duration': 3.0
    },
    # ... continues for entire video
]

Key features:

Segments are typically 2-5 seconds each
Manual captions have punctuation and proper casing
Auto-generated captions may lack punctuation and have lower accuracy
Can detect available languages and caption types

3. faster-whisper (Speech-to-Text)

What it provides:

OpenAI Whisper models with 4x speedup via CTranslate2
Word-level timestamps with confidence scores
Language detection
VAD (Voice Activity Detection) filtering
Multiple model sizes from tiny (39M) to large-v3 (1.5B)

Python API usage:

from faster_whisper import WhisperModel

def transcribe_with_whisper(audio_path: str, model_size: str = "base") -> dict:
    """Transcribe audio file with word-level timestamps."""
    model = WhisperModel(
        model_size,
        device="auto",          # auto-detect GPU/CPU
        compute_type="auto",    # auto-select precision
    )

    segments, info = model.transcribe(
        audio_path,
        word_timestamps=True,
        vad_filter=True,         # Filter silence
        vad_parameters={
            "min_silence_duration_ms": 500,
        },
    )

    result = {
        'language': info.language,
        'language_probability': info.language_probability,
        'duration': info.duration,
        'segments': [],
    }

    for segment in segments:
        seg_data = {
            'start': segment.start,
            'end': segment.end,
            'text': segment.text.strip(),
            'avg_logprob': segment.avg_logprob,
            'no_speech_prob': segment.no_speech_prob,
            'words': [],
        }
        if segment.words:
            for word in segment.words:
                seg_data['words'].append({
                    'word': word.word,
                    'start': word.start,
                    'end': word.end,
                    'probability': word.probability,
                })
        result['segments'].append(seg_data)

    return result

Model size guide:

Model	Parameters	English WER	Multilingual WER	VRAM (FP16)	Speed (30 min, GPU)
tiny	39M	14.8%	23.2%	~1GB	~30s
base	74M	11.5%	18.7%	~1GB	~45s
small	244M	9.5%	14.6%	~2GB	~90s
medium	769M	8.0%	12.4%	~5GB	~180s
large-v3	1.5B	5.7%	10.1%	~10GB	~240s
large-v3-turbo	809M	6.2%	10.8%	~6GB	~120s

Recommendation: Default to base (good balance), offer large-v3-turbo for best accuracy, tiny for speed.

4. PySceneDetect (Scene Boundary Detection)

What it provides:

Automatic scene/cut detection in video files
Multiple detection algorithms (content-based, threshold, adaptive)
Frame-accurate boundaries
Integration with OpenCV

Python API usage:

from scenedetect import detect, ContentDetector, AdaptiveDetector

def detect_scene_changes(video_path: str) -> list[tuple[float, float]]:
    """Detect scene boundaries in video.

    Returns list of (start_time, end_time) tuples.
    """
    scene_list = detect(
        video_path,
        ContentDetector(
            threshold=27.0,      # Sensitivity (lower = more scenes)
            min_scene_len=15,    # Minimum 15 frames per scene
        ),
    )

    boundaries = []
    for scene in scene_list:
        start = scene[0].get_seconds()
        end = scene[1].get_seconds()
        boundaries.append((start, end))

    return boundaries

Detection algorithms:

Algorithm	Best For	Speed	Sensitivity
ContentDetector	General content changes	Fast	Medium
AdaptiveDetector	Gradual transitions	Medium	High
ThresholdDetector	Hard cuts (black frames)	Very fast	Low

5. easyocr (Text Recognition)

What it provides:

Text detection and recognition from images
80+ language support
GPU acceleration
Bounding box coordinates for each text region
Confidence scores

Python API usage:

import easyocr

def extract_text_from_frame(image_path: str, languages: list[str] = None) -> list[dict]:
    """Extract text from a video frame image."""
    languages = languages or ['en']
    reader = easyocr.Reader(languages, gpu=True)

    results = reader.readtext(image_path)
    # results: [([x1,y1],[x2,y2],[x3,y3],[x4,y4]), text, confidence]

    extracted = []
    for bbox, text, confidence in results:
        extracted.append({
            'text': text,
            'confidence': confidence,
            'bbox': bbox,  # Corner coordinates
        })

    return extracted

Tips for code/terminal OCR:

Pre-process images: increase contrast, convert to grayscale
Use higher DPI/resolution frames
Filter by confidence threshold (>0.5 for code)
Detect monospace regions first, then OCR only those regions

6. OpenCV (Frame Extraction)

What it provides:

Video file reading and frame extraction
Image processing (resize, crop, color conversion)
Template matching (detect code editors, terminals)
Histogram analysis (detect slide vs code vs webcam)

Python API usage:

import cv2
import numpy as np

def extract_frames_at_timestamps(
    video_path: str,
    timestamps: list[float],
    output_dir: str
) -> list[str]:
    """Extract frames at specific timestamps."""
    cap = cv2.VideoCapture(video_path)
    fps = cap.get(cv2.CAP_PROP_FPS)
    frame_paths = []

    for ts in timestamps:
        frame_number = int(ts * fps)
        cap.set(cv2.CAP_PROP_POS_FRAMES, frame_number)
        ret, frame = cap.read()
        if ret:
            path = f"{output_dir}/frame_{ts:.2f}.png"
            cv2.imwrite(path, frame)
            frame_paths.append(path)

    cap.release()
    return frame_paths


def classify_frame(image_path: str) -> str:
    """Classify frame as code/slide/terminal/webcam/other.

    Uses heuristics:
    - Dark background + monospace text regions = code/terminal
    - Light background + large text blocks = slide
    - Face detection = webcam
    - High color variance = diagram
    """
    img = cv2.imread(image_path)
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    h, w = gray.shape

    # Check brightness distribution
    mean_brightness = np.mean(gray)
    brightness_std = np.std(gray)

    # Dark background with structured content = code/terminal
    if mean_brightness < 80 and brightness_std > 40:
        return 'code'  # or 'terminal'

    # Light background with text blocks = slide
    if mean_brightness > 180 and brightness_std < 60:
        return 'slide'

    # High edge density = diagram
    edges = cv2.Canny(gray, 50, 150)
    edge_density = np.count_nonzero(edges) / (h * w)
    if edge_density > 0.15:
        return 'diagram'

    return 'other'

Benchmarks & Performance Data

Transcript Extraction Speed

Method	10 min video	30 min video	60 min video	Requires Download
youtube-transcript-api	~0.5s	~0.5s	~0.5s	No
yt-dlp subtitles	~2s	~2s	~2s	Subtitle file only
faster-whisper (tiny, GPU)	~10s	~30s	~60s	Audio only
faster-whisper (base, GPU)	~15s	~45s	~90s	Audio only
faster-whisper (large-v3, GPU)	~80s	~240s	~480s	Audio only
faster-whisper (base, CPU)	~60s	~180s	~360s	Audio only

Visual Extraction Speed

Operation	Per Frame	Per 10 min video (50 keyframes)
Frame extraction (OpenCV)	~5ms	~0.25s
Scene detection (PySceneDetect)	N/A	~15s for full video
Frame classification (heuristic)	~10ms	~0.5s
OCR per frame (easyocr, GPU)	~200ms	~10s
OCR per frame (easyocr, CPU)	~1-2s	~50-100s

Total Pipeline Time (estimated)

Mode	10 min video	30 min video	1 hour video
Transcript only (YouTube captions)	~2s	~2s	~2s
Transcript only (Whisper base, GPU)	~20s	~50s	~100s
Full (transcript + visual, GPU)	~35s	~80s	~170s
Full (transcript + visual, CPU)	~120s	~350s	~700s

Recommendations

Primary Stack (Chosen)

Component	Library	Why
Metadata + download	yt-dlp	De-facto standard, 1000+ sites, comprehensive Python API
YouTube transcripts	youtube-transcript-api	Fastest, no download, structured output
Speech-to-text	faster-whisper	4x faster than Whisper, MIT, word timestamps
Scene detection	PySceneDetect	Best algorithm options, OpenCV-based
Frame extraction	opencv-python-headless	Standard, headless (no GUI deps)
OCR	easyocr	Best code/terminal accuracy, 80+ languages, GPU support

Future Considerations

Component	Library	When to Add
Speaker diarization	whisperx or pyannote	V2.0 — identify who said what
Object detection	YOLO	V2.0 — detect UI elements, diagrams
Multimodal embeddings	CLIP	V2.0 — embed frames for visual search
Slide detection	python-pptx + heuristics	V1.5 — detect and extract slide content

23 KiB

Raw Blame History

Video Source — Library Research & Industry Standards

Table of Contents

Industry Standards & Approaches

How the Industry Processes Video for AI/RAG

The 3-Stream Model

Reference Implementations

Industry Best Practices

Library Comparison Matrix

Metadata & Download

Transcript Extraction (YouTube)

Speech-to-Text (ASR)

Scene Detection & Frame Extraction

OCR (Optical Character Recognition)

Detailed Library Analysis

1. yt-dlp (Metadata & Download Engine)

2. youtube-transcript-api (Caption Extraction)

3. faster-whisper (Speech-to-Text)

4. PySceneDetect (Scene Boundary Detection)

5. easyocr (Text Recognition)

6. OpenCV (Frame Extraction)

Benchmarks & Performance Data

Transcript Extraction Speed

Visual Extraction Speed

Total Pipeline Time (estimated)

Recommendations

Primary Stack (Chosen)

Future Considerations

Sources

23 KiB Raw Blame History

Video Source — Library Research & Industry Standards

Table of Contents

Industry Standards & Approaches

How the Industry Processes Video for AI/RAG

The 3-Stream Model

Reference Implementations

Industry Best Practices

Library Comparison Matrix

Metadata & Download

Transcript Extraction (YouTube)

Speech-to-Text (ASR)

Scene Detection & Frame Extraction

OCR (Optical Character Recognition)

Detailed Library Analysis

1. yt-dlp (Metadata & Download Engine)

2. youtube-transcript-api (Caption Extraction)

3. faster-whisper (Speech-to-Text)

4. PySceneDetect (Scene Boundary Detection)

5. easyocr (Text Recognition)

6. OpenCV (Frame Extraction)

Benchmarks & Performance Data

Transcript Extraction Speed

Visual Extraction Speed

Total Pipeline Time (estimated)

Recommendations

Primary Stack (Chosen)

Future Considerations

Sources

23 KiB

Raw Blame History