Add complete video tutorial extraction system that converts YouTube videos and local video files into AI-consumable skills. The pipeline extracts transcripts, performs visual OCR on code editor panels independently, tracks code evolution across frames, and generates structured SKILL.md output. Key features: - Video metadata extraction (YouTube, local files, playlists) - Multi-source transcript extraction (YouTube API, yt-dlp, Whisper fallback) - Chapter-based and time-window segmentation - Visual extraction: keyframe detection, frame classification, panel detection - Per-panel sub-section OCR (each IDE panel OCR'd independently) - Parallel OCR with ThreadPoolExecutor for multi-panel frames - Narrow panel filtering (300px min width) to skip UI chrome - Text block tracking with spatial panel position matching - Code timeline with edit tracking across frames - Audio-visual alignment (code + narrator pairs) - Video-specific AI enhancement prompt for OCR denoising and code reconstruction - video-tutorial.yaml workflow with 4 stages (OCR cleanup, language detection, tutorial synthesis, skill polish) - CLI integration: skill-seekers video --url/--video-file/--playlist - MCP tool: scrape_video for automation - 161 tests passing Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
23 KiB
Video Source — Library Research & Industry Standards
Date: February 27, 2026 Document: 01 of 07 Status: Complete
Table of Contents
- Industry Standards & Approaches
- Library Comparison Matrix
- Detailed Library Analysis
- Architecture Patterns from Industry
- Benchmarks & Performance Data
- Recommendations
Industry Standards & Approaches
How the Industry Processes Video for AI/RAG
Based on research from NVIDIA, LlamaIndex, Ragie, and open-source projects, the industry has converged on a 3-stream parallel extraction model:
The 3-Stream Model
Video Input
│
├──→ Stream 1: ASR (Audio Speech Recognition)
│ Extract spoken words with timestamps
│ Tools: Whisper, YouTube captions API
│ Output: [{text, start, end, confidence}, ...]
│
├──→ Stream 2: OCR (Optical Character Recognition)
│ Extract visual text (code, slides, diagrams)
│ Tools: OpenCV + scene detection + OCR engine
│ Output: [{text, timestamp, frame_type, bbox}, ...]
│
└──→ Stream 3: Metadata
Extract structural info (chapters, tags, description)
Tools: yt-dlp, platform APIs
Output: {title, chapters, tags, description, ...}
Key insight (from NVIDIA's multimodal RAG blog): Ground everything to text first. Align all streams on a shared timeline, then merge into unified text segments. This makes the output compatible with any text-based RAG pipeline without requiring multimodal embeddings.
Reference Implementations
| Project | Approach | Strengths | Weaknesses |
|---|---|---|---|
| video-analyzer | Whisper + OpenCV + LLM analysis | Full pipeline, LLM summaries | No chapter support, no YouTube integration |
| LlamaIndex MultiModal RAG | Frame extraction + CLIP + LanceDB | Vector search over frames | Heavy (requires GPU), no ASR |
| VideoRAG | Graph-based reasoning + multimodal retrieval | Multi-hour video support | Research project, not production-ready |
| Ragie Multimodal RAG | faster-whisper large-v3-turbo + OCR + object detection | Production-grade, 3-stream | Proprietary, not open-source |
Industry Best Practices
- Audio-only download — Never download full video when you only need audio. Extract audio stream with FFmpeg (
-vnflag). This is 10-50x smaller. - Prefer existing captions — YouTube manual captions are higher quality than any ASR model. Only fall back to Whisper when captions unavailable.
- Chapter-based segmentation — YouTube chapters provide natural content boundaries. Use them as primary segmentation, fall back to time-window or semantic splitting.
- Confidence filtering — Auto-generated captions and OCR output include confidence scores. Filter low-confidence content rather than including everything.
- Parallel extraction — Run ASR and OCR in parallel (they're independent). Merge after both complete.
- Tiered processing — Offer fast/light mode (transcript only) and deep mode (+ visual). Let users choose based on their compute budget.
Library Comparison Matrix
Metadata & Download
| Library | Purpose | Install Size | Actively Maintained | Python API | License |
|---|---|---|---|---|---|
| yt-dlp | Metadata + subtitles + download | ~15MB | Yes (weekly releases) | Yes (YoutubeDL class) |
Unlicense |
| pytube | YouTube download | ~1MB | Inconsistent | Yes | MIT |
| youtube-dl | Download (original) | ~10MB | Stale | Yes | Unlicense |
| pafy | YouTube metadata | ~50KB | Dead (2021) | Yes | LGPL |
Winner: yt-dlp — De-facto standard, actively maintained, comprehensive Python API, supports 1000+ sites (not just YouTube).
Transcript Extraction (YouTube)
| Library | Purpose | Requires Download | Speed | Accuracy | License |
|---|---|---|---|---|---|
| youtube-transcript-api | YouTube captions | No | Very fast (<1s) | Depends on caption source | MIT |
| yt-dlp subtitles | Download subtitle files | Yes (subtitle only) | Fast (~2s) | Same as above | Unlicense |
Winner: youtube-transcript-api — Fastest, no download needed, returns structured JSON with timestamps directly. Falls back to yt-dlp for non-YouTube platforms.
Speech-to-Text (ASR)
| Library | Speed (30 min audio) | Word Timestamps | Model Sizes | GPU Required | Language Support | License |
|---|---|---|---|---|---|---|
| faster-whisper | ~2-4 min (GPU), ~8-15 min (CPU) | Yes (word_timestamps=True) |
tiny (39M) → large-v3 (1.5B) | No (but recommended) | 99 languages | MIT |
| openai-whisper | ~5-10 min (GPU), ~20-40 min (CPU) | Yes | Same models | Recommended | 99 languages | MIT |
| whisper-timestamped | Same as openai-whisper | Yes (more accurate) | Same models | Recommended | 99 languages | MIT |
| whisperx | ~2-3 min (GPU) | Yes (best accuracy via wav2vec2) | Same + wav2vec2 | Yes (required) | 99 languages | BSD |
| stable-ts | Same as openai-whisper | Yes (stabilized) | Same models | Recommended | 99 languages | MIT |
| Google Speech-to-Text | Real-time | Yes | Cloud | No | 125+ languages | Proprietary |
| AssemblyAI | Real-time | Yes | Cloud | No | 100+ languages | Proprietary |
Winner: faster-whisper — 4x faster than OpenAI Whisper via CTranslate2 optimization, MIT license, word-level timestamps, works without GPU (just slower), actively maintained. We may consider whisperx as a future upgrade for speaker diarization.
Scene Detection & Frame Extraction
| Library | Purpose | Algorithm | Speed | License |
|---|---|---|---|---|
| PySceneDetect | Scene boundary detection | ContentDetector, ThresholdDetector, AdaptiveDetector | Fast | BSD |
| opencv-python-headless | Frame extraction, image processing | Manual (absdiff, histogram) | Fast | Apache 2.0 |
| Filmstrip | Keyframe extraction | Scene detection + selection | Medium | MIT |
| video-keyframe-detector | Keyframe extraction | Peak estimation from frame diff | Fast | MIT |
| decord | GPU-accelerated frame extraction | Direct frame access | Very fast | Apache 2.0 |
Winner: PySceneDetect + opencv-python-headless — PySceneDetect handles intelligent boundary detection, OpenCV handles frame extraction and image processing. Both are well-maintained and BSD/Apache licensed.
OCR (Optical Character Recognition)
| Library | Languages | GPU Support | Accuracy on Code | Speed | Install Size | License |
|---|---|---|---|---|---|---|
| easyocr | 80+ | Yes (PyTorch) | Good | Medium | ~150MB + models | Apache 2.0 |
| pytesseract | 100+ | No | Medium | Fast | ~30MB + Tesseract | Apache 2.0 |
| PaddleOCR | 80+ | Yes (PaddlePaddle) | Very Good | Fast | ~200MB + models | Apache 2.0 |
| TrOCR (HuggingFace) | Multilingual | Yes | Good | Slow | ~500MB | MIT |
| docTR | 10+ | Yes (TF/PyTorch) | Good | Medium | ~100MB | Apache 2.0 |
Winner: easyocr — Best balance of accuracy (especially on code/terminal text), GPU support, language coverage, and ease of use. PaddleOCR is a close second but has heavier dependencies (PaddlePaddle framework).
Detailed Library Analysis
1. yt-dlp (Metadata & Download Engine)
What it provides:
- Video metadata (title, description, duration, upload date, channel, tags, categories)
- Chapter information (title, start_time, end_time for each chapter)
- Subtitle/caption download (all available languages, all formats)
- Thumbnail URLs
- View/like counts
- Playlist information (title, entries, ordering)
- Audio-only extraction (no full video download needed)
- Supports 1000+ video sites (YouTube, Vimeo, Dailymotion, etc.)
Python API usage:
from yt_dlp import YoutubeDL
def extract_video_metadata(url: str) -> dict:
"""Extract metadata without downloading."""
opts = {
'quiet': True,
'no_warnings': True,
'extract_flat': False, # Full extraction
}
with YoutubeDL(opts) as ydl:
info = ydl.extract_info(url, download=False)
return info
Key fields in info_dict:
{
'id': 'dQw4w9WgXcQ', # Video ID
'title': 'Video Title', # Full title
'description': '...', # Full description text
'duration': 1832, # Duration in seconds
'upload_date': '20260115', # YYYYMMDD format
'uploader': 'Channel Name', # Channel/uploader name
'uploader_id': '@channelname', # Channel ID
'uploader_url': 'https://...', # Channel URL
'channel_follower_count': 150000, # Subscriber count
'view_count': 5000000, # View count
'like_count': 120000, # Like count
'comment_count': 8500, # Comment count
'tags': ['react', 'hooks', ...], # Video tags
'categories': ['Education'], # YouTube categories
'language': 'en', # Primary language
'subtitles': { # Manual captions
'en': [{'ext': 'vtt', 'url': '...'}],
},
'automatic_captions': { # Auto-generated captions
'en': [{'ext': 'vtt', 'url': '...'}],
},
'chapters': [ # Chapter markers
{'title': 'Intro', 'start_time': 0, 'end_time': 45},
{'title': 'Setup', 'start_time': 45, 'end_time': 180},
{'title': 'First Component', 'start_time': 180, 'end_time': 420},
],
'thumbnail': 'https://...', # Best thumbnail URL
'thumbnails': [...], # All thumbnail variants
'webpage_url': 'https://...', # Canonical URL
'formats': [...], # Available formats
'requested_formats': [...], # Selected format info
}
Playlist extraction:
def extract_playlist(url: str) -> list[dict]:
"""Extract all videos from a playlist."""
opts = {
'quiet': True,
'extract_flat': True, # Don't extract each video yet
}
with YoutubeDL(opts) as ydl:
info = ydl.extract_info(url, download=False)
# info['entries'] contains all video entries
return info.get('entries', [])
Audio-only download (for Whisper):
def download_audio(url: str, output_dir: str) -> str:
"""Download audio stream only (no video)."""
opts = {
'format': 'bestaudio/best',
'postprocessors': [{
'key': 'FFmpegExtractAudio',
'preferredcodec': 'wav',
'preferredquality': '16', # 16kHz (Whisper's native rate)
}],
'outtmpl': f'{output_dir}/%(id)s.%(ext)s',
'quiet': True,
}
with YoutubeDL(opts) as ydl:
info = ydl.extract_info(url, download=True)
return f"{output_dir}/{info['id']}.wav"
2. youtube-transcript-api (Caption Extraction)
What it provides:
- Direct access to YouTube captions without downloading
- Manual and auto-generated caption support
- Translation support (translate captions to any language)
- Structured output with timestamps
Python API usage:
from youtube_transcript_api import YouTubeTranscriptApi
def get_youtube_transcript(video_id: str, languages: list[str] = None) -> list[dict]:
"""Get transcript with timestamps."""
languages = languages or ['en']
transcript_list = YouTubeTranscriptApi.list_transcripts(video_id)
# Prefer manual captions over auto-generated
try:
transcript = transcript_list.find_manually_created_transcript(languages)
except Exception:
transcript = transcript_list.find_generated_transcript(languages)
# Fetch the actual transcript data
data = transcript.fetch()
return data
# Returns: [{'text': 'Hello', 'start': 0.0, 'duration': 1.5}, ...]
Output format:
[
{
'text': "Welcome to this React tutorial",
'start': 0.0, # Start time in seconds
'duration': 2.5 # Duration in seconds
},
{
'text': "Today we'll learn about hooks",
'start': 2.5,
'duration': 3.0
},
# ... continues for entire video
]
Key features:
- Segments are typically 2-5 seconds each
- Manual captions have punctuation and proper casing
- Auto-generated captions may lack punctuation and have lower accuracy
- Can detect available languages and caption types
3. faster-whisper (Speech-to-Text)
What it provides:
- OpenAI Whisper models with 4x speedup via CTranslate2
- Word-level timestamps with confidence scores
- Language detection
- VAD (Voice Activity Detection) filtering
- Multiple model sizes from tiny (39M) to large-v3 (1.5B)
Python API usage:
from faster_whisper import WhisperModel
def transcribe_with_whisper(audio_path: str, model_size: str = "base") -> dict:
"""Transcribe audio file with word-level timestamps."""
model = WhisperModel(
model_size,
device="auto", # auto-detect GPU/CPU
compute_type="auto", # auto-select precision
)
segments, info = model.transcribe(
audio_path,
word_timestamps=True,
vad_filter=True, # Filter silence
vad_parameters={
"min_silence_duration_ms": 500,
},
)
result = {
'language': info.language,
'language_probability': info.language_probability,
'duration': info.duration,
'segments': [],
}
for segment in segments:
seg_data = {
'start': segment.start,
'end': segment.end,
'text': segment.text.strip(),
'avg_logprob': segment.avg_logprob,
'no_speech_prob': segment.no_speech_prob,
'words': [],
}
if segment.words:
for word in segment.words:
seg_data['words'].append({
'word': word.word,
'start': word.start,
'end': word.end,
'probability': word.probability,
})
result['segments'].append(seg_data)
return result
Model size guide:
| Model | Parameters | English WER | Multilingual WER | VRAM (FP16) | Speed (30 min, GPU) |
|---|---|---|---|---|---|
| tiny | 39M | 14.8% | 23.2% | ~1GB | ~30s |
| base | 74M | 11.5% | 18.7% | ~1GB | ~45s |
| small | 244M | 9.5% | 14.6% | ~2GB | ~90s |
| medium | 769M | 8.0% | 12.4% | ~5GB | ~180s |
| large-v3 | 1.5B | 5.7% | 10.1% | ~10GB | ~240s |
| large-v3-turbo | 809M | 6.2% | 10.8% | ~6GB | ~120s |
Recommendation: Default to base (good balance), offer large-v3-turbo for best accuracy, tiny for speed.
4. PySceneDetect (Scene Boundary Detection)
What it provides:
- Automatic scene/cut detection in video files
- Multiple detection algorithms (content-based, threshold, adaptive)
- Frame-accurate boundaries
- Integration with OpenCV
Python API usage:
from scenedetect import detect, ContentDetector, AdaptiveDetector
def detect_scene_changes(video_path: str) -> list[tuple[float, float]]:
"""Detect scene boundaries in video.
Returns list of (start_time, end_time) tuples.
"""
scene_list = detect(
video_path,
ContentDetector(
threshold=27.0, # Sensitivity (lower = more scenes)
min_scene_len=15, # Minimum 15 frames per scene
),
)
boundaries = []
for scene in scene_list:
start = scene[0].get_seconds()
end = scene[1].get_seconds()
boundaries.append((start, end))
return boundaries
Detection algorithms:
| Algorithm | Best For | Speed | Sensitivity |
|---|---|---|---|
| ContentDetector | General content changes | Fast | Medium |
| AdaptiveDetector | Gradual transitions | Medium | High |
| ThresholdDetector | Hard cuts (black frames) | Very fast | Low |
5. easyocr (Text Recognition)
What it provides:
- Text detection and recognition from images
- 80+ language support
- GPU acceleration
- Bounding box coordinates for each text region
- Confidence scores
Python API usage:
import easyocr
def extract_text_from_frame(image_path: str, languages: list[str] = None) -> list[dict]:
"""Extract text from a video frame image."""
languages = languages or ['en']
reader = easyocr.Reader(languages, gpu=True)
results = reader.readtext(image_path)
# results: [([x1,y1],[x2,y2],[x3,y3],[x4,y4]), text, confidence]
extracted = []
for bbox, text, confidence in results:
extracted.append({
'text': text,
'confidence': confidence,
'bbox': bbox, # Corner coordinates
})
return extracted
Tips for code/terminal OCR:
- Pre-process images: increase contrast, convert to grayscale
- Use higher DPI/resolution frames
- Filter by confidence threshold (>0.5 for code)
- Detect monospace regions first, then OCR only those regions
6. OpenCV (Frame Extraction)
What it provides:
- Video file reading and frame extraction
- Image processing (resize, crop, color conversion)
- Template matching (detect code editors, terminals)
- Histogram analysis (detect slide vs code vs webcam)
Python API usage:
import cv2
import numpy as np
def extract_frames_at_timestamps(
video_path: str,
timestamps: list[float],
output_dir: str
) -> list[str]:
"""Extract frames at specific timestamps."""
cap = cv2.VideoCapture(video_path)
fps = cap.get(cv2.CAP_PROP_FPS)
frame_paths = []
for ts in timestamps:
frame_number = int(ts * fps)
cap.set(cv2.CAP_PROP_POS_FRAMES, frame_number)
ret, frame = cap.read()
if ret:
path = f"{output_dir}/frame_{ts:.2f}.png"
cv2.imwrite(path, frame)
frame_paths.append(path)
cap.release()
return frame_paths
def classify_frame(image_path: str) -> str:
"""Classify frame as code/slide/terminal/webcam/other.
Uses heuristics:
- Dark background + monospace text regions = code/terminal
- Light background + large text blocks = slide
- Face detection = webcam
- High color variance = diagram
"""
img = cv2.imread(image_path)
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
h, w = gray.shape
# Check brightness distribution
mean_brightness = np.mean(gray)
brightness_std = np.std(gray)
# Dark background with structured content = code/terminal
if mean_brightness < 80 and brightness_std > 40:
return 'code' # or 'terminal'
# Light background with text blocks = slide
if mean_brightness > 180 and brightness_std < 60:
return 'slide'
# High edge density = diagram
edges = cv2.Canny(gray, 50, 150)
edge_density = np.count_nonzero(edges) / (h * w)
if edge_density > 0.15:
return 'diagram'
return 'other'
Benchmarks & Performance Data
Transcript Extraction Speed
| Method | 10 min video | 30 min video | 60 min video | Requires Download |
|---|---|---|---|---|
| youtube-transcript-api | ~0.5s | ~0.5s | ~0.5s | No |
| yt-dlp subtitles | ~2s | ~2s | ~2s | Subtitle file only |
| faster-whisper (tiny, GPU) | ~10s | ~30s | ~60s | Audio only |
| faster-whisper (base, GPU) | ~15s | ~45s | ~90s | Audio only |
| faster-whisper (large-v3, GPU) | ~80s | ~240s | ~480s | Audio only |
| faster-whisper (base, CPU) | ~60s | ~180s | ~360s | Audio only |
Visual Extraction Speed
| Operation | Per Frame | Per 10 min video (50 keyframes) |
|---|---|---|
| Frame extraction (OpenCV) | ~5ms | ~0.25s |
| Scene detection (PySceneDetect) | N/A | ~15s for full video |
| Frame classification (heuristic) | ~10ms | ~0.5s |
| OCR per frame (easyocr, GPU) | ~200ms | ~10s |
| OCR per frame (easyocr, CPU) | ~1-2s | ~50-100s |
Total Pipeline Time (estimated)
| Mode | 10 min video | 30 min video | 1 hour video |
|---|---|---|---|
| Transcript only (YouTube captions) | ~2s | ~2s | ~2s |
| Transcript only (Whisper base, GPU) | ~20s | ~50s | ~100s |
| Full (transcript + visual, GPU) | ~35s | ~80s | ~170s |
| Full (transcript + visual, CPU) | ~120s | ~350s | ~700s |
Recommendations
Primary Stack (Chosen)
| Component | Library | Why |
|---|---|---|
| Metadata + download | yt-dlp | De-facto standard, 1000+ sites, comprehensive Python API |
| YouTube transcripts | youtube-transcript-api | Fastest, no download, structured output |
| Speech-to-text | faster-whisper | 4x faster than Whisper, MIT, word timestamps |
| Scene detection | PySceneDetect | Best algorithm options, OpenCV-based |
| Frame extraction | opencv-python-headless | Standard, headless (no GUI deps) |
| OCR | easyocr | Best code/terminal accuracy, 80+ languages, GPU support |
Future Considerations
| Component | Library | When to Add |
|---|---|---|
| Speaker diarization | whisperx or pyannote | V2.0 — identify who said what |
| Object detection | YOLO | V2.0 — detect UI elements, diagrams |
| Multimodal embeddings | CLIP | V2.0 — embed frames for visual search |
| Slide detection | python-pptx + heuristics | V1.5 — detect and extract slide content |
Sources
- youtube-transcript-api (PyPI)
- yt-dlp GitHub
- yt-dlp Information Extraction Pipeline (DeepWiki)
- faster-whisper GitHub
- faster-whisper (PyPI)
- whisper-timestamped GitHub
- stable-ts (PyPI)
- PySceneDetect GitHub
- easyocr GitHub (implied from PyPI)
- NVIDIA Multimodal RAG for Video and Audio
- LlamaIndex MultiModal RAG for Video
- Ragie: How We Built Multimodal RAG
- video-analyzer GitHub
- VideoRAG Project
- video-keyframe-detector GitHub
- Filmstrip GitHub