Add complete video tutorial extraction system that converts YouTube videos and local video files into AI-consumable skills. The pipeline extracts transcripts, performs visual OCR on code editor panels independently, tracks code evolution across frames, and generates structured SKILL.md output. Key features: - Video metadata extraction (YouTube, local files, playlists) - Multi-source transcript extraction (YouTube API, yt-dlp, Whisper fallback) - Chapter-based and time-window segmentation - Visual extraction: keyframe detection, frame classification, panel detection - Per-panel sub-section OCR (each IDE panel OCR'd independently) - Parallel OCR with ThreadPoolExecutor for multi-panel frames - Narrow panel filtering (300px min width) to skip UI chrome - Text block tracking with spatial panel position matching - Code timeline with edit tracking across frames - Audio-visual alignment (code + narrator pairs) - Video-specific AI enhancement prompt for OCR denoising and code reconstruction - video-tutorial.yaml workflow with 4 stages (OCR cleanup, language detection, tutorial synthesis, skill polish) - CLI integration: skill-seekers video --url/--video-file/--playlist - MCP tool: scrape_video for automation - 161 tests passing Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
592 lines
23 KiB
Markdown
592 lines
23 KiB
Markdown
# Video Source — Library Research & Industry Standards
|
|
|
|
**Date:** February 27, 2026
|
|
**Document:** 01 of 07
|
|
**Status:** Complete
|
|
|
|
---
|
|
|
|
## Table of Contents
|
|
|
|
1. [Industry Standards & Approaches](#industry-standards--approaches)
|
|
2. [Library Comparison Matrix](#library-comparison-matrix)
|
|
3. [Detailed Library Analysis](#detailed-library-analysis)
|
|
4. [Architecture Patterns from Industry](#architecture-patterns-from-industry)
|
|
5. [Benchmarks & Performance Data](#benchmarks--performance-data)
|
|
6. [Recommendations](#recommendations)
|
|
|
|
---
|
|
|
|
## Industry Standards & Approaches
|
|
|
|
### How the Industry Processes Video for AI/RAG
|
|
|
|
Based on research from NVIDIA, LlamaIndex, Ragie, and open-source projects, the industry has converged on a **3-stream parallel extraction** model:
|
|
|
|
#### The 3-Stream Model
|
|
|
|
```
|
|
Video Input
|
|
│
|
|
├──→ Stream 1: ASR (Audio Speech Recognition)
|
|
│ Extract spoken words with timestamps
|
|
│ Tools: Whisper, YouTube captions API
|
|
│ Output: [{text, start, end, confidence}, ...]
|
|
│
|
|
├──→ Stream 2: OCR (Optical Character Recognition)
|
|
│ Extract visual text (code, slides, diagrams)
|
|
│ Tools: OpenCV + scene detection + OCR engine
|
|
│ Output: [{text, timestamp, frame_type, bbox}, ...]
|
|
│
|
|
└──→ Stream 3: Metadata
|
|
Extract structural info (chapters, tags, description)
|
|
Tools: yt-dlp, platform APIs
|
|
Output: {title, chapters, tags, description, ...}
|
|
```
|
|
|
|
**Key insight (from NVIDIA's multimodal RAG blog):** Ground everything to text first. Align all streams on a shared timeline, then merge into unified text segments. This makes the output compatible with any text-based RAG pipeline without requiring multimodal embeddings.
|
|
|
|
#### Reference Implementations
|
|
|
|
| Project | Approach | Strengths | Weaknesses |
|
|
|---------|----------|-----------|------------|
|
|
| [video-analyzer](https://github.com/byjlw/video-analyzer) | Whisper + OpenCV + LLM analysis | Full pipeline, LLM summaries | No chapter support, no YouTube integration |
|
|
| [LlamaIndex MultiModal RAG](https://www.llamaindex.ai/blog/multimodal-rag-for-advanced-video-processing-with-llamaindex-lancedb-33be4804822e) | Frame extraction + CLIP + LanceDB | Vector search over frames | Heavy (requires GPU), no ASR |
|
|
| [VideoRAG](https://video-rag.github.io/) | Graph-based reasoning + multimodal retrieval | Multi-hour video support | Research project, not production-ready |
|
|
| [Ragie Multimodal RAG](https://www.ragie.ai/blog/how-we-built-multimodal-rag-for-audio-and-video) | faster-whisper large-v3-turbo + OCR + object detection | Production-grade, 3-stream | Proprietary, not open-source |
|
|
|
|
#### Industry Best Practices
|
|
|
|
1. **Audio-only download** — Never download full video when you only need audio. Extract audio stream with FFmpeg (`-vn` flag). This is 10-50x smaller.
|
|
2. **Prefer existing captions** — YouTube manual captions are higher quality than any ASR model. Only fall back to Whisper when captions unavailable.
|
|
3. **Chapter-based segmentation** — YouTube chapters provide natural content boundaries. Use them as primary segmentation, fall back to time-window or semantic splitting.
|
|
4. **Confidence filtering** — Auto-generated captions and OCR output include confidence scores. Filter low-confidence content rather than including everything.
|
|
5. **Parallel extraction** — Run ASR and OCR in parallel (they're independent). Merge after both complete.
|
|
6. **Tiered processing** — Offer fast/light mode (transcript only) and deep mode (+ visual). Let users choose based on their compute budget.
|
|
|
|
---
|
|
|
|
## Library Comparison Matrix
|
|
|
|
### Metadata & Download
|
|
|
|
| Library | Purpose | Install Size | Actively Maintained | Python API | License |
|
|
|---------|---------|-------------|-------------------|------------|---------|
|
|
| **yt-dlp** | Metadata + subtitles + download | ~15MB | Yes (weekly releases) | Yes (`YoutubeDL` class) | Unlicense |
|
|
| pytube | YouTube download | ~1MB | Inconsistent | Yes | MIT |
|
|
| youtube-dl | Download (original) | ~10MB | Stale | Yes | Unlicense |
|
|
| pafy | YouTube metadata | ~50KB | Dead (2021) | Yes | LGPL |
|
|
|
|
**Winner: yt-dlp** — De-facto standard, actively maintained, comprehensive Python API, supports 1000+ sites (not just YouTube).
|
|
|
|
### Transcript Extraction (YouTube)
|
|
|
|
| Library | Purpose | Requires Download | Speed | Accuracy | License |
|
|
|---------|---------|-------------------|-------|----------|---------|
|
|
| **youtube-transcript-api** | YouTube captions | No | Very fast (<1s) | Depends on caption source | MIT |
|
|
| yt-dlp subtitles | Download subtitle files | Yes (subtitle only) | Fast (~2s) | Same as above | Unlicense |
|
|
|
|
**Winner: youtube-transcript-api** — Fastest, no download needed, returns structured JSON with timestamps directly. Falls back to yt-dlp for non-YouTube platforms.
|
|
|
|
### Speech-to-Text (ASR)
|
|
|
|
| Library | Speed (30 min audio) | Word Timestamps | Model Sizes | GPU Required | Language Support | License |
|
|
|---------|---------------------|----------------|-------------|-------------|-----------------|---------|
|
|
| **faster-whisper** | ~2-4 min (GPU), ~8-15 min (CPU) | Yes (`word_timestamps=True`) | tiny (39M) → large-v3 (1.5B) | No (but recommended) | 99 languages | MIT |
|
|
| openai-whisper | ~5-10 min (GPU), ~20-40 min (CPU) | Yes | Same models | Recommended | 99 languages | MIT |
|
|
| whisper-timestamped | Same as openai-whisper | Yes (more accurate) | Same models | Recommended | 99 languages | MIT |
|
|
| whisperx | ~2-3 min (GPU) | Yes (best accuracy via wav2vec2) | Same + wav2vec2 | Yes (required) | 99 languages | BSD |
|
|
| stable-ts | Same as openai-whisper | Yes (stabilized) | Same models | Recommended | 99 languages | MIT |
|
|
| Google Speech-to-Text | Real-time | Yes | Cloud | No | 125+ languages | Proprietary |
|
|
| AssemblyAI | Real-time | Yes | Cloud | No | 100+ languages | Proprietary |
|
|
|
|
**Winner: faster-whisper** — 4x faster than OpenAI Whisper via CTranslate2 optimization, MIT license, word-level timestamps, works without GPU (just slower), actively maintained. We may consider whisperx as a future upgrade for speaker diarization.
|
|
|
|
### Scene Detection & Frame Extraction
|
|
|
|
| Library | Purpose | Algorithm | Speed | License |
|
|
|---------|---------|-----------|-------|---------|
|
|
| **PySceneDetect** | Scene boundary detection | ContentDetector, ThresholdDetector, AdaptiveDetector | Fast | BSD |
|
|
| opencv-python-headless | Frame extraction, image processing | Manual (absdiff, histogram) | Fast | Apache 2.0 |
|
|
| Filmstrip | Keyframe extraction | Scene detection + selection | Medium | MIT |
|
|
| video-keyframe-detector | Keyframe extraction | Peak estimation from frame diff | Fast | MIT |
|
|
| decord | GPU-accelerated frame extraction | Direct frame access | Very fast | Apache 2.0 |
|
|
|
|
**Winner: PySceneDetect + opencv-python-headless** — PySceneDetect handles intelligent boundary detection, OpenCV handles frame extraction and image processing. Both are well-maintained and BSD/Apache licensed.
|
|
|
|
### OCR (Optical Character Recognition)
|
|
|
|
| Library | Languages | GPU Support | Accuracy on Code | Speed | Install Size | License |
|
|
|---------|-----------|------------|-------------------|-------|-------------|---------|
|
|
| **easyocr** | 80+ | Yes (PyTorch) | Good | Medium | ~150MB + models | Apache 2.0 |
|
|
| pytesseract | 100+ | No | Medium | Fast | ~30MB + Tesseract | Apache 2.0 |
|
|
| PaddleOCR | 80+ | Yes (PaddlePaddle) | Very Good | Fast | ~200MB + models | Apache 2.0 |
|
|
| TrOCR (HuggingFace) | Multilingual | Yes | Good | Slow | ~500MB | MIT |
|
|
| docTR | 10+ | Yes (TF/PyTorch) | Good | Medium | ~100MB | Apache 2.0 |
|
|
|
|
**Winner: easyocr** — Best balance of accuracy (especially on code/terminal text), GPU support, language coverage, and ease of use. PaddleOCR is a close second but has heavier dependencies (PaddlePaddle framework).
|
|
|
|
---
|
|
|
|
## Detailed Library Analysis
|
|
|
|
### 1. yt-dlp (Metadata & Download Engine)
|
|
|
|
**What it provides:**
|
|
- Video metadata (title, description, duration, upload date, channel, tags, categories)
|
|
- Chapter information (title, start_time, end_time for each chapter)
|
|
- Subtitle/caption download (all available languages, all formats)
|
|
- Thumbnail URLs
|
|
- View/like counts
|
|
- Playlist information (title, entries, ordering)
|
|
- Audio-only extraction (no full video download needed)
|
|
- Supports 1000+ video sites (YouTube, Vimeo, Dailymotion, etc.)
|
|
|
|
**Python API usage:**
|
|
|
|
```python
|
|
from yt_dlp import YoutubeDL
|
|
|
|
def extract_video_metadata(url: str) -> dict:
|
|
"""Extract metadata without downloading."""
|
|
opts = {
|
|
'quiet': True,
|
|
'no_warnings': True,
|
|
'extract_flat': False, # Full extraction
|
|
}
|
|
with YoutubeDL(opts) as ydl:
|
|
info = ydl.extract_info(url, download=False)
|
|
return info
|
|
```
|
|
|
|
**Key fields in `info_dict`:**
|
|
|
|
```python
|
|
{
|
|
'id': 'dQw4w9WgXcQ', # Video ID
|
|
'title': 'Video Title', # Full title
|
|
'description': '...', # Full description text
|
|
'duration': 1832, # Duration in seconds
|
|
'upload_date': '20260115', # YYYYMMDD format
|
|
'uploader': 'Channel Name', # Channel/uploader name
|
|
'uploader_id': '@channelname', # Channel ID
|
|
'uploader_url': 'https://...', # Channel URL
|
|
'channel_follower_count': 150000, # Subscriber count
|
|
'view_count': 5000000, # View count
|
|
'like_count': 120000, # Like count
|
|
'comment_count': 8500, # Comment count
|
|
'tags': ['react', 'hooks', ...], # Video tags
|
|
'categories': ['Education'], # YouTube categories
|
|
'language': 'en', # Primary language
|
|
'subtitles': { # Manual captions
|
|
'en': [{'ext': 'vtt', 'url': '...'}],
|
|
},
|
|
'automatic_captions': { # Auto-generated captions
|
|
'en': [{'ext': 'vtt', 'url': '...'}],
|
|
},
|
|
'chapters': [ # Chapter markers
|
|
{'title': 'Intro', 'start_time': 0, 'end_time': 45},
|
|
{'title': 'Setup', 'start_time': 45, 'end_time': 180},
|
|
{'title': 'First Component', 'start_time': 180, 'end_time': 420},
|
|
],
|
|
'thumbnail': 'https://...', # Best thumbnail URL
|
|
'thumbnails': [...], # All thumbnail variants
|
|
'webpage_url': 'https://...', # Canonical URL
|
|
'formats': [...], # Available formats
|
|
'requested_formats': [...], # Selected format info
|
|
}
|
|
```
|
|
|
|
**Playlist extraction:**
|
|
|
|
```python
|
|
def extract_playlist(url: str) -> list[dict]:
|
|
"""Extract all videos from a playlist."""
|
|
opts = {
|
|
'quiet': True,
|
|
'extract_flat': True, # Don't extract each video yet
|
|
}
|
|
with YoutubeDL(opts) as ydl:
|
|
info = ydl.extract_info(url, download=False)
|
|
# info['entries'] contains all video entries
|
|
return info.get('entries', [])
|
|
```
|
|
|
|
**Audio-only download (for Whisper):**
|
|
|
|
```python
|
|
def download_audio(url: str, output_dir: str) -> str:
|
|
"""Download audio stream only (no video)."""
|
|
opts = {
|
|
'format': 'bestaudio/best',
|
|
'postprocessors': [{
|
|
'key': 'FFmpegExtractAudio',
|
|
'preferredcodec': 'wav',
|
|
'preferredquality': '16', # 16kHz (Whisper's native rate)
|
|
}],
|
|
'outtmpl': f'{output_dir}/%(id)s.%(ext)s',
|
|
'quiet': True,
|
|
}
|
|
with YoutubeDL(opts) as ydl:
|
|
info = ydl.extract_info(url, download=True)
|
|
return f"{output_dir}/{info['id']}.wav"
|
|
```
|
|
|
|
### 2. youtube-transcript-api (Caption Extraction)
|
|
|
|
**What it provides:**
|
|
- Direct access to YouTube captions without downloading
|
|
- Manual and auto-generated caption support
|
|
- Translation support (translate captions to any language)
|
|
- Structured output with timestamps
|
|
|
|
**Python API usage:**
|
|
|
|
```python
|
|
from youtube_transcript_api import YouTubeTranscriptApi
|
|
|
|
def get_youtube_transcript(video_id: str, languages: list[str] = None) -> list[dict]:
|
|
"""Get transcript with timestamps."""
|
|
languages = languages or ['en']
|
|
|
|
transcript_list = YouTubeTranscriptApi.list_transcripts(video_id)
|
|
|
|
# Prefer manual captions over auto-generated
|
|
try:
|
|
transcript = transcript_list.find_manually_created_transcript(languages)
|
|
except Exception:
|
|
transcript = transcript_list.find_generated_transcript(languages)
|
|
|
|
# Fetch the actual transcript data
|
|
data = transcript.fetch()
|
|
return data
|
|
# Returns: [{'text': 'Hello', 'start': 0.0, 'duration': 1.5}, ...]
|
|
```
|
|
|
|
**Output format:**
|
|
|
|
```python
|
|
[
|
|
{
|
|
'text': "Welcome to this React tutorial",
|
|
'start': 0.0, # Start time in seconds
|
|
'duration': 2.5 # Duration in seconds
|
|
},
|
|
{
|
|
'text': "Today we'll learn about hooks",
|
|
'start': 2.5,
|
|
'duration': 3.0
|
|
},
|
|
# ... continues for entire video
|
|
]
|
|
```
|
|
|
|
**Key features:**
|
|
- Segments are typically 2-5 seconds each
|
|
- Manual captions have punctuation and proper casing
|
|
- Auto-generated captions may lack punctuation and have lower accuracy
|
|
- Can detect available languages and caption types
|
|
|
|
### 3. faster-whisper (Speech-to-Text)
|
|
|
|
**What it provides:**
|
|
- OpenAI Whisper models with 4x speedup via CTranslate2
|
|
- Word-level timestamps with confidence scores
|
|
- Language detection
|
|
- VAD (Voice Activity Detection) filtering
|
|
- Multiple model sizes from tiny (39M) to large-v3 (1.5B)
|
|
|
|
**Python API usage:**
|
|
|
|
```python
|
|
from faster_whisper import WhisperModel
|
|
|
|
def transcribe_with_whisper(audio_path: str, model_size: str = "base") -> dict:
|
|
"""Transcribe audio file with word-level timestamps."""
|
|
model = WhisperModel(
|
|
model_size,
|
|
device="auto", # auto-detect GPU/CPU
|
|
compute_type="auto", # auto-select precision
|
|
)
|
|
|
|
segments, info = model.transcribe(
|
|
audio_path,
|
|
word_timestamps=True,
|
|
vad_filter=True, # Filter silence
|
|
vad_parameters={
|
|
"min_silence_duration_ms": 500,
|
|
},
|
|
)
|
|
|
|
result = {
|
|
'language': info.language,
|
|
'language_probability': info.language_probability,
|
|
'duration': info.duration,
|
|
'segments': [],
|
|
}
|
|
|
|
for segment in segments:
|
|
seg_data = {
|
|
'start': segment.start,
|
|
'end': segment.end,
|
|
'text': segment.text.strip(),
|
|
'avg_logprob': segment.avg_logprob,
|
|
'no_speech_prob': segment.no_speech_prob,
|
|
'words': [],
|
|
}
|
|
if segment.words:
|
|
for word in segment.words:
|
|
seg_data['words'].append({
|
|
'word': word.word,
|
|
'start': word.start,
|
|
'end': word.end,
|
|
'probability': word.probability,
|
|
})
|
|
result['segments'].append(seg_data)
|
|
|
|
return result
|
|
```
|
|
|
|
**Model size guide:**
|
|
|
|
| Model | Parameters | English WER | Multilingual WER | VRAM (FP16) | Speed (30 min, GPU) |
|
|
|-------|-----------|-------------|------------------|-------------|---------------------|
|
|
| tiny | 39M | 14.8% | 23.2% | ~1GB | ~30s |
|
|
| base | 74M | 11.5% | 18.7% | ~1GB | ~45s |
|
|
| small | 244M | 9.5% | 14.6% | ~2GB | ~90s |
|
|
| medium | 769M | 8.0% | 12.4% | ~5GB | ~180s |
|
|
| large-v3 | 1.5B | 5.7% | 10.1% | ~10GB | ~240s |
|
|
| large-v3-turbo | 809M | 6.2% | 10.8% | ~6GB | ~120s |
|
|
|
|
**Recommendation:** Default to `base` (good balance), offer `large-v3-turbo` for best accuracy, `tiny` for speed.
|
|
|
|
### 4. PySceneDetect (Scene Boundary Detection)
|
|
|
|
**What it provides:**
|
|
- Automatic scene/cut detection in video files
|
|
- Multiple detection algorithms (content-based, threshold, adaptive)
|
|
- Frame-accurate boundaries
|
|
- Integration with OpenCV
|
|
|
|
**Python API usage:**
|
|
|
|
```python
|
|
from scenedetect import detect, ContentDetector, AdaptiveDetector
|
|
|
|
def detect_scene_changes(video_path: str) -> list[tuple[float, float]]:
|
|
"""Detect scene boundaries in video.
|
|
|
|
Returns list of (start_time, end_time) tuples.
|
|
"""
|
|
scene_list = detect(
|
|
video_path,
|
|
ContentDetector(
|
|
threshold=27.0, # Sensitivity (lower = more scenes)
|
|
min_scene_len=15, # Minimum 15 frames per scene
|
|
),
|
|
)
|
|
|
|
boundaries = []
|
|
for scene in scene_list:
|
|
start = scene[0].get_seconds()
|
|
end = scene[1].get_seconds()
|
|
boundaries.append((start, end))
|
|
|
|
return boundaries
|
|
```
|
|
|
|
**Detection algorithms:**
|
|
|
|
| Algorithm | Best For | Speed | Sensitivity |
|
|
|-----------|----------|-------|-------------|
|
|
| ContentDetector | General content changes | Fast | Medium |
|
|
| AdaptiveDetector | Gradual transitions | Medium | High |
|
|
| ThresholdDetector | Hard cuts (black frames) | Very fast | Low |
|
|
|
|
### 5. easyocr (Text Recognition)
|
|
|
|
**What it provides:**
|
|
- Text detection and recognition from images
|
|
- 80+ language support
|
|
- GPU acceleration
|
|
- Bounding box coordinates for each text region
|
|
- Confidence scores
|
|
|
|
**Python API usage:**
|
|
|
|
```python
|
|
import easyocr
|
|
|
|
def extract_text_from_frame(image_path: str, languages: list[str] = None) -> list[dict]:
|
|
"""Extract text from a video frame image."""
|
|
languages = languages or ['en']
|
|
reader = easyocr.Reader(languages, gpu=True)
|
|
|
|
results = reader.readtext(image_path)
|
|
# results: [([x1,y1],[x2,y2],[x3,y3],[x4,y4]), text, confidence]
|
|
|
|
extracted = []
|
|
for bbox, text, confidence in results:
|
|
extracted.append({
|
|
'text': text,
|
|
'confidence': confidence,
|
|
'bbox': bbox, # Corner coordinates
|
|
})
|
|
|
|
return extracted
|
|
```
|
|
|
|
**Tips for code/terminal OCR:**
|
|
- Pre-process images: increase contrast, convert to grayscale
|
|
- Use higher DPI/resolution frames
|
|
- Filter by confidence threshold (>0.5 for code)
|
|
- Detect monospace regions first, then OCR only those regions
|
|
|
|
### 6. OpenCV (Frame Extraction)
|
|
|
|
**What it provides:**
|
|
- Video file reading and frame extraction
|
|
- Image processing (resize, crop, color conversion)
|
|
- Template matching (detect code editors, terminals)
|
|
- Histogram analysis (detect slide vs code vs webcam)
|
|
|
|
**Python API usage:**
|
|
|
|
```python
|
|
import cv2
|
|
import numpy as np
|
|
|
|
def extract_frames_at_timestamps(
|
|
video_path: str,
|
|
timestamps: list[float],
|
|
output_dir: str
|
|
) -> list[str]:
|
|
"""Extract frames at specific timestamps."""
|
|
cap = cv2.VideoCapture(video_path)
|
|
fps = cap.get(cv2.CAP_PROP_FPS)
|
|
frame_paths = []
|
|
|
|
for ts in timestamps:
|
|
frame_number = int(ts * fps)
|
|
cap.set(cv2.CAP_PROP_POS_FRAMES, frame_number)
|
|
ret, frame = cap.read()
|
|
if ret:
|
|
path = f"{output_dir}/frame_{ts:.2f}.png"
|
|
cv2.imwrite(path, frame)
|
|
frame_paths.append(path)
|
|
|
|
cap.release()
|
|
return frame_paths
|
|
|
|
|
|
def classify_frame(image_path: str) -> str:
|
|
"""Classify frame as code/slide/terminal/webcam/other.
|
|
|
|
Uses heuristics:
|
|
- Dark background + monospace text regions = code/terminal
|
|
- Light background + large text blocks = slide
|
|
- Face detection = webcam
|
|
- High color variance = diagram
|
|
"""
|
|
img = cv2.imread(image_path)
|
|
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
|
|
h, w = gray.shape
|
|
|
|
# Check brightness distribution
|
|
mean_brightness = np.mean(gray)
|
|
brightness_std = np.std(gray)
|
|
|
|
# Dark background with structured content = code/terminal
|
|
if mean_brightness < 80 and brightness_std > 40:
|
|
return 'code' # or 'terminal'
|
|
|
|
# Light background with text blocks = slide
|
|
if mean_brightness > 180 and brightness_std < 60:
|
|
return 'slide'
|
|
|
|
# High edge density = diagram
|
|
edges = cv2.Canny(gray, 50, 150)
|
|
edge_density = np.count_nonzero(edges) / (h * w)
|
|
if edge_density > 0.15:
|
|
return 'diagram'
|
|
|
|
return 'other'
|
|
```
|
|
|
|
---
|
|
|
|
## Benchmarks & Performance Data
|
|
|
|
### Transcript Extraction Speed
|
|
|
|
| Method | 10 min video | 30 min video | 60 min video | Requires Download |
|
|
|--------|-------------|-------------|-------------|-------------------|
|
|
| youtube-transcript-api | ~0.5s | ~0.5s | ~0.5s | No |
|
|
| yt-dlp subtitles | ~2s | ~2s | ~2s | Subtitle file only |
|
|
| faster-whisper (tiny, GPU) | ~10s | ~30s | ~60s | Audio only |
|
|
| faster-whisper (base, GPU) | ~15s | ~45s | ~90s | Audio only |
|
|
| faster-whisper (large-v3, GPU) | ~80s | ~240s | ~480s | Audio only |
|
|
| faster-whisper (base, CPU) | ~60s | ~180s | ~360s | Audio only |
|
|
|
|
### Visual Extraction Speed
|
|
|
|
| Operation | Per Frame | Per 10 min video (50 keyframes) |
|
|
|-----------|----------|-------------------------------|
|
|
| Frame extraction (OpenCV) | ~5ms | ~0.25s |
|
|
| Scene detection (PySceneDetect) | N/A | ~15s for full video |
|
|
| Frame classification (heuristic) | ~10ms | ~0.5s |
|
|
| OCR per frame (easyocr, GPU) | ~200ms | ~10s |
|
|
| OCR per frame (easyocr, CPU) | ~1-2s | ~50-100s |
|
|
|
|
### Total Pipeline Time (estimated)
|
|
|
|
| Mode | 10 min video | 30 min video | 1 hour video |
|
|
|------|-------------|-------------|-------------|
|
|
| Transcript only (YouTube captions) | ~2s | ~2s | ~2s |
|
|
| Transcript only (Whisper base, GPU) | ~20s | ~50s | ~100s |
|
|
| Full (transcript + visual, GPU) | ~35s | ~80s | ~170s |
|
|
| Full (transcript + visual, CPU) | ~120s | ~350s | ~700s |
|
|
|
|
---
|
|
|
|
## Recommendations
|
|
|
|
### Primary Stack (Chosen)
|
|
|
|
| Component | Library | Why |
|
|
|-----------|---------|-----|
|
|
| Metadata + download | **yt-dlp** | De-facto standard, 1000+ sites, comprehensive Python API |
|
|
| YouTube transcripts | **youtube-transcript-api** | Fastest, no download, structured output |
|
|
| Speech-to-text | **faster-whisper** | 4x faster than Whisper, MIT, word timestamps |
|
|
| Scene detection | **PySceneDetect** | Best algorithm options, OpenCV-based |
|
|
| Frame extraction | **opencv-python-headless** | Standard, headless (no GUI deps) |
|
|
| OCR | **easyocr** | Best code/terminal accuracy, 80+ languages, GPU support |
|
|
|
|
### Future Considerations
|
|
|
|
| Component | Library | When to Add |
|
|
|-----------|---------|-------------|
|
|
| Speaker diarization | **whisperx** or **pyannote** | V2.0 — identify who said what |
|
|
| Object detection | **YOLO** | V2.0 — detect UI elements, diagrams |
|
|
| Multimodal embeddings | **CLIP** | V2.0 — embed frames for visual search |
|
|
| Slide detection | **python-pptx** + heuristics | V1.5 — detect and extract slide content |
|
|
|
|
### Sources
|
|
|
|
- [youtube-transcript-api (PyPI)](https://pypi.org/project/youtube-transcript-api/)
|
|
- [yt-dlp GitHub](https://github.com/yt-dlp/yt-dlp)
|
|
- [yt-dlp Information Extraction Pipeline (DeepWiki)](https://deepwiki.com/yt-dlp/yt-dlp/2.2-information-extraction-pipeline)
|
|
- [faster-whisper GitHub](https://github.com/SYSTRAN/faster-whisper)
|
|
- [faster-whisper (PyPI)](https://pypi.org/project/faster-whisper/)
|
|
- [whisper-timestamped GitHub](https://github.com/linto-ai/whisper-timestamped)
|
|
- [stable-ts (PyPI)](https://pypi.org/project/stable-ts/)
|
|
- [PySceneDetect GitHub](https://github.com/Breakthrough/PySceneDetect)
|
|
- [easyocr GitHub (implied from PyPI)](https://pypi.org/project/easyocr/)
|
|
- [NVIDIA Multimodal RAG for Video and Audio](https://developer.nvidia.com/blog/an-easy-introduction-to-multimodal-retrieval-augmented-generation-for-video-and-audio/)
|
|
- [LlamaIndex MultiModal RAG for Video](https://www.llamaindex.ai/blog/multimodal-rag-for-advanced-video-processing-with-llamaindex-lancedb-33be4804822e)
|
|
- [Ragie: How We Built Multimodal RAG](https://www.ragie.ai/blog/how-we-built-multimodal-rag-for-audio-and-video)
|
|
- [video-analyzer GitHub](https://github.com/byjlw/video-analyzer)
|
|
- [VideoRAG Project](https://video-rag.github.io/)
|
|
- [video-keyframe-detector GitHub](https://github.com/joelibaceta/video-keyframe-detector)
|
|
- [Filmstrip GitHub](https://github.com/tafsiri/filmstrip)
|