# Video Source — Library Research & Industry Standards **Date:** February 27, 2026 **Document:** 01 of 07 **Status:** Complete --- ## Table of Contents 1. [Industry Standards & Approaches](#industry-standards--approaches) 2. [Library Comparison Matrix](#library-comparison-matrix) 3. [Detailed Library Analysis](#detailed-library-analysis) 4. [Architecture Patterns from Industry](#architecture-patterns-from-industry) 5. [Benchmarks & Performance Data](#benchmarks--performance-data) 6. [Recommendations](#recommendations) --- ## Industry Standards & Approaches ### How the Industry Processes Video for AI/RAG Based on research from NVIDIA, LlamaIndex, Ragie, and open-source projects, the industry has converged on a **3-stream parallel extraction** model: #### The 3-Stream Model ``` Video Input │ ├──→ Stream 1: ASR (Audio Speech Recognition) │ Extract spoken words with timestamps │ Tools: Whisper, YouTube captions API │ Output: [{text, start, end, confidence}, ...] │ ├──→ Stream 2: OCR (Optical Character Recognition) │ Extract visual text (code, slides, diagrams) │ Tools: OpenCV + scene detection + OCR engine │ Output: [{text, timestamp, frame_type, bbox}, ...] │ └──→ Stream 3: Metadata Extract structural info (chapters, tags, description) Tools: yt-dlp, platform APIs Output: {title, chapters, tags, description, ...} ``` **Key insight (from NVIDIA's multimodal RAG blog):** Ground everything to text first. Align all streams on a shared timeline, then merge into unified text segments. This makes the output compatible with any text-based RAG pipeline without requiring multimodal embeddings. #### Reference Implementations | Project | Approach | Strengths | Weaknesses | |---------|----------|-----------|------------| | [video-analyzer](https://github.com/byjlw/video-analyzer) | Whisper + OpenCV + LLM analysis | Full pipeline, LLM summaries | No chapter support, no YouTube integration | | [LlamaIndex MultiModal RAG](https://www.llamaindex.ai/blog/multimodal-rag-for-advanced-video-processing-with-llamaindex-lancedb-33be4804822e) | Frame extraction + CLIP + LanceDB | Vector search over frames | Heavy (requires GPU), no ASR | | [VideoRAG](https://video-rag.github.io/) | Graph-based reasoning + multimodal retrieval | Multi-hour video support | Research project, not production-ready | | [Ragie Multimodal RAG](https://www.ragie.ai/blog/how-we-built-multimodal-rag-for-audio-and-video) | faster-whisper large-v3-turbo + OCR + object detection | Production-grade, 3-stream | Proprietary, not open-source | #### Industry Best Practices 1. **Audio-only download** — Never download full video when you only need audio. Extract audio stream with FFmpeg (`-vn` flag). This is 10-50x smaller. 2. **Prefer existing captions** — YouTube manual captions are higher quality than any ASR model. Only fall back to Whisper when captions unavailable. 3. **Chapter-based segmentation** — YouTube chapters provide natural content boundaries. Use them as primary segmentation, fall back to time-window or semantic splitting. 4. **Confidence filtering** — Auto-generated captions and OCR output include confidence scores. Filter low-confidence content rather than including everything. 5. **Parallel extraction** — Run ASR and OCR in parallel (they're independent). Merge after both complete. 6. **Tiered processing** — Offer fast/light mode (transcript only) and deep mode (+ visual). Let users choose based on their compute budget. --- ## Library Comparison Matrix ### Metadata & Download | Library | Purpose | Install Size | Actively Maintained | Python API | License | |---------|---------|-------------|-------------------|------------|---------| | **yt-dlp** | Metadata + subtitles + download | ~15MB | Yes (weekly releases) | Yes (`YoutubeDL` class) | Unlicense | | pytube | YouTube download | ~1MB | Inconsistent | Yes | MIT | | youtube-dl | Download (original) | ~10MB | Stale | Yes | Unlicense | | pafy | YouTube metadata | ~50KB | Dead (2021) | Yes | LGPL | **Winner: yt-dlp** — De-facto standard, actively maintained, comprehensive Python API, supports 1000+ sites (not just YouTube). ### Transcript Extraction (YouTube) | Library | Purpose | Requires Download | Speed | Accuracy | License | |---------|---------|-------------------|-------|----------|---------| | **youtube-transcript-api** | YouTube captions | No | Very fast (<1s) | Depends on caption source | MIT | | yt-dlp subtitles | Download subtitle files | Yes (subtitle only) | Fast (~2s) | Same as above | Unlicense | **Winner: youtube-transcript-api** — Fastest, no download needed, returns structured JSON with timestamps directly. Falls back to yt-dlp for non-YouTube platforms. ### Speech-to-Text (ASR) | Library | Speed (30 min audio) | Word Timestamps | Model Sizes | GPU Required | Language Support | License | |---------|---------------------|----------------|-------------|-------------|-----------------|---------| | **faster-whisper** | ~2-4 min (GPU), ~8-15 min (CPU) | Yes (`word_timestamps=True`) | tiny (39M) → large-v3 (1.5B) | No (but recommended) | 99 languages | MIT | | openai-whisper | ~5-10 min (GPU), ~20-40 min (CPU) | Yes | Same models | Recommended | 99 languages | MIT | | whisper-timestamped | Same as openai-whisper | Yes (more accurate) | Same models | Recommended | 99 languages | MIT | | whisperx | ~2-3 min (GPU) | Yes (best accuracy via wav2vec2) | Same + wav2vec2 | Yes (required) | 99 languages | BSD | | stable-ts | Same as openai-whisper | Yes (stabilized) | Same models | Recommended | 99 languages | MIT | | Google Speech-to-Text | Real-time | Yes | Cloud | No | 125+ languages | Proprietary | | AssemblyAI | Real-time | Yes | Cloud | No | 100+ languages | Proprietary | **Winner: faster-whisper** — 4x faster than OpenAI Whisper via CTranslate2 optimization, MIT license, word-level timestamps, works without GPU (just slower), actively maintained. We may consider whisperx as a future upgrade for speaker diarization. ### Scene Detection & Frame Extraction | Library | Purpose | Algorithm | Speed | License | |---------|---------|-----------|-------|---------| | **PySceneDetect** | Scene boundary detection | ContentDetector, ThresholdDetector, AdaptiveDetector | Fast | BSD | | opencv-python-headless | Frame extraction, image processing | Manual (absdiff, histogram) | Fast | Apache 2.0 | | Filmstrip | Keyframe extraction | Scene detection + selection | Medium | MIT | | video-keyframe-detector | Keyframe extraction | Peak estimation from frame diff | Fast | MIT | | decord | GPU-accelerated frame extraction | Direct frame access | Very fast | Apache 2.0 | **Winner: PySceneDetect + opencv-python-headless** — PySceneDetect handles intelligent boundary detection, OpenCV handles frame extraction and image processing. Both are well-maintained and BSD/Apache licensed. ### OCR (Optical Character Recognition) | Library | Languages | GPU Support | Accuracy on Code | Speed | Install Size | License | |---------|-----------|------------|-------------------|-------|-------------|---------| | **easyocr** | 80+ | Yes (PyTorch) | Good | Medium | ~150MB + models | Apache 2.0 | | pytesseract | 100+ | No | Medium | Fast | ~30MB + Tesseract | Apache 2.0 | | PaddleOCR | 80+ | Yes (PaddlePaddle) | Very Good | Fast | ~200MB + models | Apache 2.0 | | TrOCR (HuggingFace) | Multilingual | Yes | Good | Slow | ~500MB | MIT | | docTR | 10+ | Yes (TF/PyTorch) | Good | Medium | ~100MB | Apache 2.0 | **Winner: easyocr** — Best balance of accuracy (especially on code/terminal text), GPU support, language coverage, and ease of use. PaddleOCR is a close second but has heavier dependencies (PaddlePaddle framework). --- ## Detailed Library Analysis ### 1. yt-dlp (Metadata & Download Engine) **What it provides:** - Video metadata (title, description, duration, upload date, channel, tags, categories) - Chapter information (title, start_time, end_time for each chapter) - Subtitle/caption download (all available languages, all formats) - Thumbnail URLs - View/like counts - Playlist information (title, entries, ordering) - Audio-only extraction (no full video download needed) - Supports 1000+ video sites (YouTube, Vimeo, Dailymotion, etc.) **Python API usage:** ```python from yt_dlp import YoutubeDL def extract_video_metadata(url: str) -> dict: """Extract metadata without downloading.""" opts = { 'quiet': True, 'no_warnings': True, 'extract_flat': False, # Full extraction } with YoutubeDL(opts) as ydl: info = ydl.extract_info(url, download=False) return info ``` **Key fields in `info_dict`:** ```python { 'id': 'dQw4w9WgXcQ', # Video ID 'title': 'Video Title', # Full title 'description': '...', # Full description text 'duration': 1832, # Duration in seconds 'upload_date': '20260115', # YYYYMMDD format 'uploader': 'Channel Name', # Channel/uploader name 'uploader_id': '@channelname', # Channel ID 'uploader_url': 'https://...', # Channel URL 'channel_follower_count': 150000, # Subscriber count 'view_count': 5000000, # View count 'like_count': 120000, # Like count 'comment_count': 8500, # Comment count 'tags': ['react', 'hooks', ...], # Video tags 'categories': ['Education'], # YouTube categories 'language': 'en', # Primary language 'subtitles': { # Manual captions 'en': [{'ext': 'vtt', 'url': '...'}], }, 'automatic_captions': { # Auto-generated captions 'en': [{'ext': 'vtt', 'url': '...'}], }, 'chapters': [ # Chapter markers {'title': 'Intro', 'start_time': 0, 'end_time': 45}, {'title': 'Setup', 'start_time': 45, 'end_time': 180}, {'title': 'First Component', 'start_time': 180, 'end_time': 420}, ], 'thumbnail': 'https://...', # Best thumbnail URL 'thumbnails': [...], # All thumbnail variants 'webpage_url': 'https://...', # Canonical URL 'formats': [...], # Available formats 'requested_formats': [...], # Selected format info } ``` **Playlist extraction:** ```python def extract_playlist(url: str) -> list[dict]: """Extract all videos from a playlist.""" opts = { 'quiet': True, 'extract_flat': True, # Don't extract each video yet } with YoutubeDL(opts) as ydl: info = ydl.extract_info(url, download=False) # info['entries'] contains all video entries return info.get('entries', []) ``` **Audio-only download (for Whisper):** ```python def download_audio(url: str, output_dir: str) -> str: """Download audio stream only (no video).""" opts = { 'format': 'bestaudio/best', 'postprocessors': [{ 'key': 'FFmpegExtractAudio', 'preferredcodec': 'wav', 'preferredquality': '16', # 16kHz (Whisper's native rate) }], 'outtmpl': f'{output_dir}/%(id)s.%(ext)s', 'quiet': True, } with YoutubeDL(opts) as ydl: info = ydl.extract_info(url, download=True) return f"{output_dir}/{info['id']}.wav" ``` ### 2. youtube-transcript-api (Caption Extraction) **What it provides:** - Direct access to YouTube captions without downloading - Manual and auto-generated caption support - Translation support (translate captions to any language) - Structured output with timestamps **Python API usage:** ```python from youtube_transcript_api import YouTubeTranscriptApi def get_youtube_transcript(video_id: str, languages: list[str] = None) -> list[dict]: """Get transcript with timestamps.""" languages = languages or ['en'] transcript_list = YouTubeTranscriptApi.list_transcripts(video_id) # Prefer manual captions over auto-generated try: transcript = transcript_list.find_manually_created_transcript(languages) except Exception: transcript = transcript_list.find_generated_transcript(languages) # Fetch the actual transcript data data = transcript.fetch() return data # Returns: [{'text': 'Hello', 'start': 0.0, 'duration': 1.5}, ...] ``` **Output format:** ```python [ { 'text': "Welcome to this React tutorial", 'start': 0.0, # Start time in seconds 'duration': 2.5 # Duration in seconds }, { 'text': "Today we'll learn about hooks", 'start': 2.5, 'duration': 3.0 }, # ... continues for entire video ] ``` **Key features:** - Segments are typically 2-5 seconds each - Manual captions have punctuation and proper casing - Auto-generated captions may lack punctuation and have lower accuracy - Can detect available languages and caption types ### 3. faster-whisper (Speech-to-Text) **What it provides:** - OpenAI Whisper models with 4x speedup via CTranslate2 - Word-level timestamps with confidence scores - Language detection - VAD (Voice Activity Detection) filtering - Multiple model sizes from tiny (39M) to large-v3 (1.5B) **Python API usage:** ```python from faster_whisper import WhisperModel def transcribe_with_whisper(audio_path: str, model_size: str = "base") -> dict: """Transcribe audio file with word-level timestamps.""" model = WhisperModel( model_size, device="auto", # auto-detect GPU/CPU compute_type="auto", # auto-select precision ) segments, info = model.transcribe( audio_path, word_timestamps=True, vad_filter=True, # Filter silence vad_parameters={ "min_silence_duration_ms": 500, }, ) result = { 'language': info.language, 'language_probability': info.language_probability, 'duration': info.duration, 'segments': [], } for segment in segments: seg_data = { 'start': segment.start, 'end': segment.end, 'text': segment.text.strip(), 'avg_logprob': segment.avg_logprob, 'no_speech_prob': segment.no_speech_prob, 'words': [], } if segment.words: for word in segment.words: seg_data['words'].append({ 'word': word.word, 'start': word.start, 'end': word.end, 'probability': word.probability, }) result['segments'].append(seg_data) return result ``` **Model size guide:** | Model | Parameters | English WER | Multilingual WER | VRAM (FP16) | Speed (30 min, GPU) | |-------|-----------|-------------|------------------|-------------|---------------------| | tiny | 39M | 14.8% | 23.2% | ~1GB | ~30s | | base | 74M | 11.5% | 18.7% | ~1GB | ~45s | | small | 244M | 9.5% | 14.6% | ~2GB | ~90s | | medium | 769M | 8.0% | 12.4% | ~5GB | ~180s | | large-v3 | 1.5B | 5.7% | 10.1% | ~10GB | ~240s | | large-v3-turbo | 809M | 6.2% | 10.8% | ~6GB | ~120s | **Recommendation:** Default to `base` (good balance), offer `large-v3-turbo` for best accuracy, `tiny` for speed. ### 4. PySceneDetect (Scene Boundary Detection) **What it provides:** - Automatic scene/cut detection in video files - Multiple detection algorithms (content-based, threshold, adaptive) - Frame-accurate boundaries - Integration with OpenCV **Python API usage:** ```python from scenedetect import detect, ContentDetector, AdaptiveDetector def detect_scene_changes(video_path: str) -> list[tuple[float, float]]: """Detect scene boundaries in video. Returns list of (start_time, end_time) tuples. """ scene_list = detect( video_path, ContentDetector( threshold=27.0, # Sensitivity (lower = more scenes) min_scene_len=15, # Minimum 15 frames per scene ), ) boundaries = [] for scene in scene_list: start = scene[0].get_seconds() end = scene[1].get_seconds() boundaries.append((start, end)) return boundaries ``` **Detection algorithms:** | Algorithm | Best For | Speed | Sensitivity | |-----------|----------|-------|-------------| | ContentDetector | General content changes | Fast | Medium | | AdaptiveDetector | Gradual transitions | Medium | High | | ThresholdDetector | Hard cuts (black frames) | Very fast | Low | ### 5. easyocr (Text Recognition) **What it provides:** - Text detection and recognition from images - 80+ language support - GPU acceleration - Bounding box coordinates for each text region - Confidence scores **Python API usage:** ```python import easyocr def extract_text_from_frame(image_path: str, languages: list[str] = None) -> list[dict]: """Extract text from a video frame image.""" languages = languages or ['en'] reader = easyocr.Reader(languages, gpu=True) results = reader.readtext(image_path) # results: [([x1,y1],[x2,y2],[x3,y3],[x4,y4]), text, confidence] extracted = [] for bbox, text, confidence in results: extracted.append({ 'text': text, 'confidence': confidence, 'bbox': bbox, # Corner coordinates }) return extracted ``` **Tips for code/terminal OCR:** - Pre-process images: increase contrast, convert to grayscale - Use higher DPI/resolution frames - Filter by confidence threshold (>0.5 for code) - Detect monospace regions first, then OCR only those regions ### 6. OpenCV (Frame Extraction) **What it provides:** - Video file reading and frame extraction - Image processing (resize, crop, color conversion) - Template matching (detect code editors, terminals) - Histogram analysis (detect slide vs code vs webcam) **Python API usage:** ```python import cv2 import numpy as np def extract_frames_at_timestamps( video_path: str, timestamps: list[float], output_dir: str ) -> list[str]: """Extract frames at specific timestamps.""" cap = cv2.VideoCapture(video_path) fps = cap.get(cv2.CAP_PROP_FPS) frame_paths = [] for ts in timestamps: frame_number = int(ts * fps) cap.set(cv2.CAP_PROP_POS_FRAMES, frame_number) ret, frame = cap.read() if ret: path = f"{output_dir}/frame_{ts:.2f}.png" cv2.imwrite(path, frame) frame_paths.append(path) cap.release() return frame_paths def classify_frame(image_path: str) -> str: """Classify frame as code/slide/terminal/webcam/other. Uses heuristics: - Dark background + monospace text regions = code/terminal - Light background + large text blocks = slide - Face detection = webcam - High color variance = diagram """ img = cv2.imread(image_path) gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY) h, w = gray.shape # Check brightness distribution mean_brightness = np.mean(gray) brightness_std = np.std(gray) # Dark background with structured content = code/terminal if mean_brightness < 80 and brightness_std > 40: return 'code' # or 'terminal' # Light background with text blocks = slide if mean_brightness > 180 and brightness_std < 60: return 'slide' # High edge density = diagram edges = cv2.Canny(gray, 50, 150) edge_density = np.count_nonzero(edges) / (h * w) if edge_density > 0.15: return 'diagram' return 'other' ``` --- ## Benchmarks & Performance Data ### Transcript Extraction Speed | Method | 10 min video | 30 min video | 60 min video | Requires Download | |--------|-------------|-------------|-------------|-------------------| | youtube-transcript-api | ~0.5s | ~0.5s | ~0.5s | No | | yt-dlp subtitles | ~2s | ~2s | ~2s | Subtitle file only | | faster-whisper (tiny, GPU) | ~10s | ~30s | ~60s | Audio only | | faster-whisper (base, GPU) | ~15s | ~45s | ~90s | Audio only | | faster-whisper (large-v3, GPU) | ~80s | ~240s | ~480s | Audio only | | faster-whisper (base, CPU) | ~60s | ~180s | ~360s | Audio only | ### Visual Extraction Speed | Operation | Per Frame | Per 10 min video (50 keyframes) | |-----------|----------|-------------------------------| | Frame extraction (OpenCV) | ~5ms | ~0.25s | | Scene detection (PySceneDetect) | N/A | ~15s for full video | | Frame classification (heuristic) | ~10ms | ~0.5s | | OCR per frame (easyocr, GPU) | ~200ms | ~10s | | OCR per frame (easyocr, CPU) | ~1-2s | ~50-100s | ### Total Pipeline Time (estimated) | Mode | 10 min video | 30 min video | 1 hour video | |------|-------------|-------------|-------------| | Transcript only (YouTube captions) | ~2s | ~2s | ~2s | | Transcript only (Whisper base, GPU) | ~20s | ~50s | ~100s | | Full (transcript + visual, GPU) | ~35s | ~80s | ~170s | | Full (transcript + visual, CPU) | ~120s | ~350s | ~700s | --- ## Recommendations ### Primary Stack (Chosen) | Component | Library | Why | |-----------|---------|-----| | Metadata + download | **yt-dlp** | De-facto standard, 1000+ sites, comprehensive Python API | | YouTube transcripts | **youtube-transcript-api** | Fastest, no download, structured output | | Speech-to-text | **faster-whisper** | 4x faster than Whisper, MIT, word timestamps | | Scene detection | **PySceneDetect** | Best algorithm options, OpenCV-based | | Frame extraction | **opencv-python-headless** | Standard, headless (no GUI deps) | | OCR | **easyocr** | Best code/terminal accuracy, 80+ languages, GPU support | ### Future Considerations | Component | Library | When to Add | |-----------|---------|-------------| | Speaker diarization | **whisperx** or **pyannote** | V2.0 — identify who said what | | Object detection | **YOLO** | V2.0 — detect UI elements, diagrams | | Multimodal embeddings | **CLIP** | V2.0 — embed frames for visual search | | Slide detection | **python-pptx** + heuristics | V1.5 — detect and extract slide content | ### Sources - [youtube-transcript-api (PyPI)](https://pypi.org/project/youtube-transcript-api/) - [yt-dlp GitHub](https://github.com/yt-dlp/yt-dlp) - [yt-dlp Information Extraction Pipeline (DeepWiki)](https://deepwiki.com/yt-dlp/yt-dlp/2.2-information-extraction-pipeline) - [faster-whisper GitHub](https://github.com/SYSTRAN/faster-whisper) - [faster-whisper (PyPI)](https://pypi.org/project/faster-whisper/) - [whisper-timestamped GitHub](https://github.com/linto-ai/whisper-timestamped) - [stable-ts (PyPI)](https://pypi.org/project/stable-ts/) - [PySceneDetect GitHub](https://github.com/Breakthrough/PySceneDetect) - [easyocr GitHub (implied from PyPI)](https://pypi.org/project/easyocr/) - [NVIDIA Multimodal RAG for Video and Audio](https://developer.nvidia.com/blog/an-easy-introduction-to-multimodal-retrieval-augmented-generation-for-video-and-audio/) - [LlamaIndex MultiModal RAG for Video](https://www.llamaindex.ai/blog/multimodal-rag-for-advanced-video-processing-with-llamaindex-lancedb-33be4804822e) - [Ragie: How We Built Multimodal RAG](https://www.ragie.ai/blog/how-we-built-multimodal-rag-for-audio-and-video) - [video-analyzer GitHub](https://github.com/byjlw/video-analyzer) - [VideoRAG Project](https://video-rag.github.io/) - [video-keyframe-detector GitHub](https://github.com/joelibaceta/video-keyframe-detector) - [Filmstrip GitHub](https://github.com/tafsiri/filmstrip)