Auto-detects NVIDIA (CUDA), AMD (ROCm), or CPU-only GPU and installs the correct PyTorch variant + easyocr + all visual extraction dependencies. Removes easyocr from video-full pip extras to avoid pulling ~2GB of wrong CUDA packages on non-NVIDIA systems. New files: - video_setup.py (835 lines): GPU detection, PyTorch install, ROCm config, venv checks, system dep validation, module selection, verification - test_video_setup.py (60 tests): Full coverage of detection, install, verify Updated docs: CHANGELOG, AGENTS.md, CLAUDE.md, README.md, CLI_REFERENCE, FAQ, TROUBLESHOOTING, installation guide, video dependency plan All 2523 tests passing (15 skipped). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
13 KiB
Video Source — Dependencies & System Requirements
Date: February 27, 2026 Document: 07 of 07 Status: Planning
Status: IMPLEMENTED —
skill-seekers video --setup(seevideo_setup.py, 835 lines, 60 tests)
- GPU auto-detection: NVIDIA (nvidia-smi/CUDA), AMD (rocminfo/ROCm), CPU fallback
- Correct PyTorch index URL selection per GPU vendor
- EasyOCR removed from pip extras, installed at runtime via --setup
- ROCm configuration (MIOPEN_FIND_MODE, HSA_OVERRIDE_GFX_VERSION)
- Virtual environment detection with --force override
- System dependency checks (tesseract, ffmpeg)
- Non-interactive mode for MCP/CI usage
Table of Contents
- Dependency Tiers
- pyproject.toml Changes
- System Requirements
- Import Guards
- Dependency Check Command
- Model Management
- Docker Considerations
Dependency Tiers
Video processing has two tiers to keep the base install lightweight:
Tier 1: [video] — Lightweight (YouTube transcripts + metadata)
Use case: YouTube videos with existing captions. No download, no GPU needed.
| Package | Version | Size | Purpose |
|---|---|---|---|
yt-dlp |
>=2024.12.0 |
~15MB | Metadata extraction, audio download |
youtube-transcript-api |
>=1.2.0 |
~50KB | YouTube caption extraction |
Capabilities:
- YouTube metadata (title, chapters, tags, description, engagement)
- YouTube captions (manual and auto-generated)
- Vimeo metadata
- Playlist and channel resolution
- Subtitle file parsing (SRT, VTT)
- Segmentation and alignment
- Full output generation
NOT included:
- Speech-to-text (Whisper)
- Visual extraction (frame + OCR)
- Local video file transcription (without subtitles)
Tier 2: [video-full] — Full (adds Whisper + visual extraction)
Use case: Local videos without subtitles, or when you want code/slide extraction from screen.
| Package | Version | Size | Purpose |
|---|---|---|---|
yt-dlp |
>=2024.12.0 |
~15MB | Metadata + audio download |
youtube-transcript-api |
>=1.2.0 |
~50KB | YouTube captions |
faster-whisper |
>=1.0.0 |
~5MB (+ models: 75MB-3GB) | Speech-to-text |
scenedetect[opencv] |
>=0.6.4 |
~50MB (includes OpenCV) | Scene boundary detection |
easyocr |
>=1.7.0 |
~150MB (+ models: ~200MB) | Text recognition from frames |
opencv-python-headless |
>=4.9.0 |
~50MB | Frame extraction, image processing |
Additional capabilities over Tier 1:
- Whisper speech-to-text (99 languages, word-level timestamps)
- Scene detection (find visual transitions)
- Keyframe extraction (save important frames)
- Frame classification (code/slide/terminal/diagram)
- OCR on frames (extract code and text from screen)
- Code block detection from video
Total install size:
- Tier 1: ~15MB
- Tier 2: ~270MB + models (~300MB-3.2GB depending on Whisper model)
pyproject.toml Changes
[project.optional-dependencies]
# Existing dependencies...
gemini = ["google-generativeai>=0.8.0"]
openai = ["openai>=1.0.0"]
all-llms = ["google-generativeai>=0.8.0", "openai>=1.0.0"]
# NEW: Video processing
video = [
"yt-dlp>=2024.12.0",
"youtube-transcript-api>=1.2.0",
]
video-full = [
"yt-dlp>=2024.12.0",
"youtube-transcript-api>=1.2.0",
"faster-whisper>=1.0.0",
"scenedetect[opencv]>=0.6.4",
"easyocr>=1.7.0",
"opencv-python-headless>=4.9.0",
]
# Update 'all' to include video
all = [
# ... existing all dependencies ...
"yt-dlp>=2024.12.0",
"youtube-transcript-api>=1.2.0",
"faster-whisper>=1.0.0",
"scenedetect[opencv]>=0.6.4",
"easyocr>=1.7.0",
"opencv-python-headless>=4.9.0",
]
[project.scripts]
# ... existing entry points ...
skill-seekers-video = "skill_seekers.cli.video_scraper:main" # NEW
Installation Commands
# Lightweight video (YouTube transcripts + metadata)
pip install skill-seekers[video]
# Full video (+ Whisper + visual extraction)
pip install skill-seekers[video-full]
# Everything
pip install skill-seekers[all]
# Development (editable)
pip install -e ".[video]"
pip install -e ".[video-full]"
System Requirements
Tier 1 (Lightweight)
| Requirement | Needed For | How to Check |
|---|---|---|
| Python 3.10+ | All | python --version |
| Internet connection | YouTube API calls | N/A |
No additional system dependencies. Pure Python.
Tier 2 (Full)
| Requirement | Needed For | How to Check | Install |
|---|---|---|---|
| Python 3.10+ | All | python --version |
— |
| FFmpeg | Audio extraction, video processing | ffmpeg -version |
See below |
| GPU (optional) | Whisper + easyocr acceleration | nvidia-smi (NVIDIA) |
CUDA toolkit |
FFmpeg Installation
FFmpeg is required for:
- Extracting audio from video files (Whisper input)
- Downloading audio-only streams (yt-dlp post-processing)
- Converting between audio formats
# macOS
brew install ffmpeg
# Ubuntu/Debian
sudo apt install ffmpeg
# Windows (winget)
winget install ffmpeg
# Windows (choco)
choco install ffmpeg
# Verify
ffmpeg -version
GPU Support (Optional)
GPU accelerates Whisper (~4x) and easyocr (~5x) but is not required.
NVIDIA GPU (CUDA):
# Check CUDA availability
python -c "import torch; print(torch.cuda.is_available())"
# faster-whisper uses CTranslate2 which auto-detects CUDA
# easyocr uses PyTorch which auto-detects CUDA
# No additional setup needed if PyTorch CUDA is working
Apple Silicon (MPS):
# faster-whisper does not support MPS directly
# Falls back to CPU on Apple Silicon
# easyocr has partial MPS support
CPU-only (no GPU):
# Everything works on CPU, just slower
# Whisper base model: ~4x slower on CPU vs GPU
# easyocr: ~5x slower on CPU vs GPU
# For short videos (<10 min), CPU is fine
Import Guards
All video dependencies use try/except import guards to provide clear error messages:
video_scraper.py
"""Video scraper - main orchestrator."""
# Core dependencies (always available)
import json
import logging
import os
from pathlib import Path
# Tier 1: Video basics
try:
from yt_dlp import YoutubeDL
HAS_YTDLP = True
except ImportError:
HAS_YTDLP = False
try:
from youtube_transcript_api import YouTubeTranscriptApi
HAS_YT_TRANSCRIPT = True
except ImportError:
HAS_YT_TRANSCRIPT = False
# Feature availability check
def check_video_dependencies(require_full: bool = False) -> None:
"""Check that video dependencies are installed.
Args:
require_full: If True, check for full dependencies (Whisper, OCR)
Raises:
ImportError: With installation instructions
"""
missing = []
if not HAS_YTDLP:
missing.append("yt-dlp")
if not HAS_YT_TRANSCRIPT:
missing.append("youtube-transcript-api")
if missing:
raise ImportError(
f"Video processing requires: {', '.join(missing)}\n"
f"Install with: pip install skill-seekers[video]"
)
if require_full:
full_missing = []
try:
import faster_whisper
except ImportError:
full_missing.append("faster-whisper")
try:
import cv2
except ImportError:
full_missing.append("opencv-python-headless")
try:
import scenedetect
except ImportError:
full_missing.append("scenedetect[opencv]")
try:
import easyocr
except ImportError:
full_missing.append("easyocr")
if full_missing:
raise ImportError(
f"Visual extraction requires: {', '.join(full_missing)}\n"
f"Install with: pip install skill-seekers[video-full]"
)
video_transcript.py
"""Transcript extraction module."""
# YouTube transcript (Tier 1)
try:
from youtube_transcript_api import YouTubeTranscriptApi
HAS_YT_TRANSCRIPT = True
except ImportError:
HAS_YT_TRANSCRIPT = False
# Whisper (Tier 2)
try:
from faster_whisper import WhisperModel
HAS_WHISPER = True
except ImportError:
HAS_WHISPER = False
def get_transcript(video_info, config):
"""Get transcript using best available method."""
# Try YouTube captions first (Tier 1)
if HAS_YT_TRANSCRIPT and video_info.source_type == VideoSourceType.YOUTUBE:
try:
return extract_youtube_transcript(video_info.video_id, config.languages)
except TranscriptNotAvailable:
pass
# Try Whisper fallback (Tier 2)
if HAS_WHISPER:
return transcribe_with_whisper(video_info, config)
# No transcript possible
if not HAS_WHISPER:
logger.warning(
f"No transcript for {video_info.video_id}. "
"Install faster-whisper for speech-to-text: "
"pip install skill-seekers[video-full]"
)
return [], TranscriptSource.NONE
video_visual.py
"""Visual extraction module."""
try:
import cv2
HAS_OPENCV = True
except ImportError:
HAS_OPENCV = False
try:
from scenedetect import detect, ContentDetector
HAS_SCENEDETECT = True
except ImportError:
HAS_SCENEDETECT = False
try:
import easyocr
HAS_EASYOCR = True
except ImportError:
HAS_EASYOCR = False
def check_visual_dependencies() -> None:
"""Check visual extraction dependencies."""
missing = []
if not HAS_OPENCV:
missing.append("opencv-python-headless")
if not HAS_SCENEDETECT:
missing.append("scenedetect[opencv]")
if not HAS_EASYOCR:
missing.append("easyocr")
if missing:
raise ImportError(
f"Visual extraction requires: {', '.join(missing)}\n"
f"Install with: pip install skill-seekers[video-full]"
)
def check_ffmpeg() -> bool:
"""Check if FFmpeg is available."""
import shutil
return shutil.which('ffmpeg') is not None
Dependency Check Command
Add a dependency check to the config command:
# Check all video dependencies
skill-seekers config --check-video
# Output:
# Video Dependencies:
# yt-dlp ✅ 2025.01.15
# youtube-transcript-api ✅ 1.2.3
# faster-whisper ❌ Not installed (pip install skill-seekers[video-full])
# opencv-python-headless ❌ Not installed
# scenedetect ❌ Not installed
# easyocr ❌ Not installed
#
# System Dependencies:
# FFmpeg ✅ 6.1.1
# GPU (CUDA) ❌ Not available (CPU mode will be used)
#
# Available modes:
# Transcript only ✅ YouTube captions available
# Whisper fallback ❌ Install faster-whisper
# Visual extraction ❌ Install video-full dependencies
Model Management
Whisper Models
Whisper models are downloaded on first use and cached in the user's home directory.
| Model | Download Size | Disk Size | First-Use Download Time |
|---|---|---|---|
| tiny | 75 MB | 75 MB | ~15s |
| base | 142 MB | 142 MB | ~25s |
| small | 466 MB | 466 MB | ~60s |
| medium | 1.5 GB | 1.5 GB | ~3 min |
| large-v3 | 3.1 GB | 3.1 GB | ~5 min |
| large-v3-turbo | 1.6 GB | 1.6 GB | ~3 min |
Cache location: ~/.cache/huggingface/hub/ (CTranslate2 models)
Pre-download command:
# Pre-download a model before using it
python -c "from faster_whisper import WhisperModel; WhisperModel('base')"
easyocr Models
easyocr models are also downloaded on first use.
| Language Pack | Download Size | Disk Size |
|---|---|---|
| English | ~100 MB | ~100 MB |
| + Additional language | ~50-100 MB each | ~50-100 MB each |
Cache location: ~/.EasyOCR/model/
Pre-download command:
# Pre-download English OCR model
python -c "import easyocr; easyocr.Reader(['en'])"
Docker Considerations
Dockerfile additions for video support
# Tier 1 (lightweight)
RUN pip install skill-seekers[video]
# Tier 2 (full)
RUN apt-get update && apt-get install -y ffmpeg
RUN pip install skill-seekers[video-full]
# Pre-download Whisper model (avoids first-run download)
RUN python -c "from faster_whisper import WhisperModel; WhisperModel('base')"
# Pre-download easyocr model
RUN python -c "import easyocr; easyocr.Reader(['en'])"
Docker image sizes
| Tier | Base Image Size | Additional Size | Total |
|---|---|---|---|
| Tier 1 (video) | ~300 MB | ~20 MB | ~320 MB |
| Tier 2 (video-full, CPU) | ~300 MB | ~800 MB | ~1.1 GB |
| Tier 2 (video-full, GPU) | ~5 GB (CUDA base) | ~800 MB | ~5.8 GB |
Kubernetes resource recommendations
# Tier 1 (transcript only)
resources:
requests:
memory: "256Mi"
cpu: "500m"
limits:
memory: "512Mi"
cpu: "1000m"
# Tier 2 (full, CPU)
resources:
requests:
memory: "2Gi"
cpu: "2000m"
limits:
memory: "4Gi"
cpu: "4000m"
# Tier 2 (full, GPU)
resources:
requests:
memory: "4Gi"
cpu: "2000m"
nvidia.com/gpu: 1
limits:
memory: "8Gi"
cpu: "4000m"
nvidia.com/gpu: 1