Files
skill-seekers-reference/docs/plans/video/07_VIDEO_DEPENDENCIES.md
yusyus cc9cc32417 feat: add skill-seekers video --setup for GPU auto-detection and dependency installation
Auto-detects NVIDIA (CUDA), AMD (ROCm), or CPU-only GPU and installs the
correct PyTorch variant + easyocr + all visual extraction dependencies.
Removes easyocr from video-full pip extras to avoid pulling ~2GB of wrong
CUDA packages on non-NVIDIA systems.

New files:
- video_setup.py (835 lines): GPU detection, PyTorch install, ROCm config,
  venv checks, system dep validation, module selection, verification
- test_video_setup.py (60 tests): Full coverage of detection, install, verify

Updated docs: CHANGELOG, AGENTS.md, CLAUDE.md, README.md, CLI_REFERENCE,
FAQ, TROUBLESHOOTING, installation guide, video dependency plan

All 2523 tests passing (15 skipped).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-01 18:39:16 +03:00

13 KiB

Video Source — Dependencies & System Requirements

Date: February 27, 2026 Document: 07 of 07 Status: Planning

Status: IMPLEMENTEDskill-seekers video --setup (see video_setup.py, 835 lines, 60 tests)

  • GPU auto-detection: NVIDIA (nvidia-smi/CUDA), AMD (rocminfo/ROCm), CPU fallback
  • Correct PyTorch index URL selection per GPU vendor
  • EasyOCR removed from pip extras, installed at runtime via --setup
  • ROCm configuration (MIOPEN_FIND_MODE, HSA_OVERRIDE_GFX_VERSION)
  • Virtual environment detection with --force override
  • System dependency checks (tesseract, ffmpeg)
  • Non-interactive mode for MCP/CI usage

Table of Contents

  1. Dependency Tiers
  2. pyproject.toml Changes
  3. System Requirements
  4. Import Guards
  5. Dependency Check Command
  6. Model Management
  7. Docker Considerations

Dependency Tiers

Video processing has two tiers to keep the base install lightweight:

Tier 1: [video] — Lightweight (YouTube transcripts + metadata)

Use case: YouTube videos with existing captions. No download, no GPU needed.

Package Version Size Purpose
yt-dlp >=2024.12.0 ~15MB Metadata extraction, audio download
youtube-transcript-api >=1.2.0 ~50KB YouTube caption extraction

Capabilities:

  • YouTube metadata (title, chapters, tags, description, engagement)
  • YouTube captions (manual and auto-generated)
  • Vimeo metadata
  • Playlist and channel resolution
  • Subtitle file parsing (SRT, VTT)
  • Segmentation and alignment
  • Full output generation

NOT included:

  • Speech-to-text (Whisper)
  • Visual extraction (frame + OCR)
  • Local video file transcription (without subtitles)

Tier 2: [video-full] — Full (adds Whisper + visual extraction)

Use case: Local videos without subtitles, or when you want code/slide extraction from screen.

Package Version Size Purpose
yt-dlp >=2024.12.0 ~15MB Metadata + audio download
youtube-transcript-api >=1.2.0 ~50KB YouTube captions
faster-whisper >=1.0.0 ~5MB (+ models: 75MB-3GB) Speech-to-text
scenedetect[opencv] >=0.6.4 ~50MB (includes OpenCV) Scene boundary detection
easyocr >=1.7.0 ~150MB (+ models: ~200MB) Text recognition from frames
opencv-python-headless >=4.9.0 ~50MB Frame extraction, image processing

Additional capabilities over Tier 1:

  • Whisper speech-to-text (99 languages, word-level timestamps)
  • Scene detection (find visual transitions)
  • Keyframe extraction (save important frames)
  • Frame classification (code/slide/terminal/diagram)
  • OCR on frames (extract code and text from screen)
  • Code block detection from video

Total install size:

  • Tier 1: ~15MB
  • Tier 2: ~270MB + models (~300MB-3.2GB depending on Whisper model)

pyproject.toml Changes

[project.optional-dependencies]
# Existing dependencies...
gemini = ["google-generativeai>=0.8.0"]
openai = ["openai>=1.0.0"]
all-llms = ["google-generativeai>=0.8.0", "openai>=1.0.0"]

# NEW: Video processing
video = [
    "yt-dlp>=2024.12.0",
    "youtube-transcript-api>=1.2.0",
]
video-full = [
    "yt-dlp>=2024.12.0",
    "youtube-transcript-api>=1.2.0",
    "faster-whisper>=1.0.0",
    "scenedetect[opencv]>=0.6.4",
    "easyocr>=1.7.0",
    "opencv-python-headless>=4.9.0",
]

# Update 'all' to include video
all = [
    # ... existing all dependencies ...
    "yt-dlp>=2024.12.0",
    "youtube-transcript-api>=1.2.0",
    "faster-whisper>=1.0.0",
    "scenedetect[opencv]>=0.6.4",
    "easyocr>=1.7.0",
    "opencv-python-headless>=4.9.0",
]

[project.scripts]
# ... existing entry points ...
skill-seekers-video = "skill_seekers.cli.video_scraper:main"      # NEW

Installation Commands

# Lightweight video (YouTube transcripts + metadata)
pip install skill-seekers[video]

# Full video (+ Whisper + visual extraction)
pip install skill-seekers[video-full]

# Everything
pip install skill-seekers[all]

# Development (editable)
pip install -e ".[video]"
pip install -e ".[video-full]"

System Requirements

Tier 1 (Lightweight)

Requirement Needed For How to Check
Python 3.10+ All python --version
Internet connection YouTube API calls N/A

No additional system dependencies. Pure Python.

Tier 2 (Full)

Requirement Needed For How to Check Install
Python 3.10+ All python --version
FFmpeg Audio extraction, video processing ffmpeg -version See below
GPU (optional) Whisper + easyocr acceleration nvidia-smi (NVIDIA) CUDA toolkit

FFmpeg Installation

FFmpeg is required for:

  • Extracting audio from video files (Whisper input)
  • Downloading audio-only streams (yt-dlp post-processing)
  • Converting between audio formats
# macOS
brew install ffmpeg

# Ubuntu/Debian
sudo apt install ffmpeg

# Windows (winget)
winget install ffmpeg

# Windows (choco)
choco install ffmpeg

# Verify
ffmpeg -version

GPU Support (Optional)

GPU accelerates Whisper (~4x) and easyocr (~5x) but is not required.

NVIDIA GPU (CUDA):

# Check CUDA availability
python -c "import torch; print(torch.cuda.is_available())"

# faster-whisper uses CTranslate2 which auto-detects CUDA
# easyocr uses PyTorch which auto-detects CUDA
# No additional setup needed if PyTorch CUDA is working

Apple Silicon (MPS):

# faster-whisper does not support MPS directly
# Falls back to CPU on Apple Silicon
# easyocr has partial MPS support

CPU-only (no GPU):

# Everything works on CPU, just slower
# Whisper base model: ~4x slower on CPU vs GPU
# easyocr: ~5x slower on CPU vs GPU
# For short videos (<10 min), CPU is fine

Import Guards

All video dependencies use try/except import guards to provide clear error messages:

video_scraper.py

"""Video scraper - main orchestrator."""

# Core dependencies (always available)
import json
import logging
import os
from pathlib import Path

# Tier 1: Video basics
try:
    from yt_dlp import YoutubeDL
    HAS_YTDLP = True
except ImportError:
    HAS_YTDLP = False

try:
    from youtube_transcript_api import YouTubeTranscriptApi
    HAS_YT_TRANSCRIPT = True
except ImportError:
    HAS_YT_TRANSCRIPT = False

# Feature availability check
def check_video_dependencies(require_full: bool = False) -> None:
    """Check that video dependencies are installed.

    Args:
        require_full: If True, check for full dependencies (Whisper, OCR)

    Raises:
        ImportError: With installation instructions
    """
    missing = []

    if not HAS_YTDLP:
        missing.append("yt-dlp")
    if not HAS_YT_TRANSCRIPT:
        missing.append("youtube-transcript-api")

    if missing:
        raise ImportError(
            f"Video processing requires: {', '.join(missing)}\n"
            f"Install with: pip install skill-seekers[video]"
        )

    if require_full:
        full_missing = []
        try:
            import faster_whisper
        except ImportError:
            full_missing.append("faster-whisper")
        try:
            import cv2
        except ImportError:
            full_missing.append("opencv-python-headless")
        try:
            import scenedetect
        except ImportError:
            full_missing.append("scenedetect[opencv]")
        try:
            import easyocr
        except ImportError:
            full_missing.append("easyocr")

        if full_missing:
            raise ImportError(
                f"Visual extraction requires: {', '.join(full_missing)}\n"
                f"Install with: pip install skill-seekers[video-full]"
            )

video_transcript.py

"""Transcript extraction module."""

# YouTube transcript (Tier 1)
try:
    from youtube_transcript_api import YouTubeTranscriptApi
    HAS_YT_TRANSCRIPT = True
except ImportError:
    HAS_YT_TRANSCRIPT = False

# Whisper (Tier 2)
try:
    from faster_whisper import WhisperModel
    HAS_WHISPER = True
except ImportError:
    HAS_WHISPER = False


def get_transcript(video_info, config):
    """Get transcript using best available method."""

    # Try YouTube captions first (Tier 1)
    if HAS_YT_TRANSCRIPT and video_info.source_type == VideoSourceType.YOUTUBE:
        try:
            return extract_youtube_transcript(video_info.video_id, config.languages)
        except TranscriptNotAvailable:
            pass

    # Try Whisper fallback (Tier 2)
    if HAS_WHISPER:
        return transcribe_with_whisper(video_info, config)

    # No transcript possible
    if not HAS_WHISPER:
        logger.warning(
            f"No transcript for {video_info.video_id}. "
            "Install faster-whisper for speech-to-text: "
            "pip install skill-seekers[video-full]"
        )
    return [], TranscriptSource.NONE

video_visual.py

"""Visual extraction module."""

try:
    import cv2
    HAS_OPENCV = True
except ImportError:
    HAS_OPENCV = False

try:
    from scenedetect import detect, ContentDetector
    HAS_SCENEDETECT = True
except ImportError:
    HAS_SCENEDETECT = False

try:
    import easyocr
    HAS_EASYOCR = True
except ImportError:
    HAS_EASYOCR = False


def check_visual_dependencies() -> None:
    """Check visual extraction dependencies."""
    missing = []
    if not HAS_OPENCV:
        missing.append("opencv-python-headless")
    if not HAS_SCENEDETECT:
        missing.append("scenedetect[opencv]")
    if not HAS_EASYOCR:
        missing.append("easyocr")

    if missing:
        raise ImportError(
            f"Visual extraction requires: {', '.join(missing)}\n"
            f"Install with: pip install skill-seekers[video-full]"
        )


def check_ffmpeg() -> bool:
    """Check if FFmpeg is available."""
    import shutil
    return shutil.which('ffmpeg') is not None

Dependency Check Command

Add a dependency check to the config command:

# Check all video dependencies
skill-seekers config --check-video

# Output:
# Video Dependencies:
#   yt-dlp              ✅ 2025.01.15
#   youtube-transcript-api ✅ 1.2.3
#   faster-whisper      ❌ Not installed (pip install skill-seekers[video-full])
#   opencv-python-headless ❌ Not installed
#   scenedetect         ❌ Not installed
#   easyocr             ❌ Not installed
#
# System Dependencies:
#   FFmpeg              ✅ 6.1.1
#   GPU (CUDA)          ❌ Not available (CPU mode will be used)
#
# Available modes:
#   Transcript only     ✅ YouTube captions available
#   Whisper fallback    ❌ Install faster-whisper
#   Visual extraction   ❌ Install video-full dependencies

Model Management

Whisper Models

Whisper models are downloaded on first use and cached in the user's home directory.

Model Download Size Disk Size First-Use Download Time
tiny 75 MB 75 MB ~15s
base 142 MB 142 MB ~25s
small 466 MB 466 MB ~60s
medium 1.5 GB 1.5 GB ~3 min
large-v3 3.1 GB 3.1 GB ~5 min
large-v3-turbo 1.6 GB 1.6 GB ~3 min

Cache location: ~/.cache/huggingface/hub/ (CTranslate2 models)

Pre-download command:

# Pre-download a model before using it
python -c "from faster_whisper import WhisperModel; WhisperModel('base')"

easyocr Models

easyocr models are also downloaded on first use.

Language Pack Download Size Disk Size
English ~100 MB ~100 MB
+ Additional language ~50-100 MB each ~50-100 MB each

Cache location: ~/.EasyOCR/model/

Pre-download command:

# Pre-download English OCR model
python -c "import easyocr; easyocr.Reader(['en'])"

Docker Considerations

Dockerfile additions for video support

# Tier 1 (lightweight)
RUN pip install skill-seekers[video]

# Tier 2 (full)
RUN apt-get update && apt-get install -y ffmpeg
RUN pip install skill-seekers[video-full]

# Pre-download Whisper model (avoids first-run download)
RUN python -c "from faster_whisper import WhisperModel; WhisperModel('base')"

# Pre-download easyocr model
RUN python -c "import easyocr; easyocr.Reader(['en'])"

Docker image sizes

Tier Base Image Size Additional Size Total
Tier 1 (video) ~300 MB ~20 MB ~320 MB
Tier 2 (video-full, CPU) ~300 MB ~800 MB ~1.1 GB
Tier 2 (video-full, GPU) ~5 GB (CUDA base) ~800 MB ~5.8 GB

Kubernetes resource recommendations

# Tier 1 (transcript only)
resources:
  requests:
    memory: "256Mi"
    cpu: "500m"
  limits:
    memory: "512Mi"
    cpu: "1000m"

# Tier 2 (full, CPU)
resources:
  requests:
    memory: "2Gi"
    cpu: "2000m"
  limits:
    memory: "4Gi"
    cpu: "4000m"

# Tier 2 (full, GPU)
resources:
  requests:
    memory: "4Gi"
    cpu: "2000m"
    nvidia.com/gpu: 1
  limits:
    memory: "8Gi"
    cpu: "4000m"
    nvidia.com/gpu: 1