firefrost-gaming/skill-seekers-reference

Files

yusyus cc9cc32417 feat: add skill-seekers video --setup for GPU auto-detection and dependency installation

Auto-detects NVIDIA (CUDA), AMD (ROCm), or CPU-only GPU and installs the
correct PyTorch variant + easyocr + all visual extraction dependencies.
Removes easyocr from video-full pip extras to avoid pulling ~2GB of wrong
CUDA packages on non-NVIDIA systems.

New files:
- video_setup.py (835 lines): GPU detection, PyTorch install, ROCm config,
  venv checks, system dep validation, module selection, verification
- test_video_setup.py (60 tests): Full coverage of detection, install, verify

Updated docs: CHANGELOG, AGENTS.md, CLAUDE.md, README.md, CLI_REFERENCE,
FAQ, TROUBLESHOOTING, installation guide, video dependency plan

All 2523 tests passing (15 skipped).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-03-01 18:39:16 +03:00

13 KiB

Raw Blame History

Video Source — Dependencies & System Requirements

Date: February 27, 2026 Document: 07 of 07 Status: Planning

Status: IMPLEMENTED — skill-seekers video --setup (see video_setup.py, 835 lines, 60 tests)

GPU auto-detection: NVIDIA (nvidia-smi/CUDA), AMD (rocminfo/ROCm), CPU fallback

Correct PyTorch index URL selection per GPU vendor

EasyOCR removed from pip extras, installed at runtime via --setup

ROCm configuration (MIOPEN_FIND_MODE, HSA_OVERRIDE_GFX_VERSION)

Virtual environment detection with --force override

System dependency checks (tesseract, ffmpeg)

Non-interactive mode for MCP/CI usage

Dependency Tiers
pyproject.toml Changes
System Requirements
Import Guards
Dependency Check Command
Model Management
Docker Considerations

Dependency Tiers

Video processing has two tiers to keep the base install lightweight:

Tier 1: `[video]` — Lightweight (YouTube transcripts + metadata)

Use case: YouTube videos with existing captions. No download, no GPU needed.

Package	Version	Size	Purpose
`yt-dlp`	`>=2024.12.0`	~15MB	Metadata extraction, audio download
`youtube-transcript-api`	`>=1.2.0`	~50KB	YouTube caption extraction

Capabilities:

YouTube metadata (title, chapters, tags, description, engagement)
YouTube captions (manual and auto-generated)
Vimeo metadata
Playlist and channel resolution
Subtitle file parsing (SRT, VTT)
Segmentation and alignment
Full output generation

NOT included:

Speech-to-text (Whisper)
Visual extraction (frame + OCR)
Local video file transcription (without subtitles)

Tier 2: `[video-full]` — Full (adds Whisper + visual extraction)

Use case: Local videos without subtitles, or when you want code/slide extraction from screen.

Package	Version	Size	Purpose
`yt-dlp`	`>=2024.12.0`	~15MB	Metadata + audio download
`youtube-transcript-api`	`>=1.2.0`	~50KB	YouTube captions
`faster-whisper`	`>=1.0.0`	~5MB (+ models: 75MB-3GB)	Speech-to-text
`scenedetect[opencv]`	`>=0.6.4`	~50MB (includes OpenCV)	Scene boundary detection
`easyocr`	`>=1.7.0`	~150MB (+ models: ~200MB)	Text recognition from frames
`opencv-python-headless`	`>=4.9.0`	~50MB	Frame extraction, image processing

Additional capabilities over Tier 1:

Whisper speech-to-text (99 languages, word-level timestamps)
Scene detection (find visual transitions)
Keyframe extraction (save important frames)
Frame classification (code/slide/terminal/diagram)
OCR on frames (extract code and text from screen)
Code block detection from video

Total install size:

Tier 1: ~15MB
Tier 2: ~270MB + models (~300MB-3.2GB depending on Whisper model)

pyproject.toml Changes

[project.optional-dependencies]
# Existing dependencies...
gemini = ["google-generativeai>=0.8.0"]
openai = ["openai>=1.0.0"]
all-llms = ["google-generativeai>=0.8.0", "openai>=1.0.0"]

# NEW: Video processing
video = [
    "yt-dlp>=2024.12.0",
    "youtube-transcript-api>=1.2.0",
]
video-full = [
    "yt-dlp>=2024.12.0",
    "youtube-transcript-api>=1.2.0",
    "faster-whisper>=1.0.0",
    "scenedetect[opencv]>=0.6.4",
    "easyocr>=1.7.0",
    "opencv-python-headless>=4.9.0",
]

# Update 'all' to include video
all = [
    # ... existing all dependencies ...
    "yt-dlp>=2024.12.0",
    "youtube-transcript-api>=1.2.0",
    "faster-whisper>=1.0.0",
    "scenedetect[opencv]>=0.6.4",
    "easyocr>=1.7.0",
    "opencv-python-headless>=4.9.0",
]

[project.scripts]
# ... existing entry points ...
skill-seekers-video = "skill_seekers.cli.video_scraper:main"      # NEW

Installation Commands

# Lightweight video (YouTube transcripts + metadata)
pip install skill-seekers[video]

# Full video (+ Whisper + visual extraction)
pip install skill-seekers[video-full]

# Everything
pip install skill-seekers[all]

# Development (editable)
pip install -e ".[video]"
pip install -e ".[video-full]"

System Requirements

Tier 1 (Lightweight)

Requirement	Needed For	How to Check
Python 3.10+	All	`python --version`
Internet connection	YouTube API calls	N/A

No additional system dependencies. Pure Python.

Tier 2 (Full)

Requirement	Needed For	How to Check	Install
Python 3.10+	All	`python --version`	—
FFmpeg	Audio extraction, video processing	`ffmpeg -version`	See below
GPU (optional)	Whisper + easyocr acceleration	`nvidia-smi` (NVIDIA)	CUDA toolkit

FFmpeg Installation

FFmpeg is required for:

Extracting audio from video files (Whisper input)
Downloading audio-only streams (yt-dlp post-processing)
Converting between audio formats

# macOS
brew install ffmpeg

# Ubuntu/Debian
sudo apt install ffmpeg

# Windows (winget)
winget install ffmpeg

# Windows (choco)
choco install ffmpeg

# Verify
ffmpeg -version

GPU Support (Optional)

GPU accelerates Whisper (~4x) and easyocr (~5x) but is not required.

NVIDIA GPU (CUDA):

# Check CUDA availability
python -c "import torch; print(torch.cuda.is_available())"

# faster-whisper uses CTranslate2 which auto-detects CUDA
# easyocr uses PyTorch which auto-detects CUDA
# No additional setup needed if PyTorch CUDA is working

Apple Silicon (MPS):

# faster-whisper does not support MPS directly
# Falls back to CPU on Apple Silicon
# easyocr has partial MPS support

CPU-only (no GPU):

# Everything works on CPU, just slower
# Whisper base model: ~4x slower on CPU vs GPU
# easyocr: ~5x slower on CPU vs GPU
# For short videos (<10 min), CPU is fine

Import Guards

All video dependencies use try/except import guards to provide clear error messages:

video_scraper.py

"""Video scraper - main orchestrator."""

# Core dependencies (always available)
import json
import logging
import os
from pathlib import Path

# Tier 1: Video basics
try:
    from yt_dlp import YoutubeDL
    HAS_YTDLP = True
except ImportError:
    HAS_YTDLP = False

try:
    from youtube_transcript_api import YouTubeTranscriptApi
    HAS_YT_TRANSCRIPT = True
except ImportError:
    HAS_YT_TRANSCRIPT = False

# Feature availability check
def check_video_dependencies(require_full: bool = False) -> None:
    """Check that video dependencies are installed.

    Args:
        require_full: If True, check for full dependencies (Whisper, OCR)

    Raises:
        ImportError: With installation instructions
    """
    missing = []

    if not HAS_YTDLP:
        missing.append("yt-dlp")
    if not HAS_YT_TRANSCRIPT:
        missing.append("youtube-transcript-api")

    if missing:
        raise ImportError(
            f"Video processing requires: {', '.join(missing)}\n"
            f"Install with: pip install skill-seekers[video]"
        )

    if require_full:
        full_missing = []
        try:
            import faster_whisper
        except ImportError:
            full_missing.append("faster-whisper")
        try:
            import cv2
        except ImportError:
            full_missing.append("opencv-python-headless")
        try:
            import scenedetect
        except ImportError:
            full_missing.append("scenedetect[opencv]")
        try:
            import easyocr
        except ImportError:
            full_missing.append("easyocr")

        if full_missing:
            raise ImportError(
                f"Visual extraction requires: {', '.join(full_missing)}\n"
                f"Install with: pip install skill-seekers[video-full]"
            )

video_transcript.py

"""Transcript extraction module."""

# YouTube transcript (Tier 1)
try:
    from youtube_transcript_api import YouTubeTranscriptApi
    HAS_YT_TRANSCRIPT = True
except ImportError:
    HAS_YT_TRANSCRIPT = False

# Whisper (Tier 2)
try:
    from faster_whisper import WhisperModel
    HAS_WHISPER = True
except ImportError:
    HAS_WHISPER = False


def get_transcript(video_info, config):
    """Get transcript using best available method."""

    # Try YouTube captions first (Tier 1)
    if HAS_YT_TRANSCRIPT and video_info.source_type == VideoSourceType.YOUTUBE:
        try:
            return extract_youtube_transcript(video_info.video_id, config.languages)
        except TranscriptNotAvailable:
            pass

    # Try Whisper fallback (Tier 2)
    if HAS_WHISPER:
        return transcribe_with_whisper(video_info, config)

    # No transcript possible
    if not HAS_WHISPER:
        logger.warning(
            f"No transcript for {video_info.video_id}. "
            "Install faster-whisper for speech-to-text: "
            "pip install skill-seekers[video-full]"
        )
    return [], TranscriptSource.NONE

video_visual.py

"""Visual extraction module."""

try:
    import cv2
    HAS_OPENCV = True
except ImportError:
    HAS_OPENCV = False

try:
    from scenedetect import detect, ContentDetector
    HAS_SCENEDETECT = True
except ImportError:
    HAS_SCENEDETECT = False

try:
    import easyocr
    HAS_EASYOCR = True
except ImportError:
    HAS_EASYOCR = False


def check_visual_dependencies() -> None:
    """Check visual extraction dependencies."""
    missing = []
    if not HAS_OPENCV:
        missing.append("opencv-python-headless")
    if not HAS_SCENEDETECT:
        missing.append("scenedetect[opencv]")
    if not HAS_EASYOCR:
        missing.append("easyocr")

    if missing:
        raise ImportError(
            f"Visual extraction requires: {', '.join(missing)}\n"
            f"Install with: pip install skill-seekers[video-full]"
        )


def check_ffmpeg() -> bool:
    """Check if FFmpeg is available."""
    import shutil
    return shutil.which('ffmpeg') is not None

Dependency Check Command

Add a dependency check to the config command:

# Check all video dependencies
skill-seekers config --check-video

# Output:
# Video Dependencies:
#   yt-dlp              ✅ 2025.01.15
#   youtube-transcript-api ✅ 1.2.3
#   faster-whisper      ❌ Not installed (pip install skill-seekers[video-full])
#   opencv-python-headless ❌ Not installed
#   scenedetect         ❌ Not installed
#   easyocr             ❌ Not installed
#
# System Dependencies:
#   FFmpeg              ✅ 6.1.1
#   GPU (CUDA)          ❌ Not available (CPU mode will be used)
#
# Available modes:
#   Transcript only     ✅ YouTube captions available
#   Whisper fallback    ❌ Install faster-whisper
#   Visual extraction   ❌ Install video-full dependencies

Model Management

Whisper Models

Whisper models are downloaded on first use and cached in the user's home directory.

Model	Download Size	Disk Size	First-Use Download Time
tiny	75 MB	75 MB	~15s
base	142 MB	142 MB	~25s
small	466 MB	466 MB	~60s
medium	1.5 GB	1.5 GB	~3 min
large-v3	3.1 GB	3.1 GB	~5 min
large-v3-turbo	1.6 GB	1.6 GB	~3 min

Cache location: ~/.cache/huggingface/hub/ (CTranslate2 models)

Pre-download command:

# Pre-download a model before using it
python -c "from faster_whisper import WhisperModel; WhisperModel('base')"

easyocr Models

easyocr models are also downloaded on first use.

Language Pack	Download Size	Disk Size
English	~100 MB	~100 MB
+ Additional language	~50-100 MB each	~50-100 MB each

Cache location: ~/.EasyOCR/model/