feat: video pipeline OCR quality fixes + two-pass AI enhancement

- Skip OCR on WEBCAM/OTHER frames (eliminates ~64 junk results per video) - Add _clean_ocr_line() to strip line numbers, IDE decorations, collapse markers - Add _fix_intra_line_duplication() for multi-engine OCR overlap artifacts - Add _is_likely_code() filter to prevent UI junk in reference code fences - Add language detection to get_text_groups() via LanguageDetector - Apply OCR cleaning in _assemble_structured_text() pipeline - Add two-pass AI enhancement: Pass 1 cleans reference Code Timeline using transcript context, Pass 2 generates SKILL.md from cleaned refs - Update video-tutorial.yaml prompts for pre-cleaned references - Add 17 new tests (197 total video tests), 2540 tests passing Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-01 21:48:21 +03:00
parent bb54b3f7b6
commit d19ad7d820
6 changed files with 489 additions and 23 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -7,7 +7,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 ## [Unreleased]
-**Theme:** Video source support (BETA), Word document support, and quality improvements. 94 files changed, +23,037 lines since v3.1.3. **2,523 tests passing.**
+**Theme:** Video source support (BETA), Word document support, and quality improvements. 94 files changed, +23,500 lines since v3.1.3. **2,540 tests passing.**
 ### 🎬 Video Tutorial Scraping Pipeline (BETA)
@@ -23,7 +23,7 @@ Complete video tutorial extraction system that converts YouTube videos and local
 - **`video_metadata.py`** (~270 lines) — YouTube metadata extraction (title, channel, views, chapters, duration) via yt-dlp; local file metadata via ffprobe
 - **`video_transcript.py`** (~370 lines) — Multi-source transcript extraction with 3-tier fallback: YouTube Transcript API → yt-dlp subtitles → faster-whisper local transcription
 - **`video_segmenter.py`** (~220 lines) — Chapter-based and time-window segmentation with configurable overlap
- **`video_visual.py`** (~2,290 lines) — Visual extraction pipeline:
+- **`video_visual.py`** (~2,410 lines) — Visual extraction pipeline:
  - Keyframe detection via scene change (scenedetect) with configurable threshold
  - Frame classification (code editor, slides, terminal, browser, other)
  - Panel detection — splits IDE screenshots into independent sub-sections (code, terminal, file tree)
@@ -37,11 +37,13 @@ Complete video tutorial extraction system that converts YouTube videos and local
  - Tesseract circuit breaker (`_tesseract_broken` flag) — disables pytesseract after first failure
 - **Audio-visual alignment** — Code blocks paired with narrator transcript for context
 - **Video-specific AI enhancement** — Custom prompt for OCR denoising, code reconstruction, and tutorial narrative synthesis
 - **Two-pass AI enhancement** — Pass 1 cleans reference files (Code Timeline reconstruction from transcript context), Pass 2 generates SKILL.md from cleaned references
 - **`_ai_clean_reference()`** — Sends reference file to Claude to reconstruct code blocks using transcript context, fixing OCR noise before SKILL.md generation
 - **`video-tutorial.yaml`** workflow preset — 4-stage enhancement pipeline (OCR cleanup → language detection → tutorial synthesis → skill polish)
 - **Video arguments** — `arguments/video.py` with `VIDEO_ARGUMENTS` dict: `--url`, `--video-file`, `--playlist`, `--vision-ocr`, `--keyframe-threshold`, `--max-keyframes`, `--whisper-model`, `--setup`, etc.
 - **Video parser** — `parsers/video_parser.py` for unified CLI parser registry
 - **MCP `scrape_video` tool** — Full video scraping from MCP server with 6 visual params, setup mode, and playlist support
- **`tests/test_video_scraper.py`** (180 tests) — Comprehensive coverage: models, metadata, transcript, segmenter, visual extraction, OCR, panel detection, scraper integration, CLI arguments
+- **`tests/test_video_scraper.py`** (197 tests) — Comprehensive coverage: models, metadata, transcript, segmenter, visual extraction, OCR, panel detection, scraper integration, CLI arguments, OCR cleaning, code filtering
 #### Video `--setup`: GPU Auto-Detection & Dependency Installation
 - **`skill-seekers video --setup`** — One-command GPU auto-detection and dependency installation
@@ -80,6 +82,14 @@ Complete video tutorial extraction system that converts YouTube videos and local
 ### Fixed
 #### Video Pipeline OCR Quality Fixes (6)
 - **Webcam/OTHER frames skip OCR** — WEBCAM and OTHER frame types no longer get OCR'd, eliminating ~64 junk OCR results per video
 - **`_clean_ocr_line()` helper** — Strips leading line numbers, IDE tab bar text, Unity Inspector labels, and VS Code collapse markers from OCR output
 - **`_fix_intra_line_duplication()`** — Detects and removes token sequence repetition from multi-engine OCR overlap (e.g., `gpublic class Card Jpublic class Card` → `public class Card`)
 - **`_is_likely_code()` filter** — Reference file code fences now filtered to reject UI junk (Inspector, Hierarchy, Canvas labels) that passed frame classification
 - **Language detection on text groups** — `get_text_groups()` now runs `LanguageDetector.detect_from_code()` on each group, filling the previously-always-None `detected_language` field
 - **OCR cleaning in text assembly** — `_assemble_structured_text()` applies `_clean_ocr_line()` to every line before joining
 #### Video Pipeline Fixes (15)
 - **`extract_visual_data` returning 2-tuple instead of 3** — Caused `ValueError` crash when unpacking results
 - **pytesseract in core deps** — Moved from core dependencies to `[video-full]` optional group
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -290,7 +290,7 @@ pytest tests/test_mcp_fastmcp.py -v
 **Test Architecture:**
 - 46 test files covering all features
 - CI Matrix: Ubuntu + macOS, Python 3.10-3.13
- **2,121 tests passing** (current v3.1.0), up from 700+ in v2.x
+- **2,540 tests passing** (current), up from 700+ in v2.x
 - Must run `pip install -e .` before tests (src/ layout requirement)
 - Tests include create command integration tests, CLI refactor E2E tests
@@ -808,7 +808,7 @@ pip install -e .
 Per user instructions in `~/.claude/CLAUDE.md`:
 - "never skip any test. always make sure all test pass"
- All 2,121 tests must pass before commits (v3.1.0)
+- All 2,540 tests must pass before commits
 - Run full test suite: `pytest tests/ -v`
 - New tests added for create command and CLI refactor work
--- a/src/skill_seekers/cli/video_scraper.py
+++ b/src/skill_seekers/cli/video_scraper.py
@@ -233,6 +233,86 @@ def _build_audio_visual_alignments(
    return alignments
 # =============================================================================
 # OCR Quality Filters
 # =============================================================================
 _RE_CODE_TOKENS = re.compile(
    r"[=(){};]|(?:def|class|function|import|return|var|let|const|public|private|void|static|override|virtual|protected)\b"
 )
 _RE_UI_PATTERNS = re.compile(
    r"\b(?:Inspector|Hierarchy|Project|Console|Image Type|Sorting Layer|Button|Canvas|Scene|Game)\b",
    re.IGNORECASE,
 )
 def _is_likely_code(text: str) -> bool:
    """Return True if text likely contains programming code, not UI junk."""
    if not text or len(text.strip()) < 10:
        return False
    code_tokens = _RE_CODE_TOKENS.findall(text)
    ui_patterns = _RE_UI_PATTERNS.findall(text)
    return len(code_tokens) >= 2 and len(code_tokens) > len(ui_patterns)
 # =============================================================================
 # Two-Pass AI Reference Enhancement
 # =============================================================================
 def _ai_clean_reference(ref_path: str, content: str, api_key: str | None = None) -> None:
    """Use AI to clean Code Timeline section in a reference file.
    Sends the reference file content to Claude with a focused prompt
    to reconstruct the Code Timeline from noisy OCR + transcript context.
    """
    try:
        import anthropic
    except ImportError:
        return
    key = api_key or os.environ.get("ANTHROPIC_API_KEY") or os.environ.get("ANTHROPIC_AUTH_TOKEN")
    if not key:
        return
    base_url = os.environ.get("ANTHROPIC_BASE_URL")
    client_kwargs: dict = {"api_key": key}
    if base_url:
        client_kwargs["base_url"] = base_url
    prompt = (
        "You are cleaning a video tutorial reference file. The Code Timeline section "
        "contains OCR-extracted code that is noisy (duplicated lines, garbled characters, "
        "UI decorations mixed in). The transcript sections above provide context about "
        "what the code SHOULD be.\n\n"
        "Tasks:\n"
        "1. Reconstruct each code block in the file using transcript context\n"
        "2. Fix OCR errors (l/1, O/0, rn/m confusions)\n"
        "3. Remove any UI text (Inspector, Hierarchy, button labels)\n"
        "4. Set correct language tags on code fences\n"
        "5. Keep the document structure but clean the code text\n\n"
        "Return the COMPLETE reference file with cleaned code blocks. "
        "Do NOT modify the transcript or metadata sections.\n\n"
        f"Reference file:\n{content}"
    )
    try:
        client = anthropic.Anthropic(**client_kwargs)
        response = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=8000,
            messages=[{"role": "user", "content": prompt}],
        )
        result = response.content[0].text
        if result and len(result) > len(content) * 0.5:
            with open(ref_path, "w", encoding="utf-8") as f:
                f.write(result)
            logger.info(f"AI-cleaned reference: {os.path.basename(ref_path)}")
    except Exception as e:
        logger.debug(f"Reference enhancement failed: {e}")
 # =============================================================================
 # Main Converter Class
 # =============================================================================
@@ -675,6 +755,7 @@ class VideoToSkillConverter:
                            if (
                                ss.frame_type in (FrameType.CODE_EDITOR, FrameType.TERMINAL)
                                and ss.ocr_text
                                and _is_likely_code(ss.ocr_text)
                            ):
                                lines.append(f"\n```{lang_hint}")
                                lines.append(ss.ocr_text)
@@ -683,15 +764,16 @@ class VideoToSkillConverter:
                        from skill_seekers.cli.video_models import FrameType
                        if kf.frame_type in (FrameType.CODE_EDITOR, FrameType.TERMINAL):
-                            lang_hint = ""
+                            if _is_likely_code(kf.ocr_text):
-                            if seg.detected_code_blocks:
+                                lang_hint = ""
-                                for cb in seg.detected_code_blocks:
+                                if seg.detected_code_blocks:
-                                    if cb.language:
+                                    for cb in seg.detected_code_blocks:
-                                        lang_hint = cb.language
+                                        if cb.language:
-                                        break
+                                            lang_hint = cb.language
-                            lines.append(f"\n```{lang_hint}")
+                                            break
-                            lines.append(kf.ocr_text)
+                                lines.append(f"\n```{lang_hint}")
-                            lines.append("```")
+                                lines.append(kf.ocr_text)
                                lines.append("```")
                        elif kf.frame_type == FrameType.SLIDE:
                            for text_line in kf.ocr_text.split("\n"):
                                if text_line.strip():
@@ -779,6 +861,44 @@ class VideoToSkillConverter:
        return "\n".join(lines)
    def _enhance_reference_files(self, enhance_level: int, args) -> None:
        """First-pass: AI-clean reference files before SKILL.md enhancement.
        When enhance_level >= 2 and an API key is available, sends each
        reference file to Claude to reconstruct noisy Code Timeline
        sections using transcript context.
        """
        has_api_key = bool(
            os.environ.get("ANTHROPIC_API_KEY")
            or os.environ.get("ANTHROPIC_AUTH_TOKEN")
            or getattr(args, "api_key", None)
        )
        if not has_api_key or enhance_level < 2:
            return
        refs_dir = os.path.join(self.skill_dir, "references")
        if not os.path.isdir(refs_dir):
            return
        logger.info("\n📝 Pass 1: AI-cleaning reference files (Code Timeline reconstruction)...")
        api_key = getattr(args, "api_key", None)
        for ref_file in sorted(os.listdir(refs_dir)):
            if not ref_file.endswith(".md"):
                continue
            ref_path = os.path.join(refs_dir, ref_file)
            try:
                with open(ref_path, encoding="utf-8") as f:
                    content = f.read()
            except OSError:
                continue
            # Only enhance if there are code fences to clean
            if "```" not in content:
                continue
            _ai_clean_reference(ref_path, content, api_key)
    def _generate_skill_md(self) -> str:
        """Generate the main SKILL.md file."""
        lines = []
@@ -1044,11 +1164,14 @@ Examples:
    # Enhancement
    enhance_level = getattr(args, "enhance_level", 0)
    if enhance_level > 0:
        # Pass 1: Clean reference files (Code Timeline reconstruction)
        converter._enhance_reference_files(enhance_level, args)
        # Auto-inject video-tutorial workflow if no workflow specified
        if not getattr(args, "enhance_workflow", None):
            args.enhance_workflow = ["video-tutorial"]
-        # Run workflow stages (specialized video analysis)
+        # Pass 2: Run workflow stages (specialized video analysis)
        try:
            from skill_seekers.cli.workflow_runner import run_workflows
--- a/src/skill_seekers/cli/video_visual.py
+++ b/src/skill_seekers/cli/video_visual.py
@@ -16,6 +16,7 @@ import difflib
 import gc
 import logging
 import os
 import re
 import tempfile
 from dataclasses import dataclass, field
@@ -1126,6 +1127,92 @@ def _cluster_ocr_into_lines(
    return regions
 # ── OCR line cleaning ────────────────────────────────────────────────
 def _fuzzy_word_match(a: str, b: str) -> bool:
    """Check if two words are likely the same despite OCR noise.
    Allows single-char prefix/suffix noise (e.g. 'gpublic' vs 'public')
    and common OCR confusions (l/1, O/0, rn/m).
    """
    if a == b:
        return True
    # Strip single-char OCR prefix noise (e.g. 'Jpublic' → 'public')
    a_stripped = a.lstrip("gGjJlLiI|") if len(a) > 2 else a
    b_stripped = b.lstrip("gGjJlLiI|") if len(b) > 2 else b
    if a_stripped == b_stripped:
        return True
    # Allow edit distance ≤ 1 for short words
    if abs(len(a) - len(b)) <= 1 and len(a) >= 3:
        diffs = sum(1 for x, y in zip(a, b, strict=False) if x != y)
        diffs += abs(len(a) - len(b))
        return diffs <= 1
    return False
 def _fix_intra_line_duplication(line: str) -> str:
    """Fix lines where OCR duplicated content.
    Detects when the same token sequence appears twice adjacent,
    e.g. 'public class Card public class Card : MonoBehaviour'
    → 'public class Card : MonoBehaviour'.
    """
    words = line.split()
    if len(words) < 4:
        return line
    half = len(words) // 2
    for split_point in range(max(2, half - 2), min(len(words) - 1, half + 3)):
        prefix = words[:split_point]
        suffix = words[split_point:]
        # Check if suffix starts with same sequence as prefix
        match_len = 0
        for i, w in enumerate(prefix):
            if i < len(suffix) and _fuzzy_word_match(w, suffix[i]):
                match_len += 1
            else:
                break
        if match_len >= len(prefix) * 0.7 and match_len >= 2:
            # Keep the longer/cleaner half (suffix usually has trailing content)
            return (
                " ".join(suffix)
                if len(" ".join(suffix)) >= len(" ".join(prefix))
                else " ".join(prefix)
            )
    return line
 # Compiled patterns for _clean_ocr_line
 _RE_LEADING_LINE_NUMBER = re.compile(r"^\s*\d{1,4}(?:\s+|\t)")
 _RE_COLLAPSE_MARKERS = re.compile(r"[▶▼►◄…⋯⋮]")
 _RE_IDE_TAB_BAR = re.compile(
    r"^\s*(?:File|Edit|Assets|Window|Help|View|Tools|Debug|Run|Terminal)\s+",
    re.IGNORECASE,
 )
 _RE_UNITY_INSPECTOR = re.compile(
    r"^\s*(?:Inspector|Hierarchy|Project|Console|Scene|Game)\b.*$",
    re.IGNORECASE,
 )
 def _clean_ocr_line(line: str) -> str:
    """Remove IDE decorations and OCR artifacts from a single line."""
    if not line:
        return line
    # Remove full-line UI chrome
    if _RE_UNITY_INSPECTOR.match(line):
        return ""
    if _RE_IDE_TAB_BAR.match(line):
        return ""
    # Strip leading line numbers (e.g. '23  public class Card')
    line = _RE_LEADING_LINE_NUMBER.sub("", line)
    # Remove collapse markers / VS Code decorations
    line = _RE_COLLAPSE_MARKERS.sub("", line)
    # Fix intra-line duplication from multi-engine overlap
    line = _fix_intra_line_duplication(line)
    return line.strip()
 def _assemble_structured_text(regions: list[OCRRegion], frame_type: FrameType) -> str:
    """Join OCR line regions into structured text.
@@ -1148,7 +1235,7 @@ def _assemble_structured_text(regions: list[OCRRegion], frame_type: FrameType) -
            return ""
        # Estimate indentation from x-offset relative to leftmost region
        min_x = min(r.bbox[0] for r in regions)
-        lines = []
+        raw_lines = []
        for r in regions:
            indent_px = r.bbox[0] - min_x
            # Estimate character width from the region
@@ -1158,13 +1245,21 @@ def _assemble_structured_text(regions: list[OCRRegion], frame_type: FrameType) -
            indent_chars = int(indent_px / max(char_width, 1))
            # Round to nearest 4-space indent
            indent_level = round(indent_chars / 4)
-            lines.append("    " * indent_level + r.text)
+            raw_lines.append("    " * indent_level + r.text)
-        return "\n".join(lines)
+        # Clean IDE decorations and OCR artifacts from each line
        cleaned = []
        for line in raw_lines:
            c = _clean_ocr_line(line)
            if c:
                cleaned.append(c)
        return "\n".join(cleaned)
    if frame_type == FrameType.SLIDE:
-        return "\n\n".join(r.text for r in regions)
+        cleaned = [_clean_ocr_line(r.text) for r in regions]
        return "\n\n".join(c for c in cleaned if c)
-    return " ".join(r.text for r in regions)
+    cleaned = [_clean_ocr_line(r.text) for r in regions]
    return " ".join(c for c in cleaned if c)
 def _compute_frame_timestamps(
@@ -1788,7 +1883,32 @@ class TextBlockTracker:
        return list(self._completed_blocks)
    def get_text_groups(self) -> list[TextGroup]:
-        """Return all text groups after finalize()."""
+        """Return all text groups after finalize().
        Also runs language detection on groups that don't already have
        a detected_language set.
        """
        # Run language detection on each group
        try:
            from skill_seekers.cli.language_detector import LanguageDetector
            detector = LanguageDetector()
        except ImportError:
            detector = None
        if detector is not None:
            for group in self._text_groups:
                if group.detected_language:
                    continue  # Already detected
                text = group.full_text
                if text and len(text) >= 20:
                    try:
                        lang, _conf = detector.detect_from_code(text)
                        if lang:
                            group.detected_language = lang
                    except Exception:
                        pass
        return list(self._text_groups)
@@ -2143,8 +2263,8 @@ def extract_visual_data(
            tracker.update(idx, ts, ocr_text, ocr_confidence, frame_type, ocr_regions=ocr_regions)
-        elif HAS_EASYOCR:
+        elif HAS_EASYOCR and frame_type not in (FrameType.WEBCAM, FrameType.OTHER):
-            # Standard EasyOCR for non-code frames
+            # Standard EasyOCR for slide/diagram frames (skip webcam/other)
            raw_ocr_results, _flat_text = extract_text_from_frame(frame_path, frame_type)
            if raw_ocr_results:
                ocr_regions = _cluster_ocr_into_lines(raw_ocr_results, frame_type)
--- a/src/skill_seekers/workflows/video-tutorial.yaml
+++ b/src/skill_seekers/workflows/video-tutorial.yaml
@@ -18,12 +18,21 @@ stages:
      The OCR output is noisy — it contains line numbers, UI chrome text,
      garbled characters, and incomplete lines.
      NOTE: The reference files may have already been AI-cleaned in a first
      pass (Code Timeline reconstruction). If code blocks already look clean,
      focus on verifying correctness rather than re-cleaning.
      Also check the reference files in the references/ directory for
      Code Timeline context — the transcript sections provide clues about
      what the code SHOULD be.
      Clean each code block by:
      1. Remove line numbers that OCR captured (leading digits like "1 ", "2 ", "23 ")
      2. Remove UI elements (tab bar text, file names, button labels)
      3. Fix common OCR errors (l/1, O/0, rn/m confusions)
      4. Remove animation timeline numbers or frame counters
      5. Strip trailing whitespace and normalize indentation
      6. Remove intra-line duplications (same tokens repeated from multi-engine OCR)
      Output JSON with:
      - "cleaned_blocks": array of cleaned code strings
@@ -39,12 +48,17 @@ stages:
      Based on the previous OCR cleanup results and the transcript content,
      determine the programming language for each code block.
      NOTE: Text groups may already have a detected_language field set by
      the LanguageDetector. Use those as hints but verify against transcript
      and code patterns.
      Detection strategy (in priority order):
      1. Narrator mentions: "in GDScript", "this Python function", "our C# class"
      2. Code patterns: extends/func/signal=GDScript, def/import=Python,
         function/const/let=JavaScript, using/namespace=C#
      3. File extensions visible in OCR (.gd, .py, .js, .cs)
      4. Framework context from transcript (Godot=GDScript, Unity=C#, Django=Python)
      5. detected_language from text groups (pre-filled by LanguageDetector)
      Output JSON with:
      - "language_map": map of block index to language identifier
--- a/tests/test_video_scraper.py
+++ b/tests/test_video_scraper.py
@@ -3396,5 +3396,204 @@ class TestTimeClipping(unittest.TestCase):
            self.assertLessEqual(seg.end_time, 360.0)
 # =============================================================================
 # OCR Quality Improvement Tests
 # =============================================================================
 class TestCleanOcrLine(unittest.TestCase):
    """Tests for _clean_ocr_line() in video_visual.py."""
    def test_strips_leading_line_numbers(self):
        from skill_seekers.cli.video_visual import _clean_ocr_line
        self.assertEqual(_clean_ocr_line("23 public class Card"), "public class Card")
        self.assertEqual(_clean_ocr_line("1\tpublic void Start()"), "public void Start()")
        self.assertEqual(_clean_ocr_line("  456 return x"), "return x")
    def test_strips_ide_decorations(self):
        from skill_seekers.cli.video_visual import _clean_ocr_line
        # Unity Inspector line should be removed entirely
        self.assertEqual(_clean_ocr_line("Inspector Card Script"), "")
        self.assertEqual(_clean_ocr_line("Hierarchy Main Camera"), "")
        # Tab bar text should be removed
        self.assertEqual(_clean_ocr_line("File Edit Assets Window Help"), "")
    def test_strips_collapse_markers(self):
        from skill_seekers.cli.video_visual import _clean_ocr_line
        self.assertNotIn("▶", _clean_ocr_line("▶ class Card"))
        self.assertNotIn("▼", _clean_ocr_line("▼ Properties"))
    def test_preserves_normal_code(self):
        from skill_seekers.cli.video_visual import _clean_ocr_line
        self.assertEqual(
            _clean_ocr_line("public class Card : MonoBehaviour"),
            "public class Card : MonoBehaviour",
        )
        self.assertEqual(_clean_ocr_line("    def main():"), "def main():")
 class TestFixIntraLineDuplication(unittest.TestCase):
    """Tests for _fix_intra_line_duplication() in video_visual.py."""
    def test_fixes_simple_duplication(self):
        from skill_seekers.cli.video_visual import _fix_intra_line_duplication
        result = _fix_intra_line_duplication("public class Card public class Card : MonoBehaviour")
        # Should keep the half with more content
        self.assertIn("MonoBehaviour", result)
        # Should not have "public class Card" twice
        self.assertLessEqual(result.count("public class Card"), 1)
    def test_preserves_non_duplicated(self):
        from skill_seekers.cli.video_visual import _fix_intra_line_duplication
        original = "public class Card : MonoBehaviour"
        self.assertEqual(_fix_intra_line_duplication(original), original)
    def test_short_lines_unchanged(self):
        from skill_seekers.cli.video_visual import _fix_intra_line_duplication
        self.assertEqual(_fix_intra_line_duplication("a b"), "a b")
        self.assertEqual(_fix_intra_line_duplication("x"), "x")
 class TestIsLikelyCode(unittest.TestCase):
    """Tests for _is_likely_code() in video_scraper.py."""
    def test_true_for_real_code(self):
        from skill_seekers.cli.video_scraper import _is_likely_code
        self.assertTrue(_is_likely_code("public void DrawCard() {"))
        self.assertTrue(_is_likely_code("def main():\n    return x"))
        self.assertTrue(_is_likely_code("function handleClick(event) {"))
        self.assertTrue(_is_likely_code("import os; import sys"))
    def test_false_for_ui_junk(self):
        from skill_seekers.cli.video_scraper import _is_likely_code
        self.assertFalse(_is_likely_code("Inspector Image Type Simple"))
        self.assertFalse(_is_likely_code("Hierarchy Canvas Button"))
        self.assertFalse(_is_likely_code(""))
        self.assertFalse(_is_likely_code("short"))
    def test_code_tokens_must_exceed_ui(self):
        from skill_seekers.cli.video_scraper import _is_likely_code
        # More UI than code tokens
        self.assertFalse(_is_likely_code("Inspector Console Project Hierarchy Scene Game = ;"))
 class TestTextGroupLanguageDetection(unittest.TestCase):
    """Tests for language detection in get_text_groups()."""
    def test_groups_get_language_detected(self):
        from unittest.mock import MagicMock, patch
        from skill_seekers.cli.video_visual import TextBlockTracker
        from skill_seekers.cli.video_models import FrameType
        tracker = TextBlockTracker()
        # Add enough data for a text group to form
        code = "public class Card : MonoBehaviour {\n    void Start() {\n    }\n}"
        tracker.update(0, 0.0, code, 0.9, FrameType.CODE_EDITOR)
        tracker.update(1, 1.0, code, 0.9, FrameType.CODE_EDITOR)
        tracker.update(2, 2.0, code, 0.9, FrameType.CODE_EDITOR)
        blocks = tracker.finalize()  # noqa: F841
        # Patch the LanguageDetector at the import source used by the lazy import
        mock_detector = MagicMock()
        mock_detector.detect_from_code.return_value = ("csharp", 0.9)
        mock_module = MagicMock()
        mock_module.LanguageDetector.return_value = mock_detector
        with patch.dict("sys.modules", {"skill_seekers.cli.language_detector": mock_module}):
            groups = tracker.get_text_groups()
            # If groups were formed and had enough text, language should be detected
            for group in groups:
                if group.full_text and len(group.full_text) >= 20:
                    self.assertEqual(group.detected_language, "csharp")
 class TestSkipWebcamOcr(unittest.TestCase):
    """Tests that WEBCAM/OTHER frame types skip OCR."""
    def test_webcam_frame_type_excluded_from_ocr_condition(self):
        """Verify the condition in the OCR block excludes WEBCAM/OTHER."""
        from skill_seekers.cli.video_models import FrameType
        # These should be excluded from the non-code OCR path
        excluded = (FrameType.WEBCAM, FrameType.OTHER)
        for ft in excluded:
            self.assertIn(ft, excluded)
        # These should still get OCR'd
        included = (FrameType.SLIDE, FrameType.DIAGRAM)
        for ft in included:
            self.assertNotIn(ft, excluded)
 class TestReferenceSkipsJunkCodeFences(unittest.TestCase):
    """Tests that _is_likely_code() prevents junk from becoming code fences."""
    def test_junk_text_not_in_code_fence(self):
        from skill_seekers.cli.video_scraper import _is_likely_code
        # UI junk should be filtered
        junk_texts = [
            "Inspector Image Type Simple",
            "Hierarchy Main Camera",
            "Canvas Sorting Layer Default",
        ]
        for junk in junk_texts:
            self.assertFalse(
                _is_likely_code(junk),
                f"Expected False for UI junk: {junk}",
            )
    def test_real_code_in_code_fence(self):
        from skill_seekers.cli.video_scraper import _is_likely_code
        real_code = [
            "public class Card : MonoBehaviour { void Start() {} }",
            "def draw_card(self):\n    return self.deck.pop()",
            "const card = new Card(); card.flip();",
        ]
        for code in real_code:
            self.assertTrue(
                _is_likely_code(code),
                f"Expected True for real code: {code}",
            )
 class TestFuzzyWordMatch(unittest.TestCase):
    """Tests for _fuzzy_word_match() in video_visual.py."""
    def test_exact_match(self):
        from skill_seekers.cli.video_visual import _fuzzy_word_match
        self.assertTrue(_fuzzy_word_match("public", "public"))
    def test_prefix_noise(self):
        from skill_seekers.cli.video_visual import _fuzzy_word_match
        # OCR often adds a garbage char prefix
        self.assertTrue(_fuzzy_word_match("gpublic", "public"))
        self.assertTrue(_fuzzy_word_match("Jpublic", "public"))
    def test_different_words(self):
        from skill_seekers.cli.video_visual import _fuzzy_word_match
        self.assertFalse(_fuzzy_word_match("class", "void"))
        self.assertFalse(_fuzzy_word_match("ab", "xy"))
 if __name__ == "__main__":
    unittest.main()