feat: video pipeline OCR quality fixes + two-pass AI enhancement
- Skip OCR on WEBCAM/OTHER frames (eliminates ~64 junk results per video) - Add _clean_ocr_line() to strip line numbers, IDE decorations, collapse markers - Add _fix_intra_line_duplication() for multi-engine OCR overlap artifacts - Add _is_likely_code() filter to prevent UI junk in reference code fences - Add language detection to get_text_groups() via LanguageDetector - Apply OCR cleaning in _assemble_structured_text() pipeline - Add two-pass AI enhancement: Pass 1 cleans reference Code Timeline using transcript context, Pass 2 generates SKILL.md from cleaned refs - Update video-tutorial.yaml prompts for pre-cleaned references - Add 17 new tests (197 total video tests), 2540 tests passing Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
16
CHANGELOG.md
16
CHANGELOG.md
@@ -7,7 +7,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
|
|||||||
|
|
||||||
## [Unreleased]
|
## [Unreleased]
|
||||||
|
|
||||||
**Theme:** Video source support (BETA), Word document support, and quality improvements. 94 files changed, +23,037 lines since v3.1.3. **2,523 tests passing.**
|
**Theme:** Video source support (BETA), Word document support, and quality improvements. 94 files changed, +23,500 lines since v3.1.3. **2,540 tests passing.**
|
||||||
|
|
||||||
### 🎬 Video Tutorial Scraping Pipeline (BETA)
|
### 🎬 Video Tutorial Scraping Pipeline (BETA)
|
||||||
|
|
||||||
@@ -23,7 +23,7 @@ Complete video tutorial extraction system that converts YouTube videos and local
|
|||||||
- **`video_metadata.py`** (~270 lines) — YouTube metadata extraction (title, channel, views, chapters, duration) via yt-dlp; local file metadata via ffprobe
|
- **`video_metadata.py`** (~270 lines) — YouTube metadata extraction (title, channel, views, chapters, duration) via yt-dlp; local file metadata via ffprobe
|
||||||
- **`video_transcript.py`** (~370 lines) — Multi-source transcript extraction with 3-tier fallback: YouTube Transcript API → yt-dlp subtitles → faster-whisper local transcription
|
- **`video_transcript.py`** (~370 lines) — Multi-source transcript extraction with 3-tier fallback: YouTube Transcript API → yt-dlp subtitles → faster-whisper local transcription
|
||||||
- **`video_segmenter.py`** (~220 lines) — Chapter-based and time-window segmentation with configurable overlap
|
- **`video_segmenter.py`** (~220 lines) — Chapter-based and time-window segmentation with configurable overlap
|
||||||
- **`video_visual.py`** (~2,290 lines) — Visual extraction pipeline:
|
- **`video_visual.py`** (~2,410 lines) — Visual extraction pipeline:
|
||||||
- Keyframe detection via scene change (scenedetect) with configurable threshold
|
- Keyframe detection via scene change (scenedetect) with configurable threshold
|
||||||
- Frame classification (code editor, slides, terminal, browser, other)
|
- Frame classification (code editor, slides, terminal, browser, other)
|
||||||
- Panel detection — splits IDE screenshots into independent sub-sections (code, terminal, file tree)
|
- Panel detection — splits IDE screenshots into independent sub-sections (code, terminal, file tree)
|
||||||
@@ -37,11 +37,13 @@ Complete video tutorial extraction system that converts YouTube videos and local
|
|||||||
- Tesseract circuit breaker (`_tesseract_broken` flag) — disables pytesseract after first failure
|
- Tesseract circuit breaker (`_tesseract_broken` flag) — disables pytesseract after first failure
|
||||||
- **Audio-visual alignment** — Code blocks paired with narrator transcript for context
|
- **Audio-visual alignment** — Code blocks paired with narrator transcript for context
|
||||||
- **Video-specific AI enhancement** — Custom prompt for OCR denoising, code reconstruction, and tutorial narrative synthesis
|
- **Video-specific AI enhancement** — Custom prompt for OCR denoising, code reconstruction, and tutorial narrative synthesis
|
||||||
|
- **Two-pass AI enhancement** — Pass 1 cleans reference files (Code Timeline reconstruction from transcript context), Pass 2 generates SKILL.md from cleaned references
|
||||||
|
- **`_ai_clean_reference()`** — Sends reference file to Claude to reconstruct code blocks using transcript context, fixing OCR noise before SKILL.md generation
|
||||||
- **`video-tutorial.yaml`** workflow preset — 4-stage enhancement pipeline (OCR cleanup → language detection → tutorial synthesis → skill polish)
|
- **`video-tutorial.yaml`** workflow preset — 4-stage enhancement pipeline (OCR cleanup → language detection → tutorial synthesis → skill polish)
|
||||||
- **Video arguments** — `arguments/video.py` with `VIDEO_ARGUMENTS` dict: `--url`, `--video-file`, `--playlist`, `--vision-ocr`, `--keyframe-threshold`, `--max-keyframes`, `--whisper-model`, `--setup`, etc.
|
- **Video arguments** — `arguments/video.py` with `VIDEO_ARGUMENTS` dict: `--url`, `--video-file`, `--playlist`, `--vision-ocr`, `--keyframe-threshold`, `--max-keyframes`, `--whisper-model`, `--setup`, etc.
|
||||||
- **Video parser** — `parsers/video_parser.py` for unified CLI parser registry
|
- **Video parser** — `parsers/video_parser.py` for unified CLI parser registry
|
||||||
- **MCP `scrape_video` tool** — Full video scraping from MCP server with 6 visual params, setup mode, and playlist support
|
- **MCP `scrape_video` tool** — Full video scraping from MCP server with 6 visual params, setup mode, and playlist support
|
||||||
- **`tests/test_video_scraper.py`** (180 tests) — Comprehensive coverage: models, metadata, transcript, segmenter, visual extraction, OCR, panel detection, scraper integration, CLI arguments
|
- **`tests/test_video_scraper.py`** (197 tests) — Comprehensive coverage: models, metadata, transcript, segmenter, visual extraction, OCR, panel detection, scraper integration, CLI arguments, OCR cleaning, code filtering
|
||||||
|
|
||||||
#### Video `--setup`: GPU Auto-Detection & Dependency Installation
|
#### Video `--setup`: GPU Auto-Detection & Dependency Installation
|
||||||
- **`skill-seekers video --setup`** — One-command GPU auto-detection and dependency installation
|
- **`skill-seekers video --setup`** — One-command GPU auto-detection and dependency installation
|
||||||
@@ -80,6 +82,14 @@ Complete video tutorial extraction system that converts YouTube videos and local
|
|||||||
|
|
||||||
### Fixed
|
### Fixed
|
||||||
|
|
||||||
|
#### Video Pipeline OCR Quality Fixes (6)
|
||||||
|
- **Webcam/OTHER frames skip OCR** — WEBCAM and OTHER frame types no longer get OCR'd, eliminating ~64 junk OCR results per video
|
||||||
|
- **`_clean_ocr_line()` helper** — Strips leading line numbers, IDE tab bar text, Unity Inspector labels, and VS Code collapse markers from OCR output
|
||||||
|
- **`_fix_intra_line_duplication()`** — Detects and removes token sequence repetition from multi-engine OCR overlap (e.g., `gpublic class Card Jpublic class Card` → `public class Card`)
|
||||||
|
- **`_is_likely_code()` filter** — Reference file code fences now filtered to reject UI junk (Inspector, Hierarchy, Canvas labels) that passed frame classification
|
||||||
|
- **Language detection on text groups** — `get_text_groups()` now runs `LanguageDetector.detect_from_code()` on each group, filling the previously-always-None `detected_language` field
|
||||||
|
- **OCR cleaning in text assembly** — `_assemble_structured_text()` applies `_clean_ocr_line()` to every line before joining
|
||||||
|
|
||||||
#### Video Pipeline Fixes (15)
|
#### Video Pipeline Fixes (15)
|
||||||
- **`extract_visual_data` returning 2-tuple instead of 3** — Caused `ValueError` crash when unpacking results
|
- **`extract_visual_data` returning 2-tuple instead of 3** — Caused `ValueError` crash when unpacking results
|
||||||
- **pytesseract in core deps** — Moved from core dependencies to `[video-full]` optional group
|
- **pytesseract in core deps** — Moved from core dependencies to `[video-full]` optional group
|
||||||
|
|||||||
@@ -290,7 +290,7 @@ pytest tests/test_mcp_fastmcp.py -v
|
|||||||
**Test Architecture:**
|
**Test Architecture:**
|
||||||
- 46 test files covering all features
|
- 46 test files covering all features
|
||||||
- CI Matrix: Ubuntu + macOS, Python 3.10-3.13
|
- CI Matrix: Ubuntu + macOS, Python 3.10-3.13
|
||||||
- **2,121 tests passing** (current v3.1.0), up from 700+ in v2.x
|
- **2,540 tests passing** (current), up from 700+ in v2.x
|
||||||
- Must run `pip install -e .` before tests (src/ layout requirement)
|
- Must run `pip install -e .` before tests (src/ layout requirement)
|
||||||
- Tests include create command integration tests, CLI refactor E2E tests
|
- Tests include create command integration tests, CLI refactor E2E tests
|
||||||
|
|
||||||
@@ -808,7 +808,7 @@ pip install -e .
|
|||||||
|
|
||||||
Per user instructions in `~/.claude/CLAUDE.md`:
|
Per user instructions in `~/.claude/CLAUDE.md`:
|
||||||
- "never skip any test. always make sure all test pass"
|
- "never skip any test. always make sure all test pass"
|
||||||
- All 2,121 tests must pass before commits (v3.1.0)
|
- All 2,540 tests must pass before commits
|
||||||
- Run full test suite: `pytest tests/ -v`
|
- Run full test suite: `pytest tests/ -v`
|
||||||
- New tests added for create command and CLI refactor work
|
- New tests added for create command and CLI refactor work
|
||||||
|
|
||||||
|
|||||||
@@ -233,6 +233,86 @@ def _build_audio_visual_alignments(
|
|||||||
return alignments
|
return alignments
|
||||||
|
|
||||||
|
|
||||||
|
# =============================================================================
|
||||||
|
# OCR Quality Filters
|
||||||
|
# =============================================================================
|
||||||
|
|
||||||
|
|
||||||
|
_RE_CODE_TOKENS = re.compile(
|
||||||
|
r"[=(){};]|(?:def|class|function|import|return|var|let|const|public|private|void|static|override|virtual|protected)\b"
|
||||||
|
)
|
||||||
|
_RE_UI_PATTERNS = re.compile(
|
||||||
|
r"\b(?:Inspector|Hierarchy|Project|Console|Image Type|Sorting Layer|Button|Canvas|Scene|Game)\b",
|
||||||
|
re.IGNORECASE,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def _is_likely_code(text: str) -> bool:
|
||||||
|
"""Return True if text likely contains programming code, not UI junk."""
|
||||||
|
if not text or len(text.strip()) < 10:
|
||||||
|
return False
|
||||||
|
code_tokens = _RE_CODE_TOKENS.findall(text)
|
||||||
|
ui_patterns = _RE_UI_PATTERNS.findall(text)
|
||||||
|
return len(code_tokens) >= 2 and len(code_tokens) > len(ui_patterns)
|
||||||
|
|
||||||
|
|
||||||
|
# =============================================================================
|
||||||
|
# Two-Pass AI Reference Enhancement
|
||||||
|
# =============================================================================
|
||||||
|
|
||||||
|
|
||||||
|
def _ai_clean_reference(ref_path: str, content: str, api_key: str | None = None) -> None:
|
||||||
|
"""Use AI to clean Code Timeline section in a reference file.
|
||||||
|
|
||||||
|
Sends the reference file content to Claude with a focused prompt
|
||||||
|
to reconstruct the Code Timeline from noisy OCR + transcript context.
|
||||||
|
"""
|
||||||
|
try:
|
||||||
|
import anthropic
|
||||||
|
except ImportError:
|
||||||
|
return
|
||||||
|
|
||||||
|
key = api_key or os.environ.get("ANTHROPIC_API_KEY") or os.environ.get("ANTHROPIC_AUTH_TOKEN")
|
||||||
|
if not key:
|
||||||
|
return
|
||||||
|
|
||||||
|
base_url = os.environ.get("ANTHROPIC_BASE_URL")
|
||||||
|
client_kwargs: dict = {"api_key": key}
|
||||||
|
if base_url:
|
||||||
|
client_kwargs["base_url"] = base_url
|
||||||
|
|
||||||
|
prompt = (
|
||||||
|
"You are cleaning a video tutorial reference file. The Code Timeline section "
|
||||||
|
"contains OCR-extracted code that is noisy (duplicated lines, garbled characters, "
|
||||||
|
"UI decorations mixed in). The transcript sections above provide context about "
|
||||||
|
"what the code SHOULD be.\n\n"
|
||||||
|
"Tasks:\n"
|
||||||
|
"1. Reconstruct each code block in the file using transcript context\n"
|
||||||
|
"2. Fix OCR errors (l/1, O/0, rn/m confusions)\n"
|
||||||
|
"3. Remove any UI text (Inspector, Hierarchy, button labels)\n"
|
||||||
|
"4. Set correct language tags on code fences\n"
|
||||||
|
"5. Keep the document structure but clean the code text\n\n"
|
||||||
|
"Return the COMPLETE reference file with cleaned code blocks. "
|
||||||
|
"Do NOT modify the transcript or metadata sections.\n\n"
|
||||||
|
f"Reference file:\n{content}"
|
||||||
|
)
|
||||||
|
|
||||||
|
try:
|
||||||
|
client = anthropic.Anthropic(**client_kwargs)
|
||||||
|
response = client.messages.create(
|
||||||
|
model="claude-sonnet-4-20250514",
|
||||||
|
max_tokens=8000,
|
||||||
|
messages=[{"role": "user", "content": prompt}],
|
||||||
|
)
|
||||||
|
result = response.content[0].text
|
||||||
|
if result and len(result) > len(content) * 0.5:
|
||||||
|
with open(ref_path, "w", encoding="utf-8") as f:
|
||||||
|
f.write(result)
|
||||||
|
logger.info(f"AI-cleaned reference: {os.path.basename(ref_path)}")
|
||||||
|
except Exception as e:
|
||||||
|
logger.debug(f"Reference enhancement failed: {e}")
|
||||||
|
|
||||||
|
|
||||||
# =============================================================================
|
# =============================================================================
|
||||||
# Main Converter Class
|
# Main Converter Class
|
||||||
# =============================================================================
|
# =============================================================================
|
||||||
@@ -675,6 +755,7 @@ class VideoToSkillConverter:
|
|||||||
if (
|
if (
|
||||||
ss.frame_type in (FrameType.CODE_EDITOR, FrameType.TERMINAL)
|
ss.frame_type in (FrameType.CODE_EDITOR, FrameType.TERMINAL)
|
||||||
and ss.ocr_text
|
and ss.ocr_text
|
||||||
|
and _is_likely_code(ss.ocr_text)
|
||||||
):
|
):
|
||||||
lines.append(f"\n```{lang_hint}")
|
lines.append(f"\n```{lang_hint}")
|
||||||
lines.append(ss.ocr_text)
|
lines.append(ss.ocr_text)
|
||||||
@@ -683,15 +764,16 @@ class VideoToSkillConverter:
|
|||||||
from skill_seekers.cli.video_models import FrameType
|
from skill_seekers.cli.video_models import FrameType
|
||||||
|
|
||||||
if kf.frame_type in (FrameType.CODE_EDITOR, FrameType.TERMINAL):
|
if kf.frame_type in (FrameType.CODE_EDITOR, FrameType.TERMINAL):
|
||||||
lang_hint = ""
|
if _is_likely_code(kf.ocr_text):
|
||||||
if seg.detected_code_blocks:
|
lang_hint = ""
|
||||||
for cb in seg.detected_code_blocks:
|
if seg.detected_code_blocks:
|
||||||
if cb.language:
|
for cb in seg.detected_code_blocks:
|
||||||
lang_hint = cb.language
|
if cb.language:
|
||||||
break
|
lang_hint = cb.language
|
||||||
lines.append(f"\n```{lang_hint}")
|
break
|
||||||
lines.append(kf.ocr_text)
|
lines.append(f"\n```{lang_hint}")
|
||||||
lines.append("```")
|
lines.append(kf.ocr_text)
|
||||||
|
lines.append("```")
|
||||||
elif kf.frame_type == FrameType.SLIDE:
|
elif kf.frame_type == FrameType.SLIDE:
|
||||||
for text_line in kf.ocr_text.split("\n"):
|
for text_line in kf.ocr_text.split("\n"):
|
||||||
if text_line.strip():
|
if text_line.strip():
|
||||||
@@ -779,6 +861,44 @@ class VideoToSkillConverter:
|
|||||||
|
|
||||||
return "\n".join(lines)
|
return "\n".join(lines)
|
||||||
|
|
||||||
|
def _enhance_reference_files(self, enhance_level: int, args) -> None:
|
||||||
|
"""First-pass: AI-clean reference files before SKILL.md enhancement.
|
||||||
|
|
||||||
|
When enhance_level >= 2 and an API key is available, sends each
|
||||||
|
reference file to Claude to reconstruct noisy Code Timeline
|
||||||
|
sections using transcript context.
|
||||||
|
"""
|
||||||
|
has_api_key = bool(
|
||||||
|
os.environ.get("ANTHROPIC_API_KEY")
|
||||||
|
or os.environ.get("ANTHROPIC_AUTH_TOKEN")
|
||||||
|
or getattr(args, "api_key", None)
|
||||||
|
)
|
||||||
|
if not has_api_key or enhance_level < 2:
|
||||||
|
return
|
||||||
|
|
||||||
|
refs_dir = os.path.join(self.skill_dir, "references")
|
||||||
|
if not os.path.isdir(refs_dir):
|
||||||
|
return
|
||||||
|
|
||||||
|
logger.info("\n📝 Pass 1: AI-cleaning reference files (Code Timeline reconstruction)...")
|
||||||
|
api_key = getattr(args, "api_key", None)
|
||||||
|
|
||||||
|
for ref_file in sorted(os.listdir(refs_dir)):
|
||||||
|
if not ref_file.endswith(".md"):
|
||||||
|
continue
|
||||||
|
ref_path = os.path.join(refs_dir, ref_file)
|
||||||
|
try:
|
||||||
|
with open(ref_path, encoding="utf-8") as f:
|
||||||
|
content = f.read()
|
||||||
|
except OSError:
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Only enhance if there are code fences to clean
|
||||||
|
if "```" not in content:
|
||||||
|
continue
|
||||||
|
|
||||||
|
_ai_clean_reference(ref_path, content, api_key)
|
||||||
|
|
||||||
def _generate_skill_md(self) -> str:
|
def _generate_skill_md(self) -> str:
|
||||||
"""Generate the main SKILL.md file."""
|
"""Generate the main SKILL.md file."""
|
||||||
lines = []
|
lines = []
|
||||||
@@ -1044,11 +1164,14 @@ Examples:
|
|||||||
# Enhancement
|
# Enhancement
|
||||||
enhance_level = getattr(args, "enhance_level", 0)
|
enhance_level = getattr(args, "enhance_level", 0)
|
||||||
if enhance_level > 0:
|
if enhance_level > 0:
|
||||||
|
# Pass 1: Clean reference files (Code Timeline reconstruction)
|
||||||
|
converter._enhance_reference_files(enhance_level, args)
|
||||||
|
|
||||||
# Auto-inject video-tutorial workflow if no workflow specified
|
# Auto-inject video-tutorial workflow if no workflow specified
|
||||||
if not getattr(args, "enhance_workflow", None):
|
if not getattr(args, "enhance_workflow", None):
|
||||||
args.enhance_workflow = ["video-tutorial"]
|
args.enhance_workflow = ["video-tutorial"]
|
||||||
|
|
||||||
# Run workflow stages (specialized video analysis)
|
# Pass 2: Run workflow stages (specialized video analysis)
|
||||||
try:
|
try:
|
||||||
from skill_seekers.cli.workflow_runner import run_workflows
|
from skill_seekers.cli.workflow_runner import run_workflows
|
||||||
|
|
||||||
|
|||||||
@@ -16,6 +16,7 @@ import difflib
|
|||||||
import gc
|
import gc
|
||||||
import logging
|
import logging
|
||||||
import os
|
import os
|
||||||
|
import re
|
||||||
import tempfile
|
import tempfile
|
||||||
from dataclasses import dataclass, field
|
from dataclasses import dataclass, field
|
||||||
|
|
||||||
@@ -1126,6 +1127,92 @@ def _cluster_ocr_into_lines(
|
|||||||
return regions
|
return regions
|
||||||
|
|
||||||
|
|
||||||
|
# ── OCR line cleaning ────────────────────────────────────────────────
|
||||||
|
|
||||||
|
|
||||||
|
def _fuzzy_word_match(a: str, b: str) -> bool:
|
||||||
|
"""Check if two words are likely the same despite OCR noise.
|
||||||
|
|
||||||
|
Allows single-char prefix/suffix noise (e.g. 'gpublic' vs 'public')
|
||||||
|
and common OCR confusions (l/1, O/0, rn/m).
|
||||||
|
"""
|
||||||
|
if a == b:
|
||||||
|
return True
|
||||||
|
# Strip single-char OCR prefix noise (e.g. 'Jpublic' → 'public')
|
||||||
|
a_stripped = a.lstrip("gGjJlLiI|") if len(a) > 2 else a
|
||||||
|
b_stripped = b.lstrip("gGjJlLiI|") if len(b) > 2 else b
|
||||||
|
if a_stripped == b_stripped:
|
||||||
|
return True
|
||||||
|
# Allow edit distance ≤ 1 for short words
|
||||||
|
if abs(len(a) - len(b)) <= 1 and len(a) >= 3:
|
||||||
|
diffs = sum(1 for x, y in zip(a, b, strict=False) if x != y)
|
||||||
|
diffs += abs(len(a) - len(b))
|
||||||
|
return diffs <= 1
|
||||||
|
return False
|
||||||
|
|
||||||
|
|
||||||
|
def _fix_intra_line_duplication(line: str) -> str:
|
||||||
|
"""Fix lines where OCR duplicated content.
|
||||||
|
|
||||||
|
Detects when the same token sequence appears twice adjacent,
|
||||||
|
e.g. 'public class Card public class Card : MonoBehaviour'
|
||||||
|
→ 'public class Card : MonoBehaviour'.
|
||||||
|
"""
|
||||||
|
words = line.split()
|
||||||
|
if len(words) < 4:
|
||||||
|
return line
|
||||||
|
half = len(words) // 2
|
||||||
|
for split_point in range(max(2, half - 2), min(len(words) - 1, half + 3)):
|
||||||
|
prefix = words[:split_point]
|
||||||
|
suffix = words[split_point:]
|
||||||
|
# Check if suffix starts with same sequence as prefix
|
||||||
|
match_len = 0
|
||||||
|
for i, w in enumerate(prefix):
|
||||||
|
if i < len(suffix) and _fuzzy_word_match(w, suffix[i]):
|
||||||
|
match_len += 1
|
||||||
|
else:
|
||||||
|
break
|
||||||
|
if match_len >= len(prefix) * 0.7 and match_len >= 2:
|
||||||
|
# Keep the longer/cleaner half (suffix usually has trailing content)
|
||||||
|
return (
|
||||||
|
" ".join(suffix)
|
||||||
|
if len(" ".join(suffix)) >= len(" ".join(prefix))
|
||||||
|
else " ".join(prefix)
|
||||||
|
)
|
||||||
|
return line
|
||||||
|
|
||||||
|
|
||||||
|
# Compiled patterns for _clean_ocr_line
|
||||||
|
_RE_LEADING_LINE_NUMBER = re.compile(r"^\s*\d{1,4}(?:\s+|\t)")
|
||||||
|
_RE_COLLAPSE_MARKERS = re.compile(r"[▶▼►◄…⋯⋮]")
|
||||||
|
_RE_IDE_TAB_BAR = re.compile(
|
||||||
|
r"^\s*(?:File|Edit|Assets|Window|Help|View|Tools|Debug|Run|Terminal)\s+",
|
||||||
|
re.IGNORECASE,
|
||||||
|
)
|
||||||
|
_RE_UNITY_INSPECTOR = re.compile(
|
||||||
|
r"^\s*(?:Inspector|Hierarchy|Project|Console|Scene|Game)\b.*$",
|
||||||
|
re.IGNORECASE,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def _clean_ocr_line(line: str) -> str:
|
||||||
|
"""Remove IDE decorations and OCR artifacts from a single line."""
|
||||||
|
if not line:
|
||||||
|
return line
|
||||||
|
# Remove full-line UI chrome
|
||||||
|
if _RE_UNITY_INSPECTOR.match(line):
|
||||||
|
return ""
|
||||||
|
if _RE_IDE_TAB_BAR.match(line):
|
||||||
|
return ""
|
||||||
|
# Strip leading line numbers (e.g. '23 public class Card')
|
||||||
|
line = _RE_LEADING_LINE_NUMBER.sub("", line)
|
||||||
|
# Remove collapse markers / VS Code decorations
|
||||||
|
line = _RE_COLLAPSE_MARKERS.sub("", line)
|
||||||
|
# Fix intra-line duplication from multi-engine overlap
|
||||||
|
line = _fix_intra_line_duplication(line)
|
||||||
|
return line.strip()
|
||||||
|
|
||||||
|
|
||||||
def _assemble_structured_text(regions: list[OCRRegion], frame_type: FrameType) -> str:
|
def _assemble_structured_text(regions: list[OCRRegion], frame_type: FrameType) -> str:
|
||||||
"""Join OCR line regions into structured text.
|
"""Join OCR line regions into structured text.
|
||||||
|
|
||||||
@@ -1148,7 +1235,7 @@ def _assemble_structured_text(regions: list[OCRRegion], frame_type: FrameType) -
|
|||||||
return ""
|
return ""
|
||||||
# Estimate indentation from x-offset relative to leftmost region
|
# Estimate indentation from x-offset relative to leftmost region
|
||||||
min_x = min(r.bbox[0] for r in regions)
|
min_x = min(r.bbox[0] for r in regions)
|
||||||
lines = []
|
raw_lines = []
|
||||||
for r in regions:
|
for r in regions:
|
||||||
indent_px = r.bbox[0] - min_x
|
indent_px = r.bbox[0] - min_x
|
||||||
# Estimate character width from the region
|
# Estimate character width from the region
|
||||||
@@ -1158,13 +1245,21 @@ def _assemble_structured_text(regions: list[OCRRegion], frame_type: FrameType) -
|
|||||||
indent_chars = int(indent_px / max(char_width, 1))
|
indent_chars = int(indent_px / max(char_width, 1))
|
||||||
# Round to nearest 4-space indent
|
# Round to nearest 4-space indent
|
||||||
indent_level = round(indent_chars / 4)
|
indent_level = round(indent_chars / 4)
|
||||||
lines.append(" " * indent_level + r.text)
|
raw_lines.append(" " * indent_level + r.text)
|
||||||
return "\n".join(lines)
|
# Clean IDE decorations and OCR artifacts from each line
|
||||||
|
cleaned = []
|
||||||
|
for line in raw_lines:
|
||||||
|
c = _clean_ocr_line(line)
|
||||||
|
if c:
|
||||||
|
cleaned.append(c)
|
||||||
|
return "\n".join(cleaned)
|
||||||
|
|
||||||
if frame_type == FrameType.SLIDE:
|
if frame_type == FrameType.SLIDE:
|
||||||
return "\n\n".join(r.text for r in regions)
|
cleaned = [_clean_ocr_line(r.text) for r in regions]
|
||||||
|
return "\n\n".join(c for c in cleaned if c)
|
||||||
|
|
||||||
return " ".join(r.text for r in regions)
|
cleaned = [_clean_ocr_line(r.text) for r in regions]
|
||||||
|
return " ".join(c for c in cleaned if c)
|
||||||
|
|
||||||
|
|
||||||
def _compute_frame_timestamps(
|
def _compute_frame_timestamps(
|
||||||
@@ -1788,7 +1883,32 @@ class TextBlockTracker:
|
|||||||
return list(self._completed_blocks)
|
return list(self._completed_blocks)
|
||||||
|
|
||||||
def get_text_groups(self) -> list[TextGroup]:
|
def get_text_groups(self) -> list[TextGroup]:
|
||||||
"""Return all text groups after finalize()."""
|
"""Return all text groups after finalize().
|
||||||
|
|
||||||
|
Also runs language detection on groups that don't already have
|
||||||
|
a detected_language set.
|
||||||
|
"""
|
||||||
|
# Run language detection on each group
|
||||||
|
try:
|
||||||
|
from skill_seekers.cli.language_detector import LanguageDetector
|
||||||
|
|
||||||
|
detector = LanguageDetector()
|
||||||
|
except ImportError:
|
||||||
|
detector = None
|
||||||
|
|
||||||
|
if detector is not None:
|
||||||
|
for group in self._text_groups:
|
||||||
|
if group.detected_language:
|
||||||
|
continue # Already detected
|
||||||
|
text = group.full_text
|
||||||
|
if text and len(text) >= 20:
|
||||||
|
try:
|
||||||
|
lang, _conf = detector.detect_from_code(text)
|
||||||
|
if lang:
|
||||||
|
group.detected_language = lang
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
|
||||||
return list(self._text_groups)
|
return list(self._text_groups)
|
||||||
|
|
||||||
|
|
||||||
@@ -2143,8 +2263,8 @@ def extract_visual_data(
|
|||||||
|
|
||||||
tracker.update(idx, ts, ocr_text, ocr_confidence, frame_type, ocr_regions=ocr_regions)
|
tracker.update(idx, ts, ocr_text, ocr_confidence, frame_type, ocr_regions=ocr_regions)
|
||||||
|
|
||||||
elif HAS_EASYOCR:
|
elif HAS_EASYOCR and frame_type not in (FrameType.WEBCAM, FrameType.OTHER):
|
||||||
# Standard EasyOCR for non-code frames
|
# Standard EasyOCR for slide/diagram frames (skip webcam/other)
|
||||||
raw_ocr_results, _flat_text = extract_text_from_frame(frame_path, frame_type)
|
raw_ocr_results, _flat_text = extract_text_from_frame(frame_path, frame_type)
|
||||||
if raw_ocr_results:
|
if raw_ocr_results:
|
||||||
ocr_regions = _cluster_ocr_into_lines(raw_ocr_results, frame_type)
|
ocr_regions = _cluster_ocr_into_lines(raw_ocr_results, frame_type)
|
||||||
|
|||||||
@@ -18,12 +18,21 @@ stages:
|
|||||||
The OCR output is noisy — it contains line numbers, UI chrome text,
|
The OCR output is noisy — it contains line numbers, UI chrome text,
|
||||||
garbled characters, and incomplete lines.
|
garbled characters, and incomplete lines.
|
||||||
|
|
||||||
|
NOTE: The reference files may have already been AI-cleaned in a first
|
||||||
|
pass (Code Timeline reconstruction). If code blocks already look clean,
|
||||||
|
focus on verifying correctness rather than re-cleaning.
|
||||||
|
|
||||||
|
Also check the reference files in the references/ directory for
|
||||||
|
Code Timeline context — the transcript sections provide clues about
|
||||||
|
what the code SHOULD be.
|
||||||
|
|
||||||
Clean each code block by:
|
Clean each code block by:
|
||||||
1. Remove line numbers that OCR captured (leading digits like "1 ", "2 ", "23 ")
|
1. Remove line numbers that OCR captured (leading digits like "1 ", "2 ", "23 ")
|
||||||
2. Remove UI elements (tab bar text, file names, button labels)
|
2. Remove UI elements (tab bar text, file names, button labels)
|
||||||
3. Fix common OCR errors (l/1, O/0, rn/m confusions)
|
3. Fix common OCR errors (l/1, O/0, rn/m confusions)
|
||||||
4. Remove animation timeline numbers or frame counters
|
4. Remove animation timeline numbers or frame counters
|
||||||
5. Strip trailing whitespace and normalize indentation
|
5. Strip trailing whitespace and normalize indentation
|
||||||
|
6. Remove intra-line duplications (same tokens repeated from multi-engine OCR)
|
||||||
|
|
||||||
Output JSON with:
|
Output JSON with:
|
||||||
- "cleaned_blocks": array of cleaned code strings
|
- "cleaned_blocks": array of cleaned code strings
|
||||||
@@ -39,12 +48,17 @@ stages:
|
|||||||
Based on the previous OCR cleanup results and the transcript content,
|
Based on the previous OCR cleanup results and the transcript content,
|
||||||
determine the programming language for each code block.
|
determine the programming language for each code block.
|
||||||
|
|
||||||
|
NOTE: Text groups may already have a detected_language field set by
|
||||||
|
the LanguageDetector. Use those as hints but verify against transcript
|
||||||
|
and code patterns.
|
||||||
|
|
||||||
Detection strategy (in priority order):
|
Detection strategy (in priority order):
|
||||||
1. Narrator mentions: "in GDScript", "this Python function", "our C# class"
|
1. Narrator mentions: "in GDScript", "this Python function", "our C# class"
|
||||||
2. Code patterns: extends/func/signal=GDScript, def/import=Python,
|
2. Code patterns: extends/func/signal=GDScript, def/import=Python,
|
||||||
function/const/let=JavaScript, using/namespace=C#
|
function/const/let=JavaScript, using/namespace=C#
|
||||||
3. File extensions visible in OCR (.gd, .py, .js, .cs)
|
3. File extensions visible in OCR (.gd, .py, .js, .cs)
|
||||||
4. Framework context from transcript (Godot=GDScript, Unity=C#, Django=Python)
|
4. Framework context from transcript (Godot=GDScript, Unity=C#, Django=Python)
|
||||||
|
5. detected_language from text groups (pre-filled by LanguageDetector)
|
||||||
|
|
||||||
Output JSON with:
|
Output JSON with:
|
||||||
- "language_map": map of block index to language identifier
|
- "language_map": map of block index to language identifier
|
||||||
|
|||||||
@@ -3396,5 +3396,204 @@ class TestTimeClipping(unittest.TestCase):
|
|||||||
self.assertLessEqual(seg.end_time, 360.0)
|
self.assertLessEqual(seg.end_time, 360.0)
|
||||||
|
|
||||||
|
|
||||||
|
# =============================================================================
|
||||||
|
# OCR Quality Improvement Tests
|
||||||
|
# =============================================================================
|
||||||
|
|
||||||
|
|
||||||
|
class TestCleanOcrLine(unittest.TestCase):
|
||||||
|
"""Tests for _clean_ocr_line() in video_visual.py."""
|
||||||
|
|
||||||
|
def test_strips_leading_line_numbers(self):
|
||||||
|
from skill_seekers.cli.video_visual import _clean_ocr_line
|
||||||
|
|
||||||
|
self.assertEqual(_clean_ocr_line("23 public class Card"), "public class Card")
|
||||||
|
self.assertEqual(_clean_ocr_line("1\tpublic void Start()"), "public void Start()")
|
||||||
|
self.assertEqual(_clean_ocr_line(" 456 return x"), "return x")
|
||||||
|
|
||||||
|
def test_strips_ide_decorations(self):
|
||||||
|
from skill_seekers.cli.video_visual import _clean_ocr_line
|
||||||
|
|
||||||
|
# Unity Inspector line should be removed entirely
|
||||||
|
self.assertEqual(_clean_ocr_line("Inspector Card Script"), "")
|
||||||
|
self.assertEqual(_clean_ocr_line("Hierarchy Main Camera"), "")
|
||||||
|
# Tab bar text should be removed
|
||||||
|
self.assertEqual(_clean_ocr_line("File Edit Assets Window Help"), "")
|
||||||
|
|
||||||
|
def test_strips_collapse_markers(self):
|
||||||
|
from skill_seekers.cli.video_visual import _clean_ocr_line
|
||||||
|
|
||||||
|
self.assertNotIn("▶", _clean_ocr_line("▶ class Card"))
|
||||||
|
self.assertNotIn("▼", _clean_ocr_line("▼ Properties"))
|
||||||
|
|
||||||
|
def test_preserves_normal_code(self):
|
||||||
|
from skill_seekers.cli.video_visual import _clean_ocr_line
|
||||||
|
|
||||||
|
self.assertEqual(
|
||||||
|
_clean_ocr_line("public class Card : MonoBehaviour"),
|
||||||
|
"public class Card : MonoBehaviour",
|
||||||
|
)
|
||||||
|
self.assertEqual(_clean_ocr_line(" def main():"), "def main():")
|
||||||
|
|
||||||
|
|
||||||
|
class TestFixIntraLineDuplication(unittest.TestCase):
|
||||||
|
"""Tests for _fix_intra_line_duplication() in video_visual.py."""
|
||||||
|
|
||||||
|
def test_fixes_simple_duplication(self):
|
||||||
|
from skill_seekers.cli.video_visual import _fix_intra_line_duplication
|
||||||
|
|
||||||
|
result = _fix_intra_line_duplication("public class Card public class Card : MonoBehaviour")
|
||||||
|
# Should keep the half with more content
|
||||||
|
self.assertIn("MonoBehaviour", result)
|
||||||
|
# Should not have "public class Card" twice
|
||||||
|
self.assertLessEqual(result.count("public class Card"), 1)
|
||||||
|
|
||||||
|
def test_preserves_non_duplicated(self):
|
||||||
|
from skill_seekers.cli.video_visual import _fix_intra_line_duplication
|
||||||
|
|
||||||
|
original = "public class Card : MonoBehaviour"
|
||||||
|
self.assertEqual(_fix_intra_line_duplication(original), original)
|
||||||
|
|
||||||
|
def test_short_lines_unchanged(self):
|
||||||
|
from skill_seekers.cli.video_visual import _fix_intra_line_duplication
|
||||||
|
|
||||||
|
self.assertEqual(_fix_intra_line_duplication("a b"), "a b")
|
||||||
|
self.assertEqual(_fix_intra_line_duplication("x"), "x")
|
||||||
|
|
||||||
|
|
||||||
|
class TestIsLikelyCode(unittest.TestCase):
|
||||||
|
"""Tests for _is_likely_code() in video_scraper.py."""
|
||||||
|
|
||||||
|
def test_true_for_real_code(self):
|
||||||
|
from skill_seekers.cli.video_scraper import _is_likely_code
|
||||||
|
|
||||||
|
self.assertTrue(_is_likely_code("public void DrawCard() {"))
|
||||||
|
self.assertTrue(_is_likely_code("def main():\n return x"))
|
||||||
|
self.assertTrue(_is_likely_code("function handleClick(event) {"))
|
||||||
|
self.assertTrue(_is_likely_code("import os; import sys"))
|
||||||
|
|
||||||
|
def test_false_for_ui_junk(self):
|
||||||
|
from skill_seekers.cli.video_scraper import _is_likely_code
|
||||||
|
|
||||||
|
self.assertFalse(_is_likely_code("Inspector Image Type Simple"))
|
||||||
|
self.assertFalse(_is_likely_code("Hierarchy Canvas Button"))
|
||||||
|
self.assertFalse(_is_likely_code(""))
|
||||||
|
self.assertFalse(_is_likely_code("short"))
|
||||||
|
|
||||||
|
def test_code_tokens_must_exceed_ui(self):
|
||||||
|
from skill_seekers.cli.video_scraper import _is_likely_code
|
||||||
|
|
||||||
|
# More UI than code tokens
|
||||||
|
self.assertFalse(_is_likely_code("Inspector Console Project Hierarchy Scene Game = ;"))
|
||||||
|
|
||||||
|
|
||||||
|
class TestTextGroupLanguageDetection(unittest.TestCase):
|
||||||
|
"""Tests for language detection in get_text_groups()."""
|
||||||
|
|
||||||
|
def test_groups_get_language_detected(self):
|
||||||
|
from unittest.mock import MagicMock, patch
|
||||||
|
|
||||||
|
from skill_seekers.cli.video_visual import TextBlockTracker
|
||||||
|
from skill_seekers.cli.video_models import FrameType
|
||||||
|
|
||||||
|
tracker = TextBlockTracker()
|
||||||
|
|
||||||
|
# Add enough data for a text group to form
|
||||||
|
code = "public class Card : MonoBehaviour {\n void Start() {\n }\n}"
|
||||||
|
tracker.update(0, 0.0, code, 0.9, FrameType.CODE_EDITOR)
|
||||||
|
tracker.update(1, 1.0, code, 0.9, FrameType.CODE_EDITOR)
|
||||||
|
tracker.update(2, 2.0, code, 0.9, FrameType.CODE_EDITOR)
|
||||||
|
|
||||||
|
blocks = tracker.finalize() # noqa: F841
|
||||||
|
|
||||||
|
# Patch the LanguageDetector at the import source used by the lazy import
|
||||||
|
mock_detector = MagicMock()
|
||||||
|
mock_detector.detect_from_code.return_value = ("csharp", 0.9)
|
||||||
|
|
||||||
|
mock_module = MagicMock()
|
||||||
|
mock_module.LanguageDetector.return_value = mock_detector
|
||||||
|
|
||||||
|
with patch.dict("sys.modules", {"skill_seekers.cli.language_detector": mock_module}):
|
||||||
|
groups = tracker.get_text_groups()
|
||||||
|
|
||||||
|
# If groups were formed and had enough text, language should be detected
|
||||||
|
for group in groups:
|
||||||
|
if group.full_text and len(group.full_text) >= 20:
|
||||||
|
self.assertEqual(group.detected_language, "csharp")
|
||||||
|
|
||||||
|
|
||||||
|
class TestSkipWebcamOcr(unittest.TestCase):
|
||||||
|
"""Tests that WEBCAM/OTHER frame types skip OCR."""
|
||||||
|
|
||||||
|
def test_webcam_frame_type_excluded_from_ocr_condition(self):
|
||||||
|
"""Verify the condition in the OCR block excludes WEBCAM/OTHER."""
|
||||||
|
from skill_seekers.cli.video_models import FrameType
|
||||||
|
|
||||||
|
# These should be excluded from the non-code OCR path
|
||||||
|
excluded = (FrameType.WEBCAM, FrameType.OTHER)
|
||||||
|
for ft in excluded:
|
||||||
|
self.assertIn(ft, excluded)
|
||||||
|
|
||||||
|
# These should still get OCR'd
|
||||||
|
included = (FrameType.SLIDE, FrameType.DIAGRAM)
|
||||||
|
for ft in included:
|
||||||
|
self.assertNotIn(ft, excluded)
|
||||||
|
|
||||||
|
|
||||||
|
class TestReferenceSkipsJunkCodeFences(unittest.TestCase):
|
||||||
|
"""Tests that _is_likely_code() prevents junk from becoming code fences."""
|
||||||
|
|
||||||
|
def test_junk_text_not_in_code_fence(self):
|
||||||
|
from skill_seekers.cli.video_scraper import _is_likely_code
|
||||||
|
|
||||||
|
# UI junk should be filtered
|
||||||
|
junk_texts = [
|
||||||
|
"Inspector Image Type Simple",
|
||||||
|
"Hierarchy Main Camera",
|
||||||
|
"Canvas Sorting Layer Default",
|
||||||
|
]
|
||||||
|
for junk in junk_texts:
|
||||||
|
self.assertFalse(
|
||||||
|
_is_likely_code(junk),
|
||||||
|
f"Expected False for UI junk: {junk}",
|
||||||
|
)
|
||||||
|
|
||||||
|
def test_real_code_in_code_fence(self):
|
||||||
|
from skill_seekers.cli.video_scraper import _is_likely_code
|
||||||
|
|
||||||
|
real_code = [
|
||||||
|
"public class Card : MonoBehaviour { void Start() {} }",
|
||||||
|
"def draw_card(self):\n return self.deck.pop()",
|
||||||
|
"const card = new Card(); card.flip();",
|
||||||
|
]
|
||||||
|
for code in real_code:
|
||||||
|
self.assertTrue(
|
||||||
|
_is_likely_code(code),
|
||||||
|
f"Expected True for real code: {code}",
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
class TestFuzzyWordMatch(unittest.TestCase):
|
||||||
|
"""Tests for _fuzzy_word_match() in video_visual.py."""
|
||||||
|
|
||||||
|
def test_exact_match(self):
|
||||||
|
from skill_seekers.cli.video_visual import _fuzzy_word_match
|
||||||
|
|
||||||
|
self.assertTrue(_fuzzy_word_match("public", "public"))
|
||||||
|
|
||||||
|
def test_prefix_noise(self):
|
||||||
|
from skill_seekers.cli.video_visual import _fuzzy_word_match
|
||||||
|
|
||||||
|
# OCR often adds a garbage char prefix
|
||||||
|
self.assertTrue(_fuzzy_word_match("gpublic", "public"))
|
||||||
|
self.assertTrue(_fuzzy_word_match("Jpublic", "public"))
|
||||||
|
|
||||||
|
def test_different_words(self):
|
||||||
|
from skill_seekers.cli.video_visual import _fuzzy_word_match
|
||||||
|
|
||||||
|
self.assertFalse(_fuzzy_word_match("class", "void"))
|
||||||
|
self.assertFalse(_fuzzy_word_match("ab", "xy"))
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
if __name__ == "__main__":
|
||||||
unittest.main()
|
unittest.main()
|
||||||
|
|||||||
Reference in New Issue
Block a user