firefrost-gaming/claude-code-skills-reference

Files

daymade 135a1873af feat(transcript-fixer): add timestamp repair and section splitting scripts

New scripts:
- fix_transcript_timestamps.py: Repair malformed timestamps (HH:MM:SS format)
- split_transcript_sections.py: Split transcript by keywords and rebase timestamps
- Automated tests for both scripts

Features:
- Timestamp validation and repair (handle missing colons, invalid ranges)
- Section splitting with custom names
- Rebase timestamps to 00:00:00 for each section
- Preserve speaker format and content integrity
- In-place editing with backup

Documentation updates:
- Add usage examples to SKILL.md
- Clarify dictionary iteration workflow (save stable patterns only)
- Update workflow guides with new script references
- Add script parameter documentation

Use cases:
- Fix ASR output with broken timestamps
- Split long meetings into focused sections
- Prepare sections for independent processing

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-11 13:59:36 +08:00

14 KiB

Raw Permalink Blame History

Workflow Guide

Detailed step-by-step workflows for transcript correction and management.

Pre-Flight Checklist
Core Workflows
Batch Processing
- Process Multiple Files
- Parallel Processing
Maintenance Workflows
Next Steps

Pre-Flight Checklist

Before running corrections, verify these prerequisites:

Initial Setup

Initialized with uv run scripts/fix_transcription.py --init
Database exists at ~/.transcript-fixer/corrections.db
GLM_API_KEY environment variable set (run echo $GLM_API_KEY)
Configuration validated (run --validate)

File Preparation

Input file exists and is readable
File uses supported format (.md, .txt)
File encoding is UTF-8
File size is reasonable (<10MB for first runs)

Execution Parameters

Using --stage 3 for full pipeline (or specific stage if testing)
Domain specified with --domain if using specialized dictionaries
Using --merge flag when importing team corrections

Environment

Sufficient disk space for output files (~2x input size)
API quota available for Stage 2 corrections
Network connectivity for API calls

Quick validation:

uv run scripts/fix_transcription.py --validate && echo $GLM_API_KEY

Core Workflows

1. First-Time Correction

Goal: Correct a transcript for the first time.

Steps:

Initialize (if not done):

uv run scripts/fix_transcription.py --init
export GLM_API_KEY="your-key"

Add initial corrections (5-10 common errors):

uv run scripts/fix_transcription.py --add "常见错误1" "正确词1" --domain general
uv run scripts/fix_transcription.py --add "常见错误2" "正确词2" --domain general

Test on small sample (Stage 1 only):

uv run scripts/fix_transcription.py --input sample.md --stage 1
less sample_stage1.md  # Review output

Run full pipeline:

uv run scripts/fix_transcription.py --input transcript.md --stage 3 --domain general

Review outputs:

# Stage 1: Dictionary corrections
less transcript_stage1.md

# Stage 2: Final corrected version
less transcript_stage2.md

# Generate diff report
uv run scripts/diff_generator.py transcript.md transcript_stage1.md transcript_stage2.md

Expected duration:

Stage 1: Instant (dictionary lookup)
Stage 2: ~1-2 minutes per 1000 lines (API calls)

2. Iterative Improvement

Goal: Improve correction quality over time through learning.

Steps:

Run corrections on 3-5 similar transcripts:

uv run scripts/fix_transcription.py --input day1.md --stage 3 --domain embodied_ai
uv run scripts/fix_transcription.py --input day2.md --stage 3 --domain embodied_ai
uv run scripts/fix_transcription.py --input day3.md --stage 3 --domain embodied_ai

Review learned suggestions:

uv run scripts/fix_transcription.py --review-learned

Output example:

📚 Learned Suggestions (Pending Review)
========================================

1. "巨升方向" → "具身方向"
   Frequency: 5  Confidence: 0.95
   Examples: day1.md (line 45), day2.md (line 23), ...

2. "奇迹创坛" → "奇绩创坛"
   Frequency: 3  Confidence: 0.87
   Examples: day1.md (line 102), day3.md (line 67)

Approve high-quality suggestions:

uv run scripts/fix_transcription.py --approve "巨升方向" "具身方向"
uv run scripts/fix_transcription.py --approve "奇迹创坛" "奇绩创坛"

Verify approved corrections:

uv run scripts/fix_transcription.py --list --domain embodied_ai | grep "learned"

Run next batch (benefits from approved corrections):

uv run scripts/fix_transcription.py --input day4.md --stage 3 --domain embodied_ai

Impact: Approved corrections move to Stage 1 (instant, free).

Cycle: Repeat every 3-5 transcripts for continuous improvement.

3. Domain-Specific Corrections

Goal: Build specialized dictionaries for different fields.

Steps:

Identify domain:
- embodied_ai - Robotics, AI terminology
- finance - Financial terminology
- medical - Medical terminology
- general - General-purpose

Add domain-specific terms:

# Embodied AI domain
uv run scripts/fix_transcription.py --add "巨升智能" "具身智能" --domain embodied_ai
uv run scripts/fix_transcription.py --add "机器学习" "机器学习" --domain embodied_ai

# Finance domain
uv run scripts/fix_transcription.py --add "股价" "股价" --domain finance  # Keep as-is
uv run scripts/fix_transcription.py --add "PE比率" "市盈率" --domain finance

Use appropriate domain when correcting:

# AI meeting transcript
uv run scripts/fix_transcription.py --input ai_meeting.md --stage 3 --domain embodied_ai

# Financial report transcript
uv run scripts/fix_transcription.py --input earnings_call.md --stage 3 --domain finance

Review domain statistics:

sqlite3 ~/.transcript-fixer/corrections.db "SELECT * FROM correction_statistics;"

Benefits:

Prevents cross-domain conflicts
Higher accuracy per domain
Targeted vocabulary building

4. Team Collaboration

Goal: Share corrections across team members.

Steps:

Setup (One-time per team)

Create shared repository:

mkdir transcript-corrections
cd transcript-corrections
git init

# .gitignore
echo "*.db\n*.db-journal\n*.bak" > .gitignore

Export initial corrections:

uv run scripts/fix_transcription.py --export general.json --domain general
uv run scripts/fix_transcription.py --export embodied_ai.json --domain embodied_ai

git add *.json
git commit -m "Initial correction dictionaries"
git push origin main

Daily Workflow

Team Member A (adds new corrections):

# 1. Run corrections
uv run scripts/fix_transcription.py --input transcript.md --stage 3 --domain embodied_ai

# 2. Review and approve learned suggestions
uv run scripts/fix_transcription.py --review-learned
uv run scripts/fix_transcription.py --approve "新错误" "正确词"

# 3. Export updated corrections
uv run scripts/fix_transcription.py --export embodied_ai_$(date +%Y%m%d).json --domain embodied_ai

# 4. Commit and push
git add embodied_ai_*.json
git commit -m "Add embodied AI corrections from today's transcripts"
git push origin main

Team Member B (imports team corrections):

# 1. Pull latest corrections
git pull origin main

# 2. Import with merge
uv run scripts/fix_transcription.py --import embodied_ai_20250128.json --merge

# 3. Verify
uv run scripts/fix_transcription.py --list --domain embodied_ai | tail -10

Conflict resolution: See team_collaboration.md for handling merge conflicts.

5. Stage-by-Stage Execution

Goal: Test dictionary changes without wasting API quota.

Stage 1 Only (Dictionary)

Use when: Testing new corrections, verifying domain setup.

uv run scripts/fix_transcription.py --input file.md --stage 1 --domain general

Output: file_stage1.md with dictionary corrections only.

Review: Check if dictionary corrections are sufficient.

Stage 2 Only (AI)

Use when: Running AI corrections on pre-processed file.

Prerequisites: Stage 1 output exists.

# Stage 1 first
uv run scripts/fix_transcription.py --input file.md --stage 1

# Then Stage 2
uv run scripts/fix_transcription.py --input file_stage1.md --stage 2

Output: file_stage1_stage2.md (confusing naming - use Stage 3 instead).

Stage 3 (Full Pipeline)

Use when: Production runs, full correction workflow.

uv run scripts/fix_transcription.py --input file.md --stage 3 --domain general

Output: Both file_stage1.md and file_stage2.md.

Recommended: Use Stage 3 for most workflows.

6. Context-Aware Rules

Goal: Handle edge cases with regex patterns.

Use cases:

Positional corrections (e.g., "的" vs "地")
Multi-word patterns
Conditional corrections

Steps:

Identify pattern that simple dictionary can't handle:

Problem: "近距离的去看" (wrong - should be "地")
Problem: "近距离搏杀" (correct - should keep "的")

Add context rules:

sqlite3 ~/.transcript-fixer/corrections.db

-- Higher priority for specific context
INSERT INTO context_rules (pattern, replacement, description, priority)
VALUES ('近距离的去看', '近距离地去看', '的→地 before verb', 10);

-- Lower priority for general pattern
INSERT INTO context_rules (pattern, replacement, description, priority)
VALUES ('近距离搏杀', '近距离搏杀', 'Keep 的 for noun modifier', 5);

.quit

Test context rules:

uv run scripts/fix_transcription.py --input test.md --stage 1

Validate:

uv run scripts/fix_transcription.py --validate

Priority: Higher numbers run first (use for exceptions/edge cases).

See file_formats.md for context_rules schema.

7. Diff Report Generation

Goal: Visualize all changes for review.

Use when:

Reviewing corrections before publishing
Training new team members
Documenting ASR error patterns

Steps:

Run corrections:

uv run scripts/fix_transcription.py --input transcript.md --stage 3

Generate diff reports:

uv run scripts/diff_generator.py \
  transcript.md \
  transcript_stage1.md \
  transcript_stage2.md

Review outputs:

# Markdown report (statistics + summary)
less diff_report.md

# Unified diff (git-style)
less transcript_unified.diff

# HTML side-by-side (visual review)
open transcript_sidebyside.html

# Inline markers (for editing)
less transcript_inline.md

Report contents:

Total changes count
Stage 1 vs Stage 2 breakdown
Character/word count changes
Side-by-side comparison

See script_parameters.md for advanced diff options.

8. Workshop Transcript Split + Timestamp Rebase

Goal: Split a long workshop transcript into sections such as setup chat, class, and debrief, then make each section start from 00:00:00.

Steps:

Correct transcript text first (dictionary + AI/manual review)
Pick marker phrases for each section boundary
Split and rebase:

uv run scripts/split_transcript_sections.py workshop.txt \
  --first-section-name "课前聊天" \
  --section "正式上课::好，无缝切换嘛。对。那个曹总连上了吗？那个网页。" \
  --section "课后复盘::我们复盘一下。" \
  --rebase-to-zero

If you already split the files, rebase a single file directly:

uv run scripts/fix_transcript_timestamps.py class.txt --in-place --rebase-to-zero

Batch Processing

Process Multiple Files

# Simple loop
for file in meeting_*.md; do
  uv run scripts/fix_transcription.py --input "$file" --stage 3 --domain embodied_ai
done

# With error handling
for file in meeting_*.md; do
  echo "Processing $file..."
  if uv run scripts/fix_transcription.py --input "$file" --stage 3 --domain embodied_ai; then
    echo "✅ $file completed"
  else
    echo "❌ $file failed"
  fi
done

Parallel Processing

# GNU parallel (install: brew install parallel)
ls meeting_*.md | parallel -j 4 \
  "uv run scripts/fix_transcription.py --input {} --stage 3 --domain embodied_ai"

Caution: Monitor API rate limits when processing in parallel.

Maintenance Workflows

Weekly: Review Learning

# Review suggestions
uv run scripts/fix_transcription.py --review-learned

# Approve high-confidence patterns
uv run scripts/fix_transcription.py --approve "错误1" "正确1"
uv run scripts/fix_transcription.py --approve "错误2" "正确2"

Monthly: Export and Backup

# Export all domains
uv run scripts/fix_transcription.py --export general_$(date +%Y%m%d).json --domain general
uv run scripts/fix_transcription.py --export embodied_ai_$(date +%Y%m%d).json --domain embodied_ai

# Backup database
cp ~/.transcript-fixer/corrections.db ~/backups/corrections_$(date +%Y%m%d).db

# Database maintenance
sqlite3 ~/.transcript-fixer/corrections.db "VACUUM; REINDEX; ANALYZE;"

Quarterly: Clean Up

# Archive old history (> 90 days)
sqlite3 ~/.transcript-fixer/corrections.db "
DELETE FROM correction_history
WHERE run_timestamp < datetime('now', '-90 days');
"

# Reject low-confidence suggestions
sqlite3 ~/.transcript-fixer/corrections.db "
UPDATE learned_suggestions
SET status = 'rejected'
WHERE confidence < 0.6 AND frequency < 3;
"

Next Steps

See best_practices.md for optimization tips
See troubleshooting.md for error resolution
See file_formats.md for database schema
See script_parameters.md for advanced CLI options

14 KiB Raw Permalink Blame History

Workflow Guide

Table of Contents

Pre-Flight Checklist

Initial Setup

File Preparation

Execution Parameters

Environment

Core Workflows

1. First-Time Correction

2. Iterative Improvement

3. Domain-Specific Corrections

4. Team Collaboration

Setup (One-time per team)

Daily Workflow

5. Stage-by-Stage Execution

Stage 1 Only (Dictionary)

Stage 2 Only (AI)

Stage 3 (Full Pipeline)

6. Context-Aware Rules

7. Diff Report Generation

8. Workshop Transcript Split + Timestamp Rebase

Batch Processing

Process Multiple Files

Parallel Processing

Maintenance Workflows

Weekly: Review Learning

Monthly: Export and Backup

Quarterly: Clean Up

Next Steps

14 KiB

Raw Permalink Blame History