Files
daymade 135a1873af feat(transcript-fixer): add timestamp repair and section splitting scripts
New scripts:
- fix_transcript_timestamps.py: Repair malformed timestamps (HH:MM:SS format)
- split_transcript_sections.py: Split transcript by keywords and rebase timestamps
- Automated tests for both scripts

Features:
- Timestamp validation and repair (handle missing colons, invalid ranges)
- Section splitting with custom names
- Rebase timestamps to 00:00:00 for each section
- Preserve speaker format and content integrity
- In-place editing with backup

Documentation updates:
- Add usage examples to SKILL.md
- Clarify dictionary iteration workflow (save stable patterns only)
- Update workflow guides with new script references
- Add script parameter documentation

Use cases:
- Fix ASR output with broken timestamps
- Split long meetings into focused sections
- Prepare sections for independent processing

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-11 13:59:36 +08:00

14 KiB

Workflow Guide

Detailed step-by-step workflows for transcript correction and management.

Table of Contents

Pre-Flight Checklist

Before running corrections, verify these prerequisites:

Initial Setup

  • Initialized with uv run scripts/fix_transcription.py --init
  • Database exists at ~/.transcript-fixer/corrections.db
  • GLM_API_KEY environment variable set (run echo $GLM_API_KEY)
  • Configuration validated (run --validate)

File Preparation

  • Input file exists and is readable
  • File uses supported format (.md, .txt)
  • File encoding is UTF-8
  • File size is reasonable (<10MB for first runs)

Execution Parameters

  • Using --stage 3 for full pipeline (or specific stage if testing)
  • Domain specified with --domain if using specialized dictionaries
  • Using --merge flag when importing team corrections

Environment

  • Sufficient disk space for output files (~2x input size)
  • API quota available for Stage 2 corrections
  • Network connectivity for API calls

Quick validation:

uv run scripts/fix_transcription.py --validate && echo $GLM_API_KEY

Core Workflows

1. First-Time Correction

Goal: Correct a transcript for the first time.

Steps:

  1. Initialize (if not done):

    uv run scripts/fix_transcription.py --init
    export GLM_API_KEY="your-key"
    
  2. Add initial corrections (5-10 common errors):

    uv run scripts/fix_transcription.py --add "常见错误1" "正确词1" --domain general
    uv run scripts/fix_transcription.py --add "常见错误2" "正确词2" --domain general
    
  3. Test on small sample (Stage 1 only):

    uv run scripts/fix_transcription.py --input sample.md --stage 1
    less sample_stage1.md  # Review output
    
  4. Run full pipeline:

    uv run scripts/fix_transcription.py --input transcript.md --stage 3 --domain general
    
  5. Review outputs:

    # Stage 1: Dictionary corrections
    less transcript_stage1.md
    
    # Stage 2: Final corrected version
    less transcript_stage2.md
    
    # Generate diff report
    uv run scripts/diff_generator.py transcript.md transcript_stage1.md transcript_stage2.md
    

Expected duration:

  • Stage 1: Instant (dictionary lookup)
  • Stage 2: ~1-2 minutes per 1000 lines (API calls)

2. Iterative Improvement

Goal: Improve correction quality over time through learning.

Steps:

  1. Run corrections on 3-5 similar transcripts:

    uv run scripts/fix_transcription.py --input day1.md --stage 3 --domain embodied_ai
    uv run scripts/fix_transcription.py --input day2.md --stage 3 --domain embodied_ai
    uv run scripts/fix_transcription.py --input day3.md --stage 3 --domain embodied_ai
    
  2. Review learned suggestions:

    uv run scripts/fix_transcription.py --review-learned
    

    Output example:

    📚 Learned Suggestions (Pending Review)
    ========================================
    
    1. "巨升方向" → "具身方向"
       Frequency: 5  Confidence: 0.95
       Examples: day1.md (line 45), day2.md (line 23), ...
    
    2. "奇迹创坛" → "奇绩创坛"
       Frequency: 3  Confidence: 0.87
       Examples: day1.md (line 102), day3.md (line 67)
    
  3. Approve high-quality suggestions:

    uv run scripts/fix_transcription.py --approve "巨升方向" "具身方向"
    uv run scripts/fix_transcription.py --approve "奇迹创坛" "奇绩创坛"
    
  4. Verify approved corrections:

    uv run scripts/fix_transcription.py --list --domain embodied_ai | grep "learned"
    
  5. Run next batch (benefits from approved corrections):

    uv run scripts/fix_transcription.py --input day4.md --stage 3 --domain embodied_ai
    

Impact: Approved corrections move to Stage 1 (instant, free).

Cycle: Repeat every 3-5 transcripts for continuous improvement.

3. Domain-Specific Corrections

Goal: Build specialized dictionaries for different fields.

Steps:

  1. Identify domain:

    • embodied_ai - Robotics, AI terminology
    • finance - Financial terminology
    • medical - Medical terminology
    • general - General-purpose
  2. Add domain-specific terms:

    # Embodied AI domain
    uv run scripts/fix_transcription.py --add "巨升智能" "具身智能" --domain embodied_ai
    uv run scripts/fix_transcription.py --add "机器学习" "机器学习" --domain embodied_ai
    
    # Finance domain
    uv run scripts/fix_transcription.py --add "股价" "股价" --domain finance  # Keep as-is
    uv run scripts/fix_transcription.py --add "PE比率" "市盈率" --domain finance
    
  3. Use appropriate domain when correcting:

    # AI meeting transcript
    uv run scripts/fix_transcription.py --input ai_meeting.md --stage 3 --domain embodied_ai
    
    # Financial report transcript
    uv run scripts/fix_transcription.py --input earnings_call.md --stage 3 --domain finance
    
  4. Review domain statistics:

    sqlite3 ~/.transcript-fixer/corrections.db "SELECT * FROM correction_statistics;"
    

Benefits:

  • Prevents cross-domain conflicts
  • Higher accuracy per domain
  • Targeted vocabulary building

4. Team Collaboration

Goal: Share corrections across team members.

Steps:

Setup (One-time per team)

  1. Create shared repository:

    mkdir transcript-corrections
    cd transcript-corrections
    git init
    
    # .gitignore
    echo "*.db\n*.db-journal\n*.bak" > .gitignore
    
  2. Export initial corrections:

    uv run scripts/fix_transcription.py --export general.json --domain general
    uv run scripts/fix_transcription.py --export embodied_ai.json --domain embodied_ai
    
    git add *.json
    git commit -m "Initial correction dictionaries"
    git push origin main
    

Daily Workflow

Team Member A (adds new corrections):

# 1. Run corrections
uv run scripts/fix_transcription.py --input transcript.md --stage 3 --domain embodied_ai

# 2. Review and approve learned suggestions
uv run scripts/fix_transcription.py --review-learned
uv run scripts/fix_transcription.py --approve "新错误" "正确词"

# 3. Export updated corrections
uv run scripts/fix_transcription.py --export embodied_ai_$(date +%Y%m%d).json --domain embodied_ai

# 4. Commit and push
git add embodied_ai_*.json
git commit -m "Add embodied AI corrections from today's transcripts"
git push origin main

Team Member B (imports team corrections):

# 1. Pull latest corrections
git pull origin main

# 2. Import with merge
uv run scripts/fix_transcription.py --import embodied_ai_20250128.json --merge

# 3. Verify
uv run scripts/fix_transcription.py --list --domain embodied_ai | tail -10

Conflict resolution: See team_collaboration.md for handling merge conflicts.

5. Stage-by-Stage Execution

Goal: Test dictionary changes without wasting API quota.

Stage 1 Only (Dictionary)

Use when: Testing new corrections, verifying domain setup.

uv run scripts/fix_transcription.py --input file.md --stage 1 --domain general

Output: file_stage1.md with dictionary corrections only.

Review: Check if dictionary corrections are sufficient.

Stage 2 Only (AI)

Use when: Running AI corrections on pre-processed file.

Prerequisites: Stage 1 output exists.

# Stage 1 first
uv run scripts/fix_transcription.py --input file.md --stage 1

# Then Stage 2
uv run scripts/fix_transcription.py --input file_stage1.md --stage 2

Output: file_stage1_stage2.md (confusing naming - use Stage 3 instead).

Stage 3 (Full Pipeline)

Use when: Production runs, full correction workflow.

uv run scripts/fix_transcription.py --input file.md --stage 3 --domain general

Output: Both file_stage1.md and file_stage2.md.

Recommended: Use Stage 3 for most workflows.

6. Context-Aware Rules

Goal: Handle edge cases with regex patterns.

Use cases:

  • Positional corrections (e.g., "的" vs "地")
  • Multi-word patterns
  • Conditional corrections

Steps:

  1. Identify pattern that simple dictionary can't handle:

    Problem: "近距离的去看" (wrong - should be "地")
    Problem: "近距离搏杀" (correct - should keep "的")
    
  2. Add context rules:

    sqlite3 ~/.transcript-fixer/corrections.db
    
    -- Higher priority for specific context
    INSERT INTO context_rules (pattern, replacement, description, priority)
    VALUES ('近距离的去看', '近距离地去看', '的→地 before verb', 10);
    
    -- Lower priority for general pattern
    INSERT INTO context_rules (pattern, replacement, description, priority)
    VALUES ('近距离搏杀', '近距离搏杀', 'Keep 的 for noun modifier', 5);
    
    .quit
    
  3. Test context rules:

    uv run scripts/fix_transcription.py --input test.md --stage 1
    
  4. Validate:

    uv run scripts/fix_transcription.py --validate
    

Priority: Higher numbers run first (use for exceptions/edge cases).

See file_formats.md for context_rules schema.

7. Diff Report Generation

Goal: Visualize all changes for review.

Use when:

  • Reviewing corrections before publishing
  • Training new team members
  • Documenting ASR error patterns

Steps:

  1. Run corrections:

    uv run scripts/fix_transcription.py --input transcript.md --stage 3
    
  2. Generate diff reports:

    uv run scripts/diff_generator.py \
      transcript.md \
      transcript_stage1.md \
      transcript_stage2.md
    
  3. Review outputs:

    # Markdown report (statistics + summary)
    less diff_report.md
    
    # Unified diff (git-style)
    less transcript_unified.diff
    
    # HTML side-by-side (visual review)
    open transcript_sidebyside.html
    
    # Inline markers (for editing)
    less transcript_inline.md
    

Report contents:

  • Total changes count
  • Stage 1 vs Stage 2 breakdown
  • Character/word count changes
  • Side-by-side comparison

See script_parameters.md for advanced diff options.

8. Workshop Transcript Split + Timestamp Rebase

Goal: Split a long workshop transcript into sections such as setup chat, class, and debrief, then make each section start from 00:00:00.

Steps:

  1. Correct transcript text first (dictionary + AI/manual review)
  2. Pick marker phrases for each section boundary
  3. Split and rebase:
uv run scripts/split_transcript_sections.py workshop.txt \
  --first-section-name "课前聊天" \
  --section "正式上课::好,无缝切换嘛。对。那个曹总连上了吗?那个网页。" \
  --section "课后复盘::我们复盘一下。" \
  --rebase-to-zero
  1. If you already split the files, rebase a single file directly:
uv run scripts/fix_transcript_timestamps.py class.txt --in-place --rebase-to-zero

Batch Processing

Process Multiple Files

# Simple loop
for file in meeting_*.md; do
  uv run scripts/fix_transcription.py --input "$file" --stage 3 --domain embodied_ai
done

# With error handling
for file in meeting_*.md; do
  echo "Processing $file..."
  if uv run scripts/fix_transcription.py --input "$file" --stage 3 --domain embodied_ai; then
    echo "✅ $file completed"
  else
    echo "❌ $file failed"
  fi
done

Parallel Processing

# GNU parallel (install: brew install parallel)
ls meeting_*.md | parallel -j 4 \
  "uv run scripts/fix_transcription.py --input {} --stage 3 --domain embodied_ai"

Caution: Monitor API rate limits when processing in parallel.

Maintenance Workflows

Weekly: Review Learning

# Review suggestions
uv run scripts/fix_transcription.py --review-learned

# Approve high-confidence patterns
uv run scripts/fix_transcription.py --approve "错误1" "正确1"
uv run scripts/fix_transcription.py --approve "错误2" "正确2"

Monthly: Export and Backup

# Export all domains
uv run scripts/fix_transcription.py --export general_$(date +%Y%m%d).json --domain general
uv run scripts/fix_transcription.py --export embodied_ai_$(date +%Y%m%d).json --domain embodied_ai

# Backup database
cp ~/.transcript-fixer/corrections.db ~/backups/corrections_$(date +%Y%m%d).db

# Database maintenance
sqlite3 ~/.transcript-fixer/corrections.db "VACUUM; REINDEX; ANALYZE;"

Quarterly: Clean Up

# Archive old history (> 90 days)
sqlite3 ~/.transcript-fixer/corrections.db "
DELETE FROM correction_history
WHERE run_timestamp < datetime('now', '-90 days');
"

# Reject low-confidence suggestions
sqlite3 ~/.transcript-fixer/corrections.db "
UPDATE learned_suggestions
SET status = 'rejected'
WHERE confidence < 0.6 AND frequency < 3;
"

Next Steps

  • See best_practices.md for optimization tips
  • See troubleshooting.md for error resolution
  • See file_formats.md for database schema
  • See script_parameters.md for advanced CLI options