Files
daymade 135a1873af feat(transcript-fixer): add timestamp repair and section splitting scripts
New scripts:
- fix_transcript_timestamps.py: Repair malformed timestamps (HH:MM:SS format)
- split_transcript_sections.py: Split transcript by keywords and rebase timestamps
- Automated tests for both scripts

Features:
- Timestamp validation and repair (handle missing colons, invalid ranges)
- Section splitting with custom names
- Rebase timestamps to 00:00:00 for each section
- Preserve speaker format and content integrity
- In-place editing with backup

Documentation updates:
- Add usage examples to SKILL.md
- Clarify dictionary iteration workflow (save stable patterns only)
- Update workflow guides with new script references
- Add script parameter documentation

Use cases:
- Fix ASR output with broken timestamps
- Split long meetings into focused sections
- Prepare sections for independent processing

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-11 13:59:36 +08:00

4.7 KiB

Dictionary Iteration Workflow

The core value of transcript-fixer is building a personalized correction dictionary that improves over time.

The Core Loop

┌─────────────────────────────────────────────────┐
│  1. Fix transcript (manual or Stage 3)          │
│                    ↓                            │
│  2. Identify new ASR errors during fixing       │
│                    ↓                            │
│  3. IMMEDIATELY save to dictionary              │
│                    ↓                            │
│  4. Next time: Stage 1 auto-corrects these      │
└─────────────────────────────────────────────────┘

Key principle: Every stable, reusable ASR correction you make should be saved to the dictionary. This transforms one-time work into permanent value without polluting the database.

Workflow Checklist

Copy this checklist when correcting transcripts:

Correction Progress:
- [ ] Run correction: --input file.md --stage 3
- [ ] Review output file for remaining ASR errors
- [ ] Fix errors manually with Edit tool
- [ ] Save EACH correction to dictionary with --add
- [ ] Verify with --list that corrections were saved
- [ ] Next time: Stage 1 handles these automatically

Save Corrections Immediately

After fixing any transcript, save stable corrections:

# Single correction
uv run scripts/fix_transcription.py --add "错误词" "正确词" --domain general

# Multiple corrections - run command for each
uv run scripts/fix_transcription.py --add "片片总" "翩翩总" --domain general
uv run scripts/fix_transcription.py --add "姐弟" "结业" --domain general
uv run scripts/fix_transcription.py --add "自杀性" "自嗨性" --domain general
uv run scripts/fix_transcription.py --add "被看" "被砍" --domain general
uv run scripts/fix_transcription.py --add "单反过" "单访过" --domain general

Verify Dictionary

Always verify corrections were saved:

# List all corrections in current domain
uv run scripts/fix_transcription.py --list

# Direct database query
sqlite3 ~/.transcript-fixer/corrections.db \
  "SELECT from_text, to_text, domain FROM active_corrections ORDER BY added_at DESC LIMIT 10;"

Domain Selection

Choose the right domain for corrections:

Domain Use Case
general Common ASR errors, names, general vocabulary
embodied_ai 具身智能、机器人、AI 相关术语
finance 财务、投资、金融术语
medical 医疗、健康相关术语
火星加速器 Custom Chinese domain name (any valid name works)
# Domain-specific correction
uv run scripts/fix_transcription.py --add "股价系统" "框架系统" --domain embodied_ai
uv run scripts/fix_transcription.py --add "片片总" "翩翩总" --domain 火星加速器

Common ASR Error Patterns

Build your dictionary with these common patterns:

Type Examples
Homophones 赢→营, 减→剪, 被看→被砍, 营业→营的
Names 片片→翩翩, 亮亮→亮哥
Technical 巨升智能→具身智能, 股价→框架
English log→vlog
Broken words 姐弟→结业, 单反→单访

When GLM API Fails

If you see [CLAUDE_FALLBACK] output, the GLM API is unavailable.

Steps:

  1. Claude Code should analyze the text directly for ASR errors
  2. Fix using Edit tool
  3. MUST save corrections to dictionary - this is critical
  4. Dictionary corrections work even without AI

Auto-Learning Feature

After running Stage 3 multiple times:

# Check learned patterns
uv run scripts/fix_transcription.py --review-learned

# Approve high-confidence patterns
uv run scripts/fix_transcription.py --approve "错误词" "正确词"

Patterns appearing ≥3 times at ≥80% confidence are suggested for review.

Best Practices

  1. Save immediately: Don't batch corrections - save each one right after fixing
  2. Be specific: Use exact phrases, not partial words
  3. Use domains: Organize corrections by topic for better precision
  4. Verify: Always run --list to confirm saves
  5. Review suggestions: Periodically check --review-learned for auto-detected patterns

What NOT to Save to Dictionary

Do not save these as reusable dictionary entries:

  • Full-sentence deletions
  • One-off section headers or meeting-specific boilerplate
  • Context-only disambiguations such as cloud -> Claude when cloud can also be legitimate
  • File-local cleanup after section splitting or timestamp rebasing