New scripts: - fix_transcript_timestamps.py: Repair malformed timestamps (HH:MM:SS format) - split_transcript_sections.py: Split transcript by keywords and rebase timestamps - Automated tests for both scripts Features: - Timestamp validation and repair (handle missing colons, invalid ranges) - Section splitting with custom names - Rebase timestamps to 00:00:00 for each section - Preserve speaker format and content integrity - In-place editing with backup Documentation updates: - Add usage examples to SKILL.md - Clarify dictionary iteration workflow (save stable patterns only) - Update workflow guides with new script references - Add script parameter documentation Use cases: - Fix ASR output with broken timestamps - Split long meetings into focused sections - Prepare sections for independent processing Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
4.7 KiB
4.7 KiB
Dictionary Iteration Workflow
The core value of transcript-fixer is building a personalized correction dictionary that improves over time.
The Core Loop
┌─────────────────────────────────────────────────┐
│ 1. Fix transcript (manual or Stage 3) │
│ ↓ │
│ 2. Identify new ASR errors during fixing │
│ ↓ │
│ 3. IMMEDIATELY save to dictionary │
│ ↓ │
│ 4. Next time: Stage 1 auto-corrects these │
└─────────────────────────────────────────────────┘
Key principle: Every stable, reusable ASR correction you make should be saved to the dictionary. This transforms one-time work into permanent value without polluting the database.
Workflow Checklist
Copy this checklist when correcting transcripts:
Correction Progress:
- [ ] Run correction: --input file.md --stage 3
- [ ] Review output file for remaining ASR errors
- [ ] Fix errors manually with Edit tool
- [ ] Save EACH correction to dictionary with --add
- [ ] Verify with --list that corrections were saved
- [ ] Next time: Stage 1 handles these automatically
Save Corrections Immediately
After fixing any transcript, save stable corrections:
# Single correction
uv run scripts/fix_transcription.py --add "错误词" "正确词" --domain general
# Multiple corrections - run command for each
uv run scripts/fix_transcription.py --add "片片总" "翩翩总" --domain general
uv run scripts/fix_transcription.py --add "姐弟" "结业" --domain general
uv run scripts/fix_transcription.py --add "自杀性" "自嗨性" --domain general
uv run scripts/fix_transcription.py --add "被看" "被砍" --domain general
uv run scripts/fix_transcription.py --add "单反过" "单访过" --domain general
Verify Dictionary
Always verify corrections were saved:
# List all corrections in current domain
uv run scripts/fix_transcription.py --list
# Direct database query
sqlite3 ~/.transcript-fixer/corrections.db \
"SELECT from_text, to_text, domain FROM active_corrections ORDER BY added_at DESC LIMIT 10;"
Domain Selection
Choose the right domain for corrections:
| Domain | Use Case |
|---|---|
general |
Common ASR errors, names, general vocabulary |
embodied_ai |
具身智能、机器人、AI 相关术语 |
finance |
财务、投资、金融术语 |
medical |
医疗、健康相关术语 |
火星加速器 |
Custom Chinese domain name (any valid name works) |
# Domain-specific correction
uv run scripts/fix_transcription.py --add "股价系统" "框架系统" --domain embodied_ai
uv run scripts/fix_transcription.py --add "片片总" "翩翩总" --domain 火星加速器
Common ASR Error Patterns
Build your dictionary with these common patterns:
| Type | Examples |
|---|---|
| Homophones | 赢→营, 减→剪, 被看→被砍, 营业→营的 |
| Names | 片片→翩翩, 亮亮→亮哥 |
| Technical | 巨升智能→具身智能, 股价→框架 |
| English | log→vlog |
| Broken words | 姐弟→结业, 单反→单访 |
When GLM API Fails
If you see [CLAUDE_FALLBACK] output, the GLM API is unavailable.
Steps:
- Claude Code should analyze the text directly for ASR errors
- Fix using Edit tool
- MUST save corrections to dictionary - this is critical
- Dictionary corrections work even without AI
Auto-Learning Feature
After running Stage 3 multiple times:
# Check learned patterns
uv run scripts/fix_transcription.py --review-learned
# Approve high-confidence patterns
uv run scripts/fix_transcription.py --approve "错误词" "正确词"
Patterns appearing ≥3 times at ≥80% confidence are suggested for review.
Best Practices
- Save immediately: Don't batch corrections - save each one right after fixing
- Be specific: Use exact phrases, not partial words
- Use domains: Organize corrections by topic for better precision
- Verify: Always run --list to confirm saves
- Review suggestions: Periodically check --review-learned for auto-detected patterns
What NOT to Save to Dictionary
Do not save these as reusable dictionary entries:
- Full-sentence deletions
- One-off section headers or meeting-specific boilerplate
- Context-only disambiguations such as
cloud -> Claudewhencloudcan also be legitimate - File-local cleanup after section splitting or timestamp rebasing