New scripts: - fix_transcript_timestamps.py: Repair malformed timestamps (HH:MM:SS format) - split_transcript_sections.py: Split transcript by keywords and rebase timestamps - Automated tests for both scripts Features: - Timestamp validation and repair (handle missing colons, invalid ranges) - Section splitting with custom names - Rebase timestamps to 00:00:00 for each section - Preserve speaker format and content integrity - In-place editing with backup Documentation updates: - Add usage examples to SKILL.md - Clarify dictionary iteration workflow (save stable patterns only) - Update workflow guides with new script references - Add script parameter documentation Use cases: - Fix ASR output with broken timestamps - Split long meetings into focused sections - Prepare sections for independent processing Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
14 KiB
Workflow Guide
Detailed step-by-step workflows for transcript correction and management.
Table of Contents
Pre-Flight Checklist
Before running corrections, verify these prerequisites:
Initial Setup
- Initialized with
uv run scripts/fix_transcription.py --init - Database exists at
~/.transcript-fixer/corrections.db GLM_API_KEYenvironment variable set (runecho $GLM_API_KEY)- Configuration validated (run
--validate)
File Preparation
- Input file exists and is readable
- File uses supported format (
.md,.txt) - File encoding is UTF-8
- File size is reasonable (<10MB for first runs)
Execution Parameters
- Using
--stage 3for full pipeline (or specific stage if testing) - Domain specified with
--domainif using specialized dictionaries - Using
--mergeflag when importing team corrections
Environment
- Sufficient disk space for output files (~2x input size)
- API quota available for Stage 2 corrections
- Network connectivity for API calls
Quick validation:
uv run scripts/fix_transcription.py --validate && echo $GLM_API_KEY
Core Workflows
1. First-Time Correction
Goal: Correct a transcript for the first time.
Steps:
-
Initialize (if not done):
uv run scripts/fix_transcription.py --init export GLM_API_KEY="your-key" -
Add initial corrections (5-10 common errors):
uv run scripts/fix_transcription.py --add "常见错误1" "正确词1" --domain general uv run scripts/fix_transcription.py --add "常见错误2" "正确词2" --domain general -
Test on small sample (Stage 1 only):
uv run scripts/fix_transcription.py --input sample.md --stage 1 less sample_stage1.md # Review output -
Run full pipeline:
uv run scripts/fix_transcription.py --input transcript.md --stage 3 --domain general -
Review outputs:
# Stage 1: Dictionary corrections less transcript_stage1.md # Stage 2: Final corrected version less transcript_stage2.md # Generate diff report uv run scripts/diff_generator.py transcript.md transcript_stage1.md transcript_stage2.md
Expected duration:
- Stage 1: Instant (dictionary lookup)
- Stage 2: ~1-2 minutes per 1000 lines (API calls)
2. Iterative Improvement
Goal: Improve correction quality over time through learning.
Steps:
-
Run corrections on 3-5 similar transcripts:
uv run scripts/fix_transcription.py --input day1.md --stage 3 --domain embodied_ai uv run scripts/fix_transcription.py --input day2.md --stage 3 --domain embodied_ai uv run scripts/fix_transcription.py --input day3.md --stage 3 --domain embodied_ai -
Review learned suggestions:
uv run scripts/fix_transcription.py --review-learnedOutput example:
📚 Learned Suggestions (Pending Review) ======================================== 1. "巨升方向" → "具身方向" Frequency: 5 Confidence: 0.95 Examples: day1.md (line 45), day2.md (line 23), ... 2. "奇迹创坛" → "奇绩创坛" Frequency: 3 Confidence: 0.87 Examples: day1.md (line 102), day3.md (line 67) -
Approve high-quality suggestions:
uv run scripts/fix_transcription.py --approve "巨升方向" "具身方向" uv run scripts/fix_transcription.py --approve "奇迹创坛" "奇绩创坛" -
Verify approved corrections:
uv run scripts/fix_transcription.py --list --domain embodied_ai | grep "learned" -
Run next batch (benefits from approved corrections):
uv run scripts/fix_transcription.py --input day4.md --stage 3 --domain embodied_ai
Impact: Approved corrections move to Stage 1 (instant, free).
Cycle: Repeat every 3-5 transcripts for continuous improvement.
3. Domain-Specific Corrections
Goal: Build specialized dictionaries for different fields.
Steps:
-
Identify domain:
embodied_ai- Robotics, AI terminologyfinance- Financial terminologymedical- Medical terminologygeneral- General-purpose
-
Add domain-specific terms:
# Embodied AI domain uv run scripts/fix_transcription.py --add "巨升智能" "具身智能" --domain embodied_ai uv run scripts/fix_transcription.py --add "机器学习" "机器学习" --domain embodied_ai # Finance domain uv run scripts/fix_transcription.py --add "股价" "股价" --domain finance # Keep as-is uv run scripts/fix_transcription.py --add "PE比率" "市盈率" --domain finance -
Use appropriate domain when correcting:
# AI meeting transcript uv run scripts/fix_transcription.py --input ai_meeting.md --stage 3 --domain embodied_ai # Financial report transcript uv run scripts/fix_transcription.py --input earnings_call.md --stage 3 --domain finance -
Review domain statistics:
sqlite3 ~/.transcript-fixer/corrections.db "SELECT * FROM correction_statistics;"
Benefits:
- Prevents cross-domain conflicts
- Higher accuracy per domain
- Targeted vocabulary building
4. Team Collaboration
Goal: Share corrections across team members.
Steps:
Setup (One-time per team)
-
Create shared repository:
mkdir transcript-corrections cd transcript-corrections git init # .gitignore echo "*.db\n*.db-journal\n*.bak" > .gitignore -
Export initial corrections:
uv run scripts/fix_transcription.py --export general.json --domain general uv run scripts/fix_transcription.py --export embodied_ai.json --domain embodied_ai git add *.json git commit -m "Initial correction dictionaries" git push origin main
Daily Workflow
Team Member A (adds new corrections):
# 1. Run corrections
uv run scripts/fix_transcription.py --input transcript.md --stage 3 --domain embodied_ai
# 2. Review and approve learned suggestions
uv run scripts/fix_transcription.py --review-learned
uv run scripts/fix_transcription.py --approve "新错误" "正确词"
# 3. Export updated corrections
uv run scripts/fix_transcription.py --export embodied_ai_$(date +%Y%m%d).json --domain embodied_ai
# 4. Commit and push
git add embodied_ai_*.json
git commit -m "Add embodied AI corrections from today's transcripts"
git push origin main
Team Member B (imports team corrections):
# 1. Pull latest corrections
git pull origin main
# 2. Import with merge
uv run scripts/fix_transcription.py --import embodied_ai_20250128.json --merge
# 3. Verify
uv run scripts/fix_transcription.py --list --domain embodied_ai | tail -10
Conflict resolution: See team_collaboration.md for handling merge conflicts.
5. Stage-by-Stage Execution
Goal: Test dictionary changes without wasting API quota.
Stage 1 Only (Dictionary)
Use when: Testing new corrections, verifying domain setup.
uv run scripts/fix_transcription.py --input file.md --stage 1 --domain general
Output: file_stage1.md with dictionary corrections only.
Review: Check if dictionary corrections are sufficient.
Stage 2 Only (AI)
Use when: Running AI corrections on pre-processed file.
Prerequisites: Stage 1 output exists.
# Stage 1 first
uv run scripts/fix_transcription.py --input file.md --stage 1
# Then Stage 2
uv run scripts/fix_transcription.py --input file_stage1.md --stage 2
Output: file_stage1_stage2.md (confusing naming - use Stage 3 instead).
Stage 3 (Full Pipeline)
Use when: Production runs, full correction workflow.
uv run scripts/fix_transcription.py --input file.md --stage 3 --domain general
Output: Both file_stage1.md and file_stage2.md.
Recommended: Use Stage 3 for most workflows.
6. Context-Aware Rules
Goal: Handle edge cases with regex patterns.
Use cases:
- Positional corrections (e.g., "的" vs "地")
- Multi-word patterns
- Conditional corrections
Steps:
-
Identify pattern that simple dictionary can't handle:
Problem: "近距离的去看" (wrong - should be "地") Problem: "近距离搏杀" (correct - should keep "的") -
Add context rules:
sqlite3 ~/.transcript-fixer/corrections.db -- Higher priority for specific context INSERT INTO context_rules (pattern, replacement, description, priority) VALUES ('近距离的去看', '近距离地去看', '的→地 before verb', 10); -- Lower priority for general pattern INSERT INTO context_rules (pattern, replacement, description, priority) VALUES ('近距离搏杀', '近距离搏杀', 'Keep 的 for noun modifier', 5); .quit -
Test context rules:
uv run scripts/fix_transcription.py --input test.md --stage 1 -
Validate:
uv run scripts/fix_transcription.py --validate
Priority: Higher numbers run first (use for exceptions/edge cases).
See file_formats.md for context_rules schema.
7. Diff Report Generation
Goal: Visualize all changes for review.
Use when:
- Reviewing corrections before publishing
- Training new team members
- Documenting ASR error patterns
Steps:
-
Run corrections:
uv run scripts/fix_transcription.py --input transcript.md --stage 3 -
Generate diff reports:
uv run scripts/diff_generator.py \ transcript.md \ transcript_stage1.md \ transcript_stage2.md -
Review outputs:
# Markdown report (statistics + summary) less diff_report.md # Unified diff (git-style) less transcript_unified.diff # HTML side-by-side (visual review) open transcript_sidebyside.html # Inline markers (for editing) less transcript_inline.md
Report contents:
- Total changes count
- Stage 1 vs Stage 2 breakdown
- Character/word count changes
- Side-by-side comparison
See script_parameters.md for advanced diff options.
8. Workshop Transcript Split + Timestamp Rebase
Goal: Split a long workshop transcript into sections such as setup chat, class, and debrief, then make each section start from 00:00:00.
Steps:
- Correct transcript text first (dictionary + AI/manual review)
- Pick marker phrases for each section boundary
- Split and rebase:
uv run scripts/split_transcript_sections.py workshop.txt \
--first-section-name "课前聊天" \
--section "正式上课::好,无缝切换嘛。对。那个曹总连上了吗?那个网页。" \
--section "课后复盘::我们复盘一下。" \
--rebase-to-zero
- If you already split the files, rebase a single file directly:
uv run scripts/fix_transcript_timestamps.py class.txt --in-place --rebase-to-zero
Batch Processing
Process Multiple Files
# Simple loop
for file in meeting_*.md; do
uv run scripts/fix_transcription.py --input "$file" --stage 3 --domain embodied_ai
done
# With error handling
for file in meeting_*.md; do
echo "Processing $file..."
if uv run scripts/fix_transcription.py --input "$file" --stage 3 --domain embodied_ai; then
echo "✅ $file completed"
else
echo "❌ $file failed"
fi
done
Parallel Processing
# GNU parallel (install: brew install parallel)
ls meeting_*.md | parallel -j 4 \
"uv run scripts/fix_transcription.py --input {} --stage 3 --domain embodied_ai"
Caution: Monitor API rate limits when processing in parallel.
Maintenance Workflows
Weekly: Review Learning
# Review suggestions
uv run scripts/fix_transcription.py --review-learned
# Approve high-confidence patterns
uv run scripts/fix_transcription.py --approve "错误1" "正确1"
uv run scripts/fix_transcription.py --approve "错误2" "正确2"
Monthly: Export and Backup
# Export all domains
uv run scripts/fix_transcription.py --export general_$(date +%Y%m%d).json --domain general
uv run scripts/fix_transcription.py --export embodied_ai_$(date +%Y%m%d).json --domain embodied_ai
# Backup database
cp ~/.transcript-fixer/corrections.db ~/backups/corrections_$(date +%Y%m%d).db
# Database maintenance
sqlite3 ~/.transcript-fixer/corrections.db "VACUUM; REINDEX; ANALYZE;"
Quarterly: Clean Up
# Archive old history (> 90 days)
sqlite3 ~/.transcript-fixer/corrections.db "
DELETE FROM correction_history
WHERE run_timestamp < datetime('now', '-90 days');
"
# Reject low-confidence suggestions
sqlite3 ~/.transcript-fixer/corrections.db "
UPDATE learned_suggestions
SET status = 'rejected'
WHERE confidence < 0.6 AND frequency < 3;
"
Next Steps
- See
best_practices.mdfor optimization tips - See
troubleshooting.mdfor error resolution - See
file_formats.mdfor database schema - See
script_parameters.mdfor advanced CLI options