Files
claude-code-skills-reference/transcript-fixer/references/workflow_guide.md
daymade 135a1873af feat(transcript-fixer): add timestamp repair and section splitting scripts
New scripts:
- fix_transcript_timestamps.py: Repair malformed timestamps (HH:MM:SS format)
- split_transcript_sections.py: Split transcript by keywords and rebase timestamps
- Automated tests for both scripts

Features:
- Timestamp validation and repair (handle missing colons, invalid ranges)
- Section splitting with custom names
- Rebase timestamps to 00:00:00 for each section
- Preserve speaker format and content integrity
- In-place editing with backup

Documentation updates:
- Add usage examples to SKILL.md
- Clarify dictionary iteration workflow (save stable patterns only)
- Update workflow guides with new script references
- Add script parameter documentation

Use cases:
- Fix ASR output with broken timestamps
- Split long meetings into focused sections
- Prepare sections for independent processing

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-11 13:59:36 +08:00

509 lines
14 KiB
Markdown

# Workflow Guide
Detailed step-by-step workflows for transcript correction and management.
## Table of Contents
- [Pre-Flight Checklist](#pre-flight-checklist)
- [Initial Setup](#initial-setup)
- [File Preparation](#file-preparation)
- [Execution Parameters](#execution-parameters)
- [Environment](#environment)
- [Core Workflows](#core-workflows)
- [1. First-Time Correction](#1-first-time-correction)
- [2. Iterative Improvement](#2-iterative-improvement)
- [3. Domain-Specific Corrections](#3-domain-specific-corrections)
- [4. Team Collaboration](#4-team-collaboration)
- [5. Stage-by-Stage Execution](#5-stage-by-stage-execution)
- [6. Context-Aware Rules](#6-context-aware-rules)
- [7. Diff Report Generation](#7-diff-report-generation)
- [8. Workshop Transcript Split + Timestamp Rebase](#8-workshop-transcript-split--timestamp-rebase)
- [Batch Processing](#batch-processing)
- [Process Multiple Files](#process-multiple-files)
- [Parallel Processing](#parallel-processing)
- [Maintenance Workflows](#maintenance-workflows)
- [Weekly: Review Learning](#weekly-review-learning)
- [Monthly: Export and Backup](#monthly-export-and-backup)
- [Quarterly: Clean Up](#quarterly-clean-up)
- [Next Steps](#next-steps)
## Pre-Flight Checklist
Before running corrections, verify these prerequisites:
### Initial Setup
- [ ] Initialized with `uv run scripts/fix_transcription.py --init`
- [ ] Database exists at `~/.transcript-fixer/corrections.db`
- [ ] `GLM_API_KEY` environment variable set (run `echo $GLM_API_KEY`)
- [ ] Configuration validated (run `--validate`)
### File Preparation
- [ ] Input file exists and is readable
- [ ] File uses supported format (`.md`, `.txt`)
- [ ] File encoding is UTF-8
- [ ] File size is reasonable (<10MB for first runs)
### Execution Parameters
- [ ] Using `--stage 3` for full pipeline (or specific stage if testing)
- [ ] Domain specified with `--domain` if using specialized dictionaries
- [ ] Using `--merge` flag when importing team corrections
### Environment
- [ ] Sufficient disk space for output files (~2x input size)
- [ ] API quota available for Stage 2 corrections
- [ ] Network connectivity for API calls
**Quick validation**:
```bash
uv run scripts/fix_transcription.py --validate && echo $GLM_API_KEY
```
## Core Workflows
### 1. First-Time Correction
**Goal**: Correct a transcript for the first time.
**Steps**:
1. **Initialize** (if not done):
```bash
uv run scripts/fix_transcription.py --init
export GLM_API_KEY="your-key"
```
2. **Add initial corrections** (5-10 common errors):
```bash
uv run scripts/fix_transcription.py --add "常见错误1" "正确词1" --domain general
uv run scripts/fix_transcription.py --add "常见错误2" "正确词2" --domain general
```
3. **Test on small sample** (Stage 1 only):
```bash
uv run scripts/fix_transcription.py --input sample.md --stage 1
less sample_stage1.md # Review output
```
4. **Run full pipeline**:
```bash
uv run scripts/fix_transcription.py --input transcript.md --stage 3 --domain general
```
5. **Review outputs**:
```bash
# Stage 1: Dictionary corrections
less transcript_stage1.md
# Stage 2: Final corrected version
less transcript_stage2.md
# Generate diff report
uv run scripts/diff_generator.py transcript.md transcript_stage1.md transcript_stage2.md
```
**Expected duration**:
- Stage 1: Instant (dictionary lookup)
- Stage 2: ~1-2 minutes per 1000 lines (API calls)
### 2. Iterative Improvement
**Goal**: Improve correction quality over time through learning.
**Steps**:
1. **Run corrections** on 3-5 similar transcripts:
```bash
uv run scripts/fix_transcription.py --input day1.md --stage 3 --domain embodied_ai
uv run scripts/fix_transcription.py --input day2.md --stage 3 --domain embodied_ai
uv run scripts/fix_transcription.py --input day3.md --stage 3 --domain embodied_ai
```
2. **Review learned suggestions**:
```bash
uv run scripts/fix_transcription.py --review-learned
```
**Output example**:
```
📚 Learned Suggestions (Pending Review)
========================================
1. "巨升方向" → "具身方向"
Frequency: 5 Confidence: 0.95
Examples: day1.md (line 45), day2.md (line 23), ...
2. "奇迹创坛" → "奇绩创坛"
Frequency: 3 Confidence: 0.87
Examples: day1.md (line 102), day3.md (line 67)
```
3. **Approve high-quality suggestions**:
```bash
uv run scripts/fix_transcription.py --approve "巨升方向" "具身方向"
uv run scripts/fix_transcription.py --approve "奇迹创坛" "奇绩创坛"
```
4. **Verify approved corrections**:
```bash
uv run scripts/fix_transcription.py --list --domain embodied_ai | grep "learned"
```
5. **Run next batch** (benefits from approved corrections):
```bash
uv run scripts/fix_transcription.py --input day4.md --stage 3 --domain embodied_ai
```
**Impact**: Approved corrections move to Stage 1 (instant, free).
**Cycle**: Repeat every 3-5 transcripts for continuous improvement.
### 3. Domain-Specific Corrections
**Goal**: Build specialized dictionaries for different fields.
**Steps**:
1. **Identify domain**:
- `embodied_ai` - Robotics, AI terminology
- `finance` - Financial terminology
- `medical` - Medical terminology
- `general` - General-purpose
2. **Add domain-specific terms**:
```bash
# Embodied AI domain
uv run scripts/fix_transcription.py --add "巨升智能" "具身智能" --domain embodied_ai
uv run scripts/fix_transcription.py --add "机器学习" "机器学习" --domain embodied_ai
# Finance domain
uv run scripts/fix_transcription.py --add "股价" "股价" --domain finance # Keep as-is
uv run scripts/fix_transcription.py --add "PE比率" "市盈率" --domain finance
```
3. **Use appropriate domain** when correcting:
```bash
# AI meeting transcript
uv run scripts/fix_transcription.py --input ai_meeting.md --stage 3 --domain embodied_ai
# Financial report transcript
uv run scripts/fix_transcription.py --input earnings_call.md --stage 3 --domain finance
```
4. **Review domain statistics**:
```bash
sqlite3 ~/.transcript-fixer/corrections.db "SELECT * FROM correction_statistics;"
```
**Benefits**:
- Prevents cross-domain conflicts
- Higher accuracy per domain
- Targeted vocabulary building
### 4. Team Collaboration
**Goal**: Share corrections across team members.
**Steps**:
#### Setup (One-time per team)
1. **Create shared repository**:
```bash
mkdir transcript-corrections
cd transcript-corrections
git init
# .gitignore
echo "*.db\n*.db-journal\n*.bak" > .gitignore
```
2. **Export initial corrections**:
```bash
uv run scripts/fix_transcription.py --export general.json --domain general
uv run scripts/fix_transcription.py --export embodied_ai.json --domain embodied_ai
git add *.json
git commit -m "Initial correction dictionaries"
git push origin main
```
#### Daily Workflow
**Team Member A** (adds new corrections):
```bash
# 1. Run corrections
uv run scripts/fix_transcription.py --input transcript.md --stage 3 --domain embodied_ai
# 2. Review and approve learned suggestions
uv run scripts/fix_transcription.py --review-learned
uv run scripts/fix_transcription.py --approve "新错误" "正确词"
# 3. Export updated corrections
uv run scripts/fix_transcription.py --export embodied_ai_$(date +%Y%m%d).json --domain embodied_ai
# 4. Commit and push
git add embodied_ai_*.json
git commit -m "Add embodied AI corrections from today's transcripts"
git push origin main
```
**Team Member B** (imports team corrections):
```bash
# 1. Pull latest corrections
git pull origin main
# 2. Import with merge
uv run scripts/fix_transcription.py --import embodied_ai_20250128.json --merge
# 3. Verify
uv run scripts/fix_transcription.py --list --domain embodied_ai | tail -10
```
**Conflict resolution**: See `team_collaboration.md` for handling merge conflicts.
### 5. Stage-by-Stage Execution
**Goal**: Test dictionary changes without wasting API quota.
#### Stage 1 Only (Dictionary)
**Use when**: Testing new corrections, verifying domain setup.
```bash
uv run scripts/fix_transcription.py --input file.md --stage 1 --domain general
```
**Output**: `file_stage1.md` with dictionary corrections only.
**Review**: Check if dictionary corrections are sufficient.
#### Stage 2 Only (AI)
**Use when**: Running AI corrections on pre-processed file.
**Prerequisites**: Stage 1 output exists.
```bash
# Stage 1 first
uv run scripts/fix_transcription.py --input file.md --stage 1
# Then Stage 2
uv run scripts/fix_transcription.py --input file_stage1.md --stage 2
```
**Output**: `file_stage1_stage2.md` (confusing naming - use Stage 3 instead).
#### Stage 3 (Full Pipeline)
**Use when**: Production runs, full correction workflow.
```bash
uv run scripts/fix_transcription.py --input file.md --stage 3 --domain general
```
**Output**: Both `file_stage1.md` and `file_stage2.md`.
**Recommended**: Use Stage 3 for most workflows.
### 6. Context-Aware Rules
**Goal**: Handle edge cases with regex patterns.
**Use cases**:
- Positional corrections (e.g., "的" vs "地")
- Multi-word patterns
- Conditional corrections
**Steps**:
1. **Identify pattern** that simple dictionary can't handle:
```
Problem: "近距离的去看" (wrong - should be "地")
Problem: "近距离搏杀" (correct - should keep "的")
```
2. **Add context rules**:
```bash
sqlite3 ~/.transcript-fixer/corrections.db
-- Higher priority for specific context
INSERT INTO context_rules (pattern, replacement, description, priority)
VALUES ('近距离的去看', '近距离地去看', '的→地 before verb', 10);
-- Lower priority for general pattern
INSERT INTO context_rules (pattern, replacement, description, priority)
VALUES ('近距离搏杀', '近距离搏杀', 'Keep 的 for noun modifier', 5);
.quit
```
3. **Test context rules**:
```bash
uv run scripts/fix_transcription.py --input test.md --stage 1
```
4. **Validate**:
```bash
uv run scripts/fix_transcription.py --validate
```
**Priority**: Higher numbers run first (use for exceptions/edge cases).
See `file_formats.md` for context_rules schema.
### 7. Diff Report Generation
**Goal**: Visualize all changes for review.
**Use when**:
- Reviewing corrections before publishing
- Training new team members
- Documenting ASR error patterns
**Steps**:
1. **Run corrections**:
```bash
uv run scripts/fix_transcription.py --input transcript.md --stage 3
```
2. **Generate diff reports**:
```bash
uv run scripts/diff_generator.py \
transcript.md \
transcript_stage1.md \
transcript_stage2.md
```
3. **Review outputs**:
```bash
# Markdown report (statistics + summary)
less diff_report.md
# Unified diff (git-style)
less transcript_unified.diff
# HTML side-by-side (visual review)
open transcript_sidebyside.html
# Inline markers (for editing)
less transcript_inline.md
```
**Report contents**:
- Total changes count
- Stage 1 vs Stage 2 breakdown
- Character/word count changes
- Side-by-side comparison
See `script_parameters.md` for advanced diff options.
### 8. Workshop Transcript Split + Timestamp Rebase
**Goal**: Split a long workshop transcript into sections such as setup chat, class, and debrief, then make each section start from `00:00:00`.
**Steps**:
1. **Correct transcript text first** (dictionary + AI/manual review)
2. **Pick marker phrases** for each section boundary
3. **Split and rebase**:
```bash
uv run scripts/split_transcript_sections.py workshop.txt \
--first-section-name "课前聊天" \
--section "正式上课::好,无缝切换嘛。对。那个曹总连上了吗?那个网页。" \
--section "课后复盘::我们复盘一下。" \
--rebase-to-zero
```
4. **If you already split the files**, rebase a single file directly:
```bash
uv run scripts/fix_transcript_timestamps.py class.txt --in-place --rebase-to-zero
```
## Batch Processing
### Process Multiple Files
```bash
# Simple loop
for file in meeting_*.md; do
uv run scripts/fix_transcription.py --input "$file" --stage 3 --domain embodied_ai
done
# With error handling
for file in meeting_*.md; do
echo "Processing $file..."
if uv run scripts/fix_transcription.py --input "$file" --stage 3 --domain embodied_ai; then
echo "✅ $file completed"
else
echo "❌ $file failed"
fi
done
```
### Parallel Processing
```bash
# GNU parallel (install: brew install parallel)
ls meeting_*.md | parallel -j 4 \
"uv run scripts/fix_transcription.py --input {} --stage 3 --domain embodied_ai"
```
**Caution**: Monitor API rate limits when processing in parallel.
## Maintenance Workflows
### Weekly: Review Learning
```bash
# Review suggestions
uv run scripts/fix_transcription.py --review-learned
# Approve high-confidence patterns
uv run scripts/fix_transcription.py --approve "错误1" "正确1"
uv run scripts/fix_transcription.py --approve "错误2" "正确2"
```
### Monthly: Export and Backup
```bash
# Export all domains
uv run scripts/fix_transcription.py --export general_$(date +%Y%m%d).json --domain general
uv run scripts/fix_transcription.py --export embodied_ai_$(date +%Y%m%d).json --domain embodied_ai
# Backup database
cp ~/.transcript-fixer/corrections.db ~/backups/corrections_$(date +%Y%m%d).db
# Database maintenance
sqlite3 ~/.transcript-fixer/corrections.db "VACUUM; REINDEX; ANALYZE;"
```
### Quarterly: Clean Up
```bash
# Archive old history (> 90 days)
sqlite3 ~/.transcript-fixer/corrections.db "
DELETE FROM correction_history
WHERE run_timestamp < datetime('now', '-90 days');
"
# Reject low-confidence suggestions
sqlite3 ~/.transcript-fixer/corrections.db "
UPDATE learned_suggestions
SET status = 'rejected'
WHERE confidence < 0.6 AND frequency < 3;
"
```
## Next Steps
- See `best_practices.md` for optimization tips
- See `troubleshooting.md` for error resolution
- See `file_formats.md` for database schema
- See `script_parameters.md` for advanced CLI options