New scripts: - fix_transcript_timestamps.py: Repair malformed timestamps (HH:MM:SS format) - split_transcript_sections.py: Split transcript by keywords and rebase timestamps - Automated tests for both scripts Features: - Timestamp validation and repair (handle missing colons, invalid ranges) - Section splitting with custom names - Rebase timestamps to 00:00:00 for each section - Preserve speaker format and content integrity - In-place editing with backup Documentation updates: - Add usage examples to SKILL.md - Clarify dictionary iteration workflow (save stable patterns only) - Update workflow guides with new script references - Add script parameter documentation Use cases: - Fix ASR output with broken timestamps - Split long meetings into focused sections - Prepare sections for independent processing Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
509 lines
14 KiB
Markdown
509 lines
14 KiB
Markdown
# Workflow Guide
|
|
|
|
Detailed step-by-step workflows for transcript correction and management.
|
|
|
|
## Table of Contents
|
|
|
|
- [Pre-Flight Checklist](#pre-flight-checklist)
|
|
- [Initial Setup](#initial-setup)
|
|
- [File Preparation](#file-preparation)
|
|
- [Execution Parameters](#execution-parameters)
|
|
- [Environment](#environment)
|
|
- [Core Workflows](#core-workflows)
|
|
- [1. First-Time Correction](#1-first-time-correction)
|
|
- [2. Iterative Improvement](#2-iterative-improvement)
|
|
- [3. Domain-Specific Corrections](#3-domain-specific-corrections)
|
|
- [4. Team Collaboration](#4-team-collaboration)
|
|
- [5. Stage-by-Stage Execution](#5-stage-by-stage-execution)
|
|
- [6. Context-Aware Rules](#6-context-aware-rules)
|
|
- [7. Diff Report Generation](#7-diff-report-generation)
|
|
- [8. Workshop Transcript Split + Timestamp Rebase](#8-workshop-transcript-split--timestamp-rebase)
|
|
- [Batch Processing](#batch-processing)
|
|
- [Process Multiple Files](#process-multiple-files)
|
|
- [Parallel Processing](#parallel-processing)
|
|
- [Maintenance Workflows](#maintenance-workflows)
|
|
- [Weekly: Review Learning](#weekly-review-learning)
|
|
- [Monthly: Export and Backup](#monthly-export-and-backup)
|
|
- [Quarterly: Clean Up](#quarterly-clean-up)
|
|
- [Next Steps](#next-steps)
|
|
|
|
## Pre-Flight Checklist
|
|
|
|
Before running corrections, verify these prerequisites:
|
|
|
|
### Initial Setup
|
|
- [ ] Initialized with `uv run scripts/fix_transcription.py --init`
|
|
- [ ] Database exists at `~/.transcript-fixer/corrections.db`
|
|
- [ ] `GLM_API_KEY` environment variable set (run `echo $GLM_API_KEY`)
|
|
- [ ] Configuration validated (run `--validate`)
|
|
|
|
### File Preparation
|
|
- [ ] Input file exists and is readable
|
|
- [ ] File uses supported format (`.md`, `.txt`)
|
|
- [ ] File encoding is UTF-8
|
|
- [ ] File size is reasonable (<10MB for first runs)
|
|
|
|
### Execution Parameters
|
|
- [ ] Using `--stage 3` for full pipeline (or specific stage if testing)
|
|
- [ ] Domain specified with `--domain` if using specialized dictionaries
|
|
- [ ] Using `--merge` flag when importing team corrections
|
|
|
|
### Environment
|
|
- [ ] Sufficient disk space for output files (~2x input size)
|
|
- [ ] API quota available for Stage 2 corrections
|
|
- [ ] Network connectivity for API calls
|
|
|
|
**Quick validation**:
|
|
|
|
```bash
|
|
uv run scripts/fix_transcription.py --validate && echo $GLM_API_KEY
|
|
```
|
|
|
|
## Core Workflows
|
|
|
|
### 1. First-Time Correction
|
|
|
|
**Goal**: Correct a transcript for the first time.
|
|
|
|
**Steps**:
|
|
|
|
1. **Initialize** (if not done):
|
|
```bash
|
|
uv run scripts/fix_transcription.py --init
|
|
export GLM_API_KEY="your-key"
|
|
```
|
|
|
|
2. **Add initial corrections** (5-10 common errors):
|
|
```bash
|
|
uv run scripts/fix_transcription.py --add "常见错误1" "正确词1" --domain general
|
|
uv run scripts/fix_transcription.py --add "常见错误2" "正确词2" --domain general
|
|
```
|
|
|
|
3. **Test on small sample** (Stage 1 only):
|
|
```bash
|
|
uv run scripts/fix_transcription.py --input sample.md --stage 1
|
|
less sample_stage1.md # Review output
|
|
```
|
|
|
|
4. **Run full pipeline**:
|
|
```bash
|
|
uv run scripts/fix_transcription.py --input transcript.md --stage 3 --domain general
|
|
```
|
|
|
|
5. **Review outputs**:
|
|
```bash
|
|
# Stage 1: Dictionary corrections
|
|
less transcript_stage1.md
|
|
|
|
# Stage 2: Final corrected version
|
|
less transcript_stage2.md
|
|
|
|
# Generate diff report
|
|
uv run scripts/diff_generator.py transcript.md transcript_stage1.md transcript_stage2.md
|
|
```
|
|
|
|
**Expected duration**:
|
|
- Stage 1: Instant (dictionary lookup)
|
|
- Stage 2: ~1-2 minutes per 1000 lines (API calls)
|
|
|
|
### 2. Iterative Improvement
|
|
|
|
**Goal**: Improve correction quality over time through learning.
|
|
|
|
**Steps**:
|
|
|
|
1. **Run corrections** on 3-5 similar transcripts:
|
|
```bash
|
|
uv run scripts/fix_transcription.py --input day1.md --stage 3 --domain embodied_ai
|
|
uv run scripts/fix_transcription.py --input day2.md --stage 3 --domain embodied_ai
|
|
uv run scripts/fix_transcription.py --input day3.md --stage 3 --domain embodied_ai
|
|
```
|
|
|
|
2. **Review learned suggestions**:
|
|
```bash
|
|
uv run scripts/fix_transcription.py --review-learned
|
|
```
|
|
|
|
**Output example**:
|
|
```
|
|
📚 Learned Suggestions (Pending Review)
|
|
========================================
|
|
|
|
1. "巨升方向" → "具身方向"
|
|
Frequency: 5 Confidence: 0.95
|
|
Examples: day1.md (line 45), day2.md (line 23), ...
|
|
|
|
2. "奇迹创坛" → "奇绩创坛"
|
|
Frequency: 3 Confidence: 0.87
|
|
Examples: day1.md (line 102), day3.md (line 67)
|
|
```
|
|
|
|
3. **Approve high-quality suggestions**:
|
|
```bash
|
|
uv run scripts/fix_transcription.py --approve "巨升方向" "具身方向"
|
|
uv run scripts/fix_transcription.py --approve "奇迹创坛" "奇绩创坛"
|
|
```
|
|
|
|
4. **Verify approved corrections**:
|
|
```bash
|
|
uv run scripts/fix_transcription.py --list --domain embodied_ai | grep "learned"
|
|
```
|
|
|
|
5. **Run next batch** (benefits from approved corrections):
|
|
```bash
|
|
uv run scripts/fix_transcription.py --input day4.md --stage 3 --domain embodied_ai
|
|
```
|
|
|
|
**Impact**: Approved corrections move to Stage 1 (instant, free).
|
|
|
|
**Cycle**: Repeat every 3-5 transcripts for continuous improvement.
|
|
|
|
### 3. Domain-Specific Corrections
|
|
|
|
**Goal**: Build specialized dictionaries for different fields.
|
|
|
|
**Steps**:
|
|
|
|
1. **Identify domain**:
|
|
- `embodied_ai` - Robotics, AI terminology
|
|
- `finance` - Financial terminology
|
|
- `medical` - Medical terminology
|
|
- `general` - General-purpose
|
|
|
|
2. **Add domain-specific terms**:
|
|
```bash
|
|
# Embodied AI domain
|
|
uv run scripts/fix_transcription.py --add "巨升智能" "具身智能" --domain embodied_ai
|
|
uv run scripts/fix_transcription.py --add "机器学习" "机器学习" --domain embodied_ai
|
|
|
|
# Finance domain
|
|
uv run scripts/fix_transcription.py --add "股价" "股价" --domain finance # Keep as-is
|
|
uv run scripts/fix_transcription.py --add "PE比率" "市盈率" --domain finance
|
|
```
|
|
|
|
3. **Use appropriate domain** when correcting:
|
|
```bash
|
|
# AI meeting transcript
|
|
uv run scripts/fix_transcription.py --input ai_meeting.md --stage 3 --domain embodied_ai
|
|
|
|
# Financial report transcript
|
|
uv run scripts/fix_transcription.py --input earnings_call.md --stage 3 --domain finance
|
|
```
|
|
|
|
4. **Review domain statistics**:
|
|
```bash
|
|
sqlite3 ~/.transcript-fixer/corrections.db "SELECT * FROM correction_statistics;"
|
|
```
|
|
|
|
**Benefits**:
|
|
- Prevents cross-domain conflicts
|
|
- Higher accuracy per domain
|
|
- Targeted vocabulary building
|
|
|
|
### 4. Team Collaboration
|
|
|
|
**Goal**: Share corrections across team members.
|
|
|
|
**Steps**:
|
|
|
|
#### Setup (One-time per team)
|
|
|
|
1. **Create shared repository**:
|
|
```bash
|
|
mkdir transcript-corrections
|
|
cd transcript-corrections
|
|
git init
|
|
|
|
# .gitignore
|
|
echo "*.db\n*.db-journal\n*.bak" > .gitignore
|
|
```
|
|
|
|
2. **Export initial corrections**:
|
|
```bash
|
|
uv run scripts/fix_transcription.py --export general.json --domain general
|
|
uv run scripts/fix_transcription.py --export embodied_ai.json --domain embodied_ai
|
|
|
|
git add *.json
|
|
git commit -m "Initial correction dictionaries"
|
|
git push origin main
|
|
```
|
|
|
|
#### Daily Workflow
|
|
|
|
**Team Member A** (adds new corrections):
|
|
|
|
```bash
|
|
# 1. Run corrections
|
|
uv run scripts/fix_transcription.py --input transcript.md --stage 3 --domain embodied_ai
|
|
|
|
# 2. Review and approve learned suggestions
|
|
uv run scripts/fix_transcription.py --review-learned
|
|
uv run scripts/fix_transcription.py --approve "新错误" "正确词"
|
|
|
|
# 3. Export updated corrections
|
|
uv run scripts/fix_transcription.py --export embodied_ai_$(date +%Y%m%d).json --domain embodied_ai
|
|
|
|
# 4. Commit and push
|
|
git add embodied_ai_*.json
|
|
git commit -m "Add embodied AI corrections from today's transcripts"
|
|
git push origin main
|
|
```
|
|
|
|
**Team Member B** (imports team corrections):
|
|
|
|
```bash
|
|
# 1. Pull latest corrections
|
|
git pull origin main
|
|
|
|
# 2. Import with merge
|
|
uv run scripts/fix_transcription.py --import embodied_ai_20250128.json --merge
|
|
|
|
# 3. Verify
|
|
uv run scripts/fix_transcription.py --list --domain embodied_ai | tail -10
|
|
```
|
|
|
|
**Conflict resolution**: See `team_collaboration.md` for handling merge conflicts.
|
|
|
|
### 5. Stage-by-Stage Execution
|
|
|
|
**Goal**: Test dictionary changes without wasting API quota.
|
|
|
|
#### Stage 1 Only (Dictionary)
|
|
|
|
**Use when**: Testing new corrections, verifying domain setup.
|
|
|
|
```bash
|
|
uv run scripts/fix_transcription.py --input file.md --stage 1 --domain general
|
|
```
|
|
|
|
**Output**: `file_stage1.md` with dictionary corrections only.
|
|
|
|
**Review**: Check if dictionary corrections are sufficient.
|
|
|
|
#### Stage 2 Only (AI)
|
|
|
|
**Use when**: Running AI corrections on pre-processed file.
|
|
|
|
**Prerequisites**: Stage 1 output exists.
|
|
|
|
```bash
|
|
# Stage 1 first
|
|
uv run scripts/fix_transcription.py --input file.md --stage 1
|
|
|
|
# Then Stage 2
|
|
uv run scripts/fix_transcription.py --input file_stage1.md --stage 2
|
|
```
|
|
|
|
**Output**: `file_stage1_stage2.md` (confusing naming - use Stage 3 instead).
|
|
|
|
#### Stage 3 (Full Pipeline)
|
|
|
|
**Use when**: Production runs, full correction workflow.
|
|
|
|
```bash
|
|
uv run scripts/fix_transcription.py --input file.md --stage 3 --domain general
|
|
```
|
|
|
|
**Output**: Both `file_stage1.md` and `file_stage2.md`.
|
|
|
|
**Recommended**: Use Stage 3 for most workflows.
|
|
|
|
### 6. Context-Aware Rules
|
|
|
|
**Goal**: Handle edge cases with regex patterns.
|
|
|
|
**Use cases**:
|
|
- Positional corrections (e.g., "的" vs "地")
|
|
- Multi-word patterns
|
|
- Conditional corrections
|
|
|
|
**Steps**:
|
|
|
|
1. **Identify pattern** that simple dictionary can't handle:
|
|
```
|
|
Problem: "近距离的去看" (wrong - should be "地")
|
|
Problem: "近距离搏杀" (correct - should keep "的")
|
|
```
|
|
|
|
2. **Add context rules**:
|
|
```bash
|
|
sqlite3 ~/.transcript-fixer/corrections.db
|
|
|
|
-- Higher priority for specific context
|
|
INSERT INTO context_rules (pattern, replacement, description, priority)
|
|
VALUES ('近距离的去看', '近距离地去看', '的→地 before verb', 10);
|
|
|
|
-- Lower priority for general pattern
|
|
INSERT INTO context_rules (pattern, replacement, description, priority)
|
|
VALUES ('近距离搏杀', '近距离搏杀', 'Keep 的 for noun modifier', 5);
|
|
|
|
.quit
|
|
```
|
|
|
|
3. **Test context rules**:
|
|
```bash
|
|
uv run scripts/fix_transcription.py --input test.md --stage 1
|
|
```
|
|
|
|
4. **Validate**:
|
|
```bash
|
|
uv run scripts/fix_transcription.py --validate
|
|
```
|
|
|
|
**Priority**: Higher numbers run first (use for exceptions/edge cases).
|
|
|
|
See `file_formats.md` for context_rules schema.
|
|
|
|
### 7. Diff Report Generation
|
|
|
|
**Goal**: Visualize all changes for review.
|
|
|
|
**Use when**:
|
|
- Reviewing corrections before publishing
|
|
- Training new team members
|
|
- Documenting ASR error patterns
|
|
|
|
**Steps**:
|
|
|
|
1. **Run corrections**:
|
|
```bash
|
|
uv run scripts/fix_transcription.py --input transcript.md --stage 3
|
|
```
|
|
|
|
2. **Generate diff reports**:
|
|
```bash
|
|
uv run scripts/diff_generator.py \
|
|
transcript.md \
|
|
transcript_stage1.md \
|
|
transcript_stage2.md
|
|
```
|
|
|
|
3. **Review outputs**:
|
|
```bash
|
|
# Markdown report (statistics + summary)
|
|
less diff_report.md
|
|
|
|
# Unified diff (git-style)
|
|
less transcript_unified.diff
|
|
|
|
# HTML side-by-side (visual review)
|
|
open transcript_sidebyside.html
|
|
|
|
# Inline markers (for editing)
|
|
less transcript_inline.md
|
|
```
|
|
|
|
**Report contents**:
|
|
- Total changes count
|
|
- Stage 1 vs Stage 2 breakdown
|
|
- Character/word count changes
|
|
- Side-by-side comparison
|
|
|
|
See `script_parameters.md` for advanced diff options.
|
|
|
|
### 8. Workshop Transcript Split + Timestamp Rebase
|
|
|
|
**Goal**: Split a long workshop transcript into sections such as setup chat, class, and debrief, then make each section start from `00:00:00`.
|
|
|
|
**Steps**:
|
|
|
|
1. **Correct transcript text first** (dictionary + AI/manual review)
|
|
2. **Pick marker phrases** for each section boundary
|
|
3. **Split and rebase**:
|
|
|
|
```bash
|
|
uv run scripts/split_transcript_sections.py workshop.txt \
|
|
--first-section-name "课前聊天" \
|
|
--section "正式上课::好,无缝切换嘛。对。那个曹总连上了吗?那个网页。" \
|
|
--section "课后复盘::我们复盘一下。" \
|
|
--rebase-to-zero
|
|
```
|
|
|
|
4. **If you already split the files**, rebase a single file directly:
|
|
|
|
```bash
|
|
uv run scripts/fix_transcript_timestamps.py class.txt --in-place --rebase-to-zero
|
|
```
|
|
|
|
## Batch Processing
|
|
|
|
### Process Multiple Files
|
|
|
|
```bash
|
|
# Simple loop
|
|
for file in meeting_*.md; do
|
|
uv run scripts/fix_transcription.py --input "$file" --stage 3 --domain embodied_ai
|
|
done
|
|
|
|
# With error handling
|
|
for file in meeting_*.md; do
|
|
echo "Processing $file..."
|
|
if uv run scripts/fix_transcription.py --input "$file" --stage 3 --domain embodied_ai; then
|
|
echo "✅ $file completed"
|
|
else
|
|
echo "❌ $file failed"
|
|
fi
|
|
done
|
|
```
|
|
|
|
### Parallel Processing
|
|
|
|
```bash
|
|
# GNU parallel (install: brew install parallel)
|
|
ls meeting_*.md | parallel -j 4 \
|
|
"uv run scripts/fix_transcription.py --input {} --stage 3 --domain embodied_ai"
|
|
```
|
|
|
|
**Caution**: Monitor API rate limits when processing in parallel.
|
|
|
|
## Maintenance Workflows
|
|
|
|
### Weekly: Review Learning
|
|
|
|
```bash
|
|
# Review suggestions
|
|
uv run scripts/fix_transcription.py --review-learned
|
|
|
|
# Approve high-confidence patterns
|
|
uv run scripts/fix_transcription.py --approve "错误1" "正确1"
|
|
uv run scripts/fix_transcription.py --approve "错误2" "正确2"
|
|
```
|
|
|
|
### Monthly: Export and Backup
|
|
|
|
```bash
|
|
# Export all domains
|
|
uv run scripts/fix_transcription.py --export general_$(date +%Y%m%d).json --domain general
|
|
uv run scripts/fix_transcription.py --export embodied_ai_$(date +%Y%m%d).json --domain embodied_ai
|
|
|
|
# Backup database
|
|
cp ~/.transcript-fixer/corrections.db ~/backups/corrections_$(date +%Y%m%d).db
|
|
|
|
# Database maintenance
|
|
sqlite3 ~/.transcript-fixer/corrections.db "VACUUM; REINDEX; ANALYZE;"
|
|
```
|
|
|
|
### Quarterly: Clean Up
|
|
|
|
```bash
|
|
# Archive old history (> 90 days)
|
|
sqlite3 ~/.transcript-fixer/corrections.db "
|
|
DELETE FROM correction_history
|
|
WHERE run_timestamp < datetime('now', '-90 days');
|
|
"
|
|
|
|
# Reject low-confidence suggestions
|
|
sqlite3 ~/.transcript-fixer/corrections.db "
|
|
UPDATE learned_suggestions
|
|
SET status = 'rejected'
|
|
WHERE confidence < 0.6 AND frequency < 3;
|
|
"
|
|
```
|
|
|
|
## Next Steps
|
|
|
|
- See `best_practices.md` for optimization tips
|
|
- See `troubleshooting.md` for error resolution
|
|
- See `file_formats.md` for database schema
|
|
- See `script_parameters.md` for advanced CLI options
|