claude-code-skills-reference/transcript-fixer/references/workflow_guide.md

# Workflow Guide

Detailed step-by-step workflows for transcript correction and management.

## Table of Contents

- [Pre-Flight Checklist](#pre-flight-checklist)
  - [Initial Setup](#initial-setup)
  - [File Preparation](#file-preparation)
  - [Execution Parameters](#execution-parameters)
  - [Environment](#environment)
- [Core Workflows](#core-workflows)
  - [1. First-Time Correction](#1-first-time-correction)
  - [2. Iterative Improvement](#2-iterative-improvement)
  - [3. Domain-Specific Corrections](#3-domain-specific-corrections)
  - [4. Team Collaboration](#4-team-collaboration)
  - [5. Stage-by-Stage Execution](#5-stage-by-stage-execution)
  - [6. Context-Aware Rules](#6-context-aware-rules)
  - [7. Diff Report Generation](#7-diff-report-generation)
  - [8. Workshop Transcript Split + Timestamp Rebase](#8-workshop-transcript-split--timestamp-rebase)
- [Batch Processing](#batch-processing)
  - [Process Multiple Files](#process-multiple-files)
  - [Parallel Processing](#parallel-processing)
- [Maintenance Workflows](#maintenance-workflows)
  - [Weekly: Review Learning](#weekly-review-learning)
  - [Monthly: Export and Backup](#monthly-export-and-backup)
  - [Quarterly: Clean Up](#quarterly-clean-up)
- [Next Steps](#next-steps)

## Pre-Flight Checklist

Before running corrections, verify these prerequisites:

### Initial Setup
- [ ] Initialized with `uv run scripts/fix_transcription.py --init`
- [ ] Database exists at `~/.transcript-fixer/corrections.db`
- [ ] `GLM_API_KEY` environment variable set (run `echo $GLM_API_KEY`)
- [ ] Configuration validated (run `--validate`)

### File Preparation
- [ ] Input file exists and is readable
- [ ] File uses supported format (`.md`, `.txt`)
- [ ] File encoding is UTF-8
- [ ] File size is reasonable (<10MB for first runs)

### Execution Parameters
- [ ] Using `--stage 3` for full pipeline (or specific stage if testing)
- [ ] Domain specified with `--domain` if using specialized dictionaries
- [ ] Using `--merge` flag when importing team corrections

### Environment
- [ ] Sufficient disk space for output files (~2x input size)
- [ ] API quota available for Stage 2 corrections
- [ ] Network connectivity for API calls

**Quick validation**:

```bash
uv run scripts/fix_transcription.py --validate && echo $GLM_API_KEY
```

## Core Workflows

### 1. First-Time Correction

**Goal**: Correct a transcript for the first time.

**Steps**:

1. **Initialize** (if not done):
   ```bash
   uv run scripts/fix_transcription.py --init
   export GLM_API_KEY="your-key"
   ```

2. **Add initial corrections** (5-10 common errors):
   ```bash
   uv run scripts/fix_transcription.py --add "常见错误1" "正确词1" --domain general
   uv run scripts/fix_transcription.py --add "常见错误2" "正确词2" --domain general
   ```

3. **Test on small sample** (Stage 1 only):
   ```bash
   uv run scripts/fix_transcription.py --input sample.md --stage 1
   less sample_stage1.md  # Review output
   ```

4. **Run full pipeline**:
   ```bash
   uv run scripts/fix_transcription.py --input transcript.md --stage 3 --domain general
   ```

5. **Review outputs**:
   ```bash
   # Stage 1: Dictionary corrections
   less transcript_stage1.md

   # Stage 2: Final corrected version
   less transcript_stage2.md

   # Generate diff report
   uv run scripts/diff_generator.py transcript.md transcript_stage1.md transcript_stage2.md
   ```

**Expected duration**:
- Stage 1: Instant (dictionary lookup)
- Stage 2: ~1-2 minutes per 1000 lines (API calls)

### 2. Iterative Improvement

**Goal**: Improve correction quality over time through learning.

**Steps**:

1. **Run corrections** on 3-5 similar transcripts:
   ```bash
   uv run scripts/fix_transcription.py --input day1.md --stage 3 --domain embodied_ai
   uv run scripts/fix_transcription.py --input day2.md --stage 3 --domain embodied_ai
   uv run scripts/fix_transcription.py --input day3.md --stage 3 --domain embodied_ai
   ```

2. **Review learned suggestions**:
   ```bash
   uv run scripts/fix_transcription.py --review-learned
   ```

   **Output example**:
   ```
   📚 Learned Suggestions (Pending Review)
   ========================================

   1. "巨升方向" → "具身方向"
      Frequency: 5  Confidence: 0.95
      Examples: day1.md (line 45), day2.md (line 23), ...

   2. "奇迹创坛" → "奇绩创坛"
      Frequency: 3  Confidence: 0.87
      Examples: day1.md (line 102), day3.md (line 67)
   ```

3. **Approve high-quality suggestions**:
   ```bash
   uv run scripts/fix_transcription.py --approve "巨升方向" "具身方向"
   uv run scripts/fix_transcription.py --approve "奇迹创坛" "奇绩创坛"
   ```

4. **Verify approved corrections**:
   ```bash
   uv run scripts/fix_transcription.py --list --domain embodied_ai | grep "learned"
   ```

5. **Run next batch** (benefits from approved corrections):
   ```bash
   uv run scripts/fix_transcription.py --input day4.md --stage 3 --domain embodied_ai
   ```

**Impact**: Approved corrections move to Stage 1 (instant, free).

**Cycle**: Repeat every 3-5 transcripts for continuous improvement.

### 3. Domain-Specific Corrections

**Goal**: Build specialized dictionaries for different fields.

**Steps**:

1. **Identify domain**:
   - `embodied_ai` - Robotics, AI terminology
   - `finance` - Financial terminology
   - `medical` - Medical terminology
   - `general` - General-purpose

2. **Add domain-specific terms**:
   ```bash
   # Embodied AI domain
   uv run scripts/fix_transcription.py --add "巨升智能" "具身智能" --domain embodied_ai
   uv run scripts/fix_transcription.py --add "机器学习" "机器学习" --domain embodied_ai

   # Finance domain
   uv run scripts/fix_transcription.py --add "股价" "股价" --domain finance  # Keep as-is
   uv run scripts/fix_transcription.py --add "PE比率" "市盈率" --domain finance
   ```

3. **Use appropriate domain** when correcting:
   ```bash
   # AI meeting transcript
   uv run scripts/fix_transcription.py --input ai_meeting.md --stage 3 --domain embodied_ai

   # Financial report transcript
   uv run scripts/fix_transcription.py --input earnings_call.md --stage 3 --domain finance
   ```

4. **Review domain statistics**:
   ```bash
   sqlite3 ~/.transcript-fixer/corrections.db "SELECT * FROM correction_statistics;"
   ```

**Benefits**:
- Prevents cross-domain conflicts
- Higher accuracy per domain
- Targeted vocabulary building

### 4. Team Collaboration

**Goal**: Share corrections across team members.

**Steps**:

#### Setup (One-time per team)

1. **Create shared repository**:
   ```bash
   mkdir transcript-corrections
   cd transcript-corrections
   git init

   # .gitignore
   echo "*.db\n*.db-journal\n*.bak" > .gitignore
   ```

2. **Export initial corrections**:
   ```bash
   uv run scripts/fix_transcription.py --export general.json --domain general
   uv run scripts/fix_transcription.py --export embodied_ai.json --domain embodied_ai

   git add *.json
   git commit -m "Initial correction dictionaries"
   git push origin main
   ```

#### Daily Workflow

**Team Member A** (adds new corrections):

```bash
# 1. Run corrections
uv run scripts/fix_transcription.py --input transcript.md --stage 3 --domain embodied_ai

# 2. Review and approve learned suggestions
uv run scripts/fix_transcription.py --review-learned
uv run scripts/fix_transcription.py --approve "新错误" "正确词"

# 3. Export updated corrections
uv run scripts/fix_transcription.py --export embodied_ai_$(date +%Y%m%d).json --domain embodied_ai

# 4. Commit and push
git add embodied_ai_*.json
git commit -m "Add embodied AI corrections from today's transcripts"
git push origin main
```

**Team Member B** (imports team corrections):

```bash
# 1. Pull latest corrections
git pull origin main

# 2. Import with merge
uv run scripts/fix_transcription.py --import embodied_ai_20250128.json --merge

# 3. Verify
uv run scripts/fix_transcription.py --list --domain embodied_ai | tail -10
```

**Conflict resolution**: See `team_collaboration.md` for handling merge conflicts.

### 5. Stage-by-Stage Execution

**Goal**: Test dictionary changes without wasting API quota.

#### Stage 1 Only (Dictionary)

**Use when**: Testing new corrections, verifying domain setup.

```bash
uv run scripts/fix_transcription.py --input file.md --stage 1 --domain general
```

**Output**: `file_stage1.md` with dictionary corrections only.

**Review**: Check if dictionary corrections are sufficient.

#### Stage 2 Only (AI)

**Use when**: Running AI corrections on pre-processed file.

**Prerequisites**: Stage 1 output exists.

```bash
# Stage 1 first
uv run scripts/fix_transcription.py --input file.md --stage 1

# Then Stage 2
uv run scripts/fix_transcription.py --input file_stage1.md --stage 2
```

**Output**: `file_stage1_stage2.md` (confusing naming - use Stage 3 instead).

#### Stage 3 (Full Pipeline)

**Use when**: Production runs, full correction workflow.

```bash
uv run scripts/fix_transcription.py --input file.md --stage 3 --domain general
```

**Output**: Both `file_stage1.md` and `file_stage2.md`.

**Recommended**: Use Stage 3 for most workflows.

### 6. Context-Aware Rules

**Goal**: Handle edge cases with regex patterns.

**Use cases**:
- Positional corrections (e.g., "的" vs "地")
- Multi-word patterns
- Conditional corrections

**Steps**:

1. **Identify pattern** that simple dictionary can't handle:
   ```
   Problem: "近距离的去看" (wrong - should be "地")
   Problem: "近距离搏杀" (correct - should keep "的")
   ```

2. **Add context rules**:
   ```bash
   sqlite3 ~/.transcript-fixer/corrections.db

   -- Higher priority for specific context
   INSERT INTO context_rules (pattern, replacement, description, priority)
   VALUES ('近距离的去看', '近距离地去看', '的→地 before verb', 10);

   -- Lower priority for general pattern
   INSERT INTO context_rules (pattern, replacement, description, priority)
   VALUES ('近距离搏杀', '近距离搏杀', 'Keep 的 for noun modifier', 5);

   .quit
   ```

3. **Test context rules**:
   ```bash
   uv run scripts/fix_transcription.py --input test.md --stage 1
   ```

4. **Validate**:
   ```bash
   uv run scripts/fix_transcription.py --validate
   ```

**Priority**: Higher numbers run first (use for exceptions/edge cases).

See `file_formats.md` for context_rules schema.

### 7. Diff Report Generation

**Goal**: Visualize all changes for review.

**Use when**:
- Reviewing corrections before publishing
- Training new team members
- Documenting ASR error patterns

**Steps**:

1. **Run corrections**:
   ```bash
   uv run scripts/fix_transcription.py --input transcript.md --stage 3
   ```

2. **Generate diff reports**:
   ```bash
   uv run scripts/diff_generator.py \
     transcript.md \
     transcript_stage1.md \
     transcript_stage2.md
   ```

3. **Review outputs**:
   ```bash
   # Markdown report (statistics + summary)
   less diff_report.md

   # Unified diff (git-style)
   less transcript_unified.diff

   # HTML side-by-side (visual review)
   open transcript_sidebyside.html

   # Inline markers (for editing)
   less transcript_inline.md
   ```

**Report contents**:
- Total changes count
- Stage 1 vs Stage 2 breakdown
- Character/word count changes
- Side-by-side comparison

See `script_parameters.md` for advanced diff options.

### 8. Workshop Transcript Split + Timestamp Rebase

**Goal**: Split a long workshop transcript into sections such as setup chat, class, and debrief, then make each section start from `00:00:00`.

**Steps**:

1. **Correct transcript text first** (dictionary + AI/manual review)
2. **Pick marker phrases** for each section boundary
3. **Split and rebase**:

```bash
uv run scripts/split_transcript_sections.py workshop.txt \
  --first-section-name "课前聊天" \
  --section "正式上课::好，无缝切换嘛。对。那个曹总连上了吗？那个网页。" \
  --section "课后复盘::我们复盘一下。" \
  --rebase-to-zero
```

4. **If you already split the files**, rebase a single file directly:

```bash
uv run scripts/fix_transcript_timestamps.py class.txt --in-place --rebase-to-zero
```

## Batch Processing

### Process Multiple Files

```bash
# Simple loop
for file in meeting_*.md; do
  uv run scripts/fix_transcription.py --input "$file" --stage 3 --domain embodied_ai
done

# With error handling
for file in meeting_*.md; do
  echo "Processing $file..."
  if uv run scripts/fix_transcription.py --input "$file" --stage 3 --domain embodied_ai; then
    echo "✅ $file completed"
  else
    echo "❌ $file failed"
  fi
done
```

### Parallel Processing

```bash
# GNU parallel (install: brew install parallel)
ls meeting_*.md | parallel -j 4 \
  "uv run scripts/fix_transcription.py --input {} --stage 3 --domain embodied_ai"
```

**Caution**: Monitor API rate limits when processing in parallel.

## Maintenance Workflows

### Weekly: Review Learning

```bash
# Review suggestions
uv run scripts/fix_transcription.py --review-learned

# Approve high-confidence patterns
uv run scripts/fix_transcription.py --approve "错误1" "正确1"
uv run scripts/fix_transcription.py --approve "错误2" "正确2"
```

### Monthly: Export and Backup

```bash
# Export all domains
uv run scripts/fix_transcription.py --export general_$(date +%Y%m%d).json --domain general
uv run scripts/fix_transcription.py --export embodied_ai_$(date +%Y%m%d).json --domain embodied_ai

# Backup database
cp ~/.transcript-fixer/corrections.db ~/backups/corrections_$(date +%Y%m%d).db

# Database maintenance
sqlite3 ~/.transcript-fixer/corrections.db "VACUUM; REINDEX; ANALYZE;"
```

### Quarterly: Clean Up

```bash
# Archive old history (> 90 days)
sqlite3 ~/.transcript-fixer/corrections.db "
DELETE FROM correction_history
WHERE run_timestamp < datetime('now', '-90 days');
"

# Reject low-confidence suggestions
sqlite3 ~/.transcript-fixer/corrections.db "
UPDATE learned_suggestions
SET status = 'rejected'
WHERE confidence < 0.6 AND frequency < 3;
"
```

## Next Steps

- See `best_practices.md` for optimization tips
- See `troubleshooting.md` for error resolution
- See `file_formats.md` for database schema
- See `script_parameters.md` for advanced CLI options