claude-code-skills-reference/transcript-fixer/references/best_practices.md

# Best Practices

Recommendations for effective use of transcript-fixer based on production experience.

## Table of Contents

- [Getting Started](#getting-started)
  - [Build Foundation Before Scaling](#build-foundation-before-scaling)
  - [Review Learned Suggestions Regularly](#review-learned-suggestions-regularly)
- [Domain Organization](#domain-organization)
  - [Use Domain Separation](#use-domain-separation)
  - [Domain Selection Strategy](#domain-selection-strategy)
- [Cost Optimization](#cost-optimization)
  - [Test Dictionary Changes Before AI Calls](#test-dictionary-changes-before-ai-calls)
  - [Approve High-Confidence Suggestions](#approve-high-confidence-suggestions)
- [Team Collaboration](#team-collaboration)
  - [Export Corrections for Version Control](#export-corrections-for-version-control)
  - [Share Corrections via Import/Merge](#share-corrections-via-importmerge)
- [Data Management](#data-management)
  - [Database Backup Strategy](#database-backup-strategy)
  - [Cleanup Strategy](#cleanup-strategy)
- [Workflow Efficiency](#workflow-efficiency)
  - [File Organization](#file-organization)
  - [Batch Processing](#batch-processing)
  - [Context Rules for Edge Cases](#context-rules-for-edge-cases)
- [Quality Assurance](#quality-assurance)
  - [Validate After Manual Changes](#validate-after-manual-changes)
  - [Monitor Learning Quality](#monitor-learning-quality)
- [Production Deployment](#production-deployment)
  - [Environment Variables](#environment-variables)
  - [Monitoring](#monitoring)
  - [Performance](#performance)
- [Summary](#summary)

## Getting Started

### Build Foundation Before Scaling

**Start small**: Begin with 5-10 manually-added corrections for the most common errors in your domain.

```bash
# Example: embodied AI domain
uv run scripts/fix_transcription.py --add "巨升智能" "具身智能" --domain embodied_ai
uv run scripts/fix_transcription.py --add "巨升" "具身" --domain embodied_ai
uv run scripts/fix_transcription.py --add "奇迹创坛" "奇绩创坛" --domain embodied_ai
```

**Let learning discover others**: After 3-5 correction runs, the learning system will suggest additional patterns automatically.

**Rationale**: Manual corrections provide high-quality training data. Learning amplifies your corrections exponentially.

### Review Learned Suggestions Regularly

**Frequency**: Every 3-5 correction runs

```bash
uv run scripts/fix_transcription.py --review-learned
```

**Why**: Learned corrections move from Stage 2 (AI, expensive) to Stage 1 (dictionary, cheap/instant).

**Impact**:
- 10x faster processing (no API calls)
- Zero cost for repeated patterns
- Builds domain-specific vocabulary automatically

## Domain Organization

### Use Domain Separation

**Prevent conflicts**: Same phonetic error might have different corrections in different domains.

**Example**:
- Finance domain: "股价" (stock price) is correct
- General domain: "股价" → "框架" (framework) ASR error

```bash
# Domain-specific corrections
uv run scripts/fix_transcription.py --add "股价" "框架" --domain general
# No correction needed in finance domain - "股价" is correct there
```

**Available domains**:
- `general` (default) - General-purpose corrections
- `embodied_ai` - Robotics and embodied AI terminology
- `finance` - Financial terminology
- `medical` - Medical terminology

**Custom domains**: Any string matching `^[a-z0-9_]+$` (lowercase, numbers, underscore).

### Domain Selection Strategy

1. **Default domain** for general corrections (dates, common words)
2. **Specialized domains** for technical terminology
3. **Project domains** for company/product-specific terms

```bash
# Project-specific domain
uv run scripts/fix_transcription.py --add "我司" "奇绩创坛" --domain yc_china
```

## Cost Optimization

### Test Dictionary Changes Before AI Calls

**Problem**: AI calls (Stage 2) consume API quota and time.

**Solution**: Test dictionary changes with Stage 1 first.

```bash
# 1. Add new corrections
uv run scripts/fix_transcription.py --add "新错误" "正确词" --domain general

# 2. Test on small sample (Stage 1 only)
uv run scripts/fix_transcription.py --input sample.md --stage 1

# 3. Review output
less sample_stage1.md

# 4. If satisfied, run full pipeline on large files
uv run scripts/fix_transcription.py --input large_file.md --stage 3
```

**Savings**: Avoid wasting API quota on files with dictionary-only corrections.

### Approve High-Confidence Suggestions

**Check suggestions regularly**:

```bash
uv run scripts/fix_transcription.py --review-learned
```

**Approve suggestions with**:
- Frequency ≥ 5
- Confidence ≥ 0.9
- Pattern makes semantic sense

**Impact**: Each approved suggestion saves future API calls.

## Team Collaboration

### Export Corrections for Version Control

**Don't commit** `.db` files to Git:
- Binary format causes merge conflicts
- Database grows over time (bloats repository)
- Not human-reviewable

**Do commit** JSON exports:

```bash
# Export domain dictionaries
uv run scripts/fix_transcription.py --export general_$(date +%Y%m%d).json --domain general
uv run scripts/fix_transcription.py --export embodied_ai_$(date +%Y%m%d).json --domain embodied_ai

# .gitignore
*.db
*.db-journal
*.bak

# Commit exports
git add *_corrections.json
git commit -m "Update correction dictionaries"
```

### Share Corrections via Import/Merge

**Always use `--merge` flag** to combine corrections:

```bash
# Pull latest from team
git pull origin main

# Import new corrections (merge mode)
uv run scripts/fix_transcription.py --import general_20250128.json --merge
uv run scripts/fix_transcription.py --import embodied_ai_20250128.json --merge
```

**Merge behavior**:
- New corrections: inserted
- Existing corrections with higher confidence: updated
- Existing corrections with lower confidence: skipped
- Preserves local customizations

See `team_collaboration.md` for Git workflows and conflict handling.

## Data Management

### Database Backup Strategy

**Automatic backups**: Database creates timestamped backups before migrations:

```
~/.transcript-fixer/
├── corrections.db
├── corrections.20250128_140532.bak
└── corrections.20250127_093021.bak
```

**Manual backups** before bulk changes:

```bash
cp ~/.transcript-fixer/corrections.db ~/backups/corrections_$(date +%Y%m%d).db
```

**Or use SQLite backup**:

```bash
sqlite3 ~/.transcript-fixer/corrections.db ".backup ~/backups/corrections.db"
```

### Cleanup Strategy

**History retention**: Keep recent history, archive old entries:

```bash
# Archive history older than 90 days
sqlite3 ~/.transcript-fixer/corrections.db "
DELETE FROM correction_history
WHERE run_timestamp < datetime('now', '-90 days');
"

# Reclaim space
sqlite3 ~/.transcript-fixer/corrections.db "VACUUM;"
```

**Suggestion cleanup**: Reject low-confidence suggestions periodically:

```bash
# Reject suggestions with frequency < 3
sqlite3 ~/.transcript-fixer/corrections.db "
UPDATE learned_suggestions
SET status = 'rejected'
WHERE frequency < 3 AND confidence < 0.7;
"
```

## Workflow Efficiency

### File Organization

**Use consistent naming**:
```
meeting_20250128.md           # Original transcript
meeting_20250128_stage1.md    # Dictionary corrections
meeting_20250128_stage2.md    # Final corrected version
```

**Generate diff reports** for review:

```bash
uv run scripts/diff_generator.py \
  meeting_20250128.md \
  meeting_20250128_stage1.md \
  meeting_20250128_stage2.md
```

**Output formats**:
- Markdown report (what changed, statistics)
- Unified diff (git-style)
- HTML side-by-side (visual review)
- Inline markers (for direct editing)

### Batch Processing

**Process similar files together** to amplify learning:

```bash
# Day 1: Process 5 similar meetings
for file in meeting_*.md; do
  uv run scripts/fix_transcription.py --input "$file" --stage 3 --domain embodied_ai
done

# Day 2: Review learned patterns
uv run scripts/fix_transcription.py --review-learned

# Approve good suggestions
uv run scripts/fix_transcription.py --approve "常见错误1" "正确词1"
uv run scripts/fix_transcription.py --approve "常见错误2" "正确词2"

# Day 3: Future files benefit from dictionary corrections
```

### Context Rules for Edge Cases

**Use regex context rules** for:
- Positional dependencies (e.g., "的" vs "地" before verbs)
- Multi-word patterns
- Traditional vs simplified Chinese

**Example**:

```bash
sqlite3 ~/.transcript-fixer/corrections.db

# "的" before verb → "地"
INSERT INTO context_rules (pattern, replacement, description, priority)
VALUES ('近距离的去看', '近距离地去看', '的→地 before verb', 10);

# Preserve correct usage
INSERT INTO context_rules (pattern, replacement, description, priority)
VALUES ('近距离搏杀', '近距离搏杀', '的 is correct here (noun modifier)', 5);
```

**Priority**: Higher numbers run first (use for exceptions).

## Quality Assurance

### Validate After Manual Changes

**After direct SQL edits**:

```bash
uv run scripts/fix_transcription.py --validate
```

**After imports**:

```bash
# Check statistics
uv run scripts/fix_transcription.py --list --domain general | head -20

# Verify specific corrections
sqlite3 ~/.transcript-fixer/corrections.db "
SELECT from_text, to_text, source, confidence
FROM active_corrections
WHERE domain = 'general'
ORDER BY added_at DESC
LIMIT 10;
"
```

### Monitor Learning Quality

**Check suggestion confidence distribution**:

```bash
sqlite3 ~/.transcript-fixer/corrections.db "
SELECT
  CASE
    WHEN confidence >= 0.9 THEN 'high (>=0.9)'
    WHEN confidence >= 0.8 THEN 'medium (0.8-0.9)'
    ELSE 'low (<0.8)'
  END as confidence_level,
  COUNT(*) as count
FROM learned_suggestions
WHERE status = 'pending'
GROUP BY confidence_level;
"
```

**Review examples** for low-confidence suggestions:

```bash
sqlite3 ~/.transcript-fixer/corrections.db "
SELECT s.from_text, s.to_text, s.confidence, e.context
FROM learned_suggestions s
JOIN suggestion_examples e ON s.id = e.suggestion_id
WHERE s.confidence < 0.8 AND s.status = 'pending';
"
```

## Production Deployment

### Environment Variables

**Set permanently** in production:

```bash
# Add to /etc/environment or systemd service
GLM_API_KEY=your-production-key
```

### Monitoring

**Track usage statistics**:

```bash
# Corrections by source
sqlite3 ~/.transcript-fixer/corrections.db "
SELECT source, COUNT(*) as count, SUM(usage_count) as total_usage
FROM corrections
WHERE is_active = 1
GROUP BY source;
"

# Success rate
sqlite3 ~/.transcript-fixer/corrections.db "
SELECT
  COUNT(*) as total_runs,
  SUM(CASE WHEN success = 1 THEN 1 ELSE 0 END) as successful,
  ROUND(100.0 * SUM(CASE WHEN success = 1 THEN 1 ELSE 0 END) / COUNT(*), 2) as success_rate
FROM correction_history;
"
```

### Performance

**Database optimization**:

```bash
# Rebuild indexes periodically
sqlite3 ~/.transcript-fixer/corrections.db "REINDEX;"

# Analyze query patterns
sqlite3 ~/.transcript-fixer/corrections.db "ANALYZE;"

# Vacuum to reclaim space
sqlite3 ~/.transcript-fixer/corrections.db "VACUUM;"
```

## Summary

**Key principles**:
1. Start small, let learning amplify
2. Use domain separation for quality
3. Test dictionary changes before AI calls
4. Export to JSON for version control
5. Review and approve learned suggestions
6. Validate after manual changes
7. Monitor learning quality
8. Backup before bulk operations

**ROI timeline**:
- Week 1: Build foundation (10-20 manual corrections)
- Week 2-3: Learning kicks in (20-50 suggestions)
- Month 2+: Mature vocabulary (80%+ dictionary coverage, minimal AI calls)