## New Skill: transcript-fixer v1.0.0 Correct speech-to-text (ASR/STT) transcription errors through dictionary-based rules and AI-powered corrections with automatic pattern learning. **Features:** - Two-stage correction pipeline (dictionary + AI) - Automatic pattern detection and learning - Domain-specific dictionaries (general, embodied_ai, finance, medical) - SQLite-based correction repository - Team collaboration with import/export - GLM API integration for AI corrections - Cost optimization through dictionary promotion **Use cases:** - Correcting meeting notes, lecture recordings, or interview transcripts - Fixing Chinese/English homophone errors and technical terminology - Building domain-specific correction dictionaries - Improving transcript accuracy through iterative learning **Documentation:** - Complete workflow guides in references/ - SQL query templates - Troubleshooting guide - Team collaboration patterns - API setup instructions **Marketplace updates:** - Updated marketplace to v1.8.0 - Added transcript-fixer plugin (category: productivity) - Updated README.md with skill description and use cases - Updated CLAUDE.md with skill listing and counts 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
429 lines
11 KiB
Markdown
429 lines
11 KiB
Markdown
# Best Practices
|
|
|
|
Recommendations for effective use of transcript-fixer based on production experience.
|
|
|
|
## Table of Contents
|
|
|
|
- [Getting Started](#getting-started)
|
|
- [Build Foundation Before Scaling](#build-foundation-before-scaling)
|
|
- [Review Learned Suggestions Regularly](#review-learned-suggestions-regularly)
|
|
- [Domain Organization](#domain-organization)
|
|
- [Use Domain Separation](#use-domain-separation)
|
|
- [Domain Selection Strategy](#domain-selection-strategy)
|
|
- [Cost Optimization](#cost-optimization)
|
|
- [Test Dictionary Changes Before AI Calls](#test-dictionary-changes-before-ai-calls)
|
|
- [Approve High-Confidence Suggestions](#approve-high-confidence-suggestions)
|
|
- [Team Collaboration](#team-collaboration)
|
|
- [Export Corrections for Version Control](#export-corrections-for-version-control)
|
|
- [Share Corrections via Import/Merge](#share-corrections-via-importmerge)
|
|
- [Data Management](#data-management)
|
|
- [Database Backup Strategy](#database-backup-strategy)
|
|
- [Cleanup Strategy](#cleanup-strategy)
|
|
- [Workflow Efficiency](#workflow-efficiency)
|
|
- [File Organization](#file-organization)
|
|
- [Batch Processing](#batch-processing)
|
|
- [Context Rules for Edge Cases](#context-rules-for-edge-cases)
|
|
- [Quality Assurance](#quality-assurance)
|
|
- [Validate After Manual Changes](#validate-after-manual-changes)
|
|
- [Monitor Learning Quality](#monitor-learning-quality)
|
|
- [Production Deployment](#production-deployment)
|
|
- [Environment Variables](#environment-variables)
|
|
- [Monitoring](#monitoring)
|
|
- [Performance](#performance)
|
|
- [Summary](#summary)
|
|
|
|
## Getting Started
|
|
|
|
### Build Foundation Before Scaling
|
|
|
|
**Start small**: Begin with 5-10 manually-added corrections for the most common errors in your domain.
|
|
|
|
```bash
|
|
# Example: embodied AI domain
|
|
uv run scripts/fix_transcription.py --add "巨升智能" "具身智能" --domain embodied_ai
|
|
uv run scripts/fix_transcription.py --add "巨升" "具身" --domain embodied_ai
|
|
uv run scripts/fix_transcription.py --add "奇迹创坛" "奇绩创坛" --domain embodied_ai
|
|
```
|
|
|
|
**Let learning discover others**: After 3-5 correction runs, the learning system will suggest additional patterns automatically.
|
|
|
|
**Rationale**: Manual corrections provide high-quality training data. Learning amplifies your corrections exponentially.
|
|
|
|
### Review Learned Suggestions Regularly
|
|
|
|
**Frequency**: Every 3-5 correction runs
|
|
|
|
```bash
|
|
uv run scripts/fix_transcription.py --review-learned
|
|
```
|
|
|
|
**Why**: Learned corrections move from Stage 2 (AI, expensive) to Stage 1 (dictionary, cheap/instant).
|
|
|
|
**Impact**:
|
|
- 10x faster processing (no API calls)
|
|
- Zero cost for repeated patterns
|
|
- Builds domain-specific vocabulary automatically
|
|
|
|
## Domain Organization
|
|
|
|
### Use Domain Separation
|
|
|
|
**Prevent conflicts**: Same phonetic error might have different corrections in different domains.
|
|
|
|
**Example**:
|
|
- Finance domain: "股价" (stock price) is correct
|
|
- General domain: "股价" → "框架" (framework) ASR error
|
|
|
|
```bash
|
|
# Domain-specific corrections
|
|
uv run scripts/fix_transcription.py --add "股价" "框架" --domain general
|
|
# No correction needed in finance domain - "股价" is correct there
|
|
```
|
|
|
|
**Available domains**:
|
|
- `general` (default) - General-purpose corrections
|
|
- `embodied_ai` - Robotics and embodied AI terminology
|
|
- `finance` - Financial terminology
|
|
- `medical` - Medical terminology
|
|
|
|
**Custom domains**: Any string matching `^[a-z0-9_]+$` (lowercase, numbers, underscore).
|
|
|
|
### Domain Selection Strategy
|
|
|
|
1. **Default domain** for general corrections (dates, common words)
|
|
2. **Specialized domains** for technical terminology
|
|
3. **Project domains** for company/product-specific terms
|
|
|
|
```bash
|
|
# Project-specific domain
|
|
uv run scripts/fix_transcription.py --add "我司" "奇绩创坛" --domain yc_china
|
|
```
|
|
|
|
## Cost Optimization
|
|
|
|
### Test Dictionary Changes Before AI Calls
|
|
|
|
**Problem**: AI calls (Stage 2) consume API quota and time.
|
|
|
|
**Solution**: Test dictionary changes with Stage 1 first.
|
|
|
|
```bash
|
|
# 1. Add new corrections
|
|
uv run scripts/fix_transcription.py --add "新错误" "正确词" --domain general
|
|
|
|
# 2. Test on small sample (Stage 1 only)
|
|
uv run scripts/fix_transcription.py --input sample.md --stage 1
|
|
|
|
# 3. Review output
|
|
less sample_stage1.md
|
|
|
|
# 4. If satisfied, run full pipeline on large files
|
|
uv run scripts/fix_transcription.py --input large_file.md --stage 3
|
|
```
|
|
|
|
**Savings**: Avoid wasting API quota on files with dictionary-only corrections.
|
|
|
|
### Approve High-Confidence Suggestions
|
|
|
|
**Check suggestions regularly**:
|
|
|
|
```bash
|
|
uv run scripts/fix_transcription.py --review-learned
|
|
```
|
|
|
|
**Approve suggestions with**:
|
|
- Frequency ≥ 5
|
|
- Confidence ≥ 0.9
|
|
- Pattern makes semantic sense
|
|
|
|
**Impact**: Each approved suggestion saves future API calls.
|
|
|
|
## Team Collaboration
|
|
|
|
### Export Corrections for Version Control
|
|
|
|
**Don't commit** `.db` files to Git:
|
|
- Binary format causes merge conflicts
|
|
- Database grows over time (bloats repository)
|
|
- Not human-reviewable
|
|
|
|
**Do commit** JSON exports:
|
|
|
|
```bash
|
|
# Export domain dictionaries
|
|
uv run scripts/fix_transcription.py --export general_$(date +%Y%m%d).json --domain general
|
|
uv run scripts/fix_transcription.py --export embodied_ai_$(date +%Y%m%d).json --domain embodied_ai
|
|
|
|
# .gitignore
|
|
*.db
|
|
*.db-journal
|
|
*.bak
|
|
|
|
# Commit exports
|
|
git add *_corrections.json
|
|
git commit -m "Update correction dictionaries"
|
|
```
|
|
|
|
### Share Corrections via Import/Merge
|
|
|
|
**Always use `--merge` flag** to combine corrections:
|
|
|
|
```bash
|
|
# Pull latest from team
|
|
git pull origin main
|
|
|
|
# Import new corrections (merge mode)
|
|
uv run scripts/fix_transcription.py --import general_20250128.json --merge
|
|
uv run scripts/fix_transcription.py --import embodied_ai_20250128.json --merge
|
|
```
|
|
|
|
**Merge behavior**:
|
|
- New corrections: inserted
|
|
- Existing corrections with higher confidence: updated
|
|
- Existing corrections with lower confidence: skipped
|
|
- Preserves local customizations
|
|
|
|
See `team_collaboration.md` for Git workflows and conflict handling.
|
|
|
|
## Data Management
|
|
|
|
### Database Backup Strategy
|
|
|
|
**Automatic backups**: Database creates timestamped backups before migrations:
|
|
|
|
```
|
|
~/.transcript-fixer/
|
|
├── corrections.db
|
|
├── corrections.20250128_140532.bak
|
|
└── corrections.20250127_093021.bak
|
|
```
|
|
|
|
**Manual backups** before bulk changes:
|
|
|
|
```bash
|
|
cp ~/.transcript-fixer/corrections.db ~/backups/corrections_$(date +%Y%m%d).db
|
|
```
|
|
|
|
**Or use SQLite backup**:
|
|
|
|
```bash
|
|
sqlite3 ~/.transcript-fixer/corrections.db ".backup ~/backups/corrections.db"
|
|
```
|
|
|
|
### Cleanup Strategy
|
|
|
|
**History retention**: Keep recent history, archive old entries:
|
|
|
|
```bash
|
|
# Archive history older than 90 days
|
|
sqlite3 ~/.transcript-fixer/corrections.db "
|
|
DELETE FROM correction_history
|
|
WHERE run_timestamp < datetime('now', '-90 days');
|
|
"
|
|
|
|
# Reclaim space
|
|
sqlite3 ~/.transcript-fixer/corrections.db "VACUUM;"
|
|
```
|
|
|
|
**Suggestion cleanup**: Reject low-confidence suggestions periodically:
|
|
|
|
```bash
|
|
# Reject suggestions with frequency < 3
|
|
sqlite3 ~/.transcript-fixer/corrections.db "
|
|
UPDATE learned_suggestions
|
|
SET status = 'rejected'
|
|
WHERE frequency < 3 AND confidence < 0.7;
|
|
"
|
|
```
|
|
|
|
## Workflow Efficiency
|
|
|
|
### File Organization
|
|
|
|
**Use consistent naming**:
|
|
```
|
|
meeting_20250128.md # Original transcript
|
|
meeting_20250128_stage1.md # Dictionary corrections
|
|
meeting_20250128_stage2.md # Final corrected version
|
|
```
|
|
|
|
**Generate diff reports** for review:
|
|
|
|
```bash
|
|
uv run scripts/diff_generator.py \
|
|
meeting_20250128.md \
|
|
meeting_20250128_stage1.md \
|
|
meeting_20250128_stage2.md
|
|
```
|
|
|
|
**Output formats**:
|
|
- Markdown report (what changed, statistics)
|
|
- Unified diff (git-style)
|
|
- HTML side-by-side (visual review)
|
|
- Inline markers (for direct editing)
|
|
|
|
### Batch Processing
|
|
|
|
**Process similar files together** to amplify learning:
|
|
|
|
```bash
|
|
# Day 1: Process 5 similar meetings
|
|
for file in meeting_*.md; do
|
|
uv run scripts/fix_transcription.py --input "$file" --stage 3 --domain embodied_ai
|
|
done
|
|
|
|
# Day 2: Review learned patterns
|
|
uv run scripts/fix_transcription.py --review-learned
|
|
|
|
# Approve good suggestions
|
|
uv run scripts/fix_transcription.py --approve "常见错误1" "正确词1"
|
|
uv run scripts/fix_transcription.py --approve "常见错误2" "正确词2"
|
|
|
|
# Day 3: Future files benefit from dictionary corrections
|
|
```
|
|
|
|
### Context Rules for Edge Cases
|
|
|
|
**Use regex context rules** for:
|
|
- Positional dependencies (e.g., "的" vs "地" before verbs)
|
|
- Multi-word patterns
|
|
- Traditional vs simplified Chinese
|
|
|
|
**Example**:
|
|
|
|
```bash
|
|
sqlite3 ~/.transcript-fixer/corrections.db
|
|
|
|
# "的" before verb → "地"
|
|
INSERT INTO context_rules (pattern, replacement, description, priority)
|
|
VALUES ('近距离的去看', '近距离地去看', '的→地 before verb', 10);
|
|
|
|
# Preserve correct usage
|
|
INSERT INTO context_rules (pattern, replacement, description, priority)
|
|
VALUES ('近距离搏杀', '近距离搏杀', '的 is correct here (noun modifier)', 5);
|
|
```
|
|
|
|
**Priority**: Higher numbers run first (use for exceptions).
|
|
|
|
## Quality Assurance
|
|
|
|
### Validate After Manual Changes
|
|
|
|
**After direct SQL edits**:
|
|
|
|
```bash
|
|
uv run scripts/fix_transcription.py --validate
|
|
```
|
|
|
|
**After imports**:
|
|
|
|
```bash
|
|
# Check statistics
|
|
uv run scripts/fix_transcription.py --list --domain general | head -20
|
|
|
|
# Verify specific corrections
|
|
sqlite3 ~/.transcript-fixer/corrections.db "
|
|
SELECT from_text, to_text, source, confidence
|
|
FROM active_corrections
|
|
WHERE domain = 'general'
|
|
ORDER BY added_at DESC
|
|
LIMIT 10;
|
|
"
|
|
```
|
|
|
|
### Monitor Learning Quality
|
|
|
|
**Check suggestion confidence distribution**:
|
|
|
|
```bash
|
|
sqlite3 ~/.transcript-fixer/corrections.db "
|
|
SELECT
|
|
CASE
|
|
WHEN confidence >= 0.9 THEN 'high (>=0.9)'
|
|
WHEN confidence >= 0.8 THEN 'medium (0.8-0.9)'
|
|
ELSE 'low (<0.8)'
|
|
END as confidence_level,
|
|
COUNT(*) as count
|
|
FROM learned_suggestions
|
|
WHERE status = 'pending'
|
|
GROUP BY confidence_level;
|
|
"
|
|
```
|
|
|
|
**Review examples** for low-confidence suggestions:
|
|
|
|
```bash
|
|
sqlite3 ~/.transcript-fixer/corrections.db "
|
|
SELECT s.from_text, s.to_text, s.confidence, e.context
|
|
FROM learned_suggestions s
|
|
JOIN suggestion_examples e ON s.id = e.suggestion_id
|
|
WHERE s.confidence < 0.8 AND s.status = 'pending';
|
|
"
|
|
```
|
|
|
|
## Production Deployment
|
|
|
|
### Environment Variables
|
|
|
|
**Set permanently** in production:
|
|
|
|
```bash
|
|
# Add to /etc/environment or systemd service
|
|
GLM_API_KEY=your-production-key
|
|
```
|
|
|
|
### Monitoring
|
|
|
|
**Track usage statistics**:
|
|
|
|
```bash
|
|
# Corrections by source
|
|
sqlite3 ~/.transcript-fixer/corrections.db "
|
|
SELECT source, COUNT(*) as count, SUM(usage_count) as total_usage
|
|
FROM corrections
|
|
WHERE is_active = 1
|
|
GROUP BY source;
|
|
"
|
|
|
|
# Success rate
|
|
sqlite3 ~/.transcript-fixer/corrections.db "
|
|
SELECT
|
|
COUNT(*) as total_runs,
|
|
SUM(CASE WHEN success = 1 THEN 1 ELSE 0 END) as successful,
|
|
ROUND(100.0 * SUM(CASE WHEN success = 1 THEN 1 ELSE 0 END) / COUNT(*), 2) as success_rate
|
|
FROM correction_history;
|
|
"
|
|
```
|
|
|
|
### Performance
|
|
|
|
**Database optimization**:
|
|
|
|
```bash
|
|
# Rebuild indexes periodically
|
|
sqlite3 ~/.transcript-fixer/corrections.db "REINDEX;"
|
|
|
|
# Analyze query patterns
|
|
sqlite3 ~/.transcript-fixer/corrections.db "ANALYZE;"
|
|
|
|
# Vacuum to reclaim space
|
|
sqlite3 ~/.transcript-fixer/corrections.db "VACUUM;"
|
|
```
|
|
|
|
## Summary
|
|
|
|
**Key principles**:
|
|
1. Start small, let learning amplify
|
|
2. Use domain separation for quality
|
|
3. Test dictionary changes before AI calls
|
|
4. Export to JSON for version control
|
|
5. Review and approve learned suggestions
|
|
6. Validate after manual changes
|
|
7. Monitor learning quality
|
|
8. Backup before bulk operations
|
|
|
|
**ROI timeline**:
|
|
- Week 1: Build foundation (10-20 manual corrections)
|
|
- Week 2-3: Learning kicks in (20-50 suggestions)
|
|
- Month 2+: Mature vocabulary (80%+ dictionary coverage, minimal AI calls)
|