Update transcript-fixer guidance and hook paths

This commit is contained in:
daymade
2026-04-06 17:50:17 +08:00
parent 000596dad6
commit 9242af5fcb
2 changed files with 71 additions and 21 deletions

View File

@@ -17,15 +17,21 @@ All scripts use PEP 723 inline metadata — `uv run` auto-installs dependencies.
# First time: Initialize database
uv run scripts/fix_transcription.py --init
# Phase 1: Dictionary corrections (instant, free)
# Single file
uv run scripts/fix_transcription.py --input meeting.md --stage 1
# Batch: multiple files in parallel (use shell loop)
for f in /path/to/*.txt; do
uv run scripts/fix_transcription.py --input "$f" --stage 1
done
```
After Stage 1, Claude reads the output and fixes remaining ASR errors natively (no API key needed):
1. Read Stage 1 output in ~200-line chunks using the Read tool
2. Identify ASR errors — homophones, garbled terms, broken sentences
3. Present corrections in a table for user review before applying
4. Save stable patterns to dictionary for future reuse
1. Read all Stage 1 outputs — read **entire** transcript before proposing corrections (later context disambiguates earlier errors)
2. Identify ASR errors — compile all corrections across files
3. Apply fixes with sed in batch, verify each with diff
4. Finalize: rename `_stage1.md``.md`, delete original `.txt`
5. Save stable patterns to dictionary for future reuse
See `references/example_session.md` for a concrete input/output walkthrough.
@@ -51,6 +57,25 @@ Two-phase pipeline with persistent learning:
**After fixing, always save reusable corrections to dictionary.** This is the skill's core value — see `references/iteration_workflow.md` for the complete checklist.
### Dictionary Addition After Fixing
After native AI correction, review all applied fixes and decide which to save. Use this decision matrix:
| Pattern type | Example | Action |
|-------------|---------|--------|
| Non-word → correct term | 克劳锐→Claude, cloucode→Claude Code | ✅ Add (zero false positive risk) |
| Rare word → correct term | 潜彩→前采, 维星→韦青 | ✅ Add (verify it's not a real word first) |
| Person/company name ASR error | 宋天航→宋天生, 策马攀山→策马看山 | ✅ Add (stable, unique) |
| Common word → context word | 争→蒸, 钱财→前采, 报纸→标品 | ❌ Skip (high false positive risk) |
| Real brand → different brand | Xcode→Claude Code, Clover→Claude | ❌ Skip (real words in other contexts) |
Batch add multiple corrections in one session:
```bash
uv run scripts/fix_transcription.py --add "错误1" "正确1" --domain tech
uv run scripts/fix_transcription.py --add "错误2" "正确2" --domain business
# Chain with && for efficiency
```
## False Positive Prevention
Adding wrong dictionary rules silently corrupts future transcripts. **Read `references/false_positive_guide.md` before adding any correction rule**, especially for short words (≤2 chars) or common Chinese words that appear correctly in normal text.
@@ -59,21 +84,46 @@ Adding wrong dictionary rules silently corrupts future transcripts. **Read `refe
When running inside Claude Code, use Claude's own language understanding for Phase 2:
1. Run Stage 1 (dictionary): `--input file.md --stage 1`
1. Run Stage 1 (dictionary) on all files (parallel if multiple)
2. Verify Stage 1 — diff original vs output. If dictionary introduced false positives, work from the **original** file
3. Read the full text in ~200-line chunks. Read the entire transcript before proposing corrections — later context often disambiguates earlier errors
4. Identify ASR errors:
- Product/tool names: "close code" → "Claude Code", "get Hub" → "GitHub"
- Technical terms: "Web coding" → "Vibe Coding", "happy pass" → "happy path"
- Homophone errors: "上海文" → "上下文", "分值" → "分支"
- English ASR garbling: "Pre top" → "prototype", "rapper" → "repo"
- Broken sentences: "很大程。路上" → "很大程度上"
5. Present corrections in high/medium confidence tables with line numbers
6. Apply with sed on a copy, verify with diff, replace original
7. Generate word diff: `uv run scripts/generate_word_diff.py original.md corrected.md diff.html`
8. Save stable patterns to dictionary
3. Read **all** Stage 1 outputs fully before proposing any corrections — later context often disambiguates earlier errors. For large files (>10k tokens), read in chunks but finish the entire file before identifying errors
4. Identify ASR errors per file — classify by confidence:
- **High confidence** (apply directly): non-words, obvious garbling, product name variants
- **Medium confidence** (present for review): context-dependent homophones, person names
5. Apply fixes efficiently:
- **Global replacements** (unique non-words like "克劳锐"→"Claude"): use `sed -i ''` with `-e` flags, multiple patterns in one command
- **Context-dependent** (common words like "争"→"蒸" only in distillation context): use sed with longer context phrases for uniqueness, or Edit tool
6. Verify with diff: `diff original.txt corrected_stage1.md`
7. Finalize files: rename `*_stage1.md``*.md`, delete original `.txt`
8. Save stable patterns to dictionary (see "Dictionary Addition" below)
9. Remove false positives if Stage 1 had any
### Common ASR Error Patterns
AI product names are frequently garbled. These patterns recur across transcripts:
| Correct term | Common ASR variants |
|-------------|-------------------|
| Claude | cloud, Clou, calloc, 克劳锐, Clover, color |
| Claude Code | cloud code, Xcode, call code, cloucode, cloudcode, color code |
| Claude Agent SDK | cloud agent SDK |
| Opus | Opaas |
| Vibe Coding | web coding, Web coding |
| GitHub | get Hub, Git Hub |
| prototype | Pre top |
Person names and company names also produce consistent ASR errors across sessions — always add confirmed name corrections to the dictionary.
### Efficient Batch Fix Strategy
When fixing multiple files (e.g., 5 transcripts from one day):
1. **Stage 1 in parallel**: run all files through dictionary at once
2. **Read all files first**: build a mental model of speakers, topics, and recurring terms before fixing anything
3. **Compile a global correction list**: many errors repeat across files from the same session (same speakers, same topics)
4. **Apply global corrections first** (sed with multiple `-e` flags), then per-file context-dependent fixes
5. **Verify all diffs**, finalize all files, then do one dictionary addition pass
### Enhanced Capabilities (Native Mode Only)
- **Intelligent paragraph breaks**: Add `\n\n` at logical topic transitions