fix: prevent dictionary false positives + add tunnel-doctor WSL/Go findings

transcript-fixer: - Add common_words.py safety system (blocks common Chinese words from dictionary) - Add --audit command to scan existing dictionary for risky rules - Add --force flag to override safety checks explicitly - Fix substring corruption (产线数据→产线束据, 现金流→现现金流) - Unified position-aware replacement with _already_corrected() check - 69 tests covering all production false positive scenarios tunnel-doctor: - Add Step 5A: Tailscale SSH proxy silent failure on WSL - Add Step 5B: App Store vs Standalone Tailscale on macOS - Add Go net/http NO_PROXY CIDR incompatibility warning - Add utun interface identification (MTU 1280=Tailscale, 4064=Shadowrocket) - Fix "Four→Five Conflict Layers" inconsistency in reference doc - Add complete working Shadowrocket config reference Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-21 15:56:38 +08:00
parent d4634cb00b
commit a496c91cae
12 changed files with 1596 additions and 44 deletions
--- a/transcript-fixer/SKILL.md
+++ b/transcript-fixer/SKILL.md
@@ -142,6 +142,46 @@ Do **not** save one-off deletions, ambiguous context-only rewrites, or section-s

 See `references/iteration_workflow.md` for complete iteration guide with checklist.

+## FALSE POSITIVE RISKS -- READ BEFORE ADDING CORRECTIONS
+
+Dictionary-based corrections are powerful but dangerous. Adding the wrong rule silently corrupts every future transcript. The `--add` command runs safety checks automatically, but you must understand the risks.
+
+### What is safe to add
+
+- **ASR-specific gibberish**: "巨升智能" -> "具身智能" (no real word sounds like "巨升智能")
+- **Long compound errors**: "语音是别" -> "语音识别" (4+ chars, unlikely to collide)
+- **English transliteration errors**: "japanese 3 pro" -> "Gemini 3 Pro"
+
+### What is NEVER safe to add
+
+- **Common Chinese words**: "仿佛", "正面", "犹豫", "传说", "增加", "教育" -- these appear correctly in normal text. Replacing them corrupts transcripts from better ASR models.
+- **Words <=2 characters**: Almost any 2-char Chinese string is a valid word or part of one. "线数" inside "产线数据" becomes "产线束据".
+- **Both sides are real words**: "仿佛->反复", "犹豫->抑郁" -- both forms are valid Chinese. The "error" is only an error for one specific ASR model.
+
+### When in doubt, use a context rule instead
+
+Context rules use regex patterns that match only in specific surroundings, avoiding false positives:
+```bash
+# Instead of: --add "线数" "线束"
+# Use a context rule in the database:
+sqlite3 ~/.transcript-fixer/corrections.db "INSERT INTO context_rules (pattern, replacement, description, priority) VALUES ('(?<!产)线数(?!据)', '线束', 'ASR: 线数->线束 (not inside 产线数据)', 10);"
+```
+
+### Auditing the dictionary
+
+Run `--audit` periodically to scan all rules for false positive risks:
+```bash
+uv run scripts/fix_transcription.py --audit
+uv run scripts/fix_transcription.py --audit --domain manufacturing
+```
+
+### Forcing a risky addition
+
+If you understand the risks and still want to add a flagged rule:
+```bash
+uv run scripts/fix_transcription.py --add "仿佛" "反复" --domain general --force
+```
+
 ## AI Fallback Strategy

 When GLM API is unavailable (503, network issues), the script outputs `[CLAUDE_FALLBACK]` marker.