feat: Update skill-creator and transcript-fixer

skill-creator v1.2.0 → v1.2.1: - Add critical warning about not editing skills in cache directory - Cache location (~/.claude/plugins/cache/) is read-only - Changes there are lost on cache refresh transcript-fixer v1.0.0 → v1.1.0: - Add Chinese/Japanese/Korean domain name support (火星加速器, 具身智能) - Add [CLAUDE_FALLBACK] signal for Claude Code to take over when GLM unavailable - Add Prerequisites section requiring uv for Python execution - Add Critical Workflow section for dictionary iteration - Add AI Fallback Strategy and Database Operations sections - Add Stages table (Dictionary → AI → Full pipeline) - Add ensure_deps.py script for shared virtual environment - Add database_schema.md and iteration_workflow.md references - Update domain validation from whitelist to pattern matching - Update tests for Chinese domains and security bypass attempts 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-11 13:04:27 +08:00
parent 20cc442ec4
commit 1d237fc3be
12 changed files with 556 additions and 27 deletions
--- a/transcript-fixer/references/database_schema.md
+++ b/transcript-fixer/references/database_schema.md
@@ -0,0 +1,190 @@
+# Database Schema Reference
+
+**MUST read this before any database operations.**
+
+Database location: `~/.transcript-fixer/corrections.db`
+
+## Core Tables
+
+### corrections
+
+Main storage for correction mappings.
+
+| Column | Type | Description |
+|--------|------|-------------|
+| id | INTEGER | Primary key |
+| from_text | TEXT | Error text to match (NOT NULL) |
+| to_text | TEXT | Correct replacement (NOT NULL) |
+| domain | TEXT | Domain: general, embodied_ai, finance, medical |
+| source | TEXT | 'manual', 'learned', 'imported' |
+| confidence | REAL | 0.0-1.0 |
+| added_by | TEXT | Username |
+| added_at | TIMESTAMP | Creation time |
+| usage_count | INTEGER | Times this correction was applied |
+| last_used | TIMESTAMP | Last time used |
+| notes | TEXT | Optional notes |
+| is_active | BOOLEAN | Active flag (1=active, 0=disabled) |
+
+**Constraint**: `UNIQUE(from_text, domain)`
+
+### context_rules
+
+Regex-based context-aware correction rules.
+
+| Column | Type | Description |
+|--------|------|-------------|
+| id | INTEGER | Primary key |
+| pattern | TEXT | Regex pattern (UNIQUE) |
+| replacement | TEXT | Replacement text |
+| description | TEXT | Rule description |
+| priority | INTEGER | Higher = processed first |
+| is_active | BOOLEAN | Active flag |
+
+### learned_suggestions
+
+AI-learned patterns pending user review.
+
+| Column | Type | Description |
+|--------|------|-------------|
+| id | INTEGER | Primary key |
+| from_text | TEXT | Detected error |
+| to_text | TEXT | Suggested correction |
+| domain | TEXT | Domain |
+| frequency | INTEGER | Occurrence count (≥1) |
+| confidence | REAL | AI confidence (0.0-1.0) |
+| first_seen | TIMESTAMP | First occurrence |
+| last_seen | TIMESTAMP | Last occurrence |
+| status | TEXT | 'pending', 'approved', 'rejected' |
+| reviewed_at | TIMESTAMP | Review time |
+| reviewed_by | TEXT | Reviewer |
+
+**Constraint**: `UNIQUE(from_text, to_text, domain)`
+
+### correction_history
+
+Audit log for all correction runs.
+
+| Column | Type | Description |
+|--------|------|-------------|
+| id | INTEGER | Primary key |
+| filename | TEXT | Input file name |
+| domain | TEXT | Domain used |
+| run_timestamp | TIMESTAMP | When run |
+| original_length | INTEGER | Original text length |
+| stage1_changes | INTEGER | Dictionary changes count |
+| stage2_changes | INTEGER | AI changes count |
+| model | TEXT | AI model used |
+| execution_time_ms | INTEGER | Processing time |
+| success | BOOLEAN | Success flag |
+| error_message | TEXT | Error if failed |
+
+### correction_changes
+
+Detailed changes made in each correction run.
+
+| Column | Type | Description |
+|--------|------|-------------|
+| id | INTEGER | Primary key |
+| history_id | INTEGER | FK → correction_history.id |
+| line_number | INTEGER | Line where change occurred |
+| from_text | TEXT | Original text |
+| to_text | TEXT | Corrected text |
+| rule_type | TEXT | 'context', 'dictionary', 'ai' |
+| rule_id | INTEGER | Reference to rule used |
+| context_before | TEXT | Text before change |
+| context_after | TEXT | Text after change |
+
+### system_config
+
+Key-value configuration store.
+
+| Column | Type | Description |
+|--------|------|-------------|
+| key | TEXT | Config key (PRIMARY KEY) |
+| value | TEXT | Config value |
+| value_type | TEXT | 'string', 'int', 'float', 'boolean', 'json' |
+| description | TEXT | What this config does |
+| updated_at | TIMESTAMP | Last update |
+
+**Default configs**:
+- `schema_version`: '2.0'
+- `api_model`: 'GLM-4.6'
+- `learning_frequency_threshold`: 3
+- `learning_confidence_threshold`: 0.8
+- `history_retention_days`: 90
+
+### audit_log
+
+Comprehensive operations trail.
+
+| Column | Type | Description |
+|--------|------|-------------|
+| id | INTEGER | Primary key |
+| timestamp | TIMESTAMP | When occurred |
+| action | TEXT | Action type |
+| entity_type | TEXT | Table affected |
+| entity_id | INTEGER | Row ID |
+| user | TEXT | Who did it |
+| details | TEXT | JSON details |
+| success | BOOLEAN | Success flag |
+| error_message | TEXT | Error if failed |
+
+## Views
+
+### active_corrections
+
+Active corrections only, ordered by domain and from_text.
+
+```sql
+SELECT * FROM active_corrections;
+```
+
+### pending_suggestions
+
+Suggestions awaiting review, with example count.
+
+```sql
+SELECT * FROM pending_suggestions WHERE confidence > 0.8;
+```
+
+### correction_statistics
+
+Statistics per domain.
+
+```sql
+SELECT * FROM correction_statistics;
+```
+
+## Common Queries
+
+```sql
+-- List all active corrections
+SELECT from_text, to_text, domain FROM active_corrections;
+
+-- Check pending high-confidence suggestions
+SELECT * FROM pending_suggestions WHERE confidence > 0.8 ORDER BY frequency DESC;
+
+-- Domain statistics
+SELECT domain, total_corrections, total_usage FROM correction_statistics;
+
+-- Recent correction history
+SELECT filename, stage1_changes, stage2_changes, run_timestamp
+FROM correction_history
+ORDER BY run_timestamp DESC LIMIT 10;
+
+-- Add new correction (use CLI instead for safety)
+INSERT INTO corrections (from_text, to_text, domain, source, confidence, added_by)
+VALUES ('错误词', '正确词', 'general', 'manual', 1.0, 'user');
+
+-- Disable a correction
+UPDATE corrections SET is_active = 0 WHERE id = ?;
+```
+
+## Schema Version
+
+Check current version:
+```sql
+SELECT value FROM system_config WHERE key = 'schema_version';
+```
+
+For complete schema including indexes and constraints, see `scripts/core/schema.sql`.
--- a/transcript-fixer/references/iteration_workflow.md
+++ b/transcript-fixer/references/iteration_workflow.md
@@ -0,0 +1,124 @@
+# Dictionary Iteration Workflow
+
+The core value of transcript-fixer is building a personalized correction dictionary that improves over time.
+
+## The Core Loop
+
+```
+┌─────────────────────────────────────────────────┐
+│  1. Fix transcript (manual or Stage 3)          │
+│                    ↓                            │
+│  2. Identify new ASR errors during fixing       │
+│                    ↓                            │
+│  3. IMMEDIATELY save to dictionary              │
+│                    ↓                            │
+│  4. Next time: Stage 1 auto-corrects these      │
+└─────────────────────────────────────────────────┘
+```
+
+**Key principle**: Every correction you make should be saved to the dictionary. This transforms one-time work into permanent value.
+
+## Workflow Checklist
+
+Copy this checklist when correcting transcripts:
+
+```
+Correction Progress:
+- [ ] Run correction: --input file.md --stage 3
+- [ ] Review output file for remaining ASR errors
+- [ ] Fix errors manually with Edit tool
+- [ ] Save EACH correction to dictionary with --add
+- [ ] Verify with --list that corrections were saved
+- [ ] Next time: Stage 1 handles these automatically
+```
+
+## Save Corrections Immediately
+
+After fixing any transcript, save corrections:
+
+```bash
+# Single correction
+uv run scripts/fix_transcription.py --add "错误词" "正确词" --domain general
+
+# Multiple corrections - run command for each
+uv run scripts/fix_transcription.py --add "片片总" "翩翩总" --domain general
+uv run scripts/fix_transcription.py --add "姐弟" "结业" --domain general
+uv run scripts/fix_transcription.py --add "自杀性" "自嗨性" --domain general
+uv run scripts/fix_transcription.py --add "被看" "被砍" --domain general
+uv run scripts/fix_transcription.py --add "单反过" "单访过" --domain general
+```
+
+## Verify Dictionary
+
+Always verify corrections were saved:
+
+```bash
+# List all corrections in current domain
+uv run scripts/fix_transcription.py --list
+
+# Direct database query
+sqlite3 ~/.transcript-fixer/corrections.db \
+  "SELECT from_text, to_text, domain FROM active_corrections ORDER BY added_at DESC LIMIT 10;"
+```
+
+## Domain Selection
+
+Choose the right domain for corrections:
+
+| Domain | Use Case |
+|--------|----------|
+| `general` | Common ASR errors, names, general vocabulary |
+| `embodied_ai` | 具身智能、机器人、AI 相关术语 |
+| `finance` | 财务、投资、金融术语 |
+| `medical` | 医疗、健康相关术语 |
+| `火星加速器` | Custom Chinese domain name (any valid name works) |
+
+```bash
+# Domain-specific correction
+uv run scripts/fix_transcription.py --add "股价系统" "框架系统" --domain embodied_ai
+uv run scripts/fix_transcription.py --add "片片总" "翩翩总" --domain 火星加速器
+```
+
+## Common ASR Error Patterns
+
+Build your dictionary with these common patterns:
+
+| Type | Examples |
+|------|----------|
+| **Homophones** | 赢→营, 减→剪, 被看→被砍, 营业→营的 |
+| **Names** | 片片→翩翩, 亮亮→亮哥 |
+| **Technical** | 巨升智能→具身智能, 股价→框架 |
+| **English** | log→vlog |
+| **Broken words** | 姐弟→结业, 单反→单访 |
+
+## When GLM API Fails
+
+If you see `[CLAUDE_FALLBACK]` output, the GLM API is unavailable.
+
+Steps:
+1. Claude Code should analyze the text directly for ASR errors
+2. Fix using Edit tool
+3. **MUST save corrections to dictionary** - this is critical
+4. Dictionary corrections work even without AI
+
+## Auto-Learning Feature
+
+After running Stage 3 multiple times:
+
+```bash
+# Check learned patterns
+uv run scripts/fix_transcription.py --review-learned
+
+# Approve high-confidence patterns
+uv run scripts/fix_transcription.py --approve "错误词" "正确词"
+```
+
+Patterns appearing ≥3 times at ≥80% confidence are suggested for review.
+
+## Best Practices
+
+1. **Save immediately**: Don't batch corrections - save each one right after fixing
+2. **Be specific**: Use exact phrases, not partial words
+3. **Use domains**: Organize corrections by topic for better precision
+4. **Verify**: Always run --list to confirm saves
+5. **Review suggestions**: Periodically check --review-learned for auto-detected patterns