feat: Update skill-creator and transcript-fixer
skill-creator v1.2.0 → v1.2.1: - Add critical warning about not editing skills in cache directory - Cache location (~/.claude/plugins/cache/) is read-only - Changes there are lost on cache refresh transcript-fixer v1.0.0 → v1.1.0: - Add Chinese/Japanese/Korean domain name support (火星加速器, 具身智能) - Add [CLAUDE_FALLBACK] signal for Claude Code to take over when GLM unavailable - Add Prerequisites section requiring uv for Python execution - Add Critical Workflow section for dictionary iteration - Add AI Fallback Strategy and Database Operations sections - Add Stages table (Dictionary → AI → Full pipeline) - Add ensure_deps.py script for shared virtual environment - Add database_schema.md and iteration_workflow.md references - Update domain validation from whitelist to pattern matching - Update tests for Chinese domains and security bypass attempts 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
190
transcript-fixer/references/database_schema.md
Normal file
190
transcript-fixer/references/database_schema.md
Normal file
@@ -0,0 +1,190 @@
|
||||
# Database Schema Reference
|
||||
|
||||
**MUST read this before any database operations.**
|
||||
|
||||
Database location: `~/.transcript-fixer/corrections.db`
|
||||
|
||||
## Core Tables
|
||||
|
||||
### corrections
|
||||
|
||||
Main storage for correction mappings.
|
||||
|
||||
| Column | Type | Description |
|
||||
|--------|------|-------------|
|
||||
| id | INTEGER | Primary key |
|
||||
| from_text | TEXT | Error text to match (NOT NULL) |
|
||||
| to_text | TEXT | Correct replacement (NOT NULL) |
|
||||
| domain | TEXT | Domain: general, embodied_ai, finance, medical |
|
||||
| source | TEXT | 'manual', 'learned', 'imported' |
|
||||
| confidence | REAL | 0.0-1.0 |
|
||||
| added_by | TEXT | Username |
|
||||
| added_at | TIMESTAMP | Creation time |
|
||||
| usage_count | INTEGER | Times this correction was applied |
|
||||
| last_used | TIMESTAMP | Last time used |
|
||||
| notes | TEXT | Optional notes |
|
||||
| is_active | BOOLEAN | Active flag (1=active, 0=disabled) |
|
||||
|
||||
**Constraint**: `UNIQUE(from_text, domain)`
|
||||
|
||||
### context_rules
|
||||
|
||||
Regex-based context-aware correction rules.
|
||||
|
||||
| Column | Type | Description |
|
||||
|--------|------|-------------|
|
||||
| id | INTEGER | Primary key |
|
||||
| pattern | TEXT | Regex pattern (UNIQUE) |
|
||||
| replacement | TEXT | Replacement text |
|
||||
| description | TEXT | Rule description |
|
||||
| priority | INTEGER | Higher = processed first |
|
||||
| is_active | BOOLEAN | Active flag |
|
||||
|
||||
### learned_suggestions
|
||||
|
||||
AI-learned patterns pending user review.
|
||||
|
||||
| Column | Type | Description |
|
||||
|--------|------|-------------|
|
||||
| id | INTEGER | Primary key |
|
||||
| from_text | TEXT | Detected error |
|
||||
| to_text | TEXT | Suggested correction |
|
||||
| domain | TEXT | Domain |
|
||||
| frequency | INTEGER | Occurrence count (≥1) |
|
||||
| confidence | REAL | AI confidence (0.0-1.0) |
|
||||
| first_seen | TIMESTAMP | First occurrence |
|
||||
| last_seen | TIMESTAMP | Last occurrence |
|
||||
| status | TEXT | 'pending', 'approved', 'rejected' |
|
||||
| reviewed_at | TIMESTAMP | Review time |
|
||||
| reviewed_by | TEXT | Reviewer |
|
||||
|
||||
**Constraint**: `UNIQUE(from_text, to_text, domain)`
|
||||
|
||||
### correction_history
|
||||
|
||||
Audit log for all correction runs.
|
||||
|
||||
| Column | Type | Description |
|
||||
|--------|------|-------------|
|
||||
| id | INTEGER | Primary key |
|
||||
| filename | TEXT | Input file name |
|
||||
| domain | TEXT | Domain used |
|
||||
| run_timestamp | TIMESTAMP | When run |
|
||||
| original_length | INTEGER | Original text length |
|
||||
| stage1_changes | INTEGER | Dictionary changes count |
|
||||
| stage2_changes | INTEGER | AI changes count |
|
||||
| model | TEXT | AI model used |
|
||||
| execution_time_ms | INTEGER | Processing time |
|
||||
| success | BOOLEAN | Success flag |
|
||||
| error_message | TEXT | Error if failed |
|
||||
|
||||
### correction_changes
|
||||
|
||||
Detailed changes made in each correction run.
|
||||
|
||||
| Column | Type | Description |
|
||||
|--------|------|-------------|
|
||||
| id | INTEGER | Primary key |
|
||||
| history_id | INTEGER | FK → correction_history.id |
|
||||
| line_number | INTEGER | Line where change occurred |
|
||||
| from_text | TEXT | Original text |
|
||||
| to_text | TEXT | Corrected text |
|
||||
| rule_type | TEXT | 'context', 'dictionary', 'ai' |
|
||||
| rule_id | INTEGER | Reference to rule used |
|
||||
| context_before | TEXT | Text before change |
|
||||
| context_after | TEXT | Text after change |
|
||||
|
||||
### system_config
|
||||
|
||||
Key-value configuration store.
|
||||
|
||||
| Column | Type | Description |
|
||||
|--------|------|-------------|
|
||||
| key | TEXT | Config key (PRIMARY KEY) |
|
||||
| value | TEXT | Config value |
|
||||
| value_type | TEXT | 'string', 'int', 'float', 'boolean', 'json' |
|
||||
| description | TEXT | What this config does |
|
||||
| updated_at | TIMESTAMP | Last update |
|
||||
|
||||
**Default configs**:
|
||||
- `schema_version`: '2.0'
|
||||
- `api_model`: 'GLM-4.6'
|
||||
- `learning_frequency_threshold`: 3
|
||||
- `learning_confidence_threshold`: 0.8
|
||||
- `history_retention_days`: 90
|
||||
|
||||
### audit_log
|
||||
|
||||
Comprehensive operations trail.
|
||||
|
||||
| Column | Type | Description |
|
||||
|--------|------|-------------|
|
||||
| id | INTEGER | Primary key |
|
||||
| timestamp | TIMESTAMP | When occurred |
|
||||
| action | TEXT | Action type |
|
||||
| entity_type | TEXT | Table affected |
|
||||
| entity_id | INTEGER | Row ID |
|
||||
| user | TEXT | Who did it |
|
||||
| details | TEXT | JSON details |
|
||||
| success | BOOLEAN | Success flag |
|
||||
| error_message | TEXT | Error if failed |
|
||||
|
||||
## Views
|
||||
|
||||
### active_corrections
|
||||
|
||||
Active corrections only, ordered by domain and from_text.
|
||||
|
||||
```sql
|
||||
SELECT * FROM active_corrections;
|
||||
```
|
||||
|
||||
### pending_suggestions
|
||||
|
||||
Suggestions awaiting review, with example count.
|
||||
|
||||
```sql
|
||||
SELECT * FROM pending_suggestions WHERE confidence > 0.8;
|
||||
```
|
||||
|
||||
### correction_statistics
|
||||
|
||||
Statistics per domain.
|
||||
|
||||
```sql
|
||||
SELECT * FROM correction_statistics;
|
||||
```
|
||||
|
||||
## Common Queries
|
||||
|
||||
```sql
|
||||
-- List all active corrections
|
||||
SELECT from_text, to_text, domain FROM active_corrections;
|
||||
|
||||
-- Check pending high-confidence suggestions
|
||||
SELECT * FROM pending_suggestions WHERE confidence > 0.8 ORDER BY frequency DESC;
|
||||
|
||||
-- Domain statistics
|
||||
SELECT domain, total_corrections, total_usage FROM correction_statistics;
|
||||
|
||||
-- Recent correction history
|
||||
SELECT filename, stage1_changes, stage2_changes, run_timestamp
|
||||
FROM correction_history
|
||||
ORDER BY run_timestamp DESC LIMIT 10;
|
||||
|
||||
-- Add new correction (use CLI instead for safety)
|
||||
INSERT INTO corrections (from_text, to_text, domain, source, confidence, added_by)
|
||||
VALUES ('错误词', '正确词', 'general', 'manual', 1.0, 'user');
|
||||
|
||||
-- Disable a correction
|
||||
UPDATE corrections SET is_active = 0 WHERE id = ?;
|
||||
```
|
||||
|
||||
## Schema Version
|
||||
|
||||
Check current version:
|
||||
```sql
|
||||
SELECT value FROM system_config WHERE key = 'schema_version';
|
||||
```
|
||||
|
||||
For complete schema including indexes and constraints, see `scripts/core/schema.sql`.
|
||||
124
transcript-fixer/references/iteration_workflow.md
Normal file
124
transcript-fixer/references/iteration_workflow.md
Normal file
@@ -0,0 +1,124 @@
|
||||
# Dictionary Iteration Workflow
|
||||
|
||||
The core value of transcript-fixer is building a personalized correction dictionary that improves over time.
|
||||
|
||||
## The Core Loop
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────┐
|
||||
│ 1. Fix transcript (manual or Stage 3) │
|
||||
│ ↓ │
|
||||
│ 2. Identify new ASR errors during fixing │
|
||||
│ ↓ │
|
||||
│ 3. IMMEDIATELY save to dictionary │
|
||||
│ ↓ │
|
||||
│ 4. Next time: Stage 1 auto-corrects these │
|
||||
└─────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
**Key principle**: Every correction you make should be saved to the dictionary. This transforms one-time work into permanent value.
|
||||
|
||||
## Workflow Checklist
|
||||
|
||||
Copy this checklist when correcting transcripts:
|
||||
|
||||
```
|
||||
Correction Progress:
|
||||
- [ ] Run correction: --input file.md --stage 3
|
||||
- [ ] Review output file for remaining ASR errors
|
||||
- [ ] Fix errors manually with Edit tool
|
||||
- [ ] Save EACH correction to dictionary with --add
|
||||
- [ ] Verify with --list that corrections were saved
|
||||
- [ ] Next time: Stage 1 handles these automatically
|
||||
```
|
||||
|
||||
## Save Corrections Immediately
|
||||
|
||||
After fixing any transcript, save corrections:
|
||||
|
||||
```bash
|
||||
# Single correction
|
||||
uv run scripts/fix_transcription.py --add "错误词" "正确词" --domain general
|
||||
|
||||
# Multiple corrections - run command for each
|
||||
uv run scripts/fix_transcription.py --add "片片总" "翩翩总" --domain general
|
||||
uv run scripts/fix_transcription.py --add "姐弟" "结业" --domain general
|
||||
uv run scripts/fix_transcription.py --add "自杀性" "自嗨性" --domain general
|
||||
uv run scripts/fix_transcription.py --add "被看" "被砍" --domain general
|
||||
uv run scripts/fix_transcription.py --add "单反过" "单访过" --domain general
|
||||
```
|
||||
|
||||
## Verify Dictionary
|
||||
|
||||
Always verify corrections were saved:
|
||||
|
||||
```bash
|
||||
# List all corrections in current domain
|
||||
uv run scripts/fix_transcription.py --list
|
||||
|
||||
# Direct database query
|
||||
sqlite3 ~/.transcript-fixer/corrections.db \
|
||||
"SELECT from_text, to_text, domain FROM active_corrections ORDER BY added_at DESC LIMIT 10;"
|
||||
```
|
||||
|
||||
## Domain Selection
|
||||
|
||||
Choose the right domain for corrections:
|
||||
|
||||
| Domain | Use Case |
|
||||
|--------|----------|
|
||||
| `general` | Common ASR errors, names, general vocabulary |
|
||||
| `embodied_ai` | 具身智能、机器人、AI 相关术语 |
|
||||
| `finance` | 财务、投资、金融术语 |
|
||||
| `medical` | 医疗、健康相关术语 |
|
||||
| `火星加速器` | Custom Chinese domain name (any valid name works) |
|
||||
|
||||
```bash
|
||||
# Domain-specific correction
|
||||
uv run scripts/fix_transcription.py --add "股价系统" "框架系统" --domain embodied_ai
|
||||
uv run scripts/fix_transcription.py --add "片片总" "翩翩总" --domain 火星加速器
|
||||
```
|
||||
|
||||
## Common ASR Error Patterns
|
||||
|
||||
Build your dictionary with these common patterns:
|
||||
|
||||
| Type | Examples |
|
||||
|------|----------|
|
||||
| **Homophones** | 赢→营, 减→剪, 被看→被砍, 营业→营的 |
|
||||
| **Names** | 片片→翩翩, 亮亮→亮哥 |
|
||||
| **Technical** | 巨升智能→具身智能, 股价→框架 |
|
||||
| **English** | log→vlog |
|
||||
| **Broken words** | 姐弟→结业, 单反→单访 |
|
||||
|
||||
## When GLM API Fails
|
||||
|
||||
If you see `[CLAUDE_FALLBACK]` output, the GLM API is unavailable.
|
||||
|
||||
Steps:
|
||||
1. Claude Code should analyze the text directly for ASR errors
|
||||
2. Fix using Edit tool
|
||||
3. **MUST save corrections to dictionary** - this is critical
|
||||
4. Dictionary corrections work even without AI
|
||||
|
||||
## Auto-Learning Feature
|
||||
|
||||
After running Stage 3 multiple times:
|
||||
|
||||
```bash
|
||||
# Check learned patterns
|
||||
uv run scripts/fix_transcription.py --review-learned
|
||||
|
||||
# Approve high-confidence patterns
|
||||
uv run scripts/fix_transcription.py --approve "错误词" "正确词"
|
||||
```
|
||||
|
||||
Patterns appearing ≥3 times at ≥80% confidence are suggested for review.
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Save immediately**: Don't batch corrections - save each one right after fixing
|
||||
2. **Be specific**: Use exact phrases, not partial words
|
||||
3. **Use domains**: Organize corrections by topic for better precision
|
||||
4. **Verify**: Always run --list to confirm saves
|
||||
5. **Review suggestions**: Periodically check --review-learned for auto-detected patterns
|
||||
Reference in New Issue
Block a user