Release v1.8.0: Add transcript-fixer skill

## New Skill: transcript-fixer v1.0.0 Correct speech-to-text (ASR/STT) transcription errors through dictionary-based rules and AI-powered corrections with automatic pattern learning. **Features:** - Two-stage correction pipeline (dictionary + AI) - Automatic pattern detection and learning - Domain-specific dictionaries (general, embodied_ai, finance, medical) - SQLite-based correction repository - Team collaboration with import/export - GLM API integration for AI corrections - Cost optimization through dictionary promotion **Use cases:** - Correcting meeting notes, lecture recordings, or interview transcripts - Fixing Chinese/English homophone errors and technical terminology - Building domain-specific correction dictionaries - Improving transcript accuracy through iterative learning **Documentation:** - Complete workflow guides in references/ - SQL query templates - Troubleshooting guide - Team collaboration patterns - API setup instructions **Marketplace updates:** - Updated marketplace to v1.8.0 - Added transcript-fixer plugin (category: productivity) - Updated README.md with skill description and use cases - Updated CLAUDE.md with skill listing and counts 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-28 13:16:37 +08:00
parent d1041ac203
commit bd0aa12004
44 changed files with 7432 additions and 8 deletions
--- a/transcript-fixer/references/file_formats.md
+++ b/transcript-fixer/references/file_formats.md
@@ -0,0 +1,395 @@
+# Storage Format Reference
+
+This document describes the SQLite database format used by transcript-fixer v2.0.
+
+## Table of Contents
+
+- [Database Location](#database-location)
+- [Database Schema](#database-schema)
+  - [Core Tables](#core-tables)
+  - [Views](#views)
+- [Querying the Database](#querying-the-database)
+  - [Using Python API](#using-python-api)
+  - [Using SQLite CLI](#using-sqlite-cli)
+- [Import/Export](#importexport)
+  - [Export to JSON](#export-to-json)
+  - [Import from JSON](#import-from-json)
+- [Backup Strategy](#backup-strategy)
+  - [Automatic Backups](#automatic-backups)
+  - [Manual Backups](#manual-backups)
+  - [Version Control](#version-control)
+- [Best Practices](#best-practices)
+- [Troubleshooting](#troubleshooting)
+  - [Database Locked](#database-locked)
+  - [Corrupted Database](#corrupted-database)
+  - [Missing Tables](#missing-tables)
+
+## Database Location
+
+**Path**: `~/.transcript-fixer/corrections.db`
+
+**Type**: SQLite 3 database with ACID guarantees
+
+## Database Schema
+
+### Core Tables
+
+#### corrections
+
+Main correction dictionary storage.
+
+| Column | Type | Constraints | Description |
+|--------|------|-------------|-------------|
+| id | INTEGER | PRIMARY KEY | Auto-increment ID |
+| from_text | TEXT | NOT NULL | Original (incorrect) text |
+| to_text | TEXT | NOT NULL | Corrected text |
+| domain | TEXT | DEFAULT 'general' | Correction domain |
+| source | TEXT | CHECK IN ('manual', 'learned', 'imported') | Origin of correction |
+| confidence | REAL | CHECK 0.0-1.0 | Confidence score |
+| added_by | TEXT | | User who added |
+| added_at | TIMESTAMP | DEFAULT CURRENT_TIMESTAMP | When added |
+| usage_count | INTEGER | DEFAULT 0, CHECK >= 0 | Times used |
+| last_used | TIMESTAMP | | Last usage time |
+| notes | TEXT | | Optional notes |
+| is_active | BOOLEAN | DEFAULT 1 | Soft delete flag |
+
+**Unique Constraint**: `(from_text, domain)`
+
+**Indexes**: domain, source, added_at, is_active, from_text
+
+#### context_rules
+
+Regex-based context-aware correction rules.
+
+| Column | Type | Constraints | Description |
+|--------|------|-------------|-------------|
+| id | INTEGER | PRIMARY KEY | Auto-increment ID |
+| pattern | TEXT | NOT NULL, UNIQUE | Regex pattern |
+| replacement | TEXT | NOT NULL | Replacement text |
+| description | TEXT | | Rule explanation |
+| priority | INTEGER | DEFAULT 0 | Higher = applied first |
+| is_active | BOOLEAN | DEFAULT 1 | Enable/disable |
+| added_at | TIMESTAMP | DEFAULT CURRENT_TIMESTAMP | When added |
+| added_by | TEXT | | User who added |
+
+**Indexes**: priority (DESC), is_active
+
+#### correction_history
+
+Audit log for all correction runs.
+
+| Column | Type | Constraints | Description |
+|--------|------|-------------|-------------|
+| id | INTEGER | PRIMARY KEY | Auto-increment ID |
+| filename | TEXT | NOT NULL | File corrected |
+| domain | TEXT | NOT NULL | Domain used |
+| run_timestamp | TIMESTAMP | DEFAULT CURRENT_TIMESTAMP | When run |
+| original_length | INTEGER | CHECK >= 0 | Original file size |
+| stage1_changes | INTEGER | CHECK >= 0 | Dictionary changes |
+| stage2_changes | INTEGER | CHECK >= 0 | AI changes |
+| model | TEXT | | AI model used |
+| execution_time_ms | INTEGER | | Runtime in ms |
+| success | BOOLEAN | DEFAULT 1 | Success flag |
+| error_message | TEXT | | Error if failed |
+
+**Indexes**: run_timestamp (DESC), domain, success
+
+#### correction_changes
+
+Detailed changes made in each run.
+
+| Column | Type | Constraints | Description |
+|--------|------|-------------|-------------|
+| id | INTEGER | PRIMARY KEY | Auto-increment ID |
+| history_id | INTEGER | FOREIGN KEY → correction_history | Parent run |
+| line_number | INTEGER | | Line in file |
+| from_text | TEXT | NOT NULL | Original text |
+| to_text | TEXT | NOT NULL | Corrected text |
+| rule_type | TEXT | CHECK IN ('context', 'dictionary', 'ai') | Rule type |
+| rule_id | INTEGER | | Reference to rule |
+| context_before | TEXT | | Text before |
+| context_after | TEXT | | Text after |
+
+**Foreign Key**: history_id → correction_history.id (CASCADE DELETE)
+
+**Indexes**: history_id, rule_type
+
+#### learned_suggestions
+
+AI-detected patterns pending review.
+
+| Column | Type | Constraints | Description |
+|--------|------|-------------|-------------|
+| id | INTEGER | PRIMARY KEY | Auto-increment ID |
+| from_text | TEXT | NOT NULL | Pattern detected |
+| to_text | TEXT | NOT NULL | Suggested correction |
+| domain | TEXT | DEFAULT 'general' | Domain |
+| frequency | INTEGER | CHECK > 0 | Times seen |
+| confidence | REAL | CHECK 0.0-1.0 | Confidence score |
+| first_seen | TIMESTAMP | DEFAULT CURRENT_TIMESTAMP | First occurrence |
+| last_seen | TIMESTAMP | DEFAULT CURRENT_TIMESTAMP | Last occurrence |
+| status | TEXT | CHECK IN ('pending', 'approved', 'rejected') | Review status |
+| reviewed_at | TIMESTAMP | | When reviewed |
+| reviewed_by | TEXT | | Who reviewed |
+
+**Unique Constraint**: `(from_text, to_text, domain)`
+
+**Indexes**: status, domain, confidence (DESC), frequency (DESC)
+
+#### suggestion_examples
+
+Example occurrences of learned patterns.
+
+| Column | Type | Constraints | Description |
+|--------|------|-------------|-------------|
+| id | INTEGER | PRIMARY KEY | Auto-increment ID |
+| suggestion_id | INTEGER | FOREIGN KEY → learned_suggestions | Parent suggestion |
+| filename | TEXT | NOT NULL | File where found |
+| line_number | INTEGER | | Line number |
+| context | TEXT | NOT NULL | Surrounding text |
+| occurred_at | TIMESTAMP | DEFAULT CURRENT_TIMESTAMP | When found |
+
+**Foreign Key**: suggestion_id → learned_suggestions.id (CASCADE DELETE)
+
+**Index**: suggestion_id
+
+#### system_config
+
+System configuration key-value store.
+
+| Column | Type | Constraints | Description |
+|--------|------|-------------|-------------|
+| key | TEXT | PRIMARY KEY | Config key |
+| value | TEXT | NOT NULL | Config value |
+| value_type | TEXT | CHECK IN ('string', 'int', 'float', 'boolean', 'json') | Value type |
+| description | TEXT | | Config description |
+| updated_at | TIMESTAMP | DEFAULT CURRENT_TIMESTAMP | Last update |
+
+**Default Values**:
+- `schema_version`: "2.0"
+- `api_provider`: "GLM"
+- `api_model`: "GLM-4.6"
+- `default_domain`: "general"
+- `auto_learn_enabled`: "true"
+- `learning_frequency_threshold`: "3"
+- `learning_confidence_threshold`: "0.8"
+
+#### audit_log
+
+Comprehensive audit trail for all operations.
+
+| Column | Type | Constraints | Description |
+|--------|------|-------------|-------------|
+| id | INTEGER | PRIMARY KEY | Auto-increment ID |
+| timestamp | TIMESTAMP | DEFAULT CURRENT_TIMESTAMP | When occurred |
+| action | TEXT | NOT NULL | Action type |
+| entity_type | TEXT | NOT NULL | Entity affected |
+| entity_id | INTEGER | | Entity ID |
+| user | TEXT | | User who performed |
+| details | TEXT | | Action details |
+| success | BOOLEAN | DEFAULT 1 | Success flag |
+| error_message | TEXT | | Error if failed |
+
+**Indexes**: timestamp (DESC), action, entity_type, success
+
+### Views
+
+#### active_corrections
+
+Quick access to active corrections.
+
+```sql
+SELECT id, from_text, to_text, domain, source, confidence, usage_count, last_used, added_at
+FROM corrections
+WHERE is_active = 1
+ORDER BY domain, from_text;
+```
+
+#### pending_suggestions
+
+Suggestions pending review with example count.
+
+```sql
+SELECT s.id, s.from_text, s.to_text, s.domain, s.frequency, s.confidence,
+       s.first_seen, s.last_seen, COUNT(e.id) as example_count
+FROM learned_suggestions s
+LEFT JOIN suggestion_examples e ON s.id = e.suggestion_id
+WHERE s.status = 'pending'
+GROUP BY s.id
+ORDER BY s.confidence DESC, s.frequency DESC;
+```
+
+#### correction_statistics
+
+Statistics per domain.
+
+```sql
+SELECT domain,
+       COUNT(*) as total_corrections,
+       COUNT(CASE WHEN source = 'manual' THEN 1 END) as manual_count,
+       COUNT(CASE WHEN source = 'learned' THEN 1 END) as learned_count,
+       COUNT(CASE WHEN source = 'imported' THEN 1 END) as imported_count,
+       SUM(usage_count) as total_usage,
+       MAX(added_at) as last_updated
+FROM corrections
+WHERE is_active = 1
+GROUP BY domain;
+```
+
+## Querying the Database
+
+### Using Python API
+
+```python
+from pathlib import Path
+from core import CorrectionRepository, CorrectionService
+
+# Initialize
+db_path = Path.home() / ".transcript-fixer" / "corrections.db"
+repository = CorrectionRepository(db_path)
+service = CorrectionService(repository)
+
+# Add correction
+service.add_correction("错误", "正确", domain="general")
+
+# Get corrections
+corrections = service.get_corrections(domain="general")
+
+# Get statistics
+stats = service.get_statistics(domain="general")
+print(f"Total: {stats['total_corrections']}")
+
+# Close
+service.close()
+```
+
+### Using SQLite CLI
+
+```bash
+# Open database
+sqlite3 ~/.transcript-fixer/corrections.db
+
+# View active corrections
+SELECT from_text, to_text, domain FROM active_corrections;
+
+# View statistics
+SELECT * FROM correction_statistics;
+
+# View pending suggestions
+SELECT * FROM pending_suggestions;
+
+# Check schema version
+SELECT value FROM system_config WHERE key = 'schema_version';
+```
+
+## Import/Export
+
+### Export to JSON
+
+```python
+service = _get_service()
+corrections = service.export_corrections(domain="general")
+
+# Write to file
+import json
+with open("export.json", "w", encoding="utf-8") as f:
+    json.dump({
+        "version": "2.0",
+        "domain": "general",
+        "corrections": corrections
+    }, f, ensure_ascii=False, indent=2)
+```
+
+### Import from JSON
+
+```python
+import json
+
+with open("import.json", "r", encoding="utf-8") as f:
+    data = json.load(f)
+
+service = _get_service()
+inserted, updated, skipped = service.import_corrections(
+    corrections=data["corrections"],
+    domain=data.get("domain", "general"),
+    merge=True,
+    validate_all=True
+)
+
+print(f"Imported: {inserted} new, {updated} updated, {skipped} skipped")
+```
+
+## Backup Strategy
+
+### Automatic Backups
+
+The system maintains database integrity through SQLite's ACID guarantees and automatic journaling.
+
+### Manual Backups
+
+```bash
+# Backup database
+cp ~/.transcript-fixer/corrections.db ~/backups/corrections_$(date +%Y%m%d).db
+
+# Or use SQLite backup
+sqlite3 ~/.transcript-fixer/corrections.db ".backup ~/backups/corrections.db"
+```
+
+### Version Control
+
+**Recommended**: Use Git for configuration and export files, but NOT for the database:
+
+```bash
+# .gitignore
+*.db
+*.db-journal
+*.bak
+```
+
+Instead, export corrections periodically:
+
+```bash
+python scripts/fix_transcription.py --export-json corrections_backup.json
+git add corrections_backup.json
+git commit -m "Backup corrections"
+```
+
+## Best Practices
+
+1. **Regular Exports**: Export to JSON weekly for team sharing
+2. **Database Backups**: Backup `.db` file before major changes
+3. **Use Transactions**: All modifications use ACID transactions automatically
+4. **Soft Deletes**: Records are marked inactive, not deleted (preserves audit trail)
+5. **Validate**: Run `--validate` after manual database changes
+6. **Statistics**: Check usage patterns via `correction_statistics` view
+7. **Cleanup**: Old history can be archived (query by `run_timestamp`)
+
+## Troubleshooting
+
+### Database Locked
+
+```bash
+# Check for lingering connections
+lsof ~/.transcript-fixer/corrections.db
+
+# If needed, backup and recreate
+cp corrections.db corrections_backup.db
+sqlite3 corrections.db "VACUUM;"
+```
+
+### Corrupted Database
+
+```bash
+# Check integrity
+sqlite3 corrections.db "PRAGMA integrity_check;"
+
+# Recover if possible
+sqlite3 corrections.db ".recover" | sqlite3 corrections_new.db
+```
+
+### Missing Tables
+
+```bash
+# Reinitialize schema (safe, uses IF NOT EXISTS)
+python -c "from core import CorrectionRepository; from pathlib import Path; CorrectionRepository(Path.home() / '.transcript-fixer' / 'corrections.db')"
+```