Release v1.8.0: Add transcript-fixer skill
## New Skill: transcript-fixer v1.0.0 Correct speech-to-text (ASR/STT) transcription errors through dictionary-based rules and AI-powered corrections with automatic pattern learning. **Features:** - Two-stage correction pipeline (dictionary + AI) - Automatic pattern detection and learning - Domain-specific dictionaries (general, embodied_ai, finance, medical) - SQLite-based correction repository - Team collaboration with import/export - GLM API integration for AI corrections - Cost optimization through dictionary promotion **Use cases:** - Correcting meeting notes, lecture recordings, or interview transcripts - Fixing Chinese/English homophone errors and technical terminology - Building domain-specific correction dictionaries - Improving transcript accuracy through iterative learning **Documentation:** - Complete workflow guides in references/ - SQL query templates - Troubleshooting guide - Team collaboration patterns - API setup instructions **Marketplace updates:** - Updated marketplace to v1.8.0 - Added transcript-fixer plugin (category: productivity) - Updated README.md with skill description and use cases - Updated CLAUDE.md with skill listing and counts 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
395
transcript-fixer/references/file_formats.md
Normal file
395
transcript-fixer/references/file_formats.md
Normal file
@@ -0,0 +1,395 @@
|
||||
# Storage Format Reference
|
||||
|
||||
This document describes the SQLite database format used by transcript-fixer v2.0.
|
||||
|
||||
## Table of Contents
|
||||
|
||||
- [Database Location](#database-location)
|
||||
- [Database Schema](#database-schema)
|
||||
- [Core Tables](#core-tables)
|
||||
- [Views](#views)
|
||||
- [Querying the Database](#querying-the-database)
|
||||
- [Using Python API](#using-python-api)
|
||||
- [Using SQLite CLI](#using-sqlite-cli)
|
||||
- [Import/Export](#importexport)
|
||||
- [Export to JSON](#export-to-json)
|
||||
- [Import from JSON](#import-from-json)
|
||||
- [Backup Strategy](#backup-strategy)
|
||||
- [Automatic Backups](#automatic-backups)
|
||||
- [Manual Backups](#manual-backups)
|
||||
- [Version Control](#version-control)
|
||||
- [Best Practices](#best-practices)
|
||||
- [Troubleshooting](#troubleshooting)
|
||||
- [Database Locked](#database-locked)
|
||||
- [Corrupted Database](#corrupted-database)
|
||||
- [Missing Tables](#missing-tables)
|
||||
|
||||
## Database Location
|
||||
|
||||
**Path**: `~/.transcript-fixer/corrections.db`
|
||||
|
||||
**Type**: SQLite 3 database with ACID guarantees
|
||||
|
||||
## Database Schema
|
||||
|
||||
### Core Tables
|
||||
|
||||
#### corrections
|
||||
|
||||
Main correction dictionary storage.
|
||||
|
||||
| Column | Type | Constraints | Description |
|
||||
|--------|------|-------------|-------------|
|
||||
| id | INTEGER | PRIMARY KEY | Auto-increment ID |
|
||||
| from_text | TEXT | NOT NULL | Original (incorrect) text |
|
||||
| to_text | TEXT | NOT NULL | Corrected text |
|
||||
| domain | TEXT | DEFAULT 'general' | Correction domain |
|
||||
| source | TEXT | CHECK IN ('manual', 'learned', 'imported') | Origin of correction |
|
||||
| confidence | REAL | CHECK 0.0-1.0 | Confidence score |
|
||||
| added_by | TEXT | | User who added |
|
||||
| added_at | TIMESTAMP | DEFAULT CURRENT_TIMESTAMP | When added |
|
||||
| usage_count | INTEGER | DEFAULT 0, CHECK >= 0 | Times used |
|
||||
| last_used | TIMESTAMP | | Last usage time |
|
||||
| notes | TEXT | | Optional notes |
|
||||
| is_active | BOOLEAN | DEFAULT 1 | Soft delete flag |
|
||||
|
||||
**Unique Constraint**: `(from_text, domain)`
|
||||
|
||||
**Indexes**: domain, source, added_at, is_active, from_text
|
||||
|
||||
#### context_rules
|
||||
|
||||
Regex-based context-aware correction rules.
|
||||
|
||||
| Column | Type | Constraints | Description |
|
||||
|--------|------|-------------|-------------|
|
||||
| id | INTEGER | PRIMARY KEY | Auto-increment ID |
|
||||
| pattern | TEXT | NOT NULL, UNIQUE | Regex pattern |
|
||||
| replacement | TEXT | NOT NULL | Replacement text |
|
||||
| description | TEXT | | Rule explanation |
|
||||
| priority | INTEGER | DEFAULT 0 | Higher = applied first |
|
||||
| is_active | BOOLEAN | DEFAULT 1 | Enable/disable |
|
||||
| added_at | TIMESTAMP | DEFAULT CURRENT_TIMESTAMP | When added |
|
||||
| added_by | TEXT | | User who added |
|
||||
|
||||
**Indexes**: priority (DESC), is_active
|
||||
|
||||
#### correction_history
|
||||
|
||||
Audit log for all correction runs.
|
||||
|
||||
| Column | Type | Constraints | Description |
|
||||
|--------|------|-------------|-------------|
|
||||
| id | INTEGER | PRIMARY KEY | Auto-increment ID |
|
||||
| filename | TEXT | NOT NULL | File corrected |
|
||||
| domain | TEXT | NOT NULL | Domain used |
|
||||
| run_timestamp | TIMESTAMP | DEFAULT CURRENT_TIMESTAMP | When run |
|
||||
| original_length | INTEGER | CHECK >= 0 | Original file size |
|
||||
| stage1_changes | INTEGER | CHECK >= 0 | Dictionary changes |
|
||||
| stage2_changes | INTEGER | CHECK >= 0 | AI changes |
|
||||
| model | TEXT | | AI model used |
|
||||
| execution_time_ms | INTEGER | | Runtime in ms |
|
||||
| success | BOOLEAN | DEFAULT 1 | Success flag |
|
||||
| error_message | TEXT | | Error if failed |
|
||||
|
||||
**Indexes**: run_timestamp (DESC), domain, success
|
||||
|
||||
#### correction_changes
|
||||
|
||||
Detailed changes made in each run.
|
||||
|
||||
| Column | Type | Constraints | Description |
|
||||
|--------|------|-------------|-------------|
|
||||
| id | INTEGER | PRIMARY KEY | Auto-increment ID |
|
||||
| history_id | INTEGER | FOREIGN KEY → correction_history | Parent run |
|
||||
| line_number | INTEGER | | Line in file |
|
||||
| from_text | TEXT | NOT NULL | Original text |
|
||||
| to_text | TEXT | NOT NULL | Corrected text |
|
||||
| rule_type | TEXT | CHECK IN ('context', 'dictionary', 'ai') | Rule type |
|
||||
| rule_id | INTEGER | | Reference to rule |
|
||||
| context_before | TEXT | | Text before |
|
||||
| context_after | TEXT | | Text after |
|
||||
|
||||
**Foreign Key**: history_id → correction_history.id (CASCADE DELETE)
|
||||
|
||||
**Indexes**: history_id, rule_type
|
||||
|
||||
#### learned_suggestions
|
||||
|
||||
AI-detected patterns pending review.
|
||||
|
||||
| Column | Type | Constraints | Description |
|
||||
|--------|------|-------------|-------------|
|
||||
| id | INTEGER | PRIMARY KEY | Auto-increment ID |
|
||||
| from_text | TEXT | NOT NULL | Pattern detected |
|
||||
| to_text | TEXT | NOT NULL | Suggested correction |
|
||||
| domain | TEXT | DEFAULT 'general' | Domain |
|
||||
| frequency | INTEGER | CHECK > 0 | Times seen |
|
||||
| confidence | REAL | CHECK 0.0-1.0 | Confidence score |
|
||||
| first_seen | TIMESTAMP | DEFAULT CURRENT_TIMESTAMP | First occurrence |
|
||||
| last_seen | TIMESTAMP | DEFAULT CURRENT_TIMESTAMP | Last occurrence |
|
||||
| status | TEXT | CHECK IN ('pending', 'approved', 'rejected') | Review status |
|
||||
| reviewed_at | TIMESTAMP | | When reviewed |
|
||||
| reviewed_by | TEXT | | Who reviewed |
|
||||
|
||||
**Unique Constraint**: `(from_text, to_text, domain)`
|
||||
|
||||
**Indexes**: status, domain, confidence (DESC), frequency (DESC)
|
||||
|
||||
#### suggestion_examples
|
||||
|
||||
Example occurrences of learned patterns.
|
||||
|
||||
| Column | Type | Constraints | Description |
|
||||
|--------|------|-------------|-------------|
|
||||
| id | INTEGER | PRIMARY KEY | Auto-increment ID |
|
||||
| suggestion_id | INTEGER | FOREIGN KEY → learned_suggestions | Parent suggestion |
|
||||
| filename | TEXT | NOT NULL | File where found |
|
||||
| line_number | INTEGER | | Line number |
|
||||
| context | TEXT | NOT NULL | Surrounding text |
|
||||
| occurred_at | TIMESTAMP | DEFAULT CURRENT_TIMESTAMP | When found |
|
||||
|
||||
**Foreign Key**: suggestion_id → learned_suggestions.id (CASCADE DELETE)
|
||||
|
||||
**Index**: suggestion_id
|
||||
|
||||
#### system_config
|
||||
|
||||
System configuration key-value store.
|
||||
|
||||
| Column | Type | Constraints | Description |
|
||||
|--------|------|-------------|-------------|
|
||||
| key | TEXT | PRIMARY KEY | Config key |
|
||||
| value | TEXT | NOT NULL | Config value |
|
||||
| value_type | TEXT | CHECK IN ('string', 'int', 'float', 'boolean', 'json') | Value type |
|
||||
| description | TEXT | | Config description |
|
||||
| updated_at | TIMESTAMP | DEFAULT CURRENT_TIMESTAMP | Last update |
|
||||
|
||||
**Default Values**:
|
||||
- `schema_version`: "2.0"
|
||||
- `api_provider`: "GLM"
|
||||
- `api_model`: "GLM-4.6"
|
||||
- `default_domain`: "general"
|
||||
- `auto_learn_enabled`: "true"
|
||||
- `learning_frequency_threshold`: "3"
|
||||
- `learning_confidence_threshold`: "0.8"
|
||||
|
||||
#### audit_log
|
||||
|
||||
Comprehensive audit trail for all operations.
|
||||
|
||||
| Column | Type | Constraints | Description |
|
||||
|--------|------|-------------|-------------|
|
||||
| id | INTEGER | PRIMARY KEY | Auto-increment ID |
|
||||
| timestamp | TIMESTAMP | DEFAULT CURRENT_TIMESTAMP | When occurred |
|
||||
| action | TEXT | NOT NULL | Action type |
|
||||
| entity_type | TEXT | NOT NULL | Entity affected |
|
||||
| entity_id | INTEGER | | Entity ID |
|
||||
| user | TEXT | | User who performed |
|
||||
| details | TEXT | | Action details |
|
||||
| success | BOOLEAN | DEFAULT 1 | Success flag |
|
||||
| error_message | TEXT | | Error if failed |
|
||||
|
||||
**Indexes**: timestamp (DESC), action, entity_type, success
|
||||
|
||||
### Views
|
||||
|
||||
#### active_corrections
|
||||
|
||||
Quick access to active corrections.
|
||||
|
||||
```sql
|
||||
SELECT id, from_text, to_text, domain, source, confidence, usage_count, last_used, added_at
|
||||
FROM corrections
|
||||
WHERE is_active = 1
|
||||
ORDER BY domain, from_text;
|
||||
```
|
||||
|
||||
#### pending_suggestions
|
||||
|
||||
Suggestions pending review with example count.
|
||||
|
||||
```sql
|
||||
SELECT s.id, s.from_text, s.to_text, s.domain, s.frequency, s.confidence,
|
||||
s.first_seen, s.last_seen, COUNT(e.id) as example_count
|
||||
FROM learned_suggestions s
|
||||
LEFT JOIN suggestion_examples e ON s.id = e.suggestion_id
|
||||
WHERE s.status = 'pending'
|
||||
GROUP BY s.id
|
||||
ORDER BY s.confidence DESC, s.frequency DESC;
|
||||
```
|
||||
|
||||
#### correction_statistics
|
||||
|
||||
Statistics per domain.
|
||||
|
||||
```sql
|
||||
SELECT domain,
|
||||
COUNT(*) as total_corrections,
|
||||
COUNT(CASE WHEN source = 'manual' THEN 1 END) as manual_count,
|
||||
COUNT(CASE WHEN source = 'learned' THEN 1 END) as learned_count,
|
||||
COUNT(CASE WHEN source = 'imported' THEN 1 END) as imported_count,
|
||||
SUM(usage_count) as total_usage,
|
||||
MAX(added_at) as last_updated
|
||||
FROM corrections
|
||||
WHERE is_active = 1
|
||||
GROUP BY domain;
|
||||
```
|
||||
|
||||
## Querying the Database
|
||||
|
||||
### Using Python API
|
||||
|
||||
```python
|
||||
from pathlib import Path
|
||||
from core import CorrectionRepository, CorrectionService
|
||||
|
||||
# Initialize
|
||||
db_path = Path.home() / ".transcript-fixer" / "corrections.db"
|
||||
repository = CorrectionRepository(db_path)
|
||||
service = CorrectionService(repository)
|
||||
|
||||
# Add correction
|
||||
service.add_correction("错误", "正确", domain="general")
|
||||
|
||||
# Get corrections
|
||||
corrections = service.get_corrections(domain="general")
|
||||
|
||||
# Get statistics
|
||||
stats = service.get_statistics(domain="general")
|
||||
print(f"Total: {stats['total_corrections']}")
|
||||
|
||||
# Close
|
||||
service.close()
|
||||
```
|
||||
|
||||
### Using SQLite CLI
|
||||
|
||||
```bash
|
||||
# Open database
|
||||
sqlite3 ~/.transcript-fixer/corrections.db
|
||||
|
||||
# View active corrections
|
||||
SELECT from_text, to_text, domain FROM active_corrections;
|
||||
|
||||
# View statistics
|
||||
SELECT * FROM correction_statistics;
|
||||
|
||||
# View pending suggestions
|
||||
SELECT * FROM pending_suggestions;
|
||||
|
||||
# Check schema version
|
||||
SELECT value FROM system_config WHERE key = 'schema_version';
|
||||
```
|
||||
|
||||
## Import/Export
|
||||
|
||||
### Export to JSON
|
||||
|
||||
```python
|
||||
service = _get_service()
|
||||
corrections = service.export_corrections(domain="general")
|
||||
|
||||
# Write to file
|
||||
import json
|
||||
with open("export.json", "w", encoding="utf-8") as f:
|
||||
json.dump({
|
||||
"version": "2.0",
|
||||
"domain": "general",
|
||||
"corrections": corrections
|
||||
}, f, ensure_ascii=False, indent=2)
|
||||
```
|
||||
|
||||
### Import from JSON
|
||||
|
||||
```python
|
||||
import json
|
||||
|
||||
with open("import.json", "r", encoding="utf-8") as f:
|
||||
data = json.load(f)
|
||||
|
||||
service = _get_service()
|
||||
inserted, updated, skipped = service.import_corrections(
|
||||
corrections=data["corrections"],
|
||||
domain=data.get("domain", "general"),
|
||||
merge=True,
|
||||
validate_all=True
|
||||
)
|
||||
|
||||
print(f"Imported: {inserted} new, {updated} updated, {skipped} skipped")
|
||||
```
|
||||
|
||||
## Backup Strategy
|
||||
|
||||
### Automatic Backups
|
||||
|
||||
The system maintains database integrity through SQLite's ACID guarantees and automatic journaling.
|
||||
|
||||
### Manual Backups
|
||||
|
||||
```bash
|
||||
# Backup database
|
||||
cp ~/.transcript-fixer/corrections.db ~/backups/corrections_$(date +%Y%m%d).db
|
||||
|
||||
# Or use SQLite backup
|
||||
sqlite3 ~/.transcript-fixer/corrections.db ".backup ~/backups/corrections.db"
|
||||
```
|
||||
|
||||
### Version Control
|
||||
|
||||
**Recommended**: Use Git for configuration and export files, but NOT for the database:
|
||||
|
||||
```bash
|
||||
# .gitignore
|
||||
*.db
|
||||
*.db-journal
|
||||
*.bak
|
||||
```
|
||||
|
||||
Instead, export corrections periodically:
|
||||
|
||||
```bash
|
||||
python scripts/fix_transcription.py --export-json corrections_backup.json
|
||||
git add corrections_backup.json
|
||||
git commit -m "Backup corrections"
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Regular Exports**: Export to JSON weekly for team sharing
|
||||
2. **Database Backups**: Backup `.db` file before major changes
|
||||
3. **Use Transactions**: All modifications use ACID transactions automatically
|
||||
4. **Soft Deletes**: Records are marked inactive, not deleted (preserves audit trail)
|
||||
5. **Validate**: Run `--validate` after manual database changes
|
||||
6. **Statistics**: Check usage patterns via `correction_statistics` view
|
||||
7. **Cleanup**: Old history can be archived (query by `run_timestamp`)
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Database Locked
|
||||
|
||||
```bash
|
||||
# Check for lingering connections
|
||||
lsof ~/.transcript-fixer/corrections.db
|
||||
|
||||
# If needed, backup and recreate
|
||||
cp corrections.db corrections_backup.db
|
||||
sqlite3 corrections.db "VACUUM;"
|
||||
```
|
||||
|
||||
### Corrupted Database
|
||||
|
||||
```bash
|
||||
# Check integrity
|
||||
sqlite3 corrections.db "PRAGMA integrity_check;"
|
||||
|
||||
# Recover if possible
|
||||
sqlite3 corrections.db ".recover" | sqlite3 corrections_new.db
|
||||
```
|
||||
|
||||
### Missing Tables
|
||||
|
||||
```bash
|
||||
# Reinitialize schema (safe, uses IF NOT EXISTS)
|
||||
python -c "from core import CorrectionRepository; from pathlib import Path; CorrectionRepository(Path.home() / '.transcript-fixer' / 'corrections.db')"
|
||||
```
|
||||
Reference in New Issue
Block a user