## New Skill: transcript-fixer v1.0.0 Correct speech-to-text (ASR/STT) transcription errors through dictionary-based rules and AI-powered corrections with automatic pattern learning. **Features:** - Two-stage correction pipeline (dictionary + AI) - Automatic pattern detection and learning - Domain-specific dictionaries (general, embodied_ai, finance, medical) - SQLite-based correction repository - Team collaboration with import/export - GLM API integration for AI corrections - Cost optimization through dictionary promotion **Use cases:** - Correcting meeting notes, lecture recordings, or interview transcripts - Fixing Chinese/English homophone errors and technical terminology - Building domain-specific correction dictionaries - Improving transcript accuracy through iterative learning **Documentation:** - Complete workflow guides in references/ - SQL query templates - Troubleshooting guide - Team collaboration patterns - API setup instructions **Marketplace updates:** - Updated marketplace to v1.8.0 - Added transcript-fixer plugin (category: productivity) - Updated README.md with skill description and use cases - Updated CLAUDE.md with skill listing and counts 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
396 lines
11 KiB
Markdown
396 lines
11 KiB
Markdown
# Storage Format Reference
|
|
|
|
This document describes the SQLite database format used by transcript-fixer v2.0.
|
|
|
|
## Table of Contents
|
|
|
|
- [Database Location](#database-location)
|
|
- [Database Schema](#database-schema)
|
|
- [Core Tables](#core-tables)
|
|
- [Views](#views)
|
|
- [Querying the Database](#querying-the-database)
|
|
- [Using Python API](#using-python-api)
|
|
- [Using SQLite CLI](#using-sqlite-cli)
|
|
- [Import/Export](#importexport)
|
|
- [Export to JSON](#export-to-json)
|
|
- [Import from JSON](#import-from-json)
|
|
- [Backup Strategy](#backup-strategy)
|
|
- [Automatic Backups](#automatic-backups)
|
|
- [Manual Backups](#manual-backups)
|
|
- [Version Control](#version-control)
|
|
- [Best Practices](#best-practices)
|
|
- [Troubleshooting](#troubleshooting)
|
|
- [Database Locked](#database-locked)
|
|
- [Corrupted Database](#corrupted-database)
|
|
- [Missing Tables](#missing-tables)
|
|
|
|
## Database Location
|
|
|
|
**Path**: `~/.transcript-fixer/corrections.db`
|
|
|
|
**Type**: SQLite 3 database with ACID guarantees
|
|
|
|
## Database Schema
|
|
|
|
### Core Tables
|
|
|
|
#### corrections
|
|
|
|
Main correction dictionary storage.
|
|
|
|
| Column | Type | Constraints | Description |
|
|
|--------|------|-------------|-------------|
|
|
| id | INTEGER | PRIMARY KEY | Auto-increment ID |
|
|
| from_text | TEXT | NOT NULL | Original (incorrect) text |
|
|
| to_text | TEXT | NOT NULL | Corrected text |
|
|
| domain | TEXT | DEFAULT 'general' | Correction domain |
|
|
| source | TEXT | CHECK IN ('manual', 'learned', 'imported') | Origin of correction |
|
|
| confidence | REAL | CHECK 0.0-1.0 | Confidence score |
|
|
| added_by | TEXT | | User who added |
|
|
| added_at | TIMESTAMP | DEFAULT CURRENT_TIMESTAMP | When added |
|
|
| usage_count | INTEGER | DEFAULT 0, CHECK >= 0 | Times used |
|
|
| last_used | TIMESTAMP | | Last usage time |
|
|
| notes | TEXT | | Optional notes |
|
|
| is_active | BOOLEAN | DEFAULT 1 | Soft delete flag |
|
|
|
|
**Unique Constraint**: `(from_text, domain)`
|
|
|
|
**Indexes**: domain, source, added_at, is_active, from_text
|
|
|
|
#### context_rules
|
|
|
|
Regex-based context-aware correction rules.
|
|
|
|
| Column | Type | Constraints | Description |
|
|
|--------|------|-------------|-------------|
|
|
| id | INTEGER | PRIMARY KEY | Auto-increment ID |
|
|
| pattern | TEXT | NOT NULL, UNIQUE | Regex pattern |
|
|
| replacement | TEXT | NOT NULL | Replacement text |
|
|
| description | TEXT | | Rule explanation |
|
|
| priority | INTEGER | DEFAULT 0 | Higher = applied first |
|
|
| is_active | BOOLEAN | DEFAULT 1 | Enable/disable |
|
|
| added_at | TIMESTAMP | DEFAULT CURRENT_TIMESTAMP | When added |
|
|
| added_by | TEXT | | User who added |
|
|
|
|
**Indexes**: priority (DESC), is_active
|
|
|
|
#### correction_history
|
|
|
|
Audit log for all correction runs.
|
|
|
|
| Column | Type | Constraints | Description |
|
|
|--------|------|-------------|-------------|
|
|
| id | INTEGER | PRIMARY KEY | Auto-increment ID |
|
|
| filename | TEXT | NOT NULL | File corrected |
|
|
| domain | TEXT | NOT NULL | Domain used |
|
|
| run_timestamp | TIMESTAMP | DEFAULT CURRENT_TIMESTAMP | When run |
|
|
| original_length | INTEGER | CHECK >= 0 | Original file size |
|
|
| stage1_changes | INTEGER | CHECK >= 0 | Dictionary changes |
|
|
| stage2_changes | INTEGER | CHECK >= 0 | AI changes |
|
|
| model | TEXT | | AI model used |
|
|
| execution_time_ms | INTEGER | | Runtime in ms |
|
|
| success | BOOLEAN | DEFAULT 1 | Success flag |
|
|
| error_message | TEXT | | Error if failed |
|
|
|
|
**Indexes**: run_timestamp (DESC), domain, success
|
|
|
|
#### correction_changes
|
|
|
|
Detailed changes made in each run.
|
|
|
|
| Column | Type | Constraints | Description |
|
|
|--------|------|-------------|-------------|
|
|
| id | INTEGER | PRIMARY KEY | Auto-increment ID |
|
|
| history_id | INTEGER | FOREIGN KEY → correction_history | Parent run |
|
|
| line_number | INTEGER | | Line in file |
|
|
| from_text | TEXT | NOT NULL | Original text |
|
|
| to_text | TEXT | NOT NULL | Corrected text |
|
|
| rule_type | TEXT | CHECK IN ('context', 'dictionary', 'ai') | Rule type |
|
|
| rule_id | INTEGER | | Reference to rule |
|
|
| context_before | TEXT | | Text before |
|
|
| context_after | TEXT | | Text after |
|
|
|
|
**Foreign Key**: history_id → correction_history.id (CASCADE DELETE)
|
|
|
|
**Indexes**: history_id, rule_type
|
|
|
|
#### learned_suggestions
|
|
|
|
AI-detected patterns pending review.
|
|
|
|
| Column | Type | Constraints | Description |
|
|
|--------|------|-------------|-------------|
|
|
| id | INTEGER | PRIMARY KEY | Auto-increment ID |
|
|
| from_text | TEXT | NOT NULL | Pattern detected |
|
|
| to_text | TEXT | NOT NULL | Suggested correction |
|
|
| domain | TEXT | DEFAULT 'general' | Domain |
|
|
| frequency | INTEGER | CHECK > 0 | Times seen |
|
|
| confidence | REAL | CHECK 0.0-1.0 | Confidence score |
|
|
| first_seen | TIMESTAMP | DEFAULT CURRENT_TIMESTAMP | First occurrence |
|
|
| last_seen | TIMESTAMP | DEFAULT CURRENT_TIMESTAMP | Last occurrence |
|
|
| status | TEXT | CHECK IN ('pending', 'approved', 'rejected') | Review status |
|
|
| reviewed_at | TIMESTAMP | | When reviewed |
|
|
| reviewed_by | TEXT | | Who reviewed |
|
|
|
|
**Unique Constraint**: `(from_text, to_text, domain)`
|
|
|
|
**Indexes**: status, domain, confidence (DESC), frequency (DESC)
|
|
|
|
#### suggestion_examples
|
|
|
|
Example occurrences of learned patterns.
|
|
|
|
| Column | Type | Constraints | Description |
|
|
|--------|------|-------------|-------------|
|
|
| id | INTEGER | PRIMARY KEY | Auto-increment ID |
|
|
| suggestion_id | INTEGER | FOREIGN KEY → learned_suggestions | Parent suggestion |
|
|
| filename | TEXT | NOT NULL | File where found |
|
|
| line_number | INTEGER | | Line number |
|
|
| context | TEXT | NOT NULL | Surrounding text |
|
|
| occurred_at | TIMESTAMP | DEFAULT CURRENT_TIMESTAMP | When found |
|
|
|
|
**Foreign Key**: suggestion_id → learned_suggestions.id (CASCADE DELETE)
|
|
|
|
**Index**: suggestion_id
|
|
|
|
#### system_config
|
|
|
|
System configuration key-value store.
|
|
|
|
| Column | Type | Constraints | Description |
|
|
|--------|------|-------------|-------------|
|
|
| key | TEXT | PRIMARY KEY | Config key |
|
|
| value | TEXT | NOT NULL | Config value |
|
|
| value_type | TEXT | CHECK IN ('string', 'int', 'float', 'boolean', 'json') | Value type |
|
|
| description | TEXT | | Config description |
|
|
| updated_at | TIMESTAMP | DEFAULT CURRENT_TIMESTAMP | Last update |
|
|
|
|
**Default Values**:
|
|
- `schema_version`: "2.0"
|
|
- `api_provider`: "GLM"
|
|
- `api_model`: "GLM-4.6"
|
|
- `default_domain`: "general"
|
|
- `auto_learn_enabled`: "true"
|
|
- `learning_frequency_threshold`: "3"
|
|
- `learning_confidence_threshold`: "0.8"
|
|
|
|
#### audit_log
|
|
|
|
Comprehensive audit trail for all operations.
|
|
|
|
| Column | Type | Constraints | Description |
|
|
|--------|------|-------------|-------------|
|
|
| id | INTEGER | PRIMARY KEY | Auto-increment ID |
|
|
| timestamp | TIMESTAMP | DEFAULT CURRENT_TIMESTAMP | When occurred |
|
|
| action | TEXT | NOT NULL | Action type |
|
|
| entity_type | TEXT | NOT NULL | Entity affected |
|
|
| entity_id | INTEGER | | Entity ID |
|
|
| user | TEXT | | User who performed |
|
|
| details | TEXT | | Action details |
|
|
| success | BOOLEAN | DEFAULT 1 | Success flag |
|
|
| error_message | TEXT | | Error if failed |
|
|
|
|
**Indexes**: timestamp (DESC), action, entity_type, success
|
|
|
|
### Views
|
|
|
|
#### active_corrections
|
|
|
|
Quick access to active corrections.
|
|
|
|
```sql
|
|
SELECT id, from_text, to_text, domain, source, confidence, usage_count, last_used, added_at
|
|
FROM corrections
|
|
WHERE is_active = 1
|
|
ORDER BY domain, from_text;
|
|
```
|
|
|
|
#### pending_suggestions
|
|
|
|
Suggestions pending review with example count.
|
|
|
|
```sql
|
|
SELECT s.id, s.from_text, s.to_text, s.domain, s.frequency, s.confidence,
|
|
s.first_seen, s.last_seen, COUNT(e.id) as example_count
|
|
FROM learned_suggestions s
|
|
LEFT JOIN suggestion_examples e ON s.id = e.suggestion_id
|
|
WHERE s.status = 'pending'
|
|
GROUP BY s.id
|
|
ORDER BY s.confidence DESC, s.frequency DESC;
|
|
```
|
|
|
|
#### correction_statistics
|
|
|
|
Statistics per domain.
|
|
|
|
```sql
|
|
SELECT domain,
|
|
COUNT(*) as total_corrections,
|
|
COUNT(CASE WHEN source = 'manual' THEN 1 END) as manual_count,
|
|
COUNT(CASE WHEN source = 'learned' THEN 1 END) as learned_count,
|
|
COUNT(CASE WHEN source = 'imported' THEN 1 END) as imported_count,
|
|
SUM(usage_count) as total_usage,
|
|
MAX(added_at) as last_updated
|
|
FROM corrections
|
|
WHERE is_active = 1
|
|
GROUP BY domain;
|
|
```
|
|
|
|
## Querying the Database
|
|
|
|
### Using Python API
|
|
|
|
```python
|
|
from pathlib import Path
|
|
from core import CorrectionRepository, CorrectionService
|
|
|
|
# Initialize
|
|
db_path = Path.home() / ".transcript-fixer" / "corrections.db"
|
|
repository = CorrectionRepository(db_path)
|
|
service = CorrectionService(repository)
|
|
|
|
# Add correction
|
|
service.add_correction("错误", "正确", domain="general")
|
|
|
|
# Get corrections
|
|
corrections = service.get_corrections(domain="general")
|
|
|
|
# Get statistics
|
|
stats = service.get_statistics(domain="general")
|
|
print(f"Total: {stats['total_corrections']}")
|
|
|
|
# Close
|
|
service.close()
|
|
```
|
|
|
|
### Using SQLite CLI
|
|
|
|
```bash
|
|
# Open database
|
|
sqlite3 ~/.transcript-fixer/corrections.db
|
|
|
|
# View active corrections
|
|
SELECT from_text, to_text, domain FROM active_corrections;
|
|
|
|
# View statistics
|
|
SELECT * FROM correction_statistics;
|
|
|
|
# View pending suggestions
|
|
SELECT * FROM pending_suggestions;
|
|
|
|
# Check schema version
|
|
SELECT value FROM system_config WHERE key = 'schema_version';
|
|
```
|
|
|
|
## Import/Export
|
|
|
|
### Export to JSON
|
|
|
|
```python
|
|
service = _get_service()
|
|
corrections = service.export_corrections(domain="general")
|
|
|
|
# Write to file
|
|
import json
|
|
with open("export.json", "w", encoding="utf-8") as f:
|
|
json.dump({
|
|
"version": "2.0",
|
|
"domain": "general",
|
|
"corrections": corrections
|
|
}, f, ensure_ascii=False, indent=2)
|
|
```
|
|
|
|
### Import from JSON
|
|
|
|
```python
|
|
import json
|
|
|
|
with open("import.json", "r", encoding="utf-8") as f:
|
|
data = json.load(f)
|
|
|
|
service = _get_service()
|
|
inserted, updated, skipped = service.import_corrections(
|
|
corrections=data["corrections"],
|
|
domain=data.get("domain", "general"),
|
|
merge=True,
|
|
validate_all=True
|
|
)
|
|
|
|
print(f"Imported: {inserted} new, {updated} updated, {skipped} skipped")
|
|
```
|
|
|
|
## Backup Strategy
|
|
|
|
### Automatic Backups
|
|
|
|
The system maintains database integrity through SQLite's ACID guarantees and automatic journaling.
|
|
|
|
### Manual Backups
|
|
|
|
```bash
|
|
# Backup database
|
|
cp ~/.transcript-fixer/corrections.db ~/backups/corrections_$(date +%Y%m%d).db
|
|
|
|
# Or use SQLite backup
|
|
sqlite3 ~/.transcript-fixer/corrections.db ".backup ~/backups/corrections.db"
|
|
```
|
|
|
|
### Version Control
|
|
|
|
**Recommended**: Use Git for configuration and export files, but NOT for the database:
|
|
|
|
```bash
|
|
# .gitignore
|
|
*.db
|
|
*.db-journal
|
|
*.bak
|
|
```
|
|
|
|
Instead, export corrections periodically:
|
|
|
|
```bash
|
|
python scripts/fix_transcription.py --export-json corrections_backup.json
|
|
git add corrections_backup.json
|
|
git commit -m "Backup corrections"
|
|
```
|
|
|
|
## Best Practices
|
|
|
|
1. **Regular Exports**: Export to JSON weekly for team sharing
|
|
2. **Database Backups**: Backup `.db` file before major changes
|
|
3. **Use Transactions**: All modifications use ACID transactions automatically
|
|
4. **Soft Deletes**: Records are marked inactive, not deleted (preserves audit trail)
|
|
5. **Validate**: Run `--validate` after manual database changes
|
|
6. **Statistics**: Check usage patterns via `correction_statistics` view
|
|
7. **Cleanup**: Old history can be archived (query by `run_timestamp`)
|
|
|
|
## Troubleshooting
|
|
|
|
### Database Locked
|
|
|
|
```bash
|
|
# Check for lingering connections
|
|
lsof ~/.transcript-fixer/corrections.db
|
|
|
|
# If needed, backup and recreate
|
|
cp corrections.db corrections_backup.db
|
|
sqlite3 corrections.db "VACUUM;"
|
|
```
|
|
|
|
### Corrupted Database
|
|
|
|
```bash
|
|
# Check integrity
|
|
sqlite3 corrections.db "PRAGMA integrity_check;"
|
|
|
|
# Recover if possible
|
|
sqlite3 corrections.db ".recover" | sqlite3 corrections_new.db
|
|
```
|
|
|
|
### Missing Tables
|
|
|
|
```bash
|
|
# Reinitialize schema (safe, uses IF NOT EXISTS)
|
|
python -c "from core import CorrectionRepository; from pathlib import Path; CorrectionRepository(Path.home() / '.transcript-fixer' / 'corrections.db')"
|
|
```
|