claude-code-skills-reference/transcript-fixer/references/file_formats.md

# Storage Format Reference

This document describes the SQLite database format used by transcript-fixer v2.0.

## Table of Contents

- [Database Location](#database-location)
- [Database Schema](#database-schema)
  - [Core Tables](#core-tables)
  - [Views](#views)
- [Querying the Database](#querying-the-database)
  - [Using Python API](#using-python-api)
  - [Using SQLite CLI](#using-sqlite-cli)
- [Import/Export](#importexport)
  - [Export to JSON](#export-to-json)
  - [Import from JSON](#import-from-json)
- [Backup Strategy](#backup-strategy)
  - [Automatic Backups](#automatic-backups)
  - [Manual Backups](#manual-backups)
  - [Version Control](#version-control)
- [Best Practices](#best-practices)
- [Troubleshooting](#troubleshooting)
  - [Database Locked](#database-locked)
  - [Corrupted Database](#corrupted-database)
  - [Missing Tables](#missing-tables)

## Database Location

**Path**: `~/.transcript-fixer/corrections.db`

**Type**: SQLite 3 database with ACID guarantees

## Database Schema

### Core Tables

#### corrections

Main correction dictionary storage.

| Column | Type | Constraints | Description |
|--------|------|-------------|-------------|
| id | INTEGER | PRIMARY KEY | Auto-increment ID |
| from_text | TEXT | NOT NULL | Original (incorrect) text |
| to_text | TEXT | NOT NULL | Corrected text |
| domain | TEXT | DEFAULT 'general' | Correction domain |
| source | TEXT | CHECK IN ('manual', 'learned', 'imported') | Origin of correction |
| confidence | REAL | CHECK 0.0-1.0 | Confidence score |
| added_by | TEXT | | User who added |
| added_at | TIMESTAMP | DEFAULT CURRENT_TIMESTAMP | When added |
| usage_count | INTEGER | DEFAULT 0, CHECK >= 0 | Times used |
| last_used | TIMESTAMP | | Last usage time |
| notes | TEXT | | Optional notes |
| is_active | BOOLEAN | DEFAULT 1 | Soft delete flag |

**Unique Constraint**: `(from_text, domain)`

**Indexes**: domain, source, added_at, is_active, from_text

#### context_rules

Regex-based context-aware correction rules.

| Column | Type | Constraints | Description |
|--------|------|-------------|-------------|
| id | INTEGER | PRIMARY KEY | Auto-increment ID |
| pattern | TEXT | NOT NULL, UNIQUE | Regex pattern |
| replacement | TEXT | NOT NULL | Replacement text |
| description | TEXT | | Rule explanation |
| priority | INTEGER | DEFAULT 0 | Higher = applied first |
| is_active | BOOLEAN | DEFAULT 1 | Enable/disable |
| added_at | TIMESTAMP | DEFAULT CURRENT_TIMESTAMP | When added |
| added_by | TEXT | | User who added |

**Indexes**: priority (DESC), is_active

#### correction_history

Audit log for all correction runs.

| Column | Type | Constraints | Description |
|--------|------|-------------|-------------|
| id | INTEGER | PRIMARY KEY | Auto-increment ID |
| filename | TEXT | NOT NULL | File corrected |
| domain | TEXT | NOT NULL | Domain used |
| run_timestamp | TIMESTAMP | DEFAULT CURRENT_TIMESTAMP | When run |
| original_length | INTEGER | CHECK >= 0 | Original file size |
| stage1_changes | INTEGER | CHECK >= 0 | Dictionary changes |
| stage2_changes | INTEGER | CHECK >= 0 | AI changes |
| model | TEXT | | AI model used |
| execution_time_ms | INTEGER | | Runtime in ms |
| success | BOOLEAN | DEFAULT 1 | Success flag |
| error_message | TEXT | | Error if failed |

**Indexes**: run_timestamp (DESC), domain, success

#### correction_changes

Detailed changes made in each run.

| Column | Type | Constraints | Description |
|--------|------|-------------|-------------|
| id | INTEGER | PRIMARY KEY | Auto-increment ID |
| history_id | INTEGER | FOREIGN KEY → correction_history | Parent run |
| line_number | INTEGER | | Line in file |
| from_text | TEXT | NOT NULL | Original text |
| to_text | TEXT | NOT NULL | Corrected text |
| rule_type | TEXT | CHECK IN ('context', 'dictionary', 'ai') | Rule type |
| rule_id | INTEGER | | Reference to rule |
| context_before | TEXT | | Text before |
| context_after | TEXT | | Text after |

**Foreign Key**: history_id → correction_history.id (CASCADE DELETE)

**Indexes**: history_id, rule_type

#### learned_suggestions

AI-detected patterns pending review.

| Column | Type | Constraints | Description |
|--------|------|-------------|-------------|
| id | INTEGER | PRIMARY KEY | Auto-increment ID |
| from_text | TEXT | NOT NULL | Pattern detected |
| to_text | TEXT | NOT NULL | Suggested correction |
| domain | TEXT | DEFAULT 'general' | Domain |
| frequency | INTEGER | CHECK > 0 | Times seen |
| confidence | REAL | CHECK 0.0-1.0 | Confidence score |
| first_seen | TIMESTAMP | DEFAULT CURRENT_TIMESTAMP | First occurrence |
| last_seen | TIMESTAMP | DEFAULT CURRENT_TIMESTAMP | Last occurrence |
| status | TEXT | CHECK IN ('pending', 'approved', 'rejected') | Review status |
| reviewed_at | TIMESTAMP | | When reviewed |
| reviewed_by | TEXT | | Who reviewed |

**Unique Constraint**: `(from_text, to_text, domain)`

**Indexes**: status, domain, confidence (DESC), frequency (DESC)

#### suggestion_examples

Example occurrences of learned patterns.

| Column | Type | Constraints | Description |
|--------|------|-------------|-------------|
| id | INTEGER | PRIMARY KEY | Auto-increment ID |
| suggestion_id | INTEGER | FOREIGN KEY → learned_suggestions | Parent suggestion |
| filename | TEXT | NOT NULL | File where found |
| line_number | INTEGER | | Line number |
| context | TEXT | NOT NULL | Surrounding text |
| occurred_at | TIMESTAMP | DEFAULT CURRENT_TIMESTAMP | When found |

**Foreign Key**: suggestion_id → learned_suggestions.id (CASCADE DELETE)

**Index**: suggestion_id

#### system_config

System configuration key-value store.

| Column | Type | Constraints | Description |
|--------|------|-------------|-------------|
| key | TEXT | PRIMARY KEY | Config key |
| value | TEXT | NOT NULL | Config value |
| value_type | TEXT | CHECK IN ('string', 'int', 'float', 'boolean', 'json') | Value type |
| description | TEXT | | Config description |
| updated_at | TIMESTAMP | DEFAULT CURRENT_TIMESTAMP | Last update |

**Default Values**:
- `schema_version`: "2.0"
- `api_provider`: "GLM"
- `api_model`: "GLM-4.6"
- `default_domain`: "general"
- `auto_learn_enabled`: "true"
- `learning_frequency_threshold`: "3"
- `learning_confidence_threshold`: "0.8"

#### audit_log

Comprehensive audit trail for all operations.

| Column | Type | Constraints | Description |
|--------|------|-------------|-------------|
| id | INTEGER | PRIMARY KEY | Auto-increment ID |
| timestamp | TIMESTAMP | DEFAULT CURRENT_TIMESTAMP | When occurred |
| action | TEXT | NOT NULL | Action type |
| entity_type | TEXT | NOT NULL | Entity affected |
| entity_id | INTEGER | | Entity ID |
| user | TEXT | | User who performed |
| details | TEXT | | Action details |
| success | BOOLEAN | DEFAULT 1 | Success flag |
| error_message | TEXT | | Error if failed |

**Indexes**: timestamp (DESC), action, entity_type, success

### Views

#### active_corrections

Quick access to active corrections.

```sql
SELECT id, from_text, to_text, domain, source, confidence, usage_count, last_used, added_at
FROM corrections
WHERE is_active = 1
ORDER BY domain, from_text;
```

#### pending_suggestions

Suggestions pending review with example count.

```sql
SELECT s.id, s.from_text, s.to_text, s.domain, s.frequency, s.confidence,
       s.first_seen, s.last_seen, COUNT(e.id) as example_count
FROM learned_suggestions s
LEFT JOIN suggestion_examples e ON s.id = e.suggestion_id
WHERE s.status = 'pending'
GROUP BY s.id
ORDER BY s.confidence DESC, s.frequency DESC;
```

#### correction_statistics

Statistics per domain.

```sql
SELECT domain,
       COUNT(*) as total_corrections,
       COUNT(CASE WHEN source = 'manual' THEN 1 END) as manual_count,
       COUNT(CASE WHEN source = 'learned' THEN 1 END) as learned_count,
       COUNT(CASE WHEN source = 'imported' THEN 1 END) as imported_count,
       SUM(usage_count) as total_usage,
       MAX(added_at) as last_updated
FROM corrections
WHERE is_active = 1
GROUP BY domain;
```

## Querying the Database

### Using Python API

```python
from pathlib import Path
from core import CorrectionRepository, CorrectionService

# Initialize
db_path = Path.home() / ".transcript-fixer" / "corrections.db"
repository = CorrectionRepository(db_path)
service = CorrectionService(repository)

# Add correction
service.add_correction("错误", "正确", domain="general")

# Get corrections
corrections = service.get_corrections(domain="general")

# Get statistics
stats = service.get_statistics(domain="general")
print(f"Total: {stats['total_corrections']}")

# Close
service.close()
```

### Using SQLite CLI

```bash
# Open database
sqlite3 ~/.transcript-fixer/corrections.db

# View active corrections
SELECT from_text, to_text, domain FROM active_corrections;

# View statistics
SELECT * FROM correction_statistics;

# View pending suggestions
SELECT * FROM pending_suggestions;

# Check schema version
SELECT value FROM system_config WHERE key = 'schema_version';
```

## Import/Export

### Export to JSON

```python
service = _get_service()
corrections = service.export_corrections(domain="general")

# Write to file
import json
with open("export.json", "w", encoding="utf-8") as f:
    json.dump({
        "version": "2.0",
        "domain": "general",
        "corrections": corrections
    }, f, ensure_ascii=False, indent=2)
```

### Import from JSON

```python
import json

with open("import.json", "r", encoding="utf-8") as f:
    data = json.load(f)

service = _get_service()
inserted, updated, skipped = service.import_corrections(
    corrections=data["corrections"],
    domain=data.get("domain", "general"),
    merge=True,
    validate_all=True
)

print(f"Imported: {inserted} new, {updated} updated, {skipped} skipped")
```

## Backup Strategy

### Automatic Backups

The system maintains database integrity through SQLite's ACID guarantees and automatic journaling.

### Manual Backups

```bash
# Backup database
cp ~/.transcript-fixer/corrections.db ~/backups/corrections_$(date +%Y%m%d).db

# Or use SQLite backup
sqlite3 ~/.transcript-fixer/corrections.db ".backup ~/backups/corrections.db"
```

### Version Control

**Recommended**: Use Git for configuration and export files, but NOT for the database:

```bash
# .gitignore
*.db
*.db-journal
*.bak
```

Instead, export corrections periodically:

```bash
python scripts/fix_transcription.py --export-json corrections_backup.json
git add corrections_backup.json
git commit -m "Backup corrections"
```

## Best Practices

1. **Regular Exports**: Export to JSON weekly for team sharing
2. **Database Backups**: Backup `.db` file before major changes
3. **Use Transactions**: All modifications use ACID transactions automatically
4. **Soft Deletes**: Records are marked inactive, not deleted (preserves audit trail)
5. **Validate**: Run `--validate` after manual database changes
6. **Statistics**: Check usage patterns via `correction_statistics` view
7. **Cleanup**: Old history can be archived (query by `run_timestamp`)

## Troubleshooting

### Database Locked

```bash
# Check for lingering connections
lsof ~/.transcript-fixer/corrections.db

# If needed, backup and recreate
cp corrections.db corrections_backup.db
sqlite3 corrections.db "VACUUM;"
```

### Corrupted Database

```bash
# Check integrity
sqlite3 corrections.db "PRAGMA integrity_check;"

# Recover if possible
sqlite3 corrections.db ".recover" | sqlite3 corrections_new.db
```

### Missing Tables

```bash
# Reinitialize schema (safe, uses IF NOT EXISTS)
python -c "from core import CorrectionRepository; from pathlib import Path; CorrectionRepository(Path.home() / '.transcript-fixer' / 'corrections.db')"
```