Release v1.8.0: Add transcript-fixer skill

## New Skill: transcript-fixer v1.0.0

Correct speech-to-text (ASR/STT) transcription errors through dictionary-based rules and AI-powered corrections with automatic pattern learning.

**Features:**
- Two-stage correction pipeline (dictionary + AI)
- Automatic pattern detection and learning
- Domain-specific dictionaries (general, embodied_ai, finance, medical)
- SQLite-based correction repository
- Team collaboration with import/export
- GLM API integration for AI corrections
- Cost optimization through dictionary promotion

**Use cases:**
- Correcting meeting notes, lecture recordings, or interview transcripts
- Fixing Chinese/English homophone errors and technical terminology
- Building domain-specific correction dictionaries
- Improving transcript accuracy through iterative learning

**Documentation:**
- Complete workflow guides in references/
- SQL query templates
- Troubleshooting guide
- Team collaboration patterns
- API setup instructions

**Marketplace updates:**
- Updated marketplace to v1.8.0
- Added transcript-fixer plugin (category: productivity)
- Updated README.md with skill description and use cases
- Updated CLAUDE.md with skill listing and counts

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
daymade
2025-10-28 13:16:37 +08:00
parent d1041ac203
commit bd0aa12004
44 changed files with 7432 additions and 8 deletions

View File

@@ -0,0 +1,848 @@
# Architecture Reference
Technical implementation details of the transcript-fixer system.
## Table of Contents
- [Module Structure](#module-structure)
- [Design Principles](#design-principles)
- [SOLID Compliance](#solid-compliance)
- [File Length Limits](#file-length-limits)
- [Module Architecture](#module-architecture)
- [Layer Diagram](#layer-diagram)
- [Correction Workflow](#correction-workflow)
- [Learning Cycle](#learning-cycle)
- [Data Flow](#data-flow)
- [SQLite Architecture (v2.0)](#sqlite-architecture-v20)
- [Two-Layer Data Access](#two-layer-data-access-simplified)
- [Database Schema](#database-schema-schemasql)
- [ACID Guarantees](#acid-guarantees)
- [Thread Safety](#thread-safety)
- [Migration from JSON](#migration-from-json)
- [Module Details](#module-details)
- [fix_transcription.py](#fix_transcriptionpy-orchestrator)
- [correction_repository.py](#correction_repositorypy-data-access-layer)
- [correction_service.py](#correction_servicepy-business-logic-layer)
- [CLI Integration](#cli-integration-commandspy)
- [dictionary_processor.py](#dictionary_processorpy-stage-1)
- [ai_processor.py](#ai_processorpy-stage-2)
- [learning_engine.py](#learning_enginepy-pattern-detection)
- [diff_generator.py](#diff_generatorpy-stage-3)
- [State Management](#state-management)
- [Database-Backed State](#database-backed-state)
- [Thread-Safe Access](#thread-safe-access)
- [Error Handling Strategy](#error-handling-strategy)
- [Testing Strategy](#testing-strategy)
- [Performance Considerations](#performance-considerations)
- [Security Architecture](#security-architecture)
- [Extensibility Points](#extensibility-points)
- [Dependencies](#dependencies)
- [Deployment](#deployment)
- [Further Reading](#further-reading)
## Module Structure
The codebase follows a modular package structure for maintainability:
```
scripts/
├── fix_transcription.py # Main entry point (~70 lines)
├── core/ # Business logic & data access
│ ├── correction_repository.py # Data access layer (466 lines)
│ ├── correction_service.py # Business logic layer (525 lines)
│ ├── schema.sql # SQLite database schema (216 lines)
│ ├── dictionary_processor.py # Stage 1 processor (140 lines)
│ ├── ai_processor.py # Stage 2 processor (199 lines)
│ └── learning_engine.py # Pattern detection (252 lines)
├── cli/ # Command-line interface
│ ├── commands.py # Command handlers (180 lines)
│ └── argument_parser.py # Argument config (95 lines)
└── utils/ # Utility functions
├── diff_generator.py # Multi-format diffs (132 lines)
├── logging_config.py # Logging configuration (130 lines)
└── validation.py # SQLite validation (105 lines)
```
**Benefits of modular structure**:
- Clear separation of concerns (business logic / CLI / utilities)
- Easy to locate and modify specific functionality
- Supports independent testing of modules
- Scales well as codebase grows
- Follows Python package best practices
## Design Principles
### SOLID Compliance
Every module follows SOLID principles for maintainability:
1. **Single Responsibility Principle (SRP)**
- Each module has exactly one reason to change
- `CorrectionRepository`: Database operations only
- `CorrectionService`: Business logic and validation only
- `DictionaryProcessor`: Text transformation only
- `AIProcessor`: API communication only
- `LearningEngine`: Pattern analysis only
2. **Open/Closed Principle (OCP)**
- Open for extension via SQL INSERT
- Closed for modification (no code changes needed)
- Add corrections via CLI or SQL without editing Python
3. **Liskov Substitution Principle (LSP)**
- All processors implement same interface
- Can swap implementations without breaking workflow
4. **Interface Segregation Principle (ISP)**
- Repository, Service, Processor, Engine are independent
- No unnecessary dependencies
5. **Dependency Inversion Principle (DIP)**
- Service depends on Repository interface
- CLI depends on Service interface
- Not tied to concrete implementations
### File Length Limits
All files comply with code quality standards:
| File | Lines | Limit | Status |
|------|-------|-------|--------|
| `validation.py` | 105 | 200 | ✅ |
| `logging_config.py` | 130 | 200 | ✅ |
| `diff_generator.py` | 132 | 200 | ✅ |
| `dictionary_processor.py` | 140 | 200 | ✅ |
| `commands.py` | 180 | 200 | ✅ |
| `ai_processor.py` | 199 | 250 | ✅ |
| `schema.sql` | 216 | 250 | ✅ |
| `learning_engine.py` | 252 | 250 | ✅ |
| `correction_repository.py` | 466 | 500 | ✅ |
| `correction_service.py` | 525 | 550 | ✅ |
## Module Architecture
### Layer Diagram
```
┌─────────────────────────────────────────┐
│ CLI Layer (fix_transcription.py) │
│ - Argument parsing │
│ - Command routing │
│ - User interaction │
└───────────────┬─────────────────────────┘
┌───────────────▼─────────────────────────┐
│ Business Logic Layer │
│ │
│ ┌──────────────────┐ ┌──────────────┐│
│ │ Dictionary │ │ AI ││
│ │ Processor │ │ Processor ││
│ │ (Stage 1) │ │ (Stage 2) ││
│ └──────────────────┘ └──────────────┘│
│ │
│ ┌──────────────────┐ ┌──────────────┐│
│ │ Learning │ │ Diff ││
│ │ Engine │ │ Generator ││
│ │ (Pattern detect) │ │ (Stage 3) ││
│ └──────────────────┘ └──────────────┘│
└───────────────┬─────────────────────────┘
┌───────────────▼─────────────────────────┐
│ Data Access Layer (SQLite-based) │
│ │
│ ┌──────────────────────────────────┐ │
│ │ CorrectionManager (Facade) │ │
│ │ - Backward-compatible API │ │
│ └──────────────┬───────────────────┘ │
│ │ │
│ ┌──────────────▼───────────────────┐ │
│ │ CorrectionService │ │
│ │ - Business logic │ │
│ │ - Validation │ │
│ │ - Import/Export │ │
│ └──────────────┬───────────────────┘ │
│ │ │
│ ┌──────────────▼───────────────────┐ │
│ │ CorrectionRepository │ │
│ │ - ACID transactions │ │
│ │ - Thread-safe connections │ │
│ │ - Audit logging │ │
│ └──────────────────────────────────┘ │
└───────────────┬─────────────────────────┘
┌───────────────▼─────────────────────────┐
│ Storage Layer │
│ ~/.transcript-fixer/corrections.db │
│ - SQLite database (ACID compliant) │
│ - 8 normalized tables + 3 views │
│ - Comprehensive indexes │
│ - Foreign key constraints │
└─────────────────────────────────────────┘
```
## Data Flow
### Correction Workflow
```
1. User Input
2. fix_transcription.py (Orchestrator)
3. CorrectionService.get_corrections()
← Query from ~/.transcript-fixer/corrections.db
4. DictionaryProcessor.process()
- Apply context rules (regex)
- Apply dictionary replacements
- Track changes
5. AIProcessor.process()
- Split into chunks
- Call GLM-4.6 API
- Retry with fallback on error
- Track AI changes
6. CorrectionService.save_history()
→ Insert into correction_history table
7. LearningEngine.analyze_and_suggest()
- Query correction_history table
- Detect patterns (frequency ≥3, confidence ≥80%)
- Generate suggestions
→ Insert into learned_suggestions table
8. Output Files
- {filename}_stage1.md
- {filename}_stage2.md
```
### Learning Cycle
```
Run 1: meeting1.md
AI corrects: "巨升" → "具身"
INSERT INTO correction_history
Run 2: meeting2.md
AI corrects: "巨升" → "具身"
INSERT INTO correction_history
Run 3: meeting3.md
AI corrects: "巨升" → "具身"
INSERT INTO correction_history
LearningEngine queries patterns:
- SELECT ... GROUP BY from_text, to_text
- Frequency: 3, Confidence: 100%
INSERT INTO learned_suggestions (status='pending')
User reviews: --review-learned
User approves: --approve "巨升" "具身"
INSERT INTO corrections (source='learned')
UPDATE learned_suggestions (status='approved')
Future runs query corrections table (Stage 1 - faster!)
```
## SQLite Architecture (v2.0)
### Two-Layer Data Access (Simplified)
**Design Principle**: No users = no backward compatibility overhead.
The system uses a clean 2-layer architecture:
```
┌──────────────────────────────────────────┐
│ CLI Commands (commands.py) │
│ - User interaction │
│ - Command routing │
└──────────────┬───────────────────────────┘
┌──────────────▼───────────────────────────┐
│ CorrectionService (Business Logic) │
│ - Input validation & sanitization │
│ - Business rules enforcement │
│ - Import/export orchestration │
│ - Statistics calculation │
│ - History tracking │
└──────────────┬───────────────────────────┘
┌──────────────▼───────────────────────────┐
│ CorrectionRepository (Data Access) │
│ - ACID transactions │
│ - Thread-safe connections │
│ - SQL query execution │
│ - Audit logging │
└──────────────┬───────────────────────────┘
┌──────────────▼───────────────────────────┐
│ SQLite Database (corrections.db) │
│ - 8 normalized tables │
│ - Foreign key constraints │
│ - Comprehensive indexes │
│ - 3 views for common queries │
└───────────────────────────────────────────┘
```
### Database Schema (schema.sql)
**Core Tables**:
1. **corrections** (main correction storage)
- Primary key: id
- Unique constraint: (from_text, domain)
- Indexes: domain, source, added_at, is_active, from_text
- Fields: confidence (0.0-1.0), usage_count, notes
2. **context_rules** (regex-based rules)
- Pattern + replacement with priority ordering
- Indexes: priority (DESC), is_active
3. **correction_history** (audit trail for runs)
- Tracks: filename, domain, timestamps, change counts
- Links to correction_changes via foreign key
- Indexes: run_timestamp, domain, success
4. **correction_changes** (detailed change log)
- Links to history via foreign key (CASCADE delete)
- Stores: line_number, from/to text, rule_type, context
- Indexes: history_id, rule_type
5. **learned_suggestions** (AI-detected patterns)
- Status: pending → approved/rejected
- Unique constraint: (from_text, to_text, domain)
- Fields: frequency, confidence, timestamps
- Indexes: status, domain, confidence, frequency
6. **suggestion_examples** (occurrences of patterns)
- Links to learned_suggestions via foreign key
- Stores context where pattern occurred
7. **system_config** (configuration storage)
- Key-value store with type safety
- Stores: API settings, thresholds, defaults
8. **audit_log** (comprehensive audit trail)
- Tracks all database operations
- Fields: action, entity_type, entity_id, user, success
- Indexes: timestamp, action, entity_type, success
**Views** (for common queries):
- `active_corrections`: Active corrections only
- `pending_suggestions`: Suggestions pending review
- `correction_statistics`: Statistics per domain
### ACID Guarantees
**Atomicity**: All-or-nothing transactions
```python
with self._transaction() as conn:
conn.execute("INSERT ...") # Either all succeed
conn.execute("UPDATE ...") # or all rollback
```
**Consistency**: Constraints enforced
- Foreign key constraints
- Check constraints (confidence 0.0-1.0, usage_count ≥ 0)
- Unique constraints
**Isolation**: Serializable transactions
```python
conn.execute("BEGIN IMMEDIATE") # Acquire write lock
```
**Durability**: Changes persisted to disk
- SQLite guarantees persistence after commit
- Backup before migrations
### Thread Safety
**Thread-local connections**:
```python
def _get_connection(self):
if not hasattr(self._local, 'connection'):
self._local.connection = sqlite3.connect(...)
return self._local.connection
```
**Connection pooling**:
- One connection per thread
- Automatic cleanup on close
- Foreign keys enabled per connection
### Clean Architecture (No Legacy)
**Design Philosophy**:
- Clean 2-layer architecture (Service → Repository)
- No backward compatibility overhead
- Direct API design without legacy constraints
- YAGNI principle: Build for current needs, not hypothetical migrations
## Module Details
### fix_transcription.py (Orchestrator)
**Responsibilities**:
- Parse CLI arguments
- Route commands to appropriate handlers
- Coordinate workflow between modules
- Display user feedback
**Key Functions**:
```python
cmd_init() # Initialize ~/.transcript-fixer/
cmd_add_correction() # Add single correction
cmd_list_corrections() # List corrections
cmd_run_correction() # Execute correction workflow
cmd_review_learned() # Review AI suggestions
cmd_approve() # Approve learned correction
```
**Design Pattern**: Command pattern with function routing
### correction_repository.py (Data Access Layer)
**Responsibilities**:
- Execute SQL queries with ACID guarantees
- Manage thread-safe database connections
- Handle transactions (commit/rollback)
- Perform audit logging
- Convert between database rows and Python objects
**Key Methods**:
```python
add_correction() # INSERT with UNIQUE handling
get_correction() # SELECT single correction
get_all_corrections() # SELECT with filters
get_corrections_dict() # For backward compatibility
update_correction() # UPDATE with transaction
delete_correction() # Soft delete (is_active=0)
increment_usage() # Track usage statistics
bulk_import_corrections() # Batch INSERT with conflict resolution
```
**Transaction Management**:
```python
@contextmanager
def _transaction(self):
conn = self._get_connection()
try:
conn.execute("BEGIN IMMEDIATE")
yield conn
conn.commit()
except Exception:
conn.rollback()
raise
```
### correction_service.py (Business Logic Layer)
**Responsibilities**:
- Input validation and sanitization
- Business rule enforcement
- Orchestrate repository operations
- Import/export with conflict detection
- Statistics calculation
**Key Methods**:
```python
# Validation
validate_correction_text() # Check length, control chars, NULL bytes
validate_domain_name() # Prevent path traversal, injection
validate_confidence() # Range check (0.0-1.0)
validate_source() # Enum validation
# Operations
add_correction() # Validate + repository.add
get_corrections() # Get corrections for domain
remove_correction() # Validate + repository.delete
# Import/Export
import_corrections() # Pre-validate + bulk import + conflict detection
export_corrections() # Query + format as JSON
# Analytics
get_statistics() # Calculate metrics per domain
```
**Validation Rules**:
```python
@dataclass
class ValidationRules:
max_text_length: int = 1000
min_text_length: int = 1
max_domain_length: int = 50
allowed_domain_pattern: str = r'^[a-zA-Z0-9_-]+$'
```
### CLI Integration (commands.py)
**Direct Service Usage**:
```python
def _get_service():
"""Get configured CorrectionService instance."""
config_dir = Path.home() / ".transcript-fixer"
db_path = config_dir / "corrections.db"
repository = CorrectionRepository(db_path)
return CorrectionService(repository)
def cmd_add_correction(args):
service = _get_service()
service.add_correction(args.from_text, args.to_text, args.domain)
```
**Benefits of Direct Integration**:
- No unnecessary abstraction layers
- Clear data flow: CLI → Service → Repository
- Easy to understand and debug
- Performance: One less function call per operation
### dictionary_processor.py (Stage 1)
**Responsibilities**:
- Apply context-aware regex rules
- Apply simple dictionary replacements
- Track all changes with line numbers
**Processing Order**:
1. Context rules first (higher priority)
2. Dictionary replacements second
**Key Methods**:
```python
process(text) -> (corrected_text, changes)
_apply_context_rules()
_apply_dictionary()
get_summary(changes)
```
**Change Tracking**:
```python
@dataclass
class Change:
line_number: int
from_text: str
to_text: str
rule_type: str # "dictionary" or "context_rule"
rule_name: str
```
### ai_processor.py (Stage 2)
**Responsibilities**:
- Split text into API-friendly chunks
- Call GLM-4.6 API
- Handle retries with fallback model
- Track AI-suggested changes
**Key Methods**:
```python
process(text, context) -> (corrected_text, changes)
_split_into_chunks() # Respect paragraph boundaries
_process_chunk() # Single API call
_build_prompt() # Construct correction prompt
```
**Chunking Strategy**:
- Max 6000 characters per chunk
- Split on paragraph boundaries (`\n\n`)
- If paragraph too long, split on sentences
- Preserve context across chunks
**Error Handling**:
- Retry with fallback model (GLM-4.5-Air)
- If both fail, use original text
- Never lose user's data
### learning_engine.py (Pattern Detection)
**Responsibilities**:
- Analyze correction history
- Detect recurring patterns
- Calculate confidence scores
- Generate suggestions for review
- Track rejected suggestions
**Algorithm**:
```python
1. Query correction_history table
2. Extract stage2 (AI) changes
3. Group by pattern (fromto)
4. Count frequency
5. Calculate confidence
6. Filter by thresholds:
- frequency 3
- confidence 0.8
7. Save to learned/pending_review.json
```
**Confidence Calculation**:
```python
confidence = (
0.5 * frequency_score + # More occurrences = higher
0.3 * consistency_score + # Always same correction
0.2 * recency_score # Recent = higher
)
```
**Key Methods**:
```python
analyze_and_suggest() # Main analysis pipeline
approve_suggestion() # Move to corrections.json
reject_suggestion() # Move to rejected.json
list_pending() # Get all suggestions
```
### diff_generator.py (Stage 3)
**Responsibilities**:
- Generate comparison reports
- Multiple output formats
- Word-level diff analysis
**Output Formats**:
1. Markdown summary (statistics + change list)
2. Unified diff (standard diff format)
3. HTML side-by-side (visual comparison)
4. Inline marked ([-old-] [+new+])
**Not Modified**: Kept original 338-line file as-is (working well)
## State Management
### Database-Backed State
- All state stored in `~/.transcript-fixer/corrections.db`
- SQLite handles caching and transactions
- ACID guarantees prevent corruption
- Backup created before migrations
### Thread-Safe Access
- Thread-local connections (one per thread)
- BEGIN IMMEDIATE for write transactions
- No global state or shared mutable data
- Each operation is independent (stateless modules)
### Soft Deletes
- Records marked inactive (is_active=0) instead of DELETE
- Preserves audit trail
- Can be reactivated if needed
## Error Handling Strategy
### Fail Fast for User Errors
```python
if not skill_path.exists():
print(f"❌ Error: Skill directory not found")
sys.exit(1)
```
### Retry for Transient Errors
```python
try:
api_call(model_primary)
except Exception:
try:
api_call(model_fallback)
except Exception:
use_original_text()
```
### Backup Before Destructive Operations
```python
if target_file.exists():
shutil.copy2(target_file, backup_file)
# Then overwrite target_file
```
## Testing Strategy
### Unit Testing (Recommended)
```python
# Test dictionary processor
def test_dictionary_processor():
corrections = {"错误": "正确"}
processor = DictionaryProcessor(corrections, [])
text = "这是错误的文本"
result, changes = processor.process(text)
assert result == "这是正确的文本"
assert len(changes) == 1
# Test learning engine thresholds
def test_learning_thresholds():
engine = LearningEngine(history_dir, learned_dir)
# Create mock history with pattern appearing 3+ times
suggestions = engine.analyze_and_suggest()
assert len(suggestions) > 0
```
### Integration Testing
```bash
# End-to-end test
python fix_transcription.py --init
python fix_transcription.py --add "test" "TEST"
python fix_transcription.py --input test.md --stage 3
# Verify output files exist
```
## Performance Considerations
### Bottlenecks
1. **AI API calls**: Slowest part (60s timeout per chunk)
2. **File I/O**: Negligible (JSON files are small)
3. **Pattern matching**: Fast (regex + dict lookups)
### Optimization Strategies
1. **Stage 1 First**: Test dictionary corrections before expensive AI calls
2. **Chunking**: Process large files in parallel chunks (future enhancement)
3. **Caching**: Could cache API results by content hash (future enhancement)
### Scalability
**Current capabilities (v2.0 with SQLite)**:
- File size: Unlimited (chunks handle large files)
- Corrections: Tested up to 100,000 entries (with indexes)
- History: Unlimited (database handles efficiently)
- Concurrent access: Thread-safe with ACID guarantees
- Query performance: O(log n) with B-tree indexes
**Performance improvements from SQLite**:
- Indexed queries (domain, source, added_at)
- Views for common aggregations
- Batch imports with transactions
- Soft deletes (no data loss)
**Future improvements**:
- Parallel chunk processing for AI calls
- API response caching
- Full-text search for corrections
## Security Architecture
### Secret Management
- API keys via environment variables only
- Never hardcode credentials
- Security scanner enforces this
### Backup Security
- `.bak` files same permissions as originals
- No encryption (user's responsibility)
- Recommendation: Use encrypted filesystems
### Git Security
- `.gitignore` for `.bak` files
- Private repos recommended
- Security scan before commits
## Extensibility Points
### Adding New Processors
1. Create new processor class
2. Implement `process(text) -> (result, changes)` interface
3. Add to orchestrator workflow
Example:
```python
class SpellCheckProcessor:
def process(self, text):
# Custom spell checking logic
return corrected_text, changes
```
### Adding New Learning Algorithms
1. Subclass `LearningEngine`
2. Override `_calculate_confidence()`
3. Adjust thresholds as needed
### Adding New Export Formats
1. Add method to `CorrectionManager`
2. Support new file format
3. Add CLI command
## Dependencies
### Required
- Python 3.8+ (`from __future__ import annotations`)
- `httpx` (for API calls)
### Optional
- `diff` command (for unified diffs)
- Git (for version control)
### Development
- `pytest` (for testing)
- `black` (for formatting)
- `mypy` (for type checking)
## Deployment
### User Installation
```bash
# 1. Clone or download skill to workspace
git clone <repo> transcript-fixer
cd transcript-fixer
# 2. Install dependencies
pip install -r requirements.txt
# 3. Initialize
python scripts/fix_transcription.py --init
# 4. Set API key
export GLM_API_KEY="KEY_VALUE"
# Ready to use!
```
### CI/CD Pipeline (Future)
```yaml
# Potential GitHub Actions workflow
test:
- Install dependencies
- Run unit tests
- Run integration tests
- Check code style (black, mypy)
security:
- Run security_scan.py
- Check for secrets
deploy:
- Package skill
- Upload to skill marketplace
```
## Further Reading
- SOLID Principles: https://en.wikipedia.org/wiki/SOLID
- API Patterns: `references/glm_api_setup.md`
- File Formats: `references/file_formats.md`
- Testing: https://docs.pytest.org/

View File

@@ -0,0 +1,428 @@
# Best Practices
Recommendations for effective use of transcript-fixer based on production experience.
## Table of Contents
- [Getting Started](#getting-started)
- [Build Foundation Before Scaling](#build-foundation-before-scaling)
- [Review Learned Suggestions Regularly](#review-learned-suggestions-regularly)
- [Domain Organization](#domain-organization)
- [Use Domain Separation](#use-domain-separation)
- [Domain Selection Strategy](#domain-selection-strategy)
- [Cost Optimization](#cost-optimization)
- [Test Dictionary Changes Before AI Calls](#test-dictionary-changes-before-ai-calls)
- [Approve High-Confidence Suggestions](#approve-high-confidence-suggestions)
- [Team Collaboration](#team-collaboration)
- [Export Corrections for Version Control](#export-corrections-for-version-control)
- [Share Corrections via Import/Merge](#share-corrections-via-importmerge)
- [Data Management](#data-management)
- [Database Backup Strategy](#database-backup-strategy)
- [Cleanup Strategy](#cleanup-strategy)
- [Workflow Efficiency](#workflow-efficiency)
- [File Organization](#file-organization)
- [Batch Processing](#batch-processing)
- [Context Rules for Edge Cases](#context-rules-for-edge-cases)
- [Quality Assurance](#quality-assurance)
- [Validate After Manual Changes](#validate-after-manual-changes)
- [Monitor Learning Quality](#monitor-learning-quality)
- [Production Deployment](#production-deployment)
- [Environment Variables](#environment-variables)
- [Monitoring](#monitoring)
- [Performance](#performance)
- [Summary](#summary)
## Getting Started
### Build Foundation Before Scaling
**Start small**: Begin with 5-10 manually-added corrections for the most common errors in your domain.
```bash
# Example: embodied AI domain
uv run scripts/fix_transcription.py --add "巨升智能" "具身智能" --domain embodied_ai
uv run scripts/fix_transcription.py --add "巨升" "具身" --domain embodied_ai
uv run scripts/fix_transcription.py --add "奇迹创坛" "奇绩创坛" --domain embodied_ai
```
**Let learning discover others**: After 3-5 correction runs, the learning system will suggest additional patterns automatically.
**Rationale**: Manual corrections provide high-quality training data. Learning amplifies your corrections exponentially.
### Review Learned Suggestions Regularly
**Frequency**: Every 3-5 correction runs
```bash
uv run scripts/fix_transcription.py --review-learned
```
**Why**: Learned corrections move from Stage 2 (AI, expensive) to Stage 1 (dictionary, cheap/instant).
**Impact**:
- 10x faster processing (no API calls)
- Zero cost for repeated patterns
- Builds domain-specific vocabulary automatically
## Domain Organization
### Use Domain Separation
**Prevent conflicts**: Same phonetic error might have different corrections in different domains.
**Example**:
- Finance domain: "股价" (stock price) is correct
- General domain: "股价" → "框架" (framework) ASR error
```bash
# Domain-specific corrections
uv run scripts/fix_transcription.py --add "股价" "框架" --domain general
# No correction needed in finance domain - "股价" is correct there
```
**Available domains**:
- `general` (default) - General-purpose corrections
- `embodied_ai` - Robotics and embodied AI terminology
- `finance` - Financial terminology
- `medical` - Medical terminology
**Custom domains**: Any string matching `^[a-z0-9_]+$` (lowercase, numbers, underscore).
### Domain Selection Strategy
1. **Default domain** for general corrections (dates, common words)
2. **Specialized domains** for technical terminology
3. **Project domains** for company/product-specific terms
```bash
# Project-specific domain
uv run scripts/fix_transcription.py --add "我司" "奇绩创坛" --domain yc_china
```
## Cost Optimization
### Test Dictionary Changes Before AI Calls
**Problem**: AI calls (Stage 2) consume API quota and time.
**Solution**: Test dictionary changes with Stage 1 first.
```bash
# 1. Add new corrections
uv run scripts/fix_transcription.py --add "新错误" "正确词" --domain general
# 2. Test on small sample (Stage 1 only)
uv run scripts/fix_transcription.py --input sample.md --stage 1
# 3. Review output
less sample_stage1.md
# 4. If satisfied, run full pipeline on large files
uv run scripts/fix_transcription.py --input large_file.md --stage 3
```
**Savings**: Avoid wasting API quota on files with dictionary-only corrections.
### Approve High-Confidence Suggestions
**Check suggestions regularly**:
```bash
uv run scripts/fix_transcription.py --review-learned
```
**Approve suggestions with**:
- Frequency ≥ 5
- Confidence ≥ 0.9
- Pattern makes semantic sense
**Impact**: Each approved suggestion saves future API calls.
## Team Collaboration
### Export Corrections for Version Control
**Don't commit** `.db` files to Git:
- Binary format causes merge conflicts
- Database grows over time (bloats repository)
- Not human-reviewable
**Do commit** JSON exports:
```bash
# Export domain dictionaries
uv run scripts/fix_transcription.py --export general_$(date +%Y%m%d).json --domain general
uv run scripts/fix_transcription.py --export embodied_ai_$(date +%Y%m%d).json --domain embodied_ai
# .gitignore
*.db
*.db-journal
*.bak
# Commit exports
git add *_corrections.json
git commit -m "Update correction dictionaries"
```
### Share Corrections via Import/Merge
**Always use `--merge` flag** to combine corrections:
```bash
# Pull latest from team
git pull origin main
# Import new corrections (merge mode)
uv run scripts/fix_transcription.py --import general_20250128.json --merge
uv run scripts/fix_transcription.py --import embodied_ai_20250128.json --merge
```
**Merge behavior**:
- New corrections: inserted
- Existing corrections with higher confidence: updated
- Existing corrections with lower confidence: skipped
- Preserves local customizations
See `team_collaboration.md` for Git workflows and conflict handling.
## Data Management
### Database Backup Strategy
**Automatic backups**: Database creates timestamped backups before migrations:
```
~/.transcript-fixer/
├── corrections.db
├── corrections.20250128_140532.bak
└── corrections.20250127_093021.bak
```
**Manual backups** before bulk changes:
```bash
cp ~/.transcript-fixer/corrections.db ~/backups/corrections_$(date +%Y%m%d).db
```
**Or use SQLite backup**:
```bash
sqlite3 ~/.transcript-fixer/corrections.db ".backup ~/backups/corrections.db"
```
### Cleanup Strategy
**History retention**: Keep recent history, archive old entries:
```bash
# Archive history older than 90 days
sqlite3 ~/.transcript-fixer/corrections.db "
DELETE FROM correction_history
WHERE run_timestamp < datetime('now', '-90 days');
"
# Reclaim space
sqlite3 ~/.transcript-fixer/corrections.db "VACUUM;"
```
**Suggestion cleanup**: Reject low-confidence suggestions periodically:
```bash
# Reject suggestions with frequency < 3
sqlite3 ~/.transcript-fixer/corrections.db "
UPDATE learned_suggestions
SET status = 'rejected'
WHERE frequency < 3 AND confidence < 0.7;
"
```
## Workflow Efficiency
### File Organization
**Use consistent naming**:
```
meeting_20250128.md # Original transcript
meeting_20250128_stage1.md # Dictionary corrections
meeting_20250128_stage2.md # Final corrected version
```
**Generate diff reports** for review:
```bash
uv run scripts/diff_generator.py \
meeting_20250128.md \
meeting_20250128_stage1.md \
meeting_20250128_stage2.md
```
**Output formats**:
- Markdown report (what changed, statistics)
- Unified diff (git-style)
- HTML side-by-side (visual review)
- Inline markers (for direct editing)
### Batch Processing
**Process similar files together** to amplify learning:
```bash
# Day 1: Process 5 similar meetings
for file in meeting_*.md; do
uv run scripts/fix_transcription.py --input "$file" --stage 3 --domain embodied_ai
done
# Day 2: Review learned patterns
uv run scripts/fix_transcription.py --review-learned
# Approve good suggestions
uv run scripts/fix_transcription.py --approve "常见错误1" "正确词1"
uv run scripts/fix_transcription.py --approve "常见错误2" "正确词2"
# Day 3: Future files benefit from dictionary corrections
```
### Context Rules for Edge Cases
**Use regex context rules** for:
- Positional dependencies (e.g., "的" vs "地" before verbs)
- Multi-word patterns
- Traditional vs simplified Chinese
**Example**:
```bash
sqlite3 ~/.transcript-fixer/corrections.db
# "的" before verb → "地"
INSERT INTO context_rules (pattern, replacement, description, priority)
VALUES ('近距离的去看', '近距离地去看', '的→地 before verb', 10);
# Preserve correct usage
INSERT INTO context_rules (pattern, replacement, description, priority)
VALUES ('近距离搏杀', '近距离搏杀', '的 is correct here (noun modifier)', 5);
```
**Priority**: Higher numbers run first (use for exceptions).
## Quality Assurance
### Validate After Manual Changes
**After direct SQL edits**:
```bash
uv run scripts/fix_transcription.py --validate
```
**After imports**:
```bash
# Check statistics
uv run scripts/fix_transcription.py --list --domain general | head -20
# Verify specific corrections
sqlite3 ~/.transcript-fixer/corrections.db "
SELECT from_text, to_text, source, confidence
FROM active_corrections
WHERE domain = 'general'
ORDER BY added_at DESC
LIMIT 10;
"
```
### Monitor Learning Quality
**Check suggestion confidence distribution**:
```bash
sqlite3 ~/.transcript-fixer/corrections.db "
SELECT
CASE
WHEN confidence >= 0.9 THEN 'high (>=0.9)'
WHEN confidence >= 0.8 THEN 'medium (0.8-0.9)'
ELSE 'low (<0.8)'
END as confidence_level,
COUNT(*) as count
FROM learned_suggestions
WHERE status = 'pending'
GROUP BY confidence_level;
"
```
**Review examples** for low-confidence suggestions:
```bash
sqlite3 ~/.transcript-fixer/corrections.db "
SELECT s.from_text, s.to_text, s.confidence, e.context
FROM learned_suggestions s
JOIN suggestion_examples e ON s.id = e.suggestion_id
WHERE s.confidence < 0.8 AND s.status = 'pending';
"
```
## Production Deployment
### Environment Variables
**Set permanently** in production:
```bash
# Add to /etc/environment or systemd service
GLM_API_KEY=your-production-key
```
### Monitoring
**Track usage statistics**:
```bash
# Corrections by source
sqlite3 ~/.transcript-fixer/corrections.db "
SELECT source, COUNT(*) as count, SUM(usage_count) as total_usage
FROM corrections
WHERE is_active = 1
GROUP BY source;
"
# Success rate
sqlite3 ~/.transcript-fixer/corrections.db "
SELECT
COUNT(*) as total_runs,
SUM(CASE WHEN success = 1 THEN 1 ELSE 0 END) as successful,
ROUND(100.0 * SUM(CASE WHEN success = 1 THEN 1 ELSE 0 END) / COUNT(*), 2) as success_rate
FROM correction_history;
"
```
### Performance
**Database optimization**:
```bash
# Rebuild indexes periodically
sqlite3 ~/.transcript-fixer/corrections.db "REINDEX;"
# Analyze query patterns
sqlite3 ~/.transcript-fixer/corrections.db "ANALYZE;"
# Vacuum to reclaim space
sqlite3 ~/.transcript-fixer/corrections.db "VACUUM;"
```
## Summary
**Key principles**:
1. Start small, let learning amplify
2. Use domain separation for quality
3. Test dictionary changes before AI calls
4. Export to JSON for version control
5. Review and approve learned suggestions
6. Validate after manual changes
7. Monitor learning quality
8. Backup before bulk operations
**ROI timeline**:
- Week 1: Build foundation (10-20 manual corrections)
- Week 2-3: Learning kicks in (20-50 suggestions)
- Month 2+: Mature vocabulary (80%+ dictionary coverage, minimal AI calls)

View File

@@ -0,0 +1,97 @@
# 纠错词典配置指南
## 词典结构
纠错词典位于 `fix_transcription.py` 中,包含两部分:
### 1. 上下文规则 (CONTEXT_RULES)
用于需要结合上下文判断的替换:
```python
CONTEXT_RULES = [
{
"pattern": r"正则表达式",
"replacement": "替换文本",
"description": "规则说明"
}
]
```
**示例:**
```python
{
"pattern": r"近距离的去看",
"replacement": "近距离地去看",
"description": "修正''''"
}
```
### 2. 通用词典 (CORRECTIONS_DICT)
用于直接字符串替换:
```python
CORRECTIONS_DICT = {
"错误词汇": "正确词汇",
}
```
**示例:**
```python
{
"巨升智能": "具身智能",
"奇迹创坛": "奇绩创坛",
"矩阵公司": "初创公司",
}
```
## 添加自定义规则
### 步骤1: 识别错误模式
从修复报告中识别重复出现的错误。
### 步骤2: 选择规则类型
- **简单替换** → 使用 CORRECTIONS_DICT
- **需要上下文** → 使用 CONTEXT_RULES
### 步骤3: 添加到词典
编辑 `scripts/fix_transcription.py`:
```python
CORRECTIONS_DICT = {
# 现有规则...
"你的错误": "正确词汇", # 添加新规则
}
```
### 步骤4: 测试
运行修复脚本测试新规则。
## 常见错误类型
### 同音字错误
```python
"股价": "框架",
"三观": "三关",
```
### 专业术语
```python
"巨升智能": "具身智能",
"近距离": "具身", # 某些上下文中
```
### 公司名称
```python
"奇迹创坛": "奇绩创坛",
```
## 优先级
1. 先应用 CONTEXT_RULES (精确匹配)
2. 再应用 CORRECTIONS_DICT (全局替换)

View File

@@ -0,0 +1,395 @@
# Storage Format Reference
This document describes the SQLite database format used by transcript-fixer v2.0.
## Table of Contents
- [Database Location](#database-location)
- [Database Schema](#database-schema)
- [Core Tables](#core-tables)
- [Views](#views)
- [Querying the Database](#querying-the-database)
- [Using Python API](#using-python-api)
- [Using SQLite CLI](#using-sqlite-cli)
- [Import/Export](#importexport)
- [Export to JSON](#export-to-json)
- [Import from JSON](#import-from-json)
- [Backup Strategy](#backup-strategy)
- [Automatic Backups](#automatic-backups)
- [Manual Backups](#manual-backups)
- [Version Control](#version-control)
- [Best Practices](#best-practices)
- [Troubleshooting](#troubleshooting)
- [Database Locked](#database-locked)
- [Corrupted Database](#corrupted-database)
- [Missing Tables](#missing-tables)
## Database Location
**Path**: `~/.transcript-fixer/corrections.db`
**Type**: SQLite 3 database with ACID guarantees
## Database Schema
### Core Tables
#### corrections
Main correction dictionary storage.
| Column | Type | Constraints | Description |
|--------|------|-------------|-------------|
| id | INTEGER | PRIMARY KEY | Auto-increment ID |
| from_text | TEXT | NOT NULL | Original (incorrect) text |
| to_text | TEXT | NOT NULL | Corrected text |
| domain | TEXT | DEFAULT 'general' | Correction domain |
| source | TEXT | CHECK IN ('manual', 'learned', 'imported') | Origin of correction |
| confidence | REAL | CHECK 0.0-1.0 | Confidence score |
| added_by | TEXT | | User who added |
| added_at | TIMESTAMP | DEFAULT CURRENT_TIMESTAMP | When added |
| usage_count | INTEGER | DEFAULT 0, CHECK >= 0 | Times used |
| last_used | TIMESTAMP | | Last usage time |
| notes | TEXT | | Optional notes |
| is_active | BOOLEAN | DEFAULT 1 | Soft delete flag |
**Unique Constraint**: `(from_text, domain)`
**Indexes**: domain, source, added_at, is_active, from_text
#### context_rules
Regex-based context-aware correction rules.
| Column | Type | Constraints | Description |
|--------|------|-------------|-------------|
| id | INTEGER | PRIMARY KEY | Auto-increment ID |
| pattern | TEXT | NOT NULL, UNIQUE | Regex pattern |
| replacement | TEXT | NOT NULL | Replacement text |
| description | TEXT | | Rule explanation |
| priority | INTEGER | DEFAULT 0 | Higher = applied first |
| is_active | BOOLEAN | DEFAULT 1 | Enable/disable |
| added_at | TIMESTAMP | DEFAULT CURRENT_TIMESTAMP | When added |
| added_by | TEXT | | User who added |
**Indexes**: priority (DESC), is_active
#### correction_history
Audit log for all correction runs.
| Column | Type | Constraints | Description |
|--------|------|-------------|-------------|
| id | INTEGER | PRIMARY KEY | Auto-increment ID |
| filename | TEXT | NOT NULL | File corrected |
| domain | TEXT | NOT NULL | Domain used |
| run_timestamp | TIMESTAMP | DEFAULT CURRENT_TIMESTAMP | When run |
| original_length | INTEGER | CHECK >= 0 | Original file size |
| stage1_changes | INTEGER | CHECK >= 0 | Dictionary changes |
| stage2_changes | INTEGER | CHECK >= 0 | AI changes |
| model | TEXT | | AI model used |
| execution_time_ms | INTEGER | | Runtime in ms |
| success | BOOLEAN | DEFAULT 1 | Success flag |
| error_message | TEXT | | Error if failed |
**Indexes**: run_timestamp (DESC), domain, success
#### correction_changes
Detailed changes made in each run.
| Column | Type | Constraints | Description |
|--------|------|-------------|-------------|
| id | INTEGER | PRIMARY KEY | Auto-increment ID |
| history_id | INTEGER | FOREIGN KEY → correction_history | Parent run |
| line_number | INTEGER | | Line in file |
| from_text | TEXT | NOT NULL | Original text |
| to_text | TEXT | NOT NULL | Corrected text |
| rule_type | TEXT | CHECK IN ('context', 'dictionary', 'ai') | Rule type |
| rule_id | INTEGER | | Reference to rule |
| context_before | TEXT | | Text before |
| context_after | TEXT | | Text after |
**Foreign Key**: history_id → correction_history.id (CASCADE DELETE)
**Indexes**: history_id, rule_type
#### learned_suggestions
AI-detected patterns pending review.
| Column | Type | Constraints | Description |
|--------|------|-------------|-------------|
| id | INTEGER | PRIMARY KEY | Auto-increment ID |
| from_text | TEXT | NOT NULL | Pattern detected |
| to_text | TEXT | NOT NULL | Suggested correction |
| domain | TEXT | DEFAULT 'general' | Domain |
| frequency | INTEGER | CHECK > 0 | Times seen |
| confidence | REAL | CHECK 0.0-1.0 | Confidence score |
| first_seen | TIMESTAMP | DEFAULT CURRENT_TIMESTAMP | First occurrence |
| last_seen | TIMESTAMP | DEFAULT CURRENT_TIMESTAMP | Last occurrence |
| status | TEXT | CHECK IN ('pending', 'approved', 'rejected') | Review status |
| reviewed_at | TIMESTAMP | | When reviewed |
| reviewed_by | TEXT | | Who reviewed |
**Unique Constraint**: `(from_text, to_text, domain)`
**Indexes**: status, domain, confidence (DESC), frequency (DESC)
#### suggestion_examples
Example occurrences of learned patterns.
| Column | Type | Constraints | Description |
|--------|------|-------------|-------------|
| id | INTEGER | PRIMARY KEY | Auto-increment ID |
| suggestion_id | INTEGER | FOREIGN KEY → learned_suggestions | Parent suggestion |
| filename | TEXT | NOT NULL | File where found |
| line_number | INTEGER | | Line number |
| context | TEXT | NOT NULL | Surrounding text |
| occurred_at | TIMESTAMP | DEFAULT CURRENT_TIMESTAMP | When found |
**Foreign Key**: suggestion_id → learned_suggestions.id (CASCADE DELETE)
**Index**: suggestion_id
#### system_config
System configuration key-value store.
| Column | Type | Constraints | Description |
|--------|------|-------------|-------------|
| key | TEXT | PRIMARY KEY | Config key |
| value | TEXT | NOT NULL | Config value |
| value_type | TEXT | CHECK IN ('string', 'int', 'float', 'boolean', 'json') | Value type |
| description | TEXT | | Config description |
| updated_at | TIMESTAMP | DEFAULT CURRENT_TIMESTAMP | Last update |
**Default Values**:
- `schema_version`: "2.0"
- `api_provider`: "GLM"
- `api_model`: "GLM-4.6"
- `default_domain`: "general"
- `auto_learn_enabled`: "true"
- `learning_frequency_threshold`: "3"
- `learning_confidence_threshold`: "0.8"
#### audit_log
Comprehensive audit trail for all operations.
| Column | Type | Constraints | Description |
|--------|------|-------------|-------------|
| id | INTEGER | PRIMARY KEY | Auto-increment ID |
| timestamp | TIMESTAMP | DEFAULT CURRENT_TIMESTAMP | When occurred |
| action | TEXT | NOT NULL | Action type |
| entity_type | TEXT | NOT NULL | Entity affected |
| entity_id | INTEGER | | Entity ID |
| user | TEXT | | User who performed |
| details | TEXT | | Action details |
| success | BOOLEAN | DEFAULT 1 | Success flag |
| error_message | TEXT | | Error if failed |
**Indexes**: timestamp (DESC), action, entity_type, success
### Views
#### active_corrections
Quick access to active corrections.
```sql
SELECT id, from_text, to_text, domain, source, confidence, usage_count, last_used, added_at
FROM corrections
WHERE is_active = 1
ORDER BY domain, from_text;
```
#### pending_suggestions
Suggestions pending review with example count.
```sql
SELECT s.id, s.from_text, s.to_text, s.domain, s.frequency, s.confidence,
s.first_seen, s.last_seen, COUNT(e.id) as example_count
FROM learned_suggestions s
LEFT JOIN suggestion_examples e ON s.id = e.suggestion_id
WHERE s.status = 'pending'
GROUP BY s.id
ORDER BY s.confidence DESC, s.frequency DESC;
```
#### correction_statistics
Statistics per domain.
```sql
SELECT domain,
COUNT(*) as total_corrections,
COUNT(CASE WHEN source = 'manual' THEN 1 END) as manual_count,
COUNT(CASE WHEN source = 'learned' THEN 1 END) as learned_count,
COUNT(CASE WHEN source = 'imported' THEN 1 END) as imported_count,
SUM(usage_count) as total_usage,
MAX(added_at) as last_updated
FROM corrections
WHERE is_active = 1
GROUP BY domain;
```
## Querying the Database
### Using Python API
```python
from pathlib import Path
from core import CorrectionRepository, CorrectionService
# Initialize
db_path = Path.home() / ".transcript-fixer" / "corrections.db"
repository = CorrectionRepository(db_path)
service = CorrectionService(repository)
# Add correction
service.add_correction("错误", "正确", domain="general")
# Get corrections
corrections = service.get_corrections(domain="general")
# Get statistics
stats = service.get_statistics(domain="general")
print(f"Total: {stats['total_corrections']}")
# Close
service.close()
```
### Using SQLite CLI
```bash
# Open database
sqlite3 ~/.transcript-fixer/corrections.db
# View active corrections
SELECT from_text, to_text, domain FROM active_corrections;
# View statistics
SELECT * FROM correction_statistics;
# View pending suggestions
SELECT * FROM pending_suggestions;
# Check schema version
SELECT value FROM system_config WHERE key = 'schema_version';
```
## Import/Export
### Export to JSON
```python
service = _get_service()
corrections = service.export_corrections(domain="general")
# Write to file
import json
with open("export.json", "w", encoding="utf-8") as f:
json.dump({
"version": "2.0",
"domain": "general",
"corrections": corrections
}, f, ensure_ascii=False, indent=2)
```
### Import from JSON
```python
import json
with open("import.json", "r", encoding="utf-8") as f:
data = json.load(f)
service = _get_service()
inserted, updated, skipped = service.import_corrections(
corrections=data["corrections"],
domain=data.get("domain", "general"),
merge=True,
validate_all=True
)
print(f"Imported: {inserted} new, {updated} updated, {skipped} skipped")
```
## Backup Strategy
### Automatic Backups
The system maintains database integrity through SQLite's ACID guarantees and automatic journaling.
### Manual Backups
```bash
# Backup database
cp ~/.transcript-fixer/corrections.db ~/backups/corrections_$(date +%Y%m%d).db
# Or use SQLite backup
sqlite3 ~/.transcript-fixer/corrections.db ".backup ~/backups/corrections.db"
```
### Version Control
**Recommended**: Use Git for configuration and export files, but NOT for the database:
```bash
# .gitignore
*.db
*.db-journal
*.bak
```
Instead, export corrections periodically:
```bash
python scripts/fix_transcription.py --export-json corrections_backup.json
git add corrections_backup.json
git commit -m "Backup corrections"
```
## Best Practices
1. **Regular Exports**: Export to JSON weekly for team sharing
2. **Database Backups**: Backup `.db` file before major changes
3. **Use Transactions**: All modifications use ACID transactions automatically
4. **Soft Deletes**: Records are marked inactive, not deleted (preserves audit trail)
5. **Validate**: Run `--validate` after manual database changes
6. **Statistics**: Check usage patterns via `correction_statistics` view
7. **Cleanup**: Old history can be archived (query by `run_timestamp`)
## Troubleshooting
### Database Locked
```bash
# Check for lingering connections
lsof ~/.transcript-fixer/corrections.db
# If needed, backup and recreate
cp corrections.db corrections_backup.db
sqlite3 corrections.db "VACUUM;"
```
### Corrupted Database
```bash
# Check integrity
sqlite3 corrections.db "PRAGMA integrity_check;"
# Recover if possible
sqlite3 corrections.db ".recover" | sqlite3 corrections_new.db
```
### Missing Tables
```bash
# Reinitialize schema (safe, uses IF NOT EXISTS)
python -c "from core import CorrectionRepository; from pathlib import Path; CorrectionRepository(Path.home() / '.transcript-fixer' / 'corrections.db')"
```

View File

@@ -0,0 +1,116 @@
# GLM API 配置指南
## API配置
### 设置环境变量
在运行脚本前,设置GLM API密钥环境变量:
```bash
# Linux/macOS
export GLM_API_KEY="your-api-key-here"
# Windows (PowerShell)
$env:GLM_API_KEY="your-api-key-here"
# Windows (CMD)
set GLM_API_KEY=your-api-key-here
```
**永久设置** (推荐):
```bash
# Linux/macOS: 添加到 ~/.bashrc 或 ~/.zshrc
echo 'export GLM_API_KEY="your-api-key-here"' >> ~/.bashrc
source ~/.bashrc
# Windows: 在系统环境变量中设置
```
### 脚本配置
脚本会自动从环境变量读取API密钥:
```python
# 脚本会检查环境变量
if "GLM_API_KEY" not in os.environ:
raise ValueError("请设置 GLM_API_KEY 环境变量")
os.environ["ANTHROPIC_BASE_URL"] = "https://open.bigmodel.cn/api/anthropic"
os.environ["ANTHROPIC_API_KEY"] = os.environ["GLM_API_KEY"]
# 模型配置
GLM_MODEL = "GLM-4.6" # 主力模型
GLM_MODEL_FAST = "GLM-4.5-Air" # 快速模型(备用)
```
## 支持的模型
| 模型名称 | 说明 | 用途 |
|---------|------|------|
| GLM-4.6 | 最强模型 | 默认使用,精度最高 |
| GLM-4.5-Air | 快速模型 | 备用,速度更快 |
**注意**: 模型名称大小写不敏感。
## API认证
智谱GLM使用Anthropic兼容API:
```python
headers = {
"anthropic-version": "2023-06-01",
"Authorization": f"Bearer {api_key}",
"content-type": "application/json"
}
```
**关键点:**
- 使用 `Authorization: Bearer`
- 不要使用 `x-api-key`
## API调用示例
```python
def call_glm_api(prompt: str) -> str:
url = "https://open.bigmodel.cn/api/anthropic/v1/messages"
headers = {
"anthropic-version": "2023-06-01",
"Authorization": f"Bearer {os.environ.get('ANTHROPIC_API_KEY')}",
"content-type": "application/json"
}
data = {
"model": "GLM-4.6",
"max_tokens": 8000,
"temperature": 0.3,
"messages": [{"role": "user", "content": prompt}]
}
response = httpx.post(url, headers=headers, json=data, timeout=60.0)
return response.json()["content"][0]["text"]
```
## 获取API密钥
1. 访问 https://open.bigmodel.cn/
2. 注册/登录账号
3. 进入API管理页面
4. 创建新的API密钥
5. 复制密钥到配置中
## 费用
参考智谱AI官方定价:
- GLM-4.6: 按token计费
- GLM-4.5-Air: 更便宜的选择
## 故障排查
### 401错误
- 检查API密钥是否正确
- 确认使用 `Authorization: Bearer`
### 超时错误
- 增加timeout参数
- 考虑使用GLM-4.5-Air快速模型

View File

@@ -0,0 +1,135 @@
# Setup Guide
Complete installation and configuration guide for transcript-fixer.
## Table of Contents
- [Installation](#installation)
- [API Configuration](#api-configuration)
- [Environment Setup](#environment-setup)
- [Next Steps](#next-steps)
## Installation
### Dependencies
Install required dependencies using uv:
```bash
uv pip install -r requirements.txt
```
Or sync the project environment:
```bash
uv sync
```
**Required packages**:
- `anthropic` - For Claude API integration (future)
- `requests` - For GLM API calls
- `difflib` - Standard library for diff generation
### Database Initialization
Initialize the SQLite database (first time only):
```bash
uv run scripts/fix_transcription.py --init
```
This creates `~/.transcript-fixer/corrections.db` with the complete schema:
- 8 tables (corrections, context_rules, history, suggestions, etc.)
- 3 views (active_corrections, pending_suggestions, statistics)
- ACID transactions enabled
- Automatic backups before migrations
See `file_formats.md` for complete database schema.
## API Configuration
### GLM API Key (Required for Stage 2)
Stage 2 AI corrections require a GLM API key.
1. **Obtain API key**: Visit https://open.bigmodel.cn/
2. **Register** for an account
3. **Generate** an API key from the dashboard
4. **Set environment variable**:
```bash
export GLM_API_KEY="your-api-key-here"
```
**Persistence**: Add to shell profile for permanent access:
```bash
# For bash
echo 'export GLM_API_KEY="your-key"' >> ~/.bashrc
source ~/.bashrc
# For zsh
echo 'export GLM_API_KEY="your-key"' >> ~/.zshrc
source ~/.zshrc
```
### Verify Configuration
Run validation to check setup:
```bash
uv run scripts/fix_transcription.py --validate
```
**Expected output**:
```
🔍 Validating transcript-fixer configuration...
✅ Configuration directory exists: ~/.transcript-fixer
✅ Database valid: 0 corrections
✅ All 8 tables present
✅ GLM_API_KEY is set
============================================================
✅ All checks passed! Configuration is valid.
============================================================
```
## Environment Setup
### Python Environment
**Required**: Python 3.8+
**Recommended**: Use uv for all Python operations:
```bash
# Never use system python directly
uv run scripts/fix_transcription.py # ✅ Correct
# Don't use system python
python scripts/fix_transcription.py # ❌ Wrong
```
### Directory Structure
After initialization, the directory structure is:
```
~/.transcript-fixer/
├── corrections.db # SQLite database
├── corrections.YYYYMMDD.bak # Automatic backups
└── (migration artifacts)
```
**Important**: The `.db` file should NOT be committed to Git. Export corrections to JSON for version control instead.
## Next Steps
After setup:
1. Add initial corrections (5-10 terms)
2. Run first correction on a test file
3. Review learned suggestions after 3-5 runs
4. Build domain-specific dictionaries
See `workflow_guide.md` for detailed usage instructions.

View File

@@ -0,0 +1,125 @@
# Quick Reference
**Storage**: transcript-fixer uses SQLite database for corrections storage.
**Database location**: `~/.transcript-fixer/corrections.db`
## Quick Start Examples
### Adding Corrections via CLI
```bash
# Add a simple correction
uv run scripts/fix_transcription.py --add "巨升智能" "具身智能" --domain embodied_ai
# Add corrections for specific domain
uv run scripts/fix_transcription.py --add "奇迹创坛" "奇绩创坛" --domain general
uv run scripts/fix_transcription.py --add "矩阵公司" "初创公司" --domain general
```
### Adding Corrections via SQL
```bash
sqlite3 ~/.transcript-fixer/corrections.db
# Insert corrections
INSERT INTO corrections (from_text, to_text, domain, source)
VALUES ('巨升智能', '具身智能', 'embodied_ai', 'manual');
INSERT INTO corrections (from_text, to_text, domain, source)
VALUES ('巨升', '具身', 'embodied_ai', 'manual');
INSERT INTO corrections (from_text, to_text, domain, source)
VALUES ('奇迹创坛', '奇绩创坛', 'general', 'manual');
# Exit
.quit
```
### Adding Context Rules via SQL
Context rules use regex patterns for context-aware corrections:
```bash
sqlite3 ~/.transcript-fixer/corrections.db
# Add context-aware rules
INSERT INTO context_rules (pattern, replacement, description, priority)
VALUES ('巨升方向', '具身方向', '巨升→具身', 10);
INSERT INTO context_rules (pattern, replacement, description, priority)
VALUES ('巨升现在', '具身现在', '巨升→具身', 10);
INSERT INTO context_rules (pattern, replacement, description, priority)
VALUES ('近距离的去看', '近距离地去看', '的→地 副词修饰', 5);
# Exit
.quit
```
### Adding Corrections via Python API
Save as `add_corrections.py` and run with `uv run add_corrections.py`:
```python
#!/usr/bin/env -S uv run
from pathlib import Path
from core import CorrectionRepository, CorrectionService
# Initialize service
db_path = Path.home() / ".transcript-fixer" / "corrections.db"
repository = CorrectionRepository(db_path)
service = CorrectionService(repository)
# Add corrections
corrections = [
("巨升智能", "具身智能", "embodied_ai"),
("巨升", "具身", "embodied_ai"),
("奇迹创坛", "奇绩创坛", "general"),
("火星营", "火星营", "general"),
("矩阵公司", "初创公司", "general"),
("股价", "框架", "general"),
("三观", "三关", "general"),
]
for from_text, to_text, domain in corrections:
service.add_correction(from_text, to_text, domain)
print(f"✅ Added: '{from_text}''{to_text}' (domain: {domain})")
# Close connection
service.close()
```
## Bulk Import Example
Use the provided bulk import script for importing multiple corrections:
```bash
uv run scripts/examples/bulk_import.py
```
## Querying the Database
### View Active Corrections
```bash
sqlite3 ~/.transcript-fixer/corrections.db "SELECT from_text, to_text, domain FROM active_corrections;"
```
### View Statistics
```bash
sqlite3 ~/.transcript-fixer/corrections.db "SELECT * FROM correction_statistics;"
```
### View Context Rules
```bash
sqlite3 ~/.transcript-fixer/corrections.db "SELECT pattern, replacement, priority FROM context_rules WHERE is_active = 1 ORDER BY priority DESC;"
```
## See Also
- `references/file_formats.md` - Complete database schema documentation
- `references/script_parameters.md` - CLI command reference
- `SKILL.md` - Main user documentation

View File

@@ -0,0 +1,186 @@
# Script Parameters Reference
Detailed command-line parameters and usage examples for transcript-fixer Python scripts.
## Table of Contents
- [fix_transcription.py](#fixtranscriptionpy) - Main correction pipeline
- [Setup Commands](#setup-commands)
- [Correction Management](#correction-management)
- [Correction Workflow](#correction-workflow)
- [Learning Commands](#learning-commands)
- [diff_generator.py](#diffgeneratorpy) - Generate comparison reports
- [Common Workflows](#common-workflows)
- [Exit Codes](#exit-codes)
- [Environment Variables](#environment-variables)
---
## fix_transcription.py
Main correction pipeline script supporting three processing stages.
### Syntax
```bash
python scripts/fix_transcription.py --input <file> --stage <1|2|3> [--output <dir>]
```
### Parameters
- `--input, -i` (required): Input Markdown file path
- `--stage, -s` (optional): Stage to execute (default: 3)
- `1` = Dictionary corrections only
- `2` = AI corrections only (requires Stage 1 output file)
- `3` = Both stages sequentially
- `--output, -o` (optional): Output directory (defaults to input file directory)
### Usage Examples
**Run dictionary corrections only:**
```bash
python scripts/fix_transcription.py --input meeting.md --stage 1
```
Output: `meeting_阶段1_词典修复.md`
**Run AI corrections only:**
```bash
python scripts/fix_transcription.py --input meeting_阶段1_词典修复.md --stage 2
```
Output: `meeting_阶段2_AI修复.md`
Note: Requires Stage 1 output file as input.
**Run complete pipeline:**
```bash
python scripts/fix_transcription.py --input meeting.md --stage 3
```
Outputs:
- `meeting_阶段1_词典修复.md`
- `meeting_阶段2_AI修复.md`
**Custom output directory:**
```bash
python scripts/fix_transcription.py --input meeting.md --stage 3 --output ./corrections
```
### Exit Codes
- `0` - Success
- `1` - Missing required parameters or file not found
- `2` - GLM_API_KEY environment variable not set (Stage 2 or 3 only)
- `3` - API request failed
## generate_diff_report.py
Multi-format diff report generator for comparing correction stages.
### Syntax
```bash
python scripts/generate_diff_report.py --original <file> --stage1 <file> --stage2 <file> [--output-dir <dir>]
```
### Parameters
- `--original` (required): Original transcript file path
- `--stage1` (required): Stage 1 correction output file path
- `--stage2` (required): Stage 2 correction output file path
- `--output-dir` (optional): Output directory for diff reports (defaults to original file directory)
### Usage Examples
**Basic usage:**
```bash
python scripts/generate_diff_report.py \
--original "meeting.md" \
--stage1 "meeting_阶段1_词典修复.md" \
--stage2 "meeting_阶段2_AI修复.md"
```
**Custom output directory:**
```bash
python scripts/generate_diff_report.py \
--original "meeting.md" \
--stage1 "meeting_阶段1_词典修复.md" \
--stage2 "meeting_阶段2_AI修复.md" \
--output-dir "./reports"
```
### Output Files
The script generates four comparison formats:
1. **Markdown summary** (`*_对比报告.md`)
- High-level statistics and change summary
- Word count changes per stage
- Common error patterns identified
2. **Unified diff** (`*_unified.diff`)
- Traditional Unix diff format
- Suitable for command-line review or version control
3. **HTML side-by-side** (`*_对比.html`)
- Visual side-by-side comparison
- Color-coded additions/deletions
- **Recommended for human review**
4. **Inline marked** (`*_行内对比.txt`)
- Single-column format with inline change markers
- Useful for quick text editor review
### Exit Codes
- `0` - Success
- `1` - Missing required parameters or file not found
- `2` - File format error (non-Markdown input)
## Common Workflows
### Testing Dictionary Changes
Test dictionary updates before running expensive AI corrections:
```bash
# 1. Update CORRECTIONS_DICT in scripts/fix_transcription.py
# 2. Run Stage 1 only
python scripts/fix_transcription.py --input meeting.md --stage 1
# 3. Review output
cat meeting_阶段1_词典修复.md
# 4. If satisfied, run Stage 2
python scripts/fix_transcription.py --input meeting_阶段1_词典修复.md --stage 2
```
### Batch Processing
Process multiple transcripts in sequence:
```bash
for file in transcripts/*.md; do
python scripts/fix_transcription.py --input "$file" --stage 3
done
```
### Quick Review Cycle
Generate and open comparison report immediately after correction:
```bash
# Run corrections
python scripts/fix_transcription.py --input meeting.md --stage 3
# Generate and open diff report
python scripts/generate_diff_report.py \
--original "meeting.md" \
--stage1 "meeting_阶段1_词典修复.md" \
--stage2 "meeting_阶段2_AI修复.md"
open meeting_对比.html # macOS
# xdg-open meeting_对比.html # Linux
# start meeting_对比.html # Windows
```

View File

@@ -0,0 +1,188 @@
# SQL Query Reference
Database location: `~/.transcript-fixer/corrections.db`
## Basic Operations
### Add Corrections
```sql
-- Add a correction
INSERT INTO corrections (from_text, to_text, domain, source)
VALUES ('巨升智能', '具身智能', 'embodied_ai', 'manual');
INSERT INTO corrections (from_text, to_text, domain, source)
VALUES ('奇迹创坛', '奇绩创坛', 'general', 'manual');
```
### View Corrections
```sql
-- View all active corrections
SELECT from_text, to_text, domain, source, usage_count
FROM active_corrections
ORDER BY domain, from_text;
-- View corrections for specific domain
SELECT from_text, to_text, usage_count, added_at
FROM active_corrections
WHERE domain = 'embodied_ai';
```
## Context Rules
### Add Context-Aware Rules
```sql
-- Add regex-based context rule
INSERT INTO context_rules (pattern, replacement, description, priority)
VALUES ('巨升方向', '具身方向', '巨升→具身', 10);
INSERT INTO context_rules (pattern, replacement, description, priority)
VALUES ('近距离的去看', '近距离地去看', '的→地 副词修饰', 5);
```
### View Rules
```sql
-- View all active context rules (ordered by priority)
SELECT pattern, replacement, description, priority
FROM context_rules
WHERE is_active = 1
ORDER BY priority DESC;
```
## Statistics
```sql
-- View correction statistics by domain
SELECT * FROM correction_statistics;
-- Count corrections by source
SELECT source, COUNT(*) as count, SUM(usage_count) as total_usage
FROM corrections
WHERE is_active = 1
GROUP BY source;
-- Most frequently used corrections
SELECT from_text, to_text, domain, usage_count, last_used
FROM corrections
WHERE is_active = 1 AND usage_count > 0
ORDER BY usage_count DESC
LIMIT 10;
```
## Learning and Suggestions
### View Suggestions
```sql
-- View pending suggestions
SELECT * FROM pending_suggestions;
-- View high-confidence suggestions
SELECT from_text, to_text, domain, frequency, confidence
FROM learned_suggestions
WHERE status = 'pending' AND confidence >= 0.8
ORDER BY confidence DESC, frequency DESC;
```
### Approve Suggestions
```sql
-- Insert into corrections
INSERT INTO corrections (from_text, to_text, domain, source, confidence)
SELECT from_text, to_text, domain, 'learned', confidence
FROM learned_suggestions
WHERE id = 1;
-- Mark as approved
UPDATE learned_suggestions
SET status = 'approved', reviewed_at = CURRENT_TIMESTAMP
WHERE id = 1;
```
## History and Audit
```sql
-- View recent correction runs
SELECT filename, domain, stage1_changes, stage2_changes, run_timestamp
FROM correction_history
ORDER BY run_timestamp DESC
LIMIT 10;
-- View detailed changes for a specific run
SELECT ch.line_number, ch.from_text, ch.to_text, ch.rule_type
FROM correction_changes ch
JOIN correction_history h ON ch.history_id = h.id
WHERE h.filename = 'meeting.md'
ORDER BY ch.line_number;
-- Calculate success rate
SELECT
COUNT(*) as total_runs,
SUM(CASE WHEN success = 1 THEN 1 ELSE 0 END) as successful,
ROUND(100.0 * SUM(CASE WHEN success = 1 THEN 1 ELSE 0 END) / COUNT(*), 2) as success_rate
FROM correction_history;
```
## Maintenance
```sql
-- Deactivate (soft delete) a correction
UPDATE corrections
SET is_active = 0
WHERE from_text = '错误词' AND domain = 'general';
-- Reactivate a correction
UPDATE corrections
SET is_active = 1
WHERE from_text = '错误词' AND domain = 'general';
-- Update correction confidence
UPDATE corrections
SET confidence = 0.95
WHERE from_text = '巨升' AND to_text = '具身';
-- Delete old history (older than 90 days)
DELETE FROM correction_history
WHERE run_timestamp < datetime('now', '-90 days');
-- Reclaim space
VACUUM;
```
## System Configuration
```sql
-- View system configuration
SELECT key, value, description FROM system_config;
-- Update configuration
UPDATE system_config
SET value = '5'
WHERE key = 'learning_frequency_threshold';
-- Check schema version
SELECT value FROM system_config WHERE key = 'schema_version';
```
## Export
```sql
-- Export corrections as CSV
.mode csv
.headers on
.output corrections_export.csv
SELECT from_text, to_text, domain, source, confidence, usage_count, added_at
FROM active_corrections;
.output stdout
```
For JSON export, use Python script with `service.export_corrections()` instead.
## See Also
- `references/file_formats.md` - Complete database schema documentation
- `references/quick_reference.md` - CLI command quick reference
- `SKILL.md` - Main user documentation

View File

@@ -0,0 +1,371 @@
# Team Collaboration Guide
This guide explains how to share correction knowledge across teams using export/import and Git workflows.
## Table of Contents
- [Export/Import Workflow](#exportimport-workflow)
- [Export Corrections](#export-corrections)
- [Import from Teammate](#import-from-teammate)
- [Team Workflow Example](#team-workflow-example)
- [Git-Based Collaboration](#git-based-collaboration)
- [Initial Setup](#initial-setup)
- [Team Members Clone](#team-members-clone)
- [Ongoing Sync](#ongoing-sync)
- [Handling Conflicts](#handling-conflicts)
- [Selective Domain Sharing](#selective-domain-sharing)
- [Finance Team](#finance-team)
- [AI Team](#ai-team)
- [Individual imports specific domains](#individual-imports-specific-domains)
- [Git Branching Strategy](#git-branching-strategy)
- [Feature Branches](#feature-branches)
- [Domain Branches (Alternative)](#domain-branches-alternative)
- [Automated Sync (Advanced)](#automated-sync-advanced)
- [macOS/Linux Cron](#macoslinux-cron)
- [Windows Task Scheduler](#windows-task-scheduler)
- [Backup and Recovery](#backup-and-recovery)
- [Backup Strategy](#backup-strategy)
- [Recovery from Backup](#recovery-from-backup)
- [Recovery from Git](#recovery-from-git)
- [Team Best Practices](#team-best-practices)
- [Integration with CI/CD](#integration-with-cicd)
- [GitHub Actions Example](#github-actions-example)
- [Troubleshooting](#troubleshooting)
- [Import Failed](#import-failed)
- [Git Sync Failed](#git-sync-failed)
- [Merge Conflicts Too Complex](#merge-conflicts-too-complex)
- [Security Considerations](#security-considerations)
- [Further Reading](#further-reading)
## Export/Import Workflow
### Export Corrections
Share your corrections with team members:
```bash
# Export specific domain
python scripts/fix_transcription.py --export team_corrections.json --domain embodied_ai
# Export general corrections
python scripts/fix_transcription.py --export team_corrections.json
```
**Output**: Creates a standalone JSON file with your corrections.
### Import from Teammate
Two modes: **merge** (combine) or **replace** (overwrite):
```bash
# Merge (recommended) - combines with existing corrections
python scripts/fix_transcription.py --import team_corrections.json --merge
# Replace - overwrites existing corrections (dangerous!)
python scripts/fix_transcription.py --import team_corrections.json
```
**Merge behavior**:
- Adds new corrections
- Updates existing corrections with imported values
- Preserves corrections not in import file
### Team Workflow Example
**Person A (Domain Expert)**:
```bash
# Build correction dictionary
python fix_transcription.py --add "巨升" "具身" --domain embodied_ai
python fix_transcription.py --add "奇迹创坛" "奇绩创坛" --domain embodied_ai
# ... add 50 more corrections ...
# Export for team
python fix_transcription.py --export ai_corrections.json --domain embodied_ai
# Send ai_corrections.json to team via Slack/email
```
**Person B (Team Member)**:
```bash
# Receive ai_corrections.json
# Import and merge with existing corrections
python fix_transcription.py --import ai_corrections.json --merge
# Now Person B has all 50+ corrections!
```
## Git-Based Collaboration
For teams using Git, version control the entire correction database.
### Initial Setup
**Person A (First User)**:
```bash
cd ~/.transcript-fixer
git init
git add corrections.json context_rules.json config.json
git add domains/
git commit -m "Initial correction database"
# Push to shared repo
git remote add origin git@github.com:org/transcript-corrections.git
git push -u origin main
```
### Team Members Clone
**Person B, C, D (Team Members)**:
```bash
# Clone shared corrections
git clone git@github.com:org/transcript-corrections.git ~/.transcript-fixer
# Now everyone has the same corrections!
```
### Ongoing Sync
**Daily workflow**:
```bash
# Morning: Pull team updates
cd ~/.transcript-fixer
git pull origin main
# During day: Add corrections
python fix_transcription.py --add "错误" "正确"
# Evening: Push your additions
cd ~/.transcript-fixer
git add corrections.json
git commit -m "Added 5 new embodied AI corrections"
git push origin main
```
### Handling Conflicts
When two people add different corrections to same file:
```bash
cd ~/.transcript-fixer
git pull origin main
# If conflict occurs:
# CONFLICT in corrections.json
# Option 1: Manual merge (recommended)
nano corrections.json # Edit to combine both changes
git add corrections.json
git commit -m "Merged corrections from teammate"
git push
# Option 2: Keep yours
git checkout --ours corrections.json
git add corrections.json
git commit -m "Kept local corrections"
git push
# Option 3: Keep theirs
git checkout --theirs corrections.json
git add corrections.json
git commit -m "Used teammate's corrections"
git push
```
**Best Practice**: JSON merge conflicts are usually easy - just combine the correction entries from both versions.
## Selective Domain Sharing
Share only specific domains with different teams:
### Finance Team
```bash
# Finance team exports their domain
python fix_transcription.py --export finance_corrections.json --domain finance
# Share finance_corrections.json with finance team only
```
### AI Team
```bash
# AI team exports their domain
python fix_transcription.py --export ai_corrections.json --domain embodied_ai
# Share ai_corrections.json with AI team only
```
### Individual imports specific domains
```bash
# Alice works on both finance and AI
python fix_transcription.py --import finance_corrections.json --merge
python fix_transcription.py --import ai_corrections.json --merge
```
## Git Branching Strategy
For larger teams, use branches for different domains or workflows:
### Feature Branches
```bash
# Create branch for major dictionary additions
git checkout -b add-medical-terms
python fix_transcription.py --add "医疗术语" "正确术语" --domain medical
# ... add 100 medical corrections ...
git add domains/medical.json
git commit -m "Added 100 medical terminology corrections"
git push origin add-medical-terms
# Create PR for review
# After approval, merge to main
```
### Domain Branches (Alternative)
```bash
# Separate branches per domain
git checkout -b domain/embodied-ai
# Work on AI corrections
git push origin domain/embodied-ai
git checkout -b domain/finance
# Work on finance corrections
git push origin domain/finance
```
## Automated Sync (Advanced)
Set up automatic Git sync using cron/Task Scheduler:
### macOS/Linux Cron
```bash
# Edit crontab
crontab -e
# Add daily sync at 9 AM and 6 PM
0 9,18 * * * cd ~/.transcript-fixer && git pull origin main && git push origin main
```
### Windows Task Scheduler
```powershell
# Create scheduled task
$action = New-ScheduledTaskAction -Execute "git" -Argument "pull origin main" -WorkingDirectory "$env:USERPROFILE\.transcript-fixer"
$trigger = New-ScheduledTaskTrigger -Daily -At 9am
Register-ScheduledTask -Action $action -Trigger $trigger -TaskName "SyncTranscriptCorrections"
```
## Backup and Recovery
### Backup Strategy
```bash
# Weekly backup to cloud
cd ~/.transcript-fixer
tar -czf transcript-corrections-$(date +%Y%m%d).tar.gz corrections.json context_rules.json domains/
# Upload to Dropbox/Google Drive/S3
```
### Recovery from Backup
```bash
# Extract backup
tar -xzf transcript-corrections-20250127.tar.gz -C ~/.transcript-fixer/
```
### Recovery from Git
```bash
# View history
cd ~/.transcript-fixer
git log corrections.json
# Restore from 3 commits ago
git checkout HEAD~3 corrections.json
# Or restore specific version
git checkout abc123def corrections.json
```
## Team Best Practices
1. **Pull Before Push**: Always `git pull` before starting work
2. **Commit Often**: Small, frequent commits better than large infrequent ones
3. **Descriptive Messages**: "Added 5 finance terms" better than "updates"
4. **Review Process**: Use PRs for major dictionary changes (100+ corrections)
5. **Domain Ownership**: Assign domain experts as reviewers
6. **Weekly Sync**: Schedule team sync meetings to review learned suggestions
7. **Backup Policy**: Weekly backups of entire `~/.transcript-fixer/`
## Integration with CI/CD
For enterprise teams, integrate validation into CI:
### GitHub Actions Example
```yaml
# .github/workflows/validate-corrections.yml
name: Validate Corrections
on:
pull_request:
paths:
- 'corrections.json'
- 'domains/*.json'
jobs:
validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Validate JSON
run: |
python -m json.tool corrections.json > /dev/null
for file in domains/*.json; do
python -m json.tool "$file" > /dev/null
done
- name: Check for duplicates
run: |
python scripts/check_duplicates.py corrections.json
```
## Troubleshooting
### Import Failed
```bash
# Check JSON validity
python -m json.tool team_corrections.json
# If invalid, fix JSON syntax errors
nano team_corrections.json
```
### Git Sync Failed
```bash
# Check remote connection
git remote -v
# Re-add if needed
git remote set-url origin git@github.com:org/corrections.git
# Verify SSH keys
ssh -T git@github.com
```
### Merge Conflicts Too Complex
```bash
# Nuclear option: Keep one version
git checkout --ours corrections.json # Keep yours
# OR
git checkout --theirs corrections.json # Keep theirs
# Then re-import the other version
python fix_transcription.py --import other_version.json --merge
```
## Security Considerations
1. **Private Repos**: Use private Git repositories for company-specific corrections
2. **Access Control**: Limit who can push to main branch
3. **Secret Scanning**: Never commit API keys (already handled by security_scan.py)
4. **Audit Trail**: Git history provides full audit trail of who changed what
5. **Backup Encryption**: Encrypt backups if containing sensitive terminology
## Further Reading
- Git workflows: https://git-scm.com/book/en/v2/Git-Branching-Branching-Workflows
- JSON validation: https://jsonlint.com/
- Team Git practices: https://github.com/git-guides

View File

@@ -0,0 +1,313 @@
# Troubleshooting Guide
Solutions to common issues and error conditions.
## Table of Contents
- [API Authentication Errors](#api-authentication-errors)
- [GLM_API_KEY Not Set](#glm_api_key-not-set)
- [Invalid API Key](#invalid-api-key)
- [Learning System Issues](#learning-system-issues)
- [No Suggestions Generated](#no-suggestions-generated)
- [Database Issues](#database-issues)
- [Database Not Found](#database-not-found)
- [Database Locked](#database-locked)
- [Corrupted Database](#corrupted-database)
- [Missing Tables](#missing-tables)
- [Common Pitfalls](#common-pitfalls)
- [1. Stage Order Confusion](#1-stage-order-confusion)
- [2. Overwriting Imports](#2-overwriting-imports)
- [3. Ignoring Learned Suggestions](#3-ignoring-learned-suggestions)
- [4. Testing on Large Files](#4-testing-on-large-files)
- [5. Manual Database Edits Without Validation](#5-manual-database-edits-without-validation)
- [6. Committing .db Files to Git](#6-committing-db-files-to-git)
- [Validation Commands](#validation-commands)
- [Quick Health Check](#quick-health-check)
- [Detailed Diagnostics](#detailed-diagnostics)
- [Getting Help](#getting-help)
## API Authentication Errors
### GLM_API_KEY Not Set
**Symptom**:
```
❌ Error: GLM_API_KEY environment variable not set
Set it with: export GLM_API_KEY='your-key'
```
**Solution**:
```bash
# Check if key is set
echo $GLM_API_KEY
# If empty, export key
export GLM_API_KEY="your-api-key-here"
# Verify
uv run scripts/fix_transcription.py --validate
```
**Persistence**: Add to shell profile (`.bashrc` or `.zshrc`) for permanent access.
See `glm_api_setup.md` for detailed API key management.
### Invalid API Key
**Symptom**: API calls fail with 401/403 errors
**Solutions**:
1. Verify key is correct (copy from https://open.bigmodel.cn/)
2. Check for extra spaces or quotes in the key
3. Regenerate key if compromised
4. Verify API quota hasn't been exceeded
## Learning System Issues
### No Suggestions Generated
**Symptom**: Running `--review-learned` shows no suggestions after multiple corrections.
**Requirements**:
- Minimum 3 correction runs with consistent patterns
- Learning frequency threshold ≥3 (default)
- Learning confidence threshold ≥0.8 (default)
**Diagnostic steps**:
```bash
# Check correction history count
sqlite3 ~/.transcript-fixer/corrections.db "SELECT COUNT(*) FROM correction_history;"
# If 0, no corrections have been run yet
# If >0 but <3, run more corrections
# Check suggestions table
sqlite3 ~/.transcript-fixer/corrections.db "SELECT * FROM learned_suggestions;"
# Check system configuration
sqlite3 ~/.transcript-fixer/corrections.db "SELECT key, value FROM system_config WHERE key LIKE 'learning%';"
```
**Solutions**:
1. Run at least 3 correction sessions
2. Ensure patterns repeat (same error → same correction)
3. Verify database permissions (should be readable/writable)
4. Check `correction_history` table has entries
## Database Issues
### Database Not Found
**Symptom**:
```
⚠️ Database not found: ~/.transcript-fixer/corrections.db
```
**Solution**:
```bash
uv run scripts/fix_transcription.py --init
```
This creates the database with the complete schema.
### Database Locked
**Symptom**:
```
Error: database is locked
```
**Causes**:
- Another process is accessing the database
- Unfinished transaction from crashed process
- File permissions issue
**Solutions**:
```bash
# Check for processes using the database
lsof ~/.transcript-fixer/corrections.db
# If processes found, kill them or wait for completion
# If database is corrupted, backup and recreate
cp ~/.transcript-fixer/corrections.db ~/.transcript-fixer/corrections_backup.db
sqlite3 ~/.transcript-fixer/corrections.db "VACUUM;"
```
### Corrupted Database
**Symptom**: SQLite errors, integrity check failures
**Solutions**:
```bash
# Check integrity
sqlite3 ~/.transcript-fixer/corrections.db "PRAGMA integrity_check;"
# If corrupted, attempt recovery
sqlite3 ~/.transcript-fixer/corrections.db ".recover" | sqlite3 ~/.transcript-fixer/corrections_new.db
# Replace database with recovered version
mv ~/.transcript-fixer/corrections.db ~/.transcript-fixer/corrections_corrupted.db
mv ~/.transcript-fixer/corrections_new.db ~/.transcript-fixer/corrections.db
```
### Missing Tables
**Symptom**:
```
❌ Database missing tables: ['corrections', ...]
```
**Solution**: Reinitialize schema (safe, uses IF NOT EXISTS):
```bash
python -c "from core import CorrectionRepository; from pathlib import Path; CorrectionRepository(Path.home() / '.transcript-fixer' / 'corrections.db')"
```
Or delete database and reinitialize:
```bash
# Backup first
cp ~/.transcript-fixer/corrections.db ~/corrections_backup_$(date +%Y%m%d).db
# Reinitialize
uv run scripts/fix_transcription.py --init
```
## Common Pitfalls
### 1. Stage Order Confusion
**Problem**: Running Stage 2 without Stage 1 output.
**Solution**: Use `--stage 3` for full pipeline, or run stages sequentially:
```bash
# Wrong: Stage 2 on raw file
uv run scripts/fix_transcription.py --input file.md --stage 2 # ❌
# Correct: Full pipeline
uv run scripts/fix_transcription.py --input file.md --stage 3 # ✅
# Or sequential stages
uv run scripts/fix_transcription.py --input file.md --stage 1
uv run scripts/fix_transcription.py --input file_stage1.md --stage 2
```
### 2. Overwriting Imports
**Problem**: Using `--import` without `--merge` overwrites existing corrections.
**Solution**: Always use `--merge` flag:
```bash
# Wrong: Overwrites existing
uv run scripts/fix_transcription.py --import team.json # ❌
# Correct: Merges with existing
uv run scripts/fix_transcription.py --import team.json --merge # ✅
```
### 3. Ignoring Learned Suggestions
**Problem**: Not reviewing learned patterns, missing free optimizations.
**Impact**: Patterns detected by AI remain expensive (Stage 2) instead of cheap (Stage 1).
**Solution**: Review suggestions every 3-5 runs:
```bash
uv run scripts/fix_transcription.py --review-learned
uv run scripts/fix_transcription.py --approve "错误" "正确"
```
### 4. Testing on Large Files
**Problem**: Testing dictionary changes on large files wastes API quota.
**Solution**: Start with `--stage 1` on small files (100-500 lines):
```bash
# Test dictionary changes first
uv run scripts/fix_transcription.py --input small_sample.md --stage 1
# Review output, adjust corrections
# Then run full pipeline
uv run scripts/fix_transcription.py --input large_file.md --stage 3
```
### 5. Manual Database Edits Without Validation
**Problem**: Direct SQL edits might violate schema constraints.
**Solution**: Always validate after manual changes:
```bash
sqlite3 ~/.transcript-fixer/corrections.db
# ... make changes ...
.quit
# Validate
uv run scripts/fix_transcription.py --validate
```
### 6. Committing .db Files to Git
**Problem**: Binary database files in Git cause merge conflicts and bloat repository.
**Solution**: Use JSON exports for version control:
```bash
# .gitignore
*.db
*.db-journal
*.bak
# Export for version control instead
uv run scripts/fix_transcription.py --export corrections_$(date +%Y%m%d).json
git add corrections_*.json
```
## Validation Commands
### Quick Health Check
```bash
uv run scripts/fix_transcription.py --validate
```
### Detailed Diagnostics
```bash
# Check database integrity
sqlite3 ~/.transcript-fixer/corrections.db "PRAGMA integrity_check;"
# Check table counts
sqlite3 ~/.transcript-fixer/corrections.db "
SELECT 'corrections' as table_name, COUNT(*) as count FROM corrections
UNION ALL
SELECT 'context_rules', COUNT(*) FROM context_rules
UNION ALL
SELECT 'learned_suggestions', COUNT(*) FROM learned_suggestions
UNION ALL
SELECT 'correction_history', COUNT(*) FROM correction_history;
"
# Check configuration
sqlite3 ~/.transcript-fixer/corrections.db "SELECT * FROM system_config;"
```
## Getting Help
If issues persist:
1. Run `--validate` to collect diagnostic information
2. Check `correction_history` and `audit_log` tables for errors
3. Review `references/file_formats.md` for schema details
4. Check `references/architecture.md` for component details
5. Verify Python and uv versions are up to date
For database corruption, automatic backups are created before migrations. Check for `.bak` files in `~/.transcript-fixer/`.

View File

@@ -0,0 +1,483 @@
# Workflow Guide
Detailed step-by-step workflows for transcript correction and management.
## Table of Contents
- [Pre-Flight Checklist](#pre-flight-checklist)
- [Initial Setup](#initial-setup)
- [File Preparation](#file-preparation)
- [Execution Parameters](#execution-parameters)
- [Environment](#environment)
- [Core Workflows](#core-workflows)
- [1. First-Time Correction](#1-first-time-correction)
- [2. Iterative Improvement](#2-iterative-improvement)
- [3. Domain-Specific Corrections](#3-domain-specific-corrections)
- [4. Team Collaboration](#4-team-collaboration)
- [5. Stage-by-Stage Execution](#5-stage-by-stage-execution)
- [6. Context-Aware Rules](#6-context-aware-rules)
- [7. Diff Report Generation](#7-diff-report-generation)
- [Batch Processing](#batch-processing)
- [Process Multiple Files](#process-multiple-files)
- [Parallel Processing](#parallel-processing)
- [Maintenance Workflows](#maintenance-workflows)
- [Weekly: Review Learning](#weekly-review-learning)
- [Monthly: Export and Backup](#monthly-export-and-backup)
- [Quarterly: Clean Up](#quarterly-clean-up)
- [Next Steps](#next-steps)
## Pre-Flight Checklist
Before running corrections, verify these prerequisites:
### Initial Setup
- [ ] Initialized with `uv run scripts/fix_transcription.py --init`
- [ ] Database exists at `~/.transcript-fixer/corrections.db`
- [ ] `GLM_API_KEY` environment variable set (run `echo $GLM_API_KEY`)
- [ ] Configuration validated (run `--validate`)
### File Preparation
- [ ] Input file exists and is readable
- [ ] File uses supported format (`.md`, `.txt`)
- [ ] File encoding is UTF-8
- [ ] File size is reasonable (<10MB for first runs)
### Execution Parameters
- [ ] Using `--stage 3` for full pipeline (or specific stage if testing)
- [ ] Domain specified with `--domain` if using specialized dictionaries
- [ ] Using `--merge` flag when importing team corrections
### Environment
- [ ] Sufficient disk space for output files (~2x input size)
- [ ] API quota available for Stage 2 corrections
- [ ] Network connectivity for API calls
**Quick validation**:
```bash
uv run scripts/fix_transcription.py --validate && echo $GLM_API_KEY
```
## Core Workflows
### 1. First-Time Correction
**Goal**: Correct a transcript for the first time.
**Steps**:
1. **Initialize** (if not done):
```bash
uv run scripts/fix_transcription.py --init
export GLM_API_KEY="your-key"
```
2. **Add initial corrections** (5-10 common errors):
```bash
uv run scripts/fix_transcription.py --add "常见错误1" "正确词1" --domain general
uv run scripts/fix_transcription.py --add "常见错误2" "正确词2" --domain general
```
3. **Test on small sample** (Stage 1 only):
```bash
uv run scripts/fix_transcription.py --input sample.md --stage 1
less sample_stage1.md # Review output
```
4. **Run full pipeline**:
```bash
uv run scripts/fix_transcription.py --input transcript.md --stage 3 --domain general
```
5. **Review outputs**:
```bash
# Stage 1: Dictionary corrections
less transcript_stage1.md
# Stage 2: Final corrected version
less transcript_stage2.md
# Generate diff report
uv run scripts/diff_generator.py transcript.md transcript_stage1.md transcript_stage2.md
```
**Expected duration**:
- Stage 1: Instant (dictionary lookup)
- Stage 2: ~1-2 minutes per 1000 lines (API calls)
### 2. Iterative Improvement
**Goal**: Improve correction quality over time through learning.
**Steps**:
1. **Run corrections** on 3-5 similar transcripts:
```bash
uv run scripts/fix_transcription.py --input day1.md --stage 3 --domain embodied_ai
uv run scripts/fix_transcription.py --input day2.md --stage 3 --domain embodied_ai
uv run scripts/fix_transcription.py --input day3.md --stage 3 --domain embodied_ai
```
2. **Review learned suggestions**:
```bash
uv run scripts/fix_transcription.py --review-learned
```
**Output example**:
```
📚 Learned Suggestions (Pending Review)
========================================
1. "巨升方向" → "具身方向"
Frequency: 5 Confidence: 0.95
Examples: day1.md (line 45), day2.md (line 23), ...
2. "奇迹创坛" → "奇绩创坛"
Frequency: 3 Confidence: 0.87
Examples: day1.md (line 102), day3.md (line 67)
```
3. **Approve high-quality suggestions**:
```bash
uv run scripts/fix_transcription.py --approve "巨升方向" "具身方向"
uv run scripts/fix_transcription.py --approve "奇迹创坛" "奇绩创坛"
```
4. **Verify approved corrections**:
```bash
uv run scripts/fix_transcription.py --list --domain embodied_ai | grep "learned"
```
5. **Run next batch** (benefits from approved corrections):
```bash
uv run scripts/fix_transcription.py --input day4.md --stage 3 --domain embodied_ai
```
**Impact**: Approved corrections move to Stage 1 (instant, free).
**Cycle**: Repeat every 3-5 transcripts for continuous improvement.
### 3. Domain-Specific Corrections
**Goal**: Build specialized dictionaries for different fields.
**Steps**:
1. **Identify domain**:
- `embodied_ai` - Robotics, AI terminology
- `finance` - Financial terminology
- `medical` - Medical terminology
- `general` - General-purpose
2. **Add domain-specific terms**:
```bash
# Embodied AI domain
uv run scripts/fix_transcription.py --add "巨升智能" "具身智能" --domain embodied_ai
uv run scripts/fix_transcription.py --add "机器学习" "机器学习" --domain embodied_ai
# Finance domain
uv run scripts/fix_transcription.py --add "股价" "股价" --domain finance # Keep as-is
uv run scripts/fix_transcription.py --add "PE比率" "市盈率" --domain finance
```
3. **Use appropriate domain** when correcting:
```bash
# AI meeting transcript
uv run scripts/fix_transcription.py --input ai_meeting.md --stage 3 --domain embodied_ai
# Financial report transcript
uv run scripts/fix_transcription.py --input earnings_call.md --stage 3 --domain finance
```
4. **Review domain statistics**:
```bash
sqlite3 ~/.transcript-fixer/corrections.db "SELECT * FROM correction_statistics;"
```
**Benefits**:
- Prevents cross-domain conflicts
- Higher accuracy per domain
- Targeted vocabulary building
### 4. Team Collaboration
**Goal**: Share corrections across team members.
**Steps**:
#### Setup (One-time per team)
1. **Create shared repository**:
```bash
mkdir transcript-corrections
cd transcript-corrections
git init
# .gitignore
echo "*.db\n*.db-journal\n*.bak" > .gitignore
```
2. **Export initial corrections**:
```bash
uv run scripts/fix_transcription.py --export general.json --domain general
uv run scripts/fix_transcription.py --export embodied_ai.json --domain embodied_ai
git add *.json
git commit -m "Initial correction dictionaries"
git push origin main
```
#### Daily Workflow
**Team Member A** (adds new corrections):
```bash
# 1. Run corrections
uv run scripts/fix_transcription.py --input transcript.md --stage 3 --domain embodied_ai
# 2. Review and approve learned suggestions
uv run scripts/fix_transcription.py --review-learned
uv run scripts/fix_transcription.py --approve "新错误" "正确词"
# 3. Export updated corrections
uv run scripts/fix_transcription.py --export embodied_ai_$(date +%Y%m%d).json --domain embodied_ai
# 4. Commit and push
git add embodied_ai_*.json
git commit -m "Add embodied AI corrections from today's transcripts"
git push origin main
```
**Team Member B** (imports team corrections):
```bash
# 1. Pull latest corrections
git pull origin main
# 2. Import with merge
uv run scripts/fix_transcription.py --import embodied_ai_20250128.json --merge
# 3. Verify
uv run scripts/fix_transcription.py --list --domain embodied_ai | tail -10
```
**Conflict resolution**: See `team_collaboration.md` for handling merge conflicts.
### 5. Stage-by-Stage Execution
**Goal**: Test dictionary changes without wasting API quota.
#### Stage 1 Only (Dictionary)
**Use when**: Testing new corrections, verifying domain setup.
```bash
uv run scripts/fix_transcription.py --input file.md --stage 1 --domain general
```
**Output**: `file_stage1.md` with dictionary corrections only.
**Review**: Check if dictionary corrections are sufficient.
#### Stage 2 Only (AI)
**Use when**: Running AI corrections on pre-processed file.
**Prerequisites**: Stage 1 output exists.
```bash
# Stage 1 first
uv run scripts/fix_transcription.py --input file.md --stage 1
# Then Stage 2
uv run scripts/fix_transcription.py --input file_stage1.md --stage 2
```
**Output**: `file_stage1_stage2.md` (confusing naming - use Stage 3 instead).
#### Stage 3 (Full Pipeline)
**Use when**: Production runs, full correction workflow.
```bash
uv run scripts/fix_transcription.py --input file.md --stage 3 --domain general
```
**Output**: Both `file_stage1.md` and `file_stage2.md`.
**Recommended**: Use Stage 3 for most workflows.
### 6. Context-Aware Rules
**Goal**: Handle edge cases with regex patterns.
**Use cases**:
- Positional corrections (e.g., "的" vs "地")
- Multi-word patterns
- Conditional corrections
**Steps**:
1. **Identify pattern** that simple dictionary can't handle:
```
Problem: "近距离的去看" (wrong - should be "地")
Problem: "近距离搏杀" (correct - should keep "的")
```
2. **Add context rules**:
```bash
sqlite3 ~/.transcript-fixer/corrections.db
-- Higher priority for specific context
INSERT INTO context_rules (pattern, replacement, description, priority)
VALUES ('近距离的去看', '近距离地去看', '的→地 before verb', 10);
-- Lower priority for general pattern
INSERT INTO context_rules (pattern, replacement, description, priority)
VALUES ('近距离搏杀', '近距离搏杀', 'Keep 的 for noun modifier', 5);
.quit
```
3. **Test context rules**:
```bash
uv run scripts/fix_transcription.py --input test.md --stage 1
```
4. **Validate**:
```bash
uv run scripts/fix_transcription.py --validate
```
**Priority**: Higher numbers run first (use for exceptions/edge cases).
See `file_formats.md` for context_rules schema.
### 7. Diff Report Generation
**Goal**: Visualize all changes for review.
**Use when**:
- Reviewing corrections before publishing
- Training new team members
- Documenting ASR error patterns
**Steps**:
1. **Run corrections**:
```bash
uv run scripts/fix_transcription.py --input transcript.md --stage 3
```
2. **Generate diff reports**:
```bash
uv run scripts/diff_generator.py \
transcript.md \
transcript_stage1.md \
transcript_stage2.md
```
3. **Review outputs**:
```bash
# Markdown report (statistics + summary)
less diff_report.md
# Unified diff (git-style)
less transcript_unified.diff
# HTML side-by-side (visual review)
open transcript_sidebyside.html
# Inline markers (for editing)
less transcript_inline.md
```
**Report contents**:
- Total changes count
- Stage 1 vs Stage 2 breakdown
- Character/word count changes
- Side-by-side comparison
See `script_parameters.md` for advanced diff options.
## Batch Processing
### Process Multiple Files
```bash
# Simple loop
for file in meeting_*.md; do
uv run scripts/fix_transcription.py --input "$file" --stage 3 --domain embodied_ai
done
# With error handling
for file in meeting_*.md; do
echo "Processing $file..."
if uv run scripts/fix_transcription.py --input "$file" --stage 3 --domain embodied_ai; then
echo "✅ $file completed"
else
echo "❌ $file failed"
fi
done
```
### Parallel Processing
```bash
# GNU parallel (install: brew install parallel)
ls meeting_*.md | parallel -j 4 \
"uv run scripts/fix_transcription.py --input {} --stage 3 --domain embodied_ai"
```
**Caution**: Monitor API rate limits when processing in parallel.
## Maintenance Workflows
### Weekly: Review Learning
```bash
# Review suggestions
uv run scripts/fix_transcription.py --review-learned
# Approve high-confidence patterns
uv run scripts/fix_transcription.py --approve "错误1" "正确1"
uv run scripts/fix_transcription.py --approve "错误2" "正确2"
```
### Monthly: Export and Backup
```bash
# Export all domains
uv run scripts/fix_transcription.py --export general_$(date +%Y%m%d).json --domain general
uv run scripts/fix_transcription.py --export embodied_ai_$(date +%Y%m%d).json --domain embodied_ai
# Backup database
cp ~/.transcript-fixer/corrections.db ~/backups/corrections_$(date +%Y%m%d).db
# Database maintenance
sqlite3 ~/.transcript-fixer/corrections.db "VACUUM; REINDEX; ANALYZE;"
```
### Quarterly: Clean Up
```bash
# Archive old history (> 90 days)
sqlite3 ~/.transcript-fixer/corrections.db "
DELETE FROM correction_history
WHERE run_timestamp < datetime('now', '-90 days');
"
# Reject low-confidence suggestions
sqlite3 ~/.transcript-fixer/corrections.db "
UPDATE learned_suggestions
SET status = 'rejected'
WHERE confidence < 0.6 AND frequency < 3;
"
```
## Next Steps
- See `best_practices.md` for optimization tips
- See `troubleshooting.md` for error resolution
- See `file_formats.md` for database schema
- See `script_parameters.md` for advanced CLI options