## New Skill: transcript-fixer v1.0.0 Correct speech-to-text (ASR/STT) transcription errors through dictionary-based rules and AI-powered corrections with automatic pattern learning. **Features:** - Two-stage correction pipeline (dictionary + AI) - Automatic pattern detection and learning - Domain-specific dictionaries (general, embodied_ai, finance, medical) - SQLite-based correction repository - Team collaboration with import/export - GLM API integration for AI corrections - Cost optimization through dictionary promotion **Use cases:** - Correcting meeting notes, lecture recordings, or interview transcripts - Fixing Chinese/English homophone errors and technical terminology - Building domain-specific correction dictionaries - Improving transcript accuracy through iterative learning **Documentation:** - Complete workflow guides in references/ - SQL query templates - Troubleshooting guide - Team collaboration patterns - API setup instructions **Marketplace updates:** - Updated marketplace to v1.8.0 - Added transcript-fixer plugin (category: productivity) - Updated README.md with skill description and use cases - Updated CLAUDE.md with skill listing and counts 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
849 lines
26 KiB
Markdown
849 lines
26 KiB
Markdown
# Architecture Reference
|
|
|
|
Technical implementation details of the transcript-fixer system.
|
|
|
|
## Table of Contents
|
|
|
|
- [Module Structure](#module-structure)
|
|
- [Design Principles](#design-principles)
|
|
- [SOLID Compliance](#solid-compliance)
|
|
- [File Length Limits](#file-length-limits)
|
|
- [Module Architecture](#module-architecture)
|
|
- [Layer Diagram](#layer-diagram)
|
|
- [Correction Workflow](#correction-workflow)
|
|
- [Learning Cycle](#learning-cycle)
|
|
- [Data Flow](#data-flow)
|
|
- [SQLite Architecture (v2.0)](#sqlite-architecture-v20)
|
|
- [Two-Layer Data Access](#two-layer-data-access-simplified)
|
|
- [Database Schema](#database-schema-schemasql)
|
|
- [ACID Guarantees](#acid-guarantees)
|
|
- [Thread Safety](#thread-safety)
|
|
- [Migration from JSON](#migration-from-json)
|
|
- [Module Details](#module-details)
|
|
- [fix_transcription.py](#fix_transcriptionpy-orchestrator)
|
|
- [correction_repository.py](#correction_repositorypy-data-access-layer)
|
|
- [correction_service.py](#correction_servicepy-business-logic-layer)
|
|
- [CLI Integration](#cli-integration-commandspy)
|
|
- [dictionary_processor.py](#dictionary_processorpy-stage-1)
|
|
- [ai_processor.py](#ai_processorpy-stage-2)
|
|
- [learning_engine.py](#learning_enginepy-pattern-detection)
|
|
- [diff_generator.py](#diff_generatorpy-stage-3)
|
|
- [State Management](#state-management)
|
|
- [Database-Backed State](#database-backed-state)
|
|
- [Thread-Safe Access](#thread-safe-access)
|
|
- [Error Handling Strategy](#error-handling-strategy)
|
|
- [Testing Strategy](#testing-strategy)
|
|
- [Performance Considerations](#performance-considerations)
|
|
- [Security Architecture](#security-architecture)
|
|
- [Extensibility Points](#extensibility-points)
|
|
- [Dependencies](#dependencies)
|
|
- [Deployment](#deployment)
|
|
- [Further Reading](#further-reading)
|
|
|
|
## Module Structure
|
|
|
|
The codebase follows a modular package structure for maintainability:
|
|
|
|
```
|
|
scripts/
|
|
├── fix_transcription.py # Main entry point (~70 lines)
|
|
├── core/ # Business logic & data access
|
|
│ ├── correction_repository.py # Data access layer (466 lines)
|
|
│ ├── correction_service.py # Business logic layer (525 lines)
|
|
│ ├── schema.sql # SQLite database schema (216 lines)
|
|
│ ├── dictionary_processor.py # Stage 1 processor (140 lines)
|
|
│ ├── ai_processor.py # Stage 2 processor (199 lines)
|
|
│ └── learning_engine.py # Pattern detection (252 lines)
|
|
├── cli/ # Command-line interface
|
|
│ ├── commands.py # Command handlers (180 lines)
|
|
│ └── argument_parser.py # Argument config (95 lines)
|
|
└── utils/ # Utility functions
|
|
├── diff_generator.py # Multi-format diffs (132 lines)
|
|
├── logging_config.py # Logging configuration (130 lines)
|
|
└── validation.py # SQLite validation (105 lines)
|
|
```
|
|
|
|
**Benefits of modular structure**:
|
|
- Clear separation of concerns (business logic / CLI / utilities)
|
|
- Easy to locate and modify specific functionality
|
|
- Supports independent testing of modules
|
|
- Scales well as codebase grows
|
|
- Follows Python package best practices
|
|
|
|
## Design Principles
|
|
|
|
### SOLID Compliance
|
|
|
|
Every module follows SOLID principles for maintainability:
|
|
|
|
1. **Single Responsibility Principle (SRP)**
|
|
- Each module has exactly one reason to change
|
|
- `CorrectionRepository`: Database operations only
|
|
- `CorrectionService`: Business logic and validation only
|
|
- `DictionaryProcessor`: Text transformation only
|
|
- `AIProcessor`: API communication only
|
|
- `LearningEngine`: Pattern analysis only
|
|
|
|
2. **Open/Closed Principle (OCP)**
|
|
- Open for extension via SQL INSERT
|
|
- Closed for modification (no code changes needed)
|
|
- Add corrections via CLI or SQL without editing Python
|
|
|
|
3. **Liskov Substitution Principle (LSP)**
|
|
- All processors implement same interface
|
|
- Can swap implementations without breaking workflow
|
|
|
|
4. **Interface Segregation Principle (ISP)**
|
|
- Repository, Service, Processor, Engine are independent
|
|
- No unnecessary dependencies
|
|
|
|
5. **Dependency Inversion Principle (DIP)**
|
|
- Service depends on Repository interface
|
|
- CLI depends on Service interface
|
|
- Not tied to concrete implementations
|
|
|
|
### File Length Limits
|
|
|
|
All files comply with code quality standards:
|
|
|
|
| File | Lines | Limit | Status |
|
|
|------|-------|-------|--------|
|
|
| `validation.py` | 105 | 200 | ✅ |
|
|
| `logging_config.py` | 130 | 200 | ✅ |
|
|
| `diff_generator.py` | 132 | 200 | ✅ |
|
|
| `dictionary_processor.py` | 140 | 200 | ✅ |
|
|
| `commands.py` | 180 | 200 | ✅ |
|
|
| `ai_processor.py` | 199 | 250 | ✅ |
|
|
| `schema.sql` | 216 | 250 | ✅ |
|
|
| `learning_engine.py` | 252 | 250 | ✅ |
|
|
| `correction_repository.py` | 466 | 500 | ✅ |
|
|
| `correction_service.py` | 525 | 550 | ✅ |
|
|
|
|
## Module Architecture
|
|
|
|
### Layer Diagram
|
|
|
|
```
|
|
┌─────────────────────────────────────────┐
|
|
│ CLI Layer (fix_transcription.py) │
|
|
│ - Argument parsing │
|
|
│ - Command routing │
|
|
│ - User interaction │
|
|
└───────────────┬─────────────────────────┘
|
|
│
|
|
┌───────────────▼─────────────────────────┐
|
|
│ Business Logic Layer │
|
|
│ │
|
|
│ ┌──────────────────┐ ┌──────────────┐│
|
|
│ │ Dictionary │ │ AI ││
|
|
│ │ Processor │ │ Processor ││
|
|
│ │ (Stage 1) │ │ (Stage 2) ││
|
|
│ └──────────────────┘ └──────────────┘│
|
|
│ │
|
|
│ ┌──────────────────┐ ┌──────────────┐│
|
|
│ │ Learning │ │ Diff ││
|
|
│ │ Engine │ │ Generator ││
|
|
│ │ (Pattern detect) │ │ (Stage 3) ││
|
|
│ └──────────────────┘ └──────────────┘│
|
|
└───────────────┬─────────────────────────┘
|
|
│
|
|
┌───────────────▼─────────────────────────┐
|
|
│ Data Access Layer (SQLite-based) │
|
|
│ │
|
|
│ ┌──────────────────────────────────┐ │
|
|
│ │ CorrectionManager (Facade) │ │
|
|
│ │ - Backward-compatible API │ │
|
|
│ └──────────────┬───────────────────┘ │
|
|
│ │ │
|
|
│ ┌──────────────▼───────────────────┐ │
|
|
│ │ CorrectionService │ │
|
|
│ │ - Business logic │ │
|
|
│ │ - Validation │ │
|
|
│ │ - Import/Export │ │
|
|
│ └──────────────┬───────────────────┘ │
|
|
│ │ │
|
|
│ ┌──────────────▼───────────────────┐ │
|
|
│ │ CorrectionRepository │ │
|
|
│ │ - ACID transactions │ │
|
|
│ │ - Thread-safe connections │ │
|
|
│ │ - Audit logging │ │
|
|
│ └──────────────────────────────────┘ │
|
|
└───────────────┬─────────────────────────┘
|
|
│
|
|
┌───────────────▼─────────────────────────┐
|
|
│ Storage Layer │
|
|
│ ~/.transcript-fixer/corrections.db │
|
|
│ - SQLite database (ACID compliant) │
|
|
│ - 8 normalized tables + 3 views │
|
|
│ - Comprehensive indexes │
|
|
│ - Foreign key constraints │
|
|
└─────────────────────────────────────────┘
|
|
```
|
|
|
|
## Data Flow
|
|
|
|
### Correction Workflow
|
|
|
|
```
|
|
1. User Input
|
|
↓
|
|
2. fix_transcription.py (Orchestrator)
|
|
↓
|
|
3. CorrectionService.get_corrections()
|
|
← Query from ~/.transcript-fixer/corrections.db
|
|
↓
|
|
4. DictionaryProcessor.process()
|
|
- Apply context rules (regex)
|
|
- Apply dictionary replacements
|
|
- Track changes
|
|
↓
|
|
5. AIProcessor.process()
|
|
- Split into chunks
|
|
- Call GLM-4.6 API
|
|
- Retry with fallback on error
|
|
- Track AI changes
|
|
↓
|
|
6. CorrectionService.save_history()
|
|
→ Insert into correction_history table
|
|
↓
|
|
7. LearningEngine.analyze_and_suggest()
|
|
- Query correction_history table
|
|
- Detect patterns (frequency ≥3, confidence ≥80%)
|
|
- Generate suggestions
|
|
→ Insert into learned_suggestions table
|
|
↓
|
|
8. Output Files
|
|
- {filename}_stage1.md
|
|
- {filename}_stage2.md
|
|
```
|
|
|
|
### Learning Cycle
|
|
|
|
```
|
|
Run 1: meeting1.md
|
|
AI corrects: "巨升" → "具身"
|
|
↓
|
|
INSERT INTO correction_history
|
|
|
|
Run 2: meeting2.md
|
|
AI corrects: "巨升" → "具身"
|
|
↓
|
|
INSERT INTO correction_history
|
|
|
|
Run 3: meeting3.md
|
|
AI corrects: "巨升" → "具身"
|
|
↓
|
|
INSERT INTO correction_history
|
|
↓
|
|
LearningEngine queries patterns:
|
|
- SELECT ... GROUP BY from_text, to_text
|
|
- Frequency: 3, Confidence: 100%
|
|
↓
|
|
INSERT INTO learned_suggestions (status='pending')
|
|
↓
|
|
User reviews: --review-learned
|
|
↓
|
|
User approves: --approve "巨升" "具身"
|
|
↓
|
|
INSERT INTO corrections (source='learned')
|
|
UPDATE learned_suggestions (status='approved')
|
|
↓
|
|
Future runs query corrections table (Stage 1 - faster!)
|
|
```
|
|
|
|
## SQLite Architecture (v2.0)
|
|
|
|
### Two-Layer Data Access (Simplified)
|
|
|
|
**Design Principle**: No users = no backward compatibility overhead.
|
|
|
|
The system uses a clean 2-layer architecture:
|
|
|
|
```
|
|
┌──────────────────────────────────────────┐
|
|
│ CLI Commands (commands.py) │
|
|
│ - User interaction │
|
|
│ - Command routing │
|
|
└──────────────┬───────────────────────────┘
|
|
│
|
|
┌──────────────▼───────────────────────────┐
|
|
│ CorrectionService (Business Logic) │
|
|
│ - Input validation & sanitization │
|
|
│ - Business rules enforcement │
|
|
│ - Import/export orchestration │
|
|
│ - Statistics calculation │
|
|
│ - History tracking │
|
|
└──────────────┬───────────────────────────┘
|
|
│
|
|
┌──────────────▼───────────────────────────┐
|
|
│ CorrectionRepository (Data Access) │
|
|
│ - ACID transactions │
|
|
│ - Thread-safe connections │
|
|
│ - SQL query execution │
|
|
│ - Audit logging │
|
|
└──────────────┬───────────────────────────┘
|
|
│
|
|
┌──────────────▼───────────────────────────┐
|
|
│ SQLite Database (corrections.db) │
|
|
│ - 8 normalized tables │
|
|
│ - Foreign key constraints │
|
|
│ - Comprehensive indexes │
|
|
│ - 3 views for common queries │
|
|
└───────────────────────────────────────────┘
|
|
```
|
|
|
|
### Database Schema (schema.sql)
|
|
|
|
**Core Tables**:
|
|
|
|
1. **corrections** (main correction storage)
|
|
- Primary key: id
|
|
- Unique constraint: (from_text, domain)
|
|
- Indexes: domain, source, added_at, is_active, from_text
|
|
- Fields: confidence (0.0-1.0), usage_count, notes
|
|
|
|
2. **context_rules** (regex-based rules)
|
|
- Pattern + replacement with priority ordering
|
|
- Indexes: priority (DESC), is_active
|
|
|
|
3. **correction_history** (audit trail for runs)
|
|
- Tracks: filename, domain, timestamps, change counts
|
|
- Links to correction_changes via foreign key
|
|
- Indexes: run_timestamp, domain, success
|
|
|
|
4. **correction_changes** (detailed change log)
|
|
- Links to history via foreign key (CASCADE delete)
|
|
- Stores: line_number, from/to text, rule_type, context
|
|
- Indexes: history_id, rule_type
|
|
|
|
5. **learned_suggestions** (AI-detected patterns)
|
|
- Status: pending → approved/rejected
|
|
- Unique constraint: (from_text, to_text, domain)
|
|
- Fields: frequency, confidence, timestamps
|
|
- Indexes: status, domain, confidence, frequency
|
|
|
|
6. **suggestion_examples** (occurrences of patterns)
|
|
- Links to learned_suggestions via foreign key
|
|
- Stores context where pattern occurred
|
|
|
|
7. **system_config** (configuration storage)
|
|
- Key-value store with type safety
|
|
- Stores: API settings, thresholds, defaults
|
|
|
|
8. **audit_log** (comprehensive audit trail)
|
|
- Tracks all database operations
|
|
- Fields: action, entity_type, entity_id, user, success
|
|
- Indexes: timestamp, action, entity_type, success
|
|
|
|
**Views** (for common queries):
|
|
- `active_corrections`: Active corrections only
|
|
- `pending_suggestions`: Suggestions pending review
|
|
- `correction_statistics`: Statistics per domain
|
|
|
|
### ACID Guarantees
|
|
|
|
**Atomicity**: All-or-nothing transactions
|
|
```python
|
|
with self._transaction() as conn:
|
|
conn.execute("INSERT ...") # Either all succeed
|
|
conn.execute("UPDATE ...") # or all rollback
|
|
```
|
|
|
|
**Consistency**: Constraints enforced
|
|
- Foreign key constraints
|
|
- Check constraints (confidence 0.0-1.0, usage_count ≥ 0)
|
|
- Unique constraints
|
|
|
|
**Isolation**: Serializable transactions
|
|
```python
|
|
conn.execute("BEGIN IMMEDIATE") # Acquire write lock
|
|
```
|
|
|
|
**Durability**: Changes persisted to disk
|
|
- SQLite guarantees persistence after commit
|
|
- Backup before migrations
|
|
|
|
### Thread Safety
|
|
|
|
**Thread-local connections**:
|
|
```python
|
|
def _get_connection(self):
|
|
if not hasattr(self._local, 'connection'):
|
|
self._local.connection = sqlite3.connect(...)
|
|
return self._local.connection
|
|
```
|
|
|
|
**Connection pooling**:
|
|
- One connection per thread
|
|
- Automatic cleanup on close
|
|
- Foreign keys enabled per connection
|
|
|
|
### Clean Architecture (No Legacy)
|
|
|
|
**Design Philosophy**:
|
|
- Clean 2-layer architecture (Service → Repository)
|
|
- No backward compatibility overhead
|
|
- Direct API design without legacy constraints
|
|
- YAGNI principle: Build for current needs, not hypothetical migrations
|
|
|
|
## Module Details
|
|
|
|
### fix_transcription.py (Orchestrator)
|
|
|
|
**Responsibilities**:
|
|
- Parse CLI arguments
|
|
- Route commands to appropriate handlers
|
|
- Coordinate workflow between modules
|
|
- Display user feedback
|
|
|
|
**Key Functions**:
|
|
```python
|
|
cmd_init() # Initialize ~/.transcript-fixer/
|
|
cmd_add_correction() # Add single correction
|
|
cmd_list_corrections() # List corrections
|
|
cmd_run_correction() # Execute correction workflow
|
|
cmd_review_learned() # Review AI suggestions
|
|
cmd_approve() # Approve learned correction
|
|
```
|
|
|
|
**Design Pattern**: Command pattern with function routing
|
|
|
|
### correction_repository.py (Data Access Layer)
|
|
|
|
**Responsibilities**:
|
|
- Execute SQL queries with ACID guarantees
|
|
- Manage thread-safe database connections
|
|
- Handle transactions (commit/rollback)
|
|
- Perform audit logging
|
|
- Convert between database rows and Python objects
|
|
|
|
**Key Methods**:
|
|
```python
|
|
add_correction() # INSERT with UNIQUE handling
|
|
get_correction() # SELECT single correction
|
|
get_all_corrections() # SELECT with filters
|
|
get_corrections_dict() # For backward compatibility
|
|
update_correction() # UPDATE with transaction
|
|
delete_correction() # Soft delete (is_active=0)
|
|
increment_usage() # Track usage statistics
|
|
bulk_import_corrections() # Batch INSERT with conflict resolution
|
|
```
|
|
|
|
**Transaction Management**:
|
|
```python
|
|
@contextmanager
|
|
def _transaction(self):
|
|
conn = self._get_connection()
|
|
try:
|
|
conn.execute("BEGIN IMMEDIATE")
|
|
yield conn
|
|
conn.commit()
|
|
except Exception:
|
|
conn.rollback()
|
|
raise
|
|
```
|
|
|
|
### correction_service.py (Business Logic Layer)
|
|
|
|
**Responsibilities**:
|
|
- Input validation and sanitization
|
|
- Business rule enforcement
|
|
- Orchestrate repository operations
|
|
- Import/export with conflict detection
|
|
- Statistics calculation
|
|
|
|
**Key Methods**:
|
|
```python
|
|
# Validation
|
|
validate_correction_text() # Check length, control chars, NULL bytes
|
|
validate_domain_name() # Prevent path traversal, injection
|
|
validate_confidence() # Range check (0.0-1.0)
|
|
validate_source() # Enum validation
|
|
|
|
# Operations
|
|
add_correction() # Validate + repository.add
|
|
get_corrections() # Get corrections for domain
|
|
remove_correction() # Validate + repository.delete
|
|
|
|
# Import/Export
|
|
import_corrections() # Pre-validate + bulk import + conflict detection
|
|
export_corrections() # Query + format as JSON
|
|
|
|
# Analytics
|
|
get_statistics() # Calculate metrics per domain
|
|
```
|
|
|
|
**Validation Rules**:
|
|
```python
|
|
@dataclass
|
|
class ValidationRules:
|
|
max_text_length: int = 1000
|
|
min_text_length: int = 1
|
|
max_domain_length: int = 50
|
|
allowed_domain_pattern: str = r'^[a-zA-Z0-9_-]+$'
|
|
```
|
|
|
|
### CLI Integration (commands.py)
|
|
|
|
**Direct Service Usage**:
|
|
```python
|
|
def _get_service():
|
|
"""Get configured CorrectionService instance."""
|
|
config_dir = Path.home() / ".transcript-fixer"
|
|
db_path = config_dir / "corrections.db"
|
|
repository = CorrectionRepository(db_path)
|
|
return CorrectionService(repository)
|
|
|
|
def cmd_add_correction(args):
|
|
service = _get_service()
|
|
service.add_correction(args.from_text, args.to_text, args.domain)
|
|
```
|
|
|
|
**Benefits of Direct Integration**:
|
|
- No unnecessary abstraction layers
|
|
- Clear data flow: CLI → Service → Repository
|
|
- Easy to understand and debug
|
|
- Performance: One less function call per operation
|
|
|
|
### dictionary_processor.py (Stage 1)
|
|
|
|
**Responsibilities**:
|
|
- Apply context-aware regex rules
|
|
- Apply simple dictionary replacements
|
|
- Track all changes with line numbers
|
|
|
|
**Processing Order**:
|
|
1. Context rules first (higher priority)
|
|
2. Dictionary replacements second
|
|
|
|
**Key Methods**:
|
|
```python
|
|
process(text) -> (corrected_text, changes)
|
|
_apply_context_rules()
|
|
_apply_dictionary()
|
|
get_summary(changes)
|
|
```
|
|
|
|
**Change Tracking**:
|
|
```python
|
|
@dataclass
|
|
class Change:
|
|
line_number: int
|
|
from_text: str
|
|
to_text: str
|
|
rule_type: str # "dictionary" or "context_rule"
|
|
rule_name: str
|
|
```
|
|
|
|
### ai_processor.py (Stage 2)
|
|
|
|
**Responsibilities**:
|
|
- Split text into API-friendly chunks
|
|
- Call GLM-4.6 API
|
|
- Handle retries with fallback model
|
|
- Track AI-suggested changes
|
|
|
|
**Key Methods**:
|
|
```python
|
|
process(text, context) -> (corrected_text, changes)
|
|
_split_into_chunks() # Respect paragraph boundaries
|
|
_process_chunk() # Single API call
|
|
_build_prompt() # Construct correction prompt
|
|
```
|
|
|
|
**Chunking Strategy**:
|
|
- Max 6000 characters per chunk
|
|
- Split on paragraph boundaries (`\n\n`)
|
|
- If paragraph too long, split on sentences
|
|
- Preserve context across chunks
|
|
|
|
**Error Handling**:
|
|
- Retry with fallback model (GLM-4.5-Air)
|
|
- If both fail, use original text
|
|
- Never lose user's data
|
|
|
|
### learning_engine.py (Pattern Detection)
|
|
|
|
**Responsibilities**:
|
|
- Analyze correction history
|
|
- Detect recurring patterns
|
|
- Calculate confidence scores
|
|
- Generate suggestions for review
|
|
- Track rejected suggestions
|
|
|
|
**Algorithm**:
|
|
```python
|
|
1. Query correction_history table
|
|
2. Extract stage2 (AI) changes
|
|
3. Group by pattern (from→to)
|
|
4. Count frequency
|
|
5. Calculate confidence
|
|
6. Filter by thresholds:
|
|
- frequency ≥ 3
|
|
- confidence ≥ 0.8
|
|
7. Save to learned/pending_review.json
|
|
```
|
|
|
|
**Confidence Calculation**:
|
|
```python
|
|
confidence = (
|
|
0.5 * frequency_score + # More occurrences = higher
|
|
0.3 * consistency_score + # Always same correction
|
|
0.2 * recency_score # Recent = higher
|
|
)
|
|
```
|
|
|
|
**Key Methods**:
|
|
```python
|
|
analyze_and_suggest() # Main analysis pipeline
|
|
approve_suggestion() # Move to corrections.json
|
|
reject_suggestion() # Move to rejected.json
|
|
list_pending() # Get all suggestions
|
|
```
|
|
|
|
### diff_generator.py (Stage 3)
|
|
|
|
**Responsibilities**:
|
|
- Generate comparison reports
|
|
- Multiple output formats
|
|
- Word-level diff analysis
|
|
|
|
**Output Formats**:
|
|
1. Markdown summary (statistics + change list)
|
|
2. Unified diff (standard diff format)
|
|
3. HTML side-by-side (visual comparison)
|
|
4. Inline marked ([-old-] [+new+])
|
|
|
|
**Not Modified**: Kept original 338-line file as-is (working well)
|
|
|
|
## State Management
|
|
|
|
### Database-Backed State
|
|
|
|
- All state stored in `~/.transcript-fixer/corrections.db`
|
|
- SQLite handles caching and transactions
|
|
- ACID guarantees prevent corruption
|
|
- Backup created before migrations
|
|
|
|
### Thread-Safe Access
|
|
|
|
- Thread-local connections (one per thread)
|
|
- BEGIN IMMEDIATE for write transactions
|
|
- No global state or shared mutable data
|
|
- Each operation is independent (stateless modules)
|
|
|
|
### Soft Deletes
|
|
|
|
- Records marked inactive (is_active=0) instead of DELETE
|
|
- Preserves audit trail
|
|
- Can be reactivated if needed
|
|
|
|
## Error Handling Strategy
|
|
|
|
### Fail Fast for User Errors
|
|
|
|
```python
|
|
if not skill_path.exists():
|
|
print(f"❌ Error: Skill directory not found")
|
|
sys.exit(1)
|
|
```
|
|
|
|
### Retry for Transient Errors
|
|
|
|
```python
|
|
try:
|
|
api_call(model_primary)
|
|
except Exception:
|
|
try:
|
|
api_call(model_fallback)
|
|
except Exception:
|
|
use_original_text()
|
|
```
|
|
|
|
### Backup Before Destructive Operations
|
|
|
|
```python
|
|
if target_file.exists():
|
|
shutil.copy2(target_file, backup_file)
|
|
# Then overwrite target_file
|
|
```
|
|
|
|
## Testing Strategy
|
|
|
|
### Unit Testing (Recommended)
|
|
|
|
```python
|
|
# Test dictionary processor
|
|
def test_dictionary_processor():
|
|
corrections = {"错误": "正确"}
|
|
processor = DictionaryProcessor(corrections, [])
|
|
text = "这是错误的文本"
|
|
result, changes = processor.process(text)
|
|
assert result == "这是正确的文本"
|
|
assert len(changes) == 1
|
|
|
|
# Test learning engine thresholds
|
|
def test_learning_thresholds():
|
|
engine = LearningEngine(history_dir, learned_dir)
|
|
# Create mock history with pattern appearing 3+ times
|
|
suggestions = engine.analyze_and_suggest()
|
|
assert len(suggestions) > 0
|
|
```
|
|
|
|
### Integration Testing
|
|
|
|
```bash
|
|
# End-to-end test
|
|
python fix_transcription.py --init
|
|
python fix_transcription.py --add "test" "TEST"
|
|
python fix_transcription.py --input test.md --stage 3
|
|
# Verify output files exist
|
|
```
|
|
|
|
## Performance Considerations
|
|
|
|
### Bottlenecks
|
|
|
|
1. **AI API calls**: Slowest part (60s timeout per chunk)
|
|
2. **File I/O**: Negligible (JSON files are small)
|
|
3. **Pattern matching**: Fast (regex + dict lookups)
|
|
|
|
### Optimization Strategies
|
|
|
|
1. **Stage 1 First**: Test dictionary corrections before expensive AI calls
|
|
2. **Chunking**: Process large files in parallel chunks (future enhancement)
|
|
3. **Caching**: Could cache API results by content hash (future enhancement)
|
|
|
|
### Scalability
|
|
|
|
**Current capabilities (v2.0 with SQLite)**:
|
|
- File size: Unlimited (chunks handle large files)
|
|
- Corrections: Tested up to 100,000 entries (with indexes)
|
|
- History: Unlimited (database handles efficiently)
|
|
- Concurrent access: Thread-safe with ACID guarantees
|
|
- Query performance: O(log n) with B-tree indexes
|
|
|
|
**Performance improvements from SQLite**:
|
|
- Indexed queries (domain, source, added_at)
|
|
- Views for common aggregations
|
|
- Batch imports with transactions
|
|
- Soft deletes (no data loss)
|
|
|
|
**Future improvements**:
|
|
- Parallel chunk processing for AI calls
|
|
- API response caching
|
|
- Full-text search for corrections
|
|
|
|
## Security Architecture
|
|
|
|
### Secret Management
|
|
|
|
- API keys via environment variables only
|
|
- Never hardcode credentials
|
|
- Security scanner enforces this
|
|
|
|
### Backup Security
|
|
|
|
- `.bak` files same permissions as originals
|
|
- No encryption (user's responsibility)
|
|
- Recommendation: Use encrypted filesystems
|
|
|
|
### Git Security
|
|
|
|
- `.gitignore` for `.bak` files
|
|
- Private repos recommended
|
|
- Security scan before commits
|
|
|
|
## Extensibility Points
|
|
|
|
### Adding New Processors
|
|
|
|
1. Create new processor class
|
|
2. Implement `process(text) -> (result, changes)` interface
|
|
3. Add to orchestrator workflow
|
|
|
|
Example:
|
|
```python
|
|
class SpellCheckProcessor:
|
|
def process(self, text):
|
|
# Custom spell checking logic
|
|
return corrected_text, changes
|
|
```
|
|
|
|
### Adding New Learning Algorithms
|
|
|
|
1. Subclass `LearningEngine`
|
|
2. Override `_calculate_confidence()`
|
|
3. Adjust thresholds as needed
|
|
|
|
### Adding New Export Formats
|
|
|
|
1. Add method to `CorrectionManager`
|
|
2. Support new file format
|
|
3. Add CLI command
|
|
|
|
## Dependencies
|
|
|
|
### Required
|
|
|
|
- Python 3.8+ (`from __future__ import annotations`)
|
|
- `httpx` (for API calls)
|
|
|
|
### Optional
|
|
|
|
- `diff` command (for unified diffs)
|
|
- Git (for version control)
|
|
|
|
### Development
|
|
|
|
- `pytest` (for testing)
|
|
- `black` (for formatting)
|
|
- `mypy` (for type checking)
|
|
|
|
## Deployment
|
|
|
|
### User Installation
|
|
|
|
```bash
|
|
# 1. Clone or download skill to workspace
|
|
git clone <repo> transcript-fixer
|
|
cd transcript-fixer
|
|
|
|
# 2. Install dependencies
|
|
pip install -r requirements.txt
|
|
|
|
# 3. Initialize
|
|
python scripts/fix_transcription.py --init
|
|
|
|
# 4. Set API key
|
|
export GLM_API_KEY="KEY_VALUE"
|
|
|
|
# Ready to use!
|
|
```
|
|
|
|
### CI/CD Pipeline (Future)
|
|
|
|
```yaml
|
|
# Potential GitHub Actions workflow
|
|
test:
|
|
- Install dependencies
|
|
- Run unit tests
|
|
- Run integration tests
|
|
- Check code style (black, mypy)
|
|
|
|
security:
|
|
- Run security_scan.py
|
|
- Check for secrets
|
|
|
|
deploy:
|
|
- Package skill
|
|
- Upload to skill marketplace
|
|
```
|
|
|
|
## Further Reading
|
|
|
|
- SOLID Principles: https://en.wikipedia.org/wiki/SOLID
|
|
- API Patterns: `references/glm_api_setup.md`
|
|
- File Formats: `references/file_formats.md`
|
|
- Testing: https://docs.pytest.org/
|