# Architecture Reference Technical implementation details of the transcript-fixer system. ## Table of Contents - [Module Structure](#module-structure) - [Design Principles](#design-principles) - [SOLID Compliance](#solid-compliance) - [File Length Limits](#file-length-limits) - [Module Architecture](#module-architecture) - [Layer Diagram](#layer-diagram) - [Correction Workflow](#correction-workflow) - [Learning Cycle](#learning-cycle) - [Data Flow](#data-flow) - [SQLite Architecture (v2.0)](#sqlite-architecture-v20) - [Two-Layer Data Access](#two-layer-data-access-simplified) - [Database Schema](#database-schema-schemasql) - [ACID Guarantees](#acid-guarantees) - [Thread Safety](#thread-safety) - [Migration from JSON](#migration-from-json) - [Module Details](#module-details) - [fix_transcription.py](#fix_transcriptionpy-orchestrator) - [correction_repository.py](#correction_repositorypy-data-access-layer) - [correction_service.py](#correction_servicepy-business-logic-layer) - [CLI Integration](#cli-integration-commandspy) - [dictionary_processor.py](#dictionary_processorpy-stage-1) - [ai_processor.py](#ai_processorpy-stage-2) - [learning_engine.py](#learning_enginepy-pattern-detection) - [diff_generator.py](#diff_generatorpy-stage-3) - [State Management](#state-management) - [Database-Backed State](#database-backed-state) - [Thread-Safe Access](#thread-safe-access) - [Error Handling Strategy](#error-handling-strategy) - [Testing Strategy](#testing-strategy) - [Performance Considerations](#performance-considerations) - [Security Architecture](#security-architecture) - [Extensibility Points](#extensibility-points) - [Dependencies](#dependencies) - [Deployment](#deployment) - [Further Reading](#further-reading) ## Module Structure The codebase follows a modular package structure for maintainability: ``` scripts/ ├── fix_transcription.py # Main entry point (~70 lines) ├── core/ # Business logic & data access │ ├── correction_repository.py # Data access layer (466 lines) │ ├── correction_service.py # Business logic layer (525 lines) │ ├── schema.sql # SQLite database schema (216 lines) │ ├── dictionary_processor.py # Stage 1 processor (140 lines) │ ├── ai_processor.py # Stage 2 processor (199 lines) │ └── learning_engine.py # Pattern detection (252 lines) ├── cli/ # Command-line interface │ ├── commands.py # Command handlers (180 lines) │ └── argument_parser.py # Argument config (95 lines) └── utils/ # Utility functions ├── diff_generator.py # Multi-format diffs (132 lines) ├── logging_config.py # Logging configuration (130 lines) └── validation.py # SQLite validation (105 lines) ``` **Benefits of modular structure**: - Clear separation of concerns (business logic / CLI / utilities) - Easy to locate and modify specific functionality - Supports independent testing of modules - Scales well as codebase grows - Follows Python package best practices ## Design Principles ### SOLID Compliance Every module follows SOLID principles for maintainability: 1. **Single Responsibility Principle (SRP)** - Each module has exactly one reason to change - `CorrectionRepository`: Database operations only - `CorrectionService`: Business logic and validation only - `DictionaryProcessor`: Text transformation only - `AIProcessor`: API communication only - `LearningEngine`: Pattern analysis only 2. **Open/Closed Principle (OCP)** - Open for extension via SQL INSERT - Closed for modification (no code changes needed) - Add corrections via CLI or SQL without editing Python 3. **Liskov Substitution Principle (LSP)** - All processors implement same interface - Can swap implementations without breaking workflow 4. **Interface Segregation Principle (ISP)** - Repository, Service, Processor, Engine are independent - No unnecessary dependencies 5. **Dependency Inversion Principle (DIP)** - Service depends on Repository interface - CLI depends on Service interface - Not tied to concrete implementations ### File Length Limits All files comply with code quality standards: | File | Lines | Limit | Status | |------|-------|-------|--------| | `validation.py` | 105 | 200 | ✅ | | `logging_config.py` | 130 | 200 | ✅ | | `diff_generator.py` | 132 | 200 | ✅ | | `dictionary_processor.py` | 140 | 200 | ✅ | | `commands.py` | 180 | 200 | ✅ | | `ai_processor.py` | 199 | 250 | ✅ | | `schema.sql` | 216 | 250 | ✅ | | `learning_engine.py` | 252 | 250 | ✅ | | `correction_repository.py` | 466 | 500 | ✅ | | `correction_service.py` | 525 | 550 | ✅ | ## Module Architecture ### Layer Diagram ``` ┌─────────────────────────────────────────┐ │ CLI Layer (fix_transcription.py) │ │ - Argument parsing │ │ - Command routing │ │ - User interaction │ └───────────────┬─────────────────────────┘ │ ┌───────────────▼─────────────────────────┐ │ Business Logic Layer │ │ │ │ ┌──────────────────┐ ┌──────────────┐│ │ │ Dictionary │ │ AI ││ │ │ Processor │ │ Processor ││ │ │ (Stage 1) │ │ (Stage 2) ││ │ └──────────────────┘ └──────────────┘│ │ │ │ ┌──────────────────┐ ┌──────────────┐│ │ │ Learning │ │ Diff ││ │ │ Engine │ │ Generator ││ │ │ (Pattern detect) │ │ (Stage 3) ││ │ └──────────────────┘ └──────────────┘│ └───────────────┬─────────────────────────┘ │ ┌───────────────▼─────────────────────────┐ │ Data Access Layer (SQLite-based) │ │ │ │ ┌──────────────────────────────────┐ │ │ │ CorrectionManager (Facade) │ │ │ │ - Backward-compatible API │ │ │ └──────────────┬───────────────────┘ │ │ │ │ │ ┌──────────────▼───────────────────┐ │ │ │ CorrectionService │ │ │ │ - Business logic │ │ │ │ - Validation │ │ │ │ - Import/Export │ │ │ └──────────────┬───────────────────┘ │ │ │ │ │ ┌──────────────▼───────────────────┐ │ │ │ CorrectionRepository │ │ │ │ - ACID transactions │ │ │ │ - Thread-safe connections │ │ │ │ - Audit logging │ │ │ └──────────────────────────────────┘ │ └───────────────┬─────────────────────────┘ │ ┌───────────────▼─────────────────────────┐ │ Storage Layer │ │ ~/.transcript-fixer/corrections.db │ │ - SQLite database (ACID compliant) │ │ - 8 normalized tables + 3 views │ │ - Comprehensive indexes │ │ - Foreign key constraints │ └─────────────────────────────────────────┘ ``` ## Data Flow ### Correction Workflow ``` 1. User Input ↓ 2. fix_transcription.py (Orchestrator) ↓ 3. CorrectionService.get_corrections() ← Query from ~/.transcript-fixer/corrections.db ↓ 4. DictionaryProcessor.process() - Apply context rules (regex) - Apply dictionary replacements - Track changes ↓ 5. AIProcessor.process() - Split into chunks - Call GLM-4.6 API - Retry with fallback on error - Track AI changes ↓ 6. CorrectionService.save_history() → Insert into correction_history table ↓ 7. LearningEngine.analyze_and_suggest() - Query correction_history table - Detect patterns (frequency ≥3, confidence ≥80%) - Generate suggestions → Insert into learned_suggestions table ↓ 8. Output Files - {filename}_stage1.md - {filename}_stage2.md ``` ### Learning Cycle ``` Run 1: meeting1.md AI corrects: "巨升" → "具身" ↓ INSERT INTO correction_history Run 2: meeting2.md AI corrects: "巨升" → "具身" ↓ INSERT INTO correction_history Run 3: meeting3.md AI corrects: "巨升" → "具身" ↓ INSERT INTO correction_history ↓ LearningEngine queries patterns: - SELECT ... GROUP BY from_text, to_text - Frequency: 3, Confidence: 100% ↓ INSERT INTO learned_suggestions (status='pending') ↓ User reviews: --review-learned ↓ User approves: --approve "巨升" "具身" ↓ INSERT INTO corrections (source='learned') UPDATE learned_suggestions (status='approved') ↓ Future runs query corrections table (Stage 1 - faster!) ``` ## SQLite Architecture (v2.0) ### Two-Layer Data Access (Simplified) **Design Principle**: No users = no backward compatibility overhead. The system uses a clean 2-layer architecture: ``` ┌──────────────────────────────────────────┐ │ CLI Commands (commands.py) │ │ - User interaction │ │ - Command routing │ └──────────────┬───────────────────────────┘ │ ┌──────────────▼───────────────────────────┐ │ CorrectionService (Business Logic) │ │ - Input validation & sanitization │ │ - Business rules enforcement │ │ - Import/export orchestration │ │ - Statistics calculation │ │ - History tracking │ └──────────────┬───────────────────────────┘ │ ┌──────────────▼───────────────────────────┐ │ CorrectionRepository (Data Access) │ │ - ACID transactions │ │ - Thread-safe connections │ │ - SQL query execution │ │ - Audit logging │ └──────────────┬───────────────────────────┘ │ ┌──────────────▼───────────────────────────┐ │ SQLite Database (corrections.db) │ │ - 8 normalized tables │ │ - Foreign key constraints │ │ - Comprehensive indexes │ │ - 3 views for common queries │ └───────────────────────────────────────────┘ ``` ### Database Schema (schema.sql) **Core Tables**: 1. **corrections** (main correction storage) - Primary key: id - Unique constraint: (from_text, domain) - Indexes: domain, source, added_at, is_active, from_text - Fields: confidence (0.0-1.0), usage_count, notes 2. **context_rules** (regex-based rules) - Pattern + replacement with priority ordering - Indexes: priority (DESC), is_active 3. **correction_history** (audit trail for runs) - Tracks: filename, domain, timestamps, change counts - Links to correction_changes via foreign key - Indexes: run_timestamp, domain, success 4. **correction_changes** (detailed change log) - Links to history via foreign key (CASCADE delete) - Stores: line_number, from/to text, rule_type, context - Indexes: history_id, rule_type 5. **learned_suggestions** (AI-detected patterns) - Status: pending → approved/rejected - Unique constraint: (from_text, to_text, domain) - Fields: frequency, confidence, timestamps - Indexes: status, domain, confidence, frequency 6. **suggestion_examples** (occurrences of patterns) - Links to learned_suggestions via foreign key - Stores context where pattern occurred 7. **system_config** (configuration storage) - Key-value store with type safety - Stores: API settings, thresholds, defaults 8. **audit_log** (comprehensive audit trail) - Tracks all database operations - Fields: action, entity_type, entity_id, user, success - Indexes: timestamp, action, entity_type, success **Views** (for common queries): - `active_corrections`: Active corrections only - `pending_suggestions`: Suggestions pending review - `correction_statistics`: Statistics per domain ### ACID Guarantees **Atomicity**: All-or-nothing transactions ```python with self._transaction() as conn: conn.execute("INSERT ...") # Either all succeed conn.execute("UPDATE ...") # or all rollback ``` **Consistency**: Constraints enforced - Foreign key constraints - Check constraints (confidence 0.0-1.0, usage_count ≥ 0) - Unique constraints **Isolation**: Serializable transactions ```python conn.execute("BEGIN IMMEDIATE") # Acquire write lock ``` **Durability**: Changes persisted to disk - SQLite guarantees persistence after commit - Backup before migrations ### Thread Safety **Thread-local connections**: ```python def _get_connection(self): if not hasattr(self._local, 'connection'): self._local.connection = sqlite3.connect(...) return self._local.connection ``` **Connection pooling**: - One connection per thread - Automatic cleanup on close - Foreign keys enabled per connection ### Clean Architecture (No Legacy) **Design Philosophy**: - Clean 2-layer architecture (Service → Repository) - No backward compatibility overhead - Direct API design without legacy constraints - YAGNI principle: Build for current needs, not hypothetical migrations ## Module Details ### fix_transcription.py (Orchestrator) **Responsibilities**: - Parse CLI arguments - Route commands to appropriate handlers - Coordinate workflow between modules - Display user feedback **Key Functions**: ```python cmd_init() # Initialize ~/.transcript-fixer/ cmd_add_correction() # Add single correction cmd_list_corrections() # List corrections cmd_run_correction() # Execute correction workflow cmd_review_learned() # Review AI suggestions cmd_approve() # Approve learned correction ``` **Design Pattern**: Command pattern with function routing ### correction_repository.py (Data Access Layer) **Responsibilities**: - Execute SQL queries with ACID guarantees - Manage thread-safe database connections - Handle transactions (commit/rollback) - Perform audit logging - Convert between database rows and Python objects **Key Methods**: ```python add_correction() # INSERT with UNIQUE handling get_correction() # SELECT single correction get_all_corrections() # SELECT with filters get_corrections_dict() # For backward compatibility update_correction() # UPDATE with transaction delete_correction() # Soft delete (is_active=0) increment_usage() # Track usage statistics bulk_import_corrections() # Batch INSERT with conflict resolution ``` **Transaction Management**: ```python @contextmanager def _transaction(self): conn = self._get_connection() try: conn.execute("BEGIN IMMEDIATE") yield conn conn.commit() except Exception: conn.rollback() raise ``` ### correction_service.py (Business Logic Layer) **Responsibilities**: - Input validation and sanitization - Business rule enforcement - Orchestrate repository operations - Import/export with conflict detection - Statistics calculation **Key Methods**: ```python # Validation validate_correction_text() # Check length, control chars, NULL bytes validate_domain_name() # Prevent path traversal, injection validate_confidence() # Range check (0.0-1.0) validate_source() # Enum validation # Operations add_correction() # Validate + repository.add get_corrections() # Get corrections for domain remove_correction() # Validate + repository.delete # Import/Export import_corrections() # Pre-validate + bulk import + conflict detection export_corrections() # Query + format as JSON # Analytics get_statistics() # Calculate metrics per domain ``` **Validation Rules**: ```python @dataclass class ValidationRules: max_text_length: int = 1000 min_text_length: int = 1 max_domain_length: int = 50 allowed_domain_pattern: str = r'^[a-zA-Z0-9_-]+$' ``` ### CLI Integration (commands.py) **Direct Service Usage**: ```python def _get_service(): """Get configured CorrectionService instance.""" config_dir = Path.home() / ".transcript-fixer" db_path = config_dir / "corrections.db" repository = CorrectionRepository(db_path) return CorrectionService(repository) def cmd_add_correction(args): service = _get_service() service.add_correction(args.from_text, args.to_text, args.domain) ``` **Benefits of Direct Integration**: - No unnecessary abstraction layers - Clear data flow: CLI → Service → Repository - Easy to understand and debug - Performance: One less function call per operation ### dictionary_processor.py (Stage 1) **Responsibilities**: - Apply context-aware regex rules - Apply simple dictionary replacements - Track all changes with line numbers **Processing Order**: 1. Context rules first (higher priority) 2. Dictionary replacements second **Key Methods**: ```python process(text) -> (corrected_text, changes) _apply_context_rules() _apply_dictionary() get_summary(changes) ``` **Change Tracking**: ```python @dataclass class Change: line_number: int from_text: str to_text: str rule_type: str # "dictionary" or "context_rule" rule_name: str ``` ### ai_processor.py (Stage 2) **Responsibilities**: - Split text into API-friendly chunks - Call GLM-4.6 API - Handle retries with fallback model - Track AI-suggested changes **Key Methods**: ```python process(text, context) -> (corrected_text, changes) _split_into_chunks() # Respect paragraph boundaries _process_chunk() # Single API call _build_prompt() # Construct correction prompt ``` **Chunking Strategy**: - Max 6000 characters per chunk - Split on paragraph boundaries (`\n\n`) - If paragraph too long, split on sentences - Preserve context across chunks **Error Handling**: - Retry with fallback model (GLM-4.5-Air) - If both fail, use original text - Never lose user's data ### learning_engine.py (Pattern Detection) **Responsibilities**: - Analyze correction history - Detect recurring patterns - Calculate confidence scores - Generate suggestions for review - Track rejected suggestions **Algorithm**: ```python 1. Query correction_history table 2. Extract stage2 (AI) changes 3. Group by pattern (from→to) 4. Count frequency 5. Calculate confidence 6. Filter by thresholds: - frequency ≥ 3 - confidence ≥ 0.8 7. Save to learned/pending_review.json ``` **Confidence Calculation**: ```python confidence = ( 0.5 * frequency_score + # More occurrences = higher 0.3 * consistency_score + # Always same correction 0.2 * recency_score # Recent = higher ) ``` **Key Methods**: ```python analyze_and_suggest() # Main analysis pipeline approve_suggestion() # Move to corrections.json reject_suggestion() # Move to rejected.json list_pending() # Get all suggestions ``` ### diff_generator.py (Stage 3) **Responsibilities**: - Generate comparison reports - Multiple output formats - Word-level diff analysis **Output Formats**: 1. Markdown summary (statistics + change list) 2. Unified diff (standard diff format) 3. HTML side-by-side (visual comparison) 4. Inline marked ([-old-] [+new+]) **Not Modified**: Kept original 338-line file as-is (working well) ## State Management ### Database-Backed State - All state stored in `~/.transcript-fixer/corrections.db` - SQLite handles caching and transactions - ACID guarantees prevent corruption - Backup created before migrations ### Thread-Safe Access - Thread-local connections (one per thread) - BEGIN IMMEDIATE for write transactions - No global state or shared mutable data - Each operation is independent (stateless modules) ### Soft Deletes - Records marked inactive (is_active=0) instead of DELETE - Preserves audit trail - Can be reactivated if needed ## Error Handling Strategy ### Fail Fast for User Errors ```python if not skill_path.exists(): print(f"❌ Error: Skill directory not found") sys.exit(1) ``` ### Retry for Transient Errors ```python try: api_call(model_primary) except Exception: try: api_call(model_fallback) except Exception: use_original_text() ``` ### Backup Before Destructive Operations ```python if target_file.exists(): shutil.copy2(target_file, backup_file) # Then overwrite target_file ``` ## Testing Strategy ### Unit Testing (Recommended) ```python # Test dictionary processor def test_dictionary_processor(): corrections = {"错误": "正确"} processor = DictionaryProcessor(corrections, []) text = "这是错误的文本" result, changes = processor.process(text) assert result == "这是正确的文本" assert len(changes) == 1 # Test learning engine thresholds def test_learning_thresholds(): engine = LearningEngine(history_dir, learned_dir) # Create mock history with pattern appearing 3+ times suggestions = engine.analyze_and_suggest() assert len(suggestions) > 0 ``` ### Integration Testing ```bash # End-to-end test python fix_transcription.py --init python fix_transcription.py --add "test" "TEST" python fix_transcription.py --input test.md --stage 3 # Verify output files exist ``` ## Performance Considerations ### Bottlenecks 1. **AI API calls**: Slowest part (60s timeout per chunk) 2. **File I/O**: Negligible (JSON files are small) 3. **Pattern matching**: Fast (regex + dict lookups) ### Optimization Strategies 1. **Stage 1 First**: Test dictionary corrections before expensive AI calls 2. **Chunking**: Process large files in parallel chunks (future enhancement) 3. **Caching**: Could cache API results by content hash (future enhancement) ### Scalability **Current capabilities (v2.0 with SQLite)**: - File size: Unlimited (chunks handle large files) - Corrections: Tested up to 100,000 entries (with indexes) - History: Unlimited (database handles efficiently) - Concurrent access: Thread-safe with ACID guarantees - Query performance: O(log n) with B-tree indexes **Performance improvements from SQLite**: - Indexed queries (domain, source, added_at) - Views for common aggregations - Batch imports with transactions - Soft deletes (no data loss) **Future improvements**: - Parallel chunk processing for AI calls - API response caching - Full-text search for corrections ## Security Architecture ### Secret Management - API keys via environment variables only - Never hardcode credentials - Security scanner enforces this ### Backup Security - `.bak` files same permissions as originals - No encryption (user's responsibility) - Recommendation: Use encrypted filesystems ### Git Security - `.gitignore` for `.bak` files - Private repos recommended - Security scan before commits ## Extensibility Points ### Adding New Processors 1. Create new processor class 2. Implement `process(text) -> (result, changes)` interface 3. Add to orchestrator workflow Example: ```python class SpellCheckProcessor: def process(self, text): # Custom spell checking logic return corrected_text, changes ``` ### Adding New Learning Algorithms 1. Subclass `LearningEngine` 2. Override `_calculate_confidence()` 3. Adjust thresholds as needed ### Adding New Export Formats 1. Add method to `CorrectionManager` 2. Support new file format 3. Add CLI command ## Dependencies ### Required - Python 3.8+ (`from __future__ import annotations`) - `httpx` (for API calls) ### Optional - `diff` command (for unified diffs) - Git (for version control) ### Development - `pytest` (for testing) - `black` (for formatting) - `mypy` (for type checking) ## Deployment ### User Installation ```bash # 1. Clone or download skill to workspace git clone transcript-fixer cd transcript-fixer # 2. Install dependencies pip install -r requirements.txt # 3. Initialize python scripts/fix_transcription.py --init # 4. Set API key export GLM_API_KEY="KEY_VALUE" # Ready to use! ``` ### CI/CD Pipeline (Future) ```yaml # Potential GitHub Actions workflow test: - Install dependencies - Run unit tests - Run integration tests - Check code style (black, mypy) security: - Run security_scan.py - Check for secrets deploy: - Package skill - Upload to skill marketplace ``` ## Further Reading - SOLID Principles: https://en.wikipedia.org/wiki/SOLID - API Patterns: `references/glm_api_setup.md` - File Formats: `references/file_formats.md` - Testing: https://docs.pytest.org/