Release v1.8.0: Add transcript-fixer skill
## New Skill: transcript-fixer v1.0.0 Correct speech-to-text (ASR/STT) transcription errors through dictionary-based rules and AI-powered corrections with automatic pattern learning. **Features:** - Two-stage correction pipeline (dictionary + AI) - Automatic pattern detection and learning - Domain-specific dictionaries (general, embodied_ai, finance, medical) - SQLite-based correction repository - Team collaboration with import/export - GLM API integration for AI corrections - Cost optimization through dictionary promotion **Use cases:** - Correcting meeting notes, lecture recordings, or interview transcripts - Fixing Chinese/English homophone errors and technical terminology - Building domain-specific correction dictionaries - Improving transcript accuracy through iterative learning **Documentation:** - Complete workflow guides in references/ - SQL query templates - Troubleshooting guide - Team collaboration patterns - API setup instructions **Marketplace updates:** - Updated marketplace to v1.8.0 - Added transcript-fixer plugin (category: productivity) - Updated README.md with skill description and use cases - Updated CLAUDE.md with skill listing and counts 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
848
transcript-fixer/references/architecture.md
Normal file
848
transcript-fixer/references/architecture.md
Normal file
@@ -0,0 +1,848 @@
|
||||
# Architecture Reference
|
||||
|
||||
Technical implementation details of the transcript-fixer system.
|
||||
|
||||
## Table of Contents
|
||||
|
||||
- [Module Structure](#module-structure)
|
||||
- [Design Principles](#design-principles)
|
||||
- [SOLID Compliance](#solid-compliance)
|
||||
- [File Length Limits](#file-length-limits)
|
||||
- [Module Architecture](#module-architecture)
|
||||
- [Layer Diagram](#layer-diagram)
|
||||
- [Correction Workflow](#correction-workflow)
|
||||
- [Learning Cycle](#learning-cycle)
|
||||
- [Data Flow](#data-flow)
|
||||
- [SQLite Architecture (v2.0)](#sqlite-architecture-v20)
|
||||
- [Two-Layer Data Access](#two-layer-data-access-simplified)
|
||||
- [Database Schema](#database-schema-schemasql)
|
||||
- [ACID Guarantees](#acid-guarantees)
|
||||
- [Thread Safety](#thread-safety)
|
||||
- [Migration from JSON](#migration-from-json)
|
||||
- [Module Details](#module-details)
|
||||
- [fix_transcription.py](#fix_transcriptionpy-orchestrator)
|
||||
- [correction_repository.py](#correction_repositorypy-data-access-layer)
|
||||
- [correction_service.py](#correction_servicepy-business-logic-layer)
|
||||
- [CLI Integration](#cli-integration-commandspy)
|
||||
- [dictionary_processor.py](#dictionary_processorpy-stage-1)
|
||||
- [ai_processor.py](#ai_processorpy-stage-2)
|
||||
- [learning_engine.py](#learning_enginepy-pattern-detection)
|
||||
- [diff_generator.py](#diff_generatorpy-stage-3)
|
||||
- [State Management](#state-management)
|
||||
- [Database-Backed State](#database-backed-state)
|
||||
- [Thread-Safe Access](#thread-safe-access)
|
||||
- [Error Handling Strategy](#error-handling-strategy)
|
||||
- [Testing Strategy](#testing-strategy)
|
||||
- [Performance Considerations](#performance-considerations)
|
||||
- [Security Architecture](#security-architecture)
|
||||
- [Extensibility Points](#extensibility-points)
|
||||
- [Dependencies](#dependencies)
|
||||
- [Deployment](#deployment)
|
||||
- [Further Reading](#further-reading)
|
||||
|
||||
## Module Structure
|
||||
|
||||
The codebase follows a modular package structure for maintainability:
|
||||
|
||||
```
|
||||
scripts/
|
||||
├── fix_transcription.py # Main entry point (~70 lines)
|
||||
├── core/ # Business logic & data access
|
||||
│ ├── correction_repository.py # Data access layer (466 lines)
|
||||
│ ├── correction_service.py # Business logic layer (525 lines)
|
||||
│ ├── schema.sql # SQLite database schema (216 lines)
|
||||
│ ├── dictionary_processor.py # Stage 1 processor (140 lines)
|
||||
│ ├── ai_processor.py # Stage 2 processor (199 lines)
|
||||
│ └── learning_engine.py # Pattern detection (252 lines)
|
||||
├── cli/ # Command-line interface
|
||||
│ ├── commands.py # Command handlers (180 lines)
|
||||
│ └── argument_parser.py # Argument config (95 lines)
|
||||
└── utils/ # Utility functions
|
||||
├── diff_generator.py # Multi-format diffs (132 lines)
|
||||
├── logging_config.py # Logging configuration (130 lines)
|
||||
└── validation.py # SQLite validation (105 lines)
|
||||
```
|
||||
|
||||
**Benefits of modular structure**:
|
||||
- Clear separation of concerns (business logic / CLI / utilities)
|
||||
- Easy to locate and modify specific functionality
|
||||
- Supports independent testing of modules
|
||||
- Scales well as codebase grows
|
||||
- Follows Python package best practices
|
||||
|
||||
## Design Principles
|
||||
|
||||
### SOLID Compliance
|
||||
|
||||
Every module follows SOLID principles for maintainability:
|
||||
|
||||
1. **Single Responsibility Principle (SRP)**
|
||||
- Each module has exactly one reason to change
|
||||
- `CorrectionRepository`: Database operations only
|
||||
- `CorrectionService`: Business logic and validation only
|
||||
- `DictionaryProcessor`: Text transformation only
|
||||
- `AIProcessor`: API communication only
|
||||
- `LearningEngine`: Pattern analysis only
|
||||
|
||||
2. **Open/Closed Principle (OCP)**
|
||||
- Open for extension via SQL INSERT
|
||||
- Closed for modification (no code changes needed)
|
||||
- Add corrections via CLI or SQL without editing Python
|
||||
|
||||
3. **Liskov Substitution Principle (LSP)**
|
||||
- All processors implement same interface
|
||||
- Can swap implementations without breaking workflow
|
||||
|
||||
4. **Interface Segregation Principle (ISP)**
|
||||
- Repository, Service, Processor, Engine are independent
|
||||
- No unnecessary dependencies
|
||||
|
||||
5. **Dependency Inversion Principle (DIP)**
|
||||
- Service depends on Repository interface
|
||||
- CLI depends on Service interface
|
||||
- Not tied to concrete implementations
|
||||
|
||||
### File Length Limits
|
||||
|
||||
All files comply with code quality standards:
|
||||
|
||||
| File | Lines | Limit | Status |
|
||||
|------|-------|-------|--------|
|
||||
| `validation.py` | 105 | 200 | ✅ |
|
||||
| `logging_config.py` | 130 | 200 | ✅ |
|
||||
| `diff_generator.py` | 132 | 200 | ✅ |
|
||||
| `dictionary_processor.py` | 140 | 200 | ✅ |
|
||||
| `commands.py` | 180 | 200 | ✅ |
|
||||
| `ai_processor.py` | 199 | 250 | ✅ |
|
||||
| `schema.sql` | 216 | 250 | ✅ |
|
||||
| `learning_engine.py` | 252 | 250 | ✅ |
|
||||
| `correction_repository.py` | 466 | 500 | ✅ |
|
||||
| `correction_service.py` | 525 | 550 | ✅ |
|
||||
|
||||
## Module Architecture
|
||||
|
||||
### Layer Diagram
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────┐
|
||||
│ CLI Layer (fix_transcription.py) │
|
||||
│ - Argument parsing │
|
||||
│ - Command routing │
|
||||
│ - User interaction │
|
||||
└───────────────┬─────────────────────────┘
|
||||
│
|
||||
┌───────────────▼─────────────────────────┐
|
||||
│ Business Logic Layer │
|
||||
│ │
|
||||
│ ┌──────────────────┐ ┌──────────────┐│
|
||||
│ │ Dictionary │ │ AI ││
|
||||
│ │ Processor │ │ Processor ││
|
||||
│ │ (Stage 1) │ │ (Stage 2) ││
|
||||
│ └──────────────────┘ └──────────────┘│
|
||||
│ │
|
||||
│ ┌──────────────────┐ ┌──────────────┐│
|
||||
│ │ Learning │ │ Diff ││
|
||||
│ │ Engine │ │ Generator ││
|
||||
│ │ (Pattern detect) │ │ (Stage 3) ││
|
||||
│ └──────────────────┘ └──────────────┘│
|
||||
└───────────────┬─────────────────────────┘
|
||||
│
|
||||
┌───────────────▼─────────────────────────┐
|
||||
│ Data Access Layer (SQLite-based) │
|
||||
│ │
|
||||
│ ┌──────────────────────────────────┐ │
|
||||
│ │ CorrectionManager (Facade) │ │
|
||||
│ │ - Backward-compatible API │ │
|
||||
│ └──────────────┬───────────────────┘ │
|
||||
│ │ │
|
||||
│ ┌──────────────▼───────────────────┐ │
|
||||
│ │ CorrectionService │ │
|
||||
│ │ - Business logic │ │
|
||||
│ │ - Validation │ │
|
||||
│ │ - Import/Export │ │
|
||||
│ └──────────────┬───────────────────┘ │
|
||||
│ │ │
|
||||
│ ┌──────────────▼───────────────────┐ │
|
||||
│ │ CorrectionRepository │ │
|
||||
│ │ - ACID transactions │ │
|
||||
│ │ - Thread-safe connections │ │
|
||||
│ │ - Audit logging │ │
|
||||
│ └──────────────────────────────────┘ │
|
||||
└───────────────┬─────────────────────────┘
|
||||
│
|
||||
┌───────────────▼─────────────────────────┐
|
||||
│ Storage Layer │
|
||||
│ ~/.transcript-fixer/corrections.db │
|
||||
│ - SQLite database (ACID compliant) │
|
||||
│ - 8 normalized tables + 3 views │
|
||||
│ - Comprehensive indexes │
|
||||
│ - Foreign key constraints │
|
||||
└─────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
## Data Flow
|
||||
|
||||
### Correction Workflow
|
||||
|
||||
```
|
||||
1. User Input
|
||||
↓
|
||||
2. fix_transcription.py (Orchestrator)
|
||||
↓
|
||||
3. CorrectionService.get_corrections()
|
||||
← Query from ~/.transcript-fixer/corrections.db
|
||||
↓
|
||||
4. DictionaryProcessor.process()
|
||||
- Apply context rules (regex)
|
||||
- Apply dictionary replacements
|
||||
- Track changes
|
||||
↓
|
||||
5. AIProcessor.process()
|
||||
- Split into chunks
|
||||
- Call GLM-4.6 API
|
||||
- Retry with fallback on error
|
||||
- Track AI changes
|
||||
↓
|
||||
6. CorrectionService.save_history()
|
||||
→ Insert into correction_history table
|
||||
↓
|
||||
7. LearningEngine.analyze_and_suggest()
|
||||
- Query correction_history table
|
||||
- Detect patterns (frequency ≥3, confidence ≥80%)
|
||||
- Generate suggestions
|
||||
→ Insert into learned_suggestions table
|
||||
↓
|
||||
8. Output Files
|
||||
- {filename}_stage1.md
|
||||
- {filename}_stage2.md
|
||||
```
|
||||
|
||||
### Learning Cycle
|
||||
|
||||
```
|
||||
Run 1: meeting1.md
|
||||
AI corrects: "巨升" → "具身"
|
||||
↓
|
||||
INSERT INTO correction_history
|
||||
|
||||
Run 2: meeting2.md
|
||||
AI corrects: "巨升" → "具身"
|
||||
↓
|
||||
INSERT INTO correction_history
|
||||
|
||||
Run 3: meeting3.md
|
||||
AI corrects: "巨升" → "具身"
|
||||
↓
|
||||
INSERT INTO correction_history
|
||||
↓
|
||||
LearningEngine queries patterns:
|
||||
- SELECT ... GROUP BY from_text, to_text
|
||||
- Frequency: 3, Confidence: 100%
|
||||
↓
|
||||
INSERT INTO learned_suggestions (status='pending')
|
||||
↓
|
||||
User reviews: --review-learned
|
||||
↓
|
||||
User approves: --approve "巨升" "具身"
|
||||
↓
|
||||
INSERT INTO corrections (source='learned')
|
||||
UPDATE learned_suggestions (status='approved')
|
||||
↓
|
||||
Future runs query corrections table (Stage 1 - faster!)
|
||||
```
|
||||
|
||||
## SQLite Architecture (v2.0)
|
||||
|
||||
### Two-Layer Data Access (Simplified)
|
||||
|
||||
**Design Principle**: No users = no backward compatibility overhead.
|
||||
|
||||
The system uses a clean 2-layer architecture:
|
||||
|
||||
```
|
||||
┌──────────────────────────────────────────┐
|
||||
│ CLI Commands (commands.py) │
|
||||
│ - User interaction │
|
||||
│ - Command routing │
|
||||
└──────────────┬───────────────────────────┘
|
||||
│
|
||||
┌──────────────▼───────────────────────────┐
|
||||
│ CorrectionService (Business Logic) │
|
||||
│ - Input validation & sanitization │
|
||||
│ - Business rules enforcement │
|
||||
│ - Import/export orchestration │
|
||||
│ - Statistics calculation │
|
||||
│ - History tracking │
|
||||
└──────────────┬───────────────────────────┘
|
||||
│
|
||||
┌──────────────▼───────────────────────────┐
|
||||
│ CorrectionRepository (Data Access) │
|
||||
│ - ACID transactions │
|
||||
│ - Thread-safe connections │
|
||||
│ - SQL query execution │
|
||||
│ - Audit logging │
|
||||
└──────────────┬───────────────────────────┘
|
||||
│
|
||||
┌──────────────▼───────────────────────────┐
|
||||
│ SQLite Database (corrections.db) │
|
||||
│ - 8 normalized tables │
|
||||
│ - Foreign key constraints │
|
||||
│ - Comprehensive indexes │
|
||||
│ - 3 views for common queries │
|
||||
└───────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### Database Schema (schema.sql)
|
||||
|
||||
**Core Tables**:
|
||||
|
||||
1. **corrections** (main correction storage)
|
||||
- Primary key: id
|
||||
- Unique constraint: (from_text, domain)
|
||||
- Indexes: domain, source, added_at, is_active, from_text
|
||||
- Fields: confidence (0.0-1.0), usage_count, notes
|
||||
|
||||
2. **context_rules** (regex-based rules)
|
||||
- Pattern + replacement with priority ordering
|
||||
- Indexes: priority (DESC), is_active
|
||||
|
||||
3. **correction_history** (audit trail for runs)
|
||||
- Tracks: filename, domain, timestamps, change counts
|
||||
- Links to correction_changes via foreign key
|
||||
- Indexes: run_timestamp, domain, success
|
||||
|
||||
4. **correction_changes** (detailed change log)
|
||||
- Links to history via foreign key (CASCADE delete)
|
||||
- Stores: line_number, from/to text, rule_type, context
|
||||
- Indexes: history_id, rule_type
|
||||
|
||||
5. **learned_suggestions** (AI-detected patterns)
|
||||
- Status: pending → approved/rejected
|
||||
- Unique constraint: (from_text, to_text, domain)
|
||||
- Fields: frequency, confidence, timestamps
|
||||
- Indexes: status, domain, confidence, frequency
|
||||
|
||||
6. **suggestion_examples** (occurrences of patterns)
|
||||
- Links to learned_suggestions via foreign key
|
||||
- Stores context where pattern occurred
|
||||
|
||||
7. **system_config** (configuration storage)
|
||||
- Key-value store with type safety
|
||||
- Stores: API settings, thresholds, defaults
|
||||
|
||||
8. **audit_log** (comprehensive audit trail)
|
||||
- Tracks all database operations
|
||||
- Fields: action, entity_type, entity_id, user, success
|
||||
- Indexes: timestamp, action, entity_type, success
|
||||
|
||||
**Views** (for common queries):
|
||||
- `active_corrections`: Active corrections only
|
||||
- `pending_suggestions`: Suggestions pending review
|
||||
- `correction_statistics`: Statistics per domain
|
||||
|
||||
### ACID Guarantees
|
||||
|
||||
**Atomicity**: All-or-nothing transactions
|
||||
```python
|
||||
with self._transaction() as conn:
|
||||
conn.execute("INSERT ...") # Either all succeed
|
||||
conn.execute("UPDATE ...") # or all rollback
|
||||
```
|
||||
|
||||
**Consistency**: Constraints enforced
|
||||
- Foreign key constraints
|
||||
- Check constraints (confidence 0.0-1.0, usage_count ≥ 0)
|
||||
- Unique constraints
|
||||
|
||||
**Isolation**: Serializable transactions
|
||||
```python
|
||||
conn.execute("BEGIN IMMEDIATE") # Acquire write lock
|
||||
```
|
||||
|
||||
**Durability**: Changes persisted to disk
|
||||
- SQLite guarantees persistence after commit
|
||||
- Backup before migrations
|
||||
|
||||
### Thread Safety
|
||||
|
||||
**Thread-local connections**:
|
||||
```python
|
||||
def _get_connection(self):
|
||||
if not hasattr(self._local, 'connection'):
|
||||
self._local.connection = sqlite3.connect(...)
|
||||
return self._local.connection
|
||||
```
|
||||
|
||||
**Connection pooling**:
|
||||
- One connection per thread
|
||||
- Automatic cleanup on close
|
||||
- Foreign keys enabled per connection
|
||||
|
||||
### Clean Architecture (No Legacy)
|
||||
|
||||
**Design Philosophy**:
|
||||
- Clean 2-layer architecture (Service → Repository)
|
||||
- No backward compatibility overhead
|
||||
- Direct API design without legacy constraints
|
||||
- YAGNI principle: Build for current needs, not hypothetical migrations
|
||||
|
||||
## Module Details
|
||||
|
||||
### fix_transcription.py (Orchestrator)
|
||||
|
||||
**Responsibilities**:
|
||||
- Parse CLI arguments
|
||||
- Route commands to appropriate handlers
|
||||
- Coordinate workflow between modules
|
||||
- Display user feedback
|
||||
|
||||
**Key Functions**:
|
||||
```python
|
||||
cmd_init() # Initialize ~/.transcript-fixer/
|
||||
cmd_add_correction() # Add single correction
|
||||
cmd_list_corrections() # List corrections
|
||||
cmd_run_correction() # Execute correction workflow
|
||||
cmd_review_learned() # Review AI suggestions
|
||||
cmd_approve() # Approve learned correction
|
||||
```
|
||||
|
||||
**Design Pattern**: Command pattern with function routing
|
||||
|
||||
### correction_repository.py (Data Access Layer)
|
||||
|
||||
**Responsibilities**:
|
||||
- Execute SQL queries with ACID guarantees
|
||||
- Manage thread-safe database connections
|
||||
- Handle transactions (commit/rollback)
|
||||
- Perform audit logging
|
||||
- Convert between database rows and Python objects
|
||||
|
||||
**Key Methods**:
|
||||
```python
|
||||
add_correction() # INSERT with UNIQUE handling
|
||||
get_correction() # SELECT single correction
|
||||
get_all_corrections() # SELECT with filters
|
||||
get_corrections_dict() # For backward compatibility
|
||||
update_correction() # UPDATE with transaction
|
||||
delete_correction() # Soft delete (is_active=0)
|
||||
increment_usage() # Track usage statistics
|
||||
bulk_import_corrections() # Batch INSERT with conflict resolution
|
||||
```
|
||||
|
||||
**Transaction Management**:
|
||||
```python
|
||||
@contextmanager
|
||||
def _transaction(self):
|
||||
conn = self._get_connection()
|
||||
try:
|
||||
conn.execute("BEGIN IMMEDIATE")
|
||||
yield conn
|
||||
conn.commit()
|
||||
except Exception:
|
||||
conn.rollback()
|
||||
raise
|
||||
```
|
||||
|
||||
### correction_service.py (Business Logic Layer)
|
||||
|
||||
**Responsibilities**:
|
||||
- Input validation and sanitization
|
||||
- Business rule enforcement
|
||||
- Orchestrate repository operations
|
||||
- Import/export with conflict detection
|
||||
- Statistics calculation
|
||||
|
||||
**Key Methods**:
|
||||
```python
|
||||
# Validation
|
||||
validate_correction_text() # Check length, control chars, NULL bytes
|
||||
validate_domain_name() # Prevent path traversal, injection
|
||||
validate_confidence() # Range check (0.0-1.0)
|
||||
validate_source() # Enum validation
|
||||
|
||||
# Operations
|
||||
add_correction() # Validate + repository.add
|
||||
get_corrections() # Get corrections for domain
|
||||
remove_correction() # Validate + repository.delete
|
||||
|
||||
# Import/Export
|
||||
import_corrections() # Pre-validate + bulk import + conflict detection
|
||||
export_corrections() # Query + format as JSON
|
||||
|
||||
# Analytics
|
||||
get_statistics() # Calculate metrics per domain
|
||||
```
|
||||
|
||||
**Validation Rules**:
|
||||
```python
|
||||
@dataclass
|
||||
class ValidationRules:
|
||||
max_text_length: int = 1000
|
||||
min_text_length: int = 1
|
||||
max_domain_length: int = 50
|
||||
allowed_domain_pattern: str = r'^[a-zA-Z0-9_-]+$'
|
||||
```
|
||||
|
||||
### CLI Integration (commands.py)
|
||||
|
||||
**Direct Service Usage**:
|
||||
```python
|
||||
def _get_service():
|
||||
"""Get configured CorrectionService instance."""
|
||||
config_dir = Path.home() / ".transcript-fixer"
|
||||
db_path = config_dir / "corrections.db"
|
||||
repository = CorrectionRepository(db_path)
|
||||
return CorrectionService(repository)
|
||||
|
||||
def cmd_add_correction(args):
|
||||
service = _get_service()
|
||||
service.add_correction(args.from_text, args.to_text, args.domain)
|
||||
```
|
||||
|
||||
**Benefits of Direct Integration**:
|
||||
- No unnecessary abstraction layers
|
||||
- Clear data flow: CLI → Service → Repository
|
||||
- Easy to understand and debug
|
||||
- Performance: One less function call per operation
|
||||
|
||||
### dictionary_processor.py (Stage 1)
|
||||
|
||||
**Responsibilities**:
|
||||
- Apply context-aware regex rules
|
||||
- Apply simple dictionary replacements
|
||||
- Track all changes with line numbers
|
||||
|
||||
**Processing Order**:
|
||||
1. Context rules first (higher priority)
|
||||
2. Dictionary replacements second
|
||||
|
||||
**Key Methods**:
|
||||
```python
|
||||
process(text) -> (corrected_text, changes)
|
||||
_apply_context_rules()
|
||||
_apply_dictionary()
|
||||
get_summary(changes)
|
||||
```
|
||||
|
||||
**Change Tracking**:
|
||||
```python
|
||||
@dataclass
|
||||
class Change:
|
||||
line_number: int
|
||||
from_text: str
|
||||
to_text: str
|
||||
rule_type: str # "dictionary" or "context_rule"
|
||||
rule_name: str
|
||||
```
|
||||
|
||||
### ai_processor.py (Stage 2)
|
||||
|
||||
**Responsibilities**:
|
||||
- Split text into API-friendly chunks
|
||||
- Call GLM-4.6 API
|
||||
- Handle retries with fallback model
|
||||
- Track AI-suggested changes
|
||||
|
||||
**Key Methods**:
|
||||
```python
|
||||
process(text, context) -> (corrected_text, changes)
|
||||
_split_into_chunks() # Respect paragraph boundaries
|
||||
_process_chunk() # Single API call
|
||||
_build_prompt() # Construct correction prompt
|
||||
```
|
||||
|
||||
**Chunking Strategy**:
|
||||
- Max 6000 characters per chunk
|
||||
- Split on paragraph boundaries (`\n\n`)
|
||||
- If paragraph too long, split on sentences
|
||||
- Preserve context across chunks
|
||||
|
||||
**Error Handling**:
|
||||
- Retry with fallback model (GLM-4.5-Air)
|
||||
- If both fail, use original text
|
||||
- Never lose user's data
|
||||
|
||||
### learning_engine.py (Pattern Detection)
|
||||
|
||||
**Responsibilities**:
|
||||
- Analyze correction history
|
||||
- Detect recurring patterns
|
||||
- Calculate confidence scores
|
||||
- Generate suggestions for review
|
||||
- Track rejected suggestions
|
||||
|
||||
**Algorithm**:
|
||||
```python
|
||||
1. Query correction_history table
|
||||
2. Extract stage2 (AI) changes
|
||||
3. Group by pattern (from→to)
|
||||
4. Count frequency
|
||||
5. Calculate confidence
|
||||
6. Filter by thresholds:
|
||||
- frequency ≥ 3
|
||||
- confidence ≥ 0.8
|
||||
7. Save to learned/pending_review.json
|
||||
```
|
||||
|
||||
**Confidence Calculation**:
|
||||
```python
|
||||
confidence = (
|
||||
0.5 * frequency_score + # More occurrences = higher
|
||||
0.3 * consistency_score + # Always same correction
|
||||
0.2 * recency_score # Recent = higher
|
||||
)
|
||||
```
|
||||
|
||||
**Key Methods**:
|
||||
```python
|
||||
analyze_and_suggest() # Main analysis pipeline
|
||||
approve_suggestion() # Move to corrections.json
|
||||
reject_suggestion() # Move to rejected.json
|
||||
list_pending() # Get all suggestions
|
||||
```
|
||||
|
||||
### diff_generator.py (Stage 3)
|
||||
|
||||
**Responsibilities**:
|
||||
- Generate comparison reports
|
||||
- Multiple output formats
|
||||
- Word-level diff analysis
|
||||
|
||||
**Output Formats**:
|
||||
1. Markdown summary (statistics + change list)
|
||||
2. Unified diff (standard diff format)
|
||||
3. HTML side-by-side (visual comparison)
|
||||
4. Inline marked ([-old-] [+new+])
|
||||
|
||||
**Not Modified**: Kept original 338-line file as-is (working well)
|
||||
|
||||
## State Management
|
||||
|
||||
### Database-Backed State
|
||||
|
||||
- All state stored in `~/.transcript-fixer/corrections.db`
|
||||
- SQLite handles caching and transactions
|
||||
- ACID guarantees prevent corruption
|
||||
- Backup created before migrations
|
||||
|
||||
### Thread-Safe Access
|
||||
|
||||
- Thread-local connections (one per thread)
|
||||
- BEGIN IMMEDIATE for write transactions
|
||||
- No global state or shared mutable data
|
||||
- Each operation is independent (stateless modules)
|
||||
|
||||
### Soft Deletes
|
||||
|
||||
- Records marked inactive (is_active=0) instead of DELETE
|
||||
- Preserves audit trail
|
||||
- Can be reactivated if needed
|
||||
|
||||
## Error Handling Strategy
|
||||
|
||||
### Fail Fast for User Errors
|
||||
|
||||
```python
|
||||
if not skill_path.exists():
|
||||
print(f"❌ Error: Skill directory not found")
|
||||
sys.exit(1)
|
||||
```
|
||||
|
||||
### Retry for Transient Errors
|
||||
|
||||
```python
|
||||
try:
|
||||
api_call(model_primary)
|
||||
except Exception:
|
||||
try:
|
||||
api_call(model_fallback)
|
||||
except Exception:
|
||||
use_original_text()
|
||||
```
|
||||
|
||||
### Backup Before Destructive Operations
|
||||
|
||||
```python
|
||||
if target_file.exists():
|
||||
shutil.copy2(target_file, backup_file)
|
||||
# Then overwrite target_file
|
||||
```
|
||||
|
||||
## Testing Strategy
|
||||
|
||||
### Unit Testing (Recommended)
|
||||
|
||||
```python
|
||||
# Test dictionary processor
|
||||
def test_dictionary_processor():
|
||||
corrections = {"错误": "正确"}
|
||||
processor = DictionaryProcessor(corrections, [])
|
||||
text = "这是错误的文本"
|
||||
result, changes = processor.process(text)
|
||||
assert result == "这是正确的文本"
|
||||
assert len(changes) == 1
|
||||
|
||||
# Test learning engine thresholds
|
||||
def test_learning_thresholds():
|
||||
engine = LearningEngine(history_dir, learned_dir)
|
||||
# Create mock history with pattern appearing 3+ times
|
||||
suggestions = engine.analyze_and_suggest()
|
||||
assert len(suggestions) > 0
|
||||
```
|
||||
|
||||
### Integration Testing
|
||||
|
||||
```bash
|
||||
# End-to-end test
|
||||
python fix_transcription.py --init
|
||||
python fix_transcription.py --add "test" "TEST"
|
||||
python fix_transcription.py --input test.md --stage 3
|
||||
# Verify output files exist
|
||||
```
|
||||
|
||||
## Performance Considerations
|
||||
|
||||
### Bottlenecks
|
||||
|
||||
1. **AI API calls**: Slowest part (60s timeout per chunk)
|
||||
2. **File I/O**: Negligible (JSON files are small)
|
||||
3. **Pattern matching**: Fast (regex + dict lookups)
|
||||
|
||||
### Optimization Strategies
|
||||
|
||||
1. **Stage 1 First**: Test dictionary corrections before expensive AI calls
|
||||
2. **Chunking**: Process large files in parallel chunks (future enhancement)
|
||||
3. **Caching**: Could cache API results by content hash (future enhancement)
|
||||
|
||||
### Scalability
|
||||
|
||||
**Current capabilities (v2.0 with SQLite)**:
|
||||
- File size: Unlimited (chunks handle large files)
|
||||
- Corrections: Tested up to 100,000 entries (with indexes)
|
||||
- History: Unlimited (database handles efficiently)
|
||||
- Concurrent access: Thread-safe with ACID guarantees
|
||||
- Query performance: O(log n) with B-tree indexes
|
||||
|
||||
**Performance improvements from SQLite**:
|
||||
- Indexed queries (domain, source, added_at)
|
||||
- Views for common aggregations
|
||||
- Batch imports with transactions
|
||||
- Soft deletes (no data loss)
|
||||
|
||||
**Future improvements**:
|
||||
- Parallel chunk processing for AI calls
|
||||
- API response caching
|
||||
- Full-text search for corrections
|
||||
|
||||
## Security Architecture
|
||||
|
||||
### Secret Management
|
||||
|
||||
- API keys via environment variables only
|
||||
- Never hardcode credentials
|
||||
- Security scanner enforces this
|
||||
|
||||
### Backup Security
|
||||
|
||||
- `.bak` files same permissions as originals
|
||||
- No encryption (user's responsibility)
|
||||
- Recommendation: Use encrypted filesystems
|
||||
|
||||
### Git Security
|
||||
|
||||
- `.gitignore` for `.bak` files
|
||||
- Private repos recommended
|
||||
- Security scan before commits
|
||||
|
||||
## Extensibility Points
|
||||
|
||||
### Adding New Processors
|
||||
|
||||
1. Create new processor class
|
||||
2. Implement `process(text) -> (result, changes)` interface
|
||||
3. Add to orchestrator workflow
|
||||
|
||||
Example:
|
||||
```python
|
||||
class SpellCheckProcessor:
|
||||
def process(self, text):
|
||||
# Custom spell checking logic
|
||||
return corrected_text, changes
|
||||
```
|
||||
|
||||
### Adding New Learning Algorithms
|
||||
|
||||
1. Subclass `LearningEngine`
|
||||
2. Override `_calculate_confidence()`
|
||||
3. Adjust thresholds as needed
|
||||
|
||||
### Adding New Export Formats
|
||||
|
||||
1. Add method to `CorrectionManager`
|
||||
2. Support new file format
|
||||
3. Add CLI command
|
||||
|
||||
## Dependencies
|
||||
|
||||
### Required
|
||||
|
||||
- Python 3.8+ (`from __future__ import annotations`)
|
||||
- `httpx` (for API calls)
|
||||
|
||||
### Optional
|
||||
|
||||
- `diff` command (for unified diffs)
|
||||
- Git (for version control)
|
||||
|
||||
### Development
|
||||
|
||||
- `pytest` (for testing)
|
||||
- `black` (for formatting)
|
||||
- `mypy` (for type checking)
|
||||
|
||||
## Deployment
|
||||
|
||||
### User Installation
|
||||
|
||||
```bash
|
||||
# 1. Clone or download skill to workspace
|
||||
git clone <repo> transcript-fixer
|
||||
cd transcript-fixer
|
||||
|
||||
# 2. Install dependencies
|
||||
pip install -r requirements.txt
|
||||
|
||||
# 3. Initialize
|
||||
python scripts/fix_transcription.py --init
|
||||
|
||||
# 4. Set API key
|
||||
export GLM_API_KEY="KEY_VALUE"
|
||||
|
||||
# Ready to use!
|
||||
```
|
||||
|
||||
### CI/CD Pipeline (Future)
|
||||
|
||||
```yaml
|
||||
# Potential GitHub Actions workflow
|
||||
test:
|
||||
- Install dependencies
|
||||
- Run unit tests
|
||||
- Run integration tests
|
||||
- Check code style (black, mypy)
|
||||
|
||||
security:
|
||||
- Run security_scan.py
|
||||
- Check for secrets
|
||||
|
||||
deploy:
|
||||
- Package skill
|
||||
- Upload to skill marketplace
|
||||
```
|
||||
|
||||
## Further Reading
|
||||
|
||||
- SOLID Principles: https://en.wikipedia.org/wiki/SOLID
|
||||
- API Patterns: `references/glm_api_setup.md`
|
||||
- File Formats: `references/file_formats.md`
|
||||
- Testing: https://docs.pytest.org/
|
||||
428
transcript-fixer/references/best_practices.md
Normal file
428
transcript-fixer/references/best_practices.md
Normal file
@@ -0,0 +1,428 @@
|
||||
# Best Practices
|
||||
|
||||
Recommendations for effective use of transcript-fixer based on production experience.
|
||||
|
||||
## Table of Contents
|
||||
|
||||
- [Getting Started](#getting-started)
|
||||
- [Build Foundation Before Scaling](#build-foundation-before-scaling)
|
||||
- [Review Learned Suggestions Regularly](#review-learned-suggestions-regularly)
|
||||
- [Domain Organization](#domain-organization)
|
||||
- [Use Domain Separation](#use-domain-separation)
|
||||
- [Domain Selection Strategy](#domain-selection-strategy)
|
||||
- [Cost Optimization](#cost-optimization)
|
||||
- [Test Dictionary Changes Before AI Calls](#test-dictionary-changes-before-ai-calls)
|
||||
- [Approve High-Confidence Suggestions](#approve-high-confidence-suggestions)
|
||||
- [Team Collaboration](#team-collaboration)
|
||||
- [Export Corrections for Version Control](#export-corrections-for-version-control)
|
||||
- [Share Corrections via Import/Merge](#share-corrections-via-importmerge)
|
||||
- [Data Management](#data-management)
|
||||
- [Database Backup Strategy](#database-backup-strategy)
|
||||
- [Cleanup Strategy](#cleanup-strategy)
|
||||
- [Workflow Efficiency](#workflow-efficiency)
|
||||
- [File Organization](#file-organization)
|
||||
- [Batch Processing](#batch-processing)
|
||||
- [Context Rules for Edge Cases](#context-rules-for-edge-cases)
|
||||
- [Quality Assurance](#quality-assurance)
|
||||
- [Validate After Manual Changes](#validate-after-manual-changes)
|
||||
- [Monitor Learning Quality](#monitor-learning-quality)
|
||||
- [Production Deployment](#production-deployment)
|
||||
- [Environment Variables](#environment-variables)
|
||||
- [Monitoring](#monitoring)
|
||||
- [Performance](#performance)
|
||||
- [Summary](#summary)
|
||||
|
||||
## Getting Started
|
||||
|
||||
### Build Foundation Before Scaling
|
||||
|
||||
**Start small**: Begin with 5-10 manually-added corrections for the most common errors in your domain.
|
||||
|
||||
```bash
|
||||
# Example: embodied AI domain
|
||||
uv run scripts/fix_transcription.py --add "巨升智能" "具身智能" --domain embodied_ai
|
||||
uv run scripts/fix_transcription.py --add "巨升" "具身" --domain embodied_ai
|
||||
uv run scripts/fix_transcription.py --add "奇迹创坛" "奇绩创坛" --domain embodied_ai
|
||||
```
|
||||
|
||||
**Let learning discover others**: After 3-5 correction runs, the learning system will suggest additional patterns automatically.
|
||||
|
||||
**Rationale**: Manual corrections provide high-quality training data. Learning amplifies your corrections exponentially.
|
||||
|
||||
### Review Learned Suggestions Regularly
|
||||
|
||||
**Frequency**: Every 3-5 correction runs
|
||||
|
||||
```bash
|
||||
uv run scripts/fix_transcription.py --review-learned
|
||||
```
|
||||
|
||||
**Why**: Learned corrections move from Stage 2 (AI, expensive) to Stage 1 (dictionary, cheap/instant).
|
||||
|
||||
**Impact**:
|
||||
- 10x faster processing (no API calls)
|
||||
- Zero cost for repeated patterns
|
||||
- Builds domain-specific vocabulary automatically
|
||||
|
||||
## Domain Organization
|
||||
|
||||
### Use Domain Separation
|
||||
|
||||
**Prevent conflicts**: Same phonetic error might have different corrections in different domains.
|
||||
|
||||
**Example**:
|
||||
- Finance domain: "股价" (stock price) is correct
|
||||
- General domain: "股价" → "框架" (framework) ASR error
|
||||
|
||||
```bash
|
||||
# Domain-specific corrections
|
||||
uv run scripts/fix_transcription.py --add "股价" "框架" --domain general
|
||||
# No correction needed in finance domain - "股价" is correct there
|
||||
```
|
||||
|
||||
**Available domains**:
|
||||
- `general` (default) - General-purpose corrections
|
||||
- `embodied_ai` - Robotics and embodied AI terminology
|
||||
- `finance` - Financial terminology
|
||||
- `medical` - Medical terminology
|
||||
|
||||
**Custom domains**: Any string matching `^[a-z0-9_]+$` (lowercase, numbers, underscore).
|
||||
|
||||
### Domain Selection Strategy
|
||||
|
||||
1. **Default domain** for general corrections (dates, common words)
|
||||
2. **Specialized domains** for technical terminology
|
||||
3. **Project domains** for company/product-specific terms
|
||||
|
||||
```bash
|
||||
# Project-specific domain
|
||||
uv run scripts/fix_transcription.py --add "我司" "奇绩创坛" --domain yc_china
|
||||
```
|
||||
|
||||
## Cost Optimization
|
||||
|
||||
### Test Dictionary Changes Before AI Calls
|
||||
|
||||
**Problem**: AI calls (Stage 2) consume API quota and time.
|
||||
|
||||
**Solution**: Test dictionary changes with Stage 1 first.
|
||||
|
||||
```bash
|
||||
# 1. Add new corrections
|
||||
uv run scripts/fix_transcription.py --add "新错误" "正确词" --domain general
|
||||
|
||||
# 2. Test on small sample (Stage 1 only)
|
||||
uv run scripts/fix_transcription.py --input sample.md --stage 1
|
||||
|
||||
# 3. Review output
|
||||
less sample_stage1.md
|
||||
|
||||
# 4. If satisfied, run full pipeline on large files
|
||||
uv run scripts/fix_transcription.py --input large_file.md --stage 3
|
||||
```
|
||||
|
||||
**Savings**: Avoid wasting API quota on files with dictionary-only corrections.
|
||||
|
||||
### Approve High-Confidence Suggestions
|
||||
|
||||
**Check suggestions regularly**:
|
||||
|
||||
```bash
|
||||
uv run scripts/fix_transcription.py --review-learned
|
||||
```
|
||||
|
||||
**Approve suggestions with**:
|
||||
- Frequency ≥ 5
|
||||
- Confidence ≥ 0.9
|
||||
- Pattern makes semantic sense
|
||||
|
||||
**Impact**: Each approved suggestion saves future API calls.
|
||||
|
||||
## Team Collaboration
|
||||
|
||||
### Export Corrections for Version Control
|
||||
|
||||
**Don't commit** `.db` files to Git:
|
||||
- Binary format causes merge conflicts
|
||||
- Database grows over time (bloats repository)
|
||||
- Not human-reviewable
|
||||
|
||||
**Do commit** JSON exports:
|
||||
|
||||
```bash
|
||||
# Export domain dictionaries
|
||||
uv run scripts/fix_transcription.py --export general_$(date +%Y%m%d).json --domain general
|
||||
uv run scripts/fix_transcription.py --export embodied_ai_$(date +%Y%m%d).json --domain embodied_ai
|
||||
|
||||
# .gitignore
|
||||
*.db
|
||||
*.db-journal
|
||||
*.bak
|
||||
|
||||
# Commit exports
|
||||
git add *_corrections.json
|
||||
git commit -m "Update correction dictionaries"
|
||||
```
|
||||
|
||||
### Share Corrections via Import/Merge
|
||||
|
||||
**Always use `--merge` flag** to combine corrections:
|
||||
|
||||
```bash
|
||||
# Pull latest from team
|
||||
git pull origin main
|
||||
|
||||
# Import new corrections (merge mode)
|
||||
uv run scripts/fix_transcription.py --import general_20250128.json --merge
|
||||
uv run scripts/fix_transcription.py --import embodied_ai_20250128.json --merge
|
||||
```
|
||||
|
||||
**Merge behavior**:
|
||||
- New corrections: inserted
|
||||
- Existing corrections with higher confidence: updated
|
||||
- Existing corrections with lower confidence: skipped
|
||||
- Preserves local customizations
|
||||
|
||||
See `team_collaboration.md` for Git workflows and conflict handling.
|
||||
|
||||
## Data Management
|
||||
|
||||
### Database Backup Strategy
|
||||
|
||||
**Automatic backups**: Database creates timestamped backups before migrations:
|
||||
|
||||
```
|
||||
~/.transcript-fixer/
|
||||
├── corrections.db
|
||||
├── corrections.20250128_140532.bak
|
||||
└── corrections.20250127_093021.bak
|
||||
```
|
||||
|
||||
**Manual backups** before bulk changes:
|
||||
|
||||
```bash
|
||||
cp ~/.transcript-fixer/corrections.db ~/backups/corrections_$(date +%Y%m%d).db
|
||||
```
|
||||
|
||||
**Or use SQLite backup**:
|
||||
|
||||
```bash
|
||||
sqlite3 ~/.transcript-fixer/corrections.db ".backup ~/backups/corrections.db"
|
||||
```
|
||||
|
||||
### Cleanup Strategy
|
||||
|
||||
**History retention**: Keep recent history, archive old entries:
|
||||
|
||||
```bash
|
||||
# Archive history older than 90 days
|
||||
sqlite3 ~/.transcript-fixer/corrections.db "
|
||||
DELETE FROM correction_history
|
||||
WHERE run_timestamp < datetime('now', '-90 days');
|
||||
"
|
||||
|
||||
# Reclaim space
|
||||
sqlite3 ~/.transcript-fixer/corrections.db "VACUUM;"
|
||||
```
|
||||
|
||||
**Suggestion cleanup**: Reject low-confidence suggestions periodically:
|
||||
|
||||
```bash
|
||||
# Reject suggestions with frequency < 3
|
||||
sqlite3 ~/.transcript-fixer/corrections.db "
|
||||
UPDATE learned_suggestions
|
||||
SET status = 'rejected'
|
||||
WHERE frequency < 3 AND confidence < 0.7;
|
||||
"
|
||||
```
|
||||
|
||||
## Workflow Efficiency
|
||||
|
||||
### File Organization
|
||||
|
||||
**Use consistent naming**:
|
||||
```
|
||||
meeting_20250128.md # Original transcript
|
||||
meeting_20250128_stage1.md # Dictionary corrections
|
||||
meeting_20250128_stage2.md # Final corrected version
|
||||
```
|
||||
|
||||
**Generate diff reports** for review:
|
||||
|
||||
```bash
|
||||
uv run scripts/diff_generator.py \
|
||||
meeting_20250128.md \
|
||||
meeting_20250128_stage1.md \
|
||||
meeting_20250128_stage2.md
|
||||
```
|
||||
|
||||
**Output formats**:
|
||||
- Markdown report (what changed, statistics)
|
||||
- Unified diff (git-style)
|
||||
- HTML side-by-side (visual review)
|
||||
- Inline markers (for direct editing)
|
||||
|
||||
### Batch Processing
|
||||
|
||||
**Process similar files together** to amplify learning:
|
||||
|
||||
```bash
|
||||
# Day 1: Process 5 similar meetings
|
||||
for file in meeting_*.md; do
|
||||
uv run scripts/fix_transcription.py --input "$file" --stage 3 --domain embodied_ai
|
||||
done
|
||||
|
||||
# Day 2: Review learned patterns
|
||||
uv run scripts/fix_transcription.py --review-learned
|
||||
|
||||
# Approve good suggestions
|
||||
uv run scripts/fix_transcription.py --approve "常见错误1" "正确词1"
|
||||
uv run scripts/fix_transcription.py --approve "常见错误2" "正确词2"
|
||||
|
||||
# Day 3: Future files benefit from dictionary corrections
|
||||
```
|
||||
|
||||
### Context Rules for Edge Cases
|
||||
|
||||
**Use regex context rules** for:
|
||||
- Positional dependencies (e.g., "的" vs "地" before verbs)
|
||||
- Multi-word patterns
|
||||
- Traditional vs simplified Chinese
|
||||
|
||||
**Example**:
|
||||
|
||||
```bash
|
||||
sqlite3 ~/.transcript-fixer/corrections.db
|
||||
|
||||
# "的" before verb → "地"
|
||||
INSERT INTO context_rules (pattern, replacement, description, priority)
|
||||
VALUES ('近距离的去看', '近距离地去看', '的→地 before verb', 10);
|
||||
|
||||
# Preserve correct usage
|
||||
INSERT INTO context_rules (pattern, replacement, description, priority)
|
||||
VALUES ('近距离搏杀', '近距离搏杀', '的 is correct here (noun modifier)', 5);
|
||||
```
|
||||
|
||||
**Priority**: Higher numbers run first (use for exceptions).
|
||||
|
||||
## Quality Assurance
|
||||
|
||||
### Validate After Manual Changes
|
||||
|
||||
**After direct SQL edits**:
|
||||
|
||||
```bash
|
||||
uv run scripts/fix_transcription.py --validate
|
||||
```
|
||||
|
||||
**After imports**:
|
||||
|
||||
```bash
|
||||
# Check statistics
|
||||
uv run scripts/fix_transcription.py --list --domain general | head -20
|
||||
|
||||
# Verify specific corrections
|
||||
sqlite3 ~/.transcript-fixer/corrections.db "
|
||||
SELECT from_text, to_text, source, confidence
|
||||
FROM active_corrections
|
||||
WHERE domain = 'general'
|
||||
ORDER BY added_at DESC
|
||||
LIMIT 10;
|
||||
"
|
||||
```
|
||||
|
||||
### Monitor Learning Quality
|
||||
|
||||
**Check suggestion confidence distribution**:
|
||||
|
||||
```bash
|
||||
sqlite3 ~/.transcript-fixer/corrections.db "
|
||||
SELECT
|
||||
CASE
|
||||
WHEN confidence >= 0.9 THEN 'high (>=0.9)'
|
||||
WHEN confidence >= 0.8 THEN 'medium (0.8-0.9)'
|
||||
ELSE 'low (<0.8)'
|
||||
END as confidence_level,
|
||||
COUNT(*) as count
|
||||
FROM learned_suggestions
|
||||
WHERE status = 'pending'
|
||||
GROUP BY confidence_level;
|
||||
"
|
||||
```
|
||||
|
||||
**Review examples** for low-confidence suggestions:
|
||||
|
||||
```bash
|
||||
sqlite3 ~/.transcript-fixer/corrections.db "
|
||||
SELECT s.from_text, s.to_text, s.confidence, e.context
|
||||
FROM learned_suggestions s
|
||||
JOIN suggestion_examples e ON s.id = e.suggestion_id
|
||||
WHERE s.confidence < 0.8 AND s.status = 'pending';
|
||||
"
|
||||
```
|
||||
|
||||
## Production Deployment
|
||||
|
||||
### Environment Variables
|
||||
|
||||
**Set permanently** in production:
|
||||
|
||||
```bash
|
||||
# Add to /etc/environment or systemd service
|
||||
GLM_API_KEY=your-production-key
|
||||
```
|
||||
|
||||
### Monitoring
|
||||
|
||||
**Track usage statistics**:
|
||||
|
||||
```bash
|
||||
# Corrections by source
|
||||
sqlite3 ~/.transcript-fixer/corrections.db "
|
||||
SELECT source, COUNT(*) as count, SUM(usage_count) as total_usage
|
||||
FROM corrections
|
||||
WHERE is_active = 1
|
||||
GROUP BY source;
|
||||
"
|
||||
|
||||
# Success rate
|
||||
sqlite3 ~/.transcript-fixer/corrections.db "
|
||||
SELECT
|
||||
COUNT(*) as total_runs,
|
||||
SUM(CASE WHEN success = 1 THEN 1 ELSE 0 END) as successful,
|
||||
ROUND(100.0 * SUM(CASE WHEN success = 1 THEN 1 ELSE 0 END) / COUNT(*), 2) as success_rate
|
||||
FROM correction_history;
|
||||
"
|
||||
```
|
||||
|
||||
### Performance
|
||||
|
||||
**Database optimization**:
|
||||
|
||||
```bash
|
||||
# Rebuild indexes periodically
|
||||
sqlite3 ~/.transcript-fixer/corrections.db "REINDEX;"
|
||||
|
||||
# Analyze query patterns
|
||||
sqlite3 ~/.transcript-fixer/corrections.db "ANALYZE;"
|
||||
|
||||
# Vacuum to reclaim space
|
||||
sqlite3 ~/.transcript-fixer/corrections.db "VACUUM;"
|
||||
```
|
||||
|
||||
## Summary
|
||||
|
||||
**Key principles**:
|
||||
1. Start small, let learning amplify
|
||||
2. Use domain separation for quality
|
||||
3. Test dictionary changes before AI calls
|
||||
4. Export to JSON for version control
|
||||
5. Review and approve learned suggestions
|
||||
6. Validate after manual changes
|
||||
7. Monitor learning quality
|
||||
8. Backup before bulk operations
|
||||
|
||||
**ROI timeline**:
|
||||
- Week 1: Build foundation (10-20 manual corrections)
|
||||
- Week 2-3: Learning kicks in (20-50 suggestions)
|
||||
- Month 2+: Mature vocabulary (80%+ dictionary coverage, minimal AI calls)
|
||||
97
transcript-fixer/references/dictionary_guide.md
Normal file
97
transcript-fixer/references/dictionary_guide.md
Normal file
@@ -0,0 +1,97 @@
|
||||
# 纠错词典配置指南
|
||||
|
||||
## 词典结构
|
||||
|
||||
纠错词典位于 `fix_transcription.py` 中,包含两部分:
|
||||
|
||||
### 1. 上下文规则 (CONTEXT_RULES)
|
||||
|
||||
用于需要结合上下文判断的替换:
|
||||
|
||||
```python
|
||||
CONTEXT_RULES = [
|
||||
{
|
||||
"pattern": r"正则表达式",
|
||||
"replacement": "替换文本",
|
||||
"description": "规则说明"
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
**示例:**
|
||||
```python
|
||||
{
|
||||
"pattern": r"近距离的去看",
|
||||
"replacement": "近距离地去看",
|
||||
"description": "修正'的'为'地'"
|
||||
}
|
||||
```
|
||||
|
||||
### 2. 通用词典 (CORRECTIONS_DICT)
|
||||
|
||||
用于直接字符串替换:
|
||||
|
||||
```python
|
||||
CORRECTIONS_DICT = {
|
||||
"错误词汇": "正确词汇",
|
||||
}
|
||||
```
|
||||
|
||||
**示例:**
|
||||
```python
|
||||
{
|
||||
"巨升智能": "具身智能",
|
||||
"奇迹创坛": "奇绩创坛",
|
||||
"矩阵公司": "初创公司",
|
||||
}
|
||||
```
|
||||
|
||||
## 添加自定义规则
|
||||
|
||||
### 步骤1: 识别错误模式
|
||||
|
||||
从修复报告中识别重复出现的错误。
|
||||
|
||||
### 步骤2: 选择规则类型
|
||||
|
||||
- **简单替换** → 使用 CORRECTIONS_DICT
|
||||
- **需要上下文** → 使用 CONTEXT_RULES
|
||||
|
||||
### 步骤3: 添加到词典
|
||||
|
||||
编辑 `scripts/fix_transcription.py`:
|
||||
|
||||
```python
|
||||
CORRECTIONS_DICT = {
|
||||
# 现有规则...
|
||||
"你的错误": "正确词汇", # 添加新规则
|
||||
}
|
||||
```
|
||||
|
||||
### 步骤4: 测试
|
||||
|
||||
运行修复脚本测试新规则。
|
||||
|
||||
## 常见错误类型
|
||||
|
||||
### 同音字错误
|
||||
```python
|
||||
"股价": "框架",
|
||||
"三观": "三关",
|
||||
```
|
||||
|
||||
### 专业术语
|
||||
```python
|
||||
"巨升智能": "具身智能",
|
||||
"近距离": "具身", # 某些上下文中
|
||||
```
|
||||
|
||||
### 公司名称
|
||||
```python
|
||||
"奇迹创坛": "奇绩创坛",
|
||||
```
|
||||
|
||||
## 优先级
|
||||
|
||||
1. 先应用 CONTEXT_RULES (精确匹配)
|
||||
2. 再应用 CORRECTIONS_DICT (全局替换)
|
||||
395
transcript-fixer/references/file_formats.md
Normal file
395
transcript-fixer/references/file_formats.md
Normal file
@@ -0,0 +1,395 @@
|
||||
# Storage Format Reference
|
||||
|
||||
This document describes the SQLite database format used by transcript-fixer v2.0.
|
||||
|
||||
## Table of Contents
|
||||
|
||||
- [Database Location](#database-location)
|
||||
- [Database Schema](#database-schema)
|
||||
- [Core Tables](#core-tables)
|
||||
- [Views](#views)
|
||||
- [Querying the Database](#querying-the-database)
|
||||
- [Using Python API](#using-python-api)
|
||||
- [Using SQLite CLI](#using-sqlite-cli)
|
||||
- [Import/Export](#importexport)
|
||||
- [Export to JSON](#export-to-json)
|
||||
- [Import from JSON](#import-from-json)
|
||||
- [Backup Strategy](#backup-strategy)
|
||||
- [Automatic Backups](#automatic-backups)
|
||||
- [Manual Backups](#manual-backups)
|
||||
- [Version Control](#version-control)
|
||||
- [Best Practices](#best-practices)
|
||||
- [Troubleshooting](#troubleshooting)
|
||||
- [Database Locked](#database-locked)
|
||||
- [Corrupted Database](#corrupted-database)
|
||||
- [Missing Tables](#missing-tables)
|
||||
|
||||
## Database Location
|
||||
|
||||
**Path**: `~/.transcript-fixer/corrections.db`
|
||||
|
||||
**Type**: SQLite 3 database with ACID guarantees
|
||||
|
||||
## Database Schema
|
||||
|
||||
### Core Tables
|
||||
|
||||
#### corrections
|
||||
|
||||
Main correction dictionary storage.
|
||||
|
||||
| Column | Type | Constraints | Description |
|
||||
|--------|------|-------------|-------------|
|
||||
| id | INTEGER | PRIMARY KEY | Auto-increment ID |
|
||||
| from_text | TEXT | NOT NULL | Original (incorrect) text |
|
||||
| to_text | TEXT | NOT NULL | Corrected text |
|
||||
| domain | TEXT | DEFAULT 'general' | Correction domain |
|
||||
| source | TEXT | CHECK IN ('manual', 'learned', 'imported') | Origin of correction |
|
||||
| confidence | REAL | CHECK 0.0-1.0 | Confidence score |
|
||||
| added_by | TEXT | | User who added |
|
||||
| added_at | TIMESTAMP | DEFAULT CURRENT_TIMESTAMP | When added |
|
||||
| usage_count | INTEGER | DEFAULT 0, CHECK >= 0 | Times used |
|
||||
| last_used | TIMESTAMP | | Last usage time |
|
||||
| notes | TEXT | | Optional notes |
|
||||
| is_active | BOOLEAN | DEFAULT 1 | Soft delete flag |
|
||||
|
||||
**Unique Constraint**: `(from_text, domain)`
|
||||
|
||||
**Indexes**: domain, source, added_at, is_active, from_text
|
||||
|
||||
#### context_rules
|
||||
|
||||
Regex-based context-aware correction rules.
|
||||
|
||||
| Column | Type | Constraints | Description |
|
||||
|--------|------|-------------|-------------|
|
||||
| id | INTEGER | PRIMARY KEY | Auto-increment ID |
|
||||
| pattern | TEXT | NOT NULL, UNIQUE | Regex pattern |
|
||||
| replacement | TEXT | NOT NULL | Replacement text |
|
||||
| description | TEXT | | Rule explanation |
|
||||
| priority | INTEGER | DEFAULT 0 | Higher = applied first |
|
||||
| is_active | BOOLEAN | DEFAULT 1 | Enable/disable |
|
||||
| added_at | TIMESTAMP | DEFAULT CURRENT_TIMESTAMP | When added |
|
||||
| added_by | TEXT | | User who added |
|
||||
|
||||
**Indexes**: priority (DESC), is_active
|
||||
|
||||
#### correction_history
|
||||
|
||||
Audit log for all correction runs.
|
||||
|
||||
| Column | Type | Constraints | Description |
|
||||
|--------|------|-------------|-------------|
|
||||
| id | INTEGER | PRIMARY KEY | Auto-increment ID |
|
||||
| filename | TEXT | NOT NULL | File corrected |
|
||||
| domain | TEXT | NOT NULL | Domain used |
|
||||
| run_timestamp | TIMESTAMP | DEFAULT CURRENT_TIMESTAMP | When run |
|
||||
| original_length | INTEGER | CHECK >= 0 | Original file size |
|
||||
| stage1_changes | INTEGER | CHECK >= 0 | Dictionary changes |
|
||||
| stage2_changes | INTEGER | CHECK >= 0 | AI changes |
|
||||
| model | TEXT | | AI model used |
|
||||
| execution_time_ms | INTEGER | | Runtime in ms |
|
||||
| success | BOOLEAN | DEFAULT 1 | Success flag |
|
||||
| error_message | TEXT | | Error if failed |
|
||||
|
||||
**Indexes**: run_timestamp (DESC), domain, success
|
||||
|
||||
#### correction_changes
|
||||
|
||||
Detailed changes made in each run.
|
||||
|
||||
| Column | Type | Constraints | Description |
|
||||
|--------|------|-------------|-------------|
|
||||
| id | INTEGER | PRIMARY KEY | Auto-increment ID |
|
||||
| history_id | INTEGER | FOREIGN KEY → correction_history | Parent run |
|
||||
| line_number | INTEGER | | Line in file |
|
||||
| from_text | TEXT | NOT NULL | Original text |
|
||||
| to_text | TEXT | NOT NULL | Corrected text |
|
||||
| rule_type | TEXT | CHECK IN ('context', 'dictionary', 'ai') | Rule type |
|
||||
| rule_id | INTEGER | | Reference to rule |
|
||||
| context_before | TEXT | | Text before |
|
||||
| context_after | TEXT | | Text after |
|
||||
|
||||
**Foreign Key**: history_id → correction_history.id (CASCADE DELETE)
|
||||
|
||||
**Indexes**: history_id, rule_type
|
||||
|
||||
#### learned_suggestions
|
||||
|
||||
AI-detected patterns pending review.
|
||||
|
||||
| Column | Type | Constraints | Description |
|
||||
|--------|------|-------------|-------------|
|
||||
| id | INTEGER | PRIMARY KEY | Auto-increment ID |
|
||||
| from_text | TEXT | NOT NULL | Pattern detected |
|
||||
| to_text | TEXT | NOT NULL | Suggested correction |
|
||||
| domain | TEXT | DEFAULT 'general' | Domain |
|
||||
| frequency | INTEGER | CHECK > 0 | Times seen |
|
||||
| confidence | REAL | CHECK 0.0-1.0 | Confidence score |
|
||||
| first_seen | TIMESTAMP | DEFAULT CURRENT_TIMESTAMP | First occurrence |
|
||||
| last_seen | TIMESTAMP | DEFAULT CURRENT_TIMESTAMP | Last occurrence |
|
||||
| status | TEXT | CHECK IN ('pending', 'approved', 'rejected') | Review status |
|
||||
| reviewed_at | TIMESTAMP | | When reviewed |
|
||||
| reviewed_by | TEXT | | Who reviewed |
|
||||
|
||||
**Unique Constraint**: `(from_text, to_text, domain)`
|
||||
|
||||
**Indexes**: status, domain, confidence (DESC), frequency (DESC)
|
||||
|
||||
#### suggestion_examples
|
||||
|
||||
Example occurrences of learned patterns.
|
||||
|
||||
| Column | Type | Constraints | Description |
|
||||
|--------|------|-------------|-------------|
|
||||
| id | INTEGER | PRIMARY KEY | Auto-increment ID |
|
||||
| suggestion_id | INTEGER | FOREIGN KEY → learned_suggestions | Parent suggestion |
|
||||
| filename | TEXT | NOT NULL | File where found |
|
||||
| line_number | INTEGER | | Line number |
|
||||
| context | TEXT | NOT NULL | Surrounding text |
|
||||
| occurred_at | TIMESTAMP | DEFAULT CURRENT_TIMESTAMP | When found |
|
||||
|
||||
**Foreign Key**: suggestion_id → learned_suggestions.id (CASCADE DELETE)
|
||||
|
||||
**Index**: suggestion_id
|
||||
|
||||
#### system_config
|
||||
|
||||
System configuration key-value store.
|
||||
|
||||
| Column | Type | Constraints | Description |
|
||||
|--------|------|-------------|-------------|
|
||||
| key | TEXT | PRIMARY KEY | Config key |
|
||||
| value | TEXT | NOT NULL | Config value |
|
||||
| value_type | TEXT | CHECK IN ('string', 'int', 'float', 'boolean', 'json') | Value type |
|
||||
| description | TEXT | | Config description |
|
||||
| updated_at | TIMESTAMP | DEFAULT CURRENT_TIMESTAMP | Last update |
|
||||
|
||||
**Default Values**:
|
||||
- `schema_version`: "2.0"
|
||||
- `api_provider`: "GLM"
|
||||
- `api_model`: "GLM-4.6"
|
||||
- `default_domain`: "general"
|
||||
- `auto_learn_enabled`: "true"
|
||||
- `learning_frequency_threshold`: "3"
|
||||
- `learning_confidence_threshold`: "0.8"
|
||||
|
||||
#### audit_log
|
||||
|
||||
Comprehensive audit trail for all operations.
|
||||
|
||||
| Column | Type | Constraints | Description |
|
||||
|--------|------|-------------|-------------|
|
||||
| id | INTEGER | PRIMARY KEY | Auto-increment ID |
|
||||
| timestamp | TIMESTAMP | DEFAULT CURRENT_TIMESTAMP | When occurred |
|
||||
| action | TEXT | NOT NULL | Action type |
|
||||
| entity_type | TEXT | NOT NULL | Entity affected |
|
||||
| entity_id | INTEGER | | Entity ID |
|
||||
| user | TEXT | | User who performed |
|
||||
| details | TEXT | | Action details |
|
||||
| success | BOOLEAN | DEFAULT 1 | Success flag |
|
||||
| error_message | TEXT | | Error if failed |
|
||||
|
||||
**Indexes**: timestamp (DESC), action, entity_type, success
|
||||
|
||||
### Views
|
||||
|
||||
#### active_corrections
|
||||
|
||||
Quick access to active corrections.
|
||||
|
||||
```sql
|
||||
SELECT id, from_text, to_text, domain, source, confidence, usage_count, last_used, added_at
|
||||
FROM corrections
|
||||
WHERE is_active = 1
|
||||
ORDER BY domain, from_text;
|
||||
```
|
||||
|
||||
#### pending_suggestions
|
||||
|
||||
Suggestions pending review with example count.
|
||||
|
||||
```sql
|
||||
SELECT s.id, s.from_text, s.to_text, s.domain, s.frequency, s.confidence,
|
||||
s.first_seen, s.last_seen, COUNT(e.id) as example_count
|
||||
FROM learned_suggestions s
|
||||
LEFT JOIN suggestion_examples e ON s.id = e.suggestion_id
|
||||
WHERE s.status = 'pending'
|
||||
GROUP BY s.id
|
||||
ORDER BY s.confidence DESC, s.frequency DESC;
|
||||
```
|
||||
|
||||
#### correction_statistics
|
||||
|
||||
Statistics per domain.
|
||||
|
||||
```sql
|
||||
SELECT domain,
|
||||
COUNT(*) as total_corrections,
|
||||
COUNT(CASE WHEN source = 'manual' THEN 1 END) as manual_count,
|
||||
COUNT(CASE WHEN source = 'learned' THEN 1 END) as learned_count,
|
||||
COUNT(CASE WHEN source = 'imported' THEN 1 END) as imported_count,
|
||||
SUM(usage_count) as total_usage,
|
||||
MAX(added_at) as last_updated
|
||||
FROM corrections
|
||||
WHERE is_active = 1
|
||||
GROUP BY domain;
|
||||
```
|
||||
|
||||
## Querying the Database
|
||||
|
||||
### Using Python API
|
||||
|
||||
```python
|
||||
from pathlib import Path
|
||||
from core import CorrectionRepository, CorrectionService
|
||||
|
||||
# Initialize
|
||||
db_path = Path.home() / ".transcript-fixer" / "corrections.db"
|
||||
repository = CorrectionRepository(db_path)
|
||||
service = CorrectionService(repository)
|
||||
|
||||
# Add correction
|
||||
service.add_correction("错误", "正确", domain="general")
|
||||
|
||||
# Get corrections
|
||||
corrections = service.get_corrections(domain="general")
|
||||
|
||||
# Get statistics
|
||||
stats = service.get_statistics(domain="general")
|
||||
print(f"Total: {stats['total_corrections']}")
|
||||
|
||||
# Close
|
||||
service.close()
|
||||
```
|
||||
|
||||
### Using SQLite CLI
|
||||
|
||||
```bash
|
||||
# Open database
|
||||
sqlite3 ~/.transcript-fixer/corrections.db
|
||||
|
||||
# View active corrections
|
||||
SELECT from_text, to_text, domain FROM active_corrections;
|
||||
|
||||
# View statistics
|
||||
SELECT * FROM correction_statistics;
|
||||
|
||||
# View pending suggestions
|
||||
SELECT * FROM pending_suggestions;
|
||||
|
||||
# Check schema version
|
||||
SELECT value FROM system_config WHERE key = 'schema_version';
|
||||
```
|
||||
|
||||
## Import/Export
|
||||
|
||||
### Export to JSON
|
||||
|
||||
```python
|
||||
service = _get_service()
|
||||
corrections = service.export_corrections(domain="general")
|
||||
|
||||
# Write to file
|
||||
import json
|
||||
with open("export.json", "w", encoding="utf-8") as f:
|
||||
json.dump({
|
||||
"version": "2.0",
|
||||
"domain": "general",
|
||||
"corrections": corrections
|
||||
}, f, ensure_ascii=False, indent=2)
|
||||
```
|
||||
|
||||
### Import from JSON
|
||||
|
||||
```python
|
||||
import json
|
||||
|
||||
with open("import.json", "r", encoding="utf-8") as f:
|
||||
data = json.load(f)
|
||||
|
||||
service = _get_service()
|
||||
inserted, updated, skipped = service.import_corrections(
|
||||
corrections=data["corrections"],
|
||||
domain=data.get("domain", "general"),
|
||||
merge=True,
|
||||
validate_all=True
|
||||
)
|
||||
|
||||
print(f"Imported: {inserted} new, {updated} updated, {skipped} skipped")
|
||||
```
|
||||
|
||||
## Backup Strategy
|
||||
|
||||
### Automatic Backups
|
||||
|
||||
The system maintains database integrity through SQLite's ACID guarantees and automatic journaling.
|
||||
|
||||
### Manual Backups
|
||||
|
||||
```bash
|
||||
# Backup database
|
||||
cp ~/.transcript-fixer/corrections.db ~/backups/corrections_$(date +%Y%m%d).db
|
||||
|
||||
# Or use SQLite backup
|
||||
sqlite3 ~/.transcript-fixer/corrections.db ".backup ~/backups/corrections.db"
|
||||
```
|
||||
|
||||
### Version Control
|
||||
|
||||
**Recommended**: Use Git for configuration and export files, but NOT for the database:
|
||||
|
||||
```bash
|
||||
# .gitignore
|
||||
*.db
|
||||
*.db-journal
|
||||
*.bak
|
||||
```
|
||||
|
||||
Instead, export corrections periodically:
|
||||
|
||||
```bash
|
||||
python scripts/fix_transcription.py --export-json corrections_backup.json
|
||||
git add corrections_backup.json
|
||||
git commit -m "Backup corrections"
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Regular Exports**: Export to JSON weekly for team sharing
|
||||
2. **Database Backups**: Backup `.db` file before major changes
|
||||
3. **Use Transactions**: All modifications use ACID transactions automatically
|
||||
4. **Soft Deletes**: Records are marked inactive, not deleted (preserves audit trail)
|
||||
5. **Validate**: Run `--validate` after manual database changes
|
||||
6. **Statistics**: Check usage patterns via `correction_statistics` view
|
||||
7. **Cleanup**: Old history can be archived (query by `run_timestamp`)
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Database Locked
|
||||
|
||||
```bash
|
||||
# Check for lingering connections
|
||||
lsof ~/.transcript-fixer/corrections.db
|
||||
|
||||
# If needed, backup and recreate
|
||||
cp corrections.db corrections_backup.db
|
||||
sqlite3 corrections.db "VACUUM;"
|
||||
```
|
||||
|
||||
### Corrupted Database
|
||||
|
||||
```bash
|
||||
# Check integrity
|
||||
sqlite3 corrections.db "PRAGMA integrity_check;"
|
||||
|
||||
# Recover if possible
|
||||
sqlite3 corrections.db ".recover" | sqlite3 corrections_new.db
|
||||
```
|
||||
|
||||
### Missing Tables
|
||||
|
||||
```bash
|
||||
# Reinitialize schema (safe, uses IF NOT EXISTS)
|
||||
python -c "from core import CorrectionRepository; from pathlib import Path; CorrectionRepository(Path.home() / '.transcript-fixer' / 'corrections.db')"
|
||||
```
|
||||
116
transcript-fixer/references/glm_api_setup.md
Normal file
116
transcript-fixer/references/glm_api_setup.md
Normal file
@@ -0,0 +1,116 @@
|
||||
# GLM API 配置指南
|
||||
|
||||
## API配置
|
||||
|
||||
### 设置环境变量
|
||||
|
||||
在运行脚本前,设置GLM API密钥环境变量:
|
||||
|
||||
```bash
|
||||
# Linux/macOS
|
||||
export GLM_API_KEY="your-api-key-here"
|
||||
|
||||
# Windows (PowerShell)
|
||||
$env:GLM_API_KEY="your-api-key-here"
|
||||
|
||||
# Windows (CMD)
|
||||
set GLM_API_KEY=your-api-key-here
|
||||
```
|
||||
|
||||
**永久设置** (推荐):
|
||||
|
||||
```bash
|
||||
# Linux/macOS: 添加到 ~/.bashrc 或 ~/.zshrc
|
||||
echo 'export GLM_API_KEY="your-api-key-here"' >> ~/.bashrc
|
||||
source ~/.bashrc
|
||||
|
||||
# Windows: 在系统环境变量中设置
|
||||
```
|
||||
|
||||
### 脚本配置
|
||||
|
||||
脚本会自动从环境变量读取API密钥:
|
||||
|
||||
```python
|
||||
# 脚本会检查环境变量
|
||||
if "GLM_API_KEY" not in os.environ:
|
||||
raise ValueError("请设置 GLM_API_KEY 环境变量")
|
||||
|
||||
os.environ["ANTHROPIC_BASE_URL"] = "https://open.bigmodel.cn/api/anthropic"
|
||||
os.environ["ANTHROPIC_API_KEY"] = os.environ["GLM_API_KEY"]
|
||||
|
||||
# 模型配置
|
||||
GLM_MODEL = "GLM-4.6" # 主力模型
|
||||
GLM_MODEL_FAST = "GLM-4.5-Air" # 快速模型(备用)
|
||||
```
|
||||
|
||||
## 支持的模型
|
||||
|
||||
| 模型名称 | 说明 | 用途 |
|
||||
|---------|------|------|
|
||||
| GLM-4.6 | 最强模型 | 默认使用,精度最高 |
|
||||
| GLM-4.5-Air | 快速模型 | 备用,速度更快 |
|
||||
|
||||
**注意**: 模型名称大小写不敏感。
|
||||
|
||||
## API认证
|
||||
|
||||
智谱GLM使用Anthropic兼容API:
|
||||
|
||||
```python
|
||||
headers = {
|
||||
"anthropic-version": "2023-06-01",
|
||||
"Authorization": f"Bearer {api_key}",
|
||||
"content-type": "application/json"
|
||||
}
|
||||
```
|
||||
|
||||
**关键点:**
|
||||
- 使用 `Authorization: Bearer` 头
|
||||
- 不要使用 `x-api-key` 头
|
||||
|
||||
## API调用示例
|
||||
|
||||
```python
|
||||
def call_glm_api(prompt: str) -> str:
|
||||
url = "https://open.bigmodel.cn/api/anthropic/v1/messages"
|
||||
headers = {
|
||||
"anthropic-version": "2023-06-01",
|
||||
"Authorization": f"Bearer {os.environ.get('ANTHROPIC_API_KEY')}",
|
||||
"content-type": "application/json"
|
||||
}
|
||||
|
||||
data = {
|
||||
"model": "GLM-4.6",
|
||||
"max_tokens": 8000,
|
||||
"temperature": 0.3,
|
||||
"messages": [{"role": "user", "content": prompt}]
|
||||
}
|
||||
|
||||
response = httpx.post(url, headers=headers, json=data, timeout=60.0)
|
||||
return response.json()["content"][0]["text"]
|
||||
```
|
||||
|
||||
## 获取API密钥
|
||||
|
||||
1. 访问 https://open.bigmodel.cn/
|
||||
2. 注册/登录账号
|
||||
3. 进入API管理页面
|
||||
4. 创建新的API密钥
|
||||
5. 复制密钥到配置中
|
||||
|
||||
## 费用
|
||||
|
||||
参考智谱AI官方定价:
|
||||
- GLM-4.6: 按token计费
|
||||
- GLM-4.5-Air: 更便宜的选择
|
||||
|
||||
## 故障排查
|
||||
|
||||
### 401错误
|
||||
- 检查API密钥是否正确
|
||||
- 确认使用 `Authorization: Bearer` 头
|
||||
|
||||
### 超时错误
|
||||
- 增加timeout参数
|
||||
- 考虑使用GLM-4.5-Air快速模型
|
||||
135
transcript-fixer/references/installation_setup.md
Normal file
135
transcript-fixer/references/installation_setup.md
Normal file
@@ -0,0 +1,135 @@
|
||||
# Setup Guide
|
||||
|
||||
Complete installation and configuration guide for transcript-fixer.
|
||||
|
||||
## Table of Contents
|
||||
|
||||
- [Installation](#installation)
|
||||
- [API Configuration](#api-configuration)
|
||||
- [Environment Setup](#environment-setup)
|
||||
- [Next Steps](#next-steps)
|
||||
|
||||
## Installation
|
||||
|
||||
### Dependencies
|
||||
|
||||
Install required dependencies using uv:
|
||||
|
||||
```bash
|
||||
uv pip install -r requirements.txt
|
||||
```
|
||||
|
||||
Or sync the project environment:
|
||||
|
||||
```bash
|
||||
uv sync
|
||||
```
|
||||
|
||||
**Required packages**:
|
||||
- `anthropic` - For Claude API integration (future)
|
||||
- `requests` - For GLM API calls
|
||||
- `difflib` - Standard library for diff generation
|
||||
|
||||
### Database Initialization
|
||||
|
||||
Initialize the SQLite database (first time only):
|
||||
|
||||
```bash
|
||||
uv run scripts/fix_transcription.py --init
|
||||
```
|
||||
|
||||
This creates `~/.transcript-fixer/corrections.db` with the complete schema:
|
||||
- 8 tables (corrections, context_rules, history, suggestions, etc.)
|
||||
- 3 views (active_corrections, pending_suggestions, statistics)
|
||||
- ACID transactions enabled
|
||||
- Automatic backups before migrations
|
||||
|
||||
See `file_formats.md` for complete database schema.
|
||||
|
||||
## API Configuration
|
||||
|
||||
### GLM API Key (Required for Stage 2)
|
||||
|
||||
Stage 2 AI corrections require a GLM API key.
|
||||
|
||||
1. **Obtain API key**: Visit https://open.bigmodel.cn/
|
||||
2. **Register** for an account
|
||||
3. **Generate** an API key from the dashboard
|
||||
4. **Set environment variable**:
|
||||
|
||||
```bash
|
||||
export GLM_API_KEY="your-api-key-here"
|
||||
```
|
||||
|
||||
**Persistence**: Add to shell profile for permanent access:
|
||||
|
||||
```bash
|
||||
# For bash
|
||||
echo 'export GLM_API_KEY="your-key"' >> ~/.bashrc
|
||||
source ~/.bashrc
|
||||
|
||||
# For zsh
|
||||
echo 'export GLM_API_KEY="your-key"' >> ~/.zshrc
|
||||
source ~/.zshrc
|
||||
```
|
||||
|
||||
### Verify Configuration
|
||||
|
||||
Run validation to check setup:
|
||||
|
||||
```bash
|
||||
uv run scripts/fix_transcription.py --validate
|
||||
```
|
||||
|
||||
**Expected output**:
|
||||
```
|
||||
🔍 Validating transcript-fixer configuration...
|
||||
|
||||
✅ Configuration directory exists: ~/.transcript-fixer
|
||||
✅ Database valid: 0 corrections
|
||||
✅ All 8 tables present
|
||||
✅ GLM_API_KEY is set
|
||||
|
||||
============================================================
|
||||
✅ All checks passed! Configuration is valid.
|
||||
============================================================
|
||||
```
|
||||
|
||||
## Environment Setup
|
||||
|
||||
### Python Environment
|
||||
|
||||
**Required**: Python 3.8+
|
||||
|
||||
**Recommended**: Use uv for all Python operations:
|
||||
|
||||
```bash
|
||||
# Never use system python directly
|
||||
uv run scripts/fix_transcription.py # ✅ Correct
|
||||
|
||||
# Don't use system python
|
||||
python scripts/fix_transcription.py # ❌ Wrong
|
||||
```
|
||||
|
||||
### Directory Structure
|
||||
|
||||
After initialization, the directory structure is:
|
||||
|
||||
```
|
||||
~/.transcript-fixer/
|
||||
├── corrections.db # SQLite database
|
||||
├── corrections.YYYYMMDD.bak # Automatic backups
|
||||
└── (migration artifacts)
|
||||
```
|
||||
|
||||
**Important**: The `.db` file should NOT be committed to Git. Export corrections to JSON for version control instead.
|
||||
|
||||
## Next Steps
|
||||
|
||||
After setup:
|
||||
1. Add initial corrections (5-10 terms)
|
||||
2. Run first correction on a test file
|
||||
3. Review learned suggestions after 3-5 runs
|
||||
4. Build domain-specific dictionaries
|
||||
|
||||
See `workflow_guide.md` for detailed usage instructions.
|
||||
125
transcript-fixer/references/quick_reference.md
Normal file
125
transcript-fixer/references/quick_reference.md
Normal file
@@ -0,0 +1,125 @@
|
||||
# Quick Reference
|
||||
|
||||
**Storage**: transcript-fixer uses SQLite database for corrections storage.
|
||||
|
||||
**Database location**: `~/.transcript-fixer/corrections.db`
|
||||
|
||||
## Quick Start Examples
|
||||
|
||||
### Adding Corrections via CLI
|
||||
|
||||
```bash
|
||||
# Add a simple correction
|
||||
uv run scripts/fix_transcription.py --add "巨升智能" "具身智能" --domain embodied_ai
|
||||
|
||||
# Add corrections for specific domain
|
||||
uv run scripts/fix_transcription.py --add "奇迹创坛" "奇绩创坛" --domain general
|
||||
uv run scripts/fix_transcription.py --add "矩阵公司" "初创公司" --domain general
|
||||
```
|
||||
|
||||
### Adding Corrections via SQL
|
||||
|
||||
```bash
|
||||
sqlite3 ~/.transcript-fixer/corrections.db
|
||||
|
||||
# Insert corrections
|
||||
INSERT INTO corrections (from_text, to_text, domain, source)
|
||||
VALUES ('巨升智能', '具身智能', 'embodied_ai', 'manual');
|
||||
|
||||
INSERT INTO corrections (from_text, to_text, domain, source)
|
||||
VALUES ('巨升', '具身', 'embodied_ai', 'manual');
|
||||
|
||||
INSERT INTO corrections (from_text, to_text, domain, source)
|
||||
VALUES ('奇迹创坛', '奇绩创坛', 'general', 'manual');
|
||||
|
||||
# Exit
|
||||
.quit
|
||||
```
|
||||
|
||||
### Adding Context Rules via SQL
|
||||
|
||||
Context rules use regex patterns for context-aware corrections:
|
||||
|
||||
```bash
|
||||
sqlite3 ~/.transcript-fixer/corrections.db
|
||||
|
||||
# Add context-aware rules
|
||||
INSERT INTO context_rules (pattern, replacement, description, priority)
|
||||
VALUES ('巨升方向', '具身方向', '巨升→具身', 10);
|
||||
|
||||
INSERT INTO context_rules (pattern, replacement, description, priority)
|
||||
VALUES ('巨升现在', '具身现在', '巨升→具身', 10);
|
||||
|
||||
INSERT INTO context_rules (pattern, replacement, description, priority)
|
||||
VALUES ('近距离的去看', '近距离地去看', '的→地 副词修饰', 5);
|
||||
|
||||
# Exit
|
||||
.quit
|
||||
```
|
||||
|
||||
### Adding Corrections via Python API
|
||||
|
||||
Save as `add_corrections.py` and run with `uv run add_corrections.py`:
|
||||
|
||||
```python
|
||||
#!/usr/bin/env -S uv run
|
||||
from pathlib import Path
|
||||
from core import CorrectionRepository, CorrectionService
|
||||
|
||||
# Initialize service
|
||||
db_path = Path.home() / ".transcript-fixer" / "corrections.db"
|
||||
repository = CorrectionRepository(db_path)
|
||||
service = CorrectionService(repository)
|
||||
|
||||
# Add corrections
|
||||
corrections = [
|
||||
("巨升智能", "具身智能", "embodied_ai"),
|
||||
("巨升", "具身", "embodied_ai"),
|
||||
("奇迹创坛", "奇绩创坛", "general"),
|
||||
("火星营", "火星营", "general"),
|
||||
("矩阵公司", "初创公司", "general"),
|
||||
("股价", "框架", "general"),
|
||||
("三观", "三关", "general"),
|
||||
]
|
||||
|
||||
for from_text, to_text, domain in corrections:
|
||||
service.add_correction(from_text, to_text, domain)
|
||||
print(f"✅ Added: '{from_text}' → '{to_text}' (domain: {domain})")
|
||||
|
||||
# Close connection
|
||||
service.close()
|
||||
```
|
||||
|
||||
## Bulk Import Example
|
||||
|
||||
Use the provided bulk import script for importing multiple corrections:
|
||||
|
||||
```bash
|
||||
uv run scripts/examples/bulk_import.py
|
||||
```
|
||||
|
||||
## Querying the Database
|
||||
|
||||
### View Active Corrections
|
||||
|
||||
```bash
|
||||
sqlite3 ~/.transcript-fixer/corrections.db "SELECT from_text, to_text, domain FROM active_corrections;"
|
||||
```
|
||||
|
||||
### View Statistics
|
||||
|
||||
```bash
|
||||
sqlite3 ~/.transcript-fixer/corrections.db "SELECT * FROM correction_statistics;"
|
||||
```
|
||||
|
||||
### View Context Rules
|
||||
|
||||
```bash
|
||||
sqlite3 ~/.transcript-fixer/corrections.db "SELECT pattern, replacement, priority FROM context_rules WHERE is_active = 1 ORDER BY priority DESC;"
|
||||
```
|
||||
|
||||
## See Also
|
||||
|
||||
- `references/file_formats.md` - Complete database schema documentation
|
||||
- `references/script_parameters.md` - CLI command reference
|
||||
- `SKILL.md` - Main user documentation
|
||||
186
transcript-fixer/references/script_parameters.md
Normal file
186
transcript-fixer/references/script_parameters.md
Normal file
@@ -0,0 +1,186 @@
|
||||
# Script Parameters Reference
|
||||
|
||||
Detailed command-line parameters and usage examples for transcript-fixer Python scripts.
|
||||
|
||||
## Table of Contents
|
||||
|
||||
- [fix_transcription.py](#fixtranscriptionpy) - Main correction pipeline
|
||||
- [Setup Commands](#setup-commands)
|
||||
- [Correction Management](#correction-management)
|
||||
- [Correction Workflow](#correction-workflow)
|
||||
- [Learning Commands](#learning-commands)
|
||||
- [diff_generator.py](#diffgeneratorpy) - Generate comparison reports
|
||||
- [Common Workflows](#common-workflows)
|
||||
- [Exit Codes](#exit-codes)
|
||||
- [Environment Variables](#environment-variables)
|
||||
|
||||
---
|
||||
|
||||
## fix_transcription.py
|
||||
|
||||
Main correction pipeline script supporting three processing stages.
|
||||
|
||||
### Syntax
|
||||
|
||||
```bash
|
||||
python scripts/fix_transcription.py --input <file> --stage <1|2|3> [--output <dir>]
|
||||
```
|
||||
|
||||
### Parameters
|
||||
|
||||
- `--input, -i` (required): Input Markdown file path
|
||||
- `--stage, -s` (optional): Stage to execute (default: 3)
|
||||
- `1` = Dictionary corrections only
|
||||
- `2` = AI corrections only (requires Stage 1 output file)
|
||||
- `3` = Both stages sequentially
|
||||
- `--output, -o` (optional): Output directory (defaults to input file directory)
|
||||
|
||||
### Usage Examples
|
||||
|
||||
**Run dictionary corrections only:**
|
||||
```bash
|
||||
python scripts/fix_transcription.py --input meeting.md --stage 1
|
||||
```
|
||||
|
||||
Output: `meeting_阶段1_词典修复.md`
|
||||
|
||||
**Run AI corrections only:**
|
||||
```bash
|
||||
python scripts/fix_transcription.py --input meeting_阶段1_词典修复.md --stage 2
|
||||
```
|
||||
|
||||
Output: `meeting_阶段2_AI修复.md`
|
||||
|
||||
Note: Requires Stage 1 output file as input.
|
||||
|
||||
**Run complete pipeline:**
|
||||
```bash
|
||||
python scripts/fix_transcription.py --input meeting.md --stage 3
|
||||
```
|
||||
|
||||
Outputs:
|
||||
- `meeting_阶段1_词典修复.md`
|
||||
- `meeting_阶段2_AI修复.md`
|
||||
|
||||
**Custom output directory:**
|
||||
```bash
|
||||
python scripts/fix_transcription.py --input meeting.md --stage 3 --output ./corrections
|
||||
```
|
||||
|
||||
### Exit Codes
|
||||
|
||||
- `0` - Success
|
||||
- `1` - Missing required parameters or file not found
|
||||
- `2` - GLM_API_KEY environment variable not set (Stage 2 or 3 only)
|
||||
- `3` - API request failed
|
||||
|
||||
## generate_diff_report.py
|
||||
|
||||
Multi-format diff report generator for comparing correction stages.
|
||||
|
||||
### Syntax
|
||||
|
||||
```bash
|
||||
python scripts/generate_diff_report.py --original <file> --stage1 <file> --stage2 <file> [--output-dir <dir>]
|
||||
```
|
||||
|
||||
### Parameters
|
||||
|
||||
- `--original` (required): Original transcript file path
|
||||
- `--stage1` (required): Stage 1 correction output file path
|
||||
- `--stage2` (required): Stage 2 correction output file path
|
||||
- `--output-dir` (optional): Output directory for diff reports (defaults to original file directory)
|
||||
|
||||
### Usage Examples
|
||||
|
||||
**Basic usage:**
|
||||
```bash
|
||||
python scripts/generate_diff_report.py \
|
||||
--original "meeting.md" \
|
||||
--stage1 "meeting_阶段1_词典修复.md" \
|
||||
--stage2 "meeting_阶段2_AI修复.md"
|
||||
```
|
||||
|
||||
**Custom output directory:**
|
||||
```bash
|
||||
python scripts/generate_diff_report.py \
|
||||
--original "meeting.md" \
|
||||
--stage1 "meeting_阶段1_词典修复.md" \
|
||||
--stage2 "meeting_阶段2_AI修复.md" \
|
||||
--output-dir "./reports"
|
||||
```
|
||||
|
||||
### Output Files
|
||||
|
||||
The script generates four comparison formats:
|
||||
|
||||
1. **Markdown summary** (`*_对比报告.md`)
|
||||
- High-level statistics and change summary
|
||||
- Word count changes per stage
|
||||
- Common error patterns identified
|
||||
|
||||
2. **Unified diff** (`*_unified.diff`)
|
||||
- Traditional Unix diff format
|
||||
- Suitable for command-line review or version control
|
||||
|
||||
3. **HTML side-by-side** (`*_对比.html`)
|
||||
- Visual side-by-side comparison
|
||||
- Color-coded additions/deletions
|
||||
- **Recommended for human review**
|
||||
|
||||
4. **Inline marked** (`*_行内对比.txt`)
|
||||
- Single-column format with inline change markers
|
||||
- Useful for quick text editor review
|
||||
|
||||
### Exit Codes
|
||||
|
||||
- `0` - Success
|
||||
- `1` - Missing required parameters or file not found
|
||||
- `2` - File format error (non-Markdown input)
|
||||
|
||||
## Common Workflows
|
||||
|
||||
### Testing Dictionary Changes
|
||||
|
||||
Test dictionary updates before running expensive AI corrections:
|
||||
|
||||
```bash
|
||||
# 1. Update CORRECTIONS_DICT in scripts/fix_transcription.py
|
||||
# 2. Run Stage 1 only
|
||||
python scripts/fix_transcription.py --input meeting.md --stage 1
|
||||
|
||||
# 3. Review output
|
||||
cat meeting_阶段1_词典修复.md
|
||||
|
||||
# 4. If satisfied, run Stage 2
|
||||
python scripts/fix_transcription.py --input meeting_阶段1_词典修复.md --stage 2
|
||||
```
|
||||
|
||||
### Batch Processing
|
||||
|
||||
Process multiple transcripts in sequence:
|
||||
|
||||
```bash
|
||||
for file in transcripts/*.md; do
|
||||
python scripts/fix_transcription.py --input "$file" --stage 3
|
||||
done
|
||||
```
|
||||
|
||||
### Quick Review Cycle
|
||||
|
||||
Generate and open comparison report immediately after correction:
|
||||
|
||||
```bash
|
||||
# Run corrections
|
||||
python scripts/fix_transcription.py --input meeting.md --stage 3
|
||||
|
||||
# Generate and open diff report
|
||||
python scripts/generate_diff_report.py \
|
||||
--original "meeting.md" \
|
||||
--stage1 "meeting_阶段1_词典修复.md" \
|
||||
--stage2 "meeting_阶段2_AI修复.md"
|
||||
|
||||
open meeting_对比.html # macOS
|
||||
# xdg-open meeting_对比.html # Linux
|
||||
# start meeting_对比.html # Windows
|
||||
```
|
||||
188
transcript-fixer/references/sql_queries.md
Normal file
188
transcript-fixer/references/sql_queries.md
Normal file
@@ -0,0 +1,188 @@
|
||||
# SQL Query Reference
|
||||
|
||||
Database location: `~/.transcript-fixer/corrections.db`
|
||||
|
||||
## Basic Operations
|
||||
|
||||
### Add Corrections
|
||||
|
||||
```sql
|
||||
-- Add a correction
|
||||
INSERT INTO corrections (from_text, to_text, domain, source)
|
||||
VALUES ('巨升智能', '具身智能', 'embodied_ai', 'manual');
|
||||
|
||||
INSERT INTO corrections (from_text, to_text, domain, source)
|
||||
VALUES ('奇迹创坛', '奇绩创坛', 'general', 'manual');
|
||||
```
|
||||
|
||||
### View Corrections
|
||||
|
||||
```sql
|
||||
-- View all active corrections
|
||||
SELECT from_text, to_text, domain, source, usage_count
|
||||
FROM active_corrections
|
||||
ORDER BY domain, from_text;
|
||||
|
||||
-- View corrections for specific domain
|
||||
SELECT from_text, to_text, usage_count, added_at
|
||||
FROM active_corrections
|
||||
WHERE domain = 'embodied_ai';
|
||||
```
|
||||
|
||||
## Context Rules
|
||||
|
||||
### Add Context-Aware Rules
|
||||
|
||||
```sql
|
||||
-- Add regex-based context rule
|
||||
INSERT INTO context_rules (pattern, replacement, description, priority)
|
||||
VALUES ('巨升方向', '具身方向', '巨升→具身', 10);
|
||||
|
||||
INSERT INTO context_rules (pattern, replacement, description, priority)
|
||||
VALUES ('近距离的去看', '近距离地去看', '的→地 副词修饰', 5);
|
||||
```
|
||||
|
||||
### View Rules
|
||||
|
||||
```sql
|
||||
-- View all active context rules (ordered by priority)
|
||||
SELECT pattern, replacement, description, priority
|
||||
FROM context_rules
|
||||
WHERE is_active = 1
|
||||
ORDER BY priority DESC;
|
||||
```
|
||||
|
||||
## Statistics
|
||||
|
||||
```sql
|
||||
-- View correction statistics by domain
|
||||
SELECT * FROM correction_statistics;
|
||||
|
||||
-- Count corrections by source
|
||||
SELECT source, COUNT(*) as count, SUM(usage_count) as total_usage
|
||||
FROM corrections
|
||||
WHERE is_active = 1
|
||||
GROUP BY source;
|
||||
|
||||
-- Most frequently used corrections
|
||||
SELECT from_text, to_text, domain, usage_count, last_used
|
||||
FROM corrections
|
||||
WHERE is_active = 1 AND usage_count > 0
|
||||
ORDER BY usage_count DESC
|
||||
LIMIT 10;
|
||||
```
|
||||
|
||||
## Learning and Suggestions
|
||||
|
||||
### View Suggestions
|
||||
|
||||
```sql
|
||||
-- View pending suggestions
|
||||
SELECT * FROM pending_suggestions;
|
||||
|
||||
-- View high-confidence suggestions
|
||||
SELECT from_text, to_text, domain, frequency, confidence
|
||||
FROM learned_suggestions
|
||||
WHERE status = 'pending' AND confidence >= 0.8
|
||||
ORDER BY confidence DESC, frequency DESC;
|
||||
```
|
||||
|
||||
### Approve Suggestions
|
||||
|
||||
```sql
|
||||
-- Insert into corrections
|
||||
INSERT INTO corrections (from_text, to_text, domain, source, confidence)
|
||||
SELECT from_text, to_text, domain, 'learned', confidence
|
||||
FROM learned_suggestions
|
||||
WHERE id = 1;
|
||||
|
||||
-- Mark as approved
|
||||
UPDATE learned_suggestions
|
||||
SET status = 'approved', reviewed_at = CURRENT_TIMESTAMP
|
||||
WHERE id = 1;
|
||||
```
|
||||
|
||||
## History and Audit
|
||||
|
||||
```sql
|
||||
-- View recent correction runs
|
||||
SELECT filename, domain, stage1_changes, stage2_changes, run_timestamp
|
||||
FROM correction_history
|
||||
ORDER BY run_timestamp DESC
|
||||
LIMIT 10;
|
||||
|
||||
-- View detailed changes for a specific run
|
||||
SELECT ch.line_number, ch.from_text, ch.to_text, ch.rule_type
|
||||
FROM correction_changes ch
|
||||
JOIN correction_history h ON ch.history_id = h.id
|
||||
WHERE h.filename = 'meeting.md'
|
||||
ORDER BY ch.line_number;
|
||||
|
||||
-- Calculate success rate
|
||||
SELECT
|
||||
COUNT(*) as total_runs,
|
||||
SUM(CASE WHEN success = 1 THEN 1 ELSE 0 END) as successful,
|
||||
ROUND(100.0 * SUM(CASE WHEN success = 1 THEN 1 ELSE 0 END) / COUNT(*), 2) as success_rate
|
||||
FROM correction_history;
|
||||
```
|
||||
|
||||
## Maintenance
|
||||
|
||||
```sql
|
||||
-- Deactivate (soft delete) a correction
|
||||
UPDATE corrections
|
||||
SET is_active = 0
|
||||
WHERE from_text = '错误词' AND domain = 'general';
|
||||
|
||||
-- Reactivate a correction
|
||||
UPDATE corrections
|
||||
SET is_active = 1
|
||||
WHERE from_text = '错误词' AND domain = 'general';
|
||||
|
||||
-- Update correction confidence
|
||||
UPDATE corrections
|
||||
SET confidence = 0.95
|
||||
WHERE from_text = '巨升' AND to_text = '具身';
|
||||
|
||||
-- Delete old history (older than 90 days)
|
||||
DELETE FROM correction_history
|
||||
WHERE run_timestamp < datetime('now', '-90 days');
|
||||
|
||||
-- Reclaim space
|
||||
VACUUM;
|
||||
```
|
||||
|
||||
## System Configuration
|
||||
|
||||
```sql
|
||||
-- View system configuration
|
||||
SELECT key, value, description FROM system_config;
|
||||
|
||||
-- Update configuration
|
||||
UPDATE system_config
|
||||
SET value = '5'
|
||||
WHERE key = 'learning_frequency_threshold';
|
||||
|
||||
-- Check schema version
|
||||
SELECT value FROM system_config WHERE key = 'schema_version';
|
||||
```
|
||||
|
||||
## Export
|
||||
|
||||
```sql
|
||||
-- Export corrections as CSV
|
||||
.mode csv
|
||||
.headers on
|
||||
.output corrections_export.csv
|
||||
SELECT from_text, to_text, domain, source, confidence, usage_count, added_at
|
||||
FROM active_corrections;
|
||||
.output stdout
|
||||
```
|
||||
|
||||
For JSON export, use Python script with `service.export_corrections()` instead.
|
||||
|
||||
## See Also
|
||||
|
||||
- `references/file_formats.md` - Complete database schema documentation
|
||||
- `references/quick_reference.md` - CLI command quick reference
|
||||
- `SKILL.md` - Main user documentation
|
||||
371
transcript-fixer/references/team_collaboration.md
Normal file
371
transcript-fixer/references/team_collaboration.md
Normal file
@@ -0,0 +1,371 @@
|
||||
# Team Collaboration Guide
|
||||
|
||||
This guide explains how to share correction knowledge across teams using export/import and Git workflows.
|
||||
|
||||
## Table of Contents
|
||||
|
||||
- [Export/Import Workflow](#exportimport-workflow)
|
||||
- [Export Corrections](#export-corrections)
|
||||
- [Import from Teammate](#import-from-teammate)
|
||||
- [Team Workflow Example](#team-workflow-example)
|
||||
- [Git-Based Collaboration](#git-based-collaboration)
|
||||
- [Initial Setup](#initial-setup)
|
||||
- [Team Members Clone](#team-members-clone)
|
||||
- [Ongoing Sync](#ongoing-sync)
|
||||
- [Handling Conflicts](#handling-conflicts)
|
||||
- [Selective Domain Sharing](#selective-domain-sharing)
|
||||
- [Finance Team](#finance-team)
|
||||
- [AI Team](#ai-team)
|
||||
- [Individual imports specific domains](#individual-imports-specific-domains)
|
||||
- [Git Branching Strategy](#git-branching-strategy)
|
||||
- [Feature Branches](#feature-branches)
|
||||
- [Domain Branches (Alternative)](#domain-branches-alternative)
|
||||
- [Automated Sync (Advanced)](#automated-sync-advanced)
|
||||
- [macOS/Linux Cron](#macoslinux-cron)
|
||||
- [Windows Task Scheduler](#windows-task-scheduler)
|
||||
- [Backup and Recovery](#backup-and-recovery)
|
||||
- [Backup Strategy](#backup-strategy)
|
||||
- [Recovery from Backup](#recovery-from-backup)
|
||||
- [Recovery from Git](#recovery-from-git)
|
||||
- [Team Best Practices](#team-best-practices)
|
||||
- [Integration with CI/CD](#integration-with-cicd)
|
||||
- [GitHub Actions Example](#github-actions-example)
|
||||
- [Troubleshooting](#troubleshooting)
|
||||
- [Import Failed](#import-failed)
|
||||
- [Git Sync Failed](#git-sync-failed)
|
||||
- [Merge Conflicts Too Complex](#merge-conflicts-too-complex)
|
||||
- [Security Considerations](#security-considerations)
|
||||
- [Further Reading](#further-reading)
|
||||
|
||||
## Export/Import Workflow
|
||||
|
||||
### Export Corrections
|
||||
|
||||
Share your corrections with team members:
|
||||
|
||||
```bash
|
||||
# Export specific domain
|
||||
python scripts/fix_transcription.py --export team_corrections.json --domain embodied_ai
|
||||
|
||||
# Export general corrections
|
||||
python scripts/fix_transcription.py --export team_corrections.json
|
||||
```
|
||||
|
||||
**Output**: Creates a standalone JSON file with your corrections.
|
||||
|
||||
### Import from Teammate
|
||||
|
||||
Two modes: **merge** (combine) or **replace** (overwrite):
|
||||
|
||||
```bash
|
||||
# Merge (recommended) - combines with existing corrections
|
||||
python scripts/fix_transcription.py --import team_corrections.json --merge
|
||||
|
||||
# Replace - overwrites existing corrections (dangerous!)
|
||||
python scripts/fix_transcription.py --import team_corrections.json
|
||||
```
|
||||
|
||||
**Merge behavior**:
|
||||
- Adds new corrections
|
||||
- Updates existing corrections with imported values
|
||||
- Preserves corrections not in import file
|
||||
|
||||
### Team Workflow Example
|
||||
|
||||
**Person A (Domain Expert)**:
|
||||
```bash
|
||||
# Build correction dictionary
|
||||
python fix_transcription.py --add "巨升" "具身" --domain embodied_ai
|
||||
python fix_transcription.py --add "奇迹创坛" "奇绩创坛" --domain embodied_ai
|
||||
# ... add 50 more corrections ...
|
||||
|
||||
# Export for team
|
||||
python fix_transcription.py --export ai_corrections.json --domain embodied_ai
|
||||
# Send ai_corrections.json to team via Slack/email
|
||||
```
|
||||
|
||||
**Person B (Team Member)**:
|
||||
```bash
|
||||
# Receive ai_corrections.json
|
||||
# Import and merge with existing corrections
|
||||
python fix_transcription.py --import ai_corrections.json --merge
|
||||
|
||||
# Now Person B has all 50+ corrections!
|
||||
```
|
||||
|
||||
## Git-Based Collaboration
|
||||
|
||||
For teams using Git, version control the entire correction database.
|
||||
|
||||
### Initial Setup
|
||||
|
||||
**Person A (First User)**:
|
||||
```bash
|
||||
cd ~/.transcript-fixer
|
||||
git init
|
||||
git add corrections.json context_rules.json config.json
|
||||
git add domains/
|
||||
git commit -m "Initial correction database"
|
||||
|
||||
# Push to shared repo
|
||||
git remote add origin git@github.com:org/transcript-corrections.git
|
||||
git push -u origin main
|
||||
```
|
||||
|
||||
### Team Members Clone
|
||||
|
||||
**Person B, C, D (Team Members)**:
|
||||
```bash
|
||||
# Clone shared corrections
|
||||
git clone git@github.com:org/transcript-corrections.git ~/.transcript-fixer
|
||||
|
||||
# Now everyone has the same corrections!
|
||||
```
|
||||
|
||||
### Ongoing Sync
|
||||
|
||||
**Daily workflow**:
|
||||
```bash
|
||||
# Morning: Pull team updates
|
||||
cd ~/.transcript-fixer
|
||||
git pull origin main
|
||||
|
||||
# During day: Add corrections
|
||||
python fix_transcription.py --add "错误" "正确"
|
||||
|
||||
# Evening: Push your additions
|
||||
cd ~/.transcript-fixer
|
||||
git add corrections.json
|
||||
git commit -m "Added 5 new embodied AI corrections"
|
||||
git push origin main
|
||||
```
|
||||
|
||||
### Handling Conflicts
|
||||
|
||||
When two people add different corrections to same file:
|
||||
|
||||
```bash
|
||||
cd ~/.transcript-fixer
|
||||
git pull origin main
|
||||
|
||||
# If conflict occurs:
|
||||
# CONFLICT in corrections.json
|
||||
|
||||
# Option 1: Manual merge (recommended)
|
||||
nano corrections.json # Edit to combine both changes
|
||||
git add corrections.json
|
||||
git commit -m "Merged corrections from teammate"
|
||||
git push
|
||||
|
||||
# Option 2: Keep yours
|
||||
git checkout --ours corrections.json
|
||||
git add corrections.json
|
||||
git commit -m "Kept local corrections"
|
||||
git push
|
||||
|
||||
# Option 3: Keep theirs
|
||||
git checkout --theirs corrections.json
|
||||
git add corrections.json
|
||||
git commit -m "Used teammate's corrections"
|
||||
git push
|
||||
```
|
||||
|
||||
**Best Practice**: JSON merge conflicts are usually easy - just combine the correction entries from both versions.
|
||||
|
||||
## Selective Domain Sharing
|
||||
|
||||
Share only specific domains with different teams:
|
||||
|
||||
### Finance Team
|
||||
```bash
|
||||
# Finance team exports their domain
|
||||
python fix_transcription.py --export finance_corrections.json --domain finance
|
||||
|
||||
# Share finance_corrections.json with finance team only
|
||||
```
|
||||
|
||||
### AI Team
|
||||
```bash
|
||||
# AI team exports their domain
|
||||
python fix_transcription.py --export ai_corrections.json --domain embodied_ai
|
||||
|
||||
# Share ai_corrections.json with AI team only
|
||||
```
|
||||
|
||||
### Individual imports specific domains
|
||||
```bash
|
||||
# Alice works on both finance and AI
|
||||
python fix_transcription.py --import finance_corrections.json --merge
|
||||
python fix_transcription.py --import ai_corrections.json --merge
|
||||
```
|
||||
|
||||
## Git Branching Strategy
|
||||
|
||||
For larger teams, use branches for different domains or workflows:
|
||||
|
||||
### Feature Branches
|
||||
```bash
|
||||
# Create branch for major dictionary additions
|
||||
git checkout -b add-medical-terms
|
||||
python fix_transcription.py --add "医疗术语" "正确术语" --domain medical
|
||||
# ... add 100 medical corrections ...
|
||||
git add domains/medical.json
|
||||
git commit -m "Added 100 medical terminology corrections"
|
||||
git push origin add-medical-terms
|
||||
|
||||
# Create PR for review
|
||||
# After approval, merge to main
|
||||
```
|
||||
|
||||
### Domain Branches (Alternative)
|
||||
```bash
|
||||
# Separate branches per domain
|
||||
git checkout -b domain/embodied-ai
|
||||
# Work on AI corrections
|
||||
git push origin domain/embodied-ai
|
||||
|
||||
git checkout -b domain/finance
|
||||
# Work on finance corrections
|
||||
git push origin domain/finance
|
||||
```
|
||||
|
||||
## Automated Sync (Advanced)
|
||||
|
||||
Set up automatic Git sync using cron/Task Scheduler:
|
||||
|
||||
### macOS/Linux Cron
|
||||
```bash
|
||||
# Edit crontab
|
||||
crontab -e
|
||||
|
||||
# Add daily sync at 9 AM and 6 PM
|
||||
0 9,18 * * * cd ~/.transcript-fixer && git pull origin main && git push origin main
|
||||
```
|
||||
|
||||
### Windows Task Scheduler
|
||||
```powershell
|
||||
# Create scheduled task
|
||||
$action = New-ScheduledTaskAction -Execute "git" -Argument "pull origin main" -WorkingDirectory "$env:USERPROFILE\.transcript-fixer"
|
||||
$trigger = New-ScheduledTaskTrigger -Daily -At 9am
|
||||
Register-ScheduledTask -Action $action -Trigger $trigger -TaskName "SyncTranscriptCorrections"
|
||||
```
|
||||
|
||||
## Backup and Recovery
|
||||
|
||||
### Backup Strategy
|
||||
```bash
|
||||
# Weekly backup to cloud
|
||||
cd ~/.transcript-fixer
|
||||
tar -czf transcript-corrections-$(date +%Y%m%d).tar.gz corrections.json context_rules.json domains/
|
||||
# Upload to Dropbox/Google Drive/S3
|
||||
```
|
||||
|
||||
### Recovery from Backup
|
||||
```bash
|
||||
# Extract backup
|
||||
tar -xzf transcript-corrections-20250127.tar.gz -C ~/.transcript-fixer/
|
||||
```
|
||||
|
||||
### Recovery from Git
|
||||
```bash
|
||||
# View history
|
||||
cd ~/.transcript-fixer
|
||||
git log corrections.json
|
||||
|
||||
# Restore from 3 commits ago
|
||||
git checkout HEAD~3 corrections.json
|
||||
|
||||
# Or restore specific version
|
||||
git checkout abc123def corrections.json
|
||||
```
|
||||
|
||||
## Team Best Practices
|
||||
|
||||
1. **Pull Before Push**: Always `git pull` before starting work
|
||||
2. **Commit Often**: Small, frequent commits better than large infrequent ones
|
||||
3. **Descriptive Messages**: "Added 5 finance terms" better than "updates"
|
||||
4. **Review Process**: Use PRs for major dictionary changes (100+ corrections)
|
||||
5. **Domain Ownership**: Assign domain experts as reviewers
|
||||
6. **Weekly Sync**: Schedule team sync meetings to review learned suggestions
|
||||
7. **Backup Policy**: Weekly backups of entire `~/.transcript-fixer/`
|
||||
|
||||
## Integration with CI/CD
|
||||
|
||||
For enterprise teams, integrate validation into CI:
|
||||
|
||||
### GitHub Actions Example
|
||||
```yaml
|
||||
# .github/workflows/validate-corrections.yml
|
||||
name: Validate Corrections
|
||||
|
||||
on:
|
||||
pull_request:
|
||||
paths:
|
||||
- 'corrections.json'
|
||||
- 'domains/*.json'
|
||||
|
||||
jobs:
|
||||
validate:
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- uses: actions/checkout@v2
|
||||
|
||||
- name: Validate JSON
|
||||
run: |
|
||||
python -m json.tool corrections.json > /dev/null
|
||||
for file in domains/*.json; do
|
||||
python -m json.tool "$file" > /dev/null
|
||||
done
|
||||
|
||||
- name: Check for duplicates
|
||||
run: |
|
||||
python scripts/check_duplicates.py corrections.json
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Import Failed
|
||||
```bash
|
||||
# Check JSON validity
|
||||
python -m json.tool team_corrections.json
|
||||
|
||||
# If invalid, fix JSON syntax errors
|
||||
nano team_corrections.json
|
||||
```
|
||||
|
||||
### Git Sync Failed
|
||||
```bash
|
||||
# Check remote connection
|
||||
git remote -v
|
||||
|
||||
# Re-add if needed
|
||||
git remote set-url origin git@github.com:org/corrections.git
|
||||
|
||||
# Verify SSH keys
|
||||
ssh -T git@github.com
|
||||
```
|
||||
|
||||
### Merge Conflicts Too Complex
|
||||
```bash
|
||||
# Nuclear option: Keep one version
|
||||
git checkout --ours corrections.json # Keep yours
|
||||
# OR
|
||||
git checkout --theirs corrections.json # Keep theirs
|
||||
|
||||
# Then re-import the other version
|
||||
python fix_transcription.py --import other_version.json --merge
|
||||
```
|
||||
|
||||
## Security Considerations
|
||||
|
||||
1. **Private Repos**: Use private Git repositories for company-specific corrections
|
||||
2. **Access Control**: Limit who can push to main branch
|
||||
3. **Secret Scanning**: Never commit API keys (already handled by security_scan.py)
|
||||
4. **Audit Trail**: Git history provides full audit trail of who changed what
|
||||
5. **Backup Encryption**: Encrypt backups if containing sensitive terminology
|
||||
|
||||
## Further Reading
|
||||
|
||||
- Git workflows: https://git-scm.com/book/en/v2/Git-Branching-Branching-Workflows
|
||||
- JSON validation: https://jsonlint.com/
|
||||
- Team Git practices: https://github.com/git-guides
|
||||
313
transcript-fixer/references/troubleshooting.md
Normal file
313
transcript-fixer/references/troubleshooting.md
Normal file
@@ -0,0 +1,313 @@
|
||||
# Troubleshooting Guide
|
||||
|
||||
Solutions to common issues and error conditions.
|
||||
|
||||
## Table of Contents
|
||||
|
||||
- [API Authentication Errors](#api-authentication-errors)
|
||||
- [GLM_API_KEY Not Set](#glm_api_key-not-set)
|
||||
- [Invalid API Key](#invalid-api-key)
|
||||
- [Learning System Issues](#learning-system-issues)
|
||||
- [No Suggestions Generated](#no-suggestions-generated)
|
||||
- [Database Issues](#database-issues)
|
||||
- [Database Not Found](#database-not-found)
|
||||
- [Database Locked](#database-locked)
|
||||
- [Corrupted Database](#corrupted-database)
|
||||
- [Missing Tables](#missing-tables)
|
||||
- [Common Pitfalls](#common-pitfalls)
|
||||
- [1. Stage Order Confusion](#1-stage-order-confusion)
|
||||
- [2. Overwriting Imports](#2-overwriting-imports)
|
||||
- [3. Ignoring Learned Suggestions](#3-ignoring-learned-suggestions)
|
||||
- [4. Testing on Large Files](#4-testing-on-large-files)
|
||||
- [5. Manual Database Edits Without Validation](#5-manual-database-edits-without-validation)
|
||||
- [6. Committing .db Files to Git](#6-committing-db-files-to-git)
|
||||
- [Validation Commands](#validation-commands)
|
||||
- [Quick Health Check](#quick-health-check)
|
||||
- [Detailed Diagnostics](#detailed-diagnostics)
|
||||
- [Getting Help](#getting-help)
|
||||
|
||||
## API Authentication Errors
|
||||
|
||||
### GLM_API_KEY Not Set
|
||||
|
||||
**Symptom**:
|
||||
```
|
||||
❌ Error: GLM_API_KEY environment variable not set
|
||||
Set it with: export GLM_API_KEY='your-key'
|
||||
```
|
||||
|
||||
**Solution**:
|
||||
```bash
|
||||
# Check if key is set
|
||||
echo $GLM_API_KEY
|
||||
|
||||
# If empty, export key
|
||||
export GLM_API_KEY="your-api-key-here"
|
||||
|
||||
# Verify
|
||||
uv run scripts/fix_transcription.py --validate
|
||||
```
|
||||
|
||||
**Persistence**: Add to shell profile (`.bashrc` or `.zshrc`) for permanent access.
|
||||
|
||||
See `glm_api_setup.md` for detailed API key management.
|
||||
|
||||
### Invalid API Key
|
||||
|
||||
**Symptom**: API calls fail with 401/403 errors
|
||||
|
||||
**Solutions**:
|
||||
1. Verify key is correct (copy from https://open.bigmodel.cn/)
|
||||
2. Check for extra spaces or quotes in the key
|
||||
3. Regenerate key if compromised
|
||||
4. Verify API quota hasn't been exceeded
|
||||
|
||||
## Learning System Issues
|
||||
|
||||
### No Suggestions Generated
|
||||
|
||||
**Symptom**: Running `--review-learned` shows no suggestions after multiple corrections.
|
||||
|
||||
**Requirements**:
|
||||
- Minimum 3 correction runs with consistent patterns
|
||||
- Learning frequency threshold ≥3 (default)
|
||||
- Learning confidence threshold ≥0.8 (default)
|
||||
|
||||
**Diagnostic steps**:
|
||||
|
||||
```bash
|
||||
# Check correction history count
|
||||
sqlite3 ~/.transcript-fixer/corrections.db "SELECT COUNT(*) FROM correction_history;"
|
||||
|
||||
# If 0, no corrections have been run yet
|
||||
# If >0 but <3, run more corrections
|
||||
|
||||
# Check suggestions table
|
||||
sqlite3 ~/.transcript-fixer/corrections.db "SELECT * FROM learned_suggestions;"
|
||||
|
||||
# Check system configuration
|
||||
sqlite3 ~/.transcript-fixer/corrections.db "SELECT key, value FROM system_config WHERE key LIKE 'learning%';"
|
||||
```
|
||||
|
||||
**Solutions**:
|
||||
1. Run at least 3 correction sessions
|
||||
2. Ensure patterns repeat (same error → same correction)
|
||||
3. Verify database permissions (should be readable/writable)
|
||||
4. Check `correction_history` table has entries
|
||||
|
||||
## Database Issues
|
||||
|
||||
### Database Not Found
|
||||
|
||||
**Symptom**:
|
||||
```
|
||||
⚠️ Database not found: ~/.transcript-fixer/corrections.db
|
||||
```
|
||||
|
||||
**Solution**:
|
||||
```bash
|
||||
uv run scripts/fix_transcription.py --init
|
||||
```
|
||||
|
||||
This creates the database with the complete schema.
|
||||
|
||||
### Database Locked
|
||||
|
||||
**Symptom**:
|
||||
```
|
||||
Error: database is locked
|
||||
```
|
||||
|
||||
**Causes**:
|
||||
- Another process is accessing the database
|
||||
- Unfinished transaction from crashed process
|
||||
- File permissions issue
|
||||
|
||||
**Solutions**:
|
||||
|
||||
```bash
|
||||
# Check for processes using the database
|
||||
lsof ~/.transcript-fixer/corrections.db
|
||||
|
||||
# If processes found, kill them or wait for completion
|
||||
|
||||
# If database is corrupted, backup and recreate
|
||||
cp ~/.transcript-fixer/corrections.db ~/.transcript-fixer/corrections_backup.db
|
||||
sqlite3 ~/.transcript-fixer/corrections.db "VACUUM;"
|
||||
```
|
||||
|
||||
### Corrupted Database
|
||||
|
||||
**Symptom**: SQLite errors, integrity check failures
|
||||
|
||||
**Solutions**:
|
||||
|
||||
```bash
|
||||
# Check integrity
|
||||
sqlite3 ~/.transcript-fixer/corrections.db "PRAGMA integrity_check;"
|
||||
|
||||
# If corrupted, attempt recovery
|
||||
sqlite3 ~/.transcript-fixer/corrections.db ".recover" | sqlite3 ~/.transcript-fixer/corrections_new.db
|
||||
|
||||
# Replace database with recovered version
|
||||
mv ~/.transcript-fixer/corrections.db ~/.transcript-fixer/corrections_corrupted.db
|
||||
mv ~/.transcript-fixer/corrections_new.db ~/.transcript-fixer/corrections.db
|
||||
```
|
||||
|
||||
### Missing Tables
|
||||
|
||||
**Symptom**:
|
||||
```
|
||||
❌ Database missing tables: ['corrections', ...]
|
||||
```
|
||||
|
||||
**Solution**: Reinitialize schema (safe, uses IF NOT EXISTS):
|
||||
|
||||
```bash
|
||||
python -c "from core import CorrectionRepository; from pathlib import Path; CorrectionRepository(Path.home() / '.transcript-fixer' / 'corrections.db')"
|
||||
```
|
||||
|
||||
Or delete database and reinitialize:
|
||||
|
||||
```bash
|
||||
# Backup first
|
||||
cp ~/.transcript-fixer/corrections.db ~/corrections_backup_$(date +%Y%m%d).db
|
||||
|
||||
# Reinitialize
|
||||
uv run scripts/fix_transcription.py --init
|
||||
```
|
||||
|
||||
## Common Pitfalls
|
||||
|
||||
### 1. Stage Order Confusion
|
||||
|
||||
**Problem**: Running Stage 2 without Stage 1 output.
|
||||
|
||||
**Solution**: Use `--stage 3` for full pipeline, or run stages sequentially:
|
||||
|
||||
```bash
|
||||
# Wrong: Stage 2 on raw file
|
||||
uv run scripts/fix_transcription.py --input file.md --stage 2 # ❌
|
||||
|
||||
# Correct: Full pipeline
|
||||
uv run scripts/fix_transcription.py --input file.md --stage 3 # ✅
|
||||
|
||||
# Or sequential stages
|
||||
uv run scripts/fix_transcription.py --input file.md --stage 1
|
||||
uv run scripts/fix_transcription.py --input file_stage1.md --stage 2
|
||||
```
|
||||
|
||||
### 2. Overwriting Imports
|
||||
|
||||
**Problem**: Using `--import` without `--merge` overwrites existing corrections.
|
||||
|
||||
**Solution**: Always use `--merge` flag:
|
||||
|
||||
```bash
|
||||
# Wrong: Overwrites existing
|
||||
uv run scripts/fix_transcription.py --import team.json # ❌
|
||||
|
||||
# Correct: Merges with existing
|
||||
uv run scripts/fix_transcription.py --import team.json --merge # ✅
|
||||
```
|
||||
|
||||
### 3. Ignoring Learned Suggestions
|
||||
|
||||
**Problem**: Not reviewing learned patterns, missing free optimizations.
|
||||
|
||||
**Impact**: Patterns detected by AI remain expensive (Stage 2) instead of cheap (Stage 1).
|
||||
|
||||
**Solution**: Review suggestions every 3-5 runs:
|
||||
|
||||
```bash
|
||||
uv run scripts/fix_transcription.py --review-learned
|
||||
uv run scripts/fix_transcription.py --approve "错误" "正确"
|
||||
```
|
||||
|
||||
### 4. Testing on Large Files
|
||||
|
||||
**Problem**: Testing dictionary changes on large files wastes API quota.
|
||||
|
||||
**Solution**: Start with `--stage 1` on small files (100-500 lines):
|
||||
|
||||
```bash
|
||||
# Test dictionary changes first
|
||||
uv run scripts/fix_transcription.py --input small_sample.md --stage 1
|
||||
|
||||
# Review output, adjust corrections
|
||||
# Then run full pipeline
|
||||
uv run scripts/fix_transcription.py --input large_file.md --stage 3
|
||||
```
|
||||
|
||||
### 5. Manual Database Edits Without Validation
|
||||
|
||||
**Problem**: Direct SQL edits might violate schema constraints.
|
||||
|
||||
**Solution**: Always validate after manual changes:
|
||||
|
||||
```bash
|
||||
sqlite3 ~/.transcript-fixer/corrections.db
|
||||
# ... make changes ...
|
||||
.quit
|
||||
|
||||
# Validate
|
||||
uv run scripts/fix_transcription.py --validate
|
||||
```
|
||||
|
||||
### 6. Committing .db Files to Git
|
||||
|
||||
**Problem**: Binary database files in Git cause merge conflicts and bloat repository.
|
||||
|
||||
**Solution**: Use JSON exports for version control:
|
||||
|
||||
```bash
|
||||
# .gitignore
|
||||
*.db
|
||||
*.db-journal
|
||||
*.bak
|
||||
|
||||
# Export for version control instead
|
||||
uv run scripts/fix_transcription.py --export corrections_$(date +%Y%m%d).json
|
||||
git add corrections_*.json
|
||||
```
|
||||
|
||||
## Validation Commands
|
||||
|
||||
### Quick Health Check
|
||||
|
||||
```bash
|
||||
uv run scripts/fix_transcription.py --validate
|
||||
```
|
||||
|
||||
### Detailed Diagnostics
|
||||
|
||||
```bash
|
||||
# Check database integrity
|
||||
sqlite3 ~/.transcript-fixer/corrections.db "PRAGMA integrity_check;"
|
||||
|
||||
# Check table counts
|
||||
sqlite3 ~/.transcript-fixer/corrections.db "
|
||||
SELECT 'corrections' as table_name, COUNT(*) as count FROM corrections
|
||||
UNION ALL
|
||||
SELECT 'context_rules', COUNT(*) FROM context_rules
|
||||
UNION ALL
|
||||
SELECT 'learned_suggestions', COUNT(*) FROM learned_suggestions
|
||||
UNION ALL
|
||||
SELECT 'correction_history', COUNT(*) FROM correction_history;
|
||||
"
|
||||
|
||||
# Check configuration
|
||||
sqlite3 ~/.transcript-fixer/corrections.db "SELECT * FROM system_config;"
|
||||
```
|
||||
|
||||
## Getting Help
|
||||
|
||||
If issues persist:
|
||||
|
||||
1. Run `--validate` to collect diagnostic information
|
||||
2. Check `correction_history` and `audit_log` tables for errors
|
||||
3. Review `references/file_formats.md` for schema details
|
||||
4. Check `references/architecture.md` for component details
|
||||
5. Verify Python and uv versions are up to date
|
||||
|
||||
For database corruption, automatic backups are created before migrations. Check for `.bak` files in `~/.transcript-fixer/`.
|
||||
483
transcript-fixer/references/workflow_guide.md
Normal file
483
transcript-fixer/references/workflow_guide.md
Normal file
@@ -0,0 +1,483 @@
|
||||
# Workflow Guide
|
||||
|
||||
Detailed step-by-step workflows for transcript correction and management.
|
||||
|
||||
## Table of Contents
|
||||
|
||||
- [Pre-Flight Checklist](#pre-flight-checklist)
|
||||
- [Initial Setup](#initial-setup)
|
||||
- [File Preparation](#file-preparation)
|
||||
- [Execution Parameters](#execution-parameters)
|
||||
- [Environment](#environment)
|
||||
- [Core Workflows](#core-workflows)
|
||||
- [1. First-Time Correction](#1-first-time-correction)
|
||||
- [2. Iterative Improvement](#2-iterative-improvement)
|
||||
- [3. Domain-Specific Corrections](#3-domain-specific-corrections)
|
||||
- [4. Team Collaboration](#4-team-collaboration)
|
||||
- [5. Stage-by-Stage Execution](#5-stage-by-stage-execution)
|
||||
- [6. Context-Aware Rules](#6-context-aware-rules)
|
||||
- [7. Diff Report Generation](#7-diff-report-generation)
|
||||
- [Batch Processing](#batch-processing)
|
||||
- [Process Multiple Files](#process-multiple-files)
|
||||
- [Parallel Processing](#parallel-processing)
|
||||
- [Maintenance Workflows](#maintenance-workflows)
|
||||
- [Weekly: Review Learning](#weekly-review-learning)
|
||||
- [Monthly: Export and Backup](#monthly-export-and-backup)
|
||||
- [Quarterly: Clean Up](#quarterly-clean-up)
|
||||
- [Next Steps](#next-steps)
|
||||
|
||||
## Pre-Flight Checklist
|
||||
|
||||
Before running corrections, verify these prerequisites:
|
||||
|
||||
### Initial Setup
|
||||
- [ ] Initialized with `uv run scripts/fix_transcription.py --init`
|
||||
- [ ] Database exists at `~/.transcript-fixer/corrections.db`
|
||||
- [ ] `GLM_API_KEY` environment variable set (run `echo $GLM_API_KEY`)
|
||||
- [ ] Configuration validated (run `--validate`)
|
||||
|
||||
### File Preparation
|
||||
- [ ] Input file exists and is readable
|
||||
- [ ] File uses supported format (`.md`, `.txt`)
|
||||
- [ ] File encoding is UTF-8
|
||||
- [ ] File size is reasonable (<10MB for first runs)
|
||||
|
||||
### Execution Parameters
|
||||
- [ ] Using `--stage 3` for full pipeline (or specific stage if testing)
|
||||
- [ ] Domain specified with `--domain` if using specialized dictionaries
|
||||
- [ ] Using `--merge` flag when importing team corrections
|
||||
|
||||
### Environment
|
||||
- [ ] Sufficient disk space for output files (~2x input size)
|
||||
- [ ] API quota available for Stage 2 corrections
|
||||
- [ ] Network connectivity for API calls
|
||||
|
||||
**Quick validation**:
|
||||
|
||||
```bash
|
||||
uv run scripts/fix_transcription.py --validate && echo $GLM_API_KEY
|
||||
```
|
||||
|
||||
## Core Workflows
|
||||
|
||||
### 1. First-Time Correction
|
||||
|
||||
**Goal**: Correct a transcript for the first time.
|
||||
|
||||
**Steps**:
|
||||
|
||||
1. **Initialize** (if not done):
|
||||
```bash
|
||||
uv run scripts/fix_transcription.py --init
|
||||
export GLM_API_KEY="your-key"
|
||||
```
|
||||
|
||||
2. **Add initial corrections** (5-10 common errors):
|
||||
```bash
|
||||
uv run scripts/fix_transcription.py --add "常见错误1" "正确词1" --domain general
|
||||
uv run scripts/fix_transcription.py --add "常见错误2" "正确词2" --domain general
|
||||
```
|
||||
|
||||
3. **Test on small sample** (Stage 1 only):
|
||||
```bash
|
||||
uv run scripts/fix_transcription.py --input sample.md --stage 1
|
||||
less sample_stage1.md # Review output
|
||||
```
|
||||
|
||||
4. **Run full pipeline**:
|
||||
```bash
|
||||
uv run scripts/fix_transcription.py --input transcript.md --stage 3 --domain general
|
||||
```
|
||||
|
||||
5. **Review outputs**:
|
||||
```bash
|
||||
# Stage 1: Dictionary corrections
|
||||
less transcript_stage1.md
|
||||
|
||||
# Stage 2: Final corrected version
|
||||
less transcript_stage2.md
|
||||
|
||||
# Generate diff report
|
||||
uv run scripts/diff_generator.py transcript.md transcript_stage1.md transcript_stage2.md
|
||||
```
|
||||
|
||||
**Expected duration**:
|
||||
- Stage 1: Instant (dictionary lookup)
|
||||
- Stage 2: ~1-2 minutes per 1000 lines (API calls)
|
||||
|
||||
### 2. Iterative Improvement
|
||||
|
||||
**Goal**: Improve correction quality over time through learning.
|
||||
|
||||
**Steps**:
|
||||
|
||||
1. **Run corrections** on 3-5 similar transcripts:
|
||||
```bash
|
||||
uv run scripts/fix_transcription.py --input day1.md --stage 3 --domain embodied_ai
|
||||
uv run scripts/fix_transcription.py --input day2.md --stage 3 --domain embodied_ai
|
||||
uv run scripts/fix_transcription.py --input day3.md --stage 3 --domain embodied_ai
|
||||
```
|
||||
|
||||
2. **Review learned suggestions**:
|
||||
```bash
|
||||
uv run scripts/fix_transcription.py --review-learned
|
||||
```
|
||||
|
||||
**Output example**:
|
||||
```
|
||||
📚 Learned Suggestions (Pending Review)
|
||||
========================================
|
||||
|
||||
1. "巨升方向" → "具身方向"
|
||||
Frequency: 5 Confidence: 0.95
|
||||
Examples: day1.md (line 45), day2.md (line 23), ...
|
||||
|
||||
2. "奇迹创坛" → "奇绩创坛"
|
||||
Frequency: 3 Confidence: 0.87
|
||||
Examples: day1.md (line 102), day3.md (line 67)
|
||||
```
|
||||
|
||||
3. **Approve high-quality suggestions**:
|
||||
```bash
|
||||
uv run scripts/fix_transcription.py --approve "巨升方向" "具身方向"
|
||||
uv run scripts/fix_transcription.py --approve "奇迹创坛" "奇绩创坛"
|
||||
```
|
||||
|
||||
4. **Verify approved corrections**:
|
||||
```bash
|
||||
uv run scripts/fix_transcription.py --list --domain embodied_ai | grep "learned"
|
||||
```
|
||||
|
||||
5. **Run next batch** (benefits from approved corrections):
|
||||
```bash
|
||||
uv run scripts/fix_transcription.py --input day4.md --stage 3 --domain embodied_ai
|
||||
```
|
||||
|
||||
**Impact**: Approved corrections move to Stage 1 (instant, free).
|
||||
|
||||
**Cycle**: Repeat every 3-5 transcripts for continuous improvement.
|
||||
|
||||
### 3. Domain-Specific Corrections
|
||||
|
||||
**Goal**: Build specialized dictionaries for different fields.
|
||||
|
||||
**Steps**:
|
||||
|
||||
1. **Identify domain**:
|
||||
- `embodied_ai` - Robotics, AI terminology
|
||||
- `finance` - Financial terminology
|
||||
- `medical` - Medical terminology
|
||||
- `general` - General-purpose
|
||||
|
||||
2. **Add domain-specific terms**:
|
||||
```bash
|
||||
# Embodied AI domain
|
||||
uv run scripts/fix_transcription.py --add "巨升智能" "具身智能" --domain embodied_ai
|
||||
uv run scripts/fix_transcription.py --add "机器学习" "机器学习" --domain embodied_ai
|
||||
|
||||
# Finance domain
|
||||
uv run scripts/fix_transcription.py --add "股价" "股价" --domain finance # Keep as-is
|
||||
uv run scripts/fix_transcription.py --add "PE比率" "市盈率" --domain finance
|
||||
```
|
||||
|
||||
3. **Use appropriate domain** when correcting:
|
||||
```bash
|
||||
# AI meeting transcript
|
||||
uv run scripts/fix_transcription.py --input ai_meeting.md --stage 3 --domain embodied_ai
|
||||
|
||||
# Financial report transcript
|
||||
uv run scripts/fix_transcription.py --input earnings_call.md --stage 3 --domain finance
|
||||
```
|
||||
|
||||
4. **Review domain statistics**:
|
||||
```bash
|
||||
sqlite3 ~/.transcript-fixer/corrections.db "SELECT * FROM correction_statistics;"
|
||||
```
|
||||
|
||||
**Benefits**:
|
||||
- Prevents cross-domain conflicts
|
||||
- Higher accuracy per domain
|
||||
- Targeted vocabulary building
|
||||
|
||||
### 4. Team Collaboration
|
||||
|
||||
**Goal**: Share corrections across team members.
|
||||
|
||||
**Steps**:
|
||||
|
||||
#### Setup (One-time per team)
|
||||
|
||||
1. **Create shared repository**:
|
||||
```bash
|
||||
mkdir transcript-corrections
|
||||
cd transcript-corrections
|
||||
git init
|
||||
|
||||
# .gitignore
|
||||
echo "*.db\n*.db-journal\n*.bak" > .gitignore
|
||||
```
|
||||
|
||||
2. **Export initial corrections**:
|
||||
```bash
|
||||
uv run scripts/fix_transcription.py --export general.json --domain general
|
||||
uv run scripts/fix_transcription.py --export embodied_ai.json --domain embodied_ai
|
||||
|
||||
git add *.json
|
||||
git commit -m "Initial correction dictionaries"
|
||||
git push origin main
|
||||
```
|
||||
|
||||
#### Daily Workflow
|
||||
|
||||
**Team Member A** (adds new corrections):
|
||||
|
||||
```bash
|
||||
# 1. Run corrections
|
||||
uv run scripts/fix_transcription.py --input transcript.md --stage 3 --domain embodied_ai
|
||||
|
||||
# 2. Review and approve learned suggestions
|
||||
uv run scripts/fix_transcription.py --review-learned
|
||||
uv run scripts/fix_transcription.py --approve "新错误" "正确词"
|
||||
|
||||
# 3. Export updated corrections
|
||||
uv run scripts/fix_transcription.py --export embodied_ai_$(date +%Y%m%d).json --domain embodied_ai
|
||||
|
||||
# 4. Commit and push
|
||||
git add embodied_ai_*.json
|
||||
git commit -m "Add embodied AI corrections from today's transcripts"
|
||||
git push origin main
|
||||
```
|
||||
|
||||
**Team Member B** (imports team corrections):
|
||||
|
||||
```bash
|
||||
# 1. Pull latest corrections
|
||||
git pull origin main
|
||||
|
||||
# 2. Import with merge
|
||||
uv run scripts/fix_transcription.py --import embodied_ai_20250128.json --merge
|
||||
|
||||
# 3. Verify
|
||||
uv run scripts/fix_transcription.py --list --domain embodied_ai | tail -10
|
||||
```
|
||||
|
||||
**Conflict resolution**: See `team_collaboration.md` for handling merge conflicts.
|
||||
|
||||
### 5. Stage-by-Stage Execution
|
||||
|
||||
**Goal**: Test dictionary changes without wasting API quota.
|
||||
|
||||
#### Stage 1 Only (Dictionary)
|
||||
|
||||
**Use when**: Testing new corrections, verifying domain setup.
|
||||
|
||||
```bash
|
||||
uv run scripts/fix_transcription.py --input file.md --stage 1 --domain general
|
||||
```
|
||||
|
||||
**Output**: `file_stage1.md` with dictionary corrections only.
|
||||
|
||||
**Review**: Check if dictionary corrections are sufficient.
|
||||
|
||||
#### Stage 2 Only (AI)
|
||||
|
||||
**Use when**: Running AI corrections on pre-processed file.
|
||||
|
||||
**Prerequisites**: Stage 1 output exists.
|
||||
|
||||
```bash
|
||||
# Stage 1 first
|
||||
uv run scripts/fix_transcription.py --input file.md --stage 1
|
||||
|
||||
# Then Stage 2
|
||||
uv run scripts/fix_transcription.py --input file_stage1.md --stage 2
|
||||
```
|
||||
|
||||
**Output**: `file_stage1_stage2.md` (confusing naming - use Stage 3 instead).
|
||||
|
||||
#### Stage 3 (Full Pipeline)
|
||||
|
||||
**Use when**: Production runs, full correction workflow.
|
||||
|
||||
```bash
|
||||
uv run scripts/fix_transcription.py --input file.md --stage 3 --domain general
|
||||
```
|
||||
|
||||
**Output**: Both `file_stage1.md` and `file_stage2.md`.
|
||||
|
||||
**Recommended**: Use Stage 3 for most workflows.
|
||||
|
||||
### 6. Context-Aware Rules
|
||||
|
||||
**Goal**: Handle edge cases with regex patterns.
|
||||
|
||||
**Use cases**:
|
||||
- Positional corrections (e.g., "的" vs "地")
|
||||
- Multi-word patterns
|
||||
- Conditional corrections
|
||||
|
||||
**Steps**:
|
||||
|
||||
1. **Identify pattern** that simple dictionary can't handle:
|
||||
```
|
||||
Problem: "近距离的去看" (wrong - should be "地")
|
||||
Problem: "近距离搏杀" (correct - should keep "的")
|
||||
```
|
||||
|
||||
2. **Add context rules**:
|
||||
```bash
|
||||
sqlite3 ~/.transcript-fixer/corrections.db
|
||||
|
||||
-- Higher priority for specific context
|
||||
INSERT INTO context_rules (pattern, replacement, description, priority)
|
||||
VALUES ('近距离的去看', '近距离地去看', '的→地 before verb', 10);
|
||||
|
||||
-- Lower priority for general pattern
|
||||
INSERT INTO context_rules (pattern, replacement, description, priority)
|
||||
VALUES ('近距离搏杀', '近距离搏杀', 'Keep 的 for noun modifier', 5);
|
||||
|
||||
.quit
|
||||
```
|
||||
|
||||
3. **Test context rules**:
|
||||
```bash
|
||||
uv run scripts/fix_transcription.py --input test.md --stage 1
|
||||
```
|
||||
|
||||
4. **Validate**:
|
||||
```bash
|
||||
uv run scripts/fix_transcription.py --validate
|
||||
```
|
||||
|
||||
**Priority**: Higher numbers run first (use for exceptions/edge cases).
|
||||
|
||||
See `file_formats.md` for context_rules schema.
|
||||
|
||||
### 7. Diff Report Generation
|
||||
|
||||
**Goal**: Visualize all changes for review.
|
||||
|
||||
**Use when**:
|
||||
- Reviewing corrections before publishing
|
||||
- Training new team members
|
||||
- Documenting ASR error patterns
|
||||
|
||||
**Steps**:
|
||||
|
||||
1. **Run corrections**:
|
||||
```bash
|
||||
uv run scripts/fix_transcription.py --input transcript.md --stage 3
|
||||
```
|
||||
|
||||
2. **Generate diff reports**:
|
||||
```bash
|
||||
uv run scripts/diff_generator.py \
|
||||
transcript.md \
|
||||
transcript_stage1.md \
|
||||
transcript_stage2.md
|
||||
```
|
||||
|
||||
3. **Review outputs**:
|
||||
```bash
|
||||
# Markdown report (statistics + summary)
|
||||
less diff_report.md
|
||||
|
||||
# Unified diff (git-style)
|
||||
less transcript_unified.diff
|
||||
|
||||
# HTML side-by-side (visual review)
|
||||
open transcript_sidebyside.html
|
||||
|
||||
# Inline markers (for editing)
|
||||
less transcript_inline.md
|
||||
```
|
||||
|
||||
**Report contents**:
|
||||
- Total changes count
|
||||
- Stage 1 vs Stage 2 breakdown
|
||||
- Character/word count changes
|
||||
- Side-by-side comparison
|
||||
|
||||
See `script_parameters.md` for advanced diff options.
|
||||
|
||||
## Batch Processing
|
||||
|
||||
### Process Multiple Files
|
||||
|
||||
```bash
|
||||
# Simple loop
|
||||
for file in meeting_*.md; do
|
||||
uv run scripts/fix_transcription.py --input "$file" --stage 3 --domain embodied_ai
|
||||
done
|
||||
|
||||
# With error handling
|
||||
for file in meeting_*.md; do
|
||||
echo "Processing $file..."
|
||||
if uv run scripts/fix_transcription.py --input "$file" --stage 3 --domain embodied_ai; then
|
||||
echo "✅ $file completed"
|
||||
else
|
||||
echo "❌ $file failed"
|
||||
fi
|
||||
done
|
||||
```
|
||||
|
||||
### Parallel Processing
|
||||
|
||||
```bash
|
||||
# GNU parallel (install: brew install parallel)
|
||||
ls meeting_*.md | parallel -j 4 \
|
||||
"uv run scripts/fix_transcription.py --input {} --stage 3 --domain embodied_ai"
|
||||
```
|
||||
|
||||
**Caution**: Monitor API rate limits when processing in parallel.
|
||||
|
||||
## Maintenance Workflows
|
||||
|
||||
### Weekly: Review Learning
|
||||
|
||||
```bash
|
||||
# Review suggestions
|
||||
uv run scripts/fix_transcription.py --review-learned
|
||||
|
||||
# Approve high-confidence patterns
|
||||
uv run scripts/fix_transcription.py --approve "错误1" "正确1"
|
||||
uv run scripts/fix_transcription.py --approve "错误2" "正确2"
|
||||
```
|
||||
|
||||
### Monthly: Export and Backup
|
||||
|
||||
```bash
|
||||
# Export all domains
|
||||
uv run scripts/fix_transcription.py --export general_$(date +%Y%m%d).json --domain general
|
||||
uv run scripts/fix_transcription.py --export embodied_ai_$(date +%Y%m%d).json --domain embodied_ai
|
||||
|
||||
# Backup database
|
||||
cp ~/.transcript-fixer/corrections.db ~/backups/corrections_$(date +%Y%m%d).db
|
||||
|
||||
# Database maintenance
|
||||
sqlite3 ~/.transcript-fixer/corrections.db "VACUUM; REINDEX; ANALYZE;"
|
||||
```
|
||||
|
||||
### Quarterly: Clean Up
|
||||
|
||||
```bash
|
||||
# Archive old history (> 90 days)
|
||||
sqlite3 ~/.transcript-fixer/corrections.db "
|
||||
DELETE FROM correction_history
|
||||
WHERE run_timestamp < datetime('now', '-90 days');
|
||||
"
|
||||
|
||||
# Reject low-confidence suggestions
|
||||
sqlite3 ~/.transcript-fixer/corrections.db "
|
||||
UPDATE learned_suggestions
|
||||
SET status = 'rejected'
|
||||
WHERE confidence < 0.6 AND frequency < 3;
|
||||
"
|
||||
```
|
||||
|
||||
## Next Steps
|
||||
|
||||
- See `best_practices.md` for optimization tips
|
||||
- See `troubleshooting.md` for error resolution
|
||||
- See `file_formats.md` for database schema
|
||||
- See `script_parameters.md` for advanced CLI options
|
||||
Reference in New Issue
Block a user