feat(transcript-fixer): add timestamp repair and section splitting scripts

New scripts:
- fix_transcript_timestamps.py: Repair malformed timestamps (HH:MM:SS format)
- split_transcript_sections.py: Split transcript by keywords and rebase timestamps
- Automated tests for both scripts

Features:
- Timestamp validation and repair (handle missing colons, invalid ranges)
- Section splitting with custom names
- Rebase timestamps to 00:00:00 for each section
- Preserve speaker format and content integrity
- In-place editing with backup

Documentation updates:
- Add usage examples to SKILL.md
- Clarify dictionary iteration workflow (save stable patterns only)
- Update workflow guides with new script references
- Add script parameter documentation

Use cases:
- Fix ASR output with broken timestamps
- Split long meetings into focused sections
- Prepare sections for independent processing

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
daymade
2026-03-11 13:59:36 +08:00
parent 29f85d27c3
commit 135a1873af
8 changed files with 688 additions and 4 deletions

View File

@@ -44,6 +44,20 @@ The enhanced wrapper automatically:
- Moves output files to specified directory
- Opens HTML visual diff in browser for immediate feedback
**Timestamp repair**:
```bash
uv run scripts/fix_transcript_timestamps.py meeting.txt --in-place
```
**Split transcript into sections and rebase each section to `00:00:00`**:
```bash
uv run scripts/split_transcript_sections.py meeting.txt \
--first-section-name "课前聊天" \
--section "正式上课::好,无缝切换嘛。对。那个曹总连上了吗?那个网页。" \
--section "课后复盘::我们复盘一下。" \
--rebase-to-zero
```
**Alternative: Use Core Script Directly**:
```bash
@@ -117,13 +131,15 @@ See `references/workflow_guide.md` for detailed workflows, `references/script_pa
## Critical Workflow: Dictionary Iteration
**MUST save corrections after each fix.** This is the skill's core value.
**Save stable, reusable ASR patterns after each fix.** This is the skill's core value.
After fixing errors manually, immediately save to dictionary:
After fixing errors manually, immediately save stable corrections to dictionary:
```bash
uv run scripts/fix_transcription.py --add "错误词" "正确词" --domain general
```
Do **not** save one-off deletions, ambiguous context-only rewrites, or section-specific cleanup to the dictionary.
See `references/iteration_workflow.md` for complete iteration guide with checklist.
## AI Fallback Strategy
@@ -162,7 +178,9 @@ sqlite3 ~/.transcript-fixer/corrections.db "SELECT value FROM system_config WHER
- `ensure_deps.py` - Initialize shared virtual environment (run once, optional)
- `fix_transcript_enhanced.py` - Enhanced wrapper (recommended for interactive use)
- `fix_transcription.py` - Core CLI (for automation)
- `fix_transcript_timestamps.py` - Normalize/repair speaker timestamps and optionally rebase to zero
- `generate_word_diff.py` - Generate word-level diff HTML for reviewing corrections
- `split_transcript_sections.py` - Split a transcript by marker phrases and optionally rebase each section
- `examples/bulk_import.py` - Bulk import example
**References** (load as needed):