feat(transcript-fixer): add timestamp repair and section splitting scripts

New scripts: - fix_transcript_timestamps.py: Repair malformed timestamps (HH:MM:SS format) - split_transcript_sections.py: Split transcript by keywords and rebase timestamps - Automated tests for both scripts Features: - Timestamp validation and repair (handle missing colons, invalid ranges) - Section splitting with custom names - Rebase timestamps to 00:00:00 for each section - Preserve speaker format and content integrity - In-place editing with backup Documentation updates: - Add usage examples to SKILL.md - Clarify dictionary iteration workflow (save stable patterns only) - Update workflow guides with new script references - Add script parameter documentation Use cases: - Fix ASR output with broken timestamps - Split long meetings into focused sections - Prepare sections for independent processing Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-11 13:59:36 +08:00
parent 29f85d27c3
commit 135a1873af
8 changed files with 688 additions and 4 deletions
--- a/transcript-fixer/references/iteration_workflow.md
+++ b/transcript-fixer/references/iteration_workflow.md
@@ -16,7 +16,7 @@ The core value of transcript-fixer is building a personalized correction diction
 └─────────────────────────────────────────────────┘
 ```

-**Key principle**: Every correction you make should be saved to the dictionary. This transforms one-time work into permanent value.
+**Key principle**: Every stable, reusable ASR correction you make should be saved to the dictionary. This transforms one-time work into permanent value without polluting the database.

 ## Workflow Checklist

@@ -34,7 +34,7 @@ Correction Progress:

 ## Save Corrections Immediately

-After fixing any transcript, save corrections:
+After fixing any transcript, save stable corrections:

 ```bash
 # Single correction
@@ -122,3 +122,12 @@ Patterns appearing ≥3 times at ≥80% confidence are suggested for review.
 3. **Use domains**: Organize corrections by topic for better precision
 4. **Verify**: Always run --list to confirm saves
 5. **Review suggestions**: Periodically check --review-learned for auto-detected patterns
+
+## What NOT to Save to Dictionary
+
+Do **not** save these as reusable dictionary entries:
+
+- Full-sentence deletions
+- One-off section headers or meeting-specific boilerplate
+- Context-only disambiguations such as `cloud -> Claude` when `cloud` can also be legitimate
+- File-local cleanup after section splitting or timestamp rebasing
--- a/transcript-fixer/references/script_parameters.md
+++ b/transcript-fixer/references/script_parameters.md
@@ -9,6 +9,8 @@ Detailed command-line parameters and usage examples for transcript-fixer Python
  - [Correction Management](#correction-management)
  - [Correction Workflow](#correction-workflow)
  - [Learning Commands](#learning-commands)
+- [fix_transcript_timestamps.py](#fix_transcript_timestampspy) - Normalize/repair speaker timestamps
+- [split_transcript_sections.py](#split_transcript_sectionspy) - Split transcript into named sections
 - [diff_generator.py](#diffgeneratorpy) - Generate comparison reports
 - [Common Workflows](#common-workflows)
 - [Exit Codes](#exit-codes)
@@ -74,6 +76,59 @@ python scripts/fix_transcription.py --input meeting.md --stage 3 --output ./corr
 - `2` - GLM_API_KEY environment variable not set (Stage 2 or 3 only)
 - `3` - API request failed

+## fix_transcript_timestamps.py
+
+Normalize speaker timestamp lines such as `天生 00:21` or `Speaker 7 01:31:10`.
+
+### Syntax
+
+```bash
+python scripts/fix_transcript_timestamps.py <file> [--output FILE | --in-place | --check]
+```
+
+### Key Parameters
+
+- `--format {hhmmss,preserve}`: output timestamp style
+- `--rebase-to-zero`: reset the first detected speaker timestamp to `00:00:00`
+- `--rollover-backjump-seconds`: threshold for treating `59:58 -> 00:05` as a new hour
+- `--jitter-seconds`: tolerated small backward jitter before flagging anomaly
+
+### Usage Examples
+
+```bash
+# Normalize mixed MM:SS / HH:MM:SS
+python scripts/fix_transcript_timestamps.py meeting.txt --in-place
+
+# Rebase a split transcript so it starts at 00:00:00
+python scripts/fix_transcript_timestamps.py workshop-class.txt --in-place --rebase-to-zero
+
+# Only inspect anomalies, do not write
+python scripts/fix_transcript_timestamps.py meeting.txt --check
+```
+
+## split_transcript_sections.py
+
+Split a transcript into named sections using marker phrases. Useful for workshop transcripts that include setup chat, class, and debrief in one file.
+
+### Syntax
+
+```bash
+python scripts/split_transcript_sections.py <file> \
+  --first-section-name <name> \
+  --section "Name::Marker" \
+  --section "Name::Marker"
+```
+
+### Usage Example
+
+```bash
+python scripts/split_transcript_sections.py workshop.txt \
+  --first-section-name "课前聊天" \
+  --section "正式上课::好，无缝切换嘛。对。那个曹总连上了吗？那个网页。" \
+  --section "课后复盘::我们复盘一下。" \
+  --rebase-to-zero
+```
+
 ## generate_diff_report.py

 Multi-format diff report generator for comparing correction stages.
--- a/transcript-fixer/references/workflow_guide.md
+++ b/transcript-fixer/references/workflow_guide.md
@@ -17,6 +17,7 @@ Detailed step-by-step workflows for transcript correction and management.
  - [5. Stage-by-Stage Execution](#5-stage-by-stage-execution)
  - [6. Context-Aware Rules](#6-context-aware-rules)
  - [7. Diff Report Generation](#7-diff-report-generation)
+  - [8. Workshop Transcript Split + Timestamp Rebase](#8-workshop-transcript-split--timestamp-rebase)
 - [Batch Processing](#batch-processing)
  - [Process Multiple Files](#process-multiple-files)
  - [Parallel Processing](#parallel-processing)
@@ -400,6 +401,30 @@ See `file_formats.md` for context_rules schema.

 See `script_parameters.md` for advanced diff options.

+### 8. Workshop Transcript Split + Timestamp Rebase
+
+**Goal**: Split a long workshop transcript into sections such as setup chat, class, and debrief, then make each section start from `00:00:00`.
+
+**Steps**:
+
+1. **Correct transcript text first** (dictionary + AI/manual review)
+2. **Pick marker phrases** for each section boundary
+3. **Split and rebase**:
+
+```bash
+uv run scripts/split_transcript_sections.py workshop.txt \
+  --first-section-name "课前聊天" \
+  --section "正式上课::好，无缝切换嘛。对。那个曹总连上了吗？那个网页。" \
+  --section "课后复盘::我们复盘一下。" \
+  --rebase-to-zero
+```
+
+4. **If you already split the files**, rebase a single file directly:
+
+```bash
+uv run scripts/fix_transcript_timestamps.py class.txt --in-place --rebase-to-zero
+```
+
 ## Batch Processing

 ### Process Multiple Files