feat(transcript-fixer): add timestamp repair and section splitting scripts
New scripts: - fix_transcript_timestamps.py: Repair malformed timestamps (HH:MM:SS format) - split_transcript_sections.py: Split transcript by keywords and rebase timestamps - Automated tests for both scripts Features: - Timestamp validation and repair (handle missing colons, invalid ranges) - Section splitting with custom names - Rebase timestamps to 00:00:00 for each section - Preserve speaker format and content integrity - In-place editing with backup Documentation updates: - Add usage examples to SKILL.md - Clarify dictionary iteration workflow (save stable patterns only) - Update workflow guides with new script references - Add script parameter documentation Use cases: - Fix ASR output with broken timestamps - Split long meetings into focused sections - Prepare sections for independent processing Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -9,6 +9,8 @@ Detailed command-line parameters and usage examples for transcript-fixer Python
|
||||
- [Correction Management](#correction-management)
|
||||
- [Correction Workflow](#correction-workflow)
|
||||
- [Learning Commands](#learning-commands)
|
||||
- [fix_transcript_timestamps.py](#fix_transcript_timestampspy) - Normalize/repair speaker timestamps
|
||||
- [split_transcript_sections.py](#split_transcript_sectionspy) - Split transcript into named sections
|
||||
- [diff_generator.py](#diffgeneratorpy) - Generate comparison reports
|
||||
- [Common Workflows](#common-workflows)
|
||||
- [Exit Codes](#exit-codes)
|
||||
@@ -74,6 +76,59 @@ python scripts/fix_transcription.py --input meeting.md --stage 3 --output ./corr
|
||||
- `2` - GLM_API_KEY environment variable not set (Stage 2 or 3 only)
|
||||
- `3` - API request failed
|
||||
|
||||
## fix_transcript_timestamps.py
|
||||
|
||||
Normalize speaker timestamp lines such as `天生 00:21` or `Speaker 7 01:31:10`.
|
||||
|
||||
### Syntax
|
||||
|
||||
```bash
|
||||
python scripts/fix_transcript_timestamps.py <file> [--output FILE | --in-place | --check]
|
||||
```
|
||||
|
||||
### Key Parameters
|
||||
|
||||
- `--format {hhmmss,preserve}`: output timestamp style
|
||||
- `--rebase-to-zero`: reset the first detected speaker timestamp to `00:00:00`
|
||||
- `--rollover-backjump-seconds`: threshold for treating `59:58 -> 00:05` as a new hour
|
||||
- `--jitter-seconds`: tolerated small backward jitter before flagging anomaly
|
||||
|
||||
### Usage Examples
|
||||
|
||||
```bash
|
||||
# Normalize mixed MM:SS / HH:MM:SS
|
||||
python scripts/fix_transcript_timestamps.py meeting.txt --in-place
|
||||
|
||||
# Rebase a split transcript so it starts at 00:00:00
|
||||
python scripts/fix_transcript_timestamps.py workshop-class.txt --in-place --rebase-to-zero
|
||||
|
||||
# Only inspect anomalies, do not write
|
||||
python scripts/fix_transcript_timestamps.py meeting.txt --check
|
||||
```
|
||||
|
||||
## split_transcript_sections.py
|
||||
|
||||
Split a transcript into named sections using marker phrases. Useful for workshop transcripts that include setup chat, class, and debrief in one file.
|
||||
|
||||
### Syntax
|
||||
|
||||
```bash
|
||||
python scripts/split_transcript_sections.py <file> \
|
||||
--first-section-name <name> \
|
||||
--section "Name::Marker" \
|
||||
--section "Name::Marker"
|
||||
```
|
||||
|
||||
### Usage Example
|
||||
|
||||
```bash
|
||||
python scripts/split_transcript_sections.py workshop.txt \
|
||||
--first-section-name "课前聊天" \
|
||||
--section "正式上课::好,无缝切换嘛。对。那个曹总连上了吗?那个网页。" \
|
||||
--section "课后复盘::我们复盘一下。" \
|
||||
--rebase-to-zero
|
||||
```
|
||||
|
||||
## generate_diff_report.py
|
||||
|
||||
Multi-format diff report generator for comparing correction stages.
|
||||
|
||||
Reference in New Issue
Block a user