New scripts: - fix_transcript_timestamps.py: Repair malformed timestamps (HH:MM:SS format) - split_transcript_sections.py: Split transcript by keywords and rebase timestamps - Automated tests for both scripts Features: - Timestamp validation and repair (handle missing colons, invalid ranges) - Section splitting with custom names - Rebase timestamps to 00:00:00 for each section - Preserve speaker format and content integrity - In-place editing with backup Documentation updates: - Add usage examples to SKILL.md - Clarify dictionary iteration workflow (save stable patterns only) - Update workflow guides with new script references - Add script parameter documentation Use cases: - Fix ASR output with broken timestamps - Split long meetings into focused sections - Prepare sections for independent processing Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
242 lines
6.6 KiB
Markdown
242 lines
6.6 KiB
Markdown
# Script Parameters Reference
|
|
|
|
Detailed command-line parameters and usage examples for transcript-fixer Python scripts.
|
|
|
|
## Table of Contents
|
|
|
|
- [fix_transcription.py](#fixtranscriptionpy) - Main correction pipeline
|
|
- [Setup Commands](#setup-commands)
|
|
- [Correction Management](#correction-management)
|
|
- [Correction Workflow](#correction-workflow)
|
|
- [Learning Commands](#learning-commands)
|
|
- [fix_transcript_timestamps.py](#fix_transcript_timestampspy) - Normalize/repair speaker timestamps
|
|
- [split_transcript_sections.py](#split_transcript_sectionspy) - Split transcript into named sections
|
|
- [diff_generator.py](#diffgeneratorpy) - Generate comparison reports
|
|
- [Common Workflows](#common-workflows)
|
|
- [Exit Codes](#exit-codes)
|
|
- [Environment Variables](#environment-variables)
|
|
|
|
---
|
|
|
|
## fix_transcription.py
|
|
|
|
Main correction pipeline script supporting three processing stages.
|
|
|
|
### Syntax
|
|
|
|
```bash
|
|
python scripts/fix_transcription.py --input <file> --stage <1|2|3> [--output <dir>]
|
|
```
|
|
|
|
### Parameters
|
|
|
|
- `--input, -i` (required): Input Markdown file path
|
|
- `--stage, -s` (optional): Stage to execute (default: 3)
|
|
- `1` = Dictionary corrections only
|
|
- `2` = AI corrections only (requires Stage 1 output file)
|
|
- `3` = Both stages sequentially
|
|
- `--output, -o` (optional): Output directory (defaults to input file directory)
|
|
|
|
### Usage Examples
|
|
|
|
**Run dictionary corrections only:**
|
|
```bash
|
|
python scripts/fix_transcription.py --input meeting.md --stage 1
|
|
```
|
|
|
|
Output: `meeting_阶段1_词典修复.md`
|
|
|
|
**Run AI corrections only:**
|
|
```bash
|
|
python scripts/fix_transcription.py --input meeting_阶段1_词典修复.md --stage 2
|
|
```
|
|
|
|
Output: `meeting_阶段2_AI修复.md`
|
|
|
|
Note: Requires Stage 1 output file as input.
|
|
|
|
**Run complete pipeline:**
|
|
```bash
|
|
python scripts/fix_transcription.py --input meeting.md --stage 3
|
|
```
|
|
|
|
Outputs:
|
|
- `meeting_阶段1_词典修复.md`
|
|
- `meeting_阶段2_AI修复.md`
|
|
|
|
**Custom output directory:**
|
|
```bash
|
|
python scripts/fix_transcription.py --input meeting.md --stage 3 --output ./corrections
|
|
```
|
|
|
|
### Exit Codes
|
|
|
|
- `0` - Success
|
|
- `1` - Missing required parameters or file not found
|
|
- `2` - GLM_API_KEY environment variable not set (Stage 2 or 3 only)
|
|
- `3` - API request failed
|
|
|
|
## fix_transcript_timestamps.py
|
|
|
|
Normalize speaker timestamp lines such as `天生 00:21` or `Speaker 7 01:31:10`.
|
|
|
|
### Syntax
|
|
|
|
```bash
|
|
python scripts/fix_transcript_timestamps.py <file> [--output FILE | --in-place | --check]
|
|
```
|
|
|
|
### Key Parameters
|
|
|
|
- `--format {hhmmss,preserve}`: output timestamp style
|
|
- `--rebase-to-zero`: reset the first detected speaker timestamp to `00:00:00`
|
|
- `--rollover-backjump-seconds`: threshold for treating `59:58 -> 00:05` as a new hour
|
|
- `--jitter-seconds`: tolerated small backward jitter before flagging anomaly
|
|
|
|
### Usage Examples
|
|
|
|
```bash
|
|
# Normalize mixed MM:SS / HH:MM:SS
|
|
python scripts/fix_transcript_timestamps.py meeting.txt --in-place
|
|
|
|
# Rebase a split transcript so it starts at 00:00:00
|
|
python scripts/fix_transcript_timestamps.py workshop-class.txt --in-place --rebase-to-zero
|
|
|
|
# Only inspect anomalies, do not write
|
|
python scripts/fix_transcript_timestamps.py meeting.txt --check
|
|
```
|
|
|
|
## split_transcript_sections.py
|
|
|
|
Split a transcript into named sections using marker phrases. Useful for workshop transcripts that include setup chat, class, and debrief in one file.
|
|
|
|
### Syntax
|
|
|
|
```bash
|
|
python scripts/split_transcript_sections.py <file> \
|
|
--first-section-name <name> \
|
|
--section "Name::Marker" \
|
|
--section "Name::Marker"
|
|
```
|
|
|
|
### Usage Example
|
|
|
|
```bash
|
|
python scripts/split_transcript_sections.py workshop.txt \
|
|
--first-section-name "课前聊天" \
|
|
--section "正式上课::好,无缝切换嘛。对。那个曹总连上了吗?那个网页。" \
|
|
--section "课后复盘::我们复盘一下。" \
|
|
--rebase-to-zero
|
|
```
|
|
|
|
## generate_diff_report.py
|
|
|
|
Multi-format diff report generator for comparing correction stages.
|
|
|
|
### Syntax
|
|
|
|
```bash
|
|
python scripts/generate_diff_report.py --original <file> --stage1 <file> --stage2 <file> [--output-dir <dir>]
|
|
```
|
|
|
|
### Parameters
|
|
|
|
- `--original` (required): Original transcript file path
|
|
- `--stage1` (required): Stage 1 correction output file path
|
|
- `--stage2` (required): Stage 2 correction output file path
|
|
- `--output-dir` (optional): Output directory for diff reports (defaults to original file directory)
|
|
|
|
### Usage Examples
|
|
|
|
**Basic usage:**
|
|
```bash
|
|
python scripts/generate_diff_report.py \
|
|
--original "meeting.md" \
|
|
--stage1 "meeting_阶段1_词典修复.md" \
|
|
--stage2 "meeting_阶段2_AI修复.md"
|
|
```
|
|
|
|
**Custom output directory:**
|
|
```bash
|
|
python scripts/generate_diff_report.py \
|
|
--original "meeting.md" \
|
|
--stage1 "meeting_阶段1_词典修复.md" \
|
|
--stage2 "meeting_阶段2_AI修复.md" \
|
|
--output-dir "./reports"
|
|
```
|
|
|
|
### Output Files
|
|
|
|
The script generates four comparison formats:
|
|
|
|
1. **Markdown summary** (`*_对比报告.md`)
|
|
- High-level statistics and change summary
|
|
- Word count changes per stage
|
|
- Common error patterns identified
|
|
|
|
2. **Unified diff** (`*_unified.diff`)
|
|
- Traditional Unix diff format
|
|
- Suitable for command-line review or version control
|
|
|
|
3. **HTML side-by-side** (`*_对比.html`)
|
|
- Visual side-by-side comparison
|
|
- Color-coded additions/deletions
|
|
- **Recommended for human review**
|
|
|
|
4. **Inline marked** (`*_行内对比.txt`)
|
|
- Single-column format with inline change markers
|
|
- Useful for quick text editor review
|
|
|
|
### Exit Codes
|
|
|
|
- `0` - Success
|
|
- `1` - Missing required parameters or file not found
|
|
- `2` - File format error (non-Markdown input)
|
|
|
|
## Common Workflows
|
|
|
|
### Testing Dictionary Changes
|
|
|
|
Test dictionary updates before running expensive AI corrections:
|
|
|
|
```bash
|
|
# 1. Update CORRECTIONS_DICT in scripts/fix_transcription.py
|
|
# 2. Run Stage 1 only
|
|
python scripts/fix_transcription.py --input meeting.md --stage 1
|
|
|
|
# 3. Review output
|
|
cat meeting_阶段1_词典修复.md
|
|
|
|
# 4. If satisfied, run Stage 2
|
|
python scripts/fix_transcription.py --input meeting_阶段1_词典修复.md --stage 2
|
|
```
|
|
|
|
### Batch Processing
|
|
|
|
Process multiple transcripts in sequence:
|
|
|
|
```bash
|
|
for file in transcripts/*.md; do
|
|
python scripts/fix_transcription.py --input "$file" --stage 3
|
|
done
|
|
```
|
|
|
|
### Quick Review Cycle
|
|
|
|
Generate and open comparison report immediately after correction:
|
|
|
|
```bash
|
|
# Run corrections
|
|
python scripts/fix_transcription.py --input meeting.md --stage 3
|
|
|
|
# Generate and open diff report
|
|
python scripts/generate_diff_report.py \
|
|
--original "meeting.md" \
|
|
--stage1 "meeting_阶段1_词典修复.md" \
|
|
--stage2 "meeting_阶段2_AI修复.md"
|
|
|
|
open meeting_对比.html # macOS
|
|
# xdg-open meeting_对比.html # Linux
|
|
# start meeting_对比.html # Windows
|
|
```
|