feat: optimize skills + add pipeline handoff chaining across 9 skills
asr-transcribe-to-text: - Add local MLX transcription path (macOS Apple Silicon, 15-27x realtime) - Add bundled script transcribe_local_mlx.py with max_tokens=200000 - Add local_mlx_guide.md with benchmarks and truncation trap docs - Auto-detect platform and recommend local vs remote mode - Fix audio extraction format (MP3 → WAV 16kHz mono PCM) - Add Step 5: recommend transcript-fixer after transcription transcript-fixer: - Optimize SKILL.md from 289 → 153 lines (best practices compliance) - Move FALSE_POSITIVE_RISKS (40 lines) to references/false_positive_guide.md - Move Example Session to references/example_session.md - Improve description for better triggering (226 → 580 chars) - Add handoff to meeting-minutes-taker skill-creator: - Add "Pipeline Handoff" pattern to Skill Writing Guide - Add pipeline check reminder in Step 4 (Edit the Skill) Pipeline handoffs added to 8 skills forming 6 chains: - youtube-downloader → asr-transcribe-to-text → transcript-fixer → meeting-minutes-taker → pdf/ppt-creator - deep-research → fact-checker → pdf/ppt-creator - doc-to-markdown → docs-cleaner / fact-checker - claude-code-history-files-finder → continue-claude-work Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -1,49 +1,84 @@
|
||||
---
|
||||
name: asr-transcribe-to-text
|
||||
description: Transcribe audio and video files to text using a remote ASR service (Qwen3-ASR or OpenAI-compatible endpoint). Extracts audio from video, sends to configurable ASR endpoint, outputs clean text. Use when the user wants to transcribe recordings, convert audio/video to text, do speech-to-text, or mentions ASR, Qwen ASR, 转录, 语音转文字, 录音转文字, or has a meeting recording, lecture, interview, or screen recording to transcribe.
|
||||
argument-hint: [audio-or-video-file-path]
|
||||
description: Transcribes audio and video files to text using Qwen3-ASR. Supports two modes — local MLX inference on macOS Apple Silicon (no API key, 15-27x realtime) and remote API via vLLM/OpenAI-compatible endpoints. Auto-detects platform and recommends the best path. Triggers when the user wants to transcribe recordings, convert audio/video to text, do speech-to-text, or mentions ASR, Qwen ASR, 转录, 语音转文字, 录音转文字. Also triggers for meeting recordings, lectures, interviews, podcasts, screen recordings, or any audio/video file the user wants converted to text.
|
||||
argument-hint: [audio-or-video-file-path ...]
|
||||
---
|
||||
|
||||
# ASR Transcribe to Text
|
||||
|
||||
Transcribe audio/video files to text using a configurable ASR endpoint (default: Qwen3-ASR-1.7B via vLLM). Configuration persists across sessions in `${CLAUDE_PLUGIN_DATA}/config.json`.
|
||||
Transcribe audio/video files to text using Qwen3-ASR. Two inference paths:
|
||||
|
||||
## Step 0: Load or Initialize Configuration
|
||||
| Mode | When | Speed | Cost |
|
||||
|------|------|-------|------|
|
||||
| **Local MLX** | macOS Apple Silicon | 15-27x realtime | Free |
|
||||
| **Remote API** | Any platform, or when local unavailable | Depends on GPU | API/self-hosted |
|
||||
|
||||
Configuration persists in `${CLAUDE_PLUGIN_DATA}/config.json`.
|
||||
|
||||
## Step 0: Detect Platform and Load Config
|
||||
|
||||
```bash
|
||||
cat "${CLAUDE_PLUGIN_DATA}/config.json" 2>/dev/null
|
||||
```
|
||||
|
||||
**If config exists**, read the values and proceed to Step 1.
|
||||
**If config exists**, read values and proceed to Step 1.
|
||||
|
||||
**If config does not exist** (first run), use **AskUserQuestion**:
|
||||
**If config does not exist**, auto-detect platform first:
|
||||
|
||||
```bash
|
||||
python3 -c "
|
||||
import sys, platform
|
||||
is_mac_arm = sys.platform == 'darwin' and platform.machine() in ('arm64', 'aarch64')
|
||||
print(f'Platform: {sys.platform} {platform.machine()}')
|
||||
print(f'Apple Silicon: {is_mac_arm}')
|
||||
if is_mac_arm:
|
||||
print('RECOMMEND: local-mlx')
|
||||
else:
|
||||
print('RECOMMEND: remote-api')
|
||||
"
|
||||
```
|
||||
First-time setup for ASR transcription.
|
||||
I need to know where your ASR service is running so I can send audio to it.
|
||||
|
||||
RECOMMENDATION: Use the defaults below if you have Qwen3-ASR on a 4090 via Tailscale.
|
||||
Then use **AskUserQuestion** with platform-aware defaults:
|
||||
|
||||
For **macOS Apple Silicon** (recommended: local):
|
||||
```
|
||||
ASR setup — your Mac has Apple Silicon, so local transcription is recommended.
|
||||
|
||||
Q1: Transcription mode?
|
||||
A) Local MLX — runs on your Mac's GPU, no API key needed, 15-27x realtime (Recommended)
|
||||
B) Remote API — send audio to a server (vLLM, Tailscale workstation, etc.)
|
||||
|
||||
Q2: Does your network have an HTTP proxy that might intercept traffic?
|
||||
A) Yes — bypass proxy for ASR traffic (Recommended if using Shadowrocket/Clash)
|
||||
B) No — direct connection
|
||||
```
|
||||
|
||||
For **other platforms** (recommended: remote):
|
||||
```
|
||||
ASR setup — local MLX requires macOS Apple Silicon. Using remote API mode.
|
||||
|
||||
Q1: ASR Endpoint URL?
|
||||
A) http://workstation-4090-wsl:8002/v1/audio/transcriptions (Default — Qwen3-ASR vLLM via Tailscale)
|
||||
B) http://localhost:8002/v1/audio/transcriptions (Local machine)
|
||||
C) Let me enter a custom URL
|
||||
A) http://workstation-4090-wsl:8002/v1/audio/transcriptions (Qwen3-ASR vLLM via Tailscale)
|
||||
B) http://localhost:8002/v1/audio/transcriptions (Local server)
|
||||
C) Custom URL
|
||||
|
||||
Q2: Does your network have an HTTP proxy that might intercept LAN/Tailscale traffic?
|
||||
A) Yes — add --noproxy to bypass it (Recommended if you use Shadowrocket/Clash/corporate proxy)
|
||||
B) No — direct connection is fine
|
||||
Q2: Proxy bypass needed?
|
||||
A) Yes (Recommended for Shadowrocket/Clash/corporate proxy)
|
||||
B) No
|
||||
```
|
||||
|
||||
Save the config:
|
||||
Save config:
|
||||
```bash
|
||||
mkdir -p "${CLAUDE_PLUGIN_DATA}"
|
||||
python3 -c "
|
||||
import json
|
||||
config = {
|
||||
'endpoint': 'USER_PROVIDED_ENDPOINT',
|
||||
'model': 'USER_PROVIDED_MODEL_OR_DEFAULT',
|
||||
'noproxy': True, # or False based on user answer
|
||||
'max_timeout': 900
|
||||
'mode': 'MODE', # 'local-mlx' or 'remote-api'
|
||||
'model': 'MODEL_ID', # local: 'mlx-community/Qwen3-ASR-1.7B-8bit', remote: 'Qwen/Qwen3-ASR-1.7B'
|
||||
'max_tokens': 200000, # local only, critical for long audio
|
||||
'endpoint': 'URL', # remote only
|
||||
'noproxy': True,
|
||||
'max_timeout': 900 # remote only
|
||||
}
|
||||
with open('${CLAUDE_PLUGIN_DATA}/config.json', 'w') as f:
|
||||
json.dump(config, f, indent=2)
|
||||
@@ -51,10 +86,40 @@ print('Config saved.')
|
||||
"
|
||||
```
|
||||
|
||||
## Step 1: Validate Input and Check Service Health
|
||||
## Step 1: Extract Audio (if input is video)
|
||||
|
||||
Read config and health-check in a single command (shell variables don't persist across Bash calls):
|
||||
For video files (mp4, mov, mkv, avi, webm), extract as 16kHz mono WAV:
|
||||
|
||||
```bash
|
||||
ffmpeg -i INPUT_VIDEO -vn -acodec pcm_s16le -ar 16000 -ac 1 OUTPUT.wav -y
|
||||
```
|
||||
|
||||
Audio files (wav, mp3, m4a, flac, ogg) can be used directly. Get duration:
|
||||
```bash
|
||||
ffprobe -v error -show_entries format=duration -of default=noprint_wrappers=1:nokey=1 INPUT_FILE
|
||||
```
|
||||
|
||||
**Cleanup**: After transcription succeeds, delete extracted WAV files to save disk space.
|
||||
|
||||
## Step 2: Transcribe
|
||||
|
||||
### Path A: Local MLX (macOS Apple Silicon)
|
||||
|
||||
Use the bundled script — it handles model loading, chunking, and the critical `max_tokens` parameter:
|
||||
|
||||
```bash
|
||||
uv run ${CLAUDE_PLUGIN_ROOT}/scripts/transcribe_local_mlx.py \
|
||||
INPUT_AUDIO [INPUT_AUDIO2 ...] \
|
||||
--output-dir OUTPUT_DIR
|
||||
```
|
||||
|
||||
The script loads the model once and transcribes all files sequentially (no GPU contention). For details on performance, model compatibility, and the max_tokens truncation issue, see `references/local_mlx_guide.md`.
|
||||
|
||||
**Critical**: The upstream `mlx-audio` default `max_tokens=8192` silently truncates audio longer than ~40 minutes. The bundled script defaults to `200000`. If calling `model.generate()` directly, always pass `max_tokens=200000`.
|
||||
|
||||
### Path B: Remote API
|
||||
|
||||
**Health check first** (skip if already verified this session):
|
||||
```bash
|
||||
python3 -c "
|
||||
import json, subprocess, sys
|
||||
@@ -67,54 +132,13 @@ result = subprocess.run(
|
||||
capture_output=True, text=True
|
||||
)
|
||||
if result.returncode != 0 or not result.stdout.strip():
|
||||
print(f'HEALTH CHECK FAILED', file=sys.stderr)
|
||||
print(f'Endpoint: {base}/models', file=sys.stderr)
|
||||
print(f'stdout: {result.stdout[:200]}', file=sys.stderr)
|
||||
print(f'stderr: {result.stderr[:200]}', file=sys.stderr)
|
||||
print(f'HEALTH CHECK FAILED: {base}/models', file=sys.stderr)
|
||||
sys.exit(1)
|
||||
else:
|
||||
print(f'Service healthy: {base}')
|
||||
print(f'Model: {cfg[\"model\"]}')
|
||||
print(f'Service healthy: {base}')
|
||||
"
|
||||
```
|
||||
|
||||
**If health check fails**, use **AskUserQuestion**:
|
||||
|
||||
```
|
||||
ASR service at [endpoint] is not responding.
|
||||
|
||||
Options:
|
||||
A) Diagnose — check network, Tailscale, and service status step by step
|
||||
B) Reconfigure — the endpoint URL might be wrong, let me re-enter it
|
||||
C) Try anyway — send the transcription request and see what happens
|
||||
D) Abort — I'll fix the service manually and come back later
|
||||
```
|
||||
|
||||
For option A, diagnose in order:
|
||||
1. Network: `ping -c 1 HOST` or `tailscale status | grep HOST`
|
||||
2. Service: `tailscale ssh USER@HOST "curl -s localhost:PORT/v1/models"`
|
||||
3. Proxy: retry with `--noproxy '*'` toggled
|
||||
|
||||
## Step 2: Extract Audio (if input is video)
|
||||
|
||||
For video files (mp4, mov, mkv, avi, webm), extract audio as 16kHz mono MP3:
|
||||
|
||||
```bash
|
||||
ffmpeg -i INPUT_VIDEO -vn -acodec libmp3lame -q:a 4 -ar 16000 -ac 1 OUTPUT.mp3 -y
|
||||
```
|
||||
|
||||
For audio files (mp3, wav, m4a, flac, ogg), use directly — no conversion needed.
|
||||
|
||||
Get duration for progress estimation:
|
||||
```bash
|
||||
ffprobe -v error -show_entries format=duration -of default=noprint_wrappers=1:nokey=1 INPUT_FILE
|
||||
```
|
||||
|
||||
## Step 3: Transcribe — Single Request First
|
||||
|
||||
**Always try full-length single request first.** Chunking causes sentence truncation at every split boundary — the model forces the last sentence to close and loses words. Single request = zero truncation + fastest speed.
|
||||
|
||||
The Qwen3-ASR paper's "20-minute limit" is a training benchmark, not an inference hard limit. Empirically verified: 55 minutes transcribed in a single 76-second request on 4090 24GB.
|
||||
Read config and send via curl:
|
||||
|
||||
```bash
|
||||
python3 -c "
|
||||
@@ -123,7 +147,7 @@ with open('${CLAUDE_PLUGIN_DATA}/config.json') as f:
|
||||
cfg = json.load(f)
|
||||
noproxy = ['--noproxy', '*'] if cfg.get('noproxy', True) else []
|
||||
timeout = str(cfg.get('max_timeout', 900))
|
||||
audio_file = 'AUDIO_FILE_PATH' # replace with actual path
|
||||
audio_file = 'AUDIO_FILE_PATH'
|
||||
output_json = tempfile.mktemp(suffix='.json', prefix='asr_')
|
||||
|
||||
result = subprocess.run(
|
||||
@@ -141,42 +165,43 @@ if 'text' not in data:
|
||||
print(f'ERROR: {json.dumps(data)[:300]}', file=sys.stderr)
|
||||
sys.exit(1)
|
||||
text = data['text']
|
||||
duration = data.get('usage', {}).get('seconds', 0)
|
||||
print(f'Transcribed: {len(text)} chars, {duration}s audio', file=sys.stderr)
|
||||
print(f'Transcribed: {len(text)} chars', file=sys.stderr)
|
||||
print(text)
|
||||
os.unlink(output_json)
|
||||
" > OUTPUT.txt
|
||||
```
|
||||
|
||||
**Performance reference**: ~400 characters per minute for Chinese speech; rates vary by language. Qwen3-ASR supports 52 languages including Chinese dialects, English, Japanese, Korean, and more.
|
||||
**If remote health check fails**, diagnose in order:
|
||||
1. Network: `ping -c 1 HOST` or `tailscale status | grep HOST`
|
||||
2. Service: `tailscale ssh USER@HOST "curl -s localhost:PORT/v1/models"`
|
||||
3. Proxy: retry with `--noproxy '*'` toggled
|
||||
|
||||
## Step 4: Verify and Confirm Output
|
||||
## Step 3: Verify Output
|
||||
|
||||
After transcription, verify quality:
|
||||
1. Confirm the response contains a `text` field (not an error message)
|
||||
2. Check character count is plausible for the audio duration (~400 chars/min for Chinese)
|
||||
3. Show the user the first ~200 characters as a preview
|
||||
After transcription, check for truncation — the most common failure mode:
|
||||
|
||||
If the output looks wrong (empty, garbled, or error), use **AskUserQuestion**:
|
||||
1. Confirm output is not empty
|
||||
2. Check character count is plausible (~400 chars/min for Chinese, ~200 words/min for English)
|
||||
3. Check the **ending** — does it trail off mid-sentence? If so, `max_tokens` was exhausted
|
||||
4. Show user the first and last ~200 characters as preview
|
||||
|
||||
If truncated or wrong, use **AskUserQuestion**:
|
||||
```
|
||||
Transcription may have an issue:
|
||||
Transcription may be truncated:
|
||||
- Expected: ~[N] chars for [M] minutes of audio
|
||||
- Got: [actual chars] chars
|
||||
- Preview: "[first 100 chars...]"
|
||||
- Got: [actual] chars ([pct]% of expected)
|
||||
- Last line: "[last 100 chars...]"
|
||||
|
||||
Options:
|
||||
A) Save as-is — the output looks fine to me
|
||||
B) Retry with fallback — split into chunks and merge (handles long audio / OOM)
|
||||
C) Reconfigure — try a different model or endpoint
|
||||
D) Abort — something is wrong with the service
|
||||
A) Retry with higher max_tokens (current: [N], try: [N*2])
|
||||
B) Switch mode — try [local/remote] instead
|
||||
C) Save as-is — the output looks complete to me
|
||||
D) Abort
|
||||
```
|
||||
|
||||
If output is good, save as `.txt` alongside the original file or to user-specified location.
|
||||
## Step 4: Fallback — Overlap-Merge (Remote API Only)
|
||||
|
||||
## Step 5: Fallback — Overlap-Merge for Very Long Audio
|
||||
|
||||
If single request fails (timeout, OOM, HTTP error), fall back to chunked transcription with overlap merging:
|
||||
If single remote request fails (timeout, OOM), fall back to chunked transcription:
|
||||
|
||||
```bash
|
||||
python3 ${CLAUDE_PLUGIN_ROOT}/scripts/overlap_merge_transcribe.py \
|
||||
@@ -184,14 +209,42 @@ python3 ${CLAUDE_PLUGIN_ROOT}/scripts/overlap_merge_transcribe.py \
|
||||
INPUT_AUDIO OUTPUT.txt
|
||||
```
|
||||
|
||||
This splits into 18-minute chunks with 2-minute overlap, then merges using punctuation-stripped fuzzy matching. See [references/overlap_merge_strategy.md](references/overlap_merge_strategy.md) for the algorithm details.
|
||||
Splits into 18-minute chunks with 2-minute overlap, merges using punctuation-stripped fuzzy matching. See `references/overlap_merge_strategy.md` for algorithm details.
|
||||
|
||||
For local MLX mode, overlap-merge is unnecessary — the bundled script handles chunking internally with `max_tokens=200000`.
|
||||
|
||||
## Step 5: Recommend Transcript Correction
|
||||
|
||||
ASR output always contains recognition errors — homophones, garbled technical terms, broken sentences. After successful transcription, **proactively suggest** running the `transcript-fixer` skill on the output:
|
||||
|
||||
```
|
||||
Transcription complete: [N] chars saved to [output_path].
|
||||
|
||||
ASR output typically contains recognition errors (homophones, garbled terms, broken sentences).
|
||||
Would you like me to run /transcript-fixer to clean up the text?
|
||||
|
||||
Options:
|
||||
A) Yes — run transcript-fixer on the output now (Recommended)
|
||||
B) No — the raw transcription is good enough for my needs
|
||||
C) Later — I'll run it myself when ready
|
||||
```
|
||||
|
||||
If the user chooses A, invoke the `transcript-fixer` skill with the output file path. The two skills form a natural pipeline: **transcribe → correct → review**.
|
||||
|
||||
## Reconfigure
|
||||
|
||||
To change the ASR endpoint, model, or proxy settings:
|
||||
|
||||
```bash
|
||||
rm "${CLAUDE_PLUGIN_DATA}/config.json"
|
||||
```
|
||||
|
||||
Then re-run Step 0 to collect new values via AskUserQuestion.
|
||||
Then re-run Step 0.
|
||||
|
||||
## Bundled Resources
|
||||
|
||||
**Scripts:**
|
||||
- `transcribe_local_mlx.py` — Local MLX transcription (macOS ARM64, PEP 723 deps)
|
||||
- `overlap_merge_transcribe.py` — Chunked transcription with overlap merge (remote API fallback)
|
||||
|
||||
**References:**
|
||||
- `local_mlx_guide.md` — Performance benchmarks, max_tokens truncation, model compatibility
|
||||
- `overlap_merge_strategy.md` — Why naive chunking fails, fuzzy merge algorithm
|
||||
|
||||
58
asr-transcribe-to-text/references/local_mlx_guide.md
Normal file
58
asr-transcribe-to-text/references/local_mlx_guide.md
Normal file
@@ -0,0 +1,58 @@
|
||||
# Local MLX Transcription Guide
|
||||
|
||||
## Platform Requirements
|
||||
|
||||
- macOS on Apple Silicon (M1/M2/M3/M4/M5+)
|
||||
- Python 3.10+
|
||||
- `uv` package manager
|
||||
- ~3GB disk for model weights (first download)
|
||||
|
||||
## Recommended Configuration
|
||||
|
||||
| Setting | Value | Why |
|
||||
|---------|-------|-----|
|
||||
| Model | `mlx-community/Qwen3-ASR-1.7B-8bit` | 8-bit quantized, fast inference, good quality |
|
||||
| max_tokens | `200000` | Default 8192 silently truncates audio >40min |
|
||||
| Audio format | WAV 16kHz mono PCM | Best compatibility with ASR models |
|
||||
|
||||
## Performance Benchmarks (M5 Pro 48GB, April 2026)
|
||||
|
||||
| Audio Length | Inference Time | Speed | Chars | Tokens |
|
||||
|-------------|---------------|-------|-------|--------|
|
||||
| 1 min | 3.7s | 16x realtime | 295 | ~180 |
|
||||
| 5 min | 11.1s | 27x realtime | 1,633 | ~980 |
|
||||
| 15 min | 50.5s | 17.8x realtime | 5,074 | ~3,045 |
|
||||
| 123 min | 502s (8m22s) | 14.7x realtime | 40,347 | 24,337 |
|
||||
| 96 min | 409s (6m48s) | 14.1x realtime | 30,018 | 18,214 |
|
||||
|
||||
Model load: ~4s (cached), ~130s (first download).
|
||||
|
||||
## Critical: max_tokens Truncation
|
||||
|
||||
The `model.generate()` method in mlx-audio has `max_tokens=8192` as default. This is a **global budget shared across all audio chunks**, not per-chunk. When exhausted, remaining chunks are silently skipped.
|
||||
|
||||
For 123 minutes of Chinese speech:
|
||||
- Required: ~24,000 tokens
|
||||
- Default budget: 8,192 tokens
|
||||
- Result: only first ~40 minutes transcribed, rest silently dropped
|
||||
|
||||
Always pass `max_tokens=200000` for any audio longer than 20 minutes.
|
||||
|
||||
## Model Weight Compatibility
|
||||
|
||||
Two MLX packages exist for Qwen3-ASR. Their weight formats are **incompatible**:
|
||||
|
||||
| Package | Use with | Weight Format |
|
||||
|---------|----------|--------------|
|
||||
| `mlx-audio` (Blaizzy) | `mlx-community/Qwen3-ASR-1.7B-8bit` | mlx-audio quantization (audio_tower quantized) |
|
||||
| `mlx-qwen3-asr` (moona3k) | `Qwen/Qwen3-ASR-1.7B` | Own loader (audio_tower NOT quantized) |
|
||||
|
||||
Crossing these produces "Missing 297 parameters" error. This skill uses `mlx-audio`.
|
||||
|
||||
## Alternatives Not Recommended
|
||||
|
||||
| Approach | Issue |
|
||||
|----------|-------|
|
||||
| PyTorch MPS (qwen-asr package) | 97.77% time in GPU↔CPU sync, RTF 5.5-24.5x |
|
||||
| whisper.cpp large-v3-turbo | High Chinese error rate |
|
||||
| Official qwen-asr on macOS | Designed for CUDA only |
|
||||
80
asr-transcribe-to-text/scripts/transcribe_local_mlx.py
Normal file
80
asr-transcribe-to-text/scripts/transcribe_local_mlx.py
Normal file
@@ -0,0 +1,80 @@
|
||||
# /// script
|
||||
# requires-python = ">=3.10"
|
||||
# dependencies = ["mlx-audio>=0.3.1"]
|
||||
# ///
|
||||
"""
|
||||
Local ASR transcription using mlx-audio + Qwen3-ASR on Apple Silicon.
|
||||
|
||||
Usage:
|
||||
uv run scripts/transcribe_local_mlx.py INPUT_AUDIO [INPUT_AUDIO2 ...] [--output-dir DIR]
|
||||
|
||||
CRITICAL: max_tokens defaults to 200000. The upstream mlx-audio default (8192)
|
||||
silently truncates audio longer than ~40 minutes. This was discovered empirically:
|
||||
123 minutes of Chinese speech requires ~24,000 tokens. 8192 only covers the first
|
||||
~40 minutes before the token budget is exhausted and remaining chunks are skipped.
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import os
|
||||
import platform
|
||||
import sys
|
||||
import time
|
||||
|
||||
|
||||
def check_platform():
|
||||
if sys.platform != "darwin" or platform.machine() not in ("arm64", "aarch64"):
|
||||
print("ERROR: Local MLX transcription requires macOS on Apple Silicon (M1+).", file=sys.stderr)
|
||||
print("Use the remote API mode instead.", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description="Transcribe audio/video using local MLX Qwen3-ASR")
|
||||
parser.add_argument("inputs", nargs="+", help="Audio/video file paths")
|
||||
parser.add_argument("--output-dir", default=None, help="Output directory (default: same as input)")
|
||||
parser.add_argument("--model", default="mlx-community/Qwen3-ASR-1.7B-8bit",
|
||||
help="HuggingFace model ID (default: mlx-community/Qwen3-ASR-1.7B-8bit)")
|
||||
parser.add_argument("--max-tokens", type=int, default=200000,
|
||||
help="Max tokens for generation (default: 200000, covers ~3 hours of speech)")
|
||||
args = parser.parse_args()
|
||||
|
||||
check_platform()
|
||||
|
||||
from mlx_audio.stt.generate import load_model
|
||||
|
||||
print(f"Loading model {args.model}...", file=sys.stderr, flush=True)
|
||||
t0 = time.time()
|
||||
model = load_model(args.model)
|
||||
load_time = time.time() - t0
|
||||
print(f"Model loaded in {load_time:.1f}s", file=sys.stderr, flush=True)
|
||||
|
||||
for audio_path in args.inputs:
|
||||
if not os.path.exists(audio_path):
|
||||
print(f"SKIP: {audio_path} not found", file=sys.stderr)
|
||||
continue
|
||||
|
||||
name = os.path.splitext(os.path.basename(audio_path))[0]
|
||||
out_dir = args.output_dir or os.path.dirname(audio_path) or "."
|
||||
output_path = os.path.join(out_dir, f"{name}.txt")
|
||||
|
||||
print(f"\nTranscribing: {os.path.basename(audio_path)}", file=sys.stderr, flush=True)
|
||||
t1 = time.time()
|
||||
|
||||
result = model.generate(audio_path, max_tokens=args.max_tokens, verbose=True)
|
||||
|
||||
elapsed = time.time() - t1
|
||||
text = result.text if hasattr(result, "text") else str(result)
|
||||
gen_tokens = result.generation_tokens if hasattr(result, "generation_tokens") else "N/A"
|
||||
|
||||
with open(output_path, "w", encoding="utf-8") as f:
|
||||
f.write(text)
|
||||
|
||||
print(f"Done: {elapsed:.1f}s, {len(text)} chars, {gen_tokens} tokens → {output_path}",
|
||||
file=sys.stderr, flush=True)
|
||||
|
||||
total = time.time() - t0
|
||||
print(f"\nAll done. Total: {total:.1f}s", file=sys.stderr, flush=True)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -209,3 +209,15 @@ grep -i "api_key\|password\|token" recovered_content/*
|
||||
### Safe Storage
|
||||
|
||||
Recovered content inherits sensitivity from original sessions. Store securely and follow organizational policies for handling session data.
|
||||
|
||||
## Next Step: Resume Interrupted Work
|
||||
|
||||
After finding relevant session history, suggest continuing the work:
|
||||
|
||||
```
|
||||
Found [N] relevant sessions with recoverable context.
|
||||
|
||||
Options:
|
||||
A) Resume work — run /continue-claude-work to pick up where you left off (Recommended)
|
||||
B) Just show me the content — I'll decide what to do with it
|
||||
```
|
||||
|
||||
@@ -528,3 +528,17 @@ Report: `[P7 complete] {N} spot-checks, {M} violations fixed.`
|
||||
- **Skipping counter-review** — mandatory P6 must find ≥3 issues
|
||||
- **CIRCULAR VERIFICATION** — never use user's private data to "discover" what they already know about themselves
|
||||
- **IGNORING EXCLUSIVE SOURCES** — when user provides Crunchbase Pro etc. for competitor research, USE IT
|
||||
|
||||
## Next Step: Verify and Deliver
|
||||
|
||||
After completing research, suggest verification and output:
|
||||
|
||||
```
|
||||
Research report complete: [N] sources cited, [M] claims made.
|
||||
|
||||
Options:
|
||||
A) Verify facts — run /fact-checker on the report (Recommended)
|
||||
B) Create slides — run /ppt-creator from the findings
|
||||
C) Export as PDF — run /pdf-creator for formal delivery
|
||||
D) No thanks — the report is ready as-is
|
||||
```
|
||||
|
||||
@@ -171,3 +171,16 @@ brew install pandoc
|
||||
- `references/heavy-mode-guide.md` - Detailed Heavy Mode documentation
|
||||
- `references/tool-comparison.md` - Tool capabilities comparison
|
||||
- `references/conversion-examples.md` - Batch operation examples
|
||||
|
||||
## Next Step: Clean Up Converted Content
|
||||
|
||||
After converting documents to markdown, suggest cleanup:
|
||||
|
||||
```
|
||||
Conversion complete: [N] files converted to markdown.
|
||||
|
||||
Options:
|
||||
A) Clean up docs — run /docs-cleaner to consolidate redundant content (Recommended if multiple files)
|
||||
B) Check facts — run /fact-checker to verify claims in the converted content
|
||||
C) No thanks — the markdown conversion is sufficient
|
||||
```
|
||||
|
||||
@@ -281,3 +281,16 @@ Before completing fact-check:
|
||||
- Note the limitation in the report
|
||||
- Suggest qualification language
|
||||
- Recommend user research or expert consultation
|
||||
|
||||
## Next Step: Export Verified Content
|
||||
|
||||
After fact-checking, suggest exporting the verified document:
|
||||
|
||||
```
|
||||
Fact-check complete: [N] claims verified, [M] corrections proposed.
|
||||
|
||||
Options:
|
||||
A) Export as PDF — run /pdf-creator (Recommended for formal documents)
|
||||
B) Create slides — run /ppt-creator from verified content
|
||||
C) No thanks — I'll use the corrected document directly
|
||||
```
|
||||
|
||||
@@ -643,3 +643,16 @@ Round 4: Update to use "Annotation" instead of "Note"
|
||||
- ❌ **Creating non-existent links** - Do NOT create markdown links to files that don't exist in the repo (e.g., `[doc.md](reviewed-document)`); use plain text for external/local documents not in the repository
|
||||
- ❌ **Losing content during consolidation** - When moving or consolidating sections, verify ALL bullet points and details are preserved; never summarize away specific details like "supports batch operations" or "button triggers auto-save"
|
||||
- ❌ **Appending domain details to role titles** - Use ONLY the Role column from Team Directory for speaker attribution (e.g., "Backend", "Frontend", "TPM"); do NOT append specializations like "Backend, Infrastructure" or "Backend, Business Logic" - all team members with the same role should have identical attribution
|
||||
|
||||
## Next Step: Export to Deliverable Format
|
||||
|
||||
After structuring meeting minutes, suggest exporting:
|
||||
|
||||
```
|
||||
Meeting minutes complete: [N] decisions, [M] action items captured.
|
||||
|
||||
Options:
|
||||
A) Export as PDF — run /pdf-creator (Recommended for sharing)
|
||||
B) Export as slides — run /ppt-creator (for presentation)
|
||||
C) No thanks — the markdown is sufficient
|
||||
```
|
||||
|
||||
@@ -288,6 +288,47 @@ competitors-analysis (fork, specialist)
|
||||
3. Each skill has a single responsibility — don't mix orchestration with execution
|
||||
4. Share methodology via references (e.g., checklists, templates), not by duplicating code
|
||||
|
||||
##### Pipeline Handoff (Sequential Skill Chaining)
|
||||
|
||||
Beyond orchestrator/specialist composition, skills often form **sequential pipelines** where one skill's output is the next skill's input. Each skill should proactively suggest the logical next step after completing its work.
|
||||
|
||||
**Pattern: "Next Step" section at the end of SKILL.md**
|
||||
|
||||
```markdown
|
||||
## Next Step: [Action Description]
|
||||
|
||||
After [this skill completes], suggest the natural next skill:
|
||||
|
||||
\```
|
||||
[Summary of what was just accomplished].
|
||||
|
||||
Options:
|
||||
A) [Next skill] — [one-line reason] (Recommended)
|
||||
B) [Alternative skill] — [when this is better]
|
||||
C) No thanks — [the current output is sufficient]
|
||||
\```
|
||||
```
|
||||
|
||||
**Real-world pipeline examples:**
|
||||
|
||||
```
|
||||
youtube-downloader → asr-transcribe-to-text → transcript-fixer → meeting-minutes-taker → pdf-creator
|
||||
deep-research → fact-checker → ppt-creator
|
||||
doc-to-markdown → docs-cleaner
|
||||
claude-code-history-files-finder → continue-claude-work
|
||||
```
|
||||
|
||||
**Rules for pipeline handoff:**
|
||||
1. Every handoff is **opt-in** via AskUserQuestion — never auto-invoke the next skill without asking
|
||||
2. Suggest only when the output naturally feeds into another skill — don't force connections
|
||||
3. Include a "No thanks" option — the user may not need the full pipeline
|
||||
4. The suggestion should explain **why** the next step helps (e.g., "ASR output typically contains recognition errors")
|
||||
5. Keep it to 1-2 recommendations max — too many choices cause decision fatigue
|
||||
|
||||
**When to add a handoff:** Ask "does this skill's output commonly become another skill's input?" If yes, add a "Next Step" section. If the connection is rare or forced, don't add one.
|
||||
|
||||
**Anti-pattern:** Chaining skills that don't share a natural data flow. `pdf-creator → youtube-downloader` makes no sense. The pipeline must follow the user's actual workflow.
|
||||
|
||||
##### Auto-Detection Over Manual Flags
|
||||
|
||||
**Never add manual flags for capabilities that can be auto-detected.** Instead of requiring users to pass `--with-codex` or `--verbose`, detect capabilities at runtime:
|
||||
@@ -854,6 +895,8 @@ When editing, remember that the skill is being created for another instance of C
|
||||
|
||||
**When updating an existing skill**: Scan all existing reference files to check if they need corresponding updates.
|
||||
|
||||
**Pipeline check**: Consider whether this skill's output naturally feeds into another skill. If so, add a "Next Step" handoff section (see "Pipeline Handoff" in the Skill Writing Guide). Also check if any existing skill should chain *into* this one.
|
||||
|
||||
### Step 5: Sanitization Review (Optional)
|
||||
|
||||
Use **AskUserQuestion** before executing this step:
|
||||
|
||||
@@ -1,40 +1,18 @@
|
||||
---
|
||||
name: transcript-fixer
|
||||
description: Corrects speech-to-text transcription errors in meeting notes, lectures, and interviews using dictionary rules and AI. Learns patterns to build personalized correction databases. Use when working with transcripts containing ASR/STT errors, homophones, or Chinese/English mixed content requiring cleanup.
|
||||
description: Corrects speech-to-text transcription errors using dictionary rules and AI-powered analysis. Builds personalized correction databases that learn from each fix. Triggers when working with ASR/STT output containing recognition errors, homophones, garbled technical terms, or Chinese/English mixed content. Also triggers on requests to clean up meeting notes, lecture transcripts, interview recordings, or any text produced by speech recognition. Use this skill even when the user just says "fix this transcript" or "clean up these meeting notes" without mentioning ASR specifically.
|
||||
---
|
||||
|
||||
# Transcript Fixer
|
||||
|
||||
Correct speech-to-text transcription errors through dictionary-based rules, AI-powered corrections, and automatic pattern detection. Build a personalized knowledge base that learns from each correction.
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
- Correcting ASR/STT errors in meeting notes, lectures, or interviews
|
||||
- Building domain-specific correction dictionaries
|
||||
- Fixing Chinese/English homophone errors or technical terminology
|
||||
- Collaborating on shared correction knowledge bases
|
||||
Two-phase correction pipeline: deterministic dictionary rules (instant, free) followed by AI-powered error detection. Corrections accumulate in `~/.transcript-fixer/corrections.db`, improving accuracy over time.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
**Python execution must use `uv`** - never use system Python directly.
|
||||
|
||||
If `uv` is not installed:
|
||||
```bash
|
||||
# macOS/Linux
|
||||
curl -LsSf https://astral.sh/uv/install.sh | sh
|
||||
|
||||
# Windows PowerShell
|
||||
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"
|
||||
```
|
||||
All scripts use PEP 723 inline metadata — `uv run` auto-installs dependencies. Requires `uv` ([install guide](https://docs.astral.sh/uv/getting-started/installation/)).
|
||||
|
||||
## Quick Start
|
||||
|
||||
**Default: Native AI Correction (no API key needed)**
|
||||
|
||||
When invoked from Claude Code, the skill uses a two-phase approach:
|
||||
1. **Dictionary phase** (script): Apply 700+ learned correction rules instantly
|
||||
2. **AI phase** (Claude native): Claude reads the text directly and fixes ASR errors, adds paragraph breaks, removes filler words
|
||||
|
||||
```bash
|
||||
# First time: Initialize database
|
||||
uv run scripts/fix_transcription.py --init
|
||||
@@ -43,196 +21,106 @@ uv run scripts/fix_transcription.py --init
|
||||
uv run scripts/fix_transcription.py --input meeting.md --stage 1
|
||||
```
|
||||
|
||||
After Stage 1, Claude should:
|
||||
1. Read the Stage 1 output in ~3000-char chunks
|
||||
2. Identify ASR errors (homophones, technical terms, broken sentences)
|
||||
3. Present corrections in a table for user review (high/medium confidence)
|
||||
4. Apply confirmed corrections and save stable patterns to dictionary
|
||||
5. Optionally: add paragraph breaks and remove excessive filler words
|
||||
After Stage 1, Claude reads the output and fixes remaining ASR errors natively (no API key needed):
|
||||
1. Read Stage 1 output in ~200-line chunks using the Read tool
|
||||
2. Identify ASR errors — homophones, garbled terms, broken sentences
|
||||
3. Present corrections in a table for user review before applying
|
||||
4. Save stable patterns to dictionary for future reuse
|
||||
|
||||
**Alternative: API-Based Batch Processing** (for automation or large volumes):
|
||||
See `references/example_session.md` for a concrete input/output walkthrough.
|
||||
|
||||
**Alternative: API batch processing** (for automation without Claude Code):
|
||||
```bash
|
||||
# Set API key for automated AI corrections
|
||||
export GLM_API_KEY="<api-key>" # From https://open.bigmodel.cn/
|
||||
|
||||
# Run full pipeline (dict + API AI + diff report)
|
||||
uv run scripts/fix_transcript_enhanced.py input.md --output ./corrected
|
||||
```
|
||||
|
||||
## Core Workflow
|
||||
|
||||
Two-phase pipeline with persistent learning:
|
||||
|
||||
1. **Initialize** (once): `uv run scripts/fix_transcription.py --init`
|
||||
2. **Add domain corrections**: `--add "错误词" "正确词" --domain <domain>`
|
||||
3. **Phase 1 — Dictionary**: `--input file.md --stage 1` (instant, free)
|
||||
4. **Phase 2 — AI Correction**: Claude reads output and fixes errors natively, or `--stage 3` with `GLM_API_KEY` for API mode
|
||||
5. **Save stable patterns**: `--add "错误词" "正确词"` after each session
|
||||
6. **Review learned patterns**: `--review-learned` and `--approve` high-confidence suggestions
|
||||
|
||||
**Domains**: `general`, `embodied_ai`, `finance`, `medical`, or custom (e.g., `火星加速器`)
|
||||
**Learning**: Patterns appearing ≥3 times at ≥80% confidence auto-promote from AI to dictionary
|
||||
|
||||
**After fixing, always save reusable corrections to dictionary.** This is the skill's core value — see `references/iteration_workflow.md` for the complete checklist.
|
||||
|
||||
## False Positive Prevention
|
||||
|
||||
Adding wrong dictionary rules silently corrupts future transcripts. **Read `references/false_positive_guide.md` before adding any correction rule**, especially for short words (≤2 chars) or common Chinese words that appear correctly in normal text.
|
||||
|
||||
## Native AI Correction (Default Mode)
|
||||
|
||||
When running inside Claude Code, use Claude's own language understanding for Phase 2:
|
||||
|
||||
1. Run Stage 1 (dictionary): `--input file.md --stage 1`
|
||||
2. Verify Stage 1 — diff original vs output. If dictionary introduced false positives, work from the **original** file
|
||||
3. Read the full text in ~200-line chunks. Read the entire transcript before proposing corrections — later context often disambiguates earlier errors
|
||||
4. Identify ASR errors:
|
||||
- Product/tool names: "close code" → "Claude Code", "get Hub" → "GitHub"
|
||||
- Technical terms: "Web coding" → "Vibe Coding", "happy pass" → "happy path"
|
||||
- Homophone errors: "上海文" → "上下文", "分值" → "分支"
|
||||
- English ASR garbling: "Pre top" → "prototype", "rapper" → "repo"
|
||||
- Broken sentences: "很大程。路上" → "很大程度上"
|
||||
5. Present corrections in high/medium confidence tables with line numbers
|
||||
6. Apply with sed on a copy, verify with diff, replace original
|
||||
7. Generate word diff: `uv run scripts/generate_word_diff.py original.md corrected.md diff.html`
|
||||
8. Save stable patterns to dictionary
|
||||
9. Remove false positives if Stage 1 had any
|
||||
|
||||
### Enhanced Capabilities (Native Mode Only)
|
||||
|
||||
- **Intelligent paragraph breaks**: Add `\n\n` at logical topic transitions
|
||||
- **Filler word reduction**: "这个这个这个" → "这个"
|
||||
- **Interactive review**: Corrections confirmed before applying
|
||||
- **Context-aware judgment**: Full document context resolves ambiguous errors
|
||||
|
||||
### When to Use API Mode Instead
|
||||
|
||||
Use `GLM_API_KEY` + Stage 3 for batch processing, standalone usage without Claude Code, or reproducible automated processing.
|
||||
|
||||
### Legacy Fallback
|
||||
|
||||
When the script outputs `[CLAUDE_FALLBACK]` (GLM API error), switch to native mode automatically.
|
||||
|
||||
## Utility Scripts
|
||||
|
||||
**Timestamp repair**:
|
||||
```bash
|
||||
uv run scripts/fix_transcript_timestamps.py meeting.txt --in-place
|
||||
```
|
||||
|
||||
**Split transcript into sections and rebase each section to `00:00:00`**:
|
||||
**Split transcript into sections** (rebase each to `00:00:00`):
|
||||
```bash
|
||||
uv run scripts/split_transcript_sections.py meeting.txt \
|
||||
--first-section-name "课前聊天" \
|
||||
--section "正式上课::好,无缝切换嘛。对。那个曹总连上了吗?那个网页。" \
|
||||
--section "课后复盘::我们复盘一下。" \
|
||||
--section "正式上课::好,无缝切换嘛。" \
|
||||
--rebase-to-zero
|
||||
```
|
||||
|
||||
**Output files**:
|
||||
- `*_stage1.md` - Dictionary corrections applied
|
||||
- `*_corrected.txt` - Final version (native mode) or `*_stage2.md` (API mode)
|
||||
- `*_对比.html` - Visual diff (open in browser for best experience)
|
||||
|
||||
**Generate word-level diff** (recommended for reviewing corrections):
|
||||
**Word-level diff** (recommended for reviewing corrections):
|
||||
```bash
|
||||
uv run scripts/generate_word_diff.py original.md corrected.md output.html
|
||||
```
|
||||
|
||||
This creates an HTML file showing word-by-word differences with clear highlighting:
|
||||
- 🔴 `japanese 3 pro` → 🟢 `Gemini 3 Pro` (complete word replacements)
|
||||
- Easy to spot exactly what changed without character-level noise
|
||||
## Output Files
|
||||
|
||||
## Example Session
|
||||
|
||||
**Input transcript** (`meeting.md`):
|
||||
```
|
||||
今天我们讨论了巨升智能的最新进展。
|
||||
股价系统需要优化,目前性能不够好。
|
||||
```
|
||||
|
||||
**After Stage 1** (`meeting_stage1.md`):
|
||||
```
|
||||
今天我们讨论了具身智能的最新进展。 ← "巨升"→"具身" corrected
|
||||
股价系统需要优化,目前性能不够好。 ← Unchanged (not in dictionary)
|
||||
```
|
||||
|
||||
**After Stage 2** (`meeting_stage2.md`):
|
||||
```
|
||||
今天我们讨论了具身智能的最新进展。
|
||||
框架系统需要优化,目前性能不够好。 ← "股价"→"框架" corrected by AI
|
||||
```
|
||||
|
||||
**Learned pattern detected:**
|
||||
```
|
||||
✓ Detected: "股价" → "框架" (confidence: 85%, count: 1)
|
||||
Run --review-learned after 2 more occurrences to approve
|
||||
```
|
||||
|
||||
## Core Workflow
|
||||
|
||||
Two-phase pipeline stores corrections in `~/.transcript-fixer/corrections.db`:
|
||||
|
||||
1. **Initialize** (first time): `uv run scripts/fix_transcription.py --init`
|
||||
2. **Add domain corrections**: `--add "错误词" "正确词" --domain <domain>`
|
||||
3. **Phase 1 — Dictionary**: `--input file.md --stage 1` (instant, free)
|
||||
4. **Phase 2 — AI Correction**: Claude reads output and fixes ASR errors natively (default), or use `--stage 3` with `GLM_API_KEY` for API mode
|
||||
5. **Save stable patterns**: `--add "错误词" "正确词"` after each fix session
|
||||
6. **Review learned patterns**: `--review-learned` and `--approve` high-confidence suggestions
|
||||
|
||||
**Domains**: `general`, `embodied_ai`, `finance`, `medical`, or custom names including Chinese (e.g., `火星加速器`, `具身智能`)
|
||||
**Learning**: Patterns appearing ≥3 times at ≥80% confidence move from AI to dictionary
|
||||
|
||||
See `references/workflow_guide.md` for detailed workflows, `references/script_parameters.md` for complete CLI reference, and `references/team_collaboration.md` for collaboration patterns.
|
||||
|
||||
## Critical Workflow: Dictionary Iteration
|
||||
|
||||
**Save stable, reusable ASR patterns after each fix.** This is the skill's core value.
|
||||
|
||||
After fixing errors manually, immediately save stable corrections to dictionary:
|
||||
```bash
|
||||
uv run scripts/fix_transcription.py --add "错误词" "正确词" --domain general
|
||||
```
|
||||
|
||||
Do **not** save one-off deletions, ambiguous context-only rewrites, or section-specific cleanup to the dictionary.
|
||||
|
||||
See `references/iteration_workflow.md` for complete iteration guide with checklist.
|
||||
|
||||
## FALSE POSITIVE RISKS -- READ BEFORE ADDING CORRECTIONS
|
||||
|
||||
Dictionary-based corrections are powerful but dangerous. Adding the wrong rule silently corrupts every future transcript. The `--add` command runs safety checks automatically, but you must understand the risks.
|
||||
|
||||
### What is safe to add
|
||||
|
||||
- **ASR-specific gibberish**: "巨升智能" -> "具身智能" (no real word sounds like "巨升智能")
|
||||
- **Long compound errors**: "语音是别" -> "语音识别" (4+ chars, unlikely to collide)
|
||||
- **English transliteration errors**: "japanese 3 pro" -> "Gemini 3 Pro"
|
||||
|
||||
### What is NEVER safe to add
|
||||
|
||||
- **Common Chinese words**: "仿佛", "正面", "犹豫", "传说", "增加", "教育" -- these appear correctly in normal text. Replacing them corrupts transcripts from better ASR models.
|
||||
- **Words <=2 characters**: Almost any 2-char Chinese string is a valid word or part of one. "线数" inside "产线数据" becomes "产线束据".
|
||||
- **Both sides are real words**: "仿佛->反复", "犹豫->抑郁" -- both forms are valid Chinese. The "error" is only an error for one specific ASR model.
|
||||
|
||||
### When in doubt, use a context rule instead
|
||||
|
||||
Context rules use regex patterns that match only in specific surroundings, avoiding false positives:
|
||||
```bash
|
||||
# Instead of: --add "线数" "线束"
|
||||
# Use a context rule in the database:
|
||||
sqlite3 ~/.transcript-fixer/corrections.db "INSERT INTO context_rules (pattern, replacement, description, priority) VALUES ('(?<!产)线数(?!据)', '线束', 'ASR: 线数->线束 (not inside 产线数据)', 10);"
|
||||
```
|
||||
|
||||
### Auditing the dictionary
|
||||
|
||||
Run `--audit` periodically to scan all rules for false positive risks:
|
||||
```bash
|
||||
uv run scripts/fix_transcription.py --audit
|
||||
uv run scripts/fix_transcription.py --audit --domain manufacturing
|
||||
```
|
||||
|
||||
### Forcing a risky addition
|
||||
|
||||
If you understand the risks and still want to add a flagged rule:
|
||||
```bash
|
||||
uv run scripts/fix_transcription.py --add "仿佛" "反复" --domain general --force
|
||||
```
|
||||
|
||||
## Native AI Correction (Default Mode)
|
||||
|
||||
**Claude IS the AI.** When running inside Claude Code, use Claude's own language understanding for Stage 2 corrections instead of calling an external API. This is the default behavior — no API key needed.
|
||||
|
||||
### Workflow
|
||||
|
||||
1. **Run Stage 1** (dictionary): `uv run scripts/fix_transcription.py --input file.md --stage 1`
|
||||
2. **Read the text** in ~3000-character chunks (use `cut -c<start>-<end>` for single-line files)
|
||||
3. **Identify ASR errors** — look for:
|
||||
- Homophone errors (同音字): "上海文" → "上下文", "扩种" → "扩充"
|
||||
- Broken sentence boundaries: "很大程。路上" → "很大程度上"
|
||||
- Technical terms: "Web coding" → "Vibe Coding"
|
||||
- Missing/extra characters: "沉沉默" → "沉默"
|
||||
4. **Present corrections** in a table with confidence levels before applying:
|
||||
- High confidence: clear ASR errors with unambiguous corrections
|
||||
- Medium confidence: context-dependent, need user confirmation
|
||||
5. **Apply corrections** to a copy of the file (never modify the original)
|
||||
6. **Save stable patterns** to dictionary: `--add "错误词" "正确词" --domain general`
|
||||
7. **Generate word diff**: `uv run scripts/generate_word_diff.py original.md corrected.md diff.html`
|
||||
|
||||
### Enhanced AI Capabilities (Native Mode Only)
|
||||
|
||||
Native mode can do things the API mode cannot:
|
||||
|
||||
- **Intelligent paragraph breaks**: Add `\n\n` at logical topic transitions in continuous text
|
||||
- **Filler word reduction**: Remove excessive repetition (这个这个这个 → 这个, 都都都都 → 都)
|
||||
- **Interactive review**: Present corrections for user confirmation before applying
|
||||
- **Context-aware judgment**: Use full document context to resolve ambiguous errors
|
||||
|
||||
### When to Use API Mode Instead
|
||||
|
||||
Use `GLM_API_KEY` + Stage 3 for:
|
||||
- Batch processing multiple files in automation
|
||||
- When Claude Code is not available (standalone script usage)
|
||||
- Consistent reproducible processing without interactive review
|
||||
|
||||
### Legacy Fallback Marker
|
||||
|
||||
When the script outputs `[CLAUDE_FALLBACK]` (GLM API error), switch to native mode automatically.
|
||||
- `*_stage1.md` — Dictionary corrections applied
|
||||
- `*_corrected.txt` — Final version (native mode) or `*_stage2.md` (API mode)
|
||||
- `*_对比.html` — Visual diff (open in browser)
|
||||
|
||||
## Database Operations
|
||||
|
||||
**MUST read `references/database_schema.md` before any database operations.**
|
||||
**Read `references/database_schema.md` before any database operations.**
|
||||
|
||||
Quick reference:
|
||||
```bash
|
||||
# View all corrections
|
||||
sqlite3 ~/.transcript-fixer/corrections.db "SELECT * FROM active_corrections;"
|
||||
|
||||
# Check schema version
|
||||
sqlite3 ~/.transcript-fixer/corrections.db "SELECT value FROM system_config WHERE key='schema_version';"
|
||||
```
|
||||
|
||||
@@ -247,26 +135,34 @@ sqlite3 ~/.transcript-fixer/corrections.db "SELECT value FROM system_config WHER
|
||||
## Bundled Resources
|
||||
|
||||
**Scripts:**
|
||||
- `ensure_deps.py` - Initialize shared virtual environment (run once, optional)
|
||||
- `fix_transcript_enhanced.py` - Enhanced wrapper (recommended for interactive use)
|
||||
- `fix_transcription.py` - Core CLI (for automation)
|
||||
- `fix_transcript_timestamps.py` - Normalize/repair speaker timestamps and optionally rebase to zero
|
||||
- `generate_word_diff.py` - Generate word-level diff HTML for reviewing corrections
|
||||
- `split_transcript_sections.py` - Split a transcript by marker phrases and optionally rebase each section
|
||||
- `examples/bulk_import.py` - Bulk import example
|
||||
- `fix_transcription.py` — Core CLI (dictionary, add, audit, learning)
|
||||
- `fix_transcript_enhanced.py` — Enhanced wrapper for interactive use
|
||||
- `fix_transcript_timestamps.py` — Timestamp normalization and repair
|
||||
- `generate_word_diff.py` — Word-level diff HTML generation
|
||||
- `split_transcript_sections.py` — Split transcript by marker phrases
|
||||
|
||||
**References** (load as needed):
|
||||
- **Critical**: `database_schema.md` (read before DB operations), `iteration_workflow.md` (dictionary iteration best practices)
|
||||
- Getting started: `installation_setup.md`, `glm_api_setup.md`, `workflow_guide.md`
|
||||
- Daily use: `quick_reference.md`, `script_parameters.md`, `dictionary_guide.md`
|
||||
- Advanced: `sql_queries.md`, `file_formats.md`, `architecture.md`, `best_practices.md`
|
||||
- Operations: `troubleshooting.md`, `team_collaboration.md`
|
||||
- **Safety**: `false_positive_guide.md` (read before adding rules), `database_schema.md` (read before DB ops)
|
||||
- **Workflow**: `iteration_workflow.md`, `workflow_guide.md`, `example_session.md`
|
||||
- **CLI**: `quick_reference.md`, `script_parameters.md`
|
||||
- **Advanced**: `dictionary_guide.md`, `sql_queries.md`, `architecture.md`, `best_practices.md`
|
||||
- **Operations**: `troubleshooting.md`, `installation_setup.md`, `glm_api_setup.md`, `team_collaboration.md`
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
Verify setup health with `uv run scripts/fix_transcription.py --validate`. Common issues:
|
||||
- Missing database → Run `--init`
|
||||
- Missing API key → `export GLM_API_KEY="<key>"` (obtain from https://open.bigmodel.cn/)
|
||||
- Permission errors → Check `~/.transcript-fixer/` ownership
|
||||
`uv run scripts/fix_transcription.py --validate` checks setup health. See `references/troubleshooting.md` for detailed resolution.
|
||||
|
||||
See `references/troubleshooting.md` for detailed error resolution and `references/glm_api_setup.md` for API configuration.
|
||||
## Next Step: Structure into Meeting Minutes
|
||||
|
||||
After correcting a transcript, if the content is from a meeting, lecture, or interview, suggest structuring it:
|
||||
|
||||
```
|
||||
Transcript corrected: [N] errors fixed, saved to [output_path].
|
||||
|
||||
Want to turn this into structured meeting minutes with decisions and action items?
|
||||
|
||||
Options:
|
||||
A) Yes — run /meeting-minutes-taker (Recommended for meetings/lectures)
|
||||
B) Export as PDF — run /pdf-creator on the corrected text
|
||||
C) No thanks — the corrected transcript is all I need
|
||||
```
|
||||
|
||||
25
transcript-fixer/references/example_session.md
Normal file
25
transcript-fixer/references/example_session.md
Normal file
@@ -0,0 +1,25 @@
|
||||
# Example Session
|
||||
|
||||
## Input transcript (`meeting.md`)
|
||||
```
|
||||
今天我们讨论了巨升智能的最新进展。
|
||||
股价系统需要优化,目前性能不够好。
|
||||
```
|
||||
|
||||
## After Stage 1 (`meeting_stage1.md`)
|
||||
```
|
||||
今天我们讨论了具身智能的最新进展。 ← "巨升"→"具身" corrected
|
||||
股价系统需要优化,目前性能不够好。 ← Unchanged (not in dictionary)
|
||||
```
|
||||
|
||||
## After Stage 2 (`meeting_stage2.md`)
|
||||
```
|
||||
今天我们讨论了具身智能的最新进展。
|
||||
框架系统需要优化,目前性能不够好。 ← "股价"→"框架" corrected by AI
|
||||
```
|
||||
|
||||
## Learned pattern detected
|
||||
```
|
||||
✓ Detected: "股价" → "框架" (confidence: 85%, count: 1)
|
||||
Run --review-learned after 2 more occurrences to approve
|
||||
```
|
||||
39
transcript-fixer/references/false_positive_guide.md
Normal file
39
transcript-fixer/references/false_positive_guide.md
Normal file
@@ -0,0 +1,39 @@
|
||||
# False Positive Prevention Guide
|
||||
|
||||
Dictionary-based corrections are powerful but dangerous. Adding the wrong rule silently corrupts every future transcript. The `--add` command runs safety checks automatically, but you must understand the risks.
|
||||
|
||||
## What is safe to add
|
||||
|
||||
- **ASR-specific gibberish**: "巨升智能" -> "具身智能" (no real word sounds like "巨升智能")
|
||||
- **Long compound errors**: "语音是别" -> "语音识别" (4+ chars, unlikely to collide)
|
||||
- **English transliteration errors**: "japanese 3 pro" -> "Gemini 3 Pro"
|
||||
|
||||
## What is NEVER safe to add
|
||||
|
||||
- **Common Chinese words**: "仿佛", "正面", "犹豫", "传说", "增加", "教育" -- these appear correctly in normal text. Replacing them corrupts transcripts from better ASR models.
|
||||
- **Words <=2 characters**: Almost any 2-char Chinese string is a valid word or part of one. "线数" inside "产线数据" becomes "产线束据".
|
||||
- **Both sides are real words**: "仿佛->反复", "犹豫->抑郁" -- both forms are valid Chinese. The "error" is only an error for one specific ASR model.
|
||||
|
||||
## When in doubt, use a context rule instead
|
||||
|
||||
Context rules use regex patterns that match only in specific surroundings, avoiding false positives:
|
||||
```bash
|
||||
# Instead of: --add "线数" "线束"
|
||||
# Use a context rule in the database:
|
||||
sqlite3 ~/.transcript-fixer/corrections.db "INSERT INTO context_rules (pattern, replacement, description, priority) VALUES ('(?<!产)线数(?!据)', '线束', 'ASR: 线数->线束 (not inside 产线数据)', 10);"
|
||||
```
|
||||
|
||||
## Auditing the dictionary
|
||||
|
||||
Run `--audit` periodically to scan all rules for false positive risks:
|
||||
```bash
|
||||
uv run scripts/fix_transcription.py --audit
|
||||
uv run scripts/fix_transcription.py --audit --domain manufacturing
|
||||
```
|
||||
|
||||
## Forcing a risky addition
|
||||
|
||||
If you understand the risks and still want to add a flagged rule:
|
||||
```bash
|
||||
uv run scripts/fix_transcription.py --add "仿佛" "反复" --domain general --force
|
||||
```
|
||||
@@ -492,3 +492,17 @@ ffmpeg -i video.mp4 -i audio.m4a -c copy merged.mp4
|
||||
- **PO Token Setup**: See `references/po-token-setup.md` for detailed installation and troubleshooting
|
||||
- **yt-dlp Documentation**: https://github.com/yt-dlp/yt-dlp
|
||||
- **Format Selection Guide**: https://github.com/yt-dlp/yt-dlp#format-selection
|
||||
|
||||
## Next Step: Transcribe Downloaded Audio/Video
|
||||
|
||||
After downloading, if the user's goal involves getting text from the video (transcription, subtitles, meeting notes), proactively suggest:
|
||||
|
||||
```
|
||||
Download complete: [filename]
|
||||
|
||||
If you need the spoken content as text, I can transcribe it for you.
|
||||
|
||||
Options:
|
||||
A) Transcribe with /asr-transcribe-to-text (Recommended for speech-to-text)
|
||||
B) No thanks — I just needed the video file
|
||||
```
|
||||
|
||||
Reference in New Issue
Block a user