feat: add asr-transcribe-to-text skill + optimize skill-creator with AskUserQuestion

New skill: asr-transcribe-to-text (v1.0.0)
- Transcribe audio/video via configurable ASR endpoint (Qwen3-ASR default)
- Persistent config in CLAUDE_PLUGIN_DATA (endpoint, model, proxy bypass)
- Single-request-first strategy (empirically proven: 55min in one request)
- Fallback overlap-merge script for very long audio (18min chunks, 2min overlap)
- AskUserQuestion at config init, health check failure, and output verification

skill-creator optimization (v1.5.1 → v1.6.0)
- Add AskUserQuestion best practices section (Re-ground/Simplify/Recommend/Options)
- Inject structured decision points at 8 key workflow stages
- Inspired by gstack's atomic question pattern

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
daymade
2026-03-23 00:03:45 +08:00
parent 639a6d303e
commit ee38ae41b8
4 changed files with 655 additions and 5 deletions

View File

@@ -0,0 +1,197 @@
---
name: asr-transcribe-to-text
description: Transcribe audio and video files to text using a remote ASR service (Qwen3-ASR or OpenAI-compatible endpoint). Extracts audio from video, sends to configurable ASR endpoint, outputs clean text. Use when the user wants to transcribe recordings, convert audio/video to text, do speech-to-text, or mentions ASR, Qwen ASR, 转录, 语音转文字, 录音转文字, or has a meeting recording, lecture, interview, or screen recording to transcribe.
argument-hint: [audio-or-video-file-path]
---
# ASR Transcribe to Text
Transcribe audio/video files to text using a configurable ASR endpoint (default: Qwen3-ASR-1.7B via vLLM). Configuration persists across sessions in `${CLAUDE_PLUGIN_DATA}/config.json`.
## Step 0: Load or Initialize Configuration
```bash
cat "${CLAUDE_PLUGIN_DATA}/config.json" 2>/dev/null
```
**If config exists**, read the values and proceed to Step 1.
**If config does not exist** (first run), use **AskUserQuestion**:
```
First-time setup for ASR transcription.
I need to know where your ASR service is running so I can send audio to it.
RECOMMENDATION: Use the defaults below if you have Qwen3-ASR on a 4090 via Tailscale.
Q1: ASR Endpoint URL?
A) http://workstation-4090-wsl:8002/v1/audio/transcriptions (Default — Qwen3-ASR vLLM via Tailscale)
B) http://localhost:8002/v1/audio/transcriptions (Local machine)
C) Let me enter a custom URL
Q2: Does your network have an HTTP proxy that might intercept LAN/Tailscale traffic?
A) Yes — add --noproxy to bypass it (Recommended if you use Shadowrocket/Clash/corporate proxy)
B) No — direct connection is fine
```
Save the config:
```bash
mkdir -p "${CLAUDE_PLUGIN_DATA}"
python3 -c "
import json
config = {
'endpoint': 'USER_PROVIDED_ENDPOINT',
'model': 'USER_PROVIDED_MODEL_OR_DEFAULT',
'noproxy': True, # or False based on user answer
'max_timeout': 900
}
with open('${CLAUDE_PLUGIN_DATA}/config.json', 'w') as f:
json.dump(config, f, indent=2)
print('Config saved.')
"
```
## Step 1: Validate Input and Check Service Health
Read config and health-check in a single command (shell variables don't persist across Bash calls):
```bash
python3 -c "
import json, subprocess, sys
with open('${CLAUDE_PLUGIN_DATA}/config.json') as f:
cfg = json.load(f)
base = cfg['endpoint'].rsplit('/audio/', 1)[0]
noproxy = ['--noproxy', '*'] if cfg.get('noproxy', True) else []
result = subprocess.run(
['curl', '-s', '--max-time', '10'] + noproxy + [f'{base}/models'],
capture_output=True, text=True
)
if result.returncode != 0 or not result.stdout.strip():
print(f'HEALTH CHECK FAILED', file=sys.stderr)
print(f'Endpoint: {base}/models', file=sys.stderr)
print(f'stdout: {result.stdout[:200]}', file=sys.stderr)
print(f'stderr: {result.stderr[:200]}', file=sys.stderr)
sys.exit(1)
else:
print(f'Service healthy: {base}')
print(f'Model: {cfg[\"model\"]}')
"
```
**If health check fails**, use **AskUserQuestion**:
```
ASR service at [endpoint] is not responding.
Options:
A) Diagnose — check network, Tailscale, and service status step by step
B) Reconfigure — the endpoint URL might be wrong, let me re-enter it
C) Try anyway — send the transcription request and see what happens
D) Abort — I'll fix the service manually and come back later
```
For option A, diagnose in order:
1. Network: `ping -c 1 HOST` or `tailscale status | grep HOST`
2. Service: `tailscale ssh USER@HOST "curl -s localhost:PORT/v1/models"`
3. Proxy: retry with `--noproxy '*'` toggled
## Step 2: Extract Audio (if input is video)
For video files (mp4, mov, mkv, avi, webm), extract audio as 16kHz mono MP3:
```bash
ffmpeg -i INPUT_VIDEO -vn -acodec libmp3lame -q:a 4 -ar 16000 -ac 1 OUTPUT.mp3 -y
```
For audio files (mp3, wav, m4a, flac, ogg), use directly — no conversion needed.
Get duration for progress estimation:
```bash
ffprobe -v error -show_entries format=duration -of default=noprint_wrappers=1:nokey=1 INPUT_FILE
```
## Step 3: Transcribe — Single Request First
**Always try full-length single request first.** Chunking causes sentence truncation at every split boundary — the model forces the last sentence to close and loses words. Single request = zero truncation + fastest speed.
The Qwen3-ASR paper's "20-minute limit" is a training benchmark, not an inference hard limit. Empirically verified: 55 minutes transcribed in a single 76-second request on 4090 24GB.
```bash
python3 -c "
import json, subprocess, sys, os, tempfile
with open('${CLAUDE_PLUGIN_DATA}/config.json') as f:
cfg = json.load(f)
noproxy = ['--noproxy', '*'] if cfg.get('noproxy', True) else []
timeout = str(cfg.get('max_timeout', 900))
audio_file = 'AUDIO_FILE_PATH' # replace with actual path
output_json = tempfile.mktemp(suffix='.json', prefix='asr_')
result = subprocess.run(
['curl', '-s', '--max-time', timeout] + noproxy + [
cfg['endpoint'],
'-F', f'file=@{audio_file}',
'-F', f'model={cfg[\"model\"]}',
'-o', output_json
], capture_output=True, text=True
)
with open(output_json) as f:
data = json.load(f)
if 'text' not in data:
print(f'ERROR: {json.dumps(data)[:300]}', file=sys.stderr)
sys.exit(1)
text = data['text']
duration = data.get('usage', {}).get('seconds', 0)
print(f'Transcribed: {len(text)} chars, {duration}s audio', file=sys.stderr)
print(text)
os.unlink(output_json)
" > OUTPUT.txt
```
**Performance reference**: ~400 characters per minute for Chinese speech; rates vary by language. Qwen3-ASR supports 52 languages including Chinese dialects, English, Japanese, Korean, and more.
## Step 4: Verify and Confirm Output
After transcription, verify quality:
1. Confirm the response contains a `text` field (not an error message)
2. Check character count is plausible for the audio duration (~400 chars/min for Chinese)
3. Show the user the first ~200 characters as a preview
If the output looks wrong (empty, garbled, or error), use **AskUserQuestion**:
```
Transcription may have an issue:
- Expected: ~[N] chars for [M] minutes of audio
- Got: [actual chars] chars
- Preview: "[first 100 chars...]"
Options:
A) Save as-is — the output looks fine to me
B) Retry with fallback — split into chunks and merge (handles long audio / OOM)
C) Reconfigure — try a different model or endpoint
D) Abort — something is wrong with the service
```
If output is good, save as `.txt` alongside the original file or to user-specified location.
## Step 5: Fallback — Overlap-Merge for Very Long Audio
If single request fails (timeout, OOM, HTTP error), fall back to chunked transcription with overlap merging:
```bash
python3 ${CLAUDE_PLUGIN_ROOT}/scripts/overlap_merge_transcribe.py \
--config "${CLAUDE_PLUGIN_DATA}/config.json" \
INPUT_AUDIO OUTPUT.txt
```
This splits into 18-minute chunks with 2-minute overlap, then merges using punctuation-stripped fuzzy matching. See [references/overlap_merge_strategy.md](references/overlap_merge_strategy.md) for the algorithm details.
## Reconfigure
To change the ASR endpoint, model, or proxy settings:
```bash
rm "${CLAUDE_PLUGIN_DATA}/config.json"
```
Then re-run Step 0 to collect new values via AskUserQuestion.

View File

@@ -0,0 +1,56 @@
# Overlap-Merge Strategy: Why and How
## The Problem with Naive Chunking
When ASR transcribes audio in chunks, each chunk's last sentence gets **forcibly truncated**. The model closes the sentence at the chunk boundary even if the speaker is mid-sentence.
**Real example from testing (5-minute chunks):**
| Version | Text at boundary |
|---------|-----------------|
| 5min chunk ending | "...靠的就。" (truncated) |
| Continuous 10min | "...靠的就是其中一次战略会。" (complete) |
| Next 5min chunk start | "如果不这么啃,这个业务..." (picks up but gap exists) |
Concatenating these chunks produces: "靠的就。如果不这么啃..." — losing "是其中一次战略会" entirely.
## Why Exact String Matching Fails
The same 2-minute audio segment transcribed in two different contexts (end of chunk A vs start of chunk B) produces **different punctuation**:
- Chunk A: "可能三年AI的进化"
- Chunk B: "可能。三年AI的进化"
The words are identical, but punctuation differs because the model's sentence boundary decisions depend on surrounding context. Exact string matching finds zero overlap.
## The Solution: Punctuation-Stripped Fuzzy Matching
1. Strip all punctuation from both the tail of chunk A and the head of chunk B
2. Find the longest common substring in the stripped versions
3. Use chunk B's version at the merge point (because chunk A's ending is truncated while chunk B has the complete sentence)
## Optimal Parameters (Empirically Determined)
| Parameter | Value | Rationale |
|-----------|-------|-----------|
| Chunk duration | 18 min (1080s) | Safe margin under 20min paper benchmark; 4090 24GB handles much more |
| Overlap duration | 2 min (120s) | ~800 chars overlap region; enough for reliable fuzzy matching |
| Min match length | 15 chars | Filters false positives while catching real overlaps |
| Search window | 600 chars | Covers the overlap region with margin |
## When to Use This Fallback
Only use overlap-merge when the single full-length request fails. Reasons it might fail:
- Audio longer than ~2 hours (untested territory, may OOM on 24GB VRAM)
- GPU memory pressure from other processes
- Network timeout (curl max-time exceeded)
For audio under 1 hour, always try single request first — it's faster, simpler, and produces the best quality.
## Empirical Comparison (55-minute AI course recording)
| Strategy | Segments | Boundaries | Total chars | Quality |
|----------|----------|------------|-------------|---------|
| 12x5min direct concat | 12 | 11 cuts | 23,060 | 11 truncated sentences |
| 4x18min overlap merge | 4 | 3 merges | 22,781 | Zero truncation |
| 1x55min single request | 1 | 0 | 22,889 | Perfect (best) |

View File

@@ -0,0 +1,200 @@
# /// script
# requires-python = ">=3.9"
# ///
"""
Overlap-merge transcription for long audio files.
Splits audio into 18-minute chunks with 2-minute overlap, transcribes each chunk
via a configurable ASR endpoint, then merges using punctuation-stripped fuzzy
matching to eliminate sentence truncation at boundaries.
Usage:
python3 scripts/overlap_merge_transcribe.py INPUT_AUDIO OUTPUT.txt --config CONFIG.json
python3 scripts/overlap_merge_transcribe.py INPUT_AUDIO OUTPUT.txt --endpoint URL --model MODEL
"""
import argparse
import json
import os
import re
import subprocess
import sys
import tempfile
def get_duration(audio_path: str) -> float:
result = subprocess.run(
["ffprobe", "-v", "error", "-show_entries", "format=duration",
"-of", "default=noprint_wrappers=1:nokey=1", audio_path],
capture_output=True, text=True
)
return float(result.stdout.strip())
def split_audio(audio_path: str, chunk_dir: str, chunk_duration: int, overlap: int) -> list[tuple[int, int, str]]:
"""Split audio into overlapping chunks. Returns list of (start_sec, duration_sec, chunk_path)."""
total = int(get_duration(audio_path))
chunks = []
start = 0
while start < total:
duration = min(chunk_duration, total - start)
chunk_path = os.path.join(chunk_dir, f"chunk_{len(chunks):02d}.mp3")
subprocess.run(
["ffmpeg", "-i", audio_path, "-ss", str(start), "-t", str(duration),
"-acodec", "copy", chunk_path, "-y"],
capture_output=True
)
chunks.append((start, duration, chunk_path))
print(f" Chunk {len(chunks)-1}: {start//60}:{start%60:02d} - {(start+duration)//60}:{(start+duration)%60:02d}", file=sys.stderr)
start += duration - overlap
if start + duration >= total and duration == chunk_duration:
start = total - duration # ensure last chunk covers the end
if start <= chunks[-1][0]:
break
return chunks
def transcribe(audio_path: str, endpoint: str, model: str, noproxy: bool = True) -> str:
"""Send audio to ASR endpoint and return text."""
noproxy_args = ["--noproxy", "*"] if noproxy else []
result = subprocess.run(
["curl", "-s", "--max-time", "600"] + noproxy_args + [
endpoint,
"-F", f"file=@{audio_path}",
"-F", f"model={model}"
],
capture_output=True, text=True
)
data = json.loads(result.stdout)
return data["text"]
def strip_punct(text: str) -> str:
"""Remove all punctuation, keep only CJK chars, letters, and digits."""
return re.sub(r'[^\w\u4e00-\u9fff]', '', text)
def fuzzy_merge(text_a: str, text_b: str, search_chars: int = 600, min_match: int = 15) -> str:
"""
Merge two overlapping transcription segments using punctuation-stripped fuzzy matching.
The ASR model produces slightly different punctuation for the same audio segment
across different runs, so exact string matching fails. By stripping punctuation
before matching, we find the true overlap region reliably.
Uses text_b's version at the merge point because text_a truncates its final sentence
while text_b has the complete version.
"""
tail_a_clean = strip_punct(text_a[-search_chars:])
text_b_clean = strip_punct(text_b)
best_match_len = 0
best_b_clean_end = 0
# Search for longest matching substring (punctuation-stripped)
for start in range(len(tail_a_clean)):
substr = tail_a_clean[start:start + min_match]
if len(substr) < min_match:
break
pos = text_b_clean.find(substr)
if pos >= 0:
# Extend the match as far as possible
match_len = min_match
while (start + match_len < len(tail_a_clean)
and pos + match_len < len(text_b_clean)
and tail_a_clean[start + match_len] == text_b_clean[pos + match_len]):
match_len += 1
if match_len > best_match_len:
best_match_len = match_len
best_b_clean_end = pos + match_len
best_a_clean_start = start
if best_match_len >= min_match:
# Map clean positions back to raw text positions
# For text_a: find where the match starts in raw text
a_offset = len(text_a) - search_chars
clean_count = 0
a_cut_pos = len(text_a)
for idx, ch in enumerate(text_a[-search_chars:]):
if strip_punct(ch):
clean_count += 1
if clean_count > best_a_clean_start:
a_cut_pos = a_offset + idx
break
# For text_b: find where the match ends in raw text
clean_count = 0
b_start_pos = 0
for idx, ch in enumerate(text_b):
if strip_punct(ch):
clean_count += 1
if clean_count >= best_b_clean_end:
b_start_pos = idx + 1
break
print(f" Merged: {best_match_len} chars matched (punct-stripped)", file=sys.stderr)
return text_a[:a_cut_pos] + text_b[b_start_pos:]
else:
print(f" Warning: no overlap found ({best_match_len} chars), concatenating directly", file=sys.stderr)
return text_a + text_b
def main():
parser = argparse.ArgumentParser(description="Overlap-merge ASR transcription")
parser.add_argument("input", help="Input audio/video file")
parser.add_argument("output", help="Output text file")
parser.add_argument("--config", help="Path to config.json (from CLAUDE_PLUGIN_DATA)")
parser.add_argument("--endpoint", default="http://workstation-4090-wsl:8002/v1/audio/transcriptions", help="ASR endpoint URL")
parser.add_argument("--model", default="Qwen/Qwen3-ASR-1.7B", help="Model name")
parser.add_argument("--noproxy", action="store_true", default=True, help="Use --noproxy with curl")
parser.add_argument("--chunk-duration", type=int, default=1080, help="Chunk duration in seconds (default: 1080 = 18min)")
parser.add_argument("--overlap", type=int, default=120, help="Overlap duration in seconds (default: 120 = 2min)")
args = parser.parse_args()
# Load config from file if provided, otherwise use CLI args
if args.config and os.path.exists(args.config):
with open(args.config) as f:
cfg = json.load(f)
args.endpoint = cfg.get("endpoint", args.endpoint)
args.model = cfg.get("model", args.model)
args.noproxy = cfg.get("noproxy", args.noproxy)
print(f"Input: {args.input}", file=sys.stderr)
total_duration = get_duration(args.input)
print(f"Duration: {total_duration:.0f}s ({total_duration/60:.1f}min)", file=sys.stderr)
with tempfile.TemporaryDirectory() as chunk_dir:
# Split
print(f"\nSplitting into {args.chunk_duration}s chunks with {args.overlap}s overlap...", file=sys.stderr)
chunks = split_audio(args.input, chunk_dir, args.chunk_duration, args.overlap)
print(f"Created {len(chunks)} chunks\n", file=sys.stderr)
# Transcribe each chunk
texts = []
for i, (start, dur, path) in enumerate(chunks):
print(f"Transcribing chunk {i} ({start//60}:{start%60:02d})...", end=" ", file=sys.stderr, flush=True)
text = transcribe(path, args.endpoint, args.model, args.noproxy)
texts.append(text)
print(f"{len(text)} chars", file=sys.stderr)
# Merge
print(f"\nMerging {len(texts)} segments...", file=sys.stderr)
merged = texts[0]
for i in range(1, len(texts)):
print(f" Merging chunk {i-1} + {i}:", file=sys.stderr)
merged = fuzzy_merge(merged, texts[i])
# Save
with open(args.output, "w", encoding="utf-8") as f:
f.write(merged)
print(f"\nDone! {len(merged)} chars saved to {args.output}", file=sys.stderr)
if __name__ == "__main__":
main()

View File

@@ -41,6 +41,23 @@ So please pay attention to context cues to understand how to phrase your communi
It's OK to briefly explain terms if you're in doubt, and feel free to clarify terms with a short definition if you're unsure if the user will get it. It's OK to briefly explain terms if you're in doubt, and feel free to clarify terms with a short definition if you're unsure if the user will get it.
### Using AskUserQuestion (Critical — Read This)
**Use the AskUserQuestion tool aggressively at every decision point.** Do not ask open-ended text questions in conversation when structured choices exist. This is the single biggest UX improvement you can make — users juggle multiple windows and may not have looked at this conversation in 20 minutes.
**Every AskUserQuestion MUST follow this structure:**
1. **Re-ground**: State the skill name, current phase, and what just happened (1-2 sentences). The user may have context-switched away.
2. **Simplify**: Explain the decision in plain language. No function names or internal jargon. Say what it DOES, not what it's called.
3. **Recommend**: Lead with your recommendation and a one-line reason why. If options involve effort, show both scales: `(human: ~X min / Claude: ~Y min)`.
4. **Options**: Provide 2-4 concrete, lettered choices. Each option should be a clear action, not an abstract concept.
**Rules:**
- One decision per question — never batch unrelated choices
- Provide an escape hatch ("Other" is always implicit in AskUserQuestion)
- Accept the user's choice — nudge on tradeoffs but never refuse to proceed
- Skip the question if there's an obvious answer with no tradeoffs (just state what you'll do)
--- ---
## Creating a skill ## Creating a skill
@@ -52,7 +69,83 @@ Start by understanding the user's intent. The current conversation might already
1. What should this skill enable Claude to do? 1. What should this skill enable Claude to do?
2. When should this skill trigger? (what user phrases/contexts) 2. When should this skill trigger? (what user phrases/contexts)
3. What's the expected output format? 3. What's the expected output format?
4. Should we set up test cases to verify the skill works? Skills with objectively verifiable outputs (file transforms, data extraction, code generation, fixed workflow steps) benefit from test cases. Skills with subjective outputs (writing style, art) often don't need them. Suggest the appropriate default based on the skill type, but let the user decide. 4. Should we set up test cases to verify the skill works?
After extracting answers from conversation history (or asking questions 1-3), use **AskUserQuestion** to confirm the skill type and testing strategy:
```
Creating skill "[name]" — here's what I understand so far:
- Purpose: [1-sentence summary]
- Triggers on: [key phrases]
- Output: [format]
RECOMMENDATION: [Objective/Subjective/Hybrid] skill → [suggested testing approach]
Options:
A) Objective output (files, code, data) — set up automated test cases (Recommended if output is verifiable)
B) Subjective output (writing, design) — qualitative human review only
C) Hybrid — automated checks for structure, human review for quality
D) Skip testing for now — just build the skill and iterate by feel
```
This upfront classification drives the entire evaluation strategy downstream. Get it right here to avoid wasted effort later.
### Prior Art Research (Do Not Skip)
The user's private methodology — their domain rules, workflow decisions, competitive edge — is what makes a skill valuable. No public repo can provide that. But the user shouldn't waste time reinventing infrastructure (API clients, auth flows, rate limiting) when mature tools exist. Prior art research finds building blocks for the infrastructure layer so the skill can focus on encoding the user's unique methodology.
**Search these channels in order** (use subagents for 4-8 in parallel):
| Priority | Channel | What to search | How |
|----------|---------|---------------|-----|
| 1 | **Conversation history** | User's proven workflows, verified API patterns, corrections made during debugging | Grep recent conversations for the service/API name |
| 2 | **Local documents & SOPs** | User's private methodology, runbooks, existing skills | Search project directory, `~/.claude/CLAUDE.md`, `~/.claude/references/` |
| 3 | **Installed plugins & MCPs** | Already-integrated tools | Check `~/.claude/plugins/`, parse `installed_plugins.json`; check `~/.claude.json` for configured MCP servers |
| 4 | **skills.sh** | Community skills | `WebFetch https://skills.sh/?q=<keyword>` |
| 5 | **Anthropic official plugins** | Official/partner plugins | `WebFetch https://github.com/anthropics/claude-plugins-official/tree/main/plugins` and `external_plugins` directory |
| 6 | **MCP servers on GitHub** | Existing MCP servers for the same API | `WebSearch "<service-name> MCP server site:github.com"` |
| 7 | **Official API docs** | The target service's own documentation | `WebSearch "<service-name> API documentation"` or `WebFetch` the docs URL |
| 8 | **npm / PyPI** | SDK or CLI packages | `npm search <keyword>` or `curl https://pypi.org/pypi/<name>/json` |
Channels 1-3 surface the user's own proven patterns and existing integrations. Channels 4-8 find public infrastructure. The user's private SOP always takes precedence — public tools are building blocks, not replacements. In competitive domains (finance, trading, proprietary operations), the valuable methodology will never be public.
**If a public MCP server or skill is found, clone it and verify — don't trust the README:**
1. **Read the actual source code** — many projects have polished READMEs on hollow codebases
2. **Verify auth method** — does it match how the API actually authenticates? (X-Api-Key headers vs Bearer vs OAuth — many get this wrong)
3. **Check test coverage** — zero tests = prototype, not production-grade
4. **Check maintenance** — last commit date, open issue count, response to bug reports
5. **Check environment compatibility** — proxy/network assumptions, hardcoded DNS/IPs, region locks
6. **Check license** — MIT/Apache is fine; GPL/SSPL may conflict with proprietary use
7. **Check dependency weight** — huge dependency trees create conflict and security surface
**Decision matrix:**
| Finding | Action |
|---------|--------|
| Mature MCP/SDK handles the infrastructure | **Adopt it, build on top** — install the MCP, then build the skill as a workflow layer encoding the user's methodology |
| Partial MCP or SDK exists | **Extend** — use for infrastructure, fill gaps in the skill |
| Public skill covers the same domain | **Use for structural inspiration only** — public skills in competitive domains are generic by definition. The user's edge is their private SOP |
| Nothing public exists | **Build from scratch** — validate API access patterns work (auth, endpoints, proxy) before writing the full skill |
| Integration cost > build cost | **Build it** — a 2-hour custom implementation you own beats a "mature" tool with integration friction and upstream risk |
After research completes, present findings via **AskUserQuestion**:
```
Research complete for "[skill-name]". Here's what I found:
[1-2 sentence summary of what exists publicly]
RECOMMENDATION: [ADOPT / EXTEND / BUILD] because [one-line reason]
Options:
A) Adopt [tool/MCP X] for infrastructure, build methodology layer on top (Recommended)
B) Extend [partial tool Y] — use what works, fill gaps in the skill
C) Build from scratch — nothing found matches well enough
D) Show me the detailed findings before I decide
```
When in doubt, bias toward adopting mature infrastructure for the plumbing layer and building custom logic for the methodology layer — that's where the value lives.
### Interview and Research ### Interview and Research
@@ -342,7 +435,26 @@ Anthropic has wrote skill authoring best practices, you SHOULD retrieve it befor
### Test Cases ### Test Cases
After writing the skill draft, come up with 2-3 realistic test prompts — the kind of thing a real user would actually say. Share them with the user: [you don't have to use this exact language] "Here are a few test cases I'd like to try. Do these look right, or do you want to add more?" Then run them. After writing the skill draft, come up with 2-3 realistic test prompts — the kind of thing a real user would actually say. Present them via **AskUserQuestion**:
```
Skill draft is ready. Here are [N] test cases I'd like to run:
1. "[test prompt 1]" — tests [what aspect]
2. "[test prompt 2]" — tests [what aspect]
3. "[test prompt 3]" — tests [what aspect]
Each test runs the skill + a baseline (no skill) for comparison.
Estimated time: ~[X] minutes total.
RECOMMENDATION: Run all [N] test cases now.
Options:
A) Run all test cases (Recommended)
B) Run test cases, but let me modify them first
C) Add more test cases before running
D) Skip testing — the skill looks good enough to ship
```
Save test cases to `evals/evals.json`. Don't write assertions yet — just the prompts. You'll draft assertions in the next step while the runs are in progress. Save test cases to `evals/evals.json`. Don't write assertions yet — just the prompts. You'll draft assertions in the next step while the runs are in progress.
@@ -450,7 +562,22 @@ Put each with_skill version before its baseline counterpart.
Note: please use generate_review.py to create the viewer; there's no need to write custom HTML. Note: please use generate_review.py to create the viewer; there's no need to write custom HTML.
5. **Tell the user** something like: "I've opened the results in your browser. There are two tabs — 'Outputs' lets you click through each test case and leave feedback, 'Benchmark' shows the quantitative comparison. When you're done, come back here and let me know." 5. **Tell the user** via **AskUserQuestion**:
```
Results are ready! I've opened the eval viewer in your browser.
- "Outputs" tab: click through each test case, leave feedback in the textbox
- "Benchmark" tab: quantitative comparison (pass rates, timing, tokens)
Take your time reviewing. When you're done, come back here.
Options:
A) I've finished reviewing — read my feedback and improve the skill
B) I have questions about the results before giving feedback
C) Results look good enough — skip iteration, let's package the skill
D) Results need major rework — let's discuss before iterating
```
### What the user sees in the viewer ### What the user sees in the viewer
@@ -507,6 +634,24 @@ This is the heart of the loop. You've run the test cases, the user has reviewed
This task is pretty important (we are trying to create billions a year in economic value here!) and your thinking time is not the blocker; take your time and really mull things over. I'd suggest writing a draft revision and then looking at it anew and making improvements. Really do your best to get into the head of the user and understand what they want and need. This task is pretty important (we are trying to create billions a year in economic value here!) and your thinking time is not the blocker; take your time and really mull things over. I'd suggest writing a draft revision and then looking at it anew and making improvements. Really do your best to get into the head of the user and understand what they want and need.
After analyzing feedback, present your improvement plan via **AskUserQuestion**:
```
I've read the feedback from [N] test cases. [X] had specific complaints, [Y] looked good.
Key issues:
- [Issue 1]: [plain-language summary]
- [Issue 2]: [plain-language summary]
RECOMMENDATION: [strategy] because [reason]
Options:
A) Iterative refinement — targeted fixes for the specific issues above (Recommended)
B) Structural redesign — the core approach needs rethinking
C) Bundle a script — I noticed all test runs independently wrote similar code for [X]
D) Expand test set first — add [N] more test cases to avoid overfitting to these examples
```
### The iteration loop ### The iteration loop
After improving the skill: After improving the skill:
@@ -517,6 +662,18 @@ After improving the skill:
4. Wait for the user to review and tell you they're done 4. Wait for the user to review and tell you they're done
5. Read the new feedback, improve again, repeat 5. Read the new feedback, improve again, repeat
At the end of each iteration, use **AskUserQuestion** as a checkpoint:
```
Iteration [N] complete. Results: [pass_rate]% assertions passing, [delta vs previous].
Options:
A) Continue iterating — I see more room for improvement
B) Accept this version — it's good enough, let's move to packaging
C) Revert to previous iteration — this round made things worse
D) Run blind comparison — rigorously compare this version vs the previous one
```
Keep going until: Keep going until:
- The user says they're happy - The user says they're happy
- The feedback is all empty (everything looks good) - The feedback is all empty (everything looks good)
@@ -686,7 +843,20 @@ When editing, remember that the skill is being created for another instance of C
### Step 5: Sanitization Review (Optional) ### Step 5: Sanitization Review (Optional)
**Ask the user before executing this step:** "This skill appears to be extracted from a business project. Would you like me to perform a sanitization review to remove business-specific content before public distribution?" Use **AskUserQuestion** before executing this step:
```
This skill appears to contain content from a real project.
Before distribution, I should check for business-specific details
(company names, internal paths, product names) that shouldn't be public.
RECOMMENDATION: Run selective sanitization — review each finding before removing.
Options:
A) Full sanitization — automatically remove all business-specific content
B) Selective sanitization — show me each finding and let me decide (Recommended)
C) Skip — this is for internal use only, no sanitization needed
```
Skip if: skill was created from scratch for public use, user declines, or skill is for internal use. Skip if: skill was created from scratch for public use, user declines, or skill is for internal use.
@@ -730,6 +900,21 @@ brew install gitleaks
- `3` - gitleaks not installed - `3` - gitleaks not installed
- `4` - Scan error - `4` - Scan error
**If issues are found**, present them via **AskUserQuestion**:
```
Security scan found [N] issues in "[skill-name]":
- [SEVERITY] [file]: [description]
- ...
RECOMMENDATION: Fix automatically — these look like [accidental leaks / false positives].
Options:
A) Fix all issues automatically (Recommended)
B) Review each finding — let me decide per-item (some may be intentional)
C) Override and proceed — I accept the risk for internal distribution
```
### Step 7: Packaging a Skill ### Step 7: Packaging a Skill
Once the skill is ready, package it into a distributable file: Once the skill is ready, package it into a distributable file:
@@ -773,7 +958,19 @@ After packaging, update the marketplace registry to include the new or updated s
**For updated skills**, bump the version in `plugins[].version` following semver. **For updated skills**, bump the version in `plugins[].version` following semver.
### Step 9: Iterate ### Step 9: Ship or Iterate
After completing the skill, use **AskUserQuestion** to determine next steps:
```
Skill "[name]" is complete. Security scan passed, marketplace updated.
Options:
A) Package and export as .skill file for distribution
B) Run description optimization — improve auto-triggering accuracy (~5 min)
C) Expand test set and iterate more — add edge cases before shipping
D) Done for now — I'll test it manually and come back if needed
```
After testing the skill, users may request improvements. Often this happens right after using the skill, with fresh context of how the skill performed. After testing the skill, users may request improvements. Often this happens right after using the skill, with fresh context of how the skill performed.