feat: add douban-skill + enhance skill-creator with development methodology

New skill: douban-skill - Full export of Douban (豆瓣) book/movie/music/game collections via Frodo API - RSS incremental sync for daily updates - Python stdlib only, zero dependencies, cross-platform (macOS/Windows/Linux) - Documented 7 failed approaches (PoW anti-scraping) and why Frodo API is the only working solution - Pre-flight user validation, KeyboardInterrupt handling, pagination bug fix skill-creator enhancements: - Add development methodology reference (8-phase process with prior art research, counter review, and real failure case studies) - Sync upstream changes: improve_description.py now uses `claude -p` instead of Anthropic SDK (no ANTHROPIC_API_KEY needed), remove stale "extended thinking" ref - Add "Updating an existing skill" guidance to Claude.ai and Cowork sections - Restore test case heuristic guidance for objective vs subjective skills README updates: - Document fork advantages vs upstream with quality comparison table (65 vs 42) - Bilingual (EN + ZH-CN) with consistent content Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 12:36:51 +08:00
parent cafabd753b
commit 28cd6bd813
11 changed files with 1186 additions and 73 deletions
--- a/skill-creator/SKILL.md
+++ b/skill-creator/SKILL.md
@@ -1,6 +1,6 @@
 ---
 name: skill-creator
-description: Create new skills, modify and improve existing skills, and measure skill performance. Use when users want to create a skill from scratch, update or optimize an existing skill, run evals to test a skill, benchmark skill performance with variance analysis, or optimize a skill's description for better triggering accuracy.
+description: Create new skills, modify and improve existing skills, and measure skill performance. Use when users want to create a skill from scratch, edit, or optimize an existing skill, run evals to test a skill, benchmark skill performance with variance analysis, or optimize a skill's description for better triggering accuracy.
 license: Complete terms in LICENSE.txt
 ---

@@ -69,7 +69,7 @@ Start by understanding the user's intent. The current conversation might already
 1. What should this skill enable Claude to do?
 2. When should this skill trigger? (what user phrases/contexts)
 3. What's the expected output format?
-4. Should we set up test cases to verify the skill works?
+4. Should we set up test cases to verify the skill works? Skills with objectively verifiable outputs (file transforms, data extraction, code generation, fixed workflow steps) benefit from test cases. Skills with subjective outputs (writing style, art) often don't need them. Suggest the appropriate default based on the skill type, but let the user decide.

 After extracting answers from conversation history (or asking questions 1-3), use **AskUserQuestion** to confirm the skill type and testing strategy:

@@ -433,6 +433,10 @@ Filenames must be self-explanatory without reading contents.

 Anthropic has wrote skill authoring best practices, you SHOULD retrieve it before you create or update any skills, the link is https://platform.claude.com/docs/en/agents-and-tools/agent-skills/best-practices.md

+#### Development Methodology Reference
+
+Also read [references/skill-development-methodology.md](references/skill-development-methodology.md) before starting — it covers the full 8-phase development process with prior art research, counter review, and real failure case studies. The two references are complementary: the Anthropic doc covers principles, the methodology covers process.
+
 ### Test Cases

 After writing the skill draft, come up with 2-3 realistic test prompts — the kind of thing a real user would actually say. Present them via **AskUserQuestion**:
@@ -750,7 +754,7 @@ Use the model ID from your system prompt (the one powering the current session)

 While it runs, periodically tail the output to give the user updates on which iteration it's on and what the scores look like.

-This handles the full optimization loop automatically. It splits the eval set into 60% train and 40% held-out test, evaluates the current description (running each query 3 times to get a reliable trigger rate), then calls Claude with extended thinking to propose improvements based on what failed. It re-evaluates each new description on both train and test, iterating up to 5 times. When it's done, it opens an HTML report in the browser showing the results per iteration and returns JSON with `best_description` — selected by test score rather than train score to avoid overfitting.
+This handles the full optimization loop automatically. It splits the eval set into 60% train and 40% held-out test, evaluates the current description (running each query 3 times to get a reliable trigger rate), then calls Claude to propose improvements based on what failed. It re-evaluates each new description on both train and test, iterating up to 5 times. When it's done, it opens an HTML report in the browser showing the results per iteration and returns JSON with `best_description` — selected by test score rather than train score to avoid overfitting.

 ### How skill triggering works

@@ -1008,6 +1012,11 @@ In Claude.ai, the core workflow is the same (draft -> test -> review -> improve

 **Packaging**: The `package_skill.py` script works anywhere with Python and a filesystem. On Claude.ai, you can run it and the user can download the resulting `.skill` file.

+- **Updating an existing skill**: The user might be asking you to update an existing skill, not create a new one. In this case:
+  - **Preserve the original name.** Note the skill's directory name and `name` frontmatter field — use them unchanged. E.g., if the installed skill is `research-helper`, output `research-helper.skill` (not `research-helper-v2`).
+  - **Copy to a writeable location before editing.** The installed skill path may be read-only. Copy to `/tmp/skill-name/`, edit there, and package from the copy.
+  - **If packaging manually, stage in `/tmp/` first**, then copy to the output directory — direct writes may fail due to permissions.
+
 ---

 ## Cowork-Specific Instructions
@@ -1020,6 +1029,7 @@ If you're in Cowork, the main things to know are:
 - Feedback works differently: since there's no running server, the viewer's "Submit All Reviews" button will download `feedback.json` as a file. You can then read it from there (you may have to request access first).
 - Packaging works — `package_skill.py` just needs Python and a filesystem.
 - Description optimization (`run_loop.py` / `run_eval.py`) should work in Cowork just fine since it uses `claude -p` via subprocess, not a browser, but please save it until you've fully finished making the skill and the user agrees it's in good shape.
+- **Updating an existing skill**: The user might be asking you to update an existing skill, not create a new one. Follow the update guidance in the claude.ai section above.

 ---

--- a/skill-creator/references/skill-development-methodology.md
+++ b/skill-creator/references/skill-development-methodology.md
@@ -0,0 +1,149 @@
+# Skill Development Methodology
+
+综合 Anthropic 官方最佳实践、skill-creator 工作流、社区经验和实战教训的完整方法论。
+
+本文档只包含 SKILL.md 中**没有覆盖**的内容。SKILL.md 已经详细描述的流程（Prior Art 8 渠道表、决策矩阵、Inline vs Fork、测试用例格式、描述优化循环等）不在此重复——请直接参考 SKILL.md 对应章节。
+
+## Phase 1: 先手动解决问题，不要上来就建 skill
+
+SKILL.md 的 "Capture Intent" 章节覆盖了意图收集的 4 个问题和 skill 类型分类。本节补充一个被忽略的前置步骤：
+
+**不要一开始就写 skill。** 先用 Claude Code 正常解决用户的问题，在过程中积累经验——哪些方案有效、哪些失败、最终的 working solution 是什么。如果你没有亲自失败过，你写不出能防止别人失败的 skill。
+
+很多 skill 都是从"把我们刚做的变成一个 skill"中诞生的。先从对话历史中提取已验证的模式（SKILL.md "Capture Intent" 第三段已提及），然后才开始规划 skill 结构。
+
+## Phase 2: 用 Agent Team 做并行调研
+
+SKILL.md 的 "Prior Art Research" 章节覆盖了 8 个搜索渠道、clone-and-verify 检查清单、和 Adopt/Extend/Build 决策矩阵。本节补充 SKILL.md 未提及的**并行调研模式**：
+
+遇到不确定的技术方案时，不要串行尝试（太慢），也不要凭经验猜（太危险）。同时启动 3+ 个研究 agent，每个负责一个调研方向：
+
+| Agent | 职责 | 搜索范围 |
+|-------|------|---------|
+| 工具调研 | 找已有成熟工具 | GitHub stars、npm/PyPI、社区 skill 注册表 |
+| API 调研 | 找可用 API 端点 | 官方文档、逆向工程、移动端 API |
+| 约束调研 | 理解技术限制 | 反爬机制、认证要求、平台限制 |
+
+每个 agent 必须独立验证（读源码、确认 API 可达、检查最近提交日期），不能只看 README。
+
+**案例**：开发一个数据导出 skill 时，3 个 agent 并行跑了 5-20 分钟，分别发现：一个关键工具当前版本 broken（605 stars 但 PR 待合并）、一个未公开的移动端 API（唯一可行方案）、目标平台升级了 PoW 反爬（所有 HTTP 抓取失效）。没有并行研究，这些信息需要串行试错 3+ 小时才能获得。
+
+## Phase 3: 用真实数据验证原型
+
+SKILL.md 的 Evaluation-Driven Development 流程覆盖了"先跑 baseline → 建 eval → 迭代"的过程。本节补充两个 SKILL.md 未强调的验证原则：
+
+### 3.1 数据完整性验证
+
+"it runs without errors" ≠ "it exported all items correctly"。必须：
+- 对比 API 报告的 total 和实际导出行数
+- 检查字段格式（评分、日期、编码是否符合预期）
+- 用不同规模的数据测试（0 条、100 条、1000+ 条）
+
+**常见静默 bug**：
+- 分页逻辑：某些页面返回的数据量少于请求值（如请求 50 条返回 48 条），被误判为最后一页导致提前终止。修复：检查 `total` 而非 `page_size`
+- 数据转换：API 返回 `{value: 2, max: 5}` 表示 2/5 星，但代码按 `max: 10` 处理后变成 1 星。修复：检查 `max` 字段确定 scale
+
+### 3.2 记录失败
+
+详细记录每个失败方案的方法、失败模式、根因。这些将成为 skill 中 "Do NOT attempt" 部分的内容——这是 skill 最独特的价值，防止未来的 agent 重走弯路。
+
+失败记录的结构：
+
+| 方案 | 结果 | 根因 |
+|------|------|------|
+| 方案名称 | 具体失败表现（HTTP 状态码、错误信息） | 架构层面的原因分析 |
+
+## Phase 4: Skill 写作补充原则
+
+SKILL.md 的 "Skill Writing Guide" 已覆盖 frontmatter、progressive disclosure、bundled resources、命名规范等。本节补充 SKILL.md 未提及的内容层面原则：
+
+### 4.1 写清楚 skill 不能做什么
+
+防止 agent 尝试不可能的操作。例如：
+- "Cannot export reviews (长评) — different API endpoint, not implemented"
+- "Cannot filter by single category — exports all 4 types together"
+
+### 4.2 写清楚失败过什么
+
+在 SKILL.md 或 references 中保留失败方案的摘要（详见 Phase 3.2），加上明确的"Do NOT attempt"警告。这比正面指令更有效——agent 看到 7 种方案的失败记录后，不会尝试第 8 种类似方案。
+
+### 4.3 安全说明
+
+如果脚本包含 API key、HMAC 密钥或其他凭据，必须解释来源和安全性。例如："These are the app's public credentials extracted from the APK, shared by all users. No personal credentials are used."
+
+### 4.4 Console output 示例
+
+展示一次成功运行的完整控制台输出。让 agent 知道"正确运行"长什么样，方便验证（SKILL.md Phase 5 的 self-verification）。
+
+### 4.5 脚本健壮性
+
+SKILL.md 的 "Solve, don't punt" 覆盖了基本错误处理。补充实战中发现的常见遗漏：
+- 只捕获 HTTPError，遗漏 URLError / socket.timeout / JSONDecodeError
+- 无限分页循环（API 异常时）——需要 max-page 安全阀
+- CSV 中的换行符/回车符——`csvEscape` 必须处理 `\r`
+- 用户输入是完整 URL 而非 ID——脚本应自动提取
+
+## Phase 5: 测试迭代补充
+
+SKILL.md 的测试流程非常详细（A/B 测试、断言、评分、viewer）。本节补充两个 SKILL.md 未覆盖的实操教训：
+
+### 5.1 删除竞争的旧 skill
+
+如果系统中存在旧版 skill（关键词冲突），eval agent 会被旧 skill 截胡，导致测试结果完全无效。必须在测试前删除旧 skill。
+
+**信号**：eval agent 使用了不同于预期的脚本或方法 → 检查是否有同名/同领域的旧 skill 被加载。
+
+### 5.2 量化迭代对比
+
+SKILL.md 提到 timing.json 和 benchmark，但未给出具体应跟踪哪些指标。推荐：
+
+| 指标 | 为什么重要 |
+|------|-----------|
+| 数据完整性（实际/预期） | 核心正确性 |
+| 执行时间 | 用户体验 |
+| Token 消耗 | 成本 |
+| 工具调用次数 | skill 引导效率——次数越少说明 skill 的指令越清晰 |
+| 错误数 | 必须为 0 |
+
+**案例对比**：某 skill 迭代后，工具调用从 31 次降到 8 次（74% 减少）、Token 从 72K 降到 41K（43% 减少），说明 skill 的指令让 agent 不再需要自己摸索。
+
+## Phase 6: Counter Review — 用 Agent Team 做对抗性审查
+
+这是 SKILL.md 未覆盖的独立环节。SKILL.md 的 "Improving the skill" 章节关注用户反馈驱动的迭代，但没有系统化的多视角审查流程。
+
+### 6.1 第一轮：3 个视角并行
+
+用 Task 工具同时启动 3 个 review agent：
+
+| Reviewer | 视角 | 关注点 |
+|----------|------|--------|
+| Skill 质量 | 对标 Anthropic 最佳实践 | 描述质量、简洁性、progressive disclosure、可操作性、错误预防、示例、术语一致性 |
+| 代码健壮性 | 高级工程师找 bug | 错误处理、安全性、跨平台、边界情况、依赖、幂等性 |
+| 用户视角 | 首次使用者体验 | 首次成功率、输入容错、输出预期、隐私顾虑、失败恢复 |
+
+### 6.2 修复后 Final Gate
+
+修复所有 Critical 和 HIGH 问题后，再启动 final gate reviewers 验证修复正确性。评分 >= 8 才放行。
+
+### 6.3 常见发现模式
+
+根据实战经验，reviewer 经常发现的问题类型：
+- **SKILL.md 和 references 内容重复**（每次都会犯，包括本文档自己）
+- **异常类型遗漏**（只捕获 HTTPError，漏掉 URLError/socket.timeout）
+- **substring 误匹配**（`content.includes(url)` 导致 `/1234/` 匹配 `/12345/`）
+- **docstring 与实际行为不一致**（写了 "4.5 → 5" 但实际行为是 "4.5 → 4"）
+- **误导性注释**（注释说"每个分类写入后立即保存"但代码在最后才写入）
+- **时间敏感数据**（特定日期的测试结果、版本号——下周就过时了）
+
+## Phase 7 & 8: Description Optimization + Packaging
+
+SKILL.md 已完整覆盖描述优化循环（20 个 eval query、60/40 train/test split、5 轮迭代）和打包流程（prerequisites、security scan、marketplace.json）。无补充。
+
+## 来源
+
+| 来源 | 本文档引用的独有贡献 |
+|------|-------------------|
+| Anthropic Official | Evaluation-driven development、conciseness imperative（已由 SKILL.md 覆盖，本文不重复） |
+| skill-creator SKILL.md | 完整工作流和工具链（本文引用但不复制，请直接参考 SKILL.md） |
+| 社区经验 | 激活率数据（20%→90%）、Encoded Preference > Capability Uplift |
+| 实战教训 | 并行研究 agent、失败记录的价值、竞争 skill 删除、量化迭代对比、Counter Review 流程 |
--- a/skill-creator/scripts/improve_description.py
+++ b/skill-creator/scripts/improve_description.py
@@ -2,22 +2,52 @@
 """Improve a skill description based on eval results.

 Takes eval results (from run_eval.py) and generates an improved description
-using Claude with extended thinking.
+by calling `claude -p` as a subprocess (same auth pattern as run_eval.py —
+uses the session's Claude Code auth, no separate ANTHROPIC_API_KEY needed).
 """

 import argparse
 import json
+import os
 import re
+import subprocess
 import sys
 from pathlib import Path

-import anthropic
-
 from scripts.utils import parse_skill_md


+def _call_claude(prompt: str, model: str | None, timeout: int = 300) -> str:
+    """Run `claude -p` with the prompt on stdin and return the text response.
+
+    Prompt goes over stdin (not argv) because it embeds the full SKILL.md
+    body and can easily exceed comfortable argv length.
+    """
+    cmd = ["claude", "-p", "--output-format", "text"]
+    if model:
+        cmd.extend(["--model", model])
+
+    # Remove CLAUDECODE env var to allow nesting claude -p inside a
+    # Claude Code session. The guard is for interactive terminal conflicts;
+    # programmatic subprocess usage is safe. Same pattern as run_eval.py.
+    env = {k: v for k, v in os.environ.items() if k != "CLAUDECODE"}
+
+    result = subprocess.run(
+        cmd,
+        input=prompt,
+        capture_output=True,
+        text=True,
+        env=env,
+        timeout=timeout,
+    )
+    if result.returncode != 0:
+        raise RuntimeError(
+            f"claude -p exited {result.returncode}\nstderr: {result.stderr}"
+        )
+    return result.stdout
+
+
 def improve_description(
-    client: anthropic.Anthropic,
    skill_name: str,
    skill_content: str,
    current_description: str,
@@ -99,7 +129,7 @@ Based on the failures, write a new and improved description that is more likely
 1. Avoid overfitting
 2. The list might get loooong and it's injected into ALL queries and there might be a lot of skills, so we don't want to blow too much space on any given description.

-Concretely, your description should not be more than about 100-200 words, even if that comes at the cost of accuracy.
+Concretely, your description should not be more than about 100-200 words, even if that comes at the cost of accuracy. There is a hard limit of 1024 characters — descriptions over that will be truncated, so stay comfortably under it.

 Here are some tips that we've found to work well in writing these descriptions:
 - The skill should be phrased in the imperative -- "Use this skill for" rather than "this skill does"
@@ -111,70 +141,41 @@ I'd encourage you to be creative and mix up the style in different iterations si

 Please respond with only the new description text in <new_description> tags, nothing else."""

-    response = client.messages.create(
-        model=model,
-        max_tokens=16000,
-        thinking={
-            "type": "enabled",
-            "budget_tokens": 10000,
-        },
-        messages=[{"role": "user", "content": prompt}],
-    )
+    text = _call_claude(prompt, model)

-    # Extract thinking and text from response
-    thinking_text = ""
-    text = ""
-    for block in response.content:
-        if block.type == "thinking":
-            thinking_text = block.thinking
-        elif block.type == "text":
-            text = block.text
-
-    # Parse out the <new_description> tags
    match = re.search(r"<new_description>(.*?)</new_description>", text, re.DOTALL)
    description = match.group(1).strip().strip('"') if match else text.strip().strip('"')

-    # Log the transcript
    transcript: dict = {
        "iteration": iteration,
        "prompt": prompt,
-        "thinking": thinking_text,
        "response": text,
        "parsed_description": description,
        "char_count": len(description),
        "over_limit": len(description) > 1024,
    }

-    # If over 1024 chars, ask the model to shorten it
+    # Safety net: the prompt already states the 1024-char hard limit, but if
+    # the model blew past it anyway, make one fresh single-turn call that
+    # quotes the too-long version and asks for a shorter rewrite. (The old
+    # SDK path did this as a true multi-turn; `claude -p` is one-shot, so we
+    # inline the prior output into the new prompt instead.)
    if len(description) > 1024:
-        shorten_prompt = f"Your description is {len(description)} characters, which exceeds the hard 1024 character limit. Please rewrite it to be under 1024 characters while preserving the most important trigger words and intent coverage. Respond with only the new description in <new_description> tags."
-        shorten_response = client.messages.create(
-            model=model,
-            max_tokens=16000,
-            thinking={
-                "type": "enabled",
-                "budget_tokens": 10000,
-            },
-            messages=[
-                {"role": "user", "content": prompt},
-                {"role": "assistant", "content": text},
-                {"role": "user", "content": shorten_prompt},
-            ],
+        shorten_prompt = (
+            f"{prompt}\n\n"
+            f"---\n\n"
+            f"A previous attempt produced this description, which at "
+            f"{len(description)} characters is over the 1024-character hard limit:\n\n"
+            f'"{description}"\n\n'
+            f"Rewrite it to be under 1024 characters while keeping the most "
+            f"important trigger words and intent coverage. Respond with only "
+            f"the new description in <new_description> tags."
        )
-
-        shorten_thinking = ""
-        shorten_text = ""
-        for block in shorten_response.content:
-            if block.type == "thinking":
-                shorten_thinking = block.thinking
-            elif block.type == "text":
-                shorten_text = block.text
-
+        shorten_text = _call_claude(shorten_prompt, model)
        match = re.search(r"<new_description>(.*?)</new_description>", shorten_text, re.DOTALL)
        shortened = match.group(1).strip().strip('"') if match else shorten_text.strip().strip('"')

        transcript["rewrite_prompt"] = shorten_prompt
-        transcript["rewrite_thinking"] = shorten_thinking
        transcript["rewrite_response"] = shorten_text
        transcript["rewrite_description"] = shortened
        transcript["rewrite_char_count"] = len(shortened)
@@ -216,9 +217,7 @@ def main():
        print(f"Current: {current_description}", file=sys.stderr)
        print(f"Score: {eval_results['summary']['passed']}/{eval_results['summary']['total']}", file=sys.stderr)

-    client = anthropic.Anthropic()
    new_description = improve_description(
-        client=client,
        skill_name=name,
        skill_content=content,
        current_description=current_description,
--- a/skill-creator/scripts/run_loop.py
+++ b/skill-creator/scripts/run_loop.py
@@ -15,8 +15,6 @@ import time
 import webbrowser
 from pathlib import Path

-import anthropic
-
 from scripts.generate_report import generate_html
 from scripts.improve_description import improve_description
 from scripts.run_eval import find_project_root, run_eval
@@ -75,7 +73,6 @@ def run_loop(
        train_set = eval_set
        test_set = []

-    client = anthropic.Anthropic()
    history = []
    exit_reason = "unknown"

@@ -200,7 +197,6 @@ def run_loop(
            for h in history
        ]
        new_description = improve_description(
-            client=client,
            skill_name=name,
            skill_content=content,
            current_description=current_description,