--- title: "Prompt Engineer Toolkit" description: "Prompt Engineer Toolkit - Claude Code skill from the Marketing domain." --- # Prompt Engineer Toolkit
:material-bullhorn-outline: Marketing :material-identifier: `prompt-engineer-toolkit` :material-github: Source
Install: claude /plugin install marketing-skills
## Overview Use this skill to move prompts from ad-hoc drafts to production assets with repeatable testing, versioning, and regression safety. It emphasizes measurable quality over intuition. Apply it when launching a new LLM feature that needs reliable outputs, when prompt quality degrades after model or instruction changes, when multiple team members edit prompts and need history/diffs, when you need evidence-based prompt choice for production rollout, or when you want consistent prompt governance across environments. ## Core Capabilities - A/B prompt evaluation against structured test cases - Quantitative scoring for adherence, relevance, and safety checks - Prompt version tracking with immutable history and changelog - Prompt diffs to review behavior-impacting edits - Reusable prompt templates and selection guidance - Regression-friendly workflows for model/prompt updates ## Key Workflows ### 1. Run Prompt A/B Test Prepare JSON test cases and run: ```bash python3 scripts/prompt_tester.py \ --prompt-a-file prompts/a.txt \ --prompt-b-file prompts/b.txt \ --cases-file testcases.json \ --runner-cmd 'my-llm-cli --prompt {prompt} --input {input}' \ --format text ``` Input can also come from stdin/`--input` JSON payload. ### 2. Choose Winner With Evidence The tester scores outputs per case and aggregates: - expected content coverage - forbidden content violations - regex/format compliance - output length sanity Use the higher-scoring prompt as candidate baseline, then run regression suite. ### 3. Version Prompts ```bash # Add version python3 scripts/prompt_versioner.py add \ --name support_classifier \ --prompt-file prompts/support_v3.txt \ --author alice # Diff versions python3 scripts/prompt_versioner.py diff --name support_classifier --from-version 2 --to-version 3 # Changelog python3 scripts/prompt_versioner.py changelog --name support_classifier ``` ### 4. Regression Loop 1. Store baseline version. 2. Propose prompt edits. 3. Re-run A/B test. 4. Promote only if score and safety constraints improve. ## Script Interfaces - `python3 scripts/prompt_tester.py --help` - Reads prompts/cases from stdin or `--input` - Optional external runner command - Emits text or JSON metrics - `python3 scripts/prompt_versioner.py --help` - Manages prompt history (`add`, `list`, `diff`, `changelog`) - Stores metadata and content snapshots locally ## Pitfalls, Best Practices & Review Checklist **Avoid these mistakes:** 1. Picking prompts from single-case outputs — use a realistic, edge-case-rich test suite. 2. Changing prompt and model simultaneously — always isolate variables. 3. Missing `must_not_contain` (forbidden-content) checks in evaluation criteria. 4. Editing prompts without version metadata, author, or change rationale. 5. Skipping semantic diffs before deploying a new prompt version. 6. Optimizing one benchmark while harming edge cases — track the full suite. 7. Model swap without rerunning the baseline A/B suite. **Before promoting any prompt, confirm:** - [ ] Task intent is explicit and unambiguous. - [ ] Output schema/format is explicit. - [ ] Safety and exclusion constraints are explicit. - [ ] No contradictory instructions. - [ ] No unnecessary verbosity tokens. - [ ] A/B score improves and violation count stays at zero. ## References - [references/prompt-templates.md](references/prompt-templates.md) - [references/technique-guide.md](references/technique-guide.md) - [references/evaluation-rubric.md](references/evaluation-rubric.md) - [README.md](README.md) ## Evaluation Design Each test case should define: - `input`: realistic production-like input - `expected_contains`: required markers/content - `forbidden_contains`: disallowed phrases or unsafe content - `expected_regex`: required structural patterns This enables deterministic grading across prompt variants. ## Versioning Policy - Use semantic prompt identifiers per feature (`support_classifier`, `ad_copy_shortform`). - Record author + change note for every revision. - Never overwrite historical versions. - Diff before promoting a new prompt to production. ## Rollout Strategy 1. Create baseline prompt version. 2. Propose candidate prompt. 3. Run A/B suite against same cases. 4. Promote only if winner improves average and keeps violation count at zero. 5. Track post-release feedback and feed new failure cases back into test suite.