Update SKILL.md to final version (#305)

* Update SKILL.md to final version

* fix: restore analyze-project frontmatter

---------

Co-authored-by: sck_0 <samujackson1337@gmail.com>
This commit is contained in:
Gizzant
2026-03-16 10:56:50 -04:00
committed by GitHub
parent de21ffa2c6
commit 1a5c4059bd

View File

@@ -7,99 +7,109 @@ tags: [analysis, diagnostics, meta, root-cause, project-health, session-review]
# /analyze-project — Root Cause Analyst Workflow
Analyze AI-assisted coding sessions in `brain/` and produce a diagnostic report that explains not just **what happened**, but **why it happened**, **who/what caused it**, and **what should change next time**.
Analyze AI-assisted coding sessions in `~/.gemini/antigravity/brain/` and produce a report that explains not just **what happened**, but **why it happened**, **who/what caused it**, and **what should change next time**.
This workflow is not a simple metrics dashboard.
It is a forensic analysis workflow for AI coding sessions.
---
## Primary Objective
## Goal
For each session, determine:
1. What changed from the initial ask to the final executed work
2. Whether the change was caused primarily by:
- the user/spec
- the agent
- the codebase/repo
- testing/verification
2. Whether the main cause was:
- user/spec
- agent
- repo/codebase
- validation/testing
- legitimate task complexity
3. Whether the original prompt was sufficient for the actual job
4. Which subsystems or files repeatedly correlate with struggle
5. What concrete changes would most improve future sessions
3. Whether the opening prompt was sufficient
4. Which files/subsystems repeatedly correlate with struggle
5. What changes would most improve future sessions
---
## Global Rules
## Core Principles
- Treat `.resolved.N` counts as **signals of iteration intensity**, not proof of failure
- Do not label struggle based on counts alone; classify the **shape** of rework
- Separate **human-added scope** from **necessary discovered scope**
- Treat `.resolved.N` counts as **iteration signals**, not proof of failure
- Separate **human-added scope**, **necessary discovered scope**, and **agent-introduced scope**
- Separate **agent error** from **repo friction**
- Every diagnosis must include **evidence**
- Every recommendation must map to a specific observed pattern
- Use confidence levels:
- **High** = directly supported by artifact contents or timestamps
- **Medium** = supported by multiple indirect signals
- Every diagnosis must include **evidence** and **confidence**
- Confidence levels:
- **High** = direct artifact/timestamp evidence
- **Medium** = multiple supporting signals
- **Low** = plausible inference, not directly proven
- Evidence precedence:
- artifact contents > timestamps > metadata summaries > inference
- If evidence is weak, say so
---
## Step 1: Discovery — Find Relevant Conversations
## Step 0.5: Session Intent Classification
1. Read the conversation summaries available in the system context.
2. List all subdirectories in:
`~/.gemini/antigravity/brain/
3. Build a **Conversation Index** by cross-referencing summaries with UUID folders.
4. Record for each conversation:
Classify the primary session intent from objective + artifacts:
- `DELIVERY`
- `DEBUGGING`
- `REFACTOR`
- `RESEARCH`
- `EXPLORATION`
- `AUDIT_ANALYSIS`
Record:
- `session_intent`
- `session_intent_confidence`
Use intent to contextualize severity and rework shape.
Do not judge exploratory or research sessions by the same standards as narrow delivery sessions.
---
## Step 1: Discover Conversations
1. Read available conversation summaries from system context
2. List conversation folders in the users Antigravity `brain/` directory
3. Build a conversation index with:
- `conversation_id`
- `title`
- `objective`
- `created`
- `last_modified`
5. If the user supplied a keyword/path, filter on that. Otherwise analyze all workspace conversations.
4. If the user supplied a keyword/path, filter to matching conversations; otherwise analyze all
> Output: indexed list of conversations to analyze.
Output: indexed list of conversations to analyze.
---
## Step 2: Artifact Extraction — Build Session Evidence
## Step 2: Extract Session Evidence
For each conversation, read all structured artifacts that exist.
For each conversation, read if present:
### 2a. Core Artifacts
### Core artifacts
- `task.md`
- `implementation_plan.md`
- `walkthrough.md`
### 2b. Metadata
### Metadata
- `*.metadata.json`
### 2c. Version Snapshots
### Version snapshots
- `task.md.resolved.0 ... N`
- `implementation_plan.md.resolved.0 ... N`
- `walkthrough.md.resolved.0 ... N`
### 2d. Additional Signals
### Additional signals
- other `.md` artifacts
- report/evaluation files
- timestamps across artifact updates
- file/folder names mentioned in plans and walkthroughs
- repeated subsystem references
- explicit testing/validation language
- explicit non-goals or constraints, if present
- file/folder/subsystem names mentioned in plans/walkthroughs
- validation/testing language
- explicit acceptance criteria, constraints, non-goals, and file targets
### 2e. Record Per Conversation
Record per conversation:
#### Presence / Lifecycle
#### Lifecycle
- `has_task`
- `has_plan`
- `has_walkthrough`
- `is_completed`
- `is_abandoned_candidate` = has task but no walkthrough
- `is_abandoned_candidate` = task exists but no walkthrough
#### Revision / Change Volume
#### Revision / change volume
- `task_versions`
- `plan_versions`
- `walkthrough_versions`
@@ -117,7 +127,7 @@ For each conversation, read all structured artifacts that exist.
- `completed_at`
- `duration_minutes`
#### Content / Quality Signals
#### Content / quality
- `objective_text`
- `initial_plan_summary`
- `final_plan_summary`
@@ -134,81 +144,64 @@ For each conversation, read all structured artifacts that exist.
---
## Step 3: Prompt Sufficiency Analysis
## Step 3: Prompt Sufficiency
For each conversation, score the opening objective/request on a 02 scale for each dimension:
Score the opening request on a 02 scale for:
- **Clarity** — is the ask understandable?
- **Boundedness** — are scope limits defined?
- **Testability** — are success conditions or acceptance criteria defined?
- **Architectural specificity** — are files/modules/systems identified?
- **Constraint awareness** — are non-goals, constraints, or environment details included?
- **Dependency awareness** — does the prompt acknowledge affected systems or hidden coupling?
- **Clarity**
- **Boundedness**
- **Testability**
- **Architectural specificity**
- **Constraint awareness**
- **Dependency awareness**
Create:
- `prompt_sufficiency_score`
- `prompt_sufficiency_band` = High / Medium / Low
Then note which missing ingredients likely contributed to later friction.
Then note which missing prompt ingredients likely contributed to later friction.
Important:
Do not assume a low-detail prompt is bad by default.
Short prompts can still be good if the task is narrow and the repo context is obvious.
Do not punish short prompts by default; a narrow, obvious task can still have high sufficiency.
---
## Step 4: Scope Change Classification
Do not treat all scope growth as the same.
Classify scope change into:
For each conversation, classify scope delta into:
- **Human-added scope** — new asks beyond the original task
- **Necessary discovered scope** — work required to complete the original task correctly
- **Agent-introduced scope** — likely unnecessary work introduced by the agent
### 4a. Human-Added Scope
New items clearly introduced beyond the initial ask.
Examples:
- optional enhancements
- follow-on refactors
- “while we are here” additions
- cosmetic or adjacent work added later
### 4b. Necessary Discovered Scope
Work that was not in the opening ask but appears required to complete it correctly.
Examples:
- dependency fixes
- required validation work
- hidden integration tasks
- migration fallout
- coupled module updates
### 4c. Agent-Introduced Scope
Work that appears not requested and not necessary, likely introduced by agent overreach.
For each conversation record:
Record:
- `scope_change_type_primary`
- `scope_change_type_secondary` (optional)
- `scope_change_confidence`
- evidence for classification
- evidence
Keep one short example in mind for calibration:
- Human-added: “also refactor nearby code while youre here”
- Necessary discovered: hidden dependency must be fixed for original task to work
- Agent-introduced: extra cleanup or redesign not requested and not required
---
## Step 5: Rework Shape Analysis
## Step 5: Rework Shape
Do not just count revisions. Determine the **shape** of session rework.
Classify each session into one primary pattern:
Classify each conversation into one of these patterns:
- **Clean execution** — little change, smooth completion
- **Early replan then stable finish** — plan changed early, then execution converged
- **Progressive scope expansion** — work kept growing throughout the session
- **Reopen/reclose churn** — repeated task adjustments/backtracking
- **Late-stage verification churn** — implementation mostly done, but testing/validation caused loops
- **Abandoned mid-flight** — work started but did not reach walkthrough
- **Exploratory / research session** — iterations are high but expected due to problem discovery
- **Clean execution**
- **Early replan then stable finish**
- **Progressive scope expansion**
- **Reopen/reclose churn**
- **Late-stage verification churn**
- **Abandoned mid-flight**
- **Exploratory / research session**
Record:
- `rework_shape`
- `rework_shape_confidence`
- supporting evidence
- evidence
---
@@ -216,8 +209,8 @@ Record:
For every non-clean session, assign:
### 6a. Primary Root Cause
Choose one:
### Primary root cause
One of:
- `SPEC_AMBIGUITY`
- `HUMAN_SCOPE_CHANGE`
- `REPO_FRAGILITY`
@@ -225,46 +218,58 @@ Choose one:
- `VERIFICATION_CHURN`
- `LEGITIMATE_TASK_COMPLEXITY`
### 6b. Secondary Root Cause
Optional if a second factor materially contributed.
### Secondary root cause
Optional if materially relevant
### 6c. Evidence Requirements
Every root cause assignment must include:
- evidence from artifacts or metadata
- why competing causes were rejected
- confidence level
### Root-cause guidance
- **SPEC_AMBIGUITY**: opening ask lacked boundaries, targets, criteria, or constraints
- **HUMAN_SCOPE_CHANGE**: scope expanded because the user broadened the task
- **REPO_FRAGILITY**: hidden coupling, brittle files, unclear architecture, or environment issues forced extra work
- **AGENT_ARCHITECTURAL_ERROR**: wrong files, wrong assumptions, wrong approach, hallucinated structure
- **VERIFICATION_CHURN**: implementation mostly worked, but testing/validation caused loops
- **LEGITIMATE_TASK_COMPLEXITY**: revisions were expected for the difficulty and not clearly avoidable
### 6d. Root Cause Heuristics
Every root-cause assignment must include:
- evidence
- why stronger alternative causes were rejected
- confidence
#### SPEC_AMBIGUITY
Use when the opening ask lacked boundaries, targets, criteria, or constraints, and the plan had to invent them.
---
#### HUMAN_SCOPE_CHANGE
Use when the task set expanded due to new asks, broadened goals, or post-hoc additions.
## Step 6.5: Session Severity Scoring (0100)
#### REPO_FRAGILITY
Use when hidden coupling, unclear architecture, brittle files, or environmental issues forced extra work.
Assign each session a severity score to prioritize attention.
#### AGENT_ARCHITECTURAL_ERROR
Use when the agent chose the wrong approach, wrong files, wrong assumptions, or hallucinated structure.
Components (sum, clamp 0100):
- **Completion failure**: 025 (`abandoned = 25`)
- **Replanning intensity**: 015
- **Scope instability**: 015
- **Rework shape severity**: 015
- **Prompt sufficiency deficit**: 010 (`low = 10`)
- **Root cause impact**: 010 (`REPO_FRAGILITY` / `AGENT_ARCHITECTURAL_ERROR` highest)
- **Hotspot recurrence**: 010
#### VERIFICATION_CHURN
Use when implementation mostly succeeded but tests, validation, QA, or fixes created repeated loops.
Bands:
- **019 Low**
- **2039 Moderate**
- **4059 Significant**
- **6079 High**
- **80100 Critical**
#### LEGITIMATE_TASK_COMPLEXITY
Use when revisions were reasonable given the difficulty and do not strongly indicate avoidable failure.
Record:
- `session_severity_score`
- `severity_band`
- `severity_drivers` = top 24 contributors
- `severity_confidence`
Use severity as a prioritization signal, not a verdict. Always explain the drivers.
Contextualize severity using session intent so research/exploration sessions are not over-penalized.
---
## Step 7: Subsystem / File Clustering
Across all conversations, cluster repeated struggle by subsystem, folder, or file mentions.
Examples:
- `frontend/auth/*`
- `db.py`
- `ui.py`
- `video_pipeline/*`
Across all conversations, cluster repeated struggle by file, folder, or subsystem.
For each cluster, calculate:
- number of conversations touching it
@@ -272,18 +277,15 @@ For each cluster, calculate:
- completion rate
- abandonment rate
- common root causes
- average severity
Output the top recurring friction zones.
Goal:
Identify whether struggle is prompt-driven, agent-driven, or concentrated in specific repo areas.
Goal: identify whether friction is mostly prompt-driven, agent-driven, or concentrated in specific repo areas.
---
## Step 8: Comparative Cohort Analysis
Compare these cohorts:
## Step 8: Comparative Cohorts
Compare:
- first-shot successes vs re-planned sessions
- completed vs abandoned
- high prompt sufficiency vs low prompt sufficiency
@@ -296,8 +298,7 @@ For each comparison, identify:
- which prompt traits correlate with smoother execution
- which repo traits correlate with repeated struggle
Do not merely restate averages.
Extract causal-looking patterns cautiously and label them as inference where appropriate.
Do not just restate averages; extract cautious evidence-backed patterns.
---
@@ -305,38 +306,29 @@ Extract causal-looking patterns cautiously and label them as inference where app
Generate 37 findings that are not simple metric restatements.
Good examples:
- “Most replans happen in sessions with weak file targeting, not weak acceptance criteria.”
- “Scope growth usually begins after the first successful implementation, suggesting post-success human expansion.”
- “Auth-related sessions cluster around repo fragility rather than agent hallucination.”
- “Abandoned work is strongly associated with missing validation criteria.”
Bad examples:
- “Some sessions had many revisions.”
- “Some sessions were longer than others.”
Each finding must include:
- observation
- why it matters
- evidence
- confidence
Examples of strong findings:
- replans cluster around weak file targeting rather than weak acceptance criteria
- scope growth often begins after initial success, suggesting post-success human expansion
- auth-related struggle is driven more by repo fragility than agent hallucination
---
## Step 10: Report Generation
Create `session_analysis_report.md` in the current conversations brain folder.
Use this structure:
Create `session_analysis_report.md` with this structure:
# 📊 Session Analysis Report — [Project Name]
**Generated**: [timestamp]
**Conversations Analyzed**: [N]
**Generated**: [timestamp]
**Conversations Analyzed**: [N]
**Date Range**: [earliest] → [latest]
---
## Executive Summary
| Metric | Value | Rating |
@@ -346,91 +338,61 @@ Use this structure:
| Avg Scope Growth | X% | 🟢/🟡/🔴 |
| Replan Rate | X% | 🟢/🟡/🔴 |
| Median Duration | Xm | — |
| Avg Revision Intensity | X | 🟢/🟡/🔴 |
| Avg Session Severity | X | 🟢/🟡/🔴 |
| High-Severity Sessions | X / N | 🟢/🟡/🔴 |
Then include a short narrative summary:
- what is going well
- what is breaking down
- whether the main issue is prompt quality, repo fragility, or workflow discipline
Thresholds:
- First-shot: 🟢 >70 / 🟡 4070 / 🔴 <40
- Scope growth: 🟢 <15 / 🟡 1540 / 🔴 >40
- Replan rate: 🟢 <20 / 🟡 2050 / 🔴 >50
---
Avg severity guidance:
- 🟢 <25
- 🟡 2550
- 🔴 >50
Note: avg severity is an aggregate health signal, not the same as per-session severity bands.
Then add a short narrative summary of what is going well, what is breaking down, and whether the main issue is prompt quality, repo fragility, workflow discipline, or validation churn.
## Root Cause Breakdown
| Root Cause | Count | % | Notes |
|:---|:---|:---|:---|
| Spec Ambiguity | X | X% | ... |
| Human Scope Change | X | X% | ... |
| Repo Fragility | X | X% | ... |
| Agent Architectural Error | X | X% | ... |
| Verification Churn | X | X% | ... |
| Legitimate Task Complexity | X | X% | ... |
---
## Prompt Sufficiency Analysis
- common traits of high-sufficiency prompts
- common missing inputs in low-sufficiency prompts
- which missing prompt ingredients correlate most with replanning or abandonment
---
## Scope Change Analysis
Separate:
- Human-added scope
- Necessary discovered scope
- Agent-introduced scope
Show top offenders in each category.
---
## Rework Shape Analysis
Summarize how sessions tend to fail:
- early replan then recover
- progressive scope expansion
- late verification churn
- abandonments
- reopen/reclose cycles
---
Summarize the main failure patterns across sessions.
## Friction Hotspots
Cluster repeated struggle by subsystem/file/domain.
Show which areas correlate with:
- replanning
- abandonment
- verification churn
- agent architectural mistakes
---
Show the files/folders/subsystems most associated with replanning, abandonment, verification churn, and high severity.
## First-Shot Successes
List the cleanest sessions and extract what made them work:
- scope boundaries
- acceptance criteria
- file targeting
- validation clarity
- narrowness of change surface
---
List the cleanest sessions and extract what made them work.
## Non-Obvious Findings
List 37 evidence-backed findings with confidence.
List 37 high-value findings with evidence and confidence.
---
## Severity Triage
List the highest-severity sessions and say whether the best intervention is:
- prompt improvement
- scope discipline
- targeted skill/workflow
- repo refactor / architecture cleanup
- validation/test harness improvement
## Recommendations
Each recommendation must use this format:
### Recommendation [N]
For each recommendation, use:
- **Observed pattern**
- **Likely cause**
- **Evidence**
@@ -438,79 +400,33 @@ Each recommendation must use this format:
- **Expected benefit**
- **Confidence**
Recommendations must be specific, not generic.
---
## Per-Conversation Breakdown
| # | Title | Duration | Scope Δ | Plan Revs | Task Revs | Root Cause | Rework Shape | Complete? |
|:---|:---|:---|:---|:---|:---|:---|:---|:---|
Add short notes only where meaningful.
| # | Title | Intent | Duration | Scope Δ | Plan Revs | Task Revs | Root Cause | Rework Shape | Severity | Complete? |
|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|:---|
---
## Step 11: Auto-Optimize — Improve Future Sessions
## Step 11: Optional Post-Analysis Improvements
### 11a. Update Project Health State
# Example path (update to your actual location):
# `~/.gemini/antigravity/.agent/skills/project-health-state/SKILL.md`
If appropriate, also:
- update any local project-health or memory artifact (if present) with recurring failure modes and fragile subsystems
- generate `prompt_improvement_tips.md` from high-sufficiency / first-shot-success sessions
- suggest missing skills or workflows when the same subsystem or task sequence repeatedly causes struggle
Update:
- session analysis metrics
- recurring fragile files/subsystems
- recurring failure modes
- last updated timestamp
### 11b. Generate Prompt Improvement Guidance
Create `prompt_improvement_tips.md`
Do not give generic advice.
Instead extract:
- traits of high-sufficiency prompts
- examples of effective scope boundaries
- examples of good acceptance criteria
- examples of useful file targeting
- common missing details that led to replans
### 11c. Suggest Missing Skills / Workflows
If multiple struggle sessions cluster around the same subsystem or repeated sequence, recommend:
- a targeted skill
- a repeatable workflow
- a reusable prompt template
- a repo note / architecture map
Only recommend workflows when the pattern appears repeatedly.
Only recommend workflows/skills when the pattern appears repeatedly.
---
## Final Output Standard
The workflow must produce:
1. A metrics summary
2. A root-cause diagnosis
3. A subsystem/friction map
4. A prompt-sufficiency assessment
5. Evidence-backed recommendations
6. Non-obvious findings
1. metrics summary
2. root-cause diagnosis
3. prompt-sufficiency assessment
4. subsystem/friction map
5. severity triage and prioritization
6. evidence-backed recommendations
7. non-obvious findings
If evidence is weak, say so.
Do not overclaim.
Prefer explicit uncertainty over fake precision.
**How to invoke this skill**
Just say any of these in a new conversation:
- “Run analyze-project on the workspace”
- “Do a full session analysis report”
- “Root cause my recent brain/ sessions”
- “Update project health state”
The agent will automatically discover and use the skill.