Add analyze-project — Root Cause Analyst skill (#297)

* Add analyze-project — Root Cause Analyst skill New forensic analysis skill for Antigravity. Reads brain/ artifacts (task.md, plans, walkthroughs + .resolved.N versions), classifies scope changes, rework shapes, root causes (SPEC_AMBIGUITY, REPO_FRAGILITY, etc.), clusters friction hotspots, and auto-updates project-health-state + prompt_improvement_tips.md. Full workflow and SKILL.md included. * Update SKILL.md Shorten frontmatter description to pass validation (<200 chars) * Update SKILL.md Fix YAML syntax + shorten description to pass validation (<200 chars); improve portability * Create sample_session_analysis_report.md Add examples/ with trimmed sample output report to demonstrate skill results * Update SKILL.md Fix non-portable hardcoded path in Step 11a (project-health-state) per bot feedback
2026-03-14 12:48:23 -04:00
parent b9ce8c9011
commit 80abf0f4d6
2 changed files with 594 additions and 0 deletions
--- a/skills/analyze-project/SKILL.md
+++ b/skills/analyze-project/SKILL.md
@@ -0,0 +1,516 @@
+---
+name: analyze-project
+description: Forensic root cause analyzer for Antigravity sessions. Classifies scope deltas, rework patterns, root causes, hotspots, and auto-improves prompts/health.
+version: "1.0"
+tags: [analysis, diagnostics, meta, root-cause, project-health, session-review]
+---
+
+# /analyze-project — Root Cause Analyst Workflow
+
+Analyze AI-assisted coding sessions in `brain/` and produce a diagnostic report that explains not just **what happened**, but **why it happened**, **who/what caused it**, and **what should change next time**.
+
+This workflow is not a simple metrics dashboard.
+It is a forensic analysis workflow for AI coding sessions.
+
+---
+
+## Primary Objective
+
+For each session, determine:
+
+1. What changed from the initial ask to the final executed work
+2. Whether the change was caused primarily by:
+   - the user/spec
+   - the agent
+   - the codebase/repo
+   - testing/verification
+   - legitimate task complexity
+3. Whether the original prompt was sufficient for the actual job
+4. Which subsystems or files repeatedly correlate with struggle
+5. What concrete changes would most improve future sessions
+
+---
+
+## Core Principles
+
+- Treat `.resolved.N` counts as **signals of iteration intensity**, not proof of failure
+- Do not label struggle based on counts alone; classify the **shape** of rework
+- Separate **human-added scope** from **necessary discovered scope**
+- Separate **agent error** from **repo friction**
+- Every diagnosis must include **evidence**
+- Every recommendation must map to a specific observed pattern
+- Use confidence levels:
+  - **High** = directly supported by artifact contents or timestamps
+  - **Medium** = supported by multiple indirect signals
+  - **Low** = plausible inference, not directly proven
+
+---
+
+## Step 1: Discovery — Find Relevant Conversations
+
+1. Read the conversation summaries available in the system context.
+2. List all subdirectories in:
+   `~/.gemini/antigravity/brain/
+3. Build a **Conversation Index** by cross-referencing summaries with UUID folders.
+4. Record for each conversation:
+   - `conversation_id`
+   - `title`
+   - `objective`
+   - `created`
+   - `last_modified`
+5. If the user supplied a keyword/path, filter on that. Otherwise analyze all workspace conversations.
+
+> Output: indexed list of conversations to analyze.
+
+---
+
+## Step 2: Artifact Extraction — Build Session Evidence
+
+For each conversation, read all structured artifacts that exist.
+
+### 2a. Core Artifacts
+- `task.md`
+- `implementation_plan.md`
+- `walkthrough.md`
+
+### 2b. Metadata
+- `*.metadata.json`
+
+### 2c. Version Snapshots
+- `task.md.resolved.0 ... N`
+- `implementation_plan.md.resolved.0 ... N`
+- `walkthrough.md.resolved.0 ... N`
+
+### 2d. Additional Signals
+- other `.md` artifacts
+- report/evaluation files
+- timestamps across artifact updates
+- file/folder names mentioned in plans and walkthroughs
+- repeated subsystem references
+- explicit testing/validation language
+- explicit non-goals or constraints, if present
+
+### 2e. Record Per Conversation
+
+#### Presence / Lifecycle
+- `has_task`
+- `has_plan`
+- `has_walkthrough`
+- `is_completed`
+- `is_abandoned_candidate` = has task but no walkthrough
+
+#### Revision / Change Volume
+- `task_versions`
+- `plan_versions`
+- `walkthrough_versions`
+- `extra_artifacts`
+
+#### Scope
+- `task_items_initial`
+- `task_items_final`
+- `task_completed_pct`
+- `scope_delta_raw`
+- `scope_creep_pct_raw`
+
+#### Timing
+- `created_at`
+- `completed_at`
+- `duration_minutes`
+
+#### Content / Quality Signals
+- `objective_text`
+- `initial_plan_summary`
+- `final_plan_summary`
+- `initial_task_excerpt`
+- `final_task_excerpt`
+- `walkthrough_summary`
+- `mentioned_files_or_subsystems`
+- `validation_requirements_present`
+- `acceptance_criteria_present`
+- `non_goals_present`
+- `scope_boundaries_present`
+- `file_targets_present`
+- `constraints_present`
+
+---
+
+## Step 3: Prompt Sufficiency Analysis
+
+For each conversation, score the opening objective/request on a 0–2 scale for each dimension:
+
+- **Clarity** — is the ask understandable?
+- **Boundedness** — are scope limits defined?
+- **Testability** — are success conditions or acceptance criteria defined?
+- **Architectural specificity** — are files/modules/systems identified?
+- **Constraint awareness** — are non-goals, constraints, or environment details included?
+- **Dependency awareness** — does the prompt acknowledge affected systems or hidden coupling?
+
+Create:
+- `prompt_sufficiency_score`
+- `prompt_sufficiency_band` = High / Medium / Low
+
+Then note which missing ingredients likely contributed to later friction.
+
+Important:
+Do not assume a low-detail prompt is bad by default.
+Short prompts can still be good if the task is narrow and the repo context is obvious.
+
+---
+
+## Step 4: Scope Change Classification
+
+Do not treat all scope growth as the same.
+
+For each conversation, classify scope delta into:
+
+### 4a. Human-Added Scope
+New items clearly introduced beyond the initial ask.
+Examples:
+- optional enhancements
+- follow-on refactors
+- “while we are here” additions
+- cosmetic or adjacent work added later
+
+### 4b. Necessary Discovered Scope
+Work that was not in the opening ask but appears required to complete it correctly.
+Examples:
+- dependency fixes
+- required validation work
+- hidden integration tasks
+- migration fallout
+- coupled module updates
+
+### 4c. Agent-Introduced Scope
+Work that appears not requested and not necessary, likely introduced by agent overreach.
+
+For each conversation record:
+- `scope_change_type_primary`
+- `scope_change_type_secondary` (optional)
+- `scope_change_confidence`
+- evidence for classification
+
+---
+
+## Step 5: Rework Shape Analysis
+
+Do not just count revisions. Determine the **shape** of session rework.
+
+Classify each conversation into one of these patterns:
+
+- **Clean execution** — little change, smooth completion
+- **Early replan then stable finish** — plan changed early, then execution converged
+- **Progressive scope expansion** — work kept growing throughout the session
+- **Reopen/reclose churn** — repeated task adjustments/backtracking
+- **Late-stage verification churn** — implementation mostly done, but testing/validation caused loops
+- **Abandoned mid-flight** — work started but did not reach walkthrough
+- **Exploratory / research session** — iterations are high but expected due to problem discovery
+
+Record:
+- `rework_shape`
+- `rework_shape_confidence`
+- supporting evidence
+
+---
+
+## Step 6: Root Cause Analysis
+
+For every non-clean session, assign:
+
+### 6a. Primary Root Cause
+Choose one:
+- `SPEC_AMBIGUITY`
+- `HUMAN_SCOPE_CHANGE`
+- `REPO_FRAGILITY`
+- `AGENT_ARCHITECTURAL_ERROR`
+- `VERIFICATION_CHURN`
+- `LEGITIMATE_TASK_COMPLEXITY`
+
+### 6b. Secondary Root Cause
+Optional if a second factor materially contributed.
+
+### 6c. Evidence Requirements
+Every root cause assignment must include:
+- evidence from artifacts or metadata
+- why competing causes were rejected
+- confidence level
+
+### 6d. Root Cause Heuristics
+
+#### SPEC_AMBIGUITY
+Use when the opening ask lacked boundaries, targets, criteria, or constraints, and the plan had to invent them.
+
+#### HUMAN_SCOPE_CHANGE
+Use when the task set expanded due to new asks, broadened goals, or post-hoc additions.
+
+#### REPO_FRAGILITY
+Use when hidden coupling, unclear architecture, brittle files, or environmental issues forced extra work.
+
+#### AGENT_ARCHITECTURAL_ERROR
+Use when the agent chose the wrong approach, wrong files, wrong assumptions, or hallucinated structure.
+
+#### VERIFICATION_CHURN
+Use when implementation mostly succeeded but tests, validation, QA, or fixes created repeated loops.
+
+#### LEGITIMATE_TASK_COMPLEXITY
+Use when revisions were reasonable given the difficulty and do not strongly indicate avoidable failure.
+
+---
+
+## Step 7: Subsystem / File Clustering
+
+Across all conversations, cluster repeated struggle by subsystem, folder, or file mentions.
+
+Examples:
+- `frontend/auth/*`
+- `db.py`
+- `ui.py`
+- `video_pipeline/*`
+
+For each cluster, calculate:
+- number of conversations touching it
+- average revisions
+- completion rate
+- abandonment rate
+- common root causes
+
+Output the top recurring friction zones.
+
+Goal:
+Identify whether struggle is prompt-driven, agent-driven, or concentrated in specific repo areas.
+
+---
+
+## Step 8: Comparative Cohort Analysis
+
+Compare these cohorts:
+
+- first-shot successes vs re-planned sessions
+- completed vs abandoned
+- high prompt sufficiency vs low prompt sufficiency
+- narrow-scope vs high-scope-growth
+- short sessions vs long sessions
+- low-friction subsystems vs high-friction subsystems
+
+For each comparison, identify:
+- what differs materially
+- which prompt traits correlate with smoother execution
+- which repo traits correlate with repeated struggle
+
+Do not merely restate averages.
+Extract causal-looking patterns cautiously and label them as inference where appropriate.
+
+---
+
+## Step 9: Non-Obvious Findings
+
+Generate 3–7 findings that are not simple metric restatements.
+
+Good examples:
+- “Most replans happen in sessions with weak file targeting, not weak acceptance criteria.”
+- “Scope growth usually begins after the first successful implementation, suggesting post-success human expansion.”
+- “Auth-related sessions cluster around repo fragility rather than agent hallucination.”
+- “Abandoned work is strongly associated with missing validation criteria.”
+
+Bad examples:
+- “Some sessions had many revisions.”
+- “Some sessions were longer than others.”
+
+Each finding must include:
+- observation
+- why it matters
+- evidence
+- confidence
+
+---
+
+## Step 10: Report Generation
+
+Create `session_analysis_report.md` in the current conversation’s brain folder.
+
+Use this structure:
+
+# 📊 Session Analysis Report — [Project Name]
+
+**Generated**: [timestamp]
+**Conversations Analyzed**: [N]
+**Date Range**: [earliest] → [latest]
+
+---
+
+## Executive Summary
+
+| Metric | Value | Rating |
+|:---|:---|:---|
+| First-Shot Success Rate | X% | 🟢/🟡/🔴 |
+| Completion Rate | X% | 🟢/🟡/🔴 |
+| Avg Scope Growth | X% | 🟢/🟡/🔴 |
+| Replan Rate | X% | 🟢/🟡/🔴 |
+| Median Duration | Xm | — |
+| Avg Revision Intensity | X | 🟢/🟡/🔴 |
+
+Then include a short narrative summary:
+- what is going well
+- what is breaking down
+- whether the main issue is prompt quality, repo fragility, or workflow discipline
+
+---
+
+## Root Cause Breakdown
+
+| Root Cause | Count | % | Notes |
+|:---|:---|:---|:---|
+| Spec Ambiguity | X | X% | ... |
+| Human Scope Change | X | X% | ... |
+| Repo Fragility | X | X% | ... |
+| Agent Architectural Error | X | X% | ... |
+| Verification Churn | X | X% | ... |
+| Legitimate Task Complexity | X | X% | ... |
+
+---
+
+## Prompt Sufficiency Analysis
+
+- common traits of high-sufficiency prompts
+- common missing inputs in low-sufficiency prompts
+- which missing prompt ingredients correlate most with replanning or abandonment
+
+---
+
+## Scope Change Analysis
+
+Separate:
+- Human-added scope
+- Necessary discovered scope
+- Agent-introduced scope
+
+Show top offenders in each category.
+
+---
+
+## Rework Shape Analysis
+
+Summarize how sessions tend to fail:
+- early replan then recover
+- progressive scope expansion
+- late verification churn
+- abandonments
+- reopen/reclose cycles
+
+---
+
+## Friction Hotspots
+
+Cluster repeated struggle by subsystem/file/domain.
+Show which areas correlate with:
+- replanning
+- abandonment
+- verification churn
+- agent architectural mistakes
+
+---
+
+## First-Shot Successes
+
+List the cleanest sessions and extract what made them work:
+- scope boundaries
+- acceptance criteria
+- file targeting
+- validation clarity
+- narrowness of change surface
+
+---
+
+## Non-Obvious Findings
+
+List 3–7 high-value findings with evidence and confidence.
+
+---
+
+## Recommendations
+
+Each recommendation must use this format:
+
+### Recommendation [N]
+- **Observed pattern**
+- **Likely cause**
+- **Evidence**
+- **Change to make**
+- **Expected benefit**
+- **Confidence**
+
+Recommendations must be specific, not generic.
+
+---
+
+## Per-Conversation Breakdown
+
+| # | Title | Duration | Scope Δ | Plan Revs | Task Revs | Root Cause | Rework Shape | Complete? |
+|:---|:---|:---|:---|:---|:---|:---|:---|:---|
+
+Add short notes only where meaningful.
+
+---
+
+## Step 11: Auto-Optimize — Improve Future Sessions
+
+### 11a. Update Project Health State
+# Example path (update to your actual location):
+# `~/.gemini/antigravity/.agent/skills/project-health-state/SKILL.md`
+
+Update:
+- session analysis metrics
+- recurring fragile files/subsystems
+- recurring failure modes
+- last updated timestamp
+
+### 11b. Generate Prompt Improvement Guidance
+Create `prompt_improvement_tips.md`
+
+Do not give generic advice.
+Instead extract:
+- traits of high-sufficiency prompts
+- examples of effective scope boundaries
+- examples of good acceptance criteria
+- examples of useful file targeting
+- common missing details that led to replans
+
+### 11c. Suggest Missing Skills / Workflows
+If multiple struggle sessions cluster around the same subsystem or repeated sequence, recommend:
+- a targeted skill
+- a repeatable workflow
+- a reusable prompt template
+- a repo note / architecture map
+
+Only recommend workflows when the pattern appears repeatedly.
+
+---
+
+## Final Output Standard
+
+The workflow must produce:
+1. A metrics summary
+2. A root-cause diagnosis
+3. A subsystem/friction map
+4. A prompt-sufficiency assessment
+5. Evidence-backed recommendations
+6. Non-obvious findings
+
+If evidence is weak, say so.
+Do not overclaim.
+Prefer explicit uncertainty over fake precision.
+
+
+
+
+
+
+
+
+**How to invoke this skill**  
+Just say any of these in a new conversation:
+- “Run analyze-project on the workspace”
+- “Do a full session analysis report”
+- “Root cause my recent brain/ sessions”
+- “Update project health state”
+
+The agent will automatically discover and use the skill.
--- a/skills/analyze-project/examples/sample_session_analysis_report.md
+++ b/skills/analyze-project/examples/sample_session_analysis_report.md
@@ -0,0 +1,78 @@
+# Sample Output: session_analysis_report.md
+# Generated by /analyze-project skill on a ~3-week project with ~50 substantive sessions.
+# (Trimmed for demo; real reports include full per-conversation breakdown and more cohorts.)
+
+# 📊 Session Analysis Report — Sample AI Video Studio
+
+**Generated**: 2026-03-13  
+**Conversations Analyzed**: 54 substantive (with artifacts)  
+**Date Range**: Feb 18 – Mar 13, 2026
+
+## Executive Summary
+
+| Metric                  | Value       | Rating |
+|-------------------------|-------------|--------|
+| First-Shot Success Rate | 52%         | 🟡     |
+| Completion Rate         | 70%         | 🟢     |
+| Avg Scope Growth        | +58%        | 🟡     |
+| Replan Rate             | 30%         | 🟢     |
+| Median Duration         | ~35 min     | 🟢     |
+| Avg Revision Intensity  | 4.8 versions| 🟡     |
+| Abandoned Rate          | 22%         | 🟡     |
+
+**Narrative**: High velocity with strong completion on workflow-driven tasks. Main friction is **post-success human scope expansion** — users add "while we're here" features after initial work succeeds, turning narrow tasks into multi-phase epics. Not primarily prompt or agent issues — more workflow discipline.
+
+## Root Cause Breakdown (non-clean sessions only)
+
+| Root Cause                  | %   | Notes                                      |
+|-----------------------------|-----|--------------------------------------------|
+| Human Scope Change          | 37% | New features/epics added mid-session after success |
+| Legitimate Task Complexity  | 26% | Multi-phase builds with expected iteration |
+| Repo Fragility              | 15% | Hidden coupling, pre-existing bugs         |
+| Verification Churn          | 11% | Late test/build failures                   |
+| Spec Ambiguity              | 7%  | Vague initial ask                          |
+| Agent Architectural Error   | 4%  | Rare wrong approach                        |
+
+Confidence: **High** for top two (direct evidence from version diffs).
+
+## Scope Change Analysis Highlights
+
+**Human-Added** (most common): Starts narrow → grows after Phase 1 succeeds (e.g., T2E QA → A/B testing + demos + editor tools).  
+**Necessary Discovered**: Hidden deps, missing packages, env issues (e.g., auth bcrypt blocking E2E).  
+**Agent-Introduced**: Very rare (1 case of over-creating components).
+
+## Rework Shape Summary
+
+- Clean execution:          52%  
+- Progressive expansion:    18% (dominant failure mode)  
+- Early replan → stable:    11%  
+- Late verification churn:  7%  
+- Exploratory/research:     7%  
+- Abandoned mid-flight:     4%
+
+**Pattern**: Progressive expansion often follows successful implementation — user adds adjacent work in same session.
+
+## Friction Hotspots (top areas)
+
+| Subsystem              | Sessions | Avg Revisions | Main Cause          |
+|------------------------|----------|---------------|---------------------|
+| production.py + domain | 8        | 6.2           | Hidden coupling     |
+| fal.py (model adapter) | 7        | 5.0           | Legitimate complexity |
+| billing.py + tests     | 6        | 5.5           | Verification churn  |
+| frontend/ build        | 5        | 7.0           | Missing deps/types  |
+| Auth/bcrypt            | 3        | 4.7           | Blocks E2E testing  |
+
+## Non-Obvious Findings (top 3)
+
+1. **Post-Success Expansion Dominates** — Most scope growth happens *after* initial completion succeeds, not from bad planning. (High confidence)  
+2. **File Targeting > Acceptance Criteria** — Missing specific files correlates more with replanning (44% vs 12%) than missing criteria. Anchors agent research early. (High)  
+3. **Frontend Build is Silent Killer** — Late TypeScript/import failures add 2–4 cycles repeatedly. No pre-flight check exists. (High)
+
+## Recommendations (top 4)
+
+1. **Split Sessions After Phases** — Start new conversation after successful completion to avoid context bloat and scope creep. Expected: +13% first-shot success. (High)  
+2. **Enforce File Targeting** — Add pre-check in prompt optimizer to flag missing file/module refs. Expected: halve replan rate. (High)  
+3. **Add Frontend Preflight** — Run `npm run build` early in frontend-touching sessions. Eliminates common late blockers. (High)  
+4. **Fix Auth Test Fixture** — Seed test users with plain passwords or bypass bcrypt for local E2E. Unblocks browser testing. (High)
+
+This sample shows the forensic style: evidence-backed, confidence-rated, focused on actionable patterns rather than raw counts.