Add analyze-project — Root Cause Analyst skill (#297)

* Add analyze-project — Root Cause Analyst skill

New forensic analysis skill for Antigravity.

Reads brain/ artifacts (task.md, plans, walkthroughs + .resolved.N versions), classifies scope changes, rework shapes, root causes (SPEC_AMBIGUITY, REPO_FRAGILITY, etc.), clusters friction hotspots, and auto-updates project-health-state + prompt_improvement_tips.md.

Full workflow and SKILL.md included.

* Update SKILL.md

Shorten frontmatter description to pass validation (<200 chars)

* Update SKILL.md

Fix YAML syntax + shorten description to pass validation (<200 chars); improve portability

* Create sample_session_analysis_report.md

Add examples/ with trimmed sample output report to demonstrate skill results

* Update SKILL.md

Fix non-portable hardcoded path in Step 11a (project-health-state) per bot feedback
This commit is contained in:
Gizzant
2026-03-14 12:48:23 -04:00
committed by GitHub
parent b9ce8c9011
commit 80abf0f4d6
2 changed files with 594 additions and 0 deletions

View File

@@ -0,0 +1,516 @@
---
name: analyze-project
description: Forensic root cause analyzer for Antigravity sessions. Classifies scope deltas, rework patterns, root causes, hotspots, and auto-improves prompts/health.
version: "1.0"
tags: [analysis, diagnostics, meta, root-cause, project-health, session-review]
---
# /analyze-project — Root Cause Analyst Workflow
Analyze AI-assisted coding sessions in `brain/` and produce a diagnostic report that explains not just **what happened**, but **why it happened**, **who/what caused it**, and **what should change next time**.
This workflow is not a simple metrics dashboard.
It is a forensic analysis workflow for AI coding sessions.
---
## Primary Objective
For each session, determine:
1. What changed from the initial ask to the final executed work
2. Whether the change was caused primarily by:
- the user/spec
- the agent
- the codebase/repo
- testing/verification
- legitimate task complexity
3. Whether the original prompt was sufficient for the actual job
4. Which subsystems or files repeatedly correlate with struggle
5. What concrete changes would most improve future sessions
---
## Core Principles
- Treat `.resolved.N` counts as **signals of iteration intensity**, not proof of failure
- Do not label struggle based on counts alone; classify the **shape** of rework
- Separate **human-added scope** from **necessary discovered scope**
- Separate **agent error** from **repo friction**
- Every diagnosis must include **evidence**
- Every recommendation must map to a specific observed pattern
- Use confidence levels:
- **High** = directly supported by artifact contents or timestamps
- **Medium** = supported by multiple indirect signals
- **Low** = plausible inference, not directly proven
---
## Step 1: Discovery — Find Relevant Conversations
1. Read the conversation summaries available in the system context.
2. List all subdirectories in:
`~/.gemini/antigravity/brain/
3. Build a **Conversation Index** by cross-referencing summaries with UUID folders.
4. Record for each conversation:
- `conversation_id`
- `title`
- `objective`
- `created`
- `last_modified`
5. If the user supplied a keyword/path, filter on that. Otherwise analyze all workspace conversations.
> Output: indexed list of conversations to analyze.
---
## Step 2: Artifact Extraction — Build Session Evidence
For each conversation, read all structured artifacts that exist.
### 2a. Core Artifacts
- `task.md`
- `implementation_plan.md`
- `walkthrough.md`
### 2b. Metadata
- `*.metadata.json`
### 2c. Version Snapshots
- `task.md.resolved.0 ... N`
- `implementation_plan.md.resolved.0 ... N`
- `walkthrough.md.resolved.0 ... N`
### 2d. Additional Signals
- other `.md` artifacts
- report/evaluation files
- timestamps across artifact updates
- file/folder names mentioned in plans and walkthroughs
- repeated subsystem references
- explicit testing/validation language
- explicit non-goals or constraints, if present
### 2e. Record Per Conversation
#### Presence / Lifecycle
- `has_task`
- `has_plan`
- `has_walkthrough`
- `is_completed`
- `is_abandoned_candidate` = has task but no walkthrough
#### Revision / Change Volume
- `task_versions`
- `plan_versions`
- `walkthrough_versions`
- `extra_artifacts`
#### Scope
- `task_items_initial`
- `task_items_final`
- `task_completed_pct`
- `scope_delta_raw`
- `scope_creep_pct_raw`
#### Timing
- `created_at`
- `completed_at`
- `duration_minutes`
#### Content / Quality Signals
- `objective_text`
- `initial_plan_summary`
- `final_plan_summary`
- `initial_task_excerpt`
- `final_task_excerpt`
- `walkthrough_summary`
- `mentioned_files_or_subsystems`
- `validation_requirements_present`
- `acceptance_criteria_present`
- `non_goals_present`
- `scope_boundaries_present`
- `file_targets_present`
- `constraints_present`
---
## Step 3: Prompt Sufficiency Analysis
For each conversation, score the opening objective/request on a 02 scale for each dimension:
- **Clarity** — is the ask understandable?
- **Boundedness** — are scope limits defined?
- **Testability** — are success conditions or acceptance criteria defined?
- **Architectural specificity** — are files/modules/systems identified?
- **Constraint awareness** — are non-goals, constraints, or environment details included?
- **Dependency awareness** — does the prompt acknowledge affected systems or hidden coupling?
Create:
- `prompt_sufficiency_score`
- `prompt_sufficiency_band` = High / Medium / Low
Then note which missing ingredients likely contributed to later friction.
Important:
Do not assume a low-detail prompt is bad by default.
Short prompts can still be good if the task is narrow and the repo context is obvious.
---
## Step 4: Scope Change Classification
Do not treat all scope growth as the same.
For each conversation, classify scope delta into:
### 4a. Human-Added Scope
New items clearly introduced beyond the initial ask.
Examples:
- optional enhancements
- follow-on refactors
- “while we are here” additions
- cosmetic or adjacent work added later
### 4b. Necessary Discovered Scope
Work that was not in the opening ask but appears required to complete it correctly.
Examples:
- dependency fixes
- required validation work
- hidden integration tasks
- migration fallout
- coupled module updates
### 4c. Agent-Introduced Scope
Work that appears not requested and not necessary, likely introduced by agent overreach.
For each conversation record:
- `scope_change_type_primary`
- `scope_change_type_secondary` (optional)
- `scope_change_confidence`
- evidence for classification
---
## Step 5: Rework Shape Analysis
Do not just count revisions. Determine the **shape** of session rework.
Classify each conversation into one of these patterns:
- **Clean execution** — little change, smooth completion
- **Early replan then stable finish** — plan changed early, then execution converged
- **Progressive scope expansion** — work kept growing throughout the session
- **Reopen/reclose churn** — repeated task adjustments/backtracking
- **Late-stage verification churn** — implementation mostly done, but testing/validation caused loops
- **Abandoned mid-flight** — work started but did not reach walkthrough
- **Exploratory / research session** — iterations are high but expected due to problem discovery
Record:
- `rework_shape`
- `rework_shape_confidence`
- supporting evidence
---
## Step 6: Root Cause Analysis
For every non-clean session, assign:
### 6a. Primary Root Cause
Choose one:
- `SPEC_AMBIGUITY`
- `HUMAN_SCOPE_CHANGE`
- `REPO_FRAGILITY`
- `AGENT_ARCHITECTURAL_ERROR`
- `VERIFICATION_CHURN`
- `LEGITIMATE_TASK_COMPLEXITY`
### 6b. Secondary Root Cause
Optional if a second factor materially contributed.
### 6c. Evidence Requirements
Every root cause assignment must include:
- evidence from artifacts or metadata
- why competing causes were rejected
- confidence level
### 6d. Root Cause Heuristics
#### SPEC_AMBIGUITY
Use when the opening ask lacked boundaries, targets, criteria, or constraints, and the plan had to invent them.
#### HUMAN_SCOPE_CHANGE
Use when the task set expanded due to new asks, broadened goals, or post-hoc additions.
#### REPO_FRAGILITY
Use when hidden coupling, unclear architecture, brittle files, or environmental issues forced extra work.
#### AGENT_ARCHITECTURAL_ERROR
Use when the agent chose the wrong approach, wrong files, wrong assumptions, or hallucinated structure.
#### VERIFICATION_CHURN
Use when implementation mostly succeeded but tests, validation, QA, or fixes created repeated loops.
#### LEGITIMATE_TASK_COMPLEXITY
Use when revisions were reasonable given the difficulty and do not strongly indicate avoidable failure.
---
## Step 7: Subsystem / File Clustering
Across all conversations, cluster repeated struggle by subsystem, folder, or file mentions.
Examples:
- `frontend/auth/*`
- `db.py`
- `ui.py`
- `video_pipeline/*`
For each cluster, calculate:
- number of conversations touching it
- average revisions
- completion rate
- abandonment rate
- common root causes
Output the top recurring friction zones.
Goal:
Identify whether struggle is prompt-driven, agent-driven, or concentrated in specific repo areas.
---
## Step 8: Comparative Cohort Analysis
Compare these cohorts:
- first-shot successes vs re-planned sessions
- completed vs abandoned
- high prompt sufficiency vs low prompt sufficiency
- narrow-scope vs high-scope-growth
- short sessions vs long sessions
- low-friction subsystems vs high-friction subsystems
For each comparison, identify:
- what differs materially
- which prompt traits correlate with smoother execution
- which repo traits correlate with repeated struggle
Do not merely restate averages.
Extract causal-looking patterns cautiously and label them as inference where appropriate.
---
## Step 9: Non-Obvious Findings
Generate 37 findings that are not simple metric restatements.
Good examples:
- “Most replans happen in sessions with weak file targeting, not weak acceptance criteria.”
- “Scope growth usually begins after the first successful implementation, suggesting post-success human expansion.”
- “Auth-related sessions cluster around repo fragility rather than agent hallucination.”
- “Abandoned work is strongly associated with missing validation criteria.”
Bad examples:
- “Some sessions had many revisions.”
- “Some sessions were longer than others.”
Each finding must include:
- observation
- why it matters
- evidence
- confidence
---
## Step 10: Report Generation
Create `session_analysis_report.md` in the current conversations brain folder.
Use this structure:
# 📊 Session Analysis Report — [Project Name]
**Generated**: [timestamp]
**Conversations Analyzed**: [N]
**Date Range**: [earliest] → [latest]
---
## Executive Summary
| Metric | Value | Rating |
|:---|:---|:---|
| First-Shot Success Rate | X% | 🟢/🟡/🔴 |
| Completion Rate | X% | 🟢/🟡/🔴 |
| Avg Scope Growth | X% | 🟢/🟡/🔴 |
| Replan Rate | X% | 🟢/🟡/🔴 |
| Median Duration | Xm | — |
| Avg Revision Intensity | X | 🟢/🟡/🔴 |
Then include a short narrative summary:
- what is going well
- what is breaking down
- whether the main issue is prompt quality, repo fragility, or workflow discipline
---
## Root Cause Breakdown
| Root Cause | Count | % | Notes |
|:---|:---|:---|:---|
| Spec Ambiguity | X | X% | ... |
| Human Scope Change | X | X% | ... |
| Repo Fragility | X | X% | ... |
| Agent Architectural Error | X | X% | ... |
| Verification Churn | X | X% | ... |
| Legitimate Task Complexity | X | X% | ... |
---
## Prompt Sufficiency Analysis
- common traits of high-sufficiency prompts
- common missing inputs in low-sufficiency prompts
- which missing prompt ingredients correlate most with replanning or abandonment
---
## Scope Change Analysis
Separate:
- Human-added scope
- Necessary discovered scope
- Agent-introduced scope
Show top offenders in each category.
---
## Rework Shape Analysis
Summarize how sessions tend to fail:
- early replan then recover
- progressive scope expansion
- late verification churn
- abandonments
- reopen/reclose cycles
---
## Friction Hotspots
Cluster repeated struggle by subsystem/file/domain.
Show which areas correlate with:
- replanning
- abandonment
- verification churn
- agent architectural mistakes
---
## First-Shot Successes
List the cleanest sessions and extract what made them work:
- scope boundaries
- acceptance criteria
- file targeting
- validation clarity
- narrowness of change surface
---
## Non-Obvious Findings
List 37 high-value findings with evidence and confidence.
---
## Recommendations
Each recommendation must use this format:
### Recommendation [N]
- **Observed pattern**
- **Likely cause**
- **Evidence**
- **Change to make**
- **Expected benefit**
- **Confidence**
Recommendations must be specific, not generic.
---
## Per-Conversation Breakdown
| # | Title | Duration | Scope Δ | Plan Revs | Task Revs | Root Cause | Rework Shape | Complete? |
|:---|:---|:---|:---|:---|:---|:---|:---|:---|
Add short notes only where meaningful.
---
## Step 11: Auto-Optimize — Improve Future Sessions
### 11a. Update Project Health State
# Example path (update to your actual location):
# `~/.gemini/antigravity/.agent/skills/project-health-state/SKILL.md`
Update:
- session analysis metrics
- recurring fragile files/subsystems
- recurring failure modes
- last updated timestamp
### 11b. Generate Prompt Improvement Guidance
Create `prompt_improvement_tips.md`
Do not give generic advice.
Instead extract:
- traits of high-sufficiency prompts
- examples of effective scope boundaries
- examples of good acceptance criteria
- examples of useful file targeting
- common missing details that led to replans
### 11c. Suggest Missing Skills / Workflows
If multiple struggle sessions cluster around the same subsystem or repeated sequence, recommend:
- a targeted skill
- a repeatable workflow
- a reusable prompt template
- a repo note / architecture map
Only recommend workflows when the pattern appears repeatedly.
---
## Final Output Standard
The workflow must produce:
1. A metrics summary
2. A root-cause diagnosis
3. A subsystem/friction map
4. A prompt-sufficiency assessment
5. Evidence-backed recommendations
6. Non-obvious findings
If evidence is weak, say so.
Do not overclaim.
Prefer explicit uncertainty over fake precision.
**How to invoke this skill**
Just say any of these in a new conversation:
- “Run analyze-project on the workspace”
- “Do a full session analysis report”
- “Root cause my recent brain/ sessions”
- “Update project health state”
The agent will automatically discover and use the skill.

View File

@@ -0,0 +1,78 @@
# Sample Output: session_analysis_report.md
# Generated by /analyze-project skill on a ~3-week project with ~50 substantive sessions.
# (Trimmed for demo; real reports include full per-conversation breakdown and more cohorts.)
# 📊 Session Analysis Report — Sample AI Video Studio
**Generated**: 2026-03-13
**Conversations Analyzed**: 54 substantive (with artifacts)
**Date Range**: Feb 18 Mar 13, 2026
## Executive Summary
| Metric | Value | Rating |
|-------------------------|-------------|--------|
| First-Shot Success Rate | 52% | 🟡 |
| Completion Rate | 70% | 🟢 |
| Avg Scope Growth | +58% | 🟡 |
| Replan Rate | 30% | 🟢 |
| Median Duration | ~35 min | 🟢 |
| Avg Revision Intensity | 4.8 versions| 🟡 |
| Abandoned Rate | 22% | 🟡 |
**Narrative**: High velocity with strong completion on workflow-driven tasks. Main friction is **post-success human scope expansion** — users add "while we're here" features after initial work succeeds, turning narrow tasks into multi-phase epics. Not primarily prompt or agent issues — more workflow discipline.
## Root Cause Breakdown (non-clean sessions only)
| Root Cause | % | Notes |
|-----------------------------|-----|--------------------------------------------|
| Human Scope Change | 37% | New features/epics added mid-session after success |
| Legitimate Task Complexity | 26% | Multi-phase builds with expected iteration |
| Repo Fragility | 15% | Hidden coupling, pre-existing bugs |
| Verification Churn | 11% | Late test/build failures |
| Spec Ambiguity | 7% | Vague initial ask |
| Agent Architectural Error | 4% | Rare wrong approach |
Confidence: **High** for top two (direct evidence from version diffs).
## Scope Change Analysis Highlights
**Human-Added** (most common): Starts narrow → grows after Phase 1 succeeds (e.g., T2E QA → A/B testing + demos + editor tools).
**Necessary Discovered**: Hidden deps, missing packages, env issues (e.g., auth bcrypt blocking E2E).
**Agent-Introduced**: Very rare (1 case of over-creating components).
## Rework Shape Summary
- Clean execution: 52%
- Progressive expansion: 18% (dominant failure mode)
- Early replan → stable: 11%
- Late verification churn: 7%
- Exploratory/research: 7%
- Abandoned mid-flight: 4%
**Pattern**: Progressive expansion often follows successful implementation — user adds adjacent work in same session.
## Friction Hotspots (top areas)
| Subsystem | Sessions | Avg Revisions | Main Cause |
|------------------------|----------|---------------|---------------------|
| production.py + domain | 8 | 6.2 | Hidden coupling |
| fal.py (model adapter) | 7 | 5.0 | Legitimate complexity |
| billing.py + tests | 6 | 5.5 | Verification churn |
| frontend/ build | 5 | 7.0 | Missing deps/types |
| Auth/bcrypt | 3 | 4.7 | Blocks E2E testing |
## Non-Obvious Findings (top 3)
1. **Post-Success Expansion Dominates** — Most scope growth happens *after* initial completion succeeds, not from bad planning. (High confidence)
2. **File Targeting > Acceptance Criteria** — Missing specific files correlates more with replanning (44% vs 12%) than missing criteria. Anchors agent research early. (High)
3. **Frontend Build is Silent Killer** — Late TypeScript/import failures add 24 cycles repeatedly. No pre-flight check exists. (High)
## Recommendations (top 4)
1. **Split Sessions After Phases** — Start new conversation after successful completion to avoid context bloat and scope creep. Expected: +13% first-shot success. (High)
2. **Enforce File Targeting** — Add pre-check in prompt optimizer to flag missing file/module refs. Expected: halve replan rate. (High)
3. **Add Frontend Preflight** — Run `npm run build` early in frontend-touching sessions. Eliminates common late blockers. (High)
4. **Fix Auth Test Fixture** — Seed test users with plain passwords or bypass bcrypt for local E2E. Unblocks browser testing. (High)
This sample shows the forensic style: evidence-backed, confidence-rated, focused on actionable patterns rather than raw counts.