From 80abf0f4d66a3b2d9017111cfa0b225068529993 Mon Sep 17 00:00:00 2001 From: Gizzant Date: Sat, 14 Mar 2026 12:48:23 -0400 Subject: [PATCH] =?UTF-8?q?Add=20analyze-project=20=E2=80=94=20Root=20Caus?= =?UTF-8?q?e=20Analyst=20skill=20(#297)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * Add analyze-project — Root Cause Analyst skill New forensic analysis skill for Antigravity. Reads brain/ artifacts (task.md, plans, walkthroughs + .resolved.N versions), classifies scope changes, rework shapes, root causes (SPEC_AMBIGUITY, REPO_FRAGILITY, etc.), clusters friction hotspots, and auto-updates project-health-state + prompt_improvement_tips.md. Full workflow and SKILL.md included. * Update SKILL.md Shorten frontmatter description to pass validation (<200 chars) * Update SKILL.md Fix YAML syntax + shorten description to pass validation (<200 chars); improve portability * Create sample_session_analysis_report.md Add examples/ with trimmed sample output report to demonstrate skill results * Update SKILL.md Fix non-portable hardcoded path in Step 11a (project-health-state) per bot feedback --- skills/analyze-project/SKILL.md | 516 ++++++++++++++++++ .../sample_session_analysis_report.md | 78 +++ 2 files changed, 594 insertions(+) create mode 100644 skills/analyze-project/SKILL.md create mode 100644 skills/analyze-project/examples/sample_session_analysis_report.md diff --git a/skills/analyze-project/SKILL.md b/skills/analyze-project/SKILL.md new file mode 100644 index 00000000..dcf7031d --- /dev/null +++ b/skills/analyze-project/SKILL.md @@ -0,0 +1,516 @@ +--- +name: analyze-project +description: Forensic root cause analyzer for Antigravity sessions. Classifies scope deltas, rework patterns, root causes, hotspots, and auto-improves prompts/health. +version: "1.0" +tags: [analysis, diagnostics, meta, root-cause, project-health, session-review] +--- + +# /analyze-project — Root Cause Analyst Workflow + +Analyze AI-assisted coding sessions in `brain/` and produce a diagnostic report that explains not just **what happened**, but **why it happened**, **who/what caused it**, and **what should change next time**. + +This workflow is not a simple metrics dashboard. +It is a forensic analysis workflow for AI coding sessions. + +--- + +## Primary Objective + +For each session, determine: + +1. What changed from the initial ask to the final executed work +2. Whether the change was caused primarily by: + - the user/spec + - the agent + - the codebase/repo + - testing/verification + - legitimate task complexity +3. Whether the original prompt was sufficient for the actual job +4. Which subsystems or files repeatedly correlate with struggle +5. What concrete changes would most improve future sessions + +--- + +## Core Principles + +- Treat `.resolved.N` counts as **signals of iteration intensity**, not proof of failure +- Do not label struggle based on counts alone; classify the **shape** of rework +- Separate **human-added scope** from **necessary discovered scope** +- Separate **agent error** from **repo friction** +- Every diagnosis must include **evidence** +- Every recommendation must map to a specific observed pattern +- Use confidence levels: + - **High** = directly supported by artifact contents or timestamps + - **Medium** = supported by multiple indirect signals + - **Low** = plausible inference, not directly proven + +--- + +## Step 1: Discovery — Find Relevant Conversations + +1. Read the conversation summaries available in the system context. +2. List all subdirectories in: + `~/.gemini/antigravity/brain/ +3. Build a **Conversation Index** by cross-referencing summaries with UUID folders. +4. Record for each conversation: + - `conversation_id` + - `title` + - `objective` + - `created` + - `last_modified` +5. If the user supplied a keyword/path, filter on that. Otherwise analyze all workspace conversations. + +> Output: indexed list of conversations to analyze. + +--- + +## Step 2: Artifact Extraction — Build Session Evidence + +For each conversation, read all structured artifacts that exist. + +### 2a. Core Artifacts +- `task.md` +- `implementation_plan.md` +- `walkthrough.md` + +### 2b. Metadata +- `*.metadata.json` + +### 2c. Version Snapshots +- `task.md.resolved.0 ... N` +- `implementation_plan.md.resolved.0 ... N` +- `walkthrough.md.resolved.0 ... N` + +### 2d. Additional Signals +- other `.md` artifacts +- report/evaluation files +- timestamps across artifact updates +- file/folder names mentioned in plans and walkthroughs +- repeated subsystem references +- explicit testing/validation language +- explicit non-goals or constraints, if present + +### 2e. Record Per Conversation + +#### Presence / Lifecycle +- `has_task` +- `has_plan` +- `has_walkthrough` +- `is_completed` +- `is_abandoned_candidate` = has task but no walkthrough + +#### Revision / Change Volume +- `task_versions` +- `plan_versions` +- `walkthrough_versions` +- `extra_artifacts` + +#### Scope +- `task_items_initial` +- `task_items_final` +- `task_completed_pct` +- `scope_delta_raw` +- `scope_creep_pct_raw` + +#### Timing +- `created_at` +- `completed_at` +- `duration_minutes` + +#### Content / Quality Signals +- `objective_text` +- `initial_plan_summary` +- `final_plan_summary` +- `initial_task_excerpt` +- `final_task_excerpt` +- `walkthrough_summary` +- `mentioned_files_or_subsystems` +- `validation_requirements_present` +- `acceptance_criteria_present` +- `non_goals_present` +- `scope_boundaries_present` +- `file_targets_present` +- `constraints_present` + +--- + +## Step 3: Prompt Sufficiency Analysis + +For each conversation, score the opening objective/request on a 0–2 scale for each dimension: + +- **Clarity** — is the ask understandable? +- **Boundedness** — are scope limits defined? +- **Testability** — are success conditions or acceptance criteria defined? +- **Architectural specificity** — are files/modules/systems identified? +- **Constraint awareness** — are non-goals, constraints, or environment details included? +- **Dependency awareness** — does the prompt acknowledge affected systems or hidden coupling? + +Create: +- `prompt_sufficiency_score` +- `prompt_sufficiency_band` = High / Medium / Low + +Then note which missing ingredients likely contributed to later friction. + +Important: +Do not assume a low-detail prompt is bad by default. +Short prompts can still be good if the task is narrow and the repo context is obvious. + +--- + +## Step 4: Scope Change Classification + +Do not treat all scope growth as the same. + +For each conversation, classify scope delta into: + +### 4a. Human-Added Scope +New items clearly introduced beyond the initial ask. +Examples: +- optional enhancements +- follow-on refactors +- “while we are here” additions +- cosmetic or adjacent work added later + +### 4b. Necessary Discovered Scope +Work that was not in the opening ask but appears required to complete it correctly. +Examples: +- dependency fixes +- required validation work +- hidden integration tasks +- migration fallout +- coupled module updates + +### 4c. Agent-Introduced Scope +Work that appears not requested and not necessary, likely introduced by agent overreach. + +For each conversation record: +- `scope_change_type_primary` +- `scope_change_type_secondary` (optional) +- `scope_change_confidence` +- evidence for classification + +--- + +## Step 5: Rework Shape Analysis + +Do not just count revisions. Determine the **shape** of session rework. + +Classify each conversation into one of these patterns: + +- **Clean execution** — little change, smooth completion +- **Early replan then stable finish** — plan changed early, then execution converged +- **Progressive scope expansion** — work kept growing throughout the session +- **Reopen/reclose churn** — repeated task adjustments/backtracking +- **Late-stage verification churn** — implementation mostly done, but testing/validation caused loops +- **Abandoned mid-flight** — work started but did not reach walkthrough +- **Exploratory / research session** — iterations are high but expected due to problem discovery + +Record: +- `rework_shape` +- `rework_shape_confidence` +- supporting evidence + +--- + +## Step 6: Root Cause Analysis + +For every non-clean session, assign: + +### 6a. Primary Root Cause +Choose one: +- `SPEC_AMBIGUITY` +- `HUMAN_SCOPE_CHANGE` +- `REPO_FRAGILITY` +- `AGENT_ARCHITECTURAL_ERROR` +- `VERIFICATION_CHURN` +- `LEGITIMATE_TASK_COMPLEXITY` + +### 6b. Secondary Root Cause +Optional if a second factor materially contributed. + +### 6c. Evidence Requirements +Every root cause assignment must include: +- evidence from artifacts or metadata +- why competing causes were rejected +- confidence level + +### 6d. Root Cause Heuristics + +#### SPEC_AMBIGUITY +Use when the opening ask lacked boundaries, targets, criteria, or constraints, and the plan had to invent them. + +#### HUMAN_SCOPE_CHANGE +Use when the task set expanded due to new asks, broadened goals, or post-hoc additions. + +#### REPO_FRAGILITY +Use when hidden coupling, unclear architecture, brittle files, or environmental issues forced extra work. + +#### AGENT_ARCHITECTURAL_ERROR +Use when the agent chose the wrong approach, wrong files, wrong assumptions, or hallucinated structure. + +#### VERIFICATION_CHURN +Use when implementation mostly succeeded but tests, validation, QA, or fixes created repeated loops. + +#### LEGITIMATE_TASK_COMPLEXITY +Use when revisions were reasonable given the difficulty and do not strongly indicate avoidable failure. + +--- + +## Step 7: Subsystem / File Clustering + +Across all conversations, cluster repeated struggle by subsystem, folder, or file mentions. + +Examples: +- `frontend/auth/*` +- `db.py` +- `ui.py` +- `video_pipeline/*` + +For each cluster, calculate: +- number of conversations touching it +- average revisions +- completion rate +- abandonment rate +- common root causes + +Output the top recurring friction zones. + +Goal: +Identify whether struggle is prompt-driven, agent-driven, or concentrated in specific repo areas. + +--- + +## Step 8: Comparative Cohort Analysis + +Compare these cohorts: + +- first-shot successes vs re-planned sessions +- completed vs abandoned +- high prompt sufficiency vs low prompt sufficiency +- narrow-scope vs high-scope-growth +- short sessions vs long sessions +- low-friction subsystems vs high-friction subsystems + +For each comparison, identify: +- what differs materially +- which prompt traits correlate with smoother execution +- which repo traits correlate with repeated struggle + +Do not merely restate averages. +Extract causal-looking patterns cautiously and label them as inference where appropriate. + +--- + +## Step 9: Non-Obvious Findings + +Generate 3–7 findings that are not simple metric restatements. + +Good examples: +- “Most replans happen in sessions with weak file targeting, not weak acceptance criteria.” +- “Scope growth usually begins after the first successful implementation, suggesting post-success human expansion.” +- “Auth-related sessions cluster around repo fragility rather than agent hallucination.” +- “Abandoned work is strongly associated with missing validation criteria.” + +Bad examples: +- “Some sessions had many revisions.” +- “Some sessions were longer than others.” + +Each finding must include: +- observation +- why it matters +- evidence +- confidence + +--- + +## Step 10: Report Generation + +Create `session_analysis_report.md` in the current conversation’s brain folder. + +Use this structure: + +# 📊 Session Analysis Report — [Project Name] + +**Generated**: [timestamp] +**Conversations Analyzed**: [N] +**Date Range**: [earliest] → [latest] + +--- + +## Executive Summary + +| Metric | Value | Rating | +|:---|:---|:---| +| First-Shot Success Rate | X% | 🟢/🟡/🔴 | +| Completion Rate | X% | 🟢/🟡/🔴 | +| Avg Scope Growth | X% | 🟢/🟡/🔴 | +| Replan Rate | X% | 🟢/🟡/🔴 | +| Median Duration | Xm | — | +| Avg Revision Intensity | X | 🟢/🟡/🔴 | + +Then include a short narrative summary: +- what is going well +- what is breaking down +- whether the main issue is prompt quality, repo fragility, or workflow discipline + +--- + +## Root Cause Breakdown + +| Root Cause | Count | % | Notes | +|:---|:---|:---|:---| +| Spec Ambiguity | X | X% | ... | +| Human Scope Change | X | X% | ... | +| Repo Fragility | X | X% | ... | +| Agent Architectural Error | X | X% | ... | +| Verification Churn | X | X% | ... | +| Legitimate Task Complexity | X | X% | ... | + +--- + +## Prompt Sufficiency Analysis + +- common traits of high-sufficiency prompts +- common missing inputs in low-sufficiency prompts +- which missing prompt ingredients correlate most with replanning or abandonment + +--- + +## Scope Change Analysis + +Separate: +- Human-added scope +- Necessary discovered scope +- Agent-introduced scope + +Show top offenders in each category. + +--- + +## Rework Shape Analysis + +Summarize how sessions tend to fail: +- early replan then recover +- progressive scope expansion +- late verification churn +- abandonments +- reopen/reclose cycles + +--- + +## Friction Hotspots + +Cluster repeated struggle by subsystem/file/domain. +Show which areas correlate with: +- replanning +- abandonment +- verification churn +- agent architectural mistakes + +--- + +## First-Shot Successes + +List the cleanest sessions and extract what made them work: +- scope boundaries +- acceptance criteria +- file targeting +- validation clarity +- narrowness of change surface + +--- + +## Non-Obvious Findings + +List 3–7 high-value findings with evidence and confidence. + +--- + +## Recommendations + +Each recommendation must use this format: + +### Recommendation [N] +- **Observed pattern** +- **Likely cause** +- **Evidence** +- **Change to make** +- **Expected benefit** +- **Confidence** + +Recommendations must be specific, not generic. + +--- + +## Per-Conversation Breakdown + +| # | Title | Duration | Scope Δ | Plan Revs | Task Revs | Root Cause | Rework Shape | Complete? | +|:---|:---|:---|:---|:---|:---|:---|:---|:---| + +Add short notes only where meaningful. + +--- + +## Step 11: Auto-Optimize — Improve Future Sessions + +### 11a. Update Project Health State +# Example path (update to your actual location): +# `~/.gemini/antigravity/.agent/skills/project-health-state/SKILL.md` + +Update: +- session analysis metrics +- recurring fragile files/subsystems +- recurring failure modes +- last updated timestamp + +### 11b. Generate Prompt Improvement Guidance +Create `prompt_improvement_tips.md` + +Do not give generic advice. +Instead extract: +- traits of high-sufficiency prompts +- examples of effective scope boundaries +- examples of good acceptance criteria +- examples of useful file targeting +- common missing details that led to replans + +### 11c. Suggest Missing Skills / Workflows +If multiple struggle sessions cluster around the same subsystem or repeated sequence, recommend: +- a targeted skill +- a repeatable workflow +- a reusable prompt template +- a repo note / architecture map + +Only recommend workflows when the pattern appears repeatedly. + +--- + +## Final Output Standard + +The workflow must produce: +1. A metrics summary +2. A root-cause diagnosis +3. A subsystem/friction map +4. A prompt-sufficiency assessment +5. Evidence-backed recommendations +6. Non-obvious findings + +If evidence is weak, say so. +Do not overclaim. +Prefer explicit uncertainty over fake precision. + + + + + + + + +**How to invoke this skill** +Just say any of these in a new conversation: +- “Run analyze-project on the workspace” +- “Do a full session analysis report” +- “Root cause my recent brain/ sessions” +- “Update project health state” + +The agent will automatically discover and use the skill. diff --git a/skills/analyze-project/examples/sample_session_analysis_report.md b/skills/analyze-project/examples/sample_session_analysis_report.md new file mode 100644 index 00000000..141a48e5 --- /dev/null +++ b/skills/analyze-project/examples/sample_session_analysis_report.md @@ -0,0 +1,78 @@ +# Sample Output: session_analysis_report.md +# Generated by /analyze-project skill on a ~3-week project with ~50 substantive sessions. +# (Trimmed for demo; real reports include full per-conversation breakdown and more cohorts.) + +# 📊 Session Analysis Report — Sample AI Video Studio + +**Generated**: 2026-03-13 +**Conversations Analyzed**: 54 substantive (with artifacts) +**Date Range**: Feb 18 – Mar 13, 2026 + +## Executive Summary + +| Metric | Value | Rating | +|-------------------------|-------------|--------| +| First-Shot Success Rate | 52% | 🟡 | +| Completion Rate | 70% | 🟢 | +| Avg Scope Growth | +58% | 🟡 | +| Replan Rate | 30% | 🟢 | +| Median Duration | ~35 min | 🟢 | +| Avg Revision Intensity | 4.8 versions| 🟡 | +| Abandoned Rate | 22% | 🟡 | + +**Narrative**: High velocity with strong completion on workflow-driven tasks. Main friction is **post-success human scope expansion** — users add "while we're here" features after initial work succeeds, turning narrow tasks into multi-phase epics. Not primarily prompt or agent issues — more workflow discipline. + +## Root Cause Breakdown (non-clean sessions only) + +| Root Cause | % | Notes | +|-----------------------------|-----|--------------------------------------------| +| Human Scope Change | 37% | New features/epics added mid-session after success | +| Legitimate Task Complexity | 26% | Multi-phase builds with expected iteration | +| Repo Fragility | 15% | Hidden coupling, pre-existing bugs | +| Verification Churn | 11% | Late test/build failures | +| Spec Ambiguity | 7% | Vague initial ask | +| Agent Architectural Error | 4% | Rare wrong approach | + +Confidence: **High** for top two (direct evidence from version diffs). + +## Scope Change Analysis Highlights + +**Human-Added** (most common): Starts narrow → grows after Phase 1 succeeds (e.g., T2E QA → A/B testing + demos + editor tools). +**Necessary Discovered**: Hidden deps, missing packages, env issues (e.g., auth bcrypt blocking E2E). +**Agent-Introduced**: Very rare (1 case of over-creating components). + +## Rework Shape Summary + +- Clean execution: 52% +- Progressive expansion: 18% (dominant failure mode) +- Early replan → stable: 11% +- Late verification churn: 7% +- Exploratory/research: 7% +- Abandoned mid-flight: 4% + +**Pattern**: Progressive expansion often follows successful implementation — user adds adjacent work in same session. + +## Friction Hotspots (top areas) + +| Subsystem | Sessions | Avg Revisions | Main Cause | +|------------------------|----------|---------------|---------------------| +| production.py + domain | 8 | 6.2 | Hidden coupling | +| fal.py (model adapter) | 7 | 5.0 | Legitimate complexity | +| billing.py + tests | 6 | 5.5 | Verification churn | +| frontend/ build | 5 | 7.0 | Missing deps/types | +| Auth/bcrypt | 3 | 4.7 | Blocks E2E testing | + +## Non-Obvious Findings (top 3) + +1. **Post-Success Expansion Dominates** — Most scope growth happens *after* initial completion succeeds, not from bad planning. (High confidence) +2. **File Targeting > Acceptance Criteria** — Missing specific files correlates more with replanning (44% vs 12%) than missing criteria. Anchors agent research early. (High) +3. **Frontend Build is Silent Killer** — Late TypeScript/import failures add 2–4 cycles repeatedly. No pre-flight check exists. (High) + +## Recommendations (top 4) + +1. **Split Sessions After Phases** — Start new conversation after successful completion to avoid context bloat and scope creep. Expected: +13% first-shot success. (High) +2. **Enforce File Targeting** — Add pre-check in prompt optimizer to flag missing file/module refs. Expected: halve replan rate. (High) +3. **Add Frontend Preflight** — Run `npm run build` early in frontend-touching sessions. Eliminates common late blockers. (High) +4. **Fix Auth Test Fixture** — Seed test users with plain passwords or bypass bcrypt for local E2E. Unblocks browser testing. (High) + +This sample shows the forensic style: evidence-backed, confidence-rated, focused on actionable patterns rather than raw counts.