Files
claude-skills-reference/marketing-skill/site-architecture/scripts/sitemap_analyzer.py
Alireza Rezvani a68ae3a05e Dev (#305)
* chore: update gitignore for audit reports and playwright cache

* fix: add YAML frontmatter (name + description) to all SKILL.md files

- Added frontmatter to 34 skills that were missing it entirely (0% Tessl score)
- Fixed name field format to kebab-case across all 169 skills
- Resolves #284

* chore: sync codex skills symlinks [automated]

* fix: optimize 14 low-scoring skills via Tessl review (#290)

Tessl optimization: 14 skills improved from ≤69% to 85%+. Closes #285, #286.

* chore: sync codex skills symlinks [automated]

* fix: optimize 18 skills via Tessl review + compliance fix (closes #287) (#291)

Phase 1: 18 skills optimized via Tessl (avg 77% → 95%). Closes #287.

* feat: add scripts and references to 4 prompt-only skills + Tessl optimization (#292)

Phase 2: 3 new scripts + 2 reference files for prompt-only skills. Tessl 45-55% → 94-100%.

* feat: add 6 agents + 5 slash commands for full coverage (v2.7.0) (#293)

Phase 3: 6 new agents (all 9 categories covered) + 5 slash commands.

* fix: Phase 5 verification fixes + docs update (#294)

Phase 5 verification fixes

* chore: sync codex skills symlinks [automated]

* fix: marketplace audit — all 11 plugins validated by Claude Code (#295)

Marketplace audit: all 11 plugins validated + installed + tested in Claude Code

* fix: restore 7 removed plugins + revert playwright-pro name to pw

Reverts two overly aggressive audit changes:
- Restored content-creator, demand-gen, fullstack-engineer, aws-architect,
  product-manager, scrum-master, skill-security-auditor to marketplace
- Reverted playwright-pro plugin.json name back to 'pw' (intentional short name)

* refactor: split 21 over-500-line skills into SKILL.md + references (#296)

* chore: sync codex skills symlinks [automated]

* docs: update all documentation with accurate counts and regenerated skill pages

- Update skill count to 170, Python tools to 213, references to 314 across all docs
- Regenerate all 170 skill doc pages from latest SKILL.md sources
- Update CLAUDE.md with v2.1.1 highlights, accurate architecture tree, and roadmap
- Update README.md badges and overview table
- Update marketplace.json metadata description and version
- Update mkdocs.yml, index.md, getting-started.md with correct numbers

* fix: add root-level SKILL.md and .codex/instructions.md to all domains (#301)

Root cause: CLI tools (ai-agent-skills, agent-skills-cli) look for SKILL.md
at the specified install path. 7 of 9 domain directories were missing this
file, causing "Skill not found" errors for bundle installs like:
  npx ai-agent-skills install alirezarezvani/claude-skills/engineering-team

Fix:
- Add root-level SKILL.md with YAML frontmatter to 7 domains
- Add .codex/instructions.md to 8 domains (for Codex CLI discovery)
- Update INSTALLATION.md with accurate skill counts (53→170)
- Add troubleshooting entry for "Skill not found" error

All 9 domains now have: SKILL.md + .codex/instructions.md + plugin.json

Closes #301

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: add Gemini CLI + OpenClaw support, fix Codex missing 25 skills

Gemini CLI:
- Add GEMINI.md with activation instructions
- Add scripts/gemini-install.sh setup script
- Add scripts/sync-gemini-skills.py (194 skills indexed)
- Add .gemini/skills/ with symlinks for all skills, agents, commands
- Remove phantom medium-content-pro entries from sync script
- Add top-level folder filter to prevent gitignored dirs from leaking

Codex CLI:
- Fix sync-codex-skills.py missing "engineering" domain (25 POWERFUL skills)
- Regenerate .codex/skills-index.json: 124 → 149 skills
- Add 25 new symlinks in .codex/skills/

OpenClaw:
- Add OpenClaw installation section to INSTALLATION.md
- Add ClawHub install + manual install + YAML frontmatter docs

Documentation:
- Update INSTALLATION.md with all 4 platforms + accurate counts
- Update README.md: "three platforms" → "four platforms" + Gemini quick start
- Update CLAUDE.md with Gemini CLI support in v2.1.1 highlights
- Update SKILL-AUTHORING-STANDARD.md + SKILL_PIPELINE.md with Gemini steps
- Add OpenClaw + Gemini to installation locations reference table

Marketplace: all 18 plugins validated — sources exist, SKILL.md present

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat(product,pm): world-class product & PM skills audit — 6 scripts, 5 agents, 7 commands, 23 references/assets

Phase 1 — Agent & Command Foundation:
- Rewrite cs-project-manager agent (55→515 lines, 4 workflows, 6 skill integrations)
- Expand cs-product-manager agent (408→684 lines, orchestrates all 8 product skills)
- Add 7 slash commands: /rice, /okr, /persona, /user-story, /sprint-health, /project-health, /retro

Phase 2 — Script Gap Closure (2,779 lines):
- jira-expert: jql_query_builder.py (22 patterns), workflow_validator.py
- confluence-expert: space_structure_generator.py, content_audit_analyzer.py
- atlassian-admin: permission_audit_tool.py
- atlassian-templates: template_scaffolder.py (Confluence XHTML generation)

Phase 3 — Reference & Asset Enrichment:
- 9 product references (competitive-teardown, landing-page-generator, saas-scaffolder)
- 6 PM references (confluence-expert, atlassian-admin, atlassian-templates)
- 7 product assets (templates for PRD, RICE, sprint, stories, OKR, research, design system)
- 1 PM asset (permission_scheme_template.json)

Phase 4 — New Agents:
- cs-agile-product-owner, cs-product-strategist, cs-ux-researcher

Phase 5 — Integration & Polish:
- Related Skills cross-references in 8 SKILL.md files
- Updated product-team/CLAUDE.md (5→8 skills, 6→9 tools, 4 agents, 5 commands)
- Updated project-management/CLAUDE.md (0→12 scripts, 3 commands)
- Regenerated docs site (177 pages), updated homepage and getting-started

Quality audit: 31 files reviewed, 29 PASS, 2 fixed (copy-frameworks.md, governance-framework.md)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: audit and repair all plugins, agents, and commands

- Fix 12 command files: correct CLI arg syntax, script paths, and usage docs
- Fix 3 agents with broken script/reference paths (cs-content-creator,
  cs-demand-gen-specialist, cs-financial-analyst)
- Add complete YAML frontmatter to 5 agents (cs-growth-strategist,
  cs-engineering-lead, cs-senior-engineer, cs-financial-analyst,
  cs-quality-regulatory)
- Fix cs-ceo-advisor related agent path
- Update marketplace.json metadata counts (224 tools, 341 refs, 14 agents,
  12 commands)

Verified: all 19 scripts pass --help, all 14 agent paths resolve, mkdocs
builds clean.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: repair 25 Python scripts failing --help across all domains

- Fix Python 3.10+ syntax (float | None → Optional[float]) in 2 scripts
- Add argparse CLI handling to 9 marketing scripts using raw sys.argv
- Fix 10 scripts crashing at module level (wrap in __main__, add argparse)
- Make yaml/prefect/mcp imports conditional with stdlib fallbacks (4 scripts)
- Fix f-string backslash syntax in project_bootstrapper.py
- Fix -h flag conflict in pr_analyzer.py
- Fix tech-debt.md description (score → prioritize)

All 237 scripts now pass python3 --help verification.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(product-team): close 3 verified gaps in product skills

- Fix competitive-teardown/SKILL.md: replace broken references
  DATA_COLLECTION.md → references/data-collection-guide.md and
  TEMPLATES.md → references/analysis-templates.md (workflow was broken
  at steps 2 and 4)

- Upgrade landing_page_scaffolder.py: add TSX + Tailwind output format
  (--format tsx) matching SKILL.md promise of Next.js/React components.
  4 design styles (dark-saas, clean-minimal, bold-startup, enterprise).
  TSX is now default; HTML preserved via --format html

- Rewrite README.md: fix stale counts (was 5 skills/15+ tools, now
  accurately shows 8 skills/9 tools), remove 7 ghost scripts that
  never existed (sprint_planner.py, velocity_tracker.py, etc.)

- Fix tech-debt.md description (score → prioritize)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* release: v2.1.2 — landing page TSX output, brand voice integration, docs update

- Landing page generator defaults to Next.js TSX + Tailwind CSS (4 design styles)
- Brand voice analyzer integrated into landing page generation workflow
- CHANGELOG, CLAUDE.md, README.md updated for v2.1.2
- All 13 plugin.json + marketplace.json bumped to 2.1.2
- Gemini/Codex skill indexes re-synced
- Backward compatible: --format html preserved, no breaking changes

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: alirezarezvani <5697919+alirezarezvani@users.noreply.github.com>
Co-authored-by: Leo <leo@openclaw.ai>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-10 09:48:49 +01:00

376 lines
15 KiB
Python
Raw Blame History

This file contains invisible Unicode characters
This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
#!/usr/bin/env python3
"""
sitemap_analyzer.py — Analyzes sitemap.xml files for structure, depth, and potential issues.
Usage:
python3 sitemap_analyzer.py [sitemap.xml]
python3 sitemap_analyzer.py https://example.com/sitemap.xml (fetches via urllib)
cat sitemap.xml | python3 sitemap_analyzer.py
If no file is provided, runs on embedded sample sitemap for demonstration.
Output: Structural analysis with depth distribution, URL patterns, orphan candidates,
duplicate path detection, and JSON summary.
Stdlib only — no external dependencies.
"""
import json
import sys
import re
import select
import urllib.request
import urllib.error
from collections import Counter, defaultdict
from urllib.parse import urlparse
import xml.etree.ElementTree as ET
# ─── Namespaces used in sitemaps ─────────────────────────────────────────────
SITEMAP_NAMESPACES = {
"sm": "http://www.sitemaps.org/schemas/sitemap/0.9",
"image": "http://www.google.com/schemas/sitemap-image/1.1",
"video": "http://www.google.com/schemas/sitemap-video/1.1",
"news": "http://www.google.com/schemas/sitemap-news/0.9",
"xhtml": "http://www.w3.org/1999/xhtml",
}
# ─── Sample sitemap (embedded) ────────────────────────────────────────────────
SAMPLE_SITEMAP = """<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<!-- Homepage -->
<url>
<loc>https://example.com/</loc>
<changefreq>daily</changefreq>
<priority>1.0</priority>
</url>
<!-- Top-level pages -->
<url><loc>https://example.com/pricing</loc></url>
<url><loc>https://example.com/about</loc></url>
<url><loc>https://example.com/contact</loc></url>
<url><loc>https://example.com/blog</loc></url>
<!-- Features section -->
<url><loc>https://example.com/features</loc></url>
<url><loc>https://example.com/features/email-automation</loc></url>
<url><loc>https://example.com/features/crm-integration</loc></url>
<url><loc>https://example.com/features/analytics</loc></url>
<!-- Solutions section -->
<url><loc>https://example.com/solutions/sales-teams</loc></url>
<url><loc>https://example.com/solutions/marketing-teams</loc></url>
<!-- Blog posts (various topics) -->
<url><loc>https://example.com/blog/cold-email-guide</loc></url>
<url><loc>https://example.com/blog/email-open-rates</loc></url>
<url><loc>https://example.com/blog/crm-comparison</loc></url>
<url><loc>https://example.com/blog/sales-process-optimization</loc></url>
<!-- Deeply nested pages (potential issue) -->
<url><loc>https://example.com/resources/guides/email/cold-outreach/advanced/templates</loc></url>
<url><loc>https://example.com/resources/guides/email/cold-outreach/advanced/scripts</loc></url>
<!-- Duplicate path patterns (potential issue) -->
<url><loc>https://example.com/blog/email-tips</loc></url>
<url><loc>https://example.com/resources/email-tips</loc></url>
<!-- Dynamic-looking URL (potential issue) -->
<url><loc>https://example.com/search?q=cold+email&amp;sort=recent</loc></url>
<!-- Case studies -->
<url><loc>https://example.com/customers/acme-corp</loc></url>
<url><loc>https://example.com/customers/globex</loc></url>
<!-- Legal pages (often over-linked) -->
<url><loc>https://example.com/privacy</loc></url>
<url><loc>https://example.com/terms</loc></url>
</urlset>
"""
# ─── URL Analysis ─────────────────────────────────────────────────────────────
def get_depth(path: str) -> int:
"""Return depth of a URL path. / = 0, /blog = 1, /blog/post = 2, etc."""
parts = [p for p in path.strip("/").split("/") if p]
return len(parts)
def get_path_pattern(path: str) -> str:
"""Replace variable segments with {slug} for pattern detection."""
parts = path.strip("/").split("/")
normalized = []
for p in parts:
if p:
# Keep static segments (likely structure), replace dynamic-looking ones
if re.match(r'^[a-z][-a-z]+$', p) and len(p) < 30:
normalized.append(p)
else:
normalized.append("{slug}")
return "/" + "/".join(normalized) if normalized else "/"
def has_query_params(url: str) -> bool:
return "?" in url
def looks_like_dynamic_url(url: str) -> bool:
parsed = urlparse(url)
return bool(parsed.query)
def detect_path_siblings(urls: list) -> list:
"""Find URLs with same slug in different parent directories (potential duplicates)."""
slug_to_paths = defaultdict(list)
for url in urls:
path = urlparse(url).path.strip("/")
slug = path.split("/")[-1] if path else ""
if slug:
slug_to_paths[slug].append(url)
duplicates = []
for slug, paths in slug_to_paths.items():
if len(paths) > 1:
# Only flag if they're in different directories
parents = set("/".join(urlparse(p).path.strip("/").split("/")[:-1]) for p in paths)
if len(parents) > 1:
duplicates.append({"slug": slug, "urls": paths})
return duplicates
# ─── Sitemap Parser ──────────────────────────────────────────────────────────
def parse_sitemap(content: str) -> list:
"""Parse sitemap XML and return list of URL dicts."""
urls = []
# Strip namespace declarations for simpler parsing
content_clean = re.sub(r'xmlns[^=]*="[^"]*"', '', content)
try:
root = ET.fromstring(content_clean)
except ET.ParseError as e:
print(f"❌ XML parse error: {e}", file=sys.stderr)
return []
# Handle sitemap index (points to other sitemaps)
if root.tag.endswith("sitemapindex") or root.tag == "sitemapindex":
print(" This is a sitemap index file — it points to child sitemaps.")
print(" Child sitemaps:")
for sitemap in root.findall(".//{http://www.sitemaps.org/schemas/sitemap/0.9}loc") or root.findall(".//loc"):
print(f" - {sitemap.text}")
print(" Run this tool on each child sitemap for full analysis.")
return []
# Regular urlset
for url_el in root.findall(".//{http://www.sitemaps.org/schemas/sitemap/0.9}url") or root.findall(".//url"):
loc_el = url_el.find("{http://www.sitemaps.org/schemas/sitemap/0.9}loc") or url_el.find("loc")
lastmod_el = url_el.find("{http://www.sitemaps.org/schemas/sitemap/0.9}lastmod") or url_el.find("lastmod")
priority_el = url_el.find("{http://www.sitemaps.org/schemas/sitemap/0.9}priority") or url_el.find("priority")
if loc_el is not None and loc_el.text:
urls.append({
"url": loc_el.text.strip(),
"lastmod": lastmod_el.text.strip() if lastmod_el is not None and lastmod_el.text else None,
"priority": float(priority_el.text.strip()) if priority_el is not None and priority_el.text else None,
})
return urls
# ─── Analysis Engine ─────────────────────────────────────────────────────────
def analyze_urls(urls: list) -> dict:
raw_urls = [u["url"] for u in urls]
paths = [urlparse(u).path for u in raw_urls]
depths = [get_depth(p) for p in paths]
depth_counter = Counter(depths)
dynamic_urls = [u for u in raw_urls if looks_like_dynamic_url(u)]
patterns = Counter(get_path_pattern(urlparse(u).path) for u in raw_urls)
top_patterns = patterns.most_common(10)
duplicate_slugs = detect_path_siblings(raw_urls)
deep_urls = [(u, get_depth(urlparse(u).path)) for u in raw_urls if get_depth(urlparse(u).path) >= 4]
# Extract top-level directories
top_dirs = Counter()
for p in paths:
parts = p.strip("/").split("/")
if parts and parts[0]:
top_dirs[parts[0]] += 1
return {
"total_urls": len(urls),
"depth_distribution": dict(sorted(depth_counter.items())),
"top_directories": dict(top_dirs.most_common(15)),
"dynamic_urls": dynamic_urls,
"deep_pages": deep_urls,
"duplicate_slug_candidates": duplicate_slugs,
"top_url_patterns": [{"pattern": p, "count": c} for p, c in top_patterns],
}
# ─── Report Printer ──────────────────────────────────────────────────────────
def grade_depth_distribution(dist: dict) -> str:
deep = sum(v for k, v in dist.items() if k >= 4)
total = sum(dist.values())
if total == 0:
return "N/A"
pct = deep / total * 100
if pct < 5:
return "🟢 Excellent"
if pct < 15:
return "🟡 Acceptable"
return "🔴 Too many deep pages"
def print_report(analysis: dict) -> None:
print("\n" + "" * 62)
print(" SITEMAP STRUCTURE ANALYSIS")
print("" * 62)
print(f"\n Total URLs: {analysis['total_urls']}")
print("\n── Depth Distribution ──")
dist = analysis["depth_distribution"]
total = analysis["total_urls"]
for depth, count in sorted(dist.items()):
pct = count / total * 100 if total else 0
bar = "" * int(pct / 2)
label = "homepage" if depth == 0 else f"{' ' * min(depth, 3)}/{'…/' * (depth - 1)}page"
print(f" Depth {depth}: {count:4d} pages ({pct:5.1f}%) {bar} {label}")
print(f"\n Rating: {grade_depth_distribution(dist)}")
deep_pct = sum(v for k, v in dist.items() if k >= 4) / total * 100 if total else 0
if deep_pct >= 5:
print(" ⚠️ More than 5% of pages are 4+ levels deep.")
print(" Consider flattening structure or adding shortcut links.")
print("\n── Top-Level Directories ──")
for d, count in analysis["top_directories"].items():
pct = count / total * 100 if total else 0
print(f" /{d:<30s} {count:4d} URLs ({pct:.1f}%)")
print("\n── URL Pattern Analysis ──")
for p in analysis["top_url_patterns"]:
print(f" {p['pattern']:<45s} {p['count']:4d} URLs")
if analysis["dynamic_urls"]:
print(f"\n── Dynamic URLs Detected ({len(analysis['dynamic_urls'])}) ──")
print(" ⚠️ URLs with query parameters should usually be excluded from sitemap.")
print(" Use canonical tags or robots.txt to prevent duplicate content indexing.")
for u in analysis["dynamic_urls"][:5]:
print(f" {u}")
if len(analysis["dynamic_urls"]) > 5:
print(f" ... and {len(analysis['dynamic_urls']) - 5} more")
if analysis["deep_pages"]:
print(f"\n── Deep Pages (4+ Levels) ({len(analysis['deep_pages'])}) ──")
print(" ⚠️ Pages this deep may have weak crawl equity. Add internal shortcuts.")
for url, depth in analysis["deep_pages"][:5]:
print(f" Depth {depth}: {url}")
if len(analysis["deep_pages"]) > 5:
print(f" ... and {len(analysis['deep_pages']) - 5} more")
if analysis["duplicate_slug_candidates"]:
print(f"\n── Potential Duplicate Path Issues ({len(analysis['duplicate_slug_candidates'])}) ──")
print(" ⚠️ Same slug appears in multiple directories — possible duplicate content.")
for item in analysis["duplicate_slug_candidates"][:5]:
print(f" Slug: '{item['slug']}'")
for u in item["urls"]:
print(f" - {u}")
if len(analysis["duplicate_slug_candidates"]) > 5:
print(f" ... and {len(analysis['duplicate_slug_candidates']) - 5} more")
print("\n── Recommendations ──")
has_issues = False
if analysis["dynamic_urls"]:
print(" 1. Remove dynamic URLs (with ?) from sitemap.")
has_issues = True
if analysis["deep_pages"]:
print(f" {'2' if has_issues else '1'}. Flatten deep URL structures or add internal shortcut links.")
has_issues = True
if analysis["duplicate_slug_candidates"]:
print(f" {'3' if has_issues else '1'}. Review duplicate slug paths — consolidate or add canonical tags.")
has_issues = True
if not has_issues:
print(" ✅ No major structural issues detected in this sitemap.")
print("\n" + "" * 62)
# ─── Main ─────────────────────────────────────────────────────────────────────
def load_content(source: str) -> str:
"""Load sitemap from file path, URL, or stdin."""
if source.startswith("http://") or source.startswith("https://"):
try:
with urllib.request.urlopen(source, timeout=10) as resp:
return resp.read().decode("utf-8")
except urllib.error.URLError as e:
print(f"Error fetching URL: {e}", file=sys.stderr)
sys.exit(1)
else:
try:
with open(source, "r", encoding="utf-8") as f:
return f.read()
except FileNotFoundError:
print(f"Error: File not found: {source}", file=sys.stderr)
sys.exit(1)
def main():
import argparse
parser = argparse.ArgumentParser(
description="Analyzes sitemap.xml files for structure, depth, and potential issues. "
"Reports depth distribution, URL patterns, orphan candidates, and duplicates."
)
parser.add_argument(
"file", nargs="?", default=None,
help="Path to a sitemap.xml file or URL (https://...). "
"Use '-' to read from stdin. If omitted, runs embedded sample."
)
args = parser.parse_args()
if args.file:
if args.file == "-":
content = sys.stdin.read()
else:
content = load_content(args.file)
else:
print("No file or URL provided — running on embedded sample sitemap.\n")
content = SAMPLE_SITEMAP
urls = parse_sitemap(content)
if not urls:
print("No URLs found in sitemap.", file=sys.stderr)
sys.exit(1)
analysis = analyze_urls(urls)
print_report(analysis)
# JSON output
print("\n── JSON Summary ──")
summary = {
"total_urls": analysis["total_urls"],
"depth_distribution": analysis["depth_distribution"],
"dynamic_url_count": len(analysis["dynamic_urls"]),
"deep_page_count": len(analysis["deep_pages"]),
"duplicate_slug_count": len(analysis["duplicate_slug_candidates"]),
"top_directories": analysis["top_directories"],
}
print(json.dumps(summary, indent=2))
if __name__ == "__main__":
main()