Documentation restructure: - New docs/getting-started/ guide (4 files: install, quick-start, first-skill, next-steps) - New docs/user-guide/ section (6 files: core concepts through troubleshooting) - New docs/reference/ section (CLI_REFERENCE, CONFIG_FORMAT, ENVIRONMENT_VARIABLES, MCP_REFERENCE) - New docs/advanced/ section (custom-workflows, mcp-server, multi-source) - New docs/ARCHITECTURE.md - system architecture overview - Archived legacy files (QUICKSTART.md, QUICK_REFERENCE.md, docs/guides/USAGE.md) to docs/archive/legacy/ Chinese (zh-CN) translations: - Full zh-CN mirror of all user-facing docs (getting-started, user-guide, reference, advanced) - GitHub Actions workflow for translation sync (.github/workflows/translate-docs.yml) - Translation sync checker script (scripts/check_translation_sync.sh) - Translation helper script (scripts/translate_doc.py) Content updates: - CHANGELOG.md: [Unreleased] → [3.1.0] - 2026-02-22 - README.md: updated with new doc structure links - AGENTS.md: updated agent documentation - docs/features/UNIFIED_SCRAPING.md: updated for unified scraper workflow JSON config Analysis/planning artifacts (kept for reference): - DOCUMENTATION_OVERHAUL_PLAN.md, DOCUMENTATION_OVERHAUL_SUMMARY.md - FEATURE_GAP_ANALYSIS.md, IMPLEMENTATION_GAPS_ANALYSIS.md, CREATE_COMMAND_COVERAGE_ANALYSIS.md - CHINESE_TRANSLATION_IMPLEMENTATION_SUMMARY.md, ISSUE_260_UPDATE.md Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
70 KiB
C3.x Router Architecture - Ultra-Detailed Technical Specification
Created: 2026-01-08 Last Updated: 2026-01-08 (MAJOR REVISION - Three-Stream GitHub Architecture) Purpose: Complete architectural design for converting C3.x-analyzed codebases into router-based skill systems Status: Design phase - Ready for implementation
Executive Summary
Problem Statement
Current C3.x codebase analysis generates monolithic skills that are:
- Too large for optimal AI consumption (666 lines vs 150-300 ideal)
- Token inefficient (77-88% waste on topic-specific queries)
- Confusing to AI (8 OAuth providers presented when user wants 1)
- Hard to maintain (single giant file vs modular structure)
FastMCP E2E Test Results:
- Monolithic SKILL.md: 666 lines / 20KB
- Human quality: A+ (96/100) - Excellent documentation
- AI quality: B+ (87/100) - Too large, redundancy issues
- Token waste: 77% on OAuth-specific queries (load 666 lines, use 150)
Proposed Solution
Two-Part Architecture:
-
Three-Stream Source Integration (NEW!)
- GitHub as multi-source provider
- Split: Code → C3.x, Docs → Markdown, Issues → Insights
- C3.x as depth mode (basic/deep), not separate tool
-
Router-Based Skill Structure
- 1 main router + N focused sub-skills
- 45% token reduction
- 100% content relevance
GitHub Repository
↓
Three-Stream Fetcher
├─ Code Stream → C3.x Analysis (patterns, examples)
├─ Docs Stream → README/docs/*.md (official docs)
└─ Issues Stream → Common problems + solutions
↓
Router Generator
├─ fastmcp (router - 150 lines)
├─ fastmcp-oauth (250 lines)
├─ fastmcp-async (200 lines)
├─ fastmcp-testing (250 lines)
└─ fastmcp-api (400 lines)
Benefits:
- 45% token reduction (20KB → 11KB avg per query)
- 100% relevance (only load needed sub-skill)
- GitHub insights (real user problems from issues)
- Complete coverage (code + docs + community knowledge)
Impact Metrics
| Metric | Before (Monolithic) | After (Router + 3-Stream) | Improvement |
|---|---|---|---|
| Average tokens/query | 20KB | 11KB | 45% reduction |
| Relevant content % | 23% (OAuth query) | 100% | 4.3x increase |
| Main skill size | 20KB | 5KB | 4x smaller |
| Data sources | 1 (code only) | 3 (code+docs+issues) | 3x richer |
| Common problems coverage | 0% | 100% (from issues) | New capability |
Table of Contents
- Source Architecture (NEW)
- Current State Analysis
- Proposed Router Architecture
- Data Flow & Algorithms
- Technical Implementation
- File Structure
- Filtering Strategies
- Quality Metrics
- Edge Cases & Solutions
- Scalability Analysis
- Migration Path
- Testing Strategy
- Implementation Phases
1. Source Architecture (NEW)
1.1 Rethinking Source Types
OLD (Confusing) Model:
Source Types:
1. Documentation (HTML scraping)
2. GitHub (basic analysis)
3. C3.x Codebase Analysis (deep analysis)
4. PDF
Problem: GitHub and C3.x both analyze code at different depths!
NEW (Correct) Model:
Source Types:
1. Documentation (HTML scraping from docs sites)
2. Codebase (local OR GitHub, with depth: basic/c3x)
3. PDF (supplementary)
Insight: GitHub is a SOURCE PROVIDER, C3.x is an ANALYSIS DEPTH
1.2 Three-Stream GitHub Architecture
Core Principle: GitHub repositories contain THREE types of valuable data:
┌─────────────────────────────────────────────────────────┐
│ GitHub Repository │
│ https://github.com/facebook/react │
└─────────────────────────────────────────────────────────┘
↓
┌─────────────────────────┐
│ GitHub Fetcher │
│ (Gets EVERYTHING) │
└─────────────────────────┘
↓
┌─────────────────────────┐
│ Intelligent Splitter │
└─────────────────────────┘
↓
┌─────────────────┴─────────────────┐
│ │
↓ ↓
┌───────────────┐ ┌────────────────┐
│ STREAM 1: │ │ STREAM 2: │
│ CODE │ │ DOCUMENTATION │
├───────────────┤ ├────────────────┤
│ *.py, *.js │ │ README.md │
│ *.tsx, *.go │ │ CONTRIBUTING.md│
│ *.rs, etc. │ │ docs/*.md │
│ │ │ *.rst │
│ → C3.x │ │ │
│ Analysis │ │ → Doc Parser │
│ (20-60 min) │ │ (1-2 min) │
└───────────────┘ └────────────────┘
↓
┌───────────────┐
│ STREAM 3: │
│ METADATA │
├───────────────┤
│ Open issues │
│ Closed issues │
│ Labels │
│ Stars, forks │
│ │
│ → Issue │
│ Analyzer │
│ (1-2 min) │
└───────────────┘
↓
┌───────────────┐
│ MERGER │
│ Combines all │
│ 3 streams │
└───────────────┘
1.3 Source Type Definitions (Revised)
Source Type 1: Documentation (HTML)
{
"type": "documentation",
"base_url": "https://react.dev/",
"selectors": {...},
"max_pages": 200
}
What it does:
- Scrapes HTML documentation sites
- Extracts structured content
- Time: 20-40 minutes
Source Type 2: Codebase (Unified)
{
"type": "codebase",
"source": "https://github.com/facebook/react", // OR "/path/to/local"
"analysis_depth": "c3x", // or "basic"
"fetch_github_metadata": true, // Issues, README, etc.
"split_docs": true // Separate markdown files as doc source
}
What it does:
-
Acquire source:
- If GitHub URL: Clone to
/tmp/repo/ - If local path: Use directly
- If GitHub URL: Clone to
-
Split into streams:
- Code stream:
*.py,*.js, etc. → C3.x or basic analysis - Docs stream:
README.md,docs/*.md→ Documentation parser - Metadata stream: Issues, stats → Insights extractor
- Code stream:
-
Analysis depth modes:
- basic (1-2 min): File structure, imports, entry points
- c3x (20-60 min): Full C3.x suite (patterns, examples, architecture)
Source Type 3: PDF (Supplementary)
{
"type": "pdf",
"url": "https://example.com/guide.pdf"
}
What it does:
- Extracts text and code from PDFs
- Adds as supplementary references
1.4 C3.x as Analysis Depth (Not Source Type)
Key Insight: C3.x is NOT a source type, it's an analysis depth level.
# OLD (Wrong)
sources = [
{"type": "github", ...}, # Basic analysis
{"type": "c3x_codebase", ...} # Deep analysis - CONFUSING!
]
# NEW (Correct)
sources = [
{
"type": "codebase",
"source": "https://github.com/facebook/react",
"analysis_depth": "c3x" # ← Depth, not type
}
]
Analysis Depth Modes:
| Mode | Time | Components | Use Case |
|---|---|---|---|
| basic | 1-2 min | File structure, imports, entry points | Quick overview, testing |
| c3x | 20-60 min | C3.1-C3.7 (patterns, examples, guides, configs, architecture) | Production skills |
1.5 GitHub Three-Stream Output
When you specify a GitHub codebase source:
{
"type": "codebase",
"source": "https://github.com/jlowin/fastmcp",
"analysis_depth": "c3x",
"fetch_github_metadata": true
}
You get THREE data streams automatically:
{
# STREAM 1: Code Analysis (C3.x)
"code_analysis": {
"patterns": [...], # 905 design patterns
"examples": [...], # 723 test examples
"architecture": {...}, # Service Layer Pattern
"api_reference": [...], # 316 API files
"configs": [...] # 45 config files
},
# STREAM 2: Documentation (from repo)
"documentation": {
"readme": "FastMCP is a Python framework...",
"contributing": "To contribute...",
"docs_files": [
{"path": "docs/getting-started.md", "content": "..."},
{"path": "docs/oauth.md", "content": "..."},
]
},
# STREAM 3: GitHub Insights
"github_insights": {
"metadata": {
"stars": 1234,
"forks": 56,
"open_issues": 12,
"language": "Python"
},
"common_problems": [
{"title": "OAuth setup fails", "issue": 42, "comments": 15},
{"title": "Async tools not working", "issue": 38, "comments": 8}
],
"known_solutions": [
{"title": "Fixed OAuth redirect", "issue": 35, "closed": true}
],
"top_labels": [
{"label": "question", "count": 23},
{"label": "bug", "count": 15}
]
}
}
1.6 Multi-Source Merging Strategy
Scenario: User provides both documentation URL AND GitHub repo
{
"sources": [
{
"type": "documentation",
"base_url": "https://fastmcp.dev/"
},
{
"type": "codebase",
"source": "https://github.com/jlowin/fastmcp",
"analysis_depth": "c3x",
"fetch_github_metadata": true
}
]
}
Result: 4 data streams to merge:
- HTML documentation (scraped docs site)
- Code analysis (C3.x from GitHub)
- Repo documentation (README/docs from GitHub)
- GitHub insights (issues, stats)
Merge Priority:
Priority 1: C3.x code analysis (ground truth - what code DOES)
Priority 2: HTML documentation (official intent - what code SHOULD do)
Priority 3: Repo documentation (README/docs - quick reference)
Priority 4: GitHub insights (community knowledge - common problems)
Conflict Resolution:
- If HTML docs say
GoogleProvider(app_id=...) - But C3.x code shows
GoogleProvider(client_id=...) - → Create hybrid content showing BOTH with warning
2. Current State Analysis
2.1 FastMCP E2E Test Output
Input: /tmp/fastmcp repository (361 files)
C3.x Analysis Results:
output/fastmcp-e2e-test_unified_data/c3_analysis_temp/
├── patterns/
│ └── detected_patterns.json (470KB, 905 pattern instances)
├── test_examples/
│ └── test_examples.json (698KB, 723 examples)
├── config_patterns/
│ └── config_patterns.json (45 config files)
├── api_reference/
│ └── *.md (316 API documentation files)
└── architecture/
└── architectural_patterns.json (Service Layer Pattern detected)
Generated Monolithic Skill:
output/fastmcp-e2e-test/
├── SKILL.md (666 lines, 20KB)
└── references/
├── index.md (3.6KB)
├── getting_started.md (6.9KB)
├── architecture.md (9.1KB)
├── patterns.md (16KB)
├── examples.md (10KB)
└── api.md (6.5KB)
2.2 Content Distribution Analysis
SKILL.md breakdown (666 lines):
- OAuth/Authentication: ~150 lines (23%)
- Async patterns: ~80 lines (12%)
- Testing: ~60 lines (9%)
- Design patterns: ~80 lines (12%)
- Architecture: ~70 lines (11%)
- Examples: ~120 lines (18%)
- Other: ~106 lines (15%)
Problem: User asking "How to add Google OAuth?" must load ALL 666 lines, but only 150 are relevant (77% waste).
2.3 What We're Missing (Without GitHub Insights)
Current approach: Only analyzes code
Missing valuable data:
- ❌ Common user problems (from open issues)
- ❌ Known solutions (from closed issues)
- ❌ Popular questions (from issue labels)
- ❌ Official quick start (from README)
- ❌ Contribution guide (from CONTRIBUTING.md)
- ❌ Repository popularity (stars, forks)
With three-stream GitHub architecture:
- ✅ All of the above automatically included
- ✅ "Common Issues" section in SKILL.md
- ✅ README content as quick reference
- ✅ Real user problems addressed
2.4 Token Usage Scenarios
Scenario 1: OAuth-specific query
- User: "How do I add Google OAuth to my FastMCP server?"
- Current: Load 666 lines (77% waste)
- With router: Load 150 lines router + 250 lines OAuth = 400 lines (40% waste)
- With GitHub insights: Also get issue #42 "OAuth setup fails" solution
Scenario 2: "What are common FastMCP problems?"
- Current: No way to answer (code analysis doesn't know user problems)
- With GitHub insights: Top 10 issues with solutions immediately available
3. Proposed Router Architecture
3.1 Router + Sub-Skills Structure
fastmcp/ # Main router skill
├── SKILL.md (150 lines) # Overview + routing logic
└── references/
├── index.md
└── common_issues.md # NEW: From GitHub issues
fastmcp-oauth/ # OAuth sub-skill
├── SKILL.md (250 lines) # OAuth-focused content
└── references/
├── oauth_overview.md # From C3.x + docs
├── google_provider.md # From C3.x examples
├── azure_provider.md # From C3.x examples
├── oauth_patterns.md # From C3.x patterns
└── oauth_issues.md # NEW: From GitHub issues
fastmcp-async/ # Async sub-skill
├── SKILL.md (200 lines)
└── references/
├── async_basics.md
├── async_patterns.md
├── decorator_pattern.md
└── async_issues.md # NEW: From GitHub issues
fastmcp-testing/ # Testing sub-skill
├── SKILL.md (250 lines)
└── references/
├── unit_tests.md
├── integration_tests.md
├── pytest_examples.md
└── testing_issues.md # NEW: From GitHub issues
fastmcp-api/ # API reference sub-skill
├── SKILL.md (400 lines)
└── references/
└── api_modules/
└── *.md (316 files)
3.2 Enhanced Router SKILL.md Template (With GitHub Insights)
---
name: fastmcp
description: FastMCP framework for building MCP servers - use this skill to learn FastMCP basics and route to specialized topics
---
# FastMCP - Python Framework for MCP Servers
**Repository:** https://github.com/jlowin/fastmcp
**Stars:** ⭐ 1,234 | **Language:** Python | **Open Issues:** 12
[From GitHub metadata - shows popularity and activity]
## When to Use This Skill
Use this skill when:
- You want an overview of FastMCP
- You need quick installation/setup steps
- You're deciding which FastMCP feature to use
- **Route to specialized skills for deep dives:**
- `fastmcp-oauth` - OAuth authentication (Google, Azure, GitHub)
- `fastmcp-async` - Async/await patterns
- `fastmcp-testing` - Unit and integration testing
- `fastmcp-api` - Complete API reference
## Quick Start (from README.md)
[Content extracted from GitHub README - official quick start]
## Common Issues (from GitHub)
Based on analysis of 100+ GitHub issues, here are the most common problems:
1. **OAuth provider configuration** (Issue #42, 15 comments)
- See `fastmcp-oauth` skill for solution
2. **Async tools not working** (Issue #38, 8 comments)
- See `fastmcp-async` skill for solution
[From GitHub issue analysis - real user problems]
## Choose Your Path
**Need authentication?** → Use `fastmcp-oauth` skill
**Building async tools?** → Use `fastmcp-async` skill
**Writing tests?** → Use `fastmcp-testing` skill
**Looking up API details?** → Use `fastmcp-api` skill
## Architecture Overview
FastMCP uses a Service Layer Pattern with 206 Strategy pattern instances.
[From C3.7 architecture analysis]
## Next Steps
[Links to sub-skills with trigger keywords]
Size target: 150 lines / 5KB
Data sources used:
- ✅ GitHub metadata (stars, issues count)
- ✅ README.md (quick start)
- ✅ GitHub issues (common problems)
- ✅ C3.7 architecture (pattern info)
3.3 Enhanced Sub-Skill Template (OAuth Example)
---
name: fastmcp-oauth
description: OAuth authentication for FastMCP servers - Google, Azure, GitHub providers with Strategy pattern
triggers: ["oauth", "authentication", "google provider", "azure provider", "auth provider"]
---
# FastMCP OAuth Authentication
## When to Use This Skill
Use when implementing OAuth authentication in FastMCP servers.
## Quick Reference (from C3.x examples)
[5 OAuth examples from test files - real code]
## Common OAuth Issues (from GitHub)
**Issue #42: OAuth setup fails with Google provider**
- Problem: Redirect URI mismatch
- Solution: Use `http://localhost:8000/oauth/callback` in Google Console
- Status: Solved (12 comments)
**Issue #38: Azure provider 401 error**
- Problem: Wrong tenant_id
- Solution: Check Azure AD tenant ID matches config
- Status: Solved (8 comments)
[From GitHub closed issues - real solutions]
## Supported Providers (from C3.x + README)
### Google OAuth
**Official docs say:** (from README.md)
```python
GoogleProvider(app_id="...", app_secret="...")
Current implementation: (from C3.x analysis, confidence: 95%)
GoogleProvider(client_id="...", client_secret="...")
⚠️ Conflict detected: Parameter names changed. Use current implementation.
[Hybrid content showing both docs and code]
Azure OAuth (from C3.x analysis)
[Azure-specific example with real code from tests]
Design Patterns (from C3.x)
Strategy Pattern (206 instances in FastMCP)
[Strategy pattern explanation with OAuth context]
Factory Pattern (142 instances in FastMCP)
[Factory pattern for provider creation]
Testing OAuth (from C3.2 test examples)
[OAuth testing examples from test files]
See Also
- Main
fastmcpskill for overview fastmcp-testingskill for authentication testing patterns
**Size target:** 250 lines / 8KB
**Data sources used:**
- ✅ C3.x test examples (real code)
- ✅ README.md (official docs)
- ✅ GitHub issues (common problems + solutions)
- ✅ C3.x patterns (design patterns)
- ✅ Conflict detection (docs vs code)
---
## 4. Data Flow & Algorithms
### 4.1 Complete Pipeline (Enhanced with Three-Stream)
INPUT: User provides GitHub repo URL │ ▼ ACQUISITION PHASE (GitHub Fetcher) │ ├─ Clone repository to /tmp/repo/ ├─ Fetch GitHub API metadata (stars, issues, labels) ├─ Fetch open issues (common problems) └─ Fetch closed issues (known solutions) │ ▼ STREAM SPLITTING PHASE │ ├─ STREAM 1: Code Files │ ├─ Filter: *.py, *.js, *.ts, .go, .rs, etc. │ └─ Exclude: docs/, tests/, node_modules/, etc. │ ├─ STREAM 2: Documentation Files │ ├─ README.md │ ├─ CONTRIBUTING.md │ ├─ docs/.md │ └─ .rst │ └─ STREAM 3: GitHub Metadata ├─ Open issues (common problems) ├─ Closed issues (solutions) ├─ Issue labels (categories) └─ Repository stats (stars, forks, language) │ ▼ PARALLEL ANALYSIS PHASE │ ├─ Thread 1: C3.x Code Analysis (20-60 min) │ ├─ Input: Code files from Stream 1 │ ├─ C3.1: Detect design patterns (905 instances) │ ├─ C3.2: Extract test examples (723 examples) │ ├─ C3.3: Build how-to guides (if working) │ ├─ C3.4: Analyze config files (45 configs) │ └─ C3.7: Detect architecture (Service Layer) │ ├─ Thread 2: Documentation Processing (1-2 min) │ ├─ Input: Markdown files from Stream 2 │ ├─ Parse README.md → Quick start section │ ├─ Parse CONTRIBUTING.md → Contribution guide │ └─ Parse docs/.md → Additional references │ └─ Thread 3: Issue Analysis (1-2 min) ├─ Input: Issues from Stream 3 ├─ Categorize by label (bug, question, enhancement) ├─ Identify top 10 common problems (open issues) └─ Extract solutions (closed issues with comments) │ ▼ MERGE PHASE │ ├─ Combine all 3 streams ├─ Detect conflicts (docs vs code) ├─ Create hybrid content (show both versions) └─ Build cross-references │ ▼ ARCHITECTURE DECISION │ ├─ Should use router? │ └─ YES (estimated 666 lines > 200 threshold) │ ▼ TOPIC DEFINITION PHASE │ ├─ Analyze pattern distribution → OAuth, Async dominant ├─ Analyze example categories → Testing has 723 examples ├─ Analyze issue labels → "oauth", "async", "testing" top labels └─ Define 4 topics: OAuth, Async, Testing, API │ ▼ FILTERING PHASE (Multi-Stage) │ ├─ Stage 1: Keyword Matching (broad) ├─ Stage 2: Relevance Scoring (precision) ├─ Stage 3: Confidence Filtering (quality ≥ 0.8) └─ Stage 4: Diversity Selection (coverage) │ ▼ CROSS-REFERENCE RESOLUTION │ ├─ Identify items in multiple topics ├─ Assign primary topic (highest priority) └─ Create secondary mentions (links) │ ▼ SUB-SKILL GENERATION │ ├─ For each topic: │ ├─ Apply topic template │ ├─ Include filtered patterns/examples │ ├─ Add GitHub issues for this topic │ ├─ Add README content if relevant │ └─ Generate references/ │ ▼ ROUTER GENERATION │ ├─ Extract routing keywords ├─ Add README quick start ├─ Add top 5 common issues ├─ Create routing table └─ Generate scenarios │ ▼ ENHANCEMENT PHASE (Multi-Stage AI) │ ├─ Stage 1: Source Enrichment (Premium) │ └─ AI resolves conflicts, ranks examples │ ├─ Stage 2: Sub-Skill Enhancement (Standard) │ └─ AI enhances each SKILL.md │ └─ Stage 3: Router Enhancement (Required) └─ AI enhances router logic │ ▼ PACKAGING PHASE │ ├─ Validate quality (size, examples, cross-refs) ├─ Package router → fastmcp.zip ├─ Package sub-skills → fastmcp-.zip └─ Create upload manifest │ ▼ OUTPUT ├─ fastmcp.zip (router) ├─ fastmcp-oauth.zip ├─ fastmcp-async.zip ├─ fastmcp-testing.zip └─ fastmcp-api.zip
### 4.2 GitHub Three-Stream Fetcher Algorithm
```python
class GitHubThreeStreamFetcher:
"""
Fetch from GitHub and split into 3 streams.
Outputs:
- Stream 1: Code (for C3.x)
- Stream 2: Docs (for doc parser)
- Stream 3: Insights (for issue analyzer)
"""
def fetch(self, repo_url: str) -> ThreeStreamData:
"""
Main fetching algorithm.
Steps:
1. Clone repository
2. Fetch GitHub API data
3. Classify files into code vs docs
4. Analyze issues
5. Return 3 streams
"""
# STEP 1: Clone repository
print(f"📦 Cloning {repo_url}...")
local_path = self.clone_repo(repo_url)
# STEP 2: Fetch GitHub metadata
print(f"🔍 Fetching GitHub metadata...")
metadata = self.fetch_github_metadata(repo_url)
issues = self.fetch_issues(repo_url, max_issues=100)
# STEP 3: Classify files
print(f"📂 Classifying files...")
code_files, doc_files = self.classify_files(local_path)
print(f" - Code: {len(code_files)} files")
print(f" - Docs: {len(doc_files)} files")
# STEP 4: Analyze issues
print(f"🐛 Analyzing {len(issues)} issues...")
issue_insights = self.analyze_issues(issues)
# STEP 5: Return 3 streams
return ThreeStreamData(
code_stream=CodeStream(
directory=local_path,
files=code_files
),
docs_stream=DocsStream(
readme=self.read_file(local_path / 'README.md'),
contributing=self.read_file(local_path / 'CONTRIBUTING.md'),
docs_files=[self.read_file(f) for f in doc_files]
),
insights_stream=InsightsStream(
metadata=metadata,
common_problems=issue_insights['common_problems'],
known_solutions=issue_insights['known_solutions'],
top_labels=issue_insights['top_labels']
)
)
def classify_files(self, repo_path: Path) -> tuple[List[Path], List[Path]]:
"""
Split files into code vs documentation.
Code patterns:
- *.py, *.js, *.ts, *.go, *.rs, *.java, etc.
- In src/, lib/, pkg/, etc.
Doc patterns:
- README.md, CONTRIBUTING.md, CHANGELOG.md
- docs/**/*.md, doc/**/*.md
- *.rst (reStructuredText)
"""
code_files = []
doc_files = []
# Documentation patterns
doc_patterns = [
'**/README.md',
'**/CONTRIBUTING.md',
'**/CHANGELOG.md',
'**/LICENSE.md',
'docs/**/*.md',
'doc/**/*.md',
'documentation/**/*.md',
'**/*.rst',
]
# Code patterns (by extension)
code_extensions = [
'.py', '.js', '.ts', '.jsx', '.tsx',
'.go', '.rs', '.java', '.kt',
'.c', '.cpp', '.h', '.hpp',
'.rb', '.php', '.swift'
]
for file in repo_path.rglob('*'):
if not file.is_file():
continue
# Skip hidden files and common excludes
if any(part.startswith('.') for part in file.parts):
continue
if any(exclude in str(file) for exclude in ['node_modules', '__pycache__', 'venv']):
continue
# Check if documentation
is_doc = any(file.match(pattern) for pattern in doc_patterns)
if is_doc:
doc_files.append(file)
elif file.suffix in code_extensions:
code_files.append(file)
return code_files, doc_files
def analyze_issues(self, issues: List[Dict]) -> Dict:
"""
Analyze GitHub issues to extract insights.
Returns:
{
"common_problems": [
{
"title": "OAuth setup fails",
"number": 42,
"labels": ["question", "oauth"],
"comments": 15,
"state": "open"
},
...
],
"known_solutions": [
{
"title": "Fixed OAuth redirect",
"number": 35,
"labels": ["bug", "oauth"],
"solution": "Check redirect URI in Google Console",
"state": "closed"
},
...
],
"top_labels": [
{"label": "question", "count": 23},
{"label": "bug", "count": 15},
...
]
}
"""
common_problems = []
known_solutions = []
all_labels = []
for issue in issues:
labels = issue.get('labels', [])
all_labels.extend(labels)
# Open issues with many comments = common problems
if issue['state'] == 'open' and issue.get('comments', 0) > 5:
common_problems.append({
'title': issue['title'],
'number': issue['number'],
'labels': labels,
'comments': issue['comments'],
'state': 'open'
})
# Closed issues with comments = known solutions
elif issue['state'] == 'closed' and issue.get('comments', 0) > 0:
known_solutions.append({
'title': issue['title'],
'number': issue['number'],
'labels': labels,
'comments': issue['comments'],
'state': 'closed'
})
# Count label frequency
from collections import Counter
label_counts = Counter(all_labels)
return {
'common_problems': sorted(common_problems, key=lambda x: x['comments'], reverse=True)[:10],
'known_solutions': sorted(known_solutions, key=lambda x: x['comments'], reverse=True)[:10],
'top_labels': [
{'label': label, 'count': count}
for label, count in label_counts.most_common(10)
]
}
4.3 Multi-Source Merge Algorithm (Enhanced)
class EnhancedSourceMerger:
"""
Merge data from all sources with conflict detection.
Sources:
1. HTML documentation (if provided)
2. GitHub code stream (C3.x)
3. GitHub docs stream (README/docs)
4. GitHub insights stream (issues)
"""
def merge(
self,
html_docs: Optional[Dict],
github_three_streams: Optional[ThreeStreamData]
) -> MergedSkillData:
"""
Merge all sources with priority:
1. C3.x code (ground truth)
2. HTML docs (official intent)
3. GitHub docs (repo documentation)
4. GitHub insights (community knowledge)
"""
merged = MergedSkillData()
# LAYER 1: GitHub Code Stream (C3.x) - Ground Truth
if github_three_streams and github_three_streams.code_stream:
print("📊 Layer 1: C3.x code analysis")
c3x_data = self.run_c3x_analysis(github_three_streams.code_stream)
merged.patterns = c3x_data['patterns']
merged.examples = c3x_data['examples']
merged.architecture = c3x_data['architecture']
merged.api_reference = c3x_data['api_files']
merged.source_priority['c3x_code'] = 1 # Highest
# LAYER 2: HTML Documentation - Official Intent
if html_docs:
print("📚 Layer 2: HTML documentation")
for topic, content in html_docs.items():
if topic in merged.topics:
# Detect conflicts with C3.x
conflicts = self.detect_conflicts(
code_version=merged.topics[topic],
docs_version=content
)
if conflicts:
merged.conflicts.append(conflicts)
# Create hybrid (show both)
merged.topics[topic] = self.create_hybrid(
code=merged.topics[topic],
docs=content,
conflicts=conflicts
)
else:
# Enrich with docs
merged.topics[topic].add_documentation(content)
else:
merged.topics[topic] = content
merged.source_priority['html_docs'] = 2
# LAYER 3: GitHub Docs Stream - Repo Documentation
if github_three_streams and github_three_streams.docs_stream:
print("📄 Layer 3: GitHub documentation")
docs = github_three_streams.docs_stream
# Add README quick start
merged.quick_start = docs.readme
# Add contribution guide
merged.contributing = docs.contributing
# Add docs/ files as references
for doc_file in docs.docs_files:
merged.references.append({
'source': 'github_docs',
'content': doc_file,
'priority': 3
})
merged.source_priority['github_docs'] = 3
# LAYER 4: GitHub Insights Stream - Community Knowledge
if github_three_streams and github_three_streams.insights_stream:
print("🐛 Layer 4: GitHub insights")
insights = github_three_streams.insights_stream
# Add common problems
merged.common_problems = insights.common_problems
merged.known_solutions = insights.known_solutions
# Add metadata
merged.metadata = insights.metadata
# Categorize issues by topic
merged.issues_by_topic = self.categorize_issues_by_topic(
problems=insights.common_problems,
solutions=insights.known_solutions,
topics=merged.topics.keys()
)
merged.source_priority['github_insights'] = 4
return merged
def categorize_issues_by_topic(
self,
problems: List[Dict],
solutions: List[Dict],
topics: List[str]
) -> Dict[str, List[Dict]]:
"""
Categorize issues by topic using label/title matching.
Example:
- Issue "OAuth setup fails" → oauth topic
- Issue "Async tools error" → async topic
"""
categorized = {topic: [] for topic in topics}
all_issues = problems + solutions
for issue in all_issues:
title_lower = issue['title'].lower()
labels_lower = [l.lower() for l in issue.get('labels', [])]
# Match to topic by keywords
for topic in topics:
topic_keywords = self.get_topic_keywords(topic)
# Check title and labels
if any(kw in title_lower for kw in topic_keywords):
categorized[topic].append(issue)
continue
if any(kw in label for label in labels_lower for kw in topic_keywords):
categorized[topic].append(issue)
continue
return categorized
def get_topic_keywords(self, topic: str) -> List[str]:
"""Get keywords for each topic."""
keywords = {
'oauth': ['oauth', 'auth', 'provider', 'google', 'azure', 'token'],
'async': ['async', 'await', 'asynchronous', 'concurrent'],
'testing': ['test', 'pytest', 'mock', 'fixture'],
'api': ['api', 'reference', 'function', 'class']
}
return keywords.get(topic, [])
4.4 Topic Definition Algorithm (Enhanced with GitHub Insights)
def define_topics_enhanced(
base_name: str,
c3x_data: Dict,
github_insights: Optional[InsightsStream]
) -> Dict[str, TopicConfig]:
"""
Auto-detect topics using:
1. C3.x pattern distribution
2. C3.x example categories
3. GitHub issue labels (NEW!)
Example: If GitHub has 23 "oauth" labeled issues,
that's strong signal OAuth is important topic.
"""
topics = {}
# Analyze C3.x patterns
pattern_counts = count_patterns_by_keyword(c3x_data['patterns'])
# Analyze C3.x examples
example_categories = categorize_examples(c3x_data['examples'])
# Analyze GitHub issue labels (NEW!)
issue_label_counts = {}
if github_insights:
for label_info in github_insights.top_labels:
issue_label_counts[label_info['label']] = label_info['count']
# TOPIC 1: OAuth (if significant)
oauth_signals = (
pattern_counts.get('auth', 0) +
example_categories.get('auth', 0) +
issue_label_counts.get('oauth', 0) * 2 # Issues weighted 2x
)
if oauth_signals > 50:
topics['oauth'] = TopicConfig(
keywords=['auth', 'oauth', 'provider', 'token'],
patterns=['Strategy', 'Factory'],
target_length=250,
priority=1,
github_issue_count=issue_label_counts.get('oauth', 0) # NEW
)
# TOPIC 2: Async (if significant)
async_signals = (
pattern_counts.get('async', 0) +
example_categories.get('async', 0) +
issue_label_counts.get('async', 0) * 2
)
if async_signals > 30:
topics['async'] = TopicConfig(
keywords=['async', 'await'],
patterns=['Decorator'],
target_length=200,
priority=2,
github_issue_count=issue_label_counts.get('async', 0)
)
# TOPIC 3: Testing (if examples exist)
if example_categories.get('test', 0) > 50:
topics['testing'] = TopicConfig(
keywords=['test', 'mock', 'pytest'],
patterns=[],
target_length=250,
priority=3,
github_issue_count=issue_label_counts.get('testing', 0)
)
# TOPIC 4: API Reference (always)
topics['api'] = TopicConfig(
keywords=[],
patterns=[],
target_length=400,
priority=4,
github_issue_count=0
)
return topics
5. Technical Implementation
5.1 Core Classes (Enhanced)
# src/skill_seekers/cli/github_fetcher.py
from dataclasses import dataclass
from typing import List, Dict, Optional
from pathlib import Path
@dataclass
class CodeStream:
"""Code files for C3.x analysis."""
directory: Path
files: List[Path]
@dataclass
class DocsStream:
"""Documentation files from repository."""
readme: Optional[str]
contributing: Optional[str]
docs_files: List[Dict] # [{"path": "docs/oauth.md", "content": "..."}]
@dataclass
class InsightsStream:
"""GitHub metadata and issues."""
metadata: Dict # stars, forks, language, etc.
common_problems: List[Dict]
known_solutions: List[Dict]
top_labels: List[Dict]
@dataclass
class ThreeStreamData:
"""Complete output from GitHub fetcher."""
code_stream: CodeStream
docs_stream: DocsStream
insights_stream: InsightsStream
class GitHubThreeStreamFetcher:
"""
Fetch from GitHub and split into 3 streams.
Usage:
fetcher = GitHubThreeStreamFetcher(
repo_url="https://github.com/facebook/react",
github_token=os.getenv('GITHUB_TOKEN')
)
three_streams = fetcher.fetch()
# Now you have:
# - three_streams.code_stream (for C3.x)
# - three_streams.docs_stream (for doc parser)
# - three_streams.insights_stream (for issue analyzer)
"""
def __init__(self, repo_url: str, github_token: Optional[str] = None):
self.repo_url = repo_url
self.github_token = github_token
self.owner, self.repo = self.parse_repo_url(repo_url)
def fetch(self, output_dir: Path = Path('/tmp')) -> ThreeStreamData:
"""Fetch everything and split into 3 streams."""
# Implementation from section 4.2
pass
def clone_repo(self, output_dir: Path) -> Path:
"""Clone repository to local directory."""
# Implementation from section 4.2
pass
def fetch_github_metadata(self) -> Dict:
"""Fetch repo metadata via GitHub API."""
url = f"https://api.github.com/repos/{self.owner}/{self.repo}"
headers = {}
if self.github_token:
headers['Authorization'] = f'token {self.github_token}'
response = requests.get(url, headers=headers)
return response.json()
def fetch_issues(self, max_issues: int = 100) -> List[Dict]:
"""Fetch GitHub issues (open + closed)."""
# Implementation from section 4.2
pass
def classify_files(self, repo_path: Path) -> tuple[List[Path], List[Path]]:
"""Split files into code vs documentation."""
# Implementation from section 4.2
pass
def analyze_issues(self, issues: List[Dict]) -> Dict:
"""Analyze issues to extract insights."""
# Implementation from section 4.2
pass
# src/skill_seekers/cli/unified_codebase_analyzer.py
class UnifiedCodebaseAnalyzer:
"""
Unified analyzer for ANY codebase (local or GitHub).
Key insight: C3.x is a DEPTH MODE, not a source type.
Usage:
analyzer = UnifiedCodebaseAnalyzer()
# Analyze from GitHub
result = analyzer.analyze(
source="https://github.com/facebook/react",
depth="c3x",
fetch_github_metadata=True
)
# Analyze local directory
result = analyzer.analyze(
source="/path/to/project",
depth="c3x"
)
# Quick basic analysis
result = analyzer.analyze(
source="/path/to/project",
depth="basic"
)
"""
def analyze(
self,
source: str, # GitHub URL or local path
depth: str = 'c3x', # 'basic' or 'c3x'
fetch_github_metadata: bool = True
) -> Dict:
"""
Analyze codebase with specified depth.
Returns unified result with all available streams.
"""
# Step 1: Acquire source
if self.is_github_url(source):
# Use three-stream fetcher
fetcher = GitHubThreeStreamFetcher(source)
three_streams = fetcher.fetch()
code_directory = three_streams.code_stream.directory
github_data = {
'docs': three_streams.docs_stream,
'insights': three_streams.insights_stream
}
else:
# Local directory
code_directory = Path(source)
github_data = None
# Step 2: Analyze code with specified depth
if depth == 'basic':
code_analysis = self.basic_analysis(code_directory)
elif depth == 'c3x':
code_analysis = self.c3x_analysis(code_directory)
else:
raise ValueError(f"Unknown depth: {depth}")
# Step 3: Combine results
result = {
'code_analysis': code_analysis,
'github_docs': github_data['docs'] if github_data else None,
'github_insights': github_data['insights'] if github_data else None,
}
return result
def basic_analysis(self, directory: Path) -> Dict:
"""
Fast, shallow analysis (1-2 min).
Returns:
- File structure
- Imports
- Entry points
"""
return {
'files': self.list_files(directory),
'structure': self.get_directory_structure(directory),
'imports': self.extract_imports(directory),
'entry_points': self.find_entry_points(directory),
'analysis_time': '1-2 min',
'analysis_depth': 'basic'
}
def c3x_analysis(self, directory: Path) -> Dict:
"""
Deep C3.x analysis (20-60 min).
Returns:
- Everything from basic
- C3.1: Design patterns
- C3.2: Test examples
- C3.3: How-to guides
- C3.4: Config patterns
- C3.7: Architecture
"""
# Start with basic
basic = self.basic_analysis(directory)
# Add C3.x components
c3x = {
**basic,
'c3_1_patterns': self.detect_patterns(directory),
'c3_2_examples': self.extract_test_examples(directory),
'c3_3_guides': self.build_how_to_guides(directory),
'c3_4_configs': self.analyze_configs(directory),
'c3_7_architecture': self.detect_architecture(directory),
'analysis_time': '20-60 min',
'analysis_depth': 'c3x'
}
return c3x
def is_github_url(self, source: str) -> bool:
"""Check if source is a GitHub URL."""
return 'github.com' in source
# src/skill_seekers/cli/c3x_to_router.py (Enhanced)
class EnhancedC3xToRouterPipeline:
"""
Enhanced pipeline with three-stream GitHub support.
New capabilities:
- Integrates GitHub docs (README, CONTRIBUTING)
- Adds GitHub issues to "Common Problems" sections
- Shows repository stats in overview
- Categorizes issues by topic
"""
def __init__(
self,
analysis_dir: Path,
output_dir: Path,
github_data: Optional[ThreeStreamData] = None
):
self.analysis_dir = Path(analysis_dir)
self.output_dir = Path(output_dir)
self.github_data = github_data
self.c3x_data = self.load_c3x_data()
def run(self, base_name: str) -> Dict[str, Path]:
"""
Execute complete pipeline with GitHub integration.
Enhanced steps:
1. Define topics (using C3.x + GitHub issue labels)
2. Filter data for each topic
3. Categorize GitHub issues by topic
4. Resolve cross-references
5. Generate sub-skills (with GitHub issues)
6. Generate router (with README + top issues)
7. Validate quality
"""
print(f"🚀 Starting Enhanced C3.x to Router pipeline for {base_name}")
# Step 1: Define topics (enhanced with GitHub insights)
topics = self.define_topics_enhanced(
base_name,
github_insights=self.github_data.insights_stream if self.github_data else None
)
print(f"📋 Defined {len(topics)} topics: {list(topics.keys())}")
# Step 2: Filter data for each topic
filtered_data = {}
for topic_name, topic_config in topics.items():
print(f"🔍 Filtering data for topic: {topic_name}")
filtered_data[topic_name] = self.filter_for_topic(topic_config)
# Step 3: Categorize GitHub issues by topic (NEW!)
if self.github_data:
print(f"🐛 Categorizing GitHub issues by topic")
issues_by_topic = self.categorize_issues_by_topic(
insights=self.github_data.insights_stream,
topics=list(topics.keys())
)
# Add to filtered data
for topic_name, issues in issues_by_topic.items():
if topic_name in filtered_data:
filtered_data[topic_name].github_issues = issues
# Step 4: Resolve cross-references
print(f"🔗 Resolving cross-references")
filtered_data = self.resolve_cross_references(filtered_data, topics)
# Step 5: Generate sub-skills (with GitHub issues)
skill_paths = {}
for topic_name, data in filtered_data.items():
print(f"📝 Generating sub-skill: {base_name}-{topic_name}")
skill_path = self.generate_sub_skill_enhanced(
base_name, topic_name, data, topics[topic_name]
)
skill_paths[f"{base_name}-{topic_name}"] = skill_path
# Step 6: Generate router (with README + top issues)
print(f"🧭 Generating router skill: {base_name}")
router_path = self.generate_router_enhanced(
base_name,
list(skill_paths.keys()),
github_docs=self.github_data.docs_stream if self.github_data else None,
github_insights=self.github_data.insights_stream if self.github_data else None
)
skill_paths[base_name] = router_path
# Step 7: Quality validation
print(f"✅ Validating quality")
self.validate_quality(skill_paths)
print(f"🎉 Pipeline complete! Generated {len(skill_paths)} skills")
return skill_paths
def generate_sub_skill_enhanced(
self,
base_name: str,
topic_name: str,
data: FilteredData,
config: TopicConfig
) -> Path:
"""
Generate sub-skill with GitHub issues integrated.
Adds new section: "Common Issues (from GitHub)"
"""
output_dir = self.output_dir / f"{base_name}-{topic_name}"
output_dir.mkdir(parents=True, exist_ok=True)
# Use topic-specific template
template = self.get_topic_template(topic_name)
# Generate SKILL.md with GitHub issues
skill_md = template.render(
base_name=base_name,
topic_name=topic_name,
data=data,
config=config,
github_issues=data.github_issues if hasattr(data, 'github_issues') else [] # NEW
)
# Write SKILL.md
skill_file = output_dir / 'SKILL.md'
skill_file.write_text(skill_md)
# Generate reference files (including GitHub issues)
self.generate_references_enhanced(output_dir, data)
return output_dir
def generate_router_enhanced(
self,
base_name: str,
sub_skills: List[str],
github_docs: Optional[DocsStream],
github_insights: Optional[InsightsStream]
) -> Path:
"""
Generate router with:
- README quick start
- Top 5 GitHub issues
- Repository stats
"""
output_dir = self.output_dir / base_name
output_dir.mkdir(parents=True, exist_ok=True)
# Generate router SKILL.md
router_md = self.create_router_md_enhanced(
base_name,
sub_skills,
github_docs,
github_insights
)
# Write SKILL.md
skill_file = output_dir / 'SKILL.md'
skill_file.write_text(router_md)
# Generate reference files
refs_dir = output_dir / 'references'
refs_dir.mkdir(exist_ok=True)
# Add index
(refs_dir / 'index.md').write_text(self.create_router_index(sub_skills))
# Add common issues (NEW!)
if github_insights:
(refs_dir / 'common_issues.md').write_text(
self.create_common_issues_reference(github_insights)
)
return output_dir
def create_router_md_enhanced(
self,
base_name: str,
sub_skills: List[str],
github_docs: Optional[DocsStream],
github_insights: Optional[InsightsStream]
) -> str:
"""Create router SKILL.md with GitHub integration."""
# Extract repo URL from github_insights
repo_url = f"https://github.com/{base_name}" # Simplified
md = f"""---
name: {base_name}
description: {base_name.upper()} framework - use for overview and routing to specialized topics
---
# {base_name.upper()} - Overview
"""
# Add GitHub metadata (if available)
if github_insights:
metadata = github_insights.metadata
md += f"""**Repository:** {repo_url}
**Stars:** ⭐ {metadata.get('stars', 0)} | **Language:** {metadata.get('language', 'Unknown')} | **Open Issues:** {metadata.get('open_issues', 0)}
"""
md += """## When to Use This Skill
Use this skill when:
- You want an overview of """ + base_name.upper() + """
- You need quick installation/setup steps
- You're deciding which feature to use
- **Route to specialized skills for deep dives**
"""
# Add Quick Start from README (if available)
if github_docs and github_docs.readme:
md += f"""## Quick Start (from README)
{github_docs.readme[:500]}... <!-- Truncated -->
"""
# Add Common Issues (if available)
if github_insights and github_insights.common_problems:
md += """## Common Issues (from GitHub)
Based on analysis of GitHub issues:
"""
for i, problem in enumerate(github_insights.common_problems[:5], 1):
topic_hint = self.guess_topic_from_issue(problem, sub_skills)
md += f"""{i}. **{problem['title']}** (Issue #{problem['number']}, {problem['comments']} comments)
- See `{topic_hint}` skill for details
"""
# Add routing table
md += """## Choose Your Path
"""
for skill_name in sub_skills:
if skill_name == base_name:
continue
topic = skill_name.replace(f"{base_name}-", "")
md += f"""**{topic.title()}?** → Use `{skill_name}` skill
"""
# Add architecture overview
if self.c3x_data.get('architecture'):
arch = self.c3x_data['architecture']
md += f"""
## Architecture Overview
{base_name.upper()} uses a {arch.get('primary_pattern', 'layered')} architecture.
"""
return md
def guess_topic_from_issue(self, issue: Dict, sub_skills: List[str]) -> str:
"""Guess which sub-skill an issue belongs to."""
title_lower = issue['title'].lower()
labels_lower = [l.lower() for l in issue.get('labels', [])]
for skill_name in sub_skills:
topic = skill_name.split('-')[-1] # Extract topic from skill name
if topic in title_lower or topic in str(labels_lower):
return skill_name
# Default to main skill
return sub_skills[0] if sub_skills else 'main'
5.2 Enhanced Topic Templates (With GitHub Issues)
# src/skill_seekers/cli/topic_templates.py (Enhanced)
class EnhancedOAuthTemplate(TopicTemplate):
"""Enhanced OAuth template with GitHub issues."""
TEMPLATE = """---
name: {{ base_name }}-{{ topic_name }}
description: {{ base_name.upper() }} {{ topic_name }} - OAuth authentication with multiple providers
triggers: {{ triggers }}
---
# {{ base_name.upper() }} OAuth Authentication
## When to Use This Skill
Use this skill when implementing OAuth authentication in {{ base_name }} servers.
## Quick Reference (from C3.x examples)
{% for example in top_examples[:5] %}
### {{ example.title }}
```{{ example.language }}
{{ example.code }}
{{ example.description }}
{% endfor %}
Common OAuth Issues (from GitHub)
{% if github_issues %} Based on {{ github_issues|length }} GitHub issues related to OAuth:
{% for issue in github_issues[:5] %} Issue #{{ issue.number }}: {{ issue.title }}
- Status: {{ issue.state }}
- Comments: {{ issue.comments }} {% if issue.state == 'closed' %}
- ✅ Solution found (see issue for details) {% else %}
- ⚠️ Open issue - community discussion ongoing {% endif %}
{% endfor %}
{% endif %}
Supported Providers
{% for provider in providers %}
{{ provider.name }}
From C3.x analysis:
{{ provider.example_code }}
Key features: {% for feature in provider.features %}
- {{ feature }} {% endfor %}
{% endfor %}
Design Patterns
{% for pattern in patterns %}
{{ pattern.name }} ({{ pattern.count }} instances)
{{ pattern.description }}
Example:
{{ pattern.example }}
{% endfor %}
Testing OAuth
{% for test_example in test_examples[:10] %}
{{ test_example.name }}
{{ test_example.code }}
{% endfor %}
See Also
-
Main {{ base_name }} skill for overview
-
{{ base_name }}-testing for authentication testing patterns """
def render( self, base_name: str, topic_name: str, data: FilteredData, config: TopicConfig, github_issues: List[Dict] = [] # NEW parameter ) -> str: """Render template with GitHub issues.""" template = Template(self.TEMPLATE)
# Extract data (existing) top_examples = self.extract_top_examples(data.examples) providers = self.extract_providers(data.patterns, data.examples) patterns = self.extract_patterns(data.patterns) test_examples = self.extract_test_examples(data.examples) triggers = self.extract_triggers(topic_name) # Render with GitHub issues return template.render( base_name=base_name, topic_name=topic_name, top_examples=top_examples, providers=providers, patterns=patterns, test_examples=test_examples, triggers=triggers, github_issues=github_issues # NEW )
---
## 6. File Structure (Enhanced)
### 6.1 Input Structure (Three-Stream)
GitHub Repository (https://github.com/jlowin/fastmcp) ↓ (after fetching)
/tmp/fastmcp/ # Cloned repository ├── src/ # Code stream │ └── .py ├── tests/ # Code stream │ └── test_.py ├── README.md # Docs stream ├── CONTRIBUTING.md # Docs stream ├── docs/ # Docs stream │ ├── getting-started.md │ ├── oauth.md │ └── async.md └── .github/ └── ... (ignored)
Plus GitHub API data: # Insights stream ├── Repository metadata │ ├── stars: 1234 │ ├── forks: 56 │ ├── open_issues: 12 │ └── language: Python ├── Issues (100 fetched) │ ├── Open: 12 │ └── Closed: 88 └── Labels ├── oauth: 15 issues ├── async: 8 issues └── testing: 6 issues
After splitting:
STREAM 1: Code Analysis Input /tmp/fastmcp_code_stream/ ├── patterns/detected_patterns.json (from C3.x) ├── test_examples/test_examples.json (from C3.x) ├── config_patterns/config_patterns.json (from C3.x) ├── api_reference/*.md (from C3.x) └── architecture/architectural_patterns.json (from C3.x)
STREAM 2: Documentation Input /tmp/fastmcp_docs_stream/ ├── README.md ├── CONTRIBUTING.md └── docs/ ├── getting-started.md ├── oauth.md └── async.md
STREAM 3: Insights Input /tmp/fastmcp_insights_stream/ ├── metadata.json ├── common_problems.json ├── known_solutions.json └── top_labels.json
### 6.2 Output Structure (Enhanced)
output/ ├── fastmcp/ # Router skill (ENHANCED) │ ├── SKILL.md (150 lines) │ │ └── Includes: README quick start + top 5 GitHub issues │ └── references/ │ ├── index.md │ └── common_issues.md # NEW: From GitHub insights │ ├── fastmcp-oauth/ # OAuth sub-skill (ENHANCED) │ ├── SKILL.md (250 lines) │ │ └── Includes: C3.x + GitHub OAuth issues │ └── references/ │ ├── oauth_overview.md # From C3.x + README │ ├── google_provider.md # From C3.x examples │ ├── azure_provider.md # From C3.x examples │ ├── oauth_patterns.md # From C3.x patterns │ └── oauth_issues.md # NEW: From GitHub issues │ ├── fastmcp-async/ # Async sub-skill (ENHANCED) │ ├── SKILL.md (200 lines) │ └── references/ │ ├── async_basics.md │ ├── async_patterns.md │ ├── decorator_pattern.md │ └── async_issues.md # NEW: From GitHub issues │ ├── fastmcp-testing/ # Testing sub-skill (ENHANCED) │ ├── SKILL.md (250 lines) │ └── references/ │ ├── unit_tests.md │ ├── integration_tests.md │ ├── pytest_examples.md │ └── testing_issues.md # NEW: From GitHub issues │ └── fastmcp-api/ # API reference sub-skill ├── SKILL.md (400 lines) └── references/ └── api_modules/ └── *.md (316 files, from C3.x)
---
## 7. Filtering Strategies (Unchanged)
[Content from original document - no changes needed]
---
## 8. Quality Metrics (Enhanced)
### 8.1 Size Constraints (Unchanged)
**Targets:**
- Router: 150 lines (±20)
- OAuth sub-skill: 250 lines (±30)
- Async sub-skill: 200 lines (±30)
- Testing sub-skill: 250 lines (±30)
- API sub-skill: 400 lines (±50)
### 8.2 Content Quality (Enhanced)
**Requirements:**
- Minimum 3 code examples per sub-skill (from C3.x)
- Minimum 2 GitHub issues per sub-skill (if available)
- All code blocks must have language tags
- No placeholder content (TODO, [Add...])
- Cross-references must be valid
- GitHub issue links must be valid (#42, etc.)
**Validation:**
```python
def validate_content_quality_enhanced(skill_md: str, has_github: bool):
"""Check content quality including GitHub integration."""
# Existing checks
code_blocks = skill_md.count('```')
assert code_blocks >= 6, "Need at least 3 code examples"
assert '```python' in skill_md or '```javascript' in skill_md, \
"Code blocks must have language tags"
assert 'TODO' not in skill_md, "No TODO placeholders"
assert '[Add' not in skill_md, "No [Add...] placeholders"
# NEW: GitHub checks
if has_github:
# Check for GitHub metadata
assert '⭐' in skill_md or 'Repository:' in skill_md, \
"Missing GitHub metadata"
# Check for issue references
issue_refs = len(re.findall(r'Issue #\d+', skill_md))
assert issue_refs >= 2, f"Need at least 2 GitHub issue references, found {issue_refs}"
# Check for "Common Issues" section
assert 'Common Issues' in skill_md or 'Common Problems' in skill_md, \
"Missing Common Issues section from GitHub"
8.3 GitHub Integration Quality (NEW)
Requirements:
- Router must include repository stats (stars, forks, language)
- Router must include top 5 common issues
- Each sub-skill must include relevant issues (if any exist)
- Issue references must be properly formatted (#42)
- Closed issues should show "✅ Solution found"
Validation:
def validate_github_integration(skill_md: str, topic: str, github_insights: InsightsStream):
"""Validate GitHub integration quality."""
# Check metadata present
if topic == 'router':
assert '⭐' in skill_md, "Missing stars count"
assert 'Open Issues:' in skill_md, "Missing issue count"
# Check issue formatting
issue_matches = re.findall(r'Issue #(\d+)', skill_md)
for issue_num in issue_matches:
# Verify issue exists in insights
all_issues = github_insights.common_problems + github_insights.known_solutions
issue_exists = any(str(i['number']) == issue_num for i in all_issues)
assert issue_exists, f"Issue #{issue_num} referenced but not in GitHub data"
# Check solution indicators
closed_issue_matches = re.findall(r'Issue #(\d+).*closed', skill_md, re.IGNORECASE)
for match in closed_issue_matches:
assert '✅' in skill_md or 'Solution' in skill_md, \
f"Closed issue #{match} should indicate solution found"
8.4 Token Efficiency (Enhanced)
Requirement: Average 40%+ token reduction vs monolithic
NEW: GitHub overhead calculation
def measure_token_efficiency_with_github(scenarios: List[Dict]):
"""
Measure token usage with GitHub integration overhead.
GitHub adds ~50 lines per skill (metadata + issues).
Router architecture still wins due to selective loading.
"""
# Monolithic with GitHub
monolithic_size = 666 + 50 # SKILL.md + GitHub section
# Router with GitHub
router_size = 150 + 50 # Router + GitHub metadata
avg_subskill_size = (250 + 200 + 250 + 400) / 4 # ~275 lines
avg_subskill_with_github = avg_subskill_size + 30 # +30 for issue section
# Calculate average query
avg_router_query = router_size + avg_subskill_with_github # ~455 lines
reduction = (monolithic_size - avg_router_query) / monolithic_size
# (716 - 455) / 716 = 36% reduction
assert reduction >= 0.35, f"Token reduction {reduction:.1%} below 35% (with GitHub overhead)"
return reduction
Result: Even with GitHub integration, router achieves 35-40% token reduction.
9-13. [Remaining Sections]
[Edge Cases, Scalability, Migration, Testing, Implementation Phases sections remain largely the same as original document, with these enhancements:]
- Add GitHub fetcher tests
- Add issue categorization tests
- Add hybrid content generation tests
- Update implementation phases to include GitHub integration
- Add time estimates for GitHub API fetching (1-2 min)
Implementation Phases (Updated)
Phase 1: Three-Stream GitHub Fetcher (Day 1, 8 hours)
NEW PHASE - Highest Priority
Tasks:
-
Create
github_fetcher.py✅- Clone repository
- Fetch GitHub API metadata
- Fetch issues (open + closed)
- Classify files (code vs docs)
-
Create
GitHubThreeStreamFetcherclass ✅fetch()main methodclassify_files()splitteranalyze_issues()insights extractor
-
Integrate with
unified_codebase_analyzer.py✅- Detect GitHub URLs
- Call three-stream fetcher
- Return unified result
-
Write tests ✅
- Test file classification
- Test issue analysis
- Test real GitHub fetch (with token)
Deliverable: Working three-stream GitHub fetcher
Phase 2: Enhanced Source Merging (Day 2, 6 hours)
Tasks:
-
Update
source_merger.py✅- Add GitHub docs stream handling
- Add GitHub insights stream handling
- Categorize issues by topic
- Create hybrid content with issue links
-
Update topic definition ✅
- Use GitHub issue labels
- Weight issues in topic scoring
-
Write tests ✅
- Test issue categorization
- Test hybrid content generation
- Test conflict detection
Deliverable: Enhanced merge with GitHub integration
Phase 3: Router Generation with GitHub (Day 2-3, 6 hours)
Tasks:
-
Update router templates ✅
- Add README quick start section
- Add repository stats
- Add top 5 common issues
- Link issues to sub-skills
-
Update sub-skill templates ✅
- Add "Common Issues" section
- Format issue references
- Add solution indicators
-
Write tests ✅
- Test router with GitHub data
- Test sub-skills with issues
- Validate issue links
Deliverable: Complete router with GitHub integration
Phase 4: Testing & Refinement (Day 3, 4 hours)
Tasks:
-
Run full E2E test on FastMCP ✅
- With GitHub three-stream
- Validate all 3 streams present
- Check issue integration
- Measure token savings
-
Manual testing ✅
- Test 10 real queries
- Verify issue relevance
- Check GitHub links work
-
Performance optimization ✅
- GitHub API rate limiting
- Parallel stream processing
- Caching GitHub data
Deliverable: Production-ready pipeline
Phase 5: Documentation (Day 4, 2 hours)
Tasks:
-
Update documentation ✅
- This architecture document
- CLI help text
- README with GitHub example
-
Create examples ✅
- FastMCP with GitHub
- React with GitHub
- Add to official configs
Deliverable: Complete documentation
Total Timeline: 4 days (26 hours)
Day 1 (8 hours): GitHub three-stream fetcher Day 2 (8 hours): Enhanced merging + router generation Day 3 (8 hours): Testing, refinement, quality validation Day 4 (2 hours): Documentation and examples
Appendix A: Configuration Examples (Updated)
Example 1: GitHub with Three-Stream (NEW)
{
"name": "fastmcp",
"description": "FastMCP framework - complete analysis with GitHub insights",
"sources": [
{
"type": "codebase",
"source": "https://github.com/jlowin/fastmcp",
"analysis_depth": "c3x",
"fetch_github_metadata": true,
"split_docs": true,
"max_issues": 100
}
],
"router_mode": true
}
Result:
- ✅ Code analyzed with C3.x
- ✅ README/docs extracted
- ✅ 100 issues analyzed
- ✅ Router + 4 sub-skills generated
- ✅ All skills include GitHub insights
Example 2: Documentation + GitHub (Multi-Source)
{
"name": "react",
"description": "React framework - official docs + GitHub insights",
"sources": [
{
"type": "documentation",
"base_url": "https://react.dev/",
"max_pages": 200
},
{
"type": "codebase",
"source": "https://github.com/facebook/react",
"analysis_depth": "c3x",
"fetch_github_metadata": true,
"max_issues": 100
}
],
"merge_mode": "conflict_detection",
"router_mode": true
}
Result:
- ✅ HTML docs scraped (200 pages)
- ✅ Code analyzed with C3.x
- ✅ GitHub insights added
- ✅ Conflicts detected (docs vs code)
- ✅ Hybrid content generated
- ✅ Router + sub-skills with all sources
Example 3: Local Codebase (No GitHub)
{
"name": "internal-tool",
"description": "Internal tool - local analysis only",
"sources": [
{
"type": "codebase",
"source": "/path/to/internal-tool",
"analysis_depth": "c3x",
"fetch_github_metadata": false
}
],
"router_mode": true
}
Result:
- ✅ Code analyzed with C3.x
- ❌ No GitHub insights (not applicable)
- ✅ Router + sub-skills generated
- ✅ Works without GitHub data
End of Enhanced Architecture Document
Summary of Major Changes
What Changed:
-
Source Architecture Redesigned
- GitHub is now a "multi-source provider" (3 streams)
- C3.x is now an "analysis depth mode", not a source type
- Unified codebase analyzer handles local AND GitHub
-
Three-Stream GitHub Integration
- Stream 1: Code → C3.x analysis
- Stream 2: Docs → README/CONTRIBUTING/docs/*.md
- Stream 3: Insights → Issues, labels, stats
-
Enhanced Router Content
- Repository stats in overview
- README quick start
- Top 5 common issues from GitHub
- Issue-to-skill routing
-
Enhanced Sub-Skill Content
- "Common Issues" section per topic
- Real user problems from GitHub
- Known solutions from closed issues
- Issue references (#42, etc.)
-
Data Flow Updated
- Parallel stream processing
- Issue categorization by topic
- Hybrid content with GitHub data
-
Implementation Updated
- New classes:
GitHubThreeStreamFetcher,UnifiedCodebaseAnalyzer - Enhanced templates with GitHub support
- New quality metrics for GitHub integration
- New classes:
Key Benefits:
- Richer Skills: Code + Docs + Community Knowledge
- Real User Problems: From GitHub issues
- Official Quick Starts: From README
- Better Architecture: Clean separation of concerns
- Still Efficient: 35-40% token reduction (even with GitHub overhead)
This document now represents the complete, production-ready architecture for C3.x router skills with three-stream GitHub integration.