✨ Improvements: - Add .gitignore entries for test artifacts (.pytest_cache, .coverage, htmlcov) - Create cli/__init__.py with exports for llms_txt modules - Create mcp/__init__.py with package documentation - Create mcp/tools/__init__.py as placeholder for future modularization ✅ Benefits: - Proper Python package structure enables clean imports - IDE autocomplete now works for cli modules - Can use: from cli import LlmsTxtDetector - Foundation for future refactoring 📊 Impact: - Code Quality: 6.0/10 (up from 5.5/10) - Import Issues: Fixed ✅ - Package Structure: Fixed ✅ Related: Phase 0 of REFACTORING_PLAN.md Time: 42 minutes Risk: Zero - additive changes only
1096 lines
27 KiB
Markdown
1096 lines
27 KiB
Markdown
# 🔧 Skill Seekers - Comprehensive Refactoring Plan
|
|
|
|
**Generated:** October 23, 2025
|
|
**Updated:** October 25, 2025 (After recent merges)
|
|
**Current Version:** v1.2.0 (PDF & llms.txt support)
|
|
**Overall Health:** 6.8/10 ⬆️ (was 6.5/10)
|
|
|
|
---
|
|
|
|
## 📊 Executive Summary
|
|
|
|
### Current State (Updated Oct 25, 2025)
|
|
- ✅ **Functionality:** 8.5/10 ⬆️ - Works well, new features added
|
|
- ⚠️ **Code Quality:** 5.5/10 ⬆️ - Some modularization, still needs work
|
|
- ✅ **Documentation:** 8/10 ⬆️ - Excellent external docs, weak inline docs
|
|
- ✅ **Testing:** 8/10 ⬆️ - 93 tests (up from 69), excellent coverage
|
|
- ⚠️ **Structure:** 6/10 - Still missing Python package setup
|
|
- ✅ **GitHub/CI:** 8/10 - Well organized
|
|
|
|
### Recent Improvements ✅
|
|
- ✅ **llms.txt Support** - 3 new modular files (detector, downloader, parser)
|
|
- ✅ **PDF Advanced Features** - OCR, tables, parallel processing
|
|
- ✅ **Better Modularization** - llms.txt features properly separated
|
|
- ✅ **More Tests** - 93 tests (up 35% from 69)
|
|
- ✅ **Better Documentation** - 7+ new comprehensive docs
|
|
|
|
### Target State (After Phases 1-2)
|
|
- **Overall Quality:** 7.8/10 (adjusted up from 7.5)
|
|
- **Effort:** 10-14 days (reduced from 12-17, some work done)
|
|
- **Impact:** High maintainability improvement
|
|
|
|
---
|
|
|
|
## 🎉 Recent Wins (What Got Better)
|
|
|
|
### ✅ Good Modularization Examples
|
|
The recent llms.txt feature shows **EXCELLENT** code organization:
|
|
|
|
```
|
|
cli/llms_txt_detector.py (66 lines) - Clean, focused
|
|
cli/llms_txt_downloader.py (94 lines) - Single responsibility
|
|
cli/llms_txt_parser.py (74 lines) - Well-structured
|
|
```
|
|
|
|
**This is the pattern we want everywhere!** Each file:
|
|
- Has a clear single purpose
|
|
- Is small and maintainable (< 100 lines)
|
|
- Has proper docstrings
|
|
- Can be tested independently
|
|
|
|
### ✅ Testing Improvements
|
|
- **93 tests** (up from 69) - 35% increase
|
|
- New test files for llms.txt features
|
|
- PDF advanced features fully tested
|
|
- 100% pass rate maintained
|
|
|
|
### ✅ Documentation Explosion
|
|
Added 7+ comprehensive new docs:
|
|
- `docs/LLMS_TXT_SUPPORT.md`
|
|
- `docs/PDF_ADVANCED_FEATURES.md`
|
|
- `docs/PDF_*.md` (multiple guides)
|
|
- `docs/plans/2025-10-24-active-skills-*.md`
|
|
|
|
### ✅ File Count Healthy
|
|
- **237 Python files** in cli/ and mcp/
|
|
- Shows active development
|
|
- Good separation starting to happen
|
|
|
|
### ⚠️ What Didn't Improve
|
|
- Still NO `__init__.py` files (critical!)
|
|
- `.gitignore` still incomplete
|
|
- `doc_scraper.py` grew larger (1,345 lines now)
|
|
- Still have code duplication
|
|
- Still have magic numbers
|
|
|
|
---
|
|
|
|
## 🚨 Critical Issues (Fix First)
|
|
|
|
### 1. Missing Python Package Structure ⚡⚡⚡
|
|
**Status:** ❌ STILL NOT FIXED (after all merges)
|
|
**Impact:** Cannot properly import modules, breaks IDE support
|
|
|
|
**Missing Files:**
|
|
```
|
|
cli/__init__.py ❌ STILL CRITICAL
|
|
mcp/__init__.py ❌ STILL CRITICAL
|
|
mcp/tools/__init__.py ❌ STILL CRITICAL
|
|
```
|
|
|
|
**Why This Matters:**
|
|
- New llms_txt_*.py files can't be imported as a package
|
|
- PDF modules scattered without package organization
|
|
- IDE autocomplete doesn't work properly
|
|
- Relative imports fail
|
|
|
|
**Fix:**
|
|
```bash
|
|
# Create missing __init__.py files
|
|
touch cli/__init__.py
|
|
touch mcp/__init__.py
|
|
touch mcp/tools/__init__.py
|
|
|
|
# Then in cli/__init__.py, add:
|
|
from .llms_txt_detector import LlmsTxtDetector
|
|
from .llms_txt_downloader import LlmsTxtDownloader
|
|
from .llms_txt_parser import LlmsTxtParser
|
|
from .utils import open_folder, read_reference_files
|
|
```
|
|
|
|
**Effort:** 15-30 minutes
|
|
**Priority:** P0 🔥
|
|
|
|
---
|
|
|
|
### 2. Code Duplication - Reference File Reading ⚡⚡⚡
|
|
**Impact:** Maintenance nightmare, inconsistent behavior
|
|
|
|
**Duplicated Code:**
|
|
- `cli/enhance_skill.py` lines 42-69 (100K limit)
|
|
- `cli/enhance_skill_local.py` lines 101-125 (50K limit)
|
|
|
|
**Fix:** Extract to `cli/utils.py`:
|
|
```python
|
|
def read_reference_files(skill_dir: str, max_chars: int = 100000) -> str:
|
|
"""Read all reference files up to max_chars limit.
|
|
|
|
Args:
|
|
skill_dir: Path to skill directory
|
|
max_chars: Maximum characters to read (default: 100K)
|
|
|
|
Returns:
|
|
Combined content from all reference files
|
|
"""
|
|
references_dir = Path(skill_dir) / "references"
|
|
content_parts = []
|
|
total_chars = 0
|
|
|
|
for ref_file in sorted(references_dir.glob("*.md")):
|
|
if total_chars >= max_chars:
|
|
break
|
|
file_content = ref_file.read_text(encoding='utf-8')
|
|
chars_to_add = min(len(file_content), max_chars - total_chars)
|
|
content_parts.append(file_content[:chars_to_add])
|
|
total_chars += chars_to_add
|
|
|
|
return "\n\n".join(content_parts)
|
|
```
|
|
|
|
**Effort:** 1 hour
|
|
**Priority:** P0
|
|
|
|
---
|
|
|
|
### 3. Overly Large Functions ⚡⚡⚡
|
|
**Impact:** Hard to understand, test, and maintain
|
|
|
|
#### Problem 1: `main()` in doc_scraper.py
|
|
- **Lines:** 1000-1194 (193 lines)
|
|
- **Complexity:** Does everything in one function
|
|
|
|
**Fix:** Split into separate functions:
|
|
```python
|
|
def parse_arguments() -> argparse.Namespace:
|
|
"""Parse and return command line arguments."""
|
|
pass
|
|
|
|
def validate_config(config: dict) -> None:
|
|
"""Validate configuration is complete and correct."""
|
|
pass
|
|
|
|
def execute_scraping(converter, config, args) -> bool:
|
|
"""Execute scraping phase with error handling."""
|
|
pass
|
|
|
|
def execute_building(converter, config) -> bool:
|
|
"""Execute skill building phase."""
|
|
pass
|
|
|
|
def execute_enhancement(skill_dir, args) -> None:
|
|
"""Execute skill enhancement (local or API)."""
|
|
pass
|
|
|
|
def main():
|
|
"""Main entry point - orchestrates the workflow."""
|
|
args = parse_arguments()
|
|
config = load_and_validate_config(args)
|
|
|
|
converter = DocToSkillConverter(config)
|
|
|
|
if not should_skip_scraping(args):
|
|
if not execute_scraping(converter, config, args):
|
|
sys.exit(1)
|
|
|
|
if not execute_building(converter, config):
|
|
sys.exit(1)
|
|
|
|
if args.enhance or args.enhance_local:
|
|
execute_enhancement(skill_dir, args)
|
|
|
|
print_success_message(skill_dir)
|
|
```
|
|
|
|
**Effort:** 3-4 hours
|
|
**Priority:** P1
|
|
|
|
---
|
|
|
|
#### Problem 2: `DocToSkillConverter` class
|
|
- **Status:** ⚠️ PARTIALLY IMPROVED (llms.txt extracted, but still huge)
|
|
- **Current Lines:** ~1,345 lines (grew 70% due to new features!)
|
|
- **Current Functions/Classes:** Only 6 (better than 25+ methods!)
|
|
- **Responsibility:** Still does too much
|
|
|
|
**What Improved:**
|
|
- ✅ llms.txt logic properly extracted to 3 separate files
|
|
- ✅ Better separation of concerns for new features
|
|
|
|
**Still Needs:**
|
|
- ❌ Main scraper logic still monolithic
|
|
- ❌ PDF extraction logic not extracted
|
|
|
|
**Fix:** Split into focused modules:
|
|
|
|
```python
|
|
# cli/scraper.py
|
|
class DocumentScraper:
|
|
"""Handles URL traversal and page downloading."""
|
|
def scrape_all(self) -> List[dict]:
|
|
pass
|
|
def is_valid_url(self, url: str) -> bool:
|
|
pass
|
|
def scrape_page(self, url: str) -> Optional[dict]:
|
|
pass
|
|
|
|
# cli/extractor.py
|
|
class ContentExtractor:
|
|
"""Extracts and parses HTML content."""
|
|
def extract_content(self, soup) -> dict:
|
|
pass
|
|
def detect_language(self, code: str) -> str:
|
|
pass
|
|
def extract_patterns(self, content: str) -> List[dict]:
|
|
pass
|
|
|
|
# cli/builder.py
|
|
class SkillBuilder:
|
|
"""Builds skill files from scraped data."""
|
|
def build_skill(self, pages: List[dict]) -> None:
|
|
pass
|
|
def create_skill_md(self, pages: List[dict]) -> str:
|
|
pass
|
|
def categorize_pages(self, pages: List[dict]) -> dict:
|
|
pass
|
|
def generate_references(self, categories: dict) -> None:
|
|
pass
|
|
|
|
# cli/validator.py
|
|
class SkillValidator:
|
|
"""Validates skill quality and completeness."""
|
|
def validate_skill(self, skill_dir: str) -> bool:
|
|
pass
|
|
def check_references(self, skill_dir: str) -> List[str]:
|
|
pass
|
|
```
|
|
|
|
**Effort:** 8-10 hours
|
|
**Priority:** P1
|
|
|
|
---
|
|
|
|
### 4. Bare Except Clause ⚡⚡
|
|
**Impact:** Catches system exceptions (KeyboardInterrupt, SystemExit)
|
|
|
|
**Problem:**
|
|
```python
|
|
# doc_scraper.py line ~650
|
|
try:
|
|
scrape_page()
|
|
except: # ❌ BAD - catches everything
|
|
print("Error")
|
|
```
|
|
|
|
**Fix:**
|
|
```python
|
|
try:
|
|
scrape_page()
|
|
except Exception as e: # ✅ GOOD - specific exceptions only
|
|
logger.error(f"Scraping failed: {e}")
|
|
except KeyboardInterrupt: # ✅ Handle separately
|
|
logger.warning("Scraping interrupted by user")
|
|
raise
|
|
```
|
|
|
|
**Effort:** 30 minutes
|
|
**Priority:** P1
|
|
|
|
---
|
|
|
|
## ⚠️ Important Issues (Phase 2)
|
|
|
|
### 5. Magic Numbers ⚡⚡
|
|
**Impact:** Hard to configure, unclear meaning
|
|
|
|
**Current Problems:**
|
|
```python
|
|
# Scattered throughout codebase
|
|
doc_scraper.py: 1000 (checkpoint interval)
|
|
10000 (threshold)
|
|
estimate_pages.py: 1000 (default max discovery)
|
|
0.5 (rate limit)
|
|
enhance_skill.py: 100000, 40000 (content limits)
|
|
enhance_skill_local: 50000, 20000 (different limits!)
|
|
```
|
|
|
|
**Fix:** Create `cli/constants.py`:
|
|
```python
|
|
"""Configuration constants for Skill Seekers."""
|
|
|
|
# Scraping Configuration
|
|
DEFAULT_RATE_LIMIT = 0.5 # seconds between requests
|
|
DEFAULT_MAX_PAGES = 500
|
|
CHECKPOINT_INTERVAL = 1000 # pages
|
|
|
|
# Enhancement Configuration
|
|
API_CONTENT_LIMIT = 100000 # chars for API enhancement
|
|
API_PREVIEW_LIMIT = 40000 # chars for preview
|
|
LOCAL_CONTENT_LIMIT = 50000 # chars for local enhancement
|
|
LOCAL_PREVIEW_LIMIT = 20000 # chars for preview
|
|
|
|
# Page Estimation
|
|
DEFAULT_MAX_DISCOVERY = 1000
|
|
DISCOVERY_THRESHOLD = 10000
|
|
|
|
# File Limits
|
|
MAX_REFERENCE_FILES = 100
|
|
MAX_CODE_BLOCKS_PER_PAGE = 5
|
|
|
|
# Categorization
|
|
CATEGORY_SCORE_THRESHOLD = 2
|
|
URL_MATCH_POINTS = 3
|
|
TITLE_MATCH_POINTS = 2
|
|
CONTENT_MATCH_POINTS = 1
|
|
```
|
|
|
|
**Effort:** 2 hours
|
|
**Priority:** P2
|
|
|
|
---
|
|
|
|
### 6. Missing Docstrings ⚡⚡
|
|
**Impact:** Hard to understand code, poor IDE support
|
|
|
|
**Current Coverage:** ~55% (should be 95%+)
|
|
|
|
**Missing Docstrings:**
|
|
```python
|
|
# doc_scraper.py (8/16 functions documented)
|
|
scrape_all() # ❌
|
|
smart_categorize() # ❌
|
|
infer_categories() # ❌
|
|
generate_quick_reference() # ❌
|
|
|
|
# enhance_skill.py (3/4 documented)
|
|
class EnhancementEngine: # ❌
|
|
|
|
# estimate_pages.py (6/10 documented)
|
|
discover_pages() # ❌
|
|
calculate_estimate() # ❌
|
|
```
|
|
|
|
**Fix Template:**
|
|
```python
|
|
def scrape_all(self, base_url: str, max_pages: int = 500) -> List[dict]:
|
|
"""Scrape all pages from documentation website.
|
|
|
|
Performs breadth-first traversal starting from base_url, respecting
|
|
include/exclude patterns and rate limits defined in config.
|
|
|
|
Args:
|
|
base_url: Starting URL for documentation
|
|
max_pages: Maximum pages to scrape (default: 500)
|
|
|
|
Returns:
|
|
List of page dictionaries with url, title, content, code_blocks
|
|
|
|
Raises:
|
|
ValueError: If base_url is invalid
|
|
ConnectionError: If unable to reach documentation site
|
|
|
|
Example:
|
|
>>> scraper = DocToSkillConverter(config)
|
|
>>> pages = scraper.scrape_all("https://react.dev/", max_pages=100)
|
|
>>> len(pages)
|
|
100
|
|
"""
|
|
pass
|
|
```
|
|
|
|
**Effort:** 5-6 hours
|
|
**Priority:** P2
|
|
|
|
---
|
|
|
|
### 7. Add Type Hints ⚡⚡
|
|
**Impact:** No IDE autocomplete, no type checking
|
|
|
|
**Current Coverage:** 0%
|
|
|
|
**Fix Examples:**
|
|
```python
|
|
from typing import List, Dict, Optional, Tuple
|
|
from pathlib import Path
|
|
|
|
def scrape_all(
|
|
self,
|
|
base_url: str,
|
|
max_pages: int = 500
|
|
) -> List[Dict[str, Any]]:
|
|
"""Scrape all pages from documentation."""
|
|
pass
|
|
|
|
def extract_content(
|
|
self,
|
|
soup: BeautifulSoup
|
|
) -> Dict[str, Any]:
|
|
"""Extract content from HTML page."""
|
|
pass
|
|
|
|
def read_reference_files(
|
|
skill_dir: Path | str,
|
|
max_chars: int = 100000
|
|
) -> str:
|
|
"""Read reference files up to limit."""
|
|
pass
|
|
```
|
|
|
|
**Effort:** 6-8 hours
|
|
**Priority:** P2
|
|
|
|
---
|
|
|
|
### 8. Inconsistent Import Patterns ⚡⚡
|
|
**Impact:** Confusing, breaks in different environments
|
|
|
|
**Current Problems:**
|
|
```python
|
|
# Pattern 1: sys.path manipulation
|
|
sys.path.insert(0, str(Path(__file__).parent.parent))
|
|
|
|
# Pattern 2: Try-except imports
|
|
try:
|
|
from utils import open_folder
|
|
except ImportError:
|
|
sys.path.insert(0, ...)
|
|
|
|
# Pattern 3: Direct relative imports
|
|
from utils import something
|
|
```
|
|
|
|
**Fix:** Use proper package structure:
|
|
```python
|
|
# After creating __init__.py files:
|
|
|
|
# In cli/__init__.py
|
|
from .utils import open_folder, read_reference_files
|
|
from .constants import *
|
|
|
|
# In scripts
|
|
from cli.utils import open_folder
|
|
from cli.constants import DEFAULT_RATE_LIMIT
|
|
```
|
|
|
|
**Effort:** 2-3 hours
|
|
**Priority:** P2
|
|
|
|
---
|
|
|
|
## 📝 Documentation Issues
|
|
|
|
### Missing README Files
|
|
```
|
|
cli/README.md ❌ - How to use each CLI tool
|
|
configs/README.md ❌ - How to create custom configs
|
|
tests/README.md ❌ - How to run and write tests
|
|
mcp/tools/README.md ❌ - MCP tool documentation
|
|
```
|
|
|
|
**Fix - Create cli/README.md:**
|
|
```markdown
|
|
# CLI Tools
|
|
|
|
Command-line tools for Skill Seekers.
|
|
|
|
## Tools Overview
|
|
|
|
### doc_scraper.py
|
|
Main scraping and building tool.
|
|
|
|
**Usage:**
|
|
```bash
|
|
python3 cli/doc_scraper.py --config configs/react.json
|
|
```
|
|
|
|
**Options:**
|
|
- `--config PATH` - Config file path
|
|
- `--skip-scrape` - Use cached data
|
|
- `--enhance` - API enhancement
|
|
- `--enhance-local` - Local enhancement
|
|
|
|
### enhance_skill.py
|
|
AI-powered SKILL.md enhancement using Anthropic API.
|
|
|
|
**Usage:**
|
|
```bash
|
|
export ANTHROPIC_API_KEY=sk-ant-...
|
|
python3 cli/enhance_skill.py output/react/
|
|
```
|
|
|
|
### enhance_skill_local.py
|
|
Local enhancement using Claude Code Max (no API key).
|
|
|
|
[... continue for all tools ...]
|
|
```
|
|
|
|
**Effort:** 4-5 hours
|
|
**Priority:** P3
|
|
|
|
---
|
|
|
|
## 🔧 Git & GitHub Improvements
|
|
|
|
### 1. Update .gitignore ⚡
|
|
**Status:** ❌ STILL NOT FIXED
|
|
**Current Problems:**
|
|
- `.pytest_cache/` exists (52KB) but NOT in .gitignore
|
|
- `.coverage` exists (52KB) but NOT in .gitignore
|
|
- No htmlcov/ entry
|
|
- No .tox/ entry
|
|
|
|
**Missing Entries:**
|
|
```gitignore
|
|
# Testing artifacts
|
|
.pytest_cache/
|
|
.coverage
|
|
htmlcov/
|
|
.tox/
|
|
*.cover
|
|
.hypothesis/
|
|
|
|
# Build artifacts
|
|
.build/
|
|
*.egg-info/
|
|
```
|
|
|
|
**Fix NOW:**
|
|
```bash
|
|
cat >> .gitignore << 'EOF'
|
|
|
|
# Testing artifacts
|
|
.pytest_cache/
|
|
.coverage
|
|
htmlcov/
|
|
.tox/
|
|
*.cover
|
|
.hypothesis/
|
|
EOF
|
|
|
|
git rm -r --cached .pytest_cache .coverage 2>/dev/null
|
|
git commit -m "chore: update .gitignore for test artifacts"
|
|
```
|
|
|
|
**Effort:** 2 minutes ⚡
|
|
**Priority:** P0 (these files are polluting the repo!)
|
|
|
|
---
|
|
|
|
### 2. Git Branching Strategy
|
|
**Current Branches:**
|
|
```
|
|
main - Production (✓ good)
|
|
development - Development (✓ good)
|
|
feature/* - Feature branches (✓ good)
|
|
claude/* - Claude Code branches (⚠️ should be cleaned)
|
|
remotes/ibrahim/* - External contributor (⚠️ merge or close)
|
|
remotes/jjshanks/* - External contributor (⚠️ merge or close)
|
|
```
|
|
|
|
**Recommendations:**
|
|
1. **Merge or close** old remote branches
|
|
2. **Clean up** claude/* branches after merging
|
|
3. **Document** branch strategy in CONTRIBUTING.md
|
|
|
|
**Suggested Strategy:**
|
|
```markdown
|
|
# Branch Strategy
|
|
|
|
- `main` - Production releases only
|
|
- `development` - Active development, merge PRs here first
|
|
- `feature/*` - New features (e.g., feature/pdf-support)
|
|
- `fix/*` - Bug fixes
|
|
- `refactor/*` - Code refactoring
|
|
- `docs/*` - Documentation updates
|
|
|
|
**Workflow:**
|
|
1. Create feature branch from `development`
|
|
2. Open PR to `development`
|
|
3. After review, merge to `development`
|
|
4. Periodically merge `development` to `main` for releases
|
|
```
|
|
|
|
**Effort:** 1 hour
|
|
**Priority:** P3
|
|
|
|
---
|
|
|
|
### 3. GitHub Branch Protection Rules
|
|
**Current:** No documented protection rules
|
|
|
|
**Recommended Rules for `main` branch:**
|
|
```yaml
|
|
Require pull request reviews: Yes (1 approver)
|
|
Dismiss stale reviews: Yes
|
|
Require status checks: Yes
|
|
- tests (Ubuntu)
|
|
- tests (macOS)
|
|
- codecov/patch
|
|
- codecov/project
|
|
Require branches to be up to date: Yes
|
|
Require conversation resolution: Yes
|
|
Restrict who can push: Yes (maintainers only)
|
|
```
|
|
|
|
**Setup:**
|
|
1. Go to: Settings → Branches → Add rule
|
|
2. Branch name pattern: `main`
|
|
3. Enable above protections
|
|
|
|
**Effort:** 30 minutes
|
|
**Priority:** P3
|
|
|
|
---
|
|
|
|
### 4. Missing GitHub Workflows
|
|
**Current:** ✅ tests.yml, ✅ release.yml
|
|
|
|
**Recommended Additions:**
|
|
|
|
#### 4a. Windows Testing (`workflows/windows.yml`)
|
|
```yaml
|
|
name: Windows Tests
|
|
|
|
on: [push, pull_request]
|
|
|
|
jobs:
|
|
test:
|
|
runs-on: windows-latest
|
|
steps:
|
|
- uses: actions/checkout@v3
|
|
- uses: actions/setup-python@v4
|
|
with:
|
|
python-version: '3.10'
|
|
- name: Install dependencies
|
|
run: |
|
|
pip install -r requirements.txt
|
|
pip install pytest pytest-cov
|
|
- name: Run tests
|
|
run: pytest tests/ -v
|
|
```
|
|
|
|
**Effort:** 30 minutes
|
|
**Priority:** P3
|
|
|
|
---
|
|
|
|
#### 4b. Code Quality Checks (`workflows/quality.yml`)
|
|
```yaml
|
|
name: Code Quality
|
|
|
|
on: [push, pull_request]
|
|
|
|
jobs:
|
|
lint:
|
|
runs-on: ubuntu-latest
|
|
steps:
|
|
- uses: actions/checkout@v3
|
|
- uses: actions/setup-python@v4
|
|
with:
|
|
python-version: '3.10'
|
|
- name: Install tools
|
|
run: |
|
|
pip install flake8 black isort mypy
|
|
- name: Run flake8
|
|
run: flake8 cli/ mcp/ tests/ --max-line-length=120
|
|
- name: Check formatting
|
|
run: black --check cli/ mcp/ tests/
|
|
- name: Check imports
|
|
run: isort --check cli/ mcp/ tests/
|
|
- name: Type check
|
|
run: mypy cli/ mcp/ --ignore-missing-imports
|
|
```
|
|
|
|
**Effort:** 1 hour
|
|
**Priority:** P4
|
|
|
|
---
|
|
|
|
## 📦 Dependency Management
|
|
|
|
### Current Problem
|
|
**Single requirements.txt with 42 packages** - No separation
|
|
|
|
### Recommended Split
|
|
|
|
#### requirements-core.txt
|
|
```txt
|
|
# Core dependencies (always needed)
|
|
requests>=2.31.0
|
|
beautifulsoup4>=4.12.0
|
|
```
|
|
|
|
#### requirements-pdf.txt
|
|
```txt
|
|
# PDF support (optional)
|
|
PyMuPDF>=1.23.0
|
|
Pillow>=10.0.0
|
|
pytesseract>=0.3.10
|
|
```
|
|
|
|
#### requirements-dev.txt
|
|
```txt
|
|
# Development tools
|
|
pytest>=7.4.0
|
|
pytest-cov>=4.1.0
|
|
black>=23.7.0
|
|
flake8>=6.1.0
|
|
isort>=5.12.0
|
|
mypy>=1.5.0
|
|
```
|
|
|
|
#### requirements.txt
|
|
```txt
|
|
# Install everything (convenience)
|
|
-r requirements-core.txt
|
|
-r requirements-pdf.txt
|
|
-r requirements-dev.txt
|
|
```
|
|
|
|
**Usage:**
|
|
```bash
|
|
# Minimal install
|
|
pip install -r requirements-core.txt
|
|
|
|
# With PDF support
|
|
pip install -r requirements-core.txt -r requirements-pdf.txt
|
|
|
|
# Full install (development)
|
|
pip install -r requirements.txt
|
|
```
|
|
|
|
**Effort:** 1 hour
|
|
**Priority:** P3
|
|
|
|
---
|
|
|
|
## 🏗️ Project Structure Refactoring
|
|
|
|
### Current Structure Issues
|
|
```
|
|
Skill_Seekers/
|
|
├── cli/
|
|
│ ├── __init__.py ❌ MISSING
|
|
│ ├── doc_scraper.py (1,194 lines) ⚠️ TOO LARGE
|
|
│ ├── package_multi.py ❓ UNCLEAR PURPOSE
|
|
│ └── ... (13 files)
|
|
├── mcp/
|
|
│ ├── __init__.py ❌ MISSING
|
|
│ ├── server.py (29KB) ⚠️ MONOLITHIC
|
|
│ └── tools/ (empty) ❓ UNUSED
|
|
├── test_pr144_concerns.py ❌ WRONG LOCATION
|
|
└── .coverage ❌ NOT IN .gitignore
|
|
```
|
|
|
|
### Recommended Structure
|
|
```
|
|
Skill_Seekers/
|
|
├── cli/
|
|
│ ├── __init__.py ✅
|
|
│ ├── README.md ✅
|
|
│ ├── constants.py ✅ NEW
|
|
│ ├── utils.py ✅ ENHANCED
|
|
│ ├── scraper.py ✅ EXTRACTED
|
|
│ ├── extractor.py ✅ EXTRACTED
|
|
│ ├── builder.py ✅ EXTRACTED
|
|
│ ├── validator.py ✅ EXTRACTED
|
|
│ ├── doc_scraper.py ✅ REFACTORED (imports from above)
|
|
│ ├── enhance_skill.py ✅ REFACTORED
|
|
│ ├── enhance_skill_local.py ✅ REFACTORED
|
|
│ └── ... (other tools)
|
|
├── mcp/
|
|
│ ├── __init__.py ✅
|
|
│ ├── server.py ✅ SIMPLIFIED
|
|
│ ├── tools/
|
|
│ │ ├── __init__.py ✅
|
|
│ │ ├── scraping_tools.py ✅ NEW
|
|
│ │ ├── building_tools.py ✅ NEW
|
|
│ │ └── deployment_tools.py ✅ NEW
|
|
│ └── README.md
|
|
├── tests/
|
|
│ ├── __init__.py ✅
|
|
│ ├── README.md ✅ NEW
|
|
│ ├── test_pr144_concerns.py ✅ MOVED HERE
|
|
│ └── ... (15 test files)
|
|
├── configs/
|
|
│ ├── README.md ✅ NEW
|
|
│ └── ... (16 config files)
|
|
└── docs/
|
|
└── ... (17 markdown files)
|
|
```
|
|
|
|
**Effort:** Part of Phase 1-2 work
|
|
**Priority:** P1
|
|
|
|
---
|
|
|
|
## 📊 Implementation Roadmap (Updated Oct 25, 2025)
|
|
|
|
### Phase 0: Immediate Fixes (< 1 hour) 🔥🔥🔥
|
|
**Do these RIGHT NOW before anything else:**
|
|
|
|
- [ ] **2 min:** Update `.gitignore` (add .pytest_cache/, .coverage)
|
|
- [ ] **5 min:** Remove tracked test artifacts (`git rm -r --cached`)
|
|
- [ ] **15 min:** Create `cli/__init__.py`, `mcp/__init__.py`, `mcp/tools/__init__.py`
|
|
- [ ] **10 min:** Add basic imports to `cli/__init__.py` for llms_txt modules
|
|
- [ ] **10 min:** Test imports work: `python3 -c "from cli import LlmsTxtDetector"`
|
|
|
|
**Why These First:**
|
|
- Currently breaking best practices
|
|
- Test artifacts polluting repo
|
|
- Can't properly import new modular code
|
|
- Takes < 1 hour total
|
|
- Zero risk
|
|
|
|
---
|
|
|
|
### Phase 1: Critical Fixes (4-6 days) ⚡⚡⚡
|
|
**UPDATED: Reduced from 5-7 days (llms.txt already done!)**
|
|
|
|
**Week 1:**
|
|
- [ ] Day 1: Extract duplicate reference reading (1 hour)
|
|
- [ ] Day 1: Fix bare except clauses (30 min)
|
|
- [ ] Day 1-2: Create `constants.py` and move magic numbers (2 hours)
|
|
- [ ] Day 2-3: Split `main()` function (3-4 hours)
|
|
- [ ] Day 3-5: Split `DocToSkillConverter` (focus on scraper, not llms.txt which is done) (6-8 hours)
|
|
- [ ] Day 5-6: Test all changes, fix bugs (3-4 hours)
|
|
|
|
**Deliverables:**
|
|
- ✅ Proper Python package structure
|
|
- ✅ No code duplication
|
|
- ✅ Smaller, focused functions
|
|
- ✅ Centralized configuration
|
|
|
|
**Note:** llms.txt extraction already done! This saves ~2 days.
|
|
|
|
---
|
|
|
|
### Phase 2: Important Improvements (7-10 days) ⚡⚡
|
|
|
|
**Week 2:**
|
|
- [ ] Day 8-10: Add comprehensive docstrings (5-6 hours)
|
|
- [ ] Day 10-12: Add type hints to all public APIs (6-8 hours)
|
|
- [ ] Day 12-13: Standardize import patterns (2-3 hours)
|
|
- [ ] Day 13-14: Add README files (4-5 hours)
|
|
- [ ] Day 15-17: Update .gitignore, split requirements.txt (2 hours)
|
|
|
|
**Deliverables:**
|
|
- ✅ 95%+ docstring coverage
|
|
- ✅ Type hints on all public functions
|
|
- ✅ Consistent imports
|
|
- ✅ Better documentation
|
|
|
|
---
|
|
|
|
### Phase 3: Nice-to-Have (5-8 days) ⚡
|
|
|
|
**Week 3:**
|
|
- [ ] Day 18-19: Clean up Git branches (1 hour)
|
|
- [ ] Day 18-19: Set up branch protection (30 min)
|
|
- [ ] Day 19-20: Add Windows CI/CD (30 min)
|
|
- [ ] Day 20-21: Add code quality workflow (1 hour)
|
|
- [ ] Day 21-23: Implement logging (4-5 hours)
|
|
- [ ] Day 23-25: Documentation polish (6-8 hours)
|
|
|
|
**Deliverables:**
|
|
- ✅ Better Git workflow
|
|
- ✅ Multi-platform testing
|
|
- ✅ Code quality checks
|
|
- ✅ Professional logging
|
|
|
|
---
|
|
|
|
### Phase 4: Future Refactoring (10-15 days) ⚪
|
|
|
|
**Future Work:**
|
|
- [ ] Modularize MCP server (3-4 days)
|
|
- [ ] Create plugin system (2-3 days)
|
|
- [ ] Configuration framework (2-3 days)
|
|
- [ ] Custom exceptions (1-2 days)
|
|
- [ ] Performance optimization (2-3 days)
|
|
|
|
**Note:** Phase 4 can be done incrementally, not urgent
|
|
|
|
---
|
|
|
|
## 📈 Success Metrics
|
|
|
|
### Before Refactoring (Oct 23, 2025)
|
|
- Code Quality: 5/10
|
|
- Docstring Coverage: ~55%
|
|
- Type Hint Coverage: 0%
|
|
- Import Issues: Yes
|
|
- Magic Numbers: 8+
|
|
- Code Duplication: Yes
|
|
- Tests: 69
|
|
- Line Count: doc_scraper.py ~790 lines
|
|
|
|
### Current State (Oct 25, 2025) - After Recent Merges
|
|
- Code Quality: 5.5/10 ⬆️ (+0.5)
|
|
- Docstring Coverage: ~60% ⬆️ (llms.txt modules well-documented)
|
|
- Type Hint Coverage: 15% ⬆️ (llms.txt modules have hints!)
|
|
- Import Issues: Yes (no __init__.py yet)
|
|
- Magic Numbers: 8+
|
|
- Code Duplication: Yes
|
|
- Tests: 93 ⬆️ (+24 tests!)
|
|
- Line Count: doc_scraper.py 1,345 lines ⬇️ (grew but more modular)
|
|
- New Modular Files: 3 (llms_txt_*.py) ✅
|
|
|
|
### After Phase 0 (< 1 hour)
|
|
- Code Quality: 6.0/10 ⬆️
|
|
- Import Issues: No ✅
|
|
- .gitignore: Fixed ✅
|
|
- Can use: `from cli import LlmsTxtDetector` ✅
|
|
|
|
### After Phase 1-2 (Target)
|
|
- Code Quality: 7.8/10 ⬆️ (adjusted from 7.5)
|
|
- Docstring Coverage: 95%+
|
|
- Type Hint Coverage: 85%+ (improved from 80%, some already done)
|
|
- Import Issues: No
|
|
- Magic Numbers: 0 (in constants.py)
|
|
- Code Duplication: No
|
|
- Modular Structure: Yes (following llms_txt pattern)
|
|
|
|
### Benefits
|
|
- ✅ Easier onboarding for contributors
|
|
- ✅ Faster debugging
|
|
- ✅ Better IDE support (autocomplete, type checking)
|
|
- ✅ Reduced bugs from unclear code
|
|
- ✅ Professional codebase
|
|
- ✅ Can build on llms_txt modular pattern
|
|
|
|
---
|
|
|
|
## 🎯 Quick Start (Updated)
|
|
|
|
### 🔥 RECOMMENDED: Phase 0 First (< 1 hour)
|
|
**DO THIS NOW before anything else:**
|
|
```bash
|
|
# 1. Fix .gitignore (2 min)
|
|
cat >> .gitignore << 'EOF'
|
|
|
|
# Testing artifacts
|
|
.pytest_cache/
|
|
.coverage
|
|
htmlcov/
|
|
.tox/
|
|
*.cover
|
|
.hypothesis/
|
|
EOF
|
|
|
|
# 2. Remove tracked test files (5 min)
|
|
git rm -r --cached .pytest_cache .coverage 2>/dev/null
|
|
git add .gitignore
|
|
git commit -m "chore: update .gitignore for test artifacts"
|
|
|
|
# 3. Create package structure (15 min)
|
|
touch cli/__init__.py
|
|
touch mcp/__init__.py
|
|
touch mcp/tools/__init__.py
|
|
|
|
# 4. Add imports to cli/__init__.py (10 min)
|
|
cat > cli/__init__.py << 'EOF'
|
|
"""Skill Seekers CLI tools package."""
|
|
from .llms_txt_detector import LlmsTxtDetector
|
|
from .llms_txt_downloader import LlmsTxtDownloader
|
|
from .llms_txt_parser import LlmsTxtParser
|
|
from .utils import open_folder
|
|
|
|
__all__ = [
|
|
'LlmsTxtDetector',
|
|
'LlmsTxtDownloader',
|
|
'LlmsTxtParser',
|
|
'open_folder',
|
|
]
|
|
EOF
|
|
|
|
# 5. Test it works (5 min)
|
|
python3 -c "from cli import LlmsTxtDetector; print('✅ Imports work!')"
|
|
|
|
# 6. Commit
|
|
git add cli/__init__.py mcp/__init__.py mcp/tools/__init__.py
|
|
git commit -m "feat: add Python package structure"
|
|
```
|
|
|
|
**Time:** 42 minutes
|
|
**Impact:** IMMEDIATE improvement, unlocks proper imports
|
|
|
|
---
|
|
|
|
### Option 1: Do Everything (Phases 0-2)
|
|
**Time:** 10-14 days (reduced from 12-17!)
|
|
**Impact:** Maximum improvement
|
|
|
|
### Option 2: Critical Only (Phases 0-1)
|
|
**Time:** 4-6 days (reduced from 5-7!)
|
|
**Impact:** Fix major issues
|
|
|
|
### Option 3: Incremental (One task at a time)
|
|
**Time:** Ongoing
|
|
**Impact:** Steady improvement
|
|
|
|
### 🌟 NEW: Follow llms_txt Pattern
|
|
**The llms_txt modules show the ideal pattern:**
|
|
- Small files (< 100 lines each)
|
|
- Clear single responsibility
|
|
- Good docstrings
|
|
- Type hints included
|
|
- Easy to test
|
|
|
|
**Apply this pattern to everything else!**
|
|
|
|
---
|
|
|
|
## 📋 Checklist (Updated Oct 25, 2025)
|
|
|
|
### Phase 0 (Immediate - < 1 hour) 🔥
|
|
- [ ] Update `.gitignore` with test artifacts
|
|
- [ ] Remove `.pytest_cache/` and `.coverage` from git tracking
|
|
- [ ] Create `cli/__init__.py`
|
|
- [ ] Create `mcp/__init__.py`
|
|
- [ ] Create `mcp/tools/__init__.py`
|
|
- [ ] Add imports to `cli/__init__.py` for llms_txt modules
|
|
- [ ] Test: `python3 -c "from cli import LlmsTxtDetector"`
|
|
- [ ] Commit changes
|
|
|
|
### Phase 1 (Critical - 4-6 days)
|
|
- [ ] Extract duplicate reference reading to `utils.py`
|
|
- [ ] Fix bare except clauses
|
|
- [ ] Create `cli/constants.py`
|
|
- [ ] Move all magic numbers to constants
|
|
- [ ] Split `main()` into separate functions
|
|
- [ ] Split `DocToSkillConverter` (HTML scraping part, llms_txt already done ✅)
|
|
- [ ] Test all changes
|
|
|
|
### Phase 2 (Important)
|
|
- [ ] Add docstrings to all public functions
|
|
- [ ] Add type hints to public APIs
|
|
- [ ] Standardize import patterns
|
|
- [ ] Create `cli/README.md`
|
|
- [ ] Create `tests/README.md`
|
|
- [ ] Create `configs/README.md`
|
|
- [ ] Update `.gitignore`
|
|
- [ ] Split `requirements.txt`
|
|
|
|
### Phase 3 (Nice-to-Have)
|
|
- [ ] Clean up old Git branches
|
|
- [ ] Set up branch protection rules
|
|
- [ ] Add Windows CI/CD workflow
|
|
- [ ] Add code quality workflow
|
|
- [ ] Implement logging framework
|
|
- [ ] Document Git strategy in CONTRIBUTING.md
|
|
|
|
---
|
|
|
|
## 💬 Questions?
|
|
|
|
See the full analysis reports in `/tmp/`:
|
|
- `skill_seekers_analysis.md` - Detailed 12,000+ word report
|
|
- `ANALYSIS_SUMMARY.txt` - This summary
|
|
- `CODE_EXAMPLES.md` - Before/after code examples
|
|
|
|
---
|
|
|
|
**Generated:** October 23, 2025
|
|
**Status:** Ready for implementation
|
|
**Next Step:** Choose Phase 1, 2, or 3 and start with checklist
|