feat: Make EXCLUDED_DIRS configurable for local repository analysis
Closes #203 Adds configuration options to customize directory exclusions during local repository analysis, while maintaining backward compatibility with smart defaults. **New Config Options:** 1. `exclude_dirs_additional` - Extend defaults (most common) - Adds custom directories to default exclusions - Example: ["proprietary", "legacy", "third_party"] - Total exclusions = defaults + additional 2. `exclude_dirs` - Replace defaults (advanced users) - Completely overrides default exclusions - Example: ["node_modules", ".git", "custom_vendor"] - Gives full control over exclusions **Implementation:** - Modified GitHubScraper.__init__() to parse exclude_dirs config - Changed should_exclude_dir() to use instance variable instead of global - Added logging for custom exclusions (INFO for extend, WARNING for replace) - Maintains backward compatibility (no config = use defaults) **Testing:** - Added 12 comprehensive tests in test_excluded_dirs_config.py - 3 tests for defaults (backward compatibility) - 3 tests for extend mode - 3 tests for replace mode - 1 test for precedence - 2 tests for edge cases - All 12 new tests passing ✅ - All 22 existing github_scraper tests passing ✅ **Documentation:** - Updated CLAUDE.md config parameters section - Added detailed "Configurable Directory Exclusions" feature section - Included examples for both modes - Listed common use cases (monorepos, enterprise, legacy codebases) **Use Cases:** - Monorepos with custom directory structures - Enterprise projects with non-standard naming conventions - Including unusual directories for analysis - Minimal exclusions for small/simple projects **Backward Compatibility:** ✅ Fully backward compatible - existing configs work unchanged ✅ Smart defaults maintained when no config provided ✅ All existing tests pass Co-authored-by: jimmy058910 <jimmy058910@users.noreply.github.com>
This commit is contained in:
39
CLAUDE.md
39
CLAUDE.md
@@ -437,12 +437,51 @@ Config files (`configs/*.json`) define scraping behavior:
|
||||
- `rate_limit`: Delay between requests (seconds)
|
||||
- `max_pages`: Maximum pages to scrape
|
||||
- `skip_llms_txt`: Skip llms.txt detection, force HTML scraping (default: false)
|
||||
- `exclude_dirs_additional`: Add custom directories to default exclusions (for local repo analysis)
|
||||
- `exclude_dirs`: Replace default directory exclusions entirely (advanced, for local repo analysis)
|
||||
|
||||
## Key Features & Implementation
|
||||
|
||||
### Auto-Detect Existing Data
|
||||
Tool checks for `output/{name}_data/` and prompts to reuse, avoiding re-scraping (check_existing_data() in doc_scraper.py:653-660).
|
||||
|
||||
### Configurable Directory Exclusions (Local Repository Analysis)
|
||||
|
||||
When using `local_repo_path` for unlimited local repository analysis, you can customize which directories to exclude from analysis.
|
||||
|
||||
**Smart Defaults:**
|
||||
Automatically excludes common directories: `venv`, `node_modules`, `__pycache__`, `.git`, `build`, `dist`, `.pytest_cache`, `htmlcov`, `.tox`, `.mypy_cache`, etc.
|
||||
|
||||
**Extend Mode** (`exclude_dirs_additional`): Add custom exclusions to defaults
|
||||
```json
|
||||
{
|
||||
"sources": [{
|
||||
"type": "github",
|
||||
"local_repo_path": "/path/to/repo",
|
||||
"exclude_dirs_additional": ["proprietary", "legacy", "third_party"]
|
||||
}]
|
||||
}
|
||||
```
|
||||
|
||||
**Replace Mode** (`exclude_dirs`): Override defaults entirely (advanced)
|
||||
```json
|
||||
{
|
||||
"sources": [{
|
||||
"type": "github",
|
||||
"local_repo_path": "/path/to/repo",
|
||||
"exclude_dirs": ["node_modules", ".git", "custom_vendor"]
|
||||
}]
|
||||
}
|
||||
```
|
||||
|
||||
**Use Cases:**
|
||||
- Monorepos with custom directory structures
|
||||
- Enterprise projects with non-standard naming
|
||||
- Including unusual directories (e.g., analyzing venv code)
|
||||
- Minimal exclusions for small/simple projects
|
||||
|
||||
See: `should_exclude_dir()` in github_scraper.py:304-306
|
||||
|
||||
### Language Detection
|
||||
Detects code languages from:
|
||||
1. CSS class attributes (`language-*`, `lang-*`)
|
||||
|
||||
Reference in New Issue
Block a user