feat: Make EXCLUDED_DIRS configurable for local repository analysis

Closes #203

Adds configuration options to customize directory exclusions during local
repository analysis, while maintaining backward compatibility with smart
defaults.

**New Config Options:**

1. `exclude_dirs_additional` - Extend defaults (most common)
   - Adds custom directories to default exclusions
   - Example: ["proprietary", "legacy", "third_party"]
   - Total exclusions = defaults + additional

2. `exclude_dirs` - Replace defaults (advanced users)
   - Completely overrides default exclusions
   - Example: ["node_modules", ".git", "custom_vendor"]
   - Gives full control over exclusions

**Implementation:**

- Modified GitHubScraper.__init__() to parse exclude_dirs config
- Changed should_exclude_dir() to use instance variable instead of global
- Added logging for custom exclusions (INFO for extend, WARNING for replace)
- Maintains backward compatibility (no config = use defaults)

**Testing:**

- Added 12 comprehensive tests in test_excluded_dirs_config.py
  - 3 tests for defaults (backward compatibility)
  - 3 tests for extend mode
  - 3 tests for replace mode
  - 1 test for precedence
  - 2 tests for edge cases
- All 12 new tests passing 
- All 22 existing github_scraper tests passing 

**Documentation:**

- Updated CLAUDE.md config parameters section
- Added detailed "Configurable Directory Exclusions" feature section
- Included examples for both modes
- Listed common use cases (monorepos, enterprise, legacy codebases)

**Use Cases:**

- Monorepos with custom directory structures
- Enterprise projects with non-standard naming conventions
- Including unusual directories for analysis
- Minimal exclusions for small/simple projects

**Backward Compatibility:**

 Fully backward compatible - existing configs work unchanged
 Smart defaults maintained when no config provided
 All existing tests pass

Co-authored-by: jimmy058910 <jimmy058910@users.noreply.github.com>
This commit is contained in:
yusyus
2025-11-29 23:53:27 +03:00
parent bd20b32470
commit ea289cebe1
3 changed files with 308 additions and 1 deletions

View File

@@ -437,12 +437,51 @@ Config files (`configs/*.json`) define scraping behavior:
- `rate_limit`: Delay between requests (seconds)
- `max_pages`: Maximum pages to scrape
- `skip_llms_txt`: Skip llms.txt detection, force HTML scraping (default: false)
- `exclude_dirs_additional`: Add custom directories to default exclusions (for local repo analysis)
- `exclude_dirs`: Replace default directory exclusions entirely (advanced, for local repo analysis)
## Key Features & Implementation
### Auto-Detect Existing Data
Tool checks for `output/{name}_data/` and prompts to reuse, avoiding re-scraping (check_existing_data() in doc_scraper.py:653-660).
### Configurable Directory Exclusions (Local Repository Analysis)
When using `local_repo_path` for unlimited local repository analysis, you can customize which directories to exclude from analysis.
**Smart Defaults:**
Automatically excludes common directories: `venv`, `node_modules`, `__pycache__`, `.git`, `build`, `dist`, `.pytest_cache`, `htmlcov`, `.tox`, `.mypy_cache`, etc.
**Extend Mode** (`exclude_dirs_additional`): Add custom exclusions to defaults
```json
{
"sources": [{
"type": "github",
"local_repo_path": "/path/to/repo",
"exclude_dirs_additional": ["proprietary", "legacy", "third_party"]
}]
}
```
**Replace Mode** (`exclude_dirs`): Override defaults entirely (advanced)
```json
{
"sources": [{
"type": "github",
"local_repo_path": "/path/to/repo",
"exclude_dirs": ["node_modules", ".git", "custom_vendor"]
}]
}
```
**Use Cases:**
- Monorepos with custom directory structures
- Enterprise projects with non-standard naming
- Including unusual directories (e.g., analyzing venv code)
- Minimal exclusions for small/simple projects
See: `should_exclude_dir()` in github_scraper.py:304-306
### Language Detection
Detects code languages from:
1. CSS class attributes (`language-*`, `lang-*`)