docs: update all chunk flag names to match renamed CLI flags

Replace all occurrences of old ambiguous flag names with the new explicit ones: --chunk-size (tokens) → --chunk-tokens --chunk-overlap → --chunk-overlap-tokens --chunk → --chunk-for-rag --streaming-chunk-size → --streaming-chunk-chars --streaming-overlap → --streaming-overlap-chars --chunk-size (pages) → --pdf-pages-per-chunk Updated: CLI_REFERENCE (EN+ZH), user-guide (EN+ZH), integrations (Haystack, Chroma, Weaviate, FAISS, Qdrant), features/PDF_CHUNKING, examples/haystack-pipeline, strategy docs, archive docs, and CHANGELOG. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-24 22:15:14 +03:00
parent 7a2ffb286c
commit 73adda0b17
29 changed files with 488 additions and 214 deletions
--- a/.github/workflows/translate-docs.yml
+++ b/.github/workflows/translate-docs.yml
@@ -1,143 +0,0 @@
 name: Translate Documentation to Chinese
 on:
  push:
    branches:
      - main
      - development
    paths:
      - 'docs/**/*.md'
      - '!docs/zh-CN/**'
      - '!docs/archive/**'
  workflow_dispatch:
    inputs:
      files:
        description: 'Specific files to translate (comma-separated, or "all")'
        required: false
        default: 'changed'
 jobs:
  detect-changes:
    runs-on: ubuntu-latest
    outputs:
      changed-files: ${{ steps.detect.outputs.files }}
    steps:
      - name: Checkout repository
        uses: actions/checkout@v4
        with:
          fetch-depth: 2
      - name: Detect changed files
        id: detect
        run: |
          if [ "${{ github.event.inputs.files }}" = "all" ]; then
            # Translate all docs
            FILES=$(find docs -name "*.md" -not -path "docs/zh-CN/*" -not -path "docs/archive/*" | tr '\n' ',')
          elif [ "${{ github.event.inputs.files }}" != "" ] && [ "${{ github.event.inputs.files }}" != "changed" ]; then
            # Use provided files
            FILES="${{ github.event.inputs.files }}"
          else
            # Detect changed files
            FILES=$(git diff --name-only HEAD~1 HEAD | grep "^docs/" | grep -v "^docs/zh-CN/" | grep -v "^docs/archive/" | grep "\.md$" | tr '\n' ',')
          fi
          # Remove trailing comma
          FILES=$(echo "$FILES" | sed 's/,$//')
          echo "files=$FILES" >> $GITHUB_OUTPUT
          echo "Detected files: $FILES"
  translate:
    runs-on: ubuntu-latest
    needs: detect-changes
    if: needs.detect-changes.outputs.changed-files != ''
    steps:
      - name: Checkout repository
        uses: actions/checkout@v4
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'
      - name: Install dependencies
        run: |
          pip install anthropic
      - name: Translate documents
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
        run: |
          IFS=',' read -ra FILES <<< "${{ needs.detect-changes.outputs.changed-files }}"
          for file in "${FILES[@]}"; do
            if [ -f "$file" ]; then
              echo "Translating: $file"
              python scripts/translate_doc.py "$file" --target-lang zh-CN || echo "Failed: $file"
            fi
          done
      - name: Check for changes
        id: git-check
        run: |
          git add docs/zh-CN/
          if git diff --cached --quiet; then
            echo "changed=false" >> $GITHUB_OUTPUT
          else
            echo "changed=true" >> $GITHUB_OUTPUT
          fi
      - name: Create Pull Request
        if: steps.git-check.outputs.changed == 'true'
        uses: peter-evans/create-pull-request@v6
        with:
          token: ${{ secrets.GITHUB_TOKEN }}
          commit-message: "[Auto] Chinese translation update"
          title: "🌐 [Auto] Chinese Documentation Translation Update"
          body: |
            ## 🇨🇳 中文文档翻译更新 / Chinese Documentation Translation Update
            This PR contains automated translations of updated documentation.
            ### 变更内容 / Changes
            ${{ needs.detect-changes.outputs.changed-files }}
            ### 审阅指南 / Review Guide
            - [ ] 技术术语准确 / Technical terms accurate
            - [ ] 链接正确指向中文版本 / Links point to Chinese versions
            - [ ] 代码示例保持原样 / Code examples preserved
            - [ ] 格式正确 / Formatting correct
            ### 如何审阅 / How to Review
            1. 查看文件列表 / Check the file list
            2. 阅读中文翻译 / Read the Chinese translation
            3. 在 PR 中提出修改建议 / Suggest changes in PR
            4. 确认后批准 / Approve when satisfied
            ### 相关 Issue
            - #260 - Chinese Translation
            ---
            *This PR was auto-generated by GitHub Actions*
          branch: auto-translate-zh-cn-${{ github.run_number }}
          delete-branch: true
          labels: translation, zh-CN, needs-review, automated
      - name: Update Issue #260
        if: steps.git-check.outputs.changed == 'true'
        uses: actions/github-script@v7
        with:
          script: |
            github.rest.issues.createComment({
              issue_number: 260,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: `🤖 **自动翻译更新 / Automated Translation Update**
            新的中文翻译已准备就绪，需要社区审阅：
            - PR: #${{ steps.create-pr.outputs.pull-request-number }}
            - 文件: ${{ needs.detect-changes.outputs.changed-files }}
            请志愿者帮忙审阅，谢谢！
            / Community review needed, thanks!`
            })
--- a/=0.24.0
+++ b/=0.24.0
@@ -0,0 +1,18 @@
 error: externally-managed-environment
 × This environment is externally managed
 ╰─> To install Python packages system-wide, try 'pacman -S
    python-xyz', where xyz is the package you are trying to
    install.
    If you wish to install a non-Arch-packaged Python package,
    create a virtual environment using 'python -m venv path/to/venv'.
    Then use path/to/venv/bin/python and path/to/venv/bin/pip.
    If you wish to install a non-Arch packaged Python application,
    it may be easiest to use 'pipx install xyz', which will manage a
    virtual environment for you. Make sure you have python-pipx
    installed via pacman.
 note: If you believe this is a mistake, please contact your Python installation or OS distribution provider. You can override this, at the risk of breaking your Python installation or OS, by passing --break-system-packages.
 hint: See PEP 668 for the detailed specification.
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -5,6 +5,18 @@ All notable changes to Skill Seeker will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 ## [Unreleased]
 ### Changed
 - **Explicit chunk flag names** — All `--chunk-*` flags now include unit suffixes to eliminate ambiguity:
  - `--chunk-size` (RAG tokens) → `--chunk-tokens`
  - `--chunk-overlap` (RAG tokens) → `--chunk-overlap-tokens`
  - `--chunk` (enable RAG chunking) → `--chunk-for-rag`
  - `--streaming-chunk-size` (chars) → `--streaming-chunk-chars`
  - `--streaming-overlap` (chars) → `--streaming-overlap-chars`
  - `--chunk-size` in PDF extractor (pages) → `--pdf-pages-per-chunk`
 - **`setup_logging()` centralized** — Removed duplicate `logging.basicConfig()` calls in `github_scraper.py`, `codebase_scraper.py`, `unified_scraper.py`; all now use shared `setup_logging()` from `utils.py`
 ## [3.1.2] - 2026-02-24
 ### 🔧 Fix `create` Command Argument Forwarding, Gemini Model, and Enhance Dispatcher
--- a/TESTING_GAP_REPORT.md
+++ b/TESTING_GAP_REPORT.md
@@ -0,0 +1,345 @@
 # Comprehensive Testing Gap Report
 **Project:** Skill Seekers v3.1.0  
 **Date:** 2026-02-22  
 **Total Test Files:** 113  
 **Total Test Functions:** ~208+ (collected: 2173 tests)
 ---
 ## Executive Summary
 ### Overall Test Health: 🟡 GOOD with Gaps
 | Category | Status | Coverage | Key Gaps |
 |----------|--------|----------|----------|
 | CLI Arguments | ✅ Good | 85% | Some edge cases |
 | Workflow System | ✅ Excellent | 90% | Inline stage parsing edge cases |
 | Scrapers | 🟡 Moderate | 70% | Missing real HTTP/PDF tests |
 | Enhancement | 🟡 Partial | 60% | Core logic not tested |
 | MCP Tools | 🟡 Good | 75% | 8 tools not covered |
 | Integration/E2E | 🟡 Moderate | 65% | Heavy mocking |
 | Adaptors | ✅ Good | 80% | Good coverage per platform |
 ---
 ## Detailed Findings by Category
 ### 1. CLI Argument Tests ✅ GOOD
 **Files Reviewed:**
 - `test_analyze_command.py` (269 lines, 26 tests)
 - `test_unified.py` - TestUnifiedCLIArguments class (6 tests)
 - `test_pdf_scraper.py` - TestPDFCLIArguments class (4 tests)
 - `test_create_arguments.py` (399 lines)
 - `test_create_integration_basic.py` (310 lines, 23 tests)
 **Strengths:**
 - All new workflow flags are tested (`--enhance-workflow`, `--enhance-stage`, `--var`, `--workflow-dry-run`)
 - Argument parsing thoroughly tested
 - Default values verified
 - Complex command combinations tested
 **Gaps:**
 - `test_create_integration_basic.py`: 2 tests skipped (source auto-detection not fully tested)
 - No tests for invalid argument combinations beyond basic parsing errors
 ---
 ### 2. Workflow Tests ✅ EXCELLENT
 **Files Reviewed:**
 - `test_workflow_runner.py` (445 lines, 30+ tests)
 - `test_workflows_command.py` (571 lines, 40+ tests)
 - `test_workflow_tools_mcp.py` (295 lines, 20+ tests)
 **Strengths:**
 - Comprehensive workflow execution tests
 - Variable substitution thoroughly tested
 - Dry-run mode tested
 - Workflow chaining tested
 - All 6 workflow subcommands tested (list, show, copy, add, remove, validate)
 - MCP workflow tools tested
 **Minor Gaps:**
 - No tests for `_build_inline_engine` edge cases
 - No tests for malformed stage specs (empty, invalid format)
 ---
 ### 3. Scraper Tests 🟡 MODERATE with Significant Gaps
 **Files Reviewed:**
 - `test_scraper_features.py` (524 lines) - Doc scraper features
 - `test_codebase_scraper.py` (478 lines) - Codebase analysis
 - `test_pdf_scraper.py` (558 lines) - PDF scraper
 - `test_github_scraper.py` (1015 lines) - GitHub scraper
 - `test_unified_analyzer.py` (428 lines) - Unified analyzer
 **Critical Gaps:**
 #### A. Missing Real External Resource Tests
 | Resource | Test Type | Status |
 |----------|-----------|--------|
 | HTTP Requests (docs) | Mocked only | ❌ Gap |
 | PDF Extraction | Mocked only | ❌ Gap |
 | GitHub API | Mocked only | ❌ Gap (acceptable) |
 | Local Files | Real tests | ✅ Good |
 #### B. Missing Core Function Tests
 | Function | Location | Priority |
 |----------|----------|----------|
 | `UnifiedScraper.run()` | unified_scraper.py | 🔴 High |
 | `UnifiedScraper._scrape_documentation()` | unified_scraper.py | 🔴 High |
 | `UnifiedScraper._scrape_github()` | unified_scraper.py | 🔴 High |
 | `UnifiedScraper._scrape_pdf()` | unified_scraper.py | 🔴 High |
 | `UnifiedScraper._scrape_local()` | unified_scraper.py | 🟡 Medium |
 | `DocToSkillConverter.scrape()` | doc_scraper.py | 🔴 High |
 | `PDFToSkillConverter.extract_pdf()` | pdf_scraper.py | 🔴 High |
 #### C. PDF Scraper Limited Coverage
 - No actual PDF parsing tests (only mocked)
 - OCR functionality not tested
 - Page range extraction not tested
 ---
 ### 4. Enhancement Tests 🟡 PARTIAL - MAJOR GAPS
 **Files Reviewed:**
 - `test_enhance_command.py` (367 lines, 25+ tests)
 - `test_enhance_skill_local.py` (163 lines, 14 tests)
 **Critical Gap in `test_enhance_skill_local.py`:**
 | Function | Lines | Tested? | Priority |
 |----------|-------|---------|----------|
 | `summarize_reference()` | ~50 | ❌ No | 🔴 High |
 | `create_enhancement_prompt()` | ~200 | ❌ No | 🔴 High |
 | `run()` | ~100 | ❌ No | 🔴 High |
 | `_run_headless()` | ~130 | ❌ No | 🔴 High |
 | `_run_background()` | ~80 | ❌ No | 🟡 Medium |
 | `_run_daemon()` | ~60 | ❌ No | 🟡 Medium |
 | `write_status()` | ~30 | ❌ No | 🟡 Medium |
 | `read_status()` | ~40 | ❌ No | 🟡 Medium |
 | `detect_terminal_app()` | ~80 | ❌ No | 🟡 Medium |
 **Current Tests Only Cover:**
 - Agent presets configuration
 - Command building
 - Agent name normalization
 - Environment variable handling
 **Recommendation:** Add comprehensive tests for the core enhancement logic.
 ---
 ### 5. MCP Tool Tests 🟡 GOOD with Coverage Gaps
 **Files Reviewed:**
 - `test_mcp_fastmcp.py` (868 lines)
 - `test_mcp_server.py` (715 lines)
 - `test_mcp_vector_dbs.py` (259 lines)
 - `test_real_world_fastmcp.py` (558 lines)
 **Coverage Analysis:**
 | Tool Category | Tools | Tested | Coverage |
 |---------------|-------|--------|----------|
 | Config Tools | 3 | 3 | ✅ 100% |
 | Scraping Tools | 8 | 4 | 🟡 50% |
 | Packaging Tools | 4 | 4 | ✅ 100% |
 | Splitting Tools | 2 | 2 | ✅ 100% |
 | Source Tools | 5 | 5 | ✅ 100% |
 | Vector DB Tools | 4 | 4 | ✅ 100% |
 | Workflow Tools | 5 | 0 | ❌ 0% |
 | **Total** | **31** | **22** | **🟡 71%** |
 **Untested Tools:**
 1. `detect_patterns`
 2. `extract_test_examples`
 3. `build_how_to_guides`
 4. `extract_config_patterns`
 5. `list_workflows`
 6. `get_workflow`
 7. `create_workflow`
 8. `update_workflow`
 9. `delete_workflow`
 **Note:** `test_mcp_server.py` tests legacy server, `test_mcp_fastmcp.py` tests modern server.
 ---
 ### 6. Integration/E2E Tests 🟡 MODERATE
 **Files Reviewed:**
 - `test_create_integration_basic.py` (310 lines)
 - `test_e2e_three_stream_pipeline.py` (598 lines)
 - `test_analyze_e2e.py` (344 lines)
 - `test_install_skill_e2e.py` (533 lines)
 - `test_c3_integration.py` (362 lines)
 **Issues Found:**
 1. **Skipped Tests:**
   - `test_create_detects_web_url` - Source auto-detection incomplete
   - `test_create_invalid_source_shows_error` - Error handling incomplete
   - `test_cli_via_unified_command` - Asyncio issues
 2. **Heavy Mocking:**
   - Most GitHub API tests use mocking
   - No real HTTP tests for doc scraping
   - Integration tests don't test actual integration
 3. **Limited Scope:**
   - Only `--quick` preset tested (not `--comprehensive`)
   - C3.x tests use mock data only
   - Most E2E tests are unit tests with mocks
 ---
 ### 7. Adaptor Tests ✅ GOOD
 **Files Reviewed:**
 - `test_adaptors/test_adaptors_e2e.py` (893 lines)
 - `test_adaptors/test_claude_adaptor.py` (314 lines)
 - `test_adaptors/test_gemini_adaptor.py` (146 lines)
 - `test_adaptors/test_openai_adaptor.py` (188 lines)
 - Plus 8 more platform adaptors
 **Strengths:**
 - Each adaptor has dedicated tests
 - Package format testing
 - Upload success/failure scenarios
 - Platform-specific features tested
 **Minor Gaps:**
 - Some adaptors only test 1-2 scenarios
 - Error handling coverage varies by platform
 ---
 ### 8. Config/Validation Tests ✅ GOOD
 **Files Reviewed:**
 - `test_config_validation.py` (270 lines)
 - `test_config_extractor.py` (629 lines)
 - `test_config_fetcher.py` (340 lines)
 **Strengths:**
 - Unified vs legacy format detection
 - Field validation comprehensive
 - Error message quality tested
 ---
 ## Summary of Critical Testing Gaps
 ### 🔴 HIGH PRIORITY (Must Fix)
 1. **Enhancement Core Logic**
   - File: `test_enhance_skill_local.py`
   - Missing: 9 major functions
   - Impact: Core feature untested
 2. **Unified Scraper Main Flow**
   - File: New tests needed
   - Missing: `_scrape_*()` methods, `run()` orchestration
   - Impact: Multi-source scraping untested
 3. **Actual HTTP/PDF/GitHub Integration**
   - Missing: Real external resource tests
   - Impact: Only mock tests exist
 ### 🟡 MEDIUM PRIORITY (Should Fix)
 4. **MCP Workflow Tools**
   - Missing: 5 workflow tools (0% coverage)
   - Impact: MCP workflow features untested
 5. **Skipped Integration Tests**
   - 3 tests skipped
   - Impact: Source auto-detection incomplete
 6. **PDF Real Extraction**
   - Missing: Actual PDF parsing
   - Impact: PDF feature quality unknown
 ### 🟢 LOW PRIORITY (Nice to Have)
 7. **Additional Scraping Tools**
   - Missing: 4 scraping tool tests
   - Impact: Low (core tools covered)
 8. **Edge Case Coverage**
   - Missing: Invalid argument combinations
   - Impact: Low (happy path covered)
 ---
 ## Recommendations
 ### Immediate Actions (Next Sprint)
 1. **Add Enhancement Logic Tests** (~400 lines)
   - Test `summarize_reference()`
   - Test `create_enhancement_prompt()`
   - Test `run()` method
   - Test status read/write
 2. **Fix Skipped Tests** (~100 lines)
   - Fix asyncio issues in `test_cli_via_unified_command`
   - Complete source auto-detection tests
 3. **Add MCP Workflow Tool Tests** (~200 lines)
   - Test all 5 workflow tools
 ### Short Term (Next Month)
 4. **Add Unified Scraper Integration Tests** (~300 lines)
   - Test main orchestration flow
   - Test individual source scraping
 5. **Add Real PDF Tests** (~150 lines)
   - Test with actual PDF files
   - Test OCR if available
 ### Long Term (Next Quarter)
 6. **HTTP Integration Tests** (~200 lines)
   - Test with real websites (use test sites)
   - Mock server approach
 7. **Complete E2E Pipeline** (~300 lines)
   - Full workflow from scrape to upload
   - Real GitHub repo (fork test repo)
 ---
 ## Test Quality Metrics
 | Metric | Score | Notes |
 |--------|-------|-------|
 | Test Count | 🟢 Good | 2173+ tests |
 | Coverage | 🟡 Moderate | ~75% estimated |
 | Real Tests | 🟡 Moderate | Many mocked |
 | Documentation | 🟢 Good | Most tests documented |
 | Maintenance | 🟢 Good | Tests recently updated |
 ---
 ## Conclusion
 The Skill Seekers test suite is **comprehensive in quantity** (2173+ tests) but has **quality gaps** in critical areas:
 1. **Core enhancement logic** is largely untested
 2. **Multi-source scraping** orchestration lacks integration tests
 3. **MCP workflow tools** have zero coverage
 4. **Real external resource** testing is minimal
 **Priority:** Fix the 🔴 HIGH priority gaps first, as they impact core functionality.
 ---
 *Report generated: 2026-02-22*  
 *Reviewer: Systematic test review with parallel subagent analysis*
--- a/docs/archive/legacy/QUICK_REFERENCE.md
+++ b/docs/archive/legacy/QUICK_REFERENCE.md
@@ -71,7 +71,7 @@ skill-seekers pdf manual.pdf --name product-manual
 skill-seekers pdf scanned.pdf --enable-ocr
 # Large PDF (chunked processing)
-skill-seekers pdf large.pdf --chunk-size 50
+skill-seekers pdf large.pdf --pdf-pages-per-chunk 50
 ```
 ### Multi-Source Scraping
--- a/docs/archive/research/PDF_IMAGE_EXTRACTION.md
+++ b/docs/archive/research/PDF_IMAGE_EXTRACTION.md
@@ -122,7 +122,7 @@ python3 cli/pdf_extractor_poc.py documentation.pdf \
  --extract-images \
  --min-image-size 150 \
  --min-quality 6.0 \
-  --chunk-size 20 \
+  --pdf-pages-per-chunk 20 \
  --output documentation.json \
  --verbose \
  --pretty
@@ -477,7 +477,7 @@ python3 cli/pdf_extractor_poc.py manual.pdf \
  --image-dir assets/images/ \
  --min-image-size 200 \
  --min-quality 7.0 \
-  --chunk-size 15 \
+  --pdf-pages-per-chunk 15 \
  --output manual.json \
  --verbose \
  --pretty
--- a/docs/features/PDF_CHUNKING.md
+++ b/docs/features/PDF_CHUNKING.md
@@ -25,10 +25,10 @@ Break large PDFs into smaller, manageable chunks:
 python3 cli/pdf_extractor_poc.py input.pdf
 # Custom chunk size (20 pages per chunk)
-python3 cli/pdf_extractor_poc.py input.pdf --chunk-size 20
+python3 cli/pdf_extractor_poc.py input.pdf --pdf-pages-per-chunk 20
 # Disable chunking (single chunk with all pages)
-python3 cli/pdf_extractor_poc.py input.pdf --chunk-size 0
+python3 cli/pdf_extractor_poc.py input.pdf --pdf-pages-per-chunk 0
 ```
 ### ✅ 2. Chapter/Section Detection
@@ -272,7 +272,7 @@ cat manual.json | jq '.total_chunks'
 ```bash
 # Large PDF with bigger chunks (50 pages each)
-python3 cli/pdf_extractor_poc.py large_manual.pdf --chunk-size 50 -o output.json -v
+python3 cli/pdf_extractor_poc.py large_manual.pdf --pdf-pages-per-chunk 50 -o output.json -v
 # Verbose output shows:
 # 📦 Creating chunks (chunk_size=50)...
@@ -286,7 +286,7 @@ python3 cli/pdf_extractor_poc.py large_manual.pdf --chunk-size 50 -o output.json
 ```bash
 # Process all pages as single chunk
-python3 cli/pdf_extractor_poc.py small_doc.pdf --chunk-size 0 -o output.json
+python3 cli/pdf_extractor_poc.py small_doc.pdf --pdf-pages-per-chunk 0 -o output.json
 ```
 ---
@@ -369,7 +369,7 @@ Create a test PDF with chapters:
 3. Page 30: "Chapter 3: API Reference"
 ```bash
-python3 cli/pdf_extractor_poc.py test.pdf -o test.json --chunk-size 20 -v
+python3 cli/pdf_extractor_poc.py test.pdf -o test.json --pdf-pages-per-chunk 20 -v
 # Verify chapters detected
 cat test.json | jq '.chapters'
@@ -441,7 +441,7 @@ The chunking feature lays groundwork for:
 **Example workflow:**
 ```bash
 # Extract large manual with chapters
-python3 cli/pdf_extractor_poc.py large_manual.pdf --chunk-size 25 -o manual.json
+python3 cli/pdf_extractor_poc.py large_manual.pdf --pdf-pages-per-chunk 25 -o manual.json
 # Future: Build skill from chunks
 python3 cli/build_skill_from_pdf.py manual.json
--- a/docs/integrations/CHROMA.md
+++ b/docs/integrations/CHROMA.md
@@ -223,7 +223,7 @@ skill-seekers package output/codebase --target langchain
 **Option D: RAG-Optimized Chunking**
 ```bash
-skill-seekers scrape --config configs/fastapi.json --chunk-for-rag --chunk-size 512
+skill-seekers scrape --config configs/fastapi.json --chunk-for-rag --chunk-tokens 512
 skill-seekers package output/fastapi --target langchain
 ```
@@ -968,7 +968,7 @@ collection.add(
 2. **Implement Semantic Chunking:**
   ```bash
-   skill-seekers scrape --config configs/fastapi.json --chunk-for-rag --chunk-size 512
+   skill-seekers scrape --config configs/fastapi.json --chunk-for-rag --chunk-tokens 512
   ```
 3. **Set Up Multi-Collection Search:**
--- a/docs/integrations/FAISS.md
+++ b/docs/integrations/FAISS.md
@@ -255,7 +255,7 @@ skill-seekers package output/codebase --target langchain
 **Option D: RAG-Optimized Chunking**
 ```bash
-skill-seekers scrape --config configs/fastapi.json --chunk-for-rag --chunk-size 512
+skill-seekers scrape --config configs/fastapi.json --chunk-for-rag --chunk-tokens 512
 skill-seekers package output/fastapi --target langchain
 ```
--- a/docs/integrations/HAYSTACK.md
+++ b/docs/integrations/HAYSTACK.md
@@ -318,8 +318,8 @@ print(response["llm"]["replies"][0])
 # Enable semantic chunking (preserves code blocks, respects paragraphs)
 skill-seekers scrape --config configs/django.json \
  --chunk-for-rag \
-  --chunk-size 512 \
+  --chunk-tokens 512 \
-  --chunk-overlap 50
+  --chunk-overlap-tokens 50
 # Package chunked output
 skill-seekers package output/django --target haystack
@@ -439,8 +439,8 @@ python scripts/merge_documents.py \
 # Enable chunking for frameworks with long pages
 skill-seekers scrape --config configs/django.json \
  --chunk-for-rag \
-  --chunk-size 512 \
+  --chunk-tokens 512 \
-  --chunk-overlap 50
+  --chunk-overlap-tokens 50
 ```
 ### 2. Choose Right Document Store
@@ -506,8 +506,8 @@ Complete example of building a FastAPI documentation chatbot:
 # Scrape FastAPI docs with chunking
 skill-seekers scrape --config configs/fastapi.json \
  --chunk-for-rag \
-  --chunk-size 512 \
+  --chunk-tokens 512 \
-  --chunk-overlap 50 \
+  --chunk-overlap-tokens 50 \
  --max-pages 200
 # Package for Haystack
@@ -698,8 +698,8 @@ skill-seekers scrape --config configs/fastapi.json --chunk-for-rag
 # 2. Adjust chunk size
 skill-seekers scrape --config configs/fastapi.json \
  --chunk-for-rag \
-  --chunk-size 768 \  # Larger chunks for more context
+  --chunk-tokens 768 \  # Larger chunks for more context
-  --chunk-overlap 100  # More overlap for continuity
+  --chunk-overlap-tokens 100  # More overlap for continuity
 # 3. Use hybrid search (BM25 + embeddings)
 # See Advanced Usage section
--- a/docs/integrations/QDRANT.md
+++ b/docs/integrations/QDRANT.md
@@ -270,7 +270,7 @@ skill-seekers package output/codebase --target langchain
 **Option D: RAG-Optimized Chunking**
 ```bash
-skill-seekers scrape --config configs/fastapi.json --chunk-for-rag --chunk-size 512
+skill-seekers scrape --config configs/fastapi.json --chunk-for-rag --chunk-tokens 512
 skill-seekers package output/fastapi --target langchain
 ```
--- a/docs/integrations/WEAVIATE.md
+++ b/docs/integrations/WEAVIATE.md
@@ -210,7 +210,7 @@ skill-seekers package output/codebase --target langchain
 **Option D: RAG-Optimized Chunking**
 ```bash
-skill-seekers scrape --config configs/fastapi.json --chunk-for-rag --chunk-size 512
+skill-seekers scrape --config configs/fastapi.json --chunk-for-rag --chunk-tokens 512
 skill-seekers package output/fastapi --target langchain
 ```
@@ -960,7 +960,7 @@ print(schema.get("multiTenancyConfig", {}).get("enabled"))  # Should be True
 2. **Implement Semantic Chunking:**
   ```bash
-   skill-seekers scrape --config configs/fastapi.json --chunk-for-rag --chunk-size 512
+   skill-seekers scrape --config configs/fastapi.json --chunk-for-rag --chunk-tokens 512
   ```
 3. **Set Up Multi-Tenancy:**
--- a/docs/reference/CLI_REFERENCE.md
+++ b/docs/reference/CLI_REFERENCE.md
@@ -252,8 +252,8 @@ skill-seekers create [source] [options]
 | | `--workflow-dry-run` | | Preview workflow without executing |
 | | `--dry-run` | | Preview without creating |
 | | `--chunk-for-rag` | | Enable RAG chunking |
-| | `--chunk-size` | 512 | Chunk size in tokens |
+| | `--chunk-tokens` | 512 | Chunk size in tokens |
-| | `--chunk-overlap` | 50 | Chunk overlap in tokens |
+| | `--chunk-overlap-tokens` | 50 | Chunk overlap in tokens |
 | | `--help-web` | | Show web scraping options |
 | | `--help-github` | | Show GitHub options |
 | | `--help-local` | | Show local analysis options |
@@ -615,10 +615,10 @@ skill-seekers package SKILL_DIRECTORY [options]
 | | `--skip-quality-check` | | Skip quality checks |
 | | `--upload` | | Auto-upload after packaging |
 | | `--streaming` | | Streaming mode for large docs |
-| | `--chunk-size` | 4000 | Max chars per chunk (streaming) |
+| | `--streaming-chunk-chars` | 4000 | Max chars per chunk (streaming) |
-| | `--chunk-overlap` | 200 | Overlap between chunks |
+| | `--streaming-overlap-chars` | 200 | Overlap between chunks (chars) |
 | | `--batch-size` | 100 | Chunks per batch |
-| | `--chunk` | | Enable RAG chunking |
+| | `--chunk-for-rag` | | Enable RAG chunking |
 | | `--chunk-tokens` | 512 | Max tokens per chunk |
 | | `--no-preserve-code` | | Allow code block splitting |
@@ -877,7 +877,7 @@ skill-seekers stream --config CONFIG [options]
 | Short | Long | Description |
 |-------|------|-------------|
 | `-c` | `--config` | Config JSON file |
-| | `--chunk-size` | Size of each chunk |
+| | `--streaming-chunk-chars` | Maximum characters per chunk (default: 4000) |
 | | `--output` | Output directory |
 **Examples:**
@@ -887,7 +887,7 @@ skill-seekers stream --config CONFIG [options]
 skill-seekers stream --config configs/large-docs.json
 # Custom chunk size
-skill-seekers stream --config configs/large-docs.json --chunk-size 1000
+skill-seekers stream --config configs/large-docs.json --streaming-chunk-chars 1000
 ```
 ---
--- a/docs/strategy/ACTION_PLAN.md
+++ b/docs/strategy/ACTION_PLAN.md
@@ -365,8 +365,8 @@ Position Skill Seekers as **the universal documentation preprocessor** for the e
 2. **Implement Chunking for RAG** (8-12 hours)
   ```bash
   skill-seekers scrape --chunk-for-rag \
-       --chunk-size 512 \
+       --chunk-tokens 512 \
-       --chunk-overlap 50 \
+       --chunk-overlap-tokens 50 \
       --preserve-code-blocks
   ```
--- a/docs/strategy/KIMI_ANALYSIS_COMPARISON.md
+++ b/docs/strategy/KIMI_ANALYSIS_COMPARISON.md
@@ -139,8 +139,8 @@ skill-seekers scrape --format confluence     # Confluence storage format
 ```bash
 # New flag for embedding-optimized chunking
 skill-seekers scrape --chunk-for-rag \
-    --chunk-size 512 \
+    --chunk-tokens 512 \
-    --chunk-overlap 50 \
+    --chunk-overlap-tokens 50 \
    --add-metadata
 # Output: chunks with metadata for embedding
--- a/docs/user-guide/02-scraping.md
+++ b/docs/user-guide/02-scraping.md
@@ -385,7 +385,7 @@ skill-seekers create <url> --max-pages 100
 skill-seekers create <url> --streaming
 # Or smaller chunks
-skill-seekers create <url> --chunk-size 500
+skill-seekers create <url> --chunk-tokens 500
 ```
 ---
--- a/docs/user-guide/04-packaging.md
+++ b/docs/user-guide/04-packaging.md
@@ -158,8 +158,8 @@ skill-seekers package output/large-skill/ --streaming
 # Custom chunk size
 skill-seekers package output/large-skill/ \
  --streaming \
-  --chunk-size 2000 \
+  --streaming-chunk-chars 2000 \
-  --chunk-overlap 100
+  --streaming-overlap-chars 100
 ```
 **When to use:**
@@ -177,23 +177,23 @@ Optimize for Retrieval-Augmented Generation:
 # Enable semantic chunking
 skill-seekers package output/my-skill/ \
  --target langchain \
-  --chunk \
+  --chunk-for-rag \
  --chunk-tokens 512
 # Custom chunk size
 skill-seekers package output/my-skill/ \
  --target chroma \
  --chunk-tokens 256 \
-  --chunk-overlap 50
+  --chunk-overlap-tokens 50
 ```
 **Chunking Options:**
 | Option | Default | Description |
 |--------|---------|-------------|
-| `--chunk` | auto | Enable chunking |
+| `--chunk-for-rag` | auto | Enable chunking |
 | `--chunk-tokens` | 512 | Tokens per chunk |
-| `--chunk-overlap` | 50 | Overlap between chunks |
+| `--chunk-overlap-tokens` | 50 | Overlap between chunks (tokens) |
 | `--no-preserve-code` | - | Allow splitting code blocks |
 ---
@@ -449,7 +449,7 @@ skill-seekers upload output/my-skill-claude.zip --target claude
 skill-seekers package output/my-skill/ --streaming
 # Smaller chunks
-skill-seekers package output/my-skill/ --streaming --chunk-size 1000
+skill-seekers package output/my-skill/ --streaming --streaming-chunk-chars 1000
 ```
 ---
--- a/docs/user-guide/06-troubleshooting.md
+++ b/docs/user-guide/06-troubleshooting.md
@@ -295,7 +295,7 @@ skill-seekers package output/my-skill/ --streaming
 # Reduce chunk size
 skill-seekers package output/my-skill/ \
  --streaming \
-  --chunk-size 1000
+  --streaming-chunk-chars 1000
 ```
 ---
--- a/docs/zh-CN/reference/CLI_REFERENCE.md
+++ b/docs/zh-CN/reference/CLI_REFERENCE.md
@@ -237,8 +237,8 @@ skill-seekers create [source] [options]
 | | `--workflow-dry-run` | | Preview workflow without executing |
 | | `--dry-run` | | Preview without creating |
 | | `--chunk-for-rag` | | Enable RAG chunking |
-| | `--chunk-size` | 512 | Chunk size in tokens |
+| | `--chunk-tokens` | 512 | Chunk size in tokens |
-| | `--chunk-overlap` | 50 | Chunk overlap in tokens |
+| | `--chunk-overlap-tokens` | 50 | Chunk overlap in tokens |
 | | `--help-web` | | Show web scraping options |
 | | `--help-github` | | Show GitHub options |
 | | `--help-local` | | Show local analysis options |
@@ -593,10 +593,10 @@ skill-seekers package SKILL_DIRECTORY [options]
 | | `--skip-quality-check` | | Skip quality checks |
 | | `--upload` | | Auto-upload after packaging |
 | | `--streaming` | | Streaming mode for large docs |
-| | `--chunk-size` | 4000 | Max chars per chunk (streaming) |
+| | `--streaming-chunk-chars` | 4000 | Max chars per chunk (streaming) |
-| | `--chunk-overlap` | 200 | Overlap between chunks |
+| | `--streaming-overlap-chars` | 200 | Overlap between chunks (chars) |
 | | `--batch-size` | 100 | Chunks per batch |
-| | `--chunk` | | Enable RAG chunking |
+| | `--chunk-for-rag` | | Enable RAG chunking |
 | | `--chunk-tokens` | 512 | Max tokens per chunk |
 | | `--no-preserve-code` | | Allow code block splitting |
@@ -847,7 +847,7 @@ skill-seekers stream --config CONFIG [options]
 | Short | Long | Description |
 |-------|------|-------------|
 | `-c` | `--config` | Config JSON file |
-| | `--chunk-size` | Size of each chunk |
+| | `--streaming-chunk-chars` | Maximum characters per chunk (default: 4000) |
 | | `--output` | Output directory |
 **Examples:**
@@ -857,7 +857,7 @@ skill-seekers stream --config CONFIG [options]
 skill-seekers stream --config configs/large-docs.json
 # Custom chunk size
-skill-seekers stream --config configs/large-docs.json --chunk-size 1000
+skill-seekers stream --config configs/large-docs.json --streaming-chunk-chars 1000
 ```
 ---
--- a/docs/zh-CN/user-guide/02-scraping.md
+++ b/docs/zh-CN/user-guide/02-scraping.md
@@ -385,7 +385,7 @@ skill-seekers create <url> --max-pages 100
 skill-seekers create <url> --streaming
 # Or smaller chunks
-skill-seekers create <url> --chunk-size 500
+skill-seekers create <url> --chunk-tokens 500
 ```
 ---
--- a/docs/zh-CN/user-guide/04-packaging.md
+++ b/docs/zh-CN/user-guide/04-packaging.md
@@ -158,8 +158,8 @@ skill-seekers package output/large-skill/ --streaming
 # Custom chunk size
 skill-seekers package output/large-skill/ \
  --streaming \
-  --chunk-size 2000 \
+  --streaming-chunk-chars 2000 \
-  --chunk-overlap 100
+  --streaming-overlap-chars 100
 ```
 **When to use:**
@@ -177,23 +177,23 @@ Optimize for Retrieval-Augmented Generation:
 # Enable semantic chunking
 skill-seekers package output/my-skill/ \
  --target langchain \
-  --chunk \
+  --chunk-for-rag \
  --chunk-tokens 512
 # Custom chunk size
 skill-seekers package output/my-skill/ \
  --target chroma \
  --chunk-tokens 256 \
-  --chunk-overlap 50
+  --chunk-overlap-tokens 50
 ```
 **Chunking Options:**
 | Option | Default | Description |
 |--------|---------|-------------|
-| `--chunk` | auto | Enable chunking |
+| `--chunk-for-rag` | auto | Enable chunking |
 | `--chunk-tokens` | 512 | Tokens per chunk |
-| `--chunk-overlap` | 50 | Overlap between chunks |
+| `--chunk-overlap-tokens` | 50 | Overlap between chunks (tokens) |
 | `--no-preserve-code` | - | Allow splitting code blocks |
 ---
@@ -449,7 +449,7 @@ skill-seekers upload output/my-skill-claude.zip --target claude
 skill-seekers package output/my-skill/ --streaming
 # Smaller chunks
-skill-seekers package output/my-skill/ --streaming --chunk-size 1000
+skill-seekers package output/my-skill/ --streaming --streaming-chunk-chars 1000
 ```
 ---
--- a/docs/zh-CN/user-guide/06-troubleshooting.md
+++ b/docs/zh-CN/user-guide/06-troubleshooting.md
@@ -295,7 +295,7 @@ skill-seekers package output/my-skill/ --streaming
 # Reduce chunk size
 skill-seekers package output/my-skill/ \
  --streaming \
-  --chunk-size 1000
+  --streaming-chunk-chars 1000
 ```
 ---
--- a/examples/haystack-pipeline/README.md
+++ b/examples/haystack-pipeline/README.md
@@ -132,7 +132,7 @@ For better retrieval quality, use semantic chunking:
 ```bash
 # Generate with chunking
-skill-seekers scrape --config configs/react.json --max-pages 100 --chunk-for-rag --chunk-size 512 --chunk-overlap 50
+skill-seekers scrape --config configs/react.json --max-pages 100 --chunk-for-rag --chunk-tokens 512 --chunk-overlap-tokens 50
 # Use chunked output
 python quickstart.py --chunked
--- a/src/skill_seekers/workflows/comparison-matrix.yaml
+++ b/src/skill_seekers/workflows/comparison-matrix.yaml
@@ -6,7 +6,6 @@ applies_to:
  - doc_scraping
 variables:
  depth: comprehensive
  alternatives: []
 stages:
  - name: feature_comparison
    type: custom
--- a/src/skill_seekers/workflows/data-validation.yaml
+++ b/src/skill_seekers/workflows/data-validation.yaml
@@ -164,5 +164,5 @@ post_process:
  add_metadata:
    enhanced: true
    workflow: data-validation
-    domain: ml
+    domain: backend
    has_validation_docs: true
--- a/src/skill_seekers/workflows/default.yaml
+++ b/src/skill_seekers/workflows/default.yaml
@@ -17,6 +17,46 @@ stages:
    target: examples
    enabled: true
    uses_history: false
  - name: architecture_overview
    type: custom
    target: architecture
    uses_history: false
    enabled: true
    prompt: >
      Provide a concise architectural overview of this codebase.
      Cover:
      1. Overall architecture style (MVC, microservices, layered, etc.)
      2. Key components and their responsibilities
      3. Data flow between components
      4. External dependencies and integrations
      5. Entry points (CLI, API, web, etc.)
      Output JSON with:
      - "architecture_style": main architectural pattern
      - "components": array of {name, responsibility}
      - "data_flow": how data moves through the system
      - "external_deps": third-party services and libraries
      - "entry_points": how users interact with the system
  - name: skill_polish
    type: custom
    target: skill_md
    uses_history: true
    enabled: true
    prompt: >
      Review the SKILL.md content generated so far and improve it.
      Fix:
      1. Unclear or overly technical descriptions
      2. Missing quick-start examples
      3. Gaps in the overview section
      4. Redundant or duplicate information
      5. Formatting inconsistencies
      Output JSON with:
      - "improved_overview": rewritten overview section
      - "quick_start": concise getting-started snippet
      - "key_concepts": 3-5 essential concepts a developer needs to know
 post_process:
  reorder_sections: []
  add_metadata:
--- a/src/skill_seekers/workflows/minimal.yaml
+++ b/src/skill_seekers/workflows/minimal.yaml
@@ -14,12 +14,17 @@ stages:
    uses_history: false
    enabled: true
    prompt: >
-      Review the following SKILL.md content and make minimal improvements:
+      Review the SKILL.md content and make minimal targeted improvements.
      - Fix obvious formatting issues
      - Ensure the overview section is clear and concise
      - Remove duplicate or redundant information
-      Return the improved content as plain text without extra commentary.
+      Fix only:
      1. Obvious formatting issues (broken lists, inconsistent headers)
      2. Unclear overview section (make it one clear paragraph)
      3. Duplicate or redundant information (remove repeats)
      Output JSON with:
      - "improved_overview": rewritten overview paragraph (plain markdown)
      - "removed_sections": list of section names that were removed as duplicates
      - "formatting_fixes": list of specific formatting issues corrected
 post_process:
  reorder_sections: []
  add_metadata:
--- a/src/skill_seekers/workflows/security-focus.yaml
+++ b/src/skill_seekers/workflows/security-focus.yaml
@@ -3,9 +3,7 @@ description: "Security-focused review: vulnerabilities, auth, data handling"
 version: "1.0"
 applies_to:
  - codebase_analysis
-  - python
+  - github_analysis
  - javascript
  - typescript
 variables:
  depth: comprehensive
 stages:
--- a/uv.lock
+++ b/uv.lock
@@ -5204,7 +5204,7 @@ wheels = [
 [[package]]
 name = "skill-seekers"
-version = "3.1.1"
+version = "3.1.2"
 source = { editable = "." }
 dependencies = [
    { name = "anthropic" },