docs: update all chunk flag names to match renamed CLI flags
Replace all occurrences of old ambiguous flag names with the new explicit ones: --chunk-size (tokens) → --chunk-tokens --chunk-overlap → --chunk-overlap-tokens --chunk → --chunk-for-rag --streaming-chunk-size → --streaming-chunk-chars --streaming-overlap → --streaming-overlap-chars --chunk-size (pages) → --pdf-pages-per-chunk Updated: CLI_REFERENCE (EN+ZH), user-guide (EN+ZH), integrations (Haystack, Chroma, Weaviate, FAISS, Qdrant), features/PDF_CHUNKING, examples/haystack-pipeline, strategy docs, archive docs, and CHANGELOG. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -71,7 +71,7 @@ skill-seekers pdf manual.pdf --name product-manual
|
||||
skill-seekers pdf scanned.pdf --enable-ocr
|
||||
|
||||
# Large PDF (chunked processing)
|
||||
skill-seekers pdf large.pdf --chunk-size 50
|
||||
skill-seekers pdf large.pdf --pdf-pages-per-chunk 50
|
||||
```
|
||||
|
||||
### Multi-Source Scraping
|
||||
|
||||
@@ -122,7 +122,7 @@ python3 cli/pdf_extractor_poc.py documentation.pdf \
|
||||
--extract-images \
|
||||
--min-image-size 150 \
|
||||
--min-quality 6.0 \
|
||||
--chunk-size 20 \
|
||||
--pdf-pages-per-chunk 20 \
|
||||
--output documentation.json \
|
||||
--verbose \
|
||||
--pretty
|
||||
@@ -477,7 +477,7 @@ python3 cli/pdf_extractor_poc.py manual.pdf \
|
||||
--image-dir assets/images/ \
|
||||
--min-image-size 200 \
|
||||
--min-quality 7.0 \
|
||||
--chunk-size 15 \
|
||||
--pdf-pages-per-chunk 15 \
|
||||
--output manual.json \
|
||||
--verbose \
|
||||
--pretty
|
||||
|
||||
@@ -25,10 +25,10 @@ Break large PDFs into smaller, manageable chunks:
|
||||
python3 cli/pdf_extractor_poc.py input.pdf
|
||||
|
||||
# Custom chunk size (20 pages per chunk)
|
||||
python3 cli/pdf_extractor_poc.py input.pdf --chunk-size 20
|
||||
python3 cli/pdf_extractor_poc.py input.pdf --pdf-pages-per-chunk 20
|
||||
|
||||
# Disable chunking (single chunk with all pages)
|
||||
python3 cli/pdf_extractor_poc.py input.pdf --chunk-size 0
|
||||
python3 cli/pdf_extractor_poc.py input.pdf --pdf-pages-per-chunk 0
|
||||
```
|
||||
|
||||
### ✅ 2. Chapter/Section Detection
|
||||
@@ -272,7 +272,7 @@ cat manual.json | jq '.total_chunks'
|
||||
|
||||
```bash
|
||||
# Large PDF with bigger chunks (50 pages each)
|
||||
python3 cli/pdf_extractor_poc.py large_manual.pdf --chunk-size 50 -o output.json -v
|
||||
python3 cli/pdf_extractor_poc.py large_manual.pdf --pdf-pages-per-chunk 50 -o output.json -v
|
||||
|
||||
# Verbose output shows:
|
||||
# 📦 Creating chunks (chunk_size=50)...
|
||||
@@ -286,7 +286,7 @@ python3 cli/pdf_extractor_poc.py large_manual.pdf --chunk-size 50 -o output.json
|
||||
|
||||
```bash
|
||||
# Process all pages as single chunk
|
||||
python3 cli/pdf_extractor_poc.py small_doc.pdf --chunk-size 0 -o output.json
|
||||
python3 cli/pdf_extractor_poc.py small_doc.pdf --pdf-pages-per-chunk 0 -o output.json
|
||||
```
|
||||
|
||||
---
|
||||
@@ -369,7 +369,7 @@ Create a test PDF with chapters:
|
||||
3. Page 30: "Chapter 3: API Reference"
|
||||
|
||||
```bash
|
||||
python3 cli/pdf_extractor_poc.py test.pdf -o test.json --chunk-size 20 -v
|
||||
python3 cli/pdf_extractor_poc.py test.pdf -o test.json --pdf-pages-per-chunk 20 -v
|
||||
|
||||
# Verify chapters detected
|
||||
cat test.json | jq '.chapters'
|
||||
@@ -441,7 +441,7 @@ The chunking feature lays groundwork for:
|
||||
**Example workflow:**
|
||||
```bash
|
||||
# Extract large manual with chapters
|
||||
python3 cli/pdf_extractor_poc.py large_manual.pdf --chunk-size 25 -o manual.json
|
||||
python3 cli/pdf_extractor_poc.py large_manual.pdf --pdf-pages-per-chunk 25 -o manual.json
|
||||
|
||||
# Future: Build skill from chunks
|
||||
python3 cli/build_skill_from_pdf.py manual.json
|
||||
|
||||
@@ -223,7 +223,7 @@ skill-seekers package output/codebase --target langchain
|
||||
|
||||
**Option D: RAG-Optimized Chunking**
|
||||
```bash
|
||||
skill-seekers scrape --config configs/fastapi.json --chunk-for-rag --chunk-size 512
|
||||
skill-seekers scrape --config configs/fastapi.json --chunk-for-rag --chunk-tokens 512
|
||||
skill-seekers package output/fastapi --target langchain
|
||||
```
|
||||
|
||||
@@ -968,7 +968,7 @@ collection.add(
|
||||
|
||||
2. **Implement Semantic Chunking:**
|
||||
```bash
|
||||
skill-seekers scrape --config configs/fastapi.json --chunk-for-rag --chunk-size 512
|
||||
skill-seekers scrape --config configs/fastapi.json --chunk-for-rag --chunk-tokens 512
|
||||
```
|
||||
|
||||
3. **Set Up Multi-Collection Search:**
|
||||
|
||||
@@ -255,7 +255,7 @@ skill-seekers package output/codebase --target langchain
|
||||
|
||||
**Option D: RAG-Optimized Chunking**
|
||||
```bash
|
||||
skill-seekers scrape --config configs/fastapi.json --chunk-for-rag --chunk-size 512
|
||||
skill-seekers scrape --config configs/fastapi.json --chunk-for-rag --chunk-tokens 512
|
||||
skill-seekers package output/fastapi --target langchain
|
||||
```
|
||||
|
||||
|
||||
@@ -318,8 +318,8 @@ print(response["llm"]["replies"][0])
|
||||
# Enable semantic chunking (preserves code blocks, respects paragraphs)
|
||||
skill-seekers scrape --config configs/django.json \
|
||||
--chunk-for-rag \
|
||||
--chunk-size 512 \
|
||||
--chunk-overlap 50
|
||||
--chunk-tokens 512 \
|
||||
--chunk-overlap-tokens 50
|
||||
|
||||
# Package chunked output
|
||||
skill-seekers package output/django --target haystack
|
||||
@@ -439,8 +439,8 @@ python scripts/merge_documents.py \
|
||||
# Enable chunking for frameworks with long pages
|
||||
skill-seekers scrape --config configs/django.json \
|
||||
--chunk-for-rag \
|
||||
--chunk-size 512 \
|
||||
--chunk-overlap 50
|
||||
--chunk-tokens 512 \
|
||||
--chunk-overlap-tokens 50
|
||||
```
|
||||
|
||||
### 2. Choose Right Document Store
|
||||
@@ -506,8 +506,8 @@ Complete example of building a FastAPI documentation chatbot:
|
||||
# Scrape FastAPI docs with chunking
|
||||
skill-seekers scrape --config configs/fastapi.json \
|
||||
--chunk-for-rag \
|
||||
--chunk-size 512 \
|
||||
--chunk-overlap 50 \
|
||||
--chunk-tokens 512 \
|
||||
--chunk-overlap-tokens 50 \
|
||||
--max-pages 200
|
||||
|
||||
# Package for Haystack
|
||||
@@ -698,8 +698,8 @@ skill-seekers scrape --config configs/fastapi.json --chunk-for-rag
|
||||
# 2. Adjust chunk size
|
||||
skill-seekers scrape --config configs/fastapi.json \
|
||||
--chunk-for-rag \
|
||||
--chunk-size 768 \ # Larger chunks for more context
|
||||
--chunk-overlap 100 # More overlap for continuity
|
||||
--chunk-tokens 768 \ # Larger chunks for more context
|
||||
--chunk-overlap-tokens 100 # More overlap for continuity
|
||||
|
||||
# 3. Use hybrid search (BM25 + embeddings)
|
||||
# See Advanced Usage section
|
||||
|
||||
@@ -270,7 +270,7 @@ skill-seekers package output/codebase --target langchain
|
||||
|
||||
**Option D: RAG-Optimized Chunking**
|
||||
```bash
|
||||
skill-seekers scrape --config configs/fastapi.json --chunk-for-rag --chunk-size 512
|
||||
skill-seekers scrape --config configs/fastapi.json --chunk-for-rag --chunk-tokens 512
|
||||
skill-seekers package output/fastapi --target langchain
|
||||
```
|
||||
|
||||
|
||||
@@ -210,7 +210,7 @@ skill-seekers package output/codebase --target langchain
|
||||
|
||||
**Option D: RAG-Optimized Chunking**
|
||||
```bash
|
||||
skill-seekers scrape --config configs/fastapi.json --chunk-for-rag --chunk-size 512
|
||||
skill-seekers scrape --config configs/fastapi.json --chunk-for-rag --chunk-tokens 512
|
||||
skill-seekers package output/fastapi --target langchain
|
||||
```
|
||||
|
||||
@@ -960,7 +960,7 @@ print(schema.get("multiTenancyConfig", {}).get("enabled")) # Should be True
|
||||
|
||||
2. **Implement Semantic Chunking:**
|
||||
```bash
|
||||
skill-seekers scrape --config configs/fastapi.json --chunk-for-rag --chunk-size 512
|
||||
skill-seekers scrape --config configs/fastapi.json --chunk-for-rag --chunk-tokens 512
|
||||
```
|
||||
|
||||
3. **Set Up Multi-Tenancy:**
|
||||
|
||||
@@ -252,8 +252,8 @@ skill-seekers create [source] [options]
|
||||
| | `--workflow-dry-run` | | Preview workflow without executing |
|
||||
| | `--dry-run` | | Preview without creating |
|
||||
| | `--chunk-for-rag` | | Enable RAG chunking |
|
||||
| | `--chunk-size` | 512 | Chunk size in tokens |
|
||||
| | `--chunk-overlap` | 50 | Chunk overlap in tokens |
|
||||
| | `--chunk-tokens` | 512 | Chunk size in tokens |
|
||||
| | `--chunk-overlap-tokens` | 50 | Chunk overlap in tokens |
|
||||
| | `--help-web` | | Show web scraping options |
|
||||
| | `--help-github` | | Show GitHub options |
|
||||
| | `--help-local` | | Show local analysis options |
|
||||
@@ -615,10 +615,10 @@ skill-seekers package SKILL_DIRECTORY [options]
|
||||
| | `--skip-quality-check` | | Skip quality checks |
|
||||
| | `--upload` | | Auto-upload after packaging |
|
||||
| | `--streaming` | | Streaming mode for large docs |
|
||||
| | `--chunk-size` | 4000 | Max chars per chunk (streaming) |
|
||||
| | `--chunk-overlap` | 200 | Overlap between chunks |
|
||||
| | `--streaming-chunk-chars` | 4000 | Max chars per chunk (streaming) |
|
||||
| | `--streaming-overlap-chars` | 200 | Overlap between chunks (chars) |
|
||||
| | `--batch-size` | 100 | Chunks per batch |
|
||||
| | `--chunk` | | Enable RAG chunking |
|
||||
| | `--chunk-for-rag` | | Enable RAG chunking |
|
||||
| | `--chunk-tokens` | 512 | Max tokens per chunk |
|
||||
| | `--no-preserve-code` | | Allow code block splitting |
|
||||
|
||||
@@ -877,7 +877,7 @@ skill-seekers stream --config CONFIG [options]
|
||||
| Short | Long | Description |
|
||||
|-------|------|-------------|
|
||||
| `-c` | `--config` | Config JSON file |
|
||||
| | `--chunk-size` | Size of each chunk |
|
||||
| | `--streaming-chunk-chars` | Maximum characters per chunk (default: 4000) |
|
||||
| | `--output` | Output directory |
|
||||
|
||||
**Examples:**
|
||||
@@ -887,7 +887,7 @@ skill-seekers stream --config CONFIG [options]
|
||||
skill-seekers stream --config configs/large-docs.json
|
||||
|
||||
# Custom chunk size
|
||||
skill-seekers stream --config configs/large-docs.json --chunk-size 1000
|
||||
skill-seekers stream --config configs/large-docs.json --streaming-chunk-chars 1000
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
@@ -365,8 +365,8 @@ Position Skill Seekers as **the universal documentation preprocessor** for the e
|
||||
2. **Implement Chunking for RAG** (8-12 hours)
|
||||
```bash
|
||||
skill-seekers scrape --chunk-for-rag \
|
||||
--chunk-size 512 \
|
||||
--chunk-overlap 50 \
|
||||
--chunk-tokens 512 \
|
||||
--chunk-overlap-tokens 50 \
|
||||
--preserve-code-blocks
|
||||
```
|
||||
|
||||
|
||||
@@ -139,8 +139,8 @@ skill-seekers scrape --format confluence # Confluence storage format
|
||||
```bash
|
||||
# New flag for embedding-optimized chunking
|
||||
skill-seekers scrape --chunk-for-rag \
|
||||
--chunk-size 512 \
|
||||
--chunk-overlap 50 \
|
||||
--chunk-tokens 512 \
|
||||
--chunk-overlap-tokens 50 \
|
||||
--add-metadata
|
||||
|
||||
# Output: chunks with metadata for embedding
|
||||
|
||||
@@ -385,7 +385,7 @@ skill-seekers create <url> --max-pages 100
|
||||
skill-seekers create <url> --streaming
|
||||
|
||||
# Or smaller chunks
|
||||
skill-seekers create <url> --chunk-size 500
|
||||
skill-seekers create <url> --chunk-tokens 500
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
@@ -158,8 +158,8 @@ skill-seekers package output/large-skill/ --streaming
|
||||
# Custom chunk size
|
||||
skill-seekers package output/large-skill/ \
|
||||
--streaming \
|
||||
--chunk-size 2000 \
|
||||
--chunk-overlap 100
|
||||
--streaming-chunk-chars 2000 \
|
||||
--streaming-overlap-chars 100
|
||||
```
|
||||
|
||||
**When to use:**
|
||||
@@ -177,23 +177,23 @@ Optimize for Retrieval-Augmented Generation:
|
||||
# Enable semantic chunking
|
||||
skill-seekers package output/my-skill/ \
|
||||
--target langchain \
|
||||
--chunk \
|
||||
--chunk-for-rag \
|
||||
--chunk-tokens 512
|
||||
|
||||
# Custom chunk size
|
||||
skill-seekers package output/my-skill/ \
|
||||
--target chroma \
|
||||
--chunk-tokens 256 \
|
||||
--chunk-overlap 50
|
||||
--chunk-overlap-tokens 50
|
||||
```
|
||||
|
||||
**Chunking Options:**
|
||||
|
||||
| Option | Default | Description |
|
||||
|--------|---------|-------------|
|
||||
| `--chunk` | auto | Enable chunking |
|
||||
| `--chunk-for-rag` | auto | Enable chunking |
|
||||
| `--chunk-tokens` | 512 | Tokens per chunk |
|
||||
| `--chunk-overlap` | 50 | Overlap between chunks |
|
||||
| `--chunk-overlap-tokens` | 50 | Overlap between chunks (tokens) |
|
||||
| `--no-preserve-code` | - | Allow splitting code blocks |
|
||||
|
||||
---
|
||||
@@ -449,7 +449,7 @@ skill-seekers upload output/my-skill-claude.zip --target claude
|
||||
skill-seekers package output/my-skill/ --streaming
|
||||
|
||||
# Smaller chunks
|
||||
skill-seekers package output/my-skill/ --streaming --chunk-size 1000
|
||||
skill-seekers package output/my-skill/ --streaming --streaming-chunk-chars 1000
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
@@ -295,7 +295,7 @@ skill-seekers package output/my-skill/ --streaming
|
||||
# Reduce chunk size
|
||||
skill-seekers package output/my-skill/ \
|
||||
--streaming \
|
||||
--chunk-size 1000
|
||||
--streaming-chunk-chars 1000
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
@@ -237,8 +237,8 @@ skill-seekers create [source] [options]
|
||||
| | `--workflow-dry-run` | | Preview workflow without executing |
|
||||
| | `--dry-run` | | Preview without creating |
|
||||
| | `--chunk-for-rag` | | Enable RAG chunking |
|
||||
| | `--chunk-size` | 512 | Chunk size in tokens |
|
||||
| | `--chunk-overlap` | 50 | Chunk overlap in tokens |
|
||||
| | `--chunk-tokens` | 512 | Chunk size in tokens |
|
||||
| | `--chunk-overlap-tokens` | 50 | Chunk overlap in tokens |
|
||||
| | `--help-web` | | Show web scraping options |
|
||||
| | `--help-github` | | Show GitHub options |
|
||||
| | `--help-local` | | Show local analysis options |
|
||||
@@ -593,10 +593,10 @@ skill-seekers package SKILL_DIRECTORY [options]
|
||||
| | `--skip-quality-check` | | Skip quality checks |
|
||||
| | `--upload` | | Auto-upload after packaging |
|
||||
| | `--streaming` | | Streaming mode for large docs |
|
||||
| | `--chunk-size` | 4000 | Max chars per chunk (streaming) |
|
||||
| | `--chunk-overlap` | 200 | Overlap between chunks |
|
||||
| | `--streaming-chunk-chars` | 4000 | Max chars per chunk (streaming) |
|
||||
| | `--streaming-overlap-chars` | 200 | Overlap between chunks (chars) |
|
||||
| | `--batch-size` | 100 | Chunks per batch |
|
||||
| | `--chunk` | | Enable RAG chunking |
|
||||
| | `--chunk-for-rag` | | Enable RAG chunking |
|
||||
| | `--chunk-tokens` | 512 | Max tokens per chunk |
|
||||
| | `--no-preserve-code` | | Allow code block splitting |
|
||||
|
||||
@@ -847,7 +847,7 @@ skill-seekers stream --config CONFIG [options]
|
||||
| Short | Long | Description |
|
||||
|-------|------|-------------|
|
||||
| `-c` | `--config` | Config JSON file |
|
||||
| | `--chunk-size` | Size of each chunk |
|
||||
| | `--streaming-chunk-chars` | Maximum characters per chunk (default: 4000) |
|
||||
| | `--output` | Output directory |
|
||||
|
||||
**Examples:**
|
||||
@@ -857,7 +857,7 @@ skill-seekers stream --config CONFIG [options]
|
||||
skill-seekers stream --config configs/large-docs.json
|
||||
|
||||
# Custom chunk size
|
||||
skill-seekers stream --config configs/large-docs.json --chunk-size 1000
|
||||
skill-seekers stream --config configs/large-docs.json --streaming-chunk-chars 1000
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
@@ -385,7 +385,7 @@ skill-seekers create <url> --max-pages 100
|
||||
skill-seekers create <url> --streaming
|
||||
|
||||
# Or smaller chunks
|
||||
skill-seekers create <url> --chunk-size 500
|
||||
skill-seekers create <url> --chunk-tokens 500
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
@@ -158,8 +158,8 @@ skill-seekers package output/large-skill/ --streaming
|
||||
# Custom chunk size
|
||||
skill-seekers package output/large-skill/ \
|
||||
--streaming \
|
||||
--chunk-size 2000 \
|
||||
--chunk-overlap 100
|
||||
--streaming-chunk-chars 2000 \
|
||||
--streaming-overlap-chars 100
|
||||
```
|
||||
|
||||
**When to use:**
|
||||
@@ -177,23 +177,23 @@ Optimize for Retrieval-Augmented Generation:
|
||||
# Enable semantic chunking
|
||||
skill-seekers package output/my-skill/ \
|
||||
--target langchain \
|
||||
--chunk \
|
||||
--chunk-for-rag \
|
||||
--chunk-tokens 512
|
||||
|
||||
# Custom chunk size
|
||||
skill-seekers package output/my-skill/ \
|
||||
--target chroma \
|
||||
--chunk-tokens 256 \
|
||||
--chunk-overlap 50
|
||||
--chunk-overlap-tokens 50
|
||||
```
|
||||
|
||||
**Chunking Options:**
|
||||
|
||||
| Option | Default | Description |
|
||||
|--------|---------|-------------|
|
||||
| `--chunk` | auto | Enable chunking |
|
||||
| `--chunk-for-rag` | auto | Enable chunking |
|
||||
| `--chunk-tokens` | 512 | Tokens per chunk |
|
||||
| `--chunk-overlap` | 50 | Overlap between chunks |
|
||||
| `--chunk-overlap-tokens` | 50 | Overlap between chunks (tokens) |
|
||||
| `--no-preserve-code` | - | Allow splitting code blocks |
|
||||
|
||||
---
|
||||
@@ -449,7 +449,7 @@ skill-seekers upload output/my-skill-claude.zip --target claude
|
||||
skill-seekers package output/my-skill/ --streaming
|
||||
|
||||
# Smaller chunks
|
||||
skill-seekers package output/my-skill/ --streaming --chunk-size 1000
|
||||
skill-seekers package output/my-skill/ --streaming --streaming-chunk-chars 1000
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
@@ -295,7 +295,7 @@ skill-seekers package output/my-skill/ --streaming
|
||||
# Reduce chunk size
|
||||
skill-seekers package output/my-skill/ \
|
||||
--streaming \
|
||||
--chunk-size 1000
|
||||
--streaming-chunk-chars 1000
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
Reference in New Issue
Block a user