Files
yusyus 73adda0b17 docs: update all chunk flag names to match renamed CLI flags
Replace all occurrences of old ambiguous flag names with the new explicit ones:
  --chunk-size (tokens)  → --chunk-tokens
  --chunk-overlap        → --chunk-overlap-tokens
  --chunk                → --chunk-for-rag
  --streaming-chunk-size → --streaming-chunk-chars
  --streaming-overlap    → --streaming-overlap-chars
  --chunk-size (pages)   → --pdf-pages-per-chunk

Updated: CLI_REFERENCE (EN+ZH), user-guide (EN+ZH), integrations (Haystack,
Chroma, Weaviate, FAISS, Qdrant), features/PDF_CHUNKING, examples/haystack-pipeline,
strategy docs, archive docs, and CHANGELOG.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-24 22:15:14 +03:00
..

Haystack Pipeline Example

Complete example showing how to use Skill Seekers with Haystack 2.x for building RAG pipelines.

What This Example Does

  • Converts documentation into Haystack Documents
  • Creates an in-memory document store
  • Builds a BM25 retriever for semantic search
  • Shows complete RAG pipeline workflow

Prerequisites

# Install Skill Seekers
pip install skill-seekers

# Install Haystack 2.x
pip install haystack-ai

Quick Start

1. Generate React Documentation Skill

# Scrape React documentation
skill-seekers scrape --config configs/react.json --max-pages 100

# Package for Haystack
skill-seekers package output/react --target haystack

This creates output/react-haystack.json with Haystack Documents.

2. Run the Pipeline

# Run the example script
python quickstart.py

What the Example Does

Step 1: Load Documents

from haystack import Document
import json

# Load Haystack documents
with open("../../output/react-haystack.json") as f:
    docs_data = json.load(f)

documents = [
    Document(content=doc["content"], meta=doc["meta"])
    for doc in docs_data
]

print(f"📚 Loaded {len(documents)} documents")

Step 2: Create Document Store

from haystack.document_stores.in_memory import InMemoryDocumentStore

# Create in-memory store
document_store = InMemoryDocumentStore()
document_store.write_documents(documents)

print(f"💾 Indexed {document_store.count_documents()} documents")

Step 3: Build Retriever

from haystack.components.retrievers.in_memory import InMemoryBM25Retriever

# Create BM25 retriever
retriever = InMemoryBM25Retriever(document_store=document_store)

# Query
results = retriever.run(
    query="How do I use useState hook?",
    top_k=3
)

# Display results
for doc in results["documents"]:
    print(f"\n📖 Source: {doc.meta.get('file', 'unknown')}")
    print(f"   Category: {doc.meta.get('category', 'unknown')}")
    print(f"   Preview: {doc.content[:200]}...")

Expected Output

📚 Loaded 15 documents
💾 Indexed 15 documents

🔍 Query: How do I use useState hook?

📖 Source: hooks.md
   Category: hooks
   Preview: # React Hooks

React Hooks are functions that let you "hook into" React state and lifecycle features from function components.

## useState

The useState Hook lets you add React state to function components...

📖 Source: getting_started.md
   Category: getting started
   Preview: # Getting Started with React

React is a JavaScript library for building user interfaces...

📖 Source: best_practices.md
   Category: best practices
   Preview: # React Best Practices

When working with Hooks...

Advanced Usage

With RAG Chunking

For better retrieval quality, use semantic chunking:

# Generate with chunking
skill-seekers scrape --config configs/react.json --max-pages 100 --chunk-for-rag --chunk-tokens 512 --chunk-overlap-tokens 50

# Use chunked output
python quickstart.py --chunked

With Vector Embeddings

For semantic search instead of BM25:

from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever

# Create document store with embeddings
document_store = InMemoryDocumentStore()

# Embed documents
embedder = SentenceTransformersDocumentEmbedder(
    model="sentence-transformers/all-MiniLM-L6-v2"
)
embedder.warm_up()

# Process documents
docs_with_embeddings = embedder.run(documents)
document_store.write_documents(docs_with_embeddings["documents"])

# Create embedding retriever
retriever = InMemoryEmbeddingRetriever(document_store=document_store)

# Query (requires query embedding)
from haystack.components.embedders import SentenceTransformersTextEmbedder

query_embedder = SentenceTransformersTextEmbedder(
    model="sentence-transformers/all-MiniLM-L6-v2"
)
query_embedder.warm_up()

query_embedding = query_embedder.run("How do I use useState?")

results = retriever.run(
    query_embedding=query_embedding["embedding"],
    top_k=3
)

Building Complete RAG Pipeline

For question answering with LLMs:

from haystack import Pipeline
from haystack.components.builders import PromptBuilder
from haystack.components.generators import OpenAIGenerator

# Create RAG pipeline
rag_pipeline = Pipeline()

# Add components
rag_pipeline.add_component("retriever", retriever)
rag_pipeline.add_component("prompt_builder", PromptBuilder(
    template="""
    Based on the following context, answer the question.

    Context:
    {% for doc in documents %}
    {{ doc.content }}
    {% endfor %}

    Question: {{ question }}

    Answer:
    """
))
rag_pipeline.add_component("llm", OpenAIGenerator(api_key="your-key"))

# Connect components
rag_pipeline.connect("retriever", "prompt_builder.documents")
rag_pipeline.connect("prompt_builder", "llm")

# Run pipeline
response = rag_pipeline.run({
    "retriever": {"query": "How do I use useState?"},
    "prompt_builder": {"question": "How do I use useState?"}
})

print(response["llm"]["replies"][0])

Files in This Example

  • README.md - This file
  • quickstart.py - Basic BM25 retrieval pipeline
  • requirements.txt - Python dependencies

Troubleshooting

Issue: ModuleNotFoundError: No module named 'haystack'

Solution: Install Haystack 2.x

pip install haystack-ai

Issue: Documents not found

Solution: Run scraping first

skill-seekers scrape --config configs/react.json
skill-seekers package output/react --target haystack

Issue: Poor retrieval quality

Solution: Use semantic chunking or vector embeddings

# Semantic chunking
skill-seekers scrape --config configs/react.json --chunk-for-rag

# Or use vector embeddings (see Advanced Usage)

Next Steps

  1. Try different documentation sources (Django, FastAPI, etc.)
  2. Experiment with vector embeddings for semantic search
  3. Build complete RAG pipeline with LLM generation
  4. Deploy to production with persistent document stores

Resources