Files
yusyus 73adda0b17 docs: update all chunk flag names to match renamed CLI flags
Replace all occurrences of old ambiguous flag names with the new explicit ones:
  --chunk-size (tokens)  → --chunk-tokens
  --chunk-overlap        → --chunk-overlap-tokens
  --chunk                → --chunk-for-rag
  --streaming-chunk-size → --streaming-chunk-chars
  --streaming-overlap    → --streaming-overlap-chars
  --chunk-size (pages)   → --pdf-pages-per-chunk

Updated: CLI_REFERENCE (EN+ZH), user-guide (EN+ZH), integrations (Haystack,
Chroma, Weaviate, FAISS, Qdrant), features/PDF_CHUNKING, examples/haystack-pipeline,
strategy docs, archive docs, and CHANGELOG.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-24 22:15:14 +03:00

21 KiB

Using Skill Seekers with Haystack

Last Updated: February 7, 2026 Status: Production Ready Difficulty: Easy


🎯 The Problem

Building RAG (Retrieval-Augmented Generation) applications with Haystack requires high-quality, structured documentation for your document stores and pipelines. Manually scraping and preparing documentation is:

  • Time-Consuming - Hours spent scraping docs, formatting, and structuring
  • Error-Prone - Inconsistent formatting, missing metadata, broken references
  • Not Scalable - Multi-language docs and large frameworks are overwhelming

Example:

"When building an enterprise RAG system for FastAPI documentation with Haystack, you need to scrape 300+ pages, structure them with proper metadata, and prepare for multi-language search. This typically takes 6-8 hours of manual work."


The Solution

Use Skill Seekers as essential preprocessing before Haystack:

  1. Generate Haystack Documents from any documentation source
  2. Pre-structured with metadata following Haystack 2.x format
  3. Ready for document stores (InMemoryDocumentStore, Elasticsearch, Weaviate)
  4. One command - scrape, structure, format in minutes

Result: Skill Seekers outputs JSON files with Haystack Document format (content + meta), ready to load directly into your Haystack pipelines.


🚀 Quick Start (5 Minutes)

Prerequisites

  • Python 3.10+
  • Haystack 2.x installed: pip install haystack-ai
  • Optional: Embeddings library (e.g., sentence-transformers)

Installation

# Install Skill Seekers
pip install skill-seekers

# Verify installation
skill-seekers --version

Generate Haystack Documents

# Example: Django framework documentation
skill-seekers scrape --config configs/django.json

# Package as Haystack Documents
skill-seekers package output/django --target haystack

# Output: output/django-haystack.json

Load into Haystack

from haystack import Document
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
import json

# Load documents
with open("output/django-haystack.json") as f:
    docs_data = json.load(f)

# Convert to Haystack Documents
documents = [
    Document(content=doc["content"], meta=doc["meta"])
    for doc in docs_data
]

print(f"Loaded {len(documents)} documents")

# Create document store
document_store = InMemoryDocumentStore()
document_store.write_documents(documents)

# Create retriever
retriever = InMemoryBM25Retriever(document_store=document_store)

# Query
results = retriever.run(query="How do I create Django models?", top_k=3)
for doc in results["documents"]:
    print(f"\n{doc.meta['category']}: {doc.content[:200]}...")

📖 Detailed Setup Guide

Step 1: Choose Your Documentation Source

Skill Seekers supports multiple documentation sources:

# Official framework documentation
skill-seekers scrape --config configs/fastapi.json

# GitHub repository
skill-seekers github --repo tiangolo/fastapi

# PDF documentation
skill-seekers pdf --file docs/manual.pdf

# Combine multiple sources
skill-seekers unified \
  --docs https://fastapi.tiangolo.com/ \
  --github tiangolo/fastapi \
  --output output/fastapi-complete

Step 2: Configure Scraping (Optional)

Create a custom config for your documentation:

{
  "name": "my-framework",
  "base_url": "https://docs.example.com/",
  "selectors": {
    "main_content": "article.documentation",
    "title": "h1.page-title",
    "code_blocks": "pre code"
  },
  "categories": {
    "getting_started": ["intro", "quickstart", "installation"],
    "guides": ["tutorial", "guide", "howto"],
    "api": ["api", "reference"]
  },
  "max_pages": 500,
  "rate_limit": 0.5
}

Save as configs/my-framework.json and use:

skill-seekers scrape --config configs/my-framework.json

Step 3: Package for Haystack

# Generate Haystack Documents
skill-seekers package output/my-framework --target haystack

# With semantic chunking for better retrieval
skill-seekers scrape --config configs/my-framework.json --chunk-for-rag
skill-seekers package output/my-framework --target haystack

# Output files:
# - output/my-framework-haystack.json (Haystack Documents)
# - output/my-framework/rag_chunks.json (if chunking enabled)

Step 4: Load into Haystack Pipeline

Option A: InMemoryDocumentStore (Development)

from haystack import Document
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
import json

# Load documents
with open("output/my-framework-haystack.json") as f:
    docs_data = json.load(f)

documents = [
    Document(content=doc["content"], meta=doc["meta"])
    for doc in docs_data
]

# Create in-memory store
document_store = InMemoryDocumentStore()
document_store.write_documents(documents)

# Create BM25 retriever
retriever = InMemoryBM25Retriever(document_store=document_store)

# Query
results = retriever.run(query="your question", top_k=5)

Option B: Elasticsearch (Production)

from haystack import Document
from haystack.document_stores.elasticsearch import ElasticsearchDocumentStore
from haystack.components.retrievers.elasticsearch import ElasticsearchBM25Retriever
import json

# Connect to Elasticsearch
document_store = ElasticsearchDocumentStore(
    hosts=["http://localhost:9200"],
    index="my-framework-docs"
)

# Load and write documents
with open("output/my-framework-haystack.json") as f:
    docs_data = json.load(f)

documents = [
    Document(content=doc["content"], meta=doc["meta"])
    for doc in docs_data
]

document_store.write_documents(documents)

# Create retriever
retriever = ElasticsearchBM25Retriever(document_store=document_store)

Option C: Weaviate (Hybrid Search)

from haystack import Document
from haystack.document_stores.weaviate import WeaviateDocumentStore
from haystack.components.retrievers.weaviate import WeaviateHybridRetriever
import json

# Connect to Weaviate
document_store = WeaviateDocumentStore(
    host="http://localhost:8080",
    index="MyFrameworkDocs"
)

# Load documents
with open("output/my-framework-haystack.json") as f:
    docs_data = json.load(f)

documents = [
    Document(content=doc["content"], meta=doc["meta"])
    for doc in docs_data
]

# Write with embeddings
from haystack.components.embedders import SentenceTransformersDocumentEmbedder

embedder = SentenceTransformersDocumentEmbedder(
    model="sentence-transformers/all-MiniLM-L6-v2"
)
embedder.warm_up()

docs_with_embeddings = embedder.run(documents)
document_store.write_documents(docs_with_embeddings["documents"])

# Create hybrid retriever (BM25 + vector)
retriever = WeaviateHybridRetriever(document_store=document_store)

Step 5: Build RAG Pipeline

from haystack import Pipeline
from haystack.components.builders import PromptBuilder
from haystack.components.generators import OpenAIGenerator

# Create RAG pipeline
rag_pipeline = Pipeline()

# Add components
rag_pipeline.add_component("retriever", retriever)
rag_pipeline.add_component(
    "prompt_builder",
    PromptBuilder(
        template="""
        Based on the following documentation, answer the question.

        Documentation:
        {% for doc in documents %}
        {{ doc.content }}
        {% endfor %}

        Question: {{ question }}

        Answer:
        """
    )
)
rag_pipeline.add_component(
    "llm",
    OpenAIGenerator(api_key=os.getenv("OPENAI_API_KEY"))
)

# Connect components
rag_pipeline.connect("retriever", "prompt_builder.documents")
rag_pipeline.connect("prompt_builder", "llm")

# Run pipeline
response = rag_pipeline.run({
    "retriever": {"query": "How do I deploy my app?"},
    "prompt_builder": {"question": "How do I deploy my app?"}
})

print(response["llm"]["replies"][0])

🔥 Advanced Usage

Semantic Chunking for Better Retrieval

# Enable semantic chunking (preserves code blocks, respects paragraphs)
skill-seekers scrape --config configs/django.json \
  --chunk-for-rag \
  --chunk-tokens 512 \
  --chunk-overlap-tokens 50

# Package chunked output
skill-seekers package output/django --target haystack

# Result: Smaller, more focused documents for better retrieval

Multi-Source RAG System

# Combine official docs + GitHub issues + PDF guides
skill-seekers unified \
  --docs https://docs.example.com/ \
  --github owner/repo \
  --pdf guides/*.pdf \
  --output output/complete-knowledge

skill-seekers package output/complete-knowledge --target haystack

# Detect conflicts between sources
skill-seekers detect-conflicts output/complete-knowledge

Custom Metadata for Filtering

Haystack Documents include rich metadata for filtering:

# Query with metadata filters
from haystack.dataclasses import Document
from haystack.document_stores.in_memory import InMemoryDocumentStore

# Filter by category
results = retriever.run(
    query="deployment",
    top_k=5,
    filters={"field": "category", "operator": "==", "value": "guides"}
)

# Filter by version
results = retriever.run(
    query="api reference",
    filters={"field": "version", "operator": "==", "value": "2.0"}
)

# Multiple filters
results = retriever.run(
    query="authentication",
    filters={
        "operator": "AND",
        "conditions": [
            {"field": "category", "operator": "==", "value": "api"},
            {"field": "type", "operator": "==", "value": "reference"}
        ]
    }
)

Embedding-Based Retrieval

from haystack.components.embedders import (
    SentenceTransformersDocumentEmbedder,
    SentenceTransformersTextEmbedder
)
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever

# Embed documents
doc_embedder = SentenceTransformersDocumentEmbedder(
    model="sentence-transformers/all-MiniLM-L6-v2"
)
doc_embedder.warm_up()

docs_with_embeddings = doc_embedder.run(documents)
document_store.write_documents(docs_with_embeddings["documents"])

# Create embedding retriever
text_embedder = SentenceTransformersTextEmbedder(
    model="sentence-transformers/all-MiniLM-L6-v2"
)
text_embedder.warm_up()

retriever = InMemoryEmbeddingRetriever(document_store=document_store)

# Query with embeddings
query_embedding = text_embedder.run("How do I deploy?")
results = retriever.run(
    query_embedding=query_embedding["embedding"],
    top_k=5
)

Incremental Updates

# Initial scrape
skill-seekers scrape --config configs/fastapi.json

# Later: Update only changed pages
skill-seekers scrape --config configs/fastapi.json --skip-existing

# Merge with existing documents
python scripts/merge_documents.py \
  output/fastapi-haystack.json \
  output/fastapi-haystack-new.json

Best Practices

1. Use Semantic Chunking for Large Docs

Why: Better retrieval quality, more focused results

# Enable chunking for frameworks with long pages
skill-seekers scrape --config configs/django.json \
  --chunk-for-rag \
  --chunk-tokens 512 \
  --chunk-overlap-tokens 50

2. Choose Right Document Store

Development:

  • InMemoryDocumentStore - Fast, no setup

Production:

  • Elasticsearch - Full-text search, scalable
  • Weaviate - Hybrid search (BM25 + vector), multi-modal
  • Qdrant - High-performance vector search
  • Opensearch - AWS-managed, cost-effective

3. Add Metadata Filters

# Always include category in queries for faster results
results = retriever.run(
    query="database models",
    filters={"field": "category", "operator": "==", "value": "guides"}
)

4. Monitor Retrieval Quality

# Test queries and verify relevance
test_queries = [
    "How do I create a model?",
    "What is the deployment process?",
    "How to handle authentication?"
]

for query in test_queries:
    results = retriever.run(query=query, top_k=3)
    print(f"\nQuery: {query}")
    for i, doc in enumerate(results["documents"], 1):
        print(f"{i}. {doc.meta['file']} - {doc.meta['category']}")

5. Version Your Documentation

# Include version in metadata
skill-seekers scrape --config configs/django.json --metadata version=4.2

# Query specific versions
results = retriever.run(
    query="middleware",
    filters={"field": "version", "operator": "==", "value": "4.2"}
)

💼 Real-World Example: FastAPI RAG Chatbot

Complete example of building a FastAPI documentation chatbot:

Step 1: Generate Documentation

# Scrape FastAPI docs with chunking
skill-seekers scrape --config configs/fastapi.json \
  --chunk-for-rag \
  --chunk-tokens 512 \
  --chunk-overlap-tokens 50 \
  --max-pages 200

# Package for Haystack
skill-seekers package output/fastapi --target haystack

Step 2: Setup Haystack Pipeline

from haystack import Pipeline, Document
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
from haystack.components.builders import PromptBuilder
from haystack.components.generators import OpenAIGenerator
import json
import os

# Load documents
with open("output/fastapi-haystack.json") as f:
    docs_data = json.load(f)

documents = [
    Document(content=doc["content"], meta=doc["meta"])
    for doc in docs_data
]

print(f"Loaded {len(documents)} FastAPI documentation chunks")

# Create document store
document_store = InMemoryDocumentStore()
document_store.write_documents(documents)
print(f"Indexed {document_store.count_documents()} documents")

# Build RAG pipeline
rag = Pipeline()

# Add components
rag.add_component(
    "retriever",
    InMemoryBM25Retriever(document_store=document_store)
)

rag.add_component(
    "prompt",
    PromptBuilder(
        template="""
        You are a FastAPI expert assistant. Answer the question based on the documentation below.

        Documentation:
        {% for doc in documents %}
        ---
        Source: {{ doc.meta.file }}
        Category: {{ doc.meta.category }}

        {{ doc.content }}
        {% endfor %}

        Question: {{ question }}

        Provide a clear, code-focused answer with examples when relevant.
        """
    )
)

rag.add_component(
    "llm",
    OpenAIGenerator(
        api_key=os.getenv("OPENAI_API_KEY"),
        model="gpt-4"
    )
)

# Connect pipeline
rag.connect("retriever.documents", "prompt.documents")
rag.connect("prompt.prompt", "llm.prompt")

print("Pipeline ready!")

Step 3: Interactive Chat

def ask_fastapi(question: str, top_k: int = 5):
    """Ask a question about FastAPI."""
    response = rag.run({
        "retriever": {"query": question, "top_k": top_k},
        "prompt": {"question": question}
    })

    answer = response["llm"]["replies"][0]
    print(f"\nQuestion: {question}\n")
    print(f"Answer: {answer}\n")

    # Show sources
    docs = response["retriever"]["documents"]
    print("Sources:")
    for doc in docs:
        print(f"  - {doc.meta['file']} ({doc.meta['category']})")

# Example usage
ask_fastapi("How do I create a REST API endpoint?")
ask_fastapi("What is dependency injection in FastAPI?")
ask_fastapi("How do I handle file uploads?")

Step 4: Deploy with FastAPI

from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI()

class Question(BaseModel):
    text: str
    top_k: int = 5

@app.post("/ask")
async def ask_question(question: Question):
    """Ask a question about FastAPI documentation."""
    response = rag.run({
        "retriever": {"query": question.text, "top_k": question.top_k},
        "prompt": {"question": question.text}
    })

    return {
        "question": question.text,
        "answer": response["llm"]["replies"][0],
        "sources": [
            {
                "file": doc.meta["file"],
                "category": doc.meta["category"],
                "content_preview": doc.content[:200]
            }
            for doc in response["retriever"]["documents"]
        ]
    }

# Run: uvicorn chatbot:app --reload
# Test: curl -X POST http://localhost:8000/ask \
#   -H "Content-Type: application/json" \
#   -d '{"text": "How do I use async functions?"}'

Result:

  • 200 documentation pages → 450 optimized chunks
  • Sub-second retrieval with BM25
  • Context-aware answers from GPT-4
  • Source attribution for every answer
  • REST API for integration

🔧 Troubleshooting

Issue: Documents not loading correctly

Symptoms: Empty content, missing metadata

Solutions:

# Verify JSON structure
jq '.[0]' output/fastapi-haystack.json

# Should show:
# {
#   "content": "...",
#   "meta": {
#     "source": "fastapi",
#     "category": "...",
#     ...
#   }
# }

# Regenerate if malformed
skill-seekers package output/fastapi --target haystack --force

Issue: Poor retrieval quality

Symptoms: Irrelevant results, missed relevant docs

Solutions:

# 1. Enable semantic chunking
skill-seekers scrape --config configs/fastapi.json --chunk-for-rag

# 2. Adjust chunk size
skill-seekers scrape --config configs/fastapi.json \
  --chunk-for-rag \
  --chunk-tokens 768 \  # Larger chunks for more context
  --chunk-overlap-tokens 100  # More overlap for continuity

# 3. Use hybrid search (BM25 + embeddings)
# See Advanced Usage section

Issue: OutOfMemoryError with large docs

Symptoms: Crash when loading thousands of documents

Solutions:

# Load documents in batches
import json

def load_documents_batched(file_path, batch_size=100):
    with open(file_path) as f:
        docs_data = json.load(f)

    for i in range(0, len(docs_data), batch_size):
        batch = docs_data[i:i+batch_size]
        documents = [
            Document(content=doc["content"], meta=doc["meta"])
            for doc in batch
        ]
        document_store.write_documents(documents)
        print(f"Loaded batch {i//batch_size + 1}")

load_documents_batched("output/large-framework-haystack.json")

Issue: Haystack version compatibility

Symptoms: Import errors, method not found

Solutions:

# Check Haystack version
pip show haystack-ai

# Skill Seekers requires Haystack 2.x
pip install --upgrade "haystack-ai>=2.0.0"

# For Haystack 1.x (legacy), use markdown export instead:
skill-seekers package output/framework --target markdown

Issue: Slow query performance

Symptoms: Queries take >2 seconds

Solutions:

# 1. Reduce top_k
results = retriever.run(query="...", top_k=3)  # Instead of 10

# 2. Add metadata filters
results = retriever.run(
    query="...",
    filters={"field": "category", "operator": "==", "value": "api"}
)

# 3. Use InMemoryDocumentStore for development
# Switch to Elasticsearch for production scale

📊 Before vs After

Aspect Before Skill Seekers After Skill Seekers
Setup Time 6-8 hours manual scraping 5 minutes automated
Documentation Quality Inconsistent, missing metadata Structured with rich metadata
Chunking Manual, error-prone Semantic, code-preserving
Updates Re-scrape everything Incremental updates
Multi-source Complex custom scripts One unified command
Format Custom JSON hacking Native Haystack Documents
Retrieval Quality Poor (large chunks, no metadata) Excellent (optimized chunks, filters)
Maintenance High (scripts break) Low (one tool, well-tested)

🎓 Next Steps

Try These Examples

  1. Build a chatbot - Follow the FastAPI example above
  2. Multi-language search - Scrape docs in multiple languages
  3. Hybrid retrieval - Combine BM25 + embeddings (see Advanced Usage)
  4. Production deployment - Use Elasticsearch or Weaviate

Explore More Integrations

Learn More


🤝 Support


Ready to build production RAG with Haystack?

pip install skill-seekers haystack-ai
skill-seekers scrape --config configs/your-framework.json --chunk-for-rag
skill-seekers package output/your-framework --target haystack

Transform documentation into production-ready Haystack pipelines in minutes! 🚀