skill-seekers-reference/docs/integrations/HAYSTACK.md

# Using Skill Seekers with Haystack

**Last Updated:** February 7, 2026
**Status:** Production Ready
**Difficulty:** Easy ⭐

---

## 🎯 The Problem

Building RAG (Retrieval-Augmented Generation) applications with Haystack requires high-quality, structured documentation for your document stores and pipelines. Manually scraping and preparing documentation is:

- **Time-Consuming** - Hours spent scraping docs, formatting, and structuring
- **Error-Prone** - Inconsistent formatting, missing metadata, broken references
- **Not Scalable** - Multi-language docs and large frameworks are overwhelming

**Example:**
> "When building an enterprise RAG system for FastAPI documentation with Haystack, you need to scrape 300+ pages, structure them with proper metadata, and prepare for multi-language search. This typically takes 6-8 hours of manual work."

---

## ✨ The Solution

Use Skill Seekers as **essential preprocessing** before Haystack:

1. **Generate Haystack Documents** from any documentation source
2. **Pre-structured with metadata** following Haystack 2.x format
3. **Ready for document stores** (InMemoryDocumentStore, Elasticsearch, Weaviate)
4. **One command** - scrape, structure, format in minutes

**Result:**
Skill Seekers outputs JSON files with Haystack Document format (`content` + `meta`), ready to load directly into your Haystack pipelines.

---

## 🚀 Quick Start (5 Minutes)

### Prerequisites
- Python 3.10+
- Haystack 2.x installed: `pip install haystack-ai`
- Optional: Embeddings library (e.g., `sentence-transformers`)

### Installation

```bash
# Install Skill Seekers
pip install skill-seekers

# Verify installation
skill-seekers --version
```

### Generate Haystack Documents

```bash
# Example: Django framework documentation
skill-seekers scrape --config configs/django.json

# Package as Haystack Documents
skill-seekers package output/django --target haystack

# Output: output/django-haystack.json
```

### Load into Haystack

```python
from haystack import Document
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
import json

# Load documents
with open("output/django-haystack.json") as f:
    docs_data = json.load(f)

# Convert to Haystack Documents
documents = [
    Document(content=doc["content"], meta=doc["meta"])
    for doc in docs_data
]

print(f"Loaded {len(documents)} documents")

# Create document store
document_store = InMemoryDocumentStore()
document_store.write_documents(documents)

# Create retriever
retriever = InMemoryBM25Retriever(document_store=document_store)

# Query
results = retriever.run(query="How do I create Django models?", top_k=3)
for doc in results["documents"]:
    print(f"\n{doc.meta['category']}: {doc.content[:200]}...")
```

---

## 📖 Detailed Setup Guide

### Step 1: Choose Your Documentation Source

Skill Seekers supports multiple documentation sources:

```bash
# Official framework documentation
skill-seekers scrape --config configs/fastapi.json

# GitHub repository
skill-seekers github --repo tiangolo/fastapi

# PDF documentation
skill-seekers pdf --file docs/manual.pdf

# Combine multiple sources
skill-seekers unified \
  --docs https://fastapi.tiangolo.com/ \
  --github tiangolo/fastapi \
  --output output/fastapi-complete
```

### Step 2: Configure Scraping (Optional)

Create a custom config for your documentation:

```json
{
  "name": "my-framework",
  "base_url": "https://docs.example.com/",
  "selectors": {
    "main_content": "article.documentation",
    "title": "h1.page-title",
    "code_blocks": "pre code"
  },
  "categories": {
    "getting_started": ["intro", "quickstart", "installation"],
    "guides": ["tutorial", "guide", "howto"],
    "api": ["api", "reference"]
  },
  "max_pages": 500,
  "rate_limit": 0.5
}
```

Save as `configs/my-framework.json` and use:

```bash
skill-seekers scrape --config configs/my-framework.json
```

### Step 3: Package for Haystack

```bash
# Generate Haystack Documents
skill-seekers package output/my-framework --target haystack

# With semantic chunking for better retrieval
skill-seekers scrape --config configs/my-framework.json --chunk-for-rag
skill-seekers package output/my-framework --target haystack

# Output files:
# - output/my-framework-haystack.json (Haystack Documents)
# - output/my-framework/rag_chunks.json (if chunking enabled)
```

### Step 4: Load into Haystack Pipeline

**Option A: InMemoryDocumentStore (Development)**

```python
from haystack import Document
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
import json

# Load documents
with open("output/my-framework-haystack.json") as f:
    docs_data = json.load(f)

documents = [
    Document(content=doc["content"], meta=doc["meta"])
    for doc in docs_data
]

# Create in-memory store
document_store = InMemoryDocumentStore()
document_store.write_documents(documents)

# Create BM25 retriever
retriever = InMemoryBM25Retriever(document_store=document_store)

# Query
results = retriever.run(query="your question", top_k=5)
```

**Option B: Elasticsearch (Production)**

```python
from haystack import Document
from haystack.document_stores.elasticsearch import ElasticsearchDocumentStore
from haystack.components.retrievers.elasticsearch import ElasticsearchBM25Retriever
import json

# Connect to Elasticsearch
document_store = ElasticsearchDocumentStore(
    hosts=["http://localhost:9200"],
    index="my-framework-docs"
)

# Load and write documents
with open("output/my-framework-haystack.json") as f:
    docs_data = json.load(f)

documents = [
    Document(content=doc["content"], meta=doc["meta"])
    for doc in docs_data
]

document_store.write_documents(documents)

# Create retriever
retriever = ElasticsearchBM25Retriever(document_store=document_store)
```

**Option C: Weaviate (Hybrid Search)**

```python
from haystack import Document
from haystack.document_stores.weaviate import WeaviateDocumentStore
from haystack.components.retrievers.weaviate import WeaviateHybridRetriever
import json

# Connect to Weaviate
document_store = WeaviateDocumentStore(
    host="http://localhost:8080",
    index="MyFrameworkDocs"
)

# Load documents
with open("output/my-framework-haystack.json") as f:
    docs_data = json.load(f)

documents = [
    Document(content=doc["content"], meta=doc["meta"])
    for doc in docs_data
]

# Write with embeddings
from haystack.components.embedders import SentenceTransformersDocumentEmbedder

embedder = SentenceTransformersDocumentEmbedder(
    model="sentence-transformers/all-MiniLM-L6-v2"
)
embedder.warm_up()

docs_with_embeddings = embedder.run(documents)
document_store.write_documents(docs_with_embeddings["documents"])

# Create hybrid retriever (BM25 + vector)
retriever = WeaviateHybridRetriever(document_store=document_store)
```

### Step 5: Build RAG Pipeline

```python
from haystack import Pipeline
from haystack.components.builders import PromptBuilder
from haystack.components.generators import OpenAIGenerator

# Create RAG pipeline
rag_pipeline = Pipeline()

# Add components
rag_pipeline.add_component("retriever", retriever)
rag_pipeline.add_component(
    "prompt_builder",
    PromptBuilder(
        template="""
        Based on the following documentation, answer the question.

        Documentation:
        {% for doc in documents %}
        {{ doc.content }}
        {% endfor %}

        Question: {{ question }}

        Answer:
        """
    )
)
rag_pipeline.add_component(
    "llm",
    OpenAIGenerator(api_key=os.getenv("OPENAI_API_KEY"))
)

# Connect components
rag_pipeline.connect("retriever", "prompt_builder.documents")
rag_pipeline.connect("prompt_builder", "llm")

# Run pipeline
response = rag_pipeline.run({
    "retriever": {"query": "How do I deploy my app?"},
    "prompt_builder": {"question": "How do I deploy my app?"}
})

print(response["llm"]["replies"][0])
```

---

## 🔥 Advanced Usage

### Semantic Chunking for Better Retrieval

```bash
# Enable semantic chunking (preserves code blocks, respects paragraphs)
skill-seekers scrape --config configs/django.json \
  --chunk-for-rag \
  --chunk-tokens 512 \
  --chunk-overlap-tokens 50

# Package chunked output
skill-seekers package output/django --target haystack

# Result: Smaller, more focused documents for better retrieval
```

### Multi-Source RAG System

```bash
# Combine official docs + GitHub issues + PDF guides
skill-seekers unified \
  --docs https://docs.example.com/ \
  --github owner/repo \
  --pdf guides/*.pdf \
  --output output/complete-knowledge

skill-seekers package output/complete-knowledge --target haystack

# Detect conflicts between sources
skill-seekers detect-conflicts output/complete-knowledge
```

### Custom Metadata for Filtering

Haystack Documents include rich metadata for filtering:

```python
# Query with metadata filters
from haystack.dataclasses import Document
from haystack.document_stores.in_memory import InMemoryDocumentStore

# Filter by category
results = retriever.run(
    query="deployment",
    top_k=5,
    filters={"field": "category", "operator": "==", "value": "guides"}
)

# Filter by version
results = retriever.run(
    query="api reference",
    filters={"field": "version", "operator": "==", "value": "2.0"}
)

# Multiple filters
results = retriever.run(
    query="authentication",
    filters={
        "operator": "AND",
        "conditions": [
            {"field": "category", "operator": "==", "value": "api"},
            {"field": "type", "operator": "==", "value": "reference"}
        ]
    }
)
```

### Embedding-Based Retrieval

```python
from haystack.components.embedders import (
    SentenceTransformersDocumentEmbedder,
    SentenceTransformersTextEmbedder
)
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever

# Embed documents
doc_embedder = SentenceTransformersDocumentEmbedder(
    model="sentence-transformers/all-MiniLM-L6-v2"
)
doc_embedder.warm_up()

docs_with_embeddings = doc_embedder.run(documents)
document_store.write_documents(docs_with_embeddings["documents"])

# Create embedding retriever
text_embedder = SentenceTransformersTextEmbedder(
    model="sentence-transformers/all-MiniLM-L6-v2"
)
text_embedder.warm_up()

retriever = InMemoryEmbeddingRetriever(document_store=document_store)

# Query with embeddings
query_embedding = text_embedder.run("How do I deploy?")
results = retriever.run(
    query_embedding=query_embedding["embedding"],
    top_k=5
)
```

### Incremental Updates

```bash
# Initial scrape
skill-seekers scrape --config configs/fastapi.json

# Later: Update only changed pages
skill-seekers scrape --config configs/fastapi.json --skip-existing

# Merge with existing documents
python scripts/merge_documents.py \
  output/fastapi-haystack.json \
  output/fastapi-haystack-new.json
```

---

## ✅ Best Practices

### 1. Use Semantic Chunking for Large Docs

**Why:** Better retrieval quality, more focused results

```bash
# Enable chunking for frameworks with long pages
skill-seekers scrape --config configs/django.json \
  --chunk-for-rag \
  --chunk-tokens 512 \
  --chunk-overlap-tokens 50
```

### 2. Choose Right Document Store

**Development:**
- InMemoryDocumentStore - Fast, no setup

**Production:**
- Elasticsearch - Full-text search, scalable
- Weaviate - Hybrid search (BM25 + vector), multi-modal
- Qdrant - High-performance vector search
- Opensearch - AWS-managed, cost-effective

### 3. Add Metadata Filters

```python
# Always include category in queries for faster results
results = retriever.run(
    query="database models",
    filters={"field": "category", "operator": "==", "value": "guides"}
)
```

### 4. Monitor Retrieval Quality

```python
# Test queries and verify relevance
test_queries = [
    "How do I create a model?",
    "What is the deployment process?",
    "How to handle authentication?"
]

for query in test_queries:
    results = retriever.run(query=query, top_k=3)
    print(f"\nQuery: {query}")
    for i, doc in enumerate(results["documents"], 1):
        print(f"{i}. {doc.meta['file']} - {doc.meta['category']}")
```

### 5. Version Your Documentation

```bash
# Include version in metadata
skill-seekers scrape --config configs/django.json --metadata version=4.2

# Query specific versions
results = retriever.run(
    query="middleware",
    filters={"field": "version", "operator": "==", "value": "4.2"}
)
```

---

## 💼 Real-World Example: FastAPI RAG Chatbot

Complete example of building a FastAPI documentation chatbot:

### Step 1: Generate Documentation

```bash
# Scrape FastAPI docs with chunking
skill-seekers scrape --config configs/fastapi.json \
  --chunk-for-rag \
  --chunk-tokens 512 \
  --chunk-overlap-tokens 50 \
  --max-pages 200

# Package for Haystack
skill-seekers package output/fastapi --target haystack
```

### Step 2: Setup Haystack Pipeline

```python
from haystack import Pipeline, Document
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
from haystack.components.builders import PromptBuilder
from haystack.components.generators import OpenAIGenerator
import json
import os

# Load documents
with open("output/fastapi-haystack.json") as f:
    docs_data = json.load(f)

documents = [
    Document(content=doc["content"], meta=doc["meta"])
    for doc in docs_data
]

print(f"Loaded {len(documents)} FastAPI documentation chunks")

# Create document store
document_store = InMemoryDocumentStore()
document_store.write_documents(documents)
print(f"Indexed {document_store.count_documents()} documents")

# Build RAG pipeline
rag = Pipeline()

# Add components
rag.add_component(
    "retriever",
    InMemoryBM25Retriever(document_store=document_store)
)

rag.add_component(
    "prompt",
    PromptBuilder(
        template="""
        You are a FastAPI expert assistant. Answer the question based on the documentation below.

        Documentation:
        {% for doc in documents %}
        ---
        Source: {{ doc.meta.file }}
        Category: {{ doc.meta.category }}

        {{ doc.content }}
        {% endfor %}

        Question: {{ question }}

        Provide a clear, code-focused answer with examples when relevant.
        """
    )
)

rag.add_component(
    "llm",
    OpenAIGenerator(
        api_key=os.getenv("OPENAI_API_KEY"),
        model="gpt-4"
    )
)

# Connect pipeline
rag.connect("retriever.documents", "prompt.documents")
rag.connect("prompt.prompt", "llm.prompt")

print("Pipeline ready!")
```

### Step 3: Interactive Chat

```python
def ask_fastapi(question: str, top_k: int = 5):
    """Ask a question about FastAPI."""
    response = rag.run({
        "retriever": {"query": question, "top_k": top_k},
        "prompt": {"question": question}
    })

    answer = response["llm"]["replies"][0]
    print(f"\nQuestion: {question}\n")
    print(f"Answer: {answer}\n")

    # Show sources
    docs = response["retriever"]["documents"]
    print("Sources:")
    for doc in docs:
        print(f"  - {doc.meta['file']} ({doc.meta['category']})")

# Example usage
ask_fastapi("How do I create a REST API endpoint?")
ask_fastapi("What is dependency injection in FastAPI?")
ask_fastapi("How do I handle file uploads?")
```

### Step 4: Deploy with FastAPI

```python
from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI()

class Question(BaseModel):
    text: str
    top_k: int = 5

@app.post("/ask")
async def ask_question(question: Question):
    """Ask a question about FastAPI documentation."""
    response = rag.run({
        "retriever": {"query": question.text, "top_k": question.top_k},
        "prompt": {"question": question.text}
    })

    return {
        "question": question.text,
        "answer": response["llm"]["replies"][0],
        "sources": [
            {
                "file": doc.meta["file"],
                "category": doc.meta["category"],
                "content_preview": doc.content[:200]
            }
            for doc in response["retriever"]["documents"]
        ]
    }

# Run: uvicorn chatbot:app --reload
# Test: curl -X POST http://localhost:8000/ask \
#   -H "Content-Type: application/json" \
#   -d '{"text": "How do I use async functions?"}'
```

**Result:**
- ✅ 200 documentation pages → 450 optimized chunks
- ✅ Sub-second retrieval with BM25
- ✅ Context-aware answers from GPT-4
- ✅ Source attribution for every answer
- ✅ REST API for integration

---

## 🔧 Troubleshooting

### Issue: Documents not loading correctly

**Symptoms:** Empty content, missing metadata

**Solutions:**
```bash
# Verify JSON structure
jq '.[0]' output/fastapi-haystack.json

# Should show:
# {
#   "content": "...",
#   "meta": {
#     "source": "fastapi",
#     "category": "...",
#     ...
#   }
# }

# Regenerate if malformed
skill-seekers package output/fastapi --target haystack --force
```

### Issue: Poor retrieval quality

**Symptoms:** Irrelevant results, missed relevant docs

**Solutions:**
```bash
# 1. Enable semantic chunking
skill-seekers scrape --config configs/fastapi.json --chunk-for-rag

# 2. Adjust chunk size
skill-seekers scrape --config configs/fastapi.json \
  --chunk-for-rag \
  --chunk-tokens 768 \  # Larger chunks for more context
  --chunk-overlap-tokens 100  # More overlap for continuity

# 3. Use hybrid search (BM25 + embeddings)
# See Advanced Usage section
```

### Issue: OutOfMemoryError with large docs

**Symptoms:** Crash when loading thousands of documents

**Solutions:**
```python
# Load documents in batches
import json

def load_documents_batched(file_path, batch_size=100):
    with open(file_path) as f:
        docs_data = json.load(f)

    for i in range(0, len(docs_data), batch_size):
        batch = docs_data[i:i+batch_size]
        documents = [
            Document(content=doc["content"], meta=doc["meta"])
            for doc in batch
        ]
        document_store.write_documents(documents)
        print(f"Loaded batch {i//batch_size + 1}")

load_documents_batched("output/large-framework-haystack.json")
```

### Issue: Haystack version compatibility

**Symptoms:** Import errors, method not found

**Solutions:**
```bash
# Check Haystack version
pip show haystack-ai

# Skill Seekers requires Haystack 2.x
pip install --upgrade "haystack-ai>=2.0.0"

# For Haystack 1.x (legacy), use markdown export instead:
skill-seekers package output/framework --target markdown
```

### Issue: Slow query performance

**Symptoms:** Queries take >2 seconds

**Solutions:**
```python
# 1. Reduce top_k
results = retriever.run(query="...", top_k=3)  # Instead of 10

# 2. Add metadata filters
results = retriever.run(
    query="...",
    filters={"field": "category", "operator": "==", "value": "api"}
)

# 3. Use InMemoryDocumentStore for development
# Switch to Elasticsearch for production scale
```

---

## 📊 Before vs After

| Aspect | Before Skill Seekers | After Skill Seekers |
|--------|---------------------|-------------------|
| **Setup Time** | 6-8 hours manual scraping | 5 minutes automated |
| **Documentation Quality** | Inconsistent, missing metadata | Structured with rich metadata |
| **Chunking** | Manual, error-prone | Semantic, code-preserving |
| **Updates** | Re-scrape everything | Incremental updates |
| **Multi-source** | Complex custom scripts | One unified command |
| **Format** | Custom JSON hacking | Native Haystack Documents |
| **Retrieval Quality** | Poor (large chunks, no metadata) | Excellent (optimized chunks, filters) |
| **Maintenance** | High (scripts break) | Low (one tool, well-tested) |

---

## 🎓 Next Steps

### Try These Examples

1. **Build a chatbot** - Follow the FastAPI example above
2. **Multi-language search** - Scrape docs in multiple languages
3. **Hybrid retrieval** - Combine BM25 + embeddings (see Advanced Usage)
4. **Production deployment** - Use Elasticsearch or Weaviate

### Explore More Integrations

- [LangChain Integration](LANGCHAIN.md) - Alternative RAG framework
- [LlamaIndex Integration](LLAMA_INDEX.md) - Query engine approach
- [Pinecone Integration](PINECONE.md) - Cloud vector database
- [Cursor Integration](CURSOR.md) - AI coding assistant

### Learn More

- [RAG Pipelines Guide](RAG_PIPELINES.md) - Complete RAG overview
- [Chunking Guide](../features/CHUNKING.md) - Semantic chunking details
- [Haystack Documentation](https://docs.haystack.deepset.ai/)
- [Example Repository](../../examples/haystack-pipeline/)

---

## 🤝 Support

- **Questions:** [GitHub Discussions](https://github.com/yusufkaraaslan/Skill_Seekers/discussions)
- **Issues:** [GitHub Issues](https://github.com/yusufkaraaslan/Skill_Seekers/issues)
- **Haystack Help:** [Haystack Discord](https://discord.gg/haystack)

---

**Ready to build production RAG with Haystack?**

```bash
pip install skill-seekers haystack-ai
skill-seekers scrape --config configs/your-framework.json --chunk-for-rag
skill-seekers package output/your-framework --target haystack
```

Transform documentation into production-ready Haystack pipelines in minutes! 🚀