Replace all occurrences of old ambiguous flag names with the new explicit ones: --chunk-size (tokens) → --chunk-tokens --chunk-overlap → --chunk-overlap-tokens --chunk → --chunk-for-rag --streaming-chunk-size → --streaming-chunk-chars --streaming-overlap → --streaming-overlap-chars --chunk-size (pages) → --pdf-pages-per-chunk Updated: CLI_REFERENCE (EN+ZH), user-guide (EN+ZH), integrations (Haystack, Chroma, Weaviate, FAISS, Qdrant), features/PDF_CHUNKING, examples/haystack-pipeline, strategy docs, archive docs, and CHANGELOG. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
827 lines
21 KiB
Markdown
827 lines
21 KiB
Markdown
# Using Skill Seekers with Haystack
|
|
|
|
**Last Updated:** February 7, 2026
|
|
**Status:** Production Ready
|
|
**Difficulty:** Easy ⭐
|
|
|
|
---
|
|
|
|
## 🎯 The Problem
|
|
|
|
Building RAG (Retrieval-Augmented Generation) applications with Haystack requires high-quality, structured documentation for your document stores and pipelines. Manually scraping and preparing documentation is:
|
|
|
|
- **Time-Consuming** - Hours spent scraping docs, formatting, and structuring
|
|
- **Error-Prone** - Inconsistent formatting, missing metadata, broken references
|
|
- **Not Scalable** - Multi-language docs and large frameworks are overwhelming
|
|
|
|
**Example:**
|
|
> "When building an enterprise RAG system for FastAPI documentation with Haystack, you need to scrape 300+ pages, structure them with proper metadata, and prepare for multi-language search. This typically takes 6-8 hours of manual work."
|
|
|
|
---
|
|
|
|
## ✨ The Solution
|
|
|
|
Use Skill Seekers as **essential preprocessing** before Haystack:
|
|
|
|
1. **Generate Haystack Documents** from any documentation source
|
|
2. **Pre-structured with metadata** following Haystack 2.x format
|
|
3. **Ready for document stores** (InMemoryDocumentStore, Elasticsearch, Weaviate)
|
|
4. **One command** - scrape, structure, format in minutes
|
|
|
|
**Result:**
|
|
Skill Seekers outputs JSON files with Haystack Document format (`content` + `meta`), ready to load directly into your Haystack pipelines.
|
|
|
|
---
|
|
|
|
## 🚀 Quick Start (5 Minutes)
|
|
|
|
### Prerequisites
|
|
- Python 3.10+
|
|
- Haystack 2.x installed: `pip install haystack-ai`
|
|
- Optional: Embeddings library (e.g., `sentence-transformers`)
|
|
|
|
### Installation
|
|
|
|
```bash
|
|
# Install Skill Seekers
|
|
pip install skill-seekers
|
|
|
|
# Verify installation
|
|
skill-seekers --version
|
|
```
|
|
|
|
### Generate Haystack Documents
|
|
|
|
```bash
|
|
# Example: Django framework documentation
|
|
skill-seekers scrape --config configs/django.json
|
|
|
|
# Package as Haystack Documents
|
|
skill-seekers package output/django --target haystack
|
|
|
|
# Output: output/django-haystack.json
|
|
```
|
|
|
|
### Load into Haystack
|
|
|
|
```python
|
|
from haystack import Document
|
|
from haystack.document_stores.in_memory import InMemoryDocumentStore
|
|
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
|
|
import json
|
|
|
|
# Load documents
|
|
with open("output/django-haystack.json") as f:
|
|
docs_data = json.load(f)
|
|
|
|
# Convert to Haystack Documents
|
|
documents = [
|
|
Document(content=doc["content"], meta=doc["meta"])
|
|
for doc in docs_data
|
|
]
|
|
|
|
print(f"Loaded {len(documents)} documents")
|
|
|
|
# Create document store
|
|
document_store = InMemoryDocumentStore()
|
|
document_store.write_documents(documents)
|
|
|
|
# Create retriever
|
|
retriever = InMemoryBM25Retriever(document_store=document_store)
|
|
|
|
# Query
|
|
results = retriever.run(query="How do I create Django models?", top_k=3)
|
|
for doc in results["documents"]:
|
|
print(f"\n{doc.meta['category']}: {doc.content[:200]}...")
|
|
```
|
|
|
|
---
|
|
|
|
## 📖 Detailed Setup Guide
|
|
|
|
### Step 1: Choose Your Documentation Source
|
|
|
|
Skill Seekers supports multiple documentation sources:
|
|
|
|
```bash
|
|
# Official framework documentation
|
|
skill-seekers scrape --config configs/fastapi.json
|
|
|
|
# GitHub repository
|
|
skill-seekers github --repo tiangolo/fastapi
|
|
|
|
# PDF documentation
|
|
skill-seekers pdf --file docs/manual.pdf
|
|
|
|
# Combine multiple sources
|
|
skill-seekers unified \
|
|
--docs https://fastapi.tiangolo.com/ \
|
|
--github tiangolo/fastapi \
|
|
--output output/fastapi-complete
|
|
```
|
|
|
|
### Step 2: Configure Scraping (Optional)
|
|
|
|
Create a custom config for your documentation:
|
|
|
|
```json
|
|
{
|
|
"name": "my-framework",
|
|
"base_url": "https://docs.example.com/",
|
|
"selectors": {
|
|
"main_content": "article.documentation",
|
|
"title": "h1.page-title",
|
|
"code_blocks": "pre code"
|
|
},
|
|
"categories": {
|
|
"getting_started": ["intro", "quickstart", "installation"],
|
|
"guides": ["tutorial", "guide", "howto"],
|
|
"api": ["api", "reference"]
|
|
},
|
|
"max_pages": 500,
|
|
"rate_limit": 0.5
|
|
}
|
|
```
|
|
|
|
Save as `configs/my-framework.json` and use:
|
|
|
|
```bash
|
|
skill-seekers scrape --config configs/my-framework.json
|
|
```
|
|
|
|
### Step 3: Package for Haystack
|
|
|
|
```bash
|
|
# Generate Haystack Documents
|
|
skill-seekers package output/my-framework --target haystack
|
|
|
|
# With semantic chunking for better retrieval
|
|
skill-seekers scrape --config configs/my-framework.json --chunk-for-rag
|
|
skill-seekers package output/my-framework --target haystack
|
|
|
|
# Output files:
|
|
# - output/my-framework-haystack.json (Haystack Documents)
|
|
# - output/my-framework/rag_chunks.json (if chunking enabled)
|
|
```
|
|
|
|
### Step 4: Load into Haystack Pipeline
|
|
|
|
**Option A: InMemoryDocumentStore (Development)**
|
|
|
|
```python
|
|
from haystack import Document
|
|
from haystack.document_stores.in_memory import InMemoryDocumentStore
|
|
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
|
|
import json
|
|
|
|
# Load documents
|
|
with open("output/my-framework-haystack.json") as f:
|
|
docs_data = json.load(f)
|
|
|
|
documents = [
|
|
Document(content=doc["content"], meta=doc["meta"])
|
|
for doc in docs_data
|
|
]
|
|
|
|
# Create in-memory store
|
|
document_store = InMemoryDocumentStore()
|
|
document_store.write_documents(documents)
|
|
|
|
# Create BM25 retriever
|
|
retriever = InMemoryBM25Retriever(document_store=document_store)
|
|
|
|
# Query
|
|
results = retriever.run(query="your question", top_k=5)
|
|
```
|
|
|
|
**Option B: Elasticsearch (Production)**
|
|
|
|
```python
|
|
from haystack import Document
|
|
from haystack.document_stores.elasticsearch import ElasticsearchDocumentStore
|
|
from haystack.components.retrievers.elasticsearch import ElasticsearchBM25Retriever
|
|
import json
|
|
|
|
# Connect to Elasticsearch
|
|
document_store = ElasticsearchDocumentStore(
|
|
hosts=["http://localhost:9200"],
|
|
index="my-framework-docs"
|
|
)
|
|
|
|
# Load and write documents
|
|
with open("output/my-framework-haystack.json") as f:
|
|
docs_data = json.load(f)
|
|
|
|
documents = [
|
|
Document(content=doc["content"], meta=doc["meta"])
|
|
for doc in docs_data
|
|
]
|
|
|
|
document_store.write_documents(documents)
|
|
|
|
# Create retriever
|
|
retriever = ElasticsearchBM25Retriever(document_store=document_store)
|
|
```
|
|
|
|
**Option C: Weaviate (Hybrid Search)**
|
|
|
|
```python
|
|
from haystack import Document
|
|
from haystack.document_stores.weaviate import WeaviateDocumentStore
|
|
from haystack.components.retrievers.weaviate import WeaviateHybridRetriever
|
|
import json
|
|
|
|
# Connect to Weaviate
|
|
document_store = WeaviateDocumentStore(
|
|
host="http://localhost:8080",
|
|
index="MyFrameworkDocs"
|
|
)
|
|
|
|
# Load documents
|
|
with open("output/my-framework-haystack.json") as f:
|
|
docs_data = json.load(f)
|
|
|
|
documents = [
|
|
Document(content=doc["content"], meta=doc["meta"])
|
|
for doc in docs_data
|
|
]
|
|
|
|
# Write with embeddings
|
|
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
|
|
|
|
embedder = SentenceTransformersDocumentEmbedder(
|
|
model="sentence-transformers/all-MiniLM-L6-v2"
|
|
)
|
|
embedder.warm_up()
|
|
|
|
docs_with_embeddings = embedder.run(documents)
|
|
document_store.write_documents(docs_with_embeddings["documents"])
|
|
|
|
# Create hybrid retriever (BM25 + vector)
|
|
retriever = WeaviateHybridRetriever(document_store=document_store)
|
|
```
|
|
|
|
### Step 5: Build RAG Pipeline
|
|
|
|
```python
|
|
from haystack import Pipeline
|
|
from haystack.components.builders import PromptBuilder
|
|
from haystack.components.generators import OpenAIGenerator
|
|
|
|
# Create RAG pipeline
|
|
rag_pipeline = Pipeline()
|
|
|
|
# Add components
|
|
rag_pipeline.add_component("retriever", retriever)
|
|
rag_pipeline.add_component(
|
|
"prompt_builder",
|
|
PromptBuilder(
|
|
template="""
|
|
Based on the following documentation, answer the question.
|
|
|
|
Documentation:
|
|
{% for doc in documents %}
|
|
{{ doc.content }}
|
|
{% endfor %}
|
|
|
|
Question: {{ question }}
|
|
|
|
Answer:
|
|
"""
|
|
)
|
|
)
|
|
rag_pipeline.add_component(
|
|
"llm",
|
|
OpenAIGenerator(api_key=os.getenv("OPENAI_API_KEY"))
|
|
)
|
|
|
|
# Connect components
|
|
rag_pipeline.connect("retriever", "prompt_builder.documents")
|
|
rag_pipeline.connect("prompt_builder", "llm")
|
|
|
|
# Run pipeline
|
|
response = rag_pipeline.run({
|
|
"retriever": {"query": "How do I deploy my app?"},
|
|
"prompt_builder": {"question": "How do I deploy my app?"}
|
|
})
|
|
|
|
print(response["llm"]["replies"][0])
|
|
```
|
|
|
|
---
|
|
|
|
## 🔥 Advanced Usage
|
|
|
|
### Semantic Chunking for Better Retrieval
|
|
|
|
```bash
|
|
# Enable semantic chunking (preserves code blocks, respects paragraphs)
|
|
skill-seekers scrape --config configs/django.json \
|
|
--chunk-for-rag \
|
|
--chunk-tokens 512 \
|
|
--chunk-overlap-tokens 50
|
|
|
|
# Package chunked output
|
|
skill-seekers package output/django --target haystack
|
|
|
|
# Result: Smaller, more focused documents for better retrieval
|
|
```
|
|
|
|
### Multi-Source RAG System
|
|
|
|
```bash
|
|
# Combine official docs + GitHub issues + PDF guides
|
|
skill-seekers unified \
|
|
--docs https://docs.example.com/ \
|
|
--github owner/repo \
|
|
--pdf guides/*.pdf \
|
|
--output output/complete-knowledge
|
|
|
|
skill-seekers package output/complete-knowledge --target haystack
|
|
|
|
# Detect conflicts between sources
|
|
skill-seekers detect-conflicts output/complete-knowledge
|
|
```
|
|
|
|
### Custom Metadata for Filtering
|
|
|
|
Haystack Documents include rich metadata for filtering:
|
|
|
|
```python
|
|
# Query with metadata filters
|
|
from haystack.dataclasses import Document
|
|
from haystack.document_stores.in_memory import InMemoryDocumentStore
|
|
|
|
# Filter by category
|
|
results = retriever.run(
|
|
query="deployment",
|
|
top_k=5,
|
|
filters={"field": "category", "operator": "==", "value": "guides"}
|
|
)
|
|
|
|
# Filter by version
|
|
results = retriever.run(
|
|
query="api reference",
|
|
filters={"field": "version", "operator": "==", "value": "2.0"}
|
|
)
|
|
|
|
# Multiple filters
|
|
results = retriever.run(
|
|
query="authentication",
|
|
filters={
|
|
"operator": "AND",
|
|
"conditions": [
|
|
{"field": "category", "operator": "==", "value": "api"},
|
|
{"field": "type", "operator": "==", "value": "reference"}
|
|
]
|
|
}
|
|
)
|
|
```
|
|
|
|
### Embedding-Based Retrieval
|
|
|
|
```python
|
|
from haystack.components.embedders import (
|
|
SentenceTransformersDocumentEmbedder,
|
|
SentenceTransformersTextEmbedder
|
|
)
|
|
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
|
|
|
|
# Embed documents
|
|
doc_embedder = SentenceTransformersDocumentEmbedder(
|
|
model="sentence-transformers/all-MiniLM-L6-v2"
|
|
)
|
|
doc_embedder.warm_up()
|
|
|
|
docs_with_embeddings = doc_embedder.run(documents)
|
|
document_store.write_documents(docs_with_embeddings["documents"])
|
|
|
|
# Create embedding retriever
|
|
text_embedder = SentenceTransformersTextEmbedder(
|
|
model="sentence-transformers/all-MiniLM-L6-v2"
|
|
)
|
|
text_embedder.warm_up()
|
|
|
|
retriever = InMemoryEmbeddingRetriever(document_store=document_store)
|
|
|
|
# Query with embeddings
|
|
query_embedding = text_embedder.run("How do I deploy?")
|
|
results = retriever.run(
|
|
query_embedding=query_embedding["embedding"],
|
|
top_k=5
|
|
)
|
|
```
|
|
|
|
### Incremental Updates
|
|
|
|
```bash
|
|
# Initial scrape
|
|
skill-seekers scrape --config configs/fastapi.json
|
|
|
|
# Later: Update only changed pages
|
|
skill-seekers scrape --config configs/fastapi.json --skip-existing
|
|
|
|
# Merge with existing documents
|
|
python scripts/merge_documents.py \
|
|
output/fastapi-haystack.json \
|
|
output/fastapi-haystack-new.json
|
|
```
|
|
|
|
---
|
|
|
|
## ✅ Best Practices
|
|
|
|
### 1. Use Semantic Chunking for Large Docs
|
|
|
|
**Why:** Better retrieval quality, more focused results
|
|
|
|
```bash
|
|
# Enable chunking for frameworks with long pages
|
|
skill-seekers scrape --config configs/django.json \
|
|
--chunk-for-rag \
|
|
--chunk-tokens 512 \
|
|
--chunk-overlap-tokens 50
|
|
```
|
|
|
|
### 2. Choose Right Document Store
|
|
|
|
**Development:**
|
|
- InMemoryDocumentStore - Fast, no setup
|
|
|
|
**Production:**
|
|
- Elasticsearch - Full-text search, scalable
|
|
- Weaviate - Hybrid search (BM25 + vector), multi-modal
|
|
- Qdrant - High-performance vector search
|
|
- Opensearch - AWS-managed, cost-effective
|
|
|
|
### 3. Add Metadata Filters
|
|
|
|
```python
|
|
# Always include category in queries for faster results
|
|
results = retriever.run(
|
|
query="database models",
|
|
filters={"field": "category", "operator": "==", "value": "guides"}
|
|
)
|
|
```
|
|
|
|
### 4. Monitor Retrieval Quality
|
|
|
|
```python
|
|
# Test queries and verify relevance
|
|
test_queries = [
|
|
"How do I create a model?",
|
|
"What is the deployment process?",
|
|
"How to handle authentication?"
|
|
]
|
|
|
|
for query in test_queries:
|
|
results = retriever.run(query=query, top_k=3)
|
|
print(f"\nQuery: {query}")
|
|
for i, doc in enumerate(results["documents"], 1):
|
|
print(f"{i}. {doc.meta['file']} - {doc.meta['category']}")
|
|
```
|
|
|
|
### 5. Version Your Documentation
|
|
|
|
```bash
|
|
# Include version in metadata
|
|
skill-seekers scrape --config configs/django.json --metadata version=4.2
|
|
|
|
# Query specific versions
|
|
results = retriever.run(
|
|
query="middleware",
|
|
filters={"field": "version", "operator": "==", "value": "4.2"}
|
|
)
|
|
```
|
|
|
|
---
|
|
|
|
## 💼 Real-World Example: FastAPI RAG Chatbot
|
|
|
|
Complete example of building a FastAPI documentation chatbot:
|
|
|
|
### Step 1: Generate Documentation
|
|
|
|
```bash
|
|
# Scrape FastAPI docs with chunking
|
|
skill-seekers scrape --config configs/fastapi.json \
|
|
--chunk-for-rag \
|
|
--chunk-tokens 512 \
|
|
--chunk-overlap-tokens 50 \
|
|
--max-pages 200
|
|
|
|
# Package for Haystack
|
|
skill-seekers package output/fastapi --target haystack
|
|
```
|
|
|
|
### Step 2: Setup Haystack Pipeline
|
|
|
|
```python
|
|
from haystack import Pipeline, Document
|
|
from haystack.document_stores.in_memory import InMemoryDocumentStore
|
|
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
|
|
from haystack.components.builders import PromptBuilder
|
|
from haystack.components.generators import OpenAIGenerator
|
|
import json
|
|
import os
|
|
|
|
# Load documents
|
|
with open("output/fastapi-haystack.json") as f:
|
|
docs_data = json.load(f)
|
|
|
|
documents = [
|
|
Document(content=doc["content"], meta=doc["meta"])
|
|
for doc in docs_data
|
|
]
|
|
|
|
print(f"Loaded {len(documents)} FastAPI documentation chunks")
|
|
|
|
# Create document store
|
|
document_store = InMemoryDocumentStore()
|
|
document_store.write_documents(documents)
|
|
print(f"Indexed {document_store.count_documents()} documents")
|
|
|
|
# Build RAG pipeline
|
|
rag = Pipeline()
|
|
|
|
# Add components
|
|
rag.add_component(
|
|
"retriever",
|
|
InMemoryBM25Retriever(document_store=document_store)
|
|
)
|
|
|
|
rag.add_component(
|
|
"prompt",
|
|
PromptBuilder(
|
|
template="""
|
|
You are a FastAPI expert assistant. Answer the question based on the documentation below.
|
|
|
|
Documentation:
|
|
{% for doc in documents %}
|
|
---
|
|
Source: {{ doc.meta.file }}
|
|
Category: {{ doc.meta.category }}
|
|
|
|
{{ doc.content }}
|
|
{% endfor %}
|
|
|
|
Question: {{ question }}
|
|
|
|
Provide a clear, code-focused answer with examples when relevant.
|
|
"""
|
|
)
|
|
)
|
|
|
|
rag.add_component(
|
|
"llm",
|
|
OpenAIGenerator(
|
|
api_key=os.getenv("OPENAI_API_KEY"),
|
|
model="gpt-4"
|
|
)
|
|
)
|
|
|
|
# Connect pipeline
|
|
rag.connect("retriever.documents", "prompt.documents")
|
|
rag.connect("prompt.prompt", "llm.prompt")
|
|
|
|
print("Pipeline ready!")
|
|
```
|
|
|
|
### Step 3: Interactive Chat
|
|
|
|
```python
|
|
def ask_fastapi(question: str, top_k: int = 5):
|
|
"""Ask a question about FastAPI."""
|
|
response = rag.run({
|
|
"retriever": {"query": question, "top_k": top_k},
|
|
"prompt": {"question": question}
|
|
})
|
|
|
|
answer = response["llm"]["replies"][0]
|
|
print(f"\nQuestion: {question}\n")
|
|
print(f"Answer: {answer}\n")
|
|
|
|
# Show sources
|
|
docs = response["retriever"]["documents"]
|
|
print("Sources:")
|
|
for doc in docs:
|
|
print(f" - {doc.meta['file']} ({doc.meta['category']})")
|
|
|
|
# Example usage
|
|
ask_fastapi("How do I create a REST API endpoint?")
|
|
ask_fastapi("What is dependency injection in FastAPI?")
|
|
ask_fastapi("How do I handle file uploads?")
|
|
```
|
|
|
|
### Step 4: Deploy with FastAPI
|
|
|
|
```python
|
|
from fastapi import FastAPI
|
|
from pydantic import BaseModel
|
|
|
|
app = FastAPI()
|
|
|
|
class Question(BaseModel):
|
|
text: str
|
|
top_k: int = 5
|
|
|
|
@app.post("/ask")
|
|
async def ask_question(question: Question):
|
|
"""Ask a question about FastAPI documentation."""
|
|
response = rag.run({
|
|
"retriever": {"query": question.text, "top_k": question.top_k},
|
|
"prompt": {"question": question.text}
|
|
})
|
|
|
|
return {
|
|
"question": question.text,
|
|
"answer": response["llm"]["replies"][0],
|
|
"sources": [
|
|
{
|
|
"file": doc.meta["file"],
|
|
"category": doc.meta["category"],
|
|
"content_preview": doc.content[:200]
|
|
}
|
|
for doc in response["retriever"]["documents"]
|
|
]
|
|
}
|
|
|
|
# Run: uvicorn chatbot:app --reload
|
|
# Test: curl -X POST http://localhost:8000/ask \
|
|
# -H "Content-Type: application/json" \
|
|
# -d '{"text": "How do I use async functions?"}'
|
|
```
|
|
|
|
**Result:**
|
|
- ✅ 200 documentation pages → 450 optimized chunks
|
|
- ✅ Sub-second retrieval with BM25
|
|
- ✅ Context-aware answers from GPT-4
|
|
- ✅ Source attribution for every answer
|
|
- ✅ REST API for integration
|
|
|
|
---
|
|
|
|
## 🔧 Troubleshooting
|
|
|
|
### Issue: Documents not loading correctly
|
|
|
|
**Symptoms:** Empty content, missing metadata
|
|
|
|
**Solutions:**
|
|
```bash
|
|
# Verify JSON structure
|
|
jq '.[0]' output/fastapi-haystack.json
|
|
|
|
# Should show:
|
|
# {
|
|
# "content": "...",
|
|
# "meta": {
|
|
# "source": "fastapi",
|
|
# "category": "...",
|
|
# ...
|
|
# }
|
|
# }
|
|
|
|
# Regenerate if malformed
|
|
skill-seekers package output/fastapi --target haystack --force
|
|
```
|
|
|
|
### Issue: Poor retrieval quality
|
|
|
|
**Symptoms:** Irrelevant results, missed relevant docs
|
|
|
|
**Solutions:**
|
|
```bash
|
|
# 1. Enable semantic chunking
|
|
skill-seekers scrape --config configs/fastapi.json --chunk-for-rag
|
|
|
|
# 2. Adjust chunk size
|
|
skill-seekers scrape --config configs/fastapi.json \
|
|
--chunk-for-rag \
|
|
--chunk-tokens 768 \ # Larger chunks for more context
|
|
--chunk-overlap-tokens 100 # More overlap for continuity
|
|
|
|
# 3. Use hybrid search (BM25 + embeddings)
|
|
# See Advanced Usage section
|
|
```
|
|
|
|
### Issue: OutOfMemoryError with large docs
|
|
|
|
**Symptoms:** Crash when loading thousands of documents
|
|
|
|
**Solutions:**
|
|
```python
|
|
# Load documents in batches
|
|
import json
|
|
|
|
def load_documents_batched(file_path, batch_size=100):
|
|
with open(file_path) as f:
|
|
docs_data = json.load(f)
|
|
|
|
for i in range(0, len(docs_data), batch_size):
|
|
batch = docs_data[i:i+batch_size]
|
|
documents = [
|
|
Document(content=doc["content"], meta=doc["meta"])
|
|
for doc in batch
|
|
]
|
|
document_store.write_documents(documents)
|
|
print(f"Loaded batch {i//batch_size + 1}")
|
|
|
|
load_documents_batched("output/large-framework-haystack.json")
|
|
```
|
|
|
|
### Issue: Haystack version compatibility
|
|
|
|
**Symptoms:** Import errors, method not found
|
|
|
|
**Solutions:**
|
|
```bash
|
|
# Check Haystack version
|
|
pip show haystack-ai
|
|
|
|
# Skill Seekers requires Haystack 2.x
|
|
pip install --upgrade "haystack-ai>=2.0.0"
|
|
|
|
# For Haystack 1.x (legacy), use markdown export instead:
|
|
skill-seekers package output/framework --target markdown
|
|
```
|
|
|
|
### Issue: Slow query performance
|
|
|
|
**Symptoms:** Queries take >2 seconds
|
|
|
|
**Solutions:**
|
|
```python
|
|
# 1. Reduce top_k
|
|
results = retriever.run(query="...", top_k=3) # Instead of 10
|
|
|
|
# 2. Add metadata filters
|
|
results = retriever.run(
|
|
query="...",
|
|
filters={"field": "category", "operator": "==", "value": "api"}
|
|
)
|
|
|
|
# 3. Use InMemoryDocumentStore for development
|
|
# Switch to Elasticsearch for production scale
|
|
```
|
|
|
|
---
|
|
|
|
## 📊 Before vs After
|
|
|
|
| Aspect | Before Skill Seekers | After Skill Seekers |
|
|
|--------|---------------------|-------------------|
|
|
| **Setup Time** | 6-8 hours manual scraping | 5 minutes automated |
|
|
| **Documentation Quality** | Inconsistent, missing metadata | Structured with rich metadata |
|
|
| **Chunking** | Manual, error-prone | Semantic, code-preserving |
|
|
| **Updates** | Re-scrape everything | Incremental updates |
|
|
| **Multi-source** | Complex custom scripts | One unified command |
|
|
| **Format** | Custom JSON hacking | Native Haystack Documents |
|
|
| **Retrieval Quality** | Poor (large chunks, no metadata) | Excellent (optimized chunks, filters) |
|
|
| **Maintenance** | High (scripts break) | Low (one tool, well-tested) |
|
|
|
|
---
|
|
|
|
## 🎓 Next Steps
|
|
|
|
### Try These Examples
|
|
|
|
1. **Build a chatbot** - Follow the FastAPI example above
|
|
2. **Multi-language search** - Scrape docs in multiple languages
|
|
3. **Hybrid retrieval** - Combine BM25 + embeddings (see Advanced Usage)
|
|
4. **Production deployment** - Use Elasticsearch or Weaviate
|
|
|
|
### Explore More Integrations
|
|
|
|
- [LangChain Integration](LANGCHAIN.md) - Alternative RAG framework
|
|
- [LlamaIndex Integration](LLAMA_INDEX.md) - Query engine approach
|
|
- [Pinecone Integration](PINECONE.md) - Cloud vector database
|
|
- [Cursor Integration](CURSOR.md) - AI coding assistant
|
|
|
|
### Learn More
|
|
|
|
- [RAG Pipelines Guide](RAG_PIPELINES.md) - Complete RAG overview
|
|
- [Chunking Guide](../features/CHUNKING.md) - Semantic chunking details
|
|
- [Haystack Documentation](https://docs.haystack.deepset.ai/)
|
|
- [Example Repository](../../examples/haystack-pipeline/)
|
|
|
|
---
|
|
|
|
## 🤝 Support
|
|
|
|
- **Questions:** [GitHub Discussions](https://github.com/yusufkaraaslan/Skill_Seekers/discussions)
|
|
- **Issues:** [GitHub Issues](https://github.com/yusufkaraaslan/Skill_Seekers/issues)
|
|
- **Haystack Help:** [Haystack Discord](https://discord.gg/haystack)
|
|
|
|
---
|
|
|
|
**Ready to build production RAG with Haystack?**
|
|
|
|
```bash
|
|
pip install skill-seekers haystack-ai
|
|
skill-seekers scrape --config configs/your-framework.json --chunk-for-rag
|
|
skill-seekers package output/your-framework --target haystack
|
|
```
|
|
|
|
Transform documentation into production-ready Haystack pipelines in minutes! 🚀
|