feat: Add Haystack RAG framework adaptor (Task 2.2)

Implements complete Haystack 2.x integration for RAG pipelines:

**Haystack Adaptor (src/skill_seekers/cli/adaptors/haystack.py):**
- Document format: {content: str, meta: dict}
- JSON packaging for Haystack pipelines
- Compatible with InMemoryDocumentStore, BM25Retriever
- Registered in adaptor factory as 'haystack'

**Example Pipeline (examples/haystack-pipeline/):**
- README.md with comprehensive guide and troubleshooting
- quickstart.py demonstrating BM25 retrieval
- requirements.txt (haystack-ai>=2.0.0)
- Shows document loading, indexing, and querying

**Tests (tests/test_adaptors/test_haystack_adaptor.py):**
- 11 tests covering all adaptor functionality
- Format validation, packaging, upload messages
- Edge cases: empty dirs, references-only skills
- All 93 adaptor tests passing (100% suite pass rate)

**Features:**
- No upload endpoint (local use only like LangChain/LlamaIndex)
- No AI enhancement (enhance before packaging)
- Same packaging pattern as other RAG frameworks
- InMemoryDocumentStore + BM25Retriever example

Test: pytest tests/test_adaptors/test_haystack_adaptor.py -v
This commit is contained in:
yusyus
2026-02-07 21:01:49 +03:00
parent 8b3f31409e
commit 1c888e7817
6 changed files with 910 additions and 0 deletions

View File

@@ -0,0 +1,278 @@
# Haystack Pipeline Example
Complete example showing how to use Skill Seekers with Haystack 2.x for building RAG pipelines.
## What This Example Does
- ✅ Converts documentation into Haystack Documents
- ✅ Creates an in-memory document store
- ✅ Builds a BM25 retriever for semantic search
- ✅ Shows complete RAG pipeline workflow
## Prerequisites
```bash
# Install Skill Seekers
pip install skill-seekers
# Install Haystack 2.x
pip install haystack-ai
```
## Quick Start
### 1. Generate React Documentation Skill
```bash
# Scrape React documentation
skill-seekers scrape --config configs/react.json --max-pages 100
# Package for Haystack
skill-seekers package output/react --target haystack
```
This creates `output/react-haystack.json` with Haystack Documents.
### 2. Run the Pipeline
```bash
# Run the example script
python quickstart.py
```
## What the Example Does
### Step 1: Load Documents
```python
from haystack import Document
import json
# Load Haystack documents
with open("../../output/react-haystack.json") as f:
docs_data = json.load(f)
documents = [
Document(content=doc["content"], meta=doc["meta"])
for doc in docs_data
]
print(f"📚 Loaded {len(documents)} documents")
```
### Step 2: Create Document Store
```python
from haystack.document_stores.in_memory import InMemoryDocumentStore
# Create in-memory store
document_store = InMemoryDocumentStore()
document_store.write_documents(documents)
print(f"💾 Indexed {document_store.count_documents()} documents")
```
### Step 3: Build Retriever
```python
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
# Create BM25 retriever
retriever = InMemoryBM25Retriever(document_store=document_store)
# Query
results = retriever.run(
query="How do I use useState hook?",
top_k=3
)
# Display results
for doc in results["documents"]:
print(f"\n📖 Source: {doc.meta.get('file', 'unknown')}")
print(f" Category: {doc.meta.get('category', 'unknown')}")
print(f" Preview: {doc.content[:200]}...")
```
## Expected Output
```
📚 Loaded 15 documents
💾 Indexed 15 documents
🔍 Query: How do I use useState hook?
📖 Source: hooks.md
Category: hooks
Preview: # React Hooks
React Hooks are functions that let you "hook into" React state and lifecycle features from function components.
## useState
The useState Hook lets you add React state to function components...
📖 Source: getting_started.md
Category: getting started
Preview: # Getting Started with React
React is a JavaScript library for building user interfaces...
📖 Source: best_practices.md
Category: best practices
Preview: # React Best Practices
When working with Hooks...
```
## Advanced Usage
### With RAG Chunking
For better retrieval quality, use semantic chunking:
```bash
# Generate with chunking
skill-seekers scrape --config configs/react.json --max-pages 100 --chunk-for-rag --chunk-size 512 --chunk-overlap 50
# Use chunked output
python quickstart.py --chunked
```
### With Vector Embeddings
For semantic search instead of BM25:
```python
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
# Create document store with embeddings
document_store = InMemoryDocumentStore()
# Embed documents
embedder = SentenceTransformersDocumentEmbedder(
model="sentence-transformers/all-MiniLM-L6-v2"
)
embedder.warm_up()
# Process documents
docs_with_embeddings = embedder.run(documents)
document_store.write_documents(docs_with_embeddings["documents"])
# Create embedding retriever
retriever = InMemoryEmbeddingRetriever(document_store=document_store)
# Query (requires query embedding)
from haystack.components.embedders import SentenceTransformersTextEmbedder
query_embedder = SentenceTransformersTextEmbedder(
model="sentence-transformers/all-MiniLM-L6-v2"
)
query_embedder.warm_up()
query_embedding = query_embedder.run("How do I use useState?")
results = retriever.run(
query_embedding=query_embedding["embedding"],
top_k=3
)
```
### Building Complete RAG Pipeline
For question answering with LLMs:
```python
from haystack import Pipeline
from haystack.components.builders import PromptBuilder
from haystack.components.generators import OpenAIGenerator
# Create RAG pipeline
rag_pipeline = Pipeline()
# Add components
rag_pipeline.add_component("retriever", retriever)
rag_pipeline.add_component("prompt_builder", PromptBuilder(
template="""
Based on the following context, answer the question.
Context:
{% for doc in documents %}
{{ doc.content }}
{% endfor %}
Question: {{ question }}
Answer:
"""
))
rag_pipeline.add_component("llm", OpenAIGenerator(api_key="your-key"))
# Connect components
rag_pipeline.connect("retriever", "prompt_builder.documents")
rag_pipeline.connect("prompt_builder", "llm")
# Run pipeline
response = rag_pipeline.run({
"retriever": {"query": "How do I use useState?"},
"prompt_builder": {"question": "How do I use useState?"}
})
print(response["llm"]["replies"][0])
```
## Files in This Example
- `README.md` - This file
- `quickstart.py` - Basic BM25 retrieval pipeline
- `requirements.txt` - Python dependencies
## Troubleshooting
### Issue: ModuleNotFoundError: No module named 'haystack'
**Solution:** Install Haystack 2.x
```bash
pip install haystack-ai
```
### Issue: Documents not found
**Solution:** Run scraping first
```bash
skill-seekers scrape --config configs/react.json
skill-seekers package output/react --target haystack
```
### Issue: Poor retrieval quality
**Solution:** Use semantic chunking or vector embeddings
```bash
# Semantic chunking
skill-seekers scrape --config configs/react.json --chunk-for-rag
# Or use vector embeddings (see Advanced Usage)
```
## Next Steps
1. Try different documentation sources (Django, FastAPI, etc.)
2. Experiment with vector embeddings for semantic search
3. Build complete RAG pipeline with LLM generation
4. Deploy to production with persistent document stores
## Related Examples
- [LangChain RAG Pipeline](../langchain-rag-pipeline/)
- [LlamaIndex Query Engine](../llama-index-query-engine/)
- [Pinecone Vector Store](../pinecone-upsert/)
## Resources
- [Haystack Documentation](https://docs.haystack.deepset.ai/)
- [Skill Seekers Documentation](https://github.com/yusufkaraaslan/Skill_Seekers)
- [Haystack Tutorials](https://haystack.deepset.ai/tutorials)

View File

@@ -0,0 +1,128 @@
#!/usr/bin/env python3
"""
Haystack Pipeline Example
Demonstrates how to use Skill Seekers documentation with Haystack 2.x
for building RAG pipelines.
"""
import json
import sys
from pathlib import Path
def main():
"""Run Haystack pipeline example."""
print("=" * 60)
print("Haystack Pipeline Example")
print("=" * 60)
# Check if Haystack is installed
try:
from haystack import Document
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
except ImportError:
print("❌ Error: Haystack not installed")
print(" Install with: pip install haystack-ai")
sys.exit(1)
# Find the Haystack documents file
docs_path = Path("../../output/react-haystack.json")
if not docs_path.exists():
print(f"❌ Error: Documents not found at {docs_path}")
print("\n📝 Generate documents first:")
print(" skill-seekers scrape --config configs/react.json --max-pages 100")
print(" skill-seekers package output/react --target haystack")
sys.exit(1)
# Step 1: Load documents
print("\n📚 Step 1: Loading documents...")
with open(docs_path) as f:
docs_data = json.load(f)
documents = [
Document(content=doc["content"], meta=doc["meta"]) for doc in docs_data
]
print(f"✅ Loaded {len(documents)} documents")
# Show document breakdown
categories = {}
for doc in documents:
cat = doc.meta.get("category", "unknown")
categories[cat] = categories.get(cat, 0) + 1
print("\n📁 Categories:")
for cat, count in sorted(categories.items()):
print(f" - {cat}: {count}")
# Step 2: Create document store
print("\n💾 Step 2: Creating document store...")
document_store = InMemoryDocumentStore()
document_store.write_documents(documents)
indexed_count = document_store.count_documents()
print(f"✅ Indexed {indexed_count} documents")
# Step 3: Create retriever
print("\n🔍 Step 3: Creating BM25 retriever...")
retriever = InMemoryBM25Retriever(document_store=document_store)
print("✅ Retriever ready")
# Step 4: Query examples
print("\n🎯 Step 4: Running queries...\n")
queries = [
"How do I use useState hook?",
"What are React components?",
"How to handle events in React?",
]
for i, query in enumerate(queries, 1):
print(f"\n{'=' * 60}")
print(f"Query {i}: {query}")
print("=" * 60)
# Run query
results = retriever.run(query=query, top_k=3)
if not results["documents"]:
print(" No results found")
continue
# Display results
for j, doc in enumerate(results["documents"], 1):
print(f"\n📖 Result {j}:")
print(f" Source: {doc.meta.get('file', 'unknown')}")
print(f" Category: {doc.meta.get('category', 'unknown')}")
# Show preview (first 200 chars)
preview = doc.content[:200].replace("\n", " ")
print(f" Preview: {preview}...")
# Summary
print("\n" + "=" * 60)
print("✅ Example complete!")
print("=" * 60)
print("\n📊 Summary:")
print(f" • Documents loaded: {len(documents)}")
print(f" • Documents indexed: {indexed_count}")
print(f" • Queries executed: {len(queries)}")
print("\n💡 Next steps:")
print(" • Try different queries")
print(" • Experiment with top_k parameter")
print(" • Build RAG pipeline with LLM generation")
print(" • Use vector embeddings for semantic search")
if __name__ == "__main__":
try:
main()
except KeyboardInterrupt:
print("\n\n⚠️ Interrupted by user")
sys.exit(0)
except Exception as e:
print(f"\n❌ Error: {e}")
sys.exit(1)

View File

@@ -0,0 +1,11 @@
# Haystack Pipeline Example Requirements
# Haystack 2.x - RAG framework
haystack-ai>=2.0.0
# Optional: For vector embeddings
# sentence-transformers>=2.2.0
# Optional: For LLM generation
# openai>=1.0.0
# anthropic>=0.7.0