feat: Add Haystack RAG framework adaptor (Task 2.2)

Implements complete Haystack 2.x integration for RAG pipelines: **Haystack Adaptor (src/skill_seekers/cli/adaptors/haystack.py):** - Document format: {content: str, meta: dict} - JSON packaging for Haystack pipelines - Compatible with InMemoryDocumentStore, BM25Retriever - Registered in adaptor factory as 'haystack' **Example Pipeline (examples/haystack-pipeline/):** - README.md with comprehensive guide and troubleshooting - quickstart.py demonstrating BM25 retrieval - requirements.txt (haystack-ai>=2.0.0) - Shows document loading, indexing, and querying **Tests (tests/test_adaptors/test_haystack_adaptor.py):** - 11 tests covering all adaptor functionality - Format validation, packaging, upload messages - Edge cases: empty dirs, references-only skills - All 93 adaptor tests passing (100% suite pass rate) **Features:** - No upload endpoint (local use only like LangChain/LlamaIndex) - No AI enhancement (enhance before packaging) - Same packaging pattern as other RAG frameworks - InMemoryDocumentStore + BM25Retriever example Test: pytest tests/test_adaptors/test_haystack_adaptor.py -v
2026-02-07 21:01:49 +03:00
parent 8b3f31409e
commit 1c888e7817
6 changed files with 910 additions and 0 deletions
--- a/examples/haystack-pipeline/README.md
+++ b/examples/haystack-pipeline/README.md
@@ -0,0 +1,278 @@
+# Haystack Pipeline Example
+
+Complete example showing how to use Skill Seekers with Haystack 2.x for building RAG pipelines.
+
+## What This Example Does
+
+- ✅ Converts documentation into Haystack Documents
+- ✅ Creates an in-memory document store
+- ✅ Builds a BM25 retriever for semantic search
+- ✅ Shows complete RAG pipeline workflow
+
+## Prerequisites
+
+```bash
+# Install Skill Seekers
+pip install skill-seekers
+
+# Install Haystack 2.x
+pip install haystack-ai
+```
+
+## Quick Start
+
+### 1. Generate React Documentation Skill
+
+```bash
+# Scrape React documentation
+skill-seekers scrape --config configs/react.json --max-pages 100
+
+# Package for Haystack
+skill-seekers package output/react --target haystack
+```
+
+This creates `output/react-haystack.json` with Haystack Documents.
+
+### 2. Run the Pipeline
+
+```bash
+# Run the example script
+python quickstart.py
+```
+
+## What the Example Does
+
+### Step 1: Load Documents
+
+```python
+from haystack import Document
+import json
+
+# Load Haystack documents
+with open("../../output/react-haystack.json") as f:
+    docs_data = json.load(f)
+
+documents = [
+    Document(content=doc["content"], meta=doc["meta"])
+    for doc in docs_data
+]
+
+print(f"📚 Loaded {len(documents)} documents")
+```
+
+### Step 2: Create Document Store
+
+```python
+from haystack.document_stores.in_memory import InMemoryDocumentStore
+
+# Create in-memory store
+document_store = InMemoryDocumentStore()
+document_store.write_documents(documents)
+
+print(f"💾 Indexed {document_store.count_documents()} documents")
+```
+
+### Step 3: Build Retriever
+
+```python
+from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
+
+# Create BM25 retriever
+retriever = InMemoryBM25Retriever(document_store=document_store)
+
+# Query
+results = retriever.run(
+    query="How do I use useState hook?",
+    top_k=3
+)
+
+# Display results
+for doc in results["documents"]:
+    print(f"\n📖 Source: {doc.meta.get('file', 'unknown')}")
+    print(f"   Category: {doc.meta.get('category', 'unknown')}")
+    print(f"   Preview: {doc.content[:200]}...")
+```
+
+## Expected Output
+
+```
+📚 Loaded 15 documents
+💾 Indexed 15 documents
+
+🔍 Query: How do I use useState hook?
+
+📖 Source: hooks.md
+   Category: hooks
+   Preview: # React Hooks
+
+React Hooks are functions that let you "hook into" React state and lifecycle features from function components.
+
+## useState
+
+The useState Hook lets you add React state to function components...
+
+📖 Source: getting_started.md
+   Category: getting started
+   Preview: # Getting Started with React
+
+React is a JavaScript library for building user interfaces...
+
+📖 Source: best_practices.md
+   Category: best practices
+   Preview: # React Best Practices
+
+When working with Hooks...
+```
+
+## Advanced Usage
+
+### With RAG Chunking
+
+For better retrieval quality, use semantic chunking:
+
+```bash
+# Generate with chunking
+skill-seekers scrape --config configs/react.json --max-pages 100 --chunk-for-rag --chunk-size 512 --chunk-overlap 50
+
+# Use chunked output
+python quickstart.py --chunked
+```
+
+### With Vector Embeddings
+
+For semantic search instead of BM25:
+
+```python
+from haystack.components.embedders import SentenceTransformersDocumentEmbedder
+from haystack.document_stores.in_memory import InMemoryDocumentStore
+from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
+
+# Create document store with embeddings
+document_store = InMemoryDocumentStore()
+
+# Embed documents
+embedder = SentenceTransformersDocumentEmbedder(
+    model="sentence-transformers/all-MiniLM-L6-v2"
+)
+embedder.warm_up()
+
+# Process documents
+docs_with_embeddings = embedder.run(documents)
+document_store.write_documents(docs_with_embeddings["documents"])
+
+# Create embedding retriever
+retriever = InMemoryEmbeddingRetriever(document_store=document_store)
+
+# Query (requires query embedding)
+from haystack.components.embedders import SentenceTransformersTextEmbedder
+
+query_embedder = SentenceTransformersTextEmbedder(
+    model="sentence-transformers/all-MiniLM-L6-v2"
+)
+query_embedder.warm_up()
+
+query_embedding = query_embedder.run("How do I use useState?")
+
+results = retriever.run(
+    query_embedding=query_embedding["embedding"],
+    top_k=3
+)
+```
+
+### Building Complete RAG Pipeline
+
+For question answering with LLMs:
+
+```python
+from haystack import Pipeline
+from haystack.components.builders import PromptBuilder
+from haystack.components.generators import OpenAIGenerator
+
+# Create RAG pipeline
+rag_pipeline = Pipeline()
+
+# Add components
+rag_pipeline.add_component("retriever", retriever)
+rag_pipeline.add_component("prompt_builder", PromptBuilder(
+    template="""
+    Based on the following context, answer the question.
+
+    Context:
+    {% for doc in documents %}
+    {{ doc.content }}
+    {% endfor %}
+
+    Question: {{ question }}
+
+    Answer:
+    """
+))
+rag_pipeline.add_component("llm", OpenAIGenerator(api_key="your-key"))
+
+# Connect components
+rag_pipeline.connect("retriever", "prompt_builder.documents")
+rag_pipeline.connect("prompt_builder", "llm")
+
+# Run pipeline
+response = rag_pipeline.run({
+    "retriever": {"query": "How do I use useState?"},
+    "prompt_builder": {"question": "How do I use useState?"}
+})
+
+print(response["llm"]["replies"][0])
+```
+
+## Files in This Example
+
+- `README.md` - This file
+- `quickstart.py` - Basic BM25 retrieval pipeline
+- `requirements.txt` - Python dependencies
+
+## Troubleshooting
+
+### Issue: ModuleNotFoundError: No module named 'haystack'
+
+**Solution:** Install Haystack 2.x
+
+```bash
+pip install haystack-ai
+```
+
+### Issue: Documents not found
+
+**Solution:** Run scraping first
+
+```bash
+skill-seekers scrape --config configs/react.json
+skill-seekers package output/react --target haystack
+```
+
+### Issue: Poor retrieval quality
+
+**Solution:** Use semantic chunking or vector embeddings
+
+```bash
+# Semantic chunking
+skill-seekers scrape --config configs/react.json --chunk-for-rag
+
+# Or use vector embeddings (see Advanced Usage)
+```
+
+## Next Steps
+
+1. Try different documentation sources (Django, FastAPI, etc.)
+2. Experiment with vector embeddings for semantic search
+3. Build complete RAG pipeline with LLM generation
+4. Deploy to production with persistent document stores
+
+## Related Examples
+
+- [LangChain RAG Pipeline](../langchain-rag-pipeline/)
+- [LlamaIndex Query Engine](../llama-index-query-engine/)
+- [Pinecone Vector Store](../pinecone-upsert/)
+
+## Resources
+
+- [Haystack Documentation](https://docs.haystack.deepset.ai/)
+- [Skill Seekers Documentation](https://github.com/yusufkaraaslan/Skill_Seekers)
+- [Haystack Tutorials](https://haystack.deepset.ai/tutorials)
--- a/examples/haystack-pipeline/quickstart.py
+++ b/examples/haystack-pipeline/quickstart.py
@@ -0,0 +1,128 @@
+#!/usr/bin/env python3
+"""
+Haystack Pipeline Example
+
+Demonstrates how to use Skill Seekers documentation with Haystack 2.x
+for building RAG pipelines.
+"""
+
+import json
+import sys
+from pathlib import Path
+
+
+def main():
+    """Run Haystack pipeline example."""
+    print("=" * 60)
+    print("Haystack Pipeline Example")
+    print("=" * 60)
+
+    # Check if Haystack is installed
+    try:
+        from haystack import Document
+        from haystack.document_stores.in_memory import InMemoryDocumentStore
+        from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
+    except ImportError:
+        print("❌ Error: Haystack not installed")
+        print("   Install with: pip install haystack-ai")
+        sys.exit(1)
+
+    # Find the Haystack documents file
+    docs_path = Path("../../output/react-haystack.json")
+
+    if not docs_path.exists():
+        print(f"❌ Error: Documents not found at {docs_path}")
+        print("\n📝 Generate documents first:")
+        print("   skill-seekers scrape --config configs/react.json --max-pages 100")
+        print("   skill-seekers package output/react --target haystack")
+        sys.exit(1)
+
+    # Step 1: Load documents
+    print("\n📚 Step 1: Loading documents...")
+    with open(docs_path) as f:
+        docs_data = json.load(f)
+
+    documents = [
+        Document(content=doc["content"], meta=doc["meta"]) for doc in docs_data
+    ]
+
+    print(f"✅ Loaded {len(documents)} documents")
+
+    # Show document breakdown
+    categories = {}
+    for doc in documents:
+        cat = doc.meta.get("category", "unknown")
+        categories[cat] = categories.get(cat, 0) + 1
+
+    print("\n📁 Categories:")
+    for cat, count in sorted(categories.items()):
+        print(f"   - {cat}: {count}")
+
+    # Step 2: Create document store
+    print("\n💾 Step 2: Creating document store...")
+    document_store = InMemoryDocumentStore()
+    document_store.write_documents(documents)
+
+    indexed_count = document_store.count_documents()
+    print(f"✅ Indexed {indexed_count} documents")
+
+    # Step 3: Create retriever
+    print("\n🔍 Step 3: Creating BM25 retriever...")
+    retriever = InMemoryBM25Retriever(document_store=document_store)
+    print("✅ Retriever ready")
+
+    # Step 4: Query examples
+    print("\n🎯 Step 4: Running queries...\n")
+
+    queries = [
+        "How do I use useState hook?",
+        "What are React components?",
+        "How to handle events in React?",
+    ]
+
+    for i, query in enumerate(queries, 1):
+        print(f"\n{'=' * 60}")
+        print(f"Query {i}: {query}")
+        print("=" * 60)
+
+        # Run query
+        results = retriever.run(query=query, top_k=3)
+
+        if not results["documents"]:
+            print("   No results found")
+            continue
+
+        # Display results
+        for j, doc in enumerate(results["documents"], 1):
+            print(f"\n📖 Result {j}:")
+            print(f"   Source: {doc.meta.get('file', 'unknown')}")
+            print(f"   Category: {doc.meta.get('category', 'unknown')}")
+
+            # Show preview (first 200 chars)
+            preview = doc.content[:200].replace("\n", " ")
+            print(f"   Preview: {preview}...")
+
+    # Summary
+    print("\n" + "=" * 60)
+    print("✅ Example complete!")
+    print("=" * 60)
+    print("\n📊 Summary:")
+    print(f"   • Documents loaded: {len(documents)}")
+    print(f"   • Documents indexed: {indexed_count}")
+    print(f"   • Queries executed: {len(queries)}")
+    print("\n💡 Next steps:")
+    print("   • Try different queries")
+    print("   • Experiment with top_k parameter")
+    print("   • Build RAG pipeline with LLM generation")
+    print("   • Use vector embeddings for semantic search")
+
+
+if __name__ == "__main__":
+    try:
+        main()
+    except KeyboardInterrupt:
+        print("\n\n⚠️  Interrupted by user")
+        sys.exit(0)
+    except Exception as e:
+        print(f"\n❌ Error: {e}")
+        sys.exit(1)
--- a/examples/haystack-pipeline/requirements.txt
+++ b/examples/haystack-pipeline/requirements.txt
@@ -0,0 +1,11 @@
+# Haystack Pipeline Example Requirements
+
+# Haystack 2.x - RAG framework
+haystack-ai>=2.0.0
+
+# Optional: For vector embeddings
+# sentence-transformers>=2.2.0
+
+# Optional: For LLM generation
+# openai>=1.0.0
+# anthropic>=0.7.0