Files
skill-seekers-reference/docs/integrations/CHROMA.md
yusyus 73adda0b17 docs: update all chunk flag names to match renamed CLI flags
Replace all occurrences of old ambiguous flag names with the new explicit ones:
  --chunk-size (tokens)  → --chunk-tokens
  --chunk-overlap        → --chunk-overlap-tokens
  --chunk                → --chunk-for-rag
  --streaming-chunk-size → --streaming-chunk-chars
  --streaming-overlap    → --streaming-overlap-chars
  --chunk-size (pages)   → --pdf-pages-per-chunk

Updated: CLI_REFERENCE (EN+ZH), user-guide (EN+ZH), integrations (Haystack,
Chroma, Weaviate, FAISS, Qdrant), features/PDF_CHUNKING, examples/haystack-pipeline,
strategy docs, archive docs, and CHANGELOG.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-24 22:15:14 +03:00

25 KiB

Chroma Integration with Skill Seekers

Status: Production Ready Difficulty: Beginner Last Updated: February 7, 2026


The Problem

Building RAG applications with Chroma involves several challenges:

  1. Embedding Model Setup - Need to choose and configure embedding models (local vs API) manually
  2. Collection Management - Creating and managing collections with metadata requires boilerplate code
  3. Local-First Complexity - Setting up persistent storage and dealing with file paths

Example Pain Point:

# Manual embedding + collection setup for each framework
import chromadb
from chromadb.utils import embedding_functions

# Choose embedding function
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
    api_key="sk-...",
    model_name="text-embedding-ada-002"
)

# Create client + collection
client = chromadb.PersistentClient(path="./chroma_db")
collection = client.create_collection(
    name="react_docs",
    embedding_function=openai_ef,
    metadata={"description": "React documentation"}
)

# Manually parse and add documents...

The Solution

Skill Seekers automates Chroma integration with structured, production-ready data:

Benefits:

  • Auto-formatted documents with embeddings included
  • Consistent collection structure across all frameworks
  • Works with local models (Sentence Transformers) or API embeddings (OpenAI, Cohere)
  • Persistent storage with automatic path management
  • Metadata-rich for precise filtering

Result: 5-minute setup, production-ready local vector search with zero external dependencies.


Quick Start (5 Minutes)

Prerequisites

# Install Chroma
pip install chromadb>=0.4.22

# For local embeddings (optional, free)
pip install sentence-transformers

# For OpenAI embeddings (optional)
pip install openai

# Or with Skill Seekers
pip install skill-seekers[all-llms]

What you need:

  • Python 3.10+
  • No external services required (fully local!)
  • Optional: OpenAI API key for better embeddings

Generate Chroma-Ready Documents

# Step 1: Scrape documentation
skill-seekers scrape --config configs/react.json

# Step 2: Package for Chroma (creates LangChain format)
skill-seekers package output/react --target langchain

# Output: output/react-langchain.json (Chroma-compatible)

Upload to Chroma (Local)

import chromadb
import json

# Create persistent client (data saved to disk)
client = chromadb.PersistentClient(path="./chroma_db")

# Create collection with local embeddings (free!)
collection = client.get_or_create_collection(
    name="react_docs",
    metadata={"description": "React documentation from Skill Seekers"}
)

# Load documents
with open("output/react-langchain.json") as f:
    documents = json.load(f)

# Add to collection (Chroma generates embeddings automatically)
collection.add(
    documents=[doc["page_content"] for doc in documents],
    metadatas=[doc["metadata"] for doc in documents],
    ids=[f"doc_{i}" for i in range(len(documents))]
)

print(f"✅ Added {len(documents)} documents to Chroma")
print(f"Total in collection: {collection.count()}")

Query with Filters

# Semantic search with metadata filter
results = collection.query(
    query_texts=["How do I use React hooks?"],
    n_results=3,
    where={"category": "hooks"}  # Filter by category
)

for i, (doc, metadata) in enumerate(zip(results["documents"][0], results["metadatas"][0])):
    print(f"\n{i+1}. Category: {metadata['category']}")
    print(f"   Source: {metadata['source']}")
    print(f"   Content: {doc[:200]}...")

That's it! Chroma is now running locally with your documentation.


📖 Detailed Setup Guide

Step 1: Choose Storage Mode

Option A: Persistent (Recommended for Production)

import chromadb

# Data persists to disk
client = chromadb.PersistentClient(
    path="./chroma_db"  # Specify database directory
)

# Database files saved to ./chroma_db/
# Survives script restarts

Option B: In-Memory (Fast, for Development)

# Data lost when script ends
client = chromadb.Client()

# Fast, but temporary
# Perfect for experimentation

Option C: HTTP Client (Remote Chroma Server)

# Start Chroma server
chroma run --path ./chroma_db --port 8000
# Connect to remote server
client = chromadb.HttpClient(host="localhost", port=8000)

# Great for microservices architecture

Option D: Docker (Production)

# docker-compose.yml
version: '3'
services:
  chroma:
    image: ghcr.io/chroma-core/chroma:latest
    volumes:
      - ./chroma-data:/chroma/chroma
    ports:
      - "8000:8000"
    environment:
      - ANONYMIZED_TELEMETRY=False

# Start Chroma
docker-compose up -d

Step 2: Generate Skill Seekers Documents

Option A: Documentation Website

skill-seekers scrape --config configs/django.json
skill-seekers package output/django --target langchain

Option B: GitHub Repository

skill-seekers github --repo django/django --name django
skill-seekers package output/django --target langchain

Option C: Local Codebase

skill-seekers analyze --directory /path/to/repo
skill-seekers package output/codebase --target langchain

Option D: RAG-Optimized Chunking

skill-seekers scrape --config configs/fastapi.json --chunk-for-rag --chunk-tokens 512
skill-seekers package output/fastapi --target langchain

Step 3: Choose Embedding Function

Option A: Default (Sentence Transformers - Free)

# Chroma uses all-MiniLM-L6-v2 by default
collection = client.get_or_create_collection(name="docs")

# Automatically downloads model on first use (~90MB)
# Dimensions: 384
# Speed: ~500 docs/sec on CPU
# Quality: Good for most use cases

Option B: OpenAI (Best Quality)

from chromadb.utils import embedding_functions

openai_ef = embedding_functions.OpenAIEmbeddingFunction(
    api_key="sk-...",
    model_name="text-embedding-ada-002"
)

collection = client.get_or_create_collection(
    name="docs",
    embedding_function=openai_ef
)

# Cost: ~$0.0001 per 1K tokens
# Dimensions: 1536
# Quality: Excellent

Option C: Local Sentence Transformers (Customizable)

from chromadb.utils import embedding_functions

sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="all-mpnet-base-v2"  # Better quality than default
)

collection = client.get_or_create_collection(
    name="docs",
    embedding_function=sentence_transformer_ef
)

# Free, local, customizable
# Dimensions: 768 (all-mpnet-base-v2)
# Quality: Better than default

Option D: Cohere

cohere_ef = embedding_functions.CohereEmbeddingFunction(
    api_key="your-cohere-key",
    model_name="embed-english-v3.0"
)

collection = client.get_or_create_collection(
    name="docs",
    embedding_function=cohere_ef
)

Step 4: Add Documents with Metadata

import json

# Load Skill Seekers documents
with open("output/django-langchain.json") as f:
    documents = json.load(f)

# Prepare for Chroma
docs_content = []
docs_metadata = []
docs_ids = []

for i, doc in enumerate(documents):
    docs_content.append(doc["page_content"])
    docs_metadata.append(doc["metadata"])
    docs_ids.append(f"doc_{i}")

# Add to collection (batch operation)
collection.add(
    documents=docs_content,
    metadatas=docs_metadata,
    ids=docs_ids
)

print(f"✅ Added {len(documents)} documents")
print(f"Collection size: {collection.count()}")

Step 5: Query with Advanced Filters

# Simple query
results = collection.query(
    query_texts=["How do I create models?"],
    n_results=5
)

# With metadata filter
results = collection.query(
    query_texts=["Django authentication"],
    n_results=3,
    where={"category": "authentication"}
)

# Multiple filters (AND logic)
results = collection.query(
    query_texts=["user registration"],
    n_results=3,
    where={
        "$and": [
            {"category": "authentication"},
            {"type": "tutorial"}
        ]
    }
)

# Filter with OR
results = collection.query(
    query_texts=["components"],
    n_results=5,
    where={
        "$or": [
            {"category": "components"},
            {"category": "hooks"}
        ]
    }
)

# Filter with IN
results = collection.query(
    query_texts=["data handling"],
    n_results=5,
    where={"category": {"$in": ["models", "views", "serializers"]}}
)

# Extract results
for doc, metadata, distance in zip(
    results["documents"][0],
    results["metadatas"][0],
    results["distances"][0]
):
    print(f"Distance: {distance:.3f}")
    print(f"Category: {metadata['category']}")
    print(f"Content: {doc[:200]}...")
    print()

🚀 Advanced Usage

1. Multiple Collections for Different Frameworks

# Create separate collections
frameworks = ["react", "vue", "angular", "svelte"]

for framework in frameworks:
    collection = client.get_or_create_collection(
        name=f"{framework}_docs",
        metadata={
            "framework": framework,
            "version": "latest",
            "last_updated": "2026-02-07"
        }
    )

    # Load framework-specific documents
    with open(f"output/{framework}-langchain.json") as f:
        docs = json.load(f)

    collection.add(
        documents=[d["page_content"] for d in docs],
        metadatas=[d["metadata"] for d in docs],
        ids=[f"doc_{i}" for i in range(len(docs))]
    )

# Query specific framework
react_collection = client.get_collection(name="react_docs")
results = react_collection.query(
    query_texts=["useState hook"],
    n_results=3
)

2. Update Documents Efficiently

# Update existing document (same ID)
collection.update(
    ids=["doc_42"],
    documents=["Updated content for React hooks..."],
    metadatas=[{"category": "hooks", "updated": "2026-02-07"}]
)

# Upsert (update or insert)
collection.upsert(
    ids=["doc_42"],
    documents=["New or updated content..."],
    metadatas=[{"category": "hooks"}]
)

# Delete specific documents
collection.delete(ids=["doc_42", "doc_99"])

# Delete by filter
collection.delete(where={"category": "deprecated"})

3. Pre-Compute Embeddings for Faster Ingestion

from chromadb.utils import embedding_functions
import openai

# Generate embeddings separately
openai_client = openai.OpenAI()
embeddings = []

for doc in documents:
    response = openai_client.embeddings.create(
        model="text-embedding-ada-002",
        input=doc["page_content"]
    )
    embeddings.append(response.data[0].embedding)

# Add with pre-computed embeddings (faster)
collection.add(
    documents=[d["page_content"] for d in documents],
    embeddings=embeddings,  # Skip embedding generation
    metadatas=[d["metadata"] for d in documents],
    ids=[f"doc_{i}" for i in range(len(documents))]
)

4. Hybrid Search (Vector + Keyword)

# Get all documents matching keyword filter
results = collection.query(
    query_texts=["state management"],
    n_results=100,  # Get many candidates
    where_document={"$contains": "useState"}  # Keyword filter
)

# Chroma re-ranks by semantic similarity
# Results contain "useState" AND are semantically similar to "state management"

5. Collection Management

# List all collections
collections = client.list_collections()
for collection in collections:
    print(f"{collection.name}: {collection.count()} documents")
    print(f"  Metadata: {collection.metadata}")

# Get collection info
collection = client.get_collection(name="react_docs")
print(f"Count: {collection.count()}")
print(f"Metadata: {collection.metadata}")

# Delete collection
client.delete_collection(name="old_docs")

# Rename collection (create new, copy data, delete old)
old = client.get_collection(name="react_docs")
new = client.create_collection(name="react_docs_v2")

# Copy all documents
old_data = old.get()
new.add(
    ids=old_data["ids"],
    documents=old_data["documents"],
    metadatas=old_data["metadatas"],
    embeddings=old_data["embeddings"]
)

client.delete_collection(name="react_docs")

📋 Best Practices

1. Use Persistent Storage for Production

# ✅ Good: Data persists
client = chromadb.PersistentClient(path="./chroma_db")

# ❌ Bad: Data lost on restart
client = chromadb.Client()

# Store DB in appropriate location
import os
db_path = os.path.expanduser("~/.local/share/my_app/chroma_db")
client = chromadb.PersistentClient(path=db_path)

2. Batch Operations for Large Datasets

# ✅ Good: Batch add (fast)
batch_size = 1000
for i in range(0, len(documents), batch_size):
    batch = documents[i:i + batch_size]
    collection.add(
        documents=[d["page_content"] for d in batch],
        metadatas=[d["metadata"] for d in batch],
        ids=[f"doc_{i+j}" for j in range(len(batch))]
    )
    print(f"Added {i + len(batch)}/{len(documents)}...")

# ❌ Bad: One at a time (slow)
for i, doc in enumerate(documents):
    collection.add(
        documents=[doc["page_content"]],
        metadatas=[doc["metadata"]],
        ids=[f"doc_{i}"]
    )

3. Choose Embedding Model Wisely

# For speed (local development):
# - Default Chroma (all-MiniLM-L6-v2): 384 dims, fast
collection = client.get_or_create_collection(name="docs")

# For quality (production):
# - OpenAI ada-002: 1536 dims, best quality
openai_ef = embedding_functions.OpenAIEmbeddingFunction(...)
collection = client.get_or_create_collection(name="docs", embedding_function=openai_ef)

# For balance (offline production):
# - all-mpnet-base-v2: 768 dims, good quality, free
mpnet_ef = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="all-mpnet-base-v2"
)
collection = client.get_or_create_collection(name="docs", embedding_function=mpnet_ef)

4. Use Metadata Filters to Reduce Search Space

# ✅ Good: Filter then search (fast)
results = collection.query(
    query_texts=["authentication"],
    n_results=3,
    where={"category": "auth"}  # Only search auth docs
)

# ❌ Slow: Search everything, filter later
results = collection.query(
    query_texts=["authentication"],
    n_results=100
)
filtered = [r for r in results if r["metadata"]["category"] == "auth"]

5. Handle Updates with Upsert

# ✅ Good: Upsert (idempotent)
collection.upsert(
    ids=["doc_42"],
    documents=["Updated content..."],
    metadatas=[{"updated": "2026-02-07"}]
)

# ❌ Bad: Delete then add (race conditions)
try:
    collection.delete(ids=["doc_42"])
except:
    pass
collection.add(ids=["doc_42"], ...)

🔥 Real-World Example: Local RAG Chatbot

import chromadb
import json
from openai import OpenAI

class LocalRAGChatbot:
    def __init__(self, db_path: str = "./chroma_db"):
        """Initialize chatbot with local Chroma database."""
        self.client = chromadb.PersistentClient(path=db_path)
        self.openai = OpenAI()  # For chat completion only
        self.collection = None

    def ingest_framework(self, framework: str, docs_path: str):
        """Ingest documentation for a framework."""
        # Create or get collection
        self.collection = self.client.get_or_create_collection(
            name=f"{framework}_docs",
            metadata={"framework": framework}
        )

        # Load documents
        with open(docs_path) as f:
            documents = json.load(f)

        # Batch add (Chroma generates embeddings locally)
        batch_size = 1000
        for i in range(0, len(documents), batch_size):
            batch = documents[i:i + batch_size]

            self.collection.add(
                documents=[d["page_content"] for d in batch],
                metadatas=[d["metadata"] for d in batch],
                ids=[f"doc_{i+j}" for j in range(len(batch))]
            )

            if (i + batch_size) < len(documents):
                print(f"Ingested {i + batch_size}/{len(documents)}...")

        print(f"✅ Ingested {len(documents)} documents for {framework}")
        print(f"Collection size: {self.collection.count()}")

    def chat(self, question: str, category: str = None):
        """Answer question using RAG."""
        if not self.collection:
            raise ValueError("No framework ingested. Call ingest_framework() first.")

        # Retrieve relevant documents
        where_filter = {"category": category} if category else None

        results = self.collection.query(
            query_texts=[question],
            n_results=5,
            where=where_filter
        )

        # Build context from results
        context_parts = []
        for doc, metadata in zip(results["documents"][0], results["metadatas"][0]):
            context_parts.append(f"[{metadata['category']}] {doc}")

        context = "\n\n".join(context_parts)

        # Generate answer using GPT-4
        completion = self.openai.chat.completions.create(
            model="gpt-4",
            messages=[
                {
                    "role": "system",
                    "content": "You are a helpful assistant. Answer based on the provided documentation context."
                },
                {
                    "role": "user",
                    "content": f"Context:\n{context}\n\nQuestion: {question}"
                }
            ]
        )

        return {
            "answer": completion.choices[0].message.content,
            "sources": [
                {
                    "category": m["category"],
                    "source": m["source"],
                    "file": m["file"]
                }
                for m in results["metadatas"][0]
            ],
            "context_used": len(context)
        }

    def list_frameworks(self):
        """List all ingested frameworks."""
        collections = self.client.list_collections()
        return [
            {
                "name": c.name,
                "count": c.count(),
                "metadata": c.metadata
            }
            for c in collections
        ]

# Usage
chatbot = LocalRAGChatbot(db_path="./my_docs_db")

# Ingest multiple frameworks
chatbot.ingest_framework("react", "output/react-langchain.json")
chatbot.ingest_framework("django", "output/django-langchain.json")

# Interactive chat
frameworks = chatbot.list_frameworks()
print(f"Available frameworks: {[f['name'] for f in frameworks]}")

# Select framework
chatbot.collection = chatbot.client.get_collection("react_docs")

# Ask questions
questions = [
    "How do I use useState?",
    "What is useEffect for?",
    "How do I handle form input?"
]

for question in questions:
    print(f"\nQ: {question}")
    result = chatbot.chat(question, category="hooks")
    print(f"A: {result['answer']}")
    print(f"Sources: {[s['file'] for s in result['sources'][:2]]}")
    print(f"Context size: {result['context_used']} chars")

Output:

✅ Ingested 1247 documents for react
Collection size: 1247
✅ Ingested 892 documents for django
Collection size: 892

Available frameworks: ['react_docs', 'django_docs']

Q: How do I use useState?
A: useState is a React Hook that lets you add state to functional components.
   Call it at the top level: const [count, setCount] = useState(0)
Sources: ['hooks/useState.md', 'hooks/overview.md']
Context size: 2340 chars

Q: What is useEffect for?
A: useEffect performs side effects in functional components, like fetching data,
   subscriptions, or DOM manipulation. It runs after render.
Sources: ['hooks/useEffect.md', 'hooks/rules.md']
Context size: 2156 chars

🐛 Troubleshooting

Issue: Model Download Stuck

Problem: "Downloading model..." hangs indefinitely

Solutions:

  1. Check internet connection:
curl -I https://huggingface.co
  1. Manually download model:
from sentence_transformers import SentenceTransformer

# Force download
model = SentenceTransformer('all-MiniLM-L6-v2')
print("Model downloaded!")
  1. Use pre-downloaded model:
ef = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="/path/to/local/model"
)

Issue: Dimension Mismatch

Problem: "Dimensionality mismatch: expected 384, got 1536"

Solution: Collections remember their embedding function

# Delete and recreate with correct embedding function
client.delete_collection(name="docs")

openai_ef = embedding_functions.OpenAIEmbeddingFunction(...)
collection = client.create_collection(
    name="docs",
    embedding_function=openai_ef  # 1536 dims
)

Issue: Slow Queries

Problem: Queries take >1 second on 10K documents

Solutions:

  1. Use smaller n_results:
# ✅ Fast: Get only what you need
results = collection.query(query_texts=["..."], n_results=5)

# ❌ Slow: Large result sets
results = collection.query(query_texts=["..."], n_results=100)
  1. Filter with metadata:
# ✅ Fast: Reduce search space
results = collection.query(
    query_texts=["..."],
    n_results=5,
    where={"category": "specific"}  # Only search subset
)
  1. Use HttpClient for parallelism:
# Start Chroma server
chroma run --path ./chroma_db
# Connect multiple clients
client = chromadb.HttpClient(host="localhost", port=8000)

Issue: Database Locked

Problem: "Database is locked" error

Solutions:

  1. Check for other processes:
lsof ./chroma_db/chroma.sqlite3
# Kill any hung processes
  1. Use HttpClient instead:
chroma run --path ./chroma_db --port 8000
client = chromadb.HttpClient(host="localhost", port=8000)
  1. Enable WAL mode (Write-Ahead Logging):
import sqlite3
conn = sqlite3.connect("./chroma_db/chroma.sqlite3")
conn.execute("PRAGMA journal_mode=WAL")
conn.close()

Issue: Collection Not Found

Problem: "Collection 'docs' does not exist"

Solutions:

  1. List existing collections:
collections = client.list_collections()
print([c.name for c in collections])
  1. Use get_or_create:
# ✅ Safe: Creates if missing
collection = client.get_or_create_collection(name="docs")

# ❌ Fails if missing
collection = client.get_collection(name="docs")

Issue: Out of Memory

Problem: Python crashes when adding large dataset

Solutions:

  1. Batch with smaller size:
batch_size = 500  # Reduce from 1000
for i in range(0, len(documents), batch_size):
    batch = documents[i:i + batch_size]
    collection.add(...)
  1. Use HttpClient + server:
# Server handles memory better
chroma run --path ./chroma_db
  1. Pre-compute embeddings externally:
# Generate embeddings in separate script
# Then add with embeddings parameter
collection.add(
    documents=[...],
    embeddings=precomputed_embeddings,
    ...
)

📊 Before vs. After

Aspect Without Skill Seekers With Skill Seekers
Data Preparation Custom scraping + parsing logic One command: skill-seekers scrape
Embedding Setup Manual model selection and config Auto-configured with sensible defaults
Metadata Manual extraction from docs Auto-extracted (category, source, file, type)
Storage Complex path management Simple: PersistentClient(path="...")
Local-First Requires external services Fully local with Sentence Transformers
Setup Time 2-4 hours 5 minutes
Code Required 300+ lines scraping logic 20 lines upload script
External Deps OpenAI API required Optional (works offline!)

🎯 Next Steps

Enhance Your Chroma Integration

  1. Try Different Embedding Models:

    # Better quality (still local)
    ef = embedding_functions.SentenceTransformerEmbeddingFunction(
        model_name="all-mpnet-base-v2"
    )
    
  2. Implement Semantic Chunking:

    skill-seekers scrape --config configs/fastapi.json --chunk-for-rag --chunk-tokens 512
    
  3. Set Up Multi-Collection Search:

    # Search across multiple frameworks
    for name in ["react_docs", "vue_docs", "angular_docs"]:
        collection = client.get_collection(name)
        results = collection.query(...)
    
  4. Deploy with Docker:

    docker run -p 8000:8000 -v ./chroma-data:/chroma/chroma ghcr.io/chroma-core/chroma:latest
    

Resources


Questions? Open an issue: https://github.com/yusufkaraaslan/Skill_Seekers/issues Website: https://skillseekersweb.com/ Last Updated: February 7, 2026