skill-seekers-reference/docs/integrations/CHROMA.md

# Chroma Integration with Skill Seekers

**Status:** ✅ Production Ready
**Difficulty:** Beginner
**Last Updated:** February 7, 2026

---

## ❌ The Problem

Building RAG applications with Chroma involves several challenges:

1. **Embedding Model Setup** - Need to choose and configure embedding models (local vs API) manually
2. **Collection Management** - Creating and managing collections with metadata requires boilerplate code
3. **Local-First Complexity** - Setting up persistent storage and dealing with file paths

**Example Pain Point:**

```python
# Manual embedding + collection setup for each framework
import chromadb
from chromadb.utils import embedding_functions

# Choose embedding function
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
    api_key="sk-...",
    model_name="text-embedding-ada-002"
)

# Create client + collection
client = chromadb.PersistentClient(path="./chroma_db")
collection = client.create_collection(
    name="react_docs",
    embedding_function=openai_ef,
    metadata={"description": "React documentation"}
)

# Manually parse and add documents...
```

---

## ✅ The Solution

Skill Seekers automates Chroma integration with structured, production-ready data:

**Benefits:**
- ✅ Auto-formatted documents with embeddings included
- ✅ Consistent collection structure across all frameworks
- ✅ Works with local models (Sentence Transformers) or API embeddings (OpenAI, Cohere)
- ✅ Persistent storage with automatic path management
- ✅ Metadata-rich for precise filtering

**Result:** 5-minute setup, production-ready local vector search with zero external dependencies.

---

## ⚡ Quick Start (5 Minutes)

### Prerequisites

```bash
# Install Chroma
pip install chromadb>=0.4.22

# For local embeddings (optional, free)
pip install sentence-transformers

# For OpenAI embeddings (optional)
pip install openai

# Or with Skill Seekers
pip install skill-seekers[all-llms]
```

**What you need:**
- Python 3.10+
- No external services required (fully local!)
- Optional: OpenAI API key for better embeddings

### Generate Chroma-Ready Documents

```bash
# Step 1: Scrape documentation
skill-seekers scrape --config configs/react.json

# Step 2: Package for Chroma (creates LangChain format)
skill-seekers package output/react --target langchain

# Output: output/react-langchain.json (Chroma-compatible)
```

### Upload to Chroma (Local)

```python
import chromadb
import json

# Create persistent client (data saved to disk)
client = chromadb.PersistentClient(path="./chroma_db")

# Create collection with local embeddings (free!)
collection = client.get_or_create_collection(
    name="react_docs",
    metadata={"description": "React documentation from Skill Seekers"}
)

# Load documents
with open("output/react-langchain.json") as f:
    documents = json.load(f)

# Add to collection (Chroma generates embeddings automatically)
collection.add(
    documents=[doc["page_content"] for doc in documents],
    metadatas=[doc["metadata"] for doc in documents],
    ids=[f"doc_{i}" for i in range(len(documents))]
)

print(f"✅ Added {len(documents)} documents to Chroma")
print(f"Total in collection: {collection.count()}")
```

### Query with Filters

```python
# Semantic search with metadata filter
results = collection.query(
    query_texts=["How do I use React hooks?"],
    n_results=3,
    where={"category": "hooks"}  # Filter by category
)

for i, (doc, metadata) in enumerate(zip(results["documents"][0], results["metadatas"][0])):
    print(f"\n{i+1}. Category: {metadata['category']}")
    print(f"   Source: {metadata['source']}")
    print(f"   Content: {doc[:200]}...")
```

**That's it!** Chroma is now running locally with your documentation.

---

## 📖 Detailed Setup Guide

### Step 1: Choose Storage Mode

**Option A: Persistent (Recommended for Production)**

```python
import chromadb

# Data persists to disk
client = chromadb.PersistentClient(
    path="./chroma_db"  # Specify database directory
)

# Database files saved to ./chroma_db/
# Survives script restarts
```

**Option B: In-Memory (Fast, for Development)**

```python
# Data lost when script ends
client = chromadb.Client()

# Fast, but temporary
# Perfect for experimentation
```

**Option C: HTTP Client (Remote Chroma Server)**

```bash
# Start Chroma server
chroma run --path ./chroma_db --port 8000
```

```python
# Connect to remote server
client = chromadb.HttpClient(host="localhost", port=8000)

# Great for microservices architecture
```

**Option D: Docker (Production)**

```bash
# docker-compose.yml
version: '3'
services:
  chroma:
    image: ghcr.io/chroma-core/chroma:latest
    volumes:
      - ./chroma-data:/chroma/chroma
    ports:
      - "8000:8000"
    environment:
      - ANONYMIZED_TELEMETRY=False

# Start Chroma
docker-compose up -d
```

### Step 2: Generate Skill Seekers Documents

**Option A: Documentation Website**
```bash
skill-seekers scrape --config configs/django.json
skill-seekers package output/django --target langchain
```

**Option B: GitHub Repository**
```bash
skill-seekers github --repo django/django --name django
skill-seekers package output/django --target langchain
```

**Option C: Local Codebase**
```bash
skill-seekers analyze --directory /path/to/repo
skill-seekers package output/codebase --target langchain
```

**Option D: RAG-Optimized Chunking**
```bash
skill-seekers scrape --config configs/fastapi.json --chunk-for-rag --chunk-tokens 512
skill-seekers package output/fastapi --target langchain
```

### Step 3: Choose Embedding Function

**Option A: Default (Sentence Transformers - Free)**

```python
# Chroma uses all-MiniLM-L6-v2 by default
collection = client.get_or_create_collection(name="docs")

# Automatically downloads model on first use (~90MB)
# Dimensions: 384
# Speed: ~500 docs/sec on CPU
# Quality: Good for most use cases
```

**Option B: OpenAI (Best Quality)**

```python
from chromadb.utils import embedding_functions

openai_ef = embedding_functions.OpenAIEmbeddingFunction(
    api_key="sk-...",
    model_name="text-embedding-ada-002"
)

collection = client.get_or_create_collection(
    name="docs",
    embedding_function=openai_ef
)

# Cost: ~$0.0001 per 1K tokens
# Dimensions: 1536
# Quality: Excellent
```

**Option C: Local Sentence Transformers (Customizable)**

```python
from chromadb.utils import embedding_functions

sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="all-mpnet-base-v2"  # Better quality than default
)

collection = client.get_or_create_collection(
    name="docs",
    embedding_function=sentence_transformer_ef
)

# Free, local, customizable
# Dimensions: 768 (all-mpnet-base-v2)
# Quality: Better than default
```

**Option D: Cohere**

```python
cohere_ef = embedding_functions.CohereEmbeddingFunction(
    api_key="your-cohere-key",
    model_name="embed-english-v3.0"
)

collection = client.get_or_create_collection(
    name="docs",
    embedding_function=cohere_ef
)
```

### Step 4: Add Documents with Metadata

```python
import json

# Load Skill Seekers documents
with open("output/django-langchain.json") as f:
    documents = json.load(f)

# Prepare for Chroma
docs_content = []
docs_metadata = []
docs_ids = []

for i, doc in enumerate(documents):
    docs_content.append(doc["page_content"])
    docs_metadata.append(doc["metadata"])
    docs_ids.append(f"doc_{i}")

# Add to collection (batch operation)
collection.add(
    documents=docs_content,
    metadatas=docs_metadata,
    ids=docs_ids
)

print(f"✅ Added {len(documents)} documents")
print(f"Collection size: {collection.count()}")
```

### Step 5: Query with Advanced Filters

```python
# Simple query
results = collection.query(
    query_texts=["How do I create models?"],
    n_results=5
)

# With metadata filter
results = collection.query(
    query_texts=["Django authentication"],
    n_results=3,
    where={"category": "authentication"}
)

# Multiple filters (AND logic)
results = collection.query(
    query_texts=["user registration"],
    n_results=3,
    where={
        "$and": [
            {"category": "authentication"},
            {"type": "tutorial"}
        ]
    }
)

# Filter with OR
results = collection.query(
    query_texts=["components"],
    n_results=5,
    where={
        "$or": [
            {"category": "components"},
            {"category": "hooks"}
        ]
    }
)

# Filter with IN
results = collection.query(
    query_texts=["data handling"],
    n_results=5,
    where={"category": {"$in": ["models", "views", "serializers"]}}
)

# Extract results
for doc, metadata, distance in zip(
    results["documents"][0],
    results["metadatas"][0],
    results["distances"][0]
):
    print(f"Distance: {distance:.3f}")
    print(f"Category: {metadata['category']}")
    print(f"Content: {doc[:200]}...")
    print()
```

---

## 🚀 Advanced Usage

### 1. Multiple Collections for Different Frameworks

```python
# Create separate collections
frameworks = ["react", "vue", "angular", "svelte"]

for framework in frameworks:
    collection = client.get_or_create_collection(
        name=f"{framework}_docs",
        metadata={
            "framework": framework,
            "version": "latest",
            "last_updated": "2026-02-07"
        }
    )

    # Load framework-specific documents
    with open(f"output/{framework}-langchain.json") as f:
        docs = json.load(f)

    collection.add(
        documents=[d["page_content"] for d in docs],
        metadatas=[d["metadata"] for d in docs],
        ids=[f"doc_{i}" for i in range(len(docs))]
    )

# Query specific framework
react_collection = client.get_collection(name="react_docs")
results = react_collection.query(
    query_texts=["useState hook"],
    n_results=3
)
```

### 2. Update Documents Efficiently

```python
# Update existing document (same ID)
collection.update(
    ids=["doc_42"],
    documents=["Updated content for React hooks..."],
    metadatas=[{"category": "hooks", "updated": "2026-02-07"}]
)

# Upsert (update or insert)
collection.upsert(
    ids=["doc_42"],
    documents=["New or updated content..."],
    metadatas=[{"category": "hooks"}]
)

# Delete specific documents
collection.delete(ids=["doc_42", "doc_99"])

# Delete by filter
collection.delete(where={"category": "deprecated"})
```

### 3. Pre-Compute Embeddings for Faster Ingestion

```python
from chromadb.utils import embedding_functions
import openai

# Generate embeddings separately
openai_client = openai.OpenAI()
embeddings = []

for doc in documents:
    response = openai_client.embeddings.create(
        model="text-embedding-ada-002",
        input=doc["page_content"]
    )
    embeddings.append(response.data[0].embedding)

# Add with pre-computed embeddings (faster)
collection.add(
    documents=[d["page_content"] for d in documents],
    embeddings=embeddings,  # Skip embedding generation
    metadatas=[d["metadata"] for d in documents],
    ids=[f"doc_{i}" for i in range(len(documents))]
)
```

### 4. Hybrid Search (Vector + Keyword)

```python
# Get all documents matching keyword filter
results = collection.query(
    query_texts=["state management"],
    n_results=100,  # Get many candidates
    where_document={"$contains": "useState"}  # Keyword filter
)

# Chroma re-ranks by semantic similarity
# Results contain "useState" AND are semantically similar to "state management"
```

### 5. Collection Management

```python
# List all collections
collections = client.list_collections()
for collection in collections:
    print(f"{collection.name}: {collection.count()} documents")
    print(f"  Metadata: {collection.metadata}")

# Get collection info
collection = client.get_collection(name="react_docs")
print(f"Count: {collection.count()}")
print(f"Metadata: {collection.metadata}")

# Delete collection
client.delete_collection(name="old_docs")

# Rename collection (create new, copy data, delete old)
old = client.get_collection(name="react_docs")
new = client.create_collection(name="react_docs_v2")

# Copy all documents
old_data = old.get()
new.add(
    ids=old_data["ids"],
    documents=old_data["documents"],
    metadatas=old_data["metadatas"],
    embeddings=old_data["embeddings"]
)

client.delete_collection(name="react_docs")
```

---

## 📋 Best Practices

### 1. Use Persistent Storage for Production

```python
# ✅ Good: Data persists
client = chromadb.PersistentClient(path="./chroma_db")

# ❌ Bad: Data lost on restart
client = chromadb.Client()

# Store DB in appropriate location
import os
db_path = os.path.expanduser("~/.local/share/my_app/chroma_db")
client = chromadb.PersistentClient(path=db_path)
```

### 2. Batch Operations for Large Datasets

```python
# ✅ Good: Batch add (fast)
batch_size = 1000
for i in range(0, len(documents), batch_size):
    batch = documents[i:i + batch_size]
    collection.add(
        documents=[d["page_content"] for d in batch],
        metadatas=[d["metadata"] for d in batch],
        ids=[f"doc_{i+j}" for j in range(len(batch))]
    )
    print(f"Added {i + len(batch)}/{len(documents)}...")

# ❌ Bad: One at a time (slow)
for i, doc in enumerate(documents):
    collection.add(
        documents=[doc["page_content"]],
        metadatas=[doc["metadata"]],
        ids=[f"doc_{i}"]
    )
```

### 3. Choose Embedding Model Wisely

```python
# For speed (local development):
# - Default Chroma (all-MiniLM-L6-v2): 384 dims, fast
collection = client.get_or_create_collection(name="docs")

# For quality (production):
# - OpenAI ada-002: 1536 dims, best quality
openai_ef = embedding_functions.OpenAIEmbeddingFunction(...)
collection = client.get_or_create_collection(name="docs", embedding_function=openai_ef)

# For balance (offline production):
# - all-mpnet-base-v2: 768 dims, good quality, free
mpnet_ef = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="all-mpnet-base-v2"
)
collection = client.get_or_create_collection(name="docs", embedding_function=mpnet_ef)
```

### 4. Use Metadata Filters to Reduce Search Space

```python
# ✅ Good: Filter then search (fast)
results = collection.query(
    query_texts=["authentication"],
    n_results=3,
    where={"category": "auth"}  # Only search auth docs
)

# ❌ Slow: Search everything, filter later
results = collection.query(
    query_texts=["authentication"],
    n_results=100
)
filtered = [r for r in results if r["metadata"]["category"] == "auth"]
```

### 5. Handle Updates with Upsert

```python
# ✅ Good: Upsert (idempotent)
collection.upsert(
    ids=["doc_42"],
    documents=["Updated content..."],
    metadatas=[{"updated": "2026-02-07"}]
)

# ❌ Bad: Delete then add (race conditions)
try:
    collection.delete(ids=["doc_42"])
except:
    pass
collection.add(ids=["doc_42"], ...)
```

---

## 🔥 Real-World Example: Local RAG Chatbot

```python
import chromadb
import json
from openai import OpenAI

class LocalRAGChatbot:
    def __init__(self, db_path: str = "./chroma_db"):
        """Initialize chatbot with local Chroma database."""
        self.client = chromadb.PersistentClient(path=db_path)
        self.openai = OpenAI()  # For chat completion only
        self.collection = None

    def ingest_framework(self, framework: str, docs_path: str):
        """Ingest documentation for a framework."""
        # Create or get collection
        self.collection = self.client.get_or_create_collection(
            name=f"{framework}_docs",
            metadata={"framework": framework}
        )

        # Load documents
        with open(docs_path) as f:
            documents = json.load(f)

        # Batch add (Chroma generates embeddings locally)
        batch_size = 1000
        for i in range(0, len(documents), batch_size):
            batch = documents[i:i + batch_size]

            self.collection.add(
                documents=[d["page_content"] for d in batch],
                metadatas=[d["metadata"] for d in batch],
                ids=[f"doc_{i+j}" for j in range(len(batch))]
            )

            if (i + batch_size) < len(documents):
                print(f"Ingested {i + batch_size}/{len(documents)}...")

        print(f"✅ Ingested {len(documents)} documents for {framework}")
        print(f"Collection size: {self.collection.count()}")

    def chat(self, question: str, category: str = None):
        """Answer question using RAG."""
        if not self.collection:
            raise ValueError("No framework ingested. Call ingest_framework() first.")

        # Retrieve relevant documents
        where_filter = {"category": category} if category else None

        results = self.collection.query(
            query_texts=[question],
            n_results=5,
            where=where_filter
        )

        # Build context from results
        context_parts = []
        for doc, metadata in zip(results["documents"][0], results["metadatas"][0]):
            context_parts.append(f"[{metadata['category']}] {doc}")

        context = "\n\n".join(context_parts)

        # Generate answer using GPT-4
        completion = self.openai.chat.completions.create(
            model="gpt-4",
            messages=[
                {
                    "role": "system",
                    "content": "You are a helpful assistant. Answer based on the provided documentation context."
                },
                {
                    "role": "user",
                    "content": f"Context:\n{context}\n\nQuestion: {question}"
                }
            ]
        )

        return {
            "answer": completion.choices[0].message.content,
            "sources": [
                {
                    "category": m["category"],
                    "source": m["source"],
                    "file": m["file"]
                }
                for m in results["metadatas"][0]
            ],
            "context_used": len(context)
        }

    def list_frameworks(self):
        """List all ingested frameworks."""
        collections = self.client.list_collections()
        return [
            {
                "name": c.name,
                "count": c.count(),
                "metadata": c.metadata
            }
            for c in collections
        ]

# Usage
chatbot = LocalRAGChatbot(db_path="./my_docs_db")

# Ingest multiple frameworks
chatbot.ingest_framework("react", "output/react-langchain.json")
chatbot.ingest_framework("django", "output/django-langchain.json")

# Interactive chat
frameworks = chatbot.list_frameworks()
print(f"Available frameworks: {[f['name'] for f in frameworks]}")

# Select framework
chatbot.collection = chatbot.client.get_collection("react_docs")

# Ask questions
questions = [
    "How do I use useState?",
    "What is useEffect for?",
    "How do I handle form input?"
]

for question in questions:
    print(f"\nQ: {question}")
    result = chatbot.chat(question, category="hooks")
    print(f"A: {result['answer']}")
    print(f"Sources: {[s['file'] for s in result['sources'][:2]]}")
    print(f"Context size: {result['context_used']} chars")
```

**Output:**
```
✅ Ingested 1247 documents for react
Collection size: 1247
✅ Ingested 892 documents for django
Collection size: 892

Available frameworks: ['react_docs', 'django_docs']

Q: How do I use useState?
A: useState is a React Hook that lets you add state to functional components.
   Call it at the top level: const [count, setCount] = useState(0)
Sources: ['hooks/useState.md', 'hooks/overview.md']
Context size: 2340 chars

Q: What is useEffect for?
A: useEffect performs side effects in functional components, like fetching data,
   subscriptions, or DOM manipulation. It runs after render.
Sources: ['hooks/useEffect.md', 'hooks/rules.md']
Context size: 2156 chars
```

---

## 🐛 Troubleshooting

### Issue: Model Download Stuck

**Problem:** "Downloading model..." hangs indefinitely

**Solutions:**

1. **Check internet connection:**
```bash
curl -I https://huggingface.co
```

2. **Manually download model:**
```python
from sentence_transformers import SentenceTransformer

# Force download
model = SentenceTransformer('all-MiniLM-L6-v2')
print("Model downloaded!")
```

3. **Use pre-downloaded model:**
```python
ef = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="/path/to/local/model"
)
```

### Issue: Dimension Mismatch

**Problem:** "Dimensionality mismatch: expected 384, got 1536"

**Solution:** Collections remember their embedding function
```python
# Delete and recreate with correct embedding function
client.delete_collection(name="docs")

openai_ef = embedding_functions.OpenAIEmbeddingFunction(...)
collection = client.create_collection(
    name="docs",
    embedding_function=openai_ef  # 1536 dims
)
```

### Issue: Slow Queries

**Problem:** Queries take >1 second on 10K documents

**Solutions:**

1. **Use smaller n_results:**
```python
# ✅ Fast: Get only what you need
results = collection.query(query_texts=["..."], n_results=5)

# ❌ Slow: Large result sets
results = collection.query(query_texts=["..."], n_results=100)
```

2. **Filter with metadata:**
```python
# ✅ Fast: Reduce search space
results = collection.query(
    query_texts=["..."],
    n_results=5,
    where={"category": "specific"}  # Only search subset
)
```

3. **Use HttpClient for parallelism:**
```bash
# Start Chroma server
chroma run --path ./chroma_db
```

```python
# Connect multiple clients
client = chromadb.HttpClient(host="localhost", port=8000)
```

### Issue: Database Locked

**Problem:** "Database is locked" error

**Solutions:**

1. **Check for other processes:**
```bash
lsof ./chroma_db/chroma.sqlite3
# Kill any hung processes
```

2. **Use HttpClient instead:**
```bash
chroma run --path ./chroma_db --port 8000
```

```python
client = chromadb.HttpClient(host="localhost", port=8000)
```

3. **Enable WAL mode (Write-Ahead Logging):**
```python
import sqlite3
conn = sqlite3.connect("./chroma_db/chroma.sqlite3")
conn.execute("PRAGMA journal_mode=WAL")
conn.close()
```

### Issue: Collection Not Found

**Problem:** "Collection 'docs' does not exist"

**Solutions:**

1. **List existing collections:**
```python
collections = client.list_collections()
print([c.name for c in collections])
```

2. **Use get_or_create:**
```python
# ✅ Safe: Creates if missing
collection = client.get_or_create_collection(name="docs")

# ❌ Fails if missing
collection = client.get_collection(name="docs")
```

### Issue: Out of Memory

**Problem:** Python crashes when adding large dataset

**Solutions:**

1. **Batch with smaller size:**
```python
batch_size = 500  # Reduce from 1000
for i in range(0, len(documents), batch_size):
    batch = documents[i:i + batch_size]
    collection.add(...)
```

2. **Use HttpClient + server:**
```bash
# Server handles memory better
chroma run --path ./chroma_db
```

3. **Pre-compute embeddings externally:**
```python
# Generate embeddings in separate script
# Then add with embeddings parameter
collection.add(
    documents=[...],
    embeddings=precomputed_embeddings,
    ...
)
```

---

## 📊 Before vs. After

| Aspect | Without Skill Seekers | With Skill Seekers |
|--------|----------------------|-------------------|
| **Data Preparation** | Custom scraping + parsing logic | One command: `skill-seekers scrape` |
| **Embedding Setup** | Manual model selection and config | Auto-configured with sensible defaults |
| **Metadata** | Manual extraction from docs | Auto-extracted (category, source, file, type) |
| **Storage** | Complex path management | Simple: `PersistentClient(path="...")` |
| **Local-First** | Requires external services | Fully local with Sentence Transformers |
| **Setup Time** | 2-4 hours | 5 minutes |
| **Code Required** | 300+ lines scraping logic | 20 lines upload script |
| **External Deps** | OpenAI API required | Optional (works offline!) |

---

## 🎯 Next Steps

### Enhance Your Chroma Integration

1. **Try Different Embedding Models:**
   ```python
   # Better quality (still local)
   ef = embedding_functions.SentenceTransformerEmbeddingFunction(
       model_name="all-mpnet-base-v2"
   )
   ```

2. **Implement Semantic Chunking:**
   ```bash
   skill-seekers scrape --config configs/fastapi.json --chunk-for-rag --chunk-tokens 512
   ```

3. **Set Up Multi-Collection Search:**
   ```python
   # Search across multiple frameworks
   for name in ["react_docs", "vue_docs", "angular_docs"]:
       collection = client.get_collection(name)
       results = collection.query(...)
   ```

4. **Deploy with Docker:**
   ```bash
   docker run -p 8000:8000 -v ./chroma-data:/chroma/chroma ghcr.io/chroma-core/chroma:latest
   ```

### Related Guides

- **[LangChain Integration](LANGCHAIN.md)** - Use Chroma as vector store in LangChain
- **[LlamaIndex Integration](LLAMA_INDEX.md)** - Use Chroma with LlamaIndex
- **[RAG Pipelines Guide](RAG_PIPELINES.md)** - Build complete RAG systems
- **[INTEGRATIONS.md](INTEGRATIONS.md)** - See all integration options

### Resources

- **Chroma Docs:** https://docs.trychroma.com/
- **Python Client:** https://docs.trychroma.com/reference/py-client
- **Support:** https://github.com/yusufkaraaslan/Skill_Seekers/discussions

---

**Questions?** Open an issue: https://github.com/yusufkaraaslan/Skill_Seekers/issues
**Website:** https://skillseekersweb.com/
**Last Updated:** February 7, 2026