# ChromaDB Vector Database Example

This example demonstrates how to use Skill Seekers with ChromaDB, the AI-native open-source embedding database. Chroma is designed to be simple, fast, and easy to use locally.

## What You'll Learn

- How to generate skills in ChromaDB format
- How to create local Chroma collections
- How to perform semantic searches
- How to filter by metadata categories

## Why ChromaDB?

- **No Server Required**: Works entirely in-process (perfect for development)
- **Simple API**: Clean Python interface, no complex setup
- **Fast**: Built for speed with smart indexing
- **Open Source**: MIT licensed, community-driven

## Prerequisites

### Python Dependencies

```bash
pip install -r requirements.txt
```

That's it! No Docker, no server setup. Chroma runs entirely in your Python process.

## Step-by-Step Guide

### Step 1: Generate Skill from Documentation

First, we'll scrape Vue documentation and package it for ChromaDB:

```bash
python 1_generate_skill.py
```

This script will:
1. Scrape Vue docs (limited to 20 pages for demo)
2. Package the skill in ChromaDB format (JSON with documents + metadata + IDs)
3. Save to `output/vue-chroma.json`

**Expected Output:**
```
✅ ChromaDB data packaged successfully!
📦 Output: output/vue-chroma.json
📊 Total documents: 21
📂 Categories: overview (1), guides (8), api (12)
```

**What's in the JSON?**
```json
{
  "documents": [
    "Vue is a progressive JavaScript framework...",
    "Components are the building blocks..."
  ],
  "metadatas": [
    {
      "source": "vue",
      "category": "overview",
      "file": "SKILL.md",
      "type": "documentation",
      "version": "1.0.0"
    }
  ],
  "ids": [
    "a1b2c3d4e5f6...",
    "b2c3d4e5f6g7..."
  ],
  "collection_name": "vue"
}
```

### Step 2: Create Collection and Upload

Now we'll create a ChromaDB collection and load all documents:

```bash
python 2_upload_to_chroma.py
```

This script will:
1. Create an in-memory Chroma client (or persistent with `--persist`)
2. Create a collection with the skill name
3. Add all documents with metadata and IDs
4. Verify the upload was successful

**Expected Output:**
```
📊 Creating ChromaDB client...
✅ Client created (in-memory)

📦 Creating collection: vue
✅ Collection created!

📤 Adding 21 documents to collection...
✅ Successfully added 21 documents to ChromaDB

🔍 Collection 'vue' now contains 21 documents
```

**Persistent Storage:**
```bash
# Save to disk for later use
python 2_upload_to_chroma.py --persist ./chroma_db
```

### Step 3: Query and Search

Now search your knowledge base!

```bash
python 3_query_example.py
```

**With persistent storage:**
```bash
python 3_query_example.py --persist ./chroma_db
```

This script demonstrates:
1. **Semantic Search**: Natural language queries
2. **Metadata Filtering**: Filter by category
3. **Top-K Results**: Get most relevant documents
4. **Distance Scoring**: See how relevant each result is

**Example Queries:**

**Query 1: Semantic Search**
```
Query: "How do I create a Vue component?"
Top 3 results:

1. [Distance: 0.234] guides/components.md
   Components are reusable Vue instances with a name. You can use them as custom
   elements inside a root Vue instance...

2. [Distance: 0.298] api/component_api.md
   The component API reference describes all available options for defining
   components using the Options API...

3. [Distance: 0.312] guides/single_file_components.md
   Single-File Components (SFCs) allow you to define templates, logic, and
   styling in a single .vue file...
```

**Query 2: Filtered Search**
```
Query: "reactivity"
Filter: category = "api"

Results:
1. ref() - Create reactive references
2. reactive() - Create reactive proxies
3. computed() - Create computed properties
```

## Understanding ChromaDB Features

### Semantic Search

Chroma automatically:
- Generates embeddings for your documents (using default model)
- Indexes them for fast similarity search
- Finds semantically similar content

**Distance Scores:**
- Lower = more similar
- `0.0` = identical
- `< 0.5` = very relevant
- `0.5-1.0` = somewhat relevant
- `> 1.0` = less relevant

### Metadata Filtering

Filter results before semantic search:
```python
collection.query(
    query_texts=["your query"],
    n_results=5,
    where={"category": "api"}
)
```

**Supported operators:**
- `$eq`: Equal to
- `$ne`: Not equal to
- `$gt`, `$gte`: Greater than (or equal)
- `$lt`, `$lte`: Less than (or equal)
- `$in`: In list
- `$nin`: Not in list

**Complex filters:**
```python
where={
    "$and": [
        {"category": {"$eq": "api"}},
        {"type": {"$eq": "reference"}}
    ]
}
```

### Collection Management

```python
# List all collections
client.list_collections()

# Get collection
collection = client.get_collection("vue")

# Get count
collection.count()

# Delete collection
client.delete_collection("vue")
```

## Customization

### Use Your Own Embeddings

Chroma supports custom embedding functions:

```python
from chromadb.utils import embedding_functions

# OpenAI embeddings
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
    api_key="your-key",
    model_name="text-embedding-ada-002"
)

collection = client.create_collection(
    name="your_skill",
    embedding_function=openai_ef
)
```

**Supported embedding functions:**
- **OpenAI**: `text-embedding-ada-002` (best quality)
- **Cohere**: `embed-english-v2.0`
- **HuggingFace**: Various models (local, no API key)
- **Sentence Transformers**: Local models

### Generate Different Skills

```bash
# Change the config in 1_generate_skill.py
"--config", "configs/django.json",  # Your framework

# Or use CLI directly
skill-seekers scrape --config configs/flask.json
skill-seekers package output/flask --target chroma
```

### Adjust Query Parameters

In `3_query_example.py`:

```python
# Get more results
n_results=10  # Default is 5

# Include more metadata
include=["documents", "metadatas", "distances"]

# Different distance metrics
# (configure when creating collection)
metadata={"hnsw:space": "cosine"}  # or "l2", "ip"
```

## Performance Tips

1. **Batch Operations**: Add documents in batches for better performance
   ```python
   collection.add(
       documents=batch_docs,
       metadatas=batch_metadata,
       ids=batch_ids
   )
   ```

2. **Persistent Storage**: Use `--persist` for production
   ```bash
   python 2_upload_to_chroma.py --persist ./prod_db
   ```

3. **Custom Embeddings**: Use OpenAI for best quality (costs $)
4. **Index Tuning**: Adjust HNSW parameters for speed vs accuracy

## Troubleshooting

### Import Error
```
ModuleNotFoundError: No module named 'chromadb'
```

**Solution:**
```bash
pip install chromadb
```

### Collection Already Exists
```
Error: Collection 'vue' already exists
```

**Solution:**
```python
# Delete existing collection
client.delete_collection("vue")

# Or use --reset flag
python 2_upload_to_chroma.py --reset
```

### Empty Results
```
Query returned empty results
```

**Possible causes:**
1. Collection empty: Check `collection.count()`
2. Query too specific: Try broader queries
3. Wrong collection name: Verify collection exists

**Debug:**
```python
# Check collection contents
collection.get()  # Get all documents

# Check embedding function
collection._embedding_function  # Should not be None
```

### Performance Issues
```
Query is slow
```

**Solutions:**
1. Use persistent storage (faster than in-memory for large datasets)
2. Reduce `n_results` (fewer results = faster)
3. Add metadata filters to narrow search space
4. Consider using OpenAI embeddings (better quality = faster convergence)

## Next Steps

1. **Try other skills**: Package your favorite documentation
2. **Build a chatbot**: Integrate with LangChain or LlamaIndex
3. **Production deployment**: Use persistent storage + API wrapper
4. **Custom embeddings**: Experiment with different models

## Resources

- **ChromaDB Docs**: https://docs.trychroma.com/
- **GitHub**: https://github.com/chroma-core/chroma
- **Discord**: https://discord.gg/MMeYNTmh3x
- **Skill Seekers**: https://github.com/yourusername/skill-seekers

## File Structure

```
chroma-example/
├── README.md                      # This file
├── requirements.txt               # Python dependencies
├── 1_generate_skill.py            # Generate ChromaDB-format skill
├── 2_upload_to_chroma.py          # Create collection and upload
├── 3_query_example.py             # Query demonstrations
└── sample_output/                 # Example outputs
    ├── vue-chroma.json            # Generated skill (21 docs)
    └── query_results.txt          # Sample query results
```

## Comparison: Chroma vs Weaviate

| Feature | ChromaDB | Weaviate |
|---------|----------|----------|
| **Setup** | ✅ No server needed | ⚠️ Docker/Cloud required |
| **API** | ✅ Very simple | ⚠️ More complex |
| **Performance** | ✅ Fast for < 1M docs | ✅ Scales to billions |
| **Hybrid Search** | ❌ Semantic only | ✅ Keyword + semantic |
| **Production** | ✅ Good for small-medium | ✅ Built for scale |

**Use Chroma for:** Development, prototypes, small-medium datasets (< 1M docs)
**Use Weaviate for:** Production, large datasets (> 1M docs), hybrid search

---

**Last Updated:** February 2026
**Tested With:** ChromaDB v0.4.22, Python 3.10+, skill-seekers v2.10.0