Files
skill-seekers-reference/examples/chroma-example
yusyus 53d37e61dd docs: Add 4 comprehensive vector database examples (Weaviate, Chroma, FAISS, Qdrant)
Created complete working examples for all 4 vector databases with RAG adaptors:

Weaviate Example:
- Comprehensive README with hybrid search guide
- 3 Python scripts (generate, upload, query)
- Sample outputs and query results
- Covers hybrid search, filtering, schema design

Chroma Example:
- Simple, local-first approach
- In-memory and persistent storage options
- Semantic search and metadata filtering
- Comparison with Weaviate

FAISS Example:
- Facebook AI Similarity Search integration
- OpenAI embeddings generation
- Index building and persistence
- Performance-focused for scale

Qdrant Example:
- Advanced filtering capabilities
- Production-ready features
- Complex query patterns
- Rust-based performance

Each example includes:
- Detailed README with setup and troubleshooting
- requirements.txt with dependencies
- 3 working Python scripts
- Sample outputs directory

Total files: 20 (4 examples × 5 files each)
Documentation: 4 comprehensive READMEs (~800 lines total)

Phase 2 of optional enhancements complete.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-07 22:38:15 +03:00
..

ChromaDB Vector Database Example

This example demonstrates how to use Skill Seekers with ChromaDB, the AI-native open-source embedding database. Chroma is designed to be simple, fast, and easy to use locally.

What You'll Learn

  • How to generate skills in ChromaDB format
  • How to create local Chroma collections
  • How to perform semantic searches
  • How to filter by metadata categories

Why ChromaDB?

  • No Server Required: Works entirely in-process (perfect for development)
  • Simple API: Clean Python interface, no complex setup
  • Fast: Built for speed with smart indexing
  • Open Source: MIT licensed, community-driven

Prerequisites

Python Dependencies

pip install -r requirements.txt

That's it! No Docker, no server setup. Chroma runs entirely in your Python process.

Step-by-Step Guide

Step 1: Generate Skill from Documentation

First, we'll scrape Vue documentation and package it for ChromaDB:

python 1_generate_skill.py

This script will:

  1. Scrape Vue docs (limited to 20 pages for demo)
  2. Package the skill in ChromaDB format (JSON with documents + metadata + IDs)
  3. Save to output/vue-chroma.json

Expected Output:

✅ ChromaDB data packaged successfully!
📦 Output: output/vue-chroma.json
📊 Total documents: 21
📂 Categories: overview (1), guides (8), api (12)

What's in the JSON?

{
  "documents": [
    "Vue is a progressive JavaScript framework...",
    "Components are the building blocks..."
  ],
  "metadatas": [
    {
      "source": "vue",
      "category": "overview",
      "file": "SKILL.md",
      "type": "documentation",
      "version": "1.0.0"
    }
  ],
  "ids": [
    "a1b2c3d4e5f6...",
    "b2c3d4e5f6g7..."
  ],
  "collection_name": "vue"
}

Step 2: Create Collection and Upload

Now we'll create a ChromaDB collection and load all documents:

python 2_upload_to_chroma.py

This script will:

  1. Create an in-memory Chroma client (or persistent with --persist)
  2. Create a collection with the skill name
  3. Add all documents with metadata and IDs
  4. Verify the upload was successful

Expected Output:

📊 Creating ChromaDB client...
✅ Client created (in-memory)

📦 Creating collection: vue
✅ Collection created!

📤 Adding 21 documents to collection...
✅ Successfully added 21 documents to ChromaDB

🔍 Collection 'vue' now contains 21 documents

Persistent Storage:

# Save to disk for later use
python 2_upload_to_chroma.py --persist ./chroma_db

Now search your knowledge base!

python 3_query_example.py

With persistent storage:

python 3_query_example.py --persist ./chroma_db

This script demonstrates:

  1. Semantic Search: Natural language queries
  2. Metadata Filtering: Filter by category
  3. Top-K Results: Get most relevant documents
  4. Distance Scoring: See how relevant each result is

Example Queries:

Query 1: Semantic Search

Query: "How do I create a Vue component?"
Top 3 results:

1. [Distance: 0.234] guides/components.md
   Components are reusable Vue instances with a name. You can use them as custom
   elements inside a root Vue instance...

2. [Distance: 0.298] api/component_api.md
   The component API reference describes all available options for defining
   components using the Options API...

3. [Distance: 0.312] guides/single_file_components.md
   Single-File Components (SFCs) allow you to define templates, logic, and
   styling in a single .vue file...

Query 2: Filtered Search

Query: "reactivity"
Filter: category = "api"

Results:
1. ref() - Create reactive references
2. reactive() - Create reactive proxies
3. computed() - Create computed properties

Understanding ChromaDB Features

Chroma automatically:

  • Generates embeddings for your documents (using default model)
  • Indexes them for fast similarity search
  • Finds semantically similar content

Distance Scores:

  • Lower = more similar
  • 0.0 = identical
  • < 0.5 = very relevant
  • 0.5-1.0 = somewhat relevant
  • > 1.0 = less relevant

Metadata Filtering

Filter results before semantic search:

collection.query(
    query_texts=["your query"],
    n_results=5,
    where={"category": "api"}
)

Supported operators:

  • $eq: Equal to
  • $ne: Not equal to
  • $gt, $gte: Greater than (or equal)
  • $lt, $lte: Less than (or equal)
  • $in: In list
  • $nin: Not in list

Complex filters:

where={
    "$and": [
        {"category": {"$eq": "api"}},
        {"type": {"$eq": "reference"}}
    ]
}

Collection Management

# List all collections
client.list_collections()

# Get collection
collection = client.get_collection("vue")

# Get count
collection.count()

# Delete collection
client.delete_collection("vue")

Customization

Use Your Own Embeddings

Chroma supports custom embedding functions:

from chromadb.utils import embedding_functions

# OpenAI embeddings
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
    api_key="your-key",
    model_name="text-embedding-ada-002"
)

collection = client.create_collection(
    name="your_skill",
    embedding_function=openai_ef
)

Supported embedding functions:

  • OpenAI: text-embedding-ada-002 (best quality)
  • Cohere: embed-english-v2.0
  • HuggingFace: Various models (local, no API key)
  • Sentence Transformers: Local models

Generate Different Skills

# Change the config in 1_generate_skill.py
"--config", "configs/django.json",  # Your framework

# Or use CLI directly
skill-seekers scrape --config configs/flask.json
skill-seekers package output/flask --target chroma

Adjust Query Parameters

In 3_query_example.py:

# Get more results
n_results=10  # Default is 5

# Include more metadata
include=["documents", "metadatas", "distances"]

# Different distance metrics
# (configure when creating collection)
metadata={"hnsw:space": "cosine"}  # or "l2", "ip"

Performance Tips

  1. Batch Operations: Add documents in batches for better performance

    collection.add(
        documents=batch_docs,
        metadatas=batch_metadata,
        ids=batch_ids
    )
    
  2. Persistent Storage: Use --persist for production

    python 2_upload_to_chroma.py --persist ./prod_db
    
  3. Custom Embeddings: Use OpenAI for best quality (costs $)

  4. Index Tuning: Adjust HNSW parameters for speed vs accuracy

Troubleshooting

Import Error

ModuleNotFoundError: No module named 'chromadb'

Solution:

pip install chromadb

Collection Already Exists

Error: Collection 'vue' already exists

Solution:

# Delete existing collection
client.delete_collection("vue")

# Or use --reset flag
python 2_upload_to_chroma.py --reset

Empty Results

Query returned empty results

Possible causes:

  1. Collection empty: Check collection.count()
  2. Query too specific: Try broader queries
  3. Wrong collection name: Verify collection exists

Debug:

# Check collection contents
collection.get()  # Get all documents

# Check embedding function
collection._embedding_function  # Should not be None

Performance Issues

Query is slow

Solutions:

  1. Use persistent storage (faster than in-memory for large datasets)
  2. Reduce n_results (fewer results = faster)
  3. Add metadata filters to narrow search space
  4. Consider using OpenAI embeddings (better quality = faster convergence)

Next Steps

  1. Try other skills: Package your favorite documentation
  2. Build a chatbot: Integrate with LangChain or LlamaIndex
  3. Production deployment: Use persistent storage + API wrapper
  4. Custom embeddings: Experiment with different models

Resources

File Structure

chroma-example/
├── README.md                      # This file
├── requirements.txt               # Python dependencies
├── 1_generate_skill.py            # Generate ChromaDB-format skill
├── 2_upload_to_chroma.py          # Create collection and upload
├── 3_query_example.py             # Query demonstrations
└── sample_output/                 # Example outputs
    ├── vue-chroma.json            # Generated skill (21 docs)
    └── query_results.txt          # Sample query results

Comparison: Chroma vs Weaviate

Feature ChromaDB Weaviate
Setup No server needed ⚠️ Docker/Cloud required
API Very simple ⚠️ More complex
Performance Fast for < 1M docs Scales to billions
Hybrid Search Semantic only Keyword + semantic
Production Good for small-medium Built for scale

Use Chroma for: Development, prototypes, small-medium datasets (< 1M docs) Use Weaviate for: Production, large datasets (> 1M docs), hybrid search


Last Updated: February 2026 Tested With: ChromaDB v0.4.22, Python 3.10+, skill-seekers v2.10.0