skill-seekers-reference/docs/integrations/FAISS.md

# FAISS Integration with Skill Seekers

**Status:** ✅ Production Ready
**Difficulty:** Intermediate
**Last Updated:** February 7, 2026

---

## ❌ The Problem

Building RAG applications with FAISS involves several challenges:

1. **Manual Index Configuration** - Choosing the right FAISS index type (Flat, IVF, HNSW, PQ) requires deep understanding
2. **Embedding Management** - Need to generate and store embeddings separately, track document IDs manually
3. **Billion-Scale Complexity** - Optimizing for large datasets (>1M vectors) requires index training and parameter tuning

**Example Pain Point:**

```python
# Manual FAISS setup for each framework
import faiss
import numpy as np
from openai import OpenAI

# Generate embeddings
client = OpenAI()
embeddings = []
for doc in documents:
    response = client.embeddings.create(
        model="text-embedding-ada-002",
        input=doc
    )
    embeddings.append(response.data[0].embedding)

# Create index
dimension = 1536
index = faiss.IndexFlatL2(dimension)
index.add(np.array(embeddings))

# Save index + metadata separately (complex!)
faiss.write_index(index, "index.faiss")
# ... manually track which ID maps to which document
```

---

## ✅ The Solution

Skill Seekers automates FAISS integration with structured, production-ready data:

**Benefits:**
- ✅ Auto-formatted documents with consistent metadata
- ✅ Works with LangChain FAISS wrapper for easy ID tracking
- ✅ Supports flat (small datasets) and IVF (large datasets) indexes
- ✅ GPU acceleration compatible (billion-scale search)
- ✅ Serialization-ready for production deployment

**Result:** 10-minute setup, production-ready similarity search that scales to billions of vectors.

---

## ⚡ Quick Start (10 Minutes)

### Prerequisites

```bash
# Install FAISS (CPU version)
pip install faiss-cpu>=1.7.4

# For GPU support (if available)
pip install faiss-gpu>=1.7.4

# Install LangChain for easy FAISS wrapper
pip install langchain>=0.1.0 langchain-community>=0.0.20

# OpenAI for embeddings
pip install openai>=1.0.0

# Or with Skill Seekers
pip install skill-seekers[all-llms]
```

**What you need:**
- Python 3.10+
- OpenAI API key (for embeddings)
- Optional: CUDA GPU for billion-scale search

### Generate FAISS-Ready Documents

```bash
# Step 1: Scrape documentation
skill-seekers scrape --config configs/react.json

# Step 2: Package for LangChain (FAISS-compatible)
skill-seekers package output/react --target langchain

# Output: output/react-langchain.json (FAISS-ready)
```

### Create FAISS Index with LangChain

```python
import json
from langchain.vectorstores import FAISS
from langchain.embeddings import OpenAIEmbeddings
from langchain.schema import Document

# Load documents
with open("output/react-langchain.json") as f:
    docs_data = json.load(f)

# Convert to LangChain Documents
documents = [
    Document(
        page_content=doc["page_content"],
        metadata=doc["metadata"]
    )
    for doc in docs_data
]

# Create FAISS index (embeddings generated automatically)
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")
vectorstore = FAISS.from_documents(documents, embeddings)

# Save index
vectorstore.save_local("faiss_index")

print(f"✅ Created FAISS index with {len(documents)} documents")
```

### Query FAISS Index

```python
from langchain.vectorstores import FAISS
from langchain.embeddings import OpenAIEmbeddings

# Load index (note: only load indexes from trusted sources)
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")
vectorstore = FAISS.load_local("faiss_index", embeddings, allow_dangerous_deserialization=True)

# Similarity search
results = vectorstore.similarity_search(
    query="How do I use React hooks?",
    k=3
)

for i, doc in enumerate(results):
    print(f"\n{i+1}. Category: {doc.metadata['category']}")
    print(f"   Source: {doc.metadata['source']}")
    print(f"   Content: {doc.page_content[:200]}...")
```

### Similarity Search with Scores

```python
# Get similarity scores
results = vectorstore.similarity_search_with_score(
    query="React state management",
    k=5
)

for doc, score in results:
    print(f"Score: {score:.3f}")
    print(f"Category: {doc.metadata['category']}")
    print(f"Content: {doc.page_content[:150]}...")
    print()
```

---

## 📖 Detailed Setup Guide

### Step 1: Choose FAISS Index Type

**Option A: IndexFlatL2 (Exact Search, <100K vectors)**

```python
import faiss

# Flat index: exact nearest neighbors (brute force)
dimension = 1536  # OpenAI ada-002
index = faiss.IndexFlatL2(dimension)

# Pros: 100% accuracy, simple
# Cons: O(n) search time, slow for large datasets
# Use when: <100K vectors, need perfect recall
```

**Option B: IndexIVFFlat (Approximate Search, 100K-10M vectors)**

```python
# IVF index: cluster-based approximate search
quantizer = faiss.IndexFlatL2(dimension)
nlist = 100  # Number of clusters
index = faiss.IndexIVFFlat(quantizer, dimension, nlist)

# Train on sample data
index.train(training_vectors)  # Needs ~30*nlist training vectors
index.add(vectors)

# Pros: Faster than flat, good accuracy
# Cons: Requires training, 90-95% recall
# Use when: 100K-10M vectors
```

**Option C: IndexHNSWFlat (Graph-based, High Recall)**

```python
# HNSW index: hierarchical navigable small world
index = faiss.IndexHNSWFlat(dimension, 32)  # 32 = M (graph connections)

# Pros: Fast, high recall (>95%), no training
# Cons: High memory usage (3-4x flat)
# Use when: Need speed + high recall, have memory
```

**Option D: IndexIVFPQ (Product Quantization, 10M-1B vectors)**

```python
# IVF + PQ: compressed vectors for massive scale
quantizer = faiss.IndexFlatL2(dimension)
nlist = 1000
m = 8  # Number of subvectors
nbits = 8  # Bits per subvector
index = faiss.IndexIVFPQ(quantizer, dimension, nlist, m, nbits)

# Train then add
index.train(training_vectors)
index.add(vectors)

# Pros: 16-32x memory reduction, billion-scale
# Cons: Lower recall (80-90%), complex
# Use when: >10M vectors, memory constrained
```

### Step 2: Generate Skill Seekers Documents

**Option A: Documentation Website**
```bash
skill-seekers scrape --config configs/django.json
skill-seekers package output/django --target langchain
```

**Option B: GitHub Repository**
```bash
skill-seekers github --repo django/django --name django
skill-seekers package output/django --target langchain
```

**Option C: Local Codebase**
```bash
skill-seekers analyze --directory /path/to/repo
skill-seekers package output/codebase --target langchain
```

**Option D: RAG-Optimized Chunking**
```bash
skill-seekers scrape --config configs/fastapi.json --chunk-for-rag --chunk-tokens 512
skill-seekers package output/fastapi --target langchain
```

### Step 3: Create FAISS Index (LangChain Wrapper)

```python
import json
from langchain.vectorstores import FAISS
from langchain.embeddings import OpenAIEmbeddings
from langchain.schema import Document

# Load documents
with open("output/django-langchain.json") as f:
    docs_data = json.load(f)

documents = [
    Document(page_content=doc["page_content"], metadata=doc["metadata"])
    for doc in docs_data
]

# Create embeddings
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")

# For small datasets (<100K): Use default (Flat)
vectorstore = FAISS.from_documents(documents, embeddings)

# For large datasets (>100K): Use IVF
# vectorstore = FAISS.from_documents(
#     documents,
#     embeddings,
#     index_factory_string="IVF100,Flat"
# )

# Save index + docstore + metadata
vectorstore.save_local("faiss_index")

print(f"✅ Created FAISS index with {len(documents)} vectors")
```

### Step 4: Query with Filtering

```python
# Load index (only from trusted sources!)
vectorstore = FAISS.load_local("faiss_index", embeddings, allow_dangerous_deserialization=True)

# Basic similarity search
results = vectorstore.similarity_search(
    query="Django models tutorial",
    k=5
)

# Similarity search with score threshold
results = vectorstore.similarity_search_with_relevance_scores(
    query="Django authentication",
    k=5,
    score_threshold=0.8  # Only return if relevance > 0.8
)

# Maximum marginal relevance (diverse results)
results = vectorstore.max_marginal_relevance_search(
    query="React components",
    k=5,
    fetch_k=20  # Fetch 20, return top 5 diverse
)

# Custom filter function (post-search filtering)
def filter_by_category(docs, category):
    return [doc for doc in docs if doc.metadata.get("category") == category]

results = vectorstore.similarity_search("hooks", k=20)
filtered = filter_by_category(results, "state-management")
```

---

## 🚀 Advanced Usage

### 1. GPU Acceleration (Billion-Scale Search)

```python
import faiss

# Check GPU availability
ngpus = faiss.get_num_gpus()
print(f"GPUs available: {ngpus}")

# Create GPU index
dimension = 1536
cpu_index = faiss.IndexFlatL2(dimension)

# Move to GPU
gpu_index = faiss.index_cpu_to_gpu(
    faiss.StandardGpuResources(),
    0,  # GPU ID
    cpu_index
)

# Add vectors (on GPU)
gpu_index.add(vectors)

# Search (on GPU, 10-100x faster)
distances, indices = gpu_index.search(query_vectors, k=10)

# Move back to CPU for saving
cpu_index = faiss.index_gpu_to_cpu(gpu_index)
faiss.write_index(cpu_index, "index.faiss")
```

### 2. Batch Processing for Large Datasets

```python
import json
from langchain.vectorstores import FAISS
from langchain.embeddings import OpenAIEmbeddings
from langchain.schema import Document

embeddings = OpenAIEmbeddings()

# Load documents
with open("output/large-dataset-langchain.json") as f:
    all_docs = json.load(f)

# Create index with first batch
batch_size = 10000
first_batch = [
    Document(page_content=doc["page_content"], metadata=doc["metadata"])
    for doc in all_docs[:batch_size]
]

vectorstore = FAISS.from_documents(first_batch, embeddings)
print(f"Created index with {batch_size} documents")

# Add remaining batches
for i in range(batch_size, len(all_docs), batch_size):
    batch = [
        Document(page_content=doc["page_content"], metadata=doc["metadata"])
        for doc in all_docs[i:i+batch_size]
    ]

    vectorstore.add_documents(batch)
    print(f"Added documents {i} to {i+len(batch)}")

# Save final index
vectorstore.save_local("large_faiss_index")
print(f"✅ Final index size: {len(all_docs)} documents")
```

### 3. Index Merging for Multi-Source

```python
# Create separate indexes for different sources
vectorstore1 = FAISS.from_documents(docs1, embeddings)
vectorstore2 = FAISS.from_documents(docs2, embeddings)
vectorstore3 = FAISS.from_documents(docs3, embeddings)

# Merge indexes
vectorstore1.merge_from(vectorstore2)
vectorstore1.merge_from(vectorstore3)

# Save merged index
vectorstore1.save_local("merged_index")

# Query combined index
results = vectorstore1.similarity_search("query", k=10)
```

---

## 📋 Best Practices

### 1. Choose Index Type by Dataset Size

```python
# <100K vectors: Flat (exact search)
if num_vectors < 100_000:
    vectorstore = FAISS.from_documents(documents, embeddings)

# 100K-1M vectors: IVF
elif num_vectors < 1_000_000:
    vectorstore = FAISS.from_documents(
        documents,
        embeddings,
        index_factory_string="IVF100,Flat"
    )

# 1M-10M vectors: IVF + PQ
elif num_vectors < 10_000_000:
    vectorstore = FAISS.from_documents(
        documents,
        embeddings,
        index_factory_string="IVF1000,PQ8"
    )

# >10M vectors: GPU + IVF + PQ
else:
    # Use GPU acceleration
    pass
```

### 2. Only Load Indexes from Trusted Sources

```python
# ⚠️ SECURITY: Only load indexes you trust!
# The allow_dangerous_deserialization flag exists because
# LangChain uses Python's serialization which can execute code

# ✅ Safe: Your own indexes
vectorstore = FAISS.load_local("my_index", embeddings, allow_dangerous_deserialization=True)

# ❌ Dangerous: Unknown indexes from internet
# vectorstore = FAISS.load_local("untrusted_index", ...)  # DON'T DO THIS
```

### 3. Use Batch Embedding Generation

```python
from openai import OpenAI

client = OpenAI()

# ✅ Good: Batch API (2048 texts per call)
texts = [doc["page_content"] for doc in documents]

embeddings = []
batch_size = 2048

for i in range(0, len(texts), batch_size):
    batch = texts[i:i + batch_size]
    response = client.embeddings.create(
        model="text-embedding-ada-002",
        input=batch
    )
    embeddings.extend([e.embedding for e in response.data])

# ❌ Bad: One at a time (slow!)
for text in texts:
    response = client.embeddings.create(model="text-embedding-ada-002", input=text)
    embeddings.append(response.data[0].embedding)
```

---

## 🐛 Troubleshooting

### Issue: Index Too Large for Memory

**Problem:** "MemoryError" when loading index with 10M+ vectors

**Solutions:**

1. **Use Product Quantization:**
```python
# Compress vectors 32x
vectorstore = FAISS.from_documents(
    documents,
    embeddings,
    index_factory_string="IVF1000,PQ8"
)
```

2. **Use GPU:**
```python
# Move to GPU memory
gpu_index = faiss.index_cpu_to_gpu(faiss.StandardGpuResources(), 0, cpu_index)
```

### Issue: Slow Search on Large Index

**Problem:** Search takes >1 second on 1M+ vectors

**Solutions:**

1. **Use IVF index:**
```python
vectorstore = FAISS.from_documents(
    documents,
    embeddings,
    index_factory_string="IVF100,Flat"
)

# Tune nprobe
vectorstore.index.nprobe = 10  # Balance speed/accuracy
```

2. **GPU acceleration:**
```python
gpu_index = faiss.index_cpu_to_gpu(faiss.StandardGpuResources(), 0, index)
```

---

## 📊 Before vs. After

| Aspect | Without Skill Seekers | With Skill Seekers |
|--------|----------------------|-------------------|
| **Data Preparation** | Custom scraping + embedding generation | One command: `skill-seekers scrape` |
| **Index Creation** | Manual FAISS setup with numpy arrays | LangChain wrapper handles complexity |
| **ID Tracking** | Manual mapping of IDs to documents | Automatic docstore integration |
| **Metadata** | Separate storage required | Built into LangChain Documents |
| **Scaling** | Complex index optimization required | Factory strings: `"IVF100,PQ8"` |
| **Setup Time** | 4-6 hours | 10 minutes |
| **Code Required** | 500+ lines | 30 lines with LangChain |

---

## 🎯 Next Steps

### Related Guides

- **[LangChain Integration](LANGCHAIN.md)** - Use FAISS as vector store in LangChain
- **[LlamaIndex Integration](LLAMA_INDEX.md)** - Use FAISS with LlamaIndex
- **[RAG Pipelines Guide](RAG_PIPELINES.md)** - Build complete RAG systems
- **[INTEGRATIONS.md](INTEGRATIONS.md)** - See all integration options

### Resources

- **FAISS Wiki:** https://github.com/facebookresearch/faiss/wiki
- **LangChain FAISS:** https://python.langchain.com/docs/integrations/vectorstores/faiss
- **Support:** https://github.com/yusufkaraaslan/Skill_Seekers/discussions

---

**Questions?** Open an issue: https://github.com/yusufkaraaslan/Skill_Seekers/issues
**Website:** https://skillseekersweb.com/
**Last Updated:** February 7, 2026