Files
yusyus 73adda0b17 docs: update all chunk flag names to match renamed CLI flags
Replace all occurrences of old ambiguous flag names with the new explicit ones:
  --chunk-size (tokens)  → --chunk-tokens
  --chunk-overlap        → --chunk-overlap-tokens
  --chunk                → --chunk-for-rag
  --streaming-chunk-size → --streaming-chunk-chars
  --streaming-overlap    → --streaming-overlap-chars
  --chunk-size (pages)   → --pdf-pages-per-chunk

Updated: CLI_REFERENCE (EN+ZH), user-guide (EN+ZH), integrations (Haystack,
Chroma, Weaviate, FAISS, Qdrant), features/PDF_CHUNKING, examples/haystack-pipeline,
strategy docs, archive docs, and CHANGELOG.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-24 22:15:14 +03:00

15 KiB

FAISS Integration with Skill Seekers

Status: Production Ready Difficulty: Intermediate Last Updated: February 7, 2026


The Problem

Building RAG applications with FAISS involves several challenges:

  1. Manual Index Configuration - Choosing the right FAISS index type (Flat, IVF, HNSW, PQ) requires deep understanding
  2. Embedding Management - Need to generate and store embeddings separately, track document IDs manually
  3. Billion-Scale Complexity - Optimizing for large datasets (>1M vectors) requires index training and parameter tuning

Example Pain Point:

# Manual FAISS setup for each framework
import faiss
import numpy as np
from openai import OpenAI

# Generate embeddings
client = OpenAI()
embeddings = []
for doc in documents:
    response = client.embeddings.create(
        model="text-embedding-ada-002",
        input=doc
    )
    embeddings.append(response.data[0].embedding)

# Create index
dimension = 1536
index = faiss.IndexFlatL2(dimension)
index.add(np.array(embeddings))

# Save index + metadata separately (complex!)
faiss.write_index(index, "index.faiss")
# ... manually track which ID maps to which document

The Solution

Skill Seekers automates FAISS integration with structured, production-ready data:

Benefits:

  • Auto-formatted documents with consistent metadata
  • Works with LangChain FAISS wrapper for easy ID tracking
  • Supports flat (small datasets) and IVF (large datasets) indexes
  • GPU acceleration compatible (billion-scale search)
  • Serialization-ready for production deployment

Result: 10-minute setup, production-ready similarity search that scales to billions of vectors.


Quick Start (10 Minutes)

Prerequisites

# Install FAISS (CPU version)
pip install faiss-cpu>=1.7.4

# For GPU support (if available)
pip install faiss-gpu>=1.7.4

# Install LangChain for easy FAISS wrapper
pip install langchain>=0.1.0 langchain-community>=0.0.20

# OpenAI for embeddings
pip install openai>=1.0.0

# Or with Skill Seekers
pip install skill-seekers[all-llms]

What you need:

  • Python 3.10+
  • OpenAI API key (for embeddings)
  • Optional: CUDA GPU for billion-scale search

Generate FAISS-Ready Documents

# Step 1: Scrape documentation
skill-seekers scrape --config configs/react.json

# Step 2: Package for LangChain (FAISS-compatible)
skill-seekers package output/react --target langchain

# Output: output/react-langchain.json (FAISS-ready)

Create FAISS Index with LangChain

import json
from langchain.vectorstores import FAISS
from langchain.embeddings import OpenAIEmbeddings
from langchain.schema import Document

# Load documents
with open("output/react-langchain.json") as f:
    docs_data = json.load(f)

# Convert to LangChain Documents
documents = [
    Document(
        page_content=doc["page_content"],
        metadata=doc["metadata"]
    )
    for doc in docs_data
]

# Create FAISS index (embeddings generated automatically)
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")
vectorstore = FAISS.from_documents(documents, embeddings)

# Save index
vectorstore.save_local("faiss_index")

print(f"✅ Created FAISS index with {len(documents)} documents")

Query FAISS Index

from langchain.vectorstores import FAISS
from langchain.embeddings import OpenAIEmbeddings

# Load index (note: only load indexes from trusted sources)
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")
vectorstore = FAISS.load_local("faiss_index", embeddings, allow_dangerous_deserialization=True)

# Similarity search
results = vectorstore.similarity_search(
    query="How do I use React hooks?",
    k=3
)

for i, doc in enumerate(results):
    print(f"\n{i+1}. Category: {doc.metadata['category']}")
    print(f"   Source: {doc.metadata['source']}")
    print(f"   Content: {doc.page_content[:200]}...")

Similarity Search with Scores

# Get similarity scores
results = vectorstore.similarity_search_with_score(
    query="React state management",
    k=5
)

for doc, score in results:
    print(f"Score: {score:.3f}")
    print(f"Category: {doc.metadata['category']}")
    print(f"Content: {doc.page_content[:150]}...")
    print()

📖 Detailed Setup Guide

Step 1: Choose FAISS Index Type

Option A: IndexFlatL2 (Exact Search, <100K vectors)

import faiss

# Flat index: exact nearest neighbors (brute force)
dimension = 1536  # OpenAI ada-002
index = faiss.IndexFlatL2(dimension)

# Pros: 100% accuracy, simple
# Cons: O(n) search time, slow for large datasets
# Use when: <100K vectors, need perfect recall

Option B: IndexIVFFlat (Approximate Search, 100K-10M vectors)

# IVF index: cluster-based approximate search
quantizer = faiss.IndexFlatL2(dimension)
nlist = 100  # Number of clusters
index = faiss.IndexIVFFlat(quantizer, dimension, nlist)

# Train on sample data
index.train(training_vectors)  # Needs ~30*nlist training vectors
index.add(vectors)

# Pros: Faster than flat, good accuracy
# Cons: Requires training, 90-95% recall
# Use when: 100K-10M vectors

Option C: IndexHNSWFlat (Graph-based, High Recall)

# HNSW index: hierarchical navigable small world
index = faiss.IndexHNSWFlat(dimension, 32)  # 32 = M (graph connections)

# Pros: Fast, high recall (>95%), no training
# Cons: High memory usage (3-4x flat)
# Use when: Need speed + high recall, have memory

Option D: IndexIVFPQ (Product Quantization, 10M-1B vectors)

# IVF + PQ: compressed vectors for massive scale
quantizer = faiss.IndexFlatL2(dimension)
nlist = 1000
m = 8  # Number of subvectors
nbits = 8  # Bits per subvector
index = faiss.IndexIVFPQ(quantizer, dimension, nlist, m, nbits)

# Train then add
index.train(training_vectors)
index.add(vectors)

# Pros: 16-32x memory reduction, billion-scale
# Cons: Lower recall (80-90%), complex
# Use when: >10M vectors, memory constrained

Step 2: Generate Skill Seekers Documents

Option A: Documentation Website

skill-seekers scrape --config configs/django.json
skill-seekers package output/django --target langchain

Option B: GitHub Repository

skill-seekers github --repo django/django --name django
skill-seekers package output/django --target langchain

Option C: Local Codebase

skill-seekers analyze --directory /path/to/repo
skill-seekers package output/codebase --target langchain

Option D: RAG-Optimized Chunking

skill-seekers scrape --config configs/fastapi.json --chunk-for-rag --chunk-tokens 512
skill-seekers package output/fastapi --target langchain

Step 3: Create FAISS Index (LangChain Wrapper)

import json
from langchain.vectorstores import FAISS
from langchain.embeddings import OpenAIEmbeddings
from langchain.schema import Document

# Load documents
with open("output/django-langchain.json") as f:
    docs_data = json.load(f)

documents = [
    Document(page_content=doc["page_content"], metadata=doc["metadata"])
    for doc in docs_data
]

# Create embeddings
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")

# For small datasets (<100K): Use default (Flat)
vectorstore = FAISS.from_documents(documents, embeddings)

# For large datasets (>100K): Use IVF
# vectorstore = FAISS.from_documents(
#     documents,
#     embeddings,
#     index_factory_string="IVF100,Flat"
# )

# Save index + docstore + metadata
vectorstore.save_local("faiss_index")

print(f"✅ Created FAISS index with {len(documents)} vectors")

Step 4: Query with Filtering

# Load index (only from trusted sources!)
vectorstore = FAISS.load_local("faiss_index", embeddings, allow_dangerous_deserialization=True)

# Basic similarity search
results = vectorstore.similarity_search(
    query="Django models tutorial",
    k=5
)

# Similarity search with score threshold
results = vectorstore.similarity_search_with_relevance_scores(
    query="Django authentication",
    k=5,
    score_threshold=0.8  # Only return if relevance > 0.8
)

# Maximum marginal relevance (diverse results)
results = vectorstore.max_marginal_relevance_search(
    query="React components",
    k=5,
    fetch_k=20  # Fetch 20, return top 5 diverse
)

# Custom filter function (post-search filtering)
def filter_by_category(docs, category):
    return [doc for doc in docs if doc.metadata.get("category") == category]

results = vectorstore.similarity_search("hooks", k=20)
filtered = filter_by_category(results, "state-management")

🚀 Advanced Usage

import faiss

# Check GPU availability
ngpus = faiss.get_num_gpus()
print(f"GPUs available: {ngpus}")

# Create GPU index
dimension = 1536
cpu_index = faiss.IndexFlatL2(dimension)

# Move to GPU
gpu_index = faiss.index_cpu_to_gpu(
    faiss.StandardGpuResources(),
    0,  # GPU ID
    cpu_index
)

# Add vectors (on GPU)
gpu_index.add(vectors)

# Search (on GPU, 10-100x faster)
distances, indices = gpu_index.search(query_vectors, k=10)

# Move back to CPU for saving
cpu_index = faiss.index_gpu_to_cpu(gpu_index)
faiss.write_index(cpu_index, "index.faiss")

2. Batch Processing for Large Datasets

import json
from langchain.vectorstores import FAISS
from langchain.embeddings import OpenAIEmbeddings
from langchain.schema import Document

embeddings = OpenAIEmbeddings()

# Load documents
with open("output/large-dataset-langchain.json") as f:
    all_docs = json.load(f)

# Create index with first batch
batch_size = 10000
first_batch = [
    Document(page_content=doc["page_content"], metadata=doc["metadata"])
    for doc in all_docs[:batch_size]
]

vectorstore = FAISS.from_documents(first_batch, embeddings)
print(f"Created index with {batch_size} documents")

# Add remaining batches
for i in range(batch_size, len(all_docs), batch_size):
    batch = [
        Document(page_content=doc["page_content"], metadata=doc["metadata"])
        for doc in all_docs[i:i+batch_size]
    ]

    vectorstore.add_documents(batch)
    print(f"Added documents {i} to {i+len(batch)}")

# Save final index
vectorstore.save_local("large_faiss_index")
print(f"✅ Final index size: {len(all_docs)} documents")

3. Index Merging for Multi-Source

# Create separate indexes for different sources
vectorstore1 = FAISS.from_documents(docs1, embeddings)
vectorstore2 = FAISS.from_documents(docs2, embeddings)
vectorstore3 = FAISS.from_documents(docs3, embeddings)

# Merge indexes
vectorstore1.merge_from(vectorstore2)
vectorstore1.merge_from(vectorstore3)

# Save merged index
vectorstore1.save_local("merged_index")

# Query combined index
results = vectorstore1.similarity_search("query", k=10)

📋 Best Practices

1. Choose Index Type by Dataset Size

# <100K vectors: Flat (exact search)
if num_vectors < 100_000:
    vectorstore = FAISS.from_documents(documents, embeddings)

# 100K-1M vectors: IVF
elif num_vectors < 1_000_000:
    vectorstore = FAISS.from_documents(
        documents,
        embeddings,
        index_factory_string="IVF100,Flat"
    )

# 1M-10M vectors: IVF + PQ
elif num_vectors < 10_000_000:
    vectorstore = FAISS.from_documents(
        documents,
        embeddings,
        index_factory_string="IVF1000,PQ8"
    )

# >10M vectors: GPU + IVF + PQ
else:
    # Use GPU acceleration
    pass

2. Only Load Indexes from Trusted Sources

# ⚠️ SECURITY: Only load indexes you trust!
# The allow_dangerous_deserialization flag exists because
# LangChain uses Python's serialization which can execute code

# ✅ Safe: Your own indexes
vectorstore = FAISS.load_local("my_index", embeddings, allow_dangerous_deserialization=True)

# ❌ Dangerous: Unknown indexes from internet
# vectorstore = FAISS.load_local("untrusted_index", ...)  # DON'T DO THIS

3. Use Batch Embedding Generation

from openai import OpenAI

client = OpenAI()

# ✅ Good: Batch API (2048 texts per call)
texts = [doc["page_content"] for doc in documents]

embeddings = []
batch_size = 2048

for i in range(0, len(texts), batch_size):
    batch = texts[i:i + batch_size]
    response = client.embeddings.create(
        model="text-embedding-ada-002",
        input=batch
    )
    embeddings.extend([e.embedding for e in response.data])

# ❌ Bad: One at a time (slow!)
for text in texts:
    response = client.embeddings.create(model="text-embedding-ada-002", input=text)
    embeddings.append(response.data[0].embedding)

🐛 Troubleshooting

Issue: Index Too Large for Memory

Problem: "MemoryError" when loading index with 10M+ vectors

Solutions:

  1. Use Product Quantization:
# Compress vectors 32x
vectorstore = FAISS.from_documents(
    documents,
    embeddings,
    index_factory_string="IVF1000,PQ8"
)
  1. Use GPU:
# Move to GPU memory
gpu_index = faiss.index_cpu_to_gpu(faiss.StandardGpuResources(), 0, cpu_index)

Issue: Slow Search on Large Index

Problem: Search takes >1 second on 1M+ vectors

Solutions:

  1. Use IVF index:
vectorstore = FAISS.from_documents(
    documents,
    embeddings,
    index_factory_string="IVF100,Flat"
)

# Tune nprobe
vectorstore.index.nprobe = 10  # Balance speed/accuracy
  1. GPU acceleration:
gpu_index = faiss.index_cpu_to_gpu(faiss.StandardGpuResources(), 0, index)

📊 Before vs. After

Aspect Without Skill Seekers With Skill Seekers
Data Preparation Custom scraping + embedding generation One command: skill-seekers scrape
Index Creation Manual FAISS setup with numpy arrays LangChain wrapper handles complexity
ID Tracking Manual mapping of IDs to documents Automatic docstore integration
Metadata Separate storage required Built into LangChain Documents
Scaling Complex index optimization required Factory strings: "IVF100,PQ8"
Setup Time 4-6 hours 10 minutes
Code Required 500+ lines 30 lines with LangChain

🎯 Next Steps

Resources


Questions? Open an issue: https://github.com/yusufkaraaslan/Skill_Seekers/issues Website: https://skillseekersweb.com/ Last Updated: February 7, 2026