Semantic search uses vector embeddings to retrieve information based on the meaning of queries and documents, rather than simple keyword matching. Tools like FAISS (Facebook AI Similarity Search) and Annoy (Approximate Nearest Neighbors Oh Yeah) are popular libraries for efficient similarity search in high-dimensional embedding spaces.
This guide explores the implementation of semantic search using both FAISS and Annoy.
Why Use Semantic Search?
- Improved Relevance:
- Retrieves documents or items based on meaning, not just keywords.
- Scalability:
- Handles millions of vectors with approximate nearest neighbor (ANN) algorithms.
- Versatility:
- Applicable to diverse use cases like text retrieval, product recommendations, and image search.
Tools Overview
Tool | Description | Best For |
---|---|---|
FAISS | Optimized for large-scale similarity search with GPU acceleration. | Large datasets and GPU-based acceleration. |
Annoy | Uses random projection trees for fast approximate nearest neighbor search. | Smaller datasets or scenarios requiring fast setup and lightweight indexing. |
Pipeline Overview
- Generate Embeddings:
- Convert text or data into dense vector representations using models like Sentence Transformers or OpenAI Embeddings.
- Build an Index:
- Use FAISS or Annoy to create an index for the embeddings.
- Perform Search:
- Search the index to retrieve the nearest vectors to a query embedding.
1. Semantic Search with FAISS
Installation
Install FAISS:
pip install faiss-cpu
# For GPU support:
# pip install faiss-gpu
Code Implementation
a. Generate Embeddings
Use a pre-trained model to generate embeddings (e.g., Sentence Transformers):
from sentence_transformers import SentenceTransformer
# Load pre-trained model
model = SentenceTransformer('all-MiniLM-L6-v2')
# Sample data
documents = [
"The Eiffel Tower is located in Paris.",
"The Colosseum is in Rome.",
"The Great Wall of China is in Beijing."
]
# Generate embeddings
embeddings = model.encode(documents)
b. Build FAISS Index
Create and populate a FAISS index:
import faiss
import numpy as np
# Convert embeddings to a NumPy array
embedding_dim = embeddings.shape[1]
embeddings = np.array(embeddings).astype('float32')
# Initialize FAISS index
index = faiss.IndexFlatL2(embedding_dim) # L2 distance (Euclidean)
# Add embeddings to the index
index.add(embeddings)
print(f"Number of vectors in the index: {index.ntotal}")
c. Perform Search
Query the index:
# Query text
query = "Where is the Eiffel Tower?"
query_embedding = model.encode([query]).astype('float32')
# Search for the nearest neighbors
k = 2 # Number of results to retrieve
distances, indices = index.search(query_embedding, k)
# Print results
print("Top results:")
for i, idx in enumerate(indices[0]):
print(f"{i+1}: {documents[idx]} (Distance: {distances[0][i]:.2f})")
2. Semantic Search with Annoy
Installation
Install Annoy:
pip install annoy
Code Implementation
a. Generate Embeddings
(Use the same embedding generation as above.)
b. Build Annoy Index
Create and populate an Annoy index:
from annoy import AnnoyIndex
# Initialize Annoy index
embedding_dim = embeddings.shape[1]
index = AnnoyIndex(embedding_dim, metric='angular') # Angular distance (cosine similarity)
# Add embeddings to the index
for i, embedding in enumerate(embeddings):
index.add_item(i, embedding)
# Build the index
n_trees = 10 # Number of trees (higher = more accurate but slower)
index.build(n_trees)
index.save('annoy_index.ann')
c. Perform Search
Query the Annoy index:
# Load the index (if saved previously)
index.load('annoy_index.ann')
# Query text
query = "Where is the Eiffel Tower?"
query_embedding = model.encode([query])[0]
# Search for the nearest neighbors
k = 2 # Number of results to retrieve
indices, distances = index.get_nns_by_vector(query_embedding, k, include_distances=True)
# Print results
print("Top results:")
for i, idx in enumerate(indices):
print(f"{i+1}: {documents[idx]} (Distance: {distances[i]:.2f})")
Comparison of FAISS and Annoy
Aspect | FAISS | Annoy |
---|---|---|
Accuracy | High (especially with GPU acceleration). | Approximate but adjustable with n_trees . |
Speed | Faster for large datasets with GPU support. | Faster for smaller datasets with fewer vectors. |
Index Size | Optimized, compact for memory usage. | Larger index due to tree-based structure. |
Ease of Use | Slightly steeper learning curve. | Easy to implement and deploy. |
Best Use Case | Large-scale, high-performance applications. | Lightweight, quick setup for smaller projects. |
Tips for Effective Semantic Search
- Choose the Right Embedding Model:
- Use pre-trained models like Sentence Transformers (
all-MiniLM-L6-v2
) for general-purpose tasks. - Fine-tune models for domain-specific data.
- Optimize Index Parameters:
- For FAISS: Experiment with clustering-based indices (e.g.,
IndexIVF
for faster searches). - For Annoy: Increase
n_trees
to improve accuracy.
- Normalize Embeddings:
- Normalize vectors to ensure consistency, especially when using cosine similarity.
- Handle Large Datasets:
- For FAISS, use GPU support to scale to millions of vectors.
- For Annoy, shard the index if memory becomes a bottleneck.
- Combine with Metadata:
- Enhance search results by combining vector embeddings with metadata filters (e.g., tags, categories).
Conclusion
- FAISS is ideal for large-scale, high-accuracy applications, especially when GPU acceleration is available.
- Annoy is lightweight and well-suited for smaller datasets or scenarios requiring quick setup.