Building a Semantic Search Engine with Transformers and FAISS
What You'll Build
A semantic search engine that finds research papers by meaning, not keywords. Search for "attention mechanism in neural networks" and find relevant papers even if they don't contain those exact words.
By the end of this tutorial, you will:
- Generate 768-dimensional embeddings for 41,000 ML papers
- Build a GPU-accelerated similarity search index with FAISS
- Query papers in near real-time by semantic similarity
Prerequisites
Required:
- Python 3.8+
- NVIDIA GPU with CUDA 12.1+ (for GPU acceleration)
- 8GB+ RAM
- Git LFS installed
Knowledge assumed:
- Basic Python and pandas
- Familiarity with machine learning concepts
Tech Stack
| Component | Purpose |
|---|---|
| PyTorch + CUDA | GPU-accelerated deep learning |
| DistilBERT | Generates 768-dimensional text embeddings |
| FAISS-GPU | Fast similarity search across vectors |
| Pandas | Dataset loading and manipulation |
Step 1: Clone and Set Up Environment
git lfs install
git clone https://github.com/sheygs/semantic-search.git
cd semantic-search
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
Why Git LFS? The dataset and pre-computed embeddings exceed GitHub's 100MB file limit. Git LFS downloads these large files separately.
Expected result: Repository cloned with data/ folder containing research_papers.json.
Step 2: Install Dependencies
pip install pandas scikit-learn
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install sentence-transformers faiss-gpu
pip install "numpy<2.0"
| Package | Purpose |
|---|---|
sentence-transformers | Converts text to semantic embeddings |
faiss-gpu | GPU-accelerated similarity search |
numpy<2.0 | Pinned to avoid FAISS compatibility issues |
Step 3: Import Libraries and Verify GPU
import pickle
import pandas as pd
import torch
import numpy as np
import faiss
from sentence_transformers import SentenceTransformer
from sklearn import preprocessing
# Verify GPU access
print(f"GPUs available: {faiss.get_num_gpus()}")
# Expected output: GPUs available: 1 (or more)
If GPU count is 0: FAISS will fall back to CPU, which is significantly slower. Check your CUDA installation.
Step 4: Load the Dataset
df = pd.read_json("../data/research_papers.json")
df = df.drop(["author", "link", "tag"], axis=1)
print(f"Papers loaded: {len(df)}")
# Expected output: Papers loaded: 41000
df.head()
The dataset contains 41,000 ML research papers with id, title, summary, year, month, and day columns. We drop metadata columns not needed for search.
Step 5: Load the Embedding Model
model = SentenceTransformer('distilbert-base-nli-stsb-mean-tokens')
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)
print(f"Device: {device}")
# Expected output: Device: cuda
Model breakdown:
- distilbert: 40% smaller, 60% faster than BERT
- nli-stsb: Fine-tuned for semantic similarity tasks
- mean-tokens: Pools token embeddings into a single sentence vector
Note: This model is deprecated. For production, check sentence-transformers documentation for current alternatives like
all-MiniLM-L6-v2.
Step 6: Generate Embeddings
Option A: Generate new embeddings (~43 seconds on GPU)
embeddings = model.encode(df.summary.to_list(), show_progress_bar=True)
# Save for future use
with open('../data/embeddings.pickle', 'wb') as f:
pickle.dump(embeddings, f)
print(f"Shape: {embeddings.shape}")
# Expected output: Shape: (41000, 768)
Option B: Load pre-computed embeddings
with open('../data/embeddings.pickle', 'rb') as f:
embeddings = pickle.load(f)
print(f"Loaded {len(embeddings)} embeddings of dimension {embeddings.shape[1]}")
# Expected output: Loaded 41000 embeddings of dimension 768
Each paper summary becomes a 768-dimensional vector. Similar concepts have similar vectors, even with different wording.
Step 7: Prepare Data for FAISS
label_encoder = preprocessing.LabelEncoder()
df['encoded_id'] = label_encoder.fit_transform(df['id'])
print(f"ID range: 0 to {df['encoded_id'].max()}")
# Expected output: ID range: 0 to 40999
FAISS uses integer indices internally. LabelEncoder maps string paper IDs to integers (0, 1, 2, ..., 40999).
Step 8: Build the FAISS Index
# Normalize for cosine similarity
faiss.normalize_L2(embeddings)
# Create GPU index
dimension = embeddings.shape[1] # 768
index = faiss.IndexFlatL2(dimension)
gpu_resource = faiss.StandardGpuResources()
gpu_index = faiss.index_cpu_to_gpu(gpu_resource, 0, index)
# Add all embeddings
gpu_index.add(embeddings.astype('float32'))
print(f"Index size: {gpu_index.ntotal} vectors")
# Expected output: Index size: 41000 vectors
What each step does:
- L2 normalization — Converts Euclidean distance to cosine similarity
- IndexFlatL2 — Exact search (no approximation)
- index_cpu_to_gpu — Moves index to GPU 0
- add() — Inserts all embeddings into the searchable index
Step 9: Implement Semantic Search
def semantic_search(query, k=5):
# Encode query with same model
query_embedding = model.encode([query], convert_to_numpy=True)
faiss.normalize_L2(query_embedding)
# Find k nearest neighbors
distances, indices = gpu_index.search(query_embedding.astype('float32'), k)
# Retrieve matching papers
results = df.iloc[indices[0]].copy()
results['similarity'] = 1 - (distances[0] / 2) # Convert distance to 0-1 score
return results[['title', 'summary', 'similarity']]
Step 10: Test the Search
results = semantic_search("attention mechanism in neural networks", k=3)
print(results[['title', 'similarity']])
Expected output:
title similarity
12847 Attention Is All You Need 0.847
8923 Self-Attention with Relative Position Repr... 0.812
15234 Effective Approaches to Attention-based Ne... 0.798
The search finds semantically related papers, not just keyword matches.
Verification Checklist
-
faiss.get_num_gpus()returns ≥1 - Embeddings shape is
(41000, 768) - Index contains 41,000 vectors
- Search returns relevant papers with similarity scores
Performance Summary
| Operation | Time |
|---|---|
| Embedding generation (41K papers) | ~43 seconds |
| Search query | ~50-100ms |
| Load pre-computed embeddings | ~2 seconds |
Troubleshooting
"No GPU available" (faiss.get_num_gpus() returns 0)
- Verify CUDA installation:
nvidia-smi - Reinstall faiss-gpu:
pip uninstall faiss-gpu && pip install faiss-gpu - Check PyTorch CUDA:
torch.cuda.is_available()
"NumPy version incompatibility"
- Pin NumPy:
pip install "numpy<2.0"
"Out of memory" during embedding generation
- Reduce batch size:
model.encode(..., batch_size=32) - Use CPU fallback if GPU memory is limited
Next Steps
- Scale up: Use
IndexIVFFlatfor datasets with millions of vectors - Try different models:
all-MiniLM-L6-v2is faster with similar quality - Add filtering: Combine semantic search with metadata filters (year, author)
- Deploy: Wrap in FastAPI for a production-ready API
Repository
Complete implementation: github.com/sheygs/semantic-search
Questions about semantic search? Feel free to reach out.