January 26, 2026•12 min read

Building a Semantic Search Engine with Transformers and FAISS

AI/MLNLPPythonSemantic Search

What You'll Build

A semantic search engine that finds research papers by meaning, not keywords. Search for "attention mechanism in neural networks" and find relevant papers even if they don't contain those exact words.

By the end of this tutorial, you will:

Generate 768-dimensional embeddings for 41,000 ML papers
Build a GPU-accelerated similarity search index with FAISS
Query papers in near real-time by semantic similarity

Prerequisites

Required:

Python 3.8+
NVIDIA GPU with CUDA 12.1+ (for GPU acceleration)
8GB+ RAM
Git LFS installed

Knowledge assumed:

Basic Python and pandas
Familiarity with machine learning concepts

Tech Stack

Component	Purpose
PyTorch + CUDA	GPU-accelerated deep learning
DistilBERT	Generates 768-dimensional text embeddings
FAISS-GPU	Fast similarity search across vectors
Pandas	Dataset loading and manipulation

Step 1: Clone and Set Up Environment

git lfs install
git clone https://github.com/sheygs/semantic-search.git
cd semantic-search

python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

Why Git LFS? The dataset and pre-computed embeddings exceed GitHub's 100MB file limit. Git LFS downloads these large files separately.

Expected result: Repository cloned with data/ folder containing research_papers.json.

Step 2: Install Dependencies

pip install pandas scikit-learn
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install sentence-transformers faiss-gpu
pip install "numpy<2.0"

Package	Purpose
`sentence-transformers`	Converts text to semantic embeddings
`faiss-gpu`	GPU-accelerated similarity search
`numpy<2.0`	Pinned to avoid FAISS compatibility issues

Step 3: Import Libraries and Verify GPU

import pickle
import pandas as pd
import torch
import numpy as np
import faiss
from sentence_transformers import SentenceTransformer
from sklearn import preprocessing

# Verify GPU access
print(f"GPUs available: {faiss.get_num_gpus()}")
# Expected output: GPUs available: 1 (or more)

If GPU count is 0: FAISS will fall back to CPU, which is significantly slower. Check your CUDA installation.

Step 4: Load the Dataset

df = pd.read_json("../data/research_papers.json")
df = df.drop(["author", "link", "tag"], axis=1)

print(f"Papers loaded: {len(df)}")
# Expected output: Papers loaded: 41000

df.head()

The dataset contains 41,000 ML research papers with id, title, summary, year, month, and day columns. We drop metadata columns not needed for search.

Step 5: Load the Embedding Model

model = SentenceTransformer('distilbert-base-nli-stsb-mean-tokens')
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)

print(f"Device: {device}")
# Expected output: Device: cuda

Model breakdown:

distilbert: 40% smaller, 60% faster than BERT
nli-stsb: Fine-tuned for semantic similarity tasks
mean-tokens: Pools token embeddings into a single sentence vector

Note: This model is deprecated. For production, check sentence-transformers documentation for current alternatives like all-MiniLM-L6-v2.

Step 6: Generate Embeddings

Option A: Generate new embeddings (~43 seconds on GPU)

embeddings = model.encode(df.summary.to_list(), show_progress_bar=True)

# Save for future use
with open('../data/embeddings.pickle', 'wb') as f:
    pickle.dump(embeddings, f)

print(f"Shape: {embeddings.shape}")
# Expected output: Shape: (41000, 768)

Option B: Load pre-computed embeddings

with open('../data/embeddings.pickle', 'rb') as f:
    embeddings = pickle.load(f)

print(f"Loaded {len(embeddings)} embeddings of dimension {embeddings.shape[1]}")
# Expected output: Loaded 41000 embeddings of dimension 768

Each paper summary becomes a 768-dimensional vector. Similar concepts have similar vectors, even with different wording.

Step 7: Prepare Data for FAISS

label_encoder = preprocessing.LabelEncoder()
df['encoded_id'] = label_encoder.fit_transform(df['id'])

print(f"ID range: 0 to {df['encoded_id'].max()}")
# Expected output: ID range: 0 to 40999

FAISS uses integer indices internally. LabelEncoder maps string paper IDs to integers (0, 1, 2, ..., 40999).

Step 8: Build the FAISS Index

# Normalize for cosine similarity
faiss.normalize_L2(embeddings)

# Create GPU index
dimension = embeddings.shape[1]  # 768
index = faiss.IndexFlatL2(dimension)
gpu_resource = faiss.StandardGpuResources()
gpu_index = faiss.index_cpu_to_gpu(gpu_resource, 0, index)

# Add all embeddings
gpu_index.add(embeddings.astype('float32'))

print(f"Index size: {gpu_index.ntotal} vectors")
# Expected output: Index size: 41000 vectors

What each step does:

L2 normalization — Converts Euclidean distance to cosine similarity
IndexFlatL2 — Exact search (no approximation)
index_cpu_to_gpu — Moves index to GPU 0
add() — Inserts all embeddings into the searchable index

Step 9: Implement Semantic Search

def semantic_search(query, k=5):
    # Encode query with same model
    query_embedding = model.encode([query], convert_to_numpy=True)
    faiss.normalize_L2(query_embedding)

    # Find k nearest neighbors
    distances, indices = gpu_index.search(query_embedding.astype('float32'), k)

    # Retrieve matching papers
    results = df.iloc[indices[0]].copy()
    results['similarity'] = 1 - (distances[0] / 2)  # Convert distance to 0-1 score

    return results[['title', 'summary', 'similarity']]

Step 10: Test the Search

results = semantic_search("attention mechanism in neural networks", k=3)
print(results[['title', 'similarity']])

Expected output:

                                               title  similarity
12847  Attention Is All You Need                      0.847
8923   Self-Attention with Relative Position Repr...  0.812
15234  Effective Approaches to Attention-based Ne...  0.798

The search finds semantically related papers, not just keyword matches.

Verification Checklist

faiss.get_num_gpus() returns ≥1
Embeddings shape is (41000, 768)
Index contains 41,000 vectors
Search returns relevant papers with similarity scores

Performance Summary

Operation	Time
Embedding generation (41K papers)	~43 seconds
Search query	~50-100ms
Load pre-computed embeddings	~2 seconds

Troubleshooting

"No GPU available" (faiss.get_num_gpus() returns 0)

Verify CUDA installation: nvidia-smi
Reinstall faiss-gpu: pip uninstall faiss-gpu && pip install faiss-gpu
Check PyTorch CUDA: torch.cuda.is_available()

"NumPy version incompatibility"

Pin NumPy: pip install "numpy<2.0"

"Out of memory" during embedding generation

Reduce batch size: model.encode(..., batch_size=32)
Use CPU fallback if GPU memory is limited

Next Steps

Scale up: Use IndexIVFFlat for datasets with millions of vectors
Try different models: all-MiniLM-L6-v2 is faster with similar quality
Add filtering: Combine semantic search with metadata filters (year, author)
Deploy: Wrap in FastAPI for a production-ready API

Repository

Complete implementation: github.com/sheygs/semantic-search

Questions about semantic search? Feel free to reach out.