Distance Metrics (Similarity Indexes) in Vector Stores
In modern machine learning applications—especially those dealing with embeddings from language, vision, or recommendation models—similarity and distance metrics are critical to determining how “close” two data points are in vector space.
When you're working with a vector store (like Milvus, Pinecone, Weaviate, Qdrant), these metrics define how retrieval is performed: what is “most similar”?
In this guide, we dive deep into the most used distance metrics:
We’ll explain how each one works, their pros/cons, and finish with code to see them in action.
1️⃣ Cosine Similarity
Concept
Cosine similarity measures the cosine of the angle between two vectors. Think of two arrows pointing in space: it compares their direction, not their length. Two vectors with the same orientation (even if different lengths) will have high cosine similarity.
It answers the question:
"How similar is the direction of two vectors?"
Properties:
Use When:
2️⃣ Dot Product
Concept
The dot product measures the algebraic overlap between two vectors. It multiplies each component and sums them up. It is the uncorrected cosine similarity—so it's higher for vectors that both align and have large magnitudes.
It answers:
"How strongly aligned are these two vectors, including their magnitudes?"
Properties:
Use When:
Recommended by LinkedIn
3️⃣ Euclidean Distance
Concept
Also known as L2 distance, it measures the straight-line (“as the crow flies”) distance between two vectors in space. Think of it as the geometric length between two points.
It answers:
"How far apart are these vectors in physical space?"
Properties:
Use When:
4️⃣ Manhattan Distance
Concept
Also called L1 norm or Taxicab distance, this measures the sum of absolute differences between two vectors. Imagine walking in a city grid where you can only move up/down or left/right—not diagonally.
It answers:
"How far are the vectors if you can only move in axis-aligned steps?"
Properties:
Use When:
Take below example code to run and see how it works
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
# 1. Initialize the model
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')
# 2. List of sentences to encode (the "database")
sentences = [
"I love machine learning.",
"Artificial intelligence is fascinating.",
"Python is a great programming language.",
"I enjoy building models with data.",
"Natural language processing is a part of AI."
]
# 3. Convert sentences into embeddings (vectors)
sentence_embeddings = model.encode(sentences)
# 4. Function to find similar sentences using different similarity metrics
def find_similar_sentences(query, metric='cosine', top_n=3):
# Encode the query sentence
query_embedding = model.encode([query])
# Select similarity metric
if metric == 'cosine':
similarities = cosine_similarity(query_embedding, sentence_embeddings)
elif metric == 'dot_product':
# Compute dot product similarity
similarities = np.dot(query_embedding, sentence_embeddings.T)
elif metric == 'euclidean':
# Compute Euclidean distance and convert to similarity
similarities = [1 / (1 + np.sum((sentence_embeddings - query_embedding)**2,axis=1))]
elif metric == 'manhattan':
# Compute Euclidean distance and convert to similarity
similarities = [1 / (1 + np.sum(np.abs(sentence_embeddings - query_embedding),axis=1))]
else:
print("Unsupported similarity metric. Choose from 'cosine', 'dot_product', 'euclidean'")
# Get the top_n most similar sentences
similar_indices = np.argsort(similarities[0])[::-1][:top_n]
similar_sentences = [(sentences[i], similarities[0][i]) for i in similar_indices]
return similar_sentences
# Example query
query_sentence = "I like learning about AI."
similar_sentences_cosine = find_similar_sentences(query_sentence, metric='cosine', top_n=2)
similar_sentences_dot = find_similar_sentences(query_sentence, metric='dot_product', top_n=2)
similar_sentences_euclidean = find_similar_sentences(query_sentence, metric='euclidean', top_n=2)
similar_sentences_manhattan = find_similar_sentences(query_sentence, metric='manhattan', top_n=2)
# Output the results
print("Cosine Similarity:")
for sentence, score in similar_sentences_cosine:
print(f"Sentence: {sentence}, Similarity Score: {score:.4f}")
print("\nDot Product Similarity:")
for sentence, score in similar_sentences_dot:
print(f"Sentence: {sentence}, Similarity Score: {score:.4f}")
print("\nEuclidean Distance (converted to similarity):")
for sentence, score in similar_sentences_euclidean:
print(f"Sentence: {sentence}, Similarity Score: {score:.4f}")
print("\nManhattan Distance (converted to similarity):")
for sentence, score in similar_sentences_manhattan:
print(f"Sentence: {sentence}, Similarity Score: {score:.4f}")
>>>Cosine Similarity:
>>>Sentence: Artificial intelligence is fascinating., Similarity Score: 0.6267
>>>Sentence: I love machine learning., Similarity Score: 0.6036
>>>Dot Product Similarity:
>>>Sentence: I love machine learning., Similarity Score: 33.3234
>>>Sentence: Artificial intelligence is fascinating., Similarity Score: 32.4035
>>>Euclidean Distance (converted to similarity):
>>>Sentence: Artificial intelligence is fascinating., Similarity Score: 0.0250
>>>Sentence: I love machine learning., Similarity Score: 0.0223
>>>Manhattan Distance (converted to similarity):
>>>Sentence: Artificial intelligence is fascinating., Similarity Score: 0.0103
>>>Sentence: Natural language processing is a part of AI., Similarity Score: 0.0098