Distance Metrics (Similarity Indexes) in Vector Stores

Distance Metrics (Similarity Indexes) in Vector Stores

In modern machine learning applications—especially those dealing with embeddings from language, vision, or recommendation models—similarity and distance metrics are critical to determining how “close” two data points are in vector space.

When you're working with a vector store (like Milvus, Pinecone, Weaviate, Qdrant), these metrics define how retrieval is performed: what is “most similar”?

In this guide, we dive deep into the most used distance metrics:

  1. Cosine Similarity
  2. Dot Product
  3. Euclidean Distance
  4. Manhattan Distance

We’ll explain how each one works, their pros/cons, and finish with code to see them in action.


1️⃣ Cosine Similarity

Concept

Cosine similarity measures the cosine of the angle between two vectors. Think of two arrows pointing in space: it compares their direction, not their length. Two vectors with the same orientation (even if different lengths) will have high cosine similarity.

It answers the question:

"How similar is the direction of two vectors?"

Properties:

  • Normalized: Ignores vector magnitude
  • Range: −1,1−1,1

Use When:

  • Comparing text embeddings (e.g., BERT, LLaMA outputs)
  • Measuring semantic similarity where magnitude doesn’t matter


2️⃣ Dot Product

Concept

The dot product measures the algebraic overlap between two vectors. It multiplies each component and sums them up. It is the uncorrected cosine similarity—so it's higher for vectors that both align and have large magnitudes.

It answers:

"How strongly aligned are these two vectors, including their magnitudes?"

Properties:

  • Not normalized
  • Unbounded: Could be large positive or negative numbers
  • Sensitive to vector length and direction

Use When:

  • You want to favor longer vectors (e.g., stronger user interest or item importance)
  • In deep learning layers (e.g., attention scores)
  • When magnitude matters


3️⃣ Euclidean Distance

Concept

Also known as L2 distance, it measures the straight-line (“as the crow flies”) distance between two vectors in space. Think of it as the geometric length between two points.

It answers:

"How far apart are these vectors in physical space?"

Properties:

  • Sensitive to both magnitude and direction
  • Range: [0, ∞)
  • A true distance metric

Use When:

  • Measuring actual distance is meaningful (e.g., K-means, clustering)
  • Working with numerical or spatial features
  • You have normalized data


4️⃣ Manhattan Distance

Concept

Also called L1 norm or Taxicab distance, this measures the sum of absolute differences between two vectors. Imagine walking in a city grid where you can only move up/down or left/right—not diagonally.

It answers:

"How far are the vectors if you can only move in axis-aligned steps?"

Properties:

  • More robust to outliers than Euclidean
  • Range: [0, ∞)
  • Additive along dimensions

Use When:

  • Working with high-dimensional, sparse data (e.g., recommender systems)
  • You want interpretable differences in individual features
  • Avoiding squared penalties (robust to spikes)


Take below example code to run and see how it works

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# 1. Initialize the model
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

# 2. List of sentences to encode (the "database")
sentences = [
    "I love machine learning.",
    "Artificial intelligence is fascinating.",
    "Python is a great programming language.",
    "I enjoy building models with data.",
    "Natural language processing is a part of AI."
]

# 3. Convert sentences into embeddings (vectors)
sentence_embeddings = model.encode(sentences)

# 4. Function to find similar sentences using different similarity metrics
def find_similar_sentences(query, metric='cosine', top_n=3):
    # Encode the query sentence
    query_embedding = model.encode([query])
    
    # Select similarity metric
    if metric == 'cosine':
        similarities = cosine_similarity(query_embedding, sentence_embeddings)
    elif metric == 'dot_product':
        # Compute dot product similarity
        similarities = np.dot(query_embedding, sentence_embeddings.T)
    elif metric == 'euclidean':
        # Compute Euclidean distance and convert to similarity
        similarities = [1 / (1 + np.sum((sentence_embeddings - query_embedding)**2,axis=1))]
    elif metric == 'manhattan':
        # Compute Euclidean distance and convert to similarity
        similarities = [1 / (1 + np.sum(np.abs(sentence_embeddings - query_embedding),axis=1))]
    else:
        print("Unsupported similarity metric. Choose from 'cosine', 'dot_product', 'euclidean'")

    # Get the top_n most similar sentences
    similar_indices = np.argsort(similarities[0])[::-1][:top_n]
    similar_sentences = [(sentences[i], similarities[0][i]) for i in similar_indices]
    
    return similar_sentences

# Example query
query_sentence = "I like learning about AI."
similar_sentences_cosine = find_similar_sentences(query_sentence, metric='cosine', top_n=2)
similar_sentences_dot = find_similar_sentences(query_sentence, metric='dot_product', top_n=2)
similar_sentences_euclidean = find_similar_sentences(query_sentence, metric='euclidean', top_n=2)
similar_sentences_manhattan = find_similar_sentences(query_sentence, metric='manhattan', top_n=2)

# Output the results
print("Cosine Similarity:")
for sentence, score in similar_sentences_cosine:
    print(f"Sentence: {sentence}, Similarity Score: {score:.4f}")

print("\nDot Product Similarity:")
for sentence, score in similar_sentences_dot:
    print(f"Sentence: {sentence}, Similarity Score: {score:.4f}")

print("\nEuclidean Distance (converted to similarity):")
for sentence, score in similar_sentences_euclidean:
    print(f"Sentence: {sentence}, Similarity Score: {score:.4f}")

print("\nManhattan Distance (converted to similarity):")
for sentence, score in similar_sentences_manhattan:
    print(f"Sentence: {sentence}, Similarity Score: {score:.4f}")        
>>>Cosine Similarity:
>>>Sentence: Artificial intelligence is fascinating., Similarity Score: 0.6267
>>>Sentence: I love machine learning., Similarity Score: 0.6036

>>>Dot Product Similarity:
>>>Sentence: I love machine learning., Similarity Score: 33.3234
>>>Sentence: Artificial intelligence is fascinating., Similarity Score: 32.4035

>>>Euclidean Distance (converted to similarity):
>>>Sentence: Artificial intelligence is fascinating., Similarity Score: 0.0250
>>>Sentence: I love machine learning., Similarity Score: 0.0223

>>>Manhattan Distance (converted to similarity):
>>>Sentence: Artificial intelligence is fascinating., Similarity Score: 0.0103
>>>Sentence: Natural language processing is a part of AI., Similarity Score: 0.0098        

To view or add a comment, sign in

More articles by Venkatesh Madala

Insights from the community

Others also viewed

Explore topics