Distance Functions¶
Distance functions are a critical component of the HNSW algorithm, as they determine how similarity between vectors is measured. This page explains the different distance metrics available in HNSW and how to choose the right one for your use case.
Overview¶
In HNSW, distance functions are used to:
- Determine the nearest neighbors during graph construction
- Find the closest vectors during search operations
- Calculate the distance between vectors for ranking search results
The choice of distance function can significantly impact the quality of search results and should be selected based on the properties of your vector data.
Built-in Distance Functions¶
HNSW provides several built-in distance functions:
Cosine Distance¶
Cosine distance measures the cosine of the angle between two vectors. It is defined as:
Where:
a·b
is the dot product of vectors a and b||a||
and||b||
are the magnitudes (Euclidean norms) of vectors a and b
Properties:
- Range: [0, 2] (0 means identical, 2 means opposite directions)
- Invariant to vector scaling (only direction matters)
- Best for normalized vectors (e.g., embeddings from language models)
Use cases:
- Text embeddings (e.g., from OpenAI, BERT, etc.)
- Semantic search
- Recommendation systems
Implementation:
func CosineDistance(a, b Vector) float32 {
var (
dot float32
normA float32
normB float32
)
for i := range a {
dot += a[i] * b[i]
normA += a[i] * a[i]
normB += b[i] * b[i]
}
if normA == 0 || normB == 0 {
return 1.0
}
return 1.0 - dot/float32(math.Sqrt(float64(normA)*float64(normB)))
}
Euclidean Distance¶
Euclidean distance measures the straight-line distance between two points in Euclidean space. It is defined as:
Properties:
- Range: [0, ∞) (0 means identical)
- Sensitive to vector scaling
- Preserves the geometric interpretation of distance
Use cases:
- Image feature vectors
- Geographical coordinates
- Physical measurements
- Any application where the absolute magnitude matters
Implementation:
func EuclideanDistance(a, b Vector) float32 {
var sum float32
for i := range a {
d := a[i] - b[i]
sum += d * d
}
return float32(math.Sqrt(float64(sum)))
}
Dot Product Distance¶
Dot product distance is the negative of the dot product between two vectors. It is defined as:
Properties:
- Range: (-∞, ∞)
- Lower values indicate higher similarity
- Sensitive to vector magnitude
Use cases:
- When vectors are already normalized
- Specialized applications where dot product is the natural similarity measure
Implementation:
func DotProductDistance(a, b Vector) float32 {
var dot float32
for i := range a {
dot += a[i] * b[i]
}
return -dot
}
Custom Distance Functions¶
You can define your own distance function to suit your specific needs:
graph := hnsw.NewGraph[int]()
// Define a custom distance function
graph.Distance = func(a, b []float32) float32 {
// Manhattan distance (L1 norm)
var sum float32
for i := range a {
sum += float32(math.Abs(float64(a[i] - b[i])))
}
return sum
}
Common custom distance functions include:
Manhattan Distance (L1 Norm)¶
func ManhattanDistance(a, b Vector) float32 {
var sum float32
for i := range a {
sum += float32(math.Abs(float64(a[i] - b[i])))
}
return sum
}
Chebyshev Distance (L∞ Norm)¶
func ChebyshevDistance(a, b Vector) float32 {
var max float32
for i := range a {
d := float32(math.Abs(float64(a[i] - b[i])))
if d > max {
max = d
}
}
return max
}
Weighted Euclidean Distance¶
func WeightedEuclideanDistance(weights Vector) DistanceFunc {
return func(a, b Vector) float32 {
var sum float32
for i := range a {
d := a[i] - b[i]
sum += weights[i] * d * d
}
return float32(math.Sqrt(float64(sum)))
}
}
Choosing the Right Distance Function¶
The choice of distance function depends on the properties of your data and the requirements of your application:
- For text embeddings and semantic search:
- Use Cosine Distance if vectors are normalized
-
Use Dot Product Distance if vectors are already normalized
-
For image features and physical measurements:
- Use Euclidean Distance if the absolute magnitude matters
-
Use Manhattan Distance if you want to reduce the influence of outliers
-
For geographical coordinates:
- Use Euclidean Distance for small areas
-
Use a specialized geodesic distance for large areas
-
For mixed data types:
- Use a weighted combination of distance functions
- Consider normalizing each feature before computing the distance
Performance Considerations¶
Distance calculations are the most computationally intensive part of the HNSW algorithm, as they are performed many times during both construction and search. Here are some tips for optimizing distance calculations:
- Precompute norms for cosine distance:
- If using cosine distance, consider precomputing and storing the norms of vectors
-
This can significantly reduce the computation time during search
-
Use SIMD instructions:
- Modern CPUs support SIMD (Single Instruction, Multiple Data) instructions
- These can accelerate vector operations like dot products and Euclidean distance
-
The Go compiler may automatically use SIMD instructions for some operations
-
Consider dimensionality reduction:
- High-dimensional vectors require more computation for distance calculations
- Consider using dimensionality reduction techniques like PCA or t-SNE
-
This can improve both performance and memory usage
-
Normalize vectors:
- For cosine distance, normalizing vectors in advance can simplify calculations
- This is especially useful for text embeddings from language models
Next Steps¶
Now that you understand distance functions in HNSW, you can explore:
- Graph Structure: Learn about the hierarchical structure of HNSW
- Search Algorithms: Dive deeper into the search algorithms used in HNSW
- Performance Tuning: Learn how to optimize HNSW for your specific use case