Similarity Metrics
Similarity is the primary query mechanism in HDC: given a noisy or transformed hypervector, similarity to each item in a codebook identifies the nearest match. All metrics in PyHDC return values in [-1, 1] (1 = identical, 0 = orthogonal, -1 = maximally dissimilar).
All similarity functions are in pyhdc.components.similarity.
CosineSimilarity
Used by: MAP_C, MAP_I, MAP_I_Bits, MAP_B, HRR family, FHRR, VTB, MBAT
Output range: [-1, 1]
Cosine similarity is the dot product of unit vectors. It measures the angle between two vectors, independent of their magnitudes. This makes it appropriate for both normalised (HRR) and unnormalised (MAP) vectors.
For two random unit vectors in \(\mathbb{R}^D\), the expected cosine similarity is 0 and the standard deviation is \(1/\sqrt{D}\).
HammingDistance
Used by: BSC
Normalised and remapped Hamming distance:
Output range: [-1, 1]
The formula maps:
0 bit flips (identical vectors) -> 1.0
D/2 bit flips (random/orthogonal) -> 0.0
D bit flips (all-different) -> -1.0
This is consistent with the [-1, 1] convention used by all other metrics.
Note
v1.0.x breaking change: In v1.0.x, HammingDistance returned
popcount(a XOR b) / D, in [0, 1] with 0 = identical and 1 = all-different.
Code that compared against thresholds in [0, 1] must be updated, or use
similarity_remap=remap_to_unit on the encoding constructor to restore
the [0, 1] output.
Overlap
Used by: BSDC family
Normalised set intersection, remapped to [-1, 1]:
Output range: [-1, 1]
For sparse binary vectors, the dot product counts the number of positions where both vectors have a 1. Dividing by the smaller \(\ell_1\) norm (i.e., the smaller number of 1s) gives a Jaccard-like coefficient: 0 = no overlap, 1 = the smaller vector is a subset of the larger.
Note
Same v1.0.x breaking change as HammingDistance.
AngleDistance
Used by: FHRR
For angle-valued vectors, similarity is the cosine of the mean angular difference:
Output range: [-1, 1]
This is appropriate because FHRR binding uses modular angle arithmetic: two vectors are “similar” when their angles are close element-wise (small absolute angular difference per dimension).
remap_to_unit
A utility function that maps any [-1, 1] similarity value to [0, 1]:
This maps: -1 -> 0, 0 -> 0.5 (orthogonal), +1 -> 1.
It works on scalars, NumPy arrays, and PyTorch tensors. Use it as the
similarity_remap= argument on any encoding to apply it automatically.
Batched calling conventions
Batches are dimension-first: a batch of N hypervectors has shape (D, N),
where each column batch[:, i] is one hypervector. Similarity operates
column-wise over axis 0. The supported input modes:
Input shape |
Output shape |
Semantics |
|---|---|---|
|
scalar |
Single pair |
|
|
Per-column pairs: |
|
|
One vector vs. each column: |
|
|
Column 0 vs. columns 1…N-1: |
|
list of scalars |
Pairwise: |
Axis-aware reduction and trailing-axis broadcasting
Every metric reduces over axis 0, the hypervector dimension \(D\).
The result shape is the broadcast of the two operands’ trailing axes (axes 1
and higher). The dimension axis disappears in the reduction. This is what
lets a higher-rank batch line up against a lower-rank one. A (D, N) input
compared against a (D, N, M) input pads the smaller operand to
(D, N, 1) and broadcasts over the last axis, yielding an (N, M) score
array. The axis= keyword on similarity() is keyword-only and,
for a single batched input, selects which batch axis splits index 0 from the
rest, the reduction itself stays on axis 0.
A Python float comes back only when both operands are 1D ((D,) against
(D,)). Every other case returns a NumPy array or PyTorch tensor whose
shape is the broadcast of the non-dimension axes. The similarity_remap
callback, when set, is applied to that result.
A shape |
B shape |
Result |
|---|---|---|
|
|
Python |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Single-input similarity on a (D, N, M, ...) batch (ndim >= 3)
requires an explicit axis. Passing a 1D single input or omitting the axis
on a 3D+ single input raises ValueError. The chosen axis must resolve to
exactly one batch axis, and axis 0 is never reducible.
import numpy as np
import pyhdc
enc = pyhdc.MAP_C(dimension=10_000)
query = enc.generate(size=(10_000, 4)) # (D, N) = (D, 4)
codebook = enc.generate(size=(10_000, 4, 8)) # (D, N, M) = (D, 4, 8)
scores = enc.similarity(query, codebook) # (N, M) = (4, 8)
print(scores.shape) # (4, 8)
Choosing the right metric
The encoding automatically selects the appropriate metric, you do not need to call these functions directly. The mapping is:
Encoding |
Metric |
|---|---|
MAP_C, MAP_I, MAP_I_Bits, MAP_B |
CosineSimilarity |
HRR, HRR_NoNorm, HRR_ConstNorm |
CosineSimilarity |
FHRR |
AngleDistance |
VTB, MBAT |
CosineSimilarity |
BSC |
HammingDistance |
BSDC_CDT, BSDC_S, BSDC_SEG, BSDC_THIN |
Overlap |