How to Compute Similarity

Similarity measures how related two hypervectors are. PyHDC returns values in [-1, 1] (1 = identical, 0 = unrelated, -1 = maximally dissimilar).

Basic usage

Instance method (most common):

import pyhdc

enc = pyhdc.MAP_C(dimension=10_000)
a   = enc.generate()
b   = enc.generate()

sim = a.similarity(b)   # float

Encoding method (same result):

sim = enc.similarity(a, b)

Batched similarity: calling conventions

Hypervectors are dimension-first: a single vector has shape (D,) and a batch of N vectors has shape (D, N) (each column is a hypervector). Both Hypervector.similarity() and Encoding.similarity() reduce over axis 0 (the dimension) and support these shapes:

Convention 1; two 1-D vectors -> scalar

a = enc.generate()   # shape (10000,)
b = enc.generate()   # shape (10000,)
sim = enc.similarity(a, b)   # float

Convention 2; two (D, N) batches -> 1-D array (per-column pairs)

Element i of the result is similarity(A[:, i], B[:, i]):

batch_a = enc.generate(size=(10_000, 50))   # shape (10000, 50)
batch_b = enc.generate(size=(10_000, 50))   # shape (10000, 50)
sims    = enc.similarity(batch_a, batch_b)   # shape (50,)

Convention 3; single (D, N) batch -> 1-D array (column 0 vs the rest)

Column 0 is the query; columns 1+ are the candidates:

query_plus_codebook = enc.generate(size=(10_000, 101))   # shape (10000, 101)
sims = enc.similarity(query_plus_codebook)                # shape (100,)
# sims[i] = similarity(column 0, column i+1)

Convention 4; one vector vs a batch -> 1-D array (broadcast)

A (D,) vector compared against every column of a (D, N) batch:

query    = enc.generate()                    # shape (10000,)
codebook = enc.generate(size=(10_000, 100))   # shape (10000, 100)
sims     = enc.similarity(query, codebook)    # shape (100,)

Batched list form at the encoding level

You can also pass two equal-length lists of Hypervector objects:

hvs_a = [enc.generate() for _ in range(5)]
hvs_b = [enc.generate() for _ in range(5)]
sims  = enc.similarity(hvs_a, hvs_b)   # list of 5 floats

Similarity on (D, N, M) tensors

A tensor of hypervectors has shape (D, N, M). Axis 0 is the dimension D, and axes 1 and 2 are batch axes, so each tensor[:, i, j] column is one hypervector. Similarity always reduces over axis 0, the batch axes pass through to the result shape.

Single 3-D input needs an explicit axis. With a (D, N) batch the column-0-versus-rest split (convention 3) is well defined because there is one batch axis. With a (D, N, M) tensor there are two batch axes, so “column 0” is ambiguous and PyHDC will not guess. axis is keyword-only, pass it to name the batch axis that splits index 0 from the rest:

tensor = enc.generate(size=(10_000, 4, 6))   # shape (10000, 4, 6)
sims   = enc.similarity(tensor, axis=1)      # split along axis 1

# Without axis, a 3-D single input raises:
#   ValueError: single-input similarity on a (D, N, M, ...) batch
#   requires an explicit axis

The chosen split axis is kept (a length-1 head against the length-(size-1) rest) so it broadcasts against the remaining batch axes.

Two inputs: output shape by rank. With two inputs, the result shape is the broadcast of the two operands’ batch axes (axes 1 and up). Axis 0 is reduced away. Two 1-D inputs return a Python float, every other combination returns a numpy array or torch tensor.

A shape

B shape

Result

(D,)

(D,)

Python float (scalar)

(D,)

(D, N)

(N,)

(D, N)

(D,)

(N,)

(D, N)

(D, N)

(N,)

(D,)

(D, N, M)

(N, M)

(D, N, M)

(D,)

(N, M)

(D, N)

(D, N, M)

(N, M) (A padded to (D, N, 1), broadcast over M)

(D, N, M)

(D, N)

(N, M)

(D, N, M)

(D, N, M)

(N, M)

(D, 1, M)

(D, N, M)

(N, M) (broadcast over axis 1)

Two tensors of matching shape reduce to one score per column pair:

tensor_a = enc.generate(size=(10_000, 4, 6))   # shape (10000, 4, 6)
tensor_b = enc.generate(size=(10_000, 4, 6))   # shape (10000, 4, 6)
sims     = enc.similarity(tensor_a, tensor_b)  # shape (4, 6)

A single vector compared against a whole tensor broadcasts over both batch axes:

query  = enc.generate()                       # shape (10000,)
tensor = enc.generate(size=(10_000, 4, 6))    # shape (10000, 4, 6)
sims   = enc.similarity(query, tensor)        # shape (4, 6)

Output ranges by encoding

Encoding family

Similarity metric

Output range

MAP_C, MAP_I, MAP_I_Bits, MAP_B

Cosine

[-1, 1]

HRR, HRR_NoNorm, HRR_ConstNorm

Cosine

[-1, 1]

FHRR

Angle distance

[-1, 1]

VTB, MBAT

Cosine

[-1, 1]

BSC

Hamming (remapped)

[-1, 1] (was [0,1] in v1.0.x)

BSDC family

Overlap (remapped)

[-1, 1] (was [0,1] in v1.0.x)

Remapping to [0, 1]

If your downstream code expects [0, 1] (e.g., scikit-learn metrics), use similarity_remap on the encoding constructor:

from pyhdc.components.similarity import remap_to_unit

enc = pyhdc.BSC(dimension=10_000, similarity_remap=remap_to_unit)
a   = enc.generate()
print(a.similarity(a))   # 1.0
print(a.similarity(enc.generate()))   # ~= 0.5

Or apply remap_to_unit manually:

raw      = a.similarity(b)          # in [-1, 1]
remapped = remap_to_unit(raw)        # in [0, 1]

Nearest-neighbour lookup

Find the closest match to a query in a small codebook:

codebook = {name: enc.generate() for name in ['red','green','blue','yellow']}
query    = codebook['red'].bundle(enc.generate())   # noisy version of red

best = max(codebook, key=lambda k: query.similarity(codebook[k]))
print(best)   # red

For large codebooks (thousands of items), keep the codebook as one (D, N) batch and compare in a single vectorized call:

import numpy as np

enc      = pyhdc.MAP_C(dimension=10_000)
codebook = enc.generate(size=(10_000, 5_000))   # (D, N): 5000 items as columns
query    = enc.generate()                        # (D,)

sims     = enc.similarity(query, codebook)        # shape (5000,) broadcast
best_idx = int(np.argmax(sims))
print(best_idx)

# Or stack the query as column 0 and use convention 3:
stacked  = pyhdc.stack([query, codebook])          # (D, 5001)
sims     = enc.similarity(stacked)                 # shape (5000,)

# Pull specific candidates back out of the batch with select:
top3     = codebook.select(np.argsort(sims)[-3:])  # (D, 3)