How to Compute Similarity

Similarity measures how related two hypervectors are. PyHDC returns values in [-1, 1] (1 = identical, 0 = unrelated, -1 = maximally dissimilar).

Basic usage

Instance method (most common):

import pyhdc

enc = pyhdc.MAP_C(dimension=10_000)
a   = enc.generate()
b   = enc.generate()

sim = a.similarity(b)   # float

Encoding method (same result):

sim = enc.similarity(a, b)

Batched similarity: calling conventions

Hypervectors are dimension-first: a single vector has shape (D,) and a batch of N vectors has shape (D, N) (each column is a hypervector). Both Hypervector.similarity() and Encoding.similarity() reduce over axis 0 (the dimension) and support these shapes:

Convention 1; two 1-D vectors -> scalar

a = enc.generate()   # shape (10000,)
b = enc.generate()   # shape (10000,)
sim = enc.similarity(a, b)   # float

Convention 2; two (D, N) batches -> 1-D array (per-column pairs)

Element i of the result is similarity(A[:, i], B[:, i]):

batch_a = enc.generate(size=(10_000, 50))   # shape (10000, 50)
batch_b = enc.generate(size=(10_000, 50))   # shape (10000, 50)
sims    = enc.similarity(batch_a, batch_b)   # shape (50,)

Convention 3; single (D, N) batch -> 1-D array (column 0 vs the rest)

Column 0 is the query; columns 1+ are the candidates:

query_plus_codebook = enc.generate(size=(10_000, 101))   # shape (10000, 101)
sims = enc.similarity(query_plus_codebook)                # shape (100,)
# sims[i] = similarity(column 0, column i+1)

Convention 4; one vector vs a batch -> 1-D array (broadcast)

A (D,) vector compared against every column of a (D, N) batch:

query    = enc.generate()                    # shape (10000,)
codebook = enc.generate(size=(10_000, 100))   # shape (10000, 100)
sims     = enc.similarity(query, codebook)    # shape (100,)

Batched list form at the encoding level

You can also pass two equal-length lists of Hypervector objects:

hvs_a = [enc.generate() for _ in range(5)]
hvs_b = [enc.generate() for _ in range(5)]
sims  = enc.similarity(hvs_a, hvs_b)   # list of 5 floats

Similarity on (D, N, M) tensors

A tensor of hypervectors has shape (D, N, M). Axis 0 is the dimension D, and axes 1 and 2 are batch axes, so each tensor[:, i, j] column is one hypervector. Similarity always reduces over axis 0, the batch axes pass through to the result shape.

Single 3-D input needs an explicit axis. With a (D, N) batch the column-0-versus-rest split (convention 3) is well defined because there is one batch axis. With a (D, N, M) tensor there are two batch axes, so “column 0” is ambiguous and PyHDC will not guess. axis is keyword-only, pass it to name the batch axis that splits index 0 from the rest:

tensor = enc.generate(size=(10_000, 4, 6))   # shape (10000, 4, 6)
sims   = enc.similarity(tensor, axis=1)      # split along axis 1

# Without axis, a 3-D single input raises:
#   ValueError: single-input similarity on a (D, N, M, ...) batch
#   requires an explicit axis

The chosen split axis is kept (a length-1 head against the length-(size-1) rest) so it broadcasts against the remaining batch axes.

Two inputs: output shape by rank. With two inputs, the result shape is the broadcast of the two operands’ batch axes (axes 1 and up). Axis 0 is reduced away. Two 1-D inputs return a Python float, every other combination returns a numpy array or torch tensor.

A shape	B shape	Result
`(D,)`	`(D,)`	Python `float` (scalar)
`(D,)`	`(D, N)`	`(N,)`
`(D, N)`	`(D,)`	`(N,)`
`(D, N)`	`(D, N)`	`(N,)`
`(D,)`	`(D, N, M)`	`(N, M)`
`(D, N, M)`	`(D,)`	`(N, M)`
`(D, N)`	`(D, N, M)`	`(N, M)` (A padded to `(D, N, 1)`, broadcast over M)
`(D, N, M)`	`(D, N)`	`(N, M)`
`(D, N, M)`	`(D, N, M)`	`(N, M)`
`(D, 1, M)`	`(D, N, M)`	`(N, M)` (broadcast over axis 1)

Two tensors of matching shape reduce to one score per column pair:

tensor_a = enc.generate(size=(10_000, 4, 6))   # shape (10000, 4, 6)
tensor_b = enc.generate(size=(10_000, 4, 6))   # shape (10000, 4, 6)
sims     = enc.similarity(tensor_a, tensor_b)  # shape (4, 6)

A single vector compared against a whole tensor broadcasts over both batch axes:

query  = enc.generate()                       # shape (10000,)
tensor = enc.generate(size=(10_000, 4, 6))    # shape (10000, 4, 6)
sims   = enc.similarity(query, tensor)        # shape (4, 6)

Output ranges by encoding

Encoding family	Similarity metric	Output range
MAP_C, MAP_I, MAP_I_Bits, MAP_B	Cosine	[-1, 1]
HRR, HRR_NoNorm, HRR_ConstNorm	Cosine	[-1, 1]
FHRR	Angle distance	[-1, 1]
VTB, MBAT	Cosine	[-1, 1]
BSC	Hamming (remapped)	[-1, 1] (was [0,1] in v1.0.x)
BSDC family	Overlap (remapped)	[-1, 1] (was [0,1] in v1.0.x)

Remapping to [0, 1]

If your downstream code expects [0, 1] (e.g., scikit-learn metrics), use similarity_remap on the encoding constructor:

from pyhdc.components.similarity import remap_to_unit

enc = pyhdc.BSC(dimension=10_000, similarity_remap=remap_to_unit)
a   = enc.generate()
print(a.similarity(a))   # 1.0
print(a.similarity(enc.generate()))   # ~= 0.5

Or apply remap_to_unit manually:

raw      = a.similarity(b)          # in [-1, 1]
remapped = remap_to_unit(raw)        # in [0, 1]

Nearest-neighbour lookup

Find the closest match to a query in a small codebook:

codebook = {name: enc.generate() for name in ['red','green','blue','yellow']}
query    = codebook['red'].bundle(enc.generate())   # noisy version of red

best = max(codebook, key=lambda k: query.similarity(codebook[k]))
print(best)   # red

For large codebooks (thousands of items), keep the codebook as one (D, N) batch and compare in a single vectorized call:

import numpy as np

enc      = pyhdc.MAP_C(dimension=10_000)
codebook = enc.generate(size=(10_000, 5_000))   # (D, N): 5000 items as columns
query    = enc.generate()                        # (D,)

sims     = enc.similarity(query, codebook)        # shape (5000,) broadcast
best_idx = int(np.argmax(sims))
print(best_idx)

# Or stack the query as column 0 and use convention 3:
stacked  = pyhdc.stack([query, codebook])          # (D, 5001)
sims     = enc.similarity(stacked)                 # shape (5000,)

# Pull specific candidates back out of the batch with select:
top3     = codebook.select(np.argsort(sims)[-3:])  # (D, 3)