How to Compute Similarity ========================== Similarity measures how related two hypervectors are. PyHDC returns values in **[-1, 1]** (1 = identical, 0 = unrelated, −1 = maximally dissimilar). Basic usage ----------- **Instance method** (most common): .. code-block:: python import pyhdc enc = pyhdc.MAP_C(dimension=10_000) a = enc.generate() b = enc.generate() sim = a.similarity(b) # float **Encoding method** (same result): .. code-block:: python sim = enc.similarity(a, b) .. _similarity-batched: Batched similarity: three calling conventions ---------------------------------------------- As of v1.1.0, both ``Hypervector.similarity()`` and ``Encoding.similarity()`` support three input shapes: **Convention 1; two 1-D vectors → scalar** .. code-block:: python a = enc.generate() # shape (10000,) b = enc.generate() # shape (10000,) sim = enc.similarity(a, b) # float **Convention 2; two 2-D batches → 1-D array (per-row pairs)** Row ``i`` of the result is ``similarity(a[i], b[i])``: .. code-block:: python batch_a = enc.generate(size=50) # shape (50, 10000) batch_b = enc.generate(size=50) # shape (50, 10000) sims = enc.similarity(batch_a, batch_b) # shape (50,) **Convention 3; single 2-D batch → 1-D array (first row vs. rest)** Row 0 is the query; rows 1+ are the candidates: .. code-block:: python query_plus_codebook = enc.generate(size=101) # shape (101, 10000) sims = enc.similarity(query_plus_codebook) # shape (100,) # sims[i] = similarity(query_plus_codebook[0], query_plus_codebook[i+1]) **Batched list form at the encoding level** You can also pass lists of ``Hypervector`` objects: .. code-block:: python hvs_a = [enc.generate() for _ in range(5)] hvs_b = [enc.generate() for _ in range(5)] sims = enc.similarity(hvs_a, hvs_b) # list of 5 floats Output ranges by encoding -------------------------- .. list-table:: :header-rows: 1 :widths: 30 20 50 * - Encoding family - Similarity metric - Output range * - MAP_C, MAP_I, MAP_I_Bits, MAP_B - Cosine - [-1, 1] * - HRR, HRR_NoNorm, HRR_ConstNorm - Cosine - [-1, 1] * - FHRR - Angle distance - [-1, 1] * - VTB, MBAT - Cosine - [-1, 1] * - BSC - Hamming (remapped) - [-1, 1] *(was [0,1] in v1.0.x)* * - BSDC family - Overlap (remapped) - [-1, 1] *(was [0,1] in v1.0.x)* Remapping to [0, 1] -------------------- If your downstream code expects [0, 1] (e.g., scikit-learn metrics), use ``similarity_remap`` on the encoding constructor: .. code-block:: python from pyhdc.components.similarity import remap_to_unit enc = pyhdc.BSC(dimension=10_000, similarity_remap=remap_to_unit) a = enc.generate() print(a.similarity(a)) # 1.0 print(a.similarity(enc.generate())) # ~= 0.5 Or apply ``remap_to_unit`` manually: .. code-block:: python raw = a.similarity(b) # in [-1, 1] remapped = remap_to_unit(raw) # in [0, 1] Nearest-neighbour lookup ------------------------- Find the closest match to a query in a codebook: .. code-block:: python codebook = {name: enc.generate() for name in ['red','green','blue','yellow']} query = codebook['red'].bundle(enc.generate()) # noisy version of red best = max(codebook, key=lambda k: query.similarity(codebook[k])) print(best) # red For large codebooks (thousands of items), use batched convention 3 for speed: .. code-block:: python import numpy as np names = list(codebook) hvs = [codebook[n] for n in names] # Stack query + codebook into one batch enc_torch = pyhdc.MAP_C(dimension=10_000, backend="torch") stacked = enc_torch.generate(size=len(hvs) + 1) # (in practice: assign query to stacked[0] and codebook to stacked[1:]) sims = enc_torch.similarity(stacked) # shape (len(hvs),) best_idx = int(np.argmax(sims)) print(names[best_idx])