How to Compute Similarity
==========================

Similarity measures how related two hypervectors are. PyHDC returns values
in **[-1, 1]** (1 = identical, 0 = unrelated, -1 = maximally dissimilar).

Basic usage
-----------

**Instance method** (most common):

.. code-block:: python

   import pyhdc

   enc = pyhdc.MAP_C(dimension=10_000)
   a   = enc.generate()
   b   = enc.generate()

   sim = a.similarity(b)   # float

**Encoding method** (same result):

.. code-block:: python

   sim = enc.similarity(a, b)

.. _similarity-batched:

Batched similarity: calling conventions
---------------------------------------

Hypervectors are dimension-first: a single vector has shape ``(D,)`` and a batch
of ``N`` vectors has shape ``(D, N)`` (each **column** is a hypervector). Both
``Hypervector.similarity()`` and ``Encoding.similarity()`` reduce over axis 0
(the dimension) and support these shapes:

**Convention 1; two 1-D vectors -> scalar**

.. code-block:: python

   a = enc.generate()   # shape (10000,)
   b = enc.generate()   # shape (10000,)
   sim = enc.similarity(a, b)   # float

**Convention 2; two (D, N) batches -> 1-D array (per-column pairs)**

Element ``i`` of the result is ``similarity(A[:, i], B[:, i])``:

.. code-block:: python

   batch_a = enc.generate(size=(10_000, 50))   # shape (10000, 50)
   batch_b = enc.generate(size=(10_000, 50))   # shape (10000, 50)
   sims    = enc.similarity(batch_a, batch_b)   # shape (50,)

**Convention 3; single (D, N) batch -> 1-D array (column 0 vs the rest)**

Column 0 is the query; columns 1+ are the candidates:

.. code-block:: python

   query_plus_codebook = enc.generate(size=(10_000, 101))   # shape (10000, 101)
   sims = enc.similarity(query_plus_codebook)                # shape (100,)
   # sims[i] = similarity(column 0, column i+1)

**Convention 4; one vector vs a batch -> 1-D array (broadcast)**

A ``(D,)`` vector compared against every column of a ``(D, N)`` batch:

.. code-block:: python

   query    = enc.generate()                    # shape (10000,)
   codebook = enc.generate(size=(10_000, 100))   # shape (10000, 100)
   sims     = enc.similarity(query, codebook)    # shape (100,)

**Batched list form at the encoding level**

You can also pass two equal-length lists of ``Hypervector`` objects:

.. code-block:: python

   hvs_a = [enc.generate() for _ in range(5)]
   hvs_b = [enc.generate() for _ in range(5)]
   sims  = enc.similarity(hvs_a, hvs_b)   # list of 5 floats

Similarity on (D, N, M) tensors
-------------------------------

A tensor of hypervectors has shape ``(D, N, M)``. Axis 0 is the dimension ``D``,
and axes 1 and 2 are batch axes, so each ``tensor[:, i, j]`` column is one
hypervector. Similarity always reduces over axis 0, the batch axes pass through
to the result shape.

**Single 3-D input needs an explicit axis.** With a ``(D, N)`` batch the
column-0-versus-rest split (convention 3) is well defined because there is one
batch axis. With a ``(D, N, M)`` tensor there are two batch axes, so "column 0"
is ambiguous and PyHDC will not guess. ``axis`` is keyword-only, pass it to name
the batch axis that splits index 0 from the rest:

.. code-block:: python

   tensor = enc.generate(size=(10_000, 4, 6))   # shape (10000, 4, 6)
   sims   = enc.similarity(tensor, axis=1)      # split along axis 1

   # Without axis, a 3-D single input raises:
   #   ValueError: single-input similarity on a (D, N, M, ...) batch
   #   requires an explicit axis

The chosen split axis is kept (a length-1 head against the length-(size-1)
rest) so it broadcasts against the remaining batch axes.

**Two inputs: output shape by rank.** With two inputs, the result shape is the
broadcast of the two operands' batch axes (axes 1 and up). Axis 0 is reduced
away. Two 1-D inputs return a Python ``float``, every other combination returns
a numpy array or torch tensor.

.. list-table::
   :header-rows: 1
   :widths: 30 30 40

   * - A shape
     - B shape
     - Result
   * - ``(D,)``
     - ``(D,)``
     - Python ``float`` (scalar)
   * - ``(D,)``
     - ``(D, N)``
     - ``(N,)``
   * - ``(D, N)``
     - ``(D,)``
     - ``(N,)``
   * - ``(D, N)``
     - ``(D, N)``
     - ``(N,)``
   * - ``(D,)``
     - ``(D, N, M)``
     - ``(N, M)``
   * - ``(D, N, M)``
     - ``(D,)``
     - ``(N, M)``
   * - ``(D, N)``
     - ``(D, N, M)``
     - ``(N, M)`` (A padded to ``(D, N, 1)``, broadcast over M)
   * - ``(D, N, M)``
     - ``(D, N)``
     - ``(N, M)``
   * - ``(D, N, M)``
     - ``(D, N, M)``
     - ``(N, M)``
   * - ``(D, 1, M)``
     - ``(D, N, M)``
     - ``(N, M)`` (broadcast over axis 1)

Two tensors of matching shape reduce to one score per column pair:

.. code-block:: python

   tensor_a = enc.generate(size=(10_000, 4, 6))   # shape (10000, 4, 6)
   tensor_b = enc.generate(size=(10_000, 4, 6))   # shape (10000, 4, 6)
   sims     = enc.similarity(tensor_a, tensor_b)  # shape (4, 6)

A single vector compared against a whole tensor broadcasts over both batch axes:

.. code-block:: python

   query  = enc.generate()                       # shape (10000,)
   tensor = enc.generate(size=(10_000, 4, 6))    # shape (10000, 4, 6)
   sims   = enc.similarity(query, tensor)        # shape (4, 6)

Output ranges by encoding
--------------------------

.. list-table::
   :header-rows: 1
   :widths: 30 20 50

   * - Encoding family
     - Similarity metric
     - Output range
   * - MAP_C, MAP_I, MAP_I_Bits, MAP_B
     - Cosine
     - [-1, 1]
   * - HRR, HRR_NoNorm, HRR_ConstNorm
     - Cosine
     - [-1, 1]
   * - FHRR
     - Angle distance
     - [-1, 1]
   * - VTB, MBAT
     - Cosine
     - [-1, 1]
   * - BSC
     - Hamming (remapped)
     - [-1, 1]  *(was [0,1] in v1.0.x)*
   * - BSDC family
     - Overlap (remapped)
     - [-1, 1]  *(was [0,1] in v1.0.x)*

Remapping to [0, 1]
--------------------

If your downstream code expects [0, 1] (e.g., scikit-learn metrics), use
``similarity_remap`` on the encoding constructor:

.. code-block:: python

   from pyhdc.components.similarity import remap_to_unit

   enc = pyhdc.BSC(dimension=10_000, similarity_remap=remap_to_unit)
   a   = enc.generate()
   print(a.similarity(a))   # 1.0
   print(a.similarity(enc.generate()))   # ~= 0.5

Or apply ``remap_to_unit`` manually:

.. code-block:: python

   raw      = a.similarity(b)          # in [-1, 1]
   remapped = remap_to_unit(raw)        # in [0, 1]

Nearest-neighbour lookup
-------------------------

Find the closest match to a query in a small codebook:

.. code-block:: python

   codebook = {name: enc.generate() for name in ['red','green','blue','yellow']}
   query    = codebook['red'].bundle(enc.generate())   # noisy version of red

   best = max(codebook, key=lambda k: query.similarity(codebook[k]))
   print(best)   # red

For large codebooks (thousands of items), keep the codebook as one ``(D, N)``
batch and compare in a single vectorized call:

.. code-block:: python

   import numpy as np

   enc      = pyhdc.MAP_C(dimension=10_000)
   codebook = enc.generate(size=(10_000, 5_000))   # (D, N): 5000 items as columns
   query    = enc.generate()                        # (D,)

   sims     = enc.similarity(query, codebook)        # shape (5000,) broadcast
   best_idx = int(np.argmax(sims))
   print(best_idx)

   # Or stack the query as column 0 and use convention 3:
   stacked  = pyhdc.stack([query, codebook])          # (D, 5001)
   sims     = enc.similarity(stacked)                 # shape (5000,)

   # Pull specific candidates back out of the batch with select:
   top3     = codebook.select(np.argsort(sims)[-3:])  # (D, 3)