Tutorial 5: Implementing a Custom Encoding
==========================================

Every encoding in PyHDC is a thin subclass of :class:`~pyhdc.Encoding` that
returns one :class:`EncodingSpec`. The spec wires together the component
functions for generation, similarity, bundling, binding, unbinding, thinning,
and the four unary operations (permute, inverse, normalize, negative). This
tutorial builds a complete, working encoding from existing components, then
generates, bundles, binds, unbinds, compares, permutes, and inverts vectors of
your own encoding. The last sections go one level deeper: writing a component
function from scratch and wiring it into a new encoding.

**Prerequisites**: :doc:`tutorial_1_text_classification`

----

When to write a custom encoding
--------------------------------

The built-in encodings cover the standard HDC families. You write your own
subclass when you want to:

* **Swap one component** in an existing scheme. For example, pair bipolar
  multiply-binding with a different bundling rule or similarity metric.
* **Reuse the operation surface** without reimplementing it. The base class
  already handles batching, backends, broadcasting, and the dimension-first
  contract. You supply the per-element behavior and the base class does the
  rest.
* **Prototype a new encoding** by composing components before committing to a
  full implementation.

Only one method needs an override: ``_get_encoding_spec``. It is the single
abstract method on :class:`~pyhdc.Encoding`. The constructor (dimension,
backend, device, dtype, mask, generator, similarity_remap) is inherited, so a
custom encoding gets the same call surface as the built-ins.

----

The EncodingSpec fields
-----------------------

:class:`EncodingSpec` is a dataclass with seven required fields and six fields
that carry defaults. The required fields wire the core operations. The
defaulted fields cover the bit mask, the generator output contract, and the
four unary operations.

.. list-table::
   :header-rows: 1
   :widths: 28 14 58

   * - Field
     - Default
     - Meaning
   * - ``dtype``
     - required
     - Element data type (e.g. ``np.int32``, ``np.float32``).
   * - ``element_generator``
     - required
     - Callable that draws random element values for one vector.
   * - ``similarity_fn``
     - required
     - Similarity metric, reduced over axis 0.
   * - ``bundling_fn``
     - required
     - Bundling (superposition) rule.
   * - ``thinning_fn``
     - required
     - Thinning rule, or ``NoThin`` when the family does not thin.
   * - ``binding_fn``
     - required
     - Binding rule.
   * - ``unbinding_fn``
     - required
     - Unbinding rule (set to ``RaiseNotImplementedError`` to forbid it).
   * - ``mask``
     - ``None``
     - Integer bit mask; used by ``MAP_I_Bits``, ignored elsewhere.
   * - ``generator_output_type``
     - ``"floats"``
     - ``"floats"``, ``"bits"``, or ``"words"``. The output a custom
       generator must supply.
   * - ``permute_fn``
     - ``None``
     - Permutation. ``None`` falls back to the shared ``CyclicShift``.
   * - ``inverse_fn``
     - raises
     - Binding inverse. Left unset, ``inverse()`` raises ``NotImplementedError``.
   * - ``normalize_fn``
     - raises
     - Convert to entry distribution. Left unset, ``normalize()`` raises ``NotImplementedError``.
   * - ``negative_fn``
     - raises
     - Additive (bundling) inverse. Left unset, ``negative()`` raises
       ``NotImplementedError``.

The defaulted unary fields are the part most people miss. ``inverse_fn``,
``normalize_fn``, and ``negative_fn`` default to a function that raises
``NotImplementedError``. If you want ``inverse()``, ``normalize()``, or
``negative()`` to work on your encoding, you must wire them. Leaving a field
unset is how the built-in families mark an operation as unsupported. ``MAP_C``,
for instance, sets no ``inverse_fn``, so calling ``inverse()`` on a MAP_C vector
raises an exception. ``permute_fn`` is different, ``None`` is a working default, because the
shared ``CyclicShift`` is encoding-agnostic and every built-in uses it.

----

Building the encoding
---------------------

The example below is a minimal bipolar Multiply-Add-Permute scheme, assembled
from the same components the built-in ``MAP_I`` uses. Elements are drawn from
``{-1, +1}``, binding is element-wise multiplication (which is its own inverse
for bipolar values), bundling is element-wise addition, and similarity is
cosine. The four unary fields are wired so that permute, inverse, normalize,
and negative all work.

.. code-block:: python

   import numpy as np

   from pyhdc.encodings.base import Encoding
   from pyhdc.hypervector import EncodingSpec
   from pyhdc.components.elements import BernoulliBipolar
   from pyhdc.components.binding import ElementMultiplication
   from pyhdc.components.bundling import ElementAddition
   from pyhdc.components.similarity import CosineSimilarity
   from pyhdc.components.thinning import NoThin
   from pyhdc.components.unary import (
       CyclicShift,
       IdentityInverse,
       Negate,
       SignNormalize,
   )


   class MyMAP(Encoding):
       """A minimal bipolar Multiply-Add-Permute encoding.

       Elements are drawn from {-1, +1}. Binding is element-wise
       multiplication (its own inverse), bundling is element-wise addition,
       and similarity is cosine. The 2.1.0 unary fields wire permute,
       inverse, normalize, and negative.
       """

       def _get_encoding_spec(self) -> EncodingSpec:
           return EncodingSpec(
               dtype=np.int32,
               element_generator=BernoulliBipolar,
               similarity_fn=CosineSimilarity,
               bundling_fn=ElementAddition,
               thinning_fn=NoThin,
               binding_fn=ElementMultiplication,
               unbinding_fn=ElementMultiplication,
               generator_output_type="bits",
               permute_fn=CyclicShift,
               inverse_fn=IdentityInverse,
               normalize_fn=SignNormalize,
               negative_fn=Negate,
           )

A few notes on the choices:

* ``element_generator=BernoulliBipolar`` draws each element from ``{-1, +1}``
  with equal probability, so ``generator_output_type="bits"`` describes what a
  custom generator would have to supply.
* ``binding_fn`` and ``unbinding_fn`` are both ``ElementMultiplication``.
  Element-wise multiply by ``{-1, +1}`` is its own inverse, so unbinding is the
  same operation as binding.
* ``inverse_fn=IdentityInverse`` matches that self-inverse property: the
  binding inverse of a bipolar vector is the vector itself.
* ``normalize_fn=SignNormalize`` sends a bundled vector (which holds integer
  sums) back to bipolar ``{-1, 0, +1}`` by taking the sign.
* ``negative_fn=Negate`` is element-wise negation, the additive inverse used by
  bundling.
* ``permute_fn=CyclicShift`` is set here to show the field; passing ``None``
  would select the same shared ``CyclicShift`` automatically.

----

Generating and inspecting vectors
----------------------------------

Construct the encoding like any built-in and generate vectors. Single vectors
are ``(D,)``, a batch is dimension-first, so ``size=(D, N)`` returns ``(D, N)``
with each column one hypervector.

.. code-block:: python

   enc = MyMAP(dimension=10_000)

   a = enc.generate()
   b = enc.generate()
   print("single shape:", a.data.shape)         # (10000,)

   batch = enc.generate(size=(10_000, 5))
   print("batch shape: ", batch.data.shape)     # (10000, 5)

   # Each element is bipolar.
   print(set(np.unique(a.data)))                # {-1, 1}

----

Bundle, bind, and unbind
------------------------

Bundling superposes vectors, the result stays similar to every input. Binding
combines two vectors into one that is dissimilar to both, and unbinding
recovers a component. Because element-wise multiply is exactly self-inverse for
bipolar values, ``unbind`` returns the partner without approximation.

.. code-block:: python

   # Bundle: the superposition is similar to both inputs.
   ab = enc.bundle(a, b)
   print("sim(a, a+b):", round(float(a.similarity(ab)), 4))   # ~= 0.63
   print("sim(b, a+b):", round(float(b.similarity(ab)), 4))   # ~= 0.63

   # Bind then unbind recovers the partner exactly.
   bound     = a.bind(b)
   recovered = bound.unbind(b)
   print("exact recovery:", np.array_equal(recovered.data, a.data))   # True

   # Unrelated vectors are near-orthogonal under cosine.
   print("sim(a, b):", round(float(a.similarity(b)), 4))      # ~= 0.0

The bundle-similarity scores hover near 0.63 because each of the two inputs
contributes half the superposition. They are not fixed, since ``ElementAddition``
randomizes coordinates whose summed value is an exact tie. The recovery check is
exact, and unrelated vectors sit near zero cosine, as expected for random
bipolar vectors of dimension 10,000.

----

Permute and inverse
-------------------

``permute`` is a cyclic shift along axis 0, a negative shift undoes a positive
one. ``inverse`` returns the binding inverse, which for this self-inverse scheme
is the vector itself. ``normalize`` and ``negative`` round out the unary set.

.. code-block:: python

   # permute(k) shifts along axis 0; permute(-k) restores.
   shifted  = a.permute(3)
   restored = shifted.permute(-3)
   print("shift changed data:", not np.array_equal(shifted.data, a.data))  # True
   print("inverse shift restored:", np.array_equal(restored.data, a.data)) # True

   # inverse() of a self-inverse binding returns the vector unchanged.
   print("inverse is identity:", np.array_equal(a.inverse().data, a.data)) # True

   # normalize() sends a bundle back to bipolar {-1, 0, +1}.
   norm = ab.normalize()
   print("normalized values:", set(np.unique(norm.data)))   # subset of {-1, 0, 1}

   # negative() is the element-wise additive inverse.
   print("negate:", np.array_equal(a.negative().data, -a.data))   # True

----

Operators
---------

The dunder operators dispatch straight through the encoding, so they raise or
succeed per the components you wired. For this encoding ``*``, ``/``, ``~``,
``>>``, and ``<<`` are all deterministic and match their method forms:

.. code-block:: python

   assert np.array_equal((a * b).data, a.bind(b).data)        # bind
   assert np.array_equal((bound / b).data, bound.unbind(b).data)  # unbind
   assert np.array_equal((~a).data, a.inverse().data)         # inverse
   assert np.array_equal((a >> 3).data, a.permute(3).data)    # permute +3
   assert np.array_equal((a << 3).data, a.permute(-3).data)   # permute -3

   # a + b also routes to bundle, but ElementAddition randomizes tie
   # coordinates, so a fresh draw differs run to run while staying similar
   # to both inputs.
   plus = a + b
   assert a.similarity(plus) > 0.5 and b.similarity(plus) > 0.5

The bundling operator ``+`` is the one to watch. It routes to ``bundle``, and
``ElementAddition`` redraws coordinates that sum to an exact tie, so ``a + b``
and ``a.bundle(b)`` produce different (but equally valid) vectors on separate
calls. The bind, unbind, inverse, and permute paths have no such randomness, so
their operator and method forms are byte-for-byte identical.

----

Forbidding an operation
-----------------------

To mark an operation as unsupported, leave its field unset. The default for
``inverse_fn``, ``normalize_fn``, and ``negative_fn`` is a function that raises
``NotImplementedError`` with a clear message. For example, dropping
``inverse_fn`` from the spec above makes ``inverse()`` raise an exception:

.. code-block:: python

   # With inverse_fn removed from the EncodingSpec:
   try:
       a.inverse()
   except NotImplementedError as e:
       print(e)   # This operation is not implemented for this encoding scheme.

This is exactly how the built-ins draw their support lines. ``MAP_C`` omits
``inverse_fn``, ``FHRR`` omits ``negative_fn``, ``BSC`` omits both
``normalize_fn`` and ``negative_fn``, and the four BSDC variants omit all three.
See :doc:`../user_manual/encodings_overview` for the full per-family support
table.

----

Writing a custom component function
-----------------------------------

So far you have composed *existing* components. When no built-in function does what you
need, write the component function yourself and wire it into the spec the same
way. A component is a plain function, not a class, and the spec just holds a
reference to it.

This section writes a custom bundling function and uses it to build ``MAP_S``:
``MAP_C`` with addition bundling swapped for subtraction. Subtraction is not a
meaningful superposition (the result tracks the first input and rejects the
rest), so ``MAP_S`` is not an encoding you would actually use. It is, however, the smallest
change that forces you to write a real component, which is the point.

**The bundling contract.** A bundling function takes the operands as ``*args``,
accepts a keyword-only ``axis``, and returns the folded array. Its first line
calls ``_normalize_bundling``, which turns the mixed inputs (loose ``(D,)``
vectors, a ``(D, N)`` batch, or a ``(D, N, M)`` tensor) into one dimension-first
``batch`` plus the ``reduce_axes`` to fold. Axis 0 is always the dimension ``D``
and is never reduced.

.. code-block:: python

   import numpy as np
   from pyhdc.components.input_formatting import _normalize_bundling

   try:
       import torch
   except ImportError:
       torch = None


   def ElementSubtraction(*hypervectors, axis=None):
       """Toy bundling: the first vector minus the sum of the rest, clipped to
       [-1, 1]. This is MAP_C's addition bundling with the sum swapped for
       subtraction. It has no HDC meaning and is an example only.
       """
       batch, is_torch, _, reduce_axes = _normalize_bundling(
           *hypervectors, axis=axis
       )
       if len(reduce_axes) != 1:
           raise ValueError("ElementSubtraction reduces a single batch axis")
       ax = reduce_axes[0]
       n = batch.shape[ax]

       if is_torch:
           first = batch.select(ax, 0)
           rest = batch.index_select(
               ax, torch.arange(1, n, device=batch.device)
           ).sum(dim=ax)
           return torch.clamp(first - rest, -1.0, 1.0).to(batch.dtype)
       else:      
           first = np.take(batch, 0, axis=ax)
           rest = np.take(batch, np.arange(1, n), axis=ax).sum(axis=ax)
           return np.clip(first - rest, -1.0, 1.0).astype(batch.dtype)

Four things make this a correct component, and they are the same four for every
operation family:

* **Signature.** A bundling function takes ``*hypervectors`` and a keyword-only
  ``axis=None``. The base class calls ``bundling_fn(*arrays, axis=axis)``.
* **Normalize first.** ``_normalize_bundling`` returns
  ``(batch, is_torch, reference_hv, reduce_axes)``. Do not index the raw inputs
  yourself, the normalizer is what lets one function accept loose vectors, a
  batch, or a higher-rank tensor without special-casing each shape.
* **Reduce over ``reduce_axes``, keep axis 0.** Fold only the batch axes the
  normalizer handed you, so the output is still a hypervector of dimension ``D``.
  This toy reduces a single axis (the additive bundlers accept a tuple); the
  ``is_torch`` flag tells you which backend's operations to call.
* **Return type.** Return the folded array, shape ``(D, *survivors)``. You may
  instead return ``(array, metadata_dict)``. The base class unpacks both forms
  and attaches the dict to the result's metadata. ``ElementAddition`` uses the
  tuple form to report its tie-randomization count.

Now wire it into a spec. ``MAP_S`` is ``MAP_C`` field for field, with
``bundling_fn`` pointing at the new function:

.. code-block:: python

   from pyhdc.encodings.base import Encoding
   from pyhdc.hypervector import EncodingSpec
   from pyhdc.components.elements import UniformBipolar
   from pyhdc.components.binding import ElementMultiplication
   from pyhdc.components.similarity import CosineSimilarity
   from pyhdc.components.thinning import NoThin
   from pyhdc.components.unary import Negate, SignNormalize


   class MAP_S(Encoding):
       """MAP_C with subtraction bundling. A teaching example, not a usable
       encoding."""

       def _get_encoding_spec(self) -> EncodingSpec:
           return EncodingSpec(
               dtype=np.float32,
               element_generator=UniformBipolar,
               similarity_fn=CosineSimilarity,
               bundling_fn=ElementSubtraction,     # the one swapped field
               thinning_fn=NoThin,
               binding_fn=ElementMultiplication,
               unbinding_fn=ElementMultiplication,
               generator_output_type="floats",
               normalize_fn=SignNormalize,
               negative_fn=Negate,
           )


   enc = MAP_S(dimension=10_000)
   a, b = enc.generate(), enc.generate()

   bundled = enc.bundle(a, b)            # calls ElementSubtraction
   print("bundle shape:", bundled.data.shape)            # (10000,)
   print("is clip(a - b):",
         np.array_equal(bundled.data,
                        np.clip(a.data - b.data, -1, 1).astype(np.float32)))  # True

   batch = enc.generate(size=(10_000, 4))
   print("batch bundle:", enc.bundle(batch).data.shape)  # (10000,)

Everything except bundling comes from the MAP_C component set, so binding,
unbinding, similarity, normalize, and negative behave exactly as they do for
``MAP_C``. Only ``bundle`` runs your code. ``MAP_C`` sets no ``inverse_fn``, so
``MAP_S`` inherits that gap too and ``inverse()`` raises an exception.

----

The contract for every operation family
----------------------------------------

A custom function for any other operation follows the same shape: call the
family's normalizer, branch on ``is_torch``, transform or reduce the right axis,
and return an array (optionally with a metadata dict). The signature and the
normalizer are what change between families.

.. list-table::
   :header-rows: 1
   :widths: 16 26 32 26

   * - Family
     - Signature
     - Normalize with
     - Returns
   * - Bundling
     - ``f(*hvs, axis=None)``
     - ``_normalize_bundling`` to ``(batch, is_torch, ref, reduce_axes)``
     - ``(D, *survivors)``, reduce ``reduce_axes``, keep axis 0
   * - Binding / unbinding
     - ``f(*hvs)``
     - ``_normalize_binding`` to ``(operands, is_torch, ref)``
     - same-shaped array, broadcast or loop (see below)
   * - Similarity
     - ``f(*hvs, axis=None)``
     - ``_normalize_similarity`` to ``(a, b, is_torch, scalar)``
     - reduce axis 0, ``sims.item() if scalar else sims``
   * - Unary
     - ``f(data)`` (``permute`` is ``f(data, shift=1)``)
     - none, you receive the raw ``(D, *batch)`` array
     - transformed array of the same shape

All the normalizers live in ``pyhdc.components.input_formatting``. A few rules
that are easy to miss:

* **Binding takes no ``axis``.** Binding combines operands position by position,
  so there is no batch axis to fold. After ``_normalize_binding`` you usually call
  ``_broadcast_operands`` (also in ``input_formatting``) so a ``(D,)`` key binds
  against every column of a ``(D, N)`` batch. A binder that cannot act per
  coordinate (a convolution, a matrix transform) calls ``_require_single_vector``
  to reject batched input, the ``Encoding`` layer then loops it per column.
* **Similarity returns a Python ``float`` only when ``scalar`` is true**, which
  happens only when both inputs were a single ``(D,)`` vector. Every batched call
  returns an array, so end with ``return sims.item() if scalar else sims``.
* **Unary functions receive the raw array, not ``*args``.** They act
  dimension-first along axis 0 and broadcast over any trailing batch axes. Pick
  the backend with a tensor check (the built-ins use ``torch.is_tensor(data)``).
  ``permute`` also takes a ``shift`` while ``inverse``, ``negative``, and ``normalize``
  take only the array.
* **Return shape is preserved** for binding, the unary ops, and (minus axis 0)
  similarity. Only bundling collapses a batch axis.

Wire any of these into the matching ``EncodingSpec`` field exactly as you wired
``bundling_fn`` above. For the per-family math and which families define each
unary operation, see :doc:`../user_manual/unary_operations`.

----

What you built
--------------

You implemented a complete custom encoding by subclassing
:class:`~pyhdc.Encoding` and returning one :class:`EncodingSpec`:

* Wired the seven required fields (``dtype``, ``element_generator``,
  ``similarity_fn``, ``bundling_fn``, ``thinning_fn``, ``binding_fn``,
  ``unbinding_fn``) by composing existing components.
* Wired the four 2.1.0 unary fields (``permute_fn``, ``inverse_fn``,
  ``normalize_fn``, ``negative_fn``) to give your encoding a full operation
  surface.
* Generated single ``(D,)`` vectors and dimension-first ``(D, N)`` batches.
* Bundled, bound, and unbound vectors, recovering a component exactly under
  self-inverse multiply binding.
* Computed cosine similarity, confirming superposition stays near each input
  and random pairs stay near orthogonal.
* Ran permute, inverse, normalize, and negative, and saw operators dispatch
  through the encoding, including the tie-randomized behavior of ``+``.
* Wrote a component function from scratch (``ElementSubtraction``), wired it into
  a new ``MAP_S`` encoding, and learned the signature-and-return contract that
  every custom bundling, binding, similarity, and unary function follows.

----

What's next
-----------

* :doc:`../user_manual/encodings_overview` : full encoding family comparison and
  per-family operation support
* :doc:`../user_manual/components_overview` : the component catalog you compose
  from
* :doc:`../user_manual/unary_operations` : the four unary operations and which
  families define each
* :doc:`../how_to/choose_encoding` : picking the right built-in before rolling
  your own
* :doc:`tutorial_6_custom_generators` : custom generators and reproducible
  generation