Tutorial 5: Implementing a Custom Encoding

Every encoding in PyHDC is a thin subclass of Encoding that returns one EncodingSpec. The spec wires together the component functions for generation, similarity, bundling, binding, unbinding, thinning, and the four unary operations (permute, inverse, normalize, negative). This tutorial builds a complete, working encoding from existing components, then generates, bundles, binds, unbinds, compares, permutes, and inverts vectors of your own encoding. The last sections go one level deeper: writing a component function from scratch and wiring it into a new encoding.

Prerequisites: Tutorial 1: Encoding Text for Classification


When to write a custom encoding

The built-in encodings cover the standard HDC families. You write your own subclass when you want to:

  • Swap one component in an existing scheme. For example, pair bipolar multiply-binding with a different bundling rule or similarity metric.

  • Reuse the operation surface without reimplementing it. The base class already handles batching, backends, broadcasting, and the dimension-first contract. You supply the per-element behavior and the base class does the rest.

  • Prototype a new encoding by composing components before committing to a full implementation.

Only one method needs an override: _get_encoding_spec. It is the single abstract method on Encoding. The constructor (dimension, backend, device, dtype, mask, generator, similarity_remap) is inherited, so a custom encoding gets the same call surface as the built-ins.


The EncodingSpec fields

EncodingSpec is a dataclass with seven required fields and six fields that carry defaults. The required fields wire the core operations. The defaulted fields cover the bit mask, the generator output contract, and the four unary operations.

Field

Default

Meaning

dtype

required

Element data type (e.g. np.int32, np.float32).

element_generator

required

Callable that draws random element values for one vector.

similarity_fn

required

Similarity metric, reduced over axis 0.

bundling_fn

required

Bundling (superposition) rule.

thinning_fn

required

Thinning rule, or NoThin when the family does not thin.

binding_fn

required

Binding rule.

unbinding_fn

required

Unbinding rule (set to RaiseNotImplementedError to forbid it).

mask

None

Integer bit mask; used by MAP_I_Bits, ignored elsewhere.

generator_output_type

"floats"

"floats", "bits", or "words". The output a custom generator must supply.

permute_fn

None

Permutation. None falls back to the shared CyclicShift.

inverse_fn

raises

Binding inverse. Left unset, inverse() raises NotImplementedError.

normalize_fn

raises

Convert to entry distribution. Left unset, normalize() raises NotImplementedError.

negative_fn

raises

Additive (bundling) inverse. Left unset, negative() raises NotImplementedError.

The defaulted unary fields are the part most people miss. inverse_fn, normalize_fn, and negative_fn default to a function that raises NotImplementedError. If you want inverse(), normalize(), or negative() to work on your encoding, you must wire them. Leaving a field unset is how the built-in families mark an operation as unsupported. MAP_C, for instance, sets no inverse_fn, so calling inverse() on a MAP_C vector raises an exception. permute_fn is different, None is a working default, because the shared CyclicShift is encoding-agnostic and every built-in uses it.


Building the encoding

The example below is a minimal bipolar Multiply-Add-Permute scheme, assembled from the same components the built-in MAP_I uses. Elements are drawn from {-1, +1}, binding is element-wise multiplication (which is its own inverse for bipolar values), bundling is element-wise addition, and similarity is cosine. The four unary fields are wired so that permute, inverse, normalize, and negative all work.

import numpy as np

from pyhdc.encodings.base import Encoding
from pyhdc.hypervector import EncodingSpec
from pyhdc.components.elements import BernoulliBipolar
from pyhdc.components.binding import ElementMultiplication
from pyhdc.components.bundling import ElementAddition
from pyhdc.components.similarity import CosineSimilarity
from pyhdc.components.thinning import NoThin
from pyhdc.components.unary import (
    CyclicShift,
    IdentityInverse,
    Negate,
    SignNormalize,
)


class MyMAP(Encoding):
    """A minimal bipolar Multiply-Add-Permute encoding.

    Elements are drawn from {-1, +1}. Binding is element-wise
    multiplication (its own inverse), bundling is element-wise addition,
    and similarity is cosine. The 2.1.0 unary fields wire permute,
    inverse, normalize, and negative.
    """

    def _get_encoding_spec(self) -> EncodingSpec:
        return EncodingSpec(
            dtype=np.int32,
            element_generator=BernoulliBipolar,
            similarity_fn=CosineSimilarity,
            bundling_fn=ElementAddition,
            thinning_fn=NoThin,
            binding_fn=ElementMultiplication,
            unbinding_fn=ElementMultiplication,
            generator_output_type="bits",
            permute_fn=CyclicShift,
            inverse_fn=IdentityInverse,
            normalize_fn=SignNormalize,
            negative_fn=Negate,
        )

A few notes on the choices:

  • element_generator=BernoulliBipolar draws each element from {-1, +1} with equal probability, so generator_output_type="bits" describes what a custom generator would have to supply.

  • binding_fn and unbinding_fn are both ElementMultiplication. Element-wise multiply by {-1, +1} is its own inverse, so unbinding is the same operation as binding.

  • inverse_fn=IdentityInverse matches that self-inverse property: the binding inverse of a bipolar vector is the vector itself.

  • normalize_fn=SignNormalize sends a bundled vector (which holds integer sums) back to bipolar {-1, 0, +1} by taking the sign.

  • negative_fn=Negate is element-wise negation, the additive inverse used by bundling.

  • permute_fn=CyclicShift is set here to show the field; passing None would select the same shared CyclicShift automatically.


Generating and inspecting vectors

Construct the encoding like any built-in and generate vectors. Single vectors are (D,), a batch is dimension-first, so size=(D, N) returns (D, N) with each column one hypervector.

enc = MyMAP(dimension=10_000)

a = enc.generate()
b = enc.generate()
print("single shape:", a.data.shape)         # (10000,)

batch = enc.generate(size=(10_000, 5))
print("batch shape: ", batch.data.shape)     # (10000, 5)

# Each element is bipolar.
print(set(np.unique(a.data)))                # {-1, 1}

Bundle, bind, and unbind

Bundling superposes vectors, the result stays similar to every input. Binding combines two vectors into one that is dissimilar to both, and unbinding recovers a component. Because element-wise multiply is exactly self-inverse for bipolar values, unbind returns the partner without approximation.

# Bundle: the superposition is similar to both inputs.
ab = enc.bundle(a, b)
print("sim(a, a+b):", round(float(a.similarity(ab)), 4))   # ~= 0.63
print("sim(b, a+b):", round(float(b.similarity(ab)), 4))   # ~= 0.63

# Bind then unbind recovers the partner exactly.
bound     = a.bind(b)
recovered = bound.unbind(b)
print("exact recovery:", np.array_equal(recovered.data, a.data))   # True

# Unrelated vectors are near-orthogonal under cosine.
print("sim(a, b):", round(float(a.similarity(b)), 4))      # ~= 0.0

The bundle-similarity scores hover near 0.63 because each of the two inputs contributes half the superposition. They are not fixed, since ElementAddition randomizes coordinates whose summed value is an exact tie. The recovery check is exact, and unrelated vectors sit near zero cosine, as expected for random bipolar vectors of dimension 10,000.


Permute and inverse

permute is a cyclic shift along axis 0, a negative shift undoes a positive one. inverse returns the binding inverse, which for this self-inverse scheme is the vector itself. normalize and negative round out the unary set.

# permute(k) shifts along axis 0; permute(-k) restores.
shifted  = a.permute(3)
restored = shifted.permute(-3)
print("shift changed data:", not np.array_equal(shifted.data, a.data))  # True
print("inverse shift restored:", np.array_equal(restored.data, a.data)) # True

# inverse() of a self-inverse binding returns the vector unchanged.
print("inverse is identity:", np.array_equal(a.inverse().data, a.data)) # True

# normalize() sends a bundle back to bipolar {-1, 0, +1}.
norm = ab.normalize()
print("normalized values:", set(np.unique(norm.data)))   # subset of {-1, 0, 1}

# negative() is the element-wise additive inverse.
print("negate:", np.array_equal(a.negative().data, -a.data))   # True

Operators

The dunder operators dispatch straight through the encoding, so they raise or succeed per the components you wired. For this encoding *, /, ~, >>, and << are all deterministic and match their method forms:

assert np.array_equal((a * b).data, a.bind(b).data)        # bind
assert np.array_equal((bound / b).data, bound.unbind(b).data)  # unbind
assert np.array_equal((~a).data, a.inverse().data)         # inverse
assert np.array_equal((a >> 3).data, a.permute(3).data)    # permute +3
assert np.array_equal((a << 3).data, a.permute(-3).data)   # permute -3

# a + b also routes to bundle, but ElementAddition randomizes tie
# coordinates, so a fresh draw differs run to run while staying similar
# to both inputs.
plus = a + b
assert a.similarity(plus) > 0.5 and b.similarity(plus) > 0.5

The bundling operator + is the one to watch. It routes to bundle, and ElementAddition redraws coordinates that sum to an exact tie, so a + b and a.bundle(b) produce different (but equally valid) vectors on separate calls. The bind, unbind, inverse, and permute paths have no such randomness, so their operator and method forms are byte-for-byte identical.


Forbidding an operation

To mark an operation as unsupported, leave its field unset. The default for inverse_fn, normalize_fn, and negative_fn is a function that raises NotImplementedError with a clear message. For example, dropping inverse_fn from the spec above makes inverse() raise an exception:

# With inverse_fn removed from the EncodingSpec:
try:
    a.inverse()
except NotImplementedError as e:
    print(e)   # This operation is not implemented for this encoding scheme.

This is exactly how the built-ins draw their support lines. MAP_C omits inverse_fn, FHRR omits negative_fn, BSC omits both normalize_fn and negative_fn, and the four BSDC variants omit all three. See Encodings Overview for the full per-family support table.


Writing a custom component function

So far you have composed existing components. When no built-in function does what you need, write the component function yourself and wire it into the spec the same way. A component is a plain function, not a class, and the spec just holds a reference to it.

This section writes a custom bundling function and uses it to build MAP_S: MAP_C with addition bundling swapped for subtraction. Subtraction is not a meaningful superposition (the result tracks the first input and rejects the rest), so MAP_S is not an encoding you would actually use. It is, however, the smallest change that forces you to write a real component, which is the point.

The bundling contract. A bundling function takes the operands as *args, accepts a keyword-only axis, and returns the folded array. Its first line calls _normalize_bundling, which turns the mixed inputs (loose (D,) vectors, a (D, N) batch, or a (D, N, M) tensor) into one dimension-first batch plus the reduce_axes to fold. Axis 0 is always the dimension D and is never reduced.

import numpy as np
from pyhdc.components.input_formatting import _normalize_bundling

try:
    import torch
except ImportError:
    torch = None


def ElementSubtraction(*hypervectors, axis=None):
    """Toy bundling: the first vector minus the sum of the rest, clipped to
    [-1, 1]. This is MAP_C's addition bundling with the sum swapped for
    subtraction. It has no HDC meaning and is an example only.
    """
    batch, is_torch, _, reduce_axes = _normalize_bundling(
        *hypervectors, axis=axis
    )
    if len(reduce_axes) != 1:
        raise ValueError("ElementSubtraction reduces a single batch axis")
    ax = reduce_axes[0]
    n = batch.shape[ax]

    if is_torch:
        first = batch.select(ax, 0)
        rest = batch.index_select(
            ax, torch.arange(1, n, device=batch.device)
        ).sum(dim=ax)
        return torch.clamp(first - rest, -1.0, 1.0).to(batch.dtype)
    else:
        first = np.take(batch, 0, axis=ax)
        rest = np.take(batch, np.arange(1, n), axis=ax).sum(axis=ax)
        return np.clip(first - rest, -1.0, 1.0).astype(batch.dtype)

Four things make this a correct component, and they are the same four for every operation family:

  • Signature. A bundling function takes *hypervectors and a keyword-only axis=None. The base class calls bundling_fn(*arrays, axis=axis).

  • Normalize first. _normalize_bundling returns (batch, is_torch, reference_hv, reduce_axes). Do not index the raw inputs yourself, the normalizer is what lets one function accept loose vectors, a batch, or a higher-rank tensor without special-casing each shape.

  • Reduce over ``reduce_axes``, keep axis 0. Fold only the batch axes the normalizer handed you, so the output is still a hypervector of dimension D. This toy reduces a single axis (the additive bundlers accept a tuple); the is_torch flag tells you which backend’s operations to call.

  • Return type. Return the folded array, shape (D, *survivors). You may instead return (array, metadata_dict). The base class unpacks both forms and attaches the dict to the result’s metadata. ElementAddition uses the tuple form to report its tie-randomization count.

Now wire it into a spec. MAP_S is MAP_C field for field, with bundling_fn pointing at the new function:

from pyhdc.encodings.base import Encoding
from pyhdc.hypervector import EncodingSpec
from pyhdc.components.elements import UniformBipolar
from pyhdc.components.binding import ElementMultiplication
from pyhdc.components.similarity import CosineSimilarity
from pyhdc.components.thinning import NoThin
from pyhdc.components.unary import Negate, SignNormalize


class MAP_S(Encoding):
    """MAP_C with subtraction bundling. A teaching example, not a usable
    encoding."""

    def _get_encoding_spec(self) -> EncodingSpec:
        return EncodingSpec(
            dtype=np.float32,
            element_generator=UniformBipolar,
            similarity_fn=CosineSimilarity,
            bundling_fn=ElementSubtraction,     # the one swapped field
            thinning_fn=NoThin,
            binding_fn=ElementMultiplication,
            unbinding_fn=ElementMultiplication,
            generator_output_type="floats",
            normalize_fn=SignNormalize,
            negative_fn=Negate,
        )


enc = MAP_S(dimension=10_000)
a, b = enc.generate(), enc.generate()

bundled = enc.bundle(a, b)            # calls ElementSubtraction
print("bundle shape:", bundled.data.shape)            # (10000,)
print("is clip(a - b):",
      np.array_equal(bundled.data,
                     np.clip(a.data - b.data, -1, 1).astype(np.float32)))  # True

batch = enc.generate(size=(10_000, 4))
print("batch bundle:", enc.bundle(batch).data.shape)  # (10000,)

Everything except bundling comes from the MAP_C component set, so binding, unbinding, similarity, normalize, and negative behave exactly as they do for MAP_C. Only bundle runs your code. MAP_C sets no inverse_fn, so MAP_S inherits that gap too and inverse() raises an exception.


The contract for every operation family

A custom function for any other operation follows the same shape: call the family’s normalizer, branch on is_torch, transform or reduce the right axis, and return an array (optionally with a metadata dict). The signature and the normalizer are what change between families.

Family

Signature

Normalize with

Returns

Bundling

f(*hvs, axis=None)

_normalize_bundling to (batch, is_torch, ref, reduce_axes)

(D, *survivors), reduce reduce_axes, keep axis 0

Binding / unbinding

f(*hvs)

_normalize_binding to (operands, is_torch, ref)

same-shaped array, broadcast or loop (see below)

Similarity

f(*hvs, axis=None)

_normalize_similarity to (a, b, is_torch, scalar)

reduce axis 0, sims.item() if scalar else sims

Unary

f(data) (permute is f(data, shift=1))

none, you receive the raw (D, *batch) array

transformed array of the same shape

All the normalizers live in pyhdc.components.input_formatting. A few rules that are easy to miss:

  • Binding takes no ``axis``. Binding combines operands position by position, so there is no batch axis to fold. After _normalize_binding you usually call _broadcast_operands (also in input_formatting) so a (D,) key binds against every column of a (D, N) batch. A binder that cannot act per coordinate (a convolution, a matrix transform) calls _require_single_vector to reject batched input, the Encoding layer then loops it per column.

  • Similarity returns a Python ``float`` only when ``scalar`` is true, which happens only when both inputs were a single (D,) vector. Every batched call returns an array, so end with return sims.item() if scalar else sims.

  • Unary functions receive the raw array, not ``*args``. They act dimension-first along axis 0 and broadcast over any trailing batch axes. Pick the backend with a tensor check (the built-ins use torch.is_tensor(data)). permute also takes a shift while inverse, negative, and normalize take only the array.

  • Return shape is preserved for binding, the unary ops, and (minus axis 0) similarity. Only bundling collapses a batch axis.

Wire any of these into the matching EncodingSpec field exactly as you wired bundling_fn above. For the per-family math and which families define each unary operation, see Unary Operations.


What you built

You implemented a complete custom encoding by subclassing Encoding and returning one EncodingSpec:

  • Wired the seven required fields (dtype, element_generator, similarity_fn, bundling_fn, thinning_fn, binding_fn, unbinding_fn) by composing existing components.

  • Wired the four 2.1.0 unary fields (permute_fn, inverse_fn, normalize_fn, negative_fn) to give your encoding a full operation surface.

  • Generated single (D,) vectors and dimension-first (D, N) batches.

  • Bundled, bound, and unbound vectors, recovering a component exactly under self-inverse multiply binding.

  • Computed cosine similarity, confirming superposition stays near each input and random pairs stay near orthogonal.

  • Ran permute, inverse, normalize, and negative, and saw operators dispatch through the encoding, including the tie-randomized behavior of +.

  • Wrote a component function from scratch (ElementSubtraction), wired it into a new MAP_S encoding, and learned the signature-and-return contract that every custom bundling, binding, similarity, and unary function follows.


What’s next