How to Encode Data into Hypervectors

An encoding fixes the algebra (bundle, bind, similarity). A data encoder turns raw values into hypervectors. PyHDC provides two families:

  • Codebook encoders (Level, Circular, Thermometer, Empty, Identity, Random) hold a precomputed (D, L) basis. A value picks the nearest level and the encoder returns that column.

  • Functional encoders (Projection, Sinusoid, Density, FractionalPower) transform a feature vector.

Every encoder wraps one Encoding instance and is dimension-first: a scalar encodes to (D,), a batch of B values to (D, B). encoder.encode(value) and encoder(value) are the same call. For the concepts behind the two families, see Data Encoders.

Encode a scalar with Level

Level spaces levels hypervectors so that nearby values map to correlated codes and distant values to near-orthogonal ones:

import pyhdc

enc   = pyhdc.MAP_I(dimension=10_000)
level = pyhdc.Level(enc, levels=100, low=0.0, high=1.0)

hv = level.encode(0.5)   # one (10000,) hypervector
print(hv.shape)          # (10000,)

Similarity to a fixed hypervector falls monotonically as the value moves away:

zero = level.encode(0.0)
print(enc.similarity(zero, level.encode(0.1)))   # ~= 0.90
print(enc.similarity(zero, level.encode(0.5)))   # ~= 0.51
print(enc.similarity(zero, level.encode(1.0)))   # ~= 0.02

Values outside [low, high] clamp to the nearest endpoint.

Encode many values at once

Pass a list (or any 1-D array) to encode a whole batch in one call. Each value becomes one column:

batch = level.encode([0.0, 0.25, 0.5, 0.75, 1.0])
print(batch.shape)   # (10000, 5)

Periodic values with Circular

Circular wraps the level index modulo levels, so the top of the range rejoins the bottom. Use it for angles, hours, days, or any cyclic quantity:

hours = pyhdc.Circular(enc, levels=24, low=0.0, high=24.0)

print(enc.similarity(hours.encode(23.0), hours.encode(0.0)))   # ~= 0.9  (adjacent across the wrap)
print(enc.similarity(hours.encode(0.0),  hours.encode(12.0)))  # ~= 0.0  (opposite on the ring)

Discrete encoders: Thermometer and Density

Thermometer (a cumulative code) and Density (a population code) are defined for the discrete families (MAP_I, MAP_B, BSC, and the BSDC family). Build them on a discrete encoding:

binary = pyhdc.MAP_B(dimension=10_000)

therm = pyhdc.Thermometer(binary, levels=20, low=0.0, high=1.0)
dens  = pyhdc.Density(binary, low=0.0, high=1.0)
print(therm.encode(0.5).shape)   # (10000,)
print(dens.encode(0.5).shape)    # (10000,)

Constructing either on a continuous or phase family (MAP_C, the HRR family, VTB, MBAT, FHRR) raises NotImplementedError, because those domains have no two endpoint elements to interpolate between.

Encode a feature vector with Projection

Projection applies a fixed random linear map to a length-F feature vector, then the encoding’s normalize step. It accepts a single vector or a batch:

import numpy as np

proj = pyhdc.Projection(enc, features=8)
print(proj.encode(np.random.rand(8)).shape)      # (10000,)
print(proj.encode(np.random.rand(8, 5)).shape)   # (10000, 5)

Sinusoid is the related random-Fourier-feature map for the cosine and HRR families. Projection needs a family with a normalize step (MAP, HRR, VTB, MBAT, FHRR) and raises NotImplementedError on BSC and BSDC.

Empty, Random, and Identity round out the set as structural codebooks: all-zero, independent-random, and the binding-identity element respectively. They take the same (encoding, levels, low, high) constructor as Level.

Putting it together: a value-feature classifier

Encoders compose with the core operations. The pattern below binds each feature value to a per-feature key, bundles the bound pairs into one record hypervector, bundles records into a class prototype, and classifies a whole test batch in a single cross-similarity call:

import numpy as np
import pyhdc

np.random.seed(0)
enc = pyhdc.MAP_I(dimension=10_000)

F     = 4                                    # features per record
keys  = [enc.generate() for _ in range(F)]   # one key hypervector per feature
value = pyhdc.Level(enc, levels=64, low=0.0, high=1.0)

def encode_record(row):
    pairs = [pyhdc.bind(keys[i], value.encode(float(row[i]))) for i in range(F)]
    return pyhdc.bundle(*pairs)

rng    = np.random.default_rng(0)
class0 = rng.normal(0.3, 0.05, size=(3, F))   # a low-valued class
class1 = rng.normal(0.7, 0.05, size=(3, F))   # a high-valued class
protos = pyhdc.stack([
    pyhdc.bundle(*[encode_record(r) for r in class0]),
    pyhdc.bundle(*[encode_record(r) for r in class1]),
])                                            # (10000, 2)

test   = np.vstack([rng.normal(0.3, 0.05, size=(3, F)),
                    rng.normal(0.7, 0.05, size=(3, F))])
testhv = pyhdc.stack([encode_record(r) for r in test])   # (10000, 6)

scores = protos.similarity(testhv, mode="cross")         # (2, 6)
print(np.asarray(scores).argmax(axis=0))                 # [0 0 0 1 1 1]

The first three test records score highest against prototype 0 and the last three against prototype 1, which is the correct split.

See also