How to Encode Data into Hypervectors
An encoding fixes the algebra (bundle, bind, similarity). A data encoder turns raw values into hypervectors. PyHDC provides two families:
Codebook encoders (
Level,Circular,Thermometer,Empty,Identity,Random) hold a precomputed(D, L)basis. A value picks the nearest level and the encoder returns that column.Functional encoders (
Projection,Sinusoid,Density,FractionalPower) transform a feature vector.
Every encoder wraps one Encoding instance and is dimension-first:
a scalar encodes to (D,), a batch of B values to (D, B).
encoder.encode(value) and encoder(value) are the same call. For the
concepts behind the two families, see Data Encoders.
Encode a scalar with Level
Level spaces levels hypervectors so that nearby values map to
correlated codes and distant values to near-orthogonal ones:
import pyhdc
enc = pyhdc.MAP_I(dimension=10_000)
level = pyhdc.Level(enc, levels=100, low=0.0, high=1.0)
hv = level.encode(0.5) # one (10000,) hypervector
print(hv.shape) # (10000,)
Similarity to a fixed hypervector falls monotonically as the value moves away:
zero = level.encode(0.0)
print(enc.similarity(zero, level.encode(0.1))) # ~= 0.90
print(enc.similarity(zero, level.encode(0.5))) # ~= 0.51
print(enc.similarity(zero, level.encode(1.0))) # ~= 0.02
Values outside [low, high] clamp to the nearest endpoint.
Encode many values at once
Pass a list (or any 1-D array) to encode a whole batch in one call. Each value becomes one column:
batch = level.encode([0.0, 0.25, 0.5, 0.75, 1.0])
print(batch.shape) # (10000, 5)
Periodic values with Circular
Circular wraps the level index modulo levels, so the top of
the range rejoins the bottom. Use it for angles, hours, days, or any cyclic
quantity:
hours = pyhdc.Circular(enc, levels=24, low=0.0, high=24.0)
print(enc.similarity(hours.encode(23.0), hours.encode(0.0))) # ~= 0.9 (adjacent across the wrap)
print(enc.similarity(hours.encode(0.0), hours.encode(12.0))) # ~= 0.0 (opposite on the ring)
Discrete encoders: Thermometer and Density
Thermometer (a cumulative code) and Density (a
population code) are defined for the discrete families (MAP_I, MAP_B, BSC, and the
BSDC family). Build them on a discrete encoding:
binary = pyhdc.MAP_B(dimension=10_000)
therm = pyhdc.Thermometer(binary, levels=20, low=0.0, high=1.0)
dens = pyhdc.Density(binary, low=0.0, high=1.0)
print(therm.encode(0.5).shape) # (10000,)
print(dens.encode(0.5).shape) # (10000,)
Constructing either on a continuous or phase family (MAP_C, the HRR family, VTB,
MBAT, FHRR) raises NotImplementedError, because those domains have no two
endpoint elements to interpolate between.
Encode a feature vector with Projection
Projection applies a fixed random linear map to a length-F
feature vector, then the encoding’s normalize step. It accepts a single vector or
a batch:
import numpy as np
proj = pyhdc.Projection(enc, features=8)
print(proj.encode(np.random.rand(8)).shape) # (10000,)
print(proj.encode(np.random.rand(8, 5)).shape) # (10000, 5)
Sinusoid is the related random-Fourier-feature map for the cosine
and HRR families. Projection needs a family with a normalize step
(MAP, HRR, VTB, MBAT, FHRR) and raises NotImplementedError on BSC and BSDC.
Empty, Random, and Identity round out the set as structural codebooks:
all-zero, independent-random, and the binding-identity element respectively. They
take the same (encoding, levels, low, high) constructor as Level.
Putting it together: a value-feature classifier
Encoders compose with the core operations. The pattern below binds each feature value to a per-feature key, bundles the bound pairs into one record hypervector, bundles records into a class prototype, and classifies a whole test batch in a single cross-similarity call:
import numpy as np
import pyhdc
np.random.seed(0)
enc = pyhdc.MAP_I(dimension=10_000)
F = 4 # features per record
keys = [enc.generate() for _ in range(F)] # one key hypervector per feature
value = pyhdc.Level(enc, levels=64, low=0.0, high=1.0)
def encode_record(row):
pairs = [pyhdc.bind(keys[i], value.encode(float(row[i]))) for i in range(F)]
return pyhdc.bundle(*pairs)
rng = np.random.default_rng(0)
class0 = rng.normal(0.3, 0.05, size=(3, F)) # a low-valued class
class1 = rng.normal(0.7, 0.05, size=(3, F)) # a high-valued class
protos = pyhdc.stack([
pyhdc.bundle(*[encode_record(r) for r in class0]),
pyhdc.bundle(*[encode_record(r) for r in class1]),
]) # (10000, 2)
test = np.vstack([rng.normal(0.3, 0.05, size=(3, F)),
rng.normal(0.7, 0.05, size=(3, F))])
testhv = pyhdc.stack([encode_record(r) for r in test]) # (10000, 6)
scores = protos.similarity(testhv, mode="cross") # (2, 6)
print(np.asarray(scores).argmax(axis=0)) # [0 0 0 1 1 1]
The first three test records score highest against prototype 0 and the last three against prototype 1, which is the correct split.
See also
How to Choose the Right Encoding : which encoding family to build the encoder on, and the per-encoder family constraints.
How to Compute Similarity : pairwise and cross similarity, including the
mode="cross"matrix used above.Data Encoders : the encoder object model and the family-aware basis builders.