Tutorial 4: (Sparse) Binary Encodings (BSC and BSDC)
Binary encodings represent hypervectors as arrays of 0s and 1s (or -1s and +1s). Sparse variants keep most elements at 0. This tutorial covers the difference between dense binary (BSC) and sparse binary (BSDC), the density saturation problem with OR bundling and how BSDC_THIN solves it, and sequence encoding with circular shifts.
Prerequisites: Tutorial 1: Encoding Text for Classification
Dense binary: BSC
BSC (Binary Spatter Code) uses dense binary vectors where
each element is drawn from a Bernoulli distribution with p = 0.5: on average
half the elements are 1.
Binding is XOR (self-inverse), and similarity is Hamming distance remapped to [-1, 1]:
import pyhdc
import numpy as np
enc = pyhdc.BSC(dimension=10_000)
a = enc.generate()
b = enc.generate()
print(a.data.mean()) # ~= 0.5 # dense
print(a.similarity(b)) # ~= 0.0 # unrelated
print(a.similarity(a)) # 1.0 # identical
# XOR binding is exactly self-inverse
bound = a.bind(b)
recovered = bound.unbind(b) # identical to a, not approximate
print(np.allclose(recovered.data, a.data)) # True
Sparse binary: the BSDC family
The BSDC family uses sparse binary vectors where only a small fraction (typically 1-5%) of elements are 1. Sparsity improves orthogonality between random vectors and mirrors the sparse activity of biological neurons.
PyHDC provides four BSDC variants:
Encoding |
Distinguishing feature |
|---|---|
|
Context-dependent thinning during bundling; unbind not supported |
|
Binding via circular shift; unbind supported |
|
Per-segment circular shift; useful for positional encodings |
|
Random thinning after OR-bundle to maintain target density (v1.1.0+) |
enc_s = pyhdc.BSDC_S(dimension=10_000)
v = enc_s.generate()
print(v.data.mean()) # ~= 0.01–0.05 # sparse
The density growth problem
BSDC encodings use bitwise OR for bundling. OR is a natural set-union operation for sparse binary vectors, but it has a fatal flaw: each OR operation can only turn bits on, never off. After many bundles, density creeps toward 1.0 and all vectors look the same.
enc_s = pyhdc.BSDC_S(dimension=10_000)
result = enc_s.generate()
print(f"Step 0: density = {result.data.mean():.4f}")
for step in range(1, 21):
result = result.bundle(enc_s.generate())
if step % 4 == 0:
print(f"Step {step:2d}: density = {result.data.mean():.4f}")
Expected output:
Step 0: density = 0.0106
Step 4: density = 0.0507
Step 8: density = 0.0856
Step 12: density = 0.1205
Step 16: density = 0.1518
Step 20: density = 0.1847
If you continued to step 200, density would approach 1.0 and every hypervector would be indistinguishable.
Solving density growth: BSDC_THIN
BSDC_THIN applies random thinning after each OR-bundle step.
Thinning randomly clears bits until the vector reaches a target density,
keeping density bounded regardless of how many bundles you perform.
enc_thin = pyhdc.BSDC_THIN(dimension=10_000, density=0.01) # default density = 0.5
result = enc_thin.generate()
print(f"Step 0: density = {result.data.mean():.4f}")
for step in range(1, 21):
result = result.bundle(enc_thin.generate())
if step % 4 == 0:
print(f"Step {step:2d}: density = {result.data.mean():.4f}")
Expected output:
Step 0: density = 0.0111
Step 4: density = 0.0100
Step 8: density = 0.0100
Step 12: density = 0.0100
Step 16: density = 0.0100
Step 20: density = 0.0100
Density stays near the initial target throughout.
You can control the target density explicitly:
enc_dense = pyhdc.BSDC_THIN(dimension=10_000, density=0.01)
# density is determined by BernoulliSparse element generator
# To change: pass a custom element generator, or use the density parameter if available
Sequence encoding with BSDC_S
BSDC_S binds by performing a circular shift of the
hypervector by one position. Binding the k-th element with a shift of k
positions encodes position. This makes it natural for sequence encoding:
enc_s = pyhdc.BSDC_S(dimension=10_000)
# Character hypervectors
chars = {c: enc_s.generate() for c in 'abcdefghijklmnopqrstuvwxyz'}
def encode_sequence(seq):
"""Encode a sequence by binding each element to its shifted position."""
hvs = []
hv = chars[seq[0]] # position 0: no shift
hvs.append(hv)
for ch in seq[1:]:
hv = chars[ch].bind(hv) # each bind shifts the previous result
hvs.append(hv)
return pyhdc.bundle(*hvs)
cat = encode_sequence('cat')
bat = encode_sequence('bat')
rat = encode_sequence('rat')
car = encode_sequence('car')
print(cat.similarity(cat)) # 1.0
print(cat.similarity(bat)) # ~= low # different first character
print(cat.similarity(car)) # ~= moderate: share 'c' and 'a'
The circular shift means the same character at different positions maps to different hypervectors, preserving order information.
Similarity range and remapping
As of v1.1.0, all similarity functions in PyHDC return values in [-1, 1]:
-1: maximally dissimilar (all bits different for Hamming; zero overlap)0: unrelated (expected for random pairs)+1: identical
In v1.0.x, HammingDistance and Overlap returned [0, 1]. If you are
migrating from v1.0.x, use similarity_remap to restore the old behaviour:
from pyhdc.components.similarity import remap_to_unit
# Remap [-1, 1] -> [0, 1]
enc_remap = pyhdc.BSC(dimension=10_000, similarity_remap=remap_to_unit)
a = enc_remap.generate()
print(a.similarity(a)) # 1.0 (was 1.0 in v1.0.x: unchanged)
print(a.similarity(enc_remap.generate())) # ~= 0.5 (was ~= 0.5 in v1.0.x)
You can also apply remap_to_unit manually to any similarity result:
from pyhdc.components.similarity import remap_to_unit, HammingDistance
enc = pyhdc.BSC(dimension=10_000)
a, b = enc.generate(), enc.generate()
raw = a.similarity(b) # in [-1, 1]
remapped = remap_to_unit(raw) # in [0, 1]
print(raw, remapped)
Choosing density
Lower density means greater orthogonality between random vectors, which means higher capacity. However, very low density (< 0.001) requires very large dimensions to give enough 1s per vector for stable operations.
Practical guidelines:
Density |
Dimension |
Notes |
|---|---|---|
0.1-0.5 |
Any |
Dense binary (BSC territory); moderate orthogonality |
0.01-0.05 |
≥ 1,000 |
Standard BSDC range; good balance of sparsity and stability |
0.001-0.01 |
≥ 10,000 |
Very sparse; very high capacity; needs large D for stability |
Summary
In this tutorial you:
Compared dense binary (BSC) and sparse binary (BSDC) encodings
Demonstrated the density growth problem with repeated OR-bundling
Fixed density growth using
BSDC_THINEncoded a character sequence using circular-shift binding with
BSDC_SUnderstood the v1.1.0 similarity range change and how to use
remap_to_unit
What’s next
Tutorial 5: Implementing a Custom Encoding : build your own encoding from components
Tutorial 6: Custom Generators and Reproducibility : seeded, reproducible experiments
How to Control Density in Sparse Binary Encodings : practical density control recipes
Encodings Overview : full encoding family comparison