How to Control Density in Sparse Binary Encodings
Sparse binary encodings (BSDC family) work best when the fraction of 1-bits (density) stays well below 0.5. The sections below show how to measure density and keep it bounded during bundling.
Measuring density
Density is simply the mean of a binary hypervector’s data array:
import pyhdc
enc = pyhdc.BSDC_S(dimension=10_000)
hv = enc.generate()
print(f"density = {hv.data.mean():.4f}") # ~= 0.01
The density growth problem
BSDC uses bitwise OR for bundling. OR can only turn bits on, so repeated bundling drives density toward 1.0:
enc = pyhdc.BSDC_S(dimension=10_000)
result = enc.generate()
print(f"step 0: density = {result.data.mean():.4f}")
for i in range(1, 21):
result = result.bundle(enc.generate())
if i % 5 == 0:
print(f"step {i:2d}: density = {result.data.mean():.4f}")
# density increases with each step
Solving it with BSDC_THIN
BSDC_THIN applies random thinning after each OR step; bits
are randomly cleared until the density returns to the initial level:
enc = pyhdc.BSDC_THIN(dimension=10_000)
result = enc.generate()
print(f"step 0: density = {result.data.mean():.4f}")
for i in range(1, 21):
result = result.bundle(enc.generate())
if i % 5 == 0:
print(f"step {i:2d}: density = {result.data.mean():.4f}")
# density stays stable throughout
Using DisjunctionThinned directly
If you are building a custom pipeline with the components submodule, you can access the thinned OR operation directly:
from pyhdc.components.bundling import DisjunctionThinned
import numpy as np
a = enc.generate().data
b = enc.generate().data
result = DisjunctionThinned(a, b, target_density=0.05)
Density guidelines
Density |
Dimension |
Notes |
|---|---|---|
≥ 0.1 |
Any |
Dense binary; more like BSC; less sparse-binary advantage |
0.01–0.05 |
≥ 1,000 |
Recommended BSDC range; good balance of capacity and stability |
≤ 0.01 |
≥ 10,000 |
Very sparse; high capacity; need large D for enough 1s per vector |
Lower density means greater orthogonality between random vectors and higher theoretical capacity, but too-low density at small dimensions means each vector has very few 1s, making similarity estimates noisy.