Tutorial 4: (Sparse) Binary Encodings (BSC and BSDC)

Binary encodings represent hypervectors as arrays of 0s and 1s (or -1s and +1s). Sparse variants keep most elements at 0. This tutorial covers the difference between dense binary (BSC) and sparse binary (BSDC), the density saturation problem with OR bundling and how BSDC_THIN solves it, and sequence encoding with circular shifts.

Prerequisites: Tutorial 1: Encoding Text for Classification

Dense binary: BSC

BSC (Binary Spatter Code) uses dense binary vectors where each element is drawn from a Bernoulli distribution with p = 0.5: on average half the elements are 1.

Binding is XOR (self-inverse), and similarity is Hamming distance remapped to [-1, 1]:

import pyhdc
import numpy as np

enc = pyhdc.BSC(dimension=10_000)

a = enc.generate()
b = enc.generate()

print(a.data.mean())          # ~= 0.5  # dense
print(a.similarity(b))        # ~= 0.0  # unrelated
print(a.similarity(a))        # 1.0     # identical

# XOR binding is exactly self-inverse
bound    = a.bind(b)
recovered = bound.unbind(b)   # identical to a, not approximate
print(np.allclose(recovered.data, a.data))   # True

Sparse binary: the BSDC family

The BSDC family uses sparse binary vectors where only a small fraction (typically 1-5%) of elements are 1. Sparsity improves orthogonality between random vectors and mirrors the sparse activity of biological neurons.

PyHDC provides four BSDC variants:

Encoding	Distinguishing feature
`BSDC_CDT`	Context-dependent thinning during bundling; unbind not supported
`BSDC_S`	Binding via circular shift; unbind supported
`BSDC_SEG`	Per-segment circular shift; useful for positional encodings
`BSDC_THIN`	Random thinning after OR-bundle to maintain target density (v1.1.0+)

enc_s = pyhdc.BSDC_S(dimension=10_000)
v     = enc_s.generate()
print(v.data.mean())   # ~= 0.01–0.05  # sparse

The density growth problem

BSDC encodings use bitwise OR for bundling. OR is a natural set-union operation for sparse binary vectors, but it has a fatal flaw: each OR operation can only turn bits on, never off. After many bundles, density creeps toward 1.0 and all vectors look the same.

enc_s  = pyhdc.BSDC_S(dimension=10_000)
result = enc_s.generate()
print(f"Step  0: density = {result.data.mean():.4f}")

for step in range(1, 21):
    result = result.bundle(enc_s.generate())
    if step % 4 == 0:
        print(f"Step {step:2d}: density = {result.data.mean():.4f}")

Expected output:

Step  0: density = 0.0106
Step  4: density = 0.0507
Step  8: density = 0.0856
Step 12: density = 0.1205
Step 16: density = 0.1518
Step 20: density = 0.1847

If you continued to step 200, density would approach 1.0 and every hypervector would be indistinguishable.

Solving density growth: BSDC_THIN

BSDC_THIN applies random thinning after each OR-bundle step. Thinning randomly clears bits until the vector reaches a target density, keeping density bounded regardless of how many bundles you perform.

enc_thin = pyhdc.BSDC_THIN(dimension=10_000, density=0.01)   # default density = 0.5
result   = enc_thin.generate()
print(f"Step  0: density = {result.data.mean():.4f}")

for step in range(1, 21):
    result = result.bundle(enc_thin.generate())
    if step % 4 == 0:
        print(f"Step {step:2d}: density = {result.data.mean():.4f}")

Expected output:

Step  0: density = 0.0111
Step  4: density = 0.0100
Step  8: density = 0.0100
Step 12: density = 0.0100
Step 16: density = 0.0100
Step 20: density = 0.0100

Density stays near the initial target throughout.

You can control the target density explicitly:

enc_dense  = pyhdc.BSDC_THIN(dimension=10_000, density=0.01)
# density is determined by BernoulliSparse element generator
# To change: pass a custom element generator, or use the density parameter if available

Sequence encoding with BSDC_S

BSDC_S binds by performing a circular shift of the hypervector by one position. Binding the k-th element with a shift of k positions encodes position. This makes it natural for sequence encoding:

enc_s = pyhdc.BSDC_S(dimension=10_000)

# Character hypervectors
chars = {c: enc_s.generate() for c in 'abcdefghijklmnopqrstuvwxyz'}

def encode_sequence(seq):
    """Encode a sequence by binding each element to its shifted position."""
    hvs = []
    hv  = chars[seq[0]]                # position 0: no shift
    hvs.append(hv)
    for ch in seq[1:]:
        hv = chars[ch].bind(hv)        # each bind shifts the previous result
        hvs.append(hv)
    return pyhdc.bundle(*hvs)

cat = encode_sequence('cat')
bat = encode_sequence('bat')
rat = encode_sequence('rat')
car = encode_sequence('car')

print(cat.similarity(cat))   # 1.0
print(cat.similarity(bat))   # ~= low  # different first character
print(cat.similarity(car))   # ~= moderate: share 'c' and 'a'

The circular shift means the same character at different positions maps to different hypervectors, preserving order information.

Similarity range and remapping

As of v1.1.0, all similarity functions in PyHDC return values in [-1, 1]:

-1 : maximally dissimilar (all bits different for Hamming; zero overlap)
0 : unrelated (expected for random pairs)
+1 : identical

In v1.0.x, HammingDistance and Overlap returned [0, 1]. If you are migrating from v1.0.x, use similarity_remap to restore the old behaviour:

from pyhdc.components.similarity import remap_to_unit

# Remap [-1, 1] -> [0, 1]
enc_remap = pyhdc.BSC(dimension=10_000, similarity_remap=remap_to_unit)

a = enc_remap.generate()
print(a.similarity(a))   # 1.0  (was 1.0 in v1.0.x: unchanged)
print(a.similarity(enc_remap.generate()))   # ~= 0.5  (was ~= 0.5 in v1.0.x)

You can also apply remap_to_unit manually to any similarity result:

from pyhdc.components.similarity import remap_to_unit, HammingDistance

enc  = pyhdc.BSC(dimension=10_000)
a, b = enc.generate(), enc.generate()

raw     = a.similarity(b)              # in [-1, 1]
remapped = remap_to_unit(raw)          # in  [0, 1]
print(raw, remapped)

Choosing density

Lower density means greater orthogonality between random vectors, which means higher capacity. However, very low density (< 0.001) requires very large dimensions to give enough 1s per vector for stable operations.

Practical guidelines:

Density	Dimension	Notes
0.1-0.5	Any	Dense binary (BSC territory); moderate orthogonality
0.01-0.05	≥ 1,000	Standard BSDC range; good balance of sparsity and stability
0.001-0.01	≥ 10,000	Very sparse; very high capacity; needs large D for stability

Summary

In this tutorial you:

Compared dense binary (BSC) and sparse binary (BSDC) encodings
Demonstrated the density growth problem with repeated OR-bundling
Fixed density growth using BSDC_THIN
Encoded a character sequence using circular-shift binding with BSDC_S
Understood the v1.1.0 similarity range change and how to use remap_to_unit

What’s next

Tutorial 5: Implementing a Custom Encoding : build your own encoding from components
Tutorial 6: Custom Generators and Reproducibility : seeded, reproducible experiments
How to Control Density in Sparse Binary Encodings : practical density control recipes
Encodings Overview : full encoding family comparison