Tutorial 4: (Sparse) Binary Encodings (BSC and BSDC) =================================================== Binary encodings represent hypervectors as arrays of 0s and 1s (or -1s and +1s). Sparse variants keep most elements at 0. This tutorial covers the difference between dense binary (BSC) and sparse binary (BSDC), the density saturation problem with OR bundling and how BSDC_THIN solves it, and sequence encoding with circular shifts. **Prerequisites**: :doc:`tutorial_1_text_classification` ---- Dense binary: BSC ------------------ :class:`~pyhdc.BSC` (Binary Spatter Code) uses dense binary vectors where each element is drawn from a Bernoulli distribution with p = 0.5: on average half the elements are 1. Binding is XOR (self-inverse), and similarity is Hamming distance remapped to [-1, 1]: .. code-block:: python import pyhdc import numpy as np enc = pyhdc.BSC(dimension=10_000) a = enc.generate() b = enc.generate() print(a.data.mean()) # ~= 0.5 # dense print(a.similarity(b)) # ~= 0.0 # unrelated print(a.similarity(a)) # 1.0 # identical # XOR binding is exactly self-inverse bound = a.bind(b) recovered = bound.unbind(b) # identical to a, not approximate print(np.allclose(recovered.data, a.data)) # True ---- Sparse binary: the BSDC family -------------------------------- The BSDC family uses *sparse* binary vectors where only a small fraction (typically 1-5%) of elements are 1. Sparsity improves orthogonality between random vectors and mirrors the sparse activity of biological neurons. PyHDC provides four BSDC variants: .. list-table:: :header-rows: 1 :widths: 20 80 * - Encoding - Distinguishing feature * - ``BSDC_CDT`` - Context-dependent thinning during bundling; unbind **not** supported * - ``BSDC_S`` - Binding via circular shift; unbind supported * - ``BSDC_SEG`` - Per-segment circular shift; useful for positional encodings * - ``BSDC_THIN`` - Random thinning after OR-bundle to maintain target density (v1.1.0+) .. code-block:: python enc_s = pyhdc.BSDC_S(dimension=10_000) v = enc_s.generate() print(v.data.mean()) # ~= 0.01–0.05 # sparse ---- The density growth problem --------------------------- BSDC encodings use bitwise OR for bundling. OR is a natural set-union operation for sparse binary vectors, but it has a fatal flaw: each OR operation can only turn bits *on*, never off. After many bundles, density creeps toward 1.0 and all vectors look the same. .. code-block:: python enc_s = pyhdc.BSDC_S(dimension=10_000) result = enc_s.generate() print(f"Step 0: density = {result.data.mean():.4f}") for step in range(1, 21): result = result.bundle(enc_s.generate()) if step % 4 == 0: print(f"Step {step:2d}: density = {result.data.mean():.4f}") Expected output: .. code-block:: text Step 0: density = 0.0106 Step 4: density = 0.0507 Step 8: density = 0.0856 Step 12: density = 0.1205 Step 16: density = 0.1518 Step 20: density = 0.1847 If you continued to step 200, density would approach 1.0 and every hypervector would be indistinguishable. ---- Solving density growth: BSDC_THIN ----------------------------------- :class:`~pyhdc.BSDC_THIN` applies random thinning after each OR-bundle step. Thinning randomly clears bits until the vector reaches a target density, keeping density bounded regardless of how many bundles you perform. .. code-block:: python enc_thin = pyhdc.BSDC_THIN(dimension=10_000, density=0.01) # default density = 0.5 result = enc_thin.generate() print(f"Step 0: density = {result.data.mean():.4f}") for step in range(1, 21): result = result.bundle(enc_thin.generate()) if step % 4 == 0: print(f"Step {step:2d}: density = {result.data.mean():.4f}") Expected output: .. code-block:: text Step 0: density = 0.0111 Step 4: density = 0.0100 Step 8: density = 0.0100 Step 12: density = 0.0100 Step 16: density = 0.0100 Step 20: density = 0.0100 Density stays near the initial target throughout. You can control the target density explicitly: .. code-block:: python enc_dense = pyhdc.BSDC_THIN(dimension=10_000, density=0.01) # density is determined by BernoulliSparse element generator # To change: pass a custom element generator, or use the density parameter if available ---- Sequence encoding with BSDC_S ------------------------------- :class:`~pyhdc.BSDC_S` binds by performing a *circular shift* of the hypervector by one position. Binding the *k*-th element with a shift of *k* positions encodes position. This makes it natural for sequence encoding: .. code-block:: python enc_s = pyhdc.BSDC_S(dimension=10_000) # Character hypervectors chars = {c: enc_s.generate() for c in 'abcdefghijklmnopqrstuvwxyz'} def encode_sequence(seq): """Encode a sequence by binding each element to its shifted position.""" hvs = [] hv = chars[seq[0]] # position 0: no shift hvs.append(hv) for ch in seq[1:]: hv = chars[ch].bind(hv) # each bind shifts the previous result hvs.append(hv) return pyhdc.bundle(*hvs) cat = encode_sequence('cat') bat = encode_sequence('bat') rat = encode_sequence('rat') car = encode_sequence('car') print(cat.similarity(cat)) # 1.0 print(cat.similarity(bat)) # ~= low # different first character print(cat.similarity(car)) # ~= moderate: share 'c' and 'a' The circular shift means the same character at different positions maps to different hypervectors, preserving order information. ---- Similarity range and remapping ----------------------------------------------- As of v1.1.0, all similarity functions in PyHDC return values in **[-1, 1]**: * ``-1`` : maximally dissimilar (all bits different for Hamming; zero overlap) * ``0`` : unrelated (expected for random pairs) * ``+1`` : identical In v1.0.x, ``HammingDistance`` and ``Overlap`` returned [0, 1]. If you are migrating from v1.0.x, use ``similarity_remap`` to restore the old behaviour: .. code-block:: python from pyhdc.components.similarity import remap_to_unit # Remap [-1, 1] -> [0, 1] enc_remap = pyhdc.BSC(dimension=10_000, similarity_remap=remap_to_unit) a = enc_remap.generate() print(a.similarity(a)) # 1.0 (was 1.0 in v1.0.x: unchanged) print(a.similarity(enc_remap.generate())) # ~= 0.5 (was ~= 0.5 in v1.0.x) You can also apply ``remap_to_unit`` manually to any similarity result: .. code-block:: python from pyhdc.components.similarity import remap_to_unit, HammingDistance enc = pyhdc.BSC(dimension=10_000) a, b = enc.generate(), enc.generate() raw = a.similarity(b) # in [-1, 1] remapped = remap_to_unit(raw) # in [0, 1] print(raw, remapped) ---- Choosing density ----------------- Lower density means greater orthogonality between random vectors, which means higher capacity. However, very low density (< 0.001) requires very large dimensions to give enough 1s per vector for stable operations. Practical guidelines: .. list-table:: :header-rows: 1 :widths: 20 20 60 * - Density - Dimension - Notes * - 0.1-0.5 - Any - Dense binary (BSC territory); moderate orthogonality * - 0.01-0.05 - ≥ 1,000 - Standard BSDC range; good balance of sparsity and stability * - 0.001-0.01 - ≥ 10,000 - Very sparse; very high capacity; needs large D for stability ---- Summary ------- In this tutorial you: * Compared dense binary (BSC) and sparse binary (BSDC) encodings * Demonstrated the density growth problem with repeated OR-bundling * Fixed density growth using ``BSDC_THIN`` * Encoded a character sequence using circular-shift binding with ``BSDC_S`` * Understood the v1.1.0 similarity range change and how to use ``remap_to_unit`` ---- What's next ----------- * :doc:`tutorial_5_custom_generators` : seeded, reproducible experiments * :doc:`../how_to/control_density` : practical density control recipes * :doc:`../user_manual/encodings_overview` : full encoding family comparison