How to Make Experiments Reproducible
enc.generate() draws from NumPy’s global random state by default, which
changes between Python sessions. Setting a global seed or passing a seeded
HDCGenerator produces identical hypervectors on every run.
Setting a global seed
import pyhdc
import random
import numpy as np
import torch
random.seed(42) # sets the global seed for Python's built-in random
np.random.seed(42) # sets the global seed NumPY seed
if pyhdc.TORCH_AVAILABLE:
torch.manual_seed(42) # sets the global seed for PyTorch
torch.cuda.manual_seed_all(42) # sets the global seed for all CUDA devices
enc = pyhdc.MAP_C(dimension=10_000)
hv = enc.generate() # always the same for seed=42
print(hv.data[:5])
Basic reproducibility with seeded generators
Pass a seeded generator to the encoding constructor:
import pyhdc
from pyhdc.generation import CommonLCGGenerators
gen = CommonLCGGenerators.numerical_recipes(seed=42)
enc = pyhdc.MAP_C(dimension=10_000, generator=gen)
hv = enc.generate() # always the same for seed=42
print(hv.data[:5])
Re-run the same generation by calling reset() before each run:
gen.reset()
hv_run1 = enc.generate()
gen.reset()
hv_run2 = enc.generate()
import numpy as np
print(np.allclose(hv_run1.data, hv_run2.data)) # True
Building a reproducible codebook
from pyhdc.generation import CommonPCGGenerators
gen = CommonPCGGenerators.pcg32(seed=0)
enc = pyhdc.MAP_C(dimension=10_000, generator=gen)
items = ['apple', 'banana', 'cherry']
gen.reset()
codebook = {name: enc.generate() for name in items}
Snapshotting and restoring state
If you need to resume generation mid-experiment from a known point, snapshot
the state with get_state(); the exact return type is
generator-specific:
gen.reset()
_ = enc.generate() # consume one vector
state = gen.get_state() # snapshot
hv_a = enc.generate()
# Restore and re-generate from snapshot
gen.set_seed(gen._seed) # or: recreate with same seed and advance manually
# Note: get_state / restore API is generator-dependent; reset() is the
# most portable option for full reproducibility
Bypassing the generator for a single call
Pass use_generator=False to generate one vector from NumPy’s default
random state without advancing the custom generator:
hv_np = enc.generate(use_generator=False) # uses NumPy, not the LCG
Reproducible batched generation
A tuple size produces a dimension-first batch: generate(size=(D, N))
returns a (D, N) tensor of N hypervectors, and
generate(size=(D, N, M)) returns a (D, N, M) tensor of N * M
hypervectors. Axis 0 is always the dimension D, the trailing axes are the
batch. Index column j as batch[:, j].
Batched generation reproduces itself for a fixed seed and shape. Calling
generate(size=(D, N)) twice under the same seed yields the same batch:
import numpy as np
import pyhdc
enc = pyhdc.MAP_C(dimension=10_000)
np.random.seed(42)
first = enc.generate(size=(10_000, 8))
np.random.seed(42)
second = enc.generate(size=(10_000, 8))
print(np.array_equal(first.data, second.data)) # True
The i.i.d. fast path. When use_generator is False and the encoding’s
element generator draws each coordinate independently, generate draws the
whole (D, *batch) array in one vectorized call. The fast path qualifies for
these six generators: BernoulliBipolar, BernoulliBinary,
UniformBipolar, UniformAngles, NormalReal, and BernoulliSparse.
Because it draws the batch as one block, the result is not value-identical to
N separate generate(size=D) calls: a block draw and a per-vector loop
walk the random stream in different orders.
Ordered and custom generators match the per-vector loop. SparseSegmented
(the BSDC_SEG generator) is segment-structured rather than i.i.d., any custom
HDCGenerator, and any call with use_generator=True, also falls back to
the loop. For these, generate builds the batch one vector at a time, so a
seeded batch equals N successive single-vector draws:
import numpy as np
from pyhdc.generation import CommonLCGGenerators
gen = CommonLCGGenerators.numerical_recipes(seed=7)
enc = pyhdc.MAP_C(dimension=10_000, generator=gen)
batch = enc.generate(size=(10_000, 8), use_generator=True)
gen.reset()
columns = [enc.generate(size=10_000, use_generator=True) for _ in range(8)]
loop = np.stack([c.data for c in columns], axis=-1)
print(np.array_equal(batch.data, loop)) # True
Use axis= for reproducible bundling. The deprecated batch_dim
bundling carries no fixed-seed guarantee, because tie-randomizing bundlers draw
fresh random values at tie coordinates. The axis= form reduces in place
without that extra draw, so it is the reproducible and preferred.
See Batched similarity: calling conventions for the matching axis contract on the read side.
Choosing a generator for reproducibility
All built-in generator families accept a seed parameter. Recommended
choices:
PCG (
CommonPCGGenerators.pcg32) : best statistical quality, fully reproducibleLCG (
CommonLCGGenerators.numerical_recipes) : simplest, most portableXorshift (
CommonXorshiftGenerators.xorshift64) : very fast for large batches
See Random Number Generators for a full comparison.