How to Make Experiments Reproducible ===================================== ``enc.generate()`` draws from NumPy's global random state by default, which changes between Python sessions. Setting a global seed or passing a seeded ``HDCGenerator`` produces identical hypervectors on every run. Setting a global seed ---------------------- .. code-block:: python import pyhdc import random import numpy as np import torch random.seed(42) # sets the global seed for Python's built-in random np.random.seed(42) # sets the global seed NumPY seed if pyhdc.TORCH_AVAILABLE: torch.manual_seed(42) # sets the global seed for PyTorch torch.cuda.manual_seed_all(42) # sets the global seed for all CUDA devices enc = pyhdc.MAP_C(dimension=10_000) hv = enc.generate() # always the same for seed=42 print(hv.data[:5]) Basic reproducibility with seeded generators ---------------------------------------------- Pass a seeded generator to the encoding constructor: .. code-block:: python import pyhdc from pyhdc.generation import CommonLCGGenerators gen = CommonLCGGenerators.numerical_recipes(seed=42) enc = pyhdc.MAP_C(dimension=10_000, generator=gen) hv = enc.generate() # always the same for seed=42 print(hv.data[:5]) Re-run the same generation by calling ``reset()`` before each run: .. code-block:: python gen.reset() hv_run1 = enc.generate() gen.reset() hv_run2 = enc.generate() import numpy as np print(np.allclose(hv_run1.data, hv_run2.data)) # True Building a reproducible codebook ---------------------------------- .. code-block:: python from pyhdc.generation import CommonPCGGenerators gen = CommonPCGGenerators.pcg32(seed=0) enc = pyhdc.MAP_C(dimension=10_000, generator=gen) items = ['apple', 'banana', 'cherry'] gen.reset() codebook = {name: enc.generate() for name in items} Snapshotting and restoring state ---------------------------------- If you need to resume generation mid-experiment from a known point, snapshot the state with ``get_state()``; the exact return type is generator-specific: .. code-block:: python gen.reset() _ = enc.generate() # consume one vector state = gen.get_state() # snapshot hv_a = enc.generate() # Restore and re-generate from snapshot gen.set_seed(gen._seed) # or: recreate with same seed and advance manually # Note: get_state / restore API is generator-dependent; reset() is the # most portable option for full reproducibility Bypassing the generator for a single call ------------------------------------------ Pass ``use_generator=False`` to generate one vector from NumPy's default random state without advancing the custom generator: .. code-block:: python hv_np = enc.generate(use_generator=False) # uses NumPy, not the LCG Reproducible batched generation ------------------------------- A tuple ``size`` produces a dimension-first batch: ``generate(size=(D, N))`` returns a ``(D, N)`` tensor of ``N`` hypervectors, and ``generate(size=(D, N, M))`` returns a ``(D, N, M)`` tensor of ``N * M`` hypervectors. Axis 0 is always the dimension ``D``, the trailing axes are the batch. Index column ``j`` as ``batch[:, j]``. **Batched generation reproduces itself for a fixed seed and shape.** Calling ``generate(size=(D, N))`` twice under the same seed yields the same batch: .. code-block:: python import numpy as np import pyhdc enc = pyhdc.MAP_C(dimension=10_000) np.random.seed(42) first = enc.generate(size=(10_000, 8)) np.random.seed(42) second = enc.generate(size=(10_000, 8)) print(np.array_equal(first.data, second.data)) # True **The i.i.d. fast path.** When ``use_generator`` is ``False`` and the encoding's element generator draws each coordinate independently, ``generate`` draws the whole ``(D, *batch)`` array in one vectorized call. The fast path qualifies for these six generators: ``BernoulliBipolar``, ``BernoulliBinary``, ``UniformBipolar``, ``UniformAngles``, ``NormalReal``, and ``BernoulliSparse``. Because it draws the batch as one block, the result is **not** value-identical to ``N`` separate ``generate(size=D)`` calls: a block draw and a per-vector loop walk the random stream in different orders. **Ordered and custom generators match the per-vector loop.** ``SparseSegmented`` (the ``BSDC_SEG`` generator) is segment-structured rather than i.i.d., any custom ``HDCGenerator``, and any call with ``use_generator=True``, also falls back to the loop. For these, ``generate`` builds the batch one vector at a time, so a seeded batch equals ``N`` successive single-vector draws: .. code-block:: python import numpy as np from pyhdc.generation import CommonLCGGenerators gen = CommonLCGGenerators.numerical_recipes(seed=7) enc = pyhdc.MAP_C(dimension=10_000, generator=gen) batch = enc.generate(size=(10_000, 8), use_generator=True) gen.reset() columns = [enc.generate(size=10_000, use_generator=True) for _ in range(8)] loop = np.stack([c.data for c in columns], axis=-1) print(np.array_equal(batch.data, loop)) # True **Use** ``axis=`` **for reproducible bundling.** The deprecated ``batch_dim`` bundling carries no fixed-seed guarantee, because tie-randomizing bundlers draw fresh random values at tie coordinates. The ``axis=`` form reduces in place without that extra draw, so it is the reproducible and preferred. See :ref:`similarity-batched` for the matching axis contract on the read side. Choosing a generator for reproducibility ------------------------------------------ All built-in generator families accept a ``seed`` parameter. Recommended choices: * **PCG** (``CommonPCGGenerators.pcg32``) : best statistical quality, fully reproducible * **LCG** (``CommonLCGGenerators.numerical_recipes``) : simplest, most portable * **Xorshift** (``CommonXorshiftGenerators.xorshift64``) : very fast for large batches See :doc:`../user_manual/generators` for a full comparison.