Tutorial 3: GPU-Accelerated HDC with PyTorch
=============================================

PyHDC has a PyTorch backend. Every encoding can produce
``torch.Tensor``-backed hypervectors, and all operations (bundle, bind,
similarity) work identically on CPU tensors or CUDA tensors. This tutorial
covers the backend model, creating and moving GPU hypervectors, batched
generation and similarity, and how to measure the speedup.

**Prerequisites**: :doc:`tutorial_1_text_classification` (the code will be
ported to GPU in this tutorial)

----

The backend model
-----------------

A :class:`~pyhdc.Hypervector` wraps either a ``numpy.ndarray`` or a ``torch.Tensor``. 
All HDC operations dispatch to the correct backend automatically, you never call 
NumPy or PyTorch functions directly. Although, the underlying array is accessible 
via the ``.data`` property, so can be extracted at any time for use with native 
``numpy`` or ``torch`` functions.

.. code-block:: python

   import pyhdc

   # NumPy backend (default)
   enc_np  = pyhdc.MAP_B(dimension=10_000)
   hv_np   = enc_np.generate()
   print(type(hv_np.data))   # <class 'numpy.ndarray'>

   if pyhdc.TORCH_AVAILABLE:
       import torch

       # PyTorch CPU backend
       enc_cpu = pyhdc.MAP_B(dimension=10_000, backend="torch")
       hv_cpu  = enc_cpu.generate()
       print(type(hv_cpu.data))   # <class 'torch.Tensor'>
       print(hv_cpu.device)        # cpu

       # PyTorch GPU backend
       if torch.cuda.is_available():
           enc_gpu = pyhdc.MAP_B(dimension=10_000, backend="torch", device="cuda")
           hv_gpu  = enc_gpu.generate()
           print(hv_gpu.device)    # cuda:0

----

Creating a GPU encoding (with CPU fallback)
-------------------------------------------

It is good practice to fall back to CPU if CUDA is not available:

.. code-block:: python

   import pyhdc
   import torch

   def make_enc(dimension=10_000):
       if pyhdc.TORCH_AVAILABLE and torch.cuda.is_available():
           device = "cuda"
       elif pyhdc.TORCH_AVAILABLE:
           device = "cpu"
       else:
           return pyhdc.MAP_B(dimension=dimension)   # NumPy
       return pyhdc.MAP_B(dimension=dimension, backend="torch", device=device)

   enc = make_enc()
   print(enc.backend, enc.device)

----

Moving hypervectors between backends and devices
-------------------------------------------------

You can convert an existing hypervector without recreating the encoding:

.. code-block:: python

   hv_numpy = pyhdc.MAP_B(dimension=10_000).generate()

   # NumPy -> PyTorch CPU
   hv_cpu = hv_numpy.to_torch()

   # CPU -> GPU
   hv_gpu = hv_cpu.cuda()         # or .to("cuda") or .to("cuda:0")

   # GPU -> CPU
   hv_back_cpu = hv_gpu.cpu()

   # PyTorch -> NumPy
   hv_back_np = hv_back_cpu.to_numpy()

These conversions copy the data, so the original hypervector is unchanged.

----

Batched generation
-------------------

Instead of generating one hypervector at a time, generate a whole batch at
once.  On GPU this is substantially faster because the operation is fully
vectorised across the entire batch.

.. code-block:: python

   enc = make_enc()

   # Generate 10,000 hypervectors of dimension 10,000 in one call.
   # Hypervectors are dimension-first: each column is one hypervector.
   batch = enc.generate(size=(10_000, 10_000))
   print(batch.shape)    # (10000, 10000)  # (D, N)
   print(batch.backend)  # torch
   print(batch.device)   # cuda:0  (if CUDA available)

   # Index a column to get a single hypervector
   hv0 = batch[:, 0]
   print(hv0.shape)      # (10000,)

----

Batched similarity: three calling conventions
----------------------------------------------

The ``Encoding.similarity()`` and ``Hypervector.similarity()`` methods
support three calling conventions. Hypervectors are dimension-first, so a
batch is ``(D, N)`` and comparisons run column-wise over axis 0.

**1. Two 1-D hypervectors -> scalar**

.. code-block:: python

   a = enc.generate()   # shape (10000,)
   b = enc.generate()   # shape (10000,)

   sim = a.similarity(b)   # float

**2. Two 2-D batches -> 1-D array (per-column pairs)**

.. code-block:: python

   batch_a = enc.generate(size=(10_000, 100))   # shape (10000, 100)  # (D, N)
   batch_b = enc.generate(size=(10_000, 100))   # shape (10000, 100)  # (D, N)

   sims = enc.similarity(batch_a, batch_b)   # shape (100,)
   # sims[i] = similarity(batch_a[:, i], batch_b[:, i])

**3. Single 2-D batch -> 1-D array (first column vs. rest)**

.. code-block:: python

   batch = enc.generate(size=(10_000, 101))   # shape (10000, 101)  # (D, N)

   sims = enc.similarity(batch)     # shape (100,)
   # sims[i] = similarity(batch[:, 0], batch[:, i+1])

Convention 3 is useful for nearest-neighbour search: put the query in
column 0 and the codebook in columns 1+.

.. code-block:: python

   query = enc.generate()                       # shape (10000,)
   codebook = enc.generate(size=(10_000, 50))   # shape (10000, 50)  # (D, N)

   batch = pyhdc.stack([query, codebook])    # shape (10000, 51): query is column 0
   sims = enc.similarity(batch)              # shape (50,)
   best_idx = sims.argmax().item()           # index of closest match in codebook

----

Porting Tutorial 1 to GPU
--------------------------

The Tutorial 1 text classifier requires only three changes to run on GPU:

.. code-block:: python

   import pyhdc, string, torch

   # Change 1: GPU encoding
   enc = pyhdc.MAP_B(dimension=10_000, backend="torch",
                     device="cuda" if torch.cuda.is_available() else "cpu")

   alphabet = string.ascii_lowercase + string.digits + '_'
   # Change 2: codebook generation, same API
   char_hv  = {ch: enc.generate() for ch in alphabet}

   def encode_word(word, enc, char_hv, n=3):
       word = word.lower().ljust(n, '_')
       trigram_hvs = []
       for i in range(len(word) - n + 1):
           t = word[i:i+n]
           hv = char_hv[t[0]].bind(char_hv[t[1]]).bind(char_hv[t[2]])
           trigram_hvs.append(hv)
       return pyhdc.bundle(*trigram_hvs)

   python_keywords = ['false', 'none', 'true', 'and', 'for', 'if', 'import',
                       'class', 'return', 'while', 'yield', 'lambda', 'def']
   english_nouns   = ['cat', 'dog', 'house', 'river', 'cloud', 'tree', 'book',
                       'chair', 'stone', 'light', 'water', 'music', 'road']

   kw_proto   = pyhdc.bundle(*[encode_word(w, enc, char_hv) for w in python_keywords])
   noun_proto = pyhdc.bundle(*[encode_word(w, enc, char_hv) for w in english_nouns])

   def classify(word):
       hv = encode_word(word, enc, char_hv)
       # Change 3: .to_numpy() before using sklearn / printing
       kw_sim   = float(hv.similarity(kw_proto))
       noun_sim = float(hv.similarity(noun_proto))
       return 'keyword' if kw_sim > noun_sim else 'noun'

   for w in ['import', 'lamp', 'yield', 'stone']:
       print(f"{w:10s} -> {classify(w)}")

The only meaningful change is ``backend="torch", device="cuda"`` on the
encoding constructor. All operations (``.bind()``, ``.bundle()``,
``.similarity()``) work identically on GPU.

----

Benchmarking
------------

GPU becomes worthwhile for large batches and high dimensions. Here is a
simple timing comparison:

.. code-block:: python

   import time, pyhdc, torch

   D = 10_000
   N = 50_000

   enc_np  = pyhdc.MAP_B(dimension=D)
   enc_gpu = pyhdc.MAP_B(dimension=D, backend="torch",
                          device="cuda" if torch.cuda.is_available() else "cpu")

   # NumPy baseline
   t0 = time.perf_counter()
   batch_np = enc_np.generate(size=(D, N))
   t1 = time.perf_counter()
   print(f"NumPy  generate {D}x{N}: {t1-t0:.3f}s")

   # PyTorch (CPU or GPU)
   t0 = time.perf_counter()
   batch_gpu = enc_gpu.generate(size=(D, N))
   if torch.cuda.is_available():
       torch.cuda.synchronize()
   t1 = time.perf_counter()
   print(f"Torch  generate {N}x{D}: {t1-t0:.3f}s")

   # Batched similarity
   q = enc_np.generate()

   t0 = time.perf_counter()
   _ = enc_np.similarity(q, batch_np)
   t1 = time.perf_counter()
   print(f"NumPy  similarity 1x{N}: {t1-t0:.3f}s")

   q_gpu = q.to_torch(enc_gpu.device)
   t0 = time.perf_counter()
   _ = enc_gpu.similarity(q_gpu, batch_gpu)
   if torch.cuda.is_available():
       torch.cuda.synchronize()
   t1 = time.perf_counter()
   print(f"Torch  similarity 1x{N}: {t1-t0:.3f}s")

Typical observations:

* For small batches (< 1,000 vectors), NumPy and PyTorch CPU are comparable, the
  GPU may even be *slower* due to launch overhead.
* For large batches (> 10,000 vectors), GPU similarity search is 10-100x
  faster depending on hardware.

----

Common pitfalls
----------------

**Backend mismatch**

Mixing a NumPy hypervector with a PyTorch hypervector raises ``ValueError``:

.. code-block:: python

   hv_np    = pyhdc.MAP_B(dimension=10_000).generate()
   hv_torch = pyhdc.MAP_B(dimension=10_000, backend="torch").generate()

   hv_np.similarity(hv_torch)   # ValueError: backend mismatch

Fix: convert one of them first with ``.to_torch()`` or ``.to_numpy()``.

**Extracting scalars for Python arithmetic**

Similarity on a GPU tensor returns a tensor, not a Python float.
Wrap with ``float()`` when you need a Python number:

.. code-block:: python

   sim = hv_gpu.similarity(hv_gpu2)   # torch.Tensor, shape ()
   if float(sim) > 0.8:               # convert explicitly
       ...

----

Summary
-------

In this tutorial you:

* Created GPU encodings with a CPU fallback guard
* Moved hypervectors between backends and devices
* Used all three batched similarity calling conventions
* Ported Tutorial 1 to GPU with three lines changed
* Timed the GPU speedup for large batch operations

----

What's next
-----------

* :doc:`tutorial_4_sparse_binary` : binary and sparse encodings
* :doc:`../how_to/switch_backends` : quick reference for all backend/device conversions
* :doc:`../user_manual/backends` : in-depth explanation of the dual backend architecture