Tutorial 1: Encoding Text for Classification
=============================================

This tutorial builds a simple language classifier using PyHDC. The classifier
distinguishes Python keywords from common English nouns by encoding each word
as a hypervector and comparing it to class prototypes. It covers building a
character codebook, n-gram encoding, prototype construction, and similarity
classification.

**Prerequisites**: :doc:`../getting_started/quickstart`

----

What we are building
--------------------

Given an unknown word, we want to predict whether it is a Python keyword
(``if``, ``for``, ``class``, …) or an ordinary English noun (``cat``,
``house``, ``river``, …).

The HDC approach:

1. Build a codebook: one random hypervector per character of the alphabet.
2. Encode each word as a trigram hypervector (bind adjacent characters,
   bundle across all windows).
3. Build a class prototype for each category by bundling all its word vectors.
4. Classify a new word by comparing it to each prototype.

This is a one-pass algorithm: no gradient descent, no epochs.

----

Step 1: Set up the encoding and character codebook
---------------------------------------------------

.. code-block:: python

   import pyhdc
   import string

   # Use MAP_B with 10,000 dimensions
   enc = pyhdc.MAP_B(dimension=10_000)

   # One random hypervector per printable character
   alphabet = string.ascii_lowercase + string.digits + '_'
   char_hv  = {ch: enc.generate() for ch in alphabet}

Every character gets its own unique, random hypervector. Because the
hypervectors are drawn independently, any two characters are almost perfectly
orthogonal.

----

Step 2: Encode a word as an n-gram hypervector
-----------------------------------------------

A *trigram* is a window of three consecutive characters. We encode each
trigram by binding the three character vectors together (binding creates an
ordered record), then bundle all trigrams for the word into a single vector.

.. code-block:: python

   def encode_word(word, enc, char_hv, n=3):
       """Return a hypervector representing the n-gram profile of a word."""
       word = word.lower()
       if len(word) < n:
           # For short words, pad with a special placeholder
           word = word.ljust(n, '_')

       trigram_hvs = []
       for i in range(len(word) - n + 1):
           trigram = word[i:i + n]
           # Bind the three character vectors together (order matters)
           hv = char_hv[trigram[0]].bind(char_hv[trigram[1]]).bind(char_hv[trigram[2]])
           trigram_hvs.append(hv)

       # Bundle all trigrams into one word-level hypervector
       return pyhdc.bundle(*trigram_hvs)

Why binding? Because ``bind(a, bind(b, c))`` is different from
``bind(b, bind(a, c))``; binding is sensitive to order, so the encoded
hypervector carries positional information.  Bundling then aggregates all
windows, making the result insensitive to the exact number of trigrams.

Let's verify that two different words produce dissimilar hypervectors, and
that the same word produces the same hypervector:

.. code-block:: python

   hv_cat  = encode_word('cat',  enc, char_hv)
   hv_car  = encode_word('car',  enc, char_hv)
   hv_cat2 = encode_word('cat',  enc, char_hv)

   print(hv_cat.similarity(hv_cat2))  # 1.0   # deterministic
   print(hv_cat.similarity(hv_car))   # ~= 0.5  # one trigram in common ("ca")

----

Step 3: Build class prototypes
-------------------------------

A *class prototype* is the bundle of all word hypervectors that belong to
that class. The bundle is an approximation of the class as a whole. Queries
against it will return high similarity for words "in" the class.

.. code-block:: python

   python_keywords = [
       'false', 'none', 'true', 'and', 'as', 'assert', 'async', 'await',
       'break', 'class', 'continue', 'def', 'del', 'elif', 'else', 'except',
       'finally', 'for', 'from', 'global', 'if', 'import', 'in', 'is',
       'lambda', 'nonlocal', 'not', 'or', 'pass', 'raise', 'return', 'try',
       'while', 'with', 'yield',
   ]

   english_nouns = [
       'cat', 'dog', 'house', 'river', 'cloud', 'tree', 'book', 'chair',
       'table', 'stone', 'light', 'water', 'music', 'road', 'city', 'field',
       'door', 'window', 'flower', 'bridge', 'mountain', 'ocean', 'forest',
       'garden', 'street', 'candle', 'bottle', 'letter', 'mirror', 'paper',
   ]

   # Build one prototype per class
   kw_prototype   = pyhdc.bundle(*[encode_word(w, enc, char_hv) for w in python_keywords])
   noun_prototype = pyhdc.bundle(*[encode_word(w, enc, char_hv) for w in english_nouns])

----

Step 4: Classify a new word
----------------------------

To classify a query word, encode it and compare it to each prototype. The
class with the highest similarity wins.

.. code-block:: python

   def classify(word, enc, char_hv, prototypes):
       hv = encode_word(word, enc, char_hv)
       scores = {cls: hv.similarity(proto) for cls, proto in prototypes.items()}
       return max(scores, key=scores.get), scores

   prototypes = {'keyword': kw_prototype, 'noun': noun_prototype}

   for word in ['import', 'lamp', 'yield', 'stone', 'while', 'mirror']:
       pred, scores = classify(word, enc, char_hv, prototypes)
       print(f"{word:10s} -> {pred:10s}  "
             f"(keyword={scores['keyword']:+.3f}, noun={scores['noun']:+.3f})")

Expected output (values will vary slightly due to randomness in the codebook):

.. code-block:: text

   import     -> keyword     (keyword=+0.312, noun=-0.021)
   lamp       -> noun        (keyword=-0.018, noun=+0.287)
   yield      -> keyword     (keyword=+0.289, noun=+0.003)
   stone      -> noun        (keyword=-0.014, noun=+0.301)
   while      -> keyword     (keyword=+0.315, noun=-0.009)
   mirror     -> noun        (keyword=-0.022, noun=+0.294)

----

Step 5: Measure accuracy
-------------------------

Let's hold out 20% of the data and measure accuracy on it.

.. code-block:: python

   import random

   random.seed(0)
   all_data = (
       [(w, 'keyword') for w in python_keywords] +
       [(w, 'noun')    for w in english_nouns]
   )
   random.shuffle(all_data)

   split   = int(0.8 * len(all_data))
   train   = all_data[:split]
   test    = all_data[split:]

   # Rebuild prototypes from training set only
   kw_proto_train   = pyhdc.bundle(*[encode_word(w, enc, char_hv)
                                     for w, label in train if label == 'keyword'])
   noun_proto_train = pyhdc.bundle(*[encode_word(w, enc, char_hv)
                                     for w, label in train if label == 'noun'])
   protos_train = {'keyword': kw_proto_train, 'noun': noun_proto_train}

   correct = sum(
       classify(w, enc, char_hv, protos_train)[0] == label
       for w, label in test
   )
   print(f"Accuracy: {correct}/{len(test)} = {correct/len(test):.0%}")

With ``dimension=10_000`` you should see accuracy of around 90-100% on this
small dataset.

----

Experiment: effect of dimension
---------------------------------

HDC accuracy improves with dimension. Try re-running with smaller dimensions:

.. code-block:: python

   for dim in [500, 1_000, 2_000, 5_000, 10_000]:
       enc_d    = pyhdc.MAP_B(dimension=dim)
       char_d   = {ch: enc_d.generate() for ch in alphabet}
       kw_p     = pyhdc.bundle(*[encode_word(w, enc_d, char_d) for w in python_keywords])
       noun_p   = pyhdc.bundle(*[encode_word(w, enc_d, char_d) for w in english_nouns])
       protos_d = {'keyword': kw_p, 'noun': noun_p}
       c = sum(classify(w, enc_d, char_d, protos_d)[0] == label for w, label in all_data)
       print(f"dim={dim:6d}: {c}/{len(all_data)} = {c/len(all_data):.0%}")

You will observe that accuracy rises sharply with dimension and plateaus
somewhere around 5,000-10,000.

----

Switching encoding families
-----------------------------

The encode/classify API is completely independent of the encoding family.
Replace ``MAP_B`` with ``HRR`` and nothing else changes:

.. code-block:: python

   enc_hrr  = pyhdc.HRR(dimension=10_000)
   char_hrr = {ch: enc_hrr.generate() for ch in alphabet}
   kw_hrr   = pyhdc.bundle(*[encode_word(w, enc_hrr, char_hrr) for w in python_keywords])
   noun_hrr = pyhdc.bundle(*[encode_word(w, enc_hrr, char_hrr) for w in english_nouns])
   # ... same classify() function works unchanged
    
HRR uses circular convolution and correlation internally, but the ``.bind()``,
``.bundle()``, and ``.similarity()`` methods have the same interface.

----

Where to go next
-----------------

* :doc:`tutorial_2_associative_memory` : store and retrieve key-value pairs
* :doc:`../how_to/choose_encoding` : when to use MAP_C vs. BSC vs. HRR
* :doc:`../user_manual/encodings_overview` : full comparison of all encoding families