Tutorial 1: Encoding Text for Classification

This tutorial builds a simple language classifier using PyHDC. The classifier distinguishes Python keywords from common English nouns by encoding each word as a hypervector and comparing it to class prototypes. It covers building a character codebook, n-gram encoding, prototype construction, and similarity classification.

Prerequisites: Five-Minute Quickstart


What we are building

Given an unknown word, we want to predict whether it is a Python keyword (if, for, class, …) or an ordinary English noun (cat, house, river, …).

The HDC approach:

  1. Build a codebook: one random hypervector per character of the alphabet.

  2. Encode each word as a trigram hypervector (bind adjacent characters, bundle across all windows).

  3. Build a class prototype for each category by bundling all its word vectors.

  4. Classify a new word by comparing it to each prototype.

This is a one-pass algorithm: no gradient descent, no epochs.


Step 1: Set up the encoding and character codebook

import pyhdc
import string

# Use MAP_B with 10,000 dimensions
enc = pyhdc.MAP_B(dimension=10_000)

# One random hypervector per printable character
alphabet = string.ascii_lowercase + string.digits + '_'
char_hv  = {ch: enc.generate() for ch in alphabet}

Every character gets its own unique, random hypervector. Because the hypervectors are drawn independently, any two characters are almost perfectly orthogonal.


Step 2: Encode a word as an n-gram hypervector

A trigram is a window of three consecutive characters. We encode each trigram by binding the three character vectors together (binding creates an ordered record), then bundle all trigrams for the word into a single vector.

def encode_word(word, enc, char_hv, n=3):
    """Return a hypervector representing the n-gram profile of a word."""
    word = word.lower()
    if len(word) < n:
        # For short words, pad with a special placeholder
        word = word.ljust(n, '_')

    trigram_hvs = []
    for i in range(len(word) - n + 1):
        trigram = word[i:i + n]
        # Bind the three character vectors together (order matters)
        hv = char_hv[trigram[0]].bind(char_hv[trigram[1]]).bind(char_hv[trigram[2]])
        trigram_hvs.append(hv)

    # Bundle all trigrams into one word-level hypervector
    return pyhdc.bundle(*trigram_hvs)

Why binding? Because bind(a, bind(b, c)) is different from bind(b, bind(a, c)); binding is sensitive to order, so the encoded hypervector carries positional information. Bundling then aggregates all windows, making the result insensitive to the exact number of trigrams.

Let’s verify that two different words produce dissimilar hypervectors, and that the same word produces the same hypervector:

hv_cat  = encode_word('cat',  enc, char_hv)
hv_car  = encode_word('car',  enc, char_hv)
hv_cat2 = encode_word('cat',  enc, char_hv)

print(hv_cat.similarity(hv_cat2))  # 1.0   # deterministic
print(hv_cat.similarity(hv_car))   # ~= 0.5  # one trigram in common ("ca")

Step 3: Build class prototypes

A class prototype is the bundle of all word hypervectors that belong to that class. The bundle is an approximation of the class as a whole. Queries against it will return high similarity for words “in” the class.

python_keywords = [
    'false', 'none', 'true', 'and', 'as', 'assert', 'async', 'await',
    'break', 'class', 'continue', 'def', 'del', 'elif', 'else', 'except',
    'finally', 'for', 'from', 'global', 'if', 'import', 'in', 'is',
    'lambda', 'nonlocal', 'not', 'or', 'pass', 'raise', 'return', 'try',
    'while', 'with', 'yield',
]

english_nouns = [
    'cat', 'dog', 'house', 'river', 'cloud', 'tree', 'book', 'chair',
    'table', 'stone', 'light', 'water', 'music', 'road', 'city', 'field',
    'door', 'window', 'flower', 'bridge', 'mountain', 'ocean', 'forest',
    'garden', 'street', 'candle', 'bottle', 'letter', 'mirror', 'paper',
]

# Build one prototype per class
kw_prototype   = pyhdc.bundle(*[encode_word(w, enc, char_hv) for w in python_keywords])
noun_prototype = pyhdc.bundle(*[encode_word(w, enc, char_hv) for w in english_nouns])

Step 4: Classify a new word

To classify a query word, encode it and compare it to each prototype. The class with the highest similarity wins.

def classify(word, enc, char_hv, prototypes):
    hv = encode_word(word, enc, char_hv)
    scores = {cls: hv.similarity(proto) for cls, proto in prototypes.items()}
    return max(scores, key=scores.get), scores

prototypes = {'keyword': kw_prototype, 'noun': noun_prototype}

for word in ['import', 'lamp', 'yield', 'stone', 'while', 'mirror']:
    pred, scores = classify(word, enc, char_hv, prototypes)
    print(f"{word:10s} -> {pred:10s}  "
          f"(keyword={scores['keyword']:+.3f}, noun={scores['noun']:+.3f})")

Expected output (values will vary slightly due to randomness in the codebook):

import     -> keyword     (keyword=+0.312, noun=-0.021)
lamp       -> noun        (keyword=-0.018, noun=+0.287)
yield      -> keyword     (keyword=+0.289, noun=+0.003)
stone      -> noun        (keyword=-0.014, noun=+0.301)
while      -> keyword     (keyword=+0.315, noun=-0.009)
mirror     -> noun        (keyword=-0.022, noun=+0.294)

Step 5: Measure accuracy

Let’s hold out 20% of the data and measure accuracy on it.

import random

random.seed(0)
all_data = (
    [(w, 'keyword') for w in python_keywords] +
    [(w, 'noun')    for w in english_nouns]
)
random.shuffle(all_data)

split   = int(0.8 * len(all_data))
train   = all_data[:split]
test    = all_data[split:]

# Rebuild prototypes from training set only
kw_proto_train   = pyhdc.bundle(*[encode_word(w, enc, char_hv)
                                  for w, label in train if label == 'keyword'])
noun_proto_train = pyhdc.bundle(*[encode_word(w, enc, char_hv)
                                  for w, label in train if label == 'noun'])
protos_train = {'keyword': kw_proto_train, 'noun': noun_proto_train}

correct = sum(
    classify(w, enc, char_hv, protos_train)[0] == label
    for w, label in test
)
print(f"Accuracy: {correct}/{len(test)} = {correct/len(test):.0%}")

With dimension=10_000 you should see accuracy of around 90-100% on this small dataset.


Experiment: effect of dimension

HDC accuracy improves with dimension. Try re-running with smaller dimensions:

for dim in [500, 1_000, 2_000, 5_000, 10_000]:
    enc_d    = pyhdc.MAP_B(dimension=dim)
    char_d   = {ch: enc_d.generate() for ch in alphabet}
    kw_p     = pyhdc.bundle(*[encode_word(w, enc_d, char_d) for w in python_keywords])
    noun_p   = pyhdc.bundle(*[encode_word(w, enc_d, char_d) for w in english_nouns])
    protos_d = {'keyword': kw_p, 'noun': noun_p}
    c = sum(classify(w, enc_d, char_d, protos_d)[0] == label for w, label in all_data)
    print(f"dim={dim:6d}: {c}/{len(all_data)} = {c/len(all_data):.0%}")

You will observe that accuracy rises sharply with dimension and plateaus somewhere around 5,000-10,000.


Switching encoding families

The encode/classify API is completely independent of the encoding family. Replace MAP_B with HRR and nothing else changes:

enc_hrr  = pyhdc.HRR(dimension=10_000)
char_hrr = {ch: enc_hrr.generate() for ch in alphabet}
kw_hrr   = pyhdc.bundle(*[encode_word(w, enc_hrr, char_hrr) for w in python_keywords])
noun_hrr = pyhdc.bundle(*[encode_word(w, enc_hrr, char_hrr) for w in english_nouns])
# ... same classify() function works unchanged

HRR uses circular convolution and correlation internally, but the .bind(), .bundle(), and .similarity() methods have the same interface.



Where to go next