Learning ChatGPT Poorly: Tokenizers

(Originally posted on LinkedIn)

If you haven’t tried already, learning how ChatGPT, GPT-3.5, and GPT-4 works is not easy. There are some really good explanations and freely available papers that totally reveal the magic behind these Generative Pre-Trained Transformer (GPT) Large Language Models (LLMs.) But! Even if you manage to grok all the new vocab, getting through the math is… rough. I thought it would be fun to talk about one aspect of GPT, poorly, to make the whole thing seem a little less magical. Let’s get dumb.

GPT is a machine that converts a list of numbers, like [10919 318 257 22746 30] into another set of numbers, like [32 22746 318 257 1402 46103 326 318 34850 5954 503 286 257 6877] in a way that somehow delights and terrifies the world.

“What?”

No, it is true, GPT doesn’t really “know” the lyrics to Taylor Swift songs or “know” how to program. It is a framework that takes integers as input and generates a sequence of integers it predicted as the ones most likely to make you happy. To do this, the machine has “learned” the statistical relationship between a bunch of integer sequences through some training and produced a model. The framework uses that model to generate the most likely sequence of numbers given the list of numbers provided to its input.

“Really?”

Yes! That’s pretty much about how it works.

Of course, I can also say Apollo 11 is a machine that takes 8 days to turn rocket fuel and sparks into a few dudes with backpacks full of moon rocks floating around in the Pacific Ocean. I mean, we’re definitely missing some details, but… kind of true?

“What’s up with these integers? I thought you asked ChatGPT things and it gives answers?”

It does! But, machine learning and AI is very much math… like, that’s all it is, just a ton of ridiculous math. So, to get it to work, the first step is to convert text into numbers then operate on those. This process is called “tokenization” which takes blobs of text and breaks it up into small “tokens” which are then assigned a number.

Take the word, “the” for example. The tokenizer for GPT-2* assigns the number 262 to that sequence of characters. “Cat” is assigned 21979 but " cat" is 3797. Notice that leading space. That isn’t a typo. “cat” is 9246. The tokenizer algorithm is cleverly built to efficiently encode sentences. Sentences have a lot of spaces, encoding words with leading spaces prevents the model from having to deal with tons of almost meaningless space tokens. Also, it helps the model understand the position of the token. No preceding space means it is the first word in the sentence.

A single word can also be represented by multiple tokens. For example, “Don’t” is tokenized as “Don” and “’t” which converts to [3987 470]. One word, two tokens. GPT don’t care, it doesn’t know english, it is just looking at tokens.

So, if I ask ChatGPT, “what is a rabbit?” GPT’s model will receive 5 tokens [10919 318 257 22746 30] and will do math that generates a sequence of tokens which might be [32 22746 318 257 1402 46103 326 318 34850 5954 503 286 257 6877] That translates to “A rabbit is a small mammal that is magically pulled out of a hat”

See? No magic here. Just numbers.

Oh, and you know what they say, the best way to get the right answer on the internet is to post the wrong answer… Looking forward to any comments.

Using GPT-2 as the example b/c it was the first tokenizer I found and I’m lazy…