Learning ChatGPT Poorly: How Big is Your Model? Parameter Size and Token Count

(originally posted on LinkedIn)

When hearing about new versions of large language models, you may have noticed people talking about model size including parameter and token count. What does that mean and why does it matter?

Token Count #

In my last article I poorly explained what tokens mean in Large Language Models (LLM’s.) In short, these machines don’t understand english, instead they convert words and partial words into integers (tokens) and do their work on those. How you “tokenize” text is super important. It helps the model handle different languages, computer code, whatever, and it affects the quality of the generated text because, really, the machine takes in tokens and spits out new ones. If the tokens aren’t great, the output won’t be, either.

If you check wikipedia for “List of Dictionaries by Number of Words” you’ll see Webster’s english dictionary has about 470k words but if you read my last article you’ll know that we don’t just convert individual words to specific tokens, we also include context information like spacing and capitalization. We also split longer words into multiple tokens all in an effort to help the model achieve better, more meaningful outputs.

OpenAI’s site says, “A helpful rule of thumb is that one token generally corresponds to ~4 characters of text for common English text.” So, when someone says a model has 100 billion tokens, that doesn’t mean there are 100 billion distinct tokens… No… There are probably a few hundred thousand distinct tokens but the model was trained on a dataset that had 100 billion tokens in it. Ah! ok, that makes sense. So, when you see a token spec, it is really telling you about how much data was used to train it.

Parameter Count #

Ok, so what’s with the parameter count? GPT-1 had 117 million parameters while GPT-3.5 has 175 billion… 175 billion is a lot of anything, but what are parameters? At its core, a large language model is a neural network. A neural network has things called parameters. A parameter is a numerical value that represents either a weight or a bias in the neural network.

Weights are numerical values that define the strength of connections between neurons across different layers in the model. In large language models like GPT, weights start as random values and are adjusted during training. They are mostly used in the mechanisms that allow the model to generate relevant and coherent text.

Bias are numerical values that adjust how each neuron in the neural net reacts to input passed to it. Just like weights, they start out as random values and are adjusted during training. They are mostly used to allow the model to learn complex patterns and relationships within the data.

So, when someone is talking about the size of a large language model, parameters give you an idea of how complex the neural network’s structure is. Token size will give you an idea of how much data was used to train the parameters. That’s it. that’s all there is. easy, eh?