(originally posted to LinkedIn)

When companies release LLM models, like ChatGPT’s GPT-3 and Meta’s LLaMA, they often release multiple “sizes” of the model which specify how many “parameters” the model was trained on. For example, GPT-3 came in the following sizes:

small - 125 million parameters
medium - 350 million parameter
large - 760 million parameters
XL - 1.3 billion parameters
2.7B - 2.7 billion parameters
6.7B - guess how many parameters
13B - still guessing?
175B - woah that’s a lot of what again?
So, what does that mean and why do we care?

Parameters #

We know that a Large Language Model (LLM) is a type of… machine… that, given a word or token, uses a neural network to predict the most likely next word or token. These neural networks consists of a network of connected nodes where a “weight” determines the strength of connection between nodes and a “bias” determines when a node is activated. These values are adjusted during training until the output of the neural network is able to predict, or output, a valid thing… like the most likely next word or token.

Parameters are a count of all the weights and biases in the neural network. It is a great way to know how “big” the model is.

Why does size matter? #

poor explanation of that is there are more neurons that can group together to “fire” when the network detects some sort of a feature… so, like a set of neurons might light up for “cat” and others may light up for “dog.” If you have a TON of neurons you might get more specific groups, like “wolf” and “dachshund.” So, that’s good! Of course, the bigger the neural network, the more computations. So, to run a large network we have to perform a lot of math. Smaller models, less math.

In addition to “math” that has to happen, the computer has to store the model’s data in memory. And, a lot of that data has to all be in memory at the same time so we can run math on all those items in parallel (at the same time). If the model is too big to fit in our computer’s memory, it ain’t going to work. So, you have to find a model that is the right size for your computer (or GPU (graphics card))

You can loosely calculate the memory required by the model by thinking about how much memory each parameter takes up in your computer. Each parameter is usually stored as a floating point number and each one takes 2 bytes. We also know that 1 Billion bytes is 1 Gigabyte. So, if we are using a 13B model, then we know that 13 B parameters * 2 bytes per parameter = 26 Billion bytes = 26 GB. That means we need a GPU with at least 26GB of memory on board to load in our model.

Is that true? You just multiply model size by 2 and that’s how much ram you need? #

No…

Of course it is way more complicated than that. Reducing Activation Recomputation in Large Transformer Models is a paper that talks about optimizing memory use during training so the calculation is more about the number of activations per layer in your neural network.

Quantization #

And then there’s the concept of “quantization” which converts all those high precision parameters (read: each one takes a lot of memory) to lower-precision data types to save space. For example, LLaMA2 7B using floating point 16-bit parameters uses 13.5 GB of memory, while a quantized version using 4-bit integers only uses 3.9GB.

So, if you see that someone is using a 7B 8q model, that’s 7 billion, 4-bit parameters.

What did we learn? #

Hopefully, if you read that someone used “Mistral 8x7B instruct Q4” vs “Mistral 8x7B instruct Q4” you can kind of guess that there’s a new model called “Mistral” that probably has 7 billion parameters and this particular one has been quantized to 4-bit integers.