(posted to LinkedIn)

If you have been following along, you’ll know that language models generate their magic responses by:

taking some words as input (prompt)
converting those words into a sequence of numbers in a process called “tokenization”
running those sequences through a neural network to infer some output which is itself a sequence of numbers
converting those numbers back to words and presenting them to the user.

One thing I haven’t touched on is… What is a language model?

oops…

So, what is a language model? It’s a neural network that can predict the next word of a sentence or knows how to fill in the missing words of a sentence. That’s it… That’s all it does.

How does it do that? Probabilities! The model generates sentences by predicting the most probable next word. For example, if you prom with the words:

“I looked down and noticed my”

and feed that into a language model, the model will generate a probability for a variety of next tokens (converted here to words b/c we are humans.) Here’s a short list of what it might generate:

can: 20%, brain: 9%, and: 5%, shoe 90%, ….. mouse: 15%, the: 2%

The model picks the most probable token and adds it to the end of the sentence:

“I looked down and noticed my shoe”

and will repeat several times until you have something like:

“I looked down and noticed my shoe was suddenly wet.”

It doesn’t have to always pick the most likely word. You can adjust settings with your prompts, such as “Temperature” that will let the model control randomness of the next prediction and “Top-p” which is “the cumulative probability cutoff for token selection. Lower values mean sampling from a smaller, more top-weighted nucleus.” which means that lower values will have it focus on a more probable set of next tokens meaning your next token will be less diverse.

These settings introduce randomness which is how you can get different results when you re-run prompts.

How does that work? How can a model possibly predict the most likely next word of a sentence and be good at it?

ULMFiT - Universal Language Model Fine-tuning #

ULMFiT is a three step method for fine-tuning a pre-trained language model so it gives more useful results.

The first step in ULMFiT is to pre-train a neural network on a LARGE dataset of text (think all of wikipedia). Training involves giving the neural network most of a sentence and having the model “guess” the next word. If it matched the next word in the real sentence the model would be rewarded. If it guessed wrong, it would be penalized. Training is effectively trying to maximize the rewards. To do a good job, the neural network has to learn a lot of stuff about the world. It learns about objects, time, people, etc. and the relationships between those things. These neural networks have billions of parameters so they are able to create rich hierarchy of abstractions it can draw upon to get a good guess of the next word in a sentence.

At this point, the model is huge and is capable of a TON of things but that might not be good for our specific needs so ULMFiT has a second step called “Language Model Fine-tuning.” In this step, the model is fed a set of documents that is a lot closer to the types of problems we want to solve. So, it we want the model to be really good at writing song lyrics, we would train it off Genius and AZLyrics and Lyrics.com.

The 3rd step of ULMFiT is “Classifier fine-tuning” where the task we want is to do something like “solve problems” so we might train it on a dataset that consists of question -> response entries. This is also called “Instruction Tuning” which is sort of a targeted language modeling that doesn’t just predict the next word of a sentence, but will predict the next word of a sentence that will answer a question or do something useful.

The crazy part about all of that? It works… and you can use it today in a variety of tools like openAi and probably thousands of others. It is generally how AI knows what to say…

Related but Unrelated - If you want to learn more about AI and Machine Learning, there is a great, free course for beginners at fast.ai called Practical Deep Learning - It is “a free course designed for people with some coding experience, who want to learn how to apply deep learning and machine learning to practical problems.

You should check it out.