Learning AI Transformers Poorly: Recent Paper on Transformers Without Tokens

(originally posted on LinkedIn)

There is a ton of progress in the Large Language Machine and Transformers space and if you want to keep up, you have to seek out and read research papers. I found, published May 19, that kind of goes along the whole tokenizer path and since I poorly explained what tokenizers are and how they enable LLM’s, I thought maybe it would be fun to check out a paper that is trying to make them obsolete… Sort of.

You see, tokenizers are great for encoding text because it is easy to generate short integer sequences (tokens) from text and then use them to train transformer models. However, tokenizers don’t work so well for things like images and audio because there isn’t an awesome way to tokenize those formats. Hence, this paper:

https://arxiv.org/abs/2305.07185v2

Oooph, that link looks like an ad. I hate writing on linkedIn… Have a look at that paper (to read it, click the link, read the abstract, then click on the “download PDF” link.) and when your eyes glaze over, come back.

First, it looks like the authors work at Meta. I don’t know if that matters much, but… who else is tired today because threads.net went live last night? Right? Anyway, this group has invented a new approach to using transformers without an official tokenizer. Instead, they take long sequences of bytes (think pixels in an image) and them up into patches. The secret sauce is they have two different types of models within the machine. A local module that can predict bytes within a patch, then a global module that is able to predict patch representations.

Think of that quilt your grandma had in her guest bedroom. The larger patterns were made up of sets of patches, each of which is made up of one of a small set of materials. This method is able to predict what material each patch is made of then predict which patches go together to create a pattern.

The real art in this paper, though, is the way they optimize the transformer for this long-sequence encoding. They do this (1) by splitting long sequences into two shorter sequences (not necessarily in half) which reduces self attention cost into something tractable (2) by changing the structure of the neural network by using large feed-forward layers per patch (like, instead of normal-sized layers per token) and (3) by running the algorithm in parallel per patch instead of the typical, serial runs per token.

The bulk of the paper explains how they arrived at those 3 optimizations and shows what they did to prove those optimizations worked. Section 6 about image modeling is interesting because it shows the difference between working with different sized images. 64x64 pixels was easy, but 640x640 pixel images required 1.2M patches (btw, they start calling patches tokens by this point in the paper) and that many patches requires a ton of computer power. However, they showed that their method out performed all of the standard methods on imaging. Good job, kids!

They even tried this method on audio files and it worked, but there aren’t any examples to hear so I don’t entirely understand what it predicts. I’m going to try to figure it out and get back to you, though.

The paper closes by explaining other work that also encodes long sequences of bytes instead of using tokens. I love this part of papers because it provides a path of more papers to read and understand…. or things to google. In fact, getting good at reading papers is a great way to stay on top of the state of the art in this ridiculously fast moving side of tech. And, you can stalk the authors on LinkedIn and see where they work.

Talk next week!