Skip to main content

Learning AI Poorly: Is ChatGPT just a fancy WinZip?

·2 mins

(originally posted to LinkedIn)

I spent entirely too long trying to remember what the name of that zip program that had an infinite free trial back forever ago… Like, I don’t think it was WinZip, but I’m going with it for now. The reason I was thinking of it was because a discussion of “AI is just fancy compression” came up and I thought it might be fun to talk about.

The Argument: AI is just glorified data compression… like zip. #

What? No, really, people are saying that. And there are papers (probably more than these):

White-Box Transformers via Sparse Rate Reduction: Compression Is All There Is?

Language Modeling is Compression

Someone even used a standard compression program called gzip to do AI and it performed pretty well:

78% MNIST accuracy using GZIP in under 10 lines of code

What do you mean AI is just WinZip? #

The idea centers around Information Theory which is the mathematical study of information, the way data is stored, moved, and measured. In the book “Information theory, inference, and learning algorithms” David MacKay proposes that Information theory and machine learning are inextricably linked. Some say they are, “two sides of the same coin.”

Data compression goes beyond simply removing redundant information. Information theory says that you can achieve the best compression by using a probabilistic model. That is, by building a model that predicts what the data looks like with a great degree of certainty.

Remember, LLM’s can also be thought of as a probabilistic model that predicts output based on some input.

By that definition, they are very much the same thing. In fact, you can flip the idea around and say something like:

  • In compression, gzip’s job is to predict the next character. Basically - we see a sequence of characters, which next sequence is most likely to occur?
  • In ML, a model has compressed training data into the neural network’s structure and parameters so that when you provide data as input, the most likely next data is the output.

That’s it, that’s the theory.

Is it true?

I think so. I mean, there are papers and examples!