Learning AI the Hard Way: Gritty, low level GPU programming is a lot like making popcorn

(originally posted to LinkedIn)

It has been a while! How’ve you been? How’s the family? What’s that? You just got back from a 6 week European vacation? Neat!

Me? oh… I’ve been learning super low level gpu programming. I didn’t even realize it is already July…

I am weird in that I don’t trust I know something until I have a deep grasp of the gritty, low level details. For example. I took a Java class in college. Within 2 semesters I was signing up for EE classes to learn how to build CPU’s. Fast forward to 2024. I’m just hanging out, happily training ML models in pytorch and running inference on LLM’s from hugging face. Next thing I know I’m streaming floating point numbers to my kid’s GeForce RTX 3080 to see how fast I can add gigs of random numbers directly on the GPU just so I could understand how it worked.

Good news! I think I’ve got it all figured out. The difference between running something on a GPU vs a CPU is a lot like making popcorn. If you heat up the whole batch at once and the kernels pop at roughly the same time, that’s GPU style. If you take one kernel, heat it till it pops, then pick up the next kernel heat it till it pops, etc., that’s like running your code on a CPU.

In other words, a GPU will absolutely smoke a CPU for certain problems.

It starts with a Kernel #

Low level GPU programming involves creating a small piece of code, literally called a kernel that does some simple computation. Send that kernel and a ton of data to the GPU and let it rip. POP! The GPU runs that kernel of code on all the data all at once. Then you move the results of that computation back to your computer and use the results.

If you’re running an Nvidia GPU, they provide an entire ecosystem to make it easy to run massively parallel algorithms incredibly quickly. The base is called CUDA and that’s what we’ll use to demonstrate how much faster doing something in parallel can be.

Have a look at this code in my github repo, PopcornCuda. All it does is make an array of about one million ones. It uses the GPU to run a kernel that multiplies each value in that array by 42 (42 is ascii for “*” which to me… looks like a piece of popcorn?) So, you get about a million pieces of popcorn, all at once. Then we do the same thing but on the CPU to compare how long it takes running on a GPU vs CPU.

Kernels are simple, CUDA provides a compiler called nvcc that is basically a c++ compiler with the ability to handle these kernels and convert them to code that runs on nvidia GPU’s. A kernel looks like a simple piece of C code. For example, my kernel is just this:

// a kernel that will turn 1 to 42 (ascii for *)
__global__
void pop_kernel(int n, float *x)
{
  int i = blockIdx.x*blockDim.x + threadIdx.x;
  if (i < n) x[i] = 42.0 * x[i];
}

Sure, there’s some nonsense looking stuff in there but it isn’t too wacky. The rest of the code in popcorn_cuda.cu is there to set up the data transfer, timers, and run the same thing on the CPU in a simple loop. An example run looks like this:

$ ./PopcornCuda 
Time to generate on GPU:  0.1 ms 
Time to generate on CPU:  0.9 ms

On average, popping about one million kernels of popcorn is nearly 10x faster on the GPU.

Is it really that easy? #

Kind of… If you want to get started in programming directly on a GPU, Nvidia has excellent tooling and documentation. However, you have to take some time to read the docs. Getting everything installed and running properly takes some careful planning because there are major compatibility issues between the CUDA Toolkit, CUDA device driver, and the hardware you may have installed in your machine. If you aren’t careful, you’ll find yourself re-installing things and getting really frustrated. However, once your environment is set up you can compile and run things quickly. There is even a VSCode extension that lets you debug code as it is running on your GPU. How convenient!

Do you really need to know how to program GPU’s? #

Nope… not at all. Most high performance compute and machine learning libraries abstract away the need to know how to program at such a low level. In fact, a lot of libraries will dynamically switch between running kernels on GPU’s and straight code on CPU’s depending on what kind of hardware they happen to be on. The developer doesn’t even need to change any code to get massive performance gains when they move to something with a GPU in it. However, I think that knowing at least a little bit about the underlying hardware and how it runs can help you develop better algorithms. Also, after writing a few kernels, it kind of helps make sense of why a lot of ML models are programmed the way they are.