Learning AI Poorly: Convolutional Neural Networks helped The Beatles release one last song today

(originally posted on LinkedIn)

Today, The Beatles released the last new song they will ever record. It is called “Now and Then” and somehow it features John Lennon singing lead. John Lennon wrote the song around 1977 and left it as an unfinished home demo. It was nothing more than a cassette tape he made in his apartment and it sounds maybe a little better than what you might hear if someone sang you a song on your voicemail.

The song that was released today sounds fantastic… How in the world did Paul and Ringo manage to release a song with John Lennon singing when all they had was a crappy home recording?

Of course, it was possible all because of AI… Let’s talk about how it happened.

“The Beatles: Get Back” was a 3 episode Mini-Series released on Disney+ in November 2021 that was based on archival footage of the 1970 documentary “Let It Be” - Producer/Director Peter Jackson got access to 60 hours of film footage and over 150 hours of audio tapes leftover from that documentary and set out to show what it was like inside the recording studio while the Beatles recorded together.

The trouble was, the audio and film sounded like garbage. There was a lot of extraneous noise. Bands rehearsing in the background, people dropping things, and other random set noise. It was next to impossible to hear what the Beatles were saying to each other. To get a watchable documentary, something drastic had to be done.

Enter the post production audio team Martin Kwok, Supervising Sound Editor and Mike Hedges Re-recording mixer.

At the time, there wasn’t really anything commercially available to single out voices from things like guitars and drums. Luckily, they were not only working with sound editors, but also people who understood code and knew software engineering who developed some techniques, using machine learning, to isolate voices from the audio tapes. In a few months they realized that getting good, isolated voices out of the noise was going to be possible but the quality was only at an 8k sample rate. So, not good enough to use in the film directly.

Emile De la Rey (machine learning engineer / sound editor) and Andrew Moore (first assistant sound editor / machine learning developer) built a machine learning package called (Machine Audio Learning (MAL… which was actually named after the Beatles tour manager, Mal Evans https://en.wikipedia.org/wiki/Mal_Evans ) They were able to tweak the models enough to achieve isolated vocals and voices at a full 48k sample rates.

At that point it was so good the editors were able to hear so much more stuff in the audio recordings it caused them to go back and re-cut the whole documentary by adding a lot of new stuff. They also realized they could do the same for the background instruments which allowed the sound engineers to do a full mix that gives the movie the feeling of being inside this loud recording studio but still being able to understand what people are saying.

The crazy thing is, it totally worked. Here’s an example of the original audio, then again with just voices, then again with the background noise, and finally, the complete mix: https://youtu.be/HN7evYFUWts?si=YBEA5bjHQ8yIpQEm&t=539

The whole documentary is fantastic, you should go watch it.

How does this circle back to the song they released today? Well, Paul, George and Ringo tried to record the song in 1994 when they received the cassette tape from Yoko Ono. They wanted to use John’s vocals, but the quality of the recording was so bad and the piano was so loud there was no way to make it work. They shelved it.

Fast forward to 2021. Paul watched “The Beatles: Get Back” and was like… No way! if they could do that with those terrible films from the recording studio, they for sure could do something with that old tape of John singing Now and Then. So, they fired up MAL* and got a pretty much perfectly isolated vocal track at 48k.

So yeah, AI did that. If you’re interested in learning how, look up “Audio Source Separation.” You’ll learn that it is a method that allows you to recover or reconstruct one or more source signals that, through some linear or convolutive process, has been mixed with other signals.

It is a lot like unmixing a can of paint - not an easy task.

To do this, it uses a Short-time Fourier transform https://en.wikipedia.org/wiki/Short-time_Fourier_transform to see how the frequency and phase of signals in the audio change over time. It then identifies the fundamental frequency of a voice, the number of harmonics, and finds “unvoiced speech” that is, the snaps, clicks, and breaths found in normal speech.

To isolate a voice, it has to identify a vocal section, distinguish between voice and noise using the short-time Fourier transform to tease out all the different signals, estimate the fundamental frequency and harmonics, use that to create a “mask” that dulls out the content you don’t want, and finally apply that mask to the signal.

This type of method can be accomplished using a Convolutional Neural Network that is trained to figure out how to solve that problem in a way that can compensate for, say, multiple vocals, echo effects, and whatever other random noise signals that may be in the audio.

That’s a lot of hand waving… because… it is complicated. But maybe this is a good jumping-off point if you want to learn more.

Until next week.

- I couldn’t find anything that specifically said they used MAL… but it was fun to say.