Similar to how systems like DALL-E produce visuals from textual prompts, Google researchers have created an AI that can generate minutes-long musical compositions from text prompts and can even translate a whistled or hummed tune into different instruments. While you can’t mess about with the MusicLM model on your own, the business has made available a number of samples it created using the model.
These instances are very outstanding. Shorter compositions, lasting just 30 seconds, may be produced from paragraph-long descriptions that specify a genre, mood, and even individual instruments, while longer ones, lasting up to five minutes, can be generated from as little as one or two phrases, such as “melodic techno.” One of the most interesting demonstrations is the “story mode,” in which the model is essentially given a screenplay to follow as it changes appearance in response to on-screen commands. This one, for instance:
electronic song played in a videogame (0:00-0:15)
meditation song played next to a river (0:15-0:30)
Though it’s not my cup of tea, I can definitely perceive a human person behind this composition (I also listened to it on loop dozens of times while writing this article). The demo site also features samples of the model’s output when requested to make eight-second clips of a given genre, music that would match a jail breakout, and even what a beginning piano player would sound like compared to an accomplished one. Words and phrases like “futuristic club” and “accordion death metal” are also explained.
MusicLM is capable of simulating human vocals, and although it sounds quite accurate in terms of pitch and volume, there is an uncanny aspect to the vocals that isn’t quite right. The closest comparison I can make is to the grainy or staticky quality of old recordings. The preceding example doesn’t do a great job of demonstrating that trait, but I believe this one does:
That’s what happens when you tell it to create music suitable for a fitness centre. You could have also picked up on the fact that the lyrics are complete gibberish, but in a manner that you wouldn’t really notice if you weren’t paying close attention, like when someone sings in Simlish or that one song that’s supposed to sound like English but isn’t.
Generating music with artificial intelligence is not a new phenomenon; algorithms have been credited for producing pop songs, duplicating Bach better than a person could in the 1990s, and accompanying live performances for decades. One current version takes text input and generates a spectrogram, which is then used to create music with the help of the AI image generating engine StableDiffusion. The report claims that MusicLM’s ability to take in audio and replicate the melody puts it ahead of competing systems in terms of “quality and adherence to the caption.”
That last bit is easily one of the most impressive demonstrations the researchers have made. You may listen to the input audio, in which a person hums or whistles a song, and then hear how the model transforms it into a different musical instrument, such as an electronic synth lead, a string quartet, a guitar solo, and so on. From what I heard, it handles the responsibility well.
Google is taking far more caution with MusicLM than some of its rivals may be doing with comparable technology, as it has with its past excursions into artificial intelligence. The study continues, “We have no intentions to distribute models at this stage,” noting concerns about plagiarism and cultural appropriation as reasons for not releasing the models at this time.
At some point in the future, the technology may be included in one of Google’s entertaining musical projects, but for the time being, the only individuals who may benefit from the study are those developing musical AI systems. Google has announced that it would make available to the public a dataset containing around 5,500 music-text combinations.
Subtly charming pop culture geek. Amateur analyst. Freelance tv buff. Coffee lover