AI x Sound
In this Guide, ลำธาร หาญตระกูล (Lamtharn Hantrakul) aka Yaboi Hanoi tells us about the long-standing relationship between music and technology focusing on recent AI-powered twists and turns. Combining his AI research experience and artistic practice, he will be covering how AI is applied to the music making process, the differences between symbolic-based and audio-based AI, current state-of-the-art tools you can try yourself, and the need for transcultural AI.
This Guide is a part of our community program AI Playground. AI Playground is an event series and a collection of Guides, structured under four topics: Image, Text, Body and Sound.
[PUBLISHED]
Aug 2023
[AUTHORS]
[EDITORS]
[FUNDING]
01_Introduction
สวัสดีครับ - Sawasdeekrub! That’s hello in Thai 😀 I’m ลำธาร หาญตระกูล or Lamtharn Hantrakul, better known by my artist name Yaboi Hanoi!
I’m the winner of the recent AI Song Contest 2022 and am excited to be sharing my technical and musical experience in this AI Sound Guide with you. This guide is intended for anyone interested in how AI is being applied to sound; specifically generating music. We will cover key developments in research, dive into music AI tools you can try right now and look into what musicians are doing with this technology.
I’m one of the original co-inventors of the DDSP library and Tone Transfer experience from the Magenta team at Google Brain, and develop music AI tools like Mawf as an AI Research Scientist in the Speech Audio and Music Intelligence (SAMI) team at ByteDance/TikTok R&D. I have degrees in both Physics and Music from Yale University, a Music Technology MS from Georgia Tech and continue to compose music professionally using AI-powered audio synthesis techniques. My winning piece for the AI Song Contest 2022, entitled “อสุระเทวะชุมนุม - Enter Demons and Gods”, demonstrated how modern AI can be used to empower melodies and tuning systems from Southeast Asia like never before.
In this guide, I’ll be telling the story of Music AI from a joint perspective of music, technology and culture - pointing out where these system can empower and fail in terms of cultural bias, and what we can do as musicians, researchers and listeners of music to address these issues together.
*Note: this guide does not represent the views of my current or previous employers, collaborators nor team members. It is based solely on my personal opinions and perspectives as a practitioner in the field.
02_A Brief History of Music and Technology
Music and technology have always been intertwined. Since the earliest records of human history, we’ve been using tools to make musical instruments.
With stones, we drilled holes in bones to make flutes. With wood, we cut and carved lutes, violins, fiddles and guitars. With metal, we shaped and hammered trumpets, pipes and xylophones. In fact, each of these breakthroughs in technology directly led to an expansion of sonic possibilities and paradigm shifts in musical genres. Without the electric guitar, Rock and Roll would be impossible to play. Without the turntable, Hip Hop would not have been born. Without the modern laptop, dubstep could never be created.
From this perspective, we can think of music AI as the next evolution of these creative technologies. The question I am excited by and seek to answer is this: what kind of new music do AI tools enable us to create?
Aurignacian flute made from an animal bone, Geissenklösterle (Swabia) Image from Wikipedia
Artists and technologists have long been fascinated by the use of algorithms to make music. Even before the application of today’s AI, composers and technologists such as Lejaren Hiller were programming early computers like the ILLIAC computer (1957) to generate melodies and rhythms in the style of a western String Quartet. The Yellow Magic Orchestra from Japan were using state of the art analogue synthesizers in the 1980s to create never before heard sounds beyond the world of acoustic instruments.
In 2023 “Making music with AI” can mean a lot of things: using AI to write lyrics, using AI to write melodies, using AI to write chords, using AI to make a drum beat and so forth. AI can also be used to generate an entire song; complete with melodies, harmonies, bass, a singer, rhyming lyrics. In this case each element is often generated in steps, co-composed with a human composer or in very recent cases, completely made from scratch. Keeping this distinction in mind will help you navigate the music AI space.
Today, AI is used interchangeably with “machine learning” or ML. These are techniques that learn from data, rather than being encoded with explicit rules. Modern machine learning has found applications in music across its creation, analysis and consumption. This guide will be focused on AI systems that can generate music directly, or be used alongside a human to make music.
Symbolic AI vs Audio AI systems
To begin, it’s important to separate music AI generation into two broad camps: symbolic generation and audio generation.
Symbolic AI Systems
A symbolic AI system generates the notes making up music, e.g. a melody containing the notes C D and G, or the harmony of a song, e.g. the chord progression C major F major G major. However, like musical notes on a page, these are just instructions of what to play. The instrument playing these notes, the tone of that instrument and how the performance is expressed are not part of the model’s output. It requires a human to play the music notes, or additional music software (e.g. a Digital Audio Workstation or DAW) to transform the notes into actual sound.
Here is an example of a song where the melody and chords were generated by a symbolic AI system trained on the style of the Beatles. It was then performed by human musicians and produced into a track by human musicians.
In this sense, a symbolic music AI model is exactly like a text generation model covered in the AI Guide to AI Generated Text.
In the same way a text model like GPT-3 or the newly viral ChatGPT generates words, a symbolic music model generates a note instead of a word (these are called “tokens” in Natural Language Processing or NLP research). And very much like AI-generated text, the expression of the text, the voice of the reader and the emotion behind the speech are not part of the model’s output. It requires a human to read the text aloud, or a second system like a Text-To-Speech model (TTS) to transform the words into expressive speech.
These examples assume Western notation and music theory as the building blocks of music. We could also use notes from Thai and Indonesian scales; but we run into challenges explored later in the guide. For now, think of symbolic systems as music notes.
Audio AI Systems
An audio generation model will synthesize the waveform of the music directly i.e. it will generate the audio file (e.g. an .mp3 or a .wav file) including aspects like instrument sounds, vocals, drums, percussion and other sonic elements.
Animated GIF from WaveNet paper (2018)
Audio samples from WaveNet paper (2018)
The earliest application of modern neural networks to waveform generation was DeepMind’s WaveNet. It could generate high quality speech and piano music and was unprecedented back in 2016.
Notice how WaveNet is producing the notes, melodies, harmonies, piano tone and expression of the piano performance all at once as an audio file. This is akin to an image generation model directly outputting a JPEG like those covered in AIxDesign’s Guide to AI Generated Images . Except instead of rendering pixels in 2 dimensions as JPEG, the model is rendering samples of a waveform as a WAV file.
However, notice how the piano performance quickly loses musical coherence towards the end of each clip. This is because rendering directly into the audio domain is extremely data intensive. To put this in perspective, here is an image that is 128x128 pixels.
💡 With 3 RGB colors this demands 128x128x3 or about 49,000 values to be generated. An English paragraph can be covered with about 70 words, or simply 70 “values” to be generated. 1 second of CD quality audio requires 48,000 samples or “values”, which means even just a short 30 second TikTok snippet already requires over 1 million samples. A full 3 minute song in stereo puts us at over a billion samples. Keeping musical coherency across the first sample to the millionth sample is a difficult task.
For this reason, generating music is a very challenging (and interesting!) machine learning task, since it requires modeling dependencies across many levels: from the relationships between notes across time at the symbolic level, to coherency of timbres across samples at the audio level. We will revisit this hierarchy of music generation multiple times throughout this guide. It is worth noting here how generating full pieces of music is closer to generating videos, not images. Imagine an AI system that must render every frame of a short movie while maintaining interesting plot and dialogue from the first frame to the last frame.
A Fun Aside: It was actually WaveNet that inspired me to make a career change from physics and acoustics into the field of audio machine learning for music! I remember during my gap year after college in 2016 reading about WaveNet and thinking “I don’t know what this machine learning thing is, but it sure looks like it will revolutionize music technology!” The Google Magenta team was also founded the same year and I remember the amazement of using NSynth for the first time and making cat-flutes!
Music AI Glossary
Melody - the main tune that sticks in your head and you end up singing in the bathroom
Monophonic - a melody consisting of just one note at a time. Singing is monophonic, you can’t sing two notes at once (unless you are a Mongolian Throat Singer)
Polyphonic - music consisting of multiples notes playing together. For example, harps from across the world including the West, Southeast Asia, Middle East and Africa are polyphonic instruments.
Harmony - the sound of multiple notes playing together and how they change over time. These are the chords you strum on a guitar or play on a piano while someone is singing.
Convolutional Neural Network - often abbreviated CNN’s. A type of ML system mostly used for analyzing images.
Recurrent Neural Network - often abbreviated CNN’s. A type of ML system often used for analyzing sequences such as language.
Transformer-based Neural Network - often abbreviated as Transformers. A type of ML system that excels at modeling long term dependancies in mediums like language.
Diffusion-based Neural Network - often abbreviated as Diffusion. A way of training an ML system that has yielded state of the art results for images and music in 2022.
03_A Timeline of Key Breakthroughs in Music AI
The following papers are a slice of research that chronologically map out some of the key breakthroughs in both symbolic and audio generation from my perspective. It’s impossible to cover all the papers so for more technical resources, I highly suggest following (or attending!) the top machine learning journals like ICML, Neurips, ICLR and their respective tracks on creative AI as well as music technology focused conferences like ISMIR. This section references research, so interacting with these models requires a technical background. The next section will focus on products you can try right now as an artist or curious reader.
I want a ChatGPT and DALLE for music, where is it?
Just this January 2023, incredible progress was made in this direction. MusicLM is a new model that can generate musical audio directly from text, similar to DALLE but for music!
An avocado armchair generated by DALL-E. Image by OpenAI
To put this in perspective, what makes this task more difficult than images and language is the following. Imagine you asked 10 people to describe an avocado, you’d get 10 different sentences roughly describing something “green” “round” and “tasty”. An ML model can now objectively associate the concept of an avocado with these words.
Now imagine you asked 10 people to draw an avocado. You’d get 10 different images, but most will generally show something roundish, greenish and smoothish. An ML model can now objectively associate the concept of an avocado with these forms.
Now imagine you asked 10 people to write music about an avocado. Honestly, as a trained composer, I have no idea what this would sound like. Someone might sing a ballad about avocados. Another might compose an orchestral piece about avocados. Someone might write a Thai percussion piece about avocados. The pairing of music with concepts is more subjective, which in turn, makes it harder to train an objective ML model on this data. Food for thought.
04_DIY: Tools to Make Your Own Generative Music
Remember the point about hierarchies in music? Is the system generating notes or is it generating audio? Is it generating a complete piece of music, or just an aspect of it like the melody? Many products in the Music AI space are similarly divided in this way. Some work in the symbolic domain, others work in the audio domain. In order for a tool to appear on this list as opposed to the previous section on research, it must be a product you can “try right now while reading this article”. No additional training or sophisticated set up required.
Audio Synthesis
systems that generate waveforms - some can run in realtime (can be used live on stage) while others are offline (requires rendering wait time)
Mawf VST: Another one of my babies! Realtime ML-based synthesis that can transform singing and other sounds into musical instruments.
XStudio: generate expressive singing voices and vocals in a multitude of styles.
MIDI generation
systems that generate or manipulate musical notes e.g. of a melody, chord or an instrument
Magenta Studio: A suite of plugins that can generate drum tracks, kick start melodies and humanize grooves.
Orb Composer: Generate melodies and chord progressions
Lyrics
systems that help write the words and rhymes of a song
These Lyrics Do Not Exist: Generate lyrics based on style and topic
ChatGPT: Of its many abilities, ChatGPT can also write rhyming lyrics on a variety of topics.
Audio Effects
systems that affect the tone of audio, think of these like instragram filters
Native Instruments Guitar Rig 6: uses audio machine learning to learn a data driven model of guitar amps for use in production.
TAIP: uses ML to make digital recordings sound like they used vintage and expensive analogue equipment from historical high end studios
Mixing and Mastering
this describes the stage of music after songwriting and production, when the track is being assembled
Ozone Suite: AI powered mixing to help music makers achieve a pleasing tonal balance to a track.
LANDR: AI powered mastering to give a finished track the final polish before publishing
Sample Search
systems that help musicians find the right sounds in their composition. Think of this like being able to search photos by a person’s face, instead of scrolling through a library
Splice CoSo: When searching for matching or similar sounds, Splice will use AI to find similar loops or samples that go together
XLN XO: Uses ML to put similar sounds on a 2D map, so specific snares and kicks are much easier to find.
Source Separation
extract the vocal section or drum section from a song. Extremely useful for creative remixing and sampling of music
lalal.ai - separate the vocals, bass, instrumentals and drums from a recorded piece of music.
audioshake - same as above.
Background music
generate entire tracks complete with bass, drums, melodies and chords
AIVA: Generate the melody, drums and harmonies of a finished song based on genre
Amper Music: Generate the melody, drums and harmonies of a finished song based on genre, duration and mood.
05_What are Musicians Doing with AI?
This guide has hopefully made clear that AI can be applied to any part of music creation process. When an artist is using AI, it is most likely they are using it to accomplish a musical task like writing lyrics or synthesizing a sound, alongside other musical elements created by humans. Like musical styles, every artist uses the technology with their own unique spin and flavor, as seen in the AI Song Contest.
EXAMPLES FROM AI SONG CONTEST
The AI Song Contest is an international competition started in 2019 to explore, encourage and push the boundaries of human and AI co-composition. The competition is also unique in how it requires all participants to document their music-making process and how AI was used. As a result, the competition offers a very clear picture of the state-of-the-art in music generation, and how it is realistically used by musician and artists. It also attracts a wide range of contestants, from researchers to coders and professional singers alike as well as contestants from many parts of the world.
Here are some of my favorites and notable uses of AI. For more details, click through to their artist pages to see the process document.
NOTABLE CREATORS
The AI Song Contest is an international competition started in 2019 to explore, encourage and push the boundaries of human and AI co-composition. The competition is also unique in how it requires all participants to document their music-making process and how AI was used. As a result, the competition offers a very clear picture of the state-of-the-art in music generation, and how it is realistically used by musician and artists. It also attracts a wide range of contestants, from researchers to coders and professional singers alike as well as contestants from many parts of the world.
Here are some of my favorites and notable uses of AI. For more details, click through to their artist pages to see the process document.
06_Building AI Tech That Empower All Musical Traditions
Defining Bias
Technology is always created within the cultural context of its inventors. This is a central part of all the work I do as research, musician and cultural technologist. From gender biased machine translation models, to racially biased computer vision models: cultural, gender and racial norms can be explicitly or implicitly encoded in hardware and software design. But what does “bias” actually mean? From a mathematical standpoint, ML systems are biased; they have a “viewpoint” that is essential in decision making. This type of mathematical bias can come from two places.
Model Bias
Model bias refers to choices a researcher makes which encode properties about the data. In the case of music, most AI models are rooted in Western definitions of melody, harmony, and tuning. Nearly all symbolic music models require the input and output notes to belong to the keys of a piano i.e. a 12-tone equal temperament tuning or 12-TET. Thai classical music does not follow these conventions. Neither does music from Indonesia, Vietnam and countries from Southeast Asia, the Middle East and Africa. Historically, these built-in assumptions (i.e. model bias) have made it near-impossible to apply music machine learning (ML) to these traditions.
Dataset Bias
Dataset bias on the other hand refers to the composition of data and how it was collected. datasets of music from Thailand and around the world that are labelled, high quality and have appropriate licenses for training are difficult to find (i.e. dataset bias). Notice how most of symbolic music AI research presented above revolves around piano repertoire, but no equivalent research exists around the Indonesian Gamelan. Most open source datasets tend to focus on western genres of music like classical and pop music.
Cultural Bias
At this point, the problem becomes more than mathematical in nature, and tied instead to larger structural issues: notably the white, male, and colonial history of music and ML scholarship. With this in mind, I wanted my piece for the AI Song Contest to demonstrate the exciting possibilities when AI research, music composition and cultural empowerment are engaged simultaneously.
At this point, the problem becomes more than mathematical in nature, and tied instead to larger structural issues: notably the white, male, and colonial history of music and ML scholarship. With this in mind, I wanted my piece for the AI Song Contest to demonstrate the exciting possibilities when AI research, music composition and cultural empowerment are engaged simultaneously.
Differentiable Digital Signal Processing and Tone Transfer
when I was an AI Resident at Google, I worked with Magenta to co-develop a breakthrough technology called Differentiable Digital Signal Processing or DDSP. This open source library enabled researchers and musicians to train lightweight audio ML with two key properties.
Addressing model bias: DDSP systems model sound at the level of explicit frequencies instead of notes. Once a DDSP model was trained on a saxophone, you could transform a singing performance into a saxophone performance. It didn’t matter if the singing deviated off notes of a piano, the system worked with the beautiful *meend (*melodic inflection) of Indian Carnatic melodies. Because the model bias was designed like this during the research phase, you could even transform sounds that cannot be notated, e.g. bird chirps and kitchen noises (try it yourself on Tone Transfer!).
Addressing dataset bias: DDSP models required much less data, say 20 minutes of a target instrument in order to generate realistic sound. This was because DDSP models didn’t have to learn from scratch concepts essential to sound like vibrations (learn more from the original paper [here]). This made it possible for end-users to collect their own data and train DDSP models themselves. For example, for the Sounds of India Project, we collected recordings of the Indian Shennai in one sitting and quickly trained a model based on this instrument. It was exciting from the perspective of dataset bias, because musicians were no longer limited by expensive data collection or public datasets.
Mawf VST
Making DDSP run in realtime: When I joined TikTok’s Speech Audio and Music Intelligence (SAMI) team, I was was one of the lead engineers behind the VST plugin Mawf released in 2022. Our team developed a DDSP-based system that could run in realtime with low-latency, meaning musicians could actually perform with the tool live like a real instrument. Mawf could synthesize sound at CD-quality 48kHz, on par with other professional-grade music producer tools and was a significant upgrade from my previous work at 16 kHz (i.e. telephone quality). This represented the final step in the puzzle, putting DDSP in a form familiar to musicians so they can get creative with it.
Enter Demons and Gods. Imagine a demon blowing into a musical instrument. What would it sound like? In Thai mythology, a famous scene portrays hundreds of demons and gods descending upon the ocean to churn an elixir of immortality. My track depicts the earth-shattering arrival of these two forces. The melodies and sound design were inspired by the ปี่ “Phi”, a piercing reeded instrument that permeates Thai sonic culture. I used new audio AI tools I had developed over the course of my career to precisely synthesize music according to Thai tuning systems in a way that was never possible before. Definitely check my video on how to make AI bass music from Thailand on TikTok
Using Mawf and DDSP to reimagine Thai classical music: For the first time, I could use Mawf to take motifs from Thai classical music and transform them into new timbres unheard of in Thai classical repertoire. During this sonic transformation, Mawf retains the underlying frequencies of Thai scales with unprecedented precision. I could create the sound of a realistic saxophone as though the instrument was built for Thai music, or create an out-of-this world dubstep bass-line that could be played in tune with a Thai ensemble. This would have never been possible using traditional music tools. Take a listen to this motif from the Thai Phi and its subsequent transformation into the lead line of the track.
This tool enabled me to craft an entire song in the style of bass music, in a way that doesn’t compromise the melodies and tunings that make Thai music and Southeast Asian music unique. It shows how AI can be used as a tool for cultural empowerment and reimagination across a range of traditions, something I call “Transcultural Machine Learning.”
Cultural Empowerment vs Cultural Appropriation
Sampling was a technique that changed the fabric of music. It was a double edged sword, simultaneously birthing entire genres of music like hip hop but also being used as a tool for cultural appropriation. AI powered music tools are no different. I would argue that models trained on a particular tradition demand the same respect as a cultural artefact from that country, akin to a national dress or national symbol.
Let’s use DDSP as a concrete example. What does it mean to train a DDSP model on the sound of an instrument that connotes the passing away of a loved one, but then use it to write an EDM track for a music festival? In projects I have helped lead such as Tone Transfer, our teams thought very carefully about how technologies represent culture. When a DDSP model of a Chinese Guqin failed to produce good results, I remember one of my colleagues scrunching his face at the sound. We were worried the models could misrepresent the culture if it was someone’s first encounter with that tradition, and made the hard decision of removing lesser known instruments. In contrast, Sounds of India proudly featured models of classical Indian instruments because it was part of India’s Independence Day Celebration and was to be used mostly by Indian locals. For Mawf, we released a Thai Khlui flute, because our team had tested the model rigorously and were confident it would always sound as intended.
07_Concluding Thoughts & Further Questions
What about music AI Copyright?
This is a fascinating conversation that is unfortunately beyond the scope of this article. A lot of copyright law hasn’t caught up with AI generated music yet. Can a machine claim copyright if it is not a human? What does it mean to scrape data from artists who don’t want to be trained on? You can read more about these links:
Keynote by Sophie Goossens: AI, Creativity and Music a Legal Update
Art created by AI cannot be copyrighted
AI Music Outputs: Challengers to the Copyright Legal Framework
Is creative AI really “creative”?
Different scholars, researchers and musicians have vastly contrasting opinions about this. From my technical, artistic and cultural perspective, I think music requires two fundamental forces working together. 80% consists of following the norms and conventions of what has come before. The last 20% is when an artist breaks the rules. Break too many rules and the music might be too avant garde. Follow too many rules, and the music may not be interesting enough. I think music which has led to paradigm shifts has this 80:20 character.
From this perspective, I would argue there is nothing in the mathematics of ML that captures this concept of “breaking the rules”. The model is tasked with finding patterns and adhering to those patterns as best as possible, not break them. I believe we need new advances in the field in order to be able incorporate this last 20% where the rules are broken.
What comes next?
It’s an exciting time to be in the music technology space. The field has matured and is becoming more mainstream in the public imagination. This means more interest from both the institutions in academia and industry producing these AI tools, as well as demand from consumers who want to experience music with AI. I’m most excited to see AI tools integrated as standard features in music software, the ability to render music based on a short description and music AI systems which can adapt to new tunings and musical cultures on-the-fly.
To an ML model, music is purely a set of statistical occurrences unfolding in time. To a human, music is so much more. It’s a powerful way of evoking emotion, expressing oneself and bringing communities together. It is my hope that AI tools, like our ancestor’s bone flutes from thousands of years before, will always be about empowering this fundamental aspect of music shared by human cultures across the world.
AI PLAYGROUND S01 / SOUND
This Guide is a part of our community program AI Playground. AI Playground is an event series and a collection of Guides, structured under four topics: Image, Text, Body and Sound. As part of the program we hosted 2 events exploring AI in relation to music and sound.
Artist Talk: Building Soundscapes Through (Machine) Learned Histories with Felipe Sanchez Luna (Kling Klang Klong)
Workshop: Composing Chrom(AI)tic Resonances with Soyun Park