▲ Kick
▲ Snare
▲ Hi-hat
▲ Tom
▲ Bass Drum
▲ Drum samples are generated by the conditional WGAN with gradient penalty (WGAN-GP), which directly generates one-dimensional time waveforms. iZotope's BreakTweaker drum samples, which are royalty-free, were used as a dataset for training.
▲ Continuous morphing of the generated audio sample according to the continuous change (linear interpolation) in the latent vector space. For this demo, a WGAN-GP which generates two-dimensional spectrograms was used. The generated spectrograms were converted into waveforms using the Griffin-Lim algorithm. NSynth dataset was used for training.
code: https://github.com/soohyun123/Drums-sample-generator-using-conditional-GAN
▲ The code is based on Librosa and PyTorch libraries.
This research project was proposed in May 2018, and selected for the Undergraduate Research Program (URP) of Korea Advanced Institute of Science and Technology (KAIST) with financial support.
It was 2018 when GAN was still quite a brand-new thing. After this project, I went for Korean military serivce until 2021.
This research proceeded through two phases:
At first, from June 2018 to December 2018, I tested the method in which spectrograms were generated by WGAN-GP and then were converted into waveforms using the Griffin-Lim algorithm.
Inspired by the success of GAN in the image domain, I tried generating spectrograms with GAN since spectrograms are also two-dimensional images. I first tried a vanilla DCGAN architecture with binary cross-entropy (BCE) loss for training, but it did not work. Then I found out that using WGAN-GP is essential for successful training when the size of images is as large as 128x128. (But, as of 2021, there are generative models capable of generating much larger images with high fidelity.) NSynth dataset was used for training.
I also examined if a continuous change of a latent vector results in a continuous morphing of the output sound; I could hear that the output sound changed continuously according to the continuous change between two latent vectors.
▲ Training of the spectrogram WGAN-GP
During this period, Chris Donahue et al. (2019) released the preprint of their paper on audio generation with GAN, which included the range that my project dealt with. The authors called the method that I tested 'SpecGAN'. They also tested 'WaveGAN', which directly generates one-dimensional time waveforms. And their results showed that WaveGAN is better than SpecGAN in terms of sound quality.
Hence, from January 2019 to February 2019, I also tested the so-called WaveGAN method by modifying my so-called SpecGAN method. And I also added a conditional input feature using one-hot encoding. iZotope's BreakTweaker drum samples were used as a dataset for training, and a conditional input indicated which part of a drum-set (e.g. kick, snare, hi-hat, etc.) each audio sample corresponded.
Later Jesse Engel et al. (2019) improved the fidelity and frequency resolution of audio generation with GAN by generating instantaneous frequency spectra together with log-magnitude spectrograms to provide coherent phase information.