A Korean pop star recorded a song in six languages, thanks to deep learning.
What’s new: Midnatt (better known as Lee Hyun) sang his latest release, “Masquerade,” in English, Japanese, Mandarin, Spanish, and Vietnamese — none of which he speaks fluently — as well as his native Korean. The entertainment company Hybe used a deep learning system to improve his pronunciation, Reuters reported. You can listen to the results here.
How it works: Hybe used Neural Analysis and Synthesis (NANSY), a neural speech processor developed by the Seoul-based startup Supertone, which Hybe acquired in January for $36 million.
- Given a vocal recording, NANSY separates pronunciation, timbre, pitch, and volume information. It uses wav2vec to analyze pronunciation, a custom convolutional neural network (CNN) for timbre, and a custom algorithm for pitch. To analyze volume, it takes an average across a mel spectrogram (a visual representation of a sound’s frequency components over time). The NANSY recombines the four elements using a CNN-based subsystem.
- Lee initially recorded “Masquerade” in each of the six languages. Then the producers recorded native speakers of the non-Korean tongues reading the lyrics in their respective languages. NANSY melded the sung and spoken recordings to adjust Lee’s pronunciation.
Behind the news: The music industry has been paying close attention to generative audio models lately, as fans have used deep learning systems to mimic the voices of established artists. Reactions from artists and music companies have been mixed.
- The musician Grimes released a tool that allows users to transform their own voices into hers. She invited people to try to earn money using her cloned voice in exchange for half of any resulting royalties. More than 300 fans responded by uploading Grimes-like productions to streaming services.
- Universal Music Group has been less welcoming. The recording-industry giant demanded that streaming services remove fan-made tracks that feature cloned voices of Universal artists.
Why it matters: This application of generated audio suggests that the technology could have tremendous commercial value. K-pop artists frequently release songs in English and Japanese, and popular musicians have recorded their songs in multiple languages since at least the 1930s, when Marlene Dietrich recorded her hits in English as well as her native German. This approach could help singers all over the world to reach listeners who may be more receptive to songs in a familiar language.
We’re thinking: Auto-Tune software began as a tool for correcting flaws in vocal performances, but musicians quickly exploited it as an effect in its own right. How long before adventurous artists use pronunciation correction to, say, sing in their own languages with foreign accents?