In spoken conversation, people naturally take turns amid interjections, overlaps, and other patterns that aren’t strictly verbal. A new approach generated natural-sounding — though not necessarily semantically coherent — audio dialogs without training on text transcriptions that mark when one party should stop speaking and the other should chime in.
What's new: Tu Anh Nguyen and colleagues at Meta, France’s National Institute for Research in Digital Science and Technology, and École des Hautes Études en Sciences Sociales introduced Dialogue Transformer Language Model (DLM), a system that learned to incorporate the interruptions, pauses, and inflections of conversational speech into audio dialogues. You can listen to examples here.
Key insight: Prior efforts to model dialogue were based on text, but text datasets omit information that’s unique to spoken interactions. Training directly on recordings of spoken dialogue can enable models to learn this additional mode of expression so they can mimic face-to-face conversation more naturally.
How it works: The system encoded two audio signals — two sides of a spoken conversation — into tokens. It processed each token stream through a separate transformer and decoded the tokens back to audio signals. The transformers were trained on Fisher English Training Speech, a dataset that comprises over 10,000 telephone conversations, an average of 10 minutes long, recorded using a separate audio channel for each participant.
- HuBERT, a self-supervised system that produces speech representations, tokenized the audio signals using a convolutional neural network (CNN) and transformer, which reduced 16,000 samples per second to 50. To adapt it to the Fisher dataset, the authors trained it to generate masked tokens.
- Given tokens from HuBERT, HiFi-GAN, a generative adversarial network with CNN architecture, learned to generate the audio waveform of one speaker.
- Given the token streams, two transformers with shared weights learned to predict new tokens. The authors modified the transformers by adding, between the usual self-attention and fully connected layers, a cross-attention layer that attended to tokens from both signals. Estimating each token’s duration meant the authors could remove repetitions of the same token from the training data to avoid generating overly elongated sounds (such as a “hmm” that never ends).
- At inference, the transformers repeatedly added the next predicted tokens to two sequences, each of which started with a preset starting token. HiFi-GAN converted the sequence into audio.
Results: Crowdsourced evaluators compared DLM to a similar approach that used a single transformer to process both channels of conversation. They rated naturalness of turn-taking and meaningfulness on a 1 to 5 scale. (Ground-truth dialogs scored around 4.25 for both criteria.) DLM performed relatively well in turn-taking though poorly in meaningful output. For turn-taking, DLM achieved 3.86 while the single transformer achieved 3.46. For meaningfulness, DLM achieved 2.71, while the single transformer achieved 2.46.
Why it matters: Two transformers can model a pair of participants in conversation (or other interaction) more effectively than one. Connecting them via cross attention layers enables them to be aware of one another’s activity without needing to predict it. This simplifies the task of modeling their interactions while avoiding potentially confounding variables such as who said what.
We're thinking: The system’s ability to mimic the ebb and flow of conversation is impressive, but its verbal output is largely gibberish. To be fair, training on only 1,700 hours of audio conversation may not be expected to impart much about semantics. We look forward to an update that produces more cogent spoken conversation.