Researchers at Amazon have trained the largest text-to-speech model to date, which they claim exhibits “emergent” properties that improve its ability to naturally speak even complex sentences. The breakthrough could be what technology needs to escape the uncanny valley.
These models were always going to grow and improve, but the researchers specifically hoped to see the kind of jump in skill we saw when language models got past a certain size. For reasons unknown to us, once LLMs get past a certain point, they start to be much more robust and flexible, able to perform tasks they were not trained to do.
That doesn’t mean they gain emote or anything, just that after a certain point their performance on certain chat AI tasks hockey sticks. The Amazon AGI team – it’s no secret what they’re aiming for – thought the same could happen as text-to-speech models grew, and their research shows that this is indeed the case.
The new model is called Big Adaptive Streamable TTS with Emergent abilities, which they have distorted into the abbreviation BASE TTS. The largest version of the model uses 100,000 hours of public domain speech, 90% of which is in English, the rest in German, Dutch and Spanish.
With 980 million parameters, BASE-large appears to be the largest model in this class. They also trained 400M and 150M parameter models based on 10,000 and 1,000 hours of audio respectively, for comparison — the idea is that if one of these models exhibits emerging behaviors but another does not, you have a range of where those behaviors start to emerge.
As it turns out, the medium-sized model showed the jump in ability the team was looking for, not necessarily in ordinary speech quality (reviewed better but only by two points) but in the set of emergent abilities they observed and measured. Here are examples of complex texts mentioned in the document:
- Composite words: The Beckhams decided to rent a charming, stone-built, quaint country cottage.
- Feelings: “Oh my God! Are we really going to the Maldives? It’s incredible!” Jenny squealed, bouncing on her tiptoes with boundless glee.
- Foreign words: “Mr. Henry, renowned for his wickedness, orchestrated a seven-course meal, each course a piece de resistance.
- Paralinguistics (ie legible non-words): “Shh, Lucy, shhh, we mustn’t wake your brother,” whispered Tom, as they passed the nursery.
- Punctuation: She received a strange message from her brother: ‘Emergency @ home? call ASAP! Mom and Dad are worried…#familymatters.”
- Questions: But the Brexit question remains: After all the trials and tribulations, will ministers find the answers in time?
- Syntactic complexities: The film starring De Moya, who was recently honored with a Lifetime Achievement Award in 2022, was a big hit despite mixed reviews.
“These sentences are designed to contain challenging tasks – parsing garden sentences, putting word stress on long compound nouns, producing emotional or whispered speech, or producing the correct phonemes for foreign words like ‘qi’ or punctuation like ‘@’ . – none of which BASE TTS is explicitly trained to perform,” the authors write.
Such features usually trigger text-to-speech engines that mispronounce, skip words, use strange accents, or make some other blunder. The BASE TTS still had problems, but fared much better than its contemporaries — models like the Tortoise and VALL-E.
There are a bunch of examples of these difficult texts being spoken completely naturally by the new model in the space they made for it. Of course these were selected by the researchers, so they’re necessarily cherry-picked, but it’s impressive regardless. Here’s a couple if you don’t want to click:
Because the three BASE TTS models share an architecture, it seems clear that the size of the model and the extent of the training data appear to be the cause of the model’s ability to handle some of the above complexities. Please note that this is still an experimental model and process — not a commercial model or anything. Further research should identify the tipping point for the emerging capability and how to train and develop the resulting model effectively.
In particular, this model is “streamable”, as the name says – meaning it doesn’t need to generate entire sentences at once, but goes moment by moment at a relatively low bit rate. The team also tried to package speech metadata such as emotionality, prosody and so on into a separate low-bandwidth stream that could accompany the vanilla audio.
It looks like text-to-speech models may have a prime moment in 2024 — just in time for the election! But there’s no doubting the utility of this technology, particularly accessibility. The team notes that it has declined to release the source of the model and other data due to the risk of it being exploited by bad actors. However, the cat will come out of that bag eventually.