Deepgram has made a name for itself as one of the most popular voice recognition startups. Today, the well-funded company announced its launch Aura, its new real-time text-to-speech API. Aura combines highly realistic voice models with a low-latency API that allows developers to build real-time, conversational AI agents. Supported by large language models (LLM), these agents can then replace customer service agents in call centers and other customer-facing situations.
As Deepgram co-founder and CEO Scott Stephenson told me, great voice models have long been accessible, but they were expensive and time-consuming to compute. Meanwhile, low-latency models tend to sound robotic. Deepgram’s Aura combines human-like voice models rendered extremely quickly (typically in well under half a second) and, as Stephenson has repeatedly noted, it does so at a low price.
“Everybody is now saying, ‘Hey, we need real-time voice AI bots that can pick up on what’s being said and that can understand and generate a response — and then respond,'” he said. In his view, it takes a combination of accuracy (which he described as table stakes for a service like this), low latency and acceptable cost to make a product like this worthwhile for businesses, especially when combined with the relatively high cost of accessing LLM. .
Deepgram claims Aura’s pricing beats almost all of its competitors at $0.015 per 1,000 characters. It’s not that far off from Google’s pricing WaveNet Voices at 0.016 per 1,000 characters and Amazon’s Polly Nervous voices at the same $0.016 per 1,000 characters, but — granted — it’s cheaper. Amazon’s top tier, however, is significantly more expensive.
“You have to hit a really good price point on everything [segments], but then you also have to have amazing latencies, speed — and then amazing accuracy. So it’s very hard to beat,” Stephenson said of Deepgram’s general approach to building its product. “But that’s what we focused on from the beginning, and that’s why we were building for four years before we released anything, because we were building the underlying infrastructure to make it happen.”
Aura delivers all around a dozen voice models at this point, all of which were trained from a Deepgram dataset created along with voice actors. The Aura model, like all of the company’s other models, was trained in-house. Here’s how it sounds:
You can try a demo of Aura here. I’ve tried it for a while, and while you’ll occasionally encounter some odd accents, the speed is really what stands out, in addition to Deepgram’s existing high-quality speech-to-text model. To highlight the speed at which it generates responses, Deepgram notes the time it took the model to start speaking (generally less than 0.3 seconds) and how long it took LLM to finish producing its response (which it’s usually just under a second).