There have been many attempts at open-source AI voice assistants (see Rhasspy, Mycroft, and Jasper, to name a few) — all with the goal of creating privacy-preserving offline experiences that don’t compromise functionality. But growth has proven extremely slow. That’s because, aside from all the usual challenges that come with open source projects, programming an assistant is hard. Technologies like Google Assistant, Siri, and Alexa have years, if not decades, of R&D behind them — and massive infrastructure to boot.
But that’s not deterring the folks at the Large-scale Artificial Intelligence Open Network (LAION), the German non-profit organization responsible for maintaining some of the world’s most popular AI training datasets. This month, LEON announced a new initiative, BUD-E, that seeks to create a “fully open” voice assistant capable of running on consumer hardware.
Why start a whole new voice assistant project when there are countless others out there in various states of abandonment? Wieland Brendel, fellow at the Ellis Institute and fellow at BUD-E, believes that there is no open assistant with an architecture scalable enough to take full advantage of emerging GenAI technologies, particularly large language models (LLMs) according to OpenAI’s ChatGPT.
“Most interactions with [assistants] rely on chat interfaces that are rather cumbersome to interact with, [and] dialogues with these systems feel distorted and unnatural,” Brendel told TechCrunch in an email interview. “These systems are fine for conveying commands to control your music or turn on the light, but they are not the basis for long and engaging conversations. The goal of BUD-E is to provide the basis for a voice assistant that feels much more natural to humans and that mimics the natural speech patterns of human dialogue and remembers past conversations.”
Brendel added that LAION also wants to ensure that every element of BUD-E can eventually be integrated with license-free, even commercial, apps and services — something that isn’t necessarily the case with other open assistant efforts.
A collaboration with the Ellis Institute in Tübingen, technology consultancy Collabora and the Tübingen AI Center, BUD-E — retroactively short for “Buddy for Understanding and Digital Empathy” — has an ambitious roadmap. In a suspensionthe LAION team lays out what it hopes to achieve in the coming months, notably building “emotional intelligence” into BUD-E and ensuring it can handle conversations involving multiple speakers at once.
“There is a great need for a flawless natural voice assistant,” said Brendel. “LAION has shown in the past that it is great at building communities, and the ELLIS Institute Tübingen and the Tübingen AI Center have committed to providing the resources to develop the assistant.”
BUD-E is up and running — you can download and install it today from GitHub on an Ubuntu or Windows PC (macOS is coming) — but it’s very much in the early stages.
LAION has updated several open models for assembling an MVP, such as Microsoft’s Phi-2 LLM, Columbia’s StyleTTS2 for text-to-speech, and Nvidia’s FastConformer for speech-to-text. As such, the experience is a bit unoptimized. For BUD-E to respond to commands within about 500 milliseconds — in the range of commercial voice assistants like Google Assistant and Alexa — it requires a powerful GPU like Nvidia RTX 4090.
Collabora is working pro bono to adapt the open source speech recognition and text-to-speech models, WhisperLive and WhisperSpeech, for BUD-E.
“Building text-to-speech and speech-to-speech solutions ourselves means we can customize them to a degree not possible with closed models exposed via APIs,” Jakub Piotr Cłapa, AI researcher at Collabora and member of the BUD-E team. he said in an email. “Collabora first started working [open assistants] partly because we struggled to find a good text-to-speech solution for an LLM-based voice agent for one of our clients. We decided to join forces with the wider open source community to make our models more widely accessible and useful.”
In the near future, LAION says it will work to make BUD-E’s hardware requirements less burdensome and reduce the assistant’s latency. A larger venture is creating a dialogue dataset to refine BUD-E — as well as a memory mechanism that allows BUD-E to store information from past conversations and a speech processing pipeline that can track multiple people speaking at once .
I asked the team if accessibility was a priority, given that speech recognition systems have historically not performed well with non-English languages and non-transatlantic accents. A Stanford study found that speech recognition systems from Amazon, IBM, Google, Microsoft and Apple were nearly twice as likely to mishear black speakers as compared to white speakers of the same age and gender.
Brendel said so LAION does not ignore accessibility — but that it’s not “direct focus.” BUD-E.
“The first focus is on really redefining the experience of how we interact with voice assistants before generalizing that experience to more different accents and languages,” Brendel said.
For that reason, LAION has some cool ideas for BUD-E, ranging from an animated avatar to assistant personalization and support for analyzing users’ faces via webcam to take into account their emotional state.
The moral of that last bit – facial analysis – is a bit bleak, needless to say. However, Robert Kaczmarczyk, co-founder of LAION, emphasized that LAION will remain committed to security.
“[We] strictly adhere to the safety and ethics guidelines laid out by the EU AI Law,” he told TechCrunch via email — referring to the legal framework governing the sale and use of AI in the EU. AI allows EU member states to adopt more restrictive rules and safeguards for “high-risk” AI, including sentiment classifiers.
“This commitment to transparency not only facilitates the early detection and correction of potential biases, but also helps the cause of scientific integrity,” Kaczmarczyk added. “By making our datasets accessible, we enable the wider scientific community to participate in research that supports the highest standards of reproducibility.”
LAION’s previous work she was not a virgin in the moral sense, and pursues a somewhat controversial separate project at the moment in emotion detection. But maybe BUD-E will be different. we will have to wait and see.