One of the more unexpected products coming out of the Microsoft Ignite 2023 event is a tool that can create a photorealistic avatar of a person and bring that avatar to life by saying things the person didn’t necessarily say.
Called Azure AI Speech text-to-speech avatars, the new feature, available in public preview starting today, lets users create videos of a speaking avatar by uploading images of a person they want the avatar to look like and writing a script. Microsoft’s tool trains a model to drive the animation, while a separate text-to-speech model — either pre-built or trained on the person’s voice — “reads” the script aloud.
“With the text-to-speech avatar, users can create more effective videos… to create video tutorials, product introductions, customer testimonials [and so on] simply by entering text”, Microsoft writes in a blog post. “You can use your avatar to create chatbots, virtual assistants, chatbots, and more.”
Avatars can speak many languages. And, for chatbot scenarios, they can tap into AI models like OpenAI’s GPT-3.5 to answer customer questions off-script.
Now, there are countless ways such a tool could be abused — something Microsoft, to its credit, realizes. (Similar avatar creation technology from AI startup Synthesia was bad use to produce propaganda in Venezuela and false news reports promoted by pro-China social media accounts.) Most Azure subscribers will only have access to pre-built—not custom—avatars at launch. Custom avatars are currently a “restricted access” feature available only with registration and “only for certain use cases,” Microsoft says.
But the feature raises a number of uncomfortable ethical questions.
One of the major sticking points in the recent SAG-AFTRA strike was the use of artificial intelligence to create digital likenesses. The studios eventually agreed to pay the actors for the AI-generated likenesses. But what about Microsoft and its customers?
I emailed Microsoft about their position on companies using actors’ likenesses without, in the actors’ view, proper compensation or even notice. The company did not respond by the time of publication — nor did it say whether it would require companies to label avatars as AI-generated, as YouTube and one increasing number other platforms.
In a follow-up email, a spokesperson clarified that Microsoft requires custom avatar customers to obtain “express written permission” and consent statements from the avatar talents and “ensure that the customer’s agreement with each individual takes into account the duration, use and any content restrictions’. The company also requires customers to add disclaimers stating that the avatars are AI-generated and created by AI.
Personal voice
Microsoft appears to have more guardrails around a related AI creation tool, personal voice, which is also being released at Ignite.
Personal Voice, a new feature in Microsoft’s custom neural voice service, can reproduce a user’s voice in seconds, providing a one-minute speech sample as an audio prompt. Microsoft is pitching it as a way to create personalized voice assistants, dub content into different languages, and create personalized narration for stories, audiobooks, and podcasts.
To avoid potential legal headaches, Microsoft has banned the use of pre-recorded speech, requiring users to give “express consent” in the form of a recorded statement and verifying that that statement matches other disposable educational data before the customer can use personal voice to compose new speech. Access to the feature is currently restricted behind a registration form, and customers must agree to use personal voice only in apps “where the voice does not read user-generated or open-source content.”
“Usage of the voice model must remain within an application, and the output must not be publishable or shared by the application,” Microsoft writes in a blog post. “[C]customers who meet limited access eligibility criteria retain sole control over the creation, access and use of their voice models and output [where it concerns] dubbing for film, television, video and audio for entertainment purposes only.”
Microsoft did not initially respond to TechCrunch’s questions about how actors might be compensated for their voice contributions — or whether it plans to implement any kind of watermarking technology so that AI-generated voices can be more easily recognized.
Later in the day, a spokesperson said via email that watermarks will be automatically added to personal voices, making it easier to determine whether speech is composed — and by which voice. But there is a catch. Building watermark detection into an app or platform requires getting Microsoft approval to use the watermark detection service — which is obviously not ideal.
For more Microsoft Ignite 2023 coverage:
This story was originally published at 8 a.m. PT on Nov. 15 and updated at 3:30 p.m. PT.