In the generator AI boom, data is the new oil. So why can’t you sell yours?
From big tech companies to startups, AI makers are licensing e-books, images, video, audio and more from data brokers, all in an effort to train more capable (and more legally defensible) AI products. Shutterstock has deals with Meta, Google, Amazon and Apple to provide millions of images for model training, while OpenAI has signed deals with various news organizations to train its models on news archives.
In many cases, the individual creators and owners of this data have not seen a penny of the cash change hands. Call a startup Valve wants to change that.
Anna Kazlauskas and Art Abal, who met in a class at the MIT Media Lab focused on building technology for emerging markets, co-founded Vana in 2021. Before Vana, Kazlauskas studied computer science and economics at MIT and eventually dropped out of a fintech automation startup, Iambiq, out of Y Combinator. Abal, a corporate attorney by training and education, was a partner at The Cadmus Group, a Boston-based consulting firm, before taking on impact discovery at data annotation company Appen.
With Vana, Kazlauskas and Abal set out to build a platform that allows users to “aggregate” their data—including conversations, speech recordings, and photos—into datasets that can then be used to train artificial intelligence models. They also want to create more personalized experiences—say, a daily voicemail with motivation based on your wellness goals, or an art-making app that understands your style preferences—by adapting public models to that data.
“Vana’s infrastructure essentially creates a data vault owned by the users,” Kazlauskas told TechCrunch. “It does this by allowing users to aggregate their personal data in a non-tribal way…Vana allows users to own AI models and use their data in AI applications.”
Here’s how Vana introduces its platform and API to developers:
The Vana API connects a user’s personal data across multiple platforms … to allow you to personalize your app. Your app gets direct access to a user’s personalized AI model or underlying data, simplifying integration and eliminating concerns about computational cost… We believe users should be able to move their personal data out of walled gardens such as Instagram, Facebook and Google, into your app, so you can create an amazing personalized experience from the very first time a user interacts with your consumer AI app.
Creating an account with Vana is quite simple. After confirming your email, you can attach data to a digital avatar (such as selfies, a description of yourself, and voice recordings) and explore apps built using Vana’s platform and datasets. The app selection ranges from ChatGPT-style chatbots and interactive storybooks to a Hinge profile generator.
Now why, you might ask – in this age of heightened data privacy awareness and ransomware attacks – would anyone ever volunteer their personal information to an anonymous startup, much less a venture-backed one? (Vana has raised $20 million to date from Paradigm, Polychain Capital and other backers.) Can any for-profit company really be trusted not to misuse or mismanage any monetizable data in its hands?
In response to this question, Kazlauskas emphasized that the whole point of Vana is for users to “take back control of their data,” noting that Vana users have the option to self-host their data instead of storing it on servers of Vana and control how data is shared with apps and developers. He also argued that because Vana makes money by charging users a monthly subscription (starting at $3.99) and charging a “data transaction” fee to developers (e.g., to transfer datasets to train AI models), the company has no incentive to exploit users and the troves of personal data they bring with them.
“We want to create models that are owned and managed by users who all contribute their data,” Kazlauskas said, “and allow users to bring their data and models with them into any application.”
Now, while Valve isn’t selling user data to companies to train AI models (or so it claims), it wants to let users do it themselves if they choose — starting with their Reddit posts.
This month, Vana released what it calls Reddit Data DAO (Digital Autonomous Organization), a program that aggregates multiple users’ Reddit data (including their karma and post history) and lets them decide together how that combined data is used. After signing up with a Reddit account, submit one Application to Reddit for their data, and by uploading that data to the DAO, users gain the right to vote with other DAO members on decisions such as licensing the combined data to AI companies for shared profit.
It’s a response to Reddit’s recent moves to commercialize data on its platform.
Reddit previously did not provide a portal to access posts and communities for AI training purposes. But it changed course late last year ahead of its IPO. Since the policy change, Reddit has collected more than $203 million in licensing fees from companies including Google.
“The broad idea [with the DAO is] to liberate user data from the big platforms that seek to hoard and monetize,” Kazlauskas said. “This is a first and part of our push to help people aggregate their data into user-owned datasets to train AI models.”
Unsurprisingly, Reddit — which doesn’t work with Vana in any official capacity — isn’t happy with The DAO.
Reddit banned Vana’s subreddit dedicated to the DAO discussion. And a Reddit spokesperson accused Vana of “exploiting” its data extraction system, which is designed to comply with data privacy regulations such as GDPR and the California Consumer Privacy Act.
“Data settings allow us to put guardrails on such entities, even on public information,” the spokesperson told TechCrunch. “Reddit does not share non-public personal data with commercial businesses, and when Redditors request an extraction of their data from us, they receive non-public personal data from us in accordance with applicable law. Direct partnerships between Reddit and vetted organizations, with clear terms and accountability, topics, and these partnerships and agreements prevent the misuse and abuse of people’s data.”
But does Reddit have any real reason to worry?
Kazlauskas envisions The DAO growing to the point where it affects how much Reddit can charge customers for its data. That’s a long way off, assuming it ever happens. the DAO has just over 141,000 members, a tiny fraction of Reddit’s user base of 73 million. And some of these members may be bots or duplicate accounts.
Then there’s the issue of how to fairly split the payments the DAO might receive from data buyers.
The DAO currently awards “tokens” – cryptocurrencies – to users who match their Reddit karma. But karma may not be the best way to measure quality contribution to the dataset — particularly in smaller Reddit communities with fewer opportunities to earn it.
Kazlauskas supports the idea that DAO members could choose to share cross-platform data and their demographics, potentially making the DAO more valuable and incentivizing sign-ups. But that would also require users to trust Vana even more to handle their sensitive data responsibly.
Personally, I don’t see Vana’s DAO reaching critical mass. The roadblocks that stand in the way are too many. I think, however, that it won’t be the grassroots’ last attempt to claim control over the data increasingly used to train artificial intelligence models.
Startups like Spawning are working on ways to let creators enforce rules that guide how their data is used for education, while vendors like Getty Images, Shutterstock and Adobe continue to experiment with compensation systems. But no one has cracked the code yet. It might even is Cracked; Given it against nature of the productive AI industry, is certainly a tall order. But maybe someone will find a way — or policymakers will force it.