Reddit’s prospects as it risks going public have a lot more to do with relationships with AI vendors like OpenAI than you might expect.
In its IPO prospectus filed today with the U.S. Securities and Exchange Commission, Reddit repeatedly emphasized how much it believes it stands to gain — and has earned — from data licensing deals with companies that train artificial intelligence models on its more than 1 billion posts and more than 16 billion comments.
“In January 2024, we entered into certain data licensing arrangements with an aggregate contract value of $203.0 million and terms ranging from two to three years,” the prospectus said. “We expect at least $66.4 million of revenue to be recognized during the year ending December 31, 2024 and the balance thereafter.”
Now, it’s a mystery as to which AI vendors are licensing data from Reddit so far. Earlier this week, Bloomberg and Reuters mentionted that a “large anonymous artificial intelligence company” — probably Google — had a licensing deal worth about $60 million annually. But OpenAI wouldn’t be a surprising customer either, especially when you consider that OpenAI CEO Sam Altman has 8.7% bet on Reddit (making him the third largest shareholder) and once served on the company’s board of directors.
Why is Reddit data valuable? As Reddit explains, AI models “learn” from examples to build essays, code, emails, articles, and more, and vendors like OpenAI scour the web for millions to billions of these examples to add to their training sets . Some examples are public. Others are not, or — in the case of Reddit content — are subject to restrictive licenses that require attribution or specific forms of compensation.
Reddit previously did not provide access to its data for AI training purposes. But it reversed course last year, arguing that its data should not be — in the words of CEO Steve Huffman — “[given] to some of the biggest companies in the world for free.”
“[Our] Data APIs are able to provide real-time access to evolving and dynamic topics such as sports, movies, news, fashion and the latest trends,” the newsletter continues. “We believe that Reddit’s vast dataset and conversational knowledge will continue to play a role in training and improving large language models. As our content refreshes and grows daily, we expect that models will want to reflect these new insights and update their training using Reddit data.”
Content producers, from media libraries to news publishers, are increasingly turning to data licensing deals with AI vendors as chatbots like OpenAI’s ChatGPT and Google’s Gemini threaten to cut traffic. A recent model from The Atlantic were found that if a search engine like Google incorporated artificial intelligence into search, it would answer a user’s query 75% of the time without requiring a click to its website.
The vendors, in turn, have been motivated to pursue licensing deals as they face a deluge of lawsuits claiming they lack legal justification for training their models on data without permission or payment. Recently, the New York Times accused OpenAI of effectively building news publisher competitors using its projects, hurting its business.
OpenAI, for one, has deals with image gallery Shutterstock as well as publishers like Axel Springer, the owner of Politico and Business Insider. The licenses are mentionted to be quite small, however — exceeding $5 million annually.