OpenAI’s legal battle with the New York Times over data to train its AI models may still be ongoing. However, OpenAI is moving forward with deals with other publishers, including some of the largest news publishers in France and Spain.
OpenAI on Wednesday was announced that it signed contracts with Le Monde and Prisa Media to bring French and Spanish news content to OpenAI’s ChatGPT chatbot. In a blog post, OpenAI said the partnership will put coverage of the organizations’ current events — from brands such as El País, Cinco Días, As and El Huffpost — in front of ChatGPT users where it makes sense, as well as contribute to OpenAI never -expand the volume of training data.
OpenAI writes:
In the coming months, ChatGPT users will be able to interact with relevant news content from these publishers through featured, rendered summaries and enhanced links to the original articles, giving users the ability to access additional information or related articles from their news sites… We’re constantly making improvements to ChatGPT and supporting the news industry’s essential role in providing real-time, authoritative information to users.
So OpenAI’s exposed licensing deals with a handful of content providers at this point. I now felt like a good opportunity to take stock:
- Shutterstock Stock Media Library (for images, videos and music training data)
- The Associated Press
- Axel Springer (owner of Politico and Business Insider, among others)
- Le Monde
- Prisa Media
How much does OpenAI pay each? Well, he’s not saying – at least not publicly. But we can estimate.
The information mentionted In January, OpenAI offered publishers between $1 million and $5 million annually for access to files for training GenAI models. That doesn’t tell us much about Shutterstock’s partnership. But on the article licensing front—assuming The Information’s report is accurate and those figures haven’t changed since then—OpenAI spends between $4 million and $20 million a year on news.
That might be pennies for OpenAI, whose war chest is over $11 billion and whose annual revenue recently topped $2 billion (per Financial Times). But as Hunter Walk, a Homebrew partner and co-founder of Screendoor, recently opined, it’s important enough to possibly trump AI rivals who are also seeking licensing deals.
Take a walk writes on his blog:
[I]If experimentation is restricted by nine-figure licensing deals, we’re doing innovation a disservice… Controls cut back on the “owners” of training data create a huge barrier to entry for challengers. If Google, OpenAI and other big tech companies can create a high enough cost, they are implicitly preventing future competition.
Now, whether there is a barrier to entry today is debatable. Many — if not most — AI vendors have chosen to risk the wrath of IP owners by choosing not to license the data they train AI models on. There are indications that the Midjourney art creation platform, for example, is education in Disney movie clips — and Midjourney has nothing to do with Disney.
The more difficult question to wrestle with is: Should licensing simply be the cost of doing business and experimenting in the AI space?
Walk wouldn’t disagree. It supports a regulatory-enforced “safe harbor” that would protect any AI vendor — as well as small startups and researchers — from legal liability as long as they adhere to certain standards of transparency and ethics.
Interestingly, the United Kingdom recently tested to codify something along those lines, exempting the use of text and data mining for AI training from copyright issues as long as it is for research purposes. But these efforts ended up failing.
Me, I’m not sure I’d go as far as Walk in the “safe harbor” of his proposal considering the impact AI threatens to have on an already destabilized news industry. A recent model from The Atlantic were found that if a search engine like Google incorporated artificial intelligence into search, it would answer a user’s query 75% of the time without requiring a click to their website.
But maybe there is space for carvings.
Publishers need to be paid — and paid fairly. There is no result, however, in which AI incumbent challengers — as well as academics — are paid and have access to the same data as those incumbents? I should believe it. Grants are one-way. Bigger VC checks are another.
I can’t say I have the solution, especially since the courts have yet to decide whether — and to what extent — fair use protects AI vendors from copyright claims. But it is vital that we tease these things out. Otherwise, the industry could well end up in a situation where the academic “brain drain” continues unabated and only a few powerful companies have access to vast pools of valuable training sets.