Data is at the heart of today’s advanced AI systems, but it’s increasingly expensive — putting it out of reach for all but the wealthiest tech companies.
Last year, James Betker, a researcher at OpenAI, wrote one post on his personal blog about the nature of generative AI models and the datasets they are trained on. In it, Betker claimed that the training data—not the design, architecture, or any other feature of a model—was the key to increasingly sophisticated, capable AI systems.
“Trained on the same data set for a long time, almost every model converges to the same point,” Betker wrote.
Is Betker right? Is training data the biggest determinant of what a model can do, whether it’s answering a question, drawing human hands, or creating a realistic cityscape?
It’s certainly plausible.
Statistical machines
AI production systems are basically probabilistic models — a huge pile of statistics. They guess based on huge amounts of examples which data makes the most “sense” to put where (eg the word “go” before “to the market” in the sentence “I go to the market”). It seems intuitive, then, that the more examples a model has to follow, the better the performance of models trained on those examples.
“It seems that the performance gains come from data,” Kyle Lo, senior applied research scientist at the Allen Institute for AI (AI2), an artificial intelligence research nonprofit, told TechCrunch, “at least when you have a solid training organization. .”
Lo gave the example of Meta’s Llama 3, a text generation model released earlier this year that outperforms AI2’s own OLMo model, despite being architecturally very similar. Llama 3 was trained on significantly more data than OLMo, which Lo believes explains its superiority in many popular AI benchmarks.
(I’ll point out here that the benchmarks widely used in the AI industry today aren’t necessarily the best gauge of a model’s performance, but outside of quality tests like ours, it’s one of the few measures it has going on.)
This is not to say that training on exponentially larger data sets is a sure path to exponentially better models. The models operate on a “garbage in, garbage out” paradigm, Lo notes, and so curation and data quality matter a lot, perhaps more than sheer quantity.
“It is possible that a small model with carefully designed data will perform better than a large model,” he added. “For example, Falcon 180B, a large model, is ranked 63rd in the LMSYS benchmark, while Llama 2 13B, a much smaller model, is ranked 56th.”
In an interview with TechCrunch last October, OpenAI researcher Gabriel Goh said that higher-quality annotations contributed significantly to improved image quality in DALL-E 3, OpenAI’s text-to-image model, over its predecessor DALL-E 2. This is the main source of improvements,” he said. “Text annotations are much better than they were [with DALL-E 2] — it’s not even comparable.”
Many artificial intelligence models, including DALL-E 3 and DALL-E 2, are trained by having human annotators label data so that a model can learn to correlate those labels with other, observed features of that data. For example, a model fed many cat images with annotations for each breed will eventually “learn” to associate terms such as short tail and short hair with their special visual characteristics.
Bad behaviour
Experts like Lo worry that the growing emphasis on large, high-quality training data sets will concentrate AI development among the few players with billion-dollar budgets who can afford to acquire those sets. Significant innovation in synthetic data or fundamental architecture could disrupt the status quo, but neither seems to be on the near horizon.
“Overall, entities that govern content that is potentially useful for AI development have incentives to lock down their material,” Lo said. “And as access to data closes, we’re essentially blessing some early movers to get data and move up the ladder so that no one else has access to data to catch up.”
Indeed, where the race to collect more education data hasn’t led to unethical (and perhaps even illegal) behavior such as surreptitiously hoarding copyrighted content, it has rewarded tech giants with deep pockets to spend on licensing data.
Artificial intelligence generation models like OpenAI are primarily trained on images, text, audio, video, and other data — some copyrighted — taken from public web pages (including, problematically, those generated by AI). The OpenAIs of the world claim that fair use protects them from legal retaliation. Many rights holders disagree — but, at least for now, there’s not much they can do to prevent the practice.
There are many, many examples of artificial intelligence builders acquiring massive data sets through questionable means in order to train their models. OpenAI According to reports transcribed more than a million hours of YouTube video without YouTube’s blessing—or the blessing of the creators—to power the flagship GPT-4 model. Google recently expanded its terms of service in part to allow public use of Google Docs, restaurant reviews on Google Maps, and other online material for its AI products. And Meta is said to have considered risking lawsuits trains her models to IP-protected content.
Meanwhile, large and small companies rely workers in third world countries paid only a few dollars an hour to create annotations for training sets. Some of these commenters — employed by mammoth startups like Scale AI — work literally days to complete tasks that expose them to graphic depictions of violence and gore with no benefits or guarantees of future gigs.
Rising costs
In other words, even the above data offerings aren’t exactly conducive to an open and fair AI ecosystem.
OpenAI has spent hundreds of millions of dollars licensing content from news publishers, media libraries, and more to train its AI models — a budget far larger than that of most academic research groups, nonprofits, and startups. Meta went so far as to weigh a takeover of publisher Simon & Schuster for the rights to e-book excerpts (eventually, Simon & Schuster sold to private equity firm KKR for $1.62 billion in 2023).
With the purchase of AI training data to be expected cultivate from about $2.5 billion now to nearly $30 billion within a decade, data brokers and platforms are rushing to charge top dollar — in some cases over the objections of their user bases.
Media library provided by Shutterstock inked deals with AI vendors ranging from $25 million to $50 million, while Reddit claims to have made hundreds of millions from licensing data to organizations like Google and OpenAI. Few platforms with abundant data accumulated organically over the years they do not have He signed deals with prolific AI developers, it seems — from Photobucket to Tumblr to Q&A site Stack Overflow.
It’s the platforms’ data for sale — at least depending on the legal arguments you believe. But in most cases, users don’t see a single penny of the earnings. And it hurts the wider AI research community.
“Smaller players will not be able to afford these data licenses and therefore will not be able to develop or study AI models,” Lo said. “I am concerned that this could lead to a lack of independent scrutiny of AI development practices.”
Independent efforts
If there is a ray of sunshine through the darkness, it is the few independent, non-profit efforts to create massive data sets that anyone can use to train a productive AI model.
EleutherAI, a non-profit grassroots research group that started as a loose Discord collective in 2020, is working with the University of Toronto, AI2, and independent researchers to create The Pile v2, a set of billions of text snippets mostly sourced from the public domain sector .
In April, the startup Hugging Face released FineWeb, a filtered version of Common Crawl—the eponymous dataset maintained by the nonprofit organization Common Crawl, consisting of billions upon billions of web pages—that Hugging Face claims improves the model’s performance on many reference points.
Some efforts to release open training datasets, such as the LAION team’s image sets, have struggled with copyright, data privacy, and more. equally serious ethical and legal challenges. But some of the most dedicated data curators are committed to doing better. Pile v2, for example, removes problematic copyrighted material found in its original dataset, The Pile.
The question is whether any of these open-source efforts can hope to keep pace with Big Tech. Since data collection and curation remains a matter of resources, the answer is probably no — at least not until some research breakthrough levels the playing field.