Close Menu
TechTost
  • AI
  • Apps
  • Crypto
  • Fintech
  • Hardware
  • Media & Entertainment
  • Security
  • Startups
  • Transportation
  • Venture
  • Recommended Essentials
What's Hot

Google Unveils Fitbit Air Without Whoop-like Display

Hackers deface school login pages after claiming another Instructure hack

Voi founders’ new AI startup Pit has become the latest rising star from Stockholm

Facebook X (Twitter) Instagram
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms and Conditions
  • Disclaimer
Facebook X (Twitter) Instagram
TechTost
Subscribe Now
  • AI

    OpenAI is launching new voice intelligence capabilities in its API

    8 May 2026

    Presenting at Disrupt 2026 in front of 10,000 decision makers

    7 May 2026

    Barry Diller trusts Sam Altman. But “trust is irrelevant” as AGI approaches, he says.

    7 May 2026

    Ethos Raises $22.75M From a16z For Its Experience Network With Voice Integration

    6 May 2026

    SAP bets $1.16 billion on 18-month-old German AI lab and says yes to NemoClaw

    6 May 2026
  • Apps

    Perplexity PC is now available to everyone on Mac

    8 May 2026

    Startup Battlefield 200 applications close on May 27

    7 May 2026

    Snap says $400M deal with Perplexity ‘ended amicably’

    7 May 2026

    Threads finally brings messaging to the web

    6 May 2026

    Bumble’s paying users are slipping as it bets on an overhaul later this year

    6 May 2026
  • Crypto

    As crypto cools, a16z crypto raises $2.2 billion in capital

    6 May 2026

    Coinbase to lay off 14% of staff as part of broader restructuring

    5 May 2026

    British cryptographer Adam Back denies NYT report that he is Bitcoin creator Satoshi Nakamoto

    9 April 2026

    Hackers stole over $2.7 billion in crypto in 2025, data shows

    23 December 2025

    New report examines how David Sachs may benefit from Trump administration role

    1 December 2025
  • Fintech

    Robinhood’s venture fund IPO attracted 150,000+ private investors, CEO says

    7 May 2026

    PayPal says it’s “becoming a tech company again” — that’s AI

    6 May 2026

    Stripe introduces Link, a digital wallet that autonomous AI agents can also use

    1 May 2026

    Y Combinator alum Skio sells for $105 million in cash, raised only $8 million, founder says

    1 May 2026

    Amazon, Meta join the fight to end Google Pay and PhonePe’s dominance in India

    30 April 2026
  • Hardware

    Google Unveils Fitbit Air Without Whoop-like Display

    8 May 2026

    Google’s $9.99 per month AI health plan launches on May 19

    8 May 2026

    Apple to pay $250 million to settle lawsuit over Siri’s lagging AI features

    7 May 2026

    reMarkable’s new Paper Pure tablet goes back to basics with a monochrome display

    6 May 2026

    Altara secures $7 million to bridge the data gap slowing the natural sciences

    6 May 2026
  • Media & Entertainment

    Netflix delays Greta Gerwig’s ‘Narnia’ for big theatrical push to 2027

    2 May 2026

    Roku’s $3 streaming service Howdy hits 1 million subscribers, per recent report

    29 April 2026

    Australia forces Big Tech companies to pay for news or face 2.25% tax.

    28 April 2026

    India’s app market is booming — but global platforms are raking in most of the profits

    23 April 2026

    YouTube extends its AI similarity detection technology to celebrities

    21 April 2026
  • Security

    Hackers deface school login pages after claiming another Instructure hack

    8 May 2026

    Hackers hack victims who have been hacked by other hackers

    7 May 2026

    AI assessment startup Braintrust confirms breach, tells each client to rotate sensitive keys

    7 May 2026

    DOJ says ransomware gang exploited Russian government databases

    6 May 2026

    Hackers steal student data during breach at education tech giant Instructure

    6 May 2026
  • Startups

    Voi founders’ new AI startup Pit has become the latest rising star from Stockholm

    8 May 2026

    India’s first tech unicorn emerges as Skyroot prepares for orbital launch

    7 May 2026

    A 20-minute pitch wins Lachy Groom-backed Indian startup Pronto

    7 May 2026

    3 days left to lock in 50% off a second ticket to Disrupt 2026

    6 May 2026

    India’s first GenAI unicorn shifts to cloud services as AI model ambitions face reality

    5 May 2026
  • Transportation

    Kodiak AI raises $100M in deep discount, sending stock down 37%

    8 May 2026

    Volkswagen becomes Rivian’s top shareholder, displacing Amazon

    7 May 2026

    Lucid Motors doesn’t know how many EVs it will build this year

    7 May 2026

    Aurora lands deal with McLane to run driverless truck routes in Texas

    6 May 2026

    Nuro gets driverless test license ahead of Uber’s robotaxi service launch

    6 May 2026
  • Venture

    2 days left: Get 50% off a second Disrupt 2026 pass

    7 May 2026

    All your M&A questions will be answered at Disrupt 2026

    6 May 2026

    ElevenLabs lists BlackRock, Jamie Foxx and Eva Longoria as new investors

    6 May 2026

    Get 50% off a second Disrupt 2026 pass to bid more, faster

    5 May 2026

    Nicolas Sauvage bets on the boring parts of AI

    4 May 2026
  • Recommended Essentials
TechTost
You are at:Home»AI»AI training data comes at a price only Big Tech can afford
AI

AI training data comes at a price only Big Tech can afford

techtost.comBy techtost.com1 June 202408 Mins Read
Share Facebook Twitter Pinterest LinkedIn Tumblr Email
Ai Training Data Comes At A Price Only Big Tech
Share
Facebook Twitter LinkedIn Pinterest Email

Data is at the heart of today’s advanced AI systems, but it’s increasingly expensive — putting it out of reach for all but the wealthiest tech companies.

Last year, James Betker, a researcher at OpenAI, wrote one post on his personal blog about the nature of generative AI models and the datasets they are trained on. In it, Betker claimed that the training data—not the design, architecture, or any other feature of a model—was the key to increasingly sophisticated, capable AI systems.

“Trained on the same data set for a long time, almost every model converges to the same point,” Betker wrote.

Is Betker right? Is training data the biggest determinant of what a model can do, whether it’s answering a question, drawing human hands, or creating a realistic cityscape?

It’s certainly plausible.

Statistical machines

AI production systems are basically probabilistic models — a huge pile of statistics. They guess based on huge amounts of examples which data makes the most “sense” to put where (eg the word “go” before “to the market” in the sentence “I go to the market”). It seems intuitive, then, that the more examples a model has to follow, the better the performance of models trained on those examples.

“It seems that the performance gains come from data,” Kyle Lo, senior applied research scientist at the Allen Institute for AI (AI2), an artificial intelligence research nonprofit, told TechCrunch, “at least when you have a solid training organization. .”

Lo gave the example of Meta’s Llama 3, a text generation model released earlier this year that outperforms AI2’s own OLMo model, despite being architecturally very similar. Llama 3 was trained on significantly more data than OLMo, which Lo believes explains its superiority in many popular AI benchmarks.

(I’ll point out here that the benchmarks widely used in the AI ​​industry today aren’t necessarily the best gauge of a model’s performance, but outside of quality tests like ours, it’s one of the few measures it has going on.)

This is not to say that training on exponentially larger data sets is a sure path to exponentially better models. The models operate on a “garbage in, garbage out” paradigm, Lo notes, and so curation and data quality matter a lot, perhaps more than sheer quantity.

“It is possible that a small model with carefully designed data will perform better than a large model,” he added. “For example, Falcon 180B, a large model, is ranked 63rd in the LMSYS benchmark, while Llama 2 13B, a much smaller model, is ranked 56th.”

In an interview with TechCrunch last October, OpenAI researcher Gabriel Goh said that higher-quality annotations contributed significantly to improved image quality in DALL-E 3, OpenAI’s text-to-image model, over its predecessor DALL-E 2. This is the main source of improvements,” he said. “Text annotations are much better than they were [with DALL-E 2] — it’s not even comparable.”

Many artificial intelligence models, including DALL-E 3 and DALL-E 2, are trained by having human annotators label data so that a model can learn to correlate those labels with other, observed features of that data. For example, a model fed many cat images with annotations for each breed will eventually “learn” to associate terms such as short tail and short hair with their special visual characteristics.

Bad behaviour

Experts like Lo worry that the growing emphasis on large, high-quality training data sets will concentrate AI development among the few players with billion-dollar budgets who can afford to acquire those sets. Significant innovation in synthetic data or fundamental architecture could disrupt the status quo, but neither seems to be on the near horizon.

“Overall, entities that govern content that is potentially useful for AI development have incentives to lock down their material,” Lo said. “And as access to data closes, we’re essentially blessing some early movers to get data and move up the ladder so that no one else has access to data to catch up.”

Indeed, where the race to collect more education data hasn’t led to unethical (and perhaps even illegal) behavior such as surreptitiously hoarding copyrighted content, it has rewarded tech giants with deep pockets to spend on licensing data.

Artificial intelligence generation models like OpenAI are primarily trained on images, text, audio, video, and other data — some copyrighted — taken from public web pages (including, problematically, those generated by AI). The OpenAIs of the world claim that fair use protects them from legal retaliation. Many rights holders disagree — but, at least for now, there’s not much they can do to prevent the practice.

There are many, many examples of artificial intelligence builders acquiring massive data sets through questionable means in order to train their models. OpenAI According to reports transcribed more than a million hours of YouTube video without YouTube’s blessing—or the blessing of the creators—to power the flagship GPT-4 model. Google recently expanded its terms of service in part to allow public use of Google Docs, restaurant reviews on Google Maps, and other online material for its AI products. And Meta is said to have considered risking lawsuits trains her models to IP-protected content.

Meanwhile, large and small companies rely workers in third world countries paid only a few dollars an hour to create annotations for training sets. Some of these commenters — employed by mammoth startups like Scale AI — work literally days to complete tasks that expose them to graphic depictions of violence and gore with no benefits or guarantees of future gigs.

Rising costs

In other words, even the above data offerings aren’t exactly conducive to an open and fair AI ecosystem.

OpenAI has spent hundreds of millions of dollars licensing content from news publishers, media libraries, and more to train its AI models — a budget far larger than that of most academic research groups, nonprofits, and startups. Meta went so far as to weigh a takeover of publisher Simon & Schuster for the rights to e-book excerpts (eventually, Simon & Schuster sold to private equity firm KKR for $1.62 billion in 2023).

With the purchase of AI training data to be expected cultivate from about $2.5 billion now to nearly $30 billion within a decade, data brokers and platforms are rushing to charge top dollar — in some cases over the objections of their user bases.

Media library provided by Shutterstock inked deals with AI vendors ranging from $25 million to $50 million, while Reddit claims to have made hundreds of millions from licensing data to organizations like Google and OpenAI. Few platforms with abundant data accumulated organically over the years they do not have He signed deals with prolific AI developers, it seems — from Photobucket to Tumblr to Q&A site Stack Overflow.

It’s the platforms’ data for sale — at least depending on the legal arguments you believe. But in most cases, users don’t see a single penny of the earnings. And it hurts the wider AI research community.

“Smaller players will not be able to afford these data licenses and therefore will not be able to develop or study AI models,” Lo said. “I am concerned that this could lead to a lack of independent scrutiny of AI development practices.”

Independent efforts

If there is a ray of sunshine through the darkness, it is the few independent, non-profit efforts to create massive data sets that anyone can use to train a productive AI model.

EleutherAI, a non-profit grassroots research group that started as a loose Discord collective in 2020, is working with the University of Toronto, AI2, and independent researchers to create The Pile v2, a set of billions of text snippets mostly sourced from the public domain sector .

In April, the startup Hugging Face released FineWeb, a filtered version of Common Crawl—the eponymous dataset maintained by the nonprofit organization Common Crawl, consisting of billions upon billions of web pages—that Hugging Face claims improves the model’s performance on many reference points.

Some efforts to release open training datasets, such as the LAION team’s image sets, have struggled with copyright, data privacy, and more. equally serious ethical and legal challenges. But some of the most dedicated data curators are committed to doing better. Pile v2, for example, removes problematic copyrighted material found in its original dataset, The Pile.

The question is whether any of these open-source efforts can hope to keep pace with Big Tech. Since data collection and curation remains a matter of resources, the answer is probably no — at least not until some research breakthrough levels the playing field.

afford All included big data data sets Education Generative AI price tech training
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
Previous ArticleInstagram is testing “test wheels” that aren’t shown to a creator’s followers
Next Article How (Re)vive grew 10x last year helping retailers recycle and sell returned items
bhanuprakash.cg
techtost.com
  • Website

Related Posts

Hackers deface school login pages after claiming another Instructure hack

8 May 2026

OpenAI is launching new voice intelligence capabilities in its API

8 May 2026

India’s first tech unicorn emerges as Skyroot prepares for orbital launch

7 May 2026
Add A Comment

Leave A Reply Cancel Reply

Don't Miss

Google Unveils Fitbit Air Without Whoop-like Display

8 May 2026

Hackers deface school login pages after claiming another Instructure hack

8 May 2026

Voi founders’ new AI startup Pit has become the latest rising star from Stockholm

8 May 2026
Stay In Touch
  • Facebook
  • YouTube
  • TikTok
  • WhatsApp
  • Twitter
  • Instagram
Fintech

Robinhood’s venture fund IPO attracted 150,000+ private investors, CEO says

7 May 2026

PayPal says it’s “becoming a tech company again” — that’s AI

6 May 2026

Stripe introduces Link, a digital wallet that autonomous AI agents can also use

1 May 2026
Startups

Voi founders’ new AI startup Pit has become the latest rising star from Stockholm

India’s first tech unicorn emerges as Skyroot prepares for orbital launch

A 20-minute pitch wins Lachy Groom-backed Indian startup Pronto

© 2026 TechTost. All Rights Reserved
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms and Conditions
  • Disclaimer

Type above and press Enter to search. Press Esc to cancel.