Close Menu
TechTost
  • AI
  • Apps
  • Crypto
  • Fintech
  • Hardware
  • Media & Entertainment
  • Security
  • Startups
  • Transportation
  • Venture
  • Recommended Essentials
What's Hot

H1 secures $40M from CVS, proving SaaS startups can still attract investment

Waymo’s newest robotaxi is Chinese-made, built to make money, and is now accepting riders

Corgi Announces $106M Raise at $2.6B Valuation — Double What It Was Worth 3 Weeks Ago

Facebook X (Twitter) Instagram
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms and Conditions
  • Disclaimer
Facebook X (Twitter) Instagram
TechTost
Subscribe Now
  • AI

    Coders refuse to work without artificial intelligence – and it could bite them

    30 May 2026

    This chip startup just raised $135 million on a bet that AI’s biggest bottleneck isn’t computation — it’s memory

    29 May 2026

    Glean’s top line tops $300M as AI budget cut becomes its main selling point

    29 May 2026

    How long is Anthropic’s lease with SpaceX? Opinions vary.

    28 May 2026

    Why Google’s AI Can’t Type Google (or Anything)

    28 May 2026
  • Apps

    YouTube adds new podcast features, including an AI recommendation tool and ‘Auto Speed’

    30 May 2026

    A sneak peek at the new Siri app reveals Apple’s plans to tackle ChatGPT and more

    29 May 2026

    Bluesky embraces long-form content to tackle X articles

    29 May 2026

    Sesame, the AI ​​chat startup from the founders of Oculus, is launching its iOS app

    28 May 2026

    Airbnb-backed WeRoad raises $58 million to bring its group travel platform to the US

    28 May 2026
  • Crypto

    Startup Battlefield 200 applications close today

    27 May 2026

    5 days left: Save up to $410 on Disrupt 2026 passes

    25 May 2026

    As crypto cools, a16z crypto raises $2.2 billion in capital

    6 May 2026

    Coinbase to lay off 14% of staff as part of broader restructuring

    5 May 2026

    British cryptographer Adam Back denies NYT report that he is Bitcoin creator Satoshi Nakamoto

    9 April 2026
  • Fintech

    Last 24 hours to save up to $410 on your Disrupt 2026 ticket

    29 May 2026

    2 days left: Lock in up to $410 in ticket savings for Disrupt 2026

    28 May 2026

    Robinhood now allows your AI agents to trade stocks

    28 May 2026

    Disrupt 2026 Early Bird ticket savings expire in 3 days

    27 May 2026

    Disrupt 2026 Early Bird ticket prices end May 29

    26 May 2026
  • Hardware

    Kiwibit’s artificial intelligence bird feeder is my new backyard friend

    29 May 2026

    Vertu wants CEOs to run companies from a foldable AI starting at $6,880

    29 May 2026

    Oura unveils its Ring 5 with a thinner, lighter design starting at $399

    28 May 2026

    The Dreamie alarm clock made me stop using my phone in bed

    26 May 2026

    6 kitchen gadgets that make adult life easier

    25 May 2026
  • Media & Entertainment

    YouTube will automatically flag videos with artificial intelligence

    28 May 2026

    Meta launches Instagram, Facebook and WhatsApp subscriptions, with more to follow, including AI plans

    27 May 2026

    Spotify now lets you view narrated magazine articles as well

    26 May 2026

    Spotify launches an audiobook creation tool powered by ElevenLabs

    22 May 2026

    New York City Mayor Zohran Mamdani Takes To Twitch To Chat With New Yorkers

    21 May 2026
  • Security

    Microsoft is under fire for threatening a security researcher with a criminal investigation

    29 May 2026

    A security flaw in prison payphone service Pay Tel exposed publicly the driver’s licenses of more than 300,000 callers

    29 May 2026

    Hackers are trying to steal Signal users’ backups in new wave of phishing attacks

    28 May 2026

    CrowdStrike and Google take down botnet used by hackers to target open source software developers

    28 May 2026

    UK Visa Portal Revealed Thousands of Applicants’ Passports and Selfies — Then Invited Lawyers to Ask Us

    27 May 2026
  • Startups

    H1 secures $40M from CVS, proving SaaS startups can still attract investment

    30 May 2026

    Cognition’s Scott Wu says AI coding agents shouldn’t replace humans

    29 May 2026

    How to apply to Startup Battlefield 2026, what you need before the June 8 deadline

    29 May 2026

    At Disrupt 2026: Databricks co-founder on what’s killing AI business deals

    28 May 2026

    Tech CEOs apparently suffer from AI psychosis

    28 May 2026
  • Transportation

    Waymo’s newest robotaxi is Chinese-made, built to make money, and is now accepting riders

    30 May 2026

    Slate Auto will announce pricing and take pre-orders for its EV on June 24

    29 May 2026

    Waymo dominates autonomous vehicle registrations as Tesla follows

    29 May 2026

    Slate Auto will begin taking orders for its affordable EV on June 24

    28 May 2026

    FAA orders SpaceX to investigate Starship V3 booster failure

    27 May 2026
  • Venture

    Corgi Announces $106M Raise at $2.6B Valuation — Double What It Was Worth 3 Weeks Ago

    30 May 2026

    In just 3 weeks, StrictlyVC is coming to Los Angeles

    29 May 2026

    Why Paris might be the most important AI city outside of Silicon Valley

    29 May 2026

    ClickHouse triples annual revenue to $250 million, charting a path to an IPO

    28 May 2026

    Triomics raises $22 million to bring oncology AI to cancer centers

    28 May 2026
  • Recommended Essentials
TechTost
You are at:Home»AI»AI training data comes at a price only Big Tech can afford
AI

AI training data comes at a price only Big Tech can afford

techtost.comBy techtost.com1 June 202408 Mins Read
Share Facebook Twitter Pinterest LinkedIn Tumblr Email
Ai Training Data Comes At A Price Only Big Tech
Share
Facebook Twitter LinkedIn Pinterest Email

Data is at the heart of today’s advanced AI systems, but it’s increasingly expensive — putting it out of reach for all but the wealthiest tech companies.

Last year, James Betker, a researcher at OpenAI, wrote one post on his personal blog about the nature of generative AI models and the datasets they are trained on. In it, Betker claimed that the training data—not the design, architecture, or any other feature of a model—was the key to increasingly sophisticated, capable AI systems.

“Trained on the same data set for a long time, almost every model converges to the same point,” Betker wrote.

Is Betker right? Is training data the biggest determinant of what a model can do, whether it’s answering a question, drawing human hands, or creating a realistic cityscape?

It’s certainly plausible.

Statistical machines

AI production systems are basically probabilistic models — a huge pile of statistics. They guess based on huge amounts of examples which data makes the most “sense” to put where (eg the word “go” before “to the market” in the sentence “I go to the market”). It seems intuitive, then, that the more examples a model has to follow, the better the performance of models trained on those examples.

“It seems that the performance gains come from data,” Kyle Lo, senior applied research scientist at the Allen Institute for AI (AI2), an artificial intelligence research nonprofit, told TechCrunch, “at least when you have a solid training organization. .”

Lo gave the example of Meta’s Llama 3, a text generation model released earlier this year that outperforms AI2’s own OLMo model, despite being architecturally very similar. Llama 3 was trained on significantly more data than OLMo, which Lo believes explains its superiority in many popular AI benchmarks.

(I’ll point out here that the benchmarks widely used in the AI ​​industry today aren’t necessarily the best gauge of a model’s performance, but outside of quality tests like ours, it’s one of the few measures it has going on.)

This is not to say that training on exponentially larger data sets is a sure path to exponentially better models. The models operate on a “garbage in, garbage out” paradigm, Lo notes, and so curation and data quality matter a lot, perhaps more than sheer quantity.

“It is possible that a small model with carefully designed data will perform better than a large model,” he added. “For example, Falcon 180B, a large model, is ranked 63rd in the LMSYS benchmark, while Llama 2 13B, a much smaller model, is ranked 56th.”

In an interview with TechCrunch last October, OpenAI researcher Gabriel Goh said that higher-quality annotations contributed significantly to improved image quality in DALL-E 3, OpenAI’s text-to-image model, over its predecessor DALL-E 2. This is the main source of improvements,” he said. “Text annotations are much better than they were [with DALL-E 2] — it’s not even comparable.”

Many artificial intelligence models, including DALL-E 3 and DALL-E 2, are trained by having human annotators label data so that a model can learn to correlate those labels with other, observed features of that data. For example, a model fed many cat images with annotations for each breed will eventually “learn” to associate terms such as short tail and short hair with their special visual characteristics.

Bad behaviour

Experts like Lo worry that the growing emphasis on large, high-quality training data sets will concentrate AI development among the few players with billion-dollar budgets who can afford to acquire those sets. Significant innovation in synthetic data or fundamental architecture could disrupt the status quo, but neither seems to be on the near horizon.

“Overall, entities that govern content that is potentially useful for AI development have incentives to lock down their material,” Lo said. “And as access to data closes, we’re essentially blessing some early movers to get data and move up the ladder so that no one else has access to data to catch up.”

Indeed, where the race to collect more education data hasn’t led to unethical (and perhaps even illegal) behavior such as surreptitiously hoarding copyrighted content, it has rewarded tech giants with deep pockets to spend on licensing data.

Artificial intelligence generation models like OpenAI are primarily trained on images, text, audio, video, and other data — some copyrighted — taken from public web pages (including, problematically, those generated by AI). The OpenAIs of the world claim that fair use protects them from legal retaliation. Many rights holders disagree — but, at least for now, there’s not much they can do to prevent the practice.

There are many, many examples of artificial intelligence builders acquiring massive data sets through questionable means in order to train their models. OpenAI According to reports transcribed more than a million hours of YouTube video without YouTube’s blessing—or the blessing of the creators—to power the flagship GPT-4 model. Google recently expanded its terms of service in part to allow public use of Google Docs, restaurant reviews on Google Maps, and other online material for its AI products. And Meta is said to have considered risking lawsuits trains her models to IP-protected content.

Meanwhile, large and small companies rely workers in third world countries paid only a few dollars an hour to create annotations for training sets. Some of these commenters — employed by mammoth startups like Scale AI — work literally days to complete tasks that expose them to graphic depictions of violence and gore with no benefits or guarantees of future gigs.

Rising costs

In other words, even the above data offerings aren’t exactly conducive to an open and fair AI ecosystem.

OpenAI has spent hundreds of millions of dollars licensing content from news publishers, media libraries, and more to train its AI models — a budget far larger than that of most academic research groups, nonprofits, and startups. Meta went so far as to weigh a takeover of publisher Simon & Schuster for the rights to e-book excerpts (eventually, Simon & Schuster sold to private equity firm KKR for $1.62 billion in 2023).

With the purchase of AI training data to be expected cultivate from about $2.5 billion now to nearly $30 billion within a decade, data brokers and platforms are rushing to charge top dollar — in some cases over the objections of their user bases.

Media library provided by Shutterstock inked deals with AI vendors ranging from $25 million to $50 million, while Reddit claims to have made hundreds of millions from licensing data to organizations like Google and OpenAI. Few platforms with abundant data accumulated organically over the years they do not have He signed deals with prolific AI developers, it seems — from Photobucket to Tumblr to Q&A site Stack Overflow.

It’s the platforms’ data for sale — at least depending on the legal arguments you believe. But in most cases, users don’t see a single penny of the earnings. And it hurts the wider AI research community.

“Smaller players will not be able to afford these data licenses and therefore will not be able to develop or study AI models,” Lo said. “I am concerned that this could lead to a lack of independent scrutiny of AI development practices.”

Independent efforts

If there is a ray of sunshine through the darkness, it is the few independent, non-profit efforts to create massive data sets that anyone can use to train a productive AI model.

EleutherAI, a non-profit grassroots research group that started as a loose Discord collective in 2020, is working with the University of Toronto, AI2, and independent researchers to create The Pile v2, a set of billions of text snippets mostly sourced from the public domain sector .

In April, the startup Hugging Face released FineWeb, a filtered version of Common Crawl—the eponymous dataset maintained by the nonprofit organization Common Crawl, consisting of billions upon billions of web pages—that Hugging Face claims improves the model’s performance on many reference points.

Some efforts to release open training datasets, such as the LAION team’s image sets, have struggled with copyright, data privacy, and more. equally serious ethical and legal challenges. But some of the most dedicated data curators are committed to doing better. Pile v2, for example, removes problematic copyrighted material found in its original dataset, The Pile.

The question is whether any of these open-source efforts can hope to keep pace with Big Tech. Since data collection and curation remains a matter of resources, the answer is probably no — at least not until some research breakthrough levels the playing field.

afford All included big data data sets Education Generative AI price tech training
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
Previous ArticleInstagram is testing “test wheels” that aren’t shown to a creator’s followers
Next Article How (Re)vive grew 10x last year helping retailers recycle and sell returned items
bhanuprakash.cg
techtost.com
  • Website

Related Posts

Coders refuse to work without artificial intelligence – and it could bite them

30 May 2026

This chip startup just raised $135 million on a bet that AI’s biggest bottleneck isn’t computation — it’s memory

29 May 2026

Glean’s top line tops $300M as AI budget cut becomes its main selling point

29 May 2026
Add A Comment

Leave A Reply Cancel Reply

Don't Miss

H1 secures $40M from CVS, proving SaaS startups can still attract investment

30 May 2026

Waymo’s newest robotaxi is Chinese-made, built to make money, and is now accepting riders

30 May 2026

Corgi Announces $106M Raise at $2.6B Valuation — Double What It Was Worth 3 Weeks Ago

30 May 2026
Stay In Touch
  • Facebook
  • YouTube
  • TikTok
  • WhatsApp
  • Twitter
  • Instagram
Fintech

Last 24 hours to save up to $410 on your Disrupt 2026 ticket

29 May 2026

2 days left: Lock in up to $410 in ticket savings for Disrupt 2026

28 May 2026

Robinhood now allows your AI agents to trade stocks

28 May 2026
Startups

H1 secures $40M from CVS, proving SaaS startups can still attract investment

Cognition’s Scott Wu says AI coding agents shouldn’t replace humans

How to apply to Startup Battlefield 2026, what you need before the June 8 deadline

© 2026 TechTost. All Rights Reserved
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms and Conditions
  • Disclaimer

Type above and press Enter to search. Press Esc to cancel.