Close Menu
TechTost
  • AI
  • Apps
  • Crypto
  • Fintech
  • Hardware
  • Media & Entertainment
  • Security
  • Startups
  • Transportation
  • Venture
  • Recommended Essentials
What's Hot

Pixi’s new iOS app turns text messages into interactive AR experiences

Rivian owners file lawsuit alleging false promises about self-driving features

The 11 startups that stood out from YC’s demo day, according to VCs

Facebook X (Twitter) Instagram
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms and Conditions
  • Disclaimer
Facebook X (Twitter) Instagram
TechTost
Subscribe Now
  • AI

    Source: Elastic agrees to buy CRV-powered DeductiveAI for up to $85 million

    19 June 2026

    General Intuition in talks to raise $300M at roughly $2B valuation

    18 June 2026

    How to turn off AI in your Google Docs

    18 June 2026

    SpaceX values ​​balloons at $2.6T, narrowly passes Amazon

    17 June 2026

    SpaceX Goes Public: Everything You Need to Know Post-IPO

    16 June 2026
  • Apps

    Telegram ban in India sparks a rush to VPN, rival apps

    19 June 2026

    MapTap, an everyday geography game, is my new Wordle

    18 June 2026

    FTC Lawsuit Reveals How Subscription Scam Networks Avoid App Store Enforcement

    18 June 2026

    Pinterest Launches Experimental AI Shopping App Called ‘Ask Pinterest’

    17 June 2026

    Android 17 rolls out with new multitasking tools as Google expands Gemini features

    17 June 2026
  • Crypto

    Startup Battlefield 200 applications close today

    27 May 2026

    5 days left: Save up to $410 on Disrupt 2026 passes

    25 May 2026

    As crypto cools, a16z crypto raises $2.2 billion in capital

    6 May 2026

    Coinbase to lay off 14% of staff as part of broader restructuring

    5 May 2026

    British cryptographer Adam Back denies NYT report that he is Bitcoin creator Satoshi Nakamoto

    9 April 2026
  • Fintech

    Robinhood’s note on 10% layoffs shows that blaming AI doesn’t cut it

    17 June 2026

    Anthropic’s latest spat with the Trump administration may actually help it, sales figures suggest

    17 June 2026

    Ramp raises $750M at $44B valuation as investors thirst for fintechs with AI history

    5 June 2026

    Last 24 hours to save up to $410 on your Disrupt 2026 ticket

    29 May 2026

    2 days left: Lock in up to $410 in ticket savings for Disrupt 2026

    28 May 2026
  • Hardware

    AI hurts Apple in more ways than one: It could force iPhone price hikes

    18 June 2026

    Snap is finally debuting its long-awaited AR glasses, the specs, and, ugh, they’re not cheap

    17 June 2026

    Qualcomm wants to be the chip in everything that replaces your smartphone, and it just announced two products to that end

    17 June 2026

    This slim speaker under the pillow helped me sleep without headphones

    14 June 2026

    Jeff Bezos’ Prometheus Raises $12 Billion to Build an ‘Artificial General Engineer’ for the Natural World

    12 June 2026
  • Media & Entertainment

    Spotify’s reserved ticket sales to music superfans are now live

    18 June 2026

    Google is betting on Gemini to reinvent the smart home speaker

    18 June 2026

    Mastodon is looking for newsletters to help revive the open social web

    17 June 2026

    60 percent of US consumers say ‘artificial intelligence’ in brand messaging is a turnoff, survey finds

    16 June 2026

    Fox to acquire Roku in $22 billion deal

    15 June 2026
  • Security

    Cybercriminals reportedly hacked tens of thousands of Fortinet firewalls used by major companies around the world

    17 June 2026

    Apple is planning to change the Hide My Email privacy feature that could make it less effective

    17 June 2026

    The US government’s ban on Anthropic models was never about an AI jailbreak

    16 June 2026

    As AI agents become employees, NewCore comes up with $66 million to give them identities

    15 June 2026

    The FBI built its own replica small town to simulate real-world cyberattacks

    13 June 2026
  • Startups

    Pixi’s new iOS app turns text messages into interactive AR experiences

    19 June 2026

    ‘Queer Eye’ life coach Karamo Brown launches Kē, a wellness app featuring his digital AI clone

    18 June 2026

    Pramaana Labs Raises $27M From Khosla Ventures To Bring Official Verification To Artificial Intelligence

    18 June 2026

    Collecting bot training data is dirty, unsavory work. Some AI labs already pay XDOF to do it.

    17 June 2026

    This startup’s super metals could soon be found in military drones, luxury watches and chef’s knives

    17 June 2026
  • Transportation

    Rivian owners file lawsuit alleging false promises about self-driving features

    19 June 2026

    Waymo recalls nearly 4,000 robotaxis to stop them from driving in highway construction zones

    18 June 2026

    Uber will bring its premium robotaxi service to Houston in 2027

    17 June 2026

    Mobileye’s robotaxi launch in the US will put it on both sides of the AV business

    17 June 2026

    SpaceX Goes Public: Everything You Need to Know Post-IPO

    16 June 2026
  • Venture

    The 11 startups that stood out from YC’s demo day, according to VCs

    19 June 2026

    Roelof Botha joins SpaceX board of directors

    18 June 2026

    Chi-Hua Chien saw Facebook coming – now he says the real AI winners won’t sell AI

    18 June 2026

    PayPal Ventures is shutting down as the company continues to restructure

    17 June 2026

    Orbio raises $21 million to automate hiring and onboarding of frontline workers

    15 June 2026
  • Recommended Essentials
TechTost
You are at:Home»AI»AI training data comes at a price only Big Tech can afford
AI

AI training data comes at a price only Big Tech can afford

techtost.comBy techtost.com1 June 202408 Mins Read
Share Facebook Twitter Pinterest LinkedIn Tumblr Email
Ai Training Data Comes At A Price Only Big Tech
Share
Facebook Twitter LinkedIn Pinterest Email

Data is at the heart of today’s advanced AI systems, but it’s increasingly expensive — putting it out of reach for all but the wealthiest tech companies.

Last year, James Betker, a researcher at OpenAI, wrote one post on his personal blog about the nature of generative AI models and the datasets they are trained on. In it, Betker claimed that the training data—not the design, architecture, or any other feature of a model—was the key to increasingly sophisticated, capable AI systems.

“Trained on the same data set for a long time, almost every model converges to the same point,” Betker wrote.

Is Betker right? Is training data the biggest determinant of what a model can do, whether it’s answering a question, drawing human hands, or creating a realistic cityscape?

It’s certainly plausible.

Statistical machines

AI production systems are basically probabilistic models — a huge pile of statistics. They guess based on huge amounts of examples which data makes the most “sense” to put where (eg the word “go” before “to the market” in the sentence “I go to the market”). It seems intuitive, then, that the more examples a model has to follow, the better the performance of models trained on those examples.

“It seems that the performance gains come from data,” Kyle Lo, senior applied research scientist at the Allen Institute for AI (AI2), an artificial intelligence research nonprofit, told TechCrunch, “at least when you have a solid training organization. .”

Lo gave the example of Meta’s Llama 3, a text generation model released earlier this year that outperforms AI2’s own OLMo model, despite being architecturally very similar. Llama 3 was trained on significantly more data than OLMo, which Lo believes explains its superiority in many popular AI benchmarks.

(I’ll point out here that the benchmarks widely used in the AI ​​industry today aren’t necessarily the best gauge of a model’s performance, but outside of quality tests like ours, it’s one of the few measures it has going on.)

This is not to say that training on exponentially larger data sets is a sure path to exponentially better models. The models operate on a “garbage in, garbage out” paradigm, Lo notes, and so curation and data quality matter a lot, perhaps more than sheer quantity.

“It is possible that a small model with carefully designed data will perform better than a large model,” he added. “For example, Falcon 180B, a large model, is ranked 63rd in the LMSYS benchmark, while Llama 2 13B, a much smaller model, is ranked 56th.”

In an interview with TechCrunch last October, OpenAI researcher Gabriel Goh said that higher-quality annotations contributed significantly to improved image quality in DALL-E 3, OpenAI’s text-to-image model, over its predecessor DALL-E 2. This is the main source of improvements,” he said. “Text annotations are much better than they were [with DALL-E 2] — it’s not even comparable.”

Many artificial intelligence models, including DALL-E 3 and DALL-E 2, are trained by having human annotators label data so that a model can learn to correlate those labels with other, observed features of that data. For example, a model fed many cat images with annotations for each breed will eventually “learn” to associate terms such as short tail and short hair with their special visual characteristics.

Bad behaviour

Experts like Lo worry that the growing emphasis on large, high-quality training data sets will concentrate AI development among the few players with billion-dollar budgets who can afford to acquire those sets. Significant innovation in synthetic data or fundamental architecture could disrupt the status quo, but neither seems to be on the near horizon.

“Overall, entities that govern content that is potentially useful for AI development have incentives to lock down their material,” Lo said. “And as access to data closes, we’re essentially blessing some early movers to get data and move up the ladder so that no one else has access to data to catch up.”

Indeed, where the race to collect more education data hasn’t led to unethical (and perhaps even illegal) behavior such as surreptitiously hoarding copyrighted content, it has rewarded tech giants with deep pockets to spend on licensing data.

Artificial intelligence generation models like OpenAI are primarily trained on images, text, audio, video, and other data — some copyrighted — taken from public web pages (including, problematically, those generated by AI). The OpenAIs of the world claim that fair use protects them from legal retaliation. Many rights holders disagree — but, at least for now, there’s not much they can do to prevent the practice.

There are many, many examples of artificial intelligence builders acquiring massive data sets through questionable means in order to train their models. OpenAI According to reports transcribed more than a million hours of YouTube video without YouTube’s blessing—or the blessing of the creators—to power the flagship GPT-4 model. Google recently expanded its terms of service in part to allow public use of Google Docs, restaurant reviews on Google Maps, and other online material for its AI products. And Meta is said to have considered risking lawsuits trains her models to IP-protected content.

Meanwhile, large and small companies rely workers in third world countries paid only a few dollars an hour to create annotations for training sets. Some of these commenters — employed by mammoth startups like Scale AI — work literally days to complete tasks that expose them to graphic depictions of violence and gore with no benefits or guarantees of future gigs.

Rising costs

In other words, even the above data offerings aren’t exactly conducive to an open and fair AI ecosystem.

OpenAI has spent hundreds of millions of dollars licensing content from news publishers, media libraries, and more to train its AI models — a budget far larger than that of most academic research groups, nonprofits, and startups. Meta went so far as to weigh a takeover of publisher Simon & Schuster for the rights to e-book excerpts (eventually, Simon & Schuster sold to private equity firm KKR for $1.62 billion in 2023).

With the purchase of AI training data to be expected cultivate from about $2.5 billion now to nearly $30 billion within a decade, data brokers and platforms are rushing to charge top dollar — in some cases over the objections of their user bases.

Media library provided by Shutterstock inked deals with AI vendors ranging from $25 million to $50 million, while Reddit claims to have made hundreds of millions from licensing data to organizations like Google and OpenAI. Few platforms with abundant data accumulated organically over the years they do not have He signed deals with prolific AI developers, it seems — from Photobucket to Tumblr to Q&A site Stack Overflow.

It’s the platforms’ data for sale — at least depending on the legal arguments you believe. But in most cases, users don’t see a single penny of the earnings. And it hurts the wider AI research community.

“Smaller players will not be able to afford these data licenses and therefore will not be able to develop or study AI models,” Lo said. “I am concerned that this could lead to a lack of independent scrutiny of AI development practices.”

Independent efforts

If there is a ray of sunshine through the darkness, it is the few independent, non-profit efforts to create massive data sets that anyone can use to train a productive AI model.

EleutherAI, a non-profit grassroots research group that started as a loose Discord collective in 2020, is working with the University of Toronto, AI2, and independent researchers to create The Pile v2, a set of billions of text snippets mostly sourced from the public domain sector .

In April, the startup Hugging Face released FineWeb, a filtered version of Common Crawl—the eponymous dataset maintained by the nonprofit organization Common Crawl, consisting of billions upon billions of web pages—that Hugging Face claims improves the model’s performance on many reference points.

Some efforts to release open training datasets, such as the LAION team’s image sets, have struggled with copyright, data privacy, and more. equally serious ethical and legal challenges. But some of the most dedicated data curators are committed to doing better. Pile v2, for example, removes problematic copyrighted material found in its original dataset, The Pile.

The question is whether any of these open-source efforts can hope to keep pace with Big Tech. Since data collection and curation remains a matter of resources, the answer is probably no — at least not until some research breakthrough levels the playing field.

afford All included big data data sets Education Generative AI price tech training
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
Previous ArticleInstagram is testing “test wheels” that aren’t shown to a creator’s followers
Next Article How (Re)vive grew 10x last year helping retailers recycle and sell returned items
bhanuprakash.cg
techtost.com
  • Website

Related Posts

Pixi’s new iOS app turns text messages into interactive AR experiences

19 June 2026

Source: Elastic agrees to buy CRV-powered DeductiveAI for up to $85 million

19 June 2026

General Intuition in talks to raise $300M at roughly $2B valuation

18 June 2026
Add A Comment

Leave A Reply Cancel Reply

Don't Miss

Pixi’s new iOS app turns text messages into interactive AR experiences

19 June 2026

Rivian owners file lawsuit alleging false promises about self-driving features

19 June 2026

The 11 startups that stood out from YC’s demo day, according to VCs

19 June 2026
Stay In Touch
  • Facebook
  • YouTube
  • TikTok
  • WhatsApp
  • Twitter
  • Instagram
Fintech

Robinhood’s note on 10% layoffs shows that blaming AI doesn’t cut it

17 June 2026

Anthropic’s latest spat with the Trump administration may actually help it, sales figures suggest

17 June 2026

Ramp raises $750M at $44B valuation as investors thirst for fintechs with AI history

5 June 2026
Startups

Pixi’s new iOS app turns text messages into interactive AR experiences

‘Queer Eye’ life coach Karamo Brown launches Kē, a wellness app featuring his digital AI clone

Pramaana Labs Raises $27M From Khosla Ventures To Bring Official Verification To Artificial Intelligence

© 2026 TechTost. All Rights Reserved
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms and Conditions
  • Disclaimer

Type above and press Enter to search. Press Esc to cancel.