Close Menu
TechTost
  • AI
  • Apps
  • Crypto
  • Fintech
  • Hardware
  • Media & Entertainment
  • Security
  • Startups
  • Transportation
  • Venture
  • Recommended Essentials
What's Hot

The rise of “micro” apps: non-developers write apps instead of buying them

Musk wants up to $134 billion in OpenAI lawsuit, despite $700 billion fortune

Bluesky launches cashtags and LIVE badges amid push in app installs

Facebook X (Twitter) Instagram
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms and Conditions
  • Disclaimer
Facebook X (Twitter) Instagram
TechTost
Subscribe Now
  • AI

    Musk wants up to $134 billion in OpenAI lawsuit, despite $700 billion fortune

    17 January 2026

    From OpenAI offices to Eli Lilly deal – how Chai Discovery became one of the most impressive names in AI drug development

    16 January 2026

    Anthropic taps former Microsoft India Director to lead Bengaluru expansion

    16 January 2026

    Taiwan to invest $250 billion in US semiconductor manufacturing

    15 January 2026

    Mira Murati’s startup Thinking Machines Lab is losing two of its co-founders to OpenAI

    15 January 2026
  • Apps

    Bluesky launches cashtags and LIVE badges amid push in app installs

    17 January 2026

    TikTok is quietly launching a micro-drama app called ‘PineDrama’

    16 January 2026

    Google’s Trends Explore page gets new Gemini features

    16 January 2026

    After Italy, WhatsApp exempts Brazil from rival chatbot ban

    15 January 2026

    App downloads decline again in 2025, but consumer spending jumps to nearly $156 billion

    15 January 2026
  • Crypto

    Hackers stole over $2.7 billion in crypto in 2025, data shows

    23 December 2025

    New report examines how David Sachs may benefit from Trump administration role

    1 December 2025

    Why Benchmark Made a Rare Crypto Bet on Trading App Fomo, with $17M Series A

    6 November 2025

    Solana co-founder Anatoly Yakovenko is a big fan of agentic coding

    30 October 2025

    MoviePass opens Mogul fantasy league game to the public

    29 October 2025
  • Fintech

    Fintech firm Betterment confirms data breach after hackers sent fake crypto scam alert to users

    12 January 2026

    Flutterwave buys Nigeria’s Mono in rare African fintech exit

    5 January 2026

    Even as global crop prices fall, India’s Arya.ag attracts investors – and remains profitable

    2 January 2026

    These 21-year-old school dropouts raise $2 million to launch Givefront, a fintech for nonprofits

    18 December 2025

    Google deepens consumer loyalty drive in India with UPI-linked card

    17 December 2025
  • Hardware

    US slaps 25% tariffs on Nvidia’s H200 AI chips headed to China

    15 January 2026

    The weirdest tech announced at CES 2026

    15 January 2026

    Google’s Gemini will power Apple’s AI features like Siri

    14 January 2026

    Pebble founder says his new company ‘isn’t a startup’

    14 January 2026

    The ring founder details the era of the camera company’s “smart assistants.”

    13 January 2026
  • Media & Entertainment

    YouTube relaxes monetization guidelines for some controversial topics

    16 January 2026

    Bandcamp takes a stand against AI music, banning it from the platform

    15 January 2026

    Paramount filed a lawsuit against Warner Bros. amid the controversial Netflix merger

    13 January 2026

    Netflix had a huge night at the 2026 Golden Globes with 7 wins

    12 January 2026

    Spotify lowers monetization limit for video podcasts

    8 January 2026
  • Security

    Supreme Court Hacker Posts Stolen Government Data on Instagram

    17 January 2026

    Iran’s internet shutdown is now one of the longest as protests continue

    16 January 2026

    AI security company depthfirst announces $40M Series A

    14 January 2026

    Man pleads guilty to hacking US Supreme Court filing system

    14 January 2026

    Internet crashes in Iran amid protests over financial crisis

    9 January 2026
  • Startups

    The rise of “micro” apps: non-developers write apps instead of buying them

    17 January 2026

    Cloud AI startup Runpod hits $120M in ARR — and it started with a Reddit post

    16 January 2026

    Parloa triples valuation in 8 months to $3 billion with $350 million raise

    16 January 2026

    AI video startup Higgsfield, founded by ex-Snap exec, valued at $1.3 billion

    15 January 2026

    India’s Emversity Doubles Valuation as It Scales Workers AI Can’t Replace

    15 January 2026
  • Transportation

    Chinese electric vehicles are closing in on the US as Canada slashes tariffs

    16 January 2026

    Tesla will only offer subscriptions for full self-driving (Supervision) in the future.

    15 January 2026

    The FTC’s data-sharing order against GM was finally settled

    15 January 2026

    The American cargo technology company has publicly exposed its shipping systems and customer data on the web

    14 January 2026

    New York’s governor paves the way for robotaxis everywhere, with one notable exception

    13 January 2026
  • Venture

    Tiger Global loses India tax case linked to Walmart-Flipkart deal in blow to offshore playbook

    15 January 2026

    The super-organization is raising $25 million to support biodiversity startups

    13 January 2026

    These Gen Zers just raised $11.75 million to put Africa’s defense back in the hands of Africans

    12 January 2026

    The venture firm that ate up Silicon Valley just raised another $15 billion

    9 January 2026

    Why This VC Thinks 2026 Will Be ‘The Year of the Consumer’

    8 January 2026
  • Recommended Essentials
TechTost
You are at:Home»AI»AI training data comes at a price only Big Tech can afford
AI

AI training data comes at a price only Big Tech can afford

techtost.comBy techtost.com1 June 202408 Mins Read
Share Facebook Twitter Pinterest LinkedIn Tumblr Email
Ai Training Data Comes At A Price Only Big Tech
Share
Facebook Twitter LinkedIn Pinterest Email

Data is at the heart of today’s advanced AI systems, but it’s increasingly expensive — putting it out of reach for all but the wealthiest tech companies.

Last year, James Betker, a researcher at OpenAI, wrote one post on his personal blog about the nature of generative AI models and the datasets they are trained on. In it, Betker claimed that the training data—not the design, architecture, or any other feature of a model—was the key to increasingly sophisticated, capable AI systems.

“Trained on the same data set for a long time, almost every model converges to the same point,” Betker wrote.

Is Betker right? Is training data the biggest determinant of what a model can do, whether it’s answering a question, drawing human hands, or creating a realistic cityscape?

It’s certainly plausible.

Statistical machines

AI production systems are basically probabilistic models — a huge pile of statistics. They guess based on huge amounts of examples which data makes the most “sense” to put where (eg the word “go” before “to the market” in the sentence “I go to the market”). It seems intuitive, then, that the more examples a model has to follow, the better the performance of models trained on those examples.

“It seems that the performance gains come from data,” Kyle Lo, senior applied research scientist at the Allen Institute for AI (AI2), an artificial intelligence research nonprofit, told TechCrunch, “at least when you have a solid training organization. .”

Lo gave the example of Meta’s Llama 3, a text generation model released earlier this year that outperforms AI2’s own OLMo model, despite being architecturally very similar. Llama 3 was trained on significantly more data than OLMo, which Lo believes explains its superiority in many popular AI benchmarks.

(I’ll point out here that the benchmarks widely used in the AI ​​industry today aren’t necessarily the best gauge of a model’s performance, but outside of quality tests like ours, it’s one of the few measures it has going on.)

This is not to say that training on exponentially larger data sets is a sure path to exponentially better models. The models operate on a “garbage in, garbage out” paradigm, Lo notes, and so curation and data quality matter a lot, perhaps more than sheer quantity.

“It is possible that a small model with carefully designed data will perform better than a large model,” he added. “For example, Falcon 180B, a large model, is ranked 63rd in the LMSYS benchmark, while Llama 2 13B, a much smaller model, is ranked 56th.”

In an interview with TechCrunch last October, OpenAI researcher Gabriel Goh said that higher-quality annotations contributed significantly to improved image quality in DALL-E 3, OpenAI’s text-to-image model, over its predecessor DALL-E 2. This is the main source of improvements,” he said. “Text annotations are much better than they were [with DALL-E 2] — it’s not even comparable.”

Many artificial intelligence models, including DALL-E 3 and DALL-E 2, are trained by having human annotators label data so that a model can learn to correlate those labels with other, observed features of that data. For example, a model fed many cat images with annotations for each breed will eventually “learn” to associate terms such as short tail and short hair with their special visual characteristics.

Bad behaviour

Experts like Lo worry that the growing emphasis on large, high-quality training data sets will concentrate AI development among the few players with billion-dollar budgets who can afford to acquire those sets. Significant innovation in synthetic data or fundamental architecture could disrupt the status quo, but neither seems to be on the near horizon.

“Overall, entities that govern content that is potentially useful for AI development have incentives to lock down their material,” Lo said. “And as access to data closes, we’re essentially blessing some early movers to get data and move up the ladder so that no one else has access to data to catch up.”

Indeed, where the race to collect more education data hasn’t led to unethical (and perhaps even illegal) behavior such as surreptitiously hoarding copyrighted content, it has rewarded tech giants with deep pockets to spend on licensing data.

Artificial intelligence generation models like OpenAI are primarily trained on images, text, audio, video, and other data — some copyrighted — taken from public web pages (including, problematically, those generated by AI). The OpenAIs of the world claim that fair use protects them from legal retaliation. Many rights holders disagree — but, at least for now, there’s not much they can do to prevent the practice.

There are many, many examples of artificial intelligence builders acquiring massive data sets through questionable means in order to train their models. OpenAI According to reports transcribed more than a million hours of YouTube video without YouTube’s blessing—or the blessing of the creators—to power the flagship GPT-4 model. Google recently expanded its terms of service in part to allow public use of Google Docs, restaurant reviews on Google Maps, and other online material for its AI products. And Meta is said to have considered risking lawsuits trains her models to IP-protected content.

Meanwhile, large and small companies rely workers in third world countries paid only a few dollars an hour to create annotations for training sets. Some of these commenters — employed by mammoth startups like Scale AI — work literally days to complete tasks that expose them to graphic depictions of violence and gore with no benefits or guarantees of future gigs.

Rising costs

In other words, even the above data offerings aren’t exactly conducive to an open and fair AI ecosystem.

OpenAI has spent hundreds of millions of dollars licensing content from news publishers, media libraries, and more to train its AI models — a budget far larger than that of most academic research groups, nonprofits, and startups. Meta went so far as to weigh a takeover of publisher Simon & Schuster for the rights to e-book excerpts (eventually, Simon & Schuster sold to private equity firm KKR for $1.62 billion in 2023).

With the purchase of AI training data to be expected cultivate from about $2.5 billion now to nearly $30 billion within a decade, data brokers and platforms are rushing to charge top dollar — in some cases over the objections of their user bases.

Media library provided by Shutterstock inked deals with AI vendors ranging from $25 million to $50 million, while Reddit claims to have made hundreds of millions from licensing data to organizations like Google and OpenAI. Few platforms with abundant data accumulated organically over the years they do not have He signed deals with prolific AI developers, it seems — from Photobucket to Tumblr to Q&A site Stack Overflow.

It’s the platforms’ data for sale — at least depending on the legal arguments you believe. But in most cases, users don’t see a single penny of the earnings. And it hurts the wider AI research community.

“Smaller players will not be able to afford these data licenses and therefore will not be able to develop or study AI models,” Lo said. “I am concerned that this could lead to a lack of independent scrutiny of AI development practices.”

Independent efforts

If there is a ray of sunshine through the darkness, it is the few independent, non-profit efforts to create massive data sets that anyone can use to train a productive AI model.

EleutherAI, a non-profit grassroots research group that started as a loose Discord collective in 2020, is working with the University of Toronto, AI2, and independent researchers to create The Pile v2, a set of billions of text snippets mostly sourced from the public domain sector .

In April, the startup Hugging Face released FineWeb, a filtered version of Common Crawl—the eponymous dataset maintained by the nonprofit organization Common Crawl, consisting of billions upon billions of web pages—that Hugging Face claims improves the model’s performance on many reference points.

Some efforts to release open training datasets, such as the LAION team’s image sets, have struggled with copyright, data privacy, and more. equally serious ethical and legal challenges. But some of the most dedicated data curators are committed to doing better. Pile v2, for example, removes problematic copyrighted material found in its original dataset, The Pile.

The question is whether any of these open-source efforts can hope to keep pace with Big Tech. Since data collection and curation remains a matter of resources, the answer is probably no — at least not until some research breakthrough levels the playing field.

afford All included big data data sets Education Generative AI price tech training
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
Previous ArticleInstagram is testing “test wheels” that aren’t shown to a creator’s followers
Next Article How (Re)vive grew 10x last year helping retailers recycle and sell returned items
bhanuprakash.cg
techtost.com
  • Website

Related Posts

Musk wants up to $134 billion in OpenAI lawsuit, despite $700 billion fortune

17 January 2026

Supreme Court Hacker Posts Stolen Government Data on Instagram

17 January 2026

Cloud AI startup Runpod hits $120M in ARR — and it started with a Reddit post

16 January 2026
Add A Comment

Leave A Reply Cancel Reply

Don't Miss

The rise of “micro” apps: non-developers write apps instead of buying them

17 January 2026

Musk wants up to $134 billion in OpenAI lawsuit, despite $700 billion fortune

17 January 2026

Bluesky launches cashtags and LIVE badges amid push in app installs

17 January 2026
Stay In Touch
  • Facebook
  • YouTube
  • TikTok
  • WhatsApp
  • Twitter
  • Instagram
Fintech

Fintech firm Betterment confirms data breach after hackers sent fake crypto scam alert to users

12 January 2026

Flutterwave buys Nigeria’s Mono in rare African fintech exit

5 January 2026

Even as global crop prices fall, India’s Arya.ag attracts investors – and remains profitable

2 January 2026
Startups

The rise of “micro” apps: non-developers write apps instead of buying them

Cloud AI startup Runpod hits $120M in ARR — and it started with a Reddit post

Parloa triples valuation in 8 months to $3 billion with $350 million raise

© 2026 TechTost. All Rights Reserved
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms and Conditions
  • Disclaimer

Type above and press Enter to search. Press Esc to cancel.