Close Menu
TechTost
  • AI
  • Apps
  • Crypto
  • Fintech
  • Hardware
  • Media & Entertainment
  • Security
  • Startups
  • Transportation
  • Venture
  • Recommended Essentials
What's Hot

Datadog veterans launch AI coding startup Niteshift in a bet against Big AI lock-in

Because everyone is an energy company now

Why business AI will be the focus of VivaTech 2026

Facebook X (Twitter) Instagram
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms and Conditions
  • Disclaimer
Facebook X (Twitter) Instagram
TechTost
Subscribe Now
  • AI

    How memory tools can make AI models worse

    10 June 2026

    Google just fired a warning shot in the AI ​​subscription price wars

    10 June 2026

    Sandstone raises $30M to bring AI to in-house legal teams

    9 June 2026

    Because Apple’s slow and steady AI bet is starting to look pretty smart

    9 June 2026

    Amazon now lets you design custom merchandise using AI

    8 June 2026
  • Apps

    Zest Launches Restaurant Discovery App Powered by Where People Really Eat

    10 June 2026

    iOS 27 features we didn’t see on stage

    10 June 2026

    Apple says it can remove some apps from the App Store if they don’t attract users

    9 June 2026

    Apple’s WWDC AI demos seemed more real after $250 million false ad settlement

    9 June 2026

    The new update of NotebookLM will help you to create source repository from chat

    8 June 2026
  • Crypto

    Startup Battlefield 200 applications close today

    27 May 2026

    5 days left: Save up to $410 on Disrupt 2026 passes

    25 May 2026

    As crypto cools, a16z crypto raises $2.2 billion in capital

    6 May 2026

    Coinbase to lay off 14% of staff as part of broader restructuring

    5 May 2026

    British cryptographer Adam Back denies NYT report that he is Bitcoin creator Satoshi Nakamoto

    9 April 2026
  • Fintech

    Ramp raises $750M at $44B valuation as investors thirst for fintechs with AI history

    5 June 2026

    Last 24 hours to save up to $410 on your Disrupt 2026 ticket

    29 May 2026

    2 days left: Lock in up to $410 in ticket savings for Disrupt 2026

    28 May 2026

    Robinhood now allows your AI agents to trade stocks

    28 May 2026

    Disrupt 2026 Early Bird ticket savings expire in 3 days

    27 May 2026
  • Hardware

    WWDC 2026: What to expect, from Siri’s long-awaited revamp to Apple Intelligence and iOS 27

    9 June 2026

    What to expect from WWDC 2026: The long-awaited Siri refresh and Apple Intelligence updates

    7 June 2026

    What to expect from WWDC 2026: The long-awaited Siri refresh and Apple Intelligence updates

    5 June 2026

    Oura Ring 5 review: Thinner, lighter, better

    4 June 2026

    Meta mercifully released the VR fitness game Supernatural instead of just killing it

    4 June 2026
  • Media & Entertainment

    Plex adds new social features ahead of major price hike for its lifetime pass

    6 June 2026

    Startup Battlefield 200 applications officially close in 3 days

    5 June 2026

    Founders Fund Launches Series of Games Starring Sam Altman, Palmer Luckey and Other Tech Elites

    5 June 2026

    Meet Wander, a StumbleUpon-inspired tool for discovering the ‘small web’

    4 June 2026

    Publishers will be able to opt out of AI Search, thanks to the new setting

    4 June 2026
  • Security

    Massachusetts votes in favor of new privacy bill that bans sale of precise location data

    9 June 2026

    WhatsApp says it has detected new spyware attacks linked to the NSO group in violation of a court order

    9 June 2026

    Microsoft’s open source tools hacked to steal AI developers’ passwords

    8 June 2026

    Hacked, leaked and held for ransom: the worst breaches of 2026 so far

    7 June 2026

    Google and FBI warn of ransomware group sending fake IT workers to hack victims in person

    6 June 2026
  • Startups

    Datadog veterans launch AI coding startup Niteshift in a bet against Big AI lock-in

    10 June 2026

    Evotrex raises $30 million to build RV that doesn’t need a charging station

    10 June 2026

    Zepto’s IPO filing reveals fast growth, bigger losses and a valuation question no one has yet answered

    9 June 2026

    How to apply to Startup Battlefield 2026, what you need before today’s June 8 deadline

    8 June 2026

    Sam Altman-backed fusion startup Helion raises $465M to build power plant for Microsoft

    6 June 2026
  • Transportation

    Because everyone is an energy company now

    10 June 2026

    Top Lucid Motors executive exits amid new CEO shakeup

    10 June 2026

    Rivian begins deliveries of its all-important R2 SUV

    9 June 2026

    Waymo bought Apple’s self-driving car for $220 million

    9 June 2026

    Uber, Wayve and Waymo are heading for a robot showdown in London

    8 June 2026
  • Venture

    Why business AI will be the focus of VivaTech 2026

    10 June 2026

    How Justin Ernest invested nearly $500 million in hot startups without a traditional VC fund

    10 June 2026

    Mercor’s Brendan Foody calls out Sequoia, accusing it of “double pricing” valuation tricks.

    9 June 2026

    Founders share VC horror stories and some name names

    6 June 2026

    Defense technology, artificial intelligence and fundraising take center stage at StrictlyVC Los Angeles

    5 June 2026
  • Recommended Essentials
TechTost
You are at:Home»AI»Why most AI benchmarks tell us so little
AI

Why most AI benchmarks tell us so little

techtost.comBy techtost.com8 March 202405 Mins Read
Share Facebook Twitter Pinterest LinkedIn Tumblr Email
Why Most Ai Benchmarks Tell Us So Little
Share
Facebook Twitter LinkedIn Pinterest Email

On Tuesday, startup Anthropic released a family of AI models that it claims achieve best-in-class performance. Just days later, rival Inflection AI unveiled a model that it claims comes close to matching some of the most capable models out there, including OpenAI’s GPT-4, in quality.

Anthropic and Inflection are by no means the first AI companies to claim that their models have matched or beaten the competition by some objective measure. Google supported the same with its Gemini models at launch, and OpenAI said the same for GPT-4 and its predecessors, GPT-3, GPT-2, and GPT-1. The list goes on.

But what metrics are they talking about? When a seller says a model achieves top performance or quality, what exactly does that mean? Perhaps more to the point: Will a model that technically “performs” better than some other model in reality touch improved in a tangible way?

On that last question, not likely.

The reason – or rather the problem – lies in the benchmarks that AI companies use to quantify a model’s strengths and weaknesses.

Internal measures

Today’s most commonly used benchmarks for AI models — specifically chatbot-powered models such as OpenAI’s ChatGPT and Anthropic’s Claude — do a poor job of capturing how the average human interacts with the models being tested. For example, a benchmark cited by Anthropic in its recent announcement, GPQA (“A Graduate-Level Google-Proof Q&A Benchmark”), contains hundreds of PhD-level biology, physics, and chemistry questions — yet most people use chatbot for tasks like answering emails, writing cover letters and talking about their feelings.

Jesse Dodge, a scientist at the Allen Institute for AI, the nonprofit AI research organization, says the industry has reached a “crisis of evaluation.”

“Benchmarks are typically static and narrowly focused on evaluating a single capability, such as a model’s realism in a single domain or its ability to solve multiple-choice mathematical reasoning questions,” Dodge told TechCrunch in an interview. “Many benchmarks used for evaluation are more than three years old, from when AI systems were mainly used for research and did not have many real users. In addition, humans use genetic AI in many ways — they are very creative.”

Wrong measurements

It’s not that the most used benchmarks are completely useless. No doubt someone is asking Ph.D level math questions. in ChatGPT. However, as genetic AI models are increasingly positioned as mass-market, do-it-all systems, the old benchmarks are becoming less applicable.

David Widder, a postdoctoral researcher at Cornell who studies artificial intelligence and ethics, notes that many of the common tests of reference skills—from solving school-level math problems to determining whether a sentence contains an anachronism—will never be relevant to the majority of users.

“Earlier AI systems were often built to solve a specific problem in a context (e.g. medical AI expert systems), making a deep understanding of what constitutes good performance in that particular context more possible,” Widder said. at TechCrunch. “As systems are increasingly seen as ‘general purpose’, this is less possible, so we’re increasingly seeing a focus on testing models across a variety of benchmarks in different fields.”

Errors and other defects

In addition to misalignment with use cases, there are questions about whether some benchmarks are properly measuring what they are supposed to measure.

One analysis of HellaSwag, a test designed to assess common sense reasoning in models, found that over a third of the test questions contained typos and “stupid” writing. Somewhere else, MMLU (short for “Massive Multitask Language Understanding”), a benchmark highlighted by vendors such as Google, OpenAI and Anthropic as proof that their models can reason through logic problems, asks questions that can be solved through memorization verbatim.

Test questions from the HellaSwag benchmark.

“[Benchmarks like MMLU are] more about memorizing and associating two keywords together,” Widder said. “I can find [a relevant] article quickly enough and answer the question, but that doesn’t mean I understand the causal mechanism or that I could use my understanding of that causal mechanism to actually reason and solve new and complex problems in unpredictable contexts. Not even a model can.”

Fixing what’s broken

So benchmarks are broken. But can they be fixed?

Dodge believes so – with more human involvement.

“The right way forward, here, is a combination of evaluation benchmarks with human evaluation,” he said, “prompting a model with a real user question and then hiring a human to evaluate how good the response is.”

As for Widder, he’s less optimistic that benchmarks today — even with corrections for the most obvious mistakes, like typos — can be improved to the point where they would be informative to the vast majority of AI model users. Instead, he believes that tests of models should focus on the downstream effects of those models and whether the effects, good or bad, are seen as desirable by those affected.

“I would ask for what specific goals we want AI models to be able to be used for and assess whether they would be – or are – successful in such contexts,” he said. “And hopefully that process also includes evaluating whether we should be using AI in such contexts.”

All included benchmarks genAI Generative AI reference points Research
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
Previous ArticleApple will ease the transition to Android by fall 2025
Next Article LLMs are ready to make logging business intelligence tools easier and faster to use
bhanuprakash.cg
techtost.com
  • Website

Related Posts

How memory tools can make AI models worse

10 June 2026

Google just fired a warning shot in the AI ​​subscription price wars

10 June 2026

Sandstone raises $30M to bring AI to in-house legal teams

9 June 2026
Add A Comment

Leave A Reply Cancel Reply

Don't Miss

Datadog veterans launch AI coding startup Niteshift in a bet against Big AI lock-in

10 June 2026

Because everyone is an energy company now

10 June 2026

Why business AI will be the focus of VivaTech 2026

10 June 2026
Stay In Touch
  • Facebook
  • YouTube
  • TikTok
  • WhatsApp
  • Twitter
  • Instagram
Fintech

Ramp raises $750M at $44B valuation as investors thirst for fintechs with AI history

5 June 2026

Last 24 hours to save up to $410 on your Disrupt 2026 ticket

29 May 2026

2 days left: Lock in up to $410 in ticket savings for Disrupt 2026

28 May 2026
Startups

Datadog veterans launch AI coding startup Niteshift in a bet against Big AI lock-in

Evotrex raises $30 million to build RV that doesn’t need a charging station

Zepto’s IPO filing reveals fast growth, bigger losses and a valuation question no one has yet answered

© 2026 TechTost. All Rights Reserved
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms and Conditions
  • Disclaimer

Type above and press Enter to search. Press Esc to cancel.