Close Menu
TechTost
  • AI
  • Apps
  • Crypto
  • Fintech
  • Hardware
  • Media & Entertainment
  • Security
  • Startups
  • Transportation
  • Venture
  • Recommended Essentials
What's Hot

A spyware researcher exposed Russian government hackers trying to break into Signal accounts

Clio’s $500 million milestone comes just as Anthropic steps up to first stage

Cerebras’ IPO earns Benchmark billions, but VC Eric Vishria almost didn’t get the meeting

Facebook X (Twitter) Instagram
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms and Conditions
  • Disclaimer
Facebook X (Twitter) Instagram
TechTost
Subscribe Now
  • AI

    What the jury will really decide in the case of Elon Musk v. Sam Altman

    15 May 2026

    Wirestock Raises $23M to Bring Creative Multimodal Data to AI Labs

    14 May 2026

    Notion just turned its workspace into a hub for AI agents

    14 May 2026

    The 6 stages at Disrupt 2026 — built for today’s toughest startup market

    13 May 2026

    Medicare’s new payment model is designed for artificial intelligence, and most of the tech world has no idea

    13 May 2026
  • Apps

    Spotify will adopt Apple’s new video podcast technology, offering creators easier cross-platform distribution

    15 May 2026

    X launches a History tab for bookmarks, likes, videos and articles

    14 May 2026

    Amazon launches an AI shopping assistant for the search bar, powered by Alexa+

    13 May 2026

    Everything Google announced at its Android Expo, from Googlebooks to vibe-encoded widgets

    13 May 2026

    TikTok now wants to be the place where you book that trip you just saw on TikTok

    12 May 2026
  • Crypto

    As crypto cools, a16z crypto raises $2.2 billion in capital

    6 May 2026

    Coinbase to lay off 14% of staff as part of broader restructuring

    5 May 2026

    British cryptographer Adam Back denies NYT report that he is Bitcoin creator Satoshi Nakamoto

    9 April 2026

    Hackers stole over $2.7 billion in crypto in 2025, data shows

    23 December 2025

    New report examines how David Sachs may benefit from Trump administration role

    1 December 2025
  • Fintech

    Venmo’s biggest makeover in years comes at a very interesting time

    11 May 2026

    Fintech startup Parker files for bankruptcy

    10 May 2026

    Robinhood’s venture fund IPO attracted 150,000+ private investors, CEO says

    7 May 2026

    PayPal says it’s “becoming a tech company again” — that’s AI

    6 May 2026

    Stripe introduces Link, a digital wallet that autonomous AI agents can also use

    1 May 2026
  • Hardware

    Cerebras raises $5.5 billion, then shares soar to $108, first huge tech IPO of 2026

    15 May 2026

    Google unveils Googlebook, a new line of laptops with native artificial intelligence

    13 May 2026

    The Instax Wide 400 takes the simplicity of instant photography and expands it, literally

    10 May 2026

    Google Unveils Fitbit Air Without Whoop-like Display

    8 May 2026

    Google’s $9.99 per month AI health plan launches on May 19

    8 May 2026
  • Media & Entertainment

    YouTube viewers watch 2 billion hours of Shorts on TV every month

    14 May 2026

    Digg is trying again, this time as an AI news aggregator

    12 May 2026

    Bravo creates unscripted mini-dramas for the Peacock app

    11 May 2026

    The hottest place for startups to strike a deal? The F1 mantra

    10 May 2026

    Netflix delays Greta Gerwig’s ‘Narnia’ for big theatrical push to 2027

    2 May 2026
  • Security

    A spyware researcher exposed Russian government hackers trying to break into Signal accounts

    15 May 2026

    OpenAI says hackers stole some data after the latest code security issue

    14 May 2026

    This is what some of the world’s largest malware banks look like stacked up as hard drives

    14 May 2026

    This is what some of the world’s largest malware banks look like stacked up as hard drives

    13 May 2026

    Exaforce Raises $125M Series B to Build AI to Catch and Stop Cyberattacks as They Happen

    13 May 2026
  • Startups

    Clio’s $500 million milestone comes just as Anthropic steps up to first stage

    15 May 2026

    Startup Battlefield 200 applications close on May 27

    14 May 2026

    Anduril Raises $5B, Doubles Valuation To $61B

    13 May 2026

    Korea’s biggest manufacturers support Config, TSMC robot data

    11 May 2026

    China’s Moonshot AI Raises $2B in $20B Valuation as Demand for Open Source AI Soars

    10 May 2026
  • Transportation

    Uber to open 2 campuses in India to support product development and operations

    14 May 2026

    Rep. Jeff Bezos steps down from Slate Auto board

    14 May 2026

    ‘Too early’ to talk about IPO, says incoming CFO of Redwood Materials

    13 May 2026

    Potholes are costing cities millions: This company uses artificial intelligence and trucks to fix them

    13 May 2026

    Waymo issues recall to address a flooding issue

    12 May 2026
  • Venture

    Cerebras’ IPO earns Benchmark billions, but VC Eric Vishria almost didn’t get the meeting

    15 May 2026

    Khosla Ventures bets $10 million on Ian Crosby, whose last startup, Bench, collapsed

    14 May 2026

    Anthropic warns investors against secondary platforms offering access to its shares

    13 May 2026

    Mother Ventures looks at moms as the ‘economic engine’

    9 May 2026

    2 days left: Get 50% off a second Disrupt 2026 pass

    7 May 2026
  • Recommended Essentials
TechTost
You are at:Home»AI»Why most AI benchmarks tell us so little
AI

Why most AI benchmarks tell us so little

techtost.comBy techtost.com8 March 202405 Mins Read
Share Facebook Twitter Pinterest LinkedIn Tumblr Email
Why Most Ai Benchmarks Tell Us So Little
Share
Facebook Twitter LinkedIn Pinterest Email

On Tuesday, startup Anthropic released a family of AI models that it claims achieve best-in-class performance. Just days later, rival Inflection AI unveiled a model that it claims comes close to matching some of the most capable models out there, including OpenAI’s GPT-4, in quality.

Anthropic and Inflection are by no means the first AI companies to claim that their models have matched or beaten the competition by some objective measure. Google supported the same with its Gemini models at launch, and OpenAI said the same for GPT-4 and its predecessors, GPT-3, GPT-2, and GPT-1. The list goes on.

But what metrics are they talking about? When a seller says a model achieves top performance or quality, what exactly does that mean? Perhaps more to the point: Will a model that technically “performs” better than some other model in reality touch improved in a tangible way?

On that last question, not likely.

The reason – or rather the problem – lies in the benchmarks that AI companies use to quantify a model’s strengths and weaknesses.

Internal measures

Today’s most commonly used benchmarks for AI models — specifically chatbot-powered models such as OpenAI’s ChatGPT and Anthropic’s Claude — do a poor job of capturing how the average human interacts with the models being tested. For example, a benchmark cited by Anthropic in its recent announcement, GPQA (“A Graduate-Level Google-Proof Q&A Benchmark”), contains hundreds of PhD-level biology, physics, and chemistry questions — yet most people use chatbot for tasks like answering emails, writing cover letters and talking about their feelings.

Jesse Dodge, a scientist at the Allen Institute for AI, the nonprofit AI research organization, says the industry has reached a “crisis of evaluation.”

“Benchmarks are typically static and narrowly focused on evaluating a single capability, such as a model’s realism in a single domain or its ability to solve multiple-choice mathematical reasoning questions,” Dodge told TechCrunch in an interview. “Many benchmarks used for evaluation are more than three years old, from when AI systems were mainly used for research and did not have many real users. In addition, humans use genetic AI in many ways — they are very creative.”

Wrong measurements

It’s not that the most used benchmarks are completely useless. No doubt someone is asking Ph.D level math questions. in ChatGPT. However, as genetic AI models are increasingly positioned as mass-market, do-it-all systems, the old benchmarks are becoming less applicable.

David Widder, a postdoctoral researcher at Cornell who studies artificial intelligence and ethics, notes that many of the common tests of reference skills—from solving school-level math problems to determining whether a sentence contains an anachronism—will never be relevant to the majority of users.

“Earlier AI systems were often built to solve a specific problem in a context (e.g. medical AI expert systems), making a deep understanding of what constitutes good performance in that particular context more possible,” Widder said. at TechCrunch. “As systems are increasingly seen as ‘general purpose’, this is less possible, so we’re increasingly seeing a focus on testing models across a variety of benchmarks in different fields.”

Errors and other defects

In addition to misalignment with use cases, there are questions about whether some benchmarks are properly measuring what they are supposed to measure.

One analysis of HellaSwag, a test designed to assess common sense reasoning in models, found that over a third of the test questions contained typos and “stupid” writing. Somewhere else, MMLU (short for “Massive Multitask Language Understanding”), a benchmark highlighted by vendors such as Google, OpenAI and Anthropic as proof that their models can reason through logic problems, asks questions that can be solved through memorization verbatim.

Test questions from the HellaSwag benchmark.

“[Benchmarks like MMLU are] more about memorizing and associating two keywords together,” Widder said. “I can find [a relevant] article quickly enough and answer the question, but that doesn’t mean I understand the causal mechanism or that I could use my understanding of that causal mechanism to actually reason and solve new and complex problems in unpredictable contexts. Not even a model can.”

Fixing what’s broken

So benchmarks are broken. But can they be fixed?

Dodge believes so – with more human involvement.

“The right way forward, here, is a combination of evaluation benchmarks with human evaluation,” he said, “prompting a model with a real user question and then hiring a human to evaluate how good the response is.”

As for Widder, he’s less optimistic that benchmarks today — even with corrections for the most obvious mistakes, like typos — can be improved to the point where they would be informative to the vast majority of AI model users. Instead, he believes that tests of models should focus on the downstream effects of those models and whether the effects, good or bad, are seen as desirable by those affected.

“I would ask for what specific goals we want AI models to be able to be used for and assess whether they would be – or are – successful in such contexts,” he said. “And hopefully that process also includes evaluating whether we should be using AI in such contexts.”

All included benchmarks genAI Generative AI reference points Research
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
Previous ArticleApple will ease the transition to Android by fall 2025
Next Article LLMs are ready to make logging business intelligence tools easier and faster to use
bhanuprakash.cg
techtost.com
  • Website

Related Posts

What the jury will really decide in the case of Elon Musk v. Sam Altman

15 May 2026

Wirestock Raises $23M to Bring Creative Multimodal Data to AI Labs

14 May 2026

Notion just turned its workspace into a hub for AI agents

14 May 2026
Add A Comment

Leave A Reply Cancel Reply

Don't Miss

A spyware researcher exposed Russian government hackers trying to break into Signal accounts

15 May 2026

Clio’s $500 million milestone comes just as Anthropic steps up to first stage

15 May 2026

Cerebras’ IPO earns Benchmark billions, but VC Eric Vishria almost didn’t get the meeting

15 May 2026
Stay In Touch
  • Facebook
  • YouTube
  • TikTok
  • WhatsApp
  • Twitter
  • Instagram
Fintech

Venmo’s biggest makeover in years comes at a very interesting time

11 May 2026

Fintech startup Parker files for bankruptcy

10 May 2026

Robinhood’s venture fund IPO attracted 150,000+ private investors, CEO says

7 May 2026
Startups

Clio’s $500 million milestone comes just as Anthropic steps up to first stage

Startup Battlefield 200 applications close on May 27

Anduril Raises $5B, Doubles Valuation To $61B

© 2026 TechTost. All Rights Reserved
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms and Conditions
  • Disclaimer

Type above and press Enter to search. Press Esc to cancel.