Close Menu
TechTost
  • AI
  • Apps
  • Crypto
  • Fintech
  • Hardware
  • Media & Entertainment
  • Security
  • Startups
  • Transportation
  • Venture
  • Recommended Essentials
What's Hot

Port raises $100M valuation from $800M round to take on Spotify’s Backstage

India’s Spinny lines up $160m funding to acquire GoMechanic, sources say

OpenAI hits back at Google with GPT-5.2 after ‘code red’ memo.

Facebook X (Twitter) Instagram
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms and Conditions
  • Disclaimer
Facebook X (Twitter) Instagram
TechTost
Subscribe Now
  • AI

    OpenAI hits back at Google with GPT-5.2 after ‘code red’ memo.

    14 December 2025

    Trump’s AI executive order promises ‘a rulebook’ – startups may find legal loophole instead

    13 December 2025

    Ok, so what’s up with the LinkedIn algo?

    12 December 2025

    Google Released Its Deepest Research AI Agent To Date — The Same Day OpenAI Dropped GPT-5.2

    12 December 2025

    Disney hits Google with cease and desist alleging ‘massive’ copyright infringement

    11 December 2025
  • Apps

    Google’s AI testing feature for clothes now only works with a selfie

    14 December 2025

    DoorDash driver faces felony charges after allegedly spraying customers’ food

    13 December 2025

    Google Translate now lets you listen to real-time translations on your headphones

    13 December 2025

    With iOS 26.2, Apple lets you bring back Liquid Glass again — this time on the lock screen

    12 December 2025

    World launches its ‘super app’, including payment encryption and encrypted chat features

    12 December 2025
  • Crypto

    New report examines how David Sachs may benefit from Trump administration role

    1 December 2025

    Why Benchmark Made a Rare Crypto Bet on Trading App Fomo, with $17M Series A

    6 November 2025

    Solana co-founder Anatoly Yakovenko is a big fan of agentic coding

    30 October 2025

    MoviePass opens Mogul fantasy league game to the public

    29 October 2025

    Only 5 days until Disrupt 2025 sets the startup world on fire

    22 October 2025
  • Fintech

    Coinbase starts onboarding users again in India, plans to do fiat on-ramp next year

    7 December 2025

    Walmart-backed PhonePe shuts down Pincode app in yet another step back in e-commerce

    5 December 2025

    Nexus stays out of AI, keeping half of its new $700M fund for India startup

    4 December 2025

    Fintech firm Marquis notifies dozens of US banks and credit unions of data breach after ransomware attack

    3 December 2025

    Revolut hits $75 billion valuation in new capital raise

    24 November 2025
  • Hardware

    Pebble founder unveils $75 AI smart ring to record short notes with the push of a button

    10 December 2025

    Amazon’s Ring launches controversial AI-powered facial recognition feature on video doorbells

    10 December 2025

    Google’s first AI glasses are expected next year

    9 December 2025

    eSIM adoption is on the rise thanks to travel and device compatibility

    6 December 2025

    AWS re:Invent was an all-in pitch for AI. Customers may not be ready.

    5 December 2025
  • Media & Entertainment

    Disney signs deal with OpenAI to allow Sora to create AI videos with its characters

    11 December 2025

    YouTube TV will launch genre-based subscription plans in 2026

    11 December 2025

    Founder of AI startup Tavus says users talk to AI Santa ‘for hours’ a day

    10 December 2025

    Spotify releases music videos in the US and Canada for Premium subscribers

    9 December 2025

    Amazon Music’s 2025 Delivered is now here to compete with Spotify Wrapped

    9 December 2025
  • Security

    The flaw in the photo booth manufacturer’s website exposes customers’ photos

    13 December 2025

    Home Depot exposed access to internal systems for a year, researcher says

    13 December 2025

    Security flaws in the Freedom Chat app exposed users’ phone numbers and PINs

    11 December 2025

    Petco takes down Vetco website after exposing customers’ personal information

    10 December 2025

    Petco’s security bug affected customers’ SSNs, driver’s licenses and more

    9 December 2025
  • Startups

    Port raises $100M valuation from $800M round to take on Spotify’s Backstage

    14 December 2025

    Eclipse Energy’s microbes can turn dormant oil wells into hydrogen factories

    13 December 2025

    Interest in Spoor’s AI bird tracking software is soaring

    13 December 2025

    Retro, a photo-sharing app for friends, lets you ‘time travel’ to your camera roll

    12 December 2025

    On Me Raises $6M to Shake Up the Gift Card Industry

    12 December 2025
  • Transportation

    India’s Spinny lines up $160m funding to acquire GoMechanic, sources say

    14 December 2025

    Inside Rivian’s big bet on self-driving with artificial intelligence

    13 December 2025

    Zevo wants to add robotaxis to its car-sharing fleet, starting with newcomer Tensor

    13 December 2025

    Driving aboard Rivian’s fight for autonomy

    12 December 2025

    Rivian goes big on autonomy, with custom silicon, lidar and a hint of robotaxis

    12 December 2025
  • Venture

    Runware raises $50 million in Series A to make it easier for developers to create images and videos

    12 December 2025

    Stanford’s star reporter understands Silicon Valley’s startup culture

    12 December 2025

    The market has “changed” and founders now have the power, VCs say

    11 December 2025

    Tiger Global plans cautious business future with new $2.2 billion fund

    8 December 2025

    Sources: AI-powered synthetic research startup Aaru raises Series A at $1B ‘headline’ valuation

    6 December 2025
  • Recommended Essentials
TechTost
You are at:Home»AI»Why most AI benchmarks tell us so little
AI

Why most AI benchmarks tell us so little

techtost.comBy techtost.com8 March 202405 Mins Read
Share Facebook Twitter Pinterest LinkedIn Tumblr Email
Why Most Ai Benchmarks Tell Us So Little
Share
Facebook Twitter LinkedIn Pinterest Email

On Tuesday, startup Anthropic released a family of AI models that it claims achieve best-in-class performance. Just days later, rival Inflection AI unveiled a model that it claims comes close to matching some of the most capable models out there, including OpenAI’s GPT-4, in quality.

Anthropic and Inflection are by no means the first AI companies to claim that their models have matched or beaten the competition by some objective measure. Google supported the same with its Gemini models at launch, and OpenAI said the same for GPT-4 and its predecessors, GPT-3, GPT-2, and GPT-1. The list goes on.

But what metrics are they talking about? When a seller says a model achieves top performance or quality, what exactly does that mean? Perhaps more to the point: Will a model that technically “performs” better than some other model in reality touch improved in a tangible way?

On that last question, not likely.

The reason – or rather the problem – lies in the benchmarks that AI companies use to quantify a model’s strengths and weaknesses.

Internal measures

Today’s most commonly used benchmarks for AI models — specifically chatbot-powered models such as OpenAI’s ChatGPT and Anthropic’s Claude — do a poor job of capturing how the average human interacts with the models being tested. For example, a benchmark cited by Anthropic in its recent announcement, GPQA (“A Graduate-Level Google-Proof Q&A Benchmark”), contains hundreds of PhD-level biology, physics, and chemistry questions — yet most people use chatbot for tasks like answering emails, writing cover letters and talking about their feelings.

Jesse Dodge, a scientist at the Allen Institute for AI, the nonprofit AI research organization, says the industry has reached a “crisis of evaluation.”

“Benchmarks are typically static and narrowly focused on evaluating a single capability, such as a model’s realism in a single domain or its ability to solve multiple-choice mathematical reasoning questions,” Dodge told TechCrunch in an interview. “Many benchmarks used for evaluation are more than three years old, from when AI systems were mainly used for research and did not have many real users. In addition, humans use genetic AI in many ways — they are very creative.”

Wrong measurements

It’s not that the most used benchmarks are completely useless. No doubt someone is asking Ph.D level math questions. in ChatGPT. However, as genetic AI models are increasingly positioned as mass-market, do-it-all systems, the old benchmarks are becoming less applicable.

David Widder, a postdoctoral researcher at Cornell who studies artificial intelligence and ethics, notes that many of the common tests of reference skills—from solving school-level math problems to determining whether a sentence contains an anachronism—will never be relevant to the majority of users.

“Earlier AI systems were often built to solve a specific problem in a context (e.g. medical AI expert systems), making a deep understanding of what constitutes good performance in that particular context more possible,” Widder said. at TechCrunch. “As systems are increasingly seen as ‘general purpose’, this is less possible, so we’re increasingly seeing a focus on testing models across a variety of benchmarks in different fields.”

Errors and other defects

In addition to misalignment with use cases, there are questions about whether some benchmarks are properly measuring what they are supposed to measure.

One analysis of HellaSwag, a test designed to assess common sense reasoning in models, found that over a third of the test questions contained typos and “stupid” writing. Somewhere else, MMLU (short for “Massive Multitask Language Understanding”), a benchmark highlighted by vendors such as Google, OpenAI and Anthropic as proof that their models can reason through logic problems, asks questions that can be solved through memorization verbatim.

Test questions from the HellaSwag benchmark.

“[Benchmarks like MMLU are] more about memorizing and associating two keywords together,” Widder said. “I can find [a relevant] article quickly enough and answer the question, but that doesn’t mean I understand the causal mechanism or that I could use my understanding of that causal mechanism to actually reason and solve new and complex problems in unpredictable contexts. Not even a model can.”

Fixing what’s broken

So benchmarks are broken. But can they be fixed?

Dodge believes so – with more human involvement.

“The right way forward, here, is a combination of evaluation benchmarks with human evaluation,” he said, “prompting a model with a real user question and then hiring a human to evaluate how good the response is.”

As for Widder, he’s less optimistic that benchmarks today — even with corrections for the most obvious mistakes, like typos — can be improved to the point where they would be informative to the vast majority of AI model users. Instead, he believes that tests of models should focus on the downstream effects of those models and whether the effects, good or bad, are seen as desirable by those affected.

“I would ask for what specific goals we want AI models to be able to be used for and assess whether they would be – or are – successful in such contexts,” he said. “And hopefully that process also includes evaluating whether we should be using AI in such contexts.”

All included benchmarks genAI Generative AI reference points Research
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
Previous ArticleApple will ease the transition to Android by fall 2025
Next Article LLMs are ready to make logging business intelligence tools easier and faster to use
bhanuprakash.cg
techtost.com
  • Website

Related Posts

OpenAI hits back at Google with GPT-5.2 after ‘code red’ memo.

14 December 2025

Trump’s AI executive order promises ‘a rulebook’ – startups may find legal loophole instead

13 December 2025

Google Translate now lets you listen to real-time translations on your headphones

13 December 2025
Add A Comment

Leave A Reply Cancel Reply

Don't Miss

Port raises $100M valuation from $800M round to take on Spotify’s Backstage

14 December 2025

India’s Spinny lines up $160m funding to acquire GoMechanic, sources say

14 December 2025

OpenAI hits back at Google with GPT-5.2 after ‘code red’ memo.

14 December 2025
Stay In Touch
  • Facebook
  • YouTube
  • TikTok
  • WhatsApp
  • Twitter
  • Instagram
Fintech

Coinbase starts onboarding users again in India, plans to do fiat on-ramp next year

7 December 2025

Walmart-backed PhonePe shuts down Pincode app in yet another step back in e-commerce

5 December 2025

Nexus stays out of AI, keeping half of its new $700M fund for India startup

4 December 2025
Startups

Port raises $100M valuation from $800M round to take on Spotify’s Backstage

Eclipse Energy’s microbes can turn dormant oil wells into hydrogen factories

Interest in Spoor’s AI bird tracking software is soaring

© 2025 TechTost. All Rights Reserved
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms and Conditions
  • Disclaimer

Type above and press Enter to search. Press Esc to cancel.