Earlier this week, Meta landed in hot water for the use of an experimental, non -released version of the Llama 4 Maverick model to achieve a high score at a Crowdsourced, LM Arena reference. The incident pushed lm Arena maintenancers to apologizeChange their policies and score the non -modified, vanilla maverick.
It turns out, it is not very competitive.
The unpaid Maverick, “Llama-4-Maverick-17B-128e-Entruct,” Ranked under models Including Openai’s GPT-4O, Sonnet of Claude 3.5 by ANTHROPIC and Gemini 1.5 Pro on Friday. Many of these models are months.
Llama 4 release version has been added to LMARENA after it was found to be cheated, but you probably didn’t see it because you have to move to 32nd place where the classes are pic.twitter.com/a0bxkdx4lx
– P: ɡesn (@pigeon__s) April 11, 2025
Why bad performance? Meta’s Maverick Maverick, Llama-4-Maverick-03-26-Explipimental, “optimized for the conversation,” the company explained to a Published chart last Saturday. These optimism apparently played well at LM Arena, which has human judges compares the exits of the models and choose which they prefer.
As we have written before, for a variety of reasons, LM Arena was never the most reliable measure of the performance of an AI model. Still, adjusting a model to a reference point – in addition to being misleading – makes it difficult for developers to predict exactly how well the model will perform in different contexts.
In a statement, a Meta spokesman told TechCrunch that the posts with “all types of custom variants”.
“The” Llama-4-Maverick-03-26-EXPERIMENTAL “is an optimized conversation version we experimented with that also performing well at LM Arena,” the spokesman said. “We have now released the open source version and we will see how the developers adjust Llama 4 for their own use of use. We are excited to see what they will build and look forward to their ongoing feedback.”