One of the new Meta Model Model released on Saturday, Maverick, Maverick, Ranked second place in LM ArenaA test that has human graduates compares the exits of the models and chooses who they prefer. But it looks like the Maverick version that Meta develops at LM Arena differs from the version that is widely available to developers.
As several All included researchers It is noted in X, Meta noted in its announcement that Maverick on LM Arena is a “experimental version of conversation”. A chart in Official Llama websiteMeanwhile, he reveals that Meta’s LM Arena test was conducted using “Llama 4 Maverick optimized for the conversation”.
As we have written before, for a variety of reasons, LM Arena was never the most reliable measure of the performance of an AI model. But AI companies have generally not adapted or otherwise adapted their models to score better at LM Arena-or have not admitted to do it at least.
The problem with adjusting a model to a reference point, withholding, and then releasing a “vanilla” variant of the same model is that it makes it difficult for developers to predict exactly how well the model will perform special frameworks. It is also misleading. Ideally, the reference points – sadly inadequate as they are – provide a snapshot of the forces and weaknesses of a model in a series of tasks.
Indeed, researchers in X have observed intense behavior differences From the State download Maverick compared to the model hosted at LM Arena. The LM Arena version appears to use a lot of emojis and give incredibly long answers.
Ok Llama 4 is Def a Littled cooked lol, what is this city yap pic.twitter.com/Y3GVHBVZ65
– Nathan Lambert (@natolambert) April 6 2025
For some reason, the Llama 4 model in the Arena uses much more emojis
together. AI, it looks better: pic.twitter.com/f74odx4zt
– Tech Dev notes (@Techdevnotes) April 6 2025
We have reached the Meta and the Chatbot Arena, the organization that maintains the LM Arena, for comments.
