As conventional AI comparative evaluation techniques prove inadequate, AI builders turn to more creative ways to evaluate the potential of AI genetic models. For a team of developers, this is Minecraft, the Microsoft Sandboild manufacturing game.
The website Benchmark (or Mc Bench) was collaborated with AI models against the other in head challenges to meet Minecraft creations. Users can vote on which model they did a better job and only after the vote can they see what AI made each minecraft make.
For Adi Singh, the 12th grader who started the Mc Bench, the value of Minecraft is not as much the game itself, but the familiarization of people with the end, is the with the best sales Video game of all time. Even for people who have not played the game, it is still possible to evaluate which blocked representation of a pineapple is getting better.
‘Minecraft allows people to see progress [of AI development] Much easier, “Singh told TechCrunch.” People are used to Minecraft, used for appearance and vibe. ”
MC Bench currently lists eight people as volunteers. The anthropogenic, Google, Openai and Alibaba subsidized the use of their products by the project to execute reference prompts per MC Bench website, but companies are not different.
“Right now we make simple constructions to think about how far we have come from the GPT-3 season but [we] We could see ourselves in these greater form plans and target -oriented tasks, “Singh said.” Games can only be a means of testing the logic that is safer than in real life and more controlled for testing purposes, making it the most ideal in my eyes. “
Other games like Pokémon Red, Road fighterAnd Pictionary has been used as experimental reference points for AI, in part because the art of comparative AI evaluation is strangely difficult.
Researchers often try AI models standardized evaluationsBut many of these tests give AI an advantage at home. Due to the way they are trained, the models are naturally endowed with certain, narrow problem solving, particularly solving problems that require memorization or basic extension.
Simply put, it is difficult to collect what it means that OpenAi’s GPT-4 can score at 88th percentage in LSAT, but it cannot discern how many Rs are in the word “strawberry”. Anthropogenic Claude 3.7 Sonnet A 62.3% accuracy was achieved in a standardized engineering point of reference, but it is worse to play Pokémon than most five years.


MC Bench is technically a programming point, since models are called upon to write code to create the caused construction, such as “Frosty the Snowman” or “a charming tropical beach hut on a virgin sandy shore”.
But it is easier for most Mc Bench users to evaluate if a snowman seems better than discovering the code, which gives the project a wider appeal-and therefore the ability to collect more data on models that are consistently scoring.
Whether these scores are largely in the way of using AI is of course for discussion, of course. Singh claims to be a powerful signal.
“Today’s leaderboard is quite closely reflecting my own experience of using these models, which is in contrast to many pure text reference points,” Singh said. “Perhaps [MC-Bench] It could be useful for companies to know if they are heading in the right direction. ”