So-called reasoning AI models are getting easier — and cheaper — to develop.
On Friday, NovaSky, a team of researchers based at UC Berkeley’s Sky Computing Lab, released Sky-T1-32B-Preview, a reasoning model that competes with an earlier version of OpenAI’s o1 on a number of key benchmarks. Sky-T1 appears to be the first truly open source reasoning model in the sense that it can be replicated from scratch. the team released the dataset they used to train it as well as the necessary training code.
“Great, Sky-T1-32B-Preview trained for less than $450,” the team wrote in blog post“showing that it is possible to reproduce high-level reasoning abilities economically and efficiently.”
$450 might not sound that affordable. But it was not long ago that the price of training a model with comparable performance they often ranged into the millions of dollars. Synthetic training data, or training data generated from other models, has helped reduce costs. Palmyra X 004, a model recently released by the company AI Writer, trained almost entirely on synthetic data, reportedly cost just $700,000 to develop.
Unlike most AI models, inference models are effectively self-policing, which helps them avoid some of the pitfalls that commonly trip models. Reasoning models take slightly longer—typically seconds to minutes longer—to arrive at solutions compared to a typical non-reasoning model. On the plus side, they tend to be more reliable in areas like physics, science, and math.
The NovaSky team says it used another reasoning model, Alibaba’s QwQ-32B-Preview, to generate the initial training data for Sky-T1, then “curated” the data mix and used GPT-4o-mini of OpenAI to reformulate the data into an actionable format. Sky-T1 training of 32 billion parameters took about 19 hours using a rack of 8 Nvidia H100 GPUs. (Parameters roughly correspond to a model’s problem-solving skills.)
According to the NovaSky team, the Sky-T1 outperforms an early preview version of the o1 in MATH500, a collection of “competitive-level” math challenges. The model also outperforms the o1 preview on a set of difficult problems from LiveCodeBench, a coding benchmark.
However, Sky-T1 falls short of the o1 preview for GPQA-Diamond, which contains questions about the physics, biology and chemistry that a PhD graduate is expected to know.
Also important to note is that the GA version of o1’s OpenAI is a more powerful model than the preview version of o1, and that OpenAI is expected to release an even better performing inference model, o3, in the coming weeks.
But the NovaSky team says Sky-T1 marks only the beginning of their journey to develop open source models with advanced reasoning capabilities.
“Moving forward, we will focus on developing more efficient models that maintain strong inference performance and exploring advanced techniques that further enhance the efficiency and accuracy of the models at the time of testing,” the team wrote in the post. “Stay tuned as we make progress on these exciting initiatives.”