Natural Intelligencethe two-year-old San Francisco-based robotics startup that has quietly become one of the Bay Area’s most closely watched AI companies has published new research Thursday, showing that its latest model can direct robots to perform tasks they’ve never been explicitly trained to do — a skill the company’s own researchers say caught them by surprise.
The new model, called π0.7, represents what the company describes as an early but essential step toward the coveted goal of a general-purpose robot brain: a brain that can be pointed at an unknown task, trained in plain language, and accomplished. If the findings hold up to scrutiny, they suggest that robotic AI may be approaching a tipping point similar to what the field saw with large language models—where capabilities begin to combine in ways that go beyond what the underlying data seem to predict.
But first: The key claim in the paper is the generalizability of synthesis—the ability to combine skills learned in different contexts to solve problems the model has never encountered. Until now, the standard approach to training robots has essentially been rote memorization — collect data for a specific task, train a specialized model on that data, then repeat for each new task. π0.7, says Natural Intelligence, breaks this pattern.
“Once you get past that threshold where you go from doing exactly the things you’re collecting the data for to mixing things up in new ways,” says Sergey Levine, co-founder of Natural Intelligence and a professor at UC Berkeley who focuses on artificial intelligence for robotics, “the capabilities grow much more than linearly with the amount of data. domains, such as language and vision’.
The paper’s most impressive demonstration involves an air fryer that the model has virtually never seen in training. When the research team investigated, they found only two relevant episodes in the entire training dataset: one where a different robot simply pushed the fryer to close, and one from an open-source dataset where another robot placed a plastic bottle inside one on someone’s instructions. The model had somehow synthesized these fragments, as well as broader web-based pre-training data, into a working understanding of how the device worked.
“It’s very difficult to pinpoint where knowledge comes from or where it will succeed or fail,” says Lucy Shi, a Natural Intelligence researcher and Ph.D. student. However, with zero training, the model made a credible attempt to use the device to cook a sweet potato. With step-by-step verbal instructions—essentially, a human walking the robot through the task the way you might explain something to a new employee—it was completed successfully.
This guidance capability is important because it suggests that robots could be deployed in new environments and improved in real time without additional data collection or retraining of models.
So what does this all mean? The researchers are not shy about the model’s limitations and are careful not to get ahead of themselves. In at least one instance, they point the finger right at their own team.
“Sometimes the failure mode is not in the robot or the model,” says Shi. “It’s up to us. We’re not good at direct engineering.” He describes an early air fryer experiment that had a 5% success rate. After spending about half an hour refining the way the work was explained in the model, it went up to 95%, he says.
Also, the model is not yet capable of autonomously performing complex multi-step tasks from a single high-level command. “You can’t tell it, ‘Hey, go make me some toast,'” says Levine. “But if you approach it—’for the toaster, open this part, press this button, do this’—then it actually tends to work really well.”
The team also acknowledged that there are no real standardized benchmarks for robotics, making it difficult to externally validate their claims. Instead, the company measured π0.7 against its own previous specialized models—custom-built systems trained on individual tasks—and found that the general model matched their performance on a range of complex tasks, such as making coffee, folding clothes, and assembling boxes.
What may be most remarkable about the research—if you consider the researchers—is not any demonstration but the extent to which the results surprised them, people whose job it is to know exactly what’s in the training data and therefore what the model should and shouldn’t do.
“My experience has always been that when I know deeply what’s in the data, I can just guess what the model will be able to do,” says Ashwin Balakrishna, a researcher at Physical Intelligence. “I’m rarely surprised. But the last few months was the first time I’ve been really surprised. I just bought a set of gears randomly and asked the robot, ‘Hey, can you turn this tool?’ And it just worked.”
Levine recalled the moment researchers first encountered GPT-2 by creating a story about unicorns in the Andes. “Where the hell did he learn about unicorns in Peru?” he says. “That’s such a strange combination. And I think to see that in robotics is really special.”
Of course, critics will point to an awkward asymmetry here: language models have had the entire Internet to learn. Bots don’t, and no amount of smart prompting fully closes that gap. But when asked where he expects skepticism, Levine points elsewhere.
“The criticism that can always be leveled at any demonstration of robotic generalization is that the tasks are kind of boring,” he says. “The robot doesn’t back up.” He pushes back against this framework, arguing that the distinction between an impressive robot demonstration and a robotic system that actually generalizes is precisely the point. Generalization, he suggests, will always seem less dramatic than a carefully choreographed stunt—but it’s far more useful.
The paper itself uses cautious hedging language, describing π0.7 as “early signs” of generalization and “initial demonstrations” of new capabilities. These are research results, not a developed product.
When asked directly when a system based on these findings might be ready for real-world deployment, Levine declines to speculate. “I think there’s good reason to be optimistic, and it’s certainly moving faster than I expected a few years ago,” he says. “But it is very difficult for me to answer that question.”
Natural Intelligence has raised over $1 billion to date and was recently valued at $5.6 billion. A significant part of the investor excitement surrounding the company can be traced to Lachy Groom, a co-founder who spent years as one of Silicon Valley’s most notable angel investors—backing Figma, Notion, and Ramp, among others—before deciding that Physical Intelligence was the company he was looking for. That pedigree helped the startup attract serious institutional money, even though it declined to offer investors a commercialization timeline.
The company is now said to be in talks for a new round that would nearly double that valuation to $11 billion. The group declined to comment.
