AI agents are becoming more sophisticated. They evolve from answering questions to performing complex multi-step tasks autonomously.
But before they can trust these agents to book travel or perform financial analysis on behalf of users, model providers and startups that build such agents want to ensure they perform reliably in a wide range of scenarios.
AI labs often use benchmarks to demonstrate their model’s prowess, but a high score, even on an agent-oriented benchmark, doesn’t actually prove that an AI can correctly complete various complex tasks in the real world.
Patronus AIa startup founded in 2023 by former Meta AI researchers Anand Kannappan and Rebecca Qian, helps modelers and companies improve models to do just that by creating simulated digital environments in which to evaluate agent performance.
The San Francisco-based startup needs to solve an important problem. Almost every frontier AI lab and many emerging startups are now customers, according to Glenn Solomon, managing director of Notable Capital, who describes demand for the company’s simulated environments as nearly insatiable.
Patronus’ revenue has grown 15x over the past year, fueling significant investor interest. On Thursday, the company announced a $50 million Series B round led by Greenfield Partners, with participation from Notable Capital, Lightspeed, Datadog and Samsung. The round brings the company’s total funding to $70 million.
Patronus uses what it calls “digital world models” to create copies of websites and internal systems. In these environments, agents undergo a post-training stress test using reinforcement learning, which repeatedly rewards successful task completion and punishes errors.
AI labs see great value in these digital simulations because they give agents a chance to try out different, sometimes unpredictable, scenarios. The company compares its approach to how Waymo trained self-driving cars by first building synthetic worlds to test vehicles against rare hazards, such as bad weather or a child running after a ball.
The difference with AI agents is that they tend to take shortcuts, which means they fail to get the job done right. “Patronus is very good at spotting hacks and making sure models are held accountable,” Solomon said.
Patronus currently provides its simulated digital worlds for software engineering and finance, but those are just the beginning, according to Kannappan.
“Today we are very focused on the problems that are verifiable, so the problems that you can check and verify immediately, but there are a lot of areas that are very unverifiable or very difficult to verify,” he said.
Just because these processes are verifiable does not mean they are simple. “We want to be able to actually create the environment where you can run an agent that can run for 10 hours or 10 days or 10 weeks,” Kannappan said.
In terms of opponents, Patronus believes it competes primarily with internal teams that have already built AI labs to evaluate agent behavior. While human data companies like Mercor and Surge help model makers with reinforcement learning, Patronus works differently by evaluating how agents behave without human input.
When you purchase through links in our articles, we may earn a small commission. This does not affect our editorial independence.
