Long before most of us think about large language models, DataCebo Co-founders Kalyan Veeramachaneni and Neha Patki were building an open source library called Synthetic Data Vault or SDV in brief. The company’s roots date back to 2018 when both were working at the MIT Data Lab. They had the idea that in addition to generating text, images and code, you could also generate data with genetic artificial intelligence.
For companies, who need to use quality business data in large language models (and for other purposes), but who can’t necessarily use PII to do so, this is an interesting idea. Today, the company emerged after a few years to create an enterprise commercial version of SDV along with $8.5 million in seed funding.
This ability to generate synthetic data from relational and tabular databases is what sets the company apart from other AI generation tools, says CEO Veeramachaneni. “Our software allows our customers to build a custom AI production model on prem. And then they can use that synthetic data for different use cases,” he told TechCrunch. This could work in healthcare, financial services, or anywhere it was imperative to hide sensitive data for testing and model building purposes.
He says companies have traditionally had to create synthetic data manually, an extremely tedious process that is difficult to scale and prone to errors. By letting genetic AI solve the problem, you simply describe the kind of data you need, the software examines the characteristics of the real data set, and then creates a quality fake set for testing purposes without exposing sensitive information.
The founders started by creating an open source tool, a tool that proved extremely popular and helped them test the various key pieces of software. “We’ve had over a million downloads and a lot of people who are active in our community,” said VP of Product Patki. In fact, they have a Slack channel with over a thousand people participating.
“And through that, I think we first get a lot of validation of our core algorithms. We’re confident that it works, and if there’s a bug or anything, public open source users find it right away and we’re able to address any issues,” he said.
The big difference between the open source version and the commercial enterprise version is scale. The enterprise version can handle up to a hundred tables, while the open source version is designed to handle only a few tables. So far, customers have been building models based on over 20 to 30 tables.
The company currently has 11 employees and plans to hire over the next year to reach around 20, depending on how the business grows.
The startup’s seed funding of $8.5 million was led by Link Ventures and Zetta Venture Partners with participation from Uncorrelated Ventures.
