Surveys have been used to gain insights about populations, products, and public opinion since time immemorial. And while methodologies may have changed over the millennia, one thing has remained constant: The need for people, lots of people.
But what if you can’t find enough people to create a large enough sample group to generate meaningful results? Or, what if you could potentially find enough people, but budget constraints limit the number of people you can find and interview?
Here is where Fairgen he wants to help. The Israeli startup today launched a platform that uses “statistical artificial intelligence” to create synthetic data that it says is just as good as the real thing. The company is also announcing a new round of $5.5 million in funding from Maverick Ventures Israel, The Creator Fund, Tal Ventures, Ignia and a handful of angel investors, bringing its total cash raised since inception to $8 million.
“Fake Data”
The data can be the soul of AI, but it has also been the cornerstone of market research ever since. So when the two worlds collide, as they do in Fairgen’s world, the need for quality data becomes a little more intense.
Fairgen was founded in Tel Aviv, Israel in 2021 and previously focused on addressing bias in artificial intelligence. But in late 2022, the company turned to a new product, Fairboostwhich is now in beta.
Fairboost promises to ‘boost’ a smaller data set up to three times, enabling more detailed information in places that might otherwise be too difficult or expensive to access. Using this, companies can train a deep machine learning model on each data set they upload to the Fairgen platform, with AI learning statistical patterns across the various segments of research.
The concept of “synthetic data”—data that is artificially generated rather than from real-world events—is not novel. Its roots go back to the early days of computing, when it was used to test software and algorithms and simulate processes. But synthetic data as we understand it today has taken on a life of its own, particularly with the advent of machine learning, where it is increasingly being used to train models. We can address both data sparsity issues and data privacy issues by using artificially generated data that does not contain sensitive information.
Fairgen is the latest startup to experiment with synthetic data, with a primary focus on market research. It’s worth noting that Fairgen doesn’t generate data out of thin air or throw millions of historical surveys into an AI-powered melting pot—market researchers need to conduct a survey of a small sample of their target market, and from that, the Fairgen defines patterns for sample expansion. The company says it can guarantee at least double boost in the initial sample, but on average, it can achieve triple boost.
In this way, Fairgen may be able to demonstrate that someone of a certain age and/or income level is more likely to answer a question in a certain way. Or, combine any number of data points to extrapolate from the original data set. Essentially it is the creation of what the co-founder and CEO of Fairgen Samuel Cohen says they are “stronger, more robust data segments, with a lower margin of error.”
“The main realization was that people are becoming more and more diverse – brands have to adapt to that and they have to understand their customer segments,” Cohen explained to TechCrunch. “The segments are very different—Gen Zs think differently than older people. And to be able to have that understanding of the market at the segment level, it costs a lot of money, it takes a lot of time and operational resources. And there I knew that was the point of pain. We knew synthetic data played a role there.”
One obvious criticism—one the company admits it has grappled with—is that it all sounds like a huge shortcut to having to get out in the field, interview real people, and gather real opinions.
Surely any underrepresented group should be concerned that their real voices are being replaced by fake voices?
“Every customer we’ve talked to in the research space has huge blind spots — completely unreachable audiences,” said Fairgen’s chief development officer, Fernando Zatz, he told TechCrunch. “They’re not actually selling projects because there aren’t enough people available, especially in an increasingly diverse world where there’s a lot of market fragmentation. Sometimes they can’t go to certain countries. they can’t get into certain demographics, so they actually lose out on projects because they can’t reach their quotas. They have a minimum number [of respondents]and if they don’t reach that number, they don’t sell the information.”
Fairgen is not the only company applying genetic AI to market research. Qualtrics last year said it was investing $500 million over four years to bring generative artificial intelligence to its platform, though with substantive focus on qualitative research. However, it’s further proof that synthetic data is here and here to stay.
But validation of the results will play an important role in convincing people that this is the real deal and not some cost-cutting measure that will produce suboptimal results. Fairgen does this by comparing a “real” sample boost to a “synthetic” sample boost — it takes a small sample of the data set, extrapolates it, and puts it side-by-side with the real thing.
“With every customer we sign up, we do this exact same kind of testing,” Cohen said.
Statistically speaking
Cohen has a master’s degree in statistical science from the University of Oxford and a PhD in machine learning from UCL London, part of which included a nine-month stint as a researcher at Meta.
One of the co-founders of the company is chairman Benny Snyderwho was previously in the enterprise software space, with four exits to his name: Ravello to Oracle for $500 million in 2016; Qumranet at Red Hat for $107 million in 2008; P-Cube at Cisco For 200 million dollars in 2004; and Pentacom to Cisco for $118 in 2000.
And then there is Emmanuel Candèsprofessor of statistics and electrical engineering at Stanford University, who serves as Fairgen’s chief scientific advisor.
This business and mathematical backbone is a major selling point for a company trying to convince the world that fake data can be just as good as real data if implemented correctly. This is also how they can clearly explain the limits and limitations of his technology — how large the samples need to be to achieve optimal amplifications.
According to Cohen, ideally at least 300 real respondents are needed for a survey, and from this Fairboost can boost a segment size that is no more than 15% of the wider survey.
“Under 15%, we can guarantee an average 3x boost after validating it with hundreds of parallel tests,” Cohen said. “Statistically, gains are less dramatic above 15%. The data already shows good levels of confidence and our synthetic respondents can only potentially match or marginally increase it. Business wise, there is also no pain point above 15% — brands can already gain insights from these groups. they’re just stuck at the specialized level.”
The no-LLM factor
It’s worth noting that Fairgen does not use large language models (LLM) and its platform does not produce “plain English” responses à la ChatGPT. The reason for this is that an LLM will use lessons learned from a myriad of other data sources outside of the study setting, which increases the chances of introducing bias that is incompatible with quantitative research.
Fairgen deals with statistical models and tabular data, and its training is solely based on the data contained in the uploaded dataset. This effectively allows market researchers to create new and synthetic respondents by extrapolating from adjacent segments of the survey.
“We don’t use any LLMs for a very simple reason, which is that if we trained in a lot [other] investigations, it would just be spreading misinformation,” Cohen said. “Because you would have cases where you learned something in another investigation, and we don’t want that. It’s all about reliability.”
In terms of business model, Fairgen is sold as SaaS, with companies uploading their surveys in any structured format (.CSV or .SAV) to Fairgen’s cloud-based platform. According to Cohen, it takes up to 20 minutes to train the model on the survey data it provides, depending on the number of questions. The user then selects a “segment” (a subset of respondents who share certain characteristics) — e.g. “Gen Z working in industry x,” — and then Fairgen delivers a new file structured the same way as the original training file, with exactly the same questions , just new rows.
Fairgen is used by BVA and French polling and market research company IFOP, which have already incorporated the startup’s technology into their services. IFOP, which looks a bit like that Gallup in the US, it is using Fairgen for electoral purposes in European elections, although Cohen believes it may end up being used in US elections later this year.
“The IFOP is basically our stamp of approval, because they’ve been around for 100 years,” Cohen said. “They validated the technology and were our initial design partner. We’re also testing or already integrating with some of the biggest market research companies in the world, which I’m not allowed to talk about yet.”