Fairgen “Improves” Survey Results Using Synthetic Data and AI-Generated Responses

Surveys have always been used to gain insights into populations, products and public opinion. And while methods may have changed over the millennia, one thing has remained the same: the need for people, and lots of people.

But what if you can’t find enough people to build a large enough sample group to produce meaningful results? Or what if you might be able to find enough people, but budget constraints limit the number of people you can find and interview?

Fairgen would like to help here. The Israeli startup is today launching a platform that uses “statistical AI” to generate synthetic data that is said to be as good as the real thing. The company is also announcing a new $5.5 million fundraising from Maverick Ventures Israel, The Creator Fund, Tal Ventures, Ignia and a handful of angel investors, bringing the total funds raised since inception to $8 million US dollars.

“Fake data”

Data may be the lifeblood of AI, but it has also always been the cornerstone of market research. So when the two worlds collide, as is the case in Fairgen’s world, the need for quality data becomes a little greater.

Founded in Tel Aviv, Israel in 2021, Fairgen previously focused on combating bias in AI. But at the end of 2022, the company switched to a new product, Fairboost, which it is now releasing out of beta and onto the market.

Fairboost promises to “amplify” a smaller data set by three times, enabling more granular insights into niches that might otherwise be too difficult or expensive to reach. This allows companies to train a deep machine learning model with statistical AI learning patterns across the different survey segments for each data set they upload to the Fairgen platform.

The concept of “synthetic data” – data that is artificially created and does not come from real events – is not new. Its roots go back to the beginnings of computer science, when it was used to test software and algorithms and to simulate processes. But synthetic data as we understand it today has taken on a life of its own, particularly with the advent of machine learning, where it is increasingly being used to train models. We can address both data scarcity issues and privacy concerns by using artificially generated data that does not contain sensitive information.

Fairgen is the latest startup to put synthetic data to the test, and its main goal is market research. It’s worth noting that Fairgen doesn’t produce data out of thin air or throw millions of historical surveys into an AI-powered crucible – market researchers need to conduct a survey on a small sample of their target market, and from that Fairgen in turn sets patterns to expand the sample. The company states that it can guarantee at least a 2x increase over the original sample, but on average a 3x increase can be achieved.

In this way, Fairgen may be able to determine that someone of a certain age group and/or income level is more likely to answer a question in a certain way. Or combine any number of data points to extrapolate from the original data set. Essentially, it’s about generating what Fairgen co-founder and CEO Samuel Cohen calls “stronger, more robust data segments with a lower margin of error.”

“The key takeaway was that people are becoming more diverse – brands need to adapt to that and understand their customer segments,” Cohen told TechCrunch. “Segments are very different – Generation Z thinks differently than older people. And to be able to have this market understanding at the segment level, it costs a lot of money, requires a lot of time and operational resources. And that’s when I realized that was the pain point. We knew that synthetic data had a role to play in this.”

One obvious criticism – one the company has struggled with – is that this all sounds like a massive shortcut to getting into the field, interviewing real people and getting real opinions.

Surely every underrepresented group should be concerned that their real voices are being replaced by, well, fake voices?

“Every single customer we’ve spoken to in research has huge blind spots – audiences that are absolutely difficult to reach,” Fernando Zatz, head of growth at Fairgen, told TechCrunch. “They don’t actually sell projects because there aren’t enough people available, especially in an increasingly diverse world where there is strong market segmentation. Sometimes they cannot enter certain countries; They can’t cater to certain populations, so they actually lose out on projects because they can’t meet their quotas. They have a minimum number [of respondents]And if they don’t reach that number, they don’t sell the insights.”

Fairgen is not the only company using generative AI in market research. Qualtrics announced last year that it was investing $500 million over four years to bring generative AI to its platform, but with a significant focus on qualitative research. However, it is further evidence that synthetic data is here to stay.

But validating the results will play an important role in convincing people that this is reality and not a cost-cutting measure that leads to suboptimal results. Fairgen does this by comparing a “real” sample boost with a “synthetic” sample boost – it takes a small sample of the data set, extrapolates it, and compares it to reality.

“We do the exact same test on every single customer we enroll,” Cohen said.

Statistically speaking

Cohen holds an MSc in Statistical Science from the University of Oxford and a PhD in Machine Learning from UCL, London, including spending nine months as a research scientist at Meta.

One of the company’s co-founders is CEO Benny Schnaider, who previously worked in enterprise software and has had four exits: Ravello went to Oracle in 2016 for a reported $500 million; left Qumranet in 2008 for $107 million and joined Red Hat; P-Cube to Cisco for $200 million in 2004; and Pentacom to Cisco in 2000 for $118.

And then there’s Emmanuel Candès, a professor of statistics and electrical engineering at Stanford University, who serves as Fairgen’s chief scientific advisor.

This business and mathematical backbone is a key selling point for a company that wants to convince the world that fake data can be just as good as real data when used correctly. This also allows them to clearly explain the thresholds and limitations of their technology – how large the samples need to be to achieve the optimal boosts.

According to Cohen, they ideally need at least 300 real respondents for a survey, and this allows Fairboost to increase segment size to no more than 15% of the broader survey.

“Below 15%, we can guarantee an average threefold increase after validating it with hundreds of parallel tests,” Cohen said. “Statistically speaking, increases above 15% are less dramatic. The data already has a good level of confidence and our synthetic respondents can only potentially match this or provide a small increase. From a business perspective, there is no pain point above 15% either – brands can already learn from these groups; They’re just stuck at the niche level.”

The no LLM factor

It’s worth noting that Fairgen doesn’t use large language models (LLMs) and its platform doesn’t generate “plain English” responses a la ChatGPT. This is because an LLM utilizes insights from countless other data sources outside the parameters of the study, increasing the likelihood of introducing biases that are inconsistent with quantitative research.

Fairgen is all about statistical models and tabular data, and training is based solely on the data contained in the uploaded dataset. This allows market researchers to effectively generate new and synthetic respondents by extrapolating from adjacent segments of the survey.

“We don’t use LLMs for a very simple reason, namely that we would have to train many of them in advance [other] Polls would only convey misinformation,” Cohen said. “Because there are cases where something was learned from another survey, and we don’t want that. It’s about reliability.”

In terms of business model, Fairgen is sold as a SaaS, with companies uploading their surveys to Fairgen’s cloud-based platform in any structured format (.CSV or .SAV). According to Cohen, depending on the number of questions, it takes up to 20 minutes to train the model using the survey data provided. The user then selects a “segment” (a subset of respondents with certain characteristics) – e.g. B. “Gen Z works in industry issues, only new lines.

Fairgen is used by BVA and the French polling and market research company IFOP, which have already integrated the startup’s technology into their services. IFOP, which is a bit like Gallup in the US, is using Fairgen for polling purposes in the European elections, although Cohen believes it could also be used in the US elections later this year.

“IFOPs are basically our stamp of approval because they’ve been around for about 100 years,” Cohen said. “They validated the technology and were our original design partner. We are also testing or already working with some of the largest market research companies in the world, which I’m not allowed to talk about yet.”

Fairgen “Improves” Survey Results Using Synthetic Data and AI-Generated Responses | TechCrunch

“Fake data”

Statistically speaking

The no LLM factor

Leave a Comment Cancel reply