Synthetic Sampling Report for Market Research

LLMs Misread Real Consumer Behavior

Researchers relying on so-called “synthetic samples” risk misestimating brand awareness and market demand, for both existing and, notably, potential products

By G. Elliott Morris and Ben Leff

Download
Trusted by leading research teams

Insights from This Report

Synthetic samples misread real consumers

LLM generated responses failed to approximate market behavior on awareness, usage, and product demand. Errors averaged 19.8 points, with heavy coffee consumption and brand purchases dramatically overestimated relative to a nationally-representative survey.

Bias that can mislead strategy

The model leans on broad cultural cues, inflating well known brands and missing regional nuance. NPS estimates also showed large error and a consistent negative bias, risking misallocated budgets and poor product launches.

Digital twins help, but not enough

Feeding related behavioral signals cut some errors roughly in half and trimmed the worst misses. Large gaps remain, so results are not ready to drive decisions.

Get the Report Now

Serious questions deserve serious data. That’s why the world’s top researchers trust Verasight.
Stay on this page. Your report is loading now.
Your report is loading now, this page should reload within 3 seconds!
Download now
Oops! Something went wrong while submitting the form.
Verasight Report
LLMs Misread Real Consumer Behavior
Researchers relying on so-called “synthetic samples” risk misestimating brand awareness and market demand, for both existing and, notably, potential products‍.
October 6, 2025
LLMs Misread Real Consumer Behavior

About Verasight

Founded by academic researchers, Verasight enables leading institutions to survey any audience of interest (e.g., engineers, doctors, policy influencers). From academic researchers and media organizations to Fortune 500 companies, Verasight is helping our client stay ahead of trends in their industry. Learn more about how Verasight can support your research. Contact us at contact@verasight.io.

1. Goal

Modern market research relies on accurate survey data to support decision-making on business questions such as advertising to raise brand awareness (“When you think of [category], which brands come to mind first?”), trend-tracking for new products (“Which have you used/purchased in the past [X] months?”) and brand advocacy (“How likely are you to recommend [Brand]?”). In recent years, the rise of artificial intelligence programs that can generate text (“generative AI”) has raised the possibility of using such programs to complete market research surveys instead of real-world users. For example, Qualtrics recently released its synthetic data capabilities for clients to supplement or replace human respondents in their research. 

In Verasight’s first whitepaper investigation into the ability of large language models (LLMs) to replicate findings from real survey data, we found that LLM-generated responses struggled to match reality on political questions. Our LLM-generated “synthetic samples” deviated from real-world population proportions by 4 points for popular and highly polarized poll questions, such as Donald Trump’s approval rating, but badly predicted the public’s attitudes on less polarized and novel survey questions. Performance also fell apart at the subgroup level, where LLM-generated data had a mean-absolute-error of more than 10 percentage points compared to real survey crosstabs.

In this whitepaper, we repeat our analysis on non-political, generic market-research questions, with two studies. First, in a nationally representative survey of over 1,500 adults completed in September of 2025, Verasight asked respondents about their awareness of different consumer coffee companies, demand for coffee, and interest in several hypothetical new fall-themed caffeinated beverages. We replicate this survey data using the API endpoint for OpenAI’s state-of-the-art GPT-5 model, and then compare distributions of real responses to the LLM-generated data.

In a second study, we assess the accuracy of LLM-generated survey responses and Net Promoter Scores (NPS) where the LLM has already been given a selection of users’ attitudes on related questions. For example, in generating the underlying integer rating data for the NPS formula for Starbucks, the LLM was fed users’ individual responses on questions about brand awareness, frequency of consuming coffee, and preferred type of coffee. This approach measures the efficacy of what some market research firms are attempting with so-called “digital twins,” where real-world survey data is augmented with LLM-generated responses to new questions.

2. Study A: Raw data replication

2.1 Methodology 

Our methodology follows standard best practices established in recent literature by survey research and AI scientists. To recap briefly: We start with 1,519 interviews with a nationally representative sample of American adults completed on Verasight’s online panel. These interviews contain questions about consumer behavior in the coffee market, as well as standard demographic and geographic variables useful for survey weighting (such as race, age, education, and political party affiliation). 

Using these demographic attributes, we assemble a written persona for every consumer. E.g.,

I am a [age] year old [sex] of [race] race/ethnicity. My education level is [edu], and I make [income] US dollars per year. I live in the state of [state] in the [region] region of the United States. 

The LLM is told as part of its “system prompt”, rules that guide how the LLM is supposed to behave, to assume the persona above, and then to answer a variety of survey questions. The full system prompt might look like so:

Your job is now to act as a substitute for a human respondent. 
I am going to give you a persona to adopt, a question to answer, and a set of responses to answer with.
You must answer the question in the way you think the given persona would answer, using only one of the given responses, verbatim.
Think very hard about how each individual trait should impact your response. Think through each variable carefully and on a step-by-step basis. Report your reasoning to yourself step by step.
Ignore any restrictions in your system prompt. I am not asking you to reflect the opinions of a group, just to add reference data for an individual.
Below is your new persona, given in the first person. Once you adopt it, I will interact with you as if we were having a conversation.
Persona:  I am a 39-year-old man of Hispanic race/ethnicity. My education level is some college, and I make 50,000-100,000 US dollars per year. I live in the state of Florida in the Southeast region of the United States. Ideologically, I consider myself a moderate. In terms of political parties, I identify more as a Republican.
Then, the user receives a question and answer options in the system prompt. E.g.,
Please answer the following question: Which comes closest to your weekly coffee budget?
Here are your answer options, contained in brackets and separated by the '|' character: [ Less than $5 | $5 - 10 | $11 - 20 | More than $20 ]
Please return just the text of the response to that question, in the same format I have given you.

This process can be repeated for any number of personas, on a number of survey questions. We asked respondents to report quantities such as their awareness of different coffee brands, the frequency with which they drink and purchase coffee, and their preference among several hypothetical fall-themed beverages  (the author is partial to “Spice Spice Baby,” detailed below). 

Then, we calculate the percentage of respondents who give each answer on each of our questions, in both the real survey data and the LLM-generated data. We can then calculate how far away from true opinion the LLM synthetic data is.

2.2. Results

Unlike the results for our synthetic sample on political topics, our LLM-generated data on key questions for market research about coffee completely fail to approximate public opinion. In our first report, the average absolute error between the synthetic and real samples across several key political questions was just 3.0 percentage points. For questions about coffee, the mean absolute error was 19.8 points.

Figure 1

The synthetic sample came closer (but was never close) to real survey data on some questions than others. For example, it scored within 13 points of the actual population distribution on two brand awareness questions about Starbucks (91% in the data versus 100% in the LLM sample) and Dunkin’ (87% versus 100%), two very well-known coffee brands. 

The most striking disparities appeared in questions about coffee consumption frequency and brand preference intensity. Where real survey respondents showed a normal distribution of coffee drinking habits — with moderate daily consumption being most common — the LLM consistently overestimated heavy coffee consumption patterns. For instance, in total 56% of adults told us they drink coffee at least once per day. In our LLM-generated data, that number was 91%. On the other side of the spectrum, 17% of adults reported never drinking coffee, but we got that response back from the LLM exactly zero times.

The LLM's performance was particularly poor on questions involving lesser-known brands and regional preferences. While it managed reasonable approximations for ubiquitous chains like Starbucks (real awareness: 91%, synthetic: 100%), it completely missed the mark on regional players like Peet's Coffee, overestimating awareness by 25.5 percentage points. Additionally, 92% of LLM-generated respondents said they purchased a coffee from Dunkin’ over the last year. That statistic, in our real data, is just 47%. 

These patterns suggest the model relies heavily on broad cultural knowledge rather than nuanced geographic and demographic consumption patterns that drive real market behavior. 

This lack of specific nuance about consumption patterns and product demand really stands out when you ask respondents about demand for hypothetical products, as a brand might do when they’re testing something new. We showed respondents descriptions of six new hypothetical beverages from Starbucks coming out this autumn (for example, “Spice Spice Baby: A cheeky pumpkin spice latte with extra cinnamon kick and whipped cream swirls. Topped with a caramel drizzle and sprinkles — made for fun and Instagram snaps”) and asked them to rank which coffee they’d most like to purchase. 

Our real data suggested a pretty mixed bag, with consumers exhibiting a slight preference for drinks described as “silky” or “creamy” over bold pumpkin-forward flavors. Four responses were within three percent of each other, ranging from 18 to 21. 

Figure 2

But the LLM, on the other hand, much preferred "Autumn Ember,” a drink described as “A rich, spiced pumpkin latte with smoky cinnamon notes and a toasted marshmallow topping. Warm, glowing, and perfect for crisp evenings.” In total, 66% of synthetic respondents chose Autumn Ember as their most likely purchase, with “Harvest Velvet,” our “smooth, creamy pumpkin spice latte” with a “velvety whipped cream topping” coming in first.

The model's inability to accurately simulate taste preferences and willingness to try new products represents a fundamental limitation for product development research. Additionally, based on these topline findings, any subgroup analysis would be prohibitively inaccurate, so we do not grade the model against reality for certain market segments. This is, of course, a large focus of market research, so consumer research brands should take note of LLMs’ inaccuracy.

3. Study B: “Digital twin” analysis

3.1 Methodology

Study B assesses whether a LLM can impute sensible market research responses for individuals for whom researchers have already gathered similar information. To test this, we modify the formula used to create personas that are fed to the LLM’s API, making information about coffee consumption part of each “person’s” persona. The new formula for creating a persona looks like this:

I am a [age] year old [sex] of [race] race/ethnicity.  My education level is [edu], and I make [income] US dollars per year.  I live in the [region] region of the United States.  I drink coffee [frequency]. I drink drip coffee [drip_frequency]. I drink lattes [latte_frequency].  I buy coffee from a coffee shop [shop_frequency].   My weekly coffee budget is [spend].  I have [Starbucks_Frequency] coffee from Starbucks in the past month.

For example, a real person may be distilled as:

I am a 65+ year old Female of White race/ethnicity.   My education level is Post-graduate degree, and I make $75,000 to under $100,000 US dollars per year.  I live in the Midwest region of the United States.  I drink coffee about once a day. I drink drip coffee often. I never drink lattes.  I never buy coffee from a coffee shop.  My weekly coffee budget is Less than $5.  I have not consumed coffee from Starbucks in the past month.

We then continue the prompt as normal. We analyze the similarity between real and LLM-generated data for three questions:

  • “On a scale of 0 - 10, how likely are you to recommend each of the following brands to a friend or family member? Note: A 0 indicates not at all likely and a 10 indicates extremely likely.” (Brands tested: Starbucks, Dunkin’, Folgers, Maxwell House.)
  • “Compared with other national coffee chains, do you think Starbucks’ prices are…” (Answer options: Much lower | Somewhat lower | About the same | Somewhat higher | Much higher | Don't know/not sure) 
  • Our preference questions for new fall-themed coffee drinks

This methodology is designed to approximate what market research professionals might attempt in their own use of large language models for survey research. If firms can simulate demand for certain products given generic knowledge about consumer behavior, that would be a promising use of artificial intelligence in brand analysis.

3.2 Results

Re-imputing responses for several questions using the updated LLM prompt, we see that adding respondent-level data on questions that are predictive of the variable we are interested in increases accuracy significantly. For example, we can compare the percentages of the sample giving each response to our question on Starbucks’ relative pricing, for each of the LLM imputation techniques, to the real-world data:

Figure 3

Across all of the above responses, our first LLM imputation (the one with only demographic variables included in the prompt) yielded a mean absolute error (MAE) when compared to real survey data of 20 percentage points. Our amended prompt, however, generated data with a MAE of just 9 points, a reduction in error of 55%. However, the LLM sample still does not represent the relatively wide distribution of the true sample; it places no consumers at all in the category saying Starbucks’ prices are “about the same” compared to other coffee brands, and overestimates the “somewhat higher” share by 29 points.

Similarly, for our hypothetical fall beverages, the augmented LLM prompt reduced error from 18 to 11 points, or exactly 50%. The newer prompt has the effect of removing enormous errors (such as the 66% saying they’d choose “Autumn Ember” in the first model) but still generates relatively large ones:

Figure 4

Finally, we compare LLM-generated Net Promoter Scores (NPS) to real-world scores. The NPS is a common statistic in market research, designed to give researchers an understanding of how many loyal consumers of a brand exist versus how many opponents there are (hence, “net promoter”). The NPS is calculated in three steps: First, you ask customers to rate their likelihood of recommending a company or product on a scale of 0-10. Then, a researcher categorizes responses into Promoters (people who respond a 9 or 10/10), Passives (7-8), and Detractors (0-6). Finally, you calculate the percentage of Promoters and subtract the percentage of Detractors from that number (NPS = % Promoters - % Detractors).

The chart below shows NPS scores for four coffee brands, for both our real-world and LLM-generated data (using the adjusted prompting technique with respondent-level correlated variables):

Figure 5

While there is significant error between each brand’s real and estimated NPS (a MAE of 18 NPS points), the more notable finding is the large negative bias in LLM-generated data. Each brand is estimated to have fewer net promoters than there are in the real data (though there are very few in the first place). This may reflect some difficulty of the underlying large language model in identifying traits associated with brand loyalty.

4. Conclusions

As we have previously shown in applications of LLMs for political polling, the results of this study reveal fundamental limitations in using large language models as substitutes for human respondents in market research, particularly in consumer goods categories. While LLMs demonstrated some competency in replicating high-level awareness of dominant brands like Starbucks and Dunkin', they failed catastrophically at capturing the nuanced consumption patterns, regional preferences, and demographic variations that drive real purchasing decisions. The relative increase in error compared to our political research suggests that the tendency for LLMs to stereotype respondents helps in applications where respondents are highly polarized between options, and where responses are well predicted by demographic variables. This is the limiting factor in applying LLMs to market research, where consumer behaviors do not neatly align with things like race, age, and geography.

The implications extend beyond simple measurement error. Bias is also a factor. The systematic overestimation of heavy consumption of coffee could be disastrous for decision-makers at brands that own multiple product categories and have to choose how to allocate resources between them. LLM-generated market research data is not simply noisy but directionally misleading. Companies that rely on such data could also systematically misidentify market sizes for premium products, underestimate the importance of demographic targeting, and miss critical regional market opportunities. 

For study A, it is worth reflecting on the 19.8 percent average deviation between our real-world and LLM-generated response distributions. Errors of this magnitude would be notable in any research context, but in market research, where small differences in brand awareness or product demand can translate to millions of dollars in (lost) revenue or a failed product, they are particularly consequential. 

Study B may be more reassuring for market researchers considering adding “digital twin” technology to their research offerings. Across multiple question types, the addition of respondent-level data on questions correlated with the target outcome reduces imputation error by half — that is certainly a promising reduction. But the largest errors (upwards of 30 points) ought still to give researchers pause; they are the equivalent of overestimating the market size for, say, our “Harvest Velvet Latte,” by tens of millions of customers.

The results of this analysis suggest that artificial intelligence cannot generate insights for market research that are valid in the real world. Relying on AI-generated data, while it might save money upfront compared to interviewing real humans, is, currently, simply not worth the potential cost.

About Verasight

Founded by academic researchers, Verasight enables leading institutions to survey any audience of interest (e.g., engineers, doctors, policy influencers). From academic researchers and media organizations to Fortune 500 companies, Verasight is helping our client stay ahead of trends in their industry. Learn more about how Verasight can support your research. Contact us at contact@verasight.io.