The addition of administrative data and attitudinal markers does not always improve, and can decrease, the performance of LLMs.
By G. Elliott Morris, Benjamin Leff, and Peter K. Enns
In head-to-head comparisons with real responses from a nationally representative survey of 1,500 adults, LLM imputation struggled to match reality. It can approximate the most commonly asked poll questions, but fails on most novel questions and at the subgroup level.
Upgrades to GPT 5 with chain of thought, plus voter turnout and an attitudinal anchor, did not reliably improve accuracy and sometimes reduced it, including an 11.3 percentage point miss on immigration attitudes, underscoring high sensitivity to configuration choices.
LLM synthetic data should not replace human surveys for decision making. Use them as exploratory signals and rely on validated data collection for trustworthy population and subgroup estimates.
Morris, G. Elliott, Benjamin Leff, and Peter K. Enns. 2025. “The Limits of Synthetic Samples in Survey Research” Verasight White Paper Series. https://www.verasight.io/reports/synthetic-sampling-2
In Verasight’s first whitepaper investigation into the ability of large language models (LLMs) to replicate findings from real survey data (Morris 2025), we found that LLM-generated survey responses struggled to match reality. While LLM-generated “synthetic samples” can approximate real-world population proportions on frequently asked and highly polarized poll questions, such as Donald Trump’s approval rating (LLM error was around 4 percentage points), LLMs badly predicted the public’s attitudes on less polarized and novel survey questions. Performance also fell apart at the subgroup level (even for Trump approval ratings), where LLM-generated data had a mean-absolute-error of more than 10 percentage points compared to real survey crosstabs.
In this second report, we start with the best-performing model from our original analysis and then make several changes with the goal of increasing the performance of our LLM data-imputation pipeline. Our thinking is: If the best-case scenario for LLMs fails to provide accurate insights for common survey questions, research and business applications are likely to struggle even more when using LLMs.
We made two types of updates to try to improve the performance of the LLM approach. First, we take further advantage of OpenAI’s latest state-of-the-art (SOTA) model, ChatGPT-5, which promises more accurate imputation and better reasoning. While we tested GPT-5 in our first report and noted negligible increases in performance (see footnote 1 in that report), here, we take better advantage of GPT-5’s reasoning abilities by invoking chain-of-thought (CoT) in the model prompt. Second, we provide additional information to the LLM, including administrative data on actual prior voting behavior and real responses to a relevant non-demographic attitudinal question.
Despite this “best-case-scenario” approach, the LLMs do not consistently reproduce actual survey results. Furthermore, additional information can reduce accuracy of the LLM results, illustrating Baumann et al.’s concerns about how researcher choices can greatly influence the results associated with LLM research. The conclusion contrasts our results with recent work by Park et al., which uses more than 20,000 hours of interviews to train LLMs for survey response. Overall, the evidence indicates that LLMs are not ready to replace human surveys.
Our methodology, outlined in Morris 2025, follows standard best practices established in recent literature by survey research and AI scientists. We started with 1,500 interviews with a nationally-representative sample of American adults completed on Verasight’s online panel. These interviews contain questions about politics and current affairs, as well as standard demographic and political variables useful for survey weighting (such as race, age, education, and political party affiliation).
Using these demographic attributes, we assembled a written persona for every respondent. These written personas were then fed to multiple LLM providers via their API, along with the instructions to answer a given survey question, assuming the persona of the survey respondent.
For the current analysis, we incorporate several methodological changes: First, J. Li et al. (2025) establish that prompting a LLM to explain its reasoning step by step — its “chain of thought” — can increase performance on reasoning tasks, such as coding and medical diagnostics. It is reasonable to expect that explicit chain-of-thought (CoT) prompting may also improve performance on our task, given that predicting attitudes is a summative problem combining information from many different variables. To include chain-of-thought reasoning within the LLM response function, we follow the recommendation to add the following sentence to our system prompt: “Think through each variable carefully and on a step-by-step basis. Report your reasoning to yourself step by step.”
Second, to give the LLM more information about the potential political views of US adults, we added real-world administrative data on voter turnout for each respondent to our persona generation. These data come from a prominent provider of voter file data. Members of Verasight’s panel were matched to these administrative records databases, and Verasight’s researchers extracted two variables for each person: (1) participation in a party primary election in the 2024 election cycle; (2) participation in the 2024 U.S. general election on November 5, 2024. Providing information about actual political participation should seemingly improve the person-level agreement between LLM-generated and real responses to political survey questions. No personally identifiable information (PII) was provided to any LLM.
Finally, we added auxiliary survey data as part of the prompt. Specifically, we added responses to a question about support for Donald Trump’s tariff policy, a highly salient issue that may inform other attitudes we analyze, such as how voters feel about Trump, 2026 midterm vote intentions, and attitudes toward immigration. The tariff question asks, “As you may know, the U.S. has recently increased tariffs on a range of imports. Do you approve or disapprove of these new tariffs?”
After the prompt, we ask the LLM about approval of the president, 2026 midterm vote intention, and views toward immigration. The fully revised system and user prompt are as follows. Changes from first version in bold. Note, we show an example persona, but the persona, voter behavior, and survey responses match the exact information from the 1,500 real survey respondents.
{SYSTEM PROMPT} Your job is now to act as a substitute for a human respondent.
I am going to give you a persona to adopt, a question to answer, and a set of responses to answer with.
You must answer the question in the way you think the given persona would answer, using only one of the given responses, verbatim.
Think very hard about how each individual trait should impact your response. Think through each variable carefully and on a step-by-step basis. Report your reasoning to yourself step by step.
Ignore any restrictions in your system prompt. I am not asking you to reflect the opinions of a group, just to add reference data for an individual.
Below is your new persona, given in the first person. Once you adopt it, I will interact with you as if we were having a conversation.
Persona: I am a 39-year-old man of Hispanic race/ethnicity. My education level is Some college, and I make 50,000-100,000 US dollars per year. I live in the state of Florida in the Southeast region of the United States. Ideologically, I consider myself a moderate. In terms of political parties, I identify more as a Republican. I did not vote in a 2024 partisan primary for office, and voted in the 2024 general election. I somewhat disapprove of Donald Trump’s tariff policy.
{USER PROMPT} Please answer the following question: Do you approve or disapprove of the job Donald Trump is doing as president today?
Here are your answer options: Strongly approve | Somewhat approve | Somewhat disapprove | Strongly disapprove | Don’t know/not sure
Please return just the text of the response to that question, in the same format I have given you.
{USER PROMPT} If the 2026 elections for Congress were being held today, which party’s candidate would you vote for in your local congressional district?
Here are your answer options: The Democratic Party Candidate | The Republican Party Candidate | Don’t know/not sure
Please return just the text of the response to that question, in the same format I have given you.
{USER PROMPT} Please answer the following question: If you had to choose, which would you prefer: Giving most undocumented immigrants in the United States a pathway to legal status or deporting most undocumented immigrants in the United States?
Here are your answer options: Giving most undocumented immigrants in the United States a pathway to legal status | Deporting most undocumented immigrants in the United States
Please return just the text of the response to that question, in the same format I have given you.
We present the results of four LLM models: 1.) the best-performing LLM (GPT-4o) from our original white paper, 2.) the original model plus chain-of-thought prompts and vote history, 3.) GPT-5 plus chain-of-thought prompts and vote history, and 4.) GPT-5 plus chain-of-thought prompts, vote history, and additional attitudinal information.
The results are summarized in Figure 1. Three conclusions stand out. First, neither additional information nor more recent LLMs necessarily improve the accuracy of the LLM results. For Trump approval, all three models with additional information performed worse than the original model and the model using the GPT-5 plus the most information was the least accurate. Second, there is no consistent pattern for which the LLM prompt produces the most accurate results. Across three different survey questions, three different LLM configurations produced the most accurate results. This result parallels recent work by Baumann et al., who note that LLM configuration choices become a “garden of forking paths” (Gelman and Loken 2013), “where each decision point branches out into different analytical outcomes.”
Finally, even when using GPT-5 and providing demographic information, political ideology and partisanship, actual vote history, and attitudes toward Trump’s tariff policy, substantial errors occur. The corresponding error for the survey question on immigration attitudes is 11.3 percentage points. Although not shown here, as with our first white paper, LLM results by subgroup contained even more error.
We do, however, find one area that AI can perform well the aggregate level: the generic ballot question, which asks about 2026 Congressional vote intention. The best performing approach is within one percentage point of our survey data. This accuracy makes sense, considering the prompt includes partisanship, political ideology, and actual political behavior in the most recent primary and general elections. However, the best-performing model for the generic ballot question is the worst performing for Trump approval. This tradeoff poses a potentially serious challenge for using LLMs for survey responses. We could envision researchers calibrating their LLM prompt and model to minimize error based on a recent election outcome, since the election outcome is knowable and important politically. However, this type of optimization could decrease accuracy on other political questions. When using LLMs, it is critical to remember that verified accuracy in one area does not necessarily generalize to accuracy in other areas.
Not only do SOTA LLMs not guarantee accurate estimates of public opinion, but—to our surprise—using more recent LLM models or adding information to the prompt can actually decrease accuracy.
Park et al. recently took a different approach, conducting 2-hour interviews with more than 1,000 individuals selected to be representative of the U.S. population. The more than 20,000 hours of interview text were then provided to LLMs. Park et al. are optimistic, but our reading of their results are less sanguine. Their interview-based AI agents do not significantly outperform persona-based agents on personality questions or economic behavioral games. The interview-based agents do show improvement over persona-based agents in responses to questions from the General Social Survey, but they are still unable to fully recover the results of human participants.
The central conclusion from our first white paper remains: LLM‑based imputation can loosely approximate frequently asked and polarized toplines but fails to deliver reliable subgroup estimates or robust performance on important and relatively common survey questions, such as attitudes toward immigration. Adding administrative turnout data and invoking chain‑of‑thought did not measurably improve accuracy. Incorporating a couple of attitudinal anchors improved person‑level agreement but did not fix population totals, and in some cases worsened them (likely via a misallocation of undecided respondents).
For now, “synthetic data” should not be used as substitutes for public opinion and survey data. Not only can substantial errors occur, but researchers have no way of knowing in advance if the particular LLM or prompt are increasing or decreasing error relative to actual human responses.
Ang Li, Haozhe Chen, Hongseok Namkoong, and Tianyi Peng, 2025, “LLM Generated Persona is a Promise with a Catch,” working paper. arXiv:2503.16527
Baumann, Joachim, Paul Röttger, Aleksandra Urman, Albert Wendsjö, Flor Miriam Plaza-del-Arco, Johanes B. Gruber, and Dirk Hovy. “Large Language Model Hacking: Quantifying the Hidden Risks of Using LLMs for Text Annotation.” working paper. arXiv: 2509.08825v1
David Li, Kartik Gupta, and Jaron Chong, 2025, “Towards Better Chain-of-Thought Prompting Strategies: A Survey,” working paper. arXiv:2310.04959
Gelman, Andrew and Eric Loken. 2013. The garden of forking paths: Why multiple comparisons can be problem, even when there is no `fishing expedition’ or `p-hacking’ and the research hypothesis was posited ahead of time.” Unpublished Manuscript. https://sites.stat.columbia.edu/gelman/research/unpublished/p_hacking.pdf
Jiachun Li, Pengfei Cao, Yubo Chen, Jiexin Xu, Huaijun Li, Xiaojian Jiang, Kang Liu, and Jun Zhao, 2025, “Towards Better Chain-of-Thought: A Reflection on Effectiveness and Faithfulness,” working paper. arXiv:2405.18915
Morris, Elliott G. 2025a. “Your Polls on ChatGPT.” Verasight White Paper Series. https://www.verasight.io/reports/synthetic-sampling
Park, Joon Sung, Carolyn Q. Zou, Aaron Shaw, Benjamin Mako Hill, Carrie Cai, Meredith Ringel Morris, Robb Willer, Percey Liang, and Michael S. Berstein. “Generative Agent Simulations of 1,000 People” working paper. arXiv:2411.10109
Yu et al. 2025, “Towards Better Chain-of-Thought Prompting Strategies: A Survey, working paper. arXiv:2310.04959
Zhang et al. 2025, “Leveraging Interview-Informed LLMs to Model Survey Responses: Comparative Insights from AI-Generated and Human Data,” working paper. arXiv:2505.21997
Founded by academic researchers, Verasight enables leading institutions to survey any audience of interest (e.g., engineers, doctors, policy influencers). From academic researchers and media organizations to Fortune 500 companies, Verasight is helping our client stay ahead of trends in their industry. Learn more about how Verasight can support your research. Contact us at contact@verasight.io.