Evaluating ChatGPT’s responses to vaccine-related questions: the impact of question framing on content and quality
Highlight box
Key findings
• ChatGPT demonstrated comparable quality in responses to both supportive and critical vaccine-related questions in Japanese.
• No significant differences were observed in Clarity, Appropriateness, Ambiguity, or Length between the two types of questions.
• Expert reviewers identified potential issues such as bias toward COVID-19 vaccines, insufficient explanation, and occasionally misleading expressions.
What is known and what is new?
• Generative artificial intelligence (AI) tools such as ChatGPT have emerged as new sources of health information, yet their reliability—particularly when faced with negatively framed vaccine questions—has been uncertain.
• This study evaluated ChatGPT’s Japanese responses to vaccine-related questions with both supportive and skeptical framing, as assessed by board-certified pediatric infectious disease specialists. The findings indicate that ChatGPT maintains consistent quality across question framings, though thematic bias and linguistic limitations persist.
What is the implication, and what should change now?
• ChatGPT may serve as a useful tool for providing balanced medical information about vaccines to the public, but continued human oversight is essential. Future development should focus on improving linguistic precision and reducing topical bias to ensure trustworthy AI-based health communication.
Introduction
The decline in vaccination rates poses a serious challenge to global public health. In the United States, failure of adults aged 19 years and older to receive recommended vaccines is estimated to result in an annual economic loss of approximately $9 billion due to vaccine-preventable diseases, about 80% of which is attributable to unvaccinated individuals (1). In recent years, the coverage of the measles, mumps, and rubella (MMR) vaccine has declined in the U.S. (2). A 5% drop in MMR vaccination coverage among children aged 1–11 years has been reported to triple measles incidence and lead to an economic loss of $2.1 million (3).
Such declines in vaccination coverage are closely linked to vaccine hesitancy, which is driven by mistrust, fear, and misinformation regarding vaccines. In 2019, in response to a surge in measles cases, the World Health Organization listed vaccine hesitancy among the “Ten threats to global health” (4). This problem has become even more pronounced during the COVID-19 pandemic, with multiple studies reporting associations between vaccine hesitancy and COVID-19 infection, hospitalization, and even death (5,6).
Addressing vaccine hesitancy requires the provision of accurate, evidence-based information. Clearly communicating robust evidence and scientific consensus is considered essential to improving vaccination uptake (7). However, on social media platforms, negative sentiments toward vaccines tend to cluster and spread, creating an information environment that fosters vaccine hesitancy and increases infectious disease incidence (8). In recent years, in addition to traditional media and social networking services, generative artificial intelligence (AI) based on large language models (LLMs), such as ChatGPT, has emerged as a new means of information delivery. Notably, younger generations have shown a tendency to regard such generative AI tools as more trustworthy sources of information than social media (9). Therefore, it is crucial for LLMs to provide medically accurate information when responding to vaccine-related questions from non-medical users. This responsibility becomes particularly important when the questioner holds negative or skeptical views toward vaccination.
The present study aims to examine how ChatGPT responds to vaccine-related questions, focusing particularly on questions with negative or skeptical framing, and to assess the appropriateness and reliability of its responses. We present this article in accordance with the STROBE reporting checklist (available at https://tp.amegroups.com/article/view/10.21037/tp-2025-602/rc).
Methods
Study design
This study was an anonymous survey conducted using Google Forms (Google LLC, Mountain View, CA, USA; https://forms.google.com). The primary objective was to clarify how ChatGPT-generated responses to vaccine-related questions would be evaluated by experts.
Participants
Participants were pediatric infectious disease specialists who were recruited via personal communication and provided informed consent. This focus was based on the premise that they generally have greater expertise in vaccines than adult infectious disease specialists. To ensure the quality of responses, only specialists certified as board-certified physicians by the Japanese Pediatric Society of Infectious Diseases were eligible to participate.
Question development
In the Question Development phase, we used the generative AI ChatGPT (GPT-4o, 2024 version) to create the survey questions by entering the following prompt:
List 20 common questions about vaccines, and for each question, create both a ’supportive questioning’ and a ‘critical questioning’ version. In other words, prepare two questions with different stances on the same topic, for a total of 20 pairs (40 questions). No answers are required.
The 40 generated questions were numbered sequentially as Supportive questioning 1, Critical questioning 1, Supportive questioning 2, Critical questioning 2, and so on. Following this sequence, the researchers entered each question into ChatGPT and obtained its responses. If multiple responses were provided, the first displayed response was adopted. A new ChatGPT account was used to avoid the influence of prior usage history or training data exposure. All prompts were entered in Japanese.
Survey administration
After confirming willingness to participate individually via email, participants were asked to evaluate a pre-prepared set of ChatGPT responses created by the researchers. Each evaluator was randomly assigned one of the following two evaluation forms:
- Evaluation Form A: Supportive questioning 1–10 and Critical questioning 11–20.
- Evaluation Form B: Critical questioning 1–10 and Supportive questioning 11–20.
This division aimed to prevent decreased concentration, lower evaluation accuracy, and reduced completion rates that might occur if all 40 responses were assessed. For each response, participants rated the following four items on a five-point Likert scale:
- Clarity: Is the response easy to understand, even for those without prior knowledge?
- Appropriateness: Does the response appropriately address the question?
- Ambiguity: Does the response avoid potentially misleading expressions?
- Length: Is the length of the response appropriate?
In addition, a free-text field was provided for optional comments on each response. All evaluations were collected anonymously using Google Forms (Google LLC, Mountain View, CA, USA; https://forms.google.com), with only the evaluation form type (A or B) recorded as identifying information. No personally identifiable information was collected. The survey was conducted from May 26 to June 30, 2024, and all questions and responses were in Japanese.
Evaluation
For the quantitative evaluation, we compared the distribution of rating scales for each of the four items—Clarity, Appropriateness, Ambiguity, and Length—between supportive questioning and critical questioning. We also conducted descriptive statistical analyses of the rating scales for each ChatGPT response. For qualitative comments provided in the free-text field, we followed qualitative research methods, reading the content verbatim and extracting common themes and trends through coding.
Statistical analysis
Comparisons between supportive questioning and critical questioning were performed using the Mann-Whitney U test. All statistical analyses were conducted using R (The R Foundation for Statistical Computing).
Ethical considerations
The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. The study was approved by the Ethics Committee of University of Osaka Hospital (No. 25410). All participants received an explanation of the study via email and provided written informed consent before completing the evaluation.
Results
Question development
When tasked with creating survey items, ChatGPT generated 20 questions, as shown in Table 1. The full text of these questions is provided in Table S1.
Table 1
| Question No. | Question content |
|---|---|
| 1 | Effectiveness of vaccines |
| 2 | Adverse reactions |
| 3 | Long-term safety |
| 4 | Vaccination for children |
| 5 | Vaccination for pregnant women |
| 6 | Mandatory vaccination |
| 7 | Difference from natural immunity |
| 8 | Vaccine ingredients |
| 9 | Vaccine passports |
| 10 | Booster vaccination |
| 11 | Achievement of herd immunity |
| 12 | Speed of vaccine development |
| 13 | Post-vaccination infection |
| 14 | Risk of death |
| 15 | Interaction with healthcare professionals |
| 16 | Distinguishing misinformation |
| 17 | Comparison with drug-induced harm |
| 18 | Pharmaceutical company profits |
| 19 | Government response |
| 20 | Vaccination options |
Survey response rate
The survey evaluating ChatGPT’s responses was distributed to 22 specialists, and 20 responses were received (evaluation form A: n=10; evaluation form B: n=10), yielding a response rate of 90.9%.
Comparison of supportive and critical questions
Figure 1 presents the distribution of ratings for Clarity, Appropriateness, Ambiguity, and Length. The median [25th percentile, 75th percentile] scores for supportive questioning were: Clarity, 4 [3, 4]; Appropriateness, 3 [3, 4]; Ambiguity, 4 [3, 4]; and Length, 3 [3, 4]. Corresponding values for critical questioning were: Clarity, 4 [3, 4]; Appropriateness, 3 [3, 4]; Ambiguity, 4 [3, 4]; and Length, 3 [3, 4]. No statistically significant differences were observed between the two questioning styles for any of the evaluated items. Each effect size was −2.8×10−5 [95% confidence interval (CI): −6.7×10−5 to 1.4×10−5] for Clarity, 1.3×10−5 (95% CI: −4.1×10−5 to 6.7×10−5) for Appropriateness, 5.7×10−5 (95% CI: −7.8×10−5 to 5.2×10−5) for Ambiguity, and −2.9×10−5 (95% CI: −3.9×10−5 to 1.0×10−5) for Length, indicating negligible effect sizes.
Highest- and lowest-rated responses
Figure 2 summarizes the full rating scales for supportive questioning, and Figure 3 for critical questioning.
For supportive questioning, the highest Clarity scores were observed for “15. Interaction with healthcare professionals” (median 4 [4, 4.75]) and “16. Distinguishing misinformation” (4 [4, 5]). In contrast, lower ratings were given for Clarity in “11. Achievement of herd immunity” (3 [2, 3]); Appropriateness in “4. Vaccination for children” (3 [2, 3]) and “10. Booster vaccination” (3 [2, 3]); and Ambiguity in “10. Booster vaccination” (2.5 [2, 4]).
For critical questioning, the highest ratings were for Clarity in “10. Booster vaccination” (4 [4, 4.75]) and Appropriateness in “2. Adverse reactions” (4 [4, 4.75]). The lowest ratings were for both Appropriateness and Ambiguity in “11. Achievement of herd immunity” (2 [2, 2.75]). Length was generally rated as appropriate overall, except for “5. Vaccination for pregnant women” in critical questioning, which was considered relatively short (4 [4, 4.75]).
Free-text comments
A total of 81 free-text comments were provided by 13 participants. These comments were classified into eight categories, as shown in Table 2. The most frequent category was “Explanations are biased toward COVID-19 vaccines” (n=38), followed by “Insufficient explanation” (n=19), “Contains potentially misleading expressions” (n=9), “Positive feedback” (n=6), and “Question design issues making responses difficult” (n=5).
Table 2
| Opinion | N |
|---|---|
| Explanations are biased toward COVID-19 vaccines | 38 |
| Insufficient explanation | 19 |
| Contains potentially misleading expressions | 9 |
| Positive feedback | 6 |
| Question design issues making responses difficult | 5 |
| Excessive use of technical terms | 2 |
| Makes unrealistic suggestions | 1 |
| Overgeneralized | 1 |
COVID-19, coronavirus disease 2019.
Discussion
This study evaluated whether ChatGPT could provide accurate answers to various vaccine-related questions, even when framed as critical questioning. While ChatGPT generally produced medically accurate responses, some answers—both in supportive and critical questioning formats—were partially questionable. These findings may contribute to improving public health communication strategies and guiding the responsible integration of generative AI tools into clinical decision support. In addition, previous studies have examined ChatGPT’s accuracy in addressing vaccine-related queries (10-12), but the novelty of this study lies in its direct comparison of supportive and critical questioning. Although prior research has conducted evaluations in multiple languages (12), this is, to our knowledge, the first assessment in Japanese.
A systematic review published in 2025, encompassing 128 studies on ChatGPT’s accuracy in health-related topics, reported that ChatGPT generally provides accurate responses across many medical fields (13). High accuracy was noted in dermatology and psychiatry, whereas performance varied in cardiology and oncology. In a study evaluating the reliability of epidemiological information provided by ChatGPT, 67.7% of cited references were found to exist, 7.7% lacked sufficient access information, 10.8% were inaccurately cited, and 13.8% were entirely fabricated (14). These findings underscore the need for ongoing verification of ChatGPT’s responses.
In this study, no significant differences in score distributions for Clarity, Appropriateness, Ambiguity, or Length were observed between supportive and critical questioning, and the shapes of the violin plots were nearly identical. This suggests that ChatGPT tends to provide corrective responses even when faced with critical questioning from a vaccine-skeptical perspective, consistent with previous reports (10-12). However, certain questions received relatively low ratings. Free-text comments for these items frequently noted that “the explanation is biased toward COVID-19 vaccines.” Such bias may arise from ChatGPT’s algorithm prioritizing COVID-19 vaccine information due to its prominence and abundance in the dataset. Given that the COVID-19 pandemic has been reported to temporarily stall research activities in other biomedical fields (15), algorithmic improvements could include reducing topical bias toward COVID-19 vaccines, enhancing the accuracy of Japanese medical terminology, and diversifying training data across various vaccine types to achieve more balanced and contextually accurate responses. Moreover, the COVID-19 experience highlights the need to maintain attention, investment, and research in all vaccine-preventable diseases, even during pandemics.
Some free-text comments also expressed concern over potentially misleading statements, such as referring to adverse events from human papillomavirus (HPV) vaccination—primarily immunization stress-related responses, which led to the suspension of government recommendations—as “drug-induced harm” (16), or describing COVID-19 in children as “generally mild.” These observations highlight the challenges of Japanese-language expression in ChatGPT and underscore the importance of experts continuing to disseminate accurate information through diverse channels to support the ongoing improvement and training of ChatGPT.
This study has several limitations. First, although a new account was created and initialized prior to using ChatGPT, all prompts were entered from the same account, raising the possibility that previous prompts influenced subsequent outputs. Second, we used ChatGPT to generate question pairs to minimize arbitrariness; however, there is no validation confirming that ChatGPT-generated questions accurately represent real-world vaccine-related queries. Addressing this issue would require extensive efforts, such as interviewing vaccine-hesitant and vaccine-supportive individuals, which was not feasible in the present study. Third, this study used ChatGPT-4o, and the content of outputs may vary depending on the version employed. Fourth, participant recruitment was conducted through personal communication by researchers, which may have introduced selection bias. In addition, the limited number of respondents constrained both the statistical power and the comprehensiveness of the qualitative analysis. Future research should involve larger-scale and longitudinal investigations to accumulate further evidence.
Conclusions
This study evaluated ChatGPT’s responses to vaccine-related questions in Japanese, as assessed by experts, with particular attention to answers given to both supportive and critical questioning. No significant differences in ratings for Clarity, Appropriateness, Ambiguity, or Length were observed between questions with differing stances, suggesting that ChatGPT can maintain a consistent level of quality regardless of the questioner’s perspective. However, free-text comments highlighted concerns such as “Explanations are biased toward COVID-19 vaccines” (n=38), “Insufficient explanation” (n=19), and “Contains potentially misleading expressions” (n=9), indicating challenges in both information balance and language expression, particularly in Japanese. These findings suggest that while generative AI may serve as a useful tool for providing medical information on vaccines to non-experts, ongoing human verification and supplementation remain essential to address bias and ensure linguistic accuracy. Larger-scale and longitudinal studies are warranted.
Acknowledgments
The authors thank all participants for their valuable cooperation in this study.
Footnote
Reporting Checklist: The authors have completed the STROBE reporting checklist. Available at https://tp.amegroups.com/article/view/10.21037/tp-2025-602/rc
Data Sharing Statement: Available at https://tp.amegroups.com/article/view/10.21037/tp-2025-602/dss
Peer Review File: Available at https://tp.amegroups.com/article/view/10.21037/tp-2025-602/prf
Funding: None.
Conflicts of Interest: Both authors have completed the ICMJE uniform disclosure form (available at https://tp.amegroups.com/article/view/10.21037/tp-2025-602/coif). The authors have no conflicts of interest to declare.
Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. The study was approved by the Ethics Committee of University of Osaka Hospital (No. 25410). All participants received an explanation of the study via email and provided written informed consent before completing the evaluation.
Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.
References
- Ozawa S, Portnoy A, Getaneh H, et al. Modeling The Economic Burden Of Adult Vaccine-Preventable Diseases In The United States. Health Aff (Millwood) 2016;35:2124-32. [Crossref] [PubMed]
- Dong E, Saiyed S, Nearchou A, et al. Trends in County-Level MMR Vaccination Coverage in Children in the United States. JAMA 2025;334:730-2. [Crossref] [PubMed]
- Lo NC, Hotez PJ. Public Health and Economic Consequences of Vaccine Hesitancy for Measles in the United States. JAMA Pediatr 2017;171:887-92. [Crossref] [PubMed]
- World Health Organization. Ten threats to global health in 2019. WHO Spotlight. January 2019. Accessed June 30, 2025. Available online: https://www.who.int/news-room/spotlight/ten-threats-to-global-health-in-2019
- de Miguel-Arribas A, Aleta A, Moreno Y. Impact of vaccine hesitancy on secondary COVID-19 outbreaks in the US: an age-structured SIR model. BMC Infect Dis 2022;22:511. [Crossref] [PubMed]
- Bajracharya D, Jansen RJ. Observations of COVID-19 vaccine coverage and vaccine hesitancy on COVID-19 outbreak: An American ecological study. Vaccine 2024;42:246-54. [Crossref] [PubMed]
- Whitehead HS, French CE, Caldwell DM, et al. A systematic review of communication interventions for countering vaccine misinformation. Vaccine 2023;41:1018-34. [Crossref] [PubMed]
- Salathé M, Khandelwal S. Assessing vaccination sentiments with online social media: implications for infectious disease dynamics and control. PLoS Comput Biol 2011;7:e1002199. [Crossref] [PubMed]
- National AI Opinion Monitor: AI Trust and Knowledge in America. Rutgers University; National Opinion Research Center; The Rockefeller Foundation. Published April 2024. Accessed June 30, 2025. Available online: https://ai.rutgers.edu/naiom2024
- Koh MCY, Ngiam JN, Salada BMA, et al. Can ChatGPT Counter Vaccine Hesitancy? An Evaluation of ChatGPT's Responses to Simulated Queries from the General Public. Healthcare (Basel) 2025;13:1269. [Crossref] [PubMed]
- Deiana G, Dettori M, Arghittu A, et al. Artificial Intelligence and Public Health: Evaluating ChatGPT Responses to Vaccination Myths and Misconceptions. Vaccines (Basel) 2023;11:1217. [Crossref] [PubMed]
- Joshi S, Ha E, Rivera Y, et al. ChatGPT and Vaccine Hesitancy: A Comparison of English, Spanish, and French Responses Using a Validated Scale. AMIA Jt Summits Transl Sci Proc 2024;2024:266-75.
- Beheshti M, Toubal IE, Alaboud K, et al. Evaluating the Reliability of ChatGPT for Health-Related Questions: A Systematic Review. Informatics 2025;12:9.
- Zhu K, Zhang J, Klishin A, et al. Evaluating the Accuracy of Responses by Large Language Models for Information on Disease Epidemiology. Pharmacoepidemiol Drug Saf 2025;34:e70111. [Crossref] [PubMed]
- Aviv-Reuven S, Rosenfeld A. Publication patterns' changes due to the COVID-19 pandemic: a longitudinal and short-term scientometric analysis. Scientometrics 2021;126:6761-84. [Crossref] [PubMed]
- Suzuki S, Hosono A. No association between HPV vaccine and reported post-vaccination symptoms in Japanese young women: Results of the Nagoya study. Papillomavirus Res 2018;5:96-103. [Crossref] [PubMed]



