Comparative evaluation of seven large language models in providing home phototherapy care guidance for neonatal hyperbilirubinemia
Highlight box
Key findings
• DeepSeek-R1 demonstrated near-expert accuracy (87.8% “completely correct” responses; total score 19.31/20), outperforming all other models across five evaluation dimensions.
• Significant performance gaps exist: Copilot, Gemini 1.5 Flash, and ERNIE 4.0 Turbo showed suboptimal accuracy, including critical errors such as conflating blue-light therapy with vitiligo treatment.
• Empathy and readability varied considerably and may inadvertently reassure caregivers in high-risk situations if not carefully calibrated.
• All models exhibited hallucinations and lacked individualized reasoning, underscoring that large language models (LLMs) cannot replace professional medical judgment.
What is known and what is new?
• LLMs have demonstrated competence in medical knowledge assessments. Home phototherapy is a safe, cost-effective alternative for neonatal hyperbilirubinemia in appropriately selected infants, but caregiver knowledge gaps and anxiety remain significant barriers.
• This is the first systematic, multidimensional evaluation of seven mainstream LLMs for home phototherapy care guidance, using a validated 15-item questionnaire based on American Academy of Pediatrics guidelines and Food and Drug Administration device manuals. DeepSeek-R1 achieved near-expert performance (19.31/20) with 87.8% “completely correct” responses; ChatGPT-4o ranked second (16.73/20) but showed safety gaps on critical questions. Copilot, Gemini, and ERNIE exhibited critical errors. All models shared hallucination tendencies and a lack of individualized reasoning, confirming that LLMs cannot replace professional oversight even when clinically accurate.
What is the implication, and what should change now?
• High-performing LLMs may serve as supplementary educational tools for post-eligibility and non-urgent caregiver. However, they must not replace clinical assessment or decision-making. Future research should prioritize longitudinal performance tracking and patient-centered outcome evaluation.
Introduction
Neonatal hyperbilirubinemia, resulting from an imbalance between bilirubin production and elimination (1), is one of the most common conditions in newborns, affecting 60–80% of term neonates and over 80% of preterm infants globally (2). The physiological breakdown of elevated fetal hemoglobin after birth, combined with a shorter erythrocyte lifespan and immature hepatic conjugation, leads to the accumulation of unconjugated bilirubin. Severe hyperbilirubinemia carries risks of acute bilirubin encephalopathy and kernicterus, potentially causing permanent neurological sequelae (3).
Blue-light phototherapy (460–490 nm) remains the cornerstone treatment, converting unconjugated bilirubin into water-soluble, excretable forms via photoisomerization (4,5). While traditionally administered in hospitals, home phototherapy has emerged as a safe and cost-effective alternative for appropriately selected patients. Endorsed by the American Academy of Pediatrics (AAP) guidelines since 2004 (6), home phototherapy demonstrates therapeutic efficacy comparable to hospital-based treatment when properly implemented (7). Economic analyses indicate approximately 40% cost reduction compared to hospitalization (8), with additional benefits including enhanced maternal-neonatal bonding (9), reduced parental stress (10), and lower readmission rates (11).
A randomized controlled trial (n=147 neonates) found no significant differences in treatment failure rates between home and hospital phototherapy (12). However, successful implementation requires careful patient selection—such as term or late-preterm infants ≥38 weeks, ≥48 hours old, bilirubin levels below exchange transfusion thresholds, and absence of hemolytic disease—as well as appropriate equipment and reliable caregiver supervision (13). Critical success factors include proper device setup and maintenance, appropriate infant-light distance, adequate skin exposure, no unnecessary interruptions, monitoring of hydration and feeding, daily transcutaneous or serum bilirubin measurements, and recognition of warning signs requiring hospital evaluation (14). Despite these established parameters, substantial knowledge gaps still exist during home treatment. Significant barriers include variable health literacy among caregivers, limited access to 24/7 professional support, parental anxiety concerning treatment adequacy and complication recognition (15).
Large language models (LLMs), advanced artificial intelligence (AI) systems utilizing deep learning architectures, have demonstrated considerable capabilities in natural language understanding and generation (16). Their advantages include 24/7 availability, instantaneous response generation, and no-cost accessibility, positioning them as potential supplementary information resources for home healthcare scenarios. Emerging evidence documents LLM performance in medical knowledge assessments; for instance, ChatGPT-4 achieved 70% accuracy on neonatal-perinatal medicine practice questions (17), while multiple models exceeded 85% accuracy across United States Medical Licensing Examination steps (18).
Beyond examination performance, a scoping review of LLM applications in chronic disease management (29 studies, 2023–2025) found that patient education and information provision represented the most common use case (62% of studies), followed by self-management support (28%) and emotional support/therapeutic conversations (14%) (19). Almulla and Khasawneh found that GPT-4 achieved the highest quality scores (mean 4.85/5) when answering common parent questions about autism (20). In the domain of therapeutic communication, Wang et al. explored LLM-powered conversational agents delivering problem-solving therapy for family caregivers, highlighting the potential for empathetic support alongside challenges in balancing thorough assessment with efficient advice delivery (21).
Niko et al. demonstrated the utility of LLMs in addressing home blood pressure monitoring queries (22), suggesting potential applicability to other home-based medical interventions. Notably, a comparative evaluation of LLMs in HIV education revealed significant differences across models in accuracy, readability, and reliability, emphasizing that model selection matters for patient education outcomes (23). No prior study has systematically evaluated LLMs’ capacity to provide guidance for neonatal home phototherapy care, a scenario demanding both technical accuracy and effective parent communication.
This study addresses this gap by developing a validated, comprehensive questionnaire reflecting real-world caregiver concerns across the home phototherapy care continuum and conducting a multidimensional evaluation of seven mainstream LLMs using expert assessment. We hypothesize that high-performing LLMs can deliver accurate, comprehensible basic information comparable to general medical guidance. It is crucial to emphasize that LLMs are evaluated as supplementary educational resources for operational guidance, not as clinical decision-making tools for treatment eligibility or medical management. We assess their potential as supplementary educational resources for caregivers of infants already deemed eligible for home phototherapy by a qualified healthcare professional. Any real-world implementation of LLM-assisted guidance would require continuous oversight by a neonatal care team. We present this article in accordance with the STROBE reporting checklist (24) (available at https://tp.amegroups.com/article/view/10.21037/tp-2025-1-868/rc).
Methods
Study design and ethical considerations
The study was exempt from full ethics review by Review and Approval of the Ethics Committee of Children’s Hospital Affiliated to Chongqing Medical University as it utilized simulated interactions without real patient data. LLMs were assessed as potential supplementary information resources to complement, not replace, professional medical supervision. Figure 1 illustrates the overall process of this study.
Seven general-purpose LLMs were selected based on the Hugging Face Open LLM Leaderboard (accessed September 2024) (25), including ChatGPT-4o (OpenAI, USA), Copilot v1.1.9.0 (Microsoft, USA), Gemini 1.5 Flash (Google, USA), Claude 3.5 Sonnet (Anthropic, USA), DeepSeek-R1 (DeepSeek AI, China), GLM-4 (Zhipu AI, China), and ERNIE 4.0 Turbo (Baidu, China). All models were accessed through their publicly available interfaces at no cost during the study period.
Questionnaire development and validation
This study evaluates LLMs as supplementary educational tools for caregivers of neonates already deemed eligible for home phototherapy by a qualified healthcare professional. Thus, the queries presented to LLMs in this study are deliberately focused on post-eligibility educational support, including device operation, treatment monitoring, feeding guidance, and recognition of warning signs—rather than diagnostic or therapeutic decision-making.
We systematically reviewed the AAP Clinical Practice Guideline for Hyperbilirubinemia Management (6,13,26) and user manuals for five Food and Drug Administration (FDA)-cleared home phototherapy devices (14,27-30) to identify essential knowledge domains. Two investigators (S.H., Q.H.) independently extracted essential knowledge requirements and collaboratively designed questions together. Discrepancies were resolved through consensus discussion with senior neonatologists (W. Liu, Y.H., H.W.). All three experts confirmed that these questions reflected real-world caregiver concerns encountered in clinical practice. The final questionnaire comprised 15 questions across seven domains (Table 1), representing essential core knowledge necessary for safe home phototherapy implementation. The domains were: (I) concepts (understanding hyperbilirubinemia and risks); (II) indications & contraindications (eligibility criteria); (III) preparations (care setup, procedural steps, environmental preparation); (IV) efficacy monitoring (treatment optimization, effect assessment, bilirubin measurement); (V) safety protocols (discontinuation timing, rebound risk, side effect management); (VI) emergency response (device malfunction, urgent care recognition), and (VII) breastfeeding (feeding continuation during treatment).
Table 1
| Domains | Question text |
|---|---|
| Concepts | Q1 My child has been diagnosed with neonatal hyperbilirubinemia, what are the potential harms to him? |
| Indications & contraindications | Q2 I don’t want to be separated from my child, under what criteria can I give him home phototherapy? |
| Preparations | Q3 How do I get home phototherapy care? |
| Q4 What are the basic steps involved in home phototherapy? | |
| Q5 What preparations do I need to make before starting home phototherapy? | |
| Q6 How should I prepare my child before lighting? | |
| Efficacy monitoring | Q7 What should I do to ensure that the phototherapy is effective? |
| Q8 How can I tell if the treatment is working after phototherapy? | |
| Q9 How to use a transcutaneous bilirubinometer to monitor my child’s bilirubin levels, and what are the suggestions and precautions? | |
| Safety protocols | Q10 When can I discontinue home phototherapy when my child’s jaundice resolves? |
| Q11 Is it possible for hyperbilirubinemia to rebound? | |
| Q12 What are the side effects of phototherapy and how can I detect them? What should I do if side effects occur? | |
| Emergency response | Q13 What should I do if I think the phototherapy device is malfunctioning during the lighting process? |
| Q14 When I must take my child to the hospital for treatment? Who should I turn to for help in an emergency? | |
| Breastfeeding | Q15 Do I need to pause breastfeeding during home phototherapy? |
Q, question.
Due to the complexity and multidisciplinary nature of home phototherapy guidance, reference answers were developed through a structured expert consensus process. Three board-certified neonatologists (H.W., Y.H., W. Liu, each with >10 years of clinical experience) collaboratively developed reference responses based on the AAP Clinical Practice Guideline for Hyperbilirubinemia Management [2022] (6,13,26), FDA-cleared device user manuals (14,27-30). Each expert independently reviewed source documents for assigned questions. Initial responses were drafted and refined through structured group discussion. All three experts reviewed and unanimously approved the final 15-item reference response set. The final consolidated document served as the “gold standard” for scoring LLM response accuracy and completeness.
Data collection
To simulate authentic caregiver-physician consultation, three priming prompts were sequentially presented to each LLM before questionnaire administration: (I) “Hello, please explain what neonatal hyperbilirubinemia is from the perspective of a neonatologist”; (II) “Please interpret the home phototherapy for neonatal hyperbilirubinemia from the perspective of a neonatologist”; and (III) “After evaluation by a doctor, my child is eligible for home phototherapy. I will ask you about home phototherapy for neonatal hyperbilirubinemia. Please give me a professional and comprehensive answer”. This approach established clinical context and role specification, mirroring real-world scenarios where parents seek information after physician consultation.
Two investigators (S.H., Q.H.) independently conducted parallel conversations using separate devices, accounts, and network environments, generating Response Sets A and B for each model between November 17, 2024, and February 17, 2025. All models were queried in English except for GLM-4 and ERNIE 4.0 Turbo. Models with output limits (ChatGPT-4o, Claude 3.5 Sonnet, DeepSeek-R1) received questions sequentially within the same conversation thread; others completed all 15 questions in single sessions. All LLM responses were stripped of identifying features, converted to uniform plain text format, and randomly coded.
Expert evaluation
Three board-certified neonatologists (W. Liu, Y.H., H.W.) served as independent evaluators. Raters completed a 2-hour training session on the use of the Likert scales. Responses were evaluated across five dimensions using modified Likert scales (Table 2). Accuracy was rated on a 5-point scale: 5 (completely correct), 4 (more correct than incorrect), 3 (equal correct/incorrect), 2 (more incorrect than correct), and 1 (completely incorrect). Completeness was rated on a 5-point scale: 5 (very complete), 4 (complete), 3 (generally complete), 2 (incomplete), and 1 (very incomplete). Reproducibility was rated on a 4-point scale: 4 (completely reproducible), 3 (generally reproducible), 2 (not reproducible), and 1 (completely unreproducible). Empathy was rated on a 3-point scale: 3 (better humanistic care: explicit acknowledgment of parental emotions, reassurance phrases, personalized tone, appropriate use of emojis/emoticons enhancing connection), 2 (general humanistic care: polite but somewhat impersonal tone), and 1 (lacking humanistic care: purely clinical/technical language). Readability was rated on a 3-point scale: 3 (comprehensible: clear lay language, well-organized, jargon-free or explained), 2 (generally readable: understandable with some technical terms or awkward phrasing), and 1 (obscure: excessive jargon, disorganized, or confusing).
Table 2
| Dimension | 1 | 2 | 3 | 4 | 5 |
|---|---|---|---|---|---|
| Accuracy | Completely incorrect | More incorrect than correct | Equal correct and incorrect | More correct than incorrect | Completely correct |
| Completeness | Very incomplete | Incomplete | Generally complete | Complete | Very complete |
| Reproducibility | Completely unreproducible | Not reproducible | Generally reproducible | Completely reproducible | – |
| Empathy | Lacking humanistic care | General humanistic care | Better humanistic care | – | – |
| Readability | Obscure | Generally readable | Comprehensible | – | – |
LLM, large language model.
Each rater independently evaluated all responses (n=210 response sets: 7 LLMs × 15 questions × 2 sets). Accuracy, completeness, and empathy were scored per question-response pair. For each question of every LLM, there were two responses but only one evaluation of reproducibility. Readability was assessed holistically per response set (not per question). This resulted in 90 ratings for accuracy, completeness and empathy (15 questions × 2 sets × 3 raters), 45 ratings for reproducibility, and 6 ratings for readability (2 sets × 3 raters).
Statistical analysis
Descriptive statistics, including mean ± standard deviation (SD), were calculated for all dimension scores. Total scores were calculated by summing mean scores across all five dimensions (maximum: 20 points). Inter-rater reliability among the three neonatologists was assessed using the Kendall’s W test (31). Recognizing that this aggregate metric weights all dimensions equally and may obscure critical safety differences, the primary analysis prioritized safety-critical questions, specifically Q10 (discontinuation criteria), Q12 (side effect management), and Q14 (emergency recognition), which directly impact clinical outcomes. For these questions, we report both mean scores and the proportion of responses rated ≥4.0. Models scoring below this threshold on any critical question were flagged for high-risk inadequacy.
Data management was performed using WPS Office [2025], statistical analyses and visualizations were created using Python (v3.12.8). The gray reference circle represented the “expert clinician performance” established through the “gold standard” (Figure 2).
Results
Overall performance
Table 3 presents comprehensive performance scores across all seven LLMs. DeepSeek-R1 achieved the highest total score (19.31/20), outperforming the second-ranked ChatGPT-4o (16.73/20) by 15.4% and approaching the “gold standard”.
Table 3
| Model | Accuracy [5] | Completeness [5] | Reproducibility [4] | Empathy [3] | Readability [3] | Total [20] |
|---|---|---|---|---|---|---|
| ChatGPT-4o | 4.22±0.72 | 4.18±0.74 | 3.51±0.35 | 2.15±0.86 | 2.67±0.58 | 16.73 |
| Copilot | 3.70±0.74 | 3.64±0.71 | 3.33±0.37 | 2.00±0.83 | 3.00±0.00 | 15.67 |
| Gemini 1.5 Flash | 3.94±0.84 | 3.69±1.03 | 3.21±0.51 | 2.02±0.93 | 3.00±0.00 | 15.87 |
| GLM-4 | 4.11±0.62 | 4.04±0.76 | 3.36±0.52 | 2.04±0.95 | 1.67±0.58 | 15.21 |
| ERNIE 4.0 Turbo | 3.64±0.80 | 3.52±0.75 | 3.24±0.58 | 2.26±0.78 | 2.33±0.58 | 15.00 |
| Claude 3.5 Sonnet | 4.16±0.83 | 4.09±0.80 | 3.20±0.75 | 1.89±1.15 | 3.00±0.00 | 16.34 |
| DeepSeek-R1 | 4.89±0.12 | 4.89±0.11 | 3.53±0.52 | 3.00±0.00 | 3.00±0.00 | 19.31 |
Data are presented as mean ± standard deviation.
Accuracy performance
Model performance in accuracy was stratified into tiers. Tier 1 (Excellent, ≥4.5) included DeepSeek-R1 (4.89±0.12), with 87.8% of responses (79/90) rated “completely correct”, only one response scored 3, and zero responses scored ≤2. It demonstrated exceptional performance on all critical questions (Q10, Q12, Q14), with 100% of ratings (18/18) achieving ≥4.0 on these questions. This indicates consistently safe and accurate guidance for high-stakes scenarios.
Tier 2 (Good, 4.0–4.5) included ChatGPT-4o (4.22±0.72; 47.8% completely correct), which showed strong performance on device operation (Q3–Q6) and monitoring (Q7–Q9) but occasional omissions in contraindication details (Q2). Despite its strong overall performance (total score 16.73/20), ChatGPT-4o showed notable safety concerns on critical questions, with only 55.6% (10/18) of critical ratings ≥4.0. Claude 3.5 Sonnet (4.16±0.83) showed excellent performance on conceptual questions but weaker on specific protocols. While it performed well on Q12 (mean 4.33, 100%≥4.0), it had concerning results on Q10 (mean 3.5, only 33.3%≥4.0) and Q14 (mean 4.0, 66.7%≥4.0). This pattern suggests that even generally competent models may have specific safety blind spots. GLM-4 (4.11±0.62) was generally accurate but occasionally provided overly cautious recommendations. GLM-4 showed excellent performance on Q14 (mean 4.33, 100%≥4.0) but poor performance on Q10 (mean 2.67, 33.3%≥4.0).
Tier 3 (acceptable, 3.5–4.0) included Gemini 1.5 Flash (3.94±0.84) and Copilot (3.70±0.74). ERNIE 4.0 Turbo (3.64±0.80) made a critical error by incorrectly associating blue-light therapy with vitiligo treatment (Q1), and its web-search integration occasionally retrieved irrelevant information.
Inter-rater reliability
To further validate the consistency of expert ratings, we calculated Kendall’s coefficient of concordance (Kendall’s W) for each evaluation dimension. The results demonstrated excellent inter-rater agreement for clinical dimensions: Accuracy (W=0.86; P<0.05) and Completeness (W=0.90; P<0.05) both approached near-perfect concordance, indicating strong consensus among the three neonatologists when scoring factual correctness and information coverage. Reproducibility (W=0.51; P>0.05) and empathy (W=0.63; P>0.05) showed good agreement, reflecting little variability in response scoring across independent query sessions. For readability, the limited number of items per model precluded stable estimation of Kendall’s W.
Completeness performance
DeepSeek-R1 maintained superiority in completeness, with 87.8% of responses rated “very complete” (79/90; mean 4.89±0.11). ChatGPT-4o achieved 47.8% top ratings (43/90; mean 4.18±0.74). Notably, only DeepSeek-R1 consistently warned parents about phototherapy-related light exposure risks to caregivers, including the need for eye and skin protection. ChatGPT-4o and Claude 3.5 Sonnet provided general guidance but occasionally missed device-specific details, such as differences between BiliBed and BiliSoft blanket positioning.
Reproducibility performance
All models demonstrated moderate-to-good reproducibility (collective mean range: 3.20–3.53). The reproducibility scores were as follows: DeepSeek-R1 (3.53±0.52; 88.3% of responses scored ≥3), ChatGPT-4o (3.51±0.35; 93.3%≥3), and Claude 3.5 Sonnet (3.20±0.75). Claude 3.5 Sonnet varied in response length and detail level across iterations. Gemini 1.5 Flash showed the highest inconsistency, likely due to real-time web search integration yielding different sources between sessions. ERNIE 4.0 Turbo selected different web references in Chinese-language queries.
Empathy performance
DeepSeek-R1 achieved perfect empathy scores (3.00±0.00), representing a notable advancement in AI-parent communication. DeepSeek-R1 employed emotional validation (“I completely understand your concerns...”), reassurance (“You’re doing an amazing job...”), an encouraging tone (“Don’t worry...”), and contextually appropriate emojis. ChatGPT-4o (2.15±0.86) was polite and professional but more clinical, with empathy variability across questions. Claude 3.5 Sonnet (1.89±1.15) used predominantly directive language (“You must...”, “It is essential that...”) and occasionally anxiety-provoking phrasing.
Readability performance
Four models achieved maximum readability scores (3.00): Copilot, Gemini 1.5 Flash, Claude 3.5 Sonnet, and DeepSeek-R1. DeepSeek-R1 used structured organization with numbered steps, bullet points, visual checklists, summary notes, and clear action-oriented guidance. ChatGPT-4o (2.67±0.58) employed concise bullet-point formatting for efficient delivery but was occasionally dense. Copilot and Gemini 1.5 Flash used a second-person address for a conversational tone, with Gemini uniquely providing IP-based local emergency contact suggestions. Claude 3.5 Sonnet featured clear logical progression with well-defined sections. GLM-4 had the poorest performance (1.67±0.58), characterized by redundancy, obscure language due to translation artifacts, and disorganization. ERNIE 4.0 Turbo showed moderate readability (2.33±0.58), occasionally incorporating inconsistently formatted web-scraped content.
Integrated multidimensional analysis
DeepSeek-R1 demonstrated near-maximum scores across all dimensions, closely approaching the “gold standard”. ChatGPT-4o showed strong clinical dimensions with moderate empathy. Claude 3.5 Sonnet presented a paradox of high technical performance but the lowest empathy, indicating potential for caregiver alienation. The Chinese models (GLM-4, ERNIE) showed moderate performance, possibly limited by English-language evaluation. Consumer models (Copilot, Gemini) prioritized readability and engagement over clinical precision.
Discussion
Key findings
This study represents the first systematic, multidimensional evaluation of LLMs for providing home phototherapy guidance in neonatal hyperbilirubinemia. DeepSeek-R1 achieved near-expert performance across all evaluation domains (total score 19.31/20), with 87.8% of responses rated “completely correct” for accuracy and completeness. ChatGPT-4o also demonstrated clinically acceptable performance (16.73/20), while other models showed variable capabilities. These findings build upon previous literature by extending LLM competency from medical examinations to real-world parent education scenarios.
Explanations of findings
Category 1 models (DeepSeek-R1, ChatGPT-4o) demonstrated sufficient accuracy and completeness for basic informational support, potentially serving as supplementary resources for common queries regarding device operation, treatment procedures, and general safety protocols. However, their limitations necessitate important constraints. Appropriate use cases include general educational content reinforcement, reinforcement of physician instructions, and after-hours informational support for non-urgent questions. Inappropriate use cases encompass clinical decision-making, bilirubin level interpretation, assessment of treatment failure or complications, and emergency triage; models should always redirect to immediate professional evaluation for such concerns.
Category 2 models (Claude 3.5 Sonnet, GLM-4) showed moderate performance with notable limitations. Claude’s low empathy scores despite technical accuracy suggest a potential to exacerbate caregiver anxiety, underscoring that clinical accuracy alone is insufficient for effective patient education. GLM-4’s performance may have been underestimated due to the English-language evaluation of a Chinese-optimized model.
Category 3 models (Copilot, Gemini 1.5 Flash, ERNIE 4.0 Turbo) demonstrated suboptimal accuracy (<4.0), rendering them unsuitable for medical guidance without substantial improvements. ERNIE’s critical error conflating blue-light phototherapy with vitiligo treatment exemplifies the risks of uncritical web-search integration.
While empathy and readability are often framed as desirable attributes in patient education, their pursuit in LLM-generated content warrants careful scrutiny. In this study, DeepSeek-R1 achieved perfect empathy scores through the use of emotional validation (“I completely understand your concerns...”), reassurance (“You’re doing an amazing job...”), and an encouraging tone. However, such empathetic phrasing—if not carefully calibrated—carries the risk of inadvertently reassuring caregivers in situations where heightened vigilance is warranted. Claude 3.5 Sonnet, despite its lower empathy scores, employed directive language (“You must...”, “It is essential that...”) that—while less reassuring—may more effectively convey the seriousness of certain safety protocols.
Readability, while essential for health literacy, also presents a double-edged sword. Models like Copilot and Gemini 1.5 Flash achieved maximum readability scores through conversational tone, second-person address, and simplified language. Yet, this readability may come at the expense of nuance and safety warnings. The attributes that enhance user experience (empathy, readability) may, if unconstrained, undermine clinical safety.
The analysis of safety-critical questions (Q10, Q12, Q14) reveals that models with similar total scores may have markedly different safety profiles. ChatGPT-4o and Claude 3.5 Sonnet, for instance, had nearly identical aggregate scores (16.73 vs. 16.34), yet their performance on individual critical questions diverged substantially. ChatGPT-4o provided inadequate emergency guidance (Q14 mean 3.67, only 33% acceptable), while Claude 3.5 Sonnet struggled with discontinuation criteria (Q10 mean 3.5, 33% acceptable). These findings align with the concern that weighting empathy and readability equally with clinical accuracy can produce misleading composite scores.
Neither readability nor empathy can compensate for unsafe clinical guidance. Future evaluations should prioritize safety-critical questions as the primary outcome, using aggregate scores only as supplementary information. The ideal LLM response must strike a delicate balance—sufficiently empathetic to engage caregivers without creating false reassurance, and sufficiently readable to be understood without sacrificing critical safety details.
Strengths and limitations
This study provides the first systematic, multidimensional evaluation of LLMs for home phototherapy guidance, employing a validated, guideline-based questionnaire. A key strength is the holistic assessment across five domains critical for patient education, supported by blinded expert evaluation and duplicate sampling to ensure robustness. We conducted this study following the LLM evaluation framework for responding to medical questions proposed by Wei et al. to improve the quality of the research (32).
Several limitations merit consideration. First, model performance reflects a snapshot of specific versions and may change with subsequent updates. Moreover, the versions selected (e.g., Gemini 1.5 Flash and ERNIE 4.0 Turbo) prioritize speed and computational efficiency, which may underestimate the reasoning capacity of their full-scale counterparts. Our reliance on the Hugging Face leaderboard rankings at the time of selection, while systematic, did not account for this architectural distinction. Future comparisons should preferentially include reasoning-optimized versions when available.
Second, the primary use of English may not fully capture the strengths of multilingual models (GLM-4, ERNIE 4.0 Turbo) or diverse performance across different cultural contexts. Third, although standardized simulated queries improved comparability, they cannot fully represent real-world caregiver interactions, including follow-up questions, emotional distress, or heterogeneous health literacy.
Fourth, despite acceptable inter-rater reliability for clinical dimensions, reproducibility and empathy showed greater variability, indicating legitimate expert disagreement in judging information coverage and humanistic aspects. Fifth, assessments of empathy and readability, while expert-rated, retain inherent subjectivity and would benefit from validation by caregiver panels. Finally, this study evaluated the quality of information provided but did not measure subsequent clinical outcomes or changes in caregiver behavior, which should be addressed in future work.
Users must be cautioned that conversations with general-purpose LLMs are not confidential medical encounters. Sensitive information, including infant details, family identifiers, or specific health histories, should never be shared, as these inputs may be retained and used for model training depending on platform policies. Furthermore, the inability of current LLMs to consistently recognize and redact protected health information poses a tangible privacy risk, necessitating clear user guidelines and the future development of medically compliant, privacy-preserving interfaces.
Comparison with similar studies
Our findings align with and extend the growing literature on medical LLMs. While studies like Beam et al. demonstrate LLM competence on neonatal board examinations (33), and Hoppe et al. report superior diagnostic accuracy compared to emergency physicians (34), our work shifts the focus to the practical challenge of guiding caregivers through a multi-day, home-based treatment. This scenario demands not only knowledge recall but also sustained safety vigilance and empathetic communication.
Most directly, our study complements the work of Liu et al., who evaluated AI for neonatal home oxygen therapy (35). Both studies converge on a central conclusion: LLMs can function as valuable supplementary information sources, but their limitations in consistent safety counseling and individualized reasoning make professional oversight non-negotiable. Our contribution is to quantify these limitations within home phototherapy using a novel multidimensional framework, providing a benchmark for performance and highlighting specific, high-risk gaps such as hallucinated references that persist even in top-tier models.
Future directions and actions needed
Future research should prioritize longitudinal tracking of performance across model iterations, expanded language and cultural validation, randomized controlled trials comparing outcomes in LLM-supplemented versus standard care, and evaluation of LLMs in other home-based pediatric interventions. High-performing LLMs may be recommended as supplementary resources for basic operational queries, accompanied by explicit limitations counseling. Parents must be informed that LLM information requires professional confirmation before acting, particularly for decisions about treatment modification or discontinuation.
Conclusions
This study provides a structured, guideline-based comparison of seven LLMs in the context of home phototherapy education for neonatal hyperbilirubinemia. Under standardized, simulated caregiver queries, selected models demonstrated acceptable performance in delivering basic, operational information aligned with expert reference responses, whereas substantial variability in safety-critical content was observed across platforms.
Importantly, the potential utility of current LLMs appears confined to low-acuity, post-eligibility educational support, such as reinforcing clinician-provided instructions and addressing common procedural questions. These systems are not suitable for clinical decision-making, assessment of treatment response, interpretation of bilirubin levels, or triage of adverse events, where inaccurate or incomplete guidance could pose significant safety risks.
Within the constraints of this simulation-based evaluation, our findings suggest that even high-performing models require strict use boundaries and professional oversight. Future work should extend beyond standardized prompts to real-world caregiver-LLM interactions, incorporate patient-centered outcomes, and evaluate whether LLM-assisted education improves adherence, safety behaviors, or clinical endpoints in home-based neonatal care.
Acknowledgments
None.
Footnote
Reporting Checklist: The authors have completed the STROBE reporting checklist. Available at https://tp.amegroups.com/article/view/10.21037/tp-2025-1-868/rc
Data Sharing Statement: Available at https://tp.amegroups.com/article/view/10.21037/tp-2025-1-868/dss
Peer Review File: Available at https://tp.amegroups.com/article/view/10.21037/tp-2025-1-868/prf
Funding: None.
Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://tp.amegroups.com/article/view/10.21037/tp-2025-1-868/coif). The authors have no conflicts of interest to declare.
Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. This research was exempted from full ethics review by Review and Approval of the Ethics Committee of Children's Hospital Affiliated to Chongqing Medical University as it utilized simulated interactions without real patient data.
Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.
References
- Rennie JM, Roberton NRC. Rennie and Roberton's Textbook of Neonatology. 5th ed. Edinburgh: Churchill Livingstone; 2012.
- National Institute for Health and Care Excellence. Jaundice in newborn babies under 28 days (NICE guideline CG98). 2023. Available online: https://www.nice.org.uk/guidance/cg98
- Dennery PA, Seidman DS, Stevenson DK. Neonatal hyperbilirubinemia. N Engl J Med 2001;344:581-90. [Crossref] [PubMed]
- McDonagh AF, Lightner DA. Phototherapy and the photobiology of bilirubin. Semin Liver Dis 1988;8:272-83. [Crossref] [PubMed]
- McDonagh AF, Agati G, Fusi F, et al. Quantum yields for laser photocyclization of bilirubin in the presence of human serum albumin. Dependence of quantum yield on excitation wavelength. Photochem Photobiol 1989;50:305-19.
- Management of hyperbilirubinemia in the newborn infant 35 or more weeks of gestation. Pediatrics 2004;114:297-316.
- Chang PW, Waite WM. Evaluation of Home Phototherapy for Neonatal Hyperbilirubinemia. J Pediatr 2020;220:80-5. [Crossref] [PubMed]
- Khajehei M, Gidaszewski B, Maheshwari R, et al. Clinical outcomes and cost-effectiveness of large-scale midwifery-led, paediatrician-overseen home phototherapy and neonatal jaundice surveillance: A retrospective cohort study. J Paediatr Child Health 2022;58:1159-67. [Crossref] [PubMed]
- Hedenbro M, Rydelius PA. Early interaction between infants and their parents predicts social competence at the age of four. Acta Paediatr 2014;103:268-74. [Crossref] [PubMed]
- Pettersson M, Eriksson M, Blomberg K. Parental experiences of home phototherapy for neonatal hyperbilirubinemia. J Child Health Care 2023;27:562-73. [Crossref] [PubMed]
- Orringer K, Kileny S, Salada K, et al. Biliblanket Utilization for Outpatient Treatment of Newborn Jaundice. Clin Pediatr (Phila) 2023;62:725-32. [Crossref] [PubMed]
- Pettersson M, Eriksson M, Odlind A, et al. Home phototherapy of term neonates improves parental bonding and stress: Findings from a randomised controlled trial. Acta Paediatr 2022;111:760-6. [Crossref] [PubMed]
- Kemper AR, Newman TB, Slaughter JL, et al. Clinical Practice Guideline Revision: Management of Hyperbilirubinemia in the Newborn Infant 35 or More Weeks of Gestation. Pediatrics 2022;150:e2022058859. [Crossref] [PubMed]
- GE Healthcare. BiliSoft™ 2.0 phototherapy system operation manual. 2017. Available online: https://www.manualslib.com/manual/2106193/Ge-Bilisoft-2-0.html
- Prashanth GP. Randomised controlled trial of home phototherapy in term neonates: Pertinent issues. Acta Paediatr 2022;111:1454-5. [Crossref] [PubMed]
- Zhao WX, Zhou K, Li J, et al. A Survey of Large Language Models. arXiv:2303.18223 [Preprint]. 2023. Available online: https://arxiv.org/abs/2303.18223
- Sharma P, Luo G, Wang C, et al. Assessment of the clinical knowledge of ChatGPT-4 in neonatal-perinatal medicine: a comparative analysis with ChatGPT-3.5. J Perinatol 2024;44:1365-6. [Crossref] [PubMed]
- Mihalache A, Huang RS, Popovic MM, et al. ChatGPT-4: An assessment of an upgraded artificial intelligence chatbot in the United States Medical Licensing Examination. Med Teach 2024;46:366-72. [Crossref] [PubMed]
- Serugunda HM, Jianquan O, Kasujja Namatovu H, et al. Using Large Language Models for Chronic Disease Management Tasks: Scoping Review. JMIR Med Inform 2025;13:e66905. [Crossref] [PubMed]
- Almulla AA, Khasawneh MAS. Assessing AI-Based Large Language Models (ChatGPT, Google Gemini, and DeepSeek) for Common Parent Questions about Autism: Acceptability, Readability, and Accuracy. Psychiatr Q 2025; Epub ahead of print. [Crossref]
- Wang J, Li X, Zhang Y, et al. Large language model-powered conversational agents for problem-solving therapy in family caregivers: a feasibility study. arXiv:2401.03428 [Preprint]. 2024. Available online: https://doi.org/
10.48550/arXiv.2401.03428 - Niko MM, Karbasi Z, Kazemi M, et al. Comparing ChatGPT and Bing, in response to the Home Blood Pressure Monitoring (HBPM) knowledge checklist. Hypertens Res 2024;47:1401-9. [Crossref] [PubMed]
- Eren Korkmaz Ö, Açıkalın Arıkan B, Sayın Kutlu S, et al. Artificial intelligence meets HIV education: Comparing three large language models on accuracy, readability, and reliability. Int J STD AIDS 2026;37:112-20. [Crossref] [PubMed]
- von Elm E, Altman DG, Egger M, et al. The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement: guidelines for reporting observational studies. Ann Intern Med 2007;147:573-7. [Crossref] [PubMed]
- Hugging Face. Chatbot Arena LLM Leaderboard: community-driven evaluation for best LLM and AI chatbots. 2024. Available online: https://huggingface.co/spaces/lmarena-ai/chatbot-arena-leaderboard
- Bhutani VK, Wong RJ, Turkewitz D, et al. Phototherapy to Prevent Severe Neonatal Hyperbilirubinemia in the Newborn Infant 35 or More Weeks of Gestation: Technical Report. Pediatrics 2024;154:e2024068026. [Crossref] [PubMed]
- Amerigroup. Home phototherapy devices for neonatal hyperbilirubinemia. 2024. Available online: https://medpol.providers.amerigroup.com/dam/medpolicies/amerigroup/active/guidelines/gl_pw_a053626.html
- Motif Medical. Motif phototherapy blanket BiliTouch™ operational manual. 2021. Available online: https://motifmedical.com/bilitouchtm-phototherapy-blanket
- NeoLight. Skylife™ phototherapy system user manual. 2021. Available online: https://www.manualslib.com/manual/2615737/Neolight-Skylife.html
- Respironics. Bilitx™ professional manual. 2007. Available online: https://www.manualslib.com/manual/1331841/Respironics-Bilitx.html
- Gisev N, Bell JS, Chen TF. Interrater agreement and interrater reliability: key concepts, approaches, and applications. Res Social Adm Pharm 2013;9:330-8. [Crossref] [PubMed]
- Wei Q, Yao Z, Cui Y, et al. Evaluation of ChatGPT-generated medical responses: A systematic review and meta-analysis. J Biomed Inform 2024;151:104620. [Crossref] [PubMed]
- Beam K, Sharma P, Kumar B, et al. Performance of a Large Language Model on Practice Questions for the Neonatal Board Examination. JAMA Pediatr 2023;177:977-9. [Crossref] [PubMed]
- Hoppe JM, Auer MK, Strüven A, et al. ChatGPT With GPT-4 Outperforms Emergency Department Physicians in Diagnostic Accuracy: Retrospective Analysis. J Med Internet Res 2024;26:e56110. [Crossref] [PubMed]
- Liu W, Wei H, Xiang L, et al. Bridging the Gap in Neonatal Care: Evaluating AI Chatbots for Chronic Neonatal Lung Disease and Home Oxygen Therapy Management. Pediatr Pulmonol 2025;60:e71020. [Crossref] [PubMed]


