An integrated and interpretable machine learning framework for Kawasaki disease diagnosis and risk prediction
Original Article

An integrated and interpretable machine learning framework for Kawasaki disease diagnosis and risk prediction

Dandan Wang1, Fei Li2, Tingting Xie1, Xiaodong Zang1,3, Mingwu Chen1

1Department of Pediatrics, The First Affiliated Hospital of University of Science and Technology of China, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, China; 2Department of Pediatrics, The First People’s Hospital of Wuhu, Wuhu, China; 3Institute of Public Health Sciences, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, China

Contributions: (I) Conception and design: D Wang, M Chen; (II) Administrative support: M Chen; (III) Provision of study materials or patients: D Wang, F Li; (IV) Collection and assembly of data: T Xie, X Zang; (V) Data analysis and interpretation: D Wang, F Li; (VI) Manuscript writing: All authors; (VII) Final approval of manuscript: All authors.

Correspondence to: Mingwu Chen, PhD. Department of Pediatrics, The First Affiliated Hospital of University of Science and Technology of China, Division of Life Sciences and Medicine, University of Science and Technology of China, No. 17 Lujiang Road, Luyang District, Hefei 230001, China. Email: 17705699290@163.com.

Background: Early identification of Kawasaki disease (KD) and accurate prediction of its associated complications are critical for optimizing treatment strategies and improving clinical outcomes. While machine learning has shown promise in KD-related studies, most existing models are limited to single tasks with varying feature sets, lacking an integrated framework. This fragmentation hinders clinical applicability and constrains generalizability. Therefore, this study aimed to develop a unified and interpretable machine learning framework for KD diagnosis and risk prediction, with the goal of enhancing clinical relevance and real-world applicability.

Methods: We retrospectively collected data from 2,133 febrile pediatric patients treated at The First Affiliated Hospital of University of Science and Technology of China between January 1, 2018, and December 31, 2022. After excluding patients older than 5 years or with incomplete records, a total of 919 cases—including both typical and atypical KD—were included. Using 29 common clinical features, we developed a unified light gradient boosting machine (LightGBM)-based model for KD diagnosis, intravenous immunoglobulin (IVIG) resistance prediction, and coronary artery lesion (CAL) risk assessment. The dataset was split into training and validation sets at an 8:2 ratio, and five-fold cross-validation was performed to ensure robustness. Model performance was evaluated using accuracy, area under the receiver operating characteristic (ROC) curve (AUC), sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV). Feature importance and model interpretability were assessed using SHapley Additive exPlanations (SHAP). To further assess its clinical utility, we compared the model’s diagnostic performance with that of pediatric clinicians and an advanced large language model (ChatGPT).

Results: The KD diagnostic task achieved an AUC of 0.999, with a sensitivity of 0.984 and specificity of 0.974. The IVIG resistance prediction task yielded an AUC of 0.888, sensitivity of 0.600, and specificity of 0.979. For CALs risk prediction, the AUC was 0.783, with a sensitivity of 0.529 and specificity of 0.984. SHAP analysis identified distinct sets of top-ranking features for each task, reflecting the underlying clinical heterogeneity. Notably, variables such as inflammatory markers, immune-related indicators, and characteristic clinical signs of KD consistently contributed to model predictions. In a comparative study, our model achieved accuracies of 0.900, 0.800, and 0.757 for KD diagnosis, IVIG resistance, and CAL prediction, respectively, consistently outperforming pediatricians with over 5 years of experience and ChatGPT, highlighting its potential as a clinical decision support tool.

Conclusions: This study presents a unified machine learning framework that accurately supports KD diagnosis, IVIG resistance prediction, and CAL risk assessment. By leveraging a common set of clinical features, the model enhances clinical applicability and lays the groundwork for task-specific management and precision intervention in KD.

Keywords: Machine learning; Kawasaki disease (KD); intravenous immunoglobulin resistance (IVIG resistance); coronary artery lesions (CALs)


Submitted Jun 17, 2025. Accepted for publication Aug 05, 2025. Published online Sep 26, 2025.

doi: 10.21037/tp-2025-399


Highlight box

Key findings

• We developed a machine learning model based on 29 routinely available clinical features to simultaneously support Kawasaki disease (KD) diagnosis, intravenous immunoglobulin resistance prediction, and coronary artery lesion risk assessment, achieving robust performance across all tasks.

• Our model outperformed pediatricians with over 5 years of clinical experience and the large language model ChatGPT-4o in all three tasks, highlighting its potential as a reliable decision-support tool.

What is known and what is new?

• Prior studies have used machine learning for KD diagnosis or risk prediction separately, with inconsistent feature sets, limiting clinical integration.

• This study developed a unified model using a retrospective cohort and a consistent feature set, with SHapley Additive exPlanations analysis revealing distinct predictors across tasks.

What is the implication, and what should change now?

• By leveraging a shared set of clinical features, this integrated model enhances clinical applicability, offers an efficient tool to reduce physician workload, and provides a novel unified approach for both KD diagnosis and risk stratification.


Introduction

Kawasaki disease (KD) is one of the most common vasculitis in children and a leading cause of acquired heart disease in the pediatric population (1,2). The majority of KD-related mortality is associated with the development of giant coronary artery aneurysms (3). Timely administration of intravenous immunoglobulin (IVIG) within the first 10 days of illness has been shown to reduce the incidence of coronary artery lesions (CALs) from approximately 25% to 4% (2). Moreover, recent studies have demonstrated that, in children at high risk of IVIG resistance, initial combination therapy with corticosteroids and standard-dose IVIG can significantly lower the incidence of coronary abnormalities, thereby mitigating KD-associated complications (4,5). These findings underscore the critical importance of early KD recognition and the implementation of personalized treatment strategies to improve clinical outcomes.

Currently, the diagnosis of KD, prediction of IVIG resistance, and assessment of CAL risk are often addressed as separate tasks, each using distinct models and feature sets (5-9). Tsai et al. developed a machine learning diagnostic model to distinguish KD from other febrile illnesses using a set of 22 clinical features, including age, sex, and laboratory parameters (10). Their findings demonstrated that objective laboratory results can effectively predict the occurrence of KD. Similarly, Lam et al. introduced the KIDMATCH model, which incorporates demographic information, clinical symptoms, and 18 laboratory variables to differentiate multisystem inflammatory syndrome in children (MIS-C), KD, and other febrile conditions (11). The model’s performance was further validated across multiple external centers. Xia et al. constructed a machine learning model for predicting IVIG resistance based on demographic and laboratory data (12). Comparative analyses showed that their model outperformed traditional scoring systems such as Egami (13), Kobayashi (14), and Sano scores (15). Currently, the prediction of CALs in children with KD has largely relied on conventional statistical methods (16,17). Xu et al. evaluated several predictive models for CAL in KD and found that the eXtreme Gradient Boosting (XGBoost)-based machine learning model achieved the best overall performance (18). Although these approaches have demonstrated promising performance, the use of disjointed feature sets and task-specific models lacks integration across diagnostic and prognostic tasks. This separation may limit the translational potential and clinical adoption of such predictive tools. Considering the close interrelation of these tasks within the disease course, developing a unified predictive framework based on a shared clinical feature set may reduce data collection costs, streamline model deployment, and better support integrated clinical decision-making. Such a strategy holds substantial clinical relevance and translational potential.

To effectively implement this approach, a modeling technique capable of handling diverse data types and capturing complex interdependencies is essential. Machine learning offers strong capabilities in modeling nonlinear relationships among heterogeneous clinical variables-including demographic information, clinical symptoms, and laboratory biomarkers-making it highly suitable for multi-dimensional risk prediction (19). Its ability to integrate diverse feature types, manage missing values, and identify latent patterns in large-scale datasets facilitates the development of cohesive, end-to-end predictive frameworks, such as gradient boosting models. Building on these strengths, this study proposes a unified, feature-consistent, multi-task prediction model using light gradient boosting machine (LightGBM) to simultaneously address KD diagnosis, IVIG resistance, and CAL risk stratification, thereby enhancing clinical applicability and decision support. We present this article in accordance with the TRIPOD reporting checklist (available at https://tp.amegroups.com/article/view/10.21037/tp-2025-399/rc).


Methods

Study population

From January 1, 2018, to December 31, 2022, this retrospective study was conducted at The First Affiliated Hospital of University of Science and Technology of China, including pediatric patients diagnosed with either typical or atypical KD. The inclusion criteria were as follows:

KD group: all patients met the diagnostic criteria for KD as defined by the American Heart Association (AHA) in 2017, including both typical and atypical forms (2). Only patients experiencing their first episode were included. Those with coexisting autoimmune diseases, congenital heart disease, or incomplete clinical records were excluded. For patients who visited the pediatric department multiple times due to persistent fever or symptoms, the most recent laboratory test results prior to confirmed diagnosis were used. Clinical symptoms and fever duration were also determined based on the same time point to ensure consistency in both the timing of feature selection and the disease stage represented. This approach also ensured that all clinical features were collected before the administration of IVIG, thereby avoiding the influence of treatment on patient characteristics.

IVIG-resistant group: defined as KD patients who exhibited persistent or recurrent fever (≥38 ℃) between 36 hours and 2 weeks following standard-dose IVIG treatment (2 g/kg), along with one or more principal clinical manifestations.

CAL group: CAL classification was based on standardized Z-scores. The internal diameters of the left main coronary artery (LMCA), left anterior descending artery (LAD), left circumflex artery (LCX), and right coronary artery (RCA) were measured by echocardiography and converted to Z-scores according to the criteria of the Kobayashi Z-score, adjusted for body surface area.

To establish a control group, we enrolled febrile children under the age of five from the same hospital during the study period who exhibited at least one clinical feature of KD, forming the fever control (FC) group. Children with incomplete laboratory data were excluded. All study procedures were approved by the Ethics Committee of The First Affiliated Hospital of University of Science and Technology of China (No. 2024-RE-487). The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. As anonymized data were used, the requirement for informed consent was waived by the committee.

Data preprocessing

Based on recommendations from clinical experts and the availability of routine laboratory tests for most patients, this study selected a total of 29 clinical features as input variables for the machine learning models. These included age, sex, duration of fever, five principal clinical signs of KD, and 21 routine laboratory indicators. The five principal clinical signs of KD, along with the duration of fever, were manually extracted by experienced pediatric clinicians from admission notes and clinical documentation. Laboratory test results were obtained directly from the structured fields of the electronic health records (EHRs). The five clinical signs consisted of rash, changes in lips or oropharyngeal mucosa, conjunctival injection, cervical lymphadenopathy, and peripheral extremity changes. The 21 laboratory tests included: white blood cell count, lymphocyte count, glutathione reductase, platelet count, platelet distribution width, mean platelet volume, red cell distribution width, hematocrit, hemoglobin, red blood cell count, C-reactive protein, erythrocyte sedimentation rate, fibrin degradation products, D-dimer, procalcitonin, alanine aminotransferase, aspartate aminotransferase, albumin, sodium, creatinine, and total bilirubin. Missing laboratory data were imputed using the K-nearest neighbors (KNN) algorithm with K=10, based on the average values of the 10 most similar samples in the training set. The distribution of missing values across variables is summarized in Table S1. For normalization, all laboratory features were standardized using z-score transformation (i.e., subtracting the mean and dividing by the standard deviation).

Model design

All modeling procedures in this study were implemented using Python 3.10, with LightGBM as the core algorithm. LightGBM, based on the gradient boosting framework, is known for its efficient training speed, low memory usage, and ability to naturally handle missing values and categorical variables. These properties make it particularly well-suited for modeling nonlinear relationships and interpreting feature importance in medical data.

To establish a comprehensive predictive pipeline covering initial diagnosis through downstream risk stratification, we adopted a staged multi-task learning strategy. The overall task was divided into three independent yet feature-consistent binary classification sub-tasks, each trained using the same set of input features, including demographic variables, five principal clinical signs, and 21 laboratory indicators. The model design was as follows:

  • Task 1: KD classification model. The first binary classifier was trained using the full cohort of patients with KD (KD group) and febrile controls (FC group), with the objective of distinguishing KD cases from other febrile illnesses (KD vs. FC). This model serves as an initial screening module in the intelligent diagnostic system, assisting clinicians in the early identification of suspected KD during triage.
  • Task 2: IVIG resistance prediction model. Among confirmed KD patients, a second LightGBM model was developed to predict resistance to standard IVIG therapy, defined as persistent fever (≥38 ℃) within 36 hours post-treatment. This model aims to facilitate early assessment of treatment responsiveness and support personalized therapeutic decision-making.
  • Task 3: CAL prediction model. To address the risk of cardiovascular complications in KD patients, a third model was trained to predict the likelihood of developing CALs. This model enables early identification of patients at higher risk for long-term cardiac complications, thereby supporting proactive clinical management.

Model training and evaluation

The full dataset was randomly split into a training set and an internal validation set at a ratio of 8:2. To assess the model’s generalizability and stability across different data subsets, five-fold cross-validation was performed on the training set. Stratified sampling was used during data partitioning to preserve the original class distribution within each fold. In the first task (KD diagnosis), all enrolled subjects were included in the training pipeline. For the second and third tasks—IVIG resistance prediction and CAL risk prediction—only confirmed KD cases were retained, while patients with other febrile illnesses were excluded.

To enhance interpretability of the model predictions, SHapley Additive exPlanations (SHAP) analysis was employed to evaluate feature importance on the held-out validation set. SHAP values were computed using the SHAP Python library (20). As a unified framework for interpreting machine learning model outputs, SHAP assigns each feature an importance score based on its contribution to individual predictions. Features with higher SHAP values are considered to exert a greater influence on the model’s decision-making process.

Statistical analysis

The performance of the predictive model was assessed using standard evaluation metrics, including accuracy, area under the receiver operating characteristic (ROC) curve (AUC), sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV). All metrics were calculated based on the results from the internal validation set. To assess model stability and robustness, five-fold cross-validation was conducted on the training dataset. The dataset was randomly partitioned into five equal subsets. In each iteration, one subset was used as the validation set while the remaining four subsets were used for training. This process was repeated five times, ensuring that each subset served as the validation set once. Final results were reported as the mean values across all folds along with their corresponding 95% confidence intervals (CIs), providing a comprehensive evaluation of model consistency.


Results

Patient characteristics

A total of 571 children diagnosed with KD were included in the study [median age (interquartile range), 18.00 (11.00–29.00) months; 367 males, accounting for 64.3%] (Figure 1). Additionally, 348 febrile control children under the age of 5 years were enrolled during the same period [median age (interquartile range), 37.00 (25.00–50.00) months; 212 males, 60.9%]. Demographic characteristics, duration of fever, and laboratory findings for both groups were extracted from electronic medical records and systematically compared (Table 1). P values for between-group comparisons were calculated using the Chi-squared test for categorical variables and the Mann-Whitney U test for continuous variables. Reference ranges for laboratory indicators were also included. Differences in demographic and clinical characteristics between patients with and without IVIG resistance or CALs are provided in Tables S2,S3.

Figure 1 Flowchart of cohort selection, including KD diagnosis, IVIG resistance, and CALs prediction. Task 1: differential diagnosis between KD and other febrile illnesses. Task 2: prediction of IVIG resistance among patients with KD. Task 3: prediction of CAL development in KD patients. CAL, coronary artery lesion; IVIG, intravenous immunoglobulin; KD, Kawasaki disease.

Table 1

Basic demographic and laboratory characteristics of children with KD and febrile controls

Characteristics Children with KD (n=571) Febrile controls (n=348) Reference values P value
Male 367 (64.3) 212 (60.9) 0.32
Rash 415 (72.7) 138 (39.7) 0.01
Changes in lips or oropharyngeal mucosa 439 (76.9) 70 (20.1) 0.001
Conjunctival injection 464 (81.3) 52 (14.9) 0.001
Cervical lymphadenopathy 312 (54.6) 101 (29.0) 0.03
Peripheral extremity changes 281 (49.2) 14 (4.0) 0.001
Age (months) 18.00 (11.00–29.00) 37.00 (25.00–50.00) 0.001
Duration of fever (days) 6.00 (6.00–7.00) 4.00 (3.00–5.00) 0.008
White blood cell count (×109/L) 14.17 (11.07–17.65) 14.46 (11.16–18.70) 4.4–11.9 0.29
Lymphocytes (%) 25.50 (18.00–35.95) 48.00 (40.00–55.62) 20–50 0.001
Glutathione reductase (IU/L) 64.10 (52.65–74.25) 22.55 (16.50–29.30) 15–73 0.001
Platelet count (×109/L) 359.00 (291.00–442.00) 206.00 (167.75–259.00) 167–453 0.006
Platelet distribution width (fL) 15.40 (13.90–15.70) 15.90 (15.60–16.10) 15.1–18.1 0.08
Mean platelet volume (fL) 9.00 (8.30–9.80) 9.50 (8.80–10.30) 9.4–12.4 0.08
Red cell distribution width (%) 13.00 (12.50–13.70) 13.20 (12.67–13.72) 11–16 0.007
Hematocrit (%) 0.34 (0.31–0.36) 0.37 (0.34–0.39) 34–43 0.01
Hemoglobin (g/L) 111.00 (103.00–118.00) 121.00 (113.00–128.00) 112–149 0.02
Red blood cell count (×1012/L) 4.12 (3.87–4.43) 4.49 (4.25–4.74) 4.0–5.5 0.02
C-reactive protein (mg/L) 67.87 (37.13–110.50) 10.38 (6.58–19.85) 0–5 0.001
Erythrocyte sedimentation rate (mm/h) 62.00 (46.00–75.00) 20.00 (11.25–31.75) 0–21 0.001
Fibrin degradation products (μg/mL) 5.42 (4.45–6.37) 3.69 (2.50–5.75) 0–5 0.03
D-dimer (μg/mL) 1.36 (0.76–2.31) 0.86 (0.62–1.29) 0.01–0.55 0.001
Procalcitonin (ng/mL) 0.35 (0.17–1.09) 0.19 (0.11–0.40) 0–0.5 0.005
Alanine aminotransferase (U/L) 25.00 (15.00–58.10) 33.85 (20.00–85.05) 0–35 0.06
Aspartate aminotransferase (IU/L) 29.80 (24.00–39.10) 47.00 (37.00–70.80) 14–36 0.001
Albumin (g/L) 38.30 (34.90–41.10) 40.30 (37.80–42.60) 35–50 0.009
Sodium (mmol/L) 136.06 (134.88–138.00) 138.00 (136.43–139.18) 136–145 0.006
Creatinine (μmol/L) 22.00 (18.00–26.00) 26.00 (23.00–31.00) 46–92 0.047
Total bilirubin (μmol/L) 6.40 (4.60–9.20) 5.50 (4.50–7.43) 3–22 0.002

Data are presented as n (%) or median (IQR). IQR, interquartile range; KD, Kawasaki disease.

Model performance

In this study, we evaluated the performance of machine learning models across three sequential prediction tasks: KD diagnosis (Task 1), IVIG resistance prediction (Task 2), and CAL prediction (Task 3). Five-fold cross-validation was applied to assess multiple performance metrics for each task (Table 2). The results revealed distinct differences in model performance across tasks, reflecting varying complexities in data characteristics and predictive challenges associated with each clinical objective.

Table 2

Fivefold cross-validation and test set performance metrics for KD diagnosis, IVIG resistance, and CAL prediction (Tasks 1–3)

Task category Accuracy (95% CI) AUC (95% CI) Sensitivity (95% CI) Specificity (95% CI) PPV (95% CI) NPV (95% CI)
Task 1 0.978 (0.977–0.984) 0.999 (0.995–1.000) 0.984 (0.972–0.996) 0.974 (0.955–0.993) 0.974 (0.957–0.991) 0.981 (0.970–0.992)
Task 2 0.952 (0.932–0.972) 0.888 (0.780–0.996) 0.600 (0.575–0.635) 0.979 (0.945–1.000) 0.533 (0.476–0.613) 0.971 (0.958–0.984)
Task 3 0.897 (0.870–0.925) 0.783 (0.751–0.816) 0.529 (0.511–0.547) 0.984 (0.977–0.992) 0.704 (0.689–0.722) 0.906 (0.881–0.932)

Task 1: differential diagnosis between KD and other febrile illnesses. Task 2: prediction of IVIG resistance among patients with KD. Task 3: prediction of CAL development in KD patients. AUC, area under the receiver operating characteristic curve; CAL, coronary artery lesion; CI, confidence interval; IVIG, intravenous immunoglobulin; KD, Kawasaki disease; NPV, negative predictive value; PPV, positive predictive value.

For the initial KD diagnostic task (Task 1), the model demonstrated robust and consistent performance, achieving an accuracy of 0.978 (95% CI: 0.977–0.984) and an AUC of 0.999 (95% CI: 0.995–1.000). High sensitivity, specificity, PPV, and NPV further support the model’s utility as a reliable tool for early screening in clinical settings. Notably, the median age of children diagnosed with KD was significantly higher—approximately twice that of the febrile control group. To assess the potential influence of age on model performance, we conducted a stratified analysis based on age. Specifically, participants were divided into two subgroups: ≤18 months (n=68) and >18 months (n=116). The model maintained high diagnostic performance across both subgroups. For the ≤18-month group, the accuracy was 0.985 (95% CI: 0.956–1.000) with an AUC of 1.000 (95% CI: 1.000–1.000); for the >18-month group, the accuracy reached 0.974 (95% CI: 0.940–1.000) with an AUC of 0.998 (95% CI: 0.992–1.000). These findings indicate that the model demonstrates consistently high diagnostic accuracy across different pediatric age ranges (Table S4). In Task 2, which focused on predicting IVIG resistance among KD patients, the model maintained acceptable performance levels, with an accuracy of 0.946 (95% CI: 0.927–0.964) and an AUC of 0.938 (95% CI: 0.847–1.000). The sensitivity and specificity were 0.800 (95% CI: 0.782–0.818) and 0.992 (95% CI: 0.978–1.000), respectively. Task 3, aimed at predicting the development of CALs, yielded a moderate performance with an accuracy of 0.897 (95% CI: 0.870–0.925) and an AUC of 0.783 (95% CI: 0.751–0.816). However, the model’s sensitivity was relatively low at 0.529 (95% CI: 0.511–0.547), which may be attributable to the low incidence of CALs, class imbalance, or insufficient feature representation. Future work may consider data rebalancing techniques or feature augmentation strategies to enhance model performance in this domain. Figure 2 presents the overall ROC curves in the test set, with AUCs of 0.998, 0.988, and 0.852 for KD diagnosis, IVIG resistance, and CAL prediction, respectively. The corresponding confusion matrices for each model in the test set are shown in Figure S1.

Figure 2 ROC curves for different predictive tasks on the test set. (A) ROC curve of the KD classification model. (B) ROC curve of the IVIG resistance prediction model. (C) ROC curve of the CAL prediction model. AUC, area under the receiver operating characteristic curve; CAL, coronary artery lesion; IVIG, intravenous immunoglobulin; KD, Kawasaki disease; ROC, receiver operating characteristic.

Feature importance

Next, we quantified and ranked feature importance using SHAP, where higher SHAP values indicate a greater impact of a feature on the model’s predictions. Based on the LightGBM model, the top 20 most important features are shown in Figure 3A-3C.

Figure 3 The 20 most important features for each prediction task evaluated using the SHAP method. (A) SHAP feature importance ranking for KD diagnosis. (B) SHAP feature importance ranking for prediction of resistance to IVIG. (C) SHAP feature importance ranking for prediction of CALs. ALB, albumin; ALT, alanine aminotransferase; AST, aspartate aminotransferase; CAL, coronary artery lesion; CervicalLN, cervical lymphadenopathy; Cr, creatinine; CRP, C-reactive protein; ConjunctivalInjection, conjunctival injection; ESR, erythrocyte sedimentation rate; ExtremityChanges, peripheral extremity changes; FDP, fibrin degradation products; FeverDuration, duration of fever; GR, glutathione reductase; HCT, hematocrit; HGB, hemoglobin; IVIG, intravenous immunoglobulin; KD, Kawasaki disease; LipOralChanges, changes in lips or oropharyngeal mucosa; MPV, mean platelet volume; Na, sodium; PCT, procalcitonin; PDW, platelet distribution width; PLT, platelet count; RBC, red blood cell count; RDW, red cell distribution width; SHAP, SHapley Additive exPlanations; TBil, total bilirubin; WBC, white blood cell count.

In Task 1, which focused on KD diagnosis, the most important features included lymphocytes, glutathione reductase, changes in lips or oropharyngeal mucosa, C-reactive protein, and peripheral extremity changes. Among the top 10 features, four are clinical symptoms commonly used by clinicians to diagnose KD, further confirming that the model’s predictions align with the clinical understanding of KD diagnosis. In Task 2, which aimed at predicting IVIG resistance, the most important features were rash, mean platelet volume, changes in lips or oropharyngeal mucosa, alanine aminotransferase, and procalcitonin. In Task 3, predicting the occurrence of CALs, the top features were glutathione reductase, changes in lips or oropharyngeal mucosa, albumin, sex, and lymphocytes. These features are closely associated with oxidative stress, immune response, vascular injury, and the patient’s physiological state.

The key features for different prediction tasks reflect distinct biological mechanisms at various tasks of disease progression. The changes in these features provide profound insights into the progression of KD.

Comparative study

To further evaluate the diagnostic performance of our model, we conducted a comparative analysis against pediatricians and ChatGPT (gpt-4o-mini-2024-07-18) using a standardized diagnostic task, with the prompt provided in Table S5. A total of 50 cases were randomly selected from an independent test cohort at our center, which was entirely distinct from the model’s training dataset. These cases included 25 EHRs of children diagnosed with KD and 25 with other febrile illnesses. The control group consisted of febrile children under the age of 5 years who met at least one of the principal clinical criteria for KD but were ultimately diagnosed with other conditions. For comparison, diagnoses were provided by five board-certified pediatricians, each with over 5 years of clinical experience. The results demonstrated that our model outperformed clinical physicians in terms of diagnostic accuracy, with an accuracy of 0.900, sensitivity of 0.920, and specificity of 0.880. In comparison, the average performance of the clinical physicians was 0.860 for accuracy, 0.880 for sensitivity, and 0.840 for specificity. We also evaluated the diagnostic performance of ChatGPT, which yielded an accuracy of 0.820, sensitivity of 0.800, and specificity of 0.840, all of which were lower than the performance of our model (Table 3).

Table 3

Comparative diagnostic performance metrics of physicians, ChatGPT, and the machine learning model in KD identification

Method Accuracy Sensitivity Specificity PPV NPV
Physicians 0.860 0.880 0.840 0.846 0.875
ChatGPT 0.820 0.800 0.840 0.833 0.808
Our model 0.900 0.920 0.880 0.885 0.917

KD, Kawasaki disease; NPV, negative predictive value; PPV, positive predictive value.

Additionally, we compared the Kobayashi score, Egami score, Formosa score, and Kawamura score with the aforementioned independent test set for the prediction of IVIG resistance (10 cases in the resistance group and 15 cases in the responsive group). The variables considered in the models are listed in Table 4. As shown in Table 4, the results of these comparisons with our trained LightGBM model are presented. Chi-squared tests revealed significant differences between our gradient boosting machine (GBM) model and the four scoring systems, indicating that the performance differences are statistically significant.

Table 4

Comparative prediction performance of different methods for IVIG resistance in KD

Methods Accuracy Sensitivity Specificity PPV NPV
Physicians
   Kobayashi 0.560 0.200 0.800 0.400 0.600
   Egami 0.560 0.300 0.733 0.429 0.611
   Formosa 0.720 0.500 0.867 0.714 0.722
   Kawamura 0.600 0.300 0.800 0.500 0.632
ChatGPT 0.440 0.200 0.600 0.250 0.529
Our model 0.800 0.700 0.867 0.778 0.813

IVIG, intravenous immunoglobulin; KD, Kawasaki disease; NPV, negative predictive value; PPV, positive predictive value.

For the prediction of coronary artery aneurysms, we selected 17 patients with KD who developed coronary artery aneurysms and 20 patients with KD who did not develop coronary artery aneurysms as the test set. As shown in Table 5, our model significantly outperforms ChatGPT, and also demonstrates a notable advantage compared to the results of clinical physicians.

Table 5

Comparative diagnostic performance metrics of physicians, ChatGPT, and the machine learning model in CAL prediction in KD

Methods Accuracy Sensitivity Specificity PPV NPV
Physicians 0.622 0.471 0.750 0.615 0.625
ChatGPT 0.405 0.294 0.500 0.333 0.455
Our model 0.757 0.674 0.850 0.786 0.739

CAL, coronary artery lesion; KD, Kawasaki disease; NPV, negative predictive value; PPV, positive predictive value.


Discussion

This study proposes a machine learning modeling approach based on routine clinical indicators for the simultaneous prediction of KD diagnosis and its progression risk. The constructed model demonstrates strong performance across multiple evaluation metrics, exhibiting high clinical application potential. Furthermore, we employed the SHAP method to perform an explainability analysis of the model’s feature contributions, quantifying the relative importance of various clinical indicators in the model’s predictions. Through comparative experiments with several existing methods, the advantages of the proposed model in terms of performance and robustness were further validated. This research provides a foundational approach and strategy for the development of predictive models for KD-related complications or treatment responses based on the same clinical indicator system.

CALs have gradually replaced rheumatic fever as one of the leading causes of acquired heart disease in children (21). Previous studies have shown that early identification and intervention in KD can significantly reduce the risk of CALs (2-5). Therefore, developing early diagnostic and disease progression prediction models is of great clinical significance. This study systematically identifies key clinical features involved in the diagnosis and progression of KD through multi-task machine learning modeling. In the KD diagnostic model, lymphocytes, glutathione reductase, changes in the lips or oropharyngeal mucosa, C-reactive protein, and changes in the peripheral extremities were identified as the most predictive variables. These features reflect the immune activation, redox imbalance, and typical mucocutaneous clinical manifestations present early in the disease (22-24). For the prediction of IVIG resistance, the key features identified by the model included rash, mean platelet volume, changes in the lips or oropharyngeal mucosa, alanine aminotransferase, and procalcitonin. These indicators suggest that IVIG-resistant patients may experience more pronounced inflammatory responses, liver dysfunction, and platelet function changes (6,25-27). For predicting the risk of CALs, the primary features identified by the model were Glutathione Reductase, changes in the lips or oropharyngeal mucosa, albumin levels, gender, and lymphocyte count. This group of features not only involves oxidative stress and immune system status but may also reflect endothelial dysfunction and individual susceptibility to inflammatory responses (28-30). These findings provide insight into the potential mechanisms underlying the diagnosis and different tasks of progression of KD, highlighting the critical roles of inflammation, oxidative stress, and vascular damage in disease evolution. Moreover, they offer important clues for future pathological research and the development of precise therapeutic strategies centered around these mechanisms.

The diagnosis, treatment response prediction (e.g., IVIG resistance), and complication risk assessment (e.g., CALs) of KD have been key areas of clinical research. However, existing studies often model and analyze these three aspects as independent tasks, lacking a systematic modeling approach within a unified variable framework. One significant innovation of this study is the first attempt to construct a multi-task machine learning model based on the same set of routine clinical indicators, which can simultaneously predict KD diagnosis, IVIG resistance, and CAL risk. This integrated model enables both disease diagnosis and progression risk assessment. Methodologically, this study introduces machine learning models to address the high-dimensional, nonlinear nature of clinical data and the complex interactions between variables. Compared to traditional statistical modeling approaches, machine learning algorithms offer inherent advantages in handling multicollinearity, variable selection, and nonlinear relationships between features. Furthermore, the use of model interpretability methods, such as SHAP, to assist in feature importance analysis enhances the clinical interpretability and practical value of the model’s results (31). This strategy not only improves predictive performance but also provides more systematic and quantitative support for disease mechanism exploration and clinical decision-making. This research framework offers a novel methodological pathway for future modeling studies focused on the full-course management of KD and lays the technological foundation for developing multi-task predictive tools based on a unified indicator system.

With the rapid advancement of large language models in natural language processing, models such as ChatGPT have increasingly been explored for applications in the medical domain—particularly in clinical decision support and medical text comprehension—where they have demonstrated promising capabilities (32-34). To evaluate their applicability and performance in KD-related tasks, this study incorporated large language models into a comparative analysis. The results indicate that, for the diagnosis of KD, machine learning-based methods slightly outperform the clinical experience of physicians and ChatGPT (0.900, 0.860, and 0.820, respectively), although the difference is relatively modest. This may reflect the ability of large language models to express and reason with structured or semi-structured features based on clinical manifestations (e.g., fever, rash, and mucosal changes), suggesting a certain diagnostic assistance potential. However, for the prediction of IVIG resistance and CAL development, the machine learning models developed in this study demonstrated significantly superior performance (accuracy of 0.800, 0.720, and 0.440, respectively). These tasks often rely on precise laboratory test indicators (e.g., glutathione reductase, mean platelet volume, and alanine aminotransferase), while current large language models still face limitations in interpreting numerical features. Specifically, when they are unable to accurately identify whether indicators are in an abnormal state, their reasoning can be affected, leading to a decline in prediction accuracy (35,36). Future research could further explore fusion strategies between large language models and structured data models, such as through knowledge augmentation, feature embedding, or multimodal modeling, to enhance their understanding and modeling capability for clinical laboratory indicators. This would expand their application in personalized disease prediction and risk stratification. Additionally, more large-scale validation studies based on real-world clinical scenarios should be conducted to assess the practical clinical utility and limitations of large language models in diagnostic assistance and treatment decision-making (37).

Limitations

This study has several limitations. First, the sample size is relatively limited, especially for subgroups such as patients with IVIG resistance and CALs, which may affect the stability and robustness of the model in these specific predictive tasks. Second, the model has not yet undergone prospective validation or external evaluation using multi-center datasets, which limits its generalizability and applicability in broader clinical settings. Third, the current model was developed based on a limited set of routine clinical and laboratory features, and it remains to be determined whether incorporating additional indicators—such as inflammatory biomarkers or genetic factors—could further enhance predictive performance, especially for early identification of high-risk patients. Moreover, as KD exhibits variations across different racial and ethnic groups, the absence of diverse population data may introduce bias and limit the model’s performance in non-represented cohorts. Future research should focus on expanding the sample size, incorporating multi-center and multi-ethnic cohorts, and integrating additional clinical features to establish a more robust and generalizable prediction model for KD diagnosis and prognostic risk assessment.


Conclusions

This study developed a multi-task machine learning model based on a unified set of clinical indicators, capable of simultaneously supporting KD diagnosis, predicting IVIG resistance, and assessing the risk of CALs. The model demonstrated strong performance across all tasks. This approach not only improves the efficiency of model development and enhances clinical applicability but also provides a technical foundation for task-specific management and precision intervention in KD. With the continued accumulation of clinical data and further model optimization, this framework holds promise as a valuable decision-support tool in real-world clinical practice.


Acknowledgments

None.


Footnote

Reporting Checklist: The authors have completed the TRIPOD reporting checklist. Available at https://tp.amegroups.com/article/view/10.21037/tp-2025-399/rc

Data Sharing Statement: Available at https://tp.amegroups.com/article/view/10.21037/tp-2025-399/dss

Peer Review File: Available at https://tp.amegroups.com/article/view/10.21037/tp-2025-399/prf

Funding: This work was supported by the National Natural Science Foundation of China (No. 82201903) and the China Postdoctoral Science Foundation (No. 2023M733392).

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://tp.amegroups.com/article/view/10.21037/tp-2025-399/coif). The authors have no conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. All study procedures were approved by the Ethics Committee of The First Affiliated Hospital of University of Science and Technology of China (No. 2024-RE-487). As anonymized data were used, the requirement for informed consent was waived by the committee.

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.


References

  1. Day-Lewis M, Son MBF, Lo MS. Kawasaki disease: contemporary perspectives. Lancet Child Adolesc Health 2024;8:781-92. [Crossref] [PubMed]
  2. McCrindle BW, Rowley AH, Newburger JW, et al. Diagnosis, Treatment, and Long-Term Management of Kawasaki Disease: A Scientific Statement for Health Professionals From the American Heart Association. Circulation 2017;135:e927-99. [Crossref] [PubMed]
  3. Zhao L, Wu J, Liu X, et al. Risk factors for predicting medium-giant coronary artery aneurysms in Kawasaki disease. Immunol Res 2025;73:52. [Crossref] [PubMed]
  4. Kuo HC, Lin MC, Kao CC, et al. Intravenous Immunoglobulin Alone for Coronary Artery Lesion Treatment of Kawasaki Disease: A Randomized Clinical Trial. JAMA Netw Open 2025;8:e253063. [Crossref] [PubMed]
  5. Sunaga Y, Watanabe A, Katsumata N, et al. A simple scoring model based on machine learning predicts intravenous immunoglobulin resistance in Kawasaki disease. Clin Rheumatol 2023;42:1351-61. [Crossref] [PubMed]
  6. Wang T, Liu G, Lin H. A machine learning approach to predict intravenous immunoglobulin resistance in Kawasaki disease patients: A study based on a Southeast China population. PLoS One 2020;15:e0237321. [Crossref] [PubMed]
  7. Song Z, Ming H, Liu B, et al. Development and validation of an explainable machine learning-based prediction model for primary Kawasaki disease complicated with coronary artery aneurysms. Transl Pediatr 2025;14:208-21. [Crossref] [PubMed]
  8. Liang K, Su D, Pang Y. Value of cystatin C for Kawasaki disease with coronary artery aneurysm. Transl Pediatr 2025;14:545-58. [Crossref] [PubMed]
  9. Lo J, Gauvreau K, Baker AL, et al. Multiple Emergency Department Visits for a Diagnosis of Kawasaki Disease: An Examination of Risk Factors and Outcomes. J Pediatr 2021;232:127-132.e3. [Crossref] [PubMed]
  10. Tsai CM, Lin CR, Kuo HC, et al. Use of Machine Learning to Differentiate Children With Kawasaki Disease From Other Febrile Children in a Pediatric Emergency Department. JAMA Netw Open 2023;6:e237489. [Crossref] [PubMed]
  11. Lam JY, Shimizu C, Tremoulet AH, et al. A machine-learning algorithm for diagnosis of multisystem inflammatory syndrome in children and Kawasaki disease in the USA: a retrospective model development and validation study. Lancet Digit Health 2022;4:e717-26. [Crossref] [PubMed]
  12. Xia Y, Huang Y, Gong M, et al. A machine learning-based model to predict intravenous immunoglobulin resistance in Kawasaki disease. iScience 2025;28:112004. [Crossref] [PubMed]
  13. Kobayashi T, Inoue Y, Takeuchi K, et al. Prediction of intravenous immunoglobulin unresponsiveness in patients with Kawasaki disease. Circulation 2006;113:2606-12. [Crossref] [PubMed]
  14. Sano T, Kurotobi S, Matsuzaki K, et al. Prediction of non-responsiveness to standard high-dose gamma-globulin therapy in patients with acute Kawasaki disease before starting initial treatment. Eur J Pediatr 2007;166:131-7. [Crossref] [PubMed]
  15. Lin MT, Chang CH, Sun LC, et al. Risk factors and derived formosa score for intravenous immunoglobulin unresponsiveness in Taiwanese children with Kawasaki disease. J Formos Med Assoc 2016;115:350-5. [Crossref] [PubMed]
  16. Gong X, Tang L, Wu M, et al. Development of a nomogram prediction model for early identification of persistent coronary artery aneurysms in kawasaki disease. BMC Pediatr 2023;23:79. [Crossref] [PubMed]
  17. Tang Y, Ding C, Xu Q, et al. Prediction nomogram for coronary artery aneurysms at one month in Kawasaki disease. Ital J Pediatr 2023;49:146. [Crossref] [PubMed]
  18. Xu D, Chen YS, Feng CH, et al. Development of a prediction model for progression of coronary artery lesions in Kawasaki disease. Pediatr Res 2024;95:1041-50. [Crossref] [PubMed]
  19. Wang Z, Gu Y, Huang L, et al. Construction of machine learning diagnostic models for cardiovascular pan-disease based on blood routine and biochemical detection data. Cardiovasc Diabetol 2024;23:351. [Crossref] [PubMed]
  20. Lundberg SM, Lee SI. A unified approach to interpreting model predictions. In: Advances in Neural Information Processing Systems 30 (NIPS 2017). 2017.
  21. Liu J, Zhang J, Huang H, et al. A Machine Learning Model to Predict Intravenous Immunoglobulin-Resistant Kawasaki Disease Patients: A Retrospective Study Based on the Chongqing Population. Front Pediatr 2021;9:756095. [Crossref] [PubMed]
  22. Lee YR, Bae EY, Kil HR, et al. Elevated Plasma Apurinic/Apyrimidinic Endonuclease 1/Redox Effector Factor-1 Levels in Refractory Kawasaki Disease. Biomedicines 2022;10:190. [Crossref] [PubMed]
  23. Kumrah R, Goyal T, Rawat A, et al. Markers of Endothelial Dysfunction in Kawasaki Disease: An Update. Clin Rev Allergy Immunol 2024;66:99-111. [Crossref] [PubMed]
  24. Wang X, Zhang L. Integrative machine learning identifies robust inflammation-related diagnostic biomarkers and stratifies immune-heterogeneous subtypes in Kawasaki disease. Pediatr Rheumatol Online J 2025;23:61. [Crossref] [PubMed]
  25. Sekine K, Mochizuki H, Inoue Y, et al. Regulation of oxidative stress in patients with Kawasaki disease. Inflammation 2012;35:952-8. [Crossref] [PubMed]
  26. Sharma C, Ganigara M, Galeotti C, et al. Multisystem inflammatory syndrome in children and Kawasaki disease: a critical comparison. Nat Rev Rheumatol 2021;17:731-48. [Crossref] [PubMed]
  27. Zhu YP, Shamie I, Lee JC, et al. Immune response to intravenous immunoglobulin in patients with Kawasaki disease and MIS-C. J Clin Invest 2021;131:e147076. [Crossref] [PubMed]
  28. Seki M, Minami T. Kawasaki Disease: Pathology, Risks, and Management. Vasc Health Risk Manag 2022;18:407-16. [Crossref] [PubMed]
  29. Patra PK, Banday AZ, Das RR, et al. Long-term vascular dysfunction in Kawasaki disease: systematic review and meta-analyses. Cardiol Young 2023;33:1614-26. [Crossref] [PubMed]
  30. Huang H, Dong J, Jiang J, et al. The role of FOXO4/NFAT2 signaling pathway in dysfunction of human coronary endothelial cells and inflammatory infiltration of vasculitis in Kawasaki disease. Front Immunol 2022;13:1090056. [Crossref] [PubMed]
  31. Rodríguez-Pérez R, Bajorath J. Interpretation of machine learning models using shapley values: application to compound potency and multi-target activity predictions. J Comput Aided Mol Des 2020;34:1013-26. [Crossref] [PubMed]
  32. Bedi S, Liu Y, Orr-Ewing L, et al. Testing and Evaluation of Health Care Applications of Large Language Models: A Systematic Review. JAMA 2025;333:319-28. [Crossref] [PubMed]
  33. Tan S, Xin X, Wu D. ChatGPT in medicine: prospects and challenges: a review article. Int J Surg 2024;110:3701-6. [Crossref] [PubMed]
  34. Yan C, Li Z, Liang Y, et al. Assessing large language models as assistive tools in medical consultations for Kawasaki disease. Front Artif Intell 2025;8:1571503. [Crossref] [PubMed]
  35. He Z, Bhasuran B, Jin Q, et al. Quality of Answers of Generative Large Language Models Versus Peer Users for Interpreting Laboratory Test Results for Lay Patients: Evaluation Study. J Med Internet Res 2024;26:e56655. [Crossref] [PubMed]
  36. Hager P, Jungmann F, Holland R, et al. Evaluation and mitigation of the limitations of large language models in clinical decision-making. Nat Med 2024;30:2613-22. [Crossref] [PubMed]
  37. Miao BY, Williams CYK, Chinedu-Eneh E, et al. Understanding contraceptive switching rationales from real world clinical notes using large language models. NPJ Digit Med 2025;8:221. [Crossref] [PubMed]
Cite this article as: Wang D, Li F, Xie T, Zang X, Chen M. An integrated and interpretable machine learning framework for Kawasaki disease diagnosis and risk prediction. Transl Pediatr 2025;14(9):2145-2157. doi: 10.21037/tp-2025-399

Download Citation