Early postnatal risk stratification for severe adverse outcomes in twin neonates admitted to the neonatal intensive care unit: development and temporal validation of an interpretable machine learning model
Highlight box
Key findings
• We developed and temporally validated an interpretable gradient boosting-based machine learning (ML) model using 10 routinely available clinical predictors to estimate early risk of severe adverse outcomes in twin neonates admitted to the neonatal intensive care unit (NICU). The model showed good discrimination in temporal validation (area under the curve, 0.844) and was implemented as a web-based risk calculator with Shapley additive explanations-based individual-level interpretability.
What is known and what is new?
• Existing neonatal risk scores were developed predominantly in singleton populations and may not adequately capture twin-specific risk factors such as chorionicity and within-pair outcome correlation. While ML has been applied to predict neonatal outcomes, few models have been developed or validated specifically for twin neonates.
• This study provides a twin-specific, temporally validated ML prediction model for composite severe adverse outcomes in twin neonates, integrating individualized risk estimation with clinician-facing interpretability.
What is the implication, and what should change now?
• The model may support early bedside risk stratification and risk communication for twin neonates within 24 hours of NICU admission, potentially enabling more targeted monitoring and timely preventive interventions. Multicenter external validation and prospective impact evaluation are needed before broader clinical implementation.
Introduction
As assisted reproductive technology has become more widely used and maternal age at childbirth has risen, the incidence of twin pregnancies has increased significantly (1-3). The perinatal mortality rate (26.1 per 1,000 total births) and neonatal death rate (15.7 per 1,000 live births) among twins have shown a downward trend but remained alarmingly high (4,5). Compared with singleton pregnancies, twin pregnancies are associated with a higher probability of adverse neonatal outcomes, such as intraventricular hemorrhage (IVH), necrotizing enterocolitis (NEC), neonatal respiratory distress syndrome (RDS), and bronchopulmonary dysplasia (BPD) (6-10). These complications can result in multisystem morbidity and adversely affect long-term health and neurodevelopmental outcomes. Early risk stratification is therefore essential to inform monitoring intensity and timely clinical intervention. However, the factors influencing outcomes in twin pregnancies are complex, and validated prediction tools specifically designed for this population remain scarce. Existing neonatal risk scores, such as the Clinical Risk Index for Babies II (CRIB-II) and the Score for Neonatal Acute Physiology, Perinatal Extension II (SNAPPE-II) (11,12) were developed and validated predominantly in singleton populations and may not adequately capture twin-specific risk factors, including chorionicity, placental sharing, and the interdependence of co-twin outcomes. Consequently, prediction models trained in singleton populations may have limited transportability to twins due to shifts in baseline risk, differences in predictor-outcome relationships, and within-pair outcome correlation that can impair calibration. Machine learning (ML) technology can identify potential correlations and patterns by analyzing large amounts of patient data (13-15). ML-based clinical prediction models may improve risk prediction by leveraging nonlinear relationships and interactions in high-dimensional data. Currently, ML technology is applied in the development of prediction models, most of which have demonstrated robust predictive value; however, relatively few models have been developed or validated specifically in twin neonates (16-19).
The objective of this study is to establish and validate a prediction model for severe adverse outcomes in twin neonates by systematically comparing multiple ML algorithms. We aimed to validate the model’s robustness using a temporal validation cohort and interpret the predictions using Shapley additive explanations (SHAP), including global and patient-level explanations of risk estimates. This approach is intended to support clinician-facing risk stratification and risk communication, and to inform prioritization of monitoring intensity and timely preventive interventions in the neonatal intensive care unit (NICU). We present this article in accordance with the TRIPOD reporting checklist (available at https://tp.amegroups.com/article/view/10.21037/tp-2026-1-0007/rc).
Methods
Study population and data collection
The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. This retrospective study was approved by the Medical Ethics Committee of Shanxi Children’s Hospital (No. IRB-KYYN-2026-G004), and informed consent was waived because the data were retrospective and anonymized.
We consecutively enrolled twin neonates admitted to Shanxi Children’s Hospital between July 1, 2022 and June 30, 2023 (derivation cohort, n=912). The derivation cohort was randomly split into training (70%) and testing (30%) sets, with twin pairs from the same pregnancy kept together to prevent data leakage. Twins admitted between July 1, 2023 and December 31, 2023, formed the temporal validation cohort (n=592). Predictors with >20% missingness were excluded; remaining missing data were imputed using multivariate imputation by chained equations (MICE) with five imputed datasets under a missing-at-random assumption. Outcome status was not imputed. All steps in the modeling pipeline—including the imputation model, feature set, hyperparameters, and decision threshold—were fixed in the derivation cohort before application to the temporal validation cohort.
Study size was determined by the availability of all eligible twin NICU admissions during the study period; no formal a priori sample size calculation was performed. The events per variable (EPV) ratio exceeded the commonly recommended minimum of 10, suggesting an adequate number of events per predictor for model development. The inclusion criteria were: (I) twin pregnancy; (II) gestational age >24 weeks; (III) birth weight ≥500 g; and (IV) availability of essential clinical data. The exclusion criteria were: (I) pregnancy termination via abortion, induced labor, or fetal reduction; (II) severe maternal complications involving internal or surgical diseases (e.g., chronic hypertension, pre-gestational diabetes, antiphospholipid antibody syndrome); and (III) incomplete medical records. These criteria were selected to minimize heterogeneity arising from extreme non-viability and major maternal comorbidities that could dominate neonatal risk and reduce model generalizability.
Relevant demographic and clinical data were extracted from the integrated electronic medical record system (Table 1). The primary outcome was a composite of severe adverse events: clinically significant neonatal anemia (requiring transfusion), RDS, early-onset sepsis (EOS), grade III/IV IVH, BPD, hemodynamically significant patent ductus arteriosus (hsPDA; defined as echocardiographically confirmed PDA requiring pharmacologic closure and/or surgical ligation), NEC ≥ stage IIA, periventricular leukomalacia (PVL), and pulmonary hypertension. A composite endpoint was chosen to increase the event rate and improve statistical precision while capturing the spectrum of clinically meaningful morbidites in NICU care; components generally require clinical intervention. Although these conditions span a range of severity, all represent clinically actionable morbidities that typically require escalation of NICU care; moreover, “moderate” components such as transfusion-requiring anemia and hsPDA frequently co-occur with or precede more severe complications (e.g., IVH, BPD), supporting their inclusion in a unified risk-stratification framework. Outcomes were ascertained using prespecified operational definitions based on electronic medical record (EMR)-documented attending neonatologist diagnoses, supplemented by objective treatment and procedure records where applicable (table available at https://cdn.amegroups.cn/static/public/tp-2026-1-0007-5.xlsx). Outcome abstractors were trained NICU staff blinded to the study hypothesis and not involved in model development.
Table 1
| Variables | Overall (n=912) | Without severe (n=597) | Severe (n=315) | P (severe vs. non-severe) | Training set (n=638) | Test set (n=274) | P (train vs. test) | SMD (train vs. test) |
|---|---|---|---|---|---|---|---|---|
| Maternal age (years) | 31.08±3.95 | 31.17±3.98 | 30.91±3.89 | 0.24 | 30.96±3.83 | 31.36±4.22 | 0.24 | 0.102 |
| Median (IQR) | 31.00 [28.00–34.00] | 31.00 [28.00–34.00] | 31.00 [28.00–33.00] | 31.00 [28.00–33.00] | 31.00 [28.00–34.00] | |||
| Pre-pregnancy BMI (kg/m2) | 23.30±3.71 | 23.21±3.64 | 23.45±3.84 | 0.41 | 23.32±3.68 | 23.24±3.78 | 0.58 | −0.023 |
| Median (IQR) | 22.66 [20.70–25.39] | 22.65 [20.70–25.34] | 22.86 [20.70–25.64] | 22.86 [20.70–25.39] | 22.60 [20.45–25.61] | |||
| Gestational age (weeks) | 35.45±2.45 | 36.28±1.85 | 33.87±2.67 | <0.001 | 35.37±2.64 | 35.63±1.93 | 0.59 | 0.105 |
| Median (IQR) | 36.29 [34.67–37.00] | 36.86 [36.00–37.29] | 34.57 [32.36–35.64] | 36.29 [34.43–37.14] | 36.14 [35.00–37.00] | |||
| Birth weight (g) | 2,317.53±535.71 | 2,479.53±458.63 | 2,010.49±537.49 | <0.001 | 2,304.33±566.40 | 2,348.27±455.78 | 0.70 | 0.082 |
| Median (IQR) | 2,400.00 [2,060.00–2,670.00] | 2,530.00 [2,270.00–2,750.00] | 2,070.00 [1,639.50–2,380.00] | 2,400.00 [2,030.00–2,680.00] | 2,400.00 [2,105.00–2,650.00] | |||
| Conception method | 0.17 | 0.07 | 0.163 | |||||
| Natural conception | 275 (30.2) | 172 (28.8) | 103 (32.7) | 178 (27.9) | 97 (35.4) | |||
| Assisted reproductive technology | 544 (59.6) | 369 (61.8) | 175 (55.6) | 391 (61.3) | 153 (55.8) | |||
| Ovarian stimulation | 93 (10.2) | 56 (9.4) | 37 (11.7) | 69 (10.8) | 24 (8.8) | |||
| Parity | 0.51 | 0.29 | 0.081 | |||||
| No | 480 (52.6) | 309 (51.8) | 171 (54.3) | 328 (51.4) | 152 (55.5) | |||
| Yes | 432 (47.4) | 288 (48.2) | 144 (45.7) | 310 (48.6) | 122 (44.5) | |||
| Delivery mode | 0.005 | 0.20 | 0.103 | |||||
| Cesarean delivery | 48 (5.3) | 22 (3.7) | 26 (8.3) | 38 (6.0) | 10 (3.6) | |||
| Vaginal delivery | 864 (94.7) | 575 (96.3) | 289 (91.7) | 600 (94.0) | 264 (96.4) | |||
| Chorionicity | <0.001 | 0.57 | 0.048 | |||||
| Monochorionic | 142 (15.6) | 70 (11.7) | 72 (22.9) | 96 (15.0) | 46 (16.8) | |||
| Dichorionic | 770 (84.4) | 527 (88.3) | 243 (77.1) | 542 (85.0) | 228 (83.2) | |||
| Gestational diabetes | 0.12 | 0.27 | 0.084 | |||||
| No | 672 (73.7) | 450 (75.4) | 222 (70.5) | 463 (72.6) | 209 (76.3) | |||
| Yes | 240 (26.3) | 147 (24.6) | 93 (29.5) | 175 (27.4) | 65 (23.7) | |||
| Gestational hypertension | 0.004 | 0.047 | 0.150 | |||||
| No | 727 (79.7) | 493 (82.6) | 234 (74.3) | 497 (77.9) | 230 (83.9) | |||
| Yes | 185 (20.3) | 104 (17.4) | 81 (25.7) | 141 (22.1) | 44 (16.1) | |||
| Intrahepatic cholestasis | 0.01 | 0.79 | 0.030 | |||||
| No | 861 (94.4) | 572 (95.8) | 289 (91.7) | 601 (94.2) | 260 (94.9) | |||
| Yes | 51 (5.6) | 25 (4.2) | 26 (8.3) | 37 (5.8) | 14 (5.1) | |||
| Gestational anemia | <0.001 | 0.03 | 0.161 | |||||
| No | 733 (80.4) | 501 (83.9) | 232 (73.7) | 525 (82.3) | 208 (75.9) | |||
| Yes | 179 (19.6) | 96 (16.1) | 83 (26.3) | 113 (17.7) | 66 (24.1) | |||
| Placenta previa | 0.043 | 0.85 | 0.030 | |||||
| No | 889 (97.5) | 587 (98.3) | 302 (95.9) | 621 (97.3) | 268 (97.8) | |||
| Yes | 23 (2.5) | 10 (1.7) | 13 (4.1) | 17 (2.7) | 6 (2.2) | |||
| PROM | <0.001 | 0.06 | 0.140 | |||||
| No | 765 (83.9) | 531 (88.9) | 234 (74.3) | 545 (85.4) | 220 (80.3) | |||
| Yes | 147 (16.1) | 66 (11.1) | 81 (25.7) | 93 (14.6) | 54 (19.7) | |||
| Placental abruption | 0.49 | >0.99 | 0.003 | |||||
| No | 889 (97.5) | 584 (97.8) | 305 (96.8) | 622 (97.5) | 267 (97.4) | |||
| Yes | 23 (2.5) | 13 (2.2) | 10 (3.2) | 16 (2.5) | 7 (2.6) | |||
| Meconium staining III | 0.52 | 0.56 | 0.057 | |||||
| No | 886 (97.1) | 582 (97.5) | 304 (96.5) | 618 (96.9) | 268 (97.8) | |||
| Yes | 26 (2.9) | 15 (2.5) | 11 (3.5) | 20 (3.1) | 6 (2.2) | |||
| Antepartum hemorrhage | 0.19 | 0.18 | 0.116 | |||||
| No | 906 (99.3) | 595 (99.7) | 311 (98.7) | 632 (99.1) | 274 (100.0) | |||
| Yes | 6 (0.7) | 2 (0.3) | 4 (1.3) | 6 (0.9) | 0 (0.0) | |||
| Postpartum hemorrhage | 0.66 | 0.55 | 0.053 | |||||
| No | 860 (94.3) | 561 (94.0) | 299 (94.9) | 604 (94.7) | 256 (93.4) | |||
| Yes | 52 (5.7) | 36 (6.0) | 16 (5.1) | 34 (5.3) | 18 (6.6) | |||
| Chorioamnionitis | 0.11 | 0.09 | 0.156 | |||||
| No | 910 (99.8) | 597 (100.0) | 313 (99.4) | 638 (100.0) | 272 (99.3) | |||
| Yes | 2 (0.2) | 0 (0.0) | 2 (0.6) | 0 (0.0) | 2 (0.7) | |||
| Placental adherence | 0.90 | 0.35 | 0.077 | |||||
| No | 851 (93.3) | 558 (93.5) | 293 (93.0) | 599 (93.9) | 252 (92.0) | |||
| Yes | 61 (6.7) | 39 (6.5) | 22 (7.0) | 39 (6.1) | 22 (8.0) | |||
| TTTS | 0.01 | 0.46 | 0.068 | |||||
| No | 903 (99.0) | 595 (99.7) | 308 (97.8) | 633 (99.2) | 270 (98.5) | |||
| Yes | 9 (1.0) | 2 (0.3) | 7 (2.2) | 5 (0.8) | 4 (1.5) | |||
| Placenta accreta | >0.99 | 0.01 | 0.203 | |||||
| No | 894 (98.0) | 585 (98.0) | 309 (98.1) | 620 (97.2) | 274 (100.0) | |||
| Yes | 18 (2.0) | 12 (2.0) | 6 (1.9) | 18 (2.8) | 0 (0.0) | |||
| Fetal sex | 0.49 | 0.55 | 0.048 | |||||
| Male | 488 (53.5) | 314 (52.6) | 174 (55.2) | 346 (54.2) | 142 (51.8) | |||
| Female | 424 (46.5) | 283 (47.4) | 141 (44.8) | 292 (45.8) | 132 (48.2) | |||
| FGR | 0.85 | 0.28 | 0.085 | |||||
| No | 801 (87.8) | 523 (87.6) | 278 (88.3) | 555 (87.0) | 246 (89.8) | |||
| Yes | 111 (12.2) | 74 (12.4) | 37 (11.7) | 83 (13.0) | 28 (10.2) | |||
| Neonatal asphyxia | 0.03 | 0.73 | 0.050 | |||||
| No | 902 (98.9) | 594 (99.5) | 308 (97.8) | 630 (98.7) | 272 (99.3) | |||
| Yes | 10 (1.1) | 3 (0.5) | 7 (2.2) | 8 (1.3) | 2 (0.7) | |||
| Neonatal hypoglycemia | <0.001 | 0.65 | 0.041 | |||||
| No | 821 (90.0) | 561 (94.0) | 260 (82.5) | 572 (89.7) | 249 (90.9) | |||
| Yes | 91 (10.0) | 36 (6.0) | 55 (17.5) | 66 (10.3) | 25 (9.1) | |||
| Congenital malformation | <0.001 | >0.99 | 0.007 | |||||
| No | 840 (92.1) | 580 (97.2) | 260 (82.5) | 588 (92.2) | 252 (92.0) | |||
| Yes | 72 (7.9) | 17 (2.8) | 55 (17.5) | 50 (7.8) | 22 (8.0) | |||
Data are presented as mean ± standard deviation, median [IQR] or n (%). The “Overall” column refers to all twin neonates in the derivation cohort (n=912). The “Without severe” and “Severe” columns refer to neonates without and with composite severe adverse outcomes in the derivation cohort, respectively. The “Train set” and “Test set” columns refer to the randomly assigned internal training (n=638) and testing (n=274) subsets derived from the same cohort. P (train vs. test) denotes the P value for the comparison between the training and testing sets, calculated using the independent-samples t-test or Mann-Whitney U test for continuous variables and the χ2 test or Fisher’s exact test for categorical variables, as appropriate. SMD (train vs. test) denotes the absolute standardized mean difference between the training and testing sets; |SMD| <0.10 was considered to indicate good balance between groups. BMI, body mass index; FGR, fetal growth restriction; IQR, interquartile range; PROM, premature rupture of membranes; SMD, standardized mean difference; TTTS, twin-to-twin transfusion syndrome.
Feature selection
Multicollinearity was assessed using the variance inflation factor (VIF); predictors with VIF >5 were excluded. Four feature selection strategies were compared: (I) light gradient boosting machine (LightGBM)-based feature importance; (II) least absolute shrinkage and selection operator (LASSO) regression; (III) K-Best [analysis of variance (ANOVA) F-values]; and (IV) an intersection approach retaining predictors selected by at least two methods (20). The maximum number of features was pre-specified at 10 to ensure clinical feasibility and mitigate overfitting. Feature selection was fitted in training folds and applied to validation folds to prevent information leakage. Cross-validated area under the curve (AUC) curves across feature counts (k =5, 8, 10, 12, 15) using repeated 5-fold cross-validation showed performance plateaued at k =10, supporting the final 10-predictor subset.
Model development and validation
Ten ML algorithms were trained on each feature subset: logistic regression, artificial neural network, decision tree, extremely randomized trees, gradient boosting (GB; scikit-learn), k-nearest neighbors, LightGBM, random forest (RF), support vector machine, and extreme gradient boosting (XGBoost). Combining four feature selection methods with ten algorithms yielded 40 candidate models. To address class imbalance, Synthetic Minority Over-sampling Technique (SMOTE) was applied to the training data within each cross-validation fold, without resampling validation folds. Hyperparameters were tuned via grid search with five-fold cross-validation; search spaces and optimal values are detailed in table available at https://cdn.amegroups.cn/static/public/tp-2026-1-0007-3.xlsx. Model performance was evaluated using discrimination (AUC), calibration (Brier score, calibration plots), and classification metrics (sensitivity, specificity, positive and negative predictive values, F1 score). To enable holistic comparison across performance domains, we calculated a composite score as previously described (21), by averaging four scaled components: normalized mean AUC, inverted normalized mean Brier score, inverted normalized standard deviation (SD) (AUC), and inverted normalized SD (Brier) across repeated cross-validation folds, with min–max scaling to [0,1] across candidate models (details in Appendix 1). Decision curve analysis (DCA) was used to assess clinical utility. Performance across candidate models is summarized in table available at https://cdn.amegroups.cn/static/public/tp-2026-1-0007-1.xlsx. Once optimal hyperparameters were determined, the selected final model was frozen; no recalibration was performed during temporal validation to preserve a strict assessment of transportability.
Model interpretations and clinical application
Model interpretability was assessed using SHAP (22). A SHAP summary plot was used to present global feature importance, while individual waterfall plots provided patient-level explanations of risk estimates. A web-based risk calculator was developed for clinical use, stratifying patients into high- and low-risk groups using a Youden index-derived probability cutoff from the training set (cutoff ≈0.50; sensitivity 87.6%, specificity 88.2%).
Statistical analysis
Analyses were performed in Python 3.12.3. Continuous variables are presented as mean ± SD or median [interquartile range (IQR)] and compared using t-tests or Mann-Whitney U tests; categorical variables are presented as n (%) and compared using χ2 or Fisher’s exact tests. Two-sided P<0.05 was considered significant. Because twin births introduce within-pair correlation that may affect model performance estimates, twin pairs from the same pregnancy were kept within the same data split to prevent data leakage. To quantify the impact of this correlation on discrimination estimates, we compared standard bootstrap (resampling individuals) with cluster-robust bootstrap (resampling at the pregnancy level) in the temporal validation cohort (1,000 iterations); results are reported in table available at https://cdn.amegroups.cn/static/public/tp-2026-1-0007-4.xlsx and Figure S1.
Results
Patient characteristics
The study initially comprised 1,756 twin infants, with 1,156 in the derivation cohort and 600 in the temporal validation cohort. After applying exclusion criteria, the final derivation cohort included 912 neonates (315 events, 34.5%), split into training (n=638) and testing (n=274) sets at a 7:3 ratio with twin pairs kept together. The temporal validation cohort comprised 592 neonates (223 events, 37.7%). Baseline characteristics are detailed in Table 1, and the study flow is illustrated in Figure 1.
Feature selection
No severe multicollinearity was detected (all VIF <5). Among the four feature selection strategies compared, LASSO yielded the optimal predictor subset based on cross-validated performance stability at k =10 features. The selected predictors are visualized in Figure 2A, with the correlation heatmap (Figure 2B) demonstrating low redundancy. Alternative rankings (LightGBM, K-Best) are shown in Figure S2.
Model development and validation
Forty candidate models (4 feature selection strategies × 10 algorithms) were developed using grid search with five-fold cross-validation. Model performance was evaluated using discrimination (AUC), calibration (Brier score), classification metrics, and a composite performance score. Results for all 40 configurations are summarized in table available at https://cdn.amegroups.cn/static/public/tp-2026-1-0007-1.xlsx. ROC curves comparing all algorithms are provided in Figure S3. In the internal testing set, GB and RF using the LASSO-selected predictors showed similar discrimination (AUC 0.848 vs. 0.841) and calibration (Brier 0.158 vs. 0.162), with comparable precision-recall performance [average precision (AP) 0.749 vs. 0.745; baseline prevalence 0.35]. DCA showed positive net benefit for both models across a range of threshold probabilities (Figure 3A-3D).
Based on composite performance in the derivation cohort, GB and RF with LASSO-selected predictors emerged as the two top-performing models and were advanced to temporal validation for final selection.
Temporal validation and model selection
Baseline characteristics of the temporal validation cohort were compared with the derivation cohort (table available at https://cdn.amegroups.cn/static/public/tp-2026-1-0007-2.xlsx). Compared with the derivation cohort, the temporal validation cohort had a higher event rate (37.7% vs. 34.5%), lower mean gestational age (34.96 vs. 35.45 weeks), lower birth weight (2,178 vs. 2,318 g), and a higher proportion of monochorionic twins (26.2% vs. 15.6%). Most variables remained well balanced [|standardized mean difference (SMD)| <0.10 for 18 of 27 variables].
In temporal validation, discrimination was similar between models: RF achieved an AUC of 0.851 [95% confidence interval (CI): 0.820–0.881] and GB an AUC of 0.844 (95% CI: 0.812–0.875) (Figure 4A). Precision-recall performance was comparable (AP 0.751 for RF vs. 0.734 for GB; baseline prevalence 0.38) (Figure 4B). Calibration was acceptable for both models (GB Brier 0.162; RF 0.162) with satisfactory agreement between observed and predicted risks (Figure 4C). DCA demonstrated positive net benefit for both models across a wide range of threshold probabilities (Figure 4D).
To inform final model selection, we compared classification metrics at the training-derived Youden cutoff (≈0.50) and composite performance scores in the temporal cohort (Figure S4). GB achieved higher accuracy (0.771 vs. 0.760), specificity (0.853 vs. 0.828), and F1 score (0.647 vs. 0.628), whereas RF had slightly higher sensitivity (0.695 vs. 0.659). Overall, GB attained a marginally higher composite score. Given comparable discrimination and calibration, with higher specificity and F1 score at the cost of slightly lower sensitivity, GB was selected as the final model for deployment.
Model interpretation and clinical application
The interpretability of the final GB model was assessed using SHAP. The SHAP summary plot (Figure 5A) ranks predictors by their overall contribution to the predicted risk of severe adverse outcomes. Gestational age and birth weight were the most influential features, with lower values generally contributing to higher predicted risk, consistent with established neonatal risk factors. Neonatal hypoglycemia and congenital malformation also contributed positively to risk estimates. Importantly, the model’s clinical value lies not in identifying novel predictors but in synthesizing multiple variables into individualized, quantitative risk estimates for each twin neonate.
SHAP-based explanations are also provided at the individual level. For a representative high-risk neonate (predicted probability ≈0.87, true outcome: severe event), the waterfall plot in Figure 5B illustrates that lower gestational age and birth weight made the largest positive contributions to the predicted risk. In contrast, for a representative low-risk neonate in the temporal validation cohort (predicted probability ≈0.08, true outcome: no severe event), the waterfall plot in Figure 5C shows that relatively higher gestational age and birth weight, together with absence of congenital malformation, pushed the prediction toward lower risk. Such case-level explanations may help NICU clinicians understand why the model flags a given twin neonate as high or low risk and may support clinician-facing risk communication and prioritization of monitoring intensity.
To facilitate clinical use, the final GB model was implemented as an online risk calculator. After entering the 10 clinical predictors through a simple web interface, the calculator returns the predicted probability of severe adverse outcomes together with a risk category and clinician-facing explanations to support risk communication and monitoring prioritization (Figure 5D).
Sensitivity analysis: within-pair correlation
Twin outcomes showed moderate within-pair dependence, with 70.5% concordance across both cohorts (tetrachoric correlation 0.19; Cohen’s kappa 0.36). The concordance rate was higher in the temporal validation cohort (77.4%) than in the derivation cohort (66.0%). Accounting for this dependence using pregnancy-level cluster bootstrap in the temporal validation cohort (1,000 iterations) yielded similar point estimates for discrimination compared with standard bootstrap: AUC 0.846 (95% CI: 0.806–0.884) vs. 0.847 (95% CI: 0.814–0.877), with a CI width ratio of 1.25. The modest widening of the confidence interval under cluster resampling is expected given the within-pair dependence structure and suggests that, while within-pair correlation modestly increases uncertainty, the model’s discriminative performance remains robust (Figure S5 and table available at https://cdn.amegroups.cn/static/public/tp-2026-1-0007-4.xlsx).
Discussion
In this retrospective single-center cohort of twin neonates admitted to the NICU, we developed and temporally validated an interpretable ML model to estimate early risk of severe adverse outcomes in both spontaneously conceived and assisted reproductive technology-conceived twin pregnancies. The final GB model, based on 10 routinely available clinical predictors within the first 24 hours of admission, demonstrated good discrimination in temporal validation (AUC 0.844) and was implemented as a web-based risk calculator. Rather than identifying novel risk factors—most of which overlap with established neonatal risk determinants—the model integrates multiple early clinical variables into individualized, quantitative risk estimates that can support risk stratification and planning of monitoring and preventive care for each twin neonate.
The 10 LASSO-selected predictors—four neonatal factors (gestational age, birth weight, neonatal hypoglycemia, and congenital malformation) and six maternal-perinatal factors (chorionicity, gestational hypertension, gestational hypothyroidism, intrahepatic cholestasis of pregnancy, gestational anemia, and grade III meconium staining)—are consistent with established clinical evidence. Low gestational age and low birth weight reflect organ immaturity and growth restriction and are well-recognized determinants of severe neonatal morbidity, including IVH and BPD (23). Neonatal hypoglycemia has been associated with adverse neurodevelopmental outcomes, including impairments in executive and visual-motor function in early childhood, even after mild or transient episodes (24-26). Chorionicity is a key determinant in twin pregnancy management, and monochorionic placentation is associated with higher perinatal mortality and morbidity than dichorionic placentation (27,28). The remaining maternal-perinatal complications in the model—gestational hypertension, hypothyroidism, intrahepatic cholestasis, anemia, and meconium staining—may impair placental perfusion and fetal oxygenation or serve as markers of fetal compromise, consistent with the Developmental Origins of Health and Disease framework (29-34). None of these predictors are novel; rather, the model’s contribution lies in integrating them into individualized, quantitative risk estimates that go beyond what any single marker or conventional scoring system can provide. All 10 variables are obtainable within 24 hours of NICU admission, supporting early clinician-facing risk stratification.
Twin pregnancies present greater challenges than singleton pregnancies and are associated with substantially higher perinatal risk. Stillbirth rates have been reported to be approximately 13-fold higher in monochorionic and fivefold higher in dichorionic twin pregnancies compared with singleton pregnancies (35-37). Recent meta-analyses have similarly reported higher stillbirth prevalence in monochorionic compared with dichorionic twin pregnancies (38). Currently, there is no dedicated tool for predicting adverse outcomes in twin neonates. Previous studies have either focused on adverse perinatal outcomes of twin pregnancies (e.g., obstetric complications, postpartum hemorrhage, preterm birth, and other adverse perinatal outcomes) (37,39,40) or targeted specific clinical conditions of twin pregnancies, such as small-for-gestational-age infants, increased nuchal translucency, selective fetal growth restriction in monochorionic twins, and preeclampsia screening in twin pregnancies (41-44). Most of these studies have considered early post-birth outcomes and relied on single markers or influencing factors. In contrast, our study focused on a broader range of post-birth adverse outcomes and used a composite outcome to significantly increase the number of endpoint events—this improves model robustness, enhances clinical relevance and practicality, and aligns the model’s prediction goals with real-world clinical decision-making. ML offers a flexible framework for integrating these complex multidimensional risk factors into a unified predictive tool. While ML models have been applied to predict neonatal outcomes in preterm or growth-restricted populations, they have often focused on short-term or single-disease endpoints and were not developed specifically for twin neonates (16-19). To our knowledge, few studies have developed and temporally validated ML models for composite postnatal severe outcomes specifically in twin neonates admitted to the NICU. Compared with previous research, we integrated more comprehensive perinatal maternal data and early neonatal clinical information, which may contribute to the model’s discriminative ability in this specific population.
To facilitate clinical translation, we provided clinician-facing interpretability using SHAP, including both global and patient-level explanations. Global explanations showed that gestational age and birth weight contributed most to risk estimates, consistent with established neonatal determinants. More importantly, individual waterfall plots decomposed each prediction into feature contributions, supporting transparent risk communication and prioritization of monitoring intensity. Because all 10 predictors are routinely available within the first 24 hours of NICU admission, the tool is suitable for early bedside risk stratification.
However, there are limitations in this study. First, this was a single-center retrospective study, and selection bias and limited generalizability are possible; external validation in independent, multicenter cohorts is warranted. Although no a priori sample size calculation was performed, the derivation cohort had an events-per-predictor ratio of approximately 31.5 (315/10), exceeding the commonly cited minimum of 10. This supports model stability and reduces the risk of overfitting, but does not replace external validation. Second, missing data were handled using MICE under a missing-at-random assumption; bias may remain if this assumption is violated. Third, potentially relevant factors (e.g., prenatal steroid use) were not included due to data unavailability. Fourth, although co-twins were kept within the same data split and pregnancy-level cluster bootstrap in temporal validation yielded similar AUC point estimates with modestly wider CIs (width ratio 1.25), model training did not explicitly incorporate within-pair dependence [e.g., via generalized estimating equations (GEE) or mixed-effects frameworks]. Given the moderate within-pair concordance (70.5%; tetrachoric correlation 0.19; κ =0.36), future work should evaluate cluster-aware modeling strategies to improve calibration and standard error estimation, although point discrimination remained stable in our sensitivity analysis. Fifth, although we applied treatment-based thresholds to enhance clinical significance (e.g., transfusion-requiring anemia; treated hsPDA; NEC ≥ stage IIA), the composite endpoint still includes conditions with heterogeneous prognostic weight. Because the most extreme endpoints (e.g., IVH grade III/IV and PVL) were relatively infrequent, developing and validating a separate model for a severe-only endpoint in the current dataset would risk unstable estimation. Future larger, multicenter cohorts could evaluate endpoint-specific models focusing on the most extreme morbidities.
Conclusions
In conclusion, we developed and temporally validated a clinician-interpretable ML model to estimate early risk of severe adverse outcomes in twin neonates. The final GB model showed good discrimination in temporal validation and was implemented as a web-based risk calculator. Multicenter external validation and prospective impact evaluation are needed before broader clinical implementation. Given that all predictors are routinely available within 24 hours of admission, the tool may support clinician-facing risk communication and prioritization of monitoring intensity.
Acknowledgments
None.
Footnote
Reporting Checklist: The authors have completed the TRIPOD reporting checklist. Available at https://tp.amegroups.com/article/view/10.21037/tp-2026-1-0007/rc
Data Sharing Statement: Available at https://tp.amegroups.com/article/view/10.21037/tp-2026-1-0007/dss
Peer Review File: Available at https://tp.amegroups.com/article/view/10.21037/tp-2026-1-0007/prf
Funding: This work was supported by
Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://tp.amegroups.com/article/view/10.21037/tp-2026-1-0007/coif). The authors have no conflicts of interest to declare.
Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. The study was approved by the Medical Ethics Committee of Shanxi Children’s Hospital (No. IRB-KYYN-2026-G004), and informed consent was waived because the data were retrospective and anonymized.
Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.
References
- National Collaborating Centre for Women's and Children's Health (UK). Multiple Pregnancy: The Management of Twin and Triplet Pregnancies in the Antenatal Period. London: RCOG Press; 2011.
- Esposito G, Parazzini F, Viganò P, et al. Multiple births from medically assisted reproduction: contribution of different types of procedures and trends over time. Eur J Obstet Gynecol Reprod Biol 2024;300:63-8. [Crossref] [PubMed]
- Lee DS, Barclay KJ. More twins expected in low-income countries with later maternal ages at birth and population growth. Hum Reprod 2025;40:372-81. [Crossref] [PubMed]
- Deng C, Dai L, Yi L, et al. Temporal trends in the birth rates and perinatal mortality of twins: A population-based study in China. PLoS One 2019;14:e0209962. [Crossref] [PubMed]
- Chen P, Li M, Mu Y, et al. Temporal trends and adverse perinatal outcomes of twin pregnancies at differing gestational ages: an observational study from China between 2012-2020. BMC Pregnancy Childbirth 2022;22:467. [Crossref] [PubMed]
- Yang M, Fang L, Wang Y, et al. Perinatal characteristics and neonatal outcomes of singletons and twins in Chinese very preterm infants: a cohort study. BMC Pregnancy Childbirth 2023;23:89. [Crossref] [PubMed]
- Ward C, Caughey AB. Late preterm births: neonatal mortality and morbidity in twins vs. singletons. J Matern Fetal Neonatal Med 2022;35:7962-7. [Crossref] [PubMed]
- Hadžimuratović E, Selimović A, Hadžimuratović A, et al. Prevalence of respiratory distress syndrome in premature twins compared to premature singletons. Med Glas (Zenica) 2024. [Epub ahead of print]. doi:
10.17392/1638-21-02 .10.17392/1638-21-02 - Burjonrappa SC, Shea B, Goorah D. NEC in Twin Pregnancies: Incidence and Outcomes. J Neonatal Surg 2014;3:45. [Crossref] [PubMed]
- Wieczorek AI, Krasomski G. Twin pregnancy as the risk factor for neonatal intraventricular hemorrhage. Ginekol Pol 2015;86:137-42. [Crossref] [PubMed]
- Parry G, Tucker J, Tarnow-Mordi W, et al. CRIB II: an update of the clinical risk index for babies score. Lancet 2003;361:1789-91. [Crossref] [PubMed]
- Richardson DK, Corcoran JD, Escobar GJ, et al. SNAP-II and SNAPPE-II: Simplified newborn illness severity and mortality risk scores. J Pediatr 2001;138:92-100. [Crossref] [PubMed]
- Hashimoto DA, Witkowski E, Gao L, et al. Artificial Intelligence in Anesthesiology: Current Techniques, Clinical Applications, and Limitations. Anesthesiology 2020;132:379-94. [Crossref] [PubMed]
- Choi RY, Coyner AS, Kalpathy-Cramer J, et al. Introduction to Machine Learning, Neural Networks, and Deep Learning. Transl Vis Sci Technol 2020;9:14. [PubMed]
- He L, Gao T, Tang Y, et al. A machine learning-based model for predicting the postoperative risk of acute kidney injury in neonates. Transl Pediatr 2025;14:3349-60. [Crossref] [PubMed]
- Bowe AK, Lightbody G, Staines A, et al. Prediction of 2-Year Cognitive Outcomes in Very Preterm Infants Using Machine Learning Methods. JAMA Netw Open 2023;6:e2349111. [Crossref] [PubMed]
- Wu TY, Lin WT, Chen YJ, et al. Machine learning to predict late respiratory support in preterm infants: a retrospective cohort study. Sci Rep 2023;13:2839. [Crossref] [PubMed]
- Cho KH, Kim ES, Kim JW, et al. Comparative effectiveness of explainable machine learning approaches for extrauterine growth restriction classification in preterm infants using longitudinal data. Front Med (Lausanne) 2023;10:1166743. [Crossref] [PubMed]
- Zheng D, Hao X, Khan M, et al. Comparison of machine learning and logistic regression as predictive models for adverse maternal and neonatal outcomes of preeclampsia: A retrospective study. Front Cardiovasc Med 2022;9:959649. [Crossref] [PubMed]
- Pudjihartono N, Fadason T, Kempa-Liehr AW, et al. A Review of Feature Selection Methods for Machine Learning-Based Disease Risk Prediction. Front Bioinform 2022;2:927312. [Crossref] [PubMed]
- Dong J, Jin Z, Li C, et al. Machine Learning Models With Prognostic Implications for Predicting Gastrointestinal Bleeding After Coronary Artery Bypass Grafting and Guiding Personalized Medicine: Multicenter Cohort Study. J Med Internet Res 2025;27:e68509. [Crossref] [PubMed]
- Lundberg SM, Erion G, Chen H, et al. From Local Explanations to Global Understanding with Explainable AI for Trees. Nat Mach Intell 2020;2:56-67. [Crossref] [PubMed]
- Bell EF, Hintz SR, Hansen NI, et al. Mortality, In-Hospital Morbidity, Care Practices, and 2-Year Outcomes for Extremely Preterm Infants in the US, 2013-2018. JAMA 2022;327:248-63. [Crossref] [PubMed]
- McKinlay CJD, Alsweiler JM, Anstice NS, et al. Association of Neonatal Glycemia With Neurodevelopmental Outcomes at 4.5 Years. JAMA Pediatr 2017;171:972-83. [Crossref] [PubMed]
- Nivins S, Kennedy E, Thompson B, et al. Associations between neonatal hypoglycaemia and brain volumes, cortical thickness and white matter microstructure in mid-childhood: An MRI study. Neuroimage Clin 2022;33:102943. [Crossref] [PubMed]
- Garg M, Devaskar SU. Exploring the long-term impacts of neonatal hypoglycemia to determine a safe threshold for glucose concentrations. Eur J Pediatr 2025;184:263. [Crossref] [PubMed]
- Oliver E, Navaratnam K, Gent J, et al. Comparison of international guidelines on the management of twin pregnancy. Eur J Obstet Gynecol Reprod Biol 2023;285:97-104. [Crossref] [PubMed]
- Hack KE, Derks JB, Elias SG, et al. Increased perinatal mortality and morbidity in monochorionic versus dichorionic twin pregnancies: clinical implications of a large Dutch cohort study. BJOG 2008;115:58-67. [Crossref] [PubMed]
- Sinkey RG, Battarbee AN, Bello NA, et al. Prevention, Diagnosis, and Management of Hypertensive Disorders of Pregnancy: a Comparison of International Guidelines. Curr Hypertens Rep 2020;22:66. [Crossref] [PubMed]
- Maraka S, Ospina NM, O'Keeffe DT, et al. Subclinical Hypothyroidism in Pregnancy: A Systematic Review and Meta-Analysis. Thyroid 2016;26:580-90. [Crossref] [PubMed]
- Saad AF, Pacheco LD, Chappell L, et al. Intrahepatic Cholestasis of Pregnancy: Toward Improving Perinatal Outcome. Reprod Sci 2022;29:3100-5. [Crossref] [PubMed]
- Shi H, Chen L, Wang Y, et al. Severity of Anemia During Pregnancy and Adverse Maternal and Fetal Outcomes. JAMA Netw Open 2022;5:e2147046. [Crossref] [PubMed]
- Parween S, Prasad D, Poonam P, et al. Impact of Meconium-Stained Amniotic Fluid on Neonatal Outcome in a Tertiary Hospital. Cureus 2022;14:e24464. [Crossref] [PubMed]
- Luo S, Mao J, Wen L, et al. A retrospective cohort study on perinatal outcomes of monochorionic and dichorionic twin pregnancies complicated by intrahepatic cholestasis of pregnancy. Sci Rep 2025;15:25984. [Crossref] [PubMed]
- Cheong-See F, Schuit E, Arroyo-Manzano D, et al. Prospective risk of stillbirth and neonatal complications in twin pregnancies: systematic review and meta-analysis. BMJ 2016;354:i4353. [Crossref] [PubMed]
- Khalil A. Unprecedented fall in stillbirth and neonatal death in twins: lessons from the UK. Ultrasound Obstet Gynecol 2019;53:153-7. [Crossref] [PubMed]
- Giorgione V, Trapani M, Lopian M, et al. Predicting Adverse Perinatal Outcomes in Dichorionic Twin Pregnancies: A Multicentre Cohort Study. BJOG 2025;132:983-90. [Crossref] [PubMed]
- Salari N, Beiromvand M, Abdollahi R, et al. Global prevalence of stillbirth among fetuses from twin pregnancies: a systematic review and meta-analysis. Arch Gynecol Obstet 2025;312:9-16. [Crossref] [PubMed]
- Blitz MJ, Yukhayev A, Pachtman SL, et al. Twin pregnancy and risk of postpartum hemorrhage. J Matern Fetal Neonatal Med 2020;33:3740-5. [Crossref] [PubMed]
- Weitzner O, Yagur Y, Biron-Shental T, et al. Twin pregnancies: can sonographic measurements and changes in cervical length during pregnancy predict preterm birth? J Matern Fetal Neonatal Med 2022;35:1783-6. [Crossref] [PubMed]
- Monaghan C, Kalafat E, Binder J, et al. Prediction of adverse pregnancy outcome in monochorionic diamniotic twin pregnancy complicated by selective fetal growth restriction. Ultrasound Obstet Gynecol 2019;53:200-7. [Crossref] [PubMed]
- Briffa C, Di Fabrizio C, Kalafat E, et al. Adverse neonatal outcome in twin pregnancy complicated by small-for-gestational age: twin vs singleton reference charts. Ultrasound Obstet Gynecol 2022;59:377-84. [Crossref] [PubMed]
- Cimpoca B, Syngelaki A, Litwinska E, et al. Increased nuchal translucency at 11-13 weeks' gestation and outcome in twin pregnancy. Ultrasound Obstet Gynecol 2020;55:318-25. [Crossref] [PubMed]
- Maymon R, Trahtenherts A, Svirsky R, et al. Developing a new algorithm for first and second trimester preeclampsia screening in twin pregnancies. Hypertens Pregnancy 2017;36:108-15. [Crossref] [PubMed]

