Development and external validation of machine learning models to predict insulin resistance among iron-deficient children and adolescents
Highlight box
Key findings
• This study developed and externally validated machine learning (ML) models to predict insulin resistance (IR) among iron-deficient children and adolescents. Extreme gradient boosting (XGBoost) achieved optimal external validation performance with an area under the receiver operating characteristic curve (AUC) of 0.940 [95% confidence interval (CI): 0.889–0.991]. Fasting glucose and triglycerides emerged as dominant predictors, while albumin demonstrated a protective association (odds ratio 0.86, 95% CI: 0.78–0.95). A sensitivity analysis excluding fasting glucose confirmed robust performance (AUC 0.925), substantiating genuine metabolic pattern recognition rather than mathematical circularity.
What is known and what is new?
• Existing IR prediction models primarily target overweight or obese pediatric populations and do not account for iron deficiency as a distinct risk modifier.
• This study establishes the first externally validated ML framework specifically designed for iron-deficient youth, demonstrating that ensemble learning approaches capture non-linear metabolic interactions in this undernourished subgroup.
What is the implication, and what should change now?
• These findings support continued development of accessible screening tools for resource-limited settings where fasting insulin measurements are unavailable. However, the modest sample size and single-province validation preclude immediate clinical deployment. Prospective multi-site validation is required before any consideration of implementation as a developmental screening framework.
Introduction
Insulin resistance (IR) is increasingly prevalent among children and adolescents, representing a critical metabolic disturbance that disrupts cellular energy homeostasis and predisposes affected individuals to type 2 diabetes and cardiovascular morbidity (1,2). Compounding this clinical burden, iron-deficiency independently exacerbates IR through established mechanisms involving impaired insulin signaling and glucose transport (3), creating a distinct high-risk pediatric phenotype with amplified long-term health consequences. Consequently, a universal IR prediction model incorporating iron status as one variable among many would not capture these phenotype-specific interactions, necessitating a specialized framework calibrated for this subgroup. Early identification of metabolic dysfunction specifically within iron-deficient youth emerges as an essential clinical priority to mitigate downstream complications.
Despite growing recognition of IR as a critical pediatric health concern, its systematic detection in primary care remains severely hindered by methodological and practical constraints. The hyperinsulinemic euglycemic clamp, despite serving as the reference standard, demands specialized equipment, substantial cost, and invasive procedures that render it impractical for routine screening (4). Surrogate indices such as Homeostatic Model Assessment for Insulin Resistance (HOMA-IR), while computationally simpler, still require fasting insulin measurements that are costly, poorly standardized across laboratories (5). Consequently, iron-deficient youth constitute a particularly vulnerable subgroup with heightened metabolic risk yet face a critical diagnostic gap where early detection remains inaccessible in resource-limited settings (6).
Machine learning (ML) algorithms offer a paradigm shift in clinical risk prediction by transcending the restrictive assumptions of conventional statistical models. Traditional regression models assume linear relationships and limited variable interactions, whereas the metabolic pathophysiology of IR in iron-deficient youth likely involves complex non-linearities. We therefore compared traditional regression-based approaches with ML algorithms to determine whether non-linear methods yield clinically meaningful gains in this context. ML algorithms have demonstrated remarkable effectiveness in predicting IR across diverse populations. Across several large-scale studies, ML models achieved area under the curve (AUC) values ranging from 0.86 to 0.93 (7-9). These models typically used 9–48 predictive features, with algorithms like extreme gradient boosting (XGBoost), random forest (RF), and neural networks performing best. Consistently, key predictors included body-mass index (BMI), fasting glucose, high-density lipoprotein (HDL) cholesterol, and triglycerides (7,8). The models demonstrated transferability across different ethnic groups and settings, suggesting ML’s robust potential for IR prediction. Furthermore, these algorithms scale efficiently to high-dimensional datasets and automatically detect subtle variable interactions that elude conventional biostatistical methods (10,11).
We compared four ML architectures with distinct theoretical properties to systematically evaluate their suitability for this metabolic context. Logistic regression (LR) serves as a baseline interpretable linear classifier assuming additive predictor effects. RF employs bootstrap aggregating with randomized feature selection, reducing variance through ensemble averaging but exhibiting sensitivity to sample size. K-nearest neighbor (KNN) classifies instances based on local feature space similarity without functional form assumptions, yet suffers from instability in high-dimensional settings. XGBoost sequentially builds weak learners with gradient-based optimization and explicit regularization, managing bias-variance tradeoffs through shrinkage and tree pruning. In iron-deficient children, metabolic pathways governing insulin sensitivity involve non-linear threshold effects (e.g., glucose toxicity) and synergistic interactions (e.g., triglycerides amplifying glucose-induced dysfunction) that linear models may miss. We therefore selected these architectures to test whether methods capturing such complex patterns outperform traditional regression in this specific pediatric population.
Beyond metabolic applications, ML has demonstrated prognostic utility in diverse pediatric and perioperative contexts, including prediction of neonatal intensive care unit admission and postoperative complication risk stratification (12,13). Such capabilities position ML as particularly suitable for developing accessible screening tools that leverage routinely collected clinical data to identify metabolic abnormalities in specific pediatric subpopulations.
There is currently no externally validated predictive tool specifically designed to identify IR in iron-deficient pediatric populations using routinely available parameters. Several promising predictive models that exist for IR in pediatric populations are primarily validated in overweight or obese cohorts and do not account for iron-deficiency as a distinct risk modifier. Existing models incorporate general metabolic indicators without considering iron status, omit a validated iron biomarker from their predictor sets, or exclude undernourished participants entirely. Consequently, no externally validated tool specifically targets IR prediction among iron-deficient youth using routinely available parameters (11,14,15). Therefore, this study aimed to develop and externally validate ML models for predicting IR among iron-deficient children and adolescents using routinely available clinical variables. To enhance clinical interpretability and facilitate deployment, SHapley Additive exPlanations (SHAP) analysis was employed to quantify individual predictor contributions and elucidate decision mechanisms (16). This approach establishes an accessible, noninvasive screening framework specifically tailored for iron-deficient pediatric populations in resource-limited settings. We present this article in accordance with the TRIPOD reporting checklist (17) (available at https://tp.amegroups.com/article/view/10.21037/tp-2026-1-0146/rc).
Methods
Study participants
In this research, we focused specifically on iron-deficient children and adolescents aged 6–17 years from the China Health and Nutrition Survey (CHNS) database, employing this subset as our training sample. The CHNS is a collaborative project between the Carolina Population Center at the University of North Carolina and the National Institute for Nutrition and Health at the Chinese Center for Disease Control and Prevention (18,19). Established to track nutritional and health conditions across 15 Chinese provinces from 1989–2015, the 2009 CHNS dataset is particularly valuable as it includes comprehensive biomarker measurements for pediatric populations. Given our specific aim to predict IR within the context of iron-deficiency, we applied stringent inclusion criteria: participants were required to meet established criteria for iron-deficiency based on soluble transferrin receptor (sTfR) concentrations (>1.8 mg/L for ages 6–11 years; >1.75 mg/L for females and >1.95 mg/L for males aged 12–17 years), consistent with age- and sex-specific thresholds validated in pediatric populations (20). To minimize confounding by systemic inflammation on iron status assessment and IR, participants with high-sensitivity C-reactive protein (hs-CRP) ≥5 mg/L were excluded from the analysis, as elevated hs-CRP indicates active inflammation that falsely elevates serum ferritin (acute-phase reactant effect) and independently contributes to IR through inflammatory pathways. From the CHNS dataset, we first identified 827 children and adolescents aged 6–17 years with complete age and sex records. We then applied sequential screening: 814 participants had available sTfR measurements, of whom 256 met iron-deficiency criteria. After excluding 18 individuals with hs-CRP ≥5 mg/L and 16 participants with missing critical variables (fasting glucose or fasting insulin), 222 iron-deficient participants aged 6–17 years from 9 provinces were included for the training dataset.
To mitigate the limitations of cross-sectional data and assess generalizability, we validated the model using an external dataset from two hospitals. We collected data on iron-deficient children and adolescents aged 6–17 years from the physical examination and pediatric departments of Nanchong Central Hospital and its Jialing Branch (a tertiary and a secondary hospital, respectively) from January 2022 to October 2024. A total of 541 participants were screened, of whom 517 had available sTfR measurements. Of these, 156 met iron-deficiency criteria. After excluding 18 individuals with hs-CRP ≥5 mg/L and 13 with missing critical variables, 125 iron-deficient participants were retained for the external validation dataset. Iron-deficiency status was determined using the same sTfR criteria applied to the training set. This compilation included fundamental demographic details, lifestyle practices, and blood sample analyses. Ultimately, an external validation dataset encompassing 125 iron-deficient individuals within the age bracket of 6–17 years was created. The studies were conducted in accordance with local legislation and institutional requirements. The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. The study was approved by Institutional Review Boards of Nanchong Central Hospital [No. 2024(033)] and Nanchong City Jialing District People’s Hospital (Jialing Branch of Nanchong Central Hospital) [No. 2024(002)]. Individual consent for this retrospective analysis was waived.
The study utilized all available iron-deficient pediatric participants from the CHNS 2009 dataset (n=222) and two hospitals (n=125) meeting inclusion criteria. A post-hoc power analysis using PASS 2021 (v21.0.3) confirmed that the training sample of 222 participants (67 IR events, 155 non-events) achieved >99% power to detect an AUC of 0.80 against a null hypothesis of 0.50 at alpha =0.05 (Figure S1).
It is noteworthy that due to CHNS not collecting data on children and adolescents in Sichuan Province, and with all external validation cases derived from two hospitals in Sichuan Province, regional differences may exist between the training and validation populations. However, this geographic separation also strengthens the external validity of our model for iron-deficient pediatric populations across diverse Chinese regions.
Outcome
The HOMA-IR is widely utilized in clinical practice due to its computational simplicity and robust correlation with the hyperinsulinemic euglycemic clamp technique, the reference standard for assessing insulin sensitivity (21). In the present study, IR was defined as a HOMA-IR value exceeding 3.0, calculated as (fasting insulin mU/L) × (fasting glucose mmol/L)/22.5. Despite its practical utility, the HOMA-IR index necessitates the measurement of fasting insulin concentrations, which may be unavailable or cost-prohibitive in resource-constrained settings and primary care facilities. Therefore, this study aimed to develop and validate ML-based predictive models that leverage readily available clinical variables and routine biochemical parameters to identify iron-deficient children and adolescents at elevated risk for IR.
Data preprocessing and feature selection
A total of 27 candidate predictors routinely available in standard health examinations were extracted from the CHNS database and categorized into eight clinical domains: demographic characteristics including age, sex, and urbanization status; blood pressure measurements comprising systolic and diastolic pressure; anthropometric indices encompassing height, weight, BMI, waist circumference, waist-to-height ratio, and body-roundness index; hematological parameters including hemoglobin, white blood cell count, red blood cell count, and platelet count; glucose metabolism indicators comprising fasting glucose and glycated hemoglobin; lipid-profile encompassing total cholesterol, triglycerides, HDL cholesterol, and low-density lipoprotein (LDL) cholesterol; hepatic-function markers including alanine aminotransferase, total protein, and albumin; and renal-function indicators comprising blood urea nitrogen, uric acid, and creatinine.
Unweighted analyzes were performed because the objective was model development rather than population prevalence estimation. Variables with a data missing rate exceeding 30% were excluded, but the missing rate of the variables included in this study did not exceed 5% (Table S1). Extreme values indicative of potential data entry errors (e.g., biologically implausible values) were identified and treated as missing. Multiple imputation using the Multivariate Imputation by Chained Equations (MICE) algorithm was performed, with predictive mean matching (PMM) for continuous variables and LR for categorical variables (20 iterations, ridge penalty =0.01 to address multicollinearity). Convergence was verified via trace plots. Five independent imputed datasets were generated, and Rubin’s rules were applied to pool estimates.
To address multicollinearity among metabolic, anthropometric, and renal function variables while avoiding the overfitting and selection instability associated with conventional stepwise regression, we employed Least Absolute Shrinkage and Selection Operator (LASSO) LR for feature selection. The LASSO approach applies an L1 penalty that shrinks less informative coefficients toward zero while retaining important predictors, thereby enabling automated variable selection and coefficient regularization simultaneously. All continuous variables were standardized (z-score transformation) prior to LASSO analysis to ensure comparable penalization across different measurement scales. To account for class imbalance in the training set, inverse-frequency weighting was incorporated into the penalized likelihood function, assigning the minority class a weight calculated as the ratio of majority to minority class sample sizes, while the majority class retained unit weight. Ten-fold cross-validation with stratification by IR status was performed to determine the optimal regularization parameter (λ). The λ min value of 0.0291, corresponding to the minimum cross-validation error, was selected and applied to the final model. Model assumptions for the final multivariable model, including linearity of continuous variables and absence of multicollinearity (variance inflation factor <5), were verified. Statistical significance was assessed using two-tailed Wald tests with false discovery rate (FDR) adjustment for multiple comparisons where applicable. Following LASSO selection, the identified predictors were used to fit a final weighted LR model to obtain unbiased odds ratios (ORs) and 95% confidence intervals (CIs) for clinical interpretability.
Model development
Four ML algorithms were implemented to develop prediction models for IR: LR as a baseline interpretable model, RF, KNN, and XGBoost. Prior to model training, categorical variables were encoded as binary numeric indicators, and continuous variables were standardized (z-score transformation) for distance-based algorithms (LR and KNN). Hyperparameter tuning was performed exclusively on the training set via five-fold repeated cross-validation (three repeats) with stratification by IR status to preserve class-distribution across folds to optimize model performance and prevent overfitting. For LR, no hyperparameter tuning was required. The RF was regularized by limiting the number of variables randomly sampled at each split (mtry =2), setting the minimum node size to 10 (increased from default to prevent overfitting), and restricting the maximum number of nodes to 20. The KNN algorithm was optimized for the number of neighbors (k =15) using standardized input features. For XGBoost, conservative regularization parameters were employed to balance predictive performance with generalizability: maximum tree depth of 3, learning rate (eta) of 0.05, minimum child weight of 5, gamma regularization of 0.1, subsample and column sample ratios of 0.8, and 100 boosting rounds.
A sensitivity analysis was conducted excluding fasting glucose from the predictor set to eliminate mathematical dependency on the HOMA-IR outcome. All models were retrained using the identical pipeline applied to the remaining variables, with performance evaluated through the same discrimination and calibration metrics used in the primary analysis.
Statistical analysis
The reliability of the models was assessed using several commonly employed evaluation metrics, including the area under the receiver operating characteristic curve (AUC), sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), accuracy, F1 score, and Brier score. AUC values were interpreted according to established benchmarks: 0.5 indicates no discrimination, 0.7 to 0.8 acceptable, 0.8 to 0.9 excellent, and greater than 0.9 outstanding discrimination (22). The final model was selected based on a composite consideration of AUC, calibration, and clinical interpretability. To enhance model transparency and facilitate clinical interpretation, SHAP analysis was performed on the best-performing algorithm to quantify the contribution of individual predictors.
To preserve training data integrity given the modest sample size, no internal random split was performed within the CHNS dataset. Training-set performance metrics, including AUC, were derived from five-fold repeated cross-validation (three repeats), wherein the data were partitioned into five folds and each fold served as validation once per repeat. The reported training AUC represents the mean across all 15 folds. The external validation cohort served as the independent hold-out test set, with all validation metrics derived from this set. No data overlap or information leakage occurred between training and validation.
Baseline characteristics were summarized separately for the training set and validation cohort. Continuous variables were presented as mean ± standard deviation and compared using Student’s t-test for normally distributed data or Wilcoxon rank-sum test for non-normally distributed variables. Categorical variables were expressed as counts (percentages) and analyzed using Pearson’s chi-square test or Fisher’s exact test when expected cell counts were <5. All statistical tests were two-tailed, with P<0.05 considered statistically significant. Between-group comparisons (training vs. validation) were performed for all baseline characteristics. Additional subgroup analyzes compared participants with and without IR using the same statistical approaches. All analyzes were performed using R version 4.4.2.
Results
Subject characteristics
The study population comprised 222 participants in the training cohort and 125 in the validation cohort. The prevalence of IR was 30.2% (n=67) in the training set and 28.0% (n=35) in the validation set (Figure 1). Table 1 compares baseline characteristics between the training and validation sets, and between participants with (n=102) and without (n=245) IR. The training and validation cohorts showed comparable demographic, anthropometric, and biochemical profiles (all P>0.05), confirming appropriate randomization.
Table 1
| Characteristics | Training set | Validation set | P (training vs. validation) | IR | Non-IR | P (IR vs. non-IR) |
|---|---|---|---|---|---|---|
| Age, years | 11.10±2.58 | 11.23±2.35 | 0.60 | 12.01±2.17 | 10.79±2.54 | <0.001 |
| Urban residence | 42 (18.92) | 19 (15.20) | 0.47 | 29 (28.43) | 32 (13.06) | 0.001 |
| Gender, male | 119 (53.60) | 68 (54.40) | >0.99 | 53 (51.96) | 134 (54.69) | >0.99 |
| Systolic blood pressure, mmHg | 101.03±11.92 | 100.17±11.87 | 0.47 | 103.00±10.50 | 99.74±12.33 | 0.03 |
| Diastolic blood pressure, mmHg | 67.60±9.63 | 66.72±9.72 | 0.44 | 68.21±8.81 | 66.88±9.99 | 0.47 |
| Height, cm | 143.91±13.95 | 144.24±13.27 | 0.92 | 148.26±11.52 | 142.30±14.14 | <0.001 |
| Weight, kg | 37.14±11.19 | 36.53±10.45 | 0.72 | 41.18±10.48 | 35.16±10.62 | <0.001 |
| BMI, kg/m2 | 17.51±2.83 | 17.19±2.67 | 0.31 | 18.50±3.15 | 16.94±2.46 | <0.001 |
| Waist circumference, cm | 61.96±8.82 | 60.90±8.37 | 0.35 | 64.92±9.53 | 60.16±7.87 | <0.001 |
| Blood urea nitrogen, mmol/L | 4.37±1.11 | 4.22±1.06 | 0.22 | 4.33±0.94 | 4.31±1.15 | 0.92 |
| Uric acid, μmol/L | 305.68±82.86 | 310.21±88.22 | 0.89 | 330.20±101.76 | 297.78±74.73 | 0.005 |
| Creatinine, μmol/L | 69.75±10.54 | 70.11±10.81 | 0.84 | 73.31±12.95 | 68.45±9.14 | <0.001 |
| HDL cholesterol, mmol/L | 1.42±0.32 | 1.44±0.32 | 0.53 | 1.37±0.28 | 1.45±0.34 | 0.02 |
| LDL cholesterol, mmol/L | 2.26±0.67 | 2.18±0.70 | 0.20 | 2.33±0.86 | 2.19±0.59 | 0.31 |
| Hemoglobin, g/L | 134.93±16.05 | 133.73±14.95 | 0.61 | 135.05±18.33 | 134.27±14.42 | 0.57 |
| White blood cell count, 109/L | 6.68±1.89 | 6.53±1.79 | 0.46 | 6.87±1.94 | 6.52±1.81 | 0.06 |
| Red blood cell count, 1012/L | 4.94±0.66 | 4.87±0.60 | 0.25 | 4.92±0.63 | 4.91±0.64 | 0.35 |
| Platelet count, 109/L | 280.67±81.18 | 277.64±73.94 | 0.97 | 277.89±76.04 | 280.28±79.72 | 0.96 |
| Glycated hemoglobin, % | 5.22±0.52 | 5.25±0.59 | 0.77 | 5.33±0.81 | 5.19±0.38 | 0.35 |
| Total protein, g/L | 77.59±4.45 | 77.33±4.50 | 0.60 | 77.53±4.26 | 77.49±4.55 | 0.94 |
| Albumin, g/L | 48.92±3.18 | 48.69±3.16 | 0.44 | 48.72±3.37 | 48.89±3.09 | 0.40 |
| Glucose, mmol/L | 4.81±0.84 | 4.88±0.98 | 0.36 | 5.25±1.34 | 4.66±0.53 | <0.001 |
| Triglycerides, mmol/L | 0.94±0.51 | 1.01±0.60 | 0.48 | 1.23±0.69 | 0.86±0.44 | <0.001 |
| Total cholesterol, mmol/L | 3.96±0.70 | 3.93±0.74 | 0.65 | 4.07±0.87 | 3.89±0.64 | 0.11 |
| Alanine aminotransferase, U/L | 16.56±35.12 | 18.74±45.68 | 0.48 | 17.16±21.64 | 17.42±44.56 | 0.06 |
| Waist to height ratio | 0.43±0.05 | 0.42±0.04 | 0.28 | 0.44±0.05 | 0.42±0.05 | 0.006 |
| Body roundness index | 2.18±0.86 | 2.05±0.74 | 0.28 | 2.31±0.89 | 2.06±0.77 | 0.006 |
Data are presented as number (%) or mean ± standard deviation. BMI, body mass index; HDL, high-density lipoprotein; IR, insulin resistance; LDL, low-density lipoprotein.
Participants with IR were significantly older (12.01 vs. 10.79 years, P<0.001) and more likely to be urban residents (28.4% vs. 13.1%, P=0.001) than those without IR. All anthropometric measures were elevated in the IR group, including BMI (18.50 vs. 16.94 kg/m2), waist circumference (64.92 vs. 60.16 cm), and body-roundness index (2.31 vs. 2.06) (all P<0.01). Metabolically, IR participants exhibited higher levels of fasting glucose, triglycerides, uric acid, and creatinine, alongside lower HDL cholesterol and elevated systolic blood pressure (all P<0.05). No significant differences were observed in sex distribution, diastolic blood pressure, complete blood counts, glycated hemoglobin, or liver enzymes between groups.
Feature selection and predictor identification
The training set comprised 222 participants (155 normal, 67 with IR; IR prevalence 30.2%). Inverse-frequency weighting assigned the IR class a weight of 2.31, yielding an effective sample size of 310. To preserve clinically relevant predictors while controlling multicollinearity, we applied LASSO regression with 10-fold cross-validation. The optimal λ min value of 0.0291 identified nine predictors with non-zero coefficients: age (β=0.116), urban residence (β=0.081), systolic blood pressure (β=0.002), BMI (β=0.083), waist circumference (β=0.019), blood urea nitrogen (β=0.032), albumin (β=−0.037), fasting glucose (β=1.116), and triglycerides (β=0.545) (Table 2).
Table 2
| Variable | LASSO coefficient | LASSO OR | Multivariable coefficient | Multivariable OR (95% CI) | P |
|---|---|---|---|---|---|
| Glucose, mmol/L | 1.12 | 3.05 | 1.93 | 6.87 (3.85–12.93) | <0.001 |
| Triglycerides, mmol/L | 0.55 | 1.72 | 0.93 | 2.52 (1.34–4.96) | 0.005 |
| Age, years | 0.12 | 1.12 | 0.22 | 1.24 (1.10–1.41) | <0.001 |
| BMI, kg/m2 | 0.08 | 1.09 | 0.15 | 1.16 (1.02–1.33) | 0.02 |
| Urban residence | 0.08 | 1.08 | 0.25 | 1.28 (0.62–2.65) | 0.50 |
| Albumin, g/L | −0.04 | 0.96 | −0.15 | 0.86 (0.78–0.95) | 0.005 |
| Blood urea nitrogen, mmol/L | 0.03 | 1.03 | 0.36 | 1.43 (1.08–1.92) | 0.01 |
| Waist circumference, cm | 0.02 | 1.02 | 0.03 | 1.03 (0.98–1.07) | 0.25 |
| Systolic blood pressure, mmHg | 0 | 1.00 | 0.01 | 1.01 (0.99–1.04) | 0.37 |
BMI, body mass index; CI, confidence interval; LASSO, Least Absolute Shrinkage and Selection Operator; OR, odds ratio.
The final weighted multivariable model demonstrated that fasting glucose (OR 6.87, 95% CI: 3.85–12.93, P<0.001) and triglycerides (OR 2.52, 95% CI: 1.34–4.96, P=0.005) were the strongest independent predictors. Notably, albumin exhibited a significant inverse association with IR (OR 0.86, 95% CI: 0.78–0.95, P=0.005), whereas blood urea nitrogen showed a positive association (OR 1.43, 95% CI: 1.08–1.92, P=0.01). All variance inflation factors were below 3.0, indicating acceptable levels of multicollinearity.
Model performance comparison
Table 3 and Figure 2 present the predictive performance and confusion matrices of the four ML algorithms. XGBoost achieved an external validation AUC of 0.940 (95% CI: 0.889–0.991), with sensitivity 0.829 (95% CI: 0.664–0.934) and specificity 0.967 (95% CI: 0.906–0.993). RF showed comparable validation discrimination (AUC 0.990, 95% CI: 0.980–1.000), while LR and KNN performed less favorably (AUC 0.832 and 0.823, respectively). XGBoost was selected as the final model based on two principal considerations. First, RF exhibited near-perfect training metrics (AUC 0.993, sensitivity 0.985) with minimal training-validation divergence, suggesting constrained residual capacity for generalization to broader populations. Second, XGBoost achieved superior specificity (0.967 vs. 0.933) with markedly tighter CIs, indicating more stable and reliable negative prediction. These characteristics position XGBoost as the preferred algorithm for clinical deployment. Confusion matrices for the two top-performing algorithms are presented in Figure 3.
Table 3
| Model | Dataset | AUC (95% CI) | Sensitivity (95% CI) | Specificity (95% CI) | Accuracy | PPV | NPV | F1 | Brier |
|---|---|---|---|---|---|---|---|---|---|
| LR | Training | 0.824 (0.765–0.884) | 0.836 (0.725–0.915) | 0.723 (0.645–0.791) | 0.757 | 0.566 | 0.911 | 0.675 | 0.148 |
| RF | Training | 0.993 (0.987–1.000) | 0.985 (0.920–1.000) | 0.929 (0.877–0.964) | 0.946 | 0.857 | 0.993 | 0.917 | 0.060 |
| KNN | Training | 0.772 (0.706–0.838) | 0.716 (0.593–0.820) | 0.703 (0.625–0.774) | 0.707 | 0.511 | 0.852 | 0.596 | 0.169 |
| XGBoost | Training | 0.962 (0.934–0.989) | 0.955 (0.875–0.991) | 0.865 (0.800–0.914) | 0.892 | 0.753 | 0.978 | 0.842 | 0.087 |
| LR | Validation | 0.832 (0.743–0.922) | 0.686 (0.507–0.831) | 0.933 (0.861–0.975) | 0.864 | 0.800 | 0.884 | 0.738 | 0.130 |
| RF | Validation | 0.990 (0.980–1.000) | 0.971 (0.851–0.999) | 0.933 (0.861–0.975) | 0.944 | 0.850 | 0.988 | 0.907 | 0.059 |
| KNN | Validation | 0.823 (0.740–0.905) | 0.829 (0.664–0.934) | 0.700 (0.594–0.792) | 0.736 | 0.518 | 0.913 | 0.637 | 0.148 |
| XGBoost | Validation | 0.940 (0.889–0.991) | 0.829 (0.664–0.934) | 0.967 (0.906–0.993) | 0.928 | 0.906 | 0.935 | 0.866 | 0.094 |
AUC, area under the receiver operating characteristic curve; CI, confidence interval; KNN, k-nearest neighbor; LR, logistic regression; NPV, negative predictive value; PPV, positive predictive value; RF, random forest; XGBoost, eXtreme Gradient Boosting.
The final trained XGBoost model and R prediction script are available in https://cdn.amegroups.cn/static/public/tp-2026-1-0146-1.pdf, https://cdn.amegroups.cn/static/public/tp-2026-1-0146-2.pdf. To facilitate clinical implementation, the prediction model operates as follows: clinicians input nine routinely available parameters into the provided R script. Continuous variables are automatically standardized using the training set parameters provided in https://cdn.amegroups.cn/static/public/tp-2026-1-0146-1.pdf. The model outputs a probability score between 0 and 1, representing the likelihood of IR. For example, a probability >0.5 suggests high risk, warranting further metabolic evaluation.
Sensitivity analysis
To exclude mathematical circularity arising from the inclusion of fasting glucose, a direct constituent of the HOMA-IR formula, we retrained all models on the remaining eight variables using the identical preprocessing, hyperparameter tuning, and external validation pipeline (Table S2). The XGBoost model maintained robust discriminative performance with an external validation AUC of 0.925 (95% CI: 0.870–0.979), representing a marginal decrease of 0.015 from the original nine-variable model (AUC 0.940). Sensitivity increased from 0.829 to 0.857, while specificity modestly declined from 0.967 to 0.867. RF exhibited near identical performance (AUC 0.987 vs. 0.990). LR and KNN showed modest reductions in discriminative capacity (ΔAUC 0.052 and 0.060, respectively), with corresponding decreases in sensitivity.
Model interpretability using SHAP analysis
To elucidate the contribution of individual predictors to the XGBoost model’s decisions, SHAP values were calculated for the validation cohort (Figure 4). The analysis revealed that fasting glucose emerged as the dominant predictor, with a mean absolute SHAP value of 0.707, indicating the largest average impact on model output magnitude. Higher glucose concentrations were consistently associated with increased predicted probabilities of IR (positive SHAP values), underscoring the central role of glycemic dysregulation in this iron-deficient pediatric population.
Triglycerides ranked second in importance (mean |SHAP| =0.383), contributing substantially to risk stratification, followed by age (0.290), systolic blood pressure (0.220), and BMI (0.194). Notably, these top five predictors collectively accounted for the majority of the model’s discriminative power, with waist circumference (0.163), albumin (0.153), and blood urea nitrogen (0.138) demonstrating moderate contributions. Urban residence exhibited minimal predictive importance (mean |SHAP| =0.015), suggesting that socioeconomic geographic factors played a negligible role compared to metabolic and anthropometric indicators in this cohort. The SHAP dependence plots further revealed non-linear relationships between several predictors and outcome risk, with glucose and triglycerides showing monotonic positive associations, whereas albumin demonstrated an inverse relationship consistent with its protective role (OR 0.86 in multivariable analysis). These findings align with the clinical understanding that IR in iron-deficient children is primarily driven by metabolic dysfunction, with renal function markers providing secondary predictive value.
Discussion
This study developed and externally validated ML models to predict IR among iron-deficient children and adolescents using routinely available clinical variables. Our findings demonstrate that ensemble learning approaches, particularly XGBoost, achieve robust discriminative performance with an AUC of 0.940 in external validation. The integration of SHAP analysis revealed that fasting glucose and triglycerides constitute the primary drivers of IR prediction in this population, while albumin emerged as a protective factor, offering novel insights into the metabolic phenotyping of iron-deficient youth.
The performance of our XGBoost model compares favorably with existing ML applications for IR prediction. Previous pediatric studies utilizing ML algorithms have reported AUC values ranging from 0.86 to 0.93, predominantly focusing on overweight or obese populations rather than iron-deficient subgroups (7-9). Our model achieves comparable discriminative capacity despite targeting a narrower, metabolically distinct phenotype characterized by iron-deficiency-induced perturbations in glucose homeostasis. The superior generalizability of XGBoost relative to RF likely reflects distinct algorithmic behaviors in small-sample contexts. RF exhibited near-perfect training metrics (AUC 0.993, sensitivity 0.985) with minimal residual error capacity, suggesting constrained margin for performance maintenance in independent cohorts. In contrast, XGBoost maintained higher specificity (0.967 vs. 0.933) with tighter CIs on external validation, indicating more stable negative prediction. These findings support selecting algorithms based on consistency across training and validation frameworks, not solely on single-cohort point estimates.
The identification of fasting glucose as the dominant predictive factor aligns with established pathophysiological understanding of IR (23), yet its prominence in iron-deficient populations warrants specific consideration. Iron-deficiency independently impairs insulin signaling by disrupting mitochondrial function and reducing iron-dependent tyrosine kinase activity, potentially exacerbating glycemic dysfunction (24,25). The secondary importance of triglycerides further supports that iron-deficiency significantly disrupts lipid metabolism, with triglycerides serving as a key indicator of metabolic dysregulation. Kidman et al. demonstrate that iron is fundamental to hepatic lipid metabolism, with approximately one-third of non-alcoholic fatty liver disease patients showing elevated hepatic iron concentrations (26). Wlazlo et al. further confirmed this link through epidemiological studies showing significant correlations between serum ferritin and triglyceride levels across different populations (27). Notably, the inverse association between albumin and IR risk suggests that nutritional status and hepatic synthetic function may confer protective effects, a finding consistent with the protein energy malnutrition frequently observed in pediatric iron-deficiency. Chen et al. further noted that hypoalbuminemia is a marker of poor outcomes in critically ill children (28). Okuyan et al. demonstrated that prealbumin levels strongly correlate with nutritional indices (29), while Calcaterra et al. highlighted iron metabolism’s critical role in insulin signaling (3). The evidence suggests a complex interplay between nutritional status, hepatic function, and metabolic health, particularly in pediatric populations.
The clinical implications of this work extend beyond mere risk stratification. Current consensus statements explicitly discourage universal IR screening in children due to methodological limitations and resource constraints associated with fasting insulin measurements. Our model represents a proof-of-concept demonstrating the feasibility of IR risk stratification using routine clinical data. While these findings support continued research toward accessible screening tools, the current version requires threshold optimization and prospective validation before any substitution for specialized metabolic testing. The geographic external validation across diverse Chinese regions enhances confidence in model transportability, addressing a critical limitation of previous ML studies that relied solely on random train test splits within single datasets. By validating the model in Sichuan Province using data from both tertiary and secondary hospitals, we demonstrated robustness against regional variations in clinical practice and population characteristics.
Several design features strengthen internal validity. The application of LASSO regression for feature selection addressed multicollinearity while yielding a parsimonious nine-predictor model, reducing overfitting inherent to small-sample ML studies. Ten-fold cross-validation with inverse-frequency weighting further guarded against overfitting and class imbalance. While the initial candidate pool included 27 variables [events per variable (EPV) =2.48], LASSO regularization reduced the final model to 9 predictors (EPV =7.44), approaching the conventional threshold of 10. The stability of LASSO selections was supported by three considerations. First, the dominant predictors in the weighted multivariable model demonstrated biologically plausible effect directions, with fasting glucose (OR 6.87) and triglycerides (OR 2.52) as strong positive predictors and albumin (OR 0.86) as a protective factor, consistent with established IR pathophysiology. Second, the sensitivity analysis excluding fasting glucose redistributed importance toward remaining metabolic indicators without marked performance degradation (ΔAUC =0.015), suggesting the model captured genuine multivariable patterns rather than single-predictor dependence. Third, conservative regularization parameters in XGBoost (max depth =3, eta =0.05, min child weight =5, gamma =0.1) provided additional safeguards against overfitting during ensemble construction. SHAP analysis ensured transparent, instance-level interpretability. Exclusion of participants with elevated hs-CRP minimized inflammatory confounding. Residual confounding from unmeasured variables may persist. External validity is supported by deliberate geographic and temporal separation between training and validation cohorts. The training data derive from a community-based national survey (CHNS), whereas validation was performed in hospital-based clinical settings across Sichuan Province. This design tests model transportability across both population spectrum (community vs. clinical) and secular trends in pediatric metabolic health. The comparable IR prevalence between cohorts supports generalizability.
To exclude mathematical circularity arising from the inclusion of fasting glucose in the predictor set, we performed a sensitivity analysis retraining all models on the remaining eight variables. The XGBoost model retained near identical external validation performance, with sensitivity preserved and specificity showing only a modest decline. This finding substantiates that the model’s discriminative capacity derives from genuine metabolic patterns rather than artifactual correlation with the outcome definition. The negligible performance decrement for XGBoost contrasted with more pronounced declines in LR and KNN, underscoring the superior capacity of gradient boosting to exploit non-linear interactions among remaining variables. These results support the clinical utility of the eight-variable restricted model for resource-limited settings where fasting glucose measurements may be unavailable, while the original nine-variable formulation offers marginally enhanced specificity when glycemic data are accessible.
Study limitations require acknowledgment. The modest sample sizes in both training and validation cohorts may constrain the detection of rare predictor interactions and limit subgroup analysis capabilities. The cross-sectional design precludes causal inference regarding the temporal relationship between iron-deficiency and IR development. Additionally, the reliance on sTfR measurements at a single timepoint may misclassify transient iron-deficiency states, potentially introducing noise into the training data. The CHNS dataset dates to 2009, raising questions about secular trends in pediatric metabolic health, although the external validation using contemporary hospital data partially mitigates this concern. The 15-year interval between the CHNS 2009 training data and the 2022–2024 validation data encompasses substantial shifts in pediatric metabolic profiles, including rising obesity prevalence and dietary transitions in China (30). While these secular trends may affect the distribution of individual predictors, the core pathophysiological relationships between iron deficiency, glycemic dysregulation, and dyslipidemia are biologically conserved and unlikely to exhibit marked temporal variation. The comparable IR prevalence between training and validation cohorts supports this assumption, suggesting that the model captures stable biological mechanisms rather than transient epidemiological patterns. Nevertheless, prospective validation in contemporary population-based samples would further clarify model transportability across secular trends. Furthermore, several technical constraints limit immediate clinical deployment. The model requires nine standardized input parameters. The optimal probability threshold for clinical action remains undefined. Resource-limited settings may prioritize sensitivity to minimize missed cases. Tertiary centers might accept lower sensitivity to reduce unnecessary referrals. The multiple imputation pipeline adds computational complexity. This may exceed rural health station capacity. Translating this model into practice requires a phased pathway. Phase one involves prospective validation across diverse settings. Phase two requires threshold optimization through clinician engagement. Phase three includes cost-effectiveness evaluation against direct insulin measurement. Collectively, these limitations preclude any recommendation for immediate clinical deployment. The model should be regarded as a developmental research tool requiring substantial further validation before consideration for real-world application.
Future research should prioritize prospective validation of the XGBoost model in diverse clinical settings, particularly resource-limited environments where fasting insulin measurements are unavailable. Investigation of longitudinal changes in model predicted risk scores following iron supplementation would clarify whether improved iron status ameliorates IR, informing potential intervention thresholds. Integration of genetic polymorphisms related to iron metabolism and insulin signaling may further refine risk stratification in this vulnerable population.
Conclusions
In conclusion, this study establishes an externally validated, interpretable ML framework for identifying IR among iron-deficient children and adolescents using routine clinical parameters. By demonstrating superior predictive performance and geographic generalizability, these findings support continued development of ML-based screening approaches, with this framework serving as a foundation for future prospective studies.
Acknowledgments
None.
Footnote
Reporting Checklist: The authors have completed the TRIPOD reporting checklist. Available at https://tp.amegroups.com/article/view/10.21037/tp-2026-1-0146/rc
Data Sharing Statement: Available at https://tp.amegroups.com/article/view/10.21037/tp-2026-1-0146/dss
Peer Review File: Available at https://tp.amegroups.com/article/view/10.21037/tp-2026-1-0146/prf
Funding: None.
Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://tp.amegroups.com/article/view/10.21037/tp-2026-1-0146/coif). The authors have no conflicts of interest to declare.
Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. The study was approved by Institutional Review Boards of Nanchong Central Hospital [No. 2024(033)] and Nanchong City Jialing District People’s Hospital (Jialing Branch of Nanchong Central Hospital) [No. 2024(002)]. Individual consent for this retrospective analysis was waived.
Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.
References
- Al-Beltagi M, Bediwy AS, Saeed NK. Insulin-resistance in paediatric age: Its magnitude and implications. World J Diabetes 2022;13:282-307. [Crossref] [PubMed]
- Astudillo M, Tosur M, Castillo B, et al. Type 2 diabetes in prepubertal children. Pediatr Diabetes 2021;22:946-50. [Crossref] [PubMed]
- Calcaterra V, Cena H, Bolpagni F, et al. The Interplay Between Iron Metabolism and Insulin Resistance: A Key Factor in Optimizing Obesity Management in Children and Adolescents. Nutrients 2025;17:1211. [Crossref] [PubMed]
- Mainieri F, Tagi VM, Chiarelli F. Insulin resistance in children. Curr Opin Pediatr 2022;34:400-6. [Crossref] [PubMed]
- Sendrea AM, Iorga D, Dascalu M, et al. HOMA-IR Index and Pediatric Psoriasis Severity-A Retrospective Observational Study. Life (Basel) 2024;14:700. [Crossref] [PubMed]
- Tagi VM, Samvelyan S, Chiarelli F. An update of the consensus statement on insulin resistance in children 2010. Front Endocrinol (Lausanne) 2022;13:1061524. [Crossref] [PubMed]
- Xing Z, Chen H, Alman AC. Discriminating insulin resistance in middle-aged nondiabetic women using machine learning approaches. AIMS Public Health 2024;11:667-87. [Crossref] [PubMed]
- Zhang H, Zeng T, Zhang J, et al. Development and validation of machine learning-augmented algorithm for insulin sensitivity assessment in the community and primary care settings: a population-based study in China. Front Endocrinol (Lausanne) 2024;15:1292346. [Crossref] [PubMed]
- Tsai SF, Liu WJ, Lee CL. IDF2022-0124 A trans-ethnic machine learning approach for prediction of insulin resistance in non-diabetic population. Diabetes Res Clin Pract 2023;197:110536.
- Wang Y, Aivalioti E, Stamatelopoulos K, et al. Machine learning in cardiovascular risk assessment: Towards a precision medicine approach. Eur J Clin Invest 2025;55:e70017. [Crossref] [PubMed]
- Huang X, Yi K, Jia L, et al. Development and validation of an insulin resistance prediction model in children and adolescents using machine learning algorithms. Transl Pediatr 2025;14:452-62. [Crossref] [PubMed]
- Malakooti N, Mehrnoush V, Abdi F, et al. Development of a machine learning model to identify the predictors of the neonatal intensive care unit admission. Sci Rep 2025;15:20914. [Crossref] [PubMed]
- Vatankhah Tarbebar M, Mohammadi M, Mehrnoush V, et al. Prognostic machine learning models for predicting postoperative complications following general surgery in Bandar Abbas, Iran: a study protocol. BMJ Open 2025;15:e108019. [Crossref] [PubMed]
- Rodríguez-Gutiérrez N, Villareal-Calderón JR, Castillo EC, et al. Prediction of Insulin Resistance Based on Anthropometric and Clinical Variables in Children with Overweight or Obesity at a Tertiary Center in Northeast Mexico. Metab Syndr Relat Disord 2022;20:174-81. [Crossref] [PubMed]
- Araújo D, Morgado C, Correia-Pinto J, et al. Predicting Insulin Resistance in a Pediatric Population With Obesity. J Pediatr Gastroenterol Nutr 2023;77:779-87. [Crossref] [PubMed]
- Dong W, Jiang H, Li Y, et al. Interpretable machine learning analysis of immunoinflammatory biomarkers for predicting CHD among NAFLD patients. Cardiovasc Diabetol 2025;24:263. [Crossref] [PubMed]
- Collins GS, Reitsma JB, Altman DG, et al. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement. BMJ 2015;350:g7594. [Crossref] [PubMed]
- Zhang YT, Mashevskaya OV, Wang XZ. The impact of export shocks on child health: evidence from China. Front Public Health 2025;13:1593356. [Crossref] [PubMed]
- Liang Y, Qiao T, Ni X, et al. Association between hyperuricemia and dietary retinol intake in Southwest China: a cross-sectional study based on CHNS database. Front Nutr 2025;12:1508774. [Crossref] [PubMed]
- Vázquez-López MA, López-Ruzafa E, Ibáñez-Alcalde M, et al. The usefulness of reticulocyte haemoglobin content, serum transferrin receptor and the sTfR-ferritin index to identify iron deficiency in healthy children aged 1-16 years. Eur J Pediatr 2019;178:41-9. [Crossref] [PubMed]
- Khalili D, Khayamzadeh M, Kohansal K, et al. Are HOMA-IR and HOMA-B good predictors for diabetes and pre-diabetes subtypes? BMC Endocr Disord 2023;23:39. [Crossref] [PubMed]
- Mandrekar JN. Receiver operating characteristic curve in diagnostic test assessment. J Thorac Oncol 2010;5:1315-6. [Crossref] [PubMed]
- Tian X, Chen S, Wang P, et al. Insulin resistance mediates obesity-related risk of cardiovascular disease: a prospective cohort study. Cardiovasc Diabetol 2022;21:289. [Crossref] [PubMed]
- Fernández-Real JM, McClain D, Manco M. Mechanisms Linking Glucose Homeostasis and Iron Metabolism Toward the Onset and Progression of Type 2 Diabetes. Diabetes Care 2015;38:2169-76. [Crossref] [PubMed]
- Ramesh J, Shaik MI, Srivalli J. Impact of Iron Indices, Mitochondrial Oxidative Capacity, Oxidative Stress and Inflammatory Markers on Insulin Resistance and Secretion: A Pathophysiologic Perspective. J Diabetes Metab 2012;3:9.
- Kidman CJ, Mamotte CDS, Inder-Smith KR, et al. The Interplay of Iron and Lipid Homeostasis in Non-Alcoholic Fatty Liver Disease. Journal of Renal & Hepatic Disorders 2025;8:1-16.
- Wlazlo N, Greevenbroek MMJV. Lipid metabolism: a role for iron? Curr Opin Lipidol 2012;23:258-9. [Crossref] [PubMed]
- Chen CB, Hammo B, Barry J, et al. Overview of Albumin Physiology and its Role in Pediatric Diseases. Curr Gastroenterol Rep 2021;23:11. [Crossref] [PubMed]
- Okuyan Ö, Durmus S, Uzun H. The relationship between nutritional status and prealbumin levels in children with loss of appetite and iron deficiency: a prospective cross-sectional study. Front Nutr 2025;12:1647870. [Crossref] [PubMed]
- Yuan C, Dong Y, Chen H, et al. Determinants of childhood obesity in China. Lancet Public Health 2024;9:e1105-14. [Crossref] [PubMed]


