Machine learning-based genome-wide association analysis to construct a clinical decision model for severe neonatal jaundice

Haiyan Ma; Xianhong Chen; Peng Zhang; Xinran Dong; Bingbing Wu; Guoqiang Chen; Wenhao Zhou; Mingbang Wang

doi:10.21037/tp-2026-1-0082

Original Article

Machine learning-based genome-wide association analysis to construct a clinical decision model for severe neonatal jaundice

Haiyan Ma^1,2#, Xianhong Chen^3,4# , Peng Zhang⁵, Xinran Dong¹, Bingbing Wu¹, Guoqiang Chen⁵, Wenhao Zhou^1,6, Mingbang Wang^3,5

¹Center for Molecular Medicine, Children’s Hospital of Fudan University, National Center for Children’s Health, Shanghai, China; ²Department of Neonatology, Zhuhai Women and Children’s Hospital, Zhuhai, China; ³Department of Neonatology, Affiliated Shenzhen Women and Children’s Hospital (Longgang) of Shantou University Medical College (Longgang District Maternity & Child Healthcare Hospital of Shenzhen City), Shenzhen, China; ⁴Division of Neonatology, Longgang Central Hospital of Shenzhen, Shenzhen, China; ⁵Division of Neonatology, Children’s Hospital of Fudan University, National Center for Children’s Health, Shanghai, China; ⁶Department of Neonatology, Guangzhou Women and Children’s Medical Center, Guangzhou Medical University, Guangzhou, China

Contributions: (I) Conception and design: ; (II) Administrative support: ; (III) Provision of study materials or patients: ; (IV) Collection and assembly of data: ; (V) Data analysis and interpretation: ; (VI) Manuscript writing: All authors; (VII) Final approval of manuscript: All authors.

^#These authors contributed equally to this work.

Correspondence to: Dr. Mingbang Wang. Department of Neonatology, Affiliated Shenzhen Women and Children’s Hospital (Longgang) of Shantou University Medical College (Longgang District Maternity & Child Healthcare Hospital of Shenzhen City), Shenzhen 518172, China; Division of Neonatology, Children’s Hospital of Fudan University, National Center for Children’s Health, Shanghai 201102, China. Email: mingbang_wang@fudan.edu.cn; Dr. Wenhao Zhou. Center for Molecular Medicine, Children’s Hospital of Fudan University, National Center for Children’s Health, Shanghai 201102, China; Department of Neonatology, Guangzhou Women and Children’s Medical Center, Guangzhou Medical University, Guangzhou 510623, China. Email: zhouwenhao@fudan.edu.cn.

Importance: Early identification of severe unconjugated hyperbilirubinemia is essential to avoid bilirubin encephalopathy and neurological sequelae. Genetic factors play an important role, and genome-wide association studies in adults have revealed important biological insights into jaundice.

Objective: The etiology of neonatal jaundice is complex. This study aims to identify genetic variants associated with severe neonatal jaundice (SNJ) through a genome-wide association approach and to evaluate their potential for risk prediction.

Design, setting, and participants: We performed a genome-wide association analysis based on whole-exome sequencing (WES) on 155 patients with SNJ and 160 patients without SNJ. The importance of genetic factors and clinical phenotypes for the classification of SNJ were evaluated based on the LASSO machine learning method.

Exposures: Interventions such as phototherapy were performed according to the severity of neonatal jaundice.

Main outcomes and measures: A clinical prediction model based on genetic factors and clinical phenotypes for SNJ was constructed to assess the causal association between single-nucleotide polymorphisms (SNPs) and SNJ based on causal inference analysis.

Results: In our cohort, SNJ was associated with increased erythrocyte count and hemoglobin (Hb) concentration. A positive correlation between erythrocyte count and Hb was also observed. Seventeen SNPs were found to be significantly associated with total blood erythrocyte count. A missense mutation in the gene haptoglobin-related protein (HPR), rs144648182, was enriched in SNJ. This mutation may affect the ability of HPR to bind free Hb, and a machine learning causal inference approach confirmed the potential causal effect of rs144648182 with serum total bilirubin. Nine genotypes and clinical phenotypes associated with SNJ were identified by the LASSO method and used to construct a clinical prediction model for SNJ, which enables accurate prediction of high-risk individuals in neonatal jaundice and aids clinical decisions.

Conclusions and relevance: We applied machine learning causal inference to GWAS data and identified potential erythroid-related genetic factors, including the HPR variant rs144648182, that may contribute to SNJ. This finding represents a testable hypothesis requiring experimental validation. A prediction model based on genetic and clinical variables demonstrated potential for risk stratification among jaundiced neonates, though external validation is needed before clinical application.

Keywords: Neonatal jaundice; whole-exome sequencing (WES); genome-wide association study (GWAS); Least Absolute Shrinkage and Selection Operator (LASSO); causal inference

Submitted Jan 22, 2026. Accepted for publication Mar 19, 2026. Published online Apr 26, 2026.

doi: 10.21037/tp-2026-1-0082

Highlight box

Key findings

• Report here about key findings of the study.

What is known and what is new?

• Report here about what is known.

• Report here about what does this manuscript adds.

What is the implication, and what should change now?

• Report here about implications and actions needed.

Introduction

Neonatal jaundice is a common clinical condition in newborns characterized by an increase in total serum bilirubin, which manifests as yellowing of the skin and sclerae (1). Most cases resolve spontaneously, but a small number of neonates may develop severe hyperbilirubinemia or even bilirubin encephalopathy, which can lead to death or brain damage if not diagnosed and treated promptly (2,3). Early identification of severe hyperbilirubinemia is essential to effectively prevent bilirubin encephalopathy and its neurological sequelae (4).

The etiology of neonatal jaundice is complex and a comprehensive assessment of the role of genetic factors by genome-wide association study (GWAS) seems necessary. After birth, excess red blood cells are destroyed in large quantities, leading to excessive bilirubin production; at the same time, the metabolic function of newborns is immature, and bilirubin metabolism is slower and less efficient. Common clinical factors associated with severe neonatal jaundice (SNJ) include isoimmune hemolytic disease, inadequate feeding, and infection (5,6). In addition to clinical factors, genetic factors play an important role in SNJ. A recent large multicenter study demonstrated that neonates with genetic variants had significantly higher rates of severe hyperbilirubinemia (16.9% vs. 9.7%, P=0.001) compared to those without genetic variants (7). Specifically, UGT1A1 211G>A homozygous mutation confers a 2.35-fold increased risk for severe unconjugated hyperbilirubinemia in the Chinese population (6). We previously initiated the Chinese Neonatal Genome Project, which aims to comprehensively resolve the genetic factors of neonatal diseases through whole-exome sequencing (WES) or whole-genome sequencing, based on which we constructed a large genome-wide database and identified causative genes for neonatal diseases (8,9). The Chinese Neonatal Genome Project also offers the possibility of conducting large-scale WES genome-wide association studies.

Accurate disease risk prediction models are essential for stratifying individuals with SNJ. This is because they can be offered targeted screening and interventions to address their risk of developing the disease if they are high risk, and can avoid unnecessary screening and interventions if they are low risk. Machine learning methods are applied in genome-wide association studies and to build disease risk prediction models. Thomas et al. found that LDpred, a machine learning-based risk prediction model using a Bayesian approach for genome-wide risk prediction, was able to identify 30% of individuals with no family history as being at high risk for colorectal cancer (CRC). The traditional polygenic risk scores model identified only 10% of individuals without family history as high risk (10).

The observed associations between suspected risk factors and outcomes do not always indicate that interventions at the risk factor level will have a causal effect on outcomes (correlation is not causation) (11). Causal inference methods have been used to identify potential causal effects of genotype on disease phenotypes using GWAS data (12). Chen et al. identified a potential causal association between genome-wide significant single-nucleotide polymorphism (SNP) loci and gallstone disease by causal inference methods, i.e., univariate and multivariate Mendelian randomization (13). McCormick et al. evaluated the causal relationship between insulin resistance, hyperuricemia, and gout using genome-wide association data and bidirectional Mendelian randomization, and found that hyperinsulinemia leads to hyperuricemia and not vice versa (14). Sealock et al. investigated the causal link between depression polygenic scores and white blood cell counts using PsycheMERGE Network data and causal inference methods based on mediation analysis and Mendelian randomization, and found that increased depression polygenic scores were associated with increased white blood cell counts, suggesting that the association may be bidirectional (15).

In brief, considering the complexity of neonatal jaundice etiology and the important role of genetic factors, it is necessary to conduct a GWAS of neonatal jaundice in a Chinese population to identify genetic factors that differentiate severe from non-severe cases and comprehensively understand the role of genetic factors in neonatal jaundice through a machine learning-based causal inference approach. We present this article in accordance with the TRIPOD reporting checklist (available at https://tp.amegroups.com/article/view/10.21037/tp-2026-1-0082/rc).

Methods

Participants

The participants were patients diagnosed with neonatal unconjugated hyperbilirubinemia who were also participating in the Chinese Neonatal Genome Project (16), and a detailed clinical description of the patients is provided in Appendix 1.

WES

WES was done with reference to our previous studies of the Chinese Neonatal Genome Project (16) and detailed in Appendix 1.

SNP calling

The variants were detected by referring to our previous studies (17-19) and detailed in Appendix 1.

Genome-wide association analysis

Exome level SNPs were used for genome-wide association analysis, and genome-wide association analysis was done by PLINK software, and the detailed process is shown in Appendix 1.

Machine learning

The Least Absolute Shrinkage and Selection Operator (LASSO), a machine learning method, was used to discover SNPs that contribute significantly to clinical variables, which was performed according to our previous studies (20,21), and detailed in Appendix 1.

Clinical prediction models

A clinical prediction model for severe jaundice based on SNPs and clinical indicators was developed, and we also performed a calibration assessment of the model and predicted the risk of severe jaundice based on the model, the detailed procedure is shown in Appendix 1.

Survival analysis

Kaplan-Meier survival curves were done by the Kaplan-MeierFitter() function of the lifelines package (version 0.26.4) of Python software (Python version 3.7.6). For univariate survival analysis, we first set the time variable and event variable, and then compared the differences of the univariate survival curves. The statistical significance was calculated by the logrank_test() function of the statistics module of the lifelines package, and a P value less than 0.05 was considered a significant difference. Multivariate survival analysis was done by the CoxPHFitter() function of the lifelines package. The independent effects of predictive variables on time-to-event outcomes were evaluated using a Cox proportional hazards model, with important features selected by LASSO included as covariates. Erythrocyte-related indicators and time to peak bilirubin level were defined as time variables. The proportional hazards assumption was tested using Schoenfeld residuals, and results were presented as hazard ratios with 95% confidence intervals.

Machine learning-based causal inference

A machine learning-based causal inference approach was used to assess the causal association of SNPs and clinical indicators with severe jaundice, completed with reference to a previous article (22), Machine learning-based causal inference Machine learning-based causal inference was done through Microsoft’s DoWhy library (https://github.com/microsoft/dowhy) and EconML library (https://github.com/econml/), and by referring to the software manual and our previous study (submitted). First, our domain knowledge was encoded into a causal model and represented by a graph, then the backdoor.linear_regression method based on DoWhy checked whether a given observed variable could estimate the target quantity. Then, the estimator was constructed using EconML’s machine learning method, which uses gradient boosting trees to learn the relationship between the outcome and confounding factors, as well as the relationship between the intervention and confounding factors, and finally compares the residuals between the outcome and the intervention. Finally, the robustness of the causal model was assessed by placebo_treatment_refuter and data_subset_refuter tests.

Ethical consideration

The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. The study was approved by the Ethics Committee of Children’s Hospital of Fudan University (No. CHFudanU_NNICU11), and written informed consent was obtained from the parents of the neonates.

Results

The overall design of this study is shown in Figure 1. A total of 315 neonates with neonatal jaundice were included, including 160 cases of non-SNJ and 155 cases of SNJ. The two groups were statistically comparable at baseline. In the non-SNJ group, the gestational age was 38.66±1.25 weeks, the birth weight was 3,293.68±425.53 g, the age at admission was 9.71±10.31 days, onset time was 2.74±1.44 days and there were 99 males (61.9%). In the SNJ group, the gestational age was 38.63±1.25 weeks, the birth weight was 3,257.26±400.20 g, the age at admission was 8.25±6.38 days, onset time was 2.96±1.59 days and there were 81 males (52.3%). Notably, compared with the non-SNJ group, the SNJ group showed significantly higher red blood cell counts (5.02±0.87 vs. 4.75±0.87, P=0.006) and hemoglobin (Hb) levels (170.17±32.97 vs. 159.14±35.06, P=0.004). Detailed clinical characteristics are shown in Table 1.

Figure 1 Flow chart of this study. Samples were collected from 155 neonates with SNJ (severe NJ⁺) and 160 patients without SNJ (severe NJ⁻). First, WES was performed to obtain data on gene SNPs at the whole-exome level; then, the quality-controlled SNPs were used for a GWAS with clinical indicators of neonatal jaundice to obtain SNPs with significant genome-wide associations. Finally, the effectiveness of the SNPs and clinical phenotypes for classification of SNJ was further evaluated based on the LASSO machine learning approach to construct a clinical prediction model of SNJ based on SNPs and clinical phenotypes and to assess the causal association between SNPs and SNJ using causal inference analysis. GWAS, genome-wide association study; LASSO, Least Absolute Shrinkage and Selection Operator; SNJ, severe neonatal jaundice; SNP, single-nucleotide polymorphism; WES, whole-exome sequencing.

Table 1

Clinical information statistics

	Severe NJ⁻	Severe NJ⁺	P
n	160	155
AgeA (year)	9.71 [10.31]	8.25 [6.38]	0.13
AWL = Y	12 (7.5)	23 (14.8)	0.06
BF = Y	141 (88.1)	151 (97.4)	0.003
BL = Y	29 (18.1)	53 (34.2)	0.002
BW ()	3,293.68 [425.53]	3,257.26 [400.20]	0.44
CRP = Y	21 (13.1)	14 (9.0)	0.33
EXB = Y	3 (1.9)	10 (6.5)	0.08
GA ()	38.66 [1.25]	38.63 [1.25]	0.80
Gender = M	99 (61.9)	81 (52.3)	0.11
Hb ()	159.14 [35.06]	170.17 [32.97]	0.004
HEMO = Y	21 (13.1)	18 (11.6)	0.81
Hypothyroidism = Y	7 (4.4)	7 (4.5)	>0.99
Hospital_stay ()	6.84 [2.82]	6.95 [2.36]	0.69
LCR ()	44.73 [15.21]	45.81 [13.53]	0.51
NER ()	39.12 [14.92]	38.15 [13.04]	0.54
Onset_time ()	2.74 [1.44]	2.96 [1.59]	0.19
Other ()	0.09 [0.28]	0.06 [0.23]	0.32
Peak_TSB_time ()	10.38 [10.63]	8.72 [6.05]	0.09
PT = Y ()	153 (95.6)	155 (100.0)	0.02
RBC ()	4.75 [0.87]	5.02 [0.87]	0.006
RET ()	2.05 [1.84]	1.54 [1.48]	0.008
TSB ()	267.11 [49.07]	398.47 [60.84]	<0.001
WBC ()	11.44 [4.23]	11.45 [3.53]	0.98

Data are presented as number (%) or mean [standard deviation]. AgeA, age at admission; AWL, abnormal weight loss; BF, breastfeeding; BL, extravascular hemorrhage (including cephalhematoma and intracranial hemorrhage); BW, birth weight; CRP, C-reactive protein; EXB, exchange transfusion; GA, gestational age; Hb, hemoglobin; HEMO, hemolysis (ABO/Rh incompatibility, positive coombs test); Hospital_stay, hospital stay; LCR, lymphocyte ratio; M, male; NER, neutrophil ratio; NJ, neonatal jaundice; Onset_time, onset time of jaundice; Other, other complications; Peak_TSB_time, time to peak total serum bilirubin; PT, phototherapy; RBC, red blood cell count; RET, reticulocyte count; TSB, total serum bilirubin; WBC, white blood cell count; Y, yes.

Genome-wide association analysis identified SNPs associated with total red blood cell count in neonatal jaundice

To gain a comprehensive understanding of the role of genetic factors in neonatal jaundice, we performed a genome-wide association analysis based on exome sequencing data. First, WES was performed on whole blood samples to obtain information on SNPs at the whole-exome level, and then quality control was performed to remove low-frequency and genotyping error SNPs, resulting in a total of 46,854 SNPs for the GWAS. The distribution of these SNPs on the chromosomes is shown in Figure 2A, which shows that the chromosomes are well covered by the SNPs. By performing a GWAS of the SNPs with clinical indices using PLINK, the SNP loci were found to correlate with erythroid-related indicators, including the total red blood cell count, reticulocyte count, altered red blood cells, and altered reticulocyte count, with a threshold P value of 1.0×10⁻⁵ (Figure 2B). A total of 17 SNPs were found to be significantly correlated with red blood cell count (Figure 2C). Specific information on all the SNPs correlated with red blood cell metrics is provided in Table S1.

Figure 2 Results of GWAS. (A) Density distribution of SNPs on chromosomes, which shows good coverage of SNPs on chromosomes. (B) QQ diagram of important clinical indicators. (C) Manhattan diagram of total RBC-associated SNPs. GWAS, genome-wide association study; Hb, hemoglobin; QQ, quantile-quantile; RBC, red blood cell; RET, reticulocyte count; SNP, single-nucleotide polymorphism; TSB, total serum bilirubin.

Machine learning identified genetic and clinical features associated with SNJ

To evaluate the potential of genetic factors and clinical features for classification of SNJ, a LASSO-based approach was used for machine learning analysis. The optimal lambda value of 0.0477 was selected by 10-fold cross-validation based on the minimum binomial deviance criterion (Figure 3A). First, the optimal lambda value was screened, defined as the point at which cross-validation error is minimized, balancing model complexity and predictive accuracy. And then, based on the optimal lambda value, nine important variables that could be used for the classification of SNJ were calculated (Figure 3A), where rs144648182, Hb concentration, breastfeeding, altered reticulocyte count, and phototherapy (as a treatment indicator) had a positive effect on serum total bilirubin level (Figure 3B). Of note, phototherapy was included as a marker of clinical intervention, not as a causal factor for bilirubin elevation. A simplified LASSO regression model was constructed based on these nine important variables, and evaluation of the model formulation was performed. The results of the evaluation are shown in Table S2, which shows that rs144648182, chr3_75715118, altered reticulocyte count, breastfeeding, gender, and bleeding contributed significantly to the model. The results of the correlation analysis of these nine important variables are shown in Figure 3C, which shows that altered reticulocyte count was significantly associated with breastfeeding and Hb concentration.

Figure 3 Machine learning screening of genetic and clinical characteristics associated with SNJ. (A) LASSO machine learning method to screen the optimal classification model for SNJ. The blue dashed line in the x-axis direction corresponds to the minimum log(lambda) value of 3.09, the maximum lambda value is 0.0477, and the top corresponds to the number of variables in the model. (B) Bar plot of significant variables, the size of the bar is the regression coefficient obtained from LASSO analysis. (C) Heatmap of correlations of important variables, the asterisk represents significant correlations. Phototherapy was included as a treatment indicator reflecting clinical intervention, not as a causal risk factor for bilirubin elevation. BF, breastfeeding; BL; Hb, hemoglobin; LASSO, Least Absolute Shrinkage and Selection Operator; PT; RET, reticulocyte count; SNJ, severe neonatal jaundice.

Construction of clinical prediction model for risk assessment of SNJ and to aid clinical decision

Based on the ranking of important variables obtained from the LASSO analysis, we constructed clinical prediction models for the top three important variables (top3), the top six important variables (top6), and all nine important variables (all9). To evaluate whether the model-predicted risk was in good agreement with the actual risk, we performed calibration of the clinical prediction models and found that all models predicted risk in good agreement with the actual risk, while the Brier score of the clinical prediction model was based on all nine important variables (Figure 4A). Regarding the possibility of false positives and false negatives in predicting whether a patient has a disease by a biomarker, no matter which value is chosen as the threshold, sometimes it is preferable to avoid false positives and sometimes it is more desirable to avoid false negatives. Since both cases cannot be avoided, we tried to find a model with the greatest net benefit by decision curve analysis (DCA). The results are shown in Figure 4B, which shows that a clinical decision model based on all important clinical variables has a certain clinical effect or net benefit. Furthermore, we also evaluated the clinical effects of the model based on all variables using the clinical impact curve and found that interventions at a threshold of ≤0.4 could reduce impairment and increase benefit (Figure 4C). Finally, we constructed a logistic regression model for risk prediction of severe jaundice based on all variables and visualized the risk prediction model by nomogram, which showed that the risk prediction model provided better risk prediction of severe jaundice (Figure 4D). Also, the peak serum total bilirubin time and hospital stay were predicted based on the clinical prediction model (Figure S1).

Figure 4 Clinical prediction model. (A) Calibration curve showing that the predicted risk and the actual risk match. (B) Decision curve showing that the clinical decision model composed of important variables has certain clinical effects. The horizontal coordinate is the threshold probability and the vertical coordinate is the net benefit after subtracting the disadvantage. (C) Clinical impact curve showing that interventions at a threshold ≤0.4 can reduce injury and increase benefit. (D) The risk of occurrence of severe jaundice can be predicted based on the clinical prediction model, and the red points in the figure show the corresponding scores of each clinical index for a positive patient with a total score of 0.84. AUC, area under the curve; BF, breastfeeding; BL; Hb, hemoglobin; PT, RET, reticulocyte count; TSB, total serum bilirubin.

Causal inference reveals that rs144648182, associated with total red blood cells, may contribute to elevated serum total bilirubin

The number of red blood cells and Hb concentration are important factors influencing the elevation of total serum bilirubin. In the present study, we found that the total red blood cells and Hb concentration were significantly elevated in neonates with SNJ compared with those without SNJ (Figure 5A,5B), and the increase in red blood cell count was positively correlated with the increase in Hb concentration (Figure 5C). Meanwhile, we performed survival analysis and showed a significant effect of rs144648182 on Hb concentration by univariate Kaplan-Meier survival curve analysis (Figure 5D). Survival analysis based on a multivariate Cox proportional risk model also showed that rs144648182 was a significant risk factor for Hb concentration and peak serum total bilirubin time (Figure 5E). To further assess whether total red blood cells and total red blood cell-related SNPs directly or indirectly contributed to the elevated serum total bilirubin concentration, we evaluated the causal effect of total red blood cell-related SNPs with serum total bilirubin concentration based on a machine learning causal inference approach. The results showed that rs144648182 had a potential association effect with bilirubin, that Hb was a possible mediator, and that other SNPs, breastfeeding, and gender were confounders (Figure 5F).

Figure 5 Causal inference reveals a potential causal effect of erythrocyte-associated SNPs with total serum bilirubin. Relative to neonates with non-SNJ, neonates with SNJ had significantly higher RBC count (A) and Hb concentration (B). (C) The rise in altered blood RBC count (delta RBC) was positively correlated with the rise in Hb concentration. (D) Kaplan-Meier survival curve, where the time variable was set to total RBC and the event variable as Hb, showing that rs144648182 significantly affects the outcome of Hb. (E) Cox proportional risk model survival analysis, showing that rs144648182 is a risk factor for Hb concentration and time to peak serum total bilirubin (peak TSB time). (F) Machine learning causal inference approach assessed the SNP and clinical phenotype of RBCs with TSB and obtained potential causal graphs, which centered on the potential causal effect of rs144648182 with TSB, with Hb as a mediator and other SNPs, BF, and gender as confounders. This model represents a computational hypothesis requiring experimental validation. BF, breastfeeding; BL, CI, confidence interval; Hb, hemoglobin; HR, hazard ratio; PT, RBC, red blood cell; RET, reticulocyte count; SNJ, severe neonatal jaundice; SNP, single-nucleotide polymorphism; TSB, total serum bilirubin.

Discussion

Jaundice is a common symptom in the neonatal period, and SNJ can lead to kernicterus or even death. It is essential to accurately identify individuals at high risk from neonatal jaundice and to intervene effectively. Genetic factors play a very important role in jaundice, and genome-wide association studies conducted in adults have revealed important biological insights into jaundice. SNPs in the genes UGT1A1, SLCO1B3, and SEMA3C were found to be associated with jaundice and clinical comorbidities (23-29). However, studies have suggested ethnic differences in genetic associations for bilirubin levels between populations. For example, Kang et al. conducted a large GWAS using 8841 Koreans to identify genetic variants affecting serum bilirubin levels, and significant associations were observed at the previously identified loci UGT1A1 (rs11891311) and SLCO1B3 (rs2417940). However, the two SLCO1B3 variants (rs17680137 and rs2117032) most significantly associated with total serum bilirubin in a European population were not found to reach genome-wide significance levels in the Korean population (25).

The first genome-wide association analysis in neonatal jaundice identified 17 SNPs significantly associated with total red blood cell count

The development of neonatal jaundice is closely related to the formation and senescence of erythrocytes. Reticulocytes generated in the bone marrow can enter the blood and generate mature erythrocytes under the action of erythropoietin (EPO), and the mature erythrocytes senescence to form Hb, which is the main source of bilirubin. We found that both total red blood cells and Hb were significantly increased in the SNJ group relative to the non-SNJ group, while the number of altered erythrocytes was positively correlated with the Hb concentration. Notably, we identified 17 SNP loci that correlated with total red blood cell count.

rs144648182 is a missense mutation in the gene HPR (NP_066275.3:p.Arg219His). HPR (haptoglobin-related protein) can bind free Hb, and the complex of HPR bound to free Hb in the blood can be captured by macrophages. Heme is then removed from free Hb and metabolized to bilirubin. Through causal inference analysis, we found that rs144648182 may increase bilirubin levels by affecting Hb metabolism. The potential mechanism is that this missense mutation may alter the HPR protein, reducing its ability to bind free Hb. This could impair Hb clearance, leading to increased heme substrate availability and consequently enhanced bilirubin production. Although this association is biologically plausible given the gene’s role in Hb metabolism, our causal inference relies on statistical assumptions and lacks direct functional evidence. Therefore, this finding should be considered preliminary, and the proposed mechanism requires further experimental validation. Future studies should use techniques such as surface plasmon resonance to determine whether this mutation affects Hb binding, combined with gene-edited cell or animal models to verify the causal pathway. If validated, this mechanism may inform novel therapeutic strategies targeting Hb clearance or heme metabolism in high-risk infants.

rs183378943 is a mutation in the non-coding region of the PLEKHA5 gene. Previous studies have shown that PLEKHA5 exerts its biological functions through the PI3K-AKT signaling pathway (30,31). Heme oxygenase-1 (HO-1) is the key enzyme for bilirubin formation, in which bilirubin is oxidized to biliverdin, which releases iron and carbon monoxide (CO); furthermore, biliverdin is catalyzed by biliverdin reductase to form bilirubin. The process of bilirubin formation is closely related to the PI3K-AKT pathway, and biliverdin reductase was found to directly activate protein kinase B (AKT) phosphorylation (32), while HO-1-induced hypoxia/reoxygenation protection depends on biliverdin reductase and its interaction with the PI3K-AKT pathway (33). Bilirubin was found to regulate brain-derived neurotrophic factor (BDNF) and glial cell-derived neurotrophic factor (GDNF) expression in neurons and astrocytes through the PI3K-AKT pathway (34). We speculate that rs183378943 may affect bilirubin production through the PI3K-AKT pathway. However, this proposed mechanism remains speculative and requires functional validation, representing an important direction for future mechanistic investigations.

Clinical prediction model for SNJ based on machine learning can achieve accurate prediction of individuals at high risk of neonatal jaundice

Genome-wide association studies usually identify multiple disease- or phenotype-related SNPs, and screening important genetic factors from these SNPs is not easy. LASSO has some advantages for screening disease-related important clinical variables. We previously identified metabolic markers associated with sepsis in neonates with meningoencephalitis based on the LASSO approach and combined these with serum and cerebrospinal fluid metabolomic analysis (21). In this present study, based on the LASSO machine learning approach, we identified nine variables that contributed significantly to SNJ, five of which were closely associated with red blood cells. We constructed a clinical prediction model based on the nine important clinical variables, which had some clinical effects and could achieve accurate prediction of high-risk individuals in neonatal jaundice.

Based on the threshold of 0.4 determined by DCA (Figure 4C), the model can distinguish newborns at high risk for severe jaundice. The model is designed for application at birth, enabling risk stratification before the onset of hyperbilirubinemia. For high-risk infants (predicted probability >0.4), such as carriers of rs1446482 with elevated Hb, intensified monitoring and early phototherapy could be adopted to prevent rapid bilirubin elevation and exchange transfusion. For low-risk infants, appropriately prolonged monitoring intervals up to 12–24 hours may help reduce unnecessary blood sampling and interventions. This model does not replace current clinical guidelines but complements them by incorporating genetic information. The proposed mechanism remains a hypothesis that requires experimental verification, and further external validation is warranted prior to clinical application.

The machine learning-based causal inference approach identified potential causal effects of genetic factors and total serum bilirubin

Correlation does not imply causality. Identifying potential causal associations from correlation results is not an easy task, usually requiring a combination of animal models or clinical randomized controlled trials, which are time-consuming and laborious studies. Newly developed machine learning methods are used to identify potential causal relationships from correlation results. We have previously used machine learning causal inference methods to identify potentially causally linked oral bacteria for autism from autism oral microbiome data (35). In the present study, based on a machine learning causal inference approach, we identified potential causal effects of erythroid-associated SNPs, such as rs144648182, with total serum bilirubin, and the possible mechanisms of our hypothesized involvement of erythroid-associated SNPs in bilirubin biosynthesis are shown in Figure 6.

Figure 6 Possible mechanisms for the involvement of erythrocyte-associated SNPs in bilirubin biosynthesis. Reticulocytes generated in the bone marrow can enter the blood and generate mature RBCs under the action of erythropoietin, and the Hb formed after aging of mature RBCs is the main source of bilirubin. We found that total RBCs was significantly increased in the SNJ group relative to the non-SNJ group, while the alteration in RBCs (ΔRBC) was positively correlated with Hb count, suggesting that excess RBCs are an important influencing factor for elevated serum total bilirubin. Meanwhile, we found that SNP locus rs144648182 was significantly associated with total RBC by GWAS. rs144648182 is a missense mutation of gene HPR (NP_066275.3:p.Arg219His), and we know that HPR gene encodes a HPR, HPR has the ability to bind free Hb, which can bind free Hb to form a complex that can be captured by macrophages in the blood to remove the haptoglobin and isolate Hb (heme), which is further metabolized to bilirubin, rs144648182 is hypothesized to affect the bilirubin synthesis process by potentially altering HPR’s ability to bind free Hb, though this mechanism awaits experimental confirmation. Finally, the UGT1A1 gene encodes UDP-glucuronosyltransferase, which is involved in the metabolism of unconjugated bilirubin glucosylation, and mutations in UGT1A1_6, UGT1A1_28 result in abnormal conversion of unconjugated bilirubin to conjugated bilirubin. BVR; GWAS, genome-wide association study; Hb, hemoglobin; HO-1, heme oxygenase-1; HRP, haptoglobin-related protein; RBC, red blood cell; SNJ, severe neonatal jaundice; SNP, single-nucleotide polymorphism.

Strengths and limitations

The China Neonatal Genome Project that we initiated offers the possibility of comprehensive genetic elucidation of diseases in neonates at risk, and we are now able to complete genome sequencing of 20,000 neonates per year. The Chinese Neonatal Genome Database is constructed based on these data to provide strong support for to assess the causal association this study.

Artificial intelligence technology represented by machine learning is playing an important role in clinical medical research. Applying LASSO machine learning methods to genome-wide association studies can facilitate the discovery of genetic factors associated with diseases or clinical phenotypes. With causal inference methods of machine learning, the causal effects between genetic factors and clinical outcomes can be inferred under conditions where animal models or clinical intervention experiments are not available. These reflect the innovation of this study.

The present study has some limitations. First, there is a lack of a healthy non-neonatal jaundice control. Considering that neonatal jaundice is a common symptom in newborns, it is not easy to find sufficient (300+) non-jaundiced neonates that can be used for genome-wide association studies. Consequently, our findings pertain specifically to severity stratification among jaundiced neonates (differentiating severe from non-severe cases). The identified SNPs and the prediction model should not be interpreted as risk factors for the initial onset of neonatal jaundice in the general population. Second, the modest sample size (n=315) increases the risk of false-positive associations and reduces power to detect variants with small effect sizes, despite our use of complementary machine learning and causal inference approaches. Third, there is a lack of reproducibility validation and functional studies, which is the topic of our next study.

Conclusions

We applied machine learning causal inference to GWAS data and identified potential erythroid-related genetic factors, including the HPR variant rs144648182, that may contribute to SNJ. This finding represents a testable hypothesis requiring experimental validation. A prediction model based on genetic and clinical variables demonstrated potential for risk stratification among jaundiced neonates, though external validation is needed before clinical application.

Acknowledgments

We would like to thank Catherine Perfect, MA (Cantab), from Liwen Bianji (Edanz) (www.liwenbianji.cn/), for editing the English text of this manuscript.

Footnote

Reporting Checklist: The authors have completed the TRIPOD reporting checklist. Available at https://tp.amegroups.com/article/view/10.21037/tp-2026-1-0082/rc

Data Sharing Statement: Available at https://tp.amegroups.com/article/view/10.21037/tp-2026-1-0082/dss

Peer Review File: Available at https://tp.amegroups.com/article/view/10.21037/tp-2026-1-0082/prf

Funding: This work was supported by the National Natural Science Foundation of China (No. 82071733), Shanghai Talent Development Funding (No. 2020115), and Zhuhai Science and Technology Program Project in the Field of Social Development (No. 2420004000216).

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://tp.amegroups.com/article/view/10.21037/tp-2026-1-0082/coif). W.Z. serves as an Editor-in-Chief of Translational Pediatrics from July 2025 to June 2026. The other authors have no conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. The study was approved by the Ethics Committee of Children’s Hospital of Fudan University (No. CHFudanU_NNICU11), and written informed consent was obtained from the parents of the neonates.

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.

References

Falke M. The Basics of Neonatal Hyperbilirubinemia. Neonatal Netw 2025;44:61-7. [Crossref] [PubMed]
Okolie F, South-Paul JE, Watchko JF. Combating the Hidden Health Disparity of Kernicterus in Black Infants: A Review. JAMA Pediatr 2020;174:1199-205. [Crossref] [PubMed]
Lai NM, Gerard JP, Ngim CF, et al. The Association between Serum Bilirubin and Kernicterus Spectrum Disorder: A Systematic Review and Meta-Analysis. Neonatology 2021;118:654-64. [Crossref] [PubMed]
Olusanya BO, Teeple S, Kassebaum NJ. The Contribution of Neonatal Jaundice to Global Child Mortality: Findings From the GBD 2016 Study. Pediatrics 2018;141:e20171471. [Crossref] [PubMed]
Wickremasinghe AC, Kuzniewicz MW. Neonatal Hyperbilirubinemia. Pediatr Clin North Am 2025;72:605-22. [Crossref] [PubMed]
Wang X, Xiao T, Wang J, et al. Clinical and genetic risk factors associated with neonatal severe hyperbilirubinemia: a case-control study based on the China Neonatal Genomes Project. Front Genet 2023;14:1292921. [Crossref] [PubMed]
Huang D, Gu X, Li W, et al. Genomic sequencing as a key primary recommendation for neonatal hyperbilirubinemia: a population-based multicenter study. J Genet Genomics 2026;53:467-75. [Crossref] [PubMed]
Yang L, Wei Z, Chen X, et al. Use of medical exome sequencing for identification of underlying genetic defects in NICU: Experience in a cohort of 2303 neonates in China. Clin Genet 2022;101:101-9. [Crossref] [PubMed]
Wang H, Qian Y, Lu Y, et al. Clinical utility of 24-h rapid trio-exome sequencing for critically ill infants. NPJ Genom Med 2020;5:20. [Crossref] [PubMed]
Thomas M, Sakoda LC, Hoffmeister M, et al. Genome-wide Modeling of Polygenic Risk Score in Colorectal Cancer Risk. Am J Hum Genet 2020;107:432-44. [Crossref] [PubMed]
Burgess S, Foley CN, Zuber V. Inferring Causal Relationships Between Risk Factors and Outcomes from Genome-Wide Association Study Data. Annu Rev Genomics Hum Genet 2018;19:303-27. [Crossref] [PubMed]
Reay WR, Cairns MJ. Advancing the use of genome-wide association studies for drug repurposing. Nat Rev Genet 2021;22:658-71. [Crossref] [PubMed]
Chen L, Yang H, Li H, et al. Insights into modifiable risk factors of cholelithiasis: A Mendelian randomization study. Hepatology 2022;75:785-96. [Crossref] [PubMed]
McCormick N, O’Connor MJ, Yokose C, et al. Assessing the Causal Relationships Between Insulin Resistance and Hyperuricemia and Gout Using Bidirectional Mendelian Randomization. Arthritis Rheumatol 2021;73:2096-104. [Crossref] [PubMed]
Sealock JM, Lee YH, Moscati A, et al. Use of the PsycheMERGE Network to Investigate the Association Between Depression Polygenic Scores and White Blood Cell Count. JAMA Psychiatry 2021;78:1365-74. [Crossref] [PubMed]
Wang M, Zhuang D, Mei M, et al. Frequent mutation of hypoxia-related genes in persistent pulmonary hypertension of the newborn. Respir Res 2020;21:53. [Crossref] [PubMed]
Shi X, Chen J, Lu Q, et al. Whole-Exome Sequencing Revealing De Novo Heterozygous Variant OF KCNT1 in a Twin Discordant for Benign Epilepsy with Centrotemporal Spikes. J Paediatr Child Health 2018;54:709-10. [Crossref] [PubMed]
Chen C, Wang M, Zhu Z, et al. Multiple gene mutations identified in patients infected with influenza A (H7N9) virus. Sci Rep 2016;6:25614. [Crossref] [PubMed]
Wang M, Zhou J, He F, et al. Alteration of gut microbiota-associated epitopes in children with autism spectrum disorders. Brain Behav Immun 2019;75:192-9. [Crossref] [PubMed]
Wang M, Doenyas C, Wan J, et al. Virulence factor-related gut microbiota genes and immunoglobulin A levels as novel markers for machine learning-based classification of autism spectrum disorder. Comput Struct Biotechnol J 2021;19:545-54. [Crossref] [PubMed]
Zhang P, Wang Z, Qiu H, et al. Machine learning applied to serum and cerebrospinal fluid metabolomes revealed altered arginine metabolism in neonatal sepsis with meningoencephalitis. Comput Struct Biotechnol J 2021;19:3284-92. [Crossref] [PubMed]
Zeng S, Wang Z, Zhang P, et al. Machine learning approach identifies meconium metabolites as potential biomarkers of neonatal hyperbilirubinemia. Comput Struct Biotechnol J 2022;20:1778-84. [Crossref] [PubMed]
Johnson AD, Kavousi M, Smith AV, et al. Genome-wide association meta-analysis for total serum bilirubin levels. Hum Mol Genet 2009;18:2700-10. [Crossref] [PubMed]
Sanna S, Busonero F, Maschio A, et al. Common variants in the SLCO1B3 locus are associated with bilirubin levels and unconjugated hyperbilirubinemia. Hum Mol Genet 2009;18:2711-8. [Crossref] [PubMed]
Kang TW, Kim HJ, Ju H, et al. Genome-wide association of serum bilirubin levels in Korean population. Hum Mol Genet 2010;19:3672-8. [Crossref] [PubMed]
Chen G, Ramos E, Adeyemo A, et al. UGT1A1 is a major locus influencing bilirubin levels in African Americans. Eur J Hum Genet 2012;20:463-8. [Crossref] [PubMed]
Oussalah A, Bosco P, Anello G, et al. Exome-Wide Association Study Identifies New Low-Frequency and Rare UGT1A1 Coding Variants and UGT1A6 Coding Variants Influencing Serum Bilirubin in Elderly Subjects: A Strobe Compliant Article. Medicine (Baltimore) 2015;94:e925. [Crossref] [PubMed]
Coltell O, Asensio EM, Sorlí JV, et al. Genome-Wide Association Study (GWAS) on Bilirubin Concentrations in Subjects with Metabolic Syndrome: Sex-Specific GWAS Analysis and Gene-Diet Interactions in a Mediterranean Population. Nutrients 2019;11:90. [Crossref] [PubMed]
Chen G, Adeyemo A, Zhou J, et al. A UGT1A1 variant is associated with serum total bilirubin levels, which are causal for hypertension in African-ancestry individuals. NPJ Genom Med 2021;6:44. [Crossref] [PubMed]
Chen G, Chakravarti N, Aardalen K, et al. Molecular profiling of patient-matched brain and extracranial melanoma metastases implicates the PI3K pathway as a therapeutic target. Clin Cancer Res 2014;20:5537-46. [Crossref] [PubMed]
Jilaveanu LB, Parisi F, Barr ML, et al. PLEKHA5 as a Biomarker and Potential Mediator of Melanoma Brain Metastasis. Clin Cancer Res 2015;21:2138-47. [Crossref] [PubMed]
Gibbs PE, Miralem T, Maines MD. Biliverdin reductase: a target for cancer therapy? Front Pharmacol 2015;6:119. [Crossref] [PubMed]
Pachori AS, Smith A, McDonald P, et al. Heme-oxygenase-1-induced protection against hypoxia/reoxygenation is dependent on biliverdin reductase and its interaction with PI3K/Akt pathway. J Mol Cell Cardiol 2007;43:580-92. [Crossref] [PubMed]
Hung SY, Liou HC, Fu WM. The mechanism of heme oxygenase-1 action involved in the enhancement of neurotrophic factor expression. Neuropharmacology 2010;58:321-9. [Crossref] [PubMed]
Qiao Y, Gong W, Li B, et al. Oral Microbiota Changes Contribute to Autism Spectrum Disorder in Mice. J Dent Res 2022;101:821-31. [Crossref] [PubMed]

Cite this article as: Ma H, Chen X, Zhang P, Dong X, Wu B, Chen G, Zhou W, Wang M. Machine learning-based genome-wide association analysis to construct a clinical decision model for severe neonatal jaundice. Transl Pediatr 2026;15(5):181. doi: 10.21037/tp-2026-1-0082

Machine learning-based genome-wide association analysis to construct a clinical decision model for severe neonatal jaundice

Highlight box

Introduction

Methods

Participants

WES

SNP calling

Genome-wide association analysis

Machine learning

Clinical prediction models

Survival analysis

Machine learning-based causal inference

Ethical consideration

Results

Table 1

Genome-wide association analysis identified SNPs associated with total red blood cell count in neonatal jaundice

Machine learning identified genetic and clinical features associated with SNJ

Construction of clinical prediction model for risk assessment of SNJ and to aid clinical decision

Causal inference reveals that rs144648182, associated with total red blood cells, may contribute to elevated serum total bilirubin

Discussion

The first genome-wide association analysis in neonatal jaundice identified 17 SNPs significantly associated with total red blood cell count

Clinical prediction model for SNJ based on machine learning can achieve accurate prediction of individuals at high risk of neonatal jaundice

The machine learning-based causal inference approach identified potential causal effects of genetic factors and total serum bilirubin

Strengths and limitations

Conclusions

Acknowledgments

Footnote

References

Article Options

Download Citation

Share