Orchid Logo
Embryo Report
PGT-WGSPGT-MPGT-SR
Log inContact Us
Orchid Logo

Embryo Report

Advanced genetic screen for your embryos. Prevent your child from inheriting a predisposition to a condition that runs in your family.

VIEW DETAILSGET ACCESS

Couple Report

Our preconception test measures your future child's genetic predisposition to disease. Mitigate your risk.

For Patients

OverviewRisk CalculatorGuidesBook a Call

For Clinicians

Log inContact Us
Skip to article body

Improved Polygenic Prediction with Multi-Ancestry and Multi-Trait GWAS Data

Improved Polygenic Prediction with Multi-Ancestry and Multi-Trait GWAS Data
View / Download PDF version

We use new methods and diverse sources of GWAS data to improve the accuracy of GRS when used applied to African, East Asian, and South Asian populations.

Overview

In this whitepaper, we describe a method to improve the performance of genetic risk scores (GRS) across multiple ancestries – including European, East Asian, South Asian, and African populations. The method leverages predictions from genetically correlated traits and GWAS (genome-wide association study) of non-European ancestry. We replicate previous research[1][2] that used pleiotropy to improve prediction accuracy and make further gains by incorporating newly developed methods[3][4] that use ancestrally diverse GWAS data, which capitalize on the diversity of linkage disequilibrium across discovery samples. By integrating these two sources of data, we successfully improved the predictive performance of GRS for a wide range of diseases for Europeans, with a relative increase in effect size (log odds ratio per standard deviation of GRS) with a inverse-variance weighted average increase of 23.7% across n = 8 diseases. Consistent with other work[1], the predictive performance also improved in non-Europeans, with a 24.8% average relative increase in effect size for South Asians, and 29.6% for Africans. Due to the inclusion of a large East Asian biobank, the improvement in prediction was particularly remarkable among East Asians, where the gain was 53.6%. This approach promises to advance the use of genetic risk prediction in preimplantation genetic testing by providing more accurate and inclusive scores for diverse populations.

Data and Validation Cohorts

The models for a set of 8 diseases were improved by adding in new data from several sources: non-European data from the Biobank of Japan[5], Finnish data from FinnGen[6], and genetically correlated traits from within the UK Biobank (discovered by examining the UK Biobank Genetic Correlation browser from the Neale Lab). These were compared to the original models developed by using a standard approach of taking a large GWAS and creating a polygenic score using the PRScs[7] software or pruning and thresholding[8].

For quality control, we removed samples that failed standard quality control(due to missing genotypes, genetic sex not matching self-reported sex, and genetic ancestry not matching self-reported ancestry). For the African/Caribbean, South Asian, and East Asian cohorts, we removed any samples that were genetically related (up to 3rd degree) to other samples in the UK Biobank. For the White British samples, we split the dataset into two cohorts:

  • Cohort 1: Samples with no relatives (up to third degree) within the UK Biobank (n=276,471)
  • Cohort 2: Samples with one or more relatives in the UK Biobank, where one relative from each family was selected (n=58,808)

For type 2 diabetes and coronary artery disease, we trained GRS models that include cohort 1 and tested on cohort 2 as validation. For all other conditions, we tested on cohort 1 since we did not use any UK Biobank GWAS data.

CohortNumber of samples
White British, unrelated (cohort 1)276,471
White British, relatives in UK Biobank (cohort 2)58,808
East Asian1,350
South Asian6,433
African and Caribbean6,415

Incorporating Genetically Correlated Traits

A number of research papers have shown predictive performance improves by creating a linear combination of polygenic score models (multi-PGS) whose weights are determined with elastic net regularization. For example, including the PGS for schizophrenia as a feature in predicting depression improves the prediction of the latter[1]. For two of these diseases – breast cancer and atrial fibrillation – there were no strongly genetically correlated traits in the UK Biobank or otherwise.

For the three psychiatric conditions (depression, bipolar disorder, schizophrenia), because each is significantly genetically correlated with the other two[9], we built a multi-PGS on all predictors for the three psychiatric disorders.

For type 2 diabetes, class III obesity and coronary artery disease, we scanned the UK Biobank for the top 25 ranked by genetic covariance and trained GRS models using PRScs software.

Incorporating East Asian and Finnish GWAS

For all diseases we included data from FinnGen using models trained with PRScs. Additionally, we employed a multi-ancestry approach, PRScsx, to jointly analyze European and East Asian data from the Biobank of Japan. This technique has demonstrated enhanced predictive performance among East Asians and, in some circumstances, improved relative performance for Africans and South Asians.

DiseaseBenchmark PGS / Training MethodAdditional Sources of Data
Atrial FibrillationChristophersen et al. (2017) [10] / PRScsBiobank of Japan (BBJ); FinnGen
SchizophreniaPGC Schizophrenia Wave 3 [11] / Pruning + ThresholdingGenetically correlated traits; BBJ, FinnGen
DepressionWray et al. (2018) [12] / PRScsGenetically correlated traits; BBJ, FinnGen
Coronary Artery DiseaseNikpay et al. (2015) [13] / PRScsGenetically correlated traits; BBJ, FinnGen
Breast CancerMichailidou et al. (2015) [14] / PRScsBBJ, FinnGen
Type 2 DiabetesScott RA, et al. (2017) [15] / PRScsGenetically correlated traits; BBJ, FinnGen
Bipolar DisorderStahl et al. (2019)[16] / Pruning + ThresholdingGenetically correlated traits; BBJ, FinnGen
Class III ObesityKhera et al. (2019) / PRScsBBJ, FinnGen

Table 2: Description of benchmark PGS and additional data used to train improved models.

Results for Europeans

For each disease, the collection of trained models were combined into a multi-PGS with a logistic regression using elastic net regularization. Performance for these diseases was evaluated on the Cohort 1, except for CAD and Type 2 Diabetes, which were evaluated on Cohort 2 because their multi-PGS incorporated models trained on the Cohort 1. Improvements were strong, with a mean improvement of 28.2% in effect sizes (log odds ratio per standard deviation) across the 8 diseases. This estimate weighs each disease equally, but the error bars are wider for more rare diseases, so we also report the average improvement weighted by inverse of the standard errors, which is 23.7% for Europeans. Relative performance increases were the strongest in type 2 diabetes and schizophrenia, which can be explained by the large numbers of cases in the additional data and the high SNP heritability of the disease.

Figure 1: Improved results of PGS models on White British population in the UK Biobank.

Performance Improvements successfully generalized across ancestries

Genetic risk score performance improved across all ancestries, with East Asians having a gain relative to Europeans due to the inclusion of data from the Biobank of Japan. The gains were particularly significant within the East Asian population, which aligns with the majority of non-European GWAS data used originating from the Biobank of Japan.

PopulationRelative gain (log odds ratio per standard deviation, weighted by inverse variance)
East Asians (n = 1,350)+53.6%
South Asians (n = 6,433)+24.8%
African / Caribbean (n = 6,414)+29.6%
Europeans (n = 58,808 or 276,461)+23.7%

Table 3: Improvements in performance across different ancestries by inverse weighted variance. Inverse weighted variance is a weighted average that assigns weights to quantities by the inverse of the variance, i.e. the precision of the estimate, which assigns more weight to diseases with larger numbers of cases.  

For the two most common diseases in the UK Biobank (coronary artery disease and type 2 diabetes), we depict here the odds ratios of the top 10% of PRS versus the bottom 90% in the improved and original models.

Figure 2: Type 2 Diabetes odds ratios comparing top 10% GRS to the result of the cohort in the improved versus old models.

Figure 3: Coronary Artery Disease odds ratios comparing top 10% GRS to the result of the cohort in the improved versus old models.

Discussion

We have evaluated the performance of predictors that incorporate non-European GWAS and genetically correlated traits, showing that the performance improves across all ancestries. Relative performance increases were especially high in East Asians because of the joint inference on multi-ancestry data that included the Biobank of Japan. The results demonstrate that the performance of Genetic Risk Scores scores can be improved by diverse data and replicate the findings that summary statistics from large non-European biobanks can help improve equity in genomic medicine.

Citations

  1. Albiñana, C., Zhu, Z., Schork, A. J., Ingason, A., Aschard, H., Brikell, I., ... Vilhjálmsson, B. J. (2022). Multi-PGS enhances polygenic prediction: weighting 937 polygenic scores. medRxiv, 2022.09.14.22279940. https://doi.org/10.1101/2022.09.14.22279940
  2. Truong, B., Hull, L. E., Ruan, Y., Huang, Q. Q., Hornsby, W., Martin, H. C., ... Natarajan, P. (2023, March 23). Integrative polygenic risk score improves the prediction accuracy of complex traits and diseases. medRxiv [Preprint]. 2023.02.21.23286110. https://doi.org/10.1101/2023.02.21.23286110
  3. Ruan, Y., Lin, Y. F., Feng, Y. A., Chen, C. Y., Lam, M., Guo, Z., ... Ge, T. (2022, May). Improving polygenic prediction in ancestrally diverse populations. Nat Genet, 54(5), 573-580. https://doi.org/10.1038/s41588-022-01054-7
  4. Zheng, Z., Liu, S., Sidorenko, J., Yengo, L., Turley, P., Ani, A., ... Zeng, J. (2022). Leveraging functional genomic annotations and genome coverage to improve polygenic prediction of complex traits within and between ancestries. bioRxiv, 2022.10.12.510418. https://doi.org/10.1101/2022.10.12.510418
  5. Nagai, A., Hirata, M., Kamatani, Y., Muto, K., Matsuda, K., Kiyohara, Y., ... Nakamura, Y. (2017, March). Overview of the BioBank Japan Project: Study design and profile. J Epidemiol, 27(3S), S2-S8. https://doi.org/10.1016/j.je.2016.12.005
  6. Kurki, M. I., Karjalainen, J., Palta, P., et al. (2023). FinnGen provides genetic insights from a well-phenotyped isolated population. Nature, 613, 508-518. https://doi.org/10.1038/s41586-022-05473-8
  7. Ge, T., Chen, C. Y., Ni, Y., et al. (2019). Polygenic prediction via Bayesian regression and continuous shrinkage priors. Nat Commun, 10, 1776. https://doi.org/10.1038/s41467-019-09718-5
  8. Privé, F., Vilhjálmsson, B. J., Aschard, H., & Blum, M. G. B. (2019). Making the Most of Clumping and Thresholding for Polygenic Scores. American Journal of Human Genetics, https://doi.org/10.1016/j.ajhg.2019.11.001
  9. Abdellaoui, A., Smit, D. J. A., van den Brink, W., Denys, D., & Verweij, K. J. H. (2021, March 1). Genomic relationships across psychiatric disorders including substance use disorders. Drug and Alcohol Dependence, 220, 108535. https://doi.org/10.1016/j.drugalcdep.2021.108535
  10. Christophersen, I. E., Rienstra, M., Roselli, C., Yin, X., Geelhoed, B., Barnard, J., ... Guo, X.; METASTROKE Consortium of the ISGC; Neurology Working Group of the CHARGE Consortium; Dichgans, M., Ingelsson, E., Kooperberg, C., Melander, O., Loos, R. J. F., Laurikka, J., ... Ellinor, P. T.; AFGen Consortium. (2017, June). Large-scale analyses of common and rare variants identify 12 new loci associated with atrial fibrillation. Nat Genet, 49(6), 946-952. https://doi.org/10.1038/ng.3843
  11. Trubetskoy, V., Pardiñas, A. F., Qi, T., Panagiotaropoulou, G., Awasthi, S., Bigdeli, T. B., ... Chung, M. K., Felix, S. B., Gudnason, V., Alonso, A., Roden, D. M., Kääb, S., Chasman, D. I., Heckbert, S. R., Benjamin, E. J., Tanaka, T., Lunetta, K. L., Lubitz, S. A., & Ellinor, P. T. (2022). Mapping genomic loci implicates genes and synaptic biology in schizophrenia. Nature, 604(7906), 502-508. https://doi.org/10.1038/s41586-022-04434-5
  12. Wray, N. R., Ripke, S., Mattheisen, M., et al. (2018). Genome-wide association analyses identify 44 risk variants and refine the genetic architecture of major depression. Nat Genet, 50, 668-681. https://doi.org/10.1038/s41588-018-0090-3
  13. Nikpay, M., Goel, A., Won, H. H., Hall, L. M., Willenborg, C., Kanoni, S., ... Farrall, M. (2015, October). A comprehensive 1,000 Genomes-based genome-wide association meta-analysis of coronary artery disease. Nat Genet, 47(10), 1121-1130. https://doi.org/10.1038/ng.3396
  14. Michailidou, K., Lindström, S., Dennis, J., Beesley, J., Hui, S., Kar, S., ... Easton, D. F. (2017). Association analysis identifies 65 new breast cancer risk loci. Nature, 551(7678), 92-94. https://doi.org/10.1038/nature24284
  15. Scott, R. A., Scott, L. J., Mägi, R., Marullo, L., Gaulton, K. J., Kaakinen, M., ... McCarthy, M. I.; DIAbetes Genetics Replication And Meta-analysis (DIAGRAM) Consortium. (2017, November). An Expanded Genome-Wide Association Study of Type 2 Diabetes in Europeans. Diabetes, 66(11), 2888-2902. https://doi.org/10.2337/db16-1253
  16. Stahl, E. A., Breen, G., Forstner, A. J., et al. (2019). Genome-wide association study identifies 30 loci associated with bipolar disorder. Nat Genet, 51, 793-803. https://doi.org/10.1038/s41588-019-0397-8

Orchid Health supports open research data initiatives while abiding by the terms of use on all genetic risk models and datasets. PGC data was used in this study for the evaluation of the potential of multi-PGS model training technique only in a research context.

Supplementary Tables

Supplementary Table A: How each disease case is defined in evaluating genetic risk scores in the UK Biobank

PhenotypeICD-10 CodesSelf-Report CodesCases in UK Biobank (White British)
Prostate cancerC61, D075104413,806
Type 2 diabetesE11.1-9122330,507
Coronary artery diseaseI210-4,I219,I220I221,I228, I232, I233, I235, I236, I238, I249, I252107522,451
Breast cancerC5.0-9, D05.0, D059100218,588
Atrial fibrillationI48.0-4, I48.91471, 148322,472
SchizophreniaF20.0-9, F21, F23.0-3, F23.812891,376
Class III Obesity*--
Depression**--
Bipolar disorderF3112911,855
  • Class III Obesity was defined as having a BMI (UK Biobank Field 21001) of 40 kg/m2 or above.
  • The depression phenotype was defined for participants who participated in the Mental Health Survey who had researcher-derived “probable recurrent depression (severe)”, and controls excluded participants with any depression or bipolar.

Supplementary Table B1-B10

Number of Heart Disease cases in test set: 1765 (prevalence of 5.38% in Cohort 1 overall)

Coronary Artery DiseaseOdds Ratio (Improved Model)Case Prevalence at Cutoff (Improved Model)Odds Ratio (Baseline)
Top 2%3.86 (3.12, 4.76)19.0%3.04 (2.43, 3.82)
Top 5%2.89 (2.48, 3.38)14.5%2.41 (2.05, 2.84)
Top 10%2.68 (2.37, 3.02)12.9%2.28 (2.01, 2.58)

Number of Breast Cancer cases in test set: 6061 (prevalence of 7.45% in Cohort 1 females overall)

**   Breast Cancer **  **   Odds Ratio (Improved Model) **  **   Case Prevalence at Cutoff (Improved Model) **  **   Odds Ratio (Baseline) **  
Top 2%4.24 (3.76, 4.77)26.5%3.95 (3.50, 4.45)
Top 5%3.34 (3.07, 3.63)21.4%3.23 (2.97, 3.52)
Top 10%3.05 (2.86, 3.26)18.7%2.83 (2.65, 3.02)

Number of Schizophrenia cases in test set: 476 (prevalence of 0.27% in Cohort 1 overall)

**   Schizophrenia **  **   Odds Ratio (Improved Model) **  **   Case Prevalence at Cutoff (Improved Model **  **   Odds Ratio (Baseline) **  
Top 2%4.39 (3.14, 6.13)1.37%3.15 (2.14, 4.62)
Top 5%3.61 (2.81, 4.63)1.07%2.29 (1.70, 3.07)
Top 10%2.85 (2.31, 3.53)0.81%1.95 (1.53, 2.47)

Number of Type 2 Diabetes cases in test set: 2363 (prevalence of 6.9% in Cohort 2 overall)

** Type 2 Diabetes **** Odds Ratio (Improved Model) **** Case Prevalence at Cutoff (Improved Model) **** Odds Ratio (Baseline) **
Top 2%4.07 (3.36, 4.92)25.3%2.90 (2.36, 3.57)
Top 5%3.48 (3.05, 3.97)21.5%2.38 (2.06, 2.76)
Top 10%3.06 (2.76, 3.40)18.5%2.21 (1.98, 2.48)

Number of bipolar cases in test set: 640 (prevalence of 0.41% in Cohort 1 overall)

**   Bipolar **  **   Odds Ratio (Improved Model) **  **   Case Prevalence at Cutoff (Improved Model) **  **   Odds Ratio (Baseline) **  
Top 2%3.57 (2.61, 4.87)1.6%3.75 (2.76, 5.09)
Top 5%2.69 (2.13, 3.41)1.1%2.62 (2.06, 3.32)
Top 10%2.61 (2.29, 3.31)1.06%2.46 (2.04, 2.98)

Number of atrial fibrillation cases in test set: 7502

**   Atrial Fibrillation **  **   Odds Ratio **  **   Case Prevalence at Cutoff (Improved Model) **  **   Odds Ratio (Baseline) **  
Top 2%3.62 (3.26, 4.02)16.6%2.91 (2.60, 3.25)
Top 5%3.02 (2.80, 3.25)11.6%2.45 (2.27, 2.65)
Top 10%2.67 (2.53, 2.84)10.3%2.23 (2.10, 2.38)

Number of depression cases in test set: 2415

**   Depression **  **   Odds Ratio **  **   Case Prevalence at Cutoff (Improved Model **   )**   Odds Ratio (Baseline) **  
Top 2%2.28 (1.822, 2.85)18.1%2.02 (1.60, 2.55)
Top 5%2.06 (1.77, 2.39)7.63%1.92 (1.71, 2.16)
Top 10%1.92 (1.171, 2.16)6.15%1.54 (1.37, 1.74)

Number of class III obesity cases in test set: 569 (prevalence of 1.39% in Cohort 2 overall)

**   Class III Obesity **  **   Odds Ratio **  **   Case Prevalence at Cutoff (Improved Model) **  **   Odds Ratio (Baseline) **  
Top 2%7.51 (5.76, 9.80)11.7%6.05 (4.55, 8.04)
Top 5%5.75 (4.68, 7.06)8.5%5.25 (4.26, 6.48)
Top 10%5.24 (4.40, 6.25)6.9%4.22 (3.52, 5.07)

Acknowledgements

This research has been conducted using the UK Biobank Resource under Application Number 80545.

Recent Articles

Whitepaper: Validating Orchid’s Alzheimer’s Disease Genetic Risk Score

Whitepapers

Whitepaper: Validating Orchid’s Alzheimer’s Disease Genetic Risk Score

Orchid's reports include a genetic risk score (GRS) for Alzheimer's disease, validated on UK Biobank data. We share our methods and findings…

Whitepaper: Validating Orchid’s Atrial Fibrillation Genetic Risk Score

Whitepapers

Whitepaper: Validating Orchid’s Atrial Fibrillation Genetic Risk Score

Orchid's reports include a genetic risk score (GRS) for atrial fibrillation, validated on UK Biobank data. We share our methods and findings…

Whitepaper: Validating Orchid’s Bipolar Disorder Genetic Risk Score

Whitepapers

Whitepaper: Validating Orchid’s Bipolar Disorder Genetic Risk Score

Orchid's reports include a genetic risk score (GRS) for bipolar disorder, validated on UK Biobank data. We share our methods and findings, i…

Have healthy babies.

PRODUCTS

Embryo ReportCouple Report

FOR PATIENTS

OverviewRisk CalculatorGuidesBook a Call

FOR CLINICIANS

OverviewScience

© 2026 Orchid

Orchid
GET STARTED