A user friendly, cloud-based computing solution to build state-of-the-art, publication ready polygenic risk scores
The DISCOVER module allows users to build their own polygenic risk scores (PRS) from genetic data and summary statistics from a Genome Wide Association Study (GWAS).
In this piece, we’ll run through what the DISCOVER module does and the outputs that it can provide to help users understand the strengths and weaknesses of their PRS.
The main aim of the DISCOVER module is to run a suite of different PRS methodologies on a dataset to find the PRS with the best predictive performance. Different diseases and phenotypes have different genetic architectures and to understand which PRS method will give the best predictive performance requires different PRS methodologies to be tested.
Current suggested best practice for PRS research and development is to trial several PRS methods so the one that is most appropriate for a dataset can be chosen. However, this is rarely performed in practise, it requires very advanced computational skills and powerful digital infrastructures.
Our DISCOVER module allows users to generate PRSs using four different state-of-the-art algorithms in parallel in order to identify the best predictive model for the disease under investigation. Running PRS methods in parallel decreases total computational time by an order of magnitude.
PRS needs genotype data with matched phenotype measurements together with GWAS summary statistics
Currently Allelica’s DISCOVER module includes genomic imputation. We’ve outlined how our IMPUTE module works here. Before using the DISCOVER module, genomic data needs to be imputed. So to use the DISCOVER module, you need the following three pieces of information:
- A set of genomic data from a group of individuals from either genotyping arrays or low coverage whole genome sequencing that can be imputed to a high quality. We’ll call this the Discovery dataset.
- Measurements of the phenotype of interest in the Discovery dataset.
- Summary statistics from a GWAS on the trait or disease of interest.
A key requirement for building a robust PRS is that the Discovery dataset is different from the dataset on which the GWAS was performed. This avoids producing PRS that are good at explaining the association between genetic variation and phenotype in a specific population but less good at doing so in different populations. (The statistical term for this is overfitting.) It is also essential that a separate dataset be used to test the predictive performance of the new PRS in an independent set of genotypes. In practice, users can either split their own dataset in two, or, if appropriate, we offer the UK Biobank resource as a validation and testing dataset.
We need to have a measurement of the phenotype in the Discovery dataset so that we can assess how well the PRS explains what we know about the population. The aim of this module is to discover the best methodology for building a PRS for a specific trait, so we need to have a dataset where we know the phenotypes. Our eventual goal is to use this PRS to predict the phenotype across a new population where this is unknown.
Recall that summary statistics provide a measurement of the effect that specific alleles have on a trait or disease, known as the effect size, or beta, and a measurement of the statistical support of the association between that allele and the trait or disease, provided as a P-value. These summary statistics can be found in the GWAS Catalogue for a range of phenotypes. For users interested in individuals with Western European genetic ancestry, the UK Biobank is driving forward the discovery of associations between genetic variants and over 2,000 phenotypes.
Running the DISCOVER module: fast and flexible
With the three datasets above, we can start building a PRS for our disease of interest. There are many different ways to develop a PRS depending on how you use summary statistics. It’s important to remember that when a genetic variant is found to be associated with a phenotype in a GWAS, this is a statistical association. This means that an allele is not necessarily (and in fact is very rarely) causal to a given phenotype.
The DISCOVER module currently lets users run four different algorithms that work in different ways to combine information across different genetic variants. These are LD-Pred, Clumping and Thresholding, Stacked Clumping and Thresholding and Lassosum. (You can read more about these in our white paper, and we’re always interested in implementing new methods, so if you want to implement a new one then get in touch.)
We provide several metrics for users to decide on the best PRS to use moving forward. These are the Area Under the Curve of the Receiver Operator Curve, which measures how well the model classification works; the Odds Ratio per standard deviation, which measures how the model captures the gradient of risk of the disease in question; and the percentiles of the dataset that are at 3 fold or greater increased risk relative to the remainder of the population. These are industry standard metrics for model comparison and provide the user with all the relevant information to choose the best performing PRS.
The outputs of the DISCOVER module provide the user with a set of PRSs and a quantification of their predictive performance. Having the ability to choose the best PRS from three different methods allows users to align their research with best practice reporting standards. With the PRS in hand, users can now move on to either further validate their PRS in a population with a different ancestry using the VALIDATE module or alternatively move on to PRS prediction using the PREDICT module. These steps in Allelica’s full PRS pipeline will be covered in the next articles in this series describing our Software as a Service.
Originally published at https://www.allelica.com on July 6, 2020.