George Busby, Allelica CSO & Co-Founder
We’ve previously written about how you can use Allelica’s DISCOVER software to build new polygenic risk scores (PRS) and to deploy them at scale with PREDICT. While a PRS provides an accurate assessment of someone’s genetic liability of disease, this number on its own provides little value. We need to provide additional context for the score to have clinical relevance. In this article, we’re going to cover the approach that we’ve taken at Allelica to provide disease risk estimates from a raw polygenic risk score.
To illustrate how we go from PRS to risk assessment and reporting, we’re going to follow the steps involved in generating a breast cancer risk report for a hypothetical woman called Maria.
Predicting disease risk with PRS
A PRS is a single number, a statistic that is calculated by adding up the effect sizes for all the variants in a PRS panel where an individual has at least one risk allele. Although there are multiple breast cancer PRS, and multiple ways of developing a new breast cancer PRS, we’re going to assume that we have a score (such as ours) that has been validated and tested on independent datasets, and that has a robust association with breast cancer risk.
PRS panels can contain variants with positive effect sizes which increase risk, and variants with negative effect sizes that decrease risk. This means that the higher Maria’s PRS, the greater the amount of disease risk that is conferred by her genes. It also means that the actual values of an individual PRS can be positive or negative, depending on both the PRS panel used and the individual’s genetic background.
When we ran Allelica’s breast cancer PRS on Maria, we got a score of 0.11. As it’s positive, this number suggests that Maria has more alleles contributing to her disease risk than reducing it, but this can’t be used on its own to estimate her actual genetic disease risk. To get that, we need to compare this score to a distribution of scores from a reference population.
The distribution of PRS in a population is normally distributed
PRS are generated by adding up the effects of hundreds to millions of alleles at common variants spread across the genome. When applied to a population and then plotted, the result is a normal distribution. Most people will have some risk alleles with a positive effect, resulting in a score somewhere in the middle of the distribution. There will be a smaller number of individuals with either lots of risk alleles and therefore very high scores, or fewer than average risk alleles leading to lower scores.
So what does a reference population look like? In a perfect world, we’d have a population of ‘Marias’, each of whom would live in the same place and share the same environment as ‘our’ Maria. We’d also want their genetic ancestry to be the same (more on this later). We’d want to control these things as much as possible so that the only thing that differs between each of these ‘Marias’ is their PRS and their disease state, which we’d also know. Using this matched population of ‘Marias’ we could build a distribution of breast cancer PRS and compare Maria’s score to estimate where in this distribution her score lands.
We can now place Maria’s PRS within this distribution to assess how her value of 0.11 compares to others. We first standardize the distribution of PRS to have a mean of 0 and standard deviation of 1. This helps with interpretation and allows us to split this reference distribution into 100 equal sized bins — or percentiles — and see which percentile Maria’s PRS falls into. Here we can see that her standardized PRS value of 1.41 places her in the 91st percentile.
So far, so good. But before we continue we need to backtrack a bit. Remember that we said that ideally we’d generate a distribution of scores from a population of ‘Marias’? In reality, this doesn’t exist. People’s environments differ; their diet and exercise levels differ; their physiology differs; and of course their DNA differs. What’s more, we don’t have access to genetic and other clinical information for most of the world’s population, so we don’t have the luxury of being able to pick from a large pool of useful data.
So instead of building a reference population of ‘Marias’, we need to use available datasets that have both the appropriate genetic data available to apply a PRS, and the clinical data required to assess disease outcome. In practice, this means that we need to be flexible and use a variety of different approaches to overcome the challenges of defining an individual’s disease risk by comparing his/her PRS and other clinical risk factors to those from an imperfect reference population.
Developing ancestry-specific reference populations
One of the main factors to consider when building a reference population is to ensure that the genetic ancestry of the population matches that of the individual(s) for whom you are planning to predict risk. A reference population needs to have the necessary genetic and clinical data needed to build a prediction model that can be used on new individuals like Maria. It’s also helpful — although not essential — that the PRS we’re using was also developed in an ancestry group that matches Maria’s genetic ancestry.
Why do we need to match for ancestry? There are two main reasons. The first is that the genetic associations that feed into the PRS are related to the ancestry of the population(s) used in the PRS Training dataset from the original genome-wide association study (GWAS). Most of these associations are not actual disease-causing variants, but ‘tag’ them, meaning that they are located close to the causal variant in the genome, but aren’t the actual causal variant. The ability of these associations to tag the causal variant depends on the underlying genetic background around the causal variant, which differs in different ancestries. So a variant in strong association with disease in the Training dataset population(s) may be less well associated in the reference population used to develop our prediction model. The ability of a PRS to predict disease in a population is dependent on that population’s ancestry.
Secondly, the frequency of a variant in a population is important. Although in general the frequency of variants across populations is correlated, this is never perfect. Some variants are more common in one population than another. If these variants are in a PRS, then the resulting score predicted on an individual from one population may be inflated, not because of higher disease risk, but because they happen to have more effect alleles because of their ancestry. We can account for this by adjusting PRS based on an assessment of genetic ancestry using principal components analysis.
In general, reference populations containing individuals from continental level ancestry groups (European, American, Asian, African) are used because the tradeoff between size (usually in the thousands to tens of thousands) and ancestry similarity is acceptable. In some cases there is enough data and diversity to split these groups further (e.g. East Asian and South Asian) and even to country level in situations where national biobanks have made large quantities of data available (e.g. Japan, Finland, and the UK). This is possible because there is a strong correlation between geography and genetics, with people who live nearby tending to have more similar DNA and thus more similar genetic ancestry.
PRS percentile to relative and absolute risk
If Maria’s genetic ancestry is European, then we can define the reference population based on many thousands of women who are European that we have data for. A good example of such a dataset is the UK Biobank. Because we know about disease in this reference population, we can identify Maria’s relative risk by comparing disease outcomes in women with her PRS to those with either average or lower scores.
We can also build absolute risk models based on PRS and other relevant risk factors and use this to predict the risk conferred by different PRS percentiles of the PRS distribution in combination with these other risk factors. Doing so means that we can reassess how Maria’s PRS percentile equates to risk in this population by calculating the risk of disease for women in the reference population with the same PRS.
If Maria is not European, we can adjust the PRS for ancestry, although this is predicated on having enough genetic data from an ancestry-specific reference population to account for the effect of ancestry on the PRS. Principal components can be used to adjust both the ancestry-specific population and Maria’s PRS for ancestry, which can then be used to build a risk model.
However, we don’t know the risk conferred by this PRS in this ancestry, so we can’t use the same clinical model that has been developed on European ancestry individuals. We need to reassess the association between the adjusted PRS and clinical risk in the non-European ancestry population. This requires both clinical and genetic data on the same individuals, which can for example be accessed from prospective studies such as eMERGE and the UK Biobank, where data from tens of thousands of non-European ancestry individuals is available.
In the absence of an ancestry-specific reference population with clinical data it is not possible to translate the adjusted PRS into an estimate of risk.
Reporting risk
So, now we know Maria’s PRS percentile, having potentially adjusted her ancestry (in the case that she is not European) and have assessed her relative and absolute risk of disease by comparing her PRS to an ancestry-specific reference distribution. The final step of the process is to feed back Maria’s breast cancer risk. There are a variety of ways that this can be done. While it’s possible to simply provide someone their PRS percentile, which implicitly provides some information about how their score relates to others, as we mentioned at the top — from a clinical point of view — this provides no value. And, we can anyway do much better by providing an assessment of absolute risk that provides actionable information to empower individuals to actively mitigate their risk.
When reporting PRS-integrated risk scores, we need to bear the following questions in mind:
- Is the report physician or patient-focused? This will affect the language and tone of any resulting communication and define the level of educational content that might be required.
- Can absolute risk, incorporating clinical factors, PRS and potentially rare pathogenic variation, be communicated?
- Are there clear risk thresholds from disease guidelines that can be used to align genomics-integrated predictions of risk with current standard of care?
- Are there recommendations that can be applied to help the individual mitigate any extra risk?
A guidelines-first approach to defining risk thresholds
At Allelica, our risk reports take a guidelines-first approach to defining high risk. According to the American Cancer Society, a lifetime risk of breast cancer between 20% and 25% is defined as high risk. Given a population prevalence of around 10–12% for breast cancer, these rates are between 2 and 3 times the ‘average’ risk of the population. Other authors have suggested that — as a rule of thumb — a factor that increases your risk of disease by at least 2 times might be described as high risk. While this is a relative measure, made by comparing a group against either an average individual or the remainder of the population, it provides a benchmark that we can use with PRS — and more sophisticated PRS integrated models involving other risk factors — to provide estimates of risk.
Importantly, in the case of breast cancer, as well as other common diseases, there are already well established recommendations for appropriate action given a certain level of absolute risk. For example, the National Comprehensive Cancer Network’s latest cancer screening guidelines (v 1.2021) suggest that women with rare, moderate impact pathogenic mutations that can increase absolute lifetime risk to between 20% and 50%, should start annual mammograms up to 20 years earlier than the general population. There is no reason why the same advice couldn’t be used to inform mitigation options based on a risk assessment involving PRS that arrives at an estimate of more than 20%.
Below we can see how Maria’s PRS in the 91st percentile of the distribution translates to an absolute risk of around 23%, which is above the high risk threshold. In this case, we would recommend she speak with an oncologist about potentially increasing surveillance by performing more frequent mammograms.
We can also produce informative risk reports that provide information about how Maria’s risk compares to that of the average population, and which risk factors contribute to her elevated risk.
Conclusion
Genetic information from PRS have the potential to inform risk management approaches across populations and for individuals. We have known for a long time that the genetic risk of complex disease is the result highly penetrant rare pathogenic mutations as well as polygenic variants. Until now the assessment of this polygenic risk has not been possible. PRS provide one tool for assessing that genetic component, and are becoming increasingly powerful in their predictive performance.
Nevertheless, the best disease risk assessments will combine genetics, clinical and other biomarkers to arrive at a more personalized and preventive approach to medicine. By aligning assessments of absolute disease risk with current guidelines, it is possible to provide more precise estimates of risk without invoking either genetic exceptionalism or genetic determinism.
If you’re interested in finding out more about Allelica’s clinical grade PRS reporting, or how you can use our software to develop your own PRS, then drop us a line at info@allelica.com.