PRS-powered disease risk reporting

George Busby, Allelica CSO & Co-Founder

We’ve previously written about how you can use Allelica’s DISCOVER software to build new polygenic risk scores (PRS) and to deploy them at scale with PREDICT. While a PRS provides an accurate assessment of someone’s genetic liability of disease, this number on its own provides little value. We need to provide additional context for the score to have clinical relevance. In this article, we’re going to cover the approach that we’ve taken at Allelica to provide disease risk estimates from a raw polygenic risk score.

To illustrate how we go from PRS to risk assessment and reporting, we’re going to follow the steps involved in generating a breast cancer risk report for a hypothetical woman called Maria.

Predicting disease risk with PRS

PRS panels can contain variants with positive effect sizes which increase risk, and variants with negative effect sizes that decrease risk. This means that the higher Maria’s PRS, the greater the amount of disease risk that is conferred by her genes. It also means that the actual values of an individual PRS can be positive or negative, depending on both the PRS panel used and the individual’s genetic background.

When we ran Allelica’s breast cancer PRS on Maria, we got a score of 0.11. As it’s positive, this number suggests that Maria has more alleles contributing to her disease risk than reducing it, but this can’t be used on its own to estimate her actual genetic disease risk. To get that, we need to compare this score to a distribution of scores from a reference population.

The distribution of PRS in a population is normally distributed

So what does a reference population look like? In a perfect world, we’d have a population of ‘Marias’, each of whom would live in the same place and share the same environment as ‘our’ Maria. We’d also want their genetic ancestry to be the same (more on this later). We’d want to control these things as much as possible so that the only thing that differs between each of these ‘Marias’ is their PRS and their disease state, which we’d also know. Using this matched population of ‘Marias’ we could build a distribution of breast cancer PRS and compare Maria’s score to estimate where in this distribution her score lands.

We calculated Maria’s raw PRS for breast cancer and compared it to a reference distribution (left). To enable interpretation and to define quantiles of the PRS distribution, the PRS distribution is standardized to have a mean of 0 and standard deviation of 1 (right). From her standardized score, we can see that Maria’s PRS is 1.4 standard deviations away from the average score.

We can now place Maria’s PRS within this distribution to assess how her value of 0.11 compares to others. We first standardize the distribution of PRS to have a mean of 0 and standard deviation of 1. This helps with interpretation and allows us to split this reference distribution into 100 equal sized bins — or percentiles — and see which percentile Maria’s PRS falls into. Here we can see that her standardized PRS value of 1.41 places her in the 91st percentile.

The reference population is divided into percentiles to identify where Maria’s PRS lands. (In this chart we show only where the deciles fall.) Her PRS, marked by the red line, is in the 91st percentile of the distribution.

So far, so good. But before we continue we need to backtrack a bit. Remember that we said that ideally we’d generate a distribution of scores from a population of ‘Marias’? In reality, this doesn’t exist. People’s environments differ; their diet and exercise levels differ; their physiology differs; and of course their DNA differs. What’s more, we don’t have access to genetic and other clinical information for most of the world’s population, so we don’t have the luxury of being able to pick from a large pool of useful data.

So instead of building a reference population of ‘Marias’, we need to use available datasets that have both the appropriate genetic data available to apply a PRS, and the clinical data required to assess disease outcome. In practice, this means that we need to be flexible and use a variety of different approaches to overcome the challenges of defining an individual’s disease risk by comparing his/her PRS and other clinical risk factors to those from an imperfect reference population.

Developing ancestry-specific reference populations

Why do we need to match for ancestry? There are two main reasons. The first is that the genetic associations that feed into the PRS are related to the ancestry of the population(s) used in the PRS Training dataset from the original genome-wide association study (GWAS). Most of these associations are not actual disease-causing variants, but ‘tag’ them, meaning that they are located close to the causal variant in the genome, but aren’t the actual causal variant. The ability of these associations to tag the causal variant depends on the underlying genetic background around the causal variant, which differs in different ancestries. So a variant in strong association with disease in the Training dataset population(s) may be less well associated in the reference population used to develop our prediction model. The ability of a PRS to predict disease in a population is dependent on that population’s ancestry.

Secondly, the frequency of a variant in a population is important. Although in general the frequency of variants across populations is correlated, this is never perfect. Some variants are more common in one population than another. If these variants are in a PRS, then the resulting score predicted on an individual from one population may be inflated, not because of higher disease risk, but because they happen to have more effect alleles because of their ancestry. We can account for this by adjusting PRS based on an assessment of genetic ancestry using principal components analysis.

In general, reference populations containing individuals from continental level ancestry groups (European, American, Asian, African) are used because the tradeoff between size (usually in the thousands to tens of thousands) and ancestry similarity is acceptable. In some cases there is enough data and diversity to split these groups further (e.g. East Asian and South Asian) and even to country level in situations where national biobanks have made large quantities of data available (e.g. Japan, Finland, and the UK). This is possible because there is a strong correlation between geography and genetics, with people who live nearby tending to have more similar DNA and thus more similar genetic ancestry.

PRS percentile to relative and absolute risk

We can also build absolute risk models based on PRS and other relevant risk factors and use this to predict the risk conferred by different PRS percentiles of the PRS distribution in combination with these other risk factors. Doing so means that we can reassess how Maria’s PRS percentile equates to risk in this population by calculating the risk of disease for women in the reference population with the same PRS.

PRS distributions can be used to translate specific risk scores to risk. On the left the risk relative to the remainder of the population is shown for the tail of the distribution. On the right, Maria’s PRS percentile translates to an almost 23% absolute lifetime risk of Breast Cancer, based on a risk model from a reference population

If Maria is not European, we can adjust the PRS for ancestry, although this is predicated on having enough genetic data from an ancestry-specific reference population to account for the effect of ancestry on the PRS. Principal components can be used to adjust both the ancestry-specific population and Maria’s PRS for ancestry, which can then be used to build a risk model.

However, we don’t know the risk conferred by this PRS in this ancestry, so we can’t use the same clinical model that has been developed on European ancestry individuals. We need to reassess the association between the adjusted PRS and clinical risk in the non-European ancestry population. This requires both clinical and genetic data on the same individuals, which can for example be accessed from prospective studies such as eMERGE and the UK Biobank, where data from tens of thousands of non-European ancestry individuals is available.

In the absence of an ancestry-specific reference population with clinical data it is not possible to translate the adjusted PRS into an estimate of risk.

Reporting risk

When reporting PRS-integrated risk scores, we need to bear the following questions in mind:

  • Is the report physician or patient-focused? This will affect the language and tone of any resulting communication and define the level of educational content that might be required.
  • Can absolute risk, incorporating clinical factors, PRS and potentially rare pathogenic variation, be communicated?
  • Are there clear risk thresholds from disease guidelines that can be used to align genomics-integrated predictions of risk with current standard of care?
  • Are there recommendations that can be applied to help the individual mitigate any extra risk?

A guidelines-first approach to defining risk thresholds

Importantly, in the case of breast cancer, as well as other common diseases, there are already well established recommendations for appropriate action given a certain level of absolute risk. For example, the National Comprehensive Cancer Network’s latest cancer screening guidelines (v 1.2021) suggest that women with rare, moderate impact pathogenic mutations that can increase absolute lifetime risk to between 20% and 50%, should start annual mammograms up to 20 years earlier than the general population. There is no reason why the same advice couldn’t be used to inform mitigation options based on a risk assessment involving PRS that arrives at an estimate of more than 20%.

Below we can see how Maria’s PRS in the 91st percentile of the distribution translates to an absolute risk of around 23%, which is above the high risk threshold. In this case, we would recommend she speak with an oncologist about potentially increasing surveillance by performing more frequent mammograms.

Maria’s PRS puts her above the guideline high risk threshold (20%).

We can also produce informative risk reports that provide information about how Maria’s risk compares to that of the average population, and which risk factors contribute to her elevated risk.

An example of Allelica’s Breast Cancer risk report incorporating PRS, rare pathogenic mutations and family history of disease.


Nevertheless, the best disease risk assessments will combine genetics, clinical and other biomarkers to arrive at a more personalized and preventive approach to medicine. By aligning assessments of absolute disease risk with current guidelines, it is possible to provide more precise estimates of risk without invoking either genetic exceptionalism or genetic determinism.

If you’re interested in finding out more about Allelica’s clinical grade PRS reporting, or how you can use our software to develop your own PRS, then drop us a line at

Allelica is a Software Genomics Company developing algorithms and digital tools to accelerate the integration of Polygenic Risk Score in the clinical practice