## Simulating Genetic Associations

For type I evaluations simulation of genotypes is independent to phenotypes. Phenotype values or disease status can be randomly assigned to sample genotypes. To evaluate power, genotype-phenotype associations have to be simulated, by either generating genotypes conditional on phenotypes, or assign phenotypes based on given genotype sequence. This document briefly discusses general considerations for simulations of genetic associations as well as models for quantitative and disease phenotypes.

### Effects of Variants

#### Fixed effect vs. variable effects

Phenotypes, either disease status or quantitative trait values, reflect contribution of mutations observed in a genomic region. A “joint effect size” is calculated for a region under consideration as a result of the collective contribution of all variant sites. Contribution of causal variants within a locus can be uniform, or vary by the underlying MAF or functionality.

The Fixed effect model assumes the effect size for all loci to be the same, i.e., the odds ratio $\gamma$ for disease traits or the mean shift in QT due to a mutations in quantitative traits $\beta$ is constant. Variable effects model the magnitude of effect of a variant is determined by its attribute, for example its MAF such that $\gamma \propto \frac{1}{q}$. Although it is also possible to base the variable effects on other attributes, say, selection coefficient, functionality, annotation scores, etc., they are mostly correlated with the MAF for a variant. In SEQPower variable effects model are solely based on MAF.

For population attributable risk model (PAR), constant PAR across all loci will still lead to variable odds ratios since odds ratio is a function of PAR and MAF. Therefore for both fixed PAR or variable PAR input, the effect size will be determined further by MAF.

#### Protective, deleterious and neutral variants

Effects of variants on disease phenotypes is not uni-directional. In our DNA sequence simulation we have simulated synonymous variants $S=0$, protective variants $S\leftarrow10^{-4}$, deleterious variants $S>10^{-4}$ and neutral variants $-10^{-4}\le S < 0, 0 < S \le 10^{-4}$. Effect size for deleterious variants can be assigned based on literature references {citation needed}. Effect size for neutral variants are simply “neutral” (1 for odds ratio, 0 for mean shift or PAR). Although knowledge on the effect sizes of protective variants is lacking, it is natural to model them with reference to deleterious variants, e.g. simply add a negative sign to the effect size for QT mean shift or PAR difference, or the inverse of odds ratio $\gamma_p=\frac{1}{\gamma_d} \le 1.0$ for the logit model.

Note that the inverse of odds ratio does not result in symmetric effects w.r.t. deleterious variants. Even if there are 50% protective and 50% deleterious variants in a region, their effects do not cancel out in this situation. Consider this setup: $A= \text{Affected}$; $U = \text{Unaffected}$; $M = \text{with mutation}$; $\Pr{\left\{A|M\right\}}= \text{penetrance}$; $\Pr{\left\{M|A\right\}}= \text{MAF in cases}$; $\Pr{\left\{M|U\right\}}= \text{MAF in ctrls}$; $\Pr{\left\{A\right\}}= \text{prevalence}$. Then by formula $\frac{\Pr{\left\{A|M\right\}}\Pr{\left\{M\right\}}}{\Pr{\left\{A\right\}}} = \Pr{\left\{M|A\right\}}$ then for protective variants the formula may look like for example $\frac{0.005\times10^{-6}}{0.01} = \Pr{\left\{M|A\right\}} = 5\times10^{-7}$ $\frac{0.995\times10^{-6}}{0.99} = \Pr{\left\{M|U\right\}} = 10^{-6}$ for deleterious variants the formula may look like for example $\frac{0.02\times10^{-6}}{0.01} = \Pr{\left\{M|A\right\}} = 2\times10^{-6}$ $\frac{0.98\times10^{-6}}{0.99} = \Pr{\left\{M|U\right\}} = 10^{-6}$ It seems in such settings the MAF in ctrls would be about the same while in cases there are underlying MAF changes. It is different from when you use PAR models the MAF in case and ctrls would just flip when there is protective, making it symmetric.

### Quantitative Traits

#### Linear mean-shift model

This model implements the simulation procedures in Kyrukov et al1), Ladouceur et al2) and others. Assuming QT is distributed normally and individuals carrying at least one functional nonsynonymous allele have QT values distributed with a shifted mean, but with the same variance, and population with presence of a detrimental mutation will raise the mean QT, whereas population with presence of a protective mutation will lower the mean QT. For a genetic region with multiple locus the effects are additive, following a simple linear model $Y=\mu+\sum\beta_gXg+\sigma$

According to Kyrukov et al, “De novo mutations associated with dominant syndromes caused by haploinsufficiency are known to cause shifts in QT means exceeding one standard deviation $\sigma$. For example, loss-of-function mutations in NSD1 causing Sotos syndrome increase mean occipitofrontal head circumference by >2$\sigma$; mutations in JAG1 causing Alagille syndrome decrease mean height by 2$\sigma$; dominant-negative mutations in FBN1 producing Marfan syndrome reduce mean bone mineral density by 1.5$\sigma$; and LDL receptor mutations causing hypercholesterolemia increase mean LDL-bound cholesterol level by >4$\sigma$ … A segregating nonsynonymous SNP A390P in CETP is associated with 0.4$\sigma$ decrease in plasma HDL-C levels. We therefore concentrated on values of 0.25$\sigma$ and 0.5$\sigma$ for new missense mutations … In the context of a trait such as human height, 0.25$\sigma$ would correspond to a change in mean height of 0.5 inch.”

#### Extreme quantitative trait model

In SEQPower extreme QT samples can be obtained in two ways:

• Simulate a random population with QT and take the upper / lower $p^{th}$ percentile as extreme QT samples.
• Specify upper / lower QT value cutoff, and repeatedly generating samples with QT values yet only keep extreme samples defined by these cutoffs until specified sample size is fulfilled.

Extreme QT thus simulated can be analyzed as quantitative traits with caution (normality no longer holds and conventional QT methods are not solid), or be converted to qualitative labels as will be discussed below.

### Disease Status

#### logit model of odds ratios

A logit equation is used to model the joint penetrance for given genotype. Consider the logistic regression framework $logit(p) = \sum\beta_gX_g$ In this setup, the regression coefficient $\beta$ correspond to the log odds ratio of the genetic risk factor (i.e., a variant). The joint odds ratio $\Gamma$ can then be computed and joint penetrance is calculated as $f=\frac{\omega}{1+\omega}$ where $\omega=\frac{f_0}{1-f_0}\Gamma$, given the baseline penetrance $f_0$.

To simulate disease status under this model, first an odds ratio $\gamma$ is assigned to a variant. For the fixed effect model $\gamma$ is constant for all rare variants; for variable effects model the magnitude of $\gamma$ is determined by locus MAF. For each locus the assigned $\gamma$ has to be updated by taking into consideration the genotype at both haplotypes on the locus under given mode of inheritance assumption (dominant, recessive, multiplicative, additive, compound dominant, etc.). Given locus specific odds ratio, disease status specific locus MAF can be calculated (e.g., $q'_{Aa}=\frac{\gamma_{Aa}q_{Aa}}{1+(1-\gamma_{Aa})q_{Aa}}$). If the locus is detrimental, then the population MAF is used for control group and updated MAF is used for case group. If the locus is protective then the population MAF is used for case group and updated MAF is for control group. Haplotype samples for both groups can be simulated based on the updated underlying MAF.

#### Population attributable risk model

For a population attributable risk (PAR) model, a group PAR $\alpha$ is specified for a genetics region. The marginal PAR for a single rare variant $\alpha_i$ is then calculated by distributing the group PAR to non-synonymous rare variants either uniformly (fixed effect) or relative to MAF. The sum of the marginal PAR should equal the total PAR. Odds ratio of a detrimental heterzygous genotype, for example, is approximated by $\gamma_{Aa} = \frac{\alpha}{(1-\alpha)q_{Aa}}+1$ where the group specific genotype frequency (in this setting the genotype frequency in controls) is approximated by the genotype frequency in population.

In population attributable risk model the effect of protective and detrimental mutations can be symmetric because prevalence of the disease in the population is not taken into consideration.

#### Linear model with qualitative outcome

Ladouceur et al based their simulation for dichotomous traits on the simulation for quantitative traits, by translating QT to qualitative outcome. In their simulation they used extreme $25^{th}$ percentiles from continuous trait distributions to define case/control groups. This model represents a class of disease related to extreme physiological variables. It has been implemented in SEQPower taking user specified percentile cutoff.

### Phenotype Simulation for Sequences Drawn from Haplotype Pools

Drawing haplotypes directly from given haplotype pools (not based on given MAF) takes advantage of real world sequence data and maintain important genomic features (e.g. phase, LD patterns). However for power calculation purposes using such data set is more computationally demanding, since it is no longer possible to compute for disease traits the MAF distribution in cases/controls. Different procedure has to be applied to designate disease status conditioning on observed (sampled) genotypes.

Under the logit model of odds ratios, once the locus specific odds ratios are determined, the joint odds ratio can be computed and with baseline penetrance, the updated penetrance for the genotype is calculated, which, by definition, is the probability of a sample being a case given its genotype. A random number is generated and compared against the penetrance value to determine the disease status. The process is repeated until required sample size for cases and controls are fulfilled. Simulation procedures for quantitative traits remain the same as with simulations based on MAF spectrum.

1) G. V. Kryukov, A. Shpunt, J. A. Stamatoyannopoulos and S. R. Sunyaev (2009). Power of deep, all-exon resequencing for discovery of human trait genes. Proceedings of the National Academy of Sciences
2) Martin Ladouceur, Zari Dastani, Yurii S. Aulchenko, Celia M. T. Greenwood and J. Brent Richards (2012). The Empirical Power of Rare Variant Association Methods: Results from Sanger Sequencing in 1,998 Individuals. PLoS Genetics