## Rare Variant Association Methods

Implementation of association analysis methods in SEQPower is taken from the VAT ensemble module, which is part of ''variant tools / VAT''. A number of published methods and their variations are available in VAT for rare variants association analysis of quantitative and qualitative traits. SEQPower incorporates these methods for power and sample size analysis purposes. The documentation below is also borrowed from VAT, which briefly describes the implementation of these previously published methods with an emphasize on our own optimization and extensions to the methods. More details of each association tests can be found in their original publications.

### Testing for Association with Single Genes

Most statistical methods for rare variants were originally proposed for association analysis between single gene and qualitative (case/control) phenotype, although many can be readily generalized into testing for association adjusting for phenotype covariates. Implementation of single gene analysis include:

• CMC method
• $C(\alpha)$ method
• KBAC method
• RBT method
• RareCover method
• VT method
• WSS method
• ASUM method

#### Combined and Multivariate Collapsing method for rare variants

The Combined and Multivariate Collapsing (CMC, Li and Leal, 20081)) method considers all variants in a test unit (e.g., a gene). It “collapses” all rare variants in the gene region such that the region is coded “0” if all loci are wildtype, and “1” if any one locus has a minor allele. Then it “combines” this coding with the rest of common variants in the gene region into a multivariate problem that tests for the null hypothesis that the gene region is not associated with a disease or quantitative trait. The statistic for CMC method can be $\chi^2$ test for collapsed rare variants, Hotelling's $T^2$ or multivariate regression analysis for joint analysis of common and rare variants. CMC method is implemented for rare variants with Fisher's exact test for evaluating association between rare variants and disease phenotypes. The use of Fisher's test results in exact p-value, avoiding the computationally intensive permutation procedure.

Using mid-$p$ values for exact test

The use of exact test guarantees to control for type I error yet can be overly conservative. Mid-$p$ values are reasonable compromise between the conservativeness of the ordinary exact test and the uncertain adequacy of large-sample methods. An additional switch --midp gives adjusted $p$ values for CMC exact test.

#### Kernel based adaptive clustering method

The KBAC statistic in Liu and Leal 20102) carries out case-control association testing for rare variants for whole exome association studies. Briefly, consider a gene of length $K$ which harbors $M$ rare variants. Multi-locus genotype (genotype pattern) on the $M$ variant sites and the disease status (case/control) are known for each individual. The algorithm contrasts the genotype pattern between cases and controls, weighted by the probability of the genotype pattern distribution under various statistical distribution models (kernels). Permutation has to be used to obtain valid p-values.

We also provide an R package for KBAC with standalone data set in PLINK-like text format.

Note a few differences between this implementation and the original KBAC paper:

• The original paper provides 3 kernel options: hypergeometric, binomial and Gaussian kernels. The hypergeometric kernel generally performs best and is recommended by the authors as the only KBAC implementation.
• The two-sided KBAC test implements the spirit of the RBT test3) by performing two KBAC tests under both protective and deleterious assumptions and use the larger of the two statistics thus calculated as the final KBAC statistic.

#### A "Covering algorithm" for rare and low frequency variants

The RareCover test in Bhatia et al 20104) is an efficient heuristic greedy algorithm to find an optimized combination of variants in a loci with the strongest association signal. It uses the same collapsing strategy and test statistic as in Li and Leal, 20085) but scans over the entire genomic region of interest, adding at each iteration the variants that contributes most to the statistic.

RareCover is related to the Variable Threshold test yet differs in the order by which rare variants are incorporated into the test. Variable threshold test assumes a fixed yet unknown MAF boundary of rare causal variants, which is not assumed by RareCover. Due to computational complexity, RareCover does not perform exhaustive search for all combinations of variants in a loci region. The “coverage” of RareCover method depends on the tuning parameter $Q$. RareCover method is implemented here as a two-sided test with $Q=0.5$, as recommanded by the original paper.

#### Replication based test for protective variants

The replication base test (RBT) in Ionita-Laza et al 20116) use the idea of replication to take into consideration of impact of protective variants in association analysis. The two-sided RBT evaluates statistical evidences for two hypothesis separately:

• Deleterious rare variants are enriched in cases
• Protective rare variants are enriched in controls

The final statistic is based on the stronger evidence of the two, adjusted for multiple testing. To increase the power of this approach a weighting theme is applied to variant counts in case or control group using a transformation of the probability of observing such counts under a Poisson model.

Implementation of RBT in this program has both one-sided and two-sided versions. The one-sided testing strategy tests for the presence of variants conferring risk to disease by focusing on variants that have higher observed frequency in cases compared with controls. Permutation procedure is used in both one-sided and two-sided tests to obtain valid $p$ value.

#### Variable threshold test for case control data analysis

The variable threshold method (VT, Price et al 20107)) tests for association between phenotypic values (case control or quantitative traits) with individuals' genotype “score” subject to a variable MAF threshold. It assumes that there exists a fixed yet unknown MAF threshold on a given genetic region which is the cutoff for the causality of variants on that loci. In testing the association for the genetic region, for each possible MAF threshold a genotype score is computed based on given rare variant analysis scheme, and is evaluated for association between the phenotype of interest; the final MAF threshold is chosen such that the association signal is strongest. Permutation procedure has to be used to control for type I error due to multiple testing.

The VT strategy creates a very flexible framework that can be applied to many association tests, including the use of external information such as weight theme by annotation scores, as the VT paper suggested. The VTtest method here implements the test for case control phenotype using the CMC coding theme and Fisher's exact test, in addition to the original VT statistic. Permutation procedure is optimized due to the use of Fisher's exact test: the minimum $p$ value resulted from Fisher's exact test on data before permutation has to exceed the expected significance level in order to enter the permutation test procedure; otherwise it will be reported as is. This computational trick reduces the complexity the original VT test bears.

#### Weighted Sum Statistic via Rank Test

The method proposed by Madsen and Browning 20098) first introduced the idea of assigning “weights” to rare variants within a genetic region before they are aggregated into some generic coding. Variants having higher weights will have more substantial contribution to the collapsed variant score. In the Madsen & Browning paper the “weights” are defined as $\sqrt{n_iq_i(1-q_i)}$ with the assumption that the “rarer” the variant, the larger the risk effect it is to a phenotype. The $q_i$ in the original paper was based on observed control sample, which might result in inflated type I error9). Implementation of the WSS statistic here uses the same definition for $q_i$ but the Mann-Whitney U test for rank (definition and C++ implementation for this program) now relies on a full permutation procedure rather than normal approximation, such that the bias is correctly accounted for.

As with the Varible Thresholds strategy, the idea of weighting can be applied to many other rare variant methods. The multivariate version (WeightedBurdenBt and WeightedBurdenQt methods) implements the Madsen & Browning weighting based on controls (or samples with low quantitative phenotypic values) or the entire population, and tests for association for both case control and quantitative traits with/without presence of phenotype co-variates.

#### Data-adaptive sum test for protective and deleterious variants

The data-adaptive sum test (ASUM) by Han and Pan 201010) is the first method that took into consideration the difference in direction of effects (protective or deleterious) of rare variants in the same genetic region analyzed by a rare variant association test. It takes a two-stage approach. In the first stage the effect size of each rare variant is evaluated in a multivariate regression analysis, identifying the variants having significant “protective” effects, i.e., variants with a negative log odds ratio associated with a $p$ value smaller than $0.1$. In the second stage, variants are collapsed across the genetic region similar to Morris and Zeggini 201011) but with the coding for protective variants flipped. The test statistic is a score test for logistic regression for case control data.

The implementation of stage 1 in this program differs from the original paper. Instead of evaluating the effect size for each variant, it evaluates the difference in MAF of each variant between case and controls via an exact test to determine which variants are to be re-coded in stage 2. The same $p<0.1$ criteria is used for stage 1, but in effect is more stringent than the original criteria for a multivariate logistic regression analysis.

#### $C(\alpha)$ test for protective variants

The $C(\alpha)$ test (Neale et al 201112)) for disease traits tests for the hypothesis of rare variants disease association under the particular assumption that rare variants observed in cases and controls is a mixture of phenotypically deleterious, protective and neutral variants. Instead of using a cumulative dosage (or “burden”) based summary statistic over a gene region, it directly contrasts the observed and expected distribution of minor alleles in cases and controls at each locus as an evidence of “unusual distribution”, and combine evidences from multiple loci (whether it be an evidence of protective or deleterious) to formulate the $C(\alpha)$ statistic: $T=\sum_{i=1}^m[(y_i-n_ip_0)^2-n_ip_0(1-p_0)]$. $C(\alpha)$ is a special case of a more general class of variance based analysis, i.e., the SKAT test.

The original paper evaluates p-value of the test under large sample normal assumption, which usually would not hold for the real world data. Implementation here also allows permutation based $C(\alpha)$ test, via the -p/--permutations switch.

### Multivariate Association Analysis

The single gene analysis methods are implemented in the form as they were originally published with slight modifications or optimizations. They are good for analyzing gene and case control phenotype associations without adjusting for covariates. For conditional association analysis the multivariate methods needs to be performed. The multivariate methods generalizes the scope of rare variant association tests to being able to handle both quantitative and case control data with flexible weighting and grouping approaches.

Multivariate based association methods are mostly in either a logistic regression or linear regression framework. In SEQPower the simulated data does not have additional covariates to be added to a regression model. However these multivariate methods can be used without covariates, in the same fashion as single gene methods. Additionally asymptotic p-value is available. For power and sample size evaluations at very small significant levels ($\alpha=2.5\times10^{-6}$) such asymptotic p-values have to be used instead of permutation based solutions, although for small sample size the tests can be inflated.

#### Aggregation methods for disease and quantitative traits

This is implementation of the fixed threshold aggregation methods for disease and quantitative traits. Originally described in Morris and Zeggni 201013) and known as Gene- or Region-based Analysis of Variants of Intermediate and Low frequency (GRANVIL), the Aggregation method for rare variants codes observed genotype of a genetic region the count of minor alleles weighted by the total number of variants in the loci: $X = \frac{\sum_i^N X_i}{N}$

We implement the aggregation methods without the weight $N$, because this denominator has undesirable properties when there are missing data. The coding is incorporated in a logistic regression framework for disease traits analysis (the BurdenBt method), and a linear regression framework for quantitative traits analysis (the BurdenQt method). $p$ value for aggregation method is based on asymptotic normal distribution of the Wald statistic in generalized linear models. One could incorporate a number of phenotype covariates in collapsing tests and evaluate the significance of the genetics component.

#### Collapsing methods for disease and quantitative traits

This is implementation of the fixed threshold collapsing methods for both disease and quantitative traits. Collapsing method for rare variants treats a genetic region as a test unit; based on observed genotype it assigns a numeric coding to the region $X$:$X = I_{(0,N)}(\sum_i^N X_i)$ i.e., the observed genotype will be coded as $1$ if there exists at least one mutation, and $0$ otherwise. This coding theme has been used in Li and Leal 200814) and Bhatia et al 201015).

Advantages in using collapsing methods instead of aggregation methods is in its robustness to LD of multiple rare variants in the region under investigation, which would potentially inflate type I error. However under additive assumptions of genetic effects, collapsing methods can be less powerful than aggregation methods.

We implement the collapsing coding in a logistic regression framework for disease traits analysis (case control data) (the CollapseBt method), and a linear regression framework for quantitative traits analysis (the CollapseQt method). $p$ value for collapsing method is based on asymptotic normal distribution of the Wald statistic in generalized linear models. One could incorporate a number of phenotype covariates in collapsing tests and evaluate the significance of the genetics component.

#### Multivariate Variable Threshold methods for disease and quantitative traits

This implements the variable threshold version of aggregation methods. Similar to the VT method, the VariableThresholdsBt and VariableThresholdsQt methods use a variable threshold definition for the rare variants being considered such that multiple test statistics are calculated for an aggregation unit. The final statistic is taken as the one that gives the best result. Type I error is controlled due to the use of permutation testing.

Results from variable threshold method have one additional column in the output of the program, i.e., an MAF column reporting the statistic of VT at which it is derived.

#### Weighted burden test for disease and quantitative traits

This implements a collection of weighted aggregation tests. Different from plain aggregation methods which assumes equal contribution of each locus from the genetic region under investigation, the weighted methods assigns a “weight” to each variant site such that each site differs from another by the weight they are assigned, and these weights will contribute to the aggregated “burden”, e.g., $X=\sum_i^N\omega_iX_i$ where $\omega_i$'s are the weights. The weights often reflect the relative importance of a variant in terms of its contribution to phenotype.

The weighting approach was first proposed by Madsen and Browning 201016) with the assumption that “rarer” variants tend to be more important (the WSS statistic). This weighting theme is by far the most popular weights and has been adapted into a number of methods emerged later, such as Lin and Tang 201117) and Wu et al 201118). Other weighting themes such as KBAC and RBT weightings have different assumptions but they are also based solely on internal information from data. Price et al 201019) proposed the use of “external” weights, i.e., using functional annotation sources to calculate weight for rare variants. This weighting theme can also be naturally integrated into many rare variants methods.

Implementation of WeightedBurdenBt and WeightedBurdenQt are similar to aggregation methods but allows the use of the following weighting themes:

• WSS weight, based on entire sample
• WSS weight, based on controls or sample with above/below average phenotype values
• RBT weight
• KBAC weight
• External weights from annotation

Permutation methods have to be used to obtain $p$ value for WSS (control based), KBAC and RBT weighting themes.

#### SKAT and SKAT-O methods

The SNP-set (Sequence) Kernel Association Test (SKAT, Wu et al 201120)) and optimal SKAT (Lee et al 2012a21), Lee et al 2012b22)) represents a popular class of second-moment statistics of gene based rare variant tests. SKAT test is more powerful when a large fraction of the variants in a region are noncausal or the effects of causal variants are in different directions. The optimal SKAT test combines information from first-moment statistic (e.g., burden tests) that derive an optimal test within this class that maximizes power, for both large sample and small sample applications. A number of kernel options are available in SKAT for applications under various model assumptions.

We integrate the R implementation of SKAT/SKAT-O and renders a command interface of SKAT method to configure all parameters in its R implementation. Multiple R instances can be executed in parallel to efficiently perform thousands of tests in a typical whole exome association scan.

#### Estimated Regression Coefficients (EREC) test

The EREC method (Tang and Lin 201123)) is a general framework of association analysis using score statistics. Much like the VAT ensemble (Wang et al 2013, to appear) the generic coding of the rare variants is represented by a weighted linear combination of genotypes and analyzed through appropriate regression models, yet the selection of weighting themes are different from the VAT ensemble and the association testing is based on score statistics.

#### Testing for Rare Variant Associations in the Presence of Missing Data

We have shown in our recent paper (Auer, Wang and Leal 201324)) that when variant calls are missing, naive implementation of rare variant association (RVA) methods may lead to inﬂated type I error rates as well as a reduction in power. Extensions for four commonly used RVA tests were developed to overcome these problems: the CMC-M, BRV-M, WSS-M and VT-M. We implement these adjustment for missing data as a switch --NA_adjust for the CMC, BRV, WSS and VT methods. We recommend the use of these extended RVA tests whenever applicable, due to their desirable property in properly control type I error in the presence of differential missing data without sacriﬁcing power compared to their original counterparts.

Note

In SEQPower simulation for differential missingness in cases/ctrls is not yet available. However one could use this method to evaluate the impact of missing data and the effect of the statistical remedy proposed in Auer 2013.
1) , 5) , 14) Bingshan Li and Suzanne M. Leal (2008). Methods for Detecting Associations with Rare Variants for Common Diseases: Application to Analysis of Sequence Data. The American Journal of Human Genetics
2) Dajiang J. Liu and Suzanne M. Leal (2010). A Novel Adaptive Method for the Analysis of Next-Generation Sequencing Data to Detect Complex Trait Associations with Rare Variants Due to Gene Main Effects and Interactions. PLoS Genetics
3) , 6) Iuliana Ionita-Laza, Joseph D. Buxbaum, Nan M. Laird and Christoph Lange (2011). A New Testing Strategy to Identify Rare Variants with Either Risk or Protective Effect on Disease. PLoS Genetics
4) , 15) Gaurav Bhatia, Vikas Bansal, Olivier Harismendy, Nicholas J. Schork, Eric J. Topol, Kelly Frazer and Vineet Bafna (2010). A Covering Method for Detecting Genetic Associations between Rare Variants and Common Phenotypes. PLoS Comput Biol
7) , 19) Alkes L. Price, Gregory V. Kryukov, Paul I.W. de Bakker, Shaun M. Purcell, Jeff Staples, Lee-Jen Wei and Shamil R. Sunyaev (2010). Pooled Association Tests for Rare Variants in Exon-Resequencing Studies. The American Journal of Human Genetics
8) , 16) Bo Eskerod Madsen and Sharon R. Browning (2009). A Groupwise Association Test for Rare Mutations Using a Weighted Sum Statistic. PLoS Genetics
9) Mathieu Lemire (2011). Defining rare variants by their frequencies in controls may increase type I error. Nature Genetics
10) Fang Han and Wei Pan (2010). A Data-Adaptive Sum Test for Disease Association with Multiple Common or Rare Variants. Human Heredity
11) , 13) Andrew P. Morris and Eleftheria Zeggini (2009). An evaluation of statistical approaches to rare variant analysis in genetic association studies. Genetic Epidemiology
12) Benjamin M. Neale, Manuel A. Rivas, Benjamin F. Voight, David Altshuler, Bernie Devlin, Marju Orho-Melander, Sekar Kathiresan, Shaun M. Purcell, Kathryn Roeder and Mark J. Daly (2011). Testing for an Unusual Distribution of Rare Variants. PLoS Genetics
17) , 23) Dan-Yu Lin and Zheng-Zheng Tang (2011). A General Framework for Detecting Disease Associations with Rare Variants in Sequencing Studies. The American Journal of Human Genetics
18) , 20) Michael C. Wu, Seunggeun Lee, Tianxi Cai, Yun Li, Michael Boehnke and Xihong Lin (2011). Rare-Variant Association Testing for Sequencing Data with the Sequence Kernel Association Test. The American Journal of Human Genetics
21) S. Lee, M. C. Wu and X. Lin (2012). Optimal tests for rare variant effects in sequencing association studies. Biostatistics
22) Seunggeun Lee, Mary J. Emond, Michael J. Bamshad, Kathleen C. Barnes, Mark J. Rieder, Deborah A. Nickerson, David C. Christiani, Mark M. Wurfel and Xihong Lin (2012). Optimal Unified Approach for Rare-Variant Association Testing with Application to Small-Sample Case-Control Whole-Exome Sequencing Studies. The American Journal of Human Genetics
24) Paul L. Auer, Gao Wang and Suzanne M. Leal (2013). Testing for Rare Variant Associations in the Presence of Missing Data. Genetic Epidemiology