Evidences suggest that rare variants in the human genome might have strong impact on the risk of complex diseases. An essential first step in designing genetic association mapping studies is to assess the sample size needed to achieve sufficient statistical power and to choose appropriate statistical methods for association testing. SEQPower employs sophisticated modeling of human genome sequences and complex diseases, and rapidly conducts customized power analysis using recently developed rare variants association methods. This tutorial demonstrates some basic examples using SEQPower to carry out power and sample size evaluations for the gene-based rare variant association study design.
SEQPower provides a number of study designs and association analysis methods. The scope of this tutorial limits to evaluation of power for gene-based association tests under three association study designs: case-control, quantitative traits, and extreme quantitative traits. For case-control data, we generate phenotypes based on simulated variants in genes for the European population by modeling the odds ratio for the locus and disease prevalence. For quantitative traits data we generated a normally distributed phenotype using a linear model assuming that the genetics effects are additive. We will demonstrate power analysis using several popular association analysis methods including the Combined Multivariate Collapsing (CMC)[1], Weighted Sum Statistic (WSS)[2], Kernel Based Adaptive Cluster (KBAC)[3], Variable Threshold (VT)[4] and Sequence Kernel Assocation Test (SKAT)[5].
Simulation is involved in the calculation of power. This can be computationally intensive. In this tutorial we use a relatively small sample size of 1,000 for demonstration purposes. You may not need to run all the examples in the tutorial, but rather to read the input and output to get an idea how power evaluation is carried out for rare variants association methods. Once understood, you may compose specific commands that describe the study design in your research and obtain power/sample size estimates. For more examples please visit http://bioinformatics.org/spower/tutorial.
In this tutorial we use genetic sequence data simulated using spower simulate
command (in the “Full” version of SEQPower), available as http://bioinformatics.org/spower/download/data/SRV/sfs.tar.gz. The data bundle contains file having information on minor allele frequency (MAF) spectrum, purifying selection coefficients and variant positions, which are calculated from simulated data from the following genetic demographic models:
Genes are simulated for different lengths. As an example We will use Kyrukov's European demographic model to simulate genes of length 1,800bp. Genes using other models and different lengths can be conveniently simulated using spower simulate
or other software such as SFSCODE
. This tutorial will not cover the use of spower simulate
command.
Type spower -h
to view available models and options:
spower -h usage: spower [-h] [--version] {LOGIT,PAR,BLNR,LNR,ELNR,show,execute} ...
There are currently 5 model options and two administrative options. To view information for specific model or option, type spower <name> -h
, for example:
spower LOGIT -h
Input of a typical SEQPower command is composed of:
The command line options for each model in SEQPower consist of one required positional argument, i.e., the input data, as well as a number of optional arguments. Optional arguments may have short and long syntax which are equivalent. Tables of optional arguments for each model are provided at the end of the tutorial. For clarity, in this tutorial we will use long version syntax for important optional arguments.
Below is the command running the LOGIT model using CMC method, with sample size 1000, and the screen output:
spower LOGIT Kryukov2009European1800.sfs --sample_size 1000 --OR_rare_detrimental 1.5 --method CFisher \ -r 1000 -j 4 -l 1 -o exercise
The lines starting with INFO are information lines printed on screen. The output data are saved to plain text files *.csv
as well as a database file *.SEQPowerDB
. You can either browse the text file with a text editor, or more conveniently, use spower show options
. To show all available fields in text output,
spower show exercise.csv
To extract particular field, for example, the power estimate:
spower show exercise.csv power*
Optional argument -v
controls verbosity levels. The default verbosity level of the program is 2, which will output all INFO and WARNING messages on the screen, as shown above, for each test unit. -v 1
will only show an overall progress bar for all test units. -v 0
will suppress all screen output.
In addition to the result of power analysis in text file and database, two additional files are generated. The *.log
file records the command line history for SEQPower and all INFO, WARNING and DEBUG messages during run-time. The *.loci.csv
contains summary information of each locus of the unit being analyzed. To name a few examples:
spower show
command can be applied to the summary text file, for example:
spower show exercise.loci.csv spower show exercise.loci.csv maf
Use spower show tests
to list all available tests, and spower show test <name>
to list options for a specific test.
spower show tests
spower show test CFisher
In this example we will model phenotype-genotype associations by locus specific odds ratios and prevalence of disease. This model is coded as LOGIT
in SEQPower. To start with, we revisit the example in the previous section, but this time we use KBAC method, and type the options in a more complete manner.
spower LOGIT Kryukov2009European1800.sfs \ --def_rare 0.01 --def_neutral -0.0001 0.0001 --moi A \ --proportion_detrimental 1 --proportion_protective 0 \ --OR_rare_detrimental 1.5 --OR_common_detrimental 1 --baseline_effect 0.01 \ --sample_size 1000 --p1 0.5 --limit 1 \ --alpha 0.05 \ --method \ "KBAC --name K1 --mafupper 0.01 --maflower 0 --alternative 1 --moi additive --permutations 1000 --adaptive 0.1" \ --replicates 1000 \ --jobs 4 -o exercise
Here we explain a few key parameters:
--def_rare
The definition of a “rare” variant. We usually define a rare variant as a variant having MAF below 0.01. You may prefer an alternative definition 5%.--def_neutral
The definition of a “neutral” variant by its functional annotation score. In this simulated dataset we use selection coefficient to annotate functionality. We set it [-0.0001, 0.0001] such that variants having selection coefficient in between these values will be considered neutral. The values will be different depending on the nature of the input functional annotation scores.--moi
Mode of inheritance. Most aggregated analysis methods implicitly assumes an “additive” effect of rare variants, thus simulating/analyzing data under additive model represents a best case scenario for many methods. However this does not always have to be the case, and there is also a --moi
option for each association method within the --method
option (see below) that controls for the assumption on MOI when the data are being analyzed.--proportion_detrimental
This parameter allows us to model situations where not all deleterious variants are causal. Here we assume all deleterious variants are causal for the trait of interest. We can use a value smaller than 1.0 if we want to model the impact of non-causal variants.--proportion_protective
This is the proportion of causal protective variants.--OR_rare_detrimental
The odds ratio per rare detrimental variant.--OR_common_detrimental
The odds ratio per common detrimental variant.--baseline_effect
The base-line odds ratio in population. For common complex traits involving rare variants, this is approximately the same as prevalence of disease in population.--sample_size
Total sample size.--p1
Proportion of cases. In case/ctrl study design, 50% cases and 50% controls will yield maximum power.--limit
In the input data Kryukov2009European1800.sfs
there are many replicates from the same forward time simulation setting. Here for demonstration purpose we limit the calculation to only one replicate, although in practice the power analysis result should be based on multiple input data under the same simulation model.--alpha
Significance level for which power will be evaluated.--method
Name of statistical method to apply, see spower show tests
--name
A unique name assigned to a specific method. For the same method you might have different analysis parameters, in which case you may want to specify this name to distinguish between different settings. This is particularly useful when you have multiple configurations of the same method in the same command, for example --method 'KBAC --name K1 … --mafupper 0.01' 'KBAC --name K5 … --mafupper 0.05
'--mafupper 0/--maflower 0.01
We define rare variants within a population haplotype pool using --def_rare
option. However for many of the association test to work, we need to confine the analysis to alleles within certain frequency spectrum in our observed data. These options set the variants we want to focus on in analysis, i.e., we will only analyze variants having observed MAF between 0 and 0.01.--alternative
1 for one-sided test, 2 for two-sided test. For a test of deleterious variants only, it is more powerful to apply one-sided test.--permutations/--adaptive
In the previous section we used Fisher's method for CMC test implementation, which does not require permutations. For permutation based tests the number of permutations needed depends on \(\alpha\) level. For \(\alpha=0.05\), it is sufficient to use 1,000 permutations. For exome studies with \(\alpha=2.5\times 10^{-6}\) (Bonferroni corrected p-value for testing 20,000 genes). Although “adaptive” permutation method is available, it would still require a large number of tests to finish their full permutation cycle in order to complete the power analysis. Therefore in SEQPower permutation based test is not feasible to evaluate power at very small \(\alpha\) level.--replicates
Number of replicates for power / sample size estimate.--jobs
Number of CPUs to be used to run the command. The input of this parameter depends on your computational environment.Result of the above analysis is as follows
The previous example uses fixed odds ratio \(\gamma=1.5\) for detrimental variants. To use variable odds ratio, e.g. in range [1.2, 3.0], we input the following:
spower LOGIT Kryukov2009European1800.sfs --sample_size 1000 \ --OR_rare_detrimental 1.2 --ORmax_rare_detrimental 3.0 \ --method CFisher -r 1000 -j 4 -l 1 -o exercise
Instead of using fixed odds ratio, now the odds ratio will be generated on the fly based on the underlying minor allele frequencies. You will see the actual odds ratios in the resulting *.loci.csv
file:
spower show exercise.loci.csv effect*
The effect size for detrimental variants will range from 1.2 to 3.0; effect size of neutral variants is 1.0.
Assuming only 80% of the deleterious sites are causal, we have:
spower LOGIT Kryukov2009European1800.sfs --sample_size 1000 \ --OR_rare_detrimental 1.2 --proportion_detrimental 0.8 \ --method CFisher -r 1000 -j 4 -l 1 -o exercise
With this option the program will still simulate all the deleterious variant sites, regardless of them being causal or not. However they do not contribute to the etiology of the disease. As a result they will present as noise in data. Result from previous command results in reduced power.
Different from presence of non-causal (inclusion of false positive signals), this option models situation when variants are presented in the population and may or may not have contributed to the phenotype, but were removed from sample dataset for association analysis, due to sequencing failure, low mapping quality, etc. (potential exclusion of true positive signals), for example we set 70% sites missing due to artifacts:
spower LOGIT Kryukov2009European1800.sfs --sample_size 1000 \ --OR_rare_detrimental 1.2 --missing_sites 0.7 \ --method CFisher -r 1000 -j 4 -l 1 -o exercise
You may also model the missing proportion of sites for different variant types separately, using --missing_sites_*
options for protective variants, non-causal and deleterious variants. It is also possible to model missingness by MAF, using the --missing_low_maf
option to set variants below certain population minor allele frequency as missing, irrespective to its functionality.
It is possible to apply multiple association tests in one command, for comparison purposes. In previous sections we have introduced CMC method and KBAC method. Here we add 3 more tests:
--permutation
option. The full version gives a more accurate power estimate but computationally intensive.--maflower/--mafupper
, and will correct for multiple testing naturally in its permutation framework.Command below runs 5 tests simultaneously (is computationally intensive!).
spower LOGIT Kryukov2009European1800.sfs --sample_size 1000 \ --OR_rare_detrimental 1.5 \ --method "CFisher --alternative 1" "KBAC --permutations 1000 --alternative 1" \ "WSSRankTest --alternative 1" "VTtest --alternative 1 --permutations 1000" "SKAT disease" \ -r 1000 -j 4 -l 1 -o exercise
spower show exercise.csv method power
Note
Simulation and analysis of quantitative traits shares some options with the previous example, and we will focus on its unique options. Take the following command for example:
spower LNR Kryukov2009European1800.sfs --sample_size 1000 \ --meanshift_rare_detrimental 0.2 \ --method "CollapseQt --alternative 2" \ -r 1000 -j 4 -l 1 -o exercise
Notice the new options
--meanshift_rare_detrimental
The genetic effect is now modeled by mean-shift in phenotype value due to the genetic factor. The unit for mean-shift is the standard deviation of the simulated trait.--method CollapseQt
The test is now the quantitative trait version of the CMC method, a linear regression Wald's statistic using collapsed genotype scores as regressors. We use a two-sided test here, a fair assumption to quantitative traits, although in the simulation only deleterious variants are modeled.
Variable effect sizes model can also be applied to the generation of QT data, for example:
spower LNR Kryukov2009European1800.sfs --sample_size 1000 \ --meanshift_rare_detrimental 0.2 --meanshiftmax_rare_detrimental 0.5 \ --method "CollapseQt --alternative 2" \ -r 1000 -j 4 -l 1 -o exercise
Samples with extreme quantitative traits can be obtained in two ways:
The default ELNR model implements the first theme:
spower ELNR Kryukov2009European1800.sfs --sample_size 1000 \ --meanshift_rare_detrimental 0.2 --QT_thresholds 0.4 0.6 \ --method "CollapseQt --alternative 2" \ -r 1000 -j 4 -l 1 -o exercise
Notice the new options
--QT_thresholds
Lower and upper cutoff for extreme traits. For this simulation it takes samples having QT values ranking below 40% percentile and samples above 60% percentile from the total 1,000 samples simulated, i.e., we will end up having \(1000 \times 0.4 = 400\) samples of low QT values and \(1000 \times (1-0.6) = 400\) samples having high QT values.spower show exercise.csv sample* power
The ELNR with --p1
option implements the second theme:
spower ELNR Kryukov2009European1800.sfs --sample_size 1000 --p1 0.5 \ --meanshift_rare_detrimental 0.5 --QT_thresholds 0.4 0.6 \ --method "CollapseQt --alternative 2" \ -r 1000 -j 4 -l 1 -o exercise
--p1
With this option, samples will be taken from an infinite population having simulated trait values smaller than \(\Phi^{-1}(0.4)\) or larger than \(\Phi^{-1}(0.6)\) (\(\Phi\) is the cumulative density function for standard normal distribution), until required sample size is achieved, which is \(1000 \times p1\) and \(1000 \times (1-p1)\) for the two groups respectively.
Sometimes we may want to simulate data-sets for purposes other than power calculation. The GroupWrite
method in SEQPower will output simulated data into bundles containing the following files:
We simulate phenotype data for all units in the input data (remove -l
option and use default -r
option for 1 replicate), under basic LOGIT model. Simulated data will be generated to folder ExerciseSimulation
as specified in GroupWrite
method.
spower LOGIT Kryukov2009European1800.sfs --sample_size 1000 --OR_rare_detrimental 1.5 \ --method "GroupWrite ExerciseSimulation" -j 4 -o exercise -v1
So far we have covered a number of SEQPower examples. With -o exercise
option in action for all examples, we have saved all results from every command in this tutorial to a database named exercise.SEQPowerDB
. We browse the database by spower show exercise.SEQPowerDB
which lists all models in the database, spower show exercise.SEQPowerDB <model name>
which lists all column names in one model table.
spower show exercise.SEQPowerDB spower show exercise.SEQPowerDB LOGIT
To select power analysis result of interest, for example
spower show exercise.SEQPowerDB LOGIT method power title --condition "where power between 0.25 and 0.95"
Although power analysis results of interest are selected from SEQPower database, the output is confusing since we cannot tell from the output under which scenarios the power estimates are obtained.
Next we use simple shell commands to run power calculation in batch mode, assigning unique identifiers to each calculation (the --title
option), and extract output from the result database. We now save everything to a new database called exercise2.SEQPowerDB
.
A range of fixed odds ratios from 1 to 4 are evaluated. For a quick demonstration we only use 100 replicates for each calculation. To view the output:
spower show exercise2.SEQPowerDB LOGIT method power title
[1] Li B, Leal SM. Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am J Hum Genet 2008 83:311-21
[2] Madsen BE, Browning SR. A groupwise association test for rare mutations using a weighted sum statistic. PLoS Genet 2009 5: e1000384
[3] Liu DJ, Leal SM. A novel adaptive method for the analysis of next-generation sequencing data to detect complex trait associations with rare variants due to gene main effects and interactions. PLoS Genet 2010 6:e1001156
[4] Price AL, Kryukov GV, de Bakker PI, Purcell SM, Staples J, Wei LJ, Sunyaev SR. Pooled association tests for rare variants in exon-resequencing studies. Am J Hum Genet 20010 86:832-8.
[5] Wu MC, Lee S, Cai T, Li Y, Boehnke M, Lin X. Rare-variant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet 2011 89:82-93.
[6] Williamson SH, Hernandez R, Fledel-Alon A, Zhu L, Nielsen R, Bustamante CD. Simultaneous inference of selection and population growth from patterns of variation in the human genome. Proc Natl Acad Sci U S A. 2005 102:7882-7
[7] Boyko AR, Williamson SH, Indap AR, Degenhardt JD, Hernandez RD, Lohmueller KE, Adams MD, Schmidt S, Sninsky JJ, Sunyaev SR, White TJ, Nielsen R, Clark AG, Bustamante CD. Assessing the evolutionary impact of amino acid mutations in the human genome. PLoS Genet 2008 4:e1000083
[8] Kryukov GV, Shpunt A, Stamatoyannopoulos JA, Sunyaev SR. Power of deep, all-exon resequencing for discovery of human trait genes. Proc Natl Acad Sci U S A. 2009 106:3871-6