Table of Contents

A Quick Start Tutorial

Purpose

Evidences suggest that rare variants in the human genome might have strong impact on the risk of complex diseases. An essential first step in designing genetic association mapping studies is to assess the sample size needed to achieve sufficient statistical power and to choose appropriate statistical methods for association testing. SEQPower employs sophisticated modeling of human genome sequences and complex diseases, and rapidly conducts customized power analysis using recently developed rare variants association methods. This tutorial demonstrates some basic examples using SEQPower to carry out power and sample size evaluations for the gene-based rare variant association study design.

Methodology

SEQPower provides a number of study designs and association analysis methods. The scope of this tutorial limits to evaluation of power for gene-based association tests under three association study designs: case-control, quantitative traits, and extreme quantitative traits. For case-control data, we generate phenotypes based on simulated variants in genes for the European population by modeling the odds ratio for the locus and disease prevalence. For quantitative traits data we generated a normally distributed phenotype using a linear model assuming that the genetics effects are additive. We will demonstrate power analysis using several popular association analysis methods including the Combined Multivariate Collapsing (CMC)[1], Weighted Sum Statistic (WSS)[2], Kernel Based Adaptive Cluster (KBAC)[3], Variable Threshold (VT)[4] and Sequence Kernel Assocation Test (SKAT)[5].

Simulation is involved in the calculation of power. This can be computationally intensive. In this tutorial we use a relatively small sample size of 1,000 for demonstration purposes. You may not need to run all the examples in the tutorial, but rather to read the input and output to get an idea how power evaluation is carried out for rare variants association methods. Once understood, you may compose specific commands that describe the study design in your research and obtain power/sample size estimates. For more examples please visit http://bioinformatics.org/spower/tutorial.

Data Resource

In this tutorial we use genetic sequence data simulated using spower simulate command (in the “Full” version of SEQPower), available as http://bioinformatics.org/spower/download/data/SRV/sfs.tar.gz. The data bundle contains file having information on minor allele frequency (MAF) spectrum, purifying selection coefficients and variant positions, which are calculated from simulated data from the following genetic demographic models:

Genes are simulated for different lengths. As an example We will use Kyrukov's European demographic model to simulate genes of length 1,800bp. Genes using other models and different lengths can be conveniently simulated using spower simulate or other software such as SFSCODE. This tutorial will not cover the use of spower simulate command.

Getting started

Type spower -h to view available models and options:

spower -h
usage: spower [-h] [--version] {LOGIT,PAR,BLNR,LNR,ELNR,show,execute} ...


There are currently 5 model options and two administrative options. To view information for specific model or option, type spower <name> -h, for example:

spower LOGIT -h


Input of a typical SEQPower command is composed of:

The command line options for each model in SEQPower consist of one required positional argument, i.e., the input data, as well as a number of optional arguments. Optional arguments may have short and long syntax which are equivalent. Tables of optional arguments for each model are provided at the end of the tutorial. For clarity, in this tutorial we will use long version syntax for important optional arguments.

Below is the command running the LOGIT model using CMC method, with sample size 1000, and the screen output:

spower LOGIT Kryukov2009European1800.sfs --sample_size 1000 --OR_rare_detrimental 1.5 --method CFisher \
       -r 1000 -j 4 -l 1 -o exercise


Click to display ⇲

Click to hide ⇱

INFO: Loading data from [Kryukov2009European1800.sfs] ...
INFO: 200 units found
INFO: 1 units will be analyzed
R1 : 100% |=======================================================================================| Time: 0:01:13
INFO: Tuning [exercise.SEQPowerDB] ...
INFO: Result saved to [exercise.csv/.loci.csv/.SEQPowerDB]


Organizing the output

The lines starting with INFO are information lines printed on screen. The output data are saved to plain text files *.csv as well as a database file *.SEQPowerDB. You can either browse the text file with a text editor, or more conveniently, use spower show options. To show all available fields in text output,

spower show exercise.csv


Click to display ⇲

Click to hide ⇱

exercise.csv columns
title
name
method
power
power_median
power_std
case_cmaf
case_cmaf_median
case_cmaf_std
cmaf
cmaf_detrimental
cmaf_neutral
cmaf_protective
ctrl_cmaf
ctrl_cmaf_median
ctrl_cmaf_std
...


To extract particular field, for example, the power estimate:

spower show exercise.csv power*


Click to display ⇲

Click to hide ⇱

power_estimates
+-------+--------------+---------------+
| power | power_median |   power_std   |
+-------+--------------+---------------+
| 0.289 |     0.0      | 0.01433453870 |
+-------+--------------+---------------+


Logging and Summary

Optional argument -v controls verbosity levels. The default verbosity level of the program is 2, which will output all INFO and WARNING messages on the screen, as shown above, for each test unit. -v 1 will only show an overall progress bar for all test units. -v 0 will suppress all screen output.

In addition to the result of power analysis in text file and database, two additional files are generated. The *.log file records the command line history for SEQPower and all INFO, WARNING and DEBUG messages during run-time. The *.loci.csv contains summary information of each locus of the unit being analyzed. To name a few examples:

spower show command can be applied to the summary text file, for example:

spower show exercise.loci.csv
spower show exercise.loci.csv maf


Association tests

Use spower show tests to list all available tests, and spower show test <name> to list options for a specific test.

spower show tests
spower show test CFisher


Example 1: Case-control design, LOGIT model for disease status

In this example we will model phenotype-genotype associations by locus specific odds ratios and prevalence of disease. This model is coded as LOGIT in SEQPower. To start with, we revisit the example in the previous section, but this time we use KBAC method, and type the options in a more complete manner.

spower LOGIT Kryukov2009European1800.sfs \
--def_rare 0.01 --def_neutral -0.0001 0.0001 --moi A \
--proportion_detrimental 1 --proportion_protective 0 \
--OR_rare_detrimental 1.5 --OR_common_detrimental 1 --baseline_effect 0.01 \
--sample_size 1000 --p1 0.5 --limit 1 \
--alpha 0.05 \
--method \
"KBAC --name K1 --mafupper 0.01 --maflower 0 --alternative 1 --moi additive --permutations 1000 --adaptive 0.1" \
--replicates 1000 \
--jobs 4 -o exercise


Here we explain a few key parameters:

Result of the above analysis is as follows

Click to display ⇲

Click to hide ⇱

exercise.csv] power for KBAC method
+-------+--------------+---------------+
| power | power_median |   power_std   |
+-------+--------------+---------------+
| 0.347 |     0.0      | 0.01505293991 |
+-------+--------------+---------------+


Variable effect sizes of rare variants

The previous example uses fixed odds ratio \(\gamma=1.5\) for detrimental variants. To use variable odds ratio, e.g. in range [1.2, 3.0], we input the following:

spower LOGIT Kryukov2009European1800.sfs --sample_size 1000 \
       --OR_rare_detrimental 1.2 --ORmax_rare_detrimental 3.0 \
       --method CFisher -r 1000 -j 4 -l 1 -o exercise


Instead of using fixed odds ratio, now the odds ratio will be generated on the fly based on the underlying minor allele frequencies. You will see the actual odds ratios in the resulting *.loci.csv file:

spower show exercise.loci.csv effect*


Click to display ⇲

Click to hide ⇱

variable_effect_sizes
+---------------+
|  effect_size  |
+---------------+
...
|      1.0      |
...
| 1.42859361773 |
...
|      3.0      |
| 2.99121719826 |
| 2.99820319113 |
| 2.99940345946 |
|      1.2      |
...


The effect size for detrimental variants will range from 1.2 to 3.0; effect size of neutral variants is 1.0.

Presence of non-causal variants

Assuming only 80% of the deleterious sites are causal, we have:

spower LOGIT Kryukov2009European1800.sfs --sample_size 1000 \
       --OR_rare_detrimental 1.2 --proportion_detrimental 0.8 \
       --method CFisher -r 1000 -j 4 -l 1 -o exercise


With this option the program will still simulate all the deleterious variant sites, regardless of them being causal or not. However they do not contribute to the etiology of the disease. As a result they will present as noise in data. Result from previous command results in reduced power.

Click to display ⇲

Click to hide ⇱

presence_of_non_causal_variants
+-------+--------------+------------------+
| power | power_median |    power_std     |
+-------+--------------+------------------+
| 0.084 |     0.0      | 0.00877177291088 |
+-------+--------------+------------------+


Exclusion of causal variants

Different from presence of non-causal (inclusion of false positive signals), this option models situation when variants are presented in the population and may or may not have contributed to the phenotype, but were removed from sample dataset for association analysis, due to sequencing failure, low mapping quality, etc. (potential exclusion of true positive signals), for example we set 70% sites missing due to artifacts:

spower LOGIT Kryukov2009European1800.sfs --sample_size 1000 \
       --OR_rare_detrimental 1.2 --missing_sites 0.7 \
       --method CFisher -r 1000 -j 4 -l 1 -o exercise


Click to display ⇲

Click to hide ⇱

impact_of_missing_data
+-------+--------------+------------------+
| power | power_median |    power_std     |
+-------+--------------+------------------+
| 0.062 |     0.0      | 0.00767050971515 |
+-------+--------------+------------------+


You may also model the missing proportion of sites for different variant types separately, using --missing_sites_* options for protective variants, non-causal and deleterious variants. It is also possible to model missingness by MAF, using the --missing_low_maf option to set variants below certain population minor allele frequency as missing, irrespective to its functionality.

Using multiple association methods

It is possible to apply multiple association tests in one command, for comparison purposes. In previous sections we have introduced CMC method and KBAC method. Here we add 3 more tests:

Command below runs 5 tests simultaneously (is computationally intensive!).

spower LOGIT Kryukov2009European1800.sfs --sample_size 1000 \
       --OR_rare_detrimental 1.5  \
       --method "CFisher --alternative 1" "KBAC --permutations 1000 --alternative 1" \
       "WSSRankTest --alternative 1" "VTtest --alternative 1 --permutations 1000" "SKAT disease" \
       -r 1000 -j 4 -l 1 -o exercise


Click to display ⇲

Click to hide ⇱

INFO: Loading data from [Kryukov2009European1800.sfs] ...
INFO: 200 units found
INFO: 1 units will be analyzed
R1 : 100% |=======================================================================================| Time: 0:22:17
INFO: Tuning [exercise.SEQPowerDB] ...
INFO: Result saved to [exercise.csv/.loci.csv/.SEQPowerDB]


spower show exercise.csv method power


Click to display ⇲

Click to hide ⇱

power_comparison_of_different_methods
+-------------+-------+
|    method   | power |
+-------------+-------+
| WSSRankTest | 0.374 |
|    VTtest   | 0.318 |
|   CFisher   | 0.316 |
|     KBAC    | 0.352 |
|     SKAT    | 0.094 |
+-------------+-------+


Note

Variable threshold method might appear under-powered in our simulation studies, because the simulation model clearly favors the other methods (MAF for defining rare variants in simulation is the same as for the variants analyzed). SKAT method will also appear under-powered, since in this simulation only deleterious variants present in data.

Example 2: Quantitative traits design, linear model for QT data

Simulation and analysis of quantitative traits shares some options with the previous example, and we will focus on its unique options. Take the following command for example:

spower LNR Kryukov2009European1800.sfs --sample_size 1000 \
       --meanshift_rare_detrimental 0.2 \
       --method "CollapseQt --alternative 2" \
       -r 1000 -j 4 -l 1 -o exercise


Notice the new options

Click to display ⇲

Click to hide ⇱

QT_analysis
+-------+--------------+-----------------+
| power | power_median |    power_std    |
+-------+--------------+-----------------+
| 0.228 |     0.0      | 0.0132671021704 |
+-------+--------------+-----------------+


Variable effect sizes model

Variable effect sizes model can also be applied to the generation of QT data, for example:

spower LNR Kryukov2009European1800.sfs --sample_size 1000 \
       --meanshift_rare_detrimental 0.2 --meanshiftmax_rare_detrimental 0.5 \
       --method "CollapseQt --alternative 2" \
       -r 1000 -j 4 -l 1 -o exercise


Click to display ⇲

Click to hide ⇱

QT_analysis_variable_effects_model
+-------+--------------+-----------------+
| power | power_median |    power_std    |
+-------+--------------+-----------------+
| 0.541 |     1.0      | 0.0157581407533 |
+-------+--------------+-----------------+


Example 3: Extreme quantitative traits design, the ELNR model

Samples with extreme quantitative traits can be obtained in two ways:

The default ELNR model implements the first theme:

spower ELNR Kryukov2009European1800.sfs --sample_size 1000 \
       --meanshift_rare_detrimental 0.2 --QT_thresholds 0.4 0.6 \
       --method "CollapseQt --alternative 2" \
       -r 1000 -j 4 -l 1 -o exercise


Notice the new options

spower show exercise.csv sample* power


Click to display ⇲

Click to hide ⇱

extreme_quantitative_samples_from_finite_population
+-------+-----------------+----------------------+
| power |    power_std    | sample_size_analyzed |
+-------+-----------------+----------------------+
| 0.207 | 0.0128121426779 |        800.0         |
+-------+-----------------+----------------------+


The ELNR with --p1 option implements the second theme:

spower ELNR Kryukov2009European1800.sfs --sample_size 1000 --p1 0.5 \
       --meanshift_rare_detrimental 0.5 --QT_thresholds 0.4 0.6 \
       --method "CollapseQt --alternative 2" \
       -r 1000 -j 4 -l 1 -o exercise


Click to display ⇲

Click to hide ⇱

extreme_quantitative_samples_from_finite_population
+-------+-----------------+----------------------+
| power |    power_std    | sample_size_analyzed |
+-------+-----------------+----------------------+
| 0.877 | 0.0103860964756 |       1000.0         |
+-------+-----------------+----------------------+


Example 4: Simulation-only mode

Sometimes we may want to simulate data-sets for purposes other than power calculation. The GroupWrite method in SEQPower will output simulated data into bundles containing the following files:

We simulate phenotype data for all units in the input data (remove -l option and use default -r option for 1 replicate), under basic LOGIT model. Simulated data will be generated to folder ExerciseSimulation as specified in GroupWrite method.

spower LOGIT Kryukov2009European1800.sfs --sample_size 1000 --OR_rare_detrimental 1.5 \
--method "GroupWrite ExerciseSimulation" -j 4 -o exercise -v1


Click to display ⇲

Click to hide ⇱

INFO: Loading data from [Kryukov2009European1800.sfs] ...
INFO: 200 units found
scanning: unit 200 - 100% |>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>| Time: 0:00:53
$ ls ExerciseSimulation
R100_geno.txt     R130_mapping.txt  R160_pheno.txt    R191_geno.txt     R3_mapping.txt   R6_pheno.txt
R100_mapping.txt  R130_pheno.txt    R161_geno.txt     R191_mapping.txt  R3_pheno.txt     R70_geno.txt
R100_pheno.txt    R131_geno.txt     R161_mapping.txt  R191_pheno.txt    R40_geno.txt     R70_mapping.txt
R101_geno.txt     R131_mapping.txt  R161_pheno.txt    R192_geno.txt     R40_mapping.txt  R70_pheno.txt
R101_mapping.txt  R131_pheno.txt    R162_geno.txt     R192_mapping.txt  R40_pheno.txt    R71_geno.txt
R101_pheno.txt    R132_geno.txt     R162_mapping.txt  R192_pheno.txt    R41_geno.txt     R71_mapping.txt
R102_geno.txt     R132_mapping.txt  R162_pheno.txt    R193_geno.txt     R41_mapping.txt  R71_pheno.txt
...


Example 5: Power calculation for a range of input parameters

Browse result output from multiple SEQPower commands

So far we have covered a number of SEQPower examples. With -o exercise option in action for all examples, we have saved all results from every command in this tutorial to a database named exercise.SEQPowerDB. We browse the database by spower show exercise.SEQPowerDB which lists all models in the database, spower show exercise.SEQPowerDB <model name> which lists all column names in one model table.

spower show exercise.SEQPowerDB
spower show exercise.SEQPowerDB LOGIT


To select power analysis result of interest, for example

spower show exercise.SEQPowerDB LOGIT method power title --condition "where power between 0.25 and 0.95"


Click to display ⇲

Click to hide ⇱

results_from_exercise.SEQPowerDB
+---------+-------+-------------------------+
|  method | power |          title          |
+---------+-------+-------------------------+
| CFisher | 0.289 | Kryukov2009European1800 |
| CFisher | 0.31  | Kryukov2009European1800 |
|    K1   | 0.347 | Kryukov2009European1800 |
| CFisher | 0.772 | Kryukov2009European1800 |
| CFisher | 0.795 | Kryukov2009European1800 |
+---------+-------+-------------------------+


Although power analysis results of interest are selected from SEQPower database, the output is confusing since we cannot tell from the output under which scenarios the power estimates are obtained.

Run SEQPower commands in shell with unique IDs

Next we use simple shell commands to run power calculation in batch mode, assigning unique identifiers to each calculation (the --title option), and extract output from the result database. We now save everything to a new database called exercise2.SEQPowerDB.

Click to display ⇲

Click to hide ⇱

SEQPower_in_batch_mode.sh
for i in 1 1.5 2 2.5 3 3.5 4; do
spower LOGIT Kryukov2009European1800.sfs --sample_size 1000 --OR_rare_detrimental $i \
       --method "CFisher --name CMC$i" --title FixedOR$i \
       -r 100 -j 4 -l 1 -o exercise2
done


A range of fixed odds ratios from 1 to 4 are evaluated. For a quick demonstration we only use 100 replicates for each calculation. To view the output:

spower show exercise2.SEQPowerDB LOGIT method power title


Click to display ⇲

Click to hide ⇱

results_from_exercise2.SEQPowerDB
+--------+-------+------------+
| method | power |   title    |
+--------+-------+------------+
|  CMC1  |   0   |  FixedOR1  |
| CMC1.5 |  0.28 | FixedOR1.5 |
|  CMC2  |  0.74 |  FixedOR2  |
| CMC2.5 |  0.84 | FixedOR2.5 |
|  CMC3  |  0.9  |  FixedOR3  |
| CMC3.5 |   1   | FixedOR3.5 |
|  CMC4  |   1   |  FixedOR4  |
+--------+-------+------------+


Reference

[1] Li B, Leal SM. Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am J Hum Genet 2008 83:311-21

[2] Madsen BE, Browning SR. A groupwise association test for rare mutations using a weighted sum statistic. PLoS Genet 2009 5: e1000384

[3] Liu DJ, Leal SM. A novel adaptive method for the analysis of next-generation sequencing data to detect complex trait associations with rare variants due to gene main effects and interactions. PLoS Genet 2010 6:e1001156

[4] Price AL, Kryukov GV, de Bakker PI, Purcell SM, Staples J, Wei LJ, Sunyaev SR. Pooled association tests for rare variants in exon-resequencing studies. Am J Hum Genet 20010 86:832-8.

[5] Wu MC, Lee S, Cai T, Li Y, Boehnke M, Lin X. Rare-variant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet 2011 89:82-93.

[6] Williamson SH, Hernandez R, Fledel-Alon A, Zhu L, Nielsen R, Bustamante CD. Simultaneous inference of selection and population growth from patterns of variation in the human genome. Proc Natl Acad Sci U S A. 2005 102:7882-7

[7] Boyko AR, Williamson SH, Indap AR, Degenhardt JD, Hernandez RD, Lohmueller KE, Adams MD, Schmidt S, Sninsky JJ, Sunyaev SR, White TJ, Nielsen R, Clark AG, Bustamante CD. Assessing the evolutionary impact of amino acid mutations in the human genome. PLoS Genet 2008 4:e1000083

[8] Kryukov GV, Shpunt A, Stamatoyannopoulos JA, Sunyaev SR. Power of deep, all-exon resequencing for discovery of human trait genes. Proc Natl Acad Sci U S A. 2009 106:3871-6