## SEQPower Input

We provide simulated site frequency spectrum as well as real world data:

• Folder SRV contains simulated SFS and haplotype pool data generated with ''spower simulate'' using:
• Boyko et al1) African and European population demographic models with purifying selection
• Kryukov et al2) European population demographic models with purifying selection
• Files AfricanAmericanEVS6500.sfs.gz and EuropeanAmericanEVS6500.sfs.gz are real world SFS extracted from the Exome Variant Server. The fourth column is SIFT score.
• File KIT.gdat contains haplotype pool data on KIT gene from 1000 genomes project.

### Site Frequency Spectrum Data

The site frequency spectrum input data for SEQPower should have 4 columns

• Column 1: Gene / group name. This column defines an association test unit. For variants with the same group name, they will be aggregated together in association testing for rare variants. For single variant analysis (e.g., for test of common variants) each variant should have a different group name in order to be analyzed in different tests.
• Column 2: MAF. Minor allele frequency of each variant site.
• Column 3: Variant ID. Usually it can simply be the chromosomal position of the variant.
• Column 4 (optional): Annotation score. This defines the functionality of a variant. In simulated data it can be quantities such as selection coefficients; in real sequence data it can be an annotation score such as SIFT or Polyphen2 values. The annotation score is meaningful when there exists some cut-offs such that neutral, protective and deleterious variants can be defined by the scores compared to the cut-offs.

In input text, lines starting with “#” will be ignored. This allows for additional notes or comments in the input SFS data.

### Haplotype Pool Data

Using haplotype pool data keeps the LD structure and singleton, doubleton, etc. distribution in real world human haplotypes, thus could result in more realistic power analysis. Haplotype pool data can be generated via spower simulate module and we provide pre-generated haplotype pools. However currently (August, 2013) there is no publicly available exome-wide haplotype pools with reasonably large sample size for a single population group for power analysis purposes. For an illustration of the feature we provide data from 1000 genome project KIT.gdat which contains the variants and haplotypes for KIT gene. It is not recommended to use this data set for power analysis due to the limited sample size and the fact that the haplotypes are from more than one population in 1000 genome project. Please contact the developers for assistance if you find a publicly available real world haplotype pool that you are interested in converting to SEQPower input format.

1) Adam R. Boyko, Scott H. Williamson, Amit R. Indap, Jeremiah D. Degenhardt, Ryan D. Hernandez, Kirk E. Lohmueller, Mark D. Adams, Steffen Schmidt, John J. Sninsky, Shamil R. Sunyaev, Thomas J. White, Rasmus Nielsen, Andrew G. Clark and Carlos D. Bustamante (2008). Assessing the Evolutionary Impact of Amino Acid Mutations in the Human Genome. PLoS Genetics
2) G. V. Kryukov, A. Shpunt, J. A. Stamatoyannopoulos and S. R. Sunyaev (2009). Power of deep, all-exon resequencing for discovery of human trait genes. Proceedings of the National Academy of Sciences