User Tools


Differences

This shows you the differences between two versions of the page.

Link to this comparison view

input [2016/04/13 18:02]
input [2016/04/13 18:02] (current)
Line 1: Line 1:
 +===== SEQPower Input =====
 +==== Data Download ====
 +[[http://​bioinformatics.org/​spower/​download/​data|{{icons:​download.png?​64}}]]
 +
 +We provide simulated site frequency spectrum as well as real world data:
 +
 + *  Folder ''​SRV''​ contains simulated SFS and haplotype pool data generated with [[http://​bioinformatics.org/​spower/​srvbatch|''​spower simulate''​]] using:
 + *  Boyko et al((Adam R. Boyko, Scott H. Williamson, Amit R. Indap, Jeremiah D. Degenhardt, Ryan D. Hernandez, Kirk E. Lohmueller, Mark D. Adams, Steffen Schmidt, John J. Sninsky, Shamil R. Sunyaev, Thomas J. White, Rasmus Nielsen, Andrew G. Clark and Carlos D. Bustamante (2008). **Assessing the Evolutionary Impact of Amino Acid Mutations in the Human Genome**. //PLoS Genetics//​)) African and European population demographic models with purifying selection
 + *  Kryukov et al((G. V. Kryukov, A. Shpunt, J. A. Stamatoyannopoulos and S. R. Sunyaev (2009). **Power of deep, all-exon resequencing for discovery of human trait genes**. //​Proceedings of the National Academy of Sciences//​)) European population demographic models with purifying selection
 + *  Files ''​AfricanAmericanEVS6500.sfs.gz''​ and ''​EuropeanAmericanEVS6500.sfs.gz''​ are real world SFS extracted from the [[http://​evs.gs.washington.edu|Exome Variant Server]]. The fourth column is SIFT score.
 + *  File ''​KIT.gdat''​ contains haplotype pool data on //KIT// gene from 1000 genomes project.
 +
 +==== Site Frequency Spectrum Data ====
 +The site frequency spectrum input data for SEQPower should have 4 columns
 +
 + *  **Column 1**: Gene / group name. This column defines an association test unit. For variants with the same group name, they will be aggregated together in association testing for rare variants. For single variant analysis (e.g., for test of common variants) each variant should have a different group name in order to be analyzed in different tests.
 + *  **Column 2**: MAF. Minor allele frequency of each variant site.
 + *  **Column 3**: Variant ID. Usually it can simply be the chromosomal position of the variant.
 + *  **Column 4 (optional)**:​ Annotation score. This defines the functionality of a variant. In simulated data it can be quantities such as selection coefficients;​ in real sequence data it can be an annotation score such as SIFT or Polyphen2 values. The annotation score is meaningful when there exists some cut-offs such that neutral, protective and deleterious variants can be defined by the scores compared to the cut-offs.
 +
 +In input text, lines starting with "#"​ will be ignored. This allows for additional notes or comments in the input SFS data.
 +
 +==== Haplotype Pool Data ====
 +Using haplotype pool data keeps the LD structure and singleton, doubleton, etc. distribution in real world human haplotypes, thus could result in more realistic power analysis. Haplotype pool data can be generated via ''​spower simulate''​ module and we provide pre-generated haplotype pools. However currently (August, 2013) there is no publicly available exome-wide haplotype pools with reasonably large sample size for a single population group for power analysis purposes. For an illustration of the feature we provide data from 1000 genome project ''​KIT.gdat''​ which contains the variants and haplotypes for //KIT// gene. It is not recommended to use this data set for power analysis due to the limited sample size and the fact that the haplotypes are from more than one population in 1000 genome project. Please [[http://​bioinformatics.org/​spower/​support|contact the developers]] for assistance if you find a publicly available real world haplotype pool that you are interested in converting to SEQPower input format.