User Tools


Differences

This shows you the differences between two versions of the page.

Link to this comparison view

vat-r [2016/04/13 18:02] (current)
Line 1: Line 1:
 +===== SEQPower R Interface =====
 +In addition to making sample size and power calculations,​ SEQPower can be used to validate and evaluate novel association methods proposed and implemented by researchers. We incorporate the mechanism to load R script from ''​variant association tools (VAT)''​. ​ Under this mechanism, users write up an ''​R''​ function that analyzes data of an association testing unit (e.g., gene) to perform analysis on simulated association data from SEQPower. A growing collection of examples of ''​RTest''​ via VAT are being published online, including example incorporation of the ''​RAssociation''​ package. The documentation below is adapted from the VAT manual.
 +
 +==== Format ====
 +You should have one main function in the R program named the same as the R script file name. This is the interface function that interacts with the ''​RTest''​ command, taking input parameters from command line and return output in specified format (see below for details) that can be recognized by ''​RTest''​ and be stored in databases. This main function can call any other R objects as long as they are available from R or implemented elsewhere in your R program.
 +
 +== Input parameters ==
 +The main function should be defined in the following format
 +
 +<hidden initialState="​visible"​ -noprint>​
 +<code rsplus download-source.R>​
 +ScriptName <- function(dataname,​ args, **kwargs) {
 +...
 +}
 +</​code>​
 +</​hidden>​\\
 +where the first argument has to be the data object variable name (e.g. ''​dat''​ in the ''​regression.R''​ example, or any other valid names you specify), followed by a few required positional arguments (e.g. ''​phenotype.name''​ in ''​regression.R''​ example which has to be passed from commandline every time the program is executed), and other keyword arguments that have default values (e.g. ''​family''​ in the ''​regression.R''​ example. If not specified from the command line it will use default value "​gaussian"​). The required and optional arguments can be assigned from the commandline (e.g., the ''​%%--%%phenotype.name '​stroke'''​ argument of ''​vtools associate''​).
 +
 +== Output configuration ==
 +The return object of the main R function should be a list with the properties of each element in the list been pre-specified as comment strings at the beginning of the scripts taking the following format
 +
 +<hidden initialState="​visible"​ -noprint>​
 +  # BEGINCONF
 +  # [attribute1]
 +  # name=
 +  # type=
 +  # comment=
 +  # [attribute2]
 +  # n=N
 +  # columns=
 +  # name=
 +  # column_name=
 +  # type=
 +  # comment=
 +  # ENDCONF
 +</​hidden>​\\
 +and the return R list object is
 +
 +<hidden initialState="​visible"​ -noprint>​
 +  list(attribute1=...,​ attribute2=...,​ ...)
 +</​hidden>​\\
 + *  The configuration area are R comments, starts with **BEGINCONF** and ends with **ENDCONF**.
 + *  Each section name corresponds to an attribute in the return R list of the main function. Values of these attributes can be R **numeric, string, vector, matrix or data.frame. Other R data type are not allowed**.
 + *  For attributes that are single values, e.g., numeric or string, two properties can be specified: a **name** property (default to the same name as the R attribute'​s name), a**type** property (default to "​float"​) and a **comment** property (default to empty string). You can set them to different values than default, or use the default value by leaving the attribute empty (e.g. the ''​[sample.size]''​ attribute in the ''​regression.R''​ example)
 + *  For attributes that are R vectors or two dimensional (2D) data.frame or matrices, the property **n**, indicating the number of elements in the vector or number of rows in 2D objects, have to be specified. **If n is not specified the attribute will be treated as a single numeric value**, leading to truncated result output. Besides **n**, all other properties have default values same as with single value attributes which can be re-set by specifying them in the configuration. For vector objects the **name** property, if set, should be comma separated and have number of elements equals **n**. The **type** of all elements in a vector should be the same, so there is only one **type** value to be set (e.g. ''​type=float''​ for all the **n** elements). For 2D objects, an additional **required** property **p** has to be set to specify how many columns are there in the attribute. The **type** property now would have to have **p** elements separated by comma, specifying the type of each column (e.g., ''​type=float,​ string''​ for the 1st column being float and 2nd column being character string). The **name** property is now the name of the rows, and an additional optional **column_name** property can be specified for column names. The **comment** property, if set, should be a sentence that briefly describes the entire section, not specific to certain row or column.
 +
 +It is important that the return R object matches the descriptions in the configuration area. **If the configuration area is not found in the R script, no output will be written to result databases**. This is allowed because there are usage cases that does not need any output, e.g., you write an R program to plot some graphical summary of the association testing unit rather than performing association analysis and calculate p-values.
 +
 +== Data structure ==
 +Data from ''​vtools associate''​ are passed into the main R function taking a variable name defined by the first argument of the function. For example if the first argument name is ''​dat''​ then you should manipulate the R variable ''​dat''​ in your R program. The ''​dat''​ object contains 3 default attributes and two optional attributes.
 +
 +Default attributes
 +
 + *  R variable ''​dat@name''​ is a **single character string** which is the association testing group name of the data set. For example if the command is ''​vtools associate ... -m ' ... ' %%--%%group_by refGene.name2'',​ then the group name will be refseq gene names. If no ''​%%--%%group_by''​ option is used the group name will be ''​chromosome:​position''​ of a variant.
 + *  R variable ''​dat@X''​ is a **data.frame** with rows being samples (row names are sample names) and columns being variants (column names are ''​chromosome:​position''​ of variants). The samples match the ''​vtools associate ... %%--%%samples''​ specification,​ and the variants are the ones in the ''​variant_table''​ specified by the association command ''​vtools associate variant_table ...''​
 + *  R variable ''​dat@Y''​ is a **data.frame** with rows being samples (row names are sample names) and columns being phenotypes and covariates (column names are phenotype name and phenotype covariates names). The phenotype and covariates correspond to the trait name and covariate names in command ''​vtools associate variant_table trait %%--%%covariates ...''​. The trait will be the last column of the ''​dat@Y''​ object, although you can also pass the names of these phenotype/​covariates to the R function and refer to the columns by their names (e.g., the ''​phenotype.name''​ specification in the ''​regression.R''​ example)
 +
 +Optional attributes
 +
 + *  R variable ''​dat@V''​ is a **data.frame** of variant information corresponding to ''​%%--%%var_info''​ option in ''​vtools associate''​ command. The rows are variants (row names are ''​chromosome:​position''​) and columns are variant information / annotation field names.
 + *  R variable ''​dat@G''​ is a **list of data.frame** of genotype information corresponding to ''​%%--%%geno_info''​ option in ''​vtools associate''​ command. Each attribute in the list is a data.frame with rows being samples and columns being the genotype information of each genotype call in a sample, e.g., the genotype quality, imputation scores, etc.
 +
 +=== Cautions ===
 +This R interfacing mechanism is flexible, and fragile at the same time, because the ''​RTest''​ method of ''​vtools associate''​ will have no control over what are implemented inside the R program. It assumes the R program the users provide is flawless and can result in exactly the same output as specified in the configuration area. If any errors occurs in the program, ''​RTest''​ will not attempt to fix it. It instead will simply flag an association test as "​failed"​. If you run an genome-wide association scan with your R program via ''​RTest''​ method and noticed all tests failed, its most likely your R program is problematic.
 +
 +== Debug suggestions ==
 +<box 80% round green|**__Tip__**>​
 + *  All error messages will be written to the project log file for you to look into. Take a look at the log file to figure out why failures occur and fix your R code.
 + *  A more advance option is to write out one or two groups of data with the R code to test interactively in R to see what's going wrong. The ''​%%--%%data_cache N''​ option will output N R scripts per thread with data-set coded in it, to ''​cache''​ folder. The file name will be ''​[Association Group Name].dat.R''​. For debug purpose you can add ''​%%--%%data_cache 1''​ to the association command and run the command on a small variant table, find the R data file in ''​cache'',​ load it in R and play with the data-set to make sure your R function works, then remove the ''​%%--%%data_cache''​ option to actually perform association scans.
 +
 +</​box>​
 +==== Example ====
 +[To be updated]