Pomelo II is a new and much improved incarnation of Pomelo, a tool for finding differentially expressed genes, and genes that are of potential interest because they are related to an outcome of interest (e.g., type of cancer, survival).
The main features that differentiate Pomelo II from its ancestor, Pomelo, are:
So as to control multiple testing effects, we provide FDR-adjusted p-values, which we calculate using the approach of Benjamini & Hochberg (1995; J. Royal Statistical Society B, 57:289-300). FWER-correction is no longer provided. We think that control of the FDR (False Discovery Rate) is probably more relevant for most genomic and proteomics research than control of the FWER (Family-Wise Error Rate); in addition, use of the maxT FWER control procedure requires that the subset pivotality assumption holds, which is not the case with some of the tests available in Pomelo.
Finally, even if the assumptions of the Benjamini and Hochberg's FDR procedure are not satistified, the error is often small (Reiner, A., D. Yekutieli, and Y. Benjamini, 2003, "Identifying differentially expressed genes using false discovery rate controlling procedures", Bioinformatics 19, 368-375) and, overall, this method is competitive to other alternatives such as Benjamini & Yekutieli's or permutation-based approaches.
More extensive details on hypothesis testing and multiple testing correction are provided in the help for Pomelo I.
The all-famous t-test. Used to compare an interval variable (one where distances between levels make sense) between two groups. For example, we could compare gene expression data between two types of patients. We use here the test statistic for the case where we do not assume equal variances in the two groups. We compute the usual t-statistic, but p-values are obtained by permutation testing.
Analysis of variance. We compare between more than two groups the value of an interval data, such as the gene expression among five types of cancer. The statistic is your usual ANOVA F-ratio and p-values are obtained using a permutation test.
Linear regression. We try to predict the (interval scaled) values of a dependent variable based on the values of an independent, interval scaled variable. A typical example is predicting the expression levels of a protein using gene expression data. (Thus, note here that the "independent" variable is gene expression). We compute the usual ordinary least-squares regression coefficient and obtain the p-value using a permutation test.
Non permutation method for contingency tables. It obtains the unadjusted p-value using Fisher's exact test. This test would be the one to use if we have, say, different classes of patients (e.g., six types of cancers) and for a set of 1000 markers we can have either presence/absence of each marker in each patient (this would yield 1000 contingency tables of dimensions 6x2 ---each marker by each cancer type), or we can have several levels of (unordered) expression, say four types of expression (which would yield 1000 6x4 tables).
If the markers have more than two levels of expression and these are ordered (say, from less to more) other tests could be more powerful, such as the Cochran-Armitage test (which we have not implemented). Another alternative with categorical response data are logistic and multinomial models (which we have not implemented).
Please note that FisherIxJ is a test that is rarely used, almost never with gene-expression arrays. We implemented FisherIxJ for some people who were using tissue arrays (that yielded pressence/absence values).
A widely used model for survival data. With survival data we often have censored observations (e.g., a patient that is not yet dead, and all we know is that it lived for at least 100 days after initiation of the treatment).
Currently, the Cox model included here works with interval-scaled covariates; if you pass it a categorical covariate, it can ONLY have two possible values; otherwise, you will get meaningless results (it will be treated as an interval scaled covariate).
The Cox model implemented only works with right-censored observations, not left-censored or doubly-censored ones. As well, it is of course your responsibility to make sure that assumptions about censoring (e.g., lifetimes and censoring times are independent, etc) are reasonable. Statistics and p-values are computed in the "usual way"; the statistic we return is the Wald statistic and the p-value is from a Wald test.
The limma t-test is a non-permutation method used to compare gene expression data between two groups. When performing a test on one particular gene, this test borrows information from all the other genes to obtain better estimates of the variance, using an empirical bayes method.
Please note that all three limma-based tests make several assumptions; most of them are fairly reasonable and seem to hold in practice. However, notice that, for the empirical bayes procedure to work, it is assumed that the error variance of all genes comes from the same distribution; this assumption is clearly violated if the expression levels for different genes are the result of merging (averaging) over different number of spots. For instance, if you plan to use the limma-based tests, you do not want to use merge in preP if the number of replicates per gene (or clone) is different among genes (clones).
This test is the same test as the limma t-test, but it is to be used when samples are paired. If samples are paired, you will want to use a paired t-test (or a linear model, as explained below). Suppose we want to perform a t-test to compare tumor vs non-tumor samples from the same subjects. Each of the samples belonging to the same person should be similar, as they are from the same subject and, of course, the samples are naturally paired (each pair tumor/non-tumor comes from the same subject). By using a paired t-test, we take pairing into account decreasing the variance of our estimates (generally, taking pairing into account leads to smaller estimates of the error variance, and thus larger test statistics).
Pairing comes in many forms and flavors. In general, you have pairing whenever there is a feature of your design or sampling scheme that leads to pairs of your two conditions to share some (generally random) characteristic. This shared characteristic will lead to the pairs showing correlation in expression. For instance, in the two samples from the same subject example, each pair of samples shares that they come from the same subject, where the subject characteristics themselves are not something that interest us per se, but will induce a correlation between the two observations in the same subject. Similarly, if we sample families, and we choose pairs of siblings, one with cancer and one without, we have again pairing (brothers or sisters are likely to resemble each other by virtue of them being, well, brothers/sisters). Another example of pairing would arise if we take samples from two cell lines every hour. If there are any circadian patterns in gene expression, then the two samples (one from each cell line) at each hour will show correlated gene expression; thus, the two samples from every hour are paired, and the data ought to be analyzed with a paired t-test.
Please, see caveat about merging.
Non permutation method used to compare gene expression data between two or more groups. Linear models are a more general type of models that include, as particular cases, ANOVA and multiple regression. We use here the functionality provided by limma, and thus return empirical bayes adjusted estimates (i.e., we use information from all genes to obtain better estimates of variances).
Once the test results have been calculated (i.e., after obtaining the F-statistic and associated p-value for the test of global differences between groups), this method gives us the option to individually compare classes (t-test) and draw venn diagrams with the class comparisons.
Please, see caveat about merging.
With permutation tests we obtain the p-value by comparing the observed statistic (e.g., the observed t-statistic) from our data against a null distribution obtained from the actual data themselves (e.g., we randomly permute the class labels, and compute the t-statistic for the permuted data, and we repeat this, for instance, 50000 times, so that we can obtain the null distribution of the statistic under the hypothesis that the class labels were assigned randomly, regardless of gene expression).
In contrast, when using limma we use the standard distributions for computing the p-values (i.e., we compare the observed t-statistic agains the expected t-statistic under the null hypothesis by comparing with the t-distribution with a given degrees of freedom). As explained above, limma does not directly compute the t-statistic with the usual formula, but rather uses a "moderated statistic", where the estimate of the variance for every gene is "moderated" using information from all other genes. This is done using a specific model for the variances with an empirical bayes procedure as explained in Smyth, G. K. (2004). Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Statistical Applications in Genetics and Molecular Biology 3, No. 1, Article 3. What you will notice is that extremely small variances are pulled towards the much more frequent larger values and extremely larger variances are pulled towards smaller values. Moderated statistics have been shown repeatedly to work better (in several senses) with microarray data. The differences are most noticeable with small sample sizes.
What Pomelo tries to do is "detect" genes that are different using as a criteria gene expression data difference among two or more classes. Since most likely our data is not homogeneous (homogeneous meaning all subjects of the same age, country, ... ), we may find that the differences due to different classes are "blurred" by other differences. By using additional covariables, we can add this information to our study and thus enable a more complete analysis. (If there is a factor, say age, that accounts for some relevant portion of the variation in the gene expression data, but you don't use it in your analyses, then what you are doing is adding that variation, that is explained by age, to the error term. And, among other consequences, the larger the error term, the smaller the statistic).
Although additional covariables can be very useful, if the covariable you try to add is the same as the class file, pomelo can do nothing with this. Eg. If you are comparing two classes and you add a covariable indicating that all subjects of one class are male and all subjects of the other are female, pomelo has no way of knowing which differences belong to the covariable and which to the class difference. In this case, sex and class are completely confounded. Any differences in expression could be explained either by differences in the class variable you are interested, or by differences in sex.
Additional covariables can only be used with "Anova, linear models (limma)".
The files "Gene expression data" and "Class labels or dependent variable" are required for all tests. Files which are only necessary for certain tests are: "Censored indicator file" (Cox test), "Paired indicator file" (paired t-test) and "Additional covariables file" (optional for anova-limma).
The file for the gene expression should be formated as:
gene1 | 23.4 | 45.6 | NA | 76 | 85.6 |
genW@ | NA | 34 | 23 | NA | 13 |
geneX# | 23 | 25.6 | 29.4 | 13.2 | NA |
gene1 | 23.4 | 45.6 | 76 | 85.6 | |
genW@ | 34 | 23 | 13 | ||
geneX# | 23 | 25.6 | 29.4 | 13.2 |
#name | s1 | s2 | s3 | s4 | s5 |
gene1 | 23.4 | 45.6 | 76 | 85.6 | |
genW@ | 34 | 23 | 13 | ||
geneX# | 23 | 25.6 | 29.4 | 13.2 |
Separate values by tab (\t), and finish the file with a carriage return or newline. No missing values are allowed here. Class labels can be anything you wish; they can be integers, they can be words, whatever. Of course, if you are using regression or Cox model, your dependent variable must be a number. And if you are using a t-test there can only be two classes.
There should be no empty spaces or trailing tabs at the end of the file. And we read only a single line, so place all labels in a single line.
This is a simple example of class labels file:
CL1 | CL2 | CL1 | CL4 | CL2 | CL2 | CL1 | CL4 |
The class labels, as above, can be any arbitrary coding. The values in the gene expression data file should be consecutive integers that start at 0 (i.e., do NOT use negative intergers, or you'll get errors). This is an example file:
#gene | c1 | c1 | c2 | c2 | c3 | c3 |
gene1 | 0 | 0 | 1 | 1 | 1 | 0 |
gene2 | NA | 2 | NA | 1 | 0 | 0 |
gene3 | NA | 1 | 2 | NA | 0 | 0 |
gene4 | 1 | 1 | 2 | 2 | 2 | 0 |
gene5 | 1 | 0 | 0 | 2 | 1 | 2 |
gene6 | 2 | 1 | 1 | 2 | 0 | 0 |
gene7 | 2 | 1 | 1 | 0 | 2 | 0 |
gene8 | 2 | 2 | 0 | 0 | NA | NA |
As you can see, most of these rows would yield (if we used the first row as the class labels) 3x2 tables or 3x3 tables (e.g., 3rd or 5th rows).
The format should be the same as for class labels, except that we will place the same integer at the places where the two paired subjects are. We can choose any integer we want.
This is a simple example of a paired indicator file:
4 | 32 | 4 | 8 | 32 | 8 | 22 | 22 |
In the example file, the first and the third subject are paired (integer used 4, although we could have used any other), the second subject is associated to the fifth (integer used 32) and so on.
Each row of the file represents a subject and each column an additional covariable. The format must be as follows: tab (\t) separated fields, a first row with the names of the columns (covariables) and the following rows with each subject's covariable value.
This is a simple example of a additional covariables file:
Age | Country | Weight |
20 | Spain | 76 |
46 | Germany | 100 |
33 | England | 88 |
59 | Spain | 79 |
61 | England | 65 |
24 | Spain | 58 |
80 | Germany | 90 |
59 | England | 72 |
The order of the subjects must be exactly the same as the order in the expression data file. That means, the covariable values (age, country,...) of the first subject (i.e. row), must belong to the subject in the first column of the expression, etc.
The data can be of two types: numeric (age, weight, ...) or non-numeric (country, ...). When the data is read, you will be taken to a page where you will be able to see if it has been read correctly. Be careful, if your data does not have the correct format, it can be incorrectly read, meaning a non-numeric variable can be read as numeric or vice versa. If this happens all further analyses will be wrong!
Once the program has finished running, you will get a table with the output from the run and a heatmap.
The results table contains a header indicating the test you have used, number of permutations and which covariables where used (if any).
The table shows an index corresponding to the original ordering in the data file, gene names, p-values (undajusted), FDR-adjusted p-values, and statistics (and the absolute value of the statistic). For Fisher's IxJ tests the columns names statistic and abs(statistic) really have no meaning.
The figure is a heatmap where you can filter how and which genes to plot. For now the color scale goes from yellow to blue (green is the mix of yellow and blue); missing values are shown as white. Both tables and heatmap are clickable and will take you to a page with additional information (our IDConverter Light, based on IDConverter).
If you have run an "Anova, linear models (limma)" test, the output will also contain a Class compare section containing a button. By clicking on the button we will be taken to a class compare page, where we will be able to compare classes individually.
The observed test statistics are:
The code underlying these tests is written in C++ and R. The C++ code was originally written by Ramón Díaz-Uriarte and has been parallelized with LAM/MPI by Edward Morrissey. All of the multiple testing functions have been written from scratch, although some algorithms have been based on Westfall & Young (1993) or the documentation for the "multtest" package for R, some inspiration has been obtained from the above package (and a lot of testing has been done using multtest as a benchmark). Moreover, we have taken C code from R for Fisher's exact tests (fexact.c; the latter is based on Mehta and Patel's algorithm). The code for Cox model is from R, library survival.
All three limma tests have been written using limma R package [Smyth, G. K. (2005). Limma: linear models for microarray data. In: 'Bioinformatics and Computational Biology Solutions using R and Bioconductor'. R. Gentleman, V. Carey, S. Dudoit, R. Irizarry, W. Huber (eds), Springer, New York, pages 397--420].
We have also used the GNU scientific library, GSL.
The CGI has been written by Edward Morrissey and Ramón Díaz-Uriarte using Python. The heatmaps use R and the R packages GDD by Simon Urbanek and imagemap by Barry Rowlingson.
The code from R and the survival, limma, GDD, and imagemap packages, and the GSL are all free software released under GNU's General Public License. Our C++ is thus also released under the GPL. The CGI code is released under the Affero GPL.
All the code (including revision history) is available from The Launchpad
.This application is running on a cluster of machines using Debian GNU/Linux as operating system, Apache as web server, Linux Virtual Server for web server load-balancing, with heartbeat and drbd for high-availability of both services and storage, and LAM/MPI for parallelization.
Funding partially provided by Project TIC2003-09331-C02-02 of the Spanish Ministry of Education and Science. This application is running on a cluster of machines purchased with funds from the RTICCC.
If you use Pomelo II for any publication, we would appreciate if you could let us know and that you cite our program (you know, "credit where credit is due"). For now, you can give the main web site: http://pomelo2.bioinfo.cnio.es).
We appreciate if you give us feedback concerning bugs, errors or misconfigurations. Complaints or suggestions are welcome.
Uploaded data set are saved in temporary directories in the server. These data are accessible from the internet until they are erased (five days now). The same as you do, anybody can access those directories, if they known their names, which are made of 20 random digits, so they should not be easy to guess.
In any case, you should keep in mind that communications between the client (your computer) and the server are not encripted at all, thus it is also possible for somebody else to look at your data while you are uploading or dowloading them.
Again, all responsibility for access to your data is yours. We do not offer any warranties that your data will be kept confidential or that nobody will break into our system. We suggest you rename the rows, columns, and other identifiers, so that they are meaningless to anybody but you.
(The manner we specify for attribution is that you write out our names, that you link to our web pages, and that you link to this page itself. Please, give credit where credit is due).
Send comments to the webmaster. Last rev. 22-Dec-2008 |