QTL Clustering - MetaQTL - Reference Manual

Previous: QTL Projection, Up: Meta Analysis

3.3 QTL Clustering

Here we want to address the following question: How many “real” QTL do the QTL detected in the different mapping experiments represent - one, two, three, four,... or as many as the number detected throughout the studies ? The meta-analysis of QTL can be viewed as a clustering procedure. To do so, MetaQTL implements tow kinds of clustering algorithm. Whatever the procedure used to perform the clustering, the QTL locations are assumed to be normally distributed around their true locations with variances which can be derived from the reported CI or r-square values. This Gaussian and unbiased approximation comes from the classical asymptotic Gaussian distribution of the maximum-likelihood estimation of the parameters.

3.3.1 ClustQTL

3.3.1.1 Method

ClustQTL implements a clustering procedure based on a Gaussian mixture model which parameter estimates are obtained by applying a EM-algorithm.

3.3.1.2 Command Line Options

Option Usage Type Explanation
-q,--qtlmap required string The map with the QTL to clusterize (XML format).
-o,--output required string The output file stem.
-t,--tonto optional string The trait ontology.
-k,--kmax optional integer The maximal number of clusters.
-c,--chr optional string The name of the chromosome on which to perform the meta-analysis.
--cimode optional integer The CI computation mode.
--cimiss optional integer The imputation mode for missing CI.
--emrs optional integer the number of random starting points for the EM algorithm
--emeps optional double the convergence threshold for the EM algorithm

Option	Usage	Type	Explanation
`-q,--qtlmap`	required	string	The map with the QTL to clusterize (XML format).
`-o,--output`	required	string	The output file stem.
`-t,--tonto`	optional	string	The trait ontology.
`-k,--kmax`	optional	integer	The maximal number of clusters.
`-c,--chr`	optional	string	The name of the chromosome on which to perform the meta-analysis.
`--cimode`	optional	integer	The CI computation mode.
`--cimiss`	optional	integer	The imputation mode for missing CI.
`--emrs`	optional	integer	the number of random starting points for the EM algorithm
`--emeps`	optional	double	the convergence threshold for the EM algorithm

The option --cimode controls the mode of computation of the variances of the QTL. There are four modes:

1 : the variances are computed according to the avalaible information: from the CI if defined, otherwise from the r-square value.
2 : the variances are only computed for the QTL locations for which a CI is reported.
3 : the variances are computed using the r-square values.
4 : the variances are obtained by taking the maximum value between the variance derived from the CI and/or from the r-square.

The --cimiss defines how to deal with QTL for which no variance can be computed. There are two possibilities:

1 : the mean of the estimated variances is attributed to QTL with no variance defined.
2 : the QTL with no variance defined are ignored.

3.3.1.3 Output

The output of ClustQTL is divided into 3 plain text files:

<output_stem>_res.txt : this files contains a summary of the results of the clustering for each linkage group. The file is organized as follows

Identifier Value
CR The name of the linkage group.
TR The trait name following by the number of related QTL on the chromosome.
QT A QTL with its identifier, its name, its position on the chromosome and its estimated standard deviation.
CL Indicates the beginning of a clustering result. It is followed by the number of QTL involved in the clustering, the number of clusters, the log-likelihood and the complete log-likelihood of this clustering.
CC The name of a model choice criterion followed by its value.
CP This tag recovers four kinds of entry:

PI : the weights of each cluster (i.e the mixing proportions in the mixture model).
MU : the QTL location estimates (i.e the centroids of each cluster).
CI : the 95% confidence intervals of the QTL location estimates.
Z : the QTL cluster membership probabilities: first comes the identifier of the QTL and then the probabilities.

For example,

Identifier	Value
CR	The name of the linkage group.
TR	The trait name following by the number of related QTL on the chromosome.
QT	A QTL with its identifier, its name, its position on the chromosome and its estimated standard deviation.
CL	Indicates the beginning of a clustering result. It is followed by the number of QTL involved in the clustering, the number of clusters, the log-likelihood and the complete log-likelihood of this clustering.
CC	The name of a model choice criterion followed by its value.
CP	This tag recovers four kinds of entry: PI : the weights of each cluster (i.e the mixing proportions in the mixture model). MU : the QTL location estimates (i.e the centroids of each cluster). CI : the 95% confidence intervals of the QTL location estimates. Z : the QTL cluster membership probabilities: first comes the identifier of the QTL and then the probabilities.

CR 3 TR FloweringTime 10 QT 0 Lubberstedt_1997_HT_7 106.35 7.9 QT 1 Cardinal_2001_HT_5 90.76 7.91 QT 2 qplht107 150.02 5.28 QT 3 Cardinal_2001_HT_6 51.03 7.26 QT 4 qplht106 107.46 1.3 QT 5 Groh_1998_HT_2 61.03 17.15 QT 6 Bohn_1996_HT_2 66.81 4.26 QT 7 Lubberstedt_1997_HT_6 80.67 3.04 QT 8 qplht105 75.45 4.61 QT 9 Blanc_SDflofch3 148.15 15.05 QT 10 Blanc_FXflofch3 135.27 21.68 CL 10 2 -462.46 -445.55 CC AIC 930.91 CC BIC 935.11 CP MU 88.87 148.91 CP PI 0.73 0.27 CP CI 3.82 3.76 CP Z 0 1 0 CP Z 1 1 0 CP Z 2 0 1 CP Z 3 1 0 CP Z 4 1 0 CP Z 5 1 0 CP Z 6 1 0 CP Z 7 1 0 CP Z 8 1 0 CP Z 9 0 1 CP Z 10 0.1 0.9 ...

<output_stem>_crit.txt : this file summarizes the values of the model choice criteria. For example,

Chromosome Trait K Criterion Value Delta Weight 3 FT 1 AIC 1969.57 1654.71 0 3 FT 2 AIC 930.91 616.05 0 3 FT 3 AIC 445.55 130.69 0 3 FT 4 AIC 364.46 49.6 0 3 FT 5 AIC 314.86 0 0.51 3 FT 6 AIC 315.54 0.68 0.36 3 FT 7 AIC 317.92 3.06 0.11 3 FT 8 AIC 322.44 7.58 0.01 3 FT 9 AIC 326.44 11.58 0 3 FT 10 AIC 330.44 15.58 0 3 FT 30 AIC 361.77 46.92 0 3 FT 1 BIC 1970.97 1643.5 0 3 FT 2 BIC 935.11 607.64 0 3 FT 3 BIC 452.55 125.08 0 3 FT 4 BIC 374.27 46.8 0 3 FT 5 BIC 327.47 0 0.84 3 FT 6 BIC 330.95 3.48 0.15 3 FT 7 BIC 336.14 8.67 0.01 3 FT 8 BIC 343.46 15.99 0 3 FT 9 BIC 350.26 22.79 0 3 FT 10 BIC 357.06 29.59 0 3 FT 30 BIC 403.81 76.34 0

The first column indicates the name of the chromosome, the second one the name of the trait, the third the number of clusters, the fourth the name of the criterion and the three last ones give respectively the criterion value, its rescaled value and the “weight of evidence”.

<output_stem>_model.txt: This file gives the optimal number of QTL location according to the model choice criteria. The file is organized as a table with 4 columns. The first column indicates the name of the criterion, the second one the name of the chromosome, the third one the name of the trait and the last one the optimal number of QTL. For example,

Criterion Chromosome Trait Model AIC 3 FT 2 AIC 10 FT 4 AIC 5 FT 4 AIC 7 FT 5 AIC 2 FT 4 AIC 9 FT 3 AIC 4 FT 3 AIC 8 FT 5 AIC 6 FT 3 AIC 1 FT 5 BIC 3 FT 2 BIC 10 FT 3 BIC 5 FT 4 BIC 7 FT 5 BIC 2 FT 4 BIC 9 FT 3 BIC 4 FT 3 BIC 8 FT 5 BIC 6 FT 3 BIC 1 FT 5

3.3.2 QTLTree

3.3.2.1 Method

Another way to clusterize the observed QTL is to use standard hierarchical clustering procedures. QTLTree implements two kinds of hierarchical clustering algorithm :

Average group linkage : once cluster of QTL are formed, they are represented by their mean values, that is, their mean location, and inter-cluster distance is defined as the distance between two mean values. In the average group linkage method, the two clusters Q1 and Q2 are merged such that, after merging, the average pairwise distance within the newly formed cluster, is minimum. Suppose we label the new cluster formed by merging clusters Q1 and Q2, as Q3. Then D(Q1,Q2) , the distance between clusters Q1 and Q2 is computed as D(Q1,Q2) = Average {d(QTLi,QTLj) : where QTL i and j are in cluster Q3, the cluster formed by merging clusters Q1 and Q2}. At each stage of hierarchical clustering, the clusters Q1 and Q2 , for which D(Q1,Q2) is minimum, are merged. The distance used here is the mahalanobis distance.
Ward's method : Ward (1963) proposed a clustering procedure seeking to form the partitions Qn, Qn-1,........,Q1 in a manner that minimizes the loss of information associated with each grouping, and to quantify that loss in a form that is readily interpretable. At each step in the analysis, the union of every possible cluster pair is considered and the two clusters whose fusion results in minimum increase in 'information loss' are combined. Usually, information loss is defined in terms of a error sum-of-squares like criterion, called the target function. Here the target function is defined as the loglikelihood of being one “actual” QTL underlying the distribution of the observed QTL locations within the cluster.

3.3.2.2 Command Line Options

Option Usage Type Explanation
-q,--qtlmap required string The map with the QTL to clusterize (XML format).
-o,--output required string The output file.
-m,--mode optional integer The clustering mode (default is 2).
-t,--tonto optional string The trait ontology.
--cimode optional integer The variance computation mode.
--cimiss optional integer The imputation mode for missing variances.

Option	Usage	Type	Explanation
`-q,--qtlmap`	required	string	The map with the QTL to clusterize (XML format).
`-o,--output`	required	string	The output file.
`-m,--mode`	optional	integer	The clustering mode (default is 2).
`-t,--tonto`	optional	string	The trait ontology.
`--cimode`	optional	integer	The variance computation mode.
`--cimiss`	optional	integer	The imputation mode for missing variances.

The option -m (or --mode) allows user to switch between the two possible clustering algorithms:

1 : Average group linkage.
2 : Ward's metod.

The options --cimode and --cimiss works as for QTLClust.

3.3.2.3 Output

The output of QTLTree consists in one plain text file. The file is organized as follows:

Identifier Value
CR The name of the linkage group.
TR The name of the trait followed by the number of related QTL on the chromosome.
QT A QTL involved in the clustering with its identifier, its name, its most probable position on the chromosome and its estimated standard deviation.
HC The tree obtained by the clustering algorithm in Newick's format.
For example,

Identifier	Value
CR	The name of the linkage group.
TR	The name of the trait followed by the number of related QTL on the chromosome.
QT	A QTL involved in the clustering with its identifier, its name, its most probable position on the chromosome and its estimated standard deviation.
HC	The tree obtained by the clustering algorithm in Newick's format.

CR 10 TR FT 16 QT 0 Ribaut_1996_DPS_6 8.02 7 QT 1 Bohn_2000_DPS_12 51.68 4.87 QT 2 Poupard_2001_DPS_13 40.02 3.65 QT 3 Mechin_2001_HT_5 71 4.26 QT 4 Lubberstedt_1997_HT_20 59.14 3.65 QT 5 Groh_1998_HT_7 100.01 12.2 QT 6 qplht127 52.39 5.25 QT 7 Rebai_1997_SD_5 66.51 5.17 QT 8 Blanc_DFflofch10 61.57 2.55 QT 9 Rebai_1997_SD_25 54.5 12.46 QT 10 Rebai_1997_SD_19 62.14 10.64 QT 11 Blanc_FXflofch10 58.17 3.57 QT 12 Ribaut_1996_SD_6 6.78 10.83 QT 13 Rebai_1997_SD_33 59.96 9.73 QT 14 Rebai_1997_SD_12 49.04 14.59 QT 15 Blanc_SFflofch10 53.13 3.32 HC ((0:0.16,12:0.16):87.85,((((((1:0.24,((6:0.06,15:0.06):0.11,9:0.11):0.24):0.4,14:0.4) HC :7.43,(((4:0.04,13:0.04):0.16,11:0.16):1.11,(8:0.01,10:0.01):1.11):7.43):15.64, HC (3:2.43,7:2.43):15.64):24.56,5:24.56):40.9,2:40.9):87.85);