BranchClust

BranchClust Algorithm Home

BranchClust is a clustering algorithm that parses trees in order to delineate families of orthologs within a superfamily containing several paralogous gene families. The underlying idea is that closely related genes are placed on one branch emerging from one node on a tree, so the task of detecting families for n different taxa is simply a task to detect branches containing groups of genes from all, or almost all, species.

1. Selection of superfamilies

Figure 1. Scheme of assembling superfamilies.

First, we start with selection of the so-called superfamilies, i.e. sets of genes containing mixtures of orthologs and paralogs. Assume that we have n complete genomes of different taxa. We combine all genomes from n taxa in one database. From this set of n genomes, we arbitrarily choose one and perform BLAST searches of all genes in the selected genome against a combined genome database. Then we parse the BLAST output in such a way that not only the best hit for each species is selected, but all of the significant hits obtained for a given query.

As a result, each species can contribute to a superfamily through both orthologs and paralogs of the original query. Superfamilies constructed from the paralogs of the query species are combined in one so that no overlapping sets of genes could occur.

The choice of a starting genome could affect the ultimate number and composition of families of orthologous genes, especially, if we relax the criteria and require that at least 80% of different taxa be present in each family. Different species might collect different sets of significant hits. To select all possible orthologs we need to perform BLAST of all n genomes against the database composed of the same n genomes, and then assemble all significant hits for each gene from every genome in superfamilies..

The sequences from obtained superfamilies are aligned and a phylogenetic tree is reconstructed by any of the preferred methods of tree reconstruction.

2. Tree parsing

DEFINITION 1. We will call a branch in a phylogenetic tree complete if it contains representatives from all species, and incomplete otherwise.

DEFINITION 2. We will call a node in a phylogenetic tree complete if a complete branch originates from that node, and incomplete otherwise.

We start the selection from the most distant leaf in a tree (see Choice of the Root), and then descend node by node, adding branches and calculating the total number of different species on the branches, until a node becomes complete. A schematic representation of BranchClust algorithm is depicted in Figure 2.

Figure 2. Schematic representation of BranchClust algorithm.

We start from Node 1 which contains an incomplete Branch 1. Then we descend to Node 2 and add Branch 2 containing some number of different taxa. Here we introduce the notion of “FEW” and “MANY”. Depending on the number of considered species, they signify how many different taxa would be sufficient for a branch to be reported as a separate cluster, or, in other words, which branch could serve as a “stopper” while descending from node to node. The boundary between FEW and MANY can be adjusted (for example in cases where many reduced genomes are included); usually a boundary between “FEW” and “MANY” around 70-80% of the different taxa in question was found to work satisfactorily. For example, in case of 4 taxa, “MANY” would be more than 2 or 3; in case of 13 taxa – 8-10, and in case of 30 – 20-25. “FEW” is just the total number of different taxa minus “MANY”. If Branch 2 contains “FEW” species we continue selection until Node 2 becomes complete. If Branch 2 contains “MANY” species, then we must check how many species are already on Branch 1. If Branch 1 contains “MANY” species, and we also added “MANY” species on Branch 2, then it is most likely that Branch 1 is a separate, though incomplete, cluster. If Branch 1 contains only “FEW” different species, then we add it to the Branch 2, and continue with the selection.

3. Choice of the Root

When a tree containing several clusters (i.e., families of orthologs) is submitted to the BranchClust algorithm it is arbitrarily rooted: it can be rooted inside any cluster or anywhere in between the clusters (see Figure 3). If the tree is rooted inside some cluster, the results of the clustering will be affected by the root position; moreover, the undesirable split inside a cluster containing the root could occur. For example, if the root divides a cluster so that one of the resulting branches will contain more than MANY species, this branch will be wrongly reported as a cluster. However, if a tree is rooted somewhere in between the clusters, the results will not be affected by the root position because a branch with a cluster acts as a “stopper” in the algorithm, and no split inside a cluster will occur.

The process of selection with tree re-rooting is depicted on Figure 3.

Figure 3. BranchClust selection steps for superfamily of five taxa: A, B, C, D and E.

Figure 3A shows the original hypothetical unrooted tree for a set of 5 different taxa A, B, C, D and E. Let us set the parameter MANY to 4 (i.e. the branch containing 4 different taxa will serve as a “stopper”) and let us assume that the tree initially is arbitrarily rooted somewhere inside a cluster – see green point in Figure 3A. Figure 3B shows the original tree rooted at the green point inside a cluster.

We will run the algorithm twice with two different roots, which are chosen as the two nodes most distant from each other in a tree. The process of root selection for two independent runs is shown on Figures 3B-D. The first root, root 1, is the ancestor of the most distant leaf in the initially rooted tree (Figure 3B). The second root, root 2, is the ancestor of the most distant leaf when the tree is rooted with the root 1 (Figure 3C). We run the selection first for the tree rooted with root 1 (Figure 3C) and then with root 2 (Figure 3D).

Figures 3C, E-H show how BranchClust works for the tree rooted with the first root. According to the algorithm, we will start selection from the most distant leaf which is either A1, or B1, or D3, or E3 (Figure 3C). In case of more than two equally distant leaves we choose the first leaf. Then we will descend node by node until we encounter a “stopper” - either a branch containing MANY different species on condition that current branch already contains MANY species, or a single leaf “R” signifying the removed cluster. In our example selection stops at the red point because the next node contains a branch with MANY (here 4) taxa (Figure 3C). The selected cluster is removed from a tree (Figure 6E), the node is marked with leaf “R”, the tree is re-rooted at the point of the cut, and then we repeat the cycle for the re-rooted tree depicted on Figure 3F. Figure 3G shows how the second cluster is selected and removed. There is no need to re-root the tree if the node’s ancestor is already a root itself. Finally we have a tree containing an only cluster (Figure 6H) which is selected at the third run of the cycle.

We repeat selection for the tree rooted with the root 2 (Figure 6D) and compare the results by calculating the number of paralogs resulting in two different runs. The clustering that contains the least number of paralogs is selected. Using two differently rooted trees helps to solve the problem that arises in case of two incomplete clusters. This approach is illustrated by the clustering of penicillin binding proteins’ superfamily for a set of 13 gamma proteobacteria (See Starting Point).

4. Starting Point

The superfamily containing the penicillin-binding proteins consists of 25 members that form two distinct clusters in the tree: one is a branch with 15 leaves and 13 different taxa, making a complete family, and the other cluster is incomplete containing only 12 different species and forming an incomplete family of 12 members. The results of BranchClust algorithm in this case depend on the starting point, or the root of the tree. If we start selection inside Cluster 1, we will select the complete Cluster 1, remove it from the tree and the remaining tree will be the incomplete Cluster 2.

Figure 4. Superfamily of penicillin-binding proteins for 13 gamma proteobacteria.

However, if we start selection inside Cluster 2, we will skip the node containing Cluster 2 and continue selection to make a branch complete. This will result in the following clustering: 23:13, 5:5 meaning that one branch contains 23 leaves with 13 different taxa and the second – 5 leaves with 5 different taxa. The number of paralogs is given by the difference between the number of leaves on a branch and number of different taxa. In the latter case this number would be 23-13 + 5-5=10, and in the first run we have the difference of 15-13+13-12=3. We select that clustering run which yields the minimum number of paralogs.

5. Family and In- and Out-paralogs

Once a cluster is isolated, a family containing one representative from each taxon is selected with identification of inparalogs as duplicated genes inside that cluster. For example, in the case of penicillin-binding proteins’ superfamily (Figure 4), cluster 1 contains two inparalogs, one of Salmonella typhimurium, gi 16765177, and the other is of Pseudomonas aeruginosa, gi 15597468. Selection starts from the most distant leaf on a tree, and the gene copy which is closest to the top of the branch is included into the family, while other copies are reported as inparalogs. This approach assumes separate roots for each cluster. Thus for Cluster 1 the root is at the node containing the branch with Cluster 1, and for Cluster 2 the root is at the node containing the branch with Cluster 2.

We call genes "out-of-cluster paralogs", if they are located inside a superfamily, but not on the branch containing a selected family. Note that for a given cluster all other genes from the same superfamily are “out-paralogs”. We do not include out-paralogs in our clustering reports, because in that case we would just have to report all genes in the superfamily not included in that cluster. Instead we report the so-called out-of-cluster paralogs as it occurs in the DNA-binding proteins’s superfamily. The second copy of gene from Pseudomonas aeruginosa, gi 15600541, is reported as an out-of-cluster paralog (Figure 5) because Pseudomonas contains one copy inside each cluster.

Figure 5. Superfamily of DNA-binding proteins for 13 gamma proteobacteria.

back

Links

Gogarten Lab Home Page: http://gogarten.uconn.edu/

Email to: Maria.Poptsova@gmail.com

Page last updated: November 21, 2006