Perl Script                                                                                                       Home

 

 Perl-script only:

      the latest version:

              version 1.01 November, 2006:  BranchClust_1-01.pl

      previous versions:

              version 1.00 June, 2006:  BranchClust_1-00.pl       

 

 Perl-script and examples:

 

BranchClust_all.tgz -

              contains the perl-script together with the examples of trees and various taxa recognition files.

 

 BranchClust Tutorial:

 

A step-by-step guide on how to download complete genomes, assemble superfamilies, reconstruct trees, select orthologous families with BranchClust, and analyze and depict results with TreeDyn.

 

BranchClust Tutorial.pdf

 

BranchClust_Tutorial.tgz -

            contains all the perl scripts described in the Tutorial (updated January 19, 2008)

 

 Program Usage

 

Required:

1.

Bioperl module for parsing trees  Bio::TreeIO. For instructions how to install bioperl go here.

2.

 

Taxa recognition file gi_numbers.out must be present in the current directory. How to create this file, read the Taxa recognition file section.

 

At the command line type:

# perl BranchClust.pl <tree-file> <MANY>

 

where

<tree-file> is a superfamiliy tree in PHYLIP format, and

<MANY> is a parameter designating the minimum number of different taxa on the branch sufficient for the branch to be a separate cluster (see Algorithm). It can be any number less or equal to the number of different taxa considered. The BranchClust algorithm usually works well with the value of 80% of total number of different taxa. See Examples below.

 

 Output

 

BranchClust creates three output files: clusters.out, families.list and cluster.log. File clusters.out reports clusters with selected families and IN- and OUT-OF-CLUSTER paralogs, if any. File families.list contains only the list of selected families. Log-file cluster.log tracks selection algorithm, as it parses trees, node by node, and can be useful to analyze cases of wrong clustering.

 

 

 Taxa recognition file

 

The structure of taxa recognition file gi_numbers.out is defined as follows:

 

species1 | match_pattern1.1 match_pattern1.2 .. match_pattern1.m 

species2 | match_pattern2.1 match_pattern2.2 .. match_pattern2.m

species3 | match_pattern3.1 match_pattern3.2 .. match_pattern3.m

...

speciesN | match_patternN.1 match_patternN.2 ... match_patternN.m

 

where species1, species2, species 3 are different taxa and match_patterns are the ordinary perl pattern-matching expressions that can be used as an unique identifier of taxa. Species names are for user convinience only, and everything preceding sign | is ignored. Match patterns can be either names or gi_numbers.

 

1. Example of how to use names as taxa identifiers:

 

Suppose we have 4 different taxa, and we need to analyze a tree reconstructed from a mixture of paralogs and orthologs:

 

The taxa recognition file could be as follows:

 

species1| bacteria1.*

species2| bacteria2.*

species3| bacteria3.*

species4| bacteria4.*

 

The paralogs of the same taxa can be designated as bacteria1.1, bacteria1.2, bacteria2.1, bacteria2.2, etc.

 

Here is example of tree containing paralogs and orthologs of 4 different taxa:

 

(((bacteria1_01,bacteria2_01),(bacteria3_01,bacteria4_01)),(bacteria3_03,

(bacteria3_02,(bacteria2_02,(bacteria1_02,bacteria1_03)))));

 

How does BranchClust parse this tree, see Examples below.

 

2. Example of how to use gi-numbers as taxa identifiers:

 

Suppose we have 4 species of bacteria and archaea. The species could be distinguished by their gi-numbers:

 

Bacillus subtilis subsp. subtilis str. 168 | 1607.... 5081.... 1608.... 1867....

Escherichia coli K12 | 1612.... 4917.... 1613.... 3334....

Methanosarcina mazei Go1 | 2122....

Sulfolobus solfataricus P2 | 1589....

 

Here is the tree for the superfamily of ATP-ases (see Algorithm and Tutorial to learn how to assemble superfamilies), with gi_numbers to identify genes.

 

((((16078687:0.38341,16129888:0.51097):0.17454,(16080761:0.29985,

16131639:0.32318):1.63425):0.29299,((16080736:0.26753,16131602:0.39301):0.70160,

(15897485:0.22830,21226881:0.36977):0.81741):0.26223):0.20986,(16131600:0.17847,

16080734:0.27503):0.66717,21226882:1.19107);

 

How does BranchClust parse this tree, see Examples below.

 

3.  File with gi numbers identifiers for 319 bacterial and archaeal species:

 

gi_numbers.out  (created from genomes downloaded from ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria on April 6, 2006)

 

 Examples of trees with clustering

 

 

 

Sample superfamily for 4 bacteria

 

perl branch_clust.pl tree.tre 3

gi_numbers.out

tree.tre

results: clusters.out, families.list

 

 

 

Superfamily of ATP synthases for 13 gamma proteobacteria

 

#  perl branch_clust.pl tree.tre 8

gi_numbers.out

tree.tre

results: clusters.out, families.list

go to Clustering

 

 

Superfamily of ATP synthases for 30 taxa (16 bacteria and 14 archaea)

 

#  perl branch_clust.pl tree.tre 7

gi_numbers.out

tree.tre

results: clusters.out, families.list

go to Clustering

 

 

     Superfamily of ATP synthases for 317 taxa (bacteria and archaea)

 

#  perl branch_clust.pl tree.tre 150

gi_numbers.out

tree.tre

results: clusters.out, families.list

go to Clustering

 

 

 

 

 Superfamily of penicillin-binding proteins for 13 gamma proteobacteria

 

#  perl branch_clust.pl tree.tre 5

gi_numbers.out

tree.tre

results: clusters.out, families.list

go to Clustering

 

 How to do batch processing

 

Example of a wrapper you can find here: branch_clust_wrapper.pl

 

 Coloring with TreeDyn

 

Go to TreeDyn web-site to learn how to create annotation and script files.

 

Example for Superfamily of ATP synthases for 30 taxa (16 bacteria and 14 archaea)

 

 Annotation file: labelfile.tlf

 Script file: script.tds

 Resulting tree: tree.png

 

 

back

 

 Links

 

Gogarten Lab Home Page: http://gogarten.uconn.edu/

 

Email to: Maria.Poptsova@uconn.edu


Page last updated: January 19, 2008