CD-HI, Program algorithm and implementation =========================================== This is the detailed algorithm of cd-hi. The reference is: Clustering of highly homologous sequences to reduce the size of large protein databases by Weizhong Li, Lukasz Jaroszewski & Adam Godzik Bioinformatics (2000) accepted Please first read above paper for background of this program. 1. Greedy incremental algorithm to cluster sequence databases ============================================================= This algorithm has been used to create the PDB_select and nrdb90 by Holm et al. First, sequences are sorted in order of decreasing length. The longest one becomes the representative of the first cluster. Then each remaining sequence is compared to the existing representatives. If the sequence similarity with any representative is above a given threshold, it is included into that cluster. Otherwise a new cluster starts with it as representative. 2. short words filter and index table ============================== Here, we use pentapeptides as example of short words. 2.1 decomposition of sequences The sequences can be decomposed into short words, for example, ACDEFGHIKLMN has these ACDEF CDEFG DEFGH EFGHI FGHIK GHIKL HIKLM and IKLMN 9 words 2.2 short word filter If two proteins share certain sequence identity, they should have at least a certain number of identical pentapeptide. For example, two sequences having 85% identical residues over a 100-residue window will have at least 25 pentapeptides. So for a sequence of certain length, a threshold of sequence identity is related with a threshold of the number of identical pentapeptide. Before two sequences are aligned, they are first decomposed into pentapeptides, if their identical pentapeptide is lower than required number, they don't need to be aligned. 2.3 index table For pentapeptide, since each residue can have 21 possibilities, there are will be 21x21x21x21x21 (about 4 millon) of them. An index table with each entry representing a unique pentapeptide is generated, Each entry in the index table holds numbers of the representative sequences containing this pentapeptide. So an index table is like pentapeptide content index ------------------------------------------------------------------------- AAAAA | sequence #2 has 1, sequence #20 has 2, and so on. AAAAR | sequence #15 has 2, sequence #22 has 1, and so on. ... | ... ... | ... ... | ... WHCKL | no sequence contains this pentapeptide. ... | ... ... | ... ------------------------------------------------------------------------- 3. detailed implementation of this algorithm ============================================ Let's say, we have a input database below: (but note the input database for the program should be in fasta format) 1 seq1 MSHHWGYGKHNGPEHWHKDFPIAKGERQSPVDID 2 seq2 MSHHWGYGKHNGPEHWHKDFPIAKGERQMD 3 seq3 DGDSRKVNLGVGAYRTDEGQPWVLPVVRKVEQLIAGNGSLNHEPPPPPP 4 seq4 DGDSRKVNLGVGAYRTDEGQPWVLPVVRKVEQLLLGG 5 seq5 DGDSRKVNLGVGAYRTDEGQPWVH 6 seq6 PTGTDPTPDEWKQIAAVMKRRC 7 seq7 SGNVPVLNAGDGSNQHPTQTLLDLFTIQQTEGRLDNLHVAM 8 seq8 LPRVDEIATDVDKTPHAWYFQQAGN 3.1 sort The sequences are sorted by length, so database is arranged as following: 1 seq3 DGDSRKVNLGVGAYRTDEGQPWVLPVVRKVEQLIAGNGSLNHEPPPPPP 2 seq7 SGNVPVLNAGDGSNQHPTQTLLDLFTIQQTEGRLDNLHVAM 3 seq4 DGDSRKVNLGVGAYRTDEGQPWVLPVVRKVEQLLLGG 4 seq1 MSHHWGYGKHNGPEHWHKDFPIAKGERQSPVDID 5 seq2 MSHHWGYGKHNGPEHWHKDFPIAKGERQMD 6 seq8 LPRVDEIATDVDKTPHAWYFQQAGN 7 seq5 DGDSRKVNLGVGAYRTDEGQPWVH 8 seq6 PTGTDPTPDEWKQIAAVMKRRC 3.2 Adding longest sequence, The longest sequence, seq3, is selected as the representative of cluster 1. Also it is decomposed into pentapeptides. For example, this sequence has 2 pentapeptides of PPPPP. These short words are added into the index table. so in the entry of PPPPP, the program will record that seq3 has 2 PPPPPs. 3.3 Processing new sequences For every new sequence, the program handles it in following steps. 3.3.1 according to the threshold and length of the new sequence, calculate the required number of identical pentapeptides. 3.3.2 decompose it into pentapeptides. 3.3.3 scan the index table, check the entries of pentapeptides the new sequence has. For every old representative, calculate the number of identical pentapeptides between it and the new sequence. 3.3.4 If the number of identical pentapeptides between an old representative and the new sequence is larger than the required, they are aligned to calculated their real sequence identity. Otherwise their identity should below the threshold. 3.3.5 If the sequence identity with any representative is above the threshold, the new sequence is included into that cluster. 3.3.6 IF no conditions in 3.3.5 is true, that means the new sequence is not similar to any old cluster, it becames the representative of a new cluster. And the short words of it are added into the index table. 3.4 New database. This precess from 3.3.1 to 3.3.6 is repeated until all the sequences are processed. All the representatives form the new clustered database. If the threshold is 90%, the output database will like cluster1 seq3 DGDSRKVNLGVGAYRTDEGQPWVLPVVRKVEQLIAGNGSLNHEPPPPPP cluster2 seq7 SGNVPVLNAGDGSNQHPTQTLLDLFTIQQTEGRLDNLHVAM cluster3 seq1 MSHHWGYGKHNGPEHWHKDFPIAKGERQSPVDID cluster4 seq8 LPRVDEIATDVDKTPHAWYFQQAGN cluster5 seq6 PTGTDPTPDEWKQIAAVMKRRC The following sequences are deleted because seq2 MSHHWGYGKHNGPEHWHKDFPIAKGERQMD similar to seq1, cluster3 seq4 DGDSRKVNLGVGAYRTDEGQPWVLPVVRKVEQLLLGG similar to seq3, cluster1 seq5 DGDSRKVNLGVGAYRTDEGQPWVH similar to seq3, cluster1 ========================================================================= Comments to the author Weizhong Li at liwz@sdsc.edu =========================================================================