CD-HIT Cluster Database at High Identity with Tolerance http://bioinformatics.burnham-inst.org/cd-hi ================================================================================ This program is modified from CD-HI, you may read algorithm.cd-hi first. ================================================================================ The basic filter system of CD-HI states: "If two proteins share certain sequence identity, they should have at least a certain number of identical pentapeptide. For example, two sequences having 85% identical residues over a 100-residue window will have at least 25 pentapeptides." Theoretically, two sequence have 80% identity, have don't need have a single identical pentapeptides. They can differ every 4 amino-acid. like MSHHWGYGKHNGPEMWHKDFPIAKGERQS.... MSHH GYGK NGPE WHKD PIAK ERQS.... MSHHcGYGKdNGPEhWHKDiPIAKtERQS.... But, this is very very rare in real world of alignments. Even the alignment is at 60%. there are still some identical pentapeptides in general. This is the basis of CD-HIT. CD-HIT is based on the statistical analysis of a large mount of alignments. While speeding up the program, it won't lose much of quality of clustering.