CD-HIT

               Cluster Database at High Identity with Tolerance
                  http://bioinformatics.burnham-inst.org/cd-hi

================================================================================
    This program is modified from CD-HI, you may read algorithm.cd-hi first.
================================================================================

The basic filter system of CD-HI states:

  "If two proteins share certain sequence identity, they should have
at least a certain number of identical pentapeptide.  For example,
two sequences having 85% identical residues over a 100-residue
window will have at least 25 pentapeptides."

  Theoretically, two sequence have 80% identity, have don't need have a single 
identical pentapeptides. They can differ every 4 amino-acid. like

MSHHWGYGKHNGPEMWHKDFPIAKGERQS....
MSHH GYGK NGPE WHKD PIAK ERQS....
MSHHcGYGKdNGPEhWHKDiPIAKtERQS....

  But, this is very very rare in real world of alignments. Even the alignment
is at 60%. there are still some identical pentapeptides in general. This is
the basis of CD-HIT.

CD-HIT is based on the statistical analysis of a large mount of alignments. 
While speeding up the program, it won't lose much of quality of clustering.