[BiO BB] Re: All-again-all protein sequence comparison (Iddo Friedberg)

Iddo Friedberg idoerg at burnham.org
Fri Dec 17 14:24:21 EST 2004

The problem I see with the e-values is that the e-value is dependent 
upon the search database size.e-value gives you the number of expected 
false positives, given the database you are searching. If  your database 
is the queried genome(s) only, you may receive skewed values becuase a 
hit which would  be considered to have a high e-value (low significance, 
more false positives expected by chance) when searched against nr, would 
have a low e-value (high significance) when searched against the 
genome(s). Similarities may be mistaken to be significant simply because 
the predicted number of false positives will always be small due to a 
small database size.


Hongyu Zhang wrote:

>>I wouldn't go with the strategy of having  one
>>genome as a database, and
>>another as a query pool, because that would skew
>>your BLAST statistics
>>to give you false-positive hits. I would go with the
>>all-vs-all pairwise
>The problem with all-vs-all pairwise comparison is that it will be
>slower than the strategy of using one genome as a database and the
>other as the query. The statistics issue, I think, only comes when you
>do reciprocal BLASTs, ie., blast genome A agaist B and then genome B
>against A, then you probably will get two slightly different E-values
>for the same pair of sequeneces. The problem, however, can be mostly
>circumvented by setting the database size the same in both BLAST
>directions (parameter "-z" in NCBI-BLAST and "Z=" in WU-BLAST)
>Hongyu Zhang, Ph.D.
>Computational biologist
>Ceres Inc.
>BiO_Bulletin_Board maillist  -  BiO_Bulletin_Board at bioinformatics.org


Iddo Friedberg, Ph.D.
The Burnham Institute
10901 N. Torrey Pines Rd.
La Jolla, CA 92037
Tel: (858) 646 3100 x3516
Fax: (858) 713 9930

More information about the BBB mailing list