[BiO BB] Re: All-again-all protein sequence comparison (Iddo Friedberg)

Fri Dec 17 14:24:21 EST 2004

The problem I see with the e-values is that the e-value is dependent 
upon the search database size.e-value gives you the number of expected 
false positives, given the database you are searching. If  your database 
is the queried genome(s) only, you may receive skewed values becuase a 
hit which would  be considered to have a high e-value (low significance, 
more false positives expected by chance) when searched against nr, would 
have a low e-value (high significance) when searched against the 
genome(s). Similarities may be mistaken to be significant simply because 
the predicted number of false positives will always be small due to a 
small database size.

./I

Hongyu Zhang wrote:

>>I wouldn't go with the strategy of having  one
>>genome as a database, and
>>another as a query pool, because that would skew
>>your BLAST statistics
>>to give you false-positive hits. I would go with the
>>all-vs-all pairwise
>>BLAST.
>>
>>    
>>
>
>The problem with all-vs-all pairwise comparison is that it will be
>slower than the strategy of using one genome as a database and the
>other as the query. The statistics issue, I think, only comes when you
>do reciprocal BLASTs, ie., blast genome A agaist B and then genome B
>against A, then you probably will get two slightly different E-values
>for the same pair of sequeneces. The problem, however, can be mostly
>circumvented by setting the database size the same in both BLAST
>directions (parameter "-z" in NCBI-BLAST and "Z=" in WU-BLAST)
>
>--
>Hongyu Zhang, Ph.D.
>Computational biologist
>Ceres Inc.
>
>
>
>_______________________________________________
>BiO_Bulletin_Board maillist  -  BiO_Bulletin_Board at bioinformatics.org
>https://bioinformatics.org/mailman/listinfo/bio_bulletin_board
>
>
>  
>

-- 

Iddo Friedberg, Ph.D.
The Burnham Institute
10901 N. Torrey Pines Rd.
La Jolla, CA 92037
Tel: (858) 646 3100 x3516
Fax: (858) 713 9930
http://ffas.ljcrf.edu/~iddo