<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META http-equiv=Content-Type content="text/html; charset=us-ascii">
<META content="MSHTML 6.00.2900.2802" name=GENERATOR></HEAD>
<BODY>
<DIV dir=ltr align=left>
<DIV dir=ltr align=left><SPAN class=187493302-10012006><FONT face=Arial
color=#0000ff size=2><SPAN class=911191116-10012006>Looks like this is due, as
Alan wrote, to the way blast BLAST sorts the database hits during the three
blastn extension phases (ungapped, gapped, and gapped with traceback). Any
given alignment may improve during during each extension step, so if you throw
one out because of a -b limit you lose it. (blast seems to look at -b at
each extension/evaluation step, though I have not found this documented
anyplace) with a larger -b you keep more alignments for each extension phase.
So, an ungapped alignment that does not make the -b1 cutoff, might have been
improved in the gapped or gapped w/traceback extension phases thus giving a
higher score. Gapped extensions are only triggered if the ungapped
thresholds are satisfied (Thanks to Wayne Matten @ NCBI for explaining this
to me).</SPAN></FONT></SPAN></DIV>
<DIV dir=ltr align=left><SPAN class=187493302-10012006><FONT face=Arial
color=#0000ff size=2><SPAN
class=911191116-10012006></SPAN></FONT></SPAN> </DIV>
<DIV dir=ltr align=left><SPAN class=187493302-10012006><FONT face=Arial
color=#0000ff size=2><SPAN class=911191116-10012006>The question I have is why
use BLAST here at all? It looks like the user expects to align a
relatively long strech of the query to the DB sequence (mapping), so something
like BLAT would work better here. I know there are high memory
requirements with BLAT, but I would imagine this search with BLAST will use up
quite a bit of memory, and also BLASTN with lots of HSPs is computationally
quite expensive (computation is quadratic in the number of HSPs). If you
wanted to use BLAST, it seems like you could do a few things to improve
performance (rather than lower -b/-v) like use megablast or splitting up
the query or DB files or just increasing the word size. BLAST
(formatdb) does have that cool database alias function you can use to
virtually split DBs (but fasta sequences have to have GI numbers for it to
work).</SPAN></FONT></SPAN></DIV>
<DIV dir=ltr align=left><SPAN class=187493302-10012006><FONT face=Arial
color=#0000ff size=2><SPAN
class=911191116-10012006></SPAN></FONT></SPAN> </DIV>
<DIV dir=ltr align=left><SPAN class=187493302-10012006><FONT face=Arial
color=#0000ff size=2><SPAN
class=911191116-10012006>Cheers,</SPAN></FONT></SPAN></DIV>
<DIV dir=ltr align=left><SPAN class=187493302-10012006><FONT face=Arial
color=#0000ff size=2><SPAN
class=911191116-10012006></SPAN></FONT></SPAN> </DIV>
<DIV dir=ltr align=left><SPAN class=187493302-10012006><FONT face=Arial
color=#0000ff size=2><SPAN
class=911191116-10012006>Chris</SPAN></FONT></SPAN></DIV>
<DIV dir=ltr align=left><SPAN class=187493302-10012006><FONT face=Arial
color=#0000ff size=2></FONT></SPAN> </DIV>
<DIV dir=ltr align=left><SPAN class=187493302-10012006><FONT face=Arial
color=#0000ff size=2></FONT></SPAN> </DIV></DIV>
<BLOCKQUOTE
style="PADDING-LEFT: 5px; MARGIN-LEFT: 5px; BORDER-LEFT: #0000ff 2px solid; MARGIN-RIGHT: 0px">
<DIV class=OutlookMessageHeader lang=en-us dir=ltr align=left>
<HR tabIndex=-1>
<FONT face=Tahoma size=2><B>From:</B>
bioclusters-bounces+christopher.botka=joslin.harvard.edu@bioinformatics.org
[mailto:bioclusters-bounces+christopher.botka=joslin.harvard.edu@bioinformatics.org]
<B>On Behalf Of </B>Andrew.Mather@dpi.vic.gov.au<BR><B>Sent:</B> Thursday,
January 05, 2006 2:16 AM<BR><B>To:</B>
Bioclusters@bioinformatics.org<BR><B>Subject:</B> [Bioclusters] Re: Top hits,
not so top<BR></FONT><BR></DIV>
<DIV></DIV><BR><FONT face=sans-serif size=2>Hi All,</FONT> <BR><BR><FONT
face=sans-serif size=2>One of my users has encountered some odd behaviour when
trying to blast a 100MB query sequence against a human genomic sequence
database.</FONT> <BR><BR><FONT face=sans-serif size=2>His message is below,
but basically, he's finding he gets different results depending on how many
alignments andone-line descriptors he asks to see. The input sequence,
database, e-values etc remain constant, it's only the -v and -b options that
change.<BR></FONT><BR><FONT face=sans-serif size=2>We're using Blastall
v2.2.11 (newer one is in testing) on Intel machines running RHEL3.</FONT>
<BR><BR><FONT face=sans-serif size=2>Can anyone point me in an appropriate
direction for things to look at please ?</FONT> <BR><BR><FONT face=sans-serif
size=2>Thanks,</FONT> <BR><FONT face=sans-serif
size=2>Andrew<BR><BR>Bioinformatics Advanced Scientific Computing, <BR>Animal
Genetics and Genomics, PIRVic Attwood<BR>475 Mickleham Road, Attwood,
3049<BR>ph +61 3 92174342<BR>mob 0413 009
761<BR><BR><BR>----------------<BR>There are 10 kinds of people...those who
understand binary and those who don't.<BR></FONT><BR><FONT face=sans-serif
color=#800080 size=1>----- Forwarded by Andrew Mather/NRE on 05/01/06 06:08 PM
-----</FONT> <BR><BR><FONT face=sans-serif size=2><BR>The following problem
was discovered with BLAST jobs where you ask for</FONT><FONT
face="Times New Roman" size=3> </FONT><FONT face=sans-serif
size=2><BR>different numbers of hits to be presented. </FONT><FONT
face="Times New Roman" size=3><BR></FONT><FONT face=sans-serif size=2><BR>A
100 MB segement of bovine Chromosome fasta was</FONT><FONT
face="Times New Roman" size=3> </FONT><FONT face=sans-serif size=2><BR>BLASTED
onto Human with different -v and -b options to print the "Top"</FONT><FONT
face="Times New Roman" size=3> </FONT><FONT face=sans-serif
size=2><BR>hits.</FONT><FONT face="Times New Roman" size=3> <BR></FONT><FONT
face=sans-serif size=2><BR>The following were all run on -intel
ques</FONT><FONT face="Times New Roman" size=3> <BR></FONT><FONT
face=sans-serif size=2><BR>Consider this where we ask for the top hit
only</FONT><FONT face="Times New Roman" size=3> <BR></FONT><FONT
face=sans-serif size=2><BR>/usr/local/bin/blastall -p blastn -i test.out -d
"/blastdb/human_genomic" -e 1e-25 -v 1 -b 1 -a4 -o one.b</FONT><FONT
face="Times New Roman" size=3> </FONT><FONT face=sans-serif
size=2><BR>lastn</FONT><FONT face="Times New Roman" size=3> <BR></FONT><FONT
face=sans-serif size=2><BR>This is what I get</FONT><FONT
face="Times New Roman" size=3> <BR>
</FONT><FONT face="Times New Roman" size=2>Score
E</FONT><FONT face="Times New Roman" size=3> </FONT><FONT
face=sans-serif size=2><BR>
Sequences producing significant alignments:
(bits) Value</FONT><FONT
face="Times New Roman" size=3> <BR></FONT><FONT face=sans-serif
size=2><BR>ref|NC_000001.8|NC_000001 Homo sapiens chromosome 1, complete se...
224 8e-54</FONT><FONT face="Times New Roman" size=3>
<BR></FONT><FONT face=sans-serif size=2><BR>Now if I ask for 5 top hits
.....</FONT><FONT face="Times New Roman" size=3> <BR><BR></FONT><FONT
face=sans-serif size=2><BR>/usr/local/bin/blastall -p blastn -i test.out -d
"/blastdb/human_genomic" -e 1e-25 -v 5 -b 5 -a4 -o five.blastn</FONT><FONT
face="Times New Roman" size=3> <BR></FONT><FONT face=sans-serif
size=2><BR>This is what I get</FONT><FONT face="Times New Roman" size=3>
<BR></FONT><FONT face=sans-serif size=2><BR>
Score E</FONT><FONT face="Times New Roman"
size=3> </FONT><FONT face=sans-serif size=2><BR>
Sequences producing significant alignments:
(bits)
Value</FONT><FONT face="Times New Roman" size=3> <BR></FONT><FONT
face=sans-serif size=2><BR>ref|NC_000004.9|NC_000004 Homo sapiens chromosome
4, complete se... 281 4e-71</FONT><FONT face="Times New Roman"
size=3> </FONT><FONT face=sans-serif size=2><BR>ref|NC_000001.8|NC_000001 Homo
sapiens chromosome 1, complete se... 224 8e-54</FONT><FONT
face="Times New Roman" size=3> </FONT><FONT face=sans-serif
size=2><BR>ref|NC_000013.9|NC_000013 Homo sapiens chromosome 13, complete s...
222 3e-53</FONT><FONT face="Times New Roman" size=3>
</FONT><FONT face=sans-serif size=2><BR>ref|NC_000005.8|NC_000005 Homo sapiens
chromosome 5, complete se... 214 8e-51</FONT><FONT
face="Times New Roman" size=3> </FONT><FONT face=sans-serif
size=2><BR>ref|NC_000009.9|NC_000009 Homo sapiens chromosome 9, complete se...
206 2e-48</FONT><FONT face="Times New Roman" size=3>
<BR></FONT><FONT face=sans-serif size=2><BR>Note, there is now a hit on
Chromosome 4 that is above the top hit on</FONT><FONT face="Times New Roman"
size=3> </FONT><FONT face=sans-serif size=2><BR>Chromosome 1 that was seen
previously. The hit on chromosome 1 is the</FONT><FONT
face="Times New Roman" size=3> </FONT><FONT face=sans-serif size=2><BR>same as
was seen previously though.</FONT><FONT face="Times New Roman" size=3>
<BR></FONT><FONT face=sans-serif size=2><BR>It gets worse ... what if we
ask for the top 7 hits?</FONT><FONT face="Times New Roman" size=3>
<BR></FONT><FONT face=sans-serif size=2><BR>/usr/local/bin/blastall -p blastn
-i test.out -d</FONT><FONT face="Times New Roman" size=3> </FONT><BR><FONT
face=sans-serif size=2>"/blastdb/human_genomic" -e 1e-25 -v 7 -b 7 -a4 -o
seven.blastn</FONT><FONT face="Times New Roman" size=3> <BR></FONT><FONT
face=sans-serif size=2><BR>This is what we get</FONT><FONT
face="Times New Roman" size=3> <BR></FONT><FONT face=sans-serif
size=2><BR>
Score E</FONT><FONT face="Times New Roman" size=3>
</FONT><FONT face=sans-serif size=2><BR>Sequences producing significant
alignments:
(bits) Value</FONT><FONT face="Times New Roman" size=3>
<BR></FONT><FONT face=sans-serif size=2><BR>ref|NC_000004.9|NC_000004 Homo
sapiens chromosome 4, complete se... 281 4e-71</FONT><FONT
face="Times New Roman" size=3> </FONT><FONT face=sans-serif
size=2><BR>ref|NC_000011.8|NC_000011 Homo sapiens chromosome 11, complete s...
230 1e-55</FONT><FONT face="Times New Roman" size=3>
</FONT><FONT face=sans-serif size=2><BR>ref|NC_000001.8|NC_000001 Homo sapiens
chromosome 1, complete se... 224 8e-54</FONT><FONT
face="Times New Roman" size=3> </FONT><FONT face=sans-serif
size=2><BR>ref|NC_000013.9|NC_000013 Homo sapiens chromosome 13, complete s...
222 3e-53</FONT><FONT face="Times New Roman" size=3>
</FONT><FONT face=sans-serif size=2><BR>ref|NC_000005.8|NC_000005 Homo sapiens
chromosome 5, complete se... 214 8e-51</FONT><FONT
face="Times New Roman" size=3> </FONT><FONT face=sans-serif
size=2><BR>ref|NC_000009.9|NC_000009 Homo sapiens chromosome 9, complete se...
206 2e-48</FONT><FONT face="Times New Roman" size=3>
</FONT><FONT face=sans-serif size=2><BR>ref|NC_000023.8|NC_000023 Homo sapiens
chromosome X, complete se... 198 5e-46</FONT><FONT
face="Times New Roman" size=3> <BR><BR></FONT><FONT face=sans-serif
size=2><BR>Now we have a hit on Chromosome 11 that pops up between the top hit
on</FONT><FONT face="Times New Roman" size=3> </FONT><FONT face=sans-serif
size=2><BR>Chromosome 4 and our original hit on Chromosome 1.</FONT><FONT
face="Times New Roman" size=3> <BR></FONT><FONT face=sans-serif
size=2><BR>Does it get any worse? I tried the top 10 hits ...</FONT><FONT
face="Times New Roman" size=3> <BR></FONT><FONT face=sans-serif
size=2><BR>/usr/local/bin/blastall -p blastn -i test.out -d
"/blastdb/human_genomic" -e 1e-25 -v 10 -b 10 -a4 -o ten</FONT><FONT
face="Times New Roman" size=3> </FONT><FONT face=sans-serif
size=2><BR>.blastn</FONT><FONT face="Times New Roman" size=3> <BR></FONT><FONT
face=sans-serif size=2><BR>Here are the results</FONT><FONT
face="Times New Roman" size=3> <BR></FONT><FONT face=sans-serif
size=2><BR>
Score E</FONT><FONT face="Times New Roman" size=3>
</FONT><FONT face=sans-serif size=2><BR>Sequences producing significant
alignments:
(bits) Value</FONT><FONT face="Times New Roman" size=3>
<BR></FONT><FONT face=sans-serif size=2><BR>ref|NC_000004.9|NC_000004 Homo
sapiens chromosome 4, complete se... 281 4e-71</FONT><FONT
face="Times New Roman" size=3> </FONT><FONT face=sans-serif
size=2><BR>ref|NC_000011.8|NC_000011 Homo sapiens chromosome 11, complete s...
230 1e-55</FONT><FONT face="Times New Roman" size=3>
</FONT><FONT face=sans-serif size=2><BR>ref|NC_000001.8|NC_000001 Homo sapiens
chromosome 1, complete se... 224 8e-54</FONT><FONT
face="Times New Roman" size=3> </FONT><FONT face=sans-serif
size=2><BR>ref|NC_000013.9|NC_000013 Homo sapiens chromosome 13, complete s...
222 3e-53</FONT><FONT face="Times New Roman" size=3>
</FONT><FONT face=sans-serif size=2><BR>ref|NC_000005.8|NC_000005 Homo sapiens
chromosome 5, complete se... 214 8e-51</FONT><FONT
face="Times New Roman" size=3> </FONT><FONT face=sans-serif
size=2><BR>ref|NC_000009.9|NC_000009 Homo sapiens chromosome 9, complete se...
206 2e-48</FONT><FONT face="Times New Roman" size=3>
</FONT><FONT face=sans-serif size=2><BR>ref|NC_000008.9|NC_000008 Homo sapiens
chromosome 8, complete se... 202 3e-47</FONT><FONT
face="Times New Roman" size=3> </FONT><FONT face=sans-serif
size=2><BR>ref|NC_000023.8|NC_000023 Homo sapiens chromosome X, complete se...
198 5e-46</FONT><FONT face="Times New Roman" size=3>
</FONT><FONT face=sans-serif size=2><BR>ref|NC_000002.9|NC_000002 Homo sapiens
chromosome 2, complete se... 198 5e-46</FONT><FONT
face="Times New Roman" size=3> </FONT><FONT face=sans-serif
size=2><BR>ref|NC_000014.7|NC_000014 Homo sapiens chromosome 14, complete s...
194 7e-45</FONT><FONT face="Times New Roman" size=3>
<BR><BR></FONT><FONT face=sans-serif size=2><BR>No more hits jumping in near
the top... but we do have a hit that pops</FONT><FONT face="Times New Roman"
size=3> </FONT><FONT face=sans-serif size=2><BR>in between 9 and X
(8).</FONT><FONT face="Times New Roman" size=3> <BR></FONT><FONT
face=sans-serif size=2><BR>So what is going on? The same BLAST job on
the same machine ... just</FONT><FONT face="Times New Roman" size=3>
</FONT><FONT face=sans-serif size=2><BR>changing the -v and -b option to pick
the number of hits to display.</FONT><FONT face="Times New Roman" size=3>
</FONT><BR><BR></BLOCKQUOTE></BODY></HTML>