[Bioclusters] BLAST performance for 70 mere testing

Thu Apr 7 07:41:41 EDT 2005

Hi,

 You can try looking at BLAT, or you can switch to one-step BLAST, perhaps
make the word length longer.

  BTW, I found that when BLAST runs out of hits to report, it's usually a
repetitive sequence. Since sequences like that should never make it into a
microarray anyhow, have you considered screening you 70 meres for repeats?

  Eitan

--------------------
Eitan Rubin, PhD
Head of Bioinformatics
The Bauer Center for Genomics Research
Harvard University
Tel: 617-496-5649 Fax: 617-495-2196

-----Original Message-----
From: bioclusters-request at bioinformatics.org
[mailto:bioclusters-request at bioinformatics.org] 
Sent: Thursday, April 07, 2005 3:44 AM
To: bioclusters at bioinformatics.org
Subject: Bioclusters Digest, Vol 6, Issue 3

Send Bioclusters mailing list submissions to
	bioclusters at bioinformatics.org

To subscribe or unsubscribe via the World Wide Web, visit
	https://bioinformatics.org/mailman/listinfo/bioclusters
or, via email, send a message with subject or body 'help' to
	bioclusters-request at bioinformatics.org

You can reach the person managing the list at
	bioclusters-owner at bioinformatics.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Bioclusters digest..."

Today's Topics:

   1. sensitivity & blast (L. Mui)
   2. Re: sensitivity & blast (Chris Dwan)
   3. RE: sensitivity & blast (Pamela Culpepper)
   4. RE: sensitivity & blast (Pamela Culpepper)
   5. RE: sensitivity & blast (Pamela Culpepper)
   6. Re: sensitivity & blast (Chris Dwan)
   7. Re: sensitivity & blast (Pamela Culpepper)
   8. Re: sensitivity & blast (L. Mui)

----------------------------------------------------------------------

Message: 1
Date: Wed,  6 Apr 2005 10:47:19 -0700
From: "L. Mui" <lmui at stanford.edu>
Subject: [Bioclusters] sensitivity & blast
To: bioclusters at bioinformatics.org
Cc: lmui at stanford.edu
Message-ID: <1112809639.425420a78a5a7 at webmail.stanford.edu>
Content-Type: text/plain; charset=ISO-8859-1

Hello,

We ran into an issue involving blastall, which I suspect folks in this list
might know the answer to.  (I am fairly new to using blastall).  Blastall
seems to be sensitive to the input sequence size in detecting HSP. In other
words, depending on length of input, it sometimes does not report all HSP
(even with very large -b and -v).

We want to standardize blastall across all input sizes.  I am trying out the
following 2 methods, both of which seem to elicit the "right" results:

(1) modifying the "-e" e-value threshold by the input size
    e.g., if m = input sequence size, run blastall with
          "-e 10m"
    rationale: the E-value is a function of (mn)

(2) fixing the search space (-Y): which seems to fix some statistical
parameters for blastall's calculations
    e.g., "-Y 168000000000" for a human genome target

Could you suggest whether we are on the right track?  What is the right
approach to set a uniform sensitivity for all inputs?

Many thanks for your help in advance.

           Lik Mui

------------------------------

Message: 2
Date: Wed, 6 Apr 2005 14:24:54 -0400
From: Chris Dwan <cdwan at bioteam.net>
Subject: Re: [Bioclusters] sensitivity & blast
To: "Clustering,	compute farming & distributed computing in life
	science informatics"	<bioclusters at bioinformatics.org>
Cc: lmui at stanford.edu
Message-ID: <bb0671a5b3067cf443adaf2801c2ca57 at bioteam.net>
Content-Type: text/plain; charset=US-ASCII; format=flowed

> Could you suggest whether we are on the right track?  What is the right
> approach to set a uniform sensitivity for all inputs?

E-values already incorporate statistics to eliminate (normalize for) a 
number of factors, including query size.  Getting rid of that 
normalization is possible, but not necessarily a good idea unless you 
know exactly what you're doing.

E values for identical HSPs grow with the product of the sizes of the 
query and the target set.  The rationale is that the same hit will be 
more and more likely to occur by random chance in a larger sample of 
sequence.  Said HSPs will be less and less statistically interesting as 
the query and the target set grow.

This leads to your observation that you must increase the E-value 
threshold to keep getting the same hits.

The question you seem to be asking is "find me all of the HSPs that fit 
some criterion, regardless of their statistical significance."  The 
question that BLAST is designed to answer is "find me most of the 
statistically significant HSPs for some particular search, and extend 
them to build up gapped local alignments."

If you're willing to share your goal in running these searches, the 
list might be able to suggest alternative tools better suited to your 
problem.

-Chris Dwan
  The BioTeam

------------------------------

Message: 3
Date: Wed, 06 Apr 2005 19:23:52 +0000
From: "Pamela Culpepper" <pculpep at hotmail.com>
Subject: RE: [Bioclusters] sensitivity & blast
To: bioclusters at bioinformatics.org
Message-ID: <BAY20-F55F0C8A83DA71AF477E57A03D0 at phx.gbl>
Content-Type: text/plain; format=flowed

Lik Mui,

Set the Y value somewhat larger than the product of your two sequence 
lengths.

Pam

>From: "L. Mui" <lmui at stanford.edu>
>Reply-To: "Clustering,  compute farming & distributed computing in life 
>science informatics" <bioclusters at bioinformatics.org>
>To: bioclusters at bioinformatics.org
>CC: lmui at stanford.edu
>Subject: [Bioclusters] sensitivity & blast
>Date: Wed,  6 Apr 2005 10:47:19 -0700
>
>
>Hello,
>
>We ran into an issue involving blastall, which I suspect folks in this list
>might know the answer to.  (I am fairly new to using blastall).  Blastall
>seems to be sensitive to the input sequence size in detecting HSP. In other
>words, depending on length of input, it sometimes does not report all HSP
>(even with very large -b and -v).
>
>We want to standardize blastall across all input sizes.  I am trying out 
>the
>following 2 methods, both of which seem to elicit the "right" results:
>
>(1) modifying the "-e" e-value threshold by the input size
>     e.g., if m = input sequence size, run blastall with
>           "-e 10m"
>     rationale: the E-value is a function of (mn)
>
>(2) fixing the search space (-Y): which seems to fix some statistical
>parameters for blastall's calculations
>     e.g., "-Y 168000000000" for a human genome target
>
>Could you suggest whether we are on the right track?  What is the right
>approach to set a uniform sensitivity for all inputs?
>
>Many thanks for your help in advance.
>
>            Lik Mui
>_______________________________________________
>Bioclusters maillist  -  Bioclusters at bioinformatics.org
>https://bioinformatics.org/mailman/listinfo/bioclusters

------------------------------

Message: 4
Date: Wed, 06 Apr 2005 20:19:21 +0000
From: "Pamela Culpepper" <pculpep at hotmail.com>
Subject: RE: [Bioclusters] sensitivity & blast
To: bioclusters at bioinformatics.org
Message-ID: <BAY20-F171ED04CDE685F1EBE3C2FA03D0 at phx.gbl>
Content-Type: text/plain; format=flowed

Lik Mui,

We built a test case of the Y value.

The Y value is the database represented as one big sequence.  This, then, 
directly affects the -e, or expectation value.

You are on the right path.

Pam

>From: "Pamela Culpepper" <pculpep at hotmail.com>
>Reply-To: "Clustering,  compute farming & distributed computing in life 
>science informatics" <bioclusters at bioinformatics.org>
>To: bioclusters at bioinformatics.org
>Subject: RE: [Bioclusters] sensitivity & blast
>Date: Wed, 06 Apr 2005 19:23:52 +0000
>
>Lik Mui,
>
>Set the Y value somewhat larger than the product of your two sequence 
>lengths.
>
>Pam
>
>
>
>>From: "L. Mui" <lmui at stanford.edu>
>>Reply-To: "Clustering,  compute farming & distributed computing in life 
>>science informatics" <bioclusters at bioinformatics.org>
>>To: bioclusters at bioinformatics.org
>>CC: lmui at stanford.edu
>>Subject: [Bioclusters] sensitivity & blast
>>Date: Wed,  6 Apr 2005 10:47:19 -0700
>>
>>
>>Hello,
>>
>>We ran into an issue involving blastall, which I suspect folks in this 
>>list
>>might know the answer to.  (I am fairly new to using blastall).  Blastall
>>seems to be sensitive to the input sequence size in detecting HSP. In 
>>other
>>words, depending on length of input, it sometimes does not report all HSP
>>(even with very large -b and -v).
>>
>>We want to standardize blastall across all input sizes.  I am trying out 
>>the
>>following 2 methods, both of which seem to elicit the "right" results:
>>
>>(1) modifying the "-e" e-value threshold by the input size
>>     e.g., if m = input sequence size, run blastall with
>>           "-e 10m"
>>     rationale: the E-value is a function of (mn)
>>
>>(2) fixing the search space (-Y): which seems to fix some statistical
>>parameters for blastall's calculations
>>     e.g., "-Y 168000000000" for a human genome target
>>
>>Could you suggest whether we are on the right track?  What is the right
>>approach to set a uniform sensitivity for all inputs?
>>
>>Many thanks for your help in advance.
>>
>>            Lik Mui
>>_______________________________________________
>>Bioclusters maillist  -  Bioclusters at bioinformatics.org
>>https://bioinformatics.org/mailman/listinfo/bioclusters
>
>
>_______________________________________________
>Bioclusters maillist  -  Bioclusters at bioinformatics.org
>https://bioinformatics.org/mailman/listinfo/bioclusters

------------------------------

Message: 5
Date: Wed, 06 Apr 2005 20:39:12 +0000
From: "Pamela Culpepper" <pculpep at hotmail.com>
Subject: RE: [Bioclusters] sensitivity & blast
To: bioclusters at bioinformatics.org
Message-ID: <BAY20-F29CE11038D76FB0D97EB1AA03D0 at phx.gbl>
Content-Type: text/plain; format=flowed

Lik Mui,

More testing reveals that the -Y option works as follows.
In the absense of -Y, the "effective search space" is the product of the 
query sequence length
and the total database length.  It affects the calculation of the expection 
value but not the score.
It will thus vary with the query sequence length.
Using "-Y 12345" sets the above "effective search space" to 12345, constant 
for each query
sequence.   To make the expectation roughly reflect match length, one could 
use a -Y value
that is the product of  the database size and the longest query sequence 
size.

Pam and Bill

------------------------------

Message: 6
Date: Wed, 6 Apr 2005 16:58:36 -0400
From: Chris Dwan <cdwan at bioteam.net>
Subject: Re: [Bioclusters] sensitivity & blast
To: "Clustering,	compute farming & distributed computing in life
	science informatics"	<bioclusters at bioinformatics.org>
Message-ID: <8d2973588d8a01fc857bb48087e970d3 at bioteam.net>
Content-Type: text/plain; charset=US-ASCII; format=flowed

BLAST is not a black box, and its function need not be determined by 
experiment:

- An excellent reference on the algorithm:  
http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html
- The source code:  ftp://ftp.ncbi.nih.gov/toolbox/ncbi_tools/ncbi.tar.Z
- O'Reilly published an entire book on BLAST, whose author is active on 
this list.

Yes, the search space defaults to the product of the query length (m) 
and the target set length (n).  The -Y option overrides that search 
space.

Alignment Score depends only on the alignments and the substitution 
matrix.
Bit score normalizes for values specific to the substitution matrix.
Expect value normalizes out query and target set size.

Keep in mind as well:  BLAST is an heuristic algorithm with no 
knowledge of any structure beyond primary sequence.  If increased 
sensitivity is the goal, you will get much greater milage by using an 
algorithm which takes structure into account, or one which utilizes 
more than pairwise alignments.

However, taken very literally, your answer is correct.  If the goal is 
to remove query length as a factor in E value, the "-Y" option is the 
way to go.

-Chris Dwan
  The BioTeam

On Apr 6, 2005, at 4:39 PM, Pamela Culpepper wrote:

> orks as follows.
> In the absense of -Y, the "effective search space" is the product of 
> the query sequence length
> and the total database length.  It affects the calculation of the 
> expection value but not the score.
> It will thus vary with the query sequence length.
> Using "-Y 12345" sets the above "effective search space" to 12345, 
> constant for each query
> sequence.   To make the 

------------------------------

Message: 7
Date: Wed, 06 Apr 2005 21:27:49 +0000
From: "Pamela Culpepper" <pculpep at hotmail.com>
Subject: Re: [Bioclusters] sensitivity & blast
To: bioclusters at bioinformatics.org
Message-ID: <BAY20-F14ED9E31AAA265B1069B38A03D0 at phx.gbl>
Content-Type: text/plain; format=flowed

Chris,

You might be interested in what we are working on --

http://www.lifeformulae.com

Pam

>From: Chris Dwan <cdwan at bioteam.net>
>Reply-To: "Clustering,  compute farming & distributed computing in life 
>science informatics" <bioclusters at bioinformatics.org>
>To: "Clustering,  compute farming & distributed computing in life science 
>informatics" <bioclusters at bioinformatics.org>
>Subject: Re: [Bioclusters] sensitivity & blast
>Date: Wed, 6 Apr 2005 16:58:36 -0400
>
>
>BLAST is not a black box, and its function need not be determined by 
>experiment:
>
>- An excellent reference on the algorithm:  
>http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html
>- The source code:  ftp://ftp.ncbi.nih.gov/toolbox/ncbi_tools/ncbi.tar.Z
>- O'Reilly published an entire book on BLAST, whose author is active on 
>this list.
>
>Yes, the search space defaults to the product of the query length (m) and 
>the target set length (n).  The -Y option overrides that search space.
>
>Alignment Score depends only on the alignments and the substitution matrix.
>Bit score normalizes for values specific to the substitution matrix.
>Expect value normalizes out query and target set size.
>
>Keep in mind as well:  BLAST is an heuristic algorithm with no knowledge of

>any structure beyond primary sequence.  If increased sensitivity is the 
>goal, you will get much greater milage by using an algorithm which takes 
>structure into account, or one which utilizes more than pairwise 
>alignments.
>
>However, taken very literally, your answer is correct.  If the goal is to 
>remove query length as a factor in E value, the "-Y" option is the way to 
>go.
>
>-Chris Dwan
>  The BioTeam
>
>On Apr 6, 2005, at 4:39 PM, Pamela Culpepper wrote:
>
>>orks as follows.
>>In the absense of -Y, the "effective search space" is the product of the 
>>query sequence length
>>and the total database length.  It affects the calculation of the 
>>expection value but not the score.
>>It will thus vary with the query sequence length.
>>Using "-Y 12345" sets the above "effective search space" to 12345, 
>>constant for each query
>>sequence.   To make the
>
>_______________________________________________
>Bioclusters maillist  -  Bioclusters at bioinformatics.org
>https://bioinformatics.org/mailman/listinfo/bioclusters

------------------------------

Message: 8
Date: Thu,  7 Apr 2005 00:35:33 -0700
From: "L. Mui" <lmui at stanford.edu>
Subject: Re: [Bioclusters] sensitivity & blast
To: Chris Dwan <cdwan at bioteam.net>, pculpep at hotmail.com
Cc: "Clustering,	compute farming & distributed computing in life
	science informatics"	<bioclusters at bioinformatics.org>
Message-ID: <1112859333.4254e2c520184 at webmail.stanford.edu>
Content-Type: text/plain; charset=ISO-8859-1

Chris and Pam,

Thanks for your insights in the emails.

About what we are trying to do: we are trying to select 70mer DNA oligos for
microarrays.  We try to select the "best" oligo set which (1) minimizes
cross-hybridization with non-self seq in genome while (2) maximizing target
binding.

The troubling point which led to my earlier question is:

(1) from results based on feeding query sequences of varying length to
blastall, we select 70mers based on the 2 goals above

(2) when we feed the 70mers into blastall again, we get different HSP's when
the e-value is fixed at the default 10.

>From your feedbacks, to remove the dependence on the input size, setting
the
"-Y" value seems to be a sensible approach.  Won't this restriction of
search space reduce the prob of finding the best HSPs?

Also: because we know the expect E value depends on (kmn)(exp(-Ls)), why not
find a base E for a given query length, and then vary the (-e) value by mE ?

Chris, you mentioned that there are other tools we should look at.  Please
advice on this.

                  Lik

Quoting Chris Dwan <cdwan at bioteam.net>:
> > Could you suggest whether we are on the right track?  What is the right
> > approach to set a uniform sensitivity for all inputs?
>
> E-values already incorporate statistics to eliminate (normalize for) a
> number of factors, including query size.  Getting rid of that
> normalization is possible, but not necessarily a good idea unless you
> know exactly what you're doing.
>
> E values for identical HSPs grow with the product of the sizes of the
> query and the target set.  The rationale is that the same hit will be
> more and more likely to occur by random chance in a larger sample of
> sequence.  Said HSPs will be less and less statistically interesting as
> the query and the target set grow.
>
> This leads to your observation that you must increase the E-value
> threshold to keep getting the same hits.
>
> The question you seem to be asking is "find me all of the HSPs that fit
> some criterion, regardless of their statistical significance."  The
> question that BLAST is designed to answer is "find me most of the
> statistically significant HSPs for some particular search, and extend
> them to build up gapped local alignments."
>
> If you're willing to share your goal in running these searches, the
> list might be able to suggest alternative tools better suited to your
> problem.
>
> -Chris Dwan
>   The BioTeam
>
>

------------------------------

_______________________________________________
Bioclusters maillist  -  Bioclusters at bioinformatics.org
https://bioinformatics.org/mailman/listinfo/bioclusters

End of Bioclusters Digest, Vol 6, Issue 3
*****************************************