Hi, You can try looking at BLAT, or you can switch to one-step BLAST, perhaps make the word length longer. BTW, I found that when BLAST runs out of hits to report, it's usually a repetitive sequence. Since sequences like that should never make it into a microarray anyhow, have you considered screening you 70 meres for repeats? Eitan -------------------- Eitan Rubin, PhD Head of Bioinformatics The Bauer Center for Genomics Research Harvard University Tel: 617-496-5649 Fax: 617-495-2196 -----Original Message----- From: bioclusters-request at bioinformatics.org [mailto:bioclusters-request at bioinformatics.org] Sent: Thursday, April 07, 2005 3:44 AM To: bioclusters at bioinformatics.org Subject: Bioclusters Digest, Vol 6, Issue 3 Send Bioclusters mailing list submissions to bioclusters at bioinformatics.org To subscribe or unsubscribe via the World Wide Web, visit https://bioinformatics.org/mailman/listinfo/bioclusters or, via email, send a message with subject or body 'help' to bioclusters-request at bioinformatics.org You can reach the person managing the list at bioclusters-owner at bioinformatics.org When replying, please edit your Subject line so it is more specific than "Re: Contents of Bioclusters digest..." Today's Topics: 1. sensitivity & blast (L. Mui) 2. Re: sensitivity & blast (Chris Dwan) 3. RE: sensitivity & blast (Pamela Culpepper) 4. RE: sensitivity & blast (Pamela Culpepper) 5. RE: sensitivity & blast (Pamela Culpepper) 6. Re: sensitivity & blast (Chris Dwan) 7. Re: sensitivity & blast (Pamela Culpepper) 8. Re: sensitivity & blast (L. Mui) ---------------------------------------------------------------------- Message: 1 Date: Wed, 6 Apr 2005 10:47:19 -0700 From: "L. Mui" <lmui at stanford.edu> Subject: [Bioclusters] sensitivity & blast To: bioclusters at bioinformatics.org Cc: lmui at stanford.edu Message-ID: <1112809639.425420a78a5a7 at webmail.stanford.edu> Content-Type: text/plain; charset=ISO-8859-1 Hello, We ran into an issue involving blastall, which I suspect folks in this list might know the answer to. (I am fairly new to using blastall). Blastall seems to be sensitive to the input sequence size in detecting HSP. In other words, depending on length of input, it sometimes does not report all HSP (even with very large -b and -v). We want to standardize blastall across all input sizes. I am trying out the following 2 methods, both of which seem to elicit the "right" results: (1) modifying the "-e" e-value threshold by the input size e.g., if m = input sequence size, run blastall with "-e 10m" rationale: the E-value is a function of (mn) (2) fixing the search space (-Y): which seems to fix some statistical parameters for blastall's calculations e.g., "-Y 168000000000" for a human genome target Could you suggest whether we are on the right track? What is the right approach to set a uniform sensitivity for all inputs? Many thanks for your help in advance. Lik Mui ------------------------------ Message: 2 Date: Wed, 6 Apr 2005 14:24:54 -0400 From: Chris Dwan <cdwan at bioteam.net> Subject: Re: [Bioclusters] sensitivity & blast To: "Clustering, compute farming & distributed computing in life science informatics" <bioclusters at bioinformatics.org> Cc: lmui at stanford.edu Message-ID: <bb0671a5b3067cf443adaf2801c2ca57 at bioteam.net> Content-Type: text/plain; charset=US-ASCII; format=flowed > Could you suggest whether we are on the right track? What is the right > approach to set a uniform sensitivity for all inputs? E-values already incorporate statistics to eliminate (normalize for) a number of factors, including query size. Getting rid of that normalization is possible, but not necessarily a good idea unless you know exactly what you're doing. E values for identical HSPs grow with the product of the sizes of the query and the target set. The rationale is that the same hit will be more and more likely to occur by random chance in a larger sample of sequence. Said HSPs will be less and less statistically interesting as the query and the target set grow. This leads to your observation that you must increase the E-value threshold to keep getting the same hits. The question you seem to be asking is "find me all of the HSPs that fit some criterion, regardless of their statistical significance." The question that BLAST is designed to answer is "find me most of the statistically significant HSPs for some particular search, and extend them to build up gapped local alignments." If you're willing to share your goal in running these searches, the list might be able to suggest alternative tools better suited to your problem. -Chris Dwan The BioTeam ------------------------------ Message: 3 Date: Wed, 06 Apr 2005 19:23:52 +0000 From: "Pamela Culpepper" <pculpep at hotmail.com> Subject: RE: [Bioclusters] sensitivity & blast To: bioclusters at bioinformatics.org Message-ID: <BAY20-F55F0C8A83DA71AF477E57A03D0 at phx.gbl> Content-Type: text/plain; format=flowed Lik Mui, Set the Y value somewhat larger than the product of your two sequence lengths. Pam >From: "L. Mui" <lmui at stanford.edu> >Reply-To: "Clustering, compute farming & distributed computing in life >science informatics" <bioclusters at bioinformatics.org> >To: bioclusters at bioinformatics.org >CC: lmui at stanford.edu >Subject: [Bioclusters] sensitivity & blast >Date: Wed, 6 Apr 2005 10:47:19 -0700 > > >Hello, > >We ran into an issue involving blastall, which I suspect folks in this list >might know the answer to. (I am fairly new to using blastall). Blastall >seems to be sensitive to the input sequence size in detecting HSP. In other >words, depending on length of input, it sometimes does not report all HSP >(even with very large -b and -v). > >We want to standardize blastall across all input sizes. I am trying out >the >following 2 methods, both of which seem to elicit the "right" results: > >(1) modifying the "-e" e-value threshold by the input size > e.g., if m = input sequence size, run blastall with > "-e 10m" > rationale: the E-value is a function of (mn) > >(2) fixing the search space (-Y): which seems to fix some statistical >parameters for blastall's calculations > e.g., "-Y 168000000000" for a human genome target > >Could you suggest whether we are on the right track? What is the right >approach to set a uniform sensitivity for all inputs? > >Many thanks for your help in advance. > > Lik Mui >_______________________________________________ >Bioclusters maillist - Bioclusters at bioinformatics.org >https://bioinformatics.org/mailman/listinfo/bioclusters ------------------------------ Message: 4 Date: Wed, 06 Apr 2005 20:19:21 +0000 From: "Pamela Culpepper" <pculpep at hotmail.com> Subject: RE: [Bioclusters] sensitivity & blast To: bioclusters at bioinformatics.org Message-ID: <BAY20-F171ED04CDE685F1EBE3C2FA03D0 at phx.gbl> Content-Type: text/plain; format=flowed Lik Mui, We built a test case of the Y value. The Y value is the database represented as one big sequence. This, then, directly affects the -e, or expectation value. You are on the right path. Pam >From: "Pamela Culpepper" <pculpep at hotmail.com> >Reply-To: "Clustering, compute farming & distributed computing in life >science informatics" <bioclusters at bioinformatics.org> >To: bioclusters at bioinformatics.org >Subject: RE: [Bioclusters] sensitivity & blast >Date: Wed, 06 Apr 2005 19:23:52 +0000 > >Lik Mui, > >Set the Y value somewhat larger than the product of your two sequence >lengths. > >Pam > > > >>From: "L. Mui" <lmui at stanford.edu> >>Reply-To: "Clustering, compute farming & distributed computing in life >>science informatics" <bioclusters at bioinformatics.org> >>To: bioclusters at bioinformatics.org >>CC: lmui at stanford.edu >>Subject: [Bioclusters] sensitivity & blast >>Date: Wed, 6 Apr 2005 10:47:19 -0700 >> >> >>Hello, >> >>We ran into an issue involving blastall, which I suspect folks in this >>list >>might know the answer to. (I am fairly new to using blastall). Blastall >>seems to be sensitive to the input sequence size in detecting HSP. In >>other >>words, depending on length of input, it sometimes does not report all HSP >>(even with very large -b and -v). >> >>We want to standardize blastall across all input sizes. I am trying out >>the >>following 2 methods, both of which seem to elicit the "right" results: >> >>(1) modifying the "-e" e-value threshold by the input size >> e.g., if m = input sequence size, run blastall with >> "-e 10m" >> rationale: the E-value is a function of (mn) >> >>(2) fixing the search space (-Y): which seems to fix some statistical >>parameters for blastall's calculations >> e.g., "-Y 168000000000" for a human genome target >> >>Could you suggest whether we are on the right track? What is the right >>approach to set a uniform sensitivity for all inputs? >> >>Many thanks for your help in advance. >> >> Lik Mui >>_______________________________________________ >>Bioclusters maillist - Bioclusters at bioinformatics.org >>https://bioinformatics.org/mailman/listinfo/bioclusters > > >_______________________________________________ >Bioclusters maillist - Bioclusters at bioinformatics.org >https://bioinformatics.org/mailman/listinfo/bioclusters ------------------------------ Message: 5 Date: Wed, 06 Apr 2005 20:39:12 +0000 From: "Pamela Culpepper" <pculpep at hotmail.com> Subject: RE: [Bioclusters] sensitivity & blast To: bioclusters at bioinformatics.org Message-ID: <BAY20-F29CE11038D76FB0D97EB1AA03D0 at phx.gbl> Content-Type: text/plain; format=flowed Lik Mui, More testing reveals that the -Y option works as follows. In the absense of -Y, the "effective search space" is the product of the query sequence length and the total database length. It affects the calculation of the expection value but not the score. It will thus vary with the query sequence length. Using "-Y 12345" sets the above "effective search space" to 12345, constant for each query sequence. To make the expectation roughly reflect match length, one could use a -Y value that is the product of the database size and the longest query sequence size. Pam and Bill ------------------------------ Message: 6 Date: Wed, 6 Apr 2005 16:58:36 -0400 From: Chris Dwan <cdwan at bioteam.net> Subject: Re: [Bioclusters] sensitivity & blast To: "Clustering, compute farming & distributed computing in life science informatics" <bioclusters at bioinformatics.org> Message-ID: <8d2973588d8a01fc857bb48087e970d3 at bioteam.net> Content-Type: text/plain; charset=US-ASCII; format=flowed BLAST is not a black box, and its function need not be determined by experiment: - An excellent reference on the algorithm: http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html - The source code: ftp://ftp.ncbi.nih.gov/toolbox/ncbi_tools/ncbi.tar.Z - O'Reilly published an entire book on BLAST, whose author is active on this list. Yes, the search space defaults to the product of the query length (m) and the target set length (n). The -Y option overrides that search space. Alignment Score depends only on the alignments and the substitution matrix. Bit score normalizes for values specific to the substitution matrix. Expect value normalizes out query and target set size. Keep in mind as well: BLAST is an heuristic algorithm with no knowledge of any structure beyond primary sequence. If increased sensitivity is the goal, you will get much greater milage by using an algorithm which takes structure into account, or one which utilizes more than pairwise alignments. However, taken very literally, your answer is correct. If the goal is to remove query length as a factor in E value, the "-Y" option is the way to go. -Chris Dwan The BioTeam On Apr 6, 2005, at 4:39 PM, Pamela Culpepper wrote: > orks as follows. > In the absense of -Y, the "effective search space" is the product of > the query sequence length > and the total database length. It affects the calculation of the > expection value but not the score. > It will thus vary with the query sequence length. > Using "-Y 12345" sets the above "effective search space" to 12345, > constant for each query > sequence. To make the ------------------------------ Message: 7 Date: Wed, 06 Apr 2005 21:27:49 +0000 From: "Pamela Culpepper" <pculpep at hotmail.com> Subject: Re: [Bioclusters] sensitivity & blast To: bioclusters at bioinformatics.org Message-ID: <BAY20-F14ED9E31AAA265B1069B38A03D0 at phx.gbl> Content-Type: text/plain; format=flowed Chris, You might be interested in what we are working on -- http://www.lifeformulae.com Pam >From: Chris Dwan <cdwan at bioteam.net> >Reply-To: "Clustering, compute farming & distributed computing in life >science informatics" <bioclusters at bioinformatics.org> >To: "Clustering, compute farming & distributed computing in life science >informatics" <bioclusters at bioinformatics.org> >Subject: Re: [Bioclusters] sensitivity & blast >Date: Wed, 6 Apr 2005 16:58:36 -0400 > > >BLAST is not a black box, and its function need not be determined by >experiment: > >- An excellent reference on the algorithm: >http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html >- The source code: ftp://ftp.ncbi.nih.gov/toolbox/ncbi_tools/ncbi.tar.Z >- O'Reilly published an entire book on BLAST, whose author is active on >this list. > >Yes, the search space defaults to the product of the query length (m) and >the target set length (n). The -Y option overrides that search space. > >Alignment Score depends only on the alignments and the substitution matrix. >Bit score normalizes for values specific to the substitution matrix. >Expect value normalizes out query and target set size. > >Keep in mind as well: BLAST is an heuristic algorithm with no knowledge of >any structure beyond primary sequence. If increased sensitivity is the >goal, you will get much greater milage by using an algorithm which takes >structure into account, or one which utilizes more than pairwise >alignments. > >However, taken very literally, your answer is correct. If the goal is to >remove query length as a factor in E value, the "-Y" option is the way to >go. > >-Chris Dwan > The BioTeam > >On Apr 6, 2005, at 4:39 PM, Pamela Culpepper wrote: > >>orks as follows. >>In the absense of -Y, the "effective search space" is the product of the >>query sequence length >>and the total database length. It affects the calculation of the >>expection value but not the score. >>It will thus vary with the query sequence length. >>Using "-Y 12345" sets the above "effective search space" to 12345, >>constant for each query >>sequence. To make the > >_______________________________________________ >Bioclusters maillist - Bioclusters at bioinformatics.org >https://bioinformatics.org/mailman/listinfo/bioclusters ------------------------------ Message: 8 Date: Thu, 7 Apr 2005 00:35:33 -0700 From: "L. Mui" <lmui at stanford.edu> Subject: Re: [Bioclusters] sensitivity & blast To: Chris Dwan <cdwan at bioteam.net>, pculpep at hotmail.com Cc: "Clustering, compute farming & distributed computing in life science informatics" <bioclusters at bioinformatics.org> Message-ID: <1112859333.4254e2c520184 at webmail.stanford.edu> Content-Type: text/plain; charset=ISO-8859-1 Chris and Pam, Thanks for your insights in the emails. About what we are trying to do: we are trying to select 70mer DNA oligos for microarrays. We try to select the "best" oligo set which (1) minimizes cross-hybridization with non-self seq in genome while (2) maximizing target binding. The troubling point which led to my earlier question is: (1) from results based on feeding query sequences of varying length to blastall, we select 70mers based on the 2 goals above (2) when we feed the 70mers into blastall again, we get different HSP's when the e-value is fixed at the default 10. >From your feedbacks, to remove the dependence on the input size, setting the "-Y" value seems to be a sensible approach. Won't this restriction of search space reduce the prob of finding the best HSPs? Also: because we know the expect E value depends on (kmn)(exp(-Ls)), why not find a base E for a given query length, and then vary the (-e) value by mE ? Chris, you mentioned that there are other tools we should look at. Please advice on this. Lik Quoting Chris Dwan <cdwan at bioteam.net>: > > Could you suggest whether we are on the right track? What is the right > > approach to set a uniform sensitivity for all inputs? > > E-values already incorporate statistics to eliminate (normalize for) a > number of factors, including query size. Getting rid of that > normalization is possible, but not necessarily a good idea unless you > know exactly what you're doing. > > E values for identical HSPs grow with the product of the sizes of the > query and the target set. The rationale is that the same hit will be > more and more likely to occur by random chance in a larger sample of > sequence. Said HSPs will be less and less statistically interesting as > the query and the target set grow. > > This leads to your observation that you must increase the E-value > threshold to keep getting the same hits. > > The question you seem to be asking is "find me all of the HSPs that fit > some criterion, regardless of their statistical significance." The > question that BLAST is designed to answer is "find me most of the > statistically significant HSPs for some particular search, and extend > them to build up gapped local alignments." > > If you're willing to share your goal in running these searches, the > list might be able to suggest alternative tools better suited to your > problem. > > -Chris Dwan > The BioTeam > > ------------------------------ _______________________________________________ Bioclusters maillist - Bioclusters at bioinformatics.org https://bioinformatics.org/mailman/listinfo/bioclusters End of Bioclusters Digest, Vol 6, Issue 3 *****************************************