[Bioclusters] sensitivity & blast

Ian Korf iankorf at mac.com
Thu Apr 7 13:01:03 EDT 2005


I'd run WU-BLAST instead:

blastn genome 70mers W=15 M=1 N=-1 Q=3 R=1 S=50 kap V=0 B=10000000

W=15: Makes things a bit faster. 70 mers are expected to bind 
reasonably well, and requiring 1 5bp of matching doesn't seem extreme. 
But if you think so, lower it.

M=1 N=-1: Simple match/mismatch parameters that happen to work very 
well in a variety of applications. Makes calculation of S easier too.

Q=3 R=1: These are the gap penalties. 3 for the first gap character, 1 
for each the follows. WU-BLAST and NCBI-BLAST are different in the way 
they specify gaps. 3/1 in WU-BLAST is 2/1 in NCBI-BLAST (opening plus 
extension). I'm not sure how well these gap parameters simulate oligo 
annealing. But this is the goal, and there may be better values for Q, 
R, M, and N.

S=50: Minimum score of an alignment is 50. Forget about E. You're doing 
a generic alignment and not a database search. E will be calculated for 
you. A score of 50 will allow 60 matches and 10 mismatches in a 70 bp 
alignment. Again, I'm not an expert on oligo hybridization. Use a value 
of S that makes sense. You can set it low so you don't miss anything 
but beware of runaway output. Try a variety of values on a sample of 
the data to see what cutoff of S works for you.

kap: Turns off combined statistical significance. You can't do this in 
NCBI-BLAST and it is critical in this situation where you don't want to 
evaluate multiple HSPs. Saves time too.

V=0 B=1000000: No one-line summaries but lots of alignments. You could 
do V=1000000 B=0 if you only want the score, but it's good to look at 
the alignments, especially while you're experimenting with parameters.

BTW, I don't think your question was appropriate for this forum. Get 
the BLAST book (shameless plug I know). Your question and more are 
already answered there.

-Ian

On Apr 7, 2005, at 12:35 AM, L. Mui wrote:

> Chris and Pam,
>
> Thanks for your insights in the emails.
>
> About what we are trying to do: we are trying to select 70mer DNA 
> oligos for
> microarrays.  We try to select the "best" oligo set which (1) minimizes
> cross-hybridization with non-self seq in genome while (2) maximizing 
> target
> binding.
>
> The troubling point which led to my earlier question is:
>
> (1) from results based on feeding query sequences of varying length to
> blastall, we select 70mers based on the 2 goals above
>
> (2) when we feed the 70mers into blastall again, we get different 
> HSP's when
> the e-value is fixed at the default 10.
>
>> From your feedbacks, to remove the dependence on the input size, 
>> setting the
> "-Y" value seems to be a sensible approach.  Won't this restriction of
> search space reduce the prob of finding the best HSPs?
>
> Also: because we know the expect E value depends on (kmn)(exp(-Ls)), 
> why not
> find a base E for a given query length, and then vary the (-e) value 
> by mE ?
>
> Chris, you mentioned that there are other tools we should look at.  
> Please
> advice on this.
>
>                   Lik
>
>
> Quoting Chris Dwan <cdwan at bioteam.net>:
>>> Could you suggest whether we are on the right track?  What is the 
>>> right
>>> approach to set a uniform sensitivity for all inputs?
>>
>> E-values already incorporate statistics to eliminate (normalize for) a
>> number of factors, including query size.  Getting rid of that
>> normalization is possible, but not necessarily a good idea unless you
>> know exactly what you're doing.
>>
>> E values for identical HSPs grow with the product of the sizes of the
>> query and the target set.  The rationale is that the same hit will be
>> more and more likely to occur by random chance in a larger sample of
>> sequence.  Said HSPs will be less and less statistically interesting 
>> as
>> the query and the target set grow.
>>
>> This leads to your observation that you must increase the E-value
>> threshold to keep getting the same hits.
>>
>> The question you seem to be asking is "find me all of the HSPs that 
>> fit
>> some criterion, regardless of their statistical significance."  The
>> question that BLAST is designed to answer is "find me most of the
>> statistically significant HSPs for some particular search, and extend
>> them to build up gapped local alignments."
>>
>> If you're willing to share your goal in running these searches, the
>> list might be able to suggest alternative tools better suited to your
>> problem.
>>
>> -Chris Dwan
>>   The BioTeam
>>
>>
>
>
> _______________________________________________
> Bioclusters maillist  -  Bioclusters at bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/bioclusters
>



More information about the Bioclusters mailing list