[Bioclusters] Apple/Genentech BLAST

Ivo Grosse bioclusters@bioinformatics.org
Fri, 24 May 2002 10:02:19 -0400


Hi Chris,

thanks for your detailed comments.


"Chris Dwan (CCGB)" <cdwan@mail.ahc.umn.edu> wrote on Thu, 23 May 2002:

> Ivo,
> 
> How is this benchmark useful?

See below.


> The number of available sequenced chromosomes grows very
> slowly. Unless I'm missing something really fundamental, 

See below.


> that means
> that the number of possible searches of this type must also grow
> fairly slowly (slow squared, in fact).

Slow squared can be fast.  :-)

For example, if slow = linear, then slow squared = quadratic.

But that is not my main point, see below.


> Anyway, assuming that this *is* the sort of job we want to run.  

We need to define what we mean by "the sort of job."

I mean that 90% of our cluster load comes from Blast.  It doesn't mean 
that 90% of the jobs are Blast jobs, but it means that Blast is one of 
the slowest jobs in our case.  Hence, it makes sense (for us) to use 
"some sort" of Blast job as benchmark.

Instead of chr 21 and 22 and pufferfish I could have picked (almost) 
any other triple of sequences, but I picked chr 22 and 21 versus 
pufferfish for the following simple reasons:

- human and pufferfish have an evolutionary distance that is typical 
for our applications.  Right now we focus on human-mouse comparisons, 
but in the near future we will move to more distant (from human) 
organisms, and pufferfish is a good example in this respect.

- those sequences have a length that seemed optimal for a benchmark: 
the jobs will run for a few hours, not only for a few minutes, and not 
for a few days.

The first point indicates why the benchmark Blast comparison of human 
with chimpanzee (with word size up to 40) done by Apple is not too 
relevant for us, and why we would like to see a benchmark with two more 
distant organisms.


> Can we be smarter than just dumping in two sequences (one formatdb'd) and
> letting it run?  I would rather benchmark the smart way of doing it,
> at the very least.

Well, I didn't state we should do this comparison in a non-smart way.  
If you read my posts from the past, you will find that we always run 
Blast jobs in a mode that you call a smart way.  I actually don't know 
if it is smart or not, but we always cut the query sequence(s) into 
fragments of, say, 1001 kb, overlapping by 1 kb, and then fuse the 
output in the end.

Sorry for not having repeated this in my previous email, and sorry for 
all the confusion that this may have caused.


> The run you describe will return the top 500 (by default) local
> alignments between the two chromosomes.  

Oh, again I am sorry that I haven't specified all details.  We always 
want to get "all" local alignments below the specified E-value.  So, we 
typically use 50,000 for the -b and -v flags.  50,000 is typically 
enough in our examples: if we found that the number of local alignments 
had reached 50,000, we would increase that number and run that 
particular Blast job again, but so far it has never happened in our 
analyses.


> represents a tiny fraction of the similarities that are interesting to
> the genomic scientists with whom I work.

Of course, you are right, sorry for the confusion, see above.


> An example:  one of my users really likes to make chromosome /
> chromosome similarity maps (sparse matrixes) for what he calls "genome
> archeology." This is THE classic reason for doing genome on genome
> BLASTs.  We do it by breaking each chromosome into overlapping chunks
> of arbitrary size (say, 10,000bp with 1,000bp overlap) and then doing
> an "all vs. all" set of BLASTs.

That is exactly what we do, see above, and see my previous posts, 
except that we choose larger chunk sizes, at least 101 kb, and often 
1001 kb.  And we choose both the chunk size and the overlap size 
problem dependently.


> This way of setting up the problem has three benefits:

Thanks for spelling this out in great detail.


> As a side benefit, benchmarking on this this scenario maps *precisely*
> to a bioinformatic problem that *will* continue to grow without bound:

Okay, great, then we perfectly agree that "that sort of Blast job" 
could be a useful benchmark, then let's do it.  Didn't Jeff Bizzaro 
recently acquire a dual-G4 machine?


> An interesting (and more thorough) system benchmark would look for an
> optimal chunk size for the system in question.

Great point!  But of course that is problem dependent!

Also, I forgot to mention that we usually Blast only repeat-masked 
sequences, which reduces the running time (and also the memory 
requirement) substantially.

Again, thanks for your great comments.


Best regards, 

Ivo