[Bioclusters] help with benchmarks etc.

Mon, 23 Sep 2002 10:24:33 +0000

I just joined the mailing list after reading the archives. Great 
reading. I have a favor to ask the group. I'm writing a book 
about BLAST and one of the sections is titled "Industrial 
Strength BLAST", which covers high throughput considerations 
rather than optimal search parameters (things like hardware 
configurations and clustering - the kinds of things discussed on 
this list). There are a couple of experiments I could use some 
help on for those interested.

(1) Benchmarking is always controversial. This is probably 
especially true for BLAST because people have different needs. 
That said, I think a few real world examples with actual numbers 
would help people make sound decisions when purchasing hardware. 
I don't have convenient access to that many different machines, 
so I'm asking (maybe begging) for a little help. I'd like to 
propose a couple of tests, but before I do, I think it would be 
only reasonable that (a) these experiments are "owned" by this 
group and the book will make appropriate reference and (b) you 
don't participate in the tests if it will invalidate some kind 
"no benchmark" contract you may have with a vendor.

(1.1) The first test is to search the Pfam globin family against 
itself using default parameters. There are 1203 sequences in the 
family. You can find the file at 
http://dna.cs.wustl.edu/globins.gz. I'm using WU-BLAST with the 
following command line.

time blastp globins globins V=1203 B=1203 cpus=1 
filter=seg+xnu > /dev/null

Notes: I'm setting the CPU count to 1. Also, although I'm using 
WU-BLAST here, if more people are using NCBI-BLAST, I'd like to 
report that instead. This is not a bake-off of NCBI-BLAST vs. 
WU-BLAST. People have their preferences, and I'm only going to 
include one or the other in the book.

This test isn't an accurate real world test in the sense that 
most of the sequences are going to match each other, but the 
data is small enough that the burden shouldn't be too great for 
anyone. It will probably take somewhere between 5-25 minutes 
depending on your hardware.

(1.2) I'd like the second test to be a BLASTN search of some 
kind. This will require a larger database, and I think it will 
keep the same all-vs-all approach. If the response to the first 
experiment is good, I'll post another database. If not, I'll go 
sulk and do whatever experiments I can.

(1.3) Is there another test anyone can think of that would be 
simple enough for lots of people to run? If we could come up 
with a suite of reasonable tests, it might be nice to have a 
"spec-BLAST" benchmark. One could also try tests on more than 
one CPU to show how an entire system performs. It sounds like a 
fun paper to write and a great resource, but it's beyond the 
scope of the book. Any takers?

(2) There are differences in operating systems and compilers 
too. If the same tests above could be run on identical hardware 
but with different operating systems, this would provide a 
valuable resource.

(3) This isn't an experiment. Is there a favorite BLAST-based 
question you'd like to see answered in a book? Perhaps something 
you already know about that is often puzzling to the 
inexperienced?

Thanks,

Ian Korf