[Bioclusters] help with benchmarks etc.
Ian Korf
bioclusters@bioinformatics.org
Mon, 23 Sep 2002 10:24:33 +0000
I just joined the mailing list after reading the archives. Great
reading. I have a favor to ask the group. I'm writing a book
about BLAST and one of the sections is titled "Industrial
Strength BLAST", which covers high throughput considerations
rather than optimal search parameters (things like hardware
configurations and clustering - the kinds of things discussed on
this list). There are a couple of experiments I could use some
help on for those interested.
(1) Benchmarking is always controversial. This is probably
especially true for BLAST because people have different needs.
That said, I think a few real world examples with actual numbers
would help people make sound decisions when purchasing hardware.
I don't have convenient access to that many different machines,
so I'm asking (maybe begging) for a little help. I'd like to
propose a couple of tests, but before I do, I think it would be
only reasonable that (a) these experiments are "owned" by this
group and the book will make appropriate reference and (b) you
don't participate in the tests if it will invalidate some kind
"no benchmark" contract you may have with a vendor.
(1.1) The first test is to search the Pfam globin family against
itself using default parameters. There are 1203 sequences in the
family. You can find the file at
http://dna.cs.wustl.edu/globins.gz. I'm using WU-BLAST with the
following command line.
time blastp globins globins V=1203 B=1203 cpus=1
filter=seg+xnu > /dev/null
Notes: I'm setting the CPU count to 1. Also, although I'm using
WU-BLAST here, if more people are using NCBI-BLAST, I'd like to
report that instead. This is not a bake-off of NCBI-BLAST vs.
WU-BLAST. People have their preferences, and I'm only going to
include one or the other in the book.
This test isn't an accurate real world test in the sense that
most of the sequences are going to match each other, but the
data is small enough that the burden shouldn't be too great for
anyone. It will probably take somewhere between 5-25 minutes
depending on your hardware.
(1.2) I'd like the second test to be a BLASTN search of some
kind. This will require a larger database, and I think it will
keep the same all-vs-all approach. If the response to the first
experiment is good, I'll post another database. If not, I'll go
sulk and do whatever experiments I can.
(1.3) Is there another test anyone can think of that would be
simple enough for lots of people to run? If we could come up
with a suite of reasonable tests, it might be nice to have a
"spec-BLAST" benchmark. One could also try tests on more than
one CPU to show how an entire system performs. It sounds like a
fun paper to write and a great resource, but it's beyond the
scope of the book. Any takers?
(2) There are differences in operating systems and compilers
too. If the same tests above could be run on identical hardware
but with different operating systems, this would provide a
valuable resource.
(3) This isn't an experiment. Is there a favorite BLAST-based
question you'd like to see answered in a book? Perhaps something
you already know about that is often puzzling to the
inexperienced?
Thanks,
Ian Korf