[Bioclusters] design considerations for a BLAST cluster

Joe Landman bioclusters@bioinformatics.org
06 Mar 2002 08:52:45 -0500

(in the interests of full disclosure, I work for a company that sells
clusters, software, etc with these things I am writing about here, for
bioinformatics and other things)

Design is as much about compromise as it is about solving problems. 
There are several balancing acts you need to engage in to make things
work well.  Points of serialization (e.g. where things are forced to go
one item at a time) are bad for cluster design.

NCBI BLAST (and pretty much any library/database searching tool) will
have modes of operation that are I/O bound, and others that have other
factors (memory, CPU bound).

BLAST itself trys to mmap as much of the database indices into memory as
possible.  If you do not have sufficient memory for the indices, you are
going to be hitting your I/O channel hard.  First design point is
sufficient memory.  You should look at your particular calculations, and
make sure you can have most/all of it in RAM.  The reason is that the
virtual memory system access latency and speed is on the order of 10 ms,
and 20 MB/s respectively.  The RAM latency and speed are on the order of
150 ns and 800 MB/s.  You do not want to use SWAP if you can avoid it,
nor stream pages from disk for a mmap.

What you want to do is use the memory size of your calculation as a
lower bound on the RAM size.  My rule of thumb is never less than 1
GB/CPU.  Any "extra" memory can be used by the system as a disk cache,
which can have a dramatic positive impact upon performance.  The
converse is also true, if you haved 128 MB of ram per computing node,
and your calculation wants 768 MB RAM, you are going to spend all your
time getting pages off the disk, and very little time calculating.  Add
to this that RAM is (still) cheap.  This may not be the case (cost
issues) for the huge clusters, but hopefully they were designed
correctly from the beginning.  You can increase your run times by orders
of magnitudes by having insufficient RAM.

Next issue would be I/O (non swap).  BLAST and friends do large block
sequential reads, especially when streaming indices/db's from the disk. 
This means that the per spindle speed will be the limiting factor.  Or
if you are mounting the database indices from a central point versus
distributing them, then you are sharing the network pipe bandwidth.  The
side effects of this include that for N simulateous requestors of data
from the pipe, each requestor will typically get about 1/N of the
bandwidth (this is the 1/N problem).  As N (the number of nodes)
increases, the shared bandwidth (the file server) drops accordingly.  If
you are doing searchs with a system configured like this, you run out of
network bandwidth, and you get a point of serialization.  This means
your cluster cannot scale well.  The folks at Oak Ridge ran into this
with their MPP BLAST on their SP.  I cannot find the pages on the web
anymore though. 

 I normally recommend a striped local file system.  IDE disks are fast
and cheap.  SCSI disks are more expensive and fast or faster.  There are
some calculations you can do on speed of I/O channel, width of striping,
and the amount of stuff you are pulling over.  

  File system choice is critical to compute node performance.  Some file
systems are poorly suited to the demands of computational loads.  Others
are ideally suited for this.  We have seen factors of 1.5 in wall clock
time for various calculations (long running) using one file system
versus another on the same machine (e.g. format the disk with one file
system, run the test, format it with the other, run the test).

 Since you are striping the local file system, generally it is a good
idea to stripe the swap space.  Linux VMs still require at least a 1 to
1 ratio of physical memory to swap space.  The mid 2.4.x series required
more.  Swap is usually a point of serialization.

  The factors indicated above can have a substantial impact upon the
performance of the system.  They are not the only factors, but they can
be quite significant.  As always, you want to run your own tests (e.g.
your own code for several different input cases, think small, medium,
large, and where you really want to explore) on any system to gauge its
performance.  You will gradually find that marketing claims
notwithstanding, system performance is a complex beast, and requires
careful attention to many details to understand what is going on, and
where the control/tuning knobs are.

  Enjoy BIO-IT World folks, I wont be attending (prior committments).  I
think ISMB is more likely.  Anyone interested in meeting then? 

Joseph Landman, Ph.D.
Senior Scientist,
MSC Software High Performance Computing
email		: joe.landman@mscsoftware.com
office	        : +1 248 208 3312
fax	       : +1 714 784 3774