[Bioclusters] http://www.sistina.com/products_gfs.htm

Joe Landman bioclusters@bioinformatics.org
13 May 2002 14:35:42 -0400

On Mon, 2002-05-13 at 13:32, Ivo Grosse wrote:

> 0. we often use Blast, and we often blast two large sets against each 
> other, e.g. the human against the mouse genome.  In that example, one 
> genome (e.g. mouse) will be the database, and we will chop up the human 
> genome into, say, 101-kb pieces  overlapping by 1 kb, and then throw 
> those 30,000 101-kb pieces against the mouse database using SGE.  We 
> (in our group) do NOT need or want Mosix.

> 1. the (mouse) database will live in RAM (of each slave node), and the 
> way in which we feed the database to the RAM for each of the 30,000 
> jobs is as follows:
> - cp the database to /tmp/ of ALL of the slave nodes.
> - start the 30,000 jobs through SGE, where the database is READ from 
> /tmp/ (on the local node) and the output is WRITTEN to the central file 
> server.
> This is, of course, much faster than reading a GB-size database from 
> the central file server 30,000 times.

Of course, the limitation here is the write speed.  The writes are
usually done as 

	read the affected sector(s) (or allocate a new area on disk)
	modify the sector as needed
	write the results back

You replace GB sized reads from remote DBs with small local cached
reads.  This is good.

You write a fair chunk of data to one directory (a guess) on the
server.  So then, your GFS question is in part, will GFS help this
situation... ?  Or is this working fine, and not generally in need of
"tweaking" ...?  If you are doing NFS off the linux box as head node and
file server, look for messages like NFS slot exceeded, or NFS server
unavailable, during runs.  That and run

	vmstat 1

on the server before starting a heavy run, and you can see the effects
of the load.

> 2. another group here at CSHL is currently in the process of preparing 
> the installation of a new cluster, and they have some good reasons for 
> choosing Mosix.  But once in a wile they also need to run Blast jobs, 
> of similar sizes as ours.  The question is: can Mosix + GFS + DFSA 
> support a protocol similar to 1.?

Ok... you need a "runon" type mechanism, to lock a process to a node.  I
think Mosix supports that (it is suggested in the documentation).  From
a brief read of the documentation, I would guess that the answer is a
qualified yes, with some uncertainty as to how well job schedulers etc
will interact with MOSIX.  You are going to lose IO bandwidth, but from
the vmstat measurement, you can get a feel for whether or not this is an

I would honestly recommend using local writes, and have an occasional
housekeeper run inserted into the mix to move the data back per node. 
This will avoid write spikes which can occasionally be seen in server
NFS hiccups or crashes.  Sometimes the application gets very unhappy if
it cannot contact the server due to a spike, so you will see unrelated
processes fail during such a thing.

On distributed computing, it is rarely a win to use anything but local
IO.  There are some times where this is not true, but IO can be (and
often is) a significant (and as Chris will tell you, overlooked)

> Best regards, Ivo
> P.S.
> Instead of writing N identical replicas of the database to the N slave 
> nodes, one could keep just one copy of the database on /pvfs/, which is 
> accessible through all of the slave nodes.  Then, however, the GB-size 
> database would need to be read through the network 30,000 times.  Is 
> this correct?

Yes.  This is a massive loss.  You either want the data local, or you
want to do reads from a memory based cache.  The cache is going to be
fighting with the process memory (BLAST mmaps its DB using free space,
low free space due to large buffer cache tends to limit the amount kept
in RAM, and worse, Linux's algorithms for reducing the size of buffer
cache in the face of an allocation request are not that great).


A) try to keep all the reads local.  Unless you have Gigabit to each
computing node, the best effort of 11 MB/s over 100 Base T is painful
for reading hundreds of megs (30000 times).  Your ATA/100 can do better
than 2.5x that for a single disk, and better than 5x that for RAID0.  If
you have gigabit to each node, look at tweaking the file server with
huge amounts of ram for buffer cache.

b) try to keep the writes local.  Lots of transients are spikes that you
occasionally see, and sometimes really upset the NFS server, and an
clients trying to write to the server.  They appear to be transient, and
very hard to replicate, yet they can be difficult for a normal NFS
server to deal with.  You can do local writes, and create a local
flushing script that does a large sized copy less frequently (averaging
out the spikes for you).

> P.P.S.
> Do you know a smarter (than 1.) way of running the Blast jobs?

:)  Hopefully, but that remains for others to determine...

First, have a look at ccp
(http://www.msclinux.com/software/projects/index.html) which is an open
source version of a clusterwide copy utility.  Gives you order 1 time
(e.g. same time for single copy as it is for N copies) data distribution
of any size.  Move your GB files to be local on the compute nodes.  This
is a clean room implementation of a idea I had built a proof of concept
for previously.  Works nicely for moving data.

Second, set your writes to be local as well.

third, write a "reaper" process, to pull the local written data back to
the file server in a more controlled, less spikey manner.  You can
schedule transfers then, and queue them if you do this right.


Joe Landman,
email: landman@scientificappliance.com
web  : http://scientificappliance.com