On Mon, 2002-05-13 at 13:32, Ivo Grosse wrote: > 0. we often use Blast, and we often blast two large sets against each > other, e.g. the human against the mouse genome. In that example, one > genome (e.g. mouse) will be the database, and we will chop up the human > genome into, say, 101-kb pieces overlapping by 1 kb, and then throw > those 30,000 101-kb pieces against the mouse database using SGE. We > (in our group) do NOT need or want Mosix. ok... > 1. the (mouse) database will live in RAM (of each slave node), and the > way in which we feed the database to the RAM for each of the 30,000 > jobs is as follows: > > - cp the database to /tmp/ of ALL of the slave nodes. > > - start the 30,000 jobs through SGE, where the database is READ from > /tmp/ (on the local node) and the output is WRITTEN to the central file > server. > > This is, of course, much faster than reading a GB-size database from > the central file server 30,000 times. Of course, the limitation here is the write speed. The writes are usually done as read the affected sector(s) (or allocate a new area on disk) modify the sector as needed write the results back You replace GB sized reads from remote DBs with small local cached reads. This is good. You write a fair chunk of data to one directory (a guess) on the server. So then, your GFS question is in part, will GFS help this situation... ? Or is this working fine, and not generally in need of "tweaking" ...? If you are doing NFS off the linux box as head node and file server, look for messages like NFS slot exceeded, or NFS server unavailable, during runs. That and run vmstat 1 on the server before starting a heavy run, and you can see the effects of the load. > 2. another group here at CSHL is currently in the process of preparing > the installation of a new cluster, and they have some good reasons for > choosing Mosix. But once in a wile they also need to run Blast jobs, > of similar sizes as ours. The question is: can Mosix + GFS + DFSA > support a protocol similar to 1.? Ok... you need a "runon" type mechanism, to lock a process to a node. I think Mosix supports that (it is suggested in the documentation). From a brief read of the documentation, I would guess that the answer is a qualified yes, with some uncertainty as to how well job schedulers etc will interact with MOSIX. You are going to lose IO bandwidth, but from the vmstat measurement, you can get a feel for whether or not this is an issue. I would honestly recommend using local writes, and have an occasional housekeeper run inserted into the mix to move the data back per node. This will avoid write spikes which can occasionally be seen in server NFS hiccups or crashes. Sometimes the application gets very unhappy if it cannot contact the server due to a spike, so you will see unrelated processes fail during such a thing. On distributed computing, it is rarely a win to use anything but local IO. There are some times where this is not true, but IO can be (and often is) a significant (and as Chris will tell you, overlooked) bottleneck. > Best regards, Ivo > > > P.S. > > Instead of writing N identical replicas of the database to the N slave > nodes, one could keep just one copy of the database on /pvfs/, which is > accessible through all of the slave nodes. Then, however, the GB-size > database would need to be read through the network 30,000 times. Is > this correct? Yes. This is a massive loss. You either want the data local, or you want to do reads from a memory based cache. The cache is going to be fighting with the process memory (BLAST mmaps its DB using free space, low free space due to large buffer cache tends to limit the amount kept in RAM, and worse, Linux's algorithms for reducing the size of buffer cache in the face of an allocation request are not that great). So... A) try to keep all the reads local. Unless you have Gigabit to each computing node, the best effort of 11 MB/s over 100 Base T is painful for reading hundreds of megs (30000 times). Your ATA/100 can do better than 2.5x that for a single disk, and better than 5x that for RAID0. If you have gigabit to each node, look at tweaking the file server with huge amounts of ram for buffer cache. b) try to keep the writes local. Lots of transients are spikes that you occasionally see, and sometimes really upset the NFS server, and an clients trying to write to the server. They appear to be transient, and very hard to replicate, yet they can be difficult for a normal NFS server to deal with. You can do local writes, and create a local flushing script that does a large sized copy less frequently (averaging out the spikes for you). > P.P.S. > > Do you know a smarter (than 1.) way of running the Blast jobs? :) Hopefully, but that remains for others to determine... First, have a look at ccp (http://www.msclinux.com/software/projects/index.html) which is an open source version of a clusterwide copy utility. Gives you order 1 time (e.g. same time for single copy as it is for N copies) data distribution of any size. Move your GB files to be local on the compute nodes. This is a clean room implementation of a idea I had built a proof of concept for previously. Works nicely for moving data. Second, set your writes to be local as well. third, write a "reaper" process, to pull the local written data back to the file server in a more controlled, less spikey manner. You can schedule transfers then, and queue them if you do this right. Joe -- Joe Landman, email: landman@scientificappliance.com web : http://scientificappliance.com