[Bioclusters] info needed about network filesystem to use in cluster programming

Thu Jun 9 17:58:07 EDT 2005

On 9 Jun 2005, at 08:25, Michael James wrote:

> Too true, we find that when sending many blast jobs
>  to the cluster 10 nodes are enough to flatten an NFS server.
>
> At present we copy the blast databases out
>  into a 30Gig partition on each node.
>
> With the databases growing in size and number,
>  we have reached the limits of this approach.
>
> An NFS client implementation with caching
>  would give us good performance,
>  virtually infinite database space (30 Gig is a lot of cache)
>  and save me the bother of copying
>  the updated DBs out each week.

Caching works if you have right control over the code, and can ensure  
it's written to reuse databases as much as possible; i.e.:

foreach database {
   foreach sequence {
     blast
   }
}

Unfortunately my experience is that most bioinformaticians think of  
their analyses in terms of the query sequences, rather than the  
databases, and do:

foreach sequence {
   foreach database {
     blast
   }
}

If *all* the databases in the analysis can fit in your cache, this is  
OK, otherwise it's a disaster, and your cache is of no benefit.

> I haven't found the answer yet.
> I want to write a reiser4 plug-in to implement it
>  but can't get the reiser4 that SuSE supply to accept files yet.
>
> Any suggestions?

GPFS is good, but proprietary, and we use it extensively here.  Each  
chassis of 14 blades (28 CPUs) spreads a GPFS filesystem over all 14  
machines, providing about 600 GB of space for blastable databases.   
Performance is actually slightly better than local disk (because  
there are three copies of the data within the filesystem, and GPFS  
uses all three paths to the data simultaneously)

Lustre is open source, and shows much promise, but we haven't  
deployed it in production yet.

Tim