[Bioclusters] Local blast server, beowulf vs mosix

Thu, 7 Mar 2002 12:04:31 -0600 (CST)

The December 2001 version of "formatdb" will split up your targets
into chunks of arbitrary size for you, via the "-v
<max_size_of_a_chunk>" flag.  I think that it was intended to get 
around file size limitations on some larger datasets / older OS's, but
it also works nicely for my group to keep things under the RAM / CPU
performace transition point.

-Chris Dwan
 cdwan-at-ccgb.umn.edu
 CCGB - University of Minnesota

Eric Engelhard writes:
> I agree with jfreeman that this howto is a good place to start, but you
> may not want to bother with RedHat 5.2. I set a personal speed record
> building out a small (8 node) cluster last week using RedHat 7.2. I used
> the kickstart gui to configure bootdisks for the slave nodes. This is an
> embarrassingly parallel blast cluster (NFS, postgres, NCBI blastall,
> rexec/rsh, and perl).
> 
> Performance hint: What you really want to do with this kind of cluster
> is to have a good enough local RAM to refdb ratio to prevent disk I/O
> churning. If you can run a whole batch with only an initial read, then
> the next bottleneck will be the CPU/BUS speed, which is a fairly high
> bar. I haven't challenged the performance on this little cluster, but my
> work cluster (18 nodes 2GB RAM/node) cuts through >1500 queries/minute
> against nr.
> 
> In addition to BLAST, this type of system is also ideal for standalone
> InterPro.
> 
> I split the reference databases with this (babyperl freebee :-) ):
> 
> 
> #!/usr/bin/perl
> #
> #       refdb_splitter.pl - Splits a ref fasta db into $N gzipped chunks
> for distribution to cluster
> #
> #       Usage: zcat ref_fasta_db(.Z or .gz) | CMGD_splitter.pl
> #
> 
> $N = 8; # your number of nodes here (or node number itself if you want
> to run an iteration of this script on each node... parallelize the
> splitter)
> $fasta="";
> $i=0;
> $split = 1;
> while ($line =<STDIN>) {
>         if (grep (/^>/, $line)){$i++;}
>         if ($i == 2){
>                 if ($split > $N){$split = 1;} # or "if ($split % $N ==
> 0){" for running at each node
>                 open (PIPE, "|gzip >>ref_fasta_db_$split.gz");
>                 print PIPE $fasta;
>                 close PIPE;
>                 $fasta = "";
>                 $split++;
>                 $i = 1;
> 		#} #decomment for parallel version
>         }
>         $fasta = "$fasta"."$line";
> }
> 
> --
> Eric Engelhard - www.cvbig.org - www.sagresdiscovery.com
> 
> 
> jfreeman wrote:
> > 
> > Start Here...
> > http://www.beowulf-underground.org/doc_project/BIAA-HOWTO/Beowulf-Installation-and-Administration-HOWTO-5.html
> > 
> > Once you have a small 2 node master/slave cluster running with the slave
> > node running starting through tftpboot you are ready for the next level
> > of complexity...
> > 
> > Danny Navarro wrote:
> > >
> > > Hi all,
> > >
> > > I would like to set up a linux cluster with some pcs to run blast
> > > searches against EST human database. First I will try to blast locally
> > > in the master node but I would like also to make a blast server
> > > available to the intranet.
> > >
> > > I have to learn a lot about linux clusters but now I don't know exactly
> > > how to start to do this, shall I use beowulf or mosix or there are other
> > > better alternatives? What do you think is the best system for doing that
> > > task?
> > >
> > > Thanks
> > >
> > > _______________________________________________
> > > Bioclusters maillist  -  Bioclusters@bioinformatics.org
> > > http://bioinformatics.org/mailman/listinfo/bioclusters
> > _______________________________________________
> > Bioclusters maillist  -  Bioclusters@bioinformatics.org
> > http://bioinformatics.org/mailman/listinfo/bioclusters
> _______________________________________________
> Bioclusters maillist  -  Bioclusters@bioinformatics.org
> http://bioinformatics.org/mailman/listinfo/bioclusters
>