[Bioclusters] blast server on OpenMosix cluster
Chris Dagdigian
bioclusters@bioinformatics.org
Mon, 05 Jan 2004 13:02:26 -0500
{hmmm. I think this is still on-topic for the list at large...}
Hi Hong Zhang,
One of our clusters should be very close by to you at Dana-Farber Cancer
Institute depending on what building you are in...I forgot to add that
one to the list of Harvard-affiliated systems that I knew about.
On to your questions about Grid Engine (SGE) and blast;
1. Nothing about SGE will force you to have a single-application cluster
unless you choose to use it that way. Some people desire 'appliance'
type systems that are designed and specially tuned to run a single
application really, really well. Other groups of researchers want a
general purpose system that can run all sorts of applications.
Configuring a cluster to run BLAST really well is a nice target for
informatics researchers since BLAST tends to beat heavily on memory and
storage subsystems. Optimizing for blast tends to mean that the cluster
will stand up well to other sorts of informatics workloads.
Making applications run on clusters -- as a general rule, if you can run
a program on the Unix command line it is very easy to set things up so
that the same program can be run across a cluster or compute farm while
under the control of Grid Engine. It gets more complicated if the
program requires a specialized environment, a license server or if the
application requires a parallel MPI or PVM environment.
I can't tell you how easy it would be to make your other WWW tools 'SGE
aware' but in general the process is similar to what you would have to
do to cluster-enable the www-blast CGI'code. In most cases a simple
wrapper script will do the job.
Space for blast data is another hard to answer question. There are
bioclusters in the Boston area that have terabyte-scale storage arrays
serving up hundreds of gigabytes of blastable databases and there are
others that just need a few gigs of disk space to store the particular
NCBI datasets that they care about.
To figure out what space you need; list what databases you'd like to
have available and document what the (uncompressed) file sizes are. Then
take that number and triple it (or more) because you'll need space to
build, uncompress and curate your datasets as well as handle normal growth.
Most people can easily store their favorite blast databases within a
single IDE or SATA disk drive these days. Because blast is rate limited
by disk performance these drives are often mirrored with hardware or
software RAID in pairs of two or more. Searching BLAST datasets across
an NFS share can be a big performance bottleneck so many people will
install the mirrored-disk pairs in each of their blast compute nodes so
that all blast databases are replicated on local storage. This removes a
ton of NFS traffic from the cluster network although the extra work of
making sure that all your big blast files are _correctly_ replicated
across many nodes can be time consuming.
If I was building a Linux blast cluster node from scratch today I'd use
pairs of the 160gb Seagate SATA drives mirrored with software RAID. The
big computer vendors may not be as flexible with IDE storage offerings
but they'd at least have products using disks in the 80-120gb range
which should be fine for your needs. You'd want hardware RAID and more
redundancy in your cluster 'master' or head node but in general the
compute nodes are disposable so a simple software RAID mirror on
inexpensive disks is all you need for the worker nodes.
If I was building an Apple G4 or G5 cluster I'd wait until the end of
the day tomorrow to see what product announcements come out at MacWorld!
-Chris
hong.zhang@research.dfci.harvard.edu wrote:
> Hi Chris,
> Thanks for your message. It is really encourageable. My further question is
> we have other www tools rather than wwwblast installed on the cluster so
> whether SGE makes all tooks migratable or just a single-job cluster (i
> mean only for blast such as mpiblast).
>
> And also how much space is needed to host blast data?
>
>>Hong Zhang,
>>
>>There are several clusters doing Blast and Blast over WWW at Harvard.
>>Contact me in private if you want contact information for the people
>>running them.
>>
>>The Bauer Center for Genomics Research has a big cluster system running
>>Platform LSF. (http://cgr.harvard.edu)
>>
>>The Harvard Stats department over in the Science Center is running Grid
>>Engine on a small Linux cluster.
>>
>>The Flybase project people are using Grid Engine on Mac OS X (apple
>>Xserves) for some lightweight web bioinformatics portal stuff
>>(http://inquiry.flybase.harvard.edu)
>>
>>There are several more systems I've heard about or visited over at the
>>Medical school etc.
>>
>>Regarding your questions:
>>
>>1. wwwblast servers are easy to set up on clusters. For a lightweight
>>system you can just take the LSF 'lsrun' or Grid Engine 'qrsh' commands
>>and use them to wrap the call to the blastall executable. This will not
>>work in a large setting as qrsh/lsrun will fail silently if there are no
>> resources available; in that case you need to go asynchronous and get
>>used to the batch system.
>>
>>2. SGE easily runs on Debian linux
>>
>>Regards,
>>Chris
>>
>>
>>
>>
>>Hong Zhang wrote:
>>
>>
>>>Thanks for your information. I read the article before.
>>>I'd like to know
>>>1. whether it is possible to set up a wwwblast server on
>>>cluster. Our goal is allow users to access blast database through web
>>>page instead of command line. I am not sure whether query from web
>>>page can be migrated.
>>>
>>>2. whether SGE can be used in Debian.
>>>
>>>
>>> On Fri, 2 Jan 2004, Ron Chen wrote:
>>>
>>>
>>>
>>>>It takes time to let openmosix to migrate your jobs.
>>>>SGE is more suitable in the compute farm environment.
>>>>
>>>>"Integrating BLAST with Sun ONE Grid Engine Software"
>>>>available at:
>>>>http://developers.sun.com/solaris/articles/integrating_blast.html
>>>>
>>>>-Ron
>>>>
>>>>--- Hong Zhang <hzhang@research.dfci.harvard.edu>
>>>>wrote:
>>>>
>>>>
>>>>>But I have trouble make blast command line execute
>>>>>in every node.
>>>>>
>>>>>And don't you think openmosix is suitable for blast
>>>>>cluster? You suggested
>>>>>SGE?
>>>>>
>>>>>
>>>>>
>>>>>On Thu, 11 Dec 2003, Farul Mohd. Ghazali wrote:
>>>>>
>>>>>
>>>>>
>>>>>>On Wed, 10 Dec 2003
>>>>>
>>>>>hong.zhang@research.dfci.harvard.edu wrote:
>>>>>
>>>>>
>>>>>>>I am working on set up a blast server on
>>>>>
>>>>>Debian/OpenMosix cluster with 4
>>>>>
>>>>>
>>>>>>>nodes. Actually it is totally new to me. So is
>>>>>
>>>>>there anyone can give me
>>>>>
>>>>>
>>>>>>>some advice? Thanks.
>>>>>>
>>>>>>I've used OpenMosix in the form of ClusterKnoppix
>>>>>
>>>>>some months back to test
>>>>>
>>>>>
>>>>>>it out. The setup was very easy, boot off the CD,
>>>>>
>>>>>configure some settings
>>>>>
>>>>>
>>>>>>and the rest of the nodes boot off the network.
>>>>>
>>>>>Applications are
>>>>>
>>>>>
>>>>>>automatically load balanced across nodes.
>>>>>>
>>>>>>While configuration and actual use was very easy,
>>>>>
>>>>>performance wasn't too
>>>>>
>>>>>
>>>>>>great. I think the main reason was that OpenMosix
>>>>>
>>>>>dynamically migrates
>>>>>
>>>>>
>>>>>>applications to the different nodes to
>>>>>
>>>>>automatically load balance the
>>>>>
>>>>>
>>>>>>system thus the overhead of migration for long
>>>>>
>>>>>running jobs suddenly
>>>>>
>>>>>
>>>>>>became apparent.
>>>>>>
>>>>>>To be honest, we didn't try to optimize it much
>>>>>
>>>>>and went to implement our
>>>>>
>>>>>
>>>>>>blast cluster with SGE and hopefully soon
>>>>>
>>>>>mpiblast.
>>>>>
>>>>>
>>>>>>
>>>>>>_______________________________________________
>>>>>>Bioclusters maillist -
>>>>>
>>>>>Bioclusters@bioinformatics.org
>>>>>
>>>>
>>>>https://bioinformatics.org/mailman/listinfo/bioclusters
>>>>
>>>>
>>>>>--
>>>>>Hong Zhang, MIS
>>>>>Bioinformatics Analyst
>>>>>Dana Farber Cancer Institute
>>>>>Harvard Medical School
>>>>>44 Binney St, D1510A
>>>>>Boston MA 02115
>>>>>Email: hong.zhang@research.dfci.harvard.edu
>>>>>Phone: 617-632-3824
>>>>>Fax: 617-632-3351