Chris, thank you for the info. I suspected you would say that I will need local = drives ("first pass" solution). Ahhh, more money down the drain? :) Thanks! Ognen >-----Original Message----- >From: Chris Dagdigian [mailto:dag@sonsorol.org] >Sent: Monday, April 21, 2003 2:18 PM >To: bioclusters@bioinformatics.org >Subject: Re: [Bioclusters] blast and nfs > > >Duzlevski, Ognen wrote: >> Hi all, >>=20 >> we have a 40 node cluster (2 cpus each) and a cluster master that has >>=20 >attached storage over fibre, pretty much a standard thingie. >>=20 >> All of the nodes get their shared space from the cluster master over >nfs. I have a user who has set-up an experiment that fragmented a >database into 200,000 files which are then being blasted against the >standard NCBI databases which reside on the same shared space on the >cluster master and are visible on the nodes (he basically=20 >rsh-s into all >the nodes in a loop and starts jobs). He could probably go about his >business in a better way but for the sake of optimizing the setup, I am >actually glad that testing is being done the way it is. >>=20 >> I noticed that the cluster master itself is under heavy load (it is a >>=20 >2 CPU machine), and most of the load comes from the nfsd=20 >threads (kernel >space nfs used). >>=20 >> Are there any usual tricks or setup models utilized in setting up >clusters? For example, all of my nodes mount the shared space with >rw/async/rsize=3D8192,wsize=3D8192 options. How many nfsd threads = usually >run on a master node? Any advice as to the locations of NCBI databases >vs. shared space? How would one go about measuring/observing for the >bottlenecks? > > >Hi Ognen, > >There are many people on this list who have similar setups and have=20 >worked around NFS related bottlenecks in various ways depending on the=20 >complexity of their needs. > >One easy way to avoid NFS bottlenecks is to realize that BLAST is=20 >_aways_ going to be performance bound by IO speeds and that generally=20 >your IO access to local disk is going to be far faster than your NFS=20 >connection. Done right, local IDE drives in a software RAID=20 >configuration can get you better speeds than a direct GigE=20 >connection to=20 >a NetApp filer or fibrechannel SAN. > >Another way to put this: You will NEVER (well, without exotic storage=20 >hardware) be able to build a NFS fileserver that cannot be swamped by=20 >lots of cheap compute nodes going long sequential reads against=20 >network-mounted BLAST databases. You need to engineer around the NFS=20 >bottleneck that is slowing you down. > >All you need to do is have enough local disk in each of your compute=20 >nodes to hold all (or some) of your BLAST datasets. The idea=20 >is that you=20 >use the NFS mounted blast databases only as a 'staging area' for=20 >rsync'ing or copying your files to scratch or temp space on=20 >your compute=20 >nodes. Given the cheap cost of 40-80gb IDE disk drives this is a quick=20 >and easy way to get around NFS related bottlenecks. > >Each search can then be done against local disk on each compute node=20 >rather than all nodes hitting the NFS fileserver and beating=20 >it to death... > >This is generally what most BLAST farm operators will do as a "first=20 >pass" approach. It works very well and is pretty much standard=20 >practice=20 >these days. > >The "second pass" approach is more complicated and involves=20 >splitting up=20 >your blast datasets into RAM-sized chunks, distributing them=20 >across the=20 >nodes in your cluster and then multiplexing your query across all the=20 >nodes to get faster throughput times. This is harder to=20 >implement and is=20 >useful only for long queries against big databases as there is=20 >a certain=20 >amount of overhead required to merge your multiplexed query=20 >results back=20 >into one human or machine parsable file. > >People only implement the 'second pass' approach when they really need=20 >to. Usually in places where pipelines are constantly repeating=20 >the same=20 >big searches over and over again. > > >My $.02 of course > >-Chris >www.bioteam.net > > > > > >_______________________________________________ >Bioclusters maillist - Bioclusters@bioinformatics.org >https://bioinformatics.org/mailman/listinfo/bioclusters >