Chris, you gave me enough information to digest for the next hour :). Thank you very much, also Jeff and Joe! Ognen On Thu, 16 Jan 2003, Chris Dagdigian wrote: > Duzlevski, Ognen wrote: > > Hi Joe, > > > > thanks for sharing your knowledge with me. > > > > > >>Often overlooked in clusters until too late is Disk and IO in general. > >>Chris Dagdigian at BioTeam.net is a good person to speak to about this. > > > > > > When you say "Disk and IO", do you mean storage over fiber, local node > > hard-drives...? What would be good choices for your typical bioinformatics > > shop - I have seen options between local nodes having the latest SCSI > > drives and nodes having the regular 5400 rpm ide drives. Does it pay to go > > with compute node SCSI 10000 rpm or is a 7200 rpm ide good enough? > > > > The biggest performance bottleneck in 'bioclusters' is usually disk I/O > throughput. Bio people tend to do lots of things that involve streaming > massive text and binary files through the CPU and RAM (think running a > blast search). The speed of your storage becomes the rate limiting > performance bottleneck. Often there will be terabytes of this sort of > data laying around so the "/data" volume is usually a NFS mount. > > If disk I/O is not your bottleneck than memory speed and size will > likely be next bottleneck. Some applications like blast and sequence > clustering algorithims will always be better off with as much physical > RAM as you can cram into a box. Other applications are rate-limited by > memory access speeds which is why Joe recommends fast DDR memory for > users who need high mem performance. > > Our problem in the life sciences is that we have terabyte volumes of > data the needs to be kept laying around for recurring analysis. Far too > much data to keep on local storage within a cluster node unless you have > some sort of HSM mechanism. What people often end up doing is storing > all that data on a NAS or SAN infrastructure. > > Searching blast databases across an NFS mount is not a high throughput > solution :) Even a $280,000 Network appliance high-end NAS filer with > multiple trunked gigabit NICs doing the NFS fileservices will get > swamped when enough cheap linux boxes are doing large sequential reads > across the link. > > General rule of thumb: no matter how big, fast and expensive the storage > solution is you will always be able to drive it to its knees with enough > cheap cluster nodes hitting it. This is something that has to be lived > with. > > There are faster solutions out there but they get expensive (think > having to run switched fibre channel to every cluster node) and are > often very proprietary. People in the 'super fast' storage space > include: DataDirect, BlueArc, Panasas, etc. etc. > > This is generally how I would approach storage for a medium to > large-scale cluster: > > (1) purchase 500-gigs or a terabyte or so of cheap ATA RAID configured > as a NAS appliance. It should have a gigE connection to your cluster and > possibly a 100-TX connection to the organizational LAN. Treat this as > "scratch" space where you store temporary files and do things like > download and build your blast/hmmer databases. There are lots of things > that biologists and informatics types do that require lots of space but > don't really need high-availibility or high-performance. One of my > personal pet peeves is using really expensive high-end storage arrays to > store (wastefully) lots of low or no-value data and temporary files. > > You would not believe the crap that some companies store on > multi-million dollar EMC storage arrays or really expensive high-end SAN > infrastructures :) > > (2) purchase the 'real' NAS storage solution for your cluster. I prefer > NAS to SAN for my clusters because I really need shared concurrent > read/write access to the same volume of data and getting this done via > SANs is really expensive and painful. I'd rather engineer around NAS/NFS > performance issues than deal with 60+ machines needing shared read/write > access to the same SAN volume across a switched FC fabric. Nasty. > Particularly in a hetergenous OS environment. > > Your NAS system can be totally configured to fit your scale and your > budget. For small clusters this could just be a few hundred gigs of Dell > Powervault SCSI or fibrechannel RAID or perhaps a terabyte of NexSan > ATA-RAID chassis SCSI attached to a linux box that acts as NFS > fileserver. NexSan makes high quality ATA RAID shelves with nice > features (hot-swap, SCSI or FC interface, etc. etc.) for about $12-15K > per terabyte. On the high end you can pay six figures or more depending > on speed, size, scaling potential and backup abilities. > > A linux box with some SCSI or ATA drives serving as a NFS fileserver > server will cost just a few thousand dollars > > A linux box with a terabyte of NexSan ATA RAID attached via fibrechannel > or SCSI will cost about $12-$15,000 > > A high-end storage system like a Network Appliance F840 filer will be a > six-figure investment > > I've had clients who's needs and budgets ran the full spectrum so there > is really no one-size-fits-all answer. > > > I've built a cluster where the compute nodes were about $60,000 in cost > and the storage, network core and backup systems were $250,000+ You will > find that storage and backup is more expensive and more complicated than > the actual cluster hardware. > > > ## > Ok. Moving away from the shared storage and onto the storage for your > compute nodes: > > SCSI, Fibrechannel and hardware RAID are only required on cluster head > nodes, cluster database machines and the cluster fileservices layer. > > SCSI disk in a compute node are a waste of money and are usually just a > source of additional profit margin for the cluster vendor. Avoid it if > possible. For the compute nodes themselves you can save lots of $$ by > just sticking with cheap ATA devices. (usual disclaimer applies; your > personal situation may vary) > > In a linux environment I'd recommend 2x IDE drives striped at RAID0 > using Linux software RAID. The performance is amazing -- we've seen IO > speeds that beat a direct GigE connection to a NetApp _and_ a direct FC > connection to a high end Hitachi SAN. Really cool to see such > performance come from $300 worth of IDE disk drives and some software RAID. > > People are doing this with blade servers using 4200 and 5400 RPM IDE > drives as well-- there are some people on this list in the UK who are > using RLX blades with 2x drives striped with software RAID0. According > to them the performance gain was significant and measurable. > > ## > > With that storage stuff laid out this is how I approach cluster usage: > > (1) If simplicity and ease of use is more important than performance > then I just leave all my data on the central NAS fileserver. 90% of the > cluster users I see just end up doing this most of the time. Its easy, > it works and it does not require data staging or extensive changes to > scripts that the users develop on their own. > > (2) If my workflow demands more performance then I start to deploy > methodologies that involve (a) staging data from the central NAS onto > local disks within my cluster and (b) directing the cluster to send jobs > to the nodes where the target data is already cached in memory (best) or > stored on the local disk. This is a quick and dirty way to engineer > around the NFS performance bottlenecks. > > There are lots of ways to approach this but getting the process > integrated into you workflow will involve scripting and some slightly > mroe advanced usage of your PBS, GridEngine or Platform LSF cluster load > management layer. > > > > > > > >>>Also very much overlooked is the issue of cluster management. This > >> > >>tends to guide the choice of Linux distribution. Management gets to be > >>painful after the 8th compute node, the old models don't work well on > >>multiple system image machines. > > > > > I use and love systemimager (www.systemimager.org) for automating the > process of managing my full-on 'install an OS on that node from scratch' > as well as my 'update this file or directory across all cluster nodes' > needs. It's a great product. > > BioTeam has ported systemimager to Mac OS X but there are still some > sticky issues involving the performance of HFS+ aware rsync. We've > recently been using RADMIND from UMichigan on some Xserve cluster projects. > > I'm a compute farm purist I guess :) I stay away from 'cluster in a box' > products, particularly those that come with OpenPBS installed which I > would never recommend for use in a biocluster. My usual approach is to > just install pure Redhat or Debian Linux and a load management layer > like GridEngine or LSF and then use Systemimager to handle the OS > management and software update process. > > For cluster monitoring and reporting I like: BigBrother, MRTG, Larrd, > SAR, NTOP, Ganglia etc. etc. There are lots of great software monitoring > tools out there. > > > For cluster management you need to treat you compute nodes as anonymous > and disposable. You cannot afford to be messing with them on an > individual basis because your admin burden will scale linearly with > cluster size. > > You need to : > > o anonymize the cluster nodes on a private network. Your users should > log into a cluster head node to submit their jobs > > o compute nodes are in 1 of 4 possible states: > > (1) running / online > (2) online / reimaging or updating > (3) rebooting > (4) failed / offline / marked for replacement when convenient > > If you find yourself logging in as the root user to a compute node to > fix or repair something then you are doing something wrong. It may seem > OK at the time but it is a bad habit that will bite back when the > cluster has to scale over time. > > > > > > What are the usual choices here? Any recommendations, web-sites, companies > > I should be looking at in particular? 40 nodes would be our starting point > > but we could easily move up to more in time. Many vendors offer various > > alternatives - I think personally I would like to have something that is > > easy to understand and is not a complete black box. > > > > > >>There are other issues as well, specifically networking, backup of > >>system, extra IO capabilities, etc. > > > > > > Joe, I know this could easily turn into a thick book :), but how does one > > get more educated about these things? > > > > > I'm probably biased but I've heard lots of people speak positively about > reading the archived threads of this mailing list. Many of the questions > you are asking about have been debated and discussed in the past on this > very mailing list. The list archives are online at > http://bioinformatics.org/pipermail/bioclusters/ > > > I'm also going to be rehashing alot of this stuff in a talk at the > upcoming OReilly Bioinformatics Technology conference. May be of > interest to some people and really obvious and boring to others. > > > > > Another question that I am curious about is Itanium 2 and using them in a > > cluster - any experiences with these? How about bioinformatics software - > > any benefits in your regular programs like blast, clustalw... when running > > on an itanium system? > > > > Thank you, > > Ognen > > > > > Just my $.02 of course, > chris > > > -- > Chris Dagdigian, <dag@sonsorol.org> - The BioTeam Inc. > Independent life science IT & informatics consulting > Office: 617-666-6454, Mobile: 617-877-5498, Fax: 425-699-0193 > PGP KeyID: 83D4310E Yahoo IM: craffi Web: http://bioteam.net > > > > _______________________________________________ > Bioclusters maillist - Bioclusters@bioinformatics.org > https://bioinformatics.org/mailman/listinfo/bioclusters >