Duzlevski, Ognen wrote: > Hi Joe, > > thanks for sharing your knowledge with me. > > >>Often overlooked in clusters until too late is Disk and IO in general. >>Chris Dagdigian at BioTeam.net is a good person to speak to about this. > > > When you say "Disk and IO", do you mean storage over fiber, local node > hard-drives...? What would be good choices for your typical bioinformatics > shop - I have seen options between local nodes having the latest SCSI > drives and nodes having the regular 5400 rpm ide drives. Does it pay to go > with compute node SCSI 10000 rpm or is a 7200 rpm ide good enough? > The biggest performance bottleneck in 'bioclusters' is usually disk I/O throughput. Bio people tend to do lots of things that involve streaming massive text and binary files through the CPU and RAM (think running a blast search). The speed of your storage becomes the rate limiting performance bottleneck. Often there will be terabytes of this sort of data laying around so the "/data" volume is usually a NFS mount. If disk I/O is not your bottleneck than memory speed and size will likely be next bottleneck. Some applications like blast and sequence clustering algorithims will always be better off with as much physical RAM as you can cram into a box. Other applications are rate-limited by memory access speeds which is why Joe recommends fast DDR memory for users who need high mem performance. Our problem in the life sciences is that we have terabyte volumes of data the needs to be kept laying around for recurring analysis. Far too much data to keep on local storage within a cluster node unless you have some sort of HSM mechanism. What people often end up doing is storing all that data on a NAS or SAN infrastructure. Searching blast databases across an NFS mount is not a high throughput solution :) Even a $280,000 Network appliance high-end NAS filer with multiple trunked gigabit NICs doing the NFS fileservices will get swamped when enough cheap linux boxes are doing large sequential reads across the link. General rule of thumb: no matter how big, fast and expensive the storage solution is you will always be able to drive it to its knees with enough cheap cluster nodes hitting it. This is something that has to be lived with. There are faster solutions out there but they get expensive (think having to run switched fibre channel to every cluster node) and are often very proprietary. People in the 'super fast' storage space include: DataDirect, BlueArc, Panasas, etc. etc. This is generally how I would approach storage for a medium to large-scale cluster: (1) purchase 500-gigs or a terabyte or so of cheap ATA RAID configured as a NAS appliance. It should have a gigE connection to your cluster and possibly a 100-TX connection to the organizational LAN. Treat this as "scratch" space where you store temporary files and do things like download and build your blast/hmmer databases. There are lots of things that biologists and informatics types do that require lots of space but don't really need high-availibility or high-performance. One of my personal pet peeves is using really expensive high-end storage arrays to store (wastefully) lots of low or no-value data and temporary files. You would not believe the crap that some companies store on multi-million dollar EMC storage arrays or really expensive high-end SAN infrastructures :) (2) purchase the 'real' NAS storage solution for your cluster. I prefer NAS to SAN for my clusters because I really need shared concurrent read/write access to the same volume of data and getting this done via SANs is really expensive and painful. I'd rather engineer around NAS/NFS performance issues than deal with 60+ machines needing shared read/write access to the same SAN volume across a switched FC fabric. Nasty. Particularly in a hetergenous OS environment. Your NAS system can be totally configured to fit your scale and your budget. For small clusters this could just be a few hundred gigs of Dell Powervault SCSI or fibrechannel RAID or perhaps a terabyte of NexSan ATA-RAID chassis SCSI attached to a linux box that acts as NFS fileserver. NexSan makes high quality ATA RAID shelves with nice features (hot-swap, SCSI or FC interface, etc. etc.) for about $12-15K per terabyte. On the high end you can pay six figures or more depending on speed, size, scaling potential and backup abilities. A linux box with some SCSI or ATA drives serving as a NFS fileserver server will cost just a few thousand dollars A linux box with a terabyte of NexSan ATA RAID attached via fibrechannel or SCSI will cost about $12-$15,000 A high-end storage system like a Network Appliance F840 filer will be a six-figure investment I've had clients who's needs and budgets ran the full spectrum so there is really no one-size-fits-all answer. I've built a cluster where the compute nodes were about $60,000 in cost and the storage, network core and backup systems were $250,000+ You will find that storage and backup is more expensive and more complicated than the actual cluster hardware. ## Ok. Moving away from the shared storage and onto the storage for your compute nodes: SCSI, Fibrechannel and hardware RAID are only required on cluster head nodes, cluster database machines and the cluster fileservices layer. SCSI disk in a compute node are a waste of money and are usually just a source of additional profit margin for the cluster vendor. Avoid it if possible. For the compute nodes themselves you can save lots of $$ by just sticking with cheap ATA devices. (usual disclaimer applies; your personal situation may vary) In a linux environment I'd recommend 2x IDE drives striped at RAID0 using Linux software RAID. The performance is amazing -- we've seen IO speeds that beat a direct GigE connection to a NetApp _and_ a direct FC connection to a high end Hitachi SAN. Really cool to see such performance come from $300 worth of IDE disk drives and some software RAID. People are doing this with blade servers using 4200 and 5400 RPM IDE drives as well-- there are some people on this list in the UK who are using RLX blades with 2x drives striped with software RAID0. According to them the performance gain was significant and measurable. ## With that storage stuff laid out this is how I approach cluster usage: (1) If simplicity and ease of use is more important than performance then I just leave all my data on the central NAS fileserver. 90% of the cluster users I see just end up doing this most of the time. Its easy, it works and it does not require data staging or extensive changes to scripts that the users develop on their own. (2) If my workflow demands more performance then I start to deploy methodologies that involve (a) staging data from the central NAS onto local disks within my cluster and (b) directing the cluster to send jobs to the nodes where the target data is already cached in memory (best) or stored on the local disk. This is a quick and dirty way to engineer around the NFS performance bottlenecks. There are lots of ways to approach this but getting the process integrated into you workflow will involve scripting and some slightly mroe advanced usage of your PBS, GridEngine or Platform LSF cluster load management layer. > >>>Also very much overlooked is the issue of cluster management. This >> >>tends to guide the choice of Linux distribution. Management gets to be >>painful after the 8th compute node, the old models don't work well on >>multiple system image machines. > I use and love systemimager (www.systemimager.org) for automating the process of managing my full-on 'install an OS on that node from scratch' as well as my 'update this file or directory across all cluster nodes' needs. It's a great product. BioTeam has ported systemimager to Mac OS X but there are still some sticky issues involving the performance of HFS+ aware rsync. We've recently been using RADMIND from UMichigan on some Xserve cluster projects. I'm a compute farm purist I guess :) I stay away from 'cluster in a box' products, particularly those that come with OpenPBS installed which I would never recommend for use in a biocluster. My usual approach is to just install pure Redhat or Debian Linux and a load management layer like GridEngine or LSF and then use Systemimager to handle the OS management and software update process. For cluster monitoring and reporting I like: BigBrother, MRTG, Larrd, SAR, NTOP, Ganglia etc. etc. There are lots of great software monitoring tools out there. For cluster management you need to treat you compute nodes as anonymous and disposable. You cannot afford to be messing with them on an individual basis because your admin burden will scale linearly with cluster size. You need to : o anonymize the cluster nodes on a private network. Your users should log into a cluster head node to submit their jobs o compute nodes are in 1 of 4 possible states: (1) running / online (2) online / reimaging or updating (3) rebooting (4) failed / offline / marked for replacement when convenient If you find yourself logging in as the root user to a compute node to fix or repair something then you are doing something wrong. It may seem OK at the time but it is a bad habit that will bite back when the cluster has to scale over time. > > What are the usual choices here? Any recommendations, web-sites, companies > I should be looking at in particular? 40 nodes would be our starting point > but we could easily move up to more in time. Many vendors offer various > alternatives - I think personally I would like to have something that is > easy to understand and is not a complete black box. > > >>There are other issues as well, specifically networking, backup of >>system, extra IO capabilities, etc. > > > Joe, I know this could easily turn into a thick book :), but how does one > get more educated about these things? > I'm probably biased but I've heard lots of people speak positively about reading the archived threads of this mailing list. Many of the questions you are asking about have been debated and discussed in the past on this very mailing list. The list archives are online at http://bioinformatics.org/pipermail/bioclusters/ I'm also going to be rehashing alot of this stuff in a talk at the upcoming OReilly Bioinformatics Technology conference. May be of interest to some people and really obvious and boring to others. > Another question that I am curious about is Itanium 2 and using them in a > cluster - any experiences with these? How about bioinformatics software - > any benefits in your regular programs like blast, clustalw... when running > on an itanium system? > > Thank you, > Ognen Just my $.02 of course, chris -- Chris Dagdigian, <dag@sonsorol.org> - The BioTeam Inc. Independent life science IT & informatics consulting Office: 617-666-6454, Mobile: 617-877-5498, Fax: 425-699-0193 PGP KeyID: 83D4310E Yahoo IM: craffi Web: http://bioteam.net