[Bioclusters] cluster hardware question

Thu, 16 Jan 2003 09:34:15 -0600 (CST)

Chris, you gave me enough information to digest for the next hour :).
Thank you very much, also Jeff and Joe!

Ognen

On Thu, 16 Jan 2003, Chris Dagdigian wrote:

> Duzlevski, Ognen wrote:
> > Hi Joe,
> >
> > thanks for sharing your knowledge with me.
> >
> >
> >>Often overlooked in clusters until too late is Disk and IO in general.
> >>Chris Dagdigian at BioTeam.net is a good person to speak to about this.
> >
> >
> > When you say "Disk and IO", do you mean storage over fiber, local node
> > hard-drives...? What would be good choices for your typical bioinformatics
> > shop - I have seen options between local nodes having the latest SCSI
> > drives and nodes having the regular 5400 rpm ide drives. Does it pay to go
> > with compute node SCSI 10000 rpm or is a 7200 rpm ide good enough?
> >
>
> The biggest performance bottleneck in 'bioclusters' is usually disk I/O
> throughput. Bio people tend to do lots of things that involve streaming
> massive text and binary files through the CPU and RAM (think running a
> blast search). The speed of your storage becomes the rate limiting
> performance bottleneck. Often there will be terabytes of this sort of
> data laying around so the "/data" volume is usually a NFS mount.
>
> If disk I/O is not your bottleneck than memory speed and size will
> likely be next bottleneck. Some applications like blast and sequence
> clustering algorithims will always be better off with as much physical
> RAM as you can cram into a box. Other applications are rate-limited by
> memory access speeds which is why Joe recommends fast DDR memory for
> users who need high mem performance.
>
> Our problem in the life sciences is that we have terabyte volumes of
> data the needs to be kept laying around for recurring analysis. Far too
> much data to keep on local storage within a cluster node unless you have
> some sort of HSM mechanism. What people often end up doing is storing
> all that data on a NAS or SAN infrastructure.
>
> Searching blast databases across an NFS mount is not a high throughput
> solution :) Even a $280,000 Network appliance high-end NAS filer with
> multiple trunked gigabit NICs doing the NFS fileservices will get
> swamped when enough cheap linux boxes are doing large sequential reads
> across the link.
>
> General rule of thumb: no matter how big, fast and expensive the storage
> solution is you will always be able to drive it to its knees with enough
> cheap cluster nodes hitting it. This is something that has to be lived
> with.
>
> There are faster solutions out there but they get expensive (think
> having to run switched fibre channel to every cluster node) and are
> often very proprietary. People in the 'super fast' storage space
> include: DataDirect, BlueArc, Panasas, etc. etc.
>
> This is generally how I would approach storage for a medium to
> large-scale cluster:
>
> (1) purchase 500-gigs or a terabyte or so of cheap ATA RAID configured
> as a NAS appliance. It should have a gigE connection to your cluster and
> possibly a 100-TX connection to the organizational LAN. Treat this as
> "scratch" space where you store temporary files and do things like
> download and build your blast/hmmer databases. There are lots of things
> that biologists and informatics types do that require lots of space but
> don't really need high-availibility or high-performance.  One of my
> personal pet peeves is using really expensive high-end storage arrays to
> store (wastefully) lots of low or no-value data and temporary files.
>
> You would not believe the crap that some companies store on
> multi-million dollar EMC storage arrays or really expensive high-end SAN
> infrastructures :)
>
> (2) purchase the 'real' NAS storage solution for your cluster. I prefer
> NAS to SAN for my clusters because I really need shared concurrent
> read/write access to the same volume of data and getting this done via
> SANs is really expensive and painful. I'd rather engineer around NAS/NFS
> performance issues than deal with 60+ machines needing shared read/write
> access to the same SAN volume across a switched FC fabric. Nasty.
> Particularly in a hetergenous OS environment.
>
> Your NAS system can be totally configured to fit your scale and your
> budget. For small clusters this could just be a few hundred gigs of Dell
> Powervault SCSI or fibrechannel RAID or perhaps a terabyte of NexSan
> ATA-RAID chassis SCSI attached to a linux box that acts as NFS
> fileserver.  NexSan makes high quality ATA RAID shelves with nice
> features (hot-swap, SCSI or FC interface, etc. etc.) for about $12-15K
> per terabyte. On the high end you can pay six figures or more depending
> on speed, size, scaling potential and backup abilities.
>
> A linux box with some SCSI or ATA drives serving as a NFS fileserver
> server will cost just a few thousand dollars
>
> A linux box with a terabyte of NexSan ATA RAID attached via fibrechannel
> or SCSI will cost about $12-$15,000
>
> A high-end storage system like a Network Appliance F840 filer will be a
> six-figure investment
>
> I've had clients who's needs and budgets ran the full spectrum so there
> is really no one-size-fits-all answer.
>
>
> I've built a cluster where the compute nodes were about $60,000 in cost
> and the storage, network core and backup systems were $250,000+ You will
> find that storage and backup is more expensive and more complicated than
> the actual cluster hardware.
>
>
> ##
> Ok. Moving away from the shared storage and onto the storage for your
> compute nodes:
>
> SCSI, Fibrechannel and hardware RAID are only required on cluster head
> nodes, cluster database machines and the cluster fileservices layer.
>
> SCSI disk in a compute node are a waste of money and are usually just a
> source of additional profit margin for the cluster vendor. Avoid it if
> possible. For the compute nodes themselves you can save lots of $$ by
> just sticking with cheap ATA devices. (usual disclaimer applies; your
> personal situation may vary)
>
> In a linux environment I'd recommend 2x IDE drives striped at RAID0
> using Linux software RAID. The performance is amazing -- we've seen IO
> speeds that beat a direct GigE connection to a NetApp _and_ a direct FC
> connection to a high end Hitachi SAN. Really cool to see such
> performance come from $300 worth of IDE disk drives and some software RAID.
>
> People are doing this with blade servers using 4200 and 5400 RPM IDE
> drives  as well-- there are some people on this list in the UK who are
> using RLX blades with 2x drives striped with software RAID0. According
> to them the performance gain was significant and measurable.
>
> ##
>
> With that storage stuff laid out this is how I approach cluster usage:
>
> (1) If simplicity and ease of use is more important than performance
> then I just leave all my data on the central NAS fileserver. 90% of the
> cluster users I see just end up doing this most of the time. Its easy,
> it works and it does not require data staging or extensive changes to
> scripts that the users develop on their own.
>
> (2) If my workflow demands more performance then I start to deploy
> methodologies that involve (a) staging data from the central NAS onto
> local disks within my cluster and (b) directing the cluster to send jobs
> to the nodes where the target data is already cached in memory (best) or
> stored on the local disk. This is a quick and dirty way to engineer
> around the NFS performance bottlenecks.
>
> There are lots of ways to approach this but getting the process
> integrated into you workflow will involve scripting and some slightly
> mroe advanced usage of your PBS, GridEngine or Platform LSF cluster load
> management layer.
>
>
>
>
> >
> >>>Also very much overlooked is the issue of cluster management.  This
> >>
> >>tends to guide the choice of Linux distribution.  Management gets to be
> >>painful after the 8th compute node, the old models don't work well on
> >>multiple system image machines.
> >
>
>
> I use and love systemimager (www.systemimager.org) for automating the
> process of managing my full-on 'install an OS on that node from scratch'
> as well as my 'update this file or directory across all cluster nodes'
> needs. It's a great product.
>
> BioTeam has ported systemimager to Mac OS X but there are still some
> sticky issues involving the performance of HFS+ aware rsync. We've
> recently been using RADMIND from UMichigan on some Xserve cluster projects.
>
> I'm a compute farm purist I guess :) I stay away from 'cluster in a box'
>   products, particularly those that come with OpenPBS installed which I
> would never recommend for use in a biocluster. My usual approach is to
> just install pure Redhat or Debian Linux and a load management layer
> like GridEngine or LSF and then use Systemimager to handle the OS
> management and software update process.
>
> For cluster monitoring and reporting I like: BigBrother, MRTG, Larrd,
> SAR, NTOP, Ganglia etc. etc. There are lots of great software monitoring
> tools out there.
>
>
> For cluster management you need to treat you compute nodes as anonymous
> and disposable. You cannot afford to be messing with them on an
> individual basis because your admin burden will scale linearly with
> cluster size.
>
> You need to :
>
> o anonymize the cluster nodes on a private network. Your users should
> log into a cluster head node to submit their jobs
>
> o compute nodes are in 1 of 4 possible states:
>
>   (1) running / online
>   (2) online / reimaging or updating
>   (3) rebooting
>   (4) failed / offline / marked for replacement when convenient
>
> If you find yourself logging in as the root user to a compute node to
> fix or repair something then you are doing something wrong. It may seem
> OK at the time but it is a bad habit that will bite back when the
> cluster has to scale over time.
>
>
> >
> > What are the usual choices here? Any recommendations, web-sites, companies
> > I should be looking at in particular? 40 nodes would be our starting point
> > but we could easily move up to more in time. Many vendors offer various
> > alternatives - I think personally I would like to have something that is
> > easy to understand and is not a complete black box.
> >
> >
> >>There are other issues as well, specifically networking, backup of
> >>system, extra IO capabilities, etc.
> >
> >
> > Joe, I know this could easily turn into a thick book :), but how does one
> > get more educated about these things?
> >
>
>
> I'm probably biased but I've heard lots of people speak positively about
> reading the archived threads of this mailing list. Many of the questions
> you are asking about have been debated and discussed in the past on this
> very mailing list. The list archives are online at
> http://bioinformatics.org/pipermail/bioclusters/
>
>
> I'm also going to be rehashing alot of this stuff in a talk at the
> upcoming OReilly Bioinformatics Technology conference. May be of
> interest to some people and really obvious and boring to others.
>
>
>
> > Another question that I am curious about is Itanium 2 and using them in a
> > cluster - any experiences with these? How about bioinformatics software -
> > any benefits in your regular programs like blast, clustalw... when running
> > on an itanium system?
> >
> > Thank you,
> > Ognen
>
>
>
>
> Just my $.02 of course,
> chris
>
>
> --
> Chris Dagdigian, <dag@sonsorol.org> - The BioTeam Inc.
> Independent life science IT & informatics consulting
> Office: 617-666-6454, Mobile: 617-877-5498, Fax: 425-699-0193
> PGP KeyID: 83D4310E Yahoo IM: craffi Web: http://bioteam.net
>
>
>
> _______________________________________________
> Bioclusters maillist  -  Bioclusters@bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/bioclusters
>