[Bioclusters] cluster hardware question
Chris Dagdigian
bioclusters@bioinformatics.org
Thu, 16 Jan 2003 23:26:04 -0500
Duzlevski, Ognen wrote:
> Hi Joe,
>
> thanks for sharing your knowledge with me.
>
>
>>Often overlooked in clusters until too late is Disk and IO in general.
>>Chris Dagdigian at BioTeam.net is a good person to speak to about this.
>
>
> When you say "Disk and IO", do you mean storage over fiber, local node
> hard-drives...? What would be good choices for your typical bioinformatics
> shop - I have seen options between local nodes having the latest SCSI
> drives and nodes having the regular 5400 rpm ide drives. Does it pay to go
> with compute node SCSI 10000 rpm or is a 7200 rpm ide good enough?
>
The biggest performance bottleneck in 'bioclusters' is usually disk I/O
throughput. Bio people tend to do lots of things that involve streaming
massive text and binary files through the CPU and RAM (think running a
blast search). The speed of your storage becomes the rate limiting
performance bottleneck. Often there will be terabytes of this sort of
data laying around so the "/data" volume is usually a NFS mount.
If disk I/O is not your bottleneck than memory speed and size will
likely be next bottleneck. Some applications like blast and sequence
clustering algorithims will always be better off with as much physical
RAM as you can cram into a box. Other applications are rate-limited by
memory access speeds which is why Joe recommends fast DDR memory for
users who need high mem performance.
Our problem in the life sciences is that we have terabyte volumes of
data the needs to be kept laying around for recurring analysis. Far too
much data to keep on local storage within a cluster node unless you have
some sort of HSM mechanism. What people often end up doing is storing
all that data on a NAS or SAN infrastructure.
Searching blast databases across an NFS mount is not a high throughput
solution :) Even a $280,000 Network appliance high-end NAS filer with
multiple trunked gigabit NICs doing the NFS fileservices will get
swamped when enough cheap linux boxes are doing large sequential reads
across the link.
General rule of thumb: no matter how big, fast and expensive the storage
solution is you will always be able to drive it to its knees with enough
cheap cluster nodes hitting it. This is something that has to be lived
with.
There are faster solutions out there but they get expensive (think
having to run switched fibre channel to every cluster node) and are
often very proprietary. People in the 'super fast' storage space
include: DataDirect, BlueArc, Panasas, etc. etc.
This is generally how I would approach storage for a medium to
large-scale cluster:
(1) purchase 500-gigs or a terabyte or so of cheap ATA RAID configured
as a NAS appliance. It should have a gigE connection to your cluster and
possibly a 100-TX connection to the organizational LAN. Treat this as
"scratch" space where you store temporary files and do things like
download and build your blast/hmmer databases. There are lots of things
that biologists and informatics types do that require lots of space but
don't really need high-availibility or high-performance. One of my
personal pet peeves is using really expensive high-end storage arrays to
store (wastefully) lots of low or no-value data and temporary files.
You would not believe the crap that some companies store on
multi-million dollar EMC storage arrays or really expensive high-end SAN
infrastructures :)
(2) purchase the 'real' NAS storage solution for your cluster. I prefer
NAS to SAN for my clusters because I really need shared concurrent
read/write access to the same volume of data and getting this done via
SANs is really expensive and painful. I'd rather engineer around NAS/NFS
performance issues than deal with 60+ machines needing shared read/write
access to the same SAN volume across a switched FC fabric. Nasty.
Particularly in a hetergenous OS environment.
Your NAS system can be totally configured to fit your scale and your
budget. For small clusters this could just be a few hundred gigs of Dell
Powervault SCSI or fibrechannel RAID or perhaps a terabyte of NexSan
ATA-RAID chassis SCSI attached to a linux box that acts as NFS
fileserver. NexSan makes high quality ATA RAID shelves with nice
features (hot-swap, SCSI or FC interface, etc. etc.) for about $12-15K
per terabyte. On the high end you can pay six figures or more depending
on speed, size, scaling potential and backup abilities.
A linux box with some SCSI or ATA drives serving as a NFS fileserver
server will cost just a few thousand dollars
A linux box with a terabyte of NexSan ATA RAID attached via fibrechannel
or SCSI will cost about $12-$15,000
A high-end storage system like a Network Appliance F840 filer will be a
six-figure investment
I've had clients who's needs and budgets ran the full spectrum so there
is really no one-size-fits-all answer.
I've built a cluster where the compute nodes were about $60,000 in cost
and the storage, network core and backup systems were $250,000+ You will
find that storage and backup is more expensive and more complicated than
the actual cluster hardware.
##
Ok. Moving away from the shared storage and onto the storage for your
compute nodes:
SCSI, Fibrechannel and hardware RAID are only required on cluster head
nodes, cluster database machines and the cluster fileservices layer.
SCSI disk in a compute node are a waste of money and are usually just a
source of additional profit margin for the cluster vendor. Avoid it if
possible. For the compute nodes themselves you can save lots of $$ by
just sticking with cheap ATA devices. (usual disclaimer applies; your
personal situation may vary)
In a linux environment I'd recommend 2x IDE drives striped at RAID0
using Linux software RAID. The performance is amazing -- we've seen IO
speeds that beat a direct GigE connection to a NetApp _and_ a direct FC
connection to a high end Hitachi SAN. Really cool to see such
performance come from $300 worth of IDE disk drives and some software RAID.
People are doing this with blade servers using 4200 and 5400 RPM IDE
drives as well-- there are some people on this list in the UK who are
using RLX blades with 2x drives striped with software RAID0. According
to them the performance gain was significant and measurable.
##
With that storage stuff laid out this is how I approach cluster usage:
(1) If simplicity and ease of use is more important than performance
then I just leave all my data on the central NAS fileserver. 90% of the
cluster users I see just end up doing this most of the time. Its easy,
it works and it does not require data staging or extensive changes to
scripts that the users develop on their own.
(2) If my workflow demands more performance then I start to deploy
methodologies that involve (a) staging data from the central NAS onto
local disks within my cluster and (b) directing the cluster to send jobs
to the nodes where the target data is already cached in memory (best) or
stored on the local disk. This is a quick and dirty way to engineer
around the NFS performance bottlenecks.
There are lots of ways to approach this but getting the process
integrated into you workflow will involve scripting and some slightly
mroe advanced usage of your PBS, GridEngine or Platform LSF cluster load
management layer.
>
>>>Also very much overlooked is the issue of cluster management. This
>>
>>tends to guide the choice of Linux distribution. Management gets to be
>>painful after the 8th compute node, the old models don't work well on
>>multiple system image machines.
>
I use and love systemimager (www.systemimager.org) for automating the
process of managing my full-on 'install an OS on that node from scratch'
as well as my 'update this file or directory across all cluster nodes'
needs. It's a great product.
BioTeam has ported systemimager to Mac OS X but there are still some
sticky issues involving the performance of HFS+ aware rsync. We've
recently been using RADMIND from UMichigan on some Xserve cluster projects.
I'm a compute farm purist I guess :) I stay away from 'cluster in a box'
products, particularly those that come with OpenPBS installed which I
would never recommend for use in a biocluster. My usual approach is to
just install pure Redhat or Debian Linux and a load management layer
like GridEngine or LSF and then use Systemimager to handle the OS
management and software update process.
For cluster monitoring and reporting I like: BigBrother, MRTG, Larrd,
SAR, NTOP, Ganglia etc. etc. There are lots of great software monitoring
tools out there.
For cluster management you need to treat you compute nodes as anonymous
and disposable. You cannot afford to be messing with them on an
individual basis because your admin burden will scale linearly with
cluster size.
You need to :
o anonymize the cluster nodes on a private network. Your users should
log into a cluster head node to submit their jobs
o compute nodes are in 1 of 4 possible states:
(1) running / online
(2) online / reimaging or updating
(3) rebooting
(4) failed / offline / marked for replacement when convenient
If you find yourself logging in as the root user to a compute node to
fix or repair something then you are doing something wrong. It may seem
OK at the time but it is a bad habit that will bite back when the
cluster has to scale over time.
>
> What are the usual choices here? Any recommendations, web-sites, companies
> I should be looking at in particular? 40 nodes would be our starting point
> but we could easily move up to more in time. Many vendors offer various
> alternatives - I think personally I would like to have something that is
> easy to understand and is not a complete black box.
>
>
>>There are other issues as well, specifically networking, backup of
>>system, extra IO capabilities, etc.
>
>
> Joe, I know this could easily turn into a thick book :), but how does one
> get more educated about these things?
>
I'm probably biased but I've heard lots of people speak positively about
reading the archived threads of this mailing list. Many of the questions
you are asking about have been debated and discussed in the past on this
very mailing list. The list archives are online at
http://bioinformatics.org/pipermail/bioclusters/
I'm also going to be rehashing alot of this stuff in a talk at the
upcoming OReilly Bioinformatics Technology conference. May be of
interest to some people and really obvious and boring to others.
> Another question that I am curious about is Itanium 2 and using them in a
> cluster - any experiences with these? How about bioinformatics software -
> any benefits in your regular programs like blast, clustalw... when running
> on an itanium system?
>
> Thank you,
> Ognen
Just my $.02 of course,
chris
--
Chris Dagdigian, <dag@sonsorol.org> - The BioTeam Inc.
Independent life science IT & informatics consulting
Office: 617-666-6454, Mobile: 617-877-5498, Fax: 425-699-0193
PGP KeyID: 83D4310E Yahoo IM: craffi Web: http://bioteam.net