[Bioclusters] cluster hardware question

Thu, 16 Jan 2003 23:26:04 -0500

Duzlevski, Ognen wrote:
> Hi Joe,
> 
> thanks for sharing your knowledge with me.
> 
> 
>>Often overlooked in clusters until too late is Disk and IO in general.
>>Chris Dagdigian at BioTeam.net is a good person to speak to about this.
> 
> 
> When you say "Disk and IO", do you mean storage over fiber, local node
> hard-drives...? What would be good choices for your typical bioinformatics
> shop - I have seen options between local nodes having the latest SCSI
> drives and nodes having the regular 5400 rpm ide drives. Does it pay to go
> with compute node SCSI 10000 rpm or is a 7200 rpm ide good enough?
> 

The biggest performance bottleneck in 'bioclusters' is usually disk I/O 
throughput. Bio people tend to do lots of things that involve streaming 
massive text and binary files through the CPU and RAM (think running a 
blast search). The speed of your storage becomes the rate limiting 
performance bottleneck. Often there will be terabytes of this sort of 
data laying around so the "/data" volume is usually a NFS mount.

If disk I/O is not your bottleneck than memory speed and size will 
likely be next bottleneck. Some applications like blast and sequence 
clustering algorithims will always be better off with as much physical 
RAM as you can cram into a box. Other applications are rate-limited by 
memory access speeds which is why Joe recommends fast DDR memory for 
users who need high mem performance.

Our problem in the life sciences is that we have terabyte volumes of 
data the needs to be kept laying around for recurring analysis. Far too 
much data to keep on local storage within a cluster node unless you have 
some sort of HSM mechanism. What people often end up doing is storing 
all that data on a NAS or SAN infrastructure.

Searching blast databases across an NFS mount is not a high throughput 
solution :) Even a $280,000 Network appliance high-end NAS filer with 
multiple trunked gigabit NICs doing the NFS fileservices will get 
swamped when enough cheap linux boxes are doing large sequential reads 
across the link.

General rule of thumb: no matter how big, fast and expensive the storage 
solution is you will always be able to drive it to its knees with enough 
cheap cluster nodes hitting it. This is something that has to be lived 
with.

There are faster solutions out there but they get expensive (think 
having to run switched fibre channel to every cluster node) and are 
often very proprietary. People in the 'super fast' storage space 
include: DataDirect, BlueArc, Panasas, etc. etc.

This is generally how I would approach storage for a medium to 
large-scale cluster:

(1) purchase 500-gigs or a terabyte or so of cheap ATA RAID configured 
as a NAS appliance. It should have a gigE connection to your cluster and 
possibly a 100-TX connection to the organizational LAN. Treat this as 
"scratch" space where you store temporary files and do things like 
download and build your blast/hmmer databases. There are lots of things 
that biologists and informatics types do that require lots of space but 
don't really need high-availibility or high-performance.  One of my 
personal pet peeves is using really expensive high-end storage arrays to 
store (wastefully) lots of low or no-value data and temporary files.

You would not believe the crap that some companies store on 
multi-million dollar EMC storage arrays or really expensive high-end SAN 
infrastructures :)

(2) purchase the 'real' NAS storage solution for your cluster. I prefer 
NAS to SAN for my clusters because I really need shared concurrent 
read/write access to the same volume of data and getting this done via 
SANs is really expensive and painful. I'd rather engineer around NAS/NFS 
performance issues than deal with 60+ machines needing shared read/write 
access to the same SAN volume across a switched FC fabric. Nasty. 
Particularly in a hetergenous OS environment.

Your NAS system can be totally configured to fit your scale and your 
budget. For small clusters this could just be a few hundred gigs of Dell 
Powervault SCSI or fibrechannel RAID or perhaps a terabyte of NexSan 
ATA-RAID chassis SCSI attached to a linux box that acts as NFS 
fileserver.  NexSan makes high quality ATA RAID shelves with nice 
features (hot-swap, SCSI or FC interface, etc. etc.) for about $12-15K 
per terabyte. On the high end you can pay six figures or more depending 
on speed, size, scaling potential and backup abilities.

A linux box with some SCSI or ATA drives serving as a NFS fileserver 
server will cost just a few thousand dollars

A linux box with a terabyte of NexSan ATA RAID attached via fibrechannel 
or SCSI will cost about $12-$15,000

A high-end storage system like a Network Appliance F840 filer will be a 
six-figure investment

I've had clients who's needs and budgets ran the full spectrum so there 
is really no one-size-fits-all answer.

I've built a cluster where the compute nodes were about $60,000 in cost 
and the storage, network core and backup systems were $250,000+ You will 
find that storage and backup is more expensive and more complicated than 
the actual cluster hardware.

##
Ok. Moving away from the shared storage and onto the storage for your 
compute nodes:

SCSI, Fibrechannel and hardware RAID are only required on cluster head 
nodes, cluster database machines and the cluster fileservices layer.

SCSI disk in a compute node are a waste of money and are usually just a 
source of additional profit margin for the cluster vendor. Avoid it if 
possible. For the compute nodes themselves you can save lots of $$ by 
just sticking with cheap ATA devices. (usual disclaimer applies; your 
personal situation may vary)

In a linux environment I'd recommend 2x IDE drives striped at RAID0 
using Linux software RAID. The performance is amazing -- we've seen IO 
speeds that beat a direct GigE connection to a NetApp _and_ a direct FC 
connection to a high end Hitachi SAN. Really cool to see such 
performance come from $300 worth of IDE disk drives and some software RAID.

People are doing this with blade servers using 4200 and 5400 RPM IDE 
drives  as well-- there are some people on this list in the UK who are 
using RLX blades with 2x drives striped with software RAID0. According 
to them the performance gain was significant and measurable.

##

With that storage stuff laid out this is how I approach cluster usage:

(1) If simplicity and ease of use is more important than performance 
then I just leave all my data on the central NAS fileserver. 90% of the 
cluster users I see just end up doing this most of the time. Its easy, 
it works and it does not require data staging or extensive changes to 
scripts that the users develop on their own.

(2) If my workflow demands more performance then I start to deploy 
methodologies that involve (a) staging data from the central NAS onto 
local disks within my cluster and (b) directing the cluster to send jobs 
to the nodes where the target data is already cached in memory (best) or 
stored on the local disk. This is a quick and dirty way to engineer 
around the NFS performance bottlenecks.

There are lots of ways to approach this but getting the process 
integrated into you workflow will involve scripting and some slightly 
mroe advanced usage of your PBS, GridEngine or Platform LSF cluster load 
management layer.

> 
>>>Also very much overlooked is the issue of cluster management.  This
>>
>>tends to guide the choice of Linux distribution.  Management gets to be
>>painful after the 8th compute node, the old models don't work well on
>>multiple system image machines.
> 

I use and love systemimager (www.systemimager.org) for automating the 
process of managing my full-on 'install an OS on that node from scratch' 
as well as my 'update this file or directory across all cluster nodes' 
needs. It's a great product.

BioTeam has ported systemimager to Mac OS X but there are still some 
sticky issues involving the performance of HFS+ aware rsync. We've 
recently been using RADMIND from UMichigan on some Xserve cluster projects.

I'm a compute farm purist I guess :) I stay away from 'cluster in a box' 
  products, particularly those that come with OpenPBS installed which I 
would never recommend for use in a biocluster. My usual approach is to 
just install pure Redhat or Debian Linux and a load management layer 
like GridEngine or LSF and then use Systemimager to handle the OS 
management and software update process.

For cluster monitoring and reporting I like: BigBrother, MRTG, Larrd, 
SAR, NTOP, Ganglia etc. etc. There are lots of great software monitoring 
tools out there.

For cluster management you need to treat you compute nodes as anonymous 
and disposable. You cannot afford to be messing with them on an 
individual basis because your admin burden will scale linearly with 
cluster size.

You need to :

o anonymize the cluster nodes on a private network. Your users should 
log into a cluster head node to submit their jobs

o compute nodes are in 1 of 4 possible states:

  (1) running / online
  (2) online / reimaging or updating
  (3) rebooting
  (4) failed / offline / marked for replacement when convenient

If you find yourself logging in as the root user to a compute node to 
fix or repair something then you are doing something wrong. It may seem 
OK at the time but it is a bad habit that will bite back when the 
cluster has to scale over time.

> 
> What are the usual choices here? Any recommendations, web-sites, companies
> I should be looking at in particular? 40 nodes would be our starting point
> but we could easily move up to more in time. Many vendors offer various
> alternatives - I think personally I would like to have something that is
> easy to understand and is not a complete black box.
> 
> 
>>There are other issues as well, specifically networking, backup of
>>system, extra IO capabilities, etc.
> 
> 
> Joe, I know this could easily turn into a thick book :), but how does one
> get more educated about these things?
>

I'm probably biased but I've heard lots of people speak positively about 
reading the archived threads of this mailing list. Many of the questions 
you are asking about have been debated and discussed in the past on this 
very mailing list. The list archives are online at 
http://bioinformatics.org/pipermail/bioclusters/

I'm also going to be rehashing alot of this stuff in a talk at the 
upcoming OReilly Bioinformatics Technology conference. May be of 
interest to some people and really obvious and boring to others.

> Another question that I am curious about is Itanium 2 and using them in a
> cluster - any experiences with these? How about bioinformatics software -
> any benefits in your regular programs like blast, clustalw... when running
> on an itanium system?
> 
> Thank you,
> Ognen

Just my $.02 of course,
chris

-- 
Chris Dagdigian, <dag@sonsorol.org> - The BioTeam Inc.
Independent life science IT & informatics consulting
Office: 617-666-6454, Mobile: 617-877-5498, Fax: 425-699-0193
PGP KeyID: 83D4310E Yahoo IM: craffi Web: http://bioteam.net