[Bioclusters] file server for cluster

Joe Landman bioclusters@bioinformatics.org
19 Apr 2002 03:56:43 -0400

I'll put 2 responses in one here... (the one one GigE and second this
one) ... as they both address putting file systems on a wire for sharing
them.  Chris D did a great job talking about the differences between
servers, I am going to talk about the scaling aspects of this.

The name of this game is not to run out of bandwidth, either on the
network (first portion) or on the disks (second portion).

You need some numbers, and some back of the envelope calculations. 
Numbers can be found on drive makers web sites.  Terminology:  MB/s is
mega byte per second, Mb/s is mega bit per second.  There is (almost) a
factor of 10 relating the two.

Suppose you decide you want to hang some drives off of a GigE connected
server.  Each connected computer on a single 100 Base T link is capable
of sinking or sourcing about 11 MB/s (about 90 Mb/s) running flat out. 
This isnt theoretical max, this is achievable or realizable bandwidth. 
If you can get this same 90% of peak on your GigE (1000 Mb/s) link, then
you should be able to get about 900 Mb/s achievable.  Quick division
yields about 10 of the 100 Base T links per GigE link, before you fill
up.  If you are using single connections per machine, this is 1 GigE
feeding 10 machines.  If you channel bond your 100 Base T's, then you
are looking at 1 GigE per 5 machines.

This of course assumes that you are 1) fully saturating the 100 Base T
pipes all the time, and 2) that you are doing large block sequential
accesses (reads or writes).  

Reality is never so simple.

You can measure your process I/O utilization at a coarse level by

	vmstat 1 > /tmp/log.IO

and then looking at the columns labeled bi and bo (blocks in and out
respectively).  It isnt that hard to calibrate a maximial load to an
quiescent disk or file server, just copy a very large file there after
launching the vmstat in the background.  This will give you a max value
for reading and writing (if you chose to do both).  Now run your
process.  More often than not, you will see spikes to near the maximum,
and then large periods of low usage.  But it is possible that you will
see long periods of intense IO.  This is application dependent.  From
this you can estimate a "duty cycle" or a average utilization.  You can
eyeball this if you dont want to measure it, just make sure you err on
the side of larger utilization (e.g. round up).

So your 10 x 100 Base T interfaces will use on average this utilization
percentage of the total bandwidth.  What you want to do is to scale your
number of machines hanging off this GigE interface so that the
utilization multiplied by the number of 10 of the 100 Base T interfaces
is close to something like 80% of the realizable bandwidth.  

This tells you approximately how many machines you can service from this
GigE running this application, with this type of data.

So, for your case, 40 machines, 1 interface per machine, would require
an IO utilization below 20% per node to be really serviceable from the
single GigE interconnect.  

If you are effectively streaming database indices off the disk to each
node, this will be problematic.  You will likely run out of network
bandwidth at the 10 node mark.

Now onto the disks themselves.  Suppose that you have budget for a nice
set of Seagate ST336752LW (15k RPM, 36.7 GB).  These disks can (see
http://www.seagate.com/cda/products/discsales/enterprise/tech/0,1084,379,00.html) sustain in excess of 50 MB/s (from 508 to 706 Mb/s).  Remember that GigE is 1000 Mb/s.  2 of these disks could keep a GigE full if they are running flat out.  If you put these into a Linux server versus a NAS box, you are going to run into a few issues.

First:  if your IDE/SCSI controller is on the same PCI (PCI 133 MB/s,
100 MB/s realizable) as your network (GigE), you are going to cause that
poor machine to whimper.  You need either a PCI 266 MB/s bus (200 MB/s
realizable), or multiple PCI busses.  Either way, this immediately takes
most of the common non-server oriented machines off the block as
potential candidates.  You can easily fill up the PCI bus on these
things with enough IO traffic.

Second: ATA100 would require 2 channels to attach to this type of drive
(you want one dedicated ATA100 channel per drive).  Ultra160 is great,
but you need a 266 MB/s bus to plug it into.  3 of the Seagate drives on
that controller will pretty much max out the controller if the disks are
running flat out.  You cannot put 2 of these U160s on a single PCI 266
system.  You wouldnt have room left for the GigE.

Basically building these things into a linux box is somewhat hard. 
There are many little gotchas.

Look at the boxes Chris indicated, and possibly the 3ware boxes as
well.  Some people I know swear by them.


On Thu, 2002-04-18 at 18:22, Ivo Grosse wrote:
> Hi all,
> we want to buy a new fileserver (for our cluster) with about 1 TB, and 
> we are thinking of a Linux machine.  My question is: which kind of 
> fileserver do YOU use (and why)?
> (a) NAS (network-attached storage)?
> (b) regular Linux machine with internal RAID?
> (c) regular Linux machine with external RAID?
> Thanks!!!
> Ivo
> _______________________________________________
> Bioclusters maillist  -  Bioclusters@bioinformatics.org
> http://bioinformatics.org/mailman/listinfo/bioclusters
Joseph Landman, Ph.D.
Senior Scientist,
MSC Software High Performance Computing
email		: joe.landman@mscsoftware.com
messaging	: page_joe@mschpc.dtw.macsch.com
Main office	: +1 248 208 3312
Cell phone	: +1 734 612 4615
Fax		: +1 714 784 3774