[Bioclusters] Nomenclature (was Re: Call for information.)

Joe Landman bioclusters@bioinformatics.org
17 Apr 2002 15:25:45 -0400


Rather than call things "real parallel" and therefore "not" - "real
parallel", it might be better and more descriptive to discussing these
things in terms of more common nomenclature.  Relevance to BLAST et al
will be talked about towards the end. 

High Perfomance computing (HPC) is one of those things whereby you ask
10 people about its definition, and you get 11 answers.  Within HPC are
the concepts of maximizing speedup (e.g. throw as much resource at
minimizing the run time of a single process as you can), maximizing
throughput (maximizing the number of independent jobs in a given time). 

You build a large cluster with low latency switches specifically for the
speedup maximization.  Some codes work quite well in this regard, but
are tremendously sensitive to the latency of the network (how long you
need to wait before a message gets between points A and B on the
network). 

These MPI codes have lots of tunable items, and experienced workers will
usually have very fine control over the communication and calculation
phases... with MPI (and any other message passing protocol) requiring
careful coordination among computing nodes to obtain good performance. 

As pointed out, you can build these systems for bioinformatic
computing.  The exceptional codes in bioinformatics are MPI/PVM based,
most are designed and built to run on workstations of single or few
processors.  These codes are not what you would call classical parallel
codes.  This does not mean they are not high performance codes....  They
can be made into high throughput computing systems as these codes
usually process one chunk of information at a time.  If the N+1th chunk
does not depend upon the values used/computed in the Nth chunk, you have
the opportunity to pursue a high throughput computing implementation. 

BLAST as it turns out, is nicely suited for this, as are some other
codes.  These calculations are still of the high performance variety,
but you attempt to maximize throughput, and not speedup of an individual
job.  The end point is still the same, reduce the wall clock time so you
can do more, or so your long running jobs dont take as long.  



On Wed, 2002-04-17 at 01:05, Goran Ceric wrote: 
> Let me interfere for a second. Real parallel applications ( using MPI, 
> PVM etc.) are cool and they can definitely impress people, especially those 
> people who write checks. The real power of Linux clusters, however, lies in 
> their ability to run a huge number of serial jobs. There are several 
> reasons why I'm saying this. First, most bioinformatics applications are 
> not parallelized (again, using MPI, PVM etc.), and second if you want to 
> use real parallel applications, you can't do that over fast Ethernet. I 
> mean, you certainly can, but you'd take a huge performance hit. In order to 
> maximize the efficiency, you'd have to invest in Myrinet or something 
> similar and networking would cost you more than your cluster. 
> 
> For $256K, you can get yourself ~128-cpu (64 dual PIII, 1U) cluster + 
> networking + rack + everything else, but don't buy it if you're not going 
> to use it often. We've got 128-cpu (PIII,1GHz,PBSPro/Maui, yada yada) 
> cluster here and the reason why we went with Intel and not AMD was that (at 
> that time) there wasn't a company I could trust and rely on (hardware 
> support) that sold SMP Athlon systems. Also, have in mind that most of the 
> stuff you'd run is integer and not FP intensive. That's my $0.02.
> 
> Goran Ceric
> System Administrator
> Washington University St. Louis
> Department of Genetics, Eddy Lab
> goran@genetics.wustl.edu
> http://www.genetics.wustl.edu/eddy
> 
>   
> 
> > 
> > Hi All,
> > 
> > Well thanks for your replies (Ivo, Josh, William). Unfortunately I
> > think we've got off the topic a bit. Obviously the Pentium vs Athlon
> > debate is a popular discussion.  The only reason we're not considering
> > the Athlon solution is that our University has a ridiculous contract
> > with a Hardware provider that doesn't supply Athlon chips.
> > 
> > What I'm really interested in is what applications are others using
> > BioClusters for. Such as parallel implementations of BLAST, hmmsearch,
> > protein folding, etc. I want my Biologist colleagues to know the
> > amazing capabilities and benefits they will recieve from using a
> > cluster.
> > 
> > Also, thanks a lot for your email Chris. What sort of experiments were
> > your pharma company able to implement with the aid of your blast farm?
> > And what are these urls you promised? ;-)
> > 
> > I personally use our cluster to fold lots of RNA sequences (using
> > Vienna) with an embarassingly parallel implementation. I love the fact
> > that my simulations that would normally take months to complete are
> > finished in just a few days.
> > 
> > Cheers,
> > Paul.
> > 
> > On Tue, 16 Apr 2002, William Park wrote:
> > 
> >>On Wed, Apr 17, 2002 at 02:13:14AM +1200, Paul Gardner wrote:
> >>>
> >>> Hi William,
> >>>
> >>> I already have several problems that I could extend beyond our
> >>> current computing solution (16 node Beowulf: sisters.massey.ac.nz).
> >>> The problem is convincing a few senior scientists (hope that answers
> >>> your question Chris, thanks heaps) who've recently received a large
> >>> government grant that spending a bit of money upgrading our current
> >>> system is worth while. Basically I want my Biologists friends to know
> >>> that they'll also benefit from the power of a cluster.
> >>>
> >>> I agree that Athlons are probably a better way to go, but
> >>> frustratingly there are some other issues that have made us consider
> >>> PentiumIVs as a better solution.
> >>
> >>Well, what do they want?  More nodes for current clusters, or new
> >>computers on their desks?
> >>
> >>Cluster is not all that's cracked up to be.  After spending all that
> >>money building dual-P3 cluster a year ago, I'm a bit pissed to find
> >>that current 1 dual-MP will practically replace it.  Your situation is
> >>probably the same.
> >>
> >>--
> >>William Park, Open Geometry Consulting, <opengeometry@yahoo.ca>
> >>8 CPU cluster, NAS, (Slackware) Linux, Python, LaTeX, Vim, Mutt, Tin
> >>_______________________________________________
> >>Bioclusters maillist  -  Bioclusters@bioinformatics.org
> >>http://bioinformatics.org/mailman/listinfo/bioclusters
> >>
> >>
> > 
> > _______________________________________________
> > Bioclusters maillist  -  Bioclusters@bioinformatics.org
> > http://bioinformatics.org/mailman/listinfo/bioclusters
> 
> 
> _______________________________________________
> Bioclusters maillist  -  Bioclusters@bioinformatics.org
> http://bioinformatics.org/mailman/listinfo/bioclusters
-- 
Joseph Landman, Ph.D.
Senior Scientist,
MSC Software High Performance Computing
email		: joe.landman@mscsoftware.com
messaging	: page_joe@mschpc.dtw.macsch.com
Main office	: +1 248 208 3312
Cell phone	: +1 734 612 4615
Fax		: +1 714 784 3774