[Bioclusters] topbiocluster.org

Wed Jun 29 10:33:29 EDT 2005

On Tue, 2005-06-28 at 18:30 -0700, Glen Otero wrote:

> I really support the topbiocluster and BBS efforts. The results of my  
> mini poll last week (mini in that only 7 people responded) solidified  
> in my mind the need for a metric like the ones you're now developing. 

Magic.  Glen, I'm sure you don't mind, I'm bouncing this to the list,
because there are more folk out there that know more than me about a lot
of this.

> Specifically, I thought about challenging people to see how fast they  
> could get 10k, 50k, and 100k sequences BLASTed through their cluster.  
> No holds barred, full-contact, real scenario performance, with only  
> one caveat--share with everybody your configuration and how you did  
> it when asked. The concept of pragmatic pipelines takes this idea a  
> step further, which is great.

I like the concept of 'full-contact'.  This is totally what this is all
about.  I've already been contacted off list regarding one of my
approaches being sub optimal, and you know, it more than likely is.  The
cool part here, is by keeping it open ("supreme power is retained and
directly exercised"), we will learn vast amounts.  

I'm totally convinced that there are labs out there with moderate
compute, who have some really smart ideas, and labs with huge compute,
that have others.  The idea of ranking wrt to problem size, and using
that as a metric keeps everyone in the loop.  I know I've managed to do
some really cool things on a dual processor box, and saved time by not
thinking of massive parallelism, sometimes you need the grunt.

> I'd like to ask whether my manner of contributing would be welcome,  
> or if I should find another. I'm working on a version of BioBrew that  
> has the apps built with Intel compilers, tools, and libraries. I'd be  
> happy to make the SRPMs available for use in benchmarks. However,  
> it's not clear to me whether or not many folks use the Intel tools in  
> this space, and therefore whether this a good idea. Your thoughts,  
> suggestions?

I think if people had already optimized code, and it was easy for them
to get the cluster going, it would make it easy for rapid testing.

So I'm getting close to having decent data sets that we can work on, I'm
making them in a fairly funky way so we can get the old md5 sums going,
and also to make sure that we get reproducible answers, it's no good if
code whizzes through the pipeline, but finds nothing.  That could be
considered cheating :-)  

As for contribution, one of the pigs is to try and capture the
environment in a sensible manner, you know the cpu, os, memory, disk
etc.  I don't want to burden folk too much to collect this data, maybe
there could be some 'automagic' way of collecting this info?  Or at
least maybe $term = new Term::ReadLine 'collector' widget to capture and
present the simple stuff.

> collector.pl
What is your Distribution (Found BioBrew) (Y/n/other):
Current host memory (Found 1024MB) (Y/n/other):
DRM in place (Found SGE) (Y/n/other):
Number of nodes (Found 2000) (Y/n/other):

etc.  You get the picture.  Some of this stuff could be fairly esoteric,
so would need some smarts to run the right 'ps, df, psrinfo stuff'
depending on the platform.  IBT does a whole bunch of this already, but
given that we also want to capture code running on different platforms,
it gets tricky fast.

Anyway, more later, I have a misbehaving Tenrec I have to go talk to...

J.