[Bioclusters] General question on time consuming problems

Fri Apr 22 05:34:32 EDT 2005

On 20 Apr 2005, at 7:22 pm, George White wrote:

> The other problem is that many the real-world clusters are lucky to get
> 50% uptime.  The one down the hall was fried when the A/C died.  They
> fixed all that, took a couple weeks to get a new A/C installed, and 
> then a
> cable to the RAID stopped working, so now they have to get the cable 
> and
> hope the files weren't damaged.  You hear the success stories from 
> people
> who have been lucky with A/C hardware, etc., but there are also lots of
> cluster owners who are swamped by the upkeep and or poorly maintained
> physical plant (power problems, A/C, etc.).

But then, as you say, if your problem is really embarrassingly 
parallel, and you code it right, losing a few nodes here and there 
isn't a problem.  One of the nice things about embarrassingly parallel 
problems is that they tend to allow for gradual loss of capacity.  It's 
quite useful for us; it allows us to wait for a number of nodes to fail 
before we batch them up and send them back for repair.  This saves a 
lot of money in support costs, as well as effort on our part.

Tim

-- 
Dr Tim Cutts
Informatics Systems Group, Wellcome Trust Sanger Institute
GPG: 1024D/E3134233 FE3D 6C73 BBD6 726A A3F5  860B 3CDD 3F56 E313 4233