On 20 Apr 2005, at 7:22 pm, George White wrote: > The other problem is that many the real-world clusters are lucky to get > 50% uptime. The one down the hall was fried when the A/C died. They > fixed all that, took a couple weeks to get a new A/C installed, and > then a > cable to the RAID stopped working, so now they have to get the cable > and > hope the files weren't damaged. You hear the success stories from > people > who have been lucky with A/C hardware, etc., but there are also lots of > cluster owners who are swamped by the upkeep and or poorly maintained > physical plant (power problems, A/C, etc.). But then, as you say, if your problem is really embarrassingly parallel, and you code it right, losing a few nodes here and there isn't a problem. One of the nice things about embarrassingly parallel problems is that they tend to allow for gradual loss of capacity. It's quite useful for us; it allows us to wait for a number of nodes to fail before we batch them up and send them back for repair. This saves a lot of money in support costs, as well as effort on our part. Tim -- Dr Tim Cutts Informatics Systems Group, Wellcome Trust Sanger Institute GPG: 1024D/E3134233 FE3D 6C73 BBD6 726A A3F5 860B 3CDD 3F56 E313 4233