Hi Claris, I haven't seen any replies to this come across this list so I'll try to chime in here on your questions, not sure how much help I can be. I'd love to hear what others have to say as well. In general I have yet to see any really comprehensive (public) statistics or even actual case studies of cluster use in bioinformatics applications. In many cases people have been building cluster systems to solve a very specific computational need and their primary interest is the data generated, not the system or its optimized architecture/benchmark characteristics. This is starting to change now as people are applying their experiences with testbed or single-application clusters and are now building clustered systems to support general life science informatics research. For popular algorithms like BLAST/HMMER etc. etc. that fall into the "embarrassingly parallel" category the benefits of clusters are immediate and measurable as many people on this list can no doubt confirm. With a dozen or so dual-CPU intel linux boxes and a good fileserver one can easily build a high throughput BLAST farm that will blow away "enterprise" unix systems that are sold to companies for many hundreds of thousands of dollars. This is generally how many life science people get into clustering- they need to reduce the burden / free up resources on the expensive Sun/Alpha/SGI servers so they build a cheap cluster to soak up whatever workload they can throw at it. There may be people on this list who are willing to share with you some concrete performance metrics from the systems they have built. The groups that build the big/expensive clusters generally have to product some fairly good benchmarks and case studies to justify themselves to the budget people so I'm sure such documents exist. I would be very cautious in putting too much weight into any public bioinformatics cluster benchmarks that you may run across (unless you are intending to exactly copy the architecture). There are way too many variables in such systems and often times it turns out that things like RAM, network bandwidth and disk I/O are the rate limiting performance bottlenecks. There just is no "standard" way to do this type of work so benchmarks as a means of comparison are going to be fairly meaningless because the hardware and architecture approaches are likely to differ wildly. Back to your other question; There are few (if any) parallel applications that are widely used for sequence analysis and basic bioinformatics. About the only program I can think of is FASTA which can support running in a PVM environment. There may be some assembly programs that run in true parallel mode as well (I'm not sure). {Anyone have a good journal reference for an article that reviews this area?} Instead of running a single parallel application to handle thousands or hundreds of thousands of sequence analysis operations what you end up doing is invoking (and potentially distributing) many separate instances of the non-parallel algorithm each with different command line arguments (input sequence, target database, threshold cutoffs, etc). This is the type of workflow that runs beautifully on a commodity cluster architecture as there are lots of software suites available that handle batch scheduling & distributed load management. There are many more parallel applications in use as you get away from sequence analysis and start doing things like molecular modelling, virtual screening, protein structure prediction etc. These are also the areas where you will actually see commercial companies like Accelrys/MSI selling parallel software products into the research community. Many people developing their own in-house code and proprietary algorithms are thinking about parallel processing on clusters. The level of interest I've seen in high speed system interconnects like Myrinet and Dolphin SCI has risen significantly over the past year or so. In summary: o benchmarks and case studies are very hard to find or are treated as confidential o life science clusters tend to be used to bulk process many totally independent "embarrassingly parallel" jobs o little if any use of parallel applications for sequence analysis o some areas of genomics/proteomics/{insert buzzword here} are using parallel code (commercial and non-commercial) My $.02 -Chris At 11:16 PM 8/27/01 -0500, you wrote: >I am working on a project called Scientific Computing Strategy for the >Smithsonian Tropical Research Institute and I was wondering if I can get >any statitics about using cluster in bioinformatics. I mean information >like how much the time factor has improved in some applications like >sequence alignment (multiple) using clusters, do we have software for >parallel computing in the field (BLAST cluster version?), etc. Where can I >get this information? Any idea? >Thanks in advance, >Claris -- Chris Dagdigian (Home:Work) Blackstone Computing dag@sonsorol.org : dagdigian@blackstonecomputing.com www.open-bio.org : www.blackstonecomputing.com