Lance Davidow wrote: > At Massachusetts General Hospital, we are looking to set up a local > blast server, preferably with the NCBI algorithm, on a computing > cluster. We have installed the OSCAR clustering package > http://sourceforge.net/projects/oscar/ > on a small > test cluster of computers running Red Hat Linux 7.2 on Intel. Does > anyone know of an > open source blast-for-cluster package that would run under this > cluster management suite or is there a > commercial package you are already running that you think is > worthwhile??? We are also considering Apple's Xserve with the G4 > accelerated blast to compare against the above system. Hi Lance, We build such systems all the time. The closest one to you is probably over at the Harvard Bauer Center for Genomics Research across the river from your institution. The setup is pretty vanilla-- 60 CPU compute farm with a GigE network core, customized Redhat 7.2, Platform LSF and about 3 terabytes of fast network attached storage. Drop me a line directly if you'd like me to arrange a tour or meet up somewhere for an informal whiteboard/technical chat about how and why it was built as it was. George Church has given it his blessing :) You have lots of options and the people on this mailing list have implemented or are actively using almost all of them so this is a good place to bounce ideas and experiences off of. What you should deploy depends mostly on what you want to do with your blast service and what your budget is :) . If you are primarily going to use it for a high throughput pipeline where you know you are going to be doing many thousands of searches against the same database before moving on to the next large query set then you may get the best overall deal by purchasing one of the specialized hardware acceleration packages from companies like TimeLogic and Paracell. Hardware accelerators have their drawbacks but they can be very good for high throughput pipelines where you just need a fast resource that acts like a "black box" and has a very trivial administrative burden which can be very, very important in groups with limited IT staff. If you are not interested in hardware based solutions then you have two other basic options, both of which run on top of your basic cluster and in some cases can expand past your cluster to harness CPU cycles from other systems and/or desktop boxes. (1) Add a general purpose resource management layer ('DRM') like Platform LSF or Sun GridEngine to your cluster so you have a 'compute farm' capable of running and distributing load from many different informatics applications across inexpensive commodity hardware (2) Purchase application-specific software from companies like Blackstone Computing and TurboGenomics both of whom have blast-specific product offerings as well as larger frameworks that can handle other distributed computational requirements RLX falls into the (1) category -- they sell Transmeta or PIII based blade systems and they struck a deal with Platform so they can bundle LSF into their product offerings. This is primarily what allows them to offer turnkey blast farms although I bet there is some extra software and scripting behind the scenes that goes into the whole package. I personally don't think the current generation of blade servers from any vendor are suitable for blast farms because far too many of them rely on cheezy 4200 RPM laptop disk drives for their local disk IO which is far to slow for life science where many of our apps are IO bound to begin with. Many blades also have significant max memory limits as well. . I won't seriously consider blades until the form factors allow for full size ATA or SCSI drives (Compaq just announced this in their new blade chassis). This is just my opinion of course and to their credit RLX has successfully sold very large blade packages into places like the Sanger Centre so they must be doing something really good (although Sanger was already an LSF-based shop to begin with). In option (1) a DRM == "Distributed Resource Management". It is basically a software layer that allows you to link together many loosely coupled (potentially heterogeneous) servers. The DRM layer is what handles the process of scheduling your job, executing it on the best available resource and returning the results back to you. In a normal compute-farm based blast system the DRM is used mainly to schedule and farm out your many standalone ncbi-blast or wu-blast jobs across N different compute servers. What you end up with is a system that depending on how you tune it is optimized for either (a) very fast turnaround on individual jobs or (b) very high aggregate throughput. The nice thing about a DRM-based compute farm is that you are not limited to only running Blast queries -- any program that can execute on any of your servers can be used in this way so you have a very flexible and scalable research computing / informatics platform that can cope with changing times and scientific advances. This is why so many people are building them. Platform LSF is a really nice best-of-breed commercial DRM package for clusters and compute farms. It is a very good product and it is certainly not cheap although I believe that LSF's price is worth it for medium-to-large clusters when you factor in the savings you get in increased uptime, better robustness and lower administrative burden. (The cost of LSF licensing is trivial if it allows you to avoid having to hire an additional cluster admin -- people don't think about human and administrative costs enough) Recently though I think Platform has gotten a bit of an attitude -- they used to be very responsive and eager to deal with life science customers but now the sales force is getting complacent and arrogant. If they don't shape up and rethink their pricing and sales practices they are going to get their asses kicked by GridEngine (which is free and improving at an incredible rate) over the next few years. If the DRM / "compute farm" approach is attractive to you and you have a limited budget then consider evaluating Sun GridEngine (http://gridengine.sunsource.net). It is a very solid product that is making huge gains in the life sciences and is maturing at an amazing rate. A year ago there were almost no gridengine based 'bioclusters' and now they are popping up all over the place. There is a very active gridengine mailing list with a great signal-to-noise ratio that you can turn to for support and tuning issues. Ok, that was (1), now on to the commercial software offerings: (1) TurboGenomics has a whole framework for distributed informatics computation -- TurboBlast is just one of their targeted offerings. I can't speak much about their stuff as I'm not that familiar. (2) Blackstone Computing. Blackstone has a 'big' product called PowerCloud that can pretty much encapsulate and expand upon what a cluster DRM software layer can do. They also have a targeted suite of modular tools including 'PowerBlast' that just sticks to doing one thing fast. I used to be a Blackstone employee so I really should know this stuff but I've been gone for almost a year and they have changed lots of things. Regardless there are a few Blackstone people on this list who can correct/expand if necessary. There are probabably others that I've missed. You can probably find them in the archives of this list. Joe Landman I think recently posted a survey of what was out and available. Apple XServers =========== Coincidentally enough we (bioteam.net) are building an Xserve blast farm with Platform LSF as part of a day long seminar / hands on lab that will be occuring on September 10th at Apple's Market Center in Boston. The event is free, registration is still open and the full details and invite can be found online at http://bioteam.net/MacOSX/Xserve-Clustering-Event.html My usual disclaimer: All of this is just my opinion; don't take anything as gospel! Regards, Chris -- Chris Dagdigian, <dag@sonsorol.org> Bioteam.net - Independent Bio-IT & Informatics consulting Office: 617-666-6454, Mobile: 617-877-5498, Fax: 425-699-0193 PGP KeyID: 83D4310E Yahoo IM: craffi Web: http://bioteam.net