On Wed, 6 Nov 2002, Simon Twigger wrote: > Hi there, > > I stumbled across this mailing list today searching through some > bioperl archives and I'm hoping that someone out there can point me in Make sure you read through the archives of this list. I think you will be able to find some helpful suggestions there to make your up coming decisions easier. From reading your post it seems that you might be under some pressure to employ existing ,albeit heterogeneous, machines to solve some problems. There are solutions which, for example, use java as a layer to distribute work across a variety of architectures. There are also pipe lines available which might use things like SOAP, XML, etc to abstract analysis from the researcher (but not from the system administrator who has to set it all up and make it work). Check www.bioinforatics.org for a list of budding projects. These will give you an idea of how people are designing software. Just be careful to distinguish between solutions whose underlying technology exists to borrow or steal cycles off dormant or lightly loaded machines from solutions which are intended to be used with a known number of machines. In the cycle stealing/borrowing sceanrio it may never be clear just how many systems will be working on a given job making it difficult to present good estimates on completion times. This is a problem if you are attempting to accomodate the interests of several researchers all of whom need data by a grant deadline. As for some practical details I use a modestly sized cluster of Appro computers with Redhat 7.3 and Platform LSF 5.0 to perform Blasting of the human genome and to provide a web based Blast service for my users. Right now we backend it with some Perl and use LSF to distribute jobs and collect results. All in all a good setup though I'm starting to run Grid Engine in parallel (no pun intended) to test its suitability as a total replacement. My interest in replacing LSF with SGE is based on the price of LSF licenses. However I notice that attractive discounts are available when purchasing through particular resellers. LSF is good , flexible software no doubt about it. But we are always under pressure to do more with less money hence my interest in SGE. I've also been using TurbGenomics TurboBlast on the cluster which uses java to distribute the work. (There is also a python dependency). It comes with some wrappers and suggestions for use with PBS. I'm not restricted to just my cluster nodes. I could run a "blastworker" on a windows 2000 box (after installing python and java of course) and establish some load thresholds which keeps the system usable for interactive purposes. I do spend alot of time adjusting parameters on each client to make sure that the optimal number of machines are engaged. But I also spend time tuning the LSF software to make job distribution easier. Refinement and tuning are ongoing. As far as organizing the cluster I argue that the more assumptions you can make about the architecture, operating system, and filesystem layout the less work you have waiting for you after things are setup. There are some fine tools which let you provision and image disks so you can manage various architectures and toss their images onto servers for quick reinstalls and restores in the event of hardware failure. If you don't have much IT support staff or free student labor then check out the vendors who will sell you a rack-in-a-box. RLX (their control tower software is very nice) ,Microway, etc all have appealing solutions that are prepackaged in a way to minimize startup costs and labor. I tend to toss data out onto the nodes which means I might use software RAID to improve performance. I don't do much over NFS except to serve the LSF tree but even thats not a requirement. I like to cache data locally which means quicker Blasting in my case. I split up my databases so when I add new nodes I repslit and redistribute (programmatically of course). Depending on the nature of your development you might want to check out SSI models (MOSIX, Scyld) which, in a grossly simplified manner, means you can treat a group of systems as a single process space. As far as Xserves I've yet to get my hands on an eval unit so can't say much about them. If their advertised price is the actual street price then I can buy two of what I'm already using (dual atlons 1800+ ,2X40 GB HD,and 2GB RAM) for the same money. The Xserves are worth checking out though. Good Luck On Wed, 6 Nov 2002, Simon Twigger wrote: > Hi there, > > I stumbled across this mailing list today searching through some > bioperl archives and I'm hoping that someone out there can point me in > the right direction to get myself up to speed on bioclusters, both on > the hardware and software side. Im more from the bio-side of > bioinformatics and Im trying to understand more of the nitty gritty > informatics/computer part! > > We've been writing bioinformatics software in perl/java for a while and > we've got Oracle and MySQL databases and we run all the usual > genome/sequence analysis packages (blast, blat, etc) plus some of our > own annotation pipelines. Historically we've been running these on > multiple machines but not really in a cluster with robust load > management software or any significant modifications to how we write > our code to enable it to scale in a multiprocessor environment. I'm > trying to find out better ways to use our Sun, Compaq and (probably) > MacOS machines, how to get them all working together to handle both > genomic and proteomic analyses and how to modify our existing and new > code to work in this environment. > > I'd love to find some sort of 'bioclustering for dummies' that outlines > the usual solutions and approaches, also on the software side something > that describes the fundamentals of writing perl and java to exploit > clusters and even some simple examples/test packages that I could play > with to get my feet wet. > > A few specific things that Im thinking about, perhaps people can > comment on my rationale > We have a variety of platforms and it would be great to make them all > play together - LSF appears to be a good solution to handle load > balancing on a heterogeneous set of servers (we have Sun, Compaq and > will probably add Xserves into the mix), from my reading the downside > is the price ($400 per server was a price I saw quoted on the list). > ease of administration seems to be another pro for LSF which is a big > thing as we just want it to work, we dont really want to babysit this > stuff - what sort of sysadmin commitment is needed to make this work? > > Im personally interested in trying the Xserve, the storage capacity, > speed, price, etc. all make it attractive as an alternative to our > traditional options. Oracle is coming out for OS X (and the developer > release is running on my Powerbook as we speak) so that's another good > thing. Im doing all my development on a G4 with 10.2 and its great, any > thoughts/experiences with using Xserve in the mix with other platforms > and Xserve vs intel solutions? > > Many thanks for any help anyone can give a newbie in the field! > > Simon. > > > ------------------------------------------------------------------------ > -------------------------- > Simon Twigger, Ph.D. > Assistant Professor, Bioinformatics Research Center > > Medical College of Wisconsin > 8701 Watertown Plank Road, > Milwaukee, WI, 53226 > tel. 414-456-8802, fax 414-456-6595 > > _______________________________________________ > Bioclusters maillist - Bioclusters@bioinformatics.org > https://bioinformatics.org/mailman/listinfo/bioclusters >