This is a multi-part message in MIME format. --------------030902030004090002090800 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Hello, Enclosed is the current workshop schedule and program for our March 30th gathering at the Hynes Convention Center in Boston. The workshop this year is happening as part of the much larger BioITWorld Expo+Conference. The website for the full Expo seems to be having technical difficulties but you can find a link through http://www.bioitworld.com The workshop committee can be reached at bioclusters04@open-bio.org Regards, Chris --------------030902030004090002090800 Content-Type: text/plain; x-mac-type="0"; x-mac-creator="0"; name="biocluster04-program-v1.txt" Content-Transfer-Encoding: 7bit Content-Disposition: inline; filename="biocluster04-program-v1.txt" ============================================================================ 2004 Bioclusters Workshop -- Schedule v1.0 =========================================================================== PROGRAM 10:00 - 10:15 Welcome & Introductions 10:15 - 11:00 Next generation filesytem for Bioclusters: Overcoming I/O bottlenecks 11:00 - 11:30 TBA (known but not yet confirmed officially) 11:30 - 12:00 Production Bioclusters: The Good, the Difficult, and the just plain Ugly 12:00 - 12:30 Federated clusters in an academic environment 12:30 - 1:30 LUNCH BREAK 1:30 - 2:00 Biopendium: Large scale cluster computing for the real world of drug discovery 2:00 - 2:30 Building Cluster Workflows: Incogen VIBE & Sun Grid Engine 2:30 - 3:00 Grid Computing Standards: Separating the vision from the reality 3:00 - 3:30 AFTERNOON BREAK 3:30 - 4:00 Biobrew Linux 4:00 - 4:30 Stupid Cluster Tricks: Flexlm License Juggling 4:30 - 5:00 Hardware Accelerators & Building Hybrid Bioclusters PRESENTATION INFORMATION (1) "Next generation filesytem for Bioclusters: Overcoming I/O bottlenecks" Dr Guy Coates, Group Leader, Informatics Systems Group Wellcome Trust Sanger Institute Hinxton, Cambridge Cluster filesystems and how I/O bottlenecks and data management issues can be eased will be explained. Details will be given of how we have deployed cluster filesystems on the Sanger clusters to significantly improve their managebility and workflow. Preformance figures and benchmarks for typical bioinformatics algorithms will also be discussed. Finally we will also look at future filesystem technologies. (2) TBA -- not yet confirmed (3) "Production Bioclusters: The Good, the Difficult, and the just plain Ugly" Dr. James Cuff, Broad Institute No matter what the hype, installing and running production clusters is hard, non-trivial, and sometimes just plain frustrating. This talk will discuss the issues involved in the setup, configuration and day to day running of production bioclusters. The goal of the talk is to show the pitfalls, and give positive suggestions for how they can be avoided. The talk will focus on the following key areas, real life examples will be given throughout. The Good... - The friendly pixies. - Firewires and clones - Vendor preloads, control towers, xcats and CSMs - Trunks, bonds, and data localisation The Difficult... - Network design decisions and NFS - Database and SQL bottlenecks - Change and version control, automatic updates - How to keep the whole thing running - rdists, rsyncs and dollies - Why distributed resource management is essential And the Ugly... - '[syshelp] URGENT! HELP! The cluster is down!!!' - Detecting failure modes - hardware, code and process - Mixed architecture and desktop clusters (4) "Federated clusters in an academic environment" Chris Dwan Large, dedicated clusters provide tremendous computing power with great performance for the price. Many researchers do not have access to such a resource. This talk describes a multi-year effort to make use of computing resources distributed across the University of Minnesota. A variety of small Linux clusters and labs of workstations have been loosely integrated, providing a substantial infrastructure without large central investment. A modified version of the EnsEMBL genome annotation pipeline is the primary task of this computational grid, but other applications are on the horizon. Topics to be covered include: - Convincing department and lab administrators that this is a good idea. - Cycle stealing: when is it worthwhile? - Authentication, authorization, security, and trust. - Data synchronization: - Error detection and correction - Schedulers, metaschedulers, and reservation agents. - Coding techniques and applications: Some applications are simply not suited to a grid. - Grid software: Hype vs. reality. (5) "Biopendium: Large scale cluster computing for the real world of drug discovery" Dr Mark Swindells Chief Scientific Officer, Inpharmatica Biopendium is one of the largest, commercially produced bioinformatics resources in the world. Designed to provide high quality proteome-scale annotation to match the high-throughput data generation of modern research systems (such as microarray experiments) Biopendium concentrates on precalculating results from high value annotation algorithms for the research scientist. As many of the advanced methods employed are CPU intensive, (such as 3D protein structure analysis and homology searches such as PSI-Blast and Threading) precalcuation is the only way to provide large scale results to researches in a timely manner. Every single protein sequence available at the time of calculation (currently around 1.5 million from all available organism) has data available in Biopendium. This talk will focus on the challenges of calculating such a largeresource in a timely manner, using Linux based compute clusters and discuss the ways Inpharmatica are evolving their system to cope with the ever increasing data volumes from the public domain. (6) Cluster workflow: VIBE and SGE Krista Miller, Director of Software Development, Incogen INCOGEN is a life science software company that takes advantage of advanced computing technologies to enhance the performance of our applications. Primarily, the VIBE project is an integration and workflow platform that aims to give end users access to a variety of tools on a variety of compute resources without them needing to know the nuts and bolts of what's beneath. Using Sun Grid Engine (SGE) to pool Solaris/Sparc and Linux/x86 resources, VIBE utilizes the resource management of and shares the compute control with the subsystems. The choice to adopt a third-party distribution layer rather than implement one internally was a decision that had many pros and cons and has to be tackled incrementally. In this presentation we will outline many of the considerations that were addressed (and how) and outline our adaptation process. We hope to shed some light for other projects on aspects to weigh when considering a similar choice and share our experience on do's don'ts during the integration process. (7) Grid Computing Standards: Separating the vision from the reality. Chris Smith, Platform Computing Grid computing standards currently being defined within the Global Grid Forum are essential for the interoperability of systems deployed on large scale Grid systems which span multiple organizations. In this talk, the current activities within the GGF will be described, with an eye to identifying what is vision for the future, and what specifications are more immediately relevant. Covered areas will include the architecture work of the OGSA working group, and the various specifications such as WS-Agreement, OGSA-DAI, and OGSI. This talk will also describe Platform Computing's Community Scheduler Framework (CSF), an open source implementation of a number of Grid Services built on the Globus Toolkit 3, which together provide a platform for implementing metaschedulers. CSF represents the current state of the art for Grid metascheduler built on the current standards such as OGSI and WS-Agreement. CSF's Grid services, the scheduler plug in API, and some of the latest activities will be discussed. (8) BioBrew Linux Glen Otero Pharmaceutical companies, biotechnology firms, and research institutions are actively involved in developing and exploiting massive datasets that will provide better insight into human diseases. From a deeper knowledge of cell biology and genetic influences, a new class of personal medicines will emerge. This explosion in genomic and proteomic information and emergence of data-intensive molecular diagnostic techniques is being fueled by scalable Linux cluster computing. While commodity component Beowulf clusters have been in use in a number of scientific computing application areas for as long as ten years, they continue to pose significant challenges to research IT organizations - in their provisioning, management, and use. BioBrew is a supported cluster distribution for RedHat Linux that includes the CallidentRX implementation of the NPACI ROCKS cluster distribution, plus many of the most commonly used bioinformatics applications typically deployed in cluster environments. In addition to supporting many of the high performance, low latency interconnects, such as Myrinet and Infiniband, CallidentRX includes support for Panasas' ActiveScale Storage Cluster. This is a highly scalable shared storage system developed to exploit high bandwidth, high concurrency applications such as those commonly found in genomics and proteomics research. The shared storage architecture simplifies provisioning, management, and use of the cluster and ensures that cluster resources can be effectively utilized. The result is a new approach to deploying turnkey application clusters for bioinformatics applications. (9) Stupid Cluster Tricks: FlexLM License Juggling Chris Dagdigian, BioTeam Inc. An increasingly challenging cluster integration problem is the emerging class of commercial software sold with built-in rights-management restrictions. These packages are typically sold to customers locked for use only on a certain machine or (more commonly) with 'floating licenses' checked out from a networked license server that strictly enforces a certain number of concurrent users. The most common system for license management on Unix-based systems encountered is FlexLM from Macrovision Corp. Methods for integrating license status information into cluster schedulers will be discussed with particular focus on techniques used and lessons learned during a recent challenging integration project involving multi-site license servers and Sun Grid Engine Enterprise Edition at a global pharmaceutical research facility. (10) Hardware Accelerators & Building Hybrid Bioclusters Michael Curtin, Paracel { Abstract being revised; talk centers on real world lessons learned from trying to build a large hybrid cluster containing compute nodes and specialized acccelerator hardware } --------------030902030004090002090800--