[Bioclusters] Bioclusters'04 Workshop Program -- Boston, MA March 30th
Chris Dagdigian
bioclusters@bioinformatics.org
Wed, 03 Mar 2004 10:07:19 -0500
This is a multi-part message in MIME format.
--------------030902030004090002090800
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit
Hello,
Enclosed is the current workshop schedule and program for our March 30th
gathering at the Hynes Convention Center in Boston. The workshop this
year is happening as part of the much larger BioITWorld Expo+Conference.
The website for the full Expo seems to be having technical
difficulties but you can find a link through http://www.bioitworld.com
The workshop committee can be reached at bioclusters04@open-bio.org
Regards,
Chris
--------------030902030004090002090800
Content-Type: text/plain; x-mac-type="0"; x-mac-creator="0";
name="biocluster04-program-v1.txt"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline;
filename="biocluster04-program-v1.txt"
============================================================================
2004 Bioclusters Workshop -- Schedule v1.0
===========================================================================
PROGRAM
10:00 - 10:15 Welcome & Introductions
10:15 - 11:00 Next generation filesytem for Bioclusters:
Overcoming I/O bottlenecks
11:00 - 11:30 TBA (known but not yet confirmed officially)
11:30 - 12:00 Production Bioclusters: The Good, the Difficult,
and the just plain Ugly
12:00 - 12:30 Federated clusters in an academic environment
12:30 - 1:30 LUNCH BREAK
1:30 - 2:00 Biopendium: Large scale cluster computing for the
real world of drug discovery
2:00 - 2:30 Building Cluster Workflows: Incogen VIBE & Sun
Grid Engine
2:30 - 3:00 Grid Computing Standards: Separating the vision
from the reality
3:00 - 3:30 AFTERNOON BREAK
3:30 - 4:00 Biobrew Linux
4:00 - 4:30 Stupid Cluster Tricks: Flexlm License Juggling
4:30 - 5:00 Hardware Accelerators & Building Hybrid Bioclusters
PRESENTATION INFORMATION
(1) "Next generation filesytem for Bioclusters: Overcoming I/O bottlenecks"
Dr Guy Coates, Group Leader, Informatics Systems Group
Wellcome Trust Sanger Institute
Hinxton, Cambridge
Cluster filesystems and how I/O bottlenecks and data management issues
can be eased will be explained. Details will be given of how we have
deployed cluster filesystems on the Sanger clusters to significantly
improve their managebility and workflow. Preformance figures and
benchmarks for typical bioinformatics algorithms will also be
discussed. Finally we will also look at future filesystem technologies.
(2) TBA -- not yet confirmed
(3) "Production Bioclusters: The Good, the Difficult, and the just
plain Ugly"
Dr. James Cuff, Broad Institute
No matter what the hype, installing and running production clusters is
hard, non-trivial, and sometimes just plain frustrating. This talk
will discuss the issues involved in the setup, configuration and day
to day running of production bioclusters. The goal of the talk is to
show the pitfalls, and give positive suggestions for how they can be
avoided.
The talk will focus on the following key areas, real life examples
will be given throughout.
The Good...
- The friendly pixies.
- Firewires and clones
- Vendor preloads, control towers, xcats and CSMs
- Trunks, bonds, and data localisation
The Difficult...
- Network design decisions and NFS
- Database and SQL bottlenecks
- Change and version control, automatic updates
- How to keep the whole thing running - rdists, rsyncs and dollies
- Why distributed resource management is essential
And the Ugly...
- '[syshelp] URGENT! HELP! The cluster is down!!!'
- Detecting failure modes - hardware, code and process
- Mixed architecture and desktop clusters
(4) "Federated clusters in an academic environment"
Chris Dwan
Large, dedicated clusters provide tremendous computing power with great
performance for the price. Many researchers do not have access to
such a resource. This talk describes a multi-year effort to make use
of computing resources distributed across the University of Minnesota.
A variety of small Linux clusters and labs of workstations have been
loosely integrated, providing a substantial infrastructure without
large central investment. A modified version of the EnsEMBL genome
annotation pipeline is the primary task of this computational grid,
but other applications are on the horizon.
Topics to be covered include:
- Convincing department and lab administrators that this is a good
idea.
- Cycle stealing: when is it worthwhile?
- Authentication, authorization, security, and trust.
- Data synchronization:
- Error detection and correction
- Schedulers, metaschedulers, and reservation agents.
- Coding techniques and applications: Some applications are
simply not suited to a grid.
- Grid software: Hype vs. reality.
(5) "Biopendium: Large scale cluster computing for the real world of
drug discovery"
Dr Mark Swindells
Chief Scientific Officer, Inpharmatica
Biopendium is one of the largest, commercially produced bioinformatics
resources in the world. Designed to provide high quality proteome-scale
annotation to match the high-throughput data generation of modern
research systems (such as microarray experiments) Biopendium
concentrates on precalculating results from high value annotation
algorithms for the research scientist. As many of the advanced methods
employed are CPU intensive, (such as 3D protein structure analysis and
homology searches such as PSI-Blast and Threading) precalcuation is
the only way to provide large scale results to researches in a timely
manner. Every single protein sequence available at the time of
calculation (currently around 1.5 million from all available organism)
has data available in Biopendium. This talk will focus on the
challenges of calculating such a largeresource in a timely manner,
using Linux based compute clusters and discuss the ways Inpharmatica
are evolving their system to cope with the ever increasing data
volumes from the public domain.
(6) Cluster workflow: VIBE and SGE
Krista Miller, Director of Software Development, Incogen
INCOGEN is a life science software company that takes advantage of
advanced computing technologies to enhance the performance of our
applications. Primarily, the VIBE project is an integration and
workflow platform that aims to give end users access to a variety of
tools on a variety of compute resources without them needing to know
the nuts and bolts of what's beneath. Using Sun Grid Engine (SGE) to
pool Solaris/Sparc and Linux/x86 resources, VIBE utilizes the resource
management of and shares the compute control with the subsystems. The
choice to adopt a third-party distribution layer rather than implement
one internally was a decision that had many pros and cons and has to be
tackled incrementally. In this presentation we will outline many of
the considerations that were addressed (and how) and outline our
adaptation process. We hope to shed some light for other projects on
aspects to weigh when considering a similar choice and share our
experience on do's don'ts during the integration process.
(7) Grid Computing Standards: Separating the vision from the reality.
Chris Smith, Platform Computing
Grid computing standards currently being defined within the Global
Grid Forum are essential for the interoperability of systems deployed
on large scale Grid systems which span multiple organizations. In this
talk, the current activities within the GGF will be described, with an
eye to identifying what is vision for the future, and what
specifications are more immediately relevant. Covered areas will
include the architecture work of the OGSA working group, and the
various specifications such as WS-Agreement, OGSA-DAI, and OGSI.
This talk will also describe Platform Computing's Community Scheduler
Framework (CSF), an open source implementation of a number of Grid
Services built on the Globus Toolkit 3, which together provide a
platform for implementing metaschedulers. CSF represents the current
state of the art for Grid metascheduler built on the current standards
such as OGSI and WS-Agreement. CSF's Grid services, the scheduler plug
in API, and some of the latest activities will be discussed.
(8) BioBrew Linux
Glen Otero
Pharmaceutical companies, biotechnology firms, and research
institutions are actively involved in developing and exploiting
massive datasets that will provide better insight into human diseases.
From a deeper knowledge of cell biology and genetic influences, a new
class of personal medicines will emerge. This explosion in genomic and
proteomic information and emergence of data-intensive molecular
diagnostic techniques is being fueled by scalable Linux cluster
computing. While commodity component Beowulf clusters have been in
use in a number of scientific computing application areas for as long
as ten years, they continue to pose significant challenges to research
IT organizations - in their provisioning, management, and use.
BioBrew is a supported cluster distribution for RedHat Linux that
includes the CallidentRX implementation of the NPACI ROCKS cluster
distribution, plus many of the most commonly used bioinformatics
applications typically deployed in cluster environments. In addition
to supporting many of the high performance, low latency interconnects,
such as Myrinet and Infiniband, CallidentRX includes support for
Panasas' ActiveScale Storage Cluster. This is a highly scalable
shared storage system developed to exploit high bandwidth, high
concurrency applications such as those commonly found in genomics and
proteomics research. The shared storage architecture simplifies
provisioning, management, and use of the cluster and ensures that
cluster resources can be effectively utilized.
The result is a new approach to deploying turnkey application clusters
for bioinformatics applications.
(9) Stupid Cluster Tricks: FlexLM License Juggling
Chris Dagdigian, BioTeam Inc.
An increasingly challenging cluster integration problem is the
emerging class of commercial software sold with built-in
rights-management restrictions. These packages are typically sold to
customers locked for use only on a certain machine or (more commonly)
with 'floating licenses' checked out from a networked license server
that strictly enforces a certain number of concurrent users. The most
common system for license management on Unix-based systems encountered
is FlexLM from Macrovision Corp.
Methods for integrating license status information into cluster
schedulers will be discussed with particular focus on techniques
used and lessons learned during a recent challenging integration
project involving multi-site license servers and Sun Grid Engine
Enterprise Edition at a global pharmaceutical research facility.
(10) Hardware Accelerators & Building Hybrid Bioclusters
Michael Curtin, Paracel
{ Abstract being revised; talk centers on real world lessons learned
from trying to build a large hybrid cluster containing compute nodes
and specialized acccelerator hardware }
--------------030902030004090002090800--