[Bioclusters] Bioclusters'04 Workshop Program -- Boston, MA March 30th

Wed, 03 Mar 2004 10:07:19 -0500

This is a multi-part message in MIME format.
--------------030902030004090002090800
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit

Hello,

Enclosed is the current workshop schedule and program for our March 30th 
  gathering at the Hynes Convention Center in Boston. The workshop this 
year is happening as part of the much larger BioITWorld Expo+Conference. 
  The website for the full Expo seems to be having technical 
difficulties but you can find a link through http://www.bioitworld.com

The workshop committee can be reached at bioclusters04@open-bio.org

Regards,
Chris

--------------030902030004090002090800
Content-Type: text/plain; x-mac-type="0"; x-mac-creator="0";
 name="biocluster04-program-v1.txt"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline;
 filename="biocluster04-program-v1.txt"

============================================================================
2004 Bioclusters Workshop -- Schedule v1.0
===========================================================================

PROGRAM

	10:00 - 10:15 	Welcome & Introductions

 	10:15 - 11:00	Next generation filesytem for Bioclusters: 
			Overcoming I/O bottlenecks

	11:00 - 11:30	TBA (known but not yet confirmed officially)

	11:30 - 12:00	Production Bioclusters: The Good, the Difficult, 
			and the just plain Ugly

	12:00 - 12:30	Federated clusters in an academic environment

	12:30 - 1:30	LUNCH BREAK

	 1:30 - 2:00	Biopendium: Large scale cluster computing for the 
			real world of drug discovery

	 2:00 - 2:30 	Building Cluster Workflows: Incogen VIBE & Sun 
			Grid Engine

	 2:30 - 3:00	Grid Computing Standards: Separating the vision 
			from the reality

	 3:00 - 3:30	AFTERNOON BREAK

	 3:30 - 4:00	Biobrew Linux

	 4:00 - 4:30	Stupid Cluster Tricks: Flexlm License Juggling

	 4:30 - 5:00	Hardware Accelerators & Building Hybrid Bioclusters

PRESENTATION INFORMATION

(1) 	"Next generation filesytem for Bioclusters: Overcoming I/O bottlenecks"
 	Dr Guy Coates, Group Leader, Informatics Systems Group
 	Wellcome Trust Sanger Institute
 	Hinxton, Cambridge

 	Cluster filesystems and how I/O bottlenecks and data management issues
	can be eased will be explained.  Details will be given of how we have 
	deployed cluster filesystems on the Sanger clusters to significantly 
	improve their managebility and workflow.  Preformance figures and 
	benchmarks for typical bioinformatics algorithms will also be 
	discussed. Finally we will also look at future filesystem technologies.

(2)	TBA -- not yet confirmed

(3)	"Production Bioclusters: The Good, the Difficult, and the just 
	plain Ugly"
	Dr. James Cuff, Broad Institute

	No matter what the hype, installing and running production clusters is
	hard, non-trivial, and sometimes just plain frustrating.  This talk 
	will discuss the issues involved in the setup, configuration and day 
	to day running of production bioclusters.  The goal of the talk is to 
	show the pitfalls, and give positive suggestions for how they can be 
	avoided.

	The talk will focus on the following key areas, real life examples 
	will be given throughout.

	The Good...	
	  - The friendly pixies.
	  - Firewires and clones
  	  - Vendor preloads, control towers, xcats and CSMs
 	  - Trunks, bonds, and data localisation

	The Difficult...

	- Network design decisions and NFS
     	- Database and SQL bottlenecks
    	- Change and version control, automatic updates
    	- How to keep the whole thing running - rdists, rsyncs and dollies
    	- Why distributed resource management is essential

	And the Ugly...

    	- '[syshelp] URGENT! HELP! The cluster is down!!!'
    	-  Detecting failure modes - hardware, code and process
    	-  Mixed architecture and desktop clusters

(4)	"Federated clusters in an academic environment"
	Chris Dwan

	Large, dedicated clusters provide tremendous computing power with great
	performance for the price.  Many researchers do not have access to 
	such a resource.  This talk describes a multi-year effort to make use
	of computing resources distributed across the University of Minnesota.
	A variety of small Linux clusters and labs of workstations have been 
	loosely	integrated, providing a substantial infrastructure without 
	large central investment.  A modified version of the EnsEMBL genome 
	annotation pipeline is the primary task of this computational grid, 
	but other applications are on the horizon.

	Topics to be covered include:
	  - Convincing department and lab administrators that this is a good 
	    idea.
	  - Cycle stealing: when is it worthwhile?
	  - Authentication, authorization, security, and trust.
	  - Data synchronization:
	  - Error detection and correction
	  - Schedulers, metaschedulers, and reservation agents.
	  - Coding techniques and applications:  Some applications are
	    simply not suited to a grid.
	  - Grid software:  Hype vs. reality.

(5) 	"Biopendium: Large scale cluster computing for the real world of 
	drug discovery"
	Dr Mark Swindells
	Chief Scientific Officer, Inpharmatica

	Biopendium is one of the largest, commercially produced bioinformatics
	resources in the world. Designed to provide high quality proteome-scale
	annotation to match the high-throughput data generation of modern 
	research systems (such as microarray experiments) Biopendium 
	concentrates on precalculating results from high value annotation 
	algorithms for the research scientist. As many of the advanced methods
	employed are CPU intensive, (such as 3D protein structure analysis and
	homology searches such as PSI-Blast and Threading) precalcuation is 
	the only way to provide large scale results to researches in a timely 
	manner. Every single protein sequence available at the time of 
	calculation (currently around 1.5 million from all available organism)
	has data available in Biopendium. This talk will focus on the 
	challenges of calculating such a largeresource in a timely manner, 
	using Linux based compute clusters and discuss the ways Inpharmatica 
	are evolving their system to cope with the ever increasing data 
	volumes from the public domain.

(6) 	Cluster workflow: VIBE and SGE
	Krista Miller, Director of Software Development, Incogen

	INCOGEN is a life science software company that takes advantage of
	advanced computing technologies to enhance the performance of our
	applications.  Primarily, the VIBE project is an integration and
	workflow platform that aims to give end users access to a variety of
	tools on a variety of compute resources without them needing to know 
	the nuts and bolts of what's beneath.  Using Sun Grid Engine (SGE) to 
	pool Solaris/Sparc and Linux/x86 resources, VIBE utilizes the resource
	management of and shares the compute control with the subsystems.  The
	choice to adopt a third-party distribution layer rather than implement
	one internally was a decision that had many pros and cons and has to be
	tackled incrementally.  In this presentation we will outline many of
	the considerations that were addressed (and how) and outline our 
	adaptation process.  We hope to shed some light for other projects on 
	aspects to weigh when considering a similar choice and share our 
	experience on do's don'ts during the integration process.

(7) 	Grid Computing Standards: Separating the vision from the reality.
	Chris Smith, Platform Computing

	Grid computing standards currently being defined within the Global
	Grid Forum are essential for the interoperability of systems deployed
	on large scale Grid systems which span multiple organizations. In this
	talk, the current activities within the GGF will be described, with an
	eye to identifying what is vision for the future, and what 
	specifications are more immediately relevant. Covered areas will 
	include the architecture work of the OGSA working group, and the 
	various specifications such as WS-Agreement, OGSA-DAI, and OGSI.

	This talk will also describe Platform Computing's Community Scheduler
	Framework (CSF), an open source implementation of a number of Grid
	Services built on the Globus Toolkit 3, which together provide a 
	platform for implementing metaschedulers. CSF represents the current 
	state of the art for Grid metascheduler built on the current standards
	such as OGSI and WS-Agreement. CSF's Grid services, the scheduler plug
	in API, and some of the latest activities will be discussed.

(8) 	BioBrew Linux
	Glen Otero

	Pharmaceutical companies, biotechnology firms, and research 
	institutions are actively involved in developing and exploiting 
	massive datasets that will provide better insight into human diseases.
	From a deeper knowledge of cell biology and genetic influences, a new
	class of personal medicines will emerge. This explosion in genomic and
	proteomic information and emergence of data-intensive molecular
 	diagnostic techniques is being fueled by scalable Linux cluster 
	computing.  While commodity component Beowulf clusters have been in 
	use in a number of scientific computing application areas for as long 
	as ten years, they continue to pose significant challenges to research
	IT organizations - in their provisioning, management, and use.

	BioBrew is a supported cluster distribution for RedHat Linux that 
	includes the CallidentRX implementation of the NPACI ROCKS cluster 
	distribution, plus many of the most commonly used bioinformatics 
	applications typically deployed in cluster environments. In addition 
	to supporting many of the high performance, low latency interconnects,
	such as Myrinet and Infiniband, CallidentRX includes support for 
	Panasas' ActiveScale Storage Cluster.  This is a highly scalable 
	shared storage system developed to exploit high bandwidth, high 
	concurrency applications such as those commonly found in genomics and 
	proteomics research.  The shared storage architecture simplifies 
	provisioning, management, and use of the cluster and ensures that 
	cluster resources can be effectively utilized.

	The result is a new approach to deploying turnkey application clusters
	for bioinformatics applications.

(9)	Stupid Cluster Tricks: FlexLM License Juggling
	Chris Dagdigian, BioTeam Inc.

	An increasingly challenging cluster integration problem is the 
	emerging class of commercial software sold with built-in 
	rights-management restrictions. These packages are typically sold to 
	customers locked for use only on a certain machine or (more commonly) 
	with 'floating licenses' checked out from a networked license server 
	that strictly enforces a certain number of concurrent users. The most
 	common system for license management on Unix-based systems encountered
	is FlexLM from Macrovision Corp.

	Methods for integrating license status information into cluster 
	schedulers will be discussed with particular focus on techniques 
	used and lessons learned during a recent challenging integration
 	project involving multi-site license servers and Sun Grid Engine 
	Enterprise Edition at a global 	pharmaceutical research facility.

(10) 	Hardware Accelerators & Building Hybrid Bioclusters
	Michael Curtin, Paracel

	{ Abstract being revised; talk centers on real world lessons learned
	  from trying to build a large hybrid cluster containing compute nodes
	  and specialized acccelerator hardware }

--------------030902030004090002090800--