[Bioclusters] Sequest on Linux

Tue Aug 16 12:29:10 EDT 2005

Hi Christopher,

I worked in a proteo lab the University of Toronto for a couple of years
as the resident research systems admin, and our group used sequest-pvm
as part of a multi-genome research project. I might not be be able to
answer your questions directly, but perhaps some of the experiences we
had over those two years might be useful to you. I left the lab about
two years ago, so I don't know if the software has changed substantially
since then.

We were running a 16-node IBM cluster, based on dual P3-1.13 GHz nodes,
with a larger head node, all running Red Hat. The individual nodes were
hooked up to a 100-Base-T switch, and the head node had a gig-uplink to
same. For the most part, this worked quite well, although now I'd rather
make sure that we use Gig-E everywhere.

The head node had a single U160 drive array hooked to it; I found that
the head node's loadavg didn't rise to high during a run, and I find
that machines that are I/O bound tend to have higher loadavgs and lots
of idle CPU.

We also found that a bug in sequest-pvm would cause it to slowly exhaust
all of the available file handles on the head node; easy fix was to stop
pvm and restart the user's pvmd.

As for queuing, I can't really suggest much, as we ended up writing
something in-house to manage most of the sequest runs, which for the
most part works reasonably well. You would likely not want to go that
route though, I suspect, given how many better job management systems
have come up in the last four years.

Longer term storage and backup is a dicier matter. My feeling now, from
working with hardware-based SATA RAID solutions (Promise and 3Ware) is
that I would only use these for non-active storage, because they can't
yet keep up with a good SCSI RAID card and good drives under that sort
of load. I'm *very* happy running 3Ware arrays under Linux in most other
contexts, though. My experiences running one of the larger Promise cards
under Linux aren't worth repeating (read: don't go there!). 

I'd really go SCSI RAID for the data currently being searched, and then
move this to a SATA RAID array afterwards. SCSI is overkill for a lot of
the longer-term storage requirements, and I've found that on a cluster
having few users, it's *REALLY* important to encourage 'good
housekeeping' practices - either through scripts or through human
intervention.

You may have to do something with the sequest data directories, however:
At the time, sequest would create literally thousands of files for any
given MS file, and these would literally chew through inodes; we'd
actually had situations in which there was 'space' on the array per df,
but no free inodes in which to put data! Ultimately, we used a wrapper
script to tarball and compress the MS/MS files and interim results, once
the analyses were completed. We also wrapped our cgi scripts in such a
way that they could look into the bzipped tarballs; this worked well. I
haven't used sequest since, though I hope / suspect this has been
resolved.

The backup approach used at the time was to keep a copy of the MS files
from the mass spec machine; these would be written to DVD and stored off
site. The DVD image was also stored on a central server in case the data
were ever in doubt (md5 signatures were also stored with the info).

In retrospect, I would have also done one or two other things. The first
would have been to have the analysis scripts set and clear the inviolate
attribute as required when setting up and moving files. The second would
have been to have set up a root-owned folder, somewhere on the array
drive, having hard links to critical data files.

Hope this helps,

Pete

On Aug 15, 2005, at 11:46 PM, Botka, Christopher wrote:

>
>Is anyone out there running Sequest on Linux for MS analysis?  We  
>are in the process of setting up a modest sized cluster to run  
>Sequest and would be interested in sharing info and experiences  
>with anyone out there who might be doing the same.
>
>Some issues:
>
>   1. I/O requirements ? what?s the minimum thruput needed to run  
>Sequest.  We are gong to test both SATA and FC drives with multiple  
>types of interconnects, as well as local SCSI drives.
>   2. Integration of the Thermo queuing system with other job  
>management systems (LSF/SGE etc) ? Can Sequest be integrated into a  
>general purpose cluster?
>   3. Middle to long term storage requirements and back up strategies.
>
>Thanks,
>
>Chris
>
>botka at joslin.harvard.edu

_______________________________________________
Bioclusters maillist  -  Bioclusters at bioinformatics.org
https://bioinformatics.org/mailman/listinfo/bioclusters