[Bioclusters] Assembly programs

Fri Jul 7 17:45:00 EDT 2006

Joe,

    We have worked on the WGS assembly of Galdieria sulphuraria, a
unicellular red algae with an estimated genome size of ~12 Mb.  We did it on
a 4-way Opteron box with 16 GB of RAM using Arachne from the Broad institute
(formerly Whitehead). See http://www.broad.mit.edu/wga

    As genome assembly projects go it is not that large (bigger than a
bacteria but way, way smaller than a mammal).  The 16 GB of RAM was plenty
for an assembly of this size.  The original papers on Arachne cited memory
efficiency as one of the design goals and IIRC they were doing fruit fly on
12 GB machines.  Another nice thing about Arachne was that it was reasonably
straightforward to get up and running.  I did muck around with the TIGR
Assembler and EULER a bit but was never able to get them working properly.
I should point out that 3 of the 4 cpus on our box were idling since most of
Arachne is not "SMP Aware", only the initial parsing of the read, quality &
info files (you can launch multiple processes if there are multiple input
files.)  Arachne is not "cluster capable" either but a decent opteron box
with oodles of RAM can be had for a pretty good price these days.

Kevin M. Carr

**************************
Bioinformatics Specialist
Research Technology
  Support Facility
202-D Biochemistry Bldg.
Michigan State University
East Lansing, MI  48824

Ph: (517) 353-6794
Fax:(517) 353-8638
**************************

> From: Joe Landman <landman at scalableinformatics.com>
> Reply-To: HPC for bioinformatics <bioclusters at bioinformatics.org>
> Date: Thu, 06 Jul 2006 19:26:34 -0400
> To: HPC for bioinformatics <bioclusters at bioinformatics.org>
> Subject: [Bioclusters] Assembly programs
> 
> Hi folks:
> 
>    Was asked recently about genome assembly, and I gave the answer that
> Chris gave below.  What bugs me is that I haven't followed the assembly
> work for a while, and all I remember are the TIGR tools.
> 
>    Basically what I am asking is whether or not people have built
> assembly algorithms to run on smaller memory machines, or do we still
> need  large memory SMPs to do the job?  64GB and up, or can we run some
> set of tools in under 16 GB on lots of cluster nodes?
> 
>    Thanks!
> 
> Joe
> 
> Chris Dagdigian wrote:
>> 
>> Hi François,
>> 
>> First off, what assembly program are you trying to run on your cluster?
>> Are you sure it is even capable of running in parallel across many
>> machines? Most people I know doing assembly are doing it within a single
>> large SMP system because shared memory is easier/faster and (I think...)
>> there is a relative lack of "true parallel" assembly algorithms.
>> 
>> Here are some official grid engine helpful URLs:
>> 
>> - http://gridengine.sunsource.net (main site for the codebase)
>> 
>> - http://docs.sun.com/app/docs/coll/1017.3  (official documentation site)
>> 
>> I also run a site at http://gridengine.info but that may not be helpful
>> until you are at least up and running.
>> 
>> Some specific suggestions for you and your current setup:
>> 
>> (1) Ignore the 'qmon' GUI. You won't be using it anyway with your
>> assembler and it just gets in the way of the more flexible command line
>> programs. Stick with the unix binaries like "qstat", "qrsh" and
>> "qsub".   You won't be able to use SGE to its fullest unless you are
>> comfortable with the command line programs
>> 
>> (2) Send us (or me) the output of the command "qstat -f" when run on
>> your system. It may explain why you could not run the simple.sh example
>> job.
>> 
>> (3) Learn where your spool logs are, they will be invaluable in
>> debugging failures. The default location is something along the lines of
>> $SGE_ROOT/<cell>/spool/ -- in particular you want to look at the last
>> few lines of "qmaster/messages", "qmaster/schedd/messages" and any
>> messages files belonging to exec hosts that are not behaving.
>> 
>> Regards,
>> Chris
>> 
>> 
>> 
>> 
>> 
>> On Jul 6, 2006, at 4:42 PM, francois.fauteux2 at mail.mcgill.ca wrote:
>> 
>>> Hi;
>>> 
>>> I am totally new to grid computing. I recently tried to run some
>>> sequence assembly process on a G5 (8Gb RAM) but the process did
>>> require more memory.
>>> 
>>> I installed N1SGE6 on 3 MACs G5 under 10.4.7 (connected trough a
>>> router) (alltogheter 13Gb RAM) and I would like to run the assembly
>>> process in parallel trough the cluster hoping that memory resources
>>> would be sufficient for the process to complete.
>>> 
>>> I would appreciate hints as to "for-dummies-fast-how-to" configure the
>>> cluster / submit the job properly.
>>> 
>>> I installed master and hosts with defaults settings. First try with
>>> examples/simple.sh returns (w. qmon):
>>> No free slots for interactive job!
>>> while 5 PCUs are available.
>>> 
>>> Any hint as to how to properly configure the
>>> cluster/project/queues/parallel environments; or to use qsub with
>>> usefull options -for a fast getting started- would be greatly
>>> appreciated; thanks.
>>> 
>>> François
>>> 
>>> _______________________________________________
>>> Bioclusters maillist  -  Bioclusters at bioinformatics.org
>>> https://bioinformatics.org/mailman/listinfo/bioclusters
>> 
>> _______________________________________________
>> Bioclusters maillist  -  Bioclusters at bioinformatics.org
>> https://bioinformatics.org/mailman/listinfo/bioclusters
> 
> -- 
> Joseph Landman, Ph.D
> Founder and CEO
> Scalable Informatics LLC,
> email: landman at scalableinformatics.com
> web  : http://www.scalableinformatics.com
> phone: +1 734 786 8423
> fax  : +1 734 786 8452
> cell : +1 734 612 4615
> _______________________________________________
> Bioclusters maillist  -  Bioclusters at bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/bioclusters
>