[Bioclusters] Re: Bioclusters Digest, Vol 17, Issue 4

Tue Mar 7 15:05:31 EST 2006

Hi, Shane here from the JGI, I wanted to post back and attempt to answer 
some of these questions about our "disappearing" array job tasks.
I don't know the answer to all these, but the question about NIS errors 
pops out.  We have been having NIS and NFS problems quite a bit,
so I suspect that could be why.

Soon we will be moving our cluster onto a better network switch, and also 
have increased a cache size on our LDAP server.  We've been
working to improve our NFS problems too.  It seems like that may help - 
lately the problems seem to have gone away.  I've also implemented
a "cleanup" step in our workflow system which re-submits missing tasks one 
at a time just in case.

Did I isolate it down to just an issue with the job array?
         Yes

Does this only happen with this program or all programs I execute?
         Various programs

What is the code doing?
         Usually trying to run some perl code - for instance, in one case 
it was a perl program
which logs something to a database - but in between two print statements at 
the top it failed
before it could really get very far

Are there "core" files in my output directory?
         No

Are the binaries on an NFS server?  If so is it having issues?  Check the 
logs for NFS timeouts.
         Yes, Yes, and Yes

Is a directory filling up /tmp /scratch what ever?
         I don't think we are using /tmp/scratch

What do the syslogs on the remote machine say?
         Did not see any unusual messages

Is there a network issue that I've caused by running too much stuff at the
same time, broken NIS/NFS?
         Yes

Is the OOM killer running on the remote node, have I filled up all the
memory?
         I think so, quite possibly.

Is it only happening on one node, some nodes or a subset?
         Can happen on various nodes.

Am I writing to a database and not catching an error?
         It was not getting to this stage.

Does it happen with a really simple example?
         Yes

Does it only happen on a Tuesday evening (system maint for example)
         I think so, it seems to happen sporadically in spikes.

Thanks again for your help!

At 04:04 PM 3/2/2006, you wrote:
>Send Bioclusters mailing list submissions to
>         bioclusters at bioinformatics.org
>
>To subscribe or unsubscribe via the World Wide Web, visit
>         https://bioinformatics.org/mailman/listinfo/bioclusters
>or, via email, send a message with subject or body 'help' to
>         bioclusters-request at bioinformatics.org
>
>You can reach the person managing the list at
>         bioclusters-owner at bioinformatics.org
>
>When replying, please edit your Subject line so it is more specific
>than "Re: Contents of Bioclusters digest..."
>
>
>Today's Topics:
>
>    1. Announcement: Sun Discovery Cluster for the Life  Sciences
>       (Stefan Unger)
>    2. RE: Announcement: Sun Discovery Cluster for the   LifeSciences
>       (Kathleen)
>    3. SGE Array Job tasks mysteriously disappear (Shane Brubaker)
>    4. RE: quick look see at fractal computing. (James Cuff)
>    5. Re: SGE Array Job tasks mysteriously disappear (James Cuff)
>    6. Re: SGE Array Job tasks mysteriously disappear (Chris Dagdigian)
>
>
>----------------------------------------------------------------------
>
>Message: 1
>Date: Thu, 02 Mar 2006 13:59:59 -0800
>From: Stefan Unger <Stefan.Unger at Sun.COM>
>Subject: [Bioclusters] Announcement: Sun Discovery Cluster for the
>         Life    Sciences
>To: bioclusters at bioinformatics.org
>Message-ID: <44076ADF.3070102 at sun.com>
>Content-Type: text/plain; charset=windows-1252; format=flowed
>
>I'm not sure if this is ok, or not. Please let me know:
>
>************
>
>Sun Microsystems^TM Announces the Discovery Cluster for the Life Sciences
>
>Exceptional Price/Performance in a Pre-Assembled Rack
>
>
>
>
>Sun Microsystems announces the "Discovery Cluster for the Life
>Sciences". The Discovery Cluster is a pre-assembled, base-level
>configuration of a Sun Grid Rack System (SGRS) with components selected
>especially for the Life Science HPC market.
>
>
>The Discovery Cluster is Sun's solution approach to the compute needs
>for the drug discovery process. It is based on the Sun Fire^TM X2100
>64-bit x64 server, powered by the AMD Opteron^TM dual core processor.
>The X2100 delivers up to one-and-a-half times the performance, and uses
>about one-third of the power of competing systems, yet costs a fraction
>of their price. Bioinformatics and molecular modeling benchmarks confirm
>the exceptional price/performance advantages of the Sun Fire X2100 over
>Intel Xeon based clusters. These highly reliable and energy efficient
>X2100 servers are also the fastest enterprise x64 servers in their class.
>
>
>At under $94,000 (US list price) per fully populated, pre-assembled
>rack, the Discovery Cluster provides 1 TeraFlop of theoretical peak
>performances in three racks for under $282,000. In addition, the power,
>cooling and management requirements are substantially less than Intel
>Xeon based clusters.
>
>
>The Discovery Cluster comes pre-assembled, with hardware, cabling,
>Solaris^TM 10 and Sun Grid Engine. Multiple operating systems (Solaris
>10 x64, Linux (Red Hat, Suse), and Windows) are supported. Many
>alternative configurations are available, and Sun's solution partners
>provide a range of software options.
>
>
>For more information, listen to a NetTalk webinar on the Sun Discovery
>Cluster for Life Sciences, featuring the designer of the Sun Fire
>"Galaxy" series servers, Andy Bechtolsheim, Sun Chief Architect and
>Senior Vice President, Network Systems. For more information visit
>www.sun.com/nettalk,
><http://www.sun.com/nettalk>www.sun.com/discoverycluster
><http://www.sun.com/discoverycluster>, or email
>discoverycluster at sun.com. <http://www.sun.com/nettalk>
>
>
>Media contacts:
>
>
>Stefan Unger, PhD
>
>stefan.unger at sun.com <mailto:stefan.unger at sun.com>
>
>Business Development Manager
>
>Life Sciences
>
>
>
>Ulrich Meier, PhD
>
>ulrich.meier at sun.com <mailto:ulrich.meier at sun.com>
>
>Industry Marketing Manager
>
>Life Sciences
>
>
>Sun, Sun Microsystems, the Sun logo, Sun Fire, Solaris are trademarks or
>registered trademarks of Sun Microsystems, Inc. in the United States and
>other countries. AMD and Opteron are trademarks or registered trademarks
>of Advanced Micro Devices.
>
>
>--
>*!*
>Stefan Unger, PhD
>Business Development Manager Life Sciences
>949-682-4388 (x41821) AccessLine
>http://www.sun.com/edu/commofinterest/compbio
>http://www.sun.com/lifesciences
>http://www.sun.com/discoverycluster
>CB-SIG: to JOIN/DROP/POST email compbio-sig-info at sun.com
>
>* BioIT World, Boston, April 3-5, 2006
>* CB-SIG and HPC Consortium, GridAsia, May 14-15, 2006
>
>~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>NOTICE:  This email message is for the sole use of the intended
>recipient(s) and may contain confidential and privileged
>information.  Any unauthorized review, use, disclosure or
>distribution is prohibited.  If you are not the intended
>recipient, please contact the sender by reply email and destroy
>all copies of the original message.
>~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>*!*
>
>
>
>
>------------------------------
>
>Message: 2
>Date: Thu, 2 Mar 2006 15:08:54 -0700
>From: "Kathleen" <kathleen at massivelyparallel.com>
>Subject: RE: [Bioclusters] Announcement: Sun Discovery Cluster for the
>         LifeSciences
>To: "'Clustering,       compute farming & distributed computing in life
>         science informatics'"   <bioclusters at bioinformatics.org>
>Message-ID: <005c01c63e45$e682b8b0$0300a8c0 at KMElaptop>
>Content-Type: text/plain;       charset="us-ascii"
>
>Does it come pre-loaded with applications?  If so, which ones? -K
>
>
>
>From: Stefan Unger [mailto:Stefan.Unger at Sun.COM]
>Sent: Thursday, March 02, 2006 3:00 PM
>To: bioclusters at bioinformatics.org
>Subject: [Bioclusters] Announcement: Sun Discovery Cluster for the
>LifeSciences
>
>I'm not sure if this is ok, or not. Please let me know:
>
>************
>
>Sun Microsystems^TM Announces the Discovery Cluster for the Life Sciences
>
>Exceptional Price/Performance in a Pre-Assembled Rack
>
>
>
>
>Sun Microsystems announces the "Discovery Cluster for the Life Sciences".
>The Discovery Cluster is a pre-assembled, base-level configuration of a Sun
>Grid Rack System (SGRS) with components selected especially for the Life
>Science HPC market.
>
>
>The Discovery Cluster is Sun's solution approach to the compute needs for
>the drug discovery process. It is based on the Sun Fire^TM X2100 64-bit x64
>server, powered by the AMD Opteron^TM dual core processor.
>The X2100 delivers up to one-and-a-half times the performance, and uses
>about one-third of the power of competing systems, yet costs a fraction of
>their price. Bioinformatics and molecular modeling benchmarks confirm the
>exceptional price/performance advantages of the Sun Fire X2100 over Intel
>Xeon based clusters. These highly reliable and energy efficient X2100
>servers are also the fastest enterprise x64 servers in their class.
>
>
>At under $94,000 (US list price) per fully populated, pre-assembled rack,
>the Discovery Cluster provides 1 TeraFlop of theoretical peak performances
>in three racks for under $282,000. In addition, the power, cooling and
>management requirements are substantially less than Intel Xeon based
>clusters.
>
>
>The Discovery Cluster comes pre-assembled, with hardware, cabling,
>Solaris^TM 10 and Sun Grid Engine. Multiple operating systems (Solaris 10
>x64, Linux (Red Hat, Suse), and Windows) are supported. Many alternative
>configurations are available, and Sun's solution partners provide a range of
>software options.
>
>
>For more information, listen to a NetTalk webinar on the Sun Discovery
>Cluster for Life Sciences, featuring the designer of the Sun Fire "Galaxy"
>series servers, Andy Bechtolsheim, Sun Chief Architect and Senior Vice
>President, Network Systems. For more information visit www.sun.com/nettalk,
><http://www.sun.com/nettalk>www.sun.com/discoverycluster
><http://www.sun.com/discoverycluster>, or email discoverycluster at sun.com.
><http://www.sun.com/nettalk>
>
>
>Media contacts:
>
>
>Stefan Unger, PhD
>
>stefan.unger at sun.com <mailto:stefan.unger at sun.com>
>
>Business Development Manager
>
>Life Sciences
>
>
>
>Ulrich Meier, PhD
>
>ulrich.meier at sun.com <mailto:ulrich.meier at sun.com>
>
>Industry Marketing Manager
>
>Life Sciences
>
>
>Sun, Sun Microsystems, the Sun logo, Sun Fire, Solaris are trademarks or
>registered trademarks of Sun Microsystems, Inc. in the United States and
>other countries. AMD and Opteron are trademarks or registered trademarks of
>Advanced Micro Devices.
>
>
>--
>*!*
>Stefan Unger, PhD
>Business Development Manager Life Sciences
>949-682-4388 (x41821) AccessLine
>http://www.sun.com/edu/commofinterest/compbio
>http://www.sun.com/lifesciences
>http://www.sun.com/discoverycluster
>CB-SIG: to JOIN/DROP/POST email compbio-sig-info at sun.com
>
>* BioIT World, Boston, April 3-5, 2006
>* CB-SIG and HPC Consortium, GridAsia, May 14-15, 2006
>
>~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>NOTICE:  This email message is for the sole use of the intended
>recipient(s) and may contain confidential and privileged information.  Any
>unauthorized review, use, disclosure or distribution is prohibited.  If you
>are not the intended recipient, please contact the sender by reply email and
>destroy all copies of the original message.
>~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>*!*
>
>
>_______________________________________________
>Bioclusters maillist  -  Bioclusters at bioinformatics.org
>https://bioinformatics.org/mailman/listinfo/bioclusters
>
>
>
>
>
>------------------------------
>
>Message: 3
>Date: Thu, 02 Mar 2006 14:12:49 -0800
>From: Shane Brubaker <brubaker2 at llnl.gov>
>Subject: [Bioclusters] SGE Array Job tasks mysteriously disappear
>To: bioclusters at bioinformatics.org
>Message-ID: <6.0.0.22.2.20060302141100.037184e0 at mail.llnl.gov>
>Content-Type: text/plain; charset="us-ascii"; format=flowed
>
>Hi, Shane from the JGI here.
>
>We are finding some strange behavior in which a few tasks of an array job
>never seem to complete.
>
>The tasks do not go into an Error state, and they are listed as finished
>with an exit status of 0, and they
>have a valid start and end time for the task.
>
>However, in the output log, the output clearly stops in between two print
>statements near the top of the script.
>
>
>Has anyone seen this?  Any ideas?
>
>
>Thanks,
>Shane
>
>
>
>------------------------------
>
>Message: 4
>Date: Thu, 2 Mar 2006 18:17:50 -0500 (EST)
>From: James Cuff <jcuff at broad.mit.edu>
>Subject: RE: [Bioclusters] quick look see at fractal computing.
>To: Nick Robertson <nick at massivelyparallel.com>
>Cc: "'Clustering,       compute farming & distributed computing in life
>         science 
> informatics'"   <bioclusters at bioinformatics.org>,       'Kevin Howard'
>         <kevin at massivelyparallel.com>
>Message-ID:
>         <Pine.OSF.4.64.0603021718060.91263 at phosphorus.broad.mit.edu>
>Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
>
>On Thu, 2 Mar 2006, Nick Robertson wrote:
>
> > It is still unclear to me why your results are markedly different
> > from NCBI and MPT, but it's probably related to search parameters or some
> > other difference.
>
>Ahem, that could be my bad, I guess I should have explained, I thought it
>was clear from the example command line I supplied.
>
>-nT is the answer you are looking for here.
>
>I used it quickly here to show the missing sub optimals.  My reasoning
>being that if MegaBlast with its large word size and greedy algorithm
>approach could find the suboptimals, the standard version ought to nail
>it.
>
>I tend to use it automatically for near exact DNA/DNA searching, which is
>what this example test was set to do.  So that clears up changes in the
>ordering.
>
>However, you are _still_ not reporting the sub optimal alignments in your
>report.
>
>This is clear alone from just the sizes of the two files you provided me
>with via your website.  I guess it's just a printing error, you must be
>calculating them.  Probably a simple tweak for you to fix.
>
>node221 /2ndrun/ du -sh ncbi_results.txt
>3.4M    ncbi_results.txt
>
>node221 /2ndrun/ du -sh qid1597_results_1.txt
>516K    qid1597_results_1.txt
>
>
>The example gi|27657458|emb|AL844150.6| on that web link I sent before
>shows this.
>
>MegaBlast (jcuff_results_1.txt) finds two such sub alignments, and regular
>blast (jcuff2.blastn,ncbi_results.txt ) finds a whopping 16.
>
>However qid1597_results_1.txt only shows the first alignment from bases
>682 to 1330, with _no_ sub optimals being reported.
>
>Thanks for the update.  We probably ought to kill this thread and take it
>off line if you want to discuss it further.  I doubt it is very
>interesting for folk.
>
>Best,
>
>J.
>
>
>------------------------------
>
>Message: 5
>Date: Thu, 2 Mar 2006 18:34:40 -0500 (EST)
>From: James Cuff <jcuff at broad.mit.edu>
>Subject: Re: [Bioclusters] SGE Array Job tasks mysteriously disappear
>To: "Clustering,        compute farming & distributed computing in life
>         science informatics"    <bioclusters at bioinformatics.org>
>Message-ID:
>         <Pine.OSF.4.64.0603021824410.91263 at phosphorus.broad.mit.edu>
>Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
>
>
>Hi Shane,
>
>So you might want to give us a bit more information.
>
>As to seeing weird stuff on clusters, yeah we see a lot of it, *way* too
>much of it sometimes :)
>
>Here come a bunch of questions I would ask myself if it happened to me:
>
>Did I isolate it down to just an issue with the job array?
>Does this only happen with this program or all programs I execute?
>What is the code doing?
>Are there "core" files in my output directory?
>Are the binaries on an NFS server?  If so is it having issues?  Check the
>logs for NFS timeouts.
>Is a directory filling up /tmp /scratch what ever?
>What do the syslogs on the remote machine say?
>Is there a network issue that I've caused by running too much stuff at the
>same time, broken NIS/NFS?
>Is the OOM killer running on the remote node, have I filled up all the
>memory?
>Is it only happening on one node, some nodes or a subset?
>Am I writing to a database and not catching an error?
>Does it happen with a really simple example?
>Does it only happen on a Tuesday evening (system maint for example)
>
>etc. etc.  It is a pain to debug things like this on a cluster, I feel
>your pain.
>
>Maybe have another look at what is going wrong and post back with some
>more information.  There are lots of people who can probably help, at the
>moment there is not really enough for us to go on, as you see it could be
>lots of things.
>
>Best,
>
>J.
>
>On Thu, 2 Mar 2006, Shane Brubaker wrote:
>
> > Hi, Shane from the JGI here.
> >
> > We are finding some strange behavior in which a few tasks of an array job
> > never seem to complete.
> >
> > The tasks do not go into an Error state, and they are listed as finished
> > with an exit status of 0, and they have a valid start and end time for
> > the task.
> >
> > However, in the output log, the output clearly stops in between two print
> > statements near the top of the script.
> >
> >
> > Has anyone seen this?  Any ideas?
> >
> >
> > Thanks,
> > Shane
> >
> > _______________________________________________
> > Bioclusters maillist  -  Bioclusters at bioinformatics.org
> > https://bioinformatics.org/mailman/listinfo/bioclusters
> >
>
>
>
>------------------------------
>
>Message: 6
>Date: Thu, 2 Mar 2006 19:03:00 -0500
>From: Chris Dagdigian <dag at sonsorol.org>
>Subject: Re: [Bioclusters] SGE Array Job tasks mysteriously disappear
>To: "Clustering,        compute farming & distributed computing in life
>         science informatics"    <bioclusters at bioinformatics.org>
>Message-ID: <15B721DA-E4FA-44C7-BEB1-F99919DD39A1 at sonsorol.org>
>Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed
>
>
>Debugging odd failures on clusters can really be hard.
>
>For SGE clusters the best place of debug/failure info is always going
>to be in the STDOUT/STDERR files produced by the jobs themselves.
>
>Nine times out of ten this is where you'll find the most useful info.
>
>Since it seems that you are not getting anything useful from those
>files, the next place to look is the sge_execd logs from the machines
>where the array tasks ran. The execd spool files will either be local
>to the compute node or under your $SGE_ROOT/<cell>/spool/
><machineName>" directory if you are running everything off of a
>shared filesystem.
>
>After the execd spool logs, the qmaster and schedd messages files may
>also be of use although they rarely give good info on job level issues.
>
>A third place to look is "/tmp" on the compute nodes -- when all else
>fails and grid engine is in a panic situation and unable to spool
>normally it will log to /tmp/ on the host.
>
>Something you should also try:
>
>   - Alter the value for "loglevel" in your grid engine configuration
>-- you may want to temporarily set "loglevel=log_info"
>
>This was discussed in a recent SGE users mailing list The thread is
>here:
>http://gridengine.sunsource.net/servlets/BrowseList?
>list=users&by=thread&from=8137
>
>The sge_conf man page has this to say about loglevel:
>
> > loglevel
> >        This parameter specifies the level of detail that  Grid
> > Engine  compo-
> >        nents  such  as  sge_qmaster(8) or sge_execd(8) use to
> > produce informa-
> >        tive, warning or error messages which are logged to the
> > messages  files
> >        in  the master and execution daemon spool directories (see
> > the descrip-
> >        tion of the execd_spool_dir parameter  above).  The
> > following  message
> >        levels are available:
> >
> >        log_err
> >               All error events being recognized are logged.
> >
> >        log_warning
> >               All  error  events  being  recognized  and all
> > detected signs of
> >               potentially erroneous behavior are logged.
> >
> >        log_info
> >               All error events being recognized, all detected signs
> > of  poten-
> >               tially  erroneous behavior and a variety of
> > informative messages
> >               are logged.
>
>
>The final troubleshooting step is to look into the Grid Engine
>"KEEP_ACTIVE"  execd parameter setting -- this will temporarily
>disable deletion of the active_jobs/ directories that Grid Engine
>uses to stage info while the job is active. Normally these
>directories are deleted when the job drains from the system. Quite a
>bit of useful environment, pid, trace and other information can be
>found in these directories.  This is one you'll have to watch out for
>though -- disabling the cleanup function could consume disk space
>rapidly.
>
>Regards,
>Chris
>
>
>
>
>On Mar 2, 2006, at 5:12 PM, Shane Brubaker wrote:
>
> > Hi, Shane from the JGI here.
> >
> > We are finding some strange behavior in which a few tasks of an
> > array job never seem to complete.
> >
> > The tasks do not go into an Error state, and they are listed as
> > finished with an exit status of 0, and they
> > have a valid start and end time for the task.
> >
> > However, in the output log, the output clearly stops in between two
> > print statements near the top of the script.
> >
> >
> > Has anyone seen this?  Any ideas?
>
>
>------------------------------
>
>_______________________________________________
>Bioclusters maillist  -  Bioclusters at bioinformatics.org
>https://bioinformatics.org/mailman/listinfo/bioclusters
>
>
>End of Bioclusters Digest, Vol 17, Issue 4
>******************************************