[Bioclusters] ncbi blast

Justin Powell bioclusters@bioinformatics.org
Wed, 23 Jun 2004 15:02:37 +0100


Hi Joe,

Thanks for the info.  I've tested with the -a 1, it does indeed only go
wrong with -a 2, so I've kludged it for the time being.  However as to
your theory about RedHat9 NPTL being involved, I also get exactly the same
behaviour on a RedHat7.1 system running ncbi blast 2.2.6. (i.e. goes wrong
on nt database but not est database, and only if -a 2, not if -a 1).

So I guess if the -a switch changes things its not likely to be bad ram?

In reply to your other questions, the output from swapon -s is

Filename			Type		Size	Used	Priority
/dev/sda2                       partition	1807304	15036	-1

for the rh7.1 system

Filename			Type		Size	Used	Priority
/dev/sda3                       partition	1020116	10496	-1

for the rh9 system.

Adding a name line to the query makes no difference.

Neither system is overclocked. I've not run the memory checker yet, but I
have two identical Redhat9 boxes and they both do it. So that makes 3
systems, and I can test a 4th shortly too.

I've not had time to run the graphical debugger - I'm pretty snowed under
till Monday.

Justin

On Fri, 18 Jun 2004, Joe Landman wrote:

> Hi Chris and Justin:
>
> On Thu, 2004-06-17 at 12:38, Chris Dwan wrote:
> > Justin,
> >
> > I've poked around a bit, and run your queries on a variety of machines
> > (P-III and Athalon...as well as a few others) which I have sitting
> > around the shop here.  I was unable to replicate your observed
> > behavior.
>
> Hmmm.  I have had crashes when the accession lines were somehow
> mangled.  But this occurred regardless of memory size.
>
> [...]
>
> > On Jun 16, 2004, at 10:46 AM, Justin Powell wrote:
> >
> > >
> > > Hi Chris
> > >
> > > A short query which goes wrong is
> > >
> > > actacgactagcatcagctacgctagatgactacgatcagctacgactagcatcgactacg
> > >
> > > I just have this in a text file on its own with no name line. The nt
> > > database I'm using is from the ncbi ftp site blast/db directory and the
> > > unzipped database files have the date June 11 2004.
>
> So you do not have
>
> 	>accession data
> 	actacgactagcatcagctacgctagatgactacgatcagctacgactagcatcgactacg
>
> in the test file, just
>
> 	actacgactagcatcagctacgctagatgactacgatcagctacgactagcatcgactacg
>
> ?
>
> If this is the case, try making a simple accession line such as
>
> 	>abc123|my random label
> 	actacgactagcatcagctacgctagatgactacgatcagctacgactagcatcgactacg
>
> and see if it still crashes.
>
> > > I've found the intermittency varies. Sometimes it seems it can be
> > > provoked
> > > by running a blast against est first, and sometimes it seems to work
> > > correctly time after time.
>
> Oh... If it is not repeatable (e.g. repeatable == same input file always
> generates the same error at the same place), then it is likely to be
> unrelated to the program itself.  That is, the program happens to be
> hitting the case in the system which triggers the error.  This usually
> comes about when you hit a bad physical memory location somewhere, or
> you have an OS bug or driver bug of some sort.
>
> SEGV's usually come about when one process stamps on another processes
> memory, so there could be other explanations.  If you are swapping to a
> partition with some bad bytes, this could be a problem.
>
> First:  Do you have swap enabled?  What is the output of
>
> 	swapon -s
>
> Second: What other programs are running?  Is this an overclocked system?
>
> Third:  have you run memtest86 on the unit for an extended period of
> time?  You can pull the memtest86 3.1 iso from
> http://downloads.scalableinformatics.com
>
> > > A second longer sequence I've had go wrong is
> > >
> > > TCCCCCGAATTTAAACGCGTTGAAAGGGTCATCCTTACTAGAAAAGAGAGTTG
> > > ATTCTCTCCGACAGCTTAACACTACCACGGTTAACCAGCTGCTGGGGTTGCCGGGGATGACCTCTACATT
> > > CACGGCTCCGCAACTGTTGCAGTTAAGAATAATAGCTATAACTGCGTCTGCCGTGTCCCTTATTGCCGGT
> > > TGCCTCGGAATGTTCTTCCTTTCTAAAATGGATAAGAGACGAAAAGTCTTCAGACATGATCTCATCGCAT
> > > TTTTGATAATTTGCGACTTTCTTAAAGCTTTTATTCTGATGATTTATCCCATGATTATCCTTATTAATAA
> > > TAGTGTGTATGCAACACCTGCATTTTTTAATACCTTGGGTTGGTTTACGGCCTTTGCCATCGAAGGTGCA
> > > GACATGGCCATAATGATATTCGCCATACATTTTGCTATTTTGATCTTCAAGCCTAATTGGAAATGGCGAA
> > > ATAAAAGATCGGGAAATATGGAGGGTGGCTTGTACAAAAAAAGGTCATATATCTGGCCAATTACTGCATT
> > > AGTACCTGCCATTTTAGCAAGCTTAGCCTTCATTAATTATAATAAACTCAATGACGATTCTGACACCACT
> > > ATTATACTGGATAATAATAACTACAACTTTCCCGATTCTCCCAGGCAAGGTGGCTACAAACCTTGGAGTG
> > > CATGGTGCTATTTACCACCCAAGCCGTACTGGTATAAAATTGTTTTAAGCTGGGGTCCCAGATATTTCAT
> > > TATTATTTTCATATTTGCAGTCTACCTCAGTATTTATATTTTCATTACCAGTGAAAGTAAAAGAATTAAA
> > > GCGCAAATTGGAGACTTTAACC
> > >
> > >
> > > I've tried recompiling with the -g flag on (and the -O3 flag off) and
> > > run
> > > gdb on the coredump. However I'm not a c programmer (though I did once
> > > read a book on it) and am not at all familiar with either C, gdb or
> > > even
> > > the details of the call stack, so I'm not sure I've done all this
> > > correctly. An example backtrace is like this, though others I've had
> > > looked different:
> > >
> > > [root@prada bin]# gdb blastall core.9520
> > > GNU gdb Red Hat Linux (5.3post-0.20021129.18rh)
> > > Copyright 2003 Free Software Foundation, Inc.
> > > GDB is free software, covered by the GNU General Public License, and
> > > you
> > > are
> > > welcome to change it and/or distribute copies of it under certain
> > > conditions.
> > > Type "show copying" to see the conditions.
> > > There is absolutely no warranty for GDB.  Type "show warranty" for
> > > details.
> > > This GDB was configured as "i386-redhat-linux-gnu"...
> > > Core was generated by `./blastall -p blastn -a 2 -d /usr/blasttest/nt
> > > -i
> > > /usr/blasttest/tempdna'.
> > > Program terminated with signal 11, Segmentation fault.
> > > Reading symbols from /lib/tls/libm.so.6...done.
> > > Loaded symbols for /lib/tls/libm.so.6
> > > Reading symbols from /lib/tls/libpthread.so.0...done.
> > > Loaded symbols for /lib/tls/libpthread.so.0
> > > Reading symbols from /lib/tls/libc.so.6...done.
> > > Loaded symbols for /lib/tls/libc.so.6
> > > Reading symbols from /lib/ld-linux.so.2...done.
> > > Loaded symbols for /lib/ld-linux.so.2
> > > Reading symbols from /lib/libnss_files.so.2...done.
> > > Loaded symbols for /lib/libnss_files.so.2
> > > #0  0x0805ea52 in BlastNtWordFinder (search=0x84363e8,
> > > lookup=0x842e6b8)
> > >     at blast.c:9265
> > > 9265			 next_lindex = (((lookup_index) &
> > > mask)<<char_size) + *(s+1);
>
> Ok.  This is part of the word search section of BLAST.  Basically it
> walks along the linear array looking for a match.  This should not fail,
> though if it does, then the likely problem is in  *(s+1).  You could
> translate *(s+1) as "the contents of the location pointed to by pointer
> s incremented by one sizeof data type".  If s points to a valid
> location, but s+1 does not, it is possible that the memory allocation
> somehow failed to allocate sufficient memory for the array (unlikely,
> you would have seen this elsewhere).  It is also possible that there is
> some OS imposed boundary between the values of s and s+1 (the pointers
> that is, not their contents), and by accessing the contents
> (dereferencing) the pointer as BLAST was doing, you happened to trigger
> the protection fault (which is what SEGV is).
>
> For some reason, the OS thinks that *(s+1) is owned by someone else.
>
> > > (gdb) backtrace
> > > #0  0x0805ea52 in BlastNtWordFinder (search=0x84363e8,
> > > lookup=0x842e6b8)
> > >     at blast.c:9265
> > > #1  0x0805a473 in BlastWordFinder (search=0x84363e8) at blast.c:6847
> > > #2  0x0805a336 in BlastExtendWordSearch (search=0x84363e8,
> > >     multiple_hits=0 '\0') at blast.c:6803
> > > #3  0x08059d7c in BLASTPerformFinalSearch (search=0x84363e8,
> > >     subject_length=117793,
> > >     subject_seq=0x7e12b129 <Address 0x7e12b129 out of bounds>) at
> > > blast.c:6612
>
> Yup.  Looks like memory somehow got mangled. You might have a look at
> using ddd (graphical frontend to gdb), and do the run.  Then we can look
> through the process a bit easier.  Basically run the system completely
> from the debugger, and see where it crashes, and then poke at it as to
> why.
>
> Note:  The location of the crash should not change by running it in the
> debugger.  If it does, we might start to think more of a hardware
> problem (bad swap, bad memory chip, etc) than of a program/OS bug.
>
> > > #4  0x080596c8 in BLASTPerformSearch (search=0x84363e8,
> > > subject_length=117793,
> > >     subject_seq=0x7e12b129 <Address 0x7e12b129 out of bounds>) at
> > > blast.c:6365
> > > #5  0x0805967b in BLASTPerformSearchWithReadDb (search=0x84363e8,
> > >     sequence_number=1629625) at blast.c:6344
> > > #6  0x0805066f in do_blast_search (ptr=0x84363e8) at blast.c:3335
> > > #7  0x0804d600 in NlmThreadWrapper (wrapper_arg=0x8439c80) at
> > > ncbithr.c:647
> > > #8  0x400522b6 in start_thread () from /lib/tls/libpthread.so.0
> > > (gdb) quit
>
> One more thought.  Do you get a crash with -a 1 (or no -a line)?  If
> not, has your code been compiled on an NPTL box?  This has been a common
> problem in using NPTL (in RH9) versus linuxthreads, and caused some
> interesting crashes (though I seem to remember that they were not
> segv's).
>
> Would you try some of my compiled 2.2.9 binaries or the ones from NCBI
> and let us know if you still get the crash?  I am thinking this is a
> problem in the OS interacting with the program, and not a program bug
> per se.  If the problem persists across versions, and is repeatable, I
> would like to get a copy of the input file which causes it.
>
> Joe
>
> --
> Joseph Landman, Ph.D
> Scalable Informatics LLC,
> email: landman@scalableinformatics.com
> web  : http://scalableinformatics.com
> phone: +1 734 612 4615
>
> _______________________________________________
> Bioclusters maillist  -  Bioclusters@bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/bioclusters
>