[Bioclusters] blast output (-m 7) in XML, and the XML spec

Sat Apr 23 02:14:29 EDT 2005

Hi Folks:

   Working on a quick parser project, and I just spent too much time 
chasing down a bug.

   Short version:  I need to make the output of mpiBLAST (based upon 
NCBI BLAST) appear to provide identical output for the same input across 
multiple machines with the same data sets and databases.  In theory this 
is not too difficult, and it was something we had solved a while ago for 
a different case.

   Ok, I had suggested using XML, and the -m 7 output, and then simply 
parsing the document and returning it in a specific order.  Works well 
... sort of.

   The resulting XML from a BLAST run starts out with

<?xml version="1.0"?>
   <BlastOutput_reference>~Reference: Altschul, Stephen F., Thomas L. 
Madden, Alejandro A. Schaffer, ~Jinghui Zhang, Zheng Zhang, Webb Miller, 
and David J. Lipman (1997), ~&quot;Gapped BLAST and PSI-BLAST: a new 
generation of protein database search~programs&quot;,  Nucleic Acids 
Res. 25:3389-3402.</BlastOutput_reference>

and then it gives the rest of the hits ...

and then it gives ...

<?xml version="1.0"?>
<!DOCTYPE BlastOutput PUBLIC "-//NCBI//NCBI BlastOutput/EN" 
"NCBI_BlastOutput.dtd">
<BlastOutput>
   <BlastOutput_program>blastx</BlastOutput_program>
   <BlastOutput_version>blastx 2.2.10 [Oct-19-2004]</BlastOutput_version>

Is this valid?  See 
http://www.w3.org/TR/2004/REC-xml-20040204/#sec-well-formed .  I am 
trying to track down a bug in the XML parser, and I ran into that second 
  XML tag.  Basically, xmllint complains:

[landman at crunch:~] 
                            124 >xmllint /big/tomato_test1.1
/big/tomato_test1.1:7365: parser error : XML declaration allowed only at 
the start of the document
<?xml version="1.0"?>
      ^
/big/tomato_test1.1:7366: parser error : Extra content at the end of the 
document
<!DOCTYPE BlastOutput PUBLIC "-//NCBI//NCBI BlastOutput/EN" 
"NCBI_BlastOutput.dt
^

Which makes me think that this is not well formed XML.

I do have a few options here, they are hacks, but they are options.  Is 
the -m 7 output generally considered to be valid XML by people who 
consume it, or do you need to run it through parsers which have been 
made less sensitive?  Any thoughts?

I am sure others have solved issues like this in the past.  I am ok with 
being forgiving on what I read, but it is breaking the parser, so I need 
to either fix the parser, or be more sensitive to what I am parsing.

Thanks!

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 734 786 8452
cell : +1 734 612 4615