[Bioclusters] blast output (-m 7) in XML, and the XML spec
Joe Landman
landman at scalableinformatics.com
Sat Apr 23 02:14:29 EDT 2005
Hi Folks:
Working on a quick parser project, and I just spent too much time
chasing down a bug.
Short version: I need to make the output of mpiBLAST (based upon
NCBI BLAST) appear to provide identical output for the same input across
multiple machines with the same data sets and databases. In theory this
is not too difficult, and it was something we had solved a while ago for
a different case.
Ok, I had suggested using XML, and the -m 7 output, and then simply
parsing the document and returning it in a specific order. Works well
... sort of.
The resulting XML from a BLAST run starts out with
<?xml version="1.0"?>
<BlastOutput_reference>~Reference: Altschul, Stephen F., Thomas L.
Madden, Alejandro A. Schaffer, ~Jinghui Zhang, Zheng Zhang, Webb Miller,
and David J. Lipman (1997), ~"Gapped BLAST and PSI-BLAST: a new
generation of protein database search~programs", Nucleic Acids
Res. 25:3389-3402.</BlastOutput_reference>
and then it gives the rest of the hits ...
and then it gives ...
<?xml version="1.0"?>
<!DOCTYPE BlastOutput PUBLIC "-//NCBI//NCBI BlastOutput/EN"
"NCBI_BlastOutput.dtd">
<BlastOutput>
<BlastOutput_program>blastx</BlastOutput_program>
<BlastOutput_version>blastx 2.2.10 [Oct-19-2004]</BlastOutput_version>
Is this valid? See
http://www.w3.org/TR/2004/REC-xml-20040204/#sec-well-formed . I am
trying to track down a bug in the XML parser, and I ran into that second
XML tag. Basically, xmllint complains:
[landman at crunch:~]
124 >xmllint /big/tomato_test1.1
/big/tomato_test1.1:7365: parser error : XML declaration allowed only at
the start of the document
<?xml version="1.0"?>
^
/big/tomato_test1.1:7366: parser error : Extra content at the end of the
document
<!DOCTYPE BlastOutput PUBLIC "-//NCBI//NCBI BlastOutput/EN"
"NCBI_BlastOutput.dt
^
Which makes me think that this is not well formed XML.
I do have a few options here, they are hacks, but they are options. Is
the -m 7 output generally considered to be valid XML by people who
consume it, or do you need to run it through parsers which have been
made less sensitive? Any thoughts?
I am sure others have solved issues like this in the past. I am ok with
being forgiving on what I read, but it is breaking the parser, so I need
to either fix the parser, or be more sensitive to what I am parsing.
Thanks!
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web : http://www.scalableinformatics.com
phone: +1 734 786 8423
fax : +1 734 786 8452
cell : +1 734 612 4615
More information about the Bioclusters
mailing list