[Bioclusters] FYI: minimal XML output fixer for NCBI BLAST
Jason Stajich
jason.stajich at duke.edu
Wed May 11 23:27:44 EDT 2005
I hear you!
I have a workaround which does some munging as it goes in the
Bio::SearchIO::blastxml. Because we process each report one-at-time
in the parser I have to have the middle code layer strip out these
lines before allowing the lower-level XML lib to handle the stream.
Not ideal, but it works.
I think Warren added XML to WU-BLAST but unfortunately he implemented
the same problems too!
http://blast.wustl.edu/blast/parameters.html#mformat
-jason
On May 11, 2005, at 11:10 PM, Joe Landman wrote:
> Simple problem: take NCBI BLAST XML output and parse it. It is an
> XML document after all, so it should be easy ... right?
>
> Sort of ...
>
> The NCBI XML output file is really a container of XML documents.
> You cannot hand the container to be parsed to an XML Parser, as it
> (the container) is not a valid XML document (a valid XML document
> has exactly one <?xml version=""?> tag in it according to the
> standards on w3c.org).
>
> So here is my (perl based) "solution" (read as hack).
>
> # assume entire document in $all, though this is Bad(TM)
> # for huge documents, very wasteful of memory resouces.
> #
> @sub_documents = split(/\<\?xml version=\"1.0\"\?>/,$all);
> shift @sub_documents;
>
> Now, each sub_document is in fact a valid XML document, that you
> can happily and easily parse.
>
> foreach (@sub_document)
> {
> # do stuff with $_ which is now a valid XML document
> }
>
> If there are any NCBI folks lurking here, is there a nice way to
> make the -m 7 output generate a single large valid XML document so
> we can use the huge document parsers, rather than using hacks like
> the above?
>
> As XML documents can be containers themselves, it seems to make
> sense to make the entire output parseable without giving xmllint
> (and other XML parsers) fits
>
> [landman at crunch-r.scalableinformatics.com:/
> big] 137
> >xmllint tomato_test1.1
> tomato_test1.1:7365: parser error : XML declaration allowed only at
> the start of the document
> <?xml version="1.0"?>
> ^
> tomato_test1.1:7366: parser error : Extra content at the end of the
> document
> <!DOCTYPE BlastOutput PUBLIC "-//NCBI//NCBI BlastOutput/EN"
> "NCBI_BlastOutput.dt
> ^
>
> Thanks.
>
> Joe
>
>
> --
> Joseph Landman, Ph.D
> Founder and CEO
> Scalable Informatics LLC,
> email: landman at scalableinformatics.com
> web : http://www.scalableinformatics.com
> phone: +1 734 786 8423
> fax : +1 734 786 8452
> cell : +1 734 612 4615
>
> _______________________________________________
> Bioclusters maillist - Bioclusters at bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/bioclusters
>
More information about the Bioclusters
mailing list