[Bioclusters] FYI: minimal XML output fixer for NCBI BLAST

Jason Stajich jason.stajich at duke.edu
Wed May 11 23:27:44 EDT 2005


I hear you!

I have a workaround which does some munging as it goes in the  
Bio::SearchIO::blastxml.   Because we process each report one-at-time  
in the parser I have to have the middle code layer strip out these  
lines before allowing the lower-level XML lib to handle the stream.   
Not ideal, but it works.

I think Warren added XML to WU-BLAST but unfortunately he implemented  
the same problems too!
http://blast.wustl.edu/blast/parameters.html#mformat

-jason

On May 11, 2005, at 11:10 PM, Joe Landman wrote:

> Simple problem:  take NCBI BLAST XML output and parse it.  It is an  
> XML document after all, so it should be easy ... right?
>
> Sort of ...
>
> The NCBI XML output file is really a container of XML documents.   
> You cannot hand the container to be parsed to an XML Parser, as it  
> (the container) is not a valid XML document (a valid XML document  
> has exactly one <?xml version=""?> tag in it according to the  
> standards on w3c.org).
>
> So here is my (perl based) "solution" (read as hack).
>
>     # assume entire document in $all, though this is Bad(TM)
>     # for huge documents, very wasteful of memory resouces.
>     #
>     @sub_documents  = split(/\<\?xml version=\"1.0\"\?>/,$all);
>     shift @sub_documents;
>
> Now, each sub_document is in fact a valid XML document, that you  
> can happily and easily parse.
>
>     foreach (@sub_document)
>      {
>       # do stuff with $_ which is now a valid XML document
>      }
>
> If there are any NCBI folks lurking here, is there a nice way to  
> make the -m 7 output generate a single large valid XML document so  
> we can use the  huge document parsers, rather than using hacks like  
> the above?
>
> As XML documents can be containers themselves, it seems to make  
> sense to  make the entire output parseable without giving xmllint  
> (and other XML parsers) fits
>
> [landman at crunch-r.scalableinformatics.com:/ 
> big]                                                           137  
> >xmllint tomato_test1.1
> tomato_test1.1:7365: parser error : XML declaration allowed only at  
> the start of the document
> <?xml version="1.0"?>
>      ^
> tomato_test1.1:7366: parser error : Extra content at the end of the  
> document
> <!DOCTYPE BlastOutput PUBLIC "-//NCBI//NCBI BlastOutput/EN"  
> "NCBI_BlastOutput.dt
> ^
>
> Thanks.
>
> Joe
>
>
> -- 
> Joseph Landman, Ph.D
> Founder and CEO
> Scalable Informatics LLC,
> email: landman at scalableinformatics.com
> web  : http://www.scalableinformatics.com
> phone: +1 734 786 8423
> fax  : +1 734 786 8452
> cell : +1 734 612 4615
>
> _______________________________________________
> Bioclusters maillist  -  Bioclusters at bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/bioclusters
>



More information about the Bioclusters mailing list