Heh... I'd be willing (if sufficiently prodded) to build a converter from NCBI container of XML's to a single container XML. Its most of the way done, just needs the higher level container. Could do it for the one-at-a-time streaming report bits as well. Could do it as a filter. Let me know if anyone wants one. Joe Jason Stajich wrote: > I hear you! > > I have a workaround which does some munging as it goes in the > Bio::SearchIO::blastxml. Because we process each report one-at-time > in the parser I have to have the middle code layer strip out these > lines before allowing the lower-level XML lib to handle the stream. > Not ideal, but it works. > > I think Warren added XML to WU-BLAST but unfortunately he implemented > the same problems too! > http://blast.wustl.edu/blast/parameters.html#mformat > > -jason > > On May 11, 2005, at 11:10 PM, Joe Landman wrote: > >> Simple problem: take NCBI BLAST XML output and parse it. It is an >> XML document after all, so it should be easy ... right? >> >> Sort of ... >> >> The NCBI XML output file is really a container of XML documents. You >> cannot hand the container to be parsed to an XML Parser, as it (the >> container) is not a valid XML document (a valid XML document has >> exactly one <?xml version=""?> tag in it according to the standards >> on w3c.org). >> >> So here is my (perl based) "solution" (read as hack). >> >> # assume entire document in $all, though this is Bad(TM) >> # for huge documents, very wasteful of memory resouces. >> # >> @sub_documents = split(/\<\?xml version=\"1.0\"\?>/,$all); >> shift @sub_documents; >> >> Now, each sub_document is in fact a valid XML document, that you can >> happily and easily parse. >> >> foreach (@sub_document) >> { >> # do stuff with $_ which is now a valid XML document >> } >> >> If there are any NCBI folks lurking here, is there a nice way to make >> the -m 7 output generate a single large valid XML document so we can >> use the huge document parsers, rather than using hacks like the above? >> >> As XML documents can be containers themselves, it seems to make sense >> to make the entire output parseable without giving xmllint (and >> other XML parsers) fits >> >> [landman at crunch-r.scalableinformatics.com:/ >> big] 137 >> >xmllint tomato_test1.1 >> tomato_test1.1:7365: parser error : XML declaration allowed only at >> the start of the document >> <?xml version="1.0"?> >> ^ >> tomato_test1.1:7366: parser error : Extra content at the end of the >> document >> <!DOCTYPE BlastOutput PUBLIC "-//NCBI//NCBI BlastOutput/EN" >> "NCBI_BlastOutput.dt >> ^ >> >> Thanks. >> >> Joe >> >> >> -- >> Joseph Landman, Ph.D >> Founder and CEO >> Scalable Informatics LLC, >> email: landman at scalableinformatics.com >> web : http://www.scalableinformatics.com >> phone: +1 734 786 8423 >> fax : +1 734 786 8452 >> cell : +1 734 612 4615 >> >> _______________________________________________ >> Bioclusters maillist - Bioclusters at bioinformatics.org >> https://bioinformatics.org/mailman/listinfo/bioclusters >> > > _______________________________________________ > Bioclusters maillist - Bioclusters at bioinformatics.org > https://bioinformatics.org/mailman/listinfo/bioclusters -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman at scalableinformatics.com web : http://www.scalableinformatics.com phone: +1 734 786 8423 fax : +1 734 786 8452 cell : +1 734 612 4615