[Bioclusters] FYI: minimal XML output fixer for NCBI BLAST
Joe Landman
landman at scalableinformatics.com
Wed May 11 23:59:15 EDT 2005
Heh...
I'd be willing (if sufficiently prodded) to build a converter from NCBI
container of XML's to a single container XML. Its most of the way done,
just needs the higher level container. Could do it for the
one-at-a-time streaming report bits as well. Could do it as a filter.
Let me know if anyone wants one.
Joe
Jason Stajich wrote:
> I hear you!
>
> I have a workaround which does some munging as it goes in the
> Bio::SearchIO::blastxml. Because we process each report one-at-time
> in the parser I have to have the middle code layer strip out these
> lines before allowing the lower-level XML lib to handle the stream.
> Not ideal, but it works.
>
> I think Warren added XML to WU-BLAST but unfortunately he implemented
> the same problems too!
> http://blast.wustl.edu/blast/parameters.html#mformat
>
> -jason
>
> On May 11, 2005, at 11:10 PM, Joe Landman wrote:
>
>> Simple problem: take NCBI BLAST XML output and parse it. It is an
>> XML document after all, so it should be easy ... right?
>>
>> Sort of ...
>>
>> The NCBI XML output file is really a container of XML documents. You
>> cannot hand the container to be parsed to an XML Parser, as it (the
>> container) is not a valid XML document (a valid XML document has
>> exactly one <?xml version=""?> tag in it according to the standards
>> on w3c.org).
>>
>> So here is my (perl based) "solution" (read as hack).
>>
>> # assume entire document in $all, though this is Bad(TM)
>> # for huge documents, very wasteful of memory resouces.
>> #
>> @sub_documents = split(/\<\?xml version=\"1.0\"\?>/,$all);
>> shift @sub_documents;
>>
>> Now, each sub_document is in fact a valid XML document, that you can
>> happily and easily parse.
>>
>> foreach (@sub_document)
>> {
>> # do stuff with $_ which is now a valid XML document
>> }
>>
>> If there are any NCBI folks lurking here, is there a nice way to make
>> the -m 7 output generate a single large valid XML document so we can
>> use the huge document parsers, rather than using hacks like the above?
>>
>> As XML documents can be containers themselves, it seems to make sense
>> to make the entire output parseable without giving xmllint (and
>> other XML parsers) fits
>>
>> [landman at crunch-r.scalableinformatics.com:/
>> big] 137
>> >xmllint tomato_test1.1
>> tomato_test1.1:7365: parser error : XML declaration allowed only at
>> the start of the document
>> <?xml version="1.0"?>
>> ^
>> tomato_test1.1:7366: parser error : Extra content at the end of the
>> document
>> <!DOCTYPE BlastOutput PUBLIC "-//NCBI//NCBI BlastOutput/EN"
>> "NCBI_BlastOutput.dt
>> ^
>>
>> Thanks.
>>
>> Joe
>>
>>
>> --
>> Joseph Landman, Ph.D
>> Founder and CEO
>> Scalable Informatics LLC,
>> email: landman at scalableinformatics.com
>> web : http://www.scalableinformatics.com
>> phone: +1 734 786 8423
>> fax : +1 734 786 8452
>> cell : +1 734 612 4615
>>
>> _______________________________________________
>> Bioclusters maillist - Bioclusters at bioinformatics.org
>> https://bioinformatics.org/mailman/listinfo/bioclusters
>>
>
> _______________________________________________
> Bioclusters maillist - Bioclusters at bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/bioclusters
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web : http://www.scalableinformatics.com
phone: +1 734 786 8423
fax : +1 734 786 8452
cell : +1 734 612 4615
More information about the Bioclusters
mailing list