[Bioclusters] FYI: minimal XML output fixer for NCBI BLAST

Joe Landman landman at scalableinformatics.com
Wed May 11 23:59:15 EDT 2005


Heh...

I'd be willing (if sufficiently prodded) to build a converter from NCBI 
container of XML's to a single container XML.  Its most of the way done, 
just needs the higher level container.  Could do it for the 
one-at-a-time streaming report bits as well.  Could do it as a filter. 
Let me know if anyone wants one.

Joe

Jason Stajich wrote:
> I hear you!
> 
> I have a workaround which does some munging as it goes in the  
> Bio::SearchIO::blastxml.   Because we process each report one-at-time  
> in the parser I have to have the middle code layer strip out these  
> lines before allowing the lower-level XML lib to handle the stream.   
> Not ideal, but it works.
> 
> I think Warren added XML to WU-BLAST but unfortunately he implemented  
> the same problems too!
> http://blast.wustl.edu/blast/parameters.html#mformat
> 
> -jason
> 
> On May 11, 2005, at 11:10 PM, Joe Landman wrote:
> 
>> Simple problem:  take NCBI BLAST XML output and parse it.  It is an  
>> XML document after all, so it should be easy ... right?
>>
>> Sort of ...
>>
>> The NCBI XML output file is really a container of XML documents.   You 
>> cannot hand the container to be parsed to an XML Parser, as it  (the 
>> container) is not a valid XML document (a valid XML document  has 
>> exactly one <?xml version=""?> tag in it according to the  standards 
>> on w3c.org).
>>
>> So here is my (perl based) "solution" (read as hack).
>>
>>     # assume entire document in $all, though this is Bad(TM)
>>     # for huge documents, very wasteful of memory resouces.
>>     #
>>     @sub_documents  = split(/\<\?xml version=\"1.0\"\?>/,$all);
>>     shift @sub_documents;
>>
>> Now, each sub_document is in fact a valid XML document, that you  can 
>> happily and easily parse.
>>
>>     foreach (@sub_document)
>>      {
>>       # do stuff with $_ which is now a valid XML document
>>      }
>>
>> If there are any NCBI folks lurking here, is there a nice way to  make 
>> the -m 7 output generate a single large valid XML document so  we can 
>> use the  huge document parsers, rather than using hacks like  the above?
>>
>> As XML documents can be containers themselves, it seems to make  sense 
>> to  make the entire output parseable without giving xmllint  (and 
>> other XML parsers) fits
>>
>> [landman at crunch-r.scalableinformatics.com:/ 
>> big]                                                           137  
>> >xmllint tomato_test1.1
>> tomato_test1.1:7365: parser error : XML declaration allowed only at  
>> the start of the document
>> <?xml version="1.0"?>
>>      ^
>> tomato_test1.1:7366: parser error : Extra content at the end of the  
>> document
>> <!DOCTYPE BlastOutput PUBLIC "-//NCBI//NCBI BlastOutput/EN"  
>> "NCBI_BlastOutput.dt
>> ^
>>
>> Thanks.
>>
>> Joe
>>
>>
>> -- 
>> Joseph Landman, Ph.D
>> Founder and CEO
>> Scalable Informatics LLC,
>> email: landman at scalableinformatics.com
>> web  : http://www.scalableinformatics.com
>> phone: +1 734 786 8423
>> fax  : +1 734 786 8452
>> cell : +1 734 612 4615
>>
>> _______________________________________________
>> Bioclusters maillist  -  Bioclusters at bioinformatics.org
>> https://bioinformatics.org/mailman/listinfo/bioclusters
>>
> 
> _______________________________________________
> Bioclusters maillist  -  Bioclusters at bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/bioclusters

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 734 786 8452
cell : +1 734 612 4615


More information about the Bioclusters mailing list