[BiO BB] Parsing GenBank XML?

Dan Bolser dan.bolser at gmail.com
Fri Sep 5 06:24:55 EDT 2008


2008/9/4 Mike Marchywka <marchywka at hotmail.com>
>
>
> You are probably looking for a SAX parser,
>
> http://en.wikipedia.org/wiki/Simple_API_for_XML

Hi Mike, Thanks for the link. The relevant lines seems to be:

"Additionally, some kinds of XML processing simply require having
access to the entire document. XSLT and XPath, for example, need to be
able to access any node at any time in the parsed XML tree. While a
SAX parser could be used to construct such a tree, the DOM already
does so by design."

So it looks like I can't combine SAX with XSLT unless I somehow
trigger an XSLT parse of one 'chunk' of XML per SAX 'new-record'
event...

I'll try!


> I've got my own hard coded c++ that I use for my string processing rules source
> code, FDA AERA SGML parsing, SOAP utilities, etc, that will output all the fields
> in a simple format of "label value" per line, but there are SAX libraries in just about
> every language. Personally I finally gave up on PERL as speed, at least under cygwin,
> was unpredictable and degraded quickly when you ran out of physical memory.
>
>
>
> Mike Marchywka
> 586 Saint James Walk
> Marietta GA 30067-7165
> 415-264-8477 (w)<- use this
> 404-788-1216 (C)<- leave message
> 989-348-4796 (P)<- emergency only
> marchywka at hotmail.com
> Note: If I am asking for free stuff, I normally use for hobby/non-profit
> information but may use in investment forums, public and private.
> Please indicate any concerns if applicable.
> Note:  hotmail is getting cumbersom, try also marchywka at yahoo.com
>
>
>
> > Date: Thu, 4 Sep 2008 15:29:20 +0100
> > From: dan.bolser at gmail.com
> > To: BBB at bioinformatics.org
> > Subject: [BiO BB] Parsing GenBank XML?
> >
> > Hi,
> >
> > Dumb / noob question I am sure but... I am parsing the results of a
> > GenBank query obtained using esearch / efetch:
> >
> > http://www.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html
> >
> >
> > The XML looks like this...
> >
> > http://pastebin.com/f3ef02d85
> >
> > the only difference being that the real document has (possibly)
> > millions of 's.
> >
> > I decided to try to use XSLT to turn the XML into tabular output. This
> > is working fine on a sample of the data. I get one row of data per
> > Seq-entry, which is exactly what I want. For reference, my XSLT style
> > sheet is here:
> >
> > http://pastebin.com/f3a512411
> >
> >
> > I am not sure how efficient that XSLT is (I never used XSLT before),
> > however, that isn't the real problem. The real problem is that the
> > XSLT parsers that I have tried (xsltproc and XML::XSLT) both need to
> > slurp up the whole XML document before they output any rows of text.
> > This is way too memory intensive, especially as the data my well grow.
> >
> > I figure that I can't be the first person to parse GenBank, so I was
> > wondering what is 'out there' in terms of community consensus on how
> > to do it...
> >
> > I had a quick go with XML::Simple, but I rapidly get lost in the
> > resulting data structure, which I find leads to very messy (hard to
> > read / write) and generally unmaintainable code.
> >
> > Are the various 'BioX' modules any good? i.e. do they simplify the
> > resulting data to make it easy to get tab delimited dumps of the data?
> >
> >
> > Cheers,
> >
> > Dan.
> >
> >
> > --
> > http://network.nature.com/profile/dan
> >
> > _______________________________________________
> > BBB mailing list
> > BBB at bioinformatics.org
> > http://www.bioinformatics.org/mailman/listinfo/bbb
>
> _________________________________________________________________
> Stay up to date on your PC, the Web, and your mobile phone with Windows Live.
> http://clk.atdmt.com/MRT/go/msnnkwxp1020093185mrt/direct/01/
> _______________________________________________
> BBB mailing list
> BBB at bioinformatics.org
> http://www.bioinformatics.org/mailman/listinfo/bbb



--
http://network.nature.com/profile/dan




More information about the BBB mailing list