[BiO BB] Parsing GenBank XML?

Mike Marchywka marchywka at hotmail.com
Thu Sep 4 11:18:18 EDT 2008



You are probably looking for a SAX parser,

http://en.wikipedia.org/wiki/Simple_API_for_XML

I've got my own hard coded c++ that I use for my string processing rules source
code, FDA AERA SGML parsing, SOAP utilities, etc, that will output all the fields
in a simple format of "label value" per line, but there are SAX libraries in just about
every language. Personally I finally gave up on PERL as speed, at least under cygwin,
was unpredictable and degraded quickly when you ran out of physical memory.



Mike Marchywka
586 Saint James Walk
Marietta GA 30067-7165
415-264-8477 (w)<- use this
404-788-1216 (C)<- leave message
989-348-4796 (P)<- emergency only
marchywka at hotmail.com
Note: If I am asking for free stuff, I normally use for hobby/non-profit
information but may use in investment forums, public and private.
Please indicate any concerns if applicable.
Note:  hotmail is getting cumbersom, try also marchywka at yahoo.com



> Date: Thu, 4 Sep 2008 15:29:20 +0100
> From: dan.bolser at gmail.com
> To: BBB at bioinformatics.org
> Subject: [BiO BB] Parsing GenBank XML?
>
> Hi,
>
> Dumb / noob question I am sure but... I am parsing the results of a
> GenBank query obtained using esearch / efetch:
>
> http://www.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html
>
>
> The XML looks like this...
>
> http://pastebin.com/f3ef02d85
>
> the only difference being that the real document has (possibly)
> millions of 's.
>
> I decided to try to use XSLT to turn the XML into tabular output. This
> is working fine on a sample of the data. I get one row of data per
> Seq-entry, which is exactly what I want. For reference, my XSLT style
> sheet is here:
>
> http://pastebin.com/f3a512411
>
>
> I am not sure how efficient that XSLT is (I never used XSLT before),
> however, that isn't the real problem. The real problem is that the
> XSLT parsers that I have tried (xsltproc and XML::XSLT) both need to
> slurp up the whole XML document before they output any rows of text.
> This is way too memory intensive, especially as the data my well grow.
>
> I figure that I can't be the first person to parse GenBank, so I was
> wondering what is 'out there' in terms of community consensus on how
> to do it...
>
> I had a quick go with XML::Simple, but I rapidly get lost in the
> resulting data structure, which I find leads to very messy (hard to
> read / write) and generally unmaintainable code.
>
> Are the various 'BioX' modules any good? i.e. do they simplify the
> resulting data to make it easy to get tab delimited dumps of the data?
>
>
> Cheers,
>
> Dan.
>
>
> --
> http://network.nature.com/profile/dan
>
> _______________________________________________
> BBB mailing list
> BBB at bioinformatics.org
> http://www.bioinformatics.org/mailman/listinfo/bbb

_________________________________________________________________
Stay up to date on your PC, the Web, and your mobile phone with Windows Live.
http://clk.atdmt.com/MRT/go/msnnkwxp1020093185mrt/direct/01/



More information about the BBB mailing list