[BiO BB] Parsing GenBank XML?

Dan Bolser dan.bolser at gmail.com
Thu Sep 4 10:29:20 EDT 2008


Hi,

Dumb / noob question I am sure but... I am parsing the results of a
GenBank query obtained using esearch / efetch:

http://www.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html


The XML looks like this...

http://pastebin.com/f3ef02d85

the only difference being that the real document has (possibly)
millions of <Seq-entry>'s.

I decided to try to use XSLT to turn the XML into tabular output. This
is working fine on a sample of the data. I get one row of data per
Seq-entry, which is exactly what I want. For reference, my XSLT style
sheet is here:

http://pastebin.com/f3a512411


I am not sure how efficient that XSLT is (I never used XSLT before),
however, that isn't the real problem. The real problem is that the
XSLT parsers that I have tried (xsltproc and XML::XSLT) both need to
slurp up the whole XML document before they output any rows of text.
This is way too memory intensive, especially as the data my well grow.

I figure that I can't be the first person to parse GenBank, so I was
wondering what is 'out there' in terms of community consensus on how
to do it...

I had a quick go with XML::Simple, but I rapidly get lost in the
resulting data structure, which I find leads to very messy (hard to
read / write) and generally unmaintainable code.

Are the various 'BioX' modules any good? i.e. do they simplify the
resulting data to make it easy to get tab delimited dumps of the data?


Cheers,

Dan.


-- 
http://network.nature.com/profile/dan




More information about the BBB mailing list