[BiO BB] Re: pdb-l: Parsing taxonomy from blast output

Fri Apr 1 12:23:40 EST 2005

Here is a piece of a perl module that we have used for retrieving
taxonomy information:

#
# get_docsum_from_ncbi($query_gi_list)
#
# Queries the NCBI database to retrieve the document summary for each
# gi ID in a comma delimited list ($query_gi_list).  This document
# summary will be parsed and its values placed into the global hash
# variable, $xml_docsum.  This hash is keyed on the accession number in
# each sequence ID (this is not the same as the gi ID used to retrieve
# the summary).  Each entry in the hash contains a sub-hash with
# name/value pairs from the document summary.  The main value we're
# interested in is 'TaxId', which is the taxonomy ID of the organism
# the sequence comes from.
#
sub get_docsum_from_ncbi($)
{
    my($query_gi_list) = @_;

    # NCBI will return its document summary in XML format.  Here, we
    # set up an XML::Parser object with callback functions that will
    # be used to parse the XML.  See the callback functions themselves
    # for more information.
    # Note that if you call "new XML::Parser(Style => 'Debug'), you
    # can see the full structure of the XML document printed to
    # STDOUT.  If you do this, you will need to comment out the
    # setHandlers method below, since only the default handlers will
    # output the debug info.
    my $xml_parser = new XML::Parser();
    $xml_parser->setHandlers(Start => \&xml_start_handler,
			     Char  => \&xml_char_handler,
			     End => \&xml_end_handler);

    # NCBI provides tools made for batch utilities to interface with
    # their database.  They call them "Entrez Programming Utilities", or
    # EUtils for short, and you can find their documentation here:
    # http://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html
    #
    # One EUtil is called ESearch, and the following code can use
    # ESearch to query NCBI with a list of accession numbers (in the
    # form 'term' => '<accession>+OR+<accession>') and return a list
    # of gi IDs to be used with other EUtils that only take gi IDs as
    # input.  Unfortunately, the search term is limited in length, and
    # you can only fit about 20 accession numbers per query.  NCBI's
    # usage rules say you can only perform one query every 3 seconds,
    # so it would take 25 mins for 10,000 records.  Therefore, I had
    # to drop this idea as a solution, but I leave the code here in
    # comments for future reference.
    #
    #$query_url = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi";
    #$response = $browser->post($query_url,
    #    [ 'db' => 'protein',
    #     'term' => $$query_accession_nums_ref, #'NP_828860.1+OR+P05040',
    #     'usehistory' => 'n',
    #      'tool' => NCBI_TOOL,
    #      'email=' => NCBI_EMAIL,
    #	]);
    #if(!$response->is_success) {
    #	print STDERR "Error '" . $response->status_line . 
    #	    "' retrieving url '"
    #	    . $query_url . "'.\n";
    #	return "";
    #}
    #
    #$xml_parser->parse($response->content);
    #chop($query_gi_list); # Cut off trailing comma

    # Another way to convert accession numbers to gi ids is by using
    # "batch entrez".  This part of NCBI's web site is made to be used
    # by humans, so scripting it and parsing the output is a lot
    # harder.  However, the advantage of batch entrez over ESearch is
    # that it can return up to 10,000 results at a time (at least
    # according to my interpretation of the documentation - I never
    # tested with that many).  Nevertheless, to avoid straining NCBI's
    # systems and parsing result data from a human interface that
    # might change format, we decided it was better to just use the gi
    # ids stored in each a2m file, even though they can become out of
    # date over time.
    #
    # The following is code that can be adapted to getting gi ids using batch
    # Entrez.  The code was written originally as an attempt to find taxonomy
    # ids, before I knew about EUtils.  Note that $$batch_accession_nums_ref
    # needs to be set to a \n delimited list of accession numbers.
    # One warning: if you give batch entrez an accession number it can't find,
    # it may return an error and no results.  I didn't get to handling that.
    # Also note that if you need to display a next page of results, or
    # results in a different format, extract "WebEnv" and "query_num" to
    # pass on to the next page.  Each new page seems to include new values
    # for these two things.  In other words, one WebEnv can't be used to
    # page through one set of results - each page of the results gives you
    # a new WebEnv.  The EUtils docs say WebEnv is a key into a server side
    # cache that can hold up to 10,000 records.
    #
    # Use LWP's multipart/form-data POST method to upload a batch
    # of accession numbers to look up.
    # Documentation for this is found in question 10 of the LWP FAQ
    # found here: http://groups.google.com/groups?q=perl+lwp+upload+file&hl=en&lr=lang_en&ie=UTF-8&oe=UTF-8&safe=off&selm=8ek1n2%24ej1%241%40cherry.rt.ru&rnum=2
    # Better documentation here: http://www.perldoc.com/perl5.8.0/lib/HTTP/Request/Common.html
    # but even those docs are not clear how to upload file content without a
    # file.  The key was found in this post:
    # http://groups.google.com/groups?q=lwp+form-data+undef+content&hl=en&lr=lang_en&ie=UTF-8&oe=UTF-8&safe=off&selm=u6594vsb9c2l1bpp6c1rg6tpf1tcs8o8r9%404ax.com&rnum=2
    #my $request = POST (
    #  'http://www.ncbi.nlm.nih.gov:80/entrez/batchentrez.cgi',
    #  Content_Type => 'form-data',
    #  Content      =>
    #  [
    #    cmd_current => '',
    #    cmd => 'Retrieve',
    #    db => 'protein',
    #    orig_db => 'nucleotide', # This is probably not needed
    #    dispmax => '10000', # Untested, but this should make it return up to 10,000 results
    #
        # To fake the upload of a file without writing one out requires
        # simulating a bunch of extra headers.  To upload a real file is
        # much simpler, requiring just one value of ['./batch.entrez'],
    #    file => [undef, 'batch.entrez',
    #             'Content-Length' => length($$batch_accession_nums_ref),
    #             'Content-Type' => 'text/plain',
    #             'Content' => $$batch_accession_nums_ref],
    #
    #  ],
    #);
    #$response = $browser->request($request);

    # To find the taxonomy ID for each gi ID, we use the NCBI ESummary utility.
    # This utility takes a comma delimited list of gi IDs and returns summary
    # data about each one in XML format.  This summary information
    # includes the taxonomy ID, the gi ID, the sequence ID, and some other
    # things we don't care to know.  To see what ESummary returns, try visiting
    # this URL:
    # http://www.ncbi.nih.gov/entrez/eutils/esummary.fcgi?db=protein&id=29837495,114475
    #
    # Documentation for ESummary is here:
    # http://eutils.ncbi.nlm.nih.gov/entrez/query/static/esummary_help.html
    my $query_url =
	"http://www.ncbi.nih.gov/entrez/eutils/esummary.fcgi";
    # Older versions of LWP (like the one on the Condor cluster) don't
    # support the post() method, so use the older request(...POST())
    # method instead.
    #my $response = $browser->post($query_url,
    my $response = $browser->request(HTTP::Request::Common::POST($query_url,
	[ 'db' => 'protein',        # Interestingly, "protein" and "taxonomy" databases both return the same results
	  'id' => $query_gi_list,
	  'tool' => NCBI_TOOL,      # Let NCBI know what script is accessing ESummary
	  'email=' => NCBI_EMAIL,   # Let NCBI know who to e-mail if the script is causing problems on their servers.
	]));
    if(!$response->is_success) {
	print STDERR "get_docsum_from_ncbi: Error '" .
	    $response->status_line . 
	    "' retrieving url '" . $query_url . "'.\n";
	return;
    }

    # Parse the XML response to our ESummary web query.  The callback
    # functions like xml_start_handler will take care of putting all
    # the data into the global $xml_docsum hash.
    $xml_parser->parse($response->content);

    # ESummary is not the only way to get the taxonomy IDs we're looking
    # for.  You can also use EFetch through a URL like this:
    # http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=11036339,1078661&rettype=native&retmode=xml
    # where "11036339,1078661" is the comma separated list of gi ids
    # (although it seems you can also use accession numbers.  The
    # problem with this method is that the amount of information it
    # returns is enormous, and the time it takes to process and return
    # it to you in XML is quite noticable even on just two records.
    # It goes faster using the default non-xml format like this:
    # http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=11036339,1078661
    # but it is still long.  However, if you ever need more data than the
    # taxonomy ID, this may be the only way to get it.
    #
    # I also discovered that you can request multiple ids to be displayed
    # at the same time through the human "query.fcgi" interface using a
    # url like this:
    # http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=protein&list_uids=11036339,1078661&dopt=GenPept
    # Again, I don't trust that this output will remain stable like the EUtils
    # output, and I don't know how many ids you can request at once, and
    # this output is probably based on the EFetch backend data anyway
    # (so you're probably not saving their servers any work), but it's worth
    # looking into.  E-mail info at ncbi.nlm.nih.gov if you really want to be
    # sure what is most efficient on their end.
}

------------------------------------------------------------
This  code is not complete---you may need some of the following packages:
------------------------------------------------------------

# LWP (Library for WWW in Perl) is used to retrieve documents
# from web servers.  See
# http://www.perl.com/pub/a/2002/08/20/perlandlwp.html for documentation.
use HTTP::Request::Common;
use LWP::UserAgent;

# There are a bewildering number of modules written to parse XML,
# but I think the following advice found on a newsgroup made the
# most sense:
#
# Go to http://search.cpan.org/ and search for "XML".
# Have a look at XML::DOM, XML::Grove, XML::Twig, and maybe
# XML::Simple, if you don't mind the uglyish data structures it
# exposes to the user.  XML::QL might be of interest, even though
# it is still immature.
# If you are not interested in a tree, but want to just pull out
# elements while parsing the file, try XML::Node, or XML::Parser
# directly.
#
# Based on this advice, I have chosen XML::Parser as a standard
# module which works efficiently for what we need here.  We
# don't need to convert the whole document into a memory based
# tree structure, we just need to pull out a few values we need.  If
# you do need a tree structure, I would lean towards XML::Grove
# from the list above, though I have little to base that opinion
# on.
# Note that I'm not too happy with the hoops I had to jump through to
# use Parser's event driven methodology.  The code is not so easy to
# understand.
use XML::Parser;