[BiO BB] Re: pdb-l: Parsing taxonomy from blast output
Kevin Karplus
karplus at soe.ucsc.edu
Fri Apr 1 12:23:40 EST 2005
Here is a piece of a perl module that we have used for retrieving
taxonomy information:
#
# get_docsum_from_ncbi($query_gi_list)
#
# Queries the NCBI database to retrieve the document summary for each
# gi ID in a comma delimited list ($query_gi_list). This document
# summary will be parsed and its values placed into the global hash
# variable, $xml_docsum. This hash is keyed on the accession number in
# each sequence ID (this is not the same as the gi ID used to retrieve
# the summary). Each entry in the hash contains a sub-hash with
# name/value pairs from the document summary. The main value we're
# interested in is 'TaxId', which is the taxonomy ID of the organism
# the sequence comes from.
#
sub get_docsum_from_ncbi($)
{
my($query_gi_list) = @_;
# NCBI will return its document summary in XML format. Here, we
# set up an XML::Parser object with callback functions that will
# be used to parse the XML. See the callback functions themselves
# for more information.
# Note that if you call "new XML::Parser(Style => 'Debug'), you
# can see the full structure of the XML document printed to
# STDOUT. If you do this, you will need to comment out the
# setHandlers method below, since only the default handlers will
# output the debug info.
my $xml_parser = new XML::Parser();
$xml_parser->setHandlers(Start => \&xml_start_handler,
Char => \&xml_char_handler,
End => \&xml_end_handler);
# NCBI provides tools made for batch utilities to interface with
# their database. They call them "Entrez Programming Utilities", or
# EUtils for short, and you can find their documentation here:
# http://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html
#
# One EUtil is called ESearch, and the following code can use
# ESearch to query NCBI with a list of accession numbers (in the
# form 'term' => '<accession>+OR+<accession>') and return a list
# of gi IDs to be used with other EUtils that only take gi IDs as
# input. Unfortunately, the search term is limited in length, and
# you can only fit about 20 accession numbers per query. NCBI's
# usage rules say you can only perform one query every 3 seconds,
# so it would take 25 mins for 10,000 records. Therefore, I had
# to drop this idea as a solution, but I leave the code here in
# comments for future reference.
#
#$query_url = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi";
#$response = $browser->post($query_url,
# [ 'db' => 'protein',
# 'term' => $$query_accession_nums_ref, #'NP_828860.1+OR+P05040',
# 'usehistory' => 'n',
# 'tool' => NCBI_TOOL,
# 'email=' => NCBI_EMAIL,
# ]);
#if(!$response->is_success) {
# print STDERR "Error '" . $response->status_line .
# "' retrieving url '"
# . $query_url . "'.\n";
# return "";
#}
#
#$xml_parser->parse($response->content);
#chop($query_gi_list); # Cut off trailing comma
# Another way to convert accession numbers to gi ids is by using
# "batch entrez". This part of NCBI's web site is made to be used
# by humans, so scripting it and parsing the output is a lot
# harder. However, the advantage of batch entrez over ESearch is
# that it can return up to 10,000 results at a time (at least
# according to my interpretation of the documentation - I never
# tested with that many). Nevertheless, to avoid straining NCBI's
# systems and parsing result data from a human interface that
# might change format, we decided it was better to just use the gi
# ids stored in each a2m file, even though they can become out of
# date over time.
#
# The following is code that can be adapted to getting gi ids using batch
# Entrez. The code was written originally as an attempt to find taxonomy
# ids, before I knew about EUtils. Note that $$batch_accession_nums_ref
# needs to be set to a \n delimited list of accession numbers.
# One warning: if you give batch entrez an accession number it can't find,
# it may return an error and no results. I didn't get to handling that.
# Also note that if you need to display a next page of results, or
# results in a different format, extract "WebEnv" and "query_num" to
# pass on to the next page. Each new page seems to include new values
# for these two things. In other words, one WebEnv can't be used to
# page through one set of results - each page of the results gives you
# a new WebEnv. The EUtils docs say WebEnv is a key into a server side
# cache that can hold up to 10,000 records.
#
# Use LWP's multipart/form-data POST method to upload a batch
# of accession numbers to look up.
# Documentation for this is found in question 10 of the LWP FAQ
# found here: http://groups.google.com/groups?q=perl+lwp+upload+file&hl=en&lr=lang_en&ie=UTF-8&oe=UTF-8&safe=off&selm=8ek1n2%24ej1%241%40cherry.rt.ru&rnum=2
# Better documentation here: http://www.perldoc.com/perl5.8.0/lib/HTTP/Request/Common.html
# but even those docs are not clear how to upload file content without a
# file. The key was found in this post:
# http://groups.google.com/groups?q=lwp+form-data+undef+content&hl=en&lr=lang_en&ie=UTF-8&oe=UTF-8&safe=off&selm=u6594vsb9c2l1bpp6c1rg6tpf1tcs8o8r9%404ax.com&rnum=2
#my $request = POST (
# 'http://www.ncbi.nlm.nih.gov:80/entrez/batchentrez.cgi',
# Content_Type => 'form-data',
# Content =>
# [
# cmd_current => '',
# cmd => 'Retrieve',
# db => 'protein',
# orig_db => 'nucleotide', # This is probably not needed
# dispmax => '10000', # Untested, but this should make it return up to 10,000 results
#
# To fake the upload of a file without writing one out requires
# simulating a bunch of extra headers. To upload a real file is
# much simpler, requiring just one value of ['./batch.entrez'],
# file => [undef, 'batch.entrez',
# 'Content-Length' => length($$batch_accession_nums_ref),
# 'Content-Type' => 'text/plain',
# 'Content' => $$batch_accession_nums_ref],
#
# ],
#);
#$response = $browser->request($request);
# To find the taxonomy ID for each gi ID, we use the NCBI ESummary utility.
# This utility takes a comma delimited list of gi IDs and returns summary
# data about each one in XML format. This summary information
# includes the taxonomy ID, the gi ID, the sequence ID, and some other
# things we don't care to know. To see what ESummary returns, try visiting
# this URL:
# http://www.ncbi.nih.gov/entrez/eutils/esummary.fcgi?db=protein&id=29837495,114475
#
# Documentation for ESummary is here:
# http://eutils.ncbi.nlm.nih.gov/entrez/query/static/esummary_help.html
my $query_url =
"http://www.ncbi.nih.gov/entrez/eutils/esummary.fcgi";
# Older versions of LWP (like the one on the Condor cluster) don't
# support the post() method, so use the older request(...POST())
# method instead.
#my $response = $browser->post($query_url,
my $response = $browser->request(HTTP::Request::Common::POST($query_url,
[ 'db' => 'protein', # Interestingly, "protein" and "taxonomy" databases both return the same results
'id' => $query_gi_list,
'tool' => NCBI_TOOL, # Let NCBI know what script is accessing ESummary
'email=' => NCBI_EMAIL, # Let NCBI know who to e-mail if the script is causing problems on their servers.
]));
if(!$response->is_success) {
print STDERR "get_docsum_from_ncbi: Error '" .
$response->status_line .
"' retrieving url '" . $query_url . "'.\n";
return;
}
# Parse the XML response to our ESummary web query. The callback
# functions like xml_start_handler will take care of putting all
# the data into the global $xml_docsum hash.
$xml_parser->parse($response->content);
# ESummary is not the only way to get the taxonomy IDs we're looking
# for. You can also use EFetch through a URL like this:
# http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=11036339,1078661&rettype=native&retmode=xml
# where "11036339,1078661" is the comma separated list of gi ids
# (although it seems you can also use accession numbers. The
# problem with this method is that the amount of information it
# returns is enormous, and the time it takes to process and return
# it to you in XML is quite noticable even on just two records.
# It goes faster using the default non-xml format like this:
# http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=11036339,1078661
# but it is still long. However, if you ever need more data than the
# taxonomy ID, this may be the only way to get it.
#
# I also discovered that you can request multiple ids to be displayed
# at the same time through the human "query.fcgi" interface using a
# url like this:
# http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=protein&list_uids=11036339,1078661&dopt=GenPept
# Again, I don't trust that this output will remain stable like the EUtils
# output, and I don't know how many ids you can request at once, and
# this output is probably based on the EFetch backend data anyway
# (so you're probably not saving their servers any work), but it's worth
# looking into. E-mail info at ncbi.nlm.nih.gov if you really want to be
# sure what is most efficient on their end.
}
------------------------------------------------------------
This code is not complete---you may need some of the following packages:
------------------------------------------------------------
# LWP (Library for WWW in Perl) is used to retrieve documents
# from web servers. See
# http://www.perl.com/pub/a/2002/08/20/perlandlwp.html for documentation.
use HTTP::Request::Common;
use LWP::UserAgent;
# There are a bewildering number of modules written to parse XML,
# but I think the following advice found on a newsgroup made the
# most sense:
#
# Go to http://search.cpan.org/ and search for "XML".
# Have a look at XML::DOM, XML::Grove, XML::Twig, and maybe
# XML::Simple, if you don't mind the uglyish data structures it
# exposes to the user. XML::QL might be of interest, even though
# it is still immature.
# If you are not interested in a tree, but want to just pull out
# elements while parsing the file, try XML::Node, or XML::Parser
# directly.
#
# Based on this advice, I have chosen XML::Parser as a standard
# module which works efficiently for what we need here. We
# don't need to convert the whole document into a memory based
# tree structure, we just need to pull out a few values we need. If
# you do need a tree structure, I would lean towards XML::Grove
# from the list above, though I have little to base that opinion
# on.
# Note that I'm not too happy with the hoops I had to jump through to
# use Parser's event driven methodology. The code is not so easy to
# understand.
use XML::Parser;
More information about the BBB
mailing list