[Biophp-dev] Looking for comments - BioPHP "URLs"?

S Clark biophp-dev@bioinformatics.org
Wed, 2 Jul 2003 19:53:16 -0600


Okay, I've just gotten done committing some updates to CVS - both at 
sourceforge (where the web interface to see things is still un-updated, but
the files are there in the "real" CVS) and at bioinformatics.org.

One of these updates is that seqIOimport now attempts to guess the filetype
based on the filename...or conceivably an URL
 ("ftp://somebioinformaticsdataserver.net/pub/genbankdata/somedata.gb" will
now assume the data needs to be sent to the genbank parser without having
to open and read the file to detect the type, then reconnect to download
it...).

(If the filetype doesn't suggest an obvious type, it still goes on with the
file-content-based detection as before).

In addition to making it possible to autodetect some data without necessarily
having to open a network connection twice when the file's on FTP or HTTP, it
also opens up the possibility of collecting some of the less obvious imports 
and exports into the same interface as the "obvious" ones.

some possible examples (making these up off the top of my head):

efetch://www.ncbi.nlm.nih.gov/?db=nucleotide&id=24653663,1234567

biosql://biosql-user:biosql-pass@localhost:3306/?table=bioentry&biodatabase_id=1234&name=some%20sequence&accession=654321

using a scheme sort of like this, it ought to make it easy and relatively
intuitive to incorporate quite a lot of different types of data sources into
import and export transparently for people using the code.

One might then, for example, import:

n_blast://www.ncbi.nlm.nih.gov/?descriptions=10&query=AGCTTCGAAACCTTGG&database=nr&program=blastn

to import the (up to) 10 best matches in NCBI's non-redundant database as
determined by their online blastn search, then export those 11 to

ftp://some_server/pub/data_for_phylogenetic_analysis/somequery.phy

Looking for comments and suggestions...my basic idea (currently based on a 
very cursory glance at RFC 1738 [ http://www.w3.org/Addressing/rfc1738.txt ])
is that the URL's would all be constructed something like:
(interface)://(user)@(host):(port)/(filename)?(parameters)

"file:/" would be understood - if that part isn't included it just means
a local file.

"parameters" could be used to set optional parameters in regular files where
applicable (e.g. phylip-format files with fastDNAml options).

"host" and "filename" would sometimes have a default (as in the case of the
web interface to NCBI's blast server)

Some possible "interfaces" in addition to the obvious http and ftp:

efetch: - for EFetch of sequences
n_blast: - sequences from blast search via ncbi web interface
l_blast: - sequences from blast search via frontend to local blast executable
biosql: - once we've got an interface to the BioSQL schema written
clustal: - sequences returned from a run of clustalw through a frontend
file: - "default" interface for local files (really would only need to be
specified if you actually wanted a file NAMED "http://something.txt" or
something similarly confusing...)

Thoughts?  If anyone besides me thinks this might be useful, and we can come
up with a 'standard' way of constructing the URL's, it might make using BioPHP
even more intuitive and easy.