Okay, I've just gotten done committing some updates to CVS - both at sourceforge (where the web interface to see things is still un-updated, but the files are there in the "real" CVS) and at bioinformatics.org. One of these updates is that seqIOimport now attempts to guess the filetype based on the filename...or conceivably an URL ("ftp://somebioinformaticsdataserver.net/pub/genbankdata/somedata.gb" will now assume the data needs to be sent to the genbank parser without having to open and read the file to detect the type, then reconnect to download it...). (If the filetype doesn't suggest an obvious type, it still goes on with the file-content-based detection as before). In addition to making it possible to autodetect some data without necessarily having to open a network connection twice when the file's on FTP or HTTP, it also opens up the possibility of collecting some of the less obvious imports and exports into the same interface as the "obvious" ones. some possible examples (making these up off the top of my head): efetch://www.ncbi.nlm.nih.gov/?db=nucleotide&id=24653663,1234567 biosql://biosql-user:biosql-pass@localhost:3306/?table=bioentry&biodatabase_id=1234&name=some%20sequence&accession=654321 using a scheme sort of like this, it ought to make it easy and relatively intuitive to incorporate quite a lot of different types of data sources into import and export transparently for people using the code. One might then, for example, import: n_blast://www.ncbi.nlm.nih.gov/?descriptions=10&query=AGCTTCGAAACCTTGG&database=nr&program=blastn to import the (up to) 10 best matches in NCBI's non-redundant database as determined by their online blastn search, then export those 11 to ftp://some_server/pub/data_for_phylogenetic_analysis/somequery.phy Looking for comments and suggestions...my basic idea (currently based on a very cursory glance at RFC 1738 [ http://www.w3.org/Addressing/rfc1738.txt ]) is that the URL's would all be constructed something like: (interface)://(user)@(host):(port)/(filename)?(parameters) "file:/" would be understood - if that part isn't included it just means a local file. "parameters" could be used to set optional parameters in regular files where applicable (e.g. phylip-format files with fastDNAml options). "host" and "filename" would sometimes have a default (as in the case of the web interface to NCBI's blast server) Some possible "interfaces" in addition to the obvious http and ftp: efetch: - for EFetch of sequences n_blast: - sequences from blast search via ncbi web interface l_blast: - sequences from blast search via frontend to local blast executable biosql: - once we've got an interface to the BioSQL schema written clustal: - sequences returned from a run of clustalw through a frontend file: - "default" interface for local files (really would only need to be specified if you actually wanted a file NAMED "http://something.txt" or something similarly confusing...) Thoughts? If anyone besides me thinks this might be useful, and we can come up with a 'standard' way of constructing the URL's, it might make using BioPHP even more intuitive and easy.