[Pipet Devel] Data Storage Interfaces
Humberto Ortiz Zuazaga
hortiz at neurobio.upr.clu.edu
Mon Jun 14 15:16:34 EDT 1999
justin at ukans.edu said:
> That's a good point. We haven't considered slow networks or very
> large files.
> I have, and that's my problem with 1) how Paos passes objects -- it
> sends the whole thing. I would prefer
> just sending updates. Breaking up the data into linked objects
> could be
> an adequate compromise.
Paos makes me nervous too. It looks complex, and I can't see what it buys us
over CORBA. Orbit is already a standard part of gnome, and we may as well
leverage as much as we can from other efforts.
> 2) the independently roaming object concept
> where it's passed directly
> from tool to tool. Without a "home" everything has to be passed,
> and by
> the end of a complex series, that could be a large object.
> I'm beginning to think the optimal solution is a virtual interface (or
> set of optional interface) across all junctions. It's the most
> efficient (only what the receiving end wants is sent [and only the
> receiving end really knows what it wants]).
So, data objects have an URI, and a loci can request the data it needs by URI.
The local locid can fetch remote data objects, and cache them. Each part of a
pipleline of loci can request only the data objects it needs. Your local
locus requests it be sent the results that it wants, and only those, and
displays them for you. This way only the necessary data objects need be
transferred.
Imagine a service that annotates a blast search:
your locus sends the sequence data to the blast server, the blast server sends
the matching genbank UID's to the annotation server, the annotation server may
have a local copy of genbank, and gets the sequences from there, then sends
the UID's and the feature annotations back to your local locus, which may have
to fetch some of the UID's from genbank, then applies the annotations and
displays the result.
> It's completely language
> independent, as well as "junction" indepedent (each end has a standard
> interface, regardless of whether a C, Python, or Perl script is on the
> other end, or whether the two are communication via CORBA, TCP/IP, UDP/
> IP, shared memory, a pipe, a dynamically-loaded plug-in interface).
This sounds good, and can help make sure we don't overcommit to PAOS. We just
need a simple way of communicating between loci, "here's this data, please run
foo v2 on it", "have your results, formatted for bar v1"
> This interface method requires a home location where the object
> resides throughout its processing life-time. This is what I had
> envisioned the work flow system to be (ie. coordinating it's various
> objects, where and when they connected, etc). This could be located on
> the client machine, and it allows the various other loci to be really
> dumb (which means small).
Data objects can be identified by URI's with special URI's for data on a local
disk (the locid will have to have some way to service requests for your local
data, possibly from multiple loci).
But now say we want to run a five step pipeline on 2GB worth of genomic
sequences, each of the five loci may want a copy of the sequence, which means
our machine will send the file five times. Try that over a modem!
Caching at loci hubs can help solve this problem.
--
Humberto Ortiz Zuazaga
Bioinformatics Specialist
Institute of Neurobiology
hortiz at neurobio.upr.clu.edu
More information about the Pipet-Devel
mailing list