[Biococoa-dev] Design question

Tue Aug 10 14:57:13 EDT 2004

> Given easy methods to convert between the array and strings, it would 
> allow
> us to code all the methods using whichever format is easier.  And I'm 
> all
> for easy to make methods....

So am I ;-) But there's one caveat, I personally think we should see 
the singleton base sequence as the "native" format for our sequence 
class and throughout the framework. That means that the 
stringRepresentation is merely a way to give users the possibility to 
get back a string in the end, but internally all methods should work 
with and be optimized for the singleton base classes. I outlined the 
disadvantages of the stringbased approach that you will encounter (like 
the validation problem), it would be a pity if one still would 
continuously watch these caveats while we have such a nice system 
around. I hope that elegant and strong foundation classes based on the 
singletons will almost complete remove the need for the strings world 
;-)

>> Yesterday I was still thinking a bit more about the two options I
>> presented, and indeed the modification dictionary seems the best way 
>> to
>> go. I think it's a very nice approach to keep this in a similar way as
>> for instance the genbank records show features associated with the
>> sequence. I believe John also mentioned something about this. The
>> hierarchy would be something along the lines of a dictionary 
>> containing
>> BCAnnotation objects (biojava does this as well), that would describe
>> the positions in simple NSRanges and the type perhaps as
>> BCFunctionalGroup objects. One of the problems will be to keep the
>> system such that new (for us unknown) modifications/features are 
>> easily
>> added...
> I had thought an array of NSDictionary like objects, each a BCFeature 
> (or
> BCAnnotation) would be easier.  The key thing would be to have a 
> unique ID
> set when a feature is added, so the user is shielded from naming 
> conflicts
> (they could add as many things named "ORF" as they want).  This would 
> also
> allow a feature to point to a separate sequence within a bundle of 
> sequences
> - ie, the amino acid sequence of that ORF.

You're right, an array filled with BCFeatures or BCAnnotation would be 
very nice, this way we avoid namespace collisions. I'm not sure whether 
we need an additional layer of dictionaries in between actually. But 
perhaps I missed the point here. A BCFeature would then contain further 
info on the type/contents of modification, plus the range over which 
the feature extends.
For the reference to other sequences we should come up with something 
like bundle identifiers that uniquely identify sequences inside a 
bundle, but this is of later problem while devising the file format we 
plan to use (I think BioJava uses URIs here, see link below).

> Either way works, but I'd thought that an array as the root feature 
> object
> had more parallels with other sequence file formats (ie - NCBI's) and 
> having
> a regular, repeating structure would make the native file format a bit 
> more
> readable.
I fully agree.

> The flipside is that looking up a specific object in a dictionary
> would be much simpler to code.  Maybe a vote on this is in order?
Well I guess the enumerators present for arrays are sufficient here, 
nice to mention is again that we can build in sort methods in the 
BCFeature class (as I showed in the attached headers last time) which 
allows sorting on type, name, length, position etc.

> One thing I'd argue for is an enumeration of defined feature types.  
> The
> user should be free to create their own, but there are huge advantages 
> of a
> set of non-custom ones.  Imagine being able to search an institute wide
> plasmid collection for everything with a Vertebrate promoter, protein 
> tag,
> and unique BamHI site....
True, we should try to keep a defined set, perhaps we need a 
intermediate categories level here, like restriction enzyme or 
structure type with defined subtypes like BamHI for the first, and 
helix, beta strand for the second. Let's see how far we can come, a 
plist with proposed categories and subtypes might be an option here. 
Keep in mind that the list might be pretty long, already containing at 
least 700 restriction enzymes.

>
>> Another thought I would like you to comment on is the addition of a
>> "history/editing dictionary" which keeps track of who added/edited a
>> sequence and when/what things were edited. In general, I think it 
>> would
>> be nice if we would go for the "non-destructive editing approach"
>> wherever possible. My would-be Biococoa based DNAStrider-like app 
>> would
>> for instance allow the user to cut and paste fragments and vectors, 
>> and
>> it would be very nice if many of the editing could always be undone,
>> and the original sequence could always be viewed. Think along the 
>> lines
>> of a modern video editing approach, the files are unchanged, only the
>> displayed parts are changed. This could save a lot of memory/disk
>> reusal/writing as well. Of course there must be methods to "crop" your
>> file as it has no use to keep a complete genome around if your only
>> interested in one gene right...
> As you point out, the danger here would be that we'd have to guess in
> advance the information content that would best suit the user.  
> Permanent
> undo's are also out of keeping with most AppKit design practices, 
> where the
> UndoManager doesn't survive application quits.  I'm all for keeping an
> internal Undo list in each sequence object and allowing that to 
> transfer
> with drag/drop actions and such, but I'm hesitant about writing it to 
> disk.
> Something like that might be better implemented on a per-program basis,
> rather than at the root of BioCocoa.

You're right John, that would be a real option. I guess what is a 
better solution, and something we should suggest is that for instance a 
developer produces (program) specific BCFeatures, like a fragment 
feature or cut vector feature, this way he can easily add features that 
can be saved and also preserve the original content if he likes (but 
display the feature part only).
We should keep in mind the whole upcoming meta data story for Tiger, 
and things like created by, created at, etc etc should be added to our 
files. Do we create these as BCFeatures? I guess not... So we also need 
"file-wide" features that do not necessarily point to a specific part 
of the sequence, but to the whole. I propose this to be a separate 
array containing BCFileAnnotations or something, this way again each 
program can add specific features into the file and leaves plenty of 
room for both sequence specific and file specific info to add. We 
should stress in the docs that developers should expect to encounter 
non-biococoa-defined features/annotations in the files.

Just a few extra remarks / things to remember:
How BioJava tackles this problem is similar to what we propose and can 
be found here:
http://www.biojava.org/tutorials/chap2.html

I also read something about a "biofoundation-wide" standard of storing 
and accessing sequence data, here: http://obda.open-bio.org/  It also 
allows access to data stored using BioSQL. At the moment I have no clue 
what it means, but perhaps one of you knows more. Anyway, I don't think 
that has much priority right now.

Do you guys think we should, like the Cocoa frameworks, provide both 
immutable and mutable variants of our classes, or are all our objects 
mutable by definition? The first ones might allow further optimization 
if they do not need to be mutable.

Looking forward to your reaction!
Alex

*********************************************************
                     ** Alexander Griekspoor **
*********************************************************
               The Netherlands Cancer Institute
               Department of Tumorbiology (H4)
          Plesmanlaan 121, 1066 CX, Amsterdam
                   Tel:  + 31 20 - 512 2023
                   Fax:  + 31 20 - 512 2029
                   AIM: mekentosj at mac.com
                   E-mail: a.griekspoor at nki.nl
               Web: http://www.mekentosj.com

                             iRNAi, do you?
              http://www.mekentosj.com/irnai

*********************************************************

*********************************************************
                      ** Alexander Griekspoor **
*********************************************************
                The Netherlands Cancer Institute
                Department of Tumorbiology (H4)
          Plesmanlaan 121, 1066 CX, Amsterdam
                   Tel:  + 31 20 - 512 2023
                   Fax:  + 31 20 - 512 2029
                  AIM: mekentosj at mac.com
                   E-mail: a.griekspoor at nki.nl
               Web: http://www.mekentosj.com

    The requirements said: Windows 2000 or better.
    So I got a Macintosh.

*********************************************************