[Biococoa-dev] New Structure for BioCocoa part I

Sun Jul 3 07:55:39 EDT 2005

Hi guys,

My apologies for not having jumped in earlier, certainly towards Koen  
and John I'm sorry, I should have given a summary of the WWDC meeting  
much earlier. I understand that all of this comes out of the blue,  
but understand us well, this is all still open for debate. In fact I  
hope that we will discuss things more elaborate on the list in order  
to come up with the best implementation. I'll try to summarise the  
topics discussed at the WWDC and the thoughts behind them below.

The story begins while Phil and I were preparing the slides for our  
small presentation planned on Wednesday evening. I have to admit that  
I had spend very few time on BioCocoa in the month before and did not  
have the exact structure in my head anymore. When I started to look  
again at our implementation it was not really trivial how the  
sequence class cluster was set up and also Phil had problems getting  
the exact idea. John did a wonderful job explaining many of the ideas  
in his document he created just before the WWDC, but still I think it  
needs re-consideration. As even developers of the framework can't get  
it easily, imagine new users. So we decided to not spend much time  
during the presentation on the implementation, both because it's a  
moving target still and also because we thought that it would not be  
of particular interest to the audience. We did decide to tell about  
our biojava like approach for singleton BCSymbol objects. That  
pattern is easy to explain and easy to get. Our main focus however  
was on the things we had in mind with the framework, the potential  
use, and the question for feedback and input. What needs can it  
fulfil and what are people looking for?

Partially due to the rescheduled apple design awards (hooray for  
Peter!) we had a fairly small group of listeners, but already the  
discussion with the group was worth coming together I think. It was  
clear that most "new" people were from fields that focused on large  
scale genomics projects, clearly a different "target audience" than  
our frameworks aims at. If I maybe so blunt, I think it's safe to say  
that initially we aim at developers like ourselves, who create fairly  
small applications with many standard (and fairly simple) sequence  
editing routines on small sized sequences. Of course, we should aim  
at expanding this levels way higher, but that's not our initial goal  
right? One of the guys in the public explained that the philosophy  
behind BioJava was actually opposite, aimed at large scale genome- 
sized sequences, mainly focused on annotations. It was even difficult  
to convince the guy that there was a need for something we do! I told  
him that there clearly is a need for programs like vector-nti, which  
he agreed with in the end.

Not suprisingly, the main topic of discussion quickly turned to  
performance, with the basic question: where do we place the border  
between objects and structures. We want a cocoa-like interface and  
ease of use, but also performance in terms of speed and memory  
footprint. Ideally we would like to have something like NSString,  
which is easy to use, has many convenient methods, but works fast  
because of under-the-hood implementation that uses different c  
structures based on the type of string you use. Now the problem is  
that we have to design that under the hood part of our sequence objects.
Initially we choose for the BioJava approach of singleton objects  
(yes, I was(/am) a great fan).

Let me summarize the benefits:
-  Objects! Powerful methods, easy accessible properties, etc. all  
the nice goodies from cocoa
-  Way more powerful than a simple char
-  Singleton objects to dramatically reduce memory footprint, a  
sequence is simply a list of pointers to the singleton objects.

However there are clear negatives as well, many discussed before:
- Objects! Bigger than char, not that much but still. Storing 200Mb  
of sequence or 4-8 times as much makes a difference! The singleton do  
make it dramatically different though, and I still consider this one  
of the smallest problems.
- Speed. Object messaging is the number one problem here, requiring  
all kinds of hacks and tricks to get decent performance. The main  
problem lies in the use of NSArray and alike to store the list of  
pointers to the symbols. Although very convenient for editing, this  
kills performance. Certainly when the most frequent operation with  
sequences is iteration over the array.

In conclusion, the singleton symbols are great! But the problem lies  
in the NSArray way of storing the sequence of them!

Now is there a better solution? Well one obvious theme brought up  
many times was the old trick to convert the sequence object to a  
string, do the stuff that needs to be done, and convert the result  
back to a sequence object. The benefits are easy to see: chars are  
smaller and speedier to work with, and another plus: many algorithms  
are available for strings already. We also realized that this was  
something that would often be needed, thus needed a general sequence- 
to-string-and-back implementation. Why not?
It's slow, even more slowdowns! true, the conversion time would often  
be neglect-able compared to the actual implementation, still it would  
take time.
I always opposed quite strongly against all this if I could. The idea  
was simple, if we go for a certain implementation we should eat our  
own dog food, it should be so good that it would be able to handle  
the problems described. Alignments should work natively with  
BCSequences, reversing should etc. I realized that that was an  
illusion, and not practical.
But now, I realize even more that this indeed tells us that we were  
on the wrong track! Our BCSequences could not be used for this,  
they're not suited for most of the tasks they should perform! We need  
another implementation.

The credits have to go to Jeff, a graduate student new to BioCocoa  
and who I hope will join the project one day. But from all above it  
should be obvious what to do. We should use strings (or char arrays  
to be more precise). Now to quote Koen: WTF are we throwing away all  
the things we did in the past months?
No, absolutely not. The idea is simple. The native way of storing the  
sequence INSIDE a BCSequence object should not be an NSArray of  
pointers to symbols, but would be a char array (or NSData object as  
Charles suggested, but lets skip the implementation for now and focus  
on the idea). The BCSequence object would become a wrapper object  
around the string as "data store". The benefits are easy to see:
- size is as compact as possible, one could even think of applying  
classical compression algorithms to make them even smaller.
- the string is always available to any implementation so:
     - no conversion needed, the string is always there
     - speed, all implementations work with strings, no iterations  
over ns/cfarrays
     - we can use all existing and standard string based algorithms,  
i.e. for alignments, but also for instance standard regular  
expression libraries for searching, matching, etc.
However, to the OUTSIDE world we ARE (or perhaps better SEEM) arrays  
of singleton objects. If the sequence is asked for the symbol at  
position 18 for instance, we return the singleton object. If they  
want a subsequence however, we again return a bcsequence which  
internally has its char array of course. If you think about the  
number of times you really want the symbol and not for instance a  
sequence, range or annotation, that's not many I think.

The really only downside I think is the fact that programming the  
implementations using strings is somewhat more complex, more c less  
cocoa, more pointer fiddling, less enumerators. But since in many  
occasions we already started that to "hack" things faster, and  
already opted to do the conversions necessary to get at that point, I  
guess it's not a problem so much. In fact, we can now use many  
standard char implementations already available (and tested). Of  
course, if speed is not an issue we can still do it the old way  
because there is still a way to get the pointer to the symbol for any  
position.

So far the theory, now part II: implementing the thing....

*********************************************************
                     ** Alexander Griekspoor **
*********************************************************
              The Netherlands Cancer Institute
              Department of Tumorbiology (H4)
         Plesmanlaan 121, 1066 CX, Amsterdam
                    Tel:  + 31 20 - 512 2023
                    Fax:  + 31 20 - 512 2029
                    AIM: mekentosj at mac.com
                    E-mail: a.griekspoor at nki.nl
                Web: http://www.mekentosj.com

       Microsoft is not the answer,
       Microsoft is the question,
       NO is the answer

*********************************************************

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.bioinformatics.org/pipermail/biococoa-dev/attachments/20050703/4a3ce7ff/attachment.html>