[Biococoa-dev] New Structure for BioCocoa part I
Alexander Griekspoor
mek at mekentosj.com
Sun Jul 3 07:55:39 EDT 2005
Hi guys,
My apologies for not having jumped in earlier, certainly towards Koen
and John I'm sorry, I should have given a summary of the WWDC meeting
much earlier. I understand that all of this comes out of the blue,
but understand us well, this is all still open for debate. In fact I
hope that we will discuss things more elaborate on the list in order
to come up with the best implementation. I'll try to summarise the
topics discussed at the WWDC and the thoughts behind them below.
The story begins while Phil and I were preparing the slides for our
small presentation planned on Wednesday evening. I have to admit that
I had spend very few time on BioCocoa in the month before and did not
have the exact structure in my head anymore. When I started to look
again at our implementation it was not really trivial how the
sequence class cluster was set up and also Phil had problems getting
the exact idea. John did a wonderful job explaining many of the ideas
in his document he created just before the WWDC, but still I think it
needs re-consideration. As even developers of the framework can't get
it easily, imagine new users. So we decided to not spend much time
during the presentation on the implementation, both because it's a
moving target still and also because we thought that it would not be
of particular interest to the audience. We did decide to tell about
our biojava like approach for singleton BCSymbol objects. That
pattern is easy to explain and easy to get. Our main focus however
was on the things we had in mind with the framework, the potential
use, and the question for feedback and input. What needs can it
fulfil and what are people looking for?
Partially due to the rescheduled apple design awards (hooray for
Peter!) we had a fairly small group of listeners, but already the
discussion with the group was worth coming together I think. It was
clear that most "new" people were from fields that focused on large
scale genomics projects, clearly a different "target audience" than
our frameworks aims at. If I maybe so blunt, I think it's safe to say
that initially we aim at developers like ourselves, who create fairly
small applications with many standard (and fairly simple) sequence
editing routines on small sized sequences. Of course, we should aim
at expanding this levels way higher, but that's not our initial goal
right? One of the guys in the public explained that the philosophy
behind BioJava was actually opposite, aimed at large scale genome-
sized sequences, mainly focused on annotations. It was even difficult
to convince the guy that there was a need for something we do! I told
him that there clearly is a need for programs like vector-nti, which
he agreed with in the end.
Not suprisingly, the main topic of discussion quickly turned to
performance, with the basic question: where do we place the border
between objects and structures. We want a cocoa-like interface and
ease of use, but also performance in terms of speed and memory
footprint. Ideally we would like to have something like NSString,
which is easy to use, has many convenient methods, but works fast
because of under-the-hood implementation that uses different c
structures based on the type of string you use. Now the problem is
that we have to design that under the hood part of our sequence objects.
Initially we choose for the BioJava approach of singleton objects
(yes, I was(/am) a great fan).
Let me summarize the benefits:
- Objects! Powerful methods, easy accessible properties, etc. all
the nice goodies from cocoa
- Way more powerful than a simple char
- Singleton objects to dramatically reduce memory footprint, a
sequence is simply a list of pointers to the singleton objects.
However there are clear negatives as well, many discussed before:
- Objects! Bigger than char, not that much but still. Storing 200Mb
of sequence or 4-8 times as much makes a difference! The singleton do
make it dramatically different though, and I still consider this one
of the smallest problems.
- Speed. Object messaging is the number one problem here, requiring
all kinds of hacks and tricks to get decent performance. The main
problem lies in the use of NSArray and alike to store the list of
pointers to the symbols. Although very convenient for editing, this
kills performance. Certainly when the most frequent operation with
sequences is iteration over the array.
In conclusion, the singleton symbols are great! But the problem lies
in the NSArray way of storing the sequence of them!
Now is there a better solution? Well one obvious theme brought up
many times was the old trick to convert the sequence object to a
string, do the stuff that needs to be done, and convert the result
back to a sequence object. The benefits are easy to see: chars are
smaller and speedier to work with, and another plus: many algorithms
are available for strings already. We also realized that this was
something that would often be needed, thus needed a general sequence-
to-string-and-back implementation. Why not?
It's slow, even more slowdowns! true, the conversion time would often
be neglect-able compared to the actual implementation, still it would
take time.
I always opposed quite strongly against all this if I could. The idea
was simple, if we go for a certain implementation we should eat our
own dog food, it should be so good that it would be able to handle
the problems described. Alignments should work natively with
BCSequences, reversing should etc. I realized that that was an
illusion, and not practical.
But now, I realize even more that this indeed tells us that we were
on the wrong track! Our BCSequences could not be used for this,
they're not suited for most of the tasks they should perform! We need
another implementation.
The credits have to go to Jeff, a graduate student new to BioCocoa
and who I hope will join the project one day. But from all above it
should be obvious what to do. We should use strings (or char arrays
to be more precise). Now to quote Koen: WTF are we throwing away all
the things we did in the past months?
No, absolutely not. The idea is simple. The native way of storing the
sequence INSIDE a BCSequence object should not be an NSArray of
pointers to symbols, but would be a char array (or NSData object as
Charles suggested, but lets skip the implementation for now and focus
on the idea). The BCSequence object would become a wrapper object
around the string as "data store". The benefits are easy to see:
- size is as compact as possible, one could even think of applying
classical compression algorithms to make them even smaller.
- the string is always available to any implementation so:
- no conversion needed, the string is always there
- speed, all implementations work with strings, no iterations
over ns/cfarrays
- we can use all existing and standard string based algorithms,
i.e. for alignments, but also for instance standard regular
expression libraries for searching, matching, etc.
However, to the OUTSIDE world we ARE (or perhaps better SEEM) arrays
of singleton objects. If the sequence is asked for the symbol at
position 18 for instance, we return the singleton object. If they
want a subsequence however, we again return a bcsequence which
internally has its char array of course. If you think about the
number of times you really want the symbol and not for instance a
sequence, range or annotation, that's not many I think.
The really only downside I think is the fact that programming the
implementations using strings is somewhat more complex, more c less
cocoa, more pointer fiddling, less enumerators. But since in many
occasions we already started that to "hack" things faster, and
already opted to do the conversions necessary to get at that point, I
guess it's not a problem so much. In fact, we can now use many
standard char implementations already available (and tested). Of
course, if speed is not an issue we can still do it the old way
because there is still a way to get the pointer to the symbol for any
position.
So far the theory, now part II: implementing the thing....
*********************************************************
** Alexander Griekspoor **
*********************************************************
The Netherlands Cancer Institute
Department of Tumorbiology (H4)
Plesmanlaan 121, 1066 CX, Amsterdam
Tel: + 31 20 - 512 2023
Fax: + 31 20 - 512 2029
AIM: mekentosj at mac.com
E-mail: a.griekspoor at nki.nl
Web: http://www.mekentosj.com
Microsoft is not the answer,
Microsoft is the question,
NO is the answer
*********************************************************
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.bioinformatics.org/pipermail/biococoa-dev/attachments/20050703/4a3ce7ff/attachment.html>
More information about the Biococoa-dev
mailing list