[Biodevelopers] splice and clear...
Dan Bolser
dmb at mrc-dunn.cam.ac.uk
Thu Aug 28 14:00:18 EDT 2003
Sorry for the fig...
Here is empirical evidence for the non N**N
nature of the algorithm... (I am board)
X = number of HSPs
Y = number of inner loop itterations.
As you can see it is not nearly N**N, and is
bounded by NlogN except for small X (X<50).
The outer loop is visited once for each different
cluster of HSPs (C), so for small X that is likely
to be a bigger fraction of the total, hence more outer
loop visits per inner.
You get a near perfect fit with f(x) = 2x
so I guess I can call the complexity N, if it wasn't for
all those pesky splices...
There are 2*2N+C splices, and each splice is O(N)...
(again shrinking list size = faster to splice, but not
sure exactly)
Scaleable?
Sob....
On 27 Aug 2003, Joseph Landman wrote:
> On Wed, 2003-08-27 at 17:30, Dan Bolser wrote:
> > Joseph Landman said:
> > > Hi Dan:
> > >
> > > I am assuming that your arrays contain not HSP's, but some sort of
> > > object representing the HSP ala BioPerl. Is this correct?
> > >
> >
> > Nope, just a simple hash, key => value,
> > I.E.
> >
> > {
> > hsp_hit-from => 5,
> > hsp_hit-to => 10,
> > score => 20,
> > }
> >
>
> Ok.
>
> Comments interspersed:
>
>
> > >> __SKIP__
> > >>
> > >> preamble
> > >>
> > >> @hsps = array of HSP hashes, for a particular protein
> > >> each HSP can be from several SCOP sequences.
> > >>
> > >> __RESUME__
> > >>
> > >> my @result; # Final HSP's
> > >>
> > >> TOP:while (@hsps){ # NB: Ordered by Hsp_query-from
> > >> # (for optimzation).
> > >>
> > >> my $p = 0; # Current HSP pointer.
> > >>
> > >> MID:for (my $j=$p+1; $j<@hsps; $j++){ # Overlap slider.
>
> These two loops give you something comparable to an O(N^2) algorithm.
> That is, I can rewrite the while and the for as
>
> foreach my $hsp (@hsps) # Outer O(N): 1 iteration per entry
> foreach my $index (0 .. $#hsps) # Inner O(N): 1 iteration per
> entry
>
>
>
>
> > >>
> > >> # Family overlap only!
> > >>
> > >> next MID if
> > >> $hsps[$p]->{SCCS} != $hsps[$j]->{SCCS};
>
> Short circuits are good.
>
> > >>
> > >> # Optimization.
> > >>
> > >> if ( $THRESH >
> > >> $hsps[$p]->{P_END} - $hsps[$j]->{P_START} ){
> > >>
> > >> shift @hsps;
> > >> next TOP;
> > >> }
>
> A shift or a similar operation (splice, push, etc) will likely trigger
> the garbage collector in Perl, as well as the memory allocator
>
> > >>
> > >> # Pick best of pair (removing the other from the list).
> > >>
> > >> if ( $hsps[$p]->{E_VALUE} > $hsps[$j]->{E_VALUE} ){
> > >> splice (@hsps, $p, 1);
> > >> $j--;
> > >> $p = $j;
> > >> }
> > >> else {
> > >> splice (@hsps, $j, 1);
> > >> $j--;
> > >> }
> > >> }
> > >> push @result, splice(@hsps, $p, 1);
> > >> }
> > >> print "OK\n\n";
>
> I might suggest that instead of moving data around (splice, push, shift)
> and causing allocation/GC events, that you might maintain several lists,
> indicating which class the HSP's belong to.
>
>
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: itter_v_hsp_number3.gif
Type: image/gif
Size: 6629 bytes
Desc:
Url : http://bioinformatics.org/pipermail/biodevelopers/attachments/20030828/1d46b995/itter_v_hsp_number3.gif
More information about the Biodevelopers
mailing list