[Biodevelopers] HMMER-2.2g SSE2 implementation

Wed Jan 29 11:03:41 EST 2003

I have attempted a SSE2 implementation of HMMER.  The starting point was  
Erik Lindahl's code.  The testbed was a Williamette-type 1.7 GHz P4 running 
on an i845 motherboard with 512MB DDR RAM.  The compiler was gcc-3.3 which 
has implemented (slightly buggily) the SSE2 intrinsics.

The main showstopper is the absence of a 32-bit integer vector max 
instruction in SSE2 - there is one for for bytes/shorts/floats/doubles but 
not for 32 bit ints.  As the 32-bit int is the standard basis of the HMMER 
Viterbi algorithm and a vector max is required eight times during the inner 
loop, there have been severe consequences.  Its replacement requires five 
SSE2 instructions.

The net impact is that in comparison to the standard HMMER code, the speedup 
achieved is 1.9x.  Half of this comes about solely from the modifications 
Erik Lindahl made to the data structures and the remainder from SSE 
accelerations ( a part of this was achieved by running both the vector unit 
and the ALU simultaneously on different parts of the problem).

While some more changes are possible (e.g. cacheline aligned structures and 
prefetch), I think there is little prospect of achieving large improvements 
in the Viterbi SSE2 implementation.   For the current algorithm, the CPU of 
choice has to be the Motorola G4.  

I have looked into possible reasons for the disappointing speedup.  Perhaps 
the key is that the MMX_ALU that executes the SSE2 ISA has 
latency/throughput of 2/2 for most key ops (2cycles latency/new op every 
two cycles).  OTOH. the normal integer ALU is double-pumped and capable of 
2 ops a cycle on some instructions.  In aVERY crude sense therefore, the 
MMX_ALU and normal ALU are equivalent in speed for quad doubleword ints and 
shifting processing from one to the other is unlikely to yield stupendous 
improvements. In contrast, the MPC7455 has latency/throughput of 1/1 for 
many ops and moving operations to the vector unit is crudely equivalent to 
a 4x speedup relative to the normal ALU.  Similarly, current P4s would 
require double the clockspeed for the MMX_ALU to match the MPC7455 Altivec 
unit.

To explore the possibility of further speedup, I converted the model 
matrices to 32-bit floats so a vector max instruction was available.  Under 
these circumstances, an acceleration of 2.7x was achieved.  Some further 
experimentation is possible with reordering operations but I don't envisage 
gains equivalent to those observed with the G4.

As for imminent changes in the x86 platform, the Opteron does not have a 
32-bit int vector max either.  We will have to await the release of 
documentation before we know whether its MMX_ALU will operate with a better 
latency/throughput than the P4.  The Gallatin variant of the P4 is just has 
a larger L2 cache.  The impending Prescott is known to include new SSE 
instructions but what those might be remains unknown.

I do not know whether the limited speedup makes this code worth cleaning up 
for release.  The integer code will require the gcc-3.3 (in code freeze) 
compiler for SSE2support or perhaps the Intel compiler.  The inline 
assembler makes it untidy and in need of cleanup.  The floating version has 
only had limited testing - I am surprised it worked at all - and it will 
not integrate well with the existing code base.

Regards,
David Huen, Univ. of Cambridge