[Biodevelopers] HMMER-2.2g SSE2 implementation
David Huen
smh1008 at cus.cam.ac.uk
Wed Jan 29 11:03:41 EST 2003
I have attempted a SSE2 implementation of HMMER. The starting point was
Erik Lindahl's code. The testbed was a Williamette-type 1.7 GHz P4 running
on an i845 motherboard with 512MB DDR RAM. The compiler was gcc-3.3 which
has implemented (slightly buggily) the SSE2 intrinsics.
The main showstopper is the absence of a 32-bit integer vector max
instruction in SSE2 - there is one for for bytes/shorts/floats/doubles but
not for 32 bit ints. As the 32-bit int is the standard basis of the HMMER
Viterbi algorithm and a vector max is required eight times during the inner
loop, there have been severe consequences. Its replacement requires five
SSE2 instructions.
The net impact is that in comparison to the standard HMMER code, the speedup
achieved is 1.9x. Half of this comes about solely from the modifications
Erik Lindahl made to the data structures and the remainder from SSE
accelerations ( a part of this was achieved by running both the vector unit
and the ALU simultaneously on different parts of the problem).
While some more changes are possible (e.g. cacheline aligned structures and
prefetch), I think there is little prospect of achieving large improvements
in the Viterbi SSE2 implementation. For the current algorithm, the CPU of
choice has to be the Motorola G4.
I have looked into possible reasons for the disappointing speedup. Perhaps
the key is that the MMX_ALU that executes the SSE2 ISA has
latency/throughput of 2/2 for most key ops (2cycles latency/new op every
two cycles). OTOH. the normal integer ALU is double-pumped and capable of
2 ops a cycle on some instructions. In aVERY crude sense therefore, the
MMX_ALU and normal ALU are equivalent in speed for quad doubleword ints and
shifting processing from one to the other is unlikely to yield stupendous
improvements. In contrast, the MPC7455 has latency/throughput of 1/1 for
many ops and moving operations to the vector unit is crudely equivalent to
a 4x speedup relative to the normal ALU. Similarly, current P4s would
require double the clockspeed for the MMX_ALU to match the MPC7455 Altivec
unit.
To explore the possibility of further speedup, I converted the model
matrices to 32-bit floats so a vector max instruction was available. Under
these circumstances, an acceleration of 2.7x was achieved. Some further
experimentation is possible with reordering operations but I don't envisage
gains equivalent to those observed with the G4.
As for imminent changes in the x86 platform, the Opteron does not have a
32-bit int vector max either. We will have to await the release of
documentation before we know whether its MMX_ALU will operate with a better
latency/throughput than the P4. The Gallatin variant of the P4 is just has
a larger L2 cache. The impending Prescott is known to include new SSE
instructions but what those might be remains unknown.
I do not know whether the limited speedup makes this code worth cleaning up
for release. The integer code will require the gcc-3.3 (in code freeze)
compiler for SSE2support or perhaps the Intel compiler. The inline
assembler makes it untidy and in need of cleanup. The floating version has
only had limited testing - I am surprised it worked at all - and it will
not integrate well with the existing code base.
Regards,
David Huen, Univ. of Cambridge
More information about the Biodevelopers
mailing list