I have attempted a SSE2 implementation of HMMER. The starting point was Erik Lindahl's code. The testbed was a Williamette-type 1.7 GHz P4 running on an i845 motherboard with 512MB DDR RAM. The compiler was gcc-3.3 which has implemented (slightly buggily) the SSE2 intrinsics. The main showstopper is the absence of a 32-bit integer vector max instruction in SSE2 - there is one for for bytes/shorts/floats/doubles but not for 32 bit ints. As the 32-bit int is the standard basis of the HMMER Viterbi algorithm and a vector max is required eight times during the inner loop, there have been severe consequences. Its replacement requires five SSE2 instructions. The net impact is that in comparison to the standard HMMER code, the speedup achieved is 1.9x. Half of this comes about solely from the modifications Erik Lindahl made to the data structures and the remainder from SSE accelerations ( a part of this was achieved by running both the vector unit and the ALU simultaneously on different parts of the problem). While some more changes are possible (e.g. cacheline aligned structures and prefetch), I think there is little prospect of achieving large improvements in the Viterbi SSE2 implementation. For the current algorithm, the CPU of choice has to be the Motorola G4. I have looked into possible reasons for the disappointing speedup. Perhaps the key is that the MMX_ALU that executes the SSE2 ISA has latency/throughput of 2/2 for most key ops (2cycles latency/new op every two cycles). OTOH. the normal integer ALU is double-pumped and capable of 2 ops a cycle on some instructions. In aVERY crude sense therefore, the MMX_ALU and normal ALU are equivalent in speed for quad doubleword ints and shifting processing from one to the other is unlikely to yield stupendous improvements. In contrast, the MPC7455 has latency/throughput of 1/1 for many ops and moving operations to the vector unit is crudely equivalent to a 4x speedup relative to the normal ALU. Similarly, current P4s would require double the clockspeed for the MMX_ALU to match the MPC7455 Altivec unit. To explore the possibility of further speedup, I converted the model matrices to 32-bit floats so a vector max instruction was available. Under these circumstances, an acceleration of 2.7x was achieved. Some further experimentation is possible with reordering operations but I don't envisage gains equivalent to those observed with the G4. As for imminent changes in the x86 platform, the Opteron does not have a 32-bit int vector max either. We will have to await the release of documentation before we know whether its MMX_ALU will operate with a better latency/throughput than the P4. The Gallatin variant of the P4 is just has a larger L2 cache. The impending Prescott is known to include new SSE instructions but what those might be remains unknown. I do not know whether the limited speedup makes this code worth cleaning up for release. The integer code will require the gcc-3.3 (in code freeze) compiler for SSE2support or perhaps the Intel compiler. The inline assembler makes it untidy and in need of cleanup. The floating version has only had limited testing - I am surprised it worked at all - and it will not integrate well with the existing code base. Regards, David Huen, Univ. of Cambridge