[Biodevelopers] HMMER-2.2g SSE2 implementation

Wed Jan 29 11:19:35 EST 2003

Hi David:

  I looked into this with GCC 3.2.1, and hand coding some of the
assembler macros.  

  We could turn on various profiling mechanisms to see which aspect of
the code is causing the largest performance issue.  The un-aligned
access issue could contribute up to a factor of 2x for memory bound
problems.  Pipeline stalls and scheduler bubbles could account for quite
a bit more (most of the rest).  The P4 and the Athlon both have multiple
IU's and can overlap loads and computation in particular mixtures of
instructions.

  I am busy until about Feb 13 on some other work, but I would be happy
to try to work on some of this with you if you would like.

Joe

On Wed, 2003-01-29 at 11:03, David Huen wrote:
> I have attempted a SSE2 implementation of HMMER.  The starting point was  
> Erik Lindahl's code.  The testbed was a Williamette-type 1.7 GHz P4 running 
> on an i845 motherboard with 512MB DDR RAM.  The compiler was gcc-3.3 which 
> has implemented (slightly buggily) the SSE2 intrinsics.
> 
> The main showstopper is the absence of a 32-bit integer vector max 
> instruction in SSE2 - there is one for for bytes/shorts/floats/doubles but 
> not for 32 bit ints.  As the 32-bit int is the standard basis of the HMMER 
> Viterbi algorithm and a vector max is required eight times during the inner 
> loop, there have been severe consequences.  Its replacement requires five 
> SSE2 instructions.
> 
> The net impact is that in comparison to the standard HMMER code, the speedup 
> achieved is 1.9x.  Half of this comes about solely from the modifications 
> Erik Lindahl made to the data structures and the remainder from SSE 
> accelerations ( a part of this was achieved by running both the vector unit 
> and the ALU simultaneously on different parts of the problem).
> 
> While some more changes are possible (e.g. cacheline aligned structures and 
> prefetch), I think there is little prospect of achieving large improvements 
> in the Viterbi SSE2 implementation.   For the current algorithm, the CPU of 
> choice has to be the Motorola G4.  
> 
> I have looked into possible reasons for the disappointing speedup.  Perhaps 
> the key is that the MMX_ALU that executes the SSE2 ISA has 
> latency/throughput of 2/2 for most key ops (2cycles latency/new op every 
> two cycles).  OTOH. the normal integer ALU is double-pumped and capable of 
> 2 ops a cycle on some instructions.  In aVERY crude sense therefore, the 
> MMX_ALU and normal ALU are equivalent in speed for quad doubleword ints and 
> shifting processing from one to the other is unlikely to yield stupendous 
> improvements. In contrast, the MPC7455 has latency/throughput of 1/1 for 
> many ops and moving operations to the vector unit is crudely equivalent to 
> a 4x speedup relative to the normal ALU.  Similarly, current P4s would 
> require double the clockspeed for the MMX_ALU to match the MPC7455 Altivec 
> unit.
> 
> To explore the possibility of further speedup, I converted the model 
> matrices to 32-bit floats so a vector max instruction was available.  Under 
> these circumstances, an acceleration of 2.7x was achieved.  Some further 
> experimentation is possible with reordering operations but I don't envisage 
> gains equivalent to those observed with the G4.
> 
> As for imminent changes in the x86 platform, the Opteron does not have a 
> 32-bit int vector max either.  We will have to await the release of 
> documentation before we know whether its MMX_ALU will operate with a better 
> latency/throughput than the P4.  The Gallatin variant of the P4 is just has 
> a larger L2 cache.  The impending Prescott is known to include new SSE 
> instructions but what those might be remains unknown.
> 
> I do not know whether the limited speedup makes this code worth cleaning up 
> for release.  The integer code will require the gcc-3.3 (in code freeze) 
> compiler for SSE2support or perhaps the Intel compiler.  The inline 
> assembler makes it untidy and in need of cleanup.  The floating version has 
> only had limited testing - I am surprised it worked at all - and it will 
> not integrate well with the existing code base.
> 
> Regards,
> David Huen, Univ. of Cambridge
> 
> _______________________________________________
> Biodevelopers mailing list
> Biodevelopers at bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/biodevelopers
-- 
Joseph Landman <landman at scalableinformatics.com>