Tech Tip – Efficient loop optimization with unrolling and Duff’s device

Efficient loop optimization with loop unrolling and Duff’s device

One popular method of optimizing loops is to ‘unroll’ the loop so that more work is done for each cycle through the loop.
 For instance consider the following (trivial) example that accumulates the values of the elements in an array:
       long total = 0;
      unsigned char data[256];
       for ( int i = 0 ; i < 256 ; i ++ ){
            total += data[i];
      }
 With so little occurring in the loop body it’s not hard to believe that the overhead of the loop itself is greater than the work accumulating the data. Generating the assembly code (/Fas for Microsoft compilers) produces the following:
      mov   DWORD PTR _i$5081[ebp], 0
      jmp   SHORT $LN3@Loop1
$LN2@Loop1:
      mov   eax, DWORD PTR _i$5081[ebp]
      add   eax, 1
      mov   DWORD PTR _i$5081[ebp], eax
$LN3@Loop1:
      cmp   DWORD PTR _i$5081[ebp], 256         ; 00000100H
      jge   SHORT $LN1@Loop1
; 12   :          total += data[i];
       mov   eax, DWORD PTR _i$5081[ebp]
      movzx ecx, BYTE PTR _data$[ebp+eax]
      add   ecx, DWORD PTR _total$[ebp]
      mov   DWORD PTR _total$[ebp], ecx
 ; 13   :    }
       jmp   SHORT $LN2@Loop1
$LN1@Loop1:
 (In the above and following assembly listings, loop overhead is highlighted in green and the actual work shown in yellow)
 If we assume that the cost of each instruction is the same (which it isn’t but will simplify the discussion) then we could say that the loop overhead is twice that of the actual work (based on there being twice as many instructions). We can also calculate the total work 12 + ( 11 * 255) = 2817 instructions.
 We need a way to do more work in the loop, consider the following:
       long total = 0;
      unsigned char data[256];
       for ( int i = 0 […]

By |July 9th, 2013|Best Practices, Software|Comments Off on Tech Tip – Efficient loop optimization with unrolling and Duff’s device|

Performance Matters #1, or, effective optimization requires focus

My first foray into performance optimization came when attempting to write computer games for the Intel 80×86 family. The clunky graphics of CGA, MCGA, gave way to the higher resolutions but fewer colours of EGA, and VGA. For PC computer games though at that golden point in time, it was all about 320x200x8bpp and more specifically the infamous ModeX.

This post isn’t about ModeX, or any of the other modes really, it’s about how the drive to produce faster graphics to enable more realistic animation and game play led to a lifelong obsession with getting every last piece of performance out of a usually reluctant piece of hardware.

But enough history, needless to say the intent of these posts is to identify approaches and techniques or perhaps establish some principles that may help you the next time you are asked to go and optimize something. When I was spending all my time cranking through different assembly implementations of sprite drawing routines, looking for the magic combination of opcodes (CPU instructions) to yield the best possible performance, I was guilty of the cardinal sin of Performance Optimization – I optimized the code that I found most interesting rather than the code that most needed optimizing.

What I should have done is to determine which portion(s) of the game loop were actually being spent rendering sprites versus other activities such as maintaining object lists, handling input, updating state etc. Those results would have told me exactly where the time was being spent, and where I should focus the optimization.

Back in those dark ages there were not any of the tools that we now take for granted, and those that were available were very expensive. So the measurement code I […]

By |July 9th, 2013|Uncategorized|Comments Off on Performance Matters #1, or, effective optimization requires focus|