Tech Tip – Efficient loop optimization with unrolling and Duff’s device
Efficient loop optimization with loop unrolling and Duff’s device
One popular method of optimizing loops is to ‘unroll’ the loop so that more work is done for each cycle through the loop.
For instance consider the following (trivial) example that accumulates the values of the elements in an array:
long total = 0;
unsigned char data[256];
for ( int i = 0 ; i < 256 ; i ++ ){
total += data[i];
}
With so little occurring in the loop body it’s not hard to believe that the overhead of the loop itself is greater than the work accumulating the data. Generating the assembly code (/Fas for Microsoft compilers) produces the following:
mov DWORD PTR _i$5081[ebp], 0
jmp SHORT $LN3@Loop1
$LN2@Loop1:
mov eax, DWORD PTR _i$5081[ebp]
add eax, 1
mov DWORD PTR _i$5081[ebp], eax
$LN3@Loop1:
cmp DWORD PTR _i$5081[ebp], 256 ; 00000100H
jge SHORT $LN1@Loop1
; 12 : total += data[i];
mov eax, DWORD PTR _i$5081[ebp]
movzx ecx, BYTE PTR _data$[ebp+eax]
add ecx, DWORD PTR _total$[ebp]
mov DWORD PTR _total$[ebp], ecx
; 13 : }
jmp SHORT $LN2@Loop1
$LN1@Loop1:
(In the above and following assembly listings, loop overhead is highlighted in green and the actual work shown in yellow)
If we assume that the cost of each instruction is the same (which it isn’t but will simplify the discussion) then we could say that the loop overhead is twice that of the actual work (based on there being twice as many instructions). We can also calculate the total work 12 + ( 11 * 255) = 2817 instructions.
We need a way to do more work in the loop, consider the following:
long total = 0;
unsigned char data[256];
for ( int i = 0 […]