Efficient loop optimization with loop unrolling and Duff’s device
One popular method of optimizing loops is to ‘unroll’ the loop so that more work is done for each cycle through the loop.
For instance consider the following (trivial) example that accumulates the values of the elements in an array:
long total = 0;
unsigned char data[256];
for ( int i = 0 ; i < 256 ; i ++ ){
total += data[i];
}
With so little occurring in the loop body it’s not hard to believe that the overhead of the loop itself is greater than the work accumulating the data. Generating the assembly code (/Fas for Microsoft compilers) produces the following:
mov DWORD PTR _i$5081[ebp], 0
jmp SHORT $LN3@Loop1
$LN2@Loop1:
mov eax, DWORD PTR _i$5081[ebp]
add eax, 1
mov DWORD PTR _i$5081[ebp], eax
$LN3@Loop1:
cmp DWORD PTR _i$5081[ebp], 256 ; 00000100H
jge SHORT $LN1@Loop1
; 12 : total += data[i];
mov eax, DWORD PTR _i$5081[ebp]
movzx ecx, BYTE PTR _data$[ebp+eax]
add ecx, DWORD PTR _total$[ebp]
mov DWORD PTR _total$[ebp], ecx
; 13 : }
jmp SHORT $LN2@Loop1
$LN1@Loop1:
(In the above and following assembly listings, loop overhead is highlighted in green and the actual work shown in yellow)
If we assume that the cost of each instruction is the same (which it isn’t but will simplify the discussion) then we could say that the loop overhead is twice that of the actual work (based on there being twice as many instructions). We can also calculate the total work 12 + ( 11 * 255) = 2817 instructions.
We need a way to do more work in the loop, consider the following:
long total = 0;
unsigned char data[256];
for ( int i = 0 ; i < 256 ; i += 4 ){
total += data[i];
total += data[i+1];
total += data[i+2];
total += data[i+3];
}
This produces the following output:
mov DWORD PTR _total$[ebp], 0
; 21 : unsigned char data[256];
; 22 :
; 23 : for ( int i = 0 ; i < 256 ; i += 4 ){
mov DWORD PTR _i$5090[ebp], 0
jmp SHORT $LN3@Loop2
$LN2@Loop2:
mov eax, DWORD PTR _i$5090[ebp]
add eax, 4
mov DWORD PTR _i$5090[ebp], eax
$LN3@Loop2:
cmp DWORD PTR _i$5090[ebp], 256 ; 00000100H
jge SHORT $LN1@Loop2
; 24 : total += data[i];
mov eax, DWORD PTR _i$5090[ebp]
movzx ecx, BYTE PTR _data$[ebp+eax]
add ecx, DWORD PTR _total$[ebp]
mov DWORD PTR _total$[ebp], ecx
; 25 : total += data[i+1];
mov eax, DWORD PTR _i$5090[ebp]
movzxecx, BYTE PTR _data$[ebp+eax+1]
add ecx, DWORD PTR _total$[ebp]
mov DWORD PTR _total$[ebp], ecx
; 26 : total += data[i+2];
mov eax, DWORD PTR _i$5090[ebp]
movzx ecx, BYTE PTR _data$[ebp+eax+2]
add ecx, DWORD PTR _total$[ebp]
mov DWORD PTR _total$[ebp], ecx
; 27 : total += data[i+3];
mov eax, DWORD PTR _i$5090[ebp]
movzx ecx, BYTE PTR _data$[ebp+eax+3]
add ecx, DWORD PTR _total$[ebp]
mov DWORD PTR _total$[ebp], ecx
; 28 : }
jmp SHORT $LN2@Loop2
Again we will assume that the cost of each instruction is the same then we could say that the loop overhead is now half of that of the actual work (based on there being half as many instructions). We can also calculate the total work 24 + ( 16 * 64) = 1048 instructions, or less than half that of the original implementation.
This works very well, but as it’s currently expressed the degree of loop unrolling is limited by the need for the number of loop iterations to be an even multiple of the number of unrolls in the loop. That is if you want to unroll 8 times (8 increments of total in this case) then the data you are accumulating has to be a multiple of 8. Depending on the dataset this is not usually an issue, but sometimes we aren’t at liberty to rework the data.
Duff’s device is an efficient mechanism used to optimize serial copies, but we can leverage its workings to gain as much benefit from unrolling as is possible. For more information see http://en.wikipedia.org/wiki/Duff’s_device
long total = 0;
unsigned char data[256];
int n = (count + 3) / 4;
int dataIndex = 0;
switch(count % 4) {
case 0: do {
total += data[dataIndex++];
case 3:
total += data[dataIndex++];
case 2:
total += data[dataIndex++];
case 1:
total += data[dataIndex++];
} while(–n > 0);
}
return (total );
; 40 : int dataIndex = 0;
mov DWORD PTR _dataIndex$[ebp], 0
; 41 :
; 42 : switch(count % 4) {
mov eax, DWORD PTR _count$[ebp]
and eax, -2147483645 ; 80000003H
jns SHORT $LN15@Loop3
dec eax
or eax, -4 ;fffffffcH
inc eax
$LN15@Loop3:
mov DWORD PTR tv67[ebp], eax
cmp DWORD PTR tv67[ebp], 3
ja $LN8@Loop3
mov ecx, DWORD PTR tv67[ebp]
jmp DWORD PTR $LN16@Loop3[ecx*4]
$LN6@Loop3:
; 43 : case 0: do { total += data[dataIndex++];
mov eax, DWORD PTR _dataIndex$[ebp]
movzx ecx, BYTE PTR _data$[ebp+eax]
add ecx, DWORD PTR _total$[ebp]
mov DWORD PTR _total$[ebp], ecx
mov edx, DWORD PTR _dataIndex$[ebp]
add edx, 1
mov DWORD PTR _dataIndex$[ebp], edx
$LN3@Loop3:
; 44 : case 3: total += data[dataIndex++];
mov eax, DWORD PTR _dataIndex$[ebp]
movzx ecx, BYTE PTR _data$[ebp+eax]
add ecx, DWORD PTR _total$[ebp]
mov DWORD PTR _total$[ebp], ecx
mov edx, DWORD PTR _dataIndex$[ebp]
add edx, 1
mov DWORD PTR _dataIndex$[ebp], edx
$LN2@Loop3:
; 45 : case 2: total += data[dataIndex++];
mov eax, DWORD PTR _dataIndex$[ebp]
movzx ecx, BYTE PTR _data$[ebp+eax]
add ecx, DWORD PTR _total$[ebp]
mov DWORD PTR _total$[ebp], ecx
mov edx, DWORD PTR _dataIndex$[ebp]
add edx, 1
mov DWORD PTR _dataIndex$[ebp], edx
$LN1@Loop3:
; 46 : case 1: total += data[dataIndex++];
mov eax, DWORD PTR _dataIndex$[ebp]
movzx ecx, BYTE PTR _data$[ebp+eax]
add ecx, DWORD PTR _total$[ebp]
mov DWORD PTR _total$[ebp], ecx
mov edx, DWORD PTR _dataIndex$[ebp]
add edx, 1
mov DWORD PTR _dataIndex$[ebp], edx
; 47 : }while(–n > 0);
mov eax, DWORD PTR _n$[ebp]
sub eax, 1
mov DWORD PTR _n$[ebp], eax
cmp DWORD PTR _n$[ebp], 0
jg $LN6@Loop3
$LN8@Loop3:
Again we will assume that the cost of each instruction is the same then we could say that the loop overhead is now half of that of the actual work (based on there being half as many instructions). We can also calculate the total work 17 + ( 28 * 64) = 1809 instructions which is about midway between the first and second versions.
Both versions have their place, and the best results would come from using both depending on the nature of your data. A simple test that determines whether the length of data you need to process is an integral of your unroll amount can direct execution to either Duff’s version or the fully tailored.