Efficient loop optimization with loop unrolling and Duff’s device

One popular method of optimizing loops is to ‘unroll’ the loop so that more work is done for each cycle through the loop.

For instance consider the following (trivial) example that accumulates the values of the elements in an array:

long total = 0;

unsigned char data;

for ( int i = 0 ; i < 256 ; i ++ ){

total += data[i];

}

With so little occurring in the loop body it’s not hard to believe that the overhead of the loop itself is greater than the work accumulating the data. Generating the assembly code (/Fas for Microsoft compilers) produces the following:

mov   DWORD PTR _i\$5081[ebp], 0

jmp   SHORT \$LN3@Loop1

\$LN2@Loop1:

mov   eax, DWORD PTR _i\$5081[ebp]

mov   DWORD PTR _i\$5081[ebp], eax

\$LN3@Loop1:

cmp   DWORD PTR _i\$5081[ebp], 256         ; 00000100H

jge   SHORT \$LN1@Loop1

; 12   :          total += data[i];

mov   eax, DWORD PTR _i\$5081[ebp]

movzx ecx, BYTE PTR _data\$[ebp+eax]

mov   DWORD PTR _total\$[ebp], ecx

; 13   :    }

jmp   SHORT \$LN2@Loop1

\$LN1@Loop1:

(In the above and following assembly listings, loop overhead is highlighted in green and the actual work shown in yellow)

If we assume that the cost of each instruction is the same (which it isn’t but will simplify the discussion) then we could say that the loop overhead is twice that of the actual work (based on there being twice as many instructions). We can also calculate the total work 12 + ( 11 * 255) = 2817 instructions.

We need a way to do more work in the loop, consider the following:

long total = 0;

unsigned char data;

for ( int i = 0 ; i < 256 ; i += 4 ){

total += data[i];

total += data[i+1];

total += data[i+2];

total += data[i+3];

}

This produces the following output:

mov   DWORD PTR _total\$[ebp], 0

; 21   :    unsigned char data;

; 22   :

; 23   :    for ( int i = 0 ; i < 256 ; i += 4 ){

mov   DWORD PTR _i\$5090[ebp], 0

jmp   SHORT \$LN3@Loop2

\$LN2@Loop2:

mov   eax, DWORD PTR _i\$5090[ebp]

mov   DWORD PTR _i\$5090[ebp], eax

\$LN3@Loop2:

cmp   DWORD PTR _i\$5090[ebp], 256         ; 00000100H

jge   SHORT \$LN1@Loop2

; 24   :          total += data[i];

mov   eax, DWORD PTR _i\$5090[ebp]

movzx ecx, BYTE PTR _data\$[ebp+eax]

mov   DWORD PTR _total\$[ebp], ecx

; 25   :          total += data[i+1];

mov   eax, DWORD PTR _i\$5090[ebp]

movzxecx, BYTE PTR _data\$[ebp+eax+1]

mov   DWORD PTR _total\$[ebp], ecx

; 26   :          total += data[i+2];

mov   eax, DWORD PTR _i\$5090[ebp]

movzx ecx, BYTE PTR _data\$[ebp+eax+2]

mov   DWORD PTR _total\$[ebp], ecx

; 27   :          total += data[i+3];

mov   eax, DWORD PTR _i\$5090[ebp]

movzx ecx, BYTE PTR _data\$[ebp+eax+3]

mov   DWORD PTR _total\$[ebp], ecx

; 28   :    }

jmp   SHORT \$LN2@Loop2

Again we will assume that the cost of each instruction is the same then we could say that the loop overhead is now half of that of the actual work (based on there being half as many instructions). We can also calculate the total work 24 + ( 16 * 64) = 1048 instructions, or less than half that of the original implementation.

This works very well, but as it’s currently expressed the degree of loop unrolling is limited by the need for the number of loop iterations to be an even multiple of the number of unrolls in the loop. That is if you want to unroll 8 times (8 increments of total in this case) then the data you are accumulating has to be a multiple of 8. Depending on the dataset this is not usually an issue, but sometimes we aren’t at liberty to rework the data.

Duff’s device is an efficient mechanism used to optimize serial copies, but we can leverage its workings to gain as much benefit from unrolling as is possible. For more information see http://en.wikipedia.org/wiki/Duff’s_device

long total = 0;

unsigned char data;

int n = (count + 3) / 4;

int dataIndex = 0;

switch(count % 4) {

case 0: do {
total += data[dataIndex++];

case 3:

total += data[dataIndex++];

case 2:

total += data[dataIndex++];

case 1:

total += data[dataIndex++];

} while(–n > 0);

}

return (total );

; 40   :    int dataIndex = 0;

mov   DWORD PTR _dataIndex\$[ebp], 0

; 41   :

; 42   :    switch(count % 4) {

mov   eax, DWORD PTR _count\$[ebp]

and   eax, -2147483645              ; 80000003H

jns   SHORT \$LN15@Loop3

dec   eax

or    eax, -4                             ;fffffffcH

inc   eax

\$LN15@Loop3:

mov   DWORD PTR tv67[ebp], eax

cmp   DWORD PTR tv67[ebp], 3

ja    \$LN8@Loop3

mov   ecx, DWORD PTR tv67[ebp]

jmp   DWORD PTR \$LN16@Loop3[ecx*4]

\$LN6@Loop3:

; 43   :          case 0: do {    total += data[dataIndex++];

mov   eax, DWORD PTR _dataIndex\$[ebp]

movzx ecx, BYTE PTR _data\$[ebp+eax]

mov   DWORD PTR _total\$[ebp], ecx

mov   edx, DWORD PTR _dataIndex\$[ebp]

mov   DWORD PTR _dataIndex\$[ebp], edx

\$LN3@Loop3:

; 44   :          case 3:         total += data[dataIndex++];

mov   eax, DWORD PTR _dataIndex\$[ebp]

movzx ecx, BYTE PTR _data\$[ebp+eax]

mov   DWORD PTR _total\$[ebp], ecx

mov   edx, DWORD PTR _dataIndex\$[ebp]

mov   DWORD PTR _dataIndex\$[ebp], edx

\$LN2@Loop3:

; 45   :          case 2:         total += data[dataIndex++];

mov   eax, DWORD PTR _dataIndex\$[ebp]

movzx ecx, BYTE PTR _data\$[ebp+eax]

mov   DWORD PTR _total\$[ebp], ecx

mov   edx, DWORD PTR _dataIndex\$[ebp]

mov   DWORD PTR _dataIndex\$[ebp], edx

\$LN1@Loop3:

; 46   :          case 1:         total += data[dataIndex++];

mov   eax, DWORD PTR _dataIndex\$[ebp]

movzx ecx, BYTE PTR _data\$[ebp+eax]

mov   DWORD PTR _total\$[ebp], ecx

mov   edx, DWORD PTR _dataIndex\$[ebp]

mov   DWORD PTR _dataIndex\$[ebp], edx

; 47   :                      }while(–n > 0);

mov   eax, DWORD PTR _n\$[ebp]

sub   eax, 1

mov   DWORD PTR _n\$[ebp], eax

cmp   DWORD PTR _n\$[ebp], 0

jg    \$LN6@Loop3

\$LN8@Loop3:

Again we will assume that the cost of each instruction is the same then we could say that the loop overhead is now half of that of the actual work (based on there being half as many instructions). We can also calculate the total work 17 + ( 28 * 64) = 1809 instructions which is about midway between the first and second versions.

Both versions have their place, and the best results would come from using both depending on the nature of your data. A simple test that determines whether the length of data you need to process is an integral of your unroll amount can direct execution to either Duff’s version or the fully tailored.