Efficient loop optimization with loop unrolling and Duff’s device

One popular method of optimizing loops is to ‘unroll’ the loop so that more work is done for each cycle through the loop.

 For instance consider the following (trivial) example that accumulates the values of the elements in an array:

       long total = 0;

      unsigned char data[256];

       for ( int i = 0 ; i < 256 ; i ++ ){

            total += data[i];

      }

 With so little occurring in the loop body it’s not hard to believe that the overhead of the loop itself is greater than the work accumulating the data. Generating the assembly code (/Fas for Microsoft compilers) produces the following:

      mov   DWORD PTR _i$5081[ebp], 0

      jmp   SHORT $LN3@Loop1

$LN2@Loop1:

      mov   eax, DWORD PTR _i$5081[ebp]

      add   eax, 1

      mov   DWORD PTR _i$5081[ebp], eax

$LN3@Loop1:

      cmp   DWORD PTR _i$5081[ebp], 256         ; 00000100H

      jge   SHORT $LN1@Loop1

; 12   :          total += data[i];

       mov   eax, DWORD PTR _i$5081[ebp]

      movzx ecx, BYTE PTR _data$[ebp+eax]

      add   ecx, DWORD PTR _total$[ebp]

      mov   DWORD PTR _total$[ebp], ecx

 ; 13   :    }

       jmp   SHORT $LN2@Loop1

$LN1@Loop1:

 (In the above and following assembly listings, loop overhead is highlighted in green and the actual work shown in yellow)

 If we assume that the cost of each instruction is the same (which it isn’t but will simplify the discussion) then we could say that the loop overhead is twice that of the actual work (based on there being twice as many instructions). We can also calculate the total work 12 + ( 11 * 255) = 2817 instructions.

 We need a way to do more work in the loop, consider the following:

       long total = 0;

      unsigned char data[256];

       for ( int i = 0 ; i < 256 ; i += 4 ){

            total += data[i];

            total += data[i+1];

            total += data[i+2];

            total += data[i+3];

      }

 This produces the following output:

      mov   DWORD PTR _total$[ebp], 0

 ; 21   :    unsigned char data[256];

; 22   :

; 23   :    for ( int i = 0 ; i < 256 ; i += 4 ){

 

      mov   DWORD PTR _i$5090[ebp], 0

      jmp   SHORT $LN3@Loop2

$LN2@Loop2:

      mov   eax, DWORD PTR _i$5090[ebp]

      add   eax, 4

      mov   DWORD PTR _i$5090[ebp], eax

$LN3@Loop2:

      cmp   DWORD PTR _i$5090[ebp], 256         ; 00000100H

      jge   SHORT $LN1@Loop2

 ; 24   :          total += data[i];

      mov   eax, DWORD PTR _i$5090[ebp]

      movzx ecx, BYTE PTR _data$[ebp+eax]

      add   ecx, DWORD PTR _total$[ebp]

      mov   DWORD PTR _total$[ebp], ecx

; 25   :          total += data[i+1];

      mov   eax, DWORD PTR _i$5090[ebp]

      movzxecx, BYTE PTR _data$[ebp+eax+1]

      add   ecx, DWORD PTR _total$[ebp]

      mov   DWORD PTR _total$[ebp], ecx

; 26   :          total += data[i+2];

      mov   eax, DWORD PTR _i$5090[ebp]

      movzx ecx, BYTE PTR _data$[ebp+eax+2]

      add   ecx, DWORD PTR _total$[ebp]

      mov   DWORD PTR _total$[ebp], ecx

; 27   :          total += data[i+3];

      mov   eax, DWORD PTR _i$5090[ebp]

      movzx ecx, BYTE PTR _data$[ebp+eax+3]

      add   ecx, DWORD PTR _total$[ebp]

      mov   DWORD PTR _total$[ebp], ecx

 ; 28   :    }

      jmp   SHORT $LN2@Loop2

 Again we will assume that the cost of each instruction is the same then we could say that the loop overhead is now half of that of the actual work (based on there being half as many instructions). We can also calculate the total work 24 + ( 16 * 64) = 1048 instructions, or less than half that of the original implementation.

 This works very well, but as it’s currently expressed the degree of loop unrolling is limited by the need for the number of loop iterations to be an even multiple of the number of unrolls in the loop. That is if you want to unroll 8 times (8 increments of total in this case) then the data you are accumulating has to be a multiple of 8. Depending on the dataset this is not usually an issue, but sometimes we aren’t at liberty to rework the data.

 Duff’s device is an efficient mechanism used to optimize serial copies, but we can leverage its workings to gain as much benefit from unrolling as is possible. For more information see http://en.wikipedia.org/wiki/Duff’s_device

       long total = 0;

      unsigned char data[256];

      int n = (count + 3) / 4;

      int dataIndex = 0;

       switch(count % 4) {

            case 0: do {
          total += data[dataIndex++];

            case 3:

                   total += data[dataIndex++];

            case 2:

                   total += data[dataIndex++];

            case 1: 

                   total += data[dataIndex++];

                   } while(–n > 0);

      }

       return (total );

 ; 40   :    int dataIndex = 0;

      mov   DWORD PTR _dataIndex$[ebp], 0

; 41   :

; 42   :    switch(count % 4) {

      mov   eax, DWORD PTR _count$[ebp]

      and   eax, -2147483645              ; 80000003H

      jns   SHORT $LN15@Loop3

      dec   eax

      or    eax, -4                             ;fffffffcH

      inc   eax

$LN15@Loop3:

      mov   DWORD PTR tv67[ebp], eax

      cmp   DWORD PTR tv67[ebp], 3

      ja    $LN8@Loop3

      mov   ecx, DWORD PTR tv67[ebp]

      jmp   DWORD PTR $LN16@Loop3[ecx*4]

$LN6@Loop3:

 ; 43   :          case 0: do {    total += data[dataIndex++];

      mov   eax, DWORD PTR _dataIndex$[ebp]

      movzx ecx, BYTE PTR _data$[ebp+eax]

      add   ecx, DWORD PTR _total$[ebp]

      mov   DWORD PTR _total$[ebp], ecx

      mov   edx, DWORD PTR _dataIndex$[ebp]

      add   edx, 1

      mov   DWORD PTR _dataIndex$[ebp], edx

$LN3@Loop3:

; 44   :          case 3:         total += data[dataIndex++];

      mov   eax, DWORD PTR _dataIndex$[ebp]

      movzx ecx, BYTE PTR _data$[ebp+eax]

      add   ecx, DWORD PTR _total$[ebp]

      mov   DWORD PTR _total$[ebp], ecx

      mov   edx, DWORD PTR _dataIndex$[ebp]

      add   edx, 1

      mov   DWORD PTR _dataIndex$[ebp], edx

$LN2@Loop3:

; 45   :          case 2:         total += data[dataIndex++];

      mov   eax, DWORD PTR _dataIndex$[ebp]

      movzx ecx, BYTE PTR _data$[ebp+eax]

      add   ecx, DWORD PTR _total$[ebp]

      mov   DWORD PTR _total$[ebp], ecx

      mov   edx, DWORD PTR _dataIndex$[ebp]

      add   edx, 1

      mov   DWORD PTR _dataIndex$[ebp], edx

$LN1@Loop3:

; 46   :          case 1:         total += data[dataIndex++];

      mov   eax, DWORD PTR _dataIndex$[ebp]

      movzx ecx, BYTE PTR _data$[ebp+eax]

      add   ecx, DWORD PTR _total$[ebp]

      mov   DWORD PTR _total$[ebp], ecx

      mov   edx, DWORD PTR _dataIndex$[ebp]

      add   edx, 1

      mov   DWORD PTR _dataIndex$[ebp], edx

 ; 47   :                      }while(–n > 0);

      mov   eax, DWORD PTR _n$[ebp]

      sub   eax, 1

      mov   DWORD PTR _n$[ebp], eax

      cmp   DWORD PTR _n$[ebp], 0

      jg    $LN6@Loop3

$LN8@Loop3:

Again we will assume that the cost of each instruction is the same then we could say that the loop overhead is now half of that of the actual work (based on there being half as many instructions). We can also calculate the total work 17 + ( 28 * 64) = 1809 instructions which is about midway between the first and second versions.

 Both versions have their place, and the best results would come from using both depending on the nature of your data. A simple test that determines whether the length of data you need to process is an integral of your unroll amount can direct execution to either Duff’s version or the fully tailored.