A look at Halide's SSE2 3x3 Box Filter

Nils L. Corneliusen
24 August 2012

Introduction

Halide has been covered in the tech press lately. The PDF called Decoupling Algorithms from Schedules for Easy Optimization of Image Processing Pipelines is indeed an interesting read. So why do I bring it up here? Well, to have a look at the initial code example used, a 3x3 box filter. Page 1, top right corner, "Fast C++ (for x86)".

Exactly how effective is that code on a modern x86 core? Clearly, steps have been taken to try to make it reasonably efficient, but must the code look like a train wreck to accomplish that? I dust off my i7-950 to do the initial work. Later I'll look at some newer and older Intel chips to find out what's going on here.

Fast C++ for the masses

I'm not into C++ so let's just rewrite it in straight C using uint16_t pointers.

void fast_blur_halide( const uint16_t *in, uint16_t *blurred, int width, int height )
{
    int x, y;
    int xTile, yTile;
    __m128i one_third;
    
    one_third = _mm_set1_epi16(21846);

    for( yTile = 0; yTile < height; yTile += 32 ) {
        __m128i a, b, c;
        __m128i sum, avg;
        __m128i tmp[(256/8)*(32+2)];

        for( xTile = 0; xTile < width; xTile += 256 ) {
            __m128i *tmpPtr = tmp;
            for (y = -1; y < 32+1; y++) {
                const uint16_t *inPtr = &in[(yTile+y)*width + xTile];
                for (x = 0; x < 256; x += 8) {
                    a = _mm_loadu_si128( (__m128i*)(inPtr-1) );
                    b = _mm_loadu_si128( (__m128i*)(inPtr+1) );
                    c = _mm_load_si128(  (__m128i*)(inPtr)   );
                    sum = _mm_add_epi16( _mm_add_epi16( a, b ), c );
                    avg = _mm_mulhi_epi16( sum, one_third );
                    _mm_store_si128( tmpPtr++, avg );
                    inPtr += 8;
                }
            }

            tmpPtr = tmp;

            for (y = 0; y < 32; y++) {
                __m128i *outPtr = (__m128i *)&blurred[(yTile+y)*width + xTile];
                for( x = 0; x < 256; x += 8 ) {
                    a = _mm_load_si128( tmpPtr+(2*256)/8 );
                    b = _mm_load_si128( tmpPtr+256/8     );
                    c = _mm_load_si128( tmpPtr++         );
                    sum = _mm_add_epi16( _mm_add_epi16( a, b ), c );
                    avg = _mm_mulhi_epi16( sum, one_third );
                    _mm_store_si128( outPtr++, avg );
                }
            }
        }
    }
}

As mentioned they've put some work into finding the correct sizes of the blocks. I will look into that later. The result is an instruction count that's not too bad:

Box filter 3x3 instruction count, 8192*8192 image
      Halide
mul   17301504
add   34603008
loadu 17825792
loada 34078720
store 17301504

Vertical Insanity

So let's try to attack the instruction count first. We try out the first extreme and do everything vertical to reuse multiplies maximally:

void fast_blur_halide( const uint16_t *in, uint16_t *blurred, int width, int height )
{
    int x, y;
    __m128i one_third;
    __m128i *dst;
    
    one_third = _mm_set1_epi16(21846);

    for( x = 0; x < width; x += 8 ) {
        __m128i s00, s01, s02, s10, s11, s12, s20, s21, s22;
        __m128i r00, r10, r20;
        const uint16_t *row0, *rowp, *rown;

        row0 = in + x;
        rowp = row0 - width;
        rown = row0 + width;

        dst = (__m128i *)(blurred + x);

        s00 = _mm_loadu_si128( (__m128i*)(rowp-1) );
        s01 = _mm_loadu_si128( (__m128i*)(rowp+1) );
        s02 = _mm_load_si128(  (__m128i*)(rowp) );
        r00 = _mm_mulhi_epi16( _mm_add_epi16( _mm_add_epi16( s00, s01 ), s02 ), one_third );

        s10 = _mm_loadu_si128( (__m128i*)(row0-1) );
        s11 = _mm_loadu_si128( (__m128i*)(row0+1) );
        s12 = _mm_load_si128(  (__m128i*)(row0) );
        r10 = _mm_mulhi_epi16( _mm_add_epi16( _mm_add_epi16( s10, s11 ), s12 ), one_third );

        for( y = 0; y < height; y++ ) {

            s20 = _mm_loadu_si128( (__m128i*)(rown-1) );
            s21 = _mm_loadu_si128( (__m128i*)(rown+1) );
            s22 = _mm_load_si128(  (__m128i*)(rown) );
            r20 = _mm_mulhi_epi16( _mm_add_epi16( _mm_add_epi16( s20, s21 ), s22 ), one_third );

            _mm_store_si128( dst, _mm_mulhi_epi16( _mm_add_epi16( _mm_add_epi16( r00, r10 ), r20 ), one_third ) );

            r00 = r10;
            r10 = r20;

            rown += width;
            dst  += (width>>3);

        }

    }

}

Now we got a couple of code sets to compare. This gives the following instruction count:

Box filter 3x3 instruction count, 8192*8192 image
      Halide   Vertical
mul   17301504  8390656
add   34603008 16781312
loadu 17825792 16781312
loada 34078720  8390656
store 17301504  8388608

That's pretty sweet. Unfortunately this goes all the way to hell when measuring execution time. It's 4.5 times slower on the i7-950. Crap. Let's try the other extreme instead.

Horizontal Cool

The other extreme is of course a fully horizontal routine where we read three lines in parallel and completely ignore the overlap. (Actually I wrote this version first due to earlier experiences in the yuv converter, but never mind.)

void fast_blur_horiz( const uint16_t *in, uint16_t *blurred, int width, int height )
{
    int x, y;
    __m128i one_third;
    __m128i *dst;
    
    one_third = _mm_set1_epi16(21846);
    dst = (__m128i *)blurred;

    for( y = 0; y < height; y++ ) {
        const uint16_t *row0, *rowp, *rown;

        row0 = &in[y*width];
        rowp = row0 - width;
        rown = row0 + width;

        for( x = 0; x < width; x += 8 ) {
            __m128i s0 ,s1 ,s2;
            __m128i r0, r1, r2;
            
            s0 = _mm_loadu_si128( (__m128i*)(row0-1) );
            s1 = _mm_loadu_si128( (__m128i*)(row0+1) );
            s2 = _mm_load_si128(  (__m128i*)(row0) );
            r0 = _mm_mulhi_epi16( _mm_add_epi16( _mm_add_epi16( s0, s1 ), s2 ), one_third );

            s0 = _mm_loadu_si128( (__m128i*)(rowp-1) );
            s1 = _mm_loadu_si128( (__m128i*)(rowp+1) );
            s2 = _mm_load_si128(  (__m128i*)(rowp) );
            r1 = _mm_mulhi_epi16( _mm_add_epi16( _mm_add_epi16( s0, s1 ), s2 ), one_third );

            s0 = _mm_loadu_si128( (__m128i*)(rown-1) );
            s1 = _mm_loadu_si128( (__m128i*)(rown+1) );
            s2 = _mm_load_si128(  (__m128i*)(rown) );
            r2 = _mm_mulhi_epi16( _mm_add_epi16( _mm_add_epi16( s0, s1 ), s2 ), one_third );

            _mm_store_si128( dst++, _mm_mulhi_epi16( _mm_add_epi16( _mm_add_epi16( r0, r1 ), r2 ), one_third ) );

            row0 += 8;
            rowp += 8;
            rown += 8;
        }

    }

}

That's pretty and readable, but the instruction count is a complete disaster:

Box filter 3x3 instruction count, 8192*8192 image
      Halide   Horiz    Vertical
mul   17301504 33554432  8390656
add   34603008 67108864 16781312
loadu 17825792 50331648 16781312
loada 34078720 25165824  8390656
store 17301504  8388608  8388608

Fortunately, it's a complete win when we consider execution times. It's about 31% faster than the original code on the i7-950. Neat.

Mental Overdrive

Can we do any better? Actually, yes. We can read one more line and have two destinations to make it 36% faster:

void fast_blur_horiz2d( const uint16_t *in, uint16_t *blurred, int width, int height )
{
    int x, y;
    __m128i one_third;
    __m128i *dst0, *dst1;
    
    one_third = _mm_set1_epi16(21846);
    dst0 = (__m128i *)blurred;
    dst1 = (__m128i *)(blurred+width);

    for( y = 0; y < height; y += 2 ) {
        const uint16_t *row0, *row1, *row2, *row3;

        row1 = in + y*width;
        row0 = row1 - width;
        row2 = row1 + width;
        row3 = row2 + width;

        for( x = 0; x < width; x += 8 ) {
            __m128i s00 ,s01 ,s02;
            __m128i r00, r01, r02;
            
            s00 = _mm_loadu_si128( (__m128i*)(row0-1) );
            s01 = _mm_loadu_si128( (__m128i*)(row0+1) );
            s02 = _mm_load_si128(  (__m128i*)(row0) );
            r00 = _mm_mulhi_epi16( _mm_add_epi16( _mm_add_epi16( s00, s01 ), s02 ), one_third );

            s00 = _mm_loadu_si128( (__m128i*)(row1-1) );
            s01 = _mm_loadu_si128( (__m128i*)(row1+1) );
            s02 = _mm_load_si128(  (__m128i*)(row1) );
            r01 = _mm_mulhi_epi16( _mm_add_epi16( _mm_add_epi16( s00, s01 ), s02 ), one_third );

            s00 = _mm_loadu_si128( (__m128i*)(row2-1) );
            s01 = _mm_loadu_si128( (__m128i*)(row2+1) );
            s02 = _mm_load_si128(  (__m128i*)(row2) );
            r02 = _mm_mulhi_epi16( _mm_add_epi16( _mm_add_epi16( s00, s01 ), s02 ), one_third );

            _mm_store_si128( dst0++, _mm_mulhi_epi16( _mm_add_epi16( _mm_add_epi16( r00, r01 ), r02 ), one_third ) );

            s00 = _mm_loadu_si128( (__m128i*)(row3-1) );
            s01 = _mm_loadu_si128( (__m128i*)(row3+1) );
            s02 = _mm_load_si128(  (__m128i*)(row3) );
            r00 = _mm_mulhi_epi16( _mm_add_epi16( _mm_add_epi16( s00, s01 ), s02 ), one_third );

            _mm_store_si128( dst1++, _mm_mulhi_epi16( _mm_add_epi16( _mm_add_epi16( r00, r01 ), r02 ), one_third ) );

            row0 += 8;
            row1 += 8;
            row2 += 8;
            row3 += 8;
        }

        dst0 += width>>3;
        dst1 += width>>3;

    }

}

Instruction count for completeness:

Box filter 3x3 instruction count, 8192*8192 image
      Halide   Horiz    Horiz2d  Vertical
mul   17301504 33554432 16777216  8390656
add   34603008 67108864 33554432 16781312
loadu 17825792 50331648 33554432 16781312
loada 34078720 25165824 16777216  8390656
store 17301504  8388608  8388608  8388608

Expanding more in the same manner does not give any further reductions. So reading from 4 lines and writing 2 seems to be the optimal limit.

Seeing is Believing

Let's sum up the execution times for my test system:

Image: 8192*8192, 100 iterations
CPU:   i7-950@3Ghz
RAM:   6x2GB@1066Mhz
                  Time    Perc
Vertical:     29541786 -326.2%
Halide:        6931559    0.0%
Horizontal:    4780441   31.0%
Horizontal2d:  4421372   36.2%

How about Sandy Bridge? Everything runs significantly faster, and the gap is now even wider:

Image: 8192*8192, 100 iterations
CPU:   i7-2600K@3.4Ghz
RAM:   4x4GB@1600Mhz
                  Time    Perc
Vertical:     25291926 -474.3%
Halide:        4403894    0.0%
Horizontal2d:  2608385   40.8%

Basically, when there's not a lot of processing being done on the data, memory accesses will dominate. If we access memory linearly over not too many rows, things execute extremely fast on any newer Intel CPU.

A trip down memory lane

But that doesn't explain the original code. How on earth did they come up with that code and those block sizes? A test on an old Core2 system makes it all clear. I'm only doing 10 iterations, for obvious reasons.

Image: 8192*8192, 10 iterations
CPU:   Q9550@2.8Ghz
RAM:   4x2GB@800Mhz
                  Time    Perc
Vertical:      7358867 -732.1%
Halide:         884382    0.0%
Horizontal:    1563866  -76.8%
Horizontal2d:  1250236  -41.4%

Aha! The code is optimized for older Core2 based systems. The memory subsystem is obviously much smarter in newer models. While their approach works well for Core2, it's not the way to go for Core-i7 and later. That's good news; The Core2 approach is too messy.

Overflow

Cat Zimmerman sent me an email on 11 July 2015:

The box filter code at http://www.ignorantus.com/box_sse2/
overflows. Is there a reason you or Halide didn't use the saturating add instruction?
i.e:
_mm_add_epi16 becomes _mm_adds_epi16

Which sounds correct. The test code only compared the outputs with memcmp(), so if you're going to use the code for anything, please replace the adds. It's always a good idea to check the output. Thanks, Cat!

If I've missed anything obvious you can figure out my email from the front page. Also, remember to appreciate this classic Abstruse Goose strip.


www.ignorantus.com