Bilinear Picture Scaling on Tilera TILE-Gx

Nils L. Corneliusen
7 November 2014

Introduction

I've earlier looked at how to do color space conversion on the Tilera TILE-Gx, but that code was not parallelized. The raytracer article presented a more complete solution, but lacked a basic synchronization mechanism. So let's solve that problem first by writing a parallelization library. Then I'll have a look at parallel bilinear picture scaling on the TILE-Gx and how to make it reasonably fast.

Parallel Code in 160 Lines

The Tilera Devkit has all the tools needed to make parallel code: The UDN network can be used to transmit messages quickly, and TMC and Pthreads can be used to put threads that share memory on different cores. Let's use those tools to make a library that can do asynchronous function calls on another core with varying arguments.

The following functions are needed:

int par_init( unsigned int cores );
void par_sendjob( void *func, void *data, unsigned int len );
void par_wait( void );
void par_set_cores( unsigned int cores );

par_init() sets up the threads and configures the host. The cores argument specifies the absolute max number of cores that'll be used. One core is needed for control, so 35 is usually a good choice. The number of cores used can be changed later.

The threads have a control function that moves it to the correct core, initializes the UDN network and allocates an io buffer. Then it reports back that it's ready for use.

par_sendjob() does the actual work. It'll take the given data and send it to the next available core. If no cores are available, it'll wait until one is.

par_wait() just waits until all started jobs are done.

par_set_cores() limites the number of cores to use, to see how things work with different core counts.

Source files:

Managing Images in a Painless Way

Let's do one more detour before looking at bilinear picture scaling. It would be very convenient to have a system for managing 32 bit ARGB bitmaps, and being able to load and save 24 bit bitmap files. A quick scan on the net turns up really complicated ways to load and save Windows Bitmap files. Since only loading and saving of 24 bit bitmaps are needed, there's no need to make it harder than necessary. Here's my solution to that problem:

typedef struct {
    int w, h;
    int stride;
    uint8_t *argb;
} bmp_argb;

bmp_argb *bmp_alloc( int w, int h );
void bmp_free( bmp_argb *bp );
bmp_argb *bmp_load( const char *fname, bool flip );
bool bmp_save( bmp_argb *bp, const char *fname, bool flip );

Source files:

Be aware that this code is very rudimentary, since it's supposed to only load and save 24 bit files, and output them as 32 bit argb.

A Single Pass Bilinear Scaler

Bilinear picture scalilng is well covered on the internet. The boring mathematics are covered in this wikipedia article. Something resembling useful C++-code can be found in this article.

Quoting from the last article:

There are five steps involved in interpolation:

That's a reasonable way if there's lots of 16-bit multipliers or fast floating point hardware available. Unfortunately, the TILE-Gx only has 4 8-bit multipliers or 2 16-bit. There are more, since there's v1ddot, v1multu and v2mults instructions, but the registers are only 64-bit. So let's look at a slightly different method

p00 p01
p10 p11

Let's scale one pixel by first calculating the vertical weights, and then the horizontal:

    v00 = p00*(1-py) + p10*py;
    v01 = p01*(1-py) + p11*py;
    h00 = v00*(1-px) + v01*px;

That should work in the same way, except that it should fit into 8-bit multipliers. Then both the pixel values and weights need to be in the range 0x00..0xff, and just shifting 8 bits down after multiplying will darken the image.

What's needed is a cheap way to divide by 255, or something that's reasonably close. That's accomplished by adding the result shifted down by 8 bits and adding 128:

    v00 = p00*(0xff-py) + p10*py;
    v00 = v00+(v00>>8)+0x80;

    v01 = p01*(0xff-py) + p11*py;
    v01 = v01+(v01>>8)+0x80;

    h00 = v00*(0xff-px) + v01*px;
    h00 = h00+(h00>>8)+0x80;

And ditto for v01, h00. Now the 8 high bits should contain the expected result.

Instead of using floating point for the position, I use 16 bit fixed point math. It'll drift a bit, but that'll be discussed later. That makes it easy to pick out the 8 bits of fraction needed.

The offset is needed to avoid the bad weighting patterns that appear when scaling f ex 1920 to 1280. We'd get 0,128... which would miss on alot of pixels. With the offet, the pattern is a more useful 64,192... The offset might give negative coords at the start and top, so the position need to be clamped to 0. An if test for that will get replaced by a conditional move, so it shouldn't be a big issue.

An obvious problem with this approach is the excessive number of vertical multiplications done, and many unnecessary loads. But that's how a single pass version has to do it; a two-pass version will be looked at later.

The wrapper for a single pass implementation looks like this:

void scale_bilinear( bmp_argb *src, bmp_argb *dst, unsigned int startrow, unsigned int height )
{
    int xadd = (src->w<<16)/dst->w;
    int yadd = (src->h<<16)/dst->h;
    int xoffset = (xadd>>1)-(1<<15);
    int yoffset = (yadd>>1)-(1<<15);

    int dstw = dst>>w;

    for( int y = startrow; y < startrow + height; y++ ) {

        int yy0 = yoffset+ y*yadd;
        int ypos0 = yy0>>16; if( ypos0 < 0 ) ypos0 = 0;

        // find rows to fetch data from
        uint8_t *row00 = src->argb + src->stride*min((yy0>>16)+0, src->h-1);
        uint8_t *row01 = src->argb + src->stride*min((yy0>>16)+1, src->h-1);

        uint8_t *dp0 = dst->argb +  y   *dst->stride;

        for( int x = 0; x < dst->w; x++ ) {
            uint8_t *tmp;
            uint64_t p00, p01;
            uint64_t p10, p11;

            int xx0 = xoffset+ x*xadd;
            int xpos0 = (xx0>>16); if( xpos0 < 0 ) xpos0 = 0;

            // load (x,y) and the surrounding pixels into p00,01,p10,p11
            tmp = row00 + xpos0; p00 = __insn_ld4u_add( tmp, 4 ); p01 = __insn_ld4u( tmp );
            tmp = row01 + xpos0; p10 = __insn_ld4u_add( tmp, 4 ); p11 = __insn_ld4u( tmp );

            //
            // calculations go here
            //
                                
            __insn_st4_add( dp0, ..., 4 );

        }

    }

}

For each destination row, determine the vertical weights:

    vmul0 = (yy0>>8)&0xff;
    vmul0 = __insn_shufflebytes( vmul0, 0, 0 );
    vmul1 = 0xffffffff - vmul0;

For each destination pixel, determine the horizontal weights:

    hmul0 = (((uint64_t)xx0)>>8)&0xff;
    hmul0 = __insn_shufflebytes( hmul0, 0, 0 );
    hmul1 = 0xffffffff - hmul0;

Then do the calculations as explained above:

    v00 = __insn_v1mulu( p00, vmul1 ) + __insn_v1mulu( p10, vmul0 );
    v01 = __insn_v1mulu( p01, vmul1 ) + __insn_v1mulu( p11, vmul0 );
    v00 =  __insn_v2packh( 0, v00 + __insn_v2shrui( v00, 8 ) + roundval );
    v01 =  __insn_v2packh( 0, v01 + __insn_v2shrui( v01, 8 ) + roundval );
    r00 = __insn_v1mulu( v00, hmul1 ) + __insn_v1mulu( v01, hmul0 );
    r00 = r00 + __insn_v2shrui( r00, 8 ) + roundval;

To achieve some level of speed, do the same for 3 more pixels in a 2x2 grid. The 16 bit results are now in r00,r01,r10,r11. These can be packed and stored:

    __insn_st_add( dp0, __insn_v2packh( r01, r00 ), 8 );
    __insn_st_add( dp1, __insn_v2packh( r11, r10 ), 8 );

Putting all that into a function gives this result:

void scale_bilinear( bmp_argb *src, bmp_argb *dst, unsigned int startrow, unsigned int height )
{
    uint64_t roundval = 0x0080008000800080;

    int xadd = (src->w<<16)/dst->w;
    int yadd = (src->h<<16)/dst->h;

    int xoffset = (xadd>>1)-(1<<15);
    int yoffset = (yadd>>1)-(1<<15);

    int dstw = dst>>w;

    for( int y = startrow; y < startrow + height; y += 2 ) {
        uint64_t vmul0, vmul1, vmul2, vmul3;

        int yy0 = yoffset+ y   *yadd;
        int yy1 = yoffset+(y+1)*yadd;
        int ypos0 = yy0>>16; if( ypos0 < 0 ) ypos0 = 0;
        int ypos1 = yy1>>16; if( ypos1 < 0 ) ypos1 = 0;

        // find rows to fetch data from
        uint8_t *row00 = src->argb + src->stride*(ypos0+0);
        uint8_t *row01 = src->argb + src->stride*(ypos0+1);
        uint8_t *row10 = src->argb + src->stride*(ypos1+0);
        uint8_t *row11 = src->argb + src->stride*(ypos1+1);

        // get vertical weights
        vmul0 = (yy0>>8)&0xff;
        vmul0 = __insn_shufflebytes( vmul0, 0, 0 );
        vmul1 = 0xffffffff - vmul0;
        vmul2 = (yy1>>8)&0xff;
        vmul2 = __insn_shufflebytes( vmul2, 0, 0 );
        vmul3 = 0xffffffff - vmul2;

        uint8_t *dp0 = dst->argb +  y   *dst->stride;
        uint8_t *dp1 = dst->argb + (y+1)*dst->stride;

        for( int x = 0; x < dstw; x += 2 ) {
            uint8_t *tmp;
            uint64_t p00, p01, p02, p03;
            uint64_t p10, p11, p12, p13;
            uint64_t p20, p21, p22, p23;
            uint64_t p30, p31, p32, p33;
            uint64_t v00, v01;
            uint64_t r00, r01, r10, r11;
            uint64_t hmul0,hmul1,hmul2,hmul3;

            int xx0 = xoffset+ x   *xadd;
            int xx1 = xoffset+(x+1)*xadd;
            int xpos0 = (xx0>>16)*4; if( xpos0 < 0 ) xpos0 = 0;
            int xpos1 = (xx1>>16)*4; if( xpos1 < 0 ) xpos1 = 0;

            // load (2*2)*(2*2) pixels so it can be unrolled a bit
            tmp = row00 + xpos0; p00 = __insn_ld4u_add( tmp, 4 ); p01 = __insn_ld4u( tmp );
            tmp = row00 + xpos1; p02 = __insn_ld4u_add( tmp, 4 ); p03 = __insn_ld4u( tmp );
            tmp = row01 + xpos0; p10 = __insn_ld4u_add( tmp, 4 ); p11 = __insn_ld4u( tmp );
            tmp = row01 + xpos1; p12 = __insn_ld4u_add( tmp, 4 ); p13 = __insn_ld4u( tmp );
            tmp = row10 + xpos0; p20 = __insn_ld4u_add( tmp, 4 ); p21 = __insn_ld4u( tmp );
            tmp = row10 + xpos1; p22 = __insn_ld4u_add( tmp, 4 ); p23 = __insn_ld4u( tmp );
            tmp = row11 + xpos0; p30 = __insn_ld4u_add( tmp, 4 ); p31 = __insn_ld4u( tmp );
            tmp = row11 + xpos1; p32 = __insn_ld4u_add( tmp, 4 ); p33 = __insn_ld4u( tmp );

            // get horizontal weights
            hmul0 = (((uint64_t)xx0)>>8)&0xff;
            hmul0 = __insn_shufflebytes( hmul0, 0, 0 );
            hmul1 = 0xffffffff - hmul0;
            hmul2 = (((uint64_t)xx1)>>8)&0xff;
            hmul2 = __insn_shufflebytes( hmul2, 0, 0 );
            hmul3 = 0xffffffff - hmul2;

            // calculate 2*2 output pixels
            v00 = __insn_v1mulu( p00, vmul1 ) + __insn_v1mulu( p10, vmul0 );
            v01 = __insn_v1mulu( p01, vmul1 ) + __insn_v1mulu( p11, vmul0 );
            v00 =  __insn_v2packh( 0, v00 + __insn_v2shrui( v00, 8 ) + roundval );
            v01 =  __insn_v2packh( 0, v01 + __insn_v2shrui( v01, 8 ) + roundval );

            r00 = __insn_v1mulu( v00, hmul1 ) + __insn_v1mulu( v01, hmul0 );
            r00 = r00 + __insn_v2shrui( r00, 8 ) + roundval;

            v00 = __insn_v1mulu( p02, vmul1 ) + __insn_v1mulu( p12, vmul0 );
            v01 = __insn_v1mulu( p03, vmul1 ) + __insn_v1mulu( p13, vmul0 );
            v00 =  __insn_v2packh( 0, v00 + __insn_v2shrui( v00, 8 ) + roundval );
            v01 =  __insn_v2packh( 0, v01 + __insn_v2shrui( v01, 8 ) + roundval );

            r01 = __insn_v1mulu( v00, hmul3 ) + __insn_v1mulu( v01, hmul2 );
            r01 = r01 + __insn_v2shrui( r01, 8 ) + roundval;

            v00 = __insn_v1mulu( p20, vmul3 ) + __insn_v1mulu( p30, vmul2 );
            v01 = __insn_v1mulu( p21, vmul3 ) + __insn_v1mulu( p31, vmul2 );
            v00 =  __insn_v2packh( 0, v00 + __insn_v2shrui( v00, 8 ) + roundval );
            v01 =  __insn_v2packh( 0, v01 + __insn_v2shrui( v01, 8 ) + roundval );

            r10 = __insn_v1mulu( v00, hmul1 ) + __insn_v1mulu( v01, hmul0 );
            r10 = r10 + __insn_v2shrui( r10, 8 ) + roundval;

            v00 = __insn_v1mulu( p22, vmul3 ) + __insn_v1mulu( p32, vmul2 );
            v01 = __insn_v1mulu( p23, vmul3 ) + __insn_v1mulu( p33, vmul2 );
            v00 =  __insn_v2packh( 0, v00 + __insn_v2shrui( v00, 8 ) + roundval );
            v01 =  __insn_v2packh( 0, v01 + __insn_v2shrui( v01, 8 ) + roundval );

            r11 = __insn_v1mulu( v00, hmul3 ) + __insn_v1mulu( v01, hmul2 );
            r11 = r11 + __insn_v2shrui( r11, 8 ) + roundval;

            // pack and store
            __insn_st_add( dp0, __insn_v2packh( r01, r00 ), 8 );
            __insn_st_add( dp1, __insn_v2packh( r11, r10 ), 8 );

        }

    }

}

There's one small caveat left: What to do about the right and bottom edges since fetches stop at x+1,y+1? Also, the position calculation might glide a pixel across the picture since it's not 100% precise. Either insert more conditional moves, or allocate a slightly larger bitmap and duplicate the last column/row twice.

No point in adding more code in the inner loop: Let's make a pad function that has to be called on the source bitmap before scaling anything. The bitmap library already allocates slightly larger bitmaps than needed to cover this.

static void pad_right_lower( bmp_argb *bp )
{
    uint32_t *dst = (uint32_t *)(bp->argb + bp->w*4);
    uint32_t *src = dst-1;
    int str4 = bp->stride>>2;

    for( int y = 0; y < bp->h; y++ ) {
        dst[0] = *src;
        dst[1] = *src;
        src += str4;
        dst += str4;
    }

    uint32_t *dst0 = (uint32_t *)(bp->argb + bp->stride*bp->h);
    uint32_t *dst1 = dst0 + str4;
    src = dst0 - str4;
    for( int x = 0; x < bp->w; x++ ) {
        *dst0++ = *src;
        *dst1++ = *src++;
    }
}

The bilinear scaler is now ready for use. The following code will load, scale and save a picture:

int main( int argc, char *argv[] )
{
    bmp_argb *src;

    if( argc != 5 ) {
        printf( "Usage: %s <input pic> <output pic> <sizex> <sizey>\n", argv[0] );
        return 1;
    }
    
    src = bmp_load( argv[1], false );
    if( src == NULL ) return 1;
    
    pad_right_lower( src );

    unsigned int xs = atoi( argv[3] );
    unsigned int ys = atoi( argv[4] );

    bmp_argb *dst = bmp_alloc( xs, ys );
    if( dst == NULL ) {
        bmp_free( src );
        return 1;
    }

    printf( "Scale picture from %dx%d to %dx%d\n", src->w, src->h, dst->w, dst->h );

    scale_bilinear( src, dst, 0, dst->h );

    bmp_save( dst, argv[2], false );

    bmp_free( src );
    bmp_free( dst );

    return 0;
}

A Parallel Single Pass Bilinear Scaler

Writing a parallel version using the parallelization library presented earlier is very easy. Just stuff the arguments into a struct, and make a wrapper function that the thread can call. Since I'm interested in trying out different combinations of cores and lines, add a lines argument to the scale function:

typedef struct {
    bmp_argb src;
    bmp_argb dst;
    unsigned int startrow;
    unsigned int height;
} sc_info;

static void scale_bilinear_thread( void *data, unsigned int len )
{
    sc_info *sc = data;

    scale_bilinear( &sc->src, &sc->dst, sc->startrow, sc->height );
}

void scale_bilinear_parallel( bmp_argb *src, bmp_argb *dst, unsigned int lines )
{
    sc_info sc;

    sc.src = *src;
    sc.dst = *dst;

    for( unsigned int startrow = 0; startrow < dst->h; startrow += lines ) {

        sc.startrow = startrow;
        sc.height   = dst->h - startrow < lines ? dst->h - startrow : lines;

        par_sendjob( scale_bilinear_thread, &sc, sizeof(sc) );
    }

    par_wait();
}

scale_bilinear_parallel() can replace the call to scale_bilinear() in the main() function. Let's make it a bit more interesting and loop through a set of lines and cores to make some nice graphs:

int main( int argc, char *argv[] )
{
    bmp_argb *src;

    if( argc != 5 ) {
        printf( "Usage: %s <input pic> <output pic> <sizex> <sizey>\n", argv[0] );
        return 1;
    }

    int rc = par_init( 35 );
    if( rc != 0 ) return rc;

    src = bmp_load( argv[1], false );
    if( src == NULL ) return 1;

    pad_right_lower( src );

    int xs = atoi( argv[3] );
    int ys = atoi( argv[4] );

    bmp_argb *dst = bmp_alloc( xs, ys );
    if( dst == NULL ) {
        bmp_free( src );
        return 1;
    }

    printf( "Scale %d*%d to %d*%d\n\n", src->w, src->h, dst->w, dst->h );

    printf( "            1        2        4        8       16       32       35\n" );

    for( int lines = 2; lines <= 128; lines *= 2 ) {

        printf( "%03d: ", lines ); fflush( stdout );

        for( int cores2 = 1; cores2 < 128; cores2 *= 2 ) {
            int cores = cores2 == 64 ? 35 : cores2;

            par_set_cores( cores );

            uint64_t t0 = get_cycle_count();
            for( int i = 0; i < cores*30; i++ )
                scale_bilinear_parallel( src, dst, lines );
            uint64_t t1 = get_cycle_count();

            uint64_t frametime = (t1-t0)/(cores*30);

            printf( "%08lld ", (long long int)frametime ); fflush( stdout );
        }
        printf( "\n" );
    }

    bmp_save( dst, argv[2], false );

    bmp_free( src );
    bmp_free( dst );

    return 0;
}

Test Run 1

Let's compile it and do some test runs:

[user@mainframe]$ tile-cc -O3 -Wall -std=c99 -o main main.c par.c bmp.c scaler.c -ltmc -lpthread
[user@mainframe]$ scp main root@lastv36:/tmp

[lastv36:/tmp] $ ./main pics/ebu3325.bmp output.bmp 1280 720
Loading bitmap pics/ebu3325.bmp
Scale 1920*1080 to 1280*720

            1        2        4        8       16       32       35
002: 18079820 09691598 05350570 02977762 02053148 01647987 01628302
004: 18798604 09744640 05606594 02957624 02094006 01697045 01696185
008: 18732162 09989230 05213762 03044871 02122177 01719981 01732858
016: 17703481 09793327 05282825 02999953 02207199 01878886 01918319
032: 17733139 09834596 05288407 03164021 02493402 01840813 01850595
064: 19268220 09883642 05446304 03725550 02596128 02612940 02596963
128: 18705930 09832945 06619117 04031104 04048189 04000686 04006442
Saving bitmap output.bmp

Graph 1

[lastv36:/tmp] $ ./main pics/ebu3325.bmp output.bmp 2560 1440
Loading bitmap pics/ebu3325.bmp
Scale 1920*1080 to 2560*1440

            1        2        4        8       16       32       35
002: 69840471 34676107 17686502 09206715 05090252 03649246 03643276
004: 69437485 34368224 17126034 08763251 04964263 03556673 03547991
008: 69585956 34307330 17482974 08992199 05190712 03779357 03793994
016: 69419415 34892357 17908136 09275802 05353747 03969738 03975066
032: 69190961 35422839 18390759 09614943 05591535 04523913 04600017
064: 69250473 35365001 18642379 09667397 06810635 04400612 04465336
128: 68879769 36677807 18792744 12492657 07027108 07092061 07023161
Saving bitmap output.bmp

Graph 2

Small jobs and more cores execute quicker. 50% decrease going from 1 to 2, 2 to 4 and 4 to 8 cores. Then the effect of more cores tapers off, but not so much that it's pointless.

Is it possible to do better? Of course! The excessive number of vertical multiplies can be reduced alot by writing a two-pass version.

A Parallel Two-Pass Bilinear Scaler

Bilinear scaler

Instead of scaling one pixel at a time, let's do a vertical pass first, and then a horizontal pass. It would be unwise to do a full vertical pass before the horizontal. Instead, let's just do two lines at a time so the output should fit in the local cache. These will be stored on the stack so they're not cached on the other cores.

Let's make some new wrapper functions:

static void scale_bilinear_thread_2pass( void *data, unsigned int len )
{
    sc_info *sc = data;

    scale_bilinear_2pass( &sc->src, &sc->dst, sc->startrow, sc->height );
}

void scale_bilinear_parallel_2pass( bmp_argb *src, bmp_argb *dst, unsigned int lines )
{
    sc_info sc;
    sc.src = *src;
    sc.dst = *dst;

    for( unsigned int startrow = 0; startrow < dst->h; startrow += lines ) {

        sc.startrow = startrow;
        sc.height   = dst->h - startrow < lines ? dst->h - startrow : lines;

        par_sendjob( scale_bilinear_thread_2pass, &sc, sizeof(sc) );
    }

    par_wait();
}

Then add the base function that calls the vertical and horizontal functions:

static void scale_bilinear_2pass( bmp_argb *src, bmp_argb *dst, unsigned int startrow, unsigned int height )
{
    uint8_t *vbuf = alloca( src->stride*4 );

    for( int y = startrow; y < startrow + height; y += 2 ) {

        scale_bilinear_vertical( src, dst, vbuf, y );

        scale_bilinear_horizontal( src, dst, vbuf, y );

    }

}

And finally, the two scaling functions. These are based on the previous scaler, but I'm using more variables to make things prettier. gcc is quite good at sorting out such stuff, as long as code's not too long. The generated code looks pretty decent.

static inline void scale_bilinear_vertical( bmp_argb *src, bmp_argb *dst, uint8_t *out0, unsigned int startrow )
{
    uint64_t roundval = 0x0080008000800080;
    int yadd = (src->h<<16)/dst->h;
    int yoffset = (yadd>>1)-(1<<15);
    int srcw  = src->w;

    int yy0 = yoffset+ startrow   *yadd;
    int yy1 = yoffset+(startrow+1)*yadd;
    int ypos0 = yy0>>16; if( ypos0 < 0 ) ypos0 = 0;
    int ypos1 = yy1>>16; if( ypos1 < 0 ) ypos1 = 0;

    // find rows to fetch data from
    uint8_t *row00 = src->argb + src->stride*(ypos0+0);
    uint8_t *row01 = src->argb + src->stride*(ypos0+1);
    uint8_t *row10 = src->argb + src->stride*(ypos1+0);
    uint8_t *row11 = src->argb + src->stride*(ypos1+1);

    // get vertical weights
    uint64_t vmul0, vmul1, vmul2, vmul3;
    vmul0 = __insn_shufflebytes( (yy0>>8)&0xff, 0, 0 ); vmul1 = 0xffffffff - vmul0;
    vmul2 = __insn_shufflebytes( (yy1>>8)&0xff, 0, 0 ); vmul3 = 0xffffffff - vmul2;

    uint8_t *out1 = out0 + src->stride;

    for( int x = 0; x < srcw; x += 4 ) {

        uint64_t p001 = __insn_ld_add( row00, 8 ); uint64_t p023 = __insn_ld_add( row00, 8 );
        uint64_t p101 = __insn_ld_add( row01, 8 ); uint64_t p123 = __insn_ld_add( row01, 8 );
        uint64_t p201 = __insn_ld_add( row10, 8 ); uint64_t p223 = __insn_ld_add( row10, 8 );
        uint64_t p301 = __insn_ld_add( row11, 8 ); uint64_t p323 = __insn_ld_add( row11, 8 );

        uint64_t v001l = __insn_v1mulu( p001,     vmul1 ) + __insn_v1mulu( p101,     vmul0 );
        uint64_t v001h = __insn_v1mulu( p001>>32, vmul1 ) + __insn_v1mulu( p101>>32, vmul0 );
        uint64_t v023l = __insn_v1mulu( p023,     vmul1 ) + __insn_v1mulu( p123,     vmul0 );
        uint64_t v023h = __insn_v1mulu( p023>>32, vmul1 ) + __insn_v1mulu( p123>>32, vmul0 );
        uint64_t v101l = __insn_v1mulu( p201,     vmul3 ) + __insn_v1mulu( p301,     vmul2 );
        uint64_t v101h = __insn_v1mulu( p201>>32, vmul3 ) + __insn_v1mulu( p301>>32, vmul2 );
        uint64_t v123l = __insn_v1mulu( p223,     vmul3 ) + __insn_v1mulu( p323,     vmul2 );
        uint64_t v123h = __insn_v1mulu( p223>>32, vmul3 ) + __insn_v1mulu( p323>>32, vmul2 );

        uint64_t r001l =  v001l + __insn_v2shrui( v001l, 8 ) + roundval;
        uint64_t r001h =  v001h + __insn_v2shrui( v001h, 8 ) + roundval;
        uint64_t r023l =  v023l + __insn_v2shrui( v023l, 8 ) + roundval;
        uint64_t r023h =  v023h + __insn_v2shrui( v023h, 8 ) + roundval;
        uint64_t r101l =  v101l + __insn_v2shrui( v101l, 8 ) + roundval;
        uint64_t r101h =  v101h + __insn_v2shrui( v101h, 8 ) + roundval;
        uint64_t r123l =  v123l + __insn_v2shrui( v123l, 8 ) + roundval;
        uint64_t r123h =  v123h + __insn_v2shrui( v123h, 8 ) + roundval;

        __insn_st_add( out0, __insn_v2packh( r001h, r001l ), 8 );
        __insn_st_add( out0, __insn_v2packh( r023h, r023l ), 8 );
        __insn_st_add( out1, __insn_v2packh( r101h, r101l ), 8 );
        __insn_st_add( out1, __insn_v2packh( r123h, r123l ), 8 );

    }
}


static inline void scale_bilinear_horizontal( bmp_argb *src, bmp_argb *dst, uint8_t *in0, unsigned int startrow )
{
    uint64_t roundval = 0x0080008000800080;
    int xadd = (src->w<<16)/dst->w;
    int xoffset = (xadd>>1)-(1<<15);
    int dstw = dst->w;

    uint8_t *row00 = in0;
    uint8_t *row01 = in0 + src->stride;

    uint8_t *out0 = dst->argb +  startrow   * dst->stride;
    uint8_t *out1 = dst->argb + (startrow+1)* dst->stride;

    for( int x = 0; x < dstw; x += 4 ) {
        uint8_t *tmp;
        uint64_t p00,p01,p02,p03,p04,p05,p06,p07;
        uint64_t p10,p11,p12,p13,p14,p15,p16,p17;
        uint64_t hmul0,hmul1,hmul2,hmul3,hmul4,hmul5,hmul6,hmul7;

        int xx0 = xoffset+ x   *xadd;
        int xx1 = xoffset+(x+1)*xadd;
        int xx2 = xoffset+(x+2)*xadd;
        int xx3 = xoffset+(x+3)*xadd;
        int xpos0 = (xx0>>16)*4; if( xpos0 < 0 ) xpos0 = 0;
        int xpos1 = (xx1>>16)*4; if( xpos1 < 0 ) xpos1 = 0;
        int xpos2 = (xx2>>16)*4; if( xpos2 < 0 ) xpos2 = 0;
        int xpos3 = (xx3>>16)*4; if( xpos3 < 0 ) xpos3 = 0;

        tmp = row00 + xpos0; p00 = __insn_ld4u_add( tmp, 4 ); p01 = __insn_ld4u( tmp );
        tmp = row00 + xpos1; p02 = __insn_ld4u_add( tmp, 4 ); p03 = __insn_ld4u( tmp );
        tmp = row00 + xpos2; p04 = __insn_ld4u_add( tmp, 4 ); p05 = __insn_ld4u( tmp );
        tmp = row00 + xpos3; p06 = __insn_ld4u_add( tmp, 4 ); p07 = __insn_ld4u( tmp );
        tmp = row01 + xpos0; p10 = __insn_ld4u_add( tmp, 4 ); p11 = __insn_ld4u( tmp );
        tmp = row01 + xpos1; p12 = __insn_ld4u_add( tmp, 4 ); p13 = __insn_ld4u( tmp );
        tmp = row01 + xpos2; p14 = __insn_ld4u_add( tmp, 4 ); p15 = __insn_ld4u( tmp );
        tmp = row01 + xpos3; p16 = __insn_ld4u_add( tmp, 4 ); p17 = __insn_ld4u( tmp );

        hmul0 = __insn_shufflebytes( (((uint64_t)xx0)>>8)&0xff, 0, 0 ); hmul1 = 0xffffffff - hmul0;
        hmul2 = __insn_shufflebytes( (((uint64_t)xx1)>>8)&0xff, 0, 0 ); hmul3 = 0xffffffff - hmul2;
        hmul4 = __insn_shufflebytes( (((uint64_t)xx2)>>8)&0xff, 0, 0 ); hmul5 = 0xffffffff - hmul4;
        hmul6 = __insn_shufflebytes( (((uint64_t)xx3)>>8)&0xff, 0, 0 ); hmul7 = 0xffffffff - hmul6;

        uint64_t r00 = __insn_v1mulu( p00, hmul1 ) + __insn_v1mulu( p01, hmul0 );
        uint64_t r01 = __insn_v1mulu( p02, hmul3 ) + __insn_v1mulu( p03, hmul2 );
        uint64_t r02 = __insn_v1mulu( p04, hmul1 ) + __insn_v1mulu( p05, hmul0 );
        uint64_t r03 = __insn_v1mulu( p06, hmul3 ) + __insn_v1mulu( p07, hmul2 );
        uint64_t r10 = __insn_v1mulu( p10, hmul1 ) + __insn_v1mulu( p11, hmul0 );
        uint64_t r11 = __insn_v1mulu( p12, hmul3 ) + __insn_v1mulu( p13, hmul2 );
        uint64_t r12 = __insn_v1mulu( p14, hmul1 ) + __insn_v1mulu( p15, hmul0 );
        uint64_t r13 = __insn_v1mulu( p16, hmul3 ) + __insn_v1mulu( p17, hmul2 );

        r00 = r00 + __insn_v2shrui( r00, 8 ) + roundval;
        r01 = r01 + __insn_v2shrui( r01, 8 ) + roundval;
        r02 = r02 + __insn_v2shrui( r02, 8 ) + roundval;
        r03 = r03 + __insn_v2shrui( r03, 8 ) + roundval;
        r10 = r10 + __insn_v2shrui( r10, 8 ) + roundval;
        r11 = r11 + __insn_v2shrui( r11, 8 ) + roundval;
        r12 = r12 + __insn_v2shrui( r12, 8 ) + roundval;
        r13 = r13 + __insn_v2shrui( r13, 8 ) + roundval;

        __insn_st_add( out0, __insn_v2packh( r01, r00 ), 8 );
        __insn_st_add( out0, __insn_v2packh( r03, r02 ), 8 );
        __insn_st_add( out1, __insn_v2packh( r11, r10 ), 8 );
        __insn_st_add( out1, __insn_v2packh( r13, r12 ), 8 );
    }

}

Now the new two-pass function can be called from main(). Let's do a second test run.

Test Run 2

[cisco:/tmp] $ ./main pics/ebu3325.bmp output.bmp 1280 720
Loading bitmap pics/ebu3325.bmp
Scale 1920*1080 to 1280*720

            1        2        4        8       16       32       35
002: 16117904 08084943 04260390 02631100 01792549 01485572 01442396
004: 15968793 08080710 04289561 02598674 01868979 01492084 01497614
008: 15988335 08054551 04365845 02660698 01895226 01516145 01526233
016: 15931625 08156092 04474528 02742360 01908299 01656613 01671737
032: 15928102 08172282 04479788 02688426 02159633 01617142 01616620
064: 15907024 08496182 04497551 03299163 02196561 02166533 02158902
128: 15903972 08497828 05782103 03240409 03252182 03262079 03259239
Saving bitmap output.bmp
[cisco:/tmp] $ ./main pics/ebu3325.bmp output.bmp 2560 1440
Loading bitmap pics/ebu3325.bmp
Scale 1920*1080 to 2560*1440

            1        2        4        8       16       32       35
002: 43672896 22455916 11575339 06538864 04057607 03063350 03037687
004: 43227885 22273059 11277832 06329944 03896596 03059239 03025512
008: 43196391 21928897 11331747 06474072 04067784 03162889 03241387
016: 43129765 21859133 11473258 06678085 04149490 03245579 03281954
032: 43125239 22168511 11894737 06679241 04183771 03553712 03610323
064: 43097456 22144521 11907886 06656842 04953841 03304123 03324928
128: 43054993 23046248 11923922 08469932 05042979 05019210 04987422
Saving bitmap output.bmp    

That's pretty interesting. For test case 1, it's faster in all cases, but not a lot faster. Test case 2 is also faster in all cases, but a lot faster on few cores. Let's compare case 2560x1440, 4 lines with the first attempt:

Graph 3

Summing Up

A library to parallelize code on Tilera TILE-Gx easily:

A library to allocate, load and save 24 bit bitmap files:

A bilinear picture scaler for the Tilera TILE-Gx:

Test code to scale pictures and measure time used:

Compile, link and run:

[user@mainframe]$ tile-cc -O3 -Wall -std=c99 -o main main.c par.c bmp.c scaler.c -ltmc -lpthread
[user@mainframe]$ scp main root@lastv36:/tmp
[lastv36:/tmp] $ ./main pics/ebu3325.bmp output.bmp 970 550
(...)
[lastv36:/tmp] $ 

Comments are always appreciated. I prefer being contacted on LinkedIn. Please refer to this webpage if we have no connections in common. If you're using old school email, you can figure out my mail address from the front page. Be aware that the spam filtering is extreme; hardly anything gets through.

Remember to appreciate this classic XKCD strip.


www.ignorantus.com