YUV420 to RGB Conversion on Tilera TILE-Gx

Nils L. Corneliusen
7 November 2012

Introduction

I've earlier explored how to do some simple things on the Tilera TILE-Gx 36 core cpu, like the Doom port and how to make a fast AES routine. The CPU has a reasonably advanced set of SIMD instructions that I haven't exploited much yet, so let's see if we can make a fast YUV420 to RGB conversion routine.

As discussed in the SSE2 YUV conversion article, I do the same set of initial requirements, ie. this formula:

YUV to RGB conversion is defined as follows in Video Demystified:

B = 1.164(Y - 16)                   + 2.018(U - 128)
G = 1.164(Y - 16) - 0.813(V - 128)  - 0.391(U - 128)
R = 1.164(Y - 16) + 1.596(V - 128)
saturate results to 0..255.

The SSE2 version was easier to write, since we have 8x16 bit multipliers. On the TILE-Gx we only have 4x8 or 4x16, and they're only available in the X0 slot. So we have to conserve multiplies, and use the 8 bit variants where possible to avoid excess interleaves.

Short -n- Sweet

I start off in the inner loop. Look at the bottom for a complete example.

Some defines for the multiplier constants first. 16 bit for y, see below.

#define facy   0x004a004a004a004a
#define facrv  0x66666666
#define facgu  0x19191919
#define facgv  0x34343434
#define facbu  0x81818181

Next, we need to load and do the 8 bit subs on u and v:

    y0 = __insn_ld( srcy0,      8 );
    y1 = __insn_ld( srcy0 + sy, 8 );
    u0 = __insn_v1addi( __insn_ld4u_add( srcu0, 4 ), -128 );
    v0 = __insn_v1addi( __insn_ld4u_add( srcv0, 4 ), -128 );
    srcy0 += 8;

If we only want to support limited y-range 16.., the basic y multiplications can be done in 8 bit. Unfortunately it'll look like crap if they're not, so let's do those in 16 bit, and the rest in 8 bit:

    y0l = __insn_v2mults( __insn_v2addi( __insn_v1int_l( 0, y0 ), -16 ), facy );
    y0h = __insn_v2mults( __insn_v2addi( __insn_v1int_h( 0, y0 ), -16 ), facy );
    y1l = __insn_v2mults( __insn_v2addi( __insn_v1int_l( 0, y1 ), -16 ), facy );
    y1h = __insn_v2mults( __insn_v2addi( __insn_v1int_h( 0, y1 ), -16 ), facy );

Then we calculate the u and v constants for both rows, using as few muls as possible to make scheduling easier:

    rv = __insn_v1mulus( facrv, v0 );
    gu = __insn_v1mulus( facgu, u0 );
    gv = __insn_v1mulus( facgv, v0 );
    bu = __insn_v1mulus( facbu, u0 );

    rvh = __insn_v2int_h( rv, rv ); rvl = __insn_v2int_l( rv, rv );
    buh = __insn_v2int_h( bu, bu ); bul = __insn_v2int_l( bu, bu );

If gcc was a smart compiler, it'd figure out that the gu and gv parts are connected. It didn't, so we'll have to do that manually. I should probably try this in the SSE2 routine too.

    gugv  = __insn_v2addsc( gu, gv );
    gugvh = __insn_v2int_h( gugv, gugv );
    gugvl = __insn_v2int_l( gugv, gugv );

Now we have all the parts needed to calculate r, g and b values for row 0:

    r0 = __insn_v2packuc( __insn_v2shrsi( __insn_v2addsc( y0h, rvh ), 6 ),
                          __insn_v2shrsi( __insn_v2addsc( y0l, rvl ), 6 ) );

    g0 = __insn_v2packuc( __insn_v2shrsi( __insn_v2subsc( y0h, gugvh ), 6 ),
                          __insn_v2shrsi( __insn_v2subsc( y0l, gugvl ), 6 ) );

    b0 = __insn_v2packuc( __insn_v2shrsi( __insn_v2addsc( y0h, buh ), 6 ),
                          __insn_v2shrsi( __insn_v2addsc( y0l, bul ), 6 ) );

These results are planar, and that's not very useful. We need to shuffle it around before storing:

    zrl = __insn_v1int_l( 0,  r0 ); zrh = __insn_v1int_h( 0,  r0 );
    gbl = __insn_v1int_l( g0, b0 ); gbh = __insn_v1int_h( g0, b0 );

    __insn_st_add( dst0, __insn_v2int_l( zrl, gbl ), 8 );
    __insn_st_add( dst0, __insn_v2int_h( zrl, gbl ), 8 );
    __insn_st_add( dst0, __insn_v2int_l( zrh, gbh ), 8 );
    __insn_st_add( dst0, __insn_v2int_h( zrh, gbh ), 8 );

Then we just repeat the last two steps with y1 for row 1:

    r0 = __insn_v2packuc( __insn_v2shrsi( __insn_v2addsc( y1h, rvh ), 6 ),
                          __insn_v2shrsi( __insn_v2addsc( y1l, rvl ), 6 ) );

    g0 = __insn_v2packuc( __insn_v2shrsi( __insn_v2subsc( y1h, gugvh ), 6 ),
                          __insn_v2shrsi( __insn_v2subsc( y1l, gugvl ), 6 ) );

    b0 = __insn_v2packuc( __insn_v2shrsi( __insn_v2addsc( y1h, buh ), 6 ),
                          __insn_v2shrsi( __insn_v2addsc( y1l, bul ), 6 ) );

    zrl = __insn_v1int_l( 0,  r0 ); zrh = __insn_v1int_h( 0,  r0 );
    gbl = __insn_v1int_l( g0, b0 ); gbh = __insn_v1int_h( g0, b0 );

    __insn_st_add( dst1, __insn_v2int_l( zrl, gbl ), 8 );
    __insn_st_add( dst1, __insn_v2int_h( zrl, gbl ), 8 );
    __insn_st_add( dst1, __insn_v2int_l( zrh, gbh ), 8 );
    __insn_st_add( dst1, __insn_v2int_h( zrh, gbh ), 8 );

And we're done! Or are we? Of course not.

Looking at the generated code, there's a flurry of redundant movei rxx,0 instructions. Instead of loading zero outside the loop, it does it before it's used in every single case. So it has to be done manually. I do that in the code below.

Skyscrapers

A complete routine would look something like the code below. Gcc now manages to maintain base pointers without all the usual fuzz, so that code is kept as simple as possible.

void yuv420_to_argb8888( uint8_t *srcy, uint8_t *srcu, uint8_t *srcv,
                         uint32_t sy, uint32_t suv,
                         int width, int height,
                         uint32_t *rgb, uint32_t srgb )
{
    int x, y;
    uint8_t *srcy0, *srcu0, *srcv0;
    uint64_t *dst0, *dst1;
    uint64_t r0,g0,b0;
    uint64_t y0, y1, u0, v0;
    uint64_t y0l, y0h, y1l, y1h;
    uint64_t rv,gu,gv,bu;
    uint64_t rvh,rvl,buh,bul;
    uint64_t gugv, gugvh, gugvl;
    uint64_t zrl, zrh, gbl, gbh;
    uint64_t zero;

    zero = 0;

    for( y = 0; y < height; y += 2 ) {

        dst0 = (uint64_t *)(rgb + y*srgb);
        dst1 = (uint64_t *)(rgb + y*srgb + srgb);
        srcy0 = srcy + y*sy;
        srcu0 = srcu + y/2*suv;
        srcv0 = srcv + y/2*suv;

        for( x = 0; x < width; x += 8 ) {

            y0 = __insn_ld( srcy0 );
            y1 = __insn_ld( srcy0 + sy );
            u0 = __insn_v1addi( __insn_ld4u_add( srcu0, 4 ), -128 );
            v0 = __insn_v1addi( __insn_ld4u_add( srcv0, 4 ), -128 );
            srcy0 += 8;

            y0l = __insn_v2mults( __insn_v2addi( __insn_v1int_l( zero, y0 ), -16 ), facy );
            y0h = __insn_v2mults( __insn_v2addi( __insn_v1int_h( zero, y0 ), -16 ), facy );
            y1l = __insn_v2mults( __insn_v2addi( __insn_v1int_l( zero, y1 ), -16 ), facy );
            y1h = __insn_v2mults( __insn_v2addi( __insn_v1int_h( zero, y1 ), -16 ), facy );

            rv = __insn_v1mulus( facrv, v0 );
            gu = __insn_v1mulus( facgu, u0 );
            gv = __insn_v1mulus( facgv, v0 );
            bu = __insn_v1mulus( facbu, u0 );

            rvh = __insn_v2int_h( rv, rv ); rvl = __insn_v2int_l( rv, rv );
            buh = __insn_v2int_h( bu, bu ); bul = __insn_v2int_l( bu, bu );

            gugv  = __insn_v2addsc( gu, gv );
            gugvh = __insn_v2int_h( gugv, gugv );
            gugvl = __insn_v2int_l( gugv, gugv );

            // row 0
            r0 = __insn_v2packuc( __insn_v2shrsi( __insn_v2addsc( y0h, rvh ), 6 ),
                                  __insn_v2shrsi( __insn_v2addsc( y0l, rvl ), 6 ) );

            g0 = __insn_v2packuc( __insn_v2shrsi( __insn_v2subsc( y0h, gugvh ), 6 ),
                                  __insn_v2shrsi( __insn_v2subsc( y0l, gugvl ), 6 ) );

            b0 = __insn_v2packuc( __insn_v2shrsi( __insn_v2addsc( y0h, buh ), 6 ),
                                  __insn_v2shrsi( __insn_v2addsc( y0l, bul ), 6 ) );

            zrl = __insn_v1int_l( zero,  r0 ); zrh = __insn_v1int_h( zero,  r0 );
            gbl = __insn_v1int_l( g0,    b0 ); gbh = __insn_v1int_h( g0,    b0 );

            __insn_st_add( dst0, __insn_v2int_l( zrl, gbl ), 8 );
            __insn_st_add( dst0, __insn_v2int_h( zrl, gbl ), 8 );
            __insn_st_add( dst0, __insn_v2int_l( zrh, gbh ), 8 );
            __insn_st_add( dst0, __insn_v2int_h( zrh, gbh ), 8 );

            // row 1
            r0 = __insn_v2packuc( __insn_v2shrsi( __insn_v2addsc( y1h, rvh ), 6 ),
                                  __insn_v2shrsi( __insn_v2addsc( y1l, rvl ), 6 ) );

            g0 = __insn_v2packuc( __insn_v2shrsi( __insn_v2subsc( y1h, gugvh ), 6 ),
                                  __insn_v2shrsi( __insn_v2subsc( y1l, gugvl ), 6 ) );

            b0 = __insn_v2packuc( __insn_v2shrsi( __insn_v2addsc( y1h, buh ), 6 ),
                                  __insn_v2shrsi( __insn_v2addsc( y1l, bul ), 6 ) );

            zrl = __insn_v1int_l( zero,  r0 ); zrh = __insn_v1int_h( zero,  r0 );
            gbl = __insn_v1int_l( g0,    b0 ); gbh = __insn_v1int_h( g0,    b0 );

            __insn_st_add( dst1, __insn_v2int_l( zrl, gbl ), 8 );
            __insn_st_add( dst1, __insn_v2int_h( zrl, gbl ), 8 );
            __insn_st_add( dst1, __insn_v2int_l( zrh, gbh ), 8 );
            __insn_st_add( dst1, __insn_v2int_h( zrh, gbh ), 8 );

        }

    }

}

Fast Break

So, how quick is it? Due to some mishaps in the lab I cannot provide any real world numbers, but running on the simulator in functional mode gives 5ms per 1080p frame. The inner loop is 46 cycles long, and we have 2 instructions per cycle in all cases except the branch. It still refuses to put the one remaining movei rxx,0 outside, but at least we're down to one. This is pretty close to optimal if we count the necessary instructions.

If I've missed anything obvious you can figure out my email from the front page. Also, remember to appreciate this classic XKCD strip.


www.ignorantus.com