Raytracing on Tilera TILE-Gx

Nils L. Corneliusen
21 April 2014

Update 5 June 2017: Made a faster integer version with 40 spheres. New article here.

Introduction

RayGX

Raytracing is an interesting subject. The basics are simple, as explained in this article. Since each pixel can be calculated separately, it's a pretty easy job to parallelize given enough cores with floating point hardware.

The Tilera TILE-Gx36 CPU has 36 cores, obviously, but is a bit limited on the floating point side. There's the basic add/sub/mul instructions, but they must be piled up into sequences of four instructions: mul1/mul2/pack1/pack2. Given the execution times specified in the manual, and that all the instructions are pipelined, at least 4 cycles is needed per operation. It should do quite ok, but a project for the future is to make an integer based version.

Back in 1995 I used a raytracer made by P. C. Ødegård in an Amiga intro called "Speed" which can be found here (video here). S. A. Olsen made a parallel version of that which could trace on computers spread around on a network. Unfortunately, that tracer is a bit too complex, so time to look around on the net for another one.

Shodruky Rhyammer has written a very simple and clean raytracer in C for the Parallella board. The Parallella has a 16 core Adapteva Epiphany chip with pretty decent fpu performance. His code can be found on GitHub. He has also published this video of it on YouTube:

So let's use Mr Rhyammer's code as a starting point and try to get 1920x1080p60 raytracing on the Tilera TILE-Gx36. And, for a change, stop when the goal is reached.

Easy does it

An initial attempt to get it to run on a single core went very smoothly. Not very quick, but it worked. The Tilera devkit provides a simple example of how to write a basic parallel application. Using that code as a starting point, making the first parallel raytracer was easy. Just fork() off some processes, assign them to the correct cores, and things worked as expected.

The UDN network is used for communication. This net is very efficient but is tied to the core. So only a single thread or process on a given core can use it at a time. That's no big problem; use one core as a controller which distributes jobs and displays/saves images, and the rest are tracers doing actual work.

But we're still far away from 60 frames per second. The main time hogs seem to be the plane at the bottom and all the black pixels. Ok, let's remove the plane. This improves the situation, but not enough. Dialling down the number of reflections to 3 and setting a max of 6 spheres with 2 non-reflecting ones also improves things, but we're not at p60 yet.

There's some general optimization that can be done, in no particular order: Making the loop counter in trace_ray() constant lets gcc unroll things nicely. The shadow-if gets mispredicted, so put a __builtin_expect() there. The rad*rad calculation is done at setup time. The results from trace_ray() are packed differently. Some min() calls are not necessary and removed. Converting between float and int is costly; use a hand crafted version instead. Pointer referred vars are copied to local, but not all: Gcc will start dumping regs to stack if one more is copied.

Chop chop

With the plane gone, there's a majority of black pixels on screen. This can be exploited by assuming the whole area is black if the corners are black. So, by trial and error, dividing the screen into 16x16 blocks seems to be the most efficient size. This gives 4 tests for 256 pixels, and a massive speedup for the black areas.

The code that does is as follows:

    // just clear block if no hits on corners
    vec3_t spos;
    spos       = startpos;          vec3_t ul  = normalize( spos );
    spos.y    -= tg->ay*(BLOCKY-1); vec3_t bl  = normalize( spos );
    spos.x    += tg->ax*(BLOCKX-1); vec3_t br  = normalize( spos );
    spos.y    += tg->ay*(BLOCKY-1); vec3_t ur  = normalize( spos );

    int i;
    for( i = 0; i < OBJNUM; i++ ) {
        vec3_t  v  = tg->obj[i].poseye; // v  = vec3_sub(ray.pos, tg->obj[i].pos);
        float  rv  = tg->obj[i].rv;     // rv = -norm2( v ) + tg->obj[i].radsq;
        if( dotsq( v, ul ) + rv > 0.0f || dotsq( v, bl ) + rv > 0.0f ||
            dotsq( v, br ) + rv > 0.0f || dotsq( v, ur ) + rv > 0.0f    ) break;
    }
    if( i == OBJNUM ) clear block...

Gcc's optimizer actually manages to do a decent job of sorting the normalizes and such here, so no need to mess it up further. Some of the vars used between frames are precalculated by the controller as indicated.

This is not a perfect solution: It will fail on short edges, and completely remove very small spheres. So to avoid that a (crap) bounding box is added which the spheres bounce around in.

Communication and Control

1080 is not divisible by 16, so for simplicity the screen size is rounded up to 1920x1088. That gives a total of (1920*1088)/(16*16)=8160 jobs to be distributed. That's a reasonably low count, and "perf" says that very little time is wasted on the UDN calls. Unfortunately, "top" will show 100% load on all cores that are waiting for jobs in tmc_udn0_receive().

The resulting main controller loop is very simple:

        for( int y = 0; y < rh; y += BLOCKY ) {

           for( int x = 0; x < rw; x += BLOCKX ) {

               // wait for a core to be ready...
               uint32_t coreready = tmc_udn0_receive();

               // ...and give it the next job.
               tmc_udn_send_2( tmc_udn_header_from_cpu( coreready ), UDN0_DEMUX_TAG, (x<<16)|y, (uint64_t)dst );
            }

       }

And ditto for the tracer cores:

    while( true ) {

       // say we're ready
       tmc_udn_send_1( header, UDN0_DEMUX_TAG, coreid );

       // wait for a job...
       uint32_t xy = tmc_udn0_receive();
       if( xy == 0xffff ) exit( 0 );

       ptr = (uint8_t *)tmc_udn0_receive();

       // ...and do it.
       render( tglob, ptr, xy>>16, xy&0xffff );
    }

To fork() or not to fork()

The parallelization code from the example used fork(). That's cozy, but it would be more convenient if all cores shared the same data. The scene data is only modified by the controller but must be read by all the tracers. Pointers to the outbut buffers have to be distributed to the tracers, and do not like to be passed across multiple processes. Luckily, Mr T. Ringstad had some Tilera-code available that uses pthreads instead of fork(). Integrating that into the system was pretty simple and didn't change the execution speed considerably.

Measuring time used

RayGX Live

Mr Ringstad also helped me measure performance using assorted tools. Let's see how that worked out.

Since using the UDN network gives 100% load all the time, get_cycle_count() is used before and after the main loop and added up. This will not be 100% accurate since there's no message that says a core is finished. But there's little point in cluttering up the communication interface further; the data will be written by the time the display gets there.

The bar on top shows time spent on 5 frames. There'll be a bit of delay before it gets properly updated, depending on the number of frame buffers used. We should still be able to see the load go up when there's a lot of reflections, as in the top picture.

Using the "-m" argument will set a repeatable configuration for performance testing.

[lastv36:/tmp] $ ./ray -m
Measurement config being used
frames:    1200
cores:     36
blocksize: 16*16
timeskip:  5
randseed:  42
channel:   0
display:   off
output:    1920*1080
adjusted:  1920*1088
total cycles: 13556415290
cpp: 5.448019
average frame time: 11297013.000000
[lastv36:/tmp] $

The output is a bit crude. Should have put some more work into that. Anyway, 1200 frames has been traced using 13556415290 cycles on a 1.2GHz CPU. That should be about 56% load average. And there's still enough spare processing power deliver 60 fps when all spheres are in the front and reflecting alot. Job done!

"perf stat" gives some interesting statistics:

[lastv36:/tmp] $ perf stat ./ray -m
(...)
 Performance counter stats for './ray -m':

     408160.161969 task-clock                #   35.885 CPUs utilized
               931 context-switches          #    0.002 K/sec
                36 CPU-migrations            #    0.000 K/sec
                 0 page-faults               #    0.000 K/sec
      489981514476 cycles                    #    1.200 GHz
   <not supported> stalled-cycles-frontend
   <not supported> stalled-cycles-backend
      414677919463 instructions              #    0.85  insns per cycle
        5810962749 predicted branches        #   14.237 M/sec
        2185507061 branch-misses             #   27.33% of all branches

      11.374262960 seconds time elapsed

0.85 bundles per cycle is not bad, but a higher number would have been better. Let's read out some perfomance counters to locate possible bottlenecks:

[lastv36:/tmp] $ perf stat -e cycles -e instructions -e INSTRUCTION_BUNDLE -e INSTRUCTION_STALL -e ICACHE_FILL_PEND
                           -e ICACHE_FILL -e LOAD_STALL -e LOAD_HIT_STALL -e  COND_BRANCH_PRED_INCORRECT
                           -e WAY_MISPREDICT -e UDN_SRC_STALL ./ray -m
(...)
 Performance counter stats for './ray -m':

      492324609803 cycles                    #    0.000 GHz                     [36.37%]
      415678938112 instructions              #    0.84  insns per cycle         [36.37%]
      415664493993 INSTRUCTION_BUNDLE                                           [36.37%]
       60028451610 INSTRUCTION_STALL                                            [36.37%]
        1482515964 ICACHE_FILL_PEND                                             [36.37%]
         302649061 ICACHE_FILL                                                  [36.37%]
        2968067349 LOAD_STALL                                                   [36.37%]
          64152341 LOAD_HIT_STALL                                               [36.37%]
        2227614342 COND_BRANCH_PRED_INCORRECT                                    [36.37%]
         895564700 WAY_MISPREDICT                                               [36.37%]
       18672105113 UDN_SRC_STALL                                                [36.37%]

      11.482561104 seconds time elapsed

Using "mcstat -w 1" to watch memory utilitization shows no big surprises either. Memory saturation is at only 3 and 5 percent on controller 0 and 1. Less would have been preferred, since things should fit permanently in the cache, but that's a problem for another day.

That leaves plenty of time to appreciate this classic Abstruse Goose strip.

Results

Using an, err, spare Cisco SX80 to run it on, here's a couple of videos showing it in action:

Enough talk, where's the code?

Source files:

I had to replace the display code with some dummy functions due to copyright issues. Display will be permanently turned off and the output will be put in a malloc'ed buffer instead. Output can still be saved to bmp files. Due to the way the display works, the data is organized as rgb planar, so there's an extra conversion step before saving as bmp. It should be trivial to make it output rgba packed for the bmp files directly.

What about the Tilera TILE-Gx72?

Mr Ringstad had a spare TILE-Gx72 on the net, so why not try running the tracer on that? Apart from twice the number of cores, it has twice the cache and twice the number of memory controllers, so the effect of a lower clock is mitigated:

[lastv72:/tmp] $ ./ray -m
Measurement config being used
frames:    1200
cores:     72
blocksize: 16*16
timeskip:  5
randseed:  42
channel:   0
display:   off
output:    1920*1080
adjusted:  1920*1088
total cycles: 6610402977
cpp: 2.656573
average frame time: 5508669.000000
[lastv72:/tmp] $ 

That's twice as quick at 16.67% lower clock rate. Neat.


www.ignorantus.com