Al Green
Al Green's Brother
Home -- News -- Articles -- Books -- Source Code -- Videos -- Xmas -- LinkedIn -- About

Edgehog: 1080p60 Nano Fractals

Nils Liaaen Corneliusen - Article text, Edgehog and DFA code
Sjur Julin - OpenGL code, font, music remix
26 June 2023

Click to watch video on YouTube

Abstract

Presenting Edgehog: A method for doing fractal rectangle checking quickly on multiple ARM Neon cores. The output is rendered by a GPU using the Depth First Algorithm (DFA) described in the article GPU Hacks (Corneliusen 2016, 2017). This makes it possible to render 1080p60 depth 256 fractals on a dainty NVidia Jetson Nano. Full source code included.

Hogs on the Wing

DFA is discussed in more detail in the book Real Programming (Corneliusen/Julin 2021). Instead of doing the escape time check each step, calculations are done 32 at a time, with the result of the comparisons saved as bits in an uint. Locating the lowest bit in this uint determines where the first escape took place. If there was no escape, calculate the next 32 entries. This alternative brute force approach made it possible to render 1080p60 depth 256 fractals with all pixels at max depth on an NVidia TX2. 71 fps, actually. The TX1 peaked at 52 fps. The Nano is a TX1 with half the number of GPU cores and lower clocks, so it manages just 23.6 fps with the code from 2017. A lot of extra juice is needed.

The article Adaptive Parallel Computation with CUDA Dynamic Parallelism (Adinets 2014) [developer.nvidia.com] discusses using the rectangle checking method [en.wikipedia.org]: If all the pixels on the edge of an area have the same depth, the entire area has the same depth. A naive implementation of the Mariani-Silver algorithm is presented, which involves splitting the entire display into gradually smaller areas where edge depths are checked.

That method might be useful if you happen to have a large GPU. Unfortunately, that's not the case here. There are just 2 execution units with 64 cores each. However, there are 4 ARM cores in the Nano that are doing nothing useful while the GPU is working. My idea is to do edge checking on them in parallel for the next frame, essentially making it free. This saves a lot of GPU computations. However, the Nano still has to handle high load on all GPU and CPU cores. Is the standard cooling system adequate? Will it go into thermal shutdown? Will it render 1080p60 fractals? What kind of money is there in idioting? As usual, many questions and few answers. So far.

Edgehog Logo

Edgehog's Dilemma

An edge checking method I've dubbed Edgehog is presented in this chapter. It works as follows: Only areas of size 32x32 are evaluated, because... duh. Each ARM core picks up a job number and converts that to a coordinate. Then it calculates the depths for a left and top edge relative to that and stores the results in row and column buffers. Obviously, extra entries over the edges are needed.

The top edge has 32 pixels, and the left has 31 or the other way round. That doesn't matter: Say they're both 32 pixels to make the calculations simpler. The cost of calculating the corner pixel twice is considered negligible. A nifty two-part cutoff is used: First, if all the 32 pixels of the top or left edge escape at the same depth, store the depth and exit. Second, if any of the 32 pixels escape, store -1 for differing values and exit. Here's the kicker: Those are the only exit conditions if all of them are evaluated in parallel. There is absolutely no reason to track separate pixels or mask results away. As will be shown later, this can be done quite efficiently in C and Neon intrinsics.

Top left bottom right

When all the top and left edges are done, a dual loop slides over the buffers turning all the left depths into pairs of left and right depths, and all the top depths into pairs of top and bottom depths. Quads of top, bottom, left, and right depths determine what goes into the Edgehog buffer: -1 if they're not equal, otherwise the common depth. This gets fed to the GPU renderer that just fills all the pixels inside covered areas. The areas where actual work has to be done use the good, old DFA implementation from 2017.

Fractal Picture 00, normal Fractal Picture 00, Edgehog inverted

Some Assembly Required

Start by writing your own inline Assembler fetchadd(). Why? One is needed later, and it's fun! The load/store exclusive operations on ARM CPUs are awesome. They're also brutally fast. Compare that to Intel's lock+add fetchadd() solution that needs around 10000 cycles even if the address is uncontested. I measured that on an i7-7820X some years back. It's covered in the book that nobody has read, but everybody has very strong opinions about [abstrusegoose.com]. Anyway, maybe they have improved it in newer architectures. Don't hold your breath. Come to think of it, this implementation is A64-specific, so you might want to use something else on older ARMs.

static inline int atomic_fetchadd( volatile int *dst, int val )
{
    unsigned long status;
    int rc;
    int result;
    asm volatile(
        "// atomic_fetchadd\n\t"
        "1:\n\t"
        "ldaxr %w[rc],     [%[dst]]\n\t"
        "add   %w[result], %w[rc],     %w[val]\n\t"
        "stlxr %w[status], %w[result], [%[dst]]\n\t"
        "cbnz  %w[status], 1b\n\t"
    : [result] "=&r" (result), [rc] "=&r" (rc), [status] "=&r" (status), "+o" (*dst)
    : [dst] "r" (dst), [val] "Ir" (val)
    : "cc"
    );
    return rc;
}

Fractal Picture 01, normal Fractal Picture 01, Edgehog inverted

Set the Controls for the Heart of the Sun

A (small) pile of control code is needed for Edgehog. Start by spawning off some threads. That is considered trivial and not shown here. The thread control function processes a given number of frames by getting block numbers in each frame with the nifty fetchadd(). Those are sent to the Edgehog block function. Synchronization is done with a barrier. The timer stuff will be discussed later. So much for linearity in these new articles. The number of blocks processed is returned to the main thread when it's done:

static void *thread_func( void *arg )
{
    int me = (int)(size_t)arg;
    int ctr = 0;

    pthread_barrier_wait( &bar );
    printf( "%d: Ready\n", me);
    pthread_barrier_wait( &bar );

    int frcnt = frames;

    for( int i = 0; i < frcnt; i++ ) {
        pthread_barrier_wait( &bar );

        uint64_t frstart = framestart;

        while( 1 ) {
            int blockid = atomic_fetchadd( &blockctr, 1 );
            if( blockid >= ROWCOLCNT ) break;

            edgehog_top_left( blockid, &rowbuf[blockid], &colbuf[blockid] );
            ctr++;

            if( !disable_timer && !(ctr&0x0f) && get_usec() - frstart >= 15*1000 ) break;
        }

        pthread_barrier_wait( &bar );
    }

    return (void *)(size_t)ctr;
}

Let's look at the processing of blocks. The function edgehog_top_left() converts the block number to block coordinates and calls the edgehog32_top() and edgehog32_left() functions. Could have built a mega-function here. Didn't. Let's keep things orderly.

void edgehog_top_left( int blockid, int *rowdst, int *coldst )
{
    int bx = blockid%COLS;
    int by = blockid/COLS;

    float leftx = (-0.5f + bx/(DISPW/(float)BLOCKWH)) * 1.7778f * ezfac + ex;
    float topy  = (-0.5f + by/(DISPH/(float)BLOCKWH))           * ezfac + ey;

    *rowdst     = edgehog32_top(  leftx, topy );
    *coldst     = edgehog32_left( leftx, topy );
}

Those functions do the actual work and will be reviewed in the next chapter. They return either -1 for mismatch or the common depth. Store those in the row and column buffers. After all the blocks in a frame are done, the main loop calls edgehog_generate(). This function generates the Edgehog buffer by sliding over the buffers, as explained earlier. If you didn't get that, go away: There might be money in idioting. Lots of work can be put into pointer juggling and whatnot here, but there's no point. The buffers aren't that big. The invert flag is for visualizing where the Edgehog blocks are. It's used in the demo code.

static int edgehog_generate( const int *rbuf, const int *cbuf, int *dst, int invert )
{
    int hogs = 0;

    for( int y = 0; y < BLOCKSY; y++ ) {
        const int *row0 = rbuf + y*COLS;
        const int *col0 = cbuf + y*COLS;
        const int *row1 = row0 +   COLS;
        const int *col1 = col0 +      1;

        int *dstrow = dst + y*BLOCKSX;

        for( int x = 0; x < BLOCKSX; x++ ) {
            int topval   = *row0++;
            int botval   = *row1++;
            int leftval  = *col0++;
            int rightval = *col1++;

            int v = topval == botval && topval == leftval && topval == rightval ? topval : -1;
            if( invert && v != -1  ) v = 256 - v;

            hogs += v != -1 ? 1 : 0;

            *dstrow++ = v;
        }
    }

    return hogs;
}

Fractal Picture 02, normal Fractal Picture 02, Edgehog inverted

The Neon Wilderness

That was disturbingly simple. Let's review the edgehog32 functions that were skipped earlier. Only the edgehog32_top() function will be commented here. The left() function is pretty much the same, only vertical. The inputs are the X and Y coords to the upper left corner. The theme is: Go big or go home. It does 32 separate pixels per iteration, so the nifty cutoff can be easily applied. There should be no big differences from a normal fractal routine in the first part other than the width.

static int edgehog32_top( float xco, float yco )
{
    float32x4_t xco0 = vaddq_f32( wf0.val[0], vdupq_n_f32( leftx ) );
    float32x4_t xco1 = vaddq_f32( wf0.val[1], vdupq_n_f32( leftx ) );
    float32x4_t xco2 = vaddq_f32( wf0.val[2], vdupq_n_f32( leftx ) );
    float32x4_t xco3 = vaddq_f32( wf0.val[3], vdupq_n_f32( leftx ) );
    float32x4_t xco4 = vaddq_f32( wf1.val[0], vdupq_n_f32( leftx ) );
    float32x4_t xco5 = vaddq_f32( wf1.val[1], vdupq_n_f32( leftx ) );
    float32x4_t xco6 = vaddq_f32( wf1.val[2], vdupq_n_f32( leftx ) );
    float32x4_t xco7 = vaddq_f32( wf1.val[3], vdupq_n_f32( leftx ) );

    float32x4_t x0 = vdupq_n_f32( 0.0f ); float32x4_t x1 = x0;
    float32x4_t x2 = x0;                  float32x4_t x3 = x0;
    float32x4_t x4 = x0;                  float32x4_t x5 = x0;
    float32x4_t x6 = x0;                  float32x4_t x7 = x0;
    float32x4_t y0 = x0;                  float32x4_t y1 = x0;
    float32x4_t y2 = x0;                  float32x4_t y3 = x0;
    float32x4_t y4 = x0;                  float32x4_t y5 = x0;
    float32x4_t y6 = x0;                  float32x4_t y7 = x0;

    int i = 0;
    for( ; i < MAXITER; i++ ) {
        float32x4_t xt0 = vaddq_f32( vsubq_f32( vmulq_f32( x0, x0 ), vmulq_f32( y0, y0 ) ), xco0 );
        float32x4_t xt1 = vaddq_f32( vsubq_f32( vmulq_f32( x1, x1 ), vmulq_f32( y1, y1 ) ), xco1 );
        float32x4_t xt2 = vaddq_f32( vsubq_f32( vmulq_f32( x2, x2 ), vmulq_f32( y2, y2 ) ), xco2 );
        float32x4_t xt3 = vaddq_f32( vsubq_f32( vmulq_f32( x3, x3 ), vmulq_f32( y3, y3 ) ), xco3 );
        float32x4_t xt4 = vaddq_f32( vsubq_f32( vmulq_f32( x4, x4 ), vmulq_f32( y4, y4 ) ), xco4 );
        float32x4_t xt5 = vaddq_f32( vsubq_f32( vmulq_f32( x5, x5 ), vmulq_f32( y5, y5 ) ), xco5 );
        float32x4_t xt6 = vaddq_f32( vsubq_f32( vmulq_f32( x6, x6 ), vmulq_f32( y6, y6 ) ), xco6 );
        float32x4_t xt7 = vaddq_f32( vsubq_f32( vmulq_f32( x7, x7 ), vmulq_f32( y7, y7 ) ), xco7 );

        y0 = vaddq_f32( vmulq_n_f32( vmulq_f32( x0, y0 ), 2.0f ), vdupq_n_f32( topy ) ); x0  = xt0;
        y1 = vaddq_f32( vmulq_n_f32( vmulq_f32( x1, y1 ), 2.0f ), vdupq_n_f32( topy ) ); x1  = xt1;
        y2 = vaddq_f32( vmulq_n_f32( vmulq_f32( x2, y2 ), 2.0f ), vdupq_n_f32( topy ) ); x2  = xt2;
        y3 = vaddq_f32( vmulq_n_f32( vmulq_f32( x3, y3 ), 2.0f ), vdupq_n_f32( topy ) ); x3  = xt3;
        y4 = vaddq_f32( vmulq_n_f32( vmulq_f32( x4, y4 ), 2.0f ), vdupq_n_f32( topy ) ); x4  = xt4;
        y5 = vaddq_f32( vmulq_n_f32( vmulq_f32( x5, y5 ), 2.0f ), vdupq_n_f32( topy ) ); x5  = xt5;
        y6 = vaddq_f32( vmulq_n_f32( vmulq_f32( x6, y6 ), 2.0f ), vdupq_n_f32( topy ) ); x6  = xt6;
        y7 = vaddq_f32( vmulq_n_f32( vmulq_f32( x7, y7 ), 2.0f ), vdupq_n_f32( topy ) ); x7  = xt7;

Use vector compare to find out which (if any) pixels exited. Unsigned 32-bit values in the result vector will be set to -1 if it exited, or 0 if not. Yeah, I meant that. This is C, not amateur night decade at the C++ Club.

        uint32x4_t resa = vcgtq_f32( vaddq_f32( vmulq_f32( x0, x0 ), vmulq_f32( y0, y0 ) ), vdupq_n_f32( 4.0f ) );
        uint32x4_t resb = vcgtq_f32( vaddq_f32( vmulq_f32( x1, x1 ), vmulq_f32( y1, y1 ) ), vdupq_n_f32( 4.0f ) );
        uint32x4_t resc = vcgtq_f32( vaddq_f32( vmulq_f32( x2, x2 ), vmulq_f32( y2, y2 ) ), vdupq_n_f32( 4.0f ) );
        uint32x4_t resd = vcgtq_f32( vaddq_f32( vmulq_f32( x3, x3 ), vmulq_f32( y3, y3 ) ), vdupq_n_f32( 4.0f ) );
        uint32x4_t rese = vcgtq_f32( vaddq_f32( vmulq_f32( x4, x4 ), vmulq_f32( y4, y4 ) ), vdupq_n_f32( 4.0f ) );
        uint32x4_t resf = vcgtq_f32( vaddq_f32( vmulq_f32( x5, x5 ), vmulq_f32( y5, y5 ) ), vdupq_n_f32( 4.0f ) );
        uint32x4_t resg = vcgtq_f32( vaddq_f32( vmulq_f32( x6, x6 ), vmulq_f32( y6, y6 ) ), vdupq_n_f32( 4.0f ) );
        uint32x4_t resh = vcgtq_f32( vaddq_f32( vmulq_f32( x7, x7 ), vmulq_f32( y7, y7 ) ), vdupq_n_f32( 4.0f ) );

Remember the nifty cutoff? First, AND all the elements of the result vectors: This tells us if all of them escaped at the same value.

        uint32x4_t abcd;
        abcd = vandq_u32( vandq_u32( vandq_u32( resa, resb ), vandq_u32( resc, resd ) ),
                          vandq_u32( vandq_u32( rese, resf ), vandq_u32( resg, resh ) ) );
        if( (abcd[0]&abcd[1]&abcd[2]&abcd[3]) == -1 ) break;

Second, OR all the elements of the result vectors: This tells us if any of them escaped.

        abcd = vorrq_u32( vorrq_u32( vorrq_u32( resa, resb ), vorrq_u32( resc, resd ) ),
                          vorrq_u32( vorrq_u32( rese, resf ), vorrq_u32( resg, resh ) ) );
        if( (abcd[0]|abcd[1]|abcd[2]|abcd[3]) == -1 ) { i = -1; break; }
    }

    return i;
}

Can this be done faster? Most likely. Have a go at it! As always, remember to look at the generated code. A quick attempt at pulling out the multiplies etc. by hand yielded little change. Since register usage is quite high, an Assembler implementation is probably the best way to go. For once, I'm going to cut some corners and ask the burning question: What is the airspeed velocity... sorry, is this solution fast enough to make the Nano GPU render 1080p60 depth 256 fractals with all pixels at max depth? I made a performance measurement for that purpose:

nils@mork:~/src/edgehog$ ./edgehog -p
Performance test: 4 cores, 4 threads, 200 frames per pos
Block 32*32, grid 60*34, cnt 2040, 1920*1088
Image 0: x=-0.75000000, y= 0.00000000, z=  512.00. Hog/frame: 19.08ms, hogs: 2040 (100.00%), diff:    0 (  0.00%)
Image 1: x=-0.75000000, y= 0.00000000, z=    2.00. Hog/frame: 11.25ms, hogs: 1314 ( 64.41%), diff:  726 ( 35.59%)
Image 2: x=-0.74515796, y= 0.11257483, z= 1024.00. Hog/frame:  4.67ms, hogs:  758 ( 37.16%), diff: 1282 ( 62.84%)
Image 3: x=-0.34292933, y=-0.63927984, z=16384.00. Hog/frame: 10.21ms, hogs:  217 ( 10.64%), diff: 1823 ( 89.36%)
Image 4: x= 0.28004330, y= 0.00900090, z=  512.00. Hog/frame:  6.26ms, hogs:  546 ( 26.76%), diff: 1494 ( 73.24%)
Image 5: x=-1.50211167, y= 0.00000000, z= 1024.00. Hog/frame:  2.60ms, hogs:  415 ( 20.34%), diff: 1625 ( 79.66%)
Thread 0 done, blocks = 319236
Thread 1 done, blocks = 321153
Thread 2 done, blocks = 319628
Thread 3 done, blocks = 320983
nils@mork:~/src/edgehog$

Image 0 has all pixels at max depth. The multi-threaded Edgehog needs 19 ms to process that. Did it fail? Most certainly not. It just means that it can deal with approx 80% of the pixels at max depth. For each frame, set all pixels in the row and column buffers to -1, let the Edgehog threads run until the time limit is reached, stop the threads, convert the tables, and send the result to the GPU. Well, almost. The threads decide when to stop themselves, hence the timer stuff in thread_func() shown earlier. Cores are few and far between, so there is no point in having a central bureaucracy wasting precious resources. The GPU has no problems handling the rest of the max depth pixels. Problem solved.

That wraps up the Edgehog buffer generation. Let's get on with it and peek inside the new GPU shader that uses this to speed things up. Finally.

Fractal Picture 03, normal Fractal Picture 03, Edgehog inverted

DFA: Death From Above... Err

The wiseness of calling something DFA is questionable. The acronym means lots of things according to the only remaining place on the internet where searching doesn't make you sad: Wikipedia. I rarely bother searching on Google [youtube.com] anymore. So... bad naming in 2016. I'll stick with it.

Ah, the GPU GLSL code! Luckily, the Nano has a proper OS, so it can run proper OpenGL. Start with the old DFA implementation and replace the version, inputs, and outputs. Version 4.20 is particularly good. Trust me. Define max depth and DFA depth. I'll get back to that.

#version 420
precision highp float;
(...)
in  vec2 vUV1;
out vec4 outCol;

uniform float sx;
uniform float sy;
uniform float sz;

layout( binding = 0 ) uniform stuff
{
    int edgy[BLOCKCNT];
};

#define MAXDEPTH 256
#define DFADEPTH  32

But wait! Wouldn't that be a waste of space? Yes. Don't care. The table is 2040 entries. Pack it into bytes and reduce max depth if (small) size is important. The coordinates are put in separate uniforms for easier access, meaning it's easy to replace dfahog.glsl with dfa.glsl, or a random fractal shader from the internet. Try it!

A way to convert depth to color is needed. The book presents two remarkably simple and quick methods to do HSV to RGB conversion. Nobody seems to have noticed: I still see gigantic if-constructions and switches, if-statements replaced by mix/clamp/fmod and other wackiness everywhere. Do a search and see for yourself.

Well, screw that. We're not into idioting here. Leave that to the pythoneers [smbc-comics.com]. In the fractal case, set S=V=1.0, meaning the color will stick to the outside of a HSV hexagonal. Have black as last entry, since everybody expects fractals to fade to black. Unfortunately, with a max depth as low as 256, the results will be subpar. Instead, use the shorter table below for better color distribution and less aliasing. Feel free to replace it with something prettier. Shouldn't be hard. Faster would be hard.

#define RGBS 4
const vec3 cols[RGBS] = vec3[RGBS](
    vec3( 1.0f, 0.0f, 0.0f ),
    vec3( 1.0f, 1.0f, 0.0f ),
    vec3( 0.0f, 0.5f, 1.0f ),
    vec3( 0.0f, 0.0f, 0.0f )
);
vec3 getRGB( float h )
{
    float Slice     = float(RGBS-1) * h;
    float SliceInt  = floor( Slice );
    float SliceFrac = Slice - SliceInt;
    int i = int( SliceInt );
    return mix( cols[i], cols[i+1], SliceFrac );
}

Use the pixel coordinate to find the correct block in the Edgehog table. If it's a match, just fill the pixel and exit. An alternative approach would be covering the Edgehog blocks with colored OpenGL quads to avoid launching shaders for those pixels. Try it!

void main()
{
    // Check for Edgehog match
    int blx = int(floor(vUV1.x*float(DISPW-1)/float(BLOCKWH)));
    int bly = int(floor(vUV1.y*float(DISPH-1)/float(BLOCKWH)));
    int prval = edgy[bly*BLOCKSX+blx];
    if( prval != -1 ) {
        outCol = vec4( getRGB( float(prval) / float(MAXDEPTH) ), 1.0f );
        return;
    }

It didn't match, so use DFA. First, set up the position and init the x and y vectors as usual:

    // Not Edgehog, use DFA
    float x0 = 1.7778f * (vUV1.x - 0.5f) / (0.5f * sz) + sx;
    float y0 =           (vUV1.y - 0.5f) / (0.5f * sz) + sy;

    vec4 xv0 = vec4( 0.0f );
    vec4 yv0 = vec4( 0.0f );

Two loops, but the inner loop will be unrolled completely due to no breaks. It's covered in the book. DFADEPTH is 32, so evaluate 4*8=32 depths in one iteration, exactly the bit length of an uint. That's not a coincidence.

    uint res32;
    int i = 0;
    for( ; i < MAXDEPTH; i += DFADEPTH ) {
        for( int j = 0; j < DFADEPTH; j += 4 ) {

Calculate the 4 next steps in the vectors:

            xv0.x = xv0.w * xv0.w - yv0.w * yv0.w + x0; yv0.x = 2.0f * xv0.w * yv0.w + y0;
            xv0.y = xv0.x * xv0.x - yv0.x * yv0.x + x0; yv0.y = 2.0f * xv0.x * yv0.x + y0;
            xv0.z = xv0.y * xv0.y - yv0.y * yv0.y + x0; yv0.z = 2.0f * xv0.y * yv0.y + y0;
            xv0.w = xv0.z * xv0.z - yv0.z * yv0.z + x0; yv0.w = 2.0f * xv0.z * yv0.z + y0;

Do the compares and save the results as bits in res32. Several methods were tried in 2017. This turned out to be the fastest, though it seems counterintuitive. Still seems to be the fastest today. If you think you have a better solution, fine, but measure how fast it runs before gloating.

            res32 = bitfieldInsert( res32, uint(xv0.x * xv0.x + yv0.x * yv0.x > 4.0f), j+0, 1 );
            res32 = bitfieldInsert( res32, uint(xv0.y * xv0.y + yv0.y * yv0.y > 4.0f), j+1, 1 );
            res32 = bitfieldInsert( res32, uint(xv0.z * xv0.z + yv0.z * yv0.z > 4.0f), j+2, 1 );
            res32 = bitfieldInsert( res32, uint(xv0.w * xv0.w + yv0.w * yv0.w > 4.0f), j+3, 1 );
        }

When 32 values are done, check for any exits. Do the next 32 if not.

        if( res32 != 0u ) break;
    }

If an exit was found, find the lowest bit set in res32 meaning first exit and add it to the counter. That's the depth. Find the color. Job done!

    int lsb = findLSB( res32 );
    i += lsb == -1 ? 0 : lsb;

    outColor = vec4( getRGB( float(i) / float(MAXDEPTH) ), 1.0f );
}

For convenience, a separate DFA-only shader is provided in the Source Code section.

Fractal Picture 04, normal Fractal Picture 04, Edgehog inverted

Learning to Fly

That takes care of the essence of Edgehog and DFA. What's missing is the fun part: The demo code. All this was for nothing if the output can't be rendered in 1080p60 on a Nano. Sjur, the master of disaster... sorry, master of the V73D engine, put together just the stuff needed. Obviously, we don't need the entire V73D engine, just a subset that renders a single shader. He carefully extracted the pieces needed and put them in nanofrax_gl.c. He also made the awesome 8x8 font and remixed the music. It's based on an old Amiga module by Xerxes/Triumph called give it a try!! [modarchive.org].

Let's review the important parts of the setup code and control loop, to outline how Edgehog is used. Declare the shader uniform buffer where the Edgehog data goes:

typedef struct {
    int edgy[BLOCKCNT];
} gpufrac;

Init the uniform buffer and wait for threads to finish initializing:

    gpufrac gf;
    initUniformBuffer( NULL, sizeof(gf) );

    // Wait for init barrier and important printf barrier
    pthread_barrier_wait( &bar );
    pthread_barrier_wait( &bar );

    printf( "Demo: %d cores, %d threads\n", CORECNT, THREADCNT );

Edgehog calculates the next block inside the loop, so calculate the first block outside the loop:

    // Prepare first edgehog buffer
    getnextpos( &nextx, &nexty, &nextz );
    edgehog_setpos( nextx, nexty, nextz );
    blockctr = 0;
    framestart = get_usec();
    pthread_barrier_wait( &bar ); // Start
    pthread_barrier_wait( &bar ); // Wait for finish
    edgehog_generate( rowbuf, colbuf, gf.edgy, inverter );

Set up the loop and find the current and next set of fractal coordinates. Reset the row and column buffers. Start Edgehog with the next frame coords. Copy previous Edgehog result into the uniform buffer and make the shader use the current coords. Draw() paints the sky with stars... err, draws the screen, obviously.

    while(!quit) {

        // Mow along, nothing to see here
        currx = nextx; curry = nexty; currz = nextz;
        int wrap = getnextpos( &nextx, &nexty, &nextz );

        for( int i = 0; i < ROWCOLCNT; i++ ) {
            rowbuf[i] = -1;
            colbuf[i] = -1;
        }

        // Hogs on the wing for next frame
        blockctr = 0;
        edgehog_setpos( nextx, nexty, nextz );
        framestart = get_usec();
        pthread_barrier_wait( &bar );

        // Current position for this frame
        glUniform1f( sxLoc, currx );
        glUniform1f( syLoc, curry );
        glUniform1f( szLoc, currz );
        copyToUniformBuffer( &gf, sizeof(gf) );

        Draw();

When that's done, wait for Edgehog to finish and produce the next Edgehog buffer based on the row/column buffers. Recall that the Edgehog threads will stop themselves if they run for too long, so there's no extra administration here. I really don't care how far the threads got. Well, I do if the framerate drops below 60. Shouldn't happen. Jinx.

        // Wait for threads done
        pthread_barrier_wait( &bar );

        hogs += edgehog_generate( rowbuf, colbuf, gf.edgy, 0 );

        loops++;
    }
}

That's it. Simple and efficient, like all good C code.

Source Exposure

Complete source code archive:

Complete list files in the archive:

Xavier and Nano

Build a Fire

Compile and link it on a Nano:

nils@mork:~/src/Nanofrax$ make
gcc nanofrax.c nanofrax_gl.c edgehog.c txt8x8.c -O3 -ffast-math -Wall -Wextra -lm -lX11 -lGL -lGLEW -lpthread -o Nanofrax
nils@mork:~/src/Nanofrax$ ls -l Nanofrax
-rwxrwxr-x 1 nils nils 47632 Jun 25 09:15 Nanofrax
nils@mork:~/src/Nanofrax$

47 KB? It's getting bloated! Can squeeze it down to 43 KB with -O2, but the aggressive loop unrolling is needed for speed.

Start it with no arguments and it will run the demo. It does exactly the same as the video at the top. The video was recorded live from the HDMI output, so what is shown is the actual Nano output. We gave it that 1987 look just because we can. 1987 was the year the Amiga 500 launched, the greatest computer ever. The hoi polloi might disagree. We really don't care.

nils@mork:~/src/Nanofrax$ ./Nanofrax
Starting thread 0 on core 0
Starting thread 1 on core 1
Starting thread 2 on core 2
Starting thread 3 on core 3
2: Ready
1: Ready
3: Ready
0: Ready
Demo: 4 cores, 4 threads
(zzz)
Frames: 4800 FPS: 60.00 Hogs: 5790449
nils@mork:~/src/Nanofrax$

The Remains of the Day

We calculate 1080p60 depth 256 fractals on a measly NVidia Jetson Nano. The code uses all 4 ARM cores and the GPU in parallel. Have we imported gargantuan threading frameworks, 3D engines, imaging libraries, and the like? No. Edgehog is 200 lines, the shader is 90 lines, and the synchronization code is 18 lines, 4 of them Assembler. The executable size is below 50 KB. Is the code easy to understand? Definitely. It's C: It's simple, it's fast, it's readable. C is not hard to learn. C is hard to master. It requires that you apply yourself.

Apply Yourself

An entire book has been written about this: It's mainly about how hard problems can be solved in simpler and faster ways by just, wait for it, applying yourself. Not blindly trusting random crap written by strangers is also central. A well-documented fact that cannot be denied anymore is that Python uses 76 times more energy than C [uminho.pt]. While pythoneers waste immense amounts of energy, I'll invest in coal mines. Let me know when that Python fad wears off.

Feedback

Technical comments are welcome. Contact information is available here.

If you have a problem with the article text in any way, please watch this informative video:

Click to watch video on YouTube

Remember to appreciate this XKCD strip titled "Marble Run" [xkcd.com].

Article Licensing Information

This article is published under the following license: Attribution-NoDerivatives 4.0 International (CC BY-ND 4.0)
Short summary: You may copy and redistribute the material in any medium or format for any purpose, even commercially. You must give appropriate credit, provide a link to the license, and indicate if changes were made. If you remix, transform, or build upon the material, you may not distribute the modified material.

Licensed Items

Item: Modified logo from the game Sonic the Hedgehog. License: Public Domain


Ekte Programmering Norwegian flag
American flag Real Programming
Ignorantus AS