Random stuff I'm working on, in chronological order.

Video with 96 spheres. Doing 16x16 block probing on the CPU before rendering it on the GPU:

In the never ending hunt for more spheres, it's time to turn to the CPU for answers. It's mostly idling anwyay. Let the CPU check the corners of each 16x16 block of the display and determine if the area is covered by only spheres, only shadow plane/sky or both. This makes the GPU code significantly quicker, since half the code can be eliminated for most of the screen:

Corner checking takes around 6ms CPU time, so let a separate thread calculate that for the next frame while the current one is rendering. Synchronize it all with a barrier. Simple. This calls for an increase to 96 spheres at 60 fps (barely):

Code needs some work before publication.

There was a slight mishap in the raytracer code which makes some shadows disappear. It's noticeable in the video if you pay close attention to the shadows. The measurments are a bit off too, but it's still above 60 fps after fixing it, so I'm not gonna bother updating them. The GLSL source code in the GPU Hacks Article has been updated.

GPURay 1.7, 80 spheres at 60 fps on NVidia Tegra X1:

Music: 1992 Amiga Oktalyzer module by Jan Wegger.

There's an interesting pragma that can be used in NVidia GLSL code: #pragma optionNV (unroll all). Trying it on the raytracer gives interesting results: A huge assembler file that has no REP loops at all, and slightly quicker execution time.

So the trick is to find the correct unroll level and do it manually. Earlier attempts were clearly too conservative. Automatic unrolling doesn't work if there's if() statements in the loop. After some trial and error it seems that 16 is a good unroll value for a target of 80 spheres. Since the unroll is greater, using vec4 in the second loops starts paying off too. Let's increase to 80 spheres and do a test run. Blue is old code, orange is unroll all and green is unroll 16 + vec4:

The loops end up looking like this:

// trace_ray() for( int i = 0; i < OBJNUM; i += 16 ) { vec3 v0, v1, v2, v3; vec4 b03, d03; bvec4 dv; // 0-3 v0 = ray_pos - obj_pos[i+0].xyz; v1 = ray_pos - obj_pos[i+1].xyz; v2 = ray_pos - obj_pos[i+2].xyz; v3 = ray_pos - obj_pos[i+3].xyz; b03 = vec4( dot( v0, ray_dir ), dot( v1, ray_dir ), dot( v2, ray_dir ), dot( v3, ray_dir ) ); d03 = vec4( b03.x * b03.x - dot( v0, v0 ) + obj_pos[i+0].w, b03.y * b03.y - dot( v1, v1 ) + obj_pos[i+1].w, b03.z * b03.z - dot( v2, v2 ) + obj_pos[i+2].w, b03.w * b03.w - dot( v3, v3 ) + obj_pos[i+3].w ); dv = greaterThan( d03, vec4( 0.0f ) ); if( any( dv ) ) { vec4 t03 = -b03 - sqrt( d03 ); dv = dv && greaterThan( t03, vec4( 0.0f ) ); if( dv.x && t03.x < rt_dist ) { rt_dist = t03.x; rt_hit = i+0; } if( dv.y && t03.y < rt_dist ) { rt_dist = t03.y; rt_hit = i+1; } if( dv.z && t03.z < rt_dist ) { rt_dist = t03.z; rt_hit = i+2; } if( dv.w && t03.w < rt_dist ) { rt_dist = t03.w; rt_hit = i+3; } } // ditto for 4-7,8-11,12-15... } // ... // check shadow int i; for( i = 0; i < OBJNUM; i += 16 ) { vec3 v0, v1, v2, v3; vec4 b03, d03; v0 = ray_pos - obj_pos[i+0].xyz; v1 = ray_pos - obj_pos[i+1].xyz; v2 = ray_pos - obj_pos[i+2].xyz; v3 = ray_pos - obj_pos[i+3].xyz; b03 = vec4( dot( v0, l ), dot( v1, l ), dot( v2, l ), dot( v3, l ) ); d03 = vec4( b03.x * b03.x - dot( v0, v0 ) + obj_pos[i+0].w, b03.y * b03.y - dot( v1, v1 ) + obj_pos[i+1].w, b03.z * b03.z - dot( v2, v2 ) + obj_pos[i+2].w, b03.w * b03.w - dot( v3, v3 ) + obj_pos[i+3].w ); if( any( greaterThan( d03, vec4( 0.0f ) ) ) ) break; // ditto for 4-7,8-11,12-15... }

An attempt to maximize the Tegra X1 GPU power usage. Around 11 watts is the current highscore while still looking cool:

One issue with the solution below is the excessive use of if()-arguments. By wrapping things up in vectors we can get most of them out of the way. greaterThan() etc. in OpenGL compiles to the corresponding sgt etc.

for( int i = 0; i < OBJNUM; i += 4 ) { vec3 v0 = ray_pos - obj_pos[i+0].xyz; vec3 v1 = ray_pos - obj_pos[i+1].xyz; vec3 v2 = ray_pos - obj_pos[i+2].xyz; vec3 v3 = ray_pos - obj_pos[i+3].xyz; vec4 b03 = vec4( dot( v0, ray_dir ), dot( v1, ray_dir ), dot( v2, ray_dir ), dot( v3, ray_dir ) ); vec4 d03 = vec4( b03.x * b03.x - dot( v0, v0 ) + obj_pos[i+0].w, b03.y * b03.y - dot( v1, v1 ) + obj_pos[i+1].w, b03.z * b03.z - dot( v2, v2 ) + obj_pos[i+2].w, b03.w * b03.w - dot( v3, v3 ) + obj_pos[i+3].w ); bvec4 dv = greaterThanEqual( d03, vec4( 0.0f ) ); if( any( dv ) ) { vec4 t03 = -b03 - sqrt( d03 ); dv = dv && greaterThanEqual( t03, vec4( 0.0f ) ); if( dv.x && t03.x < rt_dist ) { rt_dist = t03.x; rt_hit = i; } if( dv.y && t03.y < rt_dist ) { rt_dist = t03.y; rt_hit = i+1; } if( dv.z && t03.z < rt_dist ) { rt_dist = t03.z; rt_hit = i+2; } if( dv.w && t03.w < rt_dist ) { rt_dist = t03.w; rt_hit = i+3; } } }

The if() set can be reduced with some mix() and min() and stuff. The generated code is neat, but it's not very quick yet. Have to investigate that further. Luckily, this is not like in the old days.

The cool part is that it can now do 72 spheres:

Video with 64 spheres:

While perusing more generated code and trying out other unroll structures, it's clear that the one from 28 October is not totally optimal for larger numbers of spheres. A different approach yielding better results seems to be turning down the lower unroll to 4 (not shown), and increasing the upper to 4 while moving the sqrts out of the way. So there's some double tests, but the compiler seems to figure it out. The whole point of that is to insert a continue if all 4 doesn't hit at all. The miss ratio is pretty high and we're going through everything anyway:

for( int i = 0; i < OBJNUM; i += 4 ) { vec3 v0 = ray_pos - obj_pos[i+0].xyz; vec3 v1 = ray_pos - obj_pos[i+1].xyz; vec3 v2 = ray_pos - obj_pos[i+2].xyz; vec3 v3 = ray_pos - obj_pos[i+3].xyz; float b0 = dot( v0, ray_dir ); float b1 = dot( v1, ray_dir ); float b2 = dot( v2, ray_dir ); float b3 = dot( v3, ray_dir ); float d0 = b0 * b0 - dot( v0, v0 ) + obj_pos[i+0].w; float d1 = b1 * b1 - dot( v1, v1 ) + obj_pos[i+1].w; float d2 = b2 * b2 - dot( v2, v2 ) + obj_pos[i+2].w; float d3 = b3 * b3 - dot( v3, v3 ) + obj_pos[i+3].w; if( d0 < 0.0f && d1 < 0.0f && d2 < 0.0f && d3 < 0.0f ) continue; float t0 = -b0 - sqrt( d0 ); float t1 = -b1 - sqrt( d1 ); float t2 = -b2 - sqrt( d2 ); float t3 = -b3 - sqrt( d3 ); if( d0 >= 0.0f && t0 > 0.0f && t0 < rt_dist ) { rt_dist = t0; rt_hit = i; } if( d1 >= 0.0f && t1 > 0.0f && t1 < rt_dist ) { rt_dist = t1; rt_hit = i+1; } if( d2 >= 0.0f && t2 > 0.0f && t2 < rt_dist ) { rt_dist = t2; rt_hit = i+2; } if( d3 >= 0.0f && t3 > 0.0f && t3 < rt_dist ) { rt_dist = t3; rt_hit = i+3; } }

The good news is that the number of spheres can be increased to 64 with plenty of cycles to spare. The bad news is that there's not enough cycles for next multiple of 4. Have to find a way to get around that.

In the never-ending hunt for more spheres in the GPU Raytracer, I noticed that the glGetProgramBinary() call outputs assembler code too, in NV_gpu_program5 format. That makes it reasonably easy to see where improvements can be made.

The conditional construct in the first loop seems to confuse the compiler, and blocking negative square roots is pointless when they're basically free. Wrap it all up in a simple if() that reduces nicely in the generated assembler code. Still need to unroll 2 by hand, though, since the compiler still cannot unroll automatically when there's if() expressions in the loop:

for( int i = 0; i < OBJNUM; i += 2 ) { vec3 v0 = ray_pos - obj_pos[i+0].xyz; vec3 v1 = ray_pos - obj_pos[i+1].xyz; float b0 = dot( v0, ray_dir ); float b1 = dot( v1, ray_dir ); float d0 = b0 * b0 - dot( v0, v0 ) + obj_pos[i+0].w; float d1 = b1 * b1 - dot( v1, v1 ) + obj_pos[i+1].w; float t0 = -b0 - sqrt( d0 ); float t1 = -b1 - sqrt( d1 ); if( d0 >= 0.0f && t0 > 0.0f && t0 < rt_dist ) { rt_dist = t0; rt_hit = i; } if( d1 >= 0.0f && t1 > 0.0f && t1 < rt_dist ) { rt_dist = t1; rt_hit = i+1; } }

A balance has to be made between unrolling loop 1 and 2. Seems like unrolling loop 1 twice, as above, and loop 2 10 (!) times is the best combination:

for( i = 0; i < OBJNUM; i += 10 ) { vec3 v0 = ray_pos - obj_pos[i+0].xyz; vec3 v1 = ray_pos - obj_pos[i+1].xyz; vec3 v2 = ray_pos - obj_pos[i+2].xyz; vec3 v3 = ray_pos - obj_pos[i+3].xyz; vec3 v4 = ray_pos - obj_pos[i+4].xyz; vec3 v5 = ray_pos - obj_pos[i+5].xyz; vec3 v6 = ray_pos - obj_pos[i+6].xyz; vec3 v7 = ray_pos - obj_pos[i+7].xyz; vec3 v8 = ray_pos - obj_pos[i+8].xyz; vec3 v9 = ray_pos - obj_pos[i+9].xyz; float b0 = dot( v0, l ); float b1 = dot( v1, l ); float b2 = dot( v2, l ); float b3 = dot( v3, l ); float b4 = dot( v4, l ); float b5 = dot( v5, l ); float b6 = dot( v6, l ); float b7 = dot( v7, l ); float b8 = dot( v8, l ); float b9 = dot( v9, l ); float d0 = b0 * b0 - dot( v0, v0 ) + obj_pos[i+0].w; float d1 = b1 * b1 - dot( v1, v1 ) + obj_pos[i+1].w; float d2 = b2 * b2 - dot( v2, v2 ) + obj_pos[i+2].w; float d3 = b3 * b3 - dot( v3, v3 ) + obj_pos[i+3].w; float d4 = b4 * b4 - dot( v4, v4 ) + obj_pos[i+4].w; float d5 = b5 * b5 - dot( v5, v5 ) + obj_pos[i+5].w; float d6 = b6 * b6 - dot( v6, v6 ) + obj_pos[i+6].w; float d7 = b7 * b7 - dot( v7, v7 ) + obj_pos[i+7].w; float d8 = b8 * b8 - dot( v8, v8 ) + obj_pos[i+8].w; float d9 = b9 * b9 - dot( v9, v9 ) + obj_pos[i+9].w; if( d0 > 0.0f || d1 > 0.0f || d2 > 0.0f || d3 > 0.0f || d4 > 0.0f || d5 > 0.0f || d6 > 0.0f || d7 > 0.0f || d8 > 0.0f || d9 > 0.0f ) break; }

Unlike gcc, the NVidia compiler seems to be able to look ahead far enough to convert this into a coherent set of dp3/mad/or/trunc instructions. It's now considered wise to keep number of spheres a multiple of 10. Looks like it can finally do 60 spheres at 60 fps:

New video:

I missed a trivial optimization in the shadow calculation in the GPU Raytracer. There's no point in calculating the distance, it's enough to know whether any sphere is hit or not. That simplifies the loop enough so it can be unrolled further. A limitation is that the sphere count has to be a multiple of 4, not 2 as earlier:

(...) int i; for( i = 0; i < OBJNUM; i += 4 ) { vec3 v0 = ray_pos - obj_pos[i+0].xyz; vec3 v1 = ray_pos - obj_pos[i+1].xyz; vec3 v2 = ray_pos - obj_pos[i+2].xyz; vec3 v3 = ray_pos - obj_pos[i+3].xyz; float b0 = dot( v0, l ); float b1 = dot( v1, l ); float b2 = dot( v2, l ); float b3 = dot( v3, l ); float d0 = b0 * b0 - dot( v0, v0 ) + obj_pos[i+0].w; float d1 = b1 * b1 - dot( v1, v1 ) + obj_pos[i+1].w; float d2 = b2 * b2 - dot( v2, v2 ) + obj_pos[i+2].w; float d3 = b3 * b3 - dot( v3, v3 ) + obj_pos[i+3].w; if( d0 > 0.0f || d1 > 0.0f || d2 > 0.0f || d3 > 0.0f ) break; } (...)

Let's compare the old and the new version, now with 56 spheres:

Obviously, the same is true for the input data in the fractal and quaternion code too. The results are less dramatic, but the quaternion routine is now stable above 60 fps. Updated the article and made new measurements.

The NVidia article How About Constant Buffers gives some interesting information about how data is stored and accessed on the GPU. The scene data in the GPU Raytracer is constant and the usage pattern fairly regular, so storing it in a constant buffer sounds like a good idea. Constant buffers translate to uniform buffers in OpenGL, so it's a matter of just replacing "shader storage" with "uniform" in the right places.

The results are spectacular. Old version (SSBO) versus new version (UBO), 32 spheres:

Video below. Can now have 52 spheres and keep fps above 60 all the time:

Made videos of MultiDoom and MultiQuake running on the Mellanox TILE-Gx36 multicore CPU:

It's been a slow summer, so I finally had time to update the SRTP AES Optimization article from 2010. Should now support all CPU types and packet sizes larger than 4096. It's available here: SRTP AES Optimization Revisited

Source files for the generic raytracer:

Just stuff the files in a directory and compile and link as specified below. Option -t0 (the default) will run it directly without setting up any threads.

A quick port of the GPU Raytracer to more generic C code. Also includes optional Pthreads support.

[user@phyreztorm]$ gcc -O3 -ffast-math -Wall -Wextra -std=gnu99 -o raytracer -lm -lpthread raytracer.c bmp.c [user@phyreztorm]$ ./raytracer -t 16 -o 7680x4320 Threads: 16 Thread 0: 270 lines ( 0- 269) Thread 2: 270 lines ( 540- 809) Thread 3: 270 lines ( 810-1079) (...etc...) Time: 892.050049 ms Saving bitmap out.bmp

Just dividing into chunks of lines isn't a very good idea. Should use blocks like in the TILE-Gx raytracer, but that'd require more work. Test run using assorted thread counts on Intel Xeon E5-2699v3:

Source code for GPU fractals, raytracer and quaternion raytraced julia sets now available here.

A friend scanned all issues of the old Norwegian Amiga magazine Digital 1989-1992. They're available here.

Back to 32 spheres in raytracer after removing some deadwood.

But that's boring, raytraced quaternion julia sets are cooler:

Still pretty rough, but it runs. Based on code by Keenan Crane.

Sacrificed some spheres to get 60 fps with plane shadows:

April is GPU appreciation month! Some attempts at making old cr... err, old stuff run on an NVidia Tegra X1.

36 objects, 3 reflections, non-reflective shadows on center sphere. 126 lines shader code, could be less if nvcc would stop screwing up unrolling.

Based on a GPU implementation by John Tsiombikas. I spent some time optimizing it and used a HSV'ish palette instead. 256 iterations. 45 fps if all pixels are at max iterations, aka. black screen. So 60 fps for all normal cases. Need bigger GPU.

Based on a CPU implementation by Lode Vandevenne. Relatively easy to port to a GLSL shader.

Using a HSV-based palette when making effects is useful, since it wraps without any edges. But the color has to be converted back to RGB before displaying, and this can be complicated. Several methods of assorted quality can be found on the net. This one seems popular, and this one is also interesting. Donald Reynold's implementation uses cosines for everything, but then the hexagonal shape is gone. So let's do a twist: Use a cosine for the basic H slope, clamp() for the flat top/bottom and mix() or similar for S and V:

// Input: HSV 0.0f..1.0f vec3 HSV2RGB( vec3 hsv ) { float h = hsv.x * 2.0f*PI; // h 0..2*PI float d120 = 2.0f*PI/3.0f; vec3 hv = vec3( h, h-d120, h+d120 ); vec3 rgb = clamp( 0.5f + cos( hv ), 0.0f, 1.0f ); // flatten h before s,v return ((1.0f-hsv.y) + rgb*hsv.y) * hsv.z; // mix( 1.0f,rgb,hsv.y) => ((rgb-1.0f)*hsv.y+1.0f)*hsv.z }

A straight H2RGB where S=V=1.0f can be made even simpler:

vec3 H2RGB( float h ) { h *= 2.0f*PI; // h 0..2*PI float d120 = 2.0f*PI/3.0f; vec3 hv = vec3( h, h-d120, h+d120 ); return 0.5f + cos( hv ); // add clamp() if needed }