probably coding

…instead of writing here.

Optimizing Perlin Noise

This post is a step by step guide through my attempts to optimize the perlin noise implementation in the noise-rs library.

What is perlin noise?

Perlin noise, invented by Ken Perlin, is an algorithm for creating smooth, continuous, random values over a given space. It has many applications both in 3D rendering and elsewhere. It, and algorithms like it, are the basis of almost all procedurally generated content whether it’s a marble texture on the floor of a CG scene or the entire world of Minecraft.

Perlin noise works by dividing space up into hypercubes. These are squares at 2D, cubes at 3D, tesseracts at 4D, and so on. It first figures out which cell your point is in, assigns a random gradient to each corner of that cell, then blends the gradients together to generate your final value. The gradients are chosen using only the corner coordinates which means neighboring cells will share the same gradients for the corners that are touching. This makes the neighboring cells blend together smoothly.

Why worry about optimization?

This is always an important question to think about. While never worrying about optimization and performance can obviously have its downsides you can just as easily end up obsessing over some small piece of your code that doesn’t make your overall program faster. You must consider Amdahl’s law to ensure you are getting the most benefit from your optimization work.

That being said, noise generation of any kind can easily be a large proportion of your overall computation, especially if you’re using it for on-the-fly terrain generation. This is even more true in the rendering world, as an observation once noted that “90% of 3D rendering time is spent in shading, and 90% of that time is spent computing Perlin noise”. It seems clear that any optimization we can make to this process, however small, will have noticable improvements in many domains.

Getting started

Before we make any attempts to optimize our noise we must first have a solid starting codebase and measurements to compare our changes to. We’re going to be looking at 3D perlin noise here as it is the most commonly used version. Here is our starting implementation, written in Rust:

pub fn perlin3<T: Float>(point: &math::Point3<T>) -> T {
    #[inline(always)]
    fn gradient<T: Float>(whole: math::Point3<isize>, frac: math::Vector3<T>) -> T {
        let x: usize = math::cast(whole[0] & math::cast(0xff));
        let y: usize = math::cast(whole[1] & math::cast(0xff));
        let z: usize = math::cast(whole[2] & math::cast(0xff));
        let hash = PERM[PERM[PERM[x] + y] + z] & 15;
        let u = if hash < 8 {
            frac[0]
        } else {
            frac[1]
        };
        let v = if hash < 4 {
            frac[1]
        } else if hash == 12 || hash == 14 {
            frac[0]
        } else {
            frac[1]
        };
        (if hash & 1 == 0 { u } else { -u }) + (if hash & 2 == 0 { v } else { -v })
    }

    let floored = math::map3(*point, Float::floor);
    let whole0  = math::map3(floored, math::cast);
    let whole1  = math::add3(whole0, math::one3());
    let frac0   = math::sub3(*point, floored);
    let frac1   = math::sub3(frac0, math::one3());
    let curve   = math::map3(frac0, math::scurve5);

    let f000 = gradient([whole0[0], whole0[1], whole0[2]], [frac0[0], frac0[1], frac0[2]]);
    let f100 = gradient([whole1[0], whole0[1], whole0[2]], [frac1[0], frac0[1], frac0[2]]);
    let f010 = gradient([whole0[0], whole1[1], whole0[2]], [frac0[0], frac1[1], frac0[2]]);
    let f110 = gradient([whole1[0], whole1[1], whole0[2]], [frac1[0], frac1[1], frac0[2]]);
    let f001 = gradient([whole0[0], whole0[1], whole1[2]], [frac0[0], frac0[1], frac1[2]]);
    let f101 = gradient([whole1[0], whole0[1], whole1[2]], [frac1[0], frac0[1], frac1[2]]);
    let f011 = gradient([whole0[0], whole1[1], whole1[2]], [frac0[0], frac1[1], frac1[2]]);
    let f111 = gradient([whole1[0], whole1[1], whole1[2]], [frac1[0], frac1[1], frac1[2]]);

    math::trilerp(curve, f000, f100, f010, f110, f001, f101, f011, f111)
}

This is a fairly standard implementation of improved perlin noise. The PERM variable is the standard permutation table used for perlin noise, the numbers 0 through 255 shuffled then repeated to make a 512 length array. The various math helpers are mainly to let us apply an operation to multiple values at once and are not particularly interesting.

So, how fast is this? Using the built-in Rust benchmarking tool we can see that it takes 7730 ns for a single run of this function. That might seem like a large number but remember, that’s nanoseconds. In a single second, we can run this function 129,366 times, not too shabby. However, we can do better.

Shrinking the permutation table

Our permutation table is storing the values between 0 and 255 which can fit in a single byte. However, we’re using an array of usize elements to store these values which are either 4 bytes or 8 bytes depending on your system. This is a large waste of space and, more importantly, a waste of cache space. Our first change is to make this an array of u8 elements and modify the lookup code appropriately:

        let x: usize = math::cast(whole[0] & math::cast(0xff));
        let y: usize = math::cast(whole[1] & math::cast(0xff));
        let z: usize = math::cast(whole[2] & math::cast(0xff));
        let hash = PERM[PERM[PERM[x] as usize + y] as usize + z] & 15;

The extra casting is required because array indices must be usize in Rust. For this reason we also leave the x, y, z variables as usize as well. This reduces the size of the array from 2048 or 4096 bytes to just 512 bytes. Running our benchmark again shows us that our run time is now 608 ns. This is a huge performance gain for such a small change and shows the importance of fitting in cache.

That’s not the end of it though, we can reduce the size of the permutation table even more. The reason we need to have a 512 entry permutation table even though we’re only storing 256 elements is so we can do the addition without having to do a modulo operation to handle wraparound. This permutation table setup is almost, not but quite, what is called Pearson hashing. The main difference is that Pearson hashing uses an xor operation instead of addition. Switching to this method allows us to cut our table size in half as we no longer have to worry about overflow. Here is our new lookup code:

        let hash = PERM[PERM[PERM[x] as usize ^ y] as usize ^ z] & 15;

This minor change gets us another huge performance improvement. Our run time is now 320 ns. We are now capable of running this function 3,125,000 times per second, a significant boost over where we started.

Since every reduction in the permutation table makes our code significantly faster at this point you might be considering removing the table entirely. After all, the table is really just a means to generate a hash code and there are many ways to do that without any extra storage. The problem with this is that our table now fits nicely in cache which means accessing it is very fast. Any hash function would need to use very few, very efficient operations in order to match this speed. Any attempts I’ve made at this have either resulted in a poor hash function which resulted in unacceptable noise values or have been slower than the permutation table method.

If we can’t replace the permutation tables perhaps there is something else we can do to make them even faster. Reducing the size, perhaps to 0 to 64, might give another performance boost. However, this also reduces the period of the noise so you start getting repeat values much faster. Another approach would be to use a separate permutation table for each dimension, as described in Better Gradient Noise. The advantage to this approach is removing the dependent lookups so the CPU is able to get the x, y, z values at the same time then xor them together for our final value. This does in fact give a noticeable performance improvement but it also results in a very poor quality noise output. To use this approach you need to replace the gradient vectors with a 256 sized table of evenly disributed vectors. The extra cost to lookup these values and calculate the gradient using them results in an overall slowdown versus what we currently have.

Calculating the gradients

Getting the gradient values for each of our corners is done by looking up the vector for the corner and doing a dot product of this vector and the position of our desired point relative to this corner. Perlin showed that we could use simple vectors of -1, 0, and 1 instead of random values and by doing so we can avoid the costly dot product operation entirely. He also avoids another table lookup by calculating the vector using bitmasks as branches, as seen in our original implementation. This seems like it should be about the best we can do for performance but in fact we can do much better.

Our first change is to replace the branches with a match operation (switch). This will result in a jump table instead of several branch operations so is easier for the CPU to deal with. The resulting code:

        let hash = PERM[PERM[PERM[x] as usize ^ y] as usize ^ z] & 15;
        match hash & 15 {
            0  =>  frac[0] + frac[1],
            1  => -frac[0] + frac[1],
            2  =>  frac[0] - frac[1],
            3  => -frac[0] - frac[1],
            4  =>  frac[0] + frac[2],
            5  => -frac[0] + frac[2],
            6  =>  frac[0] - frac[2],
            7  => -frac[0] - frac[2],
            8  =>  frac[1] + frac[2],
            9  => -frac[1] + frac[2],
            10 =>  frac[1] - frac[2],
            11 => -frac[1] - frac[2],
            12 =>  frac[1] + frac[0],
            13 => -frac[1] + frac[2],
            14 =>  frac[1] - frac[0],
            15 => -frac[1] - frac[2],
            _  => unreachable!()
        }

Running our benchmark again show us that we’ve reduced our time to 286 ns. Not nearly as much as we’d hoped but still a small improvement. We can do even better but this is one case where I don’t quite understand why. Instead of trying to avoid the dot product we create a new grad function which returns one of our 12 possible vectors:

#[inline(always)]
fn grad<T: Float>(index: u8) -> math::Point3<T> {
    let zero: T = math::cast(0.0);
    let one: T = math::cast(1.0);

    match index % 12 {
        0  => [ one,   one,   zero],
        1  => [-one,   one,   zero],
        2  => [ one,  -one,   zero],
        3  => [-one,  -one,   zero],
        4  => [ one,   zero,  one],
        5  => [-one,   zero,  one],
        6  => [ one,   zero, -one],
        7  => [-one,   zero, -one],
        8  => [ zero,  one,   one],
        9  => [ zero, -one,   one],
        10 => [ zero,  one,  -one],
        11 => [ zero, -one,  -one],
        _  => unreachable!()
    }
}

Using this, now our gradient calculation looks like this:

        let x: usize = math::cast(whole[0] & math::cast(0xff));
        let y: usize = math::cast(whole[1] & math::cast(0xff));
        let z: usize = math::cast(whole[2] & math::cast(0xff));
        let hash = PERM[PERM[PERM[x] as usize ^ y] as usize ^ z];
        math::dot3(frac, grad(hash))

With these changes in place our benchmark now gives us 38 ns which means we can run this 26,315,790 times per second, yet another huge performance win. As mentioned, I’m not clear why this is faster as it clearly seems to be doing more work with that dot product in there. My only guess is that we’re somehow getting better instruction ordering and/or speculative execution with this setup. In any case, we’re now significantly faster than where we started and there isn’t much room to get any better than this. In fact, to get any better we’re going to have to look in to algorithmic improvements, not just microoptimizations like we’ve done so far.

Scaling better

What perlin noise does is calculate gradients for 2^n corners then do a linear interpolation of these gradients which also scales by 2^n where n is the number of dimensions. Simplex noise has two major differences from perlin noise. It uses a simplex rather than a hypercube which results in fewer corners to calculate and it uses summation instead of linear interpolation. It is this second property that we’re interested in now.

In order to use summation we need to apply a radial attenuation function to our gradients. This ensures that points inside a cell will only be influenced by the contributions from the corners of that particular cell. Here is our new gradient function using this technique:

        let x: usize = math::cast(whole[0] & math::cast(0xff));
        let y: usize = math::cast(whole[1] & math::cast(0xff));
        let z: usize = math::cast(whole[2] & math::cast(0xff));
        let hash = PERM[PERM[PERM[x] as usize ^ y] as usize ^ z];

        let attn = math::cast::<_, T>(1.0) - math::dot3(frac, frac);
        if attn > Float::zero() {
            (attn * attn) * math::dot3(frac, grad(hash))
        } else {
            Float::zero()
        }

And we can now replace the call to math::trilerp in the main function with this:

    (f000 + f100 + f010 + f110 + f001 + f101 + f011 + f111)

This is much simpler and the benchmarks back that up. Our final time for a single run is 23 ns which means we can run this 43,478,261 times per second. We’ve managed to achieve a 336x performance improvement from our original implementation. Even better, our new approach scales better which means 4D noise gets an even bigger improvement.

Final results

The first thing to confirm is that we still have a useful noise after these changes as it is quite easy to break the algorithm. Let’s look at some before and after samples of our noise:

Original New
original final

Yep, that still looks like perlin noise. It doesn’t look exactly the same but that’s mainly because we’re using a different permutation table. When I made these changes to the actual noise-rs library the only difference between the linear interpolation version and the summation version was in the blending, the overall shape of the noise stayed identical.

To see how we compare to simplex noise, which is usually preferred for higher dimensions due to performance, I added an implementation of it to my local copy of noise-rs. Here are the results:

….. Perlin Simplex
2D 16 ns 27 ns
3D 23 ns 30 ns
4D 40 ns 62 ns

As you can see, we’re quite faster all around. 4D simplex noise is so much slower than 3D due to the complexity of determining which simplex we’re in so we know what corners to use. I’ve tried two approaches to this and attempted some basic optimization to each but this is the fastest I could get. It would still be faster than the original 4D perlin noise or even our 4D perlin without using summation. I take this as a sign that I’m not doing something wrong but instead simplex was just relying on perlin being so slow to look good here.

Even with these changes simplex noise still has the advantage of “looking better” so depending on what you’re using noise for might be the better choice. However, even in those cases you may wish to consider this new perlin as looking “good enough” in order to take advantage of the performance benefits. If you aren’t directly displaying the noise but instead using it for something like terrain generation there is even less reason to choose simplex. My next focus will be on ways to improve the quality of perlin noise to try to reduce the number of places simplex would be perferred but even if that can be accomplished will be the subject of another post.


Share

comments powered by Disqus