Thursday, 5 November 2015

LD_DEBUG

Posting this mainly as a reminder to myself...

If you ever find yourself needing to figure out a dynamic library loading problem on Linux, LD_DEBUG can be a massive help.

This is an environment variable you can set to make the dynamic linker print out a ton of useful diagnostic info. There are a number of different values which control the amount and type of diagnostics printed. One of the values is help; if you set LD_DEBUG to this and run executable it will print out a list of all the available options along with brief descriptions. For example, on my Linux workstation at the office:

> LD_DEBUG=help cat
Valid options for the LD_DEBUG environment variable are:

  libs        display library search paths
  reloc       display relocation processing
  files       display progress for input file
  symbols     display symbol table processing
  bindings    display information about symbol binding
  versions    display version dependencies
  all         all previous options combined
  statistics  display relocation statistics
  unused      determined unused DSOs
  help        display this help message and exit

To direct the debugging output into a file instead of standard output
a filename can be specified using the LD_DEBUG_OUTPUT environment variable.

Be warned: using LD_DEBUG=all will generate a lot of output!

This is documented on the man page for ld.so. There are a lot of other environment variables that can control the behaviour of the dynamic linker too, so it's worth a read.

Friday, 30 October 2015

Octree node identifiers

Let's say we have an octree and we want to come up with a unique integer that can identify any node in the tree - including interior nodes, not just leaf nodes. Let's also say that the octree has a maximum depth no greater than 9 levels, i.e. the level containing the leaf nodes divides space into 512 parts along each axis.

The encoding

The morton encoding of a node's i,j,k coordinates within the tree lets us identify a node uniquely if we already know it's depth. Without knowing the depth, there's no way to differentiate between cells at different depths in the tree. For example, the node at depth 1 with coords 0,0,0 has exactly the same morton encoding as the node at depth 2 with coords 0,0,0.

We can fix this by appending the depth of the node to the morton encoding. If we have an octree of depth 9 then we need up to 27 bits for the morton encoding and 4 bits for the depth, which still fits nicely into a 32-bit integer. We'll shift the morton code up so that it starts at the topmost bit and set any unused trailing bits to zero. We'll put the depth in the bottom 4 bits. That leaves at least one bit free between the two parts - bit 4 - which we'll always set to zero (this will be useful later). The structure of the key will look like this:

bit 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
morton index 0 depth

For a node at depth 7, for example, it would look like this:

bit 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
morton index 0 0 0 0 0 0 0 7

Why do it this way? Well, in addition to identifying each node uniquely, this representation has a few other nice properties:

  1. Sorting nodes by this number puts them in the order they would be visited by a pre-order traversal of the octree.
  2. You can calculate the identifier of a parent node from a child using simple bitwise operations.
  3. If you have two nodes at the same depth in the tree, you can calculate their lowest common ancestor using simple bitwise operations too.

Implementing it

Here's an example implementation of the scheme above:

struct NodeKey {
  uint32_t key;

  // Depth of this node.
  uint32_t depth() const
  {
    return key & 0xF;
  }

  // Morton index for this node.
  uint32_t index() const
  {
    return key >> (31 - depth() * 3);
  }

  // Calculate the NodeKey for the parent of this node.
  NodeKey parent() const
  {
    return ancestor_at_depth(depth() - 1);
  }

  // Calculate the NodeKey for the nth ancestor.
  // n = 0 means self, n = 1 means parent, n = 2 means grandparent, ...
  NodeKey ancestor_n(uint32_t n) const
  {
    return ancestor_at_depth(depth() - n);
  }
  
  // Calculate the NodeKey for the ancestor of this node at depth d in
  // the octree.
  NodeKey ancestor_at_depth(uint32_t d) const
  {
    assert(d <= depth());
    uint32_t mask = 0xFFFFFFFF << (31 - d * 3));
    return { (key & mask) | d };
  }
};

// Find the deepest node that is a common ancestor of both input nodes.
NodeKey common_ancestor(const NodeKey a, const NodeKey b)
{
  uint32_t bits = min(a.depth(), b.depth()) * 3;
  uint32_t marker = 0x80000000 >> bits;
  uint32_t depth = __builtin_clz((a.key ^ b.key) | marker) / 3;
  return a.ancestor_at_depth(depth);
}

The line in common ancestor which calculates the depth of the common ancestor uses a slight variation on the technique from my previous post.

Variations

If you want a post-order traversal instead, just set the trailing bits to 1 rather than 0.

You can turn this into a breadth-first traversal by swapping the positions of the depth and the morton index: use the top 4 bits for the depth and the bottom 27 bits for the morton index. In this case the morton index should have leading bits set to zero instead of the trailing bits (i.e. the index should be aligned to the right of the available bits, intead of the left).

Another way to represent a breadth-first traversal is to put the morton index in the bottom-most bits and have a sentinel bit which is always set to 1 to indicate where the morton index ends. Every bit above the sentinel must be set to zero for this to work. This allows you to represent an additional octree and makes it even simpler to calculate the parent identifier (it's just child >> 3), but calculating the depth of the node involves counting the leading zeros and dividing by 3.

Limitations

Assuming a 32-bit identifier, the maximum octree depth it can represent is 9 levels. Depending on your application this may or may not be sufficient. The scheme generalises easily to a 64 bit identifier which can store up to 19 levels, but at the cost of doubling storage requirements and there are still some important platforms (e.g. many GPUs) which don't provide support for 64 bit integers.

Monday, 26 October 2015

Common bit prefix length for two integers

Here's a neat trick I discovered a couple of months back: given two signed or unsigned integers of the same bit width, you can calculate the length of their common prefix very efficiently:


  int common_prefix_length(int a, int b)
  {
    return __builtin_clz(a ^ b);
  }

What's this doing? Let's break it down:

a ^ b is the bitwise-xor of a and b. The boolean xor operation is true when one of it's inputs is true and the other is false; or false if both have the same value. Another way to put this is that xor returns true when its inputs are different and false if they're the same. The bitwise-xor operation then, returns a value which has zeros for every bit that is the same in both a and b; and ones for every bit that's different.

__builtin_clz is a GCC intrisinc function which counts the number of leading zero bits of its argument. It compiles down to a single machine code instruction on hardware that supports it (which includes every Intel chip made in this decade). The Visual C++ equivalent is the _BitScanReverse intrinsic, which has a slightly more complicated API; implementing the above with it is left as an exercise for the reader. :-)

Passing the result of a ^ b to __builtin_clz means we're counting the leading zero bits in a number where a zero bit indicates that the corresponding bits in a and b had the same value; which is exactly how we get the length of the common prefix.

You can get the common suffix in the same way. The only difference is that you use the __builtin_ctz intrinsic (Visual C++: _BitScanForward) instead, to count the trailing zero bits:


  int common_suffix_length(int a, int b)
  {
    return __builtin_ctz(a ^ b);
  }

Neat huh?

Wednesday, 8 July 2015

Image based lighting notes

These are some of my notes from implementing image based lighting, a.k.a. IBL.

I thought I understood it pretty well until I started implementing it. Now, after a lot of reading, discussing and trying things out, I'm finally getting back to the stage where I tentatively believe I understand it. Hopefully these notes will save future me, or anyone else reading them, from going through the same difficulty. I'll update them if I find out anything further.

Some helpful references




The radiance integral

$$L_o(v) = L_e(v) + \int_\Omega L_i(l) f(l, v) (n \circ l) dl$$
  • The amount of light reflected at some point on a surface is the sum, over all incoming directions, of the light from that direction multiplied by the BRDF for the surface material, multiplied by the cosine of the angle between the light and the relevant normal vector, plus any light emitted by the surface itself.
  • We usually calculate this separately for the diffuse and specular lighting at the point, then sum the results.
  • For diffuse lighting, the relevant normal vector is the surface normal.
  • For specular lighting, the relevant normal vector is the half-vector: the vector halfway between the light direction and the view direction (where both vectors are pointing away from the surface point rather than towards it).

BRDF

  • A BRDF is a function which, given a view vector and a light vector, returns the amount of light which will be reflected along the view vector.
  • The view vector is the vector from the point being shaded to the view position, not the other way around.
  • The light vector is the vector from the point being shaded to the light position, not the other way around.
  • BRDF is rotation invariant, only depends on the angle between the two vectors.
  • BRDFs usually have other parameters which are constant for any given material.

Diffuse BRDF

  • Diffuse contribution comes from light emerging from the surface after bouncing around inside it a bit.
  • This means the light direction can be fairly random.
  • So a fairly common diffuse BRDF is simply a constant value: the diffuse color for the mtaerial divided by pi (this is the Lambertian model).
  • Materials with lots of subsurface scattering will need a more complicated diffuse BRDF such as Oren-Nayar.
  • Oren-Nayar is based on the microfacet model described below.

Microfacet BRDF

$$f(l, v, h) = \frac{D(h) F(v, h) G(l, v, h)}{4 (n \circ l) (n \circ v)}$$
  • Usually used for specular BRDFs, but Oren-Nayar (diffuse) is also based on microfacets.
  • Assumes that a surface is composed of lots of tiny planar fragments whose normals vary according to some statistical distribution. This is the Cook-Torrance model.
  • Breaks the BRDF into 3 main components:
    • D(h), the normal distribution function.
    • F(v, h), the Fresnel function.
    • G(l, v, h), the geometric shadowing function.
  • D(h) tells us the fraction of microfacet normals which point in a given direction.
    • This describes how rough or smooth a material is.
    • Rougher materials will have a more even spread of normals across whereas smoother ones will tend to have them concentrated around a single direction.
    • Lots of options for this: Blinn-Phong, GGX, GTR, etc.
  • F(v, h) gives us the fraction of light that gets reflected in each channel.
    • This tells us the material's "colour".
    • Pretty much everyone uses Schlick's approximation for this.
  • G(l, v, h) describes how much of the reflected light is blocked by other microfacets.
    • Tells us how "dark" a material is.
    • Is (or should be) affected by how rough the material is.
    • There are lots of options for this too.

IBL in general

  • The idea of IBL is to bake out components of the radiance integral into textures ahead of time, so that our shaders can rapidly approximate it given just a direction, a surface colour and a roughness value.
  • We can bake out the irradiance (incoming light) for any given direction:
    • The irradiance will be from all pixels within a cone around the given direction.
    • The solid angle of the cone is determined by the normal distribution function for your BRDF (which in turn is shaped by the material's roughness).
    • This will be stored in a texture where we can look up values using a direction vector, i.e. a cube map or a lat-long 2D texture.
    • You can use a mip-mapped texture to store the results for different roughness levels. mip-level 0 will be the smoothest level, higher levels will be rougher.
    • Mip-level N+1 should represent a cone covering twice as many pixels as Mip-level N, so that the sampling works out correctly.
    • Because the mip-level contents depend on the mapping from the roughness parameter to a cone width, the irradiance map will be specific to a particular normal distribution function.
  • You can precalculate the rest of the BRDF separately as well:
    • This is the split sum approximation described in the SIGGRAPH 2013 course on physically based rendering.
    • Generate a 2D texture where the u axis corresponds to dot(N, V) values >= 0.0 and the v axis corresponds to roughness values between 0.0 and 1.0.
    • The texture contains a scale and bias to the material color.
    • Apply the scale and bias, then multiply the result by the value from the irradiance map to get the final color for the pixel.

IBL vs. Reflections

  • What's the difference between IBL and reflections?
    • Theoretically: nothing. They're the same thing.
    • I think there may be an accident of terminology here, where "IBL" is often used to mean diffuse IBL; and reflections is used to mean specular IBL.
  • It's common to provide a low-res HDR image with a matching high-res LDR image. In this case:
    • The low-res HDR image is for diffuse IBL.
    • The high-res LDR image is for reflections.
  • This implies that reflections are expected to sample from an LDR input whereas IBL (of any kind, specular or diffuse) is better off sampling from a HDR input.

Monday, 23 March 2015

Faster morton codes with compiler intrinsics

Today I learned that newer Intel processors have an instruction which is tailor-made for generating morton codes: the PDEP instruction. There's an instruction for the inverse as well, PEXT.

These exist in 32- and 64-bit versions and you can use them directly from C or C++ code via compiler intrinsics: _pdep_u32/u64 and _pext_u32/u64. Miraculously, both the Visual C++ and GCC versions of the intrinsics have the same names. You'll need an Intel Haswell processor or newer to be able to take advantage of them though.

Docs for the instructions:


This page has a great write up of older techniques for generating morton codes:

...but the real gold is hidden at the bottom of that page in a comment from Julien Bilalte, which is what clued me in to the existence of these instructions.

Update: there's some useful info on Wikipedia about these intructions too.

Wednesday, 18 March 2015

Awesome tools for Windows users

I moved back to Windows on my home computer a few months back. There are a few amazing free tools I've found since then that have been making my life better and I thought they deserved a shout-out. They are:

SumatraPDF
A fantastic PDF reader. Does everything I want and nothing I don't.

RapidEE
A sane way to edit environment variables. The simple joy of just being able to resize the window is... incredible.

7-Zip
The best tool for dealing with compressed files on windows, bar none.

GPU-Z
A really handy way to see details about your GPU(s).

XnView
An amazingly good image viewer, which can also do bulk file format conversions.

If you haven't already got these... go get them!

Whole program lexical analysis

I was thinking about parsing and lexical analysis of source code recently (after all who doesn't... right??). Everywhere I've looked - which admittedly isn't in very many places - parsers still seem to treat input as a stream of tokens. The stream abstraction made sense in an era where memory was more limited. Does it still make sense now, when 8 Gb of RAM is considered small?

What if, instead, we cache the entire token array for each file? So we mmap the file, lex it in place and store the tokens in an in-memory array. Does this make parsing any easier? Does it lead to any speed savings? Or does it just use an infeasible amount of memory?

Time for some back-of-the napkin analysis. Let's say we're using data structures kind of like this to represent tokens:

  enum TokenType { /* an entry for each distinct type of token */ };

  struct Token {
    TokenType type; // The type of token
    int start;      // Byte offset for the start of the token
    int end;        // Byte offset for the end of the token
  };

  struct TokenFile {
    std::string filename;
    std::vector<Token> tokens; // The tokens in order.
    std::vector<int> lines;    // Byte offsets for each line number.
  };

So for each file we tokenise, we'll store it in memory as an array of tokens. The tokens themselves don't store a string or a value - they just identify which range of bytes in the file the token came from. We also store byte offsets for line numbers, so that if we need them for any reason - error reporting, or a __LINE__ macro, for example - we can find them using a binary search (these are low frequency uses, so it's wasteful to store a line number for every token).

Assuming a 64 bit system, we get the following sizes:
  • std::string = 24 bytes (the GNU stdlibc++ representation is 3 pointers, one each for start, end and capacity) + the size of the data
  • std::vector = 24 bytes (same representation as a std::string) + the data size
  • Token = 12 bytes (4 bytes for each of type, start and end).
  • TokenFile = 72 bytes + 12 * T + 4 * L + F, where T is the number of tokens in the file, L is the number of lines in the file and F is the length of the filename.

Now let's make up a theoretical large c++ code base and some metrics for it to see how things stack up. Let's say our code base has the following characteristics:
  • 10,000,000 lines of code
  • 10,000 separate files
  • An average of 1000 lines per file
  • 10 tokens per line, on average
  • The average filename length is 100 bytes.

Running the numbers, this means we have:
  • 10,000 TokenFile objects at (on average) 72 + 12 * 10,000 + 4 * 1000 + 100 bytes
    = 10,000 * 124,172 bytes
    = roughly 1.24 Gb;
Of which
  • 720 Kb is for the TokenFile objects
  • 1.2 Gb is for the Token data
  • 40 Mb is for the per-file line numbers
  • 1 Mb is for the file names

That doesn't seem too bad considering it's supposed to be a large code base. 1.24 Gb of RAM won't even touch the sides on a modern desktop computer! We could make a few optimisations to get the size down even further, but let's stick with that for now.

This structure doesn't store the token contents. For example, you can't use it to distinguish between two identifiers unless you load up the files containing the tokens and look at the byte ranges for each. Doing that might actually be feasible on an OS like Linux where opening files is fast, but it would be pretty bad on Windows where that's a more expensive operation. So what if we store the token contents as well? Is that feasible?

Let's make a few more assumptions about our theoretical code base:
  • There are around 1000 unique token strings per text file, on average.
  • Across the whole project there are about 10,000,000 unique strings with an average length of 10 chars.
We'll keep a cache of interned strings for each token (we won't worry about converting literals into actual ints/floats/doubles/whatever for now), so that each unique token string is stored once and once only. We'll also replace the 'end' member of our Token struct with a 'pos' member, holding an index into the interned string list.

The Token struct stays the same size so the sums above don't change. For the data size, we get:
  • 10,000,000 unique strings * (11 bytes per string + 8 bytes for the pointer to it)
    = approximately 190 Mb
Based on all that, storing the entire tokenised source code for a 10 million line c++ project would only take about 1.43 Gb. That's way below the 8Gb of RAM you get in an average desktop these days, so it seems like it would be workable.

Does it actually lead to any efficiency gains? That remains to be seen. It should mean that the file access pattern interacts better with the disk cache; and being able to index into an array of tokens might help reduce the cost of lookahead and backtracking in the parser; but for now it's just conjecture. There's no github repo to go with this post.

So it's over to you, readers. Do you know of anything that already does this? Any reasons why it'd be a crazy thing to do? If so, I'd love to hear about it!