Skip to main content

Faster morton codes with compiler intrinsics

Today I learned that newer Intel processors have an instruction which is tailor-made for generating morton codes: the PDEP instruction. There's an instruction for the inverse as well, PEXT.

These exist in 32- and 64-bit versions and you can use them directly from C or C++ code via compiler intrinsics: _pdep_u32/u64 and _pext_u32/u64. Miraculously, both the Visual C++ and GCC versions of the intrinsics have the same names. You'll need an Intel Haswell processor or newer to be able to take advantage of them though.

Docs for the instructions:


This page has a great write up of older techniques for generating morton codes:

...but the real gold is hidden at the bottom of that page in a comment from Julien Bilalte, which is what clued me in to the existence of these instructions.

Update: there's some useful info on Wikipedia about these intructions too.

Comments

Popular posts from this blog

Assert no lock required

This is a technique I learnt about from Jason Gregory's excellent book, Game Engine Architecture (3rd Edition) . If you have a shared resource accessed by multiple threads, where you're fairly certain that it's only ever accessed by one thread at a time, you can use an assert() to check for this at debug time without having to pay the runtime cost of locking a mutex. The implementation is fairly straightforward: class UnnecessaryMutex { public: void lock() { assert(!_locked); _locked = true; } void unlock() { assert(_locked); _locked = false; } private: volatile bool _locked = false; }; #ifdef ENABLE_LOCK_ASSERTS #define BEGIN_ASSERT_LOCK_NOT_REQUIRED(mutex) (mutex).lock() #define END_ASSERT_LOCK_NOT_REQUIRED(mutex) (mutex).unlock() #else #define BEGIN_ASSERT_LOCK_NOT_REQUIRED(mutex) #define END_ASSERT_LOCK_NOT_REQUIRED(mutex) #endif Usage is equally straightforward: UnnecessaryMutex gMutex; void PossiblyOverlappingFunction...

How to outperform std::vector in 1 easy step

Everyone who's familiar with C++ knows that you should avoid resizing a std::vector inside a loop wherever possible. The reasoning's pretty obvious: the memory allocated for the vector doubles in size each time it fills up and that doubling is a costly operation. Have you ever wondered why it's so costly though? It's tempting to assume that because implementations of the STL have been around for so long that they must be pretty efficient. It turns out that's a bad assumption because the problem, in this case, is the standard itself: specifically, the allocator interface. The allocator interface provides two methods that obtain and release memory: allocate allocates uninitialized storage (public member function) deallocate deallocates storage (public member function) (taken from this page ). What's missing is a   way of growing an existing memory allocation in place. In C this is provided by the realloc function, but there's no equiva...

Triangle bounding boxes in a single byte

Just thought of a way to store the bounding box for a single triangle in only one byte. It's not really practical or something you'd ever really want to use, but what the hell. Assume we have some kind of indexed mesh structure with a list of vertex positions and a list of triangle indices:   struct Mesh {     std::vector<vec3> verts;     std::vector<uvec3> triangles;   }; We can find the bounding box of a triangle by taking the min and max of all three vertices:   vec3 Mesh::lowerBound(uint32_t tri) const {     vec3 v0 = verts[triangles[tri].x];     vec3 v1 = verts[triangles[tri].y];     vec3 v2 = verts[triangles[tri].z];     return min(min(v0, v1), v2);   }   vec3 Mesh::upperBound(uint32_t tri) const {     vec3 v0 = verts[triangles[tri].x];     vec3 v1 = verts[triangles[tri].y];     vec3 v2 = verts[triangles[tri].z];     return ...