Skip to main content

Whole program lexical analysis

I was thinking about parsing and lexical analysis of source code recently (after all who doesn't... right??). Everywhere I've looked - which admittedly isn't in very many places - parsers still seem to treat input as a stream of tokens. The stream abstraction made sense in an era where memory was more limited. Does it still make sense now, when 8 Gb of RAM is considered small?

What if, instead, we cache the entire token array for each file? So we mmap the file, lex it in place and store the tokens in an in-memory array. Does this make parsing any easier? Does it lead to any speed savings? Or does it just use an infeasible amount of memory?

Time for some back-of-the napkin analysis. Let's say we're using data structures kind of like this to represent tokens:

  enum TokenType { /* an entry for each distinct type of token */ };

  struct Token {
    TokenType type; // The type of token
    int start;      // Byte offset for the start of the token
    int end;        // Byte offset for the end of the token

  struct TokenFile {
    std::string filename;
    std::vector<Token> tokens; // The tokens in order.
    std::vector<int> lines;    // Byte offsets for each line number.

So for each file we tokenise, we'll store it in memory as an array of tokens. The tokens themselves don't store a string or a value - they just identify which range of bytes in the file the token came from. We also store byte offsets for line numbers, so that if we need them for any reason - error reporting, or a __LINE__ macro, for example - we can find them using a binary search (these are low frequency uses, so it's wasteful to store a line number for every token).

Assuming a 64 bit system, we get the following sizes:
  • std::string = 24 bytes (the GNU stdlibc++ representation is 3 pointers, one each for start, end and capacity) + the size of the data
  • std::vector = 24 bytes (same representation as a std::string) + the data size
  • Token = 12 bytes (4 bytes for each of type, start and end).
  • TokenFile = 72 bytes + 12 * T + 4 * L + F, where T is the number of tokens in the file, L is the number of lines in the file and F is the length of the filename.

Now let's make up a theoretical large c++ code base and some metrics for it to see how things stack up. Let's say our code base has the following characteristics:
  • 10,000,000 lines of code
  • 10,000 separate files
  • An average of 1000 lines per file
  • 10 tokens per line, on average
  • The average filename length is 100 bytes.

Running the numbers, this means we have:
  • 10,000 TokenFile objects at (on average) 72 + 12 * 10,000 + 4 * 1000 + 100 bytes
    = 10,000 * 124,172 bytes
    = roughly 1.24 Gb;
Of which
  • 720 Kb is for the TokenFile objects
  • 1.2 Gb is for the Token data
  • 40 Mb is for the per-file line numbers
  • 1 Mb is for the file names

That doesn't seem too bad considering it's supposed to be a large code base. 1.24 Gb of RAM won't even touch the sides on a modern desktop computer! We could make a few optimisations to get the size down even further, but let's stick with that for now.

This structure doesn't store the token contents. For example, you can't use it to distinguish between two identifiers unless you load up the files containing the tokens and look at the byte ranges for each. Doing that might actually be feasible on an OS like Linux where opening files is fast, but it would be pretty bad on Windows where that's a more expensive operation. So what if we store the token contents as well? Is that feasible?

Let's make a few more assumptions about our theoretical code base:
  • There are around 1000 unique token strings per text file, on average.
  • Across the whole project there are about 10,000,000 unique strings with an average length of 10 chars.
We'll keep a cache of interned strings for each token (we won't worry about converting literals into actual ints/floats/doubles/whatever for now), so that each unique token string is stored once and once only. We'll also replace the 'end' member of our Token struct with a 'pos' member, holding an index into the interned string list.

The Token struct stays the same size so the sums above don't change. For the data size, we get:
  • 10,000,000 unique strings * (11 bytes per string + 8 bytes for the pointer to it)
    = approximately 190 Mb
Based on all that, storing the entire tokenised source code for a 10 million line c++ project would only take about 1.43 Gb. That's way below the 8Gb of RAM you get in an average desktop these days, so it seems like it would be workable.

Does it actually lead to any efficiency gains? That remains to be seen. It should mean that the file access pattern interacts better with the disk cache; and being able to index into an array of tokens might help reduce the cost of lookahead and backtracking in the parser; but for now it's just conjecture. There's no github repo to go with this post.

So it's over to you, readers. Do you know of anything that already does this? Any reasons why it'd be a crazy thing to do? If so, I'd love to hear about it!


Popular posts from this blog

Octree node identifiers

Let's say we have an octree and we want to come up with a unique integer that can identify any node in the tree - including interior nodes, not just leaf nodes. Let's also say that the octree has a maximum depth no greater than 9 levels, i.e. the level containing the leaf nodes divides space into 512 parts along each axis.

The encoding The morton encoding of a node's i,j,k coordinates within the tree lets us identify a node uniquely if we already know it's depth. Without knowing the depth, there's no way to differentiate between cells at different depths in the tree. For example, the node at depth 1 with coords 0,0,0 has exactly the same morton encoding as the node at depth 2 with coords 0,0,0.

We can fix this by appending the depth of the node to the morton encoding. If we have an octree of depth 9 then we need up to 27 bits for the morton encoding and 4 bits for the depth, which still fits nicely into a 32-bit integer. We'll shift the morton code up so that i…

How to outperform std::vector in 1 easy step

Everyone who's familiar with C++ knows that you should avoid resizing a std::vector inside a loop wherever possible. The reasoning's pretty obvious: the memory allocated for the vector doubles in size each time it fills up and that doubling is a costly operation. Have you ever wondered why it's so costly though?

It's tempting to assume that because implementations of the STL have been around for so long that they must be pretty efficient. It turns out that's a bad assumption because the problem, in this case, is the standard itself: specifically, the allocator interface.

The allocator interface provides two methods that obtain and release memory:

allocate allocates uninitialized storage
(public member function)deallocate deallocates storage
(public member function)

(taken from this page).

What's missing is away of growing an existing memory allocation in place. In C this is provided by the realloc function, but there's no equivalent in the std::allocator interfa…

Faster morton codes with compiler intrinsics

Today I learned that newer Intel processors have an instruction which is tailor-made for generating morton codes: the PDEP instruction. There's an instruction for the inverse as well, PEXT.

These exist in 32- and 64-bit versions and you can use them directly from C or C++ code via compiler intrinsics: _pdep_u32/u64 and _pext_u32/u64. Miraculously, both the Visual C++ and GCC versions of the intrinsics have the same names. You'll need an Intel Haswell processor or newer to be able to take advantage of them though.

Docs for the instructions:

Intel's docsGCC docsVisual C++ docs
This page has a great write up of older techniques for generating morton codes:

Jeroen Baert's blog ...but the real gold is hidden at the bottom of that page in a comment from Julien Bilalte, which is what clued me in to the existence of these instructions.
Update: there's some useful info on Wikipedia about these intructions too.