Skip to main content

Whole program lexical analysis

I was thinking about parsing and lexical analysis of source code recently (after all who doesn't... right??). Everywhere I've looked - which admittedly isn't in very many places - parsers still seem to treat input as a stream of tokens. The stream abstraction made sense in an era where memory was more limited. Does it still make sense now, when 8 Gb of RAM is considered small?

What if, instead, we cache the entire token array for each file? So we mmap the file, lex it in place and store the tokens in an in-memory array. Does this make parsing any easier? Does it lead to any speed savings? Or does it just use an infeasible amount of memory?

Time for some back-of-the napkin analysis. Let's say we're using data structures kind of like this to represent tokens:

  enum TokenType { /* an entry for each distinct type of token */ };

  struct Token {
    TokenType type; // The type of token
    int start;      // Byte offset for the start of the token
    int end;        // Byte offset for the end of the token
  };

  struct TokenFile {
    std::string filename;
    std::vector<Token> tokens; // The tokens in order.
    std::vector<int> lines;    // Byte offsets for each line number.
  };

So for each file we tokenise, we'll store it in memory as an array of tokens. The tokens themselves don't store a string or a value - they just identify which range of bytes in the file the token came from. We also store byte offsets for line numbers, so that if we need them for any reason - error reporting, or a __LINE__ macro, for example - we can find them using a binary search (these are low frequency uses, so it's wasteful to store a line number for every token).

Assuming a 64 bit system, we get the following sizes:
  • std::string = 24 bytes (the GNU stdlibc++ representation is 3 pointers, one each for start, end and capacity) + the size of the data
  • std::vector = 24 bytes (same representation as a std::string) + the data size
  • Token = 12 bytes (4 bytes for each of type, start and end).
  • TokenFile = 72 bytes + 12 * T + 4 * L + F, where T is the number of tokens in the file, L is the number of lines in the file and F is the length of the filename.

Now let's make up a theoretical large c++ code base and some metrics for it to see how things stack up. Let's say our code base has the following characteristics:
  • 10,000,000 lines of code
  • 10,000 separate files
  • An average of 1000 lines per file
  • 10 tokens per line, on average
  • The average filename length is 100 bytes.

Running the numbers, this means we have:
  • 10,000 TokenFile objects at (on average) 72 + 12 * 10,000 + 4 * 1000 + 100 bytes
    = 10,000 * 124,172 bytes
    = roughly 1.24 Gb;
Of which
  • 720 Kb is for the TokenFile objects
  • 1.2 Gb is for the Token data
  • 40 Mb is for the per-file line numbers
  • 1 Mb is for the file names

That doesn't seem too bad considering it's supposed to be a large code base. 1.24 Gb of RAM won't even touch the sides on a modern desktop computer! We could make a few optimisations to get the size down even further, but let's stick with that for now.

This structure doesn't store the token contents. For example, you can't use it to distinguish between two identifiers unless you load up the files containing the tokens and look at the byte ranges for each. Doing that might actually be feasible on an OS like Linux where opening files is fast, but it would be pretty bad on Windows where that's a more expensive operation. So what if we store the token contents as well? Is that feasible?

Let's make a few more assumptions about our theoretical code base:
  • There are around 1000 unique token strings per text file, on average.
  • Across the whole project there are about 10,000,000 unique strings with an average length of 10 chars.
We'll keep a cache of interned strings for each token (we won't worry about converting literals into actual ints/floats/doubles/whatever for now), so that each unique token string is stored once and once only. We'll also replace the 'end' member of our Token struct with a 'pos' member, holding an index into the interned string list.

The Token struct stays the same size so the sums above don't change. For the data size, we get:
  • 10,000,000 unique strings * (11 bytes per string + 8 bytes for the pointer to it)
    = approximately 190 Mb
Based on all that, storing the entire tokenised source code for a 10 million line c++ project would only take about 1.43 Gb. That's way below the 8Gb of RAM you get in an average desktop these days, so it seems like it would be workable.

Does it actually lead to any efficiency gains? That remains to be seen. It should mean that the file access pattern interacts better with the disk cache; and being able to index into an array of tokens might help reduce the cost of lookahead and backtracking in the parser; but for now it's just conjecture. There's no github repo to go with this post.

So it's over to you, readers. Do you know of anything that already does this? Any reasons why it'd be a crazy thing to do? If so, I'd love to hear about it!

Comments

Popular posts from this blog

OpenGL ES and occlusion queries

This is a follow-up to my earlier post "WebGL doesn't have query objects" . Since I wrote that post, the situation has changed a bit. It's still true to say that WebGL doesn't have query objects, but the underlying reason - that OpenGL ES doesn't - is no longer true. For OpenGL ES 2.0 , there's an extension which provides basic query functionality: EXT_occlusion_query_boolean  (which seems to have been based on ARB_occlusion_query2 from regular OpenGL). For OpenGL ES 3.0 , the functionality from that extension appears to have been adopted into the standard. The extension provides two query types, both of which set a boolean value to indicate whether any pixels passed the depth and stencil tests. While this is progress, unfortunately it's still not sufficient to implement the pixel accurate collision detection method I described in an earlier post. For that purpose it's not enough to know whether any  pixels passed the tests; you want to kno...

Assert no lock required

This is a technique I learnt about from Jason Gregory's excellent book, Game Engine Architecture (3rd Edition) . If you have a shared resource accessed by multiple threads, where you're fairly certain that it's only ever accessed by one thread at a time, you can use an assert() to check for this at debug time without having to pay the runtime cost of locking a mutex. The implementation is fairly straightforward: class UnnecessaryMutex { public: void lock() { assert(!_locked); _locked = true; } void unlock() { assert(_locked); _locked = false; } private: volatile bool _locked = false; }; #ifdef ENABLE_LOCK_ASSERTS #define BEGIN_ASSERT_LOCK_NOT_REQUIRED(mutex) (mutex).lock() #define END_ASSERT_LOCK_NOT_REQUIRED(mutex) (mutex).unlock() #else #define BEGIN_ASSERT_LOCK_NOT_REQUIRED(mutex) #define END_ASSERT_LOCK_NOT_REQUIRED(mutex) #endif Usage is equally straightforward: UnnecessaryMutex gMutex; void PossiblyOverlappingFunction...

Triangle bounding boxes in a single byte

Just thought of a way to store the bounding box for a single triangle in only one byte. It's not really practical or something you'd ever really want to use, but what the hell. Assume we have some kind of indexed mesh structure with a list of vertex positions and a list of triangle indices:   struct Mesh {     std::vector<vec3> verts;     std::vector<uvec3> triangles;   }; We can find the bounding box of a triangle by taking the min and max of all three vertices:   vec3 Mesh::lowerBound(uint32_t tri) const {     vec3 v0 = verts[triangles[tri].x];     vec3 v1 = verts[triangles[tri].y];     vec3 v2 = verts[triangles[tri].z];     return min(min(v0, v1), v2);   }   vec3 Mesh::upperBound(uint32_t tri) const {     vec3 v0 = verts[triangles[tri].x];     vec3 v1 = verts[triangles[tri].y];     vec3 v2 = verts[triangles[tri].z];     return ...