Wednesday, 18 March 2015

Whole program lexical analysis

I was thinking about parsing and lexical analysis of source code recently (after all who doesn't... right??). Everywhere I've looked - which admittedly isn't in very many places - parsers still seem to treat input as a stream of tokens. The stream abstraction made sense in an era where memory was more limited. Does it still make sense now, when 8 Gb of RAM is considered small?

What if, instead, we cache the entire token array for each file? So we mmap the file, lex it in place and store the tokens in an in-memory array. Does this make parsing any easier? Does it lead to any speed savings? Or does it just use an infeasible amount of memory?

Time for some back-of-the napkin analysis. Let's say we're using data structures kind of like this to represent tokens:

  enum TokenType { /* an entry for each distinct type of token */ };

  struct Token {
    TokenType type; // The type of token
    int start;      // Byte offset for the start of the token
    int end;        // Byte offset for the end of the token
  };

  struct TokenFile {
    std::string filename;
    std::vector<Token> tokens; // The tokens in order.
    std::vector<int> lines;    // Byte offsets for each line number.
  };

So for each file we tokenise, we'll store it in memory as an array of tokens. The tokens themselves don't store a string or a value - they just identify which range of bytes in the file the token came from. We also store byte offsets for line numbers, so that if we need them for any reason - error reporting, or a __LINE__ macro, for example - we can find them using a binary search (these are low frequency uses, so it's wasteful to store a line number for every token).

Assuming a 64 bit system, we get the following sizes:
  • std::string = 24 bytes (the GNU stdlibc++ representation is 3 pointers, one each for start, end and capacity) + the size of the data
  • std::vector = 24 bytes (same representation as a std::string) + the data size
  • Token = 12 bytes (4 bytes for each of type, start and end).
  • TokenFile = 72 bytes + 12 * T + 4 * L + F, where T is the number of tokens in the file, L is the number of lines in the file and F is the length of the filename.

Now let's make up a theoretical large c++ code base and some metrics for it to see how things stack up. Let's say our code base has the following characteristics:
  • 10,000,000 lines of code
  • 10,000 separate files
  • An average of 1000 lines per file
  • 10 tokens per line, on average
  • The average filename length is 100 bytes.

Running the numbers, this means we have:
  • 10,000 TokenFile objects at (on average) 72 + 12 * 10,000 + 4 * 1000 + 100 bytes
    = 10,000 * 124,172 bytes
    = roughly 1.24 Gb;
Of which
  • 720 Kb is for the TokenFile objects
  • 1.2 Gb is for the Token data
  • 40 Mb is for the per-file line numbers
  • 1 Mb is for the file names

That doesn't seem too bad considering it's supposed to be a large code base. 1.24 Gb of RAM won't even touch the sides on a modern desktop computer! We could make a few optimisations to get the size down even further, but let's stick with that for now.

This structure doesn't store the token contents. For example, you can't use it to distinguish between two identifiers unless you load up the files containing the tokens and look at the byte ranges for each. Doing that might actually be feasible on an OS like Linux where opening files is fast, but it would be pretty bad on Windows where that's a more expensive operation. So what if we store the token contents as well? Is that feasible?

Let's make a few more assumptions about our theoretical code base:
  • There are around 1000 unique token strings per text file, on average.
  • Across the whole project there are about 10,000,000 unique strings with an average length of 10 chars.
We'll keep a cache of interned strings for each token (we won't worry about converting literals into actual ints/floats/doubles/whatever for now), so that each unique token string is stored once and once only. We'll also replace the 'end' member of our Token struct with a 'pos' member, holding an index into the interned string list.

The Token struct stays the same size so the sums above don't change. For the data size, we get:
  • 10,000,000 unique strings * (11 bytes per string + 8 bytes for the pointer to it)
    = approximately 190 Mb
Based on all that, storing the entire tokenised source code for a 10 million line c++ project would only take about 1.43 Gb. That's way below the 8Gb of RAM you get in an average desktop these days, so it seems like it would be workable.

Does it actually lead to any efficiency gains? That remains to be seen. It should mean that the file access pattern interacts better with the disk cache; and being able to index into an array of tokens might help reduce the cost of lookahead and backtracking in the parser; but for now it's just conjecture. There's no github repo to go with this post.

So it's over to you, readers. Do you know of anything that already does this? Any reasons why it'd be a crazy thing to do? If so, I'd love to hear about it!

No comments:

Post a Comment