Evan Martin’s Ninja retrospective discusses code/data generation as a separate step from processing. Processing means, for example, a C compiler processes C code, a build tool processes Makefiles (or something similar). Generation means a previous program wrote the C code or Makefile. This conceptual split isn’t a generic solution to every programming problem, but it can still be a useful technique.
Go has go generate
, unlike other
languages with sophisticated (compile time) macro systems or the ability to run
a subset of the full language at compile time. The separate step keeps the
language itself simpler (and therefore a whole host of static analysis and
refactoring tools simpler, not just the compiler) and compilation faster.
Here are some examples.
For CCITT (fax’s image file format), go generate
writes an efficient
(pointer-free and therefore invisible to the garbage collector) representation
of the binary trees for the hard-coded CCITT Huffman codes. Importantly, it
also writes comments (ASCII art of those binary
trees)
that help future-me (or any other maintainer) understand the data structure
that past-me wrote.
The golang.org/x/net/html/atom
package converts common HTML attribute and
element names (like “href”, “p” and “table”) from strings to unique 32-bit
integers. Parsing HTML
properly means
sometimes treating a <p>
child differently from a <tr>
child, amongst many
other idiosyncratic rules. Comparing variable length strings takes longer than
comparing fixed length integers. Converting to integers first and working
solely with integers afterwards can noticably speed up parsing (part
1,
part
2).
Conversion uses a hash table. In the general case, a hash table implementation needs to consider hash collisions. With code generation, we know all of the keys up front. We can spend some time searching the parameter space, keeping the smallest hash table that doesn’t exhibit collisions (or doesn’t exhibit too many collisions, for cuckoo hashing. At run time, the look-up code can therefore be simpler.
For the public suffix list, go generate
again
writes an efficient
representation
of the tree. Importantly, the generation process can download the most recent
data from the
internet.
It’s also OK for the generator to take a relatively long time to really
compress the final form, as it’s only run e.g. once a week, not once per
compile.
Conversely, you generally don’t want compilation to require an internet connection (especially if you want reproducible builds) or to take too long. There’s admittedly now a question about the hard-coded list becoming stale, although for server software you can probably just re-build and deploy at a regular cadence. Engineering is trade-offs.
Published: 2020-06-05