erenon.hu

Some time ago I was tasked with improving the build system of a large company. Requirements: build (compile and test) a large amount of C++ code (n*100k+ lines), targeting various versions of Linux and Windows, scattered in several repositories, using an enormous number of dependencies (mostly stored as binaries), mixed with Python tests, with the possibility of extending it to Java.

This post explains why Bazel was chosen for this particular task. This is not a general introduction for Bazel, nor a thorough comparison against other build systems.

The Problem

The then current build system was based on a fork of a lesser known Makefile generator. It had the following issues:

The only form of caching was with ccache, that stores object files only, and has its own limitations.
The generated Makefiles allowed one to change a dependency without triggering a re-build (this is because make uses mtime for change detection), causing hard to debug issues, causing developers to lose trust in the system, causing them to regularly make clean first, further increasing resource usage.
The development of the tool was halted, there was no hope to get C++ Modules support for it.
The build system did not cover running tests. Even if the smallest change was made in a very large repository (think monorepo), all the tests must be re-run, taking a lot of time and resources.
Because running all the tests was slow, developers skipped them, or guessed, failing in CI much later, losing time.
Multiplatform builds were time consuming and difficult, making developers to push it on the CI, and running long cycles on failure.
Getting a project and all its dependencies build with a particular set of compiler flags (e.g: -fsanitize=address) was next to impossible, turning easy-to-find bugs to terribly difficult to debug failures.

The Solution

I chose to migrate to Bazel, that offers a set of self-reinforcing benefits. To understand these benefits, we need to understand how Bazel works (supposed to work). I’m going to simplify things for brevity.

Bazel is similar to make. The user defines a set of rules, that describe how the project can be built. Each rule has a set of inputs and an expected set of outputs. The input set contains e.g: a command, that defines what should be executed to turn the inputs into the output. The input set also contains a program (e.g: a compiler) that transforms the input files into the output files. The output of one rule can be the input of another, forming a tree. To build a particular output, first the system builds the dependencies of the rule (the inputs that are outputs of other rules), then it executes the rule.

Bazel is an improvement over make. First, it enforces (tries hard to) that rules are only allowed to read their declared inputs, to produce the output, and cannot depend on anything else (e.g: system time, network, etc). A build executed in this fashion is called a Hermetic Build.

If a build is really hermetic, executing it on the same source always yields to the same output, therefore it is a Reproducible Build.

For the first full build, there’s little difference compared to make. For subsequent, incremental builds however, there are plenty.

Because of the reproducible property, if the inputs of a rule did not change, the output wouldn’t change either, therefore the rule does not need to re-run. For example, if a comment is removed, but the compiler produces the same object file, the link step will not be executed again. Bazel manages tests, similar to compilation, therefore, if the test binary (and its inputs) did not change, the test doesn’t need to re-run either.

To decide whether the input has changed (compared to the previously built output), Bazel uses the hash of the input (compared to make, that uses ctime). This fixes the correctness issues, and a key piece for distributed execution and caching.

It is possible to setup a remote executor cluster (e.g: buildbarn), that accepts rules to be executed, along with the inputs. If the rule for the given set of input was already executed, it returns the cached result, otherwise it executes the rule, caches then returns the result. This includes not just compilation, but also running the tests (where the test results are cached).

This improves the developer workflow on several points:

Fetching and building the main branch of a repository is quick, as that state is probably already built and cached.
After a change, it is to run all the affected tests, and only those, implcitly caching the results at the same time.
The CI gets also fast, because it just fetches the repository, and probably gets back the green results of the tests, produced from the previous step.
The remote executor can support multiplatform builds, saving the programmer from shipping the code manually to multiple platforms for building.

It also greatly reduces the overall resource usage.

Because of its efficiency, Bazel allows one to put more and more code (also dependencies) into a single repository, building them together. This unlocks further opportunities: fully static builds for better optimization and simplicity, build governed flags (e.g: sanitizers).

Bazel is not without difficulties: making builds hermetic, maintaining proper state (the input hash, output storage), operating the remote executor and cache, etc., is a demanding task. Therefore Bazel is better suited for larger projects with more developers. During selection, CMake was considered and evaluated, but apart from it being a popular build system, it does not solve any of the issues Bazel does.