Early Overview

This page contains an early overview of the Program Repository work. It tries to explain the thinking behind and the structure of the repository.

Program Repository

This document describes a program repository which provides a common data store to record output from the compiler (the information that would traditionally be contained in an object or bytecode file) as well as a means by which other tools, such as debugger or profiler, can annotate the program or cache supplementary information. The repository may be queried and incrementally updated to enable:

The compiler to avoid optimization and code-generation for functions and data that are unchanged between builds.
The linker to build an image from individual functions and data objects by following the edges that form the program graph.
The debugger to directly access debugging metadata produced by the compiler.
The linker may use the repository to record additional metadata necessary to support incremental linking.
Other tools — a profiler or debugger for example — may also access the repository to discover and record information which then be consumed by any of the others.

Important design constraints for the repository are:

Existing program source code should be supported without change.
As far as possible, existing build systems should be unaffected; the repository's effect should be felt through reduced turnaround times but should be otherwise largely transparent.
It should not require additional configuration or installation processes.

Program Fragments

A program fragment (or simply fragment) is the basic unit from which programs are composed; it is a single, self contained, vertex in the digraph that forms the program. In C++, this is equivalent to an individual definition: a single free, member, or template function or data object with static or thread storage duration. A fragment is made up of a collection of sections where each section holds a specific type of data or metadata. Metadata necessary for language support (such as exception handling or virtual method tables as well as for debugging or incremental linking), or dynamically captured data (such as profiling or coverage information) are bundled in the fragment along with native code and/or bytecode. This logical association with the originating source code definition is explicitly preserved.

Two vectors are associated with each section: the internal and external fixups which fulfill the role of relocations in conventional object files. External fixups reference other fragments by name and form the edges in the program graph (fragments being the vertices) whereas the internal fixups define connections between the contents of the fragment's sections.

int foo(void) {
    static int a = 10; 
    return a++;
}
static int g = 31; 
int bar(void) {
    return foo() + g++;
}

For example, the simple C program snippet above defines three interconnected program fragments. Each is assigned a unique, globally visible, name: foo and bar have external linkage and the provided names can be used. g has internal linkage. The program can be represented in the repository as three connected fragments:

Program Fragment Structure

Compiling

Conventional Compilation

Three phase compilation

Multi-language/multi-target compilers such as LLVM and GCC usually employ a high-level structure that is broken into three phases of translation:

Front-end: Lexical analysis and parsing results in the generation of an AST which is lowered to an IR which is suitable for consumption by subsequent phases
Optimizer: A series of optimization passes are carried out; each consumes, modifies, then produces IR.
Back-end: Target-dependent operations such as instruction selection, register allocation, and scheduling are carried out. The resulting binary is then written to an output object file suitable for linking.

The entire translation unit passes through the optimization and back-end phases.

Compiling with the Repository

Three phase compilation with Repository support

The program repository introduces a new pass immediately following (or within) the front-end of the compiler. This pass enables the compiler to avoid the code generation phases for those definitions for which a binary representation is already available, regardless of the translation unit for which this work was done. For best performance, this new pass should be placed as close as possible to the front-end of the compiler. This enables unnecessary IR — including unchanged objects or objects with vague linkage and for which a compiled instance already exists — to be removed as early as possible and thus avoid wasteful work in the later phases of the compiler.

Hash consing is a technique which allows the construction of a new immutable object to be avoided when an equivalent is already available; a unique hash digest identifies the object. Here the strategy is used to memoize an immutable copy of the binary representation of a source code definition in the repository. This enables objects with matching source code to be identified and redundant copies eliminated; the overhead of entries requiring vague linkage is then largely eliminated.

Inputs to the hash function must include all aspects of the source code definition and its environment that affect the final generated code including the compilation options, the type-graph for referenced types, inlining decisions, local ABI optimizations, and so on. Each hash is compared to those recorded in the repository. Where the hash is not found, the optimization and code emission phases of the compilation are continued and the resulting fragment recorded in the repository. Definitions whose name and body are unchanged between compilations or whose body matches an existing record in the repository do not pass through the latter two stages of the compiler.

The modification also affects the final stage of the compiler. Instead of writing a traditional object file, we associate the hash with a new fragment in the repository. Finally the compiler creates a small ticket file containing a hash digest that corresponds to a key in a repository table whose value is a list of name/digest pairs for each entity produced by that compilation (regardless of whether new fragments were created or existing ones reused). The ticket file serves a number of purposes:

It enables traditional build tools which rely on object-file time stamps such as make to function without modification; the modification time of the ticket file may be used to detect updated compilations.
It is provided to the linker to define the name/fragment correspondence for a translation unit. This collection of tickets completes the definition of the edges in the program graph by establishing the fragments to which external fixups refer.
It acts as the root for the tracing process used when the repository is compacted. Any ticket record that is contained in a repository for which the corresponding ticket file no longer exists (which includes those that have been replaced by a later compilation) may be removed, along with any other transitively unreferenced objects.

Distributed Compilation with the Repository

As compilations proceed, the repository builds a unified view of the program which can be updated and inspected by the tools. Distribution of the compilations to remote agents potentially disrupt the construction of this whole-program view: unless the repository is available via a network-visible server, no single build machine can have a view of the entire build. Compared to a traditional distributed build, to maximize the benefits when the build is distributed to multiple hosts, some changes are needed to the distribution model.

Distributed Compilation

To reduce the overhead of copying a potentially very large repository to multiple remote build agents before the build can begin, a strip phase extracts the fragment and ticket digests to produce a minimal repository which lacks the associated content. This is then pushed to each remote agent. The remote compilations then add fragments (with their digests) for new or modified functions to their local repository as usual. Once the remote operations are complete, these remote repositories are pulled from the agents and a merge operation is carried out. This updates the original repository with the results of the remote compilations to produce the final repository. The remote repositories may each contain a copy of vague-linkage objects: these unneeded duplicates are eliminated by the merge process.

Linking

Conventional Linking

Traditional three-stage static linker

The purpose of the static linker is to gather together the collection of separately compiled object files and convert them to the final image for execution. The operation is carried out in three broad phases:

Scan: The linker examines each of the object files that were presented on the command line gathering information such as: (a) the types and size of the data section that they contain and (b) the symbols defined and referenced by the file.
Layout: The linker assigns addresses to each of the loadable sections that were discovered during the scan phases, applying any restrictions required by the target operating system.
Output: The linker copies all of the required data from the input object files to the output image. It applies relocations (or fixups) or that enable one object file to reference data defined in another.

Linking from the Repository

Program repository based static linker

Linking from the program repository is fundamentally similar to linking from a collection of discrete object files. There are, however, some advantages:

The connections between each of the fragments are now defined explicitly by the program graph. The scan phase must now simply gather the collection of fragments that are eligible for inclusion in the link.
The fragments represent a smaller “atomic unit” than a traditional object file section: the linker operates at the level of individual functions and data objects. This is particularly useful when performing an incremental link because only those functions and data objects that have actually been changed since the previous link need to be considered.
The linker does not need to consider fragment sections that are not required in the final image. Debugging metadata, for example, can remain in the repository; there is no need for it to be copied since the debugger can access it directly.

The linker begins by enumerating the collection of ticket files on the command line; the hash digest contained within each ticket maps to the set of fragments that were produced by that compilation (including those items with vague linkage). This defines the collection of fragments that are eligible for inclusion in the link.

An initial batch link then begins the layout phase; this starts at the program entry points (such the global constructors and _start for an executable or to the list of exported names for a shared library) and traverses the graph. As each fragment is visited, its sections are added to the relevant output stream (code, data, read-only data, and so on) and the corresponding address recorded. The resulting mapping of fragments to addresses is then used in the output phase to apply all of the necessary fixups to the output.

Debugging

A debuggable executable image contains references to the originating repository: the debugger loads the debugging metadata directly from there. Debugging metadata includes:

Line number information used to associate a specific line of source code with the corresponding machine instruction addresses.
Descriptions of the functions (program scope entries).
Type entries and types used in the program.
Call-frame information.

Some of this data, such as line-number or call-frame information, is clearly associated with particular function, and thus may be stored as a metadata section within the relevant fragment. However, the type entries does not usually belong to a specific function: the type graph can be viewed as forming a separate layer in the overall program graph. Following the scheme described in DWARF 4 Standard (Appendix E), each distinct type is a vertex in that graph. A “signature” hash can be generated for the type and hash consing can once again be used in the compiler to avoid the generation of duplicate copies at source. In contrast, DWARF type sections work by having the compiler create a COMDAT group for each type, each of which is written to disk, and then relying on the linker's COMDAT handling to avoid the presence of multiple copies in the final binary.

For example, consider a simple C program such as:

struct common {
    int a;
    int b;
};
int f(struct common *p) {
}
int g(struct common *q) {
}

This example can be represented in the repository in the form shown below:

Debug type representation

The debugger uses a table which connects addresses within an executable to the relevant fragments and therefore to the corresponding debugging metadata. This table may be stored within the executable, in an auxiliary file, or within the repository itself (where there is a record to bind an executable to the repository table in the same way that tickets bind a compilation to a collection of fragments).

Note that it is possible for a static linker which consumes data from a program repository to produce debugging metadata that is functionally indistinguishable from existing formats. In this configuration, the expected performance benefits of the repository may not all be realised, but an existing debugger would continue to work without change.

Repository Compaction

Wasted space in the repository is recovered by compaction performed during idle periods or when explicitly requested by the user. The compaction process starts by examining each of the ticket and linked binaries referenced by the repository: the referenced fragments and types are traced from these roots; any transitively unreachable items are removed.

Summary and Future Work

An alternative method for recording the output from a static compilation which can maximize reuse and minimize redundant code-generation and copying by the toolchain has been presented. Data showing the cost of duplicate code generation in real-world source bases in terms of both the quantity of unnecessary data as well as an example of the increased compilation time that can result from excess generation of vague-linkage objects.

Although making statements about the effect of an optimization before it has been implemented and tested is notoriously dangerous, the data presented suggests that significant reductions in developer turnaround times may be obtainable. However, it is clear that the efficacy of the approach relies on a number of factors:

Fast hashing of the compiler IR with very low probability of collision.
Fast repository search for a hash digest.
Significant duplication between translation units or a small volume of change between each compilation.
Low impact compaction to prevent the repository from growing uncontrollably.

Further work will focus on addressing these concerns and constructing a prototype implementation based on LLVM.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly