diff --git a/Documentation/design-docs/inline-size-estimates.md b/Documentation/design-docs/inline-size-estimates.md new file mode 100644 index 000000000000..12286c7f5d1f --- /dev/null +++ b/Documentation/design-docs/inline-size-estimates.md @@ -0,0 +1,526 @@ +# Inline Size Estimates + +Note this is work in progress.... + +Inlining is a heuristic-based optimization. There are some cases where +it's obvious that a particular inline is good or bad for performance, +but more typically, things are not that clear. In those cases where +the benfit is in doubt, the compiler will rely on heurstics, typically +based on various estimates of an inline's size impact, speed impact, +or other important factors. + +In this writeup we consider approaches to what should be the simplest +of these estimates: the size impact of an inline. + +## Background + +There are a number of interesting facets to the inline size estimate +problem, but let's start with some generalities. Suppose we have some +size estimate for the caller `C` (ignoring for the moment exactly what +we are measuring), say `CallerSize`, and some size estimate for the +callee `E`, say `CalleeSize`. Now suppose we are contemplating what +would happen if `E` is inlined into `C` to create `C'`. We'd like some +sort of size estimate `CallerSize'`. The simplest estimate is that +`CallerSize'` = `CallerSize + CalleeSize`. + +``` +(1) `CallerSize'` = `CallerSize + CalleeSize` +``` + +However, calling conventions impose some addtional code overhead on +both the caller and callee. The caller must set up arguments in +registers or on the stack, and if there is a return value, might need +to move it or store it somewhere. It also might need to spill values +that are stored in caller-saved registers around the call. + +The callee must set up a stack frame and possibly spill callee-save +registers it intends to use, and then ultimately must undo all this. + +When we inline `E` into `C`, none of this calling convention setup is +necessary. So one can imagine that there is some additional size +savings from inlining, one that depends on the calling convention and +the number and kind of arguments passed from caller to callee. + +``` +(2) CallerSize' = CallerSize + CalleeSize - Overhead +``` + +Note that it's entirely possible that `Overhead > CalleeSize`, so +that `CallerSize' < CallerSize`, that is the inline not only results +in faster code but also in smaller code. Indeed this state of affars +is increasingly common with that advent of modern programming styles +that emphasize building functionality out of lots of small procedures. + +Alternatively, we can compute the size impact as the size change of C +because of the inline of E: + +``` +(2) SizeImpact = CallerSize - CallerSize' +``` + +or with a bit of math, + +``` +(2) SizeImpact = Overhead - CalleeSize +``` + +Now let's look at some actual data to see how well this simple model +for `SizeImpact` holds up. The data below shows the measured size +impact of a single inline into various calling methods in mscorlib. +This data was obtained by first compiling all the methods without +inlining to obtain a baseline size as measured by the encoded +instruction size in bytes. The inliner was then modified so it would +only do a single inline per caller, with the candidate chosen by a +parameter `k` that could be varied externally. mscorlib was then +repeatedly compiled with varying values of `k`, and the inlines +performed and the subsequent modified caller sizes recorded. This data +was then permuted and histogrammed to show the variety of `SizeImpact` +values for a fixed `E`. Since the calling convention and number and +kind of parameters is fixed across all observations for a fixed `E`, +the better model would predict `SizeImpact` is the same in every +case. + +In this first example, case `E` is `System.Object:.ctor()`, which is +close to the simplest possible callee -- it takes one argument and has +no return value. In all the observed cases the `SizeImpact` is +negative, but it's far from the same in every case. + + +``` +Inlining for System.Object:.ctor():this (size 6) +Instances 1160 Mean SizeImpact -12.53 Min -52 Max -5 +Distribution + less than -50: 1 + [-50,-41]: 0 1 1 0 0 1 0 0 1 1 + [-40,-31]: 0 1 0 0 1 0 0 3 1 2 + [-30,-21]: 1 3 2 2 1 8 76 5 22 18 + [-20,-11]: 38 23 98 7 29 4 8 27 14 21 + [-10, -1]: 1 362 373 0 0 3 0 0 0 0 + [ 0, 9]: 0 0 0 0 0 0 0 0 0 0 + [ 10, 19]: 0 0 0 0 0 0 0 0 0 0 + [ 20, 29]: 0 0 0 0 0 0 0 0 0 0 + [ 30, 39]: 0 0 0 0 0 0 0 0 0 0 + [ 40, 49]: 0 0 0 0 0 0 0 0 0 0 +greater than 49: 0 +``` + +It should be evident from this set of observations that `SizeImpact` +cannot be completely characterized by a simple formula like `(2)`. For +this inlinee, inlining always saves at least 5 bytes, roughly half the time +it saves 8 or 9 bytes, on average it saves about 12.5 bytes, but it often +saves considerably more. + +Other inlinees show similar spreads in `SizeImpact`. Here we see a case +where sometimes the `SizeImpact` is negative and other times it's +positive. + +``` +Inlining for System.Threading.CancellationToken:get_IsCancellationRequested():bool:this (size 29) +Instances 42 SizeImpact Mean 11.33 Min -20 Max 28 +Distribution + less than -50: 0 + [-50,-41]: 0 0 0 0 0 0 0 0 0 0 + [-40,-31]: 0 0 0 0 0 0 0 0 0 0 + [-30,-21]: 0 0 0 0 0 0 0 0 0 0 + [-20,-11]: 1 0 0 0 0 0 0 1 0 0 + [-10, -1]: 0 2 0 0 0 0 0 0 1 2 + [ 0, 9]: 0 0 1 1 0 2 0 1 4 1 + [ 10, 19]: 1 0 1 0 4 1 2 1 2 6 + [ 20, 29]: 0 0 2 0 2 0 0 0 3 0 + [ 30, 39]: 0 0 0 0 0 0 0 0 0 0 + [ 40, 49]: 0 0 0 0 0 0 0 0 0 0 +greater than 49: 0 +``` + +Not all inlinee `SizeImpacts` exhibit such wide distributions. Some spread +just a little: + +``` +Inlining for System.Environment:GetResourceString(ref):ref (size 15) +Instances 2238 SizeImpact Mean 0.01 Min -3 Max 6 +Distribution + less than -50: 0 + [-50,-41]: 0 0 0 0 0 0 0 0 0 0 + [-40,-31]: 0 0 0 0 0 0 0 0 0 0 + [-30,-21]: 0 0 0 0 0 0 0 0 0 0 + [-20,-11]: 0 0 0 0 0 0 0 0 0 0 + [-10, -1]: 0 0 0 0 0 0 0 7 0 7 + [ 0, 9]:2212 3 0 5 0 0 4 0 0 0 + [ 10, 19]: 0 0 0 0 0 0 0 0 0 0 + [ 20, 29]: 0 0 0 0 0 0 0 0 0 0 + [ 30, 39]: 0 0 0 0 0 0 0 0 0 0 + [ 40, 49]: 0 0 0 0 0 0 0 0 0 0 +greater than 49: 0 +``` + +and some not at all: + +``` +Inlining for System.DateTime:get_Ticks():long:this (size 15) +Instances 129 SizeImpact Mean 0.00 Min 0 Max 0 +Distribution + less than -50: 0 + [-50,-41]: 0 0 0 0 0 0 0 0 0 0 + [-40,-31]: 0 0 0 0 0 0 0 0 0 0 + [-30,-21]: 0 0 0 0 0 0 0 0 0 0 + [-20,-11]: 0 0 0 0 0 0 0 0 0 0 + [-10, -1]: 0 0 0 0 0 0 0 0 0 0 + [ 0, 9]: 129 0 0 0 0 0 0 0 0 0 + [ 10, 19]: 0 0 0 0 0 0 0 0 0 0 + [ 20, 29]: 0 0 0 0 0 0 0 0 0 0 + [ 30, 39]: 0 0 0 0 0 0 0 0 0 0 + [ 40, 49]: 0 0 0 0 0 0 0 0 0 0 +greater than 49: 0 +``` + +So it appears there must be other -- perhaps many -- factors that +influence the `SizeImpact` of a particular inline. The question is, +what factors are important, how do we measure them, and how do they +they combine to influence the overall result? The remainder of this +document will look into some of the ways we can answer this question. + +## Some Ground Rules + +Before we move on, however, we should address what kind of information +will actually be available to observe and use in building a heuristic +model. + +It's not realistic to feed the heuristic on the actual encoded size of +methods. While compiling a method `C` obtaining the encoded size of +`C` is problematic. It's also quite likely that the encoded size of +the inlinee `E` is unknown, except possibly in some AOT scenarios +where one could generally arrange for methods to be compiled in +bottom-up order (that is, callees before callers). While one could +imagine spending compile time doing experimental compilations of `C`'s +and `E`'s to see what code size results, this is probably too costly +to be practical. + +Second, even if we could obtain the actual size of prospective inline +candidates, we might not want to use this data. The final code +sequence emitted by the compiler depends intimately on details of the +target archtecture, runtime conventions (ABIs), and capabilites of the +compiler phases that run after inlining. If we allow feedback into hey +heuristics by incorporating data from these "downstream" sources, we +introduce various forms of coupling that have important +consequences. For example, it would be challenging to work on a +downstream phase (say, register allocation) if some trivial change to +allocation in an inlinee `E` perturbs inlining decisions for other +methods like `C`. Likewise we may or may not want to allow runtime +conventions to influence inlining -- if our application can run cross +platform (say in both Windows and Linux) we might not want the various +versions to exhibit vastly different inlining patterns. + +For that reason, we intend to restrict the information we can use in +forming heuristics to things that are generally true of the caller and +callee, and the general capabilities of the downstream phases. For the +caller and callee this largely means information that can be obtained +by inspecting either the input to the compiler, or the compiler's IR +in the stages leading up to inlining, and perhaps (if available) some +carefully chosen information obtained from actual compilation of the +callee. + +At the same time, we still intend for the predictions made by the +heuristics to be indicative of the final binary size. We have no other +reliable source of truth to guide us. + +Given all this, we can restate the central question as: how best can +the compiler estimate the ultimate size impact of an inline while +restricting itself to considering features that are generally true of +caller, callee, and the capabilities of the compiler? + +## Building a Heuristic Manually + +The tried-and-true approach is to build the heuristic manually. There +are two prongs to the approach: the first is case analysis of actual +behavior, and the second is modelling based on the compiler writer's +experience and intution. + +### Some Case Analysis + +#### Case 1: Maximal Savings + +It's always instructive to look at actual behavior, so let's consider +some of the cases. From the table above for `System.Object:.ctor()` we +see a number of cases where there were large savings in byte size. The +most significant of these is the following: + +``` +Inline System.Object:.ctor():this +into System.IO.UnmanagedMemoryAccessor:.ctor(ref,long,long,int):this +CalleeSize = 6 +CallerSize = 72 +CallerSize' = 24 +SizeImpact = -48 +``` + +where `System.Object:.ctor():this` is simply: + +```Assembly +; System.Object:.ctor():this +; Total bytes of code 6, prolog size 5 + 0F1F440000 nop + C3 ret +``` + +As an aside, one might wonder why there are 5 bytes of nops here -- +they are put there to support the *rejit* feature used by the +application insights framework. + +The caller's source code shows it simply delegates immediately to +another method to initialize the object: + +```C# + UnmanagedMemoryAccessor( + SafeBuffer buffer, Int64 offset, + Int64 capacity, FileAccess access) { + Initialize(buffer, offset, capacity, access); + } +``` + +Here's the code before the inline: + +```Assembly +; BEFORE +; System.IO.UnmanagedMemoryAccessor:.ctor(ref,long,long,int):this +; Total bytes of code 72, prolog size 10 + 4156 push r14 + 57 push rdi + 56 push rsi + 55 push rbp + 53 push rbx + 4883EC20 sub rsp, 32 + 488BF1 mov rsi, rcx + 488BFA mov rdi, rdx + 498BD8 mov rbx, r8 + 498BE9 mov rbp, r9 + 488BCE mov rcx, rsi + E800000000 call System.Object:.ctor():this + 448B742470 mov r14d, dword ptr [rsp+70H] + 4489742470 mov dword ptr [rsp+70H], r14d + 488BCE mov rcx, rsi + 488BD7 mov rdx, rdi + 4C8BC3 mov r8, rbx + 4C8BCD mov r9, rbp + 488D0500000000 lea rax, [(reloc)] + 4883C420 add rsp, 32 + 5B pop rbx + 5D pop rbp + 5E pop rsi + 5F pop rdi + 415E pop r14 + 48FFE0 rex.jmp rax +``` + +and here's the code after the inline: + +```Assembly +; AFTER +; System.IO.UnmanagedMemoryAccessor:.ctor(ref,long,long,int):this +; Total bytes of code 24, prolog size 5 + 0F1F440000 nop + 90 nop + 8B442428 mov eax, dword ptr [rsp+28H] + 89442428 mov dword ptr [rsp+28H], eax + 488D0500000000 lea rax, [(reloc)] + 48FFE0 rex.jmp rax +``` + +In both cases this method ends up tail calling to `Initialize` (via the +`rex.jmp`), passing it the exact set of arguments that +was passed to the method. + +In the before case, the un-inlined call changes the method from a leaf +to a non-leaf. Leaf functions don't need to set up their own frame, +while non-leaves do. So some of the extra code in the prolog and +epilog is the frame setup. Beyond that, however, the un-inlined callee +forces the compiler to move the arguments that will be ultimately be +passed to the tail-called method out of the volatile registers and +into preserved registers before the call, then back again to again to +the proper argument registers after the call. And in order to use the +callee saves these registers must be pushed in the prolog and popped +in the epilog. + +In the after case, no frame setup is required, no registers need to be +preserved across the call, so no registers need saving and +restoring. The two `mov`s that remain appear to be gratuitous. The +first `nop' is there for rejit padding. Not sure why the second one is +there. + +So we can now see why this particular case has such extreme +`SizeImpact`: the un-inlined call triggers frame creation and a fair +amount of shuffling to accomodate the potential side effects of the +call. + +#### Case 2: Typical Savings + +In the data above, the inline of `System.Object:.ctor()` typically +saves either 8 or 9 bytes of code. Let's look at one such case. + +``` +Inline System.Object:.ctor():this +into System.Security.SecurityElement:.ctor():this +CalleeSize = 6 +CallerSize = 15 +CallerSize' = 6 +SizeImpact = -9 +``` + +Here are before and after listings for the caller: + +```Assembly +; BEFORE +; System.Security.SecurityElement:.ctor():this +; Total bytes of code 15, prolog size 5 + 0F1F440000 nop + 488D0500000000 lea rax, [(reloc)] + 48FFE0 rex.jmp rax +``` + +```Assembly +; AFTER +; System.Security.SecurityElement:.ctor():this +; Total bytes of code 6, prolog size 5 + 0F1F440000 nop + C3 ret +``` + +In this case the call site was initially handled via a tail call that +required a 7 byte `lea` and a 3 byte `rex.jmp`. With the inline all +that's left is a single byte `ret`. This case covers 154 of the 187 +instances where inlining `System.Object:.ctor()` saved 9 bytes. + +### Cases 3, 4: Size May Decrease or Increase + +Now let's look at a set of examples where an inline can either +decrease or increase the caller size. Here's the good case: + +``` +Inline System.IntPtr:op_Explicit(long):long +into System.Runtime.InteropServices.Marshal:WriteIntPtr(ref,int,long) +CalleeSize = 35 +CallerSize = 50 +CallerSize' = 22 +SizeImpact = -28 +``` + +Here's the callee code: + +SIGH -- turns out the above has the wrong callee size -- there are +several overloaded versions and my script picks up the wrong one. +true callee is only 9 bytes long: + +```Assembly +; System.IntPtr:op_Explicit(long):long +; Total bytes of code 9, prolog size 5 + 0F1F440000 nop + 488BC1 mov rax, rcx + C3 ret +``` + +At any rate the `SizeImpact` is still correct. + +Here's the before caller code: + +```Assembly +; BEFORE +; System.Runtime.InteropServices.Marshal:WriteIntPtr(ref,int,long) +; Total bytes of code 50, prolog size 12 + 55 push rbp + 57 push rdi + 56 push rsi + 4883EC20 sub rsp, 32 + 488D6C2430 lea rbp, [rsp+30H] + 488BF1 mov rsi, rcx + 8BFA mov edi, edx + 498BC8 mov rcx, r8 + E800000000 call System.IntPtr:op_Explicit(long):long + 4C8BC0 mov r8, rax + 8BD7 mov edx, edi + 488BCE mov rcx, rsi + 488D0500000000 lea rax, [(reloc)] + 488D65F0 lea rsp, [rbp-10H] + 5E pop rsi + 5F pop rdi + 5D pop rbp + 48FFE0 rex.jmp rax // tail call WriteInt64 +``` + +and the after caller code: + +```Assembly +; AFTER +; System.Runtime.InteropServices.Marshal:WriteIntPtr(ref,int,long) +; Total bytes of code 22, prolog size 10 + 55 push rbp + 4883EC20 sub rsp, 32 + 488D6C2420 lea rbp, [rsp+20H] + E800000000 call System.Runtime.InteropServices.Marshal:WriteInt64(ref,int,long) + 90 nop + 488D6500 lea rsp, [rbp] + 5D pop rbp + C3 ret +``` + +Somewhat confusingly, the inline has stopped the jit from making a +tail call, and the size savings should be even greater than they +are. There appears to be some kind of code generation issue within the +jit. The IL for the callee is + +``` +IL_0000 0f 00 ldarga.s 0x0 +IL_0002 7b 8f 02 00 04 ldfld 0x400028F +IL_0007 6e conv.u8 +IL_0008 2a ret + +``` +and the JIT rejects the tail call because + +``` +Rejecting tail call late for call [000008]: Local address taken +``` + +so presumably the `ldarga.s` in the inlinee is causing trouble. + +Now let's see if that same inlinee can cause a size increase. Here's a +case: + +(NOTE this is calling a different overload..., need to get this +straghtened out) + +``` +Inline System.IntPtr:op_Explicit(long):long [35] +into EventData:get_DataPointer():long:this [18] size 35 delta 17 +CalleeSize = 35 +CallerSize = 18 +CallerSize' = 35 +SizeImpact = 17 +``` + +```Assembly +; BEFORE +; EventData:get_DataPointer():long:this +; Total bytes of code 18, prolog size + 0F1F440000 nop + 488B09 mov rcx, qword ptr [rcx] + 488D0500000000 lea rax, [(reloc)] + 48FFE0 rex.jmp rax +``` + +```Assembly +; AFTER +; EventData:get_DataPointer():long:this +; Total bytes of code 35, prolog size 5 + 4883EC28 sub rsp, 40 + 488B11 mov rdx, qword ptr [rcx] + 33C9 xor rcx, rcx + 48894C2420 mov qword ptr [rsp+20H], rcx + 488D4C2420 lea rcx, bword ptr [rsp+20H] + E800000000 call System.IntPtr:.ctor(long):this + 488B442420 mov rax, qword ptr [rsp+20H] + 4883C428 add rsp, 40 + C3 ret +``` + +Here the un-inlined case made a tail call. The inlined case was unable +to do the same optimization for some reason, though it looks like it +should have been able to do so as well. diff --git a/Documentation/design-docs/inlining-plans.md b/Documentation/design-docs/inlining-plans.md new file mode 100644 index 000000000000..2279c16e379f --- /dev/null +++ b/Documentation/design-docs/inlining-plans.md @@ -0,0 +1,438 @@ +# Proposed Plans for Inlining + +This document outlines a plan for improving the inlining capabilities +of RyuJit as well as adding inlining capabilities to LLILC. + +## Background + +Inlining is a key optimization, directly reducing call overhead and +indirectly exposing a wider scope for powerful intraprocedural +optimizations. + +From an implementation standpoint, an inliner has the following aspects: + +* Machinery: transformations on the compiler's IR that physically +incorporate the code for the callee into the caller. + +* Legality: a set of rules that describe when inlining is legal (that +is, preserves program semantics). Here, the legality rules are largely +dictated to us by the CoreCLR runtime. + +* Ability: a set of rules that describe whether the machinery is able +to perform a legal inline and whether the result can be handled by +downstream phases. + +* Profitability: a set of heuristic rules that describe which legal +and able inlines are desirable. + +The general consensus is that the inlining in RyuJit can be improved, +while carefully maintaining RyuJit's current lean compilation +overhead. It is anticipated that this work will encompass Machinery, +Ability, and Profitability. + +LLILC does no inlining today. Since we aspire to have LLILC be a +high-performance .Net code generator, we need to enable inlining in +LLILC. LLILC can likely leverage much of LLVM's built-in inlining +Machinery and Ability, but will need new code for Legality and +Profitability. + +We envision various scenarios for .Net Code generation that impact +inlining: a first-tier JIT compiler, a higher-tier JIT compiler, a +fast AOT compiler, and an optimizing AOT compiler. Each scenario calls +for inlining, but the tradeoffs are different. For a given scenario, +we may choose to deploy RyuJit or LLILC or both depending on the +scenario requirements and code generator capabilities. + +There are also aspects of inlining that are runtime and target machine +specific. Each runtime and target architecture will require some +specialized tuning to accommodate the different costs of operations at +runtime. + +Taking this all into account, the proposed scope of the work +encompasses two different compiler code bases, a number of codegen +scenarios, and a number of target architectures and runtime +conventions. The goal is to come up with one conceptual framework that +covers all these cases and avoids relying too much on time-intensive +manual investigation and tuning. + +We will measure the quality of inlining along three axes: the time +spent generating code (aka *throughput* -- TP) abbreviate as TP), the +time spent executing the code (aka *code quality* -- CQ), and the size +of the generated code (CS). + +We also desire that the resulting decision-making machinery (in particular +the Profitability heuristic) is expressed in terms that are sensible to +compiler writers. + +## The Proposal + +### Machinery + +RyuJit's machinery may or may not require updates depending on how many +key inlines are rejected because of lack of ability. This is something +that requires further investigation. We may also decide to pursue +enhancements here to ensure that in cascaded inline cases the deeper +callsites are provided up-to-date information. + +LLILC should be able to leverage LLVM's existing inline machinery. In +JIT-like scenarios, LLILC's reader will need some updates to properly +handle cases where a method is being "read" as an inline candidate +rather than a root method to compiled. For AOT scenarios we ultimately +will want to do something more elaborate where all of the methods +being compiled are represented as LLVM IR. + +### Legality + +As mentioned above, the legality constraints are generally imposed by +the CoreCLR runtime. + +These constraints are already captured in the RyuJit source +code. However, we generally prefer to at least logically perform all +the legality checks before any of ability or profitability checks, so +that we can reason more simply about why a particular inline didn't +happen. This may not be entirely practical, since some of the +profitability or ability early outs may be helping TP by keeping +RyuJit from doing a lot of exploratory work before ultimately +rejecting an inline candidate. + +LLILC lacks legality checking, and this needs to be implemented. + +### Ability + +RyuJit currently bails out of some legal inlining cases because of +internal restrictions. For instance, if a callee modifies one of its +parameters, RyuJit won't inline it because of limitations in how the +IR is incorporated. We expect to work on reducing or removing key +ability limiters in RyuJit. + +Assuming the LLVM inlining framework is robust, LLILC shouldn't have +too many ability constraints. + +### Profitability + +Given the wide range of scenarios, targets, and the two distinct code +bases, we propose to heavily rely on machine learning techniques for +building and tuning the Profitability components of the inliners. + +The general approach we suggest is based on past experiences we've had +developing inliners for various compilers and is inspired by the CGO +paper [Automatic Construction of Inlining Heuristics using Machine +Learning](http://dl.acm.org/citation.cfm?id=2495914) by Kulkarni, +Cavazos, Wimmer, and Simon. + +Kulkarni et. al. treat profitability as an unsupervised learning +problem, and create a well-performing heuristic black box (neural +network) using evolutionary programming techniques. They then turn +around and use this black box as an oracle to label inline instances, +and from this guide a supervised machine learning algorithm to produce +a decision tree that expresses the profitability heuristics in terms +sensible to the compiler writer. + +The inputs to the heuristic black box are various facts and estimates +about an inline candidate, based on observations of the caller, callee, +and call site (collectively, *features*). These features may be booleans, +enumerates (categories), integers, or floating-point values. For example: + +* (boolean) `CalleeIsLeaf`, `CallerSiteInTry` +* (enumerate) `CalleeReturnCorType` +* (int) `CalleeILSize`,`CallSiteProfileWeight` +* (float) `CalleeLoadStoreFrequency` + +The output is a yes/no inlining decision for a particular candidate. + +The evolutionary process is governed by a fitness metric that expresses the +appropriate balance of TP, CQ, and CS for the scenario. For instance, in a JIT +scenario we likely will want to keep TP and CS close to the current +baseline values. For AOT we may want to allow higher TP and larger CS if we +can demonstrate CQ improvements. + +## Implications + +For this approach to work, we will need to make frequent and robust +measurements of the key fitness metrics across a range of +representative benchmarks. So this presumes: + +* we can identify and capture these key benchmarks +* we can develop methodology for doing appropriate measurements +* we can do these measurements at appropriate scale + +### The Benchmarks + +Clearly in a machine learning approach the actual observations made +are crucial. We need a solid, representative benchmark set. But this +is a general problem with a heuristic based optimization and not an +inherent limitation of this approach -- one can only measure what one +can measure. + +We plan on capturing the results in standard schema and hope that with +this and the regularization of the metric data (see next section) that +interested 3rd parties can perhaps supply their own benchmark results. + +### The Metrics + +CS is typically very easy to measure and the measurements are "noise +free". So the measurement only needs to be done once. + +If TP and CQ are both measured in terms of time (or cycles) then the +measurements will be inherently noisy. To reduce the variance caused +by noise some repeated runs may be necessary, and it might also be +necessary to run on dedicated, matched hardware. This severely limits +scalability. + +We instead propose to measure both TP and CQ in terms of instructions +retired. While this may not be entirely noise-free, we expect the +measurements to be low noise and fairly insensitive to other +computations on the machine. Thus this should give us decent scale +(for instance we can run scenarios in the cloud on VMs with varying +underlying hardware, etc). + +For TP, instructions retired should be a reasonable proxy for time, +since we expect the CPI for the compiler proper to be relatively +static. For CQ, we realize there may not be as strong a correlation +with time, but typically the inliner operates early in the compilation +process and inlining is mainly focused on reducing instruction count +and less about reducing cycle stalls. + +We'll do some experimentation early on to see how well these proxy +metrics hold up. + +### Analysis + +To help prioritize work on profitability and ability (and perhaps find +areas where we might discuss trying to remove legality constraints), +we need some mechanism to correlate inlining outcomes with success and +failure reasons. + +We propose to develop a textual, machine-readable inline log schema +that describes a set of inlining decisions. This schema can be +produced by a compiler as it makes decisions, and contains sufficient +correlating data so that we can then match it up with runtime +observations. For example, we might end up with the following sort of +per-method output (though more likely it will be embedded in some kind of +xml or json markup): + +``` + Inlining for 9811 Lookup + [o] 22825 Lookup@Lattice + [o] 22827 ??0?$interior_ptr@PAVCell + [x] 22826 Lookup@Lattice@@@Z (reason: SizeLimit) + [o] 21728 ?get_CellList + +``` + +where `[o]` is a successful inline, `[x]` a failed inline, and +indentation shows the inlining tree. For .Net compilation we'll need +some kind of persistent ID for methods, which may not be all that easy +to come by. + +This inline log can also be read back by the code generator to enable +*inline replay* and force the inliner to perform a particular pattern +of inlines. This is useful for things like performance investigations, +bug isolation, and for trying to have one compiler emulate the +inlining behavior of another. Since the logs are textual they can also +be edited by hand if necessary to perform experiments. + +### Obtaining Runtime Data + +If we can obtain runtime data about the frequency of call +instructions, we can run this through the log to accumulate failure +reasons and create a failure reason table like the following +(categories and data here totally made up): + +``` + FailedReason | Count | [%] | Sites | [%] +---------------------|----------|-------|--------|------ +SpeculativeNonGround | 79347955 | 37.8% | 417 | 14.1% +NonGroundShared | 28756634 | 13.7% | 217 | 07.4% +SizeLimit | 22066033 | 10.5% | 760 | 25.8% +ExternFunction | 21923402 | 10.5% | 300 | 10.2% +... etc ... +``` + +This can be used to prioritize inlining work. For instance, if +`SpeculativeNonGround` was an ability issue, and we wanted to improve +CQ, we'd give it high priority to fix, since it impacts inlining most +frequently. + +### Robustness + +Since we'll likely be altering inlining decisions, we may uncover bugs +or performance blemishes in the compilers. + +It may make sense to build up a random inlining methodology (RyuJit +evidently has something like this) to try and stress test the compiler +in the face of inlining changes. Log replay can be used with automatic +bisection tools to try and isolate the particular inline that exposes +a bug. + +Performance issues are harder to spot automatically, and will likely +require some amount of manual analysis. However, we may be able to study +inline instances that are not well-predicted by the inlining models -- for +instance, cases where the CS is much greater than predicted, or CQ is lower -- +and see if the modelling shortfall is an indicator of problems within the +compiler itself. + +## Development Stages + +Given the above, here's a rough draft outline for the how the +functionality could be built. + +### Preliminaries -- Measurement + +1. Make some benchmarks available for measurement. + +2. Demonstrate we can reliably measure the metrics of interest: CS, +CQ, and TP (as bytes, instructions, and instructions) when jitting and +pre-jitting the benchmarks. Get some idea of the noise levels for CQ +and TP. Hopefully they are small. + +### Preliminaries -- RyuJit + +1. Refactor RyuJit code to clearly separate out legality, ability, +and profitability aspects of the inliner. Hopefully this can be done +with no CQ changes and no TP impact. + +2. Develop the inline log format. + +3. Produce inline logs from RyuJit. + +4. Develop methodology for runtime correlation. + +5. Produce failure reason analysis for the benchmark set. Decide +which if any abilities need improving. + +6. Implement inline log replay for RyuJit. + +7. Modify logs by hand or via a tool and replay them to assess the +landscape of TP/CQ/CS tradeoffs from inliner changes. + +### Preliminaries -- LLILC + +1. Add code to capture legality constraints when reading in a method. + +2. Generalize LLILC's MSIL reader to read in an inlinee in +the context of the caller. + +3. Write a simple LLVM pass (or specialize an existing pass) to +perform inlining. Initially just implement a very simple profitability +heuristic. Verify it all works correctly. + +4. Implement inline log generation by LLILC. + +5. Implement inline log replay by LLILC. + +6. Modify logs by hand or via a tool and replay them to assess the +landscape of TP/CQ/CS tradeoffs from inliner changes. + +### Develop the Heuristics (JIT case) + +1. Start by trying to produce code size estimates for a given inline +candidate. Measure actual code size impacts for individual inlining +decisions. + +2. Determine features to surface and produce a labelled data set of +(feature*, size impact). Use this to build a machine-learning model +estimator for code size. + +3. We probably can't just plug the new size estimate into the existing +heuristic framework and expect good results (generally, better +information in does not always result in better decisions out). So +instead build up a parallel (off by default) heuristic path where we +use the new size estimator. + +4. Surface features for the CQ and TP impacts. Build a pluggable model +for the black box heuristic in the compiler. Likely this means that +the model code is written by a tool from some textual description. + +5. Run evolutionary experiments to produce an optimized black box +heuristic. This involves: (a) randomly or otherwise generating an +initial population of models; (b) running each model on the +benchmarks, gathering metrics; (c) combining metrics across benchmarks +to produce fitness; (d) using fitness to create a new set of models +using various genetic approaches; (e) iterating until satisfied with +the result. + +All of this setup should be automated and available in the CI system. + +6. Use the winning model to produce a labelled data set of (feature*, +inlining decision) for all the cases in the benchmarks. Build a +decision tree from this via standard techniques. Decide if any pruning +is appropriate. + +7. Back validate the results via measurement to show (a) final heuristic +works as expected and metrics match expectations on training examples; +cross validate on untrained benchmarks to show results are not over-fitted. +Flight this to various 3rd parties for independent confirmation. + +8. Enable new heuristics as defaults. + +9. Repeat the above process for each code base, scenario, +architecture, etc. + +### AOT Considerations + +In the AOT case (particularly the optimized AOT) we will likely need +to develop a different set of heuristics. + +The current frameworks for AOT (NGEN and Ready2Run) give the code +generator the opportunity to preview the methods to be compiled. We'll +take advantage of that to build a call graph and orchestrate +compilation of methods in a "bottom-up" fashion (callees before +callers). Because of that, when inlining, the caller typically has the +opportunity to look at a richer set of data about the callee -- for +instance, detailed usage summary for parameters and similar. + +The inliner also has the opportunity to plan inlining over a wider +scope, perhaps adjusting heuristics for the relative importance of +callers or the shape of the call graph or similar. + +## Vetting and Revalidation + +With an evaluation framework in place, we can now run carefully +controlled experiments to see if further enhancements to the inliner +are warranted. For example, the order in which candidates are +considered may have an impact -- ideally the inliner would greedily +look at the best candidates first. Other feature data may prove +important, for instance actual or synthetic profile data. + +The rest of the compiler also changes over time. Periodically we +should rerun the heuristic development to see if the current set of +heuristics are still good, especially if major new capabilities are +added to the compiler. + +We should also continually strive to improve the benchmark set. + +## Risks + +The plan calls for the inliner's Profitability logic to be constructed +via machine learning. This exposes us to a number of risks: + +* We may not understand how to successfully apply ML techniques. +* We may run into one of the common ML pitfalls, like having too many +features, or strongly correlated features. +* We may not be using the right features. +* Some of the features may themselves be estimates (eg code size). +* The benchmark set may not be sufficiently representative. +* We may overfit the benchmark set. The plan here is to cross-validate and +not use all our benchmarks as training examples. +* The training process may be too slow to be practical. +* There may be too much measurement noise. +* The resulting models may not be stable -- meaning that small changes in +the input programs might trigger large changes in the inlining decisions. +* The supervised learning output may still be black-box like in practice. +* The downstream phases of the compiler might not adapt well to inlining +changes or may offer an uneven performance surface. +* We've found it tricky in the past to produce a good combined fitness metric +from a large set of benchmarks and metrics. +* It's generally not possible to build a new heuristic which is better in +every way than the old one -- some metrics on some benchmarks will inevitably +regress. + +However, it's worth pointing out that a manually constructed heuristic faces +many of these same risks. + +## Notes + +See this [early +draft](https://github.com/AndyAyersMS/coreclr/commit/035054402a345f643d9dee0ec31dbdf5fadbb17c), +now orphaned by a squash, for some interesting comments.