-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathpartial_inlining.tex
652 lines (581 loc) · 30.3 KB
/
partial_inlining.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
\chapter{Partial Inlining}
\label{sec:partial_inlining}
In this chapter we explain our partial inlining technique for speeding up
analysis routines that have a common fast path and an uncommon slower path.
Many analysis tools are structured in this exact way. To demonstrate how
partial inlining works, we consider two example tools in this chapter: a memory
alignment checker, and a memory trace tool.
\section{Tools with Fastpaths}
The memory alignment checker is a tool that checks if a memory access is aligned
to the size of the access and diagnoses accesses that are unaligned by reporting
information about the access. For the alignment checker, it is fair to assume
that for most programs most memory accesses are unaligned, unless there is an
alignment issue in the program. In the case where the access is aligned, the
analysis routine does no work. This represents the fast path through the
routine. If an access is unaligned, the tool needs to make other function calls
to either record the access data in memory or write the information out to a
file. This represents the slow path. If many accesses are unaligned, the tool
will be slow anyway because it must do work to diagnose every unaligned access.
The source code for the alignment instrumentation routine is shown in Figure
\ref{fig:alignment_src}.
The memory trace tool has a similar property. The goal of the {\tt memtrace}
tool is to make a trace of all the memory accesses in the program and output it
to a file for offline analysis. This includes the application PC, the effective
address, whether the access was a read or a write, and how large it was. Rather
than formatting a string to write to a file after every access, it is faster to
fill a buffer with this information and flush it every time it fills up. If the
buffer is not full, the analysis routine does little work. If the buffer is
full, it needs to perform IO, which will require making system calls. The
source code for the memory trace instrumentation routine is shown in Figure
\ref{fig:memtrace_src}.
\begin{figure}
\begin{verbatim}
void
check_access(ptr_uint_t ea, app_pc pc, uint size, bool write)
{
/* Check alignment. Assumes size is a power of two. */
if ((ea & (size - 1)) == 0) {
return;
}
dr_fprintf(STDOUT,
"Unaligned %s access to ea "PFX" at pc "PFX" of size %d\n",
(write ? "write" : "read"), ea, pc, size);
}
\end{verbatim}
\caption{Source for the instrumentation routine from the {\tt alignment} tool.}
\label{fig:alignment_src}
\end{figure}
\begin{figure}
\begin{verbatim}
typedef struct _memop_t {
ptr_uint_t ea;
ptr_uint_t pc;
uint size;
uint write;
} memop_t;
#define BUFFER_SIZE 1024
static uint pos;
static memop_t buffer[BUFFER_SIZE];
void
buffer_memop(ptr_uint_t ea, ptr_uint_t pc, uint size, bool write)
{
buffer[pos].ea = ea;
buffer[pos].pc = pc;
buffer[pos].size = size;
buffer[pos].write = write;
pos++;
if (pos >= BUFFER_SIZE)
flush_buffer();
}
\end{verbatim}
\caption{Source fragment for the {\tt memtrace} tool.}
\label{fig:memtrace_src}
\end{figure}
We believe that this pattern of having fast, early checks in analysis routines
is common. For example, an undergraduate researcher in our group has been
working on a race detector for native applications using the FastTrack
algorithm\cite{fasttrack}. In the implementation, there is a notion of epochs
defined by synchronization events. Using shadow memory provided by
Umbra\cite{umbra}, the initial version of this tool checked if the memory in
question had been accessed during the last epoch. If so, it has already been
checked for races. It turned out that for most applications memory was reused
in the same epoch at least 50\% of the time, creating the fast path/slow path
structure that we believe is common. In later versions of the race detector,
the researcher generated custom instrumentation to implement the check inline
before jumping to an out of line clean call. Essentially, the researcher was
performing manual partial inlining, which is what we are attempting to do
automatically.
Before we dive into partial inlining, we need to cover how ordinary clean call
instrumentation works with complex argument values, as those are used throughout
our examples.
\section{Clean Calls with Arguments}
We have already discussed the major points about how clean calls work in Section
\ref{sec:clean_calls}. However, our previous instruction counting example did
not need to pass arguments to the instrumentation routine. Our alignment
checker and memory trace tool do.
Arguments to instrumentation calls in DynamoRIO are specified as x86 instruction
operands evaluated in the application context. We support calling functions
with as many arguments as can be passed in registers using the native platform
calling convention. However, we can only materialize arguments after we have
saved the register used to hold the argument, or we will clobber it. This
creates a chicken and egg problem, because the argument may, for example,
reference the stack pointer, which is different once we switch to DynamoRIO's
execution context. Because we cannot materialize arguments before the stack
switch or register spills, there is no safe point in the inlined code to simply
evaluate the argument operand as it is. Instead we have interpret it and
materialize it from the saved application context.
For immediate values, there is no complication. We simply use a {\tt mov}
instruction to materialize the value in the argument register.
For register operands, if they are not clobbered by the routine, they will not
have been saved to the stack. Therefore, the application value is still live,
so we perform a simple register-to-register copy. If the value is used in the
inline code, we make sure to generate a register spill before argument
materialization, so we load it back from our scratch space.
Another issue is that as we materialize arguments into registers, we may clobber
registers that we wish to use in later arguments. For example, on Linux
x86\_64, the first argument is materialized into {\tt \%rdi}, but the second
argument may read {\tt \%rdi}. Our solution is to only materialize arguments
that end up being used by the body inline. By following this rule, we can only
write registers that are saved in the context switch, and we can rematerialize
the application value of {\tt \%rdi} from our scratch space.
Finally, if the register is {\tt \%rsp}, we have a special case. The
application stack pointer is always saved in the TLS scratch space instead of
the stack, so we must restore it from there, as shown later in Figure
\ref{fig:arg_mat_lea}.
For memory operands, we support two operations: we can either load the
application value from the memory addressed by the operand using a {\tt mov}
instruction, or we can load the effective address of the memory operand with a
{\tt lea} instruction, giving us a pointer to the application value.
X86 memory operands complicate matters because they may use two registers for
the base and index. Similar to the register operand case, we need to need to
materialize the registers used by the operand before we can materialize the
argument. We cannot use the original registers of the operand, because they may
now contain materialized arguments. Therefore, we load the base into the
register we are using to hold the argument. We know the argument register will
be dead after the load or address computation, so this is safe. We then modify
the base register of the memory operand to be the argument register. If the
index has been clobbered, then we need a second dead register. We hard code the
choice of {\tt \%rax} for holding the index register because there are no
calling conventions that DynamoRIO supports that use {\tt \%rax} as a register
parameter, and it is often already saved because it must be used to save flags.
If it is not already saved, it will be marked for saving later when we
re-analyze the code sequence with argument materialization.
For a complete example, if we instrument the load instruction {\tt movq
0x44(\%rsp, \%rdi, 4) -> \%rax} with an unoptimized clean call, we emit the code
in Figure \ref{fig:arg_mat_lea}.
\begin{figure}
\begin{verbatim}
mov %rsp -> %gs:0x00 # Switch to dstack
mov %gs:0x20 -> %rsp
mov 0x000002c0(%rsp) -> %rsp
lea 0xfffffd50(%rsp) -> %rsp
mov %rax -> 0x48(%rsp) # Save GPRs
... save other GPRs
mov %rdi -> 0x10(%rsp) # Save rdi at offset 0x10
... save other GPRs and flags
... save all XMM regs
# Argument materialization:
mov %gs:0x00 -> %rdi # Load app rsp into rdi
mov 0x10(%rsp) -> %rax # Load app rdi into rax
lea 0x44(%rdi,%rax,4) -> %rdi # Load effective address into rdi
mov $0x00000000006187c0 -> %rsi # App PC of instruction
mov $0x00000008 -> %edx # Size of access
mov $0x00000000 -> %ecx # Not a write
call $0x00404670 %rsp -> %rsp 0xfffffff8(%rsp) # Call routine
... clean call restore code
\end{verbatim}
\caption{Materializing {\tt 0x44(\%rsp, \%rdi, 4)} into {\tt rdi}.}
\label{fig:arg_mat_lea}
\end{figure}
As in our instruction count example, using clean calls is clearly not the most
efficient way to implement this. However, our new routine code has control
flow, so before we can do any inlining, we need to find a reliable way to decode
it.
\section{Decoding with Control Flow}
\label{sec:decoding_cti}
When a tool inserts a clean call, all it provides is a list of machine operand
arguments and a function pointer to call. Given a function pointer, it is
challenging to reliably decode the function itself and determine where it ends.
The na\"ive approach described in Section \ref{sec:simple_inlining} of scanning
forward from the function entry to the first {\tt ret} instruction breaks down
quickly if we want to handle control flow. Partial inlining requires this, so
our decoding algorithm has to handle control flow past the first {\tt ret}
instruction. Even our simple {\tt check\_access} instrumentation routine jumps
past the first ret, as shown in Figure \ref{fig:alignment_asm}.
Our decoding algorithm simply remembers the furthest forward branch target, and
decodes up until at least that address. Once we have passed that address, we
continue decoding until we reach a {\tt ret}, a backwards branch, or a probable
tail call. Our heuristic for identifying tail calls is to check if the target
address is not between the entry address and the entry plus 4096. Our choice of
cutoff is arbitrary and has not been tuned, but it is unlikely that a single
instrumentation routine will be larger than 4096 bytes.
After decoding, we rewrite all the branches to use labels inserted in the
instruction list instead of raw PC values. This allows us to follow a branch to
a later point in the stream during optimization.
With the instruction list in hand, we proceed to partial inlining.
\begin{figure}
\begin{verbatim}
TAG 0x0000000000404670
+0 mov %edx -> %r8d # Copy size to r8
+3 mov %rdi -> %r10
+6 lea 0xffffffff(%r8) -> %eax # (size - 1)
+10 test %rax %rdi # ea & (size - 1)
+13 jz $0x00000000004046b0 # First branch for if
+15 mov <rel> 0x0000000000618764 -> %edi
+21 test %cl %cl
+23 mov $0x004133e6 -> %edx
+28 mov $0x004133eb -> %eax
+33 cmovnz %rax %rdx -> %rdx
+37 cmp %edi $0xffffffff
+40 jz $0x00000000004046b8 # Branch past ret
+42 mov %r8d -> %r9d # Merge point
+45 mov %r10 -> %rcx
+48 mov %rsi -> %r8
+51 xor %eax %eax -> %eax
+53 mov $0x00413470 -> %esi
+58 jmp $0x00000000004039e8 # Tail call to dr_fprintf.
+63 nop
+64 ret %rsp (%rsp) -> %rsp # First branch target
+66 data16 nop 0x00(%rax,%rax,1)
+72 mov <rel> 0x0000000000618770 -> %rax # Target of second branch
+79 mov 0x70(%rax) -> %edi
+82 jmp $0x000000000040469a # Backwards jump to +42
END 0x0000000000404670
\end{verbatim}
\caption{Assembly for the {\tt check\_access} instrumentation routine from our
alignment checker.}
\label{fig:alignment_asm}
\end{figure}
\section{Inlining the Fast Path}
\label{sec:inlining_fastpath}
Once we have the full routine instruction stream, we scan from the entry point
to the first control flow instruction. In the assembly listing of Figure
\ref{fig:alignment_asm}, it is the first {\tt jz} instruction at point +13
bytes. If it is a conditional branch, as it is in the example, we scan forward
from both branch targets: the fall-through target and the branch taken target.
If the first control flow instruction we hit on either path is a {\tt ret}
instruction, we say that path is a ``fast path.'' If both paths are fast, we
abort our partial inlining analysis and inline both as normal. If neither path
is fast, we also abort partial inlining, and most likely will fail to inline the
routine at all.
If one path is fast and the other is not, we designate the other path the ``slow
path.'' We then delete all instructions that do not flow directly from the
entry to the conditional branch and then to the fast path. In place of the slow
path, we insert a synthetic call instruction which we will expand later as
described in Section \ref{sec:slowpath_transition}, and a jump to the {\tt ret}
instruction on the fast path. The result of this transformation on the {\tt
check\_access} code is shown in Figure \ref{fig:alignment_fastpath}.
\begin{figure}
\begin{verbatim}
mov %edx -> %r8d # Entry block
mov %rdi -> %r10
lea 0xffffffff(%r8) -> %eax
test %rax %rdi
jz $0x00000000004046b0 # Inlined check
call <label> # Slowpath transition
jmp <label> # Jump to ret past fastpath code
# Fastpath code,
# none in this example
ret %rsp (%rsp) -> %rsp # Return
\end{verbatim}
\caption{Assembly for the {\tt check\_access} instrumentation routine from our
alignment checker.}
\label{fig:alignment_fastpath}
\end{figure}
This code fragment is much more amenable to inlining, but it is not complete.
We still need to generate code to handle the slow path case out of line.
\section{Slowpath Transition}
\label{sec:slowpath_transition}
To enter the slow path, we abandon the inline execution and restart the routine
from its entry point with fresh arguments. The alternative we considered was to
attempt a side entry into the routine at the original slow path target PC, but
this approach would require carefully recreating the stack frame state to match
what it would have been if the routine had been executed out of line. Rather
than create such an analysis, we chose to simply execute the entry block again.
Unfortunately, if there are side effects in the entry block, executing them
twice may break the tool. For this section, we restrict ourselves to working on
routines without side effects in the entry block and address this problem later
in Section \ref{sec:deferring}.
Before we can call the tool routine entry point, we need to set up the stack as
though we had just finished the clean call entry sequence. By the time we reach
the slow path, the entry block leading up to the check will have already
clobbered many application registers, so we cannot blindly save all registers.
Instead, we save all application registers that were not saved before the
inlined code. This means emitting extra saves and restores for general purpose
registers, the flags register, and all of the XMM registers.
We do not emit all of the save and restore code inline, because it would pollute
the instruction cache with many uncommonly executed instructions. On Linux
x86\_64, a single clean call expands to at least 75 instructions and 542 bytes.
Therefore, we emit the slow path transition code in a code cache maintained on
the side. When we reach the slow path, we jump into this code fragment, and it
handles the rest of the transition before returning. Because we have already
switched stacks, this is easily implemented with the {\tt call} and {\tt ret}
instructions.
Even though we materialize the arguments to the function before the inlined
code, it is likely that they will be clobbered in the inlined code. Instead of
trying to preserve the original values, we rematerialize them in the slow path.
While it is possible to share the register spilling code, the arguments to the
routine are usually different at every call site. Therefore we rematerialize
the arguments before we jump into the shared slow path transition code. It is
possible to detect if the original materialized arguments are still live and do
not need to be rematerialized, but we have not yet implemented this optimization
because the slow path is unlikely to be executed in the first place.
One major consideration in our design is that DynamoRIO has a function called
{\tt dr\_get\_mcontext} that allows the tool to inspect all application
registers saved at the base of the stack. Even if this function is called from
the slow path of a partially inlined routine, we want it to behave correctly.
This means the stack has to be set up in exactly the same way it would have if
we had executed a clean call, which means spilling all application registers
into the right stack slots.
\section{Deferring Side Effects}
\label{sec:deferring}
As discussed previously, there is a problem with re-executing the entry block of
the instrumentation routine. If there are any side effects in the entry block,
they will occur twice. For example, if we wish to count every memory access in
the entry block, as we might want to do with our alignment checker, we would end
up counting it twice when we have an unaligned access. We solve this problem by
deferring such side effects until after the fast path check.
To identify instructions with side effects, we test each instruction in the
entry block to see if it writes to non-stack memory. A non-stack write is any
instruction writing to absolute addresses, to PC relative addresses, or to
addresses relative to any register other than the stack pointer.
Next, we attempt to defer the side effects until after the conditional branch.
The conditional branch may depend on memory reads which depend on memory writes
that we want to defer, so we may not be able to defer all side effects, in which
case we fail to perform partial inlining. Our algorithm starts at the
conditional branch and goes backwards towards the entry point while deferring
all instructions that can be moved safely past the branch. An instruction can
be moved past the branch if:
\begin{packed_enumerate}
\item Its result is unused by the conditional branch, meaning our analysis
starting from the branch considers it dead.
\item It does not use any registers clobbered by the branch or its dependent
instructions, which are the instructions that could not be deferred.
\end{packed_enumerate}
Figure \ref{fig:memtrace_defer} shows the instruction stream of our memory trace
tool routine before and after deferring side effects.
\begin{figure}
\begin{verbatim}
BEFORE DEFER:
mov <rel> 0x000000007221b7a0 -> %r8d # Cannot defer
lea <rel> 0x000000007221b7c0 -> %r9
mov %r8d -> %eax # Cannot defer
add $0x00000001 %r8d -> %r8d # Cannot defer
lea (%rax,%rax,2) -> %rax
cmp %r8d $0x000003ff # Cannot defer
lea 0x00(,%rax,8) -> %r10
mov %rdi -> (%r9,%rax,8) # Write
mov %rsi -> 0x08(%r10,%r9,1) # Write
lea <rel> 0x000000007221b7d0 -> %rsi
mov %edx -> (%rsi,%rax,8) # Write
mov %ecx -> 0x04(%r10,%rsi,1) # Write
mov %r8d -> <rel> 0x000000007221b7a0 # Write
jbe $0x0000000041341560 # Cond branch
ret %rsp (%rsp) -> %rsp
AFTER DEFER:
mov <rel> 0x000000007221b7a0 -> %r8d
mov %r8d -> %eax
add $0x00000001 %r8d -> %r8d
cmp %r8d $0x000003ff
jbe $0x0000000041341560 # Cond branch
lea <rel> 0x000000007221b7c0 -> %r9
lea (%rax,%rax,2) -> %rax
lea 0x00(,%rax,8) -> %r10
mov %rdi -> (%r9,%rax,8) # Write
mov %rsi -> 0x08(%r10,%r9,1) # Write
lea <rel> 0x000000007221b7d0 -> %rsi
mov %edx -> (%rsi,%rax,8) # Write
mov %ecx -> 0x04(%r10,%rsi,1) # Write
mov %r8d -> <rel> 0x000000007221b7a0 # Write
ret %rsp (%rsp) -> %rsp
\end{verbatim}
\caption{Memory trace buffer filling routine before and after deferring side
effects.}
\label{fig:memtrace_defer}
\end{figure}
If we are unable to defer side effects until after the branch, we abandon
partial inlining and leave the instruction stream unmodified.
\section{Optimization Opportunities}
Putting together all of the techniques discussed so far, we are able to
successfully perform partial inlining, but there is still work to be done.
Figure \ref{fig:alignment_partial_noopt} shows what the assembly would look like
if we stopped here with our partial inlining techniques. As shown in the
listing, there are many opportunities for improvement. For example, we
materialize the size argument into {\tt \%edx}, which is then copied to {\tt
\%r8d}. {\tt \%r8d} is decremented and used in the {\tt test} instruction.
Ideally, we can propagate the copy, propagate the constant through the
decrement, and propagate the resulting value into the {\tt test} instruction.
We also have dead code clobbering registers, such as the left over copy from
{\tt \%rdi} to {\tt \%r10} in the fast path. Furthermore, the benefits of
cleaning up this code is that it will use fewer registers, which will save a
load and store for ever register avoided. In the next sections, we demonstrate
how our optimizations iteratively improve the code.
\begin{figure}
\begin{verbatim}
mov %rsp -> %gs:0x00
mov %gs:0x20 -> %rsp
mov 0x000002c0(%rsp) -> %rsp
lea 0xfffffd50(%rsp) -> %rsp
mov %rax -> 0x48(%rsp)
mov %rdx -> 0x38(%rsp)
mov %rdi -> 0x10(%rsp)
mov %r8 -> 0x50(%rsp)
mov %r10 -> 0x60(%rsp)
lahf -> %ah
seto -> %al
mov %rax -> 0x00000090(%rsp)
mov %gs:0x00 -> %rdi # Argument materialization
mov 0x10(%rsp) -> %rax
lea 0x44(%rdi,%rax,4) -> %rdi
mov $0x00000008 -> %edx # Constant
mov %edx -> %r8d # Extra copy
lea 0xffffffff(%r8) -> %eax # Decrement
test %rax %rdi
jz fastpath
# Slowpath start
mov %rcx -> 0x40(%rsp) # Extra register saves
mov %rsi -> 0x18(%rsp)
mov %gs:0x00 -> %rdi # Argument rematerialization
mov 0x10(%rsp) -> %rax
lea 0x44(%rdi,%rax,4) -> %rdi
mov $0x00000000006187c0 -> %rsi
mov $0x00000008 -> %edx
mov $0x00000000 -> %ecx
call $0x0000000075d18040 %rsp -> %rsp 0xfffffff8(%rsp)
mov 0x40(%rsp) -> %rcx # Extra register restores
mov 0x18(%rsp) -> %rsi
jmp mergepoint
fastpath:
mov %rdi -> %r10 # Dead
mergepoint:
mov 0x00000090(%rsp) -> %rax
add $0x7f %al -> %al
sahf %ah
mov 0x48(%rsp) -> %rax
mov 0x38(%rsp) -> %rdx
mov 0x10(%rsp) -> %rdi
mov 0x50(%rsp) -> %r8
mov 0x60(%rsp) -> %r10
mov %gs:0x00 -> %rsp
mov 0x44(%rsp,%rdi,4) -> %rax # Application instruction
\end{verbatim}
\caption{Partially inlined {\tt check\_access} routine without optimizations.}
\label{fig:alignment_partial_noopt}
\end{figure}
\section{Dead Code Elimination}
The basis for most of our transformations is register liveness analysis. This
is a standard dataflow analysis, starting at the end of the instruction stream,
and stepping backwards, maintaining a set of live and dead registers based on
which registers instructions read from. We also model the liveness of various
bits of the x86 flags register, so we can know if a {\tt cmp} or {\tt add}
instruction is needed by a branch or {\tt addc} instruction.
With our register liveness analysis, we can use that information to delete dead
instructions. Dead instructions arise often when performing partial inlining,
because code that may have used a register previously may now be deleted,
rendering the instruction dead. We have to apply the usual precautions of not
deleting instructions which have purposes besides using the result, such as
control flow instructions, memory writes, and labels.
An example of an instruction deleted by our dead code elimination pass is in
Figure \ref{fig:alignment_partial_noopt} after the fastpath label with the ``Dead''
comment next to it.
\section{Copy Propagation}
Again, from liveness information, we can detect when it is profitable to
eliminate unnecessary copies. If the source register is dead and the
destination of a copy is live, we can simply replace all uses of the destination
register over its live range with the source register, so long as the source is
not clobbered before all uses of the destination register.
These copies are often introduced by the compiler when the slow path needs
access to a value before it is modified. Since we delete the slow path, the
original value is unused, and the copy is an extra step we do not need.
There is an example of such a copy in Figure \ref{fig:alignment_partial_noopt}
in the inline code sequence next to the ``Extra copy'' comment. The result of
propagating the copy is shown in Figure \ref{fig:copy_propagation}.
\begin{figure}
\begin{verbatim}
mov %gs:0x00 -> %rdi # Argument materialization
mov 0x10(%rsp) -> %rax
lea 0x44(%rdi,%rax,4) -> %rdi
mov $0x00000008 -> %edx # Constant
lea 0xffffffff(%rdx) -> %eax # Decrement
test %rax %rdi # Test
jz fastpath
\end{verbatim}
\caption{Example with copy removed.}
\label{fig:copy_propagation}
\end{figure}
\section{Constant Folding}
Perhaps the most classic optimization that compilers perform is constant
folding. We fold constants at the machine code level by searching for
instructions that materialize immediate values into registers. When we find
such a materialization, we check if we can rewrite the uses of the register to
use an immediate instead. We also have logic for folding an immediate into an
x86 memory operand displacement.
In Figure \ref{fig:copy_propagation}, we can combine the {\tt mov \$0x08, \%edx}
instruction with the {\tt lea 0xffffffff(\%r8), \%edx} instruction into {\tt lea
0x7, \%edx}, which is effectively another immediate value materialization. Our
{\tt lea} folding optimization described in Section \ref{sec:fold_lea} is able
to fold the displacement-only {\tt lea} into the {\tt test} instruction. The
result is shown in Figure \ref{fig:fold_immediate}.
\begin{figure}
\begin{verbatim}
mov %gs:0x00 -> %rdi # Argument materialization
mov 0x10(%rsp) -> %rax
lea 0x44(%rdi,%rax,4) -> %rdi
test %rdi $0x00000007 # Test with an immediate
jz fastpath
\end{verbatim}
\caption{Check from {\tt check\_access} after constants and {\tt lea}
instructions have been folded.}
\label{fig:fold_immediate}
\end{figure}
Although we have only deleted a few register-to-register arithmetic operations
in the inline code, we have completely eliminated the use of the {\tt \%edx}
register and avoided an extra register spill.
\section{Chapter Summary}
As shown in this chapter, partial inlining can be quite profitable for
conditional analysis routines such as those present in our alignment checker and
memory trace examples. Partial inlining is fairly tricky, and requires special
attention to decoding in the face of control flow, identifying the fast path,
transitioning to the slow path, and avoiding repeated side effects.
However, just these techniques are not enough to make partial inlining truly
profitable. As shown in Figure \ref{fig:alignment_partial_noopt}, the inserted
code can be quite large. In order to further optimize the inlined code, we
needed to develop a suite of more traditional compiler optimizations to take
advantage of the opportunities that inlining creates. In particular, we found
that dead code elimination, copy propagation, constant folding, and {\tt lea}
folding were particularly effective.
The final result of these combined optimizations on the alignment checker is
shown in Figure \ref{fig:alignment_all_opt}. The techniques presented in this
chapter are promising, but unlike the previous chapter, we have not been able to
approach native performance in our example. Even after applying all of our
optimizations, we still have 20 instructions executed on the fast path for every
memory access in the program. Clever tools that instrument all memory accesses
such as DrMemory have many techniques that we should automate as future work.
In the Chapter \ref{sec:performance}, we look at the actual performance achieved
with these optimizations on the SPEC 2006 CPU benchmarks.
\begin{figure}
\begin{verbatim}
mov %rsp -> %gs:0x00 # Stack switch
mov %gs:0x20 -> %rsp
mov 0x000002c0(%rsp) -> %rsp
lea 0xfffffd50(%rsp) -> %rsp
mov %rax -> 0x48(%rsp) # Only two register saves needed
mov %rdi -> 0x10(%rsp)
lahf -> %ah # Flags save
seto -> %al
mov %rax -> 0x00000090(%rsp)
mov %gs:0x00 -> %rdi # Arguments and entry check
mov 0x10(%rsp) -> %rax
lea 0x44(%rdi,%rax,4) -> %rdi
test %rdi $0x0000000000000007
jz done
# Slowpath start, this code is not executed commonly
mov %rcx -> 0x40(%rsp) # Extra register saves
mov %rdx -> 0x38(%rsp)
mov %rsi -> 0x18(%rsp)
mov %r8 -> 0x50(%rsp)
mov %gs:0x00 -> %rdi
mov 0x10(%rsp) -> %rax
lea 0x44(%rdi,%rax,4) -> %rdi
mov $0x00000000006187c0 -> %rsi
mov $0x00000008 -> %edx
mov $0x00000000 -> %ecx
call $0x0000000077b0c040 %rsp -> %rsp 0xfffffff8(%rsp)
mov 0x40(%rsp) -> %rcx
mov 0x38(%rsp) -> %rdx
mov 0x18(%rsp) -> %rsi
mov 0x50(%rsp) -> %r8
jmp done
# Slowpath end.
done:
mov 0x00000090(%rsp) -> %rax # Flags restore
add $0x7f %al -> %al
sahf %ah
mov 0x48(%rsp) -> %rax # Register restore
mov 0x10(%rsp) -> %rdi
mov %gs:0x00 -> %rsp # Swich to appstack
mov 0x44(%rsp,%rdi,4) -> %rax # Application instruction
\end{verbatim}
\caption{Alignment tool routine after partial inlining and optimization.}
\label{fig:alignment_all_opt}.
\end{figure}