-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathsystem.tex
195 lines (165 loc) · 10.3 KB
/
system.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
\chapter{System Overview}
\label{sec:system}
After walking through the examples from the previous chapters, we now describe
the system in its final form, in order to look at how the components fit
together.
% TODO DERP
\section{Call Site Insertion}
To use our system, the tool author makes calls to insert calls as they would a
normal clean call. At this point, we have the following information: a function
pointer to call, the number of arguments, and the arguments for this particular
call site. We expect that in a given tool there will a small number of routines
which are called many times. Therefore, we take the function pointer and number
of arguments, which are the only things we can reasonably assume will be
constant, and analyze the routine. Analysis includes decoding the routine,
analyzing stack usage, and optimizing it, and is covered later in Section
\ref{sec:callee}. After we get the analysis results, we save them, along with
the rest of the information for inserting this call, into a pseudo-instruction
representing the entire call. We use the pseudo-instruction approach to make
call coalescing easier, which is described later. At this point, we return back
to the tool, where it performs further analysis and instrumentation.
\section{Callee Analysis}
\label{sec:callee}
For every routine that we wish to call, we need perform analysis to decide if it
can be inlined or partially inlined. First, we need to decode the routine. In
the absence of debug info or any symbols at all, we need to use our own
heuristics to try to find the extent of the function. Our algorithm is to
decode one instruction at a time and remember the furthest forward branch,
conditional or unconditional, within the next 4096 bytes of instructions. If it
falls outside that range, we consider it a tail call. After passing the
furthest forward branch, we continue decoding until the next return, backwards
branch, or tail call.
Once we have decoded the routine, we analyze its usage of the stack. In
particular, we want to find and remove frame setup code that will not be
inlined. In general, we try to match frame setup and tear-down instructions
together in order to remove them. For functions with multiple exit points, we
need to find a matching tear-down instruction on each exit path for each setup
instruction. Furthermore, we need to consider high-level instructions, like
{\tt leave} and {\tt enter}, that implement multiple steps of frame setup or
tear-down.
\section{Partial Inlining}
Next we consider the routine for partial inlining. As described in Section
\ref{sec:inlining_fastpath}, we check if the first control transfer instruction
after the entry point is a conditional branch. If so, we scan forward from both
the fallthrough instruction and the branch taken target, looking for a {\tt ret}
instruction. If one path has a {\tt ret} and the other does not, it becomes the
fast path and we apply our partial inlining transformation. First, we delete
all instructions except those in the entry block and the fast path block. Next,
we insert a synthetic call in the slow path block that we expand later. We
cannot expand it at the moment, because we are doing call site independent
analysis and do not have access to the function arguments, which would need to
be rematerialized.
Because the slow path will eventually re-enter the routine from the beginning,
we need to defer all side-effects from the entry block into the fast path block,
as described in Section \ref{sec:deferring}. An instruction has side-effects if
it has any non-stack memory write. We are careful not to move any instruction
in such a way that its input registers are clobbered, and defer non-side effect
instructions to preserve this property.
\section{Optimization}
Next we run our suite of machine code optimizations to try to clean up the code.
Optimization at this stage is particularly important if we have applied partial
inlining, because we may have deleted uses of values in the entry block in the
slow path. It also reduces the number of registers used and deletes
instructions, meaning we are more likely to meet our criteria for inlining. We
apply the following optimizations in order:
\begin{packed_enumerate}
\item Dead code elimination
\item Copy propagation
\item Dead register reuse
\item Constant folding
\item Flags avoidance
\item {\tt lea} folding
\item Redundant load elimination
\item Dead store elimination
\item Dead code elimination
\item Remove jumps to next instruction
\end{packed_enumerate}
This sequence was tested to work well on the example tools we optimized.
All of these optimizations have been discussed in previous chapters, except for
removing jumps to following instructions. This situation occurs in partial
inlining cases where the fast path has no instructions, as in the alignment
example. The slow path ends with a jump to the restore code after the fast
path, but if the fast path is empty, the jump is not needed.
\section{Inlining Criteria}
At this point, we have simplified the routine instruction list as much as
possible, and it is time to make our decision about whether we can inline at
all. To decide, we use the following criteria:
\begin{packed_itemize}
\item The callee is a leaf function. A non-leaf function requires saving all
registers.
\item The simplified callee instruction stream is no more than 20 instructions.
This avoids code bloat from overly aggressive inlining. This limit was chosen
to match roughly the point at which the overhead of a clean call no longer
dominates the cost of a call, so using a full clean call has little penalty.
\item The callee does not use XMM registers. This avoids XMM saves.
\item The callee does not use more than a fixed number of general purpose
registers. We have not picked an appropriate limit yet.
\item The callee must have a simple stack frame that uses at most one stack
location.
\item The callee may only have as many arguments as can be passed in registers
on the current platform, or only one if the native calling convention does not
support register parameters.
\end{packed_itemize}
If any of these criteria are not satisfied, we throw away our simplified
instruction list and mark the routine as not suitable for inlining. The summary
of all of this analysis is stored in a cache, so on future calls to the same
routine we will know immediately if the routine can be inlined or not.
\section{Basic Block Optimization}
\label{sec:basic_block_opt}
After all of the instrumentation calls have been inserted by the tool, the tool
is required to call our system one last time before returning the list of
instructions back to DynamoRIO. At this point, we apply optimizations to the
entire basic block, which is how we were able to vastly improve performance on
our instruction count example.
At this point, all calls in the instruction stream are a single
pseudo-instruction, so they are easy to move around. Our current simple
scheduling heuristic moves calls together, so long as they have only immediate
integer arguments. If they read application register or memory values, we do
not want to move the instrumentation outside of the live range of that register.
Once calls have been scheduled together, we expand them one layer. First, the
application state saving operations are inserted, which are represented with
pseudo-instructions. Next, the simplified routine code is inserted, which is
described further in Section \ref{sec:site_expansion}. Last, the restore code
is inserted, again represented as pseudo-instructions.
Because the save and restore operations are high-level pseudo-instructions, they
are easy to identify and match with other instructions with reciprocal
operations. For example, if we switch to the application stack and then back to
DynamoRIO's stack, those two operations cancel each other out and we can delete
them both. If we restore and then save flags, those are reciprocal and can be
deleted. Similarly, we can avoid restoring and then saving a register. When
this is done, if the calls were scheduled together, there should be one save
sequence followed by multiple inlined calls followed by one restore sequence.
Last, we run one more optimization sequence over the inlined code. As shown in
the instruction count example, RLE and DSE were very effective for identifying
extra loads and stores to the global count value. Folding {\tt lea}
instructions is also beneficial in that example.
\section{Call Site Expansion}
\label{sec:site_expansion}
As discussed in Section \ref{sec:basic_block_opt}, each call is expanded after
being scheduled. At this point, we have the simplified routine code from the
shared routine analysis cache. However, we need to materialize arguments, which
are different at each call site.
Materializing immediate values is trivial, but anything that reads an
application register value is fairly complicated. Our rule is that if we saved
an application register, we should reload it, because we might have clobbered it
during argument materialization or flags saving. We have to special case the
stack pointer, because we save it in a TLS slot.
For memory operand arguments, we have to be extremely careful. We cannot use a
single load instruction to re-materialize the argument, and we have limited
available registers. We solve this by realizing that there are at least two
available registers on all platforms: the destination register itself, and {\tt
\%rax}, which is never used as a parameter register on any platform. We restore
the base register into the argument register and the index register into {\tt
\%rax}, and rewrite the memory operand to use these two registers. If either or
both of the original application registers are unclobbered, we leave them
untouched.
Oftentimes there are a few immediate values as arguments, which can be folded
further. The alignment checker is a good example of this, because it passes the
access size parameter which is combined with the address to perform the check.
By folding constants and {\tt lea} instructions one more time, we are able to
fold that immediate value into the {\tt test} instruction in Figure
\ref{fig:fold_immediate}.
Finally, after our optimization step, we perform our register usage analysis to
emit save and restore code around the code we want to inline. We save this
analysis until after argument materialization and optimization in order to spill
as few registers as possible.