-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathintroduction.tex
133 lines (112 loc) · 7.29 KB
/
introduction.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
\chapter{Introduction}
\section{Research Objectives}
Dynamic program analysis tools built with dynamic binary instrumentation (DBI)
have proven indispensable for developing large native applications. Memory
debuggers such as DrMemory\cite{drmemory} and Valgrind\cite{valgrind} in
particular have been instrumental in tracking down uninitialized memory and
memory leaks. Race detectors such as Helgrind\cite{helgrind} and Thread
Sanitizer\cite{tsan} have made programming with shared memory feasible. Cache
simulators and branch prediction simulators such as
Cachegrind\cite{valgrind_workloads} provide a way to optimize cache usage in a
given program.
These are just the most commonly used types of instrumentation tools.
Researchers in academia and industry are always building new tools, both
general-purpose as well as application-specific. For example, a research group
at MIT recently created Jolt\cite{jolt}, a tool for detecting and escaping from
infinite loops with Pin\cite{pin}. In industry, Determina extended DynamoRIO
to support program shepherding\cite{shepherding}, which detects application
security violations.
For the last several years, Pin has been the framework of choice for researchers
implementing custom program analysis tools. Pin's success in this area is due
to Pin's facilities for abstracting away architectural details from the
instrumentation tool. A Pin tool works by inserting calls to instrumentation
routines, which record information about the program execution. For example, if
a Pin tool wishes to analyze memory usage, it will instrument every memory
access to call to a function with information about that memory access. Each
call inserted with Pin generally expands to many instructions to spill
registers, switch stacks, and materialize arguments, so tools that use pervasive
instrumentation can be quite slow. To ameliorate this, Pin is cabable of
inlining short routines that have no control flow.
On the other hand, DynamoRIO\cite{bruening_phd} has a larger and more general
interface. Instead of only inserting calls to plain C or C++ routines, a tool
has the power to insert custom machine code into the application code. For
example, DrMemory uses the flexibility of DynamoRIO to generate memory accesses
that will fault if the application uses uninitialized data. While faulting is
expensive, it happens rarely. In the common case that the check succeeds, a
memory access is much faster than a conditional branch. As a result, DrMemory
is on average twice as fast as Valgrind's Memcheck.\cite{drmemory} Pin does not
provide the flexibility needed to generate and handle this kind of
instrumentation.
While the ability to insert custom machine code is powerful and can support
efficient tools, generating that code can be a daunting task for a researcher or
a beginner learning the framework. To help ameliorate the burden, DynamoRIO
also has facilities for inserting ``clean calls,'' which are similar to Pin's
instrumentation routines. However, DynamoRIO cannot inline clean calls, and is
generally not suited to pervasive instrumentation using clean calls.
The goal of this thesis is to make it possible to build performant program
anlysis tools with DynamoRIO without burdening the tool writer. We want to make
tools that are developed using ordinary clean calls fast enough to be broadly
useful. We believe that the slowdowns incurred by clean calls prevent people
from using them, and we would like reduce those slowdowns so that these novel
tools see wider adoption.
% Bad performance from DBI prevents people from using analysis tools.
Throughout this thesis, we follow the efforts of a tool author attempting to
implement three tools: an instruction counting tool, a memory alignment tool,
and a memory access trace tool. To support the author of these tools, we make
the following contributions:
\begin{packed_itemize}
\item An {\em inlining} optimization for instrumentation routines.
\item The first {\em partial inlining} optimization for instrumentation
routines with conditional analysis.
\item The first {\em call coalescing} optimization for instrumentation
routines.
\item A suite of x86 machine code optimizations leveraging the opportunities
created by inlining.
\end{packed_itemize}
Partial inlining is a well-understood compiler optimization that attempts to
inline the commonly executed code without inlining an entire function, which
would cause code bloat. In our case, not all tool code can be inlined into the
application code stream. Partial inlining allows us to inline the parts we can
and leave out the rest. As far as we know, this is the first application of
this technique to dynamic binary instrumentation.
For tools built with pervasive clean calls, the main cost is
usually the context switching between the application and DynamoRIO. The idea
behind call coalescing is that we should make multiple calls after one context
switch so that we need to make fewer context switches over all.
Finally, we found that applying the above techniques was not enough, and that we
needed to build a larger suite of general machine code optimizations to finish
cleaning up the inlined code. For example, in our memory alignment tool, the
size used for the alignment check is passed as a parameter, and it is always the
same at each call site. If we inline without folding this constant into the
alignment check, we are clobbering extra registers and issuing extra
instructions.
Using the above techniques, we have been able to dramatically improve
performance for an instruction counting tool by 54.8 times and a memory
alignment tool by almost 4 times.
\section{Thesis Overview}
In Chapter \ref{sec:background}, we discuss the motivation for using a dynamic
binary instrumentation framework such as DynamoRIO, Pin, or Valgrind in the
first place. In particular, we outline all the benefits they provide and the
challenges they help overcome. We also take a closer look at the execution
model of DynamoRIO because it has a great impact on the design of analysis
tools.
In Chapter \ref{sec:inlining}, we start with a na\"ive implementation of
instruction count and walk through the stages of optimizations that we apply.
As we go through the stages, the instrumentation code is progressively
simplified until it starts to look like the version we would write using custom
machine code.
In Chapter \ref{sec:partial_inlining}, we take a look at two more complex tools:
a memory alignment checker and a memory trace tool. These tools have the common
property that the instrumentation has an early conditional check that results
either in a fast or a slow path being taken. In the memory alignment tool, no
action needs to be taken if the memory access is aligned. In the memory trace
tool, the trace buffer does not need to be flushed if it is not full. We
describe how we go about inlining the fast paths while maintaining correctness
when the slow path is taken.
In Chapter \ref{sec:system}, we depart from our example-driven description of
the system to step back and look at the system hierarchy.
In Chapter \ref{sec:performance}, we present the experimental performance
results we achieved on the SPEC2006 CPU integer benchmarks\cite{spec_cpu_2k6}
for all three of the tools examined in this thesis.
In Chapter \ref{sec:contributions}, we look back on the contributions of this
thesis and suggest possible directions for future work.