-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathcomputer-architecture-notes.tex
360 lines (288 loc) · 15.4 KB
/
computer-architecture-notes.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
\documentclass[12pt, a4paper]{article}
\usepackage{graphicx}
\usepackage{float}
\usepackage[super, square]{natbib}
\usepackage[margin = 1in]{geometry}
\usepackage{
color,
clrscode,
amssymb,
ntheorem,
amsmath,
listings,
fontspec,
xcolor,
supertabular,
multirow
}
\definecolor{bgGray}{RGB}{36, 36, 36}
\usepackage[
colorlinks,
linkcolor=bgGray,
anchorcolor=blue,
citecolor=black
]{hyperref}
\newfontfamily\courier{Courier}
\theoremstyle{margin}
\theorembodyfont{\normalfont}
\newtheorem{theorem}{Theorem}
\newtheorem{definition}[theorem]{Definition}
\newtheorem{example}[theorem]{Example}
\newcommand{\st}{\text{s.t.}}
\newcommand{\mn}{\mathnormal}
\newcommand{\tbf}{\textbf}
\newcommand{\fl}{\mathnormal{fl}}
\newcommand{\f}{\mathnormal{f}}
\newcommand{\g}{\mathnormal{g}}
\newcommand{\R}{\mathbf{R}}
\newcommand{\Q}{\mathbf{Q}}
\newcommand{\JD}{\textbf{D}}
\newcommand{\rd}{\mathrm{d}}
\newcommand{\str}{^*}
\newcommand{\vep}{\varepsilon}
\newcommand{\lhs}{\text{L.H.S}}
\newcommand{\rhs}{\text{R.H.S}}
\newcommand{\con}{\text{Const}}
\newcommand{\oneton}{1,\,2,\,\dots,\,n}
\newcommand{\aoneton}{a_1a_2\dots a_n}
\newcommand{\xoneton}{x_1,\,x_2,\,\dots,\,x_n}
\title{Lecture Notes of Computer Architecture\\\begin{large}(Lecturer \ A-Lei Liang)\end{large}}
\author{Zhou Fan\\ACM Class, Shanghai Jiao Tong University}
\date{}
\begin{document}
\lstset{numbers=left,
basicstyle=\scriptsize\courier,
numberstyle=\tiny\courier\color{red!89!green!36!blue!36},
language=C++,
breaklines=true,
keywordstyle=\color{blue!70},commentstyle=\color{red!50!green!50!blue!50},
morekeywords={},
stringstyle=\color{purple},
frame=shadowbox,
rulesepcolor=\color{red!20!green!20!blue!20}
}
\maketitle
\tableofcontents
\newpage
\section{Pipelining}
\subsection{Introduction of Pipelining}
\indent
\emph{Pipelining} is an implementation technique whereby multiple instructions are overlapped in execution; it takes advantage of parallelism that exists among the actions needed to execute an instruction.\cite{caqa}
\subsubsection{Laundry Example}
\indent
Suppose we have many loads of clothes to wash, dry and fold.
\begin{itemize}
\item Each step (washing, drying and folding) is called a \emph{pipe stage} or a \emph{pipe segment}.
\item \emph{Latency} is the total time spent on a single task, which is not improved by pipelining.Unbalanced lengths of pipe stages reduces speedup.
\item \emph{Throughput} is defined as the number of loads of clothes per minute. It shows how often a load of clothes exits the pipeline.
\item The time required between moving an instruction one step down the pipeline is a \emph{processor cycle}. In a computer, this processor cycle is usually 1 clock cycle.\cite{caqa}
\end{itemize}
\subsubsection{Speedup from Pipelining}
\indent
Throughput is what matters. Pipelining helps \emph{throughput} of whole workload, while it doesn't help \emph{latency} of single task.
\begin{itemize}
\item Unbalanced lengths of pipe stages reduces speedup.
\item Handover time between pipe stages reduces speedup.
\end{itemize}
To improve the efficiency of a pipeline, one should balance the length of each pipeline stage. If the stages are perfectly balanced, then the time per instruction on the pipeline processor is equal to (under ideal conditions)
$$\frac{\text{Time per instruction on unpipelined machine}}{\text{Number of pipe stages}}$$
and the throughput of the pipeline is equal to
$$\text{Number of pipe stages} \times \text{Throughput on unpipelined machine}$$
\subsection{The Basics of a RISC Instruction Set}
\subsubsection{Introduction of RISC}
\indent
A RISC\footnote{reduced instruction set computer} is a computer whose instruction set architecture has lower \emph{cycles per instruction} (CPI) than a CISC\footnote{complex instruction set computer}.
MIPS is a RISC instruction set architecture.
\paragraph{Key Properties of RISC Architectures\cite{caqa}}
\begin{itemize}
\item All operations on data apply to data in registers and typically change the entire register (32 or 64 bits per register).
\item Only load and store operations can affect memory. Load and store operations that load or store less than a full register are often available.
\item The instruction formats are few in number, with all instructions typically being one size.
\end{itemize}
These properties make the implementation of pipelining simple.
Most RISC architectures like MIPS have three classes of instructions:
\begin{enumerate}
\item ALU\footnote{Arithmetic logic unit} instructions
\item Load and store instructions
\item Branches and jumps
\end{enumerate}
\subsubsection{Implementation of a RISC Instruction Set}
\indent
The implementation here will focus only on a pipeline for an integer subsetof a RISC architecture that consists of load-store word, branch, and integer ALU operations.
\paragraph{Implementation Without Pipelining\cite{caqa}}
Every instruction in the RISC subset can be implemented in at most 5 clock cycles as follows:
\begin{enumerate}
\item \emph{Instruction fetch cycle} (IF):
Send the program counter (PC) to memory and fetch the current instruction from memory. Update the PC to the next sequential PC by adding 4 (since each instruction is 4 bytes) to the PC\footnote{program counter, indicates where a computer is in its program sequence.}.
\item \emph{Instruction decode / register fetch cycle} (ID):
Decode the instruction and read the registers corresponding to register source specifiers from the register file. Do the equality test on the registers as they are read, for a possible branch. Sign-extend (introduced later) the offset field of the instruction in case it is needed. Compute the possible branch target address by adding the sign-extended offset to the incremented PC.
Decoding is done in parallel with reading registers, which is possible because the register specifiers are at a fixed location in a RISC instruction. This technique is known as \emph{fixed-field decoding}. It may read a register that we don't use, and it doesn't help but also doesn't hurt performance.
\item \emph{Execution / effective address cycle} (EX):
The ALU operates on the operands prepared in the prior cycle, performing one of three functions depending on the instruction type.
\begin{itemize}
\item Memory reference --- The ALU adds the base register and the offset to form the effective address.
\item Register-Register ALU instruction --- The ALU performs the operation specified by the ALU opcode on the values read from the register file.
\item Register-Immediate ALU instruction --- The ALU performs the operation specified by the ALU opcode on the first value read from the register file and the sign-extended immediate.
\end{itemize}
In a load-store architecture the effective address and execution cycles can be combined into a single clock cycle, since no instruction needs to do both of them.
\item \emph{Memory Access} (MEM):
The memory does a read using the effective address computed in the previous cycle or writes the data read from the register file using the effective address, if the instruction is a load or a store.
\item \emph{Write-back cycle} (WB):
For a Register-Register ALU instruction or a load instruction, write the result into the register file.
\end{enumerate}
\paragraph{Sign Extension} In computer arithmetic, sigh extension is the operation of increasing the number of bits of a binary number while preserving the number's sign and value. This is done by appending digits (same as the sign bit) to the most significant side of the number, following a procedure dependent on the particular signed number representation used.
For example:
\begin{itemize}
\item ``00 1010'' (decimal positive 10) $\rightarrow$ ``0000 0000 0000 1010''
\item ``11 1111 0001'' (decimal negative 15) $\rightarrow$ ``1111 1111 1111 0001''
\end{itemize}
\paragraph{Pipelined Implementation\cite{caqa}}
Under ideal conditions, we can easily get a pipelined implementation from the execution described above, doing one of the five steps for five instructions in parallel in each pipeline stage. This procedure is showed in the following table.
\begin{table}[H]
\centering
\begin{tabular}{|l|l|l|l|l|l|l|l|l|l|}
\hline
& \multicolumn{9}{c|}{Clock Number} \\ \hline
Instruction Number & 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 \\ \hline
Instruction $i$ & IF & ID & EX & MEM & WB & & & & \\ \hline
Instruction $i+1$ & & IF & ID & EX & MEM & WB & & & \\ \hline
Instruction $i+2$ & & & IF & ID & EX & MEM & WB & & \\ \hline
Instruction $i+3$ & & & & IF & ID & EX & MEM & WB & \\ \hline
Instruction $i+4$ & & & & & IF & ID & EX & MEM & WB \\ \hline
\end{tabular}
\end{table}
\subsection{Pipeline Hazards}
\subsubsection{Three Classes of Hazards}
Hazards prevent the next instruction in the instruction stream from executing during its designated clock cycle. There are three classes of hazards\cite{caqa}:
\begin{enumerate}
\item \emph{Structural hazards} arise from resource conflicts when the hardware cannot support all possible combinations of instructions simultaneously in overlapped execution.
\item \emph{Data hazards} arise when an instruction depends on the results of a previous instrucion still in the pipeline.
\item \emph{Control hazards} is caused by delay between the fetching of instruction and decisions about changes in control flow (branches and jumps).
\end{enumerate}
These hazards can make it necessary to \emph{stall} the pipeline.
There are three generic data hazards\cite{ppt}:
\begin{itemize}
\item Read After Write (RAW):
$Instr_J$ tries to read oprand before $Instr_I$ writes it.
\begin{itemize}
\item I: add \textbf{r1}, r2, r3
\item J: sub r4, \textbf{r1}, r3
\end{itemize}
\item Write After Read (WAR):
$Instr_J$ writes oprand before $Instr_I$ reads it.
\begin{itemize}
\item I: sub r4, \textbf{r1}, r3
\item J: add \textbf{r1}, r2, r3
\item K: mul r6, r1, r7
\end{itemize}
This cannot happen in MIPS 5 stage pipeline.
\item Write After Write (WAW):
$Instr_J$ writes oprand before $Instr_I$ writes it.
\begin{itemize}
\item I: sub \textbf{r1}, r4, r3
\item J: add \textbf{r1}, r2, r3
\item K: mul r6, r1, r7
\end{itemize}
This cannot happen in MIPS 5 stage pipeline.
\end{itemize}
\subsubsection{Solutions to Hazards}
\paragraph{Forwarding}
Some case of data hazards can be avoided using a technique called \emph{forwarding}. Forwarding can be generalized to include passing a result directly to the functional unit that requires it: A result is forwarded from the pipeline register corresponding to the output of one unit to the input of another, or from the result of a unit to the input of the same unit.\cite{caqa}\\
However, some data hazards still exists even with forwarding.
\paragraph{Software Scheduling}
Using software to rearrange instructions is a way to avoid load hazards. In the following example, rearrange instruction order, increase the distance between $Instr_I$ and $Instr_J$ to avoid the hazard:\\
\begin{itemize}
\item I: lw \textbf{r1}, addr
\item J: add r3, r2, \textbf{r1}
\end{itemize}
\paragraph{Control Hazards} Alternatives for control hazard\cite{caqa}:
\begin{itemize}
\item Stall until branch direction is clear
\item Predict Branch Not Taken / Branch Taken
\item Static Branch Prediction:
use profile information collected from ealier runsm, because an individual branch is often highly biased toward taken or untaken.
\item Dynamic Branch Prediction:
predict branches dynamically base on program behaviour.
\end{itemize}
\section{Placeholder}
Sorry for the blank here :-(
To be filled ...
\section{Memory Hierarchy Design}
\subsection{Memory Technology and Optimizations}
\subsubsection{Main Memory Organizations}
THIS PART\cite{ppt}
\begin{figure}[H]
\begin{center}
\includegraphics[height=12cm]{images/memory-organizations.png}
\end{center}
\end{figure}
\begin{itemize}
\item Simple: CPU, Cache, Bus, Memory same width (32 or 64 bits)
\item Wide: CPU/Mux 1 word; Mux/Cache, Bus, Memory N words
\item Interleaved: CPU, Cache, Bus 1 word; Memory N Banks, and these independent memory banks can access data parallelly
\end{itemize}
\begin{figure}[H]
\begin{center}
\includegraphics[height=9cm]{images/memory-organization-data.png}
\end{center}
\end{figure}
\paragraph{Memory Organization Performance}
Timing model (word size is 32 bits, Cache Block is 4 words):
\begin{itemize}
\item 1 to send address
\item 6 for access time
\item 1 to send data
\end{itemize}
Performance of these three memory organizations under the timing model above:
\begin{itemize}
\item Simple: $T=4\times(1+6+1)=32$
\item Wide: $T=1+6+1=8$
\item Interleaved: $T=1+6+4\times1=11$
\end{itemize}
Wide memory organization has the best time efficiency, but it causes bad compatibility of memory hardwares for its unusual width of Data Bus. Therefore, Interleaved memory organization is often a better choice for PCs.
\paragraph{Number of Banks} Note that: $\text{number banks} \le \text{number clocks to access word in bank}$, for sequential accesses, otherwise a bank will access a new word before the last accessed one being sent to Bus.
\paragraph{Avoiding Bank Conflicts} For the following code, independent banks would have no use, for sequential memory accesses come to the same bank since $512\bmod\text{number bank}=0$ if $\text{number bank}=2^k$.
\begin{lstlisting}
int x[256][512];
for (j = 0; j < 512; j = j+1)
for (i = 0; i < 256; i = i+1)
x[i][j] = 2 * x[i][j];
\end{lstlisting}
Solutions to bank conflicts:
\begin{itemize}
\item Software: loop interchange or declaring array not power of $2$ ("array padding")
\item Hardware: prime number of banks
\end{itemize}
\begin{figure}[H]
\begin{center}
\includegraphics[height=6cm]{images/address-within-bank.png}
\end{center}
\end{figure}
\paragraph{Fast Bank Number} Prime number of banks would cause efficiency problem in memory addressing. Assume the address $x$ would be located at $(R_x, C_x)$ of memory. If we set number of banks as a prime $p$,
\[R_x = x / p,\ C_x = x \bmod p\]
as shown in the \emph{Seq. Interleaved} part of the picture above, and calculation of $x / p$ is pretty time consuming.
If We arrange data as the \emph{Modulo Interleaved} part of the picture,
\[R_x = x \bmod 2^k,\ C_x = x \bmod p\]
and the calculation of $x \bmod 2^k$ is very fast.
For the calculation of $C_x = x \bmod p$, if we set $p = 2^t - 1$, we have
\[\begin{split}
x &= kp + r \\
&= k(2^t - 1) + r\\
&= k2^t + r - k \\
r &\equiv x + k \mod 2^k
\end{split}\]
So we can calculate $C_x$ with bit operations, which is very quick.
And it holds that
\[(R_i, C_i) \ne (R_j, C_j)\text{ if } i \ne j\]
It can be proved with the Chinese remainder theorem. So this arrangement of data does work.
\begin{thebibliography}{9}
\bibitem{caqa}
John L. Hennessy, David A. Patterson, et al.
\emph{Computer Architecture: A Quantitative Approach},
Fifth Edition, 2012.
\bibitem{ppt}
David A. Patterson.
PPT of \emph{CS252 Graduate Computer Architecture},
2001
\end{thebibliography}
\end{document}