-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathhw6.tex
178 lines (127 loc) · 8.65 KB
/
hw6.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
\documentclass{article}
\usepackage[utf8]{inputenc}
\usepackage{amsmath}
\usepackage{graphicx}
\title{CS 189 Homework 6}
\author{Edwin Xie, John Du, Daniel Zezula}
\date{April 21, 2014}
\begin{document}
\maketitle
\section*{External Sources}
We used CUDA to paralellize the training process.
\section*{Single layer neural network}
\paragraph{(i)} Stochastic gradient descent for loss functions:
We want to find a $n_{in} \times n_{out}$ $\frac{\partial{J}}{\partial{\mathbf{W}}}$
where $(\frac{\partial{J}}{\partial{\mathbf{W}}})_{ij}$ is taken over $W_{ij}$.
Define $W_k$ such that $W = \begin{bmatrix} W_1 & W_2 & ... & W_{out} \end{bmatrix}$.
Then $\frac{\partial{J}}{\partial{\mathbf{W}}} =
\begin{bmatrix}\frac{\partial{J}}{\partial{\mathbf{W}_1}} &
\frac{\partial{J}}{\partial{\mathbf{W}_2}} & ... &
\frac{\partial{J}}{\partial{\mathbf{W}_{out}}} \end{bmatrix}$.
We also want to find a $n_{out} \times 1$ $\frac{\partial{J}}{\partial{\mathbf{b}}}$,
where $\frac{\partial{J}}{\partial{\mathbf{b}}} =
\begin{bmatrix} \frac{\partial{J}}{\partial{\mathbf{b}_1}} &
\frac{\partial{J}}{\partial{\mathbf{b}_2}} & ... &
\frac{\partial{J}}{\partial{\mathbf{b}_{out}}} \end{bmatrix}^T$.
The stochastic gradient descent updates are then
$$\mathbf{W} = \mathbf{W} - \eta \dfrac{\partial{J}}{\partial{\mathbf{W}}}$$
$$\mathbf{b} = \mathbf{b} - \eta \dfrac{\partial{J}}{\partial{\mathbf{b}}}$$
for every training point $x$.
\subparagraph{Derivations for mean squared error:}
$$J = \dfrac{1}{2}\sum_{k=1}^{n_{out}}(t_k - y_k)^{2}$$
$$J = \dfrac{1}{2}(\mathbf{t} - \mathbf{y})^{T}(\mathbf{t} - \mathbf{y})$$
$$\dfrac{\partial{J}}{\partial{\mathbf{W_k}}} = -(t_k - y_k) \dfrac{\partial{y_k}}{\partial{\mathbf{W_k}}}$$
$$\dfrac{\partial{y_k}}{\partial{\mathbf{W_k}}} = y_k(1 - y_k)\dfrac{\partial}{\partial{\mathbf{W_k}}}(\mathbf{W}_k^T \mathbf{x} + b_k)$$
$$\dfrac{\partial{y_k}}{\partial{\mathbf{W_k}}} = y_k(1 - y_k)\mathbf{x}$$
$$\dfrac{\partial{J}}{\partial{\mathbf{W_k}}} = -(t_k - y_k)y_k(1 - y_k)\mathbf{x}$$
$$\dfrac{\partial{J}}{\partial{\mathbf{b_k}}} = -(t_k - y_k) \dfrac{\partial{y_k}}{\partial{\mathbf{b_k}}}$$
$$\dfrac{\partial{y_k}}{\partial{\mathbf{b_k}}} = y_k(1 - y_k)$$
$$\dfrac{\partial{J}}{\partial{\mathbf{b_k}}} = -(t_k - y_k)y_k(1 - y_k)$$
\subparagraph{Derivation for cross-entropy error:}
$$J = -\sum_{k=1}^{n_{out}}[t_k \ln{y_k} + (1-t_k)\ln{(1-y_k)}$$
$$\dfrac{\partial{J}}{\partial{\mathbf{W_k}}} = \dfrac{\partial{J}}{\partial{y_k}} \dfrac{\partial{y_k}}{\partial{\mathbf{W_k}}}$$
$$\dfrac{\partial{J}}{\partial{y_k}} = -(\frac{t_k}{y_k} - \frac{1 - t_k}{1 - y_k}) = \frac{y_k - t_k}{y_k (1 - y_k)}$$
$$\dfrac{\partial{y_k}}{\partial{\mathbf{W_k}}} = y_k (1 - y_k) \mathbf{x}$$
$$\dfrac{\partial{J}}{\partial{\mathbf{W_k}}} = (y_k - t_k) \mathbf{x}$$
$$\dfrac{\partial{J}}{\partial{\mathbf{b_k}}} = (y_k - t_k)$$
\subsubsection*{Derivation for vectorizing, Forward Propagation}
For one training example:
$$y_k = {\sigma}(\sum_{j=1}^{n_{features}}{\mathbf{W}_{jk}}\mathbf{x}_j + {\mathbf{b}_k})$$
where $1\leq k \leq n_{out}$ and $1 \leq j \leq n_{features}$
$$y_k = \sigma(\mathbf{W_k}^T\mathbf{x} + {\mathbf{b}_k})$$
For element-wise sigmoid:
$$\mathbf{y} = \sigma(\mathbf{W}^T\mathbf{x} + \mathbf{b})$$
above we have the vector dimmensions of:
$$\mathbf{y} = n_{out}\times 1, \; \mathbf{b} = n_{out}\times 1, \; \mathbf{W}_k = 1\times n_{features}, \; \mathbf{x} = 1\times n_{features}, \; \mathbf{W}=n_{features}\times n_{out}$$
For batch of m examples we want:\\
Y has columns of $\mathbf{y_1} \; ... \; \mathbf{y_m}$
$$Y = \sigma ([\mathbf{W}^T\mathbf{x}_1+\mathbf{b} \; \; ... \; \; \mathbf{W}^T\mathbf{x}_m + \mathbf{b}])$$
$$Y = \sigma(\mathbf{W}^T\mathbf{X} + \mathbf{B})$$
$\mathbf{B}$ has m columns equal to $\mathbf{b}$
$$X = m \times n_{features}$$
\subsubsection*{Derivation for vectorizing, Back Propagation}
For 1 training example, mean squares:
$$\dfrac{\partial{J}}{\partial{\mathbf{W_k}}} = -(t_k - y_k)y_k(1 - y_k)\mathbf{x}$$
$$\delta_k^L = (y_k - t_k)y_k(1 - y_k)$$
For cross entropy:
$$\dfrac{\partial{J}}{\partial{\mathbf{W_k}}} = \dfrac{\partial{J}}{\partial{y_k}} \dfrac{\partial{y_k}}{\partial{\mathbf{W_k}}}$$
$$\dfrac{\partial{J}}{\partial{\mathbf{W_k}}} = (y_k - t_k) \mathbf{x}$$
$$\delta_k^L = (y_k - t_k)$$
In both cases continue with:
$$\dfrac{\partial{J}}{\partial{\mathbf{W_k}}} = \mathbf{x}\times \delta_k^L$$
define:
$$\dfrac{\partial{J}}{\partial{\mathbf{W}}} = [ \dfrac{\partial{J}}{\partial{\mathbf{W}_1}} \; ... \; \dfrac{\partial{J}}{\partial{\mathbf{W}_k}}]$$
where $k = 1, \; ... \; n_{features}$ then:
$$\dfrac{\partial{J}}{\partial{\mathbf{W}}} = \mathbf{x} \times \delta ^{L^T}$$
For the bias:
$$\dfrac{\partial{J}}{\partial{\mathbf{b}_k}} = \delta _k ^{L}$$
define:\\
The rows of $\dfrac{\partial{J}}{\partial{\mathbf{b}}}$ are $\dfrac{\partial{J}}{\partial{b_1}} \; ... \; \dfrac{\partial{J}}{\partial{b_{out}}}$
$$\dfrac{\partial{J}}{\partial{\mathbf{b}}} = \delta ^L$$
$$\dfrac{\partial{J}}{\partial{\mathbf{W}_k}} = n_{features} \times 1$$
$$\dfrac{\partial{J}}{\partial{\mathbf{W}}} = n_{features} \times n_{out}$$
where the columns of $\dfrac{\partial{J}}{\partial{\mathbf{W}}}$ are:
$$\dfrac{\partial{J}}{\partial{\mathbf{W}_1}} \; ... \; \dfrac{\partial{J}}{\partial{\mathbf{W}_k}}$$
$$\delta^{L} = n_{out} \times 1$$
where the columns of $\Delta ^{L}$ are $\delta_1^{L} \; ... \; \delta_m^{L}$\\
for batch of m examples:
$$\dfrac{\partial{J}}{\partial{\mathbf{W}}_m} = X_m\delta_m^{L^T}$$
$$X\times \Delta ^{L} = \sum_{i=1}^{m}(x_i \delta_i^{L^T})$$
$$X \times \Delta ^{L} = \sum_{i=1}^{m}(\dfrac{\partial{J}}{\partial{\mathbf{W_i}}})$$
$$X \times \Delta ^{L} = \dfrac{\partial{J}}{\partial{\mathbf{W}}}$$
Which is what we want for batch gradient descent. Similarly:
$$\dfrac{\partial{J}}{\partial{\mathbf{b}}} = \sum_{i=1}^{m}\delta_i^{L}$$
$$\dfrac{\partial{J}}{\partial{\mathbf{b}}} = \Delta ^{L}[1_1 \; ... \; 1_m]$$
Where $\Delta^{L}$ has dimensions $n_{out} \times m$
\section*{Multilayer feed forward neural network}
\paragraph{(i)} Parameter update equations for loss functions at output layer:
$$\dfrac{\partial{J}}{\mathbf{W}_{ij}^{l}} = \dfrac{\partial{J}}{\mathbf{y}_{j}^{l}}\cdot \dfrac{\partial{\mathbf{y}_j^{l}}}{\mathbf{s}_{j}^{l}}\cdot\dfrac{\partial{\mathbf{s}_j^{l}}}{\mathbf{W}_{ij}^{l}}$$
For mean squares:
$$\dfrac{\partial{J}}{\mathbf{y}_{j}^{l}}\cdot \dfrac{\partial{\mathbf{y}_j^{l}}}{\mathbf{s}_{j}^{l}} = (\mathbf{y}_i^{L} - t_j)\mathbf{y}_j^{L}(1-\mathbf{y}_j^{L}) = \delta _j ^{L}$$
For cross entropy:
$$\dfrac{\partial{J}}{\mathbf{y}_{j}^{l}}\cdot \dfrac{\partial{\mathbf{y}_j^{l}}}{\mathbf{s}_{j}^{l}} = \frac{\mathbf{y}_j^{L} - t_k}{\mathbf{y}_j^{L} (1 - \mathbf{y}_j^{L})}\mathbf{y}_j^{L}(1-\mathbf{y}_j^{L}) = (\mathbf{y}_i^{L} - t_j) = \delta_j^{L}$$
Continue in either case with:
$$\dfrac{\partial{\mathbf{s}_j^{l}}}{\mathbf{W}_{ij}^{l}} = \mathbf{y}_i^{l-1}$$
Inductive update for the hidden layers:
$$\delta_i^{l-1} = \dfrac{\partial{J}}{\partial{\mathbf{s}_i^{l-1}}} = \sum_{j=1}^{d(l)}\dfrac{\partial{J}}{\partial{\mathbf{s}_j^{l}}}\cdot\dfrac{\partial{\mathbf{s}_j^{l}}}{\partial{\mathbf{y}_{i}^{l-1}}}\cdot \dfrac{\partial{\mathbf{y}_i^{l-1}}}{\partial{\mathbf{s}_{i}^{l-1}}}$$
$$= \sum_{j=1}^{d(l)}(\delta_j^{l} \cdot \mathbf{W}_{ij}^{l} \cdot (1-\mathbf{y}_i^{l-1^2}))$$
$$=(1-\mathbf{y}_i^{l-1^2})\sum_{j=1}^{d(l)}(\delta_j^{l} \cdot \mathbf{W}_{ij}^{l})$$
For mean squares:
$$\delta ^ {L} = (\mathbf{y}^{L} - \mathbf{t})\circ (\mathbf{y}^{L} - \mathbf{y}^{L^2})$$
For cross entropy:
$$\delta ^ {L} = (\mathbf{y}^{L} - \mathbf{t})$$
Where $\circ$ is the Hadamard product.
$$\delta^{l-1} = (\mathbf{1}-\mathbf{y}^{{l-1}^{2}})\circ (\mathbf{W}^l\delta ^{l})$$
$\mathbf{y}$ has dimensions $d^{l-1}\times n$\\
$\mathbf{\delta^{L^T}}$ has dimensions $d^{l} \times n$
$$\dfrac{\partial{J}}{\mathbf{W}^{l}} = \mathbf{y}^{l-1} \cdot \delta^{l^T} $$
$$\dfrac{\partial{J}}{\mathbf{b}^{l}} = \delta ^{l}$$
$$\mathbf{y}^{0} =\mathbf{x}$$
$$\delta ^{l-1} = \dfrac{\partial{\mathbf{y}^{l-1}}}{\partial{\mathbf{s}^{l-1}}}\mathbf{W}^{l}\delta ^{l}$$
\section*{Results}
We implemented a single layer neural network with both mean squares and cross entropy as error functions. We alson implemented a two hidden layer multi layer neural network with both mean squares and cross entropy loss functions. The graphs of these can be seen below, where the green line is the test set, and blue line is the training set. The lowest error we received is with the multi layer neural network with the cross entropy error function, which resulted in a 1.26% error. With the paralellizing of CUDA, it takes around 200 seconds to train.
\includegraphics[width=\textwidth]{slms}
\includegraphics[width=\textwidth]{slce}
\includegraphics[width=\textwidth]{mlms}
\includegraphics[width=\textwidth]{mlce}
\end{document}