You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: docs/src/models/recurrence.md
+48-32
Original file line number
Diff line number
Diff line change
@@ -72,91 +72,107 @@ Equivalent to the `RNN` stateful constructor, `LSTM` and `GRU` are also availabl
72
72
Using these tools, we can now build the model shown in the above diagram with:
73
73
74
74
```julia
75
-
m =Chain(RNN(2, 5), Dense(5, 1), x ->reshape(x, :))
75
+
m =Chain(RNN(2, 5), Dense(5, 1))
76
76
```
77
+
In this example, each output has only one component.
77
78
78
79
## Working with sequences
79
80
80
81
Using the previously defined `m` recurrent model, we can now apply it to a single step from our sequence:
81
82
82
83
```julia
83
-
x =rand(Float32, 2)
84
+
julia> x =rand(Float32, 2);
85
+
84
86
julia>m(x)
85
-
1-element Array{Float32,1}:
86
-
0.028398542
87
+
2-element Vector{Float32}:
88
+
-0.12852919
89
+
0.009802654
87
90
```
88
91
89
92
The `m(x)` operation would be represented by `x1 -> A -> y1` in our diagram.
90
-
If we perform this operation a second time, it will be equivalent to `x2 -> A -> y2` since the model `m` has stored the state resulting from the `x1` step:
93
+
If we perform this operation a second time, it will be equivalent to `x2 -> A -> y2`
94
+
since the model `m` has stored the state resulting from the `x1` step.
91
95
92
-
```julia
93
-
x =rand(Float32, 2)
94
-
julia>m(x)
95
-
1-element Array{Float32,1}:
96
-
0.07381232
97
-
```
98
-
99
-
Now, instead of computing a single step at a time, we can get the full `y1` to `y3` sequence in a single pass by broadcasting the model on a sequence of data.
96
+
Now, instead of computing a single step at a time, we can get the full `y1` to `y3` sequence in a single pass by
97
+
iterating the model on a sequence of data.
100
98
101
99
To do so, we'll need to structure the input data as a `Vector` of observations at each time step. This `Vector` will therefore be of `length = seq_length` and each of its elements will represent the input features for a given step. In our example, this translates into a `Vector` of length 3, where each element is a `Matrix` of size `(features, batch_size)`, or just a `Vector` of length `features` if dealing with a single observation.
102
100
103
101
```julia
104
-
x = [rand(Float32, 2) for i =1:3]
105
-
julia>m.(x)
106
-
3-element Array{Array{Float32,1},1}:
107
-
[-0.17945863]
108
-
[-0.20863166]
109
-
[-0.20693761]
102
+
julia> x = [rand(Float32, 2) for i =1:3];
103
+
104
+
julia> [m(xi) for xi in x]
105
+
3-element Vector{Vector{Float32}}:
106
+
[-0.018976994, 0.61098206]
107
+
[-0.8924057, -0.7512169]
108
+
[-0.34613007, -0.54565114]
110
109
```
111
110
111
+
!!! warning "Use of map and broadcast"
112
+
Mapping and broadcasting operations with stateful layers such are discouraged,
113
+
since the julia language doesn't guarantee a specific execution order.
114
+
Therefore, avoid
115
+
```julia
116
+
y = m.(x)
117
+
# or
118
+
y = map(m, x)
119
+
```
120
+
and use explicit loops
121
+
```julia
122
+
y = [m(x) for x in x]
123
+
```
124
+
112
125
If for some reason one wants to exclude the first step of the RNN chain for the computation of the loss, that can be handled with:
113
126
114
127
```julia
128
+
using Flux.Losses: mse
129
+
115
130
functionloss(x, y)
116
-
sum((Flux.stack(m.(x)[2:end],1) .- y) .^2)
131
+
m(x[1]) # ignores the output but updates the hidden states
132
+
sum(mse(m(xi), yi) for (xi, yi) inzip(x[2:end], y))
117
133
end
118
134
119
-
y =rand(Float32, 2)
120
-
julia>loss(x, y)
121
-
1.7021208968648693
135
+
y = [rand(Float32, 1) for i=1:2]
136
+
loss(x, y)
122
137
```
123
138
124
-
In such model, only `y2` and `y3` are used to compute the loss, hence the target `y` being of length 2. This is a strategy that can be used to easily handle a `seq-to-one` kind of structure, compared to the `seq-to-seq` assumed so far.
139
+
In such a model, only the last two outputs are used to compute the loss, hence the target `y` being of length 2. This is a strategy that can be used to easily handle a `seq-to-one` kind of structure, compared to the `seq-to-seq` assumed so far.
125
140
126
141
Alternatively, if one wants to perform some warmup of the sequence, it could be performed once, followed with a regular training where all the steps of the sequence would be considered for the gradient update:
127
142
128
143
```julia
129
144
functionloss(x, y)
130
-
sum((Flux.stack(m.(x),1) .- y) .^2)
145
+
sum(mse(m(xi), yi) for (xi, yi) inzip(x, y))
131
146
end
132
147
133
-
seq_init = [rand(Float32, 2)for i =1:1]
148
+
seq_init = [rand(Float32, 2)]
134
149
seq_1 = [rand(Float32, 2) for i =1:3]
135
150
seq_2 = [rand(Float32, 2) for i =1:3]
136
151
137
-
y1 =rand(Float32, 3)
138
-
y2 =rand(Float32, 3)
152
+
y1 =[rand(Float32, 1) for i =1:3]
153
+
y2 =[rand(Float32, 1) for i =1:3]
139
154
140
155
X = [seq_1, seq_2]
141
156
Y = [y1, y2]
142
157
data =zip(X,Y)
143
158
144
159
Flux.reset!(m)
145
-
m.(seq_init)
160
+
[m(x) for x in seq_init]
146
161
147
162
ps =params(m)
148
163
opt=ADAM(1e-3)
149
164
Flux.train!(loss, ps, data, opt)
150
165
```
151
166
152
-
In this previous example, model's state is first reset with `Flux.reset!`. Then, there's a warmup that is performed over a sequence of length 1 by feeding it with `seq_init`, resulting in a warmup state. The model can then be trained for 1 epoch, where 2 batches are provided (`seq_1` and `seq_2`) and all the timesteps outputs are considered for the loss (we no longer use a subset of `m.(x)` in the loss function).
167
+
In this previous example, model's state is first reset with `Flux.reset!`. Then, there's a warmup that is performed over a sequence of length 1 by feeding it with `seq_init`, resulting in a warmup state. The model can then be trained for 1 epoch, where 2 batches are provided (`seq_1` and `seq_2`) and all the timesteps outputs are considered for the loss.
153
168
154
169
In this scenario, it is important to note that a single continuous sequence is considered. Since the model state is not reset between the 2 batches, the state of the model flows through the batches, which only makes sense in the context where `seq_1` is the continuation of `seq_init` and so on.
155
170
156
171
Batch size would be 1 here as there's only a single sequence within each batch. If the model was to be trained on multiple independent sequences, then these sequences could be added to the input data as a second dimension. For example, in a language model, each batch would contain multiple independent sentences. In such scenario, if we set the batch size to 4, a single batch would be of the shape:
157
172
158
173
```julia
159
-
batch = [rand(Float32, 2, 4) for i =1:3]
174
+
x = [rand(Float32, 2, 4) for i =1:3]
175
+
y = [rand(Float32, 1, 4) for i =1:3]
160
176
```
161
177
162
178
That would mean that we have 4 sentences (or samples), each with 2 features (let's say a very small embedding!) and each with a length of 3 (3 words per sentence). Computing `m(batch[1])`, would still represent `x1 -> y1` in our diagram and returns the first word output, but now for each of the 4 independent sentences (second dimension of the input matrix).
@@ -166,7 +182,7 @@ In many situations, such as when dealing with a language model, each batch typic
0 commit comments