Skip to content

Commit 1a14301

Browse files
Merge #1639
1639: fix recurrence docs r=CarloLucibello a=CarloLucibello fix #1638 Co-authored-by: CarloLucibello <[email protected]> Co-authored-by: Carlo Lucibello <[email protected]>
2 parents ee4c130 + 010c0bb commit 1a14301

File tree

1 file changed

+48
-32
lines changed

1 file changed

+48
-32
lines changed

docs/src/models/recurrence.md

+48-32
Original file line numberDiff line numberDiff line change
@@ -72,91 +72,107 @@ Equivalent to the `RNN` stateful constructor, `LSTM` and `GRU` are also availabl
7272
Using these tools, we can now build the model shown in the above diagram with:
7373

7474
```julia
75-
m = Chain(RNN(2, 5), Dense(5, 1), x -> reshape(x, :))
75+
m = Chain(RNN(2, 5), Dense(5, 1))
7676
```
77+
In this example, each output has only one component.
7778

7879
## Working with sequences
7980

8081
Using the previously defined `m` recurrent model, we can now apply it to a single step from our sequence:
8182

8283
```julia
83-
x = rand(Float32, 2)
84+
julia> x = rand(Float32, 2);
85+
8486
julia> m(x)
85-
1-element Array{Float32,1}:
86-
0.028398542
87+
2-element Vector{Float32}:
88+
-0.12852919
89+
0.009802654
8790
```
8891

8992
The `m(x)` operation would be represented by `x1 -> A -> y1` in our diagram.
90-
If we perform this operation a second time, it will be equivalent to `x2 -> A -> y2` since the model `m` has stored the state resulting from the `x1` step:
93+
If we perform this operation a second time, it will be equivalent to `x2 -> A -> y2`
94+
since the model `m` has stored the state resulting from the `x1` step.
9195

92-
```julia
93-
x = rand(Float32, 2)
94-
julia> m(x)
95-
1-element Array{Float32,1}:
96-
0.07381232
97-
```
98-
99-
Now, instead of computing a single step at a time, we can get the full `y1` to `y3` sequence in a single pass by broadcasting the model on a sequence of data.
96+
Now, instead of computing a single step at a time, we can get the full `y1` to `y3` sequence in a single pass by
97+
iterating the model on a sequence of data.
10098

10199
To do so, we'll need to structure the input data as a `Vector` of observations at each time step. This `Vector` will therefore be of `length = seq_length` and each of its elements will represent the input features for a given step. In our example, this translates into a `Vector` of length 3, where each element is a `Matrix` of size `(features, batch_size)`, or just a `Vector` of length `features` if dealing with a single observation.
102100

103101
```julia
104-
x = [rand(Float32, 2) for i = 1:3]
105-
julia> m.(x)
106-
3-element Array{Array{Float32,1},1}:
107-
[-0.17945863]
108-
[-0.20863166]
109-
[-0.20693761]
102+
julia> x = [rand(Float32, 2) for i = 1:3];
103+
104+
julia> [m(xi) for xi in x]
105+
3-element Vector{Vector{Float32}}:
106+
[-0.018976994, 0.61098206]
107+
[-0.8924057, -0.7512169]
108+
[-0.34613007, -0.54565114]
110109
```
111110

111+
!!! warning "Use of map and broadcast"
112+
Mapping and broadcasting operations with stateful layers such are discouraged,
113+
since the julia language doesn't guarantee a specific execution order.
114+
Therefore, avoid
115+
```julia
116+
y = m.(x)
117+
# or
118+
y = map(m, x)
119+
```
120+
and use explicit loops
121+
```julia
122+
y = [m(x) for x in x]
123+
```
124+
112125
If for some reason one wants to exclude the first step of the RNN chain for the computation of the loss, that can be handled with:
113126

114127
```julia
128+
using Flux.Losses: mse
129+
115130
function loss(x, y)
116-
sum((Flux.stack(m.(x)[2:end],1) .- y) .^ 2)
131+
m(x[1]) # ignores the output but updates the hidden states
132+
sum(mse(m(xi), yi) for (xi, yi) in zip(x[2:end], y))
117133
end
118134

119-
y = rand(Float32, 2)
120-
julia> loss(x, y)
121-
1.7021208968648693
135+
y = [rand(Float32, 1) for i=1:2]
136+
loss(x, y)
122137
```
123138

124-
In such model, only `y2` and `y3` are used to compute the loss, hence the target `y` being of length 2. This is a strategy that can be used to easily handle a `seq-to-one` kind of structure, compared to the `seq-to-seq` assumed so far.
139+
In such a model, only the last two outputs are used to compute the loss, hence the target `y` being of length 2. This is a strategy that can be used to easily handle a `seq-to-one` kind of structure, compared to the `seq-to-seq` assumed so far.
125140

126141
Alternatively, if one wants to perform some warmup of the sequence, it could be performed once, followed with a regular training where all the steps of the sequence would be considered for the gradient update:
127142

128143
```julia
129144
function loss(x, y)
130-
sum((Flux.stack(m.(x),1) .- y) .^ 2)
145+
sum(mse(m(xi), yi) for (xi, yi) in zip(x, y))
131146
end
132147

133-
seq_init = [rand(Float32, 2) for i = 1:1]
148+
seq_init = [rand(Float32, 2)]
134149
seq_1 = [rand(Float32, 2) for i = 1:3]
135150
seq_2 = [rand(Float32, 2) for i = 1:3]
136151

137-
y1 = rand(Float32, 3)
138-
y2 = rand(Float32, 3)
152+
y1 = [rand(Float32, 1) for i = 1:3]
153+
y2 = [rand(Float32, 1) for i = 1:3]
139154

140155
X = [seq_1, seq_2]
141156
Y = [y1, y2]
142157
data = zip(X,Y)
143158

144159
Flux.reset!(m)
145-
m.(seq_init)
160+
[m(x) for x in seq_init]
146161

147162
ps = params(m)
148163
opt= ADAM(1e-3)
149164
Flux.train!(loss, ps, data, opt)
150165
```
151166

152-
In this previous example, model's state is first reset with `Flux.reset!`. Then, there's a warmup that is performed over a sequence of length 1 by feeding it with `seq_init`, resulting in a warmup state. The model can then be trained for 1 epoch, where 2 batches are provided (`seq_1` and `seq_2`) and all the timesteps outputs are considered for the loss (we no longer use a subset of `m.(x)` in the loss function).
167+
In this previous example, model's state is first reset with `Flux.reset!`. Then, there's a warmup that is performed over a sequence of length 1 by feeding it with `seq_init`, resulting in a warmup state. The model can then be trained for 1 epoch, where 2 batches are provided (`seq_1` and `seq_2`) and all the timesteps outputs are considered for the loss.
153168

154169
In this scenario, it is important to note that a single continuous sequence is considered. Since the model state is not reset between the 2 batches, the state of the model flows through the batches, which only makes sense in the context where `seq_1` is the continuation of `seq_init` and so on.
155170

156171
Batch size would be 1 here as there's only a single sequence within each batch. If the model was to be trained on multiple independent sequences, then these sequences could be added to the input data as a second dimension. For example, in a language model, each batch would contain multiple independent sentences. In such scenario, if we set the batch size to 4, a single batch would be of the shape:
157172

158173
```julia
159-
batch = [rand(Float32, 2, 4) for i = 1:3]
174+
x = [rand(Float32, 2, 4) for i = 1:3]
175+
y = [rand(Float32, 1, 4) for i = 1:3]
160176
```
161177

162178
That would mean that we have 4 sentences (or samples), each with 2 features (let's say a very small embedding!) and each with a length of 3 (3 words per sentence). Computing `m(batch[1])`, would still represent `x1 -> y1` in our diagram and returns the first word output, but now for each of the 4 independent sentences (second dimension of the input matrix).
@@ -166,7 +182,7 @@ In many situations, such as when dealing with a language model, each batch typic
166182
```julia
167183
function loss(x, y)
168184
Flux.reset!(m)
169-
sum((Flux.stack(m.(x),1) .- y) .^ 2)
185+
sum(mse(m(xi), yi) for (xi, yi) in zip(x, y))
170186
end
171187
```
172188

0 commit comments

Comments
 (0)