Skip to content

added zoneout to rec.py #86

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Dec 8, 2021
Merged

added zoneout to rec.py #86

merged 3 commits into from
Dec 8, 2021

Conversation

Atticus1806
Copy link
Contributor

Similar to LSTM adding ZoneoutLSTM to rec.py.

@albertz
Copy link
Member

albertz commented Dec 2, 2021

I wonder whether we should rather make this an option to the LSTM?

Also relevant is the handling of the train flag (#18).

@Atticus1806
Copy link
Contributor Author

I dont really have a strong opinion on this, adding it as an option would be fine with me too.
I liked the explicit way that the name of the class defines the unit string, since the whole parameters passed etc. are dependent on it, so I think it makes sense.

Also relevant is the handling of the train flag (#18).

From my understanding this right now is just in RETURNN right? I see that to know whether we are in training or not is relevant for the concept but i am not sure why this would make a diffrence in returnn_common here.

@albertz
Copy link
Member

albertz commented Dec 2, 2021

I liked the explicit way that the name of the class defines the unit string, since the whole parameters passed etc. are dependent on it, so I think it makes sense.

This is just a RETURNN specific detail. Actually also on RETURNN side, we maybe should change that as well, and make it an option to the NativeLstm2 e.g. as well or so.

I'm a bit afraid that we end up with different LSTMs which are basically all the same except of the underlying implementation which supports a different feature set (zoneout, layer norm, recurrent dropout, weight dropout, whatever other things), and at some later point we would extend some of the implementation (e.g. NativeLstm to include zoneout and layer norm).

I'm not sure...

Also relevant is the handling of the train flag (#18).

From my understanding this right now is just in RETURNN right? I see that to know whether we are in training or not is relevant for the concept but i am not sure why this would make a diffrence in returnn_common here.

Well, yes, that's what I mean, that is the problem. You just ignore this here. This could be wrong, depending on how we handle #18.

@Atticus1806
Copy link
Contributor Author

I'm a bit afraid that we end up with different LSTMs which are basically all the same except of the underlying implementation which supports a different feature set

This is actually a really good point. If you want to track this I can open an issue in the RETURNN repo so it doesnt get lost. Do you have a suggestion how to do this as explicit as possible? Passing the unit string is something I am no fan of. Maybe adding flags what to include (which might be mutually exclusive due to the nature of the concepts) and for now choose the unit based on that? That should be easy to adapt then later when RETURNN side changes. But maybe there is a better way.

Well, yes, that's what I mean, that is the problem. You just ignore this here. This could be wrong, depending on how we handle #18.

I was assuming that the train_flag or however we handle it here in the end needs to translate to either RETURNN behavior, then I think this is fine. Right now the RETURNN layer checks this flag self.is_training = TFNetwork.get_current_network().train_flag. Do you think that whatever we come up with in #18 really breaks this? Then a RETURNN side change would be required to handle it then anyways right? I might be missing a point here, but I am not sure how and why we would want to handle this right here at the moment.

@albertz
Copy link
Member

albertz commented Dec 2, 2021

I'm a bit afraid that we end up with different LSTMs which are basically all the same except of the underlying implementation which supports a different feature set

This is actually a really good point. If you want to track this I can open an issue in the RETURNN repo so it doesnt get lost. Do you have a suggestion how to do this as explicit as possible? Passing the unit string is something I am no fan of. Maybe adding flags what to include (which might be mutually exclusive due to the nature of the concepts) and for now choose the unit based on that? That should be easy to adapt then later when RETURNN side changes.

Yes, that was what I was thinking. We can directly try to do it in a clean way here, wrap the current RETURNN logic, and adapt when RETURNN is extended. (On RETURNN side, I wanted to implemented a new extended NativeLstm since a while, which includes optional zoneout + layer-norm + recurrent dropout.)

This is the same line of thought for many things here in returnn-common.

It always boils down to the question, what is actually the best and cleanest way or interface?

So is it cleaner to have ZoneoutLSTM, LayerNormLSTM, LSTM, or have just one single LSTM with can have options for the variants?

Note that in other situations, we argued that having many options is also not good. E.g. the SelfAttentionLayer on RETURNN side is such an example.

What we also want to avoid is that this goes out of hand. E.g. there will be further LSTM variants or options in the future, and we keep adding more and more options.

Well, yes, that's what I mean, that is the problem. You just ignore this here. This could be wrong, depending on how we handle #18.

I was assuming that the train_flag or however we handle it here in the end needs to translate to either RETURNN behavior, then I think this is fine. Right now the RETURNN layer checks this flag self.is_training = TFNetwork.get_current_network().train_flag. Do you think that whatever we come up with in #18 really breaks this? Then a RETURNN side change would be required to handle it then anyways right? I might be missing a point here, but I am not sure how and why we would want to handle this right here at the moment.

Yea, well, I don't know. E.g. if we decide in #18 that we want to be explicit about train_flag here in returnn-common, and basically handle that here explicitly, e.g. such that every module has this flag, and copy very much the PyTorch behavior, that means that this must be passed over to RETURNN in some way. Yes, this might require some small changes on RETURNN side but I think whatever is needed would be simple to add.

But maybe we can also do that later here in returnn-common. I just wanted to say that #18 can have an influence on this code.

@Atticus1806
Copy link
Contributor Author

Note that in other situations, we argued that having many options is also not good. E.g. the SelfAttentionLayer on RETURNN side is such an example.

I agree. I pushed an updated versoin which for now has a bool flag for the zoneout option. Right now my idea is that for other options another flag is introduced which is mutually exclusive for now (or not if there are cases where two methods can also together form a unit). But I am not sure if I like this. This might lead to quite a few bool flags. Do you have a better (but still explicit) idea how to handle this?

For #18 I agree that this will influence this layer, but I think we can then also change this layer after the decision is done in a separate PR.

@albertz
Copy link
Member

albertz commented Dec 2, 2021

Note that in other situations, we argued that having many options is also not good. E.g. the SelfAttentionLayer on RETURNN side is such an example.

I agree. I pushed an updated versoin which for now has a bool flag for the zoneout option. Right now my idea is that for other options another flag is introduced which is mutually exclusive for now (or not if there are cases where two methods can also together form a unit). But I am not sure if I like this. This might lead to quite a few bool flags. Do you have a better (but still explicit) idea how to handle this?

So you agree that one single module (LSTM) with many options is bad but you anyway did it like that now?

Yes, in principle, conceptually, many of the things I listed above could be combined in whatever way you like (zoneout, layer norm, recurrent dropout, etc). That's one argument why it would make sense in a single module because we cannot have separate modules for all possible combinations. However, effectively, they are currently partly mutually exclusive because we just lack an implementation which is generic enough to cover all cases. But this could be extended later.

Or maybe it's cleaner if different LSTM modeling types (layer-norm, zoneout) really get a different module. I don't know.

@Atticus1806
Copy link
Contributor Author

So you agree that one single module (LSTM) with many options is bad but you anyway did it like that now?

I should stop working late... I read "Note that in other situations, we argued that having many options is also not good." the word options as in: different classes not options for one class. My bad. I am not sure if this is a problem here though, because usually you would only call the LSTM initialization with one of the flags,maybe with more generic interface like 2 or 3 + what parameters are passed anyways (thus cannot be reduced) so I am not sure if this is really this big of an overhead.

Or maybe it's cleaner if different LSTM modeling types (layer-norm, zoneout) really get a different module. I don't know.

One benefit of having them in a special class would be that the unit opts could be done explicitly. Right now I would pass something like

unit_opts={'zoneout_factor_cell': zoneout_factor_cell,  'zoneout_factor_output': zoneout_factor_output}

to the class, having zoneout in a separate class would allow these parameters to be explicit. Writing them all in one LSTM class with the optional flag would be far too big.

@albertz
Copy link
Member

albertz commented Dec 3, 2021

So you agree that one single module (LSTM) with many options is bad ...

...I am not sure if this is a problem here though, because usually you would only call the LSTM initialization with one of the flags,maybe with more generic interface like 2 or 3 + what parameters are passed anyways (thus cannot be reduced) so I am not sure if this is really this big of an overhead.

No, I'm not referring just to reading users code. Sure, for the users code, it doesn't make much a difference if it is

lstm = LSTM(zoneout=True, ...)

vs

lstm = ZoneoutLSTM(...)

But I'm speaking about the maintainability of the returnn-common code. I think any functions, classes, modules, layers which get too complex by combining too much functionality are bad and get ugly. And the experience tells, over time, they get more and more options. After a while we will have an LSTM with 20 options or so. So then when someone looks at this code (just e.g. to try to understand the LSTM code, or maybe to extend it, or to fix some issue), it is huge and complicated for lots of different cases due to all the options. Again, see SelfAttentionLayer as one such example. And I think we want to avoid that.

Having separate modules makes that cleaner.

Even separate modules like your initial proposed ZoneoutLSTM will have options but it will be less.

Maybe LSTM should just be the standard LSTM without fancy things, just like Transformer is the standard Transformer.

In other parts, e.g. also for the Transformer, and in RETURNN in general, our solution to still give any flexibility to the user but to avoid complex layers or functions with too many options is to break it down into simple building blocks such that the user can put it together in whatever way he wants. This sometimes makes it difficult to still provide good efficiency. Our solution to that is automatic optimization in RETURNN, e.g. optimizing layers out of a loop, etc. Sometimes this is not so obvious though. See the long discussion on generalizing and extending self attention in RETURNN (rwth-i6/returnn#391), although I'm very happy now with the result (CumConcatLayer).

Here in this case for LSTMs, this is tricky again. Sure, the user could already just write down the formula for a zoneout LSTM or whatever other LSTM explicitly and that would work, so the building blocks are there. But when the user writes down the formulas for a standard LSTM, it would be tricky to automatically convert that into a NativeLSTM.

Or maybe it's cleaner if different LSTM modeling types (layer-norm, zoneout) really get a different module. I don't know.

One benefit of having them in a special class would be that the unit opts could be done explicitly. Right now I would pass something like

unit_opts={'zoneout_factor_cell': zoneout_factor_cell,  'zoneout_factor_output': zoneout_factor_output}

to the class, having zoneout in a separate class would allow these parameters to be explicit. Writing them all in one LSTM class with the optional flag would be far too big.

Yes exactly, that's another thing. In a generic LSTM module, you would then either have zoneout_opts or write down all the individual Zoneout options like zoneout_opt_a, zoneout_opt_b etc. Both variants are bad.

So, to conclude, I would avoid putting all functionality into one LSTM module. Do the separate ZoneoutLSTM as you had it initially.

Maybe at some point, I write a new NativeLSTM3 which can do both layer norm and zoneout but that's ok then. I might just make a new NativeLSTM3 module on returnn-common side specifically for it then. And the standard LSTM could maybe use nativelstm3 also if it is faster, otherwise it doesn't matter. But the standard LSTM would otherwise not be extended.

@albertz
Copy link
Member

albertz commented Dec 3, 2021

Note: Relevant will also be #81 and #82. But we can also think about that later.

@Atticus1806
Copy link
Contributor Author

So, to conclude, I would avoid putting all functionality into one LSTM module. Do the separate ZoneoutLSTM as you had it initially.

Okay so this is in the PR now. If I understood you right we will deal with the rest later when (or if) it comes up relevant, so this should be ready now?

@albertz
Copy link
Member

albertz commented Dec 8, 2021

Can you rebase (due to conflicts)?

Unfortunately the master is currently broken. But I'm working on it.

@Atticus1806
Copy link
Contributor Author

Atticus1806 commented Dec 8, 2021

Can you rebase (due to conflicts)?

Should be done
What about the changes to LSTM. Do you also want me to adapt the nn.Dim style to the ZoneoutLSTM? (note: my IDE doesn't find nn.Dim right now, but this might be due to the broken master? EDIT: nvm returnn wasnt updated.)
Also: what is the difference to the old n_out?

@albertz albertz merged commit 28e4921 into main Dec 8, 2021
@albertz albertz deleted the benedikt_zoneout branch December 8, 2021 10:49
@albertz
Copy link
Member

albertz commented Dec 8, 2021

I merged now.

What about the changes to LSTM. Do you also want me to adapt the nn.Dim style to the ZoneoutLSTM?

We can just do that in a follow-up PR or commit. It's really work-in-progress and some parts on this are not fully decided yet.

Also: what is the difference to the old n_out?

See #17 as a starting point. Dim tags are used to distinguish dims explicitly. The dim value (some int, or None) is often ambiguous, and the absolute axis index (int) in some tensor (or layer output, or Data) must not be used because layers can potentially reorder the dims in any way (RETURNN principles).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants