Skip to content

Generalize DotLayer to allow sparse inputs #741

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
albertz opened this issue Nov 8, 2021 · 10 comments · Fixed by #895
Closed

Generalize DotLayer to allow sparse inputs #741

albertz opened this issue Nov 8, 2021 · 10 comments · Fixed by #895

Comments

@albertz
Copy link
Member

albertz commented Nov 8, 2021

This basically would unify GatherLayer and DotLayer.

This is the same logic as for LinearLayer, which also works with either dense or sparse input.

I'm not really suggesting to do this, I just want to raise the question here, whether this makes sense.
Maybe it is more clean however to not do this, and make it more explicit for the user.
Maybe it also was a mistake that LinearLayer does this. Other frameworks usually have a separate Embedding or EmbeddingLayer for the case of sparse inputs.

@Zettelkasten
Copy link
Member

Maybe it also was a mistake that LinearLayer does this. Other frameworks usually have a separate Embedding or EmbeddingLayer for the case of sparse inputs.

Why do you think it was a mistake to handle both, did you ever run into issues with this?
Are there any dangerous things that can happen here?
I at first found an Embedding layer a bit weird, mainly because of its name (I think SparseLookup or so would be a better name, this describes much better what it does. That it is an embedding is just a special interpretation).

If you think about this sparse dim as just another (implicit) dim in the shape, I think it makes a lot of sense. Just like we would probably also add these sparse dims to out_shape in some way (#706).
So I would support sparse inputs whenever possible (so also here for DotLayer). Also for rwth-i6/returnn_common#38 (comment) I would just have one method that accepts both.

@albertz
Copy link
Member Author

albertz commented Nov 8, 2021

Maybe it also was a mistake that LinearLayer does this. Other frameworks usually have a separate Embedding or EmbeddingLayer for the case of sparse inputs.

Why do you think it was a mistake to handle both, did you ever run into issues with this? Are there any dangerous things that can happen here?

No. I thought it might be maybe unintuitive. But maybe not. Or maybe only for people coming from other frameworks.

I at first found an Embedding layer a bit weird, mainly because of its name (I think SparseLookup or so would be a better name, this describes much better what it does. That it is an embedding is just a special interpretation).

Well, I think Embedding makes more sense, because it represents an embedding and it implies that it is trainable, it contains the embedding matrix.
SparseLookup sounds to me like it would do just a lookup, i.e. an alias to tf.gather. I.e. purely functional, not trainable. It would not contain the embedding matrix.

If you think about this sparse dim as just another (implicit) dim in the shape, I think it makes a lot of sense. Just like we would probably also add these sparse dims to out_shape in some way (#706). So I would support sparse inputs whenever possible (so also here for DotLayer). Also for rwth-i6/returnn_common#38 (comment) I would just have one method that accepts both.

Yes, I see this viewpoint.

But then GatherLayer is also basically obsolete, right?

GatherLayer doesn't require the sparse flag to be set though. But this is a detail, and maybe the sparse flag should have been set actually when it is not. For cases where it is not, you would just use ReinterpretDataLayer.

Although it might be a bit unintuitive for the user to realize when sparse is actually set and when it is not?

@Zettelkasten
Copy link
Member

Hm okay, I do see some dangers with sparse tensors now. E.g. if you have a sparse [B,T] which is then actually virtually [B,T,F]. and then you e.g. reduce the T axis, I think it will reduce on the vocab ids currently. But instead it should make a dense [B,T,F] where F is just the sum of the one-hot vectors before.
And probably also in many other cases. Like also DotLayer here.

@Zettelkasten
Copy link
Member

But then GatherLayer is also basically obsolete, right?

GatherLayer doesn't require the sparse flag to be set though. But this is a detail, and maybe the sparse flag should have been set actually when it is not. For cases where it is not, you would just use ReinterpretDataLayer.

Although it might be a bit unintuitive for the user to realize when sparse is actually set and when it is not?

Hm, yeah, I don't know.
I usually use the indices I pass to GatherLayer as positions which have a fixed order. So e.g. I would average + round them.
So here these indices tensors need to be dense, otherwise this should operate on one-hot vectors (which I think it currently does not, but I would argue that that is broken).
So in my case, I would always need to explicitly "mark it as sparse" if I wanted to gather using DotLayer.
Maybe that could be the main differentiation, that GatherLayer treats the inputs really as "indices", not so much as "one hot vectors" like the sparse flag would do?

(in the code, we can of course clean this up and for both GatherLayer and DotLayer use the same code)

@albertz
Copy link
Member Author

albertz commented Nov 8, 2021

Hm okay, I do see some dangers with sparse tensors now. E.g. if you have a sparse [B,T] which is then actually virtually [B,T,F]. and then you e.g. reduce the T axis, I think it will reduce on the vocab ids currently. But instead it should make a dense [B,T,F] where F is just the sum of the one-hot vectors before. And probably also in many other cases. Like also DotLayer here.

Is this only for sparse tensors interpreted as implicit dense tensors, but not for expansions because of implicit dims via dyn_size_ext? (#721)

But it's also a matter of interpretation and definition, what the ReduceLayer should do in this case.

@Zettelkasten
Copy link
Member

Zettelkasten commented Nov 8, 2021

Is this only for sparse tensors interpreted as implicit dense tensors, but not for expansions because of implicit dims via dyn_size_ext? (#721)

Ah true, it's actually a quite similar issue as #721. In both cases you need to make all those implicit dims explicit.

But it's also a matter of interpretation and definition, what the ReduceLayer should do in this case.

My interpretation was that sparse dims should always behave like they are implicitly dense.
But I also never used sparse dims before apart from embedding/specifying loss targets. We can discuss this (maybe in another issue, this is now much more general than DotLayer).

@albertz
Copy link
Member Author

albertz commented Nov 8, 2021

So in my case, I would always need to explicitly "mark it as sparse" if I wanted to gather using DotLayer.
Maybe that could be the main differentiation, that GatherLayer treats the inputs really as "indices", not so much as "one hot vectors" like the sparse flag would do?

There is some danger in this for DotLayer when there are situations to the user when it is not obvious whether the some output is sparse or dense (whether the sparse flag is set or not). Although when reduce is explicitly specified, maybe not so much. And multiplying int with float would also throw an error.

Note that many (almost all?) layers really ignore the sparse flag. So it's not at all consistent that layers use this consistently correct to interpret it as a one-hot dense tensor. LinearLayer is one of the rare exceptions. And the losses (CrossEntropyLoss etc) also handle that because it's very common there to have sparse targets.

So this viewpoint of interpreting it as a one-hot dense tensor, if we consider all the layers buggy which don't do that correctly, and if we would "fix" this, this could potentially break a lot of configs. And maybe this viewpoint is not always useful, except in some cases. I'm not sure...

When we have #706 and the user somehow specifies the implicit dimensions more explicitly, then this is maybe less an issue.

@albertz
Copy link
Member Author

albertz commented Nov 8, 2021

Although, to add, it is also rare that we actually have tensors with sparse=True at all, except for targets. So maybe by now fixing or changing some layers behavior on this, we would not really break anything.

@albertz
Copy link
Member Author

albertz commented Nov 9, 2021

Although, to add, it is also rare that we actually have tensors with sparse=True at all, except for targets. So maybe by now fixing or changing some layers behavior on this, we would not really break anything.

Extending on this thought, and getting back to the issue, so you say we should do this? I tend to agree.

@albertz
Copy link
Member Author

albertz commented Nov 21, 2021

Ok, no further comments here, so the conclusion: This should be implemented.

@albertz albertz changed the title Generalize DotLayer to allow sparse inputs? Generalize DotLayer to allow sparse inputs Nov 21, 2021
albertz added a commit that referenced this issue Jan 4, 2022
albertz added a commit that referenced this issue Jan 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants