Although it may be out of scope, it would be nice to have an example of computing 4bit and 8bit tensors, to save memory bandwidth.