Open
Description
Something we realized today with @pmeier: even for pure detection tasks where masks aren't needed, the detection training references are still using the masks from COCO, which means that:
- those masks are being decoded into images
- those masks get transformed all the time e.g. here
Both these things are completely wasteful since masks aren't needed for detection tasks. According to some simple benchmark this significantly hurts performance.
(Not sure if that applies to Keypoints too, would need to check)