According to the paper, starting from Sekai-Real-HQ and SpatialVID-HQ, 81-frame clips are extracted followed by quality filtering. For each retained clip, Qwen2.5-VL-72B, GroundedSAM2, and MegaSAM provide captions, object masks, depth, and camera poses, which are lifted into background/object point clouds, fitted with 3D Gaussian trajectories, and rendered as background/trajectory maps plus a merged mask that constitute the 4D Geometric Control.
Are there any plans to open-source the code related to this training data augmentation pipeline?
Looking forward to your response. Thank you!