SAM 2 Update 12/11/2024 -- full model compilation for a major VOS speedup and a new SAM2VideoPredictor to better handle multi-object tracking #486

ronghanghu · 2024-12-11T17:32:03Z

This PR provides new features and updates for SAM 2:

We now support torch.compile of the entire SAM 2 model on videos, which can be turned on by setting vos_optimized=True in build_sam2_video_predictor (it uses the new SAM2VideoPredictorVOS predictor class in sam2/sam2_video_predictor.py).
- Compared to the previous setting (which only compiles the image encoder backbone), the new full model compilation gives a major speedup in inference FPS.
- In the VOS prediction script tools/vos_inference.py, you can specify this option in tools/vos_inference.py via the --use_vos_optimized_video_predictor flag.
- Note that turning on this flag might introduce a small variance in the predictions due to numerical differences caused by torch.compile of the full model.
- PyTorch 2.5.1 is the minimum version for full support of this feature. (Earlier PyTorch versions might run into compilation errors in some cases.) Therefore, we have updated the minimum PyTorch version to 2.5.1 accordingly in the installation scripts.
We also update the implementation of the SAM2VideoPredictor class for the SAM 2 video prediction in sam2/sam2_video_predictor.py, which allows for independent per-object inference. Specifically, in the new SAM2VideoPredictor:
- Now we handle the inference of each object independently (as if we are opening a separate session for each object) while sharing their backbone features.
- This change allows us to relax the assumption of prompting for multi-object tracking. Previously (due to the batching behavior in inference), if a video frame receives clicks for only a subset of objects, the rest of the (non-prompted) objects are assumed to be non-existent in this frame (i.e., in such frames, the user is telling SAM 2 that the rest of the objects don't appear). Now, if a frame receives clicks for only a subset of objects, we do not make any assumptions about the remaining (non-prompted) objects (i.e., now each object is handled independently and is not affected by how other objects are prompted). As a result, we allow adding new objects after tracking starts after this change (which was previously a restriction on usage).
- We believe that the new version is a more natural inference behavior and therefore switched to it as the default behavior. The previous implementation of SAM2VideoPredictor is backed up to in sam2/sam2_video_predictor_legacy.py. All the VOS inference results using tools/vos_inference.py should remain the same after this change to the SAM2VideoPredictor class.

…ng accuracy regressions

speed optimizations cleanup

…per-object inference sam2/sam2_video_predictor.py In this PR, we switch to a new implementation of the class `SAM2VideoPredictor` for in sam2/sam2_video_predictor.py, which allows for independent per-object inference. Specifically, the new `SAM2VideoPredictor`: * it handles the inference of each object separately, as if we are opening a separate session for each object * it relaxes the assumption on prompting * previously if a frame receives clicks only for a subset of objects, the rest of (non-prompted) objects are assumed to be non-existing in this frame * now if a frame receives clicks only for a subset of objects, we don't make any assumptions for the remaining (non-prompted) objects * it allows adding new objects after tracking starts * (The previous implementation is backed up to `SAM2VideoPredictor` in sam2/sam2_video_predictor_legacy.py) Also, fix a small typo `APP_URL` => `API_URL` in the doc. Test plan: tested with the predictor notebook `notebooks/video_predictor_example.ipynb` and VOS script `tools/vos_inference.py`. Also tested with the demo.

…with `torch.compile`; update README.md

…inference switch to a new implementation of the class `SAM2VideoPredictor` for per-object inference sam2/sam2_video_predictor.py

Summary: PR #486 changed propagation to track each object independently. This allows to add objects after the initial propagation. The frontend has a constraint that would prevent users from adding new objects after the initial "tracking". Now that the model supports tracking objects independently, we can remove this constraint from the UI. Test Plan: `cd demo/frontend` `yarn lint` ``` (base) ➜ demo/frontend $ yarn lint ⎇ remotes/origin/HEAD* yarn run v1.18.0 $ eslint . --ext ts,tsx --report-unused-disable-directives --max-warnings 0 ✨ Done in 16.98s. ```

…edup and a new SAM2VideoPredictor to better handle multi-object tracking (facebookresearch#486) This PR provides new features and updates for SAM 2: - We now support `torch.compile` of the entire SAM 2 model on videos, which can be turned on by setting `vos_optimized=True` in `build_sam2_video_predictor` (it uses the new `SAM2VideoPredictorVOS` predictor class in `sam2/sam2_video_predictor.py`). * Compared to the previous setting (which only compiles the image encoder backbone), the new full model compilation gives a major speedup in inference FPS. * In the VOS prediction script `tools/vos_inference.py`, you can specify this option in `tools/vos_inference.py` via the `--use_vos_optimized_video_predictor` flag. * Note that turning on this flag might introduce a small variance in the predictions due to numerical differences caused by `torch.compile` of the full model. * **PyTorch 2.5.1 is the minimum version for full support of this feature**. (Earlier PyTorch versions might run into compilation errors in some cases.) Therefore, we have updated the minimum PyTorch version to 2.5.1 accordingly in the installation scripts. - We also update the implementation of the `SAM2VideoPredictor` class for the SAM 2 video prediction in `sam2/sam2_video_predictor.py`, which allows for independent per-object inference. Specifically, in the new `SAM2VideoPredictor`: * Now **we handle the inference of each object independently** (as if we are opening a separate session for each object) while sharing their backbone features. * This change allows us to relax the assumption of prompting for multi-object tracking. Previously (due to the batching behavior in inference), if a video frame receives clicks for only a subset of objects, the rest of the (non-prompted) objects are assumed to be non-existent in this frame (i.e., in such frames, the user is telling SAM 2 that the rest of the objects don't appear). Now, if a frame receives clicks for only a subset of objects, we do not make any assumptions about the remaining (non-prompted) objects (i.e., now each object is handled independently and is not affected by how other objects are prompted). As a result, **we allow adding new objects after tracking starts** after this change (which was previously a restriction on usage). * We believe that the new version is a more natural inference behavior and therefore switched to it as the default behavior. The previous implementation of `SAM2VideoPredictor` is backed up to in `sam2/sam2_video_predictor_legacy.py`. All the VOS inference results using `tools/vos_inference.py` should remain the same after this change to the `SAM2VideoPredictor` class.

chayryali and others added 11 commits December 11, 2024 03:02

speed optimizations cleanup

9851575

move to sincos pos enc to device on fwd pass and remove dynamic causi…

11f99b6

…ng accuracy regressions

add license header

beacd9a

update training config as well to be consistent

a5dc1d5

Merge pull request #153 from fairinternal/chay/improve_speed_v1

3297dd0

speed optimizations cleanup

update README.md

c0737fc

update minimum required PyTorch version to be 2.5.1 to be compatible …

96dc380

…with `torch.compile`; update README.md

update to tensordict>=0.6

ef8e6a9

update demo docker version to PyTorch 2.5.1

9e6839f

Merge pull request #152 from fairinternal/ronghanghu/port_per_object_…

c692a77

…inference switch to a new implementation of the class `SAM2VideoPredictor` for per-object inference sam2/sam2_video_predictor.py

ronghanghu requested review from chayryali and haithamkhedr December 11, 2024 17:32

facebook-github-bot added the cla signed label Dec 11, 2024

chayryali approved these changes Dec 11, 2024

View reviewed changes

ronghanghu requested a review from raedle December 11, 2024 17:53

nits

b3eadda

ronghanghu merged commit 393ae33 into main Dec 11, 2024
2 checks passed

chayryali mentioned this pull request Dec 19, 2024

compile_image_encoder: true causes cudagraph overwriting crash #460

Closed

raedle mentioned this pull request Dec 25, 2024

[sam2][demo][1/x] Remove adding object constraint after popagation #506

Draft

imprs mentioned this pull request Jan 21, 2025

Demo (backend): no need to remove ffmpeg from conda env anymore #548

Open

a120471 mentioned this pull request Feb 9, 2025

Missing segmentation mask when tracking multiple objects #249

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SAM 2 Update 12/11/2024 -- full model compilation for a major VOS speedup and a new SAM2VideoPredictor to better handle multi-object tracking #486

SAM 2 Update 12/11/2024 -- full model compilation for a major VOS speedup and a new SAM2VideoPredictor to better handle multi-object tracking #486

Uh oh!

ronghanghu commented Dec 11, 2024

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

SAM 2 Update 12/11/2024 -- full model compilation for a major VOS speedup and a new SAM2VideoPredictor to better handle multi-object tracking #486

SAM 2 Update 12/11/2024 -- full model compilation for a major VOS speedup and a new SAM2VideoPredictor to better handle multi-object tracking #486

Uh oh!

Conversation

ronghanghu commented Dec 11, 2024

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants