Skip to content

Add NULL checks after malloc and improve diagnostic logging in shadow pool functions#2188

Open
Pritiks23 wants to merge 1 commit into
NVIDIA:devfrom
Pritiks23:combine-prs
Open

Add NULL checks after malloc and improve diagnostic logging in shadow pool functions#2188
Pritiks23 wants to merge 1 commit into
NVIDIA:devfrom
Pritiks23:combine-prs

Conversation

@Pritiks23

Copy link
Copy Markdown

Description

This PR targets dev and carries the latest branch state from Pritiks23:combine-prs, rebased to latest upstream/dev and containing only the allocator hardening change in src/allocator.cc.

Changes included:

  • Added NULL checks after malloc and improved cleanup/error handling in ncclShadowPoolAlloc
  • Improved diagnostic WARN logging in ncclShadowPoolFree

This is opened in addition to #2175 (left open as requested).

Signed-off-by: Pritika Vipin 65793273+Pritiks23@users.noreply.github.com

Signed-off-by: Pritika Vipin <65793273+Pritiks23@users.noreply.github.com>
@Pritiks23

Copy link
Copy Markdown
Author

@justus-nv New PR to dev branch

@justus-nv

Copy link
Copy Markdown
Contributor

/mirror dev

Comment thread src/allocator.cc
Comment on lines -14 to -15
NCCL_PARAM(ShadowMempoolMaxSize, "SHADOW_MEMPOOL_MAX_SIZE", 1LL << 30);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this removed? It seems out of scope.

Comment thread src/allocator.cc
props.handleTypes = cudaMemHandleTypeNone;
props.location.type = cudaMemLocationTypeDevice;
cudaGetDevice(&props.location.id);
props.maxSize = (size_t)ncclParamShadowMempoolMaxSize();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here? Why was this removed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants