Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix resetting of collections in LST #47656

Merged

Conversation

ariostas
Copy link
Contributor

In one of our previous PRs we introduced a bug where two of the collections are not being reset at the start of each event. On CPU, this works fine since each time the device collection is being replaced and used when getting the output. However, on GPU this was an issue because when getting the output it ends up using a stale copy on the host instead of the new one that should have been copied from the device. This PR fixed the issue by resetting both collections at the start.

@cmsbuild
Copy link
Contributor

cmsbuild commented Mar 21, 2025

cms-bot internal usage

@cmsbuild
Copy link
Contributor

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @ariostas for master.

It involves the following packages:

  • RecoTracker/LSTCore (reconstruction)

@cmsbuild, @jfernan2, @mandrenguyen can you please review it and eventually sign? Thanks.
@GiacomoSguazzoni, @VinInn, @VourMa, @dgulhan, @felicepantaleo, @gpetruc, @missirol, @mmusich, @mtosi, @rovere this is something you requested to watch as well.
@antoniovilela, @mandrenguyen, @rappoccio, @sextonkennedy you are the release manager for this.

cms-bot commands are listed here

@slava77
Copy link
Contributor

slava77 commented Mar 21, 2025

test parameters:

  • enable_tests = gpu
  • workflows_gpu = 29634.704,29834.704
  • workflows = 29634.703,29834.703
  • relvals_opt = -w upgrade,standard
  • relvals_opt_gpu = -w upgrade,standard

@slava77
Copy link
Contributor

slava77 commented Mar 21, 2025

@cmsbuild please test

@cmsbuild
Copy link
Contributor

-1

Failed Tests: RelVals-CUDA
Size: This PR adds an extra 28KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-29c101/45135/summary.html
COMMIT: 2ba7b5c
CMSSW: CMSSW_15_1_X_2025-03-21-1100/el8_amd64_gcc12
Additional Tests: CUDA,ROCM
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/47656/45135/install.sh to create a dev area with all the needed externals and cmssw changes.

RelVals-CUDA

  • 12834.42312834.423_TTbar_14TeV+2024_Patatrack_HCALOnlyGPUandAlpaka_Validation/step2_TTbar_14TeV+2024_Patatrack_HCALOnlyGPUandAlpaka_Validation.log
  • 12834.42212834.422_TTbar_14TeV+2024_Patatrack_HCALOnlyAlpaka_Validation/step2_TTbar_14TeV+2024_Patatrack_HCALOnlyAlpaka_Validation.log
  • 12834.40612834.406_TTbar_14TeV+2024_Patatrack_PixelOnlyTripletsAlpaka/step2_TTbar_14TeV+2024_Patatrack_PixelOnlyTripletsAlpaka.log
Expand to see more relval errors ...

Comparison Summary

Summary:

ROCM Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 101 differences found in the comparisons
  • DQMHistoTests: Total files compared: 9
  • DQMHistoTests: Total histograms compared: 117389
  • DQMHistoTests: Total failures: 2692
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 114697
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 8 files compared)
  • Checked 32 log files, 36 edm output root files, 9 DQM output files

@slava77
Copy link
Contributor

slava77 commented Mar 22, 2025

Failed Tests: RelVals-CUDA

e.g. in 12834.423

src/HeterogeneousCore/CUDAServices/plugins/CUDAService.cc, line 193:
nvmlCheck(nvmlInitWithFlags(NVML_INIT_FLAG_NO_GPUS | NVML_INIT_FLAG_NO_ATTACH));
NVML Error 18: Driver/library version mismatch

looks like a node mixup.
@smuzaffar @iarspider please check

I'll just restart the tests for now

@slava77
Copy link
Contributor

slava77 commented Mar 22, 2025

@cmsbuild please test

@cmsbuild
Copy link
Contributor

+1

Size: This PR adds an extra 28KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-29c101/45149/summary.html
COMMIT: 2ba7b5c
CMSSW: CMSSW_15_1_X_2025-03-22-1100/el8_amd64_gcc12
Additional Tests: CUDA,ROCM
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/47656/45149/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

CUDA Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 153 differences found in the comparisons
  • DQMHistoTests: Total files compared: 9
  • DQMHistoTests: Total histograms compared: 117389
  • DQMHistoTests: Total failures: 3350
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 114039
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 8 files compared)
  • Checked 32 log files, 36 edm output root files, 9 DQM output files
  • TriggerResults: no differences found

ROCM Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 132 differences found in the comparisons
  • DQMHistoTests: Total files compared: 9
  • DQMHistoTests: Total histograms compared: 117389
  • DQMHistoTests: Total failures: 3044
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 114345
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 8 files compared)
  • Checked 32 log files, 36 edm output root files, 9 DQM output files

@jfernan2
Copy link
Contributor

type bug-fix

@jfernan2
Copy link
Contributor

+1

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @sextonkennedy, @antoniovilela, @mandrenguyen, @rappoccio (and backports should be raised in the release meeting by the corresponding L2)

@slava77
Copy link
Contributor

slava77 commented Mar 25, 2025

@cms-sw/orp-l2
this was fully signed for two days.
Please clarify if something else is needed before the PR can be merged.
Thank you.

@mandrenguyen
Copy link
Contributor

+1

@cmsbuild cmsbuild merged commit a46bc8b into cms-sw:master Mar 25, 2025
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants