Skip to content

Conversation

@BenWibking
Copy link
Collaborator

@BenWibking BenWibking commented Oct 24, 2025

This problem is designed to go through (nearly) the full range of densities/temperatures we encounter a hydro sim without doing hydro.

Prints relative L_1, L_2, and L_inf error norms of the GPU solution wrt the CPU solution.
(All error norms are relative to the CPU solution norm.)

@BenWibking
Copy link
Collaborator Author

On Frontier with rocm/6.4.3, I get this output:

starting the multi-zone burn...
  CPU multi-zone burn complete; zones failed = 0
  CPU multi-zone temperature range: [3032.437486, 3033.069765]
  CPU multi-zone retries applied: 1
  GPU multi-zone burn complete; zones failed = 0
  GPU multi-zone temperature range: [3029.019663, 3037.23735]
  GPU multi-zone retries applied: 33
  zone 0 T mismatch: CPU = 3032.689294, GPU = 3032.815841

Re-running now to compare with the "known good" rocm/6.3.1...

@BenWibking
Copy link
Collaborator Author

rocm/6.3.1:

starting the multi-zone burn...
CPU multi-zone burn complete; zones failed = 0
CPU multi-zone temperature range: [3032.437486, 3033.069765]
CPU multi-zone retries applied: 1
GPU multi-zone burn complete; zones failed = 0
GPU multi-zone temperature range: [3033.179406, 3033.306148]
GPU multi-zone retries applied: 39
zone 0 T mismatch: CPU = 3032.689294, GPU = 3033.210804

I need to compute a better norm to compare them, but it looks like 6.4 is bad for us as well :/

@zingale
Copy link
Member

zingale commented Oct 24, 2025

to make sure the integrator comparison is the same, run with integrator.use_jacobian_caching=0

@zingale zingale changed the base branch from main to development October 24, 2025 16:08
@BenWibking
Copy link
Collaborator Author

@zingale Jacobian caching is off.

This problem is designed to go through (nearly) the full range of densities/temperatures we encounter a hydro sim without doing hydro. All error norms are relative to the CPU solution norm.

Using rocm/6.3.1 (this is a "minimum acceptable" level of agreement -- I still don't know why the GPU burn fails so much):

starting the multi-zone burn...
CPU multi-zone burn complete; zones failed = 0
CPU multi-zone temperature range: [3032.437486, 3033.069765]
CPU multi-zone retries applied: 1
CPU multi-zone burn walltime (s): 22.2455658
GPU multi-zone burn complete; zones failed = 0
GPU multi-zone temperature range: [3033.179406, 3033.306148]
GPU multi-zone retries applied: 39
GPU multi-zone burn walltime (s): 148.3088752

CPU/GPU multi-zone difference norms:
  temperature: L1 = 0.0001577875179, L2 = 0.0001618865356, L_inf = 0.0002501927092
  number_density(E): L1 = 0.005209218474, L2 = 0.006753860604, L_inf = 0.03300111125
  number_density(Hp): L1 = 0.005209488338, L2 = 0.006754078937, L_inf = 0.03300177472
  number_density(H): L1 = 0.001276108352, L2 = 0.001309240176, L_inf = 0.002021783129
  number_density(Hm): L1 = 0.006037646397, L2 = 0.008004853872, L_inf = 0.03363601704
  number_density(Dp): L1 = 0.9773793951, L2 = 0.9949503469, L_inf = 0.9929392971
  number_density(D): L1 = 0.977414706, L2 = 0.9949523714, L_inf = 0.9929474164
  number_density(H2p): L1 = 0.004850787467, L2 = 0.006439096462, L_inf = 0.03233377457
  number_density(Dm): L1 = 0.9781727897, L2 = 0.9951584351, L_inf = 0.9929340037
  number_density(H2): L1 = 0.0003055037608, L2 = 0.0003134353239, L_inf = 0.0004843801255
  number_density(HDp): L1 = 0.9775253048, L2 = 0.9949493385, L_inf = 0.9929363062
  number_density(HD): L1 = 0.9774503449, L2 = 0.9949603367, L_inf = 0.992959557
  number_density(HEpp): L1 = 4.742830668, L2 = 3.440721513, L_inf = 3.178251298
  number_density(HEp): L1 = 0.02565528324, L2 = 0.1276112673, L_inf = 0.9903962888
  number_density(HE): L1 = 6.146444201e-08, L2 = 8.942926766e-08, L_inf = 3.738828979e-07

CPU/GPU mismatches detected:
  field T mismatched in 128 zone(s); largest difference at zone 7 (CPU = 3032.437486, GPU = 3033.196338)
  field e mismatched in 128 zone(s); largest difference at zone 7 (CPU = 2.721201708e+11, GPU = 2.722070711e+11)
  field xn(D) mismatched in 128 zone(s); largest difference at zone 22 (CPU = 16.75658685, GPU = 0.11817723)
  field xn(Dp) mismatched in 7 zone(s); largest difference at zone 22 (CPU = 3.174138657e-12, GPU = 2.241165001e-14)
  field xn(E) mismatched in 128 zone(s); largest difference at zone 104 (CPU = 19331.7002, GPU = 20000.05813)
  field xn(H) mismatched in 128 zone(s); largest difference at zone 7 (CPU = 1.616812031e+17, GPU = 1.620086389e+17)
  field xn(H2) mismatched in 128 zone(s); largest difference at zone 7 (CPU = 3.379982657e+17, GPU = 3.378345461e+17)
  field xn(H2p) mismatched in 128 zone(s); largest difference at zone 104 (CPU = 15.71035496, GPU = 16.24229929)
  field xn(HD) mismatched in 128 zone(s); largest difference at zone 22 (CPU = 160.5250696, GPU = 1.130167607)
  field xn(HE) mismatched in 128 zone(s); largest difference at zone 99 (CPU = 6.49133784e+16, GPU = 6.491340267e+16)
  field xn(HEp) mismatched in 2 zone(s); largest difference at zone 104 (CPU = 1e-100, GPU = 6.406499605e-12)
  field xn(Hm) mismatched in 128 zone(s); largest difference at zone 104 (CPU = 1.727013205, GPU = 1.789152394)
  field xn(Hp) mismatched in 128 zone(s); largest difference at zone 104 (CPU = 19317.71686, GPU = 19985.60498)

Using rocm/6.4.2 (slightly worse overall, but a lot worse for some species):

starting the multi-zone burn...
CPU multi-zone burn complete; zones failed = 0
CPU multi-zone temperature range: [3032.437486, 3033.069765]
CPU multi-zone retries applied: 1
CPU multi-zone burn walltime (s): 22.2893873
GPU multi-zone burn complete; zones failed = 0
GPU multi-zone temperature range: [3029.019663, 3037.23735]
GPU multi-zone retries applied: 33
GPU multi-zone burn walltime (s): 73.56593662

CPU/GPU multi-zone difference norms:
  temperature: L1 = 9.056564017e-05, L2 = 0.0002005176156, L_inf = 0.001545567878
  number_density(E): L1 = 0.003980398191, L2 = 0.005534918028, L_inf = 0.02641261182
  number_density(Hp): L1 = 0.00398043237, L2 = 0.005534968643, L_inf = 0.02641309697
  number_density(H): L1 = 0.0007323882078, L2 = 0.001622249965, L_inf = 0.01252560215
  number_density(Hm): L1 = 0.004824476122, L2 = 0.007289587544, L_inf = 0.0361106126
  number_density(Dp): L1 = 10.53935966, L2 = 48.17332479, L_inf = 92.24618187
  number_density(D): L1 = 10.68886968, L2 = 48.998352, L_inf = 93.88338289
  number_density(H2p): L1 = 0.003899314924, L2 = 0.005438905368, L_inf = 0.0260830975
  number_density(Dm): L1 = 11.31745067, L2 = 50.84429965, L_inf = 93.06558273
  number_density(H2): L1 = 0.0001753435257, L2 = 0.0003883778746, L_inf = 0.003000808856
  number_density(HDp): L1 = 10.51675715, L2 = 48.34467731, L_inf = 92.57997537
  number_density(HD): L1 = 10.55548406, L2 = 48.29096892, L_inf = 92.51377107
  number_density(HEpp): L1 = 1.014233689, L2 = 0.8154069422, L_inf = 0.7650545114
  number_density(HEp): L1 = 0.1078060008, L2 = 0.3210079961, L_inf = 0.9875212077
  number_density(HE): L1 = 1.565261702e-07, L2 = 3.53309392e-07, L_inf = 2.963830149e-06

CPU/GPU mismatches detected:
  field T mismatched in 128 zone(s); largest difference at zone 124 (CPU = 3032.549534, GPU = 3037.23735)
  field e mismatched in 128 zone(s); largest difference at zone 124 (CPU = 2.721329956e+11, GPU = 2.72670329e+11)
  field xn(D) mismatched in 128 zone(s); largest difference at zone 124 (CPU = 2.397211316, GPU = 1575.562271)
  field xn(Dp) mismatched in 8 zone(s); largest difference at zone 124 (CPU = 4.479440013e-13, GPU = 2.932501159e-10)
  field xn(E) mismatched in 128 zone(s); largest difference at zone 104 (CPU = 19331.7002, GPU = 19866.62396)
  field xn(H) mismatched in 128 zone(s); largest difference at zone 124 (CPU = 1.617295325e+17, GPU = 1.637581034e+17)
  field xn(H2) mismatched in 128 zone(s); largest difference at zone 124 (CPU = 3.379740888e+17, GPU = 3.369598206e+17)
  field xn(H2p) mismatched in 128 zone(s); largest difference at zone 104 (CPU = 15.71035496, GPU = 16.13946526)
  field xn(HD) mismatched in 128 zone(s); largest difference at zone 124 (CPU = 22.9727526, GPU = 14873.75229)
  field xn(HE) mismatched in 128 zone(s); largest difference at zone 98 (CPU = 6.491340265e+16, GPU = 6.491359504e+16)
  field xn(HEp) mismatched in 13 zone(s); largest difference at zone 57 (CPU = 6.387901791e-12, GPU = 1.000000088e-100)
  field xn(Hm) mismatched in 128 zone(s); largest difference at zone 53 (CPU = 1.847400329, GPU = 1.780689571)
  field xn(Hp) mismatched in 128 zone(s); largest difference at zone 104 (CPU = 19317.71686, GPU = 19852.26368)

Using rocm/7.0.2 (actually looks OK for us):

starting the multi-zone burn...
CPU multi-zone burn complete; zones failed = 0
CPU multi-zone temperature range: [3032.437486, 3033.069765]
CPU multi-zone retries applied: 1
CPU multi-zone burn walltime (s): 21.61658048
GPU multi-zone burn complete; zones failed = 0
GPU multi-zone temperature range: [3032.180939, 3033.197728]
GPU multi-zone retries applied: 24
GPU multi-zone burn walltime (s): 54.81239564

CPU/GPU multi-zone difference norms:
  temperature: L1 = 4.403323795e-05, L2 = 5.887483029e-05, L_inf = 0.0002261898486
  number_density(E): L1 = 0.004475270452, L2 = 0.006215328124, L_inf = 0.02840464773
  number_density(Hp): L1 = 0.004475191463, L2 = 0.006215284227, L_inf = 0.02840558136
  number_density(H): L1 = 0.0003560075324, L2 = 0.0004759775774, L_inf = 0.001826624112
  number_density(Hm): L1 = 0.005069360699, L2 = 0.007530198415, L_inf = 0.03478690901
  number_density(Dp): L1 = 1.221532946, L2 = 1.06892868, L_inf = 0.9835793297
  number_density(D): L1 = 1.220648686, L2 = 1.067929547, L_inf = 0.9835910807
  number_density(H2p): L1 = 0.004427627635, L2 = 0.006160110736, L_inf = 0.02803956417
  number_density(Dm): L1 = 1.283818822, L2 = 1.09576158, L_inf = 0.9835746268
  number_density(H2): L1 = 8.523078908e-05, L2 = 0.0001139545031, L_inf = 0.0004377261334
  number_density(HDp): L1 = 1.221549719, L2 = 1.068941756, L_inf = 0.9835783071
  number_density(HD): L1 = 1.220564661, L2 = 1.067872824, L_inf = 0.9835951946
  number_density(HEpp): L1 = 1.013570425, L2 = 0.814533011, L_inf = 0.7650545114
  number_density(HEp): L1 = 0.1711070716, L2 = 0.4084160359, L_inf = 0.9912271414
  number_density(HE): L1 = 1.090274196e-07, L2 = 1.50601551e-07, L_inf = 5.782735312e-07

CPU/GPU mismatches detected:
  field T mismatched in 128 zone(s); largest difference at zone 55 (CPU = 3032.866989, GPU = 3032.180939)
  field e mismatched in 128 zone(s); largest difference at zone 55 (CPU = 2.721693405e+11, GPU = 2.720908055e+11)
  field xn(D) mismatched in 128 zone(s); largest difference at zone 22 (CPU = 16.75658685, GPU = 0.2749574817)
  field xn(Dp) mismatched in 9 zone(s); largest difference at zone 22 (CPU = 3.174138657e-12, GPU = 5.212148444e-14)
  field xn(E) mismatched in 128 zone(s); largest difference at zone 104 (CPU = 19331.7002, GPU = 19906.96784)
  field xn(H) mismatched in 128 zone(s); largest difference at zone 55 (CPU = 1.618664577e+17, GPU = 1.615706287e+17)
  field xn(H2) mismatched in 128 zone(s); largest difference at zone 55 (CPU = 3.379056131e+17, GPU = 3.380535638e+17)
  field xn(H2p) mismatched in 128 zone(s); largest difference at zone 104 (CPU = 15.71035496, GPU = 16.17165239)
  field xn(HD) mismatched in 128 zone(s); largest difference at zone 22 (CPU = 160.5250696, GPU = 2.633382536)
  field xn(HE) mismatched in 128 zone(s); largest difference at zone 119 (CPU = 6.491340533e+16, GPU = 6.491344287e+16)
  field xn(HEp) mismatched in 21 zone(s); largest difference at zone 1 (CPU = 6.411874076e-12, GPU = 1e-100)
  field xn(Hm) mismatched in 128 zone(s); largest difference at zone 104 (CPU = 1.727013205, GPU = 1.791278553)
  field xn(Hp) mismatched in 128 zone(s); largest difference at zone 104 (CPU = 19317.71686, GPU = 19892.58747)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants