I have set up an OMIP RYF8485 experiment with BLOM upgraded to tag v1.12.0, otherwise using the noresm3_0_beta02 code base. It crashed in month 4 with, apparently with a segmentation fault in line 166 of mod_vertinterp.F90.
Case folder on Betzy: /cluster/projects/nn9560k/matsbn/NorESM/cases/NOIIAJRARYF8485OC_TL319_tn14_beta02_blom_v1.12.0_20250904
Run folder: /cluster/work/users/matsbn/noresm/NOIIAJRARYF8485OC_TL319_tn14_beta02_blom_v1.12.0_20250904/run
The initial crash is logged in cesm.log.1212438.250904-222734. I then did an identical single node setup (/cluster/projects/nn9560k/matsbn/NorESM/cases/NOIIAJRARYF8485OC_TL319_tn14_beta02_blom_v1.12.0_1node_20250904) to be able to explore the crash more efficiently in the --qos=devel queue. With monthly restarts, the single node experiment was bit-identical until month 3, and seemed to be identical to the 5 node original case when progressing into month 4. However, the single node case completed month 4 without crashing. Suspecting Betzy issues, I started the 5 node experiment again. Again it crashed in month 4, reporting segmentation fault in line 166 of mod_vertinterp.F90 (cesm.log.1213717.250905-114003), but at a different time step. So the crash does not seem to be reproducible and the line the segmentation fault is supposed to be occurring, seems strange:
164 do j = 1,jj
165 do l = 1,isp(j)
166 do i = max(1,ifp(j,l)),min(ii,ilp(j,l))
167 do d = ind1(i,j),ind2(i,j)
168 fld_out(i,j,d) = fld_out(i,j,d) + fld_in(i,j,k)*wghts(i,j,d)*baclin
169 end do
170 end do
171 end do
172 end do
This is in the i-loop logic, which is used throughout BLOM.
Not really related to the crash, but something that came to mind when starting to investigate this. Why is ocn2glc_coupling = .true. in a case where these coupling fields are not used? Is there a communication overhead of any significance by doing this? Anyway, I have resubmitted the 5 node case, setting ocn2glc_coupling = .false. in user_nl_cpl, to see if it has any impact on the crash and performance (will likely be in the queue for a while).
I have set up an OMIP RYF8485 experiment with BLOM upgraded to tag v1.12.0, otherwise using the noresm3_0_beta02 code base. It crashed in month 4 with, apparently with a segmentation fault in line 166 of
mod_vertinterp.F90.Case folder on Betzy: /cluster/projects/nn9560k/matsbn/NorESM/cases/NOIIAJRARYF8485OC_TL319_tn14_beta02_blom_v1.12.0_20250904
Run folder: /cluster/work/users/matsbn/noresm/NOIIAJRARYF8485OC_TL319_tn14_beta02_blom_v1.12.0_20250904/run
The initial crash is logged in cesm.log.1212438.250904-222734. I then did an identical single node setup (/cluster/projects/nn9560k/matsbn/NorESM/cases/NOIIAJRARYF8485OC_TL319_tn14_beta02_blom_v1.12.0_1node_20250904) to be able to explore the crash more efficiently in the --qos=devel queue. With monthly restarts, the single node experiment was bit-identical until month 3, and seemed to be identical to the 5 node original case when progressing into month 4. However, the single node case completed month 4 without crashing. Suspecting Betzy issues, I started the 5 node experiment again. Again it crashed in month 4, reporting segmentation fault in line 166 of
mod_vertinterp.F90(cesm.log.1213717.250905-114003), but at a different time step. So the crash does not seem to be reproducible and the line the segmentation fault is supposed to be occurring, seems strange:This is in the i-loop logic, which is used throughout BLOM.
Not really related to the crash, but something that came to mind when starting to investigate this. Why is
ocn2glc_coupling = .true.in a case where these coupling fields are not used? Is there a communication overhead of any significance by doing this? Anyway, I have resubmitted the 5 node case, settingocn2glc_coupling = .false.inuser_nl_cpl, to see if it has any impact on the crash and performance (will likely be in the queue for a while).