Skip to content

BLOM segmentation fault in mod_vertinterp.F90 #655

@matsbn

Description

@matsbn

I have set up an OMIP RYF8485 experiment with BLOM upgraded to tag v1.12.0, otherwise using the noresm3_0_beta02 code base. It crashed in month 4 with, apparently with a segmentation fault in line 166 of mod_vertinterp.F90.

Case folder on Betzy: /cluster/projects/nn9560k/matsbn/NorESM/cases/NOIIAJRARYF8485OC_TL319_tn14_beta02_blom_v1.12.0_20250904

Run folder: /cluster/work/users/matsbn/noresm/NOIIAJRARYF8485OC_TL319_tn14_beta02_blom_v1.12.0_20250904/run

The initial crash is logged in cesm.log.1212438.250904-222734. I then did an identical single node setup (/cluster/projects/nn9560k/matsbn/NorESM/cases/NOIIAJRARYF8485OC_TL319_tn14_beta02_blom_v1.12.0_1node_20250904) to be able to explore the crash more efficiently in the --qos=devel queue. With monthly restarts, the single node experiment was bit-identical until month 3, and seemed to be identical to the 5 node original case when progressing into month 4. However, the single node case completed month 4 without crashing. Suspecting Betzy issues, I started the 5 node experiment again. Again it crashed in month 4, reporting segmentation fault in line 166 of mod_vertinterp.F90 (cesm.log.1213717.250905-114003), but at a different time step. So the crash does not seem to be reproducible and the line the segmentation fault is supposed to be occurring, seems strange:

164      do j = 1,jj
165         do l = 1,isp(j)
166            do i = max(1,ifp(j,l)),min(ii,ilp(j,l))
167               do d = ind1(i,j),ind2(i,j)
168                  fld_out(i,j,d) = fld_out(i,j,d) + fld_in(i,j,k)*wghts(i,j,d)*baclin
169               end do
170            end do
171         end do
172      end do

This is in the i-loop logic, which is used throughout BLOM.

Not really related to the crash, but something that came to mind when starting to investigate this. Why is ocn2glc_coupling = .true. in a case where these coupling fields are not used? Is there a communication overhead of any significance by doing this? Anyway, I have resubmitted the 5 node case, setting ocn2glc_coupling = .false. in user_nl_cpl, to see if it has any impact on the crash and performance (will likely be in the queue for a while).

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions