Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Infinite loop in nksol.m subroutine model #16

Open
sballin opened this issue Apr 15, 2020 · 9 comments
Open

Infinite loop in nksol.m subroutine model #16

sballin opened this issue Apr 15, 2020 · 9 comments

Comments

@sballin
Copy link

sballin commented Apr 15, 2020

I've been encountering a problem where certain cases "hang". The text output stops, the code continues running, but nothing else happens (for example, the intermediate hdf5 savefiles do not get updated). Here is an example of the output (I'm using rundt.py):

$ python2 runcase.py
 Read file "gridue  " with runid:  FREEGS     09/01/2020        # 0  0ms

File attributes:
('     written on: ', 'Mon Apr 13 17:51:10 2020')
('        by code: ', 'UEDGE')
('    physics tag: ', array(['$Name:  $'], dtype='|S80'))
 UEDGE $Name:  $
*** For isimpon=2, set afracs, not afrac ***
 Read file "gridue  " with runid:  FREEGS     09/01/2020        # 0  0ms

  Updating Jacobian, npe =                      1
 iter=    0 fnrm=     0.2571840422417168     nfe=      1


 nksol ---  iterm = 1.
            maxnorm(sf*f(u)) .le. ftol, where maxnorm() is
            the maximum norm function.  u is probably an
            approximate root of f.
*---------------------------------------------------------*
Need to take initial step with Jacobian; trying to do here
*---------------------------------------------------------*
*** For isimpon=2, set afracs, not afrac ***
 Read file "gridue  " with runid:  FREEGS     09/01/2020        # 0  0ms

  Updating Jacobian, npe =                      1
 iter=    0 fnrm=     0.2571840422417166     nfe=      1


 nksol ---  iterm = 1.
            maxnorm(sf*f(u)) .le. ftol, where maxnorm() is
            the maximum norm function.  u is probably an
            approximate root of f.
initial fnrm =2.5718E-01
--------------------------------------------------------------------
--------------------------------------------------------------------

*** Number time-step changes = 1 New time-step = 1.0000E-10
rundt time elapsed: 0:00:08
--------------------------------------------------------------------
*** For isimpon=2, set afracs, not afrac ***
 iter=    0 fnrm=     0.2571840422417166     nfe=      1
(stays like this forever)
^Z

It is also possible to get the code to continue by compiling without the -Ofast flag in Makefile.Forthon. In this case, you start seeing fnrm= nan but the output continues until dtreal goes below dtkill and the code quits by itself.

I examined the case with -Ofast which hangs. I found that the pandf subroutine was being called many times. By adding some print statements and going up the call stack, I found this infinite loop inside subroutine model in nksol.m:

 10   continue
      ipcur = 0
      write(STDOUT,*) 'sbb if pthrsh=',pthrsh,'gt onept5=',onept5
      write(STDOUT,*) '       and ipflg=',ipflg,'ne 0'
      if ( (pthrsh .gt. onept5) .and. (ipflg .ne. 0) ) then
        ier = 0
        write(STDOUT,*) 'sbb call psetnk'
        call pset (n, u, savf, su, sf, x, f, wm(locwmp), iwm(locimp),
     *             ier)
        npe = npe + 1
        ipcur = 1
        nnipset = nni
        if (ier .ne. 0) then
          iersl = 8
          return
          endif
        endif
c-----------------------------------------------------------------------
c     load x with -f(u).
c-----------------------------------------------------------------------
      do 100 i = 1,n
 100    x(i) = -savf(i)
c-----------------------------------------------------------------------
c     call solpk to solve j*x = -f using the appropriate krylov
c     algorithm.
c-----------------------------------------------------------------------
      call solpk (n,wm,lenwm,iwm,leniwm,u,savf,x,su,sf,f,jac,psol)
      write(STDOUT,*) 'sbb after solpk iersl=', iersl,'ipflg=',ipflg,
     * 'ipcur=',ipcur 
      if (iersl .lt. 0) then
c nonrecoverable error from psol.  set iersl and return.
         iersl = 9
         return
         endif
      if ( (iersl .gt. 0) .and. (ipflg .ne. 0) ) then
        if (ipcur .eq. 0) go to 10
        endif

I also put print statements around pandf1 calls in oderhs.m. The pandf1 calls followed by line numbers (messed up by the additional lines I inserted into the file) correspond to other pandf1 calls in the jac_calc subroutine.

c ... Beginning of execution for call rhsdpk (by daspk), check constraints
      entry rhsdpk (neq, t, yl, yldot, ifail)
      
      if (icflag .gt. 0 .and. t .gt. 0.) then     
         if (icflag .eq. 2) rlxl = rlx
         do 6 i = 1, neq     
            ylchng(i) = yl(i) - ylprevc(i)
 6       continue
         call cnstrt (neq,ylprevc,ylchng,icnstr,tau,rlxl,ifail,ivar) 
         if (ifail .ne. 0) then
            call remark ('***Constraint failure in DASPK, dt reduced***')
            write (*,*) 'variable index = ',ivar,'   time = ',t
            goto 20
         endif
      else
         ifail = 0
      endif
      call scopy (neq, yl, 1, ylprevc, 1)  #put yl into ylprevc 

 8    tloc = t
      write(STDOUT,*) 'sbb pandf1 -1 -1 goto'
      go to 10

c ... Beginning of execution for call rhsnk (by nksol).
      entry rhsnk (neq, yl, yldot)
      tloc = 0.

c ... Calculate right-hand sides for interior and boundary points.
ccc 10   call convsr_vo (-1,-1, yl)  # test new convsr placement
ccc      call convsr_aux (-1,-1, yl) # test new convsr placement
      write(STDOUT,*) 'sbb pandf1 -1 -1 sequential'
 10   call pandf1 (-1, -1, 0, neq, tloc, yl, yldot)

 20   continue
      return
      end

This produces the following output:

$ python2 runcase.py
 Read file "gridue  " with runid:  FREEGS     09/01/2020        # 0  0ms                       

File attributes:
('     written on: ', 'Mon Apr 13 17:51:10 2020')
('        by code: ', 'UEDGE')
('    physics tag: ', array(['$Name:  $'], dtype='|S80'))
 UEDGE $Name:  $                                                                       
*** For isimpon=2, set afracs, not afrac ***
 Read file "gridue  " with runid:  FREEGS     09/01/2020        # 0  0ms                       

 sbb pandf1 -1 -1 sequential
  Updating Jacobian, npe =                      1
 sbb pandf1 8449
 sbb pandf1 8518
 sbb pandf1 8449
 sbb pandf1 8518
 sbb pandf1 8449
 sbb pandf1 8518
(more pandf1 messages...)
 sbb pandf1 8449
 sbb pandf1 8518
 sbb pandf1 8524
 sbb pandf1 -1 -1 sequential
 sbb nksol
 sbb icntnu=                    0
 sbb pandf1 -1 -1 sequential
 iter=    0 fnrm=     0.2571840410398497     nfe=      1


 nksol ---  iterm = 1.
            maxnorm(sf*f(u)) .le. ftol, where maxnorm() is
            the maximum norm function.  u is probably an
            approximate root of f.
 sbb ffun
 sbb pandf1 -1 -1 goto
*---------------------------------------------------------*
Need to take initial step with Jacobian; trying to do here
*---------------------------------------------------------*
*** For isimpon=2, set afracs, not afrac ***
 Read file "gridue  " with runid:  FREEGS     09/01/2020        # 0  0ms                       

 sbb pandf1 -1 -1 sequential
  Updating Jacobian, npe =                      1
 sbb pandf1 8449
 sbb pandf1 8518
 sbb pandf1 8449
 sbb pandf1 8518
 sbb pandf1 8449
 sbb pandf1 8518
(more pandf1 messages...)
 sbb pandf1 8518
 sbb pandf1 8449
 sbb pandf1 8518
 sbb pandf1 8449
 sbb pandf1 8518
 sbb pandf1 8449
 sbb pandf1 8518
 sbb pandf1 8524
 sbb pandf1 -1 -1 sequential
 sbb nksol
 sbb icntnu=                    0
 sbb pandf1 -1 -1 sequential
 iter=    0 fnrm=     0.2571840410398498     nfe=      1


 nksol ---  iterm = 1.
            maxnorm(sf*f(u)) .le. ftol, where maxnorm() is
            the maximum norm function.  u is probably an
            approximate root of f.
 sbb ffun
 sbb pandf1 -1 -1 goto
initial fnrm =2.5718E-01
--------------------------------------------------------------------
--------------------------------------------------------------------
 
*** Number time-step changes = 1 New time-step = 1.0000E-10
rundt time elapsed: 0:00:10
--------------------------------------------------------------------
*** For isimpon=2, set afracs, not afrac ***
 sbb pandf1 -1 -1 sequential
 sbb nksol
 sbb icntnu=                    1
 sbb pandf1 -1 -1 sequential
 iter=    0 fnrm=     0.2571840410398498     nfe=      1
 sbb call model                     1
 sbb model
 sbb if pthrsh=   0.0000000000000000      gt onept5=   1.5000000000000000     
        and ipflg=                    1 ne 0
 sbb solpk
 sbb spimgr
 sbb atv
 sbb call f
 sbb pandf1 -1 -1 sequential
 sbb after solpk iersl=                    1 ipflg=                    1 ipcur=                    0
 sbb if pthrsh=   0.0000000000000000      gt onept5=   1.5000000000000000     
        and ipflg=                    1 ne 0
 sbb solpk
 sbb spimgr
 sbb atv
 sbb call f
 sbb pandf1 -1 -1 sequential
 sbb after solpk iersl=                    1 ipflg=                    1 ipcur=                    0
 sbb if pthrsh=   0.0000000000000000      gt onept5=   1.5000000000000000     
        and ipflg=                    1 ne 0

(and so on until ctrl-Z)

Because pthrsh is always 0, ipcur never gets set to anything other than 0, which is required in order to stop looping.

pthrsh is set to 0 if icntnu != 0. icntnu indicates if this is a continuation call to nksol that makes use of old values.

Another interesting thing is that the arguments supplied to pandf1 change around the time of the hang:

 pandf1 xc=49  yc=33
 pandf1 xc=49  yc=33
 pandf1 xc=49  yc=33
 pandf1 xc=49  yc=33
 pandf1 xc=49  yc=33
 pandf1 xc=49  yc=33
 pandf1 xc=49  yc=33
 pandf1 xc=-1  yc=-1
 pandf1 xc=-1  yc=-1
 iter=    0 fnrm=     0.2571840410398498     nfe=      1


 nksol ---  iterm = 1.
            maxnorm(sf*f(u)) .le. ftol, where maxnorm() is
            the maximum norm function.  u is probably an
            approximate root of f.
 pandf1 xc=-1  yc=-1
 pandf1 xc=-1  yc=-1
initial fnrm =2.5718E-01
--------------------------------------------------------------------
--------------------------------------------------------------------
 
*** Number time-step changes = 1 New time-step = 1.0000E-10
rundt time elapsed: 0:00:09
--------------------------------------------------------------------
*** For isimpon=2, set afracs, not afrac ***
 pandf1 xc=-1  yc=-1
 pandf1 xc=-1  yc=-1
 iter=    0 fnrm=     0.2571840410398498     nfe=      1
 pandf1 xc=-1  yc=-1
 pandf1 xc=-1  yc=-1
 pandf1 xc=-1  yc=-1
 pandf1 xc=-1  yc=-1
 pandf1 xc=-1  yc=-1
 pandf1 xc=-1  yc=-1
 pandf1 xc=-1  yc=-1
 pandf1 xc=-1  yc=-1
 pandf1 xc=-1  yc=-1
(and so on until ctrl-Z)

-1 is not an invalid argument, but apparently means "full RHS evaluation" rather than "poloidal/radial index of perturbed variable for Jacobian calc".

Also tested a case which does not hang (even with -Ofast) and found that it also has a long period of pandf(-1, -1, ...) calls where the following form repeats many times but eventually the code moves on:

 sbb atv
 sbb call f
 sbb pandf1 -1 -1 sequential
 sbb pandf1 xc=                   -1  yc=                   -1

From the absence of certain print statements (compare to end of the 4th code block in this post), we can see that these calls are being made from a different loop, supporting the claim that the loop identified above is the one that needs fixing.

Also ran the hanging case for several minutes to make sure that it was entirely pandf(-1, -1, ...) calls and it didn't start doing other things.

Also checked out Jerome Guterl's pandf issue but this seems to be a problem inside pandf, not outside, as we have here.

@sballin
Copy link
Author

sballin commented Apr 17, 2020

Noticed that in the above output, iersl is set to 1 after subroutine solpk finishes, indicating that "the krylov solver suffered a breakdown, and so the solution x is undefined."

When I compile without -Ofast, solpk runs once and finishes with iersl 0, indicating that no trouble occurred, and we get out of the loop successfully:

 sbb pandf1 8524
 sbb pandf1 xc=                   49  yc=                   33
 sbb pandf1 -1 -1 sequential
 sbb pandf1 xc=                   -1  yc=                   -1
 sbb nksol
 sbb icntnu=                    0
 sbb set pthrsh = two 795
 sbb pandf1 -1 -1 sequential
 sbb pandf1 xc=                   -1  yc=                   -1
 iter=    0 fnrm=     0.2571840410397648     nfe=      1


 nksol ---  iterm = 1.
            maxnorm(sf*f(u)) .le. ftol, where maxnorm() is
            the maximum norm function.  u is probably an
            approximate root of f.
 sbb ffun
 sbb pandf1 -1 -1 goto
 sbb pandf1 xc=                   -1  yc=                   -1
 sbb pandf1 xc=                   -1  yc=                   -1
initial fnrm =2.5718E-01
--------------------------------------------------------------------
--------------------------------------------------------------------
 
*** Number time-step changes = 1 New time-step = 1.0000E-10
rundt time elapsed: 0:00:23
--------------------------------------------------------------------
*** For isimpon=2, set afracs, not afrac ***
 sbb pandf1 -1 -1 sequential
 sbb pandf1 xc=                   -1  yc=                   -1
 sbb nksol
 sbb icntnu=                    1
 sbb set pthrsh = zero 799
 sbb pandf1 -1 -1 sequential
 sbb pandf1 xc=                   -1  yc=                   -1
 iter=    0 fnrm=     0.2571840410397648     nfe=      1
 sbb call model                     1
 sbb model
 sbb if pthrsh=   0.0000000000000000      gt onept5=   1.5000000000000000     
        and ipflg=                    1 ne 0
 sbb solpk
 sbb spimgr
 sbb atv
 sbb call f
 sbb pandf1 -1 -1 sequential
 sbb pandf1 xc=                   -1  yc=                   -1
 sbb atv
 sbb call f
 sbb pandf1 -1 -1 sequential
 sbb pandf1 xc=                   -1  yc=                   -1
(same messages repeating...)
 sbb atv
 sbb call f
 sbb pandf1 -1 -1 sequential
 sbb pandf1 xc=                   -1  yc=                   -1
 sbb atv
 sbb call f
 sbb pandf1 -1 -1 sequential
 sbb pandf1 xc=                   -1  yc=                   -1
 sbb after solpk iersl=                    0 ipflg=                    1 ipcur=                    0
 sbb pandf1 -1 -1 sequential
 sbb pandf1 xc=                   -1  yc=                   -1
 sbb set pthrsh = two 1241
 iter=    1 fnrm=                        NaN nfe=    102
 sbb call model                     2
 sbb model
 sbb if pthrsh=   2.0000000000000000      gt onept5=   1.5000000000000000     
        and ipflg=                    1 ne 0
 sbb call psetnk

@jguterl
Copy link
Contributor

jguterl commented Apr 28, 2020

replace -Ofast by -03 -fstack-arrays and see what happens. Ofast enables unsafe memory data racing.

@sballin
Copy link
Author

sballin commented Apr 29, 2020

Just checked and found that -O3 -fstack-arrays has the same behavior as -Ofast in this case.

@jguterl
Copy link
Contributor

jguterl commented Apr 29, 2020 via email

@sballin
Copy link
Author

sballin commented May 2, 2020

(Just keeping this issue up to date.) I put the following code at the end of subroutine pandf1 in oderhs.m:

do iv=1,neq
  if (isnan(yldot(iv))) then
    stop
  endif
enddo

and found that the code stopped at this spot even with -O3 -fstack-arrays.

@trognlien
Copy link
Contributor

trognlien commented May 5, 2020 via email

@sballin
Copy link
Author

sballin commented May 5, 2020

Maybe we could get you a temporary PSFC account? We have totalview. I am also happy to debug over Zoom as I have it all set up.

I am using UEDGE version 7.0.8.4.14 and not evolving the potential equation. It appears to me that bugs happen both with -Ofast and without, but they manifest differently. I don't think -g prevents optimization or otherwise affects the code.

I tried Roman's fixes with and without -Ofast, and the output/hanging behavior was the same.

@jguterl
Copy link
Contributor

jguterl commented May 5, 2020 via email

@trognlien
Copy link
Contributor

trognlien commented May 5, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants