Skip to content

ForwardDiff NaN-safe mode #2769

@penelopeysm

Description

@penelopeysm

Due to recent improvements in ForwardDiff, this sort of model can now error quite easily. This causes a bunch of CI failures (most recently on the docs).

julia> using Turing, Random
       @model function f()
           var ~ InverseGamma(2, 3)
           sd = sqrt(var)
           mean ~ Normal(0.0, sd)
           1.5 ~ Normal(mean, sd)
           2.0 ~ Normal(mean, sd)
       end
f (generic function with 2 methods)

julia> sample(Xoshiro(231), f(), NUTS(), 2000; progress=false)
┌ Info: Found initial step size
└   ϵ = 0.8
ERROR: DomainError with Dual{ForwardDiff.Tag{DynamicPPL.DynamicPPLTag, Float64}}(0.0,NaN,NaN):
Normal: the condition σ >= zero(σ) is not satisfied.
[...]

The solution to this, it seems, is to enable NaN-safe mode for ForwardDiff. I can confirm that that does indeed make the above call work:

julia> using ForwardDiff, Preferences

julia> set_preferences!(ForwardDiff, "nansafe_mode" => true);

# reload session and rerun

julia> sample(Xoshiro(231), f(), NUTS(), 2000; progress=false)
┌ Info: Found initial step size
└   ϵ = 0.8
Chains MCMC chain (2000×16×1 Array{Float64, 3}):

Iterations        = 1001:1:3000
Number of chains  = 1
Samples per chain = 2000
Wall duration     = 3.34 seconds
Compute duration  = 3.34 seconds
parameters        = var, mean
internals         = n_steps, is_accept, acceptance_rate, log_density, hamiltonian_energy, hamiltonian_energy_error, max_hamiltonian_energy_error, tree_depth, numerical_error, step_size, nom_step_size, logprior, loglikelihood, logjoint

Use `describe(chains)` for summary statistics and quantiles.

This is fine, but honestly quite annoying. Firstly, it's not optimal that ForwardDiff uses Preferences.jl, which is stored somewhere on the user's working directory and is not propagated to different environments or even directories. It would be way better if it was controlled via ADTypes, and if we could change the default backend to be NaN-safe ForwardDiff. According to the docs page linked above, dynamically enabling NaN-safe mode via a config struct has been planned for many years, but this has not actually materialised.

Secondly, it's also pretty unfortunate that ForwardDiff is Turing's default backend, so this sort of error can crop up quite easily: even if individual failures don't happen often (in the example above, 231 was the first seed that failed, so it fails ~ 0.5% of the time), the simple fact is we run so many tests and docs builds with NUTS that something somewhere is bound to break quite often.

There isn't really much that can be done here. Some ideas I have are:

  1. Inside the inner constructor of LogDensityFunction, put an @info or @warn about this behaviour. We can put maxlog=1 to avoid being spammy. That might be a bit too extreme and annoying, though; and given that ForwardDiff is the default backend, putting a warning there means that literally everyone who ever runs a model with NUTS will see it, so I'm not inclined towards this.

  2. Write a docs page about it. This is partly why I'm opening an issue even though I can't do much about it, it's so that the information is out there somewhere.

  3. Pray that one day it's configurable via ADTypes, and then update our default.

Of these three, the second is the only realistic one.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions