Huge discrepancy between JIT and non-JIT code #28064

SNMS95 · 2025-04-16T19:18:18Z

SNMS95
Apr 16, 2025

I’ve been working on differentiable simulations where gradients are propagated through physical solvers. Recently, I encountered a puzzling discrepancy: running the code with JIT enabled gives significantly different results compared to the non-JIT version, even though both use the same inputs and computation logic. I know that JIT can change the numerics but had always felt that the margin was not of particular concern (probably in the decimals). But the difference here is difficult to ignore.

The issue surfaced (seemingly randomly) for certain combinations of hyperparameters. After isolating the components of my simulation, I built a minimal working example (MWE) that doesn’t involve physics or FEA, but still demonstrates the problem (I know that the learning rate is too high and that I am using a sinusoidal activation but that is by construction to illustrate the point).

Has anyone else faced this issue?

import jax
import jax.flatten_util
import jax.numpy as jnp
import optax
import equinox as eqx
from jax import random

# Ensure float64 for better diagnostics
jax.config.update("jax_enable_x64", True)

# Dummy task: Fit noisy sine wave
def get_data(n=128):
    x = jnp.linspace(-5, 5, n).reshape(-1, 1)
    y = jnp.sin(x) + 0.1 * jax.random.normal(random.PRNGKey(0), x.shape)
    return x, y

# Mean squared error
def loss_fn(model, x, y):
    preds = jax.vmap(model)(x)
    return jnp.mean((preds - y) ** 2)

def run_optimization(model, optimizer, num_steps, x, y):
    params, static = eqx.partition(model, eqx.is_array)
    opt_state = optimizer.init(params)

    @eqx.filter_value_and_grad
    def compute_loss(model, x, y):
        return loss_fn(model, x, y)
    
    for step in range(num_steps):
        loss, grads = compute_loss(model, x, y)
        updates, opt_state = optimizer.update(grads, opt_state, params)
        params = optax.apply_updates(params, updates)
        # Update model parameters
        model = eqx.combine(params, static)
    return loss, grads

def test():
    num_steps = 20
    key = random.PRNGKey(42)
    model = eqx.nn.MLP(in_size=1, out_size=1, width_size=32, depth=4, key=key,
                       activation=jnp.sin)
    optimizer = optax.adam(learning_rate=2.0)
    x, y = get_data()

    loss, grads = run_optimization(model, optimizer, num_steps, x, y)
    print("Final loss:", loss)
    # print the max gradient
    grads_flat = jax.flatten_util.ravel_pytree(grads)[0]
    print("Max gradient:", jnp.max(grads_flat))

    # Jitted function to speed up the computation
    loss_jit, grads_jit = eqx.filter_jit(run_optimization)(model, optimizer, num_steps, x, y)
    print("Final loss (jitted):", loss_jit)
    # print the max gradient
    grads_jit_flat = jax.flatten_util.ravel_pytree(grads_jit)[0]
    print("Max gradient (jitted):", jnp.max(grads_jit_flat))

test()

Jax version=0.5.3 (CPU version)
MAC OS M3
Python=3.11
optax=0.2.4
equinox=0.12.1

pearu · 2025-04-17T09:46:49Z

pearu
Apr 17, 2025
Collaborator

I can reproduce the issue with jax 0.6.0 (current main) on a Linux box.

As you have already mentioned, the numerics of non-jitted and jitted JAX code is slightly different and the learning rate is too high. In general, too high learning rate means that the SGD algorithm may diverge and the final optimization result will be very sensitive to perturbation in numerics in any program.

In this particular case, with the given learning rate 2.0 the optimization result (e.g. consider the loss and the gradient) are very high (approx 40 and 300000, respectively) meaning that the optimization process did not converge and the results from non-jitted and jitted code are not comparable by definition as both are divergent.

When using a smaller learning rate, say 2e-2, the optimization process will converge (the final loss and gradient are in the order of 0.01) and the results from non-jitted and jitted programs become comparable and are close indeed.

In sum, the cause of the reported huge discrepancy between non-jitted and jitted programs is a divergent optimization process that result is very sensitive to differences in numerics of non-jitted and jitted programs. As a resolution, one should use meta parameters (such as learning rate) that leads to a convergent optimization process which will suppresse the possible discrepancies in numerics of non-jitted and jitted programs.

2 replies

SNMS95 Apr 17, 2025
Author

Hi @pearu

Thanks for the reply.

I do understand that the optimization diverged and in fact it was by construction.
This is something that I am observing in a more complex setting even for some "normal" looking meta-parameters and that's why I wanted to know more from the community using a (perhaps absurd sounding) example.

The gradients are high but I am working in float64 precision (and we are no where near its limits for any overflow or underflow).
Are you implying that tiny differences in numerics causes significantly different outputs (like chaos?) If so, do you know of any sources to mitigate this effect?

pearu Apr 17, 2025
Collaborator

I think this example is not too absurd as it represents a typical problem of finding optimal metaparameters that lead to fastest convergence process possible. The accent of this statement is in "optimal" meaning that a too low learning rate leads to slow convergence and a too high learning rate leads to divergence.

Notice that the divergence does not necessarily mean that the cause of it is in numerical overflow or underflow or rounding error. Divergence may also be due to the mathematical properties of the optimization process that for a "bad" choice of meta-parameters may be unstable, that is, there may exists cycles so that the process will never converge no matter what precision is used in numerical computations.

Are you implying that tiny differences in numerics causes significantly different outputs (like chaos?)

Yes, when the optimization process appears as divergent. I'd also like to point out that the different outputs from both divergent jitted and non-jitted programs are equally useless as these do not give a solution to the original problem nor these cannot be used to pinpoint issues in the jitted or non-jitted codes. Even when jitted and non-jitted programs behave as identical, the underlying problem of this issue is the choice of meta-parameters such as the too large learning rate.

If so, do you know of any sources to mitigate this effect?

See, for instance, https://en.wikipedia.org/wiki/Meta-optimization . For a single parameter case like the learning rate case here, it could be sufficient to just make a scan of over various learning rates to find a range that guarantees convergence that is not too slow.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Huge discrepancy between JIT and non-JIT code #28064

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Huge discrepancy between JIT and non-JIT code #28064

SNMS95 Apr 16, 2025

Replies: 1 comment · 2 replies

pearu Apr 17, 2025 Collaborator

SNMS95 Apr 17, 2025 Author

pearu Apr 17, 2025 Collaborator

SNMS95
Apr 16, 2025

Replies: 1 comment 2 replies

pearu
Apr 17, 2025
Collaborator

SNMS95 Apr 17, 2025
Author

pearu Apr 17, 2025
Collaborator