Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: Loss saved in hall of fame is not identical to loss recalculated based on prediction #816

Open
BrotherHa opened this issue Jan 24, 2025 · 13 comments
Assignees
Labels
bug Something isn't working

Comments

@BrotherHa
Copy link
Contributor

BrotherHa commented Jan 24, 2025

Hello,

I have identified an issue with some PySR regressions, that I ran. The loss saved in the hall of fame is not identical to the loss calculated with the same dataset and same custom loss function based on the PySR prediction. I have documented the issue in the notebook file attached.
Have similar problems occurred before or do you have any idea, what might be the reason for this?
As an additional test, i was thinking about trying to recalculate the loss of the individual inside PySR, the be able to reproduce the potentially faulty loss-calculation. Is there a way to do so?

Best regards and thanks in advance!

Jupyter Notebook Code:

import pickle
import numpy as np
import pandas as pd
from pysr import jl
from pysr.julia_helpers import jl_array
# Import object of class implementing regression with PySR (pysr_object.regression_object is PySRRegressor)
with open('cluster_2_round_0_runs_4_conv_suc_pysr_object_0103.pkl', 'rb') as file:
    pysr_object = pickle.load(file)
# Get loss of 22. individual for 2. y-variable
loss_in_pysr = pysr_object.regression_object.equations_[2].iloc[22].loss
# Training dataset used for regression
training_dataset = pysr_object.dataset.dataset[pysr_object.training_data_bool].reset_index(drop=True)
# Predict y-variables for training data with the 22. individual for the 2. y-variable
y_pred= pd.DataFrame(
    data=pysr_object.regression_object.predict(
        training_dataset[pysr_object.x_variables],
        index = [0,0,22,0,0]
    ),
    columns=pysr_object.y_variables
)
# Get prediction and actual data for 2. y-variable
y2_pred = y_pred[pysr_object.y_variables[2]]
y2_act = training_dataset[pysr_object.y_variables[2]]
# Create julia loss function string based on dataset
def make_fitness_function_v2_5on(y):
# For code see attached PDF
# Create julia custom loss function string, apart from in- and outputs identical to the one implemented in pysr_object (see below)
fitness_function = make_fitness_function_v2_5on(training_dataset[pysr_object.y_variables])
print(fitness_function)

#Printed:
function custom_loss_function(y_pred, y)
    if isapprox(y[1],2.2565436)
        p5 = 1.2804586
        p80 = 15.113608
        a_1 = 0.6248880798745418
        b_1 = 1.9206878219999999
        a_2 = 2
        b_3 = -6.09607886912224
        a_3 = 18.0350585617555
        c_3 = -9.43495030694149
    elseif isapprox(y[1],0.04650065)
        p5 = 0.0040257196
        p80 = 1.8914138
        a_1 = 11.144567754841631
        b_1 = 0.006038579739964286
        a_2 = 2
        b_3 = 0.0400789096218153
        a_3 = 3.86298538267220
        c_3 = 1.23985086650155
    elseif isapprox(y[1],0.6221858)
        p5 = 0.46299943
        p80 = 13.051656
        a_1 = 1.0391895552455737
        b_1 = 0.694499148
        a_2 = 2
        b_3 = 4.05385061247809
        a_3 = 34.2110131049562
        c_3 = -71.0354535190814
    elseif isapprox(y[1],-0.45914933)
        p5 = 0.41124585
        p80 = 5.6052365
        a_1 = 1.1026412343064227
        b_1 = 0.616868799645
        a_2 = 2
        b_3 = 0.514240734249554
        a_3 = 12.2389548644991
        c_3 = -10.9601082467183
    elseif isapprox(y[1],0.8109809)
        p5 = 0.00813762
        p80 = 17.448
        a_1 = 7.838560195225847
        b_1 = 0.012206430080501547
        a_2 = 2
        b_3 = 6.53098971058287
        a_3 = 47.9579776388101
        c_3 = -117.475032351869
    end
    y_border = zeros(length(y))
    y_border[abs.(y).<p5] .= (a_1 * y[abs.(y).<p5]).^2 .+ b_1
    y_border[(abs.(y).>=p5).&(abs.(y).<=p80)] .= a_2 * abs.(y[(abs.(y).>=p5).&(abs.(y).<=p80)])
    y_border[abs.(y).>p80] .= a_3 * log.(abs.(y[abs.(y).>p80]) .+ b_3) .+ c_3
    custom_error = (sum( ( (1 ./ y_border) .* (y_pred .- y) ) .^2 ) )/(length(y))
    return custom_error
end
# For comparison: loss function of pysr_object
loss_function_pysr = pysr_object.regression_object.loss_function
print(loss_function_pysr)

# Printed:
function custom_loss_function(tree, dataset::Dataset{T,L}, options::Options, idx=nothing)::L where {T,L}
    if isapprox(dataset.y[1],2.2565436)
        p5 = 1.2804586
        p80 = 15.113608
        a_1 = 0.6248880798745418
        b_1 = 1.9206878219999999
        a_2 = 2
        b_3 = -6.09607886912224
        a_3 = 18.0350585617555
        c_3 = -9.43495030694149
    elseif isapprox(dataset.y[1],0.04650065)
        p5 = 0.0040257196
        p80 = 1.8914138
        a_1 = 11.144567754841631
        b_1 = 0.006038579739964286
        a_2 = 2
        b_3 = 0.0400789096218153
        a_3 = 3.86298538267220
        c_3 = 1.23985086650155
    elseif isapprox(dataset.y[1],0.6221858)
        p5 = 0.46299943
        p80 = 13.051656
        a_1 = 1.0391895552455737
        b_1 = 0.694499148
        a_2 = 2
        b_3 = 4.05385061247809
        a_3 = 34.2110131049562
        c_3 = -71.0354535190814
    elseif isapprox(dataset.y[1],-0.45914933)
        p5 = 0.41124585
        p80 = 5.6052365
        a_1 = 1.1026412343064227
        b_1 = 0.616868799645
        a_2 = 2
        b_3 = 0.514240734249554
        a_3 = 12.2389548644991
        c_3 = -10.9601082467183
    elseif isapprox(dataset.y[1],0.8109809)
        p5 = 0.00813762
        p80 = 17.448
        a_1 = 7.838560195225847
        b_1 = 0.012206430080501547
        a_2 = 2
        b_3 = 6.53098971058287
        a_3 = 47.9579776388101
        c_3 = -117.475032351869
    end
    if idx == nothing
        y_pred, complete = eval_tree_array(tree, dataset.X, options)
        y = dataset.y
    else
        y_pred, complete = eval_tree_array(tree, dataset.X[:,idx], options)
        y = dataset.y[idx]
    end
    y_border = zeros(length(y))
    y_border[abs.(y).<p5] .= (a_1 * y[abs.(y).<p5]).^2 .+ b_1
    y_border[(abs.(y).>=p5).&(abs.(y).<=p80)] .= a_2 * abs.(y[(abs.(y).>=p5).&(abs.(y).<=p80)])
    y_border[abs.(y).>p80] .= a_3 * log.(abs.(y[abs.(y).>p80]) .+ b_3) .+ c_3
    if !complete
        return L(Inf)
    end
    custom_error = (sum( ( (1 ./ y_border) .* (y_pred .- y) ) .^2 ) )/(length(y))
    return custom_error
end
# Translating elements to julia
fitness_function_jl = jl.seval(fitness_function)
y2_pred_jl = jl_array(np.array(y2_pred, dtype=np.float32).T)
y2_act_jl = jl_array(np.array(y2_act, dtype=np.float32).T)
loss_recalculated = fitness_function_jl(y2_pred_jl, y2_act_jl)
print('Loss saved in PySR: ' + str(loss_in_pysr))
print('Loss recalculated: ' + str(loss_recalculated))

# Printed:
Loss saved in PySR: 0.01720243
Loss recalculated: 0.022473989882165576

-> Loss recalculated and loss calculated in PySR are not identical!

Version

1.3.1

Operating System

Windows

Package Manager

Conda

Interface

Other (specify below)

Relevant log output

Extra Info

loss_difference_pysr.pdf

@BrotherHa BrotherHa added the bug Something isn't working label Jan 24, 2025
@MilesCranmer
Copy link
Owner

@BrotherHa is it okay if you paste the code inline in the markdown block? This is so that search indexes like Google can correctly find this issue if other people have a similar problem.

@BrotherHa
Copy link
Contributor Author

Hello @MilesCranmer
Thank you very much for the quick reply. Is the format ok like that?
Best regards

@BrotherHa
Copy link
Contributor Author

I have searched a bit where the issue occurs and in fact for most individuals of the regression the loss values from the equations_-table and the recalculated ones are absolutely identical until very high precision, but for some rare cases they divert significantly as in the example.

@MilesCranmer
Copy link
Owner

Is it only when reloading from a pickle file?

@MilesCranmer
Copy link
Owner

And are you sure you are computing the loss in a way completely identical between the Julia and the Python side of things?

@BrotherHa
Copy link
Contributor Author

About the loss functions I am very sure. The only issues here might be problems caused by the precision of calculation, etc. And for more than 90% of all individuals both methods calculate exactly the same loss.
And concerning the pickle-file, I have also tried using the 'from_file' method, but it still stays the same.

@MilesCranmer
Copy link
Owner

MilesCranmer commented Jan 25, 2025

Just to confirm, it is only when you load the model from a saved checkpoint that this happens? It’s not when you use warm_start=True and call fit a second time?

@MilesCranmer
Copy link
Owner

90% of all individuals both methods calculate exactly the same loss

Do you see any pattern in the ones that don’t match? e.g., Are they very complex expressions?

@BrotherHa
Copy link
Contributor Author

90% of all individuals both methods calculate exactly the same loss

Do you see any pattern in the ones that don’t match? e.g., Are they very complex expressions?

I do not see any pattern in the equations, that have the error. They are not the most complex ones neither do they have explicitly complex operators.

@BrotherHa
Copy link
Contributor Author

Just to confirm, it is only when you load the model from a saved checkpoint that this happens? It’s not when you use warm_start=True and call fit a second time?

It happens also when using fit for the first time. I will try to reload and refit, when I find time.

@MilesCranmer
Copy link
Owner

But it's only when you have a model loaded from disk, right? Like if you did:

model = PySRRegressor(warm_start=True)  # fresh model; NOT loaded from disk

model.fit(X, y)

#= ... =#

model.fit(X, y)

@BrotherHa
Copy link
Contributor Author

BrotherHa commented Jan 28, 2025

The Problem occured in the first place when fitting a new model. Then directly after the regression finished I called predict() and calculated the loss for training, test and total dataset based on that. But before that I saved the output of model.get_best(). The value from get_best() and the manual calculation for the best individual shows the bahavior in a less significant way (0.014545 instead of 0.014536), but it is still not in the range most other values are equal. For the example given in my first messege here I called the predict() method also just after the initial run and calculated the loss as 0.022473989882165576, but I did not save the loss from model.equations_ separately. But in the hall of fame csv-file loss is also saved as 0.01720243, like it is outputed when I call model.equations_ from the loaded model.

@MilesCranmer
Copy link
Owner

@BrotherHa would it be possible for you to make a MWE of this issue so I can try to reproduce it? No worries if not. It's a bit tricky because the unittests already check for warm starts from files and they seem to pass. So I'm puzzled as to what conditions trigger this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants