Potential Data Leakage in data/data_loader.py due to Preprocessing Before Split #10

Sohrabbeig · 2024-01-16T03:31:19Z

I've been reviewing the data preprocessing steps in data/data_loader.py and noticed that the entire dataset undergoes fitting and transformation before being split into training, validation, and test sets. This process might lead to data leakage, where information from the test and validation sets inadvertently influences the training process.

Is this approach an intentional part of the model's design for a specific reason that I might have missed? Or could it be an oversight?

The text was updated successfully, but these errors were encountered:

aikunyi · 2024-02-03T07:25:27Z

we use the min/max value of training sets to normalize the validation/test sets

yasinuygun · 2024-02-13T08:58:02Z

When I look at the code at data_loader.py, the full data is being used to normalize:

I guess it should have been like the following instead:

            training_end = int(len(data) * self.train_ratio)
            mms.fit(data[:training_end])
            data = mms.transform(data)

aikunyi · 2024-02-22T03:48:38Z

Thanks for the correction, we've revised it

tolinlaws · 2024-04-05T17:03:13Z

When I look at the data_loader.py,the Dataset_Wiki and Dataset_Solar class should use self.data instead of data so that fit the original code
original:
self.data = mms.fit_transform(self.data)

fixed:
if type == '1':
mms = MinMaxScaler(feature_range=(0, 1))
training_end = int(len(self.data) * self.train_ratio)
mms.fit(self.data[:training_end])
self.data = mms.transform(self.data)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential Data Leakage in data/data_loader.py due to Preprocessing Before Split #10

Potential Data Leakage in data/data_loader.py due to Preprocessing Before Split #10

Sohrabbeig commented Jan 16, 2024

aikunyi commented Feb 3, 2024

yasinuygun commented Feb 13, 2024

aikunyi commented Feb 22, 2024

tolinlaws commented Apr 5, 2024

Potential Data Leakage in data/data_loader.py due to Preprocessing Before Split #10

Potential Data Leakage in data/data_loader.py due to Preprocessing Before Split #10

Comments

Sohrabbeig commented Jan 16, 2024

aikunyi commented Feb 3, 2024

yasinuygun commented Feb 13, 2024

aikunyi commented Feb 22, 2024

tolinlaws commented Apr 5, 2024