Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential Data Leakage in data/data_loader.py due to Preprocessing Before Split #10

Open
Sohrabbeig opened this issue Jan 16, 2024 · 4 comments

Comments

@Sohrabbeig
Copy link

I've been reviewing the data preprocessing steps in data/data_loader.py and noticed that the entire dataset undergoes fitting and transformation before being split into training, validation, and test sets. This process might lead to data leakage, where information from the test and validation sets inadvertently influences the training process.

Is this approach an intentional part of the model's design for a specific reason that I might have missed? Or could it be an oversight?

@aikunyi
Copy link
Owner

aikunyi commented Feb 3, 2024

we use the min/max value of training sets to normalize the validation/test sets

@yasinuygun
Copy link
Contributor

When I look at the code at data_loader.py, the full data is being used to normalize:

image

I guess it should have been like the following instead:

            training_end = int(len(data) * self.train_ratio)
            mms.fit(data[:training_end])
            data = mms.transform(data)

@aikunyi
Copy link
Owner

aikunyi commented Feb 22, 2024

Thanks for the correction, we've revised it

@tolinlaws
Copy link

When I look at the data_loader.py,the Dataset_Wiki and Dataset_Solar class should use self.data instead of data so that fit the original code
original:
self.data = mms.fit_transform(self.data)

fixed:
if type == '1':
mms = MinMaxScaler(feature_range=(0, 1))
training_end = int(len(self.data) * self.train_ratio)
mms.fit(self.data[:training_end])
self.data = mms.transform(self.data)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants