Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: XGBClassifier fails when target labels are non-zero-based #11228

Closed
M-Colley opened this issue Feb 10, 2025 · 2 comments
Closed

Question: XGBClassifier fails when target labels are non-zero-based #11228

M-Colley opened this issue Feb 10, 2025 · 2 comments

Comments

@M-Colley
Copy link

I'm encountering an issue when using XGBClassifier for a lane number classification task. My target labels are lane numbers with values [3, 4, 5], but despite setting num_class=6 (to potentially cover six classes), the classifier fails during training with the following error:

ValueError: Invalid classes inferred from unique values of `y`.  Expected: [0 1 2], got [3 4 5]
It appears that the classifier is inferring only three classes but then expects them to be numbered [0, 1, 2] rather than [3, 4, 5].

Steps to Reproduce:

import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, accuracy_score, f1_score
from math import sqrt
import pandas as pd

# Assume df_original_xgboost is a DataFrame with our data,
# where 'target_column' (e.g., "lane_number_smoothed") contains lane numbers [3, 4, 5],
# and numerical_features and categorical_features are defined appropriately.

# Example definitions:
numerical_features = ['speed', 'Distance_vehicle_front', 'Distance_vehicle_front_left']
categorical_features = []  # or your actual categorical features list
target_column = 'lane_number_smoothed'

# Example DataFrame creation for demonstration (replace with your actual data):
data = {
    'speed': [50, 60, 55, 70, 65, 80],
    'Distance_vehicle_front': [10, 12, 11, 13, 12, 14],
    'Distance_vehicle_front_left': [8, 9, 8, 10, 9, 10],
    target_column: [3, 4, 5, 3, 4, 5]
}
df_original_xgboost = pd.DataFrame(data)

# Convert categorical columns to 'category' type if any
for feature in categorical_features:
    df_original_xgboost[feature] = df_original_xgboost[feature].astype('category')

# Prepare Features and Target variable
X = df_original_xgboost[numerical_features + categorical_features]
y = df_original_xgboost[target_column]

# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Print unique labels in y_train (should be [3, 4, 5])
print("Unique labels in training set:", sorted(y_train.unique()))

# Initialize XGBClassifier with num_class=6
model = xgb.XGBClassifier(random_state=42, num_class=6, enable_categorical=True, device="cuda")

# Attempt to train the model (this raises the ValueError)
model.fit(X_train, y_train)

Observed Behavior:

When training the model, I get the following error:

ValueError: Invalid classes inferred from unique values of `y`.  Expected: [0 1 2], got [3 4 5]

Expected Behavior:

I would expect one of the following:

XGBClassifier automatically remaps non-zero-based labels (like lane numbers [3, 4, 5]) to a contiguous range starting at 0, or
The classifier provides a clear configuration or parameter to allow the use of non-zero-based labels.
Currently, the workaround is to manually remap the target labels (e.g., mapping {3: 0, 4: 1, 5: 2}) before training, but I would like to know if this behavior is intended or if a fix is planned to support non-zero-based labels directly.

Environment:

XGBoost version, 2.1.4 or later
Python version: 3.11
Device: CUDA enabled

@trivialfis
Copy link
Member

Hi, the behavior is expected. XGBoost requires encoded labels as input. There's a discussion about whether the classifier should automatically route labels to sklearn LabelEncoder before training. Still, we haven't decided as there are many interfaces, including distributed ones, and we think the user should decide how to encode the labels. For instance, many implementations of the label encoders include the ones from sklearn and cuml for GPU. Also, they need to handle various input types including dataframes and arrays.

@trivialfis
Copy link
Member

Closing in favor of #11256 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants