Question: XGBClassifier fails when target labels are non-zero-based #11228

M-Colley · 2025-02-10T09:58:55Z

I'm encountering an issue when using XGBClassifier for a lane number classification task. My target labels are lane numbers with values [3, 4, 5], but despite setting num_class=6 (to potentially cover six classes), the classifier fails during training with the following error:

ValueError: Invalid classes inferred from unique values of `y`.  Expected: [0 1 2], got [3 4 5]
It appears that the classifier is inferring only three classes but then expects them to be numbered [0, 1, 2] rather than [3, 4, 5].

Steps to Reproduce:

import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, accuracy_score, f1_score
from math import sqrt
import pandas as pd

# Assume df_original_xgboost is a DataFrame with our data,
# where 'target_column' (e.g., "lane_number_smoothed") contains lane numbers [3, 4, 5],
# and numerical_features and categorical_features are defined appropriately.

# Example definitions:
numerical_features = ['speed', 'Distance_vehicle_front', 'Distance_vehicle_front_left']
categorical_features = []  # or your actual categorical features list
target_column = 'lane_number_smoothed'

# Example DataFrame creation for demonstration (replace with your actual data):
data = {
    'speed': [50, 60, 55, 70, 65, 80],
    'Distance_vehicle_front': [10, 12, 11, 13, 12, 14],
    'Distance_vehicle_front_left': [8, 9, 8, 10, 9, 10],
    target_column: [3, 4, 5, 3, 4, 5]
}
df_original_xgboost = pd.DataFrame(data)

# Convert categorical columns to 'category' type if any
for feature in categorical_features:
    df_original_xgboost[feature] = df_original_xgboost[feature].astype('category')

# Prepare Features and Target variable
X = df_original_xgboost[numerical_features + categorical_features]
y = df_original_xgboost[target_column]

# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Print unique labels in y_train (should be [3, 4, 5])
print("Unique labels in training set:", sorted(y_train.unique()))

# Initialize XGBClassifier with num_class=6
model = xgb.XGBClassifier(random_state=42, num_class=6, enable_categorical=True, device="cuda")

# Attempt to train the model (this raises the ValueError)
model.fit(X_train, y_train)

Observed Behavior:

When training the model, I get the following error:

ValueError: Invalid classes inferred from unique values of `y`.  Expected: [0 1 2], got [3 4 5]

Expected Behavior:

I would expect one of the following:

XGBClassifier automatically remaps non-zero-based labels (like lane numbers [3, 4, 5]) to a contiguous range starting at 0, or
The classifier provides a clear configuration or parameter to allow the use of non-zero-based labels.
Currently, the workaround is to manually remap the target labels (e.g., mapping {3: 0, 4: 1, 5: 2}) before training, but I would like to know if this behavior is intended or if a fix is planned to support non-zero-based labels directly.

Environment:

XGBoost version, 2.1.4 or later
Python version: 3.11
Device: CUDA enabled

The text was updated successfully, but these errors were encountered:

trivialfis · 2025-02-10T10:34:25Z

Hi, the behavior is expected. XGBoost requires encoded labels as input. There's a discussion about whether the classifier should automatically route labels to sklearn LabelEncoder before training. Still, we haven't decided as there are many interfaces, including distributed ones, and we think the user should decide how to encode the labels. For instance, many implementations of the label encoders include the ones from sklearn and cuml for GPU. Also, they need to handle various input types including dataframes and arrays.

trivialfis · 2025-02-14T07:19:46Z

Closing in favor of #11256 .

trivialfis added the feature-request label Feb 12, 2025

trivialfis closed this as completed Feb 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: XGBClassifier fails when target labels are non-zero-based #11228

Question: XGBClassifier fails when target labels are non-zero-based #11228

M-Colley commented Feb 10, 2025

trivialfis commented Feb 10, 2025

trivialfis commented Feb 14, 2025

Question: XGBClassifier fails when target labels are non-zero-based #11228

Question: XGBClassifier fails when target labels are non-zero-based #11228

Comments

M-Colley commented Feb 10, 2025

trivialfis commented Feb 10, 2025

trivialfis commented Feb 14, 2025