Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

question: Understanding the relationship between get_split_value_histogram and trees_to_dataframe #11220

Open
CangyuanLi opened this issue Feb 7, 2025 · 2 comments
Labels

Comments

@CangyuanLi
Copy link

CangyuanLi commented Feb 7, 2025

Hi, thanks for the great library!

I am not sure if this is a bug or more of a misunderstanding on my part, but I am struggling to resolve some differences between the output of get_split_value_histogram and trees_to_dataframe. To my understanding, it should be possible to get the splits XGBoost uses for each feature from either method. However, I am getting drastically different results. As an example, here is a model with a float feature and some boolean features.

import numpy as np
import pandas as pd
import xgboost

N_ROWS = 1_000

np.random.seed(208)

X = pd.DataFrame({
    # "bool": np.random.choice([True, False], N_ROWS),
    "int_bool": np.random.choice([0, 1], N_ROWS),
    "float_bool": np.random.choice([0.0, 1.0], N_ROWS),
    "float": np.random.rand(N_ROWS),
})

y = np.random.choice([True, False], N_ROWS)


model = xgboost.XGBClassifier().fit(X, y)
booster = model.get_booster()
trees = booster.trees_to_dataframe()

Note that passing in a boolean column works, but breaks both get_split_value_histogram and trees_to_dataframe (already reported in #10437). Looking at "int_bool",

print(booster.get_split_value_histogram("int_bool"))
print(trees.loc[trees["Feature"] == "int_bool"]["Split"].value_counts())

I get outputs of

SplitValue  Count
0         1.5  223.0

and

Split
1.0    175

respectively. According to get_split_value_histogram, there is a trivial split on 1.5, whereas trees_to_dataframe seems to report a more accurate split on 1.0. Where does the 1.5 come from?

Looking at the float feature,

print(booster.get_split_value_histogram("float").sort_values("SplitValue"))
print(
    trees.loc[trees["Feature"] == "float"]
    ["Split"]
    .value_counts()
    .to_frame()
    .reset_index()
    .sort_values("Split")
)

I get

     SplitValue  Count
0      0.012732    8.0
1      0.017076    5.0
2      0.025765    7.0
3      0.030109   20.0
4      0.034454    4.0
..          ...    ...
173    0.972841    9.0
174    0.977186    1.0
175    0.981530    5.0
176    0.990219    2.0
177    0.994563   11.0

[178 rows x 2 columns]

and

        Split  count
61   0.008877      7
20   0.015085     12
107  0.016895      5
17   0.021943     13
36   0.031006     10
..        ...    ...
2    0.977809     22
5    0.983249     20
28   0.987834     11
30   0.992312     10
50   0.994810      8

[220 rows x 2 columns]

respectively. According to get_split_value_histogram, there are 178 unique splits, and according to trees_to_dataframe, there are 220 unique splits. In addition, the actual split values are fairly different between the two.

I would expect that the two functions return the same results.

@trivialfis
Copy link
Member

Hmm, thank you for sharing. Will look into it after sorting out the work at hand.

@trivialfis
Copy link
Member

Related: #6091

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants