question: Understanding the relationship between `get_split_value_histogram` and `trees_to_dataframe` #11220

CangyuanLi · 2025-02-07T21:46:33Z

Hi, thanks for the great library!

I am not sure if this is a bug or more of a misunderstanding on my part, but I am struggling to resolve some differences between the output of get_split_value_histogram and trees_to_dataframe. To my understanding, it should be possible to get the splits XGBoost uses for each feature from either method. However, I am getting drastically different results. As an example, here is a model with a float feature and some boolean features.

import numpy as np
import pandas as pd
import xgboost

N_ROWS = 1_000

np.random.seed(208)

X = pd.DataFrame({
    # "bool": np.random.choice([True, False], N_ROWS),
    "int_bool": np.random.choice([0, 1], N_ROWS),
    "float_bool": np.random.choice([0.0, 1.0], N_ROWS),
    "float": np.random.rand(N_ROWS),
})

y = np.random.choice([True, False], N_ROWS)


model = xgboost.XGBClassifier().fit(X, y)
booster = model.get_booster()
trees = booster.trees_to_dataframe()

Note that passing in a boolean column works, but breaks both get_split_value_histogram and trees_to_dataframe (already reported in #10437). Looking at "int_bool",

print(booster.get_split_value_histogram("int_bool"))
print(trees.loc[trees["Feature"] == "int_bool"]["Split"].value_counts())

I get outputs of

SplitValue  Count
0         1.5  223.0

and

Split
1.0    175

respectively. According to get_split_value_histogram, there is a trivial split on 1.5, whereas trees_to_dataframe seems to report a more accurate split on 1.0. Where does the 1.5 come from?

Looking at the float feature,

print(booster.get_split_value_histogram("float").sort_values("SplitValue"))
print(
    trees.loc[trees["Feature"] == "float"]
    ["Split"]
    .value_counts()
    .to_frame()
    .reset_index()
    .sort_values("Split")
)

I get

     SplitValue  Count
0      0.012732    8.0
1      0.017076    5.0
2      0.025765    7.0
3      0.030109   20.0
4      0.034454    4.0
..          ...    ...
173    0.972841    9.0
174    0.977186    1.0
175    0.981530    5.0
176    0.990219    2.0
177    0.994563   11.0

[178 rows x 2 columns]

and

        Split  count
61   0.008877      7
20   0.015085     12
107  0.016895      5
17   0.021943     13
36   0.031006     10
..        ...    ...
2    0.977809     22
5    0.983249     20
28   0.987834     11
30   0.992312     10
50   0.994810      8

[220 rows x 2 columns]

respectively. According to get_split_value_histogram, there are 178 unique splits, and according to trees_to_dataframe, there are 220 unique splits. In addition, the actual split values are fairly different between the two.

I would expect that the two functions return the same results.

The text was updated successfully, but these errors were encountered:

trivialfis · 2025-02-08T17:11:56Z

Hmm, thank you for sharing. Will look into it after sorting out the work at hand.

trivialfis · 2025-02-13T13:18:39Z

Related: #6091

trivialfis added the ? Triage label Feb 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

question: Understanding the relationship between `get_split_value_histogram` and `trees_to_dataframe` #11220

question: Understanding the relationship between `get_split_value_histogram` and `trees_to_dataframe` #11220

CangyuanLi commented Feb 7, 2025 •

edited

Loading

trivialfis commented Feb 8, 2025

trivialfis commented Feb 13, 2025

question: Understanding the relationship between get_split_value_histogram and trees_to_dataframe #11220

question: Understanding the relationship between get_split_value_histogram and trees_to_dataframe #11220

Comments

CangyuanLi commented Feb 7, 2025 • edited Loading

trivialfis commented Feb 8, 2025

trivialfis commented Feb 13, 2025

question: Understanding the relationship between `get_split_value_histogram` and `trees_to_dataframe` #11220

question: Understanding the relationship between `get_split_value_histogram` and `trees_to_dataframe` #11220

CangyuanLi commented Feb 7, 2025 •

edited

Loading