-
Notifications
You must be signed in to change notification settings - Fork 23
[WIP][New Model] Dynamic Programming Decision Trees #176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Great to see the start of the contribution here @KohlerHECTOR :) Please ping me for any questions or once I should take a look at the code. One thought on the topic: can your DT do Also, I am happy to run the HPO myself once the PR is completed. I have access to free compute for this purpose. |
Hello, it does not support predict proba unfortunately. |
Yea, that sounds like a good idea. Or an RF version of it 🤔 |
I ll do that then :) thanks for the halp. Even though it is too bad as decision tree algos are known to perform well on tab data :) could be nice to make the benchmark compatible with them : ) . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @LennartPurucker . I just finished implementing BoostedDPDT as an AG abstract model. I have also added a search space for BoostedDPDT hps. This should be compatible with predict proba. What should I do next :) ?
Great, thank you! Give me some time to review and test your code, and I will get back to you with questions if I have any. I just need a bit of time as my week is very full |
def _fit(self, X: pd.DataFrame, y: pd.Series, num_cpus: int = 1, **kwargs): | ||
model_cls = self.get_model_cls() | ||
hyp = self._get_model_params() | ||
if num_cpus < 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think num_cpus would never be below 1, did you want to do <=
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello, I will remove it. It is just by experience with the joblib library in which to use all available cpus one write n_jobs = -1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes. Here num_cpus might be a string called "auto" in edge cases (not within TabArena benchmarks)
Code looks very clean, nice :) Two questions that might have a large impact on performance:
|
Hello, Thanks for the feedback. No hurry, let us take it slow. I have already benchmarked Boosted-DPDT on clusters for medium and large datasets and in my experience the key time/memory bottlenecks are the train set size. What is the usual train-test-validation split like in TabArena ? I think Boosted-DPDT could support a time limit but not very fine grain, i.e. the time limit could be something like; I did not optimize DPDT inference yet, especially not for boosting, so doing inference on validation for early stoppng should not be a good idea at the moment. |
We have training data from 500 to 100k samples with up to 2000 features, which should drive the training cost mostly.
This would be exactly what would be needed. One could also add a check to see if one would have enough time to run another iteration.
Note that a longer inference time will also make the model "training" slower, as we are cross-validating the models, which is factored into the training time. Early stopping on the validation data is mostly important to obtain peak performance. |
Hello, I have updated the source code of |
Great! I think this should be all we need to run a first benchmark on TabArena-Lite. I can likely start the runs early next week, as I am currently still running another model. |
Looking forward to hear back from you! It was nice working on this issue collaboratively. |
Hello @LennartPurucker . Did you manage to run the model? Is it running without errors ? :) Thx |
Not yet, sorry :( Quite busy debugging and running other models that were still on my TODOs, and I am at a conference right now. |
No worries ! Good luck thanks :) |
Hey @LennartPurucker ! Thx so much for the hard work :) I am happy it was not too difficult to use my code and the DPDT is not that bad :p . |
Hey @LennartPurucker thanks for the results ! Assuming that I am in the tabrepo/ folder, what command should I use to run the full benchmark please ? Thank you so much for the results ! It really helps improve the model :) . I will furtherwork on the inference time: on quick change is to reduce the number of boosted estimators from 1000 to e.g. 500 or simply set the latter as a hyperparamter. Memory consumption is indeed hard to estimate sorry. |
Check out https://github.com/TabArena/tabarena_benchmarking_examples for examples and more info on how to run the benchmark! Not sure I would make the number of estimators a hyperparameter, but one could also early stop based on the estimated inference time, or prune the model after fitting. I am also not sure why it is so slow, as it happens for datasets with a lot of categoricals, I think it might be related to that aspect. No worries, memory estimation is always something that one has to figure out by trial and error :D |
@LennartPurucker I have had a look a the examples but they are not very clear to me. Is there a "main loop" that iterates over the training datasets somewhere? Btw, I have made the inference and training way (way way) faster. It was in cubic time now it is in linear time. So do you think there is a command for me to run the benchmark somewhere ? Thank you again for all the help. |
Check out this example: https://github.com/TabArena/tabarena_benchmarking_examples/blob/main/tabarena_minimal_example/run_tabarena_lite.py The In your case, you only need to change the get_configs function to use your own configs function from the new Let me know if this helps! If not, I can take a look later and send you an example script. |
That is super helpful thx so much I will have a look ! But so I can generate 1 million random_configs ? Also, if I want run on TabArena not lite should I change the arguments of |
Generally, you can follow the settings used for the other methods as described in our paper: 200 configs, 32 GB RAM, 8 CPUs But you are free to invest as much time or resources into the method as you like for your own studies. For the leaderboard, we would look at your result artifacts and only pick 200 random configurations to make it comparable.
Jup, you need to set the parameter to run the correct number of folds and repeats for the dataset size. See this metadata file for information per dataset https://github.com/TabArena/tabarena_dataset_curation/blob/main/dataset_creation_scripts/metadata/tabarena_dataset_metadata.csv
Yes! |
Thank you so much ! Thanks in advance. |
To be compatible with TabArena and use the model pipeline framework we designate, you will have to use our benchmarking interface via https://github.com/TabArena/tabarena_benchmarking_examples/blob/main/tabarena_minimal_example/run_tabarena_lite.py Our benchmarking interface runs the model in a well-designed pipeline that handles many problems you might encounter. Moreover, it saves all the results as needed. There is no straightforward or supported method for achieving the same without our benchmarking interface. To test your method on your own, you can use the code above and run your method against it. However, this approach lacks sufficient support for proper cross-validation, HPO, and other features provided by the benchmarking interface. Thus, you would need to implement benchmarking code, which may encounter bugs and other issues that lead to unfair comparisons. I strongly recommend using our benchmarking interface. Is there something specific that stops you from using the benchmarking interface? |
Well I could naively run https://github.com/TabArena/tabarena_benchmarking_examples/blob/main/tabarena_minimal_example/run_tabarena_lite.py . But that would mean two problems:
|
An example code to run all experiments sequentially would be: import pandas as pd
metadata_df = pd.read_csv("https://raw.githubusercontent.com/TabArena/tabarena_dataset_curation/refs/heads/main/dataset_creation_scripts/metadata/tabarena_dataset_metadata.csv")
task_ids = metadata_df["task_id"].tolist()
folds = metadata_df["num_folds"].tolist()
repeats = metadata_df["tabarena_num_repeats"].tolist()
run_experiments_new(
output_dir=TABARENA_DIR,
model_experiments=model_experiments,
tasks=task_ids,
repetitions_mode="matrix",
repetitions_mode_args=[
(n_fold, n_repeats) for n_fold, n_repeats in zip(folds, repeats)
]
)
See TabFlow for examples https://github.com/TabArena/tabarena_benchmarking_examples/tree/main/tabflow_slurm Thus, you would want to split up the code above to look like this: import pandas as pd
from itertools import product
metadata_df = pd.read_csv(
"https://raw.githubusercontent.com/TabArena/tabarena_dataset_curation/refs/heads/main/dataset_creation_scripts/metadata/tabarena_dataset_metadata.csv"
)
for row in metadata_df.itertuples():
repeats_folds = product(
range(int(row.tabarena_num_repeats)), range(int(row.num_folds))
)
for repeat_i, fold_i in repeats_folds:
for model_experiment in model_experiments:
# You likely want to parallelize this call/part
run_experiments_new(
output_dir=TABARENA_DIR,
model_experiments=[model_experiment],
tasks=[row.task_id],
repetitions_mode="individual",
repetitions_mode_args=[(fold_i, repeat_i)],
) To provide a general sketch of how to set up benchmarking:
Our benchmarking code handles running the job and parallelizing the model training process. |
That is actually so helpfull !!!!! |
Jup! |
First commit to add DPDT to the TabArena.
We committed the skeleton of a class.
The paper can be found here .
The original code for DPDT is here and passes the sklearn tests for a
BaseEstimator
.