Skip to content

Conversation

LennartPurucker
Copy link
Collaborator

@LennartPurucker LennartPurucker commented Sep 5, 2025

This code adds LimiX (https://arxiv.org/pdf/2509.03505) from https://github.com/limix-ldm/LimiX

Notes

  • I have been using and running the non-retrieval version. In my small experiments, the retrieval version sometimes took more than one hour for one small dataset and had worse performance on most TabArena datasets. I feel like there must be a bug in the code or the way I am using it?
  • I have not compiled the model and am using torch's native flash attention. So maybe one could make the model faster.
  • To run the method in a typical data science pipeline, I had to "single"-thread DDP. Sadly, the original code has no option without major changes to just disable DDP in the first place. This would be a big TODO for proper integration (besides code quality improvements in other areas and a closer alignment to the sklearn API). One also needs to fix some of the code related to shutting down DDP. Just to be clear, while the inference without DDP is False, this code here is the problem: https://github.com/limix-ldm/LimiX/blob/main/inference/inference_method.py#L18 (and I changed it in the code in this PR).
  • When trying to get the method to run, I noticed that this code https://github.com/limix-ldm/LimiX/blob/main/inference/predictor.py#L321, because of the .squeeze() will crash if one runs on a dataset with just one feature (in one of the preprocessing configs). I removed the .squeeze() and am not sure why it was here in the first place, as a randomly appearing dimension sounds more like a bug.
  • I also replaced all the lambda functions with real functions or partial usage to make LimiX pickle-able (https://github.com/limix-ldm/LimiX/blob/main/inference/preprocess.py#L435).
  • LimiX does not support installation from GitHub or Pip so far, which is quite unfortunate.
  • The cache path used in the tutorial for downloading the model is not the path one should use on all systems. I added the TabPFNv2-path logic to make this stable.
  • It is unclear how the default configs were found / optimized but it might be good to get a search space we can tune over to see how much better we can make LimiX.

Performance on the TabPFN-subset of TabArena-Full

image

I have tried running the method on more datasets as well, and it worked. However, for the larger dataset (e.g. 50k samples, 130 features) in TabArena, it runs out of VRAM (given 40 GB VRAM). So for now, I will stick to this subset.


By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@LennartPurucker
Copy link
Collaborator Author

For the sake of completion, here are the results with retrieval (see the prefix [RET]) for TabArena-Lite, again the TabPFN subset.
On this subset of small datasets, retrieval was not able to run within the time limit of 1.5 hours (including overhead) on 24.24% of the datasets.

image

@LennartPurucker
Copy link
Collaborator Author

LennartPurucker commented Sep 17, 2025

I added batching the test predictions, which allowed me to run it on a few more datasets.

Even with this batching, I am not able to run LimiX on an H200 with the current setup for the OpenML task IDs 363628 and 363673, as they have too many samples: 363628 has 90k training samples, 21 features; 363673 has 100k training samples, 10 features.

At the same time, the datasets that I can now run with batching on an H200 (roughly up to a size of 70k training samples) also take very long to predict (multiple hours, even longer due to batching the test predictions).

I will postpone further investigation for larger datasets until an update from the authors and otherwise stick to the TabPFN-subset limits, as these seem roughly in scope in terms of efficiency.

@LennartPurucker
Copy link
Collaborator Author

LennartPurucker commented Sep 17, 2025

Here are the results on TabArena-Full for all datasets.

For LimiX we had to impute two datasets (4%), that is, the tasks that ran out of VRAM as mentioned above.
All other imputed foundation models had to get many more datasets imputed (see the official LB for numbers) as they were not run on an H200.
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant