[WIP][New Model] LimiX #208

LennartPurucker · 2025-09-05T08:07:45Z

This code adds LimiX (https://arxiv.org/pdf/2509.03505) from https://github.com/limix-ldm/LimiX

Notes

I have been using and running the non-retrieval version. In my small experiments, the retrieval version sometimes took more than one hour for one small dataset and had worse performance on most TabArena datasets. I feel like there must be a bug in the code or the way I am using it?
I have not compiled the model and am using torch's native flash attention. So maybe one could make the model faster.
To run the method in a typical data science pipeline, I had to "single"-thread DDP. Sadly, the original code has no option without major changes to just disable DDP in the first place. This would be a big TODO for proper integration (besides code quality improvements in other areas and a closer alignment to the sklearn API). One also needs to fix some of the code related to shutting down DDP. Just to be clear, while the inference without DDP is False, this code here is the problem: https://github.com/limix-ldm/LimiX/blob/main/inference/inference_method.py#L18 (and I changed it in the code in this PR).
When trying to get the method to run, I noticed that this code https://github.com/limix-ldm/LimiX/blob/main/inference/predictor.py#L321, because of the .squeeze() will crash if one runs on a dataset with just one feature (in one of the preprocessing configs). I removed the .squeeze() and am not sure why it was here in the first place, as a randomly appearing dimension sounds more like a bug.
I also replaced all the lambda functions with real functions or partial usage to make LimiX pickle-able (https://github.com/limix-ldm/LimiX/blob/main/inference/preprocess.py#L435).
LimiX does not support installation from GitHub or Pip so far, which is quite unfortunate.
The cache path used in the tutorial for downloading the model is not the path one should use on all systems. I added the TabPFNv2-path logic to make this stable.
It is unclear how the default configs were found / optimized but it might be good to get a search space we can tune over to see how much better we can make LimiX.

Performance on the TabPFN-subset of TabArena-Full

I have tried running the method on more datasets as well, and it worked. However, for the larger dataset (e.g. 50k samples, 130 features) in TabArena, it runs out of VRAM (given 40 GB VRAM). So for now, I will stick to this subset.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

LennartPurucker · 2025-09-05T10:00:10Z

For the sake of completion, here are the results with retrieval (see the prefix [RET]) for TabArena-Lite, again the TabPFN subset.
On this subset of small datasets, retrieval was not able to run within the time limit of 1.5 hours (including overhead) on 24.24% of the datasets.

LennartPurucker · 2025-09-17T09:01:36Z

I added batching the test predictions, which allowed me to run it on a few more datasets.

Even with this batching, I am not able to run LimiX on an H200 with the current setup for the OpenML task IDs 363628 and 363673, as they have too many samples: 363628 has 90k training samples, 21 features; 363673 has 100k training samples, 10 features.

At the same time, the datasets that I can now run with batching on an H200 (roughly up to a size of 70k training samples) also take very long to predict (multiple hours, even longer due to batching the test predictions).

I will postpone further investigation for larger datasets until an update from the authors and otherwise stick to the TabPFN-subset limits, as these seem roughly in scope in terms of efficiency.

LennartPurucker · 2025-09-17T09:14:44Z

Here are the results on TabArena-Full for all datasets.

For LimiX we had to impute two datasets (4%), that is, the tasks that ran out of VRAM as mentioned above.
All other imputed foundation models had to get many more datasets imputed (see the official LB for numbers) as they were not run on an H200.

LennartPurucker added 2 commits September 5, 2025 09:46

add: LimiX model

ca1f018

maint: rename cache path

ca31298

LennartPurucker mentioned this pull request Sep 5, 2025

[TabArena] Benchmarking LimiX; Verifying our Implemenetation limix-ldm/LimiX#4

Open

Merge remote-tracking branch 'origin/main' into limix

31796bf

LennartPurucker mentioned this pull request Sep 16, 2025

Documentation for Integrating custom model #210

Open

add: predict batching to avoid OOM

634e008

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP][New Model] LimiX #208

[WIP][New Model] LimiX #208

Uh oh!

LennartPurucker commented Sep 5, 2025 •

edited

Loading

Uh oh!

LennartPurucker commented Sep 5, 2025

Uh oh!

LennartPurucker commented Sep 17, 2025 •

edited

Loading

Uh oh!

LennartPurucker commented Sep 17, 2025 •

edited

Loading

Uh oh!

Uh oh!

[WIP][New Model] LimiX #208

Are you sure you want to change the base?

[WIP][New Model] LimiX #208

Uh oh!

Conversation

LennartPurucker commented Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Notes

Performance on the TabPFN-subset of TabArena-Full

Uh oh!

LennartPurucker commented Sep 5, 2025

Uh oh!

LennartPurucker commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LennartPurucker commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

LennartPurucker commented Sep 5, 2025 •

edited

Loading

LennartPurucker commented Sep 17, 2025 •

edited

Loading

LennartPurucker commented Sep 17, 2025 •

edited

Loading