[TabArena 2025 Roadmap] NeurIPS 2025 Camera Ready and Future Work Items

Planned items for camera-ready deadline of NeurIPS 2025 (Oct 23rd) and future plans for 2025.

# Remark
As we have already updated many models since the paper submission and rebuttal, we have three different states of results. (1) the state at submission time, (2) the state when uploading to arXiv, including the promises for rerunning models we made for the rebuttal [this is the current state of the live LB], and (3) the current state of all models we ran since then and which we will publish soon. The camera-ready version will be based on state (2), and the next version of the live leaderboard will be based on state (3). 

# P0 (Need to have) [Oct 23rd for Paper, 1st of December for Ecosystem]

### Paper 

- [ ] Finalize and add new figures (@Innixma)
	- [ ] Pareto front of Improvability and inference time
	- [ ] showing the performance over (tuning) time related to Improvability
	- [ ] Validation Overfitting plot to replace the ensemble-weight plot (or similar) + adjusting writing (@Innixma)
- [x] Finalize and add improvements to writing and the appendix based on feedback and our promises in the rebuttal (@LennartPurucker)
	- [ ] Ensure writing around the integration of new figures aligns with the rebuttal promises (@LennartPurucker)
- [x] Finalize and add improvements to writing and the appendix based on community feedback (@LennartPurucker)
- [ ] Ensure we mention and quantify likely dataset contamination for TabDPT
- [ ] Remove the imputed dataset for KNN from tables for performance per dataset  (@atschalz) and make more straightforward somewhere that KNN has imputed results for datasets without numerical features (we only state in C1 that we drop all categorical features). 
	- [x] PR Merged to remove imputed KNN from tables  (@atschalz)
	- [ ] Re-run with corrected code on the paper results (don't use new RealMLP/EBM results) (@Innixma)
	- [ ] New version of KNN preprocessing; rerun KNN; replot everything 

### Ecosystem
- [ ] Finalize integration of results for new models, which we already ran since submission
	- [x] RealMLP_GPU [Verifed]
	- [x] EBM [Verified]
	- [x] Mitra [Verified]
	- [x] xRFM [Verified]
	- [ ] LimiX [Not Verified]
	- [ ] TabFlex [Not Verified]
	- [ ] Beta-TabPFN [Not Verified]
- [ ] Update leaderboard
	- [ ] Ensure we are using the newest data from model runs that we have at the time of the update
	- [ ] Integrate Parteo front and tuning over time as plots into each leaderboard, like the main figure
	- [ ] Add support for results of unverified models 
	- [ ] Add test data leakage column (boolean or %)
	- [ ] Add additional subsets: binary, multiclass, "not small data"; consider removing/renaming TabICL-data and TabPFN-data.
	- [ ] Consider removing some models from the plots in the LB (KNN, worse than RF, ...) but keep in the LB tables.  
	- [ ] TabArena Rank Metric? := (rank (based on Elo) + rank based on Improvability + harmonic rank)/3
- [ ] Add verified / unverified to existing models: 
	- [ ] Verified: RealMLP, TabM, MondernNCA, xRFM, Mitra, TabPFNv2, TorchMLP, FastaiMLP
	- [ ] Unverified (technically could be verified by authors/maintainers): LightGBM, CatBoost, XGBoost, TabDPT
	- [ ] NA (IMO no need for verification as these are established baselines): KNN, Linear model, ExtraTrees, RandomForest


# P1 (Nice to have)

- [ ] More Models
	- [ ] TabICL with HPO [We got results but have not added them, need verification/confirmation from @dholzmueller]
	- [ ] RealTabPFN
	- [ ] PerpetualBoosting [We got some results, but not verified]
	- [ ] TabM version based on pip package and with varying inner seeds.  
	- [ ] LimiX with HPO
	- [ ] New run of TabDPT with verification from authors [context size, ensemble usage, ...]
	- [ ] Better KNN baseline pipeline (improve preprocessing and search space), or consider removing it.
	- [ ] Verifiy / improve linear model and its HPO if possible
- [ ] Improve User Experience
	- [ ] Polished end to end example of locally fitting a model -> evaluating on TabArena
	- [ ] Polished installation instructions & FAQ (ex: TabDPT install error)
	- [ ] Create a technical API for the relevant TabArena/TabRepo function
	- [ ] Create an onboarding page with different use cases and better step-by-step documentation (upgrade from https://github.com/TabArena/tabarena_benchmarking_examples)
- [ ] Technical Debt
	- [ ] `pip install tabarena`
	- [ ] Rename codebase to TabArena?
	- [ ] Cleanup/remove old TabRepo scripts (@Innixma)
	- [ ] Merge https://github.com/TabArena/tabarena_benchmarking_examples (as much as possible, maybe just examples) into TabArena codebase
	- [ ] Add easier way to run TabArena-Full into the runner (see the following: https://github.com/autogluon/tabrepo/pull/176#issuecomment-3176871534)
	
# P2 (Stretch Goal)

- [ ] Think about how to communicate the difference of HPO vs finetuning vs ICL-performance
- [ ] Improve tree-based models further
	- [ ] Update hyperparameters and implementation (also use newer versions if they exist) for tree-based models (@dholzmueller, #203, #124)
- [ ] Rerun all methods with varying inner seeds
	- [ ] CPU models: KNN, Linear, RF, ExtraTrees, FastaiMLP, TorchMLP, CatBoost, LightGBM, XGBoost
	- [ ] GPU models: Beta-TabPFN, ModernNCA (HPO)
	- [ ] GPU models with HPO and refitting (so minor impact): TabPFNv2 (HPO), TabICL (HPO)
	- [ ] GPU models without HPO and refitting (so likely no impact): TabFlex, TabDPT
	- [ ] Done models: RealMLP_GPU, xRFM, Mitra, EBM, LimiX, TabM (assuming we run the pip-version as stated above)
- [ ] Portfolio building logic that created AutoGluon 1.4 extreme preset portfolio? (@Innixma)
- [ ] AutoGluon high, HQIL, good, medium quality runs (@Innixma)
- [ ] AutoGluon w/ smaller time limit (5 min, 10 min, 30 min) (@Innixma)

# Below the line (Not scheduled so far)

- [ ] Integration with AMLB for other AutoML system results and support for systems/agents

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[TabArena 2025 Roadmap] NeurIPS 2025 Camera Ready and Future Work Items #213

Remark

P0 (Need to have) [Oct 23rd for Paper, 1st of December for Ecosystem]

Paper

Ecosystem

P1 (Nice to have)

P2 (Stretch Goal)

Below the line (Not scheduled so far)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[TabArena 2025 Roadmap] NeurIPS 2025 Camera Ready and Future Work Items #213

Description

Remark

P0 (Need to have) [Oct 23rd for Paper, 1st of December for Ecosystem]

Paper

Ecosystem

P1 (Nice to have)

P2 (Stretch Goal)

Below the line (Not scheduled so far)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions