This project explores the use of Machine Learning models to predict hydrogen adsorption energies on rocksalt-type complex oxide surfaces, using Density Functional Theory (DFT) calculations as ground truth.
The goal is to replace computationally expensive DFT calculations with ML models capable of producing predictions in milliseconds, accelerating the discovery and screening of new catalytic materials.
The problem can be summarized as:
where:
- DFT generates physico-chemical descriptors of the material,
- the ML model
$f_\theta$ learns the mapping from those descriptors to the adsorption energy.
DFT is one of the most widely used methods in computational chemistry and materials science for computing electronic properties from quantum first principles. Instead of directly solving the Schrödinger equation for every electron in the system, DFT reformulates the problem in terms of the electronic density, drastically reducing complexity and cost.
DFT delivers high accuracy, but has a critical limitation:
- a single simulation can take hours to days on an HPC cluster,
- screening thousands of candidate materials becomes computationally infeasible.
The target property of this project is the hydrogen adsorption energy:
where more negative values indicate a more favorable (exothermic) adsorption.
The dataset contains:
- 336 samples
- 14 numerical descriptors
- 1 target variable (
Eads)
Each sample corresponds to a distinct oxygen adsorption site on one of 21 slab models of the high-entropy oxide
Electronic features
Bader-chargeAve-O2p-upAve-O2p-down
Describe the local electronic distribution and charge transfer at the adsorption site.
Geometric features
BLVE Neighbor 1–5
Capture the local geometry around the active site (bond length divided by valence electrons for each of the 5 nearest metals).
Compositional features
Freq NiFreq MgFreq CuFreq Zn
Encode the local cationic composition (neighbor counts, integers 0–5).
Chemical features
Ave-diff-ENAve-diff-IE
Capture mean electronegativity and ionization-energy differences between the central oxygen and its metal neighbors.
- Dataset loading and validation with
pandas - Exploratory data analysis (EDA)
- 80/20 train/test split (
seed = 42) - Standardization via
StandardScaler(fit on train only — applied to linear and MLP models; tree ensembles and TabPFN use raw features) - 5-fold cross-validation on the training set
The MLP proposed in the project preliminary is a regularization-first feed-forward network implemented in PyTorch:
Input (14)
→ Linear(14 → 64) → ReLU → Dropout(0.25)
→ Linear(64 → 64) → ReLU → Dropout(0.25)
→ Linear(64 → 64) → ReLU → Dropout(0.25)
→ Linear(64 → 64) → ReLU → Dropout(0.25)
→ Linear(64 → 64) → ReLU → Dropout(0.25)
→ Linear(64 → 1)
Output (regression)
- Loss: MSE
- Optimizer: Adam
- Learning rate:
1e-4 - Weight decay:
1e-4 - Batch size:
32 - Early stopping (patience 50 on validation loss)
- 5-fold cross-validation
A family of regression models is benchmarked in order of increasing complexity:
| Model | Role |
|---|---|
| Linear Regression | Baseline |
| Ridge / Lasso | Regularized linear baselines |
| Random Forest | Non-linear tree ensemble |
| Gradient Boosting | Sequential tree ensemble (scikit-learn) |
| XGBoost | Gradient boosting with regularization |
| MLP (PyTorch) | Main neural network model |
| TabPFN | Transformer-based tabular foundation model |
TabPFN is included as the project's second main model: a pre-trained transformer that performs in-context learning over the training set without gradient updates, designed specifically for the small-data regime where this dataset lives.
| Model | MAE test (eV) | RMSE test (eV) | R² test |
|---|---|---|---|
| Linear Regression | 0.190 | 0.243 | 0.130 |
| Random Forest | 0.174 | 0.219 | 0.296 |
| Gradient Boosting | 0.151 | 0.201 | 0.406 |
| XGBoost | 0.150 | 0.206 | 0.376 |
| MLP (PyTorch) | 0.174 | 0.225 | 0.255 |
| TabPFN | 0.144 | 0.184 | 0.500 |
5-fold CV MAE: MLP = 0.173 ± 0.028, TabPFN = 0.152 ± 0.024.
TabPFN ranks first across every metric and is the only model without a severe train/test gap, making it both the most accurate and the most reliable surrogate on this dataset.
Beyond predictive accuracy, the project investigates what physical information the model is using to make its predictions.
This is done via SHAP (SHapley Additive Explanations), a game-theoretic technique that quantifies the contribution of each feature to individual predictions.
The analysis includes:
- Global feature importance
- Dependence plots
- Force plots on representative cases
- Residual analysis on the best-performing model
This allows us to answer questions like:
- Which descriptors have the largest influence on the adsorption energy?
- Are there non-linear interactions between features?
- Does the model learn physically consistent patterns?
- Which regions of chemical space show the largest prediction errors?
Key finding: both Gradient Boosting and the MLP agree that Bader-charge is by far the dominant predictor, followed by the chemical descriptors (Ave-diff-EN, Ave-diff-IE) and the O-2p band centers. Compositional features show low individual importance but contribute through interactions — consistent with the physical intuition that the local chemistry, not the bulk composition, controls reactivity.
The goal is not only to obtain an accurate model, but also to understand the relationship between electronic structure, local chemical environment, and surface reactivity.
| Metric | Target (preliminary) | Reference paper | This work (best: TabPFN) |
|---|---|---|---|
| MAE test | ≈ 0.1 eV | 0.06 eV | 0.144 eV |
| R² test | ≈ 0.8 | — | 0.500 |
The best model lands close to the target order of magnitude but does not reach the 0.06 eV benchmark of the reference work. Likely causes: dataset noise floor (residual std ≈ 0.2 eV), small training set (268 samples), and underrepresentation of distribution tails. See the final report for the full discussion.
# 1. Clone the repo
git clone https://github.com/zqmelissa27/Project_AI.git
cd Project_AI
# 2. Install dependencies
pip install -r requirements.txt
# Main packages: pandas, numpy, scikit-learn, xgboost, torch, shap, tabpfn, matplotlib, seaborn
# 3. Open the notebook
jupyter notebook DFT.ipynbThe notebook runs end-to-end on CPU. The first execution of the TabPFN cell downloads the pre-trained checkpoint (~230 MB).
.
├── DFT.ipynb # Main notebook: preprocessing, training, evaluation, SHAP, TabPFN
├── Data.xlsx # Hydrogen adsorption energy dataset
├── Final_Project_AI_SLIDES.pdf # Presentation slides
├── Final_Project_Report_AI.pdf # Final project report
├── README.md # Project documentation
└── mlp_hae_model.pt # Trained MLP model checkpoint
Anthropic's Claude (claude.ai) was used to assist with: (i) drafting the TabPFN integration cell from the official Prior Labs documentation, (ii) cross-checking SHAP and PyTorch APIs, and (iii) revising the final report and this README for clarity. All model-design decisions, hyperparameter choices, dataset interpretation, and results analysis are the authors' own.
- Luis Alejandro Baena —
labaenam@eafit.edu.co - Melissa Zuluaga Quintero —
kmzuluagaq@eafit.edu.co
Mathematical Engineering, Universidad EAFIT — Artificial Intelligence Course, 2026.
- A. Domínguez-Castro, DFT and machine learning for predicting hydrogen adsorption energies on rocksalt complex oxides, Theoretical Chemistry Accounts 143, 50 (2024).
- N. Hollmann et al., Accurate predictions on small data with a tabular foundation model, Nature 637, 319–326 (2025).
- S. M. Lundberg, S.-I. Lee, A Unified Approach to Interpreting Model Predictions, NeurIPS 30 (2017).
- T. Chen, C. Guestrin, XGBoost: A Scalable Tree Boosting System, KDD (2016).