Project_AI

This project explores the use of Machine Learning models to predict hydrogen adsorption energies on rocksalt-type complex oxide surfaces, using Density Functional Theory (DFT) calculations as ground truth.

The goal is to replace computationally expensive DFT calculations with ML models capable of producing predictions in milliseconds, accelerating the discovery and screening of new catalytic materials.

The problem can be summarized as:

$$ \text{Atomic configuration} ;\xrightarrow{\text{DFT}}; \mathbf{x} \in \mathbb{R}^{14} ;\xrightarrow{f_\theta}; \hat{E}_{\text{ads}} $$

where:

DFT generates physico-chemical descriptors of the material,
the ML model $f_\theta$ learns the mapping from those descriptors to the adsorption energy.

Physical Background

Density Functional Theory (DFT)

DFT is one of the most widely used methods in computational chemistry and materials science for computing electronic properties from quantum first principles. Instead of directly solving the Schrödinger equation for every electron in the system, DFT reformulates the problem in terms of the electronic density, drastically reducing complexity and cost.

DFT delivers high accuracy, but has a critical limitation:

a single simulation can take hours to days on an HPC cluster,
screening thousands of candidate materials becomes computationally infeasible.

The target property of this project is the hydrogen adsorption energy:

$$E_{\text{ads}} = E_{\text{surface+H}} - E_{\text{surface}} - \tfrac{1}{2}\,E_{H_2}$$

where more negative values indicate a more favorable (exothermic) adsorption.

Dataset

The dataset contains:

336 samples
14 numerical descriptors
1 target variable (Eads)

Each sample corresponds to a distinct oxygen adsorption site on one of 21 slab models of the high-entropy oxide $\mathrm{Ni_{16}Mg_{16}Cu_{16}Zn_{16}O_{64}}$, generated through DFT simulations.

Feature groups

Electronic features

Bader-charge
Ave-O2p-up
Ave-O2p-down

Describe the local electronic distribution and charge transfer at the adsorption site.

Geometric features

BLVE Neighbor 1–5

Capture the local geometry around the active site (bond length divided by valence electrons for each of the 5 nearest metals).

Compositional features

Freq Ni
Freq Mg
Freq Cu
Freq Zn

Encode the local cationic composition (neighbor counts, integers 0–5).

Chemical features

Ave-diff-EN
Ave-diff-IE

Capture mean electronegativity and ionization-energy differences between the central oxygen and its metal neighbors.

Machine Learning Pipeline

Preprocessing

Dataset loading and validation with pandas
Exploratory data analysis (EDA)
80/20 train/test split (seed = 42)
Standardization via StandardScaler (fit on train only — applied to linear and MLP models; tree ensembles and TabPFN use raw features)
5-fold cross-validation on the training set

Model Architecture

The MLP proposed in the project preliminary is a regularization-first feed-forward network implemented in PyTorch:

Input (14)
 → Linear(14 → 64) → ReLU → Dropout(0.25)
 → Linear(64 → 64) → ReLU → Dropout(0.25)
 → Linear(64 → 64) → ReLU → Dropout(0.25)
 → Linear(64 → 64) → ReLU → Dropout(0.25)
 → Linear(64 → 64) → ReLU → Dropout(0.25)
 → Linear(64 → 1)
Output (regression)

Training configuration

Loss: MSE
Optimizer: Adam
Learning rate: 1e-4
Weight decay: 1e-4
Batch size: 32
Early stopping (patience 50 on validation loss)
5-fold cross-validation

Evaluated Models

A family of regression models is benchmarked in order of increasing complexity:

Model	Role
Linear Regression	Baseline
Ridge / Lasso	Regularized linear baselines
Random Forest	Non-linear tree ensemble
Gradient Boosting	Sequential tree ensemble (scikit-learn)
XGBoost	Gradient boosting with regularization
MLP (PyTorch)	Main neural network model
TabPFN	Transformer-based tabular foundation model

TabPFN is included as the project's second main model: a pre-trained transformer that performs in-context learning over the training set without gradient updates, designed specifically for the small-data regime where this dataset lives.

Results

Model	MAE test (eV)	RMSE test (eV)	R² test
Linear Regression	0.190	0.243	0.130
Random Forest	0.174	0.219	0.296
Gradient Boosting	0.151	0.201	0.406
XGBoost	0.150	0.206	0.376
MLP (PyTorch)	0.174	0.225	0.255
TabPFN	0.144	0.184	0.500

5-fold CV MAE: MLP = 0.173 ± 0.028, TabPFN = 0.152 ± 0.024.

TabPFN ranks first across every metric and is the only model without a severe train/test gap, making it both the most accurate and the most reliable surrogate on this dataset.

Model Interpretability

Beyond predictive accuracy, the project investigates what physical information the model is using to make its predictions.

This is done via SHAP (SHapley Additive Explanations), a game-theoretic technique that quantifies the contribution of each feature to individual predictions.

The analysis includes:

Global feature importance
Dependence plots
Force plots on representative cases
Residual analysis on the best-performing model

This allows us to answer questions like:

Which descriptors have the largest influence on the adsorption energy?
Are there non-linear interactions between features?
Does the model learn physically consistent patterns?
Which regions of chemical space show the largest prediction errors?

Key finding: both Gradient Boosting and the MLP agree that Bader-charge is by far the dominant predictor, followed by the chemical descriptors (Ave-diff-EN, Ave-diff-IE) and the O-2p band centers. Compositional features show low individual importance but contribute through interactions — consistent with the physical intuition that the local chemistry, not the bulk composition, controls reactivity.

The goal is not only to obtain an accurate model, but also to understand the relationship between electronic structure, local chemical environment, and surface reactivity.

Expected vs. Achieved Results

Metric	Target (preliminary)	Reference paper	This work (best: TabPFN)
MAE test	≈ 0.1 eV	0.06 eV	0.144 eV
R² test	≈ 0.8	—	0.500

The best model lands close to the target order of magnitude but does not reach the 0.06 eV benchmark of the reference work. Likely causes: dataset noise floor (residual std ≈ 0.2 eV), small training set (268 samples), and underrepresentation of distribution tails. See the final report for the full discussion.

How to Run

# 1. Clone the repo
git clone https://github.com/zqmelissa27/Project_AI.git
cd Project_AI

# 2. Install dependencies
pip install -r requirements.txt
# Main packages: pandas, numpy, scikit-learn, xgboost, torch, shap, tabpfn, matplotlib, seaborn

# 3. Open the notebook
jupyter notebook DFT.ipynb

The notebook runs end-to-end on CPU. The first execution of the TabPFN cell downloads the pre-trained checkpoint (~230 MB).

Repository Structure

.
├── DFT.ipynb                    # Main notebook: preprocessing, training, evaluation, SHAP, TabPFN
├── Data.xlsx                    # Hydrogen adsorption energy dataset
├── Final_Project_AI_SLIDES.pdf  # Presentation slides
├── Final_Project_Report_AI.pdf  # Final project report
├── README.md                    # Project documentation
└── mlp_hae_model.pt             # Trained MLP model checkpoint

AI Usage Statement

Anthropic's Claude (claude.ai) was used to assist with: (i) drafting the TabPFN integration cell from the official Prior Labs documentation, (ii) cross-checking SHAP and PyTorch APIs, and (iii) revising the final report and this README for clarity. All model-design decisions, hyperparameter choices, dataset interpretation, and results analysis are the authors' own.

Authors

Luis Alejandro Baena — labaenam@eafit.edu.co
Melissa Zuluaga Quintero — kmzuluagaq@eafit.edu.co

Mathematical Engineering, Universidad EAFIT — Artificial Intelligence Course, 2026.

References

A. Domínguez-Castro, DFT and machine learning for predicting hydrogen adsorption energies on rocksalt complex oxides, Theoretical Chemistry Accounts 143, 50 (2024).
N. Hollmann et al., Accurate predictions on small data with a tabular foundation model, Nature 637, 319–326 (2025).
S. M. Lundberg, S.-I. Lee, A Unified Approach to Interpreting Model Predictions, NeurIPS 30 (2017).
T. Chen, C. Guestrin, XGBoost: A Scalable Tree Boosting System, KDD (2016).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project_AI

Physical Background

Density Functional Theory (DFT)

Dataset

Feature groups

Machine Learning Pipeline

Preprocessing

Model Architecture

Training configuration

Evaluated Models

Results

Model Interpretability

Expected vs. Achieved Results

How to Run

Repository Structure

AI Usage Statement

Authors

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
DFT.ipynb		DFT.ipynb
Data.xlsx		Data.xlsx
Final_Project_AI_SLIDES.pdf		Final_Project_AI_SLIDES.pdf
Final_Project_Report_AI.pdf		Final_Project_Report_AI.pdf
README.md		README.md
mlp_hae_model.pt		mlp_hae_model.pt

Folders and files

Latest commit

History

Repository files navigation

Project_AI

Physical Background

Density Functional Theory (DFT)

Dataset

Feature groups

Machine Learning Pipeline

Preprocessing

Model Architecture

Training configuration

Evaluated Models

Results

Model Interpretability

Expected vs. Achieved Results

How to Run

Repository Structure

AI Usage Statement

Authors

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages