Skip to content

zqmelissa27/Project_AI

Repository files navigation

Project_AI

This project explores the use of Machine Learning models to predict hydrogen adsorption energies on rocksalt-type complex oxide surfaces, using Density Functional Theory (DFT) calculations as ground truth.

The goal is to replace computationally expensive DFT calculations with ML models capable of producing predictions in milliseconds, accelerating the discovery and screening of new catalytic materials.

The problem can be summarized as:

$$ \text{Atomic configuration} ;\xrightarrow{\text{DFT}}; \mathbf{x} \in \mathbb{R}^{14} ;\xrightarrow{f_\theta}; \hat{E}_{\text{ads}} $$

where:

  • DFT generates physico-chemical descriptors of the material,
  • the ML model $f_\theta$ learns the mapping from those descriptors to the adsorption energy.

Physical Background

Density Functional Theory (DFT)

DFT is one of the most widely used methods in computational chemistry and materials science for computing electronic properties from quantum first principles. Instead of directly solving the Schrödinger equation for every electron in the system, DFT reformulates the problem in terms of the electronic density, drastically reducing complexity and cost.

DFT delivers high accuracy, but has a critical limitation:

  • a single simulation can take hours to days on an HPC cluster,
  • screening thousands of candidate materials becomes computationally infeasible.

The target property of this project is the hydrogen adsorption energy:

$$E_{\text{ads}} = E_{\text{surface+H}} - E_{\text{surface}} - \tfrac{1}{2}\,E_{H_2}$$

where more negative values indicate a more favorable (exothermic) adsorption.


Dataset

The dataset contains:

  • 336 samples
  • 14 numerical descriptors
  • 1 target variable (Eads)

Each sample corresponds to a distinct oxygen adsorption site on one of 21 slab models of the high-entropy oxide $\mathrm{Ni_{16}Mg_{16}Cu_{16}Zn_{16}O_{64}}$, generated through DFT simulations.

Feature groups

Electronic features

  • Bader-charge
  • Ave-O2p-up
  • Ave-O2p-down

Describe the local electronic distribution and charge transfer at the adsorption site.

Geometric features

  • BLVE Neighbor 1–5

Capture the local geometry around the active site (bond length divided by valence electrons for each of the 5 nearest metals).

Compositional features

  • Freq Ni
  • Freq Mg
  • Freq Cu
  • Freq Zn

Encode the local cationic composition (neighbor counts, integers 0–5).

Chemical features

  • Ave-diff-EN
  • Ave-diff-IE

Capture mean electronegativity and ionization-energy differences between the central oxygen and its metal neighbors.


Machine Learning Pipeline

Preprocessing

  • Dataset loading and validation with pandas
  • Exploratory data analysis (EDA)
  • 80/20 train/test split (seed = 42)
  • Standardization via StandardScaler (fit on train only — applied to linear and MLP models; tree ensembles and TabPFN use raw features)
  • 5-fold cross-validation on the training set

Model Architecture

The MLP proposed in the project preliminary is a regularization-first feed-forward network implemented in PyTorch:

Input (14)
 → Linear(14 → 64) → ReLU → Dropout(0.25)
 → Linear(64 → 64) → ReLU → Dropout(0.25)
 → Linear(64 → 64) → ReLU → Dropout(0.25)
 → Linear(64 → 64) → ReLU → Dropout(0.25)
 → Linear(64 → 64) → ReLU → Dropout(0.25)
 → Linear(64 → 1)
Output (regression)

Training configuration

  • Loss: MSE
  • Optimizer: Adam
  • Learning rate: 1e-4
  • Weight decay: 1e-4
  • Batch size: 32
  • Early stopping (patience 50 on validation loss)
  • 5-fold cross-validation

Evaluated Models

A family of regression models is benchmarked in order of increasing complexity:

Model Role
Linear Regression Baseline
Ridge / Lasso Regularized linear baselines
Random Forest Non-linear tree ensemble
Gradient Boosting Sequential tree ensemble (scikit-learn)
XGBoost Gradient boosting with regularization
MLP (PyTorch) Main neural network model
TabPFN Transformer-based tabular foundation model

TabPFN is included as the project's second main model: a pre-trained transformer that performs in-context learning over the training set without gradient updates, designed specifically for the small-data regime where this dataset lives.


Results

Model MAE test (eV) RMSE test (eV) R² test
Linear Regression 0.190 0.243 0.130
Random Forest 0.174 0.219 0.296
Gradient Boosting 0.151 0.201 0.406
XGBoost 0.150 0.206 0.376
MLP (PyTorch) 0.174 0.225 0.255
TabPFN 0.144 0.184 0.500

5-fold CV MAE: MLP = 0.173 ± 0.028, TabPFN = 0.152 ± 0.024.

TabPFN ranks first across every metric and is the only model without a severe train/test gap, making it both the most accurate and the most reliable surrogate on this dataset.


Model Interpretability

Beyond predictive accuracy, the project investigates what physical information the model is using to make its predictions.

This is done via SHAP (SHapley Additive Explanations), a game-theoretic technique that quantifies the contribution of each feature to individual predictions.

The analysis includes:

  • Global feature importance
  • Dependence plots
  • Force plots on representative cases
  • Residual analysis on the best-performing model

This allows us to answer questions like:

  • Which descriptors have the largest influence on the adsorption energy?
  • Are there non-linear interactions between features?
  • Does the model learn physically consistent patterns?
  • Which regions of chemical space show the largest prediction errors?

Key finding: both Gradient Boosting and the MLP agree that Bader-charge is by far the dominant predictor, followed by the chemical descriptors (Ave-diff-EN, Ave-diff-IE) and the O-2p band centers. Compositional features show low individual importance but contribute through interactions — consistent with the physical intuition that the local chemistry, not the bulk composition, controls reactivity.

The goal is not only to obtain an accurate model, but also to understand the relationship between electronic structure, local chemical environment, and surface reactivity.


Expected vs. Achieved Results

Metric Target (preliminary) Reference paper This work (best: TabPFN)
MAE test ≈ 0.1 eV 0.06 eV 0.144 eV
R² test ≈ 0.8 0.500

The best model lands close to the target order of magnitude but does not reach the 0.06 eV benchmark of the reference work. Likely causes: dataset noise floor (residual std ≈ 0.2 eV), small training set (268 samples), and underrepresentation of distribution tails. See the final report for the full discussion.


How to Run

# 1. Clone the repo
git clone https://github.com/zqmelissa27/Project_AI.git
cd Project_AI

# 2. Install dependencies
pip install -r requirements.txt
# Main packages: pandas, numpy, scikit-learn, xgboost, torch, shap, tabpfn, matplotlib, seaborn

# 3. Open the notebook
jupyter notebook DFT.ipynb

The notebook runs end-to-end on CPU. The first execution of the TabPFN cell downloads the pre-trained checkpoint (~230 MB).


Repository Structure

.
├── DFT.ipynb                    # Main notebook: preprocessing, training, evaluation, SHAP, TabPFN
├── Data.xlsx                    # Hydrogen adsorption energy dataset
├── Final_Project_AI_SLIDES.pdf  # Presentation slides
├── Final_Project_Report_AI.pdf  # Final project report
├── README.md                    # Project documentation
└── mlp_hae_model.pt             # Trained MLP model checkpoint

AI Usage Statement

Anthropic's Claude (claude.ai) was used to assist with: (i) drafting the TabPFN integration cell from the official Prior Labs documentation, (ii) cross-checking SHAP and PyTorch APIs, and (iii) revising the final report and this README for clarity. All model-design decisions, hyperparameter choices, dataset interpretation, and results analysis are the authors' own.


Authors

  • Luis Alejandro Baenalabaenam@eafit.edu.co
  • Melissa Zuluaga Quinterokmzuluagaq@eafit.edu.co

Mathematical Engineering, Universidad EAFIT — Artificial Intelligence Course, 2026.


References

  1. A. Domínguez-Castro, DFT and machine learning for predicting hydrogen adsorption energies on rocksalt complex oxides, Theoretical Chemistry Accounts 143, 50 (2024).
  2. N. Hollmann et al., Accurate predictions on small data with a tabular foundation model, Nature 637, 319–326 (2025).
  3. S. M. Lundberg, S.-I. Lee, A Unified Approach to Interpreting Model Predictions, NeurIPS 30 (2017).
  4. T. Chen, C. Guestrin, XGBoost: A Scalable Tree Boosting System, KDD (2016).

About

AI project for predicting hydrogen adsorption energies on complex oxide surfaces using Machine Learning models trained with DFT-generated data.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors