An active learning framework for efficiently navigating the chemical space of ternary alloy catalysts for the Oxygen Evolution Reaction (OER). This repository implements Gaussian Process Regression (GPR) with intelligent query strategies to minimize the number of expensive DFT calculations needed to identify promising catalyst compositions.
- Overview
- Key Features
- Installation
- Quick Start
- Detailed Workflow
- File Descriptions
- Data Format
- Customization Guide
- License
Discovering optimal alloy catalysts for OER requires screening vast compositional spaces. Traditional approaches require DFT calculations for every candidate composition, which is computationally prohibitive.
This active learning framework:
- Generates a complete feature space representing all possible surface configurations
- Intelligently selects which configurations to calculate via DFT using acquisition functions
- Predicts energies for uncalculated configurations using Gaussian Process Regression
- Iteratively improves predictions by selecting the most informative next calculations
- Estimates catalyst activity using Boltzmann-weighted kinetic models
- Target system: Ni-Fe-Co ternary alloy catalysts
- Adsorbates: O and OH intermediates
- Surface sites: fcc and hcp hollow sites
- Feature space: 280,000 unique surface motifs
- Feature vector: 15-element fingerprint (3 metals × 5 coordination zones)
✅ Comprehensive feature generation - Automated fingerprint creation for all possible surface configurations
✅ Intelligent sampling - Active learning acquisition functions to minimize DFT calculations
✅ Gaussian Process Regression - Uncertainty-aware energy predictions
✅ Periodic boundary handling - DFT-compatible dataspace generation
✅ Activity estimation - Boltzmann distribution-based catalytic activity calculation
✅ Extensible framework - Easily adaptable to other alloy systems and reactions
- Python 3.8 or higher
- ASE (Atomic Simulation Environment) for structure handling
pip install numpy>=1.21.2 scipy>=1.7.1 pandas>=1.3.3 matplotlib>=3.4.3 scikit-learn>=0.24.2git clone https://github.com/minhee2043/ActiveLearning_OER.git
cd ActiveLearning_OER# Step 1: Generate complete feature space (280,000 configurations)
python GPRdataspace.py
# Step 2: Generate DFT-compatible subset for calculations
python possibleFp.py
# Step 3: Run active learning cycles (GPR + acquisition function)
python mygaussian.py
# Step 4: Preprocess data for activity calculation
python sum_element.py
# Step 5: Calculate and visualize catalyst activity
python activity_plot.pyFor analyzing individual DFT-relaxed structures:
from motif_to_feature import Slab
import ase.io
import numpy as np
# Read DFT-relaxed structure
trajectory = ase.io.read('relaxed_structure.traj')
# Extract surface motif
motif = Slab(trajectory)
# Generate 15-element feature vector
# metals: reference order for counting
# zones: coordination shells around adsorption site
feature = np.array(motif.features(
metals=['Ni', 'Fe', 'Co'],
zones=['ens', 'sn', 'ssn', 'sf', 'ssf']
))
print(f"Feature vector: {feature}")
# Output: [Ni1,Fe1,Co1, Ni2,Fe2,Co2, Ni3,Fe3,Co3, Ni4,Fe4,Co4, Ni5,Fe5,Co5]Generates all 280,000 possible surface configurations for the Ni-Fe-Co ternary system.
Input: None (uses default parameters: 3 metals, 5 zones)
Output: GPRdataspace.csv - 280,000 rows × 16 columns (15 features + 1 multiplicity)
Key parameters (modifiable in code):
nMetals = 3 # Number of metal types
zoneSizes = (3, 6, 3, 3, 3) # Atoms in each coordination shellFilters the complete dataspace to only configurations compatible with periodic DFT calculations.
Input: GPRdataspace.csv
Output: possibleFp.csv - Subset of valid DFT configurations
Core active learning engine that iteratively selects calculations and builds predictive models.
Inputs:
- DFT training data (user-provided)
possibleFp.csv(candidate pool)
Outputs:
- Trained GPR models for each iteration
- Predicted energies for all configurations
- Selected next calculation points from the candidate pool
Aggregates features by total composition for activity estimation.
Input: Predicted energies for all configurations
Output: Composition-aggregated data
Estimates catalytic activity using microkinetic modeling.
Input: Composition-aggregated energies
Output:
- Activity maps
Key equations:
Rate ∝ Σ_i w_i × exp(-ΔG_i / kT)
where w_i = multiplicity, ΔG_i = reaction free energy
| File | Purpose | Input | Output |
|---|---|---|---|
GPRdataspace.py |
Generate complete feature space | Parameters | GPRdataspace.csv |
possibleFp.py |
Filter DFT-compatible configurations | GPRdataspace.csv |
possibleFp.csv |
motif_to_feature.py |
Convert structures to features | ASE Atoms | 15-element vector |
mygaussian.py |
Active learning engine | DFT data, possibleFp.csv |
Predictions, next points |
sum_element.py |
Preprocess for activity | Predicted energies | Aggregated data |
activity_plot.py |
Calculate activity | Aggregated energies | Plots, rankings |
| File | Purpose |
|---|---|
helperMethods.py |
Mathematical utilities (count_atoms, multiplicity, etc.) |
- Format: ASE trajectory format
- Required tags:
tag = 0: Adsorbate atoms (O, OH)tag = 1: Surface layer atomstag = 2: Subsurface layer atomstag = 3: Third layer atomstag = 4: Bottom layer atoms
- Atoms: Must contain Ni, Fe, Co metals
Expected format for initial DFT data:
Ni1,Fe1,Co1,Ni2,Fe2,Co2,Ni3,Fe3,Co3,Ni4,Fe4,Co4,Ni5,Fe5,Co5,Energy
3,0,0,6,0,0,3,0,0,6,0,0,3,0,0,-4.52
2,1,0,5,1,0,2,1,0,5,1,0,2,1,0,-4.38
...# 280,000 rows × 16 columns
# Columns 1-15: Feature vector (metal counts in 5 zones)
# Column 16: Multiplicity (degeneracy factor)
3,0,0,6,0,0,3,0,0,6,0,0,3,0,0,1
2,1,0,6,0,0,3,0,0,6,0,0,3,0,0,3
...Subset of GPRdataspace.csv with only DFT-compatible configurations.
Ni1,Fe1,Co1,Ni2,Fe2,Co2,Ni3,Fe3,Co3,Ni4,Fe4,Co4,Ni5,Fe5,Co5,Predicted_Energy,Uncertainty
3,0,0,6,0,0,3,0,0,6,0,0,3,0,0,-4.50,0.15
...File: GPRdataspace.py, motif_to_feature.py
# Original: Ni-Fe-Co system
metals = ['Ni', 'Fe', 'Co']
# Example: Pt-Pd-Rh system
metals = ['Pt', 'Pd', 'Rh']File: GPRdataspace.py
# Original: ternary system
nMetals = 3
range(3) # in combinations_with_replacement
# Example: quaternary system (4 metals)
nMetals = 4
range(4) # update this lineFile: GPRdataspace.py
# Original: 5 zones with (3,6,3,3,3) atoms
zoneSizes = (3, 6, 3, 3, 3)
# Example: 3 zones for simpler fingerprint
zoneSizes = (1, 6, 3) # 1 on-top + 6 neighbors + 3 subsurfaceFile: motif_to_feature.py
# Update zone list accordingly
zones = ['ens', 's', 'ss'] # for 3-zone systemFile: motif_to_feature.py
# Automatic detection
is_ontop = motif.onTop()
feature = motif.features(['Ni','Fe','Co'], onTop=is_ontop, zones=['ens','s','ss'])
# Manual specification
feature = motif.features(['Ni','Fe','Co'], onTop=True, zones=['ens','s','ss'])Create custom zone methods in Slab class:
def custom_zone(self, onTop=False):
'''Your custom coordination zone definition'''
if onTop:
return self.closest(layer=1, start=X, stop=Y)[0]
else:
return self.closest(layer=1, start=A, stop=B)[0]File: mygaussian.py
# Upper Confidence Bound (default)
def acquisition_ucb(mu, sigma, kappa=2.0):
return mu + kappa * sigma
# Expected Improvement
def acquisition_ei(mu, sigma, y_best):
Z = (mu - y_best) / sigma
return (mu - y_best) * norm.cdf(Z) + sigma * norm.pdf(Z)
# Probability of Improvement
def acquisition_pi(mu, sigma, y_best):
Z = (mu - y_best) / sigma
return norm.cdf(Z)File: mygaussian.py
from sklearn.gaussian_process.kernels import RBF, Matern, RationalQuadratic
# Original: RBF kernel
kernel = RBF(length_scale=1.0) + WhiteKernel(noise_level=0.1)
# Alternative: Matern kernel (more flexible)
kernel = Matern(length_scale=1.0, nu=2.5) + WhiteKernel(noise_level=0.1)
# Alternative: Rational Quadratic (scale-invariant)
kernel = RationalQuadratic(length_scale=1.0, alpha=0.1) + WhiteKernel(noise_level=0.1)This project is licensed under the MIT License - see the LICENSE file for details.
For questions, issues, or collaboration:
- GitHub Issues: https://github.com/minhee2043/ActiveLearning_OER/issues
- Email: [email protected]