By Bijun Tang, Yuhao Lu, Jiadong Zhou, Tushar Chouhan, Han Wang, Prafful Golani, Manzhang Xu, Quan Xu, Cuntai Guan, Zheng Liu
Nanyang Technological University.
This repository contains the original models described in the paper "Machine learning-guided synthesis of advanced inorganic materials" (https://arxiv.org/abs/1905.03938). These models are those used for MoS2 classification
task as well as CQD regression
task.
If you use these models in your research, please cite:
@article{Tang2019,
author = {Bijun Tang, Yuhao Lu, Jiadong Zhou, Han Wang, Prafful Golani, Manzhang Xu, Quan Xu, Cuntai Guan, Zheng Liu},
title = {Machine learning-guided synthesis of advanced inorganic materials},
journal = {arXiv preprint arXiv:1905.03938},
year = {2019}
}
-
Python environment setup:
python 3.6.6 jupyter==1.0.0 matplotlib==2.2.3 numpy==1.16.0 pandas==0.22.0 scikit-learn==0.20.3 scipy==1.1.0 seaborn==0.9.0 shap==0.24.0 xgboost==0.80
-
In case of errors during setup, check out your installation of the following packages in Ubuntu or other Linux-based systems may help:
font-manager g++ gcc python3-dev
Or, upgrade your pip.
-
Before running best_model_interpretation-*.ipynb, use
utils.data_handler.fake_input_generator()
to generate the input conditions. Then move the generated fake_input_*.csv intodata
folder. -
For more detailed description of the dataset, please check out our paper.
- Code structure:
- scripts
- run_ipynb.sh
: script to run all *.ipynb. Setup up your directory in file.
- run_ipynb.sh
- results
: folder to store all results and generated figures
- utils
: supporting functions
- data
: download data before running code
(see Data) - PAM_repeat1000times-*.py
: to repeat 1000 times of PAM with randomly selected initial training sets. Please take note that it takes considerable computational time to finish running all 1000 times. E.g it may take around 1 hr to run 1 time of PAM for classification.
- PAM_guidedSynthesis-*.ipynb
: to run 1 time of PAM and plot the figures
- model_selection-*.ipynb
: to select best model with 10 repetitions of 10 X 10 cross validation; plus result interpretation
- data_overview.ipynb
: to plot feature correlation of dataset, and compute other descriptive statistics
- best_model_interpretation-*.ipynb
: to extract feature attribution values; and predict the generated input
- scripts
Note: File names end with '-classification' are for classification or MoS2 dataset, while those end with '-regression' are for regression or CQD dataset.