Official implementation for paper:
two anaconda environments are provided, corresponding to CUDA 10.2 and CUDA 11.3 respectively. Use the following commands to create the environment for running our code.
conda env create -f env_cu102.yml # for CUDA 10.2
conda env create -f env_cu113.yml # for CUDA 11.3The raw data, processed data, checkpoints and the predicted results can be accessed via link. The directory structure should be as follows:
UAlign
├───checkpoints
│ ├───USPTO-50K
│ │ ├───class_unknown.pkl
│ │ ├───class_unknown.pth
│ │ ├───class_known.pkl
│ │ └───class_known.pth
│ │
│ ├───USPTO-FULL
│ │ ├───model.pth
│ │ └───token.pkl
│ │
│ └───USPTO-MIT
│ ├───model.pth
│ └───token.pkl
│
├───Data
| ├───USPTO-50K
| │ ├───canonicalized_raw_test.csv
| │ ├───canonicalized_raw_val.csv
| │ ├───canonicalized_raw_train.csv
| │ ├───raw_test.csv
| │ ├───raw_val.csv
| │ └───raw_train.csv
| │
| ├───USPTO-MIT
| │ ├───canonicalized_raw_train.csv
| │ ├───canonicalized_raw_val.csv
| │ ├───canonicalized_raw_test.csv
| │ ├───valid.txt
| │ ├───test.txt
| │ └───train.txt
| │
| └───USPTO-FULL
| ├───canonicalized_raw_val.csv
| ├───canonicalized_raw_test.csv
| ├───canonicalized_raw_train.csv
| ├───raw_val.csv
| ├───raw_test.csv
| └───raw_train.csv
|
└───predicted_results
├───USPTO-50K
│ ├───answer-1711345166.9484136.json
│ └───answer-1711345359.2533984.json
│
├───USPTO-MIT
│ ├───10000-20000.json
│ ├───30000-38648.json
│ ├───20000-30000.json
│ └───0-10000.json
│
└───USPTO-FULL
├───75000-96014.json
├───25000-50000.json
├───50000-75000.json
└───0-25000.json
-
Data
- The raw data of the USPTO-50K dataset and USPTO-FULL dataset is stored in the corresponding folders in the files
raw_train.csv,raw_val.csv, andraw_test.csv. The raw data of USPTO-MIT dataset are namedtrain.txt,valid.txtandtest.txtunder the folderUSPTO-MIT. - All the processed data are named
canonicalized_raw_train.csv,canonicalized_raw_val.csvandcanonicalized_raw_test.csvand are put in the corresponding folders respectively. If you want to use your own data for training, please make sure the your files have the same format and the same name as the processed ones.
- The raw data of the USPTO-50K dataset and USPTO-FULL dataset is stored in the corresponding folders in the files
-
Checkpoints
-
Every checkpoint needs to be used together with its corresponding tokenizer. The tokenizers are stored as
pklfiles, while the trained model weights are stored inpth files. The matching model weights and tokenizer have the same name and are placed in the same folder. -
The parameters of checkpoint for USPTO-50K are
dim: 512 n_layer: 8 heads: 8 negative_slope: 0.2The parameters of checkpoint for USPTO-MIT and USPTO-FULL are
dim: 768 n_layer: 8 heads: 12 negative_slope: 0.2
-
-
predicted_results
- In the
USPTO-FULLandUSPTO-MITfolders, there is only one set of experimental results in each. They are divided into different files based on the index of the data. - In USPTO-50K, there are two sets of experimental results. The file
answer-1711345166.9484136.jsoncorresponds to the setting of reaction class unknown, whileanswer-1711345359.2533984.jsoncorresponds to the setting of reaction class known. - Each
jsonfile contains raw data for testing, the model's prediction results, corresponding logits, and also includes the checkpoints information used to generate thisjsonfile.
- In the
We provide the data preprocess scripts in folder data_proprocess. Each dataset is processed through a separate processing script. The atom-mapping numbers of each reaction are reassigned according to the canonical ranks of atoms of the product to avoid information leakage. The script for USPTO-50K and USPTO-FULL is used to process a single file. The scripts can be used as follows and the output file will be stored in the same folder as the input file.
python data_proprocess/canonicalize_data_50k.py --filename $dir_of_raw_file
python data_proprocess/canonicalize_data_full.py --filename $dir_of_raw_fileThe script for USPTO-MIT processes all the files together, which can be used by
python data_proprocess/canonicalize_data_full.py --dir $folder_of_raw_data --output_dir $output_dirThe $folder_of_raw_data should contain the following files: train.txt, valid.txt and test.txt.
For the detail about data preprocess, please refer to the article.
To build the tokenizer, we need a list of of all the shown tokens. You can use the follow command to generate the token list and store it in files.
python generate_tokens $file_1 $file_2 ... $file_n $token_list.json
The script can accept multiple files as input and the last position should be the path of file to store the token list. The files should have the same format as the processed dataset.
Use the following command for training the first stage:
python pretrain.py --dim $dim \
--n_layer $n_layer \
--data_path $folder_of_dataset \
--seed $random_seed \
--bs $batch_size \
--epoch $epoch_for_training \
--early_stop $epoch_num_for_checking_early_stop \
--device $device_id \
--lr $learning_rate \
--dropout $dropout \
--base_log $folder_for_logging \
--heads $num_heads_for_attention \
--negative_slope $negative_slope_for_leaky_relu \
--token_path $path_of_token_list \
--checkpoint $path_of_checkpoint \
--token_ckpt $path_of_checkpoint_for_tokenizer \
--lrgamma $decay_rate_for_lr_scheduler \
--warmup $epoch_num_for_warmup \
--accu $batch_num_for_gradient_accumulation \
--num_worker $num_worker_for_data_loaderIf the checkpoints for model and tokenizer are provided, the path for token list is not necessary and will be ignored if you pass it to the arguments of the script. Also for data distributed training, you can use:
python ddp_pretrain.py --dim $dim \
--n_layer $n_layer \
--data_path $folder_of_dataset \
--seed $random_seed \
--bs $batch_size \
--epoch $epoch_for_training \
--early_stop $epoch_num_for_checking_early_stop \
--lr $learning_rate \
--dropout $dropout \
--base_log $folder_for_logging \
--heads $num_heads_for_attention \
--negative_slope $negative_slope_for_leaky_relu \
--token_path $path_of_token_list \
--checkpoint $path_of_checkpoint \
--token_ckpt $path_of_checkpoint_for_tokenizer \
--lrgamma $decay_rate_for_lr_scheduler \
--warmup $epoch_num_for_warmup \
--accu $batch_num_for_gradient_accumulation \
--num_worker $num_worker_for_data_loader \
--num_gpus $num_of_gpus_for_training \
--port $port_for_ddp_trainingUse the following command to train the second stage:
python train_trans.py --dim $dim \
--n_layer $n_layer \
--aug_prob $probability_for_data_augumentation \
--data_path $folder_of_dataset \
--seed $random_seed \
--bs $batch_size \
--epoch $epoch_for_training \
--early_stop $epoch_num_for_checking_early_stop \
--lr $learning_rate \
--dropout $dropout \
--base_log $folder_for_logging \
--heads $num_heads_for_attention \
--negative_slope $negative_slope_for_leaky_relu \
--token_path $path_of_token_list \
--checkpoint $path_of_checkpoint \
--token_ckpt $path_of_checkpoint_for_tokenizer \
--gamma $decay_rate_for_lr_scheduler \
--step_start $the_epoch_to_start_lr_decay \
--warmup $epoch_num_for_warmup \
--accu $batch_num_for_gradient_accumulation \
--num_worker $num_worker_for_data_loader \
--label_smoothing $label_smoothing_for_training \
[--use_class] #add it into command for reaction class known settingIf you want to train from scratch, pass the path of token list to the script and don't provide any checkpoints for it. Also for data distributed training, you can use:
python ddp_train_trans.py --dim $dim \
--n_layer $n_layer \
--aug_prob $probability_for_data_augumentation \
--data_path $folder_of_dataset \
--seed $random_seed \
--bs $batch_size \
--epoch $epoch_for_training \
--early_stop $epoch_num_for_checking_early_stop \
--device $device_id \
--lr $learning_rate \
--dropout $dropout \
--base_log $folder_for_logging \
--heads $num_heads_for_attention \
--negative_slope $negative_slope_for_leaky_relu \
--token_path $path_of_token_list \
--checkpoint $path_of_checkpoint \
--token_ckpt $path_of_checkpoint_for_tokenizer \
--gamma $decay_rate_for_lr_scheduler \
--step_start $the_epoch_to_start_lr_decay \
--warmup $epoch_num_for_warmup \
--accu $batch_num_for_gradient_accumulation \
--num_worker $num_worker_for_data_loader \
--label_smoothing $label_smoothing_for_training \
--num_gpus $num_of_gpus_for_training \
--port $port_for_ddp_training
[--use_class] #add it into command for reaction class known settingTo inference the well-trained checkpoints, you can use the following commands:
python inference.py --dim $dim \
--n_layer $n_layer \
--heads $num_heads_for_attention \
--seed $random_seed \
--data_path $path_for_file_of_testset \
--device $device_id \
--checkpoint $path_of_checkpoint \
--token_ckpt $path_of_checkpoint_for_tokenizer \
--negative_slope $negative_slope_for_leaky_relu \
--max_len $max_length_of_generated_smiles \
--beams $beam_size_for_beam_search \
--output_folder $the_folder_to_store_results \
--save_every $the_step_to_write_results_to_files \
[--use_class] #add it into command for reaction class known settingThe script will summary all the results into a json file under output folder, named by the timestamp. And to evaluate the result to get top-$k$ accuracy, use the following command:
python evaluate_answer.py --beams $beam_size_for_beam_search --path $path_of_resultTo fasten the inference, you can the following command to inference only a part of test set so that the inference part can be done parallelly:
python inference_part.py --dim $dim \
--n_layer $n_layer \
--heads $num_heads_for_attention \
--seed $random_seed \
--data_path $path_for_file_of_testset \
--device $device_id \
--checkpoint $path_of_checkpoint \
--token_ckpt $path_of_checkpoint_for_tokenizer \
--negative_slope $negative_slope_for_leaky_relu \
--max_len $max_length_of_generated_smiles \
--beams $beam_size_for_beam_search \
--output_folder $the_folder_to_store_results \
--save_every $the_step_to_write_results_to_files \
--start $start_idx \
--len $num_of_samples_to_test \
[--use_class] #add it into command for reaction class known settingThe script will summary all the results into a json file under output folder, named by the start and end index of data. And to evaluate the result to get top-$k$ accuracy, use the following command:
python evaluate_dir.py --beams $beam_size_for_beam_search --path $path_of_output_dirWe also provide the script for inferencing a single product. You can used the following command for inference:
python inference_one.py --dim $dim \
--n_layer $n_layer \
--heads $num_heads_for_attention \
--seed $random_seed \
--device $device_id \
--checkpoint $path_of_checkpoint \
--token_ckpt $path_of_checkpoint_for_tokenizer \
--negative_slope $negative_slope_for_leaky_relu \
--max_len $max_length_of_generated_smiles \
--beams $beam_size_for_beam_search \
--product_smiles $the_SMILES_of_product \
--input_class $class_number_for_reaction \
[--use_class] #add it into command for reaction class known setting
[--org_output] # add it and the invalid smiles will not be removed from outputsIf --use_class is added, the input_class is required. Also you have make sure that the product SMILES contains a single molecule.