Learn to Boost fraud detection accuracy and developer efficiency through Intel's end-to-end, no-code, graph-neural-networks-boosted and multi-node distributed workflows. Check out more workflow examples and reference implementations in the Intel Developer Catalog.
Fraud detection has traditionally been tackled with classical machine learning algorithms such as gradient boosted machines. However, such supervised machine learning algorithms can lead to unsatisfactory precision and recall due to a few reasons:
- Severe class imbalance: ratio of fraud to non-fraud transactions is extremely imbalanced with typical values less than 1%
- Complex fraudster behavior which evolves with time: it is quite difficult to capture user behavior using traditional ML techniques
- Scale of data: credit card transaction datasets can have billions of transactions which require distributed preprocessing and training
- Latency of fraud detection: it is important to detect fraud quickly in order to minimize losses, thus highlighting the need for distributed inference
In Intel's Enhanced Fraud Detection reference kit, we employ Graph Neural Networks (GNN) popular for their ability to capture complex behavioral patterns (e.g., fraudsters performing multiple small transactions from different cards to not get caught). We also demonstrate a boost in accuracy by using GNN-boosted features over a baseline trained on traditional ML-only features. To make sure that we don't trade efficiency for accuracy, we enable distributed pipelines. Generally, distributed pipelines take weeks for data scientists to set up and involve numerous technical challenges. Our reference kit allows you to easily benefit from distributed capabilities and enjoy our no-code config-driven user interface. Additionally, we also provide ways for you to customize our solution to your own usecases.
- Significantly boost fraud classification accuracy by augmenting classical ML features with features generated through Graph Neural Networks (GNNs)
- Utilize our distributed preprocessing, training and inference pipelines to detect fraud quickly
- Improve developer efficiency and experimentation with our no-code, config-driven user interface To learn more, visit the CC-Scanner GitHub repository.
There are workflow-specific hardware and software setup requirements depending on how the workflow is run. Bare metal development system and Docker* image running locally have the same system requirements.
Supported Hardware | Precision |
---|---|
Intel® 1st, 2nd, 3rd, and 4th Gen Xeon® Scalable Performance processors | FP32 |
Memory | >200GB |
Storage | >50GB |
To benefit from our distributed pipeline, please ensure that -
- Password-less ssh is set up on all the nodes.
- A network file system (NFS) is set up for all the nodes to access.
- A high-speed network between nodes is set up.
- A RAM of at least 384 GB.
- Storage space of at least 200 GB on localdisk of all nodes.
The high-level architecture of the reference use case is shown in the diagram below. We use a credit card transaction dataset open-sourced by IBM (commonly known as the tabformer dataset) in this reference use case to demonstrate the capabilities outlined in the Solution Technical Overview section.
The feature engineering stage ingests the raw data, encodes each column into features using the logic defined in the feature engineering config yaml file and saves processed data.
The GNN training stage creates homogenous graphs by consuming the processed data generated by Task 1 and trains a GraphSage model in a self-supervised link prediction task setting to learn the latent representations of the nodes (cards and merchants). Once the GNN model is trained, the GNN workflow will concatenate the card and merchant features generated by the model to the corresponding transaction features and saves the GNN-boosted features to a CSV file.
The XGBoost training stage trains a binary classification model using the data splitting, model parameters and runtime parameters set in the XGB training config yaml file. AUCPR (Area Under the Precision-Recall Curve) is used as the evaluation metric due to its robustness in evaluating highly imbalanced datasets. Data splitting is based on temporal sequence to simulate real-life scenario. The model performance on the tabformer dataset can be found in the table in results section.
Start by defining an environment variable that will store the workspace path, this can be an existing directory or one to be created in further steps. This ENVVAR will be used for all the commands executed using absolute paths.
E. g.
export WORKSPACE=/mtw/work
export DATASET_DIR=$WORKSPACE/data
For distributed pipelines that require more than one kind of storage (NFS and localdisk for instance) an environment variable that provides a clear idea of the storage is use
E. g.
export NFS_DIR=/mtw/nfs-work
export LOCAL_DIR=/mtw/local-work
export DATASET_DIR=$LOCAL_DIR/data
Notes :
- If you are using the distributed pipeline, please repeat steps below on localdisk of all nodes as well as on NFS.
- If you are using the distributed pipeline, please reffer to this NFS configuration as example.
Create a working directory for the use case. Use these commands to set up the log folder, data folder and corresponding subfolders inside the working directory. We assume that your working directory is work.
mkdir -p ${WORKSPACE} && cd ${WORKSPACE}
mkdir data ml_tmp gnn_tmp
cd data && mkdir raw_data edge_data node_edge_data
Create a working directory for the workflow and clone the Main Repository into your working directory.
cd ${WORKSPACE}
git clone https://github.com/SelimWaly/CC-Scanner
cd CC-Scanner
git submodule update --init --recursive
Your folder structure should follow the directory structure as shown in the figure below.
Download the transactions.tgz from https://github.com/IBM/TabFormer/tree/main/data/credit_card, upload the transactions.tgz file to the $WORKSPACE/data/raw_data
folder, then unzip the transactions.tgz file with command below:
cd $WORKSPACE/data/raw_data
tar -zxvf transactions.tgz
If you want to bring your own dataset, put your raw data in the $WORKSPACE/data/raw_data
folder.
You can execute the references pipelines using the following environments:
- Docker
- Jupyter
- Argo
- Bare metal
Follow these instructions to set up and run a single node pipeline with our provided Docker image. For running distributed pipeline on bare metal, see the bare metal instructions instructions.
You'll need to install Docker Engine on your development system. Note that while Docker Engine is free to use, Docker Desktop may require you to purchase a license. See the Docker Engine Server installation instructions for details.
Ensure you have Docker Compose installed on your machine. If you don't have this tool installed, consult the official Docker Compose installation documentation.
DOCKER_CONFIG=${DOCKER_CONFIG:-$HOME/.docker}
mkdir -p $DOCKER_CONFIG/cli-plugins
curl -SL https://github.com/docker/compose/releases/download/v2.7.0/docker-compose-linux-x86_64 -o $DOCKER_CONFIG/cli-plugins/docker-compose
chmod +x $DOCKER_CONFIG/cli-plugins/docker-compose
docker compose version
Build or pull the provided docker image.
cd $WORKSPACE/cc-scanner/docker
docker compose build
OR
cd $WORKSPACE/cc-scanner/docker
docker pull SelimWaly/ai-workflows:pa-fraud-detection-classical-ml
docker pull SelimWaly/ai-workflows:pa-fraud-detection-gnn
%%{init: {'theme': 'dark'}}%%
flowchart RL
VDATASETDIRrawdata{{"${DATASET_DIR}/raw_data"}} x-. /workspace/data/raw_data .-x preprocess
VOUTPUTDIRdataedgedata{{"${OUTPUT_DIR}/data/edge_data"}} x-. /workspace/data/edge_data .-x preprocess
VCONFIGDIR{{"${CONFIG_DIR"}} x-. "-$PWD/../configs/single-node}" .-x preprocess
classDef volumes fill:#0f544e,stroke:#23968b
class VDATASETDIRrawdata,VOUTPUTDIRdataedgedata,VCONFIGDIR volumes
The preprocess
workflow will ingest the raw data in the $DATASET_DIR/raw_data/
directory, generate a preprocessed CSV file, and save it in the $OUTPUT_DIR/data/edge_data/
directory.
Run the preprocess
workflow with the following command:
docker compose run preprocess 2>&1 | tee preprocess.log
The table below shows some of the environment variables you can control according to your needs.
Environment Variable Name | Default Value | Description |
---|---|---|
CONFIG_DIR | ${WORKSPACE}/cc-scanner/configs |
Configurations directory |
OUTPUT_DIR | ${WORKSPACE}/cc-scanner/docker/output |
Logfile and Checkpoint output |
%%{init: {'theme': 'dark'}}%%
flowchart RL
VCONFIGDIR{{"${CONFIG_DIR"}} x-. "-$PWD/../configs/single-node}" .-x baselinetraining[baseline-training]
VOUTPUTDIRdataedgedata{{"${OUTPUT_DIR}/data/edge_data"}} x-. /workspace/data/edge_data .-x baselinetraining
VOUTPUTDIRbaselinemodels{{"${OUTPUT_DIR}/baseline/models"}} x-. /workspace/tmp/models .-x baselinetraining
VOUTPUTDIRbaselinelogs{{"${OUTPUT_DIR}/baseline/logs"}} x-. /workspace/tmp/logs .-x baselinetraining
classDef volumes fill:#0f544e,stroke:#23968b
class VCONFIGDIR,VOUTPUTDIRdataedgedata,VOUTPUTDIRbaselinemodels,VOUTPUTDIRbaselinelogs volumes
The preprocess
workflow must complete successfully before running the baseline-training
.
The baseline-training
workflow will consume the CSV file generated from preprocess
workflow above, and run a training of a XGBoost model. It will also print out AUCPR (area under the precision-recall curve) results to the console.
Run the baseline-training
workflow with the command below.
docker compose run baseline-training 2>&1 | tee baseline-training.log
The table below shows some of the environment variables you can control according to your needs.
Environment Variable Name | Default Value | Description |
---|---|---|
CONFIG_DIR | ${WORKSPACE}/cc-scanner/configs |
Configurations directory |
OUTPUT_DIR | ${WORKSPACE}/cc-scanner/docker/output |
Logfile and Checkpoint output |
%%{init: {'theme': 'dark'}}%%
flowchart RL
VOUTPUTDIRdataedgedata{{"${OUTPUT_DIR}/data/edge_data"}} x-. /DATA_IN .-x gnnanalytics[gnn-analytics]
VOUTPUTDIRdatanodeedgedata{{"${OUTPUT_DIR}/data/node_edge_data"}} x-. /DATA_OUT .-x gnnanalytics
VOUTPUTDIRgnncheckpoint{{"${OUTPUT_DIR}/gnn_checkpoint"}} x-. /GNN_TMP .-x gnnanalytics
VCONFIGDIR{{"${CONFIG_DIR"}} x-. "-$PWD/../configs/single-node}" .-x gnnanalytics
VOUTPUTDIRdatanodeedgedata x-. /workspace/data/node_edge_data .-x xgbtraining[xgb-training]
VOUTPUTDIRxgbtrainingmodels{{"${OUTPUT_DIR}/xgb-training/models"}} x-. /workspace/tmp/models .-x xgbtraining
VOUTPUTDIRxgbtraininglogs{{"${OUTPUT_DIR}/xgb-training/logs"}} x-. /workspace/tmp/logs .-x xgbtraining
VCONFIGDIR x-. "-$PWD/../configs/single-node}" .-x xgbtraining
xgbtraining --> gnnanalytics
classDef volumes fill:#0f544e,stroke:#23968b
class VOUTPUTDIRdataedgedata,VOUTPUTDIRdatanodeedgedata,VOUTPUTDIRgnncheckpoint,VCONFIGDIR,VOUTPUTDIRdatanodeedgedata,VOUTPUTDIRxgbtrainingmodels,VOUTPUTDIRxgbtraininglogs,VCONFIGDIR volumes
To see the improvement over the baseline training you can run the xgb-training
workflow. Before running the xgb-training
, the preprocess
workflow must complete successfully.
The xgb-training
workflow consumes the CSV file generated from preprocess
above, runs the gnn-analytics
pipeline to generate optimized features, and runs a training of a XGBoost model using these features. It will also print out AUCPR (area under the precision-recall curve) results to the console.
Note : as this step runs GNN training first, we don't expect to see output for a while. Once GNN training finishes, we will start seeing output from XGBoost training. You can also check GNN log using the following commands.
docker compose logs gnn-analytics -f
Run the xgb-training
container with the command below.
docker compose run xgb-training 2>&1 | tee xgb-training.log
This command runs the gnn-analytics
workflow implicitly to generate the node features first and then uses edge features generated from Step 2 to train the XGBoost model and will print out AUCPR (area under the precision-recall curve) results to the console.
Note : This steps runs the GNN training first which can take several hours to finish.
The table below shows some of the environment variables you can control according to your needs.
Environment Variable Name | Default Value | Description |
---|---|---|
CONFIG_DIR | ${WORKSPACE}/cc-scanner/configs |
Configurations directory |
OUTPUT_DIR | ${WORKSPACE}/cc-scanner/docker/output |
Logfile and Checkpoint output |
Run these commands to check the preprocess
, baseline-training
, and xgb-training
logs:
cat preprocess.log
cat baseline-training.log
cat xgb-training.log
You can also check GNN log using the following commands.
docker compose logs gnn-analytics -f
Run the following command to stop all services and containers created by docker compose and remove them.
docker compose down
To learn more, please visit install anaconda on Linux.
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
To be able to run Notebook.ipynb
a conda environment must be created:
conda create -n ju_fraud_detection -c conda-forge jupyterlab nb_conda_kernels python=3.9 -y
conda activate ju_fraud_detection
Follow the steps in Getting Started section before continuing. Run the following command inside of the project root directory. WORKSPACE
and DATASET_DIR
must be set in the same terminal that will run Jupyter Lab.
jupyter lab
Open jupyter lab in a web browser, select Notebook.ipynb
and select conda env:ju_fraud_detection
as the jupyter kernel. Now you can follow the notebook's instructions step by step.
- Install Helm
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 && \
chmod 700 get_helm.sh && \
./get_helm.sh
- Install Argo Workflows and Argo CLI
- Configure your Artifact Repository
- Ensure that your dataset and config files are present in your chosen artifact repository.
Please refer to the document to fill out the values.yaml and then install the template to run workflow.
export NAMESPACE=argo
helm install --namespace ${NAMESPACE} --set proxy=${http_proxy} fraud-detection ./chart
argo submit --from wftmpl/fraud-detection --namespace=${NAMESPACE}
To view your workflow progress
argo logs @latest -f
Follow these instructions to set up and run this distributed pipeline on your own development system. To run with Docker on single node, refer to Run with Docker on single node.
- Password-less ssh needs to be set up on all the nodes that you are using.
- Classical ML workflow requires code, configs and data directory to reside locally.
- GNN workflow requires the data directory and code repository to reside on Network File System (NFS).
- The nodes should be connected with a high-speed network to enjoy speedups.
- The nodes should have at least 384 GB of RAM, 200 GB free space on localdisk and 50 GB of free space on NFS.
- We assume that work folder exists at the same location in localdisk of all nodes. (e.g.
/mtw/local-work
is the location for work folder in all nodes).
You'll need to install Docker Engine on your development system. Note that while Docker Engine is free to use, Docker Desktop may require you to purchase a license. See the Docker Engine Server installation instructions for details. To build and run this workload inside a Docker Container, ensure you have Docker Compose installed on your machine. If you don't have this tool installed please consult official Docker Compose installation documentation.
# on master node
DOCKER_CONFIG=${DOCKER_CONFIG:-$HOME/.docker}
mkdir -p $DOCKER_CONFIG/cli-plugins
curl -SL https://github.com/docker/compose/releases/download/v2.7.0/docker-compose-linux-x86_64 -o $DOCKER_CONFIG/cli-plugins/docker-compose
chmod +x $DOCKER_CONFIG/cli-plugins/docker-compose
docker compose version
Ensure you have downloaded the dataset as described in Prepare data directory and set the path to the dataset as described below.
export DATASET_DIR=${LOCAL_DIR}/data
# on master node
cd ${LOCAL_DIR}/cc-scanner/docker
docker compose build
# on master node
docker save -o wf-image.tar SelimWaly/ai-workflows:pa-fraud-detection-classical-ml
# on master node
# Note : provide IP address to the worker node
export WORKER_IP=<worker-node-ip>
# Note : provide path to work directory on localdisk of worker node
scp wf-image.tar $WORKER_IP:${LOCAL_DIR}
# from master node, ssh into worker node
ssh $WORKER_IP
docker load -i ${LOCAL_DIR}/wf-image.tar
We allow the users to bring their own dataset, create their own preprocessing engine, build custom graph dataset from processed csv files as well as construct their own XGBoost and GraphSage model. Please find more information in How to customize this use case.
- On master node, prepare
${LOCAL_DIR}/cc-scanner/configs/distributed/workflow-data-preprocessing.yaml
to reflect desired IP addresses and absolute path to ${LOCAL_DIR}.
env:
num_node: 2
node_ips: #the first item in the ip list is the master ip, pls make sure that the ip doesn't contain space in the end
- IP1
- IP2
# NOTE : please provide absolute path to ${LOCAL_DIR}
tmp_path: <path-to-work-dir-on-localdisk>/ml_tmp
data_path: <path-to-work-dir-on-localdisk>/data
config_path: <path-to-work-dir-on-localdisk>/cc-scanner/configs/distributed
- Pass the workflow config yaml to the Classical ML workflow container and launch the workflow container from master node with the command below.
# on master node
cd ${LOCAL_DIR}/cc-scanner/classical-ml
./run-workflow.sh ${LOCAL_DIR}/cc-scanner/configs/distributed/workflow-data-preprocessing.yaml
The Classical ML workflow saves the processed data inside ${LOCAL_DIR}/data/edge_data
folder on the local disk of the master node. After a successful run, ${LOCAL_DIR}/data/edge_data/processed_data.csv
should have (24198836, 26) shape.
- On master node, prepare
${LOCAL_DIR}/cc-scanner/configs/distributed/workflow-baseline.yaml
to reflect desired IP addresses and absolute path to ${LOCAL_DIR}.
env:
num_node: 2
node_ips: #the first item in the ip list is the master ip, pls make sure that the ip doesn't contain space in the end
- IP1
- IP2
# NOTE : please provide absolute path to ${LOCAL_DIR}
tmp_path: <path-to-work-dir-on-localdisk>/ml_tmp
data_path: <path-to-work-dir-on-localdisk>/data
config_path: <path-to-work-dir-on-localdisk>/cc-scanner/configs/distributed
- Pass the workflow config yaml to the Classical ML workflow container and run the workflow container with the command below.
# on master node
cd ${LOCAL_DIR}/cc-scanner/classical-ml
./run-workflow.sh ${LOCAL_DIR}/cc-scanner/configs/distributed/workflow-baseline.yaml
-
Copy the processed data from localdisk of master node to NFS.
# on master node scp ${LOCAL_DIR}/data/edge_data/processed_data.csv ${NFS_DIR}/data/edge_data/
-
On master node, prepare
${NFS_DIR}/cc-scanner/configs/distributed/workflow-gnn-training.yaml
to reflect desired IP addresses and absolute path to ${NFS_DIR}.env: num_node: 2 node_ips: - IP1 - IP2 # NOTE : please provide absolute path to ${NFS_DIR} # tmp_path used to save model, embeddings, partitions... tmp_path: <path-to-work-dir-on-NFS>/gnn_tmp # data_path should contain processed_data.csv data_path: <path-to-work-dir-on-NFS>/data/edge_data # in_data_filename is the name of input csv file in_data_filename: processed_data.csv # out_path will contain the output csv with the tabular data and new node embeddings out_path: <path-to-work-dir-on-NFS>/data/node_edge_data #config_path will contain all three configs required by GNN workflow config_path: <path-to-work-dir-on-NFS>/cc-scanner/configs/distributed
-
Pass the workflow config yaml to the GNN workflow container and run the workflow container with the command below.
# on master node cd ${NFS_DIR}/cc-scanner/gnn-analytics ./run-workflow.sh ${NFS_DIR}/cc-scanner/configs/distributed/workflow-gnn-training.yaml
Distributed GNN workflow will save GNN-boosted features to
${NFS_DIR}/data/node_edge_data/
folder on NFS. After a successful run,${NFS_DIR}/data/node_edge_data/tabular_with_gnn_emb.csv
should have (24198836, 154) shape. Else, please try following troubleshooting steps.
- Please ensure a clean working environment e.g., kill zombie processes with
pkill 9 <keyword>
command. The keyword can be "python" so as to kill all the zombie processes running with python. - For further information, please review troubleshooting section for GNN workflow.
- Copy GNN-boosted data from NFS to localdisk on master node.
# on master node scp ${NFS_DIR}/data/node_edge_data/tabular_with_gnn_emb.csv ${LOCAL_DIR}/data/node_edge_data/
- On master node, prepare
${LOCAL_DIR}/cc-scanner/configs/distributed/workflow-xgb-training.yaml
to reflect desired IP addresses and absolute path to work dir on localdisk.
env: num_node: 2 node_ips: #the first item in the ip list is the master ip, pls make sure that the ip doesn't contain space in the end - IP1 - IP2 # NOTE : please provide absolute path to work dir on localdisk tmp_path: <path-to-work-dir-in-localdisk>/ml_tmp data_path: <path-to-work-dir-in-localdisk>/data config_path: <path-to-work-dir-in-localdisk>/cc-scanner/configs/distributed
- Pass the workflow config yaml to the Classical ML workflow container and run the workflow container with the command below.
# on master node in localdisk cd ${LOCAL_DIR}/cc-scanner/classical-ml ./run-workflow.sh ${LOCAL_DIR}/cc-scanner/configs/distributed/workflow-xgb-training.yaml
We use Area Under the Precision Recall Curve (AUCPR) as evaluation metric (lies between 0 and 1, higher is better).
You should expect to see a boost in AUCPR for test split by using GNN-boosted features.
Data split | Number of examples | AUCPR - Edge features only | AUCPR - GNN-boosted features |
---|---|---|---|
Train (year < 2018) | 20,604,847 | 0.92 | 0.98 |
Val (year = 2018) | 1,689,822 | 0.91 | 0.93 |
Test (year > 2018) | 1,904,167 | 0.88 | 0.94 |
You should expect to see a boost in AUCPR for test split by using GNN-boosted features.
Data split | Number of examples | AUCPR - Edge features only | AUCPR - GNN-boosted features |
---|---|---|---|
Train (year < 2018) | 20,604,847 | 0.95 | 0.96 |
Val (year = 2018) | 1,689,822 | 0.91 | 0.93 |
Test (year > 2018) | 1,904,167 | 0.88 | 0.94 |
You will see logs that look similar to the ones below once you run the use case successfully. Please note that the timing numbers depend on the hardware systems.
Preprocessing dataset
Failed to read model training configurations. This is either due to wrong parameters defined in the config file as shown: 'training'
Or there is no need for model training.
enter single-node mode...
reading data...
(24386900, 15)
preparing data...
engineering features...
splitting data...
encoding features...
saving data...
data saved under the path /fraud-detection/data/edge_data/processed_data.csv
Our workflow runs hyperparameter optimization by default and prints the best trial's hyperparameters.
Failed to read data preprocessing steps. This is either due to wrong parameters defined in the config file as shown: 'data_preprocess' or there is no need for data preprocessing.
no need for training Failed to read end2end training configurations. This is either due to wrong parameters defined in the config file as shown: 'end2end_training' or there is no need for End-to-End training.
enter single-node mode...
reading training data...
reading without dropping columns...
data has the shape (24198836, 26)
start training models soon...
(24198836, 22)
read and prepare data for training...
start xgboost HPO...
[0] train-aucpr:0.39670 eval-aucpr:0.26043 test-aucpr:0.26432
[999] train-aucpr:0.98697 eval-aucpr:0.87243 test-aucpr:0.81485
Best trial: 0. Best value: 0.872425: 10%| 1/10 [22:23<3:21:33, 1343.71s/it][I 2023-04-26 23:19:42,948]
Trial 0 finished with value: 0.8724252645144079 and parameters: {'eta': 0.15, 'max_depth': 7, 'subsample': 0.5580179751710447, 'colsample_bytree': 0.8829791679963901, 'lambda': 0.6988028968743126, 'alpha': 0.9281407432733514, 'min_child_weight': 2}. Best is trial 0 with value: 0.8724252645144079.
...
[0] train-aucpr:0.20233 eval-aucpr:0.11631 test-aucpr:0.11132
[999] train-aucpr:0.93912 eval-aucpr:0.88756 test-aucpr:0.81800
Best trial: 7. Best value: 0.91189: 100%| 10/10 [3:32:04<00:00, 1272.49s/it]
Trial 9 finished with value: 0.8875619161413042 and parameters: {'eta': 0.17, 'max_depth': 5, 'subsample': 0.8748137912212397, 'colsample_bytree': 0.9733703420536219, 'lambda': 0.565297835149586, 'alpha': 0.43564481669099264, 'min_child_weight': 2}.
Best is trial 7 with value: 0.9118896651018639.
Value: 0.9118896651018639
Params:
eta: 0.1
max_depth: 9
subsample: 0.6372646065696512
colsample_bytree: 0.5940969469756718
lambda: 0.023810403412340413
alpha: 0.4955030354495986
min_child_weight: 9
aucpr of the best configs on test set is 0.8779567630966963
We saved our best model's hyper-parameters in $WORKSPACE/cc-scanner/configs/single-node/baseline-xgb-training.yaml
. Simply comment hpo_spec
section and uncomment model_spec
section to reproduce our results.
Failed to read data preprocessing steps. This is either due to wrong parameters defined in the config file as shown: 'data_preprocess' or there is no need for data preprocessing.
no need for HPO
Failed to read end2end training configurations. This is either due to wrong parameters defined in the config file as shown: 'end2end_training' or there is no need for End-to-End training.
enter single-node mode...
reading training data...
reading without dropping columns...
data has the shape (24198836, 26)
start training models soon...
(24198836, 22)
read and prepare data for training...
start xgboost model training...
[0] train-aucpr:0.41938 eval-aucpr:0.41885 test-aucpr:0.36533
[100] train-aucpr:0.78485 eval-aucpr:0.90691 test-aucpr:0.89396
[200] train-aucpr:0.81402 eval-aucpr:0.92188 test-aucpr:0.89991
[300] train-aucpr:0.84193 eval-aucpr:0.92164 test-aucpr:0.89123
[400] train-aucpr:0.86475 eval-aucpr:0.91879 test-aucpr:0.88308
[500] train-aucpr:0.87911 eval-aucpr:0.91601 test-aucpr:0.87810
[600] train-aucpr:0.88980 eval-aucpr:0.91699 test-aucpr:0.87880
[700] train-aucpr:0.89974 eval-aucpr:0.91588 test-aucpr:0.88025
[800] train-aucpr:0.90777 eval-aucpr:0.91510 test-aucpr:0.87856
[900] train-aucpr:0.91517 eval-aucpr:0.91347 test-aucpr:0.87711
[999] train-aucpr:0.92204 eval-aucpr:0.91267 test-aucpr:0.87582
start xgboost model testing...
testing results: aucpr on test set is 0.8758093307052992
xgboost model is saved under /workspace/tmp/models.
Our workflow runs hyperparameter optimization by default and prints the best trial's hyperparameters.
Creating Container docker-gnn-analytics-1
Created Container docker-gnn-analytics-1
Starting Container docker-gnn-analytics-1
Failed to read data preprocessing steps. This is either due to wrong parameters defined in the config file as shown: 'data_preprocess' or there is no need for data preprocessing.
no need for training Failed to read end2end training configurations. This is either due to wrong parameters defined in the config file as shown: 'end2end_training' or there is no need for End-to-End training.
enter single-node mode...
reading training data...
reading without dropping columns...
data has the shape (24198836, 154)
start training models soon...
(24198836, 150)
read and prepare data for training...
start xgboost HPO...
[0] train-aucpr:0.19283 eval-aucpr:0.03559 test-aucpr:0.02577
[999] train-aucpr:0.94984 eval-aucpr:0.89502 test-aucpr:0.86376
Best trial: 0. Best value: 0.89502: 10%| 1/10 [17:27<2:37:05, 1047.28s/it] [I 2023-04-27 21:51:02,449]
Trial 0 finished with value: 0.8950199939942034 and parameters: {'eta': 0.2, 'max_depth': 4, 'subsample': 0.581804256586296, 'colsample_bytree': 0.3397014497235425, 'lambda': 0.4620219935728601, 'alpha': 0.8976349208109945, 'min_child_weight': 3}. Best is trial 0 with value: 0.8950199939942034.
...
Best trial: 5. Best value: 0.930132: 100%| 10/10 [2:48:54<00:00, 1013.50s/it] [I 2023-04-28 00:22:30,154]
Trial 9 finished with value: 0.9054094235120606 and parameters: {'eta': 0.1, 'max_depth': 5, 'subsample': 0.9740220576298498, 'colsample_bytree': 0.5915481518062535, 'lambda': 0.7854898534100563, 'alpha': 0.6962425381327314, 'min_child_weight': 3}.
Best is trial 5 with value: 0.930131689676218.
Value: 0.930131689676218
Params:
eta: 0.01
max_depth: 6
subsample: 0.8714669008983891
colsample_bytree: 0.8825897760478416
lambda: 0.4397103901613584
alpha: 0.8475402634335466
min_child_weight: 8
aucpr of the best configs on test set is 0.9224617296126916
We saved our best model's hyper-parameters in $WORKSPACE/cc-scanner/configs/single-node/xgb-training.yaml
. Simply comment hpo_spec
section and uncomment model_spec
section to reproduce our results.
Failed to read data preprocessing steps. This is either due to wrong parameters defined in the config file as shown: 'data_preprocess' or there is no need for data preprocessing.
no need for HPO
Failed to read end2end training configurations. This is either due to wrong parameters defined in the config file as shown: 'end2end_training' or there is no need for End-to-End training.
enter single-node mode...
reading training data...
reading without dropping columns...
data has the shape (24198836, 154)
start training models soon...
(24198836, 150)
read and prepare data for training...
start xgboost model training...
[0] train-aucpr:0.48145 eval-aucpr:0.23717 test-aucpr:0.21672
[100] train-aucpr:0.80701 eval-aucpr:0.84678 test-aucpr:0.83616
[200] train-aucpr:0.85851 eval-aucpr:0.93348 test-aucpr:0.95080
[300] train-aucpr:0.88907 eval-aucpr:0.93487 test-aucpr:0.95312
[400] train-aucpr:0.91095 eval-aucpr:0.93121 test-aucpr:0.95203
[500] train-aucpr:0.92976 eval-aucpr:0.93077 test-aucpr:0.95142
[600] train-aucpr:0.94440 eval-aucpr:0.92910 test-aucpr:0.94902
[700] train-aucpr:0.95611 eval-aucpr:0.92940 test-aucpr:0.94795
[800] train-aucpr:0.96405 eval-aucpr:0.92913 test-aucpr:0.94606
[900] train-aucpr:0.97143 eval-aucpr:0.92884 test-aucpr:0.94424
[999] train-aucpr:0.97721 eval-aucpr:0.92824 test-aucpr:0.94275
start xgboost model testing...
testing results: aucpr on test set is 0.9427465838947949
xgboost model is saved under /workspace/tmp/models.
The steps above demonstrate accuracy boost through using GNN's and efficiency boost through setting up distributed pipelines as well as a no-code user experience through configs.
How to Customize this Workflow/Use Case
To customize our reference kit to support your needs, you can -
- Bring your own dataset: You can add your own source data to
data/raw_data
. - Bring your own preprocessor : You can create your own preprocessor (i.e. edge featurizer) by editing
data-preprocessing.yaml
. More information on how to write config yaml file can be found in the Classical ML workflow GitHub repository. - Bring your own baseline : You can set parameters of your baseline model such as learning_rate, eval_metric, num_boost_round, verbose_eval and so on in
baseline-xgb-training.yaml
. - Bring your own Graph: You can create your own graph by defining node columns and edge types in
tabular2graph.yaml
. - Bring your own GNN : You can set parameters of your GraphSage model such as learning_rate, fan_out, epochs, eval_every and so on in
gnn-training.yaml
. - Bring your own XGB model : You can set parameters of your final XGB model such as learning_rate, eval_metric, num_boost_round, verbose_eval and so on in
xgb-training.yaml
if you want to edit final XGB model.
Make sure to edit your configs in/configs/single-node
folder if you're using single-node setting and/configs/distributed
if you're using distributed setting.
To learn more about our workflows, please refer to -
To troubleshoot, please submit your github issue.
These materials are intended to assist designers who are developing applications within their scope. These materials do not purport to provide all of the requirements for a commercial, productions, or other solution. Any commercial or productions use of solution based on or derived from these materials is beyond their scope. You are solely responsible for the engineering, testing, safety, qualification, validation, and applicable approvals for any solution you build or use based on these materials. Intel bears no responsibility or liability for such use. You are solely responsible for using your independent analysis, evaluation and judgment in designing your applications and have full and exclusive responsibility to assure the safety of your applications and compliance of your applications with all applicable regulations, laws and other applicable requirements. You further understand that you are solely responsible for obtaining any licenses to third-party intellectual property rights that may be necessary for your applications or the use of these materials.
To the extent that any public or non-Intel datasets or models are referenced by or accessed using these materials those datasets or models are provided by the third party indicated as the content source. Intel does not create the content and does not warrant its accuracy or quality. By accessing the public content, or using materials trained on or with such content, you agree to the terms associated with that content and that your use complies with the applicable license.
Intel expressly disclaims the accuracy, adequacy, or completeness of any such public content, and is not liable for any errors, omissions, or defects in the content, or for any reliance on the content. Intel is not liable for any liability or damages relating to your use of public content.
Intel’s provision of these resources does not expand or otherwise alter Intel’s applicable published warranties or warranty disclaimers for Intel products or solutions, and no additional obligations, indemnifications, or liabilities arise from Intel providing such resources. Intel reserves the right, without notice, to make corrections, enhancements, improvements, and other changes to its materials.
Intel technologies may require enabled hardware, software or service activation. Performance varies by use, configuration and other factors. No product or component can be absolutely secure.
Intel is committed to respecting human rights and avoiding complicity in human rights abuses. See Intel's Global Human Rights Principles. Intel's content is intended only to be used in applications that do not cause or contribute to a violation of an internationally recognized human right.
© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names