This repository contains code to conduct classification of clients based on their demographic features, history of interaction with a company, purchases, and transactions.
The classification problem is solved using a model which combines fully connected ResNet and Gaussian process. The model is capable to estimate uncertainty of its predictions. Therefore, the predictions with high uncertainty can be excluded to achive more reiliable results. Also methods for filtration noisy (mislabeled) examples from dataset are proposed.
feature_generation_for_sequences
contains code to create aggregated features from historical sequential data (transactions) of clients, in other words it converts sequences into tabular data.
Aggregations used:
- min
- max
- std
- mean
- ohe for categorical features
- difference between consequent temporal samples
Aggretation functions can be extended
preprocessor
contains code to create embeddings from aggreagated features through the PCA method, can be applied for tabular data.
srcprcskr
contains code to create classification model and perform filtration of dataset.
notebooks
contains jupyter notebook for demonstration purpose.
git clone https://gitlab.appliedai.tech/priceseekers/core/priceseekers
cd priceseekers
pip3 install -r requirements.txt
pip3 install .
quick start notebook can be found in /examples/rosbank_demonstration.ipynb
Main class to solve classification problems and filtrate noisy examples from dataset.
classifier = FilteredClassifier(
run_name="user_run_name",
log_dir="./logs",
ckpt_dir="./ckpt_dir",
path_to_dataconf="sber_ps/configs/data/data_config.yml",
path_to_modelconf="sber_ps/configs/models/modeldue_config.yml"
)
-
run_name (str) is used to distinguish classifier runs
-
path_to_dataconf (str) is path to a
.yml
data configuration file. -
path_to_modelconf (str) is path to a
.yml
model configuration file. -
log_dir (str) is path to a directory where logs of model training will be saved. The logs will be saved in a directory which will be created inside the log_dir.
-
ckpt_dir (str) is path to a directory where checkpoints of state dictionaries (e.g model) will be saved during training. The checkpoints will be saved in a directory which will be created inside the ckpt_dir.
A method to train a model for solving classification problem. The implemetation is performed for the Deterministic Uncertainty Estimation (DUE) model, which was proposed in
On Feature Collapse and Deep Kernel Learning for Single Forward Pass Uncertainty, https://github.com/y0ast/DUE/tree/main
The model includes the fully connected ResNet combined with the Gaussian process. The fully connected ResNet consists of residual feedforward layers with the relu activation functions. As a regularization, spectral normalization and dropout are applied to each layer. Combination of residual connections with spectral normalization enables to provide smoothness and sensitivity for fully connected neural network. Smothness (or stability) implies small changes in the input cannot cause massive shifts in the output. Sensitivity implies that when the input changes the feature representation also changes. This properties are capable to prevent feature collapse, when different input features are mapped by neural network into close vectors or the same features are mapped into far-away vectors. The Gaussian process is used for classification of vectors provided by the fully connected ResNet by given labels. The main advantage of the Gaussian process is uncertainty estimation of made predictions. Therefore, the predictions with high uncertainty can be excluded to achive more reiliable results.
classifier.fit(
features_path='data/embeddings/pca_embeddings.parquet',
targets_path='data/targets/labels.parquet',
split_frac_train_val=0.8,
random_state=None,
total_epochs=None,
lr=None,
path_to_examples_to_be_excluded=None,
is_forgetting=False,
metrics_on_train=False,
ckpt_resume=None
)
-
features_path (str) is path to a file with features.
-
targets_path (str) is path to a file with true labels. It is supposed that features and true label related to one example from the dataset have the same index (name).
-
split_frac_train_val (float) is fraction of training part size from the size of the full dataset. The value 1.0 spicifies that the full dataset will be used for training.
-
random_state (int) is used to provide reproducibility of computations. If it is
None
, a value from the fieldrandom_state
from the data configuration file will be used. -
total_epochs (int) is a number of epochs for model training. If it is
None
, a value from the fieldtotal_epochs
of the model configuration file will be used. -
lr (float) is learning rate in an optimizer. If it is
None
, a value from the fieldlr
of the model configuration file will be used. -
path_to_examples_to_be_excluded (str) is path to a
.txt
file which contains names of examples to be excluded from the original dataset for training. -
is_forgetting (bool) inidicates that the masks required for computing forgetting counts of examples will be collected during training and saved in checkpoint files.
-
metrics_on_train (bool) inidicates that the metrics will be computed metrics on training dataset.
-
ckpt_resume (str) is a path to a checkpoint file
*.ckpt
which is used to load the model. It should beNone
to train a model from an initial state
An implementation of a method to find noisy examples (mislabeled examples) in dataset by counting the forgetting of examples during training. The method is named as the forgetting method. Firstly it was proposed in
An Empirical Study of Example Forgetting during Deep Neural Network Learning, https://github.com/mtoneva/example_forgetting/tree/master
According to the paper, noisy examples are frequently forgotten by a model during training or are stayed be unlearned. Therefore, to find noisy examples, the following algorithm was implemented.
-
A model is trained on the full dataset, and forgetting masks of examples are saved. The masks are formed by comparison of model predictions on each epoch with true labels.
-
After training, the quantity of the epochs, when each example was forgotten, is counted. The forgetting count for unlearned examples is assigned to be equal to
total_epochs
. -
Array of file names of the dataset containing unlearned examples are saved in
.txt
file and can be used for excluding from the dataset in the next model trainings. By varyingthreshold_val
, examples with high value of the forgetting counts also can be excluded.
df_examples_info = classifier.filtration_by_forgetting(
features_path='data/embeddings/pca_embeddings.parquet',
targets_path='data/targets/labels.parquet',
example_forgetting_dir=None,
threshold_val=None,
random_state=None,
total_epochs=None,
lr=None,
verbose=True,
ckpt_resume=None,
path_to_examples_to_be_excluded=None
)
-
features_path (str) is path to a file with features.
-
targets_path (str) is path to a file with true labels. It is supposed that features and true label related to one example from the dataset have the same index (name).
-
example_forgetting_dir (str) is path to a directory which will be used to save array with noisy examples. If it is
None
, then the directory with namef"{data_name}_forgetting"
will be created in the parent of the directorydata_filepath
. The fielddata_name
is provided by the data configuration file. -
threshold_val (int) is the threshold value for
forgetting_counts
, which can be used to filtrate examples. If it isNone
, only unlearned examples will be proposed for excluding from the dataset. -
random_state (int) is used to provide reproducibility of computations. If it is
None
, a value
from the fieldrandom_state
from the data configuration file will be used. -
total_epochs (int) is a number of epochs for model training. If it is
None
, a value
from the fieldtotal_epochs
of the model configuration file will be used. -
lr (float) is learning rate in an optimizer. If it is
None
, a value from the fieldlr
of the model configuration file will be used. -
verbose (bool) indicates that pd.DataFrame with forgetting counts for examples will be returned.
-
ckpt_resume (str) is path to a checkpoint file
*.ckpt
which is used to load the model and masks collected during training. It should beNone
to train a model from an initial state. -
path_to_examples_to_be_excluded (str) is path to a
.txt
file which contains names of examples to be excluded from the original dataset for training.
Returns:
- df_examples (pd.DataFrame) contains forgetting counts for examples of the dataset and predictions given by the trained model
An implementation of a method to find noisy examples (mislabeled examples) in dataset by sequential model training on its parts and counting forgetting of the examples. The method is named as the second-spit forgetting method. It was proposed in
Characterizing Datapoints via Second-Split Forgetting, https://github.com/pratyushmaini/ssft
One of the disadvantage of the forgetting method for filtration of dataset is that the set of unlearned and frequently forgetting examples can include complex examples. The complex examples are placed close to the boundary between different classes. Therefore, such examples contribute to improve model training. To separate noisy examples from the complex ones, the following algorithm was implemented.
-
The full dataset is divided into two halves, which we will name as the first part and the second part.
-
The model is trained on the first part of the dataset until the values of the loss functions or the tracked metric stabilize (the first training). Then the model training continues on the second part of the dataset (the second training).
-
Examples from the second part of the dataset, which were forgotten after one epoch of the second training, are marked as noisy.
-
Then the model is trained from the initial state on the second and the first half parts of the dataset. The examples forgotten after one epoch of the model training on the first half of the dataset are marked as noisy.
-
Array of file names of the dataset containing noisy examples are saved in
.txt
file and can be used for excluding from the dataset in the next model trainings. By varyingthreshold_val
, examples, which were forgotten after larger number of epoch of the subsequent training, can be excluded.
df_examples = classifier.filtration_by_second_split_forgetting(
features_path='data/embeddings/pca_embeddings.parquet',
targets_path='data/targets/labels.parquet',
example_forgetting_dir=None,
threshold_val=None,
random_state=None,
total_epochs_per_step=None,
lr=None,
verbose=True,
ckpt_resume=None,
path_to_examples_to_be_excluded=None
)
-
features_path (str) is path to a file with features.
-
targets_path (str) is path to a file with true labels. It is supposed that features and true label related to one example from the dataset have the same index (name).
-
example_forgetting_dir (str) is path to a directory which will be used to save array with noisy examples. If it is
None
, then the directory with namef"{data_name}_forgetting"
will be created in the parent of the directorydata_filepath
. The fielddata_name
is provided by the data configuration file. -
threshold_val (int) is the threshold value for
epoch_forget_forever
, which can be used to filtrate examples. If it isNone
, only examples which are forgotten after one epoch of the next training step will be proposed for excluding from the dataset. -
random_state (int) is used to provide reproducibility of computations. If it is
None
, a value
from the fieldrandom_state
from the data configuration file will be used. -
total_epochs (int) is a number of epochs for model training. If it is
None
, a value
from the fieldtotal_epochs
of the model configuration file will be used. -
lr (float) is learning rate in an optimizer. If it is
None
, a value from the fieldlr
of the model configuration file will be used. -
verbose (bool) indicates that pd.DataFrame with forgetting counts for examples will be returned.
-
ckpt_resume (str) is path to a checkpoint file
*.ckpt
which is used to load the model and masks collected during training. It should beNone
to train a model from an initial state. -
path_to_examples_to_be_excluded (str) is path to a
.txt
file which contains names of examples to be excluded from the original dataset for training.
Returns:
- df_examples (pd.DataFrame) contains the number of the epoch when each examples of the dataset were forgotten forever and predictions given by the trained model at the second and fourth training steps.
Method to get predictions using a model loaded from a checkpoint file.
preds_proba, uncertainties, file_names, true_labels = classifier.predict(
features_path='data/embeddings/pca_embeddings.parquet',
targets_path='data/targets/labels.parquet',
ckpt_resume='ckpt_dir/data_name_fit/epoch: 0120 - acc_score: 0.7514 - roc_auc_score: 0.749 - loss: 0.3942.ckpt',
random_state=None,
targets_path=None,
path_to_examples_to_be_excluded='data/data_name_second_forgetting/data_name_files_to_be_excluded.txt'
)
-
features_path (str) is path to a file with features.
-
targets_path (str) is path to a file with labels. If it is ommited then
true_labels
is not returned. -
ckpt_resume (str) is path to a checkpoint file
*.ckpt
which is used to load the model. -
random_state (int) is uded to provide reproducibility of computations. If it is
None
, a value
from the fieldrandom_state
from the data configuration file will be used. -
targets_path (str) is a path to a file with true labels. If it is omitted, than true labels predictions will not be returned.
-
path_to_examples_to_be_excluded (str) is path to a
.txt
file which contains names of examples to be excluded from the original dataset for training.