Toolbox to train automatic classification models for UVP6 images and/or to evaluate their performances.
Minimal knowledge in python, git and machine learning is needed.
This toolbox has been tested on MacOS and Linux (e.g. Ubuntu 20.04/22.04 and Mint 21). We do not garantee it will work on Windows.
To install the package, you can type the following command in your terminal:
python -m pip install git+https://github.com/ecotaxa/uvpec
or
python -m pip install git+ssh://[email protected]/ecotaxa/uvpec.git
or using pip
pip install uvpec
uvpec should now appear if you type pip list | grep uvpec.
For development purposes, you can also clone the repository locally. For this, you can either run (for HTTPS)
git clone https://github.com/ecotaxa/uvpec.git
or (for SSH)
git clone [email protected]:ecotaxa/uvpec.git
In order to use the package, you have to create a config.yaml file. Don't panic, you have an example of such a file in your cloned repository in uvpec/uvpec/config.yaml. In the latter, you need to specify 3 things : (1) what you want to do with the package, (2) some input/output information and (3) parameters for the gradient boosted trees algorithm (XGBoost) that will train and create a classification model.
For the process information, you need to specify two boolean variables:
evaluate_only:trueif you only want to evaluate an already created model. In that case, the package will not train any model and will do only the evaluation of the model indicated by themodelpath with thetest_features_filedata.falseif you want to train a model.train_only:trueif you want to only train a model and skip the evaluation part.falseif not. Not taken into account ifevaluate_onlyistrue.
For the input/ouput (io), you need to specify:
output_dir: an output directory, where the model and related information will be exported.train_images_dir: an image directory for the training set images. The plankton and/or particle images must be sorted by taxonomic classes into subfolders. It is standardized to be used with Ecotaxa. Each subfolder is named by the class's display name, and the ecotaxa ID, separated by two "_", and contains images from only its taxonomic class : 'DisplayName__EcotaxaID'. The typical way to export data from ecotaxa in such folders organization is to make a D.O.I. export, exporting all images and keep only 'white on black' images = *_100.png (see here). The maximum number of accepted classes is 40.test_images_dir: an image directory for the test set images. It will only be used if you evaluate a model (training + evaluation or evaluation only).training_features_file: the name of your training features file. If it does not already exist, it will be created automatically so give it a great name !test_features_file: the name of your test features file. If it does not already exist, it will be created automatically so give it a great name as well ! Unused iftrain_onlyistrue.model: the path to a model (the format of the file should beMuvpec_KEY.model, a model created using XGBoost). Only used forevaluation_only.objid_threshold_file: the path to a tsv file containing the objid and the UVP6 acquisition threshold of each image for which features will be extracted. Only used ifuse_objid_threshold_fileis set totrue.
For the instrument parameter, you need to specify:
- The pixel threshold of your UVP6
uvp_pixel_threshold, that is the threshold value used to split image pixels into foreground (> threshold) and background (<= threshold) pixels. It is usually comprised between 20 and 22. - If you wish to use a variable threshold value (e.g. if you are working with images acquired with different UVP6 instruments), set
use_objid_threshold_filetotrue.
Then, for XGBoost parameters of the training, you need to specify:
- An initialization seed
random_state. It is important if you build multiple models with a different XGBoost configurations. The number is not important, you can keep 42. - A number of CPU cores
n_jobsthat will depend on the computational power of your machine or server. - The learning rate. It controls the magnitude of adjustements made to the model's parameters during each iteration of training (i.e. in our model, at each boosting round). A high learning rate may cause the optimization to miss the optimal parameter values (e.g. it leads to oscillations or divergence) while a low learning rate might lead to a slow training due to a slow convergence to the minimum of the loss function or it can also get stuck in local minima.
- The maximum depth of a tree
max_depth. For technical reasons, it is forbidden to go above 7. -
weight_sensitivityrepresents the weight ($w$ ) you want to put on biological classes during training. The minimum value is 0 (i.e. no weight) and the maximum value is 1. It is useful to add a weight to smaller classes because a great number (often$\ge$ 80%) of images from the training set are detritus hence putting$w$ to 0.25 will put more weight on small (biological) classes during training and will force the algorithm to pay more attention to those classes. -
detritus_subsamplingcan be used if you want to undersample the detritus class in your training. If you think that your detritus class (therefore, you must have one specifically named 'detritus') is too populated (e.g. extreme dataset imbalance) and that removing a part of it is not an issue for your application, then you can fix a given percentage of subsampling for that class. For example, asubsampling_percentageof 20 means that you only keep 20% of your entire detritus class. Keepdetritus_subsamplingtofalseif you don't want to use it. -
subsampling_percentageis the percentage of images of 'detritus' from your training set you want to keep for training. -
num_trees_CVstands for the number of boosting rounds you want to use for the cross-validation (CV). This is equivalent to the parameternum_roundin XGBoost.
You will also notice that there is one last thing. use_C gives the possibility to extract the features from images using a C++ extension. We advise to keep it to true because it is much faster than the python version.
Once you are done, run uvpec config.yaml in your terminal and wait for the magic to happen ! You should get everything you need in the output folder you specified.
We have prepared a test folder in our package. This allows you to check if the pipeline works without launching a full process that will take a significant amount of time. It is always a good idea to check if everything works well before using it on a full training set and also after some package updates. To use it,
navigate in the test folder using cd test then run uvpec config.yaml. You should see something going on in your terminal. Don't forget to check your output folder now !
In addition, there is also another test that you can run in order to see if the pipeline is not broken somewhere. For that, run pytest (that actually looks for test_uvpec.py) in your terminal. Everything should now be taken care of and if you only see green lights it means that all tests went smoothly! If not, that means something went wrong and the error messages can help you find where the leak is.
Just a reminder, if you see some errors during the test, check if you did not forget to run uvpec config.yaml.
pytest is not automatically present on your laptop. To install it, type pip install --user pytest in your terminal.
You can refer to the documentation on Ecotaxa to download all the vignettes you need to use for your training and/or test set. See the "export project" part of your project on https://ecotaxa.obs-vlfr.fr/.
Ecotaxa is built with a rest API that has been designed to facilitate the work of users. Two packages have been developped to interact more easily with the API in python and in R. Be careful to download the vignettes with the black background because every object is stored in two versions: one with a white backgroud and one with a black background. You will also need to remove the size legend at the bottom of each vignette. To do so, crop 31 pixel at the bottom of the vignette.
Finally, just rename the vignettes with the uvpec standard (i.e. DisplayName__EcotaxaID), and you are good to go !
To uninstall our (awesome-why-are-you-removing-it) package, type pip uninstall uvpec in your terminal.
For updates, either uninstall it and reinstall it with the HTPPS or SSH version, or more simply using pip.