Here we provided short examples of various uses of the Data Prep Kit. Most users who want to jump right in can use standard pip install to deploy the data-prep-kit and the python or ray transforms to their virtual python environment.
When setting up a virtual environment it is recommended to use python3.11 as in the example below using conda.
setup a virtual environment using conda
conda create -n data-prep-kit-1 -y python=3.11
Linux system only: Install the gcc/g++ that is required while building fastext: If you are using a linux system, install gcc using the below commands, as it will be required to compile and install fasttext currently used by some of the transforms.
conda install gcc_linux-64
conda install gxx_linux-64
activate the new conda environment
conda activate data-prep-kit-1
make sure env is switched to data-prep-kit-1 and Check python version.
python --version
The command above should say: 3.11
install data prep toolkit
pip3 install 'data-prep-toolkit-transforms[ray,all]'
the command above install the complete library with all the tansforms. In certain situations, it may be desirable to install a specific transform with or without the ray runtime. In that case, the command can specify the name of the transform in the [extra] value such as:
To install the lang_id transform, use the following command:
pip3 install 'data-prep-toolit-tranforms[lang_id]'
to install the lang_id transform with the ray runtime, use the following command:
pip3 install 'data-prep-toolit-tranforms[ray,lang_id]'
pip install jupyterlab ipykernel ipywidgets
python -m ipykernel install --user --name=data-prep-kit --display-name "dataprepkit"
-
Notebooks
- There is a simple notebook for running a single transform that can be run from either Google Colab or the local environment by downloading the file.
- In most indidividual transform folders, we have included one (Python), two (Python and Ray), or three (Python, Ray and Spark) notebooks for running that transform. In order to run all these notebooks in the local environment, we clone the repo as:
git clone [email protected]:IBM/data-prep-kit.git
Then we go to an indvidual transformer folder, where we find the corresponding notebooks. As an example:
cd data-prep-kit/transforms/universal/fdedup make venv source venv/bin/activate pip install jupyterlab jupyter lab
You can now run the Python version, Ray version or Spark version of the three notebooks for this transform.
-
Command line
- Using a docker image - runs a transform in a docker transform image
- Using a virtual environment - runs a transform on the local host
- Tutorial - shows how to use the library to add a new transform.