Skip to content

Latest commit

 

History

History
92 lines (62 loc) · 3.42 KB

quick-start.md

File metadata and controls

92 lines (62 loc) · 3.42 KB

Quick Start for Data Prep Kit

Here we provided short examples of various uses of the Data Prep Kit. Most users who want to jump right in can use standard pip install to deploy the data-prep-kit and the python or ray transforms to their virtual python environment.

When setting up a virtual environment it is recommended to use python3.11 as in the example below using conda.

Create a Virtual Environment

setup a virtual environment using conda

conda create -n data-prep-kit-1 -y python=3.11

Linux system only: Install the gcc/g++ that is required while building fastext: If you are using a linux system, install gcc using the below commands, as it will be required to compile and install fasttext currently used by some of the transforms.

conda install gcc_linux-64
conda install gxx_linux-64

activate the new conda environment

conda activate data-prep-kit-1

make sure env is switched to data-prep-kit-1 and Check python version.

python --version
The command above should say: 3.11

install data prep toolkit

pip3 install 'data-prep-toolkit-transforms[ray,all]'

the command above install the complete library with all the tansforms. In certain situations, it may be desirable to install a specific transform with or without the ray runtime. In that case, the command can specify the name of the transform in the [extra] value such as:

To install the lang_id transform, use the following command:

pip3 install 'data-prep-toolit-tranforms[lang_id]' 

to install the lang_id transform with the ray runtime, use the following command:

pip3 install 'data-prep-toolit-tranforms[ray,lang_id]' 

Setting up Jupyter lab for local experimentation with transform notebooks

pip install jupyterlab ipykernel ipywidgets
python -m ipykernel install --user --name=data-prep-kit --display-name "dataprepkit"

Running transforms

  • Notebooks

    • There is a simple notebook for running a single transform that can be run from either Google Colab or the local environment by downloading the file.
    • In most indidividual transform folders, we have included one (Python), two (Python and Ray), or three (Python, Ray and Spark) notebooks for running that transform. In order to run all these notebooks in the local environment, we clone the repo as:
    git clone [email protected]:IBM/data-prep-kit.git 

    Then we go to an indvidual transformer folder, where we find the corresponding notebooks. As an example:

    cd data-prep-kit/transforms/universal/fdedup
    make venv
    source venv/bin/activate 
    pip install jupyterlab
    jupyter lab

    You can now run the Python version, Ray version or Spark version of the three notebooks for this transform.

  • Command line

Creating transforms

  • Tutorial - shows how to use the library to add a new transform.