Skip to content

Latest commit

 

History

History
66 lines (58 loc) · 2.69 KB

run-transform-cli.md

File metadata and controls

66 lines (58 loc) · 2.69 KB

Running a Transform from the Command Line

Here we address a simple use case of applying a single transform to a set of parquet files. We'll use the pdf2parquet transform as an example, but in general, this process will work for any of the transforms contained in Data Prep Kit. Additionally, what follows uses the python runtime but the examples below should also work for the ray or spark runtimes.

Install data prep kit from PyPi

The latest version of the Data Prep Kit is available on PyPi for Python 3.10, 3.11 or 3.12. It can be installed using:

pip install  'data-prep-toolkit-transforms[ray,all]'

The above installs all available transforms and both the python and Ray runtimes.

NOTE: As of this writing, on linux systems there is an issue installing fasttext for the lang_id transform. A workaround is to install using conda. Alternatively, you may choose to install only the transform(s) of interest (see below).

When installing select transforms, users can specify the name of the transform in the pip command, rather than [all]. For example, use the following command to install only the pdf2parquet transform:

pip install 'data-prep-toolkit-transforms[pdf2parquet]'

As an alternative, installing in a conda environment can be found here.

Run a transform at the command line

Here we run the pdf2parquet transform on its input data to import pdf content into rows of a parquet file. First, we load some data for the transform to run on using the following python code:

import urllib.request
import shutil
shutil.os.makedirs("input", exist_ok=True)
urllib.request.urlretrieve("https://raw.githubusercontent.com/IBM/data-prep-kit/dev/transforms/language/pdf2parquet/test-data/input/archive1.zip", "input/archive1.zip")
urllib.request.urlretrieve("https://raw.githubusercontent.com/IBM/data-prep-kit/dev/transforms/language/pdf2parquet/test-data/input/redp5110-ch1.pdf", "input/redp5110-ch1.pdf")
% ls input
archive1.zip		redp5110-ch1.pdf

Next we run pdf2parquet on the data in the input folder.

python -m dpk_pdf2parquet.transform_python \
    --data_local_config "{ 'input_folder': 'input', 'output_folder': 'output'}" \
    --data_files_to_use "['.pdf', '.zip']" 

Parquet files are generated in the designated output folder:

% ls output
archive1.parquet        metadata.json           redp5110-ch1.parquet

All transforms are runnable from the command line in the manner above.