Before publishing a pull request, ensure that the following validations pass:
make test
make test-integration
make lint
We use make
to automate common development tasks. If you feel that any recurring development routine is
missing from the current set of Makefile targets, please propose a new one!
make venv
Creates a virtual environment in the venv
directory with all required development dependencies installed. You usually
don't need to run this command directly, since it will be invoked automatically by any other target that needs it.
make clean-venv
Removes all artifacts created by the venv
Makefile target.
make build
Builds a redistributable wheel to the build
directory. Stores intermediate artifacts in the dist
directory.
make clean-build
Removes all non-virtual-environment artifacts created by the build
Makefile target.
make rebuild
Runs clean-build
followed by build
.
make clean
Removes all artifacts created by the build
and venv
Makefile targets.
make deploy-s3
Builds and uploads a wheel to S3. See Build and Deploy an S3 Wheel.
make install
Installs all developer requirements from dev-requirements.txt
in your virtual environment.
make lint
Runs the linter to ensure that code in your local workspace conforms to code-style guidelines.
make test
Runs all unit tests.
make test-integration
Runs all integration tests.
make test-integration-rebuild
Rebuild the integration test environment.
make benchmark-aws
Run AWS benchmarks.
You can deploy and test your local DeltaCAT changes on any AWS environment that can run Ray applications (e.g. EC2, Glue for Ray, EKS, etc.).
Caution
Iceberg script execution on Glue for Ray is currently broken. DeltaCAT and PyIceberg v0.5+ depend on a version of pydantic that is incompatible with ray v2.4 used by Glue. See Ray Issue #37372 for additional details.
Use the Glue Runner at dev/deploy/aws/scripts/runner.py aws glue
to configure your AWS account to run any Python script
using AWS Glue for Ray. The Glue Runner can also be used to build and upload changes in your workspace to an S3 wheel
used during execution instead of the default PyPi DeltaCAT wheel.
- Install and configure the latest version of the AWS CLI:
- Create an AWS Glue IAM Role that can create and run jobs:
- Install and configure boto3:
- Print Usage Instructions and Exit
python runner.py aws glue -h
- Create a new Glue job and run your first script in us-east-1 (the default region)
python runner.py aws glue deltacat/examples/hello_world.py --glue-iam-role "AWSGlueServiceRole"
- Run an example in us-east-1 using the last job config and DeltaCAT deploy
python runner.py aws glue deltacat/examples/hello_world.py
- Run an example in us-east-1 using your local workspace copy of DeltaCAT
python runner.py aws glue deltacat/examples/hello_world.py --deploy-local-deltacat
Note
The deployed package is referenced by an S3 URL that expires in 7 days. After 7 days, you must deploy a new DeltaCAT package to avoid receiving a 403 error!
- Create a new job and run an example in us-west-2:
python runner.py aws glue deltacat/examples/hello_world.py \
--region us-west-2
--glue-iam-role "AWSGlueServiceRole" \
- Pass arguments into an example script as environment variables:
python runner.py aws glue deltacat/examples/basic_logging.py \
--script-args '{"--var1":"Try that", "--var2":"DeltaCAT"}' \
- Creates an S3 bucket at
s3://deltacat-packages-{stage}
if it doesn't already exist. - [Optional] Builds a wheel containing your local workspace changes and uploads it to
s3://deltacat-packages-{stage}/
if the--deploy-local-deltacat
flag is set.
Important
{stage} is replaced with os.environ["USER"]
unless you set the $DELTACAT_STAGE
environment variable.
- Creates an S3 bucket at
s3://deltacat-glue-scripts-{stage}
if it doesn't already exist. - Uploads the script to run to
s3://deltacat-glue-scripts-$USER
. - Creates or updates the Glue Job
deltacat-runner-{stage}
to run this example. - Run the
deltacat-runner-{stage}
Glue Job with either the newly built DeltaCAT wheel or the last used wheel.
If you'd like to run integration tests in any other custom environment, you can run a single command to package your local changes in a wheel, upload it to S3, then install it on your Ray cluster from a signed S3 URL.
Simply run make deploy-s3
to upload your local workspace to a wheel at
s3://deltacat-packages-{stage}/deltacat-{version}-{timestamp}-{python}-{abi}-{platform}.whl
.
If the deploy succeeds, you should see some text printed telling you how to install this wheel from a signed S3 URL:
to install run:
pip install deltacat @ `s3://deltacat-packages-{stage}/deltacat-{version}-{timestamp}-{python}-{abi}-{platform}.whl`
The variables in the above S3 URL will be replaced as follows:
stage: The runtime value of the
$DELTACAT_STAGE
environment variable if defined or the$USER
environment variable if not.
version: The current DeltaCAT distribution version. See https://peps.python.org/pep-0491/.
timestamp: Second-precision epoch timestamp build tag. See https://peps.python.org/pep-0491/.
python: Language implementation and version tag (e.g. ‘py27’, ‘py2’, ‘py3’). See https://peps.python.org/pep-0491/.
abi: ABI tag (e.g. ‘cp33m’, ‘abi3’, ‘none’). See https://peps.python.org/pep-0491/.
platform: Platform tag (e.g. ‘linux_x86_64’, ‘any’). See https://peps.python.org/pep-0491/.
Use the $DELTACAT_STAGE
environment variable to change the S3 bucket that your workspace wheel is uploaded to:
export DELTACAT_STAGE=dev
make deploy-s3
This uploads a wheel to
s3://deltacat-packages-dev/deltacat-{version}-{timestamp}-{python}-{abi}-{platform}.whl
.
You can benchmark your DeltaCAT changes on AWS by running:
make benchmark-aws
Note
We recommend running benchmarks in an environment configured for high bandwidth access to cloud storage. For example, on an EC2 instance with enhanced networking support: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/enhanced-networking.html.
Parquet Reads: Modify the SINGLE_COLUMN_BENCHMARKS
and ALL_COLUMN_BENCHMARKS
fixtures in deltacat/benchmarking/benchmark_parquet_reads.py
to add more files and benchmark test cases.
Some DeltaCAT compute functions interact with Cloudpickle differently than the typical Ray application. This allows
us to improve compute stability and efficiency at the cost of managing our own distributed object garbage collection
instead of relying on Ray's automatic distributed object reference counting and garbage collection. For example, see
the comment at deltacat/compute/compactor/utils/primary_key_index.py
for an explanation of our custom
cloudpickle.dumps
usage.