Skip to content

ci: add Dataproc cluster recreation script and Cloud Build config#3536

Open
dborowitz wants to merge 1 commit into
googleapis:mainfrom
dborowitz:recreate-cluster
Open

ci: add Dataproc cluster recreation script and Cloud Build config#3536
dborowitz wants to merge 1 commit into
googleapis:mainfrom
dborowitz:recreate-cluster

Conversation

@dborowitz

Copy link
Copy Markdown
Contributor

Introduces a new Cloud Build configuration and a companion bash script to delete and recreate the Dataproc cluster used for integration testing. As a best practice, Dataproc clusters should not be very long lived, as this prevents the cluster software from being updated.

In the intial implementation, sequentially delete and recreate the cluster. This will cause some downtime (dataproc integration tests will fail), but is very simple.

This commit doesn't include a schedule/trigger yet.

Description

Should include a concise description of the changes (bug or feature), it's
impact, along with a summary of the solution

PR Checklist

Thank you for opening a Pull Request! Before submitting your PR, there are a
few things you can do to make sure it goes smoothly:

  • Make sure you reviewed
    CONTRIBUTING.md
  • Make sure to open an issue as a
    bug/issue
    before writing your code! That way we can discuss the change, evaluate
    designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)
  • Make sure to add ! if this involve a breaking change

🛠️ Fixes #3535

Introduces a new Cloud Build configuration and a companion bash script
to delete and recreate the Dataproc cluster used for integration
testing. As a best practice, Dataproc clusters should not be very long
lived, as this prevents the cluster software from being updated.

In the intial implementation, sequentially delete and recreate the
cluster. This will cause some downtime (dataproc integration tests will
fail), but is very simple.

This commit doesn't include a schedule/trigger yet.
@dborowitz dborowitz requested a review from a team as a code owner June 26, 2026 18:27

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds a Cloud Build configuration and a supporting bash script to recreate a Dataproc cluster. The feedback suggests improving the script's robustness by checking if the cluster is already in a 'DELETING' state and waiting for that process to finish before attempting recreation.

Comment on lines +41 to +58
set +e
DESCRIBE_OUT=$(gcloud dataproc clusters describe "${CLUSTER_NAME}" --region="${REGION}" --project="${PROJECT_ID}" 2>&1)
DESCRIBE_STATUS=$?
set -e

if [ ${DESCRIBE_STATUS} -eq 0 ]; then
echo "Cluster '${CLUSTER_NAME}' exists. Deleting it..."
gcloud dataproc clusters delete "${CLUSTER_NAME}" \
--region="${REGION}" \
--project="${PROJECT_ID}" \
--quiet
echo "Cluster '${CLUSTER_NAME}' deleted successfully."
elif echo "${DESCRIBE_OUT}" | grep -q "NOT_FOUND"; then
echo "Cluster '${CLUSTER_NAME}' does not exist. Skipping deletion."
else
echo "Error querying cluster existence: ${DESCRIBE_OUT}" >&2
exit ${DESCRIBE_STATUS}
fi

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If a previous run of this script was cancelled or failed during deletion, the cluster might be left in a DELETING state. Running gcloud dataproc clusters delete on a cluster that is already deleting will fail, causing this script to exit with an error.

We can make this more robust by querying the cluster's state using --format="value(status.state)". If the state is DELETING, we can poll and wait for the deletion to complete before proceeding to the creation step.

set +e
DESCRIBE_OUT=$(gcloud dataproc clusters describe "${CLUSTER_NAME}" --region="${REGION}" --project="${PROJECT_ID}" --format="value(status.state)" 2>&1)
DESCRIBE_STATUS=$?
set -e

if [ ${DESCRIBE_STATUS} -eq 0 ]; then
  STATE="${DESCRIBE_OUT}"
  if [ "${STATE}" = "DELETING" ]; then
    echo "Cluster '${CLUSTER_NAME}' is already being deleted. Waiting for deletion to complete..."
    while gcloud dataproc clusters describe "${CLUSTER_NAME}" --region="${REGION}" --project="${PROJECT_ID}" &>/dev/null; do
      sleep 10
    done
    echo "Cluster '${CLUSTER_NAME}' deleted successfully."
  else
    echo "Cluster '${CLUSTER_NAME}' exists in state '${STATE}'. Deleting it..."
    gcloud dataproc clusters delete "${CLUSTER_NAME}" \
      --region="${REGION}" \
      --project="${PROJECT_ID}" \
      --quiet
    echo "Cluster '${CLUSTER_NAME}' deleted successfully."
  fi
elif echo "${DESCRIBE_OUT}" | grep -q "NOT_FOUND"; then
  echo "Cluster '${CLUSTER_NAME}' does not exist. Skipping deletion."
else
  echo "Error querying cluster existence: ${DESCRIBE_OUT}" >&2
  exit ${DESCRIBE_STATUS}
fi

IMAGE_VERSION="$3"
CLUSTER_NAME="$4"

SERVICE_ACCOUNT="toolbox-identity@${PROJECT_ID}.iam.gserviceaccount.com"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should remove this. Service account is automatically injected to the env var by cloud build. We can retrieve it directly in the step and pass it to the script:

- "SERVICE_ACCOUNT_EMAIL=$SERVICE_ACCOUNT_EMAIL"

- "PROJECT_ID=$PROJECT_ID"
- "CLUSTER_NAME=$_CLUSTER_NAME"
- "REGION=$_REGION"
- "IMAGE_VERSION=$_IMAGE_VERSION"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- "IMAGE_VERSION=$_IMAGE_VERSION"
- "IMAGE_VERSION=$_IMAGE_VERSION"
- "SERVICE_ACCOUNT_EMAIL=$SERVICE_ACCOUNT_EMAIL"

- "IMAGE_VERSION=$_IMAGE_VERSION"
script: |
#!/usr/bin/env bash
bash .ci/recreate_dataproc_cluster.sh "$${PROJECT_ID}" "$${REGION}" "$${IMAGE_VERSION}" "$${CLUSTER_NAME}"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then pass in the service account here.

name: projects/$PROJECT_ID/locations/us-central1/workerPools/integration-testing

substitutions:
_CLUSTER_NAME: "cluster-36"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use a more descriptive name like dataproc-testing-cluster?

@duwenxin99

Copy link
Copy Markdown
Contributor

Let's create a folder in ci/ to contain these dataproc-specific scripts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Periodically recreate Dataproc cluster used for integration testing

2 participants