Skip to content

Add network validation script executed in the sagemaker_ui_post_startup script #713

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

marfriaz
Copy link

@marfriaz marfriaz commented Jun 19, 2025

Description

This change introduces the network validation script which tests if certain AWS services are reachable by making read only API calls with a set timeout. If the call exceeds the timeout, the script infers that it was caused by a bad network setup such as not having access to the internet/ VPC endpoint to make the call. API calls that resolve (succeed or fail) within the timeout are inferred as having the proper network setup.

AWS services for Compute Connections and Git are checked in this script. More specifically, the script lists the datazone connections to see which services need to be checked.

The unreachable services are aggregated and are displayed by writing to the post-startup-status.json, which displays the error notification in the IDE.

Testing

Tested in a SMUS portal containing internet, no internet, and no internet with VPC Endpoints to Datazone and s3.

Description

[Provide a brief description of the changes]

Type of Change

  • Image update - Bug fix
  • Image update - New feature
  • Image update - Breaking change
  • SMD image build tool update
  • Documentation update

Release Information

Does this change need to be included in patch version releases? By default, any pull requests will only be added to the next SMD image minor version release once they are merged in template folder. Only critical bug fix or security update should be applied to new patch versions of existed image minor versions.

  • Yes (Critical bug fix or security update)
  • No (New feature or non-critical change)
  • N/A (Not an image update)

If yes, please explain why:
[Explain the criticality of this change and why it should be included in patch releases]

How Has This Been Tested?

Tested in a SMUS portal containing internet, no internet, and no internet with VPC Endpoints to Datazone and s3.

Checklist:

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works

Test Screenshots (if applicable):

Related Issues

[Link any related issues here]

Additional Notes

[Any additional information that might be helpful for reviewers]

@marfriaz marfriaz requested a review from a team as a code owner June 19, 2025 23:51
…up script

**Description**

This change introduces the network validation script which tests if certain AWS services are reachable by making read only API calls with a set timeout. If the call exceeds the timeout, the script infers that it was caused by a bad network setup such as not having access to the internet/ VPC endpoint to make the call. API calls that resolve (succeed or fail) within the timeout are inferred as having the proper network setup.

AWS services for Compute Connections and Git are checked in this script. More specifically, the script lists the datazone connections to see which services need to be checked.

The unreachable services are aggregated and are displayed by writing to the post-startup-status.json, which displays the error notification in the IDE.

**Testing**

Tested in a SMUS portal containing internet, no internet, and no internet with VPC Endpoints to Datazone and s3.
# Initialize SERVICE_COMMANDS with always-needed STS and S3 checks
declare -A SERVICE_COMMANDS=(
["STS"]="aws sts get-caller-identity"
["S3"]="aws s3api list-buckets --max-items 1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use list-objects at the project S3 path instead? This will result in a 4xx error as the project role doesn't have permissions to list buckets by default. While I understand the goal here is to check for what services are reachable, it would be better to invoke an API we expect to succeed so logs aren't polluted and 4xx metrics aren't impacted.

if [[ "$type" == "SPARK" ]]; then
# If sparkGlueProperties present, add Glue check
if echo "$item" | jq -e '.props.sparkGlueProperties' > /dev/null; then
SERVICE_COMMANDS["Glue"]="aws glue get-crawlers --max-items 1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here, can we use glue get-catalogs or get-databases instead?

# Check for emr-serverless in sparkEmrProperties.computeArn for EMR Serverless check
emr_arn=$(echo "$item" | jq -r '.props.sparkEmrProperties.computeArn // empty')
if [[ "$emr_arn" == *"emr-serverless"* ]]; then
SERVICE_COMMANDS["EMR Serverless"]="aws emr-serverless list-applications --max-results 1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thing here. Can we use get-application or another emr-s API that the project role does have permission to call?


# Optionally add CodeConnections if S3 storage flag is true (Git storage)
if [[ "$is_s3_storage" == "1" ]]; then
SERVICE_COMMANDS["CodeConnections"]="aws codeconnections list-hosts --max-results 1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't seen the managed policy updates for S3 storage; will those include this API?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants