add util function to observe metrics #3455

owenowenisme · 2025-04-22T08:13:59Z

Why are these changes needed?

This PR only shows how I observe the proper amount of Memory and Cpu for testing api server e2e.
So this shouldn't be merged.

How I test them

I only conducted them on those tests that needed computing resources.

I use k8s native metrics-server to collect the usage of Cpu and Memory in each pods, query metrics server with kubectl top pod -n namespace with 5 second interval (metrics server itself scrape those metrics with a 15 second interval, so query them every 5 seconds should work fine)

Result: TL;DR CPU: 1 & Memory 1Gi would be enough

(Currently set to CPU:2 & 4Gi)

cluster_server_e2e_test.go/TestCreateClusterEndpoint

=== RUN   TestCreateClusterEndpoint/Create_a_cluster_without_volumes
    utils.go:83: Found condition 'RayClusterProvisioned' for ray cluster 'bunny'
    utils.go:167: Metrics result:
        bunny-head-ndplw   203m         698Mi           
        bunny-head-ndplw   203m         698Mi           
        bunny-head-ndplw   203m         698Mi           
        
    utils.go:168: 
        Peak CPU usage: 203.0m
        Peak Memory usage: 698.0Mi
--- PASS: TestCreateClusterEndpoint/Create_a_cluster_without_volumes (32.59s)
=== RUN   TestCreateClusterEndpoint/Create_cluster_with_config_map_volume
    utils.go:83: Found condition 'RayClusterProvisioned' for ray cluster 'lioness'
    utils.go:167: Metrics result:
        bunny-head-ndplw   203m         698Mi           
        bunny-head-ndplw              79m          702Mi           
        bunny-small-wg-worker-6mn46   155m         244Mi           
        bunny-head-ndplw              79m          702Mi           
        bunny-small-wg-worker-6mn46   155m         244Mi           
        bunny-head-ndplw              79m          702Mi           
        bunny-small-wg-worker-6mn46   155m         244Mi           
        bunny-head-ndplw              66m          706Mi           
        bunny-small-wg-worker-6mn46   17m          149Mi           
        lioness-head-6ffhg            378m         697Mi           
        bunny-head-ndplw              66m          706Mi           
        bunny-small-wg-worker-6mn46   17m          149Mi           
        lioness-head-6ffhg            378m         697Mi           
        bunny-small-wg-worker-6mn46   17m          149Mi           
        lioness-head-6ffhg            378m         697Mi           
        lioness-head-6ffhg   65m          701Mi           
        lioness-head-6ffhg   65m          701Mi           
        lioness-head-6ffhg   65m          701Mi           
        lioness-head-6ffhg   52m          698Mi           
        lioness-head-6ffhg   52m          698Mi           
        lioness-head-6ffhg   52m          698Mi           
        
    utils.go:168: 
        Peak CPU usage: 378.0m
        Peak Memory usage: 706.0Mi
--- PASS: TestCreateClusterEndpoint/Create_cluster_with_config_map_volume (61.09s)
=== RUN   TestCreateClusterEndpoint/Create_cluster_with_no_workers

    utils.go:83: Found condition 'RayClusterProvisioned' for ray cluster 'macaw'
    utils.go:167: Metrics result:
        lioness-head-6ffhg   52m          698Mi           
        lioness-head-6ffhg              62m          466Mi           
        lioness-small-wg-worker-xmnhc   113m         234Mi           
        lioness-head-6ffhg              62m          466Mi           
        lioness-small-wg-worker-xmnhc   113m         234Mi           
        
    utils.go:168: 
        Peak CPU usage: 113.0m
        Peak Memory usage: 698.0Mi

cluster_server_autoscaler_e2e_test.go

=== RUN   TestCreateClusterAutoscaler
    utils.go:167: Metrics result:
        warthog-head-r6wmb   655m         746Mi           
        warthog-head-r6wmb   655m         746Mi           
        warthog-head-r6wmb   655m         746Mi           
        warthog-head-r6wmb   156m         780Mi           
        warthog-head-r6wmb   156m         780Mi           
        warthog-head-r6wmb   156m         780Mi           
        warthog-head-r6wmb   135m         788Mi           
        warthog-head-r6wmb   135m         788Mi           
        warthog-head-r6wmb   135m         788Mi           
        warthog-head-r6wmb              85m          794Mi           
        warthog-small-wg-worker-lkld4   38m          289Mi           
        warthog-head-r6wmb              85m          794Mi           
        warthog-small-wg-worker-lkld4   38m          289Mi           
        warthog-head-r6wmb              85m          794Mi           
        warthog-small-wg-worker-lkld4   38m          289Mi           
        warthog-head-r6wmb              84m          796Mi           
        warthog-small-wg-worker-lkld4   128m         133Mi           
        warthog-head-r6wmb              84m          796Mi           
        warthog-small-wg-worker-lkld4   128m         133Mi           
        warthog-head-r6wmb              84m          796Mi           
        warthog-small-wg-worker-lkld4   128m         133Mi           
        warthog-head-r6wmb              88m          566Mi           
        warthog-small-wg-worker-lkld4   5m           107Mi           
        warthog-head-r6wmb              88m          566Mi           
        warthog-small-wg-worker-lkld4   5m           107Mi           
        warthog-head-r6wmb              88m          566Mi           
        warthog-small-wg-worker-lkld4   5m           107Mi           
        
    utils.go:168: 
        Peak CPU usage: 655.0m
        Peak Memory usage: 796.0Mi
--- PASS: TestCreateClusterAutoscaler (105.75s)

job_server_e2e_test.go/TestCreateJobWithDisposableClusters

=== RUN   TestCreateJobWithDisposableClusters/Create_a_running_sample_job
    utils.go:167: Metrics result:
        frog-raycluster-6ktk2-head-ckh6d   373m         502Mi           
        frog-raycluster-6ktk2-head-ckh6d   373m         502Mi           
        frog-raycluster-6ktk2-head-ckh6d   373m         502Mi           
        frog-raycluster-6ktk2-head-ckh6d   55m          516Mi           
        frog-raycluster-6ktk2-head-ckh6d   55m          516Mi           
        frog-raycluster-6ktk2-head-ckh6d   55m          516Mi           
        frog-raycluster-6ktk2-head-ckh6d   63m          515Mi           
        frog-raycluster-6ktk2-head-ckh6d   63m          515Mi           
        frog-raycluster-6ktk2-head-ckh6d   63m          515Mi           
        frog-raycluster-6ktk2-head-ckh6d              432m         621Mi           
        frog-raycluster-6ktk2-small-wg-worker-pf7zn   148m         489Mi           
        frog-wct6w                                    301m         198Mi           
        frog-raycluster-6ktk2-head-ckh6d              432m         621Mi           
        frog-raycluster-6ktk2-small-wg-worker-pf7zn   148m         489Mi           
        frog-wct6w                                    301m         198Mi           
        
    utils.go:168: Peak CPU usage: 432.0m
        Peak Memory usage: 621.0Mi
--- PASS: TestCreateJobWithDisposableClusters/Create_a_running_sample_job (64.10s)

Next step

func (e2etc *End2EndTestingContext) CreateComputeTemplate(t *testing.T) {
	computeTemplateRequest := &api.CreateComputeTemplateRequest{
		ComputeTemplate: &api.ComputeTemplate{
			Name:      e2etc.computeTemplateName,
			Namespace: e2etc.namespaceName,
			Cpu:       2, // change to 1
			Memory:    4, // change to 1Gi
		},
		Namespace: e2etc.namespaceName,
	}

	_, _, err := e2etc.kuberayAPIServerClient.CreateComputeTemplate(computeTemplateRequest)
	require.NoErrorf(t, err, "No error expected while creating a compute template (%s, %s)", e2etc.namespaceName, e2etc.computeTemplateName)
}

Related issue number

#3426

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- This PR is not tested :(

Signed-off-by: owenowenisme <[email protected]>

owenowenisme · 2025-04-22T14:56:55Z

CI error is expected.

owenowenisme · 2025-04-23T02:44:55Z

@dentiny PTAL

apiserver/test/e2e/cluster_server_e2e_test.go

dentiny

Thanks for the investigation! Looks good to me!

add util function to observe metrics

bd2669d

Signed-off-by: owenowenisme <[email protected]>

dentiny reviewed Apr 23, 2025

View reviewed changes

apiserver/test/e2e/cluster_server_e2e_test.go Show resolved Hide resolved

dentiny reviewed Apr 23, 2025

View reviewed changes

owenowenisme mentioned this pull request Apr 23, 2025

[Apiserver] Set the right amount of resource in e2e test #3465

Merged

4 tasks

owenowenisme closed this Apr 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add util function to observe metrics #3455

add util function to observe metrics #3455

Uh oh!

owenowenisme commented Apr 22, 2025 •

edited

Loading

Uh oh!

owenowenisme commented Apr 22, 2025

Uh oh!

owenowenisme commented Apr 23, 2025

Uh oh!

Uh oh!

dentiny left a comment •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

add util function to observe metrics #3455

add util function to observe metrics #3455

Uh oh!

Conversation

owenowenisme commented Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

How I test them

Result: TL;DR CPU: 1 & Memory 1Gi would be enough

Next step

Related issue number

Checks

Uh oh!

owenowenisme commented Apr 22, 2025

Uh oh!

owenowenisme commented Apr 23, 2025

Uh oh!

Uh oh!

dentiny left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

owenowenisme commented Apr 22, 2025 •

edited

Loading

dentiny left a comment •

edited

Loading