[API server] performance improvement in local mode #5039

aylei · 2025-03-26T08:32:28Z

Given same resources, the max parallelism cannot be exactly as good as 0.8.0 due to the overhead of uvicorn server and status refresh daemon (see the test result). But this branch can already achieve similar concurrency if user provide --async to the CLI (so there is no overhead of client process), so I think this close #4979

Test

All the following tests are done with aws, azure and gcp enabled, but set --cloud=aws to launch instance only on AWS.

Low resources: 0.5c 750MB

Setup: launch an AWS c6i.large instance, manually start a docker container with 0.5c 750MB resources limit

For low resources env, we focus on that sky can provide similar performance compared to 0.8.0

This Branch: support 3 concurrent sky launch --async, 4 concurrent sky.launch cause one of the worker process being OOM killed, extra sky status or sky logs cause the launch worker to be OOMKilled
0.8.0: support 3 concurrent sky launch &, each takes about 210 MB peak RSS, 4 concurrent sky launch cause OOM, there is still memory to run sky logs and sky status without blocking;
master: support 1 long request (e.g. sky launch) in parallel and 1 short request (e.g. sky logs)

Compared to 0.8.0, the overhead of this branch is about 200MB, including a single uvicorn process and a status refresh daemon process.

Common cases: 4c16g

Setup: launch an AWS c6i.4xlarge instance, manually start a docker container with 4c16g resources limit

For environment that has relatively enough resources for local usage, we focus on:

whether the parallelism can be burst to a high number if needed;
the resource consumption after the burst peak;

This Branch: up to 30 parallel launch (each with worker process with about ~250MB peak memory usage), the resident memory after peak is about 3GB (4 long workers + 17 short workers)
0.8.0: similar to this branch, but no resident memory after peak
master: support 4 long requests and 17 short requests in parallel

For common cases, current branch has about 500MB memory overhead compared to 0.8.0, the extra 300MB plus to uvicorn process and status refresh daemon process including:

long worker process
short worker process
Queue server process

Screenshot of htop:

Future works

Better OOM handling: in 0.8.0, the visibility of CLI processing being OOM killed is good: user just see it in the terminal or ps, but the API server hides workers from user, which need some UX improvement;
More cooperative operations: the overhead of a dedicated process for each request is significant, make operation cooperative improves the performance in all scenarios.

Tests

Tested (run the relevant ones):

Code formatting: install pre-commit (auto-check on commit) or bash format.sh
Any manual or new tests for this PR (please specify below)
All smoke tests: /smoke-test (CI) or pytest tests/test_smoke.py (local)
Relevant individual tests: /smoke-test -k test_name (CI) or pytest tests/test_smoke.py::test_name (local)
Backward compatibility: /quicktest-core (CI) or pytest tests/smoke_tests/test_backward_compat.py (local)

Queue Scheduling Benchmark

Run benchmark:

python tests/load_tests/test_queue_dispatcher.py

Result:

Benchmark Results:
--------------------------------------------------
Process queue + 1 process dispatcher:
  Requests/sec:  451.57
Local thread queue + 1 thread dispatcher:
  Requests/sec:  502.04
Process queue + 1 thread dispatcher:
  Requests/sec:  468.69
Process queue + 10 process dispatcher:
  Requests/sec:  411.62
Local thread queue + 10 thread dispatcher (in one process):
  Requests/sec:  410.38

A simple benchmark is added to see whether it is okay to switch to thread-based queue-server and dispatcher in low resource env.

It is interesting that multiprocess dispatchers does not outperform single-process/thread dispatcher on my laptop, worth digging deeper when we do comprehensive queue optimization.

Signed-off-by: Aylei <[email protected]>

aylei · 2025-03-27T14:38:15Z

/quicktest-core

aylei · 2025-03-28T02:17:25Z

/quicktest-core

aylei · 2025-03-28T04:49:28Z

/quicktest-core

Signed-off-by: Aylei <[email protected]>

aylei · 2025-03-29T00:13:05Z

/quicktest-core

Michaelvll

This is awesome @aylei! I like the BurstableExecutor. One thing to discuss is whether we should just completely move the dispatcher to a thread, and (not urgent, if no latency difference) whether we should have multiple dispatcher in parallel. : )

sky/server/requests/executor.py

Michaelvll · 2025-03-30T04:06:22Z

sky/server/requests/executor.py

+            while True:
+                self.process_request(executor, queue)


Previously, there will be multiple workers pulling from the request queue and submit the request to the executor, i.e. we use multiple dispatcher for handling the requests in the queue, but now there is a single thread doing that. Would that causes any performance issue (increasing the latency) when there are many requests coming in, especially many concurrent short requests, such as sky status, etc?

If needed we can even start multiple threads/async event to dispatch if the performance is a concern.

Great idea! I added a simple benchmark script to test the scheduling performance, according to the result (updated in the PR description), there is no performance regression when there is little GIL contention. Since current PR only runs the scheduler in Thread when in low resources modes (usually low CPU cores), so GIL contention is roughly equal to CPU contention and the benchmark result applies.

For more general cases where GIL contention might be a concern, i.e. high concurrency and high-end machine, I think we can follow up in another pull request along with other enhancements like queue in sqlite, wdyt?

Follow up: #5097

Michaelvll · 2025-03-30T04:10:05Z

sky/server/requests/executor.py

+    def run_worker_in_background(worker: RequestWorker):
+        if local_worker:
+            # Use daemon thread for automatic cleanup.
+            thread = threading.Thread(target=worker.run, daemon=True)
+            thread.start()
+        else:
+            # Cannot use daemon process since daemon process cannot create
+            # sub-processes, so we manually manage the cleanup.
+            worker_proc = multiprocessing.Process(target=worker.run)
+            worker_proc.start()
+            sub_procs.append(worker_proc)


This is a good point. Previously we use multi-process for worker(dispatcher), because we were afraid of the GIL causing issue with the speed for dispatching request to request process. Since we are using a single thread (process) for dispatching now or have the multi-thread dispatching in the RequestWorker (may want to implement), do we still want to start the worker_proc in new process?

Maybe we should just keep the underlying executor to use process, but the worker (dispatcher) to use asyncio?

Maybe for future: our separate request queue is a bit redundant considering we already have a request database. How do you feel if we just use that request DB to be a queue, or just store the queue in the DB with another table. In that way, it might be save more resources for having a in memory queue.

Sounds good! How about consolidating the discussion to #5039 (comment) ?

sky/server/requests/queues/local_queue.py

sky/utils/atomic.py

Signed-off-by: Aylei <[email protected]>

aylei · 2025-03-31T15:55:32Z

@Michaelvll This PR is ready for another round of review, thanks!

aylei · 2025-04-02T03:44:05Z

more comments about LOCAL mode, which will be deprecated in the future
Unify dipatcther to thread

Signed-off-by: Aylei <[email protected]>

aylei · 2025-04-02T07:40:43Z

/quicktest-core

Michaelvll

Thanks @aylei for the quick update! This is awesome! LGTM with some minor comments.

sky/utils/atomic.py

tests/load_tests/test_queue_dispatcher.py

sky/server/requests/executor.py

Co-authored-by: Zhanghao Wu <[email protected]>

Signed-off-by: Aylei <[email protected]>

aylei · 2025-04-03T06:36:50Z

/quicktest-core

Michaelvll

Thanks for the quick fix @aylei! We should also run the smoke tests on this PR to make sure non-deploy mode works as expected.

tests/load_tests/test_queue_dispatcher.py

Signed-off-by: Aylei <[email protected]>

aylei · 2025-04-03T08:58:26Z

/smoke-test --remote-server

aylei · 2025-04-03T09:15:58Z

/smoke-test --remote-server -k test_minimal

aylei · 2025-04-03T10:34:46Z

/smoke-test --remote-server

Michaelvll · 2025-04-03T21:37:14Z

btw, we should also run the smoke tests for non remote server case : )

aylei · 2025-04-07T03:36:23Z

/smoke-test --remote-server -k test_requests_scheduling

Signed-off-by: Aylei <[email protected]>

aylei · 2025-04-07T03:49:23Z

/smoke-test --remote-server -k test_requests_scheduling

https://buildkite.com/skypilot-1/smoke-tests/builds/565 passed, only test a single case to workaround #5126

aylei · 2025-04-07T08:56:42Z

/smoke-test
https://buildkite.com/skypilot-1/smoke-tests/builds/579

aylei · 2025-04-07T12:05:28Z

/smoke-test -k test_skyserve_rolling_update

aylei · 2025-04-07T12:06:46Z

btw, we should also run the smoke tests for non remote server case : )

Sure, I was trying pass the remote one since the local one has been passed previously. Now both remote/local cases passed

aylei · 2025-04-07T14:29:09Z

/smoke-test --remote-server
https://buildkite.com/skypilot-1/smoke-tests/builds/592

aylei · 2025-04-07T16:16:41Z

/smoke-test --remote-server -k test_file_mounts
/smoke-test --remote-server -k test_aws_storage_mounts_with_stop_only_mount
/smoke-test --remote-server -k test_azure_storage_mounts_with_stop
/smoke-test --remote-server -k test_docker_storage_mounts
https://buildkite.com/skypilot-1/smoke-tests/builds/593

aylei · 2025-04-08T03:30:01Z

/smoke-test --remote-server -k test_file_mounts
/smoke-test --remote-server -k test_aws_storage_mounts_with_stop_only_mount
/smoke-test --remote-server -k test_azure_storage_mounts_with_stop
/smoke-test --remote-server -k test_docker_storage_mounts

https://buildkite.com/skypilot-1/smoke-tests/builds/619
Merge master to unblock the file mounts test

aylei added 8 commits March 26, 2025 16:31

[API server] support busrtable workers in local mode

ea49db8

Signed-off-by: Aylei <[email protected]>

No permanent worker in low resources mode

15d3658

Signed-off-by: Aylei <[email protected]>

Low resource mode

8648e8d

Signed-off-by: Aylei <[email protected]>

logs

c07e662

Signed-off-by: Aylei <[email protected]>

Refine executor delegation

5bcd802

Signed-off-by: Aylei <[email protected]>

More docs, support local queue

da9232f

Signed-off-by: Aylei <[email protected]>

Merge branch 'master' into burstable-worker

e5be8d1

Fix setproctitle

536496b

Signed-off-by: Aylei <[email protected]>

aylei changed the title ~~[API server] support burstable workers in local mode~~ [API server] performance improvement in local mode Mar 27, 2025

aylei added 3 commits March 27, 2025 20:20

fix ut

fd576ce

Signed-off-by: Aylei <[email protected]>

Fix unit tests import

01beaae

Signed-off-by: Aylei <[email protected]>

format

79be076

Signed-off-by: Aylei <[email protected]>

aylei marked this pull request as ready for review March 27, 2025 13:24

aylei requested a review from Michaelvll March 28, 2025 05:40

aylei force-pushed the burstable-worker branch from 9514720 to 79be076 Compare March 29, 2025 00:10

Merge master

98f7d16

Signed-off-by: Aylei <[email protected]>

Michaelvll reviewed Mar 30, 2025

View reviewed changes

aylei added 4 commits March 31, 2025 23:05

Add benchmark

c56c176

Signed-off-by: Aylei <[email protected]>

Address review comments

cbcf0e0

Signed-off-by: Aylei <[email protected]>

Add comments for Queue type

d7b1ed1

Signed-off-by: Aylei <[email protected]>

Use __iadd__ instead of another func

ca91aa6

Signed-off-by: Aylei <[email protected]>

aylei requested a review from Michaelvll March 31, 2025 15:54

Unify dispatcher modese to thread

136a005

Signed-off-by: Aylei <[email protected]>

aylei mentioned this pull request Apr 2, 2025

[API server] Consider using sqlite as the job queue #5097

Open

Michaelvll approved these changes Apr 2, 2025

View reviewed changes

sky/utils/atomic.py Outdated Show resolved Hide resolved

tests/load_tests/test_queue_dispatcher.py Outdated Show resolved Hide resolved

sky/server/requests/executor.py Outdated Show resolved Hide resolved

aylei and others added 3 commits April 3, 2025 13:14

Update sky/server/requests/executor.py

b742bab

Co-authored-by: Zhanghao Wu <[email protected]>

Address review comments

d8773da

Signed-off-by: Aylei <[email protected]>

Address review comments

03991d0

Signed-off-by: Aylei <[email protected]>

Michaelvll approved these changes Apr 3, 2025

View reviewed changes

tests/load_tests/test_queue_dispatcher.py Outdated Show resolved Hide resolved

aylei added 2 commits April 3, 2025 16:53

Added a smoke test case

0efbbc1

Signed-off-by: Aylei <[email protected]>

Added a smoke test case

669abd4

Signed-off-by: Aylei <[email protected]>

Merge branch 'master' into burstable-worker

dd2d081

Fix merging issue

ccb297f

Signed-off-by: Aylei <[email protected]>

Merge branch 'master' into burstable-worker

e4fc6d0

aylei added this to the v0.9.0 milestone Apr 8, 2025

Merge branch 'master' into burstable-worker

f335e57

aylei merged commit e0674be into master Apr 8, 2025
21 checks passed

aylei deleted the burstable-worker branch April 8, 2025 06:46

aylei mentioned this pull request Apr 16, 2025

[Core] Allow allowed_clouds hot-reload in config #5231

Closed

Michaelvll mentioned this pull request May 1, 2025

[Jobs] No parallelism control for local API server for submitting many jobs #5473

Open

[API server] performance improvement in local mode #5039

[API server] performance improvement in local mode #5039

Conversation

aylei commented Mar 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test

Low resources: 0.5c 750MB

Common cases: 4c16g

Future works

Tests

Queue Scheduling Benchmark

Uh oh!

aylei commented Mar 27, 2025

Uh oh!

aylei commented Mar 28, 2025

Uh oh!

aylei commented Mar 28, 2025

Uh oh!

aylei commented Mar 29, 2025

Uh oh!

Michaelvll left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Michaelvll Mar 30, 2025

Choose a reason for hiding this comment

Uh oh!

aylei Mar 31, 2025

Choose a reason for hiding this comment

Uh oh!

aylei Apr 2, 2025

Choose a reason for hiding this comment

Uh oh!

Michaelvll Mar 30, 2025

Choose a reason for hiding this comment

Uh oh!

aylei Mar 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aylei commented Mar 31, 2025

Uh oh!

aylei commented Apr 2, 2025

Uh oh!

aylei commented Apr 2, 2025

Uh oh!

Michaelvll left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aylei commented Apr 3, 2025

Uh oh!

Michaelvll left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

aylei commented Apr 3, 2025

Uh oh!

aylei commented Apr 3, 2025

Uh oh!

aylei commented Apr 3, 2025

Uh oh!

Michaelvll commented Apr 3, 2025

Uh oh!

aylei commented Apr 7, 2025

Uh oh!

aylei commented Apr 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aylei commented Apr 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aylei commented Mar 26, 2025 •

edited

Loading

Michaelvll left a comment •

edited

Loading

aylei Mar 31, 2025 •

edited

Loading

aylei commented Apr 7, 2025 •

edited

Loading

aylei commented Apr 7, 2025 •

edited

Loading

aylei commented Apr 7, 2025 •

edited

Loading

aylei commented Apr 7, 2025 •

edited

Loading

aylei commented Apr 8, 2025 •

edited

Loading