Skip to content

[API server] performance improvement in local mode #5039

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 26 commits into from
Apr 8, 2025
Merged

Conversation

aylei
Copy link
Collaborator

@aylei aylei commented Mar 26, 2025

Given same resources, the max parallelism cannot be exactly as good as 0.8.0 due to the overhead of uvicorn server and status refresh daemon (see the test result). But this branch can already achieve similar concurrency if user provide --async to the CLI (so there is no overhead of client process), so I think this close #4979

Test

All the following tests are done with aws, azure and gcp enabled, but set --cloud=aws to launch instance only on AWS.

Low resources: 0.5c 750MB

Setup: launch an AWS c6i.large instance, manually start a docker container with 0.5c 750MB resources limit

For low resources env, we focus on that sky can provide similar performance compared to 0.8.0

  • This Branch: support 3 concurrent sky launch --async, 4 concurrent sky.launch cause one of the worker process being OOM killed, extra sky status or sky logs cause the launch worker to be OOMKilled
  • 0.8.0: support 3 concurrent sky launch &, each takes about 210 MB peak RSS, 4 concurrent sky launch cause OOM, there is still memory to run sky logs and sky status without blocking;
  • master: support 1 long request (e.g. sky launch) in parallel and 1 short request (e.g. sky logs)

Compared to 0.8.0, the overhead of this branch is about 200MB, including a single uvicorn process and a status refresh daemon process.

Common cases: 4c16g

Setup: launch an AWS c6i.4xlarge instance, manually start a docker container with 4c16g resources limit

For environment that has relatively enough resources for local usage, we focus on:

  1. whether the parallelism can be burst to a high number if needed;
  2. the resource consumption after the burst peak;
  • This Branch: up to 30 parallel launch (each with worker process with about ~250MB peak memory usage), the resident memory after peak is about 3GB (4 long workers + 17 short workers)
  • 0.8.0: similar to this branch, but no resident memory after peak
  • master: support 4 long requests and 17 short requests in parallel

For common cases, current branch has about 500MB memory overhead compared to 0.8.0, the extra 300MB plus to uvicorn process and status refresh daemon process including:

  • long worker process
  • short worker process
  • Queue server process

Screenshot of htop:

image

Future works

  1. Better OOM handling: in 0.8.0, the visibility of CLI processing being OOM killed is good: user just see it in the terminal or ps, but the API server hides workers from user, which need some UX improvement;
  2. More cooperative operations: the overhead of a dedicated process for each request is significant, make operation cooperative improves the performance in all scenarios.

Tests

Tested (run the relevant ones):

  • Code formatting: install pre-commit (auto-check on commit) or bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: /smoke-test (CI) or pytest tests/test_smoke.py (local)
  • Relevant individual tests: /smoke-test -k test_name (CI) or pytest tests/test_smoke.py::test_name (local)
  • Backward compatibility: /quicktest-core (CI) or pytest tests/smoke_tests/test_backward_compat.py (local)

Queue Scheduling Benchmark

Run benchmark:

python tests/load_tests/test_queue_dispatcher.py

Result:

Benchmark Results:
--------------------------------------------------
Process queue + 1 process dispatcher:
  Requests/sec:  451.57
Local thread queue + 1 thread dispatcher:
  Requests/sec:  502.04
Process queue + 1 thread dispatcher:
  Requests/sec:  468.69
Process queue + 10 process dispatcher:
  Requests/sec:  411.62
Local thread queue + 10 thread dispatcher (in one process):
  Requests/sec:  410.38

A simple benchmark is added to see whether it is okay to switch to thread-based queue-server and dispatcher in low resource env.

It is interesting that multiprocess dispatchers does not outperform single-process/thread dispatcher on my laptop, worth digging deeper when we do comprehensive queue optimization.

@aylei aylei changed the title [API server] support burstable workers in local mode [API server] performance improvement in local mode Mar 27, 2025
aylei added 3 commits March 27, 2025 20:20
Signed-off-by: Aylei <[email protected]>
Signed-off-by: Aylei <[email protected]>
@aylei aylei marked this pull request as ready for review March 27, 2025 13:24
@aylei
Copy link
Collaborator Author

aylei commented Mar 27, 2025

/quicktest-core

2 similar comments
@aylei
Copy link
Collaborator Author

aylei commented Mar 28, 2025

/quicktest-core

@aylei
Copy link
Collaborator Author

aylei commented Mar 28, 2025

/quicktest-core

@aylei aylei requested a review from Michaelvll March 28, 2025 05:40
@aylei aylei force-pushed the burstable-worker branch from 9514720 to 79be076 Compare March 29, 2025 00:10
Signed-off-by: Aylei <[email protected]>
@aylei
Copy link
Collaborator Author

aylei commented Mar 29, 2025

/quicktest-core

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome @aylei! I like the BurstableExecutor. One thing to discuss is whether we should just completely move the dispatcher to a thread, and (not urgent, if no latency difference) whether we should have multiple dispatcher in parallel. : )

Comment on lines +245 to +246
while True:
self.process_request(executor, queue)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, there will be multiple workers pulling from the request queue and submit the request to the executor, i.e. we use multiple dispatcher for handling the requests in the queue, but now there is a single thread doing that. Would that causes any performance issue (increasing the latency) when there are many requests coming in, especially many concurrent short requests, such as sky status, etc?

If needed we can even start multiple threads/async event to dispatch if the performance is a concern.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea! I added a simple benchmark script to test the scheduling performance, according to the result (updated in the PR description), there is no performance regression when there is little GIL contention. Since current PR only runs the scheduler in Thread when in low resources modes (usually low CPU cores), so GIL contention is roughly equal to CPU contention and the benchmark result applies.

For more general cases where GIL contention might be a concern, i.e. high concurrency and high-end machine, I think we can follow up in another pull request along with other enhancements like queue in sqlite, wdyt?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow up: #5097

Comment on lines 555 to 565
def run_worker_in_background(worker: RequestWorker):
if local_worker:
# Use daemon thread for automatic cleanup.
thread = threading.Thread(target=worker.run, daemon=True)
thread.start()
else:
# Cannot use daemon process since daemon process cannot create
# sub-processes, so we manually manage the cleanup.
worker_proc = multiprocessing.Process(target=worker.run)
worker_proc.start()
sub_procs.append(worker_proc)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point. Previously we use multi-process for worker(dispatcher), because we were afraid of the GIL causing issue with the speed for dispatching request to request process. Since we are using a single thread (process) for dispatching now or have the multi-thread dispatching in the RequestWorker (may want to implement), do we still want to start the worker_proc in new process?

Maybe we should just keep the underlying executor to use process, but the worker (dispatcher) to use asyncio?

Maybe for future: our separate request queue is a bit redundant considering we already have a request database. How do you feel if we just use that request DB to be a queue, or just store the queue in the DB with another table. In that way, it might be save more resources for having a in memory queue.

Copy link
Collaborator Author

@aylei aylei Mar 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! How about consolidating the discussion to #5039 (comment) ?

@aylei aylei requested a review from Michaelvll March 31, 2025 15:54
@aylei
Copy link
Collaborator Author

aylei commented Mar 31, 2025

@Michaelvll This PR is ready for another round of review, thanks!

@aylei
Copy link
Collaborator Author

aylei commented Apr 2, 2025

  1. more comments about LOCAL mode, which will be deprecated in the future
  2. Unify dipatcther to thread

@aylei
Copy link
Collaborator Author

aylei commented Apr 2, 2025

/quicktest-core

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @aylei for the quick update! This is awesome! LGTM with some minor comments.

@aylei
Copy link
Collaborator Author

aylei commented Apr 3, 2025

/quicktest-core

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick fix @aylei! We should also run the smoke tests on this PR to make sure non-deploy mode works as expected.

@aylei
Copy link
Collaborator Author

aylei commented Apr 3, 2025

/smoke-test --remote-server

@aylei
Copy link
Collaborator Author

aylei commented Apr 3, 2025

/smoke-test --remote-server -k test_minimal

@aylei
Copy link
Collaborator Author

aylei commented Apr 3, 2025

/smoke-test --remote-server

@Michaelvll
Copy link
Collaborator

btw, we should also run the smoke tests for non remote server case : )

@aylei
Copy link
Collaborator Author

aylei commented Apr 7, 2025

/smoke-test --remote-server -k test_requests_scheduling

Signed-off-by: Aylei <[email protected]>
@aylei
Copy link
Collaborator Author

aylei commented Apr 7, 2025

/smoke-test --remote-server -k test_requests_scheduling

https://buildkite.com/skypilot-1/smoke-tests/builds/565 passed, only test a single case to workaround #5126

@aylei
Copy link
Collaborator Author

aylei commented Apr 7, 2025

@aylei
Copy link
Collaborator Author

aylei commented Apr 7, 2025

/smoke-test -k test_skyserve_rolling_update

@aylei
Copy link
Collaborator Author

aylei commented Apr 7, 2025

btw, we should also run the smoke tests for non remote server case : )

Sure, I was trying pass the remote one since the local one has been passed previously. Now both remote/local cases passed

@aylei
Copy link
Collaborator Author

aylei commented Apr 7, 2025

/smoke-test --remote-server
https://buildkite.com/skypilot-1/smoke-tests/builds/592

@aylei
Copy link
Collaborator Author

aylei commented Apr 7, 2025

/smoke-test --remote-server -k test_file_mounts
/smoke-test --remote-server -k test_aws_storage_mounts_with_stop_only_mount
/smoke-test --remote-server -k test_azure_storage_mounts_with_stop
/smoke-test --remote-server -k test_docker_storage_mounts
https://buildkite.com/skypilot-1/smoke-tests/builds/593

@aylei aylei added this to the v0.9.0 milestone Apr 8, 2025
@aylei
Copy link
Collaborator Author

aylei commented Apr 8, 2025

/smoke-test --remote-server -k test_file_mounts
/smoke-test --remote-server -k test_aws_storage_mounts_with_stop_only_mount
/smoke-test --remote-server -k test_azure_storage_mounts_with_stop
/smoke-test --remote-server -k test_docker_storage_mounts

https://buildkite.com/skypilot-1/smoke-tests/builds/619
Merge master to unblock the file mounts test

@aylei aylei merged commit e0674be into master Apr 8, 2025
21 checks passed
@aylei aylei deleted the burstable-worker branch April 8, 2025 06:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[API server] Local API server has lower concurrency compared to 0.8.0
2 participants