-
Notifications
You must be signed in to change notification settings - Fork 718
[API server] performance improvement in local mode #5039
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Aylei <[email protected]>
Signed-off-by: Aylei <[email protected]>
Signed-off-by: Aylei <[email protected]>
Signed-off-by: Aylei <[email protected]>
Signed-off-by: Aylei <[email protected]>
Signed-off-by: Aylei <[email protected]>
Signed-off-by: Aylei <[email protected]>
Signed-off-by: Aylei <[email protected]>
Signed-off-by: Aylei <[email protected]>
Signed-off-by: Aylei <[email protected]>
/quicktest-core |
2 similar comments
/quicktest-core |
/quicktest-core |
Signed-off-by: Aylei <[email protected]>
/quicktest-core |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome @aylei! I like the BurstableExecutor. One thing to discuss is whether we should just completely move the dispatcher to a thread, and (not urgent, if no latency difference) whether we should have multiple dispatcher in parallel. : )
while True: | ||
self.process_request(executor, queue) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously, there will be multiple workers pulling from the request queue and submit the request to the executor, i.e. we use multiple dispatcher for handling the requests in the queue, but now there is a single thread doing that. Would that causes any performance issue (increasing the latency) when there are many requests coming in, especially many concurrent short requests, such as sky status
, etc?
If needed we can even start multiple threads/async event to dispatch if the performance is a concern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great idea! I added a simple benchmark script to test the scheduling performance, according to the result (updated in the PR description), there is no performance regression when there is little GIL contention. Since current PR only runs the scheduler in Thread when in low resources modes (usually low CPU cores), so GIL contention is roughly equal to CPU contention and the benchmark result applies.
For more general cases where GIL contention might be a concern, i.e. high concurrency and high-end machine, I think we can follow up in another pull request along with other enhancements like queue in sqlite, wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Follow up: #5097
sky/server/requests/executor.py
Outdated
def run_worker_in_background(worker: RequestWorker): | ||
if local_worker: | ||
# Use daemon thread for automatic cleanup. | ||
thread = threading.Thread(target=worker.run, daemon=True) | ||
thread.start() | ||
else: | ||
# Cannot use daemon process since daemon process cannot create | ||
# sub-processes, so we manually manage the cleanup. | ||
worker_proc = multiprocessing.Process(target=worker.run) | ||
worker_proc.start() | ||
sub_procs.append(worker_proc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point. Previously we use multi-process for worker(dispatcher), because we were afraid of the GIL causing issue with the speed for dispatching request to request process. Since we are using a single thread (process) for dispatching now or have the multi-thread dispatching in the RequestWorker (may want to implement), do we still want to start the worker_proc in new process?
Maybe we should just keep the underlying executor to use process, but the worker (dispatcher) to use asyncio?
Maybe for future: our separate request queue is a bit redundant considering we already have a request database. How do you feel if we just use that request DB to be a queue, or just store the queue in the DB with another table. In that way, it might be save more resources for having a in memory queue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good! How about consolidating the discussion to #5039 (comment) ?
Signed-off-by: Aylei <[email protected]>
Signed-off-by: Aylei <[email protected]>
Signed-off-by: Aylei <[email protected]>
Signed-off-by: Aylei <[email protected]>
@Michaelvll This PR is ready for another round of review, thanks! |
|
Signed-off-by: Aylei <[email protected]>
/quicktest-core |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @aylei for the quick update! This is awesome! LGTM with some minor comments.
Co-authored-by: Zhanghao Wu <[email protected]>
Signed-off-by: Aylei <[email protected]>
Signed-off-by: Aylei <[email protected]>
/quicktest-core |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the quick fix @aylei! We should also run the smoke tests on this PR to make sure non-deploy mode works as expected.
Signed-off-by: Aylei <[email protected]>
Signed-off-by: Aylei <[email protected]>
/smoke-test --remote-server |
/smoke-test --remote-server -k test_minimal |
/smoke-test --remote-server |
btw, we should also run the smoke tests for non remote server case : ) |
/smoke-test --remote-server -k test_requests_scheduling |
Signed-off-by: Aylei <[email protected]>
/smoke-test --remote-server -k test_requests_scheduling https://buildkite.com/skypilot-1/smoke-tests/builds/565 passed, only test a single case to workaround #5126 |
/smoke-test -k test_skyserve_rolling_update |
Sure, I was trying pass the remote one since the local one has been passed previously. Now both remote/local cases passed |
/smoke-test --remote-server |
/smoke-test --remote-server -k test_file_mounts |
/smoke-test --remote-server -k test_file_mounts https://buildkite.com/skypilot-1/smoke-tests/builds/619 |
Given same resources, the max parallelism cannot be exactly as good as 0.8.0 due to the overhead of uvicorn server and status refresh daemon (see the test result). But this branch can already achieve similar concurrency if user provide
--async
to the CLI (so there is no overhead of client process), so I think this close #4979Test
All the following tests are done with aws, azure and gcp enabled, but set
--cloud=aws
to launch instance only on AWS.Low resources: 0.5c 750MB
For low resources env, we focus on that
sky
can provide similar performance compared to0.8.0
sky launch --async
, 4 concurrentsky.launch
cause one of the worker process being OOM killed, extrasky status
orsky logs
cause the launch worker to be OOMKilledsky launch &
, each takes about 210 MB peak RSS, 4 concurrentsky launch
cause OOM, there is still memory to runsky logs
andsky status
without blocking;sky launch
) in parallel and 1 short request (e.g.sky logs
)Compared to 0.8.0, the overhead of this branch is about 200MB, including a single uvicorn process and a status refresh daemon process.
Common cases: 4c16g
For environment that has relatively enough resources for local usage, we focus on:
For common cases, current branch has about 500MB memory overhead compared to 0.8.0, the extra 300MB plus to uvicorn process and status refresh daemon process including:
Screenshot of
htop
:Future works
ps
, but the API server hides workers from user, which need some UX improvement;Tests
Tested (run the relevant ones):
bash format.sh
/smoke-test
(CI) orpytest tests/test_smoke.py
(local)/smoke-test -k test_name
(CI) orpytest tests/test_smoke.py::test_name
(local)/quicktest-core
(CI) orpytest tests/smoke_tests/test_backward_compat.py
(local)Queue Scheduling Benchmark
Run benchmark:
Result:
A simple benchmark is added to see whether it is okay to switch to thread-based queue-server and dispatcher in low resource env.
It is interesting that multiprocess dispatchers does not outperform single-process/thread dispatcher on my laptop, worth digging deeper when we do comprehensive queue optimization.