-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stress Tests MCAD using KWOK #469
Open
vishakha-ramani
wants to merge
20
commits into
project-codeflare:main
Choose a base branch
from
vishakha-ramani:main
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 11 commits
Commits
Show all changes
20 commits
Select commit
Hold shift + click to select a range
fbde315
Initial Commit
36a90e8
Create README for gpu-tests
vishakha-ramani ebe4d18
Update gpu-tests.md
vishakha-ramani aeb47e4
Stress tests
b5864f5
mcad stress-tests
6464fb4
mcad stress-tests
a43c26b
summer-tests
0c90578
reorganized stress tests
87f940e
Merge branch 'project-codeflare:main' into main
vishakha-ramani 937d6aa
sync perf-test
0f80364
Automated cleanup
cfaca58
Job complete check revised
2ca48b2
first gpu test
0017023
CDF plot for no MCAD system
531939a
Gpu and stress tests scripts
a66d621
Updated README
542eda9
Updated README
6f7b848
Updated README
081fd62
Minor edits
82308e8
Minor edits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,160 @@ | ||
## MCAD GPU Request Performance test with KWOK | ||
This experiment assumes that you have a KWOK controller as well as MCAD controller running inside a kind cluster. If not, follow [this](https://github.com/vishakha-ramani/multi-cluster-app-dispatcher/blob/main/test/perf-test/simulatingnodesandappwrappers.md) for installation instructions. | ||
The MCAD service for gpu requests is little weird . Here is the experiment I did and what I observed: | ||
1. Created two fake nodes with 8 gpus each by running the script | ||
``` | ||
./nodes.sh | ||
``` | ||
2. Check that reqested number of nodes have started | ||
``` | ||
% kubectl get nodes | ||
NAME STATUS ROLES AGE VERSION | ||
kind-control-plane Ready control-plane 27d v1.27.1 | ||
kwok-node-1 Ready agent 7s fake | ||
kwok-node-2 Ready agent 7s fake | ||
``` | ||
|
||
3. Submit an AW job that wraps two pods, each requesting for 8 gpus. | ||
``` | ||
% ./kwokmcadperf.sh | ||
Checking whether we have a valid cluster login or not... | ||
|
||
Nice, looks like you're logged in | ||
Checking MCAD Controller installation status | ||
|
||
Nice, MCAD Controller is installed | ||
Checking MCAD Controller installation status | ||
|
||
Nice, the KWOK Controller is installed | ||
|
||
How many fake KWOK appwrapper jobs do you want? 1 | ||
How many pods in a job? 2 | ||
How many GPUs do you want to allocate per pod? 8 | ||
jobs number is 1 | ||
Number of GPUs per pod: 8 | ||
Number of pods per AppWrapper: 2 | ||
... | ||
... | ||
``` | ||
|
||
4. We can see that the two pods are scheduled and run to completion. | ||
``` | ||
% kubectl get pods | ||
NAME READY STATUS RESTARTS AGE | ||
fake-defaultaw-schd-spec-with-timeout-1-4r4t2 0/1 Completed 0 4s | ||
fake-defaultaw-schd-spec-with-timeout-1-tx9d5 0/1 Completed 0 4s | ||
``` | ||
Furthermore, they are scheduled on two different nodes (as they should be). | ||
|
||
|
||
5. Delete the previous AW job | ||
``` | ||
kubectl delete appwrapper fake-defaultaw-schd-spec-with-timeout-1 | ||
``` | ||
|
||
6. Create a new AW consisting of one pod requesting 16 gpus. | ||
``` | ||
% ./kwokmcadperf.sh | ||
Checking whether we have a valid cluster login or not... | ||
|
||
Nice, looks like you're logged in | ||
Checking MCAD Controller installation status | ||
|
||
Nice, MCAD Controller is installed | ||
Checking MCAD Controller installation status | ||
|
||
Nice, the KWOK Controller is installed | ||
|
||
How many fake KWOK appwrapper jobs do you want? 1 | ||
How many pods in a job? 1 | ||
How many GPUs do you want to allocate per pod? 16 | ||
jobs number is 1 | ||
Number of GPUs per pod: 16 | ||
Number of pods per AppWrapper: 1 | ||
... | ||
... | ||
``` | ||
|
||
7. The pod is scheduled to one of the fake nodes (which theoretically it shouldn't). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Curious, did we ever print a histogram inside MCAD and see how it looks? also, can we mention which version of MCAD was used for testing? |
||
``` | ||
% kubectl get pods | ||
NAME READY STATUS RESTARTS AGE | ||
fake-defaultaw-schd-spec-with-timeout-1-v7qbk 0/1 Completed 0 40s | ||
``` | ||
|
||
8. Delete the previous AW job | ||
``` | ||
kubectl delete appwrapper fake-defaultaw-schd-spec-with-timeout-1 | ||
``` | ||
|
||
9. Create a new AW consisting of one pod requesting for 24 gpu. | ||
``` | ||
% ./kwokmcadperf.sh | ||
Checking whether we have a valid cluster login or not... | ||
|
||
Nice, looks like you're logged in | ||
Checking MCAD Controller installation status | ||
|
||
Nice, MCAD Controller is installed | ||
Checking MCAD Controller installation status | ||
|
||
Nice, the KWOK Controller is installed | ||
|
||
How many fake KWOK appwrapper jobs do you want? 1 | ||
How many pods in a job? 1 | ||
How many GPUs do you want to allocate per pod? 24 | ||
jobs number is 1 | ||
Number of GPUs per pod: 24 | ||
Number of pods per AppWrapper: 1 | ||
... | ||
... | ||
``` | ||
10. The AW job is now in the queue and is pending. | ||
``` | ||
% kubectl describe appwrapper fake-defaultaw-schd-spec-with-timeout-1 | ||
... | ||
... | ||
Status: | ||
Conditions: | ||
Last Transition Micro Time: 2023-06-21T14:12:40.279735Z | ||
Last Update Micro Time: 2023-06-21T14:12:40.279734Z | ||
Status: True | ||
Type: Init | ||
Last Transition Micro Time: 2023-06-21T14:12:40.280678Z | ||
Last Update Micro Time: 2023-06-21T14:12:40.280677Z | ||
Reason: AwaitingHeadOfLine | ||
Status: True | ||
Type: Queueing | ||
Last Transition Micro Time: 2023-06-21T14:12:40.289959Z | ||
Last Update Micro Time: 2023-06-21T14:12:40.289958Z | ||
Reason: FrontOfQueue. | ||
Status: True | ||
Type: HeadOfLine | ||
Last Transition Micro Time: 2023-06-21T14:12:40.297836Z | ||
Last Update Micro Time: 2023-06-21T14:12:40.297836Z | ||
Message: Insufficient resources to dispatch AppWrapper. | ||
Reason: AppWrapperNotRunnable. | ||
Status: True | ||
Type: Backoff | ||
Controllerfirsttimestamp: 2023-06-21T14:12:40.279730Z | ||
Filterignore: true | ||
Queuejobstate: HeadOfLine | ||
Sender: before ScheduleNext - setHOL | ||
State: Pending | ||
Systempriority: 9 | ||
``` | ||
|
||
11. Add a fake nodes in the cluster with 8 gpus (at this point, the cluster has 24 gpus in total, uniformly spread across 3 nodes) | ||
``` | ||
% kubectl apply -f fake-node.yaml | ||
node/fake-node-1 created | ||
``` | ||
|
||
12. The job is now dispatched, and runs to completion. | ||
``` | ||
% kubectl get pods | ||
NAME READY STATUS RESTARTS AGE | ||
fake-defaultaw-schd-spec-with-timeout-1-fb649 0/1 Completed 0 7s | ||
``` | ||
|
||
13. This tells us that with respect to KWOK, MCAD is looking at the aggregated gpu resources before making a dispatch decision. |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for gpu requests is little weird
can we change the wording and give the user a little more insight into the use case we ought to test, please?