Skip to content

Conversation

@fregataa
Copy link
Member

@fregataa fregataa commented May 26, 2024

Why resource sync API depends on this PR?

TL; DR

sync_containers_lifecycle() should work

Detail

We have 3 different sources of Kernel / Container data

  1. Manager side DB
  2. Agent side kernel_registry
  3. Agent side actual docker containers

I tried to synchronize ALL Agent's data to Manager's DB but synchronizing actual containers to kernel_registry or manager's data needs to process container destruction tasks which take a long time.
Since sync_containers_lifecycle() task synchronizes actual containers to kernel_registry periodically in background, resource-sync API only needs to sync kernel_registry to Manager side DB if it is ensured that the background task stays alive and works as we expect.

Changes

Fix

  • Keep sync_container_lifecycles() bgtask alive in a loop by handling any exceptions raised in the task.
  • In the try-except context of container creation, do not use unbound local variable

Enhance

  • Fetch all containers eagerly in sync_container_lifecycles()

Checklist: (if applicable)

  • Milestone metadata specifying the target backport version

@graphite-app
Copy link

graphite-app bot commented May 26, 2024

Your org has enabled the Graphite merge queue for merging into main

Add the label “flow:merge-queue” to the PR and Graphite will automatically add it to the merge queue when it’s ready to merge. Or use the label “flow:hotfix” to add to the merge queue as a hot fix.

You must have a Graphite account and log in to Graphite in order to use the merge queue. Sign up using this link.

@github-actions github-actions bot added the comp:agent Related to Agent component label May 26, 2024
Copy link
Member Author

fregataa commented May 26, 2024

This stack of pull requests is managed by Graphite. Learn more about stacking.

Join @fregataa and the rest of your teammates on Graphite Graphite

@github-actions github-actions bot added the size:M 30~100 LoC label May 26, 2024
@fregataa fregataa added this to the 24.03 milestone May 26, 2024
@fregataa fregataa added the urgency:4 As soon as feasible, implementation is essential. label May 26, 2024
@fregataa fregataa requested review from achimnol and kyujin-cho May 26, 2024 03:58
@fregataa fregataa marked this pull request as ready for review May 26, 2024 03:59
@fregataa fregataa changed the title fix: enhanced kernel termination handling fix: enhance kernel termination handling May 26, 2024
@github-actions github-actions bot added size:L 100~500 LoC and removed size:M 30~100 LoC labels May 27, 2024
@fregataa fregataa force-pushed the topic/05-23-fix_enhanced_kernel_termination_handling branch 3 times, most recently from c1cd4fa to 20a9b6c Compare June 4, 2024 06:52
@fregataa fregataa force-pushed the topic/05-23-fix_enhanced_kernel_termination_handling branch 2 times, most recently from 7a202ce to a172e1a Compare June 7, 2024 03:19
@fregataa fregataa force-pushed the topic/05-23-fix_enhanced_kernel_termination_handling branch 3 times, most recently from df467c8 to f784d6c Compare June 13, 2024 10:23
@fregataa fregataa force-pushed the topic/05-23-fix_enhanced_kernel_termination_handling branch 2 times, most recently from 331b744 to 2ac6c4e Compare June 19, 2024 02:14
@fregataa fregataa changed the title fix: enhance kernel termination handling fix: Handle failure of kernel creation and termination Jun 20, 2024
@fregataa fregataa force-pushed the topic/05-23-fix_enhanced_kernel_termination_handling branch from 2ac6c4e to 7ac6b0e Compare June 20, 2024 12:46
@fregataa fregataa marked this pull request as draft June 20, 2024 13:03
@fregataa fregataa marked this pull request as ready for review June 20, 2024 14:16
@fregataa fregataa changed the title fix: Handle failure of kernel creation and termination fix: Keep sync_container_lifecycles() bgtask alive in a loop. Jun 20, 2024
@fregataa fregataa force-pushed the topic/05-23-fix_enhanced_kernel_termination_handling branch 2 times, most recently from bd7d1e4 to 25e70f3 Compare June 24, 2024 07:05
@fregataa fregataa force-pushed the topic/05-23-fix_enhanced_kernel_termination_handling branch 4 times, most recently from cefd447 to 552b746 Compare July 5, 2024 08:14
@fregataa fregataa force-pushed the topic/05-23-fix_enhanced_kernel_termination_handling branch from 552b746 to 59cbbc9 Compare July 5, 2024 09:56
@github-actions github-actions bot added the comp:common Related to Common component label Jul 5, 2024
@fregataa fregataa merged commit 3ddcc36 into main Jul 5, 2024
16 checks passed
@fregataa fregataa deleted the topic/05-23-fix_enhanced_kernel_termination_handling branch July 5, 2024 11:51
lablup-octodog pushed a commit that referenced this pull request Jul 5, 2024
Backported-from: main (24.09)
Backported-to: 24.03
Backport-of: 2178
lablup-octodog added a commit that referenced this pull request Jul 5, 2024
… (#2394)

Co-authored-by: Sanghun Lee <[email protected]>
Backported-from: main (24.09)
Backported-to: 24.03
Backport-of: 2178
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp:agent Related to Agent component comp:common Related to Common component size:L 100~500 LoC urgency:4 As soon as feasible, implementation is essential.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants