Skip to content

Conversation

@adamlazik1
Copy link
Contributor

@adamlazik1 adamlazik1 commented Jan 30, 2025

This PR proposes to mark every sub plan which finishes as not successful
after the parent execution plan receives the cancel directive as
cancelled. This is to be consistent with how a task is evaluated as
cancelled in the foreman project.

This PR has an additional effect which makes the parent execution plan
always finish with warning if at least one sub plan is cancelled as a
response to the cancel event. Until now, it could either finish with
success if all subplans either finished with success or were not yet
queued for execution or it could finish with warning if already pending
sub plan was cancelled and finished with the error result. In
summary, this additional change unifies the behavior of a cancelled
execution plan if there is anything left to cancel.

Copy link
Contributor

@adamruzicka adamruzicka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The general approach looks good, left a couple of comments. Also rubocop is red and I'm not sure what the tests will say

@adamlazik1 adamlazik1 force-pushed the edit-cancelled-count branch from 50f7a38 to 16e5cee Compare February 3, 2025 18:52
@adamlazik1
Copy link
Contributor Author

Thanks, I applied the suggestions. I noticed in the dynflow console that for whatever reason cancelled RunHostsJob shows result success even though the counts are correct. On master it shows result = error if a job is cancelled. Is this a problem?

@adamlazik1 adamlazik1 force-pushed the edit-cancelled-count branch from 16e5cee to 036ad60 Compare February 6, 2025 10:44
@adamruzicka
Copy link
Contributor

Is this a problem?

Sounds like one, this should ideally remain an "internal" change meaning there should be no observable difference anywhere except for the action's output.

@adamlazik1
Copy link
Contributor Author

Alright, unless I am missing something I have fixed the issue with different result state. Right now the only change should be in the cancelled and failed counts of a cancelled job.

@adamlazik1
Copy link
Contributor Author

adamlazik1 commented Feb 20, 2025

Ugh, I see a problem when cancelling job invocation on more than 1 hosts and I wonder what should be the correct implementation. The table below displays result of RunHostsJob in different scenarios. The job invocation on 1 host is cancelled before finishing and the job invocation on two hosts is run with concurrency level 1 and is cancelled after the command run successfully completes on the first host.

n. of hosts / branch master PR
1 warning warning
2 success warning

Under what circumstances should the result of RunHostsJob be warning and success when cancelling a job invocation?

@adamlazik1
Copy link
Contributor Author

adamlazik1 commented Feb 25, 2025

Ugh, I see a problem when cancelling job invocation on more than 1 hosts and I wonder what should be the correct implementation. The table below displays result of RunHostsJob in different scenarios. The job invocation on 1 host is cancelled before finishing and the job invocation on two hosts is run with concurrency level 1 and is cancelled after the command run successfully completes on the first host.

n. of hosts / branch master PR
1 warning warning
2 success warning

Under what circumstances should the result of RunHostsJob be warning and success when cancelling a job invocation?

After discussion with @adamruzicka we concluded that the behavior on master could be considered an inconsistency and thus the side effect of this PR in its current form is actually a plus, so I am keeping the behavior for now. I updated the commit message accordingly.

I made some additional errors in the last version and I hope I fixed them in the current one, but I need to test more scenarios so I will flip this to draft for now.

Also, I need to transfer a value (cancelled_unqueued_sub_plans_count) between different methods and I don't know what the correct way of doing that here is so I used the output hash for now. Suggestions are welcome.

@adamlazik1 adamlazik1 marked this pull request as draft February 25, 2025 17:54
@adamruzicka
Copy link
Contributor

Also, I need to transfer a value (cancelled_unqueued_sub_plans_count) between different methods and I don't know what the correct way of doing that here is so I used the output hash for now. Suggestions are welcome.

Can't you calculate that from the other counts that you have?

@adamlazik1
Copy link
Contributor Author

adamlazik1 commented Feb 26, 2025

I am not sure how I would do that. Those unscheduled sub plans aren't even in the database as far as I understand, no?

@adamlazik1
Copy link
Contributor Author

I am not sure how I would do that. Those unqueued subplans aren't even in the database as far as I understand, no?

Looking at the code I could calculate it from total_count and planned_count, but I don't know where total_count even comes from as it appears to be (?) a virtual method. Can I count on it always returning the same value during one job invocation run? Same question goes for planned_count. From the code and output when running execution plans it seems that it always stays the same, but is there any scenario in which this value could change inbetween the cancel event occurence and the next call of recalculate_counts?

@adamlazik1
Copy link
Contributor Author

On a separate note: I retested the current version and it appears to be working as expected and the errors I had seem have been resolved. I am flipping this back to Ready for review, even though it is likely that more changes are pending based on the above comments.

@adamlazik1 adamlazik1 marked this pull request as ready for review February 26, 2025 14:01
This PR proposes to mark every sub plan which finishes as not successful
after the parent execution plan receives the cancel directive as
cancelled. This is to be consistent with how a task is evaluated as
cancelled in the foreman project.

This PR has an additional effect which makes the parent execution plan
always finish with warning if at least one sub plan is cancelled as a
response to the `cancel` event. Until now, it could either finish with
success if all subplans either finished with success or were not yet
queued for execution or it could finish with warning if already pending
sub plan was cancelled and finished with the `error` result.  In
summary, this additional change unifies the behavior of a cancelled
execution plan if there is anything left to cancel.
@adamlazik1
Copy link
Contributor Author

I am not sure how I would do that. Those unqueued subplans aren't even in the database as far as I understand, no?

Looking at the code I could calculate it from total_count and planned_count, but I don't know where total_count even comes from as it appears to be (?) a virtual method. Can I count on it always returning the same value during one job invocation run? Same question goes for planned_count. From the code and output when running execution plans it seems that it always stays the same, but is there any scenario in which this value could change inbetween the cancel event occurence and the next call of recalculate_counts?

After another discussion it turns out relying on dynamic calculation of unscheduled sub plans from total_count and planned_count should be save, thus I no longer need to transfer this value inbetween functions. PR updated and ready for review.

I hope I was able to edit the remaining_count method correctly. In the previous implementation cancelled_count could either be 0 or total_count - planned_count at the time of cancel event, which means that at the time of cancellation this method should return 0.

@adamlazik1
Copy link
Contributor Author

Test failure seems unrelated. Or at least I hope.

@adamruzicka
Copy link
Contributor

Yeah, these do happen from time to time.

@adamruzicka adamruzicka merged commit 09b331c into Dynflow:master Mar 17, 2025
6 of 7 checks passed
@adamruzicka
Copy link
Contributor

Thank you @adamlazik1 !

@adamlazik1 adamlazik1 deleted the edit-cancelled-count branch March 17, 2025 11:06
failed = sub_plans_count('state' => %w(paused stopped), 'result' => %w(error warning))
total = total_count
if output[:cancelled_timestamp]
cancelled_scheduled_plans = sub_plans_count_after(output[:cancelled_timestamp], { 'state' => %w(paused stopped), 'result' => %w(error warning) })

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this bulletproof enough? Can the following happen?

  1. I run a job, it gets scheduled
  2. At time T, I cancel it and that time becomes cancelled_timestamp
  3. Before the run is actually cancelled, at time T+1, an error occurs so the task becomes stopped/error
  4. At T+2, an already failed run gets cancelled and nothing happens
    => The run failed but is counted as cancelled

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it failed after marked as cancelled then I would assume it should be ok to count it as cancelled.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants