-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LOGMLE - Fix argo metaflow retry integration #1
base: master
Are you sure you want to change the base?
Conversation
metaflow/cli_components/step_cmd.py
Outdated
echo_always(f"{latest_done_attempt=}") | ||
echo_always(f"{retry_count=}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will remove it after testing complete
metaflow/mflog/save_logs.py
Outdated
# Use inferred attempt - to save task_stdout.log and task_stderr.log | ||
latest_done_attempt = flow_datastore.get_latest_done_attempt(run_id=run_id, step_name=step_name, task_id=task_id) | ||
task_datastore = flow_datastore.get_task_datastore( | ||
run_id, step_name, task_id, int(attempt), mode="w" | ||
run_id, step_name, task_id, int(latest_done_attempt), mode="w" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is needed because, both kubernetes and argo options run mflog.save_logs after the task is completed. which will upload task_stdout
, task_stderr
logs to s3 and reflects on the metaflow UI.
Example run without this change, have no logs for argo retried attempts. ie Attempt 3
, 4
& 5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we explicitly casting this to an int
? Why don't we do this in the function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah you are right, i dont think we need to cast it anymore, the function returns a count.
I had it this way to keep the changes minimal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
function returning None
integrates well here, since get_task_data_store(..., attempt=None) will not be broken.
@@ -21,7 +21,6 @@ def _read_file(path): | |||
|
|||
# these env vars are set by mflog.mflog_env | |||
pathspec = os.environ["MF_PATHSPEC"] | |||
attempt = os.environ["MF_ATTEMPT"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where does this come from? I am wondering if we can leave all the code in this file alone and try to set the MF_ATTEMPT
with the value that we want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, one last "idea" because I feel like I'm being negative, but don't want to be. What if instead of creating more retries, we simply set the attempt number to the last available attempt and overwrite all the data on this attempt? Pros:
Cons:
We might even be able to do this in the operator by rewriting the command when we see the annotation that the workflow is a retry. Even if we try this, we can still have this as a stop gap solution. |
I kind of looked into this idea already. one of the difficulty was:
there is no annotation to spot if a workflow is retried. |
|
Since one of the metaflow maintainers' main concerns is breaking "immutability" so I think overwriting data would stop this from getting merged upstream. The other concern is making changes to the core task logic which has a risk of breaking other systems. |
metaflow/datastore/flow_datastore.py
Outdated
pathspecs=[f"{run_id}/{step_name}/{task_id}"], | ||
include_prior=True | ||
) | ||
return max([t.attempt for t in t_datastores], default=0) # returns default, if this was a first attempt. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't the default be -1
or None
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i will make the default None
since, it makes more sense when no done attempts were found.
Also: #1 (comment)
Tracking discussion with metaflow maintainers in a separate PR |
# Not sure what are the side effects to this. | ||
if retry_count >= max_user_code_retries: | ||
max_user_code_retries = retry_count | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, @colebaileygit
MAX_ATTEMPTS vs max_user_code_retries
i think, max_user_code_retries
is number of times a task can be retried. this value is same as the value that the retry decorator contains.
Where as MAX_ATTEMPTS
is number of total attempts that a task can be ran. [here], which is basically the number of retries + initial run
TestingHitting argo retry button on the failed argo-workflow
|
metaflow/cli_components/step_cmd.py
Outdated
if latest_done_attempt: | ||
retry_count = latest_done_attempt + 1 | ||
# Not sure what are the side effects to this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Infer the retry count only if previously done attempts were detected.
Problem
https://github.com/deliveryhero/logistics-ds-metaflow-ext/issues/108#issue-2832910761
Solution
This PR,modifies the step cli command to infer retry_count from
flow_datastore
class. Looks like this class, holds all the information about underlying datastore and run artifiacts.mflog.save_logs
python script.New Behaviour
Previously
argo retry --node <>
would overwrite previously run attepmts, Now it would add artifacts as it was new attempt.Fixes: https://github.com/deliveryhero/logistics-ds-metaflow-ext/issues/108
Slack Discussion: https://outerbounds-community.slack.com/archives/C020U025QJK/p1737483912784969
Testing and Limitations
Use this branch on dummy model to test this:
Validation & QA
Documentation: