Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upper api #609

Draft
wants to merge 17 commits into
base: master
Choose a base branch
from
Draft

Upper api #609

wants to merge 17 commits into from

Conversation

marcmengel
Copy link
Contributor

@marcmengel marcmengel commented Mar 7, 2025

This is adding an upper level api to the big_jobsub_api branch / pull #594, which has 2 main calls,
and an extenstion of the Job class, SubmittedJob.

The two bare entry points are

  • submit( group="...", executable="..." ...) which has a plethora of jobsub_submit arguments and returns a SubmittedJob object for the newly submitted job and
  • q(group="...", user=""..."...) which returns a list of SubmittedJob objects, one for each entry returned by jobsub_q with the mapped argumetns. It may need more arguments mapped to meet peoples neesds.

Job objects (existing in condor.py) have members

  • id for jobids
  • cluster
  • proc
  • schedd

SubmittedJob objects add

  • pool
  • group
  • role
  • auth_methods
  • submit_output (onlyt if the result of submit())
  • owner (if jobsub_q output or q() methods called)
  • submitted (datetime.datetime) ( "" )
  • runtime (datetime.timedelta) ( "" )
  • status (htcondor.JobStatus) ( "" )
  • prio
  • size
  • command

as well as methods:

  • hold()
  • release()
  • rm()
  • fetchlog( destdir, condor=False)
  • q() (update owner, status, etc. with jobsub_q)
  • q_long() (...and return --long info as dictionary)
  • q_analyze() (return jobsub_q --better-analyze output)
  • wait() (run q() periodically until status COMPLETED, HELD, or REMOVED.)

Any of those SubmittedJob objects have the ability to manage the jobs with the methods mentioned above.

@marcmengel marcmengel marked this pull request as draft March 7, 2025 04:25
@marcmengel
Copy link
Contributor Author

marcmengel commented Mar 7, 2025

Recent playing with new api interactively:
demo.txt
although the job ended before I could demo hold() and release()

@shreyb
Copy link
Collaborator

shreyb commented Mar 7, 2025

In #594: @retzkek had a few comments, which we decided would move to this PR's scope. For completeness, I wanted to include those here:

General comment:

We should be careful and deliberate when adding an external API that it's something we can evolve without breaking compatibility, especially knowing how experiment code evolves (or doesn't).

Most APIs will offer high- and low-level interfaces, so I'm ok with offering jobsub_call() documented as a low-level interface that accepts arguments with little checking and returns the raw output as a string that is likely to change. It certainly meets the immediate request.

We should also look to offer high-level interfaces that return concrete types, but they don't all have to be (or should be) created now - we can let users drive that by requesting them. A basic jobsub_submit() and jobsub_q() is fine, and would be mostly wrapping up the already defined regexes.

In lib/jobsub_api.py; for jobsub_submit_re, and jobsub_q_re:

Can we wrap this in a small function that returns a concrete type as a high-level interface instead of exposing this detail? There already is a Job class that represents a job/submission.

class Job:

Maybe something like def jobsub_submit(*args: str, **kwargs) -> Job

@retzkek retzkek self-requested a review March 11, 2025 18:36
# =-=-=-=-=-=-=-=-=-=-=-=-=-=-=


def optfix(s: Optional[str]) -> str:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than have this function wrapping every jobsub_call invocation, perhaps it's better to have jobsub_call only return a string. Looking at that function, it looks like the only reason it needs to be Optional[str] is because you initialize res to None. Maybe if that's "", then you can always return a string, and then this function won't be necessary.

def hold(self, verbose: int = 0) -> str:
"""Hold this job with jobsub_hold"""
args = ["jobsub_hold"]
self._update_args(verbose, args)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since these last three lines are run for hold, release, and rm, could they be combined into a single function, which gets called from each of those funcs? Something like:

def some_better_name(self, command, verbose):
    args = [command]
    self._update_args(verbose, args)
    rs = optfix(jobsub_call(args, True))
    return rs

def hold(self, verbose):
    return self.some_better_name("jobsub_hold", verbose)

def release(self, verbose):
    return self.some_better_name("jobsub_release", verbose)

def rm(self, verbose):
    return self.some_better_name("jobsub_rm", verbose)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So that was the Idea of the _update_args routine already, so I collapsed it into one called _cmd() that takes
the initial args with the base command and base options, and moved the optfix(jobsub_call...) up there.

"""run 'jobsub_q' on this job and update values, status"""
rs = self._cmd(verbose, ["jobsub_q"])
lines = rs.split("\n")
if len(lines) == 2 and self.status is not None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worry about specifying the lines count to see if the jobsub_q output. For example in this case, I think the current check breaks:

[sbhat@host~]$ jobsub_q -G uboone [email protected]
Attempting to get token from https://vaultserver.fnal.gov:8200 ... succeeded
Storing bearer token in /tmp/bt_token_uboone_Analysis_UID
JOBSUBJOBID                             OWNER           SUBMITTED     RUNTIME   ST PRIO   SIZE  COMMAND
0 total; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended

By my count, that's 4 lines, but it could apply to a job that we previously saw, but has now left the queue. Maybe this check needs to be more specific, perhaps something that looks for the standard jobsub_q header, in the lines, and then sees if the next element of lines has the jobid in it? Just thinking "out loud" here, I think that could look something like:

def has_q_output(self, q_lines) -> bool:
    header_idx = 0
    for i, line in enumerate(lines):
        if re.search("^JOBSUBJOBID\tOWNER.+$", line):
            header_idx = i
            break
    else:
        return False
    
    try:
        return lines[header_idx + 1].find(self.jobid) != -1:
    except IndexError:
        return False

There may be a more intelligent way than this example to check the jobsub_q output - but either way, I think we should make the check stronger.

# we saw it previously, and now it is not showing up..
self.status = JobStatus.COMPLETED
return
if len(lines) > 1:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now reading this part, if we implement an output checker like above, the logic of this whole method could be something along the lines of:

if not self.has_q_output(lines):
   if self.status is not None:
      self.status = JobStatus.COMPLETED
      return
   raise RuntimeError(....)

# This is the happy path
line = rs.split("\n")[1]
...
return

I think that helps with readability as well - what do you think?

if not line.find(" = ") > 0:
continue
k, v = line.split(" = ", 1)
if v[0] == '"':
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you have to do this check:

>>> print('"abc"'.strip('"'))
abc
>>> print('abc'.strip('"'))
abc
>>> 'abc'.strip('"') == '"abc"'.strip('"')
True

if v[0] == '"':
v = v.strip('"')
res[k] = v
if len(lines) == 1 and self.status is not None:
Copy link
Collaborator

@shreyb shreyb Mar 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VERY nitpicky here, but I think the intent of this would be clearer if rather than checking len(lines), you check to see if rs == "", or possibly if not rs. This case applies to a blank result from rs, if I understand correctly, and one of those ways would communicate that clearer.

The other issue is that len(lines) COULD be 1 and the result of rs non-blank and still if the condor devs in the future forget to put line breaks in. To illustrate the point

>>> lines = "".split("\n")
>>> lines, len(lines)
([''], 1)
>>> lines = "blahblahoopsnolineendingkey=value".split("\n")
>>> lines, len(lines)
(['blahblahoopsnolineendingkey=value'], 1)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, here I'm trying to handle the case that jobs just stop showing up when they complete around here; so you do
jobsub_q and it comes up empty. When you're doing plain jobsub_q , you still get the headers, so length(lines)==2, (["JOBSUBJOBID ...", ""])l; but for jobsub_q --long we don't get the header so length(lines)==1... So I was trying to have parallel tests in the two cases. But you're right, at least a comment here about what the len(lines) == n means would be appropriate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a little dissertation on detecting the emtpy result, and why we're marking it "COMPLETED", in comments.

Copy link
Collaborator

@shreyb shreyb Mar 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand where you're coming from, but I'm still not too keen on the check counting lines, if there's a more reliable way. How about some internal function that checks both cases? Like a

def _jobsub_q_output_nonempty(stdout: str, long: bool) -> bool:
    if long:
        return stdout != "" 
   # logic if not long
   stuff()

Then you could have the same function called in both cases, so you'd satisfy having parallel checks.

@retzkek
Copy link
Contributor

retzkek commented Mar 26, 2025

We should also give some thought to how this API will be documented for users.

Conveniently that's the topic of tomorrow's CSAID roadmap meeting.

@shreyb
Copy link
Collaborator

shreyb commented Mar 26, 2025

I'm sorry I missed the context of your comment in the meeting earlier today, Kevin. Yes, we should have this be a part of #612 as well. It's kind of in both items' scopes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants