Replies: 1 comment 1 reply
-
Hi :) Well, we have certainly discussed it internally. This feature goes towards a reality where HQ could essentially become a sort of a more direct wrapper/alternative to SLURM/PBS, with only a single HQ instance deployed per HPC cluster. While on the surface it's not that complicated to implement (we'd need to implement user management and namespace jobs per user), there are some complications with it, e.g. how to attach workers to specific users (it would probably need some authentization key mechanism to make sure that users do not share workers). Furthermore, on an architectural level, the HQ server was always meant to be a relatively ephemeral component, running perhaps for a few hours or days. The server now actually does implement checkpointing and can restore its state from disk, but if it were to be a permanently running service in the background, we would probably need to make some very different design decisions about its performance and primarily memory usage. The "accounting" side of the server is actually quite separated from the "scheduling" part (which is very performance sensitive), so we could maybe run the accounting side in one process, and then start the scheduling processes per-user, perhaps even inside allocations. But it would still be non-trivial to implement. But perhaps the most simple reason why we didn't consider actually implementing user management in HQ so far is that it seems to be quite easily attainable with an external logic. If you simply run a separate HQ server (and make sure to restart it if it's offline) for each user, in a separate directory on disk that has its permissions scoped to that user, it should mostly "just work" out of the box. For example, here is a trivial Python script that sketches an incredibly naive version version of a multi-user HQ deployment: # hqx.py
import sys
import getpass
from pathlib import Path
import subprocess
# Get args after hqx <...>
args = sys.argv[1:]
# Get the user's username
username = getpass.getuser()
# Assign a server directory
server_dir = Path("/root/hq-servers") / username
if not server_dir.is_path():
# Start a HQ server with --server-dir configured to `server_dir`
# chmod the directory to be only accessible by the user
# Actually run hq with the provided parameters
ret = subprocess.run(["hq", *args], env=dict(HQ_SERVER_DIR=server_dir))
exit(ret.returncode) With a script like this, users could just use e.g. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi there,
Was the possibility of having a single server instance with multi-users isolation already discussed?
We would like to automate the handling of hyperqueue on an HPC cluster and make it accessible to users across several organizations.
A single hyper queue entry point (server) and the ability to dynamically spawn workers (e.g. via Slurm job) in a user-isolated fashion could simplify how to automate access to this service.
Cheers,
Elia
Beta Was this translation helpful? Give feedback.
All reactions