Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is "Restart=always" in service units really necessary? #2125

Open
ShreyasMahangade opened this issue Jan 31, 2025 · 0 comments
Open

Is "Restart=always" in service units really necessary? #2125

ShreyasMahangade opened this issue Jan 31, 2025 · 0 comments

Comments

@ShreyasMahangade
Copy link

ShreyasMahangade commented Jan 31, 2025

Hi Folks,

I recently came across this issue recently where I can see when pmcd.service fails with exit status 2, It stalls few systemd units and makes multi-user.target and other potentially important dependencies to stuck.

Simple Reproducer:

Just empty the pmcd.conf to simulate issue with pmcd.service and reboot system

[root@rhel94 ~]# > /etc/pcp/pmcd/pmcd.conf 
[root@rhel94 ~]# cat /etc/pcp/pmcd/pmcd.conf 
[root@rhel94 ~]# reboot -f

System will boot fine without issue and all services will be up without issue and everything seems to be fine but below issue will go unnoticed in most of the cases:

[root@rhel94 ~]# systemctl list-jobs
JOB  UNIT                                 TYPE  STATE  
7756 pmlogger_farm.service                start waiting
7669 pmlogger.service                     start waiting
135  multi-user.target                    start waiting
272  systemd-update-utmp-runlevel.service start waiting
7504 pmcd.service                         start running

5 jobs listed.
[root@rhel94 ~]# runlevel
unknown
[root@rhel94 ~]# 

Journal will throw below error:

-- Boot f535e9965e31455fac39b9fb35c7806b --
Jan 31 10:55:21 rhel94.static systemd[1]: Starting Performance Metrics Collector Daemon...
Jan 31 10:56:27 rhel94.static systemd[1]: pmcd.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Jan 31 10:56:27 rhel94.static systemd[1]: pmcd.service: Failed with result 'exit-code'.
Jan 31 10:56:27 rhel94.static systemd[1]: Failed to start Performance Metrics Collector Daemon.
Jan 31 10:56:27 rhel94.static systemd[1]: pmcd.service: Consumed 1.094s CPU time.
Jan 31 10:56:27 rhel94.static systemd[1]: pmcd.service: Scheduled restart job, restart counter is at 1.
Jan 31 10:56:27 rhel94.static systemd[1]: Stopped Performance Metrics Collector Daemon.
Jan 31 10:56:27 rhel94.static systemd[1]: pmcd.service: Consumed 1.094s CPU time.
Jan 31 10:56:27 rhel94.static systemd[1]: Starting Performance Metrics Collector Daemon...
Jan 31 10:57:29 rhel94.static systemd[1]: pmcd.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Jan 31 10:57:29 rhel94.static systemd[1]: pmcd.service: Failed with result 'exit-code'.
Jan 31 10:57:29 rhel94.static systemd[1]: Failed to start Performance Metrics Collector Daemon.
Jan 31 10:57:29 rhel94.static systemd[1]: pmcd.service: Consumed 1.031s CPU time.
Jan 31 10:57:29 rhel94.static systemd[1]: pmcd.service: Scheduled restart job, restart counter is at 2.
Jan 31 10:57:29 rhel94.static systemd[1]: Stopped Performance Metrics Collector Daemon.
Jan 31 10:57:29 rhel94.static systemd[1]: pmcd.service: Consumed 1.031s CPU time.
Jan 31 10:57:29 rhel94.static systemd[1]: Starting Performance Metrics Collector Daemon...
Jan 31 10:58:30 rhel94.static root[5223]: pmcd_wait failed in /usr/libexec/pcp/lib/pmcd: exit status: 2
Jan 31 10:58:30 rhel94.static systemd[1]: pmcd.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Jan 31 10:58:30 rhel94.static systemd[1]: pmcd.service: Failed with result 'exit-code'.
Jan 31 10:58:30 rhel94.static systemd[1]: Failed to start Performance Metrics Collector Daemon.
Jan 31 10:58:30 rhel94.static systemd[1]: pmcd.service: Scheduled restart job, restart counter is at 3.
Jan 31 10:58:30 rhel94.static systemd[1]: Stopped Performance Metrics Collector Daemon.
Jan 31 10:58:30 rhel94.static systemd[1]: Starting Performance Metrics Collector Daemon...

And this restart counter thingy will go on and on (I guess forever) halting all other dependent targets like multi-user.target

On Vanilla Redhat installation this does not looks very impacting but where there are custom and important services which might start after multi-user.target it might have big impact.

Obviously the reproducer used above is just an way to make pmcd fail but it might fail with other reasons as well.

We can think of any other options as well like on-abnormal which will only restart on unclean signal, timeouts and watchdog making this issue occur less frequency (but not solve).

When tested with Restart=no it will stall the multi-user.target for few moment and then as it fails to activate service, Systemd will move ahead with activation of dependencies.

Edit: This seems to be the bugzilla where we added restart option: https://bugzilla.redhat.com/show_bug.cgi?id=1365658

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant