-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
Problem: The test test_rendered_golden_config_override(and other similar config reboot tests) fails on trixie builds (202511) because the pmon service hits start-limit-hit and doesn't come back up after config reload -y -f /etc/sonic/golden_config_db.json. The test applies a golden config override and does a config reload, then waits up to 420 seconds for all critical services — pmon is the only one that reports False. All other services (snmp, lldp, bgp, swss, syncd, database, dhcp_relay, teamd) come up successfully.
Evidence: Live DUT testing on Cisco-8102-C64 — trixie build (systemd v257) fails pmon on the 4th stop/start cycle with start-limit-hit; bookworm build (systemd v252, same commit bf0cbb1 as trixie used) passes 5+ cycles. Both builds have identical service configs (StartLimitBurst=3, StartLimitIntervalSec=1200). Exit code from docker-rs wait is 0 for all services. Only pmon is affected — snmp and lldp survive 10+ cycles with the same StartLimitBurst=3.
Root Cause: The start-ratelimit counter for pmon accumulates across config reload cycles and is not reset on systemd v257 (Trixie). With StartLimitBurst=3, the 4th start attempt within the 20-minute window is refused. On systemd v252 (Bookworm), the counter resets during config reload, so the issue never occurs. Why only pmon triggers this (and not snmp/lldp with the same burst limit) is still under investigation.
Quick Fix
sudo mkdir -p /etc/systemd/system/pmon.service.d
echo -e "[Unit]\nStartLimitBurst=10" | sudo tee /etc/systemd/system/pmon.service.d/start_limit.conf
sudo systemctl daemon-reload && sudo systemctl reset-failed pmon
Seeking community input if this issue is known or fix available post systemd v255