All of lore.kernel.org
 help / color / mirror / Atom feed
* RFC Systemd Service Restart Policy
@ 2017-09-06 19:03 Andrew Geissler
  2017-09-06 19:50 ` Vernon Mauery
  0 siblings, 1 reply; 3+ messages in thread
From: Andrew Geissler @ 2017-09-06 19:03 UTC (permalink / raw)
  To: OpenBMC Maillist

I’ve got an old but good one this sprint,
https://github.com/openbmc/openbmc/issues/272

The point of this issue is to define our restart and recovery policy
for openbmc services.

Currently we’re using the systemd defaults, which are the following:
RestartSec=100ms
StartLimitIntervalSec=10s
StartLimitBurst=5
StartLimitAction=none

So basically if a service fails, we will restart it up to 5 times,
every 10s, with a 100ms delay between each restart.
There is no action taken when we reach the 5 restarts, other then to
do nothing until the 10s window has expired.

I’d like to propose a few changes for openbmc:

1.  Change the StartLimitBurst to 3
Five just seems excessive for our services in openbmc.  In all fail
scenarios I’ve seen so far (other then with phosphor-hwmon), either
restarting once does the job or restarting all 5 times does not help
and we just end up hitting the 5 limit anyway.

2. Change the RestartSec from 100ms to 1s.
When a service hits a failure, our new debug collection service kicks
in.  When a core file is involved we’ve found that generating 5 core
files within ~500ms puts a huge strain on the BMC.  Also, if we are
going to get a fix on a restart of a service, the more time the better
(think retries on device driver scenarios).

3. Define a StartLimitAction for critical services to “reboot” the BMC
With 1 and 2 above, we could have services starting indefinitely with
no real recovery on the BMC.  Certain services are critical though,
and I believe should result in a BMC reset to try and recover.  Those
service are the following:
   o dbus.service
   o xyz.openbmc_project.ObjectMapper.service

Some services that are on the bubble for me (external interfaces):
   o phosphor-ipmi-host.service
   o phosphor-ipmi-net.service
   o dropbear@.service
   o phosphor-gevent.service

I have some maintainability concerns with trying to pick specific
services to cause a BMC reboot.  Maybe it would be better to define a
default  that all services cause a BMC reboot, then pick specific
one’s that would not result in a reboot?  Or maybe it’s best to never
reboot, and just let the system owners manage it?  Thoughts
appreciated.

References:
https://www.freedesktop.org/software/systemd/man/systemd.unit.html#

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2017-09-07  1:47 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-09-06 19:03 RFC Systemd Service Restart Policy Andrew Geissler
2017-09-06 19:50 ` Vernon Mauery
2017-09-07  1:47   ` Andrew Jeffery

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.