RFC Systemd Service Restart Policy

All of lore.kernel.org
 help / color / mirror / Atom feed

* RFC Systemd Service Restart Policy
@ 2017-09-06 19:03 Andrew Geissler
  2017-09-06 19:50 ` Vernon Mauery
  0 siblings, 1 reply; 3+ messages in thread
From: Andrew Geissler @ 2017-09-06 19:03 UTC (permalink / raw)
  To: OpenBMC Maillist

I’ve got an old but good one this sprint,
https://github.com/openbmc/openbmc/issues/272

The point of this issue is to define our restart and recovery policy
for openbmc services.

Currently we’re using the systemd defaults, which are the following:
RestartSec=100ms
StartLimitIntervalSec=10s
StartLimitBurst=5
StartLimitAction=none

So basically if a service fails, we will restart it up to 5 times,
every 10s, with a 100ms delay between each restart.
There is no action taken when we reach the 5 restarts, other then to
do nothing until the 10s window has expired.

I’d like to propose a few changes for openbmc:

1.  Change the StartLimitBurst to 3
Five just seems excessive for our services in openbmc.  In all fail
scenarios I’ve seen so far (other then with phosphor-hwmon), either
restarting once does the job or restarting all 5 times does not help
and we just end up hitting the 5 limit anyway.

2. Change the RestartSec from 100ms to 1s.
When a service hits a failure, our new debug collection service kicks
in.  When a core file is involved we’ve found that generating 5 core
files within ~500ms puts a huge strain on the BMC.  Also, if we are
going to get a fix on a restart of a service, the more time the better
(think retries on device driver scenarios).

3. Define a StartLimitAction for critical services to “reboot” the BMC
With 1 and 2 above, we could have services starting indefinitely with
no real recovery on the BMC.  Certain services are critical though,
and I believe should result in a BMC reset to try and recover.  Those
service are the following:
   o dbus.service
   o xyz.openbmc_project.ObjectMapper.service

Some services that are on the bubble for me (external interfaces):
   o phosphor-ipmi-host.service
   o phosphor-ipmi-net.service
   o dropbear@.service
   o phosphor-gevent.service

I have some maintainability concerns with trying to pick specific
services to cause a BMC reboot.  Maybe it would be better to define a
default  that all services cause a BMC reboot, then pick specific
one’s that would not result in a reboot?  Or maybe it’s best to never
reboot, and just let the system owners manage it?  Thoughts
appreciated.

References:
https://www.freedesktop.org/software/systemd/man/systemd.unit.html#

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: RFC Systemd Service Restart Policy
  2017-09-06 19:03 RFC Systemd Service Restart Policy Andrew Geissler
@ 2017-09-06 19:50 ` Vernon Mauery
  2017-09-07  1:47   ` Andrew Jeffery
  0 siblings, 1 reply; 3+ messages in thread
From: Vernon Mauery @ 2017-09-06 19:50 UTC (permalink / raw)
  To: Andrew Geissler; +Cc: OpenBMC Maillist

On 06-Sep-2017 02:03 PM, Andrew Geissler wrote:
> I’ve got an old but good one this sprint,
> https://github.com/openbmc/openbmc/issues/272
> 
> The point of this issue is to define our restart and recovery policy
> for openbmc services.
> 
> Currently we’re using the systemd defaults, which are the following:
> RestartSec=100ms
> StartLimitIntervalSec=10s
> StartLimitBurst=5
> StartLimitAction=none
> 
> So basically if a service fails, we will restart it up to 5 times,
> every 10s, with a 100ms delay between each restart.
> There is no action taken when we reach the 5 restarts, other then to
> do nothing until the 10s window has expired.
> 
> I’d like to propose a few changes for openbmc:
> 
> 1.  Change the StartLimitBurst to 3
> Five just seems excessive for our services in openbmc.  In all fail
> scenarios I’ve seen so far (other then with phosphor-hwmon), either
> restarting once does the job or restarting all 5 times does not help
> and we just end up hitting the 5 limit anyway.
> 
> 2. Change the RestartSec from 100ms to 1s.
> When a service hits a failure, our new debug collection service kicks
> in.  When a core file is involved we’ve found that generating 5 core
> files within ~500ms puts a huge strain on the BMC.  Also, if we are
> going to get a fix on a restart of a service, the more time the better
> (think retries on device driver scenarios).

I think these two are pretty reasonable. We have had similar behavior 
implemented on prior generations of BMC. I like your reasoning for both 
changes.

> 3. Define a StartLimitAction for critical services to “reboot” the BMC
> With 1 and 2 above, we could have services starting indefinitely with
> no real recovery on the BMC.  Certain services are critical though,
> and I believe should result in a BMC reset to try and recover.  Those
> service are the following:
>    o dbus.service
>    o xyz.openbmc_project.ObjectMapper.service
> 
> Some services that are on the bubble for me (external interfaces):
>    o phosphor-ipmi-host.service
>    o phosphor-ipmi-net.service
>    o dropbear@.service
>    o phosphor-gevent.service
> 
> I have some maintainability concerns with trying to pick specific
> services to cause a BMC reboot.  Maybe it would be better to define a
> default  that all services cause a BMC reboot, then pick specific
> one’s that would not result in a reboot?  Or maybe it’s best to never
> reboot, and just let the system owners manage it?  Thoughts
> appreciated.

I would prefer that we have a set core (such as dbus and the mapper) 
that are terminal faults (maybe even without retries) and then assume 
that everything else can be restarted nicely. If something cannot be 
restarted nicely, there should be a really good reason for that and that 
service's unit file can specify something other than the defaults to 
change its behavior.

This is a Linux system; in the ideal world, it should only need to be 
restarted for firmware updates. All other faults should be recoverable. 
Ideal world aside, individual services that can only be recovered with a 
reboot can handle that case without adjusting the global default.

--Vernon

> References:
> https://www.freedesktop.org/software/systemd/man/systemd.unit.html#

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: RFC Systemd Service Restart Policy
  2017-09-06 19:50 ` Vernon Mauery
@ 2017-09-07  1:47   ` Andrew Jeffery
  0 siblings, 0 replies; 3+ messages in thread
From: Andrew Jeffery @ 2017-09-07  1:47 UTC (permalink / raw)
  To: Vernon Mauery, Andrew Geissler; +Cc: OpenBMC Maillist

[-- Attachment #1: Type: text/plain, Size: 1185 bytes --]

On Wed, 2017-09-06 at 12:50 -0700, Vernon Mauery wrote:
> > I have some maintainability concerns with trying to pick specific
> > services to cause a BMC reboot.  Maybe it would be better to define a
> > default  that all services cause a BMC reboot, then pick specific
> > one’s that would not result in a reboot?  Or maybe it’s best to never
> > reboot, and just let the system owners manage it?  Thoughts
> > appreciated.
> 
> I would prefer that we have a set core (such as dbus and the mapper) 
> that are terminal faults (maybe even without retries) and then assume 
> that everything else can be restarted nicely. If something cannot be 
> restarted nicely, there should be a really good reason for that and that 
> service's unit file can specify something other than the defaults to 
> change its behavior.
> 
> This is a Linux system; in the ideal world, it should only need to be 
> restarted for firmware updates. All other faults should be recoverable. 
> Ideal world aside, individual services that can only be recovered with a 
> reboot can handle that case without adjusting the global default.

I second Vernon's position.

Andrew

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2017-09-07  1:47 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-09-06 19:03 RFC Systemd Service Restart Policy Andrew Geissler
2017-09-06 19:50 ` Vernon Mauery
2017-09-07  1:47   ` Andrew Jeffery

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.