All of lore.kernel.org
 help / color / mirror / Atom feed
From: Vernon Mauery <vernon.mauery@linux.intel.com>
To: Andrew Geissler <geissonator@gmail.com>
Cc: OpenBMC Maillist <openbmc@lists.ozlabs.org>
Subject: Re: RFC Systemd Service Restart Policy
Date: Wed, 6 Sep 2017 12:50:26 -0700	[thread overview]
Message-ID: <20170906195026.GD69617@mauery> (raw)
In-Reply-To: <CALLMt=q55PXP7_D49z2NHh9eZDzm1Z6MRzbSe=j1uaoUjsuFiA@mail.gmail.com>

On 06-Sep-2017 02:03 PM, Andrew Geissler wrote:
> I’ve got an old but good one this sprint,
> https://github.com/openbmc/openbmc/issues/272
> 
> The point of this issue is to define our restart and recovery policy
> for openbmc services.
> 
> Currently we’re using the systemd defaults, which are the following:
> RestartSec=100ms
> StartLimitIntervalSec=10s
> StartLimitBurst=5
> StartLimitAction=none
> 
> So basically if a service fails, we will restart it up to 5 times,
> every 10s, with a 100ms delay between each restart.
> There is no action taken when we reach the 5 restarts, other then to
> do nothing until the 10s window has expired.
> 
> I’d like to propose a few changes for openbmc:
> 
> 1.  Change the StartLimitBurst to 3
> Five just seems excessive for our services in openbmc.  In all fail
> scenarios I’ve seen so far (other then with phosphor-hwmon), either
> restarting once does the job or restarting all 5 times does not help
> and we just end up hitting the 5 limit anyway.
> 
> 2. Change the RestartSec from 100ms to 1s.
> When a service hits a failure, our new debug collection service kicks
> in.  When a core file is involved we’ve found that generating 5 core
> files within ~500ms puts a huge strain on the BMC.  Also, if we are
> going to get a fix on a restart of a service, the more time the better
> (think retries on device driver scenarios).

I think these two are pretty reasonable. We have had similar behavior 
implemented on prior generations of BMC. I like your reasoning for both 
changes.

> 3. Define a StartLimitAction for critical services to “reboot” the BMC
> With 1 and 2 above, we could have services starting indefinitely with
> no real recovery on the BMC.  Certain services are critical though,
> and I believe should result in a BMC reset to try and recover.  Those
> service are the following:
>    o dbus.service
>    o xyz.openbmc_project.ObjectMapper.service
> 
> Some services that are on the bubble for me (external interfaces):
>    o phosphor-ipmi-host.service
>    o phosphor-ipmi-net.service
>    o dropbear@.service
>    o phosphor-gevent.service
> 
> I have some maintainability concerns with trying to pick specific
> services to cause a BMC reboot.  Maybe it would be better to define a
> default  that all services cause a BMC reboot, then pick specific
> one’s that would not result in a reboot?  Or maybe it’s best to never
> reboot, and just let the system owners manage it?  Thoughts
> appreciated.

I would prefer that we have a set core (such as dbus and the mapper) 
that are terminal faults (maybe even without retries) and then assume 
that everything else can be restarted nicely. If something cannot be 
restarted nicely, there should be a really good reason for that and that 
service's unit file can specify something other than the defaults to 
change its behavior.

This is a Linux system; in the ideal world, it should only need to be 
restarted for firmware updates. All other faults should be recoverable. 
Ideal world aside, individual services that can only be recovered with a 
reboot can handle that case without adjusting the global default.

--Vernon

> References:
> https://www.freedesktop.org/software/systemd/man/systemd.unit.html#

  reply	other threads:[~2017-09-06 19:50 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-09-06 19:03 RFC Systemd Service Restart Policy Andrew Geissler
2017-09-06 19:50 ` Vernon Mauery [this message]
2017-09-07  1:47   ` Andrew Jeffery

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170906195026.GD69617@mauery \
    --to=vernon.mauery@linux.intel.com \
    --cc=geissonator@gmail.com \
    --cc=openbmc@lists.ozlabs.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.