From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=linux.intel.com (client-ip=134.134.136.24; helo=mga09.intel.com; envelope-from=vernon.mauery@linux.intel.com; receiver=) Received: from mga09.intel.com (mga09.intel.com [134.134.136.24]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by lists.ozlabs.org (Postfix) with ESMTPS id 3xnZ1y0dsGzDrK1 for ; Thu, 7 Sep 2017 05:50:28 +1000 (AEST) Received: from orsmga004.jf.intel.com ([10.7.209.38]) by orsmga102.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 06 Sep 2017 12:50:26 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.42,355,1500966000"; d="scan'208";a="126210656" Received: from mauery.jf.intel.com (HELO mauery) ([10.7.150.85]) by orsmga004.jf.intel.com with ESMTP; 06 Sep 2017 12:50:26 -0700 Date: Wed, 6 Sep 2017 12:50:26 -0700 From: Vernon Mauery To: Andrew Geissler Cc: OpenBMC Maillist Subject: Re: RFC Systemd Service Restart Policy Message-ID: <20170906195026.GD69617@mauery> References: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable In-Reply-To: User-Agent: Mutt/1.5.24 (2015-08-30) X-BeenThere: openbmc@lists.ozlabs.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Development list for OpenBMC List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 06 Sep 2017 19:50:30 -0000 On 06-Sep-2017 02:03 PM, Andrew Geissler wrote: > I=E2=80=99ve got an old but good one this sprint, > https://github.com/openbmc/openbmc/issues/272 >=20 > The point of this issue is to define our restart and recovery policy > for openbmc services. >=20 > Currently we=E2=80=99re using the systemd defaults, which are the followi= ng: > RestartSec=3D100ms > StartLimitIntervalSec=3D10s > StartLimitBurst=3D5 > StartLimitAction=3Dnone >=20 > So basically if a service fails, we will restart it up to 5 times, > every 10s, with a 100ms delay between each restart. > There is no action taken when we reach the 5 restarts, other then to > do nothing until the 10s window has expired. >=20 > I=E2=80=99d like to propose a few changes for openbmc: >=20 > 1. Change the StartLimitBurst to 3 > Five just seems excessive for our services in openbmc. In all fail > scenarios I=E2=80=99ve seen so far (other then with phosphor-hwmon), eith= er > restarting once does the job or restarting all 5 times does not help > and we just end up hitting the 5 limit anyway. >=20 > 2. Change the RestartSec from 100ms to 1s. > When a service hits a failure, our new debug collection service kicks > in. When a core file is involved we=E2=80=99ve found that generating 5 c= ore > files within ~500ms puts a huge strain on the BMC. Also, if we are > going to get a fix on a restart of a service, the more time the better > (think retries on device driver scenarios). I think these two are pretty reasonable. We have had similar behavior=20 implemented on prior generations of BMC. I like your reasoning for both=20 changes. > 3. Define a StartLimitAction for critical services to =E2=80=9Creboot=E2= =80=9D the BMC > With 1 and 2 above, we could have services starting indefinitely with > no real recovery on the BMC. Certain services are critical though, > and I believe should result in a BMC reset to try and recover. Those > service are the following: > o dbus.service > o xyz.openbmc_project.ObjectMapper.service >=20 > Some services that are on the bubble for me (external interfaces): > o phosphor-ipmi-host.service > o phosphor-ipmi-net.service > o dropbear@.service > o phosphor-gevent.service >=20 > I have some maintainability concerns with trying to pick specific > services to cause a BMC reboot. Maybe it would be better to define a > default that all services cause a BMC reboot, then pick specific > one=E2=80=99s that would not result in a reboot? Or maybe it=E2=80=99s b= est to never > reboot, and just let the system owners manage it? Thoughts > appreciated. I would prefer that we have a set core (such as dbus and the mapper)=20 that are terminal faults (maybe even without retries) and then assume=20 that everything else can be restarted nicely. If something cannot be=20 restarted nicely, there should be a really good reason for that and that=20 service's unit file can specify something other than the defaults to=20 change its behavior. This is a Linux system; in the ideal world, it should only need to be=20 restarted for firmware updates. All other faults should be recoverable.=20 Ideal world aside, individual services that can only be recovered with a=20 reboot can handle that case without adjusting the global default. --Vernon > References: > https://www.freedesktop.org/software/systemd/man/systemd.unit.html#