From: Hannes Reinecke <hare@suse.de>
To: dm-devel@redhat.com
Subject: Re: [PATCH 09/13] multipathd: Implement systemd watchdog integration
Date: Mon, 25 Nov 2013 08:50:19 +0100 [thread overview]
Message-ID: <5293013B.4010008@suse.de> (raw)
In-Reply-To: <20131122221759.GM1661@dhcp80-209.msp.redhat.com>
On 11/22/2013 11:17 PM, Benjamin Marzinski wrote:
> On Fri, Nov 15, 2013 at 11:29:40AM +0100, Hannes Reinecke wrote:
>> In the past there have been several instances where multipathd
>> would hang with the checkerloop as some path checker might not
>> be able to return in time.
>> This patch now activates the watchdog feature from systemd
>> to shutdown (and possibly restart) multipathd in these
>> situations.
>
> This might need more of a systemd fix that a multipathd one, but once
> multipathd times out the watchdog timer, even if it starts sending
> notifications at an acceptable rate again, the service is still listed
> as failed.
>
> # service multipathd status
> Redirecting to # /bin/systemctl status multipathd.service
> multipathd.service - Device-Mapper Multipath Device Controller
> Loaded: loaded (/usr/lib/systemd/system/multipathd.service; enabled)
> Active: failed (Result: watchdog) since Fri 2013-11-22 09:43:01 CST; 9min ago
> Main PID: 6321
> Status: "running"
> CGroup: name=systemd:/system/multipathd.service
> └─6321 /sbin/multipathd -d -s
>
> More annoying, the logs fills up with messages like
>
> Nov 22 09:46:28 ask-08 systemd[1]: multipathd.service: Got notification
> message from PID 6321, but reception only permitted for PID 0
> Nov 22 09:46:29 ask-08 systemd[1]: multipathd.service: Got notification
> message from PID 6321, but reception only permitted for PID 0
> Nov 22 09:46:30 ask-08 systemd[1]: multipathd.service: Got notification
> message from PID 6321, but reception only permitted for PID 0
> Nov 22 09:46:31 ask-08 systemd[1]: multipathd.service: Got notification
> message from PID 6321, but reception only permitted for PID 0
>
> Also
>
> # service multipathd stop
>
> won't kill it. Even worse
>
> # service multipathd start
>
> WILL kill it without successfully restarting another version. A second
>
> # service multipathd start
>
> is necessary to get things back to a functional state again.
>
Actually, upstream systemd (>= v207) now has a new flag
restart=on-watchdog
With that systemd should be restarting multipathd after a watchdog
timeout. That should solve you immediate problem here.
> I'm not asking for systemd to actually shut down multipathd. In a
> production setup, killing multipathd because it had a temporary stall
> seems like bad default behavior. I haven't looked at the systemd
> watchdog code to know if this is possible, but ideally, multipathd would
> be able to just start sending watchdog notifications again, and be able
> to continue on with just a message in the logs recording the timeout.
>
Not stopping. Restarting.
The whole point of the watchdog code is to take some action if the
watchdog messages fail.
We should aim for
a) make the watchdog interval the longest interval we're prepared to
checkerloop to complete (hence the patch to measure the elapsed
time per loop iteration)
b) have systemd restart multipathd whenever the watchdog triggers,
as then we're sure we can't recover from this.
That should cover your sentiment, right?
> I realize that there is a benefit to letting people know that there was
> a problem, but the way it's appearing now, it will be pretty confusing to
> the sysadmin who sees that, and filling up the logs with notification
> rejections is pretty annoying.
>
Yeah, correct. We should be using the 'restart' flag in the service
file. I did not do this as the patch went into systemd only
recently, and one would need to figure out how to treat
installations where an older systemd version is running.
> And as long as I'm asking for systemd things, the ability to add a rule
> to the unit file that kills the service and forces a core dump when
> watchdog timer was tripped would help tracking down what's stalling the
> checker loop. Like I said before, I don't think this should be
> happening by default, but putting it in there commented out might not be
> a bad idea.
>
Yeah, that would be preferable. Sadly there is no 'force coredump'
option. What I would like to have is a
'on-watchdog'
option in systemd, where one can configure the action which needs to
be taken when the watchdog triggers.
Only adding a new option is touching systemd in tons of various
places, so my initial attempt here failed.
So I went for the easier option to just add a new flag to an
existing setting.
Cheers,
Hannes
P.S.: But hey, at least someone is actually testing this stuff.
Cool.
--
Dr. Hannes Reinecke zSeries & Storage
hare@suse.de +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
next prev parent reply other threads:[~2013-11-25 7:50 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-11-15 10:29 [PATCH 00/13] systemd integraion Hannes Reinecke
2013-11-15 10:29 ` [PATCH 01/13] Improve logging for orphan_path() Hannes Reinecke
2013-11-15 10:29 ` [PATCH 02/13] Set priority to '0' for PATH_BLOCKED or PATH_DOWN Hannes Reinecke
2013-11-15 10:29 ` [PATCH 03/13] libmultipath: fixup strlcpy Hannes Reinecke
2013-11-15 10:29 ` [PATCH 04/13] libmultipath: return error numbers from sysfs_get_XXX Hannes Reinecke
2013-11-17 17:34 ` Christophe Varoqui
2013-11-18 6:51 ` Hannes Reinecke
2013-11-15 10:29 ` [PATCH 05/13] libmultipath: do not stall on recv_packet() Hannes Reinecke
2013-11-15 10:29 ` [PATCH 06/13] multipathd: switch to socket activation for systemd Hannes Reinecke
2013-11-15 10:29 ` [PATCH 07/13] multipathd: use sd_notify() to inform systemd Hannes Reinecke
2013-11-15 10:29 ` [PATCH 08/13] multipathd: Add option '-s' to suppress timestamps Hannes Reinecke
2013-11-15 10:29 ` [PATCH 09/13] multipathd: Implement systemd watchdog integration Hannes Reinecke
2013-11-22 22:17 ` Benjamin Marzinski
2013-11-25 7:50 ` Hannes Reinecke [this message]
2013-11-25 16:21 ` Hannes Reinecke
2013-11-15 10:29 ` [PATCH 10/13] multipathd: enable core dumps for systemd Hannes Reinecke
2013-11-15 10:29 ` [PATCH 11/13] multipathd: Read environment variables from systemd Hannes Reinecke
2013-11-15 10:29 ` [PATCH 12/13] multipathd: measure path check time Hannes Reinecke
2013-11-15 10:29 ` [PATCH 13/13] multipathd: no_map_shutdown option Hannes Reinecke
2013-11-21 23:17 ` Benjamin Marzinski
2013-11-22 9:12 ` Hannes Reinecke
2013-11-22 9:30 ` Christophe Varoqui
2013-11-22 10:04 ` Hannes Reinecke
2013-11-22 10:11 ` Christophe Varoqui
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=5293013B.4010008@suse.de \
--to=hare@suse.de \
--cc=dm-devel@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.