* [Drbd-dev] stonith-related regression introduced around kernel 3.13, with 3.15.3 still affected
@ 2014-07-03 13:07 Mariusz Mazur
2014-07-03 13:44 ` Lars Ellenberg
0 siblings, 1 reply; 7+ messages in thread
From: Mariusz Mazur @ 2014-07-03 13:07 UTC (permalink / raw)
To: drbd-dev; +Cc: Jacek Konieczny
My setup is two nodes with drbd double master, corosync, pacemaker, clvmd, xen
4.4.0.
Here's what happens when I reboot -f one of the nodes and the surviving node
is kernel 3.12.23 or earlier (oldest tested was 3.6.something):
[10512.040601] d-con vpbx_dev3: PingAck did not arrive in time.
[10512.136930] d-con vpbx_dev3: peer( Primary -> Unknown ) conn( Connected ->
NetworkFailure ) pdsk( UpToDate -> DUnknown ) susp( 0 -> 1 )
[10512.346951] d-con vpbx_dev3: asender terminated
[10512.443875] d-con vpbx_dev3: Terminating drbd_a_vpbx_dev
[10512.540918] d-con vpbx_dev3: Connection closed
[10512.636365] d-con vpbx_dev3: conn( NetworkFailure -> Unconnected )
[10512.733924] d-con vpbx_dev3: receiver terminated
[10512.829033] d-con vpbx_dev3: helper command: /sbin/drbdadm fence-peer
vpbx_dev3
[10512.924943] d-con vpbx_dev3: Restarting receiver thread
[10513.022739] d-con vpbx_dev3: receiver (re)started
[10513.116863] d-con vpbx_dev3: conn( Unconnected -> WFConnection )
[10518.874822] dlm: closing connection to node 2
[10519.468190] d-con vpbx_dev3: helper command: /sbin/drbdadm fence-peer
vpbx_dev3 exit code 5 (0x500)
[10519.604612] d-con vpbx_dev3: fence-peer helper returned 5 (peer is
unreachable, assumed to be dead)
[10519.700071] d-con vpbx_dev3: pdsk( DUnknown -> Outdated )
[10519.810474] block drbd0: new current UUID
0E52D87F8EBDA1BB:A20A6CC6D7B5E6ED:2E667D8F22B5DA49:2E657D8F22B5DA49
[10519.943527] d-con vpbx_dev3: susp( 1 -> 0 )
[10522.114574] dlm: clvmd: dlm_recover 3
[10522.114619] dlm: clvmd: remove member 2
[10522.114623] dlm: clvmd: dlm_recover_members 1 nodes
[10522.114626] dlm: clvmd: generation 17 slots 1 1:1
[10522.114627] dlm: clvmd: dlm_recover_directory
[10522.114629] dlm: clvmd: dlm_recover_directory 0 in 0 new
[10522.114631] dlm: clvmd: dlm_recover_directory 0 out 0 messages
[10522.114633] dlm: clvmd: dlm_recover_masters
[10522.114634] dlm: clvmd: dlm_recover_masters 0 of 0
[10522.114636] dlm: clvmd: dlm_recover_locks 0 out
[10522.114637] dlm: clvmd: dlm_recover_locks 0 in
[10522.114662] dlm: clvmd: dlm_recover 3 generation 17 done: 0 ms
Everything seems fine, drbd remains accessible.
And here's what happens if the surviving node is running 3.13.6 (or 3.14.8 or
3.15.3).
[ 382.002770] d-con vpbx_dev3: PingAck did not arrive in time.
[ 382.079354] d-con vpbx_dev3: peer( Primary -> Unknown ) conn( Connected ->
NetworkFailure ) pdsk( UpToDate -> DUnknown ) susp( 0 -> 1 )
[ 382.245092] d-con vpbx_dev3: asender terminated
[ 382.322803] d-con vpbx_dev3: Terminating drbd_a_vpbx_dev
[ 382.400862] d-con vpbx_dev3: Connection closed
[ 382.484773] d-con vpbx_dev3: out of mem, failed to invoke fence-peer
helper
[ 382.562487] d-con vpbx_dev3: conn( NetworkFailure -> Unconnected )
[ 382.640106] d-con vpbx_dev3: receiver terminated
[ 382.717000] d-con vpbx_dev3: Restarting receiver thread
[ 382.793857] d-con vpbx_dev3: receiver (re)started
[ 382.869857] d-con vpbx_dev3: conn( Unconnected -> WFConnection )
[ 384.326309] dlm: closing connection to node 1
[ 387.172256] dlm: clvmd: dlm_recover 3
[ 387.172303] dlm: clvmd: remove member 1
[ 387.172306] dlm: clvmd: dlm_recover_members 1 nodes
[ 387.172309] dlm: clvmd: generation 19 slots 1 2:2
[ 387.172311] dlm: clvmd: dlm_recover_directory
[ 387.172312] dlm: clvmd: dlm_recover_directory 0 in 0 new
[ 387.172314] dlm: clvmd: dlm_recover_directory 0 out 0 messages
[ 387.172316] dlm: clvmd: dlm_recover_masters
[ 387.172318] dlm: clvmd: dlm_recover_masters 0 of 0
[ 387.172320] dlm: clvmd: dlm_recover_locks 0 out
[ 387.172321] dlm: clvmd: dlm_recover_locks 0 in
[ 387.172345] dlm: clvmd: dlm_recover 3 generation 19 done: 0 ms
With the result being:
[root@dev3n2 ~]# cat /proc/drbd
version: 8.4.3 (api:1/proto:86-101)
srcversion: F97798065516C94BE0F27DC
0: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C s---d-
ns:0 nr:0 dw:0 dr:264 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
And I need to reboot the node before drbd becomes operational again.
I won't have time to properly bisect this for a few weeks, though if somebody
has a guess at what's wrong I can test a patch or provide more info.
--mmazur
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Drbd-dev] stonith-related regression introduced around kernel 3.13, with 3.15.3 still affected
2014-07-03 13:07 [Drbd-dev] stonith-related regression introduced around kernel 3.13, with 3.15.3 still affected Mariusz Mazur
@ 2014-07-03 13:44 ` Lars Ellenberg
2014-07-03 13:54 ` Lars Ellenberg
0 siblings, 1 reply; 7+ messages in thread
From: Lars Ellenberg @ 2014-07-03 13:44 UTC (permalink / raw)
To: drbd-dev
On Thu, Jul 03, 2014 at 03:07:18PM +0200, Mariusz Mazur wrote:
> My setup is two nodes with drbd double master, corosync, pacemaker, clvmd, xen
> 4.4.0.
>
> Here's what happens when I reboot -f one of the nodes and the surviving node
> is kernel 3.12.23 or earlier (oldest tested was 3.6.something):
Yep, someone changed the in kernel kthread api
to use wait_for_completion_killable()
where it used to be wait_for_completion().
Which has some bad interactions with how DRBD handles things.
This is being fixed.
Lars
--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com
DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Drbd-dev] stonith-related regression introduced around kernel 3.13, with 3.15.3 still affected
2014-07-03 13:44 ` Lars Ellenberg
@ 2014-07-03 13:54 ` Lars Ellenberg
2014-07-03 17:49 ` Mariusz Mazur
0 siblings, 1 reply; 7+ messages in thread
From: Lars Ellenberg @ 2014-07-03 13:54 UTC (permalink / raw)
To: drbd-dev
On Thu, Jul 03, 2014 at 03:44:17PM +0200, Lars Ellenberg wrote:
> On Thu, Jul 03, 2014 at 03:07:18PM +0200, Mariusz Mazur wrote:
> > My setup is two nodes with drbd double master, corosync, pacemaker, clvmd, xen
> > 4.4.0.
> >
> > Here's what happens when I reboot -f one of the nodes and the surviving node
> > is kernel 3.12.23 or earlier (oldest tested was 3.6.something):
>
> Yep, someone changed the in kernel kthread api
> to use wait_for_completion_killable()
> where it used to be wait_for_completion().
>
> Which has some bad interactions with how DRBD handles things.
> This is being fixed.
Would you please try this patch:
diff --git a/drbd/drbd_nl.c b/drbd/drbd_nl.c
index 9e6adaa..88f480c 100644
--- a/drbd/drbd_nl.c
+++ b/drbd/drbd_nl.c
@@ -586,6 +586,7 @@ void conn_try_outdate_peer_async(struct drbd_connection *connection)
struct task_struct *opa;
kref_get(&connection->kref);
+ flush_pending_signals();
opa = kthread_run(_try_outdate_peer_async, connection, "drbd_async_h");
if (IS_ERR(opa)) {
drbd_err(connection, "out of mem, failed to invoke fence-peer helper\n");
--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com
DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [Drbd-dev] stonith-related regression introduced around kernel 3.13, with 3.15.3 still affected
2014-07-03 13:54 ` Lars Ellenberg
@ 2014-07-03 17:49 ` Mariusz Mazur
2014-07-04 11:42 ` Lars Ellenberg
0 siblings, 1 reply; 7+ messages in thread
From: Mariusz Mazur @ 2014-07-03 17:49 UTC (permalink / raw)
To: drbd-dev
On Thu of July 3 2014, Lars Ellenberg wrote:
> Would you please try this patch:
>
> diff --git a/drbd/drbd_nl.c b/drbd/drbd_nl.c
> index 9e6adaa..88f480c 100644
> --- a/drbd/drbd_nl.c
> +++ b/drbd/drbd_nl.c
> @@ -586,6 +586,7 @@ void conn_try_outdate_peer_async(struct drbd_connection
> *connection) struct task_struct *opa;
>
> kref_get(&connection->kref);
> + flush_pending_signals();
> opa = kthread_run(_try_outdate_peer_async, connection, "drbd_async_h");
> if (IS_ERR(opa)) {
> drbd_err(connection, "out of mem, failed to invoke fence-peer
> helper\n");
There's no such function in the kernel.
--mmazur
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Drbd-dev] stonith-related regression introduced around kernel 3.13, with 3.15.3 still affected
2014-07-03 17:49 ` Mariusz Mazur
@ 2014-07-04 11:42 ` Lars Ellenberg
2014-07-07 8:41 ` Mariusz Mazur
0 siblings, 1 reply; 7+ messages in thread
From: Lars Ellenberg @ 2014-07-04 11:42 UTC (permalink / raw)
To: drbd-dev
On Thu, Jul 03, 2014 at 07:49:09PM +0200, Mariusz Mazur wrote:
> On Thu of July 3 2014, Lars Ellenberg wrote:
>
> > Would you please try this patch:
> >
> > diff --git a/drbd/drbd_nl.c b/drbd/drbd_nl.c
> > index 9e6adaa..88f480c 100644
> > --- a/drbd/drbd_nl.c
> > +++ b/drbd/drbd_nl.c
> > @@ -586,6 +586,7 @@ void conn_try_outdate_peer_async(struct drbd_connection
> > *connection) struct task_struct *opa;
> >
> > kref_get(&connection->kref);
> > + flush_pending_signals();
> > opa = kthread_run(_try_outdate_peer_async, connection, "drbd_async_h");
> > if (IS_ERR(opa)) {
> > drbd_err(connection, "out of mem, failed to invoke fence-peer
> > helper\n");
>
> There's no such function in the kernel.
Yeah, sorry, typo, leave off the pending-.
It's flush_signals(current);
diff --git a/drbd/drbd_nl.c b/drbd/drbd_nl.c
index 9e6adaa..d8b83d7 100644
--- a/drbd/drbd_nl.c
+++ b/drbd/drbd_nl.c
@@ -586,6 +586,7 @@ void conn_try_outdate_peer_async(struct drbd_connection *connection)
struct task_struct *opa;
kref_get(&connection->kref);
+ flush_signals(current);
opa = kthread_run(_try_outdate_peer_async, connection, "drbd_async_h");
if (IS_ERR(opa)) {
drbd_err(connection, "out of mem, failed to invoke fence-peer helper\n");
--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com
DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [Drbd-dev] stonith-related regression introduced around kernel 3.13, with 3.15.3 still affected
2014-07-04 11:42 ` Lars Ellenberg
@ 2014-07-07 8:41 ` Mariusz Mazur
2014-07-07 9:03 ` Lars Ellenberg
0 siblings, 1 reply; 7+ messages in thread
From: Mariusz Mazur @ 2014-07-07 8:41 UTC (permalink / raw)
To: drbd-dev, Jacek Konieczny
On Fri of July 4 2014, Lars Ellenberg wrote:
> Yeah, sorry, typo, leave off the pending-.
>
> It's flush_signals(current);
>
> diff --git a/drbd/drbd_nl.c b/drbd/drbd_nl.c
> index 9e6adaa..d8b83d7 100644
> --- a/drbd/drbd_nl.c
> +++ b/drbd/drbd_nl.c
> @@ -586,6 +586,7 @@ void conn_try_outdate_peer_async(struct drbd_connection
> *connection) struct task_struct *opa;
>
> kref_get(&connection->kref);
> + flush_signals(current);
> opa = kthread_run(_try_outdate_peer_async, connection, "drbd_async_h");
> if (IS_ERR(opa)) {
> drbd_err(connection, "out of mem, failed to invoke fence-peer
> helper\n");
Yup, this worked. Should it maybe get sent upstream to the kernel? :)
--mmazur
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Drbd-dev] stonith-related regression introduced around kernel 3.13, with 3.15.3 still affected
2014-07-07 8:41 ` Mariusz Mazur
@ 2014-07-07 9:03 ` Lars Ellenberg
0 siblings, 0 replies; 7+ messages in thread
From: Lars Ellenberg @ 2014-07-07 9:03 UTC (permalink / raw)
To: drbd-dev
On Mon, Jul 07, 2014 at 10:41:33AM +0200, Mariusz Mazur wrote:
> On Fri of July 4 2014, Lars Ellenberg wrote:
>
> > Yeah, sorry, typo, leave off the pending-.
> >
> > It's flush_signals(current);
> >
> > diff --git a/drbd/drbd_nl.c b/drbd/drbd_nl.c
> > index 9e6adaa..d8b83d7 100644
> > --- a/drbd/drbd_nl.c
> > +++ b/drbd/drbd_nl.c
> > @@ -586,6 +586,7 @@ void conn_try_outdate_peer_async(struct drbd_connection
> > *connection) struct task_struct *opa;
> >
> > kref_get(&connection->kref);
> > + flush_signals(current);
> > opa = kthread_run(_try_outdate_peer_async, connection, "drbd_async_h");
> > if (IS_ERR(opa)) {
> > drbd_err(connection, "out of mem, failed to invoke fence-peer
> > helper\n");
>
> Yup, this worked. Should it maybe get sent upstream to the kernel? :)
Yes Sir Captain Obvious Sir.
;-)
Thanks for confirming, anyways.
Lars
--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com
DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2014-07-07 9:03 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-07-03 13:07 [Drbd-dev] stonith-related regression introduced around kernel 3.13, with 3.15.3 still affected Mariusz Mazur
2014-07-03 13:44 ` Lars Ellenberg
2014-07-03 13:54 ` Lars Ellenberg
2014-07-03 17:49 ` Mariusz Mazur
2014-07-04 11:42 ` Lars Ellenberg
2014-07-07 8:41 ` Mariusz Mazur
2014-07-07 9:03 ` Lars Ellenberg
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.