All of lore.kernel.org
 help / color / mirror / Atom feed
* [Drbd-dev] stonith-related regression introduced around kernel 3.13, with 3.15.3 still affected
@ 2014-07-03 13:07 Mariusz Mazur
  2014-07-03 13:44 ` Lars Ellenberg
  0 siblings, 1 reply; 7+ messages in thread
From: Mariusz Mazur @ 2014-07-03 13:07 UTC (permalink / raw)
  To: drbd-dev; +Cc: Jacek Konieczny

My setup is two nodes with drbd double master, corosync, pacemaker, clvmd, xen 
4.4.0.

Here's what happens when I reboot -f one of the nodes and the surviving node 
is kernel 3.12.23 or earlier (oldest tested was 3.6.something):

 [10512.040601] d-con vpbx_dev3: PingAck did not arrive in time.
 [10512.136930] d-con vpbx_dev3: peer( Primary -> Unknown ) conn( Connected -> 
NetworkFailure ) pdsk( UpToDate -> DUnknown ) susp( 0 -> 1 ) 
 [10512.346951] d-con vpbx_dev3: asender terminated
 [10512.443875] d-con vpbx_dev3: Terminating drbd_a_vpbx_dev
 [10512.540918] d-con vpbx_dev3: Connection closed
 [10512.636365] d-con vpbx_dev3: conn( NetworkFailure -> Unconnected ) 
 [10512.733924] d-con vpbx_dev3: receiver terminated
 [10512.829033] d-con vpbx_dev3: helper command: /sbin/drbdadm fence-peer 
vpbx_dev3
 [10512.924943] d-con vpbx_dev3: Restarting receiver thread
 [10513.022739] d-con vpbx_dev3: receiver (re)started
 [10513.116863] d-con vpbx_dev3: conn( Unconnected -> WFConnection ) 
 [10518.874822] dlm: closing connection to node 2
 [10519.468190] d-con vpbx_dev3: helper command: /sbin/drbdadm fence-peer 
vpbx_dev3 exit code 5 (0x500)
 [10519.604612] d-con vpbx_dev3: fence-peer helper returned 5 (peer is 
unreachable, assumed to be dead)                                                                                                           
 [10519.700071] d-con vpbx_dev3: pdsk( DUnknown -> Outdated )                                                                                                                                                     
 [10519.810474] block drbd0: new current UUID 
0E52D87F8EBDA1BB:A20A6CC6D7B5E6ED:2E667D8F22B5DA49:2E657D8F22B5DA49                                                                                                 
 [10519.943527] d-con vpbx_dev3: susp( 1 -> 0 )                                                                                                                                                                   
 [10522.114574] dlm: clvmd: dlm_recover 3                                                                                                                                                                         
 [10522.114619] dlm: clvmd: remove member 2                                                                                                                                                                       
 [10522.114623] dlm: clvmd: dlm_recover_members 1 nodes                                                                                                                                                           
 [10522.114626] dlm: clvmd: generation 17 slots 1 1:1                                                                                                                                                             
 [10522.114627] dlm: clvmd: dlm_recover_directory                                                                                                                                                                 
 [10522.114629] dlm: clvmd: dlm_recover_directory 0 in 0 new                                                                                                                                                      
 [10522.114631] dlm: clvmd: dlm_recover_directory 0 out 0 messages                                                                                                                                                
 [10522.114633] dlm: clvmd: dlm_recover_masters                                                                                                                                                                   
 [10522.114634] dlm: clvmd: dlm_recover_masters 0 of 0                                                                                                                                                            
 [10522.114636] dlm: clvmd: dlm_recover_locks 0 out                                                                                                                                                               
 [10522.114637] dlm: clvmd: dlm_recover_locks 0 in                                                                                                                                                                
 [10522.114662] dlm: clvmd: dlm_recover 3 generation 17 done: 0 ms                     

Everything seems fine, drbd remains accessible.

And here's what happens if the surviving node is running 3.13.6 (or 3.14.8 or 
3.15.3).

 [  382.002770] d-con vpbx_dev3: PingAck did not arrive in time.                                                                                                                                                  
 [  382.079354] d-con vpbx_dev3: peer( Primary -> Unknown ) conn( Connected -> 
NetworkFailure ) pdsk( UpToDate -> DUnknown ) susp( 0 -> 1 )                                                                       
 [  382.245092] d-con vpbx_dev3: asender terminated                                                                                                                                                               
 [  382.322803] d-con vpbx_dev3: Terminating drbd_a_vpbx_dev                                                                                                                                                      
 [  382.400862] d-con vpbx_dev3: Connection closed                                                                                                                                
 [  382.484773] d-con vpbx_dev3: out of mem, failed to invoke fence-peer 
helper                                                                                                   
 [  382.562487] d-con vpbx_dev3: conn( NetworkFailure -> Unconnected )                                                                                                            
 [  382.640106] d-con vpbx_dev3: receiver terminated                                                                                                                              
 [  382.717000] d-con vpbx_dev3: Restarting receiver thread                                                                                                                       
 [  382.793857] d-con vpbx_dev3: receiver (re)started                                                                                                                             
 [  382.869857] d-con vpbx_dev3: conn( Unconnected -> WFConnection )                                                                                                              
 [  384.326309] dlm: closing connection to node 1                                                                                                                                 
 [  387.172256] dlm: clvmd: dlm_recover 3                                                                                                                                         
 [  387.172303] dlm: clvmd: remove member 1                                                                                                                                       
 [  387.172306] dlm: clvmd: dlm_recover_members 1 nodes                                                                                                                           
 [  387.172309] dlm: clvmd: generation 19 slots 1 2:2                                                                                                                             
 [  387.172311] dlm: clvmd: dlm_recover_directory                                                                                                                                 
 [  387.172312] dlm: clvmd: dlm_recover_directory 0 in 0 new                                                                                                                    
 [  387.172314] dlm: clvmd: dlm_recover_directory 0 out 0 messages
 [  387.172316] dlm: clvmd: dlm_recover_masters
 [  387.172318] dlm: clvmd: dlm_recover_masters 0 of 0
 [  387.172320] dlm: clvmd: dlm_recover_locks 0 out
 [  387.172321] dlm: clvmd: dlm_recover_locks 0 in
 [  387.172345] dlm: clvmd: dlm_recover 3 generation 19 done: 0 ms

With the result being:
[root@dev3n2 ~]# cat /proc/drbd 
version: 8.4.3 (api:1/proto:86-101)
srcversion: F97798065516C94BE0F27DC 
 0: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C s---d-
    ns:0 nr:0 dw:0 dr:264 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0

And I need to reboot the node before drbd becomes operational again.

I won't have time to properly bisect this for a few weeks, though if somebody 
has a guess at what's wrong I can test a patch or provide more info.

--mmazur

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Drbd-dev] stonith-related regression introduced around kernel 3.13, with 3.15.3 still affected
  2014-07-03 13:07 [Drbd-dev] stonith-related regression introduced around kernel 3.13, with 3.15.3 still affected Mariusz Mazur
@ 2014-07-03 13:44 ` Lars Ellenberg
  2014-07-03 13:54   ` Lars Ellenberg
  0 siblings, 1 reply; 7+ messages in thread
From: Lars Ellenberg @ 2014-07-03 13:44 UTC (permalink / raw)
  To: drbd-dev

On Thu, Jul 03, 2014 at 03:07:18PM +0200, Mariusz Mazur wrote:
> My setup is two nodes with drbd double master, corosync, pacemaker, clvmd, xen 
> 4.4.0.
> 
> Here's what happens when I reboot -f one of the nodes and the surviving node 
> is kernel 3.12.23 or earlier (oldest tested was 3.6.something):

Yep, someone changed the in kernel kthread api
to use wait_for_completion_killable()
where it used to be wait_for_completion().

Which has some bad interactions with how DRBD handles things.
This is being fixed.

	Lars

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Drbd-dev] stonith-related regression introduced around kernel 3.13, with 3.15.3 still affected
  2014-07-03 13:44 ` Lars Ellenberg
@ 2014-07-03 13:54   ` Lars Ellenberg
  2014-07-03 17:49     ` Mariusz Mazur
  0 siblings, 1 reply; 7+ messages in thread
From: Lars Ellenberg @ 2014-07-03 13:54 UTC (permalink / raw)
  To: drbd-dev

On Thu, Jul 03, 2014 at 03:44:17PM +0200, Lars Ellenberg wrote:
> On Thu, Jul 03, 2014 at 03:07:18PM +0200, Mariusz Mazur wrote:
> > My setup is two nodes with drbd double master, corosync, pacemaker, clvmd, xen 
> > 4.4.0.
> > 
> > Here's what happens when I reboot -f one of the nodes and the surviving node 
> > is kernel 3.12.23 or earlier (oldest tested was 3.6.something):
> 
> Yep, someone changed the in kernel kthread api
> to use wait_for_completion_killable()
> where it used to be wait_for_completion().
> 
> Which has some bad interactions with how DRBD handles things.
> This is being fixed.

Would you please try this patch:

diff --git a/drbd/drbd_nl.c b/drbd/drbd_nl.c
index 9e6adaa..88f480c 100644
--- a/drbd/drbd_nl.c
+++ b/drbd/drbd_nl.c
@@ -586,6 +586,7 @@ void conn_try_outdate_peer_async(struct drbd_connection *connection)
 	struct task_struct *opa;
 
 	kref_get(&connection->kref);
+	flush_pending_signals();
 	opa = kthread_run(_try_outdate_peer_async, connection, "drbd_async_h");
 	if (IS_ERR(opa)) {
 		drbd_err(connection, "out of mem, failed to invoke fence-peer helper\n");


-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [Drbd-dev] stonith-related regression introduced around kernel 3.13, with 3.15.3 still affected
  2014-07-03 13:54   ` Lars Ellenberg
@ 2014-07-03 17:49     ` Mariusz Mazur
  2014-07-04 11:42       ` Lars Ellenberg
  0 siblings, 1 reply; 7+ messages in thread
From: Mariusz Mazur @ 2014-07-03 17:49 UTC (permalink / raw)
  To: drbd-dev

On Thu of July 3 2014, Lars Ellenberg wrote:

> Would you please try this patch:
> 
> diff --git a/drbd/drbd_nl.c b/drbd/drbd_nl.c
> index 9e6adaa..88f480c 100644
> --- a/drbd/drbd_nl.c
> +++ b/drbd/drbd_nl.c
> @@ -586,6 +586,7 @@ void conn_try_outdate_peer_async(struct drbd_connection
> *connection) struct task_struct *opa;
> 
>  	kref_get(&connection->kref);
> +	flush_pending_signals();
>  	opa = kthread_run(_try_outdate_peer_async, connection, "drbd_async_h");
>  	if (IS_ERR(opa)) {
>  		drbd_err(connection, "out of mem, failed to invoke fence-peer
> helper\n");

There's no such function in the kernel.


--mmazur

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Drbd-dev] stonith-related regression introduced around kernel 3.13, with 3.15.3 still affected
  2014-07-03 17:49     ` Mariusz Mazur
@ 2014-07-04 11:42       ` Lars Ellenberg
  2014-07-07  8:41         ` Mariusz Mazur
  0 siblings, 1 reply; 7+ messages in thread
From: Lars Ellenberg @ 2014-07-04 11:42 UTC (permalink / raw)
  To: drbd-dev

On Thu, Jul 03, 2014 at 07:49:09PM +0200, Mariusz Mazur wrote:
> On Thu of July 3 2014, Lars Ellenberg wrote:
> 
> > Would you please try this patch:
> > 
> > diff --git a/drbd/drbd_nl.c b/drbd/drbd_nl.c
> > index 9e6adaa..88f480c 100644
> > --- a/drbd/drbd_nl.c
> > +++ b/drbd/drbd_nl.c
> > @@ -586,6 +586,7 @@ void conn_try_outdate_peer_async(struct drbd_connection
> > *connection) struct task_struct *opa;
> > 
> >  	kref_get(&connection->kref);
> > +	flush_pending_signals();
> >  	opa = kthread_run(_try_outdate_peer_async, connection, "drbd_async_h");
> >  	if (IS_ERR(opa)) {
> >  		drbd_err(connection, "out of mem, failed to invoke fence-peer
> > helper\n");
> 
> There's no such function in the kernel.

Yeah, sorry, typo, leave off the pending-.

It's flush_signals(current);

diff --git a/drbd/drbd_nl.c b/drbd/drbd_nl.c
index 9e6adaa..d8b83d7 100644
--- a/drbd/drbd_nl.c
+++ b/drbd/drbd_nl.c
@@ -586,6 +586,7 @@ void conn_try_outdate_peer_async(struct drbd_connection *connection)
 	struct task_struct *opa;
 
 	kref_get(&connection->kref);
+	flush_signals(current);
 	opa = kthread_run(_try_outdate_peer_async, connection, "drbd_async_h");
 	if (IS_ERR(opa)) {
 		drbd_err(connection, "out of mem, failed to invoke fence-peer helper\n");

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [Drbd-dev] stonith-related regression introduced around kernel 3.13, with 3.15.3 still affected
  2014-07-04 11:42       ` Lars Ellenberg
@ 2014-07-07  8:41         ` Mariusz Mazur
  2014-07-07  9:03           ` Lars Ellenberg
  0 siblings, 1 reply; 7+ messages in thread
From: Mariusz Mazur @ 2014-07-07  8:41 UTC (permalink / raw)
  To: drbd-dev, Jacek Konieczny

On Fri of July 4 2014, Lars Ellenberg wrote:

> Yeah, sorry, typo, leave off the pending-.
> 
> It's flush_signals(current);
> 
> diff --git a/drbd/drbd_nl.c b/drbd/drbd_nl.c
> index 9e6adaa..d8b83d7 100644
> --- a/drbd/drbd_nl.c
> +++ b/drbd/drbd_nl.c
> @@ -586,6 +586,7 @@ void conn_try_outdate_peer_async(struct drbd_connection
> *connection) struct task_struct *opa;
> 
>  	kref_get(&connection->kref);
> +	flush_signals(current);
>  	opa = kthread_run(_try_outdate_peer_async, connection, "drbd_async_h");
>  	if (IS_ERR(opa)) {
>  		drbd_err(connection, "out of mem, failed to invoke fence-peer
> helper\n");

Yup, this worked. Should it maybe get sent upstream to the kernel? :)


--mmazur

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Drbd-dev] stonith-related regression introduced around kernel 3.13, with 3.15.3 still affected
  2014-07-07  8:41         ` Mariusz Mazur
@ 2014-07-07  9:03           ` Lars Ellenberg
  0 siblings, 0 replies; 7+ messages in thread
From: Lars Ellenberg @ 2014-07-07  9:03 UTC (permalink / raw)
  To: drbd-dev

On Mon, Jul 07, 2014 at 10:41:33AM +0200, Mariusz Mazur wrote:
> On Fri of July 4 2014, Lars Ellenberg wrote:
> 
> > Yeah, sorry, typo, leave off the pending-.
> > 
> > It's flush_signals(current);
> > 
> > diff --git a/drbd/drbd_nl.c b/drbd/drbd_nl.c
> > index 9e6adaa..d8b83d7 100644
> > --- a/drbd/drbd_nl.c
> > +++ b/drbd/drbd_nl.c
> > @@ -586,6 +586,7 @@ void conn_try_outdate_peer_async(struct drbd_connection
> > *connection) struct task_struct *opa;
> > 
> >  	kref_get(&connection->kref);
> > +	flush_signals(current);
> >  	opa = kthread_run(_try_outdate_peer_async, connection, "drbd_async_h");
> >  	if (IS_ERR(opa)) {
> >  		drbd_err(connection, "out of mem, failed to invoke fence-peer
> > helper\n");
> 
> Yup, this worked. Should it maybe get sent upstream to the kernel? :)

Yes Sir Captain Obvious Sir.

 ;-)

Thanks for confirming, anyways.

	Lars

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2014-07-07  9:03 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-07-03 13:07 [Drbd-dev] stonith-related regression introduced around kernel 3.13, with 3.15.3 still affected Mariusz Mazur
2014-07-03 13:44 ` Lars Ellenberg
2014-07-03 13:54   ` Lars Ellenberg
2014-07-03 17:49     ` Mariusz Mazur
2014-07-04 11:42       ` Lars Ellenberg
2014-07-07  8:41         ` Mariusz Mazur
2014-07-07  9:03           ` Lars Ellenberg

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.