[Cluster-devel] GPF in dlm_lowcomms

cluster-devel.redhat.com archive mirror
 help / color / mirror / Atom feed

* [Cluster-devel] GPF in dlm_lowcomms_stop
@ 2012-03-22  1:59 dann frazier
  2012-03-30 15:42 ` David Teigland
  0 siblings, 1 reply; 6+ messages in thread
From: dann frazier @ 2012-03-22  1:59 UTC (permalink / raw)
  To: cluster-devel.redhat.com

I observed a GPF in dlm_lowcomms_stop() in a crash dump from a
user. I'll include a traceback at the bottom. I have a theory as to
what is going on, but could use some help proving it.

 1 void dlm_lowcomms_stop(void)
 2 {
 3 /* Set all the flags to prevent any
 4 socket activity.
 5 */
 6 mutex_lock(&connections_lock);
 7 foreach_conn(stop_conn);
 8 mutex_unlock(&connections_lock);
 9
10 work_stop();
11
12 mutex_lock(&connections_lock);
13 clean_writequeues();
14
15 foreach_conn(free_conn);
16
17 mutex_unlock(&connections_lock);
18 kmem_cache_destroy(con_cache);
19 }

lines 6-8 should stop any traffic coming in on existing connections,
and the connections_lock prevents any new connections from being
added.

line 10 shutsdown the work queues which process incoming data on
existing connections. Our connections are supposed to be dead, so no
new work should be added to these queues.

However... we've dropped the connections_lock, so its possible that a
new connection gets created on line 9. This connection structure would
have pointers to the workqueues that we're about to destroy. Sometime
later on we get data on this new connection, we try to add work to the
now-obliterated workqueues, and things blow up.

This is a 2.6.32 kernel, but this code looks to be about the
same in 3.3. Normally I'd like to instrument the code to widen the
race window and see if I can trigger it, but I'm pretty clueless about
the stack here. Is there an easy way to mimic this activity from
userspace?

[80061.292085] Call Trace:
[80061.312762]  <IRQ>
[80061.332554]  [<ffffffff810387b9>] default_spin_lock_flags+0x9/0x10
[80061.356963]  [<ffffffff8155c44f>] _spin_lock_irqsave+0x2f/0x40
[80061.380717]  [<ffffffff81080504>] __queue_work+0x24/0x50
[80061.403591]  [<ffffffff81080632>] queue_work_on+0x42/0x60
[80061.426314]  [<ffffffff810807ff>] queue_work+0x1f/0x30
[80061.448516]  [<ffffffffa027326b>] lowcomms_data_ready+0x3b/0x40 [dlm]
[80061.472149]  [<ffffffff814ba318>] tcp_data_queue+0x308/0xaf0
[80061.494912]  [<ffffffff814bd021>] tcp_rcv_established+0x371/0x730
[80061.517903]  [<ffffffff814c4bb3>] tcp_v4_do_rcv+0xf3/0x160
[80061.539872]  [<ffffffff814c6305>] tcp_v4_rcv+0x5b5/0x7e0
[80061.561284]  [<ffffffff814a4110>] ? ip_local_deliver_finish+0x0/0x2d0
[80061.583714]  [<ffffffff8149bb84>] ? nf_hook_slow+0x74/0x100
[80061.604714]  [<ffffffff814a4110>] ? ip_local_deliver_finish+0x0/0x2d0
[80061.626449]  [<ffffffff814a41ed>] ip_local_deliver_finish+0xdd/0x2d0
[80061.647767]  [<ffffffff814a4470>] ip_local_deliver+0x90/0xa0
[80061.667743]  [<ffffffff814a392d>] ip_rcv_finish+0x12d/0x440
[80061.687278]  [<ffffffff814a3eb5>] ip_rcv+0x275/0x360
[80061.705783]  [<ffffffff8147450a>] netif_receive_skb+0x38a/0x5d0
[80061.725345]  [<ffffffffa0257178>] br_handle_frame_finish+0x198/0x1a0 [bridge]
[80061.746206]  [<ffffffffa025cc78>] br_nf_pre_routing_finish+0x238/0x350 [bridge]
[80061.779534]  [<ffffffff8149bb84>] ? nf_hook_slow+0x74/0x100
[80061.798437]  [<ffffffffa025ca40>] ? br_nf_pre_routing_finish+0x0/0x350 [bridge]
[80061.831598]  [<ffffffffa025d15a>] br_nf_pre_routing+0x3ca/0x468 [bridge]
[80061.851561]  [<ffffffff81561765>] ? do_IRQ+0x75/0xf0
[80061.869287]  [<ffffffff812b3a63>] ? cpumask_next_and+0x23/0x40
[80061.887717]  [<ffffffff8149bacc>] nf_iterate+0x6c/0xb0
[80061.905491]  [<ffffffffa0256fe0>] ? br_handle_frame_finish+0x0/0x1a0 [bridge]
[80061.925829]  [<ffffffff8149bb84>] nf_hook_slow+0x74/0x100
[80061.944215]  [<ffffffffa0256fe0>] ? br_handle_frame_finish+0x0/0x1a0 [bridge]
[80061.964735]  [<ffffffff8146b216>] ? __netdev_alloc_skb+0x36/0x60
[80061.984128]  [<ffffffffa025730c>] br_handle_frame+0x18c/0x250 [bridge]
[80062.004294]  [<ffffffff81537138>] ? vlan_hwaccel_do_receive+0x38/0xf0
[80062.024707]  [<ffffffff8147437a>] netif_receive_skb+0x1fa/0x5d0
[80062.044500]  [<ffffffff81474920>] napi_skb_finish+0x50/0x70
[80062.063911]  [<ffffffff81537794>] vlan_gro_receive+0x84/0x90
[80062.083512]  [<ffffffffa0015fd9>] igb_clean_rx_irq_adv+0x259/0x6d0 [igb]
[80062.104516]  [<ffffffffa00164c2>] igb_poll+0x72/0x380 [igb]
[80062.124363]  [<ffffffffa0168236>] ? ioat2_issue_pending+0x86/0xa0 [ioatdma]
[80062.145937]  [<ffffffff8147500f>] net_rx_action+0x10f/0x250
[80062.165915]  [<ffffffff8109331e>] ? tick_handle_oneshot_broadcast+0x10e/0x120
[80062.187812]  [<ffffffff8106d9e7>] __do_softirq+0xb7/0x1f0
[80062.207733]  [<ffffffff810c47b0>] ? handle_IRQ_event+0x60/0x170
[80062.228411]  [<ffffffff810132ec>] call_softirq+0x1c/0x30
[80062.248332]  [<ffffffff81014cb5>] do_softirq+0x65/0xa0
[80062.268058]  [<ffffffff8106d7e5>] irq_exit+0x85/0x90
[80062.287578]  [<ffffffff81561765>] do_IRQ+0x75/0xf0
[80062.306763]  [<ffffffff81012b13>] ret_from_intr+0x0/0x11
[80062.326418]  <EOI>
[80062.342191]  [<ffffffff8130f59f>] ? acpi_idle_enter_c1+0xa3/0xc1
[80062.362503]  [<ffffffff8130f57e>] ? acpi_idle_enter_c1+0x82/0xc1
[80062.382525]  [<ffffffff8144f1e7>] ? cpuidle_idle_call+0xa7/0x140
[80062.402409]  [<ffffffff81010e63>] ? cpu_idle+0xb3/0x110
[80062.421258]  [<ffffffff81544137>] ? rest_init+0x77/0x80
[80062.439930]  [<ffffffff81882dd8>] ? start_kernel+0x368/0x371
[80062.459099]  [<ffffffff8188233a>] ? x86_64_start_reservations+0x125/0x129
[80062.479533]  [<ffffffff81882438>] ? x86_64_start_kernel+0xfa/0x109
[80062.499226] Code: 00 00 48 c7 c2 2e 85 03 81 48 c7 c1 31 85 03 81 e9 dd fe ff ff 90 90 90 90 90 90 90 90 90 90 90 90 90 55 b8 00 01 00 00 48 89 e5 <f0> 66 0f c1 07 38 e0 74 06 f3 90 8a 07 eb f6 c9 c3 66 0f 1f 44
[80062.563580] RIP  [<ffffffff81038729>] __ticket_spin_lock+0x9/0x20
[80062.584129]  RSP <ffff880016603740>



^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Cluster-devel] GPF in dlm_lowcomms_stop
  2012-03-22  1:59 [Cluster-devel] GPF in dlm_lowcomms_stop dann frazier
@ 2012-03-30 15:42 ` David Teigland
  2012-03-30 16:42   ` David Teigland
  0 siblings, 1 reply; 6+ messages in thread
From: David Teigland @ 2012-03-30 15:42 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Wed, Mar 21, 2012 at 07:59:13PM -0600, dann frazier wrote:
> However... we've dropped the connections_lock, so its possible that a
> new connection gets created on line 9. This connection structure would
> have pointers to the workqueues that we're about to destroy. Sometime
> later on we get data on this new connection, we try to add work to the
> now-obliterated workqueues, and things blow up.

Hi Dan, I'm not very familiar with this code either, but I've talked with
Chrissie and she suggested we try something like this:

diff --git a/fs/dlm/lowcomms.c b/fs/dlm/lowcomms.c
index 133ef6d..a3c431e 100644
--- a/fs/dlm/lowcomms.c
+++ b/fs/dlm/lowcomms.c
@@ -142,6 +142,7 @@ struct writequeue_entry {
 
 static struct sockaddr_storage *dlm_local_addr[DLM_MAX_ADDR_COUNT];
 static int dlm_local_count;
+static int dlm_allow_conn;
 
 /* Work queues */
 static struct workqueue_struct *recv_workqueue;
@@ -710,6 +711,13 @@ static int tcp_accept_from_sock(struct connection *con)
 	struct connection *newcon;
 	struct connection *addcon;
 
+	mutex_lock(&connections_lock);
+	if (!dlm_allow_conn) {
+		mutex_unlock(&connections_lock);
+		return -1;
+	}
+	mutex_unlock(&connections_lock);
+
 	memset(&peeraddr, 0, sizeof(peeraddr));
 	result = sock_create_kern(dlm_local_addr[0]->ss_family, SOCK_STREAM,
 				  IPPROTO_TCP, &newsock);
@@ -1503,6 +1511,7 @@ void dlm_lowcomms_stop(void)
 	   socket activity.
 	*/
 	mutex_lock(&connections_lock);
+	dlm_allow_conn = 0;
 	foreach_conn(stop_conn);
 	mutex_unlock(&connections_lock);
 
@@ -1540,6 +1549,8 @@ int dlm_lowcomms_start(void)
 	if (!con_cache)
 		goto out;
 
+	dlm_allow_conn = 1;
+
 	/* Start listening */
 	if (dlm_config.ci_protocol == 0)
 		error = tcp_listen_for_all();
@@ -1555,6 +1566,7 @@ int dlm_lowcomms_start(void)
 	return 0;
 
 fail_unlisten:
+	dlm_allow_conn = 0;
 	con = nodeid2con(0,0);
 	if (con) {
 		close_connection(con, false);



^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [Cluster-devel] GPF in dlm_lowcomms_stop
  2012-03-30 15:42 ` David Teigland
@ 2012-03-30 16:42   ` David Teigland
  2012-03-30 17:17     ` dann frazier
  0 siblings, 1 reply; 6+ messages in thread
From: David Teigland @ 2012-03-30 16:42 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Fri, Mar 30, 2012 at 11:42:56AM -0400, David Teigland wrote:
> Hi Dan, I'm not very familiar with this code either, but I've talked with
> Chrissie and she suggested we try something like this:

A second version that addresses a potentially similar problem in start.

diff --git a/fs/dlm/lowcomms.c b/fs/dlm/lowcomms.c
index 133ef6d..5c1b0e3 100644
--- a/fs/dlm/lowcomms.c
+++ b/fs/dlm/lowcomms.c
@@ -142,6 +142,7 @@ struct writequeue_entry {
 
 static struct sockaddr_storage *dlm_local_addr[DLM_MAX_ADDR_COUNT];
 static int dlm_local_count;
+static int dlm_allow_conn;
 
 /* Work queues */
 static struct workqueue_struct *recv_workqueue;
@@ -710,6 +711,13 @@ static int tcp_accept_from_sock(struct connection *con)
 	struct connection *newcon;
 	struct connection *addcon;
 
+	mutex_lock(&connections_lock);
+	if (!dlm_allow_conn) {
+		mutex_unlock(&connections_lock);
+		return -1;
+	}
+	mutex_unlock(&connections_lock);
+
 	memset(&peeraddr, 0, sizeof(peeraddr));
 	result = sock_create_kern(dlm_local_addr[0]->ss_family, SOCK_STREAM,
 				  IPPROTO_TCP, &newsock);
@@ -1503,6 +1511,7 @@ void dlm_lowcomms_stop(void)
 	   socket activity.
 	*/
 	mutex_lock(&connections_lock);
+	dlm_allow_conn = 0;
 	foreach_conn(stop_conn);
 	mutex_unlock(&connections_lock);
 
@@ -1530,7 +1539,7 @@ int dlm_lowcomms_start(void)
 	if (!dlm_local_count) {
 		error = -ENOTCONN;
 		log_print("no local IP address has been set");
-		goto out;
+		goto fail;
 	}
 
 	error = -ENOMEM;
@@ -1538,7 +1547,13 @@ int dlm_lowcomms_start(void)
 				      __alignof__(struct connection), 0,
 				      NULL);
 	if (!con_cache)
-		goto out;
+		goto fail;
+
+	error = work_start();
+	if (error)
+		goto fail_destroy;
+
+	dlm_allow_conn = 1;
 
 	/* Start listening */
 	if (dlm_config.ci_protocol == 0)
@@ -1548,20 +1563,17 @@ int dlm_lowcomms_start(void)
 	if (error)
 		goto fail_unlisten;
 
-	error = work_start();
-	if (error)
-		goto fail_unlisten;
-
 	return 0;
 
 fail_unlisten:
+	dlm_allow_conn = 0;
 	con = nodeid2con(0,0);
 	if (con) {
 		close_connection(con, false);
 		kmem_cache_free(con_cache, con);
 	}
+fail_destroy:
 	kmem_cache_destroy(con_cache);
-
-out:
+fail:
 	return error;
 }



^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [Cluster-devel] GPF in dlm_lowcomms_stop
  2012-03-30 16:42   ` David Teigland
@ 2012-03-30 17:17     ` dann frazier
  2012-05-04 17:33       ` dann frazier
  0 siblings, 1 reply; 6+ messages in thread
From: dann frazier @ 2012-03-30 17:17 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Fri, Mar 30, 2012 at 12:42:40PM -0400, David Teigland wrote:
> On Fri, Mar 30, 2012 at 11:42:56AM -0400, David Teigland wrote:
> > Hi Dan, I'm not very familiar with this code either, but I've talked with
> > Chrissie and she suggested we try something like this:

Yeah, that's the mechanism I was thinking of as well.

> A second version that addresses a potentially similar problem in
> start.

Oh, good catch! I hadn't considered that path.

Reviewed-by: dann frazier <dann.frazier@canonical.com>

> 
> diff --git a/fs/dlm/lowcomms.c b/fs/dlm/lowcomms.c
> index 133ef6d..5c1b0e3 100644
> --- a/fs/dlm/lowcomms.c
> +++ b/fs/dlm/lowcomms.c
> @@ -142,6 +142,7 @@ struct writequeue_entry {
>  
>  static struct sockaddr_storage *dlm_local_addr[DLM_MAX_ADDR_COUNT];
>  static int dlm_local_count;
> +static int dlm_allow_conn;
>  
>  /* Work queues */
>  static struct workqueue_struct *recv_workqueue;
> @@ -710,6 +711,13 @@ static int tcp_accept_from_sock(struct connection *con)
>  	struct connection *newcon;
>  	struct connection *addcon;
>  
> +	mutex_lock(&connections_lock);
> +	if (!dlm_allow_conn) {
> +		mutex_unlock(&connections_lock);
> +		return -1;
> +	}
> +	mutex_unlock(&connections_lock);
> +
>  	memset(&peeraddr, 0, sizeof(peeraddr));
>  	result = sock_create_kern(dlm_local_addr[0]->ss_family, SOCK_STREAM,
>  				  IPPROTO_TCP, &newsock);
> @@ -1503,6 +1511,7 @@ void dlm_lowcomms_stop(void)
>  	   socket activity.
>  	*/
>  	mutex_lock(&connections_lock);
> +	dlm_allow_conn = 0;
>  	foreach_conn(stop_conn);
>  	mutex_unlock(&connections_lock);
>  
> @@ -1530,7 +1539,7 @@ int dlm_lowcomms_start(void)
>  	if (!dlm_local_count) {
>  		error = -ENOTCONN;
>  		log_print("no local IP address has been set");
> -		goto out;
> +		goto fail;
>  	}
>  
>  	error = -ENOMEM;
> @@ -1538,7 +1547,13 @@ int dlm_lowcomms_start(void)
>  				      __alignof__(struct connection), 0,
>  				      NULL);
>  	if (!con_cache)
> -		goto out;
> +		goto fail;
> +
> +	error = work_start();
> +	if (error)
> +		goto fail_destroy;
> +
> +	dlm_allow_conn = 1;
>  
>  	/* Start listening */
>  	if (dlm_config.ci_protocol == 0)
> @@ -1548,20 +1563,17 @@ int dlm_lowcomms_start(void)
>  	if (error)
>  		goto fail_unlisten;
>  
> -	error = work_start();
> -	if (error)
> -		goto fail_unlisten;
> -
>  	return 0;
>  
>  fail_unlisten:
> +	dlm_allow_conn = 0;
>  	con = nodeid2con(0,0);
>  	if (con) {
>  		close_connection(con, false);
>  		kmem_cache_free(con_cache, con);
>  	}
> +fail_destroy:
>  	kmem_cache_destroy(con_cache);
> -
> -out:
> +fail:
>  	return error;
>  }



^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Cluster-devel] GPF in dlm_lowcomms_stop
  2012-03-30 17:17     ` dann frazier
@ 2012-05-04 17:33       ` dann frazier
  2012-05-04 17:51         ` David Teigland
  0 siblings, 1 reply; 6+ messages in thread
From: dann frazier @ 2012-05-04 17:33 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Fri, Mar 30, 2012 at 11:17:56AM -0600, dann frazier wrote:
> On Fri, Mar 30, 2012 at 12:42:40PM -0400, David Teigland wrote:
> > On Fri, Mar 30, 2012 at 11:42:56AM -0400, David Teigland wrote:
> > > Hi Dan, I'm not very familiar with this code either, but I've talked with
> > > Chrissie and she suggested we try something like this:
> 
> Yeah, that's the mechanism I was thinking of as well.
> 
> > A second version that addresses a potentially similar problem in
> > start.
> 
> Oh, good catch! I hadn't considered that path.
> 
> Reviewed-by: dann frazier <dann.frazier@canonical.com>

fyi, we haven't uncovered any problems when testing with this
patch. Are there plans to submit it for the next merge cycle?

       -dann

> > 
> > diff --git a/fs/dlm/lowcomms.c b/fs/dlm/lowcomms.c
> > index 133ef6d..5c1b0e3 100644
> > --- a/fs/dlm/lowcomms.c
> > +++ b/fs/dlm/lowcomms.c
> > @@ -142,6 +142,7 @@ struct writequeue_entry {
> >  
> >  static struct sockaddr_storage *dlm_local_addr[DLM_MAX_ADDR_COUNT];
> >  static int dlm_local_count;
> > +static int dlm_allow_conn;
> >  
> >  /* Work queues */
> >  static struct workqueue_struct *recv_workqueue;
> > @@ -710,6 +711,13 @@ static int tcp_accept_from_sock(struct connection *con)
> >  	struct connection *newcon;
> >  	struct connection *addcon;
> >  
> > +	mutex_lock(&connections_lock);
> > +	if (!dlm_allow_conn) {
> > +		mutex_unlock(&connections_lock);
> > +		return -1;
> > +	}
> > +	mutex_unlock(&connections_lock);
> > +
> >  	memset(&peeraddr, 0, sizeof(peeraddr));
> >  	result = sock_create_kern(dlm_local_addr[0]->ss_family, SOCK_STREAM,
> >  				  IPPROTO_TCP, &newsock);
> > @@ -1503,6 +1511,7 @@ void dlm_lowcomms_stop(void)
> >  	   socket activity.
> >  	*/
> >  	mutex_lock(&connections_lock);
> > +	dlm_allow_conn = 0;
> >  	foreach_conn(stop_conn);
> >  	mutex_unlock(&connections_lock);
> >  
> > @@ -1530,7 +1539,7 @@ int dlm_lowcomms_start(void)
> >  	if (!dlm_local_count) {
> >  		error = -ENOTCONN;
> >  		log_print("no local IP address has been set");
> > -		goto out;
> > +		goto fail;
> >  	}
> >  
> >  	error = -ENOMEM;
> > @@ -1538,7 +1547,13 @@ int dlm_lowcomms_start(void)
> >  				      __alignof__(struct connection), 0,
> >  				      NULL);
> >  	if (!con_cache)
> > -		goto out;
> > +		goto fail;
> > +
> > +	error = work_start();
> > +	if (error)
> > +		goto fail_destroy;
> > +
> > +	dlm_allow_conn = 1;
> >  
> >  	/* Start listening */
> >  	if (dlm_config.ci_protocol == 0)
> > @@ -1548,20 +1563,17 @@ int dlm_lowcomms_start(void)
> >  	if (error)
> >  		goto fail_unlisten;
> >  
> > -	error = work_start();
> > -	if (error)
> > -		goto fail_unlisten;
> > -
> >  	return 0;
> >  
> >  fail_unlisten:
> > +	dlm_allow_conn = 0;
> >  	con = nodeid2con(0,0);
> >  	if (con) {
> >  		close_connection(con, false);
> >  		kmem_cache_free(con_cache, con);
> >  	}
> > +fail_destroy:
> >  	kmem_cache_destroy(con_cache);
> > -
> > -out:
> > +fail:
> >  	return error;
> >  }
> 



^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Cluster-devel] GPF in dlm_lowcomms_stop
  2012-05-04 17:33       ` dann frazier
@ 2012-05-04 17:51         ` David Teigland
  0 siblings, 0 replies; 6+ messages in thread
From: David Teigland @ 2012-05-04 17:51 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Fri, May 04, 2012 at 11:33:17AM -0600, dann frazier wrote:
> On Fri, Mar 30, 2012 at 11:17:56AM -0600, dann frazier wrote:
> > On Fri, Mar 30, 2012 at 12:42:40PM -0400, David Teigland wrote:
> > > On Fri, Mar 30, 2012 at 11:42:56AM -0400, David Teigland wrote:
> > > > Hi Dan, I'm not very familiar with this code either, but I've talked with
> > > > Chrissie and she suggested we try something like this:
> > 
> > Yeah, that's the mechanism I was thinking of as well.
> > 
> > > A second version that addresses a potentially similar problem in
> > > start.
> > 
> > Oh, good catch! I hadn't considered that path.
> > 
> > Reviewed-by: dann frazier <dann.frazier@canonical.com>
> 
> fyi, we haven't uncovered any problems when testing with this
> patch.

Great, thanks.

> Are there plans to submit it for the next merge cycle?

Yes, it's here
http://git.kernel.org/gitweb.cgi?p=linux/kernel/git/teigland/linux-dlm.git;a=shortlog;h=refs/heads/next




^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2012-05-04 17:51 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-03-22  1:59 [Cluster-devel] GPF in dlm_lowcomms_stop dann frazier
2012-03-30 15:42 ` David Teigland
2012-03-30 16:42   ` David Teigland
2012-03-30 17:17     ` dann frazier
2012-05-04 17:33       ` dann frazier
2012-05-04 17:51         ` David Teigland

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).