* [Cluster-devel] GPF in dlm_lowcomms_stop @ 2012-03-22 1:59 dann frazier 2012-03-30 15:42 ` David Teigland 0 siblings, 1 reply; 6+ messages in thread From: dann frazier @ 2012-03-22 1:59 UTC (permalink / raw) To: cluster-devel.redhat.com I observed a GPF in dlm_lowcomms_stop() in a crash dump from a user. I'll include a traceback at the bottom. I have a theory as to what is going on, but could use some help proving it. 1 void dlm_lowcomms_stop(void) 2 { 3 /* Set all the flags to prevent any 4 socket activity. 5 */ 6 mutex_lock(&connections_lock); 7 foreach_conn(stop_conn); 8 mutex_unlock(&connections_lock); 9 10 work_stop(); 11 12 mutex_lock(&connections_lock); 13 clean_writequeues(); 14 15 foreach_conn(free_conn); 16 17 mutex_unlock(&connections_lock); 18 kmem_cache_destroy(con_cache); 19 } lines 6-8 should stop any traffic coming in on existing connections, and the connections_lock prevents any new connections from being added. line 10 shutsdown the work queues which process incoming data on existing connections. Our connections are supposed to be dead, so no new work should be added to these queues. However... we've dropped the connections_lock, so its possible that a new connection gets created on line 9. This connection structure would have pointers to the workqueues that we're about to destroy. Sometime later on we get data on this new connection, we try to add work to the now-obliterated workqueues, and things blow up. This is a 2.6.32 kernel, but this code looks to be about the same in 3.3. Normally I'd like to instrument the code to widen the race window and see if I can trigger it, but I'm pretty clueless about the stack here. Is there an easy way to mimic this activity from userspace? [80061.292085] Call Trace: [80061.312762] <IRQ> [80061.332554] [<ffffffff810387b9>] default_spin_lock_flags+0x9/0x10 [80061.356963] [<ffffffff8155c44f>] _spin_lock_irqsave+0x2f/0x40 [80061.380717] [<ffffffff81080504>] __queue_work+0x24/0x50 [80061.403591] [<ffffffff81080632>] queue_work_on+0x42/0x60 [80061.426314] [<ffffffff810807ff>] queue_work+0x1f/0x30 [80061.448516] [<ffffffffa027326b>] lowcomms_data_ready+0x3b/0x40 [dlm] [80061.472149] [<ffffffff814ba318>] tcp_data_queue+0x308/0xaf0 [80061.494912] [<ffffffff814bd021>] tcp_rcv_established+0x371/0x730 [80061.517903] [<ffffffff814c4bb3>] tcp_v4_do_rcv+0xf3/0x160 [80061.539872] [<ffffffff814c6305>] tcp_v4_rcv+0x5b5/0x7e0 [80061.561284] [<ffffffff814a4110>] ? ip_local_deliver_finish+0x0/0x2d0 [80061.583714] [<ffffffff8149bb84>] ? nf_hook_slow+0x74/0x100 [80061.604714] [<ffffffff814a4110>] ? ip_local_deliver_finish+0x0/0x2d0 [80061.626449] [<ffffffff814a41ed>] ip_local_deliver_finish+0xdd/0x2d0 [80061.647767] [<ffffffff814a4470>] ip_local_deliver+0x90/0xa0 [80061.667743] [<ffffffff814a392d>] ip_rcv_finish+0x12d/0x440 [80061.687278] [<ffffffff814a3eb5>] ip_rcv+0x275/0x360 [80061.705783] [<ffffffff8147450a>] netif_receive_skb+0x38a/0x5d0 [80061.725345] [<ffffffffa0257178>] br_handle_frame_finish+0x198/0x1a0 [bridge] [80061.746206] [<ffffffffa025cc78>] br_nf_pre_routing_finish+0x238/0x350 [bridge] [80061.779534] [<ffffffff8149bb84>] ? nf_hook_slow+0x74/0x100 [80061.798437] [<ffffffffa025ca40>] ? br_nf_pre_routing_finish+0x0/0x350 [bridge] [80061.831598] [<ffffffffa025d15a>] br_nf_pre_routing+0x3ca/0x468 [bridge] [80061.851561] [<ffffffff81561765>] ? do_IRQ+0x75/0xf0 [80061.869287] [<ffffffff812b3a63>] ? cpumask_next_and+0x23/0x40 [80061.887717] [<ffffffff8149bacc>] nf_iterate+0x6c/0xb0 [80061.905491] [<ffffffffa0256fe0>] ? br_handle_frame_finish+0x0/0x1a0 [bridge] [80061.925829] [<ffffffff8149bb84>] nf_hook_slow+0x74/0x100 [80061.944215] [<ffffffffa0256fe0>] ? br_handle_frame_finish+0x0/0x1a0 [bridge] [80061.964735] [<ffffffff8146b216>] ? __netdev_alloc_skb+0x36/0x60 [80061.984128] [<ffffffffa025730c>] br_handle_frame+0x18c/0x250 [bridge] [80062.004294] [<ffffffff81537138>] ? vlan_hwaccel_do_receive+0x38/0xf0 [80062.024707] [<ffffffff8147437a>] netif_receive_skb+0x1fa/0x5d0 [80062.044500] [<ffffffff81474920>] napi_skb_finish+0x50/0x70 [80062.063911] [<ffffffff81537794>] vlan_gro_receive+0x84/0x90 [80062.083512] [<ffffffffa0015fd9>] igb_clean_rx_irq_adv+0x259/0x6d0 [igb] [80062.104516] [<ffffffffa00164c2>] igb_poll+0x72/0x380 [igb] [80062.124363] [<ffffffffa0168236>] ? ioat2_issue_pending+0x86/0xa0 [ioatdma] [80062.145937] [<ffffffff8147500f>] net_rx_action+0x10f/0x250 [80062.165915] [<ffffffff8109331e>] ? tick_handle_oneshot_broadcast+0x10e/0x120 [80062.187812] [<ffffffff8106d9e7>] __do_softirq+0xb7/0x1f0 [80062.207733] [<ffffffff810c47b0>] ? handle_IRQ_event+0x60/0x170 [80062.228411] [<ffffffff810132ec>] call_softirq+0x1c/0x30 [80062.248332] [<ffffffff81014cb5>] do_softirq+0x65/0xa0 [80062.268058] [<ffffffff8106d7e5>] irq_exit+0x85/0x90 [80062.287578] [<ffffffff81561765>] do_IRQ+0x75/0xf0 [80062.306763] [<ffffffff81012b13>] ret_from_intr+0x0/0x11 [80062.326418] <EOI> [80062.342191] [<ffffffff8130f59f>] ? acpi_idle_enter_c1+0xa3/0xc1 [80062.362503] [<ffffffff8130f57e>] ? acpi_idle_enter_c1+0x82/0xc1 [80062.382525] [<ffffffff8144f1e7>] ? cpuidle_idle_call+0xa7/0x140 [80062.402409] [<ffffffff81010e63>] ? cpu_idle+0xb3/0x110 [80062.421258] [<ffffffff81544137>] ? rest_init+0x77/0x80 [80062.439930] [<ffffffff81882dd8>] ? start_kernel+0x368/0x371 [80062.459099] [<ffffffff8188233a>] ? x86_64_start_reservations+0x125/0x129 [80062.479533] [<ffffffff81882438>] ? x86_64_start_kernel+0xfa/0x109 [80062.499226] Code: 00 00 48 c7 c2 2e 85 03 81 48 c7 c1 31 85 03 81 e9 dd fe ff ff 90 90 90 90 90 90 90 90 90 90 90 90 90 55 b8 00 01 00 00 48 89 e5 <f0> 66 0f c1 07 38 e0 74 06 f3 90 8a 07 eb f6 c9 c3 66 0f 1f 44 [80062.563580] RIP [<ffffffff81038729>] __ticket_spin_lock+0x9/0x20 [80062.584129] RSP <ffff880016603740> ^ permalink raw reply [flat|nested] 6+ messages in thread
* [Cluster-devel] GPF in dlm_lowcomms_stop 2012-03-22 1:59 [Cluster-devel] GPF in dlm_lowcomms_stop dann frazier @ 2012-03-30 15:42 ` David Teigland 2012-03-30 16:42 ` David Teigland 0 siblings, 1 reply; 6+ messages in thread From: David Teigland @ 2012-03-30 15:42 UTC (permalink / raw) To: cluster-devel.redhat.com On Wed, Mar 21, 2012 at 07:59:13PM -0600, dann frazier wrote: > However... we've dropped the connections_lock, so its possible that a > new connection gets created on line 9. This connection structure would > have pointers to the workqueues that we're about to destroy. Sometime > later on we get data on this new connection, we try to add work to the > now-obliterated workqueues, and things blow up. Hi Dan, I'm not very familiar with this code either, but I've talked with Chrissie and she suggested we try something like this: diff --git a/fs/dlm/lowcomms.c b/fs/dlm/lowcomms.c index 133ef6d..a3c431e 100644 --- a/fs/dlm/lowcomms.c +++ b/fs/dlm/lowcomms.c @@ -142,6 +142,7 @@ struct writequeue_entry { static struct sockaddr_storage *dlm_local_addr[DLM_MAX_ADDR_COUNT]; static int dlm_local_count; +static int dlm_allow_conn; /* Work queues */ static struct workqueue_struct *recv_workqueue; @@ -710,6 +711,13 @@ static int tcp_accept_from_sock(struct connection *con) struct connection *newcon; struct connection *addcon; + mutex_lock(&connections_lock); + if (!dlm_allow_conn) { + mutex_unlock(&connections_lock); + return -1; + } + mutex_unlock(&connections_lock); + memset(&peeraddr, 0, sizeof(peeraddr)); result = sock_create_kern(dlm_local_addr[0]->ss_family, SOCK_STREAM, IPPROTO_TCP, &newsock); @@ -1503,6 +1511,7 @@ void dlm_lowcomms_stop(void) socket activity. */ mutex_lock(&connections_lock); + dlm_allow_conn = 0; foreach_conn(stop_conn); mutex_unlock(&connections_lock); @@ -1540,6 +1549,8 @@ int dlm_lowcomms_start(void) if (!con_cache) goto out; + dlm_allow_conn = 1; + /* Start listening */ if (dlm_config.ci_protocol == 0) error = tcp_listen_for_all(); @@ -1555,6 +1566,7 @@ int dlm_lowcomms_start(void) return 0; fail_unlisten: + dlm_allow_conn = 0; con = nodeid2con(0,0); if (con) { close_connection(con, false); ^ permalink raw reply related [flat|nested] 6+ messages in thread
* [Cluster-devel] GPF in dlm_lowcomms_stop 2012-03-30 15:42 ` David Teigland @ 2012-03-30 16:42 ` David Teigland 2012-03-30 17:17 ` dann frazier 0 siblings, 1 reply; 6+ messages in thread From: David Teigland @ 2012-03-30 16:42 UTC (permalink / raw) To: cluster-devel.redhat.com On Fri, Mar 30, 2012 at 11:42:56AM -0400, David Teigland wrote: > Hi Dan, I'm not very familiar with this code either, but I've talked with > Chrissie and she suggested we try something like this: A second version that addresses a potentially similar problem in start. diff --git a/fs/dlm/lowcomms.c b/fs/dlm/lowcomms.c index 133ef6d..5c1b0e3 100644 --- a/fs/dlm/lowcomms.c +++ b/fs/dlm/lowcomms.c @@ -142,6 +142,7 @@ struct writequeue_entry { static struct sockaddr_storage *dlm_local_addr[DLM_MAX_ADDR_COUNT]; static int dlm_local_count; +static int dlm_allow_conn; /* Work queues */ static struct workqueue_struct *recv_workqueue; @@ -710,6 +711,13 @@ static int tcp_accept_from_sock(struct connection *con) struct connection *newcon; struct connection *addcon; + mutex_lock(&connections_lock); + if (!dlm_allow_conn) { + mutex_unlock(&connections_lock); + return -1; + } + mutex_unlock(&connections_lock); + memset(&peeraddr, 0, sizeof(peeraddr)); result = sock_create_kern(dlm_local_addr[0]->ss_family, SOCK_STREAM, IPPROTO_TCP, &newsock); @@ -1503,6 +1511,7 @@ void dlm_lowcomms_stop(void) socket activity. */ mutex_lock(&connections_lock); + dlm_allow_conn = 0; foreach_conn(stop_conn); mutex_unlock(&connections_lock); @@ -1530,7 +1539,7 @@ int dlm_lowcomms_start(void) if (!dlm_local_count) { error = -ENOTCONN; log_print("no local IP address has been set"); - goto out; + goto fail; } error = -ENOMEM; @@ -1538,7 +1547,13 @@ int dlm_lowcomms_start(void) __alignof__(struct connection), 0, NULL); if (!con_cache) - goto out; + goto fail; + + error = work_start(); + if (error) + goto fail_destroy; + + dlm_allow_conn = 1; /* Start listening */ if (dlm_config.ci_protocol == 0) @@ -1548,20 +1563,17 @@ int dlm_lowcomms_start(void) if (error) goto fail_unlisten; - error = work_start(); - if (error) - goto fail_unlisten; - return 0; fail_unlisten: + dlm_allow_conn = 0; con = nodeid2con(0,0); if (con) { close_connection(con, false); kmem_cache_free(con_cache, con); } +fail_destroy: kmem_cache_destroy(con_cache); - -out: +fail: return error; } ^ permalink raw reply related [flat|nested] 6+ messages in thread
* [Cluster-devel] GPF in dlm_lowcomms_stop 2012-03-30 16:42 ` David Teigland @ 2012-03-30 17:17 ` dann frazier 2012-05-04 17:33 ` dann frazier 0 siblings, 1 reply; 6+ messages in thread From: dann frazier @ 2012-03-30 17:17 UTC (permalink / raw) To: cluster-devel.redhat.com On Fri, Mar 30, 2012 at 12:42:40PM -0400, David Teigland wrote: > On Fri, Mar 30, 2012 at 11:42:56AM -0400, David Teigland wrote: > > Hi Dan, I'm not very familiar with this code either, but I've talked with > > Chrissie and she suggested we try something like this: Yeah, that's the mechanism I was thinking of as well. > A second version that addresses a potentially similar problem in > start. Oh, good catch! I hadn't considered that path. Reviewed-by: dann frazier <dann.frazier@canonical.com> > > diff --git a/fs/dlm/lowcomms.c b/fs/dlm/lowcomms.c > index 133ef6d..5c1b0e3 100644 > --- a/fs/dlm/lowcomms.c > +++ b/fs/dlm/lowcomms.c > @@ -142,6 +142,7 @@ struct writequeue_entry { > > static struct sockaddr_storage *dlm_local_addr[DLM_MAX_ADDR_COUNT]; > static int dlm_local_count; > +static int dlm_allow_conn; > > /* Work queues */ > static struct workqueue_struct *recv_workqueue; > @@ -710,6 +711,13 @@ static int tcp_accept_from_sock(struct connection *con) > struct connection *newcon; > struct connection *addcon; > > + mutex_lock(&connections_lock); > + if (!dlm_allow_conn) { > + mutex_unlock(&connections_lock); > + return -1; > + } > + mutex_unlock(&connections_lock); > + > memset(&peeraddr, 0, sizeof(peeraddr)); > result = sock_create_kern(dlm_local_addr[0]->ss_family, SOCK_STREAM, > IPPROTO_TCP, &newsock); > @@ -1503,6 +1511,7 @@ void dlm_lowcomms_stop(void) > socket activity. > */ > mutex_lock(&connections_lock); > + dlm_allow_conn = 0; > foreach_conn(stop_conn); > mutex_unlock(&connections_lock); > > @@ -1530,7 +1539,7 @@ int dlm_lowcomms_start(void) > if (!dlm_local_count) { > error = -ENOTCONN; > log_print("no local IP address has been set"); > - goto out; > + goto fail; > } > > error = -ENOMEM; > @@ -1538,7 +1547,13 @@ int dlm_lowcomms_start(void) > __alignof__(struct connection), 0, > NULL); > if (!con_cache) > - goto out; > + goto fail; > + > + error = work_start(); > + if (error) > + goto fail_destroy; > + > + dlm_allow_conn = 1; > > /* Start listening */ > if (dlm_config.ci_protocol == 0) > @@ -1548,20 +1563,17 @@ int dlm_lowcomms_start(void) > if (error) > goto fail_unlisten; > > - error = work_start(); > - if (error) > - goto fail_unlisten; > - > return 0; > > fail_unlisten: > + dlm_allow_conn = 0; > con = nodeid2con(0,0); > if (con) { > close_connection(con, false); > kmem_cache_free(con_cache, con); > } > +fail_destroy: > kmem_cache_destroy(con_cache); > - > -out: > +fail: > return error; > } ^ permalink raw reply [flat|nested] 6+ messages in thread
* [Cluster-devel] GPF in dlm_lowcomms_stop 2012-03-30 17:17 ` dann frazier @ 2012-05-04 17:33 ` dann frazier 2012-05-04 17:51 ` David Teigland 0 siblings, 1 reply; 6+ messages in thread From: dann frazier @ 2012-05-04 17:33 UTC (permalink / raw) To: cluster-devel.redhat.com On Fri, Mar 30, 2012 at 11:17:56AM -0600, dann frazier wrote: > On Fri, Mar 30, 2012 at 12:42:40PM -0400, David Teigland wrote: > > On Fri, Mar 30, 2012 at 11:42:56AM -0400, David Teigland wrote: > > > Hi Dan, I'm not very familiar with this code either, but I've talked with > > > Chrissie and she suggested we try something like this: > > Yeah, that's the mechanism I was thinking of as well. > > > A second version that addresses a potentially similar problem in > > start. > > Oh, good catch! I hadn't considered that path. > > Reviewed-by: dann frazier <dann.frazier@canonical.com> fyi, we haven't uncovered any problems when testing with this patch. Are there plans to submit it for the next merge cycle? -dann > > > > diff --git a/fs/dlm/lowcomms.c b/fs/dlm/lowcomms.c > > index 133ef6d..5c1b0e3 100644 > > --- a/fs/dlm/lowcomms.c > > +++ b/fs/dlm/lowcomms.c > > @@ -142,6 +142,7 @@ struct writequeue_entry { > > > > static struct sockaddr_storage *dlm_local_addr[DLM_MAX_ADDR_COUNT]; > > static int dlm_local_count; > > +static int dlm_allow_conn; > > > > /* Work queues */ > > static struct workqueue_struct *recv_workqueue; > > @@ -710,6 +711,13 @@ static int tcp_accept_from_sock(struct connection *con) > > struct connection *newcon; > > struct connection *addcon; > > > > + mutex_lock(&connections_lock); > > + if (!dlm_allow_conn) { > > + mutex_unlock(&connections_lock); > > + return -1; > > + } > > + mutex_unlock(&connections_lock); > > + > > memset(&peeraddr, 0, sizeof(peeraddr)); > > result = sock_create_kern(dlm_local_addr[0]->ss_family, SOCK_STREAM, > > IPPROTO_TCP, &newsock); > > @@ -1503,6 +1511,7 @@ void dlm_lowcomms_stop(void) > > socket activity. > > */ > > mutex_lock(&connections_lock); > > + dlm_allow_conn = 0; > > foreach_conn(stop_conn); > > mutex_unlock(&connections_lock); > > > > @@ -1530,7 +1539,7 @@ int dlm_lowcomms_start(void) > > if (!dlm_local_count) { > > error = -ENOTCONN; > > log_print("no local IP address has been set"); > > - goto out; > > + goto fail; > > } > > > > error = -ENOMEM; > > @@ -1538,7 +1547,13 @@ int dlm_lowcomms_start(void) > > __alignof__(struct connection), 0, > > NULL); > > if (!con_cache) > > - goto out; > > + goto fail; > > + > > + error = work_start(); > > + if (error) > > + goto fail_destroy; > > + > > + dlm_allow_conn = 1; > > > > /* Start listening */ > > if (dlm_config.ci_protocol == 0) > > @@ -1548,20 +1563,17 @@ int dlm_lowcomms_start(void) > > if (error) > > goto fail_unlisten; > > > > - error = work_start(); > > - if (error) > > - goto fail_unlisten; > > - > > return 0; > > > > fail_unlisten: > > + dlm_allow_conn = 0; > > con = nodeid2con(0,0); > > if (con) { > > close_connection(con, false); > > kmem_cache_free(con_cache, con); > > } > > +fail_destroy: > > kmem_cache_destroy(con_cache); > > - > > -out: > > +fail: > > return error; > > } > ^ permalink raw reply [flat|nested] 6+ messages in thread
* [Cluster-devel] GPF in dlm_lowcomms_stop 2012-05-04 17:33 ` dann frazier @ 2012-05-04 17:51 ` David Teigland 0 siblings, 0 replies; 6+ messages in thread From: David Teigland @ 2012-05-04 17:51 UTC (permalink / raw) To: cluster-devel.redhat.com On Fri, May 04, 2012 at 11:33:17AM -0600, dann frazier wrote: > On Fri, Mar 30, 2012 at 11:17:56AM -0600, dann frazier wrote: > > On Fri, Mar 30, 2012 at 12:42:40PM -0400, David Teigland wrote: > > > On Fri, Mar 30, 2012 at 11:42:56AM -0400, David Teigland wrote: > > > > Hi Dan, I'm not very familiar with this code either, but I've talked with > > > > Chrissie and she suggested we try something like this: > > > > Yeah, that's the mechanism I was thinking of as well. > > > > > A second version that addresses a potentially similar problem in > > > start. > > > > Oh, good catch! I hadn't considered that path. > > > > Reviewed-by: dann frazier <dann.frazier@canonical.com> > > fyi, we haven't uncovered any problems when testing with this > patch. Great, thanks. > Are there plans to submit it for the next merge cycle? Yes, it's here http://git.kernel.org/gitweb.cgi?p=linux/kernel/git/teigland/linux-dlm.git;a=shortlog;h=refs/heads/next ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2012-05-04 17:51 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2012-03-22 1:59 [Cluster-devel] GPF in dlm_lowcomms_stop dann frazier 2012-03-30 15:42 ` David Teigland 2012-03-30 16:42 ` David Teigland 2012-03-30 17:17 ` dann frazier 2012-05-04 17:33 ` dann frazier 2012-05-04 17:51 ` David Teigland
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).