cluster-devel.redhat.com archive mirror
 help / color / mirror / Atom feed
* [Cluster-devel] GPF in dlm_lowcomms_stop
@ 2012-03-22  1:59 dann frazier
  2012-03-30 15:42 ` David Teigland
  0 siblings, 1 reply; 6+ messages in thread
From: dann frazier @ 2012-03-22  1:59 UTC (permalink / raw)
  To: cluster-devel.redhat.com

I observed a GPF in dlm_lowcomms_stop() in a crash dump from a
user. I'll include a traceback at the bottom. I have a theory as to
what is going on, but could use some help proving it.

 1 void dlm_lowcomms_stop(void)
 2 {
 3 /* Set all the flags to prevent any
 4 socket activity.
 5 */
 6 mutex_lock(&connections_lock);
 7 foreach_conn(stop_conn);
 8 mutex_unlock(&connections_lock);
 9
10 work_stop();
11
12 mutex_lock(&connections_lock);
13 clean_writequeues();
14
15 foreach_conn(free_conn);
16
17 mutex_unlock(&connections_lock);
18 kmem_cache_destroy(con_cache);
19 }

lines 6-8 should stop any traffic coming in on existing connections,
and the connections_lock prevents any new connections from being
added.

line 10 shutsdown the work queues which process incoming data on
existing connections. Our connections are supposed to be dead, so no
new work should be added to these queues.

However... we've dropped the connections_lock, so its possible that a
new connection gets created on line 9. This connection structure would
have pointers to the workqueues that we're about to destroy. Sometime
later on we get data on this new connection, we try to add work to the
now-obliterated workqueues, and things blow up.

This is a 2.6.32 kernel, but this code looks to be about the
same in 3.3. Normally I'd like to instrument the code to widen the
race window and see if I can trigger it, but I'm pretty clueless about
the stack here. Is there an easy way to mimic this activity from
userspace?

[80061.292085] Call Trace:
[80061.312762]  <IRQ>
[80061.332554]  [<ffffffff810387b9>] default_spin_lock_flags+0x9/0x10
[80061.356963]  [<ffffffff8155c44f>] _spin_lock_irqsave+0x2f/0x40
[80061.380717]  [<ffffffff81080504>] __queue_work+0x24/0x50
[80061.403591]  [<ffffffff81080632>] queue_work_on+0x42/0x60
[80061.426314]  [<ffffffff810807ff>] queue_work+0x1f/0x30
[80061.448516]  [<ffffffffa027326b>] lowcomms_data_ready+0x3b/0x40 [dlm]
[80061.472149]  [<ffffffff814ba318>] tcp_data_queue+0x308/0xaf0
[80061.494912]  [<ffffffff814bd021>] tcp_rcv_established+0x371/0x730
[80061.517903]  [<ffffffff814c4bb3>] tcp_v4_do_rcv+0xf3/0x160
[80061.539872]  [<ffffffff814c6305>] tcp_v4_rcv+0x5b5/0x7e0
[80061.561284]  [<ffffffff814a4110>] ? ip_local_deliver_finish+0x0/0x2d0
[80061.583714]  [<ffffffff8149bb84>] ? nf_hook_slow+0x74/0x100
[80061.604714]  [<ffffffff814a4110>] ? ip_local_deliver_finish+0x0/0x2d0
[80061.626449]  [<ffffffff814a41ed>] ip_local_deliver_finish+0xdd/0x2d0
[80061.647767]  [<ffffffff814a4470>] ip_local_deliver+0x90/0xa0
[80061.667743]  [<ffffffff814a392d>] ip_rcv_finish+0x12d/0x440
[80061.687278]  [<ffffffff814a3eb5>] ip_rcv+0x275/0x360
[80061.705783]  [<ffffffff8147450a>] netif_receive_skb+0x38a/0x5d0
[80061.725345]  [<ffffffffa0257178>] br_handle_frame_finish+0x198/0x1a0 [bridge]
[80061.746206]  [<ffffffffa025cc78>] br_nf_pre_routing_finish+0x238/0x350 [bridge]
[80061.779534]  [<ffffffff8149bb84>] ? nf_hook_slow+0x74/0x100
[80061.798437]  [<ffffffffa025ca40>] ? br_nf_pre_routing_finish+0x0/0x350 [bridge]
[80061.831598]  [<ffffffffa025d15a>] br_nf_pre_routing+0x3ca/0x468 [bridge]
[80061.851561]  [<ffffffff81561765>] ? do_IRQ+0x75/0xf0
[80061.869287]  [<ffffffff812b3a63>] ? cpumask_next_and+0x23/0x40
[80061.887717]  [<ffffffff8149bacc>] nf_iterate+0x6c/0xb0
[80061.905491]  [<ffffffffa0256fe0>] ? br_handle_frame_finish+0x0/0x1a0 [bridge]
[80061.925829]  [<ffffffff8149bb84>] nf_hook_slow+0x74/0x100
[80061.944215]  [<ffffffffa0256fe0>] ? br_handle_frame_finish+0x0/0x1a0 [bridge]
[80061.964735]  [<ffffffff8146b216>] ? __netdev_alloc_skb+0x36/0x60
[80061.984128]  [<ffffffffa025730c>] br_handle_frame+0x18c/0x250 [bridge]
[80062.004294]  [<ffffffff81537138>] ? vlan_hwaccel_do_receive+0x38/0xf0
[80062.024707]  [<ffffffff8147437a>] netif_receive_skb+0x1fa/0x5d0
[80062.044500]  [<ffffffff81474920>] napi_skb_finish+0x50/0x70
[80062.063911]  [<ffffffff81537794>] vlan_gro_receive+0x84/0x90
[80062.083512]  [<ffffffffa0015fd9>] igb_clean_rx_irq_adv+0x259/0x6d0 [igb]
[80062.104516]  [<ffffffffa00164c2>] igb_poll+0x72/0x380 [igb]
[80062.124363]  [<ffffffffa0168236>] ? ioat2_issue_pending+0x86/0xa0 [ioatdma]
[80062.145937]  [<ffffffff8147500f>] net_rx_action+0x10f/0x250
[80062.165915]  [<ffffffff8109331e>] ? tick_handle_oneshot_broadcast+0x10e/0x120
[80062.187812]  [<ffffffff8106d9e7>] __do_softirq+0xb7/0x1f0
[80062.207733]  [<ffffffff810c47b0>] ? handle_IRQ_event+0x60/0x170
[80062.228411]  [<ffffffff810132ec>] call_softirq+0x1c/0x30
[80062.248332]  [<ffffffff81014cb5>] do_softirq+0x65/0xa0
[80062.268058]  [<ffffffff8106d7e5>] irq_exit+0x85/0x90
[80062.287578]  [<ffffffff81561765>] do_IRQ+0x75/0xf0
[80062.306763]  [<ffffffff81012b13>] ret_from_intr+0x0/0x11
[80062.326418]  <EOI>
[80062.342191]  [<ffffffff8130f59f>] ? acpi_idle_enter_c1+0xa3/0xc1
[80062.362503]  [<ffffffff8130f57e>] ? acpi_idle_enter_c1+0x82/0xc1
[80062.382525]  [<ffffffff8144f1e7>] ? cpuidle_idle_call+0xa7/0x140
[80062.402409]  [<ffffffff81010e63>] ? cpu_idle+0xb3/0x110
[80062.421258]  [<ffffffff81544137>] ? rest_init+0x77/0x80
[80062.439930]  [<ffffffff81882dd8>] ? start_kernel+0x368/0x371
[80062.459099]  [<ffffffff8188233a>] ? x86_64_start_reservations+0x125/0x129
[80062.479533]  [<ffffffff81882438>] ? x86_64_start_kernel+0xfa/0x109
[80062.499226] Code: 00 00 48 c7 c2 2e 85 03 81 48 c7 c1 31 85 03 81 e9 dd fe ff ff 90 90 90 90 90 90 90 90 90 90 90 90 90 55 b8 00 01 00 00 48 89 e5 <f0> 66 0f c1 07 38 e0 74 06 f3 90 8a 07 eb f6 c9 c3 66 0f 1f 44
[80062.563580] RIP  [<ffffffff81038729>] __ticket_spin_lock+0x9/0x20
[80062.584129]  RSP <ffff880016603740>



^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2012-05-04 17:51 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-03-22  1:59 [Cluster-devel] GPF in dlm_lowcomms_stop dann frazier
2012-03-30 15:42 ` David Teigland
2012-03-30 16:42   ` David Teigland
2012-03-30 17:17     ` dann frazier
2012-05-04 17:33       ` dann frazier
2012-05-04 17:51         ` David Teigland

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).