* lockdep question (was Re: IPoIB caused a kernel: BUG: soft lockup detected on CPU#0!) [not found] ` <f8ca0a150703030937u61bcd9aj200415b76c00d151@mail.gmail.com> @ 2007-03-11 13:50 ` Michael S. Tsirkin 2007-03-11 15:04 ` Michael S. Tsirkin 2007-03-11 15:25 ` Peter Zijlstra 0 siblings, 2 replies; 10+ messages in thread From: Michael S. Tsirkin @ 2007-03-11 13:50 UTC (permalink / raw) To: Roland Dreier Cc: Tziporet Koren, Roland Dreier, general, Ingo Molnar, Linux Kernel Mailing List > Quoting Roland Dreier <roland.list@gmail.com>: > Subject: Re: IPoIB caused a kernel: BUG: soft lockup detected on CPU#0! > > >Feb 27 17:47:52 sw169 kernel: [<ffffffff8053aaf1>] _spin_lock_irqsave+0x15/0x24 > >Feb 27 17:47:52 sw169 kernel: [<ffffffff88067a23>] :ib_ipoib:ipoib_neigh_destructor+0xc2/0x139 > > It looks like this is deadlocking trying to take priv->lock in ipoib_neigh_destructor(). > One idea I just had would be to build a kernel with CONFIG_PROVE_LOCKING > turned on, and then rerun this test. There's a good chance that this would > diagnose the deadlock. (I don't have good access to my test machines right now, or > else I would do it myself) OK, I did that. But I get [13440.761857] INFO: trying to register non-static key. [13440.766903] the code is fine but needs lockdep annotation. [13440.772455] turning off the locking correctness validator. and I am not sure what triggers this, or how to fix it to have the validator actually do its job. Ingo, what key does the message refer to? The stack dump seems to point to drivers/infiniband/ulp/ipoib/ipoib_main.c line 829. Full message below: [13440.761857] INFO: trying to register non-static key. [13440.766903] the code is fine but needs lockdep annotation. [13440.772455] turning off the locking correctness validator. [13440.778008] [<c023c082>] __lock_acquire+0xae4/0xbb9 [13440.783078] [<c023c43d>] lock_acquire+0x56/0x71 [13440.787784] [<f899bff2>] ipoib_neigh_destructor+0xd0/0x132 [ib_ipoib] [13440.794412] [<c051ad41>] _spin_lock_irqsave+0x32/0x41 [13440.799649] [<f899bff2>] ipoib_neigh_destructor+0xd0/0x132 [ib_ipoib] [13440.806275] [<f899bff2>] ipoib_neigh_destructor+0xd0/0x132 [ib_ipoib] [13440.812897] [<c04a1c1b>] dst_run_gc+0xc/0x118 [13440.817439] [<c022af6e>] run_timer_softirq+0x37/0x16b [13440.822673] [<c04a1c0f>] dst_run_gc+0x0/0x118 [13440.827221] [<c04a3eab>] neigh_destroy+0xbe/0x104 [13440.832114] [<c04a1bb1>] dst_destroy+0x4d/0xab [13440.836751] [<c04a1c64>] dst_run_gc+0x55/0x118 [13440.841384] [<c022b03f>] run_timer_softirq+0x108/0x16b [13440.846711] [<c0227634>] __do_softirq+0x5a/0xd5 [13440.851427] [<c023b435>] trace_hardirqs_on+0x106/0x141 [13440.856754] [<c0227643>] __do_softirq+0x69/0xd5 [13440.861470] [<c02276e6>] do_softirq+0x37/0x4d [13440.866016] [<c02167b0>] smp_apic_timer_interrupt+0x6b/0x77 [13440.871774] [<c02029ef>] default_idle+0x3b/0x54 [13440.876491] [<c02029ef>] default_idle+0x3b/0x54 [13440.881211] [<c0204c33>] apic_timer_interrupt+0x33/0x38 [13440.886624] [<c02029ef>] default_idle+0x3b/0x54 [13440.891342] [<c02029f1>] default_idle+0x3d/0x54 [13440.896061] [<c0202aaa>] cpu_idle+0xa2/0xbb [13440.900436] ======================= [13768.711447] BUG: spinlock lockup on CPU#1, swapper/0, c0687880 [13768.717353] [<c031f919>] _raw_spin_lock+0xda/0xfd [13768.722247] [<c051ad48>] _spin_lock_irqsave+0x39/0x41 [13768.727486] [<f899bff2>] ipoib_neigh_destructor+0xd0/0x132 [ib_ipoib] [13768.734110] [<f899bff2>] ipoib_neigh_destructor+0xd0/0x132 [ib_ipoib] [13768.740735] [<c04a1c1b>] dst_run_gc+0xc/0x118 [13768.745276] [<c022af6e>] run_timer_softirq+0x37/0x16b [13768.750517] [<c04a1c0f>] dst_run_gc+0x0/0x118 [13768.755061] [<c04a3eab>] neigh_destroy+0xbe/0x104 [13768.759955] [<c04a1bb1>] dst_destroy+0x4d/0xab [13768.764586] [<c04a1c64>] dst_run_gc+0x55/0x118 [13768.769218] [<c022b03f>] run_timer_softirq+0x108/0x16b [13768.774542] [<c0227634>] __do_softirq+0x5a/0xd5 [13768.779261] [<c023b435>] trace_hardirqs_on+0x106/0x141 [13768.784588] [<c0227643>] __do_softirq+0x69/0xd5 [13768.789308] [<c02276e6>] do_softirq+0x37/0x4d [13768.793851] [<c02167b0>] smp_apic_timer_interrupt+0x6b/0x77 [13768.799609] [<c02029ef>] default_idle+0x3b/0x54 [13768.804326] [<c02029ef>] default_idle+0x3b/0x54 [13768.809054] [<c0204c33>] apic_timer_interrupt+0x33/0x38 [13768.814471] [<c02029ef>] default_idle+0x3b/0x54 [13768.819187] [<c02029f1>] default_idle+0x3d/0x54 [13768.823903] [<c0202aaa>] cpu_idle+0xa2/0xbb [13768.828279] ======================= -- MST ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: lockdep question (was Re: IPoIB caused a kernel: BUG: soft lockup detected on CPU#0!) 2007-03-11 13:50 ` lockdep question (was Re: IPoIB caused a kernel: BUG: soft lockup detected on CPU#0!) Michael S. Tsirkin @ 2007-03-11 15:04 ` Michael S. Tsirkin 2007-03-11 15:22 ` Michael S. Tsirkin 2007-03-11 19:19 ` Ingo Molnar 2007-03-11 15:25 ` Peter Zijlstra 1 sibling, 2 replies; 10+ messages in thread From: Michael S. Tsirkin @ 2007-03-11 15:04 UTC (permalink / raw) To: Roland Dreier Cc: Roland Dreier, Ingo Molnar, general, Linux Kernel Mailing List After adding some printks, I started getting these: [ 597.036720] BUG: MAX_STACK_TRACE_ENTRIES too low! [ 597.041546] turning off the locking correctness validator. [ 597.047135] [<c023a922>] save_trace+0x8a/0x8f [ 597.051751] [<c023ae8c>] mark_lock+0x65/0x3ff [ 597.056366] [<c023a8d6>] save_trace+0x3e/0x8f [ 597.060980] [<c023a9f0>] add_lock_to_list+0x62/0x85 [ 597.066116] [<c023b992>] __lock_acquire+0x3f4/0xbb9 [ 597.071252] [<f89da11f>] send_mad+0x79/0x103 [ib_sa] [ 597.076474] [<c031a475>] idr_get_new_above_int+0x13c/0x216 [ 597.082225] [<c023c43d>] lock_acquire+0x56/0x71 [ 597.087018] [<f89da11f>] send_mad+0x79/0x103 [ib_sa] [ 597.092240] [<c051ad41>] _spin_lock_irqsave+0x32/0x41 [ 597.097547] [<f89da11f>] send_mad+0x79/0x103 [ib_sa] [ 597.102770] [<f89da11f>] send_mad+0x79/0x103 [ib_sa] [ 597.107989] [<f89da8d9>] ib_sa_path_rec_get+0x134/0x172 [ib_sa] [ 597.114166] [<f899b73f>] path_rec_start+0x115/0x143 [ib_ipoib] [ 597.120254] [<f899cb38>] path_rec_completion+0x0/0x4f4 [ib_ipoib] [ 597.126610] [<f899b874>] path_rec_create+0x77/0x9d [ib_ipoib] [ 597.132617] [<f899c9fe>] ipoib_start_xmit+0x441/0x57b [ib_ipoib] [ 597.138888] [<c051ae06>] _spin_unlock_irqrestore+0x34/0x39 [ 597.144635] [<c023b435>] trace_hardirqs_on+0x106/0x141 [ 597.150035] [<c04a058b>] dev_queue_xmit+0x109/0x245 [ 597.155167] [<c022ae27>] __mod_timer+0x94/0x9e [ 597.159871] [<c04a0423>] dev_hard_start_xmit+0x1be/0x21d [ 597.165438] [<c04a9fa9>] __qdisc_run+0xd7/0x190 [ 597.170226] [<c04a05b7>] dev_queue_xmit+0x135/0x245 [ 597.175360] [<c04ce267>] arp_process+0x2c0/0x512 [ 597.180234] [<f8954346>] mthca_tavor_interrupt+0xf3/0x12b [ib_mthca] [ 597.186855] [<c04a088b>] netif_receive_skb+0x1c4/0x1da [ 597.192254] [<c023b435>] trace_hardirqs_on+0x106/0x141 [ 597.197648] [<c04a0935>] process_backlog+0x94/0x107 [ 597.202785] [<c049f02b>] net_rx_action+0x9a/0x15e [ 597.207743] [<c0227643>] __do_softirq+0x69/0xd5 [ 597.212530] [<c02276e6>] do_softirq+0x37/0x4d [ 597.217147] [<c020617e>] do_IRQ+0x5c/0x72 [ 597.221415] [<c0204b52>] common_interrupt+0x2e/0x34 [ 597.226549] [<c02029ef>] default_idle+0x3b/0x54 [ 597.231337] [<c02029f1>] default_idle+0x3d/0x54 [ 597.236124] [<c0202aaa>] cpu_idle+0xa2/0xbb [ 597.240567] ======================= And sometimes these: [ 404.493572] KERNEL: assertion (!timer_pending(&dev->watchdog_timer)) failed at net/sched/sch_generic.c (608) -- MST ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: lockdep question (was Re: IPoIB caused a kernel: BUG: soft lockup detected on CPU#0!) 2007-03-11 15:04 ` Michael S. Tsirkin @ 2007-03-11 15:22 ` Michael S. Tsirkin 2007-03-11 19:19 ` Ingo Molnar 1 sibling, 0 replies; 10+ messages in thread From: Michael S. Tsirkin @ 2007-03-11 15:22 UTC (permalink / raw) To: Roland Dreier Cc: Roland Dreier, Ingo Molnar, general, Linux Kernel Mailing List > Quoting Michael S. Tsirkin <mst@mellanox.co.il>: > Subject: Re: lockdep question (was Re: IPoIB caused a kernel: BUG: soft lockup detected on CPU#0!) > > > After adding some printks, I started getting these: > > [ 597.036720] BUG: MAX_STACK_TRACE_ENTRIES too low! > [ 597.041546] turning off the locking correctness validator. I looked at our stack usage a bit. It seems some work is in order. $ make checkstack | grep ib_ 0x00000603 mthca_init_hca [ib_mthca]: 764 0x000014ed mthca_init_hca [ib_mthca]: 764 0x000065ae ipoib_cm_tx_start [ib_ipoib]: 368 0x00006b0b ipoib_cm_tx_start [ib_ipoib]: 368 0x0000135f ib_uverbs_query_device [ib_uverbs]: 348 0x000015f9 ib_uverbs_query_device [ib_uverbs]: 348 0x000005d0 ib_ucm_init_qp_attr [ib_ucm]: 300 0x00000697 ib_ucm_init_qp_attr [ib_ucm]: 300 0x00007f9e ipoib_path_seq_show [ib_ipoib]: 264 0x00008092 ipoib_path_seq_show [ib_ipoib]: 264 0x00005b56 ipoib_cm_rx_handler [ib_ipoib]: 220 0x00005eec ipoib_cm_rx_handler [ib_ipoib]: 220 0x00007934 ipoib_cm_tx_handler [ib_ipoib]: 208 0x00007ce0 ipoib_cm_tx_handler [ib_ipoib]: 208 0x000032fe ib_uverbs_create_qp [ib_uverbs]: 192 0x000036fd ib_uverbs_create_qp [ib_uverbs]: 192 0x000028a9 srp_reset_host [ib_srp]: 192 0x00002a96 srp_reset_host [ib_srp]: 192 0x00001c99 show_sys_image_guid [ib_core]: 188 0x00001d2b show_sys_image_guid [ib_core]: 188 0x000001f9 ib_sa_service_rec_callback [ib_sa]: 180 0x00000234 ib_sa_service_rec_callback [ib_sa]: 180 0x00001b3c path_rec_completion [ib_ipoib]: 180 0x00002020 path_rec_completion [ib_ipoib]: 180 0x000070cf ipoib_cm_handle_rx_wc [ib_ipoib]: 180 0x00007402 ipoib_cm_handle_rx_wc [ib_ipoib]: 180 0x000009a7 srp_create_target [ib_srp]: 176 0x0000125f srp_create_target [ib_srp]: 176 0x00000d9d ib_cm_listen [ib_cm]: 172 0x000010b3 ib_cm_listen [ib_cm]: 172 0x00004455 ipoib_mcast_send [ib_ipoib]: 172 0x000048e0 ipoib_mcast_send [ib_ipoib]: 172 0x000015c1 ipoib_start_xmit [ib_ipoib]: 164 0x00001b2d ipoib_start_xmit [ib_ipoib]: 164 0x000056c8 mthca_make_profile [ib_mthca]: 160 0x00006051 mthca_make_profile [ib_mthca]: 160 0x00002abb ipoib_ib_dev_stop [ib_ipoib]: 160 0x00002d19 ipoib_ib_dev_stop [ib_ipoib]: 160 0x0000202b ib_uverbs_query_qp [ib_uverbs]: 156 0x000022c0 ib_uverbs_query_qp [ib_uverbs]: 156 0x00005269 ipoib_init_qp [ib_ipoib]: 152 0x000053bc ipoib_init_qp [ib_ipoib]: 152 0x0000327f ipoib_mcast_join [ib_ipoib]: 144 0x0000349d ipoib_mcast_join [ib_ipoib]: 144 0x00002092 ib_find_send_mad [ib_mad]: 140 0x000023fa ib_find_send_mad [ib_mad]: 140 0x000022cf ib_uverbs_modify_qp [ib_uverbs]: 140 0x000024f2 ib_uverbs_modify_qp [ib_uverbs]: 140 0x0000bc8e mthca_modify_qp [ib_mthca]: 136 0x0000c9cc mthca_modify_qp [ib_mthca]: 136 0x00010cb1 mthca_reg_phys_mr [ib_mthca]: 136 0x0001117a mthca_reg_phys_mr [ib_mthca]: 136 0x000035b4 ipoib_mcast_join_finish [ib_ipoib]: 136 0x00003a33 ipoib_mcast_join_finish [ib_ipoib]: 136 0x00000793 iser_cma_handler [ib_iser]: 132 0x00000bc1 iser_cma_handler [ib_iser]: 132 0x00001e37 srp_queuecommand [ib_srp]: 132 0x0000273b srp_queuecommand [ib_srp]: 132 0x00008a5a mthca_poll_cq [ib_mthca]: 128 0x00009204 mthca_poll_cq [ib_mthca]: 128 0x00003a42 ipoib_mcast_join_complete [ib_ipoib]: 128 0x00003e6e ipoib_mcast_join_complete [ib_ipoib]: 128 0x00004a58 ipoib_mcast_restart_task [ib_ipoib]: 128 0x00004eb8 ipoib_mcast_restart_task [ib_ipoib]: 128 0x000038e6 ib_uverbs_create_ah [ib_uverbs]: 116 0x00003ac4 ib_uverbs_create_ah [ib_uverbs]: 116 0x0000f6a5 mthca_process_mad [ib_mthca]: 116 0x0000fa93 mthca_process_mad [ib_mthca]: 116 0x000011ef mcast_work_handler [ib_sa]: 112 0x000016e6 mcast_work_handler [ib_sa]: 112 0x00000a20 ib_ucm_send_req [ib_ucm]: 112 0x00000b7c ib_ucm_send_req [ib_ucm]: 112 0x00001697 ib_post_send_mad [ib_mad]: 112 0x00001b05 ib_post_send_mad [ib_mad]: 112 0x0000030e iser_post_send [ib_iser]: 112 0x000003c5 iser_post_send [ib_iser]: 112 0x00001605 ib_uverbs_query_port [ib_uverbs]: 108 0x00001713 ib_uverbs_query_port [ib_uverbs]: 108 0x0000171d ib_uverbs_create_srq [ib_uverbs]: 108 0x00001946 ib_uverbs_create_srq [ib_uverbs]: 108 0x00002409 ib_mad_completion_handler [ib_mad]: 104 0x00002907 ib_mad_completion_handler [ib_mad]: 104 0x00002454 iser_reg_rdma_mem [ib_iser]: 104 0x00002aa8 iser_reg_rdma_mem [ib_iser]: 104 0x0416 rdma_resolve_ip [ib_addr]: 100 0x066f rdma_resolve_ip [ib_addr]: 100 0x00007dcb ipoib_mcg_seq_show [ib_ipoib]: 100 0x00007e65 ipoib_mcg_seq_show [ib_ipoib]: 100 -- MST ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: lockdep question (was Re: IPoIB caused a kernel: BUG: soft lockup detected on CPU#0!) 2007-03-11 15:04 ` Michael S. Tsirkin 2007-03-11 15:22 ` Michael S. Tsirkin @ 2007-03-11 19:19 ` Ingo Molnar 1 sibling, 0 replies; 10+ messages in thread From: Ingo Molnar @ 2007-03-11 19:19 UTC (permalink / raw) To: Michael S. Tsirkin Cc: Roland Dreier, Roland Dreier, general, Linux Kernel Mailing List * Michael S. Tsirkin <mst@mellanox.co.il> wrote: > After adding some printks, I started getting these: > > [ 597.036720] BUG: MAX_STACK_TRACE_ENTRIES too low! this should go away if you double the size of MAX_STACK_TRACE_ENTRIES in kernel/lockdep_internals.h. (keep it a power of two) If it doesnt go away then it might signal some sort of leak. Ingo ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: lockdep question (was Re: IPoIB caused a kernel: BUG: soft lockup detected on CPU#0!) 2007-03-11 13:50 ` lockdep question (was Re: IPoIB caused a kernel: BUG: soft lockup detected on CPU#0!) Michael S. Tsirkin 2007-03-11 15:04 ` Michael S. Tsirkin @ 2007-03-11 15:25 ` Peter Zijlstra 2007-03-11 16:12 ` lockdep question (was Re: IPoIB caused a kernel: BUG: softlockup " Michael S. Tsirkin 1 sibling, 1 reply; 10+ messages in thread From: Peter Zijlstra @ 2007-03-11 15:25 UTC (permalink / raw) To: Michael S. Tsirkin Cc: Roland Dreier, Tziporet Koren, Roland Dreier, general, Ingo Molnar, Linux Kernel Mailing List On Sun, 2007-03-11 at 15:50 +0200, Michael S. Tsirkin wrote: > > Quoting Roland Dreier <roland.list@gmail.com>: > > Subject: Re: IPoIB caused a kernel: BUG: soft lockup detected on CPU#0! > > > > >Feb 27 17:47:52 sw169 kernel: [<ffffffff8053aaf1>] _spin_lock_irqsave+0x15/0x24 > > >Feb 27 17:47:52 sw169 kernel: [<ffffffff88067a23>] :ib_ipoib:ipoib_neigh_destructor+0xc2/0x139 > > > > It looks like this is deadlocking trying to take priv->lock in ipoib_neigh_destructor(). > > One idea I just had would be to build a kernel with CONFIG_PROVE_LOCKING > > turned on, and then rerun this test. There's a good chance that this would > > diagnose the deadlock. (I don't have good access to my test machines right now, or > > else I would do it myself) > > OK, I did that. But I get > [13440.761857] INFO: trying to register non-static key. > [13440.766903] the code is fine but needs lockdep annotation. > [13440.772455] turning off the locking correctness validator. > and I am not sure what triggers this, or how to fix it to have the > validator actually do its job. It usually indicates a spinlock is not properly initialized. Like __SPIN_LOCK_UNLOCKED() used in a non-static context, use spin_lock_init() in these cases. However looking at the code, ipoib_neight_destructor only uses &priv->lock, and that seems to get properly initialized in ipoib_setup() using spin_lock_init(). So either there are other sites that instanciate those objects and forget about the lock init, or the object is corrupted (use after free?) > Ingo, what key does the message refer to? > > The stack dump seems to point to drivers/infiniband/ulp/ipoib/ipoib_main.c line > 829. > > Full message below: > > [13440.761857] INFO: trying to register non-static key. > [13440.766903] the code is fine but needs lockdep annotation. > [13440.772455] turning off the locking correctness validator. > [13440.778008] [<c023c082>] __lock_acquire+0xae4/0xbb9 > [13440.783078] [<c023c43d>] lock_acquire+0x56/0x71 > [13440.787784] [<f899bff2>] ipoib_neigh_destructor+0xd0/0x132 [ib_ipoib] > [13440.794412] [<c051ad41>] _spin_lock_irqsave+0x32/0x41 > [13440.799649] [<f899bff2>] ipoib_neigh_destructor+0xd0/0x132 [ib_ipoib] > [13440.806275] [<f899bff2>] ipoib_neigh_destructor+0xd0/0x132 [ib_ipoib] > [13440.812897] [<c04a1c1b>] dst_run_gc+0xc/0x118 > [13440.817439] [<c022af6e>] run_timer_softirq+0x37/0x16b > [13440.822673] [<c04a1c0f>] dst_run_gc+0x0/0x118 > [13440.827221] [<c04a3eab>] neigh_destroy+0xbe/0x104 > [13440.832114] [<c04a1bb1>] dst_destroy+0x4d/0xab > [13440.836751] [<c04a1c64>] dst_run_gc+0x55/0x118 > [13440.841384] [<c022b03f>] run_timer_softirq+0x108/0x16b > [13440.846711] [<c0227634>] __do_softirq+0x5a/0xd5 > [13440.851427] [<c023b435>] trace_hardirqs_on+0x106/0x141 > [13440.856754] [<c0227643>] __do_softirq+0x69/0xd5 > [13440.861470] [<c02276e6>] do_softirq+0x37/0x4d > [13440.866016] [<c02167b0>] smp_apic_timer_interrupt+0x6b/0x77 > [13440.871774] [<c02029ef>] default_idle+0x3b/0x54 > [13440.876491] [<c02029ef>] default_idle+0x3b/0x54 > [13440.881211] [<c0204c33>] apic_timer_interrupt+0x33/0x38 > [13440.886624] [<c02029ef>] default_idle+0x3b/0x54 > [13440.891342] [<c02029f1>] default_idle+0x3d/0x54 > [13440.896061] [<c0202aaa>] cpu_idle+0xa2/0xbb > [13440.900436] ======================= > [13768.711447] BUG: spinlock lockup on CPU#1, swapper/0, c0687880 > [13768.717353] [<c031f919>] _raw_spin_lock+0xda/0xfd > [13768.722247] [<c051ad48>] _spin_lock_irqsave+0x39/0x41 > [13768.727486] [<f899bff2>] ipoib_neigh_destructor+0xd0/0x132 [ib_ipoib] > [13768.734110] [<f899bff2>] ipoib_neigh_destructor+0xd0/0x132 [ib_ipoib] > [13768.740735] [<c04a1c1b>] dst_run_gc+0xc/0x118 > [13768.745276] [<c022af6e>] run_timer_softirq+0x37/0x16b > [13768.750517] [<c04a1c0f>] dst_run_gc+0x0/0x118 > [13768.755061] [<c04a3eab>] neigh_destroy+0xbe/0x104 > [13768.759955] [<c04a1bb1>] dst_destroy+0x4d/0xab > [13768.764586] [<c04a1c64>] dst_run_gc+0x55/0x118 > [13768.769218] [<c022b03f>] run_timer_softirq+0x108/0x16b > [13768.774542] [<c0227634>] __do_softirq+0x5a/0xd5 > [13768.779261] [<c023b435>] trace_hardirqs_on+0x106/0x141 > [13768.784588] [<c0227643>] __do_softirq+0x69/0xd5 > [13768.789308] [<c02276e6>] do_softirq+0x37/0x4d > [13768.793851] [<c02167b0>] smp_apic_timer_interrupt+0x6b/0x77 > [13768.799609] [<c02029ef>] default_idle+0x3b/0x54 > [13768.804326] [<c02029ef>] default_idle+0x3b/0x54 > [13768.809054] [<c0204c33>] apic_timer_interrupt+0x33/0x38 > [13768.814471] [<c02029ef>] default_idle+0x3b/0x54 > [13768.819187] [<c02029f1>] default_idle+0x3d/0x54 > [13768.823903] [<c0202aaa>] cpu_idle+0xa2/0xbb > [13768.828279] ======================= > > > -- > MST > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: lockdep question (was Re: IPoIB caused a kernel: BUG: softlockup detected on CPU#0!) 2007-03-11 15:25 ` Peter Zijlstra @ 2007-03-11 16:12 ` Michael S. Tsirkin 2007-03-11 18:48 ` Ingo Molnar 0 siblings, 1 reply; 10+ messages in thread From: Michael S. Tsirkin @ 2007-03-11 16:12 UTC (permalink / raw) To: Peter Zijlstra Cc: Roland Dreier, Tziporet Koren, Roland Dreier, general, Ingo Molnar, Linux Kernel Mailing List Quoting Peter Zijlstra <a.p.zijlstra@chello.nl>: Subject: Re: lockdep question (was Re: IPoIB caused a kernel: BUG: softlockup detected on CPU#0!) > On Sun, 2007-03-11 at 15:50 +0200, Michael S. Tsirkin wrote: > > > Quoting Roland Dreier <roland.list@gmail.com>: > > > Subject: Re: IPoIB caused a kernel: BUG: soft lockup detected on CPU#0! > > > > > > >Feb 27 17:47:52 sw169 kernel: [<ffffffff8053aaf1>] _spin_lock_irqsave+0x15/0x24 > > > >Feb 27 17:47:52 sw169 kernel: [<ffffffff88067a23>] :ib_ipoib:ipoib_neigh_destructor+0xc2/0x139 > > > > > > It looks like this is deadlocking trying to take priv->lock in ipoib_neigh_destructor(). > > > One idea I just had would be to build a kernel with CONFIG_PROVE_LOCKING > > > turned on, and then rerun this test. There's a good chance that this would > > > diagnose the deadlock. (I don't have good access to my test machines right now, or > > > else I would do it myself) > > > > OK, I did that. But I get > > [13440.761857] INFO: trying to register non-static key. > > [13440.766903] the code is fine but needs lockdep annotation. > > [13440.772455] turning off the locking correctness validator. > > and I am not sure what triggers this, or how to fix it to have the > > validator actually do its job. > > It usually indicates a spinlock is not properly initialized. Like > __SPIN_LOCK_UNLOCKED() used in a non-static context, use > spin_lock_init() in these cases. > > However looking at the code, ipoib_neight_destructor only uses > &priv->lock, and that seems to get properly initialized in ipoib_setup() > using spin_lock_init(). > > So either there are other sites that instanciate those objects and > forget about the lock init, or the object is corrupted (use after free?) OK, thanks for the hint. So I added this: diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index f9dbc6f..2eea467 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -821,8 +821,15 @@ static void ipoib_neigh_destructor(struct neighbour *n) unsigned long flags; struct ipoib_ah *ah = NULL; + if (n->dev->type != ARPHRD_INFINIBAND) { + printk(KERN_ERR "ipoib_neigh_destructor lock %p wrong type %d !!!!!!!!!!\n", + &priv->lock, n->dev->type); + BUG_ON(n->dev->type != ARPHRD_INFINIBAND); + return; + } + ipoib_dbg(priv, "neigh_destructor for %06x " IPOIB_GID_FMT "\n", IPOIB_QPN(n->ha), IPOIB_GID_RAW_ARG(n->ha + 4)); And sure enough it triggers: [ 858.503010] ipoib_neigh_destructor lock c0687880 wrong type 772 !!!!!!!!!! [ 858.510036] ------------[ cut here ]------------ [ 858.514723] kernel BUG at drivers/infiniband/ulp/ipoib/ipoib_main.c:827! [ 858.521486] invalid opcode: 0000 [#1] [ 858.525212] SMP [ 858.527173] Modules linked in: rdma_cm iw_cm ib_addr ib_ipoib ib_cm ib_sa ib_uverbs ibv [ 858.538736] CPU: 0 [ 858.538737] EIP: 0060:[<f899bfa5>] Not tainted VLI [ 858.538738] EFLAGS: 00010206 (2.6.21-rc3-i686-dbg #4) [ 858.551755] EIP is at ipoib_neigh_destructor+0x40/0x178 [ib_ipoib] [ 858.557996] eax: c0687300 ebx: f240e880 ecx: c0223114 edx: c064f280 [ 858.564851] esi: f240e880 edi: f240e880 ebp: c0687880 esp: c06c7e9c [ 858.571702] ds: 007b es: 007b fs: 00d8 gs: 0000 ss: 0068 [ 858.577602] Process swapper (pid: 0, ti=c06c6000 task=c064f280 task.ti=c06c6000) [ 858.584883] Stack: f89a37be c0687880 00000304 c022af6e c064f280 00000000 00000000 0000 [ 858.593573] 00000000 c06a2554 00000000 c064f280 00000001 00000000 c064f280 0000 [ 858.602259] c0860be0 c2a1fba0 00000246 c06a2554 f240e880 00000000 f240e880 c04a [ 858.610946] Call Trace: [ 858.613723] [<c022af6e>] run_timer_softirq+0x37/0x16b [ 858.618959] [<c04a1c0f>] dst_run_gc+0x0/0x118 [ 858.623498] [<c04a3eab>] neigh_destroy+0xbe/0x104 [ 858.628382] [<c04a1bb1>] dst_destroy+0x4d/0xab [ 858.632998] [<c04a1c64>] dst_run_gc+0x55/0x118 [ 858.637620] [<c022b03f>] run_timer_softirq+0x108/0x16b [ 858.642934] [<c0227634>] __do_softirq+0x5a/0xd5 [ 858.647648] [<c023b435>] trace_hardirqs_on+0x106/0x141 [ 858.652970] [<c0227643>] __do_softirq+0x69/0xd5 [ 858.657677] [<c02276e6>] do_softirq+0x37/0x4d [ 858.662210] [<c02167b0>] smp_apic_timer_interrupt+0x6b/0x77 [ 858.667965] [<c02029ef>] default_idle+0x3b/0x54 [ 858.672681] [<c02029ef>] default_idle+0x3b/0x54 [ 858.677391] [<c0204c33>] apic_timer_interrupt+0x33/0x38 [ 858.682796] [<c02029ef>] default_idle+0x3b/0x54 [ 858.687505] [<c02029f1>] default_idle+0x3d/0x54 [ 858.692211] [<c0202aaa>] cpu_idle+0xa2/0xbb [ 858.696569] [<c06cd7c3>] start_kernel+0x40b/0x413 [ 858.701453] [<c06cd1b3>] unknown_bootoption+0x0/0x205 [ 858.706678] ======================= [ 858.710321] Code: 66 83 f8 20 74 29 0f b7 c0 89 44 24 08 89 6c 24 04 c7 04 24 be 37 9a [ 858.730997] EIP: [<f899bfa5>] ipoib_neigh_destructor+0x40/0x178 [ib_ipoib] SS:ESP 0068c [ 858.740271] Kernel panic - not syncing: Fatal exception in interrupt Either something is corrupting neighbour dev pointer, or somehow a neighbour not related to infiniband is assigned ipoib_neigh_destructor. -- MST ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: lockdep question (was Re: IPoIB caused a kernel: BUG: softlockup detected on CPU#0!) 2007-03-11 16:12 ` lockdep question (was Re: IPoIB caused a kernel: BUG: softlockup " Michael S. Tsirkin @ 2007-03-11 18:48 ` Ingo Molnar 2007-03-12 13:29 ` Michael S. Tsirkin 0 siblings, 1 reply; 10+ messages in thread From: Ingo Molnar @ 2007-03-11 18:48 UTC (permalink / raw) To: Michael S. Tsirkin Cc: Peter Zijlstra, Roland Dreier, Tziporet Koren, Roland Dreier, general, Linux Kernel Mailing List * Michael S. Tsirkin <mst@mellanox.co.il> wrote: > > So either there are other sites that instanciate those objects and > > forget about the lock init, or the object is corrupted (use after free?) > > OK, thanks for the hint. So I added this: > And sure enough it triggers: > > [ 858.503010] ipoib_neigh_destructor lock c0687880 wrong type 772 !!!!!!!!!! could you turn on CONFIG_SLAB_DEBUG as well? that should catch certain types of use-after-free accesses, and lockdep will also warn if a still locked object is freed. Ingo ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: lockdep question (was Re: IPoIB caused a kernel: BUG: softlockup detected on CPU#0!) 2007-03-11 18:48 ` Ingo Molnar @ 2007-03-12 13:29 ` Michael S. Tsirkin 2007-03-12 13:51 ` Ingo Molnar 0 siblings, 1 reply; 10+ messages in thread From: Michael S. Tsirkin @ 2007-03-12 13:29 UTC (permalink / raw) To: Ingo Molnar Cc: Peter Zijlstra, Roland Dreier, Tziporet Koren, Roland Dreier, general, Linux Kernel Mailing List > Quoting Ingo Molnar <mingo@elte.hu>: > Subject: Re: lockdep question (was Re: IPoIB caused a kernel: BUG: softlockup detected on CPU#0!) > > > * Michael S. Tsirkin <mst@mellanox.co.il> wrote: > > > > So either there are other sites that instanciate those objects and > > > forget about the lock init, or the object is corrupted (use after free?) > > > > OK, thanks for the hint. So I added this: > > > And sure enough it triggers: > > > > [ 858.503010] ipoib_neigh_destructor lock c0687880 wrong type 772 !!!!!!!!!! > > could you turn on CONFIG_SLAB_DEBUG as well? > > that should catch certain types of use-after-free accesses, and lockdep > will also warn if a still locked object is freed. Hmm, no, this does not look like use-after-free. I enabled CONFIG_SLAB_DEBUG, and I still see the same message, so the memory was not overwritten by slab debugger. -- MST ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: lockdep question (was Re: IPoIB caused a kernel: BUG: softlockup detected on CPU#0!) 2007-03-12 13:29 ` Michael S. Tsirkin @ 2007-03-12 13:51 ` Ingo Molnar 2007-03-12 14:20 ` Michael S. Tsirkin 0 siblings, 1 reply; 10+ messages in thread From: Ingo Molnar @ 2007-03-12 13:51 UTC (permalink / raw) To: Michael S. Tsirkin Cc: Peter Zijlstra, Roland Dreier, Tziporet Koren, Roland Dreier, general, Linux Kernel Mailing List * Michael S. Tsirkin <mst@mellanox.co.il> wrote: > > could you turn on CONFIG_SLAB_DEBUG as well? > > > > that should catch certain types of use-after-free accesses, and > > lockdep will also warn if a still locked object is freed. > > Hmm, no, this does not look like use-after-free. I enabled > CONFIG_SLAB_DEBUG, and I still see the same message, so the memory was > not overwritten by slab debugger. that's still not conclusive - the memory might not have been allocated by slab again to detect it. Your magic-number check definitely shows some sort of corruption going on, right? Ingo ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: lockdep question (was Re: IPoIB caused a kernel: BUG: softlockup detected on CPU#0!) 2007-03-12 13:51 ` Ingo Molnar @ 2007-03-12 14:20 ` Michael S. Tsirkin 0 siblings, 0 replies; 10+ messages in thread From: Michael S. Tsirkin @ 2007-03-12 14:20 UTC (permalink / raw) To: Ingo Molnar Cc: Peter Zijlstra, Linux Kernel Mailing List, general, Roland Dreier, netdev > Quoting Ingo Molnar <mingo@elte.hu>: > Subject: Re: lockdep question (was Re: IPoIB caused a kernel: BUG: softlockup detected on CPU#0!) > > > * Michael S. Tsirkin <mst@mellanox.co.il> wrote: > > > > could you turn on CONFIG_SLAB_DEBUG as well? > > > > > > that should catch certain types of use-after-free accesses, and > > > lockdep will also warn if a still locked object is freed. > > > > Hmm, no, this does not look like use-after-free. I enabled > > CONFIG_SLAB_DEBUG, and I still see the same message, so the memory was > > not overwritten by slab debugger. > > that's still not conclusive - the memory might not have been allocated > by slab again to detect it. Your magic-number check definitely shows > some sort of corruption going on, right? Not necessarily in such a direct way. I currently think we are somehow getting neighbours where neigh->dev points to a loopback device - that's type 772, and this seems to make sense. I printed out the device name and sure enough it is "lo". Is it true that sticking the following static int ipoib_neigh_setup_dev(struct net_device *dev, struct neigh_parms *parms) { parms->neigh_destructor = ipoib_neigh_destructor; return 0; } in dev->neigh_setup, as ipoib does, guarantees that neighbour->dev will point to the current device for any neighbour which ipoib_neigh_destructor gets? That's the assumption IPoIB makes, and it seems broken in this instance. How could that be? -- MST ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2007-03-12 14:19 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <45E552FC.4040305@mellanox.co.il>
[not found] ` <f8ca0a150703030937u61bcd9aj200415b76c00d151@mail.gmail.com>
2007-03-11 13:50 ` lockdep question (was Re: IPoIB caused a kernel: BUG: soft lockup detected on CPU#0!) Michael S. Tsirkin
2007-03-11 15:04 ` Michael S. Tsirkin
2007-03-11 15:22 ` Michael S. Tsirkin
2007-03-11 19:19 ` Ingo Molnar
2007-03-11 15:25 ` Peter Zijlstra
2007-03-11 16:12 ` lockdep question (was Re: IPoIB caused a kernel: BUG: softlockup " Michael S. Tsirkin
2007-03-11 18:48 ` Ingo Molnar
2007-03-12 13:29 ` Michael S. Tsirkin
2007-03-12 13:51 ` Ingo Molnar
2007-03-12 14:20 ` Michael S. Tsirkin
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox