* 3.11-rc6 genetlink locking fix offends lockdep @ 2013-08-19 5:04 Hugh Dickins 2013-08-19 8:00 ` Johannes Berg 0 siblings, 1 reply; 10+ messages in thread From: Hugh Dickins @ 2013-08-19 5:04 UTC (permalink / raw) To: Johannes Berg Cc: Linus Torvalds, Greg KH, David S. Miller, Andrei Otcheretianski, linux-kernel, netdev, stable 3.11-rc6's commit 58ad436fcf49 ("genetlink: fix family dump race") gives me the lockdep trace below at startup. I think it needs to be reverted until you can refine it. And it has already gone into today's stable review series, as 04/12 for 3.0.92, 26/34 for 3.4.59, 18/45 for 3.10.8: I raise an objection to those. Hugh [ 4.004286] e1000e 0000:00:19.0: irq 43 for MSI/MSI-X [ 4.105671] e1000e 0000:00:19.0: irq 43 for MSI/MSI-X [ 4.106123] IPv6: ADDRCONF(NETDEV_UP): eth1: link is not ready [ 4.110096] [ 4.110113] ====================================================== [ 4.110146] [ INFO: possible circular locking dependency detected ] [ 4.110180] 3.11.0-rc6 #1 Not tainted [ 4.110201] ------------------------------------------------------- [ 4.110234] NetworkManager/358 is trying to acquire lock: [ 4.110262] (genl_mutex){+.+.+.}, at: [<ffffffff8148204d>] genl_lock+0x12/0x14 [ 4.110315] [ 4.110315] but task is already holding lock: [ 4.110346] (nlk->cb_mutex){+.+.+.}, at: [<ffffffff8147f148>] netlink_dump+0x1c/0x1d7 [ 4.110400] [ 4.110400] which lock already depends on the new lock. [ 4.110400] [ 4.110442] [ 4.110442] the existing dependency chain (in reverse order) is: [ 4.110482] [ 4.110482] -> #1 (nlk->cb_mutex){+.+.+.}: [ 4.110517] [<ffffffff810b34d2>] __lock_acquire+0x865/0x956 [ 4.110555] [<ffffffff810b39fc>] lock_acquire+0x57/0x6d [ 4.110589] [<ffffffff81583e42>] mutex_lock_nested+0x5e/0x345 [ 4.110627] [<ffffffff81480122>] __netlink_dump_start+0xae/0x14e [ 4.110665] [<ffffffff81482143>] genl_rcv_msg+0xf4/0x252 [ 4.110699] [<ffffffff81481742>] netlink_rcv_skb+0x3e/0x8c [ 4.110734] [<ffffffff8148199b>] genl_rcv+0x24/0x34 [ 4.110766] [<ffffffff814811ca>] netlink_unicast+0xed/0x17a [ 4.110801] [<ffffffff814815d4>] netlink_sendmsg+0x2fb/0x345 [ 4.110838] [<ffffffff814503f7>] sock_sendmsg+0x79/0x8e [ 4.110871] [<ffffffff81450707>] ___sys_sendmsg+0x231/0x2be [ 4.110907] [<ffffffff81453228>] __sys_sendmsg+0x3d/0x5e [ 4.110942] [<ffffffff81453256>] SyS_sendmsg+0xd/0x19 [ 4.110975] [<ffffffff81587c12>] system_call_fastpath+0x16/0x1b [ 4.111012] [ 4.111012] -> #0 (genl_mutex){+.+.+.}: [ 4.111047] [<ffffffff810b1fb0>] validate_chain.isra.21+0x836/0xe8e [ 4.111086] [<ffffffff810b34d2>] __lock_acquire+0x865/0x956 [ 4.111122] [<ffffffff810b39fc>] lock_acquire+0x57/0x6d [ 4.111157] [<ffffffff81583e42>] mutex_lock_nested+0x5e/0x345 [ 4.111193] [<ffffffff8148204d>] genl_lock+0x12/0x14 [ 4.111226] [<ffffffff814822d2>] ctrl_dumpfamily+0x31/0xfa [ 4.111260] [<ffffffff8147f1b4>] netlink_dump+0x88/0x1d7 [ 4.111295] [<ffffffff8147f4b4>] netlink_recvmsg+0x1b1/0x2d1 [ 4.111331] [<ffffffff81450328>] sock_recvmsg+0x83/0x98 [ 4.111365] [<ffffffff814500c6>] ___sys_recvmsg+0x15d/0x207 [ 4.111400] [<ffffffff814533f7>] __sys_recvmsg+0x3d/0x5e [ 4.111434] [<ffffffff81453425>] SyS_recvmsg+0xd/0x19 [ 4.111467] [<ffffffff81587c12>] system_call_fastpath+0x16/0x1b [ 4.111504] [ 4.111504] other info that might help us debug this: [ 4.111504] [ 4.111545] Possible unsafe locking scenario: [ 4.111545] [ 4.111577] CPU0 CPU1 [ 4.111601] ---- ---- [ 4.111625] lock(nlk->cb_mutex); [ 4.112865] lock(genl_mutex); [ 4.114216] lock(nlk->cb_mutex); [ 4.115315] lock(genl_mutex); [ 4.116500] [ 4.116500] *** DEADLOCK *** [ 4.116500] [ 4.119670] 1 lock held by NetworkManager/358: [ 4.120906] #0: (nlk->cb_mutex){+.+.+.}, at: [<ffffffff8147f148>] netlink_dump+0x1c/0x1d7 [ 4.122196] [ 4.122196] stack backtrace: [ 4.124533] CPU: 0 PID: 358 Comm: NetworkManager Not tainted 3.11.0-rc6 #1 [ 4.125779] Hardware name: LENOVO 4174EH1/4174EH1, BIOS 8CET51WW (1.31 ) 11/29/2011 [ 4.126979] ffffffff81d0a0f0 ffff88022b91d8c8 ffffffff8157cf80 0000000000000006 [ 4.128274] ffffffff81cc8750 ffff88022b91d918 ffffffff8157a898 ffff88022d798080 [ 4.129472] ffff88022d798080 ffff88022d798080 ffff88022d798750 ffff88022d798080 [ 4.130645] Call Trace: [ 4.131801] [<ffffffff8157cf80>] dump_stack+0x4f/0x84 [ 4.132817] [<ffffffff8157a898>] print_circular_bug+0x2ad/0x2be [ 4.133839] [<ffffffff810b1fb0>] validate_chain.isra.21+0x836/0xe8e [ 4.134821] [<ffffffff8145471a>] ? sock_def_write_space+0x1b5/0x1b5 [ 4.135800] [<ffffffff810b34d2>] __lock_acquire+0x865/0x956 [ 4.136842] [<ffffffff810b40bb>] ? mark_held_locks+0xce/0xfa [ 4.137828] [<ffffffff8148204d>] ? genl_lock+0x12/0x14 [ 4.138876] [<ffffffff810b39fc>] lock_acquire+0x57/0x6d [ 4.139856] [<ffffffff8148204d>] ? genl_lock+0x12/0x14 [ 4.141027] [<ffffffff81583e42>] mutex_lock_nested+0x5e/0x345 [ 4.142194] [<ffffffff8148204d>] ? genl_lock+0x12/0x14 [ 4.143219] [<ffffffff8111c594>] ? __kmalloc_node_track_caller+0x26/0x2d [ 4.144340] [<ffffffff8148204d>] genl_lock+0x12/0x14 [ 4.145387] [<ffffffff814822d2>] ctrl_dumpfamily+0x31/0xfa [ 4.146387] [<ffffffff8145ac41>] ? __alloc_skb+0x97/0x1a0 [ 4.147454] [<ffffffff8147f1b4>] netlink_dump+0x88/0x1d7 [ 4.148448] [<ffffffff8147f4b4>] netlink_recvmsg+0x1b1/0x2d1 [ 4.149475] [<ffffffff81450328>] sock_recvmsg+0x83/0x98 [ 4.150494] [<ffffffff810f86fa>] ? might_fault+0x52/0xa2 [ 4.151471] [<ffffffff814500c6>] ___sys_recvmsg+0x15d/0x207 [ 4.152516] [<ffffffff810b34d2>] ? __lock_acquire+0x865/0x956 [ 4.153501] [<ffffffff81148b2b>] ? fget_light+0x35c/0x377 [ 4.154550] [<ffffffff81148933>] ? fget_light+0x164/0x377 [ 4.155521] [<ffffffff814533f7>] __sys_recvmsg+0x3d/0x5e [ 4.156568] [<ffffffff8145471a>] ? sock_def_write_space+0x1b5/0x1b5 [ 4.157552] [<ffffffff81453425>] SyS_recvmsg+0xd/0x19 [ 4.158607] [<ffffffff81587c12>] system_call_fastpath+0x16/0x1b [ 4.160507] iwlwifi 0000:03:00.0: L1 Enabled; Disabling L0S [ 4.160709] iwlwifi 0000:03:00.0: Radio type=0x0-0x3-0x1 ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: 3.11-rc6 genetlink locking fix offends lockdep 2013-08-19 5:04 3.11-rc6 genetlink locking fix offends lockdep Hugh Dickins @ 2013-08-19 8:00 ` Johannes Berg 2013-08-19 11:00 ` Ding Tianhong 0 siblings, 1 reply; 10+ messages in thread From: Johannes Berg @ 2013-08-19 8:00 UTC (permalink / raw) To: Hugh Dickins Cc: Linus Torvalds, Greg KH, David S. Miller, Otcheretianski, Andrei, linux-kernel@vger.kernel.org, netdev@vger.kernel.org, stable@vger.kernel.org, Pravin B Shelar, Thomas Graf > 3.11-rc6's commit 58ad436fcf49 ("genetlink: fix family dump race") > gives me the lockdep trace below at startup. Hmm. Yes, I see now how this happens, not sure why I didn't run into it. The problem is that genl_family_rcv_msg() is called with the genl_lock held, and then calls netlink_dump_start() with it held, creating a genl_lock->cb_mutex dependency, but obviously the dump continuation is the other way around. We could use the semaphore instead, I believe, but I don't really understand the mutex vs. semaphore well enough to be sure that's correct. johannes diff --git a/net/netlink/genetlink.c b/net/netlink/genetlink.c index f85f8a2..6cfa646 100644 --- a/net/netlink/genetlink.c +++ b/net/netlink/genetlink.c @@ -792,7 +792,7 @@ static int ctrl_dumpfamily(struct sk_buff *skb, struct netlink_callback *cb) bool need_locking = chains_to_skip || fams_to_skip; if (need_locking) - genl_lock(); + down_read(&cb_lock); for (i = chains_to_skip; i < GENL_FAM_TAB_SIZE; i++) { n = 0; @@ -815,7 +815,7 @@ errout: cb->args[1] = n; if (need_locking) - genl_unlock(); + up_read(&cb_lock); return skb->len; } ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: 3.11-rc6 genetlink locking fix offends lockdep 2013-08-19 8:00 ` Johannes Berg @ 2013-08-19 11:00 ` Ding Tianhong 2013-08-19 11:22 ` Johannes Berg 0 siblings, 1 reply; 10+ messages in thread From: Ding Tianhong @ 2013-08-19 11:00 UTC (permalink / raw) To: Johannes Berg Cc: Hugh Dickins, Linus Torvalds, Greg KH, David S. Miller, Otcheretianski, Andrei, linux-kernel@vger.kernel.org, netdev@vger.kernel.org, stable@vger.kernel.org, Pravin B Shelar, Thomas Graf On 2013/8/19 16:00, Johannes Berg wrote: > >> 3.11-rc6's commit 58ad436fcf49 ("genetlink: fix family dump race") >> gives me the lockdep trace below at startup. > > Hmm. Yes, I see now how this happens, not sure why I didn't run into it. > > The problem is that genl_family_rcv_msg() is called with the genl_lock > held, and then calls netlink_dump_start() with it held, creating a > genl_lock->cb_mutex dependency, but obviously the dump continuation is > the other way around. > > We could use the semaphore instead, I believe, but I don't really > understand the mutex vs. semaphore well enough to be sure that's > correct. > > johannes > it is useless, the logic need to modify or otherwise it will still call lockdep trace. maybe i could send a patch for it, if you wish. > > diff --git a/net/netlink/genetlink.c b/net/netlink/genetlink.c > index f85f8a2..6cfa646 100644 > --- a/net/netlink/genetlink.c > +++ b/net/netlink/genetlink.c > @@ -792,7 +792,7 @@ static int ctrl_dumpfamily(struct sk_buff *skb, struct netlink_callback *cb) > bool need_locking = chains_to_skip || fams_to_skip; > > if (need_locking) > - genl_lock(); > + down_read(&cb_lock); > > for (i = chains_to_skip; i < GENL_FAM_TAB_SIZE; i++) { > n = 0; > @@ -815,7 +815,7 @@ errout: > cb->args[1] = n; > > if (need_locking) > - genl_unlock(); > + up_read(&cb_lock); > > return skb->len; > } > > > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > . > ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: 3.11-rc6 genetlink locking fix offends lockdep 2013-08-19 11:00 ` Ding Tianhong @ 2013-08-19 11:22 ` Johannes Berg 2013-08-19 18:52 ` Hugh Dickins 0 siblings, 1 reply; 10+ messages in thread From: Johannes Berg @ 2013-08-19 11:22 UTC (permalink / raw) To: Ding Tianhong Cc: Hugh Dickins, Linus Torvalds, Greg KH, David S. Miller, Otcheretianski, Andrei, linux-kernel@vger.kernel.org, netdev@vger.kernel.org, stable@vger.kernel.org, Pravin B Shelar, Thomas Graf On Mon, 2013-08-19 at 19:00 +0800, Ding Tianhong wrote: > On 2013/8/19 16:00, Johannes Berg wrote: > > > >> 3.11-rc6's commit 58ad436fcf49 ("genetlink: fix family dump race") > >> gives me the lockdep trace below at startup. > > > > Hmm. Yes, I see now how this happens, not sure why I didn't run into it. > > > > The problem is that genl_family_rcv_msg() is called with the genl_lock > > held, and then calls netlink_dump_start() with it held, creating a > > genl_lock->cb_mutex dependency, but obviously the dump continuation is > > the other way around. > > > > We could use the semaphore instead, I believe, but I don't really > > understand the mutex vs. semaphore well enough to be sure that's > > correct. > > > > johannes > > > it is useless, the logic need to modify or otherwise it will still call lockdep trace. I don't believe so, the semaphore and cb_mutex don't have a dependency yet, afaict. > maybe i could send a patch for it, if you wish. What do you mean? johannes ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: 3.11-rc6 genetlink locking fix offends lockdep 2013-08-19 11:22 ` Johannes Berg @ 2013-08-19 18:52 ` Hugh Dickins 2013-08-20 8:28 ` Johannes Berg 0 siblings, 1 reply; 10+ messages in thread From: Hugh Dickins @ 2013-08-19 18:52 UTC (permalink / raw) To: Johannes Berg Cc: Ding Tianhong, Linus Torvalds, Greg KH, David S. Miller, Otcheretianski, Andrei, linux-kernel@vger.kernel.org, netdev@vger.kernel.org, stable@vger.kernel.org, Pravin B Shelar, Thomas Graf On Mon, 19 Aug 2013, Johannes Berg wrote: > On Mon, 2013-08-19 at 19:00 +0800, Ding Tianhong wrote: > > On 2013/8/19 16:00, Johannes Berg wrote: > > > > > >> 3.11-rc6's commit 58ad436fcf49 ("genetlink: fix family dump race") > > >> gives me the lockdep trace below at startup. > > > > > > Hmm. Yes, I see now how this happens, not sure why I didn't run into it. > > > > > > The problem is that genl_family_rcv_msg() is called with the genl_lock > > > held, and then calls netlink_dump_start() with it held, creating a > > > genl_lock->cb_mutex dependency, but obviously the dump continuation is > > > the other way around. > > > > > > We could use the semaphore instead, I believe, but I don't really > > > understand the mutex vs. semaphore well enough to be sure that's > > > correct. > > > > > > johannes > > > > > it is useless, the logic need to modify or otherwise it will still call lockdep trace. > > I don't believe so, the semaphore and cb_mutex don't have a dependency > yet, afaict. The down_read(&cb_lock) patch you suggested gives the lockdep trace below. > > > maybe i could send a patch for it, if you wish. > > What do you mean? > > johannes [ 4.027797] e1000e 0000:00:19.0: irq 43 for MSI/MSI-X [ 4.129749] e1000e 0000:00:19.0: irq 43 for MSI/MSI-X [ 4.130179] IPv6: ADDRCONF(NETDEV_UP): eth1: link is not ready [ 4.134629] [ 4.134646] ====================================================== [ 4.134680] [ INFO: possible circular locking dependency detected ] [ 4.134714] 3.11.0-rc6 #3 Not tainted [ 4.134735] ------------------------------------------------------- [ 4.134767] NetworkManager/357 is trying to acquire lock: [ 4.134797] (cb_lock){++++++}, at: [<ffffffff8148204a>] ctrl_dumpfamily+0x38/0x108 [ 4.134853] [ 4.134853] but task is already holding lock: [ 4.134883] (nlk->cb_mutex){+.+.+.}, at: [<ffffffff8147f148>] netlink_dump+0x1c/0x1d7 [ 4.134938] [ 4.134938] which lock already depends on the new lock. [ 4.134938] [ 4.134981] [ 4.134981] the existing dependency chain (in reverse order) is: [ 4.135020] [ 4.135020] -> #2 (nlk->cb_mutex){+.+.+.}: [ 4.135056] [<ffffffff810b34d2>] __lock_acquire+0x865/0x956 [ 4.135094] [<ffffffff810b39fc>] lock_acquire+0x57/0x6d [ 4.135129] [<ffffffff81583e52>] mutex_lock_nested+0x5e/0x345 [ 4.135167] [<ffffffff81480122>] __netlink_dump_start+0xae/0x14e [ 4.135205] [<ffffffff8148224b>] genl_rcv_msg+0xf4/0x252 [ 4.135239] [<ffffffff81481742>] netlink_rcv_skb+0x3e/0x8c [ 4.135275] [<ffffffff8148199b>] genl_rcv+0x24/0x34 [ 4.135307] [<ffffffff814811ca>] netlink_unicast+0xed/0x17a [ 4.135342] [<ffffffff814815d4>] netlink_sendmsg+0x2fb/0x345 [ 4.135378] [<ffffffff814503f7>] sock_sendmsg+0x79/0x8e [ 4.135412] [<ffffffff81450707>] ___sys_sendmsg+0x231/0x2be [ 4.135448] [<ffffffff81453228>] __sys_sendmsg+0x3d/0x5e [ 4.135483] [<ffffffff81453256>] SyS_sendmsg+0xd/0x19 [ 4.135517] [<ffffffff81587c12>] system_call_fastpath+0x16/0x1b [ 4.135555] [ 4.135555] -> #1 (genl_mutex){+.+.+.}: [ 4.135589] [<ffffffff810b34d2>] __lock_acquire+0x865/0x956 [ 4.135626] [<ffffffff810b39fc>] lock_acquire+0x57/0x6d [ 4.135661] [<ffffffff81583e52>] mutex_lock_nested+0x5e/0x345 [ 4.135697] [<ffffffff81482155>] genl_lock+0x12/0x14 [ 4.135730] [<ffffffff8148249d>] genl_lock_all+0x15/0x17 [ 4.135763] [<ffffffff81482b2a>] genl_register_family+0x51/0x142 [ 4.135801] [<ffffffff8148305f>] genl_register_family_with_ops+0x23/0x70 [ 4.135842] [<ffffffff8195e610>] genl_init+0x41/0x80 [ 4.135876] [<ffffffff81000267>] do_one_initcall+0x7f/0x108 [ 4.135912] [<ffffffff81930e29>] kernel_init_freeable+0x106/0x195 [ 4.135951] [<ffffffff81574621>] kernel_init+0x9/0xd1 [ 4.135985] [<ffffffff81587b6c>] ret_from_fork+0x7c/0xb0 [ 4.136019] [ 4.136019] -> #0 (cb_lock){++++++}: [ 4.136052] [<ffffffff810b1fb0>] validate_chain.isra.21+0x836/0xe8e [ 4.136091] [<ffffffff810b34d2>] __lock_acquire+0x865/0x956 [ 4.136127] [<ffffffff810b39fc>] lock_acquire+0x57/0x6d [ 4.136161] [<ffffffff81584629>] down_read+0x42/0x57 [ 4.136194] [<ffffffff8148204a>] ctrl_dumpfamily+0x38/0x108 [ 4.136230] [<ffffffff8147f1b4>] netlink_dump+0x88/0x1d7 [ 4.136264] [<ffffffff8147f4b4>] netlink_recvmsg+0x1b1/0x2d1 [ 4.137410] [<ffffffff81450328>] sock_recvmsg+0x83/0x98 [ 4.138459] [<ffffffff814500c6>] ___sys_recvmsg+0x15d/0x207 [ 4.139692] [<ffffffff814533f7>] __sys_recvmsg+0x3d/0x5e [ 4.140918] [<ffffffff81453425>] SyS_recvmsg+0xd/0x19 [ 4.141975] [<ffffffff81587c12>] system_call_fastpath+0x16/0x1b [ 4.143042] [ 4.143042] other info that might help us debug this: [ 4.143042] [ 4.146007] Chain exists of: [ 4.146007] cb_lock --> genl_mutex --> nlk->cb_mutex [ 4.146007] [ 4.148919] Possible unsafe locking scenario: [ 4.148919] [ 4.150955] CPU0 CPU1 [ 4.152060] ---- ---- [ 4.153030] lock(nlk->cb_mutex); [ 4.153950] lock(genl_mutex); [ 4.154912] lock(nlk->cb_mutex); [ 4.155832] lock(cb_lock); [ 4.156732] [ 4.156732] *** DEADLOCK *** [ 4.156732] [ 4.159477] 1 lock held by NetworkManager/357: [ 4.160354] #0: (nlk->cb_mutex){+.+.+.}, at: [<ffffffff8147f148>] netlink_dump+0x1c/0x1d7 [ 4.161411] [ 4.161411] stack backtrace: [ 4.163489] CPU: 0 PID: 357 Comm: NetworkManager Not tainted 3.11.0-rc6 #3 [ 4.164527] Hardware name: LENOVO 4174EH1/4174EH1, BIOS 8CET51WW (1.31 ) 11/29/2011 [ 4.165491] ffffffff81d0a450 ffff88022bd61938 ffffffff8157cf90 0000000000000006 [ 4.166520] ffffffff81cc85a0 ffff88022bd61988 ffffffff8157a8a8 ffff880200000001 [ 4.167486] ffff8802313ae080 ffff8802313ae080 ffff8802313ae750 ffff8802313ae080 [ 4.168532] Call Trace: [ 4.169491] [<ffffffff8157cf90>] dump_stack+0x4f/0x84 [ 4.170532] [<ffffffff8157a8a8>] print_circular_bug+0x2ad/0x2be [ 4.171507] [<ffffffff810b1fb0>] validate_chain.isra.21+0x836/0xe8e [ 4.172548] [<ffffffff810b34d2>] __lock_acquire+0x865/0x956 [ 4.173525] [<ffffffff810b39fc>] lock_acquire+0x57/0x6d [ 4.174536] [<ffffffff8148204a>] ? ctrl_dumpfamily+0x38/0x108 [ 4.175535] [<ffffffff81584629>] down_read+0x42/0x57 [ 4.176523] [<ffffffff8148204a>] ? ctrl_dumpfamily+0x38/0x108 [ 4.177564] [<ffffffff8148204a>] ctrl_dumpfamily+0x38/0x108 [ 4.178556] [<ffffffff8145ac41>] ? __alloc_skb+0x97/0x1a0 [ 4.179611] [<ffffffff8147f1b4>] netlink_dump+0x88/0x1d7 [ 4.180594] [<ffffffff8147f4b4>] netlink_recvmsg+0x1b1/0x2d1 [ 4.181645] [<ffffffff81450328>] sock_recvmsg+0x83/0x98 [ 4.182628] [<ffffffff810f86fa>] ? might_fault+0x52/0xa2 [ 4.183683] [<ffffffff814500c6>] ___sys_recvmsg+0x15d/0x207 [ 4.184668] [<ffffffff810b34d2>] ? __lock_acquire+0x865/0x956 [ 4.185731] [<ffffffff81148b2b>] ? fget_light+0x35c/0x377 [ 4.186707] [<ffffffff81148933>] ? fget_light+0x164/0x377 [ 4.187744] [<ffffffff814533f7>] __sys_recvmsg+0x3d/0x5e [ 4.188709] [<ffffffff8145471a>] ? sock_def_write_space+0x1b5/0x1b5 [ 4.189755] [<ffffffff81453425>] SyS_recvmsg+0xd/0x19 [ 4.190729] [<ffffffff81587c12>] system_call_fastpath+0x16/0x1b [ 4.192674] iwlwifi 0000:03:00.0: L1 Enabled; Disabling L0S [ 4.192887] iwlwifi 0000:03:00.0: Radio type=0x0-0x3-0x1 ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: 3.11-rc6 genetlink locking fix offends lockdep 2013-08-19 18:52 ` Hugh Dickins @ 2013-08-20 8:28 ` Johannes Berg 2013-08-20 9:04 ` Borislav Petkov 2013-08-20 19:02 ` Johannes Berg 0 siblings, 2 replies; 10+ messages in thread From: Johannes Berg @ 2013-08-20 8:28 UTC (permalink / raw) To: Hugh Dickins Cc: Ding Tianhong, Linus Torvalds, Greg KH, David S. Miller, Otcheretianski, Andrei, linux-kernel@vger.kernel.org, netdev@vger.kernel.org, stable@vger.kernel.org, Pravin B Shelar, Thomas Graf On Mon, 2013-08-19 at 11:52 -0700, Hugh Dickins wrote: > > > > We could use the semaphore instead, I believe, but I don't really > > > > understand the mutex vs. semaphore well enough to be sure that's > > > > correct. > > I don't believe so, the semaphore and cb_mutex don't have a dependency > > yet, afaict. > > The down_read(&cb_lock) patch you suggested gives the lockdep trace below. Clearly I was wrong then, not sure what I missed in the code though. >From the lockdep trace it looks like the dependency comes via the genl_lock, I didn't consider that. David, can you please just revert it for now and tag the revert for stable as well, in case Greg already took it somewhere? I think that some stable trees may need a different fix anyway since they don't have parallel_ops. We never saw a problem due to this, and though it's probably fairly easy to trigger by hand-rolling the dump loop in userspace, one does need to be able to rmmod to crash the kernel with it. The only way to fix this that I see right now (that doesn't rewrite the locking completely) would be to make genetlink use parallel_ops itself, thereby removing the genl_lock() in genl_rcv_msg() and breaking all those lock chains that lockdep reported. After that, it should be safe to use genl_lock() inside all the operations. Something like the patch below, perhaps? Completely untested so far. johannes diff --git a/net/netlink/genetlink.c b/net/netlink/genetlink.c index f85f8a2..dea9586 100644 --- a/net/netlink/genetlink.c +++ b/net/netlink/genetlink.c @@ -665,6 +665,7 @@ static struct genl_family genl_ctrl = { .version = 0x2, .maxattr = CTRL_ATTR_MAX, .netnsok = true, + .parallel_ops = true, }; static int ctrl_fill_info(struct genl_family *family, u32 portid, u32 seq, @@ -789,10 +790,8 @@ static int ctrl_dumpfamily(struct sk_buff *skb, struct netlink_callback *cb) struct net *net = sock_net(skb->sk); int chains_to_skip = cb->args[0]; int fams_to_skip = cb->args[1]; - bool need_locking = chains_to_skip || fams_to_skip; - if (need_locking) - genl_lock(); + genl_lock(); for (i = chains_to_skip; i < GENL_FAM_TAB_SIZE; i++) { n = 0; @@ -814,8 +813,7 @@ errout: cb->args[0] = i; cb->args[1] = n; - if (need_locking) - genl_unlock(); + genl_unlock(); return skb->len; } @@ -870,6 +868,8 @@ static int ctrl_getfamily(struct sk_buff *skb, struct genl_info *info) struct genl_family *res = NULL; int err = -EINVAL; + genl_lock(); + if (info->attrs[CTRL_ATTR_FAMILY_ID]) { u16 id = nla_get_u16(info->attrs[CTRL_ATTR_FAMILY_ID]); res = genl_family_find_byid(id); @@ -896,19 +896,25 @@ static int ctrl_getfamily(struct sk_buff *skb, struct genl_info *info) } if (res == NULL) - return err; + goto out_unlock; if (!res->netnsok && !net_eq(genl_info_net(info), &init_net)) { /* family doesn't exist here */ - return -ENOENT; + err = -ENOENT; + goto out_unlock; } msg = ctrl_build_family_msg(res, info->snd_portid, info->snd_seq, CTRL_CMD_NEWFAMILY); - if (IS_ERR(msg)) - return PTR_ERR(msg); + if (IS_ERR(msg)) { + err = PTR_ERR(msg); + goto out_unlock; + } - return genlmsg_reply(msg, info); + err = genlmsg_reply(msg, info); +out_unlock: + genl_unlock(); + return err; } static int genl_ctrl_event(int event, void *data) ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: 3.11-rc6 genetlink locking fix offends lockdep 2013-08-20 8:28 ` Johannes Berg @ 2013-08-20 9:04 ` Borislav Petkov 2013-08-20 15:49 ` Hugh Dickins 2013-08-20 19:02 ` Johannes Berg 1 sibling, 1 reply; 10+ messages in thread From: Borislav Petkov @ 2013-08-20 9:04 UTC (permalink / raw) To: Johannes Berg Cc: Hugh Dickins, Ding Tianhong, Linus Torvalds, Greg KH, David S. Miller, Otcheretianski, Andrei, linux-kernel@vger.kernel.org, netdev@vger.kernel.org, stable@vger.kernel.org, Pravin B Shelar, Thomas Graf On Tue, Aug 20, 2013 at 10:28:58AM +0200, Johannes Berg wrote: > Something like the patch below, perhaps? Completely untested so far. Yeah, this one seems to fix it here (I was seeing the same lockdep splat as Hugh). Thanks. -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: 3.11-rc6 genetlink locking fix offends lockdep 2013-08-20 9:04 ` Borislav Petkov @ 2013-08-20 15:49 ` Hugh Dickins 0 siblings, 0 replies; 10+ messages in thread From: Hugh Dickins @ 2013-08-20 15:49 UTC (permalink / raw) To: Johannes Berg Cc: Borislav Petkov, Ding Tianhong, Linus Torvalds, Greg KH, David S. Miller, Otcheretianski, Andrei, linux-kernel@vger.kernel.org, netdev@vger.kernel.org, stable@vger.kernel.org, Pravin B Shelar, Thomas Graf On Tue, 20 Aug 2013, Borislav Petkov wrote: > On Tue, Aug 20, 2013 at 10:28:58AM +0200, Johannes Berg wrote: > > Something like the patch below, perhaps? Completely untested so far. > > Yeah, this one seems to fix it here (I was seeing the same lockdep splat > as Hugh). Not so good for me: it fixed the original splat, but a few seconds later: [ 4.073542] e1000e 0000:00:19.0: irq 43 for MSI/MSI-X [ 4.175849] e1000e 0000:00:19.0: irq 43 for MSI/MSI-X [ 4.176223] IPv6: ADDRCONF(NETDEV_UP): eth1: link is not ready [ 4.182322] iwlwifi 0000:03:00.0: L1 Enabled; Disabling L0S [ 4.182537] iwlwifi 0000:03:00.0: Radio type=0x0-0x3-0x1 [ 4.405766] iwlwifi 0000:03:00.0: L1 Enabled; Disabling L0S [ 4.405973] iwlwifi 0000:03:00.0: Radio type=0x0-0x3-0x1 [ 4.504441] IPv6: ADDRCONF(NETDEV_UP): wlan1: link is not ready [ 6.204569] input: PS/2 Generic Mouse as /devices/platform/i8042/serio1/serio2/input/input8 [ 7.662343] [ 7.662361] ====================================================== [ 7.662393] [ INFO: possible circular locking dependency detected ] [ 7.662426] 3.11.0-rc6 #5 Not tainted [ 7.662462] ------------------------------------------------------- [ 7.662500] wpa_supplicant/418 is trying to acquire lock: [ 7.662533] (nlk->cb_mutex){+.+.+.}, at: [<ffffffff81480122>] __netlink_dump_start+0xae/0x14e [ 7.662603] [ 7.662603] but task is already holding lock: [ 7.662638] (genl_mutex){+.+.+.}, at: [<ffffffff8148204d>] genl_lock+0x12/0x14 [ 7.662695] [ 7.662695] which lock already depends on the new lock. [ 7.662695] [ 7.662743] [ 7.662743] the existing dependency chain (in reverse order) is: [ 7.662788] [ 7.662788] -> #1 (genl_mutex){+.+.+.}: [ 7.662827] [<ffffffff810b34d2>] __lock_acquire+0x865/0x956 [ 7.662870] [<ffffffff810b39fc>] lock_acquire+0x57/0x6d [ 7.662909] [<ffffffff81583e52>] mutex_lock_nested+0x5e/0x345 [ 7.662951] [<ffffffff8148204d>] genl_lock+0x12/0x14 [ 7.662989] [<ffffffff814822cc>] ctrl_dumpfamily+0x2b/0xea [ 7.663029] [<ffffffff8147f1b4>] netlink_dump+0x88/0x1d7 [ 7.663069] [<ffffffff81480187>] __netlink_dump_start+0x113/0x14e [ 7.663113] [<ffffffff81482143>] genl_rcv_msg+0xf4/0x252 [ 7.663152] [<ffffffff81481742>] netlink_rcv_skb+0x3e/0x8c [ 7.663192] [<ffffffff8148199b>] genl_rcv+0x24/0x34 [ 7.663228] [<ffffffff814811ca>] netlink_unicast+0xed/0x17a [ 7.663270] [<ffffffff814815d4>] netlink_sendmsg+0x2fb/0x345 [ 7.663311] [<ffffffff814503f7>] sock_sendmsg+0x79/0x8e [ 7.663351] [<ffffffff81450707>] ___sys_sendmsg+0x231/0x2be [ 7.663391] [<ffffffff81453228>] __sys_sendmsg+0x3d/0x5e [ 7.663431] [<ffffffff81453256>] SyS_sendmsg+0xd/0x19 [ 7.663469] [<ffffffff81587c12>] system_call_fastpath+0x16/0x1b [ 7.663513] [ 7.663513] -> #0 (nlk->cb_mutex){+.+.+.}: [ 7.663554] [<ffffffff810b1fb0>] validate_chain.isra.21+0x836/0xe8e [ 7.663599] [<ffffffff810b34d2>] __lock_acquire+0x865/0x956 [ 7.663640] [<ffffffff810b39fc>] lock_acquire+0x57/0x6d [ 7.663679] [<ffffffff81583e52>] mutex_lock_nested+0x5e/0x345 [ 7.663722] [<ffffffff81480122>] __netlink_dump_start+0xae/0x14e [ 7.663765] [<ffffffff81482143>] genl_rcv_msg+0xf4/0x252 [ 7.663804] [<ffffffff81481742>] netlink_rcv_skb+0x3e/0x8c [ 7.663844] [<ffffffff8148199b>] genl_rcv+0x24/0x34 [ 7.663881] [<ffffffff814811ca>] netlink_unicast+0xed/0x17a [ 7.663921] [<ffffffff814815d4>] netlink_sendmsg+0x2fb/0x345 [ 7.663961] [<ffffffff814503f7>] sock_sendmsg+0x79/0x8e [ 7.664000] [<ffffffff81450707>] ___sys_sendmsg+0x231/0x2be [ 7.664041] [<ffffffff81453228>] __sys_sendmsg+0x3d/0x5e [ 7.664081] [<ffffffff81453256>] SyS_sendmsg+0xd/0x19 [ 7.664119] [<ffffffff81587c12>] system_call_fastpath+0x16/0x1b [ 7.664161] [ 7.664161] other info that might help us debug this: [ 7.664161] [ 7.666543] Possible unsafe locking scenario: [ 7.666543] [ 7.668877] CPU0 CPU1 [ 7.670032] ---- ---- [ 7.671165] lock(genl_mutex); [ 7.672298] lock(nlk->cb_mutex); [ 7.673424] lock(genl_mutex); [ 7.674532] lock(nlk->cb_mutex); [ 7.675607] [ 7.675607] *** DEADLOCK *** [ 7.675607] [ 7.678696] 2 locks held by wpa_supplicant/418: [ 7.679704] #0: (cb_lock){++++++}, at: [<ffffffff8148198c>] genl_rcv+0x15/0x34 [ 7.680772] #1: (genl_mutex){+.+.+.}, at: [<ffffffff8148204d>] genl_lock+0x12/0x14 [ 7.681840] [ 7.681840] stack backtrace: [ 7.683933] CPU: 0 PID: 418 Comm: wpa_supplicant Not tainted 3.11.0-rc6 #5 [ 7.685022] Hardware name: LENOVO 4174EH1/4174EH1, BIOS 8CET51WW (1.31 ) 11/29/2011 [ 7.686118] ffffffff81cc8750 ffff88022b8117b8 ffffffff8157cf90 0000000000000006 [ 7.687230] ffffffff81d0a450 ffff88022b811808 ffffffff8157a8a8 0000000000000001 [ 7.688347] ffff880230d0a080 ffff880230d0a080 ffff880230d0a778 ffff880230d0a080 [ 7.689461] Call Trace: [ 7.690551] [<ffffffff8157cf90>] dump_stack+0x4f/0x84 [ 7.691657] [<ffffffff8157a8a8>] print_circular_bug+0x2ad/0x2be [ 7.692775] [<ffffffff810b1fb0>] validate_chain.isra.21+0x836/0xe8e [ 7.693893] [<ffffffff810b34d2>] __lock_acquire+0x865/0x956 [ 7.695012] [<ffffffff81480122>] ? __netlink_dump_start+0xae/0x14e [ 7.696137] [<ffffffff810b39fc>] lock_acquire+0x57/0x6d [ 7.697275] [<ffffffff81480122>] ? __netlink_dump_start+0xae/0x14e [ 7.698404] [<ffffffff81583e52>] mutex_lock_nested+0x5e/0x345 [ 7.699541] [<ffffffff81480122>] ? __netlink_dump_start+0xae/0x14e [ 7.700696] [<ffffffff81586c32>] ? _raw_read_unlock+0x2d/0x4a [ 7.701840] [<ffffffff81480122>] __netlink_dump_start+0xae/0x14e [ 7.702991] [<ffffffff81482143>] genl_rcv_msg+0xf4/0x252 [ 7.704143] [<ffffffff81536380>] ? nl80211_dump_mpath+0x10d/0x10d [ 7.705317] [<ffffffff8148204f>] ? genl_lock+0x14/0x14 [ 7.706464] [<ffffffff81481742>] netlink_rcv_skb+0x3e/0x8c [ 7.707614] [<ffffffff8148199b>] genl_rcv+0x24/0x34 [ 7.708756] [<ffffffff814811ca>] netlink_unicast+0xed/0x17a [ 7.709890] [<ffffffff814815d4>] netlink_sendmsg+0x2fb/0x345 [ 7.711018] [<ffffffff814503f7>] sock_sendmsg+0x79/0x8e [ 7.712144] [<ffffffff810f86fa>] ? might_fault+0x52/0xa2 [ 7.713280] [<ffffffff81450707>] ___sys_sendmsg+0x231/0x2be [ 7.714400] [<ffffffff810fd37b>] ? handle_mm_fault+0x47d/0x49d [ 7.715525] [<ffffffff81087aa5>] ? up_read+0x1b/0x32 [ 7.716663] [<ffffffff81053c34>] ? __do_page_fault+0x370/0x414 [ 7.717789] [<ffffffff811488e4>] ? fget_light+0x115/0x377 [ 7.718922] [<ffffffff810f86fa>] ? might_fault+0x52/0xa2 [ 7.720052] [<ffffffff81453228>] __sys_sendmsg+0x3d/0x5e [ 7.721192] [<ffffffff81453256>] SyS_sendmsg+0xd/0x19 [ 7.722309] [<ffffffff81587c12>] system_call_fastpath+0x16/0x1b [ 11.044520] wlan1: authenticate with c0:3f:0e:ad:ff:ee [ 11.096583] wlan1: send auth to c0:3f:0e:ad:ff:ee (try 1/3) [ 11.099448] wlan1: authenticated [ 11.100813] wlan1: associate with c0:3f:0e:ad:ff:ee (try 1/3) [ 11.105771] wlan1: RX AssocResp from c0:3f:0e:ad:ff:ee (capab=0x411 status=0 aid=6) [ 11.114801] IPv6: ADDRCONF(NETDEV_CHANGE): wlan1: link becomes ready [ 11.115884] wlan1: associated ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: 3.11-rc6 genetlink locking fix offends lockdep 2013-08-20 8:28 ` Johannes Berg 2013-08-20 9:04 ` Borislav Petkov @ 2013-08-20 19:02 ` Johannes Berg 2013-08-20 19:10 ` Johannes Berg 1 sibling, 1 reply; 10+ messages in thread From: Johannes Berg @ 2013-08-20 19:02 UTC (permalink / raw) To: Hugh Dickins Cc: Ding Tianhong, Linus Torvalds, Greg KH, David S. Miller, Otcheretianski, Andrei, linux-kernel@vger.kernel.org, netdev@vger.kernel.org, stable@vger.kernel.org, Pravin B Shelar, Thomas Graf On Tue, 2013-08-20 at 10:28 +0200, Johannes Berg wrote: > The only way to fix this that I see right now (that doesn't rewrite the > locking completely) would be to make genetlink use parallel_ops itself, > thereby removing the genl_lock() in genl_rcv_msg() and breaking all > those lock chains that lockdep reported. After that, it should be safe > to use genl_lock() inside all the operations. Something like the patch > below, perhaps? Completely untested so far. Tested now, and it still causes lockdep to complain, though that's a lockdep issue I believe, it thinks that genl_mutex and nlk->cb_mutex can be inverted although nlk->cb_mutex exists per family, so we need to annotate lockdep there. johannes ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: 3.11-rc6 genetlink locking fix offends lockdep 2013-08-20 19:02 ` Johannes Berg @ 2013-08-20 19:10 ` Johannes Berg 0 siblings, 0 replies; 10+ messages in thread From: Johannes Berg @ 2013-08-20 19:10 UTC (permalink / raw) To: Hugh Dickins Cc: Ding Tianhong, Linus Torvalds, Greg KH, David S. Miller, Otcheretianski, Andrei, linux-kernel@vger.kernel.org, netdev@vger.kernel.org, stable@vger.kernel.org, Pravin B Shelar, Thomas Graf On Tue, 2013-08-20 at 21:02 +0200, Johannes Berg wrote: > On Tue, 2013-08-20 at 10:28 +0200, Johannes Berg wrote: > > > The only way to fix this that I see right now (that doesn't rewrite the > > locking completely) would be to make genetlink use parallel_ops itself, > > thereby removing the genl_lock() in genl_rcv_msg() and breaking all > > those lock chains that lockdep reported. After that, it should be safe > > to use genl_lock() inside all the operations. Something like the patch > > below, perhaps? Completely untested so far. > > Tested now, and it still causes lockdep to complain, though that's a > lockdep issue I believe, it thinks that genl_mutex and nlk->cb_mutex can > be inverted although nlk->cb_mutex exists per family, so we need to > annotate lockdep there. No, lockdep is correct - generic netlink uses the same cb_mutex for all families, obviously, since it's all the same netlink family. I'll just convert it to RCU. johannes ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2013-08-20 19:10 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-08-19 5:04 3.11-rc6 genetlink locking fix offends lockdep Hugh Dickins 2013-08-19 8:00 ` Johannes Berg 2013-08-19 11:00 ` Ding Tianhong 2013-08-19 11:22 ` Johannes Berg 2013-08-19 18:52 ` Hugh Dickins 2013-08-20 8:28 ` Johannes Berg 2013-08-20 9:04 ` Borislav Petkov 2013-08-20 15:49 ` Hugh Dickins 2013-08-20 19:02 ` Johannes Berg 2013-08-20 19:10 ` Johannes Berg
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).