* netlink circular locking dependency
@ 2008-06-14 12:35 Marcel Holtmann
2008-06-16 21:34 ` Jarek Poplawski
0 siblings, 1 reply; 19+ messages in thread
From: Marcel Holtmann @ 2008-06-14 12:35 UTC (permalink / raw)
To: netdev; +Cc: Ingo Molnar
Hi guys,
so I have been working on integrating netlink support into the Bluetooth
stack. However this is running on a 2.6.26-rc2 kernel on a Quad G5 with
no netlink support for Bluetooth in the kernel, but with the userspace
part within hcid. That of course will fail to load, but that is fine and
expected if you run the new daemon on old kernels. I can upgrade that
machine to the latest vanilla kernel if needed, but just wanna quickly
check if anybody has seen this before:
NET: Registered protocol family 17
Bluetooth: L2CAP ver 2.9
Bluetooth: L2CAP socket layer initialized
Bluetooth: RFCOMM socket layer initialized
Bluetooth: RFCOMM TTY layer initialized
Bluetooth: RFCOMM ver 1.8
hci_cmd_task: hci0 command tx timeout
hci_cmd_task: hci0 command tx timeout
hci_cmd_task: hci0 command tx timeout
=======================================================
[ INFO: possible circular locking dependency detected ]
2.6.26-rc2 #5
-------------------------------------------------------
hcid/4136 is trying to acquire lock:
(genl_mutex){--..}, at: [<c0000000002ace4c>] .ctrl_dumpfamily+0x74/0x174
but task is already holding lock:
(nlk->cb_mutex){--..}, at: [<c0000000002a766c>] .netlink_dump+0x58/0x27c
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (nlk->cb_mutex){--..}:
[<c000000000077758>] .__lock_acquire+0xd14/0xf88
[<c000000000077a9c>] .lock_acquire+0xd0/0x11c
[<c000000000314068>] .mutex_lock_nested+0x158/0x3f4
[<c0000000002aa9fc>] .netlink_dump_start+0x150/0x228
[<c0000000002accc4>] .genl_rcv_msg+0xe4/0x1f8
[<c0000000002a902c>] .netlink_rcv_skb+0x80/0x118
[<c0000000002ad0f0>] .genl_rcv+0x4c/0x7c
[<c0000000002a8ca4>] .netlink_unicast+0x2f8/0x3d8
[<c0000000002a9920>] .netlink_sendmsg+0x2dc/0x320
[<c00000000027e9b0>] .sock_sendmsg+0xf0/0x134
[<c00000000027ec10>] .sys_sendmsg+0x21c/0x2b4
[<c0000000002a2f68>] .compat_sys_sendmsg+0x20/0x38
[<c0000000002a4200>] .compat_sys_socketcall+0x1ec/0x220
[<c0000000000076b0>] syscall_exit+0x0/0x40
-> #0 (genl_mutex){--..}:
[<c000000000077650>] .__lock_acquire+0xc0c/0xf88
[<c000000000077a9c>] .lock_acquire+0xd0/0x11c
[<c000000000314068>] .mutex_lock_nested+0x158/0x3f4
[<c0000000002ace4c>] .ctrl_dumpfamily+0x74/0x174
[<c0000000002a76a0>] .netlink_dump+0x8c/0x27c
[<c0000000002a9b58>] .netlink_recvmsg+0x1f4/0x378
[<c00000000027e878>] .sock_recvmsg+0x100/0x148
[<c00000000027fedc>] .sys_recvmsg+0x180/0x31c
[<c0000000002a2f30>] .compat_sys_recvmsg+0x20/0x38
[<c0000000002a420c>] .compat_sys_socketcall+0x1f8/0x220
[<c0000000000076b0>] syscall_exit+0x0/0x40
other info that might help us debug this:
1 lock held by hcid/4136:
#0: (nlk->cb_mutex){--..}, at: [<c0000000002a766c>] .netlink_dump+0x58/0x27c
stack backtrace:
Call Trace:
[c00000003bf37190] [c00000000000f7a0] .show_stack+0x84/0x1b8 (unreliable)
[c00000003bf37250] [c00000000000f8f4] .dump_stack+0x20/0x34
[c00000003bf372d0] [c000000000074b54] .print_circular_bug_tail+0x88/0xac
[c00000003bf373a0] [c000000000077650] .__lock_acquire+0xc0c/0xf88
[c00000003bf37490] [c000000000077a9c] .lock_acquire+0xd0/0x11c
[c00000003bf37550] [c000000000314068] .mutex_lock_nested+0x158/0x3f4
[c00000003bf37640] [c0000000002ace4c] .ctrl_dumpfamily+0x74/0x174
[c00000003bf37700] [c0000000002a76a0] .netlink_dump+0x8c/0x27c
[c00000003bf377c0] [c0000000002a9b58] .netlink_recvmsg+0x1f4/0x378
[c00000003bf378d0] [c00000000027e878] .sock_recvmsg+0x100/0x148
[c00000003bf37af0] [c00000000027fedc] .sys_recvmsg+0x180/0x31c
[c00000003bf37d00] [c0000000002a2f30] .compat_sys_recvmsg+0x20/0x38
[c00000003bf37d80] [c0000000002a420c] .compat_sys_socketcall+0x1f8/0x220
[c00000003bf37e30] [c0000000000076b0] syscall_exit+0x0/0x40
Bluetooth: BNEP (Ethernet Emulation) ver 1.2
Bluetooth: BNEP filters: protocol multicast
The system is running Ubuntu Hardy and using stock libnl for generic
netlink access.
Regards
Marcel
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: netlink circular locking dependency 2008-06-14 12:35 netlink circular locking dependency Marcel Holtmann @ 2008-06-16 21:34 ` Jarek Poplawski 2008-06-16 21:48 ` Patrick McHardy 0 siblings, 1 reply; 19+ messages in thread From: Jarek Poplawski @ 2008-06-16 21:34 UTC (permalink / raw) To: Marcel Holtmann; +Cc: netdev, Ingo Molnar, Thomas Graf Marcel Holtmann wrote, On 06/14/2008 02:35 PM: ... > ======================================================= > [ INFO: possible circular locking dependency detected ] > 2.6.26-rc2 #5 > ------------------------------------------------------- > hcid/4136 is trying to acquire lock: > (genl_mutex){--..}, at: [<c0000000002ace4c>] .ctrl_dumpfamily+0x74/0x174 > > but task is already holding lock: > (nlk->cb_mutex){--..}, at: [<c0000000002a766c>] .netlink_dump+0x58/0x27c > > which lock already depends on the new lock. ... Hi, IMHO it looks like a real lockup threat. Probably it needs something better, but for now here is my simplistic patch proposal for testing. Thanks, Jarek P. --- net/netlink/genetlink.c | 10 ++++++++-- 1 files changed, 8 insertions(+), 2 deletions(-) diff --git a/net/netlink/genetlink.c b/net/netlink/genetlink.c index f5aa23c..bf939ba 100644 --- a/net/netlink/genetlink.c +++ b/net/netlink/genetlink.c @@ -32,6 +32,11 @@ static inline void genl_unlock(void) mutex_unlock(&genl_mutex); } +static inline int genl_trylock(void) +{ + return mutex_trylock(&genl_mutex); +} + #define GENL_FAM_TAB_SIZE 16 #define GENL_FAM_TAB_MASK (GENL_FAM_TAB_SIZE - 1) @@ -603,8 +608,9 @@ static int ctrl_dumpfamily(struct sk_buff *skb, struct netlink_callback *cb) int chains_to_skip = cb->args[0]; int fams_to_skip = cb->args[1]; - if (chains_to_skip != 0) - genl_lock(); + /* genl_lock vs. cb_mutex inversion threatens here */ + if (chains_to_skip != 0 && !genl_trylock()) + return skb->len; for (i = 0; i < GENL_FAM_TAB_SIZE; i++) { if (i < chains_to_skip) ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: netlink circular locking dependency 2008-06-16 21:34 ` Jarek Poplawski @ 2008-06-16 21:48 ` Patrick McHardy 2008-06-17 1:45 ` Marcel Holtmann 2008-06-17 8:49 ` Jarek Poplawski 0 siblings, 2 replies; 19+ messages in thread From: Patrick McHardy @ 2008-06-16 21:48 UTC (permalink / raw) To: Jarek Poplawski; +Cc: Marcel Holtmann, netdev, Ingo Molnar, Thomas Graf [-- Attachment #1: Type: text/plain, Size: 1713 bytes --] Jarek Poplawski wrote: > Marcel Holtmann wrote, On 06/14/2008 02:35 PM: > ... > >> ======================================================= >> [ INFO: possible circular locking dependency detected ] >> 2.6.26-rc2 #5 >> ------------------------------------------------------- >> hcid/4136 is trying to acquire lock: >> (genl_mutex){--..}, at: [<c0000000002ace4c>] .ctrl_dumpfamily+0x74/0x174 >> >> but task is already holding lock: >> (nlk->cb_mutex){--..}, at: [<c0000000002a766c>] .netlink_dump+0x58/0x27c >> >> which lock already depends on the new lock. >> > ... > > Hi, > > IMHO it looks like a real lockup threat. Probably it needs something > better, but for now here is my simplistic patch proposal for testing. > So we have: genl_rcv() : take genl_mutex genl_rcv_msg() : call netlink_dump_start() while holding genl_mutex netlink_dump_start(), netlink_dump() : take nlk->cb_mutex ctrl_dumpfamily() : try to detect this case and not take genl_mutex a second time netlink_rcv() : call netlink_dump netlink_dump : take nlk->cb_mutex ctrl_dumpfamily() : take genl_mutex which is a real bug. It seems the best fix is to use genl_mutex for the netlink cb_mutex, drop genl_mutex before calling netlink_dump_start and don't take it in ctrl_dumpfamily, relying completely on af_netlink.c for dump locking. Unfortunately this creates a race since the ops passed to netlink_dump_start are also protect by the mutex, so this patch is just for testing whether it fixes the warning. On second though - that race seems to be present already since the ops can be unregistered and the module unloaded while a dump is in progress. [-- Attachment #2: x --] [-- Type: text/plain, Size: 1378 bytes --] diff --git a/net/netlink/genetlink.c b/net/netlink/genetlink.c index f5aa23c..3e1191c 100644 --- a/net/netlink/genetlink.c +++ b/net/netlink/genetlink.c @@ -444,8 +444,11 @@ static int genl_rcv_msg(struct sk_buff *skb, struct nlmsghdr *nlh) if (ops->dumpit == NULL) return -EOPNOTSUPP; - return netlink_dump_start(genl_sock, skb, nlh, - ops->dumpit, ops->done); + genl_unlock(); + err = netlink_dump_start(genl_sock, skb, nlh, + ops->dumpit, ops->done); + genl_lock(); + return err; } if (ops->doit == NULL) @@ -603,9 +606,6 @@ static int ctrl_dumpfamily(struct sk_buff *skb, struct netlink_callback *cb) int chains_to_skip = cb->args[0]; int fams_to_skip = cb->args[1]; - if (chains_to_skip != 0) - genl_lock(); - for (i = 0; i < GENL_FAM_TAB_SIZE; i++) { if (i < chains_to_skip) continue; @@ -623,9 +623,6 @@ static int ctrl_dumpfamily(struct sk_buff *skb, struct netlink_callback *cb) } errout: - if (chains_to_skip != 0) - genl_unlock(); - cb->args[0] = i; cb->args[1] = n; @@ -770,7 +767,7 @@ static int __init genl_init(void) /* we'll bump the group number right afterwards */ genl_sock = netlink_kernel_create(&init_net, NETLINK_GENERIC, 0, - genl_rcv, NULL, THIS_MODULE); + genl_rcv, &genl_mutex, THIS_MODULE); if (genl_sock == NULL) panic("GENL: Cannot initialize generic netlink\n"); ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: netlink circular locking dependency 2008-06-16 21:48 ` Patrick McHardy @ 2008-06-17 1:45 ` Marcel Holtmann 2008-06-17 12:50 ` Patrick McHardy 2008-06-17 8:49 ` Jarek Poplawski 1 sibling, 1 reply; 19+ messages in thread From: Marcel Holtmann @ 2008-06-17 1:45 UTC (permalink / raw) To: Patrick McHardy; +Cc: Jarek Poplawski, netdev, Ingo Molnar, Thomas Graf Hi Patrick, > >> ======================================================= > >> [ INFO: possible circular locking dependency detected ] > >> 2.6.26-rc2 #5 > >> ------------------------------------------------------- > >> hcid/4136 is trying to acquire lock: > >> (genl_mutex){--..}, at: [<c0000000002ace4c>] .ctrl_dumpfamily+0x74/0x174 > >> > >> but task is already holding lock: > >> (nlk->cb_mutex){--..}, at: [<c0000000002a766c>] .netlink_dump+0x58/0x27c > >> > >> which lock already depends on the new lock. > >> > > ... > > > > Hi, > > > > IMHO it looks like a real lockup threat. Probably it needs something > > better, but for now here is my simplistic patch proposal for testing. > > > So we have: > > genl_rcv() : take genl_mutex > genl_rcv_msg() : call netlink_dump_start() while holding genl_mutex > netlink_dump_start(), > netlink_dump() : take nlk->cb_mutex > ctrl_dumpfamily() : try to detect this case and not take genl_mutex a > second time > > netlink_rcv() : call netlink_dump > netlink_dump : take nlk->cb_mutex > ctrl_dumpfamily() : take genl_mutex > > which is a real bug. > > It seems the best fix is to use genl_mutex for the netlink cb_mutex, > drop genl_mutex before calling netlink_dump_start and don't take it > in ctrl_dumpfamily, relying completely on af_netlink.c for dump > locking. Unfortunately this creates a race since the ops passed to > netlink_dump_start are also protect by the mutex, so this patch > is just for testing whether it fixes the warning. I updated my test kernel to 2.6.26-rc6 and then applied your patch and the lockdep warning goes away. Regards Marcel ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: netlink circular locking dependency 2008-06-17 1:45 ` Marcel Holtmann @ 2008-06-17 12:50 ` Patrick McHardy 2008-06-17 13:09 ` Jarek Poplawski 0 siblings, 1 reply; 19+ messages in thread From: Patrick McHardy @ 2008-06-17 12:50 UTC (permalink / raw) To: Marcel Holtmann; +Cc: Jarek Poplawski, netdev, Ingo Molnar, Thomas Graf Marcel Holtmann wrote: > Hi Patrick, > >> So we have: >> >> genl_rcv() : take genl_mutex >> genl_rcv_msg() : call netlink_dump_start() while holding genl_mutex >> netlink_dump_start(), >> netlink_dump() : take nlk->cb_mutex >> ctrl_dumpfamily() : try to detect this case and not take genl_mutex a >> second time >> >> netlink_rcv() : call netlink_dump >> netlink_dump : take nlk->cb_mutex >> ctrl_dumpfamily() : take genl_mutex >> >> which is a real bug. >> >> It seems the best fix is to use genl_mutex for the netlink cb_mutex, >> drop genl_mutex before calling netlink_dump_start and don't take it >> in ctrl_dumpfamily, relying completely on af_netlink.c for dump >> locking. Unfortunately this creates a race since the ops passed to >> netlink_dump_start are also protect by the mutex, so this patch >> is just for testing whether it fixes the warning. > > I updated my test kernel to 2.6.26-rc6 and then applied your patch and > the lockdep warning goes away. Thanks for testing. Unfortunately the module unload races look more complicated to fix and I'm busy with other things, so it would great if someone else could fix this. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: netlink circular locking dependency 2008-06-17 12:50 ` Patrick McHardy @ 2008-06-17 13:09 ` Jarek Poplawski 2008-06-17 13:07 ` Patrick McHardy 2008-06-17 13:08 ` Thomas Graf 0 siblings, 2 replies; 19+ messages in thread From: Jarek Poplawski @ 2008-06-17 13:09 UTC (permalink / raw) To: Patrick McHardy; +Cc: Marcel Holtmann, netdev, Ingo Molnar, Thomas Graf On Tue, Jun 17, 2008 at 02:50:19PM +0200, Patrick McHardy wrote: ... > Thanks for testing. Unfortunately the module unload races look > more complicated to fix and I'm busy with other things, so it > would great if someone else could fix this. Patrick, I wonder if simply adding an additional mutex e.g. genl_lock_table() around all the rest (after your patch) genl_locks could be enough until some major rework. This should prevent any new races and there are no lockups, I guess? Jarek P. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: netlink circular locking dependency 2008-06-17 13:09 ` Jarek Poplawski @ 2008-06-17 13:07 ` Patrick McHardy 2008-06-17 13:24 ` Jarek Poplawski 2008-06-17 13:08 ` Thomas Graf 1 sibling, 1 reply; 19+ messages in thread From: Patrick McHardy @ 2008-06-17 13:07 UTC (permalink / raw) To: Jarek Poplawski; +Cc: Marcel Holtmann, netdev, Ingo Molnar, Thomas Graf Jarek Poplawski wrote: > On Tue, Jun 17, 2008 at 02:50:19PM +0200, Patrick McHardy wrote: > ... >> Thanks for testing. Unfortunately the module unload races look >> more complicated to fix and I'm busy with other things, so it >> would great if someone else could fix this. > > Patrick, I wonder if simply adding an additional mutex e.g. > genl_lock_table() around all the rest (after your patch) genl_locks > could be enough until some major rework. This should prevent any > new races and there are no lockups, I guess? Not sure I understand you correctly, where exactly would this mutex be taken? ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: netlink circular locking dependency 2008-06-17 13:07 ` Patrick McHardy @ 2008-06-17 13:24 ` Jarek Poplawski 2008-06-17 13:27 ` Patrick McHardy 0 siblings, 1 reply; 19+ messages in thread From: Jarek Poplawski @ 2008-06-17 13:24 UTC (permalink / raw) To: Patrick McHardy; +Cc: Marcel Holtmann, netdev, Ingo Molnar, Thomas Graf On Tue, Jun 17, 2008 at 03:07:44PM +0200, Patrick McHardy wrote: > Jarek Poplawski wrote: >> On Tue, Jun 17, 2008 at 02:50:19PM +0200, Patrick McHardy wrote: >> ... >>> Thanks for testing. Unfortunately the module unload races look >>> more complicated to fix and I'm busy with other things, so it >>> would great if someone else could fix this. >> >> Patrick, I wonder if simply adding an additional mutex e.g. >> genl_lock_table() around all the rest (after your patch) genl_locks >> could be enough until some major rework. This should prevent any >> new races and there are no lockups, I guess? > > Not sure I understand you correctly, where exactly would > this mutex be taken? Around (before) each genl_lock(), so any change would need these two locks. genl_lock() alone would work like read lock (plus cb change). But, I can miss something... Of course, it's meant as a temporary solution, until Thomas does it right. Jarek P. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: netlink circular locking dependency 2008-06-17 13:24 ` Jarek Poplawski @ 2008-06-17 13:27 ` Patrick McHardy 2008-06-17 13:43 ` Jarek Poplawski 0 siblings, 1 reply; 19+ messages in thread From: Patrick McHardy @ 2008-06-17 13:27 UTC (permalink / raw) To: Jarek Poplawski; +Cc: Marcel Holtmann, netdev, Ingo Molnar, Thomas Graf Jarek Poplawski wrote: > On Tue, Jun 17, 2008 at 03:07:44PM +0200, Patrick McHardy wrote: >> Jarek Poplawski wrote: >>> On Tue, Jun 17, 2008 at 02:50:19PM +0200, Patrick McHardy wrote: >>> ... >>>> Thanks for testing. Unfortunately the module unload races look >>>> more complicated to fix and I'm busy with other things, so it >>>> would great if someone else could fix this. >>> Patrick, I wonder if simply adding an additional mutex e.g. >>> genl_lock_table() around all the rest (after your patch) genl_locks >>> could be enough until some major rework. This should prevent any >>> new races and there are no lockups, I guess? >> Not sure I understand you correctly, where exactly would >> this mutex be taken? > > Around (before) each genl_lock(), so any change would need these two > locks. genl_lock() alone would work like read lock (plus cb change). > But, I can miss something... Of course, it's meant as a temporary > solution, until Thomas does it right. I guess that might work. But simply using module references to prevent the module from going away while the mutex is not held doesn't appear to be more complicated. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: netlink circular locking dependency 2008-06-17 13:27 ` Patrick McHardy @ 2008-06-17 13:43 ` Jarek Poplawski 2008-06-18 4:30 ` David Miller 0 siblings, 1 reply; 19+ messages in thread From: Jarek Poplawski @ 2008-06-17 13:43 UTC (permalink / raw) To: Patrick McHardy; +Cc: Marcel Holtmann, netdev, Ingo Molnar, Thomas Graf On Tue, Jun 17, 2008 at 03:27:18PM +0200, Patrick McHardy wrote: ... > I guess that might work. But simply using module references to > prevent the module from going away while the mutex is not held > doesn't appear to be more complicated. O.K. Jarek P. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: netlink circular locking dependency 2008-06-17 13:43 ` Jarek Poplawski @ 2008-06-18 4:30 ` David Miller 2008-06-18 6:15 ` Jarek Poplawski 2008-06-18 8:52 ` Patrick McHardy 0 siblings, 2 replies; 19+ messages in thread From: David Miller @ 2008-06-18 4:30 UTC (permalink / raw) To: jarkao2; +Cc: kaber, marcel, netdev, mingo, tgraf From: Jarek Poplawski <jarkao2@gmail.com> Date: Tue, 17 Jun 2008 13:43:03 +0000 > On Tue, Jun 17, 2008 at 03:27:18PM +0200, Patrick McHardy wrote: > ... > > I guess that might work. But simply using module references to > > prevent the module from going away while the mutex is not held > > doesn't appear to be more complicated. > > O.K. In the meantime should I apply Patrick's current patch from this thread? If so I'd like a proper commit message and signoff, thanks! ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: netlink circular locking dependency 2008-06-18 4:30 ` David Miller @ 2008-06-18 6:15 ` Jarek Poplawski 2008-06-18 8:52 ` Patrick McHardy 1 sibling, 0 replies; 19+ messages in thread From: Jarek Poplawski @ 2008-06-18 6:15 UTC (permalink / raw) To: David Miller; +Cc: kaber, marcel, netdev, mingo, tgraf On Tue, Jun 17, 2008 at 09:30:12PM -0700, David Miller wrote: > From: Jarek Poplawski <jarkao2@gmail.com> > Date: Tue, 17 Jun 2008 13:43:03 +0000 > > > On Tue, Jun 17, 2008 at 03:27:18PM +0200, Patrick McHardy wrote: > > ... > > > I guess that might work. But simply using module references to > > > prevent the module from going away while the mutex is not held > > > doesn't appear to be more complicated. > > > > O.K. > > In the meantime should I apply Patrick's current patch > from this thread? > > If so I'd like a proper commit message and signoff, thanks! I hope this is a question "To:" Patrick... (I'm OK with any version of this patch as a temporary solution.) Regards, Jarek P. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: netlink circular locking dependency 2008-06-18 4:30 ` David Miller 2008-06-18 6:15 ` Jarek Poplawski @ 2008-06-18 8:52 ` Patrick McHardy 2008-06-18 9:08 ` David Miller 1 sibling, 1 reply; 19+ messages in thread From: Patrick McHardy @ 2008-06-18 8:52 UTC (permalink / raw) To: David Miller; +Cc: jarkao2, marcel, netdev, mingo, tgraf [-- Attachment #1: Type: text/plain, Size: 280 bytes --] David Miller wrote: > In the meantime should I apply Patrick's current patch > from this thread? I think that makes sense, the module unload race is present anyway. > > If so I'd like a proper commit message and signoff, thanks! Attached the patch with a proper changelog. [-- Attachment #2: x --] [-- Type: text/plain, Size: 2251 bytes --] netlink: genl: fix circular locking genetlink has a circular locking dependency when dumping the registered families: - dump start: genl_rcv() : take genl_mutex genl_rcv_msg() : call netlink_dump_start() while holding genl_mutex netlink_dump_start(), netlink_dump() : take nlk->cb_mutex ctrl_dumpfamily() : try to detect this case and not take genl_mutex a second time - dump continuance: netlink_rcv() : call netlink_dump netlink_dump : take nlk->cb_mutex ctrl_dumpfamily() : take genl_mutex Register genl_lock as callback mutex with netlink to fix this. This slightly widens an already existing module unload race, the genl ops used during the dump might go away when the module is unloaded. Thomas Graf is working on a seperate fix for this. Signed-off-by: Patrick McHardy <kaber@trash.net> diff --git a/net/netlink/genetlink.c b/net/netlink/genetlink.c index f5aa23c..3e1191c 100644 --- a/net/netlink/genetlink.c +++ b/net/netlink/genetlink.c @@ -444,8 +444,11 @@ static int genl_rcv_msg(struct sk_buff *skb, struct nlmsghdr *nlh) if (ops->dumpit == NULL) return -EOPNOTSUPP; - return netlink_dump_start(genl_sock, skb, nlh, - ops->dumpit, ops->done); + genl_unlock(); + err = netlink_dump_start(genl_sock, skb, nlh, + ops->dumpit, ops->done); + genl_lock(); + return err; } if (ops->doit == NULL) @@ -603,9 +606,6 @@ static int ctrl_dumpfamily(struct sk_buff *skb, struct netlink_callback *cb) int chains_to_skip = cb->args[0]; int fams_to_skip = cb->args[1]; - if (chains_to_skip != 0) - genl_lock(); - for (i = 0; i < GENL_FAM_TAB_SIZE; i++) { if (i < chains_to_skip) continue; @@ -623,9 +623,6 @@ static int ctrl_dumpfamily(struct sk_buff *skb, struct netlink_callback *cb) } errout: - if (chains_to_skip != 0) - genl_unlock(); - cb->args[0] = i; cb->args[1] = n; @@ -770,7 +767,7 @@ static int __init genl_init(void) /* we'll bump the group number right afterwards */ genl_sock = netlink_kernel_create(&init_net, NETLINK_GENERIC, 0, - genl_rcv, NULL, THIS_MODULE); + genl_rcv, &genl_mutex, THIS_MODULE); if (genl_sock == NULL) panic("GENL: Cannot initialize generic netlink\n"); ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: netlink circular locking dependency 2008-06-18 8:52 ` Patrick McHardy @ 2008-06-18 9:08 ` David Miller 2008-06-18 11:38 ` Marcel Holtmann 0 siblings, 1 reply; 19+ messages in thread From: David Miller @ 2008-06-18 9:08 UTC (permalink / raw) To: kaber; +Cc: jarkao2, marcel, netdev, mingo, tgraf From: Patrick McHardy <kaber@trash.net> Date: Wed, 18 Jun 2008 10:52:30 +0200 > David Miller wrote: > > In the meantime should I apply Patrick's current patch > > from this thread? > > I think that makes sense, the module unload race is present anyway. > > > > If so I'd like a proper commit message and signoff, thanks! > > Attached the patch with a proper changelog. Applied and pushed out to net-2.6, thanks a lot Patrick. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: netlink circular locking dependency 2008-06-18 9:08 ` David Miller @ 2008-06-18 11:38 ` Marcel Holtmann 2008-06-18 11:42 ` Patrick McHardy 0 siblings, 1 reply; 19+ messages in thread From: Marcel Holtmann @ 2008-06-18 11:38 UTC (permalink / raw) To: David Miller; +Cc: kaber, jarkao2, netdev, mingo, tgraf Hi David, > > > In the meantime should I apply Patrick's current patch > > > from this thread? > > > > I think that makes sense, the module unload race is present anyway. > > > > > > If so I'd like a proper commit message and signoff, thanks! > > > > Attached the patch with a proper changelog. > > Applied and pushed out to net-2.6, thanks a lot Patrick. you could have added a Tested-by: Marcel Holtmann <marcel@holtmann.org> Regards Marcel ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: netlink circular locking dependency 2008-06-18 11:38 ` Marcel Holtmann @ 2008-06-18 11:42 ` Patrick McHardy 0 siblings, 0 replies; 19+ messages in thread From: Patrick McHardy @ 2008-06-18 11:42 UTC (permalink / raw) To: Marcel Holtmann; +Cc: David Miller, jarkao2, netdev, mingo, tgraf Marcel Holtmann wrote: > Hi David, > >>>> In the meantime should I apply Patrick's current patch >>>> from this thread? >>> I think that makes sense, the module unload race is present anyway. >>>> If so I'd like a proper commit message and signoff, thanks! >>> Attached the patch with a proper changelog. >> Applied and pushed out to net-2.6, thanks a lot Patrick. > > you could have added a Tested-by: Marcel Holtmann <marcel@holtmann.org> My fault, I wanted to but forgot due to lack of coffee :) ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: netlink circular locking dependency 2008-06-17 13:09 ` Jarek Poplawski 2008-06-17 13:07 ` Patrick McHardy @ 2008-06-17 13:08 ` Thomas Graf 2008-06-17 13:19 ` Patrick McHardy 1 sibling, 1 reply; 19+ messages in thread From: Thomas Graf @ 2008-06-17 13:08 UTC (permalink / raw) To: Jarek Poplawski; +Cc: Patrick McHardy, Marcel Holtmann, netdev, Ingo Molnar * Jarek Poplawski <jarkao2@gmail.com> 2008-06-17 13:09 > On Tue, Jun 17, 2008 at 02:50:19PM +0200, Patrick McHardy wrote: > ... > > Thanks for testing. Unfortunately the module unload races look > > more complicated to fix and I'm busy with other things, so it > > would great if someone else could fix this. > > Patrick, I wonder if simply adding an additional mutex e.g. > genl_lock_table() around all the rest (after your patch) genl_locks > could be enough until some major rework. This should prevent any > new races and there are no lockups, I guess? It would be wise to redo the locking alltogether while we're fixing this. Decouple message serialization from register/unregister operations would be a nice thing f.e. I'll have a look if none beats me to it. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: netlink circular locking dependency 2008-06-17 13:08 ` Thomas Graf @ 2008-06-17 13:19 ` Patrick McHardy 0 siblings, 0 replies; 19+ messages in thread From: Patrick McHardy @ 2008-06-17 13:19 UTC (permalink / raw) To: Thomas Graf; +Cc: Jarek Poplawski, Marcel Holtmann, netdev, Ingo Molnar Thomas Graf wrote: > * Jarek Poplawski <jarkao2@gmail.com> 2008-06-17 13:09 >> On Tue, Jun 17, 2008 at 02:50:19PM +0200, Patrick McHardy wrote: >> ... >>> Thanks for testing. Unfortunately the module unload races look >>> more complicated to fix and I'm busy with other things, so it >>> would great if someone else could fix this. >> Patrick, I wonder if simply adding an additional mutex e.g. >> genl_lock_table() around all the rest (after your patch) genl_locks >> could be enough until some major rework. This should prevent any >> new races and there are no lockups, I guess? > > It would be wise to redo the locking alltogether while we're fixing > this. Decouple message serialization from register/unregister > operations would be a nice thing f.e. Agreed. > I'll have a look if none beats me to it. While whoever is at this, there are a few similar races I wanted to fix for a long time, but never got to it. - the rtnl_register/rtnl_unregister function don't perform any locking, so they might get called while unregistration is in progress - the individual ops are registered one at a time, which could theoretically lead to some strange behaviour where a subset of the ops can already be used, but others return EOPNOTSUPP - netlink_dump_start usage in rtnetlink has the exact same race as genetlink with my patch For 1. and 2., it seems the best way would be to register the entire set of ops (encapsulated in a struct) at once and hold the rtnl_mutex during that. For 3., the fix would probably identical to the genetlink fix. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: netlink circular locking dependency 2008-06-16 21:48 ` Patrick McHardy 2008-06-17 1:45 ` Marcel Holtmann @ 2008-06-17 8:49 ` Jarek Poplawski 1 sibling, 0 replies; 19+ messages in thread From: Jarek Poplawski @ 2008-06-17 8:49 UTC (permalink / raw) To: Patrick McHardy; +Cc: Marcel Holtmann, netdev, Ingo Molnar, Thomas Graf On Mon, Jun 16, 2008 at 11:48:01PM +0200, Patrick McHardy wrote: > Jarek Poplawski wrote: >> Marcel Holtmann wrote, On 06/14/2008 02:35 PM: >> ... >> >>> ======================================================= >>> [ INFO: possible circular locking dependency detected ] >>> 2.6.26-rc2 #5 >>> ------------------------------------------------------- >>> hcid/4136 is trying to acquire lock: >>> (genl_mutex){--..}, at: [<c0000000002ace4c>] .ctrl_dumpfamily+0x74/0x174 >>> >>> but task is already holding lock: >>> (nlk->cb_mutex){--..}, at: [<c0000000002a766c>] .netlink_dump+0x58/0x27c >>> >>> which lock already depends on the new lock. >>> >> ... >> >> Hi, >> >> IMHO it looks like a real lockup threat. Probably it needs something >> better, but for now here is my simplistic patch proposal for testing. >> > So we have: > > genl_rcv() : take genl_mutex > genl_rcv_msg() : call netlink_dump_start() while holding genl_mutex > netlink_dump_start(), > netlink_dump() : take nlk->cb_mutex > ctrl_dumpfamily() : try to detect this case and not take genl_mutex a > second time > > netlink_rcv() : call netlink_dump > netlink_dump : take nlk->cb_mutex > ctrl_dumpfamily() : take genl_mutex > > which is a real bug. Right. Probably there is also another variant: #1 genl_rcv() : take genl_mutex genl_rcv_msg() : call netlink_dump_start() while holding genl_mutex netlink_dump_start() : mutex_unlock nlk->cb_mutex #2 netlink_rcv() : call netlink_dump netlink_dump : take nlk->cb_mutex ctrl_dumpfamily() : 1st run without genl_mutex #1 netlink_dump() : take nlk->cb_mutex ctrl_dumpfamily() : > 1st run: take genl_mutex 2nd time?! > > It seems the best fix is to use genl_mutex for the netlink cb_mutex, > drop genl_mutex before calling netlink_dump_start and don't take it > in ctrl_dumpfamily, relying completely on af_netlink.c for dump > locking. Unfortunately this creates a race since the ops passed to > netlink_dump_start are also protect by the mutex, so this patch > is just for testing whether it fixes the warning. > > On second though - that race seems to be present already since > the ops can be unregistered and the module unloaded while a dump > is in progress. Yes, very interesting. I guess there will be some followup... Regards, Jarek P. > diff --git a/net/netlink/genetlink.c b/net/netlink/genetlink.c > index f5aa23c..3e1191c 100644 > --- a/net/netlink/genetlink.c > +++ b/net/netlink/genetlink.c > @@ -444,8 +444,11 @@ static int genl_rcv_msg(struct sk_buff *skb, struct nlmsghdr *nlh) > if (ops->dumpit == NULL) > return -EOPNOTSUPP; > > - return netlink_dump_start(genl_sock, skb, nlh, > - ops->dumpit, ops->done); > + genl_unlock(); > + err = netlink_dump_start(genl_sock, skb, nlh, > + ops->dumpit, ops->done); > + genl_lock(); > + return err; > } > > if (ops->doit == NULL) > @@ -603,9 +606,6 @@ static int ctrl_dumpfamily(struct sk_buff *skb, struct netlink_callback *cb) > int chains_to_skip = cb->args[0]; > int fams_to_skip = cb->args[1]; > > - if (chains_to_skip != 0) > - genl_lock(); > - > for (i = 0; i < GENL_FAM_TAB_SIZE; i++) { > if (i < chains_to_skip) > continue; > @@ -623,9 +623,6 @@ static int ctrl_dumpfamily(struct sk_buff *skb, struct netlink_callback *cb) > } > > errout: > - if (chains_to_skip != 0) > - genl_unlock(); > - > cb->args[0] = i; > cb->args[1] = n; > > @@ -770,7 +767,7 @@ static int __init genl_init(void) > > /* we'll bump the group number right afterwards */ > genl_sock = netlink_kernel_create(&init_net, NETLINK_GENERIC, 0, > - genl_rcv, NULL, THIS_MODULE); > + genl_rcv, &genl_mutex, THIS_MODULE); > if (genl_sock == NULL) > panic("GENL: Cannot initialize generic netlink\n"); > ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2008-06-18 11:42 UTC | newest] Thread overview: 19+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-06-14 12:35 netlink circular locking dependency Marcel Holtmann 2008-06-16 21:34 ` Jarek Poplawski 2008-06-16 21:48 ` Patrick McHardy 2008-06-17 1:45 ` Marcel Holtmann 2008-06-17 12:50 ` Patrick McHardy 2008-06-17 13:09 ` Jarek Poplawski 2008-06-17 13:07 ` Patrick McHardy 2008-06-17 13:24 ` Jarek Poplawski 2008-06-17 13:27 ` Patrick McHardy 2008-06-17 13:43 ` Jarek Poplawski 2008-06-18 4:30 ` David Miller 2008-06-18 6:15 ` Jarek Poplawski 2008-06-18 8:52 ` Patrick McHardy 2008-06-18 9:08 ` David Miller 2008-06-18 11:38 ` Marcel Holtmann 2008-06-18 11:42 ` Patrick McHardy 2008-06-17 13:08 ` Thomas Graf 2008-06-17 13:19 ` Patrick McHardy 2008-06-17 8:49 ` Jarek Poplawski
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).