* [Patch 6/6] per task delay accounting taskstats interface: fix clone skbs for each listener [not found] <1152591838.14142.114.camel@localhost.localdomain> @ 2006-07-11 4:36 ` Shailabh Nagar 2006-07-11 10:05 ` Andrew Morton 0 siblings, 1 reply; 7+ messages in thread From: Shailabh Nagar @ 2006-07-11 4:36 UTC (permalink / raw) To: Andrew Morton Cc: linux-kernel, Jay Lan, Chris Sturtivant, Paul Jackson, Balbir Singh, Chandra Seetharaman, Jamal, netdev Use a cloned sk_buff for each netlink message sent to multiple listeners. Earlier, the same skb, representing a netlink message, was being erroneously reused for doing genetlink_unicast()'s (effectively netlink_unicast()) to each listener on the per-cpu list of listeners. Since netlink_unicast() frees up the skb passed to it, regardless of status of the send, reuse is bad. Thanks to Chandra Seetharaman for discovering this bug. Signed-Off-By: Shailabh Nagar <nagar@watson.ibm.com> Signed-Off-By: Chandra Seetharaman <sekharan@us.ibm.com> kernel/taskstats.c | 13 ++++++++++++- 1 files changed, 12 insertions(+), 1 deletion(-) Index: linux-2.6.18-rc1/kernel/taskstats.c =================================================================== --- linux-2.6.18-rc1.orig/kernel/taskstats.c 2006-07-10 23:44:56.000000000 -0400 +++ linux-2.6.18-rc1/kernel/taskstats.c 2006-07-10 23:45:15.000000000 -0400 @@ -125,6 +125,7 @@ static int send_cpu_listeners(struct sk_ struct genlmsghdr *genlhdr = nlmsg_data((struct nlmsghdr *)skb->data); struct listener_list *listeners; struct listener *s, *tmp; + struct sk_buff *skb_next, *skb_cur = skb; void *reply = genlmsg_data(genlhdr); int rc, ret; @@ -138,12 +139,22 @@ static int send_cpu_listeners(struct sk_ listeners = &per_cpu(listener_array, cpu); down_write(&listeners->sem); list_for_each_entry_safe(s, tmp, &listeners->list, list) { - ret = genlmsg_unicast(skb, s->pid); + skb_next = NULL; + if (!list_islast(&s->list, &listeners->list)) { + skb_next = skb_clone(skb_cur, GFP_KERNEL); + if (!skb_next) { + nlmsg_free(skb_cur); + rc = -ENOMEM; + break; + } + } + ret = genlmsg_unicast(skb_cur, s->pid); if (ret == -ECONNREFUSED) { list_del(&s->list); kfree(s); rc = ret; } + skb_cur = skb_next; } up_write(&listeners->sem); ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Patch 6/6] per task delay accounting taskstats interface: fix clone skbs for each listener 2006-07-11 4:36 ` [Patch 6/6] per task delay accounting taskstats interface: fix clone skbs for each listener Shailabh Nagar @ 2006-07-11 10:05 ` Andrew Morton 2006-07-11 10:28 ` Herbert Xu 0 siblings, 1 reply; 7+ messages in thread From: Andrew Morton @ 2006-07-11 10:05 UTC (permalink / raw) To: nagar; +Cc: linux-kernel, jlan, csturtiv, pj, balbir, sekharan, hadi, netdev On Tue, 11 Jul 2006 00:36:39 -0400 Shailabh Nagar <nagar@watson.ibm.com> wrote: > down_write(&listeners->sem); > list_for_each_entry_safe(s, tmp, &listeners->list, list) { > - ret = genlmsg_unicast(skb, s->pid); > + skb_next = NULL; > + if (!list_islast(&s->list, &listeners->list)) { > + skb_next = skb_clone(skb_cur, GFP_KERNEL); If we do a GFP_KERNEL allocation with this semaphore held, and the oom-killer tries to kill something to satisfy the allocation, and the killed task gets stuck on that semaphore, I wonder of the box locks up. Probably it'll work out OK if the semaphore is taken after that task has had some resources torn down. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Patch 6/6] per task delay accounting taskstats interface: fix clone skbs for each listener 2006-07-11 10:05 ` Andrew Morton @ 2006-07-11 10:28 ` Herbert Xu 2006-07-11 10:57 ` Andrew Morton 0 siblings, 1 reply; 7+ messages in thread From: Herbert Xu @ 2006-07-11 10:28 UTC (permalink / raw) To: Andrew Morton Cc: nagar, linux-kernel, jlan, csturtiv, pj, balbir, sekharan, hadi, netdev Andrew Morton <akpm@osdl.org> wrote: > On Tue, 11 Jul 2006 00:36:39 -0400 > Shailabh Nagar <nagar@watson.ibm.com> wrote: > >> down_write(&listeners->sem); >> list_for_each_entry_safe(s, tmp, &listeners->list, list) { >> - ret = genlmsg_unicast(skb, s->pid); >> + skb_next = NULL; >> + if (!list_islast(&s->list, &listeners->list)) { >> + skb_next = skb_clone(skb_cur, GFP_KERNEL); > > If we do a GFP_KERNEL allocation with this semaphore held, and the > oom-killer tries to kill something to satisfy the allocation, and the > killed task gets stuck on that semaphore, I wonder of the box locks up. We do GFP_KERNEL inside semaphores/mutexes in lots of places. So if this can deadlock with the oom-killer we probably should fix that, preferably by having GFP_KERNEL fail in that case. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Patch 6/6] per task delay accounting taskstats interface: fix clone skbs for each listener 2006-07-11 10:28 ` Herbert Xu @ 2006-07-11 10:57 ` Andrew Morton 2006-07-11 11:15 ` Herbert Xu 0 siblings, 1 reply; 7+ messages in thread From: Andrew Morton @ 2006-07-11 10:57 UTC (permalink / raw) To: Herbert Xu Cc: nagar, linux-kernel, jlan, csturtiv, pj, balbir, sekharan, hadi, netdev On Tue, 11 Jul 2006 20:28:50 +1000 Herbert Xu <herbert@gondor.apana.org.au> wrote: > Andrew Morton <akpm@osdl.org> wrote: > > On Tue, 11 Jul 2006 00:36:39 -0400 > > Shailabh Nagar <nagar@watson.ibm.com> wrote: > > > >> down_write(&listeners->sem); > >> list_for_each_entry_safe(s, tmp, &listeners->list, list) { > >> - ret = genlmsg_unicast(skb, s->pid); > >> + skb_next = NULL; > >> + if (!list_islast(&s->list, &listeners->list)) { > >> + skb_next = skb_clone(skb_cur, GFP_KERNEL); > > > > If we do a GFP_KERNEL allocation with this semaphore held, and the > > oom-killer tries to kill something to satisfy the allocation, and the > > killed task gets stuck on that semaphore, I wonder of the box locks up. > > We do GFP_KERNEL inside semaphores/mutexes in lots of places. So if this > can deadlock with the oom-killer we probably should fix that, preferably > by having GFP_KERNEL fail in that case. This lock is special, in that it's taken on the exit() path (I think). So it can block tasks which are trying to exit. But yes. Reliable, deadlock-free oom-killing is, err, a matter of ongoing research. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Patch 6/6] per task delay accounting taskstats interface: fix clone skbs for each listener 2006-07-11 10:57 ` Andrew Morton @ 2006-07-11 11:15 ` Herbert Xu 2006-07-11 17:44 ` Shailabh Nagar 2006-07-12 5:41 ` Shailabh Nagar 0 siblings, 2 replies; 7+ messages in thread From: Herbert Xu @ 2006-07-11 11:15 UTC (permalink / raw) To: Andrew Morton Cc: nagar, linux-kernel, jlan, csturtiv, pj, balbir, sekharan, hadi, netdev On Tue, Jul 11, 2006 at 03:57:31AM -0700, Andrew Morton wrote: > > > >> down_write(&listeners->sem); > > >> list_for_each_entry_safe(s, tmp, &listeners->list, list) { > > >> - ret = genlmsg_unicast(skb, s->pid); > > >> + skb_next = NULL; > > >> + if (!list_islast(&s->list, &listeners->list)) { > > >> + skb_next = skb_clone(skb_cur, GFP_KERNEL); > > > > > > If we do a GFP_KERNEL allocation with this semaphore held, and the > > > oom-killer tries to kill something to satisfy the allocation, and the > > > killed task gets stuck on that semaphore, I wonder of the box locks up. > > > > We do GFP_KERNEL inside semaphores/mutexes in lots of places. So if this > > can deadlock with the oom-killer we probably should fix that, preferably > > by having GFP_KERNEL fail in that case. > > This lock is special, in that it's taken on the exit() path (I think). So > it can block tasks which are trying to exit. Sorry, missed the context. If there is a deadlock then it's not just this allocation that you need worry about. There is also an allocation within genlmsg_uniast that would be GFP_KERNEL. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Patch 6/6] per task delay accounting taskstats interface: fix clone skbs for each listener 2006-07-11 11:15 ` Herbert Xu @ 2006-07-11 17:44 ` Shailabh Nagar 2006-07-12 5:41 ` Shailabh Nagar 1 sibling, 0 replies; 7+ messages in thread From: Shailabh Nagar @ 2006-07-11 17:44 UTC (permalink / raw) To: Herbert Xu Cc: Andrew Morton, linux-kernel, jlan, csturtiv, pj, balbir, sekharan, hadi, netdev Herbert Xu wrote: > On Tue, Jul 11, 2006 at 03:57:31AM -0700, Andrew Morton wrote: > >>>>> down_write(&listeners->sem); >>>>> list_for_each_entry_safe(s, tmp, &listeners->list, list) { >>>>>- ret = genlmsg_unicast(skb, s->pid); >>>>>+ skb_next = NULL; >>>>>+ if (!list_islast(&s->list, &listeners->list)) { >>>>>+ skb_next = skb_clone(skb_cur, GFP_KERNEL); >>>> >>>>If we do a GFP_KERNEL allocation with this semaphore held, and the >>>>oom-killer tries to kill something to satisfy the allocation, and the >>>>killed task gets stuck on that semaphore, I wonder of the box locks up. Hmm...doesn't look very safe does it. There's no real need for us to skb_clone within the sem. Keeping a count of listeners and doing the clone outside should let us avoid this problem. I was trying to avoid doing the above because of the potential for listeners getting added continuously to the list (and having to repeat the allocation loop outside the down_write). But on second thoughts, we're under no obligation to send the data to all the listeners who add themselves in the short time between our taking a snapshot of the listener count and when the send is done (within the down_write). So it should be ok. >>> >>>We do GFP_KERNEL inside semaphores/mutexes in lots of places. So if this >>>can deadlock with the oom-killer we probably should fix that, preferably >>>by having GFP_KERNEL fail in that case. >> >>This lock is special, in that it's taken on the exit() path (I think). So >>it can block tasks which are trying to exit. > > > Sorry, missed the context. > > If there is a deadlock then it's not just this allocation that you > need worry about. There is also an allocation within genlmsg_uniast > that would be GFP_KERNEL. > Thats true. The GFP_KERNEL allocation potentially called in netlink_trim() as part of the genlmsg/netlink_unicast() is again a problem. So perhaps we should switch to using RCU for protecting the listener list. --Shailabh > Cheers, ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Patch 6/6] per task delay accounting taskstats interface: fix clone skbs for each listener 2006-07-11 11:15 ` Herbert Xu 2006-07-11 17:44 ` Shailabh Nagar @ 2006-07-12 5:41 ` Shailabh Nagar 1 sibling, 0 replies; 7+ messages in thread From: Shailabh Nagar @ 2006-07-12 5:41 UTC (permalink / raw) To: Herbert Xu Cc: Andrew Morton, linux-kernel, Jay Lan, Chris Sturtivant, Paul Jackson, Balbir Singh, Chandra Seetharaman, Jamal, netdev On Tue, 2006-07-11 at 07:15, Herbert Xu wrote: > On Tue, Jul 11, 2006 at 03:57:31AM -0700, Andrew Morton wrote: > > > > > >> down_write(&listeners->sem); > > > >> list_for_each_entry_safe(s, tmp, &listeners->list, list) { > > > >> - ret = genlmsg_unicast(skb, s->pid); > > > >> + skb_next = NULL; > > > >> + if (!list_islast(&s->list, &listeners->list)) { > > > >> + skb_next = skb_clone(skb_cur, GFP_KERNEL); > > > > > > > > If we do a GFP_KERNEL allocation with this semaphore held, and the > > > > oom-killer tries to kill something to satisfy the allocation, and the > > > > killed task gets stuck on that semaphore, I wonder of the box locks up. > > > > > > We do GFP_KERNEL inside semaphores/mutexes in lots of places. So if this > > > can deadlock with the oom-killer we probably should fix that, preferably > > > by having GFP_KERNEL fail in that case. > > > > This lock is special, in that it's taken on the exit() path (I think). So > > it can block tasks which are trying to exit. > > Sorry, missed the context. > > If there is a deadlock then it's not just this allocation that you > need worry about. There is also an allocation within genlmsg_uniast > that would be GFP_KERNEL. Remove down_write() from taskstats code invoked on the exit() path. In send_cpu_listeners(), which is called on the exit path, a down_write() was protecting operations like skb_clone() and genlmsg_unicast() that do GFP_KERNEL allocations. If the oom-killer decides to kill tasks to satisfy the allocations,the exit of those tasks could block on the same semphore. The down_write() was only needed to allow removal of invalid listeners from the listener list. The patch converts the down_write to a down_read and defers the removal to a separate critical region. This ensures that even if the oom-killer is called, no other task's exit is blocked as it can still acquire another down_read. Thanks to Andrew Morton & Herbert Xu for pointing out the oom related pitfalls, and to Chandra Seetharaman for suggesting this fix instead of using something more complex like RCU. Signed-Off-By: Chandra Seetharaman <sekharan@us.ibm.com> Signed-Off-By: Shailabh Nagar <nagar@watson.ibm.com> --- kernel/taskstats.c | 24 +++++++++++++++++++----- 1 files changed, 19 insertions(+), 5 deletions(-) Index: linux-2.6.18-rc1/kernel/taskstats.c =================================================================== --- linux-2.6.18-rc1.orig/kernel/taskstats.c 2006-07-11 00:13:00.000000000 -0400 +++ linux-2.6.18-rc1/kernel/taskstats.c 2006-07-12 00:07:53.000000000 -0400 @@ -51,6 +51,7 @@ __read_mostly = { struct listener { struct list_head list; pid_t pid; + char valid; }; struct listener_list { @@ -127,7 +128,7 @@ static int send_cpu_listeners(struct sk_ struct listener *s, *tmp; struct sk_buff *skb_next, *skb_cur = skb; void *reply = genlmsg_data(genlhdr); - int rc, ret; + int rc, ret, delcount = 0; rc = genlmsg_end(skb, reply); if (rc < 0) { @@ -137,7 +138,7 @@ static int send_cpu_listeners(struct sk_ rc = 0; listeners = &per_cpu(listener_array, cpu); - down_write(&listeners->sem); + down_read(&listeners->sem); list_for_each_entry_safe(s, tmp, &listeners->list, list) { skb_next = NULL; if (!list_islast(&s->list, &listeners->list)) { @@ -150,14 +151,26 @@ static int send_cpu_listeners(struct sk_ } ret = genlmsg_unicast(skb_cur, s->pid); if (ret == -ECONNREFUSED) { - list_del(&s->list); - kfree(s); + s->valid = 0; + delcount++; rc = ret; } skb_cur = skb_next; } - up_write(&listeners->sem); + up_read(&listeners->sem); + + if (!delcount) + return rc; + /* Delete invalidated entries */ + down_write(&listeners->sem); + list_for_each_entry_safe(s, tmp, &listeners->list, list) { + if (!s->valid) { + list_del(&s->list); + kfree(s); + } + } + up_write(&listeners->sem); return rc; } @@ -290,6 +303,7 @@ static int add_del_listener(pid_t pid, c goto cleanup; s->pid = pid; INIT_LIST_HEAD(&s->list); + s->valid = 1; listeners = &per_cpu(listener_array, cpu); down_write(&listeners->sem); ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2006-07-12 5:41 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <1152591838.14142.114.camel@localhost.localdomain>
2006-07-11 4:36 ` [Patch 6/6] per task delay accounting taskstats interface: fix clone skbs for each listener Shailabh Nagar
2006-07-11 10:05 ` Andrew Morton
2006-07-11 10:28 ` Herbert Xu
2006-07-11 10:57 ` Andrew Morton
2006-07-11 11:15 ` Herbert Xu
2006-07-11 17:44 ` Shailabh Nagar
2006-07-12 5:41 ` Shailabh Nagar
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).