[Patch 6/6] per task delay accounting taskstats interface: fix clone skbs for each listener

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [Patch 6/6] per task delay accounting taskstats interface: fix clone skbs for each listener
       [not found] <1152591838.14142.114.camel@localhost.localdomain>
@ 2006-07-11  4:36 ` Shailabh Nagar
  2006-07-11 10:05   ` Andrew Morton
  0 siblings, 1 reply; 7+ messages in thread
From: Shailabh Nagar @ 2006-07-11  4:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, Jay Lan, Chris Sturtivant, Paul Jackson,
	Balbir Singh, Chandra Seetharaman, Jamal, netdev

Use a cloned sk_buff for each netlink message sent to multiple listeners.

Earlier, the same skb, representing a netlink message, was being erroneously 
reused for doing genetlink_unicast()'s (effectively netlink_unicast()) to 
each listener on the per-cpu list of listeners. Since netlink_unicast() frees
up the skb passed to it, regardless of status of the send, reuse is bad.

Thanks to Chandra Seetharaman for discovering this bug.

Signed-Off-By: Shailabh Nagar <nagar@watson.ibm.com>
Signed-Off-By: Chandra Seetharaman <sekharan@us.ibm.com>

 kernel/taskstats.c |   13 ++++++++++++-
 1 files changed, 12 insertions(+), 1 deletion(-)

Index: linux-2.6.18-rc1/kernel/taskstats.c
===================================================================
--- linux-2.6.18-rc1.orig/kernel/taskstats.c	2006-07-10 23:44:56.000000000 -0400
+++ linux-2.6.18-rc1/kernel/taskstats.c	2006-07-10 23:45:15.000000000 -0400
@@ -125,6 +125,7 @@ static int send_cpu_listeners(struct sk_
 	struct genlmsghdr *genlhdr = nlmsg_data((struct nlmsghdr *)skb->data);
 	struct listener_list *listeners;
 	struct listener *s, *tmp;
+	struct sk_buff *skb_next, *skb_cur = skb;
 	void *reply = genlmsg_data(genlhdr);
 	int rc, ret;
 
@@ -138,12 +139,22 @@ static int send_cpu_listeners(struct sk_
 	listeners = &per_cpu(listener_array, cpu);
 	down_write(&listeners->sem);
 	list_for_each_entry_safe(s, tmp, &listeners->list, list) {
-		ret = genlmsg_unicast(skb, s->pid);
+		skb_next = NULL;
+		if (!list_islast(&s->list, &listeners->list)) {
+			skb_next = skb_clone(skb_cur, GFP_KERNEL);
+			if (!skb_next) {
+				nlmsg_free(skb_cur);
+				rc = -ENOMEM;
+				break;
+			}
+		}
+		ret = genlmsg_unicast(skb_cur, s->pid);
 		if (ret == -ECONNREFUSED) {
 			list_del(&s->list);
 			kfree(s);
 			rc = ret;
 		}
+		skb_cur = skb_next;
 	}
 	up_write(&listeners->sem);
 



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Patch 6/6] per task delay accounting taskstats interface: fix clone skbs for each listener
  2006-07-11  4:36 ` [Patch 6/6] per task delay accounting taskstats interface: fix clone skbs for each listener Shailabh Nagar
@ 2006-07-11 10:05   ` Andrew Morton
  2006-07-11 10:28     ` Herbert Xu
  0 siblings, 1 reply; 7+ messages in thread
From: Andrew Morton @ 2006-07-11 10:05 UTC (permalink / raw)
  To: nagar; +Cc: linux-kernel, jlan, csturtiv, pj, balbir, sekharan, hadi, netdev

On Tue, 11 Jul 2006 00:36:39 -0400
Shailabh Nagar <nagar@watson.ibm.com> wrote:

>  	down_write(&listeners->sem);
>  	list_for_each_entry_safe(s, tmp, &listeners->list, list) {
> -		ret = genlmsg_unicast(skb, s->pid);
> +		skb_next = NULL;
> +		if (!list_islast(&s->list, &listeners->list)) {
> +			skb_next = skb_clone(skb_cur, GFP_KERNEL);

If we do a GFP_KERNEL allocation with this semaphore held, and the
oom-killer tries to kill something to satisfy the allocation, and the
killed task gets stuck on that semaphore, I wonder of the box locks up.

Probably it'll work out OK if the semaphore is taken after that task has
had some resources torn down.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Patch 6/6] per task delay accounting taskstats interface: fix clone skbs for each listener
  2006-07-11 10:05   ` Andrew Morton
@ 2006-07-11 10:28     ` Herbert Xu
  2006-07-11 10:57       ` Andrew Morton
  0 siblings, 1 reply; 7+ messages in thread
From: Herbert Xu @ 2006-07-11 10:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: nagar, linux-kernel, jlan, csturtiv, pj, balbir, sekharan, hadi,
	netdev

Andrew Morton <akpm@osdl.org> wrote:
> On Tue, 11 Jul 2006 00:36:39 -0400
> Shailabh Nagar <nagar@watson.ibm.com> wrote:
> 
>>       down_write(&listeners->sem);
>>       list_for_each_entry_safe(s, tmp, &listeners->list, list) {
>> -             ret = genlmsg_unicast(skb, s->pid);
>> +             skb_next = NULL;
>> +             if (!list_islast(&s->list, &listeners->list)) {
>> +                     skb_next = skb_clone(skb_cur, GFP_KERNEL);
> 
> If we do a GFP_KERNEL allocation with this semaphore held, and the
> oom-killer tries to kill something to satisfy the allocation, and the
> killed task gets stuck on that semaphore, I wonder of the box locks up.

We do GFP_KERNEL inside semaphores/mutexes in lots of places.  So if this
can deadlock with the oom-killer we probably should fix that, preferably
by having GFP_KERNEL fail in that case.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Patch 6/6] per task delay accounting taskstats interface: fix clone skbs for each listener
  2006-07-11 10:28     ` Herbert Xu
@ 2006-07-11 10:57       ` Andrew Morton
  2006-07-11 11:15         ` Herbert Xu
  0 siblings, 1 reply; 7+ messages in thread
From: Andrew Morton @ 2006-07-11 10:57 UTC (permalink / raw)
  To: Herbert Xu
  Cc: nagar, linux-kernel, jlan, csturtiv, pj, balbir, sekharan, hadi,
	netdev

On Tue, 11 Jul 2006 20:28:50 +1000
Herbert Xu <herbert@gondor.apana.org.au> wrote:

> Andrew Morton <akpm@osdl.org> wrote:
> > On Tue, 11 Jul 2006 00:36:39 -0400
> > Shailabh Nagar <nagar@watson.ibm.com> wrote:
> > 
> >>       down_write(&listeners->sem);
> >>       list_for_each_entry_safe(s, tmp, &listeners->list, list) {
> >> -             ret = genlmsg_unicast(skb, s->pid);
> >> +             skb_next = NULL;
> >> +             if (!list_islast(&s->list, &listeners->list)) {
> >> +                     skb_next = skb_clone(skb_cur, GFP_KERNEL);
> > 
> > If we do a GFP_KERNEL allocation with this semaphore held, and the
> > oom-killer tries to kill something to satisfy the allocation, and the
> > killed task gets stuck on that semaphore, I wonder of the box locks up.
> 
> We do GFP_KERNEL inside semaphores/mutexes in lots of places.  So if this
> can deadlock with the oom-killer we probably should fix that, preferably
> by having GFP_KERNEL fail in that case.

This lock is special, in that it's taken on the exit() path (I think).  So
it can block tasks which are trying to exit.

But yes.  Reliable, deadlock-free oom-killing is, err, a matter of ongoing
research.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Patch 6/6] per task delay accounting taskstats interface: fix clone skbs for each listener
  2006-07-11 10:57       ` Andrew Morton
@ 2006-07-11 11:15         ` Herbert Xu
  2006-07-11 17:44           ` Shailabh Nagar
  2006-07-12  5:41           ` Shailabh Nagar
  0 siblings, 2 replies; 7+ messages in thread
From: Herbert Xu @ 2006-07-11 11:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: nagar, linux-kernel, jlan, csturtiv, pj, balbir, sekharan, hadi,
	netdev

On Tue, Jul 11, 2006 at 03:57:31AM -0700, Andrew Morton wrote:
>
> > >>       down_write(&listeners->sem);
> > >>       list_for_each_entry_safe(s, tmp, &listeners->list, list) {
> > >> -             ret = genlmsg_unicast(skb, s->pid);
> > >> +             skb_next = NULL;
> > >> +             if (!list_islast(&s->list, &listeners->list)) {
> > >> +                     skb_next = skb_clone(skb_cur, GFP_KERNEL);
> > > 
> > > If we do a GFP_KERNEL allocation with this semaphore held, and the
> > > oom-killer tries to kill something to satisfy the allocation, and the
> > > killed task gets stuck on that semaphore, I wonder of the box locks up.
> > 
> > We do GFP_KERNEL inside semaphores/mutexes in lots of places.  So if this
> > can deadlock with the oom-killer we probably should fix that, preferably
> > by having GFP_KERNEL fail in that case.
> 
> This lock is special, in that it's taken on the exit() path (I think).  So
> it can block tasks which are trying to exit.

Sorry, missed the context.

If there is a deadlock then it's not just this allocation that you
need worry about.  There is also an allocation within genlmsg_uniast
that would be GFP_KERNEL.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Patch 6/6] per task delay accounting taskstats interface: fix clone skbs for each listener
  2006-07-11 11:15         ` Herbert Xu
@ 2006-07-11 17:44           ` Shailabh Nagar
  2006-07-12  5:41           ` Shailabh Nagar
  1 sibling, 0 replies; 7+ messages in thread
From: Shailabh Nagar @ 2006-07-11 17:44 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Andrew Morton, linux-kernel, jlan, csturtiv, pj, balbir, sekharan,
	hadi, netdev

Herbert Xu wrote:
> On Tue, Jul 11, 2006 at 03:57:31AM -0700, Andrew Morton wrote:
> 
>>>>>      down_write(&listeners->sem);
>>>>>      list_for_each_entry_safe(s, tmp, &listeners->list, list) {
>>>>>-             ret = genlmsg_unicast(skb, s->pid);
>>>>>+             skb_next = NULL;
>>>>>+             if (!list_islast(&s->list, &listeners->list)) {
>>>>>+                     skb_next = skb_clone(skb_cur, GFP_KERNEL);
>>>>
>>>>If we do a GFP_KERNEL allocation with this semaphore held, and the
>>>>oom-killer tries to kill something to satisfy the allocation, and the
>>>>killed task gets stuck on that semaphore, I wonder of the box locks up.

Hmm...doesn't look very safe does it.
There's no real need for us to skb_clone within the sem. Keeping a count of
listeners and doing the clone outside should let us avoid this problem.

I was trying to avoid doing the above because of the potential for
listeners getting added continuously to the list
(and having to repeat the allocation loop outside the down_write).

But on second thoughts, we're under no obligation to send the data to all the
listeners who add themselves in the short time between our taking a snapshot
of the listener count and when the send is done (within the down_write). So
it should be ok.

>>>
>>>We do GFP_KERNEL inside semaphores/mutexes in lots of places.  So if this
>>>can deadlock with the oom-killer we probably should fix that, preferably
>>>by having GFP_KERNEL fail in that case.
>>
>>This lock is special, in that it's taken on the exit() path (I think).  So
>>it can block tasks which are trying to exit.
> 
> 
> Sorry, missed the context.
> 
> If there is a deadlock then it's not just this allocation that you
> need worry about.  There is also an allocation within genlmsg_uniast
> that would be GFP_KERNEL.
> 

Thats true. The GFP_KERNEL allocation potentially called in netlink_trim() as part
of the genlmsg/netlink_unicast() is again a problem.

So perhaps we should switch to using RCU for protecting the listener list.

--Shailabh

> Cheers,

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Patch 6/6] per task delay accounting taskstats interface: fix clone skbs for each listener
  2006-07-11 11:15         ` Herbert Xu
  2006-07-11 17:44           ` Shailabh Nagar
@ 2006-07-12  5:41           ` Shailabh Nagar
  1 sibling, 0 replies; 7+ messages in thread
From: Shailabh Nagar @ 2006-07-12  5:41 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Andrew Morton, linux-kernel, Jay Lan, Chris Sturtivant,
	Paul Jackson, Balbir Singh, Chandra Seetharaman, Jamal, netdev

On Tue, 2006-07-11 at 07:15, Herbert Xu wrote:
> On Tue, Jul 11, 2006 at 03:57:31AM -0700, Andrew Morton wrote:
> >
> > > >>       down_write(&listeners->sem);
> > > >>       list_for_each_entry_safe(s, tmp, &listeners->list, list) {
> > > >> -             ret = genlmsg_unicast(skb, s->pid);
> > > >> +             skb_next = NULL;
> > > >> +             if (!list_islast(&s->list, &listeners->list)) {
> > > >> +                     skb_next = skb_clone(skb_cur, GFP_KERNEL);
> > > > 
> > > > If we do a GFP_KERNEL allocation with this semaphore held, and the
> > > > oom-killer tries to kill something to satisfy the allocation, and the
> > > > killed task gets stuck on that semaphore, I wonder of the box locks up.
> > > 
> > > We do GFP_KERNEL inside semaphores/mutexes in lots of places.  So if this
> > > can deadlock with the oom-killer we probably should fix that, preferably
> > > by having GFP_KERNEL fail in that case.
> > 
> > This lock is special, in that it's taken on the exit() path (I think).  So
> > it can block tasks which are trying to exit.
> 
> Sorry, missed the context.
> 
> If there is a deadlock then it's not just this allocation that you
> need worry about.  There is also an allocation within genlmsg_uniast
> that would be GFP_KERNEL.


Remove down_write() from taskstats code invoked on the exit() path.

In send_cpu_listeners(), which is called on the exit path, 
a down_write() was protecting operations like skb_clone() and 
genlmsg_unicast() that do GFP_KERNEL allocations. If the oom-killer 
decides to kill tasks to satisfy the allocations,the exit of those 
tasks could block on the same semphore.

The down_write() was only needed to allow removal of invalid 
listeners from the listener list. The patch converts the down_write 
to a down_read and defers the removal to a separate critical region. 
This ensures that even if the oom-killer is called, no other task's 
exit is blocked as it can still acquire another down_read.

Thanks to Andrew Morton & Herbert Xu for pointing out the oom 
related pitfalls, and to Chandra Seetharaman for suggesting this 
fix instead of using something more complex like RCU.

Signed-Off-By: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-Off-By: Shailabh Nagar <nagar@watson.ibm.com>

---

 kernel/taskstats.c |   24 +++++++++++++++++++-----
 1 files changed, 19 insertions(+), 5 deletions(-)

Index: linux-2.6.18-rc1/kernel/taskstats.c
===================================================================
--- linux-2.6.18-rc1.orig/kernel/taskstats.c	2006-07-11 00:13:00.000000000 -0400
+++ linux-2.6.18-rc1/kernel/taskstats.c	2006-07-12 00:07:53.000000000 -0400
@@ -51,6 +51,7 @@ __read_mostly = {
 struct listener {
 	struct list_head list;
 	pid_t pid;
+	char valid;
 };
 
 struct listener_list {
@@ -127,7 +128,7 @@ static int send_cpu_listeners(struct sk_
 	struct listener *s, *tmp;
 	struct sk_buff *skb_next, *skb_cur = skb;
 	void *reply = genlmsg_data(genlhdr);
-	int rc, ret;
+	int rc, ret, delcount = 0;
 
 	rc = genlmsg_end(skb, reply);
 	if (rc < 0) {
@@ -137,7 +138,7 @@ static int send_cpu_listeners(struct sk_
 
 	rc = 0;
 	listeners = &per_cpu(listener_array, cpu);
-	down_write(&listeners->sem);
+	down_read(&listeners->sem);
 	list_for_each_entry_safe(s, tmp, &listeners->list, list) {
 		skb_next = NULL;
 		if (!list_islast(&s->list, &listeners->list)) {
@@ -150,14 +151,26 @@ static int send_cpu_listeners(struct sk_
 		}
 		ret = genlmsg_unicast(skb_cur, s->pid);
 		if (ret == -ECONNREFUSED) {
-			list_del(&s->list);
-			kfree(s);
+			s->valid = 0;
+			delcount++;
 			rc = ret;
 		}
 		skb_cur = skb_next;
 	}
-	up_write(&listeners->sem);
+	up_read(&listeners->sem);
+
+	if (!delcount)
+		return rc;
 
+	/* Delete invalidated entries */
+	down_write(&listeners->sem);
+	list_for_each_entry_safe(s, tmp, &listeners->list, list) {
+		if (!s->valid) {
+			list_del(&s->list);
+			kfree(s);
+		}
+	}
+	up_write(&listeners->sem);
 	return rc;
 }
 
@@ -290,6 +303,7 @@ static int add_del_listener(pid_t pid, c
 				goto cleanup;
 			s->pid = pid;
 			INIT_LIST_HEAD(&s->list);
+			s->valid = 1;
 
 			listeners = &per_cpu(listener_array, cpu);
 			down_write(&listeners->sem);



^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2006-07-12  5:41 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <1152591838.14142.114.camel@localhost.localdomain>
2006-07-11  4:36 ` [Patch 6/6] per task delay accounting taskstats interface: fix clone skbs for each listener Shailabh Nagar
2006-07-11 10:05   ` Andrew Morton
2006-07-11 10:28     ` Herbert Xu
2006-07-11 10:57       ` Andrew Morton
2006-07-11 11:15         ` Herbert Xu
2006-07-11 17:44           ` Shailabh Nagar
2006-07-12  5:41           ` Shailabh Nagar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).