public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* Re: [RFT 2/4] Add mod_timer_noact
       [not found]     ` <20090218.013007.117003889.davem@davemloft.net>
@ 2009-02-18 11:01       ` Ingo Molnar
  2009-02-18 11:39         ` Jarek Poplawski
                           ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Ingo Molnar @ 2009-02-18 11:01 UTC (permalink / raw)
  To: David Miller
  Cc: shemminger, kaber, rick.jones2, dada1, netdev, netfilter-devel,
	tglx, gandalf, linux-kernel


* David Miller <davem@davemloft.net> wrote:

> From: Ingo Molnar <mingo@elte.hu>
> Date: Wed, 18 Feb 2009 10:20:41 +0100
> 
> > Why dont you use something like this instead:
> > 
> > 	if (del_timer(timer))
> > 		add_timer(timer);
> > 
> > ?
> 
> Why don't you read his commit message?

Uhm, of course i have read this piece of non-info:

| Introduce mod_timer_noact() which for example is to replace 
| the calls to del_timer()/add_timer() in 
| __nf_ct_refresh_acct(). It works like mod_timer() but doesn't 
| activate or modify the timeout of an inactive timer which is 
| the behaviour we want in order to be able to use timers as a 
| means of synchronization in nf_conntrack.

It does not mention the overhead to the regular timer interfaces 
at all, nor does it explain the reasons for this change 
adequately.

And that's why i'm asking, why is the sequence i suggested above 
inadequate? If del_timer(timer) returns 1 it means the timer was 
active - and we call add_timer() only in that case. I.e. we dont 
activate or modify the timeout of an inactive timer.

It can _only_ make a difference in the narrow special case of 
code using the timer list lock as serialization: but that is a 
pretty poor solution in this proposed form as it slows down the 
other 2000 users of timers for no good reason. The changelog was 
completely silent about that overhead aspect (which is small but 
real), and i refuse to believe that this effect was not 
realized.

In other words, the changelog is useless and even borderline 
deceptive. Not a good sign if you are trying to get a patch 
accepted to the kernel.

Furthermore, no performance figures were posted along with this 
modification - it only stated that these are "performance 
improvements". Especially in cases where a change slows down the 
common case the showing of a very substantial performance 
benefit is a must-have, before a patch is considered for 
upstream merging.

You might be aware of that and you might have planned to provide 
such info in the future, but the changelog and the submission 
does not show any realization of this necessity, so i'm asking 
for that here out of caution, to make sure it's done.

In fact, the submission incorrectly stated:

| This patch set is against Patrick's netfilter next tree since
| it is where it should end up.
|      
| git.kernel.org:/pub/scm/linux/kernel/git/kaber/nf-next-2.6.git

This is wrong, the "netfilter next tree" is not where the "Add 
mod_timer_noact" change should end up, and you should ask your 
contributors to submit changes to other subsystems to their 
respective maintainer trees - the timer tree in this case.

> At least show him that much respect if you're going to be 
> against his patch.

Firstly, let me make clear what happened here.

Deep buried inside a networking patchset, Cc:-ed to the netdev 
and netfilter lists only, a core kernel change is embedded that 
in essence modifies 2000 callsites of the generic kernel. The 
patch was not Cc:-ed to lkml.

Secondly, all i'm doing here is reviewing patches of subsystems 
i maintain, so please stop attacking me for doing my job.

I noticed it because i read a lot of lists, but still, this was 
not done transparently at all. Please show minimal respect to 
Linux and post core kernel patches to lkml, and ask your 
sub-maintainers to do likewise. If there's someone here who has 
a moral basis for flaming here it's me, not you.

So, please, at minimum, follow the following well-established 
protocol of contribution:

 - Post timer patches to lkml (the mailing list mentioned in the 
   MAINTAINERS file), just like you expect networking patches to 
   be posted to netdev. It's basic courtesy and not doing so is 
   at minimum a double standard.

 - Declare negative performance impact to the common case very 
   prominently in the changelog, and include analysis about why 
   it's worth paying the price.

 - Include measurements that show clear positive performance
   impact at the new usage site - which offsets the negative 
   micro-costs that every other usage site pays.

 - Require your sub-contributors to write meaningful changelogs,
   that mention every substantial effect of a change, especially 
   when they change core kernel facilities. For example:

      Impact: add new API, slow down old APIs a tiny bit

   Would have alerted people straight away. I had to read the 
   actual patch to figure out this key information.

I'm also utterly puzzled by your apparent desire to flame me. 
This patch is wrong on so many levels that it's not even funny - 
and you as a long-time kernel contributor should have realized 
that straight away. Instead you forced me into wasting time on 
this rather long email (and you also forced the very unnecessary 
public embarrasment of a contributor), for what should have been 
an 'oops, right, will fix' routine matter.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFT 2/4] Add mod_timer_noact
  2009-02-18 11:01       ` [RFT 2/4] Add mod_timer_noact Ingo Molnar
@ 2009-02-18 11:39         ` Jarek Poplawski
  2009-02-18 12:37           ` Ingo Molnar
  2009-02-18 12:33         ` Patrick McHardy
  2009-02-18 21:39         ` David Miller
  2 siblings, 1 reply; 16+ messages in thread
From: Jarek Poplawski @ 2009-02-18 11:39 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: David Miller, shemminger, kaber, rick.jones2, dada1, netdev,
	netfilter-devel, tglx, gandalf, linux-kernel

On 18-02-2009 12:01, Ingo Molnar wrote:
...
> that straight away. Instead you forced me into wasting time on 
> this rather long email (and you also forced the very unnecessary 
> public embarrasment of a contributor), for what should have been 
> an 'oops, right, will fix' routine matter.

No problem! But next time use this shorter routine, please...

Thanks,
Jarek P.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [patch] timers: add mod_timer_pending()
       [not found]     ` <499BDDFE.5010101@trash.net>
@ 2009-02-18 12:05       ` Ingo Molnar
       [not found]         ` <499C000A.4040205@trash.net>
  0 siblings, 1 reply; 16+ messages in thread
From: Ingo Molnar @ 2009-02-18 12:05 UTC (permalink / raw)
  To: Patrick McHardy, Oleg Nesterov, Peter Zijlstra
  Cc: Stephen Hemminger, David Miller, Rick Jones, Eric Dumazet, netdev,
	netfilter-devel, tglx, Martin Josefsson


* Patrick McHardy <kaber@trash.net> wrote:

> Ingo Molnar wrote:
>>> -extern int __mod_timer(struct timer_list *timer, unsigned long expires);
>>> +extern int __mod_timer(struct timer_list *timer, unsigned long expires, int activate);
>>
>> This is not really acceptable, it slows down every single add_timer() 
>> and mod_timer() call in the kernel with a flag that has one specific 
>> value in all but your case. There's more than 2000 such callsites in 
>> the kernel.
>>
>> Why dont you use something like this instead:
>>
>> 	if (del_timer(timer))
>> 		add_timer(timer);
>
> We need to avoid having a timer that was deleted by one CPU
> getting re-added by another, but want to avoid taking the
> conntrack lock for every timer update. The timer-internal
> locking is enough for this as long as we have a mod_timer
> variant that forwards a timer, but doesn't activate it in
> case it isn't active already.

that makes sense - but the implementation is still somewhat 
ugly. How about the one below instead? Not tested.

One open question is this construct in mod_timer():

+	/*
+	 * This is a common optimization triggered by the
+	 * networking code - if the timer is re-modified
+	 * to be the same thing then just return:
+	 */
+	if (timer->expires == expires && timer_pending(timer))
+		return 1;

We've had this for ages, but it seems rather SMP-unsafe. 
timer_pending(), if used in an unserialized fashion, can be any 
random value in theory - there's no internal serialization here 
anywhere.

We could end up with incorrectly not re-activating a timer in 
mod_timer() for example - have such things never been observed 
in practice?

So the original patch which added this to mod_timer_noact() was 
racy i think, and we cannot preserve this optimization outside 
of the timer list lock. (we could do it inside of it.)

	Ingo

------------------->
Subject: timers: add mod_timer_pending()
From: Ingo Molnar <mingo@elte.hu>
Date: Wed, 18 Feb 2009 12:23:29 +0100

Impact: new timer API

Based on an idea from Stephen Hemminger: introduce
 mod_timer_pending() which is a mod_timer() offspring
that is an invariant on already removed timers.

(regular mod_timer() re-activates non-pending timers.)

This is useful for the networking code in that it can
allow unserialized mod_timer_pending() timer-forwarding
calls, but a single del_timer*() will stop the timer
from being reactivated again.

Also while at it:

- optimize the regular mod_timer() path some more, the
  timer-stat and a debug check was needlessly duplicated
  in __mod_timer().

- make the exports come straight after the function, as
  most other exports in timer.c already did.

- eliminate __mod_timer() as an external API, change the
  users to mod_timer().

The regular mod_timer() code path is not impacted
significantly, due to inlining optimizations and due to
the simplifications - but performance testing would be nice
nevertheless.

Based-on-patch-from: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 arch/powerpc/platforms/cell/spufs/sched.c  |    2 
 drivers/infiniband/hw/ipath/ipath_driver.c |    6 -
 include/linux/timer.h                      |   22 -----
 kernel/relay.c                             |    2 
 kernel/timer.c                             |  110 +++++++++++++++++++----------
 5 files changed, 80 insertions(+), 62 deletions(-)

Index: linux/arch/powerpc/platforms/cell/spufs/sched.c
===================================================================
--- linux.orig/arch/powerpc/platforms/cell/spufs/sched.c
+++ linux/arch/powerpc/platforms/cell/spufs/sched.c
@@ -508,7 +508,7 @@ static void __spu_add_to_rq(struct spu_c
 		list_add_tail(&ctx->rq, &spu_prio->runq[ctx->prio]);
 		set_bit(ctx->prio, spu_prio->bitmap);
 		if (!spu_prio->nr_waiting++)
-			__mod_timer(&spusched_timer, jiffies + SPUSCHED_TICK);
+			mod_timer(&spusched_timer, jiffies + SPUSCHED_TICK);
 	}
 }
 
Index: linux/drivers/infiniband/hw/ipath/ipath_driver.c
===================================================================
--- linux.orig/drivers/infiniband/hw/ipath/ipath_driver.c
+++ linux/drivers/infiniband/hw/ipath/ipath_driver.c
@@ -2715,7 +2715,7 @@ static void ipath_hol_signal_up(struct i
  * to prevent HoL blocking, then start the HoL timer that
  * periodically continues, then stop procs, so they can detect
  * link down if they want, and do something about it.
- * Timer may already be running, so use __mod_timer, not add_timer.
+ * Timer may already be running, so use mod_timer, not add_timer.
  */
 void ipath_hol_down(struct ipath_devdata *dd)
 {
@@ -2724,7 +2724,7 @@ void ipath_hol_down(struct ipath_devdata
 	dd->ipath_hol_next = IPATH_HOL_DOWNCONT;
 	dd->ipath_hol_timer.expires = jiffies +
 		msecs_to_jiffies(ipath_hol_timeout_ms);
-	__mod_timer(&dd->ipath_hol_timer, dd->ipath_hol_timer.expires);
+	mod_timer(&dd->ipath_hol_timer, dd->ipath_hol_timer.expires);
 }
 
 /*
@@ -2763,7 +2763,7 @@ void ipath_hol_event(unsigned long opaqu
 	else {
 		dd->ipath_hol_timer.expires = jiffies +
 			msecs_to_jiffies(ipath_hol_timeout_ms);
-		__mod_timer(&dd->ipath_hol_timer,
+		mod_timer(&dd->ipath_hol_timer,
 			dd->ipath_hol_timer.expires);
 	}
 }
Index: linux/include/linux/timer.h
===================================================================
--- linux.orig/include/linux/timer.h
+++ linux/include/linux/timer.h
@@ -161,8 +161,8 @@ static inline int timer_pending(const st
 
 extern void add_timer_on(struct timer_list *timer, int cpu);
 extern int del_timer(struct timer_list * timer);
-extern int __mod_timer(struct timer_list *timer, unsigned long expires);
 extern int mod_timer(struct timer_list *timer, unsigned long expires);
+extern int mod_timer_pending(struct timer_list *timer, unsigned long expires);
 
 /*
  * The jiffies value which is added to now, when there is no timer
@@ -221,25 +221,7 @@ static inline void timer_stats_timer_cle
 }
 #endif
 
-/**
- * add_timer - start a timer
- * @timer: the timer to be added
- *
- * The kernel will do a ->function(->data) callback from the
- * timer interrupt at the ->expires point in the future. The
- * current time is 'jiffies'.
- *
- * The timer's ->expires, ->function (and if the handler uses it, ->data)
- * fields must be set prior calling this function.
- *
- * Timers with an ->expires field in the past will be executed in the next
- * timer tick.
- */
-static inline void add_timer(struct timer_list *timer)
-{
-	BUG_ON(timer_pending(timer));
-	__mod_timer(timer, timer->expires);
-}
+extern void add_timer(struct timer_list *timer);
 
 #ifdef CONFIG_SMP
   extern int try_to_del_timer_sync(struct timer_list *timer);
Index: linux/kernel/relay.c
===================================================================
--- linux.orig/kernel/relay.c
+++ linux/kernel/relay.c
@@ -748,7 +748,7 @@ size_t relay_switch_subbuf(struct rchan_
 			 * from the scheduler (trying to re-grab
 			 * rq->lock), so defer it.
 			 */
-			__mod_timer(&buf->timer, jiffies + 1);
+			mod_timer(&buf->timer, jiffies + 1);
 	}
 
 	old = buf->data;
Index: linux/kernel/timer.c
===================================================================
--- linux.orig/kernel/timer.c
+++ linux/kernel/timer.c
@@ -600,11 +600,14 @@ static struct tvec_base *lock_timer_base
 	}
 }
 
-int __mod_timer(struct timer_list *timer, unsigned long expires)
+static inline int
+__mod_timer(struct timer_list *timer, unsigned long expires, bool pending_only)
 {
 	struct tvec_base *base, *new_base;
 	unsigned long flags;
-	int ret = 0;
+	int ret;
+
+	ret = 0;
 
 	timer_stats_timer_set_start_info(timer);
 	BUG_ON(!timer->function);
@@ -614,6 +617,9 @@ int __mod_timer(struct timer_list *timer
 	if (timer_pending(timer)) {
 		detach_timer(timer, 0);
 		ret = 1;
+	} else {
+		if (pending_only)
+			goto out_unlock;
 	}
 
 	debug_timer_activate(timer);
@@ -640,42 +646,28 @@ int __mod_timer(struct timer_list *timer
 
 	timer->expires = expires;
 	internal_add_timer(base, timer);
+
+out_unlock:
 	spin_unlock_irqrestore(&base->lock, flags);
 
 	return ret;
 }
 
-EXPORT_SYMBOL(__mod_timer);
-
 /**
- * add_timer_on - start a timer on a particular CPU
- * @timer: the timer to be added
- * @cpu: the CPU to start it on
+ * mod_timer_pending - modify a pending timer's timeout
+ * @timer: the pending timer to be modified
+ * @expires: new timeout in jiffies
  *
- * This is not very scalable on SMP. Double adds are not possible.
+ * mod_timer_pending() is the same for pending timers as mod_timer(),
+ * but will not re-activate and modify already deleted timers.
+ *
+ * It is useful for unserialized use of timers.
  */
-void add_timer_on(struct timer_list *timer, int cpu)
+int mod_timer_pending(struct timer_list *timer, unsigned long expires)
 {
-	struct tvec_base *base = per_cpu(tvec_bases, cpu);
-	unsigned long flags;
-
-	timer_stats_timer_set_start_info(timer);
-	BUG_ON(timer_pending(timer) || !timer->function);
-	spin_lock_irqsave(&base->lock, flags);
-	timer_set_base(timer, base);
-	debug_timer_activate(timer);
-	internal_add_timer(base, timer);
-	/*
-	 * Check whether the other CPU is idle and needs to be
-	 * triggered to reevaluate the timer wheel when nohz is
-	 * active. We are protected against the other CPU fiddling
-	 * with the timer by holding the timer base lock. This also
-	 * makes sure that a CPU on the way to idle can not evaluate
-	 * the timer wheel.
-	 */
-	wake_up_idle_cpu(cpu);
-	spin_unlock_irqrestore(&base->lock, flags);
+	return __mod_timer(timer, expires, true);
 }
+EXPORT_SYMBOL(mod_timer_pending);
 
 /**
  * mod_timer - modify a timer's timeout
@@ -699,9 +691,6 @@ void add_timer_on(struct timer_list *tim
  */
 int mod_timer(struct timer_list *timer, unsigned long expires)
 {
-	BUG_ON(!timer->function);
-
-	timer_stats_timer_set_start_info(timer);
 	/*
 	 * This is a common optimization triggered by the
 	 * networking code - if the timer is re-modified
@@ -710,12 +699,62 @@ int mod_timer(struct timer_list *timer, 
 	if (timer->expires == expires && timer_pending(timer))
 		return 1;
 
-	return __mod_timer(timer, expires);
+	return __mod_timer(timer, expires, false);
 }
-
 EXPORT_SYMBOL(mod_timer);
 
 /**
+ * add_timer - start a timer
+ * @timer: the timer to be added
+ *
+ * The kernel will do a ->function(->data) callback from the
+ * timer interrupt at the ->expires point in the future. The
+ * current time is 'jiffies'.
+ *
+ * The timer's ->expires, ->function (and if the handler uses it, ->data)
+ * fields must be set prior calling this function.
+ *
+ * Timers with an ->expires field in the past will be executed in the next
+ * timer tick.
+ */
+void add_timer(struct timer_list *timer)
+{
+	BUG_ON(timer_pending(timer));
+	mod_timer(timer, timer->expires);
+}
+EXPORT_SYMBOL(add_timer);
+
+/**
+ * add_timer_on - start a timer on a particular CPU
+ * @timer: the timer to be added
+ * @cpu: the CPU to start it on
+ *
+ * This is not very scalable on SMP. Double adds are not possible.
+ */
+void add_timer_on(struct timer_list *timer, int cpu)
+{
+	struct tvec_base *base = per_cpu(tvec_bases, cpu);
+	unsigned long flags;
+
+	timer_stats_timer_set_start_info(timer);
+	BUG_ON(timer_pending(timer) || !timer->function);
+	spin_lock_irqsave(&base->lock, flags);
+	timer_set_base(timer, base);
+	debug_timer_activate(timer);
+	internal_add_timer(base, timer);
+	/*
+	 * Check whether the other CPU is idle and needs to be
+	 * triggered to reevaluate the timer wheel when nohz is
+	 * active. We are protected against the other CPU fiddling
+	 * with the timer by holding the timer base lock. This also
+	 * makes sure that a CPU on the way to idle can not evaluate
+	 * the timer wheel.
+	 */
+	wake_up_idle_cpu(cpu);
+	spin_unlock_irqrestore(&base->lock, flags);
+}
+
+/**
  * del_timer - deactive a timer.
  * @timer: the timer to be deactivated
  *
@@ -744,7 +783,6 @@ int del_timer(struct timer_list *timer)
 
 	return ret;
 }
-
 EXPORT_SYMBOL(del_timer);
 
 #ifdef CONFIG_SMP
@@ -778,7 +816,6 @@ out:
 
 	return ret;
 }
-
 EXPORT_SYMBOL(try_to_del_timer_sync);
 
 /**
@@ -816,7 +853,6 @@ int del_timer_sync(struct timer_list *ti
 		cpu_relax();
 	}
 }
-
 EXPORT_SYMBOL(del_timer_sync);
 #endif
 
@@ -1314,7 +1350,7 @@ signed long __sched schedule_timeout(sig
 	expire = timeout + jiffies;
 
 	setup_timer_on_stack(&timer, process_timeout, (unsigned long)current);
-	__mod_timer(&timer, expire);
+	__mod_timer(&timer, expire, false);
 	schedule();
 	del_singleshot_timer_sync(&timer);
 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFT 2/4] Add mod_timer_noact
  2009-02-18 11:01       ` [RFT 2/4] Add mod_timer_noact Ingo Molnar
  2009-02-18 11:39         ` Jarek Poplawski
@ 2009-02-18 12:33         ` Patrick McHardy
  2009-02-18 21:39         ` David Miller
  2 siblings, 0 replies; 16+ messages in thread
From: Patrick McHardy @ 2009-02-18 12:33 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: David Miller, shemminger, rick.jones2, dada1, netdev,
	netfilter-devel, tglx, gandalf, linux-kernel

Ingo Molnar wrote:
> In other words, the changelog is useless and even borderline 
> deceptive. Not a good sign if you are trying to get a patch 
> accepted to the kernel.
> 
> Furthermore, no performance figures were posted along with this 
> modification - it only stated that these are "performance 
> improvements". Especially in cases where a change slows down the 
> common case the showing of a very substantial performance 
> benefit is a must-have, before a patch is considered for 
> upstream merging.

I think this is mainly a misunderstanding, Stephen posted these
patches as RFT so Rick and Eric could do benchmarks, they were
not intended for merging at this time.

> In fact, the submission incorrectly stated:
> 
> | This patch set is against Patrick's netfilter next tree since
> | it is where it should end up.
> |      
> | git.kernel.org:/pub/scm/linux/kernel/git/kaber/nf-next-2.6.git
> 
> This is wrong, the "netfilter next tree" is not where the "Add 
> mod_timer_noact" change should end up, and you should ask your 
> contributors to submit changes to other subsystems to their 
> respective maintainer trees - the timer tree in this case.

Absolutely, I wouldn't have taken it, and Dave wouldn't have taken
it from me, so no cause for alarm :)



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFT 2/4] Add mod_timer_noact
  2009-02-18 11:39         ` Jarek Poplawski
@ 2009-02-18 12:37           ` Ingo Molnar
  0 siblings, 0 replies; 16+ messages in thread
From: Ingo Molnar @ 2009-02-18 12:37 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: David Miller, shemminger, kaber, rick.jones2, dada1, netdev,
	netfilter-devel, tglx, gandalf, linux-kernel


* Jarek Poplawski <jarkao2@gmail.com> wrote:

> On 18-02-2009 12:01, Ingo Molnar wrote:
> ...
> > that straight away. Instead you forced me into wasting time on 
> > this rather long email (and you also forced the very unnecessary 
> > public embarrasment of a contributor), for what should have been 
> > an 'oops, right, will fix' routine matter.
> 
> No problem! But next time use this shorter routine, please...

Correct, the "oops, right, will fix" should have come as a reply 
to my mail, obviously - i did not submit the patch after all. 
Instead i got this accusatory mail from davem which certainly 
did not help bring the issue forward ...

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [patch] timers: add mod_timer_pending()
       [not found]         ` <499C000A.4040205@trash.net>
@ 2009-02-18 12:50           ` Ingo Molnar
  2009-02-18 12:54             ` Patrick McHardy
  0 siblings, 1 reply; 16+ messages in thread
From: Ingo Molnar @ 2009-02-18 12:50 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Oleg Nesterov, Peter Zijlstra, Stephen Hemminger, David Miller,
	Rick Jones, Eric Dumazet, netdev, netfilter-devel, tglx,
	Martin Josefsson, linux-kernel


* Patrick McHardy <kaber@trash.net> wrote:

> Ingo Molnar wrote:
>> * Patrick McHardy <kaber@trash.net> wrote:
>>
>>> We need to avoid having a timer that was deleted by one CPU
>>> getting re-added by another, but want to avoid taking the
>>> conntrack lock for every timer update. The timer-internal
>>> locking is enough for this as long as we have a mod_timer
>>> variant that forwards a timer, but doesn't activate it in
>>> case it isn't active already.
>>
>> that makes sense - but the implementation is still somewhat ugly. How 
>> about the one below instead? Not tested.
>
> This seems to fulfill our needs. I also like the mod_timer_pending()
> name better than mod_timer_noact().
>
>> One open question is this construct in mod_timer():
>>
>> +	/*
>> +	 * This is a common optimization triggered by the
>> +	 * networking code - if the timer is re-modified
>> +	 * to be the same thing then just return:
>> +	 */
>> +	if (timer->expires == expires && timer_pending(timer))
>> +		return 1;
>>
>> We've had this for ages, but it seems rather SMP-unsafe.  
>> timer_pending(), if used in an unserialized fashion, can be any random 
>> value in theory - there's no internal serialization here anywhere.
>>
>> We could end up with incorrectly not re-activating a timer in  
>> mod_timer() for example - have such things never been observed in 
>> practice?
>
> Yes, it seems racy if done for timers that might get 
> activated. For forwarding only without activation it seems OK, 
> in that case the timer_pending check doesn't seem necessary at 
> all.

ok.

To accelerate matters i've committed the new API patch into a 
new standalone topic branch: tip:timers/new-apis.

Unless there are objections or test failures, you (or Stephen or 
David) can then git-pull it into the networking tree via the Git 
coordinates below - and you'll get this single commit in a 
surgical manner - no other timer changes are included.

Doing so has the advantage of:

- You not having to wait a kernel cycle for the API to go
  upstream.

- You can also push it upstream without waiting for the timer 
  tree. (the timer tree and the networking tree will share the 
  exact same commit)

- It will also all merge cleanly with the timer tree in 
  linux-next, etc.

I'd suggest to do it in about a week, to make sure any after 
effects have trickled down and to make sure the topic has become 
append-only. You can ping Thomas and me about testing/review 
status then, whenever you want to do the pull.

	Ingo

------------->

You can pull the latest timers/new-apis git tree from:

   git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip.git timers/new-apis

 Thanks,

	Ingo

------------------>
Ingo Molnar (1):
      timers: add mod_timer_pending()


 arch/powerpc/platforms/cell/spufs/sched.c  |    2 +-
 drivers/infiniband/hw/ipath/ipath_driver.c |    6 +-
 include/linux/timer.h                      |   22 +-----
 kernel/relay.c                             |    2 +-
 kernel/timer.c                             |  110 ++++++++++++++++++---------
 5 files changed, 80 insertions(+), 62 deletions(-)


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [patch] timers: add mod_timer_pending()
  2009-02-18 12:50           ` Ingo Molnar
@ 2009-02-18 12:54             ` Patrick McHardy
  2009-02-18 13:47               ` Ingo Molnar
  0 siblings, 1 reply; 16+ messages in thread
From: Patrick McHardy @ 2009-02-18 12:54 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Oleg Nesterov, Peter Zijlstra, Stephen Hemminger, David Miller,
	Rick Jones, Eric Dumazet, netdev, netfilter-devel, tglx,
	Martin Josefsson, linux-kernel

Ingo Molnar wrote:
> To accelerate matters i've committed the new API patch into a 
> new standalone topic branch: tip:timers/new-apis.
> 
> Unless there are objections or test failures, you (or Stephen or 
> David) can then git-pull it into the networking tree via the Git 
> coordinates below - and you'll get this single commit in a 
> surgical manner - no other timer changes are included.
> 
> Doing so has the advantage of:
> 
> - You not having to wait a kernel cycle for the API to go
>   upstream.
> 
> - You can also push it upstream without waiting for the timer 
>   tree. (the timer tree and the networking tree will share the 
>   exact same commit)
> 
> - It will also all merge cleanly with the timer tree in 
>   linux-next, etc.
> 
> I'd suggest to do it in about a week, to make sure any after 
> effects have trickled down and to make sure the topic has become 
> append-only. You can ping Thomas and me about testing/review 
> status then, whenever you want to do the pull.

Thanks Ingo. I'll wait for Stephen to rebase his patches on
top of your change and the test results and will let you know.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [patch] timers: add mod_timer_pending()
  2009-02-18 12:54             ` Patrick McHardy
@ 2009-02-18 13:47               ` Ingo Molnar
  0 siblings, 0 replies; 16+ messages in thread
From: Ingo Molnar @ 2009-02-18 13:47 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Oleg Nesterov, Peter Zijlstra, Stephen Hemminger, David Miller,
	Rick Jones, Eric Dumazet, netdev, netfilter-devel, tglx,
	Martin Josefsson, linux-kernel


* Patrick McHardy <kaber@trash.net> wrote:

> Ingo Molnar wrote:
>> To accelerate matters i've committed the new API patch into a new 
>> standalone topic branch: tip:timers/new-apis.
>>
>> Unless there are objections or test failures, you (or Stephen or  
>> David) can then git-pull it into the networking tree via the Git  
>> coordinates below - and you'll get this single commit in a surgical 
>> manner - no other timer changes are included.
>>
>> Doing so has the advantage of:
>>
>> - You not having to wait a kernel cycle for the API to go
>>   upstream.
>>
>> - You can also push it upstream without waiting for the timer   tree. 
>> (the timer tree and the networking tree will share the   exact same 
>> commit)
>>
>> - It will also all merge cleanly with the timer tree in   linux-next, 
>> etc.
>>
>> I'd suggest to do it in about a week, to make sure any after effects 
>> have trickled down and to make sure the topic has become append-only. 
>> You can ping Thomas and me about testing/review status then, whenever 
>> you want to do the pull.
>
> Thanks Ingo. I'll wait for Stephen to rebase his patches on 
> top of your change and the test results and will let you know.

Stress-testing here in the last ~2 hours on eight x86 test-boxes 
showed no problems so far.

	Ingo

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFT 2/4] Add mod_timer_noact
  2009-02-18 11:01       ` [RFT 2/4] Add mod_timer_noact Ingo Molnar
  2009-02-18 11:39         ` Jarek Poplawski
  2009-02-18 12:33         ` Patrick McHardy
@ 2009-02-18 21:39         ` David Miller
  2009-02-18 21:51           ` Ingo Molnar
  2 siblings, 1 reply; 16+ messages in thread
From: David Miller @ 2009-02-18 21:39 UTC (permalink / raw)
  To: mingo
  Cc: shemminger, kaber, rick.jones2, dada1, netdev, netfilter-devel,
	tglx, gandalf, linux-kernel

From: Ingo Molnar <mingo@elte.hu>
Date: Wed, 18 Feb 2009 12:01:44 +0100

> * David Miller <davem@davemloft.net> wrote:
> 
> | Introduce mod_timer_noact() which for example is to replace 
> | the calls to del_timer()/add_timer() in 
> | __nf_ct_refresh_acct(). It works like mod_timer() but doesn't 
> | activate or modify the timeout of an inactive timer which is 
> | the behaviour we want in order to be able to use timers as a 
> | means of synchronization in nf_conntrack.
> 
> It does not mention the overhead to the regular timer interfaces 
> at all, nor does it explain the reasons for this change 
> adequately.

You (conveniently) skipped this part of his commit message, so
I guess this is the part you didn't read very carefully:

	A later patch will modify __nf_ct_refresh_acct() to use
	mod_timer_noact() which will then save one spin_lock_irqsave()
	/ spin_lock_irqrestore() pair per conntrack timer update. This
	will also get rid of the race we currently have without adding
	more locking in nf_conntrack.

The whole point is to avoid two spin_lock_irqsave() sequences, thus
taking the timer locks twice.

So Ingo, when you say in response:

	Why don't you use?

		if (del_timer())
			add_timer();

you really look foolish and, in fact, disrespectful to Stephen.

This was my objection to your email, it proved that you didn't
really read his changelog message.  He explained perfectly well
what the final goal was of his changes.

And you have this knee-jerk reaction quite often.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFT 2/4] Add mod_timer_noact
  2009-02-18 21:39         ` David Miller
@ 2009-02-18 21:51           ` Ingo Molnar
  2009-02-18 22:04             ` David Miller
  0 siblings, 1 reply; 16+ messages in thread
From: Ingo Molnar @ 2009-02-18 21:51 UTC (permalink / raw)
  To: David Miller
  Cc: shemminger, kaber, rick.jones2, dada1, netdev, netfilter-devel,
	tglx, gandalf, linux-kernel


* David Miller <davem@davemloft.net> wrote:

> From: Ingo Molnar <mingo@elte.hu>
> Date: Wed, 18 Feb 2009 12:01:44 +0100
> 
> > * David Miller <davem@davemloft.net> wrote:
> > 
> > | Introduce mod_timer_noact() which for example is to replace 
> > | the calls to del_timer()/add_timer() in 
> > | __nf_ct_refresh_acct(). It works like mod_timer() but doesn't 
> > | activate or modify the timeout of an inactive timer which is 
> > | the behaviour we want in order to be able to use timers as a 
> > | means of synchronization in nf_conntrack.
> > 
> > It does not mention the overhead to the regular timer interfaces 
> > at all, nor does it explain the reasons for this change 
> > adequately.
> 
> You (conveniently) skipped this part of his commit message, so
> I guess this is the part you didn't read very carefully:
> 
> 	A later patch will modify __nf_ct_refresh_acct() to use
> 	mod_timer_noact() which will then save one spin_lock_irqsave()
> 	/ spin_lock_irqrestore() pair per conntrack timer update. This
> 	will also get rid of the race we currently have without adding
> 	more locking in nf_conntrack.
> 
> The whole point is to avoid two spin_lock_irqsave() sequences, thus
> taking the timer locks twice.
> 
> So Ingo, when you say in response:
> 
> 	Why don't you use?
> 
> 		if (del_timer())
> 			add_timer();
> 
> you really look foolish and, in fact, disrespectful to Stephen.
> 
> This was my objection to your email, it proved that you didn't
> really read his changelog message.  He explained perfectly well
> what the final goal was of his changes.
> 
> And you have this knee-jerk reaction quite often.

You accusing me of knee-jerk reaction is the joke of the century 
;-)

Anyway, it's all handled, you just need to read the rest of the 
thread.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFT 2/4] Add mod_timer_noact
  2009-02-18 21:51           ` Ingo Molnar
@ 2009-02-18 22:04             ` David Miller
  2009-02-18 22:42               ` Peter Zijlstra
  0 siblings, 1 reply; 16+ messages in thread
From: David Miller @ 2009-02-18 22:04 UTC (permalink / raw)
  To: mingo
  Cc: shemminger, kaber, rick.jones2, dada1, netdev, netfilter-devel,
	tglx, gandalf, linux-kernel

From: Ingo Molnar <mingo@elte.hu>
Date: Wed, 18 Feb 2009 22:51:40 +0100

> Anyway, it's all handled, you just need to read the rest of the 
> thread.

I did read the entire thread before replying, my objection
to your original posting still standed.

And as others have pointed out you also failed to recognize
the context of the patch posting.  It was part of a sequence
of patches for people to test some experimental netfilter
performance optimizations.  "RFT" was prefixed to every patch
subject line, if any more indication was necessary.

Yet you object that the patches are against the networking
and netfilter trees.

Again, your reactions were knee-jerk, by every definition of the
term.

I know how you work Ingo, you want to be fast and efficient.
But often, your "fast and efficient" is "careless", and this
wastes everyone elses time and in the final analysis makes
you "slow".

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFT 2/4] Add mod_timer_noact
  2009-02-18 22:04             ` David Miller
@ 2009-02-18 22:42               ` Peter Zijlstra
  2009-02-18 22:47                 ` David Miller
  0 siblings, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2009-02-18 22:42 UTC (permalink / raw)
  To: David Miller
  Cc: mingo, shemminger, kaber, rick.jones2, dada1, netdev,
	netfilter-devel, tglx, gandalf, linux-kernel

On Wed, 2009-02-18 at 14:04 -0800, David Miller wrote:

> And as others have pointed out you also failed to recognize
> the context of the patch posting.  It was part of a sequence
> of patches for people to test some experimental netfilter
> performance optimizations.  "RFT" was prefixed to every patch
> subject line, if any more indication was necessary.

Be that as it may, its a maintainer seeing a patch against his
subsystem, reviewing it (albeit early -- we should all want to get
around to reviewing that early) and asking for some clarification.

The fact is, Steve's changelog was very unclear to people not intimately
familiar with the problem space. Asking some clarification just isn't
weird in any way.

> Yet you object that the patches are against the networking
> and netfilter trees.
> 
> Again, your reactions were knee-jerk, by every definition of the
> term.
> 
> I know how you work Ingo, you want to be fast and efficient.
> But often, your "fast and efficient" is "careless", and this
> wastes everyone elses time and in the final analysis makes
> you "slow".

Can we please leave it at this, the technical issue seems to be delt
with. You and Ingo seems to have a gift to rub each other the wrong way,
it would be grand if you could both try to be a little forgiving and
just focus on the code/technical issues which makes Linux to what it is,
technically excellent ;-)


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFT 2/4] Add mod_timer_noact
  2009-02-18 22:42               ` Peter Zijlstra
@ 2009-02-18 22:47                 ` David Miller
  2009-02-18 22:56                   ` Stephen Hemminger
  0 siblings, 1 reply; 16+ messages in thread
From: David Miller @ 2009-02-18 22:47 UTC (permalink / raw)
  To: peterz
  Cc: mingo, shemminger, kaber, rick.jones2, dada1, netdev,
	netfilter-devel, tglx, gandalf, linux-kernel

From: Peter Zijlstra <peterz@infradead.org>
Date: Wed, 18 Feb 2009 23:42:27 +0100

> Can we please leave it at this, the technical issue seems to be delt
> with. You and Ingo seems to have a gift to rub each other the wrong way,
> it would be grand if you could both try to be a little forgiving and
> just focus on the code/technical issues which makes Linux to what it is,
> technically excellent ;-)

Like it or not, open source development is as much about people
and their personalitites as it is about technical issues.

So every timeone someone says to concentrate on the technical
issues and get past the personalities, they really are missing
the point, and at best are being naive.

The Linux kernel has been shaped by overtly emotional discourse and
personal interaction as it has been by any technical achievement.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFT 2/4] Add mod_timer_noact
  2009-02-18 22:47                 ` David Miller
@ 2009-02-18 22:56                   ` Stephen Hemminger
  0 siblings, 0 replies; 16+ messages in thread
From: Stephen Hemminger @ 2009-02-18 22:56 UTC (permalink / raw)
  To: David Miller
  Cc: peterz, mingo, kaber, rick.jones2, dada1, netdev, netfilter-devel,
	tglx, gandalf, linux-kernel

On Wed, 18 Feb 2009 14:47:41 -0800 (PST)
David Miller <davem@davemloft.net> wrote:

> From: Peter Zijlstra <peterz@infradead.org>
> Date: Wed, 18 Feb 2009 23:42:27 +0100
> 
> > Can we please leave it at this, the technical issue seems to be delt
> > with. You and Ingo seems to have a gift to rub each other the wrong way,
> > it would be grand if you could both try to be a little forgiving and
> > just focus on the code/technical issues which makes Linux to what it is,
> > technically excellent ;-)
> 
> Like it or not, open source development is as much about people
> and their personalitites as it is about technical issues.
> 
> So every timeone someone says to concentrate on the technical
> issues and get past the personalities, they really are missing
> the point, and at best are being naive.
> 
> The Linux kernel has been shaped by overtly emotional discourse and
> personal interaction as it has been by any technical achievement.

Everyone, please read and internalize what Matt had to say.
He is right, the community needs to learn how to review:
   
http://mdzlog.wordpress.com/2008/12/24/constructive-criticism/

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH] rcu: increment quiescent state counter in ksoftirqd()
       [not found]       ` <49A7F262.8040805@cosmosbay.com>
@ 2009-02-27 16:08         ` Eric Dumazet
  2009-02-27 16:34           ` Paul E. McKenney
  0 siblings, 1 reply; 16+ messages in thread
From: Eric Dumazet @ 2009-02-27 16:08 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Stephen Hemminger, David Miller, Patrick McHardy, Rick Jones,
	netdev, netfilter-devel, linux kernel

Eric Dumazet a écrit :
> Eric Dumazet a écrit :
>> Stephen Hemminger a écrit :
>>> The reader/writer lock in ip_tables is acquired in the critical path of
>>> processing packets and is one of the reasons just loading iptables can cause
>>> a 20% performance loss. The rwlock serves two functions:
>>>
>>> 1) it prevents changes to table state (xt_replace) while table is in use.
>>>    This is now handled by doing rcu on the xt_table. When table is
>>>    replaced, the new table(s) are put in and the old one table(s) are freed
>>>    after RCU period.
>>>
>>> 2) it provides synchronization when accesing the counter values.
>>>    This is now handled by swapping in new table_info entries for each cpu
>>>    then summing the old values, and putting the result back onto one
>>>    cpu.  On a busy system it may cause sampling to occur at different
>>>    times on each cpu, but no packet/byte counts are lost in the process.
>>>
>>> Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
>>
>> Acked-by: Eric Dumazet <dada1@cosmosbay.com>
>>
>> Sucessfully tested on my dual quad core machine too, but iptables only (no ipv6 here)
>>
>> BTW, my new "tbench 8" result is 2450 MB/s, (it was 2150 MB/s not so long ago)
>>
>> Thanks Stephen, thats very cool stuff, yet another rwlock out of kernel :)
>>
> 
> While testing multicast flooding stuff, I found that "iptables -nvL" can 
> have a *very* slow response time on my dual quad core machine...
> 
> 
> # time iptables -nvL
> Chain INPUT (policy ACCEPT 416M packets, 64G bytes)
>  pkts bytes target     prot opt in     out     source               destination
> 
> Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
>  pkts bytes target     prot opt in     out     source               destination
> 
> Chain OUTPUT (policy ACCEPT 401M packets, 62G bytes)
>  pkts bytes target     prot opt in     out     source               destination
> 
> real    0m1.810s  <<<< HERE >>>>
> user    0m0.000s
> sys     0m0.001s
> 
> 
> CONFIG_NO_HZ=y
> CONFIG_HZ_1000=y
> CONFIG_HZ=1000
> 
> One cpu is 100% handling softirqs, could it be the problem ?
> 
> Cpu0  :  1.0%us, 14.7%sy,  0.0%ni, 83.3%id,  0.0%wa,  0.0%hi,  1.0%si,  0.0%st
> Cpu1  :  3.6%us, 23.2%sy,  0.0%ni, 71.6%id,  0.0%wa,  0.0%hi,  1.7%si,  0.0%st
> Cpu2  :  0.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,100.0%si,  0.0%st
> Cpu3  :  2.7%us, 23.9%sy,  0.0%ni, 71.1%id,  0.7%wa,  0.0%hi,  1.7%si,  0.0%st
> Cpu4  :  1.3%us, 14.3%sy,  0.0%ni, 83.3%id,  0.0%wa,  0.0%hi,  1.0%si,  0.0%st
> Cpu5  :  1.0%us, 14.2%sy,  0.0%ni, 83.4%id,  0.0%wa,  0.0%hi,  1.3%si,  0.0%st
> Cpu6  :  0.3%us,  7.0%sy,  0.0%ni, 92.4%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
> Cpu7  :  0.7%us,  8.0%sy,  0.0%ni, 90.0%id,  0.7%wa,  0.0%hi,  0.7%si,  0.0%st

Hi Paul

I found following patch helps if one cpu is looping inside ksoftirqd()

synchronize_rcu() now completes in 40 ms instead of 1800 ms.

Thank you

[PATCH] rcu: increment quiescent state counter in ksoftirqd()

If a machine is flooded by network frames, a cpu can loop 100% of its time
inside ksoftirqd() without calling schedule().
This can delay RCU grace period to insane values. 

Adding rcu_qsctr_inc() call in ksoftirqd() solves this problem.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
diff --git a/kernel/softirq.c b/kernel/softirq.c
index bdbe9de..9041ea7 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -626,6 +626,7 @@ static int ksoftirqd(void * __bind_cpu)
 			preempt_enable_no_resched();
 			cond_resched();
 			preempt_disable();
+			rcu_qsctr_inc((long)__bind_cpu);
 		}
 		preempt_enable();
 		set_current_state(TASK_INTERRUPTIBLE);


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH] rcu: increment quiescent state counter in ksoftirqd()
  2009-02-27 16:08         ` [PATCH] rcu: increment quiescent state counter in ksoftirqd() Eric Dumazet
@ 2009-02-27 16:34           ` Paul E. McKenney
  0 siblings, 0 replies; 16+ messages in thread
From: Paul E. McKenney @ 2009-02-27 16:34 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Stephen Hemminger, David Miller, Patrick McHardy, Rick Jones,
	netdev, netfilter-devel, linux kernel

On Fri, Feb 27, 2009 at 05:08:04PM +0100, Eric Dumazet wrote:
> Eric Dumazet a écrit :
> > Eric Dumazet a écrit :
> >> Stephen Hemminger a écrit :
> >>> The reader/writer lock in ip_tables is acquired in the critical path of
> >>> processing packets and is one of the reasons just loading iptables can cause
> >>> a 20% performance loss. The rwlock serves two functions:
> >>>
> >>> 1) it prevents changes to table state (xt_replace) while table is in use.
> >>>    This is now handled by doing rcu on the xt_table. When table is
> >>>    replaced, the new table(s) are put in and the old one table(s) are freed
> >>>    after RCU period.
> >>>
> >>> 2) it provides synchronization when accesing the counter values.
> >>>    This is now handled by swapping in new table_info entries for each cpu
> >>>    then summing the old values, and putting the result back onto one
> >>>    cpu.  On a busy system it may cause sampling to occur at different
> >>>    times on each cpu, but no packet/byte counts are lost in the process.
> >>>
> >>> Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
> >>
> >> Acked-by: Eric Dumazet <dada1@cosmosbay.com>
> >>
> >> Sucessfully tested on my dual quad core machine too, but iptables only (no ipv6 here)
> >>
> >> BTW, my new "tbench 8" result is 2450 MB/s, (it was 2150 MB/s not so long ago)
> >>
> >> Thanks Stephen, thats very cool stuff, yet another rwlock out of kernel :)
> >>
> > 
> > While testing multicast flooding stuff, I found that "iptables -nvL" can 
> > have a *very* slow response time on my dual quad core machine...
> > 
> > 
> > # time iptables -nvL
> > Chain INPUT (policy ACCEPT 416M packets, 64G bytes)
> >  pkts bytes target     prot opt in     out     source               destination
> > 
> > Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
> >  pkts bytes target     prot opt in     out     source               destination
> > 
> > Chain OUTPUT (policy ACCEPT 401M packets, 62G bytes)
> >  pkts bytes target     prot opt in     out     source               destination
> > 
> > real    0m1.810s  <<<< HERE >>>>
> > user    0m0.000s
> > sys     0m0.001s
> > 
> > 
> > CONFIG_NO_HZ=y
> > CONFIG_HZ_1000=y
> > CONFIG_HZ=1000
> > 
> > One cpu is 100% handling softirqs, could it be the problem ?
> > 
> > Cpu0  :  1.0%us, 14.7%sy,  0.0%ni, 83.3%id,  0.0%wa,  0.0%hi,  1.0%si,  0.0%st
> > Cpu1  :  3.6%us, 23.2%sy,  0.0%ni, 71.6%id,  0.0%wa,  0.0%hi,  1.7%si,  0.0%st
> > Cpu2  :  0.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,100.0%si,  0.0%st
> > Cpu3  :  2.7%us, 23.9%sy,  0.0%ni, 71.1%id,  0.7%wa,  0.0%hi,  1.7%si,  0.0%st
> > Cpu4  :  1.3%us, 14.3%sy,  0.0%ni, 83.3%id,  0.0%wa,  0.0%hi,  1.0%si,  0.0%st
> > Cpu5  :  1.0%us, 14.2%sy,  0.0%ni, 83.4%id,  0.0%wa,  0.0%hi,  1.3%si,  0.0%st
> > Cpu6  :  0.3%us,  7.0%sy,  0.0%ni, 92.4%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
> > Cpu7  :  0.7%us,  8.0%sy,  0.0%ni, 90.0%id,  0.7%wa,  0.0%hi,  0.7%si,  0.0%st
> 
> Hi Paul
> 
> I found following patch helps if one cpu is looping inside ksoftirqd()
> 
> synchronize_rcu() now completes in 40 ms instead of 1800 ms.
> 
> Thank you
> 
> [PATCH] rcu: increment quiescent state counter in ksoftirqd()
> 
> If a machine is flooded by network frames, a cpu can loop 100% of its time
> inside ksoftirqd() without calling schedule().
> This can delay RCU grace period to insane values. 
> 
> Adding rcu_qsctr_inc() call in ksoftirqd() solves this problem.

Good catch!!!  This regression was a result of the recent change
from "schedule()" to "cond_resched()", which got rid of that quiescent
state in the common case where a reschedule is not needed.

Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
> ---
> diff --git a/kernel/softirq.c b/kernel/softirq.c
> index bdbe9de..9041ea7 100644
> --- a/kernel/softirq.c
> +++ b/kernel/softirq.c
> @@ -626,6 +626,7 @@ static int ksoftirqd(void * __bind_cpu)
>  			preempt_enable_no_resched();
>  			cond_resched();
>  			preempt_disable();
> +			rcu_qsctr_inc((long)__bind_cpu);
>  		}
>  		preempt_enable();
>  		set_current_state(TASK_INTERRUPTIBLE);
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2009-02-27 16:34 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20090218051906.174295181@vyatta.com>
     [not found] ` <20090218052747.437271195@vyatta.com>
     [not found]   ` <20090218092041.GC3294@elte.hu>
     [not found]     ` <20090218.013007.117003889.davem@davemloft.net>
2009-02-18 11:01       ` [RFT 2/4] Add mod_timer_noact Ingo Molnar
2009-02-18 11:39         ` Jarek Poplawski
2009-02-18 12:37           ` Ingo Molnar
2009-02-18 12:33         ` Patrick McHardy
2009-02-18 21:39         ` David Miller
2009-02-18 21:51           ` Ingo Molnar
2009-02-18 22:04             ` David Miller
2009-02-18 22:42               ` Peter Zijlstra
2009-02-18 22:47                 ` David Miller
2009-02-18 22:56                   ` Stephen Hemminger
     [not found]     ` <499BDDFE.5010101@trash.net>
2009-02-18 12:05       ` [patch] timers: add mod_timer_pending() Ingo Molnar
     [not found]         ` <499C000A.4040205@trash.net>
2009-02-18 12:50           ` Ingo Molnar
2009-02-18 12:54             ` Patrick McHardy
2009-02-18 13:47               ` Ingo Molnar
     [not found] ` <20090218052747.321329022@vyatta.com>
     [not found]   ` <20090219114719.560999b5@extreme>
     [not found]     ` <499DEF49.3040602@cosmosbay.com>
     [not found]       ` <49A7F262.8040805@cosmosbay.com>
2009-02-27 16:08         ` [PATCH] rcu: increment quiescent state counter in ksoftirqd() Eric Dumazet
2009-02-27 16:34           ` Paul E. McKenney

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox