* [PATCH V3]hrtimer: Fix a performance regression by disable reprogramming in remove_hrtimer
@ 2013-07-27 20:04 ethan.kernel
2013-07-29 10:18 ` Thomas Gleixner
0 siblings, 1 reply; 31+ messages in thread
From: ethan.kernel @ 2013-07-27 20:04 UTC (permalink / raw)
To: mingo, tglx; +Cc: linux-kernel, johlstei, yinghai, joe.jin, ethan.zhao
commit 968320b hrtimer: Fix extra wakeups from __remove_hrtimer() introduced a significant scheduler
performance regression, following is the test:
a. Test environment:
SUN FIRE X4170 M2 SERVER
CPU model name: Intel(R) Xeon(R) CPU X5675 @ 3.07GHz
2 socket X 6 core X 2 thread
b. To eliminate the disturbing of variable frequency technology of Intel CPU. We disabled the C-States, P-States
T-States etc SpeedStep, Turboboost, power management in BIOS configuration.
c. Test case:
1.test tool (Any better tools ?)
int main (void)
{
unsigned long long t0, t1;
int pipe_1[2], pipe_2[2];
int m = 0, i;
pipe(pipe_1);
pipe(pipe_2);
if (!fork()) {
for (i = 0; i < LOOPS; i++) {
read(pipe_1[0], &m, sizeof(int));
write(pipe_2[1], &m, sizeof(int));
}
} else {
for (i = 0; i < LOOPS; i++) {
write(pipe_1[1], &m, sizeof(int));
read(pipe_2[0], &m, sizeof(int));
}
}
return 0;
}
from http://people.redhat.com/mingo/cfs-scheduler/tools/pipe-test-1m.c
2.after boot the test kernel a few minutes, execute
$time ./pipe-test-1m
collect data output by time like:
real 0m9.326s
user 0m0.352s
sys 0m5.640s
3.after the test case finished a few seconds, redo the same one.
d. Test result data
Test kernel without patch 968320b hrtimer: Fix extra wakeups from __remove_hrtimer() /
Or applied this patch to disable reprogramming in remove_hrtimer()
---------------------------------------------------------------------------------------------------------
| | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | AVG |
---------------------------------------------------------------------------------------------------------
| real | 0m5.328s | 0m5.372s | 0m5.307s | 0m5.307s | 0m5.330s | 0m5.315s | 0m5.318s | 0m5.317s | 5.32425s |
---------------------------------------------------------------------------------------------------------
| user | 0m0.106s | 0m0.098s | 0m0.108s | 0m0.120s | 0m0.113s | 0m0.121s | 0m0.125s | 0m0.103s | 0.11175s |
---------------------------------------------------------------------------------------------------------
| sys | 0m2.287s | 0m2.334s | 0m2.269s | 0m2.269s | 0m2.298s | 0m2.274s | 0m2.263s | 0m2.292s | 2.28575s |
---------------------------------------------------------------------------------------------------------
With patch 968320b hrtimer: Fix extra wakeups from __remove_hrtimer()
Redo the test more than 10 times, select 8 of them, collect the data into following tables.
---------------------------------------------------------------------------------------------------------
| | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | AVG |
---------------------------------------------------------------------------------------------------------
| real | 0m7.132s | 0m6.741s | 0m6.996s | 0m9.269s | 0m6.742s | 0m6.977s | 0m6.802s | 0m6.957s | 7.202s |
---------------------------------------------------------------------------------------------------------
| user | 0m0.033s | 0m0.031s | 0m0.048s | 0m0.436s | 0m0.022s | 0m0.005s | 0m0.014s | 0m0.014s | 0.075375s|
---------------------------------------------------------------------------------------------------------
| sys | 0m3.119s | 0m2.940s | 0m3.185s | 0m4.023s | 0m3.053s | 0m3.152s | 0m3.054s | 0m3.124s | 3.20625s |
---------------------------------------------------------------------------------------------------------
e. Conclusion
We found the performance has 1.87775S(average value) down introduced by commit
968320b hrtimer: Fix extra wakeups from __remove_hrtimer().
That is more than -35% !
Disable reprogramming in remove_hrtimer() to fix this performance regression:
function remove_hrtimer() with reprogramming the clock device is called in following two cases:
1. In function hrtimer_try_to_cancel()
Whatever you reprogram the clock device or not, the timer function wouldn't be called anymore. So set reprogram
to 0 doesn't change the result of hrtimer_try_to_cancel()
hrtimer_try_to_cancel()
--- > remove_hrtimer()
---- > __remove_hrtimer(timer, base, state, reprogram);
2. In function __hrtimer_start_range_ns(),
After remove_hrtimer() was called, the rest of code in this function will check the new timer added into queue is
the leftmost or not, if needed, will reprogram the clock device.
__hrtimer_start_range_ns()
{
... ...
ret = remove_hrtimer(timer, base);
... ...
leftmost = enqueue_hrtimer(timer, new_base);
if (leftmost && new_base->cpu_base == &__get_cpu_var(hrtimer_bases)
&& hrtimer_enqueue_reprogram(timer, new_base)) {
... ..
}
Will we lose the chance to reprogram the clock device after remove_hrtimer() ?
I think No, Because we didn't reprogram the clock device in remove_hrtimer(), if the timer removed is just the next one
will expire, we still could get reprogrammed in hrtimer_interrupt().
So reprogramming in remove_hrtimer() is not necessary-----If I am wrong, just point out.
Why
commit 968320b hrtimer: Fix extra wakeups from __remove_hrtimer()
Introduced performance regression?
The reprogramming is expensive ? not cheap so far.
We lost one chance to wakeup?
Yes, commit 968320b actually will delay the next expiry time to the timer next to the one removed. and it looks rational.
If the extra wakeup is not harmful, We really need it to keep the performance and get the chance to wakeup in time.
Reported-by: lijun.huang@oracle.com <lijun.huang@oracle.com>
Signed-off-by: ethan.zhao <ethan.kernel@gmail.com>
---
kernel/hrtimer.c | 13 +------------
1 files changed, 1 insertions(+), 12 deletions(-)
diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
index 383319b..3c9e828 100644
--- a/kernel/hrtimer.c
+++ b/kernel/hrtimer.c
@@ -942,26 +942,15 @@ remove_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *base)
{
if (hrtimer_is_queued(timer)) {
unsigned long state;
- int reprogram;
- /*
- * Remove the timer and force reprogramming when high
- * resolution mode is active and the timer is on the current
- * CPU. If we remove a timer on another CPU, reprogramming is
- * skipped. The interrupt event on this CPU is fired and
- * reprogramming happens in the interrupt handler. This is a
- * rare case and less expensive than a smp call.
- */
- debug_deactivate(timer);
timer_stats_hrtimer_clear_start_info(timer);
- reprogram = base->cpu_base == &__get_cpu_var(hrtimer_bases);
/*
* We must preserve the CALLBACK state flag here,
* otherwise we could move the timer base in
* switch_hrtimer_base.
*/
state = timer->state & HRTIMER_STATE_CALLBACK;
- __remove_hrtimer(timer, base, state, reprogram);
+ __remove_hrtimer(timer, base, state, 0);
return 1;
}
return 0;
--
1.7.1
^ permalink raw reply related [flat|nested] 31+ messages in thread* Re: [PATCH V3]hrtimer: Fix a performance regression by disable reprogramming in remove_hrtimer 2013-07-27 20:04 [PATCH V3]hrtimer: Fix a performance regression by disable reprogramming in remove_hrtimer ethan.kernel @ 2013-07-29 10:18 ` Thomas Gleixner 2013-07-29 11:57 ` Peter Zijlstra [not found] ` <CABawtvP4oLuvHOS3prbbgPShXVziV_wTo7i6KCqJ9KkoVdz0ag@mail.gmail.com> 0 siblings, 2 replies; 31+ messages in thread From: Thomas Gleixner @ 2013-07-29 10:18 UTC (permalink / raw) To: ethan.zhao; +Cc: Ingo Molnar, LKML, johlstei, yinghai, joe.jin, Peter Zijlstra On Sat, 27 Jul 2013, ethan.kernel@gmail.com wrote: > commit 968320b hrtimer: Fix extra wakeups from __remove_hrtimer() > introduced a significant scheduler performance regression, > following is the test: This is not a commit in Linus tree. Which kernel version are you talking about? > We found the performance has 1.87775S(average value) down introduced > by commit 968320b hrtimer: Fix extra wakeups from > __remove_hrtimer(). That is more than -35% ! The test case does not involve anything hrtimer related. Do you have CONFIG_SCHED_HRTICK enabled? > So reprogramming in remove_hrtimer() is not necessary-----If I am > wrong, just point out. The reason why we want to do that is: timer expiry time A 100us -> programmed hardware event B 2000us Restart timer A with an expiry time of 3000us without reprogramming: timer expiry time NONE 100us -> programmed hardware event B 2000us A 3000us We take an extra wakeup for reprogramming the hardware, which is counterproductive for power saving. So disabling it blindly will cause a regresssion for other people. We need to be smarter than that. First of all we want to know, which particular hrtimer is causing that issue. If it is the hrtick one, then I really have to ask why you want to use it at all in such a high performance scenario. Thanks, tglx ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH V3]hrtimer: Fix a performance regression by disable reprogramming in remove_hrtimer 2013-07-29 10:18 ` Thomas Gleixner @ 2013-07-29 11:57 ` Peter Zijlstra 2013-08-08 7:32 ` ethan.zhao [not found] ` <CABawtvP4oLuvHOS3prbbgPShXVziV_wTo7i6KCqJ9KkoVdz0ag@mail.gmail.com> 1 sibling, 1 reply; 31+ messages in thread From: Peter Zijlstra @ 2013-07-29 11:57 UTC (permalink / raw) To: Thomas Gleixner; +Cc: ethan.zhao, Ingo Molnar, LKML, johlstei, yinghai, joe.jin On Mon, Jul 29, 2013 at 12:18:11PM +0200, Thomas Gleixner wrote: > The reason why we want to do that is: > > timer expiry time > A 100us -> programmed hardware event > B 2000us > > Restart timer A with an expiry time of 3000us without reprogramming: > > timer expiry time > NONE 100us -> programmed hardware event > B 2000us > A 3000us > > We take an extra wakeup for reprogramming the hardware, which is > counterproductive for power saving. So disabling it blindly will cause > a regresssion for other people. We need to be smarter than that. So aside from the complete mess posted; how about something like this? *completely untested etc..* --- include/linux/hrtimer.h | 5 +++++ kernel/hrtimer.c | 60 +++++++++++++++++++++++-------------------------- 2 files changed, 33 insertions(+), 32 deletions(-) diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h index d19a5c2..f2bcb9c 100644 --- a/include/linux/hrtimer.h +++ b/include/linux/hrtimer.h @@ -185,6 +185,7 @@ struct hrtimer_cpu_base { ktime_t expires_next; int hres_active; int hang_detected; + int timers_removed; unsigned long nr_events; unsigned long nr_retries; unsigned long nr_hangs; @@ -261,6 +262,8 @@ static inline ktime_t hrtimer_expires_remaining(const struct hrtimer *timer) #ifdef CONFIG_HIGH_RES_TIMERS struct clock_event_device; +extern void hrtimer_enter_idle(void); + extern void hrtimer_interrupt(struct clock_event_device *dev); /* @@ -296,6 +299,8 @@ extern void clock_was_set_delayed(void); # define MONOTONIC_RES_NSEC LOW_RES_NSEC # define KTIME_MONOTONIC_RES KTIME_LOW_RES +static inline void hrtimer_enter_idle(void) { } + static inline void hrtimer_peek_ahead_timers(void) { } /* diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c index f0f4fe2..ffb16d3 100644 --- a/kernel/hrtimer.c +++ b/kernel/hrtimer.c @@ -673,6 +673,27 @@ static inline ktime_t hrtimer_update_base(struct hrtimer_cpu_base *base) return ktime_get_update_offsets(offs_real, offs_boot, offs_tai); } +static void __hrtimer_reprogram_all(struct hrtimer_cpu_base *base) +{ + if (!hrtimer_hres_active()) + return; + + raw_spin_lock(&base->lock); + hrtimer_update_base(base); + hrtimer_force_reprogram(base, 0); + raw_spin_unlock(&base->lock); +} + +void hrtimer_enter_idle(void) +{ + struct hrtimer_cpu_base *base = &__get_cpu_var(hrtimer_bases); + + if (base->timers_removed) { + base->timers_removed = 0; + __hrtimer_reprogramm_all(base); + } +} + /* * Retrigger next event is called after clock was set * @@ -682,13 +703,7 @@ static void retrigger_next_event(void *arg) { struct hrtimer_cpu_base *base = &__get_cpu_var(hrtimer_bases); - if (!hrtimer_hres_active()) - return; - - raw_spin_lock(&base->lock); - hrtimer_update_base(base); - hrtimer_force_reprogram(base, 0); - raw_spin_unlock(&base->lock); + __hrtimer_reprogram_all(base); } /* @@ -907,7 +922,7 @@ static int enqueue_hrtimer(struct hrtimer *timer, */ static void __remove_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *base, - unsigned long newstate, int reprogram) + unsigned long newstate) { struct timerqueue_node *next_timer; if (!(timer->state & HRTIMER_STATE_ENQUEUED)) @@ -915,19 +930,10 @@ static void __remove_hrtimer(struct hrtimer *timer, next_timer = timerqueue_getnext(&base->active); timerqueue_del(&base->active, &timer->node); - if (&timer->node == next_timer) { #ifdef CONFIG_HIGH_RES_TIMERS - /* Reprogram the clock event device. if enabled */ - if (reprogram && hrtimer_hres_active()) { - ktime_t expires; - - expires = ktime_sub(hrtimer_get_expires(timer), - base->offset); - if (base->cpu_base->expires_next.tv64 == expires.tv64) - hrtimer_force_reprogram(base->cpu_base, 1); - } + if (hrtimer_hres_active() && &timer->node == next_timer) + base->cpu_base->timers_removed++; #endif - } if (!timerqueue_getnext(&base->active)) base->cpu_base->active_bases &= ~(1 << base->index); out: @@ -942,26 +948,16 @@ remove_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *base) { if (hrtimer_is_queued(timer)) { unsigned long state; - int reprogram; - /* - * Remove the timer and force reprogramming when high - * resolution mode is active and the timer is on the current - * CPU. If we remove a timer on another CPU, reprogramming is - * skipped. The interrupt event on this CPU is fired and - * reprogramming happens in the interrupt handler. This is a - * rare case and less expensive than a smp call. - */ debug_deactivate(timer); timer_stats_hrtimer_clear_start_info(timer); - reprogram = base->cpu_base == &__get_cpu_var(hrtimer_bases); /* * We must preserve the CALLBACK state flag here, * otherwise we could move the timer base in * switch_hrtimer_base. */ state = timer->state & HRTIMER_STATE_CALLBACK; - __remove_hrtimer(timer, base, state, reprogram); + __remove_hrtimer(timer, base, state); return 1; } return 0; @@ -1243,7 +1239,7 @@ static void __run_hrtimer(struct hrtimer *timer, ktime_t *now) WARN_ON(!irqs_disabled()); debug_deactivate(timer); - __remove_hrtimer(timer, base, HRTIMER_STATE_CALLBACK, 0); + __remove_hrtimer(timer, base, HRTIMER_STATE_CALLBACK); timer_stats_account_hrtimer(timer); fn = timer->function; @@ -1690,7 +1686,7 @@ static void migrate_hrtimer_list(struct hrtimer_clock_base *old_base, * timer could be seen as !active and just vanish away * under us on another CPU */ - __remove_hrtimer(timer, old_base, HRTIMER_STATE_MIGRATE, 0); + __remove_hrtimer(timer, old_base, HRTIMER_STATE_MIGRATE); timer->base = new_base; /* * Enqueue the timers on the new cpu. This does not ^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: [PATCH V3]hrtimer: Fix a performance regression by disable reprogramming in remove_hrtimer 2013-07-29 11:57 ` Peter Zijlstra @ 2013-08-08 7:32 ` ethan.zhao 2013-09-05 6:36 ` Mike Galbraith 0 siblings, 1 reply; 31+ messages in thread From: ethan.zhao @ 2013-08-08 7:32 UTC (permalink / raw) To: Peter Zijlstra Cc: Thomas Gleixner, Ingo Molnar, LKML, johlstei, yinghai, joe.jin Kernel 3.11-rc3 With peter's patch and disable C-state in BIOS, got 1 second better performance ! [root@localhost ~]# time ./pip1m real 0m4.399s user 0m0.092s sys 0m2.775s [root@localhost ~]# time ./pip1m real 0m4.352s user 0m0.099s sys 0m2.751s [root@localhost ~]# time ./pip1m real 0m4.328s user 0m0.102s sys 0m2.731s Compared with default kernel 3.11-rc3 and disable C-states in BIOS, test case 4 4. Disable C1E C3 C6 C-states and SpeedStep in BIOS, default configuration of kernel 3.11-rc3. [root@localhost ~]# time ./pip1m real 0m5.371s user 0m0.102s sys 0m3.253s [root@localhost ~]# time ./pip1m real 0m5.329s user 0m0.075s sys 0m3.254s That is great improvement, So it is worth to merge. Thanks Ethan 在 2013-7-29,下午7:57,Peter Zijlstra <peterz@infradead.org> 写道: > > So aside from the complete mess posted; how about something like this? > > *completely untested etc..* > > --- > include/linux/hrtimer.h | 5 +++++ > kernel/hrtimer.c | 60 +++++++++++++++++++++++-------------------------- > 2 files changed, 33 insertions(+), 32 deletions(-) > > diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h > index d19a5c2..f2bcb9c 100644 > --- a/include/linux/hrtimer.h > +++ b/include/linux/hrtimer.h > @@ -185,6 +185,7 @@ struct hrtimer_cpu_base { > ktime_t expires_next; > int hres_active; > int hang_detected; > + int timers_removed; > unsigned long nr_events; > unsigned long nr_retries; > unsigned long nr_hangs; > @@ -261,6 +262,8 @@ static inline ktime_t hrtimer_expires_remaining(const struct hrtimer *timer) > #ifdef CONFIG_HIGH_RES_TIMERS > struct clock_event_device; > > +extern void hrtimer_enter_idle(void); > + > extern void hrtimer_interrupt(struct clock_event_device *dev); > > /* > @@ -296,6 +299,8 @@ extern void clock_was_set_delayed(void); > # define MONOTONIC_RES_NSEC LOW_RES_NSEC > # define KTIME_MONOTONIC_RES KTIME_LOW_RES > > +static inline void hrtimer_enter_idle(void) { } > + > static inline void hrtimer_peek_ahead_timers(void) { } > > /* > diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c > index f0f4fe2..ffb16d3 100644 > --- a/kernel/hrtimer.c > +++ b/kernel/hrtimer.c > @@ -673,6 +673,27 @@ static inline ktime_t hrtimer_update_base(struct hrtimer_cpu_base *base) > return ktime_get_update_offsets(offs_real, offs_boot, offs_tai); > } > > +static void __hrtimer_reprogram_all(struct hrtimer_cpu_base *base) > +{ > + if (!hrtimer_hres_active()) > + return; > + > + raw_spin_lock(&base->lock); > + hrtimer_update_base(base); > + hrtimer_force_reprogram(base, 0); > + raw_spin_unlock(&base->lock); > +} > + > +void hrtimer_enter_idle(void) > +{ > + struct hrtimer_cpu_base *base = &__get_cpu_var(hrtimer_bases); > + > + if (base->timers_removed) { > + base->timers_removed = 0; > + __hrtimer_reprogramm_all(base); > + } > +} > + > /* > * Retrigger next event is called after clock was set > * > @@ -682,13 +703,7 @@ static void retrigger_next_event(void *arg) > { > struct hrtimer_cpu_base *base = &__get_cpu_var(hrtimer_bases); > > - if (!hrtimer_hres_active()) > - return; > - > - raw_spin_lock(&base->lock); > - hrtimer_update_base(base); > - hrtimer_force_reprogram(base, 0); > - raw_spin_unlock(&base->lock); > + __hrtimer_reprogram_all(base); > } > > /* > @@ -907,7 +922,7 @@ static int enqueue_hrtimer(struct hrtimer *timer, > */ > static void __remove_hrtimer(struct hrtimer *timer, > struct hrtimer_clock_base *base, > - unsigned long newstate, int reprogram) > + unsigned long newstate) > { > struct timerqueue_node *next_timer; > if (!(timer->state & HRTIMER_STATE_ENQUEUED)) > @@ -915,19 +930,10 @@ static void __remove_hrtimer(struct hrtimer *timer, > > next_timer = timerqueue_getnext(&base->active); > timerqueue_del(&base->active, &timer->node); > - if (&timer->node == next_timer) { > #ifdef CONFIG_HIGH_RES_TIMERS > - /* Reprogram the clock event device. if enabled */ > - if (reprogram && hrtimer_hres_active()) { > - ktime_t expires; > - > - expires = ktime_sub(hrtimer_get_expires(timer), > - base->offset); > - if (base->cpu_base->expires_next.tv64 == expires.tv64) > - hrtimer_force_reprogram(base->cpu_base, 1); > - } > + if (hrtimer_hres_active() && &timer->node == next_timer) > + base->cpu_base->timers_removed++; > #endif > - } > if (!timerqueue_getnext(&base->active)) > base->cpu_base->active_bases &= ~(1 << base->index); > out: > @@ -942,26 +948,16 @@ remove_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *base) > { > if (hrtimer_is_queued(timer)) { > unsigned long state; > - int reprogram; > > - /* > - * Remove the timer and force reprogramming when high > - * resolution mode is active and the timer is on the current > - * CPU. If we remove a timer on another CPU, reprogramming is > - * skipped. The interrupt event on this CPU is fired and > - * reprogramming happens in the interrupt handler. This is a > - * rare case and less expensive than a smp call. > - */ > debug_deactivate(timer); > timer_stats_hrtimer_clear_start_info(timer); > - reprogram = base->cpu_base == &__get_cpu_var(hrtimer_bases); > /* > * We must preserve the CALLBACK state flag here, > * otherwise we could move the timer base in > * switch_hrtimer_base. > */ > state = timer->state & HRTIMER_STATE_CALLBACK; > - __remove_hrtimer(timer, base, state, reprogram); > + __remove_hrtimer(timer, base, state); > return 1; > } > return 0; > @@ -1243,7 +1239,7 @@ static void __run_hrtimer(struct hrtimer *timer, ktime_t *now) > WARN_ON(!irqs_disabled()); > > debug_deactivate(timer); > - __remove_hrtimer(timer, base, HRTIMER_STATE_CALLBACK, 0); > + __remove_hrtimer(timer, base, HRTIMER_STATE_CALLBACK); > timer_stats_account_hrtimer(timer); > fn = timer->function; > > @@ -1690,7 +1686,7 @@ static void migrate_hrtimer_list(struct hrtimer_clock_base *old_base, > * timer could be seen as !active and just vanish away > * under us on another CPU > */ > - __remove_hrtimer(timer, old_base, HRTIMER_STATE_MIGRATE, 0); > + __remove_hrtimer(timer, old_base, HRTIMER_STATE_MIGRATE); > timer->base = new_base; > /* > * Enqueue the timers on the new cpu. This does not ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH V3]hrtimer: Fix a performance regression by disable reprogramming in remove_hrtimer 2013-08-08 7:32 ` ethan.zhao @ 2013-09-05 6:36 ` Mike Galbraith [not found] ` <20130905111428.GB23362@gmail.com> 0 siblings, 1 reply; 31+ messages in thread From: Mike Galbraith @ 2013-09-05 6:36 UTC (permalink / raw) To: ethan.zhao Cc: Peter Zijlstra, Thomas Gleixner, Ingo Molnar, LKML, johlstei, yinghai, joe.jin On Thu, 2013-08-08 at 15:32 +0800, ethan.zhao wrote: ... > That is great improvement, So it is worth to merge. It didn't swim upstream for some reason. -Mike ^ permalink raw reply [flat|nested] 31+ messages in thread
[parent not found: <20130905111428.GB23362@gmail.com>]
[parent not found: <1378386697.6567.9.camel@marge.simpson.net>]
[parent not found: <20130905133750.GA26637@gmail.com>]
[parent not found: <1378445942.5434.31.camel@marge.simpson.net>]
[parent not found: <20130909122325.GX31370@twins.programming.kicks-ass.net>]
[parent not found: <1378730538.5586.30.camel@marge.simpson.net>]
* Re: [PATCH V3]hrtimer: Fix a performance regression by disable reprogramming in remove_hrtimer [not found] ` <1378730538.5586.30.camel@marge.simpson.net> @ 2013-09-09 13:30 ` Peter Zijlstra 2013-09-09 13:46 ` Peter Zijlstra 0 siblings, 1 reply; 31+ messages in thread From: Peter Zijlstra @ 2013-09-09 13:30 UTC (permalink / raw) To: Mike Galbraith; +Cc: Ingo Molnar, linux-kernel, Thomas Gleixner, ethan.kernel Lets go back to the list with this.. On Mon, Sep 09, 2013 at 02:42:18PM +0200, Mike Galbraith wrote: > On Mon, 2013-09-09 at 14:23 +0200, Peter Zijlstra wrote: > > On Fri, Sep 06, 2013 at 07:39:02AM +0200, Mike Galbraith wrote: > > > The patch takes a large bite out of regressions. What's left for a > > > Westmere box is the introduction of reschedule_interrupt overhead > > > introduced by 7d1a9417 x86: Use generic idle loop. > > > > How exactly does that commit cause extra IPIs? Did the entire TS_POLLING > > stuff break or so? > > Seems so. > > > > Core2 eats that, > > > plus Intel making mwait_idle() go away with no way for them to get to > > > the remaining mwait_idle_with_hints(). > > > > but but but drivers/idle/intel_idle.c still uses mwait.. what's the > > exact complaint? > > reschedule_interrupt overhead for cross core pipe-test appeared in > westmere box at 7d1a9417. So that patch does indeed loose the TS_POLLING stuff for all of x86. I'm not entirely sure where we want to add it back, but the best place to me seems the idle loop implementations themselves. Below a patch that does intel_idle.c which is what your WSM would be using I suppose. We'll probably want to iterate all idle implementations and do what needs doing. --- diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c index fa6964d..486c0ba 100644 --- a/drivers/idle/intel_idle.c +++ b/drivers/idle/intel_idle.c @@ -359,6 +359,14 @@ static int intel_idle(struct cpuidle_device *dev, if (!(lapic_timer_reliable_states & (1 << (cstate)))) clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &cpu); + current_thread_into()->status |= TS_POLLING; + + /* + * Order against setting of TS_POLLING against the reading of + * NEED_RESCHED, matched by resched_task(). + */ + smp_mb(); + if (!need_resched()) { __monitor((void *)¤t_thread_info()->flags, 0, 0); @@ -367,6 +375,8 @@ static int intel_idle(struct cpuidle_device *dev, __mwait(eax, ecx); } + current_thread_into()->status &= ~TS_POLLING; + if (!(lapic_timer_reliable_states & (1 << (cstate)))) clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu); ^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: [PATCH V3]hrtimer: Fix a performance regression by disable reprogramming in remove_hrtimer 2013-09-09 13:30 ` Peter Zijlstra @ 2013-09-09 13:46 ` Peter Zijlstra 2013-09-11 8:56 ` Peter Zijlstra 0 siblings, 1 reply; 31+ messages in thread From: Peter Zijlstra @ 2013-09-09 13:46 UTC (permalink / raw) To: Mike Galbraith Cc: Ingo Molnar, linux-kernel, Thomas Gleixner, ethan.kernel, rjw, daniel.lezcano, linux-pm, lenb On Mon, Sep 09, 2013 at 03:30:44PM +0200, Peter Zijlstra wrote: > diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c > index fa6964d..486c0ba 100644 > --- a/drivers/idle/intel_idle.c > +++ b/drivers/idle/intel_idle.c > @@ -359,6 +359,14 @@ static int intel_idle(struct cpuidle_device *dev, > if (!(lapic_timer_reliable_states & (1 << (cstate)))) > clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &cpu); > > + current_thread_into()->status |= TS_POLLING; > + > + /* > + * Order against setting of TS_POLLING against the reading of > + * NEED_RESCHED, matched by resched_task(). > + */ > + smp_mb(); > + > if (!need_resched()) { > > __monitor((void *)¤t_thread_info()->flags, 0, 0); > @@ -367,6 +375,8 @@ static int intel_idle(struct cpuidle_device *dev, > __mwait(eax, ecx); > } > > + current_thread_into()->status &= ~TS_POLLING; > + > if (!(lapic_timer_reliable_states & (1 << (cstate)))) > clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu); Hmm, arguably it would be better to set this from intel_idle_cpuidle_driver_init() and clear it whenever this goes away. Not sure how all that works, the cpuidle driver framework seems 'weird'. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH V3]hrtimer: Fix a performance regression by disable reprogramming in remove_hrtimer 2013-09-09 13:46 ` Peter Zijlstra @ 2013-09-11 8:56 ` Peter Zijlstra 2013-09-11 10:25 ` Mike Galbraith 0 siblings, 1 reply; 31+ messages in thread From: Peter Zijlstra @ 2013-09-11 8:56 UTC (permalink / raw) To: Mike Galbraith Cc: Ingo Molnar, linux-kernel, Thomas Gleixner, ethan.kernel, rjw, daniel.lezcano, linux-pm, lenb On Mon, Sep 09, 2013 at 03:46:35PM +0200, Peter Zijlstra wrote: > On Mon, Sep 09, 2013 at 03:30:44PM +0200, Peter Zijlstra wrote: > > diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c > > index fa6964d..486c0ba 100644 > > --- a/drivers/idle/intel_idle.c > > +++ b/drivers/idle/intel_idle.c > > @@ -359,6 +359,14 @@ static int intel_idle(struct cpuidle_device *dev, > > if (!(lapic_timer_reliable_states & (1 << (cstate)))) > > clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &cpu); > > > > + current_thread_into()->status |= TS_POLLING; > > + > > + /* > > + * Order against setting of TS_POLLING against the reading of > > + * NEED_RESCHED, matched by resched_task(). > > + */ > > + smp_mb(); > > + > > if (!need_resched()) { > > > > __monitor((void *)¤t_thread_info()->flags, 0, 0); > > @@ -367,6 +375,8 @@ static int intel_idle(struct cpuidle_device *dev, > > __mwait(eax, ecx); > > } > > > > + current_thread_into()->status &= ~TS_POLLING; > > + > > if (!(lapic_timer_reliable_states & (1 << (cstate)))) > > clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu); > > Hmm, arguably it would be better to set this from > intel_idle_cpuidle_driver_init() and clear it whenever this goes away. > Not sure how all that works, the cpuidle driver framework seems 'weird'. OK, so I went over the idle stuff again, and we do set TS_POLLING like stuff, it got hidden in current_{clr,set}_polling(). Still if that patch causes extra IPIs its bound to be broken in some creative way.. I'll prod. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH V3]hrtimer: Fix a performance regression by disable reprogramming in remove_hrtimer 2013-09-11 8:56 ` Peter Zijlstra @ 2013-09-11 10:25 ` Mike Galbraith 2013-10-04 12:06 ` Ethan Zhao 0 siblings, 1 reply; 31+ messages in thread From: Mike Galbraith @ 2013-09-11 10:25 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, linux-kernel, Thomas Gleixner, ethan.kernel, rjw, daniel.lezcano, linux-pm, lenb On Wed, 2013-09-11 at 10:56 +0200, Peter Zijlstra wrote: > On Mon, Sep 09, 2013 at 03:46:35PM +0200, Peter Zijlstra wrote: > > On Mon, Sep 09, 2013 at 03:30:44PM +0200, Peter Zijlstra wrote: > > > diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c > > > index fa6964d..486c0ba 100644 > > > --- a/drivers/idle/intel_idle.c > > > +++ b/drivers/idle/intel_idle.c > > > @@ -359,6 +359,14 @@ static int intel_idle(struct cpuidle_device *dev, > > > if (!(lapic_timer_reliable_states & (1 << (cstate)))) > > > clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &cpu); > > > > > > + current_thread_into()->status |= TS_POLLING; > > > + > > > + /* > > > + * Order against setting of TS_POLLING against the reading of > > > + * NEED_RESCHED, matched by resched_task(). > > > + */ > > > + smp_mb(); > > > + > > > if (!need_resched()) { > > > > > > __monitor((void *)¤t_thread_info()->flags, 0, 0); > > > @@ -367,6 +375,8 @@ static int intel_idle(struct cpuidle_device *dev, > > > __mwait(eax, ecx); > > > } > > > > > > + current_thread_into()->status &= ~TS_POLLING; > > > + > > > if (!(lapic_timer_reliable_states & (1 << (cstate)))) > > > clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu); > > > > Hmm, arguably it would be better to set this from > > intel_idle_cpuidle_driver_init() and clear it whenever this goes away. > > Not sure how all that works, the cpuidle driver framework seems 'weird'. > > OK, so I went over the idle stuff again, and we do set TS_POLLING like > stuff, it got hidden in current_{clr,set}_polling(). > > Still if that patch causes extra IPIs its bound to be broken in some > creative way.. I'll prod. (thanks, when I get a break from testing/poking this'n'that, I'll take a peek too... well, good_intentions++ anyway;) ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH V3]hrtimer: Fix a performance regression by disable reprogramming in remove_hrtimer 2013-09-11 10:25 ` Mike Galbraith @ 2013-10-04 12:06 ` Ethan Zhao 2013-10-07 4:41 ` Mike Galbraith 0 siblings, 1 reply; 31+ messages in thread From: Ethan Zhao @ 2013-10-04 12:06 UTC (permalink / raw) To: Mike Galbraith Cc: Peter Zijlstra, Ingo Molnar, LKML, Thomas Gleixner, rjw, daniel.lezcano, linux-pm, lenb Mike, Peter, Seems lots of work has been done these days, studious guys. those patches merged in last stable/dev branch (fix performance regression caused by extra rtimer programming and rescheduling IPI,confusing idle... etc) ? So I could just do a lazy pull for test with my environment. I need catch up with other mail loops with my vacation again. Thanks, Ethan On Wed, Sep 11, 2013 at 6:25 PM, Mike Galbraith <bitbucket@online.de> wrote: > On Wed, 2013-09-11 at 10:56 +0200, Peter Zijlstra wrote: >> On Mon, Sep 09, 2013 at 03:46:35PM +0200, Peter Zijlstra wrote: >> > On Mon, Sep 09, 2013 at 03:30:44PM +0200, Peter Zijlstra wrote: >> > > diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c >> > > index fa6964d..486c0ba 100644 >> > > --- a/drivers/idle/intel_idle.c >> > > +++ b/drivers/idle/intel_idle.c >> > > @@ -359,6 +359,14 @@ static int intel_idle(struct cpuidle_device *dev, >> > > if (!(lapic_timer_reliable_states & (1 << (cstate)))) >> > > clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &cpu); >> > > >> > > + current_thread_into()->status |= TS_POLLING; >> > > + >> > > + /* >> > > + * Order against setting of TS_POLLING against the reading of >> > > + * NEED_RESCHED, matched by resched_task(). >> > > + */ >> > > + smp_mb(); >> > > + >> > > if (!need_resched()) { >> > > >> > > __monitor((void *)¤t_thread_info()->flags, 0, 0); >> > > @@ -367,6 +375,8 @@ static int intel_idle(struct cpuidle_device *dev, >> > > __mwait(eax, ecx); >> > > } >> > > >> > > + current_thread_into()->status &= ~TS_POLLING; >> > > + >> > > if (!(lapic_timer_reliable_states & (1 << (cstate)))) >> > > clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu); >> > >> > Hmm, arguably it would be better to set this from >> > intel_idle_cpuidle_driver_init() and clear it whenever this goes away. >> > Not sure how all that works, the cpuidle driver framework seems 'weird'. >> >> OK, so I went over the idle stuff again, and we do set TS_POLLING like >> stuff, it got hidden in current_{clr,set}_polling(). >> >> Still if that patch causes extra IPIs its bound to be broken in some >> creative way.. I'll prod. > > (thanks, when I get a break from testing/poking this'n'that, I'll take a > peek too... well, good_intentions++ anyway;) > ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH V3]hrtimer: Fix a performance regression by disable reprogramming in remove_hrtimer 2013-10-04 12:06 ` Ethan Zhao @ 2013-10-07 4:41 ` Mike Galbraith 2013-10-07 4:57 ` Ethan Zhao 0 siblings, 1 reply; 31+ messages in thread From: Mike Galbraith @ 2013-10-07 4:41 UTC (permalink / raw) To: Ethan Zhao Cc: Peter Zijlstra, Ingo Molnar, LKML, Thomas Gleixner, rjw, daniel.lezcano, linux-pm, lenb On Fri, 2013-10-04 at 20:06 +0800, Ethan Zhao wrote: > Mike, Peter, > Seems lots of work has been done these days, studious guys. those > patches merged in last stable/dev branch (fix performance regression > caused by extra rtimer programming and rescheduling IPI,confusing > idle... etc) ? So I could just do a lazy pull for test with my > environment. I need catch up with other mail loops with my vacation > again. Massive timer overhead seems to have crawled off and died while I wasn't looking. Peter's fix for IPI woes.. tip commit ea811747 sched, idle: Fix the idle polling state logic ..hasn't yet swum upstream. -Mike ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH V3]hrtimer: Fix a performance regression by disable reprogramming in remove_hrtimer 2013-10-07 4:41 ` Mike Galbraith @ 2013-10-07 4:57 ` Ethan Zhao 2013-12-12 14:14 ` Ethan Zhao 0 siblings, 1 reply; 31+ messages in thread From: Ethan Zhao @ 2013-10-07 4:57 UTC (permalink / raw) To: Mike Galbraith Cc: Peter Zijlstra, Ingo Molnar, LKML, Thomas Gleixner, rjw, Daniel Lezcano, linux-pm, Len Brown Got it. On Mon, Oct 7, 2013 at 12:41 PM, Mike Galbraith <bitbucket@online.de> wrote: > On Fri, 2013-10-04 at 20:06 +0800, Ethan Zhao wrote: >> Mike, Peter, >> Seems lots of work has been done these days, studious guys. those >> patches merged in last stable/dev branch (fix performance regression >> caused by extra rtimer programming and rescheduling IPI,confusing >> idle... etc) ? So I could just do a lazy pull for test with my >> environment. I need catch up with other mail loops with my vacation >> again. > > Massive timer overhead seems to have crawled off and died while I wasn't > looking. Peter's fix for IPI woes.. > > tip commit ea811747 sched, idle: Fix the idle polling state logic > > ..hasn't yet swum upstream. > > -Mike > ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH V3]hrtimer: Fix a performance regression by disable reprogramming in remove_hrtimer 2013-10-07 4:57 ` Ethan Zhao @ 2013-12-12 14:14 ` Ethan Zhao 2013-12-12 14:42 ` Mike Galbraith 0 siblings, 1 reply; 31+ messages in thread From: Ethan Zhao @ 2013-12-12 14:14 UTC (permalink / raw) To: Mike Galbraith Cc: Peter Zijlstra, Ingo Molnar, LKML, Thomas Gleixner, rjw, Daniel Lezcano, linux-pm, Len Brown Mike, it seems the IPI issue got root cause (cpu hardware errata ) now ? I only catch some pieces of the mails, is that to say the crazy horse made by Intel will wake up unexpected, and eat too much reschedule IPI, then the horse got hotter, performance down, and so the fix would be a 'CFLUSH' to all the buggy CPU, is that right ? Thanks, Ethan On Mon, Oct 7, 2013 at 12:57 PM, Ethan Zhao <ethan.kernel@gmail.com> wrote: > Got it. > > On Mon, Oct 7, 2013 at 12:41 PM, Mike Galbraith <bitbucket@online.de> wrote: >> On Fri, 2013-10-04 at 20:06 +0800, Ethan Zhao wrote: >>> Mike, Peter, >>> Seems lots of work has been done these days, studious guys. those >>> patches merged in last stable/dev branch (fix performance regression >>> caused by extra rtimer programming and rescheduling IPI,confusing >>> idle... etc) ? So I could just do a lazy pull for test with my >>> environment. I need catch up with other mail loops with my vacation >>> again. >> >> Massive timer overhead seems to have crawled off and died while I wasn't >> looking. Peter's fix for IPI woes.. >> >> tip commit ea811747 sched, idle: Fix the idle polling state logic >> >> ..hasn't yet swum upstream. >> >> -Mike >> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH V3]hrtimer: Fix a performance regression by disable reprogramming in remove_hrtimer 2013-12-12 14:14 ` Ethan Zhao @ 2013-12-12 14:42 ` Mike Galbraith 0 siblings, 0 replies; 31+ messages in thread From: Mike Galbraith @ 2013-12-12 14:42 UTC (permalink / raw) To: Ethan Zhao Cc: Peter Zijlstra, Ingo Molnar, LKML, Thomas Gleixner, rjw, Daniel Lezcano, linux-pm, Len Brown On Thu, 2013-12-12 at 22:14 +0800, Ethan Zhao wrote: > Mike, > it seems the IPI issue got root cause (cpu hardware errata ) now ? > I only catch some pieces of the mails, is that to say the crazy horse > made by Intel will wake up unexpected, and eat too much reschedule > IPI, then the horse got hotter, performance down, and so the fix would > be a 'CFLUSH' to all the buggy CPU, is that right ? You're mixing a couple different issues together, but yeah, close enough. The barking amazonian tree guppy is being saved as we speak. -Mike ^ permalink raw reply [flat|nested] 31+ messages in thread
[parent not found: <CABawtvP4oLuvHOS3prbbgPShXVziV_wTo7i6KCqJ9KkoVdz0ag@mail.gmail.com>]
* Re: [PATCH V3]hrtimer: Fix a performance regression by disable reprogramming in remove_hrtimer [not found] ` <CABawtvP4oLuvHOS3prbbgPShXVziV_wTo7i6KCqJ9KkoVdz0ag@mail.gmail.com> @ 2013-07-30 9:35 ` Peter Zijlstra 2013-07-30 11:44 ` Ethan Zhao 2013-08-06 7:29 ` Mike Galbraith 0 siblings, 2 replies; 31+ messages in thread From: Peter Zijlstra @ 2013-07-30 9:35 UTC (permalink / raw) To: Ethan Zhao Cc: Thomas Gleixner, Ingo Molnar, LKML, johlstei, Yinghai Lu, Jin Feng On Tue, Jul 30, 2013 at 05:12:49PM +0800, Ethan Zhao wrote: > On Mon, Jul 29, 2013 at 6:18 PM, Thomas Gleixner <tglx@linutronix.de> wrote: > > The test case does not involve anything hrtimer related. Do you have > > CONFIG_SCHED_HRTICK enabled? > > > > Yes. it is default configured in stable release. > CONFIG_SCHED_HRTICK=y Should still be disabled by default even if supported: # grep HRTICK kernel/sched/features.h SCHED_FEAT(HRTICK, false) > > First of all we want to know, which particular hrtimer is causing that > > issue. If it is the hrtick one, then I really have to ask why you want > > to use it at all in such a high performance scenario. > > > > Any advice about the HZ in high performance scenario ? hrtimer tick > Is not fit for high performance ? Hence why its disabled, programming the timer hardware is too expensive. But since you didn't even know that I suspect you aren't in fact using it. It would be good if you could do what Thomas suggested and look at which timer is actually active during your workload. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH V3]hrtimer: Fix a performance regression by disable reprogramming in remove_hrtimer 2013-07-30 9:35 ` Peter Zijlstra @ 2013-07-30 11:44 ` Ethan Zhao 2013-07-30 11:59 ` Peter Zijlstra 2013-08-06 7:29 ` Mike Galbraith 1 sibling, 1 reply; 31+ messages in thread From: Ethan Zhao @ 2013-07-30 11:44 UTC (permalink / raw) To: Peter Zijlstra Cc: Thomas Gleixner, Ingo Molnar, LKML, johlstei, Yinghai Lu, Jin Feng On Tue, Jul 30, 2013 at 5:35 PM, Peter Zijlstra <peterz@infradead.org> wrote: > > On Tue, Jul 30, 2013 at 05:12:49PM +0800, Ethan Zhao wrote: > > On Mon, Jul 29, 2013 at 6:18 PM, Thomas Gleixner <tglx@linutronix.de> wrote: > > > The test case does not involve anything hrtimer related. Do you have > > > CONFIG_SCHED_HRTICK enabled? > > > > > > > Yes. it is default configured in stable release. > > CONFIG_SCHED_HRTICK=y > > Should still be disabled by default even if supported: > > # grep HRTICK kernel/sched/features.h > SCHED_FEAT(HRTICK, false) > > > > > First of all we want to know, which particular hrtimer is causing that > > > issue. If it is the hrtick one, then I really have to ask why you want > > > to use it at all in such a high performance scenario. > > > > > > Any advice about the HZ in high performance scenario ? hrtimer tick > > Is not fit for high performance ? > > Hence why its disabled, programming the timer hardware is too expensive. > But since you didn't even know that I suspect you aren't in fact using > it. > Got it. what tglx and you mean SCHED_FEAT(HRTICK, 0) then no hrtimer operation in void __sched __schedule(void) { … … if (sched_feat(HRTICK)) hrtick_clear(rq); … … Yup, So what I am facing is not HRTICK. But that doesn't move my eyes away from hrtimer and suspect reprogramming delay the scheduling. The call stack looks like following : cpu_idle() { … … tick_nohz_stop_sched_tick() --> hrtimer_start(); --> __hrtimer_start_range_ns() -- > remove_hrtimer() -- > raise_softirq_irqoff(TIMER_SOFTIRQ); ---->run_timer_softirq() --> tick_sched_timer() -- > hrtimer_start_expires … … … ... tick_nohz_restart_sched_tick() … ... schedule() … ... } So the expensive thing maybe not inside the schedule(), but could outside the scheduler(), the more bigger forever loop. This is one part of what I am facing. Thanks Ethan > > It would be good if you could do what Thomas suggested and look at which > timer is actually active during your workload. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH V3]hrtimer: Fix a performance regression by disable reprogramming in remove_hrtimer 2013-07-30 11:44 ` Ethan Zhao @ 2013-07-30 11:59 ` Peter Zijlstra 2013-08-03 6:55 ` ethan 2013-08-03 7:37 ` ethan 0 siblings, 2 replies; 31+ messages in thread From: Peter Zijlstra @ 2013-07-30 11:59 UTC (permalink / raw) To: Ethan Zhao Cc: Thomas Gleixner, Ingo Molnar, LKML, johlstei, Yinghai Lu, Jin Feng On Tue, Jul 30, 2013 at 07:44:03PM +0800, Ethan Zhao wrote: > Got it. > what tglx and you mean > > > So the expensive thing maybe not inside the schedule(), but could > outside the scheduler(), the more bigger forever loop. > > This is one part of what I am facing. Right, so it would be good if you could further diagnose the problem so we can come up with a solution that cures the problem while retaining the current 'desired' properties. The patch you pinpointed caused a regression in that it would wake from NOHZ mode far too often. Could it be that the now longer idle sections cause your CPU to go into deeper idle modes and you're suffering from idle-exit latencies? ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH V3]hrtimer: Fix a performance regression by disable reprogramming in remove_hrtimer 2013-07-30 11:59 ` Peter Zijlstra @ 2013-08-03 6:55 ` ethan 2013-08-03 7:37 ` ethan 1 sibling, 0 replies; 31+ messages in thread From: ethan @ 2013-08-03 6:55 UTC (permalink / raw) To: Peter Zijlstra Cc: Thomas Gleixner, Ingo Molnar, LKML, johlstei, Yinghai Lu, Jin Feng Peter and tglx, About which hrtimer causes the performance issue, I did some test with my home server because I am on vacation, An ASUS PC server with 4 core Intel i5 cpu inside, Running the Stable 3.11RC3+ (removed reschedule IPI), almost got the same result as I did with Sun Servers. I suspect it is the tick_sched_timer, the evidence as following: While running some test tools such as http://people.redhat.com/mingo/cfs-scheduler/tools/pipe-test-1m.c, the powertop tool shows following result: PowerTOP version 1.11 (C) 2007 Intel Corporation Cn Avg residency P-states (frequencies) C0 (cpu running) ( 2.2%) Turbo Mode 0.0% polling 0.0ms ( 0.0%) 3.21 Ghz 0.0% C1 mwait 0.6ms ( 0.9%) 3.10 Ghz 0.0% C2 mwait 0.5ms ( 0.1%) 3.00 Ghz 0.0% C3 mwait 0.9ms (96.7%) 1.60 Ghz 100.0% Wakeups-from-idle per second : 1100.9 interval: 0.3s no ACPI power usage estimate available Top causes for wakeups: 23.5% ( inf) swapper/3 : hrtimer_start_range_ns (tick_sched_timer) 23.5% ( inf) swapper/0 : hrtimer_start_range_ns (tick_sched_timer) 23.5% ( inf) swapper/2 : hrtimer_start_range_ns (tick_sched_timer) 23.5% ( inf) swapper/1 : hrtimer_start_range_ns (tick_sched_timer) 2.4% ( inf) <interrupt> : xhci_hcd 2.4% ( inf) USB device 3-4 : Basic Optical Mouse (Microsoft) 0.3% ( inf) rcu_sched : rcu_gp_kthread (process_timeout) 0.2% ( inf) gnome-terminal : hrtimer_start_range_ns (hrtimer_wakeup) 0.2% ( inf) <interrupt> : PS/2 keyboard/mouse/touchpad 0.1% ( inf) swapper/0 : hrtimer_start (menu_hrtimer_notify) 0.1% ( inf) swapper/0 : clocksource_watchdog (clocksource_watchdog) 0.1% ( inf) cimserver : hrtimer_start_range_ns (hrtimer_wakeup) 0.1% ( inf) avahi-daemon : hrtimer_start_range_ns (hrtimer_wakeup) 0.1% ( inf) kworker/0:1 : queue_delayed_work_on (delayed_work_timer_fn) 0.1% ( inf) rtkit-daemon : hrtimer_start_range_ns (hrtimer_wakeup) And the /proc/timers_list [root@localhost proc]# cat timer_list Timer List Version: v0.7 HRTIMER_MAX_CLOCK_BASES: 4 now at 1463485951666 nsecs cpu: 0 clock 0: .base: ffff88021ea0d2c0 .index: 0 .resolution: 1 nsecs .get_time: ktime_get .offset: 0 nsecs active timers: #0: <ffff88021ea0d780>, tick_sched_timer, S:01, hrtimer_start_range_ns, swapper/0/1 # expires at 1463486000000-1463486000000 nsecs [in 48507 to 48507 nsecs] #1: <ffff88021ea0d900>, watchdog_timer_fn, S:01, hrtimer_start, watchdog/0/10 # expires at 1464038000000-1464038000000 nsecs [in 552048507 to 552048507 nsecs] #2: <ffff880215b139d8>, hrtimer_wakeup, S:01, hrtimer_start_range_ns, gnome-power-man/7341 # expires at 1467041141813-1467051131812 nsecs [in 3555190320 to 3565180319 nsecs] #3: <ffff8802118eb8c8>, hrtimer_wakeup, S:01, hrtimer_start_range_ns, ntpd/6806 # expires at 1469048348291-1469061256427 nsecs [in 5562396798 to 5575304934 nsecs] #4: <ffff88020f34b8c8>, hrtimer_wakeup, S:01, hrtimer_start_range_ns, cimservermain/6993 # expires at 1521114040522-1521214040522 nsecs [in 57628089029 to 57728089029 nsecs] #5: <ffff880215b57d78>, hrtimer_wakeup, S:01, hrtimer_start_range_ns, master/6886 # expires at 1521833602390-1521892602389 nsecs [in 58347650897 to 58406650896 nsecs] #6: <ffff880213c9c238>, it_real_fn, S:01, hrtimer_start, master/6886 # expires at 1795833601062-1795833601062 nsecs [in 332347649569 to 332347649569 nsecs] clock 1: .base: ffff88021ea0d300 .index: 1 .resolution: 1 nsecs .get_time: ktime_get_real .offset: 1375508485789320552 nsecs active timers: #0: <ffff8802157e5e28>, hrtimer_wakeup, S:01, hrtimer_start_range_ns, automount/6767 # expires at 1375509966602894000-1375509966602944000 nsecs [in 1375508503116942507 to 1375508503116992507 nsecs] clock 2: .base: ffff88021ea0d340 .index: 2 .resolution: 1 nsecs .get_time: ktime_get_boottime .offset: 0 nsecs active timers: clock 3: .base: ffff88021ea0d380 .index: 3 .resolution: 1 nsecs .get_time: ktime_get_clocktai .offset: 1375508485789320552 nsecs active timers: .expires_next : 1463487000000 nsecs .hres_active : 1 .nr_events : 1489177 .nr_retries : 145 .nr_hangs : 0 .max_hang_time : 0 nsecs .nohz_mode : 0 .last_tick : 0 nsecs .tick_stopped : 0 .idle_jiffies : 0 .idle_calls : 0 .idle_sleeps : 0 .idle_entrytime : 0 nsecs .idle_waketime : 0 nsecs .idle_exittime : 0 nsecs .idle_sleeptime : 0 nsecs .iowait_sleeptime: 0 nsecs .last_jiffies : 0 .next_jiffies : 0 .idle_expires : 0 nsecs jiffies: 4296130782 cpu: 1 clock 0: .base: ffff88021ea8d2c0 .index: 0 .resolution: 1 nsecs .get_time: ktime_get .offset: 0 nsecs active timers: #0: <ffff88021ea8d780>, tick_sched_timer, S:01, hrtimer_start_range_ns, swapper/1/0 # expires at 1463487000000-1463487000000 nsecs [in 1048507 to 1048507 nsecs] #1: <ffff88021ea8d900>, watchdog_timer_fn, S:01, hrtimer_start, watchdog/1/11 # expires at 1464038000000-1464038000000 nsecs [in 552048507 to 552048507 nsecs] #2: <ffff88020e2739d8>, hrtimer_wakeup, S:01, hrtimer_start_range_ns, hald-addon-stor/6726 # expires at 1464211071890-1464213064889 nsecs [in 725120397 to 727113396 nsecs] #3: <ffff8802157c99d8>, hrtimer_wakeup, S:01, hrtimer_start_range_ns, rtkit-daemon/7358 # expires at 1467525043210-1467530043209 nsecs [in 4039091717 to 4044091716 nsecs] #4: <ffff880213ec19d8>, hrtimer_wakeup, S:01, hrtimer_start_range_ns, rtkit-daemon/7359 # expires at 1472525038227-1472525038227 nsecs [in 9039086734 to 9039086734 nsecs] #5: <ffff880211a29ea8>, hrtimer_wakeup, S:01, hrtimer_start_range_ns, crond/6941 # expires at 1475497502106-1475497552106 nsecs [in 12011550613 to 12011600613 nsecs] #6: <ffff88021519d9d8>, hrtimer_wakeup, S:01, hrtimer_start_range_ns, hald/6673 # expires at 1490211101222-1490241071221 nsecs [in 26725149729 to 26755119728 nsecs] #7: <ffff880215055d78>, hrtimer_wakeup, S:01, hrtimer_start_range_ns, certmonger/7048 # expires at 1492497024071-1492527024070 nsecs [in 29011072578 to 29041072577 nsecs] #8: <ffff88021182d8c8>, hrtimer_wakeup, S:01, hrtimer_start_range_ns, rsyslogd/6412 # expires at 86416926181907-86417026181907 nsecs [in 84953440230414 to 84953540230414 nsecs] clock 1: .base: ffff88021ea8d300 .index: 1 .resolution: 1 nsecs .get_time: ktime_get_real .offset: 1375508485789320552 nsecs active timers: clock 2: .base: ffff88021ea8d340 .index: 2 .resolution: 1 nsecs .get_time: ktime_get_boottime .offset: 0 nsecs active timers: clock 3: .base: ffff88021ea8d380 .index: 3 .resolution: 1 nsecs .get_time: ktime_get_clocktai .offset: 1375508485789320552 nsecs active timers: .expires_next : 1463487000000 nsecs .hres_active : 1 .nr_events : 1505848 .nr_retries : 273 .nr_hangs : 0 .max_hang_time : 0 nsecs .nohz_mode : 0 .last_tick : 0 nsecs .tick_stopped : 0 .idle_jiffies : 0 .idle_calls : 0 .idle_sleeps : 0 .idle_entrytime : 0 nsecs .idle_waketime : 0 nsecs .idle_exittime : 0 nsecs .idle_sleeptime : 0 nsecs .iowait_sleeptime: 0 nsecs .last_jiffies : 0 .next_jiffies : 0 .idle_expires : 0 nsecs jiffies: 4296130782 cpu: 2 clock 0: .base: ffff88021eb0d2c0 .index: 0 .resolution: 1 nsecs .get_time: ktime_get .offset: 0 nsecs active timers: #0: <ffff88021eb0d780>, tick_sched_timer, S:01, hrtimer_start_range_ns, swapper/2/0 # expires at 1463487000000-1463487000000 nsecs [in 1048507 to 1048507 nsecs] #1: <ffff880213f918c8>, hrtimer_wakeup, S:01, hrtimer_start_range_ns, cimserver/6992 # expires at 1463603820300-1463604070299 nsecs [in 117868807 to 118118806 nsecs] #2: <ffff88021eb0d900>, watchdog_timer_fn, S:01, hrtimer_start, watchdog/2/16 # expires at 1464050000000-1464050000000 nsecs [in 564048507 to 564048507 nsecs] #3: <ffff8802156939d8>, hrtimer_wakeup, S:01, hrtimer_start_range_ns, gnome-settings-/7329 # expires at 1467041063935-1467045059934 nsecs [in 3555112442 to 3559108441 nsecs] #4: <ffff880214db39d8>, hrtimer_wakeup, S:01, hrtimer_start_range_ns, gconfd-2/7310 # expires at 1467041085986-1467071055985 nsecs [in 3555134493 to 3585104492 nsecs] #5: <ffff88020fb41ea8>, hrtimer_wakeup, S:01, hrtimer_start_range_ns, irqbalance/6442 # expires at 1468210681330-1468210731330 nsecs [in 4724729837 to 4724779837 nsecs] #6: <ffff880211a5f9d8>, hrtimer_wakeup, S:01, hrtimer_start_range_ns, rpcbind/6461 # expires at 1488949026351-1488979026350 nsecs [in 25463074858 to 25493074857 nsecs] #7: <ffff8802141ed9d8>, hrtimer_wakeup, S:01, hrtimer_start_range_ns, NetworkManager/6495 # expires at 1518210833956-1518310833956 nsecs [in 54724882463 to 54824882463 nsecs] #8: <ffff88021492bd78>, hrtimer_wakeup, S:01, hrtimer_start_range_ns, qmgr/6899 # expires at 1521592237483-1521692237483 nsecs [in 58106285990 to 58206285990 nsecs] #9: <ffff880214a929b8>, it_real_fn, S:01, hrtimer_start, qmgr/6899 # expires at 1554592221196-1554592221196 nsecs [in 91106269703 to 91106269703 nsecs] #10: <ffff880212685ea8>, hrtimer_wakeup, S:01, hrtimer_start_range_ns, atd/6952 # expires at 3620649306705-3620649356705 nsecs [in 2157163355212 to 2157163405212 nsecs] clock 1: .base: ffff88021eb0d300 .index: 1 .resolution: 1 nsecs .get_time: ktime_get_real .offset: 1375508485789320552 nsecs active timers: clock 2: .base: ffff88021eb0d340 .index: 2 .resolution: 1 nsecs .get_time: ktime_get_boottime .offset: 0 nsecs active timers: clock 3: .base: ffff88021eb0d380 .index: 3 .resolution: 1 nsecs .get_time: ktime_get_clocktai .offset: 1375508485789320552 nsecs active timers: .expires_next : 1463487000000 nsecs .hres_active : 1 .nr_events : 1499325 .nr_retries : 236 .nr_hangs : 0 .max_hang_time : 0 nsecs .nohz_mode : 0 .last_tick : 0 nsecs .tick_stopped : 0 .idle_jiffies : 0 .idle_calls : 0 .idle_sleeps : 0 .idle_entrytime : 0 nsecs .idle_waketime : 0 nsecs .idle_exittime : 0 nsecs .idle_sleeptime : 0 nsecs .iowait_sleeptime: 0 nsecs .last_jiffies : 0 .next_jiffies : 0 .idle_expires : 0 nsecs jiffies: 4296130782 cpu: 3 clock 0: .base: ffff88021eb8d2c0 .index: 0 .resolution: 1 nsecs .get_time: ktime_get .offset: 0 nsecs active timers: #0: <ffff88021eb8d780>, tick_sched_timer, S:01, hrtimer_start_range_ns, swapper/3/0 # expires at 1463487000000-1463487000000 nsecs [in 1048507 to 1048507 nsecs] #1: <ffff88021eb8d900>, watchdog_timer_fn, S:01, hrtimer_start, watchdog/3/21 # expires at 1464062000000-1464062000000 nsecs [in 576048507 to 576048507 nsecs] #2: <ffff880215b199d8>, hrtimer_wakeup, S:01, hrtimer_start_range_ns, gdm-simple-gree/7340 # expires at 1474212089328-1474272025327 nsecs [in 10726137835 to 10786073834 nsecs] #3: <ffff88021402dea8>, hrtimer_wakeup, S:01, hrtimer_start_range_ns, sleep/7669 # expires at 1520896517118-1520896567118 nsecs [in 57410565625 to 57410615625 nsecs] #4: <ffff880214befd78>, hrtimer_wakeup, S:01, hrtimer_start_range_ns, pickup/6898 # expires at 1562833597553-1562933597552 nsecs [in 99347646060 to 99447646059 nsecs] #5: <ffff88021277d8c8>, hrtimer_wakeup, S:01, hrtimer_start_range_ns, Xorg/7216 # expires at 1826170099331-1826270099331 nsecs [in 362684147838 to 362784147838 nsecs] #6: <ffff880214cda9b8>, it_real_fn, S:01, hrtimer_start, pickup/6898 # expires at 7462833595405-7462833595405 nsecs [in 5999347643912 to 5999347643912 nsecs] clock 1: .base: ffff88021eb8d300 .index: 1 .resolution: 1 nsecs .get_time: ktime_get_real .offset: 1375508485789320552 nsecs active timers: #0: <ffff880212e1de28>, hrtimer_wakeup, S:01, hrtimer_start_range_ns, cimservermain/6997 # expires at 1375509967695245000-1375509967695295000 nsecs [in 1375508504209293507 to 1375508504209343507 nsecs] clock 2: .base: ffff88021eb8d340 .index: 2 .resolution: 1 nsecs .get_time: ktime_get_boottime .offset: 0 nsecs active timers: clock 3: .base: ffff88021eb8d380 .index: 3 .resolution: 1 nsecs .get_time: ktime_get_clocktai .offset: 1375508485789320552 nsecs active timers: .expires_next : 1463487000000 nsecs .hres_active : 1 .nr_events : 1490001 .nr_retries : 347 .nr_hangs : 0 .max_hang_time : 0 nsecs .nohz_mode : 0 .last_tick : 0 nsecs .tick_stopped : 0 .idle_jiffies : 0 .idle_calls : 0 .idle_sleeps : 0 .idle_entrytime : 0 nsecs .idle_waketime : 0 nsecs .idle_exittime : 0 nsecs .idle_sleeptime : 0 nsecs .iowait_sleeptime: 0 nsecs .last_jiffies : 0 .next_jiffies : 0 .idle_expires : 0 nsecs jiffies: 4296130782 Tick Device: mode: 1 Broadcast device Clock Event Device: hpet max_delta_ns: 149983013276 min_delta_ns: 13409 mult: 61496111 shift: 32 mode: 1 next_event: 9223372036854775807 nsecs set_next_event: hpet_legacy_next_event set_mode: hpet_legacy_set_mode event_handler: tick_handle_oneshot_broadcast retries: 0 tick_broadcast_mask: 00000000 tick_broadcast_oneshot_mask: 00000000 Tick Device: mode: 1 Per CPU device: 0 Clock Event Device: lapic max_delta_ns: 1377930800842 min_delta_ns: 1000 mult: 13387279 shift: 27 mode: 3 next_event: 1463487000000 nsecs set_next_event: lapic_next_deadline set_mode: lapic_timer_setup event_handler: hrtimer_interrupt retries: 0 Tick Device: mode: 1 Per CPU device: 1 Clock Event Device: lapic max_delta_ns: 1377930800842 min_delta_ns: 1000 mult: 13387279 shift: 27 mode: 3 next_event: 1463487000000 nsecs set_next_event: lapic_next_deadline set_mode: lapic_timer_setup event_handler: hrtimer_interrupt retries: 1 Tick Device: mode: 1 Per CPU device: 2 Clock Event Device: lapic max_delta_ns: 1377930800842 min_delta_ns: 1000 mult: 13387279 shift: 27 mode: 3 next_event: 1463487000000 nsecs set_next_event: lapic_next_deadline set_mode: lapic_timer_setup event_handler: hrtimer_interrupt retries: 0 Tick Device: mode: 1 Per CPU device: 3 Clock Event Device: lapic max_delta_ns: 1377930800842 min_delta_ns: 1000 mult: 13387279 shift: 27 mode: 3 next_event: 1463487000000 nsecs set_next_event: lapic_next_deadline set_mode: lapic_timer_setup event_handler: hrtimer_interrupt retries: 0 Note, I removed the reshcedule IPI for test reason, or reschedule IPI is the first reason to wake-up (In another mail I will tell why I removed the reschedule IPI). The tick_sched_timers are issued by: cpu_idle_loop() => tick_nohz_idle_enter() = > tick_nohz_stop_sched_tick() = > raise_softirq_irqoff(TIMER_SOFTIRQ)----------------------------------->+ run_timer_softirq() => tick_sched_timer()=> tick_sched_handle()=>update_process_times()=> run_local_timers() =>raise_softirq_irqoff(TIMER_SOFTIRQ)-->+ ^----------------------------------------------------------------------------------------------------------------------------------------------------< Thanks, Ethan 在 2013-7-30,下午7:59,Peter Zijlstra <peterz@infradead.org> 写道: > On Tue, Jul 30, 2013 at 07:44:03PM +0800, Ethan Zhao wrote: >> Got it. >> what tglx and you mean >> >> >> So the expensive thing maybe not inside the schedule(), but could >> outside the scheduler(), the more bigger forever loop. >> >> This is one part of what I am facing. > > Right, so it would be good if you could further diagnose the problem so > we can come up with a solution that cures the problem while retaining > the current 'desired' properties. > > The patch you pinpointed caused a regression in that it would wake from > NOHZ mode far too often. Could it be that the now longer idle sections > cause your CPU to go into deeper idle modes and you're suffering from > idle-exit latencies? ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH V3]hrtimer: Fix a performance regression by disable reprogramming in remove_hrtimer 2013-07-30 11:59 ` Peter Zijlstra 2013-08-03 6:55 ` ethan @ 2013-08-03 7:37 ` ethan 1 sibling, 0 replies; 31+ messages in thread From: ethan @ 2013-08-03 7:37 UTC (permalink / raw) To: Peter Zijlstra Cc: Thomas Gleixner, Ingo Molnar, LKML, johlstei, Yinghai Lu, Jin Feng Peter and tglx, Some other tough hacking and testing with result FYI, With the default kernel 2.6.32-279.19.1.el6.x86_64 in CentOS 6.3 running on my ASUS 4 core Intel i5 server, almost got the best performance of tool http://people.redhat.com/mingo/cfs-scheduler/tools/pipe-test-1m.c [root@localhost ~]# time ./pipe-test-1m real 0m7.704s user 0m0.047s sys 0m4.815s [root@localhost ~]# time ./pipe-test-1m real 0m8.000s user 0m0.071s sys 0m5.035s [root@localhost ~]# time ./pipe-test-1m real 0m7.386s user 0m0.086s sys 0m4.591s [root@localhost ~]# time ./pipe-test-1m real 0m7.919s user 0m0.064s sys 0m4.912s [root@localhost ~]# time ./pipe-test-1m real 0m7.949s user 0m0.083s sys 0m4.917s [root@localhost ~]# time ./pipe-test-1m rrr real 0m7.913s user 0m0.070s sys 0m4.903s [root@localhost ~]# time ./pipe-test-1m real 0m7.953s user 0m0.092s sys 0m4.881s [root@localhost ~]# time ./pipe-test-1m real 0m8.059s user 0m0.108s sys 0m5.037s [root@localhost ~]# Then compiled and boot stable 3.11.0-rc3 with default configuration, redid the same test. got very bad performance: root@localhost ~]# uname -a Linux localhost 3.11.0-rc3 #4 SMP Wed Jul 31 16:10:56 EDT 2013 x86_64 x86_64 x86_64 GNU/Linux real 0m10.730s user 0m0.245s sys 0m6.596s [root@localhost ~]# time ./pipe-test-1m real 0m10.661s user 0m0.218s sys 0m6.520s [root@localhost ~]# time ./pipe-test-1m real 0m10.699s user 0m0.233s sys 0m6.534s [root@localhost ~]# time ./pipe-test-1m real 0m10.616s user 0m0.191s sys 0m6.505s [root@localhost ~]# time ./pipe-test-1m real 0m10.546s user 0m0.214s sys 0m6.441s [root@localhost ~]# time ./pipe-test-1m real 0m10.631s user 0m0.204s sys 0m6.509s First 'tough' hacking is disable the reprogramming in _remove_hrtimer() within 3.11-rc3 code and redo the test. much better. root@localhost ~]# time ./pipe-test-1m real 0m9.447s user 0m0.227s sys 0m5.900s [root@localhost ~]# time ./pipe-test-1m real 0m9.507s user 0m0.226s sys 0m5.922s [root@localhost ~]# time ./pipe-test-1m real 0m9.495s user 0m0.228s sys 0m5.916s [root@localhost ~]# time ./pipe-test-1m real 0m9.470s user 0m0.229s sys 0m5.938s [root@localhost ~]# time ./pipe-test-1m real 0m9.484s user 0m0.269s sys 0m5.875s [root@localhost ~]# time ./pipe-test-1m real 0m9.328s user 0m0.242s sys 0m5.767s While I monitor the wake-up with powertop, got Top causes for wakeups: 98.5% ( inf) <kernel IPI> : Rescheduling interrupts 0.5% ( inf) swapper/3 : hrtimer_start_range_ns (tick_sched_timer) 0.3% ( inf) swapper/2 : hrtimer_start_range_ns (tick_sched_timer) 0.2% ( inf) swapper/1 : hrtimer_start_range_ns (tick_sched_timer) 0.2% ( inf) swapper/0 : hrtimer_start_range_ns (tick_sched_timer) So I did the second tough hacking, commented out the rescheduling IPI sending in following function and re-did the test. diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h index 4137890..c27f04f 100644 --- a/arch/x86/include/asm/smp.h +++ b/arch/x86/include/asm/smp.h @@ -137,7 +137,7 @@ static inline void play_dead(void) static inline void smp_send_reschedule(int cpu) { - smp_ops.smp_send_reschedule(cpu); + /* smp_ops.smp_send_reschedule(cpu); */ } Got the performance as best as 2.6.32 kernel and the scheduling seems also OK. root@localhost ~]# time ./pipe-test-1m real 0m7.661s user 0m0.179s sys 0m4.880s [root@localhost ~]# time ./pipe-test-1m real 0m7.473s user 0m0.189s sys 0m4.782s [root@localhost ~]# time ./pipe-test-1m real 0m7.658s user 0m0.195s sys 0m4.899s [root@localhost ~]# time ./pipe-test-1m real 0m7.644s user 0m0.194s sys 0m4.941s [root@localhost ~]# time ./pipe-test-1m real 0m7.694s user 0m0.189s sys 0m4.925s [root@localhost ~]# time ./pipe-test-1m real 0m7.694s user 0m0.197s sys 0m4.915s [root@localhost ~]# time ./pipe-test-1m real 0m7.597s user 0m0.190s sys 0m4.886s The the two processes of pipe-test-1m and its child seem could be balanced from cpu0 to cpu3 well, #top f J 14888 root 20 0 68 0 R 73.2 0.0 0:03.22 2 pip1m 14887 root 20 0 284 224 S 63.4 0.0 0:03.23 0 pip1m And so the above tough hacking and test basicly show the No.1 expensive thing is the rescheduling IPI, and the No.2 expensive thing is the extra hrtimer reprogramming/tick in Linux 3.11-rc3 code. We need manage to do as less as possible rescheduling IPI and reprogramming to get better performance. Does it(the tough hacking and the test) make sense ? and the result rational ? Thanks, Ethan 在 2013-7-30,下午7:59,Peter Zijlstra <peterz@infradead.org> 写道: > On Tue, Jul 30, 2013 at 07:44:03PM +0800, Ethan Zhao wrote: >> Got it. >> what tglx and you mean >> >> >> So the expensive thing maybe not inside the schedule(), but could >> outside the scheduler(), the more bigger forever loop. >> >> This is one part of what I am facing. > > Right, so it would be good if you could further diagnose the problem so > we can come up with a solution that cures the problem while retaining > the current 'desired' properties. > > The patch you pinpointed caused a regression in that it would wake from > NOHZ mode far too often. Could it be that the now longer idle sections > cause your CPU to go into deeper idle modes and you're suffering from > idle-exit latencies? ^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: [PATCH V3]hrtimer: Fix a performance regression by disable reprogramming in remove_hrtimer 2013-07-30 9:35 ` Peter Zijlstra 2013-07-30 11:44 ` Ethan Zhao @ 2013-08-06 7:29 ` Mike Galbraith 2013-08-06 7:46 ` Mike Galbraith ` (2 more replies) 1 sibling, 3 replies; 31+ messages in thread From: Mike Galbraith @ 2013-08-06 7:29 UTC (permalink / raw) To: Peter Zijlstra Cc: Ethan Zhao, Thomas Gleixner, Ingo Molnar, LKML, johlstei, Yinghai Lu, Jin Feng On Tue, 2013-07-30 at 11:35 +0200, Peter Zijlstra wrote: > It would be good if you could do what Thomas suggested and look at which > timer is actually active during your workload. Rebuilding regression test trees, some pipe-test results... I'm missing mwait_idle() rather a lot on Q6600, and at 3.8, E5620 took a severe NOHZ drubbing from the menu governor. pipe-test, scheduling cross core NOTE: nohz is throttled here (patchlet below), as to not eat horrible microidle cost, see E5620 v3.7.10-nothrottle below. Q6600 v3.8.13 500.6 KHz 1.000 v3.9.11 422.4 KHz .843 v3.10.4 420.2 KHz .839 v3.11-rc3-4-g36f571e 404.7 KHz .808 Q6600 3.9 regression: guilty party is 69fb3676 x86 idle: remove mwait_idle() and "idle=mwait" cmdline param halt sucks, HTH does one activate mwait_idle_with_hints() [processor_idle()] for core2 boxen? E5620 +write 0 -> /dev/cpu_dma_latency, hold open v3.7.10 578.5 KHz 1.000 675.4 KHz 1.000 v3.7.10-nothrottle 366.7 KHz .633 395.0 KHz .584 v3.8.13 468.3 KHz .809 690.0 KHz 1.021 v3.8.13 idle=mwait 595.1 KHz 1.028 NA v3.9.11 462.0 KHz .798 691.1 KHz 1.023 v3.10.4 419.4 KHz .724 570.8 KHz .845 v3.11-rc3-4-g36f571e 400.1 KHz .691 538.5 KHz .797 E5620 3.8 regression: guilty party: 69a37bea cpuidle: Quickly notice prediction failure for repeat mode Q6600 (2.4 GHz core2 quad) v3.11-rc3-4-g36f571e v3.8.13 7.97% [k] reschedule_interrupt 8.63% [k] __schedule 6.27% [k] __schedule 6.07% [k] native_sched_clock 4.74% [k] native_sched_clock 4.96% [k] system_call 4.23% [k] _raw_spin_lock_irqsave 4.30% [k] _raw_spin_lock_irqsave 3.39% [k] system_call 4.06% [k] resched_task 2.89% [k] sched_clock_local 3.44% [k] sched_clock_local 2.79% [k] mutex_lock 3.39% [k] pipe_read 2.57% [k] pipe_read 3.21% [k] mutex_lock 2.55% [k] __switch_to 2.98% [k] read_tsc 2.24% [k] read_tsc 2.87% [k] __switch_to E5620 (2.4 GHz Westmere quad) v3.7.10 v3.7.10-nothrottle v3.7.10-nothrottle 8.01% [k] __schedule 25.80% [k] _raw_spin_unlock_irqrestore 21.80% [k] _raw_spin_unlock_irqrestore 4.49% [k] resched_tas 4.64% [k] __hrtimer_start_range_ns - _raw_spin_unlock_irqrestore 3.94% [k] mutex_lock 4.62% [k] timerqueue_add + 37.94% __hrtimer_start_range_ns 3.44% [k] __switch_to 4.54% [k] __schedule 19.69% hrtimer_cancel 3.18% [k] menu_select 2.84% [k] enqueue_hrtimer tick_nohz_restart 3.05% [k] copy_user_generic_string 2.64% [k] resched_task tick_nohz_idle_exit 3.02% [k] task_waking_fair 2.29% [k] _raw_spin_lock_irqsave cpu_idle 2.91% [k] mutex_unlock 2.28% [k] mutex_lock start_secondary 2.82% [k] pipe_read 1.96% [k] __switch_to + 16.05% hrtimer_start_range_ns 2.32% [k] ktime_get_real 1.73% [k] menu_select 15.46% hrtimer_start tick_nohz_stop_sched_tick __tick_nohz_idle_enter tick_nohz_idle_enter cpu_idle start_secondary 6.37% hrtimer_try_to_cancel hrtimer_cancel tick_nohz_restart tick_nohz_idle_exit cpu_idle start_secondary v3.8.13 v3.8.13 idle=mwait v3.8.13 (throttled, but menu gov bites.. HARD) 23.16% [k] _raw_spin_unlock_irqrestore 8.35% [k] __schedule - 22.91% [k] _raw_spin_unlock_irqrestore 4.93% [k] __schedule 6.49% [k] __switch_to - _raw_spin_unlock_irqrestore 3.42% [k] resched_task 5.71% [k] resched_task - 47.26% hrtimer_try_to_cancel 3.27% [k] __switch_to 4.64% [k] mutex_lock hrtimer_cancel 3.05% [k] mutex_lock 3.48% [k] copy_user_generic_string menu_hrtimer_cancel 2.32% [k] copy_user_generic_string 3.15% [k] task_waking_fair tick_nohz_idle_exit 2.30% [k] _raw_spin_lock_irqsave 3.13% [k] pipe_read cpu_idle 2.15% [k] pipe_read 2.61% [k] mutex_unlock start_secondary 2.15% [k] task_waking_fair 2.54% [k] finish_task_switch - 40.01% __hrtimer_start_range_ns 2.08% [k] ktime_get 2.29% [k] _raw_spin_lock_irqsave hrtimer_start 1.87% [k] mutex_unlock 1.91% [k] idle_cpu menu_select 1.76% [k] finish_task_switch 1.84% [k] __wake_up_common cpuidle_idle_call cpu_idle start_secondary v3.9.11 18.67% [k] _raw_spin_unlock_irqrestore 4.36% [k] __schedule 3.66% [k] __switch_to 3.13% [k] mutex_lock 2.97% [k] __hrtimer_start_range_ns 2.69% [k] _raw_spin_lock_irqsave 2.38% [k] copy_user_generic_string 2.34% [k] hrtimer_reprogram.isra.32 2.34% [k] task_waking_fair 2.25% [k] ktime_get 2.14% [k] pipe_read 1.98% [k] menu_select v3.10.4 20.42% [k] _raw_spin_unlock_irqrestore 4.75% [k] __schedule 4.42% [k] reschedule_interrupt <== appears in 3.10, guilty party as yet unknown 3.52% [k] __switch_to 3.27% [k] resched_task 2.64% [k] cpuidle_enter_state 2.63% [k] _raw_spin_lock_irqsave 2.04% [k] copy_user_generic_string 2.00% [k] cpu_idle_loop 1.97% [k] mutex_lock 1.90% [k] ktime_get 1.75% [k] task_waking_fair v3.11-rc3-4-g36f571e 18.96% [k] _raw_spin_unlock_irqrestore 4.84% [k] __schedule 4.69% [k] reschedule_interrupt 3.75% [k] __switch_to 2.62% [k] _raw_spin_lock_irqsave 2.43% [k] cpuidle_enter_state 2.28% [k] resched_task 2.20% [k] cpu_idle_loop 1.97% [k] copy_user_generic_string 1.88% [k] ktime_get 1.81% [k] task_waking_fair 1.75% [k] mutex_lock sched: ratelimit nohz Entering nohz code on every micro-idle is too expensive to bear. Signed-off-by: Mike Galbraith <efault@gmx.de> --- include/linux/sched.h | 5 +++++ kernel/sched/core.c | 5 +++++ kernel/time/tick-sched.c | 2 +- 3 files changed, 11 insertions(+), 1 deletion(-) --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -235,9 +235,14 @@ extern int runqueue_is_locked(int cpu); extern void nohz_balance_enter_idle(int cpu); extern void set_cpu_sd_state_idle(void); extern int get_nohz_timer_target(void); +extern int sched_needs_cpu(int cpu); #else static inline void nohz_balance_enter_idle(int cpu) { } static inline void set_cpu_sd_state_idle(void) { } +static inline int sched_needs_cpu(int cpu) +{ + return 0; +} #endif /* --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -650,6 +650,11 @@ static inline bool got_nohz_idle_kick(vo return false; } +int sched_needs_cpu(int cpu) +{ + return cpu_rq(cpu)->avg_idle < sysctl_sched_migration_cost; +} + #else /* CONFIG_NO_HZ_COMMON */ static inline bool got_nohz_idle_kick(void) --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -548,7 +548,7 @@ static ktime_t tick_nohz_stop_sched_tick time_delta = timekeeping_max_deferment(); } while (read_seqretry(&jiffies_lock, seq)); - if (rcu_needs_cpu(cpu, &rcu_delta_jiffies) || + if (sched_needs_cpu(cpu) || rcu_needs_cpu(cpu, &rcu_delta_jiffies) || arch_needs_cpu(cpu) || irq_work_needs_cpu()) { next_jiffies = last_jiffies + 1; delta_jiffies = 1; ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH V3]hrtimer: Fix a performance regression by disable reprogramming in remove_hrtimer 2013-08-06 7:29 ` Mike Galbraith @ 2013-08-06 7:46 ` Mike Galbraith 2013-08-08 4:31 ` ethan.zhao 2013-08-07 8:25 ` Mike Galbraith 2013-08-08 15:02 ` ethan.zhao 2 siblings, 1 reply; 31+ messages in thread From: Mike Galbraith @ 2013-08-06 7:46 UTC (permalink / raw) To: Peter Zijlstra Cc: Ethan Zhao, Thomas Gleixner, Ingo Molnar, LKML, johlstei, Yinghai Lu, Jin Feng, Youquan Song, LenBrown (CCs Intel folks) On Tue, 2013-08-06 at 09:29 +0200, Mike Galbraith wrote: > On Tue, 2013-07-30 at 11:35 +0200, Peter Zijlstra wrote: > > > It would be good if you could do what Thomas suggested and look at which > > timer is actually active during your workload. > > Rebuilding regression test trees, some pipe-test results... > > I'm missing mwait_idle() rather a lot on Q6600, and at 3.8, E5620 took a > severe NOHZ drubbing from the menu governor. > > pipe-test, scheduling cross core > > NOTE: nohz is throttled here (patchlet below), as to not eat horrible > microidle cost, see E5620 v3.7.10-nothrottle below. > > Q6600 > v3.8.13 500.6 KHz 1.000 > v3.9.11 422.4 KHz .843 > v3.10.4 420.2 KHz .839 > v3.11-rc3-4-g36f571e 404.7 KHz .808 > > Q6600 3.9 regression: > guilty party is 69fb3676 x86 idle: remove mwait_idle() and "idle=mwait" cmdline param > halt sucks, HTH does one activate mwait_idle_with_hints() [processor_idle()] for core2 boxen? > > E5620 +write 0 -> /dev/cpu_dma_latency, hold open > v3.7.10 578.5 KHz 1.000 675.4 KHz 1.000 > v3.7.10-nothrottle 366.7 KHz .633 395.0 KHz .584 > v3.8.13 468.3 KHz .809 690.0 KHz 1.021 > v3.8.13 idle=mwait 595.1 KHz 1.028 NA > v3.9.11 462.0 KHz .798 691.1 KHz 1.023 > v3.10.4 419.4 KHz .724 570.8 KHz .845 > v3.11-rc3-4-g36f571e 400.1 KHz .691 538.5 KHz .797 > > E5620 3.8 regression: > guilty party: 69a37bea cpuidle: Quickly notice prediction failure for repeat mode > > > Q6600 (2.4 GHz core2 quad) > v3.11-rc3-4-g36f571e v3.8.13 > 7.97% [k] reschedule_interrupt 8.63% [k] __schedule > 6.27% [k] __schedule 6.07% [k] native_sched_clock > 4.74% [k] native_sched_clock 4.96% [k] system_call > 4.23% [k] _raw_spin_lock_irqsave 4.30% [k] _raw_spin_lock_irqsave > 3.39% [k] system_call 4.06% [k] resched_task > 2.89% [k] sched_clock_local 3.44% [k] sched_clock_local > 2.79% [k] mutex_lock 3.39% [k] pipe_read > 2.57% [k] pipe_read 3.21% [k] mutex_lock > 2.55% [k] __switch_to 2.98% [k] read_tsc > 2.24% [k] read_tsc 2.87% [k] __switch_to > > > E5620 (2.4 GHz Westmere quad) > v3.7.10 v3.7.10-nothrottle v3.7.10-nothrottle > 8.01% [k] __schedule 25.80% [k] _raw_spin_unlock_irqrestore 21.80% [k] _raw_spin_unlock_irqrestore > 4.49% [k] resched_tas 4.64% [k] __hrtimer_start_range_ns - _raw_spin_unlock_irqrestore > 3.94% [k] mutex_lock 4.62% [k] timerqueue_add + 37.94% __hrtimer_start_range_ns > 3.44% [k] __switch_to 4.54% [k] __schedule 19.69% hrtimer_cancel > 3.18% [k] menu_select 2.84% [k] enqueue_hrtimer tick_nohz_restart > 3.05% [k] copy_user_generic_string 2.64% [k] resched_task tick_nohz_idle_exit > 3.02% [k] task_waking_fair 2.29% [k] _raw_spin_lock_irqsave cpu_idle > 2.91% [k] mutex_unlock 2.28% [k] mutex_lock start_secondary > 2.82% [k] pipe_read 1.96% [k] __switch_to + 16.05% hrtimer_start_range_ns > 2.32% [k] ktime_get_real 1.73% [k] menu_select 15.46% hrtimer_start > tick_nohz_stop_sched_tick > __tick_nohz_idle_enter > tick_nohz_idle_enter > cpu_idle > start_secondary > 6.37% hrtimer_try_to_cancel > hrtimer_cancel > tick_nohz_restart > tick_nohz_idle_exit > cpu_idle > start_secondary > > v3.8.13 v3.8.13 idle=mwait v3.8.13 (throttled, but menu gov bites.. HARD) > 23.16% [k] _raw_spin_unlock_irqrestore 8.35% [k] __schedule - 22.91% [k] _raw_spin_unlock_irqrestore > 4.93% [k] __schedule 6.49% [k] __switch_to - _raw_spin_unlock_irqrestore > 3.42% [k] resched_task 5.71% [k] resched_task - 47.26% hrtimer_try_to_cancel > 3.27% [k] __switch_to 4.64% [k] mutex_lock hrtimer_cancel > 3.05% [k] mutex_lock 3.48% [k] copy_user_generic_string menu_hrtimer_cancel > 2.32% [k] copy_user_generic_string 3.15% [k] task_waking_fair tick_nohz_idle_exit > 2.30% [k] _raw_spin_lock_irqsave 3.13% [k] pipe_read cpu_idle > 2.15% [k] pipe_read 2.61% [k] mutex_unlock start_secondary > 2.15% [k] task_waking_fair 2.54% [k] finish_task_switch - 40.01% __hrtimer_start_range_ns > 2.08% [k] ktime_get 2.29% [k] _raw_spin_lock_irqsave hrtimer_start > 1.87% [k] mutex_unlock 1.91% [k] idle_cpu menu_select > 1.76% [k] finish_task_switch 1.84% [k] __wake_up_common cpuidle_idle_call > cpu_idle > start_secondary > > v3.9.11 > 18.67% [k] _raw_spin_unlock_irqrestore > 4.36% [k] __schedule > 3.66% [k] __switch_to > 3.13% [k] mutex_lock > 2.97% [k] __hrtimer_start_range_ns > 2.69% [k] _raw_spin_lock_irqsave > 2.38% [k] copy_user_generic_string > 2.34% [k] hrtimer_reprogram.isra.32 > 2.34% [k] task_waking_fair > 2.25% [k] ktime_get > 2.14% [k] pipe_read > 1.98% [k] menu_select > > v3.10.4 > 20.42% [k] _raw_spin_unlock_irqrestore > 4.75% [k] __schedule > 4.42% [k] reschedule_interrupt <== appears in 3.10, guilty party as yet unknown > 3.52% [k] __switch_to > 3.27% [k] resched_task > 2.64% [k] cpuidle_enter_state > 2.63% [k] _raw_spin_lock_irqsave > 2.04% [k] copy_user_generic_string > 2.00% [k] cpu_idle_loop > 1.97% [k] mutex_lock > 1.90% [k] ktime_get > 1.75% [k] task_waking_fair > > v3.11-rc3-4-g36f571e > 18.96% [k] _raw_spin_unlock_irqrestore > 4.84% [k] __schedule > 4.69% [k] reschedule_interrupt > 3.75% [k] __switch_to > 2.62% [k] _raw_spin_lock_irqsave > 2.43% [k] cpuidle_enter_state > 2.28% [k] resched_task > 2.20% [k] cpu_idle_loop > 1.97% [k] copy_user_generic_string > 1.88% [k] ktime_get > 1.81% [k] task_waking_fair > 1.75% [k] mutex_lock > > sched: ratelimit nohz > > Entering nohz code on every micro-idle is too expensive to bear. > > Signed-off-by: Mike Galbraith <efault@gmx.de> > > --- > include/linux/sched.h | 5 +++++ > kernel/sched/core.c | 5 +++++ > kernel/time/tick-sched.c | 2 +- > 3 files changed, 11 insertions(+), 1 deletion(-) > > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -235,9 +235,14 @@ extern int runqueue_is_locked(int cpu); > extern void nohz_balance_enter_idle(int cpu); > extern void set_cpu_sd_state_idle(void); > extern int get_nohz_timer_target(void); > +extern int sched_needs_cpu(int cpu); > #else > static inline void nohz_balance_enter_idle(int cpu) { } > static inline void set_cpu_sd_state_idle(void) { } > +static inline int sched_needs_cpu(int cpu) > +{ > + return 0; > +} > #endif > > /* > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -650,6 +650,11 @@ static inline bool got_nohz_idle_kick(vo > return false; > } > > +int sched_needs_cpu(int cpu) > +{ > + return cpu_rq(cpu)->avg_idle < sysctl_sched_migration_cost; > +} > + > #else /* CONFIG_NO_HZ_COMMON */ > > static inline bool got_nohz_idle_kick(void) > --- a/kernel/time/tick-sched.c > +++ b/kernel/time/tick-sched.c > @@ -548,7 +548,7 @@ static ktime_t tick_nohz_stop_sched_tick > time_delta = timekeeping_max_deferment(); > } while (read_seqretry(&jiffies_lock, seq)); > > - if (rcu_needs_cpu(cpu, &rcu_delta_jiffies) || > + if (sched_needs_cpu(cpu) || rcu_needs_cpu(cpu, &rcu_delta_jiffies) || > arch_needs_cpu(cpu) || irq_work_needs_cpu()) { > next_jiffies = last_jiffies + 1; > delta_jiffies = 1; > > ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH V3]hrtimer: Fix a performance regression by disable reprogramming in remove_hrtimer 2013-08-06 7:46 ` Mike Galbraith @ 2013-08-08 4:31 ` ethan.zhao 2013-08-08 5:29 ` Mike Galbraith 2013-08-08 9:04 ` ethan.zhao 0 siblings, 2 replies; 31+ messages in thread From: ethan.zhao @ 2013-08-08 4:31 UTC (permalink / raw) To: Mike Galbraith Cc: Peter Zijlstra, Thomas Gleixner, Ingo Molnar, LKML, johlstei, Yinghai Lu, Jin Feng, Youquan Song, LenBrown Hi, perter and Mike, Some other test to verify the regression causes etc. On an 4 core intel i5 Asus pc. The pipe test. 1. default Bios configuration and default 3.11-rc3 kernel. [root@localhost ~]# time ./pip1m real 0m10.683s user 0m0.204s sys 0m6.597s [root@localhost ~]# time ./pip1m real 0m10.629s user 0m0.185s sys 0m6.546s [root@localhost ~]# uname -a Linux localhost 3.11.0-rc3 #4 SMP Wed Jul 31 16:10:56 EDT 2013 x86_64 x86_64 x86_64 GNU/Linux 2. same as 1 and idle=halt command line parameter. [root@localhost ~]# time ./pip1m real 0m9.904s user 0m0.200s sys 0m6.209s [root@localhost ~]# time ./pip1m real 0m9.972s user 0m0.201s sys 0m6.200s 3. same as 1 and idle=nomwait command line parameter real 0m13.634s user 0m0.407s sys 0m7.820s [root@localhost ~]# time ./pip1m real 0m13.684s user 0m0.416s sys 0m7.845s 4. Disable C1E C3 C6 C-states and SpeedStep in BIOS, default configuration of kernel 3.11-rc3. [root@localhost ~]# time ./pip1m real 0m5.371s user 0m0.102s sys 0m3.253s [root@localhost ~]# time ./pip1m real 0m5.329s user 0m0.075s sys 0m3.254s [root@localhost ~]# 5. same as 4 and comment out reschedule IPI sending [root@localhost ~]# time ./pip1m real 0m3.883s user 0m0.098s sys 0m2.480s [root@localhost ~]# time ./pip1m real 0m3.907s user 0m0.070s sys 0m2.552s diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h index 4137890..c27f04f 100644 --- a/arch/x86/include/asm/smp.h +++ b/arch/x86/include/asm/smp.h @@ -137,7 +137,7 @@ static inline void play_dead(void) static inline void smp_send_reschedule(int cpu) { - smp_ops.smp_send_reschedule(cpu); + /* smp_ops.smp_send_reschedule(cpu); */ } 6. same as 5 and don't reprogram clock device in remove_hrtimer. got same result as 5. real 0m3.915s user 0m0.086s sys 0m2.499s [root@localhost ~]# time ./pip1m real 0m3.919s user 0m0.110s sys 0m2.509s So when C-states disabled, no reprogramming of hrtimer wouldn't gain better performance. But will get more wakup chances while C-states enabled if no reprogramming clock device. Thanks, Ethan 在 2013-8-6,下午3:46,Mike Galbraith <bitbucket@online.de> 写道: > (CCs Intel folks) > > On Tue, 2013-08-06 at 09:29 +0200, Mike Galbraith wrote: >> On Tue, 2013-07-30 at 11:35 +0200, Peter Zijlstra wrote: >> >>> It would be good if you could do what Thomas suggested and look at which >>> timer is actually active during your workload. >> >> Rebuilding regression test trees, some pipe-test results... >> >> I'm missing mwait_idle() rather a lot on Q6600, and at 3.8, E5620 took a >> severe NOHZ drubbing from the menu governor. >> >> pipe-test, scheduling cross core >> >> NOTE: nohz is throttled here (patchlet below), as to not eat horrible >> microidle cost, see E5620 v3.7.10-nothrottle below. >> >> Q6600 >> v3.8.13 500.6 KHz 1.000 >> v3.9.11 422.4 KHz .843 >> v3.10.4 420.2 KHz .839 >> v3.11-rc3-4-g36f571e 404.7 KHz .808 >> >> Q6600 3.9 regression: >> guilty party is 69fb3676 x86 idle: remove mwait_idle() and "idle=mwait" cmdline param >> halt sucks, HTH does one activate mwait_idle_with_hints() [processor_idle()] for core2 boxen? >> >> E5620 +write 0 -> /dev/cpu_dma_latency, hold open >> v3.7.10 578.5 KHz 1.000 675.4 KHz 1.000 >> v3.7.10-nothrottle 366.7 KHz .633 395.0 KHz .584 >> v3.8.13 468.3 KHz .809 690.0 KHz 1.021 >> v3.8.13 idle=mwait 595.1 KHz 1.028 NA >> v3.9.11 462.0 KHz .798 691.1 KHz 1.023 >> v3.10.4 419.4 KHz .724 570.8 KHz .845 >> v3.11-rc3-4-g36f571e 400.1 KHz .691 538.5 KHz .797 >> >> E5620 3.8 regression: >> guilty party: 69a37bea cpuidle: Quickly notice prediction failure for repeat mode >> >> >> Q6600 (2.4 GHz core2 quad) >> v3.11-rc3-4-g36f571e v3.8.13 >> 7.97% [k] reschedule_interrupt 8.63% [k] __schedule >> 6.27% [k] __schedule 6.07% [k] native_sched_clock >> 4.74% [k] native_sched_clock 4.96% [k] system_call >> 4.23% [k] _raw_spin_lock_irqsave 4.30% [k] _raw_spin_lock_irqsave >> 3.39% [k] system_call 4.06% [k] resched_task >> 2.89% [k] sched_clock_local 3.44% [k] sched_clock_local >> 2.79% [k] mutex_lock 3.39% [k] pipe_read >> 2.57% [k] pipe_read 3.21% [k] mutex_lock >> 2.55% [k] __switch_to 2.98% [k] read_tsc >> 2.24% [k] read_tsc 2.87% [k] __switch_to >> >> >> E5620 (2.4 GHz Westmere quad) >> v3.7.10 v3.7.10-nothrottle v3.7.10-nothrottle >> 8.01% [k] __schedule 25.80% [k] _raw_spin_unlock_irqrestore 21.80% [k] _raw_spin_unlock_irqrestore >> 4.49% [k] resched_tas 4.64% [k] __hrtimer_start_range_ns - _raw_spin_unlock_irqrestore >> 3.94% [k] mutex_lock 4.62% [k] timerqueue_add + 37.94% __hrtimer_start_range_ns >> 3.44% [k] __switch_to 4.54% [k] __schedule 19.69% hrtimer_cancel >> 3.18% [k] menu_select 2.84% [k] enqueue_hrtimer tick_nohz_restart >> 3.05% [k] copy_user_generic_string 2.64% [k] resched_task tick_nohz_idle_exit >> 3.02% [k] task_waking_fair 2.29% [k] _raw_spin_lock_irqsave cpu_idle >> 2.91% [k] mutex_unlock 2.28% [k] mutex_lock start_secondary >> 2.82% [k] pipe_read 1.96% [k] __switch_to + 16.05% hrtimer_start_range_ns >> 2.32% [k] ktime_get_real 1.73% [k] menu_select 15.46% hrtimer_start >> tick_nohz_stop_sched_tick >> __tick_nohz_idle_enter >> tick_nohz_idle_enter >> cpu_idle >> start_secondary >> 6.37% hrtimer_try_to_cancel >> hrtimer_cancel >> tick_nohz_restart >> tick_nohz_idle_exit >> cpu_idle >> start_secondary >> >> v3.8.13 v3.8.13 idle=mwait v3.8.13 (throttled, but menu gov bites.. HARD) >> 23.16% [k] _raw_spin_unlock_irqrestore 8.35% [k] __schedule - 22.91% [k] _raw_spin_unlock_irqrestore >> 4.93% [k] __schedule 6.49% [k] __switch_to - _raw_spin_unlock_irqrestore >> 3.42% [k] resched_task 5.71% [k] resched_task - 47.26% hrtimer_try_to_cancel >> 3.27% [k] __switch_to 4.64% [k] mutex_lock hrtimer_cancel >> 3.05% [k] mutex_lock 3.48% [k] copy_user_generic_string menu_hrtimer_cancel >> 2.32% [k] copy_user_generic_string 3.15% [k] task_waking_fair tick_nohz_idle_exit >> 2.30% [k] _raw_spin_lock_irqsave 3.13% [k] pipe_read cpu_idle >> 2.15% [k] pipe_read 2.61% [k] mutex_unlock start_secondary >> 2.15% [k] task_waking_fair 2.54% [k] finish_task_switch - 40.01% __hrtimer_start_range_ns >> 2.08% [k] ktime_get 2.29% [k] _raw_spin_lock_irqsave hrtimer_start >> 1.87% [k] mutex_unlock 1.91% [k] idle_cpu menu_select >> 1.76% [k] finish_task_switch 1.84% [k] __wake_up_common cpuidle_idle_call >> cpu_idle >> start_secondary >> >> v3.9.11 >> 18.67% [k] _raw_spin_unlock_irqrestore >> 4.36% [k] __schedule >> 3.66% [k] __switch_to >> 3.13% [k] mutex_lock >> 2.97% [k] __hrtimer_start_range_ns >> 2.69% [k] _raw_spin_lock_irqsave >> 2.38% [k] copy_user_generic_string >> 2.34% [k] hrtimer_reprogram.isra.32 >> 2.34% [k] task_waking_fair >> 2.25% [k] ktime_get >> 2.14% [k] pipe_read >> 1.98% [k] menu_select >> >> v3.10.4 >> 20.42% [k] _raw_spin_unlock_irqrestore >> 4.75% [k] __schedule >> 4.42% [k] reschedule_interrupt <== appears in 3.10, guilty party as yet unknown >> 3.52% [k] __switch_to >> 3.27% [k] resched_task >> 2.64% [k] cpuidle_enter_state >> 2.63% [k] _raw_spin_lock_irqsave >> 2.04% [k] copy_user_generic_string >> 2.00% [k] cpu_idle_loop >> 1.97% [k] mutex_lock >> 1.90% [k] ktime_get >> 1.75% [k] task_waking_fair >> >> v3.11-rc3-4-g36f571e >> 18.96% [k] _raw_spin_unlock_irqrestore >> 4.84% [k] __schedule >> 4.69% [k] reschedule_interrupt >> 3.75% [k] __switch_to >> 2.62% [k] _raw_spin_lock_irqsave >> 2.43% [k] cpuidle_enter_state >> 2.28% [k] resched_task >> 2.20% [k] cpu_idle_loop >> 1.97% [k] copy_user_generic_string >> 1.88% [k] ktime_get >> 1.81% [k] task_waking_fair >> 1.75% [k] mutex_lock >> >> sched: ratelimit nohz >> >> Entering nohz code on every micro-idle is too expensive to bear. >> >> Signed-off-by: Mike Galbraith <efault@gmx.de> >> >> --- >> include/linux/sched.h | 5 +++++ >> kernel/sched/core.c | 5 +++++ >> kernel/time/tick-sched.c | 2 +- >> 3 files changed, 11 insertions(+), 1 deletion(-) >> >> --- a/include/linux/sched.h >> +++ b/include/linux/sched.h >> @@ -235,9 +235,14 @@ extern int runqueue_is_locked(int cpu); >> extern void nohz_balance_enter_idle(int cpu); >> extern void set_cpu_sd_state_idle(void); >> extern int get_nohz_timer_target(void); >> +extern int sched_needs_cpu(int cpu); >> #else >> static inline void nohz_balance_enter_idle(int cpu) { } >> static inline void set_cpu_sd_state_idle(void) { } >> +static inline int sched_needs_cpu(int cpu) >> +{ >> + return 0; >> +} >> #endif >> >> /* >> --- a/kernel/sched/core.c >> +++ b/kernel/sched/core.c >> @@ -650,6 +650,11 @@ static inline bool got_nohz_idle_kick(vo >> return false; >> } >> >> +int sched_needs_cpu(int cpu) >> +{ >> + return cpu_rq(cpu)->avg_idle < sysctl_sched_migration_cost; >> +} >> + >> #else /* CONFIG_NO_HZ_COMMON */ >> >> static inline bool got_nohz_idle_kick(void) >> --- a/kernel/time/tick-sched.c >> +++ b/kernel/time/tick-sched.c >> @@ -548,7 +548,7 @@ static ktime_t tick_nohz_stop_sched_tick >> time_delta = timekeeping_max_deferment(); >> } while (read_seqretry(&jiffies_lock, seq)); >> >> - if (rcu_needs_cpu(cpu, &rcu_delta_jiffies) || >> + if (sched_needs_cpu(cpu) || rcu_needs_cpu(cpu, &rcu_delta_jiffies) || >> arch_needs_cpu(cpu) || irq_work_needs_cpu()) { >> next_jiffies = last_jiffies + 1; >> delta_jiffies = 1; >> >> > > ^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: [PATCH V3]hrtimer: Fix a performance regression by disable reprogramming in remove_hrtimer 2013-08-08 4:31 ` ethan.zhao @ 2013-08-08 5:29 ` Mike Galbraith 2013-08-08 5:51 ` Mike Galbraith 2013-08-08 9:04 ` ethan.zhao 1 sibling, 1 reply; 31+ messages in thread From: Mike Galbraith @ 2013-08-08 5:29 UTC (permalink / raw) To: ethan.zhao Cc: Peter Zijlstra, Thomas Gleixner, Ingo Molnar, LKML, johlstei, Yinghai Lu, Jin Feng, Youquan Song, LenBrown (I got several copies of that, one would have been plenty. Some people get all grumpy when they get repeats, and at least one, who shall remain unnamed, gets even grumpier when folks don't trim their replies, so be careful;) On Thu, 2013-08-08 at 12:31 +0800, ethan.zhao wrote: > diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h > index 4137890..c27f04f 100644 > --- a/arch/x86/include/asm/smp.h > +++ b/arch/x86/include/asm/smp.h > @@ -137,7 +137,7 @@ static inline void play_dead(void) > > static inline void smp_send_reschedule(int cpu) > { > - smp_ops.smp_send_reschedule(cpu); > + /* smp_ops.smp_send_reschedule(cpu); */ > } Hm. You're much braver than I am. -Mike ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH V3]hrtimer: Fix a performance regression by disable reprogramming in remove_hrtimer 2013-08-08 5:29 ` Mike Galbraith @ 2013-08-08 5:51 ` Mike Galbraith 0 siblings, 0 replies; 31+ messages in thread From: Mike Galbraith @ 2013-08-08 5:51 UTC (permalink / raw) To: ethan.zhao Cc: Peter Zijlstra, Thomas Gleixner, Ingo Molnar, LKML, johlstei, Yinghai Lu, Jin Feng, Youquan Song, LenBrown On Thu, 2013-08-08 at 07:29 +0200, Mike Galbraith wrote: > On Thu, 2013-08-08 at 12:31 +0800, ethan.zhao wrote: > > > diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h > > index 4137890..c27f04f 100644 > > --- a/arch/x86/include/asm/smp.h > > +++ b/arch/x86/include/asm/smp.h > > @@ -137,7 +137,7 @@ static inline void play_dead(void) > > > > static inline void smp_send_reschedule(int cpu) > > { > > - smp_ops.smp_send_reschedule(cpu); > > + /* smp_ops.smp_send_reschedule(cpu); */ > > } > > Hm. You're much braver than I am. BTW, according to my testing, Peter's patch should make your box a lot happier. It won't make reschedule_interrupt crawl back under its rock, but should improve your pipe-test numbers quite a bit. Also note that pipe-test is really kinda silly cross core, it's really good for measuring fastpath weight gain when pinned. When scheduled cross core (we do that to cut latency for stuff that does real work), it will always suck rocks, as it's 100% synchronous, there's nada to gain by occupying two cores. (fortunately, real programs tend to do more than play ping-pong with a microscopic ball;) -Mike ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH V3]hrtimer: Fix a performance regression by disable reprogramming in remove_hrtimer 2013-08-08 4:31 ` ethan.zhao 2013-08-08 5:29 ` Mike Galbraith @ 2013-08-08 9:04 ` ethan.zhao 2013-08-08 9:05 ` ethan.zhao 2013-08-08 12:14 ` Mike Galbraith 1 sibling, 2 replies; 31+ messages in thread From: ethan.zhao @ 2013-08-08 9:04 UTC (permalink / raw) To: Mike Galbraith Cc: Peter Zijlstra, Thomas Gleixner, Ingo Molnar, LKML, johlstei, Yinghai Lu, Jin Feng, Youquan Song, LenBrown Kernel 3.11-rc1 +Peterz' patch+ Mike's patch, No C-states in BIOS, got the same result as only Peter's patch. Of course , No C states to enter, make sense. [root@localhost ~]# time ./pip1m real 0m4.381s user 0m0.099s sys 0m2.784s [root@localhost ~]# time ./pip1m real 0m4.436s user 0m0.093s sys 0m2.809s [root@localhost ~]# Retest with C-states enabled in BIOS [root@localhost ~]# time ./pip1m real 0m8.670s user 0m0.203s sys 0m5.459s [root@localhost ~]# time ./pip1m real 0m8.489s user 0m0.184s sys 0m5.360s [root@localhost ~]# So the result is Peter's patch working or Mike's ? Compared with default 3.11-rc3 result of test case 1 as following, looks great. [root@localhost ~]# time ./pip1m real 0m10.683s user 0m0.204s sys 0m6.597s [root@localhost ~]# time ./pip1m real 0m10.629s user 0m0.185s sys 0m6.546s So revert Mike's patch and retest, got [root@localhost ~]# time ./pip1m real 0m8.606s user 0m0.193s sys 0m5.449s [root@localhost ~]# time ./pip1m real 0m8.655s user 0m0.198s sys 0m5.519s [root@localhost ~]# So, it's Peter's patch working……… The result of kernel 3.11-rc3 + Peter's patch + no rescheduling IPI and no C-states in BIOS is : [root@localhost ~]# time ./pip1m real 0m3.915s user 0m0.088s sys 0m2.487s [root@localhost ~]# time ./pip1m real 0m3.929s user 0m0.082s sys 0m2.560s [root@localhost ~]# time ./pip1m Got about 0.5 sec better than only Peter's patch, but it is strange, only no rescheduling IPI almost got the same result. Thanks, Ethan 在 2013-8-8,下午12:31,ethan.zhao <ethan.kernel@gmail.com> 写道: >>> >>> sched: ratelimit nohz >>> >>> Entering nohz code on every micro-idle is too expensive to bear. >>> >>> Signed-off-by: Mike Galbraith <efault@gmx.de> >>> >>> --- >>> include/linux/sched.h | 5 +++++ >>> kernel/sched/core.c | 5 +++++ >>> kernel/time/tick-sched.c | 2 +- >>> 3 files changed, 11 insertions(+), 1 deletion(-) >>> >>> --- a/include/linux/sched.h >>> +++ b/include/linux/sched.h >>> @@ -235,9 +235,14 @@ extern int runqueue_is_locked(int cpu); >>> extern void nohz_balance_enter_idle(int cpu); >>> extern void set_cpu_sd_state_idle(void); >>> extern int get_nohz_timer_target(void); >>> +extern int sched_needs_cpu(int cpu); >>> #else >>> static inline void nohz_balance_enter_idle(int cpu) { } >>> static inline void set_cpu_sd_state_idle(void) { } >>> +static inline int sched_needs_cpu(int cpu) >>> +{ >>> + return 0; >>> +} >>> #endif >>> >>> /* >>> --- a/kernel/sched/core.c >>> +++ b/kernel/sched/core.c >>> @@ -650,6 +650,11 @@ static inline bool got_nohz_idle_kick(vo >>> return false; >>> } >>> >>> +int sched_needs_cpu(int cpu) >>> +{ >>> + return cpu_rq(cpu)->avg_idle < sysctl_sched_migration_cost; >>> +} >>> + >>> #else /* CONFIG_NO_HZ_COMMON */ >>> >>> static inline bool got_nohz_idle_kick(void) >>> --- a/kernel/time/tick-sched.c >>> +++ b/kernel/time/tick-sched.c >>> @@ -548,7 +548,7 @@ static ktime_t tick_nohz_stop_sched_tick >>> time_delta = timekeeping_max_deferment(); >>> } while (read_seqretry(&jiffies_lock, seq)); >>> >>> - if (rcu_needs_cpu(cpu, &rcu_delta_jiffies) || >>> + if (sched_needs_cpu(cpu) || rcu_needs_cpu(cpu, &rcu_delta_jiffies) || >>> arch_needs_cpu(cpu) || irq_work_needs_cpu()) { >>> next_jiffies = last_jiffies + 1; >>> delta_jiffies = 1; >>> >>> >> >> > ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH V3]hrtimer: Fix a performance regression by disable reprogramming in remove_hrtimer 2013-08-08 9:04 ` ethan.zhao @ 2013-08-08 9:05 ` ethan.zhao 2013-08-08 12:14 ` Mike Galbraith 1 sibling, 0 replies; 31+ messages in thread From: ethan.zhao @ 2013-08-08 9:05 UTC (permalink / raw) To: Mike Galbraith Cc: Peter Zijlstra, Thomas Gleixner, Ingo Molnar, LKML, johlstei, Yinghai Lu, Jin Feng, Youquan Song, LenBrown 在 2013-8-8,下午5:04,"ethan.zhao" <ethan.kernel@gmail.com> 写道: > Kernel 3.11-rc1 +Peterz' patch+ Mike's patch, No C-states in BIOS, got the same result as only Peter's patch. Typo 3.11-rc3 > Of course , No C states to enter, make sense. > [root@localhost ~]# time ./pip1m > > real 0m4.381s > user 0m0.099s > sys 0m2.784s > [root@localhost ~]# time ./pip1m > > real 0m4.436s > user 0m0.093s > sys 0m2.809s > [root@localhost ~]# > > Retest with C-states enabled in BIOS > [root@localhost ~]# time ./pip1m > > real 0m8.670s > user 0m0.203s > sys 0m5.459s > [root@localhost ~]# time ./pip1m > > real 0m8.489s > user 0m0.184s > sys 0m5.360s > [root@localhost ~]# > > So the result is Peter's patch working or Mike's ? Compared with default 3.11-rc3 > result of test case 1 as following, looks great. > [root@localhost ~]# time ./pip1m > > real 0m10.683s > user 0m0.204s > sys 0m6.597s > [root@localhost ~]# time ./pip1m > > real 0m10.629s > user 0m0.185s > sys 0m6.546s > > So revert Mike's patch and retest, got > [root@localhost ~]# time ./pip1m > > real 0m8.606s > user 0m0.193s > sys 0m5.449s > [root@localhost ~]# time ./pip1m > > real 0m8.655s > user 0m0.198s > sys 0m5.519s > [root@localhost ~]# > > So, it's Peter's patch working……… > > The result of kernel 3.11-rc3 + Peter's patch + no rescheduling IPI and no C-states in BIOS is : > [root@localhost ~]# time ./pip1m > > real 0m3.915s > user 0m0.088s > sys 0m2.487s > [root@localhost ~]# time ./pip1m > > real 0m3.929s > user 0m0.082s > sys 0m2.560s > [root@localhost ~]# time ./pip1m > > Got about 0.5 sec better than only Peter's patch, but it is strange, only no rescheduling IPI almost got the > same result. > > > Thanks, > Ethan > > > 在 2013-8-8,下午12:31,ethan.zhao <ethan.kernel@gmail.com> 写道: >>>> >>>> sched: ratelimit nohz >>>> >>>> Entering nohz code on every micro-idle is too expensive to bear. >>>> >>>> Signed-off-by: Mike Galbraith <efault@gmx.de> >>>> >>>> --- >>>> include/linux/sched.h | 5 +++++ >>>> kernel/sched/core.c | 5 +++++ >>>> kernel/time/tick-sched.c | 2 +- >>>> 3 files changed, 11 insertions(+), 1 deletion(-) >>>> >>>> --- a/include/linux/sched.h >>>> +++ b/include/linux/sched.h >>>> @@ -235,9 +235,14 @@ extern int runqueue_is_locked(int cpu); >>>> extern void nohz_balance_enter_idle(int cpu); >>>> extern void set_cpu_sd_state_idle(void); >>>> extern int get_nohz_timer_target(void); >>>> +extern int sched_needs_cpu(int cpu); >>>> #else >>>> static inline void nohz_balance_enter_idle(int cpu) { } >>>> static inline void set_cpu_sd_state_idle(void) { } >>>> +static inline int sched_needs_cpu(int cpu) >>>> +{ >>>> + return 0; >>>> +} >>>> #endif >>>> >>>> /* >>>> --- a/kernel/sched/core.c >>>> +++ b/kernel/sched/core.c >>>> @@ -650,6 +650,11 @@ static inline bool got_nohz_idle_kick(vo >>>> return false; >>>> } >>>> >>>> +int sched_needs_cpu(int cpu) >>>> +{ >>>> + return cpu_rq(cpu)->avg_idle < sysctl_sched_migration_cost; >>>> +} >>>> + >>>> #else /* CONFIG_NO_HZ_COMMON */ >>>> >>>> static inline bool got_nohz_idle_kick(void) >>>> --- a/kernel/time/tick-sched.c >>>> +++ b/kernel/time/tick-sched.c >>>> @@ -548,7 +548,7 @@ static ktime_t tick_nohz_stop_sched_tick >>>> time_delta = timekeeping_max_deferment(); >>>> } while (read_seqretry(&jiffies_lock, seq)); >>>> >>>> - if (rcu_needs_cpu(cpu, &rcu_delta_jiffies) || >>>> + if (sched_needs_cpu(cpu) || rcu_needs_cpu(cpu, &rcu_delta_jiffies) || >>>> arch_needs_cpu(cpu) || irq_work_needs_cpu()) { >>>> next_jiffies = last_jiffies + 1; >>>> delta_jiffies = 1; >>>> >>>> >>> >>> >> > ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH V3]hrtimer: Fix a performance regression by disable reprogramming in remove_hrtimer 2013-08-08 9:04 ` ethan.zhao 2013-08-08 9:05 ` ethan.zhao @ 2013-08-08 12:14 ` Mike Galbraith 1 sibling, 0 replies; 31+ messages in thread From: Mike Galbraith @ 2013-08-08 12:14 UTC (permalink / raw) To: ethan.zhao Cc: Peter Zijlstra, Thomas Gleixner, Ingo Molnar, LKML, johlstei, Yinghai Lu, Jin Feng, Youquan Song, LenBrown On Thu, 2013-08-08 at 17:04 +0800, ethan.zhao wrote: > So, it's Peter's patch working……… Yes. My patch killed the nohz tick start/stop overhead prior to 3.8, but in 3.8, the menu governor changes added more high frequency timer banging that negated its effect. In 3.9, mwait went away, making core2 boxen very sad, and in 3.10 all of x86 got generic idle, making my boxen even sadder. select_idle_sibling() (the thing that makes us schedule pipe-test and many other things cross core if there's an idle core available) has always been two-faced, doing both good and evil.. but with recent work, it's leaning more toward evil than ever before. -Mike ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH V3]hrtimer: Fix a performance regression by disable reprogramming in remove_hrtimer 2013-08-06 7:29 ` Mike Galbraith 2013-08-06 7:46 ` Mike Galbraith @ 2013-08-07 8:25 ` Mike Galbraith 2013-08-08 4:05 ` Mike Galbraith 2013-08-08 15:02 ` ethan.zhao 2 siblings, 1 reply; 31+ messages in thread From: Mike Galbraith @ 2013-08-07 8:25 UTC (permalink / raw) To: Peter Zijlstra Cc: Ethan Zhao, Thomas Gleixner, Ingo Molnar, LKML, johlstei, Yinghai Lu, Jin Feng, Youquan Song, LenBrown On Tue, 2013-08-06 at 09:29 +0200, Mike Galbraith wrote: ... > E5620 +write 0 -> /dev/cpu_dma_latency, hold open > v3.7.10 578.5 KHz 1.000 675.4 KHz 1.000 > v3.7.10-nothrottle 366.7 KHz .633 395.0 KHz .584 > v3.8.13 468.3 KHz .809 690.0 KHz 1.021 > v3.8.13 idle=mwait 595.1 KHz 1.028 NA > v3.9.11 462.0 KHz .798 691.1 KHz 1.023 > v3.10.4 419.4 KHz .724 570.8 KHz .845 > v3.11-rc3-4-g36f571e 400.1 KHz .691 538.5 KHz .797 Adds Peter's patch, re-tests master/E5620, nohz throttled/unthrottled.. v3.11-rc4-27-ge4ef108 481.9 KHz .833 1.000 throttle (400->481 36f571e..e4ef108? spinlock overhead gone) v3.11-rc4-27-ge4ef108 496.7 KHz .858 1.030 throttle+peterz (spinlock overhead gone) v3.11-rc4-27-ge4ef108 340.0 KHz .587 1.000 nothrottle (spinlock overhead present) v3.11-rc4-27-ge4ef108 440.7 KHz .761 1.296 nothrottle+peterz (spinlock overhead gone) E5620 (2.4 GHz Westmere) nothrottle v3.11-rc4-27-ge4ef108 v3.11-rc4-27-ge4ef108+peterz 17.48% [k] _raw_spin_unlock_irqrestore 5.37% [k] __schedule 11.36% [k] __hrtimer_start_range_ns 4.96% [k] reschedule_interrupt 3.78% [k] __schedule 3.82% [k] _raw_spin_lock_irqsave 3.54% [k] reschedule_interrupt 3.62% [k] __switch_to 2.81% [k] __switch_to 3.05% [k] cpuidle_enter_state 2.64% [k] _raw_spin_lock_irqsave 2.99% [k] resched_task 2.14% [k] cpuidle_enter_state 2.39% [k] mutex_lock 1.97% [k] resched_task 2.31% [k] copy_user_generic_string 1.68% [k] copy_user_generic_string 2.23% [k] cpu_idle_loop 1.68% [k] cpu_idle_loop 2.20% [k] task_waking_fair E5620 (2.4 GHz Westmere) throttle v3.11-rc4-27-ge4ef108 v3.11-rc4-27-ge4ef108+peterz 6.76% [k] __schedule 7.68% [k] reschedule_interrupt (find this little bastard) 5.90% [k] reschedule_interrupt 7.18% [k] __schedule 4.12% [k] __switch_to 4.52% [k] __switch_to 3.58% [k] resched_tas 4.24% [k] cpuidle_enter_state 3.24% [k] cpuidle_enter_state 2.96% [k] cpu_idle_loop 3.01% [k] _raw_spin_lock_irqsave 2.96% [k] resched_task 2.67% [k] cpu_idle_loop 2.79% [k] _raw_spin_lock_irqsave 2.49% [k] task_waking_fair 2.45% [k] ktime_get 2.38% [k] copy_user_generic_string 2.34% [k] copy_user_generic_string 2.27% [k] mutex_lock 2.32% [k] get_typical_interval And below is peterz.. On Mon, Jul 29, 2013 at 12:18:11PM +0200, Thomas Gleixner wrote: > The reason why we want to do that is: > > timer expiry time > A 100us -> programmed hardware event > B 2000us > > Restart timer A with an expiry time of 3000us without reprogramming: > > timer expiry time > NONE 100us -> programmed hardware event > B 2000us > A 3000us > > We take an extra wakeup for reprogramming the hardware, which is > counterproductive for power saving. So disabling it blindly will cause > a regresssion for other people. We need to be smarter than that. So aside from the complete mess posted; how about something like this? *completely untested etc..* mike: Builds(now)/boots/makes E5620 happier. --- include/linux/hrtimer.h | 5 ++++ kernel/hrtimer.c | 60 ++++++++++++++++++++++-------------------------- 2 files changed, 33 insertions(+), 32 deletions(-) --- a/include/linux/hrtimer.h +++ b/include/linux/hrtimer.h @@ -185,6 +185,7 @@ struct hrtimer_cpu_base { ktime_t expires_next; int hres_active; int hang_detected; + int timers_removed; unsigned long nr_events; unsigned long nr_retries; unsigned long nr_hangs; @@ -261,6 +262,8 @@ static inline ktime_t hrtimer_expires_re #ifdef CONFIG_HIGH_RES_TIMERS struct clock_event_device; +extern void hrtimer_enter_idle(void); + extern void hrtimer_interrupt(struct clock_event_device *dev); /* @@ -296,6 +299,8 @@ extern void clock_was_set_delayed(void); # define MONOTONIC_RES_NSEC LOW_RES_NSEC # define KTIME_MONOTONIC_RES KTIME_LOW_RES +static inline void hrtimer_enter_idle(void) { } + static inline void hrtimer_peek_ahead_timers(void) { } /* --- a/kernel/hrtimer.c +++ b/kernel/hrtimer.c @@ -673,6 +673,27 @@ static inline ktime_t hrtimer_update_bas return ktime_get_update_offsets(offs_real, offs_boot, offs_tai); } +static void __hrtimer_reprogram_all(struct hrtimer_cpu_base *base) +{ + if (!hrtimer_hres_active()) + return; + + raw_spin_lock(&base->lock); + hrtimer_update_base(base); + hrtimer_force_reprogram(base, 0); + raw_spin_unlock(&base->lock); +} + +void hrtimer_enter_idle(void) +{ + struct hrtimer_cpu_base *base = &__get_cpu_var(hrtimer_bases); + + if (base->timers_removed) { + base->timers_removed = 0; + __hrtimer_reprogram_all(base); + } +} + /* * Retrigger next event is called after clock was set * @@ -682,13 +703,7 @@ static void retrigger_next_event(void *a { struct hrtimer_cpu_base *base = &__get_cpu_var(hrtimer_bases); - if (!hrtimer_hres_active()) - return; - - raw_spin_lock(&base->lock); - hrtimer_update_base(base); - hrtimer_force_reprogram(base, 0); - raw_spin_unlock(&base->lock); + __hrtimer_reprogram_all(base); } /* @@ -907,7 +922,7 @@ static int enqueue_hrtimer(struct hrtime */ static void __remove_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *base, - unsigned long newstate, int reprogram) + unsigned long newstate) { struct timerqueue_node *next_timer; if (!(timer->state & HRTIMER_STATE_ENQUEUED)) @@ -915,19 +930,10 @@ static void __remove_hrtimer(struct hrti next_timer = timerqueue_getnext(&base->active); timerqueue_del(&base->active, &timer->node); - if (&timer->node == next_timer) { #ifdef CONFIG_HIGH_RES_TIMERS - /* Reprogram the clock event device. if enabled */ - if (reprogram && hrtimer_hres_active()) { - ktime_t expires; - - expires = ktime_sub(hrtimer_get_expires(timer), - base->offset); - if (base->cpu_base->expires_next.tv64 == expires.tv64) - hrtimer_force_reprogram(base->cpu_base, 1); - } + if (hrtimer_hres_active() && &timer->node == next_timer) + base->cpu_base->timers_removed++; #endif - } if (!timerqueue_getnext(&base->active)) base->cpu_base->active_bases &= ~(1 << base->index); out: @@ -942,26 +948,16 @@ remove_hrtimer(struct hrtimer *timer, st { if (hrtimer_is_queued(timer)) { unsigned long state; - int reprogram; - /* - * Remove the timer and force reprogramming when high - * resolution mode is active and the timer is on the current - * CPU. If we remove a timer on another CPU, reprogramming is - * skipped. The interrupt event on this CPU is fired and - * reprogramming happens in the interrupt handler. This is a - * rare case and less expensive than a smp call. - */ debug_deactivate(timer); timer_stats_hrtimer_clear_start_info(timer); - reprogram = base->cpu_base == &__get_cpu_var(hrtimer_bases); /* * We must preserve the CALLBACK state flag here, * otherwise we could move the timer base in * switch_hrtimer_base. */ state = timer->state & HRTIMER_STATE_CALLBACK; - __remove_hrtimer(timer, base, state, reprogram); + __remove_hrtimer(timer, base, state); return 1; } return 0; @@ -1243,7 +1239,7 @@ static void __run_hrtimer(struct hrtimer WARN_ON(!irqs_disabled()); debug_deactivate(timer); - __remove_hrtimer(timer, base, HRTIMER_STATE_CALLBACK, 0); + __remove_hrtimer(timer, base, HRTIMER_STATE_CALLBACK); timer_stats_account_hrtimer(timer); fn = timer->function; @@ -1690,7 +1686,7 @@ static void migrate_hrtimer_list(struct * timer could be seen as !active and just vanish away * under us on another CPU */ - __remove_hrtimer(timer, old_base, HRTIMER_STATE_MIGRATE, 0); + __remove_hrtimer(timer, old_base, HRTIMER_STATE_MIGRATE); timer->base = new_base; /* * Enqueue the timers on the new cpu. This does not ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH V3]hrtimer: Fix a performance regression by disable reprogramming in remove_hrtimer 2013-08-07 8:25 ` Mike Galbraith @ 2013-08-08 4:05 ` Mike Galbraith 0 siblings, 0 replies; 31+ messages in thread From: Mike Galbraith @ 2013-08-08 4:05 UTC (permalink / raw) To: Peter Zijlstra Cc: Ethan Zhao, Thomas Gleixner, Ingo Molnar, LKML, johlstei, Yinghai Lu, Jin Feng, Youquan Song, LenBrown On Wed, 2013-08-07 at 10:25 +0200, Mike Galbraith wrote: > E5620 (2.4 GHz Westmere) throttle > > v3.11-rc4-27-ge4ef108 v3.11-rc4-27-ge4ef108+peterz > 6.76% [k] __schedule 7.68% [k] reschedule_interrupt (find this little bastard) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > 5.90% [k] reschedule_interrupt 7.18% [k] __schedule > 4.12% [k] __switch_to 4.52% [k] __switch_to > 3.58% [k] resched_tas 4.24% [k] cpuidle_enter_state > 3.24% [k] cpuidle_enter_state 2.96% [k] cpu_idle_loop > 3.01% [k] _raw_spin_lock_irqsave 2.96% [k] resched_task > 2.67% [k] cpu_idle_loop 2.79% [k] _raw_spin_lock_irqsave > 2.49% [k] task_waking_fair 2.45% [k] ktime_get > 2.38% [k] copy_user_generic_string 2.34% [k] copy_user_generic_string > 2.27% [k] mutex_lock 2.32% [k] get_typical_interval 7d1a941731fabf27e5fb6edbebb79fe856edb4e5 is the first bad commit commit 7d1a941731fabf27e5fb6edbebb79fe856edb4e5 Author: Thomas Gleixner <tglx@linutronix.de> Date: Thu Mar 21 22:50:03 2013 +0100 x86: Use generic idle loop So, in summary, seems we need to forget all about scheduling anything remotely fast cross core these days, even on core2 with its lovely shared L2, as otherwise we'll beat it all up. Boxen with no nohz throttle maybe don't notice how bad all this progress hurt because they're used to being in agony. Your patch (modulo any bustage, haven't run any testsuite, only ran for regression hunt/test purposes) should make smiles for those who notice large chunk of that pain going away. My beloved ole Q6600 is currently not particularly happy, and that's best case, shared L2, where nearly any dinky overlap used to be reclaimable. Even with your patch, we'll still be losers at network fast movers across the board on any box methinks. -Mike ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH V3]hrtimer: Fix a performance regression by disable reprogramming in remove_hrtimer 2013-08-06 7:29 ` Mike Galbraith 2013-08-06 7:46 ` Mike Galbraith 2013-08-07 8:25 ` Mike Galbraith @ 2013-08-08 15:02 ` ethan.zhao 2013-08-09 6:52 ` Mike Galbraith 2 siblings, 1 reply; 31+ messages in thread From: ethan.zhao @ 2013-08-08 15:02 UTC (permalink / raw) To: Mike Galbraith Cc: Peter Zijlstra, Thomas Gleixner, Ingo Molnar, LKML, johlstei, Yinghai Lu, Jin Feng 在 2013-8-6,下午3:29,Mike Galbraith <bitbucket@online.de> 写道: > +int sched_needs_cpu(int cpu) > +{ > + return cpu_rq(cpu)->avg_idle < sysctl_sched_migration_cost; > +} > + > #else /* CONFIG_NO_HZ_COMMON */ > > static inline bool got_nohz_idle_kick(void) > --- a/kernel/time/tick-sched.c > +++ b/kernel/time/tick-sched.c > @@ -548,7 +548,7 @@ static ktime_t tick_nohz_stop_sched_tick > time_delta = timekeeping_max_deferment(); > } while (read_seqretry(&jiffies_lock, seq)); > > - if (rcu_needs_cpu(cpu, &rcu_delta_jiffies) || > + if (sched_needs_cpu(cpu) || rcu_needs_cpu(cpu, &rcu_delta_jiffies) || > arch_needs_cpu(cpu) || irq_work_needs_cpu()) { > next_jiffies = last_jiffies + 1; > delta_jiffies = 1; If the performace regression was caused by too much expensive clock device reprogramming and too frequent entering /exiting of C-states… this patch should work. except the following result is almost always false under 3.11-rc3 code. > return cpu_rq(cpu)->avg_idle < sysctl_sched_migration_cost; Ethan ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH V3]hrtimer: Fix a performance regression by disable reprogramming in remove_hrtimer 2013-08-08 15:02 ` ethan.zhao @ 2013-08-09 6:52 ` Mike Galbraith 0 siblings, 0 replies; 31+ messages in thread From: Mike Galbraith @ 2013-08-09 6:52 UTC (permalink / raw) To: ethan.zhao Cc: Peter Zijlstra, Thomas Gleixner, Ingo Molnar, LKML, johlstei, Yinghai Lu, Jin Feng On Thu, 2013-08-08 at 23:02 +0800, ethan.zhao wrote: > 在 2013-8-6,下午3:29,Mike Galbraith <bitbucket@online.de> 写道: > > > +int sched_needs_cpu(int cpu) > > +{ > > + return cpu_rq(cpu)->avg_idle < sysctl_sched_migration_cost; > > +} > > + > > #else /* CONFIG_NO_HZ_COMMON */ > > > > static inline bool got_nohz_idle_kick(void) > > --- a/kernel/time/tick-sched.c > > +++ b/kernel/time/tick-sched.c > > @@ -548,7 +548,7 @@ static ktime_t tick_nohz_stop_sched_tick > > time_delta = timekeeping_max_deferment(); > > } while (read_seqretry(&jiffies_lock, seq)); > > > > - if (rcu_needs_cpu(cpu, &rcu_delta_jiffies) || > > + if (sched_needs_cpu(cpu) || rcu_needs_cpu(cpu, &rcu_delta_jiffies) || > > arch_needs_cpu(cpu) || irq_work_needs_cpu()) { > > next_jiffies = last_jiffies + 1; > > delta_jiffies = 1; > > If the performace regression was caused by too much expensive clock device reprogramming and too frequent entering /exiting of C-states… this patch should work. > except the following result is almost always false under 3.11-rc3 code. > > > return cpu_rq(cpu)->avg_idle < sysctl_sched_migration_cost; On my E5620 box, avg_idle works fine. Patchlet doesn't save as much as it used to, thanks to Peter's patch now killing the worst of the pain, but it does still does save cycles. I have too much regression left to say exactly what it can now save max, doesn't matter much either. The pertinent numbers: v3.11-rc4-27-ge4ef108 496.7 KHz .858 1.030 throttle+peterz v3.11-rc4-27-ge4ef108 440.7 KHz .761 1.296 nothrottle+peterz -Mike ^ permalink raw reply [flat|nested] 31+ messages in thread
end of thread, other threads:[~2013-12-12 14:42 UTC | newest]
Thread overview: 31+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-07-27 20:04 [PATCH V3]hrtimer: Fix a performance regression by disable reprogramming in remove_hrtimer ethan.kernel
2013-07-29 10:18 ` Thomas Gleixner
2013-07-29 11:57 ` Peter Zijlstra
2013-08-08 7:32 ` ethan.zhao
2013-09-05 6:36 ` Mike Galbraith
[not found] ` <20130905111428.GB23362@gmail.com>
[not found] ` <1378386697.6567.9.camel@marge.simpson.net>
[not found] ` <20130905133750.GA26637@gmail.com>
[not found] ` <1378445942.5434.31.camel@marge.simpson.net>
[not found] ` <20130909122325.GX31370@twins.programming.kicks-ass.net>
[not found] ` <1378730538.5586.30.camel@marge.simpson.net>
2013-09-09 13:30 ` Peter Zijlstra
2013-09-09 13:46 ` Peter Zijlstra
2013-09-11 8:56 ` Peter Zijlstra
2013-09-11 10:25 ` Mike Galbraith
2013-10-04 12:06 ` Ethan Zhao
2013-10-07 4:41 ` Mike Galbraith
2013-10-07 4:57 ` Ethan Zhao
2013-12-12 14:14 ` Ethan Zhao
2013-12-12 14:42 ` Mike Galbraith
[not found] ` <CABawtvP4oLuvHOS3prbbgPShXVziV_wTo7i6KCqJ9KkoVdz0ag@mail.gmail.com>
2013-07-30 9:35 ` Peter Zijlstra
2013-07-30 11:44 ` Ethan Zhao
2013-07-30 11:59 ` Peter Zijlstra
2013-08-03 6:55 ` ethan
2013-08-03 7:37 ` ethan
2013-08-06 7:29 ` Mike Galbraith
2013-08-06 7:46 ` Mike Galbraith
2013-08-08 4:31 ` ethan.zhao
2013-08-08 5:29 ` Mike Galbraith
2013-08-08 5:51 ` Mike Galbraith
2013-08-08 9:04 ` ethan.zhao
2013-08-08 9:05 ` ethan.zhao
2013-08-08 12:14 ` Mike Galbraith
2013-08-07 8:25 ` Mike Galbraith
2013-08-08 4:05 ` Mike Galbraith
2013-08-08 15:02 ` ethan.zhao
2013-08-09 6:52 ` Mike Galbraith
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox