* 3.0.14-rt31 + 64 cores = very bad jitter == highly synchronized tick?
@ 2011-12-24 9:06 Mike Galbraith
2011-12-25 7:31 ` Mike Galbraith
2011-12-27 6:40 ` Mike Galbraith
0 siblings, 2 replies; 17+ messages in thread
From: Mike Galbraith @ 2011-12-24 9:06 UTC (permalink / raw)
To: RT; +Cc: Thomas Gleixner, Steven Rostedt, Peter Zijlstra, Ingo Molnar
Greetings,
I'm trying to convince 3.0-rt to perform on a 64 core box, and having a
devil of a time with the darn thing. I have a wild theory that cores
are much more closely synchronized in newer kernels, and that's causing
massive QPI jabbering and xtime lock contention as cores bang
cpupri_set() and ktime_get() in lockstep.
The 33-rt kernel in the numbers below has Steven's cpupri fix, and there
it works a treat. In 3.0-rt, it does NOT save the day, and the only
reason I can imagine for observed behavior is that cores are ticking in
lockstep.
Anyway, tick perturbations are definitely much larger in 3.0-rt than in
33-rt, munching ~1.4% of every core vs ~.19% for 33-rt.
Has anything been done between 33 and 3.0 that would account for this?
Numbers and such below.
-Mike
Test environment: nohz=off, cores 4-63 isolated via cpusets. Start a
perturbation measurement proggy (tight self-calibrating rdtsc loop) as
the only thing running on isolated core 63.
(ponders telling customer that 10 x 8 core synchronized boxen has more
blinky lights, makes much sexier product than boring 1 x 80 core DL980:)
2.6.33.20-rt31
vogelweide:/abuild/mike/:[130]# sh -c 'echo $$ > /cpusets/rtcpus/tasks;taskset -c 63 pert 5'
2260.86 MHZ CPU
perturbation threshold 0.024 usecs.
pert/s: 1000 >14.27us: 1 min: 1.86 max: 16.22 avg: 1.90 sum/s: 1903us overhead: 0.19%
pert/s: 1000 >13.72us: 2 min: 1.86 max: 15.79 avg: 1.91 sum/s: 1909us overhead: 0.19%
pert/s: 1000 >13.23us: 1 min: 1.85 max: 15.59 avg: 1.91 sum/s: 1914us overhead: 0.19%
3.0.14-rt31 virgin
vogelweide:/abuild/mike/:[130]# sh -c 'echo $$ > /cpusets/rtcpus/tasks;taskset -c 63 pert 5'
2261.09 MHZ CPU
perturbation threshold 0.024 usecs.
pert/s: 1001 >57.09us: 52 min: 1.10 max: 83.94 avg: 14.38 sum/s: 14399us overhead: 1.44%
pert/s: 1001 >55.94us: 45 min: 1.10 max: 77.78 avg: 13.43 sum/s: 13455us overhead: 1.35%
pert/s: 1001 >54.87us: 65 min: 1.10 max: 75.77 avg: 14.57 sum/s: 14589us overhead: 1.46%
3.0.14-rt31 non-virgin, where I'm squabbling with this darn thing
vogelweide:/abuild/mike/:[130]# sh -c 'echo $$ > /cpusets/rtcpus/tasks;taskset -c 63 pert 5'
2260.90 MHZ CPU
perturbation threshold 0.024 usecs.
pert/s: 1001 >15.15us: 613 min: 1.10 max: 62.47 avg: 6.88 sum/s: 6895us overhead: 0.69%
pert/s: 1001 >16.55us: 719 min: 1.10 max: 50.05 avg: 8.38 sum/s: 8394us overhead: 0.84%
pert/s: 1001 >17.77us: 795 min: 1.13 max: 48.51 avg: 8.98 sum/s: 8997us overhead: 0.90%
pert/s: 1001 >19.22us: 640 min: 1.10 max: 56.00 avg: 8.51 sum/s: 8524us overhead: 0.85%
pert/s: 1001 >20.36us: 560 min: 1.10 max: 52.73 avg: 8.41 sum/s: 8428us overhead: 0.84%
pert/s: 1001 >21.38us: 561 min: 1.11 max: 52.65 avg: 8.60 sum/s: 8611us overhead: 0.86%
pert/s: 1001 >22.21us: 583 min: 1.14 max: 50.35 avg: 8.90 sum/s: 8913us overhead: 0.89%
pert/s: 1001 >22.75us: 473 min: 1.12 max: 46.76 avg: 8.50 sum/s: 8516us overhead: 0.85%
pert/s: 1001 >23.42us: 383 min: 1.11 max: 51.04 avg: 7.86 sum/s: 7873us overhead: 0.79%
pert/s: 1001 >23.89us: 421 min: 1.11 max: 47.42 avg: 8.81 sum/s: 8825us overhead: 0.88%
(bend/spindle/mutilate below: echo RT_ISOLATE > sched_features)
pert/s: 1001 >18.74us: 2 min: 1.07 max: 22.62 avg: 2.57 sum/s: 2570us overhead: 0.26%
pert/s: 1001 >18.16us: 1 min: 1.13 max: 23.28 avg: 2.56 sum/s: 2566us overhead: 0.26%
pert/s: 1001 >17.64us: 1 min: 1.09 max: 23.30 avg: 2.61 sum/s: 2610us overhead: 0.26%
pert/s: 1001 >17.22us: 2 min: 1.09 max: 24.44 avg: 2.59 sum/s: 2593us overhead: 0.26%
pert/s: 1001 >16.21us: 0 min: 1.06 max: 11.46 avg: 2.62 sum/s: 2620us overhead: 0.26%
pert/s: 1001 >15.33us: 0 min: 1.14 max: 12.40 avg: 2.59 sum/s: 2597us overhead: 0.26%
pert/s: 1001 >14.83us: 1 min: 1.10 max: 17.94 avg: 2.59 sum/s: 2599us overhead: 0.26%
pert/s: 1001 >14.03us: 0 min: 1.07 max: 11.20 avg: 2.60 sum/s: 2605us overhead: 0.26%
pert/s: 1001 >13.84us: 1 min: 1.12 max: 21.51 avg: 2.62 sum/s: 2629us overhead: 0.26%
pert/s: 1001 >13.63us: 4 min: 1.12 max: 20.90 avg: 2.60 sum/s: 2604us overhead: 0.26%
profile CPU 63
NO_RT_ISOLATE RT_ISOLATE (no hacks)
3.0.14-rt31 3.0.14-rt31 2.6.33-rt31
47.83% [kernel] [k] cpupri_set 8.67% [kernel] [k] tick_sched_timer 8.28% [kernel] [k] cpupri_set
18.38% [kernel] [k] native_write_msr_safe 7.03% [kernel] [k] __schedule 7.52% [kernel] [k] __schedule
6.83% [kernel] [k] cpuacct_charge 6.42% [kernel] [k] native_write_msr_safe 6.30% [kernel] [k] apic_timer_interrupt
2.19% [kernel] [k] rcu_enter_nohz 6.02% [kernel] [k] apic_timer_interrupt 5.66% [kernel] [k] native_write_msr_safe
2.12% [kernel] [k] __schedule 3.39% [kernel] [k] __switch_to 3.13% [kernel] [k] scheduler_tick
1.95% [kernel] [k] apic_timer_interrupt 2.73% [kernel] [k] ktime_get 2.69% [kernel] [k] _raw_spin_lock
1.91% [kernel] [k] tick_sched_timer 2.21% [kernel] [k] rcu_preempt_note_context_switch 2.61% [kernel] [k] __switch_to
1.56% [kernel] [k] ktime_get 1.97% [kernel] [k] rcu_check_callbacks 2.38% [kernel] [k] try_to_wake_up
1.20% [kernel] [k] run_timer_softirq 1.85% [kernel] [k] run_posix_cpu_timers 2.16% [kernel] [k] native_read_msr_safe
0.72% [kernel] [k] __switch_to 1.63% [kernel] [k] run_timer_softirq 1.99% [kernel] [k] native_read_tsc
0.61% [kernel] [k] rcu_preempt_note_context_switch 1.63% [kernel] [k] common_interrupt 1.98% [kernel] [k] update_curr_rt
0.55% [kernel] [k] scheduler_tick 1.63% [kernel] [k] _raw_spin_unlock_irq 1.94% [kernel] [k] perf_event_task_sched_in
0.54% [kernel] [k] __thread_do_softirq 1.60% [kernel] [k] __thread_do_softirq 1.89% [kernel] [k] ktime_get
0.51% [kernel] [k] __rcu_pending 1.58% [kernel] [k] _raw_spin_lock 1.87% [kernel] [k] cpuacct_charge
0.51% [kernel] [k] _raw_spin_lock 1.46% [kernel] [k] __rcu_pending 1.80% [kernel] [k] run_ksoftirqd
0.48% [kernel] [k] native_read_tsc 1.36% [kernel] [k] wakeup_softirqd 1.73% [kernel] [k] _raw_spin_unlock
0.45% [kernel] [k] hrtimer_interrupt 1.35% [kernel] [k] finish_task_switch 1.71% [kernel] [k] perf_adjust_period
0.44% [kernel] [k] raise_softirq 1.31% [kernel] [k] cpuacct_charge 1.46% [kernel] [k] __dequeue_entity
0.33% [kernel] [k] __enqueue_rt_entity 1.28% [kernel] [k] handle_pending_softirqs 1.33% [kernel] [k] rb_insert_color
0.31% [kernel] [k] rt_spin_unlock 1.28% [kernel] [k] scheduler_tick 1.28% [kernel] [k] __rcu_pending
profile all 64 CPUs
(RT_ISOLATE hack turned back off)
3.0.14-rt31 2.6.33.20-rt31
61.08% [kernel] [k] cpupri_set 27.50% [kernel] [k] apic_timer_interrupt
15.57% [kernel] [k] ktime_get 7.52% [kernel] [k] cpupri_set
5.79% [kernel] [k] apic_timer_interrupt 5.35% [kernel] [k] __schedule
4.31% [kernel] [k] rcu_enter_nohz 4.75% [kernel] [k] _raw_spin_lock
2.84% [kernel] [k] cpuacct_charge 3.88% [kernel] [k] scheduler_tick
1.17% [kernel] [k] __schedule 2.81% [kernel] [k] ktime_get
0.92% [kernel] [k] tick_sched_timer 2.59% [kernel] [k] tick_check_oneshot_broadcast
0.65% [kernel] [k] native_write_msr_safe 2.50% [kernel] [k] native_write_msr_safe
0.53% [kernel] [k] scheduler_tick 2.28% [kernel] [k] native_read_tsc
0.41% [kernel] [k] tick_check_oneshot_broadcast 2.22% [kernel] [k] native_read_msr_safe
0.35% [kernel] [k] native_load_tls 1.11% [kernel] [k] __switch_to
0.34% [kernel] [k] update_cpu_load 1.05% [kernel] [k] read_tsc
0.27% [kernel] [k] __rcu_pending 1.03% [kernel] [k] rb_erase
0.23% [kernel] [k] _raw_spin_lock 1.00% [kernel] [k] rcu_sched_qs
0.23% [kernel] [k] __thread_do_softirq 0.94% [kernel] [k] resched_task
0.21% [kernel] [k] run_timer_softirq 0.93% [kernel] [k] run_ksoftirqd
0.19% [kernel] [k] read_tsc 0.92% [kernel] [k] atomic_notifier_call_chain
0.19% [kernel] [k] _raw_spin_lock_irqsave 0.91% [kernel] [k] _raw_spin_unlock
0.19% [kernel] [k] native_read_tsc 0.87% [kernel] [k] __rcu_read_unlock
0.17% [kernel] [k] rcu_preempt_note_context_switch 0.87% [kernel] [k] native_sched_clock
0.16% [kernel] [k] __switch_to 0.87% [kernel] [k] x86_pmu_read
0.14% [kernel] [k] rt_spin_lock 0.85% [kernel] [k] perf_adjust_period
0.13% [kernel] [k] profile_tick 0.83% [kernel] [k] try_to_wake_up
0.13% [kernel] [k] rt_spin_unlock 0.81% [kernel] [k] tick_sched_timer
0.13% [kernel] [k] finish_task_switch 0.80% [kernel] [k] __perf_pending_run
0.11% [kernel] [k] run_ksoftirqd 0.77% [kernel] [k] sched_clock_cpu
0.11% [kernel] [k] handle_pending_softirqs 0.70% [kernel] [k] finish_task_switch
0.10% [kernel] [k] smp_apic_timer_interrupt 0.68% [kernel] [k] __atomic_notifier_call_chain
0.09% [kernel] [k] tick_nohz_stop_sched_tick 0.67% [kernel] [k] hrtimer_interrupt
0.09% [kernel] [k] pick_next_task_rt 0.67% [kernel] [k] __remove_hrtimer
0.09% [kernel] [k] _raw_spin_lock_irq 0.66% [kernel] [k] save_args
0.09% [kernel] [k] timerqueue_del 0.64% [kernel] [k] rt_spin_lock
0.08% [kernel] [k] hrtimer_interrupt 0.61% [kernel] [k] _raw_spin_lock_irq
0.07% [kernel] [k] pick_next_task_stop 0.58% [kernel] [k] idle_cpu
0.07% [kernel] [k] migrate_enable 0.56% [kernel] [k] __rcu_pending
0.07% [kernel] [k] wakeup_softirqd 0.56% [kernel] [k] account_process_tick
0.07% [kernel] [k] native_sched_clock 0.55% [kernel] [k] tick_nohz_stop_sched_tick
0.06% [kernel] [k] __dequeue_rt_entity 0.51% [kernel] [k] rb_next
0.06% [kernel] [k] update_curr_rt 0.46% [kernel] [k] rt_spin_unlock
0.06% [kernel] [k] _raw_spin_unlock_irq 0.45% [kernel] [k] rcu_irq_enter
RT_ISOLATE cpupri_set() insolation hacklet
---
kernel/sched_features.h | 5 +++++
kernel/sched_rt.c | 17 +++++++++++++++--
2 files changed, 20 insertions(+), 2 deletions(-)
--- a/kernel/sched_features.h
+++ b/kernel/sched_features.h
@@ -79,3 +79,8 @@ SCHED_FEAT(TTWU_QUEUE, 0)
SCHED_FEAT(FORCE_SD_OVERLAP, 0)
SCHED_FEAT(RT_RUNTIME_SHARE, 1)
+
+/*
+ * Protect isolated CPUs from cpupri latency
+ */
+SCHED_FEAT(RT_ISOLATE, 1)
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -876,6 +876,11 @@ void dec_rt_group(struct sched_rt_entity
#endif /* CONFIG_RT_GROUP_SCHED */
+static inline int rq_isolate(struct rq *rq)
+{
+ return sched_feat(RT_ISOLATE) && !rq->sd;
+}
+
static inline
void inc_rt_tasks(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq)
{
@@ -884,7 +889,8 @@ void inc_rt_tasks(struct sched_rt_entity
WARN_ON(!rt_prio(prio));
rt_rq->rt_nr_running++;
- inc_rt_prio(rt_rq, prio);
+ if (!rq_isolate(rq_of_rt_rq(rt_rq)))
+ inc_rt_prio(rt_rq, prio);
inc_rt_migration(rt_se, rt_rq);
inc_rt_group(rt_se, rt_rq);
}
@@ -896,7 +902,8 @@ void dec_rt_tasks(struct sched_rt_entity
WARN_ON(!rt_rq->rt_nr_running);
rt_rq->rt_nr_running--;
- dec_rt_prio(rt_rq, rt_se_prio(rt_se));
+ if (!rq_isolate(rq_of_rt_rq(rt_rq)))
+ dec_rt_prio(rt_rq, rt_se_prio(rt_se));
dec_rt_migration(rt_se, rt_rq);
dec_rt_group(rt_se, rt_rq);
}
@@ -1110,6 +1117,9 @@ static void check_preempt_equal_prio(str
if (rq->curr->rt.nr_cpus_allowed == 1)
return;
+ if (rq_isolate(rq))
+ return;
+
if (p->rt.nr_cpus_allowed != 1
&& cpupri_find(&rq->rd->cpupri, p, NULL))
return;
@@ -1300,6 +1310,9 @@ static int find_lowest_rq(struct task_st
if (task->rt.nr_cpus_allowed == 1)
return -1; /* No other targets possible */
+ if (rq_isolate(cpu_rq(this_cpu)))
+ return -1;
+
if (!cpupri_find(&task_rq(task)->rd->cpupri, task, lowest_mask))
return -1; /* No targets found */
^ permalink raw reply [flat|nested] 17+ messages in thread* Re: 3.0.14-rt31 + 64 cores = very bad jitter == highly synchronized tick? 2011-12-24 9:06 3.0.14-rt31 + 64 cores = very bad jitter == highly synchronized tick? Mike Galbraith @ 2011-12-25 7:31 ` Mike Galbraith 2011-12-26 8:04 ` Mike Galbraith 2011-12-27 6:40 ` Mike Galbraith 1 sibling, 1 reply; 17+ messages in thread From: Mike Galbraith @ 2011-12-25 7:31 UTC (permalink / raw) To: RT; +Cc: Thomas Gleixner, Steven Rostedt, Peter Zijlstra, Ingo Molnar On Sat, 2011-12-24 at 10:06 +0100, Mike Galbraith wrote: > Greetings, > > I'm trying to convince 3.0-rt to perform on a 64 core box, and having a > devil of a time with the darn thing. I have a wild theory that cores > are much more closely synchronized in newer kernels, and that's causing > massive QPI jabbering and xtime lock contention as cores bang > cpupri_set() and ktime_get() in lockstep. Seems not so wild a theory. <idle>-0 [055] 1285.013088: mwait_idle <-cpu_idle <idle>-0 [053] 1285.013860: smp_apic_timer_interrupt <-apic_timer_interrupt <idle>-0 [043] 1285.013861: smp_apic_timer_interrupt <-apic_timer_interrupt <idle>-0 [053] 1285.013861: native_apic_mem_write <-smp_apic_timer_interrupt <idle>-0 [044] 1285.013861: smp_apic_timer_interrupt <-apic_timer_interrupt <idle>-0 [043] 1285.013861: native_apic_mem_write <-smp_apic_timer_interrupt <idle>-0 [061] 1285.013861: smp_apic_timer_interrupt <-apic_timer_interrupt <idle>-0 [054] 1285.013861: smp_apic_timer_interrupt <-apic_timer_interrupt <idle>-0 [038] 1285.013861: smp_apic_timer_interrupt <-apic_timer_interrupt <idle>-0 [053] 1285.013861: irq_enter <-smp_apic_timer_interrupt <idle>-0 [044] 1285.013861: native_apic_mem_write <-smp_apic_timer_interrupt <idle>-0 [043] 1285.013861: irq_enter <-smp_apic_timer_interrupt <idle>-0 [008] 1285.013861: smp_apic_timer_interrupt <-apic_timer_interrupt <idle>-0 [032] 1285.013861: smp_apic_timer_interrupt <-apic_timer_interrupt <idle>-0 [051] 1285.013861: smp_apic_timer_interrupt <-apic_timer_interrupt <idle>-0 [024] 1285.013861: smp_apic_timer_interrupt <-apic_timer_interrupt <idle>-0 [054] 1285.013861: native_apic_mem_write <-smp_apic_timer_interrupt <idle>-0 [038] 1285.013861: native_apic_mem_write <-smp_apic_timer_interrupt <idle>-0 [053] 1285.013861: rcu_irq_enter <-irq_enter <idle>-0 [044] 1285.013861: irq_enter <-smp_apic_timer_interrupt <idle>-0 [045] 1285.013861: smp_apic_timer_interrupt <-apic_timer_interrupt <idle>-0 [006] 1285.013861: smp_apic_timer_interrupt <-apic_timer_interrupt <idle>-0 [043] 1285.013861: rcu_irq_enter <-irq_enter <idle>-0 [029] 1285.013861: smp_apic_timer_interrupt <-apic_timer_interrupt <idle>-0 [014] 1285.013861: smp_apic_timer_interrupt <-apic_timer_interrupt <idle>-0 [032] 1285.013861: native_apic_mem_write <-smp_apic_timer_interrupt <idle>-0 [042] 1285.013861: smp_apic_timer_interrupt <-apic_timer_interrupt <idle>-0 [031] 1285.013861: smp_apic_timer_interrupt <-apic_timer_interrupt <idle>-0 [051] 1285.013861: native_apic_mem_write <-smp_apic_timer_interrupt <idle>-0 [024] 1285.013861: native_apic_mem_write <-smp_apic_timer_interrupt <idle>-0 [054] 1285.013861: irq_enter <-smp_apic_timer_interrupt <idle>-0 [015] 1285.013861: smp_apic_timer_interrupt <-apic_timer_interrupt <idle>-0 [027] 1285.013861: smp_apic_timer_interrupt <-apic_timer_interrupt <idle>-0 [038] 1285.013861: irq_enter <-smp_apic_timer_interrupt <idle>-0 [044] 1285.013861: rcu_irq_enter <-irq_enter <idle>-0 [053] 1285.013861: rcu_exit_nohz <-rcu_irq_enter <idle>-0 [035] 1285.013861: smp_apic_timer_interrupt <-apic_timer_interrupt <idle>-0 [045] 1285.013861: native_apic_mem_write <-smp_apic_timer_interrupt <idle>-0 [022] 1285.013861: smp_apic_timer_interrupt <-apic_timer_interrupt <idle>-0 [028] 1285.013861: smp_apic_timer_interrupt <-apic_timer_interrupt <idle>-0 [050] 1285.013861: smp_apic_timer_interrupt <-apic_timer_interrupt <idle>-0 [043] 1285.013861: rcu_exit_nohz <-rcu_irq_enter <idle>-0 [049] 1285.013861: smp_apic_timer_interrupt <-apic_timer_interrupt <idle>-0 [061] 1285.013861: native_apic_mem_write <-smp_apic_timer_interrupt <idle>-0 [019] 1285.013861: smp_apic_timer_interrupt <-apic_timer_interrupt <idle>-0 [032] 1285.013861: irq_enter <-smp_apic_timer_interrupt <idle>-0 [029] 1285.013861: native_apic_mem_write <-smp_apic_timer_interrupt <idle>-0 [014] 1285.013861: native_apic_mem_write <-smp_apic_timer_interrupt <idle>-0 [024] 1285.013861: irq_enter <-smp_apic_timer_interrupt <idle>-0 [042] 1285.013861: native_apic_mem_write <-smp_apic_timer_interrupt <idle>-0 [039] 1285.013861: smp_apic_timer_interrupt <-apic_timer_interrupt <idle>-0 [026] 1285.013861: smp_apic_timer_interrupt <-apic_timer_interrupt <....snipage> Guess I need to fight fire with fire. Make ticks jitter a little somehow, so they don't make itimer wakeup jitter a truckload when it collides with tick that is busy colliding with zillion other ticks. 'course that helps the real problem (dram sucks) not one bit. -Mike ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: 3.0.14-rt31 + 64 cores = very bad jitter == highly synchronized tick? 2011-12-25 7:31 ` Mike Galbraith @ 2011-12-26 8:04 ` Mike Galbraith 0 siblings, 0 replies; 17+ messages in thread From: Mike Galbraith @ 2011-12-26 8:04 UTC (permalink / raw) To: RT; +Cc: Thomas Gleixner, Steven Rostedt, Peter Zijlstra, Ingo Molnar On Sun, 2011-12-25 at 08:31 +0100, Mike Galbraith wrote: > Guess I need to fight fire with fire. Make ticks jitter a little > somehow, so they don't make itimer wakeup jitter a truckload when it > collides with tick that is busy colliding with zillion other ticks. Yup. Perfect is the enemy of good. non-virgin: vogelweide:/abuild/mike/:[1]# sh -c 'echo $$ > /cpusets/rtcpus/tasks;./jitter -c 63 -f 960 -p 99 -t 10 -d 300' CPU63 priority: 99 timer freq: 960 Hz (1041666 ns) tolerance: 10 usecs, stats interval: 300 secs jitter: 8.87 min: 3.08 max: 11.95 mean: 4.92 stddev: 0.56 4 > 10 us hits min: 11.01 max: 11.95 mean: 11.35 stddev: 0.37 jitter: 8.68 min: 3.09 max: 11.77 mean: 4.91 stddev: 0.56 2 > 10 us hits min: 11.10 max: 11.77 mean: 11.44 stddev: 0.33 jitter: 7.90 min: 3.12 max: 11.02 mean: 4.91 stddev: 0.56 1 > 10 us hits min: 11.02 max: 11.02 mean: 11.02 stddev: 0.00 virgin: vogelweide:/abuild/mike/:[1]# sh -c 'echo $$ > /cpusets/rtcpus/tasks;./jitter -c 63 -f 960 -p 99 -t 10 -d 300' CPU63 priority: 99 timer freq: 960 Hz (1041666 ns) tolerance: 10 usecs, stats interval: 300 secs jitter: 68.30 min: 2.43 max: 70.72 mean: 6.22 stddev: 6.41 16668 > 10 us hits min: 11.00 max: 70.72 mean: 28.57 stddev: 13.08 jitter: 71.76 min: 2.56 max: 74.32 mean: 6.29 stddev: 6.61 17257 > 10 us hits min: 11.00 max: 74.32 mean: 28.95 stddev: 13.24 jitter: 70.51 min: 2.50 max: 73.01 mean: 6.17 stddev: 6.26 16368 > 10 us hits min: 11.00 max: 73.01 mean: 28.29 stddev: 12.76 I'm still colliding a bit, and overhead is still too high, but poking tick in the eye with a sharp stick made it crawl under it's rock, so methinks the tail has been pinned on the right donkey. -Mike 64 core DL980 idling, nohz=1, cores 4-63 isolated non-virgin: top - 08:26:35 up 2:04, 2 users, load average: 0.00, 0.01, 0.41 Tasks: 1051 total, 2 running, 1049 sleeping, 0 stopped, 0 zombie Cpu(s): 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 7911876k total, 1004812k used, 6907064k free, 12836k buffers Swap: 1959924k total, 0k used, 1959924k free, 802324k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ P COMMAND 22210 root 20 0 9680 2032 928 R 1 0.0 0:00.50 0 top 4 root -41 0 0 0 0 S 0 0.0 0:07.85 0 sirq-timer/0 22 root -41 0 0 0 0 S 0 0.0 0:13.24 1 sirq-timer/1 37 root -41 0 0 0 0 S 0 0.0 0:13.32 2 sirq-timer/2 51 root -41 0 0 0 0 S 0 0.0 0:12.29 3 sirq-timer/3 65 root -41 0 0 0 0 S 0 0.0 0:12.07 4 sirq-timer/4 79 root -41 0 0 0 0 S 0 0.0 0:12.20 5 sirq-timer/5 93 root -41 0 0 0 0 S 0 0.0 0:12.07 6 sirq-timer/6 121 root -41 0 0 0 0 S 0 0.0 0:12.32 8 sirq-timer/8 163 root -41 0 0 0 0 S 0 0.0 0:12.22 11 sirq-timer/11 177 root -41 0 0 0 0 S 0 0.0 0:12.22 12 sirq-timer/12 191 root -41 0 0 0 0 S 0 0.0 0:12.25 13 sirq-timer/13 205 root -41 0 0 0 0 S 0 0.0 0:12.21 14 sirq-timer/14 219 root -41 0 0 0 0 S 0 0.0 0:12.21 15 sirq-timer/15 233 root -41 0 0 0 0 S 0 0.0 0:13.54 16 sirq-timer/16 virgin: top - 08:57:39 up 23 min, 1 user, load average: 0.00, 0.02, 0.10 Tasks: 468 total, 2 running, 466 sleeping, 0 stopped, 0 zombie Cpu(s): 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 7914276k total, 471268k used, 7443008k free, 84040k buffers Swap: 1959924k total, 0k used, 1959924k free, 243072k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ P COMMAND 179 root RT 0 0 0 0 S 3 0.0 0:45.89 32 ksoftirqd/32 231 root RT 0 0 0 0 S 3 0.0 0:46.48 40 ksoftirqd/40 241 root RT 0 0 0 0 S 3 0.0 0:46.53 42 ksoftirqd/42 246 root RT 0 0 0 0 S 3 0.0 0:46.27 43 ksoftirqd/43 184 root RT 0 0 0 0 S 3 0.0 0:44.16 33 ksoftirqd/33 206 root RT 0 0 0 0 S 3 0.0 0:44.48 35 ksoftirqd/35 211 root RT 0 0 0 0 R 3 0.0 0:45.19 36 ksoftirqd/36 216 root RT 0 0 0 0 S 3 0.0 0:44.83 37 ksoftirqd/37 221 root RT 0 0 0 0 S 3 0.0 0:43.73 38 ksoftirqd/38 226 root RT 0 0 0 0 S 3 0.0 0:44.73 39 ksoftirqd/39 236 root RT 0 0 0 0 S 3 0.0 0:45.86 41 ksoftirqd/41 251 root RT 0 0 0 0 S 3 0.0 0:43.64 44 ksoftirqd/44 323 root RT 0 0 0 0 S 3 0.0 0:40.69 56 ksoftirqd/56 345 root RT 0 0 0 0 S 3 0.0 0:40.83 58 ksoftirqd/58 201 root RT 0 0 0 0 S 3 0.0 0:44.86 34 ksoftirqd/34 256 root RT 0 0 0 0 S 3 0.0 0:41.62 45 ksoftirqd/45 273 root RT 0 0 0 0 S 3 0.0 0:41.09 46 ksoftirqd/46 ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: 3.0.14-rt31 + 64 cores = very bad jitter == highly synchronized tick? 2011-12-24 9:06 3.0.14-rt31 + 64 cores = very bad jitter == highly synchronized tick? Mike Galbraith 2011-12-25 7:31 ` Mike Galbraith @ 2011-12-27 6:40 ` Mike Galbraith 2011-12-27 9:20 ` [patch] clockevents: Reinstate the per cpu tick skew Mike Galbraith 1 sibling, 1 reply; 17+ messages in thread From: Mike Galbraith @ 2011-12-27 6:40 UTC (permalink / raw) To: RT; +Cc: Thomas Gleixner, Steven Rostedt, Peter Zijlstra, Ingo Molnar On Sat, 2011-12-24 at 10:06 +0100, Mike Galbraith wrote: > Has anything been done between 33 and 3.0 that would account for this? Um, like af5ab277d for instance. Arjan is right that this contention trouble doesn't happen with nohz.. but low jitter doesn't happen with nohz either. -Mike ^ permalink raw reply [flat|nested] 17+ messages in thread
* [patch] clockevents: Reinstate the per cpu tick skew 2011-12-27 6:40 ` Mike Galbraith @ 2011-12-27 9:20 ` Mike Galbraith 2011-12-28 5:17 ` Mike Galbraith 2011-12-28 13:32 ` Arjan van de Ven 0 siblings, 2 replies; 17+ messages in thread From: Mike Galbraith @ 2011-12-27 9:20 UTC (permalink / raw) To: RT Cc: Thomas Gleixner, Steven Rostedt, Peter Zijlstra, Ingo Molnar, Arjan van de Ven Quoting removal commit af5ab277ded04bd9bc6b048c5a2f0e7d70ef0867 Historically, Linux has tried to make the regular timer tick on the various CPUs not happen at the same time, to avoid contention on xtime_lock. Nowadays, with the tickless kernel, this contention no longer happens since time keeping and updating are done differently. In addition, this skew is actually hurting power consumption in a measurable way on many-core systems. End quote Contention remains a problem if NO_HZ is either not configured, or is nohz=off disabled due to workload constraints. The RT kernel running nohz=off was measured to be using > 1.4% CPU just ticking 64 CPUs, with tick perturbation reaching ~80us. For loads where measured (>100us) NO_HZ latencies are intolerable, a must have. Signed-off-by: Mike Galbraith <efault@gmx.de> Cc: Arjan van de Ven <arjan@linux.intel.com> --- kernel/time/tick-sched.c | 9 +++++++++ 1 file changed, 9 insertions(+) --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -689,6 +689,7 @@ static inline void tick_check_nohz(int c static inline void tick_nohz_switch_to_nohz(void) { } static inline void tick_check_nohz(int cpu) { } +#define tick_nohz_enabled 0 #endif /* NO_HZ */ @@ -777,6 +778,14 @@ void tick_setup_sched_timer(void) /* Get the next period (per cpu) */ hrtimer_set_expires(&ts->sched_timer, tick_init_jiffy_update()); + /* Offset the tick when NO_HZ is configured out or boot disabled */ + if (!tick_nohz_enabled) { + u64 offset = ktime_to_ns(tick_period) >> 1; + do_div(offset, num_possible_cpus()); + offset *= smp_processor_id(); + hrtimer_add_expires_ns(&ts->sched_timer, offset); + } + for (;;) { hrtimer_forward(&ts->sched_timer, now, tick_period); hrtimer_start_expires(&ts->sched_timer, ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [patch] clockevents: Reinstate the per cpu tick skew 2011-12-27 9:20 ` [patch] clockevents: Reinstate the per cpu tick skew Mike Galbraith @ 2011-12-28 5:17 ` Mike Galbraith 2011-12-28 8:22 ` Mike Galbraith 2011-12-28 13:35 ` Arjan van de Ven 2011-12-28 13:32 ` Arjan van de Ven 1 sibling, 2 replies; 17+ messages in thread From: Mike Galbraith @ 2011-12-28 5:17 UTC (permalink / raw) To: RT Cc: Thomas Gleixner, Steven Rostedt, Peter Zijlstra, Ingo Molnar, Arjan van de Ven On Tue, 2011-12-27 at 10:20 +0100, Mike Galbraith wrote: > Quoting removal commit af5ab277ded04bd9bc6b048c5a2f0e7d70ef0867 > Historically, Linux has tried to make the regular timer tick on the > various CPUs not happen at the same time, to avoid contention on > xtime_lock. > > Nowadays, with the tickless kernel, this contention no longer happens > since time keeping and updating are done differently. In addition, > this skew is actually hurting power consumption in a measurable way on > many-core systems. > End quote Hm, nohz enabled, hogs burning up 60 of 64 cores. 56.11% [kernel] [k] ktime_get 5.54% [kernel] [k] scheduler_tick 4.02% [kernel] [k] cpuacct_charge 3.78% [kernel] [k] __rcu_pending 3.76% [kernel] [k] tick_sched_timer 3.42% [kernel] [k] native_write_msr_safe 1.58% [kernel] [k] run_timer_softirq 1.28% [kernel] [k] __schedule 1.21% [kernel] [k] apic_timer_interrupt 1.07% [kernel] [k] _raw_spin_lock 0.81% [kernel] [k] __switch_to 0.67% [kernel] [k] thread_return Maybe skew-me wants to become a boot option? -Mike ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [patch] clockevents: Reinstate the per cpu tick skew 2011-12-28 5:17 ` Mike Galbraith @ 2011-12-28 8:22 ` Mike Galbraith 2011-12-28 9:59 ` Mike Galbraith 2011-12-28 13:35 ` Arjan van de Ven 1 sibling, 1 reply; 17+ messages in thread From: Mike Galbraith @ 2011-12-28 8:22 UTC (permalink / raw) To: RT Cc: Thomas Gleixner, Steven Rostedt, Peter Zijlstra, Ingo Molnar, Arjan van de Ven On Wed, 2011-12-28 at 06:17 +0100, Mike Galbraith wrote: > On Tue, 2011-12-27 at 10:20 +0100, Mike Galbraith wrote: > > Quoting removal commit af5ab277ded04bd9bc6b048c5a2f0e7d70ef0867 > > Historically, Linux has tried to make the regular timer tick on the > > various CPUs not happen at the same time, to avoid contention on > > xtime_lock. > > > > Nowadays, with the tickless kernel, this contention no longer happens > > since time keeping and updating are done differently. In addition, > > this skew is actually hurting power consumption in a measurable way on > > many-core systems. > > End quote > > Hm, nohz enabled, hogs burning up 60 of 64 cores. > > 56.11% [kernel] [k] ktime_get > 5.54% [kernel] [k] scheduler_tick > 4.02% [kernel] [k] cpuacct_charge > 3.78% [kernel] [k] __rcu_pending > 3.76% [kernel] [k] tick_sched_timer > 3.42% [kernel] [k] native_write_msr_safe > 1.58% [kernel] [k] run_timer_softirq > 1.28% [kernel] [k] __schedule > 1.21% [kernel] [k] apic_timer_interrupt > 1.07% [kernel] [k] _raw_spin_lock > 0.81% [kernel] [k] __switch_to > 0.67% [kernel] [k] thread_return > > Maybe skew-me wants to become a boot option? Yup.. or something. As above, but with skew. 3.06% [kernel] [k] ktime_get (Hm, wonder if nohz is usable now... nope. Tell nohz that isolated cores don't play balancer again, maybe it'll work now) -Mike ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [patch] clockevents: Reinstate the per cpu tick skew 2011-12-28 8:22 ` Mike Galbraith @ 2011-12-28 9:59 ` Mike Galbraith 0 siblings, 0 replies; 17+ messages in thread From: Mike Galbraith @ 2011-12-28 9:59 UTC (permalink / raw) To: RT Cc: Thomas Gleixner, Steven Rostedt, Peter Zijlstra, Ingo Molnar, Arjan van de Ven On Wed, 2011-12-28 at 09:22 +0100, Mike Galbraith wrote: > (Hm, wonder if nohz is usable now... nope. Tell nohz that isolated > cores don't play balancer again, maybe it'll work now) Yup, worked. 60 core jitter test is approaching single digit. Woohoo. --- kernel/sched_fair.c | 4 ++++ 1 file changed, 4 insertions(+) --- a/kernel/sched_fair.c +++ b/kernel/sched_fair.c @@ -4517,6 +4517,10 @@ void select_nohz_load_balancer(int stop_ { int cpu = smp_processor_id(); + /* Isolated cores do not play */ + if (!cpu_rq(cpu)->sd) + return; + if (stop_tick) { if (!cpu_active(cpu)) { if (atomic_read(&nohz.load_balancer) != cpu) ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [patch] clockevents: Reinstate the per cpu tick skew 2011-12-28 5:17 ` Mike Galbraith 2011-12-28 8:22 ` Mike Galbraith @ 2011-12-28 13:35 ` Arjan van de Ven 2011-12-28 14:59 ` Mike Galbraith 2011-12-29 7:22 ` Mike Galbraith 1 sibling, 2 replies; 17+ messages in thread From: Arjan van de Ven @ 2011-12-28 13:35 UTC (permalink / raw) To: Mike Galbraith Cc: RT, Thomas Gleixner, Steven Rostedt, Peter Zijlstra, Ingo Molnar -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 12/28/2011 6:17 AM, Mike Galbraith wrote: > On Tue, 2011-12-27 at 10:20 +0100, Mike Galbraith wrote: >> Quoting removal commit af5ab277ded04bd9bc6b048c5a2f0e7d70ef0867 >> Historically, Linux has tried to make the regular timer tick on >> the various CPUs not happen at the same time, to avoid contention >> on xtime_lock. >> >> Nowadays, with the tickless kernel, this contention no longer >> happens since time keeping and updating are done differently. In >> addition, this skew is actually hurting power consumption in a >> measurable way on many-core systems. End quote > > Hm, nohz enabled, hogs burning up 60 of 64 cores. > > 56.11% [kernel] [k] ktime_get 5.54% [kernel] [k] > scheduler_tick 4.02% [kernel] [k] cpuacct_charge 3.78% > [kernel] [k] __rcu_pending 3.76% [kernel] [k] > tick_sched_timer 3.42% [kernel] [k] native_write_msr_safe > 1.58% [kernel] [k] run_timer_softirq 1.28% [kernel] [k] > __schedule 1.21% [kernel] [k] apic_timer_interrupt 1.07% > [kernel] [k] _raw_spin_lock 0.81% [kernel] [k] > __switch_to 0.67% [kernel] [k] thread_return > > Maybe skew-me wants to become a boot option? this is 56% of kernel time.. of how much total time? (and are you using a system where tsc/lapic can be used, or are you using one of those boatanchors that need hpet?) -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.17 (MingW32) iQEcBAEBAgAGBQJO+xsmAAoJEEHdSxh4DVnEcDoIAK8Q0rTCNb/xX3mNm7QpLpIU kLXEHvv7Xk58TxKfOC7EDmD4EMdJxbebL6ZR7ol/J7mkQjnUjsFdGF1qF1TAW1Ph YdPV5liDMfwO+Aczj0ZdBBacuoIivUrFcwArKwonttwSB0dh1vyJU9VsRC7nTu4z eILaiYOr2pTBSKReYiQxr9u+1/zfmlwsENbbq/Z/JnbQYdf1y0ZNZ1kDF4zOwuHQ EdVu4o1RPRXwBlMI+6E3CaEyl6wACOGyoy3tsuHoR7Ax6YcwJUoDFmAVP8Bb+YE8 19AlRDnBwSCV+AaJ3qbaEdPdpX7Alp1h3fpdH8rZ/Ndu5DTeGDAuYI1HIDicZgA= =pSVy -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [patch] clockevents: Reinstate the per cpu tick skew 2011-12-28 13:35 ` Arjan van de Ven @ 2011-12-28 14:59 ` Mike Galbraith 2011-12-28 16:57 ` Peter Zijlstra 2011-12-29 7:22 ` Mike Galbraith 1 sibling, 1 reply; 17+ messages in thread From: Mike Galbraith @ 2011-12-28 14:59 UTC (permalink / raw) To: Arjan van de Ven Cc: RT, Thomas Gleixner, Steven Rostedt, Peter Zijlstra, Ingo Molnar On Wed, 2011-12-28 at 14:35 +0100, Arjan van de Ven wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > On 12/28/2011 6:17 AM, Mike Galbraith wrote: > > On Tue, 2011-12-27 at 10:20 +0100, Mike Galbraith wrote: > >> Quoting removal commit af5ab277ded04bd9bc6b048c5a2f0e7d70ef0867 > >> Historically, Linux has tried to make the regular timer tick on > >> the various CPUs not happen at the same time, to avoid contention > >> on xtime_lock. > >> > >> Nowadays, with the tickless kernel, this contention no longer > >> happens since time keeping and updating are done differently. In > >> addition, this skew is actually hurting power consumption in a > >> measurable way on many-core systems. End quote > > > > Hm, nohz enabled, hogs burning up 60 of 64 cores. > > > > 56.11% [kernel] [k] ktime_get 5.54% [kernel] [k] > > scheduler_tick 4.02% [kernel] [k] cpuacct_charge 3.78% > > [kernel] [k] __rcu_pending 3.76% [kernel] [k] > > tick_sched_timer 3.42% [kernel] [k] native_write_msr_safe > > 1.58% [kernel] [k] run_timer_softirq 1.28% [kernel] [k] > > __schedule 1.21% [kernel] [k] apic_timer_interrupt 1.07% > > [kernel] [k] _raw_spin_lock 0.81% [kernel] [k] > > __switch_to 0.67% [kernel] [k] thread_return > > > > Maybe skew-me wants to become a boot option? > > this is 56% of kernel time.. of how much total time? I'd have to re-measure. I didn't have any reason to watch the total, that it was a big perturbation source was all that mattered. It's not that it's a huge percentage of total time by any means, just that the jitter induced is too large for the kernel to be unusable for the realtime load it's expected to support. With 30 usecs to play with, every one counts. > (and are you using a system where tsc/lapic can be used, or are you > using one of those boatanchors that need hpet?) Box is an HP DL980, 64 x X7560. -Mike ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [patch] clockevents: Reinstate the per cpu tick skew 2011-12-28 14:59 ` Mike Galbraith @ 2011-12-28 16:57 ` Peter Zijlstra 2011-12-28 17:28 ` Mike Galbraith 0 siblings, 1 reply; 17+ messages in thread From: Peter Zijlstra @ 2011-12-28 16:57 UTC (permalink / raw) To: Mike Galbraith Cc: Arjan van de Ven, RT, Thomas Gleixner, Steven Rostedt, Ingo Molnar On Wed, 2011-12-28 at 15:59 +0100, Mike Galbraith wrote: > > > (and are you using a system where tsc/lapic can be used, or are you > > using one of those boatanchors that need hpet?) > > Box is an HP DL980, 64 x X7560. That smells like NHM-EX, aka boatanchor. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [patch] clockevents: Reinstate the per cpu tick skew 2011-12-28 16:57 ` Peter Zijlstra @ 2011-12-28 17:28 ` Mike Galbraith 0 siblings, 0 replies; 17+ messages in thread From: Mike Galbraith @ 2011-12-28 17:28 UTC (permalink / raw) To: Peter Zijlstra Cc: Arjan van de Ven, RT, Thomas Gleixner, Steven Rostedt, Ingo Molnar On Wed, 2011-12-28 at 17:57 +0100, Peter Zijlstra wrote: > On Wed, 2011-12-28 at 15:59 +0100, Mike Galbraith wrote: > > > > > (and are you using a system where tsc/lapic can be used, or are you > > > using one of those boatanchors that need hpet?) > > > > Box is an HP DL980, 64 x X7560. > > That smells like NHM-EX, aka boatanchor. I have a 32 core test box that has a minimum itimer fires -> task runs of 6.79 usecs.. now _that_ box would make a great boat anchor. DL980 may be a work horse, but it ain't a broken down old nag that should be sent off to the glue factory.. yet. -Mike ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [patch] clockevents: Reinstate the per cpu tick skew 2011-12-28 13:35 ` Arjan van de Ven 2011-12-28 14:59 ` Mike Galbraith @ 2011-12-29 7:22 ` Mike Galbraith 1 sibling, 0 replies; 17+ messages in thread From: Mike Galbraith @ 2011-12-29 7:22 UTC (permalink / raw) To: Arjan van de Ven Cc: RT, Thomas Gleixner, Steven Rostedt, Peter Zijlstra, Ingo Molnar On Wed, 2011-12-28 at 14:35 +0100, Arjan van de Ven wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > On 12/28/2011 6:17 AM, Mike Galbraith wrote: > > On Tue, 2011-12-27 at 10:20 +0100, Mike Galbraith wrote: > >> Quoting removal commit af5ab277ded04bd9bc6b048c5a2f0e7d70ef0867 > >> Historically, Linux has tried to make the regular timer tick on > >> the various CPUs not happen at the same time, to avoid contention > >> on xtime_lock. > >> > >> Nowadays, with the tickless kernel, this contention no longer > >> happens since time keeping and updating are done differently. In > >> addition, this skew is actually hurting power consumption in a > >> measurable way on many-core systems. End quote > > > > Hm, nohz enabled, hogs burning up 60 of 64 cores. > > > > 56.11% [kernel] [k] ktime_get 5.54% [kernel] [k] > > scheduler_tick 4.02% [kernel] [k] cpuacct_charge 3.78% > > [kernel] [k] __rcu_pending 3.76% [kernel] [k] > > tick_sched_timer 3.42% [kernel] [k] native_write_msr_safe > > 1.58% [kernel] [k] run_timer_softirq 1.28% [kernel] [k] > > __schedule 1.21% [kernel] [k] apic_timer_interrupt 1.07% > > [kernel] [k] _raw_spin_lock 0.81% [kernel] [k] > > __switch_to 0.67% [kernel] [k] thread_return > > > > Maybe skew-me wants to become a boot option? > > this is 56% of kernel time.. of how much total time? To answer the question.. 99.57% burn [.] main 0.14% [kernel] [k] ktime_get That's the DL980 running a 250Hz kernel. Dinky, but for my picky RT load, too much nonetheless. (hm, what would SGI monster box say?) -Mike ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [patch] clockevents: Reinstate the per cpu tick skew 2011-12-27 9:20 ` [patch] clockevents: Reinstate the per cpu tick skew Mike Galbraith 2011-12-28 5:17 ` Mike Galbraith @ 2011-12-28 13:32 ` Arjan van de Ven 2011-12-28 15:10 ` Mike Galbraith 1 sibling, 1 reply; 17+ messages in thread From: Arjan van de Ven @ 2011-12-28 13:32 UTC (permalink / raw) To: Mike Galbraith Cc: RT, Thomas Gleixner, Steven Rostedt, Peter Zijlstra, Ingo Molnar -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 12/27/2011 10:20 AM, Mike Galbraith wrote: > > Quoting removal commit af5ab277ded04bd9bc6b048c5a2f0e7d70ef0867 > Historically, Linux has tried to make the regular timer tick on > the various CPUs not happen at the same time, to avoid contention > on xtime_lock. > > Nowadays, with the tickless kernel, this contention no longer > happens since time keeping and updating are done differently. In > addition, this skew is actually hurting power consumption in a > measurable way on many-core systems. End quote > > Contention remains a problem if NO_HZ is either not configured, or > is nohz=off disabled due to workload constraints. The RT kernel > running nohz=off was measured to be using > 1.4% CPU just ticking > 64 CPUs, with tick perturbation reaching ~80us. For loads where > measured (>100us) NO_HZ latencies are intolerable, a must have. I think we need to just say no to this, and kill the nohz=off option entirely. Seriously, are people still running with ticks for any legitimate reasons? (and not just because they goofed their config file) -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.17 (MingW32) iQEcBAEBAgAGBQJO+xpVAAoJEEHdSxh4DVnEMvUH/A5qOO+igivDtdEjw5b39w3h 4fG7VLiyn34+AAFmBqfgx9dKbl4DkYzBBRYXcNVnicnjqnH7ZZ+FOvFo2zOCUiGG xeDNox4hcl1jJ/6J1o6p1ecJXOUlbwNsXF9SVG38HPpJ4D0mgllAdy1wHJfv3+LA Ad98sUDmhq2gpcjyupvv7exIor1i3JFo/Q+CFbDTVQrgz99zo/D2IX3ps4wRfhHq q0rKcU4ZZJVHeHkItHOyEgeex9RPGlxNRSUu50zIHKugVlH9wbTtIzBkPzt1Nn0S yThyQGd/xcFfQDaiwymWLf78d6wpEZ/BW+QIlPlO2xNMD/Qz980w86yyh0x2FQk= =yDHd -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [patch] clockevents: Reinstate the per cpu tick skew 2011-12-28 13:32 ` Arjan van de Ven @ 2011-12-28 15:10 ` Mike Galbraith 2012-01-03 6:20 ` Mike Galbraith 0 siblings, 1 reply; 17+ messages in thread From: Mike Galbraith @ 2011-12-28 15:10 UTC (permalink / raw) To: Arjan van de Ven Cc: RT, Thomas Gleixner, Steven Rostedt, Peter Zijlstra, Ingo Molnar On Wed, 2011-12-28 at 14:32 +0100, Arjan van de Ven wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > On 12/27/2011 10:20 AM, Mike Galbraith wrote: > > > > Quoting removal commit af5ab277ded04bd9bc6b048c5a2f0e7d70ef0867 > > Historically, Linux has tried to make the regular timer tick on > > the various CPUs not happen at the same time, to avoid contention > > on xtime_lock. > > > > Nowadays, with the tickless kernel, this contention no longer > > happens since time keeping and updating are done differently. In > > addition, this skew is actually hurting power consumption in a > > measurable way on many-core systems. End quote > > > > Contention remains a problem if NO_HZ is either not configured, or > > is nohz=off disabled due to workload constraints. The RT kernel > > running nohz=off was measured to be using > 1.4% CPU just ticking > > 64 CPUs, with tick perturbation reaching ~80us. For loads where > > measured (>100us) NO_HZ latencies are intolerable, a must have. > > I think we need to just say no to this, and kill the nohz=off option > entirely. > > Seriously, are people still running with ticks for any legitimate > reasons? (and not just because they goofed their config file) Yup. Realtime loads sometimes need it. Even without contention problems, entering/leaving nohz is a latency source. If every little bit counts, you may have the choice of letting the electric meter spin or not getting the job done at all. -Mike ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [patch] clockevents: Reinstate the per cpu tick skew 2011-12-28 15:10 ` Mike Galbraith @ 2012-01-03 6:20 ` Mike Galbraith 2012-04-23 6:13 ` irq latency regression post af5ab277 - was " Mike Galbraith 0 siblings, 1 reply; 17+ messages in thread From: Mike Galbraith @ 2012-01-03 6:20 UTC (permalink / raw) To: Arjan van de Ven Cc: RT, Thomas Gleixner, Steven Rostedt, Peter Zijlstra, Ingo Molnar On Wed, 2011-12-28 at 16:10 +0100, Mike Galbraith wrote: > On Wed, 2011-12-28 at 14:32 +0100, Arjan van de Ven wrote: > > > > I think we need to just say no to this, and kill the nohz=off option > > entirely. > > > > Seriously, are people still running with ticks for any legitimate > > reasons? (and not just because they goofed their config file) > > Yup. Realtime loads sometimes need it. Even without contention > problems, entering/leaving nohz is a latency source. If every little > bit counts, you may have the choice of letting the electric meter spin > or not getting the job done at all. Patch making tick skew a boot option below, and hard numbers below that. Test setup: 60 isolated cores running a synchronized frame scheduler model for 1 hour, scheduling worker-bees at three frequencies. (The testcase is supposed to "good enough" simulate a real frame rate scheduler, and did pretty well at showing the cost of these particular collisions.) First set of numbers is without tick skew, and nohz enabled. Second set is tick skewed, nohz and rt push/pull turned off for the isolated core set. The tick skew alone is responsible for an order of magnitude of jitter improvement. I have hard numbers for nohz and cpupri_set() as well, but bottom line for me is that with nohz enabled, my 30us jitter budget is nearly doubled, so even with the tick skewed, nohz is just not a viable option ATM. From: Mike Galbraith <mgalbraith@suse.de> clockevents: Reinstate the per cpu tick skew Quoting removal commit af5ab277ded04bd9bc6b048c5a2f0e7d70ef0867 Historically, Linux has tried to make the regular timer tick on the various CPUs not happen at the same time, to avoid contention on xtime_lock. Nowadays, with the tickless kernel, this contention no longer happens since time keeping and updating are done differently. In addition, this skew is actually hurting power consumption in a measurable way on many-core systems. End quote Contrary to the above, contention does still happen, and can be a problem for realtime loads whether nohz is active or not, so give the user the ability to decide whether power consumption or jitter is the more important consideration. Signed-off-by: Mike Galbraith <mgalbraith@suse.de> Cc: Arjan van de Ven <arjan@linux.intel.com> --- Documentation/kernel-parameters.txt | 3 +++ kernel/time/tick-sched.c | 19 +++++++++++++++++++ 2 files changed, 22 insertions(+) --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -2295,6 +2295,9 @@ bytes respectively. Such letter suffixes simeth= [IA-64] simscsi= + skew_tick= [KNL] Offset the periodic timer tick per cpu to mitigate + xtime_lock contention on larger systems. + slram= [HW,MTD] slub_debug[=options[,slabs]] [MM, SLUB] --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -759,6 +759,8 @@ static enum hrtimer_restart tick_sched_t return HRTIMER_RESTART; } +static int sched_skew_tick; + /** * tick_setup_sched_timer - setup the tick emulation timer */ @@ -777,6 +779,14 @@ void tick_setup_sched_timer(void) /* Get the next period (per cpu) */ hrtimer_set_expires(&ts->sched_timer, tick_init_jiffy_update()); + /* Offset the tick to avert xtime_lock contention. */ + if (sched_skew_tick) { + u64 offset = ktime_to_ns(tick_period) >> 1; + do_div(offset, num_possible_cpus()); + offset *= smp_processor_id(); + hrtimer_add_expires_ns(&ts->sched_timer, offset); + } + for (;;) { hrtimer_forward(&ts->sched_timer, now, tick_period); hrtimer_start_expires(&ts->sched_timer, @@ -858,3 +868,12 @@ int tick_check_oneshot_change(int allow_ tick_nohz_switch_to_nohz(); return 0; } + +static int __init skew_tick(char *str) +{ + get_option(&str, &sched_skew_tick); + + return 0; +} +early_param("skew_tick", skew_tick); + No skewed tick, nohz active: FREQ=960 FRAMES=3456000 LOOP=50000 using CPUs 4 - 23 FREQ=666 FRAMES=2397600 LOOP=72072 using CPUs 24 - 43 FREQ=300 FRAMES=1080000 LOOP=160000 using CPUs 44 - 63 on your marks... get set... POW! Cpu Frames Min Max(Frame) Avg Sigma LastTrans Fliers(Frames) 4 3456000 0.0159 51.51 (1751285) 1.0811 2.3215 0 (0) 940 (2496,2497,36625,36626,45649,..3438632) 5 3456000 0.0159 57.44 (1301949) 1.1164 2.3599 0 (0) 1010 (32353,32354,36625,36626,43681,..3434312) 6 3456000 0.0159 49.58 (546753) 1.0602 2.3222 0 (0) 1037 (32353,32354,36625,36626,41809,..3425240) 7 3456000 0.0159 52.20 (546753) 1.0681 2.3370 0 (0) 1035 (32353,32354,36625,36626,41809,..3432248) 8 3456000 0.0159 58.91 (1407504) 1.0592 2.0873 0 (0) 865 (11041,11042,15505,15506,25585,..3412208) 9 3456000 0.0159 54.61 (1407504) 1.0581 2.0775 0 (0) 850 (11041,11042,15505,15506,20234,..3411272) 10 3456000 0.0159 52.91 (1338694) 1.1259 2.0825 0 (0) 799 (11041,11042,15505,15506,16465,..3400640) 11 3456000 0.0159 50.56 (2470554) 1.1881 2.0364 0 (0) 334 (50714,113715,113716,166349,178780,..3421185) 12 3456000 0.0159 50.29 (2462200) 0.9961 2.0202 0 (0) 639 (9337,9338,11041,11042,15505,..3452529) 13 3456000 0.0159 56.52 (2470554) 1.1478 2.0602 0 (0) 400 (2545,2546,9121,9122,66434,..3440289) 14 3456000 0.0159 55.06 (34587) 1.2129 2.4890 0 (0) 444 (34587,34588,62571,62572,62619,..3440434) 15 3456000 0.0159 46.48 (583883) 1.2891 2.1824 0 (0) 306 (91563,95739,95740,141197,155741,..3406785) 16 3456000 0.0159 103.70 (2828662)2.1077 4.0380 410 (2) 9435 (697,698,1105,1106,1153,..3455937) 17 3456000 0.0159 73.89 (2470553) 2.1598 3.7529 0 (0) 6180 (2473,2474,3985,3986,8569,..3438201) 18 3456000 0.0159 54.14 (1212190) 2.2391 3.7075 0 (0) 5485 (10274,10275,13970,13971,14379,..3455794) 19 3456000 0.0159 99.20 (810712) 2.3861 4.5793 0 (0) 19845 (674,675,2259,2260,3554,..3455915) 20 3456000 0.0159 71.30 (631597) 2.2565 4.3141 0 (0) 9365 (674,675,3555,7394,7395,..3455914) 21 3456000 0.0159 71.51 (1431073) 2.3127 4.4810 0 (0) 25073 (1154,2259,2260,4011,4012,..3455963) 22 3456000 0.0159 62.45 (215262) 2.1318 4.3088 0 (0) 23570 (2259,2260,4011,4012,4539,..3455963) 23 3456000 0.0159 61.50 (212190) 2.1307 4.3165 0 (0) 23605 (2259,2260,4539,4540,5019,..3455963) 24 2397600 0.0587 145.26 (2229318)2.6808 6.2104 492 (14) 32977 (812,813,1145,1470,1471,..2397564) 25 2397600 0.0587 133.93 (250966) 2.6171 6.3300 492 (13) 35463 (812,813,1145,1146,1462,..2397564) 26 2397600 0.0587 140.25 (1405878)2.7079 6.1603 492 (12) 32428 (806,812,813,1145,1146,..2397564) 27 2397600 0.0587 141.56 (1405879)2.6893 6.1515 492 (14) 32089 (808,809,810,811,812,..2397564) 28 2397600 0.0587 146.57 (1405879)2.7129 6.0797 492 (14) 31637 (800,801,812,813,827,..2397564) 29 2397600 0.0587 137.99 (2172039)2.3360 5.9859 492 (14) 30551 (826,827,1157,1480,1481,..2397564) 30 2397600 0.0587 144.06 (948198) 2.2381 5.0413 496 (6) 19401 (826,827,832,833,1175,..2397566) 31 2397600 0.0587 141.92 (948198) 2.2509 5.0654 496 (4) 19353 (826,827,832,833,1175,..2397566) 32 2397600 0.0587 149.31 (2172038)2.7842 6.8891 492 (10) 41301 (822,823,824,825,826,..2397564) 33 2397600 0.0587 142.99 (1975198)2.6904 5.3538 181 (6) 21954 (511,512,846,847,1175,..2397582) 34 2397600 0.0587 167.07 (948199) 2.6350 5.6616 179 (4) 23602 (503,504,507,508,511,..2397582) 35 2397600 0.0587 79.81 (2152123) 2.5135 4.1781 0 (0) 5406 (1879,1881,1882,2876,2877,..2396956) 36 2397600 0.0587 112.24 (1184061)2.7419 5.3774 0 (0) 21005 (1185,1186,1189,1190,1518,..2397263) 37 2397600 0.0587 78.86 (986867) 2.6678 5.1954 0 (0) 19350 (529,530,861,863,1189,..2397263) 38 2397600 0.0587 77.90 (1782680) 2.5881 4.8399 0 (0) 13516 (525,526,529,530,860,..2396938) 39 2397600 0.0587 78.02 (1642135) 2.4351 3.8095 0 (0) 3569 (898,2900,2901,3561,3566,..2397291) 40 2397600 0.0587 218.81 (891116) 2.7215 6.6456 392 (8) 38961 (714,715,726,727,1046,..2397450) 41 2397600 0.0587 141.56 (1975198)2.6441 5.2995 181 (4) 22572 (846,847,1179,1180,1185,..2397249) 42 2397600 0.0587 77.07 (1782679) 2.3957 5.0119 0 (0) 17798 (529,530,860,861,862,..2397263) 43 2397600 0.0587 81.72 (1333323) 2.3469 4.5082 0 (0) 11172 (1205,1206,1207,1208,1865,..2396552) 44 1080000 0.0032 168.33 (988438) 2.7037 7.1729 381 (10) 20368 (650,651,662,663,809,..1056079) 45 1080000 0.0032 156.88 (935898) 2.6181 7.1047 0 (0) 19932 (767,768,809,810,866,..1022038) 46 1080000 0.0032 156.40 (935898) 2.2137 6.8080 0 (0) 18522 (684567,684568,695466,695467,699570,..975856) 47 1080000 0.0032 150.20 (905448) 2.6011 7.0525 0 (0) 19427 (2012,2013,510347,510348,617324,..980947) 48 1080000 0.0032 163.08 (1012102)3.0856 8.6857 491 (49) 32197 (527,528,536,537,545,..1059883) 49 1080000 0.0032 151.87 (861738) 2.1150 6.2499 0 (0) 14993 (679920,679921,681762,681763,684567,..889561) 50 1080000 0.0032 143.53 (843639) 2.3864 6.2304 0 (0) 14372 (673311,673312,676716,676717,679680,..907048) 51 1080000 0.0032 148.53 (815289) 2.4022 6.1284 0 (0) 13945 (667971,667972,672835,673311,673312,..925077) 52 1080000 0.0032 149.49 (815289) 2.4059 6.0745 0 (0) 13932 (667971,667972,672834,672835,673311,..925077) 53 1080000 0.0032 149.49 (788680) 2.2976 5.4171 0 (0) 10821 (662766,662767,664794,664795,667971,..851374) 54 1080000 0.0032 146.63 (788680) 2.1600 5.5494 0 (0) 11435 (662766,662767,664794,664795,667971,..925077) 55 1080000 0.0032 145.91 (817180) 2.3747 5.9131 0 (0) 13198 (664794,664795,667971,667972,672834,..925077) 56 1080000 0.0032 140.91 (788680) 2.4499 5.8216 0 (0) 13403 (641917,658567,662767,664794,664795,..925077) 57 1080000 0.0032 141.38 (707776) 1.2948 3.8831 0 (0) 5041 (654816,654817,658320,658321,658566,..757666) 58 1080000 0.0032 149.73 (707776) 1.2131 3.6946 0 (0) 4076 (641916,641917,654136,654816,654817,..739225) 59 1080000 0.0032 51.02 (220341) 1.3073 3.1542 0 (0) 1869 (138187,145140,145141,147822,147823,..1021026) 60 1080000 0.0032 119.93 (313205) 1.6518 5.2116 0 (0) 9504 (3019,3020,12955,12956,25645,..1078275) 61 1080000 0.0032 149.25 (707776) 1.2933 3.5546 0 (0) 3393 (631761,631762,641916,641917,647521,..732562) 62 1080000 0.0032 126.60 (222973) 2.0194 5.6079 0 (0) 11357 (3019,3020,12955,12956,14420,..1078275) 63 1080000 0.0032 126.60 (222973) 2.0223 5.6224 0 (0) 11452 (3019,3020,12955,12956,14420,..1078275) Same kernel, tick skew enabled, nohz and push/pull (100% pinned load...) disabled for the isolated cpuset. This is 10us or so better than 33-rt can do on this box with nohz=off, ie that's roughly the jitter that cpupri_set() induces (_can_ double that very rarely it seems). So with a couple little tweaks, 3.0-rt performs better than 33-rt (and can dynamically become "green" again when not running picky rt load) despite being a little fatter. 'Course if I applied the same dinky tweaks to 33-rt, the weight gain would show. Anyway, the numbers.. FREQ=960 FRAMES=3456000 LOOP=50000 using CPUs 4 - 23 FREQ=666 FRAMES=2397600 LOOP=72072 using CPUs 24 - 43 FREQ=300 FRAMES=1080000 LOOP=160000 using CPUs 44 - 63 on your marks... get set... POW! Cpu Frames Min Max(Frame) Avg Sigma LastTrans Fliers(Frames) 4 3456000 0.0159 5.98 (1957035) 0.1275 0.2979 0 (0) 5 3456000 0.0159 6.21 (2641598) 0.2173 0.3444 0 (0) 6 3456000 0.0159 5.26 (1313825) 0.1599 0.2956 0 (0) 7 3456000 0.0159 5.98 (346106) 0.1632 0.2877 0 (0) 8 3456000 0.0159 5.50 (70893) 0.1437 0.3450 0 (0) 9 3456000 0.0159 5.98 (1550901) 0.1381 0.3502 0 (0) 10 3456000 0.0159 5.74 (106100) 0.1478 0.3313 0 (0) 11 3456000 0.0159 5.71 (3174550) 0.1413 0.3090 0 (0) 12 3456000 0.0159 5.02 (1506694) 0.1761 0.3098 0 (0) 13 3456000 0.0159 5.71 (3054611) 0.1768 0.3546 0 (0) 14 3456000 0.0159 5.02 (3148871) 0.1299 0.3062 0 (0) 15 3456000 0.0159 4.99 (2122036) 0.1521 0.3132 0 (0) 16 3456000 0.0159 6.42 (1728959) 0.1521 0.3905 0 (0) 17 3456000 0.0159 6.21 (854434) 0.1618 0.3652 0 (0) 18 3456000 0.0159 6.93 (2190440) 0.1418 0.3548 0 (0) 19 3456000 0.0159 6.90 (1614252) 0.2075 0.4128 0 (0) 20 3456000 0.0159 5.47 (136316) 0.2002 0.3977 0 (0) 21 3456000 0.0159 6.69 (1057262) 0.1435 0.3475 0 (0) 22 3456000 0.0159 6.66 (3123382) 0.1602 0.3585 0 (0) 23 3456000 0.0159 5.94 (2297025) 0.2283 0.3616 0 (0) 24 2397600 0.0587 6.38 (991357) 0.2580 0.3817 0 (0) 25 2397600 0.0587 6.73 (1162518) 0.2380 0.3730 0 (0) 26 2397600 0.0587 7.21 (733474) 0.2502 0.3590 0 (0) 27 2397600 0.0587 6.86 (1873716) 0.2280 0.3768 0 (0) 28 2397600 0.0587 7.21 (2296767) 0.2521 0.3884 0 (0) 29 2397600 0.0587 7.21 (616888) 0.4165 0.4887 0 (0) 30 2397600 0.0587 7.09 (458995) 0.4245 0.4577 0 (0) 31 2397600 0.0587 6.14 (1674893) 0.3974 0.4544 0 (0) 32 2397600 0.0587 7.45 (130233) 0.4440 0.5456 0 (0) 33 2397600 0.0587 7.09 (1453350) 0.2482 0.3813 0 (0) 34 2397600 0.0587 6.73 (2365066) 0.2886 0.3827 0 (0) 35 2397600 0.0587 6.14 (35955) 0.2556 0.3841 0 (0) 36 2397600 0.0587 6.62 (2145554) 0.2566 0.3933 0 (0) 37 2397600 0.0587 7.81 (130234) 0.5375 0.5129 0 (0) 38 2397600 0.0587 7.33 (130234) 0.4921 0.5255 0 (0) 39 2397600 0.0587 7.57 (130234) 0.4200 0.4901 0 (0) 40 2397600 0.0587 6.62 (2367859) 0.2962 0.4553 0 (0) 41 2397600 0.0587 6.26 (206979) 0.5036 0.5491 0 (0) 42 2397600 0.0587 6.38 (1302660) 0.5093 0.5469 0 (0) 43 2397600 0.0587 6.73 (1825681) 0.5511 0.5734 0 (0) 44 1079999 0.0032 7.39 (91927) 0.4603 0.5291 0 (0) 45 1079999 0.0032 6.92 (977865) 0.3143 0.4378 0 (0) 46 1079999 0.0032 5.96 (1002473) 0.2129 0.3999 0 (0) 47 1079999 0.0032 6.44 (981423) 0.4193 0.5293 0 (0) 48 1079999 0.0032 6.20 (375165) 0.2602 0.4201 0 (0) 49 1079999 0.0032 5.73 (886536) 0.4002 0.5174 0 (0) 50 1079999 0.0032 6.44 (547629) 0.3182 0.4507 0 (0) 51 1079999 0.0032 5.73 (143994) 0.4736 0.5952 0 (0) 52 1079999 0.0032 6.68 (1053525) 0.4753 0.5132 0 (0) 53 1079999 0.0032 6.44 (378576) 0.3686 0.4691 0 (0) 54 1079999 0.0032 6.92 (886639) 0.6017 0.5538 0 (0) 55 1079999 0.0032 6.68 (1055655) 0.4917 0.5232 0 (0) 56 1079999 0.0032 6.44 (293526) 0.2752 0.4340 0 (0) 57 1079999 0.0032 8.59 (913209) 1.1433 0.8550 0 (0) 58 1079999 0.0032 5.25 (259824) 0.2139 0.3702 0 (0) 59 1079999 0.0032 6.68 (245211) 0.2031 0.3665 0 (0) 60 1079999 0.0032 6.44 (895440) 0.4445 0.4867 0 (0) 61 1079999 0.0032 5.96 (896382) 0.2541 0.3923 0 (0) 62 1079999 0.0032 7.16 (895440) 0.5437 0.5162 0 (0) 63 1079999 0.0032 6.44 (895371) 0.5707 0.5135 0 (0) So IMHO there is a valid case for keeping NO_HZ a config option for folks who can never tolerate the pricetag, but as for the nohz=off option, methinks that could indeed go away, given it's easy to make an on/off switch. I made one for both nohz and push/pull, just need to move it into cpusets and make it pretty enough to live. WRT $subject, it seems pretty clear that the RT kernel either wants tick skew back.. or collision avoidance radar.. or something. -Mike ^ permalink raw reply [flat|nested] 17+ messages in thread
* irq latency regression post af5ab277 - was Re: [patch] clockevents: Reinstate the per cpu tick skew 2012-01-03 6:20 ` Mike Galbraith @ 2012-04-23 6:13 ` Mike Galbraith 0 siblings, 0 replies; 17+ messages in thread From: Mike Galbraith @ 2012-04-23 6:13 UTC (permalink / raw) To: Arjan van de Ven Cc: RT, Thomas Gleixner, Steven Rostedt, Peter Zijlstra, Ingo Molnar, LKML, Paul E. McKenney, Dimitri Sivanich Greetings, On Tue, 2012-01-03 at 07:20 +0100, Mike Galbraith wrote: > On Wed, 2011-12-28 at 16:10 +0100, Mike Galbraith wrote: > > On Wed, 2011-12-28 at 14:32 +0100, Arjan van de Ven wrote: > > > > > > I think we need to just say no to this, and kill the nohz=off option > > > entirely. > > > > > > Seriously, are people still running with ticks for any legitimate > > > reasons? (and not just because they goofed their config file) > > > > Yup. Realtime loads sometimes need it. Even without contention > > problems, entering/leaving nohz is a latency source. If every little > > bit counts, you may have the choice of letting the electric meter spin > > or not getting the job done at all. There are other facets to tick skew removal that have turned up while looking into an irq latency regression 2.6.32->3.0. Not only does skew removal induce jitter woes for moderate sized boxen running RT kernels, it's a jitter source for large machines in general. More interestingly, that skew removal also appears to be indirectly responsible for a rather large irq latency regression. I bisected the source of same to.. 0209f649 rcu: limit rcu_node leaf-level fanout .._but_, the source of the lock contention it addressed appears to be the very tick skew removal that caused my xtime_lock jitter woes in RT. Revert 0209f649 in CONFIG_MAXSMP CONFIG_PREEMPT_NONE kernel, contention appears, restore skew, it disappears virtually entirely. So it would appear that we induced a ~400% latency regression to combat contention that was itself induced by tick skew removal. In enterprise, I can revert 0209f649 and enable tick skew across the board instead of selectively, and kill the regression at the cost of losing whatever power savings killing skew brought us. May have to do that. In another thread, Paul suggested limiting GP initialization to CPUs that have been online, which indeed turned the regression into a modest progression. That's highly attractive long term, but doing that in a stable kernel before it's baked in mainline is not the least bit attractive. Hohum, rock or hard spot, pick one. Anyway, I thought I should summarize the linkage of RCU induced latency regression to tick skew removal. Seems likely I'm not the only sod who will have this land in their bug list. > Patch making tick skew a boot option below, and hard numbers below that. > > Test setup: > 60 isolated cores running a synchronized frame scheduler model for 1 > hour, scheduling worker-bees at three frequencies. (The testcase is > supposed to "good enough" simulate a real frame rate scheduler, and did > pretty well at showing the cost of these particular collisions.) > > First set of numbers is without tick skew, and nohz enabled. Second set > is tick skewed, nohz and rt push/pull turned off for the isolated core > set. The tick skew alone is responsible for an order of magnitude of > jitter improvement. I have hard numbers for nohz and cpupri_set() as > well, but bottom line for me is that with nohz enabled, my 30us jitter > budget is nearly doubled, so even with the tick skewed, nohz is just not > a viable option ATM. > > > From: Mike Galbraith <mgalbraith@suse.de> > > clockevents: Reinstate the per cpu tick skew > > Quoting removal commit af5ab277ded04bd9bc6b048c5a2f0e7d70ef0867 > Historically, Linux has tried to make the regular timer tick on the > various CPUs not happen at the same time, to avoid contention on > xtime_lock. > > Nowadays, with the tickless kernel, this contention no longer happens > since time keeping and updating are done differently. In addition, > this skew is actually hurting power consumption in a measurable way on > many-core systems. > End quote > > Contrary to the above, contention does still happen, and can be a > problem for realtime loads whether nohz is active or not, so give > the user the ability to decide whether power consumption or jitter > is the more important consideration. > > Signed-off-by: Mike Galbraith <mgalbraith@suse.de> > Cc: Arjan van de Ven <arjan@linux.intel.com> > > --- > Documentation/kernel-parameters.txt | 3 +++ > kernel/time/tick-sched.c | 19 +++++++++++++++++++ > 2 files changed, 22 insertions(+) > > --- a/Documentation/kernel-parameters.txt > +++ b/Documentation/kernel-parameters.txt > @@ -2295,6 +2295,9 @@ bytes respectively. Such letter suffixes > simeth= [IA-64] > simscsi= > > + skew_tick= [KNL] Offset the periodic timer tick per cpu to mitigate > + xtime_lock contention on larger systems. > + > slram= [HW,MTD] > > slub_debug[=options[,slabs]] [MM, SLUB] > --- a/kernel/time/tick-sched.c > +++ b/kernel/time/tick-sched.c > @@ -759,6 +759,8 @@ static enum hrtimer_restart tick_sched_t > return HRTIMER_RESTART; > } > > +static int sched_skew_tick; > + > /** > * tick_setup_sched_timer - setup the tick emulation timer > */ > @@ -777,6 +779,14 @@ void tick_setup_sched_timer(void) > /* Get the next period (per cpu) */ > hrtimer_set_expires(&ts->sched_timer, tick_init_jiffy_update()); > > + /* Offset the tick to avert xtime_lock contention. */ > + if (sched_skew_tick) { > + u64 offset = ktime_to_ns(tick_period) >> 1; > + do_div(offset, num_possible_cpus()); > + offset *= smp_processor_id(); > + hrtimer_add_expires_ns(&ts->sched_timer, offset); > + } > + > for (;;) { > hrtimer_forward(&ts->sched_timer, now, tick_period); > hrtimer_start_expires(&ts->sched_timer, > @@ -858,3 +868,12 @@ int tick_check_oneshot_change(int allow_ > tick_nohz_switch_to_nohz(); > return 0; > } > + > +static int __init skew_tick(char *str) > +{ > + get_option(&str, &sched_skew_tick); > + > + return 0; > +} > +early_param("skew_tick", skew_tick); > + > > No skewed tick, nohz active: > FREQ=960 FRAMES=3456000 LOOP=50000 using CPUs 4 - 23 > FREQ=666 FRAMES=2397600 LOOP=72072 using CPUs 24 - 43 > FREQ=300 FRAMES=1080000 LOOP=160000 using CPUs 44 - 63 > on your marks... get set... POW! > Cpu Frames Min Max(Frame) Avg Sigma LastTrans Fliers(Frames) > 4 3456000 0.0159 51.51 (1751285) 1.0811 2.3215 0 (0) 940 (2496,2497,36625,36626,45649,..3438632) > 5 3456000 0.0159 57.44 (1301949) 1.1164 2.3599 0 (0) 1010 (32353,32354,36625,36626,43681,..3434312) > 6 3456000 0.0159 49.58 (546753) 1.0602 2.3222 0 (0) 1037 (32353,32354,36625,36626,41809,..3425240) > 7 3456000 0.0159 52.20 (546753) 1.0681 2.3370 0 (0) 1035 (32353,32354,36625,36626,41809,..3432248) > 8 3456000 0.0159 58.91 (1407504) 1.0592 2.0873 0 (0) 865 (11041,11042,15505,15506,25585,..3412208) > 9 3456000 0.0159 54.61 (1407504) 1.0581 2.0775 0 (0) 850 (11041,11042,15505,15506,20234,..3411272) > 10 3456000 0.0159 52.91 (1338694) 1.1259 2.0825 0 (0) 799 (11041,11042,15505,15506,16465,..3400640) > 11 3456000 0.0159 50.56 (2470554) 1.1881 2.0364 0 (0) 334 (50714,113715,113716,166349,178780,..3421185) > 12 3456000 0.0159 50.29 (2462200) 0.9961 2.0202 0 (0) 639 (9337,9338,11041,11042,15505,..3452529) > 13 3456000 0.0159 56.52 (2470554) 1.1478 2.0602 0 (0) 400 (2545,2546,9121,9122,66434,..3440289) > 14 3456000 0.0159 55.06 (34587) 1.2129 2.4890 0 (0) 444 (34587,34588,62571,62572,62619,..3440434) > 15 3456000 0.0159 46.48 (583883) 1.2891 2.1824 0 (0) 306 (91563,95739,95740,141197,155741,..3406785) > 16 3456000 0.0159 103.70 (2828662)2.1077 4.0380 410 (2) 9435 (697,698,1105,1106,1153,..3455937) > 17 3456000 0.0159 73.89 (2470553) 2.1598 3.7529 0 (0) 6180 (2473,2474,3985,3986,8569,..3438201) > 18 3456000 0.0159 54.14 (1212190) 2.2391 3.7075 0 (0) 5485 (10274,10275,13970,13971,14379,..3455794) > 19 3456000 0.0159 99.20 (810712) 2.3861 4.5793 0 (0) 19845 (674,675,2259,2260,3554,..3455915) > 20 3456000 0.0159 71.30 (631597) 2.2565 4.3141 0 (0) 9365 (674,675,3555,7394,7395,..3455914) > 21 3456000 0.0159 71.51 (1431073) 2.3127 4.4810 0 (0) 25073 (1154,2259,2260,4011,4012,..3455963) > 22 3456000 0.0159 62.45 (215262) 2.1318 4.3088 0 (0) 23570 (2259,2260,4011,4012,4539,..3455963) > 23 3456000 0.0159 61.50 (212190) 2.1307 4.3165 0 (0) 23605 (2259,2260,4539,4540,5019,..3455963) > 24 2397600 0.0587 145.26 (2229318)2.6808 6.2104 492 (14) 32977 (812,813,1145,1470,1471,..2397564) > 25 2397600 0.0587 133.93 (250966) 2.6171 6.3300 492 (13) 35463 (812,813,1145,1146,1462,..2397564) > 26 2397600 0.0587 140.25 (1405878)2.7079 6.1603 492 (12) 32428 (806,812,813,1145,1146,..2397564) > 27 2397600 0.0587 141.56 (1405879)2.6893 6.1515 492 (14) 32089 (808,809,810,811,812,..2397564) > 28 2397600 0.0587 146.57 (1405879)2.7129 6.0797 492 (14) 31637 (800,801,812,813,827,..2397564) > 29 2397600 0.0587 137.99 (2172039)2.3360 5.9859 492 (14) 30551 (826,827,1157,1480,1481,..2397564) > 30 2397600 0.0587 144.06 (948198) 2.2381 5.0413 496 (6) 19401 (826,827,832,833,1175,..2397566) > 31 2397600 0.0587 141.92 (948198) 2.2509 5.0654 496 (4) 19353 (826,827,832,833,1175,..2397566) > 32 2397600 0.0587 149.31 (2172038)2.7842 6.8891 492 (10) 41301 (822,823,824,825,826,..2397564) > 33 2397600 0.0587 142.99 (1975198)2.6904 5.3538 181 (6) 21954 (511,512,846,847,1175,..2397582) > 34 2397600 0.0587 167.07 (948199) 2.6350 5.6616 179 (4) 23602 (503,504,507,508,511,..2397582) > 35 2397600 0.0587 79.81 (2152123) 2.5135 4.1781 0 (0) 5406 (1879,1881,1882,2876,2877,..2396956) > 36 2397600 0.0587 112.24 (1184061)2.7419 5.3774 0 (0) 21005 (1185,1186,1189,1190,1518,..2397263) > 37 2397600 0.0587 78.86 (986867) 2.6678 5.1954 0 (0) 19350 (529,530,861,863,1189,..2397263) > 38 2397600 0.0587 77.90 (1782680) 2.5881 4.8399 0 (0) 13516 (525,526,529,530,860,..2396938) > 39 2397600 0.0587 78.02 (1642135) 2.4351 3.8095 0 (0) 3569 (898,2900,2901,3561,3566,..2397291) > 40 2397600 0.0587 218.81 (891116) 2.7215 6.6456 392 (8) 38961 (714,715,726,727,1046,..2397450) > 41 2397600 0.0587 141.56 (1975198)2.6441 5.2995 181 (4) 22572 (846,847,1179,1180,1185,..2397249) > 42 2397600 0.0587 77.07 (1782679) 2.3957 5.0119 0 (0) 17798 (529,530,860,861,862,..2397263) > 43 2397600 0.0587 81.72 (1333323) 2.3469 4.5082 0 (0) 11172 (1205,1206,1207,1208,1865,..2396552) > 44 1080000 0.0032 168.33 (988438) 2.7037 7.1729 381 (10) 20368 (650,651,662,663,809,..1056079) > 45 1080000 0.0032 156.88 (935898) 2.6181 7.1047 0 (0) 19932 (767,768,809,810,866,..1022038) > 46 1080000 0.0032 156.40 (935898) 2.2137 6.8080 0 (0) 18522 (684567,684568,695466,695467,699570,..975856) > 47 1080000 0.0032 150.20 (905448) 2.6011 7.0525 0 (0) 19427 (2012,2013,510347,510348,617324,..980947) > 48 1080000 0.0032 163.08 (1012102)3.0856 8.6857 491 (49) 32197 (527,528,536,537,545,..1059883) > 49 1080000 0.0032 151.87 (861738) 2.1150 6.2499 0 (0) 14993 (679920,679921,681762,681763,684567,..889561) > 50 1080000 0.0032 143.53 (843639) 2.3864 6.2304 0 (0) 14372 (673311,673312,676716,676717,679680,..907048) > 51 1080000 0.0032 148.53 (815289) 2.4022 6.1284 0 (0) 13945 (667971,667972,672835,673311,673312,..925077) > 52 1080000 0.0032 149.49 (815289) 2.4059 6.0745 0 (0) 13932 (667971,667972,672834,672835,673311,..925077) > 53 1080000 0.0032 149.49 (788680) 2.2976 5.4171 0 (0) 10821 (662766,662767,664794,664795,667971,..851374) > 54 1080000 0.0032 146.63 (788680) 2.1600 5.5494 0 (0) 11435 (662766,662767,664794,664795,667971,..925077) > 55 1080000 0.0032 145.91 (817180) 2.3747 5.9131 0 (0) 13198 (664794,664795,667971,667972,672834,..925077) > 56 1080000 0.0032 140.91 (788680) 2.4499 5.8216 0 (0) 13403 (641917,658567,662767,664794,664795,..925077) > 57 1080000 0.0032 141.38 (707776) 1.2948 3.8831 0 (0) 5041 (654816,654817,658320,658321,658566,..757666) > 58 1080000 0.0032 149.73 (707776) 1.2131 3.6946 0 (0) 4076 (641916,641917,654136,654816,654817,..739225) > 59 1080000 0.0032 51.02 (220341) 1.3073 3.1542 0 (0) 1869 (138187,145140,145141,147822,147823,..1021026) > 60 1080000 0.0032 119.93 (313205) 1.6518 5.2116 0 (0) 9504 (3019,3020,12955,12956,25645,..1078275) > 61 1080000 0.0032 149.25 (707776) 1.2933 3.5546 0 (0) 3393 (631761,631762,641916,641917,647521,..732562) > 62 1080000 0.0032 126.60 (222973) 2.0194 5.6079 0 (0) 11357 (3019,3020,12955,12956,14420,..1078275) > 63 1080000 0.0032 126.60 (222973) 2.0223 5.6224 0 (0) 11452 (3019,3020,12955,12956,14420,..1078275) > > Same kernel, tick skew enabled, nohz and push/pull (100% pinned load...) > disabled for the isolated cpuset. This is 10us or so better than 33-rt > can do on this box with nohz=off, ie that's roughly the jitter that > cpupri_set() induces (_can_ double that very rarely it seems). > > So with a couple little tweaks, 3.0-rt performs better than 33-rt (and > can dynamically become "green" again when not running picky rt load) > despite being a little fatter. 'Course if I applied the same dinky > tweaks to 33-rt, the weight gain would show. Anyway, the numbers.. > > FREQ=960 FRAMES=3456000 LOOP=50000 using CPUs 4 - 23 > FREQ=666 FRAMES=2397600 LOOP=72072 using CPUs 24 - 43 > FREQ=300 FRAMES=1080000 LOOP=160000 using CPUs 44 - 63 > on your marks... get set... POW! > Cpu Frames Min Max(Frame) Avg Sigma LastTrans Fliers(Frames) > 4 3456000 0.0159 5.98 (1957035) 0.1275 0.2979 0 (0) > 5 3456000 0.0159 6.21 (2641598) 0.2173 0.3444 0 (0) > 6 3456000 0.0159 5.26 (1313825) 0.1599 0.2956 0 (0) > 7 3456000 0.0159 5.98 (346106) 0.1632 0.2877 0 (0) > 8 3456000 0.0159 5.50 (70893) 0.1437 0.3450 0 (0) > 9 3456000 0.0159 5.98 (1550901) 0.1381 0.3502 0 (0) > 10 3456000 0.0159 5.74 (106100) 0.1478 0.3313 0 (0) > 11 3456000 0.0159 5.71 (3174550) 0.1413 0.3090 0 (0) > 12 3456000 0.0159 5.02 (1506694) 0.1761 0.3098 0 (0) > 13 3456000 0.0159 5.71 (3054611) 0.1768 0.3546 0 (0) > 14 3456000 0.0159 5.02 (3148871) 0.1299 0.3062 0 (0) > 15 3456000 0.0159 4.99 (2122036) 0.1521 0.3132 0 (0) > 16 3456000 0.0159 6.42 (1728959) 0.1521 0.3905 0 (0) > 17 3456000 0.0159 6.21 (854434) 0.1618 0.3652 0 (0) > 18 3456000 0.0159 6.93 (2190440) 0.1418 0.3548 0 (0) > 19 3456000 0.0159 6.90 (1614252) 0.2075 0.4128 0 (0) > 20 3456000 0.0159 5.47 (136316) 0.2002 0.3977 0 (0) > 21 3456000 0.0159 6.69 (1057262) 0.1435 0.3475 0 (0) > 22 3456000 0.0159 6.66 (3123382) 0.1602 0.3585 0 (0) > 23 3456000 0.0159 5.94 (2297025) 0.2283 0.3616 0 (0) > 24 2397600 0.0587 6.38 (991357) 0.2580 0.3817 0 (0) > 25 2397600 0.0587 6.73 (1162518) 0.2380 0.3730 0 (0) > 26 2397600 0.0587 7.21 (733474) 0.2502 0.3590 0 (0) > 27 2397600 0.0587 6.86 (1873716) 0.2280 0.3768 0 (0) > 28 2397600 0.0587 7.21 (2296767) 0.2521 0.3884 0 (0) > 29 2397600 0.0587 7.21 (616888) 0.4165 0.4887 0 (0) > 30 2397600 0.0587 7.09 (458995) 0.4245 0.4577 0 (0) > 31 2397600 0.0587 6.14 (1674893) 0.3974 0.4544 0 (0) > 32 2397600 0.0587 7.45 (130233) 0.4440 0.5456 0 (0) > 33 2397600 0.0587 7.09 (1453350) 0.2482 0.3813 0 (0) > 34 2397600 0.0587 6.73 (2365066) 0.2886 0.3827 0 (0) > 35 2397600 0.0587 6.14 (35955) 0.2556 0.3841 0 (0) > 36 2397600 0.0587 6.62 (2145554) 0.2566 0.3933 0 (0) > 37 2397600 0.0587 7.81 (130234) 0.5375 0.5129 0 (0) > 38 2397600 0.0587 7.33 (130234) 0.4921 0.5255 0 (0) > 39 2397600 0.0587 7.57 (130234) 0.4200 0.4901 0 (0) > 40 2397600 0.0587 6.62 (2367859) 0.2962 0.4553 0 (0) > 41 2397600 0.0587 6.26 (206979) 0.5036 0.5491 0 (0) > 42 2397600 0.0587 6.38 (1302660) 0.5093 0.5469 0 (0) > 43 2397600 0.0587 6.73 (1825681) 0.5511 0.5734 0 (0) > 44 1079999 0.0032 7.39 (91927) 0.4603 0.5291 0 (0) > 45 1079999 0.0032 6.92 (977865) 0.3143 0.4378 0 (0) > 46 1079999 0.0032 5.96 (1002473) 0.2129 0.3999 0 (0) > 47 1079999 0.0032 6.44 (981423) 0.4193 0.5293 0 (0) > 48 1079999 0.0032 6.20 (375165) 0.2602 0.4201 0 (0) > 49 1079999 0.0032 5.73 (886536) 0.4002 0.5174 0 (0) > 50 1079999 0.0032 6.44 (547629) 0.3182 0.4507 0 (0) > 51 1079999 0.0032 5.73 (143994) 0.4736 0.5952 0 (0) > 52 1079999 0.0032 6.68 (1053525) 0.4753 0.5132 0 (0) > 53 1079999 0.0032 6.44 (378576) 0.3686 0.4691 0 (0) > 54 1079999 0.0032 6.92 (886639) 0.6017 0.5538 0 (0) > 55 1079999 0.0032 6.68 (1055655) 0.4917 0.5232 0 (0) > 56 1079999 0.0032 6.44 (293526) 0.2752 0.4340 0 (0) > 57 1079999 0.0032 8.59 (913209) 1.1433 0.8550 0 (0) > 58 1079999 0.0032 5.25 (259824) 0.2139 0.3702 0 (0) > 59 1079999 0.0032 6.68 (245211) 0.2031 0.3665 0 (0) > 60 1079999 0.0032 6.44 (895440) 0.4445 0.4867 0 (0) > 61 1079999 0.0032 5.96 (896382) 0.2541 0.3923 0 (0) > 62 1079999 0.0032 7.16 (895440) 0.5437 0.5162 0 (0) > 63 1079999 0.0032 6.44 (895371) 0.5707 0.5135 0 (0) > > So IMHO there is a valid case for keeping NO_HZ a config option for > folks who can never tolerate the pricetag, but as for the nohz=off > option, methinks that could indeed go away, given it's easy to make an > on/off switch. I made one for both nohz and push/pull, just need to > move it into cpusets and make it pretty enough to live. > > WRT $subject, it seems pretty clear that the RT kernel either wants tick > skew back.. or collision avoidance radar.. or something. > > -Mike > ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2012-04-23 6:13 UTC | newest] Thread overview: 17+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-12-24 9:06 3.0.14-rt31 + 64 cores = very bad jitter == highly synchronized tick? Mike Galbraith 2011-12-25 7:31 ` Mike Galbraith 2011-12-26 8:04 ` Mike Galbraith 2011-12-27 6:40 ` Mike Galbraith 2011-12-27 9:20 ` [patch] clockevents: Reinstate the per cpu tick skew Mike Galbraith 2011-12-28 5:17 ` Mike Galbraith 2011-12-28 8:22 ` Mike Galbraith 2011-12-28 9:59 ` Mike Galbraith 2011-12-28 13:35 ` Arjan van de Ven 2011-12-28 14:59 ` Mike Galbraith 2011-12-28 16:57 ` Peter Zijlstra 2011-12-28 17:28 ` Mike Galbraith 2011-12-29 7:22 ` Mike Galbraith 2011-12-28 13:32 ` Arjan van de Ven 2011-12-28 15:10 ` Mike Galbraith 2012-01-03 6:20 ` Mike Galbraith 2012-04-23 6:13 ` irq latency regression post af5ab277 - was " Mike Galbraith
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).