* [PATCH 0/1] Reduce cost of accessing tg->load_avg
@ 2023-08-23 6:08 Aaron Lu
2023-08-23 6:08 ` [PATCH 1/1] sched/fair: ratelimit update to tg->load_avg Aaron Lu
2023-08-25 10:33 ` [PATCH 0/1] Reduce cost of accessing tg->load_avg Swapnil Sapkal
0 siblings, 2 replies; 15+ messages in thread
From: Aaron Lu @ 2023-08-23 6:08 UTC (permalink / raw)
To: Peter Zijlstra, Vincent Guittot, Ingo Molnar, Juri Lelli
Cc: Daniel Jordan, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
Tim Chen, Nitin Tekchandani, Yu Chen, Waiman Long, Deng Pan,
Mathieu Desnoyers, Gautham R . Shenoy, David Vernet, linux-kernel
RFC v2 -> v1:
- drop RFC;
- move cfs_rq->last_update_tg_load_avg before cfs_rq->tg_load_avg_contrib;
- add Vincent's reviewed-by tag.
RFC v2:
Nitin Tekchandani noticed some scheduler functions have high cost
according to perf/cycles while running postgres_sysbench workload.
I perf/annotated the high cost functions: update_cfs_group() and
update_load_avg() and found the costs were ~90% due to accessing to
tg->load_avg. This series is an attempt to reduce the overhead of
the two functions.
Thanks to Vincent's suggestion from v1, this revision used a simpler way
to solve the overhead problem by limiting updates to tg->load_avg to at
most once per ms. Benchmark shows that it has good results and with the
rate limit in place, other optimizations in v1 don't improve performance
further so they are dropped from this revision.
Aaron Lu (1):
sched/fair: ratelimit update to tg->load_avg
kernel/sched/fair.c | 13 ++++++++++++-
kernel/sched/sched.h | 1 +
2 files changed, 13 insertions(+), 1 deletion(-)
--
2.41.0
^ permalink raw reply [flat|nested] 15+ messages in thread* [PATCH 1/1] sched/fair: ratelimit update to tg->load_avg 2023-08-23 6:08 [PATCH 0/1] Reduce cost of accessing tg->load_avg Aaron Lu @ 2023-08-23 6:08 ` Aaron Lu 2023-08-23 14:05 ` Mathieu Desnoyers ` (2 more replies) 2023-08-25 10:33 ` [PATCH 0/1] Reduce cost of accessing tg->load_avg Swapnil Sapkal 1 sibling, 3 replies; 15+ messages in thread From: Aaron Lu @ 2023-08-23 6:08 UTC (permalink / raw) To: Peter Zijlstra, Vincent Guittot, Ingo Molnar, Juri Lelli Cc: Daniel Jordan, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider, Tim Chen, Nitin Tekchandani, Yu Chen, Waiman Long, Deng Pan, Mathieu Desnoyers, Gautham R . Shenoy, David Vernet, linux-kernel When using sysbench to benchmark Postgres in a single docker instance with sysbench's nr_threads set to nr_cpu, it is observed there are times update_cfs_group() and update_load_avg() shows noticeable overhead on a 2sockets/112core/224cpu Intel Sapphire Rapids(SPR): 13.75% 13.74% [kernel.vmlinux] [k] update_cfs_group 10.63% 10.04% [kernel.vmlinux] [k] update_load_avg Annotate shows the cycles are mostly spent on accessing tg->load_avg with update_load_avg() being the write side and update_cfs_group() being the read side. tg->load_avg is per task group and when different tasks of the same taskgroup running on different CPUs frequently access tg->load_avg, it can be heavily contended. E.g. when running postgres_sysbench on a 2sockets/112cores/224cpus Intel Sappire Rapids, during a 5s window, the wakeup number is 14millions and migration number is 11millions and with each migration, the task's load will transfer from src cfs_rq to target cfs_rq and each change involves an update to tg->load_avg. Since the workload can trigger as many wakeups and migrations, the access(both read and write) to tg->load_avg can be unbound. As a result, the two mentioned functions showed noticeable overhead. With netperf/nr_client=nr_cpu/UDP_RR, the problem is worse: during a 5s window, wakeup number is 21millions and migration number is 14millions; update_cfs_group() costs ~25% and update_load_avg() costs ~16%. Reduce the overhead by limiting updates to tg->load_avg to at most once per ms. After this change, the cost of accessing tg->load_avg is greatly reduced and performance improved. Detailed test results below. ============================== postgres_sysbench on SPR: 25% base: 42382±19.8% patch: 50174±9.5% (noise) 50% base: 67626±1.3% patch: 67365±3.1% (noise) 75% base: 100216±1.2% patch: 112470±0.1% +12.2% 100% base: 93671±0.4% patch: 113563±0.2% +21.2% ============================== hackbench on ICL: group=1 base: 114912±5.2% patch: 117857±2.5% (noise) group=4 base: 359902±1.6% patch: 361685±2.7% (noise) group=8 base: 461070±0.8% patch: 491713±0.3% +6.6% group=16 base: 309032±5.0% patch: 378337±1.3% +22.4% ============================= hackbench on SPR: group=1 base: 100768±2.9% patch: 103134±2.9% (noise) group=4 base: 413830±12.5% patch: 378660±16.6% (noise) group=8 base: 436124±0.6% patch: 490787±3.2% +12.5% group=16 base: 457730±3.2% patch: 680452±1.3% +48.8% ============================ netperf/udp_rr on ICL 25% base: 114413±0.1% patch: 115111±0.0% +0.6% 50% base: 86803±0.5% patch: 86611±0.0% (noise) 75% base: 35959±5.3% patch: 49801±0.6% +38.5% 100% base: 61951±6.4% patch: 70224±0.8% +13.4% =========================== netperf/udp_rr on SPR 25% base: 104954±1.3% patch: 107312±2.8% (noise) 50% base: 55394±4.6% patch: 54940±7.4% (noise) 75% base: 13779±3.1% patch: 36105±1.1% +162% 100% base: 9703±3.7% patch: 28011±0.2% +189% ============================================== netperf/tcp_stream on ICL (all in noise range) 25% base: 43092±0.1% patch: 42891±0.5% 50% base: 19278±14.9% patch: 22369±7.2% 75% base: 16822±3.0% patch: 17086±2.3% 100% base: 18216±0.6% patch: 18078±2.9% =============================================== netperf/tcp_stream on SPR (all in noise range) 25% base: 34491±0.3% patch: 34886±0.5% 50% base: 19278±14.9% patch: 22369±7.2% 75% base: 16822±3.0% patch: 17086±2.3% 100% base: 18216±0.6% patch: 18078±2.9% Reported-by: Nitin Tekchandani <nitin.tekchandani@intel.com> Suggested-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Aaron Lu <aaron.lu@intel.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> --- kernel/sched/fair.c | 13 ++++++++++++- kernel/sched/sched.h | 1 + 2 files changed, 13 insertions(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index c28206499a3d..a5462d1fcc48 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3664,7 +3664,8 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq) */ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq) { - long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib; + long delta; + u64 now; /* * No need to update load_avg for root_task_group as it is not used. @@ -3672,9 +3673,19 @@ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq) if (cfs_rq->tg == &root_task_group) return; + /* + * For migration heavy workload, access to tg->load_avg can be + * unbound. Limit the update rate to at most once per ms. + */ + now = sched_clock_cpu(cpu_of(rq_of(cfs_rq))); + if (now - cfs_rq->last_update_tg_load_avg < NSEC_PER_MSEC) + return; + + delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib; if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64) { atomic_long_add(delta, &cfs_rq->tg->load_avg); cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg; + cfs_rq->last_update_tg_load_avg = now; } } diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 6a8b7b9ed089..52ee7027def9 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -593,6 +593,7 @@ struct cfs_rq { } removed; #ifdef CONFIG_FAIR_GROUP_SCHED + u64 last_update_tg_load_avg; unsigned long tg_load_avg_contrib; long propagate; long prop_runnable_sum; -- 2.41.0 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH 1/1] sched/fair: ratelimit update to tg->load_avg 2023-08-23 6:08 ` [PATCH 1/1] sched/fair: ratelimit update to tg->load_avg Aaron Lu @ 2023-08-23 14:05 ` Mathieu Desnoyers 2023-08-23 14:17 ` Mathieu Desnoyers 2023-08-24 8:01 ` Aaron Lu 2023-08-24 18:48 ` David Vernet 2023-09-06 3:52 ` kernel test robot 2 siblings, 2 replies; 15+ messages in thread From: Mathieu Desnoyers @ 2023-08-23 14:05 UTC (permalink / raw) To: Aaron Lu, Peter Zijlstra, Vincent Guittot, Ingo Molnar, Juri Lelli Cc: Daniel Jordan, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider, Tim Chen, Nitin Tekchandani, Yu Chen, Waiman Long, Deng Pan, Gautham R . Shenoy, David Vernet, linux-kernel On 8/23/23 02:08, Aaron Lu wrote: > When using sysbench to benchmark Postgres in a single docker instance > with sysbench's nr_threads set to nr_cpu, it is observed there are times > update_cfs_group() and update_load_avg() shows noticeable overhead on > a 2sockets/112core/224cpu Intel Sapphire Rapids(SPR): > > 13.75% 13.74% [kernel.vmlinux] [k] update_cfs_group > 10.63% 10.04% [kernel.vmlinux] [k] update_load_avg > > Annotate shows the cycles are mostly spent on accessing tg->load_avg > with update_load_avg() being the write side and update_cfs_group() being > the read side. tg->load_avg is per task group and when different tasks > of the same taskgroup running on different CPUs frequently access > tg->load_avg, it can be heavily contended. > > E.g. when running postgres_sysbench on a 2sockets/112cores/224cpus Intel > Sappire Rapids, during a 5s window, the wakeup number is 14millions and > migration number is 11millions and with each migration, the task's load > will transfer from src cfs_rq to target cfs_rq and each change involves > an update to tg->load_avg. Since the workload can trigger as many wakeups > and migrations, the access(both read and write) to tg->load_avg can be > unbound. As a result, the two mentioned functions showed noticeable > overhead. With netperf/nr_client=nr_cpu/UDP_RR, the problem is worse: > during a 5s window, wakeup number is 21millions and migration number is > 14millions; update_cfs_group() costs ~25% and update_load_avg() costs ~16%. > > Reduce the overhead by limiting updates to tg->load_avg to at most once > per ms. After this change, the cost of accessing tg->load_avg is greatly > reduced and performance improved. Detailed test results below. By applying your patch on top of my patchset at: https://lore.kernel.org/lkml/20230822113133.643238-1-mathieu.desnoyers@efficios.com/ The combined hackbench results look very promising: (hackbench -g 32 -f 20 --threads --pipe -l 480000 -s 100) (192 cores AMD EPYC 9654 96-Core Processor (over 2 sockets), with hyperthreading) Baseline: 49s With L2-ttwu-queue-skip: 34s (30% speedup) With L2-ttwu-queue-skip + ratelimit-load-avg: 26s (46% speedup) Feel free to apply my: Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Tested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Thanks Aaron! Mathieu > > ============================== > postgres_sysbench on SPR: > 25% > base: 42382±19.8% > patch: 50174±9.5% (noise) > > 50% > base: 67626±1.3% > patch: 67365±3.1% (noise) > > 75% > base: 100216±1.2% > patch: 112470±0.1% +12.2% > > 100% > base: 93671±0.4% > patch: 113563±0.2% +21.2% > > ============================== > hackbench on ICL: > group=1 > base: 114912±5.2% > patch: 117857±2.5% (noise) > > group=4 > base: 359902±1.6% > patch: 361685±2.7% (noise) > > group=8 > base: 461070±0.8% > patch: 491713±0.3% +6.6% > > group=16 > base: 309032±5.0% > patch: 378337±1.3% +22.4% > > ============================= > hackbench on SPR: > group=1 > base: 100768±2.9% > patch: 103134±2.9% (noise) > > group=4 > base: 413830±12.5% > patch: 378660±16.6% (noise) > > group=8 > base: 436124±0.6% > patch: 490787±3.2% +12.5% > > group=16 > base: 457730±3.2% > patch: 680452±1.3% +48.8% > > ============================ > netperf/udp_rr on ICL > 25% > base: 114413±0.1% > patch: 115111±0.0% +0.6% > > 50% > base: 86803±0.5% > patch: 86611±0.0% (noise) > > 75% > base: 35959±5.3% > patch: 49801±0.6% +38.5% > > 100% > base: 61951±6.4% > patch: 70224±0.8% +13.4% > > =========================== > netperf/udp_rr on SPR > 25% > base: 104954±1.3% > patch: 107312±2.8% (noise) > > 50% > base: 55394±4.6% > patch: 54940±7.4% (noise) > > 75% > base: 13779±3.1% > patch: 36105±1.1% +162% > > 100% > base: 9703±3.7% > patch: 28011±0.2% +189% > > ============================================== > netperf/tcp_stream on ICL (all in noise range) > 25% > base: 43092±0.1% > patch: 42891±0.5% > > 50% > base: 19278±14.9% > patch: 22369±7.2% > > 75% > base: 16822±3.0% > patch: 17086±2.3% > > 100% > base: 18216±0.6% > patch: 18078±2.9% > > =============================================== > netperf/tcp_stream on SPR (all in noise range) > 25% > base: 34491±0.3% > patch: 34886±0.5% > > 50% > base: 19278±14.9% > patch: 22369±7.2% > > 75% > base: 16822±3.0% > patch: 17086±2.3% > > 100% > base: 18216±0.6% > patch: 18078±2.9% > > Reported-by: Nitin Tekchandani <nitin.tekchandani@intel.com> > Suggested-by: Vincent Guittot <vincent.guittot@linaro.org> > Signed-off-by: Aaron Lu <aaron.lu@intel.com> > Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> > --- > kernel/sched/fair.c | 13 ++++++++++++- > kernel/sched/sched.h | 1 + > 2 files changed, 13 insertions(+), 1 deletion(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index c28206499a3d..a5462d1fcc48 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -3664,7 +3664,8 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq) > */ > static inline void update_tg_load_avg(struct cfs_rq *cfs_rq) > { > - long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib; > + long delta; > + u64 now; > > /* > * No need to update load_avg for root_task_group as it is not used. > @@ -3672,9 +3673,19 @@ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq) > if (cfs_rq->tg == &root_task_group) > return; > > + /* > + * For migration heavy workload, access to tg->load_avg can be > + * unbound. Limit the update rate to at most once per ms. > + */ > + now = sched_clock_cpu(cpu_of(rq_of(cfs_rq))); > + if (now - cfs_rq->last_update_tg_load_avg < NSEC_PER_MSEC) > + return; > + > + delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib; > if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64) { > atomic_long_add(delta, &cfs_rq->tg->load_avg); > cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg; > + cfs_rq->last_update_tg_load_avg = now; > } > } > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > index 6a8b7b9ed089..52ee7027def9 100644 > --- a/kernel/sched/sched.h > +++ b/kernel/sched/sched.h > @@ -593,6 +593,7 @@ struct cfs_rq { > } removed; > > #ifdef CONFIG_FAIR_GROUP_SCHED > + u64 last_update_tg_load_avg; > unsigned long tg_load_avg_contrib; > long propagate; > long prop_runnable_sum; -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 1/1] sched/fair: ratelimit update to tg->load_avg 2023-08-23 14:05 ` Mathieu Desnoyers @ 2023-08-23 14:17 ` Mathieu Desnoyers 2023-08-24 8:01 ` Aaron Lu 1 sibling, 0 replies; 15+ messages in thread From: Mathieu Desnoyers @ 2023-08-23 14:17 UTC (permalink / raw) To: Aaron Lu, Peter Zijlstra, Vincent Guittot, Ingo Molnar, Juri Lelli Cc: Daniel Jordan, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider, Tim Chen, Nitin Tekchandani, Yu Chen, Waiman Long, Deng Pan, Gautham R . Shenoy, David Vernet, linux-kernel On 8/23/23 10:05, Mathieu Desnoyers wrote: > On 8/23/23 02:08, Aaron Lu wrote: >> When using sysbench to benchmark Postgres in a single docker instance >> with sysbench's nr_threads set to nr_cpu, it is observed there are times >> update_cfs_group() and update_load_avg() shows noticeable overhead on >> a 2sockets/112core/224cpu Intel Sapphire Rapids(SPR): >> >> 13.75% 13.74% [kernel.vmlinux] [k] update_cfs_group >> 10.63% 10.04% [kernel.vmlinux] [k] update_load_avg >> >> Annotate shows the cycles are mostly spent on accessing tg->load_avg >> with update_load_avg() being the write side and update_cfs_group() being >> the read side. tg->load_avg is per task group and when different tasks >> of the same taskgroup running on different CPUs frequently access >> tg->load_avg, it can be heavily contended. >> >> E.g. when running postgres_sysbench on a 2sockets/112cores/224cpus Intel >> Sappire Rapids, during a 5s window, the wakeup number is 14millions and >> migration number is 11millions and with each migration, the task's load >> will transfer from src cfs_rq to target cfs_rq and each change involves >> an update to tg->load_avg. Since the workload can trigger as many wakeups >> and migrations, the access(both read and write) to tg->load_avg can be >> unbound. As a result, the two mentioned functions showed noticeable >> overhead. With netperf/nr_client=nr_cpu/UDP_RR, the problem is worse: >> during a 5s window, wakeup number is 21millions and migration number is >> 14millions; update_cfs_group() costs ~25% and update_load_avg() costs >> ~16%. >> >> Reduce the overhead by limiting updates to tg->load_avg to at most once >> per ms. After this change, the cost of accessing tg->load_avg is greatly >> reduced and performance improved. Detailed test results below. > > By applying your patch on top of my patchset at: > > https://lore.kernel.org/lkml/20230822113133.643238-1-mathieu.desnoyers@efficios.com/ > > The combined hackbench results look very promising: > > (hackbench -g 32 -f 20 --threads --pipe -l 480000 -s 100) > (192 cores AMD EPYC 9654 96-Core Processor (over 2 sockets), with > hyperthreading) > > Baseline: 49s > With L2-ttwu-queue-skip: 34s (30% speedup) > With L2-ttwu-queue-skip + ratelimit-load-avg: 26s (46% speedup) Here is an additional interesting data point: With only ratelimit-load-avg patch: 32s (35% speedup) So each series appear to address a different scalability issue, and combining both seems worthwhile, at least from the point of view of this specific benchmark on this hardware. I'm looking forward to see numbers for other benchmarks and hardware. Thanks, Mathieu > > Feel free to apply my: > > Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> > Tested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> > > Thanks Aaron! > > Mathieu > >> >> ============================== >> postgres_sysbench on SPR: >> 25% >> base: 42382±19.8% >> patch: 50174±9.5% (noise) >> >> 50% >> base: 67626±1.3% >> patch: 67365±3.1% (noise) >> >> 75% >> base: 100216±1.2% >> patch: 112470±0.1% +12.2% >> >> 100% >> base: 93671±0.4% >> patch: 113563±0.2% +21.2% >> >> ============================== >> hackbench on ICL: >> group=1 >> base: 114912±5.2% >> patch: 117857±2.5% (noise) >> >> group=4 >> base: 359902±1.6% >> patch: 361685±2.7% (noise) >> >> group=8 >> base: 461070±0.8% >> patch: 491713±0.3% +6.6% >> >> group=16 >> base: 309032±5.0% >> patch: 378337±1.3% +22.4% >> >> ============================= >> hackbench on SPR: >> group=1 >> base: 100768±2.9% >> patch: 103134±2.9% (noise) >> >> group=4 >> base: 413830±12.5% >> patch: 378660±16.6% (noise) >> >> group=8 >> base: 436124±0.6% >> patch: 490787±3.2% +12.5% >> >> group=16 >> base: 457730±3.2% >> patch: 680452±1.3% +48.8% >> >> ============================ >> netperf/udp_rr on ICL >> 25% >> base: 114413±0.1% >> patch: 115111±0.0% +0.6% >> >> 50% >> base: 86803±0.5% >> patch: 86611±0.0% (noise) >> >> 75% >> base: 35959±5.3% >> patch: 49801±0.6% +38.5% >> >> 100% >> base: 61951±6.4% >> patch: 70224±0.8% +13.4% >> >> =========================== >> netperf/udp_rr on SPR >> 25% >> base: 104954±1.3% >> patch: 107312±2.8% (noise) >> >> 50% >> base: 55394±4.6% >> patch: 54940±7.4% (noise) >> >> 75% >> base: 13779±3.1% >> patch: 36105±1.1% +162% >> >> 100% >> base: 9703±3.7% >> patch: 28011±0.2% +189% >> >> ============================================== >> netperf/tcp_stream on ICL (all in noise range) >> 25% >> base: 43092±0.1% >> patch: 42891±0.5% >> >> 50% >> base: 19278±14.9% >> patch: 22369±7.2% >> >> 75% >> base: 16822±3.0% >> patch: 17086±2.3% >> >> 100% >> base: 18216±0.6% >> patch: 18078±2.9% >> >> =============================================== >> netperf/tcp_stream on SPR (all in noise range) >> 25% >> base: 34491±0.3% >> patch: 34886±0.5% >> >> 50% >> base: 19278±14.9% >> patch: 22369±7.2% >> >> 75% >> base: 16822±3.0% >> patch: 17086±2.3% >> >> 100% >> base: 18216±0.6% >> patch: 18078±2.9% >> >> Reported-by: Nitin Tekchandani <nitin.tekchandani@intel.com> >> Suggested-by: Vincent Guittot <vincent.guittot@linaro.org> >> Signed-off-by: Aaron Lu <aaron.lu@intel.com> >> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> >> --- >> kernel/sched/fair.c | 13 ++++++++++++- >> kernel/sched/sched.h | 1 + >> 2 files changed, 13 insertions(+), 1 deletion(-) >> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index c28206499a3d..a5462d1fcc48 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -3664,7 +3664,8 @@ static inline bool cfs_rq_is_decayed(struct >> cfs_rq *cfs_rq) >> */ >> static inline void update_tg_load_avg(struct cfs_rq *cfs_rq) >> { >> - long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib; >> + long delta; >> + u64 now; >> /* >> * No need to update load_avg for root_task_group as it is not >> used. >> @@ -3672,9 +3673,19 @@ static inline void update_tg_load_avg(struct >> cfs_rq *cfs_rq) >> if (cfs_rq->tg == &root_task_group) >> return; >> + /* >> + * For migration heavy workload, access to tg->load_avg can be >> + * unbound. Limit the update rate to at most once per ms. >> + */ >> + now = sched_clock_cpu(cpu_of(rq_of(cfs_rq))); >> + if (now - cfs_rq->last_update_tg_load_avg < NSEC_PER_MSEC) >> + return; >> + >> + delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib; >> if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64) { >> atomic_long_add(delta, &cfs_rq->tg->load_avg); >> cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg; >> + cfs_rq->last_update_tg_load_avg = now; >> } >> } >> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h >> index 6a8b7b9ed089..52ee7027def9 100644 >> --- a/kernel/sched/sched.h >> +++ b/kernel/sched/sched.h >> @@ -593,6 +593,7 @@ struct cfs_rq { >> } removed; >> #ifdef CONFIG_FAIR_GROUP_SCHED >> + u64 last_update_tg_load_avg; >> unsigned long tg_load_avg_contrib; >> long propagate; >> long prop_runnable_sum; > -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 1/1] sched/fair: ratelimit update to tg->load_avg 2023-08-23 14:05 ` Mathieu Desnoyers 2023-08-23 14:17 ` Mathieu Desnoyers @ 2023-08-24 8:01 ` Aaron Lu 2023-08-24 12:56 ` Mathieu Desnoyers 1 sibling, 1 reply; 15+ messages in thread From: Aaron Lu @ 2023-08-24 8:01 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Peter Zijlstra, Vincent Guittot, Ingo Molnar, Juri Lelli, Daniel Jordan, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider, Tim Chen, Nitin Tekchandani, Yu Chen, Waiman Long, Deng Pan, Gautham R . Shenoy, David Vernet, linux-kernel On Wed, Aug 23, 2023 at 10:05:31AM -0400, Mathieu Desnoyers wrote: > On 8/23/23 02:08, Aaron Lu wrote: > > When using sysbench to benchmark Postgres in a single docker instance > > with sysbench's nr_threads set to nr_cpu, it is observed there are times > > update_cfs_group() and update_load_avg() shows noticeable overhead on > > a 2sockets/112core/224cpu Intel Sapphire Rapids(SPR): > > > > 13.75% 13.74% [kernel.vmlinux] [k] update_cfs_group > > 10.63% 10.04% [kernel.vmlinux] [k] update_load_avg > > > > Annotate shows the cycles are mostly spent on accessing tg->load_avg > > with update_load_avg() being the write side and update_cfs_group() being > > the read side. tg->load_avg is per task group and when different tasks > > of the same taskgroup running on different CPUs frequently access > > tg->load_avg, it can be heavily contended. > > > > E.g. when running postgres_sysbench on a 2sockets/112cores/224cpus Intel > > Sappire Rapids, during a 5s window, the wakeup number is 14millions and > > migration number is 11millions and with each migration, the task's load > > will transfer from src cfs_rq to target cfs_rq and each change involves > > an update to tg->load_avg. Since the workload can trigger as many wakeups > > and migrations, the access(both read and write) to tg->load_avg can be > > unbound. As a result, the two mentioned functions showed noticeable > > overhead. With netperf/nr_client=nr_cpu/UDP_RR, the problem is worse: > > during a 5s window, wakeup number is 21millions and migration number is > > 14millions; update_cfs_group() costs ~25% and update_load_avg() costs ~16%. > > > > Reduce the overhead by limiting updates to tg->load_avg to at most once > > per ms. After this change, the cost of accessing tg->load_avg is greatly > > reduced and performance improved. Detailed test results below. > > By applying your patch on top of my patchset at: > > https://lore.kernel.org/lkml/20230822113133.643238-1-mathieu.desnoyers@efficios.com/ > > The combined hackbench results look very promising: > > (hackbench -g 32 -f 20 --threads --pipe -l 480000 -s 100) > (192 cores AMD EPYC 9654 96-Core Processor (over 2 sockets), with hyperthreading) > > Baseline: 49s > With L2-ttwu-queue-skip: 34s (30% speedup) > With L2-ttwu-queue-skip + ratelimit-load-avg: 26s (46% speedup) > > Feel free to apply my: > > Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> > Tested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Thanks a lot for running this and reviewing the patch. I'll add your number and tag in the changelog when sending a new version. Regards, Aaron > > > > ============================== > > postgres_sysbench on SPR: > > 25% > > base: 42382±19.8% > > patch: 50174±9.5% (noise) > > > > 50% > > base: 67626±1.3% > > patch: 67365±3.1% (noise) > > > > 75% > > base: 100216±1.2% > > patch: 112470±0.1% +12.2% > > > > 100% > > base: 93671±0.4% > > patch: 113563±0.2% +21.2% > > > > ============================== > > hackbench on ICL: > > group=1 > > base: 114912±5.2% > > patch: 117857±2.5% (noise) > > > > group=4 > > base: 359902±1.6% > > patch: 361685±2.7% (noise) > > > > group=8 > > base: 461070±0.8% > > patch: 491713±0.3% +6.6% > > > > group=16 > > base: 309032±5.0% > > patch: 378337±1.3% +22.4% > > > > ============================= > > hackbench on SPR: > > group=1 > > base: 100768±2.9% > > patch: 103134±2.9% (noise) > > > > group=4 > > base: 413830±12.5% > > patch: 378660±16.6% (noise) > > > > group=8 > > base: 436124±0.6% > > patch: 490787±3.2% +12.5% > > > > group=16 > > base: 457730±3.2% > > patch: 680452±1.3% +48.8% > > > > ============================ > > netperf/udp_rr on ICL > > 25% > > base: 114413±0.1% > > patch: 115111±0.0% +0.6% > > > > 50% > > base: 86803±0.5% > > patch: 86611±0.0% (noise) > > > > 75% > > base: 35959±5.3% > > patch: 49801±0.6% +38.5% > > > > 100% > > base: 61951±6.4% > > patch: 70224±0.8% +13.4% > > > > =========================== > > netperf/udp_rr on SPR > > 25% > > base: 104954±1.3% > > patch: 107312±2.8% (noise) > > > > 50% > > base: 55394±4.6% > > patch: 54940±7.4% (noise) > > > > 75% > > base: 13779±3.1% > > patch: 36105±1.1% +162% > > > > 100% > > base: 9703±3.7% > > patch: 28011±0.2% +189% > > > > ============================================== > > netperf/tcp_stream on ICL (all in noise range) > > 25% > > base: 43092±0.1% > > patch: 42891±0.5% > > > > 50% > > base: 19278±14.9% > > patch: 22369±7.2% > > > > 75% > > base: 16822±3.0% > > patch: 17086±2.3% > > > > 100% > > base: 18216±0.6% > > patch: 18078±2.9% > > > > =============================================== > > netperf/tcp_stream on SPR (all in noise range) > > 25% > > base: 34491±0.3% > > patch: 34886±0.5% > > > > 50% > > base: 19278±14.9% > > patch: 22369±7.2% > > > > 75% > > base: 16822±3.0% > > patch: 17086±2.3% > > > > 100% > > base: 18216±0.6% > > patch: 18078±2.9% > > > > Reported-by: Nitin Tekchandani <nitin.tekchandani@intel.com> > > Suggested-by: Vincent Guittot <vincent.guittot@linaro.org> > > Signed-off-by: Aaron Lu <aaron.lu@intel.com> > > Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> > > --- > > kernel/sched/fair.c | 13 ++++++++++++- > > kernel/sched/sched.h | 1 + > > 2 files changed, 13 insertions(+), 1 deletion(-) > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index c28206499a3d..a5462d1fcc48 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -3664,7 +3664,8 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq) > > */ > > static inline void update_tg_load_avg(struct cfs_rq *cfs_rq) > > { > > - long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib; > > + long delta; > > + u64 now; > > /* > > * No need to update load_avg for root_task_group as it is not used. > > @@ -3672,9 +3673,19 @@ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq) > > if (cfs_rq->tg == &root_task_group) > > return; > > + /* > > + * For migration heavy workload, access to tg->load_avg can be > > + * unbound. Limit the update rate to at most once per ms. > > + */ > > + now = sched_clock_cpu(cpu_of(rq_of(cfs_rq))); > > + if (now - cfs_rq->last_update_tg_load_avg < NSEC_PER_MSEC) > > + return; > > + > > + delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib; > > if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64) { > > atomic_long_add(delta, &cfs_rq->tg->load_avg); > > cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg; > > + cfs_rq->last_update_tg_load_avg = now; > > } > > } > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > > index 6a8b7b9ed089..52ee7027def9 100644 > > --- a/kernel/sched/sched.h > > +++ b/kernel/sched/sched.h > > @@ -593,6 +593,7 @@ struct cfs_rq { > > } removed; > > #ifdef CONFIG_FAIR_GROUP_SCHED > > + u64 last_update_tg_load_avg; > > unsigned long tg_load_avg_contrib; > > long propagate; > > long prop_runnable_sum; > > -- > Mathieu Desnoyers > EfficiOS Inc. > https://www.efficios.com > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 1/1] sched/fair: ratelimit update to tg->load_avg 2023-08-24 8:01 ` Aaron Lu @ 2023-08-24 12:56 ` Mathieu Desnoyers 2023-08-24 13:03 ` Vincent Guittot 0 siblings, 1 reply; 15+ messages in thread From: Mathieu Desnoyers @ 2023-08-24 12:56 UTC (permalink / raw) To: Aaron Lu Cc: Peter Zijlstra, Vincent Guittot, Ingo Molnar, Juri Lelli, Daniel Jordan, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider, Tim Chen, Nitin Tekchandani, Yu Chen, Waiman Long, Deng Pan, Gautham R . Shenoy, David Vernet, linux-kernel On 8/24/23 04:01, Aaron Lu wrote: > On Wed, Aug 23, 2023 at 10:05:31AM -0400, Mathieu Desnoyers wrote: >> On 8/23/23 02:08, Aaron Lu wrote: >>> When using sysbench to benchmark Postgres in a single docker instance >>> with sysbench's nr_threads set to nr_cpu, it is observed there are times >>> update_cfs_group() and update_load_avg() shows noticeable overhead on >>> a 2sockets/112core/224cpu Intel Sapphire Rapids(SPR): >>> >>> 13.75% 13.74% [kernel.vmlinux] [k] update_cfs_group >>> 10.63% 10.04% [kernel.vmlinux] [k] update_load_avg >>> >>> Annotate shows the cycles are mostly spent on accessing tg->load_avg >>> with update_load_avg() being the write side and update_cfs_group() being >>> the read side. tg->load_avg is per task group and when different tasks >>> of the same taskgroup running on different CPUs frequently access >>> tg->load_avg, it can be heavily contended. >>> >>> E.g. when running postgres_sysbench on a 2sockets/112cores/224cpus Intel >>> Sappire Rapids, during a 5s window, the wakeup number is 14millions and >>> migration number is 11millions and with each migration, the task's load >>> will transfer from src cfs_rq to target cfs_rq and each change involves >>> an update to tg->load_avg. Since the workload can trigger as many wakeups >>> and migrations, the access(both read and write) to tg->load_avg can be >>> unbound. As a result, the two mentioned functions showed noticeable >>> overhead. With netperf/nr_client=nr_cpu/UDP_RR, the problem is worse: >>> during a 5s window, wakeup number is 21millions and migration number is >>> 14millions; update_cfs_group() costs ~25% and update_load_avg() costs ~16%. >>> >>> Reduce the overhead by limiting updates to tg->load_avg to at most once >>> per ms. After this change, the cost of accessing tg->load_avg is greatly >>> reduced and performance improved. Detailed test results below. >> >> By applying your patch on top of my patchset at: >> >> https://lore.kernel.org/lkml/20230822113133.643238-1-mathieu.desnoyers@efficios.com/ >> >> The combined hackbench results look very promising: >> >> (hackbench -g 32 -f 20 --threads --pipe -l 480000 -s 100) >> (192 cores AMD EPYC 9654 96-Core Processor (over 2 sockets), with hyperthreading) >> >> Baseline: 49s >> With L2-ttwu-queue-skip: 34s (30% speedup) >> With L2-ttwu-queue-skip + ratelimit-load-avg: 26s (46% speedup) >> >> Feel free to apply my: >> >> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> >> Tested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> > > Thanks a lot for running this and reviewing the patch. > I'll add your number and tag in the changelog when sending a new > version. Now that I come to think of it, I have comment: why use sched_clock_cpu() rather than just read the jiffies value ? AFAIR, sched_clock can be slower than needed when read from a "remote" cpu on architectures that have an unsynchronized tsc. Considering that you only need a time reference more or less accurate at the millisecond level, I suspect that jiffies is what you are looking for here. This is what the NUMA balance code and rseq mm_cid use to execute work every N milliseconds. Thanks, Mathieu > > Regards, > Aaron > >>> >>> ============================== >>> postgres_sysbench on SPR: >>> 25% >>> base: 42382±19.8% >>> patch: 50174±9.5% (noise) >>> >>> 50% >>> base: 67626±1.3% >>> patch: 67365±3.1% (noise) >>> >>> 75% >>> base: 100216±1.2% >>> patch: 112470±0.1% +12.2% >>> >>> 100% >>> base: 93671±0.4% >>> patch: 113563±0.2% +21.2% >>> >>> ============================== >>> hackbench on ICL: >>> group=1 >>> base: 114912±5.2% >>> patch: 117857±2.5% (noise) >>> >>> group=4 >>> base: 359902±1.6% >>> patch: 361685±2.7% (noise) >>> >>> group=8 >>> base: 461070±0.8% >>> patch: 491713±0.3% +6.6% >>> >>> group=16 >>> base: 309032±5.0% >>> patch: 378337±1.3% +22.4% >>> >>> ============================= >>> hackbench on SPR: >>> group=1 >>> base: 100768±2.9% >>> patch: 103134±2.9% (noise) >>> >>> group=4 >>> base: 413830±12.5% >>> patch: 378660±16.6% (noise) >>> >>> group=8 >>> base: 436124±0.6% >>> patch: 490787±3.2% +12.5% >>> >>> group=16 >>> base: 457730±3.2% >>> patch: 680452±1.3% +48.8% >>> >>> ============================ >>> netperf/udp_rr on ICL >>> 25% >>> base: 114413±0.1% >>> patch: 115111±0.0% +0.6% >>> >>> 50% >>> base: 86803±0.5% >>> patch: 86611±0.0% (noise) >>> >>> 75% >>> base: 35959±5.3% >>> patch: 49801±0.6% +38.5% >>> >>> 100% >>> base: 61951±6.4% >>> patch: 70224±0.8% +13.4% >>> >>> =========================== >>> netperf/udp_rr on SPR >>> 25% >>> base: 104954±1.3% >>> patch: 107312±2.8% (noise) >>> >>> 50% >>> base: 55394±4.6% >>> patch: 54940±7.4% (noise) >>> >>> 75% >>> base: 13779±3.1% >>> patch: 36105±1.1% +162% >>> >>> 100% >>> base: 9703±3.7% >>> patch: 28011±0.2% +189% >>> >>> ============================================== >>> netperf/tcp_stream on ICL (all in noise range) >>> 25% >>> base: 43092±0.1% >>> patch: 42891±0.5% >>> >>> 50% >>> base: 19278±14.9% >>> patch: 22369±7.2% >>> >>> 75% >>> base: 16822±3.0% >>> patch: 17086±2.3% >>> >>> 100% >>> base: 18216±0.6% >>> patch: 18078±2.9% >>> >>> =============================================== >>> netperf/tcp_stream on SPR (all in noise range) >>> 25% >>> base: 34491±0.3% >>> patch: 34886±0.5% >>> >>> 50% >>> base: 19278±14.9% >>> patch: 22369±7.2% >>> >>> 75% >>> base: 16822±3.0% >>> patch: 17086±2.3% >>> >>> 100% >>> base: 18216±0.6% >>> patch: 18078±2.9% >>> >>> Reported-by: Nitin Tekchandani <nitin.tekchandani@intel.com> >>> Suggested-by: Vincent Guittot <vincent.guittot@linaro.org> >>> Signed-off-by: Aaron Lu <aaron.lu@intel.com> >>> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> >>> --- >>> kernel/sched/fair.c | 13 ++++++++++++- >>> kernel/sched/sched.h | 1 + >>> 2 files changed, 13 insertions(+), 1 deletion(-) >>> >>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >>> index c28206499a3d..a5462d1fcc48 100644 >>> --- a/kernel/sched/fair.c >>> +++ b/kernel/sched/fair.c >>> @@ -3664,7 +3664,8 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq) >>> */ >>> static inline void update_tg_load_avg(struct cfs_rq *cfs_rq) >>> { >>> - long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib; >>> + long delta; >>> + u64 now; >>> /* >>> * No need to update load_avg for root_task_group as it is not used. >>> @@ -3672,9 +3673,19 @@ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq) >>> if (cfs_rq->tg == &root_task_group) >>> return; >>> + /* >>> + * For migration heavy workload, access to tg->load_avg can be >>> + * unbound. Limit the update rate to at most once per ms. >>> + */ >>> + now = sched_clock_cpu(cpu_of(rq_of(cfs_rq))); >>> + if (now - cfs_rq->last_update_tg_load_avg < NSEC_PER_MSEC) >>> + return; >>> + >>> + delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib; >>> if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64) { >>> atomic_long_add(delta, &cfs_rq->tg->load_avg); >>> cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg; >>> + cfs_rq->last_update_tg_load_avg = now; >>> } >>> } >>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h >>> index 6a8b7b9ed089..52ee7027def9 100644 >>> --- a/kernel/sched/sched.h >>> +++ b/kernel/sched/sched.h >>> @@ -593,6 +593,7 @@ struct cfs_rq { >>> } removed; >>> #ifdef CONFIG_FAIR_GROUP_SCHED >>> + u64 last_update_tg_load_avg; >>> unsigned long tg_load_avg_contrib; >>> long propagate; >>> long prop_runnable_sum; >> >> -- >> Mathieu Desnoyers >> EfficiOS Inc. >> https://www.efficios.com >> -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 1/1] sched/fair: ratelimit update to tg->load_avg 2023-08-24 12:56 ` Mathieu Desnoyers @ 2023-08-24 13:03 ` Vincent Guittot 2023-08-24 13:08 ` Mathieu Desnoyers 0 siblings, 1 reply; 15+ messages in thread From: Vincent Guittot @ 2023-08-24 13:03 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Aaron Lu, Peter Zijlstra, Ingo Molnar, Juri Lelli, Daniel Jordan, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider, Tim Chen, Nitin Tekchandani, Yu Chen, Waiman Long, Deng Pan, Gautham R . Shenoy, David Vernet, linux-kernel On Thu, 24 Aug 2023 at 14:55, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > > On 8/24/23 04:01, Aaron Lu wrote: > > On Wed, Aug 23, 2023 at 10:05:31AM -0400, Mathieu Desnoyers wrote: > >> On 8/23/23 02:08, Aaron Lu wrote: > >>> When using sysbench to benchmark Postgres in a single docker instance > >>> with sysbench's nr_threads set to nr_cpu, it is observed there are times > >>> update_cfs_group() and update_load_avg() shows noticeable overhead on > >>> a 2sockets/112core/224cpu Intel Sapphire Rapids(SPR): > >>> > >>> 13.75% 13.74% [kernel.vmlinux] [k] update_cfs_group > >>> 10.63% 10.04% [kernel.vmlinux] [k] update_load_avg > >>> > >>> Annotate shows the cycles are mostly spent on accessing tg->load_avg > >>> with update_load_avg() being the write side and update_cfs_group() being > >>> the read side. tg->load_avg is per task group and when different tasks > >>> of the same taskgroup running on different CPUs frequently access > >>> tg->load_avg, it can be heavily contended. > >>> > >>> E.g. when running postgres_sysbench on a 2sockets/112cores/224cpus Intel > >>> Sappire Rapids, during a 5s window, the wakeup number is 14millions and > >>> migration number is 11millions and with each migration, the task's load > >>> will transfer from src cfs_rq to target cfs_rq and each change involves > >>> an update to tg->load_avg. Since the workload can trigger as many wakeups > >>> and migrations, the access(both read and write) to tg->load_avg can be > >>> unbound. As a result, the two mentioned functions showed noticeable > >>> overhead. With netperf/nr_client=nr_cpu/UDP_RR, the problem is worse: > >>> during a 5s window, wakeup number is 21millions and migration number is > >>> 14millions; update_cfs_group() costs ~25% and update_load_avg() costs ~16%. > >>> > >>> Reduce the overhead by limiting updates to tg->load_avg to at most once > >>> per ms. After this change, the cost of accessing tg->load_avg is greatly > >>> reduced and performance improved. Detailed test results below. > >> > >> By applying your patch on top of my patchset at: > >> > >> https://lore.kernel.org/lkml/20230822113133.643238-1-mathieu.desnoyers@efficios.com/ > >> > >> The combined hackbench results look very promising: > >> > >> (hackbench -g 32 -f 20 --threads --pipe -l 480000 -s 100) > >> (192 cores AMD EPYC 9654 96-Core Processor (over 2 sockets), with hyperthreading) > >> > >> Baseline: 49s > >> With L2-ttwu-queue-skip: 34s (30% speedup) > >> With L2-ttwu-queue-skip + ratelimit-load-avg: 26s (46% speedup) > >> > >> Feel free to apply my: > >> > >> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> > >> Tested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> > > > > Thanks a lot for running this and reviewing the patch. > > I'll add your number and tag in the changelog when sending a new > > version. > > Now that I come to think of it, I have comment: why use > sched_clock_cpu() rather than just read the jiffies value ? AFAIR, > sched_clock can be slower than needed when read from a "remote" cpu on > architectures that have an unsynchronized tsc. > > Considering that you only need a time reference more or less accurate at > the millisecond level, I suspect that jiffies is what you are looking > for here. This is what the NUMA balance code and rseq mm_cid use to > execute work every N milliseconds. tick can 4ms or even 10ms which means a rate limit up between 10ms to 20ms in the latter case > > Thanks, > > Mathieu > > > > > Regards, > > Aaron > > > >>> > >>> ============================== > >>> postgres_sysbench on SPR: > >>> 25% > >>> base: 42382±19.8% > >>> patch: 50174±9.5% (noise) > >>> > >>> 50% > >>> base: 67626±1.3% > >>> patch: 67365±3.1% (noise) > >>> > >>> 75% > >>> base: 100216±1.2% > >>> patch: 112470±0.1% +12.2% > >>> > >>> 100% > >>> base: 93671±0.4% > >>> patch: 113563±0.2% +21.2% > >>> > >>> ============================== > >>> hackbench on ICL: > >>> group=1 > >>> base: 114912±5.2% > >>> patch: 117857±2.5% (noise) > >>> > >>> group=4 > >>> base: 359902±1.6% > >>> patch: 361685±2.7% (noise) > >>> > >>> group=8 > >>> base: 461070±0.8% > >>> patch: 491713±0.3% +6.6% > >>> > >>> group=16 > >>> base: 309032±5.0% > >>> patch: 378337±1.3% +22.4% > >>> > >>> ============================= > >>> hackbench on SPR: > >>> group=1 > >>> base: 100768±2.9% > >>> patch: 103134±2.9% (noise) > >>> > >>> group=4 > >>> base: 413830±12.5% > >>> patch: 378660±16.6% (noise) > >>> > >>> group=8 > >>> base: 436124±0.6% > >>> patch: 490787±3.2% +12.5% > >>> > >>> group=16 > >>> base: 457730±3.2% > >>> patch: 680452±1.3% +48.8% > >>> > >>> ============================ > >>> netperf/udp_rr on ICL > >>> 25% > >>> base: 114413±0.1% > >>> patch: 115111±0.0% +0.6% > >>> > >>> 50% > >>> base: 86803±0.5% > >>> patch: 86611±0.0% (noise) > >>> > >>> 75% > >>> base: 35959±5.3% > >>> patch: 49801±0.6% +38.5% > >>> > >>> 100% > >>> base: 61951±6.4% > >>> patch: 70224±0.8% +13.4% > >>> > >>> =========================== > >>> netperf/udp_rr on SPR > >>> 25% > >>> base: 104954±1.3% > >>> patch: 107312±2.8% (noise) > >>> > >>> 50% > >>> base: 55394±4.6% > >>> patch: 54940±7.4% (noise) > >>> > >>> 75% > >>> base: 13779±3.1% > >>> patch: 36105±1.1% +162% > >>> > >>> 100% > >>> base: 9703±3.7% > >>> patch: 28011±0.2% +189% > >>> > >>> ============================================== > >>> netperf/tcp_stream on ICL (all in noise range) > >>> 25% > >>> base: 43092±0.1% > >>> patch: 42891±0.5% > >>> > >>> 50% > >>> base: 19278±14.9% > >>> patch: 22369±7.2% > >>> > >>> 75% > >>> base: 16822±3.0% > >>> patch: 17086±2.3% > >>> > >>> 100% > >>> base: 18216±0.6% > >>> patch: 18078±2.9% > >>> > >>> =============================================== > >>> netperf/tcp_stream on SPR (all in noise range) > >>> 25% > >>> base: 34491±0.3% > >>> patch: 34886±0.5% > >>> > >>> 50% > >>> base: 19278±14.9% > >>> patch: 22369±7.2% > >>> > >>> 75% > >>> base: 16822±3.0% > >>> patch: 17086±2.3% > >>> > >>> 100% > >>> base: 18216±0.6% > >>> patch: 18078±2.9% > >>> > >>> Reported-by: Nitin Tekchandani <nitin.tekchandani@intel.com> > >>> Suggested-by: Vincent Guittot <vincent.guittot@linaro.org> > >>> Signed-off-by: Aaron Lu <aaron.lu@intel.com> > >>> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> > >>> --- > >>> kernel/sched/fair.c | 13 ++++++++++++- > >>> kernel/sched/sched.h | 1 + > >>> 2 files changed, 13 insertions(+), 1 deletion(-) > >>> > >>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > >>> index c28206499a3d..a5462d1fcc48 100644 > >>> --- a/kernel/sched/fair.c > >>> +++ b/kernel/sched/fair.c > >>> @@ -3664,7 +3664,8 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq) > >>> */ > >>> static inline void update_tg_load_avg(struct cfs_rq *cfs_rq) > >>> { > >>> - long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib; > >>> + long delta; > >>> + u64 now; > >>> /* > >>> * No need to update load_avg for root_task_group as it is not used. > >>> @@ -3672,9 +3673,19 @@ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq) > >>> if (cfs_rq->tg == &root_task_group) > >>> return; > >>> + /* > >>> + * For migration heavy workload, access to tg->load_avg can be > >>> + * unbound. Limit the update rate to at most once per ms. > >>> + */ > >>> + now = sched_clock_cpu(cpu_of(rq_of(cfs_rq))); > >>> + if (now - cfs_rq->last_update_tg_load_avg < NSEC_PER_MSEC) > >>> + return; > >>> + > >>> + delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib; > >>> if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64) { > >>> atomic_long_add(delta, &cfs_rq->tg->load_avg); > >>> cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg; > >>> + cfs_rq->last_update_tg_load_avg = now; > >>> } > >>> } > >>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > >>> index 6a8b7b9ed089..52ee7027def9 100644 > >>> --- a/kernel/sched/sched.h > >>> +++ b/kernel/sched/sched.h > >>> @@ -593,6 +593,7 @@ struct cfs_rq { > >>> } removed; > >>> #ifdef CONFIG_FAIR_GROUP_SCHED > >>> + u64 last_update_tg_load_avg; > >>> unsigned long tg_load_avg_contrib; > >>> long propagate; > >>> long prop_runnable_sum; > >> > >> -- > >> Mathieu Desnoyers > >> EfficiOS Inc. > >> https://www.efficios.com > >> > > -- > Mathieu Desnoyers > EfficiOS Inc. > https://www.efficios.com > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 1/1] sched/fair: ratelimit update to tg->load_avg 2023-08-24 13:03 ` Vincent Guittot @ 2023-08-24 13:08 ` Mathieu Desnoyers 2023-08-24 13:24 ` Vincent Guittot 2023-08-25 6:08 ` Aaron Lu 0 siblings, 2 replies; 15+ messages in thread From: Mathieu Desnoyers @ 2023-08-24 13:08 UTC (permalink / raw) To: Vincent Guittot Cc: Aaron Lu, Peter Zijlstra, Ingo Molnar, Juri Lelli, Daniel Jordan, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider, Tim Chen, Nitin Tekchandani, Yu Chen, Waiman Long, Deng Pan, Gautham R . Shenoy, David Vernet, linux-kernel On 8/24/23 09:03, Vincent Guittot wrote: > On Thu, 24 Aug 2023 at 14:55, Mathieu Desnoyers > <mathieu.desnoyers@efficios.com> wrote: >> >> On 8/24/23 04:01, Aaron Lu wrote: >>> On Wed, Aug 23, 2023 at 10:05:31AM -0400, Mathieu Desnoyers wrote: >>>> On 8/23/23 02:08, Aaron Lu wrote: >>>>> When using sysbench to benchmark Postgres in a single docker instance >>>>> with sysbench's nr_threads set to nr_cpu, it is observed there are times >>>>> update_cfs_group() and update_load_avg() shows noticeable overhead on >>>>> a 2sockets/112core/224cpu Intel Sapphire Rapids(SPR): >>>>> >>>>> 13.75% 13.74% [kernel.vmlinux] [k] update_cfs_group >>>>> 10.63% 10.04% [kernel.vmlinux] [k] update_load_avg >>>>> >>>>> Annotate shows the cycles are mostly spent on accessing tg->load_avg >>>>> with update_load_avg() being the write side and update_cfs_group() being >>>>> the read side. tg->load_avg is per task group and when different tasks >>>>> of the same taskgroup running on different CPUs frequently access >>>>> tg->load_avg, it can be heavily contended. >>>>> >>>>> E.g. when running postgres_sysbench on a 2sockets/112cores/224cpus Intel >>>>> Sappire Rapids, during a 5s window, the wakeup number is 14millions and >>>>> migration number is 11millions and with each migration, the task's load >>>>> will transfer from src cfs_rq to target cfs_rq and each change involves >>>>> an update to tg->load_avg. Since the workload can trigger as many wakeups >>>>> and migrations, the access(both read and write) to tg->load_avg can be >>>>> unbound. As a result, the two mentioned functions showed noticeable >>>>> overhead. With netperf/nr_client=nr_cpu/UDP_RR, the problem is worse: >>>>> during a 5s window, wakeup number is 21millions and migration number is >>>>> 14millions; update_cfs_group() costs ~25% and update_load_avg() costs ~16%. >>>>> >>>>> Reduce the overhead by limiting updates to tg->load_avg to at most once >>>>> per ms. After this change, the cost of accessing tg->load_avg is greatly >>>>> reduced and performance improved. Detailed test results below. >>>> >>>> By applying your patch on top of my patchset at: >>>> >>>> https://lore.kernel.org/lkml/20230822113133.643238-1-mathieu.desnoyers@efficios.com/ >>>> >>>> The combined hackbench results look very promising: >>>> >>>> (hackbench -g 32 -f 20 --threads --pipe -l 480000 -s 100) >>>> (192 cores AMD EPYC 9654 96-Core Processor (over 2 sockets), with hyperthreading) >>>> >>>> Baseline: 49s >>>> With L2-ttwu-queue-skip: 34s (30% speedup) >>>> With L2-ttwu-queue-skip + ratelimit-load-avg: 26s (46% speedup) >>>> >>>> Feel free to apply my: >>>> >>>> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> >>>> Tested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> >>> >>> Thanks a lot for running this and reviewing the patch. >>> I'll add your number and tag in the changelog when sending a new >>> version. >> >> Now that I come to think of it, I have comment: why use >> sched_clock_cpu() rather than just read the jiffies value ? AFAIR, >> sched_clock can be slower than needed when read from a "remote" cpu on >> architectures that have an unsynchronized tsc. >> >> Considering that you only need a time reference more or less accurate at >> the millisecond level, I suspect that jiffies is what you are looking >> for here. This is what the NUMA balance code and rseq mm_cid use to >> execute work every N milliseconds. > > tick can 4ms or even 10ms which means a rate limit up between 10ms to > 20ms in the latter case Fair enough, so just to confirm: is the 1ms a target period which has been empirically determined to be optimal (lower having too much overhead, and higher not being precise enough) ? Thanks, Mathieu > >> >> Thanks, >> >> Mathieu >> >>> >>> Regards, >>> Aaron >>> >>>>> >>>>> ============================== >>>>> postgres_sysbench on SPR: >>>>> 25% >>>>> base: 42382±19.8% >>>>> patch: 50174±9.5% (noise) >>>>> >>>>> 50% >>>>> base: 67626±1.3% >>>>> patch: 67365±3.1% (noise) >>>>> >>>>> 75% >>>>> base: 100216±1.2% >>>>> patch: 112470±0.1% +12.2% >>>>> >>>>> 100% >>>>> base: 93671±0.4% >>>>> patch: 113563±0.2% +21.2% >>>>> >>>>> ============================== >>>>> hackbench on ICL: >>>>> group=1 >>>>> base: 114912±5.2% >>>>> patch: 117857±2.5% (noise) >>>>> >>>>> group=4 >>>>> base: 359902±1.6% >>>>> patch: 361685±2.7% (noise) >>>>> >>>>> group=8 >>>>> base: 461070±0.8% >>>>> patch: 491713±0.3% +6.6% >>>>> >>>>> group=16 >>>>> base: 309032±5.0% >>>>> patch: 378337±1.3% +22.4% >>>>> >>>>> ============================= >>>>> hackbench on SPR: >>>>> group=1 >>>>> base: 100768±2.9% >>>>> patch: 103134±2.9% (noise) >>>>> >>>>> group=4 >>>>> base: 413830±12.5% >>>>> patch: 378660±16.6% (noise) >>>>> >>>>> group=8 >>>>> base: 436124±0.6% >>>>> patch: 490787±3.2% +12.5% >>>>> >>>>> group=16 >>>>> base: 457730±3.2% >>>>> patch: 680452±1.3% +48.8% >>>>> >>>>> ============================ >>>>> netperf/udp_rr on ICL >>>>> 25% >>>>> base: 114413±0.1% >>>>> patch: 115111±0.0% +0.6% >>>>> >>>>> 50% >>>>> base: 86803±0.5% >>>>> patch: 86611±0.0% (noise) >>>>> >>>>> 75% >>>>> base: 35959±5.3% >>>>> patch: 49801±0.6% +38.5% >>>>> >>>>> 100% >>>>> base: 61951±6.4% >>>>> patch: 70224±0.8% +13.4% >>>>> >>>>> =========================== >>>>> netperf/udp_rr on SPR >>>>> 25% >>>>> base: 104954±1.3% >>>>> patch: 107312±2.8% (noise) >>>>> >>>>> 50% >>>>> base: 55394±4.6% >>>>> patch: 54940±7.4% (noise) >>>>> >>>>> 75% >>>>> base: 13779±3.1% >>>>> patch: 36105±1.1% +162% >>>>> >>>>> 100% >>>>> base: 9703±3.7% >>>>> patch: 28011±0.2% +189% >>>>> >>>>> ============================================== >>>>> netperf/tcp_stream on ICL (all in noise range) >>>>> 25% >>>>> base: 43092±0.1% >>>>> patch: 42891±0.5% >>>>> >>>>> 50% >>>>> base: 19278±14.9% >>>>> patch: 22369±7.2% >>>>> >>>>> 75% >>>>> base: 16822±3.0% >>>>> patch: 17086±2.3% >>>>> >>>>> 100% >>>>> base: 18216±0.6% >>>>> patch: 18078±2.9% >>>>> >>>>> =============================================== >>>>> netperf/tcp_stream on SPR (all in noise range) >>>>> 25% >>>>> base: 34491±0.3% >>>>> patch: 34886±0.5% >>>>> >>>>> 50% >>>>> base: 19278±14.9% >>>>> patch: 22369±7.2% >>>>> >>>>> 75% >>>>> base: 16822±3.0% >>>>> patch: 17086±2.3% >>>>> >>>>> 100% >>>>> base: 18216±0.6% >>>>> patch: 18078±2.9% >>>>> >>>>> Reported-by: Nitin Tekchandani <nitin.tekchandani@intel.com> >>>>> Suggested-by: Vincent Guittot <vincent.guittot@linaro.org> >>>>> Signed-off-by: Aaron Lu <aaron.lu@intel.com> >>>>> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> >>>>> --- >>>>> kernel/sched/fair.c | 13 ++++++++++++- >>>>> kernel/sched/sched.h | 1 + >>>>> 2 files changed, 13 insertions(+), 1 deletion(-) >>>>> >>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >>>>> index c28206499a3d..a5462d1fcc48 100644 >>>>> --- a/kernel/sched/fair.c >>>>> +++ b/kernel/sched/fair.c >>>>> @@ -3664,7 +3664,8 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq) >>>>> */ >>>>> static inline void update_tg_load_avg(struct cfs_rq *cfs_rq) >>>>> { >>>>> - long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib; >>>>> + long delta; >>>>> + u64 now; >>>>> /* >>>>> * No need to update load_avg for root_task_group as it is not used. >>>>> @@ -3672,9 +3673,19 @@ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq) >>>>> if (cfs_rq->tg == &root_task_group) >>>>> return; >>>>> + /* >>>>> + * For migration heavy workload, access to tg->load_avg can be >>>>> + * unbound. Limit the update rate to at most once per ms. >>>>> + */ >>>>> + now = sched_clock_cpu(cpu_of(rq_of(cfs_rq))); >>>>> + if (now - cfs_rq->last_update_tg_load_avg < NSEC_PER_MSEC) >>>>> + return; >>>>> + >>>>> + delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib; >>>>> if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64) { >>>>> atomic_long_add(delta, &cfs_rq->tg->load_avg); >>>>> cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg; >>>>> + cfs_rq->last_update_tg_load_avg = now; >>>>> } >>>>> } >>>>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h >>>>> index 6a8b7b9ed089..52ee7027def9 100644 >>>>> --- a/kernel/sched/sched.h >>>>> +++ b/kernel/sched/sched.h >>>>> @@ -593,6 +593,7 @@ struct cfs_rq { >>>>> } removed; >>>>> #ifdef CONFIG_FAIR_GROUP_SCHED >>>>> + u64 last_update_tg_load_avg; >>>>> unsigned long tg_load_avg_contrib; >>>>> long propagate; >>>>> long prop_runnable_sum; >>>> >>>> -- >>>> Mathieu Desnoyers >>>> EfficiOS Inc. >>>> https://www.efficios.com >>>> >> >> -- >> Mathieu Desnoyers >> EfficiOS Inc. >> https://www.efficios.com >> -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 1/1] sched/fair: ratelimit update to tg->load_avg 2023-08-24 13:08 ` Mathieu Desnoyers @ 2023-08-24 13:24 ` Vincent Guittot 2023-08-25 6:08 ` Aaron Lu 1 sibling, 0 replies; 15+ messages in thread From: Vincent Guittot @ 2023-08-24 13:24 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Aaron Lu, Peter Zijlstra, Ingo Molnar, Juri Lelli, Daniel Jordan, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider, Tim Chen, Nitin Tekchandani, Yu Chen, Waiman Long, Deng Pan, Gautham R . Shenoy, David Vernet, linux-kernel On Thu, 24 Aug 2023 at 15:07, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > > On 8/24/23 09:03, Vincent Guittot wrote: > > On Thu, 24 Aug 2023 at 14:55, Mathieu Desnoyers > > <mathieu.desnoyers@efficios.com> wrote: > >> > >> On 8/24/23 04:01, Aaron Lu wrote: > >>> On Wed, Aug 23, 2023 at 10:05:31AM -0400, Mathieu Desnoyers wrote: > >>>> On 8/23/23 02:08, Aaron Lu wrote: > >>>>> When using sysbench to benchmark Postgres in a single docker instance > >>>>> with sysbench's nr_threads set to nr_cpu, it is observed there are times > >>>>> update_cfs_group() and update_load_avg() shows noticeable overhead on > >>>>> a 2sockets/112core/224cpu Intel Sapphire Rapids(SPR): > >>>>> > >>>>> 13.75% 13.74% [kernel.vmlinux] [k] update_cfs_group > >>>>> 10.63% 10.04% [kernel.vmlinux] [k] update_load_avg > >>>>> > >>>>> Annotate shows the cycles are mostly spent on accessing tg->load_avg > >>>>> with update_load_avg() being the write side and update_cfs_group() being > >>>>> the read side. tg->load_avg is per task group and when different tasks > >>>>> of the same taskgroup running on different CPUs frequently access > >>>>> tg->load_avg, it can be heavily contended. > >>>>> > >>>>> E.g. when running postgres_sysbench on a 2sockets/112cores/224cpus Intel > >>>>> Sappire Rapids, during a 5s window, the wakeup number is 14millions and > >>>>> migration number is 11millions and with each migration, the task's load > >>>>> will transfer from src cfs_rq to target cfs_rq and each change involves > >>>>> an update to tg->load_avg. Since the workload can trigger as many wakeups > >>>>> and migrations, the access(both read and write) to tg->load_avg can be > >>>>> unbound. As a result, the two mentioned functions showed noticeable > >>>>> overhead. With netperf/nr_client=nr_cpu/UDP_RR, the problem is worse: > >>>>> during a 5s window, wakeup number is 21millions and migration number is > >>>>> 14millions; update_cfs_group() costs ~25% and update_load_avg() costs ~16%. > >>>>> > >>>>> Reduce the overhead by limiting updates to tg->load_avg to at most once > >>>>> per ms. After this change, the cost of accessing tg->load_avg is greatly > >>>>> reduced and performance improved. Detailed test results below. > >>>> > >>>> By applying your patch on top of my patchset at: > >>>> > >>>> https://lore.kernel.org/lkml/20230822113133.643238-1-mathieu.desnoyers@efficios.com/ > >>>> > >>>> The combined hackbench results look very promising: > >>>> > >>>> (hackbench -g 32 -f 20 --threads --pipe -l 480000 -s 100) > >>>> (192 cores AMD EPYC 9654 96-Core Processor (over 2 sockets), with hyperthreading) > >>>> > >>>> Baseline: 49s > >>>> With L2-ttwu-queue-skip: 34s (30% speedup) > >>>> With L2-ttwu-queue-skip + ratelimit-load-avg: 26s (46% speedup) > >>>> > >>>> Feel free to apply my: > >>>> > >>>> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> > >>>> Tested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> > >>> > >>> Thanks a lot for running this and reviewing the patch. > >>> I'll add your number and tag in the changelog when sending a new > >>> version. > >> > >> Now that I come to think of it, I have comment: why use > >> sched_clock_cpu() rather than just read the jiffies value ? AFAIR, > >> sched_clock can be slower than needed when read from a "remote" cpu on > >> architectures that have an unsynchronized tsc. > >> > >> Considering that you only need a time reference more or less accurate at > >> the millisecond level, I suspect that jiffies is what you are looking > >> for here. This is what the NUMA balance code and rseq mm_cid use to > >> execute work every N milliseconds. > > > > tick can 4ms or even 10ms which means a rate limit up between 10ms to > > 20ms in the latter case > > Fair enough, so just to confirm: is the 1ms a target period which has > been empirically determined to be optimal (lower having too much > overhead, and higher not being precise enough) ? yes it's a tradeoff. This impacts how much time a group can get on a rq > > Thanks, > > Mathieu > > > > >> > >> Thanks, > >> > >> Mathieu > >> > >>> > >>> Regards, > >>> Aaron > >>> > >>>>> > >>>>> ============================== > >>>>> postgres_sysbench on SPR: > >>>>> 25% > >>>>> base: 42382±19.8% > >>>>> patch: 50174±9.5% (noise) > >>>>> > >>>>> 50% > >>>>> base: 67626±1.3% > >>>>> patch: 67365±3.1% (noise) > >>>>> > >>>>> 75% > >>>>> base: 100216±1.2% > >>>>> patch: 112470±0.1% +12.2% > >>>>> > >>>>> 100% > >>>>> base: 93671±0.4% > >>>>> patch: 113563±0.2% +21.2% > >>>>> > >>>>> ============================== > >>>>> hackbench on ICL: > >>>>> group=1 > >>>>> base: 114912±5.2% > >>>>> patch: 117857±2.5% (noise) > >>>>> > >>>>> group=4 > >>>>> base: 359902±1.6% > >>>>> patch: 361685±2.7% (noise) > >>>>> > >>>>> group=8 > >>>>> base: 461070±0.8% > >>>>> patch: 491713±0.3% +6.6% > >>>>> > >>>>> group=16 > >>>>> base: 309032±5.0% > >>>>> patch: 378337±1.3% +22.4% > >>>>> > >>>>> ============================= > >>>>> hackbench on SPR: > >>>>> group=1 > >>>>> base: 100768±2.9% > >>>>> patch: 103134±2.9% (noise) > >>>>> > >>>>> group=4 > >>>>> base: 413830±12.5% > >>>>> patch: 378660±16.6% (noise) > >>>>> > >>>>> group=8 > >>>>> base: 436124±0.6% > >>>>> patch: 490787±3.2% +12.5% > >>>>> > >>>>> group=16 > >>>>> base: 457730±3.2% > >>>>> patch: 680452±1.3% +48.8% > >>>>> > >>>>> ============================ > >>>>> netperf/udp_rr on ICL > >>>>> 25% > >>>>> base: 114413±0.1% > >>>>> patch: 115111±0.0% +0.6% > >>>>> > >>>>> 50% > >>>>> base: 86803±0.5% > >>>>> patch: 86611±0.0% (noise) > >>>>> > >>>>> 75% > >>>>> base: 35959±5.3% > >>>>> patch: 49801±0.6% +38.5% > >>>>> > >>>>> 100% > >>>>> base: 61951±6.4% > >>>>> patch: 70224±0.8% +13.4% > >>>>> > >>>>> =========================== > >>>>> netperf/udp_rr on SPR > >>>>> 25% > >>>>> base: 104954±1.3% > >>>>> patch: 107312±2.8% (noise) > >>>>> > >>>>> 50% > >>>>> base: 55394±4.6% > >>>>> patch: 54940±7.4% (noise) > >>>>> > >>>>> 75% > >>>>> base: 13779±3.1% > >>>>> patch: 36105±1.1% +162% > >>>>> > >>>>> 100% > >>>>> base: 9703±3.7% > >>>>> patch: 28011±0.2% +189% > >>>>> > >>>>> ============================================== > >>>>> netperf/tcp_stream on ICL (all in noise range) > >>>>> 25% > >>>>> base: 43092±0.1% > >>>>> patch: 42891±0.5% > >>>>> > >>>>> 50% > >>>>> base: 19278±14.9% > >>>>> patch: 22369±7.2% > >>>>> > >>>>> 75% > >>>>> base: 16822±3.0% > >>>>> patch: 17086±2.3% > >>>>> > >>>>> 100% > >>>>> base: 18216±0.6% > >>>>> patch: 18078±2.9% > >>>>> > >>>>> =============================================== > >>>>> netperf/tcp_stream on SPR (all in noise range) > >>>>> 25% > >>>>> base: 34491±0.3% > >>>>> patch: 34886±0.5% > >>>>> > >>>>> 50% > >>>>> base: 19278±14.9% > >>>>> patch: 22369±7.2% > >>>>> > >>>>> 75% > >>>>> base: 16822±3.0% > >>>>> patch: 17086±2.3% > >>>>> > >>>>> 100% > >>>>> base: 18216±0.6% > >>>>> patch: 18078±2.9% > >>>>> > >>>>> Reported-by: Nitin Tekchandani <nitin.tekchandani@intel.com> > >>>>> Suggested-by: Vincent Guittot <vincent.guittot@linaro.org> > >>>>> Signed-off-by: Aaron Lu <aaron.lu@intel.com> > >>>>> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> > >>>>> --- > >>>>> kernel/sched/fair.c | 13 ++++++++++++- > >>>>> kernel/sched/sched.h | 1 + > >>>>> 2 files changed, 13 insertions(+), 1 deletion(-) > >>>>> > >>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > >>>>> index c28206499a3d..a5462d1fcc48 100644 > >>>>> --- a/kernel/sched/fair.c > >>>>> +++ b/kernel/sched/fair.c > >>>>> @@ -3664,7 +3664,8 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq) > >>>>> */ > >>>>> static inline void update_tg_load_avg(struct cfs_rq *cfs_rq) > >>>>> { > >>>>> - long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib; > >>>>> + long delta; > >>>>> + u64 now; > >>>>> /* > >>>>> * No need to update load_avg for root_task_group as it is not used. > >>>>> @@ -3672,9 +3673,19 @@ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq) > >>>>> if (cfs_rq->tg == &root_task_group) > >>>>> return; > >>>>> + /* > >>>>> + * For migration heavy workload, access to tg->load_avg can be > >>>>> + * unbound. Limit the update rate to at most once per ms. > >>>>> + */ > >>>>> + now = sched_clock_cpu(cpu_of(rq_of(cfs_rq))); > >>>>> + if (now - cfs_rq->last_update_tg_load_avg < NSEC_PER_MSEC) > >>>>> + return; > >>>>> + > >>>>> + delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib; > >>>>> if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64) { > >>>>> atomic_long_add(delta, &cfs_rq->tg->load_avg); > >>>>> cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg; > >>>>> + cfs_rq->last_update_tg_load_avg = now; > >>>>> } > >>>>> } > >>>>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > >>>>> index 6a8b7b9ed089..52ee7027def9 100644 > >>>>> --- a/kernel/sched/sched.h > >>>>> +++ b/kernel/sched/sched.h > >>>>> @@ -593,6 +593,7 @@ struct cfs_rq { > >>>>> } removed; > >>>>> #ifdef CONFIG_FAIR_GROUP_SCHED > >>>>> + u64 last_update_tg_load_avg; > >>>>> unsigned long tg_load_avg_contrib; > >>>>> long propagate; > >>>>> long prop_runnable_sum; > >>>> > >>>> -- > >>>> Mathieu Desnoyers > >>>> EfficiOS Inc. > >>>> https://www.efficios.com > >>>> > >> > >> -- > >> Mathieu Desnoyers > >> EfficiOS Inc. > >> https://www.efficios.com > >> > > -- > Mathieu Desnoyers > EfficiOS Inc. > https://www.efficios.com > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 1/1] sched/fair: ratelimit update to tg->load_avg 2023-08-24 13:08 ` Mathieu Desnoyers 2023-08-24 13:24 ` Vincent Guittot @ 2023-08-25 6:08 ` Aaron Lu 1 sibling, 0 replies; 15+ messages in thread From: Aaron Lu @ 2023-08-25 6:08 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Vincent Guittot, Peter Zijlstra, Ingo Molnar, Juri Lelli, Daniel Jordan, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider, Tim Chen, Nitin Tekchandani, Yu Chen, Waiman Long, Deng Pan, Gautham R . Shenoy, David Vernet, linux-kernel On Thu, Aug 24, 2023 at 09:08:43AM -0400, Mathieu Desnoyers wrote: > On 8/24/23 09:03, Vincent Guittot wrote: > > On Thu, 24 Aug 2023 at 14:55, Mathieu Desnoyers > > <mathieu.desnoyers@efficios.com> wrote: > > > > > > On 8/24/23 04:01, Aaron Lu wrote: > > > > On Wed, Aug 23, 2023 at 10:05:31AM -0400, Mathieu Desnoyers wrote: > > > > > On 8/23/23 02:08, Aaron Lu wrote: > > > > > > When using sysbench to benchmark Postgres in a single docker instance > > > > > > with sysbench's nr_threads set to nr_cpu, it is observed there are times > > > > > > update_cfs_group() and update_load_avg() shows noticeable overhead on > > > > > > a 2sockets/112core/224cpu Intel Sapphire Rapids(SPR): > > > > > > > > > > > > 13.75% 13.74% [kernel.vmlinux] [k] update_cfs_group > > > > > > 10.63% 10.04% [kernel.vmlinux] [k] update_load_avg > > > > > > > > > > > > Annotate shows the cycles are mostly spent on accessing tg->load_avg > > > > > > with update_load_avg() being the write side and update_cfs_group() being > > > > > > the read side. tg->load_avg is per task group and when different tasks > > > > > > of the same taskgroup running on different CPUs frequently access > > > > > > tg->load_avg, it can be heavily contended. > > > > > > > > > > > > E.g. when running postgres_sysbench on a 2sockets/112cores/224cpus Intel > > > > > > Sappire Rapids, during a 5s window, the wakeup number is 14millions and > > > > > > migration number is 11millions and with each migration, the task's load > > > > > > will transfer from src cfs_rq to target cfs_rq and each change involves > > > > > > an update to tg->load_avg. Since the workload can trigger as many wakeups > > > > > > and migrations, the access(both read and write) to tg->load_avg can be > > > > > > unbound. As a result, the two mentioned functions showed noticeable > > > > > > overhead. With netperf/nr_client=nr_cpu/UDP_RR, the problem is worse: > > > > > > during a 5s window, wakeup number is 21millions and migration number is > > > > > > 14millions; update_cfs_group() costs ~25% and update_load_avg() costs ~16%. > > > > > > > > > > > > Reduce the overhead by limiting updates to tg->load_avg to at most once > > > > > > per ms. After this change, the cost of accessing tg->load_avg is greatly > > > > > > reduced and performance improved. Detailed test results below. > > > > > > > > > > By applying your patch on top of my patchset at: > > > > > > > > > > https://lore.kernel.org/lkml/20230822113133.643238-1-mathieu.desnoyers@efficios.com/ > > > > > > > > > > The combined hackbench results look very promising: > > > > > > > > > > (hackbench -g 32 -f 20 --threads --pipe -l 480000 -s 100) > > > > > (192 cores AMD EPYC 9654 96-Core Processor (over 2 sockets), with hyperthreading) > > > > > > > > > > Baseline: 49s > > > > > With L2-ttwu-queue-skip: 34s (30% speedup) > > > > > With L2-ttwu-queue-skip + ratelimit-load-avg: 26s (46% speedup) > > > > > > > > > > Feel free to apply my: > > > > > > > > > > Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> > > > > > Tested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> > > > > > > > > Thanks a lot for running this and reviewing the patch. > > > > I'll add your number and tag in the changelog when sending a new > > > > version. > > > > > > Now that I come to think of it, I have comment: why use > > > sched_clock_cpu() rather than just read the jiffies value ? AFAIR, > > > sched_clock can be slower than needed when read from a "remote" cpu on > > > architectures that have an unsynchronized tsc. > > > > > > Considering that you only need a time reference more or less accurate at > > > the millisecond level, I suspect that jiffies is what you are looking > > > for here. This is what the NUMA balance code and rseq mm_cid use to > > > execute work every N milliseconds. > > > > tick can 4ms or even 10ms which means a rate limit up between 10ms to > > 20ms in the latter case > > Fair enough, so just to confirm: is the 1ms a target period which has been > empirically determined to be optimal (lower having too much overhead, and > higher not being precise enough) ? I chose 1ms because pelt window is roughly 1ms. And during my tests, ratelimit to once per 1ms delivers good performance and no regressions for the workloads that I tested so far so I didn't try other values. I can't say 1ms is the optimal value, but it appears to work good enough for now. Thanks, Aaron > > > > > > Thanks, > > > > > > Mathieu > > > > > > > > > > > Regards, > > > > Aaron > > > > > > > > > > > > > > > > ============================== > > > > > > postgres_sysbench on SPR: > > > > > > 25% > > > > > > base: 42382±19.8% > > > > > > patch: 50174±9.5% (noise) > > > > > > > > > > > > 50% > > > > > > base: 67626±1.3% > > > > > > patch: 67365±3.1% (noise) > > > > > > > > > > > > 75% > > > > > > base: 100216±1.2% > > > > > > patch: 112470±0.1% +12.2% > > > > > > > > > > > > 100% > > > > > > base: 93671±0.4% > > > > > > patch: 113563±0.2% +21.2% > > > > > > > > > > > > ============================== > > > > > > hackbench on ICL: > > > > > > group=1 > > > > > > base: 114912±5.2% > > > > > > patch: 117857±2.5% (noise) > > > > > > > > > > > > group=4 > > > > > > base: 359902±1.6% > > > > > > patch: 361685±2.7% (noise) > > > > > > > > > > > > group=8 > > > > > > base: 461070±0.8% > > > > > > patch: 491713±0.3% +6.6% > > > > > > > > > > > > group=16 > > > > > > base: 309032±5.0% > > > > > > patch: 378337±1.3% +22.4% > > > > > > > > > > > > ============================= > > > > > > hackbench on SPR: > > > > > > group=1 > > > > > > base: 100768±2.9% > > > > > > patch: 103134±2.9% (noise) > > > > > > > > > > > > group=4 > > > > > > base: 413830±12.5% > > > > > > patch: 378660±16.6% (noise) > > > > > > > > > > > > group=8 > > > > > > base: 436124±0.6% > > > > > > patch: 490787±3.2% +12.5% > > > > > > > > > > > > group=16 > > > > > > base: 457730±3.2% > > > > > > patch: 680452±1.3% +48.8% > > > > > > > > > > > > ============================ > > > > > > netperf/udp_rr on ICL > > > > > > 25% > > > > > > base: 114413±0.1% > > > > > > patch: 115111±0.0% +0.6% > > > > > > > > > > > > 50% > > > > > > base: 86803±0.5% > > > > > > patch: 86611±0.0% (noise) > > > > > > > > > > > > 75% > > > > > > base: 35959±5.3% > > > > > > patch: 49801±0.6% +38.5% > > > > > > > > > > > > 100% > > > > > > base: 61951±6.4% > > > > > > patch: 70224±0.8% +13.4% > > > > > > > > > > > > =========================== > > > > > > netperf/udp_rr on SPR > > > > > > 25% > > > > > > base: 104954±1.3% > > > > > > patch: 107312±2.8% (noise) > > > > > > > > > > > > 50% > > > > > > base: 55394±4.6% > > > > > > patch: 54940±7.4% (noise) > > > > > > > > > > > > 75% > > > > > > base: 13779±3.1% > > > > > > patch: 36105±1.1% +162% > > > > > > > > > > > > 100% > > > > > > base: 9703±3.7% > > > > > > patch: 28011±0.2% +189% > > > > > > > > > > > > ============================================== > > > > > > netperf/tcp_stream on ICL (all in noise range) > > > > > > 25% > > > > > > base: 43092±0.1% > > > > > > patch: 42891±0.5% > > > > > > > > > > > > 50% > > > > > > base: 19278±14.9% > > > > > > patch: 22369±7.2% > > > > > > > > > > > > 75% > > > > > > base: 16822±3.0% > > > > > > patch: 17086±2.3% > > > > > > > > > > > > 100% > > > > > > base: 18216±0.6% > > > > > > patch: 18078±2.9% > > > > > > > > > > > > =============================================== > > > > > > netperf/tcp_stream on SPR (all in noise range) > > > > > > 25% > > > > > > base: 34491±0.3% > > > > > > patch: 34886±0.5% > > > > > > > > > > > > 50% > > > > > > base: 19278±14.9% > > > > > > patch: 22369±7.2% > > > > > > > > > > > > 75% > > > > > > base: 16822±3.0% > > > > > > patch: 17086±2.3% > > > > > > > > > > > > 100% > > > > > > base: 18216±0.6% > > > > > > patch: 18078±2.9% > > > > > > > > > > > > Reported-by: Nitin Tekchandani <nitin.tekchandani@intel.com> > > > > > > Suggested-by: Vincent Guittot <vincent.guittot@linaro.org> > > > > > > Signed-off-by: Aaron Lu <aaron.lu@intel.com> > > > > > > Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> > > > > > > --- > > > > > > kernel/sched/fair.c | 13 ++++++++++++- > > > > > > kernel/sched/sched.h | 1 + > > > > > > 2 files changed, 13 insertions(+), 1 deletion(-) > > > > > > > > > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > > > > > index c28206499a3d..a5462d1fcc48 100644 > > > > > > --- a/kernel/sched/fair.c > > > > > > +++ b/kernel/sched/fair.c > > > > > > @@ -3664,7 +3664,8 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq) > > > > > > */ > > > > > > static inline void update_tg_load_avg(struct cfs_rq *cfs_rq) > > > > > > { > > > > > > - long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib; > > > > > > + long delta; > > > > > > + u64 now; > > > > > > /* > > > > > > * No need to update load_avg for root_task_group as it is not used. > > > > > > @@ -3672,9 +3673,19 @@ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq) > > > > > > if (cfs_rq->tg == &root_task_group) > > > > > > return; > > > > > > + /* > > > > > > + * For migration heavy workload, access to tg->load_avg can be > > > > > > + * unbound. Limit the update rate to at most once per ms. > > > > > > + */ > > > > > > + now = sched_clock_cpu(cpu_of(rq_of(cfs_rq))); > > > > > > + if (now - cfs_rq->last_update_tg_load_avg < NSEC_PER_MSEC) > > > > > > + return; > > > > > > + > > > > > > + delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib; > > > > > > if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64) { > > > > > > atomic_long_add(delta, &cfs_rq->tg->load_avg); > > > > > > cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg; > > > > > > + cfs_rq->last_update_tg_load_avg = now; > > > > > > } > > > > > > } > > > > > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > > > > > > index 6a8b7b9ed089..52ee7027def9 100644 > > > > > > --- a/kernel/sched/sched.h > > > > > > +++ b/kernel/sched/sched.h > > > > > > @@ -593,6 +593,7 @@ struct cfs_rq { > > > > > > } removed; > > > > > > #ifdef CONFIG_FAIR_GROUP_SCHED > > > > > > + u64 last_update_tg_load_avg; > > > > > > unsigned long tg_load_avg_contrib; > > > > > > long propagate; > > > > > > long prop_runnable_sum; > > > > > > > > > > -- > > > > > Mathieu Desnoyers > > > > > EfficiOS Inc. > > > > > https://www.efficios.com > > > > > > > > > > > -- > > > Mathieu Desnoyers > > > EfficiOS Inc. > > > https://www.efficios.com > > > > > -- > Mathieu Desnoyers > EfficiOS Inc. > https://www.efficios.com > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 1/1] sched/fair: ratelimit update to tg->load_avg 2023-08-23 6:08 ` [PATCH 1/1] sched/fair: ratelimit update to tg->load_avg Aaron Lu 2023-08-23 14:05 ` Mathieu Desnoyers @ 2023-08-24 18:48 ` David Vernet 2023-08-25 6:18 ` Aaron Lu 2023-09-06 3:52 ` kernel test robot 2 siblings, 1 reply; 15+ messages in thread From: David Vernet @ 2023-08-24 18:48 UTC (permalink / raw) To: Aaron Lu Cc: Peter Zijlstra, Vincent Guittot, Ingo Molnar, Juri Lelli, Daniel Jordan, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider, Tim Chen, Nitin Tekchandani, Yu Chen, Waiman Long, Deng Pan, Mathieu Desnoyers, Gautham R . Shenoy, linux-kernel On Wed, Aug 23, 2023 at 02:08:32PM +0800, Aaron Lu wrote: > When using sysbench to benchmark Postgres in a single docker instance > with sysbench's nr_threads set to nr_cpu, it is observed there are times > update_cfs_group() and update_load_avg() shows noticeable overhead on > a 2sockets/112core/224cpu Intel Sapphire Rapids(SPR): > > 13.75% 13.74% [kernel.vmlinux] [k] update_cfs_group > 10.63% 10.04% [kernel.vmlinux] [k] update_load_avg > > Annotate shows the cycles are mostly spent on accessing tg->load_avg > with update_load_avg() being the write side and update_cfs_group() being > the read side. tg->load_avg is per task group and when different tasks > of the same taskgroup running on different CPUs frequently access > tg->load_avg, it can be heavily contended. > > E.g. when running postgres_sysbench on a 2sockets/112cores/224cpus Intel > Sappire Rapids, during a 5s window, the wakeup number is 14millions and > migration number is 11millions and with each migration, the task's load > will transfer from src cfs_rq to target cfs_rq and each change involves > an update to tg->load_avg. Since the workload can trigger as many wakeups > and migrations, the access(both read and write) to tg->load_avg can be > unbound. As a result, the two mentioned functions showed noticeable > overhead. With netperf/nr_client=nr_cpu/UDP_RR, the problem is worse: > during a 5s window, wakeup number is 21millions and migration number is > 14millions; update_cfs_group() costs ~25% and update_load_avg() costs ~16%. > > Reduce the overhead by limiting updates to tg->load_avg to at most once > per ms. After this change, the cost of accessing tg->load_avg is greatly > reduced and performance improved. Detailed test results below. > > ============================== > postgres_sysbench on SPR: > 25% > base: 42382±19.8% > patch: 50174±9.5% (noise) > > 50% > base: 67626±1.3% > patch: 67365±3.1% (noise) > > 75% > base: 100216±1.2% > patch: 112470±0.1% +12.2% > > 100% > base: 93671±0.4% > patch: 113563±0.2% +21.2% > > ============================== > hackbench on ICL: > group=1 > base: 114912±5.2% > patch: 117857±2.5% (noise) > > group=4 > base: 359902±1.6% > patch: 361685±2.7% (noise) > > group=8 > base: 461070±0.8% > patch: 491713±0.3% +6.6% > > group=16 > base: 309032±5.0% > patch: 378337±1.3% +22.4% > > ============================= > hackbench on SPR: > group=1 > base: 100768±2.9% > patch: 103134±2.9% (noise) > > group=4 > base: 413830±12.5% > patch: 378660±16.6% (noise) > > group=8 > base: 436124±0.6% > patch: 490787±3.2% +12.5% > > group=16 > base: 457730±3.2% > patch: 680452±1.3% +48.8% > > ============================ > netperf/udp_rr on ICL > 25% > base: 114413±0.1% > patch: 115111±0.0% +0.6% > > 50% > base: 86803±0.5% > patch: 86611±0.0% (noise) > > 75% > base: 35959±5.3% > patch: 49801±0.6% +38.5% > > 100% > base: 61951±6.4% > patch: 70224±0.8% +13.4% > > =========================== > netperf/udp_rr on SPR > 25% > base: 104954±1.3% > patch: 107312±2.8% (noise) > > 50% > base: 55394±4.6% > patch: 54940±7.4% (noise) > > 75% > base: 13779±3.1% > patch: 36105±1.1% +162% > > 100% > base: 9703±3.7% > patch: 28011±0.2% +189% > > ============================================== > netperf/tcp_stream on ICL (all in noise range) > 25% > base: 43092±0.1% > patch: 42891±0.5% > > 50% > base: 19278±14.9% > patch: 22369±7.2% > > 75% > base: 16822±3.0% > patch: 17086±2.3% > > 100% > base: 18216±0.6% > patch: 18078±2.9% > > =============================================== > netperf/tcp_stream on SPR (all in noise range) > 25% > base: 34491±0.3% > patch: 34886±0.5% > > 50% > base: 19278±14.9% > patch: 22369±7.2% > > 75% > base: 16822±3.0% > patch: 17086±2.3% > > 100% > base: 18216±0.6% > patch: 18078±2.9% > > Reported-by: Nitin Tekchandani <nitin.tekchandani@intel.com> > Suggested-by: Vincent Guittot <vincent.guittot@linaro.org> > Signed-off-by: Aaron Lu <aaron.lu@intel.com> > Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Hey Aaron, Thanks for working on this. It LGTM modulo two small nits. Feel free to add my Reviewed-by if you'd like regardless: Reviewed-by: David Vernet <void@manifault.com> > --- > kernel/sched/fair.c | 13 ++++++++++++- > kernel/sched/sched.h | 1 + > 2 files changed, 13 insertions(+), 1 deletion(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index c28206499a3d..a5462d1fcc48 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -3664,7 +3664,8 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq) > */ > static inline void update_tg_load_avg(struct cfs_rq *cfs_rq) > { > - long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib; > + long delta; > + u64 now; > > /* > * No need to update load_avg for root_task_group as it is not used. > @@ -3672,9 +3673,19 @@ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq) > if (cfs_rq->tg == &root_task_group) > return; > > + /* > + * For migration heavy workload, access to tg->load_avg can be s/workload/workloads > + * unbound. Limit the update rate to at most once per ms. Can we describe either here or in the commit summary how we arrived at 1ms? I'm fine with hard-coded heuristics like this (just like the proposed 6-core shard size in the shared_runq patchset), but it would also be ideal to give a bit more color on how we arrived here, because we'll forget immediately otherwise. > + */ > + now = sched_clock_cpu(cpu_of(rq_of(cfs_rq))); > + if (now - cfs_rq->last_update_tg_load_avg < NSEC_PER_MSEC) > + return; > + > + delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib; > if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64) { > atomic_long_add(delta, &cfs_rq->tg->load_avg); > cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg; > + cfs_rq->last_update_tg_load_avg = now; > } > } > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > index 6a8b7b9ed089..52ee7027def9 100644 > --- a/kernel/sched/sched.h > +++ b/kernel/sched/sched.h > @@ -593,6 +593,7 @@ struct cfs_rq { > } removed; > > #ifdef CONFIG_FAIR_GROUP_SCHED > + u64 last_update_tg_load_avg; > unsigned long tg_load_avg_contrib; > long propagate; > long prop_runnable_sum; > -- > 2.41.0 > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 1/1] sched/fair: ratelimit update to tg->load_avg 2023-08-24 18:48 ` David Vernet @ 2023-08-25 6:18 ` Aaron Lu 0 siblings, 0 replies; 15+ messages in thread From: Aaron Lu @ 2023-08-25 6:18 UTC (permalink / raw) To: David Vernet Cc: Peter Zijlstra, Vincent Guittot, Ingo Molnar, Juri Lelli, Daniel Jordan, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider, Tim Chen, Nitin Tekchandani, Yu Chen, Waiman Long, Deng Pan, Mathieu Desnoyers, Gautham R . Shenoy, linux-kernel On Thu, Aug 24, 2023 at 01:48:07PM -0500, David Vernet wrote: > On Wed, Aug 23, 2023 at 02:08:32PM +0800, Aaron Lu wrote: > > When using sysbench to benchmark Postgres in a single docker instance > > with sysbench's nr_threads set to nr_cpu, it is observed there are times > > update_cfs_group() and update_load_avg() shows noticeable overhead on > > a 2sockets/112core/224cpu Intel Sapphire Rapids(SPR): > > > > 13.75% 13.74% [kernel.vmlinux] [k] update_cfs_group > > 10.63% 10.04% [kernel.vmlinux] [k] update_load_avg > > > > Annotate shows the cycles are mostly spent on accessing tg->load_avg > > with update_load_avg() being the write side and update_cfs_group() being > > the read side. tg->load_avg is per task group and when different tasks > > of the same taskgroup running on different CPUs frequently access > > tg->load_avg, it can be heavily contended. > > > > E.g. when running postgres_sysbench on a 2sockets/112cores/224cpus Intel > > Sappire Rapids, during a 5s window, the wakeup number is 14millions and > > migration number is 11millions and with each migration, the task's load > > will transfer from src cfs_rq to target cfs_rq and each change involves > > an update to tg->load_avg. Since the workload can trigger as many wakeups > > and migrations, the access(both read and write) to tg->load_avg can be > > unbound. As a result, the two mentioned functions showed noticeable > > overhead. With netperf/nr_client=nr_cpu/UDP_RR, the problem is worse: > > during a 5s window, wakeup number is 21millions and migration number is > > 14millions; update_cfs_group() costs ~25% and update_load_avg() costs ~16%. > > > > Reduce the overhead by limiting updates to tg->load_avg to at most once > > per ms. After this change, the cost of accessing tg->load_avg is greatly > > reduced and performance improved. Detailed test results below. > > > > ============================== > > postgres_sysbench on SPR: > > 25% > > base: 42382±19.8% > > patch: 50174±9.5% (noise) > > > > 50% > > base: 67626±1.3% > > patch: 67365±3.1% (noise) > > > > 75% > > base: 100216±1.2% > > patch: 112470±0.1% +12.2% > > > > 100% > > base: 93671±0.4% > > patch: 113563±0.2% +21.2% > > > > ============================== > > hackbench on ICL: > > group=1 > > base: 114912±5.2% > > patch: 117857±2.5% (noise) > > > > group=4 > > base: 359902±1.6% > > patch: 361685±2.7% (noise) > > > > group=8 > > base: 461070±0.8% > > patch: 491713±0.3% +6.6% > > > > group=16 > > base: 309032±5.0% > > patch: 378337±1.3% +22.4% > > > > ============================= > > hackbench on SPR: > > group=1 > > base: 100768±2.9% > > patch: 103134±2.9% (noise) > > > > group=4 > > base: 413830±12.5% > > patch: 378660±16.6% (noise) > > > > group=8 > > base: 436124±0.6% > > patch: 490787±3.2% +12.5% > > > > group=16 > > base: 457730±3.2% > > patch: 680452±1.3% +48.8% > > > > ============================ > > netperf/udp_rr on ICL > > 25% > > base: 114413±0.1% > > patch: 115111±0.0% +0.6% > > > > 50% > > base: 86803±0.5% > > patch: 86611±0.0% (noise) > > > > 75% > > base: 35959±5.3% > > patch: 49801±0.6% +38.5% > > > > 100% > > base: 61951±6.4% > > patch: 70224±0.8% +13.4% > > > > =========================== > > netperf/udp_rr on SPR > > 25% > > base: 104954±1.3% > > patch: 107312±2.8% (noise) > > > > 50% > > base: 55394±4.6% > > patch: 54940±7.4% (noise) > > > > 75% > > base: 13779±3.1% > > patch: 36105±1.1% +162% > > > > 100% > > base: 9703±3.7% > > patch: 28011±0.2% +189% > > > > ============================================== > > netperf/tcp_stream on ICL (all in noise range) > > 25% > > base: 43092±0.1% > > patch: 42891±0.5% > > > > 50% > > base: 19278±14.9% > > patch: 22369±7.2% > > > > 75% > > base: 16822±3.0% > > patch: 17086±2.3% > > > > 100% > > base: 18216±0.6% > > patch: 18078±2.9% > > > > =============================================== > > netperf/tcp_stream on SPR (all in noise range) > > 25% > > base: 34491±0.3% > > patch: 34886±0.5% > > > > 50% > > base: 19278±14.9% > > patch: 22369±7.2% > > > > 75% > > base: 16822±3.0% > > patch: 17086±2.3% > > > > 100% > > base: 18216±0.6% > > patch: 18078±2.9% > > > > Reported-by: Nitin Tekchandani <nitin.tekchandani@intel.com> > > Suggested-by: Vincent Guittot <vincent.guittot@linaro.org> > > Signed-off-by: Aaron Lu <aaron.lu@intel.com> > > Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> > > Hey Aaron, > > Thanks for working on this. It LGTM modulo two small nits. Feel free to > add my Reviewed-by if you'd like regardless: > > Reviewed-by: David Vernet <void@manifault.com> Thanks! > > --- > > kernel/sched/fair.c | 13 ++++++++++++- > > kernel/sched/sched.h | 1 + > > 2 files changed, 13 insertions(+), 1 deletion(-) > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index c28206499a3d..a5462d1fcc48 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -3664,7 +3664,8 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq) > > */ > > static inline void update_tg_load_avg(struct cfs_rq *cfs_rq) > > { > > - long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib; > > + long delta; > > + u64 now; > > > > /* > > * No need to update load_avg for root_task_group as it is not used. > > @@ -3672,9 +3673,19 @@ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq) > > if (cfs_rq->tg == &root_task_group) > > return; > > > > + /* > > + * For migration heavy workload, access to tg->load_avg can be > > s/workload/workloads Will change. > > + * unbound. Limit the update rate to at most once per ms. > > Can we describe either here or in the commit summary how we arrived at > 1ms? I'm fine with hard-coded heuristics like this (just like the > proposed 6-core shard size in the shared_runq patchset), but it would > also be ideal to give a bit more color on how we arrived here, because > we'll forget immediately otherwise. Agree. As I replied to Mathieu, I chose 1ms mainly because pelt window is roughly 1ms. I'll update the changelog when sending a new version. Thanks, Aaron > > + */ > > + now = sched_clock_cpu(cpu_of(rq_of(cfs_rq))); > > + if (now - cfs_rq->last_update_tg_load_avg < NSEC_PER_MSEC) > > + return; > > + > > + delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib; > > if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64) { > > atomic_long_add(delta, &cfs_rq->tg->load_avg); > > cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg; > > + cfs_rq->last_update_tg_load_avg = now; > > } > > } > > > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > > index 6a8b7b9ed089..52ee7027def9 100644 > > --- a/kernel/sched/sched.h > > +++ b/kernel/sched/sched.h > > @@ -593,6 +593,7 @@ struct cfs_rq { > > } removed; > > > > #ifdef CONFIG_FAIR_GROUP_SCHED > > + u64 last_update_tg_load_avg; > > unsigned long tg_load_avg_contrib; > > long propagate; > > long prop_runnable_sum; > > -- > > 2.41.0 > > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 1/1] sched/fair: ratelimit update to tg->load_avg 2023-08-23 6:08 ` [PATCH 1/1] sched/fair: ratelimit update to tg->load_avg Aaron Lu 2023-08-23 14:05 ` Mathieu Desnoyers 2023-08-24 18:48 ` David Vernet @ 2023-09-06 3:52 ` kernel test robot 2 siblings, 0 replies; 15+ messages in thread From: kernel test robot @ 2023-09-06 3:52 UTC (permalink / raw) To: Aaron Lu Cc: oe-lkp, lkp, Nitin Tekchandani, Vincent Guittot, linux-kernel, ying.huang, feng.tang, fengwei.yin, aubrey.li, yu.c.chen, Peter Zijlstra, Ingo Molnar, Juri Lelli, Daniel Jordan, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider, Tim Chen, Waiman Long, Deng Pan, Mathieu Desnoyers, Gautham R . Shenoy, David Vernet, oliver.sang Hello, kernel test robot noticed a 141.1% improvement of stress-ng.nanosleep.ops_per_sec on: commit: 0a24d7afed5c3c59ee212782f9c902c7ada6c3a8 ("[PATCH 1/1] sched/fair: ratelimit update to tg->load_avg") url: https://github.com/intel-lab-lkp/linux/commits/Aaron-Lu/sched-fair-ratelimit-update-to-tg-load_avg/20230823-141042 base: https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git 63304558ba5dcaaff9e052ee43cfdcc7f9c29e85 patch link: https://lore.kernel.org/all/20230823060832.454842-2-aaron.lu@intel.com/ patch subject: [PATCH 1/1] sched/fair: ratelimit update to tg->load_avg testcase: stress-ng test machine: 224 threads 2 sockets Intel(R) Xeon(R) Platinum 8480CTDX (Sapphire Rapids) with 256G memory parameters: nr_threads: 100% testtime: 60s sc_pid_max: 4194304 class: scheduler test: nanosleep cpufreq_governor: performance In addition to that, the commit also has significant impact on the following tests: +------------------+---------------------------------------------------------------------------------------------+ | testcase: change | stress-ng: stress-ng.sem.ops_per_sec 120.7% improvement | | test machine | 224 threads 2 sockets Intel(R) Xeon(R) Platinum 8480CTDX (Sapphire Rapids) with 256G memory | | test parameters | class=scheduler | | | cpufreq_governor=performance | | | nr_threads=100% | | | sc_pid_max=4194304 | | | test=sem | | | testtime=60s | +------------------+---------------------------------------------------------------------------------------------+ | testcase: change | stress-ng: stress-ng.switch.ops_per_sec 422.1% improvement | | test machine | 224 threads 2 sockets Intel(R) Xeon(R) Platinum 8480CTDX (Sapphire Rapids) with 256G memory | | test parameters | class=scheduler | | | cpufreq_governor=performance | | | nr_threads=100% | | | sc_pid_max=4194304 | | | test=switch | | | testtime=60s | +------------------+---------------------------------------------------------------------------------------------+ Details are as below: --------------------------------------------------------------------------------------------------> The kernel config and materials to reproduce are available at: https://download.01.org/0day-ci/archive/20230906/202309061004.94b065e5-oliver.sang@intel.com ========================================================================================= class/compiler/cpufreq_governor/kconfig/nr_threads/rootfs/sc_pid_max/tbox_group/test/testcase/testtime: scheduler/gcc-12/performance/x86_64-rhel-8.3/100%/debian-11.1-x86_64-20220510.cgz/4194304/lkp-spr-r02/nanosleep/stress-ng/60s commit: 63304558ba ("sched/eevdf: Curb wakeup-preemption") 0a24d7afed ("sched/fair: ratelimit update to tg->load_avg") 63304558ba5dcaaf 0a24d7afed5c3c59ee212782f9c ---------------- --------------------------- %stddev %change %stddev \ | \ 1.114e+09 ± 5% +99.5% 2.223e+09 ± 2% cpuidle..time 32153856 ± 6% +989.9% 3.505e+08 cpuidle..usage 447243 ± 17% +164.1% 1181057 ± 29% numa-numastat.node0.numa_hit 1795453 ± 9% +45.6% 2613814 ± 13% numa-numastat.node1.local_node 1926792 ± 6% +41.7% 2729682 ± 13% numa-numastat.node1.numa_hit 1211 ± 14% +60.4% 1944 ± 15% perf-c2c.DRAM.local 43481 ± 4% +115.6% 93764 ± 2% perf-c2c.HITM.local 2142 ± 8% +18.6% 2540 ± 4% perf-c2c.HITM.remote 45623 ± 3% +111.1% 96304 ± 2% perf-c2c.HITM.total 6.50 ± 8% +9.2 15.72 ± 4% mpstat.cpu.all.idle% 63.25 -8.8 54.48 mpstat.cpu.all.irq% 0.23 ± 3% -0.1 0.13 ± 2% mpstat.cpu.all.soft% 23.74 -5.1 18.68 mpstat.cpu.all.sys% 6.26 ± 4% +4.7 10.99 mpstat.cpu.all.usr% 8.67 ± 10% +103.8% 17.67 ± 4% vmstat.cpu.id 6780737 ± 4% +62.2% 11001409 vmstat.memory.cache 807.67 -45.4% 441.17 ± 3% vmstat.procs.r 9455773 ± 3% +190.9% 27507213 vmstat.system.cs 2332672 ± 3% +135.7% 5497135 vmstat.system.in 8442 ±125% +361.9% 38993 ± 42% numa-meminfo.node0.Active 8394 ±126% +364.0% 38945 ± 42% numa-meminfo.node0.Active(anon) 3920452 ± 8% +59.6% 6258302 ± 13% numa-meminfo.node1.FilePages 4046956 ± 8% +61.7% 6543199 ± 12% numa-meminfo.node1.Inactive 4046855 ± 8% +61.7% 6542806 ± 12% numa-meminfo.node1.Inactive(anon) 809538 +23.4% 999251 numa-meminfo.node1.Mapped 5779147 ± 5% +44.2% 8333883 ± 11% numa-meminfo.node1.MemUsed 3797230 ± 8% +63.2% 6195756 ± 12% numa-meminfo.node1.Shmem 6594006 ± 4% +63.2% 10760383 meminfo.Cached 20357957 +20.8% 24592848 meminfo.Committed_AS 10202112 ± 4% +12.1% 11439445 ± 4% meminfo.DirectMap2M 4594955 ± 7% +90.9% 8772956 meminfo.Inactive 4594801 ± 7% +90.9% 8772510 meminfo.Inactive(anon) 1244823 +15.3% 1435693 meminfo.Mapped 10703248 ± 3% +39.2% 14903502 meminfo.Memused 3850091 ± 8% +108.2% 8016158 meminfo.Shmem 10828684 ± 3% +38.8% 15024943 meminfo.max_used_kB 191619 ± 2% -62.9% 71181 stress-ng.nanosleep.nanosec_sleep_overrun 27467749 ± 2% +141.1% 66219303 stress-ng.nanosleep.ops 457768 ± 2% +141.1% 1103623 stress-ng.nanosleep.ops_per_sec 34002509 -32.1% 23081269 ± 3% stress-ng.time.involuntary_context_switches 45135 ± 2% +10.2% 49751 stress-ng.time.minor_page_faults 4740 ± 2% +58.5% 7515 stress-ng.time.percent_of_cpu_this_job_got 2387 ± 2% +26.6% 3022 stress-ng.time.system_time 566.06 ± 4% +191.5% 1650 stress-ng.time.user_time 5.218e+08 ± 2% +140.9% 1.257e+09 stress-ng.time.voluntary_context_switches 2100 ±126% +364.0% 9746 ± 42% numa-vmstat.node0.nr_active_anon 2100 ±126% +364.0% 9746 ± 42% numa-vmstat.node0.nr_zone_active_anon 447538 ± 17% +163.9% 1181134 ± 29% numa-vmstat.node0.numa_hit 978909 ± 8% +59.9% 1564891 ± 13% numa-vmstat.node1.nr_file_pages 1010763 ± 8% +61.9% 1636024 ± 12% numa-vmstat.node1.nr_inactive_anon 201775 +23.8% 249732 numa-vmstat.node1.nr_mapped 948105 ± 8% +63.4% 1549255 ± 12% numa-vmstat.node1.nr_shmem 1010756 ± 8% +61.9% 1636022 ± 12% numa-vmstat.node1.nr_zone_inactive_anon 1926790 ± 6% +41.7% 2730098 ± 13% numa-vmstat.node1.numa_hit 1795451 ± 9% +45.6% 2614231 ± 13% numa-vmstat.node1.numa_local 23571016 ± 6% +1005.6% 2.606e+08 turbostat.C1 0.62 ± 7% +4.5 5.12 ± 4% turbostat.C1% 6.52 ± 6% +55.6% 10.15 ± 4% turbostat.CPU%c1 0.11 ± 3% +134.3% 0.26 turbostat.IPC 1.523e+08 ± 3% +135.8% 3.59e+08 turbostat.IRQ 4826320 ± 8% +1620.7% 83044122 ± 4% turbostat.POLL 1.18 ± 4% +3.2 4.39 ± 3% turbostat.POLL% 35.50 ± 2% +9.4% 38.83 ± 4% turbostat.PkgTmp 606.26 +11.4% 675.12 turbostat.PkgWatt 17.82 +8.1% 19.27 turbostat.RAMWatt 221604 +3.6% 229668 proc-vmstat.nr_anon_pages 6286339 -1.7% 6181379 proc-vmstat.nr_dirty_background_threshold 12588050 -1.7% 12377872 proc-vmstat.nr_dirty_threshold 1647119 ± 4% +63.3% 2690349 proc-vmstat.nr_file_pages 63240215 -1.7% 62188983 proc-vmstat.nr_free_pages 1147706 ± 7% +91.1% 2193365 proc-vmstat.nr_inactive_anon 310602 +15.6% 358915 proc-vmstat.nr_mapped 961140 ± 8% +108.5% 2004292 proc-vmstat.nr_shmem 40821 +5.6% 43093 proc-vmstat.nr_slab_reclaimable 1147706 ± 7% +91.1% 2193365 proc-vmstat.nr_zone_inactive_anon 307036 ± 6% +18.3% 363373 ± 6% proc-vmstat.numa_hint_faults 174792 ± 4% +47.6% 257908 ± 8% proc-vmstat.numa_hint_faults_local 2376244 ± 4% +64.7% 3912565 proc-vmstat.numa_hit 2148067 ± 5% +71.1% 3675698 proc-vmstat.numa_local 74658 ± 15% +32.3% 98789 ± 8% proc-vmstat.numa_pages_migrated 845358 ± 5% +16.6% 985918 ± 4% proc-vmstat.numa_pte_updates 2651749 ± 4% +58.7% 4208257 proc-vmstat.pgalloc_normal 1178205 +11.3% 1310935 proc-vmstat.pgfault 74658 ± 15% +32.3% 98789 ± 8% proc-vmstat.pgmigrate_success 1619572 ± 4% +31.0% 2121483 ± 2% sched_debug.cfs_rq:/.avg_vruntime.avg 1384229 ± 2% +21.2% 1678009 ± 6% sched_debug.cfs_rq:/.avg_vruntime.min 2.14 ± 4% -49.5% 1.08 ± 5% sched_debug.cfs_rq:/.h_nr_running.avg 1.59 ± 7% -23.0% 1.23 ± 10% sched_debug.cfs_rq:/.h_nr_running.stddev 755771 ± 4% +21.1% 914988 ± 7% sched_debug.cfs_rq:/.left_vruntime.stddev 1619572 ± 4% +31.0% 2121483 ± 2% sched_debug.cfs_rq:/.min_vruntime.avg 1384229 ± 2% +21.2% 1678009 ± 6% sched_debug.cfs_rq:/.min_vruntime.min 0.51 -13.3% 0.44 ± 2% sched_debug.cfs_rq:/.nr_running.avg 0.25 ± 6% +25.7% 0.32 ± 4% sched_debug.cfs_rq:/.nr_running.stddev 755771 ± 4% +21.1% 914988 ± 7% sched_debug.cfs_rq:/.right_vruntime.stddev 2098 -56.3% 916.79 ± 3% sched_debug.cfs_rq:/.runnable_avg.avg 5955 ± 6% -56.7% 2580 ± 51% sched_debug.cfs_rq:/.runnable_avg.max 789.15 ± 6% -55.7% 349.41 ± 20% sched_debug.cfs_rq:/.runnable_avg.stddev 307.62 +22.2% 375.91 sched_debug.cfs_rq:/.util_avg.avg 24.31 ± 5% -33.7% 16.12 ± 10% sched_debug.cfs_rq:/.util_est_enqueued.avg 519896 ± 8% -9.5% 470290 ± 2% sched_debug.cpu.avg_idle.avg 34.99 ± 9% -64.1% 12.55 ± 4% sched_debug.cpu.clock.stddev 64962 ± 54% -60.1% 25901 ± 75% sched_debug.cpu.max_idle_balance_cost.stddev 1.99 ± 3% -49.2% 1.01 ± 6% sched_debug.cpu.nr_running.avg 1.58 ± 8% -24.8% 1.19 ± 10% sched_debug.cpu.nr_running.stddev 1312627 ± 3% +189.6% 3801969 sched_debug.cpu.nr_switches.avg 1454886 ± 2% +180.8% 4084862 sched_debug.cpu.nr_switches.max 520509 ± 19% +118.1% 1135305 ± 6% sched_debug.cpu.nr_switches.min 107407 ± 27% +121.0% 237364 sched_debug.cpu.nr_switches.stddev 1.54 ± 16% -30.7% 1.06 ± 4% sched_debug.rt_rq:.rt_time.avg 344.38 ± 16% -30.7% 238.55 ± 4% sched_debug.rt_rq:.rt_time.max 22.96 ± 16% -30.7% 15.90 ± 4% sched_debug.rt_rq:.rt_time.stddev 21.30 -7.6% 19.68 perf-stat.i.MPKI 2.464e+10 ± 2% +116.4% 5.333e+10 perf-stat.i.branch-instructions 2.35 -0.2 2.15 perf-stat.i.branch-miss-rate% 5.179e+08 ± 3% +104.0% 1.056e+09 perf-stat.i.branch-misses 24433597 ± 4% +134.2% 57211897 perf-stat.i.cache-misses 2.292e+09 ± 3% +117.1% 4.977e+09 perf-stat.i.cache-references 9691167 ± 2% +192.2% 28317740 perf-stat.i.context-switches 5.50 ± 2% -58.0% 2.31 perf-stat.i.cpi 6.037e+11 -2.5% 5.886e+11 perf-stat.i.cpu-cycles 2763209 ± 6% +493.6% 16401859 perf-stat.i.cpu-migrations 27315 ± 5% -58.5% 11336 perf-stat.i.cycles-between-cache-misses 0.27 ± 3% +0.1 0.38 ± 5% perf-stat.i.dTLB-load-miss-rate% 75642414 ± 6% +234.9% 2.534e+08 ± 5% perf-stat.i.dTLB-load-misses 2.882e+10 ± 2% +133.7% 6.734e+10 perf-stat.i.dTLB-loads 0.09 +0.0 0.12 perf-stat.i.dTLB-store-miss-rate% 11815823 ± 4% +250.9% 41464964 perf-stat.i.dTLB-store-misses 1.431e+10 ± 2% +154.8% 3.647e+10 perf-stat.i.dTLB-stores 1.179e+11 ± 2% +125.2% 2.655e+11 perf-stat.i.instructions 0.22 ± 4% +107.2% 0.46 perf-stat.i.ipc 2.69 -2.5% 2.63 perf-stat.i.metric.GHz 168.78 ± 3% +191.3% 491.66 perf-stat.i.metric.K/sec 312.66 ± 2% +131.4% 723.39 perf-stat.i.metric.M/sec 83.64 -4.6 79.07 perf-stat.i.node-load-miss-rate% 8492785 ± 5% +91.4% 16253628 perf-stat.i.node-load-misses 1978671 ± 8% +156.8% 5080695 ± 3% perf-stat.i.node-loads 20.13 -5.6% 19.01 perf-stat.overall.MPKI 2.17 -0.2 2.01 perf-stat.overall.branch-miss-rate% 1.04 ± 4% +0.1 1.14 perf-stat.overall.cache-miss-rate% 5.27 ± 2% -57.4% 2.24 perf-stat.overall.cpi 25098 ± 4% -58.6% 10384 perf-stat.overall.cycles-between-cache-misses 0.27 ± 3% +0.1 0.38 ± 5% perf-stat.overall.dTLB-load-miss-rate% 0.08 +0.0 0.11 perf-stat.overall.dTLB-store-miss-rate% 0.19 ± 2% +134.7% 0.45 perf-stat.overall.ipc 78.22 ± 2% -3.3 74.96 perf-stat.overall.node-load-miss-rate% 2.351e+10 ± 2% +121.2% 5.2e+10 perf-stat.ps.branch-instructions 5.098e+08 ± 3% +105.1% 1.046e+09 perf-stat.ps.branch-misses 23627740 ± 4% +136.8% 55950974 perf-stat.ps.cache-misses 2.265e+09 ± 3% +117.5% 4.927e+09 perf-stat.ps.cache-references 9507437 ± 2% +194.8% 28026780 perf-stat.ps.context-switches 217084 +1.3% 219882 perf-stat.ps.cpu-clock 5.92e+11 -1.9% 5.809e+11 perf-stat.ps.cpu-cycles 2705311 ± 6% +500.2% 16238454 perf-stat.ps.cpu-migrations 73862253 ± 5% +238.6% 2.501e+08 ± 5% perf-stat.ps.dTLB-load-misses 2.76e+10 ± 2% +138.7% 6.588e+10 perf-stat.ps.dTLB-loads 11580338 ± 4% +254.4% 41039686 perf-stat.ps.dTLB-store-misses 1.367e+10 ± 2% +161.0% 3.568e+10 perf-stat.ps.dTLB-stores 1.125e+11 ± 2% +130.3% 2.592e+11 perf-stat.ps.instructions 17267 ± 2% +11.0% 19173 perf-stat.ps.minor-faults 8196664 ± 5% +94.6% 15947509 perf-stat.ps.node-load-misses 2277667 ± 7% +134.0% 5330072 ± 4% perf-stat.ps.node-loads 17267 ± 2% +11.0% 19173 perf-stat.ps.page-faults 217084 +1.3% 219882 perf-stat.ps.task-clock 7.044e+12 ± 2% +129.7% 1.618e+13 perf-stat.total.instructions 13.44 -13.4 0.00 perf-profile.calltrace.cycles-pp.update_cfs_group.dequeue_task_fair.__schedule.schedule.do_nanosleep 17.47 -12.7 4.76 perf-profile.calltrace.cycles-pp.dequeue_task_fair.__schedule.schedule.do_nanosleep.hrtimer_nanosleep 39.60 ± 2% -12.5 27.06 perf-profile.calltrace.cycles-pp.__schedule.schedule.do_nanosleep.hrtimer_nanosleep.common_nsleep 12.30 ± 2% -12.3 0.00 perf-profile.calltrace.cycles-pp.enqueue_task_fair.activate_task.ttwu_do_activate.try_to_wake_up.hrtimer_wakeup 39.94 -12.2 27.71 perf-profile.calltrace.cycles-pp.schedule.do_nanosleep.hrtimer_nanosleep.common_nsleep.__x64_sys_clock_nanosleep 12.32 ± 3% -12.2 0.17 ±141% perf-profile.calltrace.cycles-pp.activate_task.ttwu_do_activate.try_to_wake_up.hrtimer_wakeup.__hrtimer_run_queues 12.43 ± 2% -11.9 0.54 perf-profile.calltrace.cycles-pp.ttwu_do_activate.try_to_wake_up.hrtimer_wakeup.__hrtimer_run_queues.hrtimer_interrupt 43.43 -10.2 33.22 perf-profile.calltrace.cycles-pp.do_nanosleep.hrtimer_nanosleep.common_nsleep.__x64_sys_clock_nanosleep.do_syscall_64 20.08 -9.9 10.20 perf-profile.calltrace.cycles-pp.flush_smp_call_function_queue.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify 44.17 -9.7 34.48 perf-profile.calltrace.cycles-pp.hrtimer_nanosleep.common_nsleep.__x64_sys_clock_nanosleep.do_syscall_64.entry_SYSCALL_64_after_hwframe 44.22 -9.6 34.58 perf-profile.calltrace.cycles-pp.common_nsleep.__x64_sys_clock_nanosleep.do_syscall_64.entry_SYSCALL_64_after_hwframe.clock_nanosleep 45.34 -8.8 36.51 perf-profile.calltrace.cycles-pp.__x64_sys_clock_nanosleep.do_syscall_64.entry_SYSCALL_64_after_hwframe.clock_nanosleep 10.46 ± 2% -8.0 2.44 perf-profile.calltrace.cycles-pp.enqueue_task_fair.activate_task.ttwu_do_activate.sched_ttwu_pending.__flush_smp_call_function_queue 7.82 ± 3% -7.8 0.00 perf-profile.calltrace.cycles-pp.update_cfs_group.enqueue_task_fair.activate_task.ttwu_do_activate.try_to_wake_up 44.86 -7.8 37.06 perf-profile.calltrace.cycles-pp.try_to_wake_up.hrtimer_wakeup.__hrtimer_run_queues.hrtimer_interrupt.__sysvec_apic_timer_interrupt 44.88 -7.8 37.12 perf-profile.calltrace.cycles-pp.hrtimer_wakeup.__hrtimer_run_queues.hrtimer_interrupt.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt 45.94 -7.6 38.34 perf-profile.calltrace.cycles-pp.__hrtimer_run_queues.hrtimer_interrupt.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt 12.12 -7.6 4.52 perf-profile.calltrace.cycles-pp.asm_sysvec_apic_timer_interrupt.flush_smp_call_function_queue.do_idle.cpu_startup_entry.start_secondary 12.05 -7.6 4.48 perf-profile.calltrace.cycles-pp.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.flush_smp_call_function_queue.do_idle.cpu_startup_entry 11.96 -7.5 4.46 perf-profile.calltrace.cycles-pp.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.flush_smp_call_function_queue.do_idle 11.95 -7.5 4.44 perf-profile.calltrace.cycles-pp.hrtimer_interrupt.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.flush_smp_call_function_queue 10.99 -7.1 3.91 perf-profile.calltrace.cycles-pp.available_idle_cpu.select_idle_cpu.select_idle_sibling.select_task_rq_fair.select_task_rq 47.61 -6.9 40.70 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.clock_nanosleep 47.91 -6.6 41.28 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.clock_nanosleep 22.37 ± 3% -6.6 15.82 perf-profile.calltrace.cycles-pp.hrtimer_interrupt.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.finish_task_switch 22.41 ± 3% -6.5 15.86 perf-profile.calltrace.cycles-pp.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.finish_task_switch.__schedule 6.41 ± 3% -5.7 0.70 perf-profile.calltrace.cycles-pp.update_load_avg.enqueue_entity.enqueue_task_fair.activate_task.ttwu_do_activate 6.40 ± 2% -4.7 1.70 perf-profile.calltrace.cycles-pp.enqueue_entity.enqueue_task_fair.activate_task.ttwu_do_activate.sched_ttwu_pending 16.05 ± 4% -4.7 11.35 perf-profile.calltrace.cycles-pp.asm_sysvec_apic_timer_interrupt.finish_task_switch.__schedule.schedule.do_nanosleep 15.91 ± 4% -4.7 11.22 perf-profile.calltrace.cycles-pp.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.finish_task_switch.__schedule.schedule 4.88 ± 2% -4.3 0.55 perf-profile.calltrace.cycles-pp.sched_ttwu_pending.__flush_smp_call_function_queue.__sysvec_call_function_single.sysvec_call_function_single.asm_sysvec_call_function_single 16.36 ± 4% -3.8 12.61 perf-profile.calltrace.cycles-pp.finish_task_switch.__schedule.schedule.do_nanosleep.hrtimer_nanosleep 8.95 -3.6 5.34 perf-profile.calltrace.cycles-pp.finish_task_switch.__schedule.schedule_idle.do_idle.cpu_startup_entry 53.17 -3.3 49.84 perf-profile.calltrace.cycles-pp.clock_nanosleep 6.38 ± 2% -2.9 3.53 perf-profile.calltrace.cycles-pp.activate_task.ttwu_do_activate.sched_ttwu_pending.__flush_smp_call_function_queue.flush_smp_call_function_queue 6.41 ± 2% -2.8 3.63 perf-profile.calltrace.cycles-pp.ttwu_do_activate.sched_ttwu_pending.__flush_smp_call_function_queue.flush_smp_call_function_queue.do_idle 40.76 -2.4 38.40 perf-profile.calltrace.cycles-pp.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify 40.82 -2.3 38.53 perf-profile.calltrace.cycles-pp.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify 41.01 -2.3 38.72 perf-profile.calltrace.cycles-pp.secondary_startup_64_no_verify 40.82 -2.3 38.54 perf-profile.calltrace.cycles-pp.start_secondary.secondary_startup_64_no_verify 6.52 ± 2% -2.2 4.28 perf-profile.calltrace.cycles-pp.sched_ttwu_pending.__flush_smp_call_function_queue.flush_smp_call_function_queue.do_idle.cpu_startup_entry 6.81 -2.1 4.72 perf-profile.calltrace.cycles-pp.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.finish_task_switch.__schedule.schedule_idle 6.84 -2.1 4.77 perf-profile.calltrace.cycles-pp.asm_sysvec_apic_timer_interrupt.finish_task_switch.__schedule.schedule_idle.do_idle 6.67 ± 2% -1.7 4.99 perf-profile.calltrace.cycles-pp.__flush_smp_call_function_queue.flush_smp_call_function_queue.do_idle.cpu_startup_entry.start_secondary 2.09 -1.6 0.51 perf-profile.calltrace.cycles-pp.asm_sysvec_call_function_single.finish_task_switch.__schedule.schedule_idle.do_idle 23.74 -1.0 22.79 perf-profile.calltrace.cycles-pp.select_idle_cpu.select_idle_sibling.select_task_rq_fair.select_task_rq.try_to_wake_up 10.21 -0.7 9.54 perf-profile.calltrace.cycles-pp.__schedule.schedule_idle.do_idle.cpu_startup_entry.start_secondary 10.43 -0.6 9.85 perf-profile.calltrace.cycles-pp.schedule_idle.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify 1.07 ± 4% -0.4 0.67 perf-profile.calltrace.cycles-pp.__flush_smp_call_function_queue.__sysvec_call_function_single.sysvec_call_function_single.asm_sysvec_call_function_single.cpuidle_enter_state 1.07 ± 4% -0.4 0.69 perf-profile.calltrace.cycles-pp.__sysvec_call_function_single.sysvec_call_function_single.asm_sysvec_call_function_single.cpuidle_enter_state.cpuidle_enter 1.08 ± 4% -0.4 0.72 perf-profile.calltrace.cycles-pp.sysvec_call_function_single.asm_sysvec_call_function_single.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call 2.82 ± 2% -0.3 2.48 perf-profile.calltrace.cycles-pp.dequeue_entity.dequeue_task_fair.__schedule.schedule.do_nanosleep 1.08 ± 4% -0.3 0.80 perf-profile.calltrace.cycles-pp.asm_sysvec_call_function_single.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call.do_idle 1.07 ± 6% -0.3 0.82 perf-profile.calltrace.cycles-pp.asm_sysvec_apic_timer_interrupt.nohz_run_idle_balance.do_idle.cpu_startup_entry.start_secondary 1.06 ± 6% -0.2 0.82 perf-profile.calltrace.cycles-pp.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.nohz_run_idle_balance.do_idle.cpu_startup_entry 1.05 ± 6% -0.2 0.80 perf-profile.calltrace.cycles-pp.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.nohz_run_idle_balance.do_idle 1.05 ± 6% -0.2 0.80 perf-profile.calltrace.cycles-pp.hrtimer_interrupt.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.nohz_run_idle_balance 0.70 ± 6% -0.2 0.54 perf-profile.calltrace.cycles-pp.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.do_idle.cpu_startup_entry.start_secondary 0.67 ± 7% -0.2 0.52 perf-profile.calltrace.cycles-pp.sysvec_call_function_single.asm_sysvec_call_function_single.nohz_run_idle_balance.do_idle.cpu_startup_entry 0.70 ± 6% -0.2 0.54 perf-profile.calltrace.cycles-pp.asm_sysvec_apic_timer_interrupt.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify 0.69 ± 6% -0.2 0.54 perf-profile.calltrace.cycles-pp.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.do_idle.cpu_startup_entry 0.69 ± 6% -0.2 0.53 perf-profile.calltrace.cycles-pp.hrtimer_interrupt.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.do_idle 0.68 ± 7% -0.1 0.56 perf-profile.calltrace.cycles-pp.asm_sysvec_call_function_single.nohz_run_idle_balance.do_idle.cpu_startup_entry.start_secondary 0.57 ± 4% +0.1 0.67 perf-profile.calltrace.cycles-pp.pick_next_task_fair.__schedule.schedule_idle.do_idle.cpu_startup_entry 0.54 +0.2 0.74 perf-profile.calltrace.cycles-pp.prepare_task_switch.__schedule.schedule.do_nanosleep.hrtimer_nanosleep 0.87 +0.2 1.09 perf-profile.calltrace.cycles-pp.pick_next_task_fair.__schedule.schedule.do_nanosleep.hrtimer_nanosleep 0.76 +0.2 0.99 perf-profile.calltrace.cycles-pp.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.hrtimer_active.hrtimer_try_to_cancel 0.78 +0.2 1.00 perf-profile.calltrace.cycles-pp.asm_sysvec_apic_timer_interrupt.hrtimer_active.hrtimer_try_to_cancel.do_nanosleep.hrtimer_nanosleep 0.76 +0.2 0.99 perf-profile.calltrace.cycles-pp.hrtimer_interrupt.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.hrtimer_active 0.77 +0.2 1.00 perf-profile.calltrace.cycles-pp.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.hrtimer_active.hrtimer_try_to_cancel.do_nanosleep 0.80 +0.3 1.08 perf-profile.calltrace.cycles-pp.__hrtimer_start_range_ns.hrtimer_start_range_ns.do_nanosleep.hrtimer_nanosleep.common_nsleep 0.76 ± 2% +0.3 1.05 perf-profile.calltrace.cycles-pp.hrtimer_interrupt.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.do_nanosleep 0.76 ± 2% +0.3 1.05 perf-profile.calltrace.cycles-pp.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.do_nanosleep.hrtimer_nanosleep 0.77 ± 2% +0.3 1.06 perf-profile.calltrace.cycles-pp.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.do_nanosleep.hrtimer_nanosleep.common_nsleep 0.78 ± 3% +0.3 1.06 perf-profile.calltrace.cycles-pp.asm_sysvec_apic_timer_interrupt.do_nanosleep.hrtimer_nanosleep.common_nsleep.__x64_sys_clock_nanosleep 0.61 ± 2% +0.4 1.00 perf-profile.calltrace.cycles-pp.asm_sysvec_apic_timer_interrupt 0.59 ± 2% +0.4 0.98 perf-profile.calltrace.cycles-pp.hrtimer_interrupt.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt 0.59 ± 2% +0.4 0.98 perf-profile.calltrace.cycles-pp.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt 0.60 ± 2% +0.4 0.99 perf-profile.calltrace.cycles-pp.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt 0.99 ± 5% +0.4 1.41 ± 4% perf-profile.calltrace.cycles-pp.stress_mwc32 0.00 +0.5 0.52 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.try_to_wake_up.hrtimer_wakeup.__hrtimer_run_queues 0.00 +0.5 0.53 perf-profile.calltrace.cycles-pp.tick_nohz_idle_enter.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify 1.07 +0.5 1.61 perf-profile.calltrace.cycles-pp.hrtimer_start_range_ns.do_nanosleep.hrtimer_nanosleep.common_nsleep.__x64_sys_clock_nanosleep 0.00 +0.5 0.55 ± 2% perf-profile.calltrace.cycles-pp.set_task_cpu.try_to_wake_up.hrtimer_wakeup.__hrtimer_run_queues.hrtimer_interrupt 0.58 ± 2% +0.6 1.14 perf-profile.calltrace.cycles-pp._raw_spin_lock.__schedule.schedule.do_nanosleep.hrtimer_nanosleep 0.00 +0.6 0.58 perf-profile.calltrace.cycles-pp.llist_reverse_order.__flush_smp_call_function_queue.flush_smp_call_function_queue.do_idle.cpu_startup_entry 0.00 +0.6 0.58 ± 2% perf-profile.calltrace.cycles-pp.update_load_avg.dequeue_entity.dequeue_task_fair.__schedule.schedule 0.00 +0.6 0.60 perf-profile.calltrace.cycles-pp._raw_spin_lock.try_to_wake_up.hrtimer_wakeup.__hrtimer_run_queues.hrtimer_interrupt 0.87 ± 3% +0.6 1.46 ± 3% perf-profile.calltrace.cycles-pp.hrtimer_interrupt.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.stress_pthread_func 0.90 ± 4% +0.6 1.50 ± 3% perf-profile.calltrace.cycles-pp.asm_sysvec_apic_timer_interrupt.stress_pthread_func 0.87 ± 3% +0.6 1.47 ± 3% perf-profile.calltrace.cycles-pp.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.stress_pthread_func 0.87 ± 3% +0.6 1.48 ± 3% perf-profile.calltrace.cycles-pp.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.stress_pthread_func 0.00 +0.6 0.61 ± 2% perf-profile.calltrace.cycles-pp.menu_select.cpuidle_idle_call.do_idle.cpu_startup_entry.start_secondary 0.00 +0.6 0.62 ± 11% perf-profile.calltrace.cycles-pp.__nanosleep 1.30 +0.6 1.92 perf-profile.calltrace.cycles-pp.switch_mm_irqs_off.__schedule.schedule.do_nanosleep.hrtimer_nanosleep 0.00 +0.6 0.64 perf-profile.calltrace.cycles-pp._raw_spin_lock.__schedule.schedule_idle.do_idle.cpu_startup_entry 0.00 +0.7 0.66 perf-profile.calltrace.cycles-pp.ttwu_queue_wakelist.try_to_wake_up.hrtimer_wakeup.__hrtimer_run_queues.hrtimer_interrupt 0.00 +0.7 0.67 perf-profile.calltrace.cycles-pp._copy_from_user.get_timespec64.__x64_sys_clock_nanosleep.do_syscall_64.entry_SYSCALL_64_after_hwframe 0.00 +0.7 0.70 perf-profile.calltrace.cycles-pp.__switch_to.clock_nanosleep 0.58 ± 3% +0.7 1.28 perf-profile.calltrace.cycles-pp.__switch_to_asm.clock_nanosleep 0.00 +0.7 0.72 ± 3% perf-profile.calltrace.cycles-pp.hrtimer_interrupt.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.clock_gettime 0.00 +0.7 0.73 ± 3% perf-profile.calltrace.cycles-pp.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.clock_gettime 0.00 +0.7 0.73 ± 3% perf-profile.calltrace.cycles-pp.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.clock_gettime 0.00 +0.7 0.74 ± 3% perf-profile.calltrace.cycles-pp.asm_sysvec_apic_timer_interrupt.clock_gettime 0.00 +0.8 0.78 perf-profile.calltrace.cycles-pp.get_timespec64.__x64_sys_clock_nanosleep.do_syscall_64.entry_SYSCALL_64_after_hwframe.clock_nanosleep 0.00 +0.8 0.78 perf-profile.calltrace.cycles-pp.update_curr.dequeue_entity.dequeue_task_fair.__schedule.schedule 0.00 +0.8 0.83 ± 3% perf-profile.calltrace.cycles-pp.__update_idle_core.pick_next_task_idle.__schedule.schedule.do_nanosleep 0.00 +0.8 0.84 ± 3% perf-profile.calltrace.cycles-pp.pick_next_task_idle.__schedule.schedule.do_nanosleep.hrtimer_nanosleep 25.15 +0.9 26.05 perf-profile.calltrace.cycles-pp.select_idle_sibling.select_task_rq_fair.select_task_rq.try_to_wake_up.hrtimer_wakeup 0.00 +0.9 0.92 perf-profile.calltrace.cycles-pp.__switch_to_asm 2.94 ± 2% +0.9 3.87 perf-profile.calltrace.cycles-pp.hrtimer_interrupt.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.clock_nanosleep 3.09 ± 2% +0.9 4.02 perf-profile.calltrace.cycles-pp.asm_sysvec_apic_timer_interrupt.clock_nanosleep 2.96 ± 2% +0.9 3.89 perf-profile.calltrace.cycles-pp.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.clock_nanosleep 2.95 ± 2% +0.9 3.88 perf-profile.calltrace.cycles-pp.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.clock_nanosleep 1.44 +1.0 2.41 perf-profile.calltrace.cycles-pp.hrtimer_active.hrtimer_try_to_cancel.do_nanosleep.hrtimer_nanosleep.common_nsleep 1.40 ± 2% +1.0 2.41 perf-profile.calltrace.cycles-pp.restore_fpregs_from_fpstate.switch_fpu_return.exit_to_user_mode_prepare.syscall_exit_to_user_mode.do_syscall_64 0.00 +1.0 1.04 perf-profile.calltrace.cycles-pp.sched_mm_cid_migrate_to.activate_task.ttwu_do_activate.sched_ttwu_pending.__flush_smp_call_function_queue 1.50 +1.1 2.58 perf-profile.calltrace.cycles-pp.hrtimer_try_to_cancel.do_nanosleep.hrtimer_nanosleep.common_nsleep.__x64_sys_clock_nanosleep 25.25 +1.1 26.32 perf-profile.calltrace.cycles-pp.select_task_rq_fair.select_task_rq.try_to_wake_up.hrtimer_wakeup.__hrtimer_run_queues 3.13 +1.1 4.22 perf-profile.calltrace.cycles-pp.hrtimer_interrupt.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.cpuidle_enter_state 3.13 +1.1 4.23 perf-profile.calltrace.cycles-pp.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.cpuidle_enter_state.cpuidle_enter 25.31 +1.1 26.42 perf-profile.calltrace.cycles-pp.select_task_rq.try_to_wake_up.hrtimer_wakeup.__hrtimer_run_queues.hrtimer_interrupt 3.16 +1.1 4.31 perf-profile.calltrace.cycles-pp.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call 3.17 +1.2 4.34 perf-profile.calltrace.cycles-pp.asm_sysvec_apic_timer_interrupt.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call.do_idle 0.80 ± 8% +1.3 2.11 ± 11% perf-profile.calltrace.cycles-pp.clock_gettime 0.00 +1.4 1.44 perf-profile.calltrace.cycles-pp.switch_mm_irqs_off.__schedule.schedule_idle.do_idle.cpu_startup_entry 1.98 +1.5 3.45 perf-profile.calltrace.cycles-pp.exit_to_user_mode_prepare.syscall_exit_to_user_mode.do_syscall_64.entry_SYSCALL_64_after_hwframe.clock_nanosleep 2.04 +1.6 3.59 perf-profile.calltrace.cycles-pp.syscall_exit_to_user_mode.do_syscall_64.entry_SYSCALL_64_after_hwframe.clock_nanosleep 1.75 ± 2% +1.6 3.37 perf-profile.calltrace.cycles-pp.switch_fpu_return.exit_to_user_mode_prepare.syscall_exit_to_user_mode.do_syscall_64.entry_SYSCALL_64_after_hwframe 1.92 ± 4% +1.9 3.86 ± 2% perf-profile.calltrace.cycles-pp.stress_pthread_func 0.17 ±141% +2.2 2.39 perf-profile.calltrace.cycles-pp.intel_idle.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call.do_idle 1.03 ± 2% +2.4 3.40 perf-profile.calltrace.cycles-pp.hrtimer_interrupt.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.poll_idle 1.03 ± 2% +2.4 3.40 perf-profile.calltrace.cycles-pp.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.poll_idle.cpuidle_enter_state 1.04 ± 2% +2.4 3.45 perf-profile.calltrace.cycles-pp.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.poll_idle.cpuidle_enter_state.cpuidle_enter 1.05 ± 2% +2.4 3.48 perf-profile.calltrace.cycles-pp.asm_sysvec_apic_timer_interrupt.poll_idle.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call 1.37 ± 3% +3.8 5.15 perf-profile.calltrace.cycles-pp.poll_idle.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call.do_idle 0.00 +6.0 5.98 perf-profile.calltrace.cycles-pp.available_idle_cpu.select_idle_core.select_idle_cpu.select_idle_sibling.select_task_rq_fair 6.17 +6.8 12.96 perf-profile.calltrace.cycles-pp.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call.do_idle.cpu_startup_entry 6.21 +6.9 13.08 perf-profile.calltrace.cycles-pp.cpuidle_enter.cpuidle_idle_call.do_idle.cpu_startup_entry.start_secondary 6.68 +7.6 14.28 perf-profile.calltrace.cycles-pp.cpuidle_idle_call.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify 1.25 +8.6 9.84 perf-profile.calltrace.cycles-pp.select_idle_core.select_idle_cpu.select_idle_sibling.select_task_rq_fair.select_task_rq 30.81 -29.8 1.00 ± 9% perf-profile.children.cycles-pp.update_cfs_group 26.74 -21.1 5.60 perf-profile.children.cycles-pp.enqueue_task_fair 27.45 -20.2 7.29 perf-profile.children.cycles-pp.activate_task 27.66 -20.0 7.70 perf-profile.children.cycles-pp.ttwu_do_activate 50.48 -13.5 37.02 perf-profile.children.cycles-pp.__schedule 17.49 -12.7 4.78 perf-profile.children.cycles-pp.dequeue_task_fair 40.51 ± 2% -12.5 27.97 perf-profile.children.cycles-pp.schedule 43.46 -10.2 33.28 perf-profile.children.cycles-pp.do_nanosleep 20.27 ± 2% -9.9 10.35 perf-profile.children.cycles-pp.flush_smp_call_function_queue 44.18 -9.7 34.50 perf-profile.children.cycles-pp.hrtimer_nanosleep 44.28 -9.6 34.72 perf-profile.children.cycles-pp.common_nsleep 45.35 -8.8 36.53 perf-profile.children.cycles-pp.__x64_sys_clock_nanosleep 12.11 -8.7 3.43 perf-profile.children.cycles-pp.enqueue_entity 11.22 -7.8 3.41 perf-profile.children.cycles-pp.update_load_avg 50.72 -7.7 42.98 perf-profile.children.cycles-pp.sysvec_apic_timer_interrupt 51.22 -7.7 43.50 perf-profile.children.cycles-pp.asm_sysvec_apic_timer_interrupt 48.74 -7.7 41.08 perf-profile.children.cycles-pp.try_to_wake_up 25.96 ± 3% -7.7 18.30 perf-profile.children.cycles-pp.finish_task_switch 48.74 -7.6 41.10 perf-profile.children.cycles-pp.hrtimer_wakeup 49.92 -7.6 42.28 perf-profile.children.cycles-pp.__hrtimer_run_queues 50.13 -7.5 42.59 perf-profile.children.cycles-pp.hrtimer_interrupt 50.20 -7.5 42.70 perf-profile.children.cycles-pp.__sysvec_apic_timer_interrupt 47.73 -7.0 40.77 perf-profile.children.cycles-pp.do_syscall_64 48.02 -6.7 41.34 perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe 12.52 ± 2% -5.9 6.60 perf-profile.children.cycles-pp.sched_ttwu_pending 12.86 ± 2% -5.1 7.80 perf-profile.children.cycles-pp.__flush_smp_call_function_queue 6.17 ± 2% -3.3 2.86 perf-profile.children.cycles-pp.__sysvec_call_function_single 6.20 ± 2% -3.2 2.99 perf-profile.children.cycles-pp.sysvec_call_function_single 53.30 -3.1 50.18 perf-profile.children.cycles-pp.clock_nanosleep 6.28 ± 2% -3.0 3.32 perf-profile.children.cycles-pp.asm_sysvec_call_function_single 40.96 -2.3 38.63 perf-profile.children.cycles-pp.do_idle 41.01 -2.3 38.72 perf-profile.children.cycles-pp.secondary_startup_64_no_verify 41.01 -2.3 38.72 perf-profile.children.cycles-pp.cpu_startup_entry 40.82 -2.3 38.54 perf-profile.children.cycles-pp.start_secondary 28.17 -1.6 26.60 perf-profile.children.cycles-pp.select_idle_cpu 10.48 -0.6 9.91 perf-profile.children.cycles-pp.schedule_idle 15.47 -0.5 14.92 perf-profile.children.cycles-pp.available_idle_cpu 0.65 ± 6% -0.3 0.31 ± 2% perf-profile.children.cycles-pp.exit_to_user_mode_loop 2.85 ± 2% -0.3 2.54 perf-profile.children.cycles-pp.dequeue_entity 0.41 ± 4% -0.3 0.14 ± 3% perf-profile.children.cycles-pp.__do_softirq 0.49 ± 4% -0.2 0.27 perf-profile.children.cycles-pp.__irq_exit_rcu 0.43 ± 3% -0.2 0.25 perf-profile.children.cycles-pp.tick_sched_handle 0.46 ± 6% -0.2 0.28 ± 2% perf-profile.children.cycles-pp.irqentry_exit_to_user_mode 0.42 ± 3% -0.2 0.24 ± 2% perf-profile.children.cycles-pp.update_process_times 0.44 ± 3% -0.2 0.27 perf-profile.children.cycles-pp.tick_sched_timer 0.38 ± 3% -0.2 0.22 ± 2% perf-profile.children.cycles-pp.scheduler_tick 0.46 ± 2% -0.1 0.32 perf-profile.children.cycles-pp._raw_spin_lock_irq 0.10 ± 3% -0.1 0.02 ± 99% perf-profile.children.cycles-pp.sched_clock_noinstr 0.74 -0.1 0.67 perf-profile.children.cycles-pp.update_rq_clock 0.10 ± 19% -0.1 0.04 ± 76% perf-profile.children.cycles-pp.record__mmap_read_evlist 0.10 ± 19% -0.1 0.04 ± 73% perf-profile.children.cycles-pp.perf_mmap__push 0.12 ± 15% -0.1 0.07 ± 18% perf-profile.children.cycles-pp.__libc_start_main 0.12 ± 15% -0.1 0.07 ± 18% perf-profile.children.cycles-pp.main 0.12 ± 15% -0.1 0.07 ± 18% perf-profile.children.cycles-pp.run_builtin 0.38 ± 3% -0.1 0.33 perf-profile.children.cycles-pp.get_nohz_timer_target 0.11 ± 19% -0.1 0.06 ± 49% perf-profile.children.cycles-pp.cmd_record 0.10 -0.1 0.05 perf-profile.children.cycles-pp.entity_eligible 0.06 +0.0 0.07 perf-profile.children.cycles-pp.__rb_insert_augmented 0.05 ± 7% +0.0 0.08 ± 6% perf-profile.children.cycles-pp.rebalance_domains 0.05 +0.0 0.07 ± 6% perf-profile.children.cycles-pp.put_prev_entity 0.13 ± 3% +0.0 0.15 ± 2% perf-profile.children.cycles-pp.perf_adjust_freq_unthr_context 0.13 ± 3% +0.0 0.15 ± 3% perf-profile.children.cycles-pp.perf_event_task_tick 0.05 ± 8% +0.0 0.08 perf-profile.children.cycles-pp.tracing_gen_ctx_irq_test 0.05 +0.0 0.08 perf-profile.children.cycles-pp._find_next_and_bit 0.16 +0.0 0.20 ± 2% perf-profile.children.cycles-pp.place_entity 0.06 ± 6% +0.0 0.10 ± 8% perf-profile.children.cycles-pp.mm_cid_get 0.08 ± 6% +0.0 0.11 ± 4% perf-profile.children.cycles-pp.perf_trace_buf_update 0.24 ± 3% +0.0 0.28 perf-profile.children.cycles-pp.call_cpuidle 0.14 +0.0 0.19 perf-profile.children.cycles-pp.avg_vruntime 0.00 +0.1 0.05 perf-profile.children.cycles-pp.hrtimer_get_next_event 0.00 +0.1 0.05 perf-profile.children.cycles-pp.save_fpregs_to_fpstate 0.00 +0.1 0.05 perf-profile.children.cycles-pp.__bitmap_and 0.00 +0.1 0.05 perf-profile.children.cycles-pp.idle_cpu 0.00 +0.1 0.05 perf-profile.children.cycles-pp.ct_kernel_enter 0.13 ± 2% +0.1 0.18 ± 2% perf-profile.children.cycles-pp.update_irq_load_avg 0.01 ±223% +0.1 0.06 ± 7% perf-profile.children.cycles-pp.perf_trace_sched_stat_runtime 0.23 +0.1 0.28 perf-profile.children.cycles-pp.__dequeue_entity 0.06 ± 7% +0.1 0.12 ± 3% perf-profile.children.cycles-pp.irqtime_account_irq 0.00 +0.1 0.06 perf-profile.children.cycles-pp.ct_kernel_exit_state 0.00 +0.1 0.06 perf-profile.children.cycles-pp.ct_idle_exit 0.08 ± 5% +0.1 0.14 ± 3% perf-profile.children.cycles-pp.resched_curr 0.01 ±223% +0.1 0.07 ± 21% perf-profile.children.cycles-pp.nanosleep@plt 0.00 +0.1 0.06 ± 7% perf-profile.children.cycles-pp.perf_exclude_event 0.00 +0.1 0.06 ± 7% perf-profile.children.cycles-pp.perf_trace_sched_migrate_task 0.13 ± 2% +0.1 0.20 ± 2% perf-profile.children.cycles-pp.lapic_next_deadline 0.00 +0.1 0.06 ± 7% perf-profile.children.cycles-pp.perf_trace_buf_alloc 0.20 ± 2% +0.1 0.26 perf-profile.children.cycles-pp.clockevents_program_event 0.02 ±141% +0.1 0.08 ± 5% perf-profile.children.cycles-pp.rb_next 0.00 +0.1 0.07 ± 7% perf-profile.children.cycles-pp.__update_load_avg_blocked_se 0.00 +0.1 0.07 ± 5% perf-profile.children.cycles-pp.local_clock_noinstr 0.00 +0.1 0.07 ± 9% perf-profile.children.cycles-pp.cpuidle_governor_latency_req 0.08 +0.1 0.16 perf-profile.children.cycles-pp.rcu_note_context_switch 0.22 ± 2% +0.1 0.30 perf-profile.children.cycles-pp.pick_eevdf 0.00 +0.1 0.08 ± 4% perf-profile.children.cycles-pp.perf_trace_sched_switch 0.15 +0.1 0.24 perf-profile.children.cycles-pp.hrtimer_init_sleeper 0.08 +0.1 0.17 perf-profile.children.cycles-pp.rb_erase 0.00 +0.1 0.09 perf-profile.children.cycles-pp.__hrtimer_next_event_base 0.00 +0.1 0.10 ± 5% perf-profile.children.cycles-pp.__x2apic_send_IPI_dest 0.06 ± 7% +0.1 0.16 ± 2% perf-profile.children.cycles-pp.native_apic_msr_eoi_write 0.00 +0.1 0.10 perf-profile.children.cycles-pp.__list_add_valid 0.00 +0.1 0.10 perf-profile.children.cycles-pp.syscall_exit_to_user_mode_prepare 0.00 +0.1 0.10 perf-profile.children.cycles-pp.__list_del_entry_valid 0.10 ± 4% +0.1 0.20 perf-profile.children.cycles-pp.__hrtimer_init 0.00 +0.1 0.10 ± 4% perf-profile.children.cycles-pp.hrtimer_next_event_without 0.45 ± 6% +0.1 0.55 perf-profile.children.cycles-pp.tick_nohz_idle_enter 0.07 +0.1 0.18 ± 2% perf-profile.children.cycles-pp.entry_SYSRETQ_unsafe_stack 0.00 +0.1 0.11 perf-profile.children.cycles-pp.syscall_return_via_sysret 0.00 +0.1 0.11 perf-profile.children.cycles-pp.get_next_timer_interrupt 0.00 +0.1 0.11 perf-profile.children.cycles-pp._raw_spin_unlock_irqrestore 0.08 +0.1 0.19 ± 2% perf-profile.children.cycles-pp.update_entity_lag 0.10 ± 4% +0.1 0.22 ± 2% perf-profile.children.cycles-pp.rb_insert_color 0.00 +0.1 0.12 perf-profile.children.cycles-pp.__calc_delta 0.01 ±223% +0.1 0.14 perf-profile.children.cycles-pp.__rdgsbase_inactive 0.24 +0.1 0.38 perf-profile.children.cycles-pp.os_xsave 0.06 +0.1 0.21 ± 2% perf-profile.children.cycles-pp.tick_nohz_idle_exit 0.17 ± 2% +0.1 0.32 perf-profile.children.cycles-pp.check_preempt_curr 0.07 ± 5% +0.1 0.22 perf-profile.children.cycles-pp.__wrgsbase_inactive 0.48 +0.1 0.62 perf-profile.children.cycles-pp.update_rq_clock_task 0.00 +0.2 0.15 ± 2% perf-profile.children.cycles-pp.newidle_balance 0.00 +0.2 0.16 ± 2% perf-profile.children.cycles-pp.tick_nohz_next_event 0.07 ± 5% +0.2 0.24 ± 6% perf-profile.children.cycles-pp.cpuacct_charge 0.23 ± 3% +0.2 0.41 perf-profile.children.cycles-pp.native_irq_return_iret 0.14 +0.2 0.32 perf-profile.children.cycles-pp.stress_mwc32modn 0.11 ± 4% +0.2 0.30 perf-profile.children.cycles-pp.read_tsc 0.22 +0.2 0.41 perf-profile.children.cycles-pp.ktime_get 0.24 ± 2% +0.2 0.43 perf-profile.children.cycles-pp.perf_tp_event 0.09 ± 4% +0.2 0.28 perf-profile.children.cycles-pp.attach_entity_load_avg 0.50 +0.2 0.70 perf-profile.children.cycles-pp.sched_clock_cpu 0.44 +0.2 0.64 perf-profile.children.cycles-pp.__update_load_avg_se 0.14 ± 3% +0.2 0.34 perf-profile.children.cycles-pp.cpus_share_cache 0.16 ± 2% +0.2 0.36 perf-profile.children.cycles-pp.update_min_vruntime 0.34 +0.2 0.54 perf-profile.children.cycles-pp.sched_clock 0.16 ± 2% +0.2 0.38 perf-profile.children.cycles-pp.timerqueue_del 0.14 ± 3% +0.2 0.36 perf-profile.children.cycles-pp.__entry_text_start 0.15 +0.2 0.38 perf-profile.children.cycles-pp.syscall_enter_from_user_mode 0.39 ± 2% +0.2 0.63 perf-profile.children.cycles-pp.perf_trace_sched_wakeup_template 0.21 +0.2 0.45 perf-profile.children.cycles-pp.__enqueue_entity 0.27 +0.2 0.52 perf-profile.children.cycles-pp.native_sched_clock 0.06 ± 7% +0.3 0.32 ± 2% perf-profile.children.cycles-pp.tick_nohz_get_sleep_length 0.00 +0.3 0.25 perf-profile.children.cycles-pp.asm_sysvec_reschedule_ipi 0.17 ± 2% +0.3 0.44 perf-profile.children.cycles-pp.timerqueue_add 0.40 ± 3% +0.3 0.68 perf-profile.children.cycles-pp._copy_from_user 0.19 ± 3% +0.3 0.48 perf-profile.children.cycles-pp.enqueue_hrtimer 0.81 +0.3 1.10 perf-profile.children.cycles-pp.__hrtimer_start_range_ns 0.35 ± 16% +0.3 0.66 ± 11% perf-profile.children.cycles-pp.__nanosleep 0.09 ± 5% +0.3 0.41 perf-profile.children.cycles-pp.call_function_single_prep_ipi 0.44 ± 3% +0.3 0.78 perf-profile.children.cycles-pp.get_timespec64 0.22 ± 2% +0.4 0.57 perf-profile.children.cycles-pp.___perf_sw_event 0.77 +0.4 1.14 perf-profile.children.cycles-pp.reweight_entity 1.49 ± 2% +0.4 1.86 perf-profile.children.cycles-pp.pick_next_task_fair 1.00 ± 5% +0.4 1.43 ± 4% perf-profile.children.cycles-pp.stress_mwc32 0.30 ± 15% +0.5 0.76 ± 9% perf-profile.children.cycles-pp.clock_gettime@plt 0.14 ± 7% +0.5 0.62 perf-profile.children.cycles-pp.menu_select 0.57 +0.5 1.06 perf-profile.children.cycles-pp.__update_load_avg_cfs_rq 1.08 +0.5 1.62 perf-profile.children.cycles-pp.hrtimer_start_range_ns 0.66 +0.6 1.26 perf-profile.children.cycles-pp.prepare_task_switch 1.25 +0.6 1.86 perf-profile.children.cycles-pp._find_next_bit 0.21 ± 5% +0.7 0.87 perf-profile.children.cycles-pp.llist_reverse_order 0.52 +0.7 1.23 perf-profile.children.cycles-pp.update_curr 0.05 ± 7% +0.8 0.83 ± 3% perf-profile.children.cycles-pp.__update_idle_core 0.06 ± 9% +0.8 0.85 ± 2% perf-profile.children.cycles-pp.pick_next_task_idle 0.81 ± 8% +0.8 1.65 ± 3% perf-profile.children.cycles-pp.clock_gettime 0.68 ± 2% +0.9 1.63 perf-profile.children.cycles-pp.sched_mm_cid_migrate_to 0.07 ± 6% +1.0 1.04 ± 2% perf-profile.children.cycles-pp.remove_entity_load_avg 0.58 ± 2% +1.0 1.55 perf-profile.children.cycles-pp.__switch_to 1.45 +1.0 2.43 perf-profile.children.cycles-pp.hrtimer_active 1.41 ± 2% +1.0 2.42 perf-profile.children.cycles-pp.restore_fpregs_from_fpstate 0.34 ± 6% +1.1 1.40 perf-profile.children.cycles-pp.llist_add_batch 1.51 +1.1 2.58 perf-profile.children.cycles-pp.hrtimer_try_to_cancel 0.30 ± 4% +1.2 1.54 perf-profile.children.cycles-pp.migrate_task_rq_fair 1.30 +1.2 2.55 perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath 2.44 +1.3 3.74 perf-profile.children.cycles-pp.exit_to_user_mode_prepare 0.44 ± 6% +1.4 1.85 perf-profile.children.cycles-pp.__smp_call_single_queue 0.70 ± 3% +1.5 2.23 perf-profile.children.cycles-pp.__switch_to_asm 2.04 +1.6 3.61 perf-profile.children.cycles-pp.syscall_exit_to_user_mode 1.77 ± 2% +1.6 3.39 perf-profile.children.cycles-pp.switch_fpu_return 0.41 ± 2% +1.6 2.03 perf-profile.children.cycles-pp._raw_spin_lock_irqsave 0.21 ± 3% +1.7 1.91 ± 5% perf-profile.children.cycles-pp.__bitmap_andnot 0.43 ± 4% +1.8 2.18 perf-profile.children.cycles-pp.set_task_cpu 0.61 ± 5% +1.9 2.48 perf-profile.children.cycles-pp.ttwu_queue_wakelist 1.56 +1.9 3.42 perf-profile.children.cycles-pp.switch_mm_irqs_off 0.49 ± 3% +1.9 2.40 perf-profile.children.cycles-pp.intel_idle 1.94 ± 4% +2.0 3.92 ± 3% perf-profile.children.cycles-pp.stress_pthread_func 2.48 +2.2 4.65 perf-profile.children.cycles-pp._raw_spin_lock 1.38 ± 3% +3.8 5.23 perf-profile.children.cycles-pp.poll_idle 6.23 +6.9 13.14 perf-profile.children.cycles-pp.cpuidle_enter 6.22 +6.9 13.12 perf-profile.children.cycles-pp.cpuidle_enter_state 6.72 ± 2% +7.6 14.36 perf-profile.children.cycles-pp.cpuidle_idle_call 2.32 ± 2% +11.1 13.44 perf-profile.children.cycles-pp.select_idle_core 30.80 -29.8 0.98 ± 10% perf-profile.self.cycles-pp.update_cfs_group 10.11 -8.9 1.18 perf-profile.self.cycles-pp.update_load_avg 11.11 -4.9 6.26 perf-profile.self.cycles-pp.select_idle_cpu 15.35 -0.6 14.78 perf-profile.self.cycles-pp.available_idle_cpu 0.46 ± 2% -0.1 0.32 perf-profile.self.cycles-pp._raw_spin_lock_irq 0.38 ± 3% -0.1 0.33 perf-profile.self.cycles-pp.get_nohz_timer_target 0.10 ± 4% -0.0 0.05 perf-profile.self.cycles-pp.entity_eligible 0.12 ± 3% +0.0 0.13 perf-profile.self.cycles-pp.ktime_get 0.10 ± 3% +0.0 0.12 perf-profile.self.cycles-pp.__hrtimer_start_range_ns 0.07 +0.0 0.09 ± 4% perf-profile.self.cycles-pp.select_task_rq 0.13 ± 2% +0.0 0.15 ± 6% perf-profile.self.cycles-pp.sched_clock_cpu 0.05 ± 7% +0.0 0.08 ± 4% perf-profile.self.cycles-pp.tracing_gen_ctx_irq_test 0.06 ± 6% +0.0 0.09 ± 4% perf-profile.self.cycles-pp.perf_adjust_freq_unthr_context 0.06 ± 9% +0.0 0.09 ± 7% perf-profile.self.cycles-pp.mm_cid_get 0.15 ± 3% +0.0 0.19 perf-profile.self.cycles-pp.__dequeue_entity 0.06 ± 6% +0.0 0.10 ± 4% perf-profile.self.cycles-pp.ttwu_do_activate 0.12 ± 3% +0.0 0.17 perf-profile.self.cycles-pp.update_irq_load_avg 0.01 ±223% +0.0 0.06 ± 6% perf-profile.self.cycles-pp.perf_trace_sched_stat_runtime 0.00 +0.1 0.05 perf-profile.self.cycles-pp.syscall_exit_to_user_mode_prepare 0.00 +0.1 0.05 perf-profile.self.cycles-pp.__bitmap_and 0.00 +0.1 0.05 perf-profile.self.cycles-pp.idle_cpu 0.00 +0.1 0.05 ± 7% perf-profile.self.cycles-pp.cpu_startup_entry 0.00 +0.1 0.05 ± 8% perf-profile.self.cycles-pp.perf_exclude_event 0.00 +0.1 0.06 ± 8% perf-profile.self.cycles-pp.syscall_enter_from_user_mode 0.00 +0.1 0.06 ± 11% perf-profile.self.cycles-pp.perf_trace_sched_migrate_task 0.02 ±141% +0.1 0.08 ± 6% perf-profile.self.cycles-pp._find_next_and_bit 0.05 +0.1 0.11 perf-profile.self.cycles-pp.check_preempt_curr 0.00 +0.1 0.06 perf-profile.self.cycles-pp.update_entity_lag 0.00 +0.1 0.06 perf-profile.self.cycles-pp.__update_load_avg_blocked_se 0.00 +0.1 0.06 perf-profile.self.cycles-pp.ct_kernel_exit_state 0.08 +0.1 0.14 perf-profile.self.cycles-pp.resched_curr 0.00 +0.1 0.06 ± 6% perf-profile.self.cycles-pp.remove_entity_load_avg 0.00 +0.1 0.06 ± 6% perf-profile.self.cycles-pp.activate_task 0.13 ± 2% +0.1 0.20 ± 2% perf-profile.self.cycles-pp.lapic_next_deadline 0.03 ± 70% +0.1 0.10 perf-profile.self.cycles-pp.set_next_entity 0.00 +0.1 0.07 perf-profile.self.cycles-pp.hrtimer_try_to_cancel 0.00 +0.1 0.07 perf-profile.self.cycles-pp.get_timespec64 0.00 +0.1 0.07 perf-profile.self.cycles-pp.rb_next 0.00 +0.1 0.07 ± 5% perf-profile.self.cycles-pp.perf_trace_sched_switch 0.33 +0.1 0.40 perf-profile.self.cycles-pp.update_rq_clock 0.06 +0.1 0.13 ± 3% perf-profile.self.cycles-pp.__hrtimer_init 0.09 ± 5% +0.1 0.17 perf-profile.self.cycles-pp.avg_vruntime 0.08 ± 6% +0.1 0.16 ± 3% perf-profile.self.cycles-pp.rcu_note_context_switch 0.08 +0.1 0.16 perf-profile.self.cycles-pp.rb_erase 0.00 +0.1 0.08 perf-profile.self.cycles-pp._raw_spin_unlock_irqrestore 0.10 ± 4% +0.1 0.18 ± 2% perf-profile.self.cycles-pp.select_task_rq_fair 0.00 +0.1 0.08 ± 4% perf-profile.self.cycles-pp.__hrtimer_next_event_base 0.05 ± 8% +0.1 0.14 perf-profile.self.cycles-pp.do_syscall_64 0.00 +0.1 0.09 ± 4% perf-profile.self.cycles-pp.__list_add_valid 0.09 ± 4% +0.1 0.18 perf-profile.self.cycles-pp.__hrtimer_run_queues 0.07 +0.1 0.16 perf-profile.self.cycles-pp.common_nsleep 0.00 +0.1 0.09 perf-profile.self.cycles-pp.__list_del_entry_valid 0.35 ± 2% +0.1 0.44 perf-profile.self.cycles-pp.update_rq_clock_task 0.00 +0.1 0.10 ± 5% perf-profile.self.cycles-pp.__x2apic_send_IPI_dest 0.00 +0.1 0.10 ± 3% perf-profile.self.cycles-pp.__entry_text_start 0.06 ± 6% +0.1 0.16 ± 2% perf-profile.self.cycles-pp.native_apic_msr_eoi_write 0.00 +0.1 0.10 perf-profile.self.cycles-pp.schedule_idle 0.07 ± 5% +0.1 0.17 perf-profile.self.cycles-pp.entry_SYSRETQ_unsafe_stack 0.07 ± 5% +0.1 0.17 ± 2% perf-profile.self.cycles-pp.timerqueue_del 0.00 +0.1 0.11 perf-profile.self.cycles-pp.syscall_return_via_sysret 0.10 ± 5% +0.1 0.21 ± 2% perf-profile.self.cycles-pp.rb_insert_color 0.00 +0.1 0.11 ± 3% perf-profile.self.cycles-pp.__calc_delta 0.00 +0.1 0.12 ± 4% perf-profile.self.cycles-pp.call_cpuidle 0.12 ± 3% +0.1 0.25 ± 2% perf-profile.self.cycles-pp.pick_eevdf 0.00 +0.1 0.13 perf-profile.self.cycles-pp.__rdgsbase_inactive 0.15 ± 3% +0.1 0.28 perf-profile.self.cycles-pp.pick_next_task_fair 0.06 ± 8% +0.1 0.19 perf-profile.self.cycles-pp.hrtimer_start_range_ns 0.16 ± 2% +0.1 0.30 perf-profile.self.cycles-pp.do_nanosleep 0.09 ± 5% +0.1 0.23 perf-profile.self.cycles-pp.stress_mwc32modn 0.07 ± 5% +0.1 0.21 perf-profile.self.cycles-pp.__wrgsbase_inactive 0.24 +0.1 0.38 perf-profile.self.cycles-pp.os_xsave 0.00 +0.1 0.14 ± 2% perf-profile.self.cycles-pp.newidle_balance 0.13 +0.2 0.28 perf-profile.self.cycles-pp.try_to_wake_up 0.17 ± 2% +0.2 0.32 perf-profile.self.cycles-pp.perf_tp_event 0.00 +0.2 0.15 ± 3% perf-profile.self.cycles-pp.cpuidle_idle_call 0.06 ± 6% +0.2 0.21 perf-profile.self.cycles-pp.menu_select 0.07 ± 5% +0.2 0.23 ± 2% perf-profile.self.cycles-pp.timerqueue_add 0.07 ± 5% +0.2 0.23 ± 8% perf-profile.self.cycles-pp.cpuacct_charge 0.13 ± 13% +0.2 0.31 ± 15% perf-profile.self.cycles-pp.clock_gettime 0.23 ± 3% +0.2 0.41 perf-profile.self.cycles-pp.native_irq_return_iret 0.11 ± 3% +0.2 0.29 ± 2% perf-profile.self.cycles-pp.read_tsc 0.16 ± 12% +0.2 0.34 ± 10% perf-profile.self.cycles-pp.__nanosleep 0.16 ± 2% +0.2 0.35 perf-profile.self.cycles-pp.update_min_vruntime 0.09 ± 4% +0.2 0.28 perf-profile.self.cycles-pp.attach_entity_load_avg 0.41 +0.2 0.60 perf-profile.self.cycles-pp.__update_load_avg_se 0.16 ± 3% +0.2 0.35 perf-profile.self.cycles-pp.schedule 0.13 ± 3% +0.2 0.33 perf-profile.self.cycles-pp.cpus_share_cache 0.13 ± 5% +0.2 0.33 perf-profile.self.cycles-pp.__flush_smp_call_function_queue 0.21 ± 2% +0.2 0.42 perf-profile.self.cycles-pp._copy_from_user 0.00 +0.2 0.20 ± 2% perf-profile.self.cycles-pp.cpuidle_enter_state 0.21 ± 5% +0.2 0.42 perf-profile.self.cycles-pp.migrate_task_rq_fair 0.20 ± 2% +0.2 0.44 perf-profile.self.cycles-pp.__enqueue_entity 0.57 ± 5% +0.3 0.82 ± 3% perf-profile.self.cycles-pp.stress_mwc32 0.26 +0.3 0.52 perf-profile.self.cycles-pp.dequeue_entity 0.13 ± 6% +0.3 0.39 perf-profile.self.cycles-pp.flush_smp_call_function_queue 0.23 ± 2% +0.3 0.50 perf-profile.self.cycles-pp.native_sched_clock 0.30 ± 4% +0.3 0.58 perf-profile.self.cycles-pp.__x64_sys_clock_nanosleep 0.29 ± 3% +0.3 0.57 perf-profile.self.cycles-pp.entry_SYSCALL_64_after_hwframe 0.30 +0.3 0.59 perf-profile.self.cycles-pp.hrtimer_nanosleep 0.17 ± 2% +0.3 0.47 perf-profile.self.cycles-pp.___perf_sw_event 0.13 ± 7% +0.3 0.44 perf-profile.self.cycles-pp.ttwu_queue_wakelist 0.08 ± 7% +0.3 0.40 perf-profile.self.cycles-pp.sched_ttwu_pending 0.09 ± 5% +0.3 0.40 perf-profile.self.cycles-pp.call_function_single_prep_ipi 0.22 ± 13% +0.3 0.56 ± 10% perf-profile.self.cycles-pp.clock_gettime@plt 0.28 ± 2% +0.3 0.62 ± 3% perf-profile.self.cycles-pp.update_curr 0.00 +0.3 0.35 perf-profile.self.cycles-pp.do_idle 0.58 +0.4 0.94 perf-profile.self.cycles-pp.reweight_entity 0.54 +0.5 1.00 perf-profile.self.cycles-pp.prepare_task_switch 0.11 ± 6% +0.5 0.57 ± 3% perf-profile.self.cycles-pp.set_task_cpu 0.40 ± 2% +0.5 0.87 perf-profile.self.cycles-pp.enqueue_entity 0.07 ± 10% +0.5 0.54 perf-profile.self.cycles-pp.nohz_run_idle_balance 0.55 +0.5 1.04 perf-profile.self.cycles-pp.__update_load_avg_cfs_rq 1.12 +0.5 1.66 perf-profile.self.cycles-pp._find_next_bit 0.23 ± 2% +0.5 0.77 ± 2% perf-profile.self.cycles-pp.enqueue_task_fair 0.64 +0.6 1.23 perf-profile.self.cycles-pp.dequeue_task_fair 0.36 ± 2% +0.6 0.96 perf-profile.self.cycles-pp.switch_fpu_return 0.21 ± 6% +0.7 0.87 perf-profile.self.cycles-pp.llist_reverse_order 0.00 +0.7 0.71 ± 3% perf-profile.self.cycles-pp.__update_idle_core 0.68 ± 3% +0.7 1.42 perf-profile.self.cycles-pp.hrtimer_active 0.47 ± 2% +0.8 1.26 perf-profile.self.cycles-pp.finish_task_switch 1.17 +0.9 2.09 perf-profile.self.cycles-pp._raw_spin_lock 0.68 ± 2% +0.9 1.62 perf-profile.self.cycles-pp.sched_mm_cid_migrate_to 0.57 ± 2% +0.9 1.52 perf-profile.self.cycles-pp.__switch_to 1.41 ± 2% +1.0 2.42 perf-profile.self.cycles-pp.restore_fpregs_from_fpstate 0.34 ± 6% +1.1 1.40 perf-profile.self.cycles-pp.llist_add_batch 0.72 ± 5% +1.1 1.80 ± 2% perf-profile.self.cycles-pp.clock_nanosleep 1.30 +1.2 2.55 perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath 0.96 ± 6% +1.3 2.28 ± 3% perf-profile.self.cycles-pp.stress_pthread_func 0.06 ± 11% +1.4 1.44 perf-profile.self.cycles-pp.poll_idle 0.54 +1.5 2.02 ± 2% perf-profile.self.cycles-pp.select_idle_sibling 0.69 ± 3% +1.5 2.22 perf-profile.self.cycles-pp.__switch_to_asm 0.38 ± 3% +1.6 1.97 perf-profile.self.cycles-pp._raw_spin_lock_irqsave 0.20 ± 5% +1.7 1.86 ± 5% perf-profile.self.cycles-pp.__bitmap_andnot 1.22 +1.8 3.06 perf-profile.self.cycles-pp.__schedule 1.54 +1.8 3.39 perf-profile.self.cycles-pp.switch_mm_irqs_off 0.49 ± 3% +1.9 2.40 perf-profile.self.cycles-pp.intel_idle 0.79 ± 2% +1.9 2.74 perf-profile.self.cycles-pp.select_idle_core *************************************************************************************************** lkp-spr-r02: 224 threads 2 sockets Intel(R) Xeon(R) Platinum 8480CTDX (Sapphire Rapids) with 256G memory ========================================================================================= class/compiler/cpufreq_governor/kconfig/nr_threads/rootfs/sc_pid_max/tbox_group/test/testcase/testtime: scheduler/gcc-12/performance/x86_64-rhel-8.3/100%/debian-11.1-x86_64-20220510.cgz/4194304/lkp-spr-r02/sem/stress-ng/60s commit: 63304558ba ("sched/eevdf: Curb wakeup-preemption") 0a24d7afed ("sched/fair: ratelimit update to tg->load_avg") 63304558ba5dcaaf 0a24d7afed5c3c59ee212782f9c ---------------- --------------------------- %stddev %change %stddev \ | \ 11351 ± 3% +46.2% 16601 uptime.idle 3.846e+09 ± 2% +132.5% 8.942e+09 cpuidle..time 1.809e+08 ± 4% +188.7% 5.221e+08 cpuidle..usage 1748151 ± 30% +96.9% 3442946 ± 10% numa-numastat.node1.local_node 1884868 ± 27% +91.3% 3605026 ± 10% numa-numastat.node1.numa_hit 1396 ± 21% +31.4% 1834 ± 9% perf-c2c.DRAM.local 77083 ± 8% +25.5% 96721 ± 5% perf-c2c.HITM.local 78543 ± 8% +25.0% 98216 ± 5% perf-c2c.HITM.total 38.23 ± 3% +42.9 81.16 mpstat.cpu.all.idle% 32.27 ± 2% -23.0 9.27 mpstat.cpu.all.irq% 0.43 ± 6% -0.4 0.08 ± 3% mpstat.cpu.all.soft% 20.22 -14.8 5.47 mpstat.cpu.all.sys% 8.85 -4.8 4.01 ± 2% mpstat.cpu.all.usr% 39.67 ± 2% +104.2% 81.00 vmstat.cpu.id 50.83 ± 2% -72.5% 14.00 vmstat.cpu.sy 7584537 ± 4% +62.1% 12297640 ± 2% vmstat.memory.cache 141.17 -59.0% 57.83 ± 4% vmstat.procs.r 7826695 ± 4% +127.3% 17789598 vmstat.system.cs 3068646 ± 2% +34.3% 4120105 vmstat.system.in 84090 ± 31% +62.5% 136620 ± 9% numa-meminfo.node1.Active 83901 ± 31% +62.7% 136529 ± 9% numa-meminfo.node1.Active(anon) 4332711 ± 24% +100.1% 8670667 ± 10% numa-meminfo.node1.FilePages 4148689 ± 32% +115.2% 8929944 ± 10% numa-meminfo.node1.Inactive 4148234 ± 32% +115.3% 8929838 ± 10% numa-meminfo.node1.Inactive(anon) 755396 ± 2% +35.6% 1024533 numa-meminfo.node1.Mapped 6280993 ± 18% +68.2% 10562500 ± 8% numa-meminfo.node1.MemUsed 3791150 ± 35% +126.9% 8600616 ± 10% numa-meminfo.node1.Shmem 5.324e+08 ± 3% +120.7% 1.175e+09 stress-ng.sem.ops 8872696 ± 3% +120.7% 19585942 stress-ng.sem.ops_per_sec 36203483 -44.3% 20170299 stress-ng.time.involuntary_context_switches 41804 -5.5% 39488 stress-ng.time.minor_page_faults 7970 -44.6% 4412 stress-ng.time.percent_of_cpu_this_job_got 3548 -53.3% 1658 stress-ng.time.system_time 1419 -23.4% 1087 stress-ng.time.user_time 2.657e+08 ± 3% +120.7% 5.864e+08 stress-ng.time.voluntary_context_switches 21077 ± 31% +62.0% 34153 ± 9% numa-vmstat.node1.nr_active_anon 1083493 ± 24% +100.1% 2167954 ± 10% numa-vmstat.node1.nr_file_pages 1037164 ± 32% +115.3% 2232740 ± 10% numa-vmstat.node1.nr_inactive_anon 188463 ± 2% +36.0% 256345 numa-vmstat.node1.nr_mapped 948102 ± 35% +126.8% 2150441 ± 10% numa-vmstat.node1.nr_shmem 21077 ± 31% +62.0% 34153 ± 9% numa-vmstat.node1.nr_zone_active_anon 1037160 ± 32% +115.3% 2232735 ± 10% numa-vmstat.node1.nr_zone_inactive_anon 1884811 ± 27% +91.3% 3605221 ± 10% numa-vmstat.node1.numa_hit 1748094 ± 31% +97.0% 3443141 ± 10% numa-vmstat.node1.numa_local 113047 ± 12% +32.7% 150063 ± 4% meminfo.Active 112858 ± 12% +32.8% 149930 ± 4% meminfo.Active(anon) 7383553 ± 4% +63.3% 12055933 ± 2% meminfo.Cached 11049975 ± 3% +42.2% 15716475 meminfo.Committed_AS 5392328 ± 6% +86.8% 10072875 ± 2% meminfo.Inactive 5391867 ± 6% +86.8% 10072720 ± 2% meminfo.Inactive(anon) 1185651 +20.6% 1430121 meminfo.Mapped 11425070 ± 3% +40.8% 16088748 meminfo.Memused 4639309 ± 7% +100.7% 9312054 ± 3% meminfo.Shmem 11531671 ± 3% +41.0% 16259717 meminfo.max_used_kB 2128 -46.4% 1141 turbostat.Avg_MHz 74.52 -33.3 41.25 turbostat.Busy% 2868 -3.6% 2765 turbostat.Bzy_MHz 10682212 ± 8% +223.7% 34581909 ± 2% turbostat.C1 0.45 ± 8% +0.6 1.06 ± 4% turbostat.C1% 1.672e+08 ± 4% +186.8% 4.794e+08 turbostat.C1E 22.71 ± 2% +35.2 57.90 turbostat.C1E% 25.39 ± 2% +131.4% 58.75 turbostat.CPU%c1 2.003e+08 +34.2% 2.689e+08 turbostat.IRQ 2595431 ± 6% +191.2% 7557295 turbostat.POLL 0.39 -0.1 0.29 turbostat.POLL% 546.17 -4.0% 524.25 turbostat.PkgWatt 17.63 +5.4% 18.59 turbostat.RAMWatt 28245 ± 12% +32.8% 37500 ± 4% proc-vmstat.nr_active_anon 216361 +5.3% 227804 proc-vmstat.nr_anon_pages 6268223 -1.9% 6151776 proc-vmstat.nr_dirty_background_threshold 12551772 -1.9% 12318594 proc-vmstat.nr_dirty_threshold 1846243 ± 4% +63.3% 3014307 ± 2% proc-vmstat.nr_file_pages 63058692 -1.8% 61892607 proc-vmstat.nr_free_pages 1348115 ± 6% +86.8% 2518510 ± 2% proc-vmstat.nr_inactive_anon 296600 +20.6% 357727 proc-vmstat.nr_mapped 1160181 ± 7% +100.7% 2328337 ± 3% proc-vmstat.nr_shmem 41135 +6.4% 43765 proc-vmstat.nr_slab_reclaimable 28245 ± 12% +32.8% 37500 ± 4% proc-vmstat.nr_zone_active_anon 1348115 ± 6% +86.8% 2518510 ± 2% proc-vmstat.nr_zone_inactive_anon 305736 ± 9% +64.0% 501444 ± 16% proc-vmstat.numa_hint_faults 212439 ± 11% +68.3% 357589 ± 17% proc-vmstat.numa_hint_faults_local 2618374 ± 5% +65.4% 4331174 ± 2% proc-vmstat.numa_hit 1476 ± 2% -21.4% 1159 ± 4% proc-vmstat.numa_huge_pte_updates 2389468 ± 5% +71.5% 4099042 ± 2% proc-vmstat.numa_local 18075 ± 18% +243.7% 62123 ± 11% proc-vmstat.pgactivate 2893156 ± 4% +59.3% 4608688 ± 2% proc-vmstat.pgalloc_normal 1164766 ± 2% +24.7% 1452985 ± 6% proc-vmstat.pgfault 750923 ± 7% +20.2% 902954 ± 4% proc-vmstat.pgfree 1587344 -80.7% 305805 sched_debug.cfs_rq:/.avg_vruntime.avg 3053761 ± 15% -79.5% 627436 ± 20% sched_debug.cfs_rq:/.avg_vruntime.max 1349541 ± 8% -80.5% 263633 ± 7% sched_debug.cfs_rq:/.avg_vruntime.min 119162 ± 17% -72.5% 32738 ± 18% sched_debug.cfs_rq:/.avg_vruntime.stddev 0.43 ± 8% -57.8% 0.18 ± 6% sched_debug.cfs_rq:/.h_nr_running.avg 459622 ± 7% -94.1% 26943 ± 10% sched_debug.cfs_rq:/.left_vruntime.avg 1802751 ± 8% -81.4% 334976 ± 13% sched_debug.cfs_rq:/.left_vruntime.max 715952 ± 2% -88.0% 86086 ± 4% sched_debug.cfs_rq:/.left_vruntime.stddev 1587344 -80.7% 305805 sched_debug.cfs_rq:/.min_vruntime.avg 3053761 ± 15% -79.5% 627436 ± 20% sched_debug.cfs_rq:/.min_vruntime.max 1349542 ± 8% -80.5% 263633 ± 7% sched_debug.cfs_rq:/.min_vruntime.min 119162 ± 17% -72.5% 32738 ± 18% sched_debug.cfs_rq:/.min_vruntime.stddev 0.32 ± 3% -48.3% 0.17 ± 5% sched_debug.cfs_rq:/.nr_running.avg 459622 ± 7% -94.1% 26943 ± 10% sched_debug.cfs_rq:/.right_vruntime.avg 1802751 ± 8% -81.4% 334976 ± 13% sched_debug.cfs_rq:/.right_vruntime.max 715952 ± 2% -88.0% 86086 ± 4% sched_debug.cfs_rq:/.right_vruntime.stddev 456.43 ± 3% -58.2% 190.59 ± 4% sched_debug.cfs_rq:/.runnable_avg.avg 1516 ± 11% -23.5% 1159 ± 12% sched_debug.cfs_rq:/.runnable_avg.max 225.37 ± 4% -29.1% 159.79 ± 6% sched_debug.cfs_rq:/.runnable_avg.stddev 317.90 ± 2% -43.2% 180.41 ± 4% sched_debug.cfs_rq:/.util_avg.avg 20.61 ± 17% -54.7% 9.33 ± 21% sched_debug.cfs_rq:/.util_est_enqueued.avg 41.44 ± 18% -73.0% 11.19 ± 2% sched_debug.cpu.clock.stddev 1127 ± 9% -26.2% 831.86 sched_debug.cpu.clock_task.stddev 1517 ± 4% -23.9% 1155 ± 8% sched_debug.cpu.curr->pid.avg 0.00 ± 18% -70.3% 0.00 ± 10% sched_debug.cpu.next_balance.stddev 0.39 ± 4% -54.9% 0.17 ± 6% sched_debug.cpu.nr_running.avg 0.54 ± 6% -27.4% 0.39 ± 12% sched_debug.cpu.nr_running.stddev 1086308 ± 3% +126.7% 2462284 sched_debug.cpu.nr_switches.avg 1230656 ± 4% +115.6% 2653087 ± 2% sched_debug.cpu.nr_switches.max 511501 ± 20% +208.1% 1576039 ± 20% sched_debug.cpu.nr_switches.min 0.00 -100.0% 0.00 sched_debug.rt_rq:.rt_nr_migratory.avg 0.50 -100.0% 0.00 sched_debug.rt_rq:.rt_nr_migratory.max 0.03 -100.0% 0.00 sched_debug.rt_rq:.rt_nr_migratory.stddev 0.00 -100.0% 0.00 sched_debug.rt_rq:.rt_nr_running.avg 0.50 -100.0% 0.00 sched_debug.rt_rq:.rt_nr_running.max 0.03 -100.0% 0.00 sched_debug.rt_rq:.rt_nr_running.stddev 1.21 ± 19% -100.0% 0.00 sched_debug.rt_rq:.rt_time.avg 270.28 ± 19% -100.0% 0.00 sched_debug.rt_rq:.rt_time.max 18.02 ± 19% -100.0% 0.00 sched_debug.rt_rq:.rt_time.stddev 11.96 -25.6% 8.90 perf-stat.i.MPKI 1.534e+10 ± 2% +89.5% 2.908e+10 perf-stat.i.branch-instructions 1.45 -0.4 1.09 perf-stat.i.branch-miss-rate% 1.844e+08 ± 2% +46.4% 2.7e+08 perf-stat.i.branch-misses 3.02 ± 4% +0.2 3.27 perf-stat.i.cache-miss-rate% 14912654 ± 5% +92.1% 28640533 ± 2% perf-stat.i.cache-misses 7.818e+08 ± 3% +48.0% 1.157e+09 perf-stat.i.cache-references 8093326 ± 3% +128.3% 18474462 perf-stat.i.context-switches 7.31 ± 4% -75.0% 1.82 perf-stat.i.cpi 4.79e+11 -49.8% 2.405e+11 perf-stat.i.cpu-cycles 3201059 ± 3% +117.3% 6956049 perf-stat.i.cpu-migrations 41589 ± 5% -69.5% 12679 ± 2% perf-stat.i.cycles-between-cache-misses 0.24 ± 2% -0.1 0.15 ± 2% perf-stat.i.dTLB-load-miss-rate% 44628478 ± 4% +24.9% 55730478 ± 2% perf-stat.i.dTLB-load-misses 1.958e+10 ± 3% +97.5% 3.867e+10 perf-stat.i.dTLB-loads 0.08 -0.0 0.05 perf-stat.i.dTLB-store-miss-rate% 7655277 ± 2% +29.8% 9933453 perf-stat.i.dTLB-store-misses 1.103e+10 ± 3% +102.9% 2.238e+10 perf-stat.i.dTLB-stores 7.611e+10 ± 2% +89.8% 1.445e+11 perf-stat.i.instructions 0.19 ± 2% +212.9% 0.58 perf-stat.i.ipc 2.12 -49.6% 1.07 perf-stat.i.metric.GHz 128.80 ± 3% +68.7% 217.34 perf-stat.i.metric.K/sec 207.54 ± 3% +96.0% 406.79 perf-stat.i.metric.M/sec 19217 +15.9% 22269 ± 7% perf-stat.i.minor-faults 80.23 ± 2% -10.5 69.77 ± 2% perf-stat.i.node-load-miss-rate% 4830206 ± 4% +40.9% 6803808 ± 4% perf-stat.i.node-load-misses 1648661 ± 12% +152.3% 4158848 ± 6% perf-stat.i.node-loads 19217 +15.9% 22269 ± 7% perf-stat.i.page-faults 10.69 -24.6% 8.06 perf-stat.overall.MPKI 1.25 -0.3 0.93 perf-stat.overall.branch-miss-rate% 1.86 ± 5% +0.6 2.46 ± 2% perf-stat.overall.cache-miss-rate% 6.56 ± 4% -74.5% 1.67 perf-stat.overall.cpi 33078 ± 5% -74.4% 8459 perf-stat.overall.cycles-between-cache-misses 0.23 ± 2% -0.1 0.14 ± 2% perf-stat.overall.dTLB-load-miss-rate% 0.07 -0.0 0.04 perf-stat.overall.dTLB-store-miss-rate% 0.15 ± 4% +291.0% 0.60 perf-stat.overall.ipc 70.66 ± 3% -10.0 60.71 ± 3% perf-stat.overall.node-load-miss-rate% 1.442e+10 ± 3% +96.7% 2.838e+10 perf-stat.ps.branch-instructions 1.798e+08 ± 3% +47.5% 2.653e+08 perf-stat.ps.branch-misses 14255932 ± 5% +96.0% 27941497 ± 2% perf-stat.ps.cache-misses 7.673e+08 ± 3% +48.2% 1.137e+09 perf-stat.ps.cache-references 7947074 ± 4% +129.1% 18204433 perf-stat.ps.context-switches 4.7e+11 -49.7% 2.363e+11 perf-stat.ps.cpu-cycles 3148303 ± 4% +117.8% 6855484 perf-stat.ps.cpu-migrations 43383948 ± 5% +26.3% 54781737 ± 2% perf-stat.ps.dTLB-load-misses 1.859e+10 ± 3% +103.5% 3.783e+10 perf-stat.ps.dTLB-loads 7530893 ± 2% +30.0% 9786994 perf-stat.ps.dTLB-store-misses 1.049e+10 ± 3% +108.7% 2.19e+10 perf-stat.ps.dTLB-stores 7.175e+10 ± 3% +96.6% 1.411e+11 perf-stat.ps.instructions 17174 ± 2% +24.8% 21436 ± 6% perf-stat.ps.minor-faults 4621265 ± 4% +43.9% 6648855 ± 4% perf-stat.ps.node-load-misses 1927489 ± 13% +123.1% 4300926 ± 5% perf-stat.ps.node-loads 17174 ± 2% +24.8% 21436 ± 6% perf-stat.ps.page-faults 4.485e+12 ± 3% +98.9% 8.92e+12 perf-stat.total.instructions 27.32 -18.6 8.70 perf-profile.calltrace.cycles-pp.flush_smp_call_function_queue.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify 21.32 -18.0 3.36 ± 2% perf-profile.calltrace.cycles-pp.enqueue_task_fair.activate_task.ttwu_do_activate.sched_ttwu_pending.__flush_smp_call_function_queue 18.25 -15.6 2.69 ± 3% perf-profile.calltrace.cycles-pp.enqueue_entity.enqueue_task_fair.activate_task.ttwu_do_activate.sched_ttwu_pending 18.99 -13.8 5.16 perf-profile.calltrace.cycles-pp.activate_task.ttwu_do_activate.sched_ttwu_pending.__flush_smp_call_function_queue.flush_smp_call_function_queue 19.06 -13.8 5.29 perf-profile.calltrace.cycles-pp.ttwu_do_activate.sched_ttwu_pending.__flush_smp_call_function_queue.flush_smp_call_function_queue.do_idle 17.39 -13.5 3.88 ± 2% perf-profile.calltrace.cycles-pp.dequeue_task_fair.__schedule.schedule.do_nanosleep.hrtimer_nanosleep 19.49 -13.0 6.44 perf-profile.calltrace.cycles-pp.sched_ttwu_pending.__flush_smp_call_function_queue.flush_smp_call_function_queue.do_idle.cpu_startup_entry 19.77 -12.5 7.27 perf-profile.calltrace.cycles-pp.__flush_smp_call_function_queue.flush_smp_call_function_queue.do_idle.cpu_startup_entry.start_secondary 11.46 ± 2% -10.7 0.80 ± 4% perf-profile.calltrace.cycles-pp.update_cfs_group.dequeue_entity.dequeue_task_fair.__schedule.schedule 14.02 -10.6 3.37 ± 2% perf-profile.calltrace.cycles-pp.dequeue_entity.dequeue_task_fair.__schedule.schedule.do_nanosleep 10.59 -10.5 0.09 ±223% perf-profile.calltrace.cycles-pp.update_cfs_group.enqueue_entity.enqueue_task_fair.activate_task.ttwu_do_activate 18.67 -8.4 10.29 perf-profile.calltrace.cycles-pp.try_to_wake_up.hrtimer_wakeup.__hrtimer_run_queues.hrtimer_interrupt.__sysvec_apic_timer_interrupt 9.63 -8.4 1.27 perf-profile.calltrace.cycles-pp.update_load_avg.enqueue_entity.enqueue_task_fair.activate_task.ttwu_do_activate 18.68 -8.3 10.34 perf-profile.calltrace.cycles-pp.hrtimer_wakeup.__hrtimer_run_queues.hrtimer_interrupt.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt 22.51 -8.1 14.45 ± 4% perf-profile.calltrace.cycles-pp.__schedule.schedule.do_nanosleep.hrtimer_nanosleep.common_nsleep 19.24 -7.9 11.33 perf-profile.calltrace.cycles-pp.__hrtimer_run_queues.hrtimer_interrupt.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt 22.71 -7.8 14.88 ± 4% perf-profile.calltrace.cycles-pp.schedule.do_nanosleep.hrtimer_nanosleep.common_nsleep.__x64_sys_clock_nanosleep 10.88 ± 4% -7.3 3.57 ± 2% perf-profile.calltrace.cycles-pp.select_idle_sibling.select_task_rq_fair.select_task_rq.try_to_wake_up.hrtimer_wakeup 9.66 ± 2% -7.3 2.40 perf-profile.calltrace.cycles-pp.select_idle_cpu.select_idle_sibling.select_task_rq_fair.select_task_rq.try_to_wake_up 10.95 ± 4% -7.2 3.70 perf-profile.calltrace.cycles-pp.select_task_rq_fair.select_task_rq.try_to_wake_up.hrtimer_wakeup.__hrtimer_run_queues 10.97 ± 4% -7.2 3.76 perf-profile.calltrace.cycles-pp.select_task_rq.try_to_wake_up.hrtimer_wakeup.__hrtimer_run_queues.hrtimer_interrupt 8.82 ± 2% -6.6 2.23 ± 2% perf-profile.calltrace.cycles-pp.select_idle_core.select_idle_cpu.select_idle_sibling.select_task_rq_fair.select_task_rq 25.03 -6.5 18.57 ± 3% perf-profile.calltrace.cycles-pp.do_nanosleep.hrtimer_nanosleep.common_nsleep.__x64_sys_clock_nanosleep.do_syscall_64 25.38 -6.1 19.25 ± 3% perf-profile.calltrace.cycles-pp.hrtimer_nanosleep.common_nsleep.__x64_sys_clock_nanosleep.do_syscall_64.entry_SYSCALL_64_after_hwframe 25.41 -6.1 19.30 ± 3% perf-profile.calltrace.cycles-pp.common_nsleep.__x64_sys_clock_nanosleep.do_syscall_64.entry_SYSCALL_64_after_hwframe.clock_nanosleep 26.14 -5.8 20.31 ± 2% perf-profile.calltrace.cycles-pp.__x64_sys_clock_nanosleep.do_syscall_64.entry_SYSCALL_64_after_hwframe.clock_nanosleep 6.52 ± 4% -5.6 0.89 ± 3% perf-profile.calltrace.cycles-pp.asm_sysvec_apic_timer_interrupt.flush_smp_call_function_queue.do_idle.cpu_startup_entry.start_secondary 6.36 ± 4% -5.5 0.82 ± 2% perf-profile.calltrace.cycles-pp.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.flush_smp_call_function_queue.do_idle.cpu_startup_entry 6.20 ± 4% -5.4 0.80 ± 2% perf-profile.calltrace.cycles-pp.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.flush_smp_call_function_queue.do_idle 6.17 ± 4% -5.4 0.78 ± 3% perf-profile.calltrace.cycles-pp.hrtimer_interrupt.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.flush_smp_call_function_queue 5.01 -5.0 0.00 perf-profile.calltrace.cycles-pp.update_load_avg.set_next_entity.pick_next_task_fair.__schedule.schedule_idle 29.04 -4.9 24.15 ± 2% perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.clock_nanosleep 60.01 -4.9 55.12 perf-profile.calltrace.cycles-pp.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify 29.44 -4.9 24.57 ± 2% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.clock_nanosleep 60.32 -4.9 55.46 perf-profile.calltrace.cycles-pp.secondary_startup_64_no_verify 60.05 -4.8 55.24 perf-profile.calltrace.cycles-pp.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify 12.60 -4.8 7.81 perf-profile.calltrace.cycles-pp.__schedule.schedule_idle.do_idle.cpu_startup_entry.start_secondary 60.06 -4.8 55.29 perf-profile.calltrace.cycles-pp.start_secondary.secondary_startup_64_no_verify 12.70 -4.6 8.08 perf-profile.calltrace.cycles-pp.schedule_idle.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify 5.10 -4.4 0.67 perf-profile.calltrace.cycles-pp.set_next_entity.pick_next_task_fair.__schedule.schedule_idle.do_idle 5.23 -4.2 0.98 perf-profile.calltrace.cycles-pp.pick_next_task_fair.__schedule.schedule_idle.do_idle.cpu_startup_entry 4.86 -3.8 1.07 perf-profile.calltrace.cycles-pp.finish_task_switch.__schedule.schedule_idle.do_idle.cpu_startup_entry 4.90 ± 2% -3.5 1.38 ± 2% perf-profile.calltrace.cycles-pp.available_idle_cpu.select_idle_core.select_idle_cpu.select_idle_sibling.select_task_rq_fair 31.19 -3.4 27.79 perf-profile.calltrace.cycles-pp.clock_nanosleep 4.78 -3.3 1.45 perf-profile.calltrace.cycles-pp.hrtimer_interrupt.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.finish_task_switch 4.81 -3.3 1.49 perf-profile.calltrace.cycles-pp.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.finish_task_switch.__schedule 3.99 -3.2 0.82 ± 9% perf-profile.calltrace.cycles-pp.enqueue_task_fair.activate_task.ttwu_do_activate.try_to_wake_up.hrtimer_wakeup 4.00 -3.2 0.84 ± 9% perf-profile.calltrace.cycles-pp.activate_task.ttwu_do_activate.try_to_wake_up.hrtimer_wakeup.__hrtimer_run_queues 4.09 -3.1 0.95 ± 8% perf-profile.calltrace.cycles-pp.ttwu_do_activate.try_to_wake_up.hrtimer_wakeup.__hrtimer_run_queues.hrtimer_interrupt 3.70 -3.0 0.67 ± 10% perf-profile.calltrace.cycles-pp.enqueue_entity.enqueue_task_fair.activate_task.ttwu_do_activate.try_to_wake_up 3.79 -3.0 0.78 perf-profile.calltrace.cycles-pp.asm_sysvec_apic_timer_interrupt.finish_task_switch.__schedule.schedule_idle.do_idle 3.73 -3.0 0.73 perf-profile.calltrace.cycles-pp.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.finish_task_switch.__schedule.schedule_idle 2.35 -1.8 0.54 perf-profile.calltrace.cycles-pp.asm_sysvec_call_function_single.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call.do_idle 1.74 -0.8 0.96 ± 2% perf-profile.calltrace.cycles-pp.update_load_avg.dequeue_entity.dequeue_task_fair.__schedule.schedule 1.36 ± 3% -0.6 0.81 ± 3% perf-profile.calltrace.cycles-pp.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.finish_task_switch.__schedule.schedule 1.43 ± 3% -0.5 0.89 ± 3% perf-profile.calltrace.cycles-pp.asm_sysvec_apic_timer_interrupt.finish_task_switch.__schedule.schedule.do_nanosleep 0.70 ± 3% -0.1 0.56 perf-profile.calltrace.cycles-pp.do_sched_yield.__x64_sys_sched_yield.do_syscall_64.entry_SYSCALL_64_after_hwframe.__sched_yield 1.15 ± 2% -0.1 1.08 perf-profile.calltrace.cycles-pp.hrtimer_active.hrtimer_try_to_cancel.do_nanosleep.hrtimer_nanosleep.common_nsleep 0.43 ± 44% +0.2 0.62 perf-profile.calltrace.cycles-pp.poll_idle.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call.do_idle 2.43 ± 2% +0.3 2.71 perf-profile.calltrace.cycles-pp.finish_task_switch.__schedule.schedule.do_nanosleep.hrtimer_nanosleep 2.16 +0.3 2.48 perf-profile.calltrace.cycles-pp.restore_fpregs_from_fpstate.switch_fpu_return.exit_to_user_mode_prepare.syscall_exit_to_user_mode.do_syscall_64 1.26 +0.5 1.76 perf-profile.calltrace.cycles-pp.sched_mm_cid_migrate_to.activate_task.ttwu_do_activate.sched_ttwu_pending.__flush_smp_call_function_queue 0.00 +0.5 0.51 perf-profile.calltrace.cycles-pp.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.poll_idle.cpuidle_enter_state 0.00 +0.5 0.53 perf-profile.calltrace.cycles-pp.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.poll_idle.cpuidle_enter_state.cpuidle_enter 0.00 +0.6 0.57 ± 2% perf-profile.calltrace.cycles-pp.cpus_share_cache.select_idle_sibling.select_task_rq_fair.select_task_rq.try_to_wake_up 0.00 +0.6 0.57 perf-profile.calltrace.cycles-pp.asm_sysvec_apic_timer_interrupt.poll_idle.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call 0.00 +0.6 0.59 perf-profile.calltrace.cycles-pp.update_curr.pick_next_task_fair.__schedule.schedule.__x64_sys_sched_yield 0.00 +0.7 0.66 perf-profile.calltrace.cycles-pp.prepare_task_switch.__schedule.schedule.do_nanosleep.hrtimer_nanosleep 0.00 +0.7 0.68 perf-profile.calltrace.cycles-pp.update_curr.dequeue_entity.dequeue_task_fair.__schedule.schedule 0.00 +0.7 0.68 ± 2% perf-profile.calltrace.cycles-pp.llist_reverse_order.__flush_smp_call_function_queue.flush_smp_call_function_queue.do_idle.cpu_startup_entry 2.72 +0.7 3.41 perf-profile.calltrace.cycles-pp.switch_fpu_return.exit_to_user_mode_prepare.syscall_exit_to_user_mode.do_syscall_64.entry_SYSCALL_64_after_hwframe 2.77 +0.7 3.48 perf-profile.calltrace.cycles-pp.exit_to_user_mode_prepare.syscall_exit_to_user_mode.do_syscall_64.entry_SYSCALL_64_after_hwframe.clock_nanosleep 0.00 +0.7 0.74 ± 2% perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.try_to_wake_up.hrtimer_wakeup.__hrtimer_run_queues.hrtimer_interrupt 0.00 +0.7 0.75 perf-profile.calltrace.cycles-pp.__smp_call_single_queue.ttwu_queue_wakelist.try_to_wake_up.hrtimer_wakeup.__hrtimer_run_queues 2.46 +0.8 3.23 perf-profile.calltrace.cycles-pp.__x64_sys_sched_yield.do_syscall_64.entry_SYSCALL_64_after_hwframe.__sched_yield 2.80 +0.8 3.58 perf-profile.calltrace.cycles-pp.syscall_exit_to_user_mode.do_syscall_64.entry_SYSCALL_64_after_hwframe.clock_nanosleep 0.29 ±100% +0.8 1.09 ± 8% perf-profile.calltrace.cycles-pp.queue_event.ordered_events__queue.process_simple.reader__read_event.perf_session__process_events 0.29 ±100% +0.8 1.10 ± 8% perf-profile.calltrace.cycles-pp.ordered_events__queue.process_simple.reader__read_event.perf_session__process_events.record__finish_output 0.29 ±100% +0.8 1.10 ± 8% perf-profile.calltrace.cycles-pp.process_simple.reader__read_event.perf_session__process_events.record__finish_output.__cmd_record 0.30 ±100% +0.8 1.11 ± 8% perf-profile.calltrace.cycles-pp.__cmd_record 0.30 ±100% +0.8 1.11 ± 8% perf-profile.calltrace.cycles-pp.record__finish_output.__cmd_record 0.30 ±100% +0.8 1.11 ± 8% perf-profile.calltrace.cycles-pp.perf_session__process_events.record__finish_output.__cmd_record 0.30 ±100% +0.8 1.11 ± 8% perf-profile.calltrace.cycles-pp.reader__read_event.perf_session__process_events.record__finish_output.__cmd_record 1.70 +0.8 2.55 perf-profile.calltrace.cycles-pp.__schedule.schedule.__x64_sys_sched_yield.do_syscall_64.entry_SYSCALL_64_after_hwframe 0.72 ± 3% +0.9 1.59 perf-profile.calltrace.cycles-pp.shim_nanosleep_uint64 2.57 +0.9 3.45 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.__sched_yield 1.73 +0.9 2.63 perf-profile.calltrace.cycles-pp.schedule.__x64_sys_sched_yield.do_syscall_64.entry_SYSCALL_64_after_hwframe.__sched_yield 0.00 +0.9 0.90 perf-profile.calltrace.cycles-pp.__hrtimer_start_range_ns.hrtimer_start_range_ns.do_nanosleep.hrtimer_nanosleep.common_nsleep 0.00 +0.9 0.93 perf-profile.calltrace.cycles-pp.tick_nohz_get_sleep_length.menu_select.cpuidle_idle_call.do_idle.cpu_startup_entry 0.00 +0.9 0.95 perf-profile.calltrace.cycles-pp.ttwu_queue_wakelist.try_to_wake_up.hrtimer_wakeup.__hrtimer_run_queues.hrtimer_interrupt 2.60 +0.9 3.54 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.__sched_yield 0.00 +1.0 0.98 perf-profile.calltrace.cycles-pp.__switch_to 0.00 +1.0 1.00 ± 4% perf-profile.calltrace.cycles-pp.sem_post@@GLIBC_2.2.5 0.62 ± 4% +1.0 1.65 ± 2% perf-profile.calltrace.cycles-pp.sem_getvalue@@GLIBC_2.2.5 0.00 +1.0 1.04 perf-profile.calltrace.cycles-pp.set_task_cpu.try_to_wake_up.hrtimer_wakeup.__hrtimer_run_queues.hrtimer_interrupt 0.00 +1.1 1.05 perf-profile.calltrace.cycles-pp.pick_next_task_fair.__schedule.schedule.__x64_sys_sched_yield.do_syscall_64 0.00 +1.1 1.13 perf-profile.calltrace.cycles-pp.prepare_task_switch.__schedule.schedule_idle.do_idle.cpu_startup_entry 0.64 ± 2% +1.2 1.81 ± 5% perf-profile.calltrace.cycles-pp.menu_select.cpuidle_idle_call.do_idle.cpu_startup_entry.start_secondary 0.58 +1.2 1.77 perf-profile.calltrace.cycles-pp.__switch_to_asm 0.86 +1.2 2.08 perf-profile.calltrace.cycles-pp.hrtimer_start_range_ns.do_nanosleep.hrtimer_nanosleep.common_nsleep.__x64_sys_clock_nanosleep 1.00 ± 2% +1.3 2.26 ± 8% perf-profile.calltrace.cycles-pp.semaphore_posix_thrash 2.98 +1.3 4.25 perf-profile.calltrace.cycles-pp.__sched_yield 7.89 +1.9 9.77 perf-profile.calltrace.cycles-pp.hrtimer_interrupt.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.cpuidle_enter_state 0.00 +1.9 1.94 ± 31% perf-profile.calltrace.cycles-pp.update_sg_lb_stats.update_sd_lb_stats.find_busiest_group.load_balance.newidle_balance 0.70 ± 2% +2.0 2.67 perf-profile.calltrace.cycles-pp.switch_mm_irqs_off.__schedule.schedule_idle.do_idle.cpu_startup_entry 7.93 +2.1 10.00 perf-profile.calltrace.cycles-pp.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.cpuidle_enter_state.cpuidle_enter 0.00 +2.1 2.14 ± 29% perf-profile.calltrace.cycles-pp.update_sd_lb_stats.find_busiest_group.load_balance.newidle_balance.pick_next_task_fair 0.00 +2.2 2.18 ± 29% perf-profile.calltrace.cycles-pp.find_busiest_group.load_balance.newidle_balance.pick_next_task_fair.__schedule 8.35 +2.3 10.66 perf-profile.calltrace.cycles-pp.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call 0.00 +2.4 2.41 ± 26% perf-profile.calltrace.cycles-pp.load_balance.newidle_balance.pick_next_task_fair.__schedule.schedule 0.00 +3.4 3.35 ± 18% perf-profile.calltrace.cycles-pp.newidle_balance.pick_next_task_fair.__schedule.schedule.do_nanosleep 0.00 +3.5 3.48 ± 17% perf-profile.calltrace.cycles-pp.pick_next_task_fair.__schedule.schedule.do_nanosleep.hrtimer_nanosleep 8.48 +3.6 12.08 perf-profile.calltrace.cycles-pp.asm_sysvec_apic_timer_interrupt.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call.do_idle 6.22 ± 3% +13.2 19.41 perf-profile.calltrace.cycles-pp.intel_idle.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call.do_idle 17.85 +14.9 32.78 perf-profile.calltrace.cycles-pp.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call.do_idle.cpu_startup_entry 18.03 +15.6 33.59 perf-profile.calltrace.cycles-pp.cpuidle_enter.cpuidle_idle_call.do_idle.cpu_startup_entry.start_secondary 18.91 +17.2 36.16 perf-profile.calltrace.cycles-pp.cpuidle_idle_call.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify 29.17 -27.5 1.63 ± 4% perf-profile.children.cycles-pp.update_cfs_group 26.97 -22.3 4.65 ± 3% perf-profile.children.cycles-pp.enqueue_task_fair 28.59 -21.9 6.64 perf-profile.children.cycles-pp.activate_task 28.80 -21.9 6.90 perf-profile.children.cycles-pp.ttwu_do_activate 23.52 -19.7 3.81 ± 4% perf-profile.children.cycles-pp.enqueue_entity 27.63 -18.8 8.81 perf-profile.children.cycles-pp.flush_smp_call_function_queue 24.39 -17.3 7.10 perf-profile.children.cycles-pp.sched_ttwu_pending 24.82 -16.8 8.06 perf-profile.children.cycles-pp.__flush_smp_call_function_queue 18.44 -15.1 3.31 ± 2% perf-profile.children.cycles-pp.update_load_avg 17.44 -13.5 3.93 ± 2% perf-profile.children.cycles-pp.dequeue_task_fair 36.97 -11.9 25.11 ± 2% perf-profile.children.cycles-pp.__schedule 14.08 -10.6 3.47 ± 2% perf-profile.children.cycles-pp.dequeue_entity 21.16 -8.7 12.42 perf-profile.children.cycles-pp.try_to_wake_up 21.12 -8.7 12.43 perf-profile.children.cycles-pp.hrtimer_wakeup 21.83 -8.2 13.59 perf-profile.children.cycles-pp.__hrtimer_run_queues 22.20 -7.8 14.40 perf-profile.children.cycles-pp.hrtimer_interrupt 23.28 -7.7 15.54 perf-profile.children.cycles-pp.sysvec_apic_timer_interrupt 11.63 ± 2% -7.7 3.95 ± 2% perf-profile.children.cycles-pp.select_idle_cpu 22.32 -7.6 14.74 perf-profile.children.cycles-pp.__sysvec_apic_timer_interrupt 23.88 -7.1 16.75 perf-profile.children.cycles-pp.asm_sysvec_apic_timer_interrupt 12.80 ± 2% -7.0 5.74 ± 2% perf-profile.children.cycles-pp.select_idle_sibling 12.88 ± 2% -7.0 5.92 ± 2% perf-profile.children.cycles-pp.select_task_rq_fair 10.67 ± 2% -6.9 3.72 ± 2% perf-profile.children.cycles-pp.select_idle_core 24.50 -6.9 17.56 ± 3% perf-profile.children.cycles-pp.schedule 12.90 ± 2% -6.9 6.00 ± 2% perf-profile.children.cycles-pp.select_task_rq 25.06 -6.4 18.65 ± 3% perf-profile.children.cycles-pp.do_nanosleep 25.39 -6.1 19.28 ± 3% perf-profile.children.cycles-pp.hrtimer_nanosleep 25.46 -6.0 19.47 ± 3% perf-profile.children.cycles-pp.common_nsleep 26.16 -5.8 20.33 ± 2% perf-profile.children.cycles-pp.__x64_sys_clock_nanosleep 60.29 -4.9 55.35 perf-profile.children.cycles-pp.do_idle 60.32 -4.9 55.46 perf-profile.children.cycles-pp.secondary_startup_64_no_verify 60.32 -4.9 55.46 perf-profile.children.cycles-pp.cpu_startup_entry 60.06 -4.8 55.29 perf-profile.children.cycles-pp.start_secondary 12.76 -4.6 8.14 perf-profile.children.cycles-pp.schedule_idle 7.42 ± 3% -4.5 2.97 ± 2% perf-profile.children.cycles-pp.available_idle_cpu 5.20 -4.5 0.75 perf-profile.children.cycles-pp.set_next_entity 4.99 -4.2 0.79 perf-profile.children.cycles-pp.__sysvec_call_function_single 5.01 -4.2 0.84 perf-profile.children.cycles-pp.sysvec_call_function_single 5.12 -4.1 0.98 perf-profile.children.cycles-pp.asm_sysvec_call_function_single 31.91 -4.1 27.82 perf-profile.children.cycles-pp.do_syscall_64 32.32 -4.0 28.30 perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe 7.73 -3.8 3.98 perf-profile.children.cycles-pp.finish_task_switch 31.29 -3.1 28.16 perf-profile.children.cycles-pp.clock_nanosleep 1.51 ± 5% -1.4 0.16 ± 2% perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath 0.73 ± 6% -0.5 0.25 ± 3% perf-profile.children.cycles-pp.__do_softirq 0.88 ± 5% -0.4 0.46 ± 2% perf-profile.children.cycles-pp.__irq_exit_rcu 0.68 -0.4 0.32 perf-profile.children.cycles-pp._find_next_bit 2.55 ± 2% -0.3 2.26 ± 3% perf-profile.children.cycles-pp._raw_spin_lock 0.26 ± 24% -0.2 0.09 ± 5% perf-profile.children.cycles-pp.__update_idle_core 0.26 ± 24% -0.1 0.12 ± 4% perf-profile.children.cycles-pp.pick_next_task_idle 0.71 ± 3% -0.1 0.58 perf-profile.children.cycles-pp.do_sched_yield 0.43 ± 3% -0.1 0.32 ± 3% perf-profile.children.cycles-pp.__bitmap_andnot 0.27 ± 15% -0.1 0.17 ± 56% perf-profile.children.cycles-pp.x86_64_start_kernel 0.27 ± 15% -0.1 0.17 ± 56% perf-profile.children.cycles-pp.x86_64_start_reservations 0.27 ± 15% -0.1 0.17 ± 56% perf-profile.children.cycles-pp.start_kernel 0.27 ± 15% -0.1 0.17 ± 56% perf-profile.children.cycles-pp.arch_call_rest_init 0.27 ± 15% -0.1 0.17 ± 56% perf-profile.children.cycles-pp.rest_init 1.16 ± 2% -0.1 1.10 perf-profile.children.cycles-pp.hrtimer_active 0.13 ± 7% -0.0 0.10 ± 11% perf-profile.children.cycles-pp.do_futex 0.14 ± 8% -0.0 0.11 ± 9% perf-profile.children.cycles-pp.__x64_sys_futex 0.20 ± 3% -0.0 0.17 ± 2% perf-profile.children.cycles-pp.yield_task_fair 0.09 ± 5% +0.0 0.11 ± 4% perf-profile.children.cycles-pp.update_irq_load_avg 0.15 ± 3% +0.0 0.18 ± 4% perf-profile.children.cycles-pp.check_preempt_curr 0.10 ± 9% +0.0 0.14 ± 4% perf-profile.children.cycles-pp.clock_gettime 0.34 ± 2% +0.0 0.38 perf-profile.children.cycles-pp.nohz_run_idle_balance 0.14 ± 2% +0.0 0.19 ± 4% perf-profile.children.cycles-pp.attach_entity_load_avg 0.00 +0.1 0.05 perf-profile.children.cycles-pp.perf_trace_run_bpf_submit 0.00 +0.1 0.05 perf-profile.children.cycles-pp.cgroup_rstat_updated 0.00 +0.1 0.05 perf-profile.children.cycles-pp.ct_kernel_exit 0.03 ± 70% +0.1 0.09 ± 5% perf-profile.children.cycles-pp.put_prev_entity 0.06 ± 6% +0.1 0.11 ± 3% perf-profile.children.cycles-pp.rb_insert_color 0.06 ± 6% +0.1 0.11 ± 3% perf-profile.children.cycles-pp.entity_eligible 0.00 +0.1 0.06 ± 9% perf-profile.children.cycles-pp.perf_swevent_event 0.00 +0.1 0.06 ± 8% perf-profile.children.cycles-pp.__update_load_avg_blocked_se 0.00 +0.1 0.06 ± 6% perf-profile.children.cycles-pp.tick_nohz_stop_idle 0.00 +0.1 0.06 ± 6% perf-profile.children.cycles-pp.mm_cid_get 0.00 +0.1 0.06 ± 6% perf-profile.children.cycles-pp.menu_reflect 0.05 ± 7% +0.1 0.11 perf-profile.children.cycles-pp.perf_trace_buf_update 0.12 ± 11% +0.1 0.18 ± 2% perf-profile.children.cycles-pp.remove_entity_load_avg 0.31 ± 6% +0.1 0.37 ± 2% perf-profile.children.cycles-pp.scheduler_tick 0.08 +0.1 0.14 perf-profile.children.cycles-pp._raw_spin_lock_irq 0.00 +0.1 0.06 ± 6% perf-profile.children.cycles-pp.irqentry_enter 0.08 +0.1 0.14 ± 2% perf-profile.children.cycles-pp.rcu_note_context_switch 0.00 +0.1 0.06 ± 11% perf-profile.children.cycles-pp.pm_qos_read_value 0.10 ± 16% +0.1 0.17 ± 2% perf-profile.children.cycles-pp.hrtimer_get_next_event 0.06 +0.1 0.13 ± 8% perf-profile.children.cycles-pp.__cgroup_account_cputime 0.08 ± 14% +0.1 0.14 ± 7% perf-profile.children.cycles-pp.stress_mwc1 0.00 +0.1 0.07 perf-profile.children.cycles-pp.tsc_verify_tsc_adjust 0.00 +0.1 0.07 perf-profile.children.cycles-pp.hrtimer_update_next_event 0.00 +0.1 0.07 ± 5% perf-profile.children.cycles-pp.tracing_gen_ctx_irq_test 0.00 +0.1 0.07 ± 5% perf-profile.children.cycles-pp.entry_SYSCALL_64_safe_stack 0.34 ± 6% +0.1 0.41 ± 3% perf-profile.children.cycles-pp.update_process_times 0.34 ± 6% +0.1 0.42 ± 2% perf-profile.children.cycles-pp.tick_sched_handle 0.05 ± 8% +0.1 0.13 perf-profile.children.cycles-pp.ktime_get_update_offsets_now 0.00 +0.1 0.08 ± 4% perf-profile.children.cycles-pp.rb_next 0.00 +0.1 0.08 ± 4% perf-profile.children.cycles-pp.error_entry 0.00 +0.1 0.08 ± 4% perf-profile.children.cycles-pp.tick_nohz_tick_stopped 0.00 +0.1 0.08 perf-profile.children.cycles-pp.save_fpregs_to_fpstate 0.00 +0.1 0.08 perf-profile.children.cycles-pp.arch_cpu_idle_enter 0.00 +0.1 0.08 perf-profile.children.cycles-pp.perf_trace_buf_alloc 0.09 ± 5% +0.1 0.17 ± 6% perf-profile.children.cycles-pp.perf_trace_sched_stat_runtime 0.00 +0.1 0.08 ± 4% perf-profile.children.cycles-pp.perf_exclude_event 0.01 ±223% +0.1 0.09 ± 6% perf-profile.children.cycles-pp.__list_del_entry_valid 0.08 ± 6% +0.1 0.16 ± 3% perf-profile.children.cycles-pp.__intel_pmu_enable_all 0.00 +0.1 0.09 ± 5% perf-profile.children.cycles-pp.put_prev_task_fair 0.00 +0.1 0.09 ± 4% perf-profile.children.cycles-pp.sched_clock_noinstr 0.00 +0.1 0.09 ± 5% perf-profile.children.cycles-pp.syscall_exit_to_user_mode_prepare 0.14 ± 3% +0.1 0.23 perf-profile.children.cycles-pp.get_nohz_timer_target 0.00 +0.1 0.10 ± 5% perf-profile.children.cycles-pp.rb_erase 0.00 +0.1 0.10 ± 80% perf-profile.children.cycles-pp.get_cpu_device 0.00 +0.1 0.10 ± 5% perf-profile.children.cycles-pp.__list_add_valid 0.36 ± 6% +0.1 0.46 ± 3% perf-profile.children.cycles-pp.tick_sched_timer 0.07 ± 6% +0.1 0.18 ± 5% perf-profile.children.cycles-pp.syscall_enter_from_user_mode 0.07 +0.1 0.18 ± 2% perf-profile.children.cycles-pp.__dequeue_entity 0.06 +0.1 0.17 ± 2% perf-profile.children.cycles-pp.call_cpuidle 0.00 +0.1 0.12 ± 4% perf-profile.children.cycles-pp.perf_trace_sched_switch 0.14 ± 3% +0.1 0.27 ± 2% perf-profile.children.cycles-pp.update_min_vruntime 0.32 +0.1 0.45 ± 2% perf-profile.children.cycles-pp.llist_add_batch 0.05 ± 8% +0.1 0.18 ± 3% perf-profile.children.cycles-pp.hrtimer_reprogram 0.09 +0.1 0.22 ± 3% perf-profile.children.cycles-pp.irqtime_account_irq 0.52 ± 2% +0.1 0.65 perf-profile.children.cycles-pp.poll_idle 0.00 +0.1 0.14 ± 37% perf-profile.children.cycles-pp.cpu_util 0.01 ±223% +0.1 0.14 ± 3% perf-profile.children.cycles-pp._raw_spin_unlock_irqrestore 0.11 ± 12% +0.1 0.25 ± 4% perf-profile.children.cycles-pp.avg_vruntime 0.10 ± 3% +0.1 0.24 perf-profile.children.cycles-pp.__calc_delta 0.00 +0.1 0.14 ± 21% perf-profile.children.cycles-pp._find_next_and_bit 0.14 ± 2% +0.1 0.29 ± 2% perf-profile.children.cycles-pp.perf_event_task_tick 0.14 +0.1 0.29 ± 2% perf-profile.children.cycles-pp.perf_adjust_freq_unthr_context 0.13 ± 2% +0.1 0.28 perf-profile.children.cycles-pp.timerqueue_add 0.21 ± 2% +0.1 0.36 ± 2% perf-profile.children.cycles-pp.perf_tp_event 0.08 ± 13% +0.2 0.23 ± 9% perf-profile.children.cycles-pp.sem_getvalue@plt 0.07 ± 6% +0.2 0.23 ± 14% perf-profile.children.cycles-pp.__enqueue_entity 0.00 +0.2 0.16 ± 4% perf-profile.children.cycles-pp.raw_spin_rq_lock_nested 0.05 +0.2 0.21 ± 2% perf-profile.children.cycles-pp.__hrtimer_init 0.00 +0.2 0.16 ± 3% perf-profile.children.cycles-pp.ct_kernel_exit_state 0.15 ± 4% +0.2 0.32 perf-profile.children.cycles-pp.enqueue_hrtimer 0.08 ± 4% +0.2 0.24 ± 3% perf-profile.children.cycles-pp.hrtimer_init_sleeper 0.09 ± 5% +0.2 0.26 ± 3% perf-profile.children.cycles-pp.__hrtimer_next_event_base 0.45 ± 17% +0.2 0.62 perf-profile.children.cycles-pp.native_irq_return_iret 0.05 +0.2 0.22 ± 2% perf-profile.children.cycles-pp.__rdgsbase_inactive 0.00 +0.2 0.18 ± 2% perf-profile.children.cycles-pp.ct_kernel_enter 0.05 +0.2 0.23 ± 34% perf-profile.children.cycles-pp.cpuidle_governor_latency_req 0.26 ± 5% +0.2 0.45 ± 5% perf-profile.children.cycles-pp.update_blocked_averages 0.00 +0.2 0.20 ± 2% perf-profile.children.cycles-pp.tick_irq_enter 0.11 ± 6% +0.2 0.30 ± 3% perf-profile.children.cycles-pp.native_apic_msr_eoi_write 0.14 ± 3% +0.2 0.34 ± 2% perf-profile.children.cycles-pp.reweight_entity 0.01 ±223% +0.2 0.21 ± 3% perf-profile.children.cycles-pp.irq_enter_rcu 0.13 ± 8% +0.2 0.33 ± 8% perf-profile.children.cycles-pp.place_entity 0.19 ± 3% +0.2 0.39 perf-profile.children.cycles-pp.pick_eevdf 0.14 ± 11% +0.2 0.34 ± 2% perf-profile.children.cycles-pp.get_next_timer_interrupt 0.00 +0.2 0.21 ± 2% perf-profile.children.cycles-pp.ct_idle_exit 0.10 ± 4% +0.2 0.32 perf-profile.children.cycles-pp.update_entity_lag 0.09 ± 5% +0.2 0.31 ± 23% perf-profile.children.cycles-pp.idle_cpu 0.00 +0.2 0.22 ± 4% perf-profile.children.cycles-pp.syscall_return_via_sysret 0.34 +0.2 0.58 perf-profile.children.cycles-pp.perf_trace_sched_wakeup_template 0.13 ± 4% +0.2 0.38 ± 2% perf-profile.children.cycles-pp._copy_from_user 0.00 +0.3 0.26 ± 2% perf-profile.children.cycles-pp.local_clock_noinstr 0.12 ± 4% +0.3 0.38 perf-profile.children.cycles-pp.timerqueue_del 0.37 +0.3 0.63 ± 2% perf-profile.children.cycles-pp.update_rq_clock_task 0.10 ± 6% +0.3 0.37 ± 2% perf-profile.children.cycles-pp.hrtimer_next_event_without 0.21 ± 8% +0.3 0.49 ± 2% perf-profile.children.cycles-pp.tick_nohz_next_event 0.07 ± 5% +0.3 0.36 ± 2% perf-profile.children.cycles-pp.tick_nohz_idle_exit 0.21 ± 9% +0.3 0.52 ± 5% perf-profile.children.cycles-pp.__nanosleep 0.16 ± 3% +0.3 0.46 perf-profile.children.cycles-pp.get_timespec64 0.08 ± 4% +0.3 0.39 perf-profile.children.cycles-pp.__wrgsbase_inactive 0.19 +0.3 0.50 perf-profile.children.cycles-pp.__update_load_avg_se 0.09 ± 4% +0.3 0.40 perf-profile.children.cycles-pp.entry_SYSRETQ_unsafe_stack 2.18 +0.3 2.50 perf-profile.children.cycles-pp.restore_fpregs_from_fpstate 1.61 +0.3 1.93 perf-profile.children.cycles-pp.sched_mm_cid_migrate_to 0.19 ± 7% +0.3 0.53 ± 2% perf-profile.children.cycles-pp.tick_nohz_idle_enter 0.12 ± 4% +0.3 0.46 ± 2% perf-profile.children.cycles-pp.os_xsave 0.30 ± 3% +0.5 0.76 ± 2% perf-profile.children.cycles-pp.llist_reverse_order 0.67 ± 9% +0.5 1.14 ± 7% perf-profile.children.cycles-pp.__cmd_record 0.25 ± 2% +0.5 0.72 ± 2% perf-profile.children.cycles-pp.lapic_next_deadline 0.27 ± 2% +0.5 0.75 perf-profile.children.cycles-pp.call_function_single_prep_ipi 0.44 ± 2% +0.5 0.93 perf-profile.children.cycles-pp.__hrtimer_start_range_ns 0.23 ± 4% +0.5 0.72 ± 3% perf-profile.children.cycles-pp.___perf_sw_event 0.44 ± 2% +0.5 0.94 ± 5% perf-profile.children.cycles-pp.__update_load_avg_cfs_rq 0.53 ± 11% +0.6 1.10 ± 8% perf-profile.children.cycles-pp.process_simple 0.52 ± 11% +0.6 1.10 ± 8% perf-profile.children.cycles-pp.ordered_events__queue 0.54 ± 10% +0.6 1.11 ± 8% perf-profile.children.cycles-pp.record__finish_output 0.54 ± 10% +0.6 1.11 ± 8% perf-profile.children.cycles-pp.perf_session__process_events 0.54 ± 10% +0.6 1.11 ± 8% perf-profile.children.cycles-pp.reader__read_event 0.52 ± 11% +0.6 1.10 ± 8% perf-profile.children.cycles-pp.queue_event 0.44 ± 2% +0.6 1.02 ± 4% perf-profile.children.cycles-pp.sem_post@@GLIBC_2.2.5 0.38 +0.6 0.95 perf-profile.children.cycles-pp.clockevents_program_event 0.27 ± 3% +0.6 0.85 ± 2% perf-profile.children.cycles-pp.cpus_share_cache 0.17 ± 4% +0.6 0.77 ± 2% perf-profile.children.cycles-pp.read_tsc 0.34 ± 4% +0.6 0.95 perf-profile.children.cycles-pp.tick_nohz_get_sleep_length 0.22 ± 2% +0.6 0.82 ± 2% perf-profile.children.cycles-pp.__entry_text_start 0.26 +0.6 0.89 perf-profile.children.cycles-pp.update_rq_clock 0.60 +0.6 1.23 perf-profile.children.cycles-pp.__smp_call_single_queue 0.81 +0.7 1.46 perf-profile.children.cycles-pp.update_curr 2.76 +0.7 3.44 perf-profile.children.cycles-pp.switch_fpu_return 2.86 +0.7 3.58 perf-profile.children.cycles-pp.exit_to_user_mode_prepare 0.82 ± 2% +0.7 1.55 perf-profile.children.cycles-pp.set_task_cpu 0.83 +0.7 1.56 perf-profile.children.cycles-pp.ttwu_queue_wakelist 2.46 +0.8 3.24 perf-profile.children.cycles-pp.__x64_sys_sched_yield 0.24 ± 2% +0.8 1.03 perf-profile.children.cycles-pp.ktime_get 2.88 +0.8 3.72 perf-profile.children.cycles-pp.syscall_exit_to_user_mode 0.74 ± 3% +0.9 1.63 perf-profile.children.cycles-pp.shim_nanosleep_uint64 0.25 ± 4% +0.9 1.19 ± 2% perf-profile.children.cycles-pp.sched_clock 0.26 ± 4% +1.0 1.28 perf-profile.children.cycles-pp.native_sched_clock 0.63 ± 4% +1.0 1.67 perf-profile.children.cycles-pp.sem_getvalue@@GLIBC_2.2.5 0.58 +1.1 1.66 perf-profile.children.cycles-pp._raw_spin_lock_irqsave 0.29 ± 3% +1.1 1.38 perf-profile.children.cycles-pp.sched_clock_cpu 0.78 +1.1 1.88 perf-profile.children.cycles-pp.__switch_to 0.66 ± 2% +1.2 1.84 ± 5% perf-profile.children.cycles-pp.menu_select 0.66 ± 2% +1.2 1.85 perf-profile.children.cycles-pp.prepare_task_switch 0.88 +1.2 2.10 perf-profile.children.cycles-pp.hrtimer_start_range_ns 1.03 ± 2% +1.4 2.38 ± 7% perf-profile.children.cycles-pp.semaphore_posix_thrash 0.90 +1.4 2.29 perf-profile.children.cycles-pp.__switch_to_asm 3.06 +1.5 4.58 perf-profile.children.cycles-pp.__sched_yield 0.99 +1.9 2.86 perf-profile.children.cycles-pp.switch_mm_irqs_off 0.00 +2.0 2.02 ± 30% perf-profile.children.cycles-pp.update_sg_lb_stats 0.00 +2.2 2.19 ± 29% perf-profile.children.cycles-pp.update_sd_lb_stats 0.00 +2.2 2.23 ± 28% perf-profile.children.cycles-pp.find_busiest_group 0.07 ± 7% +2.4 2.47 ± 25% perf-profile.children.cycles-pp.load_balance 0.06 ± 8% +3.3 3.38 ± 18% perf-profile.children.cycles-pp.newidle_balance 6.26 ± 3% +13.2 19.48 perf-profile.children.cycles-pp.intel_idle 18.09 +15.6 33.68 perf-profile.children.cycles-pp.cpuidle_enter_state 18.12 +15.6 33.72 perf-profile.children.cycles-pp.cpuidle_enter 19.00 +17.3 36.32 perf-profile.children.cycles-pp.cpuidle_idle_call 29.16 -27.6 1.61 ± 4% perf-profile.self.cycles-pp.update_cfs_group 17.63 -16.1 1.52 ± 2% perf-profile.self.cycles-pp.update_load_avg 7.40 ± 3% -4.5 2.94 ± 2% perf-profile.self.cycles-pp.available_idle_cpu 3.28 ± 2% -2.5 0.81 ± 2% perf-profile.self.cycles-pp.select_idle_core 1.51 ± 5% -1.4 0.16 ± 2% perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath 0.62 -0.3 0.28 ± 2% perf-profile.self.cycles-pp._find_next_bit 0.38 ± 7% -0.3 0.13 ± 2% perf-profile.self.cycles-pp.select_idle_cpu 0.38 ± 3% -0.2 0.24 ± 2% perf-profile.self.cycles-pp.migrate_task_rq_fair 0.21 ± 31% -0.1 0.08 ± 6% perf-profile.self.cycles-pp.__update_idle_core 0.42 ± 3% -0.1 0.31 ± 3% perf-profile.self.cycles-pp.__bitmap_andnot 0.19 ± 6% -0.1 0.09 ± 4% perf-profile.self.cycles-pp.__update_blocked_fair 0.44 ± 2% -0.1 0.37 perf-profile.self.cycles-pp.__x64_sys_clock_nanosleep 0.05 +0.0 0.06 perf-profile.self.cycles-pp.clockevents_program_event 0.08 +0.0 0.09 ± 4% perf-profile.self.cycles-pp.perf_trace_sched_wakeup_template 0.09 +0.0 0.10 ± 4% perf-profile.self.cycles-pp.update_irq_load_avg 0.12 ± 3% +0.0 0.14 ± 2% perf-profile.self.cycles-pp.__hrtimer_run_queues 0.17 ± 2% +0.0 0.20 ± 2% perf-profile.self.cycles-pp.ttwu_queue_wakelist 0.14 ± 2% +0.0 0.19 ± 2% perf-profile.self.cycles-pp.attach_entity_load_avg 0.08 ± 6% +0.0 0.12 ± 3% perf-profile.self.cycles-pp.ttwu_do_activate 0.08 ± 6% +0.0 0.12 ± 3% perf-profile.self.cycles-pp.yield_task_fair 0.06 ± 8% +0.0 0.10 ± 4% perf-profile.self.cycles-pp.rb_insert_color 0.08 +0.0 0.13 ± 2% perf-profile.self.cycles-pp.rcu_note_context_switch 0.05 ± 8% +0.0 0.10 ± 3% perf-profile.self.cycles-pp.entity_eligible 0.00 +0.1 0.05 perf-profile.self.cycles-pp.__update_load_avg_blocked_se 0.00 +0.1 0.05 ± 7% perf-profile.self.cycles-pp.mm_cid_get 0.09 ± 4% +0.1 0.14 ± 4% perf-profile.self.cycles-pp.__hrtimer_start_range_ns 0.08 +0.1 0.13 ± 3% perf-profile.self.cycles-pp._raw_spin_lock_irq 0.00 +0.1 0.05 ± 8% perf-profile.self.cycles-pp.update_blocked_averages 0.00 +0.1 0.06 ± 9% perf-profile.self.cycles-pp.tsc_verify_tsc_adjust 0.00 +0.1 0.06 ± 9% perf-profile.self.cycles-pp.pm_qos_read_value 0.00 +0.1 0.06 ± 8% perf-profile.self.cycles-pp.hrtimer_try_to_cancel 0.00 +0.1 0.06 ± 6% perf-profile.self.cycles-pp.ktime_get_update_offsets_now 0.00 +0.1 0.06 perf-profile.self.cycles-pp.activate_task 0.00 +0.1 0.06 perf-profile.self.cycles-pp.tick_nohz_get_sleep_length 0.00 +0.1 0.06 perf-profile.self.cycles-pp.hrtimer_next_event_without 0.00 +0.1 0.06 perf-profile.self.cycles-pp.irqtime_account_irq 0.00 +0.1 0.06 perf-profile.self.cycles-pp.save_fpregs_to_fpstate 0.95 +0.1 1.01 perf-profile.self.cycles-pp.hrtimer_active 0.06 ± 14% +0.1 0.12 ± 8% perf-profile.self.cycles-pp.stress_mwc1 0.00 +0.1 0.06 ± 6% perf-profile.self.cycles-pp.perf_exclude_event 0.42 +0.1 0.48 perf-profile.self.cycles-pp.entry_SYSCALL_64_after_hwframe 0.05 +0.1 0.11 ± 4% perf-profile.self.cycles-pp.set_next_entity 0.00 +0.1 0.06 ± 7% perf-profile.self.cycles-pp.__sysvec_apic_timer_interrupt 0.00 +0.1 0.06 ± 7% perf-profile.self.cycles-pp.rb_next 0.00 +0.1 0.06 ± 7% perf-profile.self.cycles-pp.tick_nohz_tick_stopped 0.00 +0.1 0.07 ± 7% perf-profile.self.cycles-pp.tick_nohz_idle_enter 0.00 +0.1 0.07 ± 7% perf-profile.self.cycles-pp.check_preempt_curr 0.06 ± 9% +0.1 0.13 ± 2% perf-profile.self.cycles-pp.perf_adjust_freq_unthr_context 0.00 +0.1 0.07 ± 5% perf-profile.self.cycles-pp.syscall_exit_to_user_mode_prepare 0.00 +0.1 0.07 ± 5% perf-profile.self.cycles-pp.tracing_gen_ctx_irq_test 0.00 +0.1 0.07 perf-profile.self.cycles-pp.asm_sysvec_apic_timer_interrupt 0.00 +0.1 0.07 ± 8% perf-profile.self.cycles-pp.hrtimer_interrupt 0.00 +0.1 0.07 ± 9% perf-profile.self.cycles-pp.__list_del_entry_valid 0.00 +0.1 0.07 ± 5% perf-profile.self.cycles-pp.entry_SYSCALL_64_safe_stack 0.09 ± 5% +0.1 0.16 ± 5% perf-profile.self.cycles-pp.perf_trace_sched_stat_runtime 0.00 +0.1 0.08 ± 6% perf-profile.self.cycles-pp.cpuidle_governor_latency_req 0.00 +0.1 0.08 ± 12% perf-profile.self.cycles-pp.__cgroup_account_cputime 0.00 +0.1 0.08 ± 6% perf-profile.self.cycles-pp.exit_to_user_mode_prepare 0.00 +0.1 0.08 ± 6% perf-profile.self.cycles-pp.error_entry 0.00 +0.1 0.08 ± 4% perf-profile.self.cycles-pp.sched_clock 0.00 +0.1 0.08 ± 4% perf-profile.self.cycles-pp.syscall_enter_from_user_mode 0.05 +0.1 0.13 ± 4% perf-profile.self.cycles-pp.__dequeue_entity 0.00 +0.1 0.08 perf-profile.self.cycles-pp.select_task_rq 0.00 +0.1 0.08 perf-profile.self.cycles-pp.do_sched_yield 0.00 +0.1 0.08 perf-profile.self.cycles-pp.tick_nohz_next_event 0.00 +0.1 0.08 perf-profile.self.cycles-pp.get_timespec64 0.00 +0.1 0.08 perf-profile.self.cycles-pp.tick_nohz_idle_exit 0.00 +0.1 0.08 perf-profile.self.cycles-pp.get_next_timer_interrupt 0.00 +0.1 0.08 ± 7% perf-profile.self.cycles-pp.__list_add_valid 0.08 ± 6% +0.1 0.16 ± 3% perf-profile.self.cycles-pp.__intel_pmu_enable_all 0.00 +0.1 0.08 ± 5% perf-profile.self.cycles-pp.syscall_exit_to_user_mode 0.16 ± 2% +0.1 0.25 ± 2% perf-profile.self.cycles-pp.perf_tp_event 0.00 +0.1 0.08 ± 5% perf-profile.self.cycles-pp.poll_idle 0.00 +0.1 0.08 ± 5% perf-profile.self.cycles-pp.rb_erase 0.13 ± 3% +0.1 0.22 ± 3% perf-profile.self.cycles-pp.__flush_smp_call_function_queue 0.14 ± 3% +0.1 0.22 ± 2% perf-profile.self.cycles-pp.get_nohz_timer_target 0.00 +0.1 0.09 ± 5% perf-profile.self.cycles-pp.ct_kernel_enter 0.10 ± 5% +0.1 0.19 perf-profile.self.cycles-pp.try_to_wake_up 0.07 ± 6% +0.1 0.17 perf-profile.self.cycles-pp.timerqueue_add 0.02 ± 99% +0.1 0.12 ± 3% perf-profile.self.cycles-pp.update_entity_lag 0.00 +0.1 0.10 ± 3% perf-profile.self.cycles-pp.perf_trace_sched_switch 0.28 +0.1 0.38 ± 4% perf-profile.self.cycles-pp.dequeue_task_fair 0.00 +0.1 0.10 ± 6% perf-profile.self.cycles-pp.load_balance 0.08 ± 4% +0.1 0.18 perf-profile.self.cycles-pp.select_task_rq_fair 0.14 ± 4% +0.1 0.24 ± 2% perf-profile.self.cycles-pp.update_min_vruntime 0.04 ± 47% +0.1 0.15 ± 12% perf-profile.self.cycles-pp.place_entity 0.00 +0.1 0.10 ± 4% perf-profile.self.cycles-pp._raw_spin_unlock_irqrestore 0.00 +0.1 0.11 ± 42% perf-profile.self.cycles-pp.cpu_util 0.10 +0.1 0.21 ± 4% perf-profile.self.cycles-pp.update_rq_clock 0.28 ± 3% +0.1 0.40 perf-profile.self.cycles-pp.dequeue_entity 0.07 ± 16% +0.1 0.20 ± 7% perf-profile.self.cycles-pp.sem_getvalue@plt 0.32 +0.1 0.44 ± 2% perf-profile.self.cycles-pp.llist_add_batch 0.05 ± 7% +0.1 0.18 ± 2% perf-profile.self.cycles-pp.hrtimer_reprogram 0.08 ± 10% +0.1 0.21 ± 3% perf-profile.self.cycles-pp.__entry_text_start 0.06 +0.1 0.19 perf-profile.self.cycles-pp.common_nsleep 0.09 ± 5% +0.1 0.22 perf-profile.self.cycles-pp.__calc_delta 0.00 +0.1 0.13 ± 23% perf-profile.self.cycles-pp._find_next_and_bit 0.07 +0.1 0.20 ± 16% perf-profile.self.cycles-pp.__enqueue_entity 0.09 ± 4% +0.1 0.23 ± 4% perf-profile.self.cycles-pp.avg_vruntime 0.00 +0.1 0.14 ± 9% perf-profile.self.cycles-pp.update_sd_lb_stats 0.13 ± 5% +0.1 0.27 perf-profile.self.cycles-pp.pick_eevdf 0.00 +0.1 0.15 ± 2% perf-profile.self.cycles-pp.cpu_startup_entry 0.21 ± 2% +0.2 0.36 ± 2% perf-profile.self.cycles-pp.hrtimer_nanosleep 0.00 +0.2 0.15 ± 4% perf-profile.self.cycles-pp.ct_kernel_exit_state 0.08 ± 5% +0.2 0.24 ± 2% perf-profile.self.cycles-pp.__hrtimer_next_event_base 0.00 +0.2 0.16 ± 3% perf-profile.self.cycles-pp.call_cpuidle 0.06 ± 6% +0.2 0.22 ± 3% perf-profile.self.cycles-pp.do_syscall_64 0.45 ± 17% +0.2 0.62 perf-profile.self.cycles-pp.native_irq_return_iret 0.08 ± 6% +0.2 0.25 ± 3% perf-profile.self.cycles-pp.timerqueue_del 0.13 ± 3% +0.2 0.31 ± 3% perf-profile.self.cycles-pp.nohz_run_idle_balance 0.00 +0.2 0.17 ± 2% perf-profile.self.cycles-pp.sched_clock_cpu 0.01 ±223% +0.2 0.19 ± 2% perf-profile.self.cycles-pp.__hrtimer_init 0.03 ± 70% +0.2 0.22 ± 2% perf-profile.self.cycles-pp.__rdgsbase_inactive 0.01 ±223% +0.2 0.19 ± 3% perf-profile.self.cycles-pp.schedule_idle 0.17 ± 2% +0.2 0.36 ± 2% perf-profile.self.cycles-pp.schedule 0.14 +0.2 0.33 ± 2% perf-profile.self.cycles-pp.do_nanosleep 0.11 ± 6% +0.2 0.30 ± 2% perf-profile.self.cycles-pp.native_apic_msr_eoi_write 0.12 ± 4% +0.2 0.32 ± 2% perf-profile.self.cycles-pp.reweight_entity 0.09 ± 5% +0.2 0.30 ± 23% perf-profile.self.cycles-pp.idle_cpu 0.00 +0.2 0.22 ± 3% perf-profile.self.cycles-pp.syscall_return_via_sysret 0.12 ± 5% +0.2 0.34 ± 2% perf-profile.self.cycles-pp.hrtimer_start_range_ns 0.11 ± 3% +0.2 0.34 ± 2% perf-profile.self.cycles-pp._copy_from_user 0.09 ± 5% +0.2 0.33 ± 2% perf-profile.self.cycles-pp.ktime_get 0.12 +0.2 0.37 perf-profile.self.cycles-pp.pick_next_task_fair 0.09 ± 6% +0.3 0.35 ± 2% perf-profile.self.cycles-pp.__sched_yield 0.28 ± 2% +0.3 0.54 ± 2% perf-profile.self.cycles-pp.update_rq_clock_task 0.07 ± 7% +0.3 0.33 ± 2% perf-profile.self.cycles-pp.cpuidle_idle_call 0.17 ± 2% +0.3 0.45 perf-profile.self.cycles-pp.__update_load_avg_se 0.19 ± 3% +0.3 0.48 perf-profile.self.cycles-pp.sched_ttwu_pending 0.34 ± 2% +0.3 0.62 perf-profile.self.cycles-pp.flush_smp_call_function_queue 0.19 ± 11% +0.3 0.48 ± 5% perf-profile.self.cycles-pp.__nanosleep 0.08 ± 7% +0.3 0.38 perf-profile.self.cycles-pp.__wrgsbase_inactive 0.08 ± 5% +0.3 0.39 perf-profile.self.cycles-pp.entry_SYSRETQ_unsafe_stack 2.18 +0.3 2.50 perf-profile.self.cycles-pp.restore_fpregs_from_fpstate 1.61 +0.3 1.93 perf-profile.self.cycles-pp.sched_mm_cid_migrate_to 0.31 ± 2% +0.3 0.64 ± 8% perf-profile.self.cycles-pp.enqueue_entity 0.11 ± 3% +0.3 0.45 perf-profile.self.cycles-pp.os_xsave 0.05 ± 8% +0.3 0.40 ± 12% perf-profile.self.cycles-pp.newidle_balance 0.31 +0.3 0.66 perf-profile.self.cycles-pp.update_curr 0.25 ± 4% +0.4 0.61 ± 2% perf-profile.self.cycles-pp.menu_select 0.58 +0.4 0.94 perf-profile.self.cycles-pp.switch_fpu_return 0.19 ± 5% +0.4 0.59 ± 2% perf-profile.self.cycles-pp.___perf_sw_event 0.10 ± 6% +0.4 0.54 perf-profile.self.cycles-pp.do_idle 0.43 +0.4 0.86 ± 3% perf-profile.self.cycles-pp.__update_load_avg_cfs_rq 0.30 ± 4% +0.5 0.75 ± 2% perf-profile.self.cycles-pp.llist_reverse_order 0.25 ± 2% +0.5 0.72 ± 2% perf-profile.self.cycles-pp.lapic_next_deadline 0.27 ± 2% +0.5 0.74 perf-profile.self.cycles-pp.call_function_single_prep_ipi 0.37 ± 8% +0.5 0.90 ± 3% perf-profile.self.cycles-pp.clock_nanosleep 0.21 ± 2% +0.6 0.76 ± 6% perf-profile.self.cycles-pp.enqueue_task_fair 0.52 ± 10% +0.6 1.08 ± 8% perf-profile.self.cycles-pp.queue_event 0.27 ± 3% +0.6 0.84 ± 2% perf-profile.self.cycles-pp.cpus_share_cache 0.31 ± 2% +0.6 0.90 ± 5% perf-profile.self.cycles-pp.sem_post@@GLIBC_2.2.5 0.17 ± 4% +0.6 0.75 perf-profile.self.cycles-pp.read_tsc 1.08 +0.7 1.83 perf-profile.self.cycles-pp.finish_task_switch 0.25 ± 2% +0.7 1.00 ± 2% perf-profile.self.cycles-pp.set_task_cpu 0.49 ± 2% +0.8 1.32 perf-profile.self.cycles-pp.shim_nanosleep_uint64 0.55 +0.9 1.44 perf-profile.self.cycles-pp.prepare_task_switch 0.46 ± 4% +0.9 1.37 perf-profile.self.cycles-pp.sem_getvalue@@GLIBC_2.2.5 0.25 ± 4% +1.0 1.23 perf-profile.self.cycles-pp.native_sched_clock 0.25 ± 4% +1.0 1.24 perf-profile.self.cycles-pp.cpuidle_enter_state 1.03 ± 2% +1.0 2.06 ± 3% perf-profile.self.cycles-pp._raw_spin_lock 0.57 ± 2% +1.1 1.62 perf-profile.self.cycles-pp._raw_spin_lock_irqsave 0.76 +1.1 1.83 perf-profile.self.cycles-pp.__switch_to 0.84 ± 2% +1.3 2.12 ± 8% perf-profile.self.cycles-pp.semaphore_posix_thrash 0.90 +1.4 2.28 perf-profile.self.cycles-pp.__switch_to_asm 0.00 +1.5 1.55 ± 30% perf-profile.self.cycles-pp.update_sg_lb_stats 0.98 +1.9 2.83 perf-profile.self.cycles-pp.switch_mm_irqs_off 1.97 +2.1 4.04 perf-profile.self.cycles-pp.__schedule 6.26 ± 3% +13.2 19.48 perf-profile.self.cycles-pp.intel_idle *************************************************************************************************** lkp-spr-r02: 224 threads 2 sockets Intel(R) Xeon(R) Platinum 8480CTDX (Sapphire Rapids) with 256G memory ========================================================================================= class/compiler/cpufreq_governor/kconfig/nr_threads/rootfs/sc_pid_max/tbox_group/test/testcase/testtime: scheduler/gcc-12/performance/x86_64-rhel-8.3/100%/debian-11.1-x86_64-20220510.cgz/4194304/lkp-spr-r02/switch/stress-ng/60s commit: 63304558ba ("sched/eevdf: Curb wakeup-preemption") 0a24d7afed ("sched/fair: ratelimit update to tg->load_avg") 63304558ba5dcaaf 0a24d7afed5c3c59ee212782f9c ---------------- --------------------------- %stddev %change %stddev \ | \ 1.792e+08 ± 4% +356.6% 8.181e+08 cpuidle..usage 1257499 ± 25% +97.2% 2479986 ± 4% numa-numastat.node1.local_node 1363779 ± 23% +92.3% 2622537 ± 3% numa-numastat.node1.numa_hit 1520 ± 3% -37.6% 949.17 ± 3% perf-c2c.DRAM.remote 83408 ± 2% +32.9% 110872 ± 2% perf-c2c.HITM.local 1113 ± 3% -28.4% 797.33 ± 3% perf-c2c.HITM.remote 84522 ± 2% +32.1% 111670 ± 2% perf-c2c.HITM.total 6744974 ± 4% +65.1% 11139304 ± 2% vmstat.memory.cache 182.17 +18.0% 215.00 vmstat.procs.r 7208379 ± 4% +394.4% 35639830 vmstat.system.cs 904223 ± 2% +192.2% 2641908 vmstat.system.in 29.53 -4.7 24.87 mpstat.cpu.all.idle% 7.99 -5.0 2.98 mpstat.cpu.all.irq% 0.43 ± 5% -0.2 0.25 mpstat.cpu.all.soft% 56.25 +8.2 64.42 mpstat.cpu.all.sys% 5.80 +1.7 7.47 mpstat.cpu.all.usr% 1.415e+08 ± 5% +422.1% 7.387e+08 stress-ng.switch.ops 2357895 ± 5% +422.1% 12310763 stress-ng.switch.ops_per_sec 303844 ± 4% +1122.7% 3715226 stress-ng.time.involuntary_context_switches 12532 +14.7% 14376 stress-ng.time.percent_of_cpu_this_job_got 7172 +12.8% 8090 stress-ng.time.system_time 629.90 +35.9% 856.07 stress-ng.time.user_time 2.732e+08 ± 5% +403.8% 1.377e+09 stress-ng.time.voluntary_context_switches 438651 ± 2% +14.6% 502822 meminfo.AnonPages 6551534 ± 4% +66.5% 10905701 ± 2% meminfo.Cached 8926904 ± 3% +49.2% 13315485 meminfo.Committed_AS 4104936 ± 7% +107.1% 8502046 ± 2% meminfo.Inactive 4104785 ± 7% +107.1% 8501888 ± 2% meminfo.Inactive(anon) 1189315 ± 3% +25.4% 1491641 meminfo.Mapped 10181121 ± 2% +43.5% 14606335 meminfo.Memused 3807597 ± 7% +114.4% 8161798 ± 2% meminfo.Shmem 10280871 ± 3% +43.0% 14704729 meminfo.max_used_kB 378940 ± 45% +59.0% 602609 ± 9% numa-vmstat.node0.nr_inactive_anon 378938 ± 45% +59.0% 602607 ± 9% numa-vmstat.node0.nr_zone_inactive_anon 638178 ± 26% +139.5% 1528430 ± 3% numa-vmstat.node1.nr_file_pages 648468 ± 27% +135.1% 1524245 ± 2% numa-vmstat.node1.nr_inactive_anon 188166 ± 6% +39.7% 262869 numa-vmstat.node1.nr_mapped 615278 ± 29% +144.7% 1505431 ± 2% numa-vmstat.node1.nr_shmem 648464 ± 27% +135.1% 1524241 ± 2% numa-vmstat.node1.nr_zone_inactive_anon 1363943 ± 23% +92.3% 2622303 ± 3% numa-vmstat.node1.numa_hit 1257663 ± 25% +97.2% 2479754 ± 4% numa-vmstat.node1.numa_local 277293 ± 20% +99.2% 552340 ± 9% numa-meminfo.node0.AnonPages.max 1515282 ± 45% +59.1% 2410064 ± 9% numa-meminfo.node0.Inactive 1515280 ± 45% +59.0% 2410010 ± 9% numa-meminfo.node0.Inactive(anon) 5921261 ± 11% +14.3% 6767795 ± 2% numa-meminfo.node0.MemUsed 2551274 ± 26% +139.6% 6112583 ± 3% numa-meminfo.node1.FilePages 2592782 ± 27% +135.1% 6095995 ± 2% numa-meminfo.node1.Inactive 2592634 ± 27% +135.1% 6095892 ± 2% numa-meminfo.node1.Inactive(anon) 751876 ± 6% +39.7% 1049995 numa-meminfo.node1.Mapped 4262948 ± 16% +84.0% 7842823 ± 2% numa-meminfo.node1.MemUsed 2459674 ± 29% +144.8% 6020586 ± 2% numa-meminfo.node1.Shmem 20460839 ± 21% +2277.8% 4.865e+08 turbostat.C1 1.38 ± 17% +6.6 8.02 turbostat.C1% 1.557e+08 ± 2% -94.5% 8505821 ± 2% turbostat.C1E 14.93 -11.8 3.09 ± 7% turbostat.C1E% 18.32 -31.4% 12.56 ± 3% turbostat.CPU%c1 0.08 ± 6% +278.3% 0.29 turbostat.IPC 58972223 ± 3% +193.9% 1.733e+08 turbostat.IRQ 2614421 ± 3% +12238.7% 3.226e+08 turbostat.POLL 0.06 ± 6% +4.5 4.53 turbostat.POLL% 550.43 +22.6% 675.10 turbostat.PkgWatt 17.66 +3.4% 18.26 turbostat.RAMWatt 109633 ± 2% +14.5% 125554 proc-vmstat.nr_anon_pages 6299234 -1.7% 6189022 proc-vmstat.nr_dirty_background_threshold 12613872 -1.7% 12393176 proc-vmstat.nr_dirty_threshold 1638204 ± 4% +66.3% 2724201 ± 2% proc-vmstat.nr_file_pages 63369354 -1.7% 62265610 proc-vmstat.nr_free_pages 1026430 ± 7% +106.9% 2123321 ± 2% proc-vmstat.nr_inactive_anon 297396 ± 3% +24.9% 371454 proc-vmstat.nr_mapped 952218 ± 7% +114.1% 2038225 ± 3% proc-vmstat.nr_shmem 40692 +6.3% 43273 proc-vmstat.nr_slab_reclaimable 1026430 ± 7% +106.9% 2123321 ± 2% proc-vmstat.nr_zone_inactive_anon 243574 ± 8% +39.3% 339259 ± 2% proc-vmstat.numa_hint_faults 135818 ± 18% +70.6% 231662 ± 3% proc-vmstat.numa_hint_faults_local 2345361 ± 4% +68.7% 3956948 ± 2% proc-vmstat.numa_hit 2109765 ± 5% +76.6% 3724999 ± 2% proc-vmstat.numa_local 544814 ± 4% +17.3% 639106 ± 3% proc-vmstat.numa_pte_updates 16992 ± 10% +37.7% 23393 ± 6% proc-vmstat.pgactivate 2439275 ± 4% +66.3% 4056303 ± 2% proc-vmstat.pgalloc_normal 1112292 +16.6% 1296617 proc-vmstat.pgfault 3291142 +14.7% 3773335 sched_debug.cfs_rq:/.avg_vruntime.avg 4708699 ± 5% +15.7% 5449685 ± 5% sched_debug.cfs_rq:/.avg_vruntime.max 651007 ± 8% -41.3% 382212 ± 12% sched_debug.cfs_rq:/.left_vruntime.avg 3404176 +21.7% 4141767 ± 13% sched_debug.cfs_rq:/.left_vruntime.max 1303825 ± 3% -13.0% 1134537 ± 5% sched_debug.cfs_rq:/.left_vruntime.stddev 3291142 +14.7% 3773335 sched_debug.cfs_rq:/.min_vruntime.avg 4708699 ± 5% +15.7% 5449685 ± 5% sched_debug.cfs_rq:/.min_vruntime.max 651007 ± 8% -41.3% 382212 ± 12% sched_debug.cfs_rq:/.right_vruntime.avg 3404176 +21.7% 4141767 ± 13% sched_debug.cfs_rq:/.right_vruntime.max 1303825 ± 3% -13.0% 1134537 ± 5% sched_debug.cfs_rq:/.right_vruntime.stddev 309.43 ± 6% -22.4% 240.16 ± 5% sched_debug.cfs_rq:/.runnable_avg.stddev 184.67 ± 6% -16.8% 153.61 sched_debug.cfs_rq:/.util_avg.stddev 67.77 ± 14% -83.0% 11.54 ± 13% sched_debug.cfs_rq:/.util_est_enqueued.avg 99.34 ± 11% -35.4% 64.19 ± 7% sched_debug.cfs_rq:/.util_est_enqueued.stddev 44.07 ± 12% -70.0% 13.22 ± 4% sched_debug.cpu.clock.stddev 2416 ± 7% +21.8% 2943 ± 2% sched_debug.cpu.curr->pid.avg 0.00 ± 11% -60.0% 0.00 ± 10% sched_debug.cpu.next_balance.stddev 0.47 ± 6% +17.7% 0.55 ± 4% sched_debug.cpu.nr_running.avg 993749 ± 5% +397.7% 4946105 sched_debug.cpu.nr_switches.avg 1123429 ± 4% +367.4% 5250556 sched_debug.cpu.nr_switches.max 528386 ± 14% +339.9% 2324144 ± 17% sched_debug.cpu.nr_switches.min 81513 ± 18% +225.4% 265263 ± 4% sched_debug.cpu.nr_switches.stddev 1.55 ± 7% -31.3% 1.07 ± 2% sched_debug.rt_rq:.rt_time.avg 347.61 ± 7% -31.3% 238.83 ± 2% sched_debug.rt_rq:.rt_time.max 23.17 ± 7% -31.3% 15.92 ± 2% sched_debug.rt_rq:.rt_time.stddev 14.40 +8.8% 15.66 perf-stat.i.MPKI 1.419e+10 ± 4% +300.9% 5.688e+10 perf-stat.i.branch-instructions 1.36 -0.1 1.23 perf-stat.i.branch-miss-rate% 1.624e+08 ± 4% +298.8% 6.478e+08 perf-stat.i.branch-misses 2.71 ± 3% -0.8 1.89 ± 2% perf-stat.i.cache-miss-rate% 14303386 ± 3% +104.9% 29304692 perf-stat.i.cache-misses 8.988e+08 ± 5% +381.5% 4.328e+09 perf-stat.i.cache-references 7363534 ± 5% +401.1% 36899734 perf-stat.i.context-switches 8.28 ± 4% -74.7% 2.10 perf-stat.i.cpi 5.176e+11 +12.5% 5.822e+11 perf-stat.i.cpu-cycles 2716479 ± 5% +389.6% 13299565 perf-stat.i.cpu-migrations 43307 ± 2% -37.0% 27270 perf-stat.i.cycles-between-cache-misses 41832335 ± 9% +354.4% 1.901e+08 ± 2% perf-stat.i.dTLB-load-misses 1.8e+10 ± 4% +315.9% 7.485e+10 perf-stat.i.dTLB-loads 5430661 ± 6% +373.7% 25724171 ± 2% perf-stat.i.dTLB-store-misses 9.847e+09 ± 4% +333.9% 4.272e+10 perf-stat.i.dTLB-stores 7.026e+10 ± 4% +305.4% 2.849e+11 perf-stat.i.instructions 0.17 ± 3% +205.6% 0.51 perf-stat.i.ipc 2.31 +12.5% 2.60 perf-stat.i.metric.GHz 111.01 ± 3% +264.1% 404.23 perf-stat.i.metric.K/sec 191.57 ± 4% +316.5% 797.89 perf-stat.i.metric.M/sec 18432 ± 2% +9.5% 20184 perf-stat.i.minor-faults 81.21 -13.1 68.09 perf-stat.i.node-load-miss-rate% 4976001 ± 2% +46.0% 7265429 perf-stat.i.node-load-misses 1487000 ± 6% +219.9% 4757391 perf-stat.i.node-loads 18432 ± 2% +9.5% 20184 perf-stat.i.page-faults 13.31 +15.2% 15.34 perf-stat.overall.MPKI 1.19 -0.0 1.15 perf-stat.overall.branch-miss-rate% 1.56 ± 4% -0.9 0.67 perf-stat.overall.cache-miss-rate% 7.69 ± 4% -73.2% 2.06 perf-stat.overall.cpi 37074 ± 3% -45.7% 20124 perf-stat.overall.cycles-between-cache-misses 0.06 ± 2% +0.0 0.06 ± 2% perf-stat.overall.dTLB-store-miss-rate% 0.13 ± 4% +272.6% 0.49 perf-stat.overall.ipc 73.48 -14.1 59.33 perf-stat.overall.node-load-miss-rate% 1.334e+10 ± 4% +317.8% 5.572e+10 perf-stat.ps.branch-instructions 1.584e+08 ± 5% +304.7% 6.411e+08 perf-stat.ps.branch-misses 13716290 ± 3% +108.4% 28582536 perf-stat.ps.cache-misses 8.82e+08 ± 5% +385.7% 4.284e+09 perf-stat.ps.cache-references 7221087 ± 5% +405.8% 36523032 perf-stat.ps.context-switches 217584 +1.3% 220460 perf-stat.ps.cpu-clock 5.081e+11 +13.2% 5.751e+11 perf-stat.ps.cpu-cycles 2673546 ± 5% +392.9% 13179198 perf-stat.ps.cpu-migrations 40827035 ± 9% +360.3% 1.879e+08 ± 2% perf-stat.ps.dTLB-load-misses 1.708e+10 ± 4% +330.6% 7.352e+10 perf-stat.ps.dTLB-loads 5337456 ± 6% +377.4% 25482959 ± 2% perf-stat.ps.dTLB-store-misses 9.357e+09 ± 5% +348.8% 4.199e+10 perf-stat.ps.dTLB-stores 6.622e+10 ± 4% +321.8% 2.793e+11 perf-stat.ps.instructions 16190 ± 2% +19.0% 19266 perf-stat.ps.minor-faults 4783057 ± 2% +49.3% 7140248 perf-stat.ps.node-load-misses 1727513 ± 5% +183.3% 4893702 perf-stat.ps.node-loads 16190 ± 2% +19.0% 19266 perf-stat.ps.page-faults 217584 +1.3% 220460 perf-stat.ps.task-clock 4.156e+12 ± 4% +320.9% 1.749e+13 perf-stat.total.instructions 22.02 -17.7 4.28 perf-profile.calltrace.cycles-pp.enqueue_task_fair.activate_task.ttwu_do_activate.sched_ttwu_pending.__flush_smp_call_function_queue 18.48 -15.1 3.42 perf-profile.calltrace.cycles-pp.enqueue_entity.enqueue_task_fair.activate_task.ttwu_do_activate.sched_ttwu_pending 24.23 -15.0 9.26 perf-profile.calltrace.cycles-pp.flush_smp_call_function_queue.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify 19.99 -14.3 5.66 perf-profile.calltrace.cycles-pp.activate_task.ttwu_do_activate.sched_ttwu_pending.__flush_smp_call_function_queue.flush_smp_call_function_queue 20.09 -14.2 5.84 perf-profile.calltrace.cycles-pp.ttwu_do_activate.sched_ttwu_pending.__flush_smp_call_function_queue.flush_smp_call_function_queue.do_idle 20.68 -13.6 7.10 perf-profile.calltrace.cycles-pp.sched_ttwu_pending.__flush_smp_call_function_queue.flush_smp_call_function_queue.do_idle.cpu_startup_entry 21.09 -12.8 8.24 perf-profile.calltrace.cycles-pp.__flush_smp_call_function_queue.flush_smp_call_function_queue.do_idle.cpu_startup_entry.start_secondary 10.08 -10.1 0.00 perf-profile.calltrace.cycles-pp.update_cfs_group.dequeue_entity.dequeue_task_fair.__schedule.schedule 9.04 -9.0 0.00 perf-profile.calltrace.cycles-pp.update_cfs_group.enqueue_entity.enqueue_task_fair.activate_task.ttwu_do_activate 37.64 -7.7 29.92 perf-profile.calltrace.cycles-pp.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify 37.84 -7.7 30.19 perf-profile.calltrace.cycles-pp.secondary_startup_64_no_verify 37.66 -7.7 30.01 perf-profile.calltrace.cycles-pp.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify 37.67 -7.6 30.04 perf-profile.calltrace.cycles-pp.start_secondary.secondary_startup_64_no_verify 10.34 -7.3 3.07 perf-profile.calltrace.cycles-pp.dequeue_task_fair.__schedule.schedule.pipe_read.vfs_read 10.14 -7.1 3.07 perf-profile.calltrace.cycles-pp.dequeue_task_fair.__schedule.schedule.pipe_write.vfs_write 8.31 -6.8 1.48 perf-profile.calltrace.cycles-pp.update_load_avg.enqueue_entity.enqueue_task_fair.activate_task.ttwu_do_activate 8.11 -5.8 2.26 perf-profile.calltrace.cycles-pp.dequeue_entity.dequeue_task_fair.__schedule.schedule.pipe_read 7.93 -5.7 2.26 perf-profile.calltrace.cycles-pp.dequeue_entity.dequeue_task_fair.__schedule.schedule.pipe_write 12.83 -4.9 7.91 perf-profile.calltrace.cycles-pp.__schedule.schedule.pipe_read.vfs_read.ksys_read 12.92 -4.8 8.07 perf-profile.calltrace.cycles-pp.schedule.pipe_read.vfs_read.ksys_read.do_syscall_64 12.61 -4.7 7.88 perf-profile.calltrace.cycles-pp.__schedule.schedule.pipe_write.vfs_write.ksys_write 12.69 -4.6 8.04 perf-profile.calltrace.cycles-pp.schedule.pipe_write.vfs_write.ksys_write.do_syscall_64 4.66 -2.6 2.06 perf-profile.calltrace.cycles-pp.update_load_avg.dequeue_entity.dequeue_task_fair.__schedule.schedule 1.94 -1.3 0.66 perf-profile.calltrace.cycles-pp.set_next_entity.pick_next_task_fair.__schedule.schedule_idle.do_idle 2.01 -1.1 0.91 perf-profile.calltrace.cycles-pp.pick_next_task_fair.__schedule.schedule_idle.do_idle.cpu_startup_entry 5.61 -1.0 4.65 perf-profile.calltrace.cycles-pp.intel_idle.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call.do_idle 0.61 +0.2 0.79 perf-profile.calltrace.cycles-pp._raw_spin_lock.__schedule.schedule_idle.do_idle.cpu_startup_entry 1.18 ± 2% +0.2 1.38 perf-profile.calltrace.cycles-pp.exit_to_user_mode_prepare.syscall_exit_to_user_mode.do_syscall_64.entry_SYSCALL_64_after_hwframe.read 1.17 ± 2% +0.2 1.39 perf-profile.calltrace.cycles-pp.exit_to_user_mode_prepare.syscall_exit_to_user_mode.do_syscall_64.entry_SYSCALL_64_after_hwframe.write 0.56 ± 4% +0.3 0.82 ± 2% perf-profile.calltrace.cycles-pp._copy_from_iter.copy_page_from_iter.pipe_write.vfs_write.ksys_write 0.59 ± 4% +0.3 0.87 ± 2% perf-profile.calltrace.cycles-pp.copy_page_from_iter.pipe_write.vfs_write.ksys_write.do_syscall_64 1.22 ± 2% +0.3 1.54 perf-profile.calltrace.cycles-pp.syscall_exit_to_user_mode.do_syscall_64.entry_SYSCALL_64_after_hwframe.read 1.22 ± 2% +0.3 1.54 perf-profile.calltrace.cycles-pp.syscall_exit_to_user_mode.do_syscall_64.entry_SYSCALL_64_after_hwframe.write 0.97 +0.3 1.30 perf-profile.calltrace.cycles-pp.sched_mm_cid_migrate_to.activate_task.ttwu_do_activate.sched_ttwu_pending.__flush_smp_call_function_queue 26.91 +0.4 27.31 perf-profile.calltrace.cycles-pp.pipe_write.vfs_write.ksys_write.do_syscall_64.entry_SYSCALL_64_after_hwframe 2.19 ± 2% +0.4 2.61 perf-profile.calltrace.cycles-pp.switch_fpu_return.exit_to_user_mode_prepare.syscall_exit_to_user_mode.do_syscall_64.entry_SYSCALL_64_after_hwframe 0.00 +0.5 0.52 perf-profile.calltrace.cycles-pp.__switch_to 0.00 +0.5 0.53 perf-profile.calltrace.cycles-pp._raw_spin_lock.__schedule.schedule.pipe_write.vfs_write 0.00 +0.5 0.53 perf-profile.calltrace.cycles-pp._raw_spin_lock.__schedule.schedule.pipe_read.vfs_read 0.00 +0.6 0.62 perf-profile.calltrace.cycles-pp.__update_load_avg_cfs_rq.update_load_avg.enqueue_entity.enqueue_task_fair.activate_task 0.00 +0.6 0.63 perf-profile.calltrace.cycles-pp.finish_task_switch.__schedule.schedule.pipe_read.vfs_read 0.66 ± 4% +0.6 1.29 perf-profile.calltrace.cycles-pp.menu_select.cpuidle_idle_call.do_idle.cpu_startup_entry.start_secondary 0.00 +0.6 0.64 perf-profile.calltrace.cycles-pp.finish_task_switch.__schedule.schedule.pipe_write.vfs_write 0.00 +0.6 0.64 perf-profile.calltrace.cycles-pp.tick_nohz_get_sleep_length.menu_select.cpuidle_idle_call.do_idle.cpu_startup_entry 0.34 ± 70% +0.7 1.04 perf-profile.calltrace.cycles-pp.prepare_to_wait_event.pipe_write.vfs_write.ksys_write.do_syscall_64 0.00 +0.7 0.72 perf-profile.calltrace.cycles-pp.nohz_run_idle_balance.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify 0.00 +0.7 0.73 perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.prepare_to_wait_event.pipe_read.vfs_read.ksys_read 0.00 +0.7 0.74 perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.prepare_to_wait_event.pipe_write.vfs_write.ksys_write 0.00 +0.8 0.76 perf-profile.calltrace.cycles-pp.__update_idle_core.pick_next_task_idle.__schedule.schedule.pipe_write 0.00 +0.8 0.77 ± 2% perf-profile.calltrace.cycles-pp.__update_idle_core.pick_next_task_idle.__schedule.schedule.pipe_read 0.00 +0.8 0.77 perf-profile.calltrace.cycles-pp.copyout._copy_to_iter.copy_page_to_iter.pipe_read.vfs_read 0.00 +0.8 0.78 perf-profile.calltrace.cycles-pp.pick_next_task_idle.__schedule.schedule.pipe_write.vfs_write 0.00 +0.8 0.78 perf-profile.calltrace.cycles-pp.pick_next_task_idle.__schedule.schedule.pipe_read.vfs_read 1.41 ± 2% +0.8 2.19 perf-profile.calltrace.cycles-pp.migrate_task_rq_fair.set_task_cpu.try_to_wake_up.autoremove_wake_function.__wake_up_common 1.60 ± 2% +0.8 2.38 perf-profile.calltrace.cycles-pp.set_task_cpu.try_to_wake_up.autoremove_wake_function.__wake_up_common.__wake_up_common_lock 0.00 +0.8 0.82 perf-profile.calltrace.cycles-pp._copy_to_iter.copy_page_to_iter.pipe_read.vfs_read.ksys_read 27.39 +0.8 28.21 perf-profile.calltrace.cycles-pp.vfs_write.ksys_write.do_syscall_64.entry_SYSCALL_64_after_hwframe.write 0.00 +0.8 0.83 perf-profile.calltrace.cycles-pp.__switch_to_asm 0.00 +0.8 0.84 perf-profile.calltrace.cycles-pp.prepare_task_switch.__schedule.schedule_idle.do_idle.cpu_startup_entry 0.00 +0.9 0.86 perf-profile.calltrace.cycles-pp.copy_page_to_iter.pipe_read.vfs_read.ksys_read.do_syscall_64 27.50 +0.9 28.44 perf-profile.calltrace.cycles-pp.pipe_read.vfs_read.ksys_read.do_syscall_64.entry_SYSCALL_64_after_hwframe 27.56 +1.0 28.52 perf-profile.calltrace.cycles-pp.ksys_write.do_syscall_64.entry_SYSCALL_64_after_hwframe.write 0.00 +1.0 0.97 perf-profile.calltrace.cycles-pp.llist_reverse_order.__flush_smp_call_function_queue.flush_smp_call_function_queue.do_idle.cpu_startup_entry 0.00 +1.0 1.02 perf-profile.calltrace.cycles-pp.prepare_to_wait_event.pipe_read.vfs_read.ksys_read.do_syscall_64 4.59 +1.0 5.60 perf-profile.calltrace.cycles-pp.__schedule.schedule_idle.do_idle.cpu_startup_entry.start_secondary 0.00 +1.1 1.14 perf-profile.calltrace.cycles-pp.wake_affine.select_task_rq_fair.select_task_rq.try_to_wake_up.autoremove_wake_function 0.00 +1.1 1.15 perf-profile.calltrace.cycles-pp.switch_mm_irqs_off.__schedule.schedule_idle.do_idle.cpu_startup_entry 4.66 +1.2 5.82 perf-profile.calltrace.cycles-pp.schedule_idle.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify 0.00 +1.2 1.20 perf-profile.calltrace.cycles-pp.remove_entity_load_avg.migrate_task_rq_fair.set_task_cpu.try_to_wake_up.autoremove_wake_function 1.03 ± 5% +1.2 2.27 ± 2% perf-profile.calltrace.cycles-pp.stress_switch_pipe 27.94 +1.3 29.20 perf-profile.calltrace.cycles-pp.vfs_read.ksys_read.do_syscall_64.entry_SYSCALL_64_after_hwframe.read 28.83 +1.4 30.20 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.write 0.00 +1.4 1.38 ± 3% perf-profile.calltrace.cycles-pp.__bitmap_andnot.select_idle_core.select_idle_cpu.select_idle_sibling.select_task_rq_fair 29.00 +1.4 30.43 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.write 28.10 +1.4 29.54 perf-profile.calltrace.cycles-pp.ksys_read.do_syscall_64.entry_SYSCALL_64_after_hwframe.read 0.00 +1.5 1.50 perf-profile.calltrace.cycles-pp.llist_add_batch.__smp_call_single_queue.ttwu_queue_wakelist.try_to_wake_up.autoremove_wake_function 1.44 +1.6 3.09 perf-profile.calltrace.cycles-pp.ttwu_queue_wakelist.try_to_wake_up.autoremove_wake_function.__wake_up_common.__wake_up_common_lock 29.38 +1.8 31.21 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.read 29.55 +1.9 31.44 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.read 0.00 +2.3 2.30 perf-profile.calltrace.cycles-pp.__smp_call_single_queue.ttwu_queue_wakelist.try_to_wake_up.autoremove_wake_function.__wake_up_common 29.58 +2.4 32.02 perf-profile.calltrace.cycles-pp.write 12.07 +2.9 15.00 perf-profile.calltrace.cycles-pp.try_to_wake_up.autoremove_wake_function.__wake_up_common.__wake_up_common_lock.pipe_write 12.15 +2.9 15.10 perf-profile.calltrace.cycles-pp.try_to_wake_up.autoremove_wake_function.__wake_up_common.__wake_up_common_lock.pipe_read 12.09 +3.0 15.06 perf-profile.calltrace.cycles-pp.autoremove_wake_function.__wake_up_common.__wake_up_common_lock.pipe_write.vfs_write 12.17 +3.0 15.16 perf-profile.calltrace.cycles-pp.autoremove_wake_function.__wake_up_common.__wake_up_common_lock.pipe_read.vfs_read 30.16 +3.0 33.17 perf-profile.calltrace.cycles-pp.read 12.31 +3.3 15.64 perf-profile.calltrace.cycles-pp.__wake_up_common.__wake_up_common_lock.pipe_write.vfs_write.ksys_write 12.50 +3.4 15.86 perf-profile.calltrace.cycles-pp.__wake_up_common_lock.pipe_write.vfs_write.ksys_write.do_syscall_64 12.40 +3.4 15.76 perf-profile.calltrace.cycles-pp.__wake_up_common.__wake_up_common_lock.pipe_read.vfs_read.ksys_read 12.60 +3.4 16.00 perf-profile.calltrace.cycles-pp.__wake_up_common_lock.pipe_read.vfs_read.ksys_read.do_syscall_64 5.68 +3.5 9.22 perf-profile.calltrace.cycles-pp.available_idle_cpu.select_idle_core.select_idle_cpu.select_idle_sibling.select_task_rq_fair 6.99 +3.8 10.76 perf-profile.calltrace.cycles-pp.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call.do_idle.cpu_startup_entry 7.07 +3.8 10.85 perf-profile.calltrace.cycles-pp.cpuidle_enter.cpuidle_idle_call.do_idle.cpu_startup_entry.start_secondary 7.98 +4.7 12.72 perf-profile.calltrace.cycles-pp.cpuidle_idle_call.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify 8.56 +5.0 13.55 perf-profile.calltrace.cycles-pp.select_idle_core.select_idle_cpu.select_idle_sibling.select_task_rq_fair.select_task_rq 0.00 +5.1 5.14 perf-profile.calltrace.cycles-pp.poll_idle.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call.do_idle 10.17 +5.7 15.90 perf-profile.calltrace.cycles-pp.select_idle_cpu.select_idle_sibling.select_task_rq_fair.select_task_rq.try_to_wake_up 11.79 +6.6 18.37 perf-profile.calltrace.cycles-pp.select_idle_sibling.select_task_rq_fair.select_task_rq.try_to_wake_up.autoremove_wake_function 12.90 +7.0 19.90 perf-profile.calltrace.cycles-pp.select_task_rq_fair.select_task_rq.try_to_wake_up.autoremove_wake_function.__wake_up_common 13.07 +7.0 20.12 perf-profile.calltrace.cycles-pp.select_task_rq.try_to_wake_up.autoremove_wake_function.__wake_up_common.__wake_up_common_lock 27.98 -26.6 1.35 ± 5% perf-profile.children.cycles-pp.update_cfs_group 24.39 -18.8 5.58 perf-profile.children.cycles-pp.enqueue_task_fair 25.65 -18.4 7.23 perf-profile.children.cycles-pp.activate_task 25.77 -18.2 7.60 perf-profile.children.cycles-pp.ttwu_do_activate 25.03 -16.7 8.29 perf-profile.children.cycles-pp.sched_ttwu_pending 25.58 -15.9 9.70 perf-profile.children.cycles-pp.__flush_smp_call_function_queue 20.04 -15.7 4.30 perf-profile.children.cycles-pp.enqueue_entity 24.41 -15.0 9.38 perf-profile.children.cycles-pp.flush_smp_call_function_queue 20.52 -14.4 6.17 perf-profile.children.cycles-pp.dequeue_task_fair 16.78 -11.6 5.22 perf-profile.children.cycles-pp.update_load_avg 16.11 -11.5 4.60 perf-profile.children.cycles-pp.dequeue_entity 25.62 -9.4 16.18 perf-profile.children.cycles-pp.schedule 30.10 -8.4 21.65 perf-profile.children.cycles-pp.__schedule 37.82 -7.7 30.12 perf-profile.children.cycles-pp.do_idle 37.84 -7.7 30.19 perf-profile.children.cycles-pp.secondary_startup_64_no_verify 37.84 -7.7 30.19 perf-profile.children.cycles-pp.cpu_startup_entry 37.67 -7.6 30.04 perf-profile.children.cycles-pp.start_secondary 4.42 -2.9 1.47 perf-profile.children.cycles-pp.__sysvec_call_function_single 4.45 -2.9 1.58 perf-profile.children.cycles-pp.sysvec_call_function_single 4.56 -2.7 1.88 perf-profile.children.cycles-pp.asm_sysvec_call_function_single 2.32 -1.4 0.92 perf-profile.children.cycles-pp.set_next_entity 2.24 ± 3% -1.3 0.96 perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath 5.64 -1.0 4.68 perf-profile.children.cycles-pp.intel_idle 2.54 -0.8 1.75 perf-profile.children.cycles-pp.pick_next_task_fair 1.30 ± 6% -0.7 0.62 perf-profile.children.cycles-pp.asm_sysvec_apic_timer_interrupt 1.24 ± 6% -0.7 0.58 perf-profile.children.cycles-pp.sysvec_apic_timer_interrupt 0.69 ± 6% -0.4 0.32 ± 2% perf-profile.children.cycles-pp.__sysvec_apic_timer_interrupt 0.68 ± 6% -0.4 0.31 ± 2% perf-profile.children.cycles-pp.hrtimer_interrupt 0.63 ± 6% -0.4 0.28 ± 2% perf-profile.children.cycles-pp.__hrtimer_run_queues 0.58 ± 7% -0.3 0.24 perf-profile.children.cycles-pp.tick_sched_handle 0.58 ± 7% -0.3 0.24 ± 2% perf-profile.children.cycles-pp.update_process_times 0.59 ± 6% -0.3 0.25 ± 2% perf-profile.children.cycles-pp.tick_sched_timer 0.54 ± 6% -0.3 0.21 ± 2% perf-profile.children.cycles-pp.scheduler_tick 0.42 ± 4% -0.3 0.11 ± 4% perf-profile.children.cycles-pp.__task_rq_lock 0.53 ± 5% -0.3 0.24 perf-profile.children.cycles-pp.__do_softirq 0.55 ± 5% -0.3 0.28 perf-profile.children.cycles-pp.__irq_exit_rcu 0.24 ± 3% -0.1 0.10 ± 3% perf-profile.children.cycles-pp.perf_trace_sched_stat_runtime 0.15 ± 5% -0.1 0.06 ± 7% perf-profile.children.cycles-pp.task_work_run 0.15 ± 8% -0.1 0.06 ± 7% perf-profile.children.cycles-pp.task_mm_cid_work 0.16 ± 7% -0.1 0.11 perf-profile.children.cycles-pp.exit_to_user_mode_loop 0.10 ± 5% -0.0 0.06 perf-profile.children.cycles-pp.tick_nohz_tick_stopped 0.24 ± 4% -0.0 0.21 perf-profile.children.cycles-pp.tracing_gen_ctx_irq_test 0.07 ± 5% -0.0 0.06 perf-profile.children.cycles-pp.inode_needs_update_time 0.33 ± 2% +0.0 0.35 perf-profile.children.cycles-pp.cpus_share_cache 0.12 ± 4% +0.0 0.15 ± 3% perf-profile.children.cycles-pp.file_update_time 0.08 ± 8% +0.0 0.11 ± 6% perf-profile.children.cycles-pp.anon_pipe_buf_release 0.09 ± 15% +0.0 0.13 ± 2% perf-profile.children.cycles-pp.hrtimer_get_next_event 0.16 ± 4% +0.0 0.19 perf-profile.children.cycles-pp.touch_atime 0.11 ± 4% +0.0 0.15 perf-profile.children.cycles-pp.atime_needs_update 0.06 +0.0 0.11 ± 4% perf-profile.children.cycles-pp.__get_task_ioprio 0.06 ± 9% +0.0 0.10 ± 4% perf-profile.children.cycles-pp.resched_curr 0.00 +0.1 0.05 perf-profile.children.cycles-pp.irqtime_account_irq 0.00 +0.1 0.05 perf-profile.children.cycles-pp.perf_swevent_event 0.00 +0.1 0.05 perf-profile.children.cycles-pp._find_next_and_bit 0.00 +0.1 0.05 perf-profile.children.cycles-pp.can_stop_idle_tick 0.00 +0.1 0.05 perf-profile.children.cycles-pp.tsc_verify_tsc_adjust 0.00 +0.1 0.05 perf-profile.children.cycles-pp.perf_trace_run_bpf_submit 0.00 +0.1 0.05 perf-profile.children.cycles-pp.kill_fasync 0.00 +0.1 0.06 ± 8% perf-profile.children.cycles-pp.arch_cpu_idle_enter 0.46 +0.1 0.52 perf-profile.children.cycles-pp.perf_tp_event 0.00 +0.1 0.06 perf-profile.children.cycles-pp.sched_clock_noinstr 0.00 +0.1 0.06 perf-profile.children.cycles-pp.save_fpregs_to_fpstate 0.00 +0.1 0.06 perf-profile.children.cycles-pp.rb_next 0.00 +0.1 0.06 ± 6% perf-profile.children.cycles-pp.__cgroup_account_cputime 0.00 +0.1 0.06 ± 7% perf-profile.children.cycles-pp.rcu_note_context_switch 0.02 ± 99% +0.1 0.09 ± 4% perf-profile.children.cycles-pp.aa_file_perm 0.00 +0.1 0.07 ± 7% perf-profile.children.cycles-pp.put_prev_task_fair 0.00 +0.1 0.07 ± 7% perf-profile.children.cycles-pp.put_prev_entity 0.00 +0.1 0.07 ± 7% perf-profile.children.cycles-pp.perf_exclude_event 0.12 ± 4% +0.1 0.20 ± 4% perf-profile.children.cycles-pp.cpuacct_charge 0.00 +0.1 0.07 perf-profile.children.cycles-pp.rcu_all_qs 0.00 +0.1 0.07 perf-profile.children.cycles-pp.mm_cid_get 0.00 +0.1 0.07 perf-profile.children.cycles-pp.native_apic_msr_eoi_write 0.00 +0.1 0.07 perf-profile.children.cycles-pp.perf_trace_buf_alloc 0.00 +0.1 0.07 ± 11% perf-profile.children.cycles-pp.mutex_spin_on_owner 0.10 ± 8% +0.1 0.17 ± 2% perf-profile.children.cycles-pp.native_irq_return_iret 0.02 ± 99% +0.1 0.10 perf-profile.children.cycles-pp.current_time 0.00 +0.1 0.08 perf-profile.children.cycles-pp.tick_nohz_stop_idle 0.00 +0.1 0.08 perf-profile.children.cycles-pp.__x2apic_send_IPI_dest 0.00 +0.1 0.10 ± 5% perf-profile.children.cycles-pp.nr_iowait_cpu 0.00 +0.1 0.10 ± 23% perf-profile.children.cycles-pp.read@plt 0.00 +0.1 0.10 perf-profile.children.cycles-pp.syscall_enter_from_user_mode 0.00 +0.1 0.10 perf-profile.children.cycles-pp.__hrtimer_next_event_base 0.00 +0.1 0.10 ± 3% perf-profile.children.cycles-pp.perf_trace_sched_switch 0.45 ± 2% +0.1 0.56 perf-profile.children.cycles-pp.task_h_load 0.12 ± 4% +0.1 0.22 perf-profile.children.cycles-pp.avg_vruntime 0.12 ± 3% +0.1 0.22 ± 2% perf-profile.children.cycles-pp.attach_entity_load_avg 0.27 ± 3% +0.1 0.38 ± 2% perf-profile.children.cycles-pp.apparmor_file_permission 0.00 +0.1 0.11 perf-profile.children.cycles-pp.ct_kernel_enter 0.08 ± 6% +0.1 0.19 perf-profile.children.cycles-pp.__calc_delta 0.00 +0.1 0.12 perf-profile.children.cycles-pp.ct_kernel_exit_state 0.00 +0.1 0.12 ± 3% perf-profile.children.cycles-pp.__list_add_valid 0.14 ± 3% +0.1 0.26 perf-profile.children.cycles-pp.update_entity_lag 0.02 ±141% +0.1 0.14 ± 3% perf-profile.children.cycles-pp.__list_del_entry_valid 0.12 ± 12% +0.1 0.25 perf-profile.children.cycles-pp.get_next_timer_interrupt 0.00 +0.1 0.13 perf-profile.children.cycles-pp.ct_idle_exit 0.00 +0.1 0.13 perf-profile.children.cycles-pp.__cond_resched 0.00 +0.1 0.13 ± 2% perf-profile.children.cycles-pp.cpuidle_governor_latency_req 0.13 ± 3% +0.1 0.26 perf-profile.children.cycles-pp.place_entity 0.06 ± 7% +0.1 0.20 ± 2% perf-profile.children.cycles-pp.pick_eevdf 0.28 ± 3% +0.1 0.42 perf-profile.children.cycles-pp.security_file_permission 0.10 ± 5% +0.1 0.24 perf-profile.children.cycles-pp.check_preempt_curr 0.04 ± 44% +0.1 0.19 ± 2% perf-profile.children.cycles-pp.__dequeue_entity 0.00 +0.2 0.15 perf-profile.children.cycles-pp._raw_spin_trylock 0.07 ± 7% +0.2 0.22 perf-profile.children.cycles-pp.hrtimer_next_event_without 0.05 ± 8% +0.2 0.21 ± 2% perf-profile.children.cycles-pp.call_cpuidle 0.06 ± 6% +0.2 0.22 perf-profile.children.cycles-pp.read_tsc 0.04 ± 44% +0.2 0.21 ± 2% perf-profile.children.cycles-pp.entry_SYSRETQ_unsafe_stack 0.00 +0.2 0.16 ± 3% perf-profile.children.cycles-pp.syscall_return_via_sysret 0.17 ± 4% +0.2 0.34 ± 2% perf-profile.children.cycles-pp.update_min_vruntime 0.28 ± 7% +0.2 0.45 ± 3% perf-profile.children.cycles-pp.copyin 0.00 +0.2 0.17 perf-profile.children.cycles-pp.__rdgsbase_inactive 0.00 +0.2 0.17 perf-profile.children.cycles-pp.local_clock_noinstr 0.19 ± 9% +0.2 0.36 perf-profile.children.cycles-pp.tick_nohz_next_event 0.44 ± 2% +0.2 0.61 perf-profile.children.cycles-pp.update_rq_clock_task 0.04 ± 44% +0.2 0.22 perf-profile.children.cycles-pp.finish_wait 0.02 ±141% +0.2 0.20 perf-profile.children.cycles-pp._raw_spin_unlock_irqrestore 0.08 ± 6% +0.2 0.27 perf-profile.children.cycles-pp.tick_nohz_idle_exit 0.23 ± 2% +0.2 0.42 perf-profile.children.cycles-pp.__fget_light 0.07 ± 6% +0.2 0.28 perf-profile.children.cycles-pp.syscall_exit_to_user_mode_prepare 0.20 ± 2% +0.2 0.41 perf-profile.children.cycles-pp._raw_spin_lock_irq 0.94 +0.2 1.15 perf-profile.children.cycles-pp.wake_affine 0.23 ± 3% +0.2 0.45 perf-profile.children.cycles-pp.__fdget_pos 0.06 +0.2 0.28 ± 2% perf-profile.children.cycles-pp.newidle_balance 0.09 ± 4% +0.2 0.31 perf-profile.children.cycles-pp.ktime_get 0.00 +0.2 0.23 ± 9% perf-profile.children.cycles-pp.__mutex_lock 0.16 ± 5% +0.2 0.39 perf-profile.children.cycles-pp.mutex_lock 0.06 ± 7% +0.2 0.30 perf-profile.children.cycles-pp.__wrgsbase_inactive 0.11 ± 7% +0.2 0.35 perf-profile.children.cycles-pp.tick_nohz_idle_enter 0.56 ± 4% +0.3 0.83 ± 2% perf-profile.children.cycles-pp._copy_from_iter 0.59 ± 3% +0.3 0.88 ± 2% perf-profile.children.cycles-pp.copy_page_from_iter 0.12 ± 6% +0.3 0.42 ± 2% perf-profile.children.cycles-pp.mutex_unlock 0.12 ± 4% +0.3 0.44 perf-profile.children.cycles-pp.__entry_text_start 0.09 ± 7% +0.3 0.42 perf-profile.children.cycles-pp.os_xsave 1.24 +0.3 1.57 perf-profile.children.cycles-pp.sched_mm_cid_migrate_to 0.21 ± 3% +0.3 0.56 perf-profile.children.cycles-pp.__update_load_avg_se 0.08 ± 6% +0.4 0.43 perf-profile.children.cycles-pp.__enqueue_entity 0.94 +0.4 1.32 perf-profile.children.cycles-pp.update_curr 0.28 ± 7% +0.4 0.66 perf-profile.children.cycles-pp.tick_nohz_get_sleep_length 2.37 ± 2% +0.4 2.79 perf-profile.children.cycles-pp.exit_to_user_mode_prepare 0.15 ± 3% +0.4 0.58 perf-profile.children.cycles-pp.sched_clock 0.17 ± 6% +0.4 0.60 perf-profile.children.cycles-pp.___perf_sw_event 2.19 ± 2% +0.4 2.62 perf-profile.children.cycles-pp.switch_fpu_return 0.23 ± 2% +0.4 0.67 perf-profile.children.cycles-pp.update_rq_clock 0.16 ± 4% +0.4 0.61 perf-profile.children.cycles-pp.native_sched_clock 0.30 +0.4 0.74 perf-profile.children.cycles-pp.call_function_single_prep_ipi 26.92 +0.5 27.38 perf-profile.children.cycles-pp.pipe_write 0.33 ± 5% +0.5 0.78 perf-profile.children.cycles-pp.copyout 1.53 +0.5 2.00 perf-profile.children.cycles-pp.finish_task_switch 0.35 ± 5% +0.5 0.83 perf-profile.children.cycles-pp._copy_to_iter 0.18 ± 4% +0.5 0.66 perf-profile.children.cycles-pp.reweight_entity 0.26 +0.5 0.76 perf-profile.children.cycles-pp.nohz_run_idle_balance 0.38 ± 5% +0.5 0.87 perf-profile.children.cycles-pp.copy_page_to_iter 0.18 ± 3% +0.5 0.72 perf-profile.children.cycles-pp.sched_clock_cpu 0.65 +0.5 1.19 perf-profile.children.cycles-pp._find_next_bit 0.62 ± 2% +0.6 1.21 perf-profile.children.cycles-pp.remove_entity_load_avg 0.70 ± 3% +0.6 1.32 perf-profile.children.cycles-pp.menu_select 2.44 ± 2% +0.7 3.11 perf-profile.children.cycles-pp.syscall_exit_to_user_mode 0.39 +0.7 1.10 perf-profile.children.cycles-pp.llist_reverse_order 27.52 +0.7 28.25 perf-profile.children.cycles-pp.vfs_write 1.41 ± 2% +0.8 2.20 perf-profile.children.cycles-pp.migrate_task_rq_fair 1.60 ± 2% +0.8 2.39 perf-profile.children.cycles-pp.set_task_cpu 27.69 +0.9 28.55 perf-profile.children.cycles-pp.ksys_write 0.62 +0.9 1.52 perf-profile.children.cycles-pp.llist_add_batch 27.52 +1.0 28.50 perf-profile.children.cycles-pp.pipe_read 0.61 +1.0 1.63 perf-profile.children.cycles-pp.__switch_to_asm 0.69 ± 2% +1.0 1.72 perf-profile.children.cycles-pp.__switch_to 0.64 ± 2% +1.1 1.70 perf-profile.children.cycles-pp.__update_load_avg_cfs_rq 0.36 ± 5% +1.1 1.42 ± 3% perf-profile.children.cycles-pp.__bitmap_andnot 0.66 ± 2% +1.1 1.75 perf-profile.children.cycles-pp.prepare_task_switch 0.96 +1.1 2.09 perf-profile.children.cycles-pp.prepare_to_wait_event 0.38 ± 12% +1.2 1.54 perf-profile.children.cycles-pp.__update_idle_core 4.69 +1.2 5.87 perf-profile.children.cycles-pp.schedule_idle 0.38 ± 13% +1.2 1.56 perf-profile.children.cycles-pp.pick_next_task_idle 1.03 ± 5% +1.2 2.28 perf-profile.children.cycles-pp.stress_switch_pipe 27.96 +1.3 29.22 perf-profile.children.cycles-pp.vfs_read 0.64 ± 2% +1.3 1.95 perf-profile.children.cycles-pp.switch_mm_irqs_off 0.95 +1.4 2.31 perf-profile.children.cycles-pp.__smp_call_single_queue 28.11 +1.4 29.54 perf-profile.children.cycles-pp.ksys_read 1.45 +1.6 3.10 perf-profile.children.cycles-pp.ttwu_queue_wakelist 1.29 +2.0 3.32 perf-profile.children.cycles-pp._raw_spin_lock_irqsave 29.82 +2.8 32.66 perf-profile.children.cycles-pp.write 58.39 +3.1 61.48 perf-profile.children.cycles-pp.do_syscall_64 58.71 +3.2 61.92 perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe 30.42 +3.4 33.80 perf-profile.children.cycles-pp.read 7.53 +3.6 11.15 perf-profile.children.cycles-pp.available_idle_cpu 7.10 +3.8 10.91 perf-profile.children.cycles-pp.cpuidle_enter 7.05 +3.8 10.88 perf-profile.children.cycles-pp.cpuidle_enter_state 8.02 +4.8 12.81 perf-profile.children.cycles-pp.cpuidle_idle_call 8.62 +5.1 13.76 perf-profile.children.cycles-pp.select_idle_core 0.00 +5.2 5.20 perf-profile.children.cycles-pp.poll_idle 10.20 +5.8 16.02 perf-profile.children.cycles-pp.select_idle_cpu 24.24 +5.9 30.15 perf-profile.children.cycles-pp.try_to_wake_up 24.26 +6.0 30.22 perf-profile.children.cycles-pp.autoremove_wake_function 11.80 +6.6 18.41 perf-profile.children.cycles-pp.select_idle_sibling 24.71 +6.7 31.41 perf-profile.children.cycles-pp.__wake_up_common 25.12 +6.8 31.91 perf-profile.children.cycles-pp.__wake_up_common_lock 12.91 +7.0 19.92 perf-profile.children.cycles-pp.select_task_rq_fair 13.08 +7.0 20.13 perf-profile.children.cycles-pp.select_task_rq 27.98 -26.6 1.33 ± 5% perf-profile.self.cycles-pp.update_cfs_group 15.77 -13.3 2.50 perf-profile.self.cycles-pp.update_load_avg 4.28 -2.9 1.36 perf-profile.self.cycles-pp.try_to_wake_up 2.24 ± 3% -1.3 0.96 perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath 5.64 -1.0 4.67 perf-profile.self.cycles-pp.intel_idle 0.24 ± 3% -0.1 0.10 ± 5% perf-profile.self.cycles-pp.perf_trace_sched_stat_runtime 0.21 ± 4% -0.1 0.10 perf-profile.self.cycles-pp.perf_trace_sched_wakeup_template 0.14 ± 8% -0.1 0.06 perf-profile.self.cycles-pp.task_mm_cid_work 0.23 ± 4% -0.0 0.20 ± 2% perf-profile.self.cycles-pp.tracing_gen_ctx_irq_test 0.07 ± 5% -0.0 0.06 perf-profile.self.cycles-pp.inode_needs_update_time 0.06 ± 6% +0.0 0.07 perf-profile.self.cycles-pp.perf_adjust_freq_unthr_context 0.13 ± 2% +0.0 0.15 ± 2% perf-profile.self.cycles-pp.set_task_cpu 0.06 +0.0 0.08 ± 5% perf-profile.self.cycles-pp.update_entity_lag 0.08 ± 8% +0.0 0.11 ± 3% perf-profile.self.cycles-pp.anon_pipe_buf_release 0.05 ± 7% +0.0 0.10 perf-profile.self.cycles-pp.__get_task_ioprio 0.05 ± 7% +0.0 0.10 perf-profile.self.cycles-pp.resched_curr 0.00 +0.1 0.05 perf-profile.self.cycles-pp.exit_to_user_mode_prepare 0.00 +0.1 0.05 perf-profile.self.cycles-pp.__smp_call_single_queue 0.00 +0.1 0.05 perf-profile.self.cycles-pp.copy_page_from_iter 0.00 +0.1 0.05 perf-profile.self.cycles-pp._copy_to_iter 0.00 +0.1 0.05 perf-profile.self.cycles-pp.sched_clock 0.00 +0.1 0.05 perf-profile.self.cycles-pp.tick_nohz_next_event 0.00 +0.1 0.05 perf-profile.self.cycles-pp.tick_nohz_idle_enter 0.00 +0.1 0.05 perf-profile.self.cycles-pp.cpuidle_governor_latency_req 0.00 +0.1 0.05 perf-profile.self.cycles-pp.ct_kernel_enter 0.00 +0.1 0.05 perf-profile.self.cycles-pp.rb_next 0.00 +0.1 0.06 ± 9% perf-profile.self.cycles-pp.tick_nohz_idle_exit 0.02 ± 99% +0.1 0.08 ± 4% perf-profile.self.cycles-pp.aa_file_perm 0.00 +0.1 0.06 ± 8% perf-profile.self.cycles-pp.get_next_timer_interrupt 0.00 +0.1 0.06 ± 8% perf-profile.self.cycles-pp.perf_exclude_event 0.32 ± 2% +0.1 0.38 ± 2% perf-profile.self.cycles-pp._copy_from_iter 0.00 +0.1 0.06 perf-profile.self.cycles-pp.syscall_exit_to_user_mode 0.00 +0.1 0.06 perf-profile.self.cycles-pp.rcu_note_context_switch 0.20 ± 2% +0.1 0.27 perf-profile.self.cycles-pp.perf_tp_event 0.12 ± 4% +0.1 0.19 ± 6% perf-profile.self.cycles-pp.cpuacct_charge 0.00 +0.1 0.07 ± 7% perf-profile.self.cycles-pp.mm_cid_get 0.07 ± 5% +0.1 0.14 perf-profile.self.cycles-pp.ttwu_do_activate 0.00 +0.1 0.07 ± 5% perf-profile.self.cycles-pp.__hrtimer_next_event_base 0.06 +0.1 0.13 perf-profile.self.cycles-pp.__entry_text_start 0.00 +0.1 0.07 perf-profile.self.cycles-pp.activate_task 0.00 +0.1 0.07 perf-profile.self.cycles-pp.__cond_resched 0.00 +0.1 0.07 perf-profile.self.cycles-pp.current_time 0.00 +0.1 0.07 perf-profile.self.cycles-pp.native_apic_msr_eoi_write 0.00 +0.1 0.07 ± 11% perf-profile.self.cycles-pp.mutex_spin_on_owner 0.10 ± 8% +0.1 0.17 ± 2% perf-profile.self.cycles-pp.native_irq_return_iret 0.05 +0.1 0.13 perf-profile.self.cycles-pp.place_entity 0.00 +0.1 0.08 perf-profile.self.cycles-pp.__x2apic_send_IPI_dest 0.09 ± 5% +0.1 0.17 ± 2% perf-profile.self.cycles-pp.do_syscall_64 0.15 ± 6% +0.1 0.24 ± 2% perf-profile.self.cycles-pp.schedule 0.00 +0.1 0.09 ± 7% perf-profile.self.cycles-pp.perf_trace_sched_switch 0.00 +0.1 0.09 ± 23% perf-profile.self.cycles-pp.read@plt 0.00 +0.1 0.09 perf-profile.self.cycles-pp.ksys_read 0.00 +0.1 0.09 perf-profile.self.cycles-pp.ksys_write 0.00 +0.1 0.09 perf-profile.self.cycles-pp.ktime_get 0.00 +0.1 0.09 perf-profile.self.cycles-pp.nr_iowait_cpu 0.00 +0.1 0.09 ± 4% perf-profile.self.cycles-pp.syscall_enter_from_user_mode 0.01 ±223% +0.1 0.10 perf-profile.self.cycles-pp.__wake_up_common_lock 0.00 +0.1 0.09 ± 5% perf-profile.self.cycles-pp.cpu_startup_entry 0.45 ± 2% +0.1 0.55 perf-profile.self.cycles-pp.task_h_load 0.12 ± 3% +0.1 0.22 perf-profile.self.cycles-pp.attach_entity_load_avg 0.00 +0.1 0.10 perf-profile.self.cycles-pp.check_preempt_curr 0.00 +0.1 0.10 perf-profile.self.cycles-pp.__list_add_valid 0.10 ± 4% +0.1 0.21 ± 2% perf-profile.self.cycles-pp.avg_vruntime 0.07 ± 5% +0.1 0.17 perf-profile.self.cycles-pp.__calc_delta 0.33 +0.1 0.44 perf-profile.self.cycles-pp.entry_SYSCALL_64_after_hwframe 0.00 +0.1 0.10 ± 9% perf-profile.self.cycles-pp.__mutex_lock 0.00 +0.1 0.11 perf-profile.self.cycles-pp.set_next_entity 0.00 +0.1 0.11 ± 4% perf-profile.self.cycles-pp.ct_kernel_exit_state 0.14 ± 3% +0.1 0.25 perf-profile.self.cycles-pp.wake_affine 0.00 +0.1 0.12 ± 3% perf-profile.self.cycles-pp.__list_del_entry_valid 0.34 ± 3% +0.1 0.46 perf-profile.self.cycles-pp.menu_select 0.00 +0.1 0.12 ± 4% perf-profile.self.cycles-pp.security_file_permission 0.05 ± 8% +0.1 0.18 ± 2% perf-profile.self.cycles-pp.pick_eevdf 0.00 +0.1 0.14 ± 3% perf-profile.self.cycles-pp.sched_clock_cpu 0.00 +0.1 0.14 perf-profile.self.cycles-pp.__dequeue_entity 0.17 ± 2% +0.1 0.32 perf-profile.self.cycles-pp.__flush_smp_call_function_queue 0.00 +0.1 0.14 ± 3% perf-profile.self.cycles-pp.schedule_idle 0.00 +0.1 0.15 ± 3% perf-profile.self.cycles-pp._raw_spin_trylock 0.05 +0.2 0.20 perf-profile.self.cycles-pp.call_cpuidle 0.06 ± 8% +0.2 0.21 perf-profile.self.cycles-pp.read_tsc 0.00 +0.2 0.16 ± 3% perf-profile.self.cycles-pp._raw_spin_unlock_irqrestore 0.17 ± 4% +0.2 0.32 ± 2% perf-profile.self.cycles-pp.update_min_vruntime 0.00 +0.2 0.16 perf-profile.self.cycles-pp.syscall_return_via_sysret 0.08 ± 6% +0.2 0.24 perf-profile.self.cycles-pp.cpuidle_enter_state 0.13 ± 5% +0.2 0.29 ± 2% perf-profile.self.cycles-pp.mutex_lock 0.00 +0.2 0.16 ± 2% perf-profile.self.cycles-pp.__rdgsbase_inactive 0.27 ± 6% +0.2 0.44 ± 3% perf-profile.self.cycles-pp.copyin 0.78 ± 3% +0.2 0.95 perf-profile.self.cycles-pp.migrate_task_rq_fair 0.08 ± 4% +0.2 0.26 perf-profile.self.cycles-pp.pick_next_task_fair 0.42 +0.2 0.60 perf-profile.self.cycles-pp.ttwu_queue_wakelist 0.02 ±141% +0.2 0.20 perf-profile.self.cycles-pp.entry_SYSRETQ_unsafe_stack 0.22 ± 2% +0.2 0.41 perf-profile.self.cycles-pp.__fget_light 0.08 ± 6% +0.2 0.27 ± 2% perf-profile.self.cycles-pp.cpuidle_idle_call 0.07 +0.2 0.27 perf-profile.self.cycles-pp.syscall_exit_to_user_mode_prepare 0.19 ± 2% +0.2 0.40 perf-profile.self.cycles-pp._raw_spin_lock_irq 0.36 ± 3% +0.2 0.57 perf-profile.self.cycles-pp.update_rq_clock_task 0.15 ± 3% +0.2 0.36 perf-profile.self.cycles-pp.select_task_rq_fair 0.40 ± 2% +0.2 0.61 perf-profile.self.cycles-pp.remove_entity_load_avg 0.29 ± 2% +0.2 0.51 perf-profile.self.cycles-pp.vfs_read 0.05 ± 8% +0.2 0.27 ± 2% perf-profile.self.cycles-pp.newidle_balance 0.14 ± 3% +0.2 0.36 perf-profile.self.cycles-pp.update_rq_clock 0.28 ± 2% +0.2 0.50 perf-profile.self.cycles-pp.vfs_write 0.06 ± 7% +0.2 0.29 perf-profile.self.cycles-pp.__wrgsbase_inactive 0.34 ± 2% +0.2 0.58 perf-profile.self.cycles-pp.dequeue_entity 0.26 ± 2% +0.2 0.50 perf-profile.self.cycles-pp.prepare_to_wait_event 0.25 ± 2% +0.2 0.49 ± 2% perf-profile.self.cycles-pp.pipe_write 0.38 +0.2 0.63 perf-profile.self.cycles-pp.update_curr 0.12 ± 6% +0.3 0.41 ± 2% perf-profile.self.cycles-pp.mutex_unlock 2.12 ± 2% +0.3 2.41 perf-profile.self.cycles-pp.select_idle_core 0.19 ± 3% +0.3 0.50 perf-profile.self.cycles-pp.__update_load_avg_se 0.09 ± 7% +0.3 0.41 perf-profile.self.cycles-pp.os_xsave 1.24 +0.3 1.57 perf-profile.self.cycles-pp.sched_mm_cid_migrate_to 0.14 ± 7% +0.3 0.48 perf-profile.self.cycles-pp.___perf_sw_event 0.07 +0.3 0.42 perf-profile.self.cycles-pp.__enqueue_entity 0.09 ± 6% +0.4 0.45 perf-profile.self.cycles-pp.do_idle 0.44 ± 2% +0.4 0.82 perf-profile.self.cycles-pp.switch_fpu_return 0.23 ± 3% +0.4 0.61 perf-profile.self.cycles-pp.sched_ttwu_pending 0.21 ± 2% +0.4 0.61 perf-profile.self.cycles-pp.enqueue_task_fair 0.14 ± 3% +0.4 0.54 perf-profile.self.cycles-pp.reweight_entity 0.16 ± 4% +0.4 0.58 perf-profile.self.cycles-pp.native_sched_clock 0.29 ± 2% +0.4 0.72 perf-profile.self.cycles-pp.flush_smp_call_function_queue 0.30 ± 2% +0.4 0.74 perf-profile.self.cycles-pp.call_function_single_prep_ipi 0.37 +0.4 0.81 perf-profile.self.cycles-pp.dequeue_task_fair 0.32 ± 6% +0.4 0.77 perf-profile.self.cycles-pp.copyout 0.13 ± 5% +0.5 0.59 perf-profile.self.cycles-pp.nohz_run_idle_balance 0.60 +0.5 1.06 perf-profile.self.cycles-pp._find_next_bit 0.61 +0.5 1.12 perf-profile.self.cycles-pp.finish_task_switch 0.37 ± 6% +0.5 0.88 perf-profile.self.cycles-pp.write 0.40 ± 7% +0.5 0.93 ± 2% perf-profile.self.cycles-pp.read 0.33 ± 2% +0.6 0.93 perf-profile.self.cycles-pp.enqueue_entity 0.61 +0.7 1.28 perf-profile.self.cycles-pp.select_idle_cpu 0.69 ± 2% +0.7 1.37 perf-profile.self.cycles-pp.select_idle_sibling 0.38 ± 2% +0.7 1.10 perf-profile.self.cycles-pp.llist_reverse_order 0.59 ± 8% +0.7 1.31 ± 2% perf-profile.self.cycles-pp.stress_switch_pipe 0.44 +0.7 1.18 perf-profile.self.cycles-pp.__wake_up_common 0.58 +0.8 1.33 perf-profile.self.cycles-pp.pipe_read 0.58 ± 2% +0.8 1.40 perf-profile.self.cycles-pp.prepare_task_switch 0.62 +0.9 1.51 perf-profile.self.cycles-pp.llist_add_batch 0.68 ± 2% +1.0 1.68 perf-profile.self.cycles-pp.__switch_to 0.61 +1.0 1.63 perf-profile.self.cycles-pp.__switch_to_asm 0.63 ± 2% +1.0 1.65 perf-profile.self.cycles-pp.__update_load_avg_cfs_rq 0.34 ± 4% +1.0 1.37 ± 3% perf-profile.self.cycles-pp.__bitmap_andnot 0.30 ± 17% +1.0 1.32 perf-profile.self.cycles-pp.__update_idle_core 0.64 ± 2% +1.3 1.92 perf-profile.self.cycles-pp.switch_mm_irqs_off 0.88 ± 2% +1.3 2.22 perf-profile.self.cycles-pp._raw_spin_lock 1.70 +1.6 3.28 perf-profile.self.cycles-pp.__schedule 1.28 +2.0 3.28 perf-profile.self.cycles-pp._raw_spin_lock_irqsave 7.50 +3.6 11.06 perf-profile.self.cycles-pp.available_idle_cpu 0.00 +4.8 4.81 perf-profile.self.cycles-pp.poll_idle Disclaimer: Results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance. -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 0/1] Reduce cost of accessing tg->load_avg 2023-08-23 6:08 [PATCH 0/1] Reduce cost of accessing tg->load_avg Aaron Lu 2023-08-23 6:08 ` [PATCH 1/1] sched/fair: ratelimit update to tg->load_avg Aaron Lu @ 2023-08-25 10:33 ` Swapnil Sapkal 2023-08-28 11:22 ` Aaron Lu 1 sibling, 1 reply; 15+ messages in thread From: Swapnil Sapkal @ 2023-08-25 10:33 UTC (permalink / raw) To: Aaron Lu, Peter Zijlstra, Vincent Guittot, Ingo Molnar, Juri Lelli Cc: Daniel Jordan, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider, Tim Chen, Nitin Tekchandani, Yu Chen, Waiman Long, Deng Pan, Mathieu Desnoyers, Gautham R . Shenoy, David Vernet, linux-kernel Hello Aaron, On 8/23/2023 11:38 AM, Aaron Lu wrote: > RFC v2 -> v1: > - drop RFC; > - move cfs_rq->last_update_tg_load_avg before cfs_rq->tg_load_avg_contrib; > - add Vincent's reviewed-by tag. > > RFC v2: > Nitin Tekchandani noticed some scheduler functions have high cost > according to perf/cycles while running postgres_sysbench workload. > I perf/annotated the high cost functions: update_cfs_group() and > update_load_avg() and found the costs were ~90% due to accessing to > tg->load_avg. This series is an attempt to reduce the overhead of > the two functions. > > Thanks to Vincent's suggestion from v1, this revision used a simpler way > to solve the overhead problem by limiting updates to tg->load_avg to at > most once per ms. Benchmark shows that it has good results and with the > rate limit in place, other optimizations in v1 don't improve performance > further so they are dropped from this revision. > I have tested this series alongside Mathieu's changes. You can find the report here: https://lore.kernel.org/all/f6dc1652-bc39-0b12-4b6b-29a2f9cd8484@amd.com/ Tested-by: Swapnil Sapkal <Swapnil.Sapkal@amd.com> > Aaron Lu (1): > sched/fair: ratelimit update to tg->load_avg > > kernel/sched/fair.c | 13 ++++++++++++- > kernel/sched/sched.h | 1 + > 2 files changed, 13 insertions(+), 1 deletion(-) > -- Thanks and Regards, Swapnil ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 0/1] Reduce cost of accessing tg->load_avg 2023-08-25 10:33 ` [PATCH 0/1] Reduce cost of accessing tg->load_avg Swapnil Sapkal @ 2023-08-28 11:22 ` Aaron Lu 0 siblings, 0 replies; 15+ messages in thread From: Aaron Lu @ 2023-08-28 11:22 UTC (permalink / raw) To: Swapnil Sapkal Cc: Peter Zijlstra, Vincent Guittot, Ingo Molnar, Juri Lelli, Daniel Jordan, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider, Tim Chen, Nitin Tekchandani, Yu Chen, Waiman Long, Deng Pan, Mathieu Desnoyers, Gautham R . Shenoy, David Vernet, linux-kernel Hi Swapnil, On Fri, Aug 25, 2023 at 04:03:20PM +0530, Swapnil Sapkal wrote: > Hello Aaron, > > On 8/23/2023 11:38 AM, Aaron Lu wrote: > > RFC v2 -> v1: > > - drop RFC; > > - move cfs_rq->last_update_tg_load_avg before cfs_rq->tg_load_avg_contrib; > > - add Vincent's reviewed-by tag. > > > > RFC v2: > > Nitin Tekchandani noticed some scheduler functions have high cost > > according to perf/cycles while running postgres_sysbench workload. > > I perf/annotated the high cost functions: update_cfs_group() and > > update_load_avg() and found the costs were ~90% due to accessing to > > tg->load_avg. This series is an attempt to reduce the overhead of > > the two functions. > > Thanks to Vincent's suggestion from v1, this revision used a simpler way > > to solve the overhead problem by limiting updates to tg->load_avg to at > > most once per ms. Benchmark shows that it has good results and with the > > rate limit in place, other optimizations in v1 don't improve performance > > further so they are dropped from this revision. > > > > I have tested this series alongside Mathieu's changes. You can find the > report here: https://lore.kernel.org/all/f6dc1652-bc39-0b12-4b6b-29a2f9cd8484@amd.com/ > > Tested-by: Swapnil Sapkal <Swapnil.Sapkal@amd.com> Thanks a lot for running these workloads and share the results, will include your tag when sending the next version. Regards, Aaron > > Aaron Lu (1): > > sched/fair: ratelimit update to tg->load_avg > > > > kernel/sched/fair.c | 13 ++++++++++++- > > kernel/sched/sched.h | 1 + > > 2 files changed, 13 insertions(+), 1 deletion(-) > > > -- > Thanks and Regards, > Swapnil ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2023-09-06 3:53 UTC | newest] Thread overview: 15+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2023-08-23 6:08 [PATCH 0/1] Reduce cost of accessing tg->load_avg Aaron Lu 2023-08-23 6:08 ` [PATCH 1/1] sched/fair: ratelimit update to tg->load_avg Aaron Lu 2023-08-23 14:05 ` Mathieu Desnoyers 2023-08-23 14:17 ` Mathieu Desnoyers 2023-08-24 8:01 ` Aaron Lu 2023-08-24 12:56 ` Mathieu Desnoyers 2023-08-24 13:03 ` Vincent Guittot 2023-08-24 13:08 ` Mathieu Desnoyers 2023-08-24 13:24 ` Vincent Guittot 2023-08-25 6:08 ` Aaron Lu 2023-08-24 18:48 ` David Vernet 2023-08-25 6:18 ` Aaron Lu 2023-09-06 3:52 ` kernel test robot 2023-08-25 10:33 ` [PATCH 0/1] Reduce cost of accessing tg->load_avg Swapnil Sapkal 2023-08-28 11:22 ` Aaron Lu
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox