public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/1] Reduce cost of accessing tg->load_avg
@ 2023-08-23  6:08 Aaron Lu
  2023-08-23  6:08 ` [PATCH 1/1] sched/fair: ratelimit update to tg->load_avg Aaron Lu
  2023-08-25 10:33 ` [PATCH 0/1] Reduce cost of accessing tg->load_avg Swapnil Sapkal
  0 siblings, 2 replies; 15+ messages in thread
From: Aaron Lu @ 2023-08-23  6:08 UTC (permalink / raw)
  To: Peter Zijlstra, Vincent Guittot, Ingo Molnar, Juri Lelli
  Cc: Daniel Jordan, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	Tim Chen, Nitin Tekchandani, Yu Chen, Waiman Long, Deng Pan,
	Mathieu Desnoyers, Gautham R . Shenoy, David Vernet, linux-kernel

RFC v2 -> v1:
- drop RFC;
- move cfs_rq->last_update_tg_load_avg before cfs_rq->tg_load_avg_contrib;
- add Vincent's reviewed-by tag.

RFC v2:
Nitin Tekchandani noticed some scheduler functions have high cost                                                      
according to perf/cycles while running postgres_sysbench workload.                                                     
I perf/annotated the high cost functions: update_cfs_group() and                                                       
update_load_avg() and found the costs were ~90% due to accessing to                                                    
tg->load_avg. This series is an attempt to reduce the overhead of                                                      
the two functions.                                                                                                     
                                                                                                                       
Thanks to Vincent's suggestion from v1, this revision used a simpler way                                               
to solve the overhead problem by limiting updates to tg->load_avg to at                                                
most once per ms. Benchmark shows that it has good results and with the                                                
rate limit in place, other optimizations in v1 don't improve performance                                               
further so they are dropped from this revision.

Aaron Lu (1):
  sched/fair: ratelimit update to tg->load_avg

 kernel/sched/fair.c  | 13 ++++++++++++-
 kernel/sched/sched.h |  1 +
 2 files changed, 13 insertions(+), 1 deletion(-)

-- 
2.41.0


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/1] sched/fair: ratelimit update to tg->load_avg
  2023-08-23  6:08 [PATCH 0/1] Reduce cost of accessing tg->load_avg Aaron Lu
@ 2023-08-23  6:08 ` Aaron Lu
  2023-08-23 14:05   ` Mathieu Desnoyers
                     ` (2 more replies)
  2023-08-25 10:33 ` [PATCH 0/1] Reduce cost of accessing tg->load_avg Swapnil Sapkal
  1 sibling, 3 replies; 15+ messages in thread
From: Aaron Lu @ 2023-08-23  6:08 UTC (permalink / raw)
  To: Peter Zijlstra, Vincent Guittot, Ingo Molnar, Juri Lelli
  Cc: Daniel Jordan, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	Tim Chen, Nitin Tekchandani, Yu Chen, Waiman Long, Deng Pan,
	Mathieu Desnoyers, Gautham R . Shenoy, David Vernet, linux-kernel

When using sysbench to benchmark Postgres in a single docker instance
with sysbench's nr_threads set to nr_cpu, it is observed there are times
update_cfs_group() and update_load_avg() shows noticeable overhead on
a 2sockets/112core/224cpu Intel Sapphire Rapids(SPR):

    13.75%    13.74%  [kernel.vmlinux]           [k] update_cfs_group
    10.63%    10.04%  [kernel.vmlinux]           [k] update_load_avg

Annotate shows the cycles are mostly spent on accessing tg->load_avg
with update_load_avg() being the write side and update_cfs_group() being
the read side. tg->load_avg is per task group and when different tasks
of the same taskgroup running on different CPUs frequently access
tg->load_avg, it can be heavily contended.

E.g. when running postgres_sysbench on a 2sockets/112cores/224cpus Intel
Sappire Rapids, during a 5s window, the wakeup number is 14millions and
migration number is 11millions and with each migration, the task's load
will transfer from src cfs_rq to target cfs_rq and each change involves
an update to tg->load_avg. Since the workload can trigger as many wakeups
and migrations, the access(both read and write) to tg->load_avg can be
unbound. As a result, the two mentioned functions showed noticeable
overhead. With netperf/nr_client=nr_cpu/UDP_RR, the problem is worse:
during a 5s window, wakeup number is 21millions and migration number is
14millions; update_cfs_group() costs ~25% and update_load_avg() costs ~16%.

Reduce the overhead by limiting updates to tg->load_avg to at most once
per ms. After this change, the cost of accessing tg->load_avg is greatly
reduced and performance improved. Detailed test results below.

==============================
postgres_sysbench on SPR:
25%
base:   42382±19.8%
patch:  50174±9.5%  (noise)

50%
base:   67626±1.3%
patch:  67365±3.1%  (noise)

75%
base:   100216±1.2%
patch:  112470±0.1% +12.2%

100%
base:    93671±0.4%
patch:  113563±0.2% +21.2%

==============================
hackbench on ICL:
group=1
base:    114912±5.2%
patch:   117857±2.5%  (noise)

group=4
base:    359902±1.6%
patch:   361685±2.7%  (noise)

group=8
base:    461070±0.8%
patch:   491713±0.3% +6.6%

group=16
base:    309032±5.0%
patch:   378337±1.3% +22.4%

=============================
hackbench on SPR:
group=1
base:    100768±2.9%
patch:   103134±2.9%  (noise)

group=4
base:    413830±12.5%
patch:   378660±16.6% (noise)

group=8
base:    436124±0.6%
patch:   490787±3.2% +12.5%

group=16
base:    457730±3.2%
patch:   680452±1.3% +48.8%

============================
netperf/udp_rr on ICL
25%
base:    114413±0.1%
patch:   115111±0.0% +0.6%

50%
base:    86803±0.5%
patch:   86611±0.0%  (noise)

75%
base:    35959±5.3%
patch:   49801±0.6% +38.5%

100%
base:    61951±6.4%
patch:   70224±0.8% +13.4%

===========================
netperf/udp_rr on SPR
25%
base:   104954±1.3%
patch:  107312±2.8%  (noise)

50%
base:    55394±4.6%
patch:   54940±7.4%  (noise)

75%
base:    13779±3.1%
patch:   36105±1.1% +162%

100%
base:     9703±3.7%
patch:   28011±0.2% +189%

==============================================
netperf/tcp_stream on ICL (all in noise range)
25%
base:    43092±0.1%
patch:   42891±0.5%

50%
base:    19278±14.9%
patch:   22369±7.2%

75%
base:    16822±3.0%
patch:   17086±2.3%

100%
base:    18216±0.6%
patch:   18078±2.9%

===============================================
netperf/tcp_stream on SPR (all in noise range)
25%
base:    34491±0.3%
patch:   34886±0.5%

50%
base:    19278±14.9%
patch:   22369±7.2%

75%
base:    16822±3.0%
patch:   17086±2.3%

100%
base:    18216±0.6%
patch:   18078±2.9%

Reported-by: Nitin Tekchandani <nitin.tekchandani@intel.com>
Suggested-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c  | 13 ++++++++++++-
 kernel/sched/sched.h |  1 +
 2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c28206499a3d..a5462d1fcc48 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3664,7 +3664,8 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
  */
 static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
 {
-	long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
+	long delta;
+	u64 now;
 
 	/*
 	 * No need to update load_avg for root_task_group as it is not used.
@@ -3672,9 +3673,19 @@ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
 	if (cfs_rq->tg == &root_task_group)
 		return;
 
+	/*
+	 * For migration heavy workload, access to tg->load_avg can be
+	 * unbound. Limit the update rate to at most once per ms.
+	 */
+	now = sched_clock_cpu(cpu_of(rq_of(cfs_rq)));
+	if (now - cfs_rq->last_update_tg_load_avg < NSEC_PER_MSEC)
+		return;
+
+	delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
 	if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
 		atomic_long_add(delta, &cfs_rq->tg->load_avg);
 		cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;
+		cfs_rq->last_update_tg_load_avg = now;
 	}
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 6a8b7b9ed089..52ee7027def9 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -593,6 +593,7 @@ struct cfs_rq {
 	} removed;
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
+	u64			last_update_tg_load_avg;
 	unsigned long		tg_load_avg_contrib;
 	long			propagate;
 	long			prop_runnable_sum;
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/1] sched/fair: ratelimit update to tg->load_avg
  2023-08-23  6:08 ` [PATCH 1/1] sched/fair: ratelimit update to tg->load_avg Aaron Lu
@ 2023-08-23 14:05   ` Mathieu Desnoyers
  2023-08-23 14:17     ` Mathieu Desnoyers
  2023-08-24  8:01     ` Aaron Lu
  2023-08-24 18:48   ` David Vernet
  2023-09-06  3:52   ` kernel test robot
  2 siblings, 2 replies; 15+ messages in thread
From: Mathieu Desnoyers @ 2023-08-23 14:05 UTC (permalink / raw)
  To: Aaron Lu, Peter Zijlstra, Vincent Guittot, Ingo Molnar,
	Juri Lelli
  Cc: Daniel Jordan, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	Tim Chen, Nitin Tekchandani, Yu Chen, Waiman Long, Deng Pan,
	Gautham R . Shenoy, David Vernet, linux-kernel

On 8/23/23 02:08, Aaron Lu wrote:
> When using sysbench to benchmark Postgres in a single docker instance
> with sysbench's nr_threads set to nr_cpu, it is observed there are times
> update_cfs_group() and update_load_avg() shows noticeable overhead on
> a 2sockets/112core/224cpu Intel Sapphire Rapids(SPR):
> 
>      13.75%    13.74%  [kernel.vmlinux]           [k] update_cfs_group
>      10.63%    10.04%  [kernel.vmlinux]           [k] update_load_avg
> 
> Annotate shows the cycles are mostly spent on accessing tg->load_avg
> with update_load_avg() being the write side and update_cfs_group() being
> the read side. tg->load_avg is per task group and when different tasks
> of the same taskgroup running on different CPUs frequently access
> tg->load_avg, it can be heavily contended.
> 
> E.g. when running postgres_sysbench on a 2sockets/112cores/224cpus Intel
> Sappire Rapids, during a 5s window, the wakeup number is 14millions and
> migration number is 11millions and with each migration, the task's load
> will transfer from src cfs_rq to target cfs_rq and each change involves
> an update to tg->load_avg. Since the workload can trigger as many wakeups
> and migrations, the access(both read and write) to tg->load_avg can be
> unbound. As a result, the two mentioned functions showed noticeable
> overhead. With netperf/nr_client=nr_cpu/UDP_RR, the problem is worse:
> during a 5s window, wakeup number is 21millions and migration number is
> 14millions; update_cfs_group() costs ~25% and update_load_avg() costs ~16%.
> 
> Reduce the overhead by limiting updates to tg->load_avg to at most once
> per ms. After this change, the cost of accessing tg->load_avg is greatly
> reduced and performance improved. Detailed test results below.

By applying your patch on top of my patchset at:

https://lore.kernel.org/lkml/20230822113133.643238-1-mathieu.desnoyers@efficios.com/

The combined hackbench results look very promising:

(hackbench -g 32 -f 20 --threads --pipe -l 480000 -s 100)
(192 cores AMD EPYC 9654 96-Core Processor (over 2 sockets), with hyperthreading)

Baseline:                                       49s
With L2-ttwu-queue-skip:                        34s (30% speedup)
With L2-ttwu-queue-skip + ratelimit-load-avg:   26s (46% speedup)

Feel free to apply my:

Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Tested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>

Thanks Aaron!

Mathieu

> 
> ==============================
> postgres_sysbench on SPR:
> 25%
> base:   42382±19.8%
> patch:  50174±9.5%  (noise)
> 
> 50%
> base:   67626±1.3%
> patch:  67365±3.1%  (noise)
> 
> 75%
> base:   100216±1.2%
> patch:  112470±0.1% +12.2%
> 
> 100%
> base:    93671±0.4%
> patch:  113563±0.2% +21.2%
> 
> ==============================
> hackbench on ICL:
> group=1
> base:    114912±5.2%
> patch:   117857±2.5%  (noise)
> 
> group=4
> base:    359902±1.6%
> patch:   361685±2.7%  (noise)
> 
> group=8
> base:    461070±0.8%
> patch:   491713±0.3% +6.6%
> 
> group=16
> base:    309032±5.0%
> patch:   378337±1.3% +22.4%
> 
> =============================
> hackbench on SPR:
> group=1
> base:    100768±2.9%
> patch:   103134±2.9%  (noise)
> 
> group=4
> base:    413830±12.5%
> patch:   378660±16.6% (noise)
> 
> group=8
> base:    436124±0.6%
> patch:   490787±3.2% +12.5%
> 
> group=16
> base:    457730±3.2%
> patch:   680452±1.3% +48.8%
> 
> ============================
> netperf/udp_rr on ICL
> 25%
> base:    114413±0.1%
> patch:   115111±0.0% +0.6%
> 
> 50%
> base:    86803±0.5%
> patch:   86611±0.0%  (noise)
> 
> 75%
> base:    35959±5.3%
> patch:   49801±0.6% +38.5%
> 
> 100%
> base:    61951±6.4%
> patch:   70224±0.8% +13.4%
> 
> ===========================
> netperf/udp_rr on SPR
> 25%
> base:   104954±1.3%
> patch:  107312±2.8%  (noise)
> 
> 50%
> base:    55394±4.6%
> patch:   54940±7.4%  (noise)
> 
> 75%
> base:    13779±3.1%
> patch:   36105±1.1% +162%
> 
> 100%
> base:     9703±3.7%
> patch:   28011±0.2% +189%
> 
> ==============================================
> netperf/tcp_stream on ICL (all in noise range)
> 25%
> base:    43092±0.1%
> patch:   42891±0.5%
> 
> 50%
> base:    19278±14.9%
> patch:   22369±7.2%
> 
> 75%
> base:    16822±3.0%
> patch:   17086±2.3%
> 
> 100%
> base:    18216±0.6%
> patch:   18078±2.9%
> 
> ===============================================
> netperf/tcp_stream on SPR (all in noise range)
> 25%
> base:    34491±0.3%
> patch:   34886±0.5%
> 
> 50%
> base:    19278±14.9%
> patch:   22369±7.2%
> 
> 75%
> base:    16822±3.0%
> patch:   17086±2.3%
> 
> 100%
> base:    18216±0.6%
> patch:   18078±2.9%
> 
> Reported-by: Nitin Tekchandani <nitin.tekchandani@intel.com>
> Suggested-by: Vincent Guittot <vincent.guittot@linaro.org>
> Signed-off-by: Aaron Lu <aaron.lu@intel.com>
> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>   kernel/sched/fair.c  | 13 ++++++++++++-
>   kernel/sched/sched.h |  1 +
>   2 files changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c28206499a3d..a5462d1fcc48 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3664,7 +3664,8 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
>    */
>   static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
>   {
> -	long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
> +	long delta;
> +	u64 now;
>   
>   	/*
>   	 * No need to update load_avg for root_task_group as it is not used.
> @@ -3672,9 +3673,19 @@ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
>   	if (cfs_rq->tg == &root_task_group)
>   		return;
>   
> +	/*
> +	 * For migration heavy workload, access to tg->load_avg can be
> +	 * unbound. Limit the update rate to at most once per ms.
> +	 */
> +	now = sched_clock_cpu(cpu_of(rq_of(cfs_rq)));
> +	if (now - cfs_rq->last_update_tg_load_avg < NSEC_PER_MSEC)
> +		return;
> +
> +	delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
>   	if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
>   		atomic_long_add(delta, &cfs_rq->tg->load_avg);
>   		cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;
> +		cfs_rq->last_update_tg_load_avg = now;
>   	}
>   }
>   
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 6a8b7b9ed089..52ee7027def9 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -593,6 +593,7 @@ struct cfs_rq {
>   	} removed;
>   
>   #ifdef CONFIG_FAIR_GROUP_SCHED
> +	u64			last_update_tg_load_avg;
>   	unsigned long		tg_load_avg_contrib;
>   	long			propagate;
>   	long			prop_runnable_sum;

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/1] sched/fair: ratelimit update to tg->load_avg
  2023-08-23 14:05   ` Mathieu Desnoyers
@ 2023-08-23 14:17     ` Mathieu Desnoyers
  2023-08-24  8:01     ` Aaron Lu
  1 sibling, 0 replies; 15+ messages in thread
From: Mathieu Desnoyers @ 2023-08-23 14:17 UTC (permalink / raw)
  To: Aaron Lu, Peter Zijlstra, Vincent Guittot, Ingo Molnar,
	Juri Lelli
  Cc: Daniel Jordan, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	Tim Chen, Nitin Tekchandani, Yu Chen, Waiman Long, Deng Pan,
	Gautham R . Shenoy, David Vernet, linux-kernel

On 8/23/23 10:05, Mathieu Desnoyers wrote:
> On 8/23/23 02:08, Aaron Lu wrote:
>> When using sysbench to benchmark Postgres in a single docker instance
>> with sysbench's nr_threads set to nr_cpu, it is observed there are times
>> update_cfs_group() and update_load_avg() shows noticeable overhead on
>> a 2sockets/112core/224cpu Intel Sapphire Rapids(SPR):
>>
>>      13.75%    13.74%  [kernel.vmlinux]           [k] update_cfs_group
>>      10.63%    10.04%  [kernel.vmlinux]           [k] update_load_avg
>>
>> Annotate shows the cycles are mostly spent on accessing tg->load_avg
>> with update_load_avg() being the write side and update_cfs_group() being
>> the read side. tg->load_avg is per task group and when different tasks
>> of the same taskgroup running on different CPUs frequently access
>> tg->load_avg, it can be heavily contended.
>>
>> E.g. when running postgres_sysbench on a 2sockets/112cores/224cpus Intel
>> Sappire Rapids, during a 5s window, the wakeup number is 14millions and
>> migration number is 11millions and with each migration, the task's load
>> will transfer from src cfs_rq to target cfs_rq and each change involves
>> an update to tg->load_avg. Since the workload can trigger as many wakeups
>> and migrations, the access(both read and write) to tg->load_avg can be
>> unbound. As a result, the two mentioned functions showed noticeable
>> overhead. With netperf/nr_client=nr_cpu/UDP_RR, the problem is worse:
>> during a 5s window, wakeup number is 21millions and migration number is
>> 14millions; update_cfs_group() costs ~25% and update_load_avg() costs 
>> ~16%.
>>
>> Reduce the overhead by limiting updates to tg->load_avg to at most once
>> per ms. After this change, the cost of accessing tg->load_avg is greatly
>> reduced and performance improved. Detailed test results below.
> 
> By applying your patch on top of my patchset at:
> 
> https://lore.kernel.org/lkml/20230822113133.643238-1-mathieu.desnoyers@efficios.com/
> 
> The combined hackbench results look very promising:
> 
> (hackbench -g 32 -f 20 --threads --pipe -l 480000 -s 100)
> (192 cores AMD EPYC 9654 96-Core Processor (over 2 sockets), with 
> hyperthreading)
> 
> Baseline:                                       49s
> With L2-ttwu-queue-skip:                        34s (30% speedup)
> With L2-ttwu-queue-skip + ratelimit-load-avg:   26s (46% speedup)

Here is an additional interesting data point:

With only ratelimit-load-avg patch:               32s (35% speedup)

So each series appear to address a different scalability issue, and
combining both seems worthwhile, at least from the point of view of
this specific benchmark on this hardware.

I'm looking forward to see numbers for other benchmarks and hardware.

Thanks,

Mathieu

> 
> Feel free to apply my:
> 
> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Tested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> 
> Thanks Aaron!
> 
> Mathieu
> 
>>
>> ==============================
>> postgres_sysbench on SPR:
>> 25%
>> base:   42382±19.8%
>> patch:  50174±9.5%  (noise)
>>
>> 50%
>> base:   67626±1.3%
>> patch:  67365±3.1%  (noise)
>>
>> 75%
>> base:   100216±1.2%
>> patch:  112470±0.1% +12.2%
>>
>> 100%
>> base:    93671±0.4%
>> patch:  113563±0.2% +21.2%
>>
>> ==============================
>> hackbench on ICL:
>> group=1
>> base:    114912±5.2%
>> patch:   117857±2.5%  (noise)
>>
>> group=4
>> base:    359902±1.6%
>> patch:   361685±2.7%  (noise)
>>
>> group=8
>> base:    461070±0.8%
>> patch:   491713±0.3% +6.6%
>>
>> group=16
>> base:    309032±5.0%
>> patch:   378337±1.3% +22.4%
>>
>> =============================
>> hackbench on SPR:
>> group=1
>> base:    100768±2.9%
>> patch:   103134±2.9%  (noise)
>>
>> group=4
>> base:    413830±12.5%
>> patch:   378660±16.6% (noise)
>>
>> group=8
>> base:    436124±0.6%
>> patch:   490787±3.2% +12.5%
>>
>> group=16
>> base:    457730±3.2%
>> patch:   680452±1.3% +48.8%
>>
>> ============================
>> netperf/udp_rr on ICL
>> 25%
>> base:    114413±0.1%
>> patch:   115111±0.0% +0.6%
>>
>> 50%
>> base:    86803±0.5%
>> patch:   86611±0.0%  (noise)
>>
>> 75%
>> base:    35959±5.3%
>> patch:   49801±0.6% +38.5%
>>
>> 100%
>> base:    61951±6.4%
>> patch:   70224±0.8% +13.4%
>>
>> ===========================
>> netperf/udp_rr on SPR
>> 25%
>> base:   104954±1.3%
>> patch:  107312±2.8%  (noise)
>>
>> 50%
>> base:    55394±4.6%
>> patch:   54940±7.4%  (noise)
>>
>> 75%
>> base:    13779±3.1%
>> patch:   36105±1.1% +162%
>>
>> 100%
>> base:     9703±3.7%
>> patch:   28011±0.2% +189%
>>
>> ==============================================
>> netperf/tcp_stream on ICL (all in noise range)
>> 25%
>> base:    43092±0.1%
>> patch:   42891±0.5%
>>
>> 50%
>> base:    19278±14.9%
>> patch:   22369±7.2%
>>
>> 75%
>> base:    16822±3.0%
>> patch:   17086±2.3%
>>
>> 100%
>> base:    18216±0.6%
>> patch:   18078±2.9%
>>
>> ===============================================
>> netperf/tcp_stream on SPR (all in noise range)
>> 25%
>> base:    34491±0.3%
>> patch:   34886±0.5%
>>
>> 50%
>> base:    19278±14.9%
>> patch:   22369±7.2%
>>
>> 75%
>> base:    16822±3.0%
>> patch:   17086±2.3%
>>
>> 100%
>> base:    18216±0.6%
>> patch:   18078±2.9%
>>
>> Reported-by: Nitin Tekchandani <nitin.tekchandani@intel.com>
>> Suggested-by: Vincent Guittot <vincent.guittot@linaro.org>
>> Signed-off-by: Aaron Lu <aaron.lu@intel.com>
>> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
>> ---
>>   kernel/sched/fair.c  | 13 ++++++++++++-
>>   kernel/sched/sched.h |  1 +
>>   2 files changed, 13 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index c28206499a3d..a5462d1fcc48 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -3664,7 +3664,8 @@ static inline bool cfs_rq_is_decayed(struct 
>> cfs_rq *cfs_rq)
>>    */
>>   static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
>>   {
>> -    long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
>> +    long delta;
>> +    u64 now;
>>       /*
>>        * No need to update load_avg for root_task_group as it is not 
>> used.
>> @@ -3672,9 +3673,19 @@ static inline void update_tg_load_avg(struct 
>> cfs_rq *cfs_rq)
>>       if (cfs_rq->tg == &root_task_group)
>>           return;
>> +    /*
>> +     * For migration heavy workload, access to tg->load_avg can be
>> +     * unbound. Limit the update rate to at most once per ms.
>> +     */
>> +    now = sched_clock_cpu(cpu_of(rq_of(cfs_rq)));
>> +    if (now - cfs_rq->last_update_tg_load_avg < NSEC_PER_MSEC)
>> +        return;
>> +
>> +    delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
>>       if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
>>           atomic_long_add(delta, &cfs_rq->tg->load_avg);
>>           cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;
>> +        cfs_rq->last_update_tg_load_avg = now;
>>       }
>>   }
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index 6a8b7b9ed089..52ee7027def9 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -593,6 +593,7 @@ struct cfs_rq {
>>       } removed;
>>   #ifdef CONFIG_FAIR_GROUP_SCHED
>> +    u64            last_update_tg_load_avg;
>>       unsigned long        tg_load_avg_contrib;
>>       long            propagate;
>>       long            prop_runnable_sum;
> 

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/1] sched/fair: ratelimit update to tg->load_avg
  2023-08-23 14:05   ` Mathieu Desnoyers
  2023-08-23 14:17     ` Mathieu Desnoyers
@ 2023-08-24  8:01     ` Aaron Lu
  2023-08-24 12:56       ` Mathieu Desnoyers
  1 sibling, 1 reply; 15+ messages in thread
From: Aaron Lu @ 2023-08-24  8:01 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, Vincent Guittot, Ingo Molnar, Juri Lelli,
	Daniel Jordan, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	Tim Chen, Nitin Tekchandani, Yu Chen, Waiman Long, Deng Pan,
	Gautham R . Shenoy, David Vernet, linux-kernel

On Wed, Aug 23, 2023 at 10:05:31AM -0400, Mathieu Desnoyers wrote:
> On 8/23/23 02:08, Aaron Lu wrote:
> > When using sysbench to benchmark Postgres in a single docker instance
> > with sysbench's nr_threads set to nr_cpu, it is observed there are times
> > update_cfs_group() and update_load_avg() shows noticeable overhead on
> > a 2sockets/112core/224cpu Intel Sapphire Rapids(SPR):
> > 
> >      13.75%    13.74%  [kernel.vmlinux]           [k] update_cfs_group
> >      10.63%    10.04%  [kernel.vmlinux]           [k] update_load_avg
> > 
> > Annotate shows the cycles are mostly spent on accessing tg->load_avg
> > with update_load_avg() being the write side and update_cfs_group() being
> > the read side. tg->load_avg is per task group and when different tasks
> > of the same taskgroup running on different CPUs frequently access
> > tg->load_avg, it can be heavily contended.
> > 
> > E.g. when running postgres_sysbench on a 2sockets/112cores/224cpus Intel
> > Sappire Rapids, during a 5s window, the wakeup number is 14millions and
> > migration number is 11millions and with each migration, the task's load
> > will transfer from src cfs_rq to target cfs_rq and each change involves
> > an update to tg->load_avg. Since the workload can trigger as many wakeups
> > and migrations, the access(both read and write) to tg->load_avg can be
> > unbound. As a result, the two mentioned functions showed noticeable
> > overhead. With netperf/nr_client=nr_cpu/UDP_RR, the problem is worse:
> > during a 5s window, wakeup number is 21millions and migration number is
> > 14millions; update_cfs_group() costs ~25% and update_load_avg() costs ~16%.
> > 
> > Reduce the overhead by limiting updates to tg->load_avg to at most once
> > per ms. After this change, the cost of accessing tg->load_avg is greatly
> > reduced and performance improved. Detailed test results below.
> 
> By applying your patch on top of my patchset at:
> 
> https://lore.kernel.org/lkml/20230822113133.643238-1-mathieu.desnoyers@efficios.com/
> 
> The combined hackbench results look very promising:
> 
> (hackbench -g 32 -f 20 --threads --pipe -l 480000 -s 100)
> (192 cores AMD EPYC 9654 96-Core Processor (over 2 sockets), with hyperthreading)
> 
> Baseline:                                       49s
> With L2-ttwu-queue-skip:                        34s (30% speedup)
> With L2-ttwu-queue-skip + ratelimit-load-avg:   26s (46% speedup)
> 
> Feel free to apply my:
> 
> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Tested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>

Thanks a lot for running this and reviewing the patch.                                                                 
I'll add your number and tag in the changelog when sending a new                                                       
version.

Regards,
Aaron
 
> > 
> > ==============================
> > postgres_sysbench on SPR:
> > 25%
> > base:   42382±19.8%
> > patch:  50174±9.5%  (noise)
> > 
> > 50%
> > base:   67626±1.3%
> > patch:  67365±3.1%  (noise)
> > 
> > 75%
> > base:   100216±1.2%
> > patch:  112470±0.1% +12.2%
> > 
> > 100%
> > base:    93671±0.4%
> > patch:  113563±0.2% +21.2%
> > 
> > ==============================
> > hackbench on ICL:
> > group=1
> > base:    114912±5.2%
> > patch:   117857±2.5%  (noise)
> > 
> > group=4
> > base:    359902±1.6%
> > patch:   361685±2.7%  (noise)
> > 
> > group=8
> > base:    461070±0.8%
> > patch:   491713±0.3% +6.6%
> > 
> > group=16
> > base:    309032±5.0%
> > patch:   378337±1.3% +22.4%
> > 
> > =============================
> > hackbench on SPR:
> > group=1
> > base:    100768±2.9%
> > patch:   103134±2.9%  (noise)
> > 
> > group=4
> > base:    413830±12.5%
> > patch:   378660±16.6% (noise)
> > 
> > group=8
> > base:    436124±0.6%
> > patch:   490787±3.2% +12.5%
> > 
> > group=16
> > base:    457730±3.2%
> > patch:   680452±1.3% +48.8%
> > 
> > ============================
> > netperf/udp_rr on ICL
> > 25%
> > base:    114413±0.1%
> > patch:   115111±0.0% +0.6%
> > 
> > 50%
> > base:    86803±0.5%
> > patch:   86611±0.0%  (noise)
> > 
> > 75%
> > base:    35959±5.3%
> > patch:   49801±0.6% +38.5%
> > 
> > 100%
> > base:    61951±6.4%
> > patch:   70224±0.8% +13.4%
> > 
> > ===========================
> > netperf/udp_rr on SPR
> > 25%
> > base:   104954±1.3%
> > patch:  107312±2.8%  (noise)
> > 
> > 50%
> > base:    55394±4.6%
> > patch:   54940±7.4%  (noise)
> > 
> > 75%
> > base:    13779±3.1%
> > patch:   36105±1.1% +162%
> > 
> > 100%
> > base:     9703±3.7%
> > patch:   28011±0.2% +189%
> > 
> > ==============================================
> > netperf/tcp_stream on ICL (all in noise range)
> > 25%
> > base:    43092±0.1%
> > patch:   42891±0.5%
> > 
> > 50%
> > base:    19278±14.9%
> > patch:   22369±7.2%
> > 
> > 75%
> > base:    16822±3.0%
> > patch:   17086±2.3%
> > 
> > 100%
> > base:    18216±0.6%
> > patch:   18078±2.9%
> > 
> > ===============================================
> > netperf/tcp_stream on SPR (all in noise range)
> > 25%
> > base:    34491±0.3%
> > patch:   34886±0.5%
> > 
> > 50%
> > base:    19278±14.9%
> > patch:   22369±7.2%
> > 
> > 75%
> > base:    16822±3.0%
> > patch:   17086±2.3%
> > 
> > 100%
> > base:    18216±0.6%
> > patch:   18078±2.9%
> > 
> > Reported-by: Nitin Tekchandani <nitin.tekchandani@intel.com>
> > Suggested-by: Vincent Guittot <vincent.guittot@linaro.org>
> > Signed-off-by: Aaron Lu <aaron.lu@intel.com>
> > Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
> > ---
> >   kernel/sched/fair.c  | 13 ++++++++++++-
> >   kernel/sched/sched.h |  1 +
> >   2 files changed, 13 insertions(+), 1 deletion(-)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index c28206499a3d..a5462d1fcc48 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -3664,7 +3664,8 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
> >    */
> >   static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
> >   {
> > -	long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
> > +	long delta;
> > +	u64 now;
> >   	/*
> >   	 * No need to update load_avg for root_task_group as it is not used.
> > @@ -3672,9 +3673,19 @@ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
> >   	if (cfs_rq->tg == &root_task_group)
> >   		return;
> > +	/*
> > +	 * For migration heavy workload, access to tg->load_avg can be
> > +	 * unbound. Limit the update rate to at most once per ms.
> > +	 */
> > +	now = sched_clock_cpu(cpu_of(rq_of(cfs_rq)));
> > +	if (now - cfs_rq->last_update_tg_load_avg < NSEC_PER_MSEC)
> > +		return;
> > +
> > +	delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
> >   	if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
> >   		atomic_long_add(delta, &cfs_rq->tg->load_avg);
> >   		cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;
> > +		cfs_rq->last_update_tg_load_avg = now;
> >   	}
> >   }
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index 6a8b7b9ed089..52ee7027def9 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -593,6 +593,7 @@ struct cfs_rq {
> >   	} removed;
> >   #ifdef CONFIG_FAIR_GROUP_SCHED
> > +	u64			last_update_tg_load_avg;
> >   	unsigned long		tg_load_avg_contrib;
> >   	long			propagate;
> >   	long			prop_runnable_sum;
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/1] sched/fair: ratelimit update to tg->load_avg
  2023-08-24  8:01     ` Aaron Lu
@ 2023-08-24 12:56       ` Mathieu Desnoyers
  2023-08-24 13:03         ` Vincent Guittot
  0 siblings, 1 reply; 15+ messages in thread
From: Mathieu Desnoyers @ 2023-08-24 12:56 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Peter Zijlstra, Vincent Guittot, Ingo Molnar, Juri Lelli,
	Daniel Jordan, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	Tim Chen, Nitin Tekchandani, Yu Chen, Waiman Long, Deng Pan,
	Gautham R . Shenoy, David Vernet, linux-kernel

On 8/24/23 04:01, Aaron Lu wrote:
> On Wed, Aug 23, 2023 at 10:05:31AM -0400, Mathieu Desnoyers wrote:
>> On 8/23/23 02:08, Aaron Lu wrote:
>>> When using sysbench to benchmark Postgres in a single docker instance
>>> with sysbench's nr_threads set to nr_cpu, it is observed there are times
>>> update_cfs_group() and update_load_avg() shows noticeable overhead on
>>> a 2sockets/112core/224cpu Intel Sapphire Rapids(SPR):
>>>
>>>       13.75%    13.74%  [kernel.vmlinux]           [k] update_cfs_group
>>>       10.63%    10.04%  [kernel.vmlinux]           [k] update_load_avg
>>>
>>> Annotate shows the cycles are mostly spent on accessing tg->load_avg
>>> with update_load_avg() being the write side and update_cfs_group() being
>>> the read side. tg->load_avg is per task group and when different tasks
>>> of the same taskgroup running on different CPUs frequently access
>>> tg->load_avg, it can be heavily contended.
>>>
>>> E.g. when running postgres_sysbench on a 2sockets/112cores/224cpus Intel
>>> Sappire Rapids, during a 5s window, the wakeup number is 14millions and
>>> migration number is 11millions and with each migration, the task's load
>>> will transfer from src cfs_rq to target cfs_rq and each change involves
>>> an update to tg->load_avg. Since the workload can trigger as many wakeups
>>> and migrations, the access(both read and write) to tg->load_avg can be
>>> unbound. As a result, the two mentioned functions showed noticeable
>>> overhead. With netperf/nr_client=nr_cpu/UDP_RR, the problem is worse:
>>> during a 5s window, wakeup number is 21millions and migration number is
>>> 14millions; update_cfs_group() costs ~25% and update_load_avg() costs ~16%.
>>>
>>> Reduce the overhead by limiting updates to tg->load_avg to at most once
>>> per ms. After this change, the cost of accessing tg->load_avg is greatly
>>> reduced and performance improved. Detailed test results below.
>>
>> By applying your patch on top of my patchset at:
>>
>> https://lore.kernel.org/lkml/20230822113133.643238-1-mathieu.desnoyers@efficios.com/
>>
>> The combined hackbench results look very promising:
>>
>> (hackbench -g 32 -f 20 --threads --pipe -l 480000 -s 100)
>> (192 cores AMD EPYC 9654 96-Core Processor (over 2 sockets), with hyperthreading)
>>
>> Baseline:                                       49s
>> With L2-ttwu-queue-skip:                        34s (30% speedup)
>> With L2-ttwu-queue-skip + ratelimit-load-avg:   26s (46% speedup)
>>
>> Feel free to apply my:
>>
>> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>> Tested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> 
> Thanks a lot for running this and reviewing the patch.
> I'll add your number and tag in the changelog when sending a new
> version.

Now that I come to think of it, I have comment: why use 
sched_clock_cpu() rather than just read the jiffies value ? AFAIR, 
sched_clock can be slower than needed when read from a "remote" cpu on 
architectures that have an unsynchronized tsc.

Considering that you only need a time reference more or less accurate at 
the millisecond level, I suspect that jiffies is what you are looking 
for here. This is what the NUMA balance code and rseq mm_cid use to 
execute work every N milliseconds.

Thanks,

Mathieu

> 
> Regards,
> Aaron
>   
>>>
>>> ==============================
>>> postgres_sysbench on SPR:
>>> 25%
>>> base:   42382±19.8%
>>> patch:  50174±9.5%  (noise)
>>>
>>> 50%
>>> base:   67626±1.3%
>>> patch:  67365±3.1%  (noise)
>>>
>>> 75%
>>> base:   100216±1.2%
>>> patch:  112470±0.1% +12.2%
>>>
>>> 100%
>>> base:    93671±0.4%
>>> patch:  113563±0.2% +21.2%
>>>
>>> ==============================
>>> hackbench on ICL:
>>> group=1
>>> base:    114912±5.2%
>>> patch:   117857±2.5%  (noise)
>>>
>>> group=4
>>> base:    359902±1.6%
>>> patch:   361685±2.7%  (noise)
>>>
>>> group=8
>>> base:    461070±0.8%
>>> patch:   491713±0.3% +6.6%
>>>
>>> group=16
>>> base:    309032±5.0%
>>> patch:   378337±1.3% +22.4%
>>>
>>> =============================
>>> hackbench on SPR:
>>> group=1
>>> base:    100768±2.9%
>>> patch:   103134±2.9%  (noise)
>>>
>>> group=4
>>> base:    413830±12.5%
>>> patch:   378660±16.6% (noise)
>>>
>>> group=8
>>> base:    436124±0.6%
>>> patch:   490787±3.2% +12.5%
>>>
>>> group=16
>>> base:    457730±3.2%
>>> patch:   680452±1.3% +48.8%
>>>
>>> ============================
>>> netperf/udp_rr on ICL
>>> 25%
>>> base:    114413±0.1%
>>> patch:   115111±0.0% +0.6%
>>>
>>> 50%
>>> base:    86803±0.5%
>>> patch:   86611±0.0%  (noise)
>>>
>>> 75%
>>> base:    35959±5.3%
>>> patch:   49801±0.6% +38.5%
>>>
>>> 100%
>>> base:    61951±6.4%
>>> patch:   70224±0.8% +13.4%
>>>
>>> ===========================
>>> netperf/udp_rr on SPR
>>> 25%
>>> base:   104954±1.3%
>>> patch:  107312±2.8%  (noise)
>>>
>>> 50%
>>> base:    55394±4.6%
>>> patch:   54940±7.4%  (noise)
>>>
>>> 75%
>>> base:    13779±3.1%
>>> patch:   36105±1.1% +162%
>>>
>>> 100%
>>> base:     9703±3.7%
>>> patch:   28011±0.2% +189%
>>>
>>> ==============================================
>>> netperf/tcp_stream on ICL (all in noise range)
>>> 25%
>>> base:    43092±0.1%
>>> patch:   42891±0.5%
>>>
>>> 50%
>>> base:    19278±14.9%
>>> patch:   22369±7.2%
>>>
>>> 75%
>>> base:    16822±3.0%
>>> patch:   17086±2.3%
>>>
>>> 100%
>>> base:    18216±0.6%
>>> patch:   18078±2.9%
>>>
>>> ===============================================
>>> netperf/tcp_stream on SPR (all in noise range)
>>> 25%
>>> base:    34491±0.3%
>>> patch:   34886±0.5%
>>>
>>> 50%
>>> base:    19278±14.9%
>>> patch:   22369±7.2%
>>>
>>> 75%
>>> base:    16822±3.0%
>>> patch:   17086±2.3%
>>>
>>> 100%
>>> base:    18216±0.6%
>>> patch:   18078±2.9%
>>>
>>> Reported-by: Nitin Tekchandani <nitin.tekchandani@intel.com>
>>> Suggested-by: Vincent Guittot <vincent.guittot@linaro.org>
>>> Signed-off-by: Aaron Lu <aaron.lu@intel.com>
>>> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
>>> ---
>>>    kernel/sched/fair.c  | 13 ++++++++++++-
>>>    kernel/sched/sched.h |  1 +
>>>    2 files changed, 13 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index c28206499a3d..a5462d1fcc48 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -3664,7 +3664,8 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
>>>     */
>>>    static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
>>>    {
>>> -	long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
>>> +	long delta;
>>> +	u64 now;
>>>    	/*
>>>    	 * No need to update load_avg for root_task_group as it is not used.
>>> @@ -3672,9 +3673,19 @@ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
>>>    	if (cfs_rq->tg == &root_task_group)
>>>    		return;
>>> +	/*
>>> +	 * For migration heavy workload, access to tg->load_avg can be
>>> +	 * unbound. Limit the update rate to at most once per ms.
>>> +	 */
>>> +	now = sched_clock_cpu(cpu_of(rq_of(cfs_rq)));
>>> +	if (now - cfs_rq->last_update_tg_load_avg < NSEC_PER_MSEC)
>>> +		return;
>>> +
>>> +	delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
>>>    	if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
>>>    		atomic_long_add(delta, &cfs_rq->tg->load_avg);
>>>    		cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;
>>> +		cfs_rq->last_update_tg_load_avg = now;
>>>    	}
>>>    }
>>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>>> index 6a8b7b9ed089..52ee7027def9 100644
>>> --- a/kernel/sched/sched.h
>>> +++ b/kernel/sched/sched.h
>>> @@ -593,6 +593,7 @@ struct cfs_rq {
>>>    	} removed;
>>>    #ifdef CONFIG_FAIR_GROUP_SCHED
>>> +	u64			last_update_tg_load_avg;
>>>    	unsigned long		tg_load_avg_contrib;
>>>    	long			propagate;
>>>    	long			prop_runnable_sum;
>>
>> -- 
>> Mathieu Desnoyers
>> EfficiOS Inc.
>> https://www.efficios.com
>>

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/1] sched/fair: ratelimit update to tg->load_avg
  2023-08-24 12:56       ` Mathieu Desnoyers
@ 2023-08-24 13:03         ` Vincent Guittot
  2023-08-24 13:08           ` Mathieu Desnoyers
  0 siblings, 1 reply; 15+ messages in thread
From: Vincent Guittot @ 2023-08-24 13:03 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Aaron Lu, Peter Zijlstra, Ingo Molnar, Juri Lelli, Daniel Jordan,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Tim Chen,
	Nitin Tekchandani, Yu Chen, Waiman Long, Deng Pan,
	Gautham R . Shenoy, David Vernet, linux-kernel

On Thu, 24 Aug 2023 at 14:55, Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
> On 8/24/23 04:01, Aaron Lu wrote:
> > On Wed, Aug 23, 2023 at 10:05:31AM -0400, Mathieu Desnoyers wrote:
> >> On 8/23/23 02:08, Aaron Lu wrote:
> >>> When using sysbench to benchmark Postgres in a single docker instance
> >>> with sysbench's nr_threads set to nr_cpu, it is observed there are times
> >>> update_cfs_group() and update_load_avg() shows noticeable overhead on
> >>> a 2sockets/112core/224cpu Intel Sapphire Rapids(SPR):
> >>>
> >>>       13.75%    13.74%  [kernel.vmlinux]           [k] update_cfs_group
> >>>       10.63%    10.04%  [kernel.vmlinux]           [k] update_load_avg
> >>>
> >>> Annotate shows the cycles are mostly spent on accessing tg->load_avg
> >>> with update_load_avg() being the write side and update_cfs_group() being
> >>> the read side. tg->load_avg is per task group and when different tasks
> >>> of the same taskgroup running on different CPUs frequently access
> >>> tg->load_avg, it can be heavily contended.
> >>>
> >>> E.g. when running postgres_sysbench on a 2sockets/112cores/224cpus Intel
> >>> Sappire Rapids, during a 5s window, the wakeup number is 14millions and
> >>> migration number is 11millions and with each migration, the task's load
> >>> will transfer from src cfs_rq to target cfs_rq and each change involves
> >>> an update to tg->load_avg. Since the workload can trigger as many wakeups
> >>> and migrations, the access(both read and write) to tg->load_avg can be
> >>> unbound. As a result, the two mentioned functions showed noticeable
> >>> overhead. With netperf/nr_client=nr_cpu/UDP_RR, the problem is worse:
> >>> during a 5s window, wakeup number is 21millions and migration number is
> >>> 14millions; update_cfs_group() costs ~25% and update_load_avg() costs ~16%.
> >>>
> >>> Reduce the overhead by limiting updates to tg->load_avg to at most once
> >>> per ms. After this change, the cost of accessing tg->load_avg is greatly
> >>> reduced and performance improved. Detailed test results below.
> >>
> >> By applying your patch on top of my patchset at:
> >>
> >> https://lore.kernel.org/lkml/20230822113133.643238-1-mathieu.desnoyers@efficios.com/
> >>
> >> The combined hackbench results look very promising:
> >>
> >> (hackbench -g 32 -f 20 --threads --pipe -l 480000 -s 100)
> >> (192 cores AMD EPYC 9654 96-Core Processor (over 2 sockets), with hyperthreading)
> >>
> >> Baseline:                                       49s
> >> With L2-ttwu-queue-skip:                        34s (30% speedup)
> >> With L2-ttwu-queue-skip + ratelimit-load-avg:   26s (46% speedup)
> >>
> >> Feel free to apply my:
> >>
> >> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> >> Tested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> >
> > Thanks a lot for running this and reviewing the patch.
> > I'll add your number and tag in the changelog when sending a new
> > version.
>
> Now that I come to think of it, I have comment: why use
> sched_clock_cpu() rather than just read the jiffies value ? AFAIR,
> sched_clock can be slower than needed when read from a "remote" cpu on
> architectures that have an unsynchronized tsc.
>
> Considering that you only need a time reference more or less accurate at
> the millisecond level, I suspect that jiffies is what you are looking
> for here. This is what the NUMA balance code and rseq mm_cid use to
> execute work every N milliseconds.

tick can 4ms or even 10ms which means a rate limit up between 10ms to
20ms in the latter case

>
> Thanks,
>
> Mathieu
>
> >
> > Regards,
> > Aaron
> >
> >>>
> >>> ==============================
> >>> postgres_sysbench on SPR:
> >>> 25%
> >>> base:   42382±19.8%
> >>> patch:  50174±9.5%  (noise)
> >>>
> >>> 50%
> >>> base:   67626±1.3%
> >>> patch:  67365±3.1%  (noise)
> >>>
> >>> 75%
> >>> base:   100216±1.2%
> >>> patch:  112470±0.1% +12.2%
> >>>
> >>> 100%
> >>> base:    93671±0.4%
> >>> patch:  113563±0.2% +21.2%
> >>>
> >>> ==============================
> >>> hackbench on ICL:
> >>> group=1
> >>> base:    114912±5.2%
> >>> patch:   117857±2.5%  (noise)
> >>>
> >>> group=4
> >>> base:    359902±1.6%
> >>> patch:   361685±2.7%  (noise)
> >>>
> >>> group=8
> >>> base:    461070±0.8%
> >>> patch:   491713±0.3% +6.6%
> >>>
> >>> group=16
> >>> base:    309032±5.0%
> >>> patch:   378337±1.3% +22.4%
> >>>
> >>> =============================
> >>> hackbench on SPR:
> >>> group=1
> >>> base:    100768±2.9%
> >>> patch:   103134±2.9%  (noise)
> >>>
> >>> group=4
> >>> base:    413830±12.5%
> >>> patch:   378660±16.6% (noise)
> >>>
> >>> group=8
> >>> base:    436124±0.6%
> >>> patch:   490787±3.2% +12.5%
> >>>
> >>> group=16
> >>> base:    457730±3.2%
> >>> patch:   680452±1.3% +48.8%
> >>>
> >>> ============================
> >>> netperf/udp_rr on ICL
> >>> 25%
> >>> base:    114413±0.1%
> >>> patch:   115111±0.0% +0.6%
> >>>
> >>> 50%
> >>> base:    86803±0.5%
> >>> patch:   86611±0.0%  (noise)
> >>>
> >>> 75%
> >>> base:    35959±5.3%
> >>> patch:   49801±0.6% +38.5%
> >>>
> >>> 100%
> >>> base:    61951±6.4%
> >>> patch:   70224±0.8% +13.4%
> >>>
> >>> ===========================
> >>> netperf/udp_rr on SPR
> >>> 25%
> >>> base:   104954±1.3%
> >>> patch:  107312±2.8%  (noise)
> >>>
> >>> 50%
> >>> base:    55394±4.6%
> >>> patch:   54940±7.4%  (noise)
> >>>
> >>> 75%
> >>> base:    13779±3.1%
> >>> patch:   36105±1.1% +162%
> >>>
> >>> 100%
> >>> base:     9703±3.7%
> >>> patch:   28011±0.2% +189%
> >>>
> >>> ==============================================
> >>> netperf/tcp_stream on ICL (all in noise range)
> >>> 25%
> >>> base:    43092±0.1%
> >>> patch:   42891±0.5%
> >>>
> >>> 50%
> >>> base:    19278±14.9%
> >>> patch:   22369±7.2%
> >>>
> >>> 75%
> >>> base:    16822±3.0%
> >>> patch:   17086±2.3%
> >>>
> >>> 100%
> >>> base:    18216±0.6%
> >>> patch:   18078±2.9%
> >>>
> >>> ===============================================
> >>> netperf/tcp_stream on SPR (all in noise range)
> >>> 25%
> >>> base:    34491±0.3%
> >>> patch:   34886±0.5%
> >>>
> >>> 50%
> >>> base:    19278±14.9%
> >>> patch:   22369±7.2%
> >>>
> >>> 75%
> >>> base:    16822±3.0%
> >>> patch:   17086±2.3%
> >>>
> >>> 100%
> >>> base:    18216±0.6%
> >>> patch:   18078±2.9%
> >>>
> >>> Reported-by: Nitin Tekchandani <nitin.tekchandani@intel.com>
> >>> Suggested-by: Vincent Guittot <vincent.guittot@linaro.org>
> >>> Signed-off-by: Aaron Lu <aaron.lu@intel.com>
> >>> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
> >>> ---
> >>>    kernel/sched/fair.c  | 13 ++++++++++++-
> >>>    kernel/sched/sched.h |  1 +
> >>>    2 files changed, 13 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >>> index c28206499a3d..a5462d1fcc48 100644
> >>> --- a/kernel/sched/fair.c
> >>> +++ b/kernel/sched/fair.c
> >>> @@ -3664,7 +3664,8 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
> >>>     */
> >>>    static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
> >>>    {
> >>> -   long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
> >>> +   long delta;
> >>> +   u64 now;
> >>>     /*
> >>>      * No need to update load_avg for root_task_group as it is not used.
> >>> @@ -3672,9 +3673,19 @@ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
> >>>     if (cfs_rq->tg == &root_task_group)
> >>>             return;
> >>> +   /*
> >>> +    * For migration heavy workload, access to tg->load_avg can be
> >>> +    * unbound. Limit the update rate to at most once per ms.
> >>> +    */
> >>> +   now = sched_clock_cpu(cpu_of(rq_of(cfs_rq)));
> >>> +   if (now - cfs_rq->last_update_tg_load_avg < NSEC_PER_MSEC)
> >>> +           return;
> >>> +
> >>> +   delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
> >>>     if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
> >>>             atomic_long_add(delta, &cfs_rq->tg->load_avg);
> >>>             cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;
> >>> +           cfs_rq->last_update_tg_load_avg = now;
> >>>     }
> >>>    }
> >>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> >>> index 6a8b7b9ed089..52ee7027def9 100644
> >>> --- a/kernel/sched/sched.h
> >>> +++ b/kernel/sched/sched.h
> >>> @@ -593,6 +593,7 @@ struct cfs_rq {
> >>>     } removed;
> >>>    #ifdef CONFIG_FAIR_GROUP_SCHED
> >>> +   u64                     last_update_tg_load_avg;
> >>>     unsigned long           tg_load_avg_contrib;
> >>>     long                    propagate;
> >>>     long                    prop_runnable_sum;
> >>
> >> --
> >> Mathieu Desnoyers
> >> EfficiOS Inc.
> >> https://www.efficios.com
> >>
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/1] sched/fair: ratelimit update to tg->load_avg
  2023-08-24 13:03         ` Vincent Guittot
@ 2023-08-24 13:08           ` Mathieu Desnoyers
  2023-08-24 13:24             ` Vincent Guittot
  2023-08-25  6:08             ` Aaron Lu
  0 siblings, 2 replies; 15+ messages in thread
From: Mathieu Desnoyers @ 2023-08-24 13:08 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Aaron Lu, Peter Zijlstra, Ingo Molnar, Juri Lelli, Daniel Jordan,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Tim Chen,
	Nitin Tekchandani, Yu Chen, Waiman Long, Deng Pan,
	Gautham R . Shenoy, David Vernet, linux-kernel

On 8/24/23 09:03, Vincent Guittot wrote:
> On Thu, 24 Aug 2023 at 14:55, Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
>>
>> On 8/24/23 04:01, Aaron Lu wrote:
>>> On Wed, Aug 23, 2023 at 10:05:31AM -0400, Mathieu Desnoyers wrote:
>>>> On 8/23/23 02:08, Aaron Lu wrote:
>>>>> When using sysbench to benchmark Postgres in a single docker instance
>>>>> with sysbench's nr_threads set to nr_cpu, it is observed there are times
>>>>> update_cfs_group() and update_load_avg() shows noticeable overhead on
>>>>> a 2sockets/112core/224cpu Intel Sapphire Rapids(SPR):
>>>>>
>>>>>        13.75%    13.74%  [kernel.vmlinux]           [k] update_cfs_group
>>>>>        10.63%    10.04%  [kernel.vmlinux]           [k] update_load_avg
>>>>>
>>>>> Annotate shows the cycles are mostly spent on accessing tg->load_avg
>>>>> with update_load_avg() being the write side and update_cfs_group() being
>>>>> the read side. tg->load_avg is per task group and when different tasks
>>>>> of the same taskgroup running on different CPUs frequently access
>>>>> tg->load_avg, it can be heavily contended.
>>>>>
>>>>> E.g. when running postgres_sysbench on a 2sockets/112cores/224cpus Intel
>>>>> Sappire Rapids, during a 5s window, the wakeup number is 14millions and
>>>>> migration number is 11millions and with each migration, the task's load
>>>>> will transfer from src cfs_rq to target cfs_rq and each change involves
>>>>> an update to tg->load_avg. Since the workload can trigger as many wakeups
>>>>> and migrations, the access(both read and write) to tg->load_avg can be
>>>>> unbound. As a result, the two mentioned functions showed noticeable
>>>>> overhead. With netperf/nr_client=nr_cpu/UDP_RR, the problem is worse:
>>>>> during a 5s window, wakeup number is 21millions and migration number is
>>>>> 14millions; update_cfs_group() costs ~25% and update_load_avg() costs ~16%.
>>>>>
>>>>> Reduce the overhead by limiting updates to tg->load_avg to at most once
>>>>> per ms. After this change, the cost of accessing tg->load_avg is greatly
>>>>> reduced and performance improved. Detailed test results below.
>>>>
>>>> By applying your patch on top of my patchset at:
>>>>
>>>> https://lore.kernel.org/lkml/20230822113133.643238-1-mathieu.desnoyers@efficios.com/
>>>>
>>>> The combined hackbench results look very promising:
>>>>
>>>> (hackbench -g 32 -f 20 --threads --pipe -l 480000 -s 100)
>>>> (192 cores AMD EPYC 9654 96-Core Processor (over 2 sockets), with hyperthreading)
>>>>
>>>> Baseline:                                       49s
>>>> With L2-ttwu-queue-skip:                        34s (30% speedup)
>>>> With L2-ttwu-queue-skip + ratelimit-load-avg:   26s (46% speedup)
>>>>
>>>> Feel free to apply my:
>>>>
>>>> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>>>> Tested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>>>
>>> Thanks a lot for running this and reviewing the patch.
>>> I'll add your number and tag in the changelog when sending a new
>>> version.
>>
>> Now that I come to think of it, I have comment: why use
>> sched_clock_cpu() rather than just read the jiffies value ? AFAIR,
>> sched_clock can be slower than needed when read from a "remote" cpu on
>> architectures that have an unsynchronized tsc.
>>
>> Considering that you only need a time reference more or less accurate at
>> the millisecond level, I suspect that jiffies is what you are looking
>> for here. This is what the NUMA balance code and rseq mm_cid use to
>> execute work every N milliseconds.
> 
> tick can 4ms or even 10ms which means a rate limit up between 10ms to
> 20ms in the latter case

Fair enough, so just to confirm: is the 1ms a target period which has 
been empirically determined to be optimal (lower having too much 
overhead, and higher not being precise enough) ?

Thanks,

Mathieu

> 
>>
>> Thanks,
>>
>> Mathieu
>>
>>>
>>> Regards,
>>> Aaron
>>>
>>>>>
>>>>> ==============================
>>>>> postgres_sysbench on SPR:
>>>>> 25%
>>>>> base:   42382±19.8%
>>>>> patch:  50174±9.5%  (noise)
>>>>>
>>>>> 50%
>>>>> base:   67626±1.3%
>>>>> patch:  67365±3.1%  (noise)
>>>>>
>>>>> 75%
>>>>> base:   100216±1.2%
>>>>> patch:  112470±0.1% +12.2%
>>>>>
>>>>> 100%
>>>>> base:    93671±0.4%
>>>>> patch:  113563±0.2% +21.2%
>>>>>
>>>>> ==============================
>>>>> hackbench on ICL:
>>>>> group=1
>>>>> base:    114912±5.2%
>>>>> patch:   117857±2.5%  (noise)
>>>>>
>>>>> group=4
>>>>> base:    359902±1.6%
>>>>> patch:   361685±2.7%  (noise)
>>>>>
>>>>> group=8
>>>>> base:    461070±0.8%
>>>>> patch:   491713±0.3% +6.6%
>>>>>
>>>>> group=16
>>>>> base:    309032±5.0%
>>>>> patch:   378337±1.3% +22.4%
>>>>>
>>>>> =============================
>>>>> hackbench on SPR:
>>>>> group=1
>>>>> base:    100768±2.9%
>>>>> patch:   103134±2.9%  (noise)
>>>>>
>>>>> group=4
>>>>> base:    413830±12.5%
>>>>> patch:   378660±16.6% (noise)
>>>>>
>>>>> group=8
>>>>> base:    436124±0.6%
>>>>> patch:   490787±3.2% +12.5%
>>>>>
>>>>> group=16
>>>>> base:    457730±3.2%
>>>>> patch:   680452±1.3% +48.8%
>>>>>
>>>>> ============================
>>>>> netperf/udp_rr on ICL
>>>>> 25%
>>>>> base:    114413±0.1%
>>>>> patch:   115111±0.0% +0.6%
>>>>>
>>>>> 50%
>>>>> base:    86803±0.5%
>>>>> patch:   86611±0.0%  (noise)
>>>>>
>>>>> 75%
>>>>> base:    35959±5.3%
>>>>> patch:   49801±0.6% +38.5%
>>>>>
>>>>> 100%
>>>>> base:    61951±6.4%
>>>>> patch:   70224±0.8% +13.4%
>>>>>
>>>>> ===========================
>>>>> netperf/udp_rr on SPR
>>>>> 25%
>>>>> base:   104954±1.3%
>>>>> patch:  107312±2.8%  (noise)
>>>>>
>>>>> 50%
>>>>> base:    55394±4.6%
>>>>> patch:   54940±7.4%  (noise)
>>>>>
>>>>> 75%
>>>>> base:    13779±3.1%
>>>>> patch:   36105±1.1% +162%
>>>>>
>>>>> 100%
>>>>> base:     9703±3.7%
>>>>> patch:   28011±0.2% +189%
>>>>>
>>>>> ==============================================
>>>>> netperf/tcp_stream on ICL (all in noise range)
>>>>> 25%
>>>>> base:    43092±0.1%
>>>>> patch:   42891±0.5%
>>>>>
>>>>> 50%
>>>>> base:    19278±14.9%
>>>>> patch:   22369±7.2%
>>>>>
>>>>> 75%
>>>>> base:    16822±3.0%
>>>>> patch:   17086±2.3%
>>>>>
>>>>> 100%
>>>>> base:    18216±0.6%
>>>>> patch:   18078±2.9%
>>>>>
>>>>> ===============================================
>>>>> netperf/tcp_stream on SPR (all in noise range)
>>>>> 25%
>>>>> base:    34491±0.3%
>>>>> patch:   34886±0.5%
>>>>>
>>>>> 50%
>>>>> base:    19278±14.9%
>>>>> patch:   22369±7.2%
>>>>>
>>>>> 75%
>>>>> base:    16822±3.0%
>>>>> patch:   17086±2.3%
>>>>>
>>>>> 100%
>>>>> base:    18216±0.6%
>>>>> patch:   18078±2.9%
>>>>>
>>>>> Reported-by: Nitin Tekchandani <nitin.tekchandani@intel.com>
>>>>> Suggested-by: Vincent Guittot <vincent.guittot@linaro.org>
>>>>> Signed-off-by: Aaron Lu <aaron.lu@intel.com>
>>>>> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
>>>>> ---
>>>>>     kernel/sched/fair.c  | 13 ++++++++++++-
>>>>>     kernel/sched/sched.h |  1 +
>>>>>     2 files changed, 13 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>>> index c28206499a3d..a5462d1fcc48 100644
>>>>> --- a/kernel/sched/fair.c
>>>>> +++ b/kernel/sched/fair.c
>>>>> @@ -3664,7 +3664,8 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
>>>>>      */
>>>>>     static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
>>>>>     {
>>>>> -   long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
>>>>> +   long delta;
>>>>> +   u64 now;
>>>>>      /*
>>>>>       * No need to update load_avg for root_task_group as it is not used.
>>>>> @@ -3672,9 +3673,19 @@ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
>>>>>      if (cfs_rq->tg == &root_task_group)
>>>>>              return;
>>>>> +   /*
>>>>> +    * For migration heavy workload, access to tg->load_avg can be
>>>>> +    * unbound. Limit the update rate to at most once per ms.
>>>>> +    */
>>>>> +   now = sched_clock_cpu(cpu_of(rq_of(cfs_rq)));
>>>>> +   if (now - cfs_rq->last_update_tg_load_avg < NSEC_PER_MSEC)
>>>>> +           return;
>>>>> +
>>>>> +   delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
>>>>>      if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
>>>>>              atomic_long_add(delta, &cfs_rq->tg->load_avg);
>>>>>              cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;
>>>>> +           cfs_rq->last_update_tg_load_avg = now;
>>>>>      }
>>>>>     }
>>>>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>>>>> index 6a8b7b9ed089..52ee7027def9 100644
>>>>> --- a/kernel/sched/sched.h
>>>>> +++ b/kernel/sched/sched.h
>>>>> @@ -593,6 +593,7 @@ struct cfs_rq {
>>>>>      } removed;
>>>>>     #ifdef CONFIG_FAIR_GROUP_SCHED
>>>>> +   u64                     last_update_tg_load_avg;
>>>>>      unsigned long           tg_load_avg_contrib;
>>>>>      long                    propagate;
>>>>>      long                    prop_runnable_sum;
>>>>
>>>> --
>>>> Mathieu Desnoyers
>>>> EfficiOS Inc.
>>>> https://www.efficios.com
>>>>
>>
>> --
>> Mathieu Desnoyers
>> EfficiOS Inc.
>> https://www.efficios.com
>>

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/1] sched/fair: ratelimit update to tg->load_avg
  2023-08-24 13:08           ` Mathieu Desnoyers
@ 2023-08-24 13:24             ` Vincent Guittot
  2023-08-25  6:08             ` Aaron Lu
  1 sibling, 0 replies; 15+ messages in thread
From: Vincent Guittot @ 2023-08-24 13:24 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Aaron Lu, Peter Zijlstra, Ingo Molnar, Juri Lelli, Daniel Jordan,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Tim Chen,
	Nitin Tekchandani, Yu Chen, Waiman Long, Deng Pan,
	Gautham R . Shenoy, David Vernet, linux-kernel

On Thu, 24 Aug 2023 at 15:07, Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
> On 8/24/23 09:03, Vincent Guittot wrote:
> > On Thu, 24 Aug 2023 at 14:55, Mathieu Desnoyers
> > <mathieu.desnoyers@efficios.com> wrote:
> >>
> >> On 8/24/23 04:01, Aaron Lu wrote:
> >>> On Wed, Aug 23, 2023 at 10:05:31AM -0400, Mathieu Desnoyers wrote:
> >>>> On 8/23/23 02:08, Aaron Lu wrote:
> >>>>> When using sysbench to benchmark Postgres in a single docker instance
> >>>>> with sysbench's nr_threads set to nr_cpu, it is observed there are times
> >>>>> update_cfs_group() and update_load_avg() shows noticeable overhead on
> >>>>> a 2sockets/112core/224cpu Intel Sapphire Rapids(SPR):
> >>>>>
> >>>>>        13.75%    13.74%  [kernel.vmlinux]           [k] update_cfs_group
> >>>>>        10.63%    10.04%  [kernel.vmlinux]           [k] update_load_avg
> >>>>>
> >>>>> Annotate shows the cycles are mostly spent on accessing tg->load_avg
> >>>>> with update_load_avg() being the write side and update_cfs_group() being
> >>>>> the read side. tg->load_avg is per task group and when different tasks
> >>>>> of the same taskgroup running on different CPUs frequently access
> >>>>> tg->load_avg, it can be heavily contended.
> >>>>>
> >>>>> E.g. when running postgres_sysbench on a 2sockets/112cores/224cpus Intel
> >>>>> Sappire Rapids, during a 5s window, the wakeup number is 14millions and
> >>>>> migration number is 11millions and with each migration, the task's load
> >>>>> will transfer from src cfs_rq to target cfs_rq and each change involves
> >>>>> an update to tg->load_avg. Since the workload can trigger as many wakeups
> >>>>> and migrations, the access(both read and write) to tg->load_avg can be
> >>>>> unbound. As a result, the two mentioned functions showed noticeable
> >>>>> overhead. With netperf/nr_client=nr_cpu/UDP_RR, the problem is worse:
> >>>>> during a 5s window, wakeup number is 21millions and migration number is
> >>>>> 14millions; update_cfs_group() costs ~25% and update_load_avg() costs ~16%.
> >>>>>
> >>>>> Reduce the overhead by limiting updates to tg->load_avg to at most once
> >>>>> per ms. After this change, the cost of accessing tg->load_avg is greatly
> >>>>> reduced and performance improved. Detailed test results below.
> >>>>
> >>>> By applying your patch on top of my patchset at:
> >>>>
> >>>> https://lore.kernel.org/lkml/20230822113133.643238-1-mathieu.desnoyers@efficios.com/
> >>>>
> >>>> The combined hackbench results look very promising:
> >>>>
> >>>> (hackbench -g 32 -f 20 --threads --pipe -l 480000 -s 100)
> >>>> (192 cores AMD EPYC 9654 96-Core Processor (over 2 sockets), with hyperthreading)
> >>>>
> >>>> Baseline:                                       49s
> >>>> With L2-ttwu-queue-skip:                        34s (30% speedup)
> >>>> With L2-ttwu-queue-skip + ratelimit-load-avg:   26s (46% speedup)
> >>>>
> >>>> Feel free to apply my:
> >>>>
> >>>> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> >>>> Tested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> >>>
> >>> Thanks a lot for running this and reviewing the patch.
> >>> I'll add your number and tag in the changelog when sending a new
> >>> version.
> >>
> >> Now that I come to think of it, I have comment: why use
> >> sched_clock_cpu() rather than just read the jiffies value ? AFAIR,
> >> sched_clock can be slower than needed when read from a "remote" cpu on
> >> architectures that have an unsynchronized tsc.
> >>
> >> Considering that you only need a time reference more or less accurate at
> >> the millisecond level, I suspect that jiffies is what you are looking
> >> for here. This is what the NUMA balance code and rseq mm_cid use to
> >> execute work every N milliseconds.
> >
> > tick can 4ms or even 10ms which means a rate limit up between 10ms to
> > 20ms in the latter case
>
> Fair enough, so just to confirm: is the 1ms a target period which has
> been empirically determined to be optimal (lower having too much
> overhead, and higher not being precise enough) ?

yes it's a tradeoff. This impacts how much time a group can get on a rq

>
> Thanks,
>
> Mathieu
>
> >
> >>
> >> Thanks,
> >>
> >> Mathieu
> >>
> >>>
> >>> Regards,
> >>> Aaron
> >>>
> >>>>>
> >>>>> ==============================
> >>>>> postgres_sysbench on SPR:
> >>>>> 25%
> >>>>> base:   42382±19.8%
> >>>>> patch:  50174±9.5%  (noise)
> >>>>>
> >>>>> 50%
> >>>>> base:   67626±1.3%
> >>>>> patch:  67365±3.1%  (noise)
> >>>>>
> >>>>> 75%
> >>>>> base:   100216±1.2%
> >>>>> patch:  112470±0.1% +12.2%
> >>>>>
> >>>>> 100%
> >>>>> base:    93671±0.4%
> >>>>> patch:  113563±0.2% +21.2%
> >>>>>
> >>>>> ==============================
> >>>>> hackbench on ICL:
> >>>>> group=1
> >>>>> base:    114912±5.2%
> >>>>> patch:   117857±2.5%  (noise)
> >>>>>
> >>>>> group=4
> >>>>> base:    359902±1.6%
> >>>>> patch:   361685±2.7%  (noise)
> >>>>>
> >>>>> group=8
> >>>>> base:    461070±0.8%
> >>>>> patch:   491713±0.3% +6.6%
> >>>>>
> >>>>> group=16
> >>>>> base:    309032±5.0%
> >>>>> patch:   378337±1.3% +22.4%
> >>>>>
> >>>>> =============================
> >>>>> hackbench on SPR:
> >>>>> group=1
> >>>>> base:    100768±2.9%
> >>>>> patch:   103134±2.9%  (noise)
> >>>>>
> >>>>> group=4
> >>>>> base:    413830±12.5%
> >>>>> patch:   378660±16.6% (noise)
> >>>>>
> >>>>> group=8
> >>>>> base:    436124±0.6%
> >>>>> patch:   490787±3.2% +12.5%
> >>>>>
> >>>>> group=16
> >>>>> base:    457730±3.2%
> >>>>> patch:   680452±1.3% +48.8%
> >>>>>
> >>>>> ============================
> >>>>> netperf/udp_rr on ICL
> >>>>> 25%
> >>>>> base:    114413±0.1%
> >>>>> patch:   115111±0.0% +0.6%
> >>>>>
> >>>>> 50%
> >>>>> base:    86803±0.5%
> >>>>> patch:   86611±0.0%  (noise)
> >>>>>
> >>>>> 75%
> >>>>> base:    35959±5.3%
> >>>>> patch:   49801±0.6% +38.5%
> >>>>>
> >>>>> 100%
> >>>>> base:    61951±6.4%
> >>>>> patch:   70224±0.8% +13.4%
> >>>>>
> >>>>> ===========================
> >>>>> netperf/udp_rr on SPR
> >>>>> 25%
> >>>>> base:   104954±1.3%
> >>>>> patch:  107312±2.8%  (noise)
> >>>>>
> >>>>> 50%
> >>>>> base:    55394±4.6%
> >>>>> patch:   54940±7.4%  (noise)
> >>>>>
> >>>>> 75%
> >>>>> base:    13779±3.1%
> >>>>> patch:   36105±1.1% +162%
> >>>>>
> >>>>> 100%
> >>>>> base:     9703±3.7%
> >>>>> patch:   28011±0.2% +189%
> >>>>>
> >>>>> ==============================================
> >>>>> netperf/tcp_stream on ICL (all in noise range)
> >>>>> 25%
> >>>>> base:    43092±0.1%
> >>>>> patch:   42891±0.5%
> >>>>>
> >>>>> 50%
> >>>>> base:    19278±14.9%
> >>>>> patch:   22369±7.2%
> >>>>>
> >>>>> 75%
> >>>>> base:    16822±3.0%
> >>>>> patch:   17086±2.3%
> >>>>>
> >>>>> 100%
> >>>>> base:    18216±0.6%
> >>>>> patch:   18078±2.9%
> >>>>>
> >>>>> ===============================================
> >>>>> netperf/tcp_stream on SPR (all in noise range)
> >>>>> 25%
> >>>>> base:    34491±0.3%
> >>>>> patch:   34886±0.5%
> >>>>>
> >>>>> 50%
> >>>>> base:    19278±14.9%
> >>>>> patch:   22369±7.2%
> >>>>>
> >>>>> 75%
> >>>>> base:    16822±3.0%
> >>>>> patch:   17086±2.3%
> >>>>>
> >>>>> 100%
> >>>>> base:    18216±0.6%
> >>>>> patch:   18078±2.9%
> >>>>>
> >>>>> Reported-by: Nitin Tekchandani <nitin.tekchandani@intel.com>
> >>>>> Suggested-by: Vincent Guittot <vincent.guittot@linaro.org>
> >>>>> Signed-off-by: Aaron Lu <aaron.lu@intel.com>
> >>>>> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
> >>>>> ---
> >>>>>     kernel/sched/fair.c  | 13 ++++++++++++-
> >>>>>     kernel/sched/sched.h |  1 +
> >>>>>     2 files changed, 13 insertions(+), 1 deletion(-)
> >>>>>
> >>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >>>>> index c28206499a3d..a5462d1fcc48 100644
> >>>>> --- a/kernel/sched/fair.c
> >>>>> +++ b/kernel/sched/fair.c
> >>>>> @@ -3664,7 +3664,8 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
> >>>>>      */
> >>>>>     static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
> >>>>>     {
> >>>>> -   long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
> >>>>> +   long delta;
> >>>>> +   u64 now;
> >>>>>      /*
> >>>>>       * No need to update load_avg for root_task_group as it is not used.
> >>>>> @@ -3672,9 +3673,19 @@ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
> >>>>>      if (cfs_rq->tg == &root_task_group)
> >>>>>              return;
> >>>>> +   /*
> >>>>> +    * For migration heavy workload, access to tg->load_avg can be
> >>>>> +    * unbound. Limit the update rate to at most once per ms.
> >>>>> +    */
> >>>>> +   now = sched_clock_cpu(cpu_of(rq_of(cfs_rq)));
> >>>>> +   if (now - cfs_rq->last_update_tg_load_avg < NSEC_PER_MSEC)
> >>>>> +           return;
> >>>>> +
> >>>>> +   delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
> >>>>>      if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
> >>>>>              atomic_long_add(delta, &cfs_rq->tg->load_avg);
> >>>>>              cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;
> >>>>> +           cfs_rq->last_update_tg_load_avg = now;
> >>>>>      }
> >>>>>     }
> >>>>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> >>>>> index 6a8b7b9ed089..52ee7027def9 100644
> >>>>> --- a/kernel/sched/sched.h
> >>>>> +++ b/kernel/sched/sched.h
> >>>>> @@ -593,6 +593,7 @@ struct cfs_rq {
> >>>>>      } removed;
> >>>>>     #ifdef CONFIG_FAIR_GROUP_SCHED
> >>>>> +   u64                     last_update_tg_load_avg;
> >>>>>      unsigned long           tg_load_avg_contrib;
> >>>>>      long                    propagate;
> >>>>>      long                    prop_runnable_sum;
> >>>>
> >>>> --
> >>>> Mathieu Desnoyers
> >>>> EfficiOS Inc.
> >>>> https://www.efficios.com
> >>>>
> >>
> >> --
> >> Mathieu Desnoyers
> >> EfficiOS Inc.
> >> https://www.efficios.com
> >>
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/1] sched/fair: ratelimit update to tg->load_avg
  2023-08-23  6:08 ` [PATCH 1/1] sched/fair: ratelimit update to tg->load_avg Aaron Lu
  2023-08-23 14:05   ` Mathieu Desnoyers
@ 2023-08-24 18:48   ` David Vernet
  2023-08-25  6:18     ` Aaron Lu
  2023-09-06  3:52   ` kernel test robot
  2 siblings, 1 reply; 15+ messages in thread
From: David Vernet @ 2023-08-24 18:48 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Peter Zijlstra, Vincent Guittot, Ingo Molnar, Juri Lelli,
	Daniel Jordan, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	Tim Chen, Nitin Tekchandani, Yu Chen, Waiman Long, Deng Pan,
	Mathieu Desnoyers, Gautham R . Shenoy, linux-kernel

On Wed, Aug 23, 2023 at 02:08:32PM +0800, Aaron Lu wrote:
> When using sysbench to benchmark Postgres in a single docker instance
> with sysbench's nr_threads set to nr_cpu, it is observed there are times
> update_cfs_group() and update_load_avg() shows noticeable overhead on
> a 2sockets/112core/224cpu Intel Sapphire Rapids(SPR):
> 
>     13.75%    13.74%  [kernel.vmlinux]           [k] update_cfs_group
>     10.63%    10.04%  [kernel.vmlinux]           [k] update_load_avg
> 
> Annotate shows the cycles are mostly spent on accessing tg->load_avg
> with update_load_avg() being the write side and update_cfs_group() being
> the read side. tg->load_avg is per task group and when different tasks
> of the same taskgroup running on different CPUs frequently access
> tg->load_avg, it can be heavily contended.
> 
> E.g. when running postgres_sysbench on a 2sockets/112cores/224cpus Intel
> Sappire Rapids, during a 5s window, the wakeup number is 14millions and
> migration number is 11millions and with each migration, the task's load
> will transfer from src cfs_rq to target cfs_rq and each change involves
> an update to tg->load_avg. Since the workload can trigger as many wakeups
> and migrations, the access(both read and write) to tg->load_avg can be
> unbound. As a result, the two mentioned functions showed noticeable
> overhead. With netperf/nr_client=nr_cpu/UDP_RR, the problem is worse:
> during a 5s window, wakeup number is 21millions and migration number is
> 14millions; update_cfs_group() costs ~25% and update_load_avg() costs ~16%.
> 
> Reduce the overhead by limiting updates to tg->load_avg to at most once
> per ms. After this change, the cost of accessing tg->load_avg is greatly
> reduced and performance improved. Detailed test results below.
> 
> ==============================
> postgres_sysbench on SPR:
> 25%
> base:   42382±19.8%
> patch:  50174±9.5%  (noise)
> 
> 50%
> base:   67626±1.3%
> patch:  67365±3.1%  (noise)
> 
> 75%
> base:   100216±1.2%
> patch:  112470±0.1% +12.2%
> 
> 100%
> base:    93671±0.4%
> patch:  113563±0.2% +21.2%
> 
> ==============================
> hackbench on ICL:
> group=1
> base:    114912±5.2%
> patch:   117857±2.5%  (noise)
> 
> group=4
> base:    359902±1.6%
> patch:   361685±2.7%  (noise)
> 
> group=8
> base:    461070±0.8%
> patch:   491713±0.3% +6.6%
> 
> group=16
> base:    309032±5.0%
> patch:   378337±1.3% +22.4%
> 
> =============================
> hackbench on SPR:
> group=1
> base:    100768±2.9%
> patch:   103134±2.9%  (noise)
> 
> group=4
> base:    413830±12.5%
> patch:   378660±16.6% (noise)
> 
> group=8
> base:    436124±0.6%
> patch:   490787±3.2% +12.5%
> 
> group=16
> base:    457730±3.2%
> patch:   680452±1.3% +48.8%
> 
> ============================
> netperf/udp_rr on ICL
> 25%
> base:    114413±0.1%
> patch:   115111±0.0% +0.6%
> 
> 50%
> base:    86803±0.5%
> patch:   86611±0.0%  (noise)
> 
> 75%
> base:    35959±5.3%
> patch:   49801±0.6% +38.5%
> 
> 100%
> base:    61951±6.4%
> patch:   70224±0.8% +13.4%
> 
> ===========================
> netperf/udp_rr on SPR
> 25%
> base:   104954±1.3%
> patch:  107312±2.8%  (noise)
> 
> 50%
> base:    55394±4.6%
> patch:   54940±7.4%  (noise)
> 
> 75%
> base:    13779±3.1%
> patch:   36105±1.1% +162%
> 
> 100%
> base:     9703±3.7%
> patch:   28011±0.2% +189%
> 
> ==============================================
> netperf/tcp_stream on ICL (all in noise range)
> 25%
> base:    43092±0.1%
> patch:   42891±0.5%
> 
> 50%
> base:    19278±14.9%
> patch:   22369±7.2%
> 
> 75%
> base:    16822±3.0%
> patch:   17086±2.3%
> 
> 100%
> base:    18216±0.6%
> patch:   18078±2.9%
> 
> ===============================================
> netperf/tcp_stream on SPR (all in noise range)
> 25%
> base:    34491±0.3%
> patch:   34886±0.5%
> 
> 50%
> base:    19278±14.9%
> patch:   22369±7.2%
> 
> 75%
> base:    16822±3.0%
> patch:   17086±2.3%
> 
> 100%
> base:    18216±0.6%
> patch:   18078±2.9%
> 
> Reported-by: Nitin Tekchandani <nitin.tekchandani@intel.com>
> Suggested-by: Vincent Guittot <vincent.guittot@linaro.org>
> Signed-off-by: Aaron Lu <aaron.lu@intel.com>
> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>

Hey Aaron,

Thanks for working on this. It LGTM modulo two small nits. Feel free to
add my Reviewed-by if you'd like regardless:

Reviewed-by: David Vernet <void@manifault.com>

> ---
>  kernel/sched/fair.c  | 13 ++++++++++++-
>  kernel/sched/sched.h |  1 +
>  2 files changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c28206499a3d..a5462d1fcc48 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3664,7 +3664,8 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
>   */
>  static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
>  {
> -	long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
> +	long delta;
> +	u64 now;
>  
>  	/*
>  	 * No need to update load_avg for root_task_group as it is not used.
> @@ -3672,9 +3673,19 @@ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
>  	if (cfs_rq->tg == &root_task_group)
>  		return;
>  
> +	/*
> +	 * For migration heavy workload, access to tg->load_avg can be

s/workload/workloads

> +	 * unbound. Limit the update rate to at most once per ms.

Can we describe either here or in the commit summary how we arrived at
1ms? I'm fine with hard-coded heuristics like this (just like the
proposed 6-core shard size in the shared_runq patchset), but it would
also be ideal to give a bit more color on how we arrived here, because
we'll forget immediately otherwise.

> +	 */
> +	now = sched_clock_cpu(cpu_of(rq_of(cfs_rq)));
> +	if (now - cfs_rq->last_update_tg_load_avg < NSEC_PER_MSEC)
> +		return;
> +
> +	delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
>  	if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
>  		atomic_long_add(delta, &cfs_rq->tg->load_avg);
>  		cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;
> +		cfs_rq->last_update_tg_load_avg = now;
>  	}
>  }
>  
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 6a8b7b9ed089..52ee7027def9 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -593,6 +593,7 @@ struct cfs_rq {
>  	} removed;
>  
>  #ifdef CONFIG_FAIR_GROUP_SCHED
> +	u64			last_update_tg_load_avg;
>  	unsigned long		tg_load_avg_contrib;
>  	long			propagate;
>  	long			prop_runnable_sum;
> -- 
> 2.41.0
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/1] sched/fair: ratelimit update to tg->load_avg
  2023-08-24 13:08           ` Mathieu Desnoyers
  2023-08-24 13:24             ` Vincent Guittot
@ 2023-08-25  6:08             ` Aaron Lu
  1 sibling, 0 replies; 15+ messages in thread
From: Aaron Lu @ 2023-08-25  6:08 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Vincent Guittot, Peter Zijlstra, Ingo Molnar, Juri Lelli,
	Daniel Jordan, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	Tim Chen, Nitin Tekchandani, Yu Chen, Waiman Long, Deng Pan,
	Gautham R . Shenoy, David Vernet, linux-kernel

On Thu, Aug 24, 2023 at 09:08:43AM -0400, Mathieu Desnoyers wrote:
> On 8/24/23 09:03, Vincent Guittot wrote:
> > On Thu, 24 Aug 2023 at 14:55, Mathieu Desnoyers
> > <mathieu.desnoyers@efficios.com> wrote:
> > > 
> > > On 8/24/23 04:01, Aaron Lu wrote:
> > > > On Wed, Aug 23, 2023 at 10:05:31AM -0400, Mathieu Desnoyers wrote:
> > > > > On 8/23/23 02:08, Aaron Lu wrote:
> > > > > > When using sysbench to benchmark Postgres in a single docker instance
> > > > > > with sysbench's nr_threads set to nr_cpu, it is observed there are times
> > > > > > update_cfs_group() and update_load_avg() shows noticeable overhead on
> > > > > > a 2sockets/112core/224cpu Intel Sapphire Rapids(SPR):
> > > > > > 
> > > > > >        13.75%    13.74%  [kernel.vmlinux]           [k] update_cfs_group
> > > > > >        10.63%    10.04%  [kernel.vmlinux]           [k] update_load_avg
> > > > > > 
> > > > > > Annotate shows the cycles are mostly spent on accessing tg->load_avg
> > > > > > with update_load_avg() being the write side and update_cfs_group() being
> > > > > > the read side. tg->load_avg is per task group and when different tasks
> > > > > > of the same taskgroup running on different CPUs frequently access
> > > > > > tg->load_avg, it can be heavily contended.
> > > > > > 
> > > > > > E.g. when running postgres_sysbench on a 2sockets/112cores/224cpus Intel
> > > > > > Sappire Rapids, during a 5s window, the wakeup number is 14millions and
> > > > > > migration number is 11millions and with each migration, the task's load
> > > > > > will transfer from src cfs_rq to target cfs_rq and each change involves
> > > > > > an update to tg->load_avg. Since the workload can trigger as many wakeups
> > > > > > and migrations, the access(both read and write) to tg->load_avg can be
> > > > > > unbound. As a result, the two mentioned functions showed noticeable
> > > > > > overhead. With netperf/nr_client=nr_cpu/UDP_RR, the problem is worse:
> > > > > > during a 5s window, wakeup number is 21millions and migration number is
> > > > > > 14millions; update_cfs_group() costs ~25% and update_load_avg() costs ~16%.
> > > > > > 
> > > > > > Reduce the overhead by limiting updates to tg->load_avg to at most once
> > > > > > per ms. After this change, the cost of accessing tg->load_avg is greatly
> > > > > > reduced and performance improved. Detailed test results below.
> > > > > 
> > > > > By applying your patch on top of my patchset at:
> > > > > 
> > > > > https://lore.kernel.org/lkml/20230822113133.643238-1-mathieu.desnoyers@efficios.com/
> > > > > 
> > > > > The combined hackbench results look very promising:
> > > > > 
> > > > > (hackbench -g 32 -f 20 --threads --pipe -l 480000 -s 100)
> > > > > (192 cores AMD EPYC 9654 96-Core Processor (over 2 sockets), with hyperthreading)
> > > > > 
> > > > > Baseline:                                       49s
> > > > > With L2-ttwu-queue-skip:                        34s (30% speedup)
> > > > > With L2-ttwu-queue-skip + ratelimit-load-avg:   26s (46% speedup)
> > > > > 
> > > > > Feel free to apply my:
> > > > > 
> > > > > Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> > > > > Tested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> > > > 
> > > > Thanks a lot for running this and reviewing the patch.
> > > > I'll add your number and tag in the changelog when sending a new
> > > > version.
> > > 
> > > Now that I come to think of it, I have comment: why use
> > > sched_clock_cpu() rather than just read the jiffies value ? AFAIR,
> > > sched_clock can be slower than needed when read from a "remote" cpu on
> > > architectures that have an unsynchronized tsc.
> > > 
> > > Considering that you only need a time reference more or less accurate at
> > > the millisecond level, I suspect that jiffies is what you are looking
> > > for here. This is what the NUMA balance code and rseq mm_cid use to
> > > execute work every N milliseconds.
> > 
> > tick can 4ms or even 10ms which means a rate limit up between 10ms to
> > 20ms in the latter case
> 
> Fair enough, so just to confirm: is the 1ms a target period which has been
> empirically determined to be optimal (lower having too much overhead, and
> higher not being precise enough) ?

I chose 1ms because pelt window is roughly 1ms.

And during my tests, ratelimit to once per 1ms delivers good performance
and no regressions for the workloads that I tested so far so I didn't try
other values.

I can't say 1ms is the optimal value, but it appears to work good enough
for now.

Thanks,
Aaron

> > > 
> > > Thanks,
> > > 
> > > Mathieu
> > > 
> > > > 
> > > > Regards,
> > > > Aaron
> > > > 
> > > > > > 
> > > > > > ==============================
> > > > > > postgres_sysbench on SPR:
> > > > > > 25%
> > > > > > base:   42382±19.8%
> > > > > > patch:  50174±9.5%  (noise)
> > > > > > 
> > > > > > 50%
> > > > > > base:   67626±1.3%
> > > > > > patch:  67365±3.1%  (noise)
> > > > > > 
> > > > > > 75%
> > > > > > base:   100216±1.2%
> > > > > > patch:  112470±0.1% +12.2%
> > > > > > 
> > > > > > 100%
> > > > > > base:    93671±0.4%
> > > > > > patch:  113563±0.2% +21.2%
> > > > > > 
> > > > > > ==============================
> > > > > > hackbench on ICL:
> > > > > > group=1
> > > > > > base:    114912±5.2%
> > > > > > patch:   117857±2.5%  (noise)
> > > > > > 
> > > > > > group=4
> > > > > > base:    359902±1.6%
> > > > > > patch:   361685±2.7%  (noise)
> > > > > > 
> > > > > > group=8
> > > > > > base:    461070±0.8%
> > > > > > patch:   491713±0.3% +6.6%
> > > > > > 
> > > > > > group=16
> > > > > > base:    309032±5.0%
> > > > > > patch:   378337±1.3% +22.4%
> > > > > > 
> > > > > > =============================
> > > > > > hackbench on SPR:
> > > > > > group=1
> > > > > > base:    100768±2.9%
> > > > > > patch:   103134±2.9%  (noise)
> > > > > > 
> > > > > > group=4
> > > > > > base:    413830±12.5%
> > > > > > patch:   378660±16.6% (noise)
> > > > > > 
> > > > > > group=8
> > > > > > base:    436124±0.6%
> > > > > > patch:   490787±3.2% +12.5%
> > > > > > 
> > > > > > group=16
> > > > > > base:    457730±3.2%
> > > > > > patch:   680452±1.3% +48.8%
> > > > > > 
> > > > > > ============================
> > > > > > netperf/udp_rr on ICL
> > > > > > 25%
> > > > > > base:    114413±0.1%
> > > > > > patch:   115111±0.0% +0.6%
> > > > > > 
> > > > > > 50%
> > > > > > base:    86803±0.5%
> > > > > > patch:   86611±0.0%  (noise)
> > > > > > 
> > > > > > 75%
> > > > > > base:    35959±5.3%
> > > > > > patch:   49801±0.6% +38.5%
> > > > > > 
> > > > > > 100%
> > > > > > base:    61951±6.4%
> > > > > > patch:   70224±0.8% +13.4%
> > > > > > 
> > > > > > ===========================
> > > > > > netperf/udp_rr on SPR
> > > > > > 25%
> > > > > > base:   104954±1.3%
> > > > > > patch:  107312±2.8%  (noise)
> > > > > > 
> > > > > > 50%
> > > > > > base:    55394±4.6%
> > > > > > patch:   54940±7.4%  (noise)
> > > > > > 
> > > > > > 75%
> > > > > > base:    13779±3.1%
> > > > > > patch:   36105±1.1% +162%
> > > > > > 
> > > > > > 100%
> > > > > > base:     9703±3.7%
> > > > > > patch:   28011±0.2% +189%
> > > > > > 
> > > > > > ==============================================
> > > > > > netperf/tcp_stream on ICL (all in noise range)
> > > > > > 25%
> > > > > > base:    43092±0.1%
> > > > > > patch:   42891±0.5%
> > > > > > 
> > > > > > 50%
> > > > > > base:    19278±14.9%
> > > > > > patch:   22369±7.2%
> > > > > > 
> > > > > > 75%
> > > > > > base:    16822±3.0%
> > > > > > patch:   17086±2.3%
> > > > > > 
> > > > > > 100%
> > > > > > base:    18216±0.6%
> > > > > > patch:   18078±2.9%
> > > > > > 
> > > > > > ===============================================
> > > > > > netperf/tcp_stream on SPR (all in noise range)
> > > > > > 25%
> > > > > > base:    34491±0.3%
> > > > > > patch:   34886±0.5%
> > > > > > 
> > > > > > 50%
> > > > > > base:    19278±14.9%
> > > > > > patch:   22369±7.2%
> > > > > > 
> > > > > > 75%
> > > > > > base:    16822±3.0%
> > > > > > patch:   17086±2.3%
> > > > > > 
> > > > > > 100%
> > > > > > base:    18216±0.6%
> > > > > > patch:   18078±2.9%
> > > > > > 
> > > > > > Reported-by: Nitin Tekchandani <nitin.tekchandani@intel.com>
> > > > > > Suggested-by: Vincent Guittot <vincent.guittot@linaro.org>
> > > > > > Signed-off-by: Aaron Lu <aaron.lu@intel.com>
> > > > > > Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
> > > > > > ---
> > > > > >     kernel/sched/fair.c  | 13 ++++++++++++-
> > > > > >     kernel/sched/sched.h |  1 +
> > > > > >     2 files changed, 13 insertions(+), 1 deletion(-)
> > > > > > 
> > > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > > > index c28206499a3d..a5462d1fcc48 100644
> > > > > > --- a/kernel/sched/fair.c
> > > > > > +++ b/kernel/sched/fair.c
> > > > > > @@ -3664,7 +3664,8 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
> > > > > >      */
> > > > > >     static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
> > > > > >     {
> > > > > > -   long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
> > > > > > +   long delta;
> > > > > > +   u64 now;
> > > > > >      /*
> > > > > >       * No need to update load_avg for root_task_group as it is not used.
> > > > > > @@ -3672,9 +3673,19 @@ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
> > > > > >      if (cfs_rq->tg == &root_task_group)
> > > > > >              return;
> > > > > > +   /*
> > > > > > +    * For migration heavy workload, access to tg->load_avg can be
> > > > > > +    * unbound. Limit the update rate to at most once per ms.
> > > > > > +    */
> > > > > > +   now = sched_clock_cpu(cpu_of(rq_of(cfs_rq)));
> > > > > > +   if (now - cfs_rq->last_update_tg_load_avg < NSEC_PER_MSEC)
> > > > > > +           return;
> > > > > > +
> > > > > > +   delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
> > > > > >      if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
> > > > > >              atomic_long_add(delta, &cfs_rq->tg->load_avg);
> > > > > >              cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;
> > > > > > +           cfs_rq->last_update_tg_load_avg = now;
> > > > > >      }
> > > > > >     }
> > > > > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > > > > > index 6a8b7b9ed089..52ee7027def9 100644
> > > > > > --- a/kernel/sched/sched.h
> > > > > > +++ b/kernel/sched/sched.h
> > > > > > @@ -593,6 +593,7 @@ struct cfs_rq {
> > > > > >      } removed;
> > > > > >     #ifdef CONFIG_FAIR_GROUP_SCHED
> > > > > > +   u64                     last_update_tg_load_avg;
> > > > > >      unsigned long           tg_load_avg_contrib;
> > > > > >      long                    propagate;
> > > > > >      long                    prop_runnable_sum;
> > > > > 
> > > > > --
> > > > > Mathieu Desnoyers
> > > > > EfficiOS Inc.
> > > > > https://www.efficios.com
> > > > > 
> > > 
> > > --
> > > Mathieu Desnoyers
> > > EfficiOS Inc.
> > > https://www.efficios.com
> > > 
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/1] sched/fair: ratelimit update to tg->load_avg
  2023-08-24 18:48   ` David Vernet
@ 2023-08-25  6:18     ` Aaron Lu
  0 siblings, 0 replies; 15+ messages in thread
From: Aaron Lu @ 2023-08-25  6:18 UTC (permalink / raw)
  To: David Vernet
  Cc: Peter Zijlstra, Vincent Guittot, Ingo Molnar, Juri Lelli,
	Daniel Jordan, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	Tim Chen, Nitin Tekchandani, Yu Chen, Waiman Long, Deng Pan,
	Mathieu Desnoyers, Gautham R . Shenoy, linux-kernel

On Thu, Aug 24, 2023 at 01:48:07PM -0500, David Vernet wrote:
> On Wed, Aug 23, 2023 at 02:08:32PM +0800, Aaron Lu wrote:
> > When using sysbench to benchmark Postgres in a single docker instance
> > with sysbench's nr_threads set to nr_cpu, it is observed there are times
> > update_cfs_group() and update_load_avg() shows noticeable overhead on
> > a 2sockets/112core/224cpu Intel Sapphire Rapids(SPR):
> > 
> >     13.75%    13.74%  [kernel.vmlinux]           [k] update_cfs_group
> >     10.63%    10.04%  [kernel.vmlinux]           [k] update_load_avg
> > 
> > Annotate shows the cycles are mostly spent on accessing tg->load_avg
> > with update_load_avg() being the write side and update_cfs_group() being
> > the read side. tg->load_avg is per task group and when different tasks
> > of the same taskgroup running on different CPUs frequently access
> > tg->load_avg, it can be heavily contended.
> > 
> > E.g. when running postgres_sysbench on a 2sockets/112cores/224cpus Intel
> > Sappire Rapids, during a 5s window, the wakeup number is 14millions and
> > migration number is 11millions and with each migration, the task's load
> > will transfer from src cfs_rq to target cfs_rq and each change involves
> > an update to tg->load_avg. Since the workload can trigger as many wakeups
> > and migrations, the access(both read and write) to tg->load_avg can be
> > unbound. As a result, the two mentioned functions showed noticeable
> > overhead. With netperf/nr_client=nr_cpu/UDP_RR, the problem is worse:
> > during a 5s window, wakeup number is 21millions and migration number is
> > 14millions; update_cfs_group() costs ~25% and update_load_avg() costs ~16%.
> > 
> > Reduce the overhead by limiting updates to tg->load_avg to at most once
> > per ms. After this change, the cost of accessing tg->load_avg is greatly
> > reduced and performance improved. Detailed test results below.
> > 
> > ==============================
> > postgres_sysbench on SPR:
> > 25%
> > base:   42382±19.8%
> > patch:  50174±9.5%  (noise)
> > 
> > 50%
> > base:   67626±1.3%
> > patch:  67365±3.1%  (noise)
> > 
> > 75%
> > base:   100216±1.2%
> > patch:  112470±0.1% +12.2%
> > 
> > 100%
> > base:    93671±0.4%
> > patch:  113563±0.2% +21.2%
> > 
> > ==============================
> > hackbench on ICL:
> > group=1
> > base:    114912±5.2%
> > patch:   117857±2.5%  (noise)
> > 
> > group=4
> > base:    359902±1.6%
> > patch:   361685±2.7%  (noise)
> > 
> > group=8
> > base:    461070±0.8%
> > patch:   491713±0.3% +6.6%
> > 
> > group=16
> > base:    309032±5.0%
> > patch:   378337±1.3% +22.4%
> > 
> > =============================
> > hackbench on SPR:
> > group=1
> > base:    100768±2.9%
> > patch:   103134±2.9%  (noise)
> > 
> > group=4
> > base:    413830±12.5%
> > patch:   378660±16.6% (noise)
> > 
> > group=8
> > base:    436124±0.6%
> > patch:   490787±3.2% +12.5%
> > 
> > group=16
> > base:    457730±3.2%
> > patch:   680452±1.3% +48.8%
> > 
> > ============================
> > netperf/udp_rr on ICL
> > 25%
> > base:    114413±0.1%
> > patch:   115111±0.0% +0.6%
> > 
> > 50%
> > base:    86803±0.5%
> > patch:   86611±0.0%  (noise)
> > 
> > 75%
> > base:    35959±5.3%
> > patch:   49801±0.6% +38.5%
> > 
> > 100%
> > base:    61951±6.4%
> > patch:   70224±0.8% +13.4%
> > 
> > ===========================
> > netperf/udp_rr on SPR
> > 25%
> > base:   104954±1.3%
> > patch:  107312±2.8%  (noise)
> > 
> > 50%
> > base:    55394±4.6%
> > patch:   54940±7.4%  (noise)
> > 
> > 75%
> > base:    13779±3.1%
> > patch:   36105±1.1% +162%
> > 
> > 100%
> > base:     9703±3.7%
> > patch:   28011±0.2% +189%
> > 
> > ==============================================
> > netperf/tcp_stream on ICL (all in noise range)
> > 25%
> > base:    43092±0.1%
> > patch:   42891±0.5%
> > 
> > 50%
> > base:    19278±14.9%
> > patch:   22369±7.2%
> > 
> > 75%
> > base:    16822±3.0%
> > patch:   17086±2.3%
> > 
> > 100%
> > base:    18216±0.6%
> > patch:   18078±2.9%
> > 
> > ===============================================
> > netperf/tcp_stream on SPR (all in noise range)
> > 25%
> > base:    34491±0.3%
> > patch:   34886±0.5%
> > 
> > 50%
> > base:    19278±14.9%
> > patch:   22369±7.2%
> > 
> > 75%
> > base:    16822±3.0%
> > patch:   17086±2.3%
> > 
> > 100%
> > base:    18216±0.6%
> > patch:   18078±2.9%
> > 
> > Reported-by: Nitin Tekchandani <nitin.tekchandani@intel.com>
> > Suggested-by: Vincent Guittot <vincent.guittot@linaro.org>
> > Signed-off-by: Aaron Lu <aaron.lu@intel.com>
> > Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
> 
> Hey Aaron,
> 
> Thanks for working on this. It LGTM modulo two small nits. Feel free to
> add my Reviewed-by if you'd like regardless:
> 
> Reviewed-by: David Vernet <void@manifault.com>

Thanks!

> > ---
> >  kernel/sched/fair.c  | 13 ++++++++++++-
> >  kernel/sched/sched.h |  1 +
> >  2 files changed, 13 insertions(+), 1 deletion(-)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index c28206499a3d..a5462d1fcc48 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -3664,7 +3664,8 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
> >   */
> >  static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
> >  {
> > -	long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
> > +	long delta;
> > +	u64 now;
> >  
> >  	/*
> >  	 * No need to update load_avg for root_task_group as it is not used.
> > @@ -3672,9 +3673,19 @@ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
> >  	if (cfs_rq->tg == &root_task_group)
> >  		return;
> >  
> > +	/*
> > +	 * For migration heavy workload, access to tg->load_avg can be
> 
> s/workload/workloads

Will change.

> > +	 * unbound. Limit the update rate to at most once per ms.
> 
> Can we describe either here or in the commit summary how we arrived at
> 1ms? I'm fine with hard-coded heuristics like this (just like the
> proposed 6-core shard size in the shared_runq patchset), but it would
> also be ideal to give a bit more color on how we arrived here, because
> we'll forget immediately otherwise.

Agree. As I replied to Mathieu, I chose 1ms mainly because pelt window
is roughly 1ms. I'll update the changelog when sending a new version.

Thanks,
Aaron

> > +	 */
> > +	now = sched_clock_cpu(cpu_of(rq_of(cfs_rq)));
> > +	if (now - cfs_rq->last_update_tg_load_avg < NSEC_PER_MSEC)
> > +		return;
> > +
> > +	delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
> >  	if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
> >  		atomic_long_add(delta, &cfs_rq->tg->load_avg);
> >  		cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;
> > +		cfs_rq->last_update_tg_load_avg = now;
> >  	}
> >  }
> >  
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index 6a8b7b9ed089..52ee7027def9 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -593,6 +593,7 @@ struct cfs_rq {
> >  	} removed;
> >  
> >  #ifdef CONFIG_FAIR_GROUP_SCHED
> > +	u64			last_update_tg_load_avg;
> >  	unsigned long		tg_load_avg_contrib;
> >  	long			propagate;
> >  	long			prop_runnable_sum;
> > -- 
> > 2.41.0
> > 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/1] Reduce cost of accessing tg->load_avg
  2023-08-23  6:08 [PATCH 0/1] Reduce cost of accessing tg->load_avg Aaron Lu
  2023-08-23  6:08 ` [PATCH 1/1] sched/fair: ratelimit update to tg->load_avg Aaron Lu
@ 2023-08-25 10:33 ` Swapnil Sapkal
  2023-08-28 11:22   ` Aaron Lu
  1 sibling, 1 reply; 15+ messages in thread
From: Swapnil Sapkal @ 2023-08-25 10:33 UTC (permalink / raw)
  To: Aaron Lu, Peter Zijlstra, Vincent Guittot, Ingo Molnar,
	Juri Lelli
  Cc: Daniel Jordan, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	Tim Chen, Nitin Tekchandani, Yu Chen, Waiman Long, Deng Pan,
	Mathieu Desnoyers, Gautham R . Shenoy, David Vernet, linux-kernel

Hello Aaron,

On 8/23/2023 11:38 AM, Aaron Lu wrote:
> RFC v2 -> v1:
> - drop RFC;
> - move cfs_rq->last_update_tg_load_avg before cfs_rq->tg_load_avg_contrib;
> - add Vincent's reviewed-by tag.
> 
> RFC v2:
> Nitin Tekchandani noticed some scheduler functions have high cost
> according to perf/cycles while running postgres_sysbench workload.
> I perf/annotated the high cost functions: update_cfs_group() and
> update_load_avg() and found the costs were ~90% due to accessing to
> tg->load_avg. This series is an attempt to reduce the overhead of
> the two functions.
>                                                                                                                         
> Thanks to Vincent's suggestion from v1, this revision used a simpler way
> to solve the overhead problem by limiting updates to tg->load_avg to at
> most once per ms. Benchmark shows that it has good results and with the
> rate limit in place, other optimizations in v1 don't improve performance
> further so they are dropped from this revision.
> 

I have tested this series alongside Mathieu's changes. You can find the
report here: https://lore.kernel.org/all/f6dc1652-bc39-0b12-4b6b-29a2f9cd8484@amd.com/

Tested-by: Swapnil Sapkal <Swapnil.Sapkal@amd.com>

> Aaron Lu (1):
>    sched/fair: ratelimit update to tg->load_avg
> 
>   kernel/sched/fair.c  | 13 ++++++++++++-
>   kernel/sched/sched.h |  1 +
>   2 files changed, 13 insertions(+), 1 deletion(-)
> 
--
Thanks and Regards,
Swapnil

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/1] Reduce cost of accessing tg->load_avg
  2023-08-25 10:33 ` [PATCH 0/1] Reduce cost of accessing tg->load_avg Swapnil Sapkal
@ 2023-08-28 11:22   ` Aaron Lu
  0 siblings, 0 replies; 15+ messages in thread
From: Aaron Lu @ 2023-08-28 11:22 UTC (permalink / raw)
  To: Swapnil Sapkal
  Cc: Peter Zijlstra, Vincent Guittot, Ingo Molnar, Juri Lelli,
	Daniel Jordan, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	Tim Chen, Nitin Tekchandani, Yu Chen, Waiman Long, Deng Pan,
	Mathieu Desnoyers, Gautham R . Shenoy, David Vernet, linux-kernel

Hi Swapnil,

On Fri, Aug 25, 2023 at 04:03:20PM +0530, Swapnil Sapkal wrote:
> Hello Aaron,
> 
> On 8/23/2023 11:38 AM, Aaron Lu wrote:
> > RFC v2 -> v1:
> > - drop RFC;
> > - move cfs_rq->last_update_tg_load_avg before cfs_rq->tg_load_avg_contrib;
> > - add Vincent's reviewed-by tag.
> > 
> > RFC v2:
> > Nitin Tekchandani noticed some scheduler functions have high cost
> > according to perf/cycles while running postgres_sysbench workload.
> > I perf/annotated the high cost functions: update_cfs_group() and
> > update_load_avg() and found the costs were ~90% due to accessing to
> > tg->load_avg. This series is an attempt to reduce the overhead of
> > the two functions.
> > Thanks to Vincent's suggestion from v1, this revision used a simpler way
> > to solve the overhead problem by limiting updates to tg->load_avg to at
> > most once per ms. Benchmark shows that it has good results and with the
> > rate limit in place, other optimizations in v1 don't improve performance
> > further so they are dropped from this revision.
> > 
> 
> I have tested this series alongside Mathieu's changes. You can find the
> report here: https://lore.kernel.org/all/f6dc1652-bc39-0b12-4b6b-29a2f9cd8484@amd.com/
> 
> Tested-by: Swapnil Sapkal <Swapnil.Sapkal@amd.com>

Thanks a lot for running these workloads and share the results, will
include your tag when sending the next version.

Regards,
Aaron

> > Aaron Lu (1):
> >    sched/fair: ratelimit update to tg->load_avg
> > 
> >   kernel/sched/fair.c  | 13 ++++++++++++-
> >   kernel/sched/sched.h |  1 +
> >   2 files changed, 13 insertions(+), 1 deletion(-)
> > 
> --
> Thanks and Regards,
> Swapnil

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/1] sched/fair: ratelimit update to tg->load_avg
  2023-08-23  6:08 ` [PATCH 1/1] sched/fair: ratelimit update to tg->load_avg Aaron Lu
  2023-08-23 14:05   ` Mathieu Desnoyers
  2023-08-24 18:48   ` David Vernet
@ 2023-09-06  3:52   ` kernel test robot
  2 siblings, 0 replies; 15+ messages in thread
From: kernel test robot @ 2023-09-06  3:52 UTC (permalink / raw)
  To: Aaron Lu
  Cc: oe-lkp, lkp, Nitin Tekchandani, Vincent Guittot, linux-kernel,
	ying.huang, feng.tang, fengwei.yin, aubrey.li, yu.c.chen,
	Peter Zijlstra, Ingo Molnar, Juri Lelli, Daniel Jordan,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Tim Chen,
	Waiman Long, Deng Pan, Mathieu Desnoyers, Gautham R . Shenoy,
	David Vernet, oliver.sang



Hello,

kernel test robot noticed a 141.1% improvement of stress-ng.nanosleep.ops_per_sec on:


commit: 0a24d7afed5c3c59ee212782f9c902c7ada6c3a8 ("[PATCH 1/1] sched/fair: ratelimit update to tg->load_avg")
url: https://github.com/intel-lab-lkp/linux/commits/Aaron-Lu/sched-fair-ratelimit-update-to-tg-load_avg/20230823-141042
base: https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git 63304558ba5dcaaff9e052ee43cfdcc7f9c29e85
patch link: https://lore.kernel.org/all/20230823060832.454842-2-aaron.lu@intel.com/
patch subject: [PATCH 1/1] sched/fair: ratelimit update to tg->load_avg

testcase: stress-ng
test machine: 224 threads 2 sockets Intel(R) Xeon(R) Platinum 8480CTDX (Sapphire Rapids) with 256G memory
parameters:

	nr_threads: 100%
	testtime: 60s
	sc_pid_max: 4194304
	class: scheduler
	test: nanosleep
	cpufreq_governor: performance


In addition to that, the commit also has significant impact on the following tests:

+------------------+---------------------------------------------------------------------------------------------+
| testcase: change | stress-ng: stress-ng.sem.ops_per_sec 120.7% improvement                                     |
| test machine     | 224 threads 2 sockets Intel(R) Xeon(R) Platinum 8480CTDX (Sapphire Rapids) with 256G memory |
| test parameters  | class=scheduler                                                                             |
|                  | cpufreq_governor=performance                                                                |
|                  | nr_threads=100%                                                                             |
|                  | sc_pid_max=4194304                                                                          |
|                  | test=sem                                                                                    |
|                  | testtime=60s                                                                                |
+------------------+---------------------------------------------------------------------------------------------+
| testcase: change | stress-ng: stress-ng.switch.ops_per_sec 422.1% improvement                                  |
| test machine     | 224 threads 2 sockets Intel(R) Xeon(R) Platinum 8480CTDX (Sapphire Rapids) with 256G memory |
| test parameters  | class=scheduler                                                                             |
|                  | cpufreq_governor=performance                                                                |
|                  | nr_threads=100%                                                                             |
|                  | sc_pid_max=4194304                                                                          |
|                  | test=switch                                                                                 |
|                  | testtime=60s                                                                                |
+------------------+---------------------------------------------------------------------------------------------+




Details are as below:
-------------------------------------------------------------------------------------------------->


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20230906/202309061004.94b065e5-oliver.sang@intel.com

=========================================================================================
class/compiler/cpufreq_governor/kconfig/nr_threads/rootfs/sc_pid_max/tbox_group/test/testcase/testtime:
  scheduler/gcc-12/performance/x86_64-rhel-8.3/100%/debian-11.1-x86_64-20220510.cgz/4194304/lkp-spr-r02/nanosleep/stress-ng/60s

commit: 
  63304558ba ("sched/eevdf: Curb wakeup-preemption")
  0a24d7afed ("sched/fair: ratelimit update to tg->load_avg")

63304558ba5dcaaf 0a24d7afed5c3c59ee212782f9c 
---------------- --------------------------- 
         %stddev     %change         %stddev
             \          |                \  
 1.114e+09 ±  5%     +99.5%  2.223e+09 ±  2%  cpuidle..time
  32153856 ±  6%    +989.9%  3.505e+08        cpuidle..usage
    447243 ± 17%    +164.1%    1181057 ± 29%  numa-numastat.node0.numa_hit
   1795453 ±  9%     +45.6%    2613814 ± 13%  numa-numastat.node1.local_node
   1926792 ±  6%     +41.7%    2729682 ± 13%  numa-numastat.node1.numa_hit
      1211 ± 14%     +60.4%       1944 ± 15%  perf-c2c.DRAM.local
     43481 ±  4%    +115.6%      93764 ±  2%  perf-c2c.HITM.local
      2142 ±  8%     +18.6%       2540 ±  4%  perf-c2c.HITM.remote
     45623 ±  3%    +111.1%      96304 ±  2%  perf-c2c.HITM.total
      6.50 ±  8%      +9.2       15.72 ±  4%  mpstat.cpu.all.idle%
     63.25            -8.8       54.48        mpstat.cpu.all.irq%
      0.23 ±  3%      -0.1        0.13 ±  2%  mpstat.cpu.all.soft%
     23.74            -5.1       18.68        mpstat.cpu.all.sys%
      6.26 ±  4%      +4.7       10.99        mpstat.cpu.all.usr%
      8.67 ± 10%    +103.8%      17.67 ±  4%  vmstat.cpu.id
   6780737 ±  4%     +62.2%   11001409        vmstat.memory.cache
    807.67           -45.4%     441.17 ±  3%  vmstat.procs.r
   9455773 ±  3%    +190.9%   27507213        vmstat.system.cs
   2332672 ±  3%    +135.7%    5497135        vmstat.system.in
      8442 ±125%    +361.9%      38993 ± 42%  numa-meminfo.node0.Active
      8394 ±126%    +364.0%      38945 ± 42%  numa-meminfo.node0.Active(anon)
   3920452 ±  8%     +59.6%    6258302 ± 13%  numa-meminfo.node1.FilePages
   4046956 ±  8%     +61.7%    6543199 ± 12%  numa-meminfo.node1.Inactive
   4046855 ±  8%     +61.7%    6542806 ± 12%  numa-meminfo.node1.Inactive(anon)
    809538           +23.4%     999251        numa-meminfo.node1.Mapped
   5779147 ±  5%     +44.2%    8333883 ± 11%  numa-meminfo.node1.MemUsed
   3797230 ±  8%     +63.2%    6195756 ± 12%  numa-meminfo.node1.Shmem
   6594006 ±  4%     +63.2%   10760383        meminfo.Cached
  20357957           +20.8%   24592848        meminfo.Committed_AS
  10202112 ±  4%     +12.1%   11439445 ±  4%  meminfo.DirectMap2M
   4594955 ±  7%     +90.9%    8772956        meminfo.Inactive
   4594801 ±  7%     +90.9%    8772510        meminfo.Inactive(anon)
   1244823           +15.3%    1435693        meminfo.Mapped
  10703248 ±  3%     +39.2%   14903502        meminfo.Memused
   3850091 ±  8%    +108.2%    8016158        meminfo.Shmem
  10828684 ±  3%     +38.8%   15024943        meminfo.max_used_kB
    191619 ±  2%     -62.9%      71181        stress-ng.nanosleep.nanosec_sleep_overrun
  27467749 ±  2%    +141.1%   66219303        stress-ng.nanosleep.ops
    457768 ±  2%    +141.1%    1103623        stress-ng.nanosleep.ops_per_sec
  34002509           -32.1%   23081269 ±  3%  stress-ng.time.involuntary_context_switches
     45135 ±  2%     +10.2%      49751        stress-ng.time.minor_page_faults
      4740 ±  2%     +58.5%       7515        stress-ng.time.percent_of_cpu_this_job_got
      2387 ±  2%     +26.6%       3022        stress-ng.time.system_time
    566.06 ±  4%    +191.5%       1650        stress-ng.time.user_time
 5.218e+08 ±  2%    +140.9%  1.257e+09        stress-ng.time.voluntary_context_switches
      2100 ±126%    +364.0%       9746 ± 42%  numa-vmstat.node0.nr_active_anon
      2100 ±126%    +364.0%       9746 ± 42%  numa-vmstat.node0.nr_zone_active_anon
    447538 ± 17%    +163.9%    1181134 ± 29%  numa-vmstat.node0.numa_hit
    978909 ±  8%     +59.9%    1564891 ± 13%  numa-vmstat.node1.nr_file_pages
   1010763 ±  8%     +61.9%    1636024 ± 12%  numa-vmstat.node1.nr_inactive_anon
    201775           +23.8%     249732        numa-vmstat.node1.nr_mapped
    948105 ±  8%     +63.4%    1549255 ± 12%  numa-vmstat.node1.nr_shmem
   1010756 ±  8%     +61.9%    1636022 ± 12%  numa-vmstat.node1.nr_zone_inactive_anon
   1926790 ±  6%     +41.7%    2730098 ± 13%  numa-vmstat.node1.numa_hit
   1795451 ±  9%     +45.6%    2614231 ± 13%  numa-vmstat.node1.numa_local
  23571016 ±  6%   +1005.6%  2.606e+08        turbostat.C1
      0.62 ±  7%      +4.5        5.12 ±  4%  turbostat.C1%
      6.52 ±  6%     +55.6%      10.15 ±  4%  turbostat.CPU%c1
      0.11 ±  3%    +134.3%       0.26        turbostat.IPC
 1.523e+08 ±  3%    +135.8%   3.59e+08        turbostat.IRQ
   4826320 ±  8%   +1620.7%   83044122 ±  4%  turbostat.POLL
      1.18 ±  4%      +3.2        4.39 ±  3%  turbostat.POLL%
     35.50 ±  2%      +9.4%      38.83 ±  4%  turbostat.PkgTmp
    606.26           +11.4%     675.12        turbostat.PkgWatt
     17.82            +8.1%      19.27        turbostat.RAMWatt
    221604            +3.6%     229668        proc-vmstat.nr_anon_pages
   6286339            -1.7%    6181379        proc-vmstat.nr_dirty_background_threshold
  12588050            -1.7%   12377872        proc-vmstat.nr_dirty_threshold
   1647119 ±  4%     +63.3%    2690349        proc-vmstat.nr_file_pages
  63240215            -1.7%   62188983        proc-vmstat.nr_free_pages
   1147706 ±  7%     +91.1%    2193365        proc-vmstat.nr_inactive_anon
    310602           +15.6%     358915        proc-vmstat.nr_mapped
    961140 ±  8%    +108.5%    2004292        proc-vmstat.nr_shmem
     40821            +5.6%      43093        proc-vmstat.nr_slab_reclaimable
   1147706 ±  7%     +91.1%    2193365        proc-vmstat.nr_zone_inactive_anon
    307036 ±  6%     +18.3%     363373 ±  6%  proc-vmstat.numa_hint_faults
    174792 ±  4%     +47.6%     257908 ±  8%  proc-vmstat.numa_hint_faults_local
   2376244 ±  4%     +64.7%    3912565        proc-vmstat.numa_hit
   2148067 ±  5%     +71.1%    3675698        proc-vmstat.numa_local
     74658 ± 15%     +32.3%      98789 ±  8%  proc-vmstat.numa_pages_migrated
    845358 ±  5%     +16.6%     985918 ±  4%  proc-vmstat.numa_pte_updates
   2651749 ±  4%     +58.7%    4208257        proc-vmstat.pgalloc_normal
   1178205           +11.3%    1310935        proc-vmstat.pgfault
     74658 ± 15%     +32.3%      98789 ±  8%  proc-vmstat.pgmigrate_success
   1619572 ±  4%     +31.0%    2121483 ±  2%  sched_debug.cfs_rq:/.avg_vruntime.avg
   1384229 ±  2%     +21.2%    1678009 ±  6%  sched_debug.cfs_rq:/.avg_vruntime.min
      2.14 ±  4%     -49.5%       1.08 ±  5%  sched_debug.cfs_rq:/.h_nr_running.avg
      1.59 ±  7%     -23.0%       1.23 ± 10%  sched_debug.cfs_rq:/.h_nr_running.stddev
    755771 ±  4%     +21.1%     914988 ±  7%  sched_debug.cfs_rq:/.left_vruntime.stddev
   1619572 ±  4%     +31.0%    2121483 ±  2%  sched_debug.cfs_rq:/.min_vruntime.avg
   1384229 ±  2%     +21.2%    1678009 ±  6%  sched_debug.cfs_rq:/.min_vruntime.min
      0.51           -13.3%       0.44 ±  2%  sched_debug.cfs_rq:/.nr_running.avg
      0.25 ±  6%     +25.7%       0.32 ±  4%  sched_debug.cfs_rq:/.nr_running.stddev
    755771 ±  4%     +21.1%     914988 ±  7%  sched_debug.cfs_rq:/.right_vruntime.stddev
      2098           -56.3%     916.79 ±  3%  sched_debug.cfs_rq:/.runnable_avg.avg
      5955 ±  6%     -56.7%       2580 ± 51%  sched_debug.cfs_rq:/.runnable_avg.max
    789.15 ±  6%     -55.7%     349.41 ± 20%  sched_debug.cfs_rq:/.runnable_avg.stddev
    307.62           +22.2%     375.91        sched_debug.cfs_rq:/.util_avg.avg
     24.31 ±  5%     -33.7%      16.12 ± 10%  sched_debug.cfs_rq:/.util_est_enqueued.avg
    519896 ±  8%      -9.5%     470290 ±  2%  sched_debug.cpu.avg_idle.avg
     34.99 ±  9%     -64.1%      12.55 ±  4%  sched_debug.cpu.clock.stddev
     64962 ± 54%     -60.1%      25901 ± 75%  sched_debug.cpu.max_idle_balance_cost.stddev
      1.99 ±  3%     -49.2%       1.01 ±  6%  sched_debug.cpu.nr_running.avg
      1.58 ±  8%     -24.8%       1.19 ± 10%  sched_debug.cpu.nr_running.stddev
   1312627 ±  3%    +189.6%    3801969        sched_debug.cpu.nr_switches.avg
   1454886 ±  2%    +180.8%    4084862        sched_debug.cpu.nr_switches.max
    520509 ± 19%    +118.1%    1135305 ±  6%  sched_debug.cpu.nr_switches.min
    107407 ± 27%    +121.0%     237364        sched_debug.cpu.nr_switches.stddev
      1.54 ± 16%     -30.7%       1.06 ±  4%  sched_debug.rt_rq:.rt_time.avg
    344.38 ± 16%     -30.7%     238.55 ±  4%  sched_debug.rt_rq:.rt_time.max
     22.96 ± 16%     -30.7%      15.90 ±  4%  sched_debug.rt_rq:.rt_time.stddev
     21.30            -7.6%      19.68        perf-stat.i.MPKI
 2.464e+10 ±  2%    +116.4%  5.333e+10        perf-stat.i.branch-instructions
      2.35            -0.2        2.15        perf-stat.i.branch-miss-rate%
 5.179e+08 ±  3%    +104.0%  1.056e+09        perf-stat.i.branch-misses
  24433597 ±  4%    +134.2%   57211897        perf-stat.i.cache-misses
 2.292e+09 ±  3%    +117.1%  4.977e+09        perf-stat.i.cache-references
   9691167 ±  2%    +192.2%   28317740        perf-stat.i.context-switches
      5.50 ±  2%     -58.0%       2.31        perf-stat.i.cpi
 6.037e+11            -2.5%  5.886e+11        perf-stat.i.cpu-cycles
   2763209 ±  6%    +493.6%   16401859        perf-stat.i.cpu-migrations
     27315 ±  5%     -58.5%      11336        perf-stat.i.cycles-between-cache-misses
      0.27 ±  3%      +0.1        0.38 ±  5%  perf-stat.i.dTLB-load-miss-rate%
  75642414 ±  6%    +234.9%  2.534e+08 ±  5%  perf-stat.i.dTLB-load-misses
 2.882e+10 ±  2%    +133.7%  6.734e+10        perf-stat.i.dTLB-loads
      0.09            +0.0        0.12        perf-stat.i.dTLB-store-miss-rate%
  11815823 ±  4%    +250.9%   41464964        perf-stat.i.dTLB-store-misses
 1.431e+10 ±  2%    +154.8%  3.647e+10        perf-stat.i.dTLB-stores
 1.179e+11 ±  2%    +125.2%  2.655e+11        perf-stat.i.instructions
      0.22 ±  4%    +107.2%       0.46        perf-stat.i.ipc
      2.69            -2.5%       2.63        perf-stat.i.metric.GHz
    168.78 ±  3%    +191.3%     491.66        perf-stat.i.metric.K/sec
    312.66 ±  2%    +131.4%     723.39        perf-stat.i.metric.M/sec
     83.64            -4.6       79.07        perf-stat.i.node-load-miss-rate%
   8492785 ±  5%     +91.4%   16253628        perf-stat.i.node-load-misses
   1978671 ±  8%    +156.8%    5080695 ±  3%  perf-stat.i.node-loads
     20.13            -5.6%      19.01        perf-stat.overall.MPKI
      2.17            -0.2        2.01        perf-stat.overall.branch-miss-rate%
      1.04 ±  4%      +0.1        1.14        perf-stat.overall.cache-miss-rate%
      5.27 ±  2%     -57.4%       2.24        perf-stat.overall.cpi
     25098 ±  4%     -58.6%      10384        perf-stat.overall.cycles-between-cache-misses
      0.27 ±  3%      +0.1        0.38 ±  5%  perf-stat.overall.dTLB-load-miss-rate%
      0.08            +0.0        0.11        perf-stat.overall.dTLB-store-miss-rate%
      0.19 ±  2%    +134.7%       0.45        perf-stat.overall.ipc
     78.22 ±  2%      -3.3       74.96        perf-stat.overall.node-load-miss-rate%
 2.351e+10 ±  2%    +121.2%    5.2e+10        perf-stat.ps.branch-instructions
 5.098e+08 ±  3%    +105.1%  1.046e+09        perf-stat.ps.branch-misses
  23627740 ±  4%    +136.8%   55950974        perf-stat.ps.cache-misses
 2.265e+09 ±  3%    +117.5%  4.927e+09        perf-stat.ps.cache-references
   9507437 ±  2%    +194.8%   28026780        perf-stat.ps.context-switches
    217084            +1.3%     219882        perf-stat.ps.cpu-clock
  5.92e+11            -1.9%  5.809e+11        perf-stat.ps.cpu-cycles
   2705311 ±  6%    +500.2%   16238454        perf-stat.ps.cpu-migrations
  73862253 ±  5%    +238.6%  2.501e+08 ±  5%  perf-stat.ps.dTLB-load-misses
  2.76e+10 ±  2%    +138.7%  6.588e+10        perf-stat.ps.dTLB-loads
  11580338 ±  4%    +254.4%   41039686        perf-stat.ps.dTLB-store-misses
 1.367e+10 ±  2%    +161.0%  3.568e+10        perf-stat.ps.dTLB-stores
 1.125e+11 ±  2%    +130.3%  2.592e+11        perf-stat.ps.instructions
     17267 ±  2%     +11.0%      19173        perf-stat.ps.minor-faults
   8196664 ±  5%     +94.6%   15947509        perf-stat.ps.node-load-misses
   2277667 ±  7%    +134.0%    5330072 ±  4%  perf-stat.ps.node-loads
     17267 ±  2%     +11.0%      19173        perf-stat.ps.page-faults
    217084            +1.3%     219882        perf-stat.ps.task-clock
 7.044e+12 ±  2%    +129.7%  1.618e+13        perf-stat.total.instructions
     13.44           -13.4        0.00        perf-profile.calltrace.cycles-pp.update_cfs_group.dequeue_task_fair.__schedule.schedule.do_nanosleep
     17.47           -12.7        4.76        perf-profile.calltrace.cycles-pp.dequeue_task_fair.__schedule.schedule.do_nanosleep.hrtimer_nanosleep
     39.60 ±  2%     -12.5       27.06        perf-profile.calltrace.cycles-pp.__schedule.schedule.do_nanosleep.hrtimer_nanosleep.common_nsleep
     12.30 ±  2%     -12.3        0.00        perf-profile.calltrace.cycles-pp.enqueue_task_fair.activate_task.ttwu_do_activate.try_to_wake_up.hrtimer_wakeup
     39.94           -12.2       27.71        perf-profile.calltrace.cycles-pp.schedule.do_nanosleep.hrtimer_nanosleep.common_nsleep.__x64_sys_clock_nanosleep
     12.32 ±  3%     -12.2        0.17 ±141%  perf-profile.calltrace.cycles-pp.activate_task.ttwu_do_activate.try_to_wake_up.hrtimer_wakeup.__hrtimer_run_queues
     12.43 ±  2%     -11.9        0.54        perf-profile.calltrace.cycles-pp.ttwu_do_activate.try_to_wake_up.hrtimer_wakeup.__hrtimer_run_queues.hrtimer_interrupt
     43.43           -10.2       33.22        perf-profile.calltrace.cycles-pp.do_nanosleep.hrtimer_nanosleep.common_nsleep.__x64_sys_clock_nanosleep.do_syscall_64
     20.08            -9.9       10.20        perf-profile.calltrace.cycles-pp.flush_smp_call_function_queue.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify
     44.17            -9.7       34.48        perf-profile.calltrace.cycles-pp.hrtimer_nanosleep.common_nsleep.__x64_sys_clock_nanosleep.do_syscall_64.entry_SYSCALL_64_after_hwframe
     44.22            -9.6       34.58        perf-profile.calltrace.cycles-pp.common_nsleep.__x64_sys_clock_nanosleep.do_syscall_64.entry_SYSCALL_64_after_hwframe.clock_nanosleep
     45.34            -8.8       36.51        perf-profile.calltrace.cycles-pp.__x64_sys_clock_nanosleep.do_syscall_64.entry_SYSCALL_64_after_hwframe.clock_nanosleep
     10.46 ±  2%      -8.0        2.44        perf-profile.calltrace.cycles-pp.enqueue_task_fair.activate_task.ttwu_do_activate.sched_ttwu_pending.__flush_smp_call_function_queue
      7.82 ±  3%      -7.8        0.00        perf-profile.calltrace.cycles-pp.update_cfs_group.enqueue_task_fair.activate_task.ttwu_do_activate.try_to_wake_up
     44.86            -7.8       37.06        perf-profile.calltrace.cycles-pp.try_to_wake_up.hrtimer_wakeup.__hrtimer_run_queues.hrtimer_interrupt.__sysvec_apic_timer_interrupt
     44.88            -7.8       37.12        perf-profile.calltrace.cycles-pp.hrtimer_wakeup.__hrtimer_run_queues.hrtimer_interrupt.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt
     45.94            -7.6       38.34        perf-profile.calltrace.cycles-pp.__hrtimer_run_queues.hrtimer_interrupt.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt
     12.12            -7.6        4.52        perf-profile.calltrace.cycles-pp.asm_sysvec_apic_timer_interrupt.flush_smp_call_function_queue.do_idle.cpu_startup_entry.start_secondary
     12.05            -7.6        4.48        perf-profile.calltrace.cycles-pp.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.flush_smp_call_function_queue.do_idle.cpu_startup_entry
     11.96            -7.5        4.46        perf-profile.calltrace.cycles-pp.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.flush_smp_call_function_queue.do_idle
     11.95            -7.5        4.44        perf-profile.calltrace.cycles-pp.hrtimer_interrupt.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.flush_smp_call_function_queue
     10.99            -7.1        3.91        perf-profile.calltrace.cycles-pp.available_idle_cpu.select_idle_cpu.select_idle_sibling.select_task_rq_fair.select_task_rq
     47.61            -6.9       40.70        perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.clock_nanosleep
     47.91            -6.6       41.28        perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.clock_nanosleep
     22.37 ±  3%      -6.6       15.82        perf-profile.calltrace.cycles-pp.hrtimer_interrupt.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.finish_task_switch
     22.41 ±  3%      -6.5       15.86        perf-profile.calltrace.cycles-pp.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.finish_task_switch.__schedule
      6.41 ±  3%      -5.7        0.70        perf-profile.calltrace.cycles-pp.update_load_avg.enqueue_entity.enqueue_task_fair.activate_task.ttwu_do_activate
      6.40 ±  2%      -4.7        1.70        perf-profile.calltrace.cycles-pp.enqueue_entity.enqueue_task_fair.activate_task.ttwu_do_activate.sched_ttwu_pending
     16.05 ±  4%      -4.7       11.35        perf-profile.calltrace.cycles-pp.asm_sysvec_apic_timer_interrupt.finish_task_switch.__schedule.schedule.do_nanosleep
     15.91 ±  4%      -4.7       11.22        perf-profile.calltrace.cycles-pp.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.finish_task_switch.__schedule.schedule
      4.88 ±  2%      -4.3        0.55        perf-profile.calltrace.cycles-pp.sched_ttwu_pending.__flush_smp_call_function_queue.__sysvec_call_function_single.sysvec_call_function_single.asm_sysvec_call_function_single
     16.36 ±  4%      -3.8       12.61        perf-profile.calltrace.cycles-pp.finish_task_switch.__schedule.schedule.do_nanosleep.hrtimer_nanosleep
      8.95            -3.6        5.34        perf-profile.calltrace.cycles-pp.finish_task_switch.__schedule.schedule_idle.do_idle.cpu_startup_entry
     53.17            -3.3       49.84        perf-profile.calltrace.cycles-pp.clock_nanosleep
      6.38 ±  2%      -2.9        3.53        perf-profile.calltrace.cycles-pp.activate_task.ttwu_do_activate.sched_ttwu_pending.__flush_smp_call_function_queue.flush_smp_call_function_queue
      6.41 ±  2%      -2.8        3.63        perf-profile.calltrace.cycles-pp.ttwu_do_activate.sched_ttwu_pending.__flush_smp_call_function_queue.flush_smp_call_function_queue.do_idle
     40.76            -2.4       38.40        perf-profile.calltrace.cycles-pp.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify
     40.82            -2.3       38.53        perf-profile.calltrace.cycles-pp.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify
     41.01            -2.3       38.72        perf-profile.calltrace.cycles-pp.secondary_startup_64_no_verify
     40.82            -2.3       38.54        perf-profile.calltrace.cycles-pp.start_secondary.secondary_startup_64_no_verify
      6.52 ±  2%      -2.2        4.28        perf-profile.calltrace.cycles-pp.sched_ttwu_pending.__flush_smp_call_function_queue.flush_smp_call_function_queue.do_idle.cpu_startup_entry
      6.81            -2.1        4.72        perf-profile.calltrace.cycles-pp.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.finish_task_switch.__schedule.schedule_idle
      6.84            -2.1        4.77        perf-profile.calltrace.cycles-pp.asm_sysvec_apic_timer_interrupt.finish_task_switch.__schedule.schedule_idle.do_idle
      6.67 ±  2%      -1.7        4.99        perf-profile.calltrace.cycles-pp.__flush_smp_call_function_queue.flush_smp_call_function_queue.do_idle.cpu_startup_entry.start_secondary
      2.09            -1.6        0.51        perf-profile.calltrace.cycles-pp.asm_sysvec_call_function_single.finish_task_switch.__schedule.schedule_idle.do_idle
     23.74            -1.0       22.79        perf-profile.calltrace.cycles-pp.select_idle_cpu.select_idle_sibling.select_task_rq_fair.select_task_rq.try_to_wake_up
     10.21            -0.7        9.54        perf-profile.calltrace.cycles-pp.__schedule.schedule_idle.do_idle.cpu_startup_entry.start_secondary
     10.43            -0.6        9.85        perf-profile.calltrace.cycles-pp.schedule_idle.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify
      1.07 ±  4%      -0.4        0.67        perf-profile.calltrace.cycles-pp.__flush_smp_call_function_queue.__sysvec_call_function_single.sysvec_call_function_single.asm_sysvec_call_function_single.cpuidle_enter_state
      1.07 ±  4%      -0.4        0.69        perf-profile.calltrace.cycles-pp.__sysvec_call_function_single.sysvec_call_function_single.asm_sysvec_call_function_single.cpuidle_enter_state.cpuidle_enter
      1.08 ±  4%      -0.4        0.72        perf-profile.calltrace.cycles-pp.sysvec_call_function_single.asm_sysvec_call_function_single.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call
      2.82 ±  2%      -0.3        2.48        perf-profile.calltrace.cycles-pp.dequeue_entity.dequeue_task_fair.__schedule.schedule.do_nanosleep
      1.08 ±  4%      -0.3        0.80        perf-profile.calltrace.cycles-pp.asm_sysvec_call_function_single.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call.do_idle
      1.07 ±  6%      -0.3        0.82        perf-profile.calltrace.cycles-pp.asm_sysvec_apic_timer_interrupt.nohz_run_idle_balance.do_idle.cpu_startup_entry.start_secondary
      1.06 ±  6%      -0.2        0.82        perf-profile.calltrace.cycles-pp.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.nohz_run_idle_balance.do_idle.cpu_startup_entry
      1.05 ±  6%      -0.2        0.80        perf-profile.calltrace.cycles-pp.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.nohz_run_idle_balance.do_idle
      1.05 ±  6%      -0.2        0.80        perf-profile.calltrace.cycles-pp.hrtimer_interrupt.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.nohz_run_idle_balance
      0.70 ±  6%      -0.2        0.54        perf-profile.calltrace.cycles-pp.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.do_idle.cpu_startup_entry.start_secondary
      0.67 ±  7%      -0.2        0.52        perf-profile.calltrace.cycles-pp.sysvec_call_function_single.asm_sysvec_call_function_single.nohz_run_idle_balance.do_idle.cpu_startup_entry
      0.70 ±  6%      -0.2        0.54        perf-profile.calltrace.cycles-pp.asm_sysvec_apic_timer_interrupt.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify
      0.69 ±  6%      -0.2        0.54        perf-profile.calltrace.cycles-pp.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.do_idle.cpu_startup_entry
      0.69 ±  6%      -0.2        0.53        perf-profile.calltrace.cycles-pp.hrtimer_interrupt.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.do_idle
      0.68 ±  7%      -0.1        0.56        perf-profile.calltrace.cycles-pp.asm_sysvec_call_function_single.nohz_run_idle_balance.do_idle.cpu_startup_entry.start_secondary
      0.57 ±  4%      +0.1        0.67        perf-profile.calltrace.cycles-pp.pick_next_task_fair.__schedule.schedule_idle.do_idle.cpu_startup_entry
      0.54            +0.2        0.74        perf-profile.calltrace.cycles-pp.prepare_task_switch.__schedule.schedule.do_nanosleep.hrtimer_nanosleep
      0.87            +0.2        1.09        perf-profile.calltrace.cycles-pp.pick_next_task_fair.__schedule.schedule.do_nanosleep.hrtimer_nanosleep
      0.76            +0.2        0.99        perf-profile.calltrace.cycles-pp.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.hrtimer_active.hrtimer_try_to_cancel
      0.78            +0.2        1.00        perf-profile.calltrace.cycles-pp.asm_sysvec_apic_timer_interrupt.hrtimer_active.hrtimer_try_to_cancel.do_nanosleep.hrtimer_nanosleep
      0.76            +0.2        0.99        perf-profile.calltrace.cycles-pp.hrtimer_interrupt.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.hrtimer_active
      0.77            +0.2        1.00        perf-profile.calltrace.cycles-pp.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.hrtimer_active.hrtimer_try_to_cancel.do_nanosleep
      0.80            +0.3        1.08        perf-profile.calltrace.cycles-pp.__hrtimer_start_range_ns.hrtimer_start_range_ns.do_nanosleep.hrtimer_nanosleep.common_nsleep
      0.76 ±  2%      +0.3        1.05        perf-profile.calltrace.cycles-pp.hrtimer_interrupt.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.do_nanosleep
      0.76 ±  2%      +0.3        1.05        perf-profile.calltrace.cycles-pp.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.do_nanosleep.hrtimer_nanosleep
      0.77 ±  2%      +0.3        1.06        perf-profile.calltrace.cycles-pp.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.do_nanosleep.hrtimer_nanosleep.common_nsleep
      0.78 ±  3%      +0.3        1.06        perf-profile.calltrace.cycles-pp.asm_sysvec_apic_timer_interrupt.do_nanosleep.hrtimer_nanosleep.common_nsleep.__x64_sys_clock_nanosleep
      0.61 ±  2%      +0.4        1.00        perf-profile.calltrace.cycles-pp.asm_sysvec_apic_timer_interrupt
      0.59 ±  2%      +0.4        0.98        perf-profile.calltrace.cycles-pp.hrtimer_interrupt.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt
      0.59 ±  2%      +0.4        0.98        perf-profile.calltrace.cycles-pp.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt
      0.60 ±  2%      +0.4        0.99        perf-profile.calltrace.cycles-pp.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt
      0.99 ±  5%      +0.4        1.41 ±  4%  perf-profile.calltrace.cycles-pp.stress_mwc32
      0.00            +0.5        0.52        perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.try_to_wake_up.hrtimer_wakeup.__hrtimer_run_queues
      0.00            +0.5        0.53        perf-profile.calltrace.cycles-pp.tick_nohz_idle_enter.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify
      1.07            +0.5        1.61        perf-profile.calltrace.cycles-pp.hrtimer_start_range_ns.do_nanosleep.hrtimer_nanosleep.common_nsleep.__x64_sys_clock_nanosleep
      0.00            +0.5        0.55 ±  2%  perf-profile.calltrace.cycles-pp.set_task_cpu.try_to_wake_up.hrtimer_wakeup.__hrtimer_run_queues.hrtimer_interrupt
      0.58 ±  2%      +0.6        1.14        perf-profile.calltrace.cycles-pp._raw_spin_lock.__schedule.schedule.do_nanosleep.hrtimer_nanosleep
      0.00            +0.6        0.58        perf-profile.calltrace.cycles-pp.llist_reverse_order.__flush_smp_call_function_queue.flush_smp_call_function_queue.do_idle.cpu_startup_entry
      0.00            +0.6        0.58 ±  2%  perf-profile.calltrace.cycles-pp.update_load_avg.dequeue_entity.dequeue_task_fair.__schedule.schedule
      0.00            +0.6        0.60        perf-profile.calltrace.cycles-pp._raw_spin_lock.try_to_wake_up.hrtimer_wakeup.__hrtimer_run_queues.hrtimer_interrupt
      0.87 ±  3%      +0.6        1.46 ±  3%  perf-profile.calltrace.cycles-pp.hrtimer_interrupt.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.stress_pthread_func
      0.90 ±  4%      +0.6        1.50 ±  3%  perf-profile.calltrace.cycles-pp.asm_sysvec_apic_timer_interrupt.stress_pthread_func
      0.87 ±  3%      +0.6        1.47 ±  3%  perf-profile.calltrace.cycles-pp.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.stress_pthread_func
      0.87 ±  3%      +0.6        1.48 ±  3%  perf-profile.calltrace.cycles-pp.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.stress_pthread_func
      0.00            +0.6        0.61 ±  2%  perf-profile.calltrace.cycles-pp.menu_select.cpuidle_idle_call.do_idle.cpu_startup_entry.start_secondary
      0.00            +0.6        0.62 ± 11%  perf-profile.calltrace.cycles-pp.__nanosleep
      1.30            +0.6        1.92        perf-profile.calltrace.cycles-pp.switch_mm_irqs_off.__schedule.schedule.do_nanosleep.hrtimer_nanosleep
      0.00            +0.6        0.64        perf-profile.calltrace.cycles-pp._raw_spin_lock.__schedule.schedule_idle.do_idle.cpu_startup_entry
      0.00            +0.7        0.66        perf-profile.calltrace.cycles-pp.ttwu_queue_wakelist.try_to_wake_up.hrtimer_wakeup.__hrtimer_run_queues.hrtimer_interrupt
      0.00            +0.7        0.67        perf-profile.calltrace.cycles-pp._copy_from_user.get_timespec64.__x64_sys_clock_nanosleep.do_syscall_64.entry_SYSCALL_64_after_hwframe
      0.00            +0.7        0.70        perf-profile.calltrace.cycles-pp.__switch_to.clock_nanosleep
      0.58 ±  3%      +0.7        1.28        perf-profile.calltrace.cycles-pp.__switch_to_asm.clock_nanosleep
      0.00            +0.7        0.72 ±  3%  perf-profile.calltrace.cycles-pp.hrtimer_interrupt.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.clock_gettime
      0.00            +0.7        0.73 ±  3%  perf-profile.calltrace.cycles-pp.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.clock_gettime
      0.00            +0.7        0.73 ±  3%  perf-profile.calltrace.cycles-pp.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.clock_gettime
      0.00            +0.7        0.74 ±  3%  perf-profile.calltrace.cycles-pp.asm_sysvec_apic_timer_interrupt.clock_gettime
      0.00            +0.8        0.78        perf-profile.calltrace.cycles-pp.get_timespec64.__x64_sys_clock_nanosleep.do_syscall_64.entry_SYSCALL_64_after_hwframe.clock_nanosleep
      0.00            +0.8        0.78        perf-profile.calltrace.cycles-pp.update_curr.dequeue_entity.dequeue_task_fair.__schedule.schedule
      0.00            +0.8        0.83 ±  3%  perf-profile.calltrace.cycles-pp.__update_idle_core.pick_next_task_idle.__schedule.schedule.do_nanosleep
      0.00            +0.8        0.84 ±  3%  perf-profile.calltrace.cycles-pp.pick_next_task_idle.__schedule.schedule.do_nanosleep.hrtimer_nanosleep
     25.15            +0.9       26.05        perf-profile.calltrace.cycles-pp.select_idle_sibling.select_task_rq_fair.select_task_rq.try_to_wake_up.hrtimer_wakeup
      0.00            +0.9        0.92        perf-profile.calltrace.cycles-pp.__switch_to_asm
      2.94 ±  2%      +0.9        3.87        perf-profile.calltrace.cycles-pp.hrtimer_interrupt.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.clock_nanosleep
      3.09 ±  2%      +0.9        4.02        perf-profile.calltrace.cycles-pp.asm_sysvec_apic_timer_interrupt.clock_nanosleep
      2.96 ±  2%      +0.9        3.89        perf-profile.calltrace.cycles-pp.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.clock_nanosleep
      2.95 ±  2%      +0.9        3.88        perf-profile.calltrace.cycles-pp.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.clock_nanosleep
      1.44            +1.0        2.41        perf-profile.calltrace.cycles-pp.hrtimer_active.hrtimer_try_to_cancel.do_nanosleep.hrtimer_nanosleep.common_nsleep
      1.40 ±  2%      +1.0        2.41        perf-profile.calltrace.cycles-pp.restore_fpregs_from_fpstate.switch_fpu_return.exit_to_user_mode_prepare.syscall_exit_to_user_mode.do_syscall_64
      0.00            +1.0        1.04        perf-profile.calltrace.cycles-pp.sched_mm_cid_migrate_to.activate_task.ttwu_do_activate.sched_ttwu_pending.__flush_smp_call_function_queue
      1.50            +1.1        2.58        perf-profile.calltrace.cycles-pp.hrtimer_try_to_cancel.do_nanosleep.hrtimer_nanosleep.common_nsleep.__x64_sys_clock_nanosleep
     25.25            +1.1       26.32        perf-profile.calltrace.cycles-pp.select_task_rq_fair.select_task_rq.try_to_wake_up.hrtimer_wakeup.__hrtimer_run_queues
      3.13            +1.1        4.22        perf-profile.calltrace.cycles-pp.hrtimer_interrupt.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.cpuidle_enter_state
      3.13            +1.1        4.23        perf-profile.calltrace.cycles-pp.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.cpuidle_enter_state.cpuidle_enter
     25.31            +1.1       26.42        perf-profile.calltrace.cycles-pp.select_task_rq.try_to_wake_up.hrtimer_wakeup.__hrtimer_run_queues.hrtimer_interrupt
      3.16            +1.1        4.31        perf-profile.calltrace.cycles-pp.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call
      3.17            +1.2        4.34        perf-profile.calltrace.cycles-pp.asm_sysvec_apic_timer_interrupt.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call.do_idle
      0.80 ±  8%      +1.3        2.11 ± 11%  perf-profile.calltrace.cycles-pp.clock_gettime
      0.00            +1.4        1.44        perf-profile.calltrace.cycles-pp.switch_mm_irqs_off.__schedule.schedule_idle.do_idle.cpu_startup_entry
      1.98            +1.5        3.45        perf-profile.calltrace.cycles-pp.exit_to_user_mode_prepare.syscall_exit_to_user_mode.do_syscall_64.entry_SYSCALL_64_after_hwframe.clock_nanosleep
      2.04            +1.6        3.59        perf-profile.calltrace.cycles-pp.syscall_exit_to_user_mode.do_syscall_64.entry_SYSCALL_64_after_hwframe.clock_nanosleep
      1.75 ±  2%      +1.6        3.37        perf-profile.calltrace.cycles-pp.switch_fpu_return.exit_to_user_mode_prepare.syscall_exit_to_user_mode.do_syscall_64.entry_SYSCALL_64_after_hwframe
      1.92 ±  4%      +1.9        3.86 ±  2%  perf-profile.calltrace.cycles-pp.stress_pthread_func
      0.17 ±141%      +2.2        2.39        perf-profile.calltrace.cycles-pp.intel_idle.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call.do_idle
      1.03 ±  2%      +2.4        3.40        perf-profile.calltrace.cycles-pp.hrtimer_interrupt.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.poll_idle
      1.03 ±  2%      +2.4        3.40        perf-profile.calltrace.cycles-pp.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.poll_idle.cpuidle_enter_state
      1.04 ±  2%      +2.4        3.45        perf-profile.calltrace.cycles-pp.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.poll_idle.cpuidle_enter_state.cpuidle_enter
      1.05 ±  2%      +2.4        3.48        perf-profile.calltrace.cycles-pp.asm_sysvec_apic_timer_interrupt.poll_idle.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call
      1.37 ±  3%      +3.8        5.15        perf-profile.calltrace.cycles-pp.poll_idle.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call.do_idle
      0.00            +6.0        5.98        perf-profile.calltrace.cycles-pp.available_idle_cpu.select_idle_core.select_idle_cpu.select_idle_sibling.select_task_rq_fair
      6.17            +6.8       12.96        perf-profile.calltrace.cycles-pp.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call.do_idle.cpu_startup_entry
      6.21            +6.9       13.08        perf-profile.calltrace.cycles-pp.cpuidle_enter.cpuidle_idle_call.do_idle.cpu_startup_entry.start_secondary
      6.68            +7.6       14.28        perf-profile.calltrace.cycles-pp.cpuidle_idle_call.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify
      1.25            +8.6        9.84        perf-profile.calltrace.cycles-pp.select_idle_core.select_idle_cpu.select_idle_sibling.select_task_rq_fair.select_task_rq
     30.81           -29.8        1.00 ±  9%  perf-profile.children.cycles-pp.update_cfs_group
     26.74           -21.1        5.60        perf-profile.children.cycles-pp.enqueue_task_fair
     27.45           -20.2        7.29        perf-profile.children.cycles-pp.activate_task
     27.66           -20.0        7.70        perf-profile.children.cycles-pp.ttwu_do_activate
     50.48           -13.5       37.02        perf-profile.children.cycles-pp.__schedule
     17.49           -12.7        4.78        perf-profile.children.cycles-pp.dequeue_task_fair
     40.51 ±  2%     -12.5       27.97        perf-profile.children.cycles-pp.schedule
     43.46           -10.2       33.28        perf-profile.children.cycles-pp.do_nanosleep
     20.27 ±  2%      -9.9       10.35        perf-profile.children.cycles-pp.flush_smp_call_function_queue
     44.18            -9.7       34.50        perf-profile.children.cycles-pp.hrtimer_nanosleep
     44.28            -9.6       34.72        perf-profile.children.cycles-pp.common_nsleep
     45.35            -8.8       36.53        perf-profile.children.cycles-pp.__x64_sys_clock_nanosleep
     12.11            -8.7        3.43        perf-profile.children.cycles-pp.enqueue_entity
     11.22            -7.8        3.41        perf-profile.children.cycles-pp.update_load_avg
     50.72            -7.7       42.98        perf-profile.children.cycles-pp.sysvec_apic_timer_interrupt
     51.22            -7.7       43.50        perf-profile.children.cycles-pp.asm_sysvec_apic_timer_interrupt
     48.74            -7.7       41.08        perf-profile.children.cycles-pp.try_to_wake_up
     25.96 ±  3%      -7.7       18.30        perf-profile.children.cycles-pp.finish_task_switch
     48.74            -7.6       41.10        perf-profile.children.cycles-pp.hrtimer_wakeup
     49.92            -7.6       42.28        perf-profile.children.cycles-pp.__hrtimer_run_queues
     50.13            -7.5       42.59        perf-profile.children.cycles-pp.hrtimer_interrupt
     50.20            -7.5       42.70        perf-profile.children.cycles-pp.__sysvec_apic_timer_interrupt
     47.73            -7.0       40.77        perf-profile.children.cycles-pp.do_syscall_64
     48.02            -6.7       41.34        perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe
     12.52 ±  2%      -5.9        6.60        perf-profile.children.cycles-pp.sched_ttwu_pending
     12.86 ±  2%      -5.1        7.80        perf-profile.children.cycles-pp.__flush_smp_call_function_queue
      6.17 ±  2%      -3.3        2.86        perf-profile.children.cycles-pp.__sysvec_call_function_single
      6.20 ±  2%      -3.2        2.99        perf-profile.children.cycles-pp.sysvec_call_function_single
     53.30            -3.1       50.18        perf-profile.children.cycles-pp.clock_nanosleep
      6.28 ±  2%      -3.0        3.32        perf-profile.children.cycles-pp.asm_sysvec_call_function_single
     40.96            -2.3       38.63        perf-profile.children.cycles-pp.do_idle
     41.01            -2.3       38.72        perf-profile.children.cycles-pp.secondary_startup_64_no_verify
     41.01            -2.3       38.72        perf-profile.children.cycles-pp.cpu_startup_entry
     40.82            -2.3       38.54        perf-profile.children.cycles-pp.start_secondary
     28.17            -1.6       26.60        perf-profile.children.cycles-pp.select_idle_cpu
     10.48            -0.6        9.91        perf-profile.children.cycles-pp.schedule_idle
     15.47            -0.5       14.92        perf-profile.children.cycles-pp.available_idle_cpu
      0.65 ±  6%      -0.3        0.31 ±  2%  perf-profile.children.cycles-pp.exit_to_user_mode_loop
      2.85 ±  2%      -0.3        2.54        perf-profile.children.cycles-pp.dequeue_entity
      0.41 ±  4%      -0.3        0.14 ±  3%  perf-profile.children.cycles-pp.__do_softirq
      0.49 ±  4%      -0.2        0.27        perf-profile.children.cycles-pp.__irq_exit_rcu
      0.43 ±  3%      -0.2        0.25        perf-profile.children.cycles-pp.tick_sched_handle
      0.46 ±  6%      -0.2        0.28 ±  2%  perf-profile.children.cycles-pp.irqentry_exit_to_user_mode
      0.42 ±  3%      -0.2        0.24 ±  2%  perf-profile.children.cycles-pp.update_process_times
      0.44 ±  3%      -0.2        0.27        perf-profile.children.cycles-pp.tick_sched_timer
      0.38 ±  3%      -0.2        0.22 ±  2%  perf-profile.children.cycles-pp.scheduler_tick
      0.46 ±  2%      -0.1        0.32        perf-profile.children.cycles-pp._raw_spin_lock_irq
      0.10 ±  3%      -0.1        0.02 ± 99%  perf-profile.children.cycles-pp.sched_clock_noinstr
      0.74            -0.1        0.67        perf-profile.children.cycles-pp.update_rq_clock
      0.10 ± 19%      -0.1        0.04 ± 76%  perf-profile.children.cycles-pp.record__mmap_read_evlist
      0.10 ± 19%      -0.1        0.04 ± 73%  perf-profile.children.cycles-pp.perf_mmap__push
      0.12 ± 15%      -0.1        0.07 ± 18%  perf-profile.children.cycles-pp.__libc_start_main
      0.12 ± 15%      -0.1        0.07 ± 18%  perf-profile.children.cycles-pp.main
      0.12 ± 15%      -0.1        0.07 ± 18%  perf-profile.children.cycles-pp.run_builtin
      0.38 ±  3%      -0.1        0.33        perf-profile.children.cycles-pp.get_nohz_timer_target
      0.11 ± 19%      -0.1        0.06 ± 49%  perf-profile.children.cycles-pp.cmd_record
      0.10            -0.1        0.05        perf-profile.children.cycles-pp.entity_eligible
      0.06            +0.0        0.07        perf-profile.children.cycles-pp.__rb_insert_augmented
      0.05 ±  7%      +0.0        0.08 ±  6%  perf-profile.children.cycles-pp.rebalance_domains
      0.05            +0.0        0.07 ±  6%  perf-profile.children.cycles-pp.put_prev_entity
      0.13 ±  3%      +0.0        0.15 ±  2%  perf-profile.children.cycles-pp.perf_adjust_freq_unthr_context
      0.13 ±  3%      +0.0        0.15 ±  3%  perf-profile.children.cycles-pp.perf_event_task_tick
      0.05 ±  8%      +0.0        0.08        perf-profile.children.cycles-pp.tracing_gen_ctx_irq_test
      0.05            +0.0        0.08        perf-profile.children.cycles-pp._find_next_and_bit
      0.16            +0.0        0.20 ±  2%  perf-profile.children.cycles-pp.place_entity
      0.06 ±  6%      +0.0        0.10 ±  8%  perf-profile.children.cycles-pp.mm_cid_get
      0.08 ±  6%      +0.0        0.11 ±  4%  perf-profile.children.cycles-pp.perf_trace_buf_update
      0.24 ±  3%      +0.0        0.28        perf-profile.children.cycles-pp.call_cpuidle
      0.14            +0.0        0.19        perf-profile.children.cycles-pp.avg_vruntime
      0.00            +0.1        0.05        perf-profile.children.cycles-pp.hrtimer_get_next_event
      0.00            +0.1        0.05        perf-profile.children.cycles-pp.save_fpregs_to_fpstate
      0.00            +0.1        0.05        perf-profile.children.cycles-pp.__bitmap_and
      0.00            +0.1        0.05        perf-profile.children.cycles-pp.idle_cpu
      0.00            +0.1        0.05        perf-profile.children.cycles-pp.ct_kernel_enter
      0.13 ±  2%      +0.1        0.18 ±  2%  perf-profile.children.cycles-pp.update_irq_load_avg
      0.01 ±223%      +0.1        0.06 ±  7%  perf-profile.children.cycles-pp.perf_trace_sched_stat_runtime
      0.23            +0.1        0.28        perf-profile.children.cycles-pp.__dequeue_entity
      0.06 ±  7%      +0.1        0.12 ±  3%  perf-profile.children.cycles-pp.irqtime_account_irq
      0.00            +0.1        0.06        perf-profile.children.cycles-pp.ct_kernel_exit_state
      0.00            +0.1        0.06        perf-profile.children.cycles-pp.ct_idle_exit
      0.08 ±  5%      +0.1        0.14 ±  3%  perf-profile.children.cycles-pp.resched_curr
      0.01 ±223%      +0.1        0.07 ± 21%  perf-profile.children.cycles-pp.nanosleep@plt
      0.00            +0.1        0.06 ±  7%  perf-profile.children.cycles-pp.perf_exclude_event
      0.00            +0.1        0.06 ±  7%  perf-profile.children.cycles-pp.perf_trace_sched_migrate_task
      0.13 ±  2%      +0.1        0.20 ±  2%  perf-profile.children.cycles-pp.lapic_next_deadline
      0.00            +0.1        0.06 ±  7%  perf-profile.children.cycles-pp.perf_trace_buf_alloc
      0.20 ±  2%      +0.1        0.26        perf-profile.children.cycles-pp.clockevents_program_event
      0.02 ±141%      +0.1        0.08 ±  5%  perf-profile.children.cycles-pp.rb_next
      0.00            +0.1        0.07 ±  7%  perf-profile.children.cycles-pp.__update_load_avg_blocked_se
      0.00            +0.1        0.07 ±  5%  perf-profile.children.cycles-pp.local_clock_noinstr
      0.00            +0.1        0.07 ±  9%  perf-profile.children.cycles-pp.cpuidle_governor_latency_req
      0.08            +0.1        0.16        perf-profile.children.cycles-pp.rcu_note_context_switch
      0.22 ±  2%      +0.1        0.30        perf-profile.children.cycles-pp.pick_eevdf
      0.00            +0.1        0.08 ±  4%  perf-profile.children.cycles-pp.perf_trace_sched_switch
      0.15            +0.1        0.24        perf-profile.children.cycles-pp.hrtimer_init_sleeper
      0.08            +0.1        0.17        perf-profile.children.cycles-pp.rb_erase
      0.00            +0.1        0.09        perf-profile.children.cycles-pp.__hrtimer_next_event_base
      0.00            +0.1        0.10 ±  5%  perf-profile.children.cycles-pp.__x2apic_send_IPI_dest
      0.06 ±  7%      +0.1        0.16 ±  2%  perf-profile.children.cycles-pp.native_apic_msr_eoi_write
      0.00            +0.1        0.10        perf-profile.children.cycles-pp.__list_add_valid
      0.00            +0.1        0.10        perf-profile.children.cycles-pp.syscall_exit_to_user_mode_prepare
      0.00            +0.1        0.10        perf-profile.children.cycles-pp.__list_del_entry_valid
      0.10 ±  4%      +0.1        0.20        perf-profile.children.cycles-pp.__hrtimer_init
      0.00            +0.1        0.10 ±  4%  perf-profile.children.cycles-pp.hrtimer_next_event_without
      0.45 ±  6%      +0.1        0.55        perf-profile.children.cycles-pp.tick_nohz_idle_enter
      0.07            +0.1        0.18 ±  2%  perf-profile.children.cycles-pp.entry_SYSRETQ_unsafe_stack
      0.00            +0.1        0.11        perf-profile.children.cycles-pp.syscall_return_via_sysret
      0.00            +0.1        0.11        perf-profile.children.cycles-pp.get_next_timer_interrupt
      0.00            +0.1        0.11        perf-profile.children.cycles-pp._raw_spin_unlock_irqrestore
      0.08            +0.1        0.19 ±  2%  perf-profile.children.cycles-pp.update_entity_lag
      0.10 ±  4%      +0.1        0.22 ±  2%  perf-profile.children.cycles-pp.rb_insert_color
      0.00            +0.1        0.12        perf-profile.children.cycles-pp.__calc_delta
      0.01 ±223%      +0.1        0.14        perf-profile.children.cycles-pp.__rdgsbase_inactive
      0.24            +0.1        0.38        perf-profile.children.cycles-pp.os_xsave
      0.06            +0.1        0.21 ±  2%  perf-profile.children.cycles-pp.tick_nohz_idle_exit
      0.17 ±  2%      +0.1        0.32        perf-profile.children.cycles-pp.check_preempt_curr
      0.07 ±  5%      +0.1        0.22        perf-profile.children.cycles-pp.__wrgsbase_inactive
      0.48            +0.1        0.62        perf-profile.children.cycles-pp.update_rq_clock_task
      0.00            +0.2        0.15 ±  2%  perf-profile.children.cycles-pp.newidle_balance
      0.00            +0.2        0.16 ±  2%  perf-profile.children.cycles-pp.tick_nohz_next_event
      0.07 ±  5%      +0.2        0.24 ±  6%  perf-profile.children.cycles-pp.cpuacct_charge
      0.23 ±  3%      +0.2        0.41        perf-profile.children.cycles-pp.native_irq_return_iret
      0.14            +0.2        0.32        perf-profile.children.cycles-pp.stress_mwc32modn
      0.11 ±  4%      +0.2        0.30        perf-profile.children.cycles-pp.read_tsc
      0.22            +0.2        0.41        perf-profile.children.cycles-pp.ktime_get
      0.24 ±  2%      +0.2        0.43        perf-profile.children.cycles-pp.perf_tp_event
      0.09 ±  4%      +0.2        0.28        perf-profile.children.cycles-pp.attach_entity_load_avg
      0.50            +0.2        0.70        perf-profile.children.cycles-pp.sched_clock_cpu
      0.44            +0.2        0.64        perf-profile.children.cycles-pp.__update_load_avg_se
      0.14 ±  3%      +0.2        0.34        perf-profile.children.cycles-pp.cpus_share_cache
      0.16 ±  2%      +0.2        0.36        perf-profile.children.cycles-pp.update_min_vruntime
      0.34            +0.2        0.54        perf-profile.children.cycles-pp.sched_clock
      0.16 ±  2%      +0.2        0.38        perf-profile.children.cycles-pp.timerqueue_del
      0.14 ±  3%      +0.2        0.36        perf-profile.children.cycles-pp.__entry_text_start
      0.15            +0.2        0.38        perf-profile.children.cycles-pp.syscall_enter_from_user_mode
      0.39 ±  2%      +0.2        0.63        perf-profile.children.cycles-pp.perf_trace_sched_wakeup_template
      0.21            +0.2        0.45        perf-profile.children.cycles-pp.__enqueue_entity
      0.27            +0.2        0.52        perf-profile.children.cycles-pp.native_sched_clock
      0.06 ±  7%      +0.3        0.32 ±  2%  perf-profile.children.cycles-pp.tick_nohz_get_sleep_length
      0.00            +0.3        0.25        perf-profile.children.cycles-pp.asm_sysvec_reschedule_ipi
      0.17 ±  2%      +0.3        0.44        perf-profile.children.cycles-pp.timerqueue_add
      0.40 ±  3%      +0.3        0.68        perf-profile.children.cycles-pp._copy_from_user
      0.19 ±  3%      +0.3        0.48        perf-profile.children.cycles-pp.enqueue_hrtimer
      0.81            +0.3        1.10        perf-profile.children.cycles-pp.__hrtimer_start_range_ns
      0.35 ± 16%      +0.3        0.66 ± 11%  perf-profile.children.cycles-pp.__nanosleep
      0.09 ±  5%      +0.3        0.41        perf-profile.children.cycles-pp.call_function_single_prep_ipi
      0.44 ±  3%      +0.3        0.78        perf-profile.children.cycles-pp.get_timespec64
      0.22 ±  2%      +0.4        0.57        perf-profile.children.cycles-pp.___perf_sw_event
      0.77            +0.4        1.14        perf-profile.children.cycles-pp.reweight_entity
      1.49 ±  2%      +0.4        1.86        perf-profile.children.cycles-pp.pick_next_task_fair
      1.00 ±  5%      +0.4        1.43 ±  4%  perf-profile.children.cycles-pp.stress_mwc32
      0.30 ± 15%      +0.5        0.76 ±  9%  perf-profile.children.cycles-pp.clock_gettime@plt
      0.14 ±  7%      +0.5        0.62        perf-profile.children.cycles-pp.menu_select
      0.57            +0.5        1.06        perf-profile.children.cycles-pp.__update_load_avg_cfs_rq
      1.08            +0.5        1.62        perf-profile.children.cycles-pp.hrtimer_start_range_ns
      0.66            +0.6        1.26        perf-profile.children.cycles-pp.prepare_task_switch
      1.25            +0.6        1.86        perf-profile.children.cycles-pp._find_next_bit
      0.21 ±  5%      +0.7        0.87        perf-profile.children.cycles-pp.llist_reverse_order
      0.52            +0.7        1.23        perf-profile.children.cycles-pp.update_curr
      0.05 ±  7%      +0.8        0.83 ±  3%  perf-profile.children.cycles-pp.__update_idle_core
      0.06 ±  9%      +0.8        0.85 ±  2%  perf-profile.children.cycles-pp.pick_next_task_idle
      0.81 ±  8%      +0.8        1.65 ±  3%  perf-profile.children.cycles-pp.clock_gettime
      0.68 ±  2%      +0.9        1.63        perf-profile.children.cycles-pp.sched_mm_cid_migrate_to
      0.07 ±  6%      +1.0        1.04 ±  2%  perf-profile.children.cycles-pp.remove_entity_load_avg
      0.58 ±  2%      +1.0        1.55        perf-profile.children.cycles-pp.__switch_to
      1.45            +1.0        2.43        perf-profile.children.cycles-pp.hrtimer_active
      1.41 ±  2%      +1.0        2.42        perf-profile.children.cycles-pp.restore_fpregs_from_fpstate
      0.34 ±  6%      +1.1        1.40        perf-profile.children.cycles-pp.llist_add_batch
      1.51            +1.1        2.58        perf-profile.children.cycles-pp.hrtimer_try_to_cancel
      0.30 ±  4%      +1.2        1.54        perf-profile.children.cycles-pp.migrate_task_rq_fair
      1.30            +1.2        2.55        perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath
      2.44            +1.3        3.74        perf-profile.children.cycles-pp.exit_to_user_mode_prepare
      0.44 ±  6%      +1.4        1.85        perf-profile.children.cycles-pp.__smp_call_single_queue
      0.70 ±  3%      +1.5        2.23        perf-profile.children.cycles-pp.__switch_to_asm
      2.04            +1.6        3.61        perf-profile.children.cycles-pp.syscall_exit_to_user_mode
      1.77 ±  2%      +1.6        3.39        perf-profile.children.cycles-pp.switch_fpu_return
      0.41 ±  2%      +1.6        2.03        perf-profile.children.cycles-pp._raw_spin_lock_irqsave
      0.21 ±  3%      +1.7        1.91 ±  5%  perf-profile.children.cycles-pp.__bitmap_andnot
      0.43 ±  4%      +1.8        2.18        perf-profile.children.cycles-pp.set_task_cpu
      0.61 ±  5%      +1.9        2.48        perf-profile.children.cycles-pp.ttwu_queue_wakelist
      1.56            +1.9        3.42        perf-profile.children.cycles-pp.switch_mm_irqs_off
      0.49 ±  3%      +1.9        2.40        perf-profile.children.cycles-pp.intel_idle
      1.94 ±  4%      +2.0        3.92 ±  3%  perf-profile.children.cycles-pp.stress_pthread_func
      2.48            +2.2        4.65        perf-profile.children.cycles-pp._raw_spin_lock
      1.38 ±  3%      +3.8        5.23        perf-profile.children.cycles-pp.poll_idle
      6.23            +6.9       13.14        perf-profile.children.cycles-pp.cpuidle_enter
      6.22            +6.9       13.12        perf-profile.children.cycles-pp.cpuidle_enter_state
      6.72 ±  2%      +7.6       14.36        perf-profile.children.cycles-pp.cpuidle_idle_call
      2.32 ±  2%     +11.1       13.44        perf-profile.children.cycles-pp.select_idle_core
     30.80           -29.8        0.98 ± 10%  perf-profile.self.cycles-pp.update_cfs_group
     10.11            -8.9        1.18        perf-profile.self.cycles-pp.update_load_avg
     11.11            -4.9        6.26        perf-profile.self.cycles-pp.select_idle_cpu
     15.35            -0.6       14.78        perf-profile.self.cycles-pp.available_idle_cpu
      0.46 ±  2%      -0.1        0.32        perf-profile.self.cycles-pp._raw_spin_lock_irq
      0.38 ±  3%      -0.1        0.33        perf-profile.self.cycles-pp.get_nohz_timer_target
      0.10 ±  4%      -0.0        0.05        perf-profile.self.cycles-pp.entity_eligible
      0.12 ±  3%      +0.0        0.13        perf-profile.self.cycles-pp.ktime_get
      0.10 ±  3%      +0.0        0.12        perf-profile.self.cycles-pp.__hrtimer_start_range_ns
      0.07            +0.0        0.09 ±  4%  perf-profile.self.cycles-pp.select_task_rq
      0.13 ±  2%      +0.0        0.15 ±  6%  perf-profile.self.cycles-pp.sched_clock_cpu
      0.05 ±  7%      +0.0        0.08 ±  4%  perf-profile.self.cycles-pp.tracing_gen_ctx_irq_test
      0.06 ±  6%      +0.0        0.09 ±  4%  perf-profile.self.cycles-pp.perf_adjust_freq_unthr_context
      0.06 ±  9%      +0.0        0.09 ±  7%  perf-profile.self.cycles-pp.mm_cid_get
      0.15 ±  3%      +0.0        0.19        perf-profile.self.cycles-pp.__dequeue_entity
      0.06 ±  6%      +0.0        0.10 ±  4%  perf-profile.self.cycles-pp.ttwu_do_activate
      0.12 ±  3%      +0.0        0.17        perf-profile.self.cycles-pp.update_irq_load_avg
      0.01 ±223%      +0.0        0.06 ±  6%  perf-profile.self.cycles-pp.perf_trace_sched_stat_runtime
      0.00            +0.1        0.05        perf-profile.self.cycles-pp.syscall_exit_to_user_mode_prepare
      0.00            +0.1        0.05        perf-profile.self.cycles-pp.__bitmap_and
      0.00            +0.1        0.05        perf-profile.self.cycles-pp.idle_cpu
      0.00            +0.1        0.05 ±  7%  perf-profile.self.cycles-pp.cpu_startup_entry
      0.00            +0.1        0.05 ±  8%  perf-profile.self.cycles-pp.perf_exclude_event
      0.00            +0.1        0.06 ±  8%  perf-profile.self.cycles-pp.syscall_enter_from_user_mode
      0.00            +0.1        0.06 ± 11%  perf-profile.self.cycles-pp.perf_trace_sched_migrate_task
      0.02 ±141%      +0.1        0.08 ±  6%  perf-profile.self.cycles-pp._find_next_and_bit
      0.05            +0.1        0.11        perf-profile.self.cycles-pp.check_preempt_curr
      0.00            +0.1        0.06        perf-profile.self.cycles-pp.update_entity_lag
      0.00            +0.1        0.06        perf-profile.self.cycles-pp.__update_load_avg_blocked_se
      0.00            +0.1        0.06        perf-profile.self.cycles-pp.ct_kernel_exit_state
      0.08            +0.1        0.14        perf-profile.self.cycles-pp.resched_curr
      0.00            +0.1        0.06 ±  6%  perf-profile.self.cycles-pp.remove_entity_load_avg
      0.00            +0.1        0.06 ±  6%  perf-profile.self.cycles-pp.activate_task
      0.13 ±  2%      +0.1        0.20 ±  2%  perf-profile.self.cycles-pp.lapic_next_deadline
      0.03 ± 70%      +0.1        0.10        perf-profile.self.cycles-pp.set_next_entity
      0.00            +0.1        0.07        perf-profile.self.cycles-pp.hrtimer_try_to_cancel
      0.00            +0.1        0.07        perf-profile.self.cycles-pp.get_timespec64
      0.00            +0.1        0.07        perf-profile.self.cycles-pp.rb_next
      0.00            +0.1        0.07 ±  5%  perf-profile.self.cycles-pp.perf_trace_sched_switch
      0.33            +0.1        0.40        perf-profile.self.cycles-pp.update_rq_clock
      0.06            +0.1        0.13 ±  3%  perf-profile.self.cycles-pp.__hrtimer_init
      0.09 ±  5%      +0.1        0.17        perf-profile.self.cycles-pp.avg_vruntime
      0.08 ±  6%      +0.1        0.16 ±  3%  perf-profile.self.cycles-pp.rcu_note_context_switch
      0.08            +0.1        0.16        perf-profile.self.cycles-pp.rb_erase
      0.00            +0.1        0.08        perf-profile.self.cycles-pp._raw_spin_unlock_irqrestore
      0.10 ±  4%      +0.1        0.18 ±  2%  perf-profile.self.cycles-pp.select_task_rq_fair
      0.00            +0.1        0.08 ±  4%  perf-profile.self.cycles-pp.__hrtimer_next_event_base
      0.05 ±  8%      +0.1        0.14        perf-profile.self.cycles-pp.do_syscall_64
      0.00            +0.1        0.09 ±  4%  perf-profile.self.cycles-pp.__list_add_valid
      0.09 ±  4%      +0.1        0.18        perf-profile.self.cycles-pp.__hrtimer_run_queues
      0.07            +0.1        0.16        perf-profile.self.cycles-pp.common_nsleep
      0.00            +0.1        0.09        perf-profile.self.cycles-pp.__list_del_entry_valid
      0.35 ±  2%      +0.1        0.44        perf-profile.self.cycles-pp.update_rq_clock_task
      0.00            +0.1        0.10 ±  5%  perf-profile.self.cycles-pp.__x2apic_send_IPI_dest
      0.00            +0.1        0.10 ±  3%  perf-profile.self.cycles-pp.__entry_text_start
      0.06 ±  6%      +0.1        0.16 ±  2%  perf-profile.self.cycles-pp.native_apic_msr_eoi_write
      0.00            +0.1        0.10        perf-profile.self.cycles-pp.schedule_idle
      0.07 ±  5%      +0.1        0.17        perf-profile.self.cycles-pp.entry_SYSRETQ_unsafe_stack
      0.07 ±  5%      +0.1        0.17 ±  2%  perf-profile.self.cycles-pp.timerqueue_del
      0.00            +0.1        0.11        perf-profile.self.cycles-pp.syscall_return_via_sysret
      0.10 ±  5%      +0.1        0.21 ±  2%  perf-profile.self.cycles-pp.rb_insert_color
      0.00            +0.1        0.11 ±  3%  perf-profile.self.cycles-pp.__calc_delta
      0.00            +0.1        0.12 ±  4%  perf-profile.self.cycles-pp.call_cpuidle
      0.12 ±  3%      +0.1        0.25 ±  2%  perf-profile.self.cycles-pp.pick_eevdf
      0.00            +0.1        0.13        perf-profile.self.cycles-pp.__rdgsbase_inactive
      0.15 ±  3%      +0.1        0.28        perf-profile.self.cycles-pp.pick_next_task_fair
      0.06 ±  8%      +0.1        0.19        perf-profile.self.cycles-pp.hrtimer_start_range_ns
      0.16 ±  2%      +0.1        0.30        perf-profile.self.cycles-pp.do_nanosleep
      0.09 ±  5%      +0.1        0.23        perf-profile.self.cycles-pp.stress_mwc32modn
      0.07 ±  5%      +0.1        0.21        perf-profile.self.cycles-pp.__wrgsbase_inactive
      0.24            +0.1        0.38        perf-profile.self.cycles-pp.os_xsave
      0.00            +0.1        0.14 ±  2%  perf-profile.self.cycles-pp.newidle_balance
      0.13            +0.2        0.28        perf-profile.self.cycles-pp.try_to_wake_up
      0.17 ±  2%      +0.2        0.32        perf-profile.self.cycles-pp.perf_tp_event
      0.00            +0.2        0.15 ±  3%  perf-profile.self.cycles-pp.cpuidle_idle_call
      0.06 ±  6%      +0.2        0.21        perf-profile.self.cycles-pp.menu_select
      0.07 ±  5%      +0.2        0.23 ±  2%  perf-profile.self.cycles-pp.timerqueue_add
      0.07 ±  5%      +0.2        0.23 ±  8%  perf-profile.self.cycles-pp.cpuacct_charge
      0.13 ± 13%      +0.2        0.31 ± 15%  perf-profile.self.cycles-pp.clock_gettime
      0.23 ±  3%      +0.2        0.41        perf-profile.self.cycles-pp.native_irq_return_iret
      0.11 ±  3%      +0.2        0.29 ±  2%  perf-profile.self.cycles-pp.read_tsc
      0.16 ± 12%      +0.2        0.34 ± 10%  perf-profile.self.cycles-pp.__nanosleep
      0.16 ±  2%      +0.2        0.35        perf-profile.self.cycles-pp.update_min_vruntime
      0.09 ±  4%      +0.2        0.28        perf-profile.self.cycles-pp.attach_entity_load_avg
      0.41            +0.2        0.60        perf-profile.self.cycles-pp.__update_load_avg_se
      0.16 ±  3%      +0.2        0.35        perf-profile.self.cycles-pp.schedule
      0.13 ±  3%      +0.2        0.33        perf-profile.self.cycles-pp.cpus_share_cache
      0.13 ±  5%      +0.2        0.33        perf-profile.self.cycles-pp.__flush_smp_call_function_queue
      0.21 ±  2%      +0.2        0.42        perf-profile.self.cycles-pp._copy_from_user
      0.00            +0.2        0.20 ±  2%  perf-profile.self.cycles-pp.cpuidle_enter_state
      0.21 ±  5%      +0.2        0.42        perf-profile.self.cycles-pp.migrate_task_rq_fair
      0.20 ±  2%      +0.2        0.44        perf-profile.self.cycles-pp.__enqueue_entity
      0.57 ±  5%      +0.3        0.82 ±  3%  perf-profile.self.cycles-pp.stress_mwc32
      0.26            +0.3        0.52        perf-profile.self.cycles-pp.dequeue_entity
      0.13 ±  6%      +0.3        0.39        perf-profile.self.cycles-pp.flush_smp_call_function_queue
      0.23 ±  2%      +0.3        0.50        perf-profile.self.cycles-pp.native_sched_clock
      0.30 ±  4%      +0.3        0.58        perf-profile.self.cycles-pp.__x64_sys_clock_nanosleep
      0.29 ±  3%      +0.3        0.57        perf-profile.self.cycles-pp.entry_SYSCALL_64_after_hwframe
      0.30            +0.3        0.59        perf-profile.self.cycles-pp.hrtimer_nanosleep
      0.17 ±  2%      +0.3        0.47        perf-profile.self.cycles-pp.___perf_sw_event
      0.13 ±  7%      +0.3        0.44        perf-profile.self.cycles-pp.ttwu_queue_wakelist
      0.08 ±  7%      +0.3        0.40        perf-profile.self.cycles-pp.sched_ttwu_pending
      0.09 ±  5%      +0.3        0.40        perf-profile.self.cycles-pp.call_function_single_prep_ipi
      0.22 ± 13%      +0.3        0.56 ± 10%  perf-profile.self.cycles-pp.clock_gettime@plt
      0.28 ±  2%      +0.3        0.62 ±  3%  perf-profile.self.cycles-pp.update_curr
      0.00            +0.3        0.35        perf-profile.self.cycles-pp.do_idle
      0.58            +0.4        0.94        perf-profile.self.cycles-pp.reweight_entity
      0.54            +0.5        1.00        perf-profile.self.cycles-pp.prepare_task_switch
      0.11 ±  6%      +0.5        0.57 ±  3%  perf-profile.self.cycles-pp.set_task_cpu
      0.40 ±  2%      +0.5        0.87        perf-profile.self.cycles-pp.enqueue_entity
      0.07 ± 10%      +0.5        0.54        perf-profile.self.cycles-pp.nohz_run_idle_balance
      0.55            +0.5        1.04        perf-profile.self.cycles-pp.__update_load_avg_cfs_rq
      1.12            +0.5        1.66        perf-profile.self.cycles-pp._find_next_bit
      0.23 ±  2%      +0.5        0.77 ±  2%  perf-profile.self.cycles-pp.enqueue_task_fair
      0.64            +0.6        1.23        perf-profile.self.cycles-pp.dequeue_task_fair
      0.36 ±  2%      +0.6        0.96        perf-profile.self.cycles-pp.switch_fpu_return
      0.21 ±  6%      +0.7        0.87        perf-profile.self.cycles-pp.llist_reverse_order
      0.00            +0.7        0.71 ±  3%  perf-profile.self.cycles-pp.__update_idle_core
      0.68 ±  3%      +0.7        1.42        perf-profile.self.cycles-pp.hrtimer_active
      0.47 ±  2%      +0.8        1.26        perf-profile.self.cycles-pp.finish_task_switch
      1.17            +0.9        2.09        perf-profile.self.cycles-pp._raw_spin_lock
      0.68 ±  2%      +0.9        1.62        perf-profile.self.cycles-pp.sched_mm_cid_migrate_to
      0.57 ±  2%      +0.9        1.52        perf-profile.self.cycles-pp.__switch_to
      1.41 ±  2%      +1.0        2.42        perf-profile.self.cycles-pp.restore_fpregs_from_fpstate
      0.34 ±  6%      +1.1        1.40        perf-profile.self.cycles-pp.llist_add_batch
      0.72 ±  5%      +1.1        1.80 ±  2%  perf-profile.self.cycles-pp.clock_nanosleep
      1.30            +1.2        2.55        perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath
      0.96 ±  6%      +1.3        2.28 ±  3%  perf-profile.self.cycles-pp.stress_pthread_func
      0.06 ± 11%      +1.4        1.44        perf-profile.self.cycles-pp.poll_idle
      0.54            +1.5        2.02 ±  2%  perf-profile.self.cycles-pp.select_idle_sibling
      0.69 ±  3%      +1.5        2.22        perf-profile.self.cycles-pp.__switch_to_asm
      0.38 ±  3%      +1.6        1.97        perf-profile.self.cycles-pp._raw_spin_lock_irqsave
      0.20 ±  5%      +1.7        1.86 ±  5%  perf-profile.self.cycles-pp.__bitmap_andnot
      1.22            +1.8        3.06        perf-profile.self.cycles-pp.__schedule
      1.54            +1.8        3.39        perf-profile.self.cycles-pp.switch_mm_irqs_off
      0.49 ±  3%      +1.9        2.40        perf-profile.self.cycles-pp.intel_idle
      0.79 ±  2%      +1.9        2.74        perf-profile.self.cycles-pp.select_idle_core


***************************************************************************************************
lkp-spr-r02: 224 threads 2 sockets Intel(R) Xeon(R) Platinum 8480CTDX (Sapphire Rapids) with 256G memory
=========================================================================================
class/compiler/cpufreq_governor/kconfig/nr_threads/rootfs/sc_pid_max/tbox_group/test/testcase/testtime:
  scheduler/gcc-12/performance/x86_64-rhel-8.3/100%/debian-11.1-x86_64-20220510.cgz/4194304/lkp-spr-r02/sem/stress-ng/60s

commit: 
  63304558ba ("sched/eevdf: Curb wakeup-preemption")
  0a24d7afed ("sched/fair: ratelimit update to tg->load_avg")

63304558ba5dcaaf 0a24d7afed5c3c59ee212782f9c 
---------------- --------------------------- 
         %stddev     %change         %stddev
             \          |                \  
     11351 ±  3%     +46.2%      16601        uptime.idle
 3.846e+09 ±  2%    +132.5%  8.942e+09        cpuidle..time
 1.809e+08 ±  4%    +188.7%  5.221e+08        cpuidle..usage
   1748151 ± 30%     +96.9%    3442946 ± 10%  numa-numastat.node1.local_node
   1884868 ± 27%     +91.3%    3605026 ± 10%  numa-numastat.node1.numa_hit
      1396 ± 21%     +31.4%       1834 ±  9%  perf-c2c.DRAM.local
     77083 ±  8%     +25.5%      96721 ±  5%  perf-c2c.HITM.local
     78543 ±  8%     +25.0%      98216 ±  5%  perf-c2c.HITM.total
     38.23 ±  3%     +42.9       81.16        mpstat.cpu.all.idle%
     32.27 ±  2%     -23.0        9.27        mpstat.cpu.all.irq%
      0.43 ±  6%      -0.4        0.08 ±  3%  mpstat.cpu.all.soft%
     20.22           -14.8        5.47        mpstat.cpu.all.sys%
      8.85            -4.8        4.01 ±  2%  mpstat.cpu.all.usr%
     39.67 ±  2%    +104.2%      81.00        vmstat.cpu.id
     50.83 ±  2%     -72.5%      14.00        vmstat.cpu.sy
   7584537 ±  4%     +62.1%   12297640 ±  2%  vmstat.memory.cache
    141.17           -59.0%      57.83 ±  4%  vmstat.procs.r
   7826695 ±  4%    +127.3%   17789598        vmstat.system.cs
   3068646 ±  2%     +34.3%    4120105        vmstat.system.in
     84090 ± 31%     +62.5%     136620 ±  9%  numa-meminfo.node1.Active
     83901 ± 31%     +62.7%     136529 ±  9%  numa-meminfo.node1.Active(anon)
   4332711 ± 24%    +100.1%    8670667 ± 10%  numa-meminfo.node1.FilePages
   4148689 ± 32%    +115.2%    8929944 ± 10%  numa-meminfo.node1.Inactive
   4148234 ± 32%    +115.3%    8929838 ± 10%  numa-meminfo.node1.Inactive(anon)
    755396 ±  2%     +35.6%    1024533        numa-meminfo.node1.Mapped
   6280993 ± 18%     +68.2%   10562500 ±  8%  numa-meminfo.node1.MemUsed
   3791150 ± 35%    +126.9%    8600616 ± 10%  numa-meminfo.node1.Shmem
 5.324e+08 ±  3%    +120.7%  1.175e+09        stress-ng.sem.ops
   8872696 ±  3%    +120.7%   19585942        stress-ng.sem.ops_per_sec
  36203483           -44.3%   20170299        stress-ng.time.involuntary_context_switches
     41804            -5.5%      39488        stress-ng.time.minor_page_faults
      7970           -44.6%       4412        stress-ng.time.percent_of_cpu_this_job_got
      3548           -53.3%       1658        stress-ng.time.system_time
      1419           -23.4%       1087        stress-ng.time.user_time
 2.657e+08 ±  3%    +120.7%  5.864e+08        stress-ng.time.voluntary_context_switches
     21077 ± 31%     +62.0%      34153 ±  9%  numa-vmstat.node1.nr_active_anon
   1083493 ± 24%    +100.1%    2167954 ± 10%  numa-vmstat.node1.nr_file_pages
   1037164 ± 32%    +115.3%    2232740 ± 10%  numa-vmstat.node1.nr_inactive_anon
    188463 ±  2%     +36.0%     256345        numa-vmstat.node1.nr_mapped
    948102 ± 35%    +126.8%    2150441 ± 10%  numa-vmstat.node1.nr_shmem
     21077 ± 31%     +62.0%      34153 ±  9%  numa-vmstat.node1.nr_zone_active_anon
   1037160 ± 32%    +115.3%    2232735 ± 10%  numa-vmstat.node1.nr_zone_inactive_anon
   1884811 ± 27%     +91.3%    3605221 ± 10%  numa-vmstat.node1.numa_hit
   1748094 ± 31%     +97.0%    3443141 ± 10%  numa-vmstat.node1.numa_local
    113047 ± 12%     +32.7%     150063 ±  4%  meminfo.Active
    112858 ± 12%     +32.8%     149930 ±  4%  meminfo.Active(anon)
   7383553 ±  4%     +63.3%   12055933 ±  2%  meminfo.Cached
  11049975 ±  3%     +42.2%   15716475        meminfo.Committed_AS
   5392328 ±  6%     +86.8%   10072875 ±  2%  meminfo.Inactive
   5391867 ±  6%     +86.8%   10072720 ±  2%  meminfo.Inactive(anon)
   1185651           +20.6%    1430121        meminfo.Mapped
  11425070 ±  3%     +40.8%   16088748        meminfo.Memused
   4639309 ±  7%    +100.7%    9312054 ±  3%  meminfo.Shmem
  11531671 ±  3%     +41.0%   16259717        meminfo.max_used_kB
      2128           -46.4%       1141        turbostat.Avg_MHz
     74.52           -33.3       41.25        turbostat.Busy%
      2868            -3.6%       2765        turbostat.Bzy_MHz
  10682212 ±  8%    +223.7%   34581909 ±  2%  turbostat.C1
      0.45 ±  8%      +0.6        1.06 ±  4%  turbostat.C1%
 1.672e+08 ±  4%    +186.8%  4.794e+08        turbostat.C1E
     22.71 ±  2%     +35.2       57.90        turbostat.C1E%
     25.39 ±  2%    +131.4%      58.75        turbostat.CPU%c1
 2.003e+08           +34.2%  2.689e+08        turbostat.IRQ
   2595431 ±  6%    +191.2%    7557295        turbostat.POLL
      0.39            -0.1        0.29        turbostat.POLL%
    546.17            -4.0%     524.25        turbostat.PkgWatt
     17.63            +5.4%      18.59        turbostat.RAMWatt
     28245 ± 12%     +32.8%      37500 ±  4%  proc-vmstat.nr_active_anon
    216361            +5.3%     227804        proc-vmstat.nr_anon_pages
   6268223            -1.9%    6151776        proc-vmstat.nr_dirty_background_threshold
  12551772            -1.9%   12318594        proc-vmstat.nr_dirty_threshold
   1846243 ±  4%     +63.3%    3014307 ±  2%  proc-vmstat.nr_file_pages
  63058692            -1.8%   61892607        proc-vmstat.nr_free_pages
   1348115 ±  6%     +86.8%    2518510 ±  2%  proc-vmstat.nr_inactive_anon
    296600           +20.6%     357727        proc-vmstat.nr_mapped
   1160181 ±  7%    +100.7%    2328337 ±  3%  proc-vmstat.nr_shmem
     41135            +6.4%      43765        proc-vmstat.nr_slab_reclaimable
     28245 ± 12%     +32.8%      37500 ±  4%  proc-vmstat.nr_zone_active_anon
   1348115 ±  6%     +86.8%    2518510 ±  2%  proc-vmstat.nr_zone_inactive_anon
    305736 ±  9%     +64.0%     501444 ± 16%  proc-vmstat.numa_hint_faults
    212439 ± 11%     +68.3%     357589 ± 17%  proc-vmstat.numa_hint_faults_local
   2618374 ±  5%     +65.4%    4331174 ±  2%  proc-vmstat.numa_hit
      1476 ±  2%     -21.4%       1159 ±  4%  proc-vmstat.numa_huge_pte_updates
   2389468 ±  5%     +71.5%    4099042 ±  2%  proc-vmstat.numa_local
     18075 ± 18%    +243.7%      62123 ± 11%  proc-vmstat.pgactivate
   2893156 ±  4%     +59.3%    4608688 ±  2%  proc-vmstat.pgalloc_normal
   1164766 ±  2%     +24.7%    1452985 ±  6%  proc-vmstat.pgfault
    750923 ±  7%     +20.2%     902954 ±  4%  proc-vmstat.pgfree
   1587344           -80.7%     305805        sched_debug.cfs_rq:/.avg_vruntime.avg
   3053761 ± 15%     -79.5%     627436 ± 20%  sched_debug.cfs_rq:/.avg_vruntime.max
   1349541 ±  8%     -80.5%     263633 ±  7%  sched_debug.cfs_rq:/.avg_vruntime.min
    119162 ± 17%     -72.5%      32738 ± 18%  sched_debug.cfs_rq:/.avg_vruntime.stddev
      0.43 ±  8%     -57.8%       0.18 ±  6%  sched_debug.cfs_rq:/.h_nr_running.avg
    459622 ±  7%     -94.1%      26943 ± 10%  sched_debug.cfs_rq:/.left_vruntime.avg
   1802751 ±  8%     -81.4%     334976 ± 13%  sched_debug.cfs_rq:/.left_vruntime.max
    715952 ±  2%     -88.0%      86086 ±  4%  sched_debug.cfs_rq:/.left_vruntime.stddev
   1587344           -80.7%     305805        sched_debug.cfs_rq:/.min_vruntime.avg
   3053761 ± 15%     -79.5%     627436 ± 20%  sched_debug.cfs_rq:/.min_vruntime.max
   1349542 ±  8%     -80.5%     263633 ±  7%  sched_debug.cfs_rq:/.min_vruntime.min
    119162 ± 17%     -72.5%      32738 ± 18%  sched_debug.cfs_rq:/.min_vruntime.stddev
      0.32 ±  3%     -48.3%       0.17 ±  5%  sched_debug.cfs_rq:/.nr_running.avg
    459622 ±  7%     -94.1%      26943 ± 10%  sched_debug.cfs_rq:/.right_vruntime.avg
   1802751 ±  8%     -81.4%     334976 ± 13%  sched_debug.cfs_rq:/.right_vruntime.max
    715952 ±  2%     -88.0%      86086 ±  4%  sched_debug.cfs_rq:/.right_vruntime.stddev
    456.43 ±  3%     -58.2%     190.59 ±  4%  sched_debug.cfs_rq:/.runnable_avg.avg
      1516 ± 11%     -23.5%       1159 ± 12%  sched_debug.cfs_rq:/.runnable_avg.max
    225.37 ±  4%     -29.1%     159.79 ±  6%  sched_debug.cfs_rq:/.runnable_avg.stddev
    317.90 ±  2%     -43.2%     180.41 ±  4%  sched_debug.cfs_rq:/.util_avg.avg
     20.61 ± 17%     -54.7%       9.33 ± 21%  sched_debug.cfs_rq:/.util_est_enqueued.avg
     41.44 ± 18%     -73.0%      11.19 ±  2%  sched_debug.cpu.clock.stddev
      1127 ±  9%     -26.2%     831.86        sched_debug.cpu.clock_task.stddev
      1517 ±  4%     -23.9%       1155 ±  8%  sched_debug.cpu.curr->pid.avg
      0.00 ± 18%     -70.3%       0.00 ± 10%  sched_debug.cpu.next_balance.stddev
      0.39 ±  4%     -54.9%       0.17 ±  6%  sched_debug.cpu.nr_running.avg
      0.54 ±  6%     -27.4%       0.39 ± 12%  sched_debug.cpu.nr_running.stddev
   1086308 ±  3%    +126.7%    2462284        sched_debug.cpu.nr_switches.avg
   1230656 ±  4%    +115.6%    2653087 ±  2%  sched_debug.cpu.nr_switches.max
    511501 ± 20%    +208.1%    1576039 ± 20%  sched_debug.cpu.nr_switches.min
      0.00          -100.0%       0.00        sched_debug.rt_rq:.rt_nr_migratory.avg
      0.50          -100.0%       0.00        sched_debug.rt_rq:.rt_nr_migratory.max
      0.03          -100.0%       0.00        sched_debug.rt_rq:.rt_nr_migratory.stddev
      0.00          -100.0%       0.00        sched_debug.rt_rq:.rt_nr_running.avg
      0.50          -100.0%       0.00        sched_debug.rt_rq:.rt_nr_running.max
      0.03          -100.0%       0.00        sched_debug.rt_rq:.rt_nr_running.stddev
      1.21 ± 19%    -100.0%       0.00        sched_debug.rt_rq:.rt_time.avg
    270.28 ± 19%    -100.0%       0.00        sched_debug.rt_rq:.rt_time.max
     18.02 ± 19%    -100.0%       0.00        sched_debug.rt_rq:.rt_time.stddev
     11.96           -25.6%       8.90        perf-stat.i.MPKI
 1.534e+10 ±  2%     +89.5%  2.908e+10        perf-stat.i.branch-instructions
      1.45            -0.4        1.09        perf-stat.i.branch-miss-rate%
 1.844e+08 ±  2%     +46.4%    2.7e+08        perf-stat.i.branch-misses
      3.02 ±  4%      +0.2        3.27        perf-stat.i.cache-miss-rate%
  14912654 ±  5%     +92.1%   28640533 ±  2%  perf-stat.i.cache-misses
 7.818e+08 ±  3%     +48.0%  1.157e+09        perf-stat.i.cache-references
   8093326 ±  3%    +128.3%   18474462        perf-stat.i.context-switches
      7.31 ±  4%     -75.0%       1.82        perf-stat.i.cpi
  4.79e+11           -49.8%  2.405e+11        perf-stat.i.cpu-cycles
   3201059 ±  3%    +117.3%    6956049        perf-stat.i.cpu-migrations
     41589 ±  5%     -69.5%      12679 ±  2%  perf-stat.i.cycles-between-cache-misses
      0.24 ±  2%      -0.1        0.15 ±  2%  perf-stat.i.dTLB-load-miss-rate%
  44628478 ±  4%     +24.9%   55730478 ±  2%  perf-stat.i.dTLB-load-misses
 1.958e+10 ±  3%     +97.5%  3.867e+10        perf-stat.i.dTLB-loads
      0.08            -0.0        0.05        perf-stat.i.dTLB-store-miss-rate%
   7655277 ±  2%     +29.8%    9933453        perf-stat.i.dTLB-store-misses
 1.103e+10 ±  3%    +102.9%  2.238e+10        perf-stat.i.dTLB-stores
 7.611e+10 ±  2%     +89.8%  1.445e+11        perf-stat.i.instructions
      0.19 ±  2%    +212.9%       0.58        perf-stat.i.ipc
      2.12           -49.6%       1.07        perf-stat.i.metric.GHz
    128.80 ±  3%     +68.7%     217.34        perf-stat.i.metric.K/sec
    207.54 ±  3%     +96.0%     406.79        perf-stat.i.metric.M/sec
     19217           +15.9%      22269 ±  7%  perf-stat.i.minor-faults
     80.23 ±  2%     -10.5       69.77 ±  2%  perf-stat.i.node-load-miss-rate%
   4830206 ±  4%     +40.9%    6803808 ±  4%  perf-stat.i.node-load-misses
   1648661 ± 12%    +152.3%    4158848 ±  6%  perf-stat.i.node-loads
     19217           +15.9%      22269 ±  7%  perf-stat.i.page-faults
     10.69           -24.6%       8.06        perf-stat.overall.MPKI
      1.25            -0.3        0.93        perf-stat.overall.branch-miss-rate%
      1.86 ±  5%      +0.6        2.46 ±  2%  perf-stat.overall.cache-miss-rate%
      6.56 ±  4%     -74.5%       1.67        perf-stat.overall.cpi
     33078 ±  5%     -74.4%       8459        perf-stat.overall.cycles-between-cache-misses
      0.23 ±  2%      -0.1        0.14 ±  2%  perf-stat.overall.dTLB-load-miss-rate%
      0.07            -0.0        0.04        perf-stat.overall.dTLB-store-miss-rate%
      0.15 ±  4%    +291.0%       0.60        perf-stat.overall.ipc
     70.66 ±  3%     -10.0       60.71 ±  3%  perf-stat.overall.node-load-miss-rate%
 1.442e+10 ±  3%     +96.7%  2.838e+10        perf-stat.ps.branch-instructions
 1.798e+08 ±  3%     +47.5%  2.653e+08        perf-stat.ps.branch-misses
  14255932 ±  5%     +96.0%   27941497 ±  2%  perf-stat.ps.cache-misses
 7.673e+08 ±  3%     +48.2%  1.137e+09        perf-stat.ps.cache-references
   7947074 ±  4%    +129.1%   18204433        perf-stat.ps.context-switches
   4.7e+11           -49.7%  2.363e+11        perf-stat.ps.cpu-cycles
   3148303 ±  4%    +117.8%    6855484        perf-stat.ps.cpu-migrations
  43383948 ±  5%     +26.3%   54781737 ±  2%  perf-stat.ps.dTLB-load-misses
 1.859e+10 ±  3%    +103.5%  3.783e+10        perf-stat.ps.dTLB-loads
   7530893 ±  2%     +30.0%    9786994        perf-stat.ps.dTLB-store-misses
 1.049e+10 ±  3%    +108.7%   2.19e+10        perf-stat.ps.dTLB-stores
 7.175e+10 ±  3%     +96.6%  1.411e+11        perf-stat.ps.instructions
     17174 ±  2%     +24.8%      21436 ±  6%  perf-stat.ps.minor-faults
   4621265 ±  4%     +43.9%    6648855 ±  4%  perf-stat.ps.node-load-misses
   1927489 ± 13%    +123.1%    4300926 ±  5%  perf-stat.ps.node-loads
     17174 ±  2%     +24.8%      21436 ±  6%  perf-stat.ps.page-faults
 4.485e+12 ±  3%     +98.9%   8.92e+12        perf-stat.total.instructions
     27.32           -18.6        8.70        perf-profile.calltrace.cycles-pp.flush_smp_call_function_queue.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify
     21.32           -18.0        3.36 ±  2%  perf-profile.calltrace.cycles-pp.enqueue_task_fair.activate_task.ttwu_do_activate.sched_ttwu_pending.__flush_smp_call_function_queue
     18.25           -15.6        2.69 ±  3%  perf-profile.calltrace.cycles-pp.enqueue_entity.enqueue_task_fair.activate_task.ttwu_do_activate.sched_ttwu_pending
     18.99           -13.8        5.16        perf-profile.calltrace.cycles-pp.activate_task.ttwu_do_activate.sched_ttwu_pending.__flush_smp_call_function_queue.flush_smp_call_function_queue
     19.06           -13.8        5.29        perf-profile.calltrace.cycles-pp.ttwu_do_activate.sched_ttwu_pending.__flush_smp_call_function_queue.flush_smp_call_function_queue.do_idle
     17.39           -13.5        3.88 ±  2%  perf-profile.calltrace.cycles-pp.dequeue_task_fair.__schedule.schedule.do_nanosleep.hrtimer_nanosleep
     19.49           -13.0        6.44        perf-profile.calltrace.cycles-pp.sched_ttwu_pending.__flush_smp_call_function_queue.flush_smp_call_function_queue.do_idle.cpu_startup_entry
     19.77           -12.5        7.27        perf-profile.calltrace.cycles-pp.__flush_smp_call_function_queue.flush_smp_call_function_queue.do_idle.cpu_startup_entry.start_secondary
     11.46 ±  2%     -10.7        0.80 ±  4%  perf-profile.calltrace.cycles-pp.update_cfs_group.dequeue_entity.dequeue_task_fair.__schedule.schedule
     14.02           -10.6        3.37 ±  2%  perf-profile.calltrace.cycles-pp.dequeue_entity.dequeue_task_fair.__schedule.schedule.do_nanosleep
     10.59           -10.5        0.09 ±223%  perf-profile.calltrace.cycles-pp.update_cfs_group.enqueue_entity.enqueue_task_fair.activate_task.ttwu_do_activate
     18.67            -8.4       10.29        perf-profile.calltrace.cycles-pp.try_to_wake_up.hrtimer_wakeup.__hrtimer_run_queues.hrtimer_interrupt.__sysvec_apic_timer_interrupt
      9.63            -8.4        1.27        perf-profile.calltrace.cycles-pp.update_load_avg.enqueue_entity.enqueue_task_fair.activate_task.ttwu_do_activate
     18.68            -8.3       10.34        perf-profile.calltrace.cycles-pp.hrtimer_wakeup.__hrtimer_run_queues.hrtimer_interrupt.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt
     22.51            -8.1       14.45 ±  4%  perf-profile.calltrace.cycles-pp.__schedule.schedule.do_nanosleep.hrtimer_nanosleep.common_nsleep
     19.24            -7.9       11.33        perf-profile.calltrace.cycles-pp.__hrtimer_run_queues.hrtimer_interrupt.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt
     22.71            -7.8       14.88 ±  4%  perf-profile.calltrace.cycles-pp.schedule.do_nanosleep.hrtimer_nanosleep.common_nsleep.__x64_sys_clock_nanosleep
     10.88 ±  4%      -7.3        3.57 ±  2%  perf-profile.calltrace.cycles-pp.select_idle_sibling.select_task_rq_fair.select_task_rq.try_to_wake_up.hrtimer_wakeup
      9.66 ±  2%      -7.3        2.40        perf-profile.calltrace.cycles-pp.select_idle_cpu.select_idle_sibling.select_task_rq_fair.select_task_rq.try_to_wake_up
     10.95 ±  4%      -7.2        3.70        perf-profile.calltrace.cycles-pp.select_task_rq_fair.select_task_rq.try_to_wake_up.hrtimer_wakeup.__hrtimer_run_queues
     10.97 ±  4%      -7.2        3.76        perf-profile.calltrace.cycles-pp.select_task_rq.try_to_wake_up.hrtimer_wakeup.__hrtimer_run_queues.hrtimer_interrupt
      8.82 ±  2%      -6.6        2.23 ±  2%  perf-profile.calltrace.cycles-pp.select_idle_core.select_idle_cpu.select_idle_sibling.select_task_rq_fair.select_task_rq
     25.03            -6.5       18.57 ±  3%  perf-profile.calltrace.cycles-pp.do_nanosleep.hrtimer_nanosleep.common_nsleep.__x64_sys_clock_nanosleep.do_syscall_64
     25.38            -6.1       19.25 ±  3%  perf-profile.calltrace.cycles-pp.hrtimer_nanosleep.common_nsleep.__x64_sys_clock_nanosleep.do_syscall_64.entry_SYSCALL_64_after_hwframe
     25.41            -6.1       19.30 ±  3%  perf-profile.calltrace.cycles-pp.common_nsleep.__x64_sys_clock_nanosleep.do_syscall_64.entry_SYSCALL_64_after_hwframe.clock_nanosleep
     26.14            -5.8       20.31 ±  2%  perf-profile.calltrace.cycles-pp.__x64_sys_clock_nanosleep.do_syscall_64.entry_SYSCALL_64_after_hwframe.clock_nanosleep
      6.52 ±  4%      -5.6        0.89 ±  3%  perf-profile.calltrace.cycles-pp.asm_sysvec_apic_timer_interrupt.flush_smp_call_function_queue.do_idle.cpu_startup_entry.start_secondary
      6.36 ±  4%      -5.5        0.82 ±  2%  perf-profile.calltrace.cycles-pp.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.flush_smp_call_function_queue.do_idle.cpu_startup_entry
      6.20 ±  4%      -5.4        0.80 ±  2%  perf-profile.calltrace.cycles-pp.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.flush_smp_call_function_queue.do_idle
      6.17 ±  4%      -5.4        0.78 ±  3%  perf-profile.calltrace.cycles-pp.hrtimer_interrupt.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.flush_smp_call_function_queue
      5.01            -5.0        0.00        perf-profile.calltrace.cycles-pp.update_load_avg.set_next_entity.pick_next_task_fair.__schedule.schedule_idle
     29.04            -4.9       24.15 ±  2%  perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.clock_nanosleep
     60.01            -4.9       55.12        perf-profile.calltrace.cycles-pp.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify
     29.44            -4.9       24.57 ±  2%  perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.clock_nanosleep
     60.32            -4.9       55.46        perf-profile.calltrace.cycles-pp.secondary_startup_64_no_verify
     60.05            -4.8       55.24        perf-profile.calltrace.cycles-pp.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify
     12.60            -4.8        7.81        perf-profile.calltrace.cycles-pp.__schedule.schedule_idle.do_idle.cpu_startup_entry.start_secondary
     60.06            -4.8       55.29        perf-profile.calltrace.cycles-pp.start_secondary.secondary_startup_64_no_verify
     12.70            -4.6        8.08        perf-profile.calltrace.cycles-pp.schedule_idle.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify
      5.10            -4.4        0.67        perf-profile.calltrace.cycles-pp.set_next_entity.pick_next_task_fair.__schedule.schedule_idle.do_idle
      5.23            -4.2        0.98        perf-profile.calltrace.cycles-pp.pick_next_task_fair.__schedule.schedule_idle.do_idle.cpu_startup_entry
      4.86            -3.8        1.07        perf-profile.calltrace.cycles-pp.finish_task_switch.__schedule.schedule_idle.do_idle.cpu_startup_entry
      4.90 ±  2%      -3.5        1.38 ±  2%  perf-profile.calltrace.cycles-pp.available_idle_cpu.select_idle_core.select_idle_cpu.select_idle_sibling.select_task_rq_fair
     31.19            -3.4       27.79        perf-profile.calltrace.cycles-pp.clock_nanosleep
      4.78            -3.3        1.45        perf-profile.calltrace.cycles-pp.hrtimer_interrupt.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.finish_task_switch
      4.81            -3.3        1.49        perf-profile.calltrace.cycles-pp.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.finish_task_switch.__schedule
      3.99            -3.2        0.82 ±  9%  perf-profile.calltrace.cycles-pp.enqueue_task_fair.activate_task.ttwu_do_activate.try_to_wake_up.hrtimer_wakeup
      4.00            -3.2        0.84 ±  9%  perf-profile.calltrace.cycles-pp.activate_task.ttwu_do_activate.try_to_wake_up.hrtimer_wakeup.__hrtimer_run_queues
      4.09            -3.1        0.95 ±  8%  perf-profile.calltrace.cycles-pp.ttwu_do_activate.try_to_wake_up.hrtimer_wakeup.__hrtimer_run_queues.hrtimer_interrupt
      3.70            -3.0        0.67 ± 10%  perf-profile.calltrace.cycles-pp.enqueue_entity.enqueue_task_fair.activate_task.ttwu_do_activate.try_to_wake_up
      3.79            -3.0        0.78        perf-profile.calltrace.cycles-pp.asm_sysvec_apic_timer_interrupt.finish_task_switch.__schedule.schedule_idle.do_idle
      3.73            -3.0        0.73        perf-profile.calltrace.cycles-pp.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.finish_task_switch.__schedule.schedule_idle
      2.35            -1.8        0.54        perf-profile.calltrace.cycles-pp.asm_sysvec_call_function_single.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call.do_idle
      1.74            -0.8        0.96 ±  2%  perf-profile.calltrace.cycles-pp.update_load_avg.dequeue_entity.dequeue_task_fair.__schedule.schedule
      1.36 ±  3%      -0.6        0.81 ±  3%  perf-profile.calltrace.cycles-pp.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.finish_task_switch.__schedule.schedule
      1.43 ±  3%      -0.5        0.89 ±  3%  perf-profile.calltrace.cycles-pp.asm_sysvec_apic_timer_interrupt.finish_task_switch.__schedule.schedule.do_nanosleep
      0.70 ±  3%      -0.1        0.56        perf-profile.calltrace.cycles-pp.do_sched_yield.__x64_sys_sched_yield.do_syscall_64.entry_SYSCALL_64_after_hwframe.__sched_yield
      1.15 ±  2%      -0.1        1.08        perf-profile.calltrace.cycles-pp.hrtimer_active.hrtimer_try_to_cancel.do_nanosleep.hrtimer_nanosleep.common_nsleep
      0.43 ± 44%      +0.2        0.62        perf-profile.calltrace.cycles-pp.poll_idle.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call.do_idle
      2.43 ±  2%      +0.3        2.71        perf-profile.calltrace.cycles-pp.finish_task_switch.__schedule.schedule.do_nanosleep.hrtimer_nanosleep
      2.16            +0.3        2.48        perf-profile.calltrace.cycles-pp.restore_fpregs_from_fpstate.switch_fpu_return.exit_to_user_mode_prepare.syscall_exit_to_user_mode.do_syscall_64
      1.26            +0.5        1.76        perf-profile.calltrace.cycles-pp.sched_mm_cid_migrate_to.activate_task.ttwu_do_activate.sched_ttwu_pending.__flush_smp_call_function_queue
      0.00            +0.5        0.51        perf-profile.calltrace.cycles-pp.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.poll_idle.cpuidle_enter_state
      0.00            +0.5        0.53        perf-profile.calltrace.cycles-pp.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.poll_idle.cpuidle_enter_state.cpuidle_enter
      0.00            +0.6        0.57 ±  2%  perf-profile.calltrace.cycles-pp.cpus_share_cache.select_idle_sibling.select_task_rq_fair.select_task_rq.try_to_wake_up
      0.00            +0.6        0.57        perf-profile.calltrace.cycles-pp.asm_sysvec_apic_timer_interrupt.poll_idle.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call
      0.00            +0.6        0.59        perf-profile.calltrace.cycles-pp.update_curr.pick_next_task_fair.__schedule.schedule.__x64_sys_sched_yield
      0.00            +0.7        0.66        perf-profile.calltrace.cycles-pp.prepare_task_switch.__schedule.schedule.do_nanosleep.hrtimer_nanosleep
      0.00            +0.7        0.68        perf-profile.calltrace.cycles-pp.update_curr.dequeue_entity.dequeue_task_fair.__schedule.schedule
      0.00            +0.7        0.68 ±  2%  perf-profile.calltrace.cycles-pp.llist_reverse_order.__flush_smp_call_function_queue.flush_smp_call_function_queue.do_idle.cpu_startup_entry
      2.72            +0.7        3.41        perf-profile.calltrace.cycles-pp.switch_fpu_return.exit_to_user_mode_prepare.syscall_exit_to_user_mode.do_syscall_64.entry_SYSCALL_64_after_hwframe
      2.77            +0.7        3.48        perf-profile.calltrace.cycles-pp.exit_to_user_mode_prepare.syscall_exit_to_user_mode.do_syscall_64.entry_SYSCALL_64_after_hwframe.clock_nanosleep
      0.00            +0.7        0.74 ±  2%  perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.try_to_wake_up.hrtimer_wakeup.__hrtimer_run_queues.hrtimer_interrupt
      0.00            +0.7        0.75        perf-profile.calltrace.cycles-pp.__smp_call_single_queue.ttwu_queue_wakelist.try_to_wake_up.hrtimer_wakeup.__hrtimer_run_queues
      2.46            +0.8        3.23        perf-profile.calltrace.cycles-pp.__x64_sys_sched_yield.do_syscall_64.entry_SYSCALL_64_after_hwframe.__sched_yield
      2.80            +0.8        3.58        perf-profile.calltrace.cycles-pp.syscall_exit_to_user_mode.do_syscall_64.entry_SYSCALL_64_after_hwframe.clock_nanosleep
      0.29 ±100%      +0.8        1.09 ±  8%  perf-profile.calltrace.cycles-pp.queue_event.ordered_events__queue.process_simple.reader__read_event.perf_session__process_events
      0.29 ±100%      +0.8        1.10 ±  8%  perf-profile.calltrace.cycles-pp.ordered_events__queue.process_simple.reader__read_event.perf_session__process_events.record__finish_output
      0.29 ±100%      +0.8        1.10 ±  8%  perf-profile.calltrace.cycles-pp.process_simple.reader__read_event.perf_session__process_events.record__finish_output.__cmd_record
      0.30 ±100%      +0.8        1.11 ±  8%  perf-profile.calltrace.cycles-pp.__cmd_record
      0.30 ±100%      +0.8        1.11 ±  8%  perf-profile.calltrace.cycles-pp.record__finish_output.__cmd_record
      0.30 ±100%      +0.8        1.11 ±  8%  perf-profile.calltrace.cycles-pp.perf_session__process_events.record__finish_output.__cmd_record
      0.30 ±100%      +0.8        1.11 ±  8%  perf-profile.calltrace.cycles-pp.reader__read_event.perf_session__process_events.record__finish_output.__cmd_record
      1.70            +0.8        2.55        perf-profile.calltrace.cycles-pp.__schedule.schedule.__x64_sys_sched_yield.do_syscall_64.entry_SYSCALL_64_after_hwframe
      0.72 ±  3%      +0.9        1.59        perf-profile.calltrace.cycles-pp.shim_nanosleep_uint64
      2.57            +0.9        3.45        perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.__sched_yield
      1.73            +0.9        2.63        perf-profile.calltrace.cycles-pp.schedule.__x64_sys_sched_yield.do_syscall_64.entry_SYSCALL_64_after_hwframe.__sched_yield
      0.00            +0.9        0.90        perf-profile.calltrace.cycles-pp.__hrtimer_start_range_ns.hrtimer_start_range_ns.do_nanosleep.hrtimer_nanosleep.common_nsleep
      0.00            +0.9        0.93        perf-profile.calltrace.cycles-pp.tick_nohz_get_sleep_length.menu_select.cpuidle_idle_call.do_idle.cpu_startup_entry
      0.00            +0.9        0.95        perf-profile.calltrace.cycles-pp.ttwu_queue_wakelist.try_to_wake_up.hrtimer_wakeup.__hrtimer_run_queues.hrtimer_interrupt
      2.60            +0.9        3.54        perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.__sched_yield
      0.00            +1.0        0.98        perf-profile.calltrace.cycles-pp.__switch_to
      0.00            +1.0        1.00 ±  4%  perf-profile.calltrace.cycles-pp.sem_post@@GLIBC_2.2.5
      0.62 ±  4%      +1.0        1.65 ±  2%  perf-profile.calltrace.cycles-pp.sem_getvalue@@GLIBC_2.2.5
      0.00            +1.0        1.04        perf-profile.calltrace.cycles-pp.set_task_cpu.try_to_wake_up.hrtimer_wakeup.__hrtimer_run_queues.hrtimer_interrupt
      0.00            +1.1        1.05        perf-profile.calltrace.cycles-pp.pick_next_task_fair.__schedule.schedule.__x64_sys_sched_yield.do_syscall_64
      0.00            +1.1        1.13        perf-profile.calltrace.cycles-pp.prepare_task_switch.__schedule.schedule_idle.do_idle.cpu_startup_entry
      0.64 ±  2%      +1.2        1.81 ±  5%  perf-profile.calltrace.cycles-pp.menu_select.cpuidle_idle_call.do_idle.cpu_startup_entry.start_secondary
      0.58            +1.2        1.77        perf-profile.calltrace.cycles-pp.__switch_to_asm
      0.86            +1.2        2.08        perf-profile.calltrace.cycles-pp.hrtimer_start_range_ns.do_nanosleep.hrtimer_nanosleep.common_nsleep.__x64_sys_clock_nanosleep
      1.00 ±  2%      +1.3        2.26 ±  8%  perf-profile.calltrace.cycles-pp.semaphore_posix_thrash
      2.98            +1.3        4.25        perf-profile.calltrace.cycles-pp.__sched_yield
      7.89            +1.9        9.77        perf-profile.calltrace.cycles-pp.hrtimer_interrupt.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.cpuidle_enter_state
      0.00            +1.9        1.94 ± 31%  perf-profile.calltrace.cycles-pp.update_sg_lb_stats.update_sd_lb_stats.find_busiest_group.load_balance.newidle_balance
      0.70 ±  2%      +2.0        2.67        perf-profile.calltrace.cycles-pp.switch_mm_irqs_off.__schedule.schedule_idle.do_idle.cpu_startup_entry
      7.93            +2.1       10.00        perf-profile.calltrace.cycles-pp.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.cpuidle_enter_state.cpuidle_enter
      0.00            +2.1        2.14 ± 29%  perf-profile.calltrace.cycles-pp.update_sd_lb_stats.find_busiest_group.load_balance.newidle_balance.pick_next_task_fair
      0.00            +2.2        2.18 ± 29%  perf-profile.calltrace.cycles-pp.find_busiest_group.load_balance.newidle_balance.pick_next_task_fair.__schedule
      8.35            +2.3       10.66        perf-profile.calltrace.cycles-pp.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call
      0.00            +2.4        2.41 ± 26%  perf-profile.calltrace.cycles-pp.load_balance.newidle_balance.pick_next_task_fair.__schedule.schedule
      0.00            +3.4        3.35 ± 18%  perf-profile.calltrace.cycles-pp.newidle_balance.pick_next_task_fair.__schedule.schedule.do_nanosleep
      0.00            +3.5        3.48 ± 17%  perf-profile.calltrace.cycles-pp.pick_next_task_fair.__schedule.schedule.do_nanosleep.hrtimer_nanosleep
      8.48            +3.6       12.08        perf-profile.calltrace.cycles-pp.asm_sysvec_apic_timer_interrupt.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call.do_idle
      6.22 ±  3%     +13.2       19.41        perf-profile.calltrace.cycles-pp.intel_idle.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call.do_idle
     17.85           +14.9       32.78        perf-profile.calltrace.cycles-pp.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call.do_idle.cpu_startup_entry
     18.03           +15.6       33.59        perf-profile.calltrace.cycles-pp.cpuidle_enter.cpuidle_idle_call.do_idle.cpu_startup_entry.start_secondary
     18.91           +17.2       36.16        perf-profile.calltrace.cycles-pp.cpuidle_idle_call.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify
     29.17           -27.5        1.63 ±  4%  perf-profile.children.cycles-pp.update_cfs_group
     26.97           -22.3        4.65 ±  3%  perf-profile.children.cycles-pp.enqueue_task_fair
     28.59           -21.9        6.64        perf-profile.children.cycles-pp.activate_task
     28.80           -21.9        6.90        perf-profile.children.cycles-pp.ttwu_do_activate
     23.52           -19.7        3.81 ±  4%  perf-profile.children.cycles-pp.enqueue_entity
     27.63           -18.8        8.81        perf-profile.children.cycles-pp.flush_smp_call_function_queue
     24.39           -17.3        7.10        perf-profile.children.cycles-pp.sched_ttwu_pending
     24.82           -16.8        8.06        perf-profile.children.cycles-pp.__flush_smp_call_function_queue
     18.44           -15.1        3.31 ±  2%  perf-profile.children.cycles-pp.update_load_avg
     17.44           -13.5        3.93 ±  2%  perf-profile.children.cycles-pp.dequeue_task_fair
     36.97           -11.9       25.11 ±  2%  perf-profile.children.cycles-pp.__schedule
     14.08           -10.6        3.47 ±  2%  perf-profile.children.cycles-pp.dequeue_entity
     21.16            -8.7       12.42        perf-profile.children.cycles-pp.try_to_wake_up
     21.12            -8.7       12.43        perf-profile.children.cycles-pp.hrtimer_wakeup
     21.83            -8.2       13.59        perf-profile.children.cycles-pp.__hrtimer_run_queues
     22.20            -7.8       14.40        perf-profile.children.cycles-pp.hrtimer_interrupt
     23.28            -7.7       15.54        perf-profile.children.cycles-pp.sysvec_apic_timer_interrupt
     11.63 ±  2%      -7.7        3.95 ±  2%  perf-profile.children.cycles-pp.select_idle_cpu
     22.32            -7.6       14.74        perf-profile.children.cycles-pp.__sysvec_apic_timer_interrupt
     23.88            -7.1       16.75        perf-profile.children.cycles-pp.asm_sysvec_apic_timer_interrupt
     12.80 ±  2%      -7.0        5.74 ±  2%  perf-profile.children.cycles-pp.select_idle_sibling
     12.88 ±  2%      -7.0        5.92 ±  2%  perf-profile.children.cycles-pp.select_task_rq_fair
     10.67 ±  2%      -6.9        3.72 ±  2%  perf-profile.children.cycles-pp.select_idle_core
     24.50            -6.9       17.56 ±  3%  perf-profile.children.cycles-pp.schedule
     12.90 ±  2%      -6.9        6.00 ±  2%  perf-profile.children.cycles-pp.select_task_rq
     25.06            -6.4       18.65 ±  3%  perf-profile.children.cycles-pp.do_nanosleep
     25.39            -6.1       19.28 ±  3%  perf-profile.children.cycles-pp.hrtimer_nanosleep
     25.46            -6.0       19.47 ±  3%  perf-profile.children.cycles-pp.common_nsleep
     26.16            -5.8       20.33 ±  2%  perf-profile.children.cycles-pp.__x64_sys_clock_nanosleep
     60.29            -4.9       55.35        perf-profile.children.cycles-pp.do_idle
     60.32            -4.9       55.46        perf-profile.children.cycles-pp.secondary_startup_64_no_verify
     60.32            -4.9       55.46        perf-profile.children.cycles-pp.cpu_startup_entry
     60.06            -4.8       55.29        perf-profile.children.cycles-pp.start_secondary
     12.76            -4.6        8.14        perf-profile.children.cycles-pp.schedule_idle
      7.42 ±  3%      -4.5        2.97 ±  2%  perf-profile.children.cycles-pp.available_idle_cpu
      5.20            -4.5        0.75        perf-profile.children.cycles-pp.set_next_entity
      4.99            -4.2        0.79        perf-profile.children.cycles-pp.__sysvec_call_function_single
      5.01            -4.2        0.84        perf-profile.children.cycles-pp.sysvec_call_function_single
      5.12            -4.1        0.98        perf-profile.children.cycles-pp.asm_sysvec_call_function_single
     31.91            -4.1       27.82        perf-profile.children.cycles-pp.do_syscall_64
     32.32            -4.0       28.30        perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe
      7.73            -3.8        3.98        perf-profile.children.cycles-pp.finish_task_switch
     31.29            -3.1       28.16        perf-profile.children.cycles-pp.clock_nanosleep
      1.51 ±  5%      -1.4        0.16 ±  2%  perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath
      0.73 ±  6%      -0.5        0.25 ±  3%  perf-profile.children.cycles-pp.__do_softirq
      0.88 ±  5%      -0.4        0.46 ±  2%  perf-profile.children.cycles-pp.__irq_exit_rcu
      0.68            -0.4        0.32        perf-profile.children.cycles-pp._find_next_bit
      2.55 ±  2%      -0.3        2.26 ±  3%  perf-profile.children.cycles-pp._raw_spin_lock
      0.26 ± 24%      -0.2        0.09 ±  5%  perf-profile.children.cycles-pp.__update_idle_core
      0.26 ± 24%      -0.1        0.12 ±  4%  perf-profile.children.cycles-pp.pick_next_task_idle
      0.71 ±  3%      -0.1        0.58        perf-profile.children.cycles-pp.do_sched_yield
      0.43 ±  3%      -0.1        0.32 ±  3%  perf-profile.children.cycles-pp.__bitmap_andnot
      0.27 ± 15%      -0.1        0.17 ± 56%  perf-profile.children.cycles-pp.x86_64_start_kernel
      0.27 ± 15%      -0.1        0.17 ± 56%  perf-profile.children.cycles-pp.x86_64_start_reservations
      0.27 ± 15%      -0.1        0.17 ± 56%  perf-profile.children.cycles-pp.start_kernel
      0.27 ± 15%      -0.1        0.17 ± 56%  perf-profile.children.cycles-pp.arch_call_rest_init
      0.27 ± 15%      -0.1        0.17 ± 56%  perf-profile.children.cycles-pp.rest_init
      1.16 ±  2%      -0.1        1.10        perf-profile.children.cycles-pp.hrtimer_active
      0.13 ±  7%      -0.0        0.10 ± 11%  perf-profile.children.cycles-pp.do_futex
      0.14 ±  8%      -0.0        0.11 ±  9%  perf-profile.children.cycles-pp.__x64_sys_futex
      0.20 ±  3%      -0.0        0.17 ±  2%  perf-profile.children.cycles-pp.yield_task_fair
      0.09 ±  5%      +0.0        0.11 ±  4%  perf-profile.children.cycles-pp.update_irq_load_avg
      0.15 ±  3%      +0.0        0.18 ±  4%  perf-profile.children.cycles-pp.check_preempt_curr
      0.10 ±  9%      +0.0        0.14 ±  4%  perf-profile.children.cycles-pp.clock_gettime
      0.34 ±  2%      +0.0        0.38        perf-profile.children.cycles-pp.nohz_run_idle_balance
      0.14 ±  2%      +0.0        0.19 ±  4%  perf-profile.children.cycles-pp.attach_entity_load_avg
      0.00            +0.1        0.05        perf-profile.children.cycles-pp.perf_trace_run_bpf_submit
      0.00            +0.1        0.05        perf-profile.children.cycles-pp.cgroup_rstat_updated
      0.00            +0.1        0.05        perf-profile.children.cycles-pp.ct_kernel_exit
      0.03 ± 70%      +0.1        0.09 ±  5%  perf-profile.children.cycles-pp.put_prev_entity
      0.06 ±  6%      +0.1        0.11 ±  3%  perf-profile.children.cycles-pp.rb_insert_color
      0.06 ±  6%      +0.1        0.11 ±  3%  perf-profile.children.cycles-pp.entity_eligible
      0.00            +0.1        0.06 ±  9%  perf-profile.children.cycles-pp.perf_swevent_event
      0.00            +0.1        0.06 ±  8%  perf-profile.children.cycles-pp.__update_load_avg_blocked_se
      0.00            +0.1        0.06 ±  6%  perf-profile.children.cycles-pp.tick_nohz_stop_idle
      0.00            +0.1        0.06 ±  6%  perf-profile.children.cycles-pp.mm_cid_get
      0.00            +0.1        0.06 ±  6%  perf-profile.children.cycles-pp.menu_reflect
      0.05 ±  7%      +0.1        0.11        perf-profile.children.cycles-pp.perf_trace_buf_update
      0.12 ± 11%      +0.1        0.18 ±  2%  perf-profile.children.cycles-pp.remove_entity_load_avg
      0.31 ±  6%      +0.1        0.37 ±  2%  perf-profile.children.cycles-pp.scheduler_tick
      0.08            +0.1        0.14        perf-profile.children.cycles-pp._raw_spin_lock_irq
      0.00            +0.1        0.06 ±  6%  perf-profile.children.cycles-pp.irqentry_enter
      0.08            +0.1        0.14 ±  2%  perf-profile.children.cycles-pp.rcu_note_context_switch
      0.00            +0.1        0.06 ± 11%  perf-profile.children.cycles-pp.pm_qos_read_value
      0.10 ± 16%      +0.1        0.17 ±  2%  perf-profile.children.cycles-pp.hrtimer_get_next_event
      0.06            +0.1        0.13 ±  8%  perf-profile.children.cycles-pp.__cgroup_account_cputime
      0.08 ± 14%      +0.1        0.14 ±  7%  perf-profile.children.cycles-pp.stress_mwc1
      0.00            +0.1        0.07        perf-profile.children.cycles-pp.tsc_verify_tsc_adjust
      0.00            +0.1        0.07        perf-profile.children.cycles-pp.hrtimer_update_next_event
      0.00            +0.1        0.07 ±  5%  perf-profile.children.cycles-pp.tracing_gen_ctx_irq_test
      0.00            +0.1        0.07 ±  5%  perf-profile.children.cycles-pp.entry_SYSCALL_64_safe_stack
      0.34 ±  6%      +0.1        0.41 ±  3%  perf-profile.children.cycles-pp.update_process_times
      0.34 ±  6%      +0.1        0.42 ±  2%  perf-profile.children.cycles-pp.tick_sched_handle
      0.05 ±  8%      +0.1        0.13        perf-profile.children.cycles-pp.ktime_get_update_offsets_now
      0.00            +0.1        0.08 ±  4%  perf-profile.children.cycles-pp.rb_next
      0.00            +0.1        0.08 ±  4%  perf-profile.children.cycles-pp.error_entry
      0.00            +0.1        0.08 ±  4%  perf-profile.children.cycles-pp.tick_nohz_tick_stopped
      0.00            +0.1        0.08        perf-profile.children.cycles-pp.save_fpregs_to_fpstate
      0.00            +0.1        0.08        perf-profile.children.cycles-pp.arch_cpu_idle_enter
      0.00            +0.1        0.08        perf-profile.children.cycles-pp.perf_trace_buf_alloc
      0.09 ±  5%      +0.1        0.17 ±  6%  perf-profile.children.cycles-pp.perf_trace_sched_stat_runtime
      0.00            +0.1        0.08 ±  4%  perf-profile.children.cycles-pp.perf_exclude_event
      0.01 ±223%      +0.1        0.09 ±  6%  perf-profile.children.cycles-pp.__list_del_entry_valid
      0.08 ±  6%      +0.1        0.16 ±  3%  perf-profile.children.cycles-pp.__intel_pmu_enable_all
      0.00            +0.1        0.09 ±  5%  perf-profile.children.cycles-pp.put_prev_task_fair
      0.00            +0.1        0.09 ±  4%  perf-profile.children.cycles-pp.sched_clock_noinstr
      0.00            +0.1        0.09 ±  5%  perf-profile.children.cycles-pp.syscall_exit_to_user_mode_prepare
      0.14 ±  3%      +0.1        0.23        perf-profile.children.cycles-pp.get_nohz_timer_target
      0.00            +0.1        0.10 ±  5%  perf-profile.children.cycles-pp.rb_erase
      0.00            +0.1        0.10 ± 80%  perf-profile.children.cycles-pp.get_cpu_device
      0.00            +0.1        0.10 ±  5%  perf-profile.children.cycles-pp.__list_add_valid
      0.36 ±  6%      +0.1        0.46 ±  3%  perf-profile.children.cycles-pp.tick_sched_timer
      0.07 ±  6%      +0.1        0.18 ±  5%  perf-profile.children.cycles-pp.syscall_enter_from_user_mode
      0.07            +0.1        0.18 ±  2%  perf-profile.children.cycles-pp.__dequeue_entity
      0.06            +0.1        0.17 ±  2%  perf-profile.children.cycles-pp.call_cpuidle
      0.00            +0.1        0.12 ±  4%  perf-profile.children.cycles-pp.perf_trace_sched_switch
      0.14 ±  3%      +0.1        0.27 ±  2%  perf-profile.children.cycles-pp.update_min_vruntime
      0.32            +0.1        0.45 ±  2%  perf-profile.children.cycles-pp.llist_add_batch
      0.05 ±  8%      +0.1        0.18 ±  3%  perf-profile.children.cycles-pp.hrtimer_reprogram
      0.09            +0.1        0.22 ±  3%  perf-profile.children.cycles-pp.irqtime_account_irq
      0.52 ±  2%      +0.1        0.65        perf-profile.children.cycles-pp.poll_idle
      0.00            +0.1        0.14 ± 37%  perf-profile.children.cycles-pp.cpu_util
      0.01 ±223%      +0.1        0.14 ±  3%  perf-profile.children.cycles-pp._raw_spin_unlock_irqrestore
      0.11 ± 12%      +0.1        0.25 ±  4%  perf-profile.children.cycles-pp.avg_vruntime
      0.10 ±  3%      +0.1        0.24        perf-profile.children.cycles-pp.__calc_delta
      0.00            +0.1        0.14 ± 21%  perf-profile.children.cycles-pp._find_next_and_bit
      0.14 ±  2%      +0.1        0.29 ±  2%  perf-profile.children.cycles-pp.perf_event_task_tick
      0.14            +0.1        0.29 ±  2%  perf-profile.children.cycles-pp.perf_adjust_freq_unthr_context
      0.13 ±  2%      +0.1        0.28        perf-profile.children.cycles-pp.timerqueue_add
      0.21 ±  2%      +0.1        0.36 ±  2%  perf-profile.children.cycles-pp.perf_tp_event
      0.08 ± 13%      +0.2        0.23 ±  9%  perf-profile.children.cycles-pp.sem_getvalue@plt
      0.07 ±  6%      +0.2        0.23 ± 14%  perf-profile.children.cycles-pp.__enqueue_entity
      0.00            +0.2        0.16 ±  4%  perf-profile.children.cycles-pp.raw_spin_rq_lock_nested
      0.05            +0.2        0.21 ±  2%  perf-profile.children.cycles-pp.__hrtimer_init
      0.00            +0.2        0.16 ±  3%  perf-profile.children.cycles-pp.ct_kernel_exit_state
      0.15 ±  4%      +0.2        0.32        perf-profile.children.cycles-pp.enqueue_hrtimer
      0.08 ±  4%      +0.2        0.24 ±  3%  perf-profile.children.cycles-pp.hrtimer_init_sleeper
      0.09 ±  5%      +0.2        0.26 ±  3%  perf-profile.children.cycles-pp.__hrtimer_next_event_base
      0.45 ± 17%      +0.2        0.62        perf-profile.children.cycles-pp.native_irq_return_iret
      0.05            +0.2        0.22 ±  2%  perf-profile.children.cycles-pp.__rdgsbase_inactive
      0.00            +0.2        0.18 ±  2%  perf-profile.children.cycles-pp.ct_kernel_enter
      0.05            +0.2        0.23 ± 34%  perf-profile.children.cycles-pp.cpuidle_governor_latency_req
      0.26 ±  5%      +0.2        0.45 ±  5%  perf-profile.children.cycles-pp.update_blocked_averages
      0.00            +0.2        0.20 ±  2%  perf-profile.children.cycles-pp.tick_irq_enter
      0.11 ±  6%      +0.2        0.30 ±  3%  perf-profile.children.cycles-pp.native_apic_msr_eoi_write
      0.14 ±  3%      +0.2        0.34 ±  2%  perf-profile.children.cycles-pp.reweight_entity
      0.01 ±223%      +0.2        0.21 ±  3%  perf-profile.children.cycles-pp.irq_enter_rcu
      0.13 ±  8%      +0.2        0.33 ±  8%  perf-profile.children.cycles-pp.place_entity
      0.19 ±  3%      +0.2        0.39        perf-profile.children.cycles-pp.pick_eevdf
      0.14 ± 11%      +0.2        0.34 ±  2%  perf-profile.children.cycles-pp.get_next_timer_interrupt
      0.00            +0.2        0.21 ±  2%  perf-profile.children.cycles-pp.ct_idle_exit
      0.10 ±  4%      +0.2        0.32        perf-profile.children.cycles-pp.update_entity_lag
      0.09 ±  5%      +0.2        0.31 ± 23%  perf-profile.children.cycles-pp.idle_cpu
      0.00            +0.2        0.22 ±  4%  perf-profile.children.cycles-pp.syscall_return_via_sysret
      0.34            +0.2        0.58        perf-profile.children.cycles-pp.perf_trace_sched_wakeup_template
      0.13 ±  4%      +0.2        0.38 ±  2%  perf-profile.children.cycles-pp._copy_from_user
      0.00            +0.3        0.26 ±  2%  perf-profile.children.cycles-pp.local_clock_noinstr
      0.12 ±  4%      +0.3        0.38        perf-profile.children.cycles-pp.timerqueue_del
      0.37            +0.3        0.63 ±  2%  perf-profile.children.cycles-pp.update_rq_clock_task
      0.10 ±  6%      +0.3        0.37 ±  2%  perf-profile.children.cycles-pp.hrtimer_next_event_without
      0.21 ±  8%      +0.3        0.49 ±  2%  perf-profile.children.cycles-pp.tick_nohz_next_event
      0.07 ±  5%      +0.3        0.36 ±  2%  perf-profile.children.cycles-pp.tick_nohz_idle_exit
      0.21 ±  9%      +0.3        0.52 ±  5%  perf-profile.children.cycles-pp.__nanosleep
      0.16 ±  3%      +0.3        0.46        perf-profile.children.cycles-pp.get_timespec64
      0.08 ±  4%      +0.3        0.39        perf-profile.children.cycles-pp.__wrgsbase_inactive
      0.19            +0.3        0.50        perf-profile.children.cycles-pp.__update_load_avg_se
      0.09 ±  4%      +0.3        0.40        perf-profile.children.cycles-pp.entry_SYSRETQ_unsafe_stack
      2.18            +0.3        2.50        perf-profile.children.cycles-pp.restore_fpregs_from_fpstate
      1.61            +0.3        1.93        perf-profile.children.cycles-pp.sched_mm_cid_migrate_to
      0.19 ±  7%      +0.3        0.53 ±  2%  perf-profile.children.cycles-pp.tick_nohz_idle_enter
      0.12 ±  4%      +0.3        0.46 ±  2%  perf-profile.children.cycles-pp.os_xsave
      0.30 ±  3%      +0.5        0.76 ±  2%  perf-profile.children.cycles-pp.llist_reverse_order
      0.67 ±  9%      +0.5        1.14 ±  7%  perf-profile.children.cycles-pp.__cmd_record
      0.25 ±  2%      +0.5        0.72 ±  2%  perf-profile.children.cycles-pp.lapic_next_deadline
      0.27 ±  2%      +0.5        0.75        perf-profile.children.cycles-pp.call_function_single_prep_ipi
      0.44 ±  2%      +0.5        0.93        perf-profile.children.cycles-pp.__hrtimer_start_range_ns
      0.23 ±  4%      +0.5        0.72 ±  3%  perf-profile.children.cycles-pp.___perf_sw_event
      0.44 ±  2%      +0.5        0.94 ±  5%  perf-profile.children.cycles-pp.__update_load_avg_cfs_rq
      0.53 ± 11%      +0.6        1.10 ±  8%  perf-profile.children.cycles-pp.process_simple
      0.52 ± 11%      +0.6        1.10 ±  8%  perf-profile.children.cycles-pp.ordered_events__queue
      0.54 ± 10%      +0.6        1.11 ±  8%  perf-profile.children.cycles-pp.record__finish_output
      0.54 ± 10%      +0.6        1.11 ±  8%  perf-profile.children.cycles-pp.perf_session__process_events
      0.54 ± 10%      +0.6        1.11 ±  8%  perf-profile.children.cycles-pp.reader__read_event
      0.52 ± 11%      +0.6        1.10 ±  8%  perf-profile.children.cycles-pp.queue_event
      0.44 ±  2%      +0.6        1.02 ±  4%  perf-profile.children.cycles-pp.sem_post@@GLIBC_2.2.5
      0.38            +0.6        0.95        perf-profile.children.cycles-pp.clockevents_program_event
      0.27 ±  3%      +0.6        0.85 ±  2%  perf-profile.children.cycles-pp.cpus_share_cache
      0.17 ±  4%      +0.6        0.77 ±  2%  perf-profile.children.cycles-pp.read_tsc
      0.34 ±  4%      +0.6        0.95        perf-profile.children.cycles-pp.tick_nohz_get_sleep_length
      0.22 ±  2%      +0.6        0.82 ±  2%  perf-profile.children.cycles-pp.__entry_text_start
      0.26            +0.6        0.89        perf-profile.children.cycles-pp.update_rq_clock
      0.60            +0.6        1.23        perf-profile.children.cycles-pp.__smp_call_single_queue
      0.81            +0.7        1.46        perf-profile.children.cycles-pp.update_curr
      2.76            +0.7        3.44        perf-profile.children.cycles-pp.switch_fpu_return
      2.86            +0.7        3.58        perf-profile.children.cycles-pp.exit_to_user_mode_prepare
      0.82 ±  2%      +0.7        1.55        perf-profile.children.cycles-pp.set_task_cpu
      0.83            +0.7        1.56        perf-profile.children.cycles-pp.ttwu_queue_wakelist
      2.46            +0.8        3.24        perf-profile.children.cycles-pp.__x64_sys_sched_yield
      0.24 ±  2%      +0.8        1.03        perf-profile.children.cycles-pp.ktime_get
      2.88            +0.8        3.72        perf-profile.children.cycles-pp.syscall_exit_to_user_mode
      0.74 ±  3%      +0.9        1.63        perf-profile.children.cycles-pp.shim_nanosleep_uint64
      0.25 ±  4%      +0.9        1.19 ±  2%  perf-profile.children.cycles-pp.sched_clock
      0.26 ±  4%      +1.0        1.28        perf-profile.children.cycles-pp.native_sched_clock
      0.63 ±  4%      +1.0        1.67        perf-profile.children.cycles-pp.sem_getvalue@@GLIBC_2.2.5
      0.58            +1.1        1.66        perf-profile.children.cycles-pp._raw_spin_lock_irqsave
      0.29 ±  3%      +1.1        1.38        perf-profile.children.cycles-pp.sched_clock_cpu
      0.78            +1.1        1.88        perf-profile.children.cycles-pp.__switch_to
      0.66 ±  2%      +1.2        1.84 ±  5%  perf-profile.children.cycles-pp.menu_select
      0.66 ±  2%      +1.2        1.85        perf-profile.children.cycles-pp.prepare_task_switch
      0.88            +1.2        2.10        perf-profile.children.cycles-pp.hrtimer_start_range_ns
      1.03 ±  2%      +1.4        2.38 ±  7%  perf-profile.children.cycles-pp.semaphore_posix_thrash
      0.90            +1.4        2.29        perf-profile.children.cycles-pp.__switch_to_asm
      3.06            +1.5        4.58        perf-profile.children.cycles-pp.__sched_yield
      0.99            +1.9        2.86        perf-profile.children.cycles-pp.switch_mm_irqs_off
      0.00            +2.0        2.02 ± 30%  perf-profile.children.cycles-pp.update_sg_lb_stats
      0.00            +2.2        2.19 ± 29%  perf-profile.children.cycles-pp.update_sd_lb_stats
      0.00            +2.2        2.23 ± 28%  perf-profile.children.cycles-pp.find_busiest_group
      0.07 ±  7%      +2.4        2.47 ± 25%  perf-profile.children.cycles-pp.load_balance
      0.06 ±  8%      +3.3        3.38 ± 18%  perf-profile.children.cycles-pp.newidle_balance
      6.26 ±  3%     +13.2       19.48        perf-profile.children.cycles-pp.intel_idle
     18.09           +15.6       33.68        perf-profile.children.cycles-pp.cpuidle_enter_state
     18.12           +15.6       33.72        perf-profile.children.cycles-pp.cpuidle_enter
     19.00           +17.3       36.32        perf-profile.children.cycles-pp.cpuidle_idle_call
     29.16           -27.6        1.61 ±  4%  perf-profile.self.cycles-pp.update_cfs_group
     17.63           -16.1        1.52 ±  2%  perf-profile.self.cycles-pp.update_load_avg
      7.40 ±  3%      -4.5        2.94 ±  2%  perf-profile.self.cycles-pp.available_idle_cpu
      3.28 ±  2%      -2.5        0.81 ±  2%  perf-profile.self.cycles-pp.select_idle_core
      1.51 ±  5%      -1.4        0.16 ±  2%  perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath
      0.62            -0.3        0.28 ±  2%  perf-profile.self.cycles-pp._find_next_bit
      0.38 ±  7%      -0.3        0.13 ±  2%  perf-profile.self.cycles-pp.select_idle_cpu
      0.38 ±  3%      -0.2        0.24 ±  2%  perf-profile.self.cycles-pp.migrate_task_rq_fair
      0.21 ± 31%      -0.1        0.08 ±  6%  perf-profile.self.cycles-pp.__update_idle_core
      0.42 ±  3%      -0.1        0.31 ±  3%  perf-profile.self.cycles-pp.__bitmap_andnot
      0.19 ±  6%      -0.1        0.09 ±  4%  perf-profile.self.cycles-pp.__update_blocked_fair
      0.44 ±  2%      -0.1        0.37        perf-profile.self.cycles-pp.__x64_sys_clock_nanosleep
      0.05            +0.0        0.06        perf-profile.self.cycles-pp.clockevents_program_event
      0.08            +0.0        0.09 ±  4%  perf-profile.self.cycles-pp.perf_trace_sched_wakeup_template
      0.09            +0.0        0.10 ±  4%  perf-profile.self.cycles-pp.update_irq_load_avg
      0.12 ±  3%      +0.0        0.14 ±  2%  perf-profile.self.cycles-pp.__hrtimer_run_queues
      0.17 ±  2%      +0.0        0.20 ±  2%  perf-profile.self.cycles-pp.ttwu_queue_wakelist
      0.14 ±  2%      +0.0        0.19 ±  2%  perf-profile.self.cycles-pp.attach_entity_load_avg
      0.08 ±  6%      +0.0        0.12 ±  3%  perf-profile.self.cycles-pp.ttwu_do_activate
      0.08 ±  6%      +0.0        0.12 ±  3%  perf-profile.self.cycles-pp.yield_task_fair
      0.06 ±  8%      +0.0        0.10 ±  4%  perf-profile.self.cycles-pp.rb_insert_color
      0.08            +0.0        0.13 ±  2%  perf-profile.self.cycles-pp.rcu_note_context_switch
      0.05 ±  8%      +0.0        0.10 ±  3%  perf-profile.self.cycles-pp.entity_eligible
      0.00            +0.1        0.05        perf-profile.self.cycles-pp.__update_load_avg_blocked_se
      0.00            +0.1        0.05 ±  7%  perf-profile.self.cycles-pp.mm_cid_get
      0.09 ±  4%      +0.1        0.14 ±  4%  perf-profile.self.cycles-pp.__hrtimer_start_range_ns
      0.08            +0.1        0.13 ±  3%  perf-profile.self.cycles-pp._raw_spin_lock_irq
      0.00            +0.1        0.05 ±  8%  perf-profile.self.cycles-pp.update_blocked_averages
      0.00            +0.1        0.06 ±  9%  perf-profile.self.cycles-pp.tsc_verify_tsc_adjust
      0.00            +0.1        0.06 ±  9%  perf-profile.self.cycles-pp.pm_qos_read_value
      0.00            +0.1        0.06 ±  8%  perf-profile.self.cycles-pp.hrtimer_try_to_cancel
      0.00            +0.1        0.06 ±  6%  perf-profile.self.cycles-pp.ktime_get_update_offsets_now
      0.00            +0.1        0.06        perf-profile.self.cycles-pp.activate_task
      0.00            +0.1        0.06        perf-profile.self.cycles-pp.tick_nohz_get_sleep_length
      0.00            +0.1        0.06        perf-profile.self.cycles-pp.hrtimer_next_event_without
      0.00            +0.1        0.06        perf-profile.self.cycles-pp.irqtime_account_irq
      0.00            +0.1        0.06        perf-profile.self.cycles-pp.save_fpregs_to_fpstate
      0.95            +0.1        1.01        perf-profile.self.cycles-pp.hrtimer_active
      0.06 ± 14%      +0.1        0.12 ±  8%  perf-profile.self.cycles-pp.stress_mwc1
      0.00            +0.1        0.06 ±  6%  perf-profile.self.cycles-pp.perf_exclude_event
      0.42            +0.1        0.48        perf-profile.self.cycles-pp.entry_SYSCALL_64_after_hwframe
      0.05            +0.1        0.11 ±  4%  perf-profile.self.cycles-pp.set_next_entity
      0.00            +0.1        0.06 ±  7%  perf-profile.self.cycles-pp.__sysvec_apic_timer_interrupt
      0.00            +0.1        0.06 ±  7%  perf-profile.self.cycles-pp.rb_next
      0.00            +0.1        0.06 ±  7%  perf-profile.self.cycles-pp.tick_nohz_tick_stopped
      0.00            +0.1        0.07 ±  7%  perf-profile.self.cycles-pp.tick_nohz_idle_enter
      0.00            +0.1        0.07 ±  7%  perf-profile.self.cycles-pp.check_preempt_curr
      0.06 ±  9%      +0.1        0.13 ±  2%  perf-profile.self.cycles-pp.perf_adjust_freq_unthr_context
      0.00            +0.1        0.07 ±  5%  perf-profile.self.cycles-pp.syscall_exit_to_user_mode_prepare
      0.00            +0.1        0.07 ±  5%  perf-profile.self.cycles-pp.tracing_gen_ctx_irq_test
      0.00            +0.1        0.07        perf-profile.self.cycles-pp.asm_sysvec_apic_timer_interrupt
      0.00            +0.1        0.07 ±  8%  perf-profile.self.cycles-pp.hrtimer_interrupt
      0.00            +0.1        0.07 ±  9%  perf-profile.self.cycles-pp.__list_del_entry_valid
      0.00            +0.1        0.07 ±  5%  perf-profile.self.cycles-pp.entry_SYSCALL_64_safe_stack
      0.09 ±  5%      +0.1        0.16 ±  5%  perf-profile.self.cycles-pp.perf_trace_sched_stat_runtime
      0.00            +0.1        0.08 ±  6%  perf-profile.self.cycles-pp.cpuidle_governor_latency_req
      0.00            +0.1        0.08 ± 12%  perf-profile.self.cycles-pp.__cgroup_account_cputime
      0.00            +0.1        0.08 ±  6%  perf-profile.self.cycles-pp.exit_to_user_mode_prepare
      0.00            +0.1        0.08 ±  6%  perf-profile.self.cycles-pp.error_entry
      0.00            +0.1        0.08 ±  4%  perf-profile.self.cycles-pp.sched_clock
      0.00            +0.1        0.08 ±  4%  perf-profile.self.cycles-pp.syscall_enter_from_user_mode
      0.05            +0.1        0.13 ±  4%  perf-profile.self.cycles-pp.__dequeue_entity
      0.00            +0.1        0.08        perf-profile.self.cycles-pp.select_task_rq
      0.00            +0.1        0.08        perf-profile.self.cycles-pp.do_sched_yield
      0.00            +0.1        0.08        perf-profile.self.cycles-pp.tick_nohz_next_event
      0.00            +0.1        0.08        perf-profile.self.cycles-pp.get_timespec64
      0.00            +0.1        0.08        perf-profile.self.cycles-pp.tick_nohz_idle_exit
      0.00            +0.1        0.08        perf-profile.self.cycles-pp.get_next_timer_interrupt
      0.00            +0.1        0.08 ±  7%  perf-profile.self.cycles-pp.__list_add_valid
      0.08 ±  6%      +0.1        0.16 ±  3%  perf-profile.self.cycles-pp.__intel_pmu_enable_all
      0.00            +0.1        0.08 ±  5%  perf-profile.self.cycles-pp.syscall_exit_to_user_mode
      0.16 ±  2%      +0.1        0.25 ±  2%  perf-profile.self.cycles-pp.perf_tp_event
      0.00            +0.1        0.08 ±  5%  perf-profile.self.cycles-pp.poll_idle
      0.00            +0.1        0.08 ±  5%  perf-profile.self.cycles-pp.rb_erase
      0.13 ±  3%      +0.1        0.22 ±  3%  perf-profile.self.cycles-pp.__flush_smp_call_function_queue
      0.14 ±  3%      +0.1        0.22 ±  2%  perf-profile.self.cycles-pp.get_nohz_timer_target
      0.00            +0.1        0.09 ±  5%  perf-profile.self.cycles-pp.ct_kernel_enter
      0.10 ±  5%      +0.1        0.19        perf-profile.self.cycles-pp.try_to_wake_up
      0.07 ±  6%      +0.1        0.17        perf-profile.self.cycles-pp.timerqueue_add
      0.02 ± 99%      +0.1        0.12 ±  3%  perf-profile.self.cycles-pp.update_entity_lag
      0.00            +0.1        0.10 ±  3%  perf-profile.self.cycles-pp.perf_trace_sched_switch
      0.28            +0.1        0.38 ±  4%  perf-profile.self.cycles-pp.dequeue_task_fair
      0.00            +0.1        0.10 ±  6%  perf-profile.self.cycles-pp.load_balance
      0.08 ±  4%      +0.1        0.18        perf-profile.self.cycles-pp.select_task_rq_fair
      0.14 ±  4%      +0.1        0.24 ±  2%  perf-profile.self.cycles-pp.update_min_vruntime
      0.04 ± 47%      +0.1        0.15 ± 12%  perf-profile.self.cycles-pp.place_entity
      0.00            +0.1        0.10 ±  4%  perf-profile.self.cycles-pp._raw_spin_unlock_irqrestore
      0.00            +0.1        0.11 ± 42%  perf-profile.self.cycles-pp.cpu_util
      0.10            +0.1        0.21 ±  4%  perf-profile.self.cycles-pp.update_rq_clock
      0.28 ±  3%      +0.1        0.40        perf-profile.self.cycles-pp.dequeue_entity
      0.07 ± 16%      +0.1        0.20 ±  7%  perf-profile.self.cycles-pp.sem_getvalue@plt
      0.32            +0.1        0.44 ±  2%  perf-profile.self.cycles-pp.llist_add_batch
      0.05 ±  7%      +0.1        0.18 ±  2%  perf-profile.self.cycles-pp.hrtimer_reprogram
      0.08 ± 10%      +0.1        0.21 ±  3%  perf-profile.self.cycles-pp.__entry_text_start
      0.06            +0.1        0.19        perf-profile.self.cycles-pp.common_nsleep
      0.09 ±  5%      +0.1        0.22        perf-profile.self.cycles-pp.__calc_delta
      0.00            +0.1        0.13 ± 23%  perf-profile.self.cycles-pp._find_next_and_bit
      0.07            +0.1        0.20 ± 16%  perf-profile.self.cycles-pp.__enqueue_entity
      0.09 ±  4%      +0.1        0.23 ±  4%  perf-profile.self.cycles-pp.avg_vruntime
      0.00            +0.1        0.14 ±  9%  perf-profile.self.cycles-pp.update_sd_lb_stats
      0.13 ±  5%      +0.1        0.27        perf-profile.self.cycles-pp.pick_eevdf
      0.00            +0.1        0.15 ±  2%  perf-profile.self.cycles-pp.cpu_startup_entry
      0.21 ±  2%      +0.2        0.36 ±  2%  perf-profile.self.cycles-pp.hrtimer_nanosleep
      0.00            +0.2        0.15 ±  4%  perf-profile.self.cycles-pp.ct_kernel_exit_state
      0.08 ±  5%      +0.2        0.24 ±  2%  perf-profile.self.cycles-pp.__hrtimer_next_event_base
      0.00            +0.2        0.16 ±  3%  perf-profile.self.cycles-pp.call_cpuidle
      0.06 ±  6%      +0.2        0.22 ±  3%  perf-profile.self.cycles-pp.do_syscall_64
      0.45 ± 17%      +0.2        0.62        perf-profile.self.cycles-pp.native_irq_return_iret
      0.08 ±  6%      +0.2        0.25 ±  3%  perf-profile.self.cycles-pp.timerqueue_del
      0.13 ±  3%      +0.2        0.31 ±  3%  perf-profile.self.cycles-pp.nohz_run_idle_balance
      0.00            +0.2        0.17 ±  2%  perf-profile.self.cycles-pp.sched_clock_cpu
      0.01 ±223%      +0.2        0.19 ±  2%  perf-profile.self.cycles-pp.__hrtimer_init
      0.03 ± 70%      +0.2        0.22 ±  2%  perf-profile.self.cycles-pp.__rdgsbase_inactive
      0.01 ±223%      +0.2        0.19 ±  3%  perf-profile.self.cycles-pp.schedule_idle
      0.17 ±  2%      +0.2        0.36 ±  2%  perf-profile.self.cycles-pp.schedule
      0.14            +0.2        0.33 ±  2%  perf-profile.self.cycles-pp.do_nanosleep
      0.11 ±  6%      +0.2        0.30 ±  2%  perf-profile.self.cycles-pp.native_apic_msr_eoi_write
      0.12 ±  4%      +0.2        0.32 ±  2%  perf-profile.self.cycles-pp.reweight_entity
      0.09 ±  5%      +0.2        0.30 ± 23%  perf-profile.self.cycles-pp.idle_cpu
      0.00            +0.2        0.22 ±  3%  perf-profile.self.cycles-pp.syscall_return_via_sysret
      0.12 ±  5%      +0.2        0.34 ±  2%  perf-profile.self.cycles-pp.hrtimer_start_range_ns
      0.11 ±  3%      +0.2        0.34 ±  2%  perf-profile.self.cycles-pp._copy_from_user
      0.09 ±  5%      +0.2        0.33 ±  2%  perf-profile.self.cycles-pp.ktime_get
      0.12            +0.2        0.37        perf-profile.self.cycles-pp.pick_next_task_fair
      0.09 ±  6%      +0.3        0.35 ±  2%  perf-profile.self.cycles-pp.__sched_yield
      0.28 ±  2%      +0.3        0.54 ±  2%  perf-profile.self.cycles-pp.update_rq_clock_task
      0.07 ±  7%      +0.3        0.33 ±  2%  perf-profile.self.cycles-pp.cpuidle_idle_call
      0.17 ±  2%      +0.3        0.45        perf-profile.self.cycles-pp.__update_load_avg_se
      0.19 ±  3%      +0.3        0.48        perf-profile.self.cycles-pp.sched_ttwu_pending
      0.34 ±  2%      +0.3        0.62        perf-profile.self.cycles-pp.flush_smp_call_function_queue
      0.19 ± 11%      +0.3        0.48 ±  5%  perf-profile.self.cycles-pp.__nanosleep
      0.08 ±  7%      +0.3        0.38        perf-profile.self.cycles-pp.__wrgsbase_inactive
      0.08 ±  5%      +0.3        0.39        perf-profile.self.cycles-pp.entry_SYSRETQ_unsafe_stack
      2.18            +0.3        2.50        perf-profile.self.cycles-pp.restore_fpregs_from_fpstate
      1.61            +0.3        1.93        perf-profile.self.cycles-pp.sched_mm_cid_migrate_to
      0.31 ±  2%      +0.3        0.64 ±  8%  perf-profile.self.cycles-pp.enqueue_entity
      0.11 ±  3%      +0.3        0.45        perf-profile.self.cycles-pp.os_xsave
      0.05 ±  8%      +0.3        0.40 ± 12%  perf-profile.self.cycles-pp.newidle_balance
      0.31            +0.3        0.66        perf-profile.self.cycles-pp.update_curr
      0.25 ±  4%      +0.4        0.61 ±  2%  perf-profile.self.cycles-pp.menu_select
      0.58            +0.4        0.94        perf-profile.self.cycles-pp.switch_fpu_return
      0.19 ±  5%      +0.4        0.59 ±  2%  perf-profile.self.cycles-pp.___perf_sw_event
      0.10 ±  6%      +0.4        0.54        perf-profile.self.cycles-pp.do_idle
      0.43            +0.4        0.86 ±  3%  perf-profile.self.cycles-pp.__update_load_avg_cfs_rq
      0.30 ±  4%      +0.5        0.75 ±  2%  perf-profile.self.cycles-pp.llist_reverse_order
      0.25 ±  2%      +0.5        0.72 ±  2%  perf-profile.self.cycles-pp.lapic_next_deadline
      0.27 ±  2%      +0.5        0.74        perf-profile.self.cycles-pp.call_function_single_prep_ipi
      0.37 ±  8%      +0.5        0.90 ±  3%  perf-profile.self.cycles-pp.clock_nanosleep
      0.21 ±  2%      +0.6        0.76 ±  6%  perf-profile.self.cycles-pp.enqueue_task_fair
      0.52 ± 10%      +0.6        1.08 ±  8%  perf-profile.self.cycles-pp.queue_event
      0.27 ±  3%      +0.6        0.84 ±  2%  perf-profile.self.cycles-pp.cpus_share_cache
      0.31 ±  2%      +0.6        0.90 ±  5%  perf-profile.self.cycles-pp.sem_post@@GLIBC_2.2.5
      0.17 ±  4%      +0.6        0.75        perf-profile.self.cycles-pp.read_tsc
      1.08            +0.7        1.83        perf-profile.self.cycles-pp.finish_task_switch
      0.25 ±  2%      +0.7        1.00 ±  2%  perf-profile.self.cycles-pp.set_task_cpu
      0.49 ±  2%      +0.8        1.32        perf-profile.self.cycles-pp.shim_nanosleep_uint64
      0.55            +0.9        1.44        perf-profile.self.cycles-pp.prepare_task_switch
      0.46 ±  4%      +0.9        1.37        perf-profile.self.cycles-pp.sem_getvalue@@GLIBC_2.2.5
      0.25 ±  4%      +1.0        1.23        perf-profile.self.cycles-pp.native_sched_clock
      0.25 ±  4%      +1.0        1.24        perf-profile.self.cycles-pp.cpuidle_enter_state
      1.03 ±  2%      +1.0        2.06 ±  3%  perf-profile.self.cycles-pp._raw_spin_lock
      0.57 ±  2%      +1.1        1.62        perf-profile.self.cycles-pp._raw_spin_lock_irqsave
      0.76            +1.1        1.83        perf-profile.self.cycles-pp.__switch_to
      0.84 ±  2%      +1.3        2.12 ±  8%  perf-profile.self.cycles-pp.semaphore_posix_thrash
      0.90            +1.4        2.28        perf-profile.self.cycles-pp.__switch_to_asm
      0.00            +1.5        1.55 ± 30%  perf-profile.self.cycles-pp.update_sg_lb_stats
      0.98            +1.9        2.83        perf-profile.self.cycles-pp.switch_mm_irqs_off
      1.97            +2.1        4.04        perf-profile.self.cycles-pp.__schedule
      6.26 ±  3%     +13.2       19.48        perf-profile.self.cycles-pp.intel_idle



***************************************************************************************************
lkp-spr-r02: 224 threads 2 sockets Intel(R) Xeon(R) Platinum 8480CTDX (Sapphire Rapids) with 256G memory
=========================================================================================
class/compiler/cpufreq_governor/kconfig/nr_threads/rootfs/sc_pid_max/tbox_group/test/testcase/testtime:
  scheduler/gcc-12/performance/x86_64-rhel-8.3/100%/debian-11.1-x86_64-20220510.cgz/4194304/lkp-spr-r02/switch/stress-ng/60s

commit: 
  63304558ba ("sched/eevdf: Curb wakeup-preemption")
  0a24d7afed ("sched/fair: ratelimit update to tg->load_avg")

63304558ba5dcaaf 0a24d7afed5c3c59ee212782f9c 
---------------- --------------------------- 
         %stddev     %change         %stddev
             \          |                \  
 1.792e+08 ±  4%    +356.6%  8.181e+08        cpuidle..usage
   1257499 ± 25%     +97.2%    2479986 ±  4%  numa-numastat.node1.local_node
   1363779 ± 23%     +92.3%    2622537 ±  3%  numa-numastat.node1.numa_hit
      1520 ±  3%     -37.6%     949.17 ±  3%  perf-c2c.DRAM.remote
     83408 ±  2%     +32.9%     110872 ±  2%  perf-c2c.HITM.local
      1113 ±  3%     -28.4%     797.33 ±  3%  perf-c2c.HITM.remote
     84522 ±  2%     +32.1%     111670 ±  2%  perf-c2c.HITM.total
   6744974 ±  4%     +65.1%   11139304 ±  2%  vmstat.memory.cache
    182.17           +18.0%     215.00        vmstat.procs.r
   7208379 ±  4%    +394.4%   35639830        vmstat.system.cs
    904223 ±  2%    +192.2%    2641908        vmstat.system.in
     29.53            -4.7       24.87        mpstat.cpu.all.idle%
      7.99            -5.0        2.98        mpstat.cpu.all.irq%
      0.43 ±  5%      -0.2        0.25        mpstat.cpu.all.soft%
     56.25            +8.2       64.42        mpstat.cpu.all.sys%
      5.80            +1.7        7.47        mpstat.cpu.all.usr%
 1.415e+08 ±  5%    +422.1%  7.387e+08        stress-ng.switch.ops
   2357895 ±  5%    +422.1%   12310763        stress-ng.switch.ops_per_sec
    303844 ±  4%   +1122.7%    3715226        stress-ng.time.involuntary_context_switches
     12532           +14.7%      14376        stress-ng.time.percent_of_cpu_this_job_got
      7172           +12.8%       8090        stress-ng.time.system_time
    629.90           +35.9%     856.07        stress-ng.time.user_time
 2.732e+08 ±  5%    +403.8%  1.377e+09        stress-ng.time.voluntary_context_switches
    438651 ±  2%     +14.6%     502822        meminfo.AnonPages
   6551534 ±  4%     +66.5%   10905701 ±  2%  meminfo.Cached
   8926904 ±  3%     +49.2%   13315485        meminfo.Committed_AS
   4104936 ±  7%    +107.1%    8502046 ±  2%  meminfo.Inactive
   4104785 ±  7%    +107.1%    8501888 ±  2%  meminfo.Inactive(anon)
   1189315 ±  3%     +25.4%    1491641        meminfo.Mapped
  10181121 ±  2%     +43.5%   14606335        meminfo.Memused
   3807597 ±  7%    +114.4%    8161798 ±  2%  meminfo.Shmem
  10280871 ±  3%     +43.0%   14704729        meminfo.max_used_kB
    378940 ± 45%     +59.0%     602609 ±  9%  numa-vmstat.node0.nr_inactive_anon
    378938 ± 45%     +59.0%     602607 ±  9%  numa-vmstat.node0.nr_zone_inactive_anon
    638178 ± 26%    +139.5%    1528430 ±  3%  numa-vmstat.node1.nr_file_pages
    648468 ± 27%    +135.1%    1524245 ±  2%  numa-vmstat.node1.nr_inactive_anon
    188166 ±  6%     +39.7%     262869        numa-vmstat.node1.nr_mapped
    615278 ± 29%    +144.7%    1505431 ±  2%  numa-vmstat.node1.nr_shmem
    648464 ± 27%    +135.1%    1524241 ±  2%  numa-vmstat.node1.nr_zone_inactive_anon
   1363943 ± 23%     +92.3%    2622303 ±  3%  numa-vmstat.node1.numa_hit
   1257663 ± 25%     +97.2%    2479754 ±  4%  numa-vmstat.node1.numa_local
    277293 ± 20%     +99.2%     552340 ±  9%  numa-meminfo.node0.AnonPages.max
   1515282 ± 45%     +59.1%    2410064 ±  9%  numa-meminfo.node0.Inactive
   1515280 ± 45%     +59.0%    2410010 ±  9%  numa-meminfo.node0.Inactive(anon)
   5921261 ± 11%     +14.3%    6767795 ±  2%  numa-meminfo.node0.MemUsed
   2551274 ± 26%    +139.6%    6112583 ±  3%  numa-meminfo.node1.FilePages
   2592782 ± 27%    +135.1%    6095995 ±  2%  numa-meminfo.node1.Inactive
   2592634 ± 27%    +135.1%    6095892 ±  2%  numa-meminfo.node1.Inactive(anon)
    751876 ±  6%     +39.7%    1049995        numa-meminfo.node1.Mapped
   4262948 ± 16%     +84.0%    7842823 ±  2%  numa-meminfo.node1.MemUsed
   2459674 ± 29%    +144.8%    6020586 ±  2%  numa-meminfo.node1.Shmem
  20460839 ± 21%   +2277.8%  4.865e+08        turbostat.C1
      1.38 ± 17%      +6.6        8.02        turbostat.C1%
 1.557e+08 ±  2%     -94.5%    8505821 ±  2%  turbostat.C1E
     14.93           -11.8        3.09 ±  7%  turbostat.C1E%
     18.32           -31.4%      12.56 ±  3%  turbostat.CPU%c1
      0.08 ±  6%    +278.3%       0.29        turbostat.IPC
  58972223 ±  3%    +193.9%  1.733e+08        turbostat.IRQ
   2614421 ±  3%  +12238.7%  3.226e+08        turbostat.POLL
      0.06 ±  6%      +4.5        4.53        turbostat.POLL%
    550.43           +22.6%     675.10        turbostat.PkgWatt
     17.66            +3.4%      18.26        turbostat.RAMWatt
    109633 ±  2%     +14.5%     125554        proc-vmstat.nr_anon_pages
   6299234            -1.7%    6189022        proc-vmstat.nr_dirty_background_threshold
  12613872            -1.7%   12393176        proc-vmstat.nr_dirty_threshold
   1638204 ±  4%     +66.3%    2724201 ±  2%  proc-vmstat.nr_file_pages
  63369354            -1.7%   62265610        proc-vmstat.nr_free_pages
   1026430 ±  7%    +106.9%    2123321 ±  2%  proc-vmstat.nr_inactive_anon
    297396 ±  3%     +24.9%     371454        proc-vmstat.nr_mapped
    952218 ±  7%    +114.1%    2038225 ±  3%  proc-vmstat.nr_shmem
     40692            +6.3%      43273        proc-vmstat.nr_slab_reclaimable
   1026430 ±  7%    +106.9%    2123321 ±  2%  proc-vmstat.nr_zone_inactive_anon
    243574 ±  8%     +39.3%     339259 ±  2%  proc-vmstat.numa_hint_faults
    135818 ± 18%     +70.6%     231662 ±  3%  proc-vmstat.numa_hint_faults_local
   2345361 ±  4%     +68.7%    3956948 ±  2%  proc-vmstat.numa_hit
   2109765 ±  5%     +76.6%    3724999 ±  2%  proc-vmstat.numa_local
    544814 ±  4%     +17.3%     639106 ±  3%  proc-vmstat.numa_pte_updates
     16992 ± 10%     +37.7%      23393 ±  6%  proc-vmstat.pgactivate
   2439275 ±  4%     +66.3%    4056303 ±  2%  proc-vmstat.pgalloc_normal
   1112292           +16.6%    1296617        proc-vmstat.pgfault
   3291142           +14.7%    3773335        sched_debug.cfs_rq:/.avg_vruntime.avg
   4708699 ±  5%     +15.7%    5449685 ±  5%  sched_debug.cfs_rq:/.avg_vruntime.max
    651007 ±  8%     -41.3%     382212 ± 12%  sched_debug.cfs_rq:/.left_vruntime.avg
   3404176           +21.7%    4141767 ± 13%  sched_debug.cfs_rq:/.left_vruntime.max
   1303825 ±  3%     -13.0%    1134537 ±  5%  sched_debug.cfs_rq:/.left_vruntime.stddev
   3291142           +14.7%    3773335        sched_debug.cfs_rq:/.min_vruntime.avg
   4708699 ±  5%     +15.7%    5449685 ±  5%  sched_debug.cfs_rq:/.min_vruntime.max
    651007 ±  8%     -41.3%     382212 ± 12%  sched_debug.cfs_rq:/.right_vruntime.avg
   3404176           +21.7%    4141767 ± 13%  sched_debug.cfs_rq:/.right_vruntime.max
   1303825 ±  3%     -13.0%    1134537 ±  5%  sched_debug.cfs_rq:/.right_vruntime.stddev
    309.43 ±  6%     -22.4%     240.16 ±  5%  sched_debug.cfs_rq:/.runnable_avg.stddev
    184.67 ±  6%     -16.8%     153.61        sched_debug.cfs_rq:/.util_avg.stddev
     67.77 ± 14%     -83.0%      11.54 ± 13%  sched_debug.cfs_rq:/.util_est_enqueued.avg
     99.34 ± 11%     -35.4%      64.19 ±  7%  sched_debug.cfs_rq:/.util_est_enqueued.stddev
     44.07 ± 12%     -70.0%      13.22 ±  4%  sched_debug.cpu.clock.stddev
      2416 ±  7%     +21.8%       2943 ±  2%  sched_debug.cpu.curr->pid.avg
      0.00 ± 11%     -60.0%       0.00 ± 10%  sched_debug.cpu.next_balance.stddev
      0.47 ±  6%     +17.7%       0.55 ±  4%  sched_debug.cpu.nr_running.avg
    993749 ±  5%    +397.7%    4946105        sched_debug.cpu.nr_switches.avg
   1123429 ±  4%    +367.4%    5250556        sched_debug.cpu.nr_switches.max
    528386 ± 14%    +339.9%    2324144 ± 17%  sched_debug.cpu.nr_switches.min
     81513 ± 18%    +225.4%     265263 ±  4%  sched_debug.cpu.nr_switches.stddev
      1.55 ±  7%     -31.3%       1.07 ±  2%  sched_debug.rt_rq:.rt_time.avg
    347.61 ±  7%     -31.3%     238.83 ±  2%  sched_debug.rt_rq:.rt_time.max
     23.17 ±  7%     -31.3%      15.92 ±  2%  sched_debug.rt_rq:.rt_time.stddev
     14.40            +8.8%      15.66        perf-stat.i.MPKI
 1.419e+10 ±  4%    +300.9%  5.688e+10        perf-stat.i.branch-instructions
      1.36            -0.1        1.23        perf-stat.i.branch-miss-rate%
 1.624e+08 ±  4%    +298.8%  6.478e+08        perf-stat.i.branch-misses
      2.71 ±  3%      -0.8        1.89 ±  2%  perf-stat.i.cache-miss-rate%
  14303386 ±  3%    +104.9%   29304692        perf-stat.i.cache-misses
 8.988e+08 ±  5%    +381.5%  4.328e+09        perf-stat.i.cache-references
   7363534 ±  5%    +401.1%   36899734        perf-stat.i.context-switches
      8.28 ±  4%     -74.7%       2.10        perf-stat.i.cpi
 5.176e+11           +12.5%  5.822e+11        perf-stat.i.cpu-cycles
   2716479 ±  5%    +389.6%   13299565        perf-stat.i.cpu-migrations
     43307 ±  2%     -37.0%      27270        perf-stat.i.cycles-between-cache-misses
  41832335 ±  9%    +354.4%  1.901e+08 ±  2%  perf-stat.i.dTLB-load-misses
   1.8e+10 ±  4%    +315.9%  7.485e+10        perf-stat.i.dTLB-loads
   5430661 ±  6%    +373.7%   25724171 ±  2%  perf-stat.i.dTLB-store-misses
 9.847e+09 ±  4%    +333.9%  4.272e+10        perf-stat.i.dTLB-stores
 7.026e+10 ±  4%    +305.4%  2.849e+11        perf-stat.i.instructions
      0.17 ±  3%    +205.6%       0.51        perf-stat.i.ipc
      2.31           +12.5%       2.60        perf-stat.i.metric.GHz
    111.01 ±  3%    +264.1%     404.23        perf-stat.i.metric.K/sec
    191.57 ±  4%    +316.5%     797.89        perf-stat.i.metric.M/sec
     18432 ±  2%      +9.5%      20184        perf-stat.i.minor-faults
     81.21           -13.1       68.09        perf-stat.i.node-load-miss-rate%
   4976001 ±  2%     +46.0%    7265429        perf-stat.i.node-load-misses
   1487000 ±  6%    +219.9%    4757391        perf-stat.i.node-loads
     18432 ±  2%      +9.5%      20184        perf-stat.i.page-faults
     13.31           +15.2%      15.34        perf-stat.overall.MPKI
      1.19            -0.0        1.15        perf-stat.overall.branch-miss-rate%
      1.56 ±  4%      -0.9        0.67        perf-stat.overall.cache-miss-rate%
      7.69 ±  4%     -73.2%       2.06        perf-stat.overall.cpi
     37074 ±  3%     -45.7%      20124        perf-stat.overall.cycles-between-cache-misses
      0.06 ±  2%      +0.0        0.06 ±  2%  perf-stat.overall.dTLB-store-miss-rate%
      0.13 ±  4%    +272.6%       0.49        perf-stat.overall.ipc
     73.48           -14.1       59.33        perf-stat.overall.node-load-miss-rate%
 1.334e+10 ±  4%    +317.8%  5.572e+10        perf-stat.ps.branch-instructions
 1.584e+08 ±  5%    +304.7%  6.411e+08        perf-stat.ps.branch-misses
  13716290 ±  3%    +108.4%   28582536        perf-stat.ps.cache-misses
  8.82e+08 ±  5%    +385.7%  4.284e+09        perf-stat.ps.cache-references
   7221087 ±  5%    +405.8%   36523032        perf-stat.ps.context-switches
    217584            +1.3%     220460        perf-stat.ps.cpu-clock
 5.081e+11           +13.2%  5.751e+11        perf-stat.ps.cpu-cycles
   2673546 ±  5%    +392.9%   13179198        perf-stat.ps.cpu-migrations
  40827035 ±  9%    +360.3%  1.879e+08 ±  2%  perf-stat.ps.dTLB-load-misses
 1.708e+10 ±  4%    +330.6%  7.352e+10        perf-stat.ps.dTLB-loads
   5337456 ±  6%    +377.4%   25482959 ±  2%  perf-stat.ps.dTLB-store-misses
 9.357e+09 ±  5%    +348.8%  4.199e+10        perf-stat.ps.dTLB-stores
 6.622e+10 ±  4%    +321.8%  2.793e+11        perf-stat.ps.instructions
     16190 ±  2%     +19.0%      19266        perf-stat.ps.minor-faults
   4783057 ±  2%     +49.3%    7140248        perf-stat.ps.node-load-misses
   1727513 ±  5%    +183.3%    4893702        perf-stat.ps.node-loads
     16190 ±  2%     +19.0%      19266        perf-stat.ps.page-faults
    217584            +1.3%     220460        perf-stat.ps.task-clock
 4.156e+12 ±  4%    +320.9%  1.749e+13        perf-stat.total.instructions
     22.02           -17.7        4.28        perf-profile.calltrace.cycles-pp.enqueue_task_fair.activate_task.ttwu_do_activate.sched_ttwu_pending.__flush_smp_call_function_queue
     18.48           -15.1        3.42        perf-profile.calltrace.cycles-pp.enqueue_entity.enqueue_task_fair.activate_task.ttwu_do_activate.sched_ttwu_pending
     24.23           -15.0        9.26        perf-profile.calltrace.cycles-pp.flush_smp_call_function_queue.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify
     19.99           -14.3        5.66        perf-profile.calltrace.cycles-pp.activate_task.ttwu_do_activate.sched_ttwu_pending.__flush_smp_call_function_queue.flush_smp_call_function_queue
     20.09           -14.2        5.84        perf-profile.calltrace.cycles-pp.ttwu_do_activate.sched_ttwu_pending.__flush_smp_call_function_queue.flush_smp_call_function_queue.do_idle
     20.68           -13.6        7.10        perf-profile.calltrace.cycles-pp.sched_ttwu_pending.__flush_smp_call_function_queue.flush_smp_call_function_queue.do_idle.cpu_startup_entry
     21.09           -12.8        8.24        perf-profile.calltrace.cycles-pp.__flush_smp_call_function_queue.flush_smp_call_function_queue.do_idle.cpu_startup_entry.start_secondary
     10.08           -10.1        0.00        perf-profile.calltrace.cycles-pp.update_cfs_group.dequeue_entity.dequeue_task_fair.__schedule.schedule
      9.04            -9.0        0.00        perf-profile.calltrace.cycles-pp.update_cfs_group.enqueue_entity.enqueue_task_fair.activate_task.ttwu_do_activate
     37.64            -7.7       29.92        perf-profile.calltrace.cycles-pp.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify
     37.84            -7.7       30.19        perf-profile.calltrace.cycles-pp.secondary_startup_64_no_verify
     37.66            -7.7       30.01        perf-profile.calltrace.cycles-pp.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify
     37.67            -7.6       30.04        perf-profile.calltrace.cycles-pp.start_secondary.secondary_startup_64_no_verify
     10.34            -7.3        3.07        perf-profile.calltrace.cycles-pp.dequeue_task_fair.__schedule.schedule.pipe_read.vfs_read
     10.14            -7.1        3.07        perf-profile.calltrace.cycles-pp.dequeue_task_fair.__schedule.schedule.pipe_write.vfs_write
      8.31            -6.8        1.48        perf-profile.calltrace.cycles-pp.update_load_avg.enqueue_entity.enqueue_task_fair.activate_task.ttwu_do_activate
      8.11            -5.8        2.26        perf-profile.calltrace.cycles-pp.dequeue_entity.dequeue_task_fair.__schedule.schedule.pipe_read
      7.93            -5.7        2.26        perf-profile.calltrace.cycles-pp.dequeue_entity.dequeue_task_fair.__schedule.schedule.pipe_write
     12.83            -4.9        7.91        perf-profile.calltrace.cycles-pp.__schedule.schedule.pipe_read.vfs_read.ksys_read
     12.92            -4.8        8.07        perf-profile.calltrace.cycles-pp.schedule.pipe_read.vfs_read.ksys_read.do_syscall_64
     12.61            -4.7        7.88        perf-profile.calltrace.cycles-pp.__schedule.schedule.pipe_write.vfs_write.ksys_write
     12.69            -4.6        8.04        perf-profile.calltrace.cycles-pp.schedule.pipe_write.vfs_write.ksys_write.do_syscall_64
      4.66            -2.6        2.06        perf-profile.calltrace.cycles-pp.update_load_avg.dequeue_entity.dequeue_task_fair.__schedule.schedule
      1.94            -1.3        0.66        perf-profile.calltrace.cycles-pp.set_next_entity.pick_next_task_fair.__schedule.schedule_idle.do_idle
      2.01            -1.1        0.91        perf-profile.calltrace.cycles-pp.pick_next_task_fair.__schedule.schedule_idle.do_idle.cpu_startup_entry
      5.61            -1.0        4.65        perf-profile.calltrace.cycles-pp.intel_idle.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call.do_idle
      0.61            +0.2        0.79        perf-profile.calltrace.cycles-pp._raw_spin_lock.__schedule.schedule_idle.do_idle.cpu_startup_entry
      1.18 ±  2%      +0.2        1.38        perf-profile.calltrace.cycles-pp.exit_to_user_mode_prepare.syscall_exit_to_user_mode.do_syscall_64.entry_SYSCALL_64_after_hwframe.read
      1.17 ±  2%      +0.2        1.39        perf-profile.calltrace.cycles-pp.exit_to_user_mode_prepare.syscall_exit_to_user_mode.do_syscall_64.entry_SYSCALL_64_after_hwframe.write
      0.56 ±  4%      +0.3        0.82 ±  2%  perf-profile.calltrace.cycles-pp._copy_from_iter.copy_page_from_iter.pipe_write.vfs_write.ksys_write
      0.59 ±  4%      +0.3        0.87 ±  2%  perf-profile.calltrace.cycles-pp.copy_page_from_iter.pipe_write.vfs_write.ksys_write.do_syscall_64
      1.22 ±  2%      +0.3        1.54        perf-profile.calltrace.cycles-pp.syscall_exit_to_user_mode.do_syscall_64.entry_SYSCALL_64_after_hwframe.read
      1.22 ±  2%      +0.3        1.54        perf-profile.calltrace.cycles-pp.syscall_exit_to_user_mode.do_syscall_64.entry_SYSCALL_64_after_hwframe.write
      0.97            +0.3        1.30        perf-profile.calltrace.cycles-pp.sched_mm_cid_migrate_to.activate_task.ttwu_do_activate.sched_ttwu_pending.__flush_smp_call_function_queue
     26.91            +0.4       27.31        perf-profile.calltrace.cycles-pp.pipe_write.vfs_write.ksys_write.do_syscall_64.entry_SYSCALL_64_after_hwframe
      2.19 ±  2%      +0.4        2.61        perf-profile.calltrace.cycles-pp.switch_fpu_return.exit_to_user_mode_prepare.syscall_exit_to_user_mode.do_syscall_64.entry_SYSCALL_64_after_hwframe
      0.00            +0.5        0.52        perf-profile.calltrace.cycles-pp.__switch_to
      0.00            +0.5        0.53        perf-profile.calltrace.cycles-pp._raw_spin_lock.__schedule.schedule.pipe_write.vfs_write
      0.00            +0.5        0.53        perf-profile.calltrace.cycles-pp._raw_spin_lock.__schedule.schedule.pipe_read.vfs_read
      0.00            +0.6        0.62        perf-profile.calltrace.cycles-pp.__update_load_avg_cfs_rq.update_load_avg.enqueue_entity.enqueue_task_fair.activate_task
      0.00            +0.6        0.63        perf-profile.calltrace.cycles-pp.finish_task_switch.__schedule.schedule.pipe_read.vfs_read
      0.66 ±  4%      +0.6        1.29        perf-profile.calltrace.cycles-pp.menu_select.cpuidle_idle_call.do_idle.cpu_startup_entry.start_secondary
      0.00            +0.6        0.64        perf-profile.calltrace.cycles-pp.finish_task_switch.__schedule.schedule.pipe_write.vfs_write
      0.00            +0.6        0.64        perf-profile.calltrace.cycles-pp.tick_nohz_get_sleep_length.menu_select.cpuidle_idle_call.do_idle.cpu_startup_entry
      0.34 ± 70%      +0.7        1.04        perf-profile.calltrace.cycles-pp.prepare_to_wait_event.pipe_write.vfs_write.ksys_write.do_syscall_64
      0.00            +0.7        0.72        perf-profile.calltrace.cycles-pp.nohz_run_idle_balance.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify
      0.00            +0.7        0.73        perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.prepare_to_wait_event.pipe_read.vfs_read.ksys_read
      0.00            +0.7        0.74        perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.prepare_to_wait_event.pipe_write.vfs_write.ksys_write
      0.00            +0.8        0.76        perf-profile.calltrace.cycles-pp.__update_idle_core.pick_next_task_idle.__schedule.schedule.pipe_write
      0.00            +0.8        0.77 ±  2%  perf-profile.calltrace.cycles-pp.__update_idle_core.pick_next_task_idle.__schedule.schedule.pipe_read
      0.00            +0.8        0.77        perf-profile.calltrace.cycles-pp.copyout._copy_to_iter.copy_page_to_iter.pipe_read.vfs_read
      0.00            +0.8        0.78        perf-profile.calltrace.cycles-pp.pick_next_task_idle.__schedule.schedule.pipe_write.vfs_write
      0.00            +0.8        0.78        perf-profile.calltrace.cycles-pp.pick_next_task_idle.__schedule.schedule.pipe_read.vfs_read
      1.41 ±  2%      +0.8        2.19        perf-profile.calltrace.cycles-pp.migrate_task_rq_fair.set_task_cpu.try_to_wake_up.autoremove_wake_function.__wake_up_common
      1.60 ±  2%      +0.8        2.38        perf-profile.calltrace.cycles-pp.set_task_cpu.try_to_wake_up.autoremove_wake_function.__wake_up_common.__wake_up_common_lock
      0.00            +0.8        0.82        perf-profile.calltrace.cycles-pp._copy_to_iter.copy_page_to_iter.pipe_read.vfs_read.ksys_read
     27.39            +0.8       28.21        perf-profile.calltrace.cycles-pp.vfs_write.ksys_write.do_syscall_64.entry_SYSCALL_64_after_hwframe.write
      0.00            +0.8        0.83        perf-profile.calltrace.cycles-pp.__switch_to_asm
      0.00            +0.8        0.84        perf-profile.calltrace.cycles-pp.prepare_task_switch.__schedule.schedule_idle.do_idle.cpu_startup_entry
      0.00            +0.9        0.86        perf-profile.calltrace.cycles-pp.copy_page_to_iter.pipe_read.vfs_read.ksys_read.do_syscall_64
     27.50            +0.9       28.44        perf-profile.calltrace.cycles-pp.pipe_read.vfs_read.ksys_read.do_syscall_64.entry_SYSCALL_64_after_hwframe
     27.56            +1.0       28.52        perf-profile.calltrace.cycles-pp.ksys_write.do_syscall_64.entry_SYSCALL_64_after_hwframe.write
      0.00            +1.0        0.97        perf-profile.calltrace.cycles-pp.llist_reverse_order.__flush_smp_call_function_queue.flush_smp_call_function_queue.do_idle.cpu_startup_entry
      0.00            +1.0        1.02        perf-profile.calltrace.cycles-pp.prepare_to_wait_event.pipe_read.vfs_read.ksys_read.do_syscall_64
      4.59            +1.0        5.60        perf-profile.calltrace.cycles-pp.__schedule.schedule_idle.do_idle.cpu_startup_entry.start_secondary
      0.00            +1.1        1.14        perf-profile.calltrace.cycles-pp.wake_affine.select_task_rq_fair.select_task_rq.try_to_wake_up.autoremove_wake_function
      0.00            +1.1        1.15        perf-profile.calltrace.cycles-pp.switch_mm_irqs_off.__schedule.schedule_idle.do_idle.cpu_startup_entry
      4.66            +1.2        5.82        perf-profile.calltrace.cycles-pp.schedule_idle.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify
      0.00            +1.2        1.20        perf-profile.calltrace.cycles-pp.remove_entity_load_avg.migrate_task_rq_fair.set_task_cpu.try_to_wake_up.autoremove_wake_function
      1.03 ±  5%      +1.2        2.27 ±  2%  perf-profile.calltrace.cycles-pp.stress_switch_pipe
     27.94            +1.3       29.20        perf-profile.calltrace.cycles-pp.vfs_read.ksys_read.do_syscall_64.entry_SYSCALL_64_after_hwframe.read
     28.83            +1.4       30.20        perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.write
      0.00            +1.4        1.38 ±  3%  perf-profile.calltrace.cycles-pp.__bitmap_andnot.select_idle_core.select_idle_cpu.select_idle_sibling.select_task_rq_fair
     29.00            +1.4       30.43        perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.write
     28.10            +1.4       29.54        perf-profile.calltrace.cycles-pp.ksys_read.do_syscall_64.entry_SYSCALL_64_after_hwframe.read
      0.00            +1.5        1.50        perf-profile.calltrace.cycles-pp.llist_add_batch.__smp_call_single_queue.ttwu_queue_wakelist.try_to_wake_up.autoremove_wake_function
      1.44            +1.6        3.09        perf-profile.calltrace.cycles-pp.ttwu_queue_wakelist.try_to_wake_up.autoremove_wake_function.__wake_up_common.__wake_up_common_lock
     29.38            +1.8       31.21        perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.read
     29.55            +1.9       31.44        perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.read
      0.00            +2.3        2.30        perf-profile.calltrace.cycles-pp.__smp_call_single_queue.ttwu_queue_wakelist.try_to_wake_up.autoremove_wake_function.__wake_up_common
     29.58            +2.4       32.02        perf-profile.calltrace.cycles-pp.write
     12.07            +2.9       15.00        perf-profile.calltrace.cycles-pp.try_to_wake_up.autoremove_wake_function.__wake_up_common.__wake_up_common_lock.pipe_write
     12.15            +2.9       15.10        perf-profile.calltrace.cycles-pp.try_to_wake_up.autoremove_wake_function.__wake_up_common.__wake_up_common_lock.pipe_read
     12.09            +3.0       15.06        perf-profile.calltrace.cycles-pp.autoremove_wake_function.__wake_up_common.__wake_up_common_lock.pipe_write.vfs_write
     12.17            +3.0       15.16        perf-profile.calltrace.cycles-pp.autoremove_wake_function.__wake_up_common.__wake_up_common_lock.pipe_read.vfs_read
     30.16            +3.0       33.17        perf-profile.calltrace.cycles-pp.read
     12.31            +3.3       15.64        perf-profile.calltrace.cycles-pp.__wake_up_common.__wake_up_common_lock.pipe_write.vfs_write.ksys_write
     12.50            +3.4       15.86        perf-profile.calltrace.cycles-pp.__wake_up_common_lock.pipe_write.vfs_write.ksys_write.do_syscall_64
     12.40            +3.4       15.76        perf-profile.calltrace.cycles-pp.__wake_up_common.__wake_up_common_lock.pipe_read.vfs_read.ksys_read
     12.60            +3.4       16.00        perf-profile.calltrace.cycles-pp.__wake_up_common_lock.pipe_read.vfs_read.ksys_read.do_syscall_64
      5.68            +3.5        9.22        perf-profile.calltrace.cycles-pp.available_idle_cpu.select_idle_core.select_idle_cpu.select_idle_sibling.select_task_rq_fair
      6.99            +3.8       10.76        perf-profile.calltrace.cycles-pp.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call.do_idle.cpu_startup_entry
      7.07            +3.8       10.85        perf-profile.calltrace.cycles-pp.cpuidle_enter.cpuidle_idle_call.do_idle.cpu_startup_entry.start_secondary
      7.98            +4.7       12.72        perf-profile.calltrace.cycles-pp.cpuidle_idle_call.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify
      8.56            +5.0       13.55        perf-profile.calltrace.cycles-pp.select_idle_core.select_idle_cpu.select_idle_sibling.select_task_rq_fair.select_task_rq
      0.00            +5.1        5.14        perf-profile.calltrace.cycles-pp.poll_idle.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call.do_idle
     10.17            +5.7       15.90        perf-profile.calltrace.cycles-pp.select_idle_cpu.select_idle_sibling.select_task_rq_fair.select_task_rq.try_to_wake_up
     11.79            +6.6       18.37        perf-profile.calltrace.cycles-pp.select_idle_sibling.select_task_rq_fair.select_task_rq.try_to_wake_up.autoremove_wake_function
     12.90            +7.0       19.90        perf-profile.calltrace.cycles-pp.select_task_rq_fair.select_task_rq.try_to_wake_up.autoremove_wake_function.__wake_up_common
     13.07            +7.0       20.12        perf-profile.calltrace.cycles-pp.select_task_rq.try_to_wake_up.autoremove_wake_function.__wake_up_common.__wake_up_common_lock
     27.98           -26.6        1.35 ±  5%  perf-profile.children.cycles-pp.update_cfs_group
     24.39           -18.8        5.58        perf-profile.children.cycles-pp.enqueue_task_fair
     25.65           -18.4        7.23        perf-profile.children.cycles-pp.activate_task
     25.77           -18.2        7.60        perf-profile.children.cycles-pp.ttwu_do_activate
     25.03           -16.7        8.29        perf-profile.children.cycles-pp.sched_ttwu_pending
     25.58           -15.9        9.70        perf-profile.children.cycles-pp.__flush_smp_call_function_queue
     20.04           -15.7        4.30        perf-profile.children.cycles-pp.enqueue_entity
     24.41           -15.0        9.38        perf-profile.children.cycles-pp.flush_smp_call_function_queue
     20.52           -14.4        6.17        perf-profile.children.cycles-pp.dequeue_task_fair
     16.78           -11.6        5.22        perf-profile.children.cycles-pp.update_load_avg
     16.11           -11.5        4.60        perf-profile.children.cycles-pp.dequeue_entity
     25.62            -9.4       16.18        perf-profile.children.cycles-pp.schedule
     30.10            -8.4       21.65        perf-profile.children.cycles-pp.__schedule
     37.82            -7.7       30.12        perf-profile.children.cycles-pp.do_idle
     37.84            -7.7       30.19        perf-profile.children.cycles-pp.secondary_startup_64_no_verify
     37.84            -7.7       30.19        perf-profile.children.cycles-pp.cpu_startup_entry
     37.67            -7.6       30.04        perf-profile.children.cycles-pp.start_secondary
      4.42            -2.9        1.47        perf-profile.children.cycles-pp.__sysvec_call_function_single
      4.45            -2.9        1.58        perf-profile.children.cycles-pp.sysvec_call_function_single
      4.56            -2.7        1.88        perf-profile.children.cycles-pp.asm_sysvec_call_function_single
      2.32            -1.4        0.92        perf-profile.children.cycles-pp.set_next_entity
      2.24 ±  3%      -1.3        0.96        perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath
      5.64            -1.0        4.68        perf-profile.children.cycles-pp.intel_idle
      2.54            -0.8        1.75        perf-profile.children.cycles-pp.pick_next_task_fair
      1.30 ±  6%      -0.7        0.62        perf-profile.children.cycles-pp.asm_sysvec_apic_timer_interrupt
      1.24 ±  6%      -0.7        0.58        perf-profile.children.cycles-pp.sysvec_apic_timer_interrupt
      0.69 ±  6%      -0.4        0.32 ±  2%  perf-profile.children.cycles-pp.__sysvec_apic_timer_interrupt
      0.68 ±  6%      -0.4        0.31 ±  2%  perf-profile.children.cycles-pp.hrtimer_interrupt
      0.63 ±  6%      -0.4        0.28 ±  2%  perf-profile.children.cycles-pp.__hrtimer_run_queues
      0.58 ±  7%      -0.3        0.24        perf-profile.children.cycles-pp.tick_sched_handle
      0.58 ±  7%      -0.3        0.24 ±  2%  perf-profile.children.cycles-pp.update_process_times
      0.59 ±  6%      -0.3        0.25 ±  2%  perf-profile.children.cycles-pp.tick_sched_timer
      0.54 ±  6%      -0.3        0.21 ±  2%  perf-profile.children.cycles-pp.scheduler_tick
      0.42 ±  4%      -0.3        0.11 ±  4%  perf-profile.children.cycles-pp.__task_rq_lock
      0.53 ±  5%      -0.3        0.24        perf-profile.children.cycles-pp.__do_softirq
      0.55 ±  5%      -0.3        0.28        perf-profile.children.cycles-pp.__irq_exit_rcu
      0.24 ±  3%      -0.1        0.10 ±  3%  perf-profile.children.cycles-pp.perf_trace_sched_stat_runtime
      0.15 ±  5%      -0.1        0.06 ±  7%  perf-profile.children.cycles-pp.task_work_run
      0.15 ±  8%      -0.1        0.06 ±  7%  perf-profile.children.cycles-pp.task_mm_cid_work
      0.16 ±  7%      -0.1        0.11        perf-profile.children.cycles-pp.exit_to_user_mode_loop
      0.10 ±  5%      -0.0        0.06        perf-profile.children.cycles-pp.tick_nohz_tick_stopped
      0.24 ±  4%      -0.0        0.21        perf-profile.children.cycles-pp.tracing_gen_ctx_irq_test
      0.07 ±  5%      -0.0        0.06        perf-profile.children.cycles-pp.inode_needs_update_time
      0.33 ±  2%      +0.0        0.35        perf-profile.children.cycles-pp.cpus_share_cache
      0.12 ±  4%      +0.0        0.15 ±  3%  perf-profile.children.cycles-pp.file_update_time
      0.08 ±  8%      +0.0        0.11 ±  6%  perf-profile.children.cycles-pp.anon_pipe_buf_release
      0.09 ± 15%      +0.0        0.13 ±  2%  perf-profile.children.cycles-pp.hrtimer_get_next_event
      0.16 ±  4%      +0.0        0.19        perf-profile.children.cycles-pp.touch_atime
      0.11 ±  4%      +0.0        0.15        perf-profile.children.cycles-pp.atime_needs_update
      0.06            +0.0        0.11 ±  4%  perf-profile.children.cycles-pp.__get_task_ioprio
      0.06 ±  9%      +0.0        0.10 ±  4%  perf-profile.children.cycles-pp.resched_curr
      0.00            +0.1        0.05        perf-profile.children.cycles-pp.irqtime_account_irq
      0.00            +0.1        0.05        perf-profile.children.cycles-pp.perf_swevent_event
      0.00            +0.1        0.05        perf-profile.children.cycles-pp._find_next_and_bit
      0.00            +0.1        0.05        perf-profile.children.cycles-pp.can_stop_idle_tick
      0.00            +0.1        0.05        perf-profile.children.cycles-pp.tsc_verify_tsc_adjust
      0.00            +0.1        0.05        perf-profile.children.cycles-pp.perf_trace_run_bpf_submit
      0.00            +0.1        0.05        perf-profile.children.cycles-pp.kill_fasync
      0.00            +0.1        0.06 ±  8%  perf-profile.children.cycles-pp.arch_cpu_idle_enter
      0.46            +0.1        0.52        perf-profile.children.cycles-pp.perf_tp_event
      0.00            +0.1        0.06        perf-profile.children.cycles-pp.sched_clock_noinstr
      0.00            +0.1        0.06        perf-profile.children.cycles-pp.save_fpregs_to_fpstate
      0.00            +0.1        0.06        perf-profile.children.cycles-pp.rb_next
      0.00            +0.1        0.06 ±  6%  perf-profile.children.cycles-pp.__cgroup_account_cputime
      0.00            +0.1        0.06 ±  7%  perf-profile.children.cycles-pp.rcu_note_context_switch
      0.02 ± 99%      +0.1        0.09 ±  4%  perf-profile.children.cycles-pp.aa_file_perm
      0.00            +0.1        0.07 ±  7%  perf-profile.children.cycles-pp.put_prev_task_fair
      0.00            +0.1        0.07 ±  7%  perf-profile.children.cycles-pp.put_prev_entity
      0.00            +0.1        0.07 ±  7%  perf-profile.children.cycles-pp.perf_exclude_event
      0.12 ±  4%      +0.1        0.20 ±  4%  perf-profile.children.cycles-pp.cpuacct_charge
      0.00            +0.1        0.07        perf-profile.children.cycles-pp.rcu_all_qs
      0.00            +0.1        0.07        perf-profile.children.cycles-pp.mm_cid_get
      0.00            +0.1        0.07        perf-profile.children.cycles-pp.native_apic_msr_eoi_write
      0.00            +0.1        0.07        perf-profile.children.cycles-pp.perf_trace_buf_alloc
      0.00            +0.1        0.07 ± 11%  perf-profile.children.cycles-pp.mutex_spin_on_owner
      0.10 ±  8%      +0.1        0.17 ±  2%  perf-profile.children.cycles-pp.native_irq_return_iret
      0.02 ± 99%      +0.1        0.10        perf-profile.children.cycles-pp.current_time
      0.00            +0.1        0.08        perf-profile.children.cycles-pp.tick_nohz_stop_idle
      0.00            +0.1        0.08        perf-profile.children.cycles-pp.__x2apic_send_IPI_dest
      0.00            +0.1        0.10 ±  5%  perf-profile.children.cycles-pp.nr_iowait_cpu
      0.00            +0.1        0.10 ± 23%  perf-profile.children.cycles-pp.read@plt
      0.00            +0.1        0.10        perf-profile.children.cycles-pp.syscall_enter_from_user_mode
      0.00            +0.1        0.10        perf-profile.children.cycles-pp.__hrtimer_next_event_base
      0.00            +0.1        0.10 ±  3%  perf-profile.children.cycles-pp.perf_trace_sched_switch
      0.45 ±  2%      +0.1        0.56        perf-profile.children.cycles-pp.task_h_load
      0.12 ±  4%      +0.1        0.22        perf-profile.children.cycles-pp.avg_vruntime
      0.12 ±  3%      +0.1        0.22 ±  2%  perf-profile.children.cycles-pp.attach_entity_load_avg
      0.27 ±  3%      +0.1        0.38 ±  2%  perf-profile.children.cycles-pp.apparmor_file_permission
      0.00            +0.1        0.11        perf-profile.children.cycles-pp.ct_kernel_enter
      0.08 ±  6%      +0.1        0.19        perf-profile.children.cycles-pp.__calc_delta
      0.00            +0.1        0.12        perf-profile.children.cycles-pp.ct_kernel_exit_state
      0.00            +0.1        0.12 ±  3%  perf-profile.children.cycles-pp.__list_add_valid
      0.14 ±  3%      +0.1        0.26        perf-profile.children.cycles-pp.update_entity_lag
      0.02 ±141%      +0.1        0.14 ±  3%  perf-profile.children.cycles-pp.__list_del_entry_valid
      0.12 ± 12%      +0.1        0.25        perf-profile.children.cycles-pp.get_next_timer_interrupt
      0.00            +0.1        0.13        perf-profile.children.cycles-pp.ct_idle_exit
      0.00            +0.1        0.13        perf-profile.children.cycles-pp.__cond_resched
      0.00            +0.1        0.13 ±  2%  perf-profile.children.cycles-pp.cpuidle_governor_latency_req
      0.13 ±  3%      +0.1        0.26        perf-profile.children.cycles-pp.place_entity
      0.06 ±  7%      +0.1        0.20 ±  2%  perf-profile.children.cycles-pp.pick_eevdf
      0.28 ±  3%      +0.1        0.42        perf-profile.children.cycles-pp.security_file_permission
      0.10 ±  5%      +0.1        0.24        perf-profile.children.cycles-pp.check_preempt_curr
      0.04 ± 44%      +0.1        0.19 ±  2%  perf-profile.children.cycles-pp.__dequeue_entity
      0.00            +0.2        0.15        perf-profile.children.cycles-pp._raw_spin_trylock
      0.07 ±  7%      +0.2        0.22        perf-profile.children.cycles-pp.hrtimer_next_event_without
      0.05 ±  8%      +0.2        0.21 ±  2%  perf-profile.children.cycles-pp.call_cpuidle
      0.06 ±  6%      +0.2        0.22        perf-profile.children.cycles-pp.read_tsc
      0.04 ± 44%      +0.2        0.21 ±  2%  perf-profile.children.cycles-pp.entry_SYSRETQ_unsafe_stack
      0.00            +0.2        0.16 ±  3%  perf-profile.children.cycles-pp.syscall_return_via_sysret
      0.17 ±  4%      +0.2        0.34 ±  2%  perf-profile.children.cycles-pp.update_min_vruntime
      0.28 ±  7%      +0.2        0.45 ±  3%  perf-profile.children.cycles-pp.copyin
      0.00            +0.2        0.17        perf-profile.children.cycles-pp.__rdgsbase_inactive
      0.00            +0.2        0.17        perf-profile.children.cycles-pp.local_clock_noinstr
      0.19 ±  9%      +0.2        0.36        perf-profile.children.cycles-pp.tick_nohz_next_event
      0.44 ±  2%      +0.2        0.61        perf-profile.children.cycles-pp.update_rq_clock_task
      0.04 ± 44%      +0.2        0.22        perf-profile.children.cycles-pp.finish_wait
      0.02 ±141%      +0.2        0.20        perf-profile.children.cycles-pp._raw_spin_unlock_irqrestore
      0.08 ±  6%      +0.2        0.27        perf-profile.children.cycles-pp.tick_nohz_idle_exit
      0.23 ±  2%      +0.2        0.42        perf-profile.children.cycles-pp.__fget_light
      0.07 ±  6%      +0.2        0.28        perf-profile.children.cycles-pp.syscall_exit_to_user_mode_prepare
      0.20 ±  2%      +0.2        0.41        perf-profile.children.cycles-pp._raw_spin_lock_irq
      0.94            +0.2        1.15        perf-profile.children.cycles-pp.wake_affine
      0.23 ±  3%      +0.2        0.45        perf-profile.children.cycles-pp.__fdget_pos
      0.06            +0.2        0.28 ±  2%  perf-profile.children.cycles-pp.newidle_balance
      0.09 ±  4%      +0.2        0.31        perf-profile.children.cycles-pp.ktime_get
      0.00            +0.2        0.23 ±  9%  perf-profile.children.cycles-pp.__mutex_lock
      0.16 ±  5%      +0.2        0.39        perf-profile.children.cycles-pp.mutex_lock
      0.06 ±  7%      +0.2        0.30        perf-profile.children.cycles-pp.__wrgsbase_inactive
      0.11 ±  7%      +0.2        0.35        perf-profile.children.cycles-pp.tick_nohz_idle_enter
      0.56 ±  4%      +0.3        0.83 ±  2%  perf-profile.children.cycles-pp._copy_from_iter
      0.59 ±  3%      +0.3        0.88 ±  2%  perf-profile.children.cycles-pp.copy_page_from_iter
      0.12 ±  6%      +0.3        0.42 ±  2%  perf-profile.children.cycles-pp.mutex_unlock
      0.12 ±  4%      +0.3        0.44        perf-profile.children.cycles-pp.__entry_text_start
      0.09 ±  7%      +0.3        0.42        perf-profile.children.cycles-pp.os_xsave
      1.24            +0.3        1.57        perf-profile.children.cycles-pp.sched_mm_cid_migrate_to
      0.21 ±  3%      +0.3        0.56        perf-profile.children.cycles-pp.__update_load_avg_se
      0.08 ±  6%      +0.4        0.43        perf-profile.children.cycles-pp.__enqueue_entity
      0.94            +0.4        1.32        perf-profile.children.cycles-pp.update_curr
      0.28 ±  7%      +0.4        0.66        perf-profile.children.cycles-pp.tick_nohz_get_sleep_length
      2.37 ±  2%      +0.4        2.79        perf-profile.children.cycles-pp.exit_to_user_mode_prepare
      0.15 ±  3%      +0.4        0.58        perf-profile.children.cycles-pp.sched_clock
      0.17 ±  6%      +0.4        0.60        perf-profile.children.cycles-pp.___perf_sw_event
      2.19 ±  2%      +0.4        2.62        perf-profile.children.cycles-pp.switch_fpu_return
      0.23 ±  2%      +0.4        0.67        perf-profile.children.cycles-pp.update_rq_clock
      0.16 ±  4%      +0.4        0.61        perf-profile.children.cycles-pp.native_sched_clock
      0.30            +0.4        0.74        perf-profile.children.cycles-pp.call_function_single_prep_ipi
     26.92            +0.5       27.38        perf-profile.children.cycles-pp.pipe_write
      0.33 ±  5%      +0.5        0.78        perf-profile.children.cycles-pp.copyout
      1.53            +0.5        2.00        perf-profile.children.cycles-pp.finish_task_switch
      0.35 ±  5%      +0.5        0.83        perf-profile.children.cycles-pp._copy_to_iter
      0.18 ±  4%      +0.5        0.66        perf-profile.children.cycles-pp.reweight_entity
      0.26            +0.5        0.76        perf-profile.children.cycles-pp.nohz_run_idle_balance
      0.38 ±  5%      +0.5        0.87        perf-profile.children.cycles-pp.copy_page_to_iter
      0.18 ±  3%      +0.5        0.72        perf-profile.children.cycles-pp.sched_clock_cpu
      0.65            +0.5        1.19        perf-profile.children.cycles-pp._find_next_bit
      0.62 ±  2%      +0.6        1.21        perf-profile.children.cycles-pp.remove_entity_load_avg
      0.70 ±  3%      +0.6        1.32        perf-profile.children.cycles-pp.menu_select
      2.44 ±  2%      +0.7        3.11        perf-profile.children.cycles-pp.syscall_exit_to_user_mode
      0.39            +0.7        1.10        perf-profile.children.cycles-pp.llist_reverse_order
     27.52            +0.7       28.25        perf-profile.children.cycles-pp.vfs_write
      1.41 ±  2%      +0.8        2.20        perf-profile.children.cycles-pp.migrate_task_rq_fair
      1.60 ±  2%      +0.8        2.39        perf-profile.children.cycles-pp.set_task_cpu
     27.69            +0.9       28.55        perf-profile.children.cycles-pp.ksys_write
      0.62            +0.9        1.52        perf-profile.children.cycles-pp.llist_add_batch
     27.52            +1.0       28.50        perf-profile.children.cycles-pp.pipe_read
      0.61            +1.0        1.63        perf-profile.children.cycles-pp.__switch_to_asm
      0.69 ±  2%      +1.0        1.72        perf-profile.children.cycles-pp.__switch_to
      0.64 ±  2%      +1.1        1.70        perf-profile.children.cycles-pp.__update_load_avg_cfs_rq
      0.36 ±  5%      +1.1        1.42 ±  3%  perf-profile.children.cycles-pp.__bitmap_andnot
      0.66 ±  2%      +1.1        1.75        perf-profile.children.cycles-pp.prepare_task_switch
      0.96            +1.1        2.09        perf-profile.children.cycles-pp.prepare_to_wait_event
      0.38 ± 12%      +1.2        1.54        perf-profile.children.cycles-pp.__update_idle_core
      4.69            +1.2        5.87        perf-profile.children.cycles-pp.schedule_idle
      0.38 ± 13%      +1.2        1.56        perf-profile.children.cycles-pp.pick_next_task_idle
      1.03 ±  5%      +1.2        2.28        perf-profile.children.cycles-pp.stress_switch_pipe
     27.96            +1.3       29.22        perf-profile.children.cycles-pp.vfs_read
      0.64 ±  2%      +1.3        1.95        perf-profile.children.cycles-pp.switch_mm_irqs_off
      0.95            +1.4        2.31        perf-profile.children.cycles-pp.__smp_call_single_queue
     28.11            +1.4       29.54        perf-profile.children.cycles-pp.ksys_read
      1.45            +1.6        3.10        perf-profile.children.cycles-pp.ttwu_queue_wakelist
      1.29            +2.0        3.32        perf-profile.children.cycles-pp._raw_spin_lock_irqsave
     29.82            +2.8       32.66        perf-profile.children.cycles-pp.write
     58.39            +3.1       61.48        perf-profile.children.cycles-pp.do_syscall_64
     58.71            +3.2       61.92        perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe
     30.42            +3.4       33.80        perf-profile.children.cycles-pp.read
      7.53            +3.6       11.15        perf-profile.children.cycles-pp.available_idle_cpu
      7.10            +3.8       10.91        perf-profile.children.cycles-pp.cpuidle_enter
      7.05            +3.8       10.88        perf-profile.children.cycles-pp.cpuidle_enter_state
      8.02            +4.8       12.81        perf-profile.children.cycles-pp.cpuidle_idle_call
      8.62            +5.1       13.76        perf-profile.children.cycles-pp.select_idle_core
      0.00            +5.2        5.20        perf-profile.children.cycles-pp.poll_idle
     10.20            +5.8       16.02        perf-profile.children.cycles-pp.select_idle_cpu
     24.24            +5.9       30.15        perf-profile.children.cycles-pp.try_to_wake_up
     24.26            +6.0       30.22        perf-profile.children.cycles-pp.autoremove_wake_function
     11.80            +6.6       18.41        perf-profile.children.cycles-pp.select_idle_sibling
     24.71            +6.7       31.41        perf-profile.children.cycles-pp.__wake_up_common
     25.12            +6.8       31.91        perf-profile.children.cycles-pp.__wake_up_common_lock
     12.91            +7.0       19.92        perf-profile.children.cycles-pp.select_task_rq_fair
     13.08            +7.0       20.13        perf-profile.children.cycles-pp.select_task_rq
     27.98           -26.6        1.33 ±  5%  perf-profile.self.cycles-pp.update_cfs_group
     15.77           -13.3        2.50        perf-profile.self.cycles-pp.update_load_avg
      4.28            -2.9        1.36        perf-profile.self.cycles-pp.try_to_wake_up
      2.24 ±  3%      -1.3        0.96        perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath
      5.64            -1.0        4.67        perf-profile.self.cycles-pp.intel_idle
      0.24 ±  3%      -0.1        0.10 ±  5%  perf-profile.self.cycles-pp.perf_trace_sched_stat_runtime
      0.21 ±  4%      -0.1        0.10        perf-profile.self.cycles-pp.perf_trace_sched_wakeup_template
      0.14 ±  8%      -0.1        0.06        perf-profile.self.cycles-pp.task_mm_cid_work
      0.23 ±  4%      -0.0        0.20 ±  2%  perf-profile.self.cycles-pp.tracing_gen_ctx_irq_test
      0.07 ±  5%      -0.0        0.06        perf-profile.self.cycles-pp.inode_needs_update_time
      0.06 ±  6%      +0.0        0.07        perf-profile.self.cycles-pp.perf_adjust_freq_unthr_context
      0.13 ±  2%      +0.0        0.15 ±  2%  perf-profile.self.cycles-pp.set_task_cpu
      0.06            +0.0        0.08 ±  5%  perf-profile.self.cycles-pp.update_entity_lag
      0.08 ±  8%      +0.0        0.11 ±  3%  perf-profile.self.cycles-pp.anon_pipe_buf_release
      0.05 ±  7%      +0.0        0.10        perf-profile.self.cycles-pp.__get_task_ioprio
      0.05 ±  7%      +0.0        0.10        perf-profile.self.cycles-pp.resched_curr
      0.00            +0.1        0.05        perf-profile.self.cycles-pp.exit_to_user_mode_prepare
      0.00            +0.1        0.05        perf-profile.self.cycles-pp.__smp_call_single_queue
      0.00            +0.1        0.05        perf-profile.self.cycles-pp.copy_page_from_iter
      0.00            +0.1        0.05        perf-profile.self.cycles-pp._copy_to_iter
      0.00            +0.1        0.05        perf-profile.self.cycles-pp.sched_clock
      0.00            +0.1        0.05        perf-profile.self.cycles-pp.tick_nohz_next_event
      0.00            +0.1        0.05        perf-profile.self.cycles-pp.tick_nohz_idle_enter
      0.00            +0.1        0.05        perf-profile.self.cycles-pp.cpuidle_governor_latency_req
      0.00            +0.1        0.05        perf-profile.self.cycles-pp.ct_kernel_enter
      0.00            +0.1        0.05        perf-profile.self.cycles-pp.rb_next
      0.00            +0.1        0.06 ±  9%  perf-profile.self.cycles-pp.tick_nohz_idle_exit
      0.02 ± 99%      +0.1        0.08 ±  4%  perf-profile.self.cycles-pp.aa_file_perm
      0.00            +0.1        0.06 ±  8%  perf-profile.self.cycles-pp.get_next_timer_interrupt
      0.00            +0.1        0.06 ±  8%  perf-profile.self.cycles-pp.perf_exclude_event
      0.32 ±  2%      +0.1        0.38 ±  2%  perf-profile.self.cycles-pp._copy_from_iter
      0.00            +0.1        0.06        perf-profile.self.cycles-pp.syscall_exit_to_user_mode
      0.00            +0.1        0.06        perf-profile.self.cycles-pp.rcu_note_context_switch
      0.20 ±  2%      +0.1        0.27        perf-profile.self.cycles-pp.perf_tp_event
      0.12 ±  4%      +0.1        0.19 ±  6%  perf-profile.self.cycles-pp.cpuacct_charge
      0.00            +0.1        0.07 ±  7%  perf-profile.self.cycles-pp.mm_cid_get
      0.07 ±  5%      +0.1        0.14        perf-profile.self.cycles-pp.ttwu_do_activate
      0.00            +0.1        0.07 ±  5%  perf-profile.self.cycles-pp.__hrtimer_next_event_base
      0.06            +0.1        0.13        perf-profile.self.cycles-pp.__entry_text_start
      0.00            +0.1        0.07        perf-profile.self.cycles-pp.activate_task
      0.00            +0.1        0.07        perf-profile.self.cycles-pp.__cond_resched
      0.00            +0.1        0.07        perf-profile.self.cycles-pp.current_time
      0.00            +0.1        0.07        perf-profile.self.cycles-pp.native_apic_msr_eoi_write
      0.00            +0.1        0.07 ± 11%  perf-profile.self.cycles-pp.mutex_spin_on_owner
      0.10 ±  8%      +0.1        0.17 ±  2%  perf-profile.self.cycles-pp.native_irq_return_iret
      0.05            +0.1        0.13        perf-profile.self.cycles-pp.place_entity
      0.00            +0.1        0.08        perf-profile.self.cycles-pp.__x2apic_send_IPI_dest
      0.09 ±  5%      +0.1        0.17 ±  2%  perf-profile.self.cycles-pp.do_syscall_64
      0.15 ±  6%      +0.1        0.24 ±  2%  perf-profile.self.cycles-pp.schedule
      0.00            +0.1        0.09 ±  7%  perf-profile.self.cycles-pp.perf_trace_sched_switch
      0.00            +0.1        0.09 ± 23%  perf-profile.self.cycles-pp.read@plt
      0.00            +0.1        0.09        perf-profile.self.cycles-pp.ksys_read
      0.00            +0.1        0.09        perf-profile.self.cycles-pp.ksys_write
      0.00            +0.1        0.09        perf-profile.self.cycles-pp.ktime_get
      0.00            +0.1        0.09        perf-profile.self.cycles-pp.nr_iowait_cpu
      0.00            +0.1        0.09 ±  4%  perf-profile.self.cycles-pp.syscall_enter_from_user_mode
      0.01 ±223%      +0.1        0.10        perf-profile.self.cycles-pp.__wake_up_common_lock
      0.00            +0.1        0.09 ±  5%  perf-profile.self.cycles-pp.cpu_startup_entry
      0.45 ±  2%      +0.1        0.55        perf-profile.self.cycles-pp.task_h_load
      0.12 ±  3%      +0.1        0.22        perf-profile.self.cycles-pp.attach_entity_load_avg
      0.00            +0.1        0.10        perf-profile.self.cycles-pp.check_preempt_curr
      0.00            +0.1        0.10        perf-profile.self.cycles-pp.__list_add_valid
      0.10 ±  4%      +0.1        0.21 ±  2%  perf-profile.self.cycles-pp.avg_vruntime
      0.07 ±  5%      +0.1        0.17        perf-profile.self.cycles-pp.__calc_delta
      0.33            +0.1        0.44        perf-profile.self.cycles-pp.entry_SYSCALL_64_after_hwframe
      0.00            +0.1        0.10 ±  9%  perf-profile.self.cycles-pp.__mutex_lock
      0.00            +0.1        0.11        perf-profile.self.cycles-pp.set_next_entity
      0.00            +0.1        0.11 ±  4%  perf-profile.self.cycles-pp.ct_kernel_exit_state
      0.14 ±  3%      +0.1        0.25        perf-profile.self.cycles-pp.wake_affine
      0.00            +0.1        0.12 ±  3%  perf-profile.self.cycles-pp.__list_del_entry_valid
      0.34 ±  3%      +0.1        0.46        perf-profile.self.cycles-pp.menu_select
      0.00            +0.1        0.12 ±  4%  perf-profile.self.cycles-pp.security_file_permission
      0.05 ±  8%      +0.1        0.18 ±  2%  perf-profile.self.cycles-pp.pick_eevdf
      0.00            +0.1        0.14 ±  3%  perf-profile.self.cycles-pp.sched_clock_cpu
      0.00            +0.1        0.14        perf-profile.self.cycles-pp.__dequeue_entity
      0.17 ±  2%      +0.1        0.32        perf-profile.self.cycles-pp.__flush_smp_call_function_queue
      0.00            +0.1        0.14 ±  3%  perf-profile.self.cycles-pp.schedule_idle
      0.00            +0.1        0.15 ±  3%  perf-profile.self.cycles-pp._raw_spin_trylock
      0.05            +0.2        0.20        perf-profile.self.cycles-pp.call_cpuidle
      0.06 ±  8%      +0.2        0.21        perf-profile.self.cycles-pp.read_tsc
      0.00            +0.2        0.16 ±  3%  perf-profile.self.cycles-pp._raw_spin_unlock_irqrestore
      0.17 ±  4%      +0.2        0.32 ±  2%  perf-profile.self.cycles-pp.update_min_vruntime
      0.00            +0.2        0.16        perf-profile.self.cycles-pp.syscall_return_via_sysret
      0.08 ±  6%      +0.2        0.24        perf-profile.self.cycles-pp.cpuidle_enter_state
      0.13 ±  5%      +0.2        0.29 ±  2%  perf-profile.self.cycles-pp.mutex_lock
      0.00            +0.2        0.16 ±  2%  perf-profile.self.cycles-pp.__rdgsbase_inactive
      0.27 ±  6%      +0.2        0.44 ±  3%  perf-profile.self.cycles-pp.copyin
      0.78 ±  3%      +0.2        0.95        perf-profile.self.cycles-pp.migrate_task_rq_fair
      0.08 ±  4%      +0.2        0.26        perf-profile.self.cycles-pp.pick_next_task_fair
      0.42            +0.2        0.60        perf-profile.self.cycles-pp.ttwu_queue_wakelist
      0.02 ±141%      +0.2        0.20        perf-profile.self.cycles-pp.entry_SYSRETQ_unsafe_stack
      0.22 ±  2%      +0.2        0.41        perf-profile.self.cycles-pp.__fget_light
      0.08 ±  6%      +0.2        0.27 ±  2%  perf-profile.self.cycles-pp.cpuidle_idle_call
      0.07            +0.2        0.27        perf-profile.self.cycles-pp.syscall_exit_to_user_mode_prepare
      0.19 ±  2%      +0.2        0.40        perf-profile.self.cycles-pp._raw_spin_lock_irq
      0.36 ±  3%      +0.2        0.57        perf-profile.self.cycles-pp.update_rq_clock_task
      0.15 ±  3%      +0.2        0.36        perf-profile.self.cycles-pp.select_task_rq_fair
      0.40 ±  2%      +0.2        0.61        perf-profile.self.cycles-pp.remove_entity_load_avg
      0.29 ±  2%      +0.2        0.51        perf-profile.self.cycles-pp.vfs_read
      0.05 ±  8%      +0.2        0.27 ±  2%  perf-profile.self.cycles-pp.newidle_balance
      0.14 ±  3%      +0.2        0.36        perf-profile.self.cycles-pp.update_rq_clock
      0.28 ±  2%      +0.2        0.50        perf-profile.self.cycles-pp.vfs_write
      0.06 ±  7%      +0.2        0.29        perf-profile.self.cycles-pp.__wrgsbase_inactive
      0.34 ±  2%      +0.2        0.58        perf-profile.self.cycles-pp.dequeue_entity
      0.26 ±  2%      +0.2        0.50        perf-profile.self.cycles-pp.prepare_to_wait_event
      0.25 ±  2%      +0.2        0.49 ±  2%  perf-profile.self.cycles-pp.pipe_write
      0.38            +0.2        0.63        perf-profile.self.cycles-pp.update_curr
      0.12 ±  6%      +0.3        0.41 ±  2%  perf-profile.self.cycles-pp.mutex_unlock
      2.12 ±  2%      +0.3        2.41        perf-profile.self.cycles-pp.select_idle_core
      0.19 ±  3%      +0.3        0.50        perf-profile.self.cycles-pp.__update_load_avg_se
      0.09 ±  7%      +0.3        0.41        perf-profile.self.cycles-pp.os_xsave
      1.24            +0.3        1.57        perf-profile.self.cycles-pp.sched_mm_cid_migrate_to
      0.14 ±  7%      +0.3        0.48        perf-profile.self.cycles-pp.___perf_sw_event
      0.07            +0.3        0.42        perf-profile.self.cycles-pp.__enqueue_entity
      0.09 ±  6%      +0.4        0.45        perf-profile.self.cycles-pp.do_idle
      0.44 ±  2%      +0.4        0.82        perf-profile.self.cycles-pp.switch_fpu_return
      0.23 ±  3%      +0.4        0.61        perf-profile.self.cycles-pp.sched_ttwu_pending
      0.21 ±  2%      +0.4        0.61        perf-profile.self.cycles-pp.enqueue_task_fair
      0.14 ±  3%      +0.4        0.54        perf-profile.self.cycles-pp.reweight_entity
      0.16 ±  4%      +0.4        0.58        perf-profile.self.cycles-pp.native_sched_clock
      0.29 ±  2%      +0.4        0.72        perf-profile.self.cycles-pp.flush_smp_call_function_queue
      0.30 ±  2%      +0.4        0.74        perf-profile.self.cycles-pp.call_function_single_prep_ipi
      0.37            +0.4        0.81        perf-profile.self.cycles-pp.dequeue_task_fair
      0.32 ±  6%      +0.4        0.77        perf-profile.self.cycles-pp.copyout
      0.13 ±  5%      +0.5        0.59        perf-profile.self.cycles-pp.nohz_run_idle_balance
      0.60            +0.5        1.06        perf-profile.self.cycles-pp._find_next_bit
      0.61            +0.5        1.12        perf-profile.self.cycles-pp.finish_task_switch
      0.37 ±  6%      +0.5        0.88        perf-profile.self.cycles-pp.write
      0.40 ±  7%      +0.5        0.93 ±  2%  perf-profile.self.cycles-pp.read
      0.33 ±  2%      +0.6        0.93        perf-profile.self.cycles-pp.enqueue_entity
      0.61            +0.7        1.28        perf-profile.self.cycles-pp.select_idle_cpu
      0.69 ±  2%      +0.7        1.37        perf-profile.self.cycles-pp.select_idle_sibling
      0.38 ±  2%      +0.7        1.10        perf-profile.self.cycles-pp.llist_reverse_order
      0.59 ±  8%      +0.7        1.31 ±  2%  perf-profile.self.cycles-pp.stress_switch_pipe
      0.44            +0.7        1.18        perf-profile.self.cycles-pp.__wake_up_common
      0.58            +0.8        1.33        perf-profile.self.cycles-pp.pipe_read
      0.58 ±  2%      +0.8        1.40        perf-profile.self.cycles-pp.prepare_task_switch
      0.62            +0.9        1.51        perf-profile.self.cycles-pp.llist_add_batch
      0.68 ±  2%      +1.0        1.68        perf-profile.self.cycles-pp.__switch_to
      0.61            +1.0        1.63        perf-profile.self.cycles-pp.__switch_to_asm
      0.63 ±  2%      +1.0        1.65        perf-profile.self.cycles-pp.__update_load_avg_cfs_rq
      0.34 ±  4%      +1.0        1.37 ±  3%  perf-profile.self.cycles-pp.__bitmap_andnot
      0.30 ± 17%      +1.0        1.32        perf-profile.self.cycles-pp.__update_idle_core
      0.64 ±  2%      +1.3        1.92        perf-profile.self.cycles-pp.switch_mm_irqs_off
      0.88 ±  2%      +1.3        2.22        perf-profile.self.cycles-pp._raw_spin_lock
      1.70            +1.6        3.28        perf-profile.self.cycles-pp.__schedule
      1.28            +2.0        3.28        perf-profile.self.cycles-pp._raw_spin_lock_irqsave
      7.50            +3.6       11.06        perf-profile.self.cycles-pp.available_idle_cpu
      0.00            +4.8        4.81        perf-profile.self.cycles-pp.poll_idle





Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.


-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2023-09-06  3:53 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-08-23  6:08 [PATCH 0/1] Reduce cost of accessing tg->load_avg Aaron Lu
2023-08-23  6:08 ` [PATCH 1/1] sched/fair: ratelimit update to tg->load_avg Aaron Lu
2023-08-23 14:05   ` Mathieu Desnoyers
2023-08-23 14:17     ` Mathieu Desnoyers
2023-08-24  8:01     ` Aaron Lu
2023-08-24 12:56       ` Mathieu Desnoyers
2023-08-24 13:03         ` Vincent Guittot
2023-08-24 13:08           ` Mathieu Desnoyers
2023-08-24 13:24             ` Vincent Guittot
2023-08-25  6:08             ` Aaron Lu
2023-08-24 18:48   ` David Vernet
2023-08-25  6:18     ` Aaron Lu
2023-09-06  3:52   ` kernel test robot
2023-08-25 10:33 ` [PATCH 0/1] Reduce cost of accessing tg->load_avg Swapnil Sapkal
2023-08-28 11:22   ` Aaron Lu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox