[PATCH v2 0/1] Reduce cost of accessing tg->load

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 0/1] Reduce cost of accessing tg->load_avg
@ 2023-09-12  6:58 Aaron Lu
  2023-09-12  6:58 ` [PATCH v2 1/1] sched/fair: ratelimit update to tg->load_avg Aaron Lu
  0 siblings, 1 reply; 6+ messages in thread
From: Aaron Lu @ 2023-09-12  6:58 UTC (permalink / raw)
  To: Peter Zijlstra, Vincent Guittot, Ingo Molnar
  Cc: Dietmar Eggemann, Mathieu Desnoyers, Gautham R . Shenoy,
	David Vernet, Nitin Tekchandani, Yu Chen, Daniel Jordan, Tim Chen,
	Swapnil Sapkal, linux-kernel

v2:
- Rebase on top of tag sched-core-2023-08-28, also applies cleanly on
  top of v6.6-rc1;
- Explain why ratelimit to once per ms in the changelog as suggested by
  David Vernet;
- Collected reviewed-by and tested-by tags, thank you all for your review
  and test!

After rebase, I did a new run of postgres_sysbench workload on Intel
Sapphire Rapids and the data is about the same as v1. Consider that
there is no much change in load tracking from v6.6, I've kept the old
data.

RFC v2 -> v1:
- drop RFC;
- move cfs_rq->last_update_tg_load_avg before cfs_rq->tg_load_avg_contrib;
- add Vincent's reviewed-by tag.

RFC v2:
Nitin Tekchandani noticed some scheduler functions have high cost                                                      
according to perf/cycles while running postgres_sysbench workload.                                                     
I perf/annotated the high cost functions: update_cfs_group() and                                                       
update_load_avg() and found the costs were ~90% due to accessing to                                                    
tg->load_avg. This series is an attempt to reduce the overhead of                                                      
the two functions.                                                                                                     

Thanks to Vincent's suggestion from v1, this revision used a simpler way                                               
to solve the overhead problem by limiting updates to tg->load_avg to at                                                
most once per ms. Benchmark shows that it has good results and with the                                                
rate limit in place, other optimizations in v1 don't improve performance                                               
further so they are dropped from this revision.

Aaron Lu (1):
  sched/fair: ratelimit update to tg->load_avg

 kernel/sched/fair.c  | 13 ++++++++++++-
 kernel/sched/sched.h |  1 +
 2 files changed, 13 insertions(+), 1 deletion(-)

-- 
2.41.0

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH v2 1/1] sched/fair: ratelimit update to tg->load_avg
  2023-09-12  6:58 [PATCH v2 0/1] Reduce cost of accessing tg->load_avg Aaron Lu
@ 2023-09-12  6:58 ` Aaron Lu
  2023-09-12 11:20   ` Peter Zijlstra
                     ` (3 more replies)
  0 siblings, 4 replies; 6+ messages in thread
From: Aaron Lu @ 2023-09-12  6:58 UTC (permalink / raw)
  To: Peter Zijlstra, Vincent Guittot, Ingo Molnar
  Cc: Dietmar Eggemann, Mathieu Desnoyers, Gautham R . Shenoy,
	David Vernet, Nitin Tekchandani, Yu Chen, Daniel Jordan, Tim Chen,
	Swapnil Sapkal, linux-kernel

When using sysbench to benchmark Postgres in a single docker instance
with sysbench's nr_threads set to nr_cpu, it is observed there are times
update_cfs_group() and update_load_avg() shows noticeable overhead on
a 2sockets/112core/224cpu Intel Sapphire Rapids(SPR):

    13.75%    13.74%  [kernel.vmlinux]           [k] update_cfs_group
    10.63%    10.04%  [kernel.vmlinux]           [k] update_load_avg

Annotate shows the cycles are mostly spent on accessing tg->load_avg
with update_load_avg() being the write side and update_cfs_group() being
the read side. tg->load_avg is per task group and when different tasks
of the same taskgroup running on different CPUs frequently access
tg->load_avg, it can be heavily contended.

E.g. when running postgres_sysbench on a 2sockets/112cores/224cpus Intel
Sappire Rapids, during a 5s window, the wakeup number is 14millions and
migration number is 11millions and with each migration, the task's load
will transfer from src cfs_rq to target cfs_rq and each change involves
an update to tg->load_avg. Since the workload can trigger as many wakeups
and migrations, the access(both read and write) to tg->load_avg can be
unbound. As a result, the two mentioned functions showed noticeable
overhead. With netperf/nr_client=nr_cpu/UDP_RR, the problem is worse:
during a 5s window, wakeup number is 21millions and migration number is
14millions; update_cfs_group() costs ~25% and update_load_avg() costs ~16%.

Reduce the overhead by limiting updates to tg->load_avg to at most once
per ms. The update frequency is a tradeoff between tracking accuracy and
overhead. 1ms is chosen because PELT window is roughly 1ms and it
delivered good results for the tests that I've done. After this change,
the cost of accessing tg->load_avg is greatly reduced and performance
improved. Detailed test results below.

==============================
postgres_sysbench on SPR:
25%
base:   42382±19.8%
patch:  50174±9.5%  (noise)

50%
base:   67626±1.3%
patch:  67365±3.1%  (noise)

75%
base:   100216±1.2%
patch:  112470±0.1% +12.2%

100%
base:    93671±0.4%
patch:  113563±0.2% +21.2%

==============================
hackbench on ICL:
group=1
base:    114912±5.2%
patch:   117857±2.5%  (noise)

group=4
base:    359902±1.6%
patch:   361685±2.7%  (noise)

group=8
base:    461070±0.8%
patch:   491713±0.3% +6.6%

group=16
base:    309032±5.0%
patch:   378337±1.3% +22.4%

=============================
hackbench on SPR:
group=1
base:    100768±2.9%
patch:   103134±2.9%  (noise)

group=4
base:    413830±12.5%
patch:   378660±16.6% (noise)

group=8
base:    436124±0.6%
patch:   490787±3.2% +12.5%

group=16
base:    457730±3.2%
patch:   680452±1.3% +48.8%

============================
netperf/udp_rr on ICL
25%
base:    114413±0.1%
patch:   115111±0.0% +0.6%

50%
base:    86803±0.5%
patch:   86611±0.0%  (noise)

75%
base:    35959±5.3%
patch:   49801±0.6% +38.5%

100%
base:    61951±6.4%
patch:   70224±0.8% +13.4%

===========================
netperf/udp_rr on SPR
25%
base:   104954±1.3%
patch:  107312±2.8%  (noise)

50%
base:    55394±4.6%
patch:   54940±7.4%  (noise)

75%
base:    13779±3.1%
patch:   36105±1.1% +162%

100%
base:     9703±3.7%
patch:   28011±0.2% +189%

==============================================
netperf/tcp_stream on ICL (all in noise range)
25%
base:    43092±0.1%
patch:   42891±0.5%

50%
base:    19278±14.9%
patch:   22369±7.2%

75%
base:    16822±3.0%
patch:   17086±2.3%

100%
base:    18216±0.6%
patch:   18078±2.9%

===============================================
netperf/tcp_stream on SPR (all in noise range)
25%
base:    34491±0.3%
patch:   34886±0.5%

50%
base:    19278±14.9%
patch:   22369±7.2%

75%
base:    16822±3.0%
patch:   17086±2.3%

100%
base:    18216±0.6%
patch:   18078±2.9%

Reported-by: Nitin Tekchandani <nitin.tekchandani@intel.com>
Suggested-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Tested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reviewed-by: David Vernet <void@manifault.com>
Tested-by: Swapnil Sapkal <Swapnil.Sapkal@amd.com>
---
 kernel/sched/fair.c  | 13 ++++++++++++-
 kernel/sched/sched.h |  1 +
 2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0b7445cd5af9..0dd6f44c8e02 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3887,7 +3887,8 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
  */
 static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
 {
-	long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
+	long delta;
+	u64 now;
 
 	/*
 	 * No need to update load_avg for root_task_group as it is not used.
@@ -3895,9 +3896,19 @@ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
 	if (cfs_rq->tg == &root_task_group)
 		return;
 
+	/*
+	 * For migration heavy workloads, access to tg->load_avg can be
+	 * unbound. Limit the update rate to at most once per ms.
+	 */
+	now = sched_clock_cpu(cpu_of(rq_of(cfs_rq)));
+	if (now - cfs_rq->last_update_tg_load_avg < NSEC_PER_MSEC)
+		return;
+
+	delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
 	if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
 		atomic_long_add(delta, &cfs_rq->tg->load_avg);
 		cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;
+		cfs_rq->last_update_tg_load_avg = now;
 	}
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3a01b7a2bf66..7df010ef9e5c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -594,6 +594,7 @@ struct cfs_rq {
 	} removed;
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
+	u64			last_update_tg_load_avg;
 	unsigned long		tg_load_avg_contrib;
 	long			propagate;
 	long			prop_runnable_sum;
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH v2 1/1] sched/fair: ratelimit update to tg->load_avg
  2023-09-12  6:58 ` [PATCH v2 1/1] sched/fair: ratelimit update to tg->load_avg Aaron Lu
@ 2023-09-12 11:20   ` Peter Zijlstra
  2023-09-13  3:20   ` Chen Yu
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 6+ messages in thread
From: Peter Zijlstra @ 2023-09-12 11:20 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Vincent Guittot, Ingo Molnar, Dietmar Eggemann, Mathieu Desnoyers,
	Gautham R . Shenoy, David Vernet, Nitin Tekchandani, Yu Chen,
	Daniel Jordan, Tim Chen, Swapnil Sapkal, linux-kernel

On Tue, Sep 12, 2023 at 02:58:08PM +0800, Aaron Lu wrote:
> When using sysbench to benchmark Postgres in a single docker instance
> with sysbench's nr_threads set to nr_cpu, it is observed there are times
> update_cfs_group() and update_load_avg() shows noticeable overhead on
> a 2sockets/112core/224cpu Intel Sapphire Rapids(SPR):
> 
>     13.75%    13.74%  [kernel.vmlinux]           [k] update_cfs_group
>     10.63%    10.04%  [kernel.vmlinux]           [k] update_load_avg
> 
> Annotate shows the cycles are mostly spent on accessing tg->load_avg
> with update_load_avg() being the write side and update_cfs_group() being
> the read side. tg->load_avg is per task group and when different tasks
> of the same taskgroup running on different CPUs frequently access
> tg->load_avg, it can be heavily contended.
> 
> E.g. when running postgres_sysbench on a 2sockets/112cores/224cpus Intel
> Sappire Rapids, during a 5s window, the wakeup number is 14millions and
> migration number is 11millions and with each migration, the task's load
> will transfer from src cfs_rq to target cfs_rq and each change involves
> an update to tg->load_avg. Since the workload can trigger as many wakeups
> and migrations, the access(both read and write) to tg->load_avg can be
> unbound. As a result, the two mentioned functions showed noticeable
> overhead. With netperf/nr_client=nr_cpu/UDP_RR, the problem is worse:
> during a 5s window, wakeup number is 21millions and migration number is
> 14millions; update_cfs_group() costs ~25% and update_load_avg() costs ~16%.
> 
> Reduce the overhead by limiting updates to tg->load_avg to at most once
> per ms. The update frequency is a tradeoff between tracking accuracy and
> overhead. 1ms is chosen because PELT window is roughly 1ms and it
> delivered good results for the tests that I've done. After this change,
> the cost of accessing tg->load_avg is greatly reduced and performance
> improved. Detailed test results below.
> 

> 
> Reported-by: Nitin Tekchandani <nitin.tekchandani@intel.com>
> Suggested-by: Vincent Guittot <vincent.guittot@linaro.org>
> Signed-off-by: Aaron Lu <aaron.lu@intel.com>
> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Tested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Reviewed-by: David Vernet <void@manifault.com>
> Tested-by: Swapnil Sapkal <Swapnil.Sapkal@amd.com>

Thanks!

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2 1/1] sched/fair: ratelimit update to tg->load_avg
  2023-09-12  6:58 ` [PATCH v2 1/1] sched/fair: ratelimit update to tg->load_avg Aaron Lu
  2023-09-12 11:20   ` Peter Zijlstra
@ 2023-09-13  3:20   ` Chen Yu
  2023-09-17 10:12   ` [tip: sched/core] sched/fair: Ratelimit " tip-bot2 for Aaron Lu
  2023-09-18  6:21   ` tip-bot2 for Aaron Lu
  3 siblings, 0 replies; 6+ messages in thread
From: Chen Yu @ 2023-09-13  3:20 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Peter Zijlstra, Vincent Guittot, Ingo Molnar, Dietmar Eggemann,
	Mathieu Desnoyers, Gautham R . Shenoy, David Vernet,
	Nitin Tekchandani, Daniel Jordan, Tim Chen, Swapnil Sapkal,
	linux-kernel

On 2023-09-12 at 14:58:08 +0800, Aaron Lu wrote:
> When using sysbench to benchmark Postgres in a single docker instance
> with sysbench's nr_threads set to nr_cpu, it is observed there are times
> update_cfs_group() and update_load_avg() shows noticeable overhead on
> a 2sockets/112core/224cpu Intel Sapphire Rapids(SPR):
> 
>     13.75%    13.74%  [kernel.vmlinux]           [k] update_cfs_group
>     10.63%    10.04%  [kernel.vmlinux]           [k] update_load_avg
> 
> Annotate shows the cycles are mostly spent on accessing tg->load_avg
> with update_load_avg() being the write side and update_cfs_group() being
> the read side. tg->load_avg is per task group and when different tasks
> of the same taskgroup running on different CPUs frequently access
> tg->load_avg, it can be heavily contended.
> 
> E.g. when running postgres_sysbench on a 2sockets/112cores/224cpus Intel
> Sappire Rapids, during a 5s window, the wakeup number is 14millions and
> migration number is 11millions and with each migration, the task's load
> will transfer from src cfs_rq to target cfs_rq and each change involves
> an update to tg->load_avg. Since the workload can trigger as many wakeups
> and migrations, the access(both read and write) to tg->load_avg can be
> unbound. As a result, the two mentioned functions showed noticeable
> overhead. With netperf/nr_client=nr_cpu/UDP_RR, the problem is worse:
> during a 5s window, wakeup number is 21millions and migration number is
> 14millions; update_cfs_group() costs ~25% and update_load_avg() costs ~16%.
> 
> Reduce the overhead by limiting updates to tg->load_avg to at most once
> per ms. The update frequency is a tradeoff between tracking accuracy and
> overhead. 1ms is chosen because PELT window is roughly 1ms and it
> delivered good results for the tests that I've done. After this change,
> the cost of accessing tg->load_avg is greatly reduced and performance
> improved. Detailed test results below.
> 
> ==============================
> postgres_sysbench on SPR:
> 25%
> base:   42382±19.8%
> patch:  50174±9.5%  (noise)
> 
> 50%
> base:   67626±1.3%
> patch:  67365±3.1%  (noise)
> 
> 75%
> base:   100216±1.2%
> patch:  112470±0.1% +12.2%
> 
> 100%
> base:    93671±0.4%
> patch:  113563±0.2% +21.2%
> 
> ==============================
> hackbench on ICL:
> group=1
> base:    114912±5.2%
> patch:   117857±2.5%  (noise)
> 
> group=4
> base:    359902±1.6%
> patch:   361685±2.7%  (noise)
> 
> group=8
> base:    461070±0.8%
> patch:   491713±0.3% +6.6%
> 
> group=16
> base:    309032±5.0%
> patch:   378337±1.3% +22.4%
> 
> =============================
> hackbench on SPR:
> group=1
> base:    100768±2.9%
> patch:   103134±2.9%  (noise)
> 
> group=4
> base:    413830±12.5%
> patch:   378660±16.6% (noise)
> 
> group=8
> base:    436124±0.6%
> patch:   490787±3.2% +12.5%
> 
> group=16
> base:    457730±3.2%
> patch:   680452±1.3% +48.8%
> 
> ============================
> netperf/udp_rr on ICL
> 25%
> base:    114413±0.1%
> patch:   115111±0.0% +0.6%
> 
> 50%
> base:    86803±0.5%
> patch:   86611±0.0%  (noise)
> 
> 75%
> base:    35959±5.3%
> patch:   49801±0.6% +38.5%
> 
> 100%
> base:    61951±6.4%
> patch:   70224±0.8% +13.4%
> 
> ===========================
> netperf/udp_rr on SPR
> 25%
> base:   104954±1.3%
> patch:  107312±2.8%  (noise)
> 
> 50%
> base:    55394±4.6%
> patch:   54940±7.4%  (noise)
> 
> 75%
> base:    13779±3.1%
> patch:   36105±1.1% +162%
> 
> 100%
> base:     9703±3.7%
> patch:   28011±0.2% +189%
> 
> ==============================================
> netperf/tcp_stream on ICL (all in noise range)
> 25%
> base:    43092±0.1%
> patch:   42891±0.5%
> 
> 50%
> base:    19278±14.9%
> patch:   22369±7.2%
> 
> 75%
> base:    16822±3.0%
> patch:   17086±2.3%
> 
> 100%
> base:    18216±0.6%
> patch:   18078±2.9%
> 
> ===============================================
> netperf/tcp_stream on SPR (all in noise range)
> 25%
> base:    34491±0.3%
> patch:   34886±0.5%
> 
> 50%
> base:    19278±14.9%
> patch:   22369±7.2%
> 
> 75%
> base:    16822±3.0%
> patch:   17086±2.3%
> 
> 100%
> base:    18216±0.6%
> patch:   18078±2.9%
> 
> Reported-by: Nitin Tekchandani <nitin.tekchandani@intel.com>
> Suggested-by: Vincent Guittot <vincent.guittot@linaro.org>
> Signed-off-by: Aaron Lu <aaron.lu@intel.com>
> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Tested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Reviewed-by: David Vernet <void@manifault.com>
> Tested-by: Swapnil Sapkal <Swapnil.Sapkal@amd.com>
> ---
>

Since we know that this patch brings good improvement for netperf,
hackbench, I did some further verification on tbench/schbench on Ice Lake
Xeon Platinum 8360Y, and it reports good result too:

schbench
========
case            	load    	baseline(std%)	compare%( std%)
normal          	1-mthreads	 1.00 (  1.70)	 +0.00 (  0.00)
normal          	2-mthreads	 1.00 (  2.32)	 -0.62 (  5.24)
normal          	4-mthreads	 1.00 (  3.17)	 -1.86 (  3.11)

tbench
======
case            	load    	baseline(std%)	compare%( std%)
loopback        	36-threads	 1.00 (  2.80)	 +1.85 (  0.45)
loopback        	72-threads	 1.00 (  0.27)	 -0.20 (  0.51)
loopback        	108-threads	 1.00 (  0.06)	+21.92 (  0.10)
loopback        	144-threads	 1.00 (  1.47)	+28.42 (  0.11)

Tested-by: Chen Yu <yu.c.chen@intel.com>

thanks,
Chenyu

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [tip: sched/core] sched/fair: Ratelimit update to tg->load_avg
  2023-09-12  6:58 ` [PATCH v2 1/1] sched/fair: ratelimit update to tg->load_avg Aaron Lu
  2023-09-12 11:20   ` Peter Zijlstra
  2023-09-13  3:20   ` Chen Yu
@ 2023-09-17 10:12   ` tip-bot2 for Aaron Lu
  2023-09-18  6:21   ` tip-bot2 for Aaron Lu
  3 siblings, 0 replies; 6+ messages in thread
From: tip-bot2 for Aaron Lu @ 2023-09-17 10:12 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Nitin Tekchandani, Vincent Guittot, Aaron Lu,
	Peter Zijlstra (Intel), Mathieu Desnoyers, David Vernet,
	Swapnil Sapkal, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     afc1996859a2c26fd1190ec2b9ccf26e700e8ed7
Gitweb:        https://git.kernel.org/tip/afc1996859a2c26fd1190ec2b9ccf26e700e8ed7
Author:        Aaron Lu <aaron.lu@intel.com>
AuthorDate:    Tue, 12 Sep 2023 14:58:08 +08:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Sat, 16 Sep 2023 11:16:02 +02:00

sched/fair: Ratelimit update to tg->load_avg

When using sysbench to benchmark Postgres in a single docker instance
with sysbench's nr_threads set to nr_cpu, it is observed there are times
update_cfs_group() and update_load_avg() shows noticeable overhead on
a 2sockets/112core/224cpu Intel Sapphire Rapids(SPR):

    13.75%    13.74%  [kernel.vmlinux]           [k] update_cfs_group
    10.63%    10.04%  [kernel.vmlinux]           [k] update_load_avg

Annotate shows the cycles are mostly spent on accessing tg->load_avg
with update_load_avg() being the write side and update_cfs_group() being
the read side. tg->load_avg is per task group and when different tasks
of the same taskgroup running on different CPUs frequently access
tg->load_avg, it can be heavily contended.

E.g. when running postgres_sysbench on a 2sockets/112cores/224cpus Intel
Sappire Rapids, during a 5s window, the wakeup number is 14millions and
migration number is 11millions and with each migration, the task's load
will transfer from src cfs_rq to target cfs_rq and each change involves
an update to tg->load_avg. Since the workload can trigger as many wakeups
and migrations, the access(both read and write) to tg->load_avg can be
unbound. As a result, the two mentioned functions showed noticeable
overhead. With netperf/nr_client=nr_cpu/UDP_RR, the problem is worse:
during a 5s window, wakeup number is 21millions and migration number is
14millions; update_cfs_group() costs ~25% and update_load_avg() costs ~16%.

Reduce the overhead by limiting updates to tg->load_avg to at most once
per ms. The update frequency is a tradeoff between tracking accuracy and
overhead. 1ms is chosen because PELT window is roughly 1ms and it
delivered good results for the tests that I've done. After this change,
the cost of accessing tg->load_avg is greatly reduced and performance
improved. Detailed test results below.

  ==============================
  postgres_sysbench on SPR:
  25%
  base:   42382±19.8%
  patch:  50174±9.5%  (noise)

  50%
  base:   67626±1.3%
  patch:  67365±3.1%  (noise)

  75%
  base:   100216±1.2%
  patch:  112470±0.1% +12.2%

  100%
  base:    93671±0.4%
  patch:  113563±0.2% +21.2%

  ==============================
  hackbench on ICL:
  group=1
  base:    114912±5.2%
  patch:   117857±2.5%  (noise)

  group=4
  base:    359902±1.6%
  patch:   361685±2.7%  (noise)

  group=8
  base:    461070±0.8%
  patch:   491713±0.3% +6.6%

  group=16
  base:    309032±5.0%
  patch:   378337±1.3% +22.4%

  =============================
  hackbench on SPR:
  group=1
  base:    100768±2.9%
  patch:   103134±2.9%  (noise)

  group=4
  base:    413830±12.5%
  patch:   378660±16.6% (noise)

  group=8
  base:    436124±0.6%
  patch:   490787±3.2% +12.5%

  group=16
  base:    457730±3.2%
  patch:   680452±1.3% +48.8%

  ============================
  netperf/udp_rr on ICL
  25%
  base:    114413±0.1%
  patch:   115111±0.0% +0.6%

  50%
  base:    86803±0.5%
  patch:   86611±0.0%  (noise)

  75%
  base:    35959±5.3%
  patch:   49801±0.6% +38.5%

  100%
  base:    61951±6.4%
  patch:   70224±0.8% +13.4%

  ===========================
  netperf/udp_rr on SPR
  25%
  base:   104954±1.3%
  patch:  107312±2.8%  (noise)

  50%
  base:    55394±4.6%
  patch:   54940±7.4%  (noise)

  75%
  base:    13779±3.1%
  patch:   36105±1.1% +162%

  100%
  base:     9703±3.7%
  patch:   28011±0.2% +189%

  ==============================================
  netperf/tcp_stream on ICL (all in noise range)
  25%
  base:    43092±0.1%
  patch:   42891±0.5%

  50%
  base:    19278±14.9%
  patch:   22369±7.2%

  75%
  base:    16822±3.0%
  patch:   17086±2.3%

  100%
  base:    18216±0.6%
  patch:   18078±2.9%

  ===============================================
  netperf/tcp_stream on SPR (all in noise range)
  25%
  base:    34491±0.3%
  patch:   34886±0.5%

  50%
  base:    19278±14.9%
  patch:   22369±7.2%

  75%
  base:    16822±3.0%
  patch:   17086±2.3%

  100%
  base:    18216±0.6%
  patch:   18078±2.9%

Reported-by: Nitin Tekchandani <nitin.tekchandani@intel.com>
Suggested-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reviewed-by: David Vernet <void@manifault.com>
Tested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Tested-by: Swapnil Sapkal <Swapnil.Sapkal@amd.com>
Link: https://lkml.kernel.org/r/20230912065808.2530-2-aaron.lu@intel.com
---
 kernel/sched/fair.c  | 13 ++++++++++++-
 kernel/sched/sched.h |  1 +
 2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c893721..d087787 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3876,7 +3876,8 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
  */
 static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
 {
-	long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
+	long delta;
+	u64 now;
 
 	/*
 	 * No need to update load_avg for root_task_group as it is not used.
@@ -3884,9 +3885,19 @@ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
 	if (cfs_rq->tg == &root_task_group)
 		return;
 
+	/*
+	 * For migration heavy workloads, access to tg->load_avg can be
+	 * unbound. Limit the update rate to at most once per ms.
+	 */
+	now = sched_clock_cpu(cpu_of(rq_of(cfs_rq)));
+	if (now - cfs_rq->last_update_tg_load_avg < NSEC_PER_MSEC)
+		return;
+
+	delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
 	if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
 		atomic_long_add(delta, &cfs_rq->tg->load_avg);
 		cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;
+		cfs_rq->last_update_tg_load_avg = now;
 	}
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 68768f4..887468c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -594,6 +594,7 @@ struct cfs_rq {
 	} removed;
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
+	u64			last_update_tg_load_avg;
 	unsigned long		tg_load_avg_contrib;
 	long			propagate;
 	long			prop_runnable_sum;

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [tip: sched/core] sched/fair: Ratelimit update to tg->load_avg
  2023-09-12  6:58 ` [PATCH v2 1/1] sched/fair: ratelimit update to tg->load_avg Aaron Lu
                     ` (2 preceding siblings ...)
  2023-09-17 10:12   ` [tip: sched/core] sched/fair: Ratelimit " tip-bot2 for Aaron Lu
@ 2023-09-18  6:21   ` tip-bot2 for Aaron Lu
  3 siblings, 0 replies; 6+ messages in thread
From: tip-bot2 for Aaron Lu @ 2023-09-18  6:21 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Nitin Tekchandani, Vincent Guittot, Aaron Lu,
	Peter Zijlstra (Intel), Ingo Molnar, Mathieu Desnoyers,
	David Vernet, Swapnil Sapkal, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     1528c661c24b407e92194426b0adbb43de859ce0
Gitweb:        https://git.kernel.org/tip/1528c661c24b407e92194426b0adbb43de859ce0
Author:        Aaron Lu <aaron.lu@intel.com>
AuthorDate:    Tue, 12 Sep 2023 14:58:08 +08:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Mon, 18 Sep 2023 08:14:45 +02:00

sched/fair: Ratelimit update to tg->load_avg

When using sysbench to benchmark Postgres in a single docker instance
with sysbench's nr_threads set to nr_cpu, it is observed there are times
update_cfs_group() and update_load_avg() shows noticeable overhead on
a 2sockets/112core/224cpu Intel Sapphire Rapids(SPR):

    13.75%    13.74%  [kernel.vmlinux]           [k] update_cfs_group
    10.63%    10.04%  [kernel.vmlinux]           [k] update_load_avg

Annotate shows the cycles are mostly spent on accessing tg->load_avg
with update_load_avg() being the write side and update_cfs_group() being
the read side. tg->load_avg is per task group and when different tasks
of the same taskgroup running on different CPUs frequently access
tg->load_avg, it can be heavily contended.

E.g. when running postgres_sysbench on a 2sockets/112cores/224cpus Intel
Sappire Rapids, during a 5s window, the wakeup number is 14millions and
migration number is 11millions and with each migration, the task's load
will transfer from src cfs_rq to target cfs_rq and each change involves
an update to tg->load_avg. Since the workload can trigger as many wakeups
and migrations, the access(both read and write) to tg->load_avg can be
unbound. As a result, the two mentioned functions showed noticeable
overhead. With netperf/nr_client=nr_cpu/UDP_RR, the problem is worse:
during a 5s window, wakeup number is 21millions and migration number is
14millions; update_cfs_group() costs ~25% and update_load_avg() costs ~16%.

Reduce the overhead by limiting updates to tg->load_avg to at most once
per ms. The update frequency is a tradeoff between tracking accuracy and
overhead. 1ms is chosen because PELT window is roughly 1ms and it
delivered good results for the tests that I've done. After this change,
the cost of accessing tg->load_avg is greatly reduced and performance
improved. Detailed test results below.

  ==============================
  postgres_sysbench on SPR:
  25%
  base:   42382±19.8%
  patch:  50174±9.5%  (noise)

  50%
  base:   67626±1.3%
  patch:  67365±3.1%  (noise)

  75%
  base:   100216±1.2%
  patch:  112470±0.1% +12.2%

  100%
  base:    93671±0.4%
  patch:  113563±0.2% +21.2%

  ==============================
  hackbench on ICL:
  group=1
  base:    114912±5.2%
  patch:   117857±2.5%  (noise)

  group=4
  base:    359902±1.6%
  patch:   361685±2.7%  (noise)

  group=8
  base:    461070±0.8%
  patch:   491713±0.3% +6.6%

  group=16
  base:    309032±5.0%
  patch:   378337±1.3% +22.4%

  =============================
  hackbench on SPR:
  group=1
  base:    100768±2.9%
  patch:   103134±2.9%  (noise)

  group=4
  base:    413830±12.5%
  patch:   378660±16.6% (noise)

  group=8
  base:    436124±0.6%
  patch:   490787±3.2% +12.5%

  group=16
  base:    457730±3.2%
  patch:   680452±1.3% +48.8%

  ============================
  netperf/udp_rr on ICL
  25%
  base:    114413±0.1%
  patch:   115111±0.0% +0.6%

  50%
  base:    86803±0.5%
  patch:   86611±0.0%  (noise)

  75%
  base:    35959±5.3%
  patch:   49801±0.6% +38.5%

  100%
  base:    61951±6.4%
  patch:   70224±0.8% +13.4%

  ===========================
  netperf/udp_rr on SPR
  25%
  base:   104954±1.3%
  patch:  107312±2.8%  (noise)

  50%
  base:    55394±4.6%
  patch:   54940±7.4%  (noise)

  75%
  base:    13779±3.1%
  patch:   36105±1.1% +162%

  100%
  base:     9703±3.7%
  patch:   28011±0.2% +189%

  ==============================================
  netperf/tcp_stream on ICL (all in noise range)
  25%
  base:    43092±0.1%
  patch:   42891±0.5%

  50%
  base:    19278±14.9%
  patch:   22369±7.2%

  75%
  base:    16822±3.0%
  patch:   17086±2.3%

  100%
  base:    18216±0.6%
  patch:   18078±2.9%

  ===============================================
  netperf/tcp_stream on SPR (all in noise range)
  25%
  base:    34491±0.3%
  patch:   34886±0.5%

  50%
  base:    19278±14.9%
  patch:   22369±7.2%

  75%
  base:    16822±3.0%
  patch:   17086±2.3%

  100%
  base:    18216±0.6%
  patch:   18078±2.9%

Reported-by: Nitin Tekchandani <nitin.tekchandani@intel.com>
Suggested-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reviewed-by: David Vernet <void@manifault.com>
Tested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Tested-by: Swapnil Sapkal <Swapnil.Sapkal@amd.com>
Link: https://lkml.kernel.org/r/20230912065808.2530-2-aaron.lu@intel.com
---
 kernel/sched/fair.c  | 13 ++++++++++++-
 kernel/sched/sched.h |  1 +
 2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c893721..d087787 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3876,7 +3876,8 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
  */
 static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
 {
-	long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
+	long delta;
+	u64 now;
 
 	/*
 	 * No need to update load_avg for root_task_group as it is not used.
@@ -3884,9 +3885,19 @@ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
 	if (cfs_rq->tg == &root_task_group)
 		return;
 
+	/*
+	 * For migration heavy workloads, access to tg->load_avg can be
+	 * unbound. Limit the update rate to at most once per ms.
+	 */
+	now = sched_clock_cpu(cpu_of(rq_of(cfs_rq)));
+	if (now - cfs_rq->last_update_tg_load_avg < NSEC_PER_MSEC)
+		return;
+
+	delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
 	if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
 		atomic_long_add(delta, &cfs_rq->tg->load_avg);
 		cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;
+		cfs_rq->last_update_tg_load_avg = now;
 	}
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 68768f4..887468c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -594,6 +594,7 @@ struct cfs_rq {
 	} removed;
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
+	u64			last_update_tg_load_avg;
 	unsigned long		tg_load_avg_contrib;
 	long			propagate;
 	long			prop_runnable_sum;

^ permalink raw reply related	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2023-09-18  7:42 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-09-12  6:58 [PATCH v2 0/1] Reduce cost of accessing tg->load_avg Aaron Lu
2023-09-12  6:58 ` [PATCH v2 1/1] sched/fair: ratelimit update to tg->load_avg Aaron Lu
2023-09-12 11:20   ` Peter Zijlstra
2023-09-13  3:20   ` Chen Yu
2023-09-17 10:12   ` [tip: sched/core] sched/fair: Ratelimit " tip-bot2 for Aaron Lu
2023-09-18  6:21   ` tip-bot2 for Aaron Lu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox