[RFC PATCH v2 0/1] Reduce cost of accessing tg->load

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH v2 0/1] Reduce cost of accessing tg->load_avg
@ 2023-08-16  2:48 Aaron Lu
  2023-08-16  2:48 ` [RFC PATCH 1/1] sched/fair: ratelimit update to tg->load_avg Aaron Lu
  0 siblings, 1 reply; 7+ messages in thread
From: Aaron Lu @ 2023-08-16  2:48 UTC (permalink / raw)
  To: Peter Zijlstra, Vincent Guittot, Ingo Molnar, Juri Lelli
  Cc: Daniel Jordan, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	Tim Chen, Nitin Tekchandani, Yu Chen, Waiman Long, Deng Pan,
	Mathieu Desnoyers, linux-kernel

Nitin Tekchandani noticed some scheduler functions have high cost
according to perf/cycles while running postgres_sysbench workload.
I perf/annotated the high cost functions: update_cfs_group() and
update_load_avg() and found the costs were ~90% due to accessing to
tg->load_avg. This series is an attempt to reduce the overhead of
the two functions.

Thanks to Vincent's suggestion from v1, this revision used a simpler way
to solve the overhead problem by limiting updates to tg->load_avg to at
most once per ms. Benchmark shows that it has good results and with the
rate limit in place, other optimizations in v1 don't improve performance
further so they are dropped from this revision.

Aaron Lu (1):
  sched/fair: ratelimit update to tg->load_avg

 kernel/sched/fair.c  | 13 ++++++++++++-
 kernel/sched/sched.h |  1 +
 2 files changed, 13 insertions(+), 1 deletion(-)

-- 
2.41.0

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [RFC PATCH 1/1] sched/fair: ratelimit update to tg->load_avg
  2023-08-16  2:48 [RFC PATCH v2 0/1] Reduce cost of accessing tg->load_avg Aaron Lu
@ 2023-08-16  2:48 ` Aaron Lu
  2023-08-16  4:42   ` Gautham R. Shenoy
  2023-08-17 12:56   ` Vincent Guittot
  0 siblings, 2 replies; 7+ messages in thread
From: Aaron Lu @ 2023-08-16  2:48 UTC (permalink / raw)
  To: Peter Zijlstra, Vincent Guittot, Ingo Molnar, Juri Lelli
  Cc: Daniel Jordan, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	Tim Chen, Nitin Tekchandani, Yu Chen, Waiman Long, Deng Pan,
	Mathieu Desnoyers, linux-kernel

When using sysbench to benchmark Postgres in a single docker instance
with sysbench's nr_threads set to nr_cpu, it is observed there are times
update_cfs_group() and update_load_avg() shows noticeable overhead on
a 2sockets/112core/224cpu Intel Sapphire Rapids(SPR):

    13.75%    13.74%  [kernel.vmlinux]           [k] update_cfs_group
    10.63%    10.04%  [kernel.vmlinux]           [k] update_load_avg

Annotate shows the cycles are mostly spent on accessing tg->load_avg
with update_load_avg() being the write side and update_cfs_group() being
the read side. tg->load_avg is per task group and when different tasks
of the same taskgroup running on different CPUs frequently access
tg->load_avg, it can be heavily contended.

The frequent access to tg->load_avg is due to task migration on wakeup
path, e.g. when running postgres_sysbench on a 2sockets/112cores/224cpus
Intel Sappire Rapids, during a 5s window, the wakeup number is 14millions
and migration number is 11millions and with each migration, the task's
load will transfer from src cfs_rq to target cfs_rq and each change
involves an update to tg->load_avg. Since the workload can trigger as many
wakeups and migrations, the access(both read and write) to tg->load_avg
can be unbound. As a result, the two mentioned functions showed noticeable
overhead. With netperf/nr_client=nr_cpu/UDP_RR, the problem is worse:
during a 5s window, wakeup number is 21millions and migration number is
14millions; update_cfs_group() costs ~25% and update_load_avg() costs ~16%.

Reduce the overhead by limiting updates to tg->load_avg to at most once
per ms. After this change, the cost of accessing tg->load_avg is greatly
reduced and performance improved. Detailed test results below.

==============================
postgres_sysbench on SPR:
25%
base:   42382±19.8%
patch:  50174±9.5%  (noise)

50%
base:   67626±1.3%
patch:  67365±3.1%  (noise)

75%
base:   100216±1.2%
patch:  112470±0.1% +12.2%

100%
base:    93671±0.4%
patch:  113563±0.2% +21.2%

==============================
hackbench on ICL:
group=1
base:    114912±5.2%
patch:   117857±2.5%  (noise)

group=4
base:    359902±1.6%
patch:   361685±2.7%  (noise)

group=8
base:    461070±0.8%
patch:   491713±0.3% +6.6%

group=16
base:    309032±5.0%
patch:   378337±1.3% +22.4%

=============================
hackbench on SPR:
group=1
base:    100768±2.9%
patch:   103134±2.9%  (noise)

group=4
base:    413830±12.5%
patch:   378660±16.6% (noise)

group=8
base:    436124±0.6%
patch:   490787±3.2% +12.5%

group=16
base:    457730±3.2%
patch:   680452±1.3% +48.8%

============================
netperf/udp_rr on ICL
25%
base:    114413±0.1%
patch:   115111±0.0% +0.6%

50%
base:    86803±0.5%
patch:   86611±0.0%  (noise)

75%
base:    35959±5.3%
patch:   49801±0.6% +38.5%

100%
base:    61951±6.4%
patch:   70224±0.8% +13.4%

===========================
netperf/udp_rr on SPR
25%
base:   104954±1.3%
patch:  107312±2.8%  (noise)

50%
base:    55394±4.6%
patch:   54940±7.4%  (noise)

75%
base:    13779±3.1%
patch:   36105±1.1% +162%

100%
base:     9703±3.7%
patch:   28011±0.2% +189%

==============================================
netperf/tcp_stream on ICL (all in noise range)
25%
base:    43092±0.1%
patch:   42891±0.5%

50%
base:    19278±14.9%
patch:   22369±7.2%

75%
base:    16822±3.0%
patch:   17086±2.3%

100%
base:    18216±0.6%
patch:   18078±2.9%

===============================================
netperf/tcp_stream on SPR (all in noise range)
25%
base:    34491±0.3%
patch:   34886±0.5%

50%
base:    19278±14.9%
patch:   22369±7.2%

75%
base:    16822±3.0%
patch:   17086±2.3%

100%
base:    18216±0.6%
patch:   18078±2.9%

Signed-off-by: Aaron Lu <aaron.lu@intel.com>
---
 kernel/sched/fair.c  | 13 ++++++++++++-
 kernel/sched/sched.h |  1 +
 2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6e79de26a25d..ab055c72cc64 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3664,7 +3664,8 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
  */
 static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
 {
-	long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
+	long delta;
+	u64 now;
 
 	/*
 	 * No need to update load_avg for root_task_group as it is not used.
@@ -3672,9 +3673,19 @@ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
 	if (cfs_rq->tg == &root_task_group)
 		return;
 
+	/*
+	 * For migration heavy workload, access to tg->load_avg can be
+	 * unbound. Limit the update rate to at most once per ms.
+	 */
+	now = sched_clock_cpu(cpu_of(rq_of(cfs_rq)));
+	if (now - cfs_rq->last_update_tg_load_avg < NSEC_PER_MSEC)
+		return;
+
+	delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
 	if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
 		atomic_long_add(delta, &cfs_rq->tg->load_avg);
 		cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;
+		cfs_rq->last_update_tg_load_avg = now;
 	}
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 19af1766df2d..228a298bf3b5 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -594,6 +594,7 @@ struct cfs_rq {
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	unsigned long		tg_load_avg_contrib;
+	u64			last_update_tg_load_avg;
 	long			propagate;
 	long			prop_runnable_sum;
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH 1/1] sched/fair: ratelimit update to tg->load_avg
  2023-08-16  2:48 ` [RFC PATCH 1/1] sched/fair: ratelimit update to tg->load_avg Aaron Lu
@ 2023-08-16  4:42   ` Gautham R. Shenoy
  2023-08-16  5:55     ` Aaron Lu
  2023-08-17 12:56   ` Vincent Guittot
  1 sibling, 1 reply; 7+ messages in thread
From: Gautham R. Shenoy @ 2023-08-16  4:42 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Peter Zijlstra, Vincent Guittot, Ingo Molnar, Juri Lelli,
	Daniel Jordan, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	Tim Chen, Nitin Tekchandani, Yu Chen, Waiman Long, Deng Pan,
	Mathieu Desnoyers, linux-kernel, David Vernet

Hello Aaron,
(Adding David Vernet)

On Wed, Aug 16, 2023 at 10:48:31AM +0800, Aaron Lu wrote:
> When using sysbench to benchmark Postgres in a single docker instance
> with sysbench's nr_threads set to nr_cpu, it is observed there are times
> update_cfs_group() and update_load_avg() shows noticeable overhead on
> a 2sockets/112core/224cpu Intel Sapphire Rapids(SPR):
> 
>     13.75%    13.74%  [kernel.vmlinux]           [k] update_cfs_group
>     10.63%    10.04%  [kernel.vmlinux]           [k] update_load_avg
> 
> Annotate shows the cycles are mostly spent on accessing tg->load_avg
> with update_load_avg() being the write side and update_cfs_group() being
> the read side. tg->load_avg is per task group and when different tasks
> of the same taskgroup running on different CPUs frequently access
> tg->load_avg, it can be heavily contended.


Interestingly I observed this contention on 2 socket EPYC servers
(Zen3 and Zen4) while running tbench and netperf with David Vernet's
shared-runqueue v3 patches. This contention was observed only when
running with the shared-runqueue enabled but not otherwise.

  Overhead  Command  Shared Object     Symbol
+   20.54%  tbench   [kernel.vmlinux]  [k] update_cfs_group
+   15.78%  tbench   [kernel.vmlinux]  [k] update_load_avg

This was causing the tbench (and netperf) to not scale beyond 32
clients when shared-runqueue was enabled.

> 
> The frequent access to tg->load_avg is due to task migration on wakeup
> path, e.g. when running postgres_sysbench on a 2sockets/112cores/224cpus
> Intel Sappire Rapids, during a 5s window, the wakeup number is 14millions
> and migration number is 11millions and with each migration, the task's
> load will transfer from src cfs_rq to target cfs_rq and each change
> involves an update to tg->load_avg.

With the shared-runqueue patches, we see a lot more task migrations
since the newidle_balance() path would pull tasks from the
shared-runqueue. While the read of tg->load_avg is via READ_ONCE on
x86, the write is atomic.

> Since the workload can trigger as many
> wakeups and migrations, the access(both read and write) to tg->load_avg
> can be unbound. As a result, the two mentioned functions showed noticeable
> overhead. With netperf/nr_client=nr_cpu/UDP_RR, the problem is worse:
> during a 5s window, wakeup number is 21millions and migration number is
> 14millions; update_cfs_group() costs ~25% and update_load_avg() costs ~16%.
> 
> Reduce the overhead by limiting updates to tg->load_avg to at most once
> per ms. After this change, the cost of accessing tg->load_avg is greatly
> reduced and performance improved. Detailed test results below.

I will try this patch on with David's series today.

--
Thanks and Regards
gautham.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH 1/1] sched/fair: ratelimit update to tg->load_avg
  2023-08-16  4:42   ` Gautham R. Shenoy
@ 2023-08-16  5:55     ` Aaron Lu
  0 siblings, 0 replies; 7+ messages in thread
From: Aaron Lu @ 2023-08-16  5:55 UTC (permalink / raw)
  To: Gautham R. Shenoy
  Cc: Peter Zijlstra, Vincent Guittot, Ingo Molnar, Juri Lelli,
	Daniel Jordan, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	Tim Chen, Nitin Tekchandani, Yu Chen, Waiman Long, Deng Pan,
	Mathieu Desnoyers, linux-kernel, David Vernet

Hi Gautham,

Thanks for sharing this information.

On Wed, Aug 16, 2023 at 10:12:52AM +0530, Gautham R. Shenoy wrote:
> Hello Aaron,
> (Adding David Vernet)
> 
> On Wed, Aug 16, 2023 at 10:48:31AM +0800, Aaron Lu wrote:
> > When using sysbench to benchmark Postgres in a single docker instance
> > with sysbench's nr_threads set to nr_cpu, it is observed there are times
> > update_cfs_group() and update_load_avg() shows noticeable overhead on
> > a 2sockets/112core/224cpu Intel Sapphire Rapids(SPR):
> > 
> >     13.75%    13.74%  [kernel.vmlinux]           [k] update_cfs_group
> >     10.63%    10.04%  [kernel.vmlinux]           [k] update_load_avg
> > 
> > Annotate shows the cycles are mostly spent on accessing tg->load_avg
> > with update_load_avg() being the write side and update_cfs_group() being
> > the read side. tg->load_avg is per task group and when different tasks
> > of the same taskgroup running on different CPUs frequently access
> > tg->load_avg, it can be heavily contended.
> 
> 
> Interestingly I observed this contention on 2 socket EPYC servers
> (Zen3 and Zen4) while running tbench and netperf with David Vernet's
> shared-runqueue v3 patches. This contention was observed only when
> running with the shared-runqueue enabled but not otherwise.

I think without shared-runqueue, the migration number is pretty low on
EPYC since it has much smaller LLC than Intel's.

> 
>   Overhead  Command  Shared Object     Symbol
> +   20.54%  tbench   [kernel.vmlinux]  [k] update_cfs_group
> +   15.78%  tbench   [kernel.vmlinux]  [k] update_load_avg
> 
> This was causing the tbench (and netperf) to not scale beyond 32
> clients when shared-runqueue was enabled.
> 
> > 
> > The frequent access to tg->load_avg is due to task migration on wakeup
> > path, e.g. when running postgres_sysbench on a 2sockets/112cores/224cpus
> > Intel Sappire Rapids, during a 5s window, the wakeup number is 14millions
> > and migration number is 11millions and with each migration, the task's
> > load will transfer from src cfs_rq to target cfs_rq and each change
> > involves an update to tg->load_avg.
> 
> With the shared-runqueue patches, we see a lot more task migrations
> since the newidle_balance() path would pull tasks from the
> shared-runqueue. While the read of tg->load_avg is via READ_ONCE on
> x86, the write is atomic.

That makes sense. More migrations, more overhead from update_load_avg()
and update_cfs_group().

> 
> > Since the workload can trigger as many
> > wakeups and migrations, the access(both read and write) to tg->load_avg
> > can be unbound. As a result, the two mentioned functions showed noticeable
> > overhead. With netperf/nr_client=nr_cpu/UDP_RR, the problem is worse:
> > during a 5s window, wakeup number is 21millions and migration number is
> > 14millions; update_cfs_group() costs ~25% and update_load_avg() costs ~16%.
> > 
> > Reduce the overhead by limiting updates to tg->load_avg to at most once
> > per ms. After this change, the cost of accessing tg->load_avg is greatly
> > reduced and performance improved. Detailed test results below.
> 
> I will try this patch on with David's series today.

Look forward to see your numbers.

Thanks,
Aaron

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH 1/1] sched/fair: ratelimit update to tg->load_avg
  2023-08-16  2:48 ` [RFC PATCH 1/1] sched/fair: ratelimit update to tg->load_avg Aaron Lu
  2023-08-16  4:42   ` Gautham R. Shenoy
@ 2023-08-17 12:56   ` Vincent Guittot
  2023-08-21  6:06     ` Aaron Lu
  1 sibling, 1 reply; 7+ messages in thread
From: Vincent Guittot @ 2023-08-17 12:56 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Daniel Jordan,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Tim Chen,
	Nitin Tekchandani, Yu Chen, Waiman Long, Deng Pan,
	Mathieu Desnoyers, linux-kernel

On Wed, 16 Aug 2023 at 04:48, Aaron Lu <aaron.lu@intel.com> wrote:
>
> When using sysbench to benchmark Postgres in a single docker instance
> with sysbench's nr_threads set to nr_cpu, it is observed there are times
> update_cfs_group() and update_load_avg() shows noticeable overhead on
> a 2sockets/112core/224cpu Intel Sapphire Rapids(SPR):
>
>     13.75%    13.74%  [kernel.vmlinux]           [k] update_cfs_group
>     10.63%    10.04%  [kernel.vmlinux]           [k] update_load_avg
>
> Annotate shows the cycles are mostly spent on accessing tg->load_avg
> with update_load_avg() being the write side and update_cfs_group() being
> the read side. tg->load_avg is per task group and when different tasks
> of the same taskgroup running on different CPUs frequently access
> tg->load_avg, it can be heavily contended.
>
> The frequent access to tg->load_avg is due to task migration on wakeup
> path, e.g. when running postgres_sysbench on a 2sockets/112cores/224cpus
> Intel Sappire Rapids, during a 5s window, the wakeup number is 14millions
> and migration number is 11millions and with each migration, the task's
> load will transfer from src cfs_rq to target cfs_rq and each change
> involves an update to tg->load_avg. Since the workload can trigger as many
> wakeups and migrations, the access(both read and write) to tg->load_avg
> can be unbound. As a result, the two mentioned functions showed noticeable
> overhead. With netperf/nr_client=nr_cpu/UDP_RR, the problem is worse:
> during a 5s window, wakeup number is 21millions and migration number is
> 14millions; update_cfs_group() costs ~25% and update_load_avg() costs ~16%.
>
> Reduce the overhead by limiting updates to tg->load_avg to at most once
> per ms. After this change, the cost of accessing tg->load_avg is greatly
> reduced and performance improved. Detailed test results below.
>
> ==============================
> postgres_sysbench on SPR:
> 25%
> base:   42382±19.8%
> patch:  50174±9.5%  (noise)
>
> 50%
> base:   67626±1.3%
> patch:  67365±3.1%  (noise)
>
> 75%
> base:   100216±1.2%
> patch:  112470±0.1% +12.2%
>
> 100%
> base:    93671±0.4%
> patch:  113563±0.2% +21.2%
>
> ==============================
> hackbench on ICL:
> group=1
> base:    114912±5.2%
> patch:   117857±2.5%  (noise)
>
> group=4
> base:    359902±1.6%
> patch:   361685±2.7%  (noise)
>
> group=8
> base:    461070±0.8%
> patch:   491713±0.3% +6.6%
>
> group=16
> base:    309032±5.0%
> patch:   378337±1.3% +22.4%
>
> =============================
> hackbench on SPR:
> group=1
> base:    100768±2.9%
> patch:   103134±2.9%  (noise)
>
> group=4
> base:    413830±12.5%
> patch:   378660±16.6% (noise)
>
> group=8
> base:    436124±0.6%
> patch:   490787±3.2% +12.5%
>
> group=16
> base:    457730±3.2%
> patch:   680452±1.3% +48.8%
>
> ============================
> netperf/udp_rr on ICL
> 25%
> base:    114413±0.1%
> patch:   115111±0.0% +0.6%
>
> 50%
> base:    86803±0.5%
> patch:   86611±0.0%  (noise)
>
> 75%
> base:    35959±5.3%
> patch:   49801±0.6% +38.5%
>
> 100%
> base:    61951±6.4%
> patch:   70224±0.8% +13.4%
>
> ===========================
> netperf/udp_rr on SPR
> 25%
> base:   104954±1.3%
> patch:  107312±2.8%  (noise)
>
> 50%
> base:    55394±4.6%
> patch:   54940±7.4%  (noise)
>
> 75%
> base:    13779±3.1%
> patch:   36105±1.1% +162%
>
> 100%
> base:     9703±3.7%
> patch:   28011±0.2% +189%
>
> ==============================================
> netperf/tcp_stream on ICL (all in noise range)
> 25%
> base:    43092±0.1%
> patch:   42891±0.5%
>
> 50%
> base:    19278±14.9%
> patch:   22369±7.2%
>
> 75%
> base:    16822±3.0%
> patch:   17086±2.3%
>
> 100%
> base:    18216±0.6%
> patch:   18078±2.9%
>
> ===============================================
> netperf/tcp_stream on SPR (all in noise range)
> 25%
> base:    34491±0.3%
> patch:   34886±0.5%
>
> 50%
> base:    19278±14.9%
> patch:   22369±7.2%
>
> 75%
> base:    16822±3.0%
> patch:   17086±2.3%
>
> 100%
> base:    18216±0.6%
> patch:   18078±2.9%
>
> Signed-off-by: Aaron Lu <aaron.lu@intel.com>
> ---
>  kernel/sched/fair.c  | 13 ++++++++++++-
>  kernel/sched/sched.h |  1 +
>  2 files changed, 13 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6e79de26a25d..ab055c72cc64 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3664,7 +3664,8 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
>   */
>  static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
>  {
> -       long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
> +       long delta;
> +       u64 now;
>
>         /*
>          * No need to update load_avg for root_task_group as it is not used.
> @@ -3672,9 +3673,19 @@ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
>         if (cfs_rq->tg == &root_task_group)
>                 return;
>
> +       /*
> +        * For migration heavy workload, access to tg->load_avg can be
> +        * unbound. Limit the update rate to at most once per ms.
> +        */
> +       now = sched_clock_cpu(cpu_of(rq_of(cfs_rq)));
> +       if (now - cfs_rq->last_update_tg_load_avg < NSEC_PER_MSEC)
> +               return;
> +
> +       delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
>         if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
>                 atomic_long_add(delta, &cfs_rq->tg->load_avg);
>                 cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;
> +               cfs_rq->last_update_tg_load_avg = now;
>         }
>  }
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 19af1766df2d..228a298bf3b5 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -594,6 +594,7 @@ struct cfs_rq {
>
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>         unsigned long           tg_load_avg_contrib;
> +       u64                     last_update_tg_load_avg;

Moving last_update_tg_load_avg before tg_load_avg_contrib should
minimize the padding on 32bits arch as long is only 4 Bytes

Apart from this, looks good to me

>         long                    propagate;
>         long                    prop_runnable_sum;
>
> --
> 2.41.0
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH 1/1] sched/fair: ratelimit update to tg->load_avg
  2023-08-17 12:56   ` Vincent Guittot
@ 2023-08-21  6:06     ` Aaron Lu
  2023-08-21 11:57       ` Vincent Guittot
  0 siblings, 1 reply; 7+ messages in thread
From: Aaron Lu @ 2023-08-21  6:06 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Daniel Jordan,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Tim Chen,
	Nitin Tekchandani, Yu Chen, Waiman Long, Deng Pan,
	Mathieu Desnoyers, linux-kernel

On Thu, Aug 17, 2023 at 02:56:00PM +0200, Vincent Guittot wrote:
> On Wed, 16 Aug 2023 at 04:48, Aaron Lu <aaron.lu@intel.com> wrote:
> >
> > When using sysbench to benchmark Postgres in a single docker instance
> > with sysbench's nr_threads set to nr_cpu, it is observed there are times
> > update_cfs_group() and update_load_avg() shows noticeable overhead on
> > a 2sockets/112core/224cpu Intel Sapphire Rapids(SPR):
> >
> >     13.75%    13.74%  [kernel.vmlinux]           [k] update_cfs_group
> >     10.63%    10.04%  [kernel.vmlinux]           [k] update_load_avg
> >
> > Annotate shows the cycles are mostly spent on accessing tg->load_avg
> > with update_load_avg() being the write side and update_cfs_group() being
> > the read side. tg->load_avg is per task group and when different tasks
> > of the same taskgroup running on different CPUs frequently access
> > tg->load_avg, it can be heavily contended.
> >
> > The frequent access to tg->load_avg is due to task migration on wakeup
> > path, e.g. when running postgres_sysbench on a 2sockets/112cores/224cpus
> > Intel Sappire Rapids, during a 5s window, the wakeup number is 14millions
> > and migration number is 11millions and with each migration, the task's
> > load will transfer from src cfs_rq to target cfs_rq and each change
> > involves an update to tg->load_avg. Since the workload can trigger as many
> > wakeups and migrations, the access(both read and write) to tg->load_avg
> > can be unbound. As a result, the two mentioned functions showed noticeable
> > overhead. With netperf/nr_client=nr_cpu/UDP_RR, the problem is worse:
> > during a 5s window, wakeup number is 21millions and migration number is
> > 14millions; update_cfs_group() costs ~25% and update_load_avg() costs ~16%.
> >
> > Reduce the overhead by limiting updates to tg->load_avg to at most once
> > per ms. After this change, the cost of accessing tg->load_avg is greatly
> > reduced and performance improved. Detailed test results below.
> >
> > ==============================
> > postgres_sysbench on SPR:
> > 25%
> > base:   42382±19.8%
> > patch:  50174±9.5%  (noise)
> >
> > 50%
> > base:   67626±1.3%
> > patch:  67365±3.1%  (noise)
> >
> > 75%
> > base:   100216±1.2%
> > patch:  112470±0.1% +12.2%
> >
> > 100%
> > base:    93671±0.4%
> > patch:  113563±0.2% +21.2%
> >
> > ==============================
> > hackbench on ICL:
> > group=1
> > base:    114912±5.2%
> > patch:   117857±2.5%  (noise)
> >
> > group=4
> > base:    359902±1.6%
> > patch:   361685±2.7%  (noise)
> >
> > group=8
> > base:    461070±0.8%
> > patch:   491713±0.3% +6.6%
> >
> > group=16
> > base:    309032±5.0%
> > patch:   378337±1.3% +22.4%
> >
> > =============================
> > hackbench on SPR:
> > group=1
> > base:    100768±2.9%
> > patch:   103134±2.9%  (noise)
> >
> > group=4
> > base:    413830±12.5%
> > patch:   378660±16.6% (noise)
> >
> > group=8
> > base:    436124±0.6%
> > patch:   490787±3.2% +12.5%
> >
> > group=16
> > base:    457730±3.2%
> > patch:   680452±1.3% +48.8%
> >
> > ============================
> > netperf/udp_rr on ICL
> > 25%
> > base:    114413±0.1%
> > patch:   115111±0.0% +0.6%
> >
> > 50%
> > base:    86803±0.5%
> > patch:   86611±0.0%  (noise)
> >
> > 75%
> > base:    35959±5.3%
> > patch:   49801±0.6% +38.5%
> >
> > 100%
> > base:    61951±6.4%
> > patch:   70224±0.8% +13.4%
> >
> > ===========================
> > netperf/udp_rr on SPR
> > 25%
> > base:   104954±1.3%
> > patch:  107312±2.8%  (noise)
> >
> > 50%
> > base:    55394±4.6%
> > patch:   54940±7.4%  (noise)
> >
> > 75%
> > base:    13779±3.1%
> > patch:   36105±1.1% +162%
> >
> > 100%
> > base:     9703±3.7%
> > patch:   28011±0.2% +189%
> >
> > ==============================================
> > netperf/tcp_stream on ICL (all in noise range)
> > 25%
> > base:    43092±0.1%
> > patch:   42891±0.5%
> >
> > 50%
> > base:    19278±14.9%
> > patch:   22369±7.2%
> >
> > 75%
> > base:    16822±3.0%
> > patch:   17086±2.3%
> >
> > 100%
> > base:    18216±0.6%
> > patch:   18078±2.9%
> >
> > ===============================================
> > netperf/tcp_stream on SPR (all in noise range)
> > 25%
> > base:    34491±0.3%
> > patch:   34886±0.5%
> >
> > 50%
> > base:    19278±14.9%
> > patch:   22369±7.2%
> >
> > 75%
> > base:    16822±3.0%
> > patch:   17086±2.3%
> >
> > 100%
> > base:    18216±0.6%
> > patch:   18078±2.9%
> >
> > Signed-off-by: Aaron Lu <aaron.lu@intel.com>
> > ---
> >  kernel/sched/fair.c  | 13 ++++++++++++-
> >  kernel/sched/sched.h |  1 +
> >  2 files changed, 13 insertions(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 6e79de26a25d..ab055c72cc64 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -3664,7 +3664,8 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
> >   */
> >  static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
> >  {
> > -       long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
> > +       long delta;
> > +       u64 now;
> >
> >         /*
> >          * No need to update load_avg for root_task_group as it is not used.
> > @@ -3672,9 +3673,19 @@ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
> >         if (cfs_rq->tg == &root_task_group)
> >                 return;
> >
> > +       /*
> > +        * For migration heavy workload, access to tg->load_avg can be
> > +        * unbound. Limit the update rate to at most once per ms.
> > +        */
> > +       now = sched_clock_cpu(cpu_of(rq_of(cfs_rq)));
> > +       if (now - cfs_rq->last_update_tg_load_avg < NSEC_PER_MSEC)
> > +               return;
> > +
> > +       delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
> >         if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
> >                 atomic_long_add(delta, &cfs_rq->tg->load_avg);
> >                 cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;
> > +               cfs_rq->last_update_tg_load_avg = now;
> >         }
> >  }
> >
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index 19af1766df2d..228a298bf3b5 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -594,6 +594,7 @@ struct cfs_rq {
> >
> >  #ifdef CONFIG_FAIR_GROUP_SCHED
> >         unsigned long           tg_load_avg_contrib;
> > +       u64                     last_update_tg_load_avg;
> 
> Moving last_update_tg_load_avg before tg_load_avg_contrib should
> minimize the padding on 32bits arch as long is only 4 Bytes

Got it. That would be:

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 6a8b7b9ed089..52ee7027def9 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -593,6 +593,7 @@ struct cfs_rq {
 	} removed;
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
+	u64			last_update_tg_load_avg;
 	unsigned long		tg_load_avg_contrib;
 	long			propagate;
 	long			prop_runnable_sum;
-- 
2.41.0

> 
> Apart from this, looks good to me

Thanks a lot for your review!
Can I add your reviewed-by in my next update with the above change?

Regards,
Aaron

> >         long                    propagate;
> >         long                    prop_runnable_sum;
> >
> > --
> > 2.41.0
> >

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH 1/1] sched/fair: ratelimit update to tg->load_avg
  2023-08-21  6:06     ` Aaron Lu
@ 2023-08-21 11:57       ` Vincent Guittot
  0 siblings, 0 replies; 7+ messages in thread
From: Vincent Guittot @ 2023-08-21 11:57 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Daniel Jordan,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Tim Chen,
	Nitin Tekchandani, Yu Chen, Waiman Long, Deng Pan,
	Mathieu Desnoyers, linux-kernel

On Mon, 21 Aug 2023 at 08:06, Aaron Lu <aaron.lu@intel.com> wrote:
>
> On Thu, Aug 17, 2023 at 02:56:00PM +0200, Vincent Guittot wrote:
> > On Wed, 16 Aug 2023 at 04:48, Aaron Lu <aaron.lu@intel.com> wrote:
> > >
> > > When using sysbench to benchmark Postgres in a single docker instance
> > > with sysbench's nr_threads set to nr_cpu, it is observed there are times
> > > update_cfs_group() and update_load_avg() shows noticeable overhead on
> > > a 2sockets/112core/224cpu Intel Sapphire Rapids(SPR):
> > >
> > >     13.75%    13.74%  [kernel.vmlinux]           [k] update_cfs_group
> > >     10.63%    10.04%  [kernel.vmlinux]           [k] update_load_avg
> > >
> > > Annotate shows the cycles are mostly spent on accessing tg->load_avg
> > > with update_load_avg() being the write side and update_cfs_group() being
> > > the read side. tg->load_avg is per task group and when different tasks
> > > of the same taskgroup running on different CPUs frequently access
> > > tg->load_avg, it can be heavily contended.
> > >
> > > The frequent access to tg->load_avg is due to task migration on wakeup
> > > path, e.g. when running postgres_sysbench on a 2sockets/112cores/224cpus
> > > Intel Sappire Rapids, during a 5s window, the wakeup number is 14millions
> > > and migration number is 11millions and with each migration, the task's
> > > load will transfer from src cfs_rq to target cfs_rq and each change
> > > involves an update to tg->load_avg. Since the workload can trigger as many
> > > wakeups and migrations, the access(both read and write) to tg->load_avg
> > > can be unbound. As a result, the two mentioned functions showed noticeable
> > > overhead. With netperf/nr_client=nr_cpu/UDP_RR, the problem is worse:
> > > during a 5s window, wakeup number is 21millions and migration number is
> > > 14millions; update_cfs_group() costs ~25% and update_load_avg() costs ~16%.
> > >
> > > Reduce the overhead by limiting updates to tg->load_avg to at most once
> > > per ms. After this change, the cost of accessing tg->load_avg is greatly
> > > reduced and performance improved. Detailed test results below.
> > >
> > > ==============================
> > > postgres_sysbench on SPR:
> > > 25%
> > > base:   42382ą19.8%
> > > patch:  50174ą9.5%  (noise)
> > >
> > > 50%
> > > base:   67626ą1.3%
> > > patch:  67365ą3.1%  (noise)
> > >
> > > 75%
> > > base:   100216ą1.2%
> > > patch:  112470ą0.1% +12.2%
> > >
> > > 100%
> > > base:    93671ą0.4%
> > > patch:  113563ą0.2% +21.2%
> > >
> > > ==============================
> > > hackbench on ICL:
> > > group=1
> > > base:    114912ą5.2%
> > > patch:   117857ą2.5%  (noise)
> > >
> > > group=4
> > > base:    359902ą1.6%
> > > patch:   361685ą2.7%  (noise)
> > >
> > > group=8
> > > base:    461070ą0.8%
> > > patch:   491713ą0.3% +6.6%
> > >
> > > group=16
> > > base:    309032ą5.0%
> > > patch:   378337ą1.3% +22.4%
> > >
> > > =============================
> > > hackbench on SPR:
> > > group=1
> > > base:    100768ą2.9%
> > > patch:   103134ą2.9%  (noise)
> > >
> > > group=4
> > > base:    413830ą12.5%
> > > patch:   378660ą16.6% (noise)
> > >
> > > group=8
> > > base:    436124ą0.6%
> > > patch:   490787ą3.2% +12.5%
> > >
> > > group=16
> > > base:    457730ą3.2%
> > > patch:   680452ą1.3% +48.8%
> > >
> > > ============================
> > > netperf/udp_rr on ICL
> > > 25%
> > > base:    114413ą0.1%
> > > patch:   115111ą0.0% +0.6%
> > >
> > > 50%
> > > base:    86803ą0.5%
> > > patch:   86611ą0.0%  (noise)
> > >
> > > 75%
> > > base:    35959ą5.3%
> > > patch:   49801ą0.6% +38.5%
> > >
> > > 100%
> > > base:    61951ą6.4%
> > > patch:   70224ą0.8% +13.4%
> > >
> > > ===========================
> > > netperf/udp_rr on SPR
> > > 25%
> > > base:   104954ą1.3%
> > > patch:  107312ą2.8%  (noise)
> > >
> > > 50%
> > > base:    55394ą4.6%
> > > patch:   54940ą7.4%  (noise)
> > >
> > > 75%
> > > base:    13779ą3.1%
> > > patch:   36105ą1.1% +162%
> > >
> > > 100%
> > > base:     9703ą3.7%
> > > patch:   28011ą0.2% +189%
> > >
> > > ==============================================
> > > netperf/tcp_stream on ICL (all in noise range)
> > > 25%
> > > base:    43092ą0.1%
> > > patch:   42891ą0.5%
> > >
> > > 50%
> > > base:    19278ą14.9%
> > > patch:   22369ą7.2%
> > >
> > > 75%
> > > base:    16822ą3.0%
> > > patch:   17086ą2.3%
> > >
> > > 100%
> > > base:    18216ą0.6%
> > > patch:   18078ą2.9%
> > >
> > > ===============================================
> > > netperf/tcp_stream on SPR (all in noise range)
> > > 25%
> > > base:    34491ą0.3%
> > > patch:   34886ą0.5%
> > >
> > > 50%
> > > base:    19278ą14.9%
> > > patch:   22369ą7.2%
> > >
> > > 75%
> > > base:    16822ą3.0%
> > > patch:   17086ą2.3%
> > >
> > > 100%
> > > base:    18216ą0.6%
> > > patch:   18078ą2.9%
> > >
> > > Signed-off-by: Aaron Lu <aaron.lu@intel.com>
> > > ---
> > >  kernel/sched/fair.c  | 13 ++++++++++++-
> > >  kernel/sched/sched.h |  1 +
> > >  2 files changed, 13 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 6e79de26a25d..ab055c72cc64 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -3664,7 +3664,8 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
> > >   */
> > >  static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
> > >  {
> > > -       long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
> > > +       long delta;
> > > +       u64 now;
> > >
> > >         /*
> > >          * No need to update load_avg for root_task_group as it is not used.
> > > @@ -3672,9 +3673,19 @@ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
> > >         if (cfs_rq->tg == &root_task_group)
> > >                 return;
> > >
> > > +       /*
> > > +        * For migration heavy workload, access to tg->load_avg can be
> > > +        * unbound. Limit the update rate to at most once per ms.
> > > +        */
> > > +       now = sched_clock_cpu(cpu_of(rq_of(cfs_rq)));
> > > +       if (now - cfs_rq->last_update_tg_load_avg < NSEC_PER_MSEC)
> > > +               return;
> > > +
> > > +       delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
> > >         if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
> > >                 atomic_long_add(delta, &cfs_rq->tg->load_avg);
> > >                 cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;
> > > +               cfs_rq->last_update_tg_load_avg = now;
> > >         }
> > >  }
> > >
> > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > > index 19af1766df2d..228a298bf3b5 100644
> > > --- a/kernel/sched/sched.h
> > > +++ b/kernel/sched/sched.h
> > > @@ -594,6 +594,7 @@ struct cfs_rq {
> > >
> > >  #ifdef CONFIG_FAIR_GROUP_SCHED
> > >         unsigned long           tg_load_avg_contrib;
> > > +       u64                     last_update_tg_load_avg;
> >
> > Moving last_update_tg_load_avg before tg_load_avg_contrib should
> > minimize the padding on 32bits arch as long is only 4 Bytes
>
> Got it. That would be:
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 6a8b7b9ed089..52ee7027def9 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -593,6 +593,7 @@ struct cfs_rq {
>         } removed;
>
>  #ifdef CONFIG_FAIR_GROUP_SCHED
> +       u64                     last_update_tg_load_avg;
>         unsigned long           tg_load_avg_contrib;
>         long                    propagate;
>         long                    prop_runnable_sum;
> --
> 2.41.0
>
> >
> > Apart from this, looks good to me
>
> Thanks a lot for your review!
> Can I add your reviewed-by in my next update with the above change?

Yes with the above
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>

>
> Regards,
> Aaron
>
> > >         long                    propagate;
> > >         long                    prop_runnable_sum;
> > >
> > > --
> > > 2.41.0
> > >

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2023-08-21 11:57 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-08-16  2:48 [RFC PATCH v2 0/1] Reduce cost of accessing tg->load_avg Aaron Lu
2023-08-16  2:48 ` [RFC PATCH 1/1] sched/fair: ratelimit update to tg->load_avg Aaron Lu
2023-08-16  4:42   ` Gautham R. Shenoy
2023-08-16  5:55     ` Aaron Lu
2023-08-17 12:56   ` Vincent Guittot
2023-08-21  6:06     ` Aaron Lu
2023-08-21 11:57       ` Vincent Guittot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox