public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Aaron Lu <aaron.lu@intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Ingo Molnar <mingo@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
	"Gautham R . Shenoy" <gautham.shenoy@amd.com>,
	David Vernet <void@manifault.com>,
	Nitin Tekchandani <nitin.tekchandani@intel.com>,
	Yu Chen <yu.c.chen@intel.com>,
	Daniel Jordan <daniel.m.jordan@oracle.com>,
	Tim Chen <tim.c.chen@intel.com>,
	Swapnil Sapkal <Swapnil.Sapkal@amd.com>,
	linux-kernel@vger.kernel.org
Subject: [PATCH v2 1/1] sched/fair: ratelimit update to tg->load_avg
Date: Tue, 12 Sep 2023 14:58:08 +0800	[thread overview]
Message-ID: <20230912065808.2530-2-aaron.lu@intel.com> (raw)
In-Reply-To: <20230912065808.2530-1-aaron.lu@intel.com>

When using sysbench to benchmark Postgres in a single docker instance
with sysbench's nr_threads set to nr_cpu, it is observed there are times
update_cfs_group() and update_load_avg() shows noticeable overhead on
a 2sockets/112core/224cpu Intel Sapphire Rapids(SPR):

    13.75%    13.74%  [kernel.vmlinux]           [k] update_cfs_group
    10.63%    10.04%  [kernel.vmlinux]           [k] update_load_avg

Annotate shows the cycles are mostly spent on accessing tg->load_avg
with update_load_avg() being the write side and update_cfs_group() being
the read side. tg->load_avg is per task group and when different tasks
of the same taskgroup running on different CPUs frequently access
tg->load_avg, it can be heavily contended.

E.g. when running postgres_sysbench on a 2sockets/112cores/224cpus Intel
Sappire Rapids, during a 5s window, the wakeup number is 14millions and
migration number is 11millions and with each migration, the task's load
will transfer from src cfs_rq to target cfs_rq and each change involves
an update to tg->load_avg. Since the workload can trigger as many wakeups
and migrations, the access(both read and write) to tg->load_avg can be
unbound. As a result, the two mentioned functions showed noticeable
overhead. With netperf/nr_client=nr_cpu/UDP_RR, the problem is worse:
during a 5s window, wakeup number is 21millions and migration number is
14millions; update_cfs_group() costs ~25% and update_load_avg() costs ~16%.

Reduce the overhead by limiting updates to tg->load_avg to at most once
per ms. The update frequency is a tradeoff between tracking accuracy and
overhead. 1ms is chosen because PELT window is roughly 1ms and it
delivered good results for the tests that I've done. After this change,
the cost of accessing tg->load_avg is greatly reduced and performance
improved. Detailed test results below.

==============================
postgres_sysbench on SPR:
25%
base:   42382±19.8%
patch:  50174±9.5%  (noise)

50%
base:   67626±1.3%
patch:  67365±3.1%  (noise)

75%
base:   100216±1.2%
patch:  112470±0.1% +12.2%

100%
base:    93671±0.4%
patch:  113563±0.2% +21.2%

==============================
hackbench on ICL:
group=1
base:    114912±5.2%
patch:   117857±2.5%  (noise)

group=4
base:    359902±1.6%
patch:   361685±2.7%  (noise)

group=8
base:    461070±0.8%
patch:   491713±0.3% +6.6%

group=16
base:    309032±5.0%
patch:   378337±1.3% +22.4%

=============================
hackbench on SPR:
group=1
base:    100768±2.9%
patch:   103134±2.9%  (noise)

group=4
base:    413830±12.5%
patch:   378660±16.6% (noise)

group=8
base:    436124±0.6%
patch:   490787±3.2% +12.5%

group=16
base:    457730±3.2%
patch:   680452±1.3% +48.8%

============================
netperf/udp_rr on ICL
25%
base:    114413±0.1%
patch:   115111±0.0% +0.6%

50%
base:    86803±0.5%
patch:   86611±0.0%  (noise)

75%
base:    35959±5.3%
patch:   49801±0.6% +38.5%

100%
base:    61951±6.4%
patch:   70224±0.8% +13.4%

===========================
netperf/udp_rr on SPR
25%
base:   104954±1.3%
patch:  107312±2.8%  (noise)

50%
base:    55394±4.6%
patch:   54940±7.4%  (noise)

75%
base:    13779±3.1%
patch:   36105±1.1% +162%

100%
base:     9703±3.7%
patch:   28011±0.2% +189%

==============================================
netperf/tcp_stream on ICL (all in noise range)
25%
base:    43092±0.1%
patch:   42891±0.5%

50%
base:    19278±14.9%
patch:   22369±7.2%

75%
base:    16822±3.0%
patch:   17086±2.3%

100%
base:    18216±0.6%
patch:   18078±2.9%

===============================================
netperf/tcp_stream on SPR (all in noise range)
25%
base:    34491±0.3%
patch:   34886±0.5%

50%
base:    19278±14.9%
patch:   22369±7.2%

75%
base:    16822±3.0%
patch:   17086±2.3%

100%
base:    18216±0.6%
patch:   18078±2.9%

Reported-by: Nitin Tekchandani <nitin.tekchandani@intel.com>
Suggested-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Tested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reviewed-by: David Vernet <void@manifault.com>
Tested-by: Swapnil Sapkal <Swapnil.Sapkal@amd.com>
---
 kernel/sched/fair.c  | 13 ++++++++++++-
 kernel/sched/sched.h |  1 +
 2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0b7445cd5af9..0dd6f44c8e02 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3887,7 +3887,8 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
  */
 static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
 {
-	long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
+	long delta;
+	u64 now;
 
 	/*
 	 * No need to update load_avg for root_task_group as it is not used.
@@ -3895,9 +3896,19 @@ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
 	if (cfs_rq->tg == &root_task_group)
 		return;
 
+	/*
+	 * For migration heavy workloads, access to tg->load_avg can be
+	 * unbound. Limit the update rate to at most once per ms.
+	 */
+	now = sched_clock_cpu(cpu_of(rq_of(cfs_rq)));
+	if (now - cfs_rq->last_update_tg_load_avg < NSEC_PER_MSEC)
+		return;
+
+	delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
 	if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
 		atomic_long_add(delta, &cfs_rq->tg->load_avg);
 		cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;
+		cfs_rq->last_update_tg_load_avg = now;
 	}
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3a01b7a2bf66..7df010ef9e5c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -594,6 +594,7 @@ struct cfs_rq {
 	} removed;
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
+	u64			last_update_tg_load_avg;
 	unsigned long		tg_load_avg_contrib;
 	long			propagate;
 	long			prop_runnable_sum;
-- 
2.41.0


  reply	other threads:[~2023-09-12  6:58 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-09-12  6:58 [PATCH v2 0/1] Reduce cost of accessing tg->load_avg Aaron Lu
2023-09-12  6:58 ` Aaron Lu [this message]
2023-09-12 11:20   ` [PATCH v2 1/1] sched/fair: ratelimit update to tg->load_avg Peter Zijlstra
2023-09-13  3:20   ` Chen Yu
2023-09-17 10:12   ` [tip: sched/core] sched/fair: Ratelimit " tip-bot2 for Aaron Lu
2023-09-18  6:21   ` tip-bot2 for Aaron Lu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230912065808.2530-2-aaron.lu@intel.com \
    --to=aaron.lu@intel.com \
    --cc=Swapnil.Sapkal@amd.com \
    --cc=daniel.m.jordan@oracle.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=gautham.shenoy@amd.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=mingo@redhat.com \
    --cc=nitin.tekchandani@intel.com \
    --cc=peterz@infradead.org \
    --cc=tim.c.chen@intel.com \
    --cc=vincent.guittot@linaro.org \
    --cc=void@manifault.com \
    --cc=yu.c.chen@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox