All of lore.kernel.org
 help / color / mirror / Atom feed
From: Aaron Lu <aaron.lu@intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
	Daniel Bristot de Oliveira <bristot@redhat.com>,
	Valentin Schneider <vschneid@redhat.com>,
	Tim Chen <tim.c.chen@intel.com>,
	Nitin Tekchandani <nitin.tekchandani@intel.com>,
	Yu Chen <yu.c.chen@intel.com>, Waiman Long <longman@redhat.com>,
	linux-kernel@vger.kernel.org
Subject: [RFC PATCH 3/4] sched/fair: delay update_tg_load_avg() for cfs_rq's removed load
Date: Tue, 18 Jul 2023 21:41:19 +0800	[thread overview]
Message-ID: <20230718134120.81199-4-aaron.lu@intel.com> (raw)
In-Reply-To: <20230718134120.81199-1-aaron.lu@intel.com>

When a workload involves many wake time task migrations, tg->load_avg
can be heavily contended among CPUs because every migration involves
removing the task's load from its src cfs_rq and attach that load to
its new cfs_rq. Both the remove and attach requires an update to
tg->load_avg as well as propagating the change up the hierarchy.

E.g. when running postgres_sysbench on a 2sockets/112cores/224cpus Intel
Sappire Rapids, during a 5s window, the wakeup number is 14millions and
migration number is 11millions. Since the workload can trigger many
wakeups and migrations, the access(both read and write) to tg->load_avg
can be unbound. For the above said workload, the profile shows
update_cfs_group() costs ~13% and update_load_avg() costs ~10%. With
netperf/nr_client=nr_cpu/UDP_RR, the wakeup number is 21millions and
migration number is 14millions; update_cfs_group() costs ~25% and
update_load_avg() costs ~16%.

This patch is an attempt to reduce the cost of accessing tg->load_avg.

Current logic will immediately do a update_tg_load_avg() if cfs_rq has
removed load; this patch changes this behavior: if this cfs_rq has
removed load as discovered by update_cfs_rq_load_avg(), it didn't call
update_tg_load_avg() or propagate the removed load immediately, instead,
the update to tg->load_avg and propagated load can be dealed with by a
following event like task attached to this cfs_rq or in
update_blocked_averages(). This way, the call to update_tg_load_avg()
for this cfs_rq and its ancestors can be reduced by about half.

================================================
postgres_sysbench(transaction, higher is better)
nr_thread=100%/75%/50% were tested on 2 sockets SPR and Icelake and
results that have a measuable difference are:

nr_thread=100% on SPR:
base:   90569.11±1.15%
node:  104152.26±0.34%  +15.0%
delay: 127309.46±4.25%  +40.6%

nr_thread=75% on SPR:
base:  100803.96±0.57%
node:  107333.58±0.44%   +6.5%
delay: 124332.39±0.51%  +23.3%

nr_thread=75% on ICL:
base:  61961.26±0.41%
node:  61585.45±0.50%
delay: 72420.52±0.14%  +16.9%

=======================================================================
hackbench/pipe/threads/fd=20/loop=1000000 (throughput, higher is better)
group=1/4/8/16 were tested on 2 sockets SPR and Cascade lake and the
results that have a measuable difference are:

group=8 on SPR:
base:  437163±2.6%
node:  471203±1.2%   +7.8%
delay: 490780±0.9%  +12.3%

group=16 on SPR:
base:  468279±1.9%
node:  580385±1.7%  +23.9%
delay: 664422±0.2%  +41.9%

================================================
netperf/TCP_STRAM (throughput, higher is better)
nr_thread=1/25%/50%/75%/100% were tested on 2 sockets SPR and Cascade
Lake and results that have a measuable difference are:

nr_thread=50% on CSL:
base:  16258±0.7%
node:  16172±2.9%
delay: 17729±0.7%  +9.0%

nr_thread=75% on CSL:
base:  12923±1.2%
node:  13011±2.2%
delay: 15452±1.6%  +19.6%

nr_thread=75% on SPR:
base:  16232±11.9%
node:  13962±5.1%
delay: 21089±0.8%  +29.9%

nr_thread=100% on SPR:
base:  13220±0.6%
node:  13113±0.0%
delay: 18258±11.3% +38.1%

=============================================
netperf/UDP_RR (throughput, higher is better)
nr_thread=1/25%/50%/75%/100% were tested on 2 sockets SPR and Cascade
Lake and results that have measuable difference are:

nr_thread=1 on CSL:
base:  128521±0.5%
node:  127935±0.6%
delay: 126317±0.4%  -1.7%

nr_thread=75% on CSL:
base:  36701±1.7%
node:  39949±1.4%   +8.8%
delay: 42516±0.3%  +15.8%

nr_thread=75% on SPR:
base:  14249±3.8%
node:  19890±2.0%   +39.6%
delay: 31331±0.5%  +119.9%

nr_thread=100% on CSL:
base:  52275±0.6%
node:  53827±0.4%   +3.0%
delay: 78386±0.7%  +49.9%

nr_thread=100% on SPR:
base:   9560±1.6%
node:  14186±3.9%   +48.4%
delay: 20779±2.8%  +117.4%

Signed-off-by: Aaron Lu <aaron.lu@intel.com>
---
 kernel/sched/fair.c  | 23 ++++++++++++++++++-----
 kernel/sched/sched.h |  1 +
 2 files changed, 19 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index aceb8f5922cb..564ffe3e59c1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3645,6 +3645,9 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
 	if (child_cfs_rq_on_list(cfs_rq))
 		return false;
 
+	if (cfs_rq->prop_removed_sum)
+		return false;
+
 	return true;
 }
 
@@ -3911,6 +3914,11 @@ static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum
 {
 	cfs_rq->propagate = 1;
 	cfs_rq->prop_runnable_sum += runnable_sum;
+
+	if (cfs_rq->prop_removed_sum) {
+		cfs_rq->prop_runnable_sum += cfs_rq->prop_removed_sum;
+		cfs_rq->prop_removed_sum = 0;
+	}
 }
 
 /* Update task and its cfs_rq load average */
@@ -4133,13 +4141,11 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
 		 * removed_runnable is the unweighted version of removed_load so we
 		 * can use it to estimate removed_load_sum.
 		 */
-		add_tg_cfs_propagate(cfs_rq,
-			-(long)(removed_runnable * divider) >> SCHED_CAPACITY_SHIFT);
-
-		decayed = 1;
+		cfs_rq->prop_removed_sum +=
+			-(long)(removed_runnable * divider) >> SCHED_CAPACITY_SHIFT;
 	}
 
-	decayed |= __update_load_avg_cfs_rq(now, cfs_rq);
+	decayed = __update_load_avg_cfs_rq(now, cfs_rq);
 	u64_u32_store_copy(sa->last_update_time,
 			   cfs_rq->last_update_time_copy,
 			   sa->last_update_time);
@@ -9001,6 +9007,13 @@ static bool __update_blocked_fair(struct rq *rq, bool *done)
 
 			if (cfs_rq == &rq->cfs)
 				decayed = true;
+
+			/*
+			 * If the aggregated removed_sum hasn't been taken care of,
+			 * deal with it now before this cfs_rq is removed from the list.
+			 */
+			if (cfs_rq->prop_removed_sum)
+				add_tg_cfs_propagate(cfs_rq, 0);
 		}
 
 		/* Propagate pending load changes to the parent, if any: */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 9cece2dbc95b..ab540b21d071 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -619,6 +619,7 @@ struct cfs_rq {
 	unsigned long		tg_load_avg_contrib;
 	long			propagate;
 	long			prop_runnable_sum;
+	long			prop_removed_sum;
 
 	/*
 	 *   h_load = weight * f(tg)
-- 
2.41.0


  parent reply	other threads:[~2023-07-18 13:41 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-07-18 13:41 [RFC PATCH 0/4] Reduce cost of accessing tg->load_avg Aaron Lu
2023-07-18 13:41 ` [PATCH 1/4] sched/fair: free allocated memory on error in alloc_fair_sched_group() Aaron Lu
2023-07-18 15:13   ` Chen Yu
2023-07-19  2:13     ` Aaron Lu
2023-08-02  7:01       ` Aaron Lu
2023-08-02  8:17         ` Chen Yu
2023-07-18 13:41 ` [RFC PATCH 2/4] sched/fair: Make tg->load_avg per node Aaron Lu
2023-07-19 11:53   ` Peter Zijlstra
2023-07-19 13:45     ` Aaron Lu
2023-07-19 13:53       ` Peter Zijlstra
2023-07-19 14:22         ` Aaron Lu
2023-08-02 11:28       ` Peter Zijlstra
2023-08-11  9:48         ` Aaron Lu
2023-07-19 15:59     ` Yury Norov
2023-07-18 13:41 ` Aaron Lu [this message]
2023-07-18 16:01   ` [RFC PATCH 3/4] sched/fair: delay update_tg_load_avg() for cfs_rq's removed load Vincent Guittot
2023-07-19  5:18     ` Aaron Lu
2023-07-19  8:01       ` Aaron Lu
2023-07-19  9:47         ` Vincent Guittot
2023-07-19 13:29           ` Aaron Lu
2023-07-20 13:10             ` Vincent Guittot
2023-07-20 14:42               ` Aaron Lu
2023-07-20 15:02                 ` Vincent Guittot
2023-07-20 15:22                   ` Dietmar Eggemann
2023-07-20 15:24                     ` Vincent Guittot
2023-07-21  6:42                     ` Aaron Lu
2023-07-21  1:57                   ` Aaron Lu
2023-08-11  9:28                     ` Aaron Lu
2023-07-20 15:04                 ` Vincent Guittot
2023-07-19  8:11       ` Aaron Lu
2023-07-19  9:12         ` Vincent Guittot
2023-07-19  9:09       ` Vincent Guittot
2023-07-18 13:41 ` [RFC PATCH 4/4] sched/fair: skip some update_cfs_group() on en/dequeue_entity() Aaron Lu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230718134120.81199-4-aaron.lu@intel.com \
    --to=aaron.lu@intel.com \
    --cc=bristot@redhat.com \
    --cc=bsegall@google.com \
    --cc=daniel.m.jordan@oracle.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=juri.lelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=longman@redhat.com \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=nitin.tekchandani@intel.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=tim.c.chen@intel.com \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    --cc=yu.c.chen@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.