From: Aaron Lu <aaron.lu@intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
Ingo Molnar <mingo@redhat.com>,
Juri Lelli <juri.lelli@redhat.com>,
Vincent Guittot <vincent.guittot@linaro.org>,
Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>,
Steven Rostedt <rostedt@goodmis.org>,
Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
Daniel Bristot de Oliveira <bristot@redhat.com>,
Valentin Schneider <vschneid@redhat.com>,
Tim Chen <tim.c.chen@intel.com>,
Nitin Tekchandani <nitin.tekchandani@intel.com>,
Yu Chen <yu.c.chen@intel.com>, Waiman Long <longman@redhat.com>,
linux-kernel@vger.kernel.org
Subject: [RFC PATCH 2/4] sched/fair: Make tg->load_avg per node
Date: Tue, 18 Jul 2023 21:41:18 +0800 [thread overview]
Message-ID: <20230718134120.81199-3-aaron.lu@intel.com> (raw)
In-Reply-To: <20230718134120.81199-1-aaron.lu@intel.com>
When using sysbench to benchmark Postgres in a single docker instance
with sysbench's nr_threads set to nr_cpu, it is observed there are times
update_cfs_group() and update_load_avg() shows noticeable overhead on
a 2sockets/112core/224cpu Intel Sapphire Rapids(SPR):
13.75% 13.74% [kernel.vmlinux] [k] update_cfs_group
10.63% 10.04% [kernel.vmlinux] [k] update_load_avg
Annotate shows the cycles are mostly spent on accessing tg->load_avg
with update_load_avg() being the write side and update_cfs_group() being
the read side.
Tim Chen told me that PeterZ once mentioned a way to solve a similar
problem by making a counter per node so do the same for tg->load_avg.
After this change, the cost of the two functions are reduced and
sysbench transactions are increased on SPR. Below are test results.
===============================================
postgres_sysbench(transaction, higher is better)
nr_thread=100%/75%/50% were tested on 2 sockets SPR and Icelake and
results that have a measuable difference are:
nr_thread=100% on SPR
base: 90569.11±1.15%
node: 104152.26±0.34% +15.0%
nr_thread=75% on SPR
base: 100803.96±0.57%
node: 107333.58±0.44% +6.5%
=======================================================================
hackbench/pipe/threads/fd=20/loop=1000000 (throughput, higher is better)
group=1/4/8/16 were tested on 2 sockets SPR and Cascade lake and the
results that have a measuable difference are:
group=8 on SPR:
base: 437163±2.6%
node: 471203±1.2% +7.8%
group=16 on SPR:
base: 468279±1.9%
node: 580385±1.7% +23.9%
=============================================
netperf/TCP_STRAM
nr_thread=1/25%/50%/75%/100% were tested on 2 sockets SPR and Cascade
Lake and there is no measuable difference.
=============================================
netperf/UDP_RR (throughput, higher is better)
nr_thread=1/25%/50%/75%/100% were tested on 2 sockets SPR and Cascade
Lake and results that have measuable difference are:
nr_thread=75% on Cascade lake:
base: 36701±1.7%
node: 39949±1.4% +8.8%
nr_thread=75% on SPR:
base: 14249±3.8%
node: 19890±2.0% +39.6%
nr_thread=100% on Cascade lake
base: 52275±0.6%
node: 53827±0.4% +3.0%
nr_thread=100% on SPR
base: 9560±1.6%
node: 14186±3.9% +48.4%
Reported-by: Nitin Tekchandani <nitin.tekchandani@intel.com>
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
---
kernel/sched/debug.c | 2 +-
kernel/sched/fair.c | 29 ++++++++++++++++++++++++++---
kernel/sched/sched.h | 43 +++++++++++++++++++++++++++++++++----------
3 files changed, 60 insertions(+), 14 deletions(-)
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 066ff1c8ae4e..3af965a18866 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -691,7 +691,7 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
SEQ_printf(m, " .%-30s: %lu\n", "tg_load_avg_contrib",
cfs_rq->tg_load_avg_contrib);
SEQ_printf(m, " .%-30s: %ld\n", "tg_load_avg",
- atomic_long_read(&cfs_rq->tg->load_avg));
+ tg_load_avg(cfs_rq->tg));
#endif
#endif
#ifdef CONFIG_CFS_BANDWIDTH
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0f913487928d..aceb8f5922cb 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3496,7 +3496,7 @@ static long calc_group_shares(struct cfs_rq *cfs_rq)
load = max(scale_load_down(cfs_rq->load.weight), cfs_rq->avg.load_avg);
- tg_weight = atomic_long_read(&tg->load_avg);
+ tg_weight = tg_load_avg(tg);
/* Ensure tg_weight >= load */
tg_weight -= cfs_rq->tg_load_avg_contrib;
@@ -3665,6 +3665,7 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
{
long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
+ int node = cpu_to_node(smp_processor_id());
/*
* No need to update load_avg for root_task_group as it is not used.
@@ -3673,7 +3674,7 @@ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
return;
if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
- atomic_long_add(delta, &cfs_rq->tg->load_avg);
+ atomic_long_add(delta, &cfs_rq->tg->node_info[node]->load_avg);
cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;
}
}
@@ -12439,7 +12440,7 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
{
struct sched_entity *se;
struct cfs_rq *cfs_rq;
- int i;
+ int i, nodes;
tg->cfs_rq = kcalloc(nr_cpu_ids, sizeof(cfs_rq), GFP_KERNEL);
if (!tg->cfs_rq)
@@ -12468,8 +12469,30 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
init_entity_runnable_average(se);
}
+#ifdef CONFIG_SMP
+ nodes = num_possible_nodes();
+ tg->node_info = kcalloc(nodes, sizeof(struct tg_node_info *), GFP_KERNEL);
+ if (!tg->node_info)
+ goto err_free;
+
+ for_each_node(i) {
+ tg->node_info[i] = kzalloc_node(sizeof(struct tg_node_info), GFP_KERNEL, i);
+ if (!tg->node_info[i])
+ goto err_free_node;
+ }
+#endif
+
return 1;
+#ifdef CONFIG_SMP
+err_free_node:
+ for_each_node(i) {
+ kfree(tg->node_info[i]);
+ if (!tg->node_info[i])
+ break;
+ }
+ kfree(tg->node_info);
+#endif
err_free:
for_each_possible_cpu(i) {
kfree(tg->cfs_rq[i]);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 14dfaafb3a8f..9cece2dbc95b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -359,6 +359,17 @@ struct cfs_bandwidth {
#endif
};
+struct tg_node_info {
+ /*
+ * load_avg can be heavily contended at clock tick time and task
+ * enqueue/dequeue time, so put it in its own cacheline separated
+ * from other fields.
+ */
+ struct {
+ atomic_long_t load_avg;
+ } ____cacheline_aligned_in_smp;
+};
+
/* Task group related information */
struct task_group {
struct cgroup_subsys_state css;
@@ -373,15 +384,8 @@ struct task_group {
/* A positive value indicates that this is a SCHED_IDLE group. */
int idle;
-#ifdef CONFIG_SMP
- /*
- * load_avg can be heavily contended at clock tick time, so put
- * it in its own cacheline separated from the fields above which
- * will also be accessed at each tick.
- */
- struct {
- atomic_long_t load_avg;
- } ____cacheline_aligned_in_smp;
+#ifdef CONFIG_SMP
+ struct tg_node_info **node_info;
#endif
#endif
@@ -413,9 +417,28 @@ struct task_group {
/* Effective clamp values used for a task group */
struct uclamp_se uclamp[UCLAMP_CNT];
#endif
-
};
+#if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_SMP)
+static inline long tg_load_avg(struct task_group *tg)
+{
+ long load_avg = 0;
+ int i;
+
+ /*
+ * The only path that can give us a root_task_group
+ * here is from print_cfs_rq() thus unlikely.
+ */
+ if (unlikely(tg == &root_task_group))
+ return 0;
+
+ for_each_node(i)
+ load_avg += atomic_long_read(&tg->node_info[i]->load_avg);
+
+ return load_avg;
+}
+#endif
+
#ifdef CONFIG_FAIR_GROUP_SCHED
#define ROOT_TASK_GROUP_LOAD NICE_0_LOAD
--
2.41.0
next prev parent reply other threads:[~2023-07-18 13:41 UTC|newest]
Thread overview: 33+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-07-18 13:41 [RFC PATCH 0/4] Reduce cost of accessing tg->load_avg Aaron Lu
2023-07-18 13:41 ` [PATCH 1/4] sched/fair: free allocated memory on error in alloc_fair_sched_group() Aaron Lu
2023-07-18 15:13 ` Chen Yu
2023-07-19 2:13 ` Aaron Lu
2023-08-02 7:01 ` Aaron Lu
2023-08-02 8:17 ` Chen Yu
2023-07-18 13:41 ` Aaron Lu [this message]
2023-07-19 11:53 ` [RFC PATCH 2/4] sched/fair: Make tg->load_avg per node Peter Zijlstra
2023-07-19 13:45 ` Aaron Lu
2023-07-19 13:53 ` Peter Zijlstra
2023-07-19 14:22 ` Aaron Lu
2023-08-02 11:28 ` Peter Zijlstra
2023-08-11 9:48 ` Aaron Lu
2023-07-19 15:59 ` Yury Norov
2023-07-18 13:41 ` [RFC PATCH 3/4] sched/fair: delay update_tg_load_avg() for cfs_rq's removed load Aaron Lu
2023-07-18 16:01 ` Vincent Guittot
2023-07-19 5:18 ` Aaron Lu
2023-07-19 8:01 ` Aaron Lu
2023-07-19 9:47 ` Vincent Guittot
2023-07-19 13:29 ` Aaron Lu
2023-07-20 13:10 ` Vincent Guittot
2023-07-20 14:42 ` Aaron Lu
2023-07-20 15:02 ` Vincent Guittot
2023-07-20 15:22 ` Dietmar Eggemann
2023-07-20 15:24 ` Vincent Guittot
2023-07-21 6:42 ` Aaron Lu
2023-07-21 1:57 ` Aaron Lu
2023-08-11 9:28 ` Aaron Lu
2023-07-20 15:04 ` Vincent Guittot
2023-07-19 8:11 ` Aaron Lu
2023-07-19 9:12 ` Vincent Guittot
2023-07-19 9:09 ` Vincent Guittot
2023-07-18 13:41 ` [RFC PATCH 4/4] sched/fair: skip some update_cfs_group() on en/dequeue_entity() Aaron Lu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20230718134120.81199-3-aaron.lu@intel.com \
--to=aaron.lu@intel.com \
--cc=bristot@redhat.com \
--cc=bsegall@google.com \
--cc=daniel.m.jordan@oracle.com \
--cc=dietmar.eggemann@arm.com \
--cc=juri.lelli@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=longman@redhat.com \
--cc=mgorman@suse.de \
--cc=mingo@redhat.com \
--cc=nitin.tekchandani@intel.com \
--cc=peterz@infradead.org \
--cc=rostedt@goodmis.org \
--cc=tim.c.chen@intel.com \
--cc=vincent.guittot@linaro.org \
--cc=vschneid@redhat.com \
--cc=yu.c.chen@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox