From: Dietmar Eggemann <dietmar.eggemann@arm.com>
To: Aaron Lu <aaron.lu@intel.com>,
Peter Zijlstra <peterz@infradead.org>,
Ingo Molnar <mingo@redhat.com>,
Juri Lelli <juri.lelli@redhat.com>,
Vincent Guittot <vincent.guittot@linaro.org>
Cc: Steven Rostedt <rostedt@goodmis.org>,
Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
Daniel Bristot de Oliveira <bristot@redhat.com>,
Valentin Schneider <vschneid@redhat.com>,
Tim Chen <tim.c.chen@intel.com>,
Nitin Tekchandani <nitin.tekchandani@intel.com>,
Waiman Long <longman@redhat.com>, Yu Chen <yu.c.chen@intel.com>,
linux-kernel@vger.kernel.org
Subject: Re: [RFC PATCH] sched/fair: Make tg->load_avg per node
Date: Tue, 28 Mar 2023 14:09:39 +0200 [thread overview]
Message-ID: <943d44f7-1fa9-8545-dc1d-890e4dd20854@arm.com> (raw)
In-Reply-To: <20230327053955.GA570404@ziqianlu-desk2>
On 27/03/2023 07:39, Aaron Lu wrote:
> When using sysbench to benchmark Postgres in a single docker instance
> with sysbench's nr_threads set to nr_cpu, it is observed there are times
> update_cfs_group() and update_load_avg() shows noticeable overhead on
> cpus of one node of a 2sockets/112core/224cpu Intel Sapphire Rapids:
>
> 10.01% 9.86% [kernel.vmlinux] [k] update_cfs_group
> 7.84% 7.43% [kernel.vmlinux] [k] update_load_avg
>
> While cpus of the other node normally sees a lower cycle percent:
>
> 4.46% 4.36% [kernel.vmlinux] [k] update_cfs_group
> 4.02% 3.40% [kernel.vmlinux] [k] update_load_avg
>
> Annotate shows the cycles are mostly spent on accessing tg->load_avg
> with update_load_avg() being the write side and update_cfs_group() being
> the read side.
>
> The reason why only cpus of one node has bigger overhead is: task_group
> is allocated on demand from a slab and whichever cpu happens to do the
> allocation, the allocated tg will be located on that node and accessing
> to tg->load_avg will have a lower cost for cpus on the same node and
> a higer cost for cpus of the remote node.
>
> Tim Chen told me that PeterZ once mentioned a way to solve a similar
> problem by making a counter per node so do the same for tg->load_avg.
> After this change, the worst number I saw during a 5 minutes run from
> both nodes are:
>
> 2.77% 2.11% [kernel.vmlinux] [k] update_load_avg
> 2.72% 2.59% [kernel.vmlinux] [k] update_cfs_group
>
> Another observation of this workload is: it has a lot of wakeup time
> task migrations and that is the reason why update_load_avg() and
> update_cfs_group() shows noticeable cost. Running this workload in N
> instances setup where N >= 2 with sysbench's nr_threads set to 1/N nr_cpu,
> task migrations on wake up time are greatly reduced and the overhead from
> the two above mentioned functions also dropped a lot. It's not clear to
> me why running in multiple instances can reduce task migrations on
> wakeup path yet.
>
> Reported-by: Nitin Tekchandani <nitin.tekchandani@intel.com>
> Signed-off-by: Aaron Lu <aaron.lu@intel.com>
I'm so far not seeing this issue on my Arm64 server.
$ numactl -H
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
node 1 cpus: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
44 45 46 47
node 2 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67
68 69 70 71
node 3 cpus: 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91
92 93 94 95
node distances:
node 0 1 2 3
0: 10 12 20 22
1: 12 10 22 24
2: 20 22 10 12
3: 22 24 12 10
sysbench --table-size=100000 --tables=24 --threads=96 ...
/usr/share/sysbench/oltp_read_write.lua run
perf report | grep kernel | head
9.12% sysbench [kernel.vmlinux] [k] _raw_spin_unlock_irqrestore
5.26% sysbench [kernel.vmlinux] [k] finish_task_switch
1.56% sysbench [kernel.vmlinux] [k] __do_softirq
1.22% sysbench [kernel.vmlinux] [k] arch_local_irq_restore
1.12% sysbench [kernel.vmlinux] [k] __arch_copy_to_user
1.12% sysbench [kernel.vmlinux] [k] el0_svc_common.constprop.1
0.95% sysbench [kernel.vmlinux] [k] __fget_light
0.94% sysbench [kernel.vmlinux] [k] rwsem_spin_on_owner
0.85% sysbench [kernel.vmlinux] [k] tcp_ack
0.56% sysbench [kernel.vmlinux] [k] do_sys_poll
Is your postgres/sysbench running in a cgroup with cpu controller
attached? Mine isn't.
Maybe I'm doing something else differently?
next prev parent reply other threads:[~2023-03-28 12:09 UTC|newest]
Thread overview: 33+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-03-27 5:39 [RFC PATCH] sched/fair: Make tg->load_avg per node Aaron Lu
2023-03-27 14:45 ` Chen Yu
2023-03-28 6:42 ` Aaron Lu
2023-03-28 12:09 ` Dietmar Eggemann [this message]
2023-03-28 12:56 ` Aaron Lu
2023-03-29 12:36 ` Dietmar Eggemann
2023-03-29 13:54 ` Aaron Lu
2023-03-30 17:45 ` Daniel Jordan
2023-03-30 19:51 ` Daniel Jordan
2023-03-31 4:06 ` Aaron Lu
2023-03-31 15:48 ` Dietmar Eggemann
2023-04-03 7:53 ` Aaron Lu
2023-04-05 21:04 ` Daniel Jordan
2023-04-12 12:07 ` Peter Zijlstra
2023-04-20 20:52 ` Daniel Jordan
2023-04-21 15:05 ` Aaron Lu
2023-05-03 19:41 ` Daniel Jordan
2023-05-04 10:27 ` Aaron Lu
2023-05-16 7:50 ` Aaron Lu
2023-05-16 8:57 ` Chen Yu
2023-05-16 11:32 ` Aaron Lu
2023-03-29 14:55 ` Chen Yu
2023-04-04 8:25 ` Chen Yu
2023-04-04 13:33 ` Aaron Lu
2023-04-04 15:15 ` Aaron Lu
2023-04-04 15:37 ` Chen Yu
2023-04-05 21:31 ` Daniel Jordan
2023-04-12 11:59 ` Peter Zijlstra
2023-04-12 13:58 ` Peter Zijlstra
2023-04-12 14:11 ` Aaron Lu
2023-04-12 14:01 ` Aaron Lu
2023-04-22 4:01 ` Chen Yu
2023-04-22 6:04 ` Aaron Lu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=943d44f7-1fa9-8545-dc1d-890e4dd20854@arm.com \
--to=dietmar.eggemann@arm.com \
--cc=aaron.lu@intel.com \
--cc=bristot@redhat.com \
--cc=bsegall@google.com \
--cc=juri.lelli@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=longman@redhat.com \
--cc=mgorman@suse.de \
--cc=mingo@redhat.com \
--cc=nitin.tekchandani@intel.com \
--cc=peterz@infradead.org \
--cc=rostedt@goodmis.org \
--cc=tim.c.chen@intel.com \
--cc=vincent.guittot@linaro.org \
--cc=vschneid@redhat.com \
--cc=yu.c.chen@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).