Re: [RFC PATCH] sched/fair: Make tg->load_avg per node

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Aaron Lu <aaron.lu@intel.com>
To: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Steven Rostedt <rostedt@goodmis.org>,
	"Ben Segall" <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
	"Daniel Bristot de Oliveira" <bristot@redhat.com>,
	Valentin Schneider <vschneid@redhat.com>,
	Tim Chen <tim.c.chen@intel.com>,
	Nitin Tekchandani <nitin.tekchandani@intel.com>,
	Waiman Long <longman@redhat.com>, Yu Chen <yu.c.chen@intel.com>,
	<linux-kernel@vger.kernel.org>
Subject: Re: [RFC PATCH] sched/fair: Make tg->load_avg per node
Date: Tue, 28 Mar 2023 20:56:24 +0800	[thread overview]
Message-ID: <20230328125624.GA6130@ziqianlu-desk2> (raw)
In-Reply-To: <943d44f7-1fa9-8545-dc1d-890e4dd20854@arm.com>

Hi Dietmar,

Thanks for taking a look.

On Tue, Mar 28, 2023 at 02:09:39PM +0200, Dietmar Eggemann wrote:
> On 27/03/2023 07:39, Aaron Lu wrote:
> > When using sysbench to benchmark Postgres in a single docker instance
> > with sysbench's nr_threads set to nr_cpu, it is observed there are times
> > update_cfs_group() and update_load_avg() shows noticeable overhead on
> > cpus of one node of a 2sockets/112core/224cpu Intel Sapphire Rapids:
> > 
> >     10.01%     9.86%  [kernel.vmlinux]        [k] update_cfs_group
> >      7.84%     7.43%  [kernel.vmlinux]        [k] update_load_avg
> > 
> > While cpus of the other node normally sees a lower cycle percent:
> > 
> >      4.46%     4.36%  [kernel.vmlinux]        [k] update_cfs_group
> >      4.02%     3.40%  [kernel.vmlinux]        [k] update_load_avg
> > 
> > Annotate shows the cycles are mostly spent on accessing tg->load_avg
> > with update_load_avg() being the write side and update_cfs_group() being
> > the read side.
> > 
> > The reason why only cpus of one node has bigger overhead is: task_group
> > is allocated on demand from a slab and whichever cpu happens to do the
> > allocation, the allocated tg will be located on that node and accessing
> > to tg->load_avg will have a lower cost for cpus on the same node and
> > a higer cost for cpus of the remote node.
> > 
> > Tim Chen told me that PeterZ once mentioned a way to solve a similar
> > problem by making a counter per node so do the same for tg->load_avg.
> > After this change, the worst number I saw during a 5 minutes run from
> > both nodes are:
> > 
> >      2.77%     2.11%  [kernel.vmlinux]        [k] update_load_avg
> >      2.72%     2.59%  [kernel.vmlinux]        [k] update_cfs_group
> > 
> > Another observation of this workload is: it has a lot of wakeup time
> > task migrations and that is the reason why update_load_avg() and
> > update_cfs_group() shows noticeable cost. Running this workload in N
> > instances setup where N >= 2 with sysbench's nr_threads set to 1/N nr_cpu,
> > task migrations on wake up time are greatly reduced and the overhead from
> > the two above mentioned functions also dropped a lot. It's not clear to
> > me why running in multiple instances can reduce task migrations on
> > wakeup path yet.
> > 
> > Reported-by: Nitin Tekchandani <nitin.tekchandani@intel.com>
> > Signed-off-by: Aaron Lu <aaron.lu@intel.com>
> 
> I'm so far not seeing this issue on my Arm64 server.
> 
> $ numactl -H
> available: 4 nodes (0-3)
> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
> node 1 cpus: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
> 44 45 46 47
> node 2 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67
> 68 69 70 71
> node 3 cpus: 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91
> 92 93 94 95
> node distances:
> node   0   1   2   3
>   0:  10  12  20  22
>   1:  12  10  22  24
>   2:  20  22  10  12
>   3:  22  24  12  10
> 
> sysbench --table-size=100000 --tables=24 --threads=96 ...
> /usr/share/sysbench/oltp_read_write.lua run
> 
> perf report | grep kernel | head
> 
> 9.12%  sysbench  [kernel.vmlinux]        [k] _raw_spin_unlock_irqrestore
> 5.26%  sysbench  [kernel.vmlinux]        [k] finish_task_switch
> 1.56%  sysbench  [kernel.vmlinux]        [k] __do_softirq
> 1.22%  sysbench  [kernel.vmlinux]        [k] arch_local_irq_restore
> 1.12%  sysbench  [kernel.vmlinux]        [k] __arch_copy_to_user
> 1.12%  sysbench  [kernel.vmlinux]        [k] el0_svc_common.constprop.1
> 0.95%  sysbench  [kernel.vmlinux]        [k] __fget_light
> 0.94%  sysbench  [kernel.vmlinux]        [k] rwsem_spin_on_owner
> 0.85%  sysbench  [kernel.vmlinux]        [k] tcp_ack
> 0.56%  sysbench  [kernel.vmlinux]        [k] do_sys_poll

Did you test with a v6.3-rc based kernel?
I encountered another problem on those kernels and had to temporarily use
a v6.2 based kernel, maybe you have to do the same:
https://lore.kernel.org/lkml/20230327080502.GA570847@ziqianlu-desk2/

> 
> Is your postgres/sysbench running in a cgroup with cpu controller
> attached? Mine isn't.

Yes, I had postgres and sysbench running in the same cgroup with cpu
controller enabled. docker created the cgroup directory under
/sys/fs/cgroup/system.slice/docker-XXX and cgroup.controllers has cpu
there.

> 
> Maybe I'm doing something else differently?

Maybe, you didn't mention how you started postgres, if you start it from
the same session as sysbench and if autogroup is enabled, then all those
tasks would be in the same autogroup taskgroup then it should have the
same effect as my setup.

Anyway, you can try the following steps to see if you can reproduce this
problem on your Arm64 server:

1 docker pull postgres
2 sudo docker run --rm --name postgres-instance -e POSTGRES_PASSWORD=mypass -e POSTGRES_USER=sbtest -d postgres -c shared_buffers=80MB -c max_connections=250
3 go inside the container
  sudo docker exec -it $the_just_started_container_id bash
4 install sysbench inside container
  apt update and apt install sysbench
5 prepare
  root@container:/# sysbench --db-driver=pgsql --pgsql-user=sbtest --pgsql_password=mypass --pgsql-db=sbtest --pgsql-port=5432 --tables=16 --table-size=10000 --threads=224 --time=60 --report-interval=2 /usr/share/sysbench/oltp_read_only.lua prepare
6 run
  root@container:/# sysbench --db-driver=pgsql --pgsql-user=sbtest --pgsql_password=mypass --pgsql-db=sbtest --pgsql-port=5432 --tables=16 --table-size=10000 --threads=224 --time=60 --report-interval=2 /usr/share/sysbench/oltp_read_only.lua run

Note that I used 224 threads where this problem is visible. I also tried
96 and update_cfs_group() and update_load_avg() cost about 1% cycles then.

next prev parent reply	other threads:[~2023-03-28 12:57 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-27  5:39 [RFC PATCH] sched/fair: Make tg->load_avg per node Aaron Lu
2023-03-27 14:45 ` Chen Yu
2023-03-28  6:42   ` Aaron Lu
2023-03-28 12:09 ` Dietmar Eggemann
2023-03-28 12:56   ` Aaron Lu [this message]
2023-03-29 12:36     ` Dietmar Eggemann
2023-03-29 13:54       ` Aaron Lu
2023-03-30 17:45         ` Daniel Jordan
2023-03-30 19:51           ` Daniel Jordan
2023-03-31  4:06             ` Aaron Lu
2023-03-31 15:48               ` Dietmar Eggemann
2023-04-03  7:53                 ` Aaron Lu
2023-04-05 21:04               ` Daniel Jordan
2023-04-12 12:07           ` Peter Zijlstra
2023-04-20 20:52             ` Daniel Jordan
2023-04-21 15:05               ` Aaron Lu
2023-05-03 19:41                 ` Daniel Jordan
2023-05-04 10:27                   ` Aaron Lu
2023-05-16  7:50                     ` Aaron Lu
2023-05-16  8:57                       ` Chen Yu
2023-05-16 11:32                         ` Aaron Lu
2023-03-29 14:55       ` Chen Yu
2023-04-04  8:25 ` Chen Yu
2023-04-04 13:33   ` Aaron Lu
2023-04-04 15:15 ` Aaron Lu
2023-04-04 15:37   ` Chen Yu
2023-04-05 21:31   ` Daniel Jordan
2023-04-12 11:59 ` Peter Zijlstra
2023-04-12 13:58   ` Peter Zijlstra
2023-04-12 14:11     ` Aaron Lu
2023-04-12 14:01   ` Aaron Lu
2023-04-22  4:01 ` Chen Yu
2023-04-22  6:04   ` Aaron Lu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230328125624.GA6130@ziqianlu-desk2 \
    --to=aaron.lu@intel.com \
    --cc=bristot@redhat.com \
    --cc=bsegall@google.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=juri.lelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=longman@redhat.com \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=nitin.tekchandani@intel.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=tim.c.chen@intel.com \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    --cc=yu.c.chen@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox