From: Peter Zijlstra <peterz@infradead.org>
To: Vishal Chourasia <vishalc@linux.vnet.ibm.com>
Cc: linux-kernel@vger.kernel.org, mingo@redhat.com,
vincent.guittot@linaro.org, vschneid@redhat.com,
srikar@linux.vnet.ibm.com, sshegde@linux.ibm.com,
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Subject: Re: sched/debug: CPU hotplug operation suffers in a large cpu systems
Date: Mon, 17 Oct 2022 16:19:31 +0200 [thread overview]
Message-ID: <Y01kc4g9CVmoyOxj@hirez.programming.kicks-ass.net> (raw)
In-Reply-To: <Y01UWQL2y2r69sBX@li-05afa54c-330e-11b2-a85c-e3f3aa0db1e9.ibm.com>
+GregKH who actually knows about debugfs.
On Mon, Oct 17, 2022 at 06:40:49PM +0530, Vishal Chourasia wrote:
> smt=off operation on system with 1920 CPUs is taking approx 59 mins on v5.14
> versus 29 mins on v5.11 measured using:
> # time ppc64_cpu --smt=off
>
> Doing a git bisect between kernel v5.11 and v5.14 pointed to the commit
> 3b87f136f8fc ("sched,debug: Convert sysctl sched_domains to debugfs"). This
> commit moves sched_domain information that was originally exported using sysctl
> to debugfs.
>
> Reverting the said commit, gives us the expected good result.
>
> Previously sched domain information was exported at procfs(sysctl):
> /proc/sys/kernel/sched_domain/ but now it gets exported at debugfs
> :/sys/kernel/debug/sched/domains/
>
> We also observe regression in kernel v6.0-rc4, which vanishes after reverting
> the commit 3b87f136f8fc
>
> # Output of `time ppc64_cpu --smt=off` on different kernel versions
> |-------------------------------------+------------+----------+----------|
> | kernel version | real | user | sys |
> |-------------------------------------+------------+----------+----------|
> | v5.11 | 29m22.007s | 0m0.001s | 0m6.444s |
> | v5.14 | 58m15.719s | 0m0.037s | 0m7.482s |
> | v6.0-rc4 | 59m30.318s | 0m0.055s | 0m7.681s |
> | v6.0-rc4 with 3b87f136f8fc reverted | 32m20.486s | 0m0.029s | 0m7.361s |
> |-------------------------------------+------------+----------+----------|
>
> Machine with 1920 cpus was used for the above experiments. Output of lscpu is
> added below.
>
> # lscpu
> Architecture: ppc64le
> Byte Order: Little Endian
> CPU(s): 1920
> On-line CPU(s) list: 0-1919
> Model name: POWER10 (architected), altivec supported
> Model: 2.0 (pvr 0080 0200)
> Thread(s) per core: 8
> Core(s) per socket: 14
> Socket(s): 17
> Physical sockets: 15
> Physical chips: 1
> Physical cores/chip: 16
>
> Through our experiments we have found that even when offlining 1 cpu, functions
> responsible for exporting sched_domain information took more time in case of
> debugfs relative to sysctl.
>
> Experiments using trace-cmd function-graph plugin have shown execution time for
> certain methods common in both the scenarios (procfs and debugfs) differ
> drastically.
>
> Below table list the execution time for some of the symbols for sysctl(procfs)
> and debugfs case.
>
> |--------------------------------+----------------+--------------|
> | method | sysctl | debugfs |
> |--------------------------------+----------------+--------------|
> | unregister_sysctl_table | 0.020050 s | NA |
> | build_sched_domains | 3.090563 s | 3.119130 s |
> | register_sched_domain_sysctl | 0.065487 s | NA |
> | update_sched_domain_debugfs | NA | 2.791232 s |
> | partition_sched_domains_locked | 3.195958 s | 5.933254 s |
> |--------------------------------+----------------+--------------|
>
> Note: partition_sched_domains_locked internally calls build_sched_domains
> and calls other functions respective to what's being currently used to
> export information i.e. sysctl or debugfs
>
> Above numbers are quoted from the case where we tried offlining 1 cpu in system
> with 1920 online cpus.
>
> From the above table, register_sched_domain_sysctl and
> unregister_sysctl_table_collectively took ~0.085 secs, whereas
> update_sched_domain_debugfs took ~2.79 secs.
>
> Root cause:
>
> The observed regression stems from the way these two pseudo-filesystems handle
> creation and deletion of files and directories internally.
>
> update_sched_domain_debugfs builds and exports sched_domains information to the
> userspace. It begins with removing/tearing down the per-cpu directories present
> in /sys/kernel/debug/sched/domains/ one by one for each possible cpu. And then
> recreate per-cpu per-domain files and directories one by one for each possible
> cpus.
>
> Excerpt from the trace-cmd output for the debugfs case
> ...
> | update_sched_domain_debugfs() {
> + 14.526 us | debugfs_lookup();
> # 1092.64 us | debugfs_remove();
> + 48.408 us | debugfs_create_dir(); - creates per-cpu dir
> 9.038 us | debugfs_create_dir(); - creates per-domain dir
> 9.638 us | debugfs_create_ulong(); -+
> 7.762 us | debugfs_create_ulong(); |
> 7.776 us | debugfs_create_u64(); |
> 7.502 us | debugfs_create_u32(); |__ creates per-domain files
> 7.646 us | debugfs_create_u32(); |
> 7.702 us | debugfs_create_u32(); |
> 6.974 us | debugfs_create_str(); |
> 7.628 us | debugfs_create_file(); -+
> ... - repeat other domains and cpus
>
> As a first step, We used debugfs_remove_recursive to remove entries for each cpu
> in one go instead of calling debugfs_remove per cpu. But, We did not see any
> improvement whatsoever.
>
> We understand debugfs doesn't concern itself with performance, and that smt=off
> operation won't be invoked much often, statistically speaking. But, We should
> understand that as the CPUs in a system will scale debugfs becomes a massive
> performance bottleneck that shouldn't be ignored.
>
> Even in a system with 240 CPUs system, update_sched_domain_debugfs is 1600 times
> slower than register_sched_domain_sysctl when building sched_domain directory
> for 240 CPUs only.
>
> # For 240 CPU system
> |------------------------------+---------------|
> | method | time taken |
> |------------------------------+---------------|
> | update_sched_domain_debugfs | 236550.996 us |
> | register_sched_domain_sysctl | 13907.940 us |
> |------------------------------+---------------|
>
> Any ideas on how to improve here from the community is much appreciated.
>
> Meanwhile, We will keep posting our progress updates.
>
> -- vishal.c
next prev parent reply other threads:[~2022-10-17 14:19 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-10-17 13:10 sched/debug: CPU hotplug operation suffers in a large cpu systems Vishal Chourasia
2022-10-17 14:19 ` Peter Zijlstra [this message]
2022-10-17 14:54 ` Greg Kroah-Hartman
2022-10-18 10:37 ` Vishal Chourasia
2022-10-18 11:04 ` Greg Kroah-Hartman
2022-10-26 6:37 ` Vishal Chourasia
2022-10-26 7:02 ` Greg Kroah-Hartman
2022-10-26 9:10 ` Peter Zijlstra
2022-11-08 10:00 ` Vishal Chourasia
2022-11-08 12:24 ` Greg Kroah-Hartman
2022-11-08 14:51 ` Srikar Dronamraju
2022-11-08 15:38 ` Greg Kroah-Hartman
2022-12-12 19:17 ` Phil Auld
2022-12-13 2:17 ` kernel test robot
2022-12-13 6:23 ` Greg Kroah-Hartman
2022-12-13 13:22 ` Phil Auld
2022-12-13 14:31 ` Greg Kroah-Hartman
2022-12-13 14:45 ` Phil Auld
2023-01-19 15:31 ` Phil Auld
2022-12-13 23:41 ` Michael Ellerman
2022-12-14 2:26 ` Phil Auld
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Y01kc4g9CVmoyOxj@hirez.programming.kicks-ass.net \
--to=peterz@infradead.org \
--cc=gregkh@linuxfoundation.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@redhat.com \
--cc=srikar@linux.vnet.ibm.com \
--cc=sshegde@linux.ibm.com \
--cc=vincent.guittot@linaro.org \
--cc=vishalc@linux.vnet.ibm.com \
--cc=vschneid@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox