public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Peter Zijlstra <peterz@infradead.org>
To: Vishal Chourasia <vishalc@linux.vnet.ibm.com>
Cc: linux-kernel@vger.kernel.org, mingo@redhat.com,
	vincent.guittot@linaro.org, vschneid@redhat.com,
	srikar@linux.vnet.ibm.com, sshegde@linux.ibm.com,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Subject: Re: sched/debug: CPU hotplug operation suffers in a large cpu systems
Date: Mon, 17 Oct 2022 16:19:31 +0200	[thread overview]
Message-ID: <Y01kc4g9CVmoyOxj@hirez.programming.kicks-ass.net> (raw)
In-Reply-To: <Y01UWQL2y2r69sBX@li-05afa54c-330e-11b2-a85c-e3f3aa0db1e9.ibm.com>


+GregKH who actually knows about debugfs.

On Mon, Oct 17, 2022 at 06:40:49PM +0530, Vishal Chourasia wrote:
> smt=off operation on system with 1920 CPUs is taking approx 59 mins on v5.14
> versus 29 mins on v5.11 measured using:
> # time ppc64_cpu --smt=off
> 
> Doing a git bisect between kernel v5.11 and v5.14 pointed to the commit
> 3b87f136f8fc ("sched,debug: Convert sysctl sched_domains to debugfs"). This
> commit moves sched_domain information that was originally exported using sysctl
> to debugfs.
> 
> Reverting the said commit, gives us the expected good result.
> 
> Previously sched domain information was exported at procfs(sysctl):
> /proc/sys/kernel/sched_domain/ but now it gets exported at debugfs
> :/sys/kernel/debug/sched/domains/ 
> 
> We also observe regression in kernel v6.0-rc4, which vanishes after reverting
> the commit 3b87f136f8fc
> 
> # Output of `time ppc64_cpu --smt=off` on different kernel versions
> |-------------------------------------+------------+----------+----------|
> | kernel version                      | real       | user     | sys      |
> |-------------------------------------+------------+----------+----------|
> | v5.11                               | 29m22.007s | 0m0.001s | 0m6.444s |
> | v5.14                               | 58m15.719s | 0m0.037s | 0m7.482s |
> | v6.0-rc4                            | 59m30.318s | 0m0.055s | 0m7.681s |
> | v6.0-rc4 with 3b87f136f8fc reverted | 32m20.486s | 0m0.029s | 0m7.361s |
> |-------------------------------------+------------+----------+----------|
> 
> Machine with 1920 cpus was used for the above experiments.  Output of lscpu is
> added below.
> 
> # lscpu 
> Architecture: ppc64le
> Byte Order: Little Endian
> CPU(s): 1920
> On-line CPU(s) list: 0-1919
> Model name: POWER10 (architected), altivec supported
> Model: 2.0 (pvr 0080 0200)
> Thread(s) per core: 8
> Core(s) per socket: 14
> Socket(s): 17
> Physical sockets: 15
> Physical chips: 1
> Physical cores/chip: 16
> 
> Through our experiments we have found that even when offlining 1 cpu, functions
> responsible for exporting sched_domain information took more time in case of
> debugfs relative to sysctl.
> 
> Experiments using trace-cmd function-graph plugin have shown execution time for
> certain methods common in both the scenarios (procfs and debugfs) differ
> drastically. 
> 
> Below table list the execution time for some of the symbols for sysctl(procfs)
> and debugfs case. 
> 
> |--------------------------------+----------------+--------------|
> | method                         | sysctl         | debugfs      |
> |--------------------------------+----------------+--------------|
> | unregister_sysctl_table        |   0.020050 s   | NA           |
> | build_sched_domains            |   3.090563 s   | 3.119130 s   |
> | register_sched_domain_sysctl   |   0.065487 s   | NA           |
> | update_sched_domain_debugfs    |   NA           | 2.791232 s   |
> | partition_sched_domains_locked |   3.195958 s   | 5.933254 s   |
> |--------------------------------+----------------+--------------|
> 
> Note: partition_sched_domains_locked internally calls build_sched_domains
>       and calls other functions respective to what's being currently used to
>       export information i.e. sysctl or debugfs
> 
> Above numbers are quoted from the case where we tried offlining 1 cpu in system
> with 1920 online cpus.
> 
> From the above table, register_sched_domain_sysctl and
> unregister_sysctl_table_collectively took ~0.085 secs, whereas
> update_sched_domain_debugfs took ~2.79 secs. 
> 
> Root cause:
> 
> The observed regression stems from the way these two pseudo-filesystems handle
> creation and deletion of files and directories internally.  
> 
> update_sched_domain_debugfs builds and exports sched_domains information to the
> userspace. It begins with removing/tearing down the per-cpu directories present
> in /sys/kernel/debug/sched/domains/ one by one for each possible cpu. And then
> recreate per-cpu per-domain files and directories one by one for each possible
> cpus.
> 
> Excerpt from the trace-cmd output for the debugfs case
> ...
>              |  update_sched_domain_debugfs() {
> + 14.526 us  |    debugfs_lookup();
> # 1092.64 us |    debugfs_remove();
> + 48.408 us  |    debugfs_create_dir();      -   creates per-cpu    dir
>   9.038 us   |    debugfs_create_dir();      -   creates per-domain dir
>   9.638 us   |    debugfs_create_ulong();   -+
>   7.762 us   |    debugfs_create_ulong();    |
>   7.776 us   |    debugfs_create_u64();      |
>   7.502 us   |    debugfs_create_u32();      |__ creates per-domain files
>   7.646 us   |    debugfs_create_u32();      |
>   7.702 us   |    debugfs_create_u32();      |
>   6.974 us   |    debugfs_create_str();      |
>   7.628 us   |    debugfs_create_file();    -+
> ...                                          -   repeat other domains and cpus
> 
> As a first step, We used debugfs_remove_recursive to remove entries for each cpu
> in one go instead of calling debugfs_remove per cpu. But, We did not see any
> improvement whatsoever.   
> 
> We understand debugfs doesn't concern itself with performance, and that smt=off
> operation won't be invoked much often, statistically speaking. But, We should
> understand that as the CPUs in a system will scale debugfs becomes a massive
> performance bottleneck that shouldn't be ignored.
> 
> Even in a system with 240 CPUs system, update_sched_domain_debugfs is 1600 times
> slower than register_sched_domain_sysctl when building sched_domain directory
> for 240 CPUs only.
> 
> # For 240 CPU system
> |------------------------------+---------------|
> | method                       | time taken    |
> |------------------------------+---------------|
> | update_sched_domain_debugfs  | 236550.996 us |
> | register_sched_domain_sysctl | 13907.940 us  |
> |------------------------------+---------------|
> 
> Any ideas on how to improve here from the community is much appreciated.
> 
> Meanwhile, We will keep posting our progress updates.
> 
> -- vishal.c



  reply	other threads:[~2022-10-17 14:19 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-10-17 13:10 sched/debug: CPU hotplug operation suffers in a large cpu systems Vishal Chourasia
2022-10-17 14:19 ` Peter Zijlstra [this message]
2022-10-17 14:54   ` Greg Kroah-Hartman
2022-10-18 10:37     ` Vishal Chourasia
2022-10-18 11:04       ` Greg Kroah-Hartman
2022-10-26  6:37         ` Vishal Chourasia
2022-10-26  7:02           ` Greg Kroah-Hartman
2022-10-26  9:10             ` Peter Zijlstra
2022-11-08 10:00               ` Vishal Chourasia
2022-11-08 12:24                 ` Greg Kroah-Hartman
2022-11-08 14:51                   ` Srikar Dronamraju
2022-11-08 15:38                     ` Greg Kroah-Hartman
2022-12-12 19:17                   ` Phil Auld
2022-12-13  2:17                     ` kernel test robot
2022-12-13  6:23                     ` Greg Kroah-Hartman
2022-12-13 13:22                       ` Phil Auld
2022-12-13 14:31                         ` Greg Kroah-Hartman
2022-12-13 14:45                           ` Phil Auld
2023-01-19 15:31                           ` Phil Auld
2022-12-13 23:41                         ` Michael Ellerman
2022-12-14  2:26                           ` Phil Auld

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Y01kc4g9CVmoyOxj@hirez.programming.kicks-ass.net \
    --to=peterz@infradead.org \
    --cc=gregkh@linuxfoundation.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=srikar@linux.vnet.ibm.com \
    --cc=sshegde@linux.ibm.com \
    --cc=vincent.guittot@linaro.org \
    --cc=vishalc@linux.vnet.ibm.com \
    --cc=vschneid@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox