[RFC PATCH 0/7] Optimization to reduce the cost of newidle balance

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Chen Yu <yu.c.chen@intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
	Vincent Guittot <vincent.guittot@linaro.org>
Cc: Ingo Molnar <mingo@redhat.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Tim Chen <tim.c.chen@intel.com>,
	Mel Gorman <mgorman@techsingularity.net>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	"Gautham R . Shenoy" <gautham.shenoy@amd.com>,
	Chen Yu <yu.chen.surf@gmail.com>, Aaron Lu <aaron.lu@intel.com>,
	linux-kernel@vger.kernel.org, Chen Yu <yu.c.chen@intel.com>
Subject: [RFC PATCH 0/7] Optimization to reduce the cost of newidle balance
Date: Thu, 27 Jul 2023 22:33:58 +0800	[thread overview]
Message-ID: <cover.1690273854.git.yu.c.chen@intel.com> (raw)

Hi,

This is the second version of the newidle balance optimization[1].
It aims to reduce the cost of newidle balance which is found to
occupy noticeable CPU cycles on some high-core count systems.

For example, when running sqlite on Intel Sapphire Rapids, which has
2 x 56C/112T = 224 CPUs:

6.69%    0.09%  sqlite3     [kernel.kallsyms]   [k] newidle_balance
5.39%    4.71%  sqlite3     [kernel.kallsyms]   [k] update_sd_lb_stats

To mitigate this cost, the optimization is inspired by the question
raised by Tim:
Do we always have to find the busiest group and pull from it? Would
a relatively busy group be enough?

There are two proposals in this patch set.
The first one is ILB_UTIL. It was proposed to limit the scan
depth in update_sd_lb_stats(). The scan depth is based on the
overall utilization of this sched domain. The higher the utilization
is, the less update_sd_lb_stats() scans. Vice versa.

The second one is ILB_FAST. Instead of always finding the busiest
group in update_sd_lb_stats(), lower the bar and try to find a
relatively busy group. ILB_FAST takes effect when the local group
is group_has_spare. Because when there are many CPUs running
newidle_balance() concurrently, the sched groups should have a
high idle percentage.

Compared between ILB_UTIL and ILB_FAST, the former inhibits the
sched group scan when the system is busy. While the latter
chooses a compromised busy group when the system is not busy.
So they are complementary to each other and work independently.

patch 1/7 and patch 2/7 are preparation for ILB_UTIL.

patch 3/7 is a preparation for both ILB_UTIL and ILB_FAST.

patch 4/7 is part of ILB_UTIL. It calculates the scan depth
          of sched groups which will be used by
          update_sd_lb_stats(). The depth is calculated by the
          periodic load balance.

patch 5/7 introduces the ILB_UTIL.

patch 6/7 introduces the ILB_FAST.

patch 7/7 is a debug patch to print more sched statistics, inspired
          by Prateek's test report.

In the previous version, Prateek found some regressions[2].
This is probably caused by:
1. Cross Numa access to sched_domain_shared. So this version removed
   the sched_domain_shared for Numa domain.
2. newidle balance did not try so hard to scan for the busiest
   group. This version still keeps the linear scan function. If
   the regression is still there, we can try to leverage the result
   of SIS_UTIL. Because SIS_UTIL is a quadratic function which
   could help scan the domain harder when the system is not
   overloaded.

Changes since the previous version:
1. For all levels except for NUMA, connect a sched_domain_shared
   instance. This makes the newidle balance optimization more
   generic, and not only for LLC domain. (Peter, Gautham)
2. Introduce ILB_FAST, which terminates the sched group scan
   earlier, if it finds a proper group rather than the busiest
   one (Tim).

Peter has suggested reusing the statistics of the sched group
if multiple CPUs trigger newidle balance concurrently[3]. I created
a prototype[4] based on this direction. According to the test, there
are some regressions. The bottlenecks are a spin_trylock() and the
memory load from the 'cached' shared region. It is still under
investigation so I did not include that change into this patch set.

Any comments would be appreciated.

[1] https://lore.kernel.org/lkml/cover.1686554037.git.yu.c.chen@intel.com/
[2] https://lore.kernel.org/lkml/7e31ad34-ce2c-f64b-a852-f88f8a5749a6@amd.com/
[3] https://lore.kernel.org/lkml/20230621111721.GA2053369@hirez.programming.kicks-ass.net/
[4] https://github.com/chen-yu-surf/linux/commit/a6b33df883b972d6aaab5fceeddb11c34cc59059.patch

Chen Yu (7):
  sched/topology: Assign sd_share for all non NUMA sched domains
  sched/topology: Introduce nr_groups in sched_domain to indicate the
    number of groups
  sched/fair: Save a snapshot of sched domain total_load and
    total_capacity
  sched/fair: Calculate the scan depth for idle balance based on system
    utilization
  sched/fair: Adjust the busiest group scanning depth in idle load
    balance
  sched/fair: Pull from a relatively busy group during newidle balance
  sched/stats: Track the scan number of groups during load balance

 include/linux/sched/topology.h |   5 ++
 kernel/sched/fair.c            | 114 ++++++++++++++++++++++++++++++++-
 kernel/sched/features.h        |   4 ++
 kernel/sched/stats.c           |   5 +-
 kernel/sched/topology.c        |  14 ++--
 5 files changed, 135 insertions(+), 7 deletions(-)

-- 
2.25.1

next             reply	other threads:[~2023-07-27  6:38 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-07-27 14:33 Chen Yu [this message]
2023-07-27 14:34 ` [RFC PATCH 1/7] sched/topology: Assign sd_share for all non NUMA sched domains Chen Yu
2023-07-27 14:34 ` [RFC PATCH 2/7] sched/topology: Introduce nr_groups in sched_domain to indicate the number of groups Chen Yu
2023-07-27 14:34 ` [RFC PATCH 3/7] sched/fair: Save a snapshot of sched domain total_load and total_capacity Chen Yu
2023-07-27 14:35 ` [RFC PATCH 4/7] sched/fair: Calculate the scan depth for idle balance based on system utilization Chen Yu
2023-08-25  6:02   ` Shrikanth Hegde
2023-08-30 15:30     ` Chen Yu
2023-07-27 14:35 ` [RFC PATCH 5/7] sched/fair: Adjust the busiest group scanning depth in idle load balance Chen Yu
2023-08-25  6:00   ` Shrikanth Hegde
2023-08-30 15:35     ` Chen Yu
2023-07-27 14:35 ` [RFC PATCH 6/7] sched/fair: Pull from a relatively busy group during newidle balance Chen Yu
2023-07-27 14:35 ` [RFC PATCH 7/7] sched/stats: Track the scan number of groups during load balance Chen Yu
2023-08-25  7:48 ` [RFC PATCH 0/7] Optimization to reduce the cost of newidle balance Shrikanth Hegde
2023-08-30 15:26   ` Chen Yu
2023-09-10  7:51     ` Shrikanth Hegde
2024-07-16 14:16 ` Matt Fleming
2024-07-17  3:52   ` Chen Yu
2024-07-17 15:31     ` Matt Fleming
2024-07-17 12:17 ` Peter Zijlstra
2024-07-18  9:28   ` K Prateek Nayak
2024-07-18 17:01     ` Chen Yu
2024-07-18 16:57   ` Chen Yu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=cover.1690273854.git.yu.c.chen@intel.com \
    --to=yu.c.chen@intel.com \
    --cc=aaron.lu@intel.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=gautham.shenoy@amd.com \
    --cc=juri.lelli@redhat.com \
    --cc=kprateek.nayak@amd.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@techsingularity.net \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=tim.c.chen@intel.com \
    --cc=vincent.guittot@linaro.org \
    --cc=yu.chen.surf@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox