[RFC PATCH 0/7] Optimization to reduce the cost of newidle balance

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH 0/7] Optimization to reduce the cost of newidle balance
@ 2023-07-27 14:33 Chen Yu
  2023-07-27 14:34 ` [RFC PATCH 1/7] sched/topology: Assign sd_share for all non NUMA sched domains Chen Yu
                   ` (9 more replies)
  0 siblings, 10 replies; 22+ messages in thread
From: Chen Yu @ 2023-07-27 14:33 UTC (permalink / raw)
  To: Peter Zijlstra, Vincent Guittot
  Cc: Ingo Molnar, Juri Lelli, Tim Chen, Mel Gorman, Dietmar Eggemann,
	K Prateek Nayak, Gautham R . Shenoy, Chen Yu, Aaron Lu,
	linux-kernel, Chen Yu

Hi,

This is the second version of the newidle balance optimization[1].
It aims to reduce the cost of newidle balance which is found to
occupy noticeable CPU cycles on some high-core count systems.

For example, when running sqlite on Intel Sapphire Rapids, which has
2 x 56C/112T = 224 CPUs:

6.69%    0.09%  sqlite3     [kernel.kallsyms]   [k] newidle_balance
5.39%    4.71%  sqlite3     [kernel.kallsyms]   [k] update_sd_lb_stats

To mitigate this cost, the optimization is inspired by the question
raised by Tim:
Do we always have to find the busiest group and pull from it? Would
a relatively busy group be enough?

There are two proposals in this patch set.
The first one is ILB_UTIL. It was proposed to limit the scan
depth in update_sd_lb_stats(). The scan depth is based on the
overall utilization of this sched domain. The higher the utilization
is, the less update_sd_lb_stats() scans. Vice versa.

The second one is ILB_FAST. Instead of always finding the busiest
group in update_sd_lb_stats(), lower the bar and try to find a
relatively busy group. ILB_FAST takes effect when the local group
is group_has_spare. Because when there are many CPUs running
newidle_balance() concurrently, the sched groups should have a
high idle percentage.

Compared between ILB_UTIL and ILB_FAST, the former inhibits the
sched group scan when the system is busy. While the latter
chooses a compromised busy group when the system is not busy.
So they are complementary to each other and work independently.

patch 1/7 and patch 2/7 are preparation for ILB_UTIL.

patch 3/7 is a preparation for both ILB_UTIL and ILB_FAST.

patch 4/7 is part of ILB_UTIL. It calculates the scan depth
          of sched groups which will be used by
          update_sd_lb_stats(). The depth is calculated by the
          periodic load balance.

patch 5/7 introduces the ILB_UTIL.

patch 6/7 introduces the ILB_FAST.

patch 7/7 is a debug patch to print more sched statistics, inspired
          by Prateek's test report.

In the previous version, Prateek found some regressions[2].
This is probably caused by:
1. Cross Numa access to sched_domain_shared. So this version removed
   the sched_domain_shared for Numa domain.
2. newidle balance did not try so hard to scan for the busiest
   group. This version still keeps the linear scan function. If
   the regression is still there, we can try to leverage the result
   of SIS_UTIL. Because SIS_UTIL is a quadratic function which
   could help scan the domain harder when the system is not
   overloaded.

Changes since the previous version:
1. For all levels except for NUMA, connect a sched_domain_shared
   instance. This makes the newidle balance optimization more
   generic, and not only for LLC domain. (Peter, Gautham)
2. Introduce ILB_FAST, which terminates the sched group scan
   earlier, if it finds a proper group rather than the busiest
   one (Tim).

Peter has suggested reusing the statistics of the sched group
if multiple CPUs trigger newidle balance concurrently[3]. I created
a prototype[4] based on this direction. According to the test, there
are some regressions. The bottlenecks are a spin_trylock() and the
memory load from the 'cached' shared region. It is still under
investigation so I did not include that change into this patch set.

Any comments would be appreciated.

[1] https://lore.kernel.org/lkml/cover.1686554037.git.yu.c.chen@intel.com/
[2] https://lore.kernel.org/lkml/7e31ad34-ce2c-f64b-a852-f88f8a5749a6@amd.com/
[3] https://lore.kernel.org/lkml/20230621111721.GA2053369@hirez.programming.kicks-ass.net/
[4] https://github.com/chen-yu-surf/linux/commit/a6b33df883b972d6aaab5fceeddb11c34cc59059.patch

Chen Yu (7):
  sched/topology: Assign sd_share for all non NUMA sched domains
  sched/topology: Introduce nr_groups in sched_domain to indicate the
    number of groups
  sched/fair: Save a snapshot of sched domain total_load and
    total_capacity
  sched/fair: Calculate the scan depth for idle balance based on system
    utilization
  sched/fair: Adjust the busiest group scanning depth in idle load
    balance
  sched/fair: Pull from a relatively busy group during newidle balance
  sched/stats: Track the scan number of groups during load balance

 include/linux/sched/topology.h |   5 ++
 kernel/sched/fair.c            | 114 ++++++++++++++++++++++++++++++++-
 kernel/sched/features.h        |   4 ++
 kernel/sched/stats.c           |   5 +-
 kernel/sched/topology.c        |  14 ++--
 5 files changed, 135 insertions(+), 7 deletions(-)

-- 
2.25.1

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC PATCH 1/7] sched/topology: Assign sd_share for all non NUMA sched domains
  2023-07-27 14:33 [RFC PATCH 0/7] Optimization to reduce the cost of newidle balance Chen Yu
@ 2023-07-27 14:34 ` Chen Yu
  2023-07-27 14:34 ` [RFC PATCH 2/7] sched/topology: Introduce nr_groups in sched_domain to indicate the number of groups Chen Yu
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 22+ messages in thread
From: Chen Yu @ 2023-07-27 14:34 UTC (permalink / raw)
  To: Peter Zijlstra, Vincent Guittot
  Cc: Ingo Molnar, Juri Lelli, Tim Chen, Mel Gorman, Dietmar Eggemann,
	K Prateek Nayak, Gautham R . Shenoy, Chen Yu, Aaron Lu,
	linux-kernel, Chen Yu

Currently, only the domain with SD_SHARE_PKG_RESOURCES flag
would share 1 sd_share for every CPU in this domain. Remove this
restriction and extend it for other sched domains under NUMA
domain.

This shared field will be used by a later patch which optimizes
newidle balancing.

Suggested-by: "Gautham R. Shenoy" <gautham.shenoy@amd.com>
Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---
 kernel/sched/topology.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index d3a3b2646ec4..64212f514765 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1641,10 +1641,10 @@ sd_init(struct sched_domain_topology_level *tl,
 	}
 
 	/*
-	 * For all levels sharing cache; connect a sched_domain_shared
+	 * For all levels except for NUMA; connect a sched_domain_shared
 	 * instance.
 	 */
-	if (sd->flags & SD_SHARE_PKG_RESOURCES) {
+	if (!(sd->flags & SD_NUMA)) {
 		sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
 		atomic_inc(&sd->shared->ref);
 		atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH 2/7] sched/topology: Introduce nr_groups in sched_domain to indicate the number of groups
  2023-07-27 14:33 [RFC PATCH 0/7] Optimization to reduce the cost of newidle balance Chen Yu
  2023-07-27 14:34 ` [RFC PATCH 1/7] sched/topology: Assign sd_share for all non NUMA sched domains Chen Yu
@ 2023-07-27 14:34 ` Chen Yu
  2023-07-27 14:34 ` [RFC PATCH 3/7] sched/fair: Save a snapshot of sched domain total_load and total_capacity Chen Yu
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 22+ messages in thread
From: Chen Yu @ 2023-07-27 14:34 UTC (permalink / raw)
  To: Peter Zijlstra, Vincent Guittot
  Cc: Ingo Molnar, Juri Lelli, Tim Chen, Mel Gorman, Dietmar Eggemann,
	K Prateek Nayak, Gautham R . Shenoy, Chen Yu, Aaron Lu,
	linux-kernel, Chen Yu

Record the number of sched groups within each sched domain. Prepare for
newidle_balance() scan depth calculation introduced by ILB_UTIL.

Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---
 include/linux/sched/topology.h |  1 +
 kernel/sched/topology.c        | 10 ++++++++--
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 67b573d5bf28..c07f2f00317a 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -152,6 +152,7 @@ struct sched_domain {
 	struct sched_domain_shared *shared;
 
 	unsigned int span_weight;
+	unsigned int nr_groups;
 	/*
 	 * Span of all CPUs in this domain.
 	 *
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 64212f514765..56dc564fc9a3 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1023,7 +1023,7 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
 	struct cpumask *covered = sched_domains_tmpmask;
 	struct sd_data *sdd = sd->private;
 	struct sched_domain *sibling;
-	int i;
+	int i, nr_groups = 0;
 
 	cpumask_clear(covered);
 
@@ -1087,6 +1087,8 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
 		if (!sg)
 			goto fail;
 
+		nr_groups++;
+
 		sg_span = sched_group_span(sg);
 		cpumask_or(covered, covered, sg_span);
 
@@ -1100,6 +1102,7 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
 		last->next = first;
 	}
 	sd->groups = first;
+	sd->nr_groups = nr_groups;
 
 	return 0;
 
@@ -1233,7 +1236,7 @@ build_sched_groups(struct sched_domain *sd, int cpu)
 	struct sd_data *sdd = sd->private;
 	const struct cpumask *span = sched_domain_span(sd);
 	struct cpumask *covered;
-	int i;
+	int i, nr_groups = 0;
 
 	lockdep_assert_held(&sched_domains_mutex);
 	covered = sched_domains_tmpmask;
@@ -1248,6 +1251,8 @@ build_sched_groups(struct sched_domain *sd, int cpu)
 
 		sg = get_group(i, sdd);
 
+		nr_groups++;
+
 		cpumask_or(covered, covered, sched_group_span(sg));
 
 		if (!first)
@@ -1258,6 +1263,7 @@ build_sched_groups(struct sched_domain *sd, int cpu)
 	}
 	last->next = first;
 	sd->groups = first;
+	sd->nr_groups = nr_groups;
 
 	return 0;
 }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH 3/7] sched/fair: Save a snapshot of sched domain total_load and total_capacity
  2023-07-27 14:33 [RFC PATCH 0/7] Optimization to reduce the cost of newidle balance Chen Yu
  2023-07-27 14:34 ` [RFC PATCH 1/7] sched/topology: Assign sd_share for all non NUMA sched domains Chen Yu
  2023-07-27 14:34 ` [RFC PATCH 2/7] sched/topology: Introduce nr_groups in sched_domain to indicate the number of groups Chen Yu
@ 2023-07-27 14:34 ` Chen Yu
  2023-07-27 14:35 ` [RFC PATCH 4/7] sched/fair: Calculate the scan depth for idle balance based on system utilization Chen Yu
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 22+ messages in thread
From: Chen Yu @ 2023-07-27 14:34 UTC (permalink / raw)
  To: Peter Zijlstra, Vincent Guittot
  Cc: Ingo Molnar, Juri Lelli, Tim Chen, Mel Gorman, Dietmar Eggemann,
	K Prateek Nayak, Gautham R . Shenoy, Chen Yu, Aaron Lu,
	linux-kernel, Chen Yu

Save the total_load, total_capacity of the current sched domain in each
periodic load balance. This statistic can be used later by CPU_NEWLY_IDLE
load balance if it quits the scan earlier. Introduce a sched feature
ILB_SNAPSHOT to control this. Code can check if sd_share->total_capacity
is non-zero to verify if the stat is valid.

In theory, if the system has reached a stable status, the total_capacity
and total_load should not change dramatically.

Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---
 include/linux/sched/topology.h |  2 ++
 kernel/sched/fair.c            | 25 +++++++++++++++++++++++++
 kernel/sched/features.h        |  2 ++
 3 files changed, 29 insertions(+)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index c07f2f00317a..d6a64a2c92aa 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -82,6 +82,8 @@ struct sched_domain_shared {
 	atomic_t	nr_busy_cpus;
 	int		has_idle_cores;
 	int		nr_idle_scan;
+	unsigned long	total_load;
+	unsigned long	total_capacity;
 };
 
 struct sched_domain {
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b3e25be58e2b..edcfee9965cd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10132,6 +10132,27 @@ static void update_idle_cpu_scan(struct lb_env *env,
 		WRITE_ONCE(sd_share->nr_idle_scan, (int)y);
 }
 
+static void ilb_save_stats(struct lb_env *env,
+			   struct sched_domain_shared *sd_share,
+			   struct sd_lb_stats *sds)
+{
+	if (!sched_feat(ILB_SNAPSHOT))
+		return;
+
+	if (!sd_share)
+		return;
+
+	/* newidle balance is too frequent */
+	if (env->idle == CPU_NEWLY_IDLE)
+		return;
+
+	if (sds->total_load != sd_share->total_load)
+		WRITE_ONCE(sd_share->total_load, sds->total_load);
+
+	if (sds->total_capacity != sd_share->total_capacity)
+		WRITE_ONCE(sd_share->total_capacity, sds->total_capacity);
+}
+
 /**
  * update_sd_lb_stats - Update sched_domain's statistics for load balancing.
  * @env: The load balancing environment.
@@ -10140,6 +10161,7 @@ static void update_idle_cpu_scan(struct lb_env *env,
 
 static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sds)
 {
+	struct sched_domain_shared *sd_share = env->sd->shared;
 	struct sched_group *sg = env->sd->groups;
 	struct sg_lb_stats *local = &sds->local_stat;
 	struct sg_lb_stats tmp_sgs;
@@ -10209,6 +10231,9 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 	}
 
 	update_idle_cpu_scan(env, sum_util);
+
+	/* save a snapshot of stats during periodic load balance */
+	ilb_save_stats(env, sd_share, sds);
 }
 
 /**
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index ee7f23c76bd3..3cb71c8cddc0 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -101,3 +101,5 @@ SCHED_FEAT(LATENCY_WARN, false)
 
 SCHED_FEAT(ALT_PERIOD, true)
 SCHED_FEAT(BASE_SLICE, true)
+
+SCHED_FEAT(ILB_SNAPSHOT, true)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH 4/7] sched/fair: Calculate the scan depth for idle balance based on system utilization
  2023-07-27 14:33 [RFC PATCH 0/7] Optimization to reduce the cost of newidle balance Chen Yu
                   ` (2 preceding siblings ...)
  2023-07-27 14:34 ` [RFC PATCH 3/7] sched/fair: Save a snapshot of sched domain total_load and total_capacity Chen Yu
@ 2023-07-27 14:35 ` Chen Yu
  2023-08-25  6:02   ` Shrikanth Hegde
  2023-07-27 14:35 ` [RFC PATCH 5/7] sched/fair: Adjust the busiest group scanning depth in idle load balance Chen Yu
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 22+ messages in thread
From: Chen Yu @ 2023-07-27 14:35 UTC (permalink / raw)
  To: Peter Zijlstra, Vincent Guittot
  Cc: Ingo Molnar, Juri Lelli, Tim Chen, Mel Gorman, Dietmar Eggemann,
	K Prateek Nayak, Gautham R . Shenoy, Chen Yu, Aaron Lu,
	linux-kernel, Chen Yu

When the CPU is about to enter idle, it invokes newidle_balance()
to pull some tasks from other runqueues. Although there is per
domain max_newidle_lb_cost to throttle the newidle_balance(), it
would be good to further limit the scan based on overall system
utilization. The reason is that there is no limitation for
newidle_balance() to launch this balance simultaneously on
multiple CPUs. Since each newidle_balance() has to traverse all
the groups to calculate the statistics one by one, this total
time cost on newidle_balance() could be O(n^2). n is the number
of groups. This issue is more severe if there are many groups
within 1 domain, for example, a system with a large number of
Cores in a LLC domain. This is not good for performance or
power saving.

sqlite has spent quite some time on newidle balance() on Intel
Sapphire Rapids, which has 2 x 56C/112T = 224 CPUs:
6.69%    0.09%  sqlite3     [kernel.kallsyms]   [k] newidle_balance
5.39%    4.71%  sqlite3     [kernel.kallsyms]   [k] update_sd_lb_stats

Based on this observation, limit the scan depth of newidle_balance()
by considering the utilization of the sched domain. Let the number of
scanned groups be a linear function of the utilization ratio:

nr_groups_to_scan = nr_groups * (1 - util_ratio)

Suggested-by: Tim Chen <tim.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---
 include/linux/sched/topology.h |  1 +
 kernel/sched/fair.c            | 30 ++++++++++++++++++++++++++++++
 kernel/sched/features.h        |  1 +
 3 files changed, 32 insertions(+)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index d6a64a2c92aa..af2261308529 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -84,6 +84,7 @@ struct sched_domain_shared {
 	int		nr_idle_scan;
 	unsigned long	total_load;
 	unsigned long	total_capacity;
+	int		nr_sg_scan;
 };
 
 struct sched_domain {
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index edcfee9965cd..6925813db59b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10153,6 +10153,35 @@ static void ilb_save_stats(struct lb_env *env,
 		WRITE_ONCE(sd_share->total_capacity, sds->total_capacity);
 }
 
+static void update_ilb_group_scan(struct lb_env *env,
+				  unsigned long sum_util,
+				  struct sched_domain_shared *sd_share)
+{
+	u64 tmp, nr_scan;
+
+	if (!sched_feat(ILB_UTIL))
+		return;
+
+	if (!sd_share)
+		return;
+
+	if (env->idle == CPU_NEWLY_IDLE)
+		return;
+
+	/*
+	 * Limit the newidle balance scan depth based on overall system
+	 * utilization:
+	 * nr_groups_scan = nr_groups * (1 - util_ratio)
+	 * and util_ratio = sum_util / (sd_weight * SCHED_CAPACITY_SCALE)
+	 */
+	nr_scan = env->sd->nr_groups * sum_util;
+	tmp = env->sd->span_weight * SCHED_CAPACITY_SCALE;
+	do_div(nr_scan, tmp);
+	nr_scan = env->sd->nr_groups - nr_scan;
+	if ((int)nr_scan != sd_share->nr_sg_scan)
+		WRITE_ONCE(sd_share->nr_sg_scan, (int)nr_scan);
+}
+
 /**
  * update_sd_lb_stats - Update sched_domain's statistics for load balancing.
  * @env: The load balancing environment.
@@ -10231,6 +10260,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 	}
 
 	update_idle_cpu_scan(env, sum_util);
+	update_ilb_group_scan(env, sum_util, sd_share);
 
 	/* save a snapshot of stats during periodic load balance */
 	ilb_save_stats(env, sd_share, sds);
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 3cb71c8cddc0..30f6d1a2f235 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -103,3 +103,4 @@ SCHED_FEAT(ALT_PERIOD, true)
 SCHED_FEAT(BASE_SLICE, true)
 
 SCHED_FEAT(ILB_SNAPSHOT, true)
+SCHED_FEAT(ILB_UTIL, true)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH 5/7] sched/fair: Adjust the busiest group scanning depth in idle load balance
  2023-07-27 14:33 [RFC PATCH 0/7] Optimization to reduce the cost of newidle balance Chen Yu
                   ` (3 preceding siblings ...)
  2023-07-27 14:35 ` [RFC PATCH 4/7] sched/fair: Calculate the scan depth for idle balance based on system utilization Chen Yu
@ 2023-07-27 14:35 ` Chen Yu
  2023-08-25  6:00   ` Shrikanth Hegde
  2023-07-27 14:35 ` [RFC PATCH 6/7] sched/fair: Pull from a relatively busy group during newidle balance Chen Yu
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 22+ messages in thread
From: Chen Yu @ 2023-07-27 14:35 UTC (permalink / raw)
  To: Peter Zijlstra, Vincent Guittot
  Cc: Ingo Molnar, Juri Lelli, Tim Chen, Mel Gorman, Dietmar Eggemann,
	K Prateek Nayak, Gautham R . Shenoy, Chen Yu, Aaron Lu,
	linux-kernel, Chen Yu

Scanning the whole sched domain to find the busiest group is time costly
during newidle_balance(). And if a CPU becomes idle, it would be good
if this idle CPU pulls some tasks from other CPUs as quickly as possible.

Limit the scan depth of newidle_balance() to only scan for a limited number
of sched groups to find a relatively busy group, and pull from it.
In summary, the more spare time there is in the domain, the more time
each newidle balance can spend on scanning for a busy group. Although
the newidle balance has per domain max_newidle_lb_cost to decide
whether to launch the balance or not, the ILB_UTIL provides a smaller
granularity to decide how many groups each newidle balance can scan.

The scanning depth is calculated by the previous periodic load balance
based on its overall utilization.

Tested on top of v6.5-rc2, Sapphire Rapids with 2 x 56C/112T = 224 CPUs.
With cpufreq governor set to performance, and C6 disabled.

Firstly, tested on a extreme synthetic test[1], which launches 224
process. Each process is a loop of nanosleep(1 us), which is supposed
to trigger newidle balance as much as possible:

i=1;while [ $i -le "224" ]; do ./nano_sleep 1000 & i=$(($i+1)); done;

NO_ILB_UTIL + ILB_SNAPSHOT:
9.38%     0.45%  [kernel.kallsyms]   [k] newidle_balance
6.84%     5.32%  [kernel.kallsyms]   [k] update_sd_lb_stats.constprop.0

ILB_UTIL + ILB_SNAPSHOT:
3.35%     0.38%  [kernel.kallsyms]   [k] newidle_balance
2.30%     1.81%  [kernel.kallsyms]   [k] update_sd_lb_stats.constprop.0

With ILB_UTIL enabled, the total number of newidle_balance() and
update_sd_lb() drops. But the reason for why there are less newidle
balance has not been investigated. According to the low util_avg value
in /sys/kernel/debug/sched/debug, there should be no much impact
on the nanosleep stress test.

Test in a wider range:

[netperf]
Launches nr instances of:
netperf -4 -H 127.0.0.1 -t $work_mode -c -C -l 100 &

nr: 56, 112, 168, 224, 280, 336, 392, 448
work_mode: TCP_RR UDP_RR

throughput
=======
case            	load    	baseline(std%)	compare%( std%)
TCP_RR          	56-threads	 1.00 (  5.15)	 -3.96 (  2.17)
TCP_RR          	112-threads	 1.00 (  2.84)	 -0.82 (  2.24)
TCP_RR          	168-threads	 1.00 (  2.11)	 -0.03 (  2.31)
TCP_RR          	224-threads	 1.00 (  1.76)	 +0.01 (  2.12)
TCP_RR          	280-threads	 1.00 ( 62.46)	+56.56 ( 56.91)
TCP_RR          	336-threads	 1.00 ( 19.81)	 +0.27 ( 17.90)
TCP_RR          	392-threads	 1.00 ( 30.85)	 +0.13 ( 29.09)
TCP_RR          	448-threads	 1.00 ( 39.71)	-18.82 ( 45.93)
UDP_RR          	56-threads	 1.00 (  2.08)	 -0.31 (  7.89)
UDP_RR          	112-threads	 1.00 (  3.22)	 -0.50 ( 15.19)
UDP_RR          	168-threads	 1.00 ( 11.77)	 +0.37 ( 10.30)
UDP_RR          	224-threads	 1.00 ( 14.03)	 +0.25 ( 12.88)
UDP_RR          	280-threads	 1.00 ( 16.83)	 -0.57 ( 15.34)
UDP_RR          	336-threads	 1.00 ( 22.57)	 +0.01 ( 24.68)
UDP_RR          	392-threads	 1.00 ( 33.89)	 +2.65 ( 33.89)
UDP_RR          	448-threads	 1.00 ( 44.18)	 +0.81 ( 41.28)

Considering the std%, there is no much difference to netperf.

[tbench]
tbench -t 100 $job 127.0.0.1
job: 56, 112, 168, 224, 280, 336, 392, 448

throughput
======
case            	load    	baseline(std%)	compare%( std%)
loopback        	56-threads	 1.00 (  2.20)	 -0.09 (  2.05)
loopback        	112-threads	 1.00 (  0.29)	 -0.88 (  0.10)
loopback        	168-threads	 1.00 (  0.02)	+62.92 ( 54.57)
loopback        	224-threads	 1.00 (  0.05)	+234.30 (  1.81)
loopback        	280-threads	 1.00 (  0.08)	 -0.11 (  0.21)
loopback        	336-threads	 1.00 (  0.17)	 -0.17 (  0.08)
loopback        	392-threads	 1.00 (  0.14)	 -0.09 (  0.18)
loopback        	448-threads	 1.00 (  0.24)	 -0.53 (  0.55)

There are improvement of tbench in 224 threads case.

[hackbench]

hackbench -g $job --$work_type --pipe -l 200000 -s 100 -f 28
and
hackbench -g $job --$work_type -l 200000 -s 100 -f 28

job: 1, 2, 4, 8
work_type: process threads

throughput
==========
case                    load            baseline(std%)  compare%( std%)
process-pipe            1-groups         1.00 (  0.20)   +1.57 (  0.58)
process-pipe            2-groups         1.00 (  3.53)   +2.99 (  2.03)
process-pipe            4-groups         1.00 (  1.07)   +0.17 (  1.64)
process-sockets         1-groups         1.00 (  0.36)   -0.04 (  1.44)
process-sockets         2-groups         1.00 (  0.84)   +0.65 (  1.65)
process-sockets         4-groups         1.00 (  0.04)   +0.89 (  0.08)
threads-pipe            1-groups         1.00 (  3.62)   -0.53 (  1.67)
threads-pipe            2-groups         1.00 (  4.17)   -4.79 (  0.53)
threads-pipe            4-groups         1.00 (  5.30)   +5.06 (  1.95)
threads-sockets         1-groups         1.00 (  0.40)   +1.44 (  0.53)
threads-sockets         2-groups         1.00 (  2.54)   +2.21 (  2.51)
threads-sockets         4-groups         1.00 (  0.05)   +1.29 (  0.05)

No much difference of hackbench.

[schbench(old)]
schbench -m $job -t 56 -r 30
job: 1, 2, 4, 8
3 iterations

99.0th latency
========
case                    load            baseline(std%)  compare%( std%)
normal                  1-mthreads       1.00 (  0.56)   -0.91 (  0.32)
normal                  2-mthreads       1.00 (  0.95)   -4.05 (  3.63)
normal                  4-mthreads       1.00 (  4.04)   -0.30 (  2.35)

No much difference of schbench.

[Limitation]
In the previous version, Prateek reported a regression. That could be
due to the concurrent access across the Numa node, or ILB_UTIL did not
scan hard enough to pull from the busiest group. The former issue is
fixed by not enabling ILB_UTIL for Numa domain. If there is still
regression in this version, we can leverage the result of SIS_UTIL,
to provide a quadratic function rather than the linear function, to
scan harder when the system is idle.

Link: https://raw.githubusercontent.com/chen-yu-surf/tools/master/stress_nanosleep.c #1
Suggested-by: Tim Chen <tim.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---
 kernel/sched/fair.c | 20 +++++++++++++++++++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6925813db59b..4e360ed16e14 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10195,7 +10195,13 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 	struct sg_lb_stats *local = &sds->local_stat;
 	struct sg_lb_stats tmp_sgs;
 	unsigned long sum_util = 0;
-	int sg_status = 0;
+	int sg_status = 0, nr_sg_scan;
+	/* only newidle CPU can load the snapshot */
+	bool ilb_can_load = env->idle == CPU_NEWLY_IDLE &&
+			    sd_share && READ_ONCE(sd_share->total_capacity);
+
+	if (sched_feat(ILB_UTIL) && ilb_can_load)
+		nr_sg_scan = sd_share->nr_sg_scan;
 
 	do {
 		struct sg_lb_stats *sgs = &tmp_sgs;
@@ -10222,6 +10228,9 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 			sds->busiest_stat = *sgs;
 		}
 
+		if (sched_feat(ILB_UTIL) && ilb_can_load && --nr_sg_scan <= 0)
+			goto load_snapshot;
+
 next_group:
 		/* Now, start updating sd_lb_stats */
 		sds->total_load += sgs->group_load;
@@ -10231,6 +10240,15 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 		sg = sg->next;
 	} while (sg != env->sd->groups);
 
+	ilb_can_load = false;
+
+load_snapshot:
+	if (ilb_can_load) {
+		/* borrow the statistic of previous periodic load balance */
+		sds->total_load = READ_ONCE(sd_share->total_load);
+		sds->total_capacity = READ_ONCE(sd_share->total_capacity);
+	}
+
 	/*
 	 * Indicate that the child domain of the busiest group prefers tasks
 	 * go to a child's sibling domains first. NB the flags of a sched group
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH 6/7] sched/fair: Pull from a relatively busy group during newidle balance
  2023-07-27 14:33 [RFC PATCH 0/7] Optimization to reduce the cost of newidle balance Chen Yu
                   ` (4 preceding siblings ...)
  2023-07-27 14:35 ` [RFC PATCH 5/7] sched/fair: Adjust the busiest group scanning depth in idle load balance Chen Yu
@ 2023-07-27 14:35 ` Chen Yu
  2023-07-27 14:35 ` [RFC PATCH 7/7] sched/stats: Track the scan number of groups during load balance Chen Yu
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 22+ messages in thread
From: Chen Yu @ 2023-07-27 14:35 UTC (permalink / raw)
  To: Peter Zijlstra, Vincent Guittot
  Cc: Ingo Molnar, Juri Lelli, Tim Chen, Mel Gorman, Dietmar Eggemann,
	K Prateek Nayak, Gautham R . Shenoy, Chen Yu, Aaron Lu,
	linux-kernel, Chen Yu

Scanning the whole sched domain to find the busiest group is time
costly during newidle_balance() on a high core count system.

Introduce ILB_FAST to lower the bar during the busiest group
scanning. If the target sched group is relatively busier than the
local group, terminate the scan and try to pull from that group
directly.

Compared between ILB_UTIL and ILB_FAST, the former inhibits the
sched group scan when the system is busy. While the latter
choose a compromised busy group when the system is not busy.
So they are complementary to each other and work independently.

Tested on top of v6.5-rc2,
Sapphire Rapids with 2 x 56C/112T = 224 CPUs.
With cpufreq governor set to performance, and C6 disabled.

Firstly, tested on an extreme synthetic test[1] borrowed from
Tianyou. It launches 224 process. Each process is a loop of
nanosleep(1 us), which is supposed to trigger newidle balance
frequently:

i=1;while [ $i -le "224" ]; do ./nano_sleep 1000 & i=$(($i+1)); done;

[ILB_SNAPSHOT + NO_ILB_UTIL + NO_ILB_FAST]
Check the /proc/schedstat delta on CPU8 within 5 seconds using
the following script[2] by running: schedstat.py -i 5 -c 8

Mon Jul 24 23:43:43 2023              cpu8
.domain0.CPU_IDLE.lb_balanced          843
.domain0.CPU_IDLE.lb_count             843
.domain0.CPU_IDLE.lb_nobusyg           843
.domain0.CPU_IDLE.lb_sg_scan           843
.domain0.CPU_NEWLY_IDLE.lb_balanced    836
.domain0.CPU_NEWLY_IDLE.lb_count       837
.domain0.CPU_NEWLY_IDLE.lb_gained       1
.domain0.CPU_NEWLY_IDLE.lb_imbalance    1
.domain0.CPU_NEWLY_IDLE.lb_nobusyg     836
.domain0.CPU_NEWLY_IDLE.lb_sg_scan     837
.domain1.CPU_IDLE.lb_balanced          41
.domain1.CPU_IDLE.lb_count             41
.domain1.CPU_IDLE.lb_nobusyg           39
.domain1.CPU_IDLE.lb_sg_scan          2145
.domain1.CPU_NEWLY_IDLE.lb_balanced    732     <-----
.domain1.CPU_NEWLY_IDLE.lb_count       822     <-----
.domain1.CPU_NEWLY_IDLE.lb_failed      90
.domain1.CPU_NEWLY_IDLE.lb_imbalance   90
.domain1.CPU_NEWLY_IDLE.lb_nobusyg     497
.domain1.CPU_NEWLY_IDLE.lb_nobusyq     235
.domain1.CPU_NEWLY_IDLE.lb_sg_scan    45210    <-----
.domain1.ttwu_wake_remote              626
.domain2.CPU_IDLE.lb_balanced          15
.domain2.CPU_IDLE.lb_count             15
.domain2.CPU_NEWLY_IDLE.lb_balanced    635
.domain2.CPU_NEWLY_IDLE.lb_count       655
.domain2.CPU_NEWLY_IDLE.lb_failed      20
.domain2.CPU_NEWLY_IDLE.lb_imbalance   40
.domain2.CPU_NEWLY_IDLE.lb_nobusyg     633
.domain2.CPU_NEWLY_IDLE.lb_nobusyq      2
.domain2.CPU_NEWLY_IDLE.lb_sg_scan     655
.stats.rq_cpu_time                   227910772
.stats.rq_sched_info.pcount           89393
.stats.rq_sched_info.run_delay       2145671
.stats.sched_count                   178783
.stats.sched_goidle                   89390
.stats.ttwu_count                     89392
.stats.ttwu_local                     88766

For domain1, there are 822 newidle balance attempt, and
the total number of groups scanned is 45210, thus each
balance would scan for 55 groups. During this 822 balance,
732 becomes(or are already) balanced, so the effect balance
success ratio is (822 - 732) / 822 = 10.94%

The perf:
9.38%     0.45%  [kernel.kallsyms]   [k] newidle_balance
6.84%     5.32%  [kernel.kallsyms]   [k] update_sd_lb_stats.constprop.0

[ILB_SNAPSHOT + NO_ILB_UTIL + ILB_FAST]
Mon Jul 24 23:43:50 2023              cpu8
.domain0.CPU_IDLE.lb_balanced          918
.domain0.CPU_IDLE.lb_count             918
.domain0.CPU_IDLE.lb_nobusyg           918
.domain0.CPU_IDLE.lb_sg_scan           918
.domain0.CPU_NEWLY_IDLE.lb_balanced   1536
.domain0.CPU_NEWLY_IDLE.lb_count      1545
.domain0.CPU_NEWLY_IDLE.lb_failed       1
.domain0.CPU_NEWLY_IDLE.lb_gained       8
.domain0.CPU_NEWLY_IDLE.lb_imbalance    9
.domain0.CPU_NEWLY_IDLE.lb_nobusyg    1536
.domain0.CPU_NEWLY_IDLE.lb_sg_scan    1545
.domain1.CPU_IDLE.lb_balanced          45
.domain1.CPU_IDLE.lb_count             45
.domain1.CPU_IDLE.lb_nobusyg           43
.domain1.CPU_IDLE.lb_sg_scan          2365
.domain1.CPU_NEWLY_IDLE.lb_balanced   1196     <------
.domain1.CPU_NEWLY_IDLE.lb_count      1496     <------
.domain1.CPU_NEWLY_IDLE.lb_failed      296
.domain1.CPU_NEWLY_IDLE.lb_gained       4
.domain1.CPU_NEWLY_IDLE.lb_imbalance   301
.domain1.CPU_NEWLY_IDLE.lb_nobusyg    1182
.domain1.CPU_NEWLY_IDLE.lb_nobusyq     14
.domain1.CPU_NEWLY_IDLE.lb_sg_scan    30127    <------
.domain1.ttwu_wake_remote             2688
.domain2.CPU_IDLE.lb_balanced          13
.domain2.CPU_IDLE.lb_count             13
.domain2.CPU_NEWLY_IDLE.lb_balanced    898
.domain2.CPU_NEWLY_IDLE.lb_count       904
.domain2.CPU_NEWLY_IDLE.lb_failed       6
.domain2.CPU_NEWLY_IDLE.lb_imbalance   11
.domain2.CPU_NEWLY_IDLE.lb_nobusyg     896
.domain2.CPU_NEWLY_IDLE.lb_nobusyq      2
.domain2.CPU_NEWLY_IDLE.lb_sg_scan     904
.stats.rq_cpu_time                   239830575
.stats.rq_sched_info.pcount           90879
.stats.rq_sched_info.run_delay       2436461
.stats.sched_count                   181732
.stats.sched_goidle                   90853
.stats.ttwu_count                     90880
.stats.ttwu_local                     88192

With ILB_FAST enabled, the CPU_NEWLY_IDLE in domain1 on CPU8
is 1496, and the total number of groups scanned is 30127. For
each load balance, it will scan for 20 groups, which is only
half of the 56 groups in a domain. During this 1496 balance,
1196 are balanced, so the effect balance success ratio
is (1496 - 1196) / 1496 = 20.95%, which is higher than 10.94%
when ILB_FAST is disabled.

perf profile:

2.95%     0.38%  [kernel.kallsyms]   [k] newidle_balance
2.00%     1.51%  [kernel.kallsyms]   [k] update_sd_lb_stats.constprop.0

With ILB_FAST enabled, the total update_sd_lb_stats() has dropped a lot.

More benchmark results are shown below.
Baseline is ILB_SNAPSHOT + NO_ILB_UTIL, to compare with
ILB_SNAPSHOT + NO_ILB_UTIL + ILB_FAST

[netperf]
Launches nr instances of:
netperf -4 -H 127.0.0.1 -t $work_mode -c -C -l 100 &

nr: 56, 112, 168, 224, 280, 336, 392, 448
work_mode: TCP_RR UDP_RR

throughput
=======
case            	load    	baseline(std%)	compare%( std%)
TCP_RR          	56-threads	 1.00 (  1.83)	 +4.25 (  5.15)
TCP_RR          	112-threads	 1.00 (  2.19)	 +0.96 (  2.84)
TCP_RR          	168-threads	 1.00 (  1.92)	 -0.04 (  2.11)
TCP_RR          	224-threads	 1.00 (  1.98)	 -0.03 (  1.76)
TCP_RR          	280-threads	 1.00 ( 63.11)	 -7.59 ( 62.46)
TCP_RR          	336-threads	 1.00 ( 18.44)	 -0.45 ( 19.81)
TCP_RR          	392-threads	 1.00 ( 26.49)	 -0.09 ( 30.85)
TCP_RR          	448-threads	 1.00 ( 40.47)	 -0.28 ( 39.71)
UDP_RR          	56-threads	 1.00 (  1.83)	 -0.31 (  2.08)
UDP_RR          	112-threads	 1.00 ( 13.77)	 +3.58 (  3.22)
UDP_RR          	168-threads	 1.00 ( 10.97)	 -0.08 ( 11.77)
UDP_RR          	224-threads	 1.00 ( 12.83)	 -0.04 ( 14.03)
UDP_RR          	280-threads	 1.00 ( 13.89)	 +0.35 ( 16.83)
UDP_RR          	336-threads	 1.00 ( 24.91)	 +1.38 ( 22.57)
UDP_RR          	392-threads	 1.00 ( 34.86)	 -0.91 ( 33.89)
UDP_RR          	448-threads	 1.00 ( 40.63)	 +0.70 ( 44.18)

[tbench]
tbench -t 100 $job 127.0.0.1
job: 56, 112, 168, 224, 280, 336, 392, 448

throughput
======
case                    load            baseline(std%)  compare%( std%)
loopback                56-threads       1.00 (  0.89)   +1.51 (  2.20)
loopback                112-threads      1.00 (  0.03)   +1.15 (  0.29)
loopback                168-threads      1.00 ( 53.55)  -37.92 (  0.02)
loopback                224-threads      1.00 ( 61.24)  -43.18 (  0.01)
loopback                280-threads      1.00 (  0.04)   +0.33 (  0.08)
loopback                336-threads      1.00 (  0.35)   +0.40 (  0.17)
loopback                392-threads      1.00 (  0.61)   +0.49 (  0.14)
loopback                448-threads      1.00 (  0.08)   +0.01 (  0.24)

[schbench]
schbench -m $job -t 56 -r 30
job: 1, 2, 4, 8
3 iterations

99.0th latency
========
case                    load            baseline(std%)  compare%( std%)
normal                  1-mthreads       1.00 (  0.56)   -0.45 (  0.32)
normal                  2-mthreads       1.00 (  0.95)   +1.01 (  3.45)
normal                  4-mthreads       1.00 (  4.04)   -0.60 (  1.26)

[hackbench]
hackbench -g $job --$work_type --pipe -l 200000 -s 100 -f 28
and
hackbench -g $job --$work_type -l 200000 -s 100 -f 28

job: 1, 2, 4, 8
work_type: process threads

throughput
=========
case                    load            baseline(std%)  compare%( std%)
process-pipe            1-groups         1.00 (  0.20)   +2.30 (  0.26)
process-pipe            2-groups         1.00 (  3.53)   +6.14 (  2.45)
process-pipe            4-groups         1.00 (  1.07)   -4.58 (  2.58)
process-sockets         1-groups         1.00 (  0.36)   +0.75 (  1.22)
process-sockets         2-groups         1.00 (  0.84)   +1.26 (  1.11)
process-sockets         4-groups         1.00 (  0.04)   +0.97 (  0.11)
threads-pipe            1-groups         1.00 (  3.62)   +3.22 (  2.64)
threads-pipe            2-groups         1.00 (  4.17)   +5.85 (  7.53)
threads-pipe            4-groups         1.00 (  5.30)   -4.14 (  5.39)
threads-sockets         1-groups         1.00 (  0.40)   +3.50 (  3.13)
threads-sockets         2-groups         1.00 (  2.54)   +1.79 (  0.80)
threads-sockets         4-groups         1.00 (  0.05)   +1.33 (  0.03)

Considering the std%, there is no much score difference noticed.
It probably indicates that ILB_FAST has reduced the cost of newidle
balance without hurting the performance.

Link: https://raw.githubusercontent.com/chen-yu-surf/tools/master/stress_nanosleep.c #1
Link: https://raw.githubusercontent.com/chen-yu-surf/tools/master/schedstat.py #2
Suggested-by: Tim Chen <tim.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---
 kernel/sched/fair.c     | 37 +++++++++++++++++++++++++++++++++++++
 kernel/sched/features.h |  1 +
 2 files changed, 38 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4e360ed16e14..9af57b5a24dc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10182,6 +10182,36 @@ static void update_ilb_group_scan(struct lb_env *env,
 		WRITE_ONCE(sd_share->nr_sg_scan, (int)nr_scan);
 }
 
+static bool can_pull_busiest(struct sg_lb_stats *local,
+			     struct sg_lb_stats *busiest)
+{
+	/*
+	 * Check if the local group can pull from the 'busiest'
+	 * group directly. When reaching here, update_sd_pick_busiest()
+	 * has already filtered a candidate.
+	 * The scan in newidle load balance on high core count system
+	 * is costly, thus provide this shortcut to find a relative busy
+	 * group rather than the busiest one.
+	 *
+	 * Only enable this shortcut when the local group is quite
+	 * idle. This is because the total cost of newidle_balance()
+	 * becomes severe when multiple CPUs fall into idle and launch
+	 * newidle_balance() concurrently. And that usually indicates
+	 * a group_has_spare status.
+	 */
+	if (local->group_type != group_has_spare)
+		return false;
+
+	if (busiest->idle_cpus > local->idle_cpus)
+		return false;
+
+	if (busiest->idle_cpus == local->idle_cpus &&
+	    busiest->sum_nr_running <= local->sum_nr_running)
+		return false;
+
+	return true;
+}
+
 /**
  * update_sd_lb_stats - Update sched_domain's statistics for load balancing.
  * @env: The load balancing environment.
@@ -10226,6 +10256,13 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 		if (update_sd_pick_busiest(env, sds, sg, sgs)) {
 			sds->busiest = sg;
 			sds->busiest_stat = *sgs;
+			/*
+			 * Check if this busiest group can be pulled by the
+			 * local group directly.
+			 */
+			if (sched_feat(ILB_FAST) && ilb_can_load &&
+			    can_pull_busiest(local, sgs))
+				goto load_snapshot;
 		}
 
 		if (sched_feat(ILB_UTIL) && ilb_can_load && --nr_sg_scan <= 0)
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 30f6d1a2f235..4d67e0abb78c 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -104,3 +104,4 @@ SCHED_FEAT(BASE_SLICE, true)
 
 SCHED_FEAT(ILB_SNAPSHOT, true)
 SCHED_FEAT(ILB_UTIL, true)
+SCHED_FEAT(ILB_FAST, true)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH 7/7] sched/stats: Track the scan number of groups during load balance
  2023-07-27 14:33 [RFC PATCH 0/7] Optimization to reduce the cost of newidle balance Chen Yu
                   ` (5 preceding siblings ...)
  2023-07-27 14:35 ` [RFC PATCH 6/7] sched/fair: Pull from a relatively busy group during newidle balance Chen Yu
@ 2023-07-27 14:35 ` Chen Yu
  2023-08-25  7:48 ` [RFC PATCH 0/7] Optimization to reduce the cost of newidle balance Shrikanth Hegde
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 22+ messages in thread
From: Chen Yu @ 2023-07-27 14:35 UTC (permalink / raw)
  To: Peter Zijlstra, Vincent Guittot
  Cc: Ingo Molnar, Juri Lelli, Tim Chen, Mel Gorman, Dietmar Eggemann,
	K Prateek Nayak, Gautham R . Shenoy, Chen Yu, Aaron Lu,
	linux-kernel, Chen Yu

This metric could be used to evaluate the load balance cost and
effeciency.

Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---
 include/linux/sched/topology.h | 1 +
 kernel/sched/fair.c            | 2 ++
 kernel/sched/stats.c           | 5 +++--
 3 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index af2261308529..fa8fc6a497fd 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -124,6 +124,7 @@ struct sched_domain {
 	unsigned int lb_hot_gained[CPU_MAX_IDLE_TYPES];
 	unsigned int lb_nobusyg[CPU_MAX_IDLE_TYPES];
 	unsigned int lb_nobusyq[CPU_MAX_IDLE_TYPES];
+	unsigned int lb_sg_scan[CPU_MAX_IDLE_TYPES];
 
 	/* Active load balancing */
 	unsigned int alb_count;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9af57b5a24dc..96df7c5706d1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10253,6 +10253,8 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 			goto next_group;
 
 
+		schedstat_inc(env->sd->lb_sg_scan[env->idle]);
+
 		if (update_sd_pick_busiest(env, sds, sg, sgs)) {
 			sds->busiest = sg;
 			sds->busiest_stat = *sgs;
diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c
index 857f837f52cb..38608f791363 100644
--- a/kernel/sched/stats.c
+++ b/kernel/sched/stats.c
@@ -152,7 +152,7 @@ static int show_schedstat(struct seq_file *seq, void *v)
 				   cpumask_pr_args(sched_domain_span(sd)));
 			for (itype = CPU_IDLE; itype < CPU_MAX_IDLE_TYPES;
 					itype++) {
-				seq_printf(seq, " %u %u %u %u %u %u %u %u",
+				seq_printf(seq, " %u %u %u %u %u %u %u %u %u",
 				    sd->lb_count[itype],
 				    sd->lb_balanced[itype],
 				    sd->lb_failed[itype],
@@ -160,7 +160,8 @@ static int show_schedstat(struct seq_file *seq, void *v)
 				    sd->lb_gained[itype],
 				    sd->lb_hot_gained[itype],
 				    sd->lb_nobusyq[itype],
-				    sd->lb_nobusyg[itype]);
+				    sd->lb_nobusyg[itype],
+				    sd->lb_sg_scan[itype]);
 			}
 			seq_printf(seq,
 				   " %u %u %u %u %u %u %u %u %u %u %u %u\n",
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 5/7] sched/fair: Adjust the busiest group scanning depth in idle load balance
  2023-07-27 14:35 ` [RFC PATCH 5/7] sched/fair: Adjust the busiest group scanning depth in idle load balance Chen Yu
@ 2023-08-25  6:00   ` Shrikanth Hegde
  2023-08-30 15:35     ` Chen Yu
  0 siblings, 1 reply; 22+ messages in thread
From: Shrikanth Hegde @ 2023-08-25  6:00 UTC (permalink / raw)
  To: Chen Yu
  Cc: Ingo Molnar, Juri Lelli, Tim Chen, Mel Gorman, Dietmar Eggemann,
	K Prateek Nayak, Gautham R . Shenoy, Chen Yu, Aaron Lu,
	linux-kernel, Peter Zijlstra, Vincent Guittot



On 7/27/23 8:05 PM, Chen Yu wrote:
> Scanning the whole sched domain to find the busiest group is time costly
> during newidle_balance(). And if a CPU becomes idle, it would be good
> if this idle CPU pulls some tasks from other CPUs as quickly as possible.
> 
> Limit the scan depth of newidle_balance() to only scan for a limited number
> of sched groups to find a relatively busy group, and pull from it.
> In summary, the more spare time there is in the domain, the more time
> each newidle balance can spend on scanning for a busy group. Although
> the newidle balance has per domain max_newidle_lb_cost to decide
> whether to launch the balance or not, the ILB_UTIL provides a smaller
> granularity to decide how many groups each newidle balance can scan.
> 
> The scanning depth is calculated by the previous periodic load balance
> based on its overall utilization.
> 
> Tested on top of v6.5-rc2, Sapphire Rapids with 2 x 56C/112T = 224 CPUs.
> With cpufreq governor set to performance, and C6 disabled.
> 
> Firstly, tested on a extreme synthetic test[1], which launches 224
> process. Each process is a loop of nanosleep(1 us), which is supposed
> to trigger newidle balance as much as possible:
> 
> i=1;while [ $i -le "224" ]; do ./nano_sleep 1000 & i=$(($i+1)); done;
> 
> NO_ILB_UTIL + ILB_SNAPSHOT:
> 9.38%     0.45%  [kernel.kallsyms]   [k] newidle_balance
> 6.84%     5.32%  [kernel.kallsyms]   [k] update_sd_lb_stats.constprop.0
> 
> ILB_UTIL + ILB_SNAPSHOT:
> 3.35%     0.38%  [kernel.kallsyms]   [k] newidle_balance
> 2.30%     1.81%  [kernel.kallsyms]   [k] update_sd_lb_stats.constprop.0
> [...]

> Link: https://raw.githubusercontent.com/chen-yu-surf/tools/master/stress_nanosleep.c #1
> Suggested-by: Tim Chen <tim.c.chen@intel.com>
> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> ---
>  kernel/sched/fair.c | 20 +++++++++++++++++++-
>  1 file changed, 19 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6925813db59b..4e360ed16e14 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10195,7 +10195,13 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>  	struct sg_lb_stats *local = &sds->local_stat;
>  	struct sg_lb_stats tmp_sgs;
>  	unsigned long sum_util = 0;
> -	int sg_status = 0;
> +	int sg_status = 0, nr_sg_scan;
> +	/* only newidle CPU can load the snapshot */
> +	bool ilb_can_load = env->idle == CPU_NEWLY_IDLE &&
> +			    sd_share && READ_ONCE(sd_share->total_capacity);
> +
> +	if (sched_feat(ILB_UTIL) && ilb_can_load)

Suggestion for small improvement:

it could be ? This could help save a few cycles of checking if the feature is enabled when its not newidle. 

	if ( ilb_can_load && sched_feat(ILB_UTIL)) 

Same comments below in this patch as well in PATCH 6/7.

> +		nr_sg_scan = sd_share->nr_sg_scan;
>  
>  	do {
>  		struct sg_lb_stats *sgs = &tmp_sgs;
> @@ -10222,6 +10228,9 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>  			sds->busiest_stat = *sgs;
>  		}
>  
> +		if (sched_feat(ILB_UTIL) && ilb_can_load && --nr_sg_scan <= 0)
> +			goto load_snapshot;
> +

Same comment as above.

>  next_group:
>  		/* Now, start updating sd_lb_stats */
>  		sds->total_load += sgs->group_load;
> @@ -10231,6 +10240,15 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>  		sg = sg->next;
>  	} while (sg != env->sd->groups);
>  
> +	ilb_can_load = false;
> +
> +load_snapshot:
> +	if (ilb_can_load) {
> +		/* borrow the statistic of previous periodic load balance */
> +		sds->total_load = READ_ONCE(sd_share->total_load);
> +		sds->total_capacity = READ_ONCE(sd_share->total_capacity);
> +	}
> +
>  	/*
>  	 * Indicate that the child domain of the busiest group prefers tasks
>  	 * go to a child's sibling domains first. NB the flags of a sched group

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 4/7] sched/fair: Calculate the scan depth for idle balance based on system utilization
  2023-07-27 14:35 ` [RFC PATCH 4/7] sched/fair: Calculate the scan depth for idle balance based on system utilization Chen Yu
@ 2023-08-25  6:02   ` Shrikanth Hegde
  2023-08-30 15:30     ` Chen Yu
  0 siblings, 1 reply; 22+ messages in thread
From: Shrikanth Hegde @ 2023-08-25  6:02 UTC (permalink / raw)
  To: Chen Yu
  Cc: Ingo Molnar, Juri Lelli, Tim Chen, Mel Gorman, Dietmar Eggemann,
	K Prateek Nayak, Gautham R . Shenoy, Chen Yu, Aaron Lu,
	linux-kernel, Peter Zijlstra, Vincent Guittot



On 7/27/23 8:05 PM, Chen Yu wrote:
> When the CPU is about to enter idle, it invokes newidle_balance()
> to pull some tasks from other runqueues. Although there is per
> domain max_newidle_lb_cost to throttle the newidle_balance(), it
> would be good to further limit the scan based on overall system
> utilization. The reason is that there is no limitation for
> newidle_balance() to launch this balance simultaneously on
> multiple CPUs. Since each newidle_balance() has to traverse all
> the groups to calculate the statistics one by one, this total
> time cost on newidle_balance() could be O(n^2). n is the number
> of groups. This issue is more severe if there are many groups
> within 1 domain, for example, a system with a large number of
> Cores in a LLC domain. This is not good for performance or
> power saving.
> 
> sqlite has spent quite some time on newidle balance() on Intel
> Sapphire Rapids, which has 2 x 56C/112T = 224 CPUs:
> 6.69%    0.09%  sqlite3     [kernel.kallsyms]   [k] newidle_balance
> 5.39%    4.71%  sqlite3     [kernel.kallsyms]   [k] update_sd_lb_stats
> 
> Based on this observation, limit the scan depth of newidle_balance()
> by considering the utilization of the sched domain. Let the number of
> scanned groups be a linear function of the utilization ratio:
> 
> nr_groups_to_scan = nr_groups * (1 - util_ratio)
> 
> Suggested-by: Tim Chen <tim.c.chen@intel.com>
> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> ---
>  include/linux/sched/topology.h |  1 +
>  kernel/sched/fair.c            | 30 ++++++++++++++++++++++++++++++
>  kernel/sched/features.h        |  1 +
>  3 files changed, 32 insertions(+)
> 
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index d6a64a2c92aa..af2261308529 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -84,6 +84,7 @@ struct sched_domain_shared {
>  	int		nr_idle_scan;
>  	unsigned long	total_load;
>  	unsigned long	total_capacity;
> +	int		nr_sg_scan;
>  };
>  
>  struct sched_domain {
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index edcfee9965cd..6925813db59b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10153,6 +10153,35 @@ static void ilb_save_stats(struct lb_env *env,
>  		WRITE_ONCE(sd_share->total_capacity, sds->total_capacity);
>  }
>  
> +static void update_ilb_group_scan(struct lb_env *env,
> +				  unsigned long sum_util,
> +				  struct sched_domain_shared *sd_share)
> +{
> +	u64 tmp, nr_scan;
> +
> +	if (!sched_feat(ILB_UTIL))
> +		return;
> +
> +	if (!sd_share)
> +		return;
> +
> +	if (env->idle == CPU_NEWLY_IDLE)
> +		return;


Suggestion for small improvement:

First if condition here could be check for newidle. As it often very often we could save a few cycles of checking
sched feature.

> +	if (env->idle == CPU_NEWLY_IDLE)
> +		return;


> +
> +	/*
> +	 * Limit the newidle balance scan depth based on overall system
> +	 * utilization:
> +	 * nr_groups_scan = nr_groups * (1 - util_ratio)
> +	 * and util_ratio = sum_util / (sd_weight * SCHED_CAPACITY_SCALE)
> +	 */
> +	nr_scan = env->sd->nr_groups * sum_util;
> +	tmp = env->sd->span_weight * SCHED_CAPACITY_SCALE;
> +	do_div(nr_scan, tmp);
> +	nr_scan = env->sd->nr_groups - nr_scan;
> +	if ((int)nr_scan != sd_share->nr_sg_scan)
> +		WRITE_ONCE(sd_share->nr_sg_scan, (int)nr_scan);
> +}
> +
>  /**
>   * update_sd_lb_stats - Update sched_domain's statistics for load balancing.
>   * @env: The load balancing environment.
> @@ -10231,6 +10260,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>  	}
>  
>  	update_idle_cpu_scan(env, sum_util);
> +	update_ilb_group_scan(env, sum_util, sd_share);
>  
>  	/* save a snapshot of stats during periodic load balance */
>  	ilb_save_stats(env, sd_share, sds);
> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> index 3cb71c8cddc0..30f6d1a2f235 100644
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -103,3 +103,4 @@ SCHED_FEAT(ALT_PERIOD, true)
>  SCHED_FEAT(BASE_SLICE, true)
>  
>  SCHED_FEAT(ILB_SNAPSHOT, true)
> +SCHED_FEAT(ILB_UTIL, true)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 0/7] Optimization to reduce the cost of newidle balance
  2023-07-27 14:33 [RFC PATCH 0/7] Optimization to reduce the cost of newidle balance Chen Yu
                   ` (6 preceding siblings ...)
  2023-07-27 14:35 ` [RFC PATCH 7/7] sched/stats: Track the scan number of groups during load balance Chen Yu
@ 2023-08-25  7:48 ` Shrikanth Hegde
  2023-08-30 15:26   ` Chen Yu
  2024-07-16 14:16 ` Matt Fleming
  2024-07-17 12:17 ` Peter Zijlstra
  9 siblings, 1 reply; 22+ messages in thread
From: Shrikanth Hegde @ 2023-08-25  7:48 UTC (permalink / raw)
  To: Chen Yu
  Cc: Ingo Molnar, Juri Lelli, Tim Chen, Mel Gorman, Dietmar Eggemann,
	K Prateek Nayak, Gautham R . Shenoy, Chen Yu, Aaron Lu,
	linux-kernel, Peter Zijlstra, Vincent Guittot



On 7/27/23 8:03 PM, Chen Yu wrote:
> Hi,
> 
> This is the second version of the newidle balance optimization[1].
> It aims to reduce the cost of newidle balance which is found to
> occupy noticeable CPU cycles on some high-core count systems.
> 
> For example, when running sqlite on Intel Sapphire Rapids, which has
> 2 x 56C/112T = 224 CPUs:
> 
> 6.69%    0.09%  sqlite3     [kernel.kallsyms]   [k] newidle_balance
> 5.39%    4.71%  sqlite3     [kernel.kallsyms]   [k] update_sd_lb_stats
> 
> To mitigate this cost, the optimization is inspired by the question
> raised by Tim:
> Do we always have to find the busiest group and pull from it? Would
> a relatively busy group be enough?
> 
> There are two proposals in this patch set.
> The first one is ILB_UTIL. It was proposed to limit the scan
> depth in update_sd_lb_stats(). The scan depth is based on the
> overall utilization of this sched domain. The higher the utilization
> is, the less update_sd_lb_stats() scans. Vice versa.
> 
> The second one is ILB_FAST. Instead of always finding the busiest
> group in update_sd_lb_stats(), lower the bar and try to find a
> relatively busy group. ILB_FAST takes effect when the local group
> is group_has_spare. Because when there are many CPUs running
> newidle_balance() concurrently, the sched groups should have a
> high idle percentage.
> 
> Compared between ILB_UTIL and ILB_FAST, the former inhibits the
> sched group scan when the system is busy. While the latter
> chooses a compromised busy group when the system is not busy.
> So they are complementary to each other and work independently.
> 
> patch 1/7 and patch 2/7 are preparation for ILB_UTIL.
> 
> patch 3/7 is a preparation for both ILB_UTIL and ILB_FAST.
> 
> patch 4/7 is part of ILB_UTIL. It calculates the scan depth
>           of sched groups which will be used by
>           update_sd_lb_stats(). The depth is calculated by the
>           periodic load balance.
> 
> patch 5/7 introduces the ILB_UTIL.
> 
> patch 6/7 introduces the ILB_FAST.
> 
> patch 7/7 is a debug patch to print more sched statistics, inspired
>           by Prateek's test report.
> 
> In the previous version, Prateek found some regressions[2].
> This is probably caused by:
> 1. Cross Numa access to sched_domain_shared. So this version removed
>    the sched_domain_shared for Numa domain.
> 2. newidle balance did not try so hard to scan for the busiest
>    group. This version still keeps the linear scan function. If
>    the regression is still there, we can try to leverage the result
>    of SIS_UTIL. Because SIS_UTIL is a quadratic function which
>    could help scan the domain harder when the system is not
>    overloaded.
> 
> Changes since the previous version:
> 1. For all levels except for NUMA, connect a sched_domain_shared
>    instance. This makes the newidle balance optimization more
>    generic, and not only for LLC domain. (Peter, Gautham)
> 2. Introduce ILB_FAST, which terminates the sched group scan
>    earlier, if it finds a proper group rather than the busiest
>    one (Tim).
> 
> 
> Peter has suggested reusing the statistics of the sched group
> if multiple CPUs trigger newidle balance concurrently[3]. I created
> a prototype[4] based on this direction. According to the test, there
> are some regressions. The bottlenecks are a spin_trylock() and the
> memory load from the 'cached' shared region. It is still under
> investigation so I did not include that change into this patch set.
> 
> Any comments would be appreciated.
> 
> [1] https://lore.kernel.org/lkml/cover.1686554037.git.yu.c.chen@intel.com/
> [2] https://lore.kernel.org/lkml/7e31ad34-ce2c-f64b-a852-f88f8a5749a6@amd.com/
> [3] https://lore.kernel.org/lkml/20230621111721.GA2053369@hirez.programming.kicks-ass.net/
> [4] https://github.com/chen-yu-surf/linux/commit/a6b33df883b972d6aaab5fceeddb11c34cc59059.patch
> 
> Chen Yu (7):
>   sched/topology: Assign sd_share for all non NUMA sched domains
>   sched/topology: Introduce nr_groups in sched_domain to indicate the
>     number of groups
>   sched/fair: Save a snapshot of sched domain total_load and
>     total_capacity
>   sched/fair: Calculate the scan depth for idle balance based on system
>     utilization
>   sched/fair: Adjust the busiest group scanning depth in idle load
>     balance
>   sched/fair: Pull from a relatively busy group during newidle balance
>   sched/stats: Track the scan number of groups during load balance
> 
>  include/linux/sched/topology.h |   5 ++
>  kernel/sched/fair.c            | 114 ++++++++++++++++++++++++++++++++-
>  kernel/sched/features.h        |   4 ++
>  kernel/sched/stats.c           |   5 +-
>  kernel/sched/topology.c        |  14 ++--
>  5 files changed, 135 insertions(+), 7 deletions(-)
> 

Hi Chen. It is a nice patch series in effort to reduce the newidle cost. 
This gives the idea of making use of calculations done in load_balance to used 
among different idle types. 

It was interesting to see how this would work on Power Systems. The reason being we have 
large core count and LLC size is small. i.e at small core level (llc_weight=4). This would
mean quite frequest access sd_share at different level which would reside on the first_cpu of 
the sched domain, which might result in more cache-misses. But perf stats didnt show the same.

Another concern on more number of sched  groups at DIE level, which might take a hit if 
the balancing takes longer for the system to stabilize. 

tl;dr

Tested with micro-benchmarks on system with 96 Cores with SMT=8. Total of 768 CPU's. There is some amount 
of regression with hackbench and schbench. haven't looked into why. Any pointers to check would be helpful.
Did a test with more real case workload that we have called daytrader. Its is DB workload which gives total
transcations done per second. That doesn't show any regression.

Its true that all benchmarks will not be happy.
Maybe in below cases, newidle may not be that costly. Do you have any specific benchmark to be tried? 

-----------------------------------------------------------------------------------------------------
					6.5.rc4			6.5.rc4 + PATCH_V2 		gain					
Daytrader:				55049				55378			0.59%

-----------------------------------------------------------------------------------------------------
hackbench(50 iterations):			   6.5.rc4	6.5.rc4 + PATCH_V2(gain%)


Process 10 groups                    :       0.19,       0.19(0.00)             
Process 20 groups                    :       0.23,       0.24(-4.35)            
Process 30 groups                    :       0.28,       0.30(-7.14)            
Process 40 groups                    :       0.38,       0.40(-5.26)            
Process 50 groups                    :       0.43,       0.45(-4.65)            
Process 60 groups                    :       0.51,       0.51(0.00)             
thread 10 Time                       :       0.21,       0.22(-4.76)            
thread 20 Time                       :       0.27,       0.32(-18.52)           
Process(Pipe) 10 Time                :       0.17,       0.17(0.00)             
Process(Pipe) 20 Time                :       0.23,       0.23(0.00)             
Process(Pipe) 30 Time                :       0.28,       0.28(0.00)             
Process(Pipe) 40 Time                :       0.33,       0.32(3.03)             
Process(Pipe) 50 Time                :       0.38,       0.36(5.26)             
Process(Pipe) 60 Time                :       0.40,       0.39(2.50)             
thread(Pipe) 10 Time                 :       0.14,       0.14(0.00)             
thread(Pipe) 20 Time                 :       0.20,       0.19(5.00) 

Observation: lower is better. socket based runs show regression quite a bit, 
pipe shows slight improvement. 


-----------------------------------------------------------------------------------------------------
Unixbench(10 iterations):			   6.5.rc4	6.5.rc4 + PATCH_V2(gain%)

1 X Execl Throughput                  :    4280.15,    4398.30(2.76)           
4 X Execl Throughput                  :    8171.60,    8061.60(-1.35)            
1 X Pipe-based Context Switching      :  172455.50,  174586.60(1.24)           
4 X Pipe-based Context Switching      :  633708.35,  664659.85(4.88)           
1 X Process Creation                  :    6891.20,    7056.85(2.40)           
4 X Process Creation                  :    8826.20,    8996.25(1.93)           
1 X Shell Scripts (1 concurrent)      :    9272.05,    9456.10(1.98)           
4 X Shell Scripts (1 concurrent)      :   27919.60,   25319.75(-9.31)            
1 X Shell Scripts (8 concurrent)      :    4462.70,    4392.75(-1.57)            
4 X Shell Scripts (8 concurrent)      :   11852.30,   10820.70(-8.70) 

Observation: higher is better. Results are somewhat mixed.  


-----------------------------------------------------------------------------------------------------
schbench(10 iterations)			 6.5.rc4	6.5.rc4 + PATCH_V2(gain%)

1 Threads                                                                       
50.0th:                                   8.00,       7.00(12.50)               
75.0th:                                   8.00,       7.60(5.00)                
90.0th:                                   8.80,       8.00(9.09)                
95.0th:                                  10.20,       8.20(19.61)               
99.0th:                                  13.60,      11.00(19.12)               
99.5th:                                  14.00,      12.80(8.57)                
99.9th:                                  15.80,      35.00(-121.52)             
2 Threads                                                                       
50.0th:                                   8.40,       8.20(2.38)                
75.0th:                                   9.00,       8.60(4.44)                
90.0th:                                  10.20,       9.60(5.88)                
95.0th:                                  11.20,      10.20(8.93)                
99.0th:                                  14.40,      11.40(20.83)               
99.5th:                                  14.80,      12.80(13.51)               
99.9th:                                  17.60,      14.80(15.91)               
4 Threads                                                                       
50.0th:                                  10.60,      10.40(1.89)                
75.0th:                                  12.20,      11.60(4.92)                
90.0th:                                  13.60,      12.60(7.35)                
95.0th:                                  14.40,      13.00(9.72)                
99.0th:                                  16.40,      15.60(4.88)                
99.5th:                                  16.80,      16.60(1.19)                
99.9th:                                  22.00,      29.00(-31.82)              
8 Threads                                                                       
50.0th:                                  12.00,      11.80(1.67)                
75.0th:                                  14.40,      14.40(0.00)                
90.0th:                                  17.00,      18.00(-5.88)               
95.0th:                                  19.20,      19.80(-3.13)               
99.0th:                                  23.00,      24.20(-5.22)               
99.5th:                                  26.80,      29.20(-8.96)               
99.9th:                                  68.00,      97.20(-42.94)  
16 Threads                                                                      
50.0th:                                  18.00,      18.20(-1.11)               
75.0th:                                  23.20,      23.60(-1.72)               
90.0th:                                  28.00,      27.40(2.14)                
95.0th:                                  31.20,      30.40(2.56)                
99.0th:                                  38.60,      38.20(1.04)                
99.5th:                                  50.60,      50.40(0.40)                
99.9th:                                 122.80,     108.00(12.05)               
32 Threads                                                                      
50.0th:                                  30.00,      30.20(-0.67)               
75.0th:                                  42.20,      42.60(-0.95)               
90.0th:                                  52.60,      55.40(-5.32)               
95.0th:                                  58.60,      63.00(-7.51)               
99.0th:                                  69.60,      78.20(-12.36)              
99.5th:                                  79.20,     103.80(-31.06)              
99.9th:                                 171.80,     209.60(-22.00)

Observation: lower is better. tail latencies seem to go up. schbench also has run to run variations.

-----------------------------------------------------------------------------------------------------
stress-ng(20 iterations)	   6.5.rc4	6.5.rc4 + PATCH_V2(gain%)
	 ( 100000 cpu-ops)

--cpu=768 Time               :       1.58,       1.53(3.16)                     
--cpu=384 Time               :       1.66,       1.63(1.81)                     
--cpu=192 Time               :       2.67,       2.77(-3.75)                    
--cpu=96 Time                :       3.70,       3.69(0.27)                     
--cpu=48 Time                :       5.73,       5.69(0.70)                     
--cpu=24 Time                :       7.27,       7.26(0.14)                     
--cpu=12 Time                :      14.25,      14.24(0.07)                     
--cpu=6 Time                 :       28.42,      28.40(0.07)                    
--cpu=3 Time                 :      56.81,      56.68(0.23)                     
--cpu=768 -util=10 Time      :       3.69,       3.70(-0.27)                    
--cpu=768 -util=20 Time      :       5.67,       5.70(-0.53)                    
--cpu=768 -util=30 Time      :       7.08,       7.12(-0.56)                    
--cpu=768 -util=40 Time      :       8.23,       8.27(-0.49)                    
--cpu=768 -util=50 Time      :       9.22,       9.26(-0.43)                    
--cpu=768 -util=60 Time      :      10.09,      10.15(-0.59)                    
--cpu=768 -util=70 Time      :      10.93,      10.98(-0.46)                    
--cpu=768 -util=80 Time      :      11.79,      11.79(0.00)                     
--cpu=768 -util=90 Time      :      12.63,      12.60(0.24) 


Observation: lower is better. Almost no difference.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 0/7] Optimization to reduce the cost of newidle balance
  2023-08-25  7:48 ` [RFC PATCH 0/7] Optimization to reduce the cost of newidle balance Shrikanth Hegde
@ 2023-08-30 15:26   ` Chen Yu
  2023-09-10  7:51     ` Shrikanth Hegde
  0 siblings, 1 reply; 22+ messages in thread
From: Chen Yu @ 2023-08-30 15:26 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: Ingo Molnar, Juri Lelli, Tim Chen, Mel Gorman, Dietmar Eggemann,
	K Prateek Nayak, Gautham R . Shenoy, Chen Yu, Aaron Lu,
	linux-kernel, Peter Zijlstra, Vincent Guittot

Hi Shrikanth,

On 2023-08-25 at 13:18:56 +0530, Shrikanth Hegde wrote:
> 
> On 7/27/23 8:03 PM, Chen Yu wrote:
> 
> Hi Chen. It is a nice patch series in effort to reduce the newidle cost. 
> This gives the idea of making use of calculations done in load_balance to used 
> among different idle types. 
>

Thanks for taking a look at this patch set.
 
> It was interesting to see how this would work on Power Systems. The reason being we have 
> large core count and LLC size is small. i.e at small core level (llc_weight=4). This would
> mean quite frequest access sd_share at different level which would reside on the first_cpu of 
> the sched domain, which might result in more cache-misses. But perf stats didnt show the same.
>

Do you mean 1 large domain(Die domain?) has many LLC sched domains as its children,
and accessing the large domain's sd_share field would cross different LLCs and the
latency is high? Yes, this could be a problem and it depends on the hardware that how
fast differet LLCs snoop the data with each other.
On the other hand, the periodic load balance is the writer of sd_share, and the
interval is based on the cpu_weight of that domain. So the write might be less frequent
on large domains, and most access to sd_share would be the read issued by newidle balance,
which is less costly.
 
> Another concern on more number of sched  groups at DIE level, which might take a hit if 
> the balancing takes longer for the system to stabilize. 

Do you mean, if newidle balance does not pull tasks hard enough, the imbalance between groups
would last longer? Yes, Prateek has mentioned this point, the ILB_UTIL has this problem, I'll
think more about it. We want to find a way in newidle balance to do less scan, but still pulls
tasks as hard as before.

> 
> tl;dr
> 
> Tested with micro-benchmarks on system with 96 Cores with SMT=8. Total of 768 CPU's. There is some amount 

May I know the sched domain hierarchy of this platform?
grep . /sys/kernel/debug/sched/domains/cpu0/domain*/*
cat /proc/schedstat  | grep cpu0 -A 4  (4 domains?)

> of regression with hackbench and schbench. haven't looked into why. Any pointers to check would be helpful.

May I know what is the command to run hackbench and schbench below? For example
the fd number, package size and the loop number of hackbench, and
number of message thread and worker thread of schbench, etc. I assume
you are using the old schbench? As the latest schbench would track other metrics
besides tail latency.


> Did a test with more real case workload that we have called daytrader. Its is DB workload which gives total
> transcations done per second. That doesn't show any regression.
> 
> Its true that all benchmarks will not be happy.
> Maybe in below cases, newidle may not be that costly. Do you have any specific benchmark to be tried? 
> 

Previously I tested schbench/hackbench/netperf/tbench/sqlite, and also I'm planning
to try an OLTP.

> -----------------------------------------------------------------------------------------------------
> 					6.5.rc4			6.5.rc4 + PATCH_V2 		gain					
> Daytrader:				55049				55378			0.59%
> 
> -----------------------------------------------------------------------------------------------------
> hackbench(50 iterations):			   6.5.rc4	6.5.rc4 + PATCH_V2(gain%)
> 
> 
> Process 10 groups                    :       0.19,       0.19(0.00)             
> Process 20 groups                    :       0.23,       0.24(-4.35)            
> Process 30 groups                    :       0.28,       0.30(-7.14)            
> Process 40 groups                    :       0.38,       0.40(-5.26)            
> Process 50 groups                    :       0.43,       0.45(-4.65)            
> Process 60 groups                    :       0.51,       0.51(0.00)             
> thread 10 Time                       :       0.21,       0.22(-4.76)            
> thread 20 Time                       :       0.27,       0.32(-18.52)           
> Process(Pipe) 10 Time                :       0.17,       0.17(0.00)             
> Process(Pipe) 20 Time                :       0.23,       0.23(0.00)             
> Process(Pipe) 30 Time                :       0.28,       0.28(0.00)             
> Process(Pipe) 40 Time                :       0.33,       0.32(3.03)             
> Process(Pipe) 50 Time                :       0.38,       0.36(5.26)             
> Process(Pipe) 60 Time                :       0.40,       0.39(2.50)             
> thread(Pipe) 10 Time                 :       0.14,       0.14(0.00)             
> thread(Pipe) 20 Time                 :       0.20,       0.19(5.00) 
> 
> Observation: lower is better. socket based runs show regression quite a bit, 
> pipe shows slight improvement. 
> 
> 
> -----------------------------------------------------------------------------------------------------
> Unixbench(10 iterations):			   6.5.rc4	6.5.rc4 + PATCH_V2(gain%)
> 
> 1 X Execl Throughput                  :    4280.15,    4398.30(2.76)           
> 4 X Execl Throughput                  :    8171.60,    8061.60(-1.35)            
> 1 X Pipe-based Context Switching      :  172455.50,  174586.60(1.24)           
> 4 X Pipe-based Context Switching      :  633708.35,  664659.85(4.88)           
> 1 X Process Creation                  :    6891.20,    7056.85(2.40)           
> 4 X Process Creation                  :    8826.20,    8996.25(1.93)           
> 1 X Shell Scripts (1 concurrent)      :    9272.05,    9456.10(1.98)           
> 4 X Shell Scripts (1 concurrent)      :   27919.60,   25319.75(-9.31)            
> 1 X Shell Scripts (8 concurrent)      :    4462.70,    4392.75(-1.57)            
> 4 X Shell Scripts (8 concurrent)      :   11852.30,   10820.70(-8.70) 
> 
> Observation: higher is better. Results are somewhat mixed.  
> 
> 
> -----------------------------------------------------------------------------------------------------
> schbench(10 iterations)			 6.5.rc4	6.5.rc4 + PATCH_V2(gain%)
> 
> 1 Threads                                                                       
> 50.0th:                                   8.00,       7.00(12.50)               
> 75.0th:                                   8.00,       7.60(5.00)                
> 90.0th:                                   8.80,       8.00(9.09)                
> 95.0th:                                  10.20,       8.20(19.61)               
> 99.0th:                                  13.60,      11.00(19.12)               
> 99.5th:                                  14.00,      12.80(8.57)                
> 99.9th:                                  15.80,      35.00(-121.52)             
> 2 Threads                                                                       
> 50.0th:                                   8.40,       8.20(2.38)                
> 75.0th:                                   9.00,       8.60(4.44)                
> 90.0th:                                  10.20,       9.60(5.88)                
> 95.0th:                                  11.20,      10.20(8.93)                
> 99.0th:                                  14.40,      11.40(20.83)               
> 99.5th:                                  14.80,      12.80(13.51)               
> 99.9th:                                  17.60,      14.80(15.91)               
> 4 Threads                                                                       
> 50.0th:                                  10.60,      10.40(1.89)                
> 75.0th:                                  12.20,      11.60(4.92)                
> 90.0th:                                  13.60,      12.60(7.35)                
> 95.0th:                                  14.40,      13.00(9.72)                
> 99.0th:                                  16.40,      15.60(4.88)                
> 99.5th:                                  16.80,      16.60(1.19)                
> 99.9th:                                  22.00,      29.00(-31.82)              
> 8 Threads                                                                       
> 50.0th:                                  12.00,      11.80(1.67)                
> 75.0th:                                  14.40,      14.40(0.00)                
> 90.0th:                                  17.00,      18.00(-5.88)               
> 95.0th:                                  19.20,      19.80(-3.13)               
> 99.0th:                                  23.00,      24.20(-5.22)               
> 99.5th:                                  26.80,      29.20(-8.96)               
> 99.9th:                                  68.00,      97.20(-42.94)  
> 16 Threads                                                                      
> 50.0th:                                  18.00,      18.20(-1.11)               
> 75.0th:                                  23.20,      23.60(-1.72)               
> 90.0th:                                  28.00,      27.40(2.14)                
> 95.0th:                                  31.20,      30.40(2.56)                
> 99.0th:                                  38.60,      38.20(1.04)                
> 99.5th:                                  50.60,      50.40(0.40)                
> 99.9th:                                 122.80,     108.00(12.05)               
> 32 Threads                                                                      
> 50.0th:                                  30.00,      30.20(-0.67)               
> 75.0th:                                  42.20,      42.60(-0.95)               
> 90.0th:                                  52.60,      55.40(-5.32)               
> 95.0th:                                  58.60,      63.00(-7.51)               
> 99.0th:                                  69.60,      78.20(-12.36)              
> 99.5th:                                  79.20,     103.80(-31.06)              
> 99.9th:                                 171.80,     209.60(-22.00)
> 
> Observation: lower is better. tail latencies seem to go up. schbench also has run to run variations.
> 
> -----------------------------------------------------------------------------------------------------
> stress-ng(20 iterations)	   6.5.rc4	6.5.rc4 + PATCH_V2(gain%)
> 	 ( 100000 cpu-ops)
> 
> --cpu=768 Time               :       1.58,       1.53(3.16)                     
> --cpu=384 Time               :       1.66,       1.63(1.81)                     
> --cpu=192 Time               :       2.67,       2.77(-3.75)                    
> --cpu=96 Time                :       3.70,       3.69(0.27)                     
> --cpu=48 Time                :       5.73,       5.69(0.70)                     
> --cpu=24 Time                :       7.27,       7.26(0.14)                     
> --cpu=12 Time                :      14.25,      14.24(0.07)                     
> --cpu=6 Time                 :       28.42,      28.40(0.07)                    
> --cpu=3 Time                 :      56.81,      56.68(0.23)                     
> --cpu=768 -util=10 Time      :       3.69,       3.70(-0.27)                    
> --cpu=768 -util=20 Time      :       5.67,       5.70(-0.53)                    
> --cpu=768 -util=30 Time      :       7.08,       7.12(-0.56)                    
> --cpu=768 -util=40 Time      :       8.23,       8.27(-0.49)                    
> --cpu=768 -util=50 Time      :       9.22,       9.26(-0.43)                    
> --cpu=768 -util=60 Time      :      10.09,      10.15(-0.59)                    
> --cpu=768 -util=70 Time      :      10.93,      10.98(-0.46)                    
> --cpu=768 -util=80 Time      :      11.79,      11.79(0.00)                     
> --cpu=768 -util=90 Time      :      12.63,      12.60(0.24) 
> 
> 
> Observation: lower is better. Almost no difference.

I'll try to run the same tests of hackbench/schbench on my machine, to
see if I could find any clue for the regression.


thanks,
Chenyu

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 4/7] sched/fair: Calculate the scan depth for idle balance based on system utilization
  2023-08-25  6:02   ` Shrikanth Hegde
@ 2023-08-30 15:30     ` Chen Yu
  0 siblings, 0 replies; 22+ messages in thread
From: Chen Yu @ 2023-08-30 15:30 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: Ingo Molnar, Juri Lelli, Tim Chen, Mel Gorman, Dietmar Eggemann,
	K Prateek Nayak, Gautham R . Shenoy, Chen Yu, Aaron Lu,
	linux-kernel, Peter Zijlstra, Vincent Guittot

On 2023-08-25 at 11:32:01 +0530, Shrikanth Hegde wrote:
> 
> 
> On 7/27/23 8:05 PM, Chen Yu wrote:
> > When the CPU is about to enter idle, it invokes newidle_balance()
> > to pull some tasks from other runqueues. Although there is per
> > domain max_newidle_lb_cost to throttle the newidle_balance(), it
> > would be good to further limit the scan based on overall system
> > utilization. The reason is that there is no limitation for
> > newidle_balance() to launch this balance simultaneously on
> > multiple CPUs. Since each newidle_balance() has to traverse all
> > the groups to calculate the statistics one by one, this total
> > time cost on newidle_balance() could be O(n^2). n is the number
> > of groups. This issue is more severe if there are many groups
> > within 1 domain, for example, a system with a large number of
> > Cores in a LLC domain. This is not good for performance or
> > power saving.
> > 
> > sqlite has spent quite some time on newidle balance() on Intel
> > Sapphire Rapids, which has 2 x 56C/112T = 224 CPUs:
> > 6.69%    0.09%  sqlite3     [kernel.kallsyms]   [k] newidle_balance
> > 5.39%    4.71%  sqlite3     [kernel.kallsyms]   [k] update_sd_lb_stats
> > 
> > Based on this observation, limit the scan depth of newidle_balance()
> > by considering the utilization of the sched domain. Let the number of
> > scanned groups be a linear function of the utilization ratio:
> > 
> > nr_groups_to_scan = nr_groups * (1 - util_ratio)
> > 
> > Suggested-by: Tim Chen <tim.c.chen@intel.com>
> > Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> > ---
> >  include/linux/sched/topology.h |  1 +
> >  kernel/sched/fair.c            | 30 ++++++++++++++++++++++++++++++
> >  kernel/sched/features.h        |  1 +
> >  3 files changed, 32 insertions(+)
> > 
> > diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> > index d6a64a2c92aa..af2261308529 100644
> > --- a/include/linux/sched/topology.h
> > +++ b/include/linux/sched/topology.h
> > @@ -84,6 +84,7 @@ struct sched_domain_shared {
> >  	int		nr_idle_scan;
> >  	unsigned long	total_load;
> >  	unsigned long	total_capacity;
> > +	int		nr_sg_scan;
> >  };
> >  
> >  struct sched_domain {
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index edcfee9965cd..6925813db59b 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -10153,6 +10153,35 @@ static void ilb_save_stats(struct lb_env *env,
> >  		WRITE_ONCE(sd_share->total_capacity, sds->total_capacity);
> >  }
> >  
> > +static void update_ilb_group_scan(struct lb_env *env,
> > +				  unsigned long sum_util,
> > +				  struct sched_domain_shared *sd_share)
> > +{
> > +	u64 tmp, nr_scan;
> > +
> > +	if (!sched_feat(ILB_UTIL))
> > +		return;
> > +
> > +	if (!sd_share)
> > +		return;
> > +
> > +	if (env->idle == CPU_NEWLY_IDLE)
> > +		return;
> 
> 
> Suggestion for small improvement:
> 
> First if condition here could be check for newidle. As it often very often we could save a few cycles of checking
> sched feature.
>

Yes, this makes sense, I'll change it.

thanks,
Chenyu 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 5/7] sched/fair: Adjust the busiest group scanning depth in idle load balance
  2023-08-25  6:00   ` Shrikanth Hegde
@ 2023-08-30 15:35     ` Chen Yu
  0 siblings, 0 replies; 22+ messages in thread
From: Chen Yu @ 2023-08-30 15:35 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: Ingo Molnar, Juri Lelli, Tim Chen, Mel Gorman, Dietmar Eggemann,
	K Prateek Nayak, Gautham R . Shenoy, Chen Yu, Aaron Lu,
	linux-kernel, Peter Zijlstra, Vincent Guittot

On 2023-08-25 at 11:30:05 +0530, Shrikanth Hegde wrote:
> 
> 
> On 7/27/23 8:05 PM, Chen Yu wrote:
> > Scanning the whole sched domain to find the busiest group is time costly
> > during newidle_balance(). And if a CPU becomes idle, it would be good
> > if this idle CPU pulls some tasks from other CPUs as quickly as possible.
> > 
> > Limit the scan depth of newidle_balance() to only scan for a limited number
> > of sched groups to find a relatively busy group, and pull from it.
> > In summary, the more spare time there is in the domain, the more time
> > each newidle balance can spend on scanning for a busy group. Although
> > the newidle balance has per domain max_newidle_lb_cost to decide
> > whether to launch the balance or not, the ILB_UTIL provides a smaller
> > granularity to decide how many groups each newidle balance can scan.
> > 
> > The scanning depth is calculated by the previous periodic load balance
> > based on its overall utilization.
> > 
> > Tested on top of v6.5-rc2, Sapphire Rapids with 2 x 56C/112T = 224 CPUs.
> > With cpufreq governor set to performance, and C6 disabled.
> > 
> > Firstly, tested on a extreme synthetic test[1], which launches 224
> > process. Each process is a loop of nanosleep(1 us), which is supposed
> > to trigger newidle balance as much as possible:
> > 
> > i=1;while [ $i -le "224" ]; do ./nano_sleep 1000 & i=$(($i+1)); done;
> > 
> > NO_ILB_UTIL + ILB_SNAPSHOT:
> > 9.38%     0.45%  [kernel.kallsyms]   [k] newidle_balance
> > 6.84%     5.32%  [kernel.kallsyms]   [k] update_sd_lb_stats.constprop.0
> > 
> > ILB_UTIL + ILB_SNAPSHOT:
> > 3.35%     0.38%  [kernel.kallsyms]   [k] newidle_balance
> > 2.30%     1.81%  [kernel.kallsyms]   [k] update_sd_lb_stats.constprop.0
> > [...]
> 
> > Link: https://raw.githubusercontent.com/chen-yu-surf/tools/master/stress_nanosleep.c #1
> > Suggested-by: Tim Chen <tim.c.chen@intel.com>
> > Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> > ---
> >  kernel/sched/fair.c | 20 +++++++++++++++++++-
> >  1 file changed, 19 insertions(+), 1 deletion(-)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 6925813db59b..4e360ed16e14 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -10195,7 +10195,13 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
> >  	struct sg_lb_stats *local = &sds->local_stat;
> >  	struct sg_lb_stats tmp_sgs;
> >  	unsigned long sum_util = 0;
> > -	int sg_status = 0;
> > +	int sg_status = 0, nr_sg_scan;
> > +	/* only newidle CPU can load the snapshot */
> > +	bool ilb_can_load = env->idle == CPU_NEWLY_IDLE &&
> > +			    sd_share && READ_ONCE(sd_share->total_capacity);
> > +
> > +	if (sched_feat(ILB_UTIL) && ilb_can_load)
> 
> Suggestion for small improvement:
> 
> it could be ? This could help save a few cycles of checking if the feature is enabled when its not newidle. 
> 
> 	if ( ilb_can_load && sched_feat(ILB_UTIL)) 
> 
> Same comments below in this patch as well in PATCH 6/7.
>

Yes this makes sense because the feature is enabled by default.
 
> > +		nr_sg_scan = sd_share->nr_sg_scan;
> >  
> >  	do {
> >  		struct sg_lb_stats *sgs = &tmp_sgs;
> > @@ -10222,6 +10228,9 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
> >  			sds->busiest_stat = *sgs;
> >  		}
> >  
> > +		if (sched_feat(ILB_UTIL) && ilb_can_load && --nr_sg_scan <= 0)
> > +			goto load_snapshot;
> > +
> 
> Same comment as above.
> 

OK, will do.

thanks,
Chenyu

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 0/7] Optimization to reduce the cost of newidle balance
  2023-08-30 15:26   ` Chen Yu
@ 2023-09-10  7:51     ` Shrikanth Hegde
  0 siblings, 0 replies; 22+ messages in thread
From: Shrikanth Hegde @ 2023-09-10  7:51 UTC (permalink / raw)
  To: Chen Yu
  Cc: Ingo Molnar, Juri Lelli, Tim Chen, Mel Gorman, Dietmar Eggemann,
	K Prateek Nayak, Gautham R . Shenoy, Chen Yu, Aaron Lu,
	linux-kernel, Peter Zijlstra, Vincent Guittot



On 8/30/23 8:56 PM, Chen Yu wrote:
> Hi Shrikanth,

Hi Chen, sorry for the slightly delayed response.

note: patch as is, fails to apply cleanly as BASE_SLICE is not a
feature in the latest tip/sched/core. 

> 
> On 2023-08-25 at 13:18:56 +0530, Shrikanth Hegde wrote:
>>
>> On 7/27/23 8:03 PM, Chen Yu wrote:
>>
>> Hi Chen. It is a nice patch series in effort to reduce the newidle cost. 
>> This gives the idea of making use of calculations done in load_balance to used 
>> among different idle types. 
>>
> 
> Thanks for taking a look at this patch set.
> 
>> It was interesting to see how this would work on Power Systems. The reason being we have 
>> large core count and LLC size is small. i.e at small core level (llc_weight=4). This would
>> mean quite frequest access sd_share at different level which would reside on the first_cpu of 
>> the sched domain, which might result in more cache-misses. But perf stats didnt show the same.
>>
> 
> Do you mean 1 large domain(Die domain?) has many LLC sched domains as its children,
> and accessing the large domain's sd_share field would cross different LLCs and the
> latency is high? Yes, this could be a problem and it depends on the hardware that how
> fast differet LLCs snoop the data with each other.

Yes

> On the other hand, the periodic load balance is the writer of sd_share, and the
> interval is based on the cpu_weight of that domain. So the write might be less frequent
> on large domains, and most access to sd_share would be the read issued by newidle balance,
> which is less costly.
> 
>> Another concern on more number of sched  groups at DIE level, which might take a hit if 
>> the balancing takes longer for the system to stabilize. 
> 
> Do you mean, if newidle balance does not pull tasks hard enough, the imbalance between groups
> would last longer? Yes, Prateek has mentioned this point, the ILB_UTIL has this problem, I'll
> think more about it. We want to find a way in newidle balance to do less scan, but still pulls
> tasks as hard as before.
> 
>>
>> tl;dr
>>
>> Tested with micro-benchmarks on system with 96 Cores with SMT=8. Total of 768 CPU's. There is some amount 
> 
> May I know the sched domain hierarchy of this platform?
> grep . /sys/kernel/debug/sched/domains/cpu0/domain*/*
> cat /proc/schedstat  | grep cpu0 -A 4  (4 domains?)

/sys/kernel/debug/sched/domains/cpu0/domain0/name:SMT
/sys/kernel/debug/sched/domains/cpu0/domain1/name:MC
/sys/kernel/debug/sched/domains/cpu0/domain2/name:DIE
/sys/kernel/debug/sched/domains/cpu0/domain3/name:NUMA
/sys/kernel/debug/sched/domains/cpu0/domain4/name:NUMA


domain-0: span=0,2,4,6 level=SMT
   groups: 0:{ span=0 }, 2:{ span=2 }, 4:{ span=4 }, 6:{ span=6 }
   domain-1: span=0-7,24-39,48-55,72-87 level=MC
    groups: 0:{ span=0,2,4,6 cap=4096 }, 1:{ span=1,3,5,7 cap=4096 }, 24:{ span=24,26,28,30 cap=4096 }, 25:{ span=25,27,29,31 cap=4096 }, 32:{ span=32,34,36,38 cap=4096 }, 33:{ span=33,35,37,39 cap=4096 }, 48:{ span=48,50,52,54 cap=4096 }, 49:{ span=49,51,53,55 cap=4096 }, 72:{ span=72,74,76,78 cap=4096 }, 73:{ span=73,75,77,79 cap=4096 }, 80:{ span=80,82,84,86 cap=4096 }, 81:{ span=81,83,85,87 cap=4096 }
    domain-2: span=0-95 level=DIE
     groups: 0:{ span=0-7,24-39,48-55,72-87 cap=49152 }, 8:{ span=8-23,40-47,56-71,88-95 cap=49152 }
    domain-3: span=0-191 level=NUMA
     groups: 0:{ span=0-95 cap=98304 }, 96:{ span=96-191 cap=98304 }
    domain-4: span=0-767 level=NUMA
     groups: 0:{ span=0-191 cap=196608 }, 192:{ span=192-383 cap=196608 }, 384:{ span=384-575 cap=196608 }, 576:{ span=576-767 cap=196608 }


our LLC is at SMT domain. in an MC domain there could be max upto 16 such LLC. 
That is for Dedicated Logical Partitions(LPAR).
on Shared Processor Logical Partitions(SPLPAR),  it is observed that MC domain 
doesnt make sense. After below proposed change, DIE domain would have SMT as groups. 
After that there this max number of LLC in a DIE can go upto 30. 
https://lore.kernel.org/lkml/20230830105244.62477-5-srikar@linux.vnet.ibm.com/#r

> 
>> of regression with hackbench and schbench. haven't looked into why. Any pointers to check would be helpful.
> 
> May I know what is the command to run hackbench and schbench below? For example
> the fd number, package size and the loop number of hackbench, and
> number of message thread and worker thread of schbench, etc. I assume
> you are using the old schbench? As the latest schbench would track other metrics
> besides tail latency.
> 
> 

Yes. Old schbench.  and Hackbench is from ltp. 
I can try to test the next version. 

>> Did a test with more real case workload that we have called daytrader. Its is DB workload which gives total
>> transcations done per second. That doesn't show any regression.
>>
>> Its true that all benchmarks will not be happy.
>> Maybe in below cases, newidle may not be that costly. Do you have any specific benchmark to be tried? 
>>
> 
> Previously I tested schbench/hackbench/netperf/tbench/sqlite, and also I'm planning
> to try an OLTP.
> 
>> -----------------------------------------------------------------------------------------------------
>> 					6.5.rc4			6.5.rc4 + PATCH_V2 		gain					
>> Daytrader:				55049				55378			0.59%
>>
>> -----------------------------------------------------------------------------------------------------
>> hackbench(50 iterations):			   6.5.rc4	6.5.rc4 + PATCH_V2(gain%)
>>
>>
>> Process 10 groups                    :       0.19,       0.19(0.00)             
>> Process 20 groups                    :       0.23,       0.24(-4.35)            
>> Process 30 groups                    :       0.28,       0.30(-7.14)            
>> Process 40 groups                    :       0.38,       0.40(-5.26)            
>> Process 50 groups                    :       0.43,       0.45(-4.65)            
>> Process 60 groups                    :       0.51,       0.51(0.00)             
>> thread 10 Time                       :       0.21,       0.22(-4.76)            
>> thread 20 Time                       :       0.27,       0.32(-18.52)           
>> Process(Pipe) 10 Time                :       0.17,       0.17(0.00)             
>> Process(Pipe) 20 Time                :       0.23,       0.23(0.00)             
>> Process(Pipe) 30 Time                :       0.28,       0.28(0.00)             
>> Process(Pipe) 40 Time                :       0.33,       0.32(3.03)             
>> Process(Pipe) 50 Time                :       0.38,       0.36(5.26)             
>> Process(Pipe) 60 Time                :       0.40,       0.39(2.50)             
>> thread(Pipe) 10 Time                 :       0.14,       0.14(0.00)             
>> thread(Pipe) 20 Time                 :       0.20,       0.19(5.00) 
>>
>> Observation: lower is better. socket based runs show regression quite a bit, 
>> pipe shows slight improvement. 
>>
>>
>> -----------------------------------------------------------------------------------------------------
>> Unixbench(10 iterations):			   6.5.rc4	6.5.rc4 + PATCH_V2(gain%)
>>
>> 1 X Execl Throughput                  :    4280.15,    4398.30(2.76)           
>> 4 X Execl Throughput                  :    8171.60,    8061.60(-1.35)            
>> 1 X Pipe-based Context Switching      :  172455.50,  174586.60(1.24)           
>> 4 X Pipe-based Context Switching      :  633708.35,  664659.85(4.88)           
>> 1 X Process Creation                  :    6891.20,    7056.85(2.40)           
>> 4 X Process Creation                  :    8826.20,    8996.25(1.93)           
>> 1 X Shell Scripts (1 concurrent)      :    9272.05,    9456.10(1.98)           
>> 4 X Shell Scripts (1 concurrent)      :   27919.60,   25319.75(-9.31)            
>> 1 X Shell Scripts (8 concurrent)      :    4462.70,    4392.75(-1.57)            
>> 4 X Shell Scripts (8 concurrent)      :   11852.30,   10820.70(-8.70) 
>>
>> Observation: higher is better. Results are somewhat mixed.  
>>
>>
>> -----------------------------------------------------------------------------------------------------
>> schbench(10 iterations)			 6.5.rc4	6.5.rc4 + PATCH_V2(gain%)
>>
>> 1 Threads                                                                       
>> 50.0th:                                   8.00,       7.00(12.50)               
>> 75.0th:                                   8.00,       7.60(5.00)                
>> 90.0th:                                   8.80,       8.00(9.09)                
>> 95.0th:                                  10.20,       8.20(19.61)               
>> 99.0th:                                  13.60,      11.00(19.12)               
>> 99.5th:                                  14.00,      12.80(8.57)                
>> 99.9th:                                  15.80,      35.00(-121.52)             
>> 2 Threads                                                                       
>> 50.0th:                                   8.40,       8.20(2.38)                
>> 75.0th:                                   9.00,       8.60(4.44)                
>> 90.0th:                                  10.20,       9.60(5.88)                
>> 95.0th:                                  11.20,      10.20(8.93)                
>> 99.0th:                                  14.40,      11.40(20.83)               
>> 99.5th:                                  14.80,      12.80(13.51)               
>> 99.9th:                                  17.60,      14.80(15.91)               
>> 4 Threads                                                                       
>> 50.0th:                                  10.60,      10.40(1.89)                
>> 75.0th:                                  12.20,      11.60(4.92)                
>> 90.0th:                                  13.60,      12.60(7.35)                
>> 95.0th:                                  14.40,      13.00(9.72)                
>> 99.0th:                                  16.40,      15.60(4.88)                
>> 99.5th:                                  16.80,      16.60(1.19)                
>> 99.9th:                                  22.00,      29.00(-31.82)              
>> 8 Threads                                                                       
>> 50.0th:                                  12.00,      11.80(1.67)                
>> 75.0th:                                  14.40,      14.40(0.00)                
>> 90.0th:                                  17.00,      18.00(-5.88)               
>> 95.0th:                                  19.20,      19.80(-3.13)               
>> 99.0th:                                  23.00,      24.20(-5.22)               
>> 99.5th:                                  26.80,      29.20(-8.96)               
>> 99.9th:                                  68.00,      97.20(-42.94)  
>> 16 Threads                                                                      
>> 50.0th:                                  18.00,      18.20(-1.11)               
>> 75.0th:                                  23.20,      23.60(-1.72)               
>> 90.0th:                                  28.00,      27.40(2.14)                
>> 95.0th:                                  31.20,      30.40(2.56)                
>> 99.0th:                                  38.60,      38.20(1.04)                
>> 99.5th:                                  50.60,      50.40(0.40)                
>> 99.9th:                                 122.80,     108.00(12.05)               
>> 32 Threads                                                                      
>> 50.0th:                                  30.00,      30.20(-0.67)               
>> 75.0th:                                  42.20,      42.60(-0.95)               
>> 90.0th:                                  52.60,      55.40(-5.32)               
>> 95.0th:                                  58.60,      63.00(-7.51)               
>> 99.0th:                                  69.60,      78.20(-12.36)              
>> 99.5th:                                  79.20,     103.80(-31.06)              
>> 99.9th:                                 171.80,     209.60(-22.00)
>>
>> Observation: lower is better. tail latencies seem to go up. schbench also has run to run variations.
>>
>> -----------------------------------------------------------------------------------------------------
>> stress-ng(20 iterations)	   6.5.rc4	6.5.rc4 + PATCH_V2(gain%)
>> 	 ( 100000 cpu-ops)
>>
>> --cpu=768 Time               :       1.58,       1.53(3.16)                     
>> --cpu=384 Time               :       1.66,       1.63(1.81)                     
>> --cpu=192 Time               :       2.67,       2.77(-3.75)                    
>> --cpu=96 Time                :       3.70,       3.69(0.27)                     
>> --cpu=48 Time                :       5.73,       5.69(0.70)                     
>> --cpu=24 Time                :       7.27,       7.26(0.14)                     
>> --cpu=12 Time                :      14.25,      14.24(0.07)                     
>> --cpu=6 Time                 :       28.42,      28.40(0.07)                    
>> --cpu=3 Time                 :      56.81,      56.68(0.23)                     
>> --cpu=768 -util=10 Time      :       3.69,       3.70(-0.27)                    
>> --cpu=768 -util=20 Time      :       5.67,       5.70(-0.53)                    
>> --cpu=768 -util=30 Time      :       7.08,       7.12(-0.56)                    
>> --cpu=768 -util=40 Time      :       8.23,       8.27(-0.49)                    
>> --cpu=768 -util=50 Time      :       9.22,       9.26(-0.43)                    
>> --cpu=768 -util=60 Time      :      10.09,      10.15(-0.59)                    
>> --cpu=768 -util=70 Time      :      10.93,      10.98(-0.46)                    
>> --cpu=768 -util=80 Time      :      11.79,      11.79(0.00)                     
>> --cpu=768 -util=90 Time      :      12.63,      12.60(0.24) 
>>
>>
>> Observation: lower is better. Almost no difference.
> 
> I'll try to run the same tests of hackbench/schbench on my machine, to
> see if I could find any clue for the regression.
> 
> 
> thanks,
> Chenyu

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 0/7] Optimization to reduce the cost of newidle balance
  2023-07-27 14:33 [RFC PATCH 0/7] Optimization to reduce the cost of newidle balance Chen Yu
                   ` (7 preceding siblings ...)
  2023-08-25  7:48 ` [RFC PATCH 0/7] Optimization to reduce the cost of newidle balance Shrikanth Hegde
@ 2024-07-16 14:16 ` Matt Fleming
  2024-07-17  3:52   ` Chen Yu
  2024-07-17 12:17 ` Peter Zijlstra
  9 siblings, 1 reply; 22+ messages in thread
From: Matt Fleming @ 2024-07-16 14:16 UTC (permalink / raw)
  To: yu.c.chen
  Cc: aaron.lu, dietmar.eggemann, gautham.shenoy, juri.lelli,
	kprateek.nayak, linux-kernel, mgorman, mingo, peterz, tim.c.chen,
	vincent.guittot, yu.chen.surf

> Hi,
> 
> This is the second version of the newidle balance optimization[1].
> It aims to reduce the cost of newidle balance which is found to
> occupy noticeable CPU cycles on some high-core count systems.

Hi there, what's the status of this series?

I'm seeing this same symptom of burning cycles in update_sd_lb_stats() on an
AMD EPYC 7713 machine (128 CPUs, 8 NUMA nodes). The machine is about 50% idle
and upadte_sd_lb_stats() sits as the first entry in perf top with about 3.62%
of CPU cycles.

Thanks,
Matt

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 0/7] Optimization to reduce the cost of newidle balance
  2024-07-16 14:16 ` Matt Fleming
@ 2024-07-17  3:52   ` Chen Yu
  2024-07-17 15:31     ` Matt Fleming
  0 siblings, 1 reply; 22+ messages in thread
From: Chen Yu @ 2024-07-17  3:52 UTC (permalink / raw)
  To: Matt Fleming
  Cc: aaron.lu, dietmar.eggemann, gautham.shenoy, juri.lelli,
	kprateek.nayak, linux-kernel, mgorman, mingo, peterz, tim.c.chen,
	vincent.guittot, yu.chen.surf, yujie.liu

Hi Matt,

On 2024-07-16 at 15:16:45 +0100, Matt Fleming wrote:
> > Hi,
> > 
> > This is the second version of the newidle balance optimization[1].
> > It aims to reduce the cost of newidle balance which is found to
> > occupy noticeable CPU cycles on some high-core count systems.
> 
> Hi there, what's the status of this series?
> 

Thanks for your interest in this patch series. The RFC patch series was sent
out to seek for directions and to see if this issue is worthy to fix. Since
you have encountered this issue as well and it seems to be generic issue,
I'll rebase thie patch series and retest it on top of latest kernel and then
send out a new version.

> I'm seeing this same symptom of burning cycles in update_sd_lb_stats() on an
> AMD EPYC 7713 machine (128 CPUs, 8 NUMA nodes). The machine is about 50% idle
> and upadte_sd_lb_stats() sits as the first entry in perf top with about 3.62%
> of CPU cycles.

May I know what benchmark(test scenario) you are testing? I'd like to replicate
this test on my machine as well.

thanks,
Chenyu

> 
> Thanks,
> Matt

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 0/7] Optimization to reduce the cost of newidle balance
  2023-07-27 14:33 [RFC PATCH 0/7] Optimization to reduce the cost of newidle balance Chen Yu
                   ` (8 preceding siblings ...)
  2024-07-16 14:16 ` Matt Fleming
@ 2024-07-17 12:17 ` Peter Zijlstra
  2024-07-18  9:28   ` K Prateek Nayak
  2024-07-18 16:57   ` Chen Yu
  9 siblings, 2 replies; 22+ messages in thread
From: Peter Zijlstra @ 2024-07-17 12:17 UTC (permalink / raw)
  To: Chen Yu
  Cc: Vincent Guittot, Ingo Molnar, Juri Lelli, Tim Chen, Mel Gorman,
	Dietmar Eggemann, K Prateek Nayak, Gautham R . Shenoy, Chen Yu,
	Aaron Lu, linux-kernel, void

On Thu, Jul 27, 2023 at 10:33:58PM +0800, Chen Yu wrote:
> Hi,
> 
> This is the second version of the newidle balance optimization[1].
> It aims to reduce the cost of newidle balance which is found to
> occupy noticeable CPU cycles on some high-core count systems.
> 
> For example, when running sqlite on Intel Sapphire Rapids, which has
> 2 x 56C/112T = 224 CPUs:
> 
> 6.69%    0.09%  sqlite3     [kernel.kallsyms]   [k] newidle_balance
> 5.39%    4.71%  sqlite3     [kernel.kallsyms]   [k] update_sd_lb_stats
> 
> To mitigate this cost, the optimization is inspired by the question
> raised by Tim:
> Do we always have to find the busiest group and pull from it? Would
> a relatively busy group be enough?

So doesn't this basically boil down to recognising that new-idle might
not be the same as regular load-balancing -- we need any task, fast,
rather than we need to make equal load.

David's shared runqueue patches did the same, they re-imagined this very
path.

Now, David's thing went side-ways because of some regression that wasn't
further investigated.

But it occurs to me this might be the same thing that Prateek chased
down here:

  https://lkml.kernel.org/r/20240710090210.41856-1-kprateek.nayak@amd.com

Hmm ?

Supposing that is indeed the case, I think it makes more sense to
proceed with that approach. That is, completely redo the sub-numa new
idle balance.



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 0/7] Optimization to reduce the cost of newidle balance
  2024-07-17  3:52   ` Chen Yu
@ 2024-07-17 15:31     ` Matt Fleming
  0 siblings, 0 replies; 22+ messages in thread
From: Matt Fleming @ 2024-07-17 15:31 UTC (permalink / raw)
  To: Chen Yu
  Cc: aaron.lu, dietmar.eggemann, gautham.shenoy, juri.lelli,
	kprateek.nayak, linux-kernel, mgorman, mingo, peterz, tim.c.chen,
	vincent.guittot, yu.chen.surf, yujie.liu, kernel-team, yunzhao

On Wed, Jul 17, 2024 at 4:53 AM Chen Yu <yu.c.chen@intel.com> wrote:
>
> Thanks for your interest in this patch series. The RFC patch series was sent
> out to seek for directions and to see if this issue is worthy to fix. Since
> you have encountered this issue as well and it seems to be generic issue,
> I'll rebase thie patch series and retest it on top of latest kernel and then
> send out a new version.

Great, thanks!

> > I'm seeing this same symptom of burning cycles in update_sd_lb_stats() on an
> > AMD EPYC 7713 machine (128 CPUs, 8 NUMA nodes). The machine is about 50% idle
> > and upadte_sd_lb_stats() sits as the first entry in perf top with about 3.62%
> > of CPU cycles.
>
> May I know what benchmark(test scenario) you are testing? I'd like to replicate
> this test on my machine as well.

Actually this isn't a benchmark -- this was observed on Cloudflare's
production machines. I'm happy to try out your series and report back.

Thanks,
Matt

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 0/7] Optimization to reduce the cost of newidle balance
  2024-07-17 12:17 ` Peter Zijlstra
@ 2024-07-18  9:28   ` K Prateek Nayak
  2024-07-18 17:01     ` Chen Yu
  2024-07-18 16:57   ` Chen Yu
  1 sibling, 1 reply; 22+ messages in thread
From: K Prateek Nayak @ 2024-07-18  9:28 UTC (permalink / raw)
  To: Peter Zijlstra, Chen Yu
  Cc: Vincent Guittot, Ingo Molnar, Juri Lelli, Tim Chen, Mel Gorman,
	Dietmar Eggemann, Gautham R . Shenoy, Chen Yu, Aaron Lu,
	linux-kernel, void, Matt Fleming

Hello Peter,

On 7/17/2024 5:47 PM, Peter Zijlstra wrote:
> On Thu, Jul 27, 2023 at 10:33:58PM +0800, Chen Yu wrote:
>> Hi,
>>
>> This is the second version of the newidle balance optimization[1].
>> It aims to reduce the cost of newidle balance which is found to
>> occupy noticeable CPU cycles on some high-core count systems.
>>
>> For example, when running sqlite on Intel Sapphire Rapids, which has
>> 2 x 56C/112T = 224 CPUs:
>>
>> 6.69%    0.09%  sqlite3     [kernel.kallsyms]   [k] newidle_balance
>> 5.39%    4.71%  sqlite3     [kernel.kallsyms]   [k] update_sd_lb_stats
>>
>> To mitigate this cost, the optimization is inspired by the question
>> raised by Tim:
>> Do we always have to find the busiest group and pull from it? Would
>> a relatively busy group be enough?
> 
> So doesn't this basically boil down to recognising that new-idle might
> not be the same as regular load-balancing -- we need any task, fast,
> rather than we need to make equal load.
> 
> David's shared runqueue patches did the same, they re-imagined this very
> path.
> 
> Now, David's thing went side-ways because of some regression that wasn't
> further investigated.

In case of SHARED_RUNQ, I suspected frequent wakeup-sleep pattern of
hackbench at lower utilization seemed to raise some contention somewhere
but perf profile with IBS showed nothing specific and I left it there.

I revisited this again today and found this interesting data for perf
bench sched messaging running with one group pinned to one LLC domain on
my system:

- NO_SHARED_RUNQ

     $ time ./perf bench sched messaging -p -t -l 100000 -g 1
     # Running 'sched/messaging' benchmark:
     # 20 sender and receiver threads per group
     # 1 groups == 40 threads run
     
          Total time: 3.972 [sec] (*)
     
     real    0m3.985s
     user    0m6.203s	(*)
     sys     1m20.087s	(*)

     $ sudo perf record -C 0-7,128-135 --off-cpu -- taskset -c 0-7,128-135 perf bench sched messaging -p -t -l 100000 -g 1
     $ sudo perf report --no-children

     Samples: 128  of event 'offcpu-time', Event count (approx.): 96,216,883,498 (*)
       Overhead  Command          Shared Object  Symbol
     +   51.43%  sched-messaging  libc.so.6      [.] read
     +   44.94%  sched-messaging  libc.so.6      [.] __GI___libc_write
     +    3.60%  sched-messaging  libc.so.6      [.] __GI___futex_abstimed_wait_cancelable64
          0.03%  sched-messaging  libc.so.6      [.] __poll
          0.00%  sched-messaging  perf           [.] sender


- SHARED_RUNQ

     $ time taskset -c 0-7,128-135 perf bench sched messaging -p -t -l 100000 -g 1
     # Running 'sched/messaging' benchmark:
     # 20 sender and receiver threads per group
     # 1 groups == 40 threads run
     
          Total time: 48.171 [sec] (*)
     
     real    0m48.186s
     user    0m5.409s	(*)
     sys     0m41.185s	(*)

     $ sudo perf record -C 0-7,128-135 --off-cpu -- taskset -c 0-7,128-135 perf bench sched messaging -p -t -l 100000 -g 1
     $ sudo perf report --no-children

     Samples: 157  of event 'offcpu-time', Event count (approx.): 5,882,929,338,882 (*)
       Overhead  Command          Shared Object     Symbol
     +   47.49%  sched-messaging  libc.so.6         [.] read
     +   46.33%  sched-messaging  libc.so.6         [.] __GI___libc_write
     +    2.40%  sched-messaging  libc.so.6         [.] __GI___futex_abstimed_wait_cancelable64
     +    1.08%  snapd            snapd             [.] 0x000000000006caa3
     +    1.02%  cron             libc.so.6         [.] clock_nanosleep@GLIBC_2.2.5
     +    0.86%  containerd       containerd        [.] runtime.futex.abi0
     +    0.82%  containerd       containerd        [.] runtime/internal/syscall.Syscall6


(*) The runtime has bloated massively but both "user" and "sys" time
     are down and the "offcpu-time" count goes up with SHARED_RUNQ.

There seems to be a corner case that is not accounted for but I'm not
sure where it lies currently. P.S. I tested this on a v6.8-rc4 kernel
since that is what I initially tested the series on but I can see the
same behavior when I rebased the changed on the current v6.10-rc5 based
tip:sched/core.

> 
> But it occurs to me this might be the same thing that Prateek chased
> down here:
> 
>    https://lkml.kernel.org/r/20240710090210.41856-1-kprateek.nayak@amd.com
> 
> Hmm ?

Without the nohz_csd_func fix and the SM_IDLE fast-path (Patch 1 and 2),
currently, the scheduler depends on the newidle_balance to pull tasks to
an idle CPU. Vincent had pointed it out on the first RCF to tackle the
problem that tried to do what SM_IDLE does but for fair class alone:

     https://lore.kernel.org/all/CAKfTPtC446Lo9CATPp7PExdkLhHQFoBuY-JMGC7agOHY4hs-Pw@mail.gmail.com/

It shouldn't be too frequent but it could be the reason why
newidle_balance() might jump up in traces, especially if it decides to
scan a domain with large number of CPUs (NUMA1/NUMA2 in Matt's case,
perhaps the PKG/NUMA in the case Chenyu was investigating initially).

> 
> Supposing that is indeed the case, I think it makes more sense to
> proceed with that approach. That is, completely redo the sub-numa new
> idle balance.
> 
> 

-- 
Thanks and Regards,
Prateek

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 0/7] Optimization to reduce the cost of newidle balance
  2024-07-17 12:17 ` Peter Zijlstra
  2024-07-18  9:28   ` K Prateek Nayak
@ 2024-07-18 16:57   ` Chen Yu
  1 sibling, 0 replies; 22+ messages in thread
From: Chen Yu @ 2024-07-18 16:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vincent Guittot, Ingo Molnar, Juri Lelli, Tim Chen, Mel Gorman,
	Dietmar Eggemann, K Prateek Nayak, Gautham R . Shenoy, Chen Yu,
	Aaron Lu, linux-kernel, void

Hi Peter,

On 2024-07-17 at 14:17:45 +0200, Peter Zijlstra wrote:
> On Thu, Jul 27, 2023 at 10:33:58PM +0800, Chen Yu wrote:
> > Hi,
> > 
> > This is the second version of the newidle balance optimization[1].
> > It aims to reduce the cost of newidle balance which is found to
> > occupy noticeable CPU cycles on some high-core count systems.
> > 
> > For example, when running sqlite on Intel Sapphire Rapids, which has
> > 2 x 56C/112T = 224 CPUs:
> > 
> > 6.69%    0.09%  sqlite3     [kernel.kallsyms]   [k] newidle_balance
> > 5.39%    4.71%  sqlite3     [kernel.kallsyms]   [k] update_sd_lb_stats
> > 
> > To mitigate this cost, the optimization is inspired by the question
> > raised by Tim:
> > Do we always have to find the busiest group and pull from it? Would
> > a relatively busy group be enough?
> 
> So doesn't this basically boil down to recognising that new-idle might
> not be the same as regular load-balancing -- we need any task, fast,
> rather than we need to make equal load.
>

Yes, exactly.

> David's shared runqueue patches did the same, they re-imagined this very
> path.
> 
> Now, David's thing went side-ways because of some regression that wasn't
> further investigated.
> 
> But it occurs to me this might be the same thing that Prateek chased
> down here:
> 
>   https://lkml.kernel.org/r/20240710090210.41856-1-kprateek.nayak@amd.com
> 
> Hmm ?
>

Thanks for the patch link. I took a look and if I understand correctly,
Prateek's patch fixes three issues related to TIF_POLLING_NRFLAG.
And the following two issues might cause aggressive newidle balance:

1. normal idle load balance does not have a chance to be triggered
   when exiting the idle loop. Since normal idle load balance does not
   work, we have to count on newidle balance to do more work.

2. newly idle load balance is incorrectly triggered when exiting from
   idle due to send_ipi(), even there is no task about to sleep.

Issue 2 will increase the frequency of invoking newly idle balance,
but issue 1 would not. Issue 1 mainly impacts the success ratio
of each newidle balance, but might not increase the frequency
to trigger a newidle balance - it should mainly depend on the behavior
of task runtime duration. Please correct me if I'm wrong.

All Prateek's 3 patches fix the existing newidle balance issue, I'll apply
his patch set and have a re-test.

> Supposing that is indeed the case, I think it makes more sense to
> proceed with that approach. That is, completely redo the sub-numa new
> idle balance.
> 

I did not quite follow this, Prateek's patch set does not redo the sub-numa new
idle balance I suppose? Or do you mean further work based on Prateek's patch set?

thanks,
Chenyu 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 0/7] Optimization to reduce the cost of newidle balance
  2024-07-18  9:28   ` K Prateek Nayak
@ 2024-07-18 17:01     ` Chen Yu
  0 siblings, 0 replies; 22+ messages in thread
From: Chen Yu @ 2024-07-18 17:01 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Peter Zijlstra, Vincent Guittot, Ingo Molnar, Juri Lelli,
	Tim Chen, Mel Gorman, Dietmar Eggemann, Gautham R . Shenoy,
	Chen Yu, Aaron Lu, linux-kernel, void, Matt Fleming

Hi Prateek,

On 2024-07-18 at 14:58:30 +0530, K Prateek Nayak wrote:
> Hello Peter,
> 
> On 7/17/2024 5:47 PM, Peter Zijlstra wrote:
> > On Thu, Jul 27, 2023 at 10:33:58PM +0800, Chen Yu wrote:
> > > Hi,
> > > 
> > > This is the second version of the newidle balance optimization[1].
> > > It aims to reduce the cost of newidle balance which is found to
> > > occupy noticeable CPU cycles on some high-core count systems.
> > > 
> > > For example, when running sqlite on Intel Sapphire Rapids, which has
> > > 2 x 56C/112T = 224 CPUs:
> > > 
> > > 6.69%    0.09%  sqlite3     [kernel.kallsyms]   [k] newidle_balance
> > > 5.39%    4.71%  sqlite3     [kernel.kallsyms]   [k] update_sd_lb_stats
> > > 
> > > To mitigate this cost, the optimization is inspired by the question
> > > raised by Tim:
> > > Do we always have to find the busiest group and pull from it? Would
> > > a relatively busy group be enough?
> > 
> > So doesn't this basically boil down to recognising that new-idle might
> > not be the same as regular load-balancing -- we need any task, fast,
> > rather than we need to make equal load.
> > 
> > David's shared runqueue patches did the same, they re-imagined this very
> > path.
> > 
> > Now, David's thing went side-ways because of some regression that wasn't
> > further investigated.
> 
> In case of SHARED_RUNQ, I suspected frequent wakeup-sleep pattern of
> hackbench at lower utilization seemed to raise some contention somewhere
> but perf profile with IBS showed nothing specific and I left it there.
> 
> I revisited this again today and found this interesting data for perf
> bench sched messaging running with one group pinned to one LLC domain on
> my system:
> 
> - NO_SHARED_RUNQ
> 
>     $ time ./perf bench sched messaging -p -t -l 100000 -g 1
>     # Running 'sched/messaging' benchmark:
>     # 20 sender and receiver threads per group
>     # 1 groups == 40 threads run
>          Total time: 3.972 [sec] (*)
>     real    0m3.985s
>     user    0m6.203s	(*)
>     sys     1m20.087s	(*)
> 
>     $ sudo perf record -C 0-7,128-135 --off-cpu -- taskset -c 0-7,128-135 perf bench sched messaging -p -t -l 100000 -g 1
>     $ sudo perf report --no-children
> 
>     Samples: 128  of event 'offcpu-time', Event count (approx.): 96,216,883,498 (*)
>       Overhead  Command          Shared Object  Symbol
>     +   51.43%  sched-messaging  libc.so.6      [.] read
>     +   44.94%  sched-messaging  libc.so.6      [.] __GI___libc_write
>     +    3.60%  sched-messaging  libc.so.6      [.] __GI___futex_abstimed_wait_cancelable64
>          0.03%  sched-messaging  libc.so.6      [.] __poll
>          0.00%  sched-messaging  perf           [.] sender
> 
> 
> - SHARED_RUNQ
> 
>     $ time taskset -c 0-7,128-135 perf bench sched messaging -p -t -l 100000 -g 1
>     # Running 'sched/messaging' benchmark:
>     # 20 sender and receiver threads per group
>     # 1 groups == 40 threads run
>          Total time: 48.171 [sec] (*)
>     real    0m48.186s
>     user    0m5.409s	(*)
>     sys     0m41.185s	(*)
> 
>     $ sudo perf record -C 0-7,128-135 --off-cpu -- taskset -c 0-7,128-135 perf bench sched messaging -p -t -l 100000 -g 1
>     $ sudo perf report --no-children
> 
>     Samples: 157  of event 'offcpu-time', Event count (approx.): 5,882,929,338,882 (*)
>       Overhead  Command          Shared Object     Symbol
>     +   47.49%  sched-messaging  libc.so.6         [.] read
>     +   46.33%  sched-messaging  libc.so.6         [.] __GI___libc_write
>     +    2.40%  sched-messaging  libc.so.6         [.] __GI___futex_abstimed_wait_cancelable64
>     +    1.08%  snapd            snapd             [.] 0x000000000006caa3
>     +    1.02%  cron             libc.so.6         [.] clock_nanosleep@GLIBC_2.2.5
>     +    0.86%  containerd       containerd        [.] runtime.futex.abi0
>     +    0.82%  containerd       containerd        [.] runtime/internal/syscall.Syscall6
> 
> 
> (*) The runtime has bloated massively but both "user" and "sys" time
>     are down and the "offcpu-time" count goes up with SHARED_RUNQ.
> 
> There seems to be a corner case that is not accounted for but I'm not
> sure where it lies currently. P.S. I tested this on a v6.8-rc4 kernel
> since that is what I initially tested the series on but I can see the
> same behavior when I rebased the changed on the current v6.10-rc5 based
> tip:sched/core.
> 
> > 
> > But it occurs to me this might be the same thing that Prateek chased
> > down here:
> > 
> >    https://lkml.kernel.org/r/20240710090210.41856-1-kprateek.nayak@amd.com
> > 
> > Hmm ?
> 
> Without the nohz_csd_func fix and the SM_IDLE fast-path (Patch 1 and 2),
> currently, the scheduler depends on the newidle_balance to pull tasks to
> an idle CPU. Vincent had pointed it out on the first RCF to tackle the
> problem that tried to do what SM_IDLE does but for fair class alone:
> 
>     https://lore.kernel.org/all/CAKfTPtC446Lo9CATPp7PExdkLhHQFoBuY-JMGC7agOHY4hs-Pw@mail.gmail.com/
> 
> It shouldn't be too frequent but it could be the reason why
> newidle_balance() might jump up in traces, especially if it decides to
> scan a domain with large number of CPUs (NUMA1/NUMA2 in Matt's case,
> perhaps the PKG/NUMA in the case Chenyu was investigating initially).
>

Yes, this is my understanding too, I'll apply your patches and have a re-test.

thanks,
Chenyu

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2024-07-18 17:01 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-07-27 14:33 [RFC PATCH 0/7] Optimization to reduce the cost of newidle balance Chen Yu
2023-07-27 14:34 ` [RFC PATCH 1/7] sched/topology: Assign sd_share for all non NUMA sched domains Chen Yu
2023-07-27 14:34 ` [RFC PATCH 2/7] sched/topology: Introduce nr_groups in sched_domain to indicate the number of groups Chen Yu
2023-07-27 14:34 ` [RFC PATCH 3/7] sched/fair: Save a snapshot of sched domain total_load and total_capacity Chen Yu
2023-07-27 14:35 ` [RFC PATCH 4/7] sched/fair: Calculate the scan depth for idle balance based on system utilization Chen Yu
2023-08-25  6:02   ` Shrikanth Hegde
2023-08-30 15:30     ` Chen Yu
2023-07-27 14:35 ` [RFC PATCH 5/7] sched/fair: Adjust the busiest group scanning depth in idle load balance Chen Yu
2023-08-25  6:00   ` Shrikanth Hegde
2023-08-30 15:35     ` Chen Yu
2023-07-27 14:35 ` [RFC PATCH 6/7] sched/fair: Pull from a relatively busy group during newidle balance Chen Yu
2023-07-27 14:35 ` [RFC PATCH 7/7] sched/stats: Track the scan number of groups during load balance Chen Yu
2023-08-25  7:48 ` [RFC PATCH 0/7] Optimization to reduce the cost of newidle balance Shrikanth Hegde
2023-08-30 15:26   ` Chen Yu
2023-09-10  7:51     ` Shrikanth Hegde
2024-07-16 14:16 ` Matt Fleming
2024-07-17  3:52   ` Chen Yu
2024-07-17 15:31     ` Matt Fleming
2024-07-17 12:17 ` Peter Zijlstra
2024-07-18  9:28   ` K Prateek Nayak
2024-07-18 17:01     ` Chen Yu
2024-07-18 16:57   ` Chen Yu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox