[PATCH 0/8] x86, sched: Dynamic ITMT core ranking support and some yak shaving

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/8] x86, sched: Dynamic ITMT core ranking support and some yak shaving
@ 2024-12-11 18:55 K Prateek Nayak
  2024-12-11 18:55 ` [PATCH 1/8] x86/itmt: Convert "sysctl_sched_itmt_enabled" to boolean K Prateek Nayak
                   ` (8 more replies)
  0 siblings, 9 replies; 32+ messages in thread
From: K Prateek Nayak @ 2024-12-11 18:55 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, x86, linux-kernel
  Cc: H. Peter Anvin, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Rafael J. Wysocki, Ricardo Neri,
	Tim Chen, Mario Limonciello, Meng Li, Huang Rui,
	Gautham R. Shenoy, K Prateek Nayak

The ITMT infrastructure currently assumes ITMT rankings to be static and
is set correctly prior to enabling ITMT support which allows the CPU
with the highest core ranking to be cached as the "asym_prefer_cpu" in
the sched_group struct. However, with the introduction of Preferred Core
support in amd-pstate, these rankings can change at runtime.

This series adds support for dynamic ranking in generic scheduler layer
without the need to rebuild the sched domain hierarchy and fixes an
issue with x86_die_flags() on AMD systems that support Preferred Core
ranking with some yak shaving done along the way.

Patch 1 to 4 are independent cleanup around ITMT infrastructure, removal
of x86_smt_flags wrapper, and moving the "sched_itmt_enabled" sysctl to
debugfs.

Patch 5 adds the SD_ASYM_PACKING flag to the PKG domain on all ITMT
enabled systems. The rationale behind the addition is elaborates in the
same. One open question remains is for Intel processors with multiple
Tiles in a PKG which advertises itself as multiple LLCs in a PKG and
supports ITMT - is it okay to set SD_ASYM_PACKING for PKG domain on
these processors?

Patch 6 and 7 are independent possible micro-optimizations discovered
when auditing update_sg_lb_stats()

Patch 8 uncaches the asym_prefer_cpu from the sched_group struct and
finds it during load balancing in update_sg_lb_stats() before it is used
to make any scheduling decisions. This is the simplest approach; an
alternate approach would be to move the asym_prefer_cpu to
sched_domain_shared and allow the first load balancing instance post a
priority change to update the cached asym_prefer_cpu. On systems with
static priorities, this would allow benefits of caching while on systems
with dynamic priorities, it'll reduce the overhead of finding
"asym_prefer_cpu" each time update_sg_lb_stats() is called however the
benefits come with added code complexity which is why Patch 8 is marked
as an RFC.

This series is based on

  git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git sched/core 

at commit 2a77e4be12cb ("sched/fair: Untangle NEXT_BUDDY and
pick_next_task()") and is a spiritual successor to a previous attempt
at fixing the x86_die_flags() on Preferred Core enabled system by Mario
that can be found at
https://lore.kernel.org/lkml/20241203201129.31957-1-mario.limonciello@amd.com/

---
K Prateek Nayak (8):
  x86/itmt: Convert "sysctl_sched_itmt_enabled" to boolean
  x86/itmt: Use guard() for itmt_update_mutex
  x86/itmt: Move the "sched_itmt_enabled" sysctl to debugfs
  x86/topology: Remove x86_smt_flags and use cpu_smt_flags directly
  x86/topology: Use x86_sched_itmt_flags for PKG domain unconditionally
  sched/fair: Do not compute NUMA Balancing stats unnecessarily during
    lb
  sched/fair: Do not compute overloaded status unnecessarily during lb
  sched/fair: Uncache asym_prefer_cpu and find it during
    update_sd_lb_stats()

 arch/x86/include/asm/topology.h |  4 +-
 arch/x86/kernel/itmt.c          | 81 ++++++++++++++-------------------
 arch/x86/kernel/smpboot.c       | 19 +-------
 kernel/sched/fair.c             | 41 +++++++++++++----
 kernel/sched/sched.h            |  1 -
 kernel/sched/topology.c         | 15 +-----
 6 files changed, 69 insertions(+), 92 deletions(-)

base-commit: 2a77e4be12cb58bbf774e7c717c8bb80e128b7a4
-- 
2.34.1

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 1/8] x86/itmt: Convert "sysctl_sched_itmt_enabled" to boolean
  2024-12-11 18:55 [PATCH 0/8] x86, sched: Dynamic ITMT core ranking support and some yak shaving K Prateek Nayak
@ 2024-12-11 18:55 ` K Prateek Nayak
  2024-12-12 18:09   ` Tim Chen
  2024-12-11 18:55 ` [PATCH 2/8] x86/itmt: Use guard() for itmt_update_mutex K Prateek Nayak
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 32+ messages in thread
From: K Prateek Nayak @ 2024-12-11 18:55 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, x86, linux-kernel
  Cc: H. Peter Anvin, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Rafael J. Wysocki, Ricardo Neri,
	Tim Chen, Mario Limonciello, Meng Li, Huang Rui,
	Gautham R. Shenoy, K Prateek Nayak

In preparation to move "sysctl_sched_itmt_enabled" to debugfs, convert
the unsigned int to bool since debugfs readily exposes boolean fops
primitives (debugfs_read_file_bool, debugfs_write_file_bool) which can
streamline the conversion.

Since the current ctl_table initializes extra1 and extra2 to SYSCTL_ZERO
and SYSCTL_ONE respectively, the value of "sysctl_sched_itmt_enabled"
can only be 0 or 1 and this datatype conversion should not cause any
functional changes.

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
 arch/x86/include/asm/topology.h | 4 ++--
 arch/x86/kernel/itmt.c          | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index fd41103ad342..63bab25a4896 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -250,7 +250,7 @@ extern bool x86_topology_update;
 #include <asm/percpu.h>
 
 DECLARE_PER_CPU_READ_MOSTLY(int, sched_core_priority);
-extern unsigned int __read_mostly sysctl_sched_itmt_enabled;
+extern bool __read_mostly sysctl_sched_itmt_enabled;
 
 /* Interface to set priority of a cpu */
 void sched_set_itmt_core_prio(int prio, int core_cpu);
@@ -263,7 +263,7 @@ void sched_clear_itmt_support(void);
 
 #else /* CONFIG_SCHED_MC_PRIO */
 
-#define sysctl_sched_itmt_enabled	0
+#define sysctl_sched_itmt_enabled	false
 static inline void sched_set_itmt_core_prio(int prio, int core_cpu)
 {
 }
diff --git a/arch/x86/kernel/itmt.c b/arch/x86/kernel/itmt.c
index 51b805c727fc..28f449123d68 100644
--- a/arch/x86/kernel/itmt.c
+++ b/arch/x86/kernel/itmt.c
@@ -36,7 +36,7 @@ static bool __read_mostly sched_itmt_capable;
  *
  * It can be set via /proc/sys/kernel/sched_itmt_enabled
  */
-unsigned int __read_mostly sysctl_sched_itmt_enabled;
+bool __read_mostly sysctl_sched_itmt_enabled;
 
 static int sched_itmt_update_handler(const struct ctl_table *table, int write,
 				     void *buffer, size_t *lenp, loff_t *ppos)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 2/8] x86/itmt: Use guard() for itmt_update_mutex
  2024-12-11 18:55 [PATCH 0/8] x86, sched: Dynamic ITMT core ranking support and some yak shaving K Prateek Nayak
  2024-12-11 18:55 ` [PATCH 1/8] x86/itmt: Convert "sysctl_sched_itmt_enabled" to boolean K Prateek Nayak
@ 2024-12-11 18:55 ` K Prateek Nayak
  2024-12-12 18:22   ` Tim Chen
  2024-12-11 18:55 ` [PATCH 3/8] x86/itmt: Move the "sched_itmt_enabled" sysctl to debugfs K Prateek Nayak
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 32+ messages in thread
From: K Prateek Nayak @ 2024-12-11 18:55 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, x86, linux-kernel
  Cc: H. Peter Anvin, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Rafael J. Wysocki, Ricardo Neri,
	Tim Chen, Mario Limonciello, Meng Li, Huang Rui,
	Gautham R. Shenoy, K Prateek Nayak

Use guard() for itmt_update_mutex which avoids the extra mutex_unlock()
in the bailout and return paths.

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
 arch/x86/kernel/itmt.c | 29 ++++++++---------------------
 1 file changed, 8 insertions(+), 21 deletions(-)

diff --git a/arch/x86/kernel/itmt.c b/arch/x86/kernel/itmt.c
index 28f449123d68..ee43d1bd41d0 100644
--- a/arch/x86/kernel/itmt.c
+++ b/arch/x86/kernel/itmt.c
@@ -44,12 +44,10 @@ static int sched_itmt_update_handler(const struct ctl_table *table, int write,
 	unsigned int old_sysctl;
 	int ret;
 
-	mutex_lock(&itmt_update_mutex);
+	guard(mutex)(&itmt_update_mutex);
 
-	if (!sched_itmt_capable) {
-		mutex_unlock(&itmt_update_mutex);
+	if (!sched_itmt_capable)
 		return -EINVAL;
-	}
 
 	old_sysctl = sysctl_sched_itmt_enabled;
 	ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
@@ -59,8 +57,6 @@ static int sched_itmt_update_handler(const struct ctl_table *table, int write,
 		rebuild_sched_domains();
 	}
 
-	mutex_unlock(&itmt_update_mutex);
-
 	return ret;
 }
 
@@ -97,18 +93,14 @@ static struct ctl_table_header *itmt_sysctl_header;
  */
 int sched_set_itmt_support(void)
 {
-	mutex_lock(&itmt_update_mutex);
+	guard(mutex)(&itmt_update_mutex);
 
-	if (sched_itmt_capable) {
-		mutex_unlock(&itmt_update_mutex);
+	if (sched_itmt_capable)
 		return 0;
-	}
 
 	itmt_sysctl_header = register_sysctl("kernel", itmt_kern_table);
-	if (!itmt_sysctl_header) {
-		mutex_unlock(&itmt_update_mutex);
+	if (!itmt_sysctl_header)
 		return -ENOMEM;
-	}
 
 	sched_itmt_capable = true;
 
@@ -117,8 +109,6 @@ int sched_set_itmt_support(void)
 	x86_topology_update = true;
 	rebuild_sched_domains();
 
-	mutex_unlock(&itmt_update_mutex);
-
 	return 0;
 }
 
@@ -134,12 +124,11 @@ int sched_set_itmt_support(void)
  */
 void sched_clear_itmt_support(void)
 {
-	mutex_lock(&itmt_update_mutex);
+	guard(mutex)(&itmt_update_mutex);
 
-	if (!sched_itmt_capable) {
-		mutex_unlock(&itmt_update_mutex);
+	if (!sched_itmt_capable)
 		return;
-	}
+
 	sched_itmt_capable = false;
 
 	if (itmt_sysctl_header) {
@@ -153,8 +142,6 @@ void sched_clear_itmt_support(void)
 		x86_topology_update = true;
 		rebuild_sched_domains();
 	}
-
-	mutex_unlock(&itmt_update_mutex);
 }
 
 int arch_asym_cpu_priority(int cpu)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 3/8] x86/itmt: Move the "sched_itmt_enabled" sysctl to debugfs
  2024-12-11 18:55 [PATCH 0/8] x86, sched: Dynamic ITMT core ranking support and some yak shaving K Prateek Nayak
  2024-12-11 18:55 ` [PATCH 1/8] x86/itmt: Convert "sysctl_sched_itmt_enabled" to boolean K Prateek Nayak
  2024-12-11 18:55 ` [PATCH 2/8] x86/itmt: Use guard() for itmt_update_mutex K Prateek Nayak
@ 2024-12-11 18:55 ` K Prateek Nayak
  2024-12-12 19:15   ` Tim Chen
  2024-12-11 18:55 ` [PATCH 4/8] x86/topology: Remove x86_smt_flags and use cpu_smt_flags directly K Prateek Nayak
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 32+ messages in thread
From: K Prateek Nayak @ 2024-12-11 18:55 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, x86, linux-kernel
  Cc: H. Peter Anvin, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Rafael J. Wysocki, Ricardo Neri,
	Tim Chen, Mario Limonciello, Meng Li, Huang Rui,
	Gautham R. Shenoy, K Prateek Nayak

"sched_itmt_enabled" was only introduced as a debug toggle for any funky
ITMT behavior. Move the sysctl controlled from
"/proc/sys/kernel/sched_itmt_enabled" to debugfs at
"/sys/kernel/debug/x86/sched_itmt_enabled" with a notable change that a
cat on the file will return "Y" or "N" instead of "1" or "0" to
indicate that feature is enabled or disabled respectively.

Since ITMT is x86 specific (and PowerPC uses SD_ASYM_PACKING too), the
toggle was moved to "/sys/kernel/debug/x86/" as opposed to
"/sys/kernel/debug/sched/"

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
 arch/x86/kernel/itmt.c | 56 ++++++++++++++++++++----------------------
 1 file changed, 27 insertions(+), 29 deletions(-)

diff --git a/arch/x86/kernel/itmt.c b/arch/x86/kernel/itmt.c
index ee43d1bd41d0..9cea1fc36c18 100644
--- a/arch/x86/kernel/itmt.c
+++ b/arch/x86/kernel/itmt.c
@@ -19,6 +19,7 @@
 #include <linux/sched.h>
 #include <linux/cpumask.h>
 #include <linux/cpuset.h>
+#include <linux/debugfs.h>
 #include <linux/mutex.h>
 #include <linux/sysctl.h>
 #include <linux/nodemask.h>
@@ -34,45 +35,38 @@ static bool __read_mostly sched_itmt_capable;
  * of higher turbo frequency for cpus supporting Intel Turbo Boost Max
  * Technology 3.0.
  *
- * It can be set via /proc/sys/kernel/sched_itmt_enabled
+ * It can be set via /sys/kernel/debug/x86/sched_itmt_enabled
  */
 bool __read_mostly sysctl_sched_itmt_enabled;
 
-static int sched_itmt_update_handler(const struct ctl_table *table, int write,
-				     void *buffer, size_t *lenp, loff_t *ppos)
+static ssize_t sched_itmt_enabled_write(struct file *filp,
+					const char __user *ubuf,
+					size_t cnt, loff_t *ppos)
 {
-	unsigned int old_sysctl;
-	int ret;
+	ssize_t result;
+	bool orig;
 
 	guard(mutex)(&itmt_update_mutex);
 
-	if (!sched_itmt_capable)
-		return -EINVAL;
-
-	old_sysctl = sysctl_sched_itmt_enabled;
-	ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
+	orig = sysctl_sched_itmt_enabled;
+	result = debugfs_write_file_bool(filp, ubuf, cnt, ppos);
 
-	if (!ret && write && old_sysctl != sysctl_sched_itmt_enabled) {
+	if (sysctl_sched_itmt_enabled != orig) {
 		x86_topology_update = true;
 		rebuild_sched_domains();
 	}
 
-	return ret;
+	return result;
 }
 
-static struct ctl_table itmt_kern_table[] = {
-	{
-		.procname	= "sched_itmt_enabled",
-		.data		= &sysctl_sched_itmt_enabled,
-		.maxlen		= sizeof(unsigned int),
-		.mode		= 0644,
-		.proc_handler	= sched_itmt_update_handler,
-		.extra1		= SYSCTL_ZERO,
-		.extra2		= SYSCTL_ONE,
-	},
+static const struct file_operations dfs_sched_itmt_fops = {
+	.read =         debugfs_read_file_bool,
+	.write =        sched_itmt_enabled_write,
+	.open =         simple_open,
+	.llseek =       default_llseek,
 };
 
-static struct ctl_table_header *itmt_sysctl_header;
+static struct dentry *dfs_sched_itmt;
 
 /**
  * sched_set_itmt_support() - Indicate platform supports ITMT
@@ -98,9 +92,15 @@ int sched_set_itmt_support(void)
 	if (sched_itmt_capable)
 		return 0;
 
-	itmt_sysctl_header = register_sysctl("kernel", itmt_kern_table);
-	if (!itmt_sysctl_header)
+	dfs_sched_itmt = debugfs_create_file_unsafe("sched_itmt_enabled",
+						    0644,
+						    arch_debugfs_dir,
+						    &sysctl_sched_itmt_enabled,
+						    &dfs_sched_itmt_fops);
+	if (IS_ERR_OR_NULL(dfs_sched_itmt)) {
+		dfs_sched_itmt = NULL;
 		return -ENOMEM;
+	}
 
 	sched_itmt_capable = true;
 
@@ -131,10 +131,8 @@ void sched_clear_itmt_support(void)
 
 	sched_itmt_capable = false;
 
-	if (itmt_sysctl_header) {
-		unregister_sysctl_table(itmt_sysctl_header);
-		itmt_sysctl_header = NULL;
-	}
+	debugfs_remove(dfs_sched_itmt);
+	dfs_sched_itmt = NULL;
 
 	if (sysctl_sched_itmt_enabled) {
 		/* disable sched_itmt if we are no longer ITMT capable */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 4/8] x86/topology: Remove x86_smt_flags and use cpu_smt_flags directly
  2024-12-11 18:55 [PATCH 0/8] x86, sched: Dynamic ITMT core ranking support and some yak shaving K Prateek Nayak
                   ` (2 preceding siblings ...)
  2024-12-11 18:55 ` [PATCH 3/8] x86/itmt: Move the "sched_itmt_enabled" sysctl to debugfs K Prateek Nayak
@ 2024-12-11 18:55 ` K Prateek Nayak
  2024-12-12 21:05   ` Tim Chen
  2024-12-11 18:55 ` [PATCH 5/8] x86/topology: Use x86_sched_itmt_flags for PKG domain unconditionally K Prateek Nayak
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 32+ messages in thread
From: K Prateek Nayak @ 2024-12-11 18:55 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, x86, linux-kernel
  Cc: H. Peter Anvin, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Rafael J. Wysocki, Ricardo Neri,
	Tim Chen, Mario Limonciello, Meng Li, Huang Rui,
	Gautham R. Shenoy, K Prateek Nayak

x86_*_flags() wrappers were introduced with commit d3d37d850d1d
("x86/sched: Add SD_ASYM_PACKING flags to x86 ITMT CPU") to add
x86_sched_itmt_flags() in addition to the default domain flags for SMT
and MC domain.

commit 995998ebdebd ("x86/sched: Remove SD_ASYM_PACKING from the
SMT domain flags") removed the ITMT flags for SMT domain but not the
x86_smt_flags() wrappers which directly returns cpu_smt_flags().

Remove x86_smt_flags() and directly use cpu_smt_flags() to derive the
flags for SMT domain. No functional changes intended.

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
 arch/x86/kernel/smpboot.c | 8 +-------
 1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index b5a8f0891135..6e300897b7ee 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -482,12 +482,6 @@ static int x86_core_flags(void)
 	return cpu_core_flags() | x86_sched_itmt_flags();
 }
 #endif
-#ifdef CONFIG_SCHED_SMT
-static int x86_smt_flags(void)
-{
-	return cpu_smt_flags();
-}
-#endif
 #ifdef CONFIG_SCHED_CLUSTER
 static int x86_cluster_flags(void)
 {
@@ -519,7 +513,7 @@ static void __init build_sched_topology(void)
 
 #ifdef CONFIG_SCHED_SMT
 	x86_topology[i++] = (struct sched_domain_topology_level){
-		cpu_smt_mask, x86_smt_flags, SD_INIT_NAME(SMT)
+		cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT)
 	};
 #endif
 #ifdef CONFIG_SCHED_CLUSTER
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 5/8] x86/topology: Use x86_sched_itmt_flags for PKG domain unconditionally
  2024-12-11 18:55 [PATCH 0/8] x86, sched: Dynamic ITMT core ranking support and some yak shaving K Prateek Nayak
                   ` (3 preceding siblings ...)
  2024-12-11 18:55 ` [PATCH 4/8] x86/topology: Remove x86_smt_flags and use cpu_smt_flags directly K Prateek Nayak
@ 2024-12-11 18:55 ` K Prateek Nayak
  2024-12-13 21:07   ` Tim Chen
  2024-12-11 18:55 ` [PATCH 6/8] sched/fair: Do not compute NUMA Balancing stats unnecessarily during lb K Prateek Nayak
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 32+ messages in thread
From: K Prateek Nayak @ 2024-12-11 18:55 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, x86, linux-kernel
  Cc: H. Peter Anvin, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Rafael J. Wysocki, Ricardo Neri,
	Tim Chen, Mario Limonciello, Meng Li, Huang Rui,
	Gautham R. Shenoy, K Prateek Nayak

x86_sched_itmt_flags() returns SD_ASYM_PACKING if ITMT support is
enabled by the system. Without ITMT support being enabled, it returns 0
similar to current x86_die_flags() on non-Hybrid systems
(!X86_HYBRID_CPU and !X86_FEATURE_AMD_HETEROGENEOUS_CORES)

On Intel systems that enable ITMT support, either the MC domain
coincides with the PKG domain, or in case of multiple MC groups
within a PKG domain, either Sub-NUMA Cluster (SNC) is enabled or the
processor features Hybrid core layout (X86_HYBRID_CPU) which leads to
three distinct possibilities:

o If PKG and MC domains coincide, PKG domain is degenerated by
  sd_parent_degenerate() when building sched domain topology.

o If SNC is enabled, PKG domain is never added since
  "x86_has_numa_in_package" is set and the topology will instead contain
  NODE and NUMA domains.

o On X86_HYBRID_CPU which contains multiple MC groups within the PKG,
  the PKG domain requires x86_sched_itmt_flags().

Thus, on Intel systems that contains multiple MC groups within the PKG
and enables ITMT support, the PKG domain requires
x86_sched_itmt_flags(). In all other cases PKG domain is either never
added or is degenerated. Thus, returning x86_sched_itmt_flags()
unconditionally at PKG domain on Intel systems should not lead to any
functional changes.

On AMD systems with multiple LLCs (MC groups) within a PKG domain,
enabling ITMT support requires setting SD_ASYM_PACKING to the PKG domain
since the core rankings are assigned PKG-wide.

Core rankings on AMD processors is currently set by the amd-pstate
driver when Preferred Core feature is supported. A subset of systems that
support Preferred Core feature can be detected using
X86_FEATURE_AMD_HETEROGENEOUS_CORES however, this does not cover all the
systems that support Preferred Core ranking.

Detecting Preferred Core support on AMD systems requires inspecting CPPC
Highest Perf on all present CPUs and checking if it differs on at least
one CPU. Previous suggestion to use a synthetic feature to detect
Preferred Core support [1] was found to be non-trivial to implement
since BSP alone cannot detect if Preferred Core is supported and by the
time AP comes up, alternatives are patched and setting a X86_FEATURE_*
then is not possible.

Since x86 processors enabling ITMT support that consists multiple
non-NUMA MC groups within a PKG requires SD_ASYM_PACKING flag set at the
PKG domain, return x86_sched_itmt_flags unconditionally for the PKG
domain.

Since x86_die_flags() would have just returned x86_sched_itmt_flags()
after the change, remove the unnecessary wrapper and pass
x86_sched_itmt_flags() directly as the flags function.

Link: https://lore.kernel.org/lkml/20241203221746.GKZ0-Dii5rnZppkM_e@fat_crate.local/ [1]
Fixes: f3a052391822 ("cpufreq: amd-pstate: Enable amd-pstate preferred core support")
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
 arch/x86/kernel/smpboot.c | 11 +----------
 1 file changed, 1 insertion(+), 10 deletions(-)

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 6e300897b7ee..ef63b1c0b491 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -489,15 +489,6 @@ static int x86_cluster_flags(void)
 }
 #endif
 
-static int x86_die_flags(void)
-{
-	if (cpu_feature_enabled(X86_FEATURE_HYBRID_CPU) ||
-	    cpu_feature_enabled(X86_FEATURE_AMD_HETEROGENEOUS_CORES))
-		return x86_sched_itmt_flags();
-
-	return 0;
-}
-
 /*
  * Set if a package/die has multiple NUMA nodes inside.
  * AMD Magny-Cours, Intel Cluster-on-Die, and Intel
@@ -533,7 +524,7 @@ static void __init build_sched_topology(void)
 	 */
 	if (!x86_has_numa_in_package) {
 		x86_topology[i++] = (struct sched_domain_topology_level){
-			cpu_cpu_mask, x86_die_flags, SD_INIT_NAME(PKG)
+			cpu_cpu_mask, x86_sched_itmt_flags, SD_INIT_NAME(PKG)
 		};
 	}
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 6/8] sched/fair: Do not compute NUMA Balancing stats unnecessarily during lb
  2024-12-11 18:55 [PATCH 0/8] x86, sched: Dynamic ITMT core ranking support and some yak shaving K Prateek Nayak
                   ` (4 preceding siblings ...)
  2024-12-11 18:55 ` [PATCH 5/8] x86/topology: Use x86_sched_itmt_flags for PKG domain unconditionally K Prateek Nayak
@ 2024-12-11 18:55 ` K Prateek Nayak
  2024-12-12 11:05   ` Vincent Guittot
  2024-12-13 14:55   ` Shrikanth Hegde
  2024-12-11 18:55 ` [PATCH 7/8] sched/fair: Do not compute overloaded status " K Prateek Nayak
                   ` (2 subsequent siblings)
  8 siblings, 2 replies; 32+ messages in thread
From: K Prateek Nayak @ 2024-12-11 18:55 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, x86, linux-kernel
  Cc: H. Peter Anvin, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Rafael J. Wysocki, Ricardo Neri,
	Tim Chen, Mario Limonciello, Meng Li, Huang Rui,
	Gautham R. Shenoy, K Prateek Nayak

Aggregate nr_numa_running and nr_preferred_running when load balancing
at NUMA domains only. While at it, also move the aggregation below the
idle_cpu() check since an idle CPU cannot have any preferred tasks.

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
 kernel/sched/fair.c | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2c4ebfc82917..ec2a79c8d0e7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10340,7 +10340,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 				      bool *sg_overloaded,
 				      bool *sg_overutilized)
 {
-	int i, nr_running, local_group;
+	int i, nr_running, local_group, sd_flags = env->sd->flags;
 
 	memset(sgs, 0, sizeof(*sgs));
 
@@ -10364,10 +10364,6 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 		if (cpu_overutilized(i))
 			*sg_overutilized = 1;
 
-#ifdef CONFIG_NUMA_BALANCING
-		sgs->nr_numa_running += rq->nr_numa_running;
-		sgs->nr_preferred_running += rq->nr_preferred_running;
-#endif
 		/*
 		 * No need to call idle_cpu() if nr_running is not 0
 		 */
@@ -10377,10 +10373,17 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 			continue;
 		}
 
+#ifdef CONFIG_NUMA_BALANCING
+		/* Only fbq_classify_group() uses this to classify NUMA groups */
+		if (sd_flags & SD_NUMA) {
+			sgs->nr_numa_running += rq->nr_numa_running;
+			sgs->nr_preferred_running += rq->nr_preferred_running;
+		}
+#endif
 		if (local_group)
 			continue;
 
-		if (env->sd->flags & SD_ASYM_CPUCAPACITY) {
+		if (sd_flags & SD_ASYM_CPUCAPACITY) {
 			/* Check for a misfit task on the cpu */
 			if (sgs->group_misfit_task_load < rq->misfit_task_load) {
 				sgs->group_misfit_task_load = rq->misfit_task_load;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 7/8] sched/fair: Do not compute overloaded status unnecessarily during lb
  2024-12-11 18:55 [PATCH 0/8] x86, sched: Dynamic ITMT core ranking support and some yak shaving K Prateek Nayak
                   ` (5 preceding siblings ...)
  2024-12-11 18:55 ` [PATCH 6/8] sched/fair: Do not compute NUMA Balancing stats unnecessarily during lb K Prateek Nayak
@ 2024-12-11 18:55 ` K Prateek Nayak
  2024-12-12  9:56   ` Vincent Guittot
  2024-12-13 14:57   ` Shrikanth Hegde
  2024-12-11 18:55 ` [RFC PATCH 8/8] sched/fair: Uncache asym_prefer_cpu and find it during update_sd_lb_stats() K Prateek Nayak
  2024-12-13  0:33 ` [PATCH 0/8] x86, sched: Dynamic ITMT core ranking support and some yak shaving Tim Chen
  8 siblings, 2 replies; 32+ messages in thread
From: K Prateek Nayak @ 2024-12-11 18:55 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, x86, linux-kernel
  Cc: H. Peter Anvin, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Rafael J. Wysocki, Ricardo Neri,
	Tim Chen, Mario Limonciello, Meng Li, Huang Rui,
	Gautham R. Shenoy, K Prateek Nayak

Only set sg_overloaded when computing sg_lb_stats() at the highest sched
domain since rd->overloaded status is updated only when load balancing
at the highest domain. While at it, move setting of sg_overloaded below
idle_cpu() check since an idle CPU can never be overloaded.

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
 kernel/sched/fair.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ec2a79c8d0e7..3f36805ecdca 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10358,9 +10358,6 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 		nr_running = rq->nr_running;
 		sgs->sum_nr_running += nr_running;
 
-		if (nr_running > 1)
-			*sg_overloaded = 1;
-
 		if (cpu_overutilized(i))
 			*sg_overutilized = 1;
 
@@ -10373,6 +10370,10 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 			continue;
 		}
 
+		/* Overload indicator is only updated at root domain */
+		if (!env->sd->parent && nr_running > 1)
+			*sg_overloaded = 1;
+
 #ifdef CONFIG_NUMA_BALANCING
 		/* Only fbq_classify_group() uses this to classify NUMA groups */
 		if (sd_flags & SD_NUMA) {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC PATCH 8/8] sched/fair: Uncache asym_prefer_cpu and find it during update_sd_lb_stats()
  2024-12-11 18:55 [PATCH 0/8] x86, sched: Dynamic ITMT core ranking support and some yak shaving K Prateek Nayak
                   ` (6 preceding siblings ...)
  2024-12-11 18:55 ` [PATCH 7/8] sched/fair: Do not compute overloaded status " K Prateek Nayak
@ 2024-12-11 18:55 ` K Prateek Nayak
  2024-12-13 15:02   ` Shrikanth Hegde
  2024-12-13  0:33 ` [PATCH 0/8] x86, sched: Dynamic ITMT core ranking support and some yak shaving Tim Chen
  8 siblings, 1 reply; 32+ messages in thread
From: K Prateek Nayak @ 2024-12-11 18:55 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, x86, linux-kernel
  Cc: H. Peter Anvin, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Rafael J. Wysocki, Ricardo Neri,
	Tim Chen, Mario Limonciello, Meng Li, Huang Rui,
	Gautham R. Shenoy, K Prateek Nayak

On AMD processors supporting dynamic preferred core ranking, the
asym_prefer_cpu cached in sched_group can change dynamically. Since
asym_prefer_cpu is cached when the sched domain hierarchy is built,
updating the cached value across the system would require rebuilding
the sched domain which is prohibitively expensive.

All the asym_prefer_cpu comparisons in the load balancing path are only
carried out post the sched group stats have been updated after iterating
all the CPUs in the group. Uncache the asym_prefer_cpu and compute it
while sched group statistics are being updated as a part of sg_lb_stats.

Fixes: f3a052391822 ("cpufreq: amd-pstate: Enable amd-pstate preferred core support")
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
 kernel/sched/fair.c     | 21 +++++++++++++++++++--
 kernel/sched/sched.h    |  1 -
 kernel/sched/topology.c | 15 +--------------
 3 files changed, 20 insertions(+), 17 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3f36805ecdca..166b8e831064 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9911,6 +9911,8 @@ struct sg_lb_stats {
 	unsigned int sum_nr_running;		/* Nr of all tasks running in the group */
 	unsigned int sum_h_nr_running;		/* Nr of CFS tasks running in the group */
 	unsigned int idle_cpus;                 /* Nr of idle CPUs         in the group */
+	unsigned int asym_prefer_cpu;		/* CPU with highest asym priority */
+	int highest_asym_prio;			/* Asym priority of asym_prefer_cpu */
 	unsigned int group_weight;
 	enum group_type group_type;
 	unsigned int group_asym_packing;	/* Tasks should be moved to preferred CPU */
@@ -10243,7 +10245,7 @@ sched_group_asym(struct lb_env *env, struct sg_lb_stats *sgs, struct sched_group
 	    (sgs->group_weight - sgs->idle_cpus != 1))
 		return false;
 
-	return sched_asym(env->sd, env->dst_cpu, group->asym_prefer_cpu);
+	return sched_asym(env->sd, env->dst_cpu, sgs->asym_prefer_cpu);
 }
 
 /* One group has more than one SMT CPU while the other group does not */
@@ -10324,6 +10326,17 @@ sched_reduced_capacity(struct rq *rq, struct sched_domain *sd)
 	return check_cpu_capacity(rq, sd);
 }
 
+static inline void
+update_sg_pick_asym_prefer(struct sg_lb_stats *sgs, int cpu)
+{
+	int asym_prio = arch_asym_cpu_priority(cpu);
+
+	if (asym_prio > sgs->highest_asym_prio) {
+		sgs->asym_prefer_cpu = cpu;
+		sgs->highest_asym_prio = asym_prio;
+	}
+}
+
 /**
  * update_sg_lb_stats - Update sched_group's statistics for load balancing.
  * @env: The load balancing environment.
@@ -10345,6 +10358,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 	memset(sgs, 0, sizeof(*sgs));
 
 	local_group = group == sds->local;
+	sgs->highest_asym_prio = INT_MIN;
 
 	for_each_cpu_and(i, sched_group_span(group), env->cpus) {
 		struct rq *rq = cpu_rq(i);
@@ -10358,6 +10372,9 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 		nr_running = rq->nr_running;
 		sgs->sum_nr_running += nr_running;
 
+		if (sd_flags & SD_ASYM_PACKING)
+			update_sg_pick_asym_prefer(sgs, i);
+
 		if (cpu_overutilized(i))
 			*sg_overutilized = 1;
 
@@ -10479,7 +10496,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 
 	case group_asym_packing:
 		/* Prefer to move from lowest priority CPU's work */
-		return sched_asym_prefer(sds->busiest->asym_prefer_cpu, sg->asym_prefer_cpu);
+		return sched_asym_prefer(busiest->asym_prefer_cpu, sgs->asym_prefer_cpu);
 
 	case group_misfit_task:
 		/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index aef716c41edb..a3f0d326bd11 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2047,7 +2047,6 @@ struct sched_group {
 	unsigned int		group_weight;
 	unsigned int		cores;
 	struct sched_group_capacity *sgc;
-	int			asym_prefer_cpu;	/* CPU of highest priority in group */
 	int			flags;
 
 	/*
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 9c405f0e7b26..20aa087710f0 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1302,7 +1302,7 @@ static void init_sched_groups_capacity(int cpu, struct sched_domain *sd)
 	WARN_ON(!sg);
 
 	do {
-		int cpu, cores = 0, max_cpu = -1;
+		int cpu, cores = 0;
 
 		sg->group_weight = cpumask_weight(sched_group_span(sg));
 
@@ -1314,19 +1314,6 @@ static void init_sched_groups_capacity(int cpu, struct sched_domain *sd)
 #endif
 		}
 		sg->cores = cores;
-
-		if (!(sd->flags & SD_ASYM_PACKING))
-			goto next;
-
-		for_each_cpu(cpu, sched_group_span(sg)) {
-			if (max_cpu < 0)
-				max_cpu = cpu;
-			else if (sched_asym_prefer(cpu, max_cpu))
-				max_cpu = cpu;
-		}
-		sg->asym_prefer_cpu = max_cpu;
-
-next:
 		sg = sg->next;
 	} while (sg != sd->groups);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH 7/8] sched/fair: Do not compute overloaded status unnecessarily during lb
  2024-12-11 18:55 ` [PATCH 7/8] sched/fair: Do not compute overloaded status " K Prateek Nayak
@ 2024-12-12  9:56   ` Vincent Guittot
  2024-12-12 11:01     ` K Prateek Nayak
  2024-12-13 14:57   ` Shrikanth Hegde
  1 sibling, 1 reply; 32+ messages in thread
From: Vincent Guittot @ 2024-12-12  9:56 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Peter Zijlstra, Juri Lelli, x86, linux-kernel, H. Peter Anvin,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Rafael J. Wysocki, Ricardo Neri, Tim Chen,
	Mario Limonciello, Meng Li, Huang Rui, Gautham R. Shenoy

On Wed, 11 Dec 2024 at 19:58, K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>
> Only set sg_overloaded when computing sg_lb_stats() at the highest sched
> domain since rd->overloaded status is updated only when load balancing
> at the highest domain. While at it, move setting of sg_overloaded below
> idle_cpu() check since an idle CPU can never be overloaded.
>
> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
> ---
>  kernel/sched/fair.c | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index ec2a79c8d0e7..3f36805ecdca 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10358,9 +10358,6 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>                 nr_running = rq->nr_running;
>                 sgs->sum_nr_running += nr_running;
>
> -               if (nr_running > 1)
> -                       *sg_overloaded = 1;
> -
>                 if (cpu_overutilized(i))
>                         *sg_overutilized = 1;
>
> @@ -10373,6 +10370,10 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>                         continue;
>                 }
>
> +               /* Overload indicator is only updated at root domain */
> +               if (!env->sd->parent && nr_running > 1)

nit: may be worth checking local variable 1st which should be cheaper
than env->sd->parent

Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>


> +                       *sg_overloaded = 1;
> +
>  #ifdef CONFIG_NUMA_BALANCING
>                 /* Only fbq_classify_group() uses this to classify NUMA groups */
>                 if (sd_flags & SD_NUMA) {
> --
> 2.34.1
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 7/8] sched/fair: Do not compute overloaded status unnecessarily during lb
  2024-12-12  9:56   ` Vincent Guittot
@ 2024-12-12 11:01     ` K Prateek Nayak
  2024-12-12 11:18       ` Vincent Guittot
  0 siblings, 1 reply; 32+ messages in thread
From: K Prateek Nayak @ 2024-12-12 11:01 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Peter Zijlstra, Juri Lelli, x86, linux-kernel, H. Peter Anvin,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Rafael J. Wysocki, Ricardo Neri, Tim Chen,
	Mario Limonciello, Meng Li, Huang Rui, Gautham R. Shenoy

Hello Vincent,

Thank you for reviewing the patch.

On 12/12/2024 3:26 PM, Vincent Guittot wrote:
> On Wed, 11 Dec 2024 at 19:58, K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>>
>> Only set sg_overloaded when computing sg_lb_stats() at the highest sched
>> domain since rd->overloaded status is updated only when load balancing
>> at the highest domain. While at it, move setting of sg_overloaded below
>> idle_cpu() check since an idle CPU can never be overloaded.
>>
>> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
>> ---
>>   kernel/sched/fair.c | 7 ++++---
>>   1 file changed, 4 insertions(+), 3 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index ec2a79c8d0e7..3f36805ecdca 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -10358,9 +10358,6 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>>                  nr_running = rq->nr_running;
>>                  sgs->sum_nr_running += nr_running;
>>
>> -               if (nr_running > 1)
>> -                       *sg_overloaded = 1;
>> -
>>                  if (cpu_overutilized(i))
>>                          *sg_overutilized = 1;
>>
>> @@ -10373,6 +10370,10 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>>                          continue;
>>                  }
>>
>> +               /* Overload indicator is only updated at root domain */
>> +               if (!env->sd->parent && nr_running > 1)
> 
> nit: may be worth checking local variable 1st which should be cheaper
> than env->sd->parent

What are your thoughts on passing NULL for "sg_overloaded" when calling
update_sg_lb_stats() at non-root domains? Would it be equally cheap to
do:

	if (sg_overloaded && nr_running > 1)
		*sg_overloaded = 1;

> 
> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>

Thank you.

> 
> 
>> +                       *sg_overloaded = 1;
>> +
>>   #ifdef CONFIG_NUMA_BALANCING
>>                  /* Only fbq_classify_group() uses this to classify NUMA groups */
>>                  if (sd_flags & SD_NUMA) {
>> --
>> 2.34.1
>>

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 6/8] sched/fair: Do not compute NUMA Balancing stats unnecessarily during lb
  2024-12-11 18:55 ` [PATCH 6/8] sched/fair: Do not compute NUMA Balancing stats unnecessarily during lb K Prateek Nayak
@ 2024-12-12 11:05   ` Vincent Guittot
  2024-12-12 11:43     ` K Prateek Nayak
  2024-12-13 14:55   ` Shrikanth Hegde
  1 sibling, 1 reply; 32+ messages in thread
From: Vincent Guittot @ 2024-12-12 11:05 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Peter Zijlstra, Juri Lelli, x86, linux-kernel, H. Peter Anvin,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Rafael J. Wysocki, Ricardo Neri, Tim Chen,
	Mario Limonciello, Meng Li, Huang Rui, Gautham R. Shenoy

On Wed, 11 Dec 2024 at 19:58, K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>
> Aggregate nr_numa_running and nr_preferred_running when load balancing
> at NUMA domains only. While at it, also move the aggregation below the
> idle_cpu() check since an idle CPU cannot have any preferred tasks.
>
> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
> ---
>  kernel/sched/fair.c | 15 +++++++++------
>  1 file changed, 9 insertions(+), 6 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 2c4ebfc82917..ec2a79c8d0e7 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10340,7 +10340,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>                                       bool *sg_overloaded,
>                                       bool *sg_overutilized)
>  {
> -       int i, nr_running, local_group;
> +       int i, nr_running, local_group, sd_flags = env->sd->flags;
>
>         memset(sgs, 0, sizeof(*sgs));
>
> @@ -10364,10 +10364,6 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>                 if (cpu_overutilized(i))
>                         *sg_overutilized = 1;
>
> -#ifdef CONFIG_NUMA_BALANCING
> -               sgs->nr_numa_running += rq->nr_numa_running;
> -               sgs->nr_preferred_running += rq->nr_preferred_running;
> -#endif
>                 /*
>                  * No need to call idle_cpu() if nr_running is not 0
>                  */
> @@ -10377,10 +10373,17 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>                         continue;
>                 }
>
> +#ifdef CONFIG_NUMA_BALANCING
> +               /* Only fbq_classify_group() uses this to classify NUMA groups */

and fbq_classify_rq() which is also used by non-NUMA groups. AFAICT
It doesn't change anything at the end because group type is "all" for
non numa groups but we need some explanations why It's ok to skip numa
stats and default behavior will remain unchanged

> +               if (sd_flags & SD_NUMA) {
> +                       sgs->nr_numa_running += rq->nr_numa_running;
> +                       sgs->nr_preferred_running += rq->nr_preferred_running;
> +               }
> +#endif
>                 if (local_group)
>                         continue;
>
> -               if (env->sd->flags & SD_ASYM_CPUCAPACITY) {
> +               if (sd_flags & SD_ASYM_CPUCAPACITY) {
>                         /* Check for a misfit task on the cpu */
>                         if (sgs->group_misfit_task_load < rq->misfit_task_load) {
>                                 sgs->group_misfit_task_load = rq->misfit_task_load;
> --
> 2.34.1
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 7/8] sched/fair: Do not compute overloaded status unnecessarily during lb
  2024-12-12 11:01     ` K Prateek Nayak
@ 2024-12-12 11:18       ` Vincent Guittot
  2024-12-12 11:30         ` K Prateek Nayak
  0 siblings, 1 reply; 32+ messages in thread
From: Vincent Guittot @ 2024-12-12 11:18 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Peter Zijlstra, Juri Lelli, x86, linux-kernel, H. Peter Anvin,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Rafael J. Wysocki, Ricardo Neri, Tim Chen,
	Mario Limonciello, Meng Li, Huang Rui, Gautham R. Shenoy

On Thu, 12 Dec 2024 at 12:01, K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>
> Hello Vincent,
>
> Thank you for reviewing the patch.
>
> On 12/12/2024 3:26 PM, Vincent Guittot wrote:
> > On Wed, 11 Dec 2024 at 19:58, K Prateek Nayak <kprateek.nayak@amd.com> wrote:
> >>
> >> Only set sg_overloaded when computing sg_lb_stats() at the highest sched
> >> domain since rd->overloaded status is updated only when load balancing
> >> at the highest domain. While at it, move setting of sg_overloaded below
> >> idle_cpu() check since an idle CPU can never be overloaded.
> >>
> >> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
> >> ---
> >>   kernel/sched/fair.c | 7 ++++---
> >>   1 file changed, 4 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index ec2a79c8d0e7..3f36805ecdca 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -10358,9 +10358,6 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> >>                  nr_running = rq->nr_running;
> >>                  sgs->sum_nr_running += nr_running;
> >>
> >> -               if (nr_running > 1)
> >> -                       *sg_overloaded = 1;
> >> -
> >>                  if (cpu_overutilized(i))
> >>                          *sg_overutilized = 1;
> >>
> >> @@ -10373,6 +10370,10 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> >>                          continue;
> >>                  }
> >>
> >> +               /* Overload indicator is only updated at root domain */
> >> +               if (!env->sd->parent && nr_running > 1)
> >
> > nit: may be worth checking local variable 1st which should be cheaper
> > than env->sd->parent
>
> What are your thoughts on passing NULL for "sg_overloaded" when calling
> update_sg_lb_stats() at non-root domains? Would it be equally cheap to
> do:
>
>         if (sg_overloaded && nr_running > 1)
>                 *sg_overloaded = 1;

you will have to test it twice as it is also set for misfit task

>
> >
> > Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
>
> Thank you.
>
> >
> >
> >> +                       *sg_overloaded = 1;
> >> +
> >>   #ifdef CONFIG_NUMA_BALANCING
> >>                  /* Only fbq_classify_group() uses this to classify NUMA groups */
> >>                  if (sd_flags & SD_NUMA) {
> >> --
> >> 2.34.1
> >>
>
> --
> Thanks and Regards,
> Prateek
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 7/8] sched/fair: Do not compute overloaded status unnecessarily during lb
  2024-12-12 11:18       ` Vincent Guittot
@ 2024-12-12 11:30         ` K Prateek Nayak
  0 siblings, 0 replies; 32+ messages in thread
From: K Prateek Nayak @ 2024-12-12 11:30 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Peter Zijlstra, Juri Lelli, x86, linux-kernel, H. Peter Anvin,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Rafael J. Wysocki, Ricardo Neri, Tim Chen,
	Mario Limonciello, Meng Li, Huang Rui, Gautham R. Shenoy

Hello Vincent,

On 12/12/2024 4:48 PM, Vincent Guittot wrote:
> On Thu, 12 Dec 2024 at 12:01, K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>>
>> Hello Vincent,
>>
>> Thank you for reviewing the patch.
>>
>> On 12/12/2024 3:26 PM, Vincent Guittot wrote:
>>> On Wed, 11 Dec 2024 at 19:58, K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>>>>
>>>> Only set sg_overloaded when computing sg_lb_stats() at the highest sched
>>>> domain since rd->overloaded status is updated only when load balancing
>>>> at the highest domain. While at it, move setting of sg_overloaded below
>>>> idle_cpu() check since an idle CPU can never be overloaded.
>>>>
>>>> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
>>>> ---
>>>>    kernel/sched/fair.c | 7 ++++---
>>>>    1 file changed, 4 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index ec2a79c8d0e7..3f36805ecdca 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -10358,9 +10358,6 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>>>>                   nr_running = rq->nr_running;
>>>>                   sgs->sum_nr_running += nr_running;
>>>>
>>>> -               if (nr_running > 1)
>>>> -                       *sg_overloaded = 1;
>>>> -
>>>>                   if (cpu_overutilized(i))
>>>>                           *sg_overutilized = 1;
>>>>
>>>> @@ -10373,6 +10370,10 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>>>>                           continue;
>>>>                   }
>>>>
>>>> +               /* Overload indicator is only updated at root domain */
>>>> +               if (!env->sd->parent && nr_running > 1)
>>>
>>> nit: may be worth checking local variable 1st which should be cheaper
>>> than env->sd->parent
>>
>> What are your thoughts on passing NULL for "sg_overloaded" when calling
>> update_sg_lb_stats() at non-root domains? Would it be equally cheap to
>> do:
>>
>>          if (sg_overloaded && nr_running > 1)
>>                  *sg_overloaded = 1;
> 
> you will have to test it twice as it is also set for misfit task

Ah! True that. Local variable approach is indeed simpler.

-- 
Thanks and Regards,
Prateek

> 
>>
>>>
>>> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
>>
>> Thank you.
>>
>>>
>>>
>>>> +                       *sg_overloaded = 1;
>>>> +
>>>>    #ifdef CONFIG_NUMA_BALANCING
>>>>                   /* Only fbq_classify_group() uses this to classify NUMA groups */
>>>>                   if (sd_flags & SD_NUMA) {
>>>> --
>>>> 2.34.1
>>>>
>>
>> --
>> Thanks and Regards,
>> Prateek
>>


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 6/8] sched/fair: Do not compute NUMA Balancing stats unnecessarily during lb
  2024-12-12 11:05   ` Vincent Guittot
@ 2024-12-12 11:43     ` K Prateek Nayak
  2024-12-12 13:28       ` Vincent Guittot
  0 siblings, 1 reply; 32+ messages in thread
From: K Prateek Nayak @ 2024-12-12 11:43 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Peter Zijlstra, Juri Lelli, x86, linux-kernel, H. Peter Anvin,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Rafael J. Wysocki, Ricardo Neri, Tim Chen,
	Mario Limonciello, Meng Li, Huang Rui, Gautham R. Shenoy

Hello Vincent,

On 12/12/2024 4:35 PM, Vincent Guittot wrote:
> On Wed, 11 Dec 2024 at 19:58, K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>>
>> Aggregate nr_numa_running and nr_preferred_running when load balancing
>> at NUMA domains only. While at it, also move the aggregation below the
>> idle_cpu() check since an idle CPU cannot have any preferred tasks.
>>
>> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
>> ---
>>   kernel/sched/fair.c | 15 +++++++++------
>>   1 file changed, 9 insertions(+), 6 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 2c4ebfc82917..ec2a79c8d0e7 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -10340,7 +10340,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>>                                        bool *sg_overloaded,
>>                                        bool *sg_overutilized)
>>   {
>> -       int i, nr_running, local_group;
>> +       int i, nr_running, local_group, sd_flags = env->sd->flags;
>>
>>          memset(sgs, 0, sizeof(*sgs));
>>
>> @@ -10364,10 +10364,6 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>>                  if (cpu_overutilized(i))
>>                          *sg_overutilized = 1;
>>
>> -#ifdef CONFIG_NUMA_BALANCING
>> -               sgs->nr_numa_running += rq->nr_numa_running;
>> -               sgs->nr_preferred_running += rq->nr_preferred_running;
>> -#endif
>>                  /*
>>                   * No need to call idle_cpu() if nr_running is not 0
>>                   */
>> @@ -10377,10 +10373,17 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>>                          continue;
>>                  }
>>
>> +#ifdef CONFIG_NUMA_BALANCING
>> +               /* Only fbq_classify_group() uses this to classify NUMA groups */
> 
> and fbq_classify_rq() which is also used by non-NUMA groups.

Yup but that just looks at rq's "nr_numa_running" and
"nr_preferred_running".

> AFAICT
> It doesn't change anything at the end because group type is "all" for
> non numa groups but we need some explanations why It's ok to skip numa
> stats and default behavior will remain unchanged

I'll elaborate that comment with complete details:

	/*
	 * Only fbq_classify_group() uses these NUMA stats to derive the
	 * fbq_type of NUMA groups. By default, it is initialized to
	 * "all" - the highest type. sched_balance_find_src_rq() inhibits
	 * load balancing from runqueue whose fbq_type is found to be
	 * higher than the busiest group's fbq_type but since it is
	 * always initialized to the largest value, and remains same for
	 * non-NUMA groups, skip this aggregation when balancing at
	 * non-NUMA domains.
	 */

-- 
Thanks and Regards,
Prateek

> 
>> +               if (sd_flags & SD_NUMA) {
>> +                       sgs->nr_numa_running += rq->nr_numa_running;
>> +                       sgs->nr_preferred_running += rq->nr_preferred_running;
>> +               }
>> +#endif
>>                  if (local_group)
>>                          continue;
>>
>> -               if (env->sd->flags & SD_ASYM_CPUCAPACITY) {
>> +               if (sd_flags & SD_ASYM_CPUCAPACITY) {
>>                          /* Check for a misfit task on the cpu */
>>                          if (sgs->group_misfit_task_load < rq->misfit_task_load) {
>>                                  sgs->group_misfit_task_load = rq->misfit_task_load;
>> --
>> 2.34.1
>>



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 6/8] sched/fair: Do not compute NUMA Balancing stats unnecessarily during lb
  2024-12-12 11:43     ` K Prateek Nayak
@ 2024-12-12 13:28       ` Vincent Guittot
  0 siblings, 0 replies; 32+ messages in thread
From: Vincent Guittot @ 2024-12-12 13:28 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	Peter Zijlstra, Juri Lelli, x86, linux-kernel, H. Peter Anvin,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Rafael J. Wysocki, Ricardo Neri, Tim Chen,
	Mario Limonciello, Meng Li, Huang Rui, Gautham R. Shenoy

On Thu, 12 Dec 2024 at 12:44, K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>
> Hello Vincent,
>
> On 12/12/2024 4:35 PM, Vincent Guittot wrote:
> > On Wed, 11 Dec 2024 at 19:58, K Prateek Nayak <kprateek.nayak@amd.com> wrote:
> >>
> >> Aggregate nr_numa_running and nr_preferred_running when load balancing
> >> at NUMA domains only. While at it, also move the aggregation below the
> >> idle_cpu() check since an idle CPU cannot have any preferred tasks.
> >>
> >> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
> >> ---
> >>   kernel/sched/fair.c | 15 +++++++++------
> >>   1 file changed, 9 insertions(+), 6 deletions(-)
> >>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index 2c4ebfc82917..ec2a79c8d0e7 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -10340,7 +10340,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> >>                                        bool *sg_overloaded,
> >>                                        bool *sg_overutilized)
> >>   {
> >> -       int i, nr_running, local_group;
> >> +       int i, nr_running, local_group, sd_flags = env->sd->flags;
> >>
> >>          memset(sgs, 0, sizeof(*sgs));
> >>
> >> @@ -10364,10 +10364,6 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> >>                  if (cpu_overutilized(i))
> >>                          *sg_overutilized = 1;
> >>
> >> -#ifdef CONFIG_NUMA_BALANCING
> >> -               sgs->nr_numa_running += rq->nr_numa_running;
> >> -               sgs->nr_preferred_running += rq->nr_preferred_running;
> >> -#endif
> >>                  /*
> >>                   * No need to call idle_cpu() if nr_running is not 0
> >>                   */
> >> @@ -10377,10 +10373,17 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> >>                          continue;
> >>                  }
> >>
> >> +#ifdef CONFIG_NUMA_BALANCING
> >> +               /* Only fbq_classify_group() uses this to classify NUMA groups */
> >
> > and fbq_classify_rq() which is also used by non-NUMA groups.
>
> Yup but that just looks at rq's "nr_numa_running" and
> "nr_preferred_running".

Ah yes, I misread and mixed rq and not sgs.  You can forget my comment

Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>

>
> > AFAICT
> > It doesn't change anything at the end because group type is "all" for
> > non numa groups but we need some explanations why It's ok to skip numa
> > stats and default behavior will remain unchanged
>
> I'll elaborate that comment with complete details:
>
>         /*
>          * Only fbq_classify_group() uses these NUMA stats to derive the
>          * fbq_type of NUMA groups. By default, it is initialized to
>          * "all" - the highest type. sched_balance_find_src_rq() inhibits
>          * load balancing from runqueue whose fbq_type is found to be
>          * higher than the busiest group's fbq_type but since it is
>          * always initialized to the largest value, and remains same for
>          * non-NUMA groups, skip this aggregation when balancing at
>          * non-NUMA domains.
>          */
>
> --
> Thanks and Regards,
> Prateek
>
> >
> >> +               if (sd_flags & SD_NUMA) {
> >> +                       sgs->nr_numa_running += rq->nr_numa_running;
> >> +                       sgs->nr_preferred_running += rq->nr_preferred_running;
> >> +               }
> >> +#endif
> >>                  if (local_group)
> >>                          continue;
> >>
> >> -               if (env->sd->flags & SD_ASYM_CPUCAPACITY) {
> >> +               if (sd_flags & SD_ASYM_CPUCAPACITY) {
> >>                          /* Check for a misfit task on the cpu */
> >>                          if (sgs->group_misfit_task_load < rq->misfit_task_load) {
> >>                                  sgs->group_misfit_task_load = rq->misfit_task_load;
> >> --
> >> 2.34.1
> >>
>
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/8] x86/itmt: Convert "sysctl_sched_itmt_enabled" to boolean
  2024-12-11 18:55 ` [PATCH 1/8] x86/itmt: Convert "sysctl_sched_itmt_enabled" to boolean K Prateek Nayak
@ 2024-12-12 18:09   ` Tim Chen
  0 siblings, 0 replies; 32+ messages in thread
From: Tim Chen @ 2024-12-12 18:09 UTC (permalink / raw)
  To: K Prateek Nayak, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, Peter Zijlstra, Juri Lelli, Vincent Guittot, x86,
	linux-kernel
  Cc: H. Peter Anvin, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Rafael J. Wysocki, Ricardo Neri,
	Mario Limonciello, Meng Li, Huang Rui, Gautham R. Shenoy

On Wed, 2024-12-11 at 18:55 +0000, K Prateek Nayak wrote:
> In preparation to move "sysctl_sched_itmt_enabled" to debugfs, convert
> the unsigned int to bool since debugfs readily exposes boolean fops
> primitives (debugfs_read_file_bool, debugfs_write_file_bool) which can
> streamline the conversion.
> 
> Since the current ctl_table initializes extra1 and extra2 to SYSCTL_ZERO
> and SYSCTL_ONE respectively, the value of "sysctl_sched_itmt_enabled"
> can only be 0 or 1 and this datatype conversion should not cause any
> functional changes.
> 

Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>

> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
> ---
>  arch/x86/include/asm/topology.h | 4 ++--
>  arch/x86/kernel/itmt.c          | 2 +-
>  2 files changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
> index fd41103ad342..63bab25a4896 100644
> --- a/arch/x86/include/asm/topology.h
> +++ b/arch/x86/include/asm/topology.h
> @@ -250,7 +250,7 @@ extern bool x86_topology_update;
>  #include <asm/percpu.h>
>  
>  DECLARE_PER_CPU_READ_MOSTLY(int, sched_core_priority);
> -extern unsigned int __read_mostly sysctl_sched_itmt_enabled;
> +extern bool __read_mostly sysctl_sched_itmt_enabled;
>  
>  /* Interface to set priority of a cpu */
>  void sched_set_itmt_core_prio(int prio, int core_cpu);
> @@ -263,7 +263,7 @@ void sched_clear_itmt_support(void);
>  
>  #else /* CONFIG_SCHED_MC_PRIO */
>  
> -#define sysctl_sched_itmt_enabled	0
> +#define sysctl_sched_itmt_enabled	false
>  static inline void sched_set_itmt_core_prio(int prio, int core_cpu)
>  {
>  }
> diff --git a/arch/x86/kernel/itmt.c b/arch/x86/kernel/itmt.c
> index 51b805c727fc..28f449123d68 100644
> --- a/arch/x86/kernel/itmt.c
> +++ b/arch/x86/kernel/itmt.c
> @@ -36,7 +36,7 @@ static bool __read_mostly sched_itmt_capable;
>   *
>   * It can be set via /proc/sys/kernel/sched_itmt_enabled
>   */
> -unsigned int __read_mostly sysctl_sched_itmt_enabled;
> +bool __read_mostly sysctl_sched_itmt_enabled;
>  
>  static int sched_itmt_update_handler(const struct ctl_table *table, int write,
>  				     void *buffer, size_t *lenp, loff_t *ppos)


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/8] x86/itmt: Use guard() for itmt_update_mutex
  2024-12-11 18:55 ` [PATCH 2/8] x86/itmt: Use guard() for itmt_update_mutex K Prateek Nayak
@ 2024-12-12 18:22   ` Tim Chen
  0 siblings, 0 replies; 32+ messages in thread
From: Tim Chen @ 2024-12-12 18:22 UTC (permalink / raw)
  To: K Prateek Nayak, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, Peter Zijlstra, Juri Lelli, Vincent Guittot, x86,
	linux-kernel
  Cc: H. Peter Anvin, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Rafael J. Wysocki, Ricardo Neri,
	Mario Limonciello, Meng Li, Huang Rui, Gautham R. Shenoy

On Wed, 2024-12-11 at 18:55 +0000, K Prateek Nayak wrote:
> Use guard() for itmt_update_mutex which avoids the extra mutex_unlock()
> in the bailout and return paths.

Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>

Tim

> 
> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
> ---
>  arch/x86/kernel/itmt.c | 29 ++++++++---------------------
>  1 file changed, 8 insertions(+), 21 deletions(-)
> 
> diff --git a/arch/x86/kernel/itmt.c b/arch/x86/kernel/itmt.c
> index 28f449123d68..ee43d1bd41d0 100644
> --- a/arch/x86/kernel/itmt.c
> +++ b/arch/x86/kernel/itmt.c
> @@ -44,12 +44,10 @@ static int sched_itmt_update_handler(const struct ctl_table *table, int write,
>  	unsigned int old_sysctl;
>  	int ret;
>  
> -	mutex_lock(&itmt_update_mutex);
> +	guard(mutex)(&itmt_update_mutex);
>  
> -	if (!sched_itmt_capable) {
> -		mutex_unlock(&itmt_update_mutex);
> +	if (!sched_itmt_capable)
>  		return -EINVAL;
> -	}
>  
>  	old_sysctl = sysctl_sched_itmt_enabled;
>  	ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
> @@ -59,8 +57,6 @@ static int sched_itmt_update_handler(const struct ctl_table *table, int write,
>  		rebuild_sched_domains();
>  	}
>  
> -	mutex_unlock(&itmt_update_mutex);
> -
>  	return ret;
>  }
>  
> @@ -97,18 +93,14 @@ static struct ctl_table_header *itmt_sysctl_header;
>   */
>  int sched_set_itmt_support(void)
>  {
> -	mutex_lock(&itmt_update_mutex);
> +	guard(mutex)(&itmt_update_mutex);
>  
> -	if (sched_itmt_capable) {
> -		mutex_unlock(&itmt_update_mutex);
> +	if (sched_itmt_capable)
>  		return 0;
> -	}
>  
>  	itmt_sysctl_header = register_sysctl("kernel", itmt_kern_table);
> -	if (!itmt_sysctl_header) {
> -		mutex_unlock(&itmt_update_mutex);
> +	if (!itmt_sysctl_header)
>  		return -ENOMEM;
> -	}
>  
>  	sched_itmt_capable = true;
>  
> @@ -117,8 +109,6 @@ int sched_set_itmt_support(void)
>  	x86_topology_update = true;
>  	rebuild_sched_domains();
>  
> -	mutex_unlock(&itmt_update_mutex);
> -
>  	return 0;
>  }
>  
> @@ -134,12 +124,11 @@ int sched_set_itmt_support(void)
>   */
>  void sched_clear_itmt_support(void)
>  {
> -	mutex_lock(&itmt_update_mutex);
> +	guard(mutex)(&itmt_update_mutex);
>  
> -	if (!sched_itmt_capable) {
> -		mutex_unlock(&itmt_update_mutex);
> +	if (!sched_itmt_capable)
>  		return;
> -	}
> +
>  	sched_itmt_capable = false;
>  
>  	if (itmt_sysctl_header) {
> @@ -153,8 +142,6 @@ void sched_clear_itmt_support(void)
>  		x86_topology_update = true;
>  		rebuild_sched_domains();
>  	}
> -
> -	mutex_unlock(&itmt_update_mutex);
>  }
>  
>  int arch_asym_cpu_priority(int cpu)


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/8] x86/itmt: Move the "sched_itmt_enabled" sysctl to debugfs
  2024-12-11 18:55 ` [PATCH 3/8] x86/itmt: Move the "sched_itmt_enabled" sysctl to debugfs K Prateek Nayak
@ 2024-12-12 19:15   ` Tim Chen
  2024-12-13  4:01     ` K Prateek Nayak
  2024-12-13  9:16     ` Peter Zijlstra
  0 siblings, 2 replies; 32+ messages in thread
From: Tim Chen @ 2024-12-12 19:15 UTC (permalink / raw)
  To: K Prateek Nayak, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, Peter Zijlstra, Juri Lelli, Vincent Guittot, x86,
	linux-kernel
  Cc: H. Peter Anvin, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Rafael J. Wysocki, Ricardo Neri,
	Mario Limonciello, Meng Li, Huang Rui, Gautham R. Shenoy

On Wed, 2024-12-11 at 18:55 +0000, K Prateek Nayak wrote:
> "sched_itmt_enabled" was only introduced as a debug toggle for any funky
> ITMT behavior. Move the sysctl controlled from
> "/proc/sys/kernel/sched_itmt_enabled" to debugfs at
> "/sys/kernel/debug/x86/sched_itmt_enabled" with a notable change that a
> cat on the file will return "Y" or "N" instead of "1" or "0" to
> indicate that feature is enabled or disabled respectively.
> 

Valid values of setting "sched_itmt_enabled" likewise change from "1" or "0"
to "Y" or "N". 

> Since ITMT is x86 specific (and PowerPC uses SD_ASYM_PACKING too), the
> toggle was moved to "/sys/kernel/debug/x86/" as opposed to
> "/sys/kernel/debug/sched/"
> 
> Suggested-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>

Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>

Tim

> ---
>  arch/x86/kernel/itmt.c | 56 ++++++++++++++++++++----------------------
>  1 file changed, 27 insertions(+), 29 deletions(-)
> 
> diff --git a/arch/x86/kernel/itmt.c b/arch/x86/kernel/itmt.c
> index ee43d1bd41d0..9cea1fc36c18 100644
> --- a/arch/x86/kernel/itmt.c
> +++ b/arch/x86/kernel/itmt.c
> @@ -19,6 +19,7 @@
>  #include <linux/sched.h>
>  #include <linux/cpumask.h>
>  #include <linux/cpuset.h>
> +#include <linux/debugfs.h>
>  #include <linux/mutex.h>
>  #include <linux/sysctl.h>
>  #include <linux/nodemask.h>
> @@ -34,45 +35,38 @@ static bool __read_mostly sched_itmt_capable;
>   * of higher turbo frequency for cpus supporting Intel Turbo Boost Max
>   * Technology 3.0.
>   *
> - * It can be set via /proc/sys/kernel/sched_itmt_enabled
> + * It can be set via /sys/kernel/debug/x86/sched_itmt_enabled
>   */
>  bool __read_mostly sysctl_sched_itmt_enabled;
>  
> -static int sched_itmt_update_handler(const struct ctl_table *table, int write,
> -				     void *buffer, size_t *lenp, loff_t *ppos)
> +static ssize_t sched_itmt_enabled_write(struct file *filp,
> +					const char __user *ubuf,
> +					size_t cnt, loff_t *ppos)
>  {
> -	unsigned int old_sysctl;
> -	int ret;
> +	ssize_t result;
> +	bool orig;
>  
>  	guard(mutex)(&itmt_update_mutex);
>  
> -	if (!sched_itmt_capable)
> -		return -EINVAL;
> -
> -	old_sysctl = sysctl_sched_itmt_enabled;
> -	ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
> +	orig = sysctl_sched_itmt_enabled;
> +	result = debugfs_write_file_bool(filp, ubuf, cnt, ppos);
>  
> -	if (!ret && write && old_sysctl != sysctl_sched_itmt_enabled) {
> +	if (sysctl_sched_itmt_enabled != orig) {
>  		x86_topology_update = true;
>  		rebuild_sched_domains();
>  	}
>  
> -	return ret;
> +	return result;
>  }
>  
> -static struct ctl_table itmt_kern_table[] = {
> -	{
> -		.procname	= "sched_itmt_enabled",
> -		.data		= &sysctl_sched_itmt_enabled,
> -		.maxlen		= sizeof(unsigned int),
> -		.mode		= 0644,
> -		.proc_handler	= sched_itmt_update_handler,
> -		.extra1		= SYSCTL_ZERO,
> -		.extra2		= SYSCTL_ONE,
> -	},
> +static const struct file_operations dfs_sched_itmt_fops = {
> +	.read =         debugfs_read_file_bool,
> +	.write =        sched_itmt_enabled_write,
> +	.open =         simple_open,
> +	.llseek =       default_llseek,
>  };
>  
> -static struct ctl_table_header *itmt_sysctl_header;
> +static struct dentry *dfs_sched_itmt;
>  
>  /**
>   * sched_set_itmt_support() - Indicate platform supports ITMT
> @@ -98,9 +92,15 @@ int sched_set_itmt_support(void)
>  	if (sched_itmt_capable)
>  		return 0;
>  
> -	itmt_sysctl_header = register_sysctl("kernel", itmt_kern_table);
> -	if (!itmt_sysctl_header)
> +	dfs_sched_itmt = debugfs_create_file_unsafe("sched_itmt_enabled",
> +						    0644,
> +						    arch_debugfs_dir,
> +						    &sysctl_sched_itmt_enabled,
> +						    &dfs_sched_itmt_fops);
> +	if (IS_ERR_OR_NULL(dfs_sched_itmt)) {
> +		dfs_sched_itmt = NULL;
>  		return -ENOMEM;
> +	}
>  
>  	sched_itmt_capable = true;
>  
> @@ -131,10 +131,8 @@ void sched_clear_itmt_support(void)
>  
>  	sched_itmt_capable = false;
>  
> -	if (itmt_sysctl_header) {
> -		unregister_sysctl_table(itmt_sysctl_header);
> -		itmt_sysctl_header = NULL;
> -	}
> +	debugfs_remove(dfs_sched_itmt);
> +	dfs_sched_itmt = NULL;
>  
>  	if (sysctl_sched_itmt_enabled) {
>  		/* disable sched_itmt if we are no longer ITMT capable */


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 4/8] x86/topology: Remove x86_smt_flags and use cpu_smt_flags directly
  2024-12-11 18:55 ` [PATCH 4/8] x86/topology: Remove x86_smt_flags and use cpu_smt_flags directly K Prateek Nayak
@ 2024-12-12 21:05   ` Tim Chen
  0 siblings, 0 replies; 32+ messages in thread
From: Tim Chen @ 2024-12-12 21:05 UTC (permalink / raw)
  To: K Prateek Nayak, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, Peter Zijlstra, Juri Lelli, Vincent Guittot, x86,
	linux-kernel
  Cc: H. Peter Anvin, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Rafael J. Wysocki, Ricardo Neri,
	Mario Limonciello, Meng Li, Huang Rui, Gautham R. Shenoy

On Wed, 2024-12-11 at 18:55 +0000, K Prateek Nayak wrote:
> x86_*_flags() wrappers were introduced with commit d3d37d850d1d
> ("x86/sched: Add SD_ASYM_PACKING flags to x86 ITMT CPU") to add
> x86_sched_itmt_flags() in addition to the default domain flags for SMT
> and MC domain.
> 
> commit 995998ebdebd ("x86/sched: Remove SD_ASYM_PACKING from the
> SMT domain flags") removed the ITMT flags for SMT domain but not the
> x86_smt_flags() wrappers which directly returns cpu_smt_flags().
> 
> Remove x86_smt_flags() and directly use cpu_smt_flags() to derive the
> flags for SMT domain. No functional changes intended.
> 
> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>

Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>

Tim

> ---
>  arch/x86/kernel/smpboot.c | 8 +-------
>  1 file changed, 1 insertion(+), 7 deletions(-)
> 
> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> index b5a8f0891135..6e300897b7ee 100644
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -482,12 +482,6 @@ static int x86_core_flags(void)
>  	return cpu_core_flags() | x86_sched_itmt_flags();
>  }
>  #endif
> -#ifdef CONFIG_SCHED_SMT
> -static int x86_smt_flags(void)
> -{
> -	return cpu_smt_flags();
> -}
> -#endif
>  #ifdef CONFIG_SCHED_CLUSTER
>  static int x86_cluster_flags(void)
>  {
> @@ -519,7 +513,7 @@ static void __init build_sched_topology(void)
>  
>  #ifdef CONFIG_SCHED_SMT
>  	x86_topology[i++] = (struct sched_domain_topology_level){
> -		cpu_smt_mask, x86_smt_flags, SD_INIT_NAME(SMT)
> +		cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT)
>  	};
>  #endif
>  #ifdef CONFIG_SCHED_CLUSTER


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/8] x86, sched: Dynamic ITMT core ranking support and some yak shaving
  2024-12-11 18:55 [PATCH 0/8] x86, sched: Dynamic ITMT core ranking support and some yak shaving K Prateek Nayak
                   ` (7 preceding siblings ...)
  2024-12-11 18:55 ` [RFC PATCH 8/8] sched/fair: Uncache asym_prefer_cpu and find it during update_sd_lb_stats() K Prateek Nayak
@ 2024-12-13  0:33 ` Tim Chen
  2024-12-13  4:12   ` K Prateek Nayak
  8 siblings, 1 reply; 32+ messages in thread
From: Tim Chen @ 2024-12-13  0:33 UTC (permalink / raw)
  To: K Prateek Nayak, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, Peter Zijlstra, Juri Lelli, Vincent Guittot, x86,
	linux-kernel
  Cc: H. Peter Anvin, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Rafael J. Wysocki, Ricardo Neri,
	Mario Limonciello, Meng Li, Huang Rui, Gautham R. Shenoy,
	Srinivas Pandruvada

On Wed, 2024-12-11 at 18:55 +0000, K Prateek Nayak wrote:
> The ITMT infrastructure currently assumes ITMT rankings to be static and
> is set correctly prior to enabling ITMT support which allows the CPU
> with the highest core ranking to be cached as the "asym_prefer_cpu" in
> the sched_group struct. However, with the introduction of Preferred Core
> support in amd-pstate, these rankings can change at runtime.
> 
> This series adds support for dynamic ranking in generic scheduler layer
> without the need to rebuild the sched domain hierarchy and fixes an
> issue with x86_die_flags() on AMD systems that support Preferred Core
> ranking with some yak shaving done along the way.
> 
> Patch 1 to 4 are independent cleanup around ITMT infrastructure, removal
> of x86_smt_flags wrapper, and moving the "sched_itmt_enabled" sysctl to
> debugfs.
> 
> Patch 5 adds the SD_ASYM_PACKING flag to the PKG domain on all ITMT
> enabled systems. The rationale behind the addition is elaborates in the
> same. One open question remains is for Intel processors with multiple
> Tiles in a PKG which advertises itself as multiple LLCs in a PKG and
> supports ITMT - is it okay to set SD_ASYM_PACKING for PKG domain on
> these processors?

After talking to my colleagues Ricardo and Srinivas, we think that this
should be fine for Intel CPUs.

Tim

> 
> Patch 6 and 7 are independent possible micro-optimizations discovered
> when auditing update_sg_lb_stats()
> 
> Patch 8 uncaches the asym_prefer_cpu from the sched_group struct and
> finds it during load balancing in update_sg_lb_stats() before it is used
> to make any scheduling decisions. This is the simplest approach; an
> alternate approach would be to move the asym_prefer_cpu to
> sched_domain_shared and allow the first load balancing instance post a
> priority change to update the cached asym_prefer_cpu. On systems with
> static priorities, this would allow benefits of caching while on systems
> with dynamic priorities, it'll reduce the overhead of finding
> "asym_prefer_cpu" each time update_sg_lb_stats() is called however the
> benefits come with added code complexity which is why Patch 8 is marked
> as an RFC.
> 
> This series is based on
> 
>   git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git sched/core 
> 
> at commit 2a77e4be12cb ("sched/fair: Untangle NEXT_BUDDY and
> pick_next_task()") and is a spiritual successor to a previous attempt
> at fixing the x86_die_flags() on Preferred Core enabled system by Mario
> that can be found at
> https://lore.kernel.org/lkml/20241203201129.31957-1-mario.limonciello@amd.com/
> 
> ---
> K Prateek Nayak (8):
>   x86/itmt: Convert "sysctl_sched_itmt_enabled" to boolean
>   x86/itmt: Use guard() for itmt_update_mutex
>   x86/itmt: Move the "sched_itmt_enabled" sysctl to debugfs
>   x86/topology: Remove x86_smt_flags and use cpu_smt_flags directly
>   x86/topology: Use x86_sched_itmt_flags for PKG domain unconditionally
>   sched/fair: Do not compute NUMA Balancing stats unnecessarily during
>     lb
>   sched/fair: Do not compute overloaded status unnecessarily during lb
>   sched/fair: Uncache asym_prefer_cpu and find it during
>     update_sd_lb_stats()
> 
>  arch/x86/include/asm/topology.h |  4 +-
>  arch/x86/kernel/itmt.c          | 81 ++++++++++++++-------------------
>  arch/x86/kernel/smpboot.c       | 19 +-------
>  kernel/sched/fair.c             | 41 +++++++++++++----
>  kernel/sched/sched.h            |  1 -
>  kernel/sched/topology.c         | 15 +-----
>  6 files changed, 69 insertions(+), 92 deletions(-)
> 
> 
> base-commit: 2a77e4be12cb58bbf774e7c717c8bb80e128b7a4


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/8] x86/itmt: Move the "sched_itmt_enabled" sysctl to debugfs
  2024-12-12 19:15   ` Tim Chen
@ 2024-12-13  4:01     ` K Prateek Nayak
  2024-12-13 17:19       ` Tim Chen
  2024-12-13  9:16     ` Peter Zijlstra
  1 sibling, 1 reply; 32+ messages in thread
From: K Prateek Nayak @ 2024-12-13  4:01 UTC (permalink / raw)
  To: Tim Chen, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, Peter Zijlstra, Juri Lelli, Vincent Guittot, x86,
	linux-kernel
  Cc: H. Peter Anvin, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Rafael J. Wysocki, Ricardo Neri,
	Mario Limonciello, Meng Li, Huang Rui, Gautham R. Shenoy

Hello Tim,

Thank you for reviewing the series.

On 12/13/2024 12:45 AM, Tim Chen wrote:
> On Wed, 2024-12-11 at 18:55 +0000, K Prateek Nayak wrote:
>> "sched_itmt_enabled" was only introduced as a debug toggle for any funky
>> ITMT behavior. Move the sysctl controlled from
>> "/proc/sys/kernel/sched_itmt_enabled" to debugfs at
>> "/sys/kernel/debug/x86/sched_itmt_enabled" with a notable change that a
>> cat on the file will return "Y" or "N" instead of "1" or "0" to
>> indicate that feature is enabled or disabled respectively.
>>
> 
> Valid values of setting "sched_itmt_enabled" likewise change from "1" or "0"
> to "Y" or "N".

Turns out you can still use "1" and "0". Running:

     echo Y > /sys/kernel/debug/sched/verbose
     echo "sched_itmt_enabled: $(cat /sys/kernel/debug/x86/sched_itmt_enabled)";
     for i in 0 1 N Y;
     do
         echo "Writing $i to /sys/kernel/debug/x86/sched_itmt_enabled";
         echo $i > /sys/kernel/debug/x86/sched_itmt_enabled;
         echo "sched_itmt_enabled: $(cat /sys/kernel/debug/x86/sched_itmt_enabled)";
         echo "sched domain flags:";
         cat /sys/kernel/debug/sched/domains/cpu0/domain*/flags;
         echo;
     done

Yields the following output on my system:

     sched_itmt_enabled: Y

     Writing 0 to /sys/kernel/debug/x86/sched_itmt_enabled
     sched_itmt_enabled: N
     sched domain flags:
     SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_CPUCAPACITY SD_SHARE_LLC SD_PREFER_SIBLING
     SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_LLC SD_PREFER_SIBLING
     SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_PREFER_SIBLING

     Writing 1 to /sys/kernel/debug/x86/sched_itmt_enabled
     sched_itmt_enabled: Y
     sched domain flags:
     SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_CPUCAPACITY SD_SHARE_LLC SD_PREFER_SIBLING
     SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_LLC SD_ASYM_PACKING SD_PREFER_SIBLING
     SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_ASYM_PACKING SD_PREFER_SIBLING

     Writing N to /sys/kernel/debug/x86/sched_itmt_enabled
     sched_itmt_enabled: N
     sched domain flags:
     SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_CPUCAPACITY SD_SHARE_LLC SD_PREFER_SIBLING
     SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_LLC SD_PREFER_SIBLING
     SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_PREFER_SIBLING

     Writing Y to /sys/kernel/debug/x86/sched_itmt_enabled
     sched_itmt_enabled: Y
     sched domain flags:
     SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_CPUCAPACITY SD_SHARE_LLC SD_PREFER_SIBLING
     SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_LLC SD_ASYM_PACKING SD_PREFER_SIBLING
     SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_ASYM_PACKING SD_PREFER_SIBLING

Would you like me to extend that note as:

     ... with a notable change that a
     cat on the file will return "Y" or "N" instead of "1" or "0" to
     indicate that feature is enabled or disabled respectively. User can
     either write "0" or "1" to toggle the feature off when enabled, or
     "1" or "Y" to toggle the feature on when disabled.

for the record?

> 
>> Since ITMT is x86 specific (and PowerPC uses SD_ASYM_PACKING too), the
>> toggle was moved to "/sys/kernel/debug/x86/" as opposed to
>> "/sys/kernel/debug/sched/"
>>
>> Suggested-by: Peter Zijlstra <peterz@infradead.org>
>> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
> 
> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>

Thank you.

-- 
Thanks and Regards,
Prateek

> 
> Tim
> 
> [..snip..]
> 


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/8] x86, sched: Dynamic ITMT core ranking support and some yak shaving
  2024-12-13  0:33 ` [PATCH 0/8] x86, sched: Dynamic ITMT core ranking support and some yak shaving Tim Chen
@ 2024-12-13  4:12   ` K Prateek Nayak
  2024-12-13 21:11     ` Tim Chen
  0 siblings, 1 reply; 32+ messages in thread
From: K Prateek Nayak @ 2024-12-13  4:12 UTC (permalink / raw)
  To: Tim Chen, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, Peter Zijlstra, Juri Lelli, Vincent Guittot, x86,
	linux-kernel
  Cc: H. Peter Anvin, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Rafael J. Wysocki, Ricardo Neri,
	Mario Limonciello, Meng Li, Huang Rui, Gautham R. Shenoy,
	Srinivas Pandruvada

Hello Tim,

On 12/13/2024 6:03 AM, Tim Chen wrote:
> On Wed, 2024-12-11 at 18:55 +0000, K Prateek Nayak wrote:
>> The ITMT infrastructure currently assumes ITMT rankings to be static and
>> is set correctly prior to enabling ITMT support which allows the CPU
>> with the highest core ranking to be cached as the "asym_prefer_cpu" in
>> the sched_group struct. However, with the introduction of Preferred Core
>> support in amd-pstate, these rankings can change at runtime.
>>
>> This series adds support for dynamic ranking in generic scheduler layer
>> without the need to rebuild the sched domain hierarchy and fixes an
>> issue with x86_die_flags() on AMD systems that support Preferred Core
>> ranking with some yak shaving done along the way.
>>
>> Patch 1 to 4 are independent cleanup around ITMT infrastructure, removal
>> of x86_smt_flags wrapper, and moving the "sched_itmt_enabled" sysctl to
>> debugfs.
>>
>> Patch 5 adds the SD_ASYM_PACKING flag to the PKG domain on all ITMT
>> enabled systems. The rationale behind the addition is elaborates in the
>> same. One open question remains is for Intel processors with multiple
>> Tiles in a PKG which advertises itself as multiple LLCs in a PKG and
>> supports ITMT - is it okay to set SD_ASYM_PACKING for PKG domain on
>> these processors?
> 
> After talking to my colleagues Ricardo and Srinivas, we think that this
> should be fine for Intel CPUs.

Thank you for confirming that. Could you also confirm if my observations
for Intel systems on Patch 5 covered all possible scenarios for the ones
that feature multiple MC groups within a PKG and enable ITMT support. If
I'm missing something, please do let me know and we can hash out the
implementation details.

Thanks a ton for reviewing the series!

-- 
Thanks and Regards,
Prateek

> 
> Tim
> 
>>
>> Patch 6 and 7 are independent possible micro-optimizations discovered
>> when auditing update_sg_lb_stats()
>>
>> Patch 8 uncaches the asym_prefer_cpu from the sched_group struct and
>> finds it during load balancing in update_sg_lb_stats() before it is used
>> to make any scheduling decisions. This is the simplest approach; an
>> alternate approach would be to move the asym_prefer_cpu to
>> sched_domain_shared and allow the first load balancing instance post a
>> priority change to update the cached asym_prefer_cpu. On systems with
>> static priorities, this would allow benefits of caching while on systems
>> with dynamic priorities, it'll reduce the overhead of finding
>> "asym_prefer_cpu" each time update_sg_lb_stats() is called however the
>> benefits come with added code complexity which is why Patch 8 is marked
>> as an RFC.
>
> [..snip..]
> 


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/8] x86/itmt: Move the "sched_itmt_enabled" sysctl to debugfs
  2024-12-12 19:15   ` Tim Chen
  2024-12-13  4:01     ` K Prateek Nayak
@ 2024-12-13  9:16     ` Peter Zijlstra
  1 sibling, 0 replies; 32+ messages in thread
From: Peter Zijlstra @ 2024-12-13  9:16 UTC (permalink / raw)
  To: Tim Chen
  Cc: K Prateek Nayak, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, Juri Lelli, Vincent Guittot, x86, linux-kernel,
	H. Peter Anvin, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Rafael J. Wysocki, Ricardo Neri,
	Mario Limonciello, Meng Li, Huang Rui, Gautham R. Shenoy

On Thu, Dec 12, 2024 at 11:15:43AM -0800, Tim Chen wrote:
> On Wed, 2024-12-11 at 18:55 +0000, K Prateek Nayak wrote:
> > "sched_itmt_enabled" was only introduced as a debug toggle for any funky
> > ITMT behavior. Move the sysctl controlled from
> > "/proc/sys/kernel/sched_itmt_enabled" to debugfs at
> > "/sys/kernel/debug/x86/sched_itmt_enabled" with a notable change that a
> > cat on the file will return "Y" or "N" instead of "1" or "0" to
> > indicate that feature is enabled or disabled respectively.
> > 
> 
> Valid values of setting "sched_itmt_enabled" likewise change from "1" or "0"
> to "Y" or "N". 

Nope, kstrtobool() accepts a ton of options, very much including 1,0.
It's just the print side that has to pick one and went with the Y,N
thing.

Look at the function and marvel how you can even write: oNcology,oFficious

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 6/8] sched/fair: Do not compute NUMA Balancing stats unnecessarily during lb
  2024-12-11 18:55 ` [PATCH 6/8] sched/fair: Do not compute NUMA Balancing stats unnecessarily during lb K Prateek Nayak
  2024-12-12 11:05   ` Vincent Guittot
@ 2024-12-13 14:55   ` Shrikanth Hegde
  1 sibling, 0 replies; 32+ messages in thread
From: Shrikanth Hegde @ 2024-12-13 14:55 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: H. Peter Anvin, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Rafael J. Wysocki, Ricardo Neri,
	Tim Chen, Mario Limonciello, Meng Li, Huang Rui,
	Gautham R. Shenoy, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, Peter Zijlstra, Juri Lelli, Vincent Guittot, x86,
	linux-kernel



On 12/12/24 00:25, K Prateek Nayak wrote:
> Aggregate nr_numa_running and nr_preferred_running when load balancing
> at NUMA domains only. While at it, also move the aggregation below the
> idle_cpu() check since an idle CPU cannot have any preferred tasks.
> 
> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
> ---
>   kernel/sched/fair.c | 15 +++++++++------
>   1 file changed, 9 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 2c4ebfc82917..ec2a79c8d0e7 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10340,7 +10340,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>   				      bool *sg_overloaded,
>   				      bool *sg_overutilized)
>   {
> -	int i, nr_running, local_group;
> +	int i, nr_running, local_group, sd_flags = env->sd->flags;
>   
>   	memset(sgs, 0, sizeof(*sgs));
>   
> @@ -10364,10 +10364,6 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>   		if (cpu_overutilized(i))
>   			*sg_overutilized = 1;
>   
> -#ifdef CONFIG_NUMA_BALANCING
> -		sgs->nr_numa_running += rq->nr_numa_running;
> -		sgs->nr_preferred_running += rq->nr_preferred_running;
> -#endif
>   		/*
>   		 * No need to call idle_cpu() if nr_running is not 0
>   		 */
> @@ -10377,10 +10373,17 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>   			continue;
>   		}
>   
> +#ifdef CONFIG_NUMA_BALANCING
> +		/* Only fbq_classify_group() uses this to classify NUMA groups */
> +		if (sd_flags & SD_NUMA) {
> +			sgs->nr_numa_running += rq->nr_numa_running;
> +			sgs->nr_preferred_running += rq->nr_preferred_running;
> +		}
> +#endif
>   		if (local_group)
>   			continue;
>   
> -		if (env->sd->flags & SD_ASYM_CPUCAPACITY) {
> +		if (sd_flags & SD_ASYM_CPUCAPACITY) {
>   			/* Check for a misfit task on the cpu */
>   			if (sgs->group_misfit_task_load < rq->misfit_task_load) {
>   				sgs->group_misfit_task_load = rq->misfit_task_load;

Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 7/8] sched/fair: Do not compute overloaded status unnecessarily during lb
  2024-12-11 18:55 ` [PATCH 7/8] sched/fair: Do not compute overloaded status " K Prateek Nayak
  2024-12-12  9:56   ` Vincent Guittot
@ 2024-12-13 14:57   ` Shrikanth Hegde
  2024-12-13 19:51     ` K Prateek Nayak
  1 sibling, 1 reply; 32+ messages in thread
From: Shrikanth Hegde @ 2024-12-13 14:57 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: H. Peter Anvin, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Rafael J. Wysocki, Ricardo Neri,
	Tim Chen, Mario Limonciello, Meng Li, Huang Rui,
	Gautham R. Shenoy, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, Peter Zijlstra, Juri Lelli, Vincent Guittot, x86,
	linux-kernel



On 12/12/24 00:25, K Prateek Nayak wrote:
> Only set sg_overloaded when computing sg_lb_stats() at the highest sched
> domain since rd->overloaded status is updated only when load balancing
> at the highest domain. While at it, move setting of sg_overloaded below
> idle_cpu() check since an idle CPU can never be overloaded.
> 
> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
> ---
>   kernel/sched/fair.c | 7 ++++---
>   1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index ec2a79c8d0e7..3f36805ecdca 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10358,9 +10358,6 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>   		nr_running = rq->nr_running;
>   		sgs->sum_nr_running += nr_running;
>   
> -		if (nr_running > 1)
> -			*sg_overloaded = 1;
> -
>   		if (cpu_overutilized(i))
>   			*sg_overutilized = 1;

Maybe its worth moving the overutilized too after the idle checks. An 
idle cpu can't be overutilized right?

>   
> @@ -10373,6 +10370,10 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>   			continue;
>   		}
>   
> +		/* Overload indicator is only updated at root domain */
> +		if (!env->sd->parent && nr_running > 1)
> +			*sg_overloaded = 1;
> +
>   #ifdef CONFIG_NUMA_BALANCING
>   		/* Only fbq_classify_group() uses this to classify NUMA groups */
>   		if (sd_flags & SD_NUMA) {

Other than the point above,
Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 8/8] sched/fair: Uncache asym_prefer_cpu and find it during update_sd_lb_stats()
  2024-12-11 18:55 ` [RFC PATCH 8/8] sched/fair: Uncache asym_prefer_cpu and find it during update_sd_lb_stats() K Prateek Nayak
@ 2024-12-13 15:02   ` Shrikanth Hegde
  2024-12-13 20:00     ` K Prateek Nayak
  0 siblings, 1 reply; 32+ messages in thread
From: Shrikanth Hegde @ 2024-12-13 15:02 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: H. Peter Anvin, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Rafael J. Wysocki, Ricardo Neri,
	Tim Chen, Mario Limonciello, Meng Li, Huang Rui,
	Gautham R. Shenoy, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, Peter Zijlstra, Juri Lelli, Vincent Guittot, x86,
	linux-kernel



On 12/12/24 00:25, K Prateek Nayak wrote:
> On AMD processors supporting dynamic preferred core ranking, the
> asym_prefer_cpu cached in sched_group can change dynamically. Since
> asym_prefer_cpu is cached when the sched domain hierarchy is built,
> updating the cached value across the system would require rebuilding
> the sched domain which is prohibitively expensive.
> 
> All the asym_prefer_cpu comparisons in the load balancing path are only
> carried out post the sched group stats have been updated after iterating
> all the CPUs in the group. Uncache the asym_prefer_cpu and compute it
> while sched group statistics are being updated as a part of sg_lb_stats.
> 
> Fixes: f3a052391822 ("cpufreq: amd-pstate: Enable amd-pstate preferred core support")
> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
> ---
>   kernel/sched/fair.c     | 21 +++++++++++++++++++--
>   kernel/sched/sched.h    |  1 -
>   kernel/sched/topology.c | 15 +--------------
>   3 files changed, 20 insertions(+), 17 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 3f36805ecdca..166b8e831064 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9911,6 +9911,8 @@ struct sg_lb_stats {
>   	unsigned int sum_nr_running;		/* Nr of all tasks running in the group */
>   	unsigned int sum_h_nr_running;		/* Nr of CFS tasks running in the group */
>   	unsigned int idle_cpus;                 /* Nr of idle CPUs         in the group */
> +	unsigned int asym_prefer_cpu;		/* CPU with highest asym priority */
> +	int highest_asym_prio;			/* Asym priority of asym_prefer_cpu */

Its better to move this after group_asym_packing field, so all related 
fields are together.

>   	unsigned int group_weight;
>   	enum group_type group_type;
>   	unsigned int group_asym_packing;	/* Tasks should be moved to preferred CPU */
> @@ -10243,7 +10245,7 @@ sched_group_asym(struct lb_env *env, struct sg_lb_stats *sgs, struct sched_group
>   	    (sgs->group_weight - sgs->idle_cpus != 1))
>   		return false;
>   
> -	return sched_asym(env->sd, env->dst_cpu, group->asym_prefer_cpu);
> +	return sched_asym(env->sd, env->dst_cpu, sgs->asym_prefer_cpu);
>   }
>   
>   /* One group has more than one SMT CPU while the other group does not */
> @@ -10324,6 +10326,17 @@ sched_reduced_capacity(struct rq *rq, struct sched_domain *sd)
>   	return check_cpu_capacity(rq, sd);
>   }
>   
> +static inline void
> +update_sg_pick_asym_prefer(struct sg_lb_stats *sgs, int cpu)
> +{
> +	int asym_prio = arch_asym_cpu_priority(cpu);
> +
> +	if (asym_prio > sgs->highest_asym_prio) {
> +		sgs->asym_prefer_cpu = cpu;
> +		sgs->highest_asym_prio = asym_prio;
> +	}
> +}
> +
>   /**
>    * update_sg_lb_stats - Update sched_group's statistics for load balancing.
>    * @env: The load balancing environment.
> @@ -10345,6 +10358,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>   	memset(sgs, 0, sizeof(*sgs));
>   
>   	local_group = group == sds->local;
> +	sgs->highest_asym_prio = INT_MIN;
>   
>   	for_each_cpu_and(i, sched_group_span(group), env->cpus) {
>   		struct rq *rq = cpu_rq(i);
> @@ -10358,6 +10372,9 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>   		nr_running = rq->nr_running;
>   		sgs->sum_nr_running += nr_running;
>   
> +		if (sd_flags & SD_ASYM_PACKING)
> +			update_sg_pick_asym_prefer(sgs, i);
> +
>   		if (cpu_overutilized(i))
>   			*sg_overutilized = 1;
>   
> @@ -10479,7 +10496,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>   
>   	case group_asym_packing:
>   		/* Prefer to move from lowest priority CPU's work */
> -		return sched_asym_prefer(sds->busiest->asym_prefer_cpu, sg->asym_prefer_cpu);
> +		return sched_asym_prefer(busiest->asym_prefer_cpu, sgs->asym_prefer_cpu);
>   
>   	case group_misfit_task:
>   		/*
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index aef716c41edb..a3f0d326bd11 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -2047,7 +2047,6 @@ struct sched_group {
>   	unsigned int		group_weight;
>   	unsigned int		cores;
>   	struct sched_group_capacity *sgc;
> -	int			asym_prefer_cpu;	/* CPU of highest priority in group */
>   	int			flags;
>   
>   	/*
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 9c405f0e7b26..20aa087710f0 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1302,7 +1302,7 @@ static void init_sched_groups_capacity(int cpu, struct sched_domain *sd)
>   	WARN_ON(!sg);
>   
>   	do {
> -		int cpu, cores = 0, max_cpu = -1;
> +		int cpu, cores = 0;
>   
>   		sg->group_weight = cpumask_weight(sched_group_span(sg));
>   
> @@ -1314,19 +1314,6 @@ static void init_sched_groups_capacity(int cpu, struct sched_domain *sd)
>   #endif
>   		}
>   		sg->cores = cores;
> -
> -		if (!(sd->flags & SD_ASYM_PACKING))
> -			goto next;
> -
> -		for_each_cpu(cpu, sched_group_span(sg)) {
> -			if (max_cpu < 0)
> -				max_cpu = cpu;
> -			else if (sched_asym_prefer(cpu, max_cpu))
> -				max_cpu = cpu;
> -		}
> -		sg->asym_prefer_cpu = max_cpu;
> -
> -next:
>   		sg = sg->next;
>   	} while (sg != sd->groups);
>   

Tried minimal testing of ASYM_PACKING behavior on Power10 Shared VM. It 
is working as expected with the patch as well. (functionality wise, 
performance isn't tested)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/8] x86/itmt: Move the "sched_itmt_enabled" sysctl to debugfs
  2024-12-13  4:01     ` K Prateek Nayak
@ 2024-12-13 17:19       ` Tim Chen
  0 siblings, 0 replies; 32+ messages in thread
From: Tim Chen @ 2024-12-13 17:19 UTC (permalink / raw)
  To: K Prateek Nayak, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, Peter Zijlstra, Juri Lelli, Vincent Guittot, x86,
	linux-kernel
  Cc: H. Peter Anvin, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Rafael J. Wysocki, Ricardo Neri,
	Mario Limonciello, Meng Li, Huang Rui, Gautham R. Shenoy

On Fri, 2024-12-13 at 09:31 +0530, K Prateek Nayak wrote:
> Hello Tim,
> 
> Thank you for reviewing the series.
> 
> On 12/13/2024 12:45 AM, Tim Chen wrote:
> > On Wed, 2024-12-11 at 18:55 +0000, K Prateek Nayak wrote:
> > > "sched_itmt_enabled" was only introduced as a debug toggle for any funky
> > > ITMT behavior. Move the sysctl controlled from
> > > "/proc/sys/kernel/sched_itmt_enabled" to debugfs at
> > > "/sys/kernel/debug/x86/sched_itmt_enabled" with a notable change that a
> > > cat on the file will return "Y" or "N" instead of "1" or "0" to
> > > indicate that feature is enabled or disabled respectively.
> > > 
> > 
> > Valid values of setting "sched_itmt_enabled" likewise change from "1" or "0"
> > to "Y" or "N".
> 
> Turns out you can still use "1" and "0". Running:
> 
>      echo Y > /sys/kernel/debug/sched/verbose
>      echo "sched_itmt_enabled: $(cat /sys/kernel/debug/x86/sched_itmt_enabled)";
>      for i in 0 1 N Y;
>      do
>          echo "Writing $i to /sys/kernel/debug/x86/sched_itmt_enabled";
>          echo $i > /sys/kernel/debug/x86/sched_itmt_enabled;
>          echo "sched_itmt_enabled: $(cat /sys/kernel/debug/x86/sched_itmt_enabled)";
>          echo "sched domain flags:";
>          cat /sys/kernel/debug/sched/domains/cpu0/domain*/flags;
>          echo;
>      done
> 
> Yields the following output on my system:
> 
>      sched_itmt_enabled: Y
> 
>      Writing 0 to /sys/kernel/debug/x86/sched_itmt_enabled
>      sched_itmt_enabled: N
>      sched domain flags:
>      SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_CPUCAPACITY SD_SHARE_LLC SD_PREFER_SIBLING
>      SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_LLC SD_PREFER_SIBLING
>      SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_PREFER_SIBLING
> 
>      Writing 1 to /sys/kernel/debug/x86/sched_itmt_enabled
>      sched_itmt_enabled: Y
>      sched domain flags:
>      SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_CPUCAPACITY SD_SHARE_LLC SD_PREFER_SIBLING
>      SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_LLC SD_ASYM_PACKING SD_PREFER_SIBLING
>      SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_ASYM_PACKING SD_PREFER_SIBLING
> 
>      Writing N to /sys/kernel/debug/x86/sched_itmt_enabled
>      sched_itmt_enabled: N
>      sched domain flags:
>      SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_CPUCAPACITY SD_SHARE_LLC SD_PREFER_SIBLING
>      SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_LLC SD_PREFER_SIBLING
>      SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_PREFER_SIBLING
> 
>      Writing Y to /sys/kernel/debug/x86/sched_itmt_enabled
>      sched_itmt_enabled: Y
>      sched domain flags:
>      SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_CPUCAPACITY SD_SHARE_LLC SD_PREFER_SIBLING
>      SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_LLC SD_ASYM_PACKING SD_PREFER_SIBLING
>      SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_ASYM_PACKING SD_PREFER_SIBLING
> 
> Would you like me to extend that note as:
> 
>      ... with a notable change that a
>      cat on the file will return "Y" or "N" instead of "1" or "0" to
>      indicate that feature is enabled or disabled respectively. User can
>      either write "0" or "1" to toggle the feature off when enabled, or
>      "1" or "Y" to toggle the feature on when disabled.
> 
> for the record?
> 
> 
Sure.  Thanks.

Tim

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 7/8] sched/fair: Do not compute overloaded status unnecessarily during lb
  2024-12-13 14:57   ` Shrikanth Hegde
@ 2024-12-13 19:51     ` K Prateek Nayak
  0 siblings, 0 replies; 32+ messages in thread
From: K Prateek Nayak @ 2024-12-13 19:51 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: H. Peter Anvin, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Rafael J. Wysocki, Ricardo Neri,
	Tim Chen, Mario Limonciello, Meng Li, Huang Rui,
	Gautham R. Shenoy, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, Peter Zijlstra, Juri Lelli, Vincent Guittot, x86,
	linux-kernel

Hello Shrikanth,

Thank you for reviewing the series.

On 12/13/2024 8:27 PM, Shrikanth Hegde wrote:
> 
> 
> On 12/12/24 00:25, K Prateek Nayak wrote:
>> Only set sg_overloaded when computing sg_lb_stats() at the highest sched
>> domain since rd->overloaded status is updated only when load balancing
>> at the highest domain. While at it, move setting of sg_overloaded below
>> idle_cpu() check since an idle CPU can never be overloaded.
>>
>> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
>> ---
>>   kernel/sched/fair.c | 7 ++++---
>>   1 file changed, 4 insertions(+), 3 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index ec2a79c8d0e7..3f36805ecdca 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -10358,9 +10358,6 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>>           nr_running = rq->nr_running;
>>           sgs->sum_nr_running += nr_running;
>> -        if (nr_running > 1)
>> -            *sg_overloaded = 1;
>> -
>>           if (cpu_overutilized(i))
>>               *sg_overutilized = 1;
> 
> Maybe its worth moving the overutilized too after the idle checks. An idle cpu can't be overutilized right?

Since there are no tasks on the CPU, there are no UCLAMP constraints
which means the cpu_overutilized() boils down to:

	!fits_capacity(cpu_util_cfs(cpu), capacity_of(cpu))

But the averages can capture blocked averages and capacity_of(cpu) can
also change with arch_scale_cpu_capacity() so I cannot say for sure
with 100% confident that an idle CPU cannot appear overutilized.

Vincent, Dietmar, do you think it is possible for an idle CPU to be
overutilized? If not, I'll move this update below idle_cpu() check too.

> 
>> @@ -10373,6 +10370,10 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>>               continue;
>>           }
>> +        /* Overload indicator is only updated at root domain */
>> +        if (!env->sd->parent && nr_running > 1)
>> +            *sg_overloaded = 1;
>> +
>>   #ifdef CONFIG_NUMA_BALANCING
>>           /* Only fbq_classify_group() uses this to classify NUMA groups */
>>           if (sd_flags & SD_NUMA) {
> 
> Other than the point above,
> Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>

Thank you!

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 8/8] sched/fair: Uncache asym_prefer_cpu and find it during update_sd_lb_stats()
  2024-12-13 15:02   ` Shrikanth Hegde
@ 2024-12-13 20:00     ` K Prateek Nayak
  0 siblings, 0 replies; 32+ messages in thread
From: K Prateek Nayak @ 2024-12-13 20:00 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: H. Peter Anvin, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Rafael J. Wysocki, Ricardo Neri,
	Tim Chen, Mario Limonciello, Meng Li, Huang Rui,
	Gautham R. Shenoy, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, Peter Zijlstra, Juri Lelli, Vincent Guittot, x86,
	linux-kernel

Hello Shrikanth,

On 12/13/2024 8:32 PM, Shrikanth Hegde wrote:
> 
> 
> On 12/12/24 00:25, K Prateek Nayak wrote:
>> On AMD processors supporting dynamic preferred core ranking, the
>> asym_prefer_cpu cached in sched_group can change dynamically. Since
>> asym_prefer_cpu is cached when the sched domain hierarchy is built,
>> updating the cached value across the system would require rebuilding
>> the sched domain which is prohibitively expensive.
>>
>> All the asym_prefer_cpu comparisons in the load balancing path are only
>> carried out post the sched group stats have been updated after iterating
>> all the CPUs in the group. Uncache the asym_prefer_cpu and compute it
>> while sched group statistics are being updated as a part of sg_lb_stats.
>>
>> Fixes: f3a052391822 ("cpufreq: amd-pstate: Enable amd-pstate preferred core support")
>> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
>> ---
>>   kernel/sched/fair.c     | 21 +++++++++++++++++++--
>>   kernel/sched/sched.h    |  1 -
>>   kernel/sched/topology.c | 15 +--------------
>>   3 files changed, 20 insertions(+), 17 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 3f36805ecdca..166b8e831064 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -9911,6 +9911,8 @@ struct sg_lb_stats {
>>       unsigned int sum_nr_running;        /* Nr of all tasks running in the group */
>>       unsigned int sum_h_nr_running;        /* Nr of CFS tasks running in the group */
>>       unsigned int idle_cpus;                 /* Nr of idle CPUs         in the group */
>> +    unsigned int asym_prefer_cpu;        /* CPU with highest asym priority */
>> +    int highest_asym_prio;            /* Asym priority of asym_prefer_cpu */
> 
> Its better to move this after group_asym_packing field, so all related fields are together.

Sure, I'll move the around in the next iteration if folks are okay
with this approach.

> 
>>       unsigned int group_weight;
>>       enum group_type group_type;
>>       unsigned int group_asym_packing;    /* Tasks should be moved to preferred CPU */
>> [..snip..]
> 
> Tried minimal testing of ASYM_PACKING behavior on Power10 Shared VM. It is working as expected with the patch as well. (functionality wise, performance isn't tested)

Thank you for testing! Let me know if there are any visible regressions
in which case lets see if the alternate approach suggested in the cover
letter fares any better.

Thanks a ton for reviewing and testing the series.

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 5/8] x86/topology: Use x86_sched_itmt_flags for PKG domain unconditionally
  2024-12-11 18:55 ` [PATCH 5/8] x86/topology: Use x86_sched_itmt_flags for PKG domain unconditionally K Prateek Nayak
@ 2024-12-13 21:07   ` Tim Chen
  0 siblings, 0 replies; 32+ messages in thread
From: Tim Chen @ 2024-12-13 21:07 UTC (permalink / raw)
  To: K Prateek Nayak, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, Peter Zijlstra, Juri Lelli, Vincent Guittot, x86,
	linux-kernel
  Cc: H. Peter Anvin, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Rafael J. Wysocki, Ricardo Neri,
	Mario Limonciello, Meng Li, Huang Rui, Gautham R. Shenoy

On Wed, 2024-12-11 at 18:55 +0000, K Prateek Nayak wrote:
> x86_sched_itmt_flags() returns SD_ASYM_PACKING if ITMT support is
> enabled by the system. Without ITMT support being enabled, it returns 0
> similar to current x86_die_flags() on non-Hybrid systems
> (!X86_HYBRID_CPU and !X86_FEATURE_AMD_HETEROGENEOUS_CORES)
> 
> On Intel systems that enable ITMT support, either the MC domain
> coincides with the PKG domain, or in case of multiple MC groups
> within a PKG domain, either Sub-NUMA Cluster (SNC) is enabled or the
> processor features Hybrid core layout (X86_HYBRID_CPU) which leads to
> three distinct possibilities:
> 
> o If PKG and MC domains coincide, PKG domain is degenerated by
>   sd_parent_degenerate() when building sched domain topology.
> 
> o If SNC is enabled, PKG domain is never added since
>   "x86_has_numa_in_package" is set and the topology will instead contain
>   NODE and NUMA domains.
> 
> o On X86_HYBRID_CPU which contains multiple MC groups within the PKG,
>   the PKG domain requires x86_sched_itmt_flags().
> 
> Thus, on Intel systems that contains multiple MC groups within the PKG
> and enables ITMT support, the PKG domain requires
> x86_sched_itmt_flags(). In all other cases PKG domain is either never
> added or is degenerated. Thus, returning x86_sched_itmt_flags()
> unconditionally at PKG domain on Intel systems should not lead to any
> functional changes.
> 
> On AMD systems with multiple LLCs (MC groups) within a PKG domain,
> enabling ITMT support requires setting SD_ASYM_PACKING to the PKG domain
> since the core rankings are assigned PKG-wide.
> 
> Core rankings on AMD processors is currently set by the amd-pstate
> driver when Preferred Core feature is supported. A subset of systems that
> support Preferred Core feature can be detected using
> X86_FEATURE_AMD_HETEROGENEOUS_CORES however, this does not cover all the
> systems that support Preferred Core ranking.
> 
> Detecting Preferred Core support on AMD systems requires inspecting CPPC
> Highest Perf on all present CPUs and checking if it differs on at least
> one CPU. Previous suggestion to use a synthetic feature to detect
> Preferred Core support [1] was found to be non-trivial to implement
> since BSP alone cannot detect if Preferred Core is supported and by the
> time AP comes up, alternatives are patched and setting a X86_FEATURE_*
> then is not possible.
> 
> Since x86 processors enabling ITMT support that consists multiple
> non-NUMA MC groups within a PKG requires SD_ASYM_PACKING flag set at the
> PKG domain, return x86_sched_itmt_flags unconditionally for the PKG
> domain.
> 
> Since x86_die_flags() would have just returned x86_sched_itmt_flags()
> after the change, remove the unnecessary wrapper and pass
> x86_sched_itmt_flags() directly as the flags function.
> 
> Link: https://lore.kernel.org/lkml/20241203221746.GKZ0-Dii5rnZppkM_e@fat_crate.local/ [1]
> Fixes: f3a052391822 ("cpufreq: amd-pstate: Enable amd-pstate preferred core support")
> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>

Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>

Tim

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/8] x86, sched: Dynamic ITMT core ranking support and some yak shaving
  2024-12-13  4:12   ` K Prateek Nayak
@ 2024-12-13 21:11     ` Tim Chen
  0 siblings, 0 replies; 32+ messages in thread
From: Tim Chen @ 2024-12-13 21:11 UTC (permalink / raw)
  To: K Prateek Nayak, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, Peter Zijlstra, Juri Lelli, Vincent Guittot, x86,
	linux-kernel
  Cc: H. Peter Anvin, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Rafael J. Wysocki, Ricardo Neri,
	Mario Limonciello, Meng Li, Huang Rui, Gautham R. Shenoy,
	Srinivas Pandruvada

On Fri, 2024-12-13 at 09:42 +0530, K Prateek Nayak wrote:
> Hello Tim,
> 
> On 12/13/2024 6:03 AM, Tim Chen wrote:
> > On Wed, 2024-12-11 at 18:55 +0000, K Prateek Nayak wrote:
> > > The ITMT infrastructure currently assumes ITMT rankings to be static and
> > > is set correctly prior to enabling ITMT support which allows the CPU
> > > with the highest core ranking to be cached as the "asym_prefer_cpu" in
> > > the sched_group struct. However, with the introduction of Preferred Core
> > > support in amd-pstate, these rankings can change at runtime.
> > > 
> > > This series adds support for dynamic ranking in generic scheduler layer
> > > without the need to rebuild the sched domain hierarchy and fixes an
> > > issue with x86_die_flags() on AMD systems that support Preferred Core
> > > ranking with some yak shaving done along the way.
> > > 
> > > Patch 1 to 4 are independent cleanup around ITMT infrastructure, removal
> > > of x86_smt_flags wrapper, and moving the "sched_itmt_enabled" sysctl to
> > > debugfs.
> > > 
> > > Patch 5 adds the SD_ASYM_PACKING flag to the PKG domain on all ITMT
> > > enabled systems. The rationale behind the addition is elaborates in the
> > > same. One open question remains is for Intel processors with multiple
> > > Tiles in a PKG which advertises itself as multiple LLCs in a PKG and
> > > supports ITMT - is it okay to set SD_ASYM_PACKING for PKG domain on
> > > these processors?
> > 
> > After talking to my colleagues Ricardo and Srinivas, we think that this
> > should be fine for Intel CPUs.
> 
> Thank you for confirming that. Could you also confirm if my observations
> for Intel systems on Patch 5 covered all possible scenarios for the ones
> that feature multiple MC groups within a PKG and enable ITMT support. If
> I'm missing something, please do let me know and we can hash out the
> implementation details.

Your patch 5 implementation should be fine as far as we can tell.

Tim
> 
> Thanks a ton for reviewing the series!
> 


^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2024-12-13 21:11 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-11 18:55 [PATCH 0/8] x86, sched: Dynamic ITMT core ranking support and some yak shaving K Prateek Nayak
2024-12-11 18:55 ` [PATCH 1/8] x86/itmt: Convert "sysctl_sched_itmt_enabled" to boolean K Prateek Nayak
2024-12-12 18:09   ` Tim Chen
2024-12-11 18:55 ` [PATCH 2/8] x86/itmt: Use guard() for itmt_update_mutex K Prateek Nayak
2024-12-12 18:22   ` Tim Chen
2024-12-11 18:55 ` [PATCH 3/8] x86/itmt: Move the "sched_itmt_enabled" sysctl to debugfs K Prateek Nayak
2024-12-12 19:15   ` Tim Chen
2024-12-13  4:01     ` K Prateek Nayak
2024-12-13 17:19       ` Tim Chen
2024-12-13  9:16     ` Peter Zijlstra
2024-12-11 18:55 ` [PATCH 4/8] x86/topology: Remove x86_smt_flags and use cpu_smt_flags directly K Prateek Nayak
2024-12-12 21:05   ` Tim Chen
2024-12-11 18:55 ` [PATCH 5/8] x86/topology: Use x86_sched_itmt_flags for PKG domain unconditionally K Prateek Nayak
2024-12-13 21:07   ` Tim Chen
2024-12-11 18:55 ` [PATCH 6/8] sched/fair: Do not compute NUMA Balancing stats unnecessarily during lb K Prateek Nayak
2024-12-12 11:05   ` Vincent Guittot
2024-12-12 11:43     ` K Prateek Nayak
2024-12-12 13:28       ` Vincent Guittot
2024-12-13 14:55   ` Shrikanth Hegde
2024-12-11 18:55 ` [PATCH 7/8] sched/fair: Do not compute overloaded status " K Prateek Nayak
2024-12-12  9:56   ` Vincent Guittot
2024-12-12 11:01     ` K Prateek Nayak
2024-12-12 11:18       ` Vincent Guittot
2024-12-12 11:30         ` K Prateek Nayak
2024-12-13 14:57   ` Shrikanth Hegde
2024-12-13 19:51     ` K Prateek Nayak
2024-12-11 18:55 ` [RFC PATCH 8/8] sched/fair: Uncache asym_prefer_cpu and find it during update_sd_lb_stats() K Prateek Nayak
2024-12-13 15:02   ` Shrikanth Hegde
2024-12-13 20:00     ` K Prateek Nayak
2024-12-13  0:33 ` [PATCH 0/8] x86, sched: Dynamic ITMT core ranking support and some yak shaving Tim Chen
2024-12-13  4:12   ` K Prateek Nayak
2024-12-13 21:11     ` Tim Chen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox