public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/6] Fixes and improvements in /proc/schedstat
@ 2024-12-18  4:36 Swapnil Sapkal
  2024-12-18  4:36 ` [PATCH 1/6] sched/fair: Fix value reported by hot tasks pulled " Swapnil Sapkal
                   ` (5 more replies)
  0 siblings, 6 replies; 9+ messages in thread
From: Swapnil Sapkal @ 2024-12-18  4:36 UTC (permalink / raw)
  To: peterz, mingo, juri.lelli, vincent.guittot, corbet
  Cc: dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	iamjoonsoo.kim, qyousef, sshegde, alexs, lukasz.luba,
	gautham.shenoy, kprateek.nayak, ravi.bangoria, linux-kernel, doc,
	Swapnil Sapkal

This patch series adds fixes and improvements to /proc/schedstat. Patches
are based on top of tip/sched/core commit 7675361ff9a1
("sched: deadline: Cleanup goto label in pick_earliest_pushable_dl_task")

Patch 1 and 2: In /proc/schedstat, lb_hot_gained reports the number of
times cache-hot tasks were migrated as a part of load balancing. This
value is incremented in can_migrate_task() if the task is cache hot and
migratable. But after incrementing this value, it is possible that the
task won't get migrated, in which case this value will be incorrect.
Fix this by incrementing it in detach_task().

While at it, cleanup migrate_degrades_locality() by making it return an
enum instead of the {-1,0,1} to improve the readability of
can_migrate_task(). Previously these patches were sent here[1].

Patch 3: Adds new fields to /proc/schedstat. Previously this patch was
sent here[2]. This change is intended to be a part of v17 of
/proc/schedstat.

In /proc/schedstat, lb_imbalance reports the sum of imbalances discovered
in sched domains with each call to sched_balance_rq(), which is not very
useful because lb_imbalance does not mention whether the imbalance is due
to load, utilization, nr_tasks or misfit_tasks. Remove this field from
/proc/schedstat.

Currently there is no field in /proc/schedstat to report different types
of imbalances. Introduce new fields in /proc/schedstat to report the
total imbalances in load, utilization, nr_tasks or misfit_tasks.

Patch 4 and 5: Currently, there does not exist a straightforward way to
extract the names of the sched domains and match them to the per-cpu
domain entry in /proc/schedstat other than looking at the debugfs files
which are only visible after enabling "verbose" debug after commit
34320745dfc9 ("sched/debug: Put sched/domains files under the verbose flag")

Since tools like `perf sched stats`[3] requires displaying per-domain
information in user friendly manner, display the names of sched domain,
alongside their level in /proc/schedstat to aggregate domain level stats
if domain names are shown in /proc/schedstat. But sched domain name is
guarded by CONFIG_SCHED_DEBUG. As per the discussion[4], these patches
moves sched domain name out of CONFIG_SCHED_DEBUG and prints sched
domain name in /proc/schedstat.

Patch 6: Updates the Schedstat version to 17 as more fields are added to
report different kinds of imbalances in the sched domain. Domain fields
also started printing sched domain name.

[1] https://lore.kernel.org/all/20230614102224.12555-1-swapnil.sapkal@amd.com/
[2] https://lore.kernel.org/lkml/66f1e42c-9035-4f9b-8c77-976ab50638bd@amd.com/
[3] https://lore.kernel.org/lkml/20241122084452.1064968-1-swapnil.sapkal@amd.com/
[4] https://lore.kernel.org/lkml/fcefeb4d-3acb-462d-9c9b-3df8d927e522@amd.com/

K Prateek Nayak (1):
  sched/stats: Print domain name in /proc/schedstat

Peter Zijlstra (2):
  sched/fair: Fix value reported by hot tasks pulled in /proc/schedstat
  sched/fair: Cleanup in migrate_degrades_locality() to improve
    readability

Swapnil Sapkal (3):
  sched: Report the different kinds of imbalances in /proc/schedstat
  sched: Move sched domain name out of CONFIG_SCHED_DEBUG
  docs: Update Schedstat version to 17

 Documentation/scheduler/sched-stats.rst | 126 ++++++++++++++----------
 include/linux/sched.h                   |   1 +
 include/linux/sched/topology.h          |  13 +--
 kernel/sched/fair.c                     |  77 ++++++++++-----
 kernel/sched/stats.c                    |  11 ++-
 kernel/sched/topology.c                 |   4 -
 6 files changed, 140 insertions(+), 92 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 1/6] sched/fair: Fix value reported by hot tasks pulled in /proc/schedstat
  2024-12-18  4:36 [PATCH 0/6] Fixes and improvements in /proc/schedstat Swapnil Sapkal
@ 2024-12-18  4:36 ` Swapnil Sapkal
  2024-12-18  4:36 ` [PATCH 2/6] sched/fair: Cleanup in migrate_degrades_locality() to improve readability Swapnil Sapkal
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 9+ messages in thread
From: Swapnil Sapkal @ 2024-12-18  4:36 UTC (permalink / raw)
  To: peterz, mingo, juri.lelli, vincent.guittot, corbet
  Cc: dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	iamjoonsoo.kim, qyousef, sshegde, alexs, lukasz.luba,
	gautham.shenoy, kprateek.nayak, ravi.bangoria, linux-kernel, doc,
	Swapnil Sapkal

From: Peter Zijlstra <peterz@infradead.org>

In /proc/schedstat, lb_hot_gained reports the number hot tasks pulled
during load balance. This value is incremented in can_migrate_task()
if the task is migratable and hot. After incrementing the value,
load balancer can still decide not to migrate this task leading to wrong
accounting. Fix this by incrementing stats when hot tasks are detached.
This issue only exists in detach_tasks() where we can decide to not
migrate hot task even if it is migratable. However, in detach_one_task(),
we migrate it unconditionally.

Link: https://lore.kernel.org/all/20230619092228.GK4253@hirez.programming.kicks-ass.net/

[Swapnil: Handled the case where nr_failed_migrations_hot was not accounted properly and wrote commit log]

Fixes: d31980846f96 ("sched: Move up affinity check to mitigate useless redoing overhead")
Reported-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
Not-yet-signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Swapnil Sapkal <swapnil.sapkal@amd.com>
---
 include/linux/sched.h |  1 +
 kernel/sched/fair.c   | 17 +++++++++++++----
 2 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1d5cc3e50884..cf798d4e3ca6 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -937,6 +937,7 @@ struct task_struct {
 	unsigned			sched_reset_on_fork:1;
 	unsigned			sched_contributes_to_load:1;
 	unsigned			sched_migrated:1;
+	unsigned			sched_task_hot:1;
 
 	/* Force alignment to the next boundary: */
 	unsigned			:0;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3356315d7e64..baefbf1b07fd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9354,6 +9354,8 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 	int tsk_cache_hot;
 
 	lockdep_assert_rq_held(env->src_rq);
+	if (p->sched_task_hot)
+		p->sched_task_hot = 0;
 
 	/*
 	 * We do not migrate tasks that are:
@@ -9426,10 +9428,8 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 
 	if (tsk_cache_hot <= 0 ||
 	    env->sd->nr_balance_failed > env->sd->cache_nice_tries) {
-		if (tsk_cache_hot == 1) {
-			schedstat_inc(env->sd->lb_hot_gained[env->idle]);
-			schedstat_inc(p->stats.nr_forced_migrations);
-		}
+		if (tsk_cache_hot == 1)
+			p->sched_task_hot = 1;
 		return 1;
 	}
 
@@ -9444,6 +9444,12 @@ static void detach_task(struct task_struct *p, struct lb_env *env)
 {
 	lockdep_assert_rq_held(env->src_rq);
 
+	if (p->sched_task_hot) {
+		p->sched_task_hot = 0;
+		schedstat_inc(env->sd->lb_hot_gained[env->idle]);
+		schedstat_inc(p->stats.nr_forced_migrations);
+	}
+
 	deactivate_task(env->src_rq, p, DEQUEUE_NOCLOCK);
 	set_task_cpu(p, env->dst_cpu);
 }
@@ -9604,6 +9610,9 @@ static int detach_tasks(struct lb_env *env)
 
 		continue;
 next:
+		if (p->sched_task_hot)
+			schedstat_inc(p->stats.nr_failed_migrations_hot);
+
 		list_move(&p->se.group_node, tasks);
 	}
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 2/6] sched/fair: Cleanup in migrate_degrades_locality() to improve readability
  2024-12-18  4:36 [PATCH 0/6] Fixes and improvements in /proc/schedstat Swapnil Sapkal
  2024-12-18  4:36 ` [PATCH 1/6] sched/fair: Fix value reported by hot tasks pulled " Swapnil Sapkal
@ 2024-12-18  4:36 ` Swapnil Sapkal
  2024-12-18  4:36 ` [PATCH 3/6] sched: Report the different kinds of imbalances in /proc/schedstat Swapnil Sapkal
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 9+ messages in thread
From: Swapnil Sapkal @ 2024-12-18  4:36 UTC (permalink / raw)
  To: peterz, mingo, juri.lelli, vincent.guittot, corbet
  Cc: dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	iamjoonsoo.kim, qyousef, sshegde, alexs, lukasz.luba,
	gautham.shenoy, kprateek.nayak, ravi.bangoria, linux-kernel, doc,
	Swapnil Sapkal

From: Peter Zijlstra <peterz@infradead.org>

migrate_degrade_locality() would return {1, 0, -1} respectively to
indicate that migration would degrade-locality, would improve
locality, would be ambivalent to locality improvements.

This patch improves readability by changing the return value to mean:
* Any positive value improves locality
* 0 migration doesn't affect locality
* Any negative value degrades locality.

Link: https://lore.kernel.org/all/20230619094529.GL4253@hirez.programming.kicks-ass.net/

[Swapnil: Fixed comments around code and wrote commit log]

Not-yet-signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Swapnil Sapkal <swapnil.sapkal@amd.com>
---
 kernel/sched/fair.c | 41 +++++++++++++++++++++--------------------
 1 file changed, 21 insertions(+), 20 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index baefbf1b07fd..ec403e81ffef 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9287,43 +9287,43 @@ static int task_hot(struct task_struct *p, struct lb_env *env)
 
 #ifdef CONFIG_NUMA_BALANCING
 /*
- * Returns 1, if task migration degrades locality
- * Returns 0, if task migration improves locality i.e migration preferred.
- * Returns -1, if task migration is not affected by locality.
+ * Returns a positive value, if task migration degrades locality.
+ * Returns 0, if task migration is not affected by locality.
+ * Returns a negative value, if task migration improves locality i.e migration preferred.
  */
-static int migrate_degrades_locality(struct task_struct *p, struct lb_env *env)
+static long migrate_degrades_locality(struct task_struct *p, struct lb_env *env)
 {
 	struct numa_group *numa_group = rcu_dereference(p->numa_group);
 	unsigned long src_weight, dst_weight;
 	int src_nid, dst_nid, dist;
 
 	if (!static_branch_likely(&sched_numa_balancing))
-		return -1;
+		return 0;
 
 	if (!p->numa_faults || !(env->sd->flags & SD_NUMA))
-		return -1;
+		return 0;
 
 	src_nid = cpu_to_node(env->src_cpu);
 	dst_nid = cpu_to_node(env->dst_cpu);
 
 	if (src_nid == dst_nid)
-		return -1;
+		return 0;
 
 	/* Migrating away from the preferred node is always bad. */
 	if (src_nid == p->numa_preferred_nid) {
 		if (env->src_rq->nr_running > env->src_rq->nr_preferred_running)
 			return 1;
 		else
-			return -1;
+			return 0;
 	}
 
 	/* Encourage migration to the preferred node. */
 	if (dst_nid == p->numa_preferred_nid)
-		return 0;
+		return -1;
 
 	/* Leaving a core idle is often worse than degrading locality. */
 	if (env->idle == CPU_IDLE)
-		return -1;
+		return 0;
 
 	dist = node_distance(src_nid, dst_nid);
 	if (numa_group) {
@@ -9334,14 +9334,14 @@ static int migrate_degrades_locality(struct task_struct *p, struct lb_env *env)
 		dst_weight = task_weight(p, dst_nid, dist);
 	}
 
-	return dst_weight < src_weight;
+	return src_weight - dst_weight;
 }
 
 #else
-static inline int migrate_degrades_locality(struct task_struct *p,
+static inline long migrate_degrades_locality(struct task_struct *p,
 					     struct lb_env *env)
 {
-	return -1;
+	return 0;
 }
 #endif
 
@@ -9351,7 +9351,7 @@ static inline int migrate_degrades_locality(struct task_struct *p,
 static
 int can_migrate_task(struct task_struct *p, struct lb_env *env)
 {
-	int tsk_cache_hot;
+	long degrades, hot;
 
 	lockdep_assert_rq_held(env->src_rq);
 	if (p->sched_task_hot)
@@ -9422,13 +9422,14 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 	if (env->flags & LBF_ACTIVE_LB)
 		return 1;
 
-	tsk_cache_hot = migrate_degrades_locality(p, env);
-	if (tsk_cache_hot == -1)
-		tsk_cache_hot = task_hot(p, env);
+	degrades = migrate_degrades_locality(p, env);
+	if (!degrades)
+		hot = task_hot(p, env);
+	else
+		hot = degrades > 0;
 
-	if (tsk_cache_hot <= 0 ||
-	    env->sd->nr_balance_failed > env->sd->cache_nice_tries) {
-		if (tsk_cache_hot == 1)
+	if (!hot || env->sd->nr_balance_failed > env->sd->cache_nice_tries) {
+		if (hot)
 			p->sched_task_hot = 1;
 		return 1;
 	}
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 3/6] sched: Report the different kinds of imbalances in /proc/schedstat
  2024-12-18  4:36 [PATCH 0/6] Fixes and improvements in /proc/schedstat Swapnil Sapkal
  2024-12-18  4:36 ` [PATCH 1/6] sched/fair: Fix value reported by hot tasks pulled " Swapnil Sapkal
  2024-12-18  4:36 ` [PATCH 2/6] sched/fair: Cleanup in migrate_degrades_locality() to improve readability Swapnil Sapkal
@ 2024-12-18  4:36 ` Swapnil Sapkal
  2024-12-18 11:40   ` Peter Zijlstra
  2024-12-18  4:36 ` [PATCH 4/6] sched: Move sched domain name out of CONFIG_SCHED_DEBUG Swapnil Sapkal
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 9+ messages in thread
From: Swapnil Sapkal @ 2024-12-18  4:36 UTC (permalink / raw)
  To: peterz, mingo, juri.lelli, vincent.guittot, corbet
  Cc: dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	iamjoonsoo.kim, qyousef, sshegde, alexs, lukasz.luba,
	gautham.shenoy, kprateek.nayak, ravi.bangoria, linux-kernel, doc,
	Swapnil Sapkal

In /proc/schedstat, lb_imbalance reports the sum of imbalances
discovered in sched domains with each call to sched_balance_rq(), which is
not very useful because lb_imbalance does not mention whether the imbalance
is due to load, utilization, nr_tasks or misfit_tasks. Remove this field
from /proc/schedstat.

Currently there is no field in /proc/schedstat to report different types
of imbalances. Introduce new fields in /proc/schedstat to report the
total imbalances in load, utilization, nr_tasks or misfit_tasks.

Added fields to /proc/schedstat:
        - lb_imbalance_load: Total imbalance due to load.
        - lb_imbalance_util: Total imbalance due to utilization.
        - lb_imbalance_task: Total imbalance due to number of tasks.
        - lb_imbalance_misfit: Total imbalance due to misfit tasks.

Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Signed-off-by: Swapnil Sapkal <swapnil.sapkal@amd.com>
---
 include/linux/sched/topology.h |  5 ++++-
 kernel/sched/fair.c            | 21 ++++++++++++++++++++-
 kernel/sched/stats.c           |  7 +++++--
 3 files changed, 29 insertions(+), 4 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 4237daa5ac7a..76a662e1ec24 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -114,7 +114,10 @@ struct sched_domain {
 	unsigned int lb_count[CPU_MAX_IDLE_TYPES];
 	unsigned int lb_failed[CPU_MAX_IDLE_TYPES];
 	unsigned int lb_balanced[CPU_MAX_IDLE_TYPES];
-	unsigned int lb_imbalance[CPU_MAX_IDLE_TYPES];
+	unsigned int lb_imbalance_load[CPU_MAX_IDLE_TYPES];
+	unsigned int lb_imbalance_util[CPU_MAX_IDLE_TYPES];
+	unsigned int lb_imbalance_task[CPU_MAX_IDLE_TYPES];
+	unsigned int lb_imbalance_misfit[CPU_MAX_IDLE_TYPES];
 	unsigned int lb_gained[CPU_MAX_IDLE_TYPES];
 	unsigned int lb_hot_gained[CPU_MAX_IDLE_TYPES];
 	unsigned int lb_nobusyg[CPU_MAX_IDLE_TYPES];
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ec403e81ffef..91f33cb9fb23 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11659,6 +11659,25 @@ static int should_we_balance(struct lb_env *env)
 	return group_balance_cpu(sg) == env->dst_cpu;
 }
 
+static void update_lb_imbalance_stat(struct lb_env *env, struct sched_domain *sd,
+				     enum cpu_idle_type idle)
+{
+	switch (env->migration_type) {
+	case migrate_load:
+		schedstat_add(sd->lb_imbalance_load[idle], env->imbalance);
+		break;
+	case migrate_util:
+		schedstat_add(sd->lb_imbalance_util[idle], env->imbalance);
+		break;
+	case migrate_task:
+		schedstat_add(sd->lb_imbalance_task[idle], env->imbalance);
+		break;
+	case migrate_misfit:
+		schedstat_add(sd->lb_imbalance_misfit[idle], env->imbalance);
+		break;
+	}
+}
+
 /*
  * Check this_cpu to ensure it is balanced within domain. Attempt to move
  * tasks if there is an imbalance.
@@ -11709,7 +11728,7 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
 
 	WARN_ON_ONCE(busiest == env.dst_rq);
 
-	schedstat_add(sd->lb_imbalance[idle], env.imbalance);
+	update_lb_imbalance_stat(&env, sd, idle);
 
 	env.src_cpu = busiest->cpu;
 	env.src_rq = busiest;
diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c
index eb0cdcd4d921..802bd9398a2e 100644
--- a/kernel/sched/stats.c
+++ b/kernel/sched/stats.c
@@ -141,11 +141,14 @@ static int show_schedstat(struct seq_file *seq, void *v)
 			seq_printf(seq, "domain%d %*pb", dcount++,
 				   cpumask_pr_args(sched_domain_span(sd)));
 			for (itype = 0; itype < CPU_MAX_IDLE_TYPES; itype++) {
-				seq_printf(seq, " %u %u %u %u %u %u %u %u",
+				seq_printf(seq, " %u %u %u %u %u %u %u %u %u %u %u",
 				    sd->lb_count[itype],
 				    sd->lb_balanced[itype],
 				    sd->lb_failed[itype],
-				    sd->lb_imbalance[itype],
+				    sd->lb_imbalance_load[itype],
+				    sd->lb_imbalance_util[itype],
+				    sd->lb_imbalance_task[itype],
+				    sd->lb_imbalance_misfit[itype],
 				    sd->lb_gained[itype],
 				    sd->lb_hot_gained[itype],
 				    sd->lb_nobusyq[itype],
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 4/6] sched: Move sched domain name out of CONFIG_SCHED_DEBUG
  2024-12-18  4:36 [PATCH 0/6] Fixes and improvements in /proc/schedstat Swapnil Sapkal
                   ` (2 preceding siblings ...)
  2024-12-18  4:36 ` [PATCH 3/6] sched: Report the different kinds of imbalances in /proc/schedstat Swapnil Sapkal
@ 2024-12-18  4:36 ` Swapnil Sapkal
  2024-12-18  4:36 ` [PATCH 5/6] sched/stats: Print domain name in /proc/schedstat Swapnil Sapkal
  2024-12-18  4:36 ` [PATCH 6/6] docs: Update Schedstat version to 17 Swapnil Sapkal
  5 siblings, 0 replies; 9+ messages in thread
From: Swapnil Sapkal @ 2024-12-18  4:36 UTC (permalink / raw)
  To: peterz, mingo, juri.lelli, vincent.guittot, corbet
  Cc: dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	iamjoonsoo.kim, qyousef, sshegde, alexs, lukasz.luba,
	gautham.shenoy, kprateek.nayak, ravi.bangoria, linux-kernel, doc,
	Swapnil Sapkal

/proc/schedstat file shows cpu and sched domain level scheduler
statistics. It does not show domain name instead shows domain level.
It will be very useful for tools like `perf sched stats`[1] to
aggragate domain level stats if domain names are shown in /proc/schedstat.
But sched domain name is guarded by CONFIG_SCHED_DEBUG. As per the
discussion[2], move sched domain name out of CONFIG_SCHED_DEBUG.

[1] https://lore.kernel.org/lkml/20241122084452.1064968-1-swapnil.sapkal@amd.com/
[2] https://lore.kernel.org/lkml/fcefeb4d-3acb-462d-9c9b-3df8d927e522@amd.com/

Suggested-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
Signed-off-by: Swapnil Sapkal <swapnil.sapkal@amd.com>
---
 include/linux/sched/topology.h | 8 --------
 kernel/sched/topology.c        | 4 ----
 2 files changed, 12 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 76a662e1ec24..7f3dbafe1817 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -143,9 +143,7 @@ struct sched_domain {
 	unsigned int ttwu_move_affine;
 	unsigned int ttwu_move_balance;
 #endif
-#ifdef CONFIG_SCHED_DEBUG
 	char *name;
-#endif
 	union {
 		void *private;		/* used during construction */
 		struct rcu_head rcu;	/* used during destruction */
@@ -201,18 +199,12 @@ struct sched_domain_topology_level {
 	int		    flags;
 	int		    numa_level;
 	struct sd_data      data;
-#ifdef CONFIG_SCHED_DEBUG
 	char                *name;
-#endif
 };
 
 extern void __init set_sched_topology(struct sched_domain_topology_level *tl);
 
-#ifdef CONFIG_SCHED_DEBUG
 # define SD_INIT_NAME(type)		.name = #type
-#else
-# define SD_INIT_NAME(type)
-#endif
 
 #else /* CONFIG_SMP */
 
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 9748a4c8d668..88bd9344730d 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1635,9 +1635,7 @@ sd_init(struct sched_domain_topology_level *tl,
 		.max_newidle_lb_cost	= 0,
 		.last_decay_max_lb_cost	= jiffies,
 		.child			= child,
-#ifdef CONFIG_SCHED_DEBUG
 		.name			= tl->name,
-#endif
 	};
 
 	sd_span = sched_domain_span(sd);
@@ -2338,10 +2336,8 @@ static struct sched_domain *build_sched_domain(struct sched_domain_topology_leve
 		if (!cpumask_subset(sched_domain_span(child),
 				    sched_domain_span(sd))) {
 			pr_err("BUG: arch topology borken\n");
-#ifdef CONFIG_SCHED_DEBUG
 			pr_err("     the %s domain not a subset of the %s domain\n",
 					child->name, sd->name);
-#endif
 			/* Fixup, ensure @sd has at least @child CPUs. */
 			cpumask_or(sched_domain_span(sd),
 				   sched_domain_span(sd),
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 5/6] sched/stats: Print domain name in /proc/schedstat
  2024-12-18  4:36 [PATCH 0/6] Fixes and improvements in /proc/schedstat Swapnil Sapkal
                   ` (3 preceding siblings ...)
  2024-12-18  4:36 ` [PATCH 4/6] sched: Move sched domain name out of CONFIG_SCHED_DEBUG Swapnil Sapkal
@ 2024-12-18  4:36 ` Swapnil Sapkal
  2024-12-18  4:36 ` [PATCH 6/6] docs: Update Schedstat version to 17 Swapnil Sapkal
  5 siblings, 0 replies; 9+ messages in thread
From: Swapnil Sapkal @ 2024-12-18  4:36 UTC (permalink / raw)
  To: peterz, mingo, juri.lelli, vincent.guittot, corbet
  Cc: dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	iamjoonsoo.kim, qyousef, sshegde, alexs, lukasz.luba,
	gautham.shenoy, kprateek.nayak, ravi.bangoria, linux-kernel, doc,
	Swapnil Sapkal, James Clark

From: K Prateek Nayak <kprateek.nayak@amd.com>

Currently, there does not exist a straightforward way to extract the
names of the sched domains and match them to the per-cpu domain entry in
/proc/schedstat other than looking at the debugfs files which are only
visible after enabling "verbose" debug after commit 34320745dfc9
("sched/debug: Put sched/domains files under the verbose flag")

Since tools like `perf sched stats`[1] require displaying per-domain
information in user friendly manner, display the names of sched domain,
alongside their level in /proc/schedstat.

Domain names also makes the /proc/schedstat data unambiguous when some
of the cpus are offline. For example, on a 128 cpus AMD Zen3 machine
where CPU0 and CPU64 are SMT siblings and CPU64 is offline:

Before:
    cpu0 ...
    domain0 ...
    domain1 ...
    cpu1 ...
    domain0 ...
    domain1 ...
    domain2 ...

After:
    cpu0 ...
    domain0 MC ...
    domain1 PKG ...
    cpu1 ...
    domain0 SMT ...
    domain1 MC ...
    domain2 PKG ...

[1] https://lore.kernel.org/lkml/20241122084452.1064968-1-swapnil.sapkal@amd.com/

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
Signed-off-by: Swapnil Sapkal <swapnil.sapkal@amd.com>
Tested-by: James Clark <james.clark@linaro.org>
---
 kernel/sched/stats.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c
index 802bd9398a2e..5f563965976c 100644
--- a/kernel/sched/stats.c
+++ b/kernel/sched/stats.c
@@ -138,7 +138,7 @@ static int show_schedstat(struct seq_file *seq, void *v)
 		for_each_domain(cpu, sd) {
 			enum cpu_idle_type itype;
 
-			seq_printf(seq, "domain%d %*pb", dcount++,
+			seq_printf(seq, "domain%d %s %*pb", dcount++, sd->name,
 				   cpumask_pr_args(sched_domain_span(sd)));
 			for (itype = 0; itype < CPU_MAX_IDLE_TYPES; itype++) {
 				seq_printf(seq, " %u %u %u %u %u %u %u %u %u %u %u",
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 6/6] docs: Update Schedstat version to 17
  2024-12-18  4:36 [PATCH 0/6] Fixes and improvements in /proc/schedstat Swapnil Sapkal
                   ` (4 preceding siblings ...)
  2024-12-18  4:36 ` [PATCH 5/6] sched/stats: Print domain name in /proc/schedstat Swapnil Sapkal
@ 2024-12-18  4:36 ` Swapnil Sapkal
  5 siblings, 0 replies; 9+ messages in thread
From: Swapnil Sapkal @ 2024-12-18  4:36 UTC (permalink / raw)
  To: peterz, mingo, juri.lelli, vincent.guittot, corbet
  Cc: dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	iamjoonsoo.kim, qyousef, sshegde, alexs, lukasz.luba,
	gautham.shenoy, kprateek.nayak, ravi.bangoria, linux-kernel, doc,
	Swapnil Sapkal

Update the Schedstat version to 17 as more fields are added to report
different kinds of imbalances in the sched domain. Also domain field
started printing corresponding domain name.

Signed-off-by: Swapnil Sapkal <swapnil.sapkal@amd.com>
---
 Documentation/scheduler/sched-stats.rst | 126 ++++++++++++++----------
 kernel/sched/stats.c                    |   2 +-
 2 files changed, 76 insertions(+), 52 deletions(-)

diff --git a/Documentation/scheduler/sched-stats.rst b/Documentation/scheduler/sched-stats.rst
index 7c2b16c4729d..caea83d91c67 100644
--- a/Documentation/scheduler/sched-stats.rst
+++ b/Documentation/scheduler/sched-stats.rst
@@ -2,6 +2,12 @@
 Scheduler Statistics
 ====================
 
+Version 17 of schedstats removed 'lb_imbalance' field as it has no
+significance anymore and instead added more relevant fields namely
+'lb_imbalance_load', 'lb_imbalance_util', 'lb_imbalance_task' and
+'lb_imbalance_misfit'. The domain field prints the name of the
+corresponding sched domain from this version onwards.
+
 Version 16 of schedstats changed the order of definitions within
 'enum cpu_idle_type', which changed the order of [CPU_MAX_IDLE_TYPES]
 columns in show_schedstat(). In particular the position of CPU_IDLE
@@ -9,7 +15,9 @@ and __CPU_NOT_IDLE changed places. The size of the array is unchanged.
 
 Version 15 of schedstats dropped counters for some sched_yield:
 yld_exp_empty, yld_act_empty and yld_both_empty. Otherwise, it is
-identical to version 14.
+identical to version 14. Details are available at
+
+	https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/scheduler/sched-stats.txt?id=1e1dbb259c79b
 
 Version 14 of schedstats includes support for sched_domains, which hit the
 mainline kernel in 2.6.20 although it is identical to the stats from version
@@ -26,7 +34,14 @@ cpus on the machine, while domain0 is the most tightly focused domain,
 sometimes balancing only between pairs of cpus.  At this time, there
 are no architectures which need more than three domain levels. The first
 field in the domain stats is a bit map indicating which cpus are affected
-by that domain.
+by that domain. Details are available at
+
+	https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/sched-stats.txt?id=b762f3ffb797c
+
+The schedstat documentation is maintained version 10 onwards and is not
+updated for version 11 and 12. The details for version 10 are available at
+
+	https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/sched-stats.txt?id=1da177e4c3f4
 
 These fields are counters, and only increment.  Programs which make use
 of these will need to start with a baseline observation and then calculate
@@ -71,88 +86,97 @@ Domain statistics
 -----------------
 One of these is produced per domain for each cpu described. (Note that if
 CONFIG_SMP is not defined, *no* domains are utilized and these lines
-will not appear in the output.)
+will not appear in the output. <name> is an extension to the domain field
+that prints the name of the corresponding sched domain. It can appear in
+schedstat version 17 and above, and requires CONFIG_SCHED_DEBUG.)
 
-domain<N> <cpumask> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
+domain<N> <name> <cpumask> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
 
 The first field is a bit mask indicating what cpus this domain operates over.
 
-The next 24 are a variety of sched_balance_rq() statistics in grouped into types
-of idleness (idle, busy, and newly idle):
+The next 33 are a variety of sched_balance_rq() statistics in grouped into types
+of idleness (busy, idle and newly idle):
 
     1)  # of times in this domain sched_balance_rq() was called when the
+        cpu was busy
+    2)  # of times in this domain sched_balance_rq() checked but found the
+        load did not require balancing when busy
+    3)  # of times in this domain sched_balance_rq() tried to move one or
+        more tasks and failed, when the cpu was busy
+    4)  Total imbalance in load when the cpu was busy
+    5)  Total imbalance in utilization when the cpu was busy
+    6)  Total imbalance in number of tasks when the cpu was busy
+    7)  Total imbalance due to misfit tasks when the cpu was busy
+    8)  # of times in this domain pull_task() was called when busy
+    9)  # of times in this domain pull_task() was called even though the
+        target task was cache-hot when busy
+    10) # of times in this domain sched_balance_rq() was called but did not
+        find a busier queue while the cpu was busy
+    11) # of times in this domain a busier queue was found while the cpu
+        was busy but no busier group was found
+
+    12) # of times in this domain sched_balance_rq() was called when the
         cpu was idle
-    2)  # of times in this domain sched_balance_rq() checked but found
+    13) # of times in this domain sched_balance_rq() checked but found
         the load did not require balancing when the cpu was idle
-    3)  # of times in this domain sched_balance_rq() tried to move one or
+    14) # of times in this domain sched_balance_rq() tried to move one or
         more tasks and failed, when the cpu was idle
-    4)  sum of imbalances discovered (if any) with each call to
-        sched_balance_rq() in this domain when the cpu was idle
-    5)  # of times in this domain pull_task() was called when the cpu
+    15) Total imbalance in load when the cpu was idle
+    16) Total imbalance in utilization when the cpu was idle
+    17) Total imbalance in number of tasks when the cpu was idle
+    18) Total imbalance due to misfit tasks when the cpu was idle
+    19) # of times in this domain pull_task() was called when the cpu
         was idle
-    6)  # of times in this domain pull_task() was called even though
+    20) # of times in this domain pull_task() was called even though
         the target task was cache-hot when idle
-    7)  # of times in this domain sched_balance_rq() was called but did
+    21) # of times in this domain sched_balance_rq() was called but did
         not find a busier queue while the cpu was idle
-    8)  # of times in this domain a busier queue was found while the
+    22) # of times in this domain a busier queue was found while the
         cpu was idle but no busier group was found
-    9)  # of times in this domain sched_balance_rq() was called when the
-        cpu was busy
-    10) # of times in this domain sched_balance_rq() checked but found the
-        load did not require balancing when busy
-    11) # of times in this domain sched_balance_rq() tried to move one or
-        more tasks and failed, when the cpu was busy
-    12) sum of imbalances discovered (if any) with each call to
-        sched_balance_rq() in this domain when the cpu was busy
-    13) # of times in this domain pull_task() was called when busy
-    14) # of times in this domain pull_task() was called even though the
-        target task was cache-hot when busy
-    15) # of times in this domain sched_balance_rq() was called but did not
-        find a busier queue while the cpu was busy
-    16) # of times in this domain a busier queue was found while the cpu
-        was busy but no busier group was found
 
-    17) # of times in this domain sched_balance_rq() was called when the
-        cpu was just becoming idle
-    18) # of times in this domain sched_balance_rq() checked but found the
+    23) # of times in this domain sched_balance_rq() was called when the
+        was just becoming idle
+    24) # of times in this domain sched_balance_rq() checked but found the
         load did not require balancing when the cpu was just becoming idle
-    19) # of times in this domain sched_balance_rq() tried to move one or more
+    25) # of times in this domain sched_balance_rq() tried to move one or more
         tasks and failed, when the cpu was just becoming idle
-    20) sum of imbalances discovered (if any) with each call to
-        sched_balance_rq() in this domain when the cpu was just becoming idle
-    21) # of times in this domain pull_task() was called when newly idle
-    22) # of times in this domain pull_task() was called even though the
+    26) Total imbalance in load when the cpu was just becoming idle
+    27) Total imbalance in utilization when the cpu was just becoming idle
+    28) Total imbalance in number of tasks when the cpu was just becoming idle
+    29) Total imbalance due to misfit tasks when the cpu was just becoming idle
+    30) # of times in this domain pull_task() was called when newly idle
+    31) # of times in this domain pull_task() was called even though the
         target task was cache-hot when just becoming idle
-    23) # of times in this domain sched_balance_rq() was called but did not
+    32) # of times in this domain sched_balance_rq() was called but did not
         find a busier queue while the cpu was just becoming idle
-    24) # of times in this domain a busier queue was found while the cpu
+    33) # of times in this domain a busier queue was found while the cpu
         was just becoming idle but no busier group was found
 
    Next three are active_load_balance() statistics:
 
-    25) # of times active_load_balance() was called
-    26) # of times active_load_balance() tried to move a task and failed
-    27) # of times active_load_balance() successfully moved a task
+    34) # of times active_load_balance() was called
+    35) # of times active_load_balance() tried to move a task and failed
+    36) # of times active_load_balance() successfully moved a task
 
    Next three are sched_balance_exec() statistics:
 
-    28) sbe_cnt is not used
-    29) sbe_balanced is not used
-    30) sbe_pushed is not used
+    37) sbe_cnt is not used
+    38) sbe_balanced is not used
+    39) sbe_pushed is not used
 
    Next three are sched_balance_fork() statistics:
 
-    31) sbf_cnt is not used
-    32) sbf_balanced is not used
-    33) sbf_pushed is not used
+    40) sbf_cnt is not used
+    41) sbf_balanced is not used
+    42) sbf_pushed is not used
 
    Next three are try_to_wake_up() statistics:
 
-    34) # of times in this domain try_to_wake_up() awoke a task that
+    43) # of times in this domain try_to_wake_up() awoke a task that
         last ran on a different cpu in this domain
-    35) # of times in this domain try_to_wake_up() moved a task to the
+    44) # of times in this domain try_to_wake_up() moved a task to the
         waking cpu because it was cache-cold on its own cpu anyway
-    36) # of times in this domain try_to_wake_up() started passive balancing
+    45) # of times in this domain try_to_wake_up() started passive balancing
 
 /proc/<pid>/schedstat
 ---------------------
diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c
index 5f563965976c..4346fd81c31f 100644
--- a/kernel/sched/stats.c
+++ b/kernel/sched/stats.c
@@ -103,7 +103,7 @@ void __update_stats_enqueue_sleeper(struct rq *rq, struct task_struct *p,
  * Bump this up when changing the output format or the meaning of an existing
  * format, so that tools can adapt (or abort)
  */
-#define SCHEDSTAT_VERSION 16
+#define SCHEDSTAT_VERSION 17
 
 static int show_schedstat(struct seq_file *seq, void *v)
 {
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 3/6] sched: Report the different kinds of imbalances in /proc/schedstat
  2024-12-18  4:36 ` [PATCH 3/6] sched: Report the different kinds of imbalances in /proc/schedstat Swapnil Sapkal
@ 2024-12-18 11:40   ` Peter Zijlstra
  2024-12-19  6:08     ` Sapkal, Swapnil
  0 siblings, 1 reply; 9+ messages in thread
From: Peter Zijlstra @ 2024-12-18 11:40 UTC (permalink / raw)
  To: Swapnil Sapkal
  Cc: mingo, juri.lelli, vincent.guittot, corbet, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, iamjoonsoo.kim, qyousef,
	sshegde, alexs, lukasz.luba, gautham.shenoy, kprateek.nayak,
	ravi.bangoria, linux-kernel, doc

On Wed, Dec 18, 2024 at 04:36:26AM +0000, Swapnil Sapkal wrote:

> +static void update_lb_imbalance_stat(struct lb_env *env, struct sched_domain *sd,
> +				     enum cpu_idle_type idle)
> +{
> +	switch (env->migration_type) {
> +	case migrate_load:
> +		schedstat_add(sd->lb_imbalance_load[idle], env->imbalance);
> +		break;
> +	case migrate_util:
> +		schedstat_add(sd->lb_imbalance_util[idle], env->imbalance);
> +		break;
> +	case migrate_task:
> +		schedstat_add(sd->lb_imbalance_task[idle], env->imbalance);
> +		break;
> +	case migrate_misfit:
> +		schedstat_add(sd->lb_imbalance_misfit[idle], env->imbalance);
> +		break;
> +	}
> +}


Can you please write that like:

	if (!schedstat_enabled())
		return;

	switch () {
	case ...
		__schedstat_add();
	}

It makes no sense to have 4 copies of schedstat_enabled() inside the
switch statement -- esp. since afaik the compilers aren't able to CSE
static keys :/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 3/6] sched: Report the different kinds of imbalances in /proc/schedstat
  2024-12-18 11:40   ` Peter Zijlstra
@ 2024-12-19  6:08     ` Sapkal, Swapnil
  0 siblings, 0 replies; 9+ messages in thread
From: Sapkal, Swapnil @ 2024-12-19  6:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, vincent.guittot, corbet, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, iamjoonsoo.kim, qyousef,
	sshegde, alexs, lukasz.luba, gautham.shenoy, kprateek.nayak,
	ravi.bangoria, linux-kernel

Hello Peter,

Thanks for the review.

On 12/18/2024 5:10 PM, Peter Zijlstra wrote:
> On Wed, Dec 18, 2024 at 04:36:26AM +0000, Swapnil Sapkal wrote:
> 
>> +static void update_lb_imbalance_stat(struct lb_env *env, struct sched_domain *sd,
>> +				     enum cpu_idle_type idle)
>> +{
>> +	switch (env->migration_type) {
>> +	case migrate_load:
>> +		schedstat_add(sd->lb_imbalance_load[idle], env->imbalance);
>> +		break;
>> +	case migrate_util:
>> +		schedstat_add(sd->lb_imbalance_util[idle], env->imbalance);
>> +		break;
>> +	case migrate_task:
>> +		schedstat_add(sd->lb_imbalance_task[idle], env->imbalance);
>> +		break;
>> +	case migrate_misfit:
>> +		schedstat_add(sd->lb_imbalance_misfit[idle], env->imbalance);
>> +		break;
>> +	}
>> +}
> 
> 
> Can you please write that like:
> 
> 	if (!schedstat_enabled())
> 		return;
> 
> 	switch () {
> 	case ...
> 		__schedstat_add();
> 	}
> 
> It makes no sense to have 4 copies of schedstat_enabled() inside the
> switch statement -- esp. since afaik the compilers aren't able to CSE
> static keys :/

This makes sense. I will update this change in v2.

--
Thanks and Regards,
Swapnil

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2024-12-19  6:08 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-18  4:36 [PATCH 0/6] Fixes and improvements in /proc/schedstat Swapnil Sapkal
2024-12-18  4:36 ` [PATCH 1/6] sched/fair: Fix value reported by hot tasks pulled " Swapnil Sapkal
2024-12-18  4:36 ` [PATCH 2/6] sched/fair: Cleanup in migrate_degrades_locality() to improve readability Swapnil Sapkal
2024-12-18  4:36 ` [PATCH 3/6] sched: Report the different kinds of imbalances in /proc/schedstat Swapnil Sapkal
2024-12-18 11:40   ` Peter Zijlstra
2024-12-19  6:08     ` Sapkal, Swapnil
2024-12-18  4:36 ` [PATCH 4/6] sched: Move sched domain name out of CONFIG_SCHED_DEBUG Swapnil Sapkal
2024-12-18  4:36 ` [PATCH 5/6] sched/stats: Print domain name in /proc/schedstat Swapnil Sapkal
2024-12-18  4:36 ` [PATCH 6/6] docs: Update Schedstat version to 17 Swapnil Sapkal

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox