[PATCH v2 0/3] Fix and cleanup and extend cpu.stat

linux-doc.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/3] Fix and cleanup and extend cpu.stat
@ 2025-01-25  5:25 Abel Wu
  2025-01-25  5:25 ` [PATCH v2 1/3] cgroup/rstat: Fix forceidle time in cpu.stat Abel Wu
                   ` (4 more replies)
  0 siblings, 5 replies; 20+ messages in thread
From: Abel Wu @ 2025-01-25  5:25 UTC (permalink / raw)
  To: Tejun Heo, Johannes Weiner, Michal Koutný, Jonathan Corbet,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Andrew Morton, Yury Norov, Abel Wu,
	Thomas Gleixner, Bitao Hu, Chen Ridong
  Cc: open list:CONTROL GROUP (CGROUP), open list:DOCUMENTATION,
	open list

Patch 1: fixes an issue that forceidle time can be inconsistant with
other cputimes.

Patch 2: cleans up the #ifdef mess in cpu.stat.

Patch 3: extend run_delay accounting to cgroup to show how severely
a cgroup is stalled.

v2:
 - Fix internal function naming. (Michal Koutny)

Abel Wu (3):
  cgroup/rstat: Fix forceidle time in cpu.stat
  cgroup/rstat: Cleanup cpu.stat once for all
  cgroup/rstat: Add run_delay accounting for cgroups

 Documentation/admin-guide/cgroup-v2.rst |  1 +
 include/linux/cgroup-defs.h             |  3 +
 include/linux/kernel_stat.h             | 14 +++++
 kernel/cgroup/rstat.c                   | 75 ++++++++++++++++---------
 kernel/sched/cputime.c                  | 12 ++++
 kernel/sched/stats.h                    |  2 +
 6 files changed, 79 insertions(+), 28 deletions(-)

-- 
2.37.3


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v2 1/3] cgroup/rstat: Fix forceidle time in cpu.stat
  2025-01-25  5:25 [PATCH v2 0/3] Fix and cleanup and extend cpu.stat Abel Wu
@ 2025-01-25  5:25 ` Abel Wu
  2025-01-25  5:25 ` [PATCH v2 2/3] cgroup/rstat: Cleanup cpu.stat once for all Abel Wu
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 20+ messages in thread
From: Abel Wu @ 2025-01-25  5:25 UTC (permalink / raw)
  To: Tejun Heo, Johannes Weiner, Michal Koutný, Jonathan Corbet,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Yury Norov, Bitao Hu, Abel Wu, Andrew Morton,
	Thomas Gleixner, Chen Ridong
  Cc: open list:CONTROL GROUP (CGROUP), open list:DOCUMENTATION,
	open list

The commit b824766504e4 ("cgroup/rstat: add force idle show helper")
retrieves forceidle_time outside cgroup_rstat_lock for non-root cgroups
which can be potentially inconsistent with other stats.

Rather than reverting that commit, fix it in a way that retains the
effort of cleaning up the ifdef-messes.

Fixes: b824766504e4 ("cgroup/rstat: add force idle show helper")
Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
---
 kernel/cgroup/rstat.c | 29 +++++++++++++----------------
 1 file changed, 13 insertions(+), 16 deletions(-)

diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index 5877974ece92..c2784c317cdd 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -613,36 +613,33 @@ static void cgroup_force_idle_show(struct seq_file *seq, struct cgroup_base_stat
 void cgroup_base_stat_cputime_show(struct seq_file *seq)
 {
 	struct cgroup *cgrp = seq_css(seq)->cgroup;
-	u64 usage, utime, stime, ntime;
+	struct cgroup_base_stat bstat;
 
 	if (cgroup_parent(cgrp)) {
 		cgroup_rstat_flush_hold(cgrp);
-		usage = cgrp->bstat.cputime.sum_exec_runtime;
+		bstat = cgrp->bstat;
 		cputime_adjust(&cgrp->bstat.cputime, &cgrp->prev_cputime,
-			       &utime, &stime);
-		ntime = cgrp->bstat.ntime;
+			       &bstat.cputime.utime, &bstat.cputime.stime);
 		cgroup_rstat_flush_release(cgrp);
 	} else {
-		/* cgrp->bstat of root is not actually used, reuse it */
-		root_cgroup_cputime(&cgrp->bstat);
-		usage = cgrp->bstat.cputime.sum_exec_runtime;
-		utime = cgrp->bstat.cputime.utime;
-		stime = cgrp->bstat.cputime.stime;
-		ntime = cgrp->bstat.ntime;
+		root_cgroup_cputime(&bstat);
 	}
 
-	do_div(usage, NSEC_PER_USEC);
-	do_div(utime, NSEC_PER_USEC);
-	do_div(stime, NSEC_PER_USEC);
-	do_div(ntime, NSEC_PER_USEC);
+	do_div(bstat.cputime.sum_exec_runtime, NSEC_PER_USEC);
+	do_div(bstat.cputime.utime, NSEC_PER_USEC);
+	do_div(bstat.cputime.stime, NSEC_PER_USEC);
+	do_div(bstat.ntime, NSEC_PER_USEC);
 
 	seq_printf(seq, "usage_usec %llu\n"
 			"user_usec %llu\n"
 			"system_usec %llu\n"
 			"nice_usec %llu\n",
-			usage, utime, stime, ntime);
+			bstat.cputime.sum_exec_runtime,
+			bstat.cputime.utime,
+			bstat.cputime.stime,
+			bstat.ntime);
 
-	cgroup_force_idle_show(seq, &cgrp->bstat);
+	cgroup_force_idle_show(seq, &bstat);
 }
 
 /* Add bpf kfuncs for cgroup_rstat_updated() and cgroup_rstat_flush() */
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 2/3] cgroup/rstat: Cleanup cpu.stat once for all
  2025-01-25  5:25 [PATCH v2 0/3] Fix and cleanup and extend cpu.stat Abel Wu
  2025-01-25  5:25 ` [PATCH v2 1/3] cgroup/rstat: Fix forceidle time in cpu.stat Abel Wu
@ 2025-01-25  5:25 ` Abel Wu
  2025-01-27 20:17   ` Tejun Heo
  2025-01-25  5:25 ` [PATCH v2 3/3] cgroup/rstat: Add run_delay accounting for cgroups Abel Wu
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 20+ messages in thread
From: Abel Wu @ 2025-01-25  5:25 UTC (permalink / raw)
  To: Tejun Heo, Johannes Weiner, Michal Koutný, Jonathan Corbet,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Andrew Morton, Bitao Hu, Yury Norov, Abel Wu,
	Thomas Gleixner, Chen Ridong
  Cc: open list:CONTROL GROUP (CGROUP), open list:DOCUMENTATION,
	open list

There were efforts like b824766504e4 ("cgroup/rstat: add force idle show helper")
to escape from #ifdef hells, and there could be new stats coming out in
the future, let's clean it up once for all.

Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
---
 kernel/cgroup/rstat.c | 47 ++++++++++++++++++++++++-------------------
 1 file changed, 26 insertions(+), 21 deletions(-)

diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index c2784c317cdd..dc6acab00d69 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -599,21 +599,39 @@ static void root_cgroup_cputime(struct cgroup_base_stat *bstat)
 	}
 }
 
+static struct bstat_entry {
+	const char	*name;
+	const int	offset;
+} bstats[] = {
+#define BSTAT_ENTRY(name, field)	\
+	{ name, offsetof(struct cgroup_base_stat, field) }
+	BSTAT_ENTRY("usage_usec",	cputime.sum_exec_runtime),
+	BSTAT_ENTRY("user_usec",	cputime.utime),
+	BSTAT_ENTRY("system_usec",	cputime.stime),
+	BSTAT_ENTRY("nice_usec",	ntime),
+#ifdef CONFIG_SCHED_CORE
+	BSTAT_ENTRY("core_sched.force_idle_usec", forceidle_sum),
+#endif
+	{ NULL } /* must be at end */
+#undef BSTAT_ENTRY
+};
 
-static void cgroup_force_idle_show(struct seq_file *seq, struct cgroup_base_stat *bstat)
+static void cgroup_bstat_entry_show(struct seq_file *seq,
+				    struct cgroup_base_stat *bstat,
+				    struct bstat_entry *entry)
 {
-#ifdef CONFIG_SCHED_CORE
-	u64 forceidle_time = bstat->forceidle_sum;
+	u64 *val;
 
-	do_div(forceidle_time, NSEC_PER_USEC);
-	seq_printf(seq, "core_sched.force_idle_usec %llu\n", forceidle_time);
-#endif
+	val = (void *)bstat + entry->offset;
+	do_div(*val, NSEC_PER_USEC);
+	seq_printf(seq, "%s %llu\n", entry->name, *val);
 }
 
 void cgroup_base_stat_cputime_show(struct seq_file *seq)
 {
 	struct cgroup *cgrp = seq_css(seq)->cgroup;
 	struct cgroup_base_stat bstat;
+	struct bstat_entry *e;
 
 	if (cgroup_parent(cgrp)) {
 		cgroup_rstat_flush_hold(cgrp);
@@ -625,21 +643,8 @@ void cgroup_base_stat_cputime_show(struct seq_file *seq)
 		root_cgroup_cputime(&bstat);
 	}
 
-	do_div(bstat.cputime.sum_exec_runtime, NSEC_PER_USEC);
-	do_div(bstat.cputime.utime, NSEC_PER_USEC);
-	do_div(bstat.cputime.stime, NSEC_PER_USEC);
-	do_div(bstat.ntime, NSEC_PER_USEC);
-
-	seq_printf(seq, "usage_usec %llu\n"
-			"user_usec %llu\n"
-			"system_usec %llu\n"
-			"nice_usec %llu\n",
-			bstat.cputime.sum_exec_runtime,
-			bstat.cputime.utime,
-			bstat.cputime.stime,
-			bstat.ntime);
-
-	cgroup_force_idle_show(seq, &bstat);
+	for (e = bstats; e->name; e++)
+		cgroup_bstat_entry_show(seq, &bstat, e);
 }
 
 /* Add bpf kfuncs for cgroup_rstat_updated() and cgroup_rstat_flush() */
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 3/3] cgroup/rstat: Add run_delay accounting for cgroups
  2025-01-25  5:25 [PATCH v2 0/3] Fix and cleanup and extend cpu.stat Abel Wu
  2025-01-25  5:25 ` [PATCH v2 1/3] cgroup/rstat: Fix forceidle time in cpu.stat Abel Wu
  2025-01-25  5:25 ` [PATCH v2 2/3] cgroup/rstat: Cleanup cpu.stat once for all Abel Wu
@ 2025-01-25  5:25 ` Abel Wu
  2025-01-27 14:10   ` Michal Koutný
  2025-02-03  8:11 ` [PATCH v2 0/3] Fix and cleanup and extend cpu.stat Abel Wu
  2025-02-19  2:34 ` Abel Wu
  4 siblings, 1 reply; 20+ messages in thread
From: Abel Wu @ 2025-01-25  5:25 UTC (permalink / raw)
  To: Tejun Heo, Johannes Weiner, Michal Koutný, Jonathan Corbet,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Thomas Gleixner, Yury Norov, Abel Wu,
	Andrew Morton, Bitao Hu, Chen Ridong
  Cc: open list:CONTROL GROUP (CGROUP), open list:DOCUMENTATION,
	open list

The per-task and per-cpu accounting have already been tracked by
t->sched_info.run_delay and rq->rq_sched_info.run_delay respectively.
Extends this to also include cgroups.

The PSI indicator, "some" of cpu.pressure, loses the insight into how
severely that cgroup is stalled. Say 100 tasks or just 1 task that gets
stalled at a certain point will show no difference in "some" pressure.
IOW "some" is a flat value that not weighted by the severity (e.g. # of
tasks).

Only cgroup v2 is supported. Similar to the task accounting, the cgroup
accounting requires that CONFIG_SCHED_INFO is enabled.

Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
---
 Documentation/admin-guide/cgroup-v2.rst |  1 +
 include/linux/cgroup-defs.h             |  3 +++
 include/linux/kernel_stat.h             | 14 ++++++++++++++
 kernel/cgroup/rstat.c                   | 17 +++++++++++++++++
 kernel/sched/cputime.c                  | 12 ++++++++++++
 kernel/sched/stats.h                    |  2 ++
 6 files changed, 49 insertions(+)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 315ede811c9d..440c3800c49c 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1100,6 +1100,7 @@ All time durations are in microseconds.
 	- usage_usec
 	- user_usec
 	- system_usec
+	- run_delay_usec	(requires CONFIG_SCHED_INFO)
 
 	and the following five when the controller is enabled:
 
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 1b20d2d8ef7c..287366e60414 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -328,6 +328,9 @@ struct cgroup_base_stat {
 	u64 forceidle_sum;
 #endif
 	u64 ntime;
+#ifdef CONFIG_SCHED_INFO
+	u64 run_delay;
+#endif
 };
 
 /*
diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index b97ce2df376f..256b1a55de62 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -31,6 +31,15 @@ enum cpu_usage_stat {
 	CPUTIME_FORCEIDLE,
 #endif
 	NR_STATS,
+
+#ifdef CONFIG_SCHED_INFO
+	/*
+	 * Instead of cputime, run_delay is tracked through
+	 * sched_info by task and rq, so there is no need to
+	 * enlarge the cpustat[] array.
+	 */
+	CPUTIME_RUN_DELAY,
+#endif
 };
 
 struct kernel_cpustat {
@@ -141,4 +150,9 @@ extern void account_idle_ticks(unsigned long ticks);
 extern void __account_forceidle_time(struct task_struct *tsk, u64 delta);
 #endif
 
+#ifdef CONFIG_SCHED_INFO
+extern void account_run_delay_time(struct task_struct *tsk, u64 delta);
+extern u64 get_cpu_run_delay(int cpu);
+#endif
+
 #endif /* _LINUX_KERNEL_STAT_H */
diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index dc6acab00d69..c7f9397a714e 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -445,6 +445,9 @@ static void cgroup_base_stat_add(struct cgroup_base_stat *dst_bstat,
 	dst_bstat->forceidle_sum += src_bstat->forceidle_sum;
 #endif
 	dst_bstat->ntime += src_bstat->ntime;
+#ifdef CONFIG_SCHED_INFO
+	dst_bstat->run_delay += src_bstat->run_delay;
+#endif
 }
 
 static void cgroup_base_stat_sub(struct cgroup_base_stat *dst_bstat,
@@ -457,6 +460,9 @@ static void cgroup_base_stat_sub(struct cgroup_base_stat *dst_bstat,
 	dst_bstat->forceidle_sum -= src_bstat->forceidle_sum;
 #endif
 	dst_bstat->ntime -= src_bstat->ntime;
+#ifdef CONFIG_SCHED_INFO
+	dst_bstat->run_delay -= src_bstat->run_delay;
+#endif
 }
 
 static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu)
@@ -551,6 +557,11 @@ void __cgroup_account_cputime_field(struct cgroup *cgrp,
 	case CPUTIME_FORCEIDLE:
 		rstatc->bstat.forceidle_sum += delta_exec;
 		break;
+#endif
+#ifdef CONFIG_SCHED_INFO
+	case CPUTIME_RUN_DELAY:
+		rstatc->bstat.run_delay += delta_exec;
+		break;
 #endif
 	default:
 		break;
@@ -596,6 +607,9 @@ static void root_cgroup_cputime(struct cgroup_base_stat *bstat)
 		bstat->forceidle_sum += cpustat[CPUTIME_FORCEIDLE];
 #endif
 		bstat->ntime += cpustat[CPUTIME_NICE];
+#ifdef CONFIG_SCHED_INFO
+		bstat->run_delay += get_cpu_run_delay(i);
+#endif
 	}
 }
 
@@ -611,6 +625,9 @@ static struct bstat_entry {
 	BSTAT_ENTRY("nice_usec",	ntime),
 #ifdef CONFIG_SCHED_CORE
 	BSTAT_ENTRY("core_sched.force_idle_usec", forceidle_sum),
+#endif
+#ifdef CONFIG_SCHED_INFO
+	BSTAT_ENTRY("run_delay_usec",	run_delay),
 #endif
 	{ NULL } /* must be at end */
 #undef BSTAT_ENTRY
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 5d9143dd0879..e6be57cdb54e 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -243,6 +243,18 @@ void __account_forceidle_time(struct task_struct *p, u64 delta)
 }
 #endif
 
+#ifdef CONFIG_SCHED_INFO
+void account_run_delay_time(struct task_struct *p, u64 delta)
+{
+	cgroup_account_cputime_field(p, CPUTIME_RUN_DELAY, delta);
+}
+
+u64 get_cpu_run_delay(int cpu)
+{
+	return cpu_rq(cpu)->rq_sched_info.run_delay;
+}
+#endif
+
 /*
  * When a guest is interrupted for a longer amount of time, missed clock
  * ticks are not redelivered later. Due to that, this function may on
diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
index 6ade91bce63e..b21a2c4b9c54 100644
--- a/kernel/sched/stats.h
+++ b/kernel/sched/stats.h
@@ -249,6 +249,7 @@ static inline void sched_info_dequeue(struct rq *rq, struct task_struct *t)
 	t->sched_info.last_queued = 0;
 	t->sched_info.run_delay += delta;
 
+	account_run_delay_time(t, delta);
 	rq_sched_info_dequeue(rq, delta);
 }
 
@@ -271,6 +272,7 @@ static void sched_info_arrive(struct rq *rq, struct task_struct *t)
 	t->sched_info.last_arrival = now;
 	t->sched_info.pcount++;
 
+	account_run_delay_time(t, delta);
 	rq_sched_info_arrive(rq, delta);
 }
 
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 3/3] cgroup/rstat: Add run_delay accounting for cgroups
  2025-01-25  5:25 ` [PATCH v2 3/3] cgroup/rstat: Add run_delay accounting for cgroups Abel Wu
@ 2025-01-27 14:10   ` Michal Koutný
  2025-01-29  4:48     ` Abel Wu
  0 siblings, 1 reply; 20+ messages in thread
From: Michal Koutný @ 2025-01-27 14:10 UTC (permalink / raw)
  To: Abel Wu
  Cc: Tejun Heo, Johannes Weiner, Jonathan Corbet, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Thomas Gleixner, Yury Norov, Andrew Morton, Bitao Hu, Chen Ridong,
	open list:CONTROL GROUP (CGROUP), open list:DOCUMENTATION,
	open list

[-- Attachment #1: Type: text/plain, Size: 1033 bytes --]

Hello.

On Sat, Jan 25, 2025 at 01:25:12PM +0800, Abel Wu <wuyun.abel@bytedance.com> wrote:
> The per-task and per-cpu accounting have already been tracked by
> t->sched_info.run_delay and rq->rq_sched_info.run_delay respectively.
> Extends this to also include cgroups.
> 
> The PSI indicator, "some" of cpu.pressure, loses the insight into how
> severely that cgroup is stalled. Say 100 tasks or just 1 task that gets
> stalled at a certain point will show no difference in "some" pressure.
> IOW "some" is a flat value that not weighted by the severity (e.g. # of
> tasks).

IIUC below are three examples of when "some" tasks are waiting for CPU:

a)
  t1 |----|
  t2 |xx--|

b)
  t1 |----|
  t2 |x---|
  t3 |-x--|

c)
  t1 |----|
  t2 |xx--|
  t3 |xx--|

(- means runnable on CPU, x means runnable waiting on RQ)

Which pair from a), b), c) is indistinguishable via PSI? (Or can you
please add your illustrative example?)
And how do per-cgroup run_delay distinguishe those?


Thanks,
Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 2/3] cgroup/rstat: Cleanup cpu.stat once for all
  2025-01-25  5:25 ` [PATCH v2 2/3] cgroup/rstat: Cleanup cpu.stat once for all Abel Wu
@ 2025-01-27 20:17   ` Tejun Heo
  2025-01-29  4:47     ` Abel Wu
  0 siblings, 1 reply; 20+ messages in thread
From: Tejun Heo @ 2025-01-27 20:17 UTC (permalink / raw)
  To: Abel Wu
  Cc: Johannes Weiner, Michal Koutný, Jonathan Corbet, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Andrew Morton, Bitao Hu, Yury Norov, Thomas Gleixner, Chen Ridong,
	open list:CONTROL GROUP (CGROUP), open list:DOCUMENTATION,
	open list

On Sat, Jan 25, 2025 at 01:25:11PM +0800, Abel Wu wrote:
> There were efforts like b824766504e4 ("cgroup/rstat: add force idle show helper")
> to escape from #ifdef hells, and there could be new stats coming out in
> the future, let's clean it up once for all.
> 
> Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
> ---
>  kernel/cgroup/rstat.c | 47 ++++++++++++++++++++++++-------------------
>  1 file changed, 26 insertions(+), 21 deletions(-)

Is this materially better? The existing code has ifdef in one place which
the new code can't avoid. The new code is more complex and has more lines.
Does the balance get better with additions of new entries?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Re: [PATCH v2 2/3] cgroup/rstat: Cleanup cpu.stat once for all
  2025-01-27 20:17   ` Tejun Heo
@ 2025-01-29  4:47     ` Abel Wu
  0 siblings, 0 replies; 20+ messages in thread
From: Abel Wu @ 2025-01-29  4:47 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Johannes Weiner, Michal Koutný, Jonathan Corbet, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Andrew Morton, Bitao Hu, Yury Norov, Thomas Gleixner, Chen Ridong,
	open list:CONTROL GROUP (CGROUP), open list:DOCUMENTATION,
	open list

Hi Tejun,

On 1/28/25 4:17 AM, Tejun Heo Wrote:
> On Sat, Jan 25, 2025 at 01:25:11PM +0800, Abel Wu wrote:
>> There were efforts like b824766504e4 ("cgroup/rstat: add force idle show helper")
>> to escape from #ifdef hells, and there could be new stats coming out in
>> the future, let's clean it up once for all.
>>
>> Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
>> ---
>>   kernel/cgroup/rstat.c | 47 ++++++++++++++++++++++++-------------------
>>   1 file changed, 26 insertions(+), 21 deletions(-)
> 
> Is this materially better? The existing code has ifdef in one place which
> the new code can't avoid. 

Indeed, # of ifdefs will stay unchanged, but they will be folded
into one place inside the bstats[] array quite the same as the
definition of the struct cgroup_base_stat, which IMHO won't hurt
readability.

> The new code is more complex and has more lines.
> Does the balance get better with additions of new entries?

The line diff is 5, and 4 of them are for readability. If adding
one more field into cpu.stat, 1 or 3 lines will be added w/o or
w/ ifdef respectively, comparing to 8 or 10 lines without this
cleanup. So the balance will be better if cpu.stat extends. And
it would also be better cleanup duplicated code for each field.

Thanks,
	Abel


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 3/3] cgroup/rstat: Add run_delay accounting for cgroups
  2025-01-27 14:10   ` Michal Koutný
@ 2025-01-29  4:48     ` Abel Wu
  2025-02-10 15:38       ` Michal Koutný
  0 siblings, 1 reply; 20+ messages in thread
From: Abel Wu @ 2025-01-29  4:48 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Tejun Heo, Johannes Weiner, Jonathan Corbet, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Thomas Gleixner, Yury Norov, Andrew Morton, Bitao Hu, Chen Ridong,
	open list:CONTROL GROUP (CGROUP), open list:DOCUMENTATION,
	open list

Hi Michal,

On 1/27/25 10:10 PM, Michal Koutný Wrote:
> Hello.
> 
> On Sat, Jan 25, 2025 at 01:25:12PM +0800, Abel Wu <wuyun.abel@bytedance.com> wrote:
>> The per-task and per-cpu accounting have already been tracked by
>> t->sched_info.run_delay and rq->rq_sched_info.run_delay respectively.
>> Extends this to also include cgroups.
>>
>> The PSI indicator, "some" of cpu.pressure, loses the insight into how
>> severely that cgroup is stalled. Say 100 tasks or just 1 task that gets
>> stalled at a certain point will show no difference in "some" pressure.
>> IOW "some" is a flat value that not weighted by the severity (e.g. # of
>> tasks).
> 
> IIUC below are three examples of when "some" tasks are waiting for CPU:
> 
> a)
>    t1 |----|
>    t2 |xx--|
> 
> b)
>    t1 |----|
>    t2 |x---|
>    t3 |-x--|
> 
> c)
>    t1 |----|
>    t2 |xx--|
>    t3 |xx--|
> 
> (- means runnable on CPU, x means runnable waiting on RQ)
> 
> Which pair from a), b), c) is indistinguishable via PSI? (Or can you
> please add your illustrative example?)

PSI tracks stall times for each cpu, and

	tSOME[cpu] = time(nr_delayed_tasks[cpu] != 0)

which turns nr_delayed_tasks[cpu] into boolean value, hence loses
insight into how severely this task group is stalled on this cpu.

Thanks,
	Abel


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 0/3] Fix and cleanup and extend cpu.stat
  2025-01-25  5:25 [PATCH v2 0/3] Fix and cleanup and extend cpu.stat Abel Wu
                   ` (2 preceding siblings ...)
  2025-01-25  5:25 ` [PATCH v2 3/3] cgroup/rstat: Add run_delay accounting for cgroups Abel Wu
@ 2025-02-03  8:11 ` Abel Wu
  2025-02-04 20:46   ` Tejun Heo
  2025-02-19  2:34 ` Abel Wu
  4 siblings, 1 reply; 20+ messages in thread
From: Abel Wu @ 2025-02-03  8:11 UTC (permalink / raw)
  To: Tejun Heo, Johannes Weiner, Michal Koutný, Jonathan Corbet,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Andrew Morton, Yury Norov, Thomas Gleixner,
	Bitao Hu, Chen Ridong
  Cc: open list:CONTROL GROUP (CGROUP), open list:DOCUMENTATION,
	open list

Ping :)

On 1/25/25 1:25 PM, Abel Wu Wrote:
> Patch 1: fixes an issue that forceidle time can be inconsistant with
> other cputimes.
> 
> Patch 2: cleans up the #ifdef mess in cpu.stat.
> 
> Patch 3: extend run_delay accounting to cgroup to show how severely
> a cgroup is stalled.
> 
> v2:
>   - Fix internal function naming. (Michal Koutny)
> 
> Abel Wu (3):
>    cgroup/rstat: Fix forceidle time in cpu.stat
>    cgroup/rstat: Cleanup cpu.stat once for all
>    cgroup/rstat: Add run_delay accounting for cgroups
> 
>   Documentation/admin-guide/cgroup-v2.rst |  1 +
>   include/linux/cgroup-defs.h             |  3 +
>   include/linux/kernel_stat.h             | 14 +++++
>   kernel/cgroup/rstat.c                   | 75 ++++++++++++++++---------
>   kernel/sched/cputime.c                  | 12 ++++
>   kernel/sched/stats.h                    |  2 +
>   6 files changed, 79 insertions(+), 28 deletions(-)
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 0/3] Fix and cleanup and extend cpu.stat
  2025-02-03  8:11 ` [PATCH v2 0/3] Fix and cleanup and extend cpu.stat Abel Wu
@ 2025-02-04 20:46   ` Tejun Heo
  2025-02-05  2:42     ` Abel Wu
  2025-02-10  8:51     ` Abel Wu
  0 siblings, 2 replies; 20+ messages in thread
From: Tejun Heo @ 2025-02-04 20:46 UTC (permalink / raw)
  To: Abel Wu
  Cc: Johannes Weiner, Michal Koutný, Jonathan Corbet, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Andrew Morton, Yury Norov, Thomas Gleixner, Bitao Hu, Chen Ridong,
	open list:CONTROL GROUP (CGROUP), open list:DOCUMENTATION,
	open list

On Mon, Feb 03, 2025 at 04:11:27PM +0800, Abel Wu wrote:
> Ping :)
> 
> On 1/25/25 1:25 PM, Abel Wu Wrote:
> > Patch 1: fixes an issue that forceidle time can be inconsistant with
> > other cputimes.
> > 
> > Patch 2: cleans up the #ifdef mess in cpu.stat.

I wasn't sure whether the new code was materially better than the existing.
Can we try without this patch?

> > Patch 3: extend run_delay accounting to cgroup to show how severely
> > a cgroup is stalled.

I'm not necessarily against adding this. Johannes, what do you think?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Re: [PATCH v2 0/3] Fix and cleanup and extend cpu.stat
  2025-02-04 20:46   ` Tejun Heo
@ 2025-02-05  2:42     ` Abel Wu
  2025-02-10  8:51     ` Abel Wu
  1 sibling, 0 replies; 20+ messages in thread
From: Abel Wu @ 2025-02-05  2:42 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Johannes Weiner, Michal Koutný, Jonathan Corbet, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Andrew Morton, Yury Norov, Thomas Gleixner, Bitao Hu, Chen Ridong,
	open list:CONTROL GROUP (CGROUP), open list:DOCUMENTATION,
	open list

On 2/5/25 4:46 AM, Tejun Heo Wrote:
> On Mon, Feb 03, 2025 at 04:11:27PM +0800, Abel Wu wrote:
>> Ping :)
>>
>> On 1/25/25 1:25 PM, Abel Wu Wrote:
>>> Patch 1: fixes an issue that forceidle time can be inconsistant with
>>> other cputimes.
>>>
>>> Patch 2: cleans up the #ifdef mess in cpu.stat.
> 
> I wasn't sure whether the new code was materially better than the existing.
> Can we try without this patch?

Sure, will drop this one.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Re: [PATCH v2 0/3] Fix and cleanup and extend cpu.stat
  2025-02-04 20:46   ` Tejun Heo
  2025-02-05  2:42     ` Abel Wu
@ 2025-02-10  8:51     ` Abel Wu
  1 sibling, 0 replies; 20+ messages in thread
From: Abel Wu @ 2025-02-10  8:51 UTC (permalink / raw)
  To: Tejun Heo, Johannes Weiner
  Cc: Michal Koutný, Jonathan Corbet, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Andrew Morton,
	Yury Norov, Thomas Gleixner, Bitao Hu, Chen Ridong,
	open list:CONTROL GROUP (CGROUP), open list:DOCUMENTATION,
	open list

On 2/5/25 4:46 AM, Tejun Heo Wrote:
> On Mon, Feb 03, 2025 at 04:11:27PM +0800, Abel Wu wrote:
>> Ping :)
>>
>> On 1/25/25 1:25 PM, Abel Wu Wrote:
>>> Patch 1: fixes an issue that forceidle time can be inconsistant with
>>> other cputimes.
>>>
>>> Patch 2: cleans up the #ifdef mess in cpu.stat.
> 
> I wasn't sure whether the new code was materially better than the existing.
> Can we try without this patch?
> 
>>> Patch 3: extend run_delay accounting to cgroup to show how severely
>>> a cgroup is stalled.
> 
> I'm not necessarily against adding this. Johannes, what do you think?

Hi Johannes, it would be very appreciated if you can take a look at this.
The newest version is:

https://lore.kernel.org/lkml/20250209061322.15260-3-wuyun.abel@bytedance.com/

Thanks!


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 3/3] cgroup/rstat: Add run_delay accounting for cgroups
  2025-01-29  4:48     ` Abel Wu
@ 2025-02-10 15:38       ` Michal Koutný
  2025-02-10 16:20         ` Tejun Heo
  2025-02-12 15:12         ` Abel Wu
  0 siblings, 2 replies; 20+ messages in thread
From: Michal Koutný @ 2025-02-10 15:38 UTC (permalink / raw)
  To: Abel Wu
  Cc: Tejun Heo, Johannes Weiner, Jonathan Corbet, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Thomas Gleixner, Yury Norov, Andrew Morton, Bitao Hu, Chen Ridong,
	open list:CONTROL GROUP (CGROUP), open list:DOCUMENTATION,
	open list

Hello Abel (sorry for my delay).

On Wed, Jan 29, 2025 at 12:48:09PM +0800, Abel Wu <wuyun.abel@bytedance.com> wrote:
> PSI tracks stall times for each cpu, and
> 
> 	tSOME[cpu] = time(nr_delayed_tasks[cpu] != 0)
> 
> which turns nr_delayed_tasks[cpu] into boolean value, hence loses
> insight into how severely this task group is stalled on this cpu.

Thanks for example. So the lost information is kind of a group load.
What meaning it has when there is no group throttling?

Honestly, I can't reason neither about PSI.some nor Σ run_delay wrt
feedback for resource control. What it is slightly bugging me is
introduction of another stats field before first one was explored :-)

But if there's information loss with PSI -- could cpu.pressure:some be
removed in favor of Σ run_delay? (The former could be calculated from
latter if you're right :-p)

(I didn't like the before/after shuffling with enum cpu_usage_stat
NR_STATS but I saw v4 where you tackled that.)

Michal

More context form previous message, the difference is between a) and c),
or better equal lanes:

a')
   t1 |----|
   t2 |xx--|
   t3 |----|

c)
   t1 |----|
   t2 |xx--|
   t3 |xx--|

      <-Δt->

run_delay can be calculated indepently of cpu.pressure:some
because there is still difference between a') and c) in terms of total
cpu usage.

	Δrun_delay = nr * Δt - Δusage

The challenge is with nr (assuming they're all runnable during Δt), that
would need to be sampled from /sys/kernel/debug/sched/debug. But then
you can get whatever load for individual cfs_rqs from there. Hm, does it
even make sense to add up run_delays from different CPUs?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 3/3] cgroup/rstat: Add run_delay accounting for cgroups
  2025-02-10 15:38       ` Michal Koutný
@ 2025-02-10 16:20         ` Tejun Heo
  2025-02-10 18:25           ` Johannes Weiner
  2025-02-12 15:14           ` Abel Wu
  2025-02-12 15:12         ` Abel Wu
  1 sibling, 2 replies; 20+ messages in thread
From: Tejun Heo @ 2025-02-10 16:20 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Abel Wu, Johannes Weiner, Jonathan Corbet, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Thomas Gleixner, Yury Norov, Andrew Morton, Bitao Hu, Chen Ridong,
	open list:CONTROL GROUP (CGROUP), open list:DOCUMENTATION,
	open list

On Mon, Feb 10, 2025 at 04:38:56PM +0100, Michal Koutný wrote:
...
> The challenge is with nr (assuming they're all runnable during Δt), that
> would need to be sampled from /sys/kernel/debug/sched/debug. But then
> you can get whatever load for individual cfs_rqs from there. Hm, does it
> even make sense to add up run_delays from different CPUs?

The difficulty in aggregating across CPUs is why some and full pressures are
defined the way they are. Ideally, we'd want full distribution of stall
states across CPUs but both aggregation and presentation become challenging,
so some/full provide the two extremes. Sum of all cpu_delay adds more
incomplete signal on top. I don't know how useful it'd be. At meta, we
depend on PSI a lot when investigating resource problems and we've never
felt the need for the sum time, so that's one data point with the caveat
that usually our focus is on mem and io pressures where some and full
pressure metrics usually seem to provide sufficient information.

As the picture provided by some and full metrics is incomplete, I can
imagine adding the sum being useful. That said, it'd help if Able can
provide more concrete examples on it being useful. Another thing to consider
is whether we should add this across resources monitored by PSI - cpu, mem
and io.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 3/3] cgroup/rstat: Add run_delay accounting for cgroups
  2025-02-10 16:20         ` Tejun Heo
@ 2025-02-10 18:25           ` Johannes Weiner
  2025-02-12 15:17             ` Abel Wu
  2025-02-21 15:36             ` Michal Koutný
  2025-02-12 15:14           ` Abel Wu
  1 sibling, 2 replies; 20+ messages in thread
From: Johannes Weiner @ 2025-02-10 18:25 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Michal Koutný, Abel Wu, Jonathan Corbet, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Thomas Gleixner, Yury Norov, Andrew Morton, Bitao Hu, Chen Ridong,
	open list:CONTROL GROUP (CGROUP), open list:DOCUMENTATION,
	open list

On Mon, Feb 10, 2025 at 06:20:12AM -1000, Tejun Heo wrote:
> On Mon, Feb 10, 2025 at 04:38:56PM +0100, Michal Koutný wrote:
> ...
> > The challenge is with nr (assuming they're all runnable during Δt), that
> > would need to be sampled from /sys/kernel/debug/sched/debug. But then
> > you can get whatever load for individual cfs_rqs from there. Hm, does it
> > even make sense to add up run_delays from different CPUs?
> 
> The difficulty in aggregating across CPUs is why some and full pressures are
> defined the way they are. Ideally, we'd want full distribution of stall
> states across CPUs but both aggregation and presentation become challenging,
> so some/full provide the two extremes. Sum of all cpu_delay adds more
> incomplete signal on top. I don't know how useful it'd be. At meta, we
> depend on PSI a lot when investigating resource problems and we've never
> felt the need for the sum time, so that's one data point with the caveat
> that usually our focus is on mem and io pressures where some and full
> pressure metrics usually seem to provide sufficient information.
> 
> As the picture provided by some and full metrics is incomplete, I can
> imagine adding the sum being useful. That said, it'd help if Able can
> provide more concrete examples on it being useful. Another thing to consider
> is whether we should add this across resources monitored by PSI - cpu, mem
> and io.

Yes, a more detailed description of the usecase would be helpful.

I'm not exactly sure how the sum of wait times in a cgroup would be
used to gauge load without taking available concurrency into account.
One second of aggregate wait time means something very different if
you have 200 cpus compared to if you have 2.

This is precisely what psi tries to capture. "Some" does provide group
loading information in a sense, but it's a ratio over available
concurrency, and currently capped at 100%. I.e. if you have N cpus,
100% some is "at least N threads waiting at all times." There is a
gradient below that, but not above.

It's conceivable percentages over 100% might be useful, to capture the
degree of contention beyond that. Although like Tejun says, we've not
felt the need for that so far. Whether something is actionable or not
tends to be in the 0-1 range, and beyond that it's just "all bad".

High overload scenarios can also be gauged with tools like runqlat[1],
which give a histogram over individual tasks' delays. We've used this
one extensively to track down issues.

[1] https://github.com/iovisor/bcc/blob/master/tools/runqlat.py

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Re: [PATCH v2 3/3] cgroup/rstat: Add run_delay accounting for cgroups
  2025-02-10 15:38       ` Michal Koutný
  2025-02-10 16:20         ` Tejun Heo
@ 2025-02-12 15:12         ` Abel Wu
  1 sibling, 0 replies; 20+ messages in thread
From: Abel Wu @ 2025-02-12 15:12 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Tejun Heo, Johannes Weiner, Jonathan Corbet, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Thomas Gleixner, Yury Norov, Andrew Morton, Bitao Hu, Chen Ridong,
	open list:CONTROL GROUP (CGROUP), open list:DOCUMENTATION,
	open list

On 2/10/25 11:38 PM, Michal Koutný Wrote:
> Hello Abel (sorry for my delay).
> 
> On Wed, Jan 29, 2025 at 12:48:09PM +0800, Abel Wu <wuyun.abel@bytedance.com> wrote:
>> PSI tracks stall times for each cpu, and
>>
>> 	tSOME[cpu] = time(nr_delayed_tasks[cpu] != 0)
>>
>> which turns nr_delayed_tasks[cpu] into boolean value, hence loses
>> insight into how severely this task group is stalled on this cpu.
> 
> Thanks for example. So the lost information is kind of a group load.

Exactly.

> What meaning it has when there is no group throttling?

It means how severely this cgroup is interfered by co-located tasks.
Both psi and run_delay are tracked in (part of) our fleet, and the
spikes usually lead to poor SLI. But we do find circumstances that
run_delay has a better correlation with SLI due to the abovementioned
method of stall time accounting.

They are treated as indicators of triggering throttling or evicting
the co-located low priority jobs.

In fact we also track per-cpu stats (cpu.stat.percpu) for cgroups,
including run_delay which helped us to decide which job to be the
victim, and also provided useful info when we diagnose issues.

> 
> Honestly, I can't reason neither about PSI.some nor Σ run_delay wrt
> feedback for resource control. What it is slightly bugging me is
> introduction of another stats field before first one was explored :-)
> 
> But if there's information loss with PSI -- could cpu.pressure:some be
> removed in favor of Σ run_delay? (The former could be calculated from
> latter if you're right :-p)

It is not my intent to replacing cpu.pressure:some by run_delay. The
former provides a normalized value that can be used to compare among
different cgroups while the latter isn't able to.

> 
> (I didn't like the before/after shuffling with enum cpu_usage_stat
> NR_STATS but I saw v4 where you tackled that.)
> 
> Michal
> 
> 
> More context form previous message, the difference is between a) and c),
> or better equal lanes:
> 
> a')
>     t1 |----|
>     t2 |xx--|
>     t3 |----|
> 
> c)
>     t1 |----|
>     t2 |xx--|
>     t3 |xx--|
> 
>        <-Δt->

Yes, a) and c) have same cpu.pressure:some but make different progress.

> 
> run_delay can be calculated indepently of cpu.pressure:some
> because there is still difference between a') and c) in terms of total
> cpu usage.
> 
> 	Δrun_delay = nr * Δt - Δusage
> 
> The challenge is with nr (assuming they're all runnable during Δt), that
> would need to be sampled from /sys/kernel/debug/sched/debug. But then
> you can get whatever load for individual cfs_rqs from there. Hm, does it
> even make sense to add up run_delays from different CPUs?

Very good question. In our case, this summed value is used as a general
indicator to trigger strategy which further depends on raw per-cpu data
provided by cpu.stat.percpu, which implies that what we actually want is
the per-cpu data.

Thanks & Best Regards,
	Abel

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Re: [PATCH v2 3/3] cgroup/rstat: Add run_delay accounting for cgroups
  2025-02-10 16:20         ` Tejun Heo
  2025-02-10 18:25           ` Johannes Weiner
@ 2025-02-12 15:14           ` Abel Wu
  1 sibling, 0 replies; 20+ messages in thread
From: Abel Wu @ 2025-02-12 15:14 UTC (permalink / raw)
  To: Tejun Heo, Michal Koutný
  Cc: Johannes Weiner, Jonathan Corbet, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Thomas Gleixner,
	Yury Norov, Andrew Morton, Bitao Hu, Chen Ridong,
	open list:CONTROL GROUP (CGROUP), open list:DOCUMENTATION,
	open list

On 2/11/25 12:20 AM, Tejun Heo Wrote:
> On Mon, Feb 10, 2025 at 04:38:56PM +0100, Michal Koutný wrote:
> ...
>> The challenge is with nr (assuming they're all runnable during Δt), that
>> would need to be sampled from /sys/kernel/debug/sched/debug. But then
>> you can get whatever load for individual cfs_rqs from there. Hm, does it
>> even make sense to add up run_delays from different CPUs?
> 
> The difficulty in aggregating across CPUs is why some and full pressures are
> defined the way they are. Ideally, we'd want full distribution of stall
> states across CPUs but both aggregation and presentation become challenging,
> so some/full provide the two extremes. Sum of all cpu_delay adds more
> incomplete signal on top. I don't know how useful it'd be. At meta, we
> depend on PSI a lot when investigating resource problems and we've never
> felt the need for the sum time, so that's one data point with the caveat
> that usually our focus is on mem and io pressures where some and full
> pressure metrics usually seem to provide sufficient information.

It's interesting, as we also find that PSI is of great useful in memory
and io and never thought of aggregating them across CPUs. With my limited
knowledge, I guess it's because they have shared global bottleneck. F.e.
a shortage of memory will put all the tasks of that memcg in same situation
no matter which cpu they are running on. And the io issues are generally
caused by legacy backends which have poor performance, that is lower speed
or less hwqueues, so still contend with each other outside the scope of
cpus. While the scheduling is different, some cpus can be much contended
than others due to affinity constrains or something else, since different
cpus have separated bandwidth.

> 
> As the picture provided by some and full metrics is incomplete, I can
> imagine adding the sum being useful. That said, it'd help if Able can
> provide more concrete examples on it being useful. Another thing to consider
> is whether we should add this across resources monitored by PSI - cpu, mem
> and io.

Please check my last reply to see our usecase, and it would be appreciated
if you can shed some light on it.

Thanks & Best Regards,
	Abel


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Re: [PATCH v2 3/3] cgroup/rstat: Add run_delay accounting for cgroups
  2025-02-10 18:25           ` Johannes Weiner
@ 2025-02-12 15:17             ` Abel Wu
  2025-02-21 15:36             ` Michal Koutný
  1 sibling, 0 replies; 20+ messages in thread
From: Abel Wu @ 2025-02-12 15:17 UTC (permalink / raw)
  To: Johannes Weiner, Tejun Heo
  Cc: Michal Koutný, Jonathan Corbet, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Thomas Gleixner,
	Yury Norov, Andrew Morton, Bitao Hu, Chen Ridong,
	open list:CONTROL GROUP (CGROUP), open list:DOCUMENTATION,
	open list

On 2/11/25 2:25 AM, Johannes Weiner Wrote:
> On Mon, Feb 10, 2025 at 06:20:12AM -1000, Tejun Heo wrote:
>> On Mon, Feb 10, 2025 at 04:38:56PM +0100, Michal Koutný wrote:
>> ...
>>> The challenge is with nr (assuming they're all runnable during Δt), that
>>> would need to be sampled from /sys/kernel/debug/sched/debug. But then
>>> you can get whatever load for individual cfs_rqs from there. Hm, does it
>>> even make sense to add up run_delays from different CPUs?
>>
>> The difficulty in aggregating across CPUs is why some and full pressures are
>> defined the way they are. Ideally, we'd want full distribution of stall
>> states across CPUs but both aggregation and presentation become challenging,
>> so some/full provide the two extremes. Sum of all cpu_delay adds more
>> incomplete signal on top. I don't know how useful it'd be. At meta, we
>> depend on PSI a lot when investigating resource problems and we've never
>> felt the need for the sum time, so that's one data point with the caveat
>> that usually our focus is on mem and io pressures where some and full
>> pressure metrics usually seem to provide sufficient information.
>>
>> As the picture provided by some and full metrics is incomplete, I can
>> imagine adding the sum being useful. That said, it'd help if Able can
>> provide more concrete examples on it being useful. Another thing to consider
>> is whether we should add this across resources monitored by PSI - cpu, mem
>> and io.
> 
> Yes, a more detailed description of the usecase would be helpful.

Please check my last reply to see our usecase, and it would be appreciated
if you can shed some light on it.

> 
> I'm not exactly sure how the sum of wait times in a cgroup would be
> used to gauge load without taking available concurrency into account.
> One second of aggregate wait time means something very different if
> you have 200 cpus compared to if you have 2.

Indeed. So instead of comparing between different cgroups, we only
compare with the past for each cgroup we care about.

> 
> This is precisely what psi tries to capture. "Some" does provide group
> loading information in a sense, but it's a ratio over available
> concurrency, and currently capped at 100%. I.e. if you have N cpus,
> 100% some is "at least N threads waiting at all times." There is a
> gradient below that, but not above.
> 
> It's conceivable percentages over 100% might be useful, to capture the
> degree of contention beyond that. Although like Tejun says, we've not
> felt the need for that so far. Whether something is actionable or not
> tends to be in the 0-1 range, and beyond that it's just "all bad".

PSI is very useful and we heavily rely on it, especially for mem and
io. The reason of adding run_delay is try to find the piece lost by
cpu.some when <100%. Please check Michal's example.

> 
> High overload scenarios can also be gauged with tools like runqlat[1],
> which give a histogram over individual tasks' delays. We've used this
> one extensively to track down issues.
> 
> [1] https://github.com/iovisor/bcc/blob/master/tools/runqlat.py

Thanks for mentioning this, but I am not sure whether it is appropriate
to make it constantly enabled as we rely it to be realtime to trigger
throttling or eviction of certain jobs.

Thanks & Best Regards,
	Abel


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 0/3] Fix and cleanup and extend cpu.stat
  2025-01-25  5:25 [PATCH v2 0/3] Fix and cleanup and extend cpu.stat Abel Wu
                   ` (3 preceding siblings ...)
  2025-02-03  8:11 ` [PATCH v2 0/3] Fix and cleanup and extend cpu.stat Abel Wu
@ 2025-02-19  2:34 ` Abel Wu
  4 siblings, 0 replies; 20+ messages in thread
From: Abel Wu @ 2025-02-19  2:34 UTC (permalink / raw)
  To: Tejun Heo, Johannes Weiner, Michal Koutný, Jonathan Corbet,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Andrew Morton, Yury Norov, Thomas Gleixner,
	Bitao Hu, Chen Ridong
  Cc: open list:CONTROL GROUP (CGROUP), open list:DOCUMENTATION,
	open list

Gentle ping :)

The latest series:
https://lore.kernel.org/lkml/20250209061322.15260-1-wuyun.abel@bytedance.com/

On 1/25/25 1:25 PM, Abel Wu Wrote:
> Patch 1: fixes an issue that forceidle time can be inconsistant with
> other cputimes.
> 
> Patch 2: cleans up the #ifdef mess in cpu.stat.
> 
> Patch 3: extend run_delay accounting to cgroup to show how severely
> a cgroup is stalled.
> 
> v2:
>   - Fix internal function naming. (Michal Koutny)
> 
> Abel Wu (3):
>    cgroup/rstat: Fix forceidle time in cpu.stat
>    cgroup/rstat: Cleanup cpu.stat once for all
>    cgroup/rstat: Add run_delay accounting for cgroups
> 
>   Documentation/admin-guide/cgroup-v2.rst |  1 +
>   include/linux/cgroup-defs.h             |  3 +
>   include/linux/kernel_stat.h             | 14 +++++
>   kernel/cgroup/rstat.c                   | 75 ++++++++++++++++---------
>   kernel/sched/cputime.c                  | 12 ++++
>   kernel/sched/stats.h                    |  2 +
>   6 files changed, 79 insertions(+), 28 deletions(-)
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 3/3] cgroup/rstat: Add run_delay accounting for cgroups
  2025-02-10 18:25           ` Johannes Weiner
  2025-02-12 15:17             ` Abel Wu
@ 2025-02-21 15:36             ` Michal Koutný
  1 sibling, 0 replies; 20+ messages in thread
From: Michal Koutný @ 2025-02-21 15:36 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Tejun Heo, Abel Wu, Jonathan Corbet, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Thomas Gleixner,
	Yury Norov, Andrew Morton, Bitao Hu, Chen Ridong,
	open list:CONTROL GROUP (CGROUP), open list:DOCUMENTATION,
	open list

On Mon, Feb 10, 2025 at 01:25:45PM -0500, Johannes Weiner <hannes@cmpxchg.org> wrote:
> Yes, a more detailed description of the usecase would be helpful.
> 
> I'm not exactly sure how the sum of wait times in a cgroup would be
> used to gauge load without taking available concurrency into account.
> One second of aggregate wait time means something very different if
> you have 200 cpus compared to if you have 2.
> 
> This is precisely what psi tries to capture. "Some" does provide group
> loading information in a sense, but it's a
>
> ratio over available concurrency,

This comes as a surprise to me (I originally assumed it's only
time(some)/time(interval)).
But I confirm that after actually looking at the avg* values it is over
nr_tasks.
If the value is already normalized by nr_tasks, I'm seeing less of a
benefit of Σ run_delay.

> and currently capped at 100%. I.e.  if you have N cpus, 100% some is
> "at least N threads waiting at all times." There is a gradient below
> that, but not above.

Is this a typo? (s/some/full/ or s/at least N/at least 1/)

(Actually, if I correct my thinking with the nr_tasks normalization,
then your statement makes sense. OTOH, what is the difference betwen
'full' and 'some' at 100%?)

Also I played a bit.

cat >/root/cpu_n.sh <<EOD
#!/bin/bash

worker() {
	echo "$BASHPID: starting on $1"
	taskset -c -p $i $BASHPID
	while true ; do
		true
	done
}

for i in $(seq ${1:-1}) ; do
	worker $i &
	pids+=($!)
done

echo pids: ${pids[*]}
wait
EOD

systemd-run -u test.service /root/cpu_n.sh 2
# test.service/cpu.pressure:some is ~0

systemd-run -u pressure.service /root/cpu_n.sh 1
# test.service/cpu.pressure:some settles at ~25%, cpu1 is free, cpu2 half
# test.service/cpu.pressure:full settles at ~25% too(?!), I'd expect 0
                                            ^^^^^^^^^^^^

(kernel v6.13)

# pressure.service/cpu.pressure:some settles at ~50%, makes sense
# pressure.service/cpu.pressure:full settles at ~50%, makes sense

Thanks,
Michal

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2025-02-21 15:36 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-25  5:25 [PATCH v2 0/3] Fix and cleanup and extend cpu.stat Abel Wu
2025-01-25  5:25 ` [PATCH v2 1/3] cgroup/rstat: Fix forceidle time in cpu.stat Abel Wu
2025-01-25  5:25 ` [PATCH v2 2/3] cgroup/rstat: Cleanup cpu.stat once for all Abel Wu
2025-01-27 20:17   ` Tejun Heo
2025-01-29  4:47     ` Abel Wu
2025-01-25  5:25 ` [PATCH v2 3/3] cgroup/rstat: Add run_delay accounting for cgroups Abel Wu
2025-01-27 14:10   ` Michal Koutný
2025-01-29  4:48     ` Abel Wu
2025-02-10 15:38       ` Michal Koutný
2025-02-10 16:20         ` Tejun Heo
2025-02-10 18:25           ` Johannes Weiner
2025-02-12 15:17             ` Abel Wu
2025-02-21 15:36             ` Michal Koutný
2025-02-12 15:14           ` Abel Wu
2025-02-12 15:12         ` Abel Wu
2025-02-03  8:11 ` [PATCH v2 0/3] Fix and cleanup and extend cpu.stat Abel Wu
2025-02-04 20:46   ` Tejun Heo
2025-02-05  2:42     ` Abel Wu
2025-02-10  8:51     ` Abel Wu
2025-02-19  2:34 ` Abel Wu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).