public inbox for cgroups@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/2] Exposing nice CPU usage to userspace
@ 2024-08-30 14:19 Joshua Hahn
  2024-08-30 14:19 ` [PATCH v2 1/2] Tracking cgroup-level niced CPU time Joshua Hahn
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Joshua Hahn @ 2024-08-30 14:19 UTC (permalink / raw)
  To: tj
  Cc: cgroups, hannes, linux-kernel, linux-kselftest, lizefan.x,
	mkoutny, shuah

From: Joshua Hahn <joshua.hahn6@gmail.com>

v1 -> v2: Edited commit messages for clarity.

Niced CPU usage is a metric reported in host-level /prot/stat, but is
not reported in cgroup-level statistics in cpu.stat. However, when a
host contains multiple tasks across different workloads, it becomes
difficult to gauge how much of the task is being spent on niced
processes based on /proc/stat alone, since host-level metrics do not
provide this cgroup-level granularity.

Exposing this metric will allow users to accurately probe the niced CPU
metric for each workload, and make more informed decisions when
directing higher priority tasks.

Joshua Hahn (2):
  Tracking cgroup-level niced CPU time
  Selftests for niced CPU statistics

 include/linux/cgroup-defs.h               |  1 +
 kernel/cgroup/rstat.c                     | 16 ++++-
 tools/testing/selftests/cgroup/test_cpu.c | 72 +++++++++++++++++++++++
 3 files changed, 86 insertions(+), 3 deletions(-)

-- 
2.43.5


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH v2 1/2] Tracking cgroup-level niced CPU time
  2024-08-30 14:19 [PATCH v2 0/2] Exposing nice CPU usage to userspace Joshua Hahn
@ 2024-08-30 14:19 ` Joshua Hahn
  2024-08-30 19:41   ` Tejun Heo
  2024-08-30 14:19 ` [PATCH v2 2/2] Selftests for niced CPU statistics Joshua Hahn
  2024-09-02 15:45 ` [PATCH v2 0/2] Exposing nice CPU usage to userspace Michal Koutný
  2 siblings, 1 reply; 7+ messages in thread
From: Joshua Hahn @ 2024-08-30 14:19 UTC (permalink / raw)
  To: tj
  Cc: cgroups, hannes, linux-kernel, linux-kselftest, lizefan.x,
	mkoutny, shuah

From: Joshua Hahn <joshua.hahn6@gmail.com>

Cgroup-level CPU statistics currently include time spent on
user/system processes, but do not include niced CPU time (despite
already being tracked). This patch exposes niced CPU time to the
userspace, allowing users to get a better understanding of their
hardware limits and can facilitate more informed workload distribution.

A new field 'ntime' is added to struct cgroup_base_stat as opposed to
struct task_cputime to minimize footprint.
---
 include/linux/cgroup-defs.h |  1 +
 kernel/cgroup/rstat.c       | 16 +++++++++++++---
 2 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index ae04035b6cbe..a2fcb3db6c52 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -315,6 +315,7 @@ struct cgroup_base_stat {
 #ifdef CONFIG_SCHED_CORE
 	u64 forceidle_sum;
 #endif
+	u64 ntime;
 };
 
 /*
diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index a06b45272411..a77ba9a83bab 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -444,6 +444,7 @@ static void cgroup_base_stat_add(struct cgroup_base_stat *dst_bstat,
 #ifdef CONFIG_SCHED_CORE
 	dst_bstat->forceidle_sum += src_bstat->forceidle_sum;
 #endif
+	dst_bstat->ntime += src_bstat->ntime;
 }
 
 static void cgroup_base_stat_sub(struct cgroup_base_stat *dst_bstat,
@@ -455,6 +456,7 @@ static void cgroup_base_stat_sub(struct cgroup_base_stat *dst_bstat,
 #ifdef CONFIG_SCHED_CORE
 	dst_bstat->forceidle_sum -= src_bstat->forceidle_sum;
 #endif
+	dst_bstat->ntime -= src_bstat->ntime;
 }
 
 static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu)
@@ -535,7 +537,10 @@ void __cgroup_account_cputime_field(struct cgroup *cgrp,
 
 	switch (index) {
 	case CPUTIME_USER:
+		rstatc->bstat.cputime.utime += delta_exec;
+		break;
 	case CPUTIME_NICE:
+		rstatc->bstat.ntime += delta_exec;
 		rstatc->bstat.cputime.utime += delta_exec;
 		break;
 	case CPUTIME_SYSTEM:
@@ -591,6 +596,7 @@ static void root_cgroup_cputime(struct cgroup_base_stat *bstat)
 #ifdef CONFIG_SCHED_CORE
 		bstat->forceidle_sum += cpustat[CPUTIME_FORCEIDLE];
 #endif
+		bstat->ntime += cpustat[CPUTIME_NICE];
 	}
 }
 
@@ -608,13 +614,14 @@ static void cgroup_force_idle_show(struct seq_file *seq, struct cgroup_base_stat
 void cgroup_base_stat_cputime_show(struct seq_file *seq)
 {
 	struct cgroup *cgrp = seq_css(seq)->cgroup;
-	u64 usage, utime, stime;
+	u64 usage, utime, stime, ntime;
 
 	if (cgroup_parent(cgrp)) {
 		cgroup_rstat_flush_hold(cgrp);
 		usage = cgrp->bstat.cputime.sum_exec_runtime;
 		cputime_adjust(&cgrp->bstat.cputime, &cgrp->prev_cputime,
 			       &utime, &stime);
+		ntime = cgrp->bstat.ntime;
 		cgroup_rstat_flush_release(cgrp);
 	} else {
 		/* cgrp->bstat of root is not actually used, reuse it */
@@ -622,16 +629,19 @@ void cgroup_base_stat_cputime_show(struct seq_file *seq)
 		usage = cgrp->bstat.cputime.sum_exec_runtime;
 		utime = cgrp->bstat.cputime.utime;
 		stime = cgrp->bstat.cputime.stime;
+		ntime = cgrp->bstat.ntime;
 	}
 
 	do_div(usage, NSEC_PER_USEC);
 	do_div(utime, NSEC_PER_USEC);
 	do_div(stime, NSEC_PER_USEC);
+	do_div(ntime, NSEC_PER_USEC);
 
 	seq_printf(seq, "usage_usec %llu\n"
 		   "user_usec %llu\n"
-		   "system_usec %llu\n",
-		   usage, utime, stime);
+			 "system_usec %llu\n"
+			 "nice_usec %llu\n",
+			 usage, utime, stime, ntime);
 
 	cgroup_force_idle_show(seq, &cgrp->bstat);
 }
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v2 2/2] Selftests for niced CPU statistics
  2024-08-30 14:19 [PATCH v2 0/2] Exposing nice CPU usage to userspace Joshua Hahn
  2024-08-30 14:19 ` [PATCH v2 1/2] Tracking cgroup-level niced CPU time Joshua Hahn
@ 2024-08-30 14:19 ` Joshua Hahn
  2024-09-02 15:45 ` [PATCH v2 0/2] Exposing nice CPU usage to userspace Michal Koutný
  2 siblings, 0 replies; 7+ messages in thread
From: Joshua Hahn @ 2024-08-30 14:19 UTC (permalink / raw)
  To: tj
  Cc: cgroups, hannes, linux-kernel, linux-kselftest, lizefan.x,
	mkoutny, shuah

From: Joshua Hahn <joshua.hahn6@gmail.com>

Creates a cgroup with a single nice CPU hog process running.
fork() is called to generate the nice process because un-nicing is
not possible (see man nice(3)). If fork() was not used to generate
the CPU hog, we would run the rest of the cgroup selftest suite as a
nice process.
---
 tools/testing/selftests/cgroup/test_cpu.c | 72 +++++++++++++++++++++++
 1 file changed, 72 insertions(+)

diff --git a/tools/testing/selftests/cgroup/test_cpu.c b/tools/testing/selftests/cgroup/test_cpu.c
index dad2ed82f3ef..cd5550391f49 100644
--- a/tools/testing/selftests/cgroup/test_cpu.c
+++ b/tools/testing/selftests/cgroup/test_cpu.c
@@ -8,6 +8,7 @@
 #include <pthread.h>
 #include <stdio.h>
 #include <time.h>
+#include <unistd.h>
 
 #include "../kselftest.h"
 #include "cgroup_util.h"
@@ -229,6 +230,76 @@ static int test_cpucg_stats(const char *root)
 	return ret;
 }
 
+/*
+ * Creates a nice process that consumes CPU and checks that the elapsed
+ * usertime in the cgroup is close to the expected time.
+ */
+static int test_cpucg_nice(const char *root)
+{
+	int ret = KSFT_FAIL;
+	int status;
+	long user_usec, nice_usec;
+	long usage_seconds = 2;
+	long expected_nice_usec = usage_seconds * USEC_PER_SEC;
+	char *cpucg;
+	pid_t pid;
+
+	cpucg = cg_name(root, "cpucg_test");
+	if (!cpucg)
+		goto cleanup;
+
+	if (cg_create(cpucg))
+		goto cleanup;
+
+	user_usec = cg_read_key_long(cpucg, "cpu.stat", "user_usec");
+	nice_usec = cg_read_key_long(cpucg, "cpu.stat", "nice_usec");
+	if (user_usec != 0 || nice_usec != 0)
+		goto cleanup;
+
+	/*
+	 * We fork here to create a new process that can be niced without
+	 * polluting the nice value of other selftests
+	 */
+	pid = fork();
+	if (pid < 0) {
+		goto cleanup;
+	} else if (pid == 0) {
+		struct cpu_hog_func_param param = {
+			.nprocs = 1,
+			.ts = {
+				.tv_sec = usage_seconds,
+				.tv_nsec = 0,
+			},
+			.clock_type = CPU_HOG_CLOCK_PROCESS,
+		};
+
+		/* Try to keep niced CPU usage as constrained to hog_cpu as possible */
+		nice(1);
+		cg_run(cpucg, hog_cpus_timed, (void *)&param);
+		exit(0);
+	} else {
+		waitpid(pid, &status, 0);
+		if (!WIFEXITED(status))
+			goto cleanup;
+
+		user_usec = cg_read_key_long(cpucg, "cpu.stat", "user_usec");
+		nice_usec = cg_read_key_long(cpucg, "cpu.stat", "nice_usec");
+		if (nice_usec > user_usec || user_usec <= 0)
+			goto cleanup;
+
+		if (!values_close(nice_usec, expected_nice_usec, 1))
+			goto cleanup;
+
+		ret = KSFT_PASS;
+	}
+
+cleanup:
+	cg_destroy(cpucg);
+	free(cpucg);
+
+	return ret;
+}
+
 static int
 run_cpucg_weight_test(
 		const char *root,
@@ -686,6 +757,7 @@ struct cpucg_test {
 } tests[] = {
 	T(test_cpucg_subtree_control),
 	T(test_cpucg_stats),
+	T(test_cpucg_nice),
 	T(test_cpucg_weight_overprovisioned),
 	T(test_cpucg_weight_underprovisioned),
 	T(test_cpucg_nested_weight_overprovisioned),
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH v2 1/2] Tracking cgroup-level niced CPU time
  2024-08-30 14:19 ` [PATCH v2 1/2] Tracking cgroup-level niced CPU time Joshua Hahn
@ 2024-08-30 19:41   ` Tejun Heo
  2024-08-30 20:06     ` Joshua Hahn
  0 siblings, 1 reply; 7+ messages in thread
From: Tejun Heo @ 2024-08-30 19:41 UTC (permalink / raw)
  To: Joshua Hahn
  Cc: cgroups, hannes, linux-kernel, linux-kselftest, lizefan.x,
	mkoutny, shuah

On Fri, Aug 30, 2024 at 07:19:38AM -0700, Joshua Hahn wrote:
> From: Joshua Hahn <joshua.hahn6@gmail.com>
> 
> Cgroup-level CPU statistics currently include time spent on
> user/system processes, but do not include niced CPU time (despite
> already being tracked). This patch exposes niced CPU time to the
> userspace, allowing users to get a better understanding of their
> hardware limits and can facilitate more informed workload distribution.
> 
> A new field 'ntime' is added to struct cgroup_base_stat as opposed to
> struct task_cputime to minimize footprint.

Patch looks fine to me but can you please do the followings?

- Add subsystem prefix to the patch titles. Look other commits for examples.

- Add Signed-off-by to both.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2 1/2] Tracking cgroup-level niced CPU time
  2024-08-30 19:41   ` Tejun Heo
@ 2024-08-30 20:06     ` Joshua Hahn
  0 siblings, 0 replies; 7+ messages in thread
From: Joshua Hahn @ 2024-08-30 20:06 UTC (permalink / raw)
  To: Tejun Heo
  Cc: cgroups, hannes, linux-kernel, linux-kselftest, lizefan.x,
	mkoutny, shuah

Hello, thank you for reviewing the v2.

> Patch looks fine to me but can you please do the followings?
>
> - Add subsystem prefix to the patch titles. Look other commits for examples.
> - Add Signed-off-by to both.
> --
> tejun

I will send out a v3 with the signed-off-by, and I will add
cgroup/rstat to the patch title.
Thank you again!

Joshua

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2 0/2] Exposing nice CPU usage to userspace
  2024-08-30 14:19 [PATCH v2 0/2] Exposing nice CPU usage to userspace Joshua Hahn
  2024-08-30 14:19 ` [PATCH v2 1/2] Tracking cgroup-level niced CPU time Joshua Hahn
  2024-08-30 14:19 ` [PATCH v2 2/2] Selftests for niced CPU statistics Joshua Hahn
@ 2024-09-02 15:45 ` Michal Koutný
  2024-09-10 21:01   ` Tejun Heo
  2 siblings, 1 reply; 7+ messages in thread
From: Michal Koutný @ 2024-09-02 15:45 UTC (permalink / raw)
  To: Joshua Hahn
  Cc: tj, cgroups, hannes, linux-kernel, linux-kselftest, lizefan.x,
	shuah

[-- Attachment #1: Type: text/plain, Size: 1480 bytes --]

Hello Joshua.

On Fri, Aug 30, 2024 at 07:19:37AM GMT, Joshua Hahn <joshua.hahnjy@gmail.com> wrote:
> Exposing this metric will allow users to accurately probe the niced CPU
> metric for each workload, and make more informed decisions when
> directing higher priority tasks.

I'm afraid I can't still appreciate exposing this value:

- It makes (some) sense only on leave cgroups (where variously nice'd
  tasks are competing against each other). Not so much on inner node
  cgroups (where it's a mere sum but sibling cgroups could have different
  weights, so the absolute times would contribute differently).

- When all tasks have nice > 0 (or nice <= 0), it loses any information
  it could have had.

(Thus I don't know whether to commit to exposing that value via cgroups.)

I wonder, wouldn't your use case be equally served by some
post-processing [1] of /sys/kernel/debug/sched/debug info which is
already available?

Regards,
Michal

[1]
Your metric is supposed to represent
	Σ_i^tasks ∫_t is_nice(i, t) dt

If I try to address the second remark by looking at
	Σ_i^tasks ∫_t nice(i, t) dt

and that resembles (nice=0 ~ weight=1024)
	Σ_i^tasks ∫_t weight(i, t) dt

swap sum and int
	∫_t Σ_i^tasks weight(i, t) dt

where
	Σ_i^tasks weight(i, t) 

can be taken from
	/sys/kernel/debug/sched/debug:cfs_rq[0].load_avg

above is only for CPU nr=0. So processing would mean sampling that file
over all CPUs and time.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2 0/2] Exposing nice CPU usage to userspace
  2024-09-02 15:45 ` [PATCH v2 0/2] Exposing nice CPU usage to userspace Michal Koutný
@ 2024-09-10 21:01   ` Tejun Heo
  0 siblings, 0 replies; 7+ messages in thread
From: Tejun Heo @ 2024-09-10 21:01 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Joshua Hahn, cgroups, hannes, linux-kernel, linux-kselftest,
	lizefan.x, shuah

Hello, Michal.

On Mon, Sep 02, 2024 at 05:45:39PM +0200, Michal Koutný wrote:
> - It makes (some) sense only on leave cgroups (where variously nice'd
>   tasks are competing against each other). Not so much on inner node
>   cgroups (where it's a mere sum but sibling cgroups could have different
>   weights, so the absolute times would contribute differently).
>
> - When all tasks have nice > 0 (or nice <= 0), it loses any information
>   it could have had.

I think it's as useful as system-wide nice metric is. It's not a versatile
metric but is widely available and understood and people use it. Maybe a
workload is split across a sub-hierarchy and they wanna collect how much
lowpri threads are consuming. cpu.stats is available without cpu control
being enabled and people use it as a way to just aggregate metrics across a
portion of the system.

> (Thus I don't know whether to commit to exposing that value via cgroups.)
> 
> I wonder, wouldn't your use case be equally served by some
> post-processing [1] of /sys/kernel/debug/sched/debug info which is
> already available?
...
> above is only for CPU nr=0. So processing would mean sampling that file
> over all CPUs and time.

I think there are benefits to mirroring system wide metrics, at least ones
as widely spread as nice.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2024-09-10 21:01 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-30 14:19 [PATCH v2 0/2] Exposing nice CPU usage to userspace Joshua Hahn
2024-08-30 14:19 ` [PATCH v2 1/2] Tracking cgroup-level niced CPU time Joshua Hahn
2024-08-30 19:41   ` Tejun Heo
2024-08-30 20:06     ` Joshua Hahn
2024-08-30 14:19 ` [PATCH v2 2/2] Selftests for niced CPU statistics Joshua Hahn
2024-09-02 15:45 ` [PATCH v2 0/2] Exposing nice CPU usage to userspace Michal Koutný
2024-09-10 21:01   ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox