* [PATCH v2 1/2] Tracking cgroup-level niced CPU time
2024-08-30 14:19 [PATCH v2 0/2] Exposing nice CPU usage to userspace Joshua Hahn
@ 2024-08-30 14:19 ` Joshua Hahn
2024-08-30 19:41 ` Tejun Heo
2024-08-30 14:19 ` [PATCH v2 2/2] Selftests for niced CPU statistics Joshua Hahn
2024-09-02 15:45 ` [PATCH v2 0/2] Exposing nice CPU usage to userspace Michal Koutný
2 siblings, 1 reply; 7+ messages in thread
From: Joshua Hahn @ 2024-08-30 14:19 UTC (permalink / raw)
To: tj
Cc: cgroups, hannes, linux-kernel, linux-kselftest, lizefan.x,
mkoutny, shuah
From: Joshua Hahn <joshua.hahn6@gmail.com>
Cgroup-level CPU statistics currently include time spent on
user/system processes, but do not include niced CPU time (despite
already being tracked). This patch exposes niced CPU time to the
userspace, allowing users to get a better understanding of their
hardware limits and can facilitate more informed workload distribution.
A new field 'ntime' is added to struct cgroup_base_stat as opposed to
struct task_cputime to minimize footprint.
---
include/linux/cgroup-defs.h | 1 +
kernel/cgroup/rstat.c | 16 +++++++++++++---
2 files changed, 14 insertions(+), 3 deletions(-)
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index ae04035b6cbe..a2fcb3db6c52 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -315,6 +315,7 @@ struct cgroup_base_stat {
#ifdef CONFIG_SCHED_CORE
u64 forceidle_sum;
#endif
+ u64 ntime;
};
/*
diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index a06b45272411..a77ba9a83bab 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -444,6 +444,7 @@ static void cgroup_base_stat_add(struct cgroup_base_stat *dst_bstat,
#ifdef CONFIG_SCHED_CORE
dst_bstat->forceidle_sum += src_bstat->forceidle_sum;
#endif
+ dst_bstat->ntime += src_bstat->ntime;
}
static void cgroup_base_stat_sub(struct cgroup_base_stat *dst_bstat,
@@ -455,6 +456,7 @@ static void cgroup_base_stat_sub(struct cgroup_base_stat *dst_bstat,
#ifdef CONFIG_SCHED_CORE
dst_bstat->forceidle_sum -= src_bstat->forceidle_sum;
#endif
+ dst_bstat->ntime -= src_bstat->ntime;
}
static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu)
@@ -535,7 +537,10 @@ void __cgroup_account_cputime_field(struct cgroup *cgrp,
switch (index) {
case CPUTIME_USER:
+ rstatc->bstat.cputime.utime += delta_exec;
+ break;
case CPUTIME_NICE:
+ rstatc->bstat.ntime += delta_exec;
rstatc->bstat.cputime.utime += delta_exec;
break;
case CPUTIME_SYSTEM:
@@ -591,6 +596,7 @@ static void root_cgroup_cputime(struct cgroup_base_stat *bstat)
#ifdef CONFIG_SCHED_CORE
bstat->forceidle_sum += cpustat[CPUTIME_FORCEIDLE];
#endif
+ bstat->ntime += cpustat[CPUTIME_NICE];
}
}
@@ -608,13 +614,14 @@ static void cgroup_force_idle_show(struct seq_file *seq, struct cgroup_base_stat
void cgroup_base_stat_cputime_show(struct seq_file *seq)
{
struct cgroup *cgrp = seq_css(seq)->cgroup;
- u64 usage, utime, stime;
+ u64 usage, utime, stime, ntime;
if (cgroup_parent(cgrp)) {
cgroup_rstat_flush_hold(cgrp);
usage = cgrp->bstat.cputime.sum_exec_runtime;
cputime_adjust(&cgrp->bstat.cputime, &cgrp->prev_cputime,
&utime, &stime);
+ ntime = cgrp->bstat.ntime;
cgroup_rstat_flush_release(cgrp);
} else {
/* cgrp->bstat of root is not actually used, reuse it */
@@ -622,16 +629,19 @@ void cgroup_base_stat_cputime_show(struct seq_file *seq)
usage = cgrp->bstat.cputime.sum_exec_runtime;
utime = cgrp->bstat.cputime.utime;
stime = cgrp->bstat.cputime.stime;
+ ntime = cgrp->bstat.ntime;
}
do_div(usage, NSEC_PER_USEC);
do_div(utime, NSEC_PER_USEC);
do_div(stime, NSEC_PER_USEC);
+ do_div(ntime, NSEC_PER_USEC);
seq_printf(seq, "usage_usec %llu\n"
"user_usec %llu\n"
- "system_usec %llu\n",
- usage, utime, stime);
+ "system_usec %llu\n"
+ "nice_usec %llu\n",
+ usage, utime, stime, ntime);
cgroup_force_idle_show(seq, &cgrp->bstat);
}
--
2.43.5
^ permalink raw reply related [flat|nested] 7+ messages in thread* Re: [PATCH v2 1/2] Tracking cgroup-level niced CPU time
2024-08-30 14:19 ` [PATCH v2 1/2] Tracking cgroup-level niced CPU time Joshua Hahn
@ 2024-08-30 19:41 ` Tejun Heo
2024-08-30 20:06 ` Joshua Hahn
0 siblings, 1 reply; 7+ messages in thread
From: Tejun Heo @ 2024-08-30 19:41 UTC (permalink / raw)
To: Joshua Hahn
Cc: cgroups, hannes, linux-kernel, linux-kselftest, lizefan.x,
mkoutny, shuah
On Fri, Aug 30, 2024 at 07:19:38AM -0700, Joshua Hahn wrote:
> From: Joshua Hahn <joshua.hahn6@gmail.com>
>
> Cgroup-level CPU statistics currently include time spent on
> user/system processes, but do not include niced CPU time (despite
> already being tracked). This patch exposes niced CPU time to the
> userspace, allowing users to get a better understanding of their
> hardware limits and can facilitate more informed workload distribution.
>
> A new field 'ntime' is added to struct cgroup_base_stat as opposed to
> struct task_cputime to minimize footprint.
Patch looks fine to me but can you please do the followings?
- Add subsystem prefix to the patch titles. Look other commits for examples.
- Add Signed-off-by to both.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v2 1/2] Tracking cgroup-level niced CPU time
2024-08-30 19:41 ` Tejun Heo
@ 2024-08-30 20:06 ` Joshua Hahn
0 siblings, 0 replies; 7+ messages in thread
From: Joshua Hahn @ 2024-08-30 20:06 UTC (permalink / raw)
To: Tejun Heo
Cc: cgroups, hannes, linux-kernel, linux-kselftest, lizefan.x,
mkoutny, shuah
Hello, thank you for reviewing the v2.
> Patch looks fine to me but can you please do the followings?
>
> - Add subsystem prefix to the patch titles. Look other commits for examples.
> - Add Signed-off-by to both.
> --
> tejun
I will send out a v3 with the signed-off-by, and I will add
cgroup/rstat to the patch title.
Thank you again!
Joshua
^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH v2 2/2] Selftests for niced CPU statistics
2024-08-30 14:19 [PATCH v2 0/2] Exposing nice CPU usage to userspace Joshua Hahn
2024-08-30 14:19 ` [PATCH v2 1/2] Tracking cgroup-level niced CPU time Joshua Hahn
@ 2024-08-30 14:19 ` Joshua Hahn
2024-09-02 15:45 ` [PATCH v2 0/2] Exposing nice CPU usage to userspace Michal Koutný
2 siblings, 0 replies; 7+ messages in thread
From: Joshua Hahn @ 2024-08-30 14:19 UTC (permalink / raw)
To: tj
Cc: cgroups, hannes, linux-kernel, linux-kselftest, lizefan.x,
mkoutny, shuah
From: Joshua Hahn <joshua.hahn6@gmail.com>
Creates a cgroup with a single nice CPU hog process running.
fork() is called to generate the nice process because un-nicing is
not possible (see man nice(3)). If fork() was not used to generate
the CPU hog, we would run the rest of the cgroup selftest suite as a
nice process.
---
tools/testing/selftests/cgroup/test_cpu.c | 72 +++++++++++++++++++++++
1 file changed, 72 insertions(+)
diff --git a/tools/testing/selftests/cgroup/test_cpu.c b/tools/testing/selftests/cgroup/test_cpu.c
index dad2ed82f3ef..cd5550391f49 100644
--- a/tools/testing/selftests/cgroup/test_cpu.c
+++ b/tools/testing/selftests/cgroup/test_cpu.c
@@ -8,6 +8,7 @@
#include <pthread.h>
#include <stdio.h>
#include <time.h>
+#include <unistd.h>
#include "../kselftest.h"
#include "cgroup_util.h"
@@ -229,6 +230,76 @@ static int test_cpucg_stats(const char *root)
return ret;
}
+/*
+ * Creates a nice process that consumes CPU and checks that the elapsed
+ * usertime in the cgroup is close to the expected time.
+ */
+static int test_cpucg_nice(const char *root)
+{
+ int ret = KSFT_FAIL;
+ int status;
+ long user_usec, nice_usec;
+ long usage_seconds = 2;
+ long expected_nice_usec = usage_seconds * USEC_PER_SEC;
+ char *cpucg;
+ pid_t pid;
+
+ cpucg = cg_name(root, "cpucg_test");
+ if (!cpucg)
+ goto cleanup;
+
+ if (cg_create(cpucg))
+ goto cleanup;
+
+ user_usec = cg_read_key_long(cpucg, "cpu.stat", "user_usec");
+ nice_usec = cg_read_key_long(cpucg, "cpu.stat", "nice_usec");
+ if (user_usec != 0 || nice_usec != 0)
+ goto cleanup;
+
+ /*
+ * We fork here to create a new process that can be niced without
+ * polluting the nice value of other selftests
+ */
+ pid = fork();
+ if (pid < 0) {
+ goto cleanup;
+ } else if (pid == 0) {
+ struct cpu_hog_func_param param = {
+ .nprocs = 1,
+ .ts = {
+ .tv_sec = usage_seconds,
+ .tv_nsec = 0,
+ },
+ .clock_type = CPU_HOG_CLOCK_PROCESS,
+ };
+
+ /* Try to keep niced CPU usage as constrained to hog_cpu as possible */
+ nice(1);
+ cg_run(cpucg, hog_cpus_timed, (void *)¶m);
+ exit(0);
+ } else {
+ waitpid(pid, &status, 0);
+ if (!WIFEXITED(status))
+ goto cleanup;
+
+ user_usec = cg_read_key_long(cpucg, "cpu.stat", "user_usec");
+ nice_usec = cg_read_key_long(cpucg, "cpu.stat", "nice_usec");
+ if (nice_usec > user_usec || user_usec <= 0)
+ goto cleanup;
+
+ if (!values_close(nice_usec, expected_nice_usec, 1))
+ goto cleanup;
+
+ ret = KSFT_PASS;
+ }
+
+cleanup:
+ cg_destroy(cpucg);
+ free(cpucg);
+
+ return ret;
+}
+
static int
run_cpucg_weight_test(
const char *root,
@@ -686,6 +757,7 @@ struct cpucg_test {
} tests[] = {
T(test_cpucg_subtree_control),
T(test_cpucg_stats),
+ T(test_cpucg_nice),
T(test_cpucg_weight_overprovisioned),
T(test_cpucg_weight_underprovisioned),
T(test_cpucg_nested_weight_overprovisioned),
--
2.43.5
^ permalink raw reply related [flat|nested] 7+ messages in thread* Re: [PATCH v2 0/2] Exposing nice CPU usage to userspace
2024-08-30 14:19 [PATCH v2 0/2] Exposing nice CPU usage to userspace Joshua Hahn
2024-08-30 14:19 ` [PATCH v2 1/2] Tracking cgroup-level niced CPU time Joshua Hahn
2024-08-30 14:19 ` [PATCH v2 2/2] Selftests for niced CPU statistics Joshua Hahn
@ 2024-09-02 15:45 ` Michal Koutný
2024-09-10 21:01 ` Tejun Heo
2 siblings, 1 reply; 7+ messages in thread
From: Michal Koutný @ 2024-09-02 15:45 UTC (permalink / raw)
To: Joshua Hahn
Cc: tj, cgroups, hannes, linux-kernel, linux-kselftest, lizefan.x,
shuah
[-- Attachment #1: Type: text/plain, Size: 1480 bytes --]
Hello Joshua.
On Fri, Aug 30, 2024 at 07:19:37AM GMT, Joshua Hahn <joshua.hahnjy@gmail.com> wrote:
> Exposing this metric will allow users to accurately probe the niced CPU
> metric for each workload, and make more informed decisions when
> directing higher priority tasks.
I'm afraid I can't still appreciate exposing this value:
- It makes (some) sense only on leave cgroups (where variously nice'd
tasks are competing against each other). Not so much on inner node
cgroups (where it's a mere sum but sibling cgroups could have different
weights, so the absolute times would contribute differently).
- When all tasks have nice > 0 (or nice <= 0), it loses any information
it could have had.
(Thus I don't know whether to commit to exposing that value via cgroups.)
I wonder, wouldn't your use case be equally served by some
post-processing [1] of /sys/kernel/debug/sched/debug info which is
already available?
Regards,
Michal
[1]
Your metric is supposed to represent
Σ_i^tasks ∫_t is_nice(i, t) dt
If I try to address the second remark by looking at
Σ_i^tasks ∫_t nice(i, t) dt
and that resembles (nice=0 ~ weight=1024)
Σ_i^tasks ∫_t weight(i, t) dt
swap sum and int
∫_t Σ_i^tasks weight(i, t) dt
where
Σ_i^tasks weight(i, t)
can be taken from
/sys/kernel/debug/sched/debug:cfs_rq[0].load_avg
above is only for CPU nr=0. So processing would mean sampling that file
over all CPUs and time.
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v2 0/2] Exposing nice CPU usage to userspace
2024-09-02 15:45 ` [PATCH v2 0/2] Exposing nice CPU usage to userspace Michal Koutný
@ 2024-09-10 21:01 ` Tejun Heo
0 siblings, 0 replies; 7+ messages in thread
From: Tejun Heo @ 2024-09-10 21:01 UTC (permalink / raw)
To: Michal Koutný
Cc: Joshua Hahn, cgroups, hannes, linux-kernel, linux-kselftest,
lizefan.x, shuah
Hello, Michal.
On Mon, Sep 02, 2024 at 05:45:39PM +0200, Michal Koutný wrote:
> - It makes (some) sense only on leave cgroups (where variously nice'd
> tasks are competing against each other). Not so much on inner node
> cgroups (where it's a mere sum but sibling cgroups could have different
> weights, so the absolute times would contribute differently).
>
> - When all tasks have nice > 0 (or nice <= 0), it loses any information
> it could have had.
I think it's as useful as system-wide nice metric is. It's not a versatile
metric but is widely available and understood and people use it. Maybe a
workload is split across a sub-hierarchy and they wanna collect how much
lowpri threads are consuming. cpu.stats is available without cpu control
being enabled and people use it as a way to just aggregate metrics across a
portion of the system.
> (Thus I don't know whether to commit to exposing that value via cgroups.)
>
> I wonder, wouldn't your use case be equally served by some
> post-processing [1] of /sys/kernel/debug/sched/debug info which is
> already available?
...
> above is only for CPU nr=0. So processing would mean sampling that file
> over all CPUs and time.
I think there are benefits to mirroring system wide metrics, at least ones
as widely spread as nice.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 7+ messages in thread