public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH] cgroup: add cpu.stat.percpu for per-CPU cgroup stats
@ 2026-04-07  1:06 Willy Barro Raffel
  2026-04-07 18:27 ` Tejun Heo
  0 siblings, 1 reply; 5+ messages in thread
From: Willy Barro Raffel @ 2026-04-07  1:06 UTC (permalink / raw)
  To: Tejun Heo, Johannes Weiner, Michal Koutný, cgroups,
	linux-kernel, Willy Barro Raffel
  Cc: Justinien Bouron, Gunnar Kudrjavets

Expose per-CPU subtree_bstat via a new cgroupfs file cpu.stat.percpu.
Each line shows one CPU cumulative stats in io.stat-style key=value
format:

  cpu0 usage_usec=123 user_usec=45 system_usec=78 nice_usec=0
  cpu1 usage_usec=456 user_usec=123 system_usec=333 nice_usec=0

This completes the interface left as a TODO in commit 7716f383a583
("Merge tag 'cgroup-for-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup")
which added per-CPU subtree_bstat but only exposed it via BPF/drgn.

Signed-off-by: Willy Barro Raffel <willybar@amazon.com>
Reviewed-by: Justinien Bouron <jbouron@amazon.com>
Reviewed-by: Gunnar Kudrjavets <gunnarku@amazon.com>
---
 kernel/cgroup/cgroup-internal.h |  1 +
 kernel/cgroup/cgroup.c          | 10 +++++++++
 kernel/cgroup/rstat.c           | 36 +++++++++++++++++++++++++++++++++
 3 files changed, 47 insertions(+)

diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
index 3bfe37693d68..28aff03975f2 100644
--- a/kernel/cgroup/cgroup-internal.h
+++ b/kernel/cgroup/cgroup-internal.h
@@ -277,6 +277,7 @@ int css_rstat_init(struct cgroup_subsys_state *css);
 void css_rstat_exit(struct cgroup_subsys_state *css);
 int ss_rstat_init(struct cgroup_subsys *ss);
 void cgroup_base_stat_cputime_show(struct seq_file *seq);
+void cgroup_base_stat_cputime_show_percpu(struct seq_file *seq);
 
 /*
  * namespace.c
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index be1d71dda317..652fae15d7c5 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -3968,6 +3968,12 @@ static int cpu_local_stat_show(struct seq_file *seq, void *v)
 	return ret;
 }
 
+
+static int cpu_percpu_stat_show(struct seq_file *seq, void *v)
+{
+	cgroup_base_stat_cputime_show_percpu(seq);
+	return 0;
+}
 #ifdef CONFIG_PSI
 static int cgroup_io_pressure_show(struct seq_file *seq, void *v)
 {
@@ -5499,6 +5505,10 @@ static struct cftype cgroup_base_files[] = {
 		.name = "cpu.stat.local",
 		.seq_show = cpu_local_stat_show,
 	},
+	{
+		.name = "cpu.stat.percpu",
+		.seq_show = cpu_percpu_stat_show,
+	},
 	{ }	/* terminate */
 };
 
diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index 150e5871e66f..f1aaed87180c 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -743,6 +743,42 @@ void cgroup_base_stat_cputime_show(struct seq_file *seq)
 	cgroup_force_idle_show(seq, &bstat);
 }
 
+
+void cgroup_base_stat_cputime_show_percpu(struct seq_file *seq)
+{
+	struct cgroup *cgrp = seq_css(seq)->cgroup;
+	int cpu;
+
+	css_rstat_flush(&cgrp->self);
+
+	for_each_possible_cpu(cpu) {
+		struct cgroup_rstat_base_cpu *rstatbc;
+		struct cgroup_base_stat bstat;
+		unsigned int seq_cnt;
+
+		/* Reacquire for each CPU to avoid disabling IRQs too long */
+		__css_rstat_lock(&cgrp->self, cpu);
+		rstatbc = cgroup_rstat_base_cpu(cgrp, cpu);
+		do {
+			seq_cnt = __u64_stats_fetch_begin(&rstatbc->bsync);
+			bstat = rstatbc->subtree_bstat;
+		} while (__u64_stats_fetch_retry(&rstatbc->bsync, seq_cnt));
+		__css_rstat_unlock(&cgrp->self, cpu);
+
+		do_div(bstat.cputime.sum_exec_runtime, NSEC_PER_USEC);
+		do_div(bstat.cputime.utime, NSEC_PER_USEC);
+		do_div(bstat.cputime.stime, NSEC_PER_USEC);
+		do_div(bstat.ntime, NSEC_PER_USEC);
+
+		seq_printf(seq, "cpu%d usage_usec=%llu user_usec=%llu system_usec=%llu nice_usec=%llu\n",
+			   cpu,
+			   bstat.cputime.sum_exec_runtime,
+			   bstat.cputime.utime,
+			   bstat.cputime.stime,
+			   bstat.ntime);
+	}
+}
+
 /* Add bpf kfuncs for css_rstat_updated() and css_rstat_flush() */
 BTF_KFUNCS_START(bpf_rstat_kfunc_ids)
 BTF_ID_FLAGS(func, css_rstat_updated)
-- 
2.50.1 (Apple Git-155)


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] cgroup: add cpu.stat.percpu for per-CPU cgroup stats
  2026-04-07  1:06 [PATCH] cgroup: add cpu.stat.percpu for per-CPU cgroup stats Willy Barro Raffel
@ 2026-04-07 18:27 ` Tejun Heo
  2026-04-07 20:24   ` Barro Raffel, Willy
  0 siblings, 1 reply; 5+ messages in thread
From: Tejun Heo @ 2026-04-07 18:27 UTC (permalink / raw)
  To: Willy Barro Raffel
  Cc: Johannes Weiner, Michal Koutný, cgroups, linux-kernel,
	Justinien Bouron, Gunnar Kudrjavets

On Mon, Apr 06, 2026 at 06:06:43PM -0700, Willy Barro Raffel wrote:
> Expose per-CPU subtree_bstat via a new cgroupfs file cpu.stat.percpu.
> Each line shows one CPU cumulative stats in io.stat-style key=value
> format:
> 
>   cpu0 usage_usec=123 user_usec=45 system_usec=78 nice_usec=0
>   cpu1 usage_usec=456 user_usec=123 system_usec=333 nice_usec=0
> 
> This completes the interface left as a TODO in commit 7716f383a583
> ("Merge tag 'cgroup-for-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup")
> which added per-CPU subtree_bstat but only exposed it via BPF/drgn.

Given how quickly cpu count is increasing with 1k CPUs on common prod
machines not too far off, I'm not sure naively formatting output for every
possible CPU is desirable.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] cgroup: add cpu.stat.percpu for per-CPU cgroup stats
  2026-04-07 18:27 ` Tejun Heo
@ 2026-04-07 20:24   ` Barro Raffel, Willy
  2026-04-08 12:30     ` Michal Koutný
  0 siblings, 1 reply; 5+ messages in thread
From: Barro Raffel, Willy @ 2026-04-07 20:24 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Johannes Weiner, Michal Koutný, cgroups@vger.kernel.org,
	linux-kernel@vger.kernel.org, Bouron, Justinien,
	Kudrjavets, Gunnar

On Tue, Apr 07, 2026 at 08:27:41AM -1000, Tejun Heo wrote:
>On Mon, Apr 06, 2026 at 06:06:43PM -0700, Willy Barro Raffel wrote:
>> Expose per-CPU subtree_bstat via a new cgroupfs file cpu.stat.percpu.
>> Each line shows one CPU cumulative stats in io.stat-style key=value
>> format:
>>
>>   cpu0 usage_usec=123 user_usec=45 system_usec=78 nice_usec=0
>>   cpu1 usage_usec=456 user_usec=123 system_usec=333 nice_usec=0
>>
>> This completes the interface left as a TODO in commit 7716f383a583
>> ("Merge tag 'cgroup-for-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup")
>> which added per-CPU subtree_bstat but only exposed it via BPF/drgn.
>
>Given how quickly cpu count is increasing with 1k CPUs on common prod
>machines not too far off, I'm not sure naively formatting output for every
>possible CPU is desirable.
>
>Thanks.
>
>--
>tejun

Good point. I can skip CPUs with zero stats in the output, i.e.: a cgroup running on 4 of 1024 CPUs would only produce 4 lines. Would that address your concern?

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] cgroup: add cpu.stat.percpu for per-CPU cgroup stats
  2026-04-07 20:24   ` Barro Raffel, Willy
@ 2026-04-08 12:30     ` Michal Koutný
  2026-04-08 18:31       ` Barro Raffel, Willy
  0 siblings, 1 reply; 5+ messages in thread
From: Michal Koutný @ 2026-04-08 12:30 UTC (permalink / raw)
  To: Barro Raffel, Willy
  Cc: Tejun Heo, Johannes Weiner, cgroups@vger.kernel.org,
	linux-kernel@vger.kernel.org, Bouron, Justinien,
	Kudrjavets, Gunnar

[-- Attachment #1: Type: text/plain, Size: 1003 bytes --]

On Tue, Apr 07, 2026 at 08:24:33PM +0000, "Barro Raffel, Willy" <willybar@amazon.com> wrote:
> On Tue, Apr 07, 2026 at 08:27:41AM -1000, Tejun Heo wrote:
> ...
> >Given how quickly cpu count is increasing with 1k CPUs on common prod
> >machines not too far off, I'm not sure naively formatting output for every
> >possible CPU is desirable.

Fair point. OTOH, /proc/schedstat also outputs a line for each CPU (that
is admittedly in a simpler format, also online CPUs instead of possible).

> Good point. I can skip CPUs with zero stats in the output, i.e.: a
> cgroup running on 4 of 1024 CPUs would only produce 4 lines. Would
> that address your concern?

The argument "to complete the interface" explains the actual need for
such a new attribute not convincingly.

Willy, what is the expected use of these per-cgroup per-cpu stats?
(Given there's: global per-cpu stat, per-cgroup total stat, cpusets for
binding and the mentioned bpf/drgn availability for precise
control/debugging.)

Thanks,
Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] cgroup: add cpu.stat.percpu for per-CPU cgroup stats
  2026-04-08 12:30     ` Michal Koutný
@ 2026-04-08 18:31       ` Barro Raffel, Willy
  0 siblings, 0 replies; 5+ messages in thread
From: Barro Raffel, Willy @ 2026-04-08 18:31 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Tejun Heo, Johannes Weiner, cgroups@vger.kernel.org,
	linux-kernel@vger.kernel.org, Bouron, Justinien,
	Kudrjavets, Gunnar

On Wed, Apr 08, 2026 at 02:30:11PM +0200, Michal Koutný wrote:
> ...
>The argument "to complete the interface" explains the actual need for
>such a new attribute not convincingly.
>
>Willy, what is the expected use of these per-cgroup per-cpu stats?
>(Given there's: global per-cpu stat, per-cgroup total stat, cpusets for
>binding and the mentioned bpf/drgn availability for precise
>control/debugging.)

Our use case is that we run systems where services in separate cgroups
are pinned to specific CPUs via sched_setaffinity (not cgroup cpusets).
We need to know how much of each core's time each cgroup is consuming,
particularly on shared cores where multiple services compete. I believe
this use case is not unique to us.

/proc/stat gives per-CPU totals without per-cgroup breakdown.
cpu.stat gives per-cgroup totals without per-CPU breakdown.
Neither answers "how much of core N is cgroup X using?"

The data already exists in subtree_bstat per CPU. BPF can access
per-cgroup totals, but reading the per-CPU subtree_bstat requires either
Clang-compiled kernels (for percpu type tags) or custom kfuncs IIRC,
which are nontrivial dependencies for simple monitoring.

>Thanks,
>Michal

Regarding output format: I'm open to a more compact format if preferred,
for example, skip CPUs with zero stats, skip offline CPUs, using a
simpler positional format without keys, or a mix of all these ideas.

I personally prefer clear key-value pairs that don't require the
developer/operator/human to need to go to the manual just to find out
what a number in a certain position means.

Happy to adjust based on what you all think fits best though.

Thanks! Willy

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-04-08 18:31 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-07  1:06 [PATCH] cgroup: add cpu.stat.percpu for per-CPU cgroup stats Willy Barro Raffel
2026-04-07 18:27 ` Tejun Heo
2026-04-07 20:24   ` Barro Raffel, Willy
2026-04-08 12:30     ` Michal Koutný
2026-04-08 18:31       ` Barro Raffel, Willy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox