[PATCH v4 0/5] cgroup: separate rstat trees

cgroups.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v4 0/5] cgroup: separate rstat trees
@ 2025-04-04  1:10 JP Kobryn
  2025-04-04  1:10 ` [PATCH v4 1/5] cgroup: move rstat base stat objects into their own struct JP Kobryn
                   ` (4 more replies)
  0 siblings, 5 replies; 38+ messages in thread
From: JP Kobryn @ 2025-04-04  1:10 UTC (permalink / raw)
  To: tj, shakeel.butt, yosryahmed, mkoutny, hannes, akpm
  Cc: linux-mm, cgroups, kernel-team

The current design of rstat takes the approach that if one subsystem is to
be flushed, all other subsystems with pending updates should also be
flushed. A flush may be initiated by reading specific stats (like cpu.stat)
and other subsystems will be flushed alongside. The complexity of flushing
some subsystems has grown to the extent that the overhead of side flushes
is causing noticeable delays in reading the desired stats.

One big area where the issue comes up is system telemetry, where programs
periodically sample cpu stats while the memory controller is enabled. It
would be a benefit for programs sampling cpu.stat if the overhead of having
to flush memory (and also io) stats was eliminated. It would save cpu
cycles for existing stat reader programs and improve scalability in terms
of sampling frequency and host volume.

This series changes the approach of "flush all subsystems" to "flush only
the requested subsystem". The core design change is moving from a unified
model where rstat trees are shared by subsystems to having separate trees
for each subsystem. On a per-cpu basis, there will be separate trees for
each enabled subsystem that implements css_rstat_flush plus one tree
dedicated to the base stats. In order to do this, the rstat list pointers
were moved off of the cgroup and onto the css. In the transition, these
pointer types were changed to cgroup_subsys_state. Finally the API for
updated/flush was changed to accept a reference to a css instead of a
cgroup. This allows for a specific subsystem to be associated with a given
update or flush. The result is that rstat trees will now be made up of css
nodes, and a given tree will only contain nodes associated with a specific
subsystem.

Since separate trees will now be in use, the locking scheme was adjusted.
The global locks were split up in such a way that there are separate locks
for the base stats and also for each subsystem (memory, io, etc). This
allows different subsystems (and base stats) to use rstat in parallel with
no contention.

Breaking up the unified tree into separate trees eliminates the overhead
and scalability issues explained in the first section, but comes at the
cost of additional memory. Originally, each cgroup contained an instance of
the cgroup_rstat_cpu. The design change of moving to css-based trees calls
for each css having the rstat per-cpu objects instead. Moving these objects
to every css is where this overhead is created. In an effort to minimize
this, the cgroup_rstat_cpu struct was split into two separate structs. One
is the cgroup_rstat_base_cpu struct which only contains the per-cpu base
stat objects used in rstat. The other is the css_rstat_cpu struct which
contains the minimum amount of pointers needed for a css to participate in
rstat. Since only the cgroup::self css is associated with the base stats,
an instance of the cgroup_rstat_base_cpu struct is placed on the cgroup.
Meanwhile an instance of the css_rstat_cpu is placed on the
cgroup_subsys_state. This allows for all css's to participate in rstat
while avoiding the unnecessary inclusion of the base stats. The base stat
objects will only exist once per-cgroup regardless of however many
subsystems are enabled. With this division of rstat list pointers and base
stats, the change in memory overhead on a per-cpu basis before/after is
shown below.

memory overhead before:
	nr_cgroups * sizeof(struct cgroup_rstat_cpu)
where
	sizeof(struct cgroup_rstat_cpu) = 144 bytes /* config-dependent */
resulting in
	nr_cgroups * 144 bytes

memory overhead after:
	nr_cgroups * (
		sizeof(struct cgroup_rstat_base_cpu) +
			sizeof(struct css_rstat_cpu) * (1 + nr_rstat_controllers)
		)
where
	sizeof(struct cgroup_rstat_base_cpu) = 128 bytes
	sizeof(struct css_rstat_cpu) = 16 bytes
	the constant "1" accounts for the cgroup::self css
	nr_rstat_controllers = number of controllers defining css_rstat_flush
when both memory and io are enabled
	nr_rstat_controllers = 2
resulting in
	nr_cgroups * (128 + 16 * (1 + 2))
	nr_cgroups * 176 bytes

This leaves us with an increase in memory overhead of:
	32 bytes per cgroup per cpu

Validation was performed by reading some *.stat files of a target parent
cgroup while the system was under different workloads. A test program was
made to loop 1M times, reading the files cgroup.stat, cpu.stat, io.stat,
memory.stat of the parent cgroup each iteration. Using a non-patched kernel
as control and this series as experimental, the findings show perf gains
when reading stats with this series.

The first experiment consisted of a parent cgroup with memory.swap.max=0
and memory.max=1G. On a 52-cpu machine, 26 child cgroups were created and
within each child cgroup a process was spawned to frequently update the
memory cgroup stats by creating and then reading a file of size 1T
(encouraging reclaim). The test program was run alongside these 26 tasks in
parallel. The results showed time and perf gains for the reader test
program.

test program elapsed time
control:
real    1m0.956s
user    0m0.569s
sys     1m0.195s

experiment:
real    0m37.660s
user    0m0.463s
sys     0m37.078s

test program perf
control:
24.62% mem_cgroup_css_rstat_flush
 4.97% __blkcg_rstat_flush
 0.09% cpu_stat_show
 0.05% cgroup_base_stat_cputime_show

experiment:
2.68% mem_cgroup_css_rstat_flush
0.04% blkcg_print_stat
0.07% cpu_stat_show
0.06% cgroup_base_stat_cputime_show

It's worth noting that memcg uses heuristics to optimize flushing.
Depending on the state of updated stats at a given time, a memcg flush may
be considered unnecessary and skipped as a result. This opportunity to skip
a flush is bypassed when memcg is flushed as a consequence of sharing the
tree with another controller.

A second experiment was setup on the same host using a parent cgroup with
two child cgroups. The same swap and memory max were used as in the
previous experiment. In the two child cgroups, kernel builds were done in
parallel, each using "-j 20". The perf comparison is shown below.

test program elapsed time
control:
real    1m27.620s
user    0m0.779s
sys     1m26.258s

experiment:
real    0m45.805s
user    0m0.723s
sys     0m44.757s

test program perf
control:
30.84% mem_cgroup_css_rstat_flush
 6.75% __blkcg_rstat_flush
 0.08% cpu_stat_show
 0.04% cgroup_base_stat_cputime_show

experiment:
1.55% mem_cgroup_css_rstat_flush
0.15% blkcg_print_stat
0.10% cpu_stat_show
0.09% cgroup_base_stat_cputime_show
0.00% __blkcg_rstat_flush

The final experiment differs from the previous two in that it measures
performance from the stat updater perspective. A kernel build was run in a
child node with -j 20 on the same host and cgroup setup. A baseline was
established by having the build run while no stats were read. The builds
were then repeated while stats were constantly being read. In all cases,
perf appeared similar in cycles spent on cgroup_rstat_updated()
(insignificant compared to the other recorded events). As for the elapsed
build times, the results of the different scenarios are shown below,
showing no significant drawbacks of the split tree approach.

control with no readers
real    3m21.003s
user    55m52.133s
sys     2m40.728s

control with constant readers of {memory,io,cpu,cgroup}.stat
real    3m26.164s
user    56m49.474s
sys     2m56.389s

experiment with no readers
real    3m22.740s
user    56m18.972s
sys     2m45.041s

experiment with constant readers of {memory,io,cpu,cgroup}.stat
real    3m26.971s
user    57m11.540s
sys     2m49.735s

changelog
v4:
	drop bpf api patch
	drop cgroup_rstat_cpu split and union patch,
		replace with patch for moving base stats into new struct
	new patch for renaming rstat api's from cgroup_* to css_*
	new patch for adding css_is_cgroup() helper
	rename ss->lock and ss->cpu_lock to ss->rstat_ss_lock and
		ss->rstat_ss_cpu_lock respectively
	rename root_self_stat_cpu to root_base_rstat_cpu
	rename cgroup_rstat_push_children to css_rstat_push_children
	format comments for consistency in wings and capitalization
	update comments in bpf selftests

v3:
	new bpf kfunc api for updated/flush
	rename cgroup_rstat_{updated,flush} and related to "css_rstat_*"
	check for ss->css_rstat_flush existence where applicable
	rename locks for base stats
	move subsystem locks to cgroup_subsys struct
	change cgroup_rstat_boot() to ss_rstat_init(ss) and init locks within
	change lock helpers to accept css and perform lock selection within
	fix comments that had outdated lock names
	add open css_is_cgroup() helper
	rename rstatc to rstatbc to reflect base stats in use
	rename cgroup_dfl_root_rstat_cpu to root_self_rstat_cpu
	add comments in early init code to explain deferred allocation
	misc formatting fixes

v2:
	drop the patch creating a new cgroup_rstat struct and related code
	drop bpf-specific patches. instead just use cgroup::self in bpf progs
	drop the cpu lock patches. instead select cpu lock in updated_list func
	relocate the cgroup_rstat_init() call to inside css_create()
	relocate the cgroup_rstat_exit() cleanup from apply_control_enable()
		to css_free_rwork_fn()
v1:
	https://lore.kernel.org/all/20250218031448.46951-1-inwardvessel@gmail.com/

JP Kobryn (4):
  cgroup: separate rstat api for bpf programs
  cgroup: use separate rstat trees for each subsystem
  cgroup: use subsystem-specific rstat locks to avoid contention
  cgroup: split up cgroup_rstat_cpu into base stat and non base stat
    versions

 block/blk-cgroup.c                            |   6 +-
 include/linux/cgroup-defs.h                   |  80 ++--
 include/linux/cgroup.h                        |  16 +-
 include/trace/events/cgroup.h                 |  10 +-
 kernel/cgroup/cgroup-internal.h               |   6 +-
 kernel/cgroup/cgroup.c                        |  69 +--
 kernel/cgroup/rstat.c                         | 412 +++++++++++-------
 mm/memcontrol.c                               |   4 +-
 .../selftests/bpf/progs/btf_type_tag_percpu.c |   5 +-
 .../bpf/progs/cgroup_hierarchical_stats.c     |   8 +-
 10 files changed, 363 insertions(+), 253 deletions(-)

-- 
2.47.1

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v4 1/5] cgroup: move rstat base stat objects into their own struct
  2025-04-04  1:10 [PATCH v4 0/5] cgroup: separate rstat trees JP Kobryn
@ 2025-04-04  1:10 ` JP Kobryn
  2025-04-15 17:16   ` Michal Koutný
                     ` (2 more replies)
  2025-04-04  1:10 ` [PATCH v4 2/5] cgroup: add helper for checking when css is cgroup::self JP Kobryn
                   ` (3 subsequent siblings)
  4 siblings, 3 replies; 38+ messages in thread
From: JP Kobryn @ 2025-04-04  1:10 UTC (permalink / raw)
  To: tj, shakeel.butt, yosryahmed, mkoutny, hannes, akpm
  Cc: linux-mm, cgroups, kernel-team

This non-functional change serves as preparation for moving to
subsystem-based rstat trees. The base stats are not an actual subsystem,
but in future commits they will have exclusive rstat trees just as other
subsystems will.

Moving the base stat objects into a new struct allows the cgroup_rstat_cpu
struct to become more compact since it now only contains the minimum amount
of pointers needed for rstat participation. Subsystems will (in future
commits) make use of the compact cgroup_rstat_cpu struct while avoiding the
memory overhead of the base stat objects which they will not use.

An instance of the new struct cgroup_rstat_base_cpu was placed on the
cgroup struct so it can retain ownership of these base stats common to all
cgroups. A helper function was added for looking up the cpu-specific base
stats of a given cgroup. Finally, initialization and variable names were
adjusted where applicable.

Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
---
 include/linux/cgroup-defs.h | 38 ++++++++++-------
 kernel/cgroup/cgroup.c      |  8 +++-
 kernel/cgroup/rstat.c       | 84 ++++++++++++++++++++++---------------
 3 files changed, 79 insertions(+), 51 deletions(-)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 485b651869d9..6d177f770d28 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -344,10 +344,29 @@ struct cgroup_base_stat {
  * frequency decreases the cost of each read.
  *
  * This struct hosts both the fields which implement the above -
- * updated_children and updated_next - and the fields which track basic
- * resource statistics on top of it - bsync, bstat and last_bstat.
+ * updated_children and updated_next.
  */
 struct cgroup_rstat_cpu {
+	/*
+	 * Child cgroups with stat updates on this cpu since the last read
+	 * are linked on the parent's ->updated_children through
+	 * ->updated_next.
+	 *
+	 * In addition to being more compact, singly-linked list pointing
+	 * to the cgroup makes it unnecessary for each per-cpu struct to
+	 * point back to the associated cgroup.
+	 *
+	 * Protected by per-cpu cgroup_rstat_cpu_lock.
+	 */
+	struct cgroup *updated_children;	/* terminated by self cgroup */
+	struct cgroup *updated_next;		/* NULL iff not on the list */
+};
+
+/*
+ * This struct hosts the fields which track basic resource statistics on
+ * top of it - bsync, bstat and last_bstat.
+ */
+struct cgroup_rstat_base_cpu {
 	/*
 	 * ->bsync protects ->bstat.  These are the only fields which get
 	 * updated in the hot path.
@@ -374,20 +393,6 @@ struct cgroup_rstat_cpu {
 	 * deltas to propagate to the per-cpu subtree_bstat.
 	 */
 	struct cgroup_base_stat last_subtree_bstat;
-
-	/*
-	 * Child cgroups with stat updates on this cpu since the last read
-	 * are linked on the parent's ->updated_children through
-	 * ->updated_next.
-	 *
-	 * In addition to being more compact, singly-linked list pointing
-	 * to the cgroup makes it unnecessary for each per-cpu struct to
-	 * point back to the associated cgroup.
-	 *
-	 * Protected by per-cpu cgroup_rstat_cpu_lock.
-	 */
-	struct cgroup *updated_children;	/* terminated by self cgroup */
-	struct cgroup *updated_next;		/* NULL iff not on the list */
 };
 
 struct cgroup_freezer_state {
@@ -518,6 +523,7 @@ struct cgroup {
 
 	/* per-cpu recursive resource statistics */
 	struct cgroup_rstat_cpu __percpu *rstat_cpu;
+	struct cgroup_rstat_base_cpu __percpu *rstat_base_cpu;
 	struct list_head rstat_css_list;
 
 	/*
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index ac2db99941ca..77349d07b117 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -161,10 +161,14 @@ static struct static_key_true *cgroup_subsys_on_dfl_key[] = {
 };
 #undef SUBSYS
 
-static DEFINE_PER_CPU(struct cgroup_rstat_cpu, cgrp_dfl_root_rstat_cpu);
+static DEFINE_PER_CPU(struct cgroup_rstat_cpu, root_rstat_cpu);
+static DEFINE_PER_CPU(struct cgroup_rstat_base_cpu, root_rstat_base_cpu);
 
 /* the default hierarchy */
-struct cgroup_root cgrp_dfl_root = { .cgrp.rstat_cpu = &cgrp_dfl_root_rstat_cpu };
+struct cgroup_root cgrp_dfl_root = {
+	.cgrp.rstat_cpu = &root_rstat_cpu,
+	.cgrp.rstat_base_cpu = &root_rstat_base_cpu,
+};
 EXPORT_SYMBOL_GPL(cgrp_dfl_root);
 
 /*
diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index 4bb587d5d34f..a20e3ab3f7d3 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -19,6 +19,12 @@ static struct cgroup_rstat_cpu *cgroup_rstat_cpu(struct cgroup *cgrp, int cpu)
 	return per_cpu_ptr(cgrp->rstat_cpu, cpu);
 }
 
+static struct cgroup_rstat_base_cpu *cgroup_rstat_base_cpu(
+		struct cgroup *cgrp, int cpu)
+{
+	return per_cpu_ptr(cgrp->rstat_base_cpu, cpu);
+}
+
 /*
  * Helper functions for rstat per CPU lock (cgroup_rstat_cpu_lock).
  *
@@ -351,12 +357,22 @@ int cgroup_rstat_init(struct cgroup *cgrp)
 			return -ENOMEM;
 	}
 
+	if (!cgrp->rstat_base_cpu) {
+		cgrp->rstat_base_cpu = alloc_percpu(struct cgroup_rstat_base_cpu);
+		if (!cgrp->rstat_cpu) {
+			free_percpu(cgrp->rstat_cpu);
+			return -ENOMEM;
+		}
+	}
+
 	/* ->updated_children list is self terminated */
 	for_each_possible_cpu(cpu) {
 		struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(cgrp, cpu);
+		struct cgroup_rstat_base_cpu *rstatbc =
+			cgroup_rstat_base_cpu(cgrp, cpu);
 
 		rstatc->updated_children = cgrp;
-		u64_stats_init(&rstatc->bsync);
+		u64_stats_init(&rstatbc->bsync);
 	}
 
 	return 0;
@@ -379,6 +395,8 @@ void cgroup_rstat_exit(struct cgroup *cgrp)
 
 	free_percpu(cgrp->rstat_cpu);
 	cgrp->rstat_cpu = NULL;
+	free_percpu(cgrp->rstat_base_cpu);
+	cgrp->rstat_base_cpu = NULL;
 }
 
 void __init cgroup_rstat_boot(void)
@@ -419,9 +437,9 @@ static void cgroup_base_stat_sub(struct cgroup_base_stat *dst_bstat,
 
 static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu)
 {
-	struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(cgrp, cpu);
+	struct cgroup_rstat_base_cpu *rstatbc = cgroup_rstat_base_cpu(cgrp, cpu);
 	struct cgroup *parent = cgroup_parent(cgrp);
-	struct cgroup_rstat_cpu *prstatc;
+	struct cgroup_rstat_base_cpu *prstatbc;
 	struct cgroup_base_stat delta;
 	unsigned seq;
 
@@ -431,15 +449,15 @@ static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu)
 
 	/* fetch the current per-cpu values */
 	do {
-		seq = __u64_stats_fetch_begin(&rstatc->bsync);
-		delta = rstatc->bstat;
-	} while (__u64_stats_fetch_retry(&rstatc->bsync, seq));
+		seq = __u64_stats_fetch_begin(&rstatbc->bsync);
+		delta = rstatbc->bstat;
+	} while (__u64_stats_fetch_retry(&rstatbc->bsync, seq));
 
 	/* propagate per-cpu delta to cgroup and per-cpu global statistics */
-	cgroup_base_stat_sub(&delta, &rstatc->last_bstat);
+	cgroup_base_stat_sub(&delta, &rstatbc->last_bstat);
 	cgroup_base_stat_add(&cgrp->bstat, &delta);
-	cgroup_base_stat_add(&rstatc->last_bstat, &delta);
-	cgroup_base_stat_add(&rstatc->subtree_bstat, &delta);
+	cgroup_base_stat_add(&rstatbc->last_bstat, &delta);
+	cgroup_base_stat_add(&rstatbc->subtree_bstat, &delta);
 
 	/* propagate cgroup and per-cpu global delta to parent (unless that's root) */
 	if (cgroup_parent(parent)) {
@@ -448,73 +466,73 @@ static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu)
 		cgroup_base_stat_add(&parent->bstat, &delta);
 		cgroup_base_stat_add(&cgrp->last_bstat, &delta);
 
-		delta = rstatc->subtree_bstat;
-		prstatc = cgroup_rstat_cpu(parent, cpu);
-		cgroup_base_stat_sub(&delta, &rstatc->last_subtree_bstat);
-		cgroup_base_stat_add(&prstatc->subtree_bstat, &delta);
-		cgroup_base_stat_add(&rstatc->last_subtree_bstat, &delta);
+		delta = rstatbc->subtree_bstat;
+		prstatbc = cgroup_rstat_base_cpu(parent, cpu);
+		cgroup_base_stat_sub(&delta, &rstatbc->last_subtree_bstat);
+		cgroup_base_stat_add(&prstatbc->subtree_bstat, &delta);
+		cgroup_base_stat_add(&rstatbc->last_subtree_bstat, &delta);
 	}
 }
 
-static struct cgroup_rstat_cpu *
+static struct cgroup_rstat_base_cpu *
 cgroup_base_stat_cputime_account_begin(struct cgroup *cgrp, unsigned long *flags)
 {
-	struct cgroup_rstat_cpu *rstatc;
+	struct cgroup_rstat_base_cpu *rstatbc;
 
-	rstatc = get_cpu_ptr(cgrp->rstat_cpu);
-	*flags = u64_stats_update_begin_irqsave(&rstatc->bsync);
-	return rstatc;
+	rstatbc = get_cpu_ptr(cgrp->rstat_base_cpu);
+	*flags = u64_stats_update_begin_irqsave(&rstatbc->bsync);
+	return rstatbc;
 }
 
 static void cgroup_base_stat_cputime_account_end(struct cgroup *cgrp,
-						 struct cgroup_rstat_cpu *rstatc,
+						 struct cgroup_rstat_base_cpu *rstatbc,
 						 unsigned long flags)
 {
-	u64_stats_update_end_irqrestore(&rstatc->bsync, flags);
+	u64_stats_update_end_irqrestore(&rstatbc->bsync, flags);
 	cgroup_rstat_updated(cgrp, smp_processor_id());
-	put_cpu_ptr(rstatc);
+	put_cpu_ptr(rstatbc);
 }
 
 void __cgroup_account_cputime(struct cgroup *cgrp, u64 delta_exec)
 {
-	struct cgroup_rstat_cpu *rstatc;
+	struct cgroup_rstat_base_cpu *rstatbc;
 	unsigned long flags;
 
-	rstatc = cgroup_base_stat_cputime_account_begin(cgrp, &flags);
-	rstatc->bstat.cputime.sum_exec_runtime += delta_exec;
-	cgroup_base_stat_cputime_account_end(cgrp, rstatc, flags);
+	rstatbc = cgroup_base_stat_cputime_account_begin(cgrp, &flags);
+	rstatbc->bstat.cputime.sum_exec_runtime += delta_exec;
+	cgroup_base_stat_cputime_account_end(cgrp, rstatbc, flags);
 }
 
 void __cgroup_account_cputime_field(struct cgroup *cgrp,
 				    enum cpu_usage_stat index, u64 delta_exec)
 {
-	struct cgroup_rstat_cpu *rstatc;
+	struct cgroup_rstat_base_cpu *rstatbc;
 	unsigned long flags;
 
-	rstatc = cgroup_base_stat_cputime_account_begin(cgrp, &flags);
+	rstatbc = cgroup_base_stat_cputime_account_begin(cgrp, &flags);
 
 	switch (index) {
 	case CPUTIME_NICE:
-		rstatc->bstat.ntime += delta_exec;
+		rstatbc->bstat.ntime += delta_exec;
 		fallthrough;
 	case CPUTIME_USER:
-		rstatc->bstat.cputime.utime += delta_exec;
+		rstatbc->bstat.cputime.utime += delta_exec;
 		break;
 	case CPUTIME_SYSTEM:
 	case CPUTIME_IRQ:
 	case CPUTIME_SOFTIRQ:
-		rstatc->bstat.cputime.stime += delta_exec;
+		rstatbc->bstat.cputime.stime += delta_exec;
 		break;
 #ifdef CONFIG_SCHED_CORE
 	case CPUTIME_FORCEIDLE:
-		rstatc->bstat.forceidle_sum += delta_exec;
+		rstatbc->bstat.forceidle_sum += delta_exec;
 		break;
 #endif
 	default:
 		break;
 	}
 
-	cgroup_base_stat_cputime_account_end(cgrp, rstatc, flags);
+	cgroup_base_stat_cputime_account_end(cgrp, rstatbc, flags);
 }
 
 /*
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v4 2/5] cgroup: add helper for checking when css is cgroup::self
  2025-04-04  1:10 [PATCH v4 0/5] cgroup: separate rstat trees JP Kobryn
  2025-04-04  1:10 ` [PATCH v4 1/5] cgroup: move rstat base stat objects into their own struct JP Kobryn
@ 2025-04-04  1:10 ` JP Kobryn
  2025-04-22 12:19   ` Yosry Ahmed
       [not found]   ` <68078968.5d0a0220.2c3c35.bab3SMTPIN_ADDED_BROKEN@mx.google.com>
  2025-04-04  1:10 ` [PATCH v4 3/5] cgroup: change rstat function signatures from cgroup-based to css-based JP Kobryn
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 38+ messages in thread
From: JP Kobryn @ 2025-04-04  1:10 UTC (permalink / raw)
  To: tj, shakeel.butt, yosryahmed, mkoutny, hannes, akpm
  Cc: linux-mm, cgroups, kernel-team

The cgroup struct has a css field called "self". The main difference
between this css and the others found in the cgroup::subsys array is that
cgroup::self has a NULL subsystem pointer. There are several places where
checks are performed to determine whether the css in question is
cgroup::self or not. Instead of accessing css->ss directly, introduce a
helper function that shows the intent and use where applicable.

Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
---
 include/linux/cgroup.h | 5 +++++
 kernel/cgroup/cgroup.c | 4 ++--
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 28e999f2c642..7c120efd5e49 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -347,6 +347,11 @@ static inline bool css_is_dying(struct cgroup_subsys_state *css)
 	return !(css->flags & CSS_NO_REF) && percpu_ref_is_dying(&css->refcnt);
 }
 
+static inline bool css_is_cgroup(struct cgroup_subsys_state *css)
+{
+	return css->ss == NULL;
+}
+
 static inline void cgroup_get(struct cgroup *cgrp)
 {
 	css_get(&cgrp->self);
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 77349d07b117..00eb882dc6e7 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1719,7 +1719,7 @@ static void css_clear_dir(struct cgroup_subsys_state *css)
 
 	css->flags &= ~CSS_VISIBLE;
 
-	if (!css->ss) {
+	if (css_is_cgroup(css)) {
 		if (cgroup_on_dfl(cgrp)) {
 			cgroup_addrm_files(css, cgrp,
 					   cgroup_base_files, false);
@@ -1751,7 +1751,7 @@ static int css_populate_dir(struct cgroup_subsys_state *css)
 	if (css->flags & CSS_VISIBLE)
 		return 0;
 
-	if (!css->ss) {
+	if (css_is_cgroup(css)) {
 		if (cgroup_on_dfl(cgrp)) {
 			ret = cgroup_addrm_files(css, cgrp,
 						 cgroup_base_files, true);
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v4 3/5] cgroup: change rstat function signatures from cgroup-based to css-based
  2025-04-04  1:10 [PATCH v4 0/5] cgroup: separate rstat trees JP Kobryn
  2025-04-04  1:10 ` [PATCH v4 1/5] cgroup: move rstat base stat objects into their own struct JP Kobryn
  2025-04-04  1:10 ` [PATCH v4 2/5] cgroup: add helper for checking when css is cgroup::self JP Kobryn
@ 2025-04-04  1:10 ` JP Kobryn
  2025-04-04 20:00   ` Tejun Heo
                     ` (3 more replies)
  2025-04-04  1:10 ` [PATCH v4 4/5] cgroup: use separate rstat trees for each subsystem JP Kobryn
  2025-04-04  1:10 ` [PATCH v4 5/5] cgroup: use subsystem-specific rstat locks to avoid contention JP Kobryn
  4 siblings, 4 replies; 38+ messages in thread
From: JP Kobryn @ 2025-04-04  1:10 UTC (permalink / raw)
  To: tj, shakeel.butt, yosryahmed, mkoutny, hannes, akpm
  Cc: linux-mm, cgroups, kernel-team

This non-functional change serves as preparation for moving to
subsystem-based rstat trees. To simplify future commits, change the
signatures of existing cgroup-based rstat functions to become css-based and
rename them to reflect that.

Though the signatures have changed, the implementations have not. Within
these functions use the css->cgroup pointer to obtain the associated cgroup
and allow code to function the same just as it did before this patch. At
applicable call sites, pass the subsystem-specific css pointer as an
argument or pass a pointer to cgroup::self if not in subsystem context.

Note that cgroup_rstat_updated_list() and cgroup_rstat_push_children()
are not altered yet since there would be a larger amount of css to
cgroup conversions which may overcomplicate the code at this
intermediate phase.

Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
---
 block/blk-cgroup.c                            |  6 +-
 include/linux/cgroup-defs.h                   |  2 +-
 include/linux/cgroup.h                        |  4 +-
 kernel/cgroup/cgroup-internal.h               |  4 +-
 kernel/cgroup/cgroup.c                        | 30 ++++---
 kernel/cgroup/rstat.c                         | 83 +++++++++++--------
 mm/memcontrol.c                               |  4 +-
 .../bpf/progs/cgroup_hierarchical_stats.c     |  9 +-
 8 files changed, 80 insertions(+), 62 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 5905f277057b..0560ea402856 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1144,7 +1144,7 @@ static void blkcg_rstat_flush(struct cgroup_subsys_state *css, int cpu)
 /*
  * We source root cgroup stats from the system-wide stats to avoid
  * tracking the same information twice and incurring overhead when no
- * cgroups are defined. For that reason, cgroup_rstat_flush in
+ * cgroups are defined. For that reason, css_rstat_flush in
  * blkcg_print_stat does not actually fill out the iostat in the root
  * cgroup's blkcg_gq.
  *
@@ -1253,7 +1253,7 @@ static int blkcg_print_stat(struct seq_file *sf, void *v)
 	if (!seq_css(sf)->parent)
 		blkcg_fill_root_iostats();
 	else
-		cgroup_rstat_flush(blkcg->css.cgroup);
+		css_rstat_flush(&blkcg->css);
 
 	rcu_read_lock();
 	hlist_for_each_entry_rcu(blkg, &blkcg->blkg_list, blkcg_node) {
@@ -2243,7 +2243,7 @@ void blk_cgroup_bio_start(struct bio *bio)
 	}
 
 	u64_stats_update_end_irqrestore(&bis->sync, flags);
-	cgroup_rstat_updated(blkcg->css.cgroup, cpu);
+	css_rstat_updated(&blkcg->css, cpu);
 	put_cpu();
 }
 
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 6d177f770d28..e4a9fb00b228 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -536,7 +536,7 @@ struct cgroup {
 	/*
 	 * A singly-linked list of cgroup structures to be rstat flushed.
 	 * This is a scratch field to be used exclusively by
-	 * cgroup_rstat_flush_locked() and protected by cgroup_rstat_lock.
+	 * css_rstat_flush_locked() and protected by cgroup_rstat_lock.
 	 */
 	struct cgroup	*rstat_flush_next;
 
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 7c120efd5e49..b906c53953b9 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -693,8 +693,8 @@ static inline void cgroup_path_from_kernfs_id(u64 id, char *buf, size_t buflen)
 /*
  * cgroup scalable recursive statistics.
  */
-void cgroup_rstat_updated(struct cgroup *cgrp, int cpu);
-void cgroup_rstat_flush(struct cgroup *cgrp);
+void css_rstat_updated(struct cgroup_subsys_state *css, int cpu);
+void css_rstat_flush(struct cgroup_subsys_state *css);
 
 /*
  * Basic resource stats.
diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
index 95ab39e1ec8f..c161d34be634 100644
--- a/kernel/cgroup/cgroup-internal.h
+++ b/kernel/cgroup/cgroup-internal.h
@@ -270,8 +270,8 @@ int cgroup_task_count(const struct cgroup *cgrp);
 /*
  * rstat.c
  */
-int cgroup_rstat_init(struct cgroup *cgrp);
-void cgroup_rstat_exit(struct cgroup *cgrp);
+int css_rstat_init(struct cgroup_subsys_state *css);
+void css_rstat_exit(struct cgroup_subsys_state *css);
 void cgroup_rstat_boot(void);
 void cgroup_base_stat_cputime_show(struct seq_file *seq);
 
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 00eb882dc6e7..f8cee7819642 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1375,7 +1375,7 @@ static void cgroup_destroy_root(struct cgroup_root *root)
 
 	cgroup_unlock();
 
-	cgroup_rstat_exit(cgrp);
+	css_rstat_exit(&cgrp->self);
 	kernfs_destroy_root(root->kf_root);
 	cgroup_free_root(root);
 }
@@ -2150,7 +2150,7 @@ int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask)
 	if (ret)
 		goto destroy_root;
 
-	ret = cgroup_rstat_init(root_cgrp);
+	ret = css_rstat_init(&root_cgrp->self);
 	if (ret)
 		goto destroy_root;
 
@@ -2192,7 +2192,7 @@ int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask)
 	goto out;
 
 exit_stats:
-	cgroup_rstat_exit(root_cgrp);
+	css_rstat_exit(&root_cgrp->self);
 destroy_root:
 	kernfs_destroy_root(root->kf_root);
 	root->kf_root = NULL;
@@ -5449,7 +5449,7 @@ static void css_free_rwork_fn(struct work_struct *work)
 			cgroup_put(cgroup_parent(cgrp));
 			kernfs_put(cgrp->kn);
 			psi_cgroup_free(cgrp);
-			cgroup_rstat_exit(cgrp);
+			css_rstat_exit(css);
 			kfree(cgrp);
 		} else {
 			/*
@@ -5479,7 +5479,7 @@ static void css_release_work_fn(struct work_struct *work)
 
 		/* css release path */
 		if (!list_empty(&css->rstat_css_node)) {
-			cgroup_rstat_flush(cgrp);
+			css_rstat_flush(css);
 			list_del_rcu(&css->rstat_css_node);
 		}
 
@@ -5507,7 +5507,7 @@ static void css_release_work_fn(struct work_struct *work)
 		/* cgroup release path */
 		TRACE_CGROUP_PATH(release, cgrp);
 
-		cgroup_rstat_flush(cgrp);
+		css_rstat_flush(css);
 
 		spin_lock_irq(&css_set_lock);
 		for (tcgrp = cgroup_parent(cgrp); tcgrp;
@@ -5700,17 +5700,13 @@ static struct cgroup *cgroup_create(struct cgroup *parent, const char *name,
 	if (ret)
 		goto out_free_cgrp;
 
-	ret = cgroup_rstat_init(cgrp);
-	if (ret)
-		goto out_cancel_ref;
-
 	/* create the directory */
 	kn = kernfs_create_dir_ns(parent->kn, name, mode,
 				  current_fsuid(), current_fsgid(),
 				  cgrp, NULL);
 	if (IS_ERR(kn)) {
 		ret = PTR_ERR(kn);
-		goto out_stat_exit;
+		goto out_cancel_ref;
 	}
 	cgrp->kn = kn;
 
@@ -5720,6 +5716,14 @@ static struct cgroup *cgroup_create(struct cgroup *parent, const char *name,
 	cgrp->root = root;
 	cgrp->level = level;
 
+	/*
+	 * Now that init_cgroup_housekeeping() has been called and cgrp->self
+	 * is setup, it is safe to perform rstat initialization on it.
+	 */
+	ret = css_rstat_init(&cgrp->self);
+	if (ret)
+		goto out_stat_exit;
+
 	ret = psi_cgroup_alloc(cgrp);
 	if (ret)
 		goto out_kernfs_remove;
@@ -5790,10 +5794,10 @@ static struct cgroup *cgroup_create(struct cgroup *parent, const char *name,
 
 out_psi_free:
 	psi_cgroup_free(cgrp);
+out_stat_exit:
+	css_rstat_exit(&cgrp->self);
 out_kernfs_remove:
 	kernfs_remove(cgrp->kn);
-out_stat_exit:
-	cgroup_rstat_exit(cgrp);
 out_cancel_ref:
 	percpu_ref_exit(&cgrp->self.refcnt);
 out_free_cgrp:
diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index a20e3ab3f7d3..5bca55b4ec15 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -34,9 +34,10 @@ static struct cgroup_rstat_base_cpu *cgroup_rstat_base_cpu(
  * operations without handling high-frequency fast-path "update" events.
  */
 static __always_inline
-unsigned long _cgroup_rstat_cpu_lock(raw_spinlock_t *cpu_lock, int cpu,
-				     struct cgroup *cgrp, const bool fast_path)
+unsigned long _css_rstat_cpu_lock(raw_spinlock_t *cpu_lock, int cpu,
+				     struct cgroup_subsys_state *css, const bool fast_path)
 {
+	struct cgroup *cgrp = css->cgroup;
 	unsigned long flags;
 	bool contended;
 
@@ -67,10 +68,12 @@ unsigned long _cgroup_rstat_cpu_lock(raw_spinlock_t *cpu_lock, int cpu,
 }
 
 static __always_inline
-void _cgroup_rstat_cpu_unlock(raw_spinlock_t *cpu_lock, int cpu,
-			      struct cgroup *cgrp, unsigned long flags,
+void _css_rstat_cpu_unlock(raw_spinlock_t *cpu_lock, int cpu,
+			      struct cgroup_subsys_state *css, unsigned long flags,
 			      const bool fast_path)
 {
+	struct cgroup *cgrp = css->cgroup;
+
 	if (fast_path)
 		trace_cgroup_rstat_cpu_unlock_fastpath(cgrp, cpu, false);
 	else
@@ -80,16 +83,17 @@ void _cgroup_rstat_cpu_unlock(raw_spinlock_t *cpu_lock, int cpu,
 }
 
 /**
- * cgroup_rstat_updated - keep track of updated rstat_cpu
- * @cgrp: target cgroup
+ * css_rstat_updated - keep track of updated rstat_cpu
+ * @css: target cgroup subsystem state
  * @cpu: cpu on which rstat_cpu was updated
  *
- * @cgrp's rstat_cpu on @cpu was updated.  Put it on the parent's matching
- * rstat_cpu->updated_children list.  See the comment on top of
+ * @css->cgroup's rstat_cpu on @cpu was updated. Put it on the parent's
+ * matching rstat_cpu->updated_children list. See the comment on top of
  * cgroup_rstat_cpu definition for details.
  */
-__bpf_kfunc void cgroup_rstat_updated(struct cgroup *cgrp, int cpu)
+__bpf_kfunc void css_rstat_updated(struct cgroup_subsys_state *css, int cpu)
 {
+	struct cgroup *cgrp = css->cgroup;
 	raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu);
 	unsigned long flags;
 
@@ -104,7 +108,7 @@ __bpf_kfunc void cgroup_rstat_updated(struct cgroup *cgrp, int cpu)
 	if (data_race(cgroup_rstat_cpu(cgrp, cpu)->updated_next))
 		return;
 
-	flags = _cgroup_rstat_cpu_lock(cpu_lock, cpu, cgrp, true);
+	flags = _css_rstat_cpu_lock(cpu_lock, cpu, css, true);
 
 	/* put @cgrp and all ancestors on the corresponding updated lists */
 	while (true) {
@@ -132,7 +136,7 @@ __bpf_kfunc void cgroup_rstat_updated(struct cgroup *cgrp, int cpu)
 		cgrp = parent;
 	}
 
-	_cgroup_rstat_cpu_unlock(cpu_lock, cpu, cgrp, flags, true);
+	_css_rstat_cpu_unlock(cpu_lock, cpu, css, flags, true);
 }
 
 /**
@@ -213,7 +217,7 @@ static struct cgroup *cgroup_rstat_updated_list(struct cgroup *root, int cpu)
 	struct cgroup *head = NULL, *parent, *child;
 	unsigned long flags;
 
-	flags = _cgroup_rstat_cpu_lock(cpu_lock, cpu, root, false);
+	flags = _css_rstat_cpu_lock(cpu_lock, cpu, &root->self, false);
 
 	/* Return NULL if this subtree is not on-list */
 	if (!rstatc->updated_next)
@@ -250,14 +254,14 @@ static struct cgroup *cgroup_rstat_updated_list(struct cgroup *root, int cpu)
 	if (child != root)
 		head = cgroup_rstat_push_children(head, child, cpu);
 unlock_ret:
-	_cgroup_rstat_cpu_unlock(cpu_lock, cpu, root, flags, false);
+	_css_rstat_cpu_unlock(cpu_lock, cpu, &root->self, flags, false);
 	return head;
 }
 
 /*
  * A hook for bpf stat collectors to attach to and flush their stats.
- * Together with providing bpf kfuncs for cgroup_rstat_updated() and
- * cgroup_rstat_flush(), this enables a complete workflow where bpf progs that
+ * Together with providing bpf kfuncs for css_rstat_updated() and
+ * css_rstat_flush(), this enables a complete workflow where bpf progs that
  * collect cgroup stats can integrate with rstat for efficient flushing.
  *
  * A static noinline declaration here could cause the compiler to optimize away
@@ -285,9 +289,11 @@ __bpf_hook_end();
  * value -1 is used when obtaining the main lock else this is the CPU
  * number processed last.
  */
-static inline void __cgroup_rstat_lock(struct cgroup *cgrp, int cpu_in_loop)
+static inline void __css_rstat_lock(struct cgroup_subsys_state *css,
+		int cpu_in_loop)
 	__acquires(&cgroup_rstat_lock)
 {
+	struct cgroup *cgrp = css->cgroup;
 	bool contended;
 
 	contended = !spin_trylock_irq(&cgroup_rstat_lock);
@@ -298,36 +304,41 @@ static inline void __cgroup_rstat_lock(struct cgroup *cgrp, int cpu_in_loop)
 	trace_cgroup_rstat_locked(cgrp, cpu_in_loop, contended);
 }
 
-static inline void __cgroup_rstat_unlock(struct cgroup *cgrp, int cpu_in_loop)
+static inline void __css_rstat_unlock(struct cgroup_subsys_state *css,
+		int cpu_in_loop)
 	__releases(&cgroup_rstat_lock)
 {
+	struct cgroup *cgrp = css->cgroup;
+
 	trace_cgroup_rstat_unlock(cgrp, cpu_in_loop, false);
 	spin_unlock_irq(&cgroup_rstat_lock);
 }
 
 /**
- * cgroup_rstat_flush - flush stats in @cgrp's subtree
- * @cgrp: target cgroup
+ * css_rstat_flush - flush stats in @css->cgroup's subtree
+ * @css: target cgroup subsystem state
  *
- * Collect all per-cpu stats in @cgrp's subtree into the global counters
+ * Collect all per-cpu stats in @css->cgroup's subtree into the global counters
  * and propagate them upwards.  After this function returns, all cgroups in
  * the subtree have up-to-date ->stat.
  *
- * This also gets all cgroups in the subtree including @cgrp off the
+ * This also gets all cgroups in the subtree including @css->cgroup off the
  * ->updated_children lists.
  *
  * This function may block.
  */
-__bpf_kfunc void cgroup_rstat_flush(struct cgroup *cgrp)
+__bpf_kfunc void css_rstat_flush(struct cgroup_subsys_state *css)
 {
+	struct cgroup *cgrp = css->cgroup;
 	int cpu;
 
 	might_sleep();
 	for_each_possible_cpu(cpu) {
-		struct cgroup *pos = cgroup_rstat_updated_list(cgrp, cpu);
+		struct cgroup *pos;
 
 		/* Reacquire for each CPU to avoid disabling IRQs too long */
-		__cgroup_rstat_lock(cgrp, cpu);
+		__css_rstat_lock(css, cpu);
+		pos = cgroup_rstat_updated_list(cgrp, cpu);
 		for (; pos; pos = pos->rstat_flush_next) {
 			struct cgroup_subsys_state *css;
 
@@ -340,14 +351,15 @@ __bpf_kfunc void cgroup_rstat_flush(struct cgroup *cgrp)
 				css->ss->css_rstat_flush(css, cpu);
 			rcu_read_unlock();
 		}
-		__cgroup_rstat_unlock(cgrp, cpu);
+		__css_rstat_unlock(css, cpu);
 		if (!cond_resched())
 			cpu_relax();
 	}
 }
 
-int cgroup_rstat_init(struct cgroup *cgrp)
+int css_rstat_init(struct cgroup_subsys_state *css)
 {
+	struct cgroup *cgrp = css->cgroup;
 	int cpu;
 
 	/* the root cgrp has rstat_cpu preallocated */
@@ -378,11 +390,12 @@ int cgroup_rstat_init(struct cgroup *cgrp)
 	return 0;
 }
 
-void cgroup_rstat_exit(struct cgroup *cgrp)
+void css_rstat_exit(struct cgroup_subsys_state *css)
 {
+	struct cgroup *cgrp = css->cgroup;
 	int cpu;
 
-	cgroup_rstat_flush(cgrp);
+	css_rstat_flush(&cgrp->self);
 
 	/* sanity check */
 	for_each_possible_cpu(cpu) {
@@ -489,7 +502,7 @@ static void cgroup_base_stat_cputime_account_end(struct cgroup *cgrp,
 						 unsigned long flags)
 {
 	u64_stats_update_end_irqrestore(&rstatbc->bsync, flags);
-	cgroup_rstat_updated(cgrp, smp_processor_id());
+	css_rstat_updated(&cgrp->self, smp_processor_id());
 	put_cpu_ptr(rstatbc);
 }
 
@@ -591,12 +604,12 @@ void cgroup_base_stat_cputime_show(struct seq_file *seq)
 	struct cgroup_base_stat bstat;
 
 	if (cgroup_parent(cgrp)) {
-		cgroup_rstat_flush(cgrp);
-		__cgroup_rstat_lock(cgrp, -1);
+		css_rstat_flush(&cgrp->self);
+		__css_rstat_lock(&cgrp->self, -1);
 		bstat = cgrp->bstat;
 		cputime_adjust(&cgrp->bstat.cputime, &cgrp->prev_cputime,
 			       &bstat.cputime.utime, &bstat.cputime.stime);
-		__cgroup_rstat_unlock(cgrp, -1);
+		__css_rstat_unlock(&cgrp->self, -1);
 	} else {
 		root_cgroup_cputime(&bstat);
 	}
@@ -618,10 +631,10 @@ void cgroup_base_stat_cputime_show(struct seq_file *seq)
 	cgroup_force_idle_show(seq, &bstat);
 }
 
-/* Add bpf kfuncs for cgroup_rstat_updated() and cgroup_rstat_flush() */
+/* Add bpf kfuncs for css_rstat_updated() and css_rstat_flush() */
 BTF_KFUNCS_START(bpf_rstat_kfunc_ids)
-BTF_ID_FLAGS(func, cgroup_rstat_updated)
-BTF_ID_FLAGS(func, cgroup_rstat_flush, KF_SLEEPABLE)
+BTF_ID_FLAGS(func, css_rstat_updated)
+BTF_ID_FLAGS(func, css_rstat_flush, KF_SLEEPABLE)
 BTF_KFUNCS_END(bpf_rstat_kfunc_ids)
 
 static const struct btf_kfunc_id_set bpf_rstat_kfunc_set = {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 421740f1bcdc..140b31821b48 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -582,7 +582,7 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
 	if (!val)
 		return;
 
-	cgroup_rstat_updated(memcg->css.cgroup, cpu);
+	css_rstat_updated(&memcg->css, cpu);
 	statc = this_cpu_ptr(memcg->vmstats_percpu);
 	for (; statc; statc = statc->parent) {
 		stats_updates = READ_ONCE(statc->stats_updates) + abs(val);
@@ -614,7 +614,7 @@ static void __mem_cgroup_flush_stats(struct mem_cgroup *memcg, bool force)
 	if (mem_cgroup_is_root(memcg))
 		WRITE_ONCE(flush_last_time, jiffies_64);
 
-	cgroup_rstat_flush(memcg->css.cgroup);
+	css_rstat_flush(&memcg->css);
 }
 
 /*
diff --git a/tools/testing/selftests/bpf/progs/cgroup_hierarchical_stats.c b/tools/testing/selftests/bpf/progs/cgroup_hierarchical_stats.c
index c74362854948..ff189a736ad8 100644
--- a/tools/testing/selftests/bpf/progs/cgroup_hierarchical_stats.c
+++ b/tools/testing/selftests/bpf/progs/cgroup_hierarchical_stats.c
@@ -37,8 +37,9 @@ struct {
 	__type(value, struct attach_counter);
 } attach_counters SEC(".maps");
 
-extern void cgroup_rstat_updated(struct cgroup *cgrp, int cpu) __ksym;
-extern void cgroup_rstat_flush(struct cgroup *cgrp) __ksym;
+extern void css_rstat_updated(
+		struct cgroup_subsys_state *css, int cpu) __ksym;
+extern void css_rstat_flush(struct cgroup_subsys_state *css) __ksym;
 
 static uint64_t cgroup_id(struct cgroup *cgrp)
 {
@@ -75,7 +76,7 @@ int BPF_PROG(counter, struct cgroup *dst_cgrp, struct task_struct *leader,
 	else if (create_percpu_attach_counter(cg_id, 1))
 		return 0;
 
-	cgroup_rstat_updated(dst_cgrp, bpf_get_smp_processor_id());
+	css_rstat_updated(&dst_cgrp->self, bpf_get_smp_processor_id());
 	return 0;
 }
 
@@ -141,7 +142,7 @@ int BPF_PROG(dumper, struct bpf_iter_meta *meta, struct cgroup *cgrp)
 		return 1;
 
 	/* Flush the stats to make sure we get the most updated numbers */
-	cgroup_rstat_flush(cgrp);
+	css_rstat_flush(&cgrp->self);
 
 	total_counter = bpf_map_lookup_elem(&attach_counters, &cg_id);
 	if (!total_counter) {
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v4 4/5] cgroup: use separate rstat trees for each subsystem
  2025-04-04  1:10 [PATCH v4 0/5] cgroup: separate rstat trees JP Kobryn
                   ` (2 preceding siblings ...)
  2025-04-04  1:10 ` [PATCH v4 3/5] cgroup: change rstat function signatures from cgroup-based to css-based JP Kobryn
@ 2025-04-04  1:10 ` JP Kobryn
  2025-04-15 17:15   ` Michal Koutný
                     ` (3 more replies)
  2025-04-04  1:10 ` [PATCH v4 5/5] cgroup: use subsystem-specific rstat locks to avoid contention JP Kobryn
  4 siblings, 4 replies; 38+ messages in thread
From: JP Kobryn @ 2025-04-04  1:10 UTC (permalink / raw)
  To: tj, shakeel.butt, yosryahmed, mkoutny, hannes, akpm
  Cc: linux-mm, cgroups, kernel-team

Different subsystems may call cgroup_rstat_updated() within the same
cgroup, resulting in a tree of pending updates from multiple subsystems.
When one of these subsystems is flushed via cgroup_rstat_flushed(), all
other subsystems with pending updates on the tree will also be flushed.

Change the paradigm of having a single rstat tree for all subsystems to
having separate trees for each subsystem. This separation allows for
subsystems to perform flushes without the side effects of other subsystems.
As an example, flushing the cpu stats will no longer cause the memory stats
to be flushed and vice versa.

In order to achieve subsystem-specific trees, change the tree node type
from cgroup to cgroup_subsys_state pointer. Then remove those pointers from
the cgroup and instead place them on the css. Finally, change update/flush
functions to make use of the different node type (css). These changes allow
a specific subsystem to be associated with an update or flush. Separate
rstat trees will now exist for each unique subsystem.

Since updating/flushing will now be done at the subsystem level, there is
no longer a need to keep track of updated css nodes at the cgroup level.
The list management of these nodes done within the cgroup (rstat_css_list
and related) has been removed accordingly.

Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
---
 include/linux/cgroup-defs.h                   |  40 ++--
 kernel/cgroup/cgroup.c                        |  49 ++---
 kernel/cgroup/rstat.c                         | 174 +++++++++---------
 .../selftests/bpf/progs/btf_type_tag_percpu.c |  18 +-
 4 files changed, 151 insertions(+), 130 deletions(-)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index e4a9fb00b228..c58c21c2110a 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -169,6 +169,9 @@ struct cgroup_subsys_state {
 	/* reference count - access via css_[try]get() and css_put() */
 	struct percpu_ref refcnt;
 
+	/* per-cpu recursive resource statistics */
+	struct css_rstat_cpu __percpu *rstat_cpu;
+
 	/*
 	 * siblings list anchored at the parent's ->children
 	 *
@@ -177,9 +180,6 @@ struct cgroup_subsys_state {
 	struct list_head sibling;
 	struct list_head children;
 
-	/* flush target list anchored at cgrp->rstat_css_list */
-	struct list_head rstat_css_node;
-
 	/*
 	 * PI: Subsys-unique ID.  0 is unused and root is always 1.  The
 	 * matching css can be looked up using css_from_id().
@@ -219,6 +219,13 @@ struct cgroup_subsys_state {
 	 * Protected by cgroup_mutex.
 	 */
 	int nr_descendants;
+
+	/*
+	 * A singly-linked list of css structures to be rstat flushed.
+	 * This is a scratch field to be used exclusively by
+	 * css_rstat_flush_locked() and protected by cgroup_rstat_lock.
+	 */
+	struct cgroup_subsys_state *rstat_flush_next;
 };
 
 /*
@@ -329,10 +336,10 @@ struct cgroup_base_stat {
 
 /*
  * rstat - cgroup scalable recursive statistics.  Accounting is done
- * per-cpu in cgroup_rstat_cpu which is then lazily propagated up the
+ * per-cpu in css_rstat_cpu which is then lazily propagated up the
  * hierarchy on reads.
  *
- * When a stat gets updated, the cgroup_rstat_cpu and its ancestors are
+ * When a stat gets updated, the css_rstat_cpu and its ancestors are
  * linked into the updated tree.  On the following read, propagation only
  * considers and consumes the updated tree.  This makes reading O(the
  * number of descendants which have been active since last read) instead of
@@ -346,7 +353,7 @@ struct cgroup_base_stat {
  * This struct hosts both the fields which implement the above -
  * updated_children and updated_next.
  */
-struct cgroup_rstat_cpu {
+struct css_rstat_cpu {
 	/*
 	 * Child cgroups with stat updates on this cpu since the last read
 	 * are linked on the parent's ->updated_children through
@@ -358,8 +365,8 @@ struct cgroup_rstat_cpu {
 	 *
 	 * Protected by per-cpu cgroup_rstat_cpu_lock.
 	 */
-	struct cgroup *updated_children;	/* terminated by self cgroup */
-	struct cgroup *updated_next;		/* NULL iff not on the list */
+	struct cgroup_subsys_state *updated_children;	/* terminated by self cgroup */
+	struct cgroup_subsys_state *updated_next;	/* NULL iff not on the list */
 };
 
 /*
@@ -521,25 +528,16 @@ struct cgroup {
 	struct cgroup *dom_cgrp;
 	struct cgroup *old_dom_cgrp;		/* used while enabling threaded */
 
-	/* per-cpu recursive resource statistics */
-	struct cgroup_rstat_cpu __percpu *rstat_cpu;
+	/* per-cpu recursive basic resource statistics */
 	struct cgroup_rstat_base_cpu __percpu *rstat_base_cpu;
-	struct list_head rstat_css_list;
 
 	/*
-	 * Add padding to separate the read mostly rstat_cpu and
-	 * rstat_css_list into a different cacheline from the following
-	 * rstat_flush_next and *bstat fields which can have frequent updates.
+	 * Add padding to keep the read mostly rstat per-cpu pointer on a
+	 * different cacheline than the following *bstat fields which can have
+	 * frequent updates.
 	 */
 	CACHELINE_PADDING(_pad_);
 
-	/*
-	 * A singly-linked list of cgroup structures to be rstat flushed.
-	 * This is a scratch field to be used exclusively by
-	 * css_rstat_flush_locked() and protected by cgroup_rstat_lock.
-	 */
-	struct cgroup	*rstat_flush_next;
-
 	/* cgroup basic resource statistics */
 	struct cgroup_base_stat last_bstat;
 	struct cgroup_base_stat bstat;
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index f8cee7819642..c351b98ebd06 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -161,12 +161,12 @@ static struct static_key_true *cgroup_subsys_on_dfl_key[] = {
 };
 #undef SUBSYS
 
-static DEFINE_PER_CPU(struct cgroup_rstat_cpu, root_rstat_cpu);
+static DEFINE_PER_CPU(struct css_rstat_cpu, root_rstat_cpu);
 static DEFINE_PER_CPU(struct cgroup_rstat_base_cpu, root_rstat_base_cpu);
 
 /* the default hierarchy */
 struct cgroup_root cgrp_dfl_root = {
-	.cgrp.rstat_cpu = &root_rstat_cpu,
+	.cgrp.self.rstat_cpu = &root_rstat_cpu,
 	.cgrp.rstat_base_cpu = &root_rstat_base_cpu,
 };
 EXPORT_SYMBOL_GPL(cgrp_dfl_root);
@@ -1880,13 +1880,6 @@ int rebind_subsystems(struct cgroup_root *dst_root, u16 ss_mask)
 		}
 		spin_unlock_irq(&css_set_lock);
 
-		if (ss->css_rstat_flush) {
-			list_del_rcu(&css->rstat_css_node);
-			synchronize_rcu();
-			list_add_rcu(&css->rstat_css_node,
-				     &dcgrp->rstat_css_list);
-		}
-
 		/* default hierarchy doesn't enable controllers by default */
 		dst_root->subsys_mask |= 1 << ssid;
 		if (dst_root == &cgrp_dfl_root) {
@@ -2069,7 +2062,6 @@ static void init_cgroup_housekeeping(struct cgroup *cgrp)
 	cgrp->dom_cgrp = cgrp;
 	cgrp->max_descendants = INT_MAX;
 	cgrp->max_depth = INT_MAX;
-	INIT_LIST_HEAD(&cgrp->rstat_css_list);
 	prev_cputime_init(&cgrp->prev_cputime);
 
 	for_each_subsys(ss, ssid)
@@ -5425,6 +5417,9 @@ static void css_free_rwork_fn(struct work_struct *work)
 		struct cgroup_subsys_state *parent = css->parent;
 		int id = css->id;
 
+		if (ss->css_rstat_flush)
+			css_rstat_exit(css);
+
 		ss->css_free(css);
 		cgroup_idr_remove(&ss->css_idr, id);
 		cgroup_put(cgrp);
@@ -5477,11 +5472,8 @@ static void css_release_work_fn(struct work_struct *work)
 	if (ss) {
 		struct cgroup *parent_cgrp;
 
-		/* css release path */
-		if (!list_empty(&css->rstat_css_node)) {
+		if (ss->css_rstat_flush)
 			css_rstat_flush(css);
-			list_del_rcu(&css->rstat_css_node);
-		}
 
 		cgroup_idr_replace(&ss->css_idr, NULL, css->id);
 		if (ss->css_released)
@@ -5507,7 +5499,7 @@ static void css_release_work_fn(struct work_struct *work)
 		/* cgroup release path */
 		TRACE_CGROUP_PATH(release, cgrp);
 
-		css_rstat_flush(css);
+		css_rstat_flush(&cgrp->self);
 
 		spin_lock_irq(&css_set_lock);
 		for (tcgrp = cgroup_parent(cgrp); tcgrp;
@@ -5555,7 +5547,6 @@ static void init_and_link_css(struct cgroup_subsys_state *css,
 	css->id = -1;
 	INIT_LIST_HEAD(&css->sibling);
 	INIT_LIST_HEAD(&css->children);
-	INIT_LIST_HEAD(&css->rstat_css_node);
 	css->serial_nr = css_serial_nr_next++;
 	atomic_set(&css->online_cnt, 0);
 
@@ -5564,9 +5555,6 @@ static void init_and_link_css(struct cgroup_subsys_state *css,
 		css_get(css->parent);
 	}
 
-	if (ss->css_rstat_flush)
-		list_add_rcu(&css->rstat_css_node, &cgrp->rstat_css_list);
-
 	BUG_ON(cgroup_css(cgrp, ss));
 }
 
@@ -5659,6 +5647,12 @@ static struct cgroup_subsys_state *css_create(struct cgroup *cgrp,
 		goto err_free_css;
 	css->id = err;
 
+	if (ss->css_rstat_flush) {
+		err = css_rstat_init(css);
+		if (err)
+			goto err_free_css;
+	}
+
 	/* @css is ready to be brought online now, make it visible */
 	list_add_tail_rcu(&css->sibling, &parent_css->children);
 	cgroup_idr_replace(&ss->css_idr, css, css->id);
@@ -5672,7 +5666,6 @@ static struct cgroup_subsys_state *css_create(struct cgroup *cgrp,
 err_list_del:
 	list_del_rcu(&css->sibling);
 err_free_css:
-	list_del_rcu(&css->rstat_css_node);
 	INIT_RCU_WORK(&css->destroy_rwork, css_free_rwork_fn);
 	queue_rcu_work(cgroup_destroy_wq, &css->destroy_rwork);
 	return ERR_PTR(err);
@@ -6104,11 +6097,17 @@ static void __init cgroup_init_subsys(struct cgroup_subsys *ss, bool early)
 	css->flags |= CSS_NO_REF;
 
 	if (early) {
-		/* allocation can't be done safely during early init */
+		/*
+		 * Allocation can't be done safely during early init.
+		 * Defer IDR and rstat allocations until cgroup_init().
+		 */
 		css->id = 1;
 	} else {
 		css->id = cgroup_idr_alloc(&ss->css_idr, css, 1, 2, GFP_KERNEL);
 		BUG_ON(css->id < 0);
+
+		if (ss->css_rstat_flush)
+			BUG_ON(css_rstat_init(css));
 	}
 
 	/* Update the init_css_set to contain a subsys
@@ -6207,9 +6206,17 @@ int __init cgroup_init(void)
 			struct cgroup_subsys_state *css =
 				init_css_set.subsys[ss->id];
 
+			/*
+			 * It is now safe to perform allocations.
+			 * Finish setting up subsystems that previously
+			 * deferred IDR and rstat allocations.
+			 */
 			css->id = cgroup_idr_alloc(&ss->css_idr, css, 1, 2,
 						   GFP_KERNEL);
 			BUG_ON(css->id < 0);
+
+			if (ss->css_rstat_flush)
+				BUG_ON(css_rstat_init(css));
 		} else {
 			cgroup_init_subsys(ss, false);
 		}
diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index 5bca55b4ec15..37d9e5012b2d 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -14,9 +14,10 @@ static DEFINE_PER_CPU(raw_spinlock_t, cgroup_rstat_cpu_lock);
 
 static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu);
 
-static struct cgroup_rstat_cpu *cgroup_rstat_cpu(struct cgroup *cgrp, int cpu)
+static struct css_rstat_cpu *css_rstat_cpu(
+		struct cgroup_subsys_state *css, int cpu)
 {
-	return per_cpu_ptr(cgrp->rstat_cpu, cpu);
+	return per_cpu_ptr(css->rstat_cpu, cpu);
 }
 
 static struct cgroup_rstat_base_cpu *cgroup_rstat_base_cpu(
@@ -87,13 +88,12 @@ void _css_rstat_cpu_unlock(raw_spinlock_t *cpu_lock, int cpu,
  * @css: target cgroup subsystem state
  * @cpu: cpu on which rstat_cpu was updated
  *
- * @css->cgroup's rstat_cpu on @cpu was updated. Put it on the parent's
- * matching rstat_cpu->updated_children list. See the comment on top of
- * cgroup_rstat_cpu definition for details.
+ * @css's rstat_cpu on @cpu was updated. Put it on the parent's matching
+ * rstat_cpu->updated_children list. See the comment on top of
+ * css_rstat_cpu definition for details.
  */
 __bpf_kfunc void css_rstat_updated(struct cgroup_subsys_state *css, int cpu)
 {
-	struct cgroup *cgrp = css->cgroup;
 	raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu);
 	unsigned long flags;
 
@@ -102,19 +102,19 @@ __bpf_kfunc void css_rstat_updated(struct cgroup_subsys_state *css, int cpu)
 	 * temporary inaccuracies, which is fine.
 	 *
 	 * Because @parent's updated_children is terminated with @parent
-	 * instead of NULL, we can tell whether @cgrp is on the list by
+	 * instead of NULL, we can tell whether @css is on the list by
 	 * testing the next pointer for NULL.
 	 */
-	if (data_race(cgroup_rstat_cpu(cgrp, cpu)->updated_next))
+	if (data_race(css_rstat_cpu(css, cpu)->updated_next))
 		return;
 
 	flags = _css_rstat_cpu_lock(cpu_lock, cpu, css, true);
 
-	/* put @cgrp and all ancestors on the corresponding updated lists */
+	/* put @css and all ancestors on the corresponding updated lists */
 	while (true) {
-		struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(cgrp, cpu);
-		struct cgroup *parent = cgroup_parent(cgrp);
-		struct cgroup_rstat_cpu *prstatc;
+		struct css_rstat_cpu *rstatc = css_rstat_cpu(css, cpu);
+		struct cgroup_subsys_state *parent = css->parent;
+		struct css_rstat_cpu *prstatc;
 
 		/*
 		 * Both additions and removals are bottom-up.  If a cgroup
@@ -125,39 +125,40 @@ __bpf_kfunc void css_rstat_updated(struct cgroup_subsys_state *css, int cpu)
 
 		/* Root has no parent to link it to, but mark it busy */
 		if (!parent) {
-			rstatc->updated_next = cgrp;
+			rstatc->updated_next = css;
 			break;
 		}
 
-		prstatc = cgroup_rstat_cpu(parent, cpu);
+		prstatc = css_rstat_cpu(parent, cpu);
 		rstatc->updated_next = prstatc->updated_children;
-		prstatc->updated_children = cgrp;
+		prstatc->updated_children = css;
 
-		cgrp = parent;
+		css = parent;
 	}
 
 	_css_rstat_cpu_unlock(cpu_lock, cpu, css, flags, true);
 }
 
 /**
- * cgroup_rstat_push_children - push children cgroups into the given list
+ * css_rstat_push_children - push children css's into the given list
  * @head: current head of the list (= subtree root)
  * @child: first child of the root
  * @cpu: target cpu
  * Return: A new singly linked list of cgroups to be flush
  *
- * Iteratively traverse down the cgroup_rstat_cpu updated tree level by
+ * Iteratively traverse down the css_rstat_cpu updated tree level by
  * level and push all the parents first before their next level children
  * into a singly linked list built from the tail backward like "pushing"
  * cgroups into a stack. The root is pushed by the caller.
  */
-static struct cgroup *cgroup_rstat_push_children(struct cgroup *head,
-						 struct cgroup *child, int cpu)
+static struct cgroup_subsys_state *css_rstat_push_children(
+		struct cgroup_subsys_state *head,
+		struct cgroup_subsys_state *child, int cpu)
 {
-	struct cgroup *chead = child;	/* Head of child cgroup level */
-	struct cgroup *ghead = NULL;	/* Head of grandchild cgroup level */
-	struct cgroup *parent, *grandchild;
-	struct cgroup_rstat_cpu *crstatc;
+	struct cgroup_subsys_state *chead = child;	/* Head of child css level */
+	struct cgroup_subsys_state *ghead = NULL;	/* Head of grandchild css level */
+	struct cgroup_subsys_state *parent, *grandchild;
+	struct css_rstat_cpu *crstatc;
 
 	child->rstat_flush_next = NULL;
 
@@ -165,13 +166,13 @@ static struct cgroup *cgroup_rstat_push_children(struct cgroup *head,
 	while (chead) {
 		child = chead;
 		chead = child->rstat_flush_next;
-		parent = cgroup_parent(child);
+		parent = child->parent;
 
 		/* updated_next is parent cgroup terminated */
 		while (child != parent) {
 			child->rstat_flush_next = head;
 			head = child;
-			crstatc = cgroup_rstat_cpu(child, cpu);
+			crstatc = css_rstat_cpu(child, cpu);
 			grandchild = crstatc->updated_children;
 			if (grandchild != child) {
 				/* Push the grand child to the next level */
@@ -193,31 +194,32 @@ static struct cgroup *cgroup_rstat_push_children(struct cgroup *head,
 }
 
 /**
- * cgroup_rstat_updated_list - return a list of updated cgroups to be flushed
- * @root: root of the cgroup subtree to traverse
+ * css_rstat_updated_list - return a list of updated cgroups to be flushed
+ * @root: root of the css subtree to traverse
  * @cpu: target cpu
  * Return: A singly linked list of cgroups to be flushed
  *
  * Walks the updated rstat_cpu tree on @cpu from @root.  During traversal,
- * each returned cgroup is unlinked from the updated tree.
+ * each returned css is unlinked from the updated tree.
  *
  * The only ordering guarantee is that, for a parent and a child pair
  * covered by a given traversal, the child is before its parent in
  * the list.
  *
  * Note that updated_children is self terminated and points to a list of
- * child cgroups if not empty. Whereas updated_next is like a sibling link
- * within the children list and terminated by the parent cgroup. An exception
+ * child css's if not empty. Whereas updated_next is like a sibling link
+ * within the children list and terminated by the parent css. An exception
  * here is the cgroup root whose updated_next can be self terminated.
  */
-static struct cgroup *cgroup_rstat_updated_list(struct cgroup *root, int cpu)
+static struct cgroup_subsys_state *css_rstat_updated_list(
+		struct cgroup_subsys_state *root, int cpu)
 {
 	raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu);
-	struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(root, cpu);
-	struct cgroup *head = NULL, *parent, *child;
+	struct css_rstat_cpu *rstatc = css_rstat_cpu(root, cpu);
+	struct cgroup_subsys_state *head = NULL, *parent, *child;
 	unsigned long flags;
 
-	flags = _css_rstat_cpu_lock(cpu_lock, cpu, &root->self, false);
+	flags = _css_rstat_cpu_lock(cpu_lock, cpu, root, false);
 
 	/* Return NULL if this subtree is not on-list */
 	if (!rstatc->updated_next)
@@ -227,17 +229,17 @@ static struct cgroup *cgroup_rstat_updated_list(struct cgroup *root, int cpu)
 	 * Unlink @root from its parent. As the updated_children list is
 	 * singly linked, we have to walk it to find the removal point.
 	 */
-	parent = cgroup_parent(root);
+	parent = root->parent;
 	if (parent) {
-		struct cgroup_rstat_cpu *prstatc;
-		struct cgroup **nextp;
+		struct css_rstat_cpu *prstatc;
+		struct cgroup_subsys_state **nextp;
 
-		prstatc = cgroup_rstat_cpu(parent, cpu);
+		prstatc = css_rstat_cpu(parent, cpu);
 		nextp = &prstatc->updated_children;
 		while (*nextp != root) {
-			struct cgroup_rstat_cpu *nrstatc;
+			struct css_rstat_cpu *nrstatc;
 
-			nrstatc = cgroup_rstat_cpu(*nextp, cpu);
+			nrstatc = css_rstat_cpu(*nextp, cpu);
 			WARN_ON_ONCE(*nextp == parent);
 			nextp = &nrstatc->updated_next;
 		}
@@ -252,9 +254,9 @@ static struct cgroup *cgroup_rstat_updated_list(struct cgroup *root, int cpu)
 	child = rstatc->updated_children;
 	rstatc->updated_children = root;
 	if (child != root)
-		head = cgroup_rstat_push_children(head, child, cpu);
+		head = css_rstat_push_children(head, child, cpu);
 unlock_ret:
-	_css_rstat_cpu_unlock(cpu_lock, cpu, &root->self, flags, false);
+	_css_rstat_cpu_unlock(cpu_lock, cpu, root, flags, false);
 	return head;
 }
 
@@ -315,41 +317,36 @@ static inline void __css_rstat_unlock(struct cgroup_subsys_state *css,
 }
 
 /**
- * css_rstat_flush - flush stats in @css->cgroup's subtree
+ * css_rstat_flush - flush stats in @css's rstat subtree
  * @css: target cgroup subsystem state
  *
- * Collect all per-cpu stats in @css->cgroup's subtree into the global counters
- * and propagate them upwards.  After this function returns, all cgroups in
- * the subtree have up-to-date ->stat.
+ * Collect all per-cpu stats in @css's subtree into the global counters
+ * and propagate them upwards. After this function returns, all rstat
+ * nodes in the subtree have up-to-date ->stat.
  *
- * This also gets all cgroups in the subtree including @css->cgroup off the
+ * This also gets all rstat nodes in the subtree including @css off the
  * ->updated_children lists.
  *
  * This function may block.
  */
 __bpf_kfunc void css_rstat_flush(struct cgroup_subsys_state *css)
 {
-	struct cgroup *cgrp = css->cgroup;
 	int cpu;
 
 	might_sleep();
 	for_each_possible_cpu(cpu) {
-		struct cgroup *pos;
+		struct cgroup_subsys_state *pos;
 
 		/* Reacquire for each CPU to avoid disabling IRQs too long */
 		__css_rstat_lock(css, cpu);
-		pos = cgroup_rstat_updated_list(cgrp, cpu);
+		pos = css_rstat_updated_list(css, cpu);
 		for (; pos; pos = pos->rstat_flush_next) {
-			struct cgroup_subsys_state *css;
-
-			cgroup_base_stat_flush(pos, cpu);
-			bpf_rstat_flush(pos, cgroup_parent(pos), cpu);
-
-			rcu_read_lock();
-			list_for_each_entry_rcu(css, &pos->rstat_css_list,
-						rstat_css_node)
+			if (css_is_cgroup(pos)) {
+				cgroup_base_stat_flush(pos->cgroup, cpu);
+				bpf_rstat_flush(pos->cgroup,
+						cgroup_parent(pos->cgroup), cpu);
+			} else if (pos->ss->css_rstat_flush)
 				css->ss->css_rstat_flush(css, cpu);
-			rcu_read_unlock();
 		}
 		__css_rstat_unlock(css, cpu);
 		if (!cond_resched())
@@ -362,29 +359,38 @@ int css_rstat_init(struct cgroup_subsys_state *css)
 	struct cgroup *cgrp = css->cgroup;
 	int cpu;
 
-	/* the root cgrp has rstat_cpu preallocated */
-	if (!cgrp->rstat_cpu) {
-		cgrp->rstat_cpu = alloc_percpu(struct cgroup_rstat_cpu);
-		if (!cgrp->rstat_cpu)
-			return -ENOMEM;
+	/* the root cgrp has rstat_base_cpu preallocated */
+	if (css_is_cgroup(css)) {
+		if (!cgrp->rstat_base_cpu) {
+			cgrp->rstat_base_cpu = alloc_percpu(struct cgroup_rstat_base_cpu);
+			if (!cgrp->rstat_base_cpu)
+				return -ENOMEM;
+		}
 	}
 
-	if (!cgrp->rstat_base_cpu) {
-		cgrp->rstat_base_cpu = alloc_percpu(struct cgroup_rstat_base_cpu);
-		if (!cgrp->rstat_cpu) {
-			free_percpu(cgrp->rstat_cpu);
+	/* the root cgrp's self css has rstat_cpu preallocated */
+	if (!css->rstat_cpu) {
+		css->rstat_cpu = alloc_percpu(struct css_rstat_cpu);
+		if (!css->rstat_cpu) {
+			if (css_is_cgroup(css))
+				free_percpu(cgrp->rstat_base_cpu);
+
 			return -ENOMEM;
 		}
 	}
 
 	/* ->updated_children list is self terminated */
 	for_each_possible_cpu(cpu) {
-		struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(cgrp, cpu);
-		struct cgroup_rstat_base_cpu *rstatbc =
-			cgroup_rstat_base_cpu(cgrp, cpu);
+		struct css_rstat_cpu *rstatc = css_rstat_cpu(css, cpu);
 
-		rstatc->updated_children = cgrp;
-		u64_stats_init(&rstatbc->bsync);
+		rstatc->updated_children = css;
+
+		if (css_is_cgroup(css)) {
+			struct cgroup_rstat_base_cpu *rstatbc;
+
+			rstatbc = cgroup_rstat_base_cpu(cgrp, cpu);
+			u64_stats_init(&rstatbc->bsync);
+		}
 	}
 
 	return 0;
@@ -392,24 +398,28 @@ int css_rstat_init(struct cgroup_subsys_state *css)
 
 void css_rstat_exit(struct cgroup_subsys_state *css)
 {
-	struct cgroup *cgrp = css->cgroup;
 	int cpu;
 
-	css_rstat_flush(&cgrp->self);
+	css_rstat_flush(css);
 
 	/* sanity check */
 	for_each_possible_cpu(cpu) {
-		struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(cgrp, cpu);
+		struct css_rstat_cpu *rstatc = css_rstat_cpu(css, cpu);
 
-		if (WARN_ON_ONCE(rstatc->updated_children != cgrp) ||
+		if (WARN_ON_ONCE(rstatc->updated_children != css) ||
 		    WARN_ON_ONCE(rstatc->updated_next))
 			return;
 	}
 
-	free_percpu(cgrp->rstat_cpu);
-	cgrp->rstat_cpu = NULL;
-	free_percpu(cgrp->rstat_base_cpu);
-	cgrp->rstat_base_cpu = NULL;
+	if (css_is_cgroup(css)) {
+		struct cgroup *cgrp = css->cgroup;
+
+		free_percpu(cgrp->rstat_base_cpu);
+		cgrp->rstat_base_cpu = NULL;
+	}
+
+	free_percpu(css->rstat_cpu);
+	css->rstat_cpu = NULL;
 }
 
 void __init cgroup_rstat_boot(void)
diff --git a/tools/testing/selftests/bpf/progs/btf_type_tag_percpu.c b/tools/testing/selftests/bpf/progs/btf_type_tag_percpu.c
index 38f78d9345de..69f81cb555ca 100644
--- a/tools/testing/selftests/bpf/progs/btf_type_tag_percpu.c
+++ b/tools/testing/selftests/bpf/progs/btf_type_tag_percpu.c
@@ -30,22 +30,27 @@ int BPF_PROG(test_percpu2, struct bpf_testmod_btf_type_tag_2 *arg)
 
 /* trace_cgroup_mkdir(struct cgroup *cgrp, const char *path)
  *
- * struct cgroup_rstat_cpu {
+ * struct css_rstat_cpu {
  *   ...
- *   struct cgroup *updated_children;
+ *   struct cgroup_subsys_state *updated_children;
  *   ...
  * };
  *
- * struct cgroup {
+ * struct cgroup_subsys_state {
+ *   ...
+ *   struct css_rstat_cpu __percpu *rstat_cpu;
  *   ...
- *   struct cgroup_rstat_cpu __percpu *rstat_cpu;
+ * };
+ *
+ * struct cgroup {
+ *   struct cgroup_subsys_state self;
  *   ...
  * };
  */
 SEC("tp_btf/cgroup_mkdir")
 int BPF_PROG(test_percpu_load, struct cgroup *cgrp, const char *path)
 {
-	g = (__u64)cgrp->rstat_cpu->updated_children;
+	g = (__u64)cgrp->self.rstat_cpu->updated_children;
 	return 0;
 }
 
@@ -56,7 +61,8 @@ int BPF_PROG(test_percpu_helper, struct cgroup *cgrp, const char *path)
 	__u32 cpu;
 
 	cpu = bpf_get_smp_processor_id();
-	rstat = (struct cgroup_rstat_cpu *)bpf_per_cpu_ptr(cgrp->rstat_cpu, cpu);
+	rstat = (struct cgroup_rstat_cpu *)bpf_per_cpu_ptr(
+			cgrp->self.rstat_cpu, cpu);
 	if (rstat) {
 		/* READ_ONCE */
 		*(volatile int *)rstat;
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v4 5/5] cgroup: use subsystem-specific rstat locks to avoid contention
  2025-04-04  1:10 [PATCH v4 0/5] cgroup: separate rstat trees JP Kobryn
                   ` (3 preceding siblings ...)
  2025-04-04  1:10 ` [PATCH v4 4/5] cgroup: use separate rstat trees for each subsystem JP Kobryn
@ 2025-04-04  1:10 ` JP Kobryn
  2025-04-04 20:28   ` Tejun Heo
                     ` (3 more replies)
  4 siblings, 4 replies; 38+ messages in thread
From: JP Kobryn @ 2025-04-04  1:10 UTC (permalink / raw)
  To: tj, shakeel.butt, yosryahmed, mkoutny, hannes, akpm
  Cc: linux-mm, cgroups, kernel-team

It is possible to eliminate contention between subsystems when
updating/flushing stats by using subsystem-specific locks. Let the existing
rstat locks be dedicated to the cgroup base stats and rename them to
reflect that. Add similar locks to the cgroup_subsys struct for use with
individual subsystems.

Lock initialization is done in the new function ss_rstat_init(ss) which
replaces cgroup_rstat_boot(void). If NULL is passed to this function, the
global base stat locks will be initialized. Otherwise, the subsystem locks
will be initialized.

Change the existing lock helper functions to accept a reference to a css.
Then within these functions, conditionally select the appropriate locks
based on the subsystem affiliation of the given css. Add helper functions
for this selection routine to avoid repeated code.

Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
---
 block/blk-cgroup.c              |   2 +-
 include/linux/cgroup-defs.h     |  16 +++--
 include/trace/events/cgroup.h   |  12 +++-
 kernel/cgroup/cgroup-internal.h |   2 +-
 kernel/cgroup/cgroup.c          |  10 +++-
 kernel/cgroup/rstat.c           | 101 +++++++++++++++++++++++---------
 6 files changed, 103 insertions(+), 40 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 0560ea402856..62d0bf1e1a04 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1074,7 +1074,7 @@ static void __blkcg_rstat_flush(struct blkcg *blkcg, int cpu)
 	/*
 	 * For covering concurrent parent blkg update from blkg_release().
 	 *
-	 * When flushing from cgroup, cgroup_rstat_lock is always held, so
+	 * When flushing from cgroup, the subsystem lock is always held, so
 	 * this lock won't cause contention most of time.
 	 */
 	raw_spin_lock_irqsave(&blkg_stat_lock, flags);
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index c58c21c2110a..bb5a355524d6 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -223,7 +223,10 @@ struct cgroup_subsys_state {
 	/*
 	 * A singly-linked list of css structures to be rstat flushed.
 	 * This is a scratch field to be used exclusively by
-	 * css_rstat_flush_locked() and protected by cgroup_rstat_lock.
+	 * css_rstat_flush_locked().
+	 *
+	 * Protected by rstat_base_lock when css is cgroup::self.
+	 * Protected by css->ss->rstat_ss_lock otherwise.
 	 */
 	struct cgroup_subsys_state *rstat_flush_next;
 };
@@ -359,11 +362,11 @@ struct css_rstat_cpu {
 	 * are linked on the parent's ->updated_children through
 	 * ->updated_next.
 	 *
-	 * In addition to being more compact, singly-linked list pointing
-	 * to the cgroup makes it unnecessary for each per-cpu struct to
-	 * point back to the associated cgroup.
+	 * In addition to being more compact, singly-linked list pointing to
+	 * the css makes it unnecessary for each per-cpu struct to point back
+	 * to the associated css.
 	 *
-	 * Protected by per-cpu cgroup_rstat_cpu_lock.
+	 * Protected by per-cpu css->ss->rstat_ss_cpu_lock.
 	 */
 	struct cgroup_subsys_state *updated_children;	/* terminated by self cgroup */
 	struct cgroup_subsys_state *updated_next;	/* NULL iff not on the list */
@@ -793,6 +796,9 @@ struct cgroup_subsys {
 	 * specifies the mask of subsystems that this one depends on.
 	 */
 	unsigned int depends_on;
+
+	spinlock_t rstat_ss_lock;
+	raw_spinlock_t __percpu *rstat_ss_cpu_lock;
 };
 
 extern struct percpu_rw_semaphore cgroup_threadgroup_rwsem;
diff --git a/include/trace/events/cgroup.h b/include/trace/events/cgroup.h
index af2755bda6eb..7d332387be6c 100644
--- a/include/trace/events/cgroup.h
+++ b/include/trace/events/cgroup.h
@@ -231,7 +231,11 @@ DECLARE_EVENT_CLASS(cgroup_rstat,
 		  __entry->cpu, __entry->contended)
 );
 
-/* Related to global: cgroup_rstat_lock */
+/*
+ * Related to locks:
+ * global rstat_base_lock for base stats
+ * cgroup_subsys::rstat_ss_lock for subsystem stats
+ */
 DEFINE_EVENT(cgroup_rstat, cgroup_rstat_lock_contended,
 
 	TP_PROTO(struct cgroup *cgrp, int cpu, bool contended),
@@ -253,7 +257,11 @@ DEFINE_EVENT(cgroup_rstat, cgroup_rstat_unlock,
 	TP_ARGS(cgrp, cpu, contended)
 );
 
-/* Related to per CPU: cgroup_rstat_cpu_lock */
+/*
+ * Related to per CPU locks:
+ * global rstat_base_cpu_lock for base stats
+ * cgroup_subsys::rstat_ss_cpu_lock for subsystem stats
+ */
 DEFINE_EVENT(cgroup_rstat, cgroup_rstat_cpu_lock_contended,
 
 	TP_PROTO(struct cgroup *cgrp, int cpu, bool contended),
diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
index c161d34be634..b14e61c64a34 100644
--- a/kernel/cgroup/cgroup-internal.h
+++ b/kernel/cgroup/cgroup-internal.h
@@ -272,7 +272,7 @@ int cgroup_task_count(const struct cgroup *cgrp);
  */
 int css_rstat_init(struct cgroup_subsys_state *css);
 void css_rstat_exit(struct cgroup_subsys_state *css);
-void cgroup_rstat_boot(void);
+int ss_rstat_init(struct cgroup_subsys *ss);
 void cgroup_base_stat_cputime_show(struct seq_file *seq);
 
 /*
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index c351b98ebd06..2e0348d78679 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -6106,8 +6106,10 @@ static void __init cgroup_init_subsys(struct cgroup_subsys *ss, bool early)
 		css->id = cgroup_idr_alloc(&ss->css_idr, css, 1, 2, GFP_KERNEL);
 		BUG_ON(css->id < 0);
 
-		if (ss->css_rstat_flush)
+		if (ss->css_rstat_flush) {
+			BUG_ON(ss_rstat_init(ss));
 			BUG_ON(css_rstat_init(css));
+		}
 	}
 
 	/* Update the init_css_set to contain a subsys
@@ -6184,7 +6186,7 @@ int __init cgroup_init(void)
 	BUG_ON(cgroup_init_cftypes(NULL, cgroup_psi_files));
 	BUG_ON(cgroup_init_cftypes(NULL, cgroup1_base_files));
 
-	cgroup_rstat_boot();
+	BUG_ON(ss_rstat_init(NULL));
 
 	get_user_ns(init_cgroup_ns.user_ns);
 
@@ -6215,8 +6217,10 @@ int __init cgroup_init(void)
 						   GFP_KERNEL);
 			BUG_ON(css->id < 0);
 
-			if (ss->css_rstat_flush)
+			if (ss->css_rstat_flush) {
+				BUG_ON(ss_rstat_init(ss));
 				BUG_ON(css_rstat_init(css));
+			}
 		} else {
 			cgroup_init_subsys(ss, false);
 		}
diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index 37d9e5012b2d..bcc253aec774 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -9,8 +9,8 @@
 
 #include <trace/events/cgroup.h>
 
-static DEFINE_SPINLOCK(cgroup_rstat_lock);
-static DEFINE_PER_CPU(raw_spinlock_t, cgroup_rstat_cpu_lock);
+static DEFINE_SPINLOCK(rstat_base_lock);
+static DEFINE_PER_CPU(raw_spinlock_t, rstat_base_cpu_lock);
 
 static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu);
 
@@ -26,8 +26,24 @@ static struct cgroup_rstat_base_cpu *cgroup_rstat_base_cpu(
 	return per_cpu_ptr(cgrp->rstat_base_cpu, cpu);
 }
 
+static spinlock_t *ss_rstat_lock(struct cgroup_subsys *ss)
+{
+	if (ss)
+		return &ss->rstat_ss_lock;
+
+	return &rstat_base_lock;
+}
+
+static raw_spinlock_t *ss_rstat_cpu_lock(struct cgroup_subsys *ss, int cpu)
+{
+	if (ss)
+		return per_cpu_ptr(ss->rstat_ss_cpu_lock, cpu);
+
+	return per_cpu_ptr(&rstat_base_cpu_lock, cpu);
+}
+
 /*
- * Helper functions for rstat per CPU lock (cgroup_rstat_cpu_lock).
+ * Helper functions for rstat per CPU locks.
  *
  * This makes it easier to diagnose locking issues and contention in
  * production environments. The parameter @fast_path determine the
@@ -35,21 +51,23 @@ static struct cgroup_rstat_base_cpu *cgroup_rstat_base_cpu(
  * operations without handling high-frequency fast-path "update" events.
  */
 static __always_inline
-unsigned long _css_rstat_cpu_lock(raw_spinlock_t *cpu_lock, int cpu,
-				     struct cgroup_subsys_state *css, const bool fast_path)
+unsigned long _css_rstat_cpu_lock(struct cgroup_subsys_state *css, int cpu,
+		const bool fast_path)
 {
 	struct cgroup *cgrp = css->cgroup;
+	raw_spinlock_t *cpu_lock;
 	unsigned long flags;
 	bool contended;
 
 	/*
-	 * The _irqsave() is needed because cgroup_rstat_lock is
-	 * spinlock_t which is a sleeping lock on PREEMPT_RT. Acquiring
-	 * this lock with the _irq() suffix only disables interrupts on
-	 * a non-PREEMPT_RT kernel. The raw_spinlock_t below disables
-	 * interrupts on both configurations. The _irqsave() ensures
-	 * that interrupts are always disabled and later restored.
+	 * The _irqsave() is needed because the locks used for flushing are
+	 * spinlock_t which is a sleeping lock on PREEMPT_RT. Acquiring this lock
+	 * with the _irq() suffix only disables interrupts on a non-PREEMPT_RT
+	 * kernel. The raw_spinlock_t below disables interrupts on both
+	 * configurations. The _irqsave() ensures that interrupts are always
+	 * disabled and later restored.
 	 */
+	cpu_lock = ss_rstat_cpu_lock(css->ss, cpu);
 	contended = !raw_spin_trylock_irqsave(cpu_lock, flags);
 	if (contended) {
 		if (fast_path)
@@ -69,17 +87,18 @@ unsigned long _css_rstat_cpu_lock(raw_spinlock_t *cpu_lock, int cpu,
 }
 
 static __always_inline
-void _css_rstat_cpu_unlock(raw_spinlock_t *cpu_lock, int cpu,
-			      struct cgroup_subsys_state *css, unsigned long flags,
-			      const bool fast_path)
+void _css_rstat_cpu_unlock(struct cgroup_subsys_state *css, int cpu,
+		unsigned long flags, const bool fast_path)
 {
 	struct cgroup *cgrp = css->cgroup;
+	raw_spinlock_t *cpu_lock;
 
 	if (fast_path)
 		trace_cgroup_rstat_cpu_unlock_fastpath(cgrp, cpu, false);
 	else
 		trace_cgroup_rstat_cpu_unlock(cgrp, cpu, false);
 
+	cpu_lock = ss_rstat_cpu_lock(css->ss, cpu);
 	raw_spin_unlock_irqrestore(cpu_lock, flags);
 }
 
@@ -94,7 +113,6 @@ void _css_rstat_cpu_unlock(raw_spinlock_t *cpu_lock, int cpu,
  */
 __bpf_kfunc void css_rstat_updated(struct cgroup_subsys_state *css, int cpu)
 {
-	raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu);
 	unsigned long flags;
 
 	/*
@@ -108,7 +126,7 @@ __bpf_kfunc void css_rstat_updated(struct cgroup_subsys_state *css, int cpu)
 	if (data_race(css_rstat_cpu(css, cpu)->updated_next))
 		return;
 
-	flags = _css_rstat_cpu_lock(cpu_lock, cpu, css, true);
+	flags = _css_rstat_cpu_lock(css, cpu, true);
 
 	/* put @css and all ancestors on the corresponding updated lists */
 	while (true) {
@@ -136,7 +154,7 @@ __bpf_kfunc void css_rstat_updated(struct cgroup_subsys_state *css, int cpu)
 		css = parent;
 	}
 
-	_css_rstat_cpu_unlock(cpu_lock, cpu, css, flags, true);
+	_css_rstat_cpu_unlock(css, cpu, flags, true);
 }
 
 /**
@@ -214,12 +232,11 @@ static struct cgroup_subsys_state *css_rstat_push_children(
 static struct cgroup_subsys_state *css_rstat_updated_list(
 		struct cgroup_subsys_state *root, int cpu)
 {
-	raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu);
 	struct css_rstat_cpu *rstatc = css_rstat_cpu(root, cpu);
 	struct cgroup_subsys_state *head = NULL, *parent, *child;
 	unsigned long flags;
 
-	flags = _css_rstat_cpu_lock(cpu_lock, cpu, root, false);
+	flags = _css_rstat_cpu_lock(root, cpu, false);
 
 	/* Return NULL if this subtree is not on-list */
 	if (!rstatc->updated_next)
@@ -256,7 +273,7 @@ static struct cgroup_subsys_state *css_rstat_updated_list(
 	if (child != root)
 		head = css_rstat_push_children(head, child, cpu);
 unlock_ret:
-	_css_rstat_cpu_unlock(cpu_lock, cpu, root, flags, false);
+	_css_rstat_cpu_unlock(root, cpu, flags, false);
 	return head;
 }
 
@@ -283,7 +300,7 @@ __weak noinline void bpf_rstat_flush(struct cgroup *cgrp,
 __bpf_hook_end();
 
 /*
- * Helper functions for locking cgroup_rstat_lock.
+ * Helper functions for locking.
  *
  * This makes it easier to diagnose locking issues and contention in
  * production environments.  The parameter @cpu_in_loop indicate lock
@@ -293,27 +310,31 @@ __bpf_hook_end();
  */
 static inline void __css_rstat_lock(struct cgroup_subsys_state *css,
 		int cpu_in_loop)
-	__acquires(&cgroup_rstat_lock)
+	__acquires(lock)
 {
 	struct cgroup *cgrp = css->cgroup;
+	spinlock_t *lock;
 	bool contended;
 
-	contended = !spin_trylock_irq(&cgroup_rstat_lock);
+	lock = ss_rstat_lock(css->ss);
+	contended = !spin_trylock_irq(lock);
 	if (contended) {
 		trace_cgroup_rstat_lock_contended(cgrp, cpu_in_loop, contended);
-		spin_lock_irq(&cgroup_rstat_lock);
+		spin_lock_irq(lock);
 	}
 	trace_cgroup_rstat_locked(cgrp, cpu_in_loop, contended);
 }
 
 static inline void __css_rstat_unlock(struct cgroup_subsys_state *css,
 		int cpu_in_loop)
-	__releases(&cgroup_rstat_lock)
+	__releases(lock)
 {
 	struct cgroup *cgrp = css->cgroup;
+	spinlock_t *lock;
 
+	lock = ss_rstat_lock(css->ss);
 	trace_cgroup_rstat_unlock(cgrp, cpu_in_loop, false);
-	spin_unlock_irq(&cgroup_rstat_lock);
+	spin_unlock_irq(lock);
 }
 
 /**
@@ -422,12 +443,36 @@ void css_rstat_exit(struct cgroup_subsys_state *css)
 	css->rstat_cpu = NULL;
 }
 
-void __init cgroup_rstat_boot(void)
+/**
+ * ss_rstat_init - subsystem-specific rstat initialization
+ * @ss: target subsystem
+ *
+ * If @ss is NULL, the static locks associated with the base stats
+ * are initialized. If @ss is non-NULL, the subsystem-specific locks
+ * are initialized.
+ */
+int __init ss_rstat_init(struct cgroup_subsys *ss)
 {
 	int cpu;
 
+	if (!ss) {
+		spin_lock_init(&rstat_base_lock);
+
+		for_each_possible_cpu(cpu)
+			raw_spin_lock_init(per_cpu_ptr(&rstat_base_cpu_lock, cpu));
+
+		return 0;
+	}
+
+	spin_lock_init(&ss->rstat_ss_lock);
+	ss->rstat_ss_cpu_lock = alloc_percpu(raw_spinlock_t);
+	if (!ss->rstat_ss_cpu_lock)
+		return -ENOMEM;
+
 	for_each_possible_cpu(cpu)
-		raw_spin_lock_init(per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu));
+		raw_spin_lock_init(per_cpu_ptr(ss->rstat_ss_cpu_lock, cpu));
+
+	return 0;
 }
 
 /*
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 3/5] cgroup: change rstat function signatures from cgroup-based to css-based
  2025-04-04  1:10 ` [PATCH v4 3/5] cgroup: change rstat function signatures from cgroup-based to css-based JP Kobryn
@ 2025-04-04 20:00   ` Tejun Heo
  2025-04-04 20:09     ` Tejun Heo
  2025-04-22 12:35   ` Yosry Ahmed
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 38+ messages in thread
From: Tejun Heo @ 2025-04-04 20:00 UTC (permalink / raw)
  To: JP Kobryn
  Cc: shakeel.butt, yosryahmed, mkoutny, hannes, akpm, linux-mm,
	cgroups, kernel-team

On Thu, Apr 03, 2025 at 06:10:48PM -0700, JP Kobryn wrote:
> This non-functional change serves as preparation for moving to
> subsystem-based rstat trees. To simplify future commits, change the
> signatures of existing cgroup-based rstat functions to become css-based and
> rename them to reflect that.
> 
> Though the signatures have changed, the implementations have not. Within
> these functions use the css->cgroup pointer to obtain the associated cgroup
> and allow code to function the same just as it did before this patch. At
> applicable call sites, pass the subsystem-specific css pointer as an
> argument or pass a pointer to cgroup::self if not in subsystem context.
> 
> Note that cgroup_rstat_updated_list() and cgroup_rstat_push_children()
> are not altered yet since there would be a larger amount of css to
> cgroup conversions which may overcomplicate the code at this
> intermediate phase.
> 
> Signed-off-by: JP Kobryn <inwardvessel@gmail.com>

Applied 1-3 to cgroup/for-5.16.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 3/5] cgroup: change rstat function signatures from cgroup-based to css-based
  2025-04-04 20:00   ` Tejun Heo
@ 2025-04-04 20:09     ` Tejun Heo
  2025-04-04 21:21       ` JP Kobryn
  0 siblings, 1 reply; 38+ messages in thread
From: Tejun Heo @ 2025-04-04 20:09 UTC (permalink / raw)
  To: JP Kobryn
  Cc: shakeel.butt, yosryahmed, mkoutny, hannes, akpm, linux-mm,
	cgroups, kernel-team

On Fri, Apr 04, 2025 at 10:00:25AM -1000, Tejun Heo wrote:
> On Thu, Apr 03, 2025 at 06:10:48PM -0700, JP Kobryn wrote:
> > This non-functional change serves as preparation for moving to
> > subsystem-based rstat trees. To simplify future commits, change the
> > signatures of existing cgroup-based rstat functions to become css-based and
> > rename them to reflect that.
> > 
> > Though the signatures have changed, the implementations have not. Within
> > these functions use the css->cgroup pointer to obtain the associated cgroup
> > and allow code to function the same just as it did before this patch. At
> > applicable call sites, pass the subsystem-specific css pointer as an
> > argument or pass a pointer to cgroup::self if not in subsystem context.
> > 
> > Note that cgroup_rstat_updated_list() and cgroup_rstat_push_children()
> > are not altered yet since there would be a larger amount of css to
> > cgroup conversions which may overcomplicate the code at this
> > intermediate phase.
> > 
> > Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
> 
> Applied 1-3 to cgroup/for-5.16.

There were some conflicts with the commits already in cgroup/for-5.15-fixes.
I resolved them but it'd be great if you can verify that I didn't do
anything silly.

  git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git for-6.16

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 5/5] cgroup: use subsystem-specific rstat locks to avoid contention
  2025-04-04  1:10 ` [PATCH v4 5/5] cgroup: use subsystem-specific rstat locks to avoid contention JP Kobryn
@ 2025-04-04 20:28   ` Tejun Heo
  2025-04-11  3:31     ` JP Kobryn
  2025-04-15 17:15   ` Michal Koutný
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 38+ messages in thread
From: Tejun Heo @ 2025-04-04 20:28 UTC (permalink / raw)
  To: JP Kobryn
  Cc: shakeel.butt, yosryahmed, mkoutny, hannes, akpm, linux-mm,
	cgroups, kernel-team

On Thu, Apr 03, 2025 at 06:10:50PM -0700, JP Kobryn wrote:
> It is possible to eliminate contention between subsystems when
> updating/flushing stats by using subsystem-specific locks. Let the existing
> rstat locks be dedicated to the cgroup base stats and rename them to
> reflect that. Add similar locks to the cgroup_subsys struct for use with
> individual subsystems.
> 
> Lock initialization is done in the new function ss_rstat_init(ss) which
> replaces cgroup_rstat_boot(void). If NULL is passed to this function, the
> global base stat locks will be initialized. Otherwise, the subsystem locks
> will be initialized.
> 
> Change the existing lock helper functions to accept a reference to a css.
> Then within these functions, conditionally select the appropriate locks
> based on the subsystem affiliation of the given css. Add helper functions
> for this selection routine to avoid repeated code.
> 
> Signed-off-by: JP Kobryn <inwardvessel@gmail.com>

4-5 look fine to me. I'll wait for others to chime in before applying.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 3/5] cgroup: change rstat function signatures from cgroup-based to css-based
  2025-04-04 20:09     ` Tejun Heo
@ 2025-04-04 21:21       ` JP Kobryn
  0 siblings, 0 replies; 38+ messages in thread
From: JP Kobryn @ 2025-04-04 21:21 UTC (permalink / raw)
  To: Tejun Heo
  Cc: shakeel.butt, yosryahmed, mkoutny, hannes, akpm, linux-mm,
	cgroups, kernel-team

On 4/4/25 1:09 PM, Tejun Heo wrote:
> On Fri, Apr 04, 2025 at 10:00:25AM -1000, Tejun Heo wrote:
>> On Thu, Apr 03, 2025 at 06:10:48PM -0700, JP Kobryn wrote:
>>> This non-functional change serves as preparation for moving to
>>> subsystem-based rstat trees. To simplify future commits, change the
>>> signatures of existing cgroup-based rstat functions to become css-based and
>>> rename them to reflect that.
>>>
>>> Though the signatures have changed, the implementations have not. Within
>>> these functions use the css->cgroup pointer to obtain the associated cgroup
>>> and allow code to function the same just as it did before this patch. At
>>> applicable call sites, pass the subsystem-specific css pointer as an
>>> argument or pass a pointer to cgroup::self if not in subsystem context.
>>>
>>> Note that cgroup_rstat_updated_list() and cgroup_rstat_push_children()
>>> are not altered yet since there would be a larger amount of css to
>>> cgroup conversions which may overcomplicate the code at this
>>> intermediate phase.
>>>
>>> Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
>>
>> Applied 1-3 to cgroup/for-5.16.
> 
> There were some conflicts with the commits already in cgroup/for-5.15-fixes.
> I resolved them but it'd be great if you can verify that I didn't do
> anything silly.

I'm thinking you meant 6.* instead of 5.* in the two branches above. The
changes to resolve conflicts look good to me. Thanks for the quick
review.

> 
>    git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git for-6.16
> 
> Thanks.
> 


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 5/5] cgroup: use subsystem-specific rstat locks to avoid contention
  2025-04-04 20:28   ` Tejun Heo
@ 2025-04-11  3:31     ` JP Kobryn
  0 siblings, 0 replies; 38+ messages in thread
From: JP Kobryn @ 2025-04-11  3:31 UTC (permalink / raw)
  To: Tejun Heo
  Cc: shakeel.butt, yosryahmed, mkoutny, hannes, akpm, linux-mm,
	cgroups, kernel-team

On 4/4/25 1:28 PM, Tejun Heo wrote:
> On Thu, Apr 03, 2025 at 06:10:50PM -0700, JP Kobryn wrote:
>> It is possible to eliminate contention between subsystems when
>> updating/flushing stats by using subsystem-specific locks. Let the existing
>> rstat locks be dedicated to the cgroup base stats and rename them to
>> reflect that. Add similar locks to the cgroup_subsys struct for use with
>> individual subsystems.
>>
>> Lock initialization is done in the new function ss_rstat_init(ss) which
>> replaces cgroup_rstat_boot(void). If NULL is passed to this function, the
>> global base stat locks will be initialized. Otherwise, the subsystem locks
>> will be initialized.
>>
>> Change the existing lock helper functions to accept a reference to a css.
>> Then within these functions, conditionally select the appropriate locks
>> based on the subsystem affiliation of the given css. Add helper functions
>> for this selection routine to avoid repeated code.
>>
>> Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
> 4-5 look fine to me. I'll wait for others to chime in before applying.
>
> Thanks.
>
Michal, Yosry, can you give patches 4 and 5 a look? I think the changes 
requested up to this

point have been worked into the series.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 4/5] cgroup: use separate rstat trees for each subsystem
  2025-04-04  1:10 ` [PATCH v4 4/5] cgroup: use separate rstat trees for each subsystem JP Kobryn
@ 2025-04-15 17:15   ` Michal Koutný
  2025-04-16 21:43     ` JP Kobryn
  2025-04-21 18:18   ` JP Kobryn
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 38+ messages in thread
From: Michal Koutný @ 2025-04-15 17:15 UTC (permalink / raw)
  To: JP Kobryn
  Cc: tj, shakeel.butt, yosryahmed, hannes, akpm, linux-mm, cgroups,
	kernel-team

[-- Attachment #1: Type: text/plain, Size: 3042 bytes --]

On Thu, Apr 03, 2025 at 06:10:49PM -0700, JP Kobryn <inwardvessel@gmail.com> wrote:
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
...
> @@ -5425,6 +5417,9 @@ static void css_free_rwork_fn(struct work_struct *work)
>  		struct cgroup_subsys_state *parent = css->parent;
>  		int id = css->id;
>  
> +		if (ss->css_rstat_flush)
> +			css_rstat_exit(css);
> +

It should be safe to call this unguarded (see also my comment below at
css_rstat_flush()).

>  		ss->css_free(css);
>  		cgroup_idr_remove(&ss->css_idr, id);
>  		cgroup_put(cgrp);
> @@ -5477,11 +5472,8 @@ static void css_release_work_fn(struct work_struct *work)
>  	if (ss) {
>  		struct cgroup *parent_cgrp;
>  
> -		/* css release path */
> -		if (!list_empty(&css->rstat_css_node)) {
> +		if (ss->css_rstat_flush)
>  			css_rstat_flush(css);
> -			list_del_rcu(&css->rstat_css_node);
> -		}

Ditto.

>  __bpf_kfunc void css_rstat_flush(struct cgroup_subsys_state *css)
>  {
> -	struct cgroup *cgrp = css->cgroup;
>  	int cpu;
>  
>  	might_sleep();
>  	for_each_possible_cpu(cpu) {
> -		struct cgroup *pos;
> +		struct cgroup_subsys_state *pos;
>  
>  		/* Reacquire for each CPU to avoid disabling IRQs too long */
>  		__css_rstat_lock(css, cpu);
> -		pos = cgroup_rstat_updated_list(cgrp, cpu);
> +		pos = css_rstat_updated_list(css, cpu);
>  		for (; pos; pos = pos->rstat_flush_next) {
> -			struct cgroup_subsys_state *css;
> -
> -			cgroup_base_stat_flush(pos, cpu);
> -			bpf_rstat_flush(pos, cgroup_parent(pos), cpu);
> -
> -			rcu_read_lock();
> -			list_for_each_entry_rcu(css, &pos->rstat_css_list,
> -						rstat_css_node)
> +			if (css_is_cgroup(pos)) {
> +				cgroup_base_stat_flush(pos->cgroup, cpu);
> +				bpf_rstat_flush(pos->cgroup,
> +						cgroup_parent(pos->cgroup), cpu);
> +			} else if (pos->ss->css_rstat_flush)
>  				css->ss->css_rstat_flush(css, cpu);

These conditions -- css_is_cgroup(pos) and pos->ss->css_rstat_flush
should be invariant wrt pos in the split tree, right?
It's a μoptimization but may be worth checking only once before
processing the update tree?


> -			rcu_read_unlock();
>  		}
>  		__css_rstat_unlock(css, cpu);
>  		if (!cond_resched())
> @@ -362,29 +359,38 @@ int css_rstat_init(struct cgroup_subsys_state *css)
>  	struct cgroup *cgrp = css->cgroup;
>  	int cpu;
>  
> -	/* the root cgrp has rstat_cpu preallocated */
> -	if (!cgrp->rstat_cpu) {
> -		cgrp->rstat_cpu = alloc_percpu(struct cgroup_rstat_cpu);
> -		if (!cgrp->rstat_cpu)
> -			return -ENOMEM;
> +	/* the root cgrp has rstat_base_cpu preallocated */
> +	if (css_is_cgroup(css)) {
> +		if (!cgrp->rstat_base_cpu) {
> +			cgrp->rstat_base_cpu = alloc_percpu(struct cgroup_rstat_base_cpu);
> +			if (!cgrp->rstat_base_cpu)
> +				return -ENOMEM;
> +		}
>  	}
>  
> -	if (!cgrp->rstat_base_cpu) {
> -		cgrp->rstat_base_cpu = alloc_percpu(struct cgroup_rstat_base_cpu);
> -		if (!cgrp->rstat_cpu) {
> -			free_percpu(cgrp->rstat_cpu);

Thanks,
Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 5/5] cgroup: use subsystem-specific rstat locks to avoid contention
  2025-04-04  1:10 ` [PATCH v4 5/5] cgroup: use subsystem-specific rstat locks to avoid contention JP Kobryn
  2025-04-04 20:28   ` Tejun Heo
@ 2025-04-15 17:15   ` Michal Koutný
  2025-04-15 19:30     ` Tejun Heo
  2025-04-16 18:01     ` JP Kobryn
  2025-04-22 14:01   ` Yosry Ahmed
       [not found]   ` <6807a132.df0a0220.28dc80.a1f0SMTPIN_ADDED_BROKEN@mx.google.com>
  3 siblings, 2 replies; 38+ messages in thread
From: Michal Koutný @ 2025-04-15 17:15 UTC (permalink / raw)
  To: JP Kobryn
  Cc: tj, shakeel.butt, yosryahmed, hannes, akpm, linux-mm, cgroups,
	kernel-team

[-- Attachment #1: Type: text/plain, Size: 563 bytes --]

On Thu, Apr 03, 2025 at 06:10:50PM -0700, JP Kobryn <inwardvessel@gmail.com> wrote:
> --- a/kernel/cgroup/rstat.c
> +++ b/kernel/cgroup/rstat.c
...
>  static inline void __css_rstat_lock(struct cgroup_subsys_state *css,
>  		int cpu_in_loop)
> -	__acquires(&cgroup_rstat_lock)
> +	__acquires(lock)

Maybe
	__acquires(ss_rstat_lock(css->ss))

It shouldn't matter anyway but that may be more specific than a
generic 'lock' expression [1].

Besides that this patch LGTM.

Michal

[1] https://sparse.docs.kernel.org/en/latest/annotations.html#context-ctxt-entry-exit

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 1/5] cgroup: move rstat base stat objects into their own struct
  2025-04-04  1:10 ` [PATCH v4 1/5] cgroup: move rstat base stat objects into their own struct JP Kobryn
@ 2025-04-15 17:16   ` Michal Koutný
  2025-04-22 12:13   ` Yosry Ahmed
  2025-05-29 18:58   ` Ihor Solodrai
  2 siblings, 0 replies; 38+ messages in thread
From: Michal Koutný @ 2025-04-15 17:16 UTC (permalink / raw)
  To: JP Kobryn
  Cc: tj, shakeel.butt, yosryahmed, hannes, akpm, linux-mm, cgroups,
	kernel-team

[-- Attachment #1: Type: text/plain, Size: 590 bytes --]

On Thu, Apr 03, 2025 at 06:10:46PM -0700, JP Kobryn <inwardvessel@gmail.com> wrote:
> --- a/kernel/cgroup/rstat.c
> +++ b/kernel/cgroup/rstat.c
...
> @@ -351,12 +357,22 @@ int cgroup_rstat_init(struct cgroup *cgrp)
>  			return -ENOMEM;
>  	}
>  
> +	if (!cgrp->rstat_base_cpu) {
> +		cgrp->rstat_base_cpu = alloc_percpu(struct cgroup_rstat_base_cpu);
> +		if (!cgrp->rstat_cpu) {
> +			free_percpu(cgrp->rstat_cpu);
> +			return -ENOMEM;
> +		}
> +	}

This looks like it should check
		if (!cgrp->rstat_base_cpu) {

But nvm, it's replaced/fixed in patch 4/5.

Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 5/5] cgroup: use subsystem-specific rstat locks to avoid contention
  2025-04-15 17:15   ` Michal Koutný
@ 2025-04-15 19:30     ` Tejun Heo
  2025-04-16  9:50       ` Michal Koutný
  2025-04-16 18:01     ` JP Kobryn
  1 sibling, 1 reply; 38+ messages in thread
From: Tejun Heo @ 2025-04-15 19:30 UTC (permalink / raw)
  To: Michal Koutný
  Cc: JP Kobryn, shakeel.butt, yosryahmed, hannes, akpm, linux-mm,
	cgroups, kernel-team

On Tue, Apr 15, 2025 at 07:15:48PM +0200, Michal Koutný wrote:
> On Thu, Apr 03, 2025 at 06:10:50PM -0700, JP Kobryn <inwardvessel@gmail.com> wrote:
> > --- a/kernel/cgroup/rstat.c
> > +++ b/kernel/cgroup/rstat.c
> ...
> >  static inline void __css_rstat_lock(struct cgroup_subsys_state *css,
> >  		int cpu_in_loop)
> > -	__acquires(&cgroup_rstat_lock)
> > +	__acquires(lock)
> 
> Maybe
> 	__acquires(ss_rstat_lock(css->ss))
> 
> It shouldn't matter anyway but that may be more specific than a
> generic 'lock' expression [1].

I don't know whether this is a controversial opinion but I personally would
prefer to not have __acquires/__releases() notations at all. They add
unverifiable clutter and don't really add anything valuable on top of
lockdep.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 5/5] cgroup: use subsystem-specific rstat locks to avoid contention
  2025-04-15 19:30     ` Tejun Heo
@ 2025-04-16  9:50       ` Michal Koutný
  2025-04-16 18:10         ` JP Kobryn
  0 siblings, 1 reply; 38+ messages in thread
From: Michal Koutný @ 2025-04-16  9:50 UTC (permalink / raw)
  To: Tejun Heo
  Cc: JP Kobryn, shakeel.butt, yosryahmed, hannes, akpm, linux-mm,
	cgroups, kernel-team

[-- Attachment #1: Type: text/plain, Size: 695 bytes --]

On Tue, Apr 15, 2025 at 09:30:35AM -1000, Tejun Heo <tj@kernel.org> wrote:
> I don't know whether this is a controversial opinion but I personally would
> prefer to not have __acquires/__releases() notations at all.

Thanks for pointing it out.

> They add unverifiable clutter

IIUC, `make C=1` should verify them but they're admittedly in more sorry
state than selftests.

> and don't really add anything valuable on top of lockdep.

I also find lockdep macros better at documenting (cf `__releases(lock)
__acquires(lock)` annotations) and more functional (part of running
other tests with lockdep).

So I'd say cleanup with review of lockdep annotations is not a bad
(separate) idea.

Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 5/5] cgroup: use subsystem-specific rstat locks to avoid contention
  2025-04-15 17:15   ` Michal Koutný
  2025-04-15 19:30     ` Tejun Heo
@ 2025-04-16 18:01     ` JP Kobryn
  1 sibling, 0 replies; 38+ messages in thread
From: JP Kobryn @ 2025-04-16 18:01 UTC (permalink / raw)
  To: Michal Koutný
  Cc: tj, shakeel.butt, yosryahmed, hannes, akpm, linux-mm, cgroups,
	kernel-team

On 4/15/25 10:15 AM, Michal Koutný wrote:
> On Thu, Apr 03, 2025 at 06:10:50PM -0700, JP Kobryn <inwardvessel@gmail.com> wrote:
>> --- a/kernel/cgroup/rstat.c
>> +++ b/kernel/cgroup/rstat.c
> ...
>>   static inline void __css_rstat_lock(struct cgroup_subsys_state *css,
>>   		int cpu_in_loop)
>> -	__acquires(&cgroup_rstat_lock)
>> +	__acquires(lock)
> 
> Maybe
> 	__acquires(ss_rstat_lock(css->ss))
> 
> It shouldn't matter anyway but that may be more specific than a
> generic 'lock' expression [1].

Thanks. I see what you mean in terms of improving the output.

> 
> Besides that this patch LGTM.
> 
> Michal
> 
> [1] https://sparse.docs.kernel.org/en/latest/annotations.html#context-ctxt-entry-exit


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 5/5] cgroup: use subsystem-specific rstat locks to avoid contention
  2025-04-16  9:50       ` Michal Koutný
@ 2025-04-16 18:10         ` JP Kobryn
  2025-04-16 18:14           ` Tejun Heo
  0 siblings, 1 reply; 38+ messages in thread
From: JP Kobryn @ 2025-04-16 18:10 UTC (permalink / raw)
  To: Michal Koutný, Tejun Heo
  Cc: shakeel.butt, yosryahmed, hannes, akpm, linux-mm, cgroups,
	kernel-team

On 4/16/25 2:50 AM, Michal Koutný wrote:
> On Tue, Apr 15, 2025 at 09:30:35AM -1000, Tejun Heo <tj@kernel.org> wrote:
>> I don't know whether this is a controversial opinion but I personally would
>> prefer to not have __acquires/__releases() notations at all.
> 
> Thanks for pointing it out.
> 
>> They add unverifiable clutter
> 
> IIUC, `make C=1` should verify them but they're admittedly in more sorry
> state than selftests.
> 
>> and don't really add anything valuable on top of lockdep.
> 
> I also find lockdep macros better at documenting (cf `__releases(lock)
> __acquires(lock)` annotations) and more functional (part of running
> other tests with lockdep).
> 
> So I'd say cleanup with review of lockdep annotations is not a bad
> (separate) idea.

Maybe we can keep these intact for now while updating the text within
to something more specific like Michal is suggesting?

Tejun, let us know if you have a strong feeling on removing them at
this point.

> 
> Michal


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 5/5] cgroup: use subsystem-specific rstat locks to avoid contention
  2025-04-16 18:10         ` JP Kobryn
@ 2025-04-16 18:14           ` Tejun Heo
  0 siblings, 0 replies; 38+ messages in thread
From: Tejun Heo @ 2025-04-16 18:14 UTC (permalink / raw)
  To: JP Kobryn
  Cc: Michal Koutný, shakeel.butt, yosryahmed, hannes, akpm,
	linux-mm, cgroups, kernel-team

On Wed, Apr 16, 2025 at 11:10:20AM -0700, JP Kobryn wrote:
> On 4/16/25 2:50 AM, Michal Koutný wrote:
> > On Tue, Apr 15, 2025 at 09:30:35AM -1000, Tejun Heo <tj@kernel.org> wrote:
> > > I don't know whether this is a controversial opinion but I personally would
> > > prefer to not have __acquires/__releases() notations at all.
> > 
> > Thanks for pointing it out.
> > 
> > > They add unverifiable clutter
> > 
> > IIUC, `make C=1` should verify them but they're admittedly in more sorry
> > state than selftests.
> > 
> > > and don't really add anything valuable on top of lockdep.
> > 
> > I also find lockdep macros better at documenting (cf `__releases(lock)
> > __acquires(lock)` annotations) and more functional (part of running
> > other tests with lockdep).
> > 
> > So I'd say cleanup with review of lockdep annotations is not a bad
> > (separate) idea.
> 
> Maybe we can keep these intact for now while updating the text within
> to something more specific like Michal is suggesting?
> 
> Tejun, let us know if you have a strong feeling on removing them at
> this point.

No strong feelings. No need to churn for it.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 4/5] cgroup: use separate rstat trees for each subsystem
  2025-04-15 17:15   ` Michal Koutný
@ 2025-04-16 21:43     ` JP Kobryn
  2025-04-17  9:26       ` Michal Koutný
  0 siblings, 1 reply; 38+ messages in thread
From: JP Kobryn @ 2025-04-16 21:43 UTC (permalink / raw)
  To: Michal Koutný
  Cc: tj, shakeel.butt, yosryahmed, hannes, akpm, linux-mm, cgroups,
	kernel-team

On 4/15/25 10:15 AM, Michal Koutný wrote:
> On Thu, Apr 03, 2025 at 06:10:49PM -0700, JP Kobryn <inwardvessel@gmail.com> wrote:
>> --- a/kernel/cgroup/cgroup.c
>> +++ b/kernel/cgroup/cgroup.c
> ...
>> @@ -5425,6 +5417,9 @@ static void css_free_rwork_fn(struct work_struct *work)
>>   		struct cgroup_subsys_state *parent = css->parent;
>>   		int id = css->id;
>>   
>> +		if (ss->css_rstat_flush)
>> +			css_rstat_exit(css);
>> +
> 
> It should be safe to call this unguarded (see also my comment below at
> css_rstat_flush()).

Hmmm I checked my initial assumptions. I'm still finding that css's from
any subsystem regardless of rstat usage can reach this call to exit.
Without the guard there will be undefined behavior.

> 
>>   		ss->css_free(css);
>>   		cgroup_idr_remove(&ss->css_idr, id);
>>   		cgroup_put(cgrp);
>> @@ -5477,11 +5472,8 @@ static void css_release_work_fn(struct work_struct *work)
>>   	if (ss) {
>>   		struct cgroup *parent_cgrp;
>>   
>> -		/* css release path */
>> -		if (!list_empty(&css->rstat_css_node)) {
>> +		if (ss->css_rstat_flush)
>>   			css_rstat_flush(css);
>> -			list_del_rcu(&css->rstat_css_node);
>> -		}
> 
> Ditto.

Same as mentioned above, there are cases where some css's belonging to a
subsystem like "cpu" (via CONFIG_CGROUP_SCHED) can reach this code.

> 
> These conditions -- css_is_cgroup(pos) and pos->ss->css_rstat_flush
> should be invariant wrt pos in the split tree, right?

It's a good point. Yes, I would say so.

> It's a μoptimization but may be worth checking only once before
> processing the update tree?

Let me do that that in the next rev.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 4/5] cgroup: use separate rstat trees for each subsystem
  2025-04-16 21:43     ` JP Kobryn
@ 2025-04-17  9:26       ` Michal Koutný
  2025-04-17 19:05         ` JP Kobryn
  0 siblings, 1 reply; 38+ messages in thread
From: Michal Koutný @ 2025-04-17  9:26 UTC (permalink / raw)
  To: JP Kobryn
  Cc: tj, shakeel.butt, yosryahmed, hannes, akpm, linux-mm, cgroups,
	kernel-team

[-- Attachment #1: Type: text/plain, Size: 435 bytes --]

On Wed, Apr 16, 2025 at 02:43:57PM -0700, JP Kobryn <inwardvessel@gmail.com> wrote:
> Hmmm I checked my initial assumptions. I'm still finding that css's from
> any subsystem regardless of rstat usage can reach this call to exit.
> Without the guard there will be undefined behavior.

At which place is the UB? (I saw that all funnels to css_rstat_flush()
that does the check but I may have overlooked something in the diffs.)

Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 4/5] cgroup: use separate rstat trees for each subsystem
  2025-04-17  9:26       ` Michal Koutný
@ 2025-04-17 19:05         ` JP Kobryn
  2025-04-17 20:10           ` JP Kobryn
  0 siblings, 1 reply; 38+ messages in thread
From: JP Kobryn @ 2025-04-17 19:05 UTC (permalink / raw)
  To: Michal Koutný
  Cc: tj, shakeel.butt, yosryahmed, hannes, akpm, linux-mm, cgroups,
	kernel-team

On 4/17/25 2:26 AM, Michal Koutný wrote:
> On Wed, Apr 16, 2025 at 02:43:57PM -0700, JP Kobryn <inwardvessel@gmail.com> wrote:
>> Hmmm I checked my initial assumptions. I'm still finding that css's from
>> any subsystem regardless of rstat usage can reach this call to exit.
>> Without the guard there will be undefined behavior.
> 
> At which place is the UB? (I saw that all funnels to css_rstat_flush()
> that does the check but I may have overlooked something in the diffs.)

It would occur on access to the per-cpu rstat pointer during the tree
building in the sequence below.

css_rstat_exit(css)
	css_rstat_flush(css)
		css_rstat_updated_list(css, cpu)
			rstatc = css_rstat_cpu(css, cpu)
				per_cpu_ptr(css->rstat_cpu, cpu)

Since I'm doing the early checks in css_rstat_flush() in the next rev
though, I was thinking of this:

void css_rstat_flush(css)
{
	bool is_cgroup = css_is_cgroup(css);
	
	if (!is_cgroup && !css->ss->css_rstat_flush)
		return;

	...

	for (...) {
		if (is_cgroup)
			/* flush base stats and bpf */
		else
			/* flush via css_rstat_flush */
	}
}

Then we could remove the two conditional guards in css_release_work_fn()
and css_free_rwork_fn(). Thoughts?

Note that I was also thinking of doing the same early check in
css_rstat_updated() since it is exposed as a kfunc and someone could
pass in a non-rstat css other than cgroup::self.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 4/5] cgroup: use separate rstat trees for each subsystem
  2025-04-17 19:05         ` JP Kobryn
@ 2025-04-17 20:10           ` JP Kobryn
  0 siblings, 0 replies; 38+ messages in thread
From: JP Kobryn @ 2025-04-17 20:10 UTC (permalink / raw)
  To: Michal Koutný
  Cc: tj, shakeel.butt, yosryahmed, hannes, akpm, linux-mm, cgroups,
	kernel-team

On 4/17/25 12:05 PM, JP Kobryn wrote:
> On 4/17/25 2:26 AM, Michal Koutný wrote:
>> On Wed, Apr 16, 2025 at 02:43:57PM -0700, JP Kobryn 
>> <inwardvessel@gmail.com> wrote:
>>> Hmmm I checked my initial assumptions. I'm still finding that css's from
>>> any subsystem regardless of rstat usage can reach this call to exit.
>>> Without the guard there will be undefined behavior.
>>
>> At which place is the UB? (I saw that all funnels to css_rstat_flush()
>> that does the check but I may have overlooked something in the diffs.)
> 
> It would occur on access to the per-cpu rstat pointer during the tree
> building in the sequence below.
> 
> css_rstat_exit(css)
>      css_rstat_flush(css)
>          css_rstat_updated_list(css, cpu)
>              rstatc = css_rstat_cpu(css, cpu)
>                  per_cpu_ptr(css->rstat_cpu, cpu)
> 
> Since I'm doing the early checks in css_rstat_flush() in the next rev
> though, I was thinking of this:
> 
> void css_rstat_flush(css)
> {
>      bool is_cgroup = css_is_cgroup(css);
> 
>      if (!is_cgroup && !css->ss->css_rstat_flush)
>          return;
> 
>      ...
> 
>      for (...) {
>          if (is_cgroup)
>              /* flush base stats and bpf */
>          else
>              /* flush via css_rstat_flush */
>      }
> }
> 
> Then we could remove the two conditional guards in css_release_work_fn()
> and css_free_rwork_fn(). Thoughts?

Correction: just the one in css_release_work_fn(). In the case of
css_free_rwork_fn(), there is a call to css_rstat_exit() which would
still need to be guarded (or it will touch the per-cpu rstat pointer).

> 
> Note that I was also thinking of doing the same early check in
> css_rstat_updated() since it is exposed as a kfunc and someone could
> pass in a non-rstat css other than cgroup::self.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 4/5] cgroup: use separate rstat trees for each subsystem
  2025-04-04  1:10 ` [PATCH v4 4/5] cgroup: use separate rstat trees for each subsystem JP Kobryn
  2025-04-15 17:15   ` Michal Koutný
@ 2025-04-21 18:18   ` JP Kobryn
  2025-04-22 13:33   ` Yosry Ahmed
       [not found]   ` <68079aa7.df0a0220.30a1a0.cbb2SMTPIN_ADDED_BROKEN@mx.google.com>
  3 siblings, 0 replies; 38+ messages in thread
From: JP Kobryn @ 2025-04-21 18:18 UTC (permalink / raw)
  To: tj, shakeel.butt, yosryahmed, mkoutny, hannes, akpm
  Cc: linux-mm, cgroups, kernel-team

On 4/3/25 6:10 PM, JP Kobryn wrote:
>   __bpf_kfunc void css_rstat_flush(struct cgroup_subsys_state *css)
>   {
> -	struct cgroup *cgrp = css->cgroup;
>   	int cpu;
>   
>   	might_sleep();
>   	for_each_possible_cpu(cpu) {
> -		struct cgroup *pos;
> +		struct cgroup_subsys_state *pos;
>   
>   		/* Reacquire for each CPU to avoid disabling IRQs too long */
>   		__css_rstat_lock(css, cpu);
> -		pos = cgroup_rstat_updated_list(cgrp, cpu);
> +		pos = css_rstat_updated_list(css, cpu);
>   		for (; pos; pos = pos->rstat_flush_next) {
> -			struct cgroup_subsys_state *css;
> -
> -			cgroup_base_stat_flush(pos, cpu);
> -			bpf_rstat_flush(pos, cgroup_parent(pos), cpu);
> -
> -			rcu_read_lock();
> -			list_for_each_entry_rcu(css, &pos->rstat_css_list,
> -						rstat_css_node)
> +			if (css_is_cgroup(pos)) {
> +				cgroup_base_stat_flush(pos->cgroup, cpu);
> +				bpf_rstat_flush(pos->cgroup,
> +						cgroup_parent(pos->cgroup), cpu);
> +			} else if (pos->ss->css_rstat_flush)
>   				css->ss->css_rstat_flush(css, cpu);

This needs to change from css to pos. Note this was not an issue up
through v3. The flushing functions changed upstream between v3 and v4 so
this was possibly missed during the rebase.

> -			rcu_read_unlock();
>   		}
>   		__css_rstat_unlock(css, cpu);
>   		if (!cond_resched())

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 1/5] cgroup: move rstat base stat objects into their own struct
  2025-04-04  1:10 ` [PATCH v4 1/5] cgroup: move rstat base stat objects into their own struct JP Kobryn
  2025-04-15 17:16   ` Michal Koutný
@ 2025-04-22 12:13   ` Yosry Ahmed
  2025-05-29 18:58   ` Ihor Solodrai
  2 siblings, 0 replies; 38+ messages in thread
From: Yosry Ahmed @ 2025-04-22 12:13 UTC (permalink / raw)
  To: JP Kobryn
  Cc: tj, shakeel.butt, yosryahmed, mkoutny, hannes, akpm, linux-mm,
	cgroups, kernel-team

On Thu, Apr 03, 2025 at 06:10:46PM -0700, JP Kobryn wrote:
> This non-functional change serves as preparation for moving to
> subsystem-based rstat trees. The base stats are not an actual subsystem,
> but in future commits they will have exclusive rstat trees just as other
> subsystems will.
> 
> Moving the base stat objects into a new struct allows the cgroup_rstat_cpu
> struct to become more compact since it now only contains the minimum amount
> of pointers needed for rstat participation. Subsystems will (in future
> commits) make use of the compact cgroup_rstat_cpu struct while avoiding the
> memory overhead of the base stat objects which they will not use.
> 
> An instance of the new struct cgroup_rstat_base_cpu was placed on the
> cgroup struct so it can retain ownership of these base stats common to all
> cgroups. A helper function was added for looking up the cpu-specific base
> stats of a given cgroup. Finally, initialization and variable names were
> adjusted where applicable.
> 
> Signed-off-by: JP Kobryn <inwardvessel@gmail.com>

Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 2/5] cgroup: add helper for checking when css is cgroup::self
  2025-04-04  1:10 ` [PATCH v4 2/5] cgroup: add helper for checking when css is cgroup::self JP Kobryn
@ 2025-04-22 12:19   ` Yosry Ahmed
       [not found]   ` <68078968.5d0a0220.2c3c35.bab3SMTPIN_ADDED_BROKEN@mx.google.com>
  1 sibling, 0 replies; 38+ messages in thread
From: Yosry Ahmed @ 2025-04-22 12:19 UTC (permalink / raw)
  To: JP Kobryn
  Cc: tj, shakeel.butt, yosryahmed, mkoutny, hannes, akpm, linux-mm,
	cgroups, kernel-team

On Thu, Apr 03, 2025 at 06:10:47PM -0700, JP Kobryn wrote:
> The cgroup struct has a css field called "self". The main difference
> between this css and the others found in the cgroup::subsys array is that
> cgroup::self has a NULL subsystem pointer. There are several places where
> checks are performed to determine whether the css in question is
> cgroup::self or not. Instead of accessing css->ss directly, introduce a
> helper function that shows the intent and use where applicable.
> 
> Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
> ---
>  include/linux/cgroup.h | 5 +++++
>  kernel/cgroup/cgroup.c | 4 ++--
>  2 files changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index 28e999f2c642..7c120efd5e49 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -347,6 +347,11 @@ static inline bool css_is_dying(struct cgroup_subsys_state *css)
>  	return !(css->flags & CSS_NO_REF) && percpu_ref_is_dying(&css->refcnt);
>  }
>  
> +static inline bool css_is_cgroup(struct cgroup_subsys_state *css)

I think css_is_self() or css_is_cgroup_self() may be clearer given that
we are basically checking if css is the same as css->cgroup->self. As I
write this out, I am wondering why don't we check css ==
css->cgroup->self instead (and perhaps add a WARN to make sure css->ss
is NULL as expected)?

This seems clearer to me unless I am missing something.

> +{
> +	return css->ss == NULL;
> +}
> +
>  static inline void cgroup_get(struct cgroup *cgrp)
>  {
>  	css_get(&cgrp->self);
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index 77349d07b117..00eb882dc6e7 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -1719,7 +1719,7 @@ static void css_clear_dir(struct cgroup_subsys_state *css)
>  
>  	css->flags &= ~CSS_VISIBLE;
>  
> -	if (!css->ss) {
> +	if (css_is_cgroup(css)) {
>  		if (cgroup_on_dfl(cgrp)) {
>  			cgroup_addrm_files(css, cgrp,
>  					   cgroup_base_files, false);
> @@ -1751,7 +1751,7 @@ static int css_populate_dir(struct cgroup_subsys_state *css)
>  	if (css->flags & CSS_VISIBLE)
>  		return 0;
>  
> -	if (!css->ss) {
> +	if (css_is_cgroup(css)) {
>  		if (cgroup_on_dfl(cgrp)) {
>  			ret = cgroup_addrm_files(css, cgrp,
>  						 cgroup_base_files, true);
> -- 
> 2.47.1
> 
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 3/5] cgroup: change rstat function signatures from cgroup-based to css-based
  2025-04-04  1:10 ` [PATCH v4 3/5] cgroup: change rstat function signatures from cgroup-based to css-based JP Kobryn
  2025-04-04 20:00   ` Tejun Heo
@ 2025-04-22 12:35   ` Yosry Ahmed
  2025-04-22 12:39   ` Yosry Ahmed
       [not found]   ` <68078d3c.050a0220.3d37e.6d82SMTPIN_ADDED_BROKEN@mx.google.com>
  3 siblings, 0 replies; 38+ messages in thread
From: Yosry Ahmed @ 2025-04-22 12:35 UTC (permalink / raw)
  To: JP Kobryn
  Cc: tj, shakeel.butt, yosryahmed, mkoutny, hannes, akpm, linux-mm,
	cgroups, kernel-team

On Thu, Apr 03, 2025 at 06:10:48PM -0700, JP Kobryn wrote:
> This non-functional change serves as preparation for moving to
> subsystem-based rstat trees. To simplify future commits, change the
> signatures of existing cgroup-based rstat functions to become css-based and
> rename them to reflect that.
> 
> Though the signatures have changed, the implementations have not. Within
> these functions use the css->cgroup pointer to obtain the associated cgroup
> and allow code to function the same just as it did before this patch. At
> applicable call sites, pass the subsystem-specific css pointer as an
> argument or pass a pointer to cgroup::self if not in subsystem context.
> 
> Note that cgroup_rstat_updated_list() and cgroup_rstat_push_children()
> are not altered yet since there would be a larger amount of css to
> cgroup conversions which may overcomplicate the code at this
> intermediate phase.
> 
> Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
[..]
> @@ -5720,6 +5716,14 @@ static struct cgroup *cgroup_create(struct cgroup *parent, const char *name,
>  	cgrp->root = root;
>  	cgrp->level = level;
>  
> +	/*
> +	 * Now that init_cgroup_housekeeping() has been called and cgrp->self
> +	 * is setup, it is safe to perform rstat initialization on it.
> +	 */
> +	ret = css_rstat_init(&cgrp->self);
> +	if (ret)
> +		goto out_stat_exit;
> +

Sorry for the late review, but this looks wrong to me. I think this
should goto out_kernfs_remove..

>  	ret = psi_cgroup_alloc(cgrp);
>  	if (ret)
>  		goto out_kernfs_remove;

..and this should goto out_stat_exit.

> @@ -5790,10 +5794,10 @@ static struct cgroup *cgroup_create(struct cgroup *parent, const char *name,
>  
>  out_psi_free:
>  	psi_cgroup_free(cgrp);
> +out_stat_exit:
> +	css_rstat_exit(&cgrp->self);
>  out_kernfs_remove:
>  	kernfs_remove(cgrp->kn);
> -out_stat_exit:
> -	cgroup_rstat_exit(cgrp);
>  out_cancel_ref:
>  	percpu_ref_exit(&cgrp->self.refcnt);
>  out_free_cgrp:
[..]
> @@ -298,36 +304,41 @@ static inline void __cgroup_rstat_lock(struct cgroup *cgrp, int cpu_in_loop)
>  	trace_cgroup_rstat_locked(cgrp, cpu_in_loop, contended);
>  }
>  
> -static inline void __cgroup_rstat_unlock(struct cgroup *cgrp, int cpu_in_loop)
> +static inline void __css_rstat_unlock(struct cgroup_subsys_state *css,
> +		int cpu_in_loop)
>  	__releases(&cgroup_rstat_lock)
>  {
> +	struct cgroup *cgrp = css->cgroup;
> +
>  	trace_cgroup_rstat_unlock(cgrp, cpu_in_loop, false);
>  	spin_unlock_irq(&cgroup_rstat_lock);
>  }
>  
>  /**
> - * cgroup_rstat_flush - flush stats in @cgrp's subtree
> - * @cgrp: target cgroup
> + * css_rstat_flush - flush stats in @css->cgroup's subtree
> + * @css: target cgroup subsystem state
>   *
> - * Collect all per-cpu stats in @cgrp's subtree into the global counters
> + * Collect all per-cpu stats in @css->cgroup's subtree into the global counters
>   * and propagate them upwards.  After this function returns, all cgroups in
>   * the subtree have up-to-date ->stat.
>   *
> - * This also gets all cgroups in the subtree including @cgrp off the
> + * This also gets all cgroups in the subtree including @css->cgroup off the
>   * ->updated_children lists.
>   *
>   * This function may block.
>   */
> -__bpf_kfunc void cgroup_rstat_flush(struct cgroup *cgrp)
> +__bpf_kfunc void css_rstat_flush(struct cgroup_subsys_state *css)
>  {
> +	struct cgroup *cgrp = css->cgroup;
>  	int cpu;
>  
>  	might_sleep();
>  	for_each_possible_cpu(cpu) {
> -		struct cgroup *pos = cgroup_rstat_updated_list(cgrp, cpu);
> +		struct cgroup *pos;
>  
>  		/* Reacquire for each CPU to avoid disabling IRQs too long */
> -		__cgroup_rstat_lock(cgrp, cpu);
> +		__css_rstat_lock(css, cpu);
> +		pos = cgroup_rstat_updated_list(cgrp, cpu);

Moving this call under the lock is an unrelated bug fix that was already
done by Shakeel in commit 7d6c63c31914 ("cgroup: rstat: call
cgroup_rstat_updated_list with cgroup_rstat_lock").

Otherwise this LGTM.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 3/5] cgroup: change rstat function signatures from cgroup-based to css-based
  2025-04-04  1:10 ` [PATCH v4 3/5] cgroup: change rstat function signatures from cgroup-based to css-based JP Kobryn
  2025-04-04 20:00   ` Tejun Heo
  2025-04-22 12:35   ` Yosry Ahmed
@ 2025-04-22 12:39   ` Yosry Ahmed
       [not found]   ` <68078d3c.050a0220.3d37e.6d82SMTPIN_ADDED_BROKEN@mx.google.com>
  3 siblings, 0 replies; 38+ messages in thread
From: Yosry Ahmed @ 2025-04-22 12:39 UTC (permalink / raw)
  To: JP Kobryn
  Cc: tj, shakeel.butt, yosryahmed, mkoutny, hannes, akpm, linux-mm,
	cgroups, kernel-team

On Thu, Apr 03, 2025 at 06:10:48PM -0700, JP Kobryn wrote:
> This non-functional change serves as preparation for moving to
> subsystem-based rstat trees. To simplify future commits, change the
> signatures of existing cgroup-based rstat functions to become css-based and
> rename them to reflect that.
> 
> Though the signatures have changed, the implementations have not. Within
> these functions use the css->cgroup pointer to obtain the associated cgroup
> and allow code to function the same just as it did before this patch. At
> applicable call sites, pass the subsystem-specific css pointer as an
> argument or pass a pointer to cgroup::self if not in subsystem context.
> 
> Note that cgroup_rstat_updated_list() and cgroup_rstat_push_children()
> are not altered yet since there would be a larger amount of css to
> cgroup conversions which may overcomplicate the code at this
> intermediate phase.
> 
> Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
> ---
>  block/blk-cgroup.c                            |  6 +-
>  include/linux/cgroup-defs.h                   |  2 +-
>  include/linux/cgroup.h                        |  4 +-
>  kernel/cgroup/cgroup-internal.h               |  4 +-
>  kernel/cgroup/cgroup.c                        | 30 ++++---
>  kernel/cgroup/rstat.c                         | 83 +++++++++++--------
>  mm/memcontrol.c                               |  4 +-
>  .../bpf/progs/cgroup_hierarchical_stats.c     |  9 +-
>  8 files changed, 80 insertions(+), 62 deletions(-)
> 
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index 5905f277057b..0560ea402856 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -1144,7 +1144,7 @@ static void blkcg_rstat_flush(struct cgroup_subsys_state *css, int cpu)
>  /*
>   * We source root cgroup stats from the system-wide stats to avoid
>   * tracking the same information twice and incurring overhead when no
> - * cgroups are defined. For that reason, cgroup_rstat_flush in
> + * cgroups are defined. For that reason, css_rstat_flush in
>   * blkcg_print_stat does not actually fill out the iostat in the root
>   * cgroup's blkcg_gq.
>   *
> @@ -1253,7 +1253,7 @@ static int blkcg_print_stat(struct seq_file *sf, void *v)
>  	if (!seq_css(sf)->parent)
>  		blkcg_fill_root_iostats();
>  	else
> -		cgroup_rstat_flush(blkcg->css.cgroup);
> +		css_rstat_flush(&blkcg->css);
>  
>  	rcu_read_lock();
>  	hlist_for_each_entry_rcu(blkg, &blkcg->blkg_list, blkcg_node) {
> @@ -2243,7 +2243,7 @@ void blk_cgroup_bio_start(struct bio *bio)
>  	}
>  
>  	u64_stats_update_end_irqrestore(&bis->sync, flags);
> -	cgroup_rstat_updated(blkcg->css.cgroup, cpu);
> +	css_rstat_updated(&blkcg->css, cpu);
>  	put_cpu();
>  }
>  
> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
> index 6d177f770d28..e4a9fb00b228 100644
> --- a/include/linux/cgroup-defs.h
> +++ b/include/linux/cgroup-defs.h
> @@ -536,7 +536,7 @@ struct cgroup {
>  	/*
>  	 * A singly-linked list of cgroup structures to be rstat flushed.
>  	 * This is a scratch field to be used exclusively by
> -	 * cgroup_rstat_flush_locked() and protected by cgroup_rstat_lock.
> +	 * css_rstat_flush_locked() and protected by cgroup_rstat_lock.

Also, I believe this function does not exist anymore.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 4/5] cgroup: use separate rstat trees for each subsystem
  2025-04-04  1:10 ` [PATCH v4 4/5] cgroup: use separate rstat trees for each subsystem JP Kobryn
  2025-04-15 17:15   ` Michal Koutný
  2025-04-21 18:18   ` JP Kobryn
@ 2025-04-22 13:33   ` Yosry Ahmed
       [not found]   ` <68079aa7.df0a0220.30a1a0.cbb2SMTPIN_ADDED_BROKEN@mx.google.com>
  3 siblings, 0 replies; 38+ messages in thread
From: Yosry Ahmed @ 2025-04-22 13:33 UTC (permalink / raw)
  To: JP Kobryn
  Cc: tj, shakeel.butt, yosryahmed, mkoutny, hannes, akpm, linux-mm,
	cgroups, kernel-team

On Thu, Apr 03, 2025 at 06:10:49PM -0700, JP Kobryn wrote:
> Different subsystems may call cgroup_rstat_updated() within the same
> cgroup, resulting in a tree of pending updates from multiple subsystems.
> When one of these subsystems is flushed via cgroup_rstat_flushed(), all
> other subsystems with pending updates on the tree will also be flushed.
> 
> Change the paradigm of having a single rstat tree for all subsystems to
> having separate trees for each subsystem. This separation allows for
> subsystems to perform flushes without the side effects of other subsystems.
> As an example, flushing the cpu stats will no longer cause the memory stats
> to be flushed and vice versa.
> 
> In order to achieve subsystem-specific trees, change the tree node type
> from cgroup to cgroup_subsys_state pointer. Then remove those pointers from
> the cgroup and instead place them on the css. Finally, change update/flush
> functions to make use of the different node type (css). These changes allow
> a specific subsystem to be associated with an update or flush. Separate
> rstat trees will now exist for each unique subsystem.
> 
> Since updating/flushing will now be done at the subsystem level, there is
> no longer a need to keep track of updated css nodes at the cgroup level.
> The list management of these nodes done within the cgroup (rstat_css_list
> and related) has been removed accordingly.
> 
> Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
[..]
> @@ -219,6 +219,13 @@ struct cgroup_subsys_state {
>  	 * Protected by cgroup_mutex.
>  	 */
>  	int nr_descendants;
> +
> +	/*
> +	 * A singly-linked list of css structures to be rstat flushed.
> +	 * This is a scratch field to be used exclusively by
> +	 * css_rstat_flush_locked() and protected by cgroup_rstat_lock.
> +	 */

Ditto obsolete function name.

> +	struct cgroup_subsys_state *rstat_flush_next;
>  };
>  
>  /*
> @@ -329,10 +336,10 @@ struct cgroup_base_stat {
>  
>  /*
>   * rstat - cgroup scalable recursive statistics.  Accounting is done
> - * per-cpu in cgroup_rstat_cpu which is then lazily propagated up the
> + * per-cpu in css_rstat_cpu which is then lazily propagated up the
>   * hierarchy on reads.
>   *
> - * When a stat gets updated, the cgroup_rstat_cpu and its ancestors are
> + * When a stat gets updated, the css_rstat_cpu and its ancestors are
>   * linked into the updated tree.  On the following read, propagation only
>   * considers and consumes the updated tree.  This makes reading O(the
>   * number of descendants which have been active since last read) instead of
> @@ -346,7 +353,7 @@ struct cgroup_base_stat {
>   * This struct hosts both the fields which implement the above -
>   * updated_children and updated_next.
>   */
> -struct cgroup_rstat_cpu {
> +struct css_rstat_cpu {
>  	/*
>  	 * Child cgroups with stat updates on this cpu since the last read
>  	 * are linked on the parent's ->updated_children through
> @@ -358,8 +365,8 @@ struct cgroup_rstat_cpu {
>  	 *
>  	 * Protected by per-cpu cgroup_rstat_cpu_lock.
>  	 */
> -	struct cgroup *updated_children;	/* terminated by self cgroup */
> -	struct cgroup *updated_next;		/* NULL iff not on the list */
> +	struct cgroup_subsys_state *updated_children;	/* terminated by self cgroup */
> +	struct cgroup_subsys_state *updated_next;	/* NULL iff not on the list */
>  };
>  
>  /*
> @@ -521,25 +528,16 @@ struct cgroup {
>  	struct cgroup *dom_cgrp;
>  	struct cgroup *old_dom_cgrp;		/* used while enabling threaded */
>  
> -	/* per-cpu recursive resource statistics */
> -	struct cgroup_rstat_cpu __percpu *rstat_cpu;
> +	/* per-cpu recursive basic resource statistics */

This comment should probably be dropped now as it's only providing
partial info and is potentially confusing.

>  	struct cgroup_rstat_base_cpu __percpu *rstat_base_cpu;
> -	struct list_head rstat_css_list;
>  
>  	/*
> -	 * Add padding to separate the read mostly rstat_cpu and
> -	 * rstat_css_list into a different cacheline from the following
> -	 * rstat_flush_next and *bstat fields which can have frequent updates.
> +	 * Add padding to keep the read mostly rstat per-cpu pointer on a
> +	 * different cacheline than the following *bstat fields which can have
> +	 * frequent updates.
>  	 */
>  	CACHELINE_PADDING(_pad_);
>  
> -	/*
> -	 * A singly-linked list of cgroup structures to be rstat flushed.
> -	 * This is a scratch field to be used exclusively by
> -	 * css_rstat_flush_locked() and protected by cgroup_rstat_lock.
> -	 */
> -	struct cgroup	*rstat_flush_next;
> -
>  	/* cgroup basic resource statistics */
>  	struct cgroup_base_stat last_bstat;
>  	struct cgroup_base_stat bstat;
[..]
> @@ -5425,6 +5417,9 @@ static void css_free_rwork_fn(struct work_struct *work)
>  		struct cgroup_subsys_state *parent = css->parent;
>  		int id = css->id;
>  
> +		if (ss->css_rstat_flush)
> +			css_rstat_exit(css);
> +

This call now exists in both branches (self css or not), so it's
probably best to pull it outside. We should probably also pull the call
in cgroup_destroy_root() outside into css_free_rwork_fn() so that we end
up with a single call to css_rstat_exit() (apart from failure paths).

We can probably also use css_is_cgroup() here instead of 'if (ss)' for
consistency.

>  		ss->css_free(css);
>  		cgroup_idr_remove(&ss->css_idr, id);
>  		cgroup_put(cgrp);
[..]  
> @@ -5659,6 +5647,12 @@ static struct cgroup_subsys_state *css_create(struct cgroup *cgrp,
>  		goto err_free_css;
>  	css->id = err;
>  
> +	if (ss->css_rstat_flush) {
> +		err = css_rstat_init(css);
> +		if (err)
> +			goto err_free_css;
> +	}
> +
>  	/* @css is ready to be brought online now, make it visible */
>  	list_add_tail_rcu(&css->sibling, &parent_css->children);
>  	cgroup_idr_replace(&ss->css_idr, css, css->id);
> @@ -5672,7 +5666,6 @@ static struct cgroup_subsys_state *css_create(struct cgroup *cgrp,
>  err_list_del:
>  	list_del_rcu(&css->sibling);
>  err_free_css:
> -	list_del_rcu(&css->rstat_css_node);
>  	INIT_RCU_WORK(&css->destroy_rwork, css_free_rwork_fn);
>  	queue_rcu_work(cgroup_destroy_wq, &css->destroy_rwork);
>  	return ERR_PTR(err);
> @@ -6104,11 +6097,17 @@ static void __init cgroup_init_subsys(struct cgroup_subsys *ss, bool early)
>  	css->flags |= CSS_NO_REF;
>  
>  	if (early) {
> -		/* allocation can't be done safely during early init */
> +		/*
> +		 * Allocation can't be done safely during early init.
> +		 * Defer IDR and rstat allocations until cgroup_init().
> +		 */
>  		css->id = 1;
>  	} else {
>  		css->id = cgroup_idr_alloc(&ss->css_idr, css, 1, 2, GFP_KERNEL);
>  		BUG_ON(css->id < 0);
> +
> +		if (ss->css_rstat_flush)
> +			BUG_ON(css_rstat_init(css));
>  	}
>  
>  	/* Update the init_css_set to contain a subsys
> @@ -6207,9 +6206,17 @@ int __init cgroup_init(void)
>  			struct cgroup_subsys_state *css =
>  				init_css_set.subsys[ss->id];
>  
> +			/*
> +			 * It is now safe to perform allocations.
> +			 * Finish setting up subsystems that previously
> +			 * deferred IDR and rstat allocations.
> +			 */
>  			css->id = cgroup_idr_alloc(&ss->css_idr, css, 1, 2,
>  						   GFP_KERNEL);
>  			BUG_ON(css->id < 0);
> +
> +			if (ss->css_rstat_flush)
> +				BUG_ON(css_rstat_init(css));

The calls to css_rstat_init() are really difficult to track. Let's
recap, before this change we had two calls:
- In cgroup_setup_root(), for root cgroups.
- In cgroup_create(), for non-root cgroups.

This patch adds 3 more, so we end up with 5 calls as follows:
- In cgroup_setup_root(), for root self css's.
- In cgroup_create(), for non-root self css's.
- In cgroup_subsys_init(), for root subsys css's without early
  initialization.
- In cgroup_init(), for root subsys css's with early
  initialization, as we cannot call it from cgroup_subsys_init() early
  as allocations are not allowed during early init.
- In css_create(), for non-root non-self css's.

We should try to consolidate as much as possible. For example:
- Can we always make the call for root subsys css's in cgroup_init(),
  regardless of early initialization status? Is there a need to make the
  call early for subsystems that use early in cgroup_subsys_init()
  initialization?

- Can we always make the call for root css's in cgroup_init(),
  regardless of whether the css is a self css or a subsys css? I imagine
  we'd still need two separate calls, one outside the loop for the self
  css's, and one in the loop for subsys css's, but having them in the
  same place should make things easier.

Ideally if we can do both the above, we'd end up with 3 calling
functions only:
- cgroup_init() -> for all root css's.
- cgroup_create() -> for non-root self css's.
- css_create() -> for non-root subsys css's.

Also, we should probably document all the different call paths for
css_rstat_init() and css_rstat_exit() somewhere.


>  		} else {
>  			cgroup_init_subsys(ss, false);
>  		}

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 5/5] cgroup: use subsystem-specific rstat locks to avoid contention
  2025-04-04  1:10 ` [PATCH v4 5/5] cgroup: use subsystem-specific rstat locks to avoid contention JP Kobryn
  2025-04-04 20:28   ` Tejun Heo
  2025-04-15 17:15   ` Michal Koutný
@ 2025-04-22 14:01   ` Yosry Ahmed
  2025-04-24 17:25     ` Shakeel Butt
       [not found]   ` <6807a132.df0a0220.28dc80.a1f0SMTPIN_ADDED_BROKEN@mx.google.com>
  3 siblings, 1 reply; 38+ messages in thread
From: Yosry Ahmed @ 2025-04-22 14:01 UTC (permalink / raw)
  To: JP Kobryn
  Cc: tj, shakeel.butt, yosryahmed, mkoutny, hannes, akpm, linux-mm,
	cgroups, kernel-team

On Thu, Apr 03, 2025 at 06:10:50PM -0700, JP Kobryn wrote:
> It is possible to eliminate contention between subsystems when
> updating/flushing stats by using subsystem-specific locks. Let the existing
> rstat locks be dedicated to the cgroup base stats and rename them to
> reflect that. Add similar locks to the cgroup_subsys struct for use with
> individual subsystems.
> 
> Lock initialization is done in the new function ss_rstat_init(ss) which
> replaces cgroup_rstat_boot(void). If NULL is passed to this function, the
> global base stat locks will be initialized. Otherwise, the subsystem locks
> will be initialized.
> 
> Change the existing lock helper functions to accept a reference to a css.
> Then within these functions, conditionally select the appropriate locks
> based on the subsystem affiliation of the given css. Add helper functions
> for this selection routine to avoid repeated code.
> 
> Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
> ---
>  block/blk-cgroup.c              |   2 +-
>  include/linux/cgroup-defs.h     |  16 +++--
>  include/trace/events/cgroup.h   |  12 +++-
>  kernel/cgroup/cgroup-internal.h |   2 +-
>  kernel/cgroup/cgroup.c          |  10 +++-
>  kernel/cgroup/rstat.c           | 101 +++++++++++++++++++++++---------
>  6 files changed, 103 insertions(+), 40 deletions(-)
> 
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index 0560ea402856..62d0bf1e1a04 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -1074,7 +1074,7 @@ static void __blkcg_rstat_flush(struct blkcg *blkcg, int cpu)
>  	/*
>  	 * For covering concurrent parent blkg update from blkg_release().
>  	 *
> -	 * When flushing from cgroup, cgroup_rstat_lock is always held, so
> +	 * When flushing from cgroup, the subsystem lock is always held, so
>  	 * this lock won't cause contention most of time.
>  	 */
>  	raw_spin_lock_irqsave(&blkg_stat_lock, flags);
> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
> index c58c21c2110a..bb5a355524d6 100644
> --- a/include/linux/cgroup-defs.h
> +++ b/include/linux/cgroup-defs.h
> @@ -223,7 +223,10 @@ struct cgroup_subsys_state {
>  	/*
>  	 * A singly-linked list of css structures to be rstat flushed.
>  	 * This is a scratch field to be used exclusively by
> -	 * css_rstat_flush_locked() and protected by cgroup_rstat_lock.
> +	 * css_rstat_flush_locked().
> +	 *
> +	 * Protected by rstat_base_lock when css is cgroup::self.
> +	 * Protected by css->ss->rstat_ss_lock otherwise.
>  	 */
>  	struct cgroup_subsys_state *rstat_flush_next;
>  };
> @@ -359,11 +362,11 @@ struct css_rstat_cpu {
>  	 * are linked on the parent's ->updated_children through
>  	 * ->updated_next.
>  	 *
> -	 * In addition to being more compact, singly-linked list pointing
> -	 * to the cgroup makes it unnecessary for each per-cpu struct to
> -	 * point back to the associated cgroup.
> +	 * In addition to being more compact, singly-linked list pointing to
> +	 * the css makes it unnecessary for each per-cpu struct to point back
> +	 * to the associated css.
>  	 *
> -	 * Protected by per-cpu cgroup_rstat_cpu_lock.
> +	 * Protected by per-cpu css->ss->rstat_ss_cpu_lock.
>  	 */
>  	struct cgroup_subsys_state *updated_children;	/* terminated by self cgroup */

This rename belongs in the previous patch, also the comment about
updated_children should probably say "self css" now.

>  	struct cgroup_subsys_state *updated_next;	/* NULL iff not on the list */
> @@ -793,6 +796,9 @@ struct cgroup_subsys {
>  	 * specifies the mask of subsystems that this one depends on.
>  	 */
>  	unsigned int depends_on;
> +
> +	spinlock_t rstat_ss_lock;
> +	raw_spinlock_t __percpu *rstat_ss_cpu_lock;

Can we use local_lock_t here instead? I guess it would be annoying
because we won't be able to have common code for locking/unlocking. It's
annoying because the local lock is a spinlock under the hood for non-RT
kernels anyway..

>  };
>  
>  extern struct percpu_rw_semaphore cgroup_threadgroup_rwsem;
[..]
> diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
> index 37d9e5012b2d..bcc253aec774 100644
> --- a/kernel/cgroup/rstat.c
> +++ b/kernel/cgroup/rstat.c
> @@ -9,8 +9,8 @@
>  
>  #include <trace/events/cgroup.h>
>  
> -static DEFINE_SPINLOCK(cgroup_rstat_lock);
> -static DEFINE_PER_CPU(raw_spinlock_t, cgroup_rstat_cpu_lock);
> +static DEFINE_SPINLOCK(rstat_base_lock);
> +static DEFINE_PER_CPU(raw_spinlock_t, rstat_base_cpu_lock);

Can we do something like this (not sure the macro usage is correct):

static DEFINE_PER_CPU(raw_spinlock_t, rstat_base_cpu_lock) = __SPIN_LOCK_UNLOCKED(rstat_base_cpu_lock);

This should initialize the per-CPU spinlocks the same way
DEFINE_SPINLOCK does IIUC.

>  
>  static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu);
>  
[..]
> @@ -422,12 +443,36 @@ void css_rstat_exit(struct cgroup_subsys_state *css)
>  	css->rstat_cpu = NULL;
>  }
>  
> -void __init cgroup_rstat_boot(void)
> +/**
> + * ss_rstat_init - subsystem-specific rstat initialization
> + * @ss: target subsystem
> + *
> + * If @ss is NULL, the static locks associated with the base stats
> + * are initialized. If @ss is non-NULL, the subsystem-specific locks
> + * are initialized.
> + */
> +int __init ss_rstat_init(struct cgroup_subsys *ss)
>  {
>  	int cpu;
>  
> +	if (!ss) {
> +		spin_lock_init(&rstat_base_lock);

IIUC locks defined with DEFINE_SPINLOCK() do not need to be initialized,
and I believe we can achieve the same for the per-CPU locks as I
described above and eliminate this branch completely.

> +
> +		for_each_possible_cpu(cpu)
> +			raw_spin_lock_init(per_cpu_ptr(&rstat_base_cpu_lock, cpu));
> +
> +		return 0;
> +	}
> +
> +	spin_lock_init(&ss->rstat_ss_lock);
> +	ss->rstat_ss_cpu_lock = alloc_percpu(raw_spinlock_t);
> +	if (!ss->rstat_ss_cpu_lock)
> +		return -ENOMEM;
> +
>  	for_each_possible_cpu(cpu)
> -		raw_spin_lock_init(per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu));
> +		raw_spin_lock_init(per_cpu_ptr(ss->rstat_ss_cpu_lock, cpu));
> +
> +	return 0;
>  }
>  
>  /*
> -- 
> 2.47.1
> 
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 2/5] cgroup: add helper for checking when css is cgroup::self
       [not found]   ` <68078968.5d0a0220.2c3c35.bab3SMTPIN_ADDED_BROKEN@mx.google.com>
@ 2025-04-24 16:59     ` JP Kobryn
  0 siblings, 0 replies; 38+ messages in thread
From: JP Kobryn @ 2025-04-24 16:59 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: tj, shakeel.butt, yosryahmed, mkoutny, hannes, akpm, linux-mm,
	cgroups, kernel-team

On 4/22/25 5:19 AM, Yosry Ahmed wrote:
> On Thu, Apr 03, 2025 at 06:10:47PM -0700, JP Kobryn wrote:
...
>> +static inline bool css_is_cgroup(struct cgroup_subsys_state *css)
> 
> I think css_is_self() or css_is_cgroup_self() may be clearer given that
> we are basically checking if css is the same as css->cgroup->self. As I
> write this out, I am wondering why don't we check css ==
> css->cgroup->self instead (and perhaps add a WARN to make sure css->ss
> is NULL as expected)?

The check for css->cgroup->self seems reasonable. The intention of this
patch was just to replace explicit checks for existence of css->ss.
Regardless, since patches 1 to 3 of this series already been merged I
think we can consider changing the implementation outside of this
series.

> 
> This seems clearer to me unless I am missing something.

If the implementation changes, I think adding "_self" could make sense.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 3/5] cgroup: change rstat function signatures from cgroup-based to css-based
       [not found]   ` <68078d3c.050a0220.3d37e.6d82SMTPIN_ADDED_BROKEN@mx.google.com>
@ 2025-04-24 17:10     ` JP Kobryn
  0 siblings, 0 replies; 38+ messages in thread
From: JP Kobryn @ 2025-04-24 17:10 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: tj, shakeel.butt, yosryahmed, mkoutny, hannes, akpm, linux-mm,
	cgroups, kernel-team

On 4/22/25 5:35 AM, Yosry Ahmed wrote:
> On Thu, Apr 03, 2025 at 06:10:48PM -0700, JP Kobryn wrote:
>> This non-functional change serves as preparation for moving to
>> subsystem-based rstat trees. To simplify future commits, change the
>> signatures of existing cgroup-based rstat functions to become css-based and
>> rename them to reflect that.
>>
>> Though the signatures have changed, the implementations have not. Within
>> these functions use the css->cgroup pointer to obtain the associated cgroup
>> and allow code to function the same just as it did before this patch. At
>> applicable call sites, pass the subsystem-specific css pointer as an
>> argument or pass a pointer to cgroup::self if not in subsystem context.
>>
>> Note that cgroup_rstat_updated_list() and cgroup_rstat_push_children()
>> are not altered yet since there would be a larger amount of css to
>> cgroup conversions which may overcomplicate the code at this
>> intermediate phase.
>>
>> Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
> [..]
>> @@ -5720,6 +5716,14 @@ static struct cgroup *cgroup_create(struct cgroup *parent, const char *name,
>>   	cgrp->root = root;
>>   	cgrp->level = level;
>>   
>> +	/*
>> +	 * Now that init_cgroup_housekeeping() has been called and cgrp->self
>> +	 * is setup, it is safe to perform rstat initialization on it.
>> +	 */
>> +	ret = css_rstat_init(&cgrp->self);
>> +	if (ret)
>> +		goto out_stat_exit;
>> +
> 
> Sorry for the late review, but this looks wrong to me. I think this
> should goto out_kernfs_remove..
> 
>>   	ret = psi_cgroup_alloc(cgrp);
>>   	if (ret)
>>   		goto out_kernfs_remove;
> 
> ..and this should goto out_stat_exit.

Thanks, will invert in separate patch.
> 
>> @@ -5790,10 +5794,10 @@ static struct cgroup *cgroup_create(struct cgroup *parent, const char *name,
>>   
>>   out_psi_free:
>>   	psi_cgroup_free(cgrp);
>> +out_stat_exit:
>> +	css_rstat_exit(&cgrp->self);
>>   out_kernfs_remove:
>>   	kernfs_remove(cgrp->kn);
>> -out_stat_exit:
>> -	cgroup_rstat_exit(cgrp);
>>   out_cancel_ref:
>>   	percpu_ref_exit(&cgrp->self.refcnt);
>>   out_free_cgrp:
> [..]
>> @@ -298,36 +304,41 @@ static inline void __cgroup_rstat_lock(struct cgroup *cgrp, int cpu_in_loop)
>>   	trace_cgroup_rstat_locked(cgrp, cpu_in_loop, contended);
>>   }
>>   
>> -static inline void __cgroup_rstat_unlock(struct cgroup *cgrp, int cpu_in_loop)
>> +static inline void __css_rstat_unlock(struct cgroup_subsys_state *css,
>> +		int cpu_in_loop)
>>   	__releases(&cgroup_rstat_lock)
>>   {
>> +	struct cgroup *cgrp = css->cgroup;
>> +
>>   	trace_cgroup_rstat_unlock(cgrp, cpu_in_loop, false);
>>   	spin_unlock_irq(&cgroup_rstat_lock);
>>   }
>>   
>>   /**
>> - * cgroup_rstat_flush - flush stats in @cgrp's subtree
>> - * @cgrp: target cgroup
>> + * css_rstat_flush - flush stats in @css->cgroup's subtree
>> + * @css: target cgroup subsystem state
>>    *
>> - * Collect all per-cpu stats in @cgrp's subtree into the global counters
>> + * Collect all per-cpu stats in @css->cgroup's subtree into the global counters
>>    * and propagate them upwards.  After this function returns, all cgroups in
>>    * the subtree have up-to-date ->stat.
>>    *
>> - * This also gets all cgroups in the subtree including @cgrp off the
>> + * This also gets all cgroups in the subtree including @css->cgroup off the
>>    * ->updated_children lists.
>>    *
>>    * This function may block.
>>    */
>> -__bpf_kfunc void cgroup_rstat_flush(struct cgroup *cgrp)
>> +__bpf_kfunc void css_rstat_flush(struct cgroup_subsys_state *css)
>>   {
>> +	struct cgroup *cgrp = css->cgroup;
>>   	int cpu;
>>   
>>   	might_sleep();
>>   	for_each_possible_cpu(cpu) {
>> -		struct cgroup *pos = cgroup_rstat_updated_list(cgrp, cpu);
>> +		struct cgroup *pos;
>>   
>>   		/* Reacquire for each CPU to avoid disabling IRQs too long */
>> -		__cgroup_rstat_lock(cgrp, cpu);
>> +		__css_rstat_lock(css, cpu);
>> +		pos = cgroup_rstat_updated_list(cgrp, cpu);
> 
> Moving this call under the lock is an unrelated bug fix that was already
> done by Shakeel in commit 7d6c63c31914 ("cgroup: rstat: call
> cgroup_rstat_updated_list with cgroup_rstat_lock").

Right. I had been working off of a tree that did not yet receive the
patch. Moving forward I will be using the tj/cgroup tree.

> 
> Otherwise this LGTM.

Thanks for reviewing.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 5/5] cgroup: use subsystem-specific rstat locks to avoid contention
  2025-04-22 14:01   ` Yosry Ahmed
@ 2025-04-24 17:25     ` Shakeel Butt
  0 siblings, 0 replies; 38+ messages in thread
From: Shakeel Butt @ 2025-04-24 17:25 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: JP Kobryn, tj, yosryahmed, mkoutny, hannes, akpm, linux-mm,
	cgroups, kernel-team

On Tue, Apr 22, 2025 at 07:01:15AM -0700, Yosry Ahmed wrote:
> >  	struct cgroup_subsys_state *updated_next;	/* NULL iff not on the list */
> > @@ -793,6 +796,9 @@ struct cgroup_subsys {
> >  	 * specifies the mask of subsystems that this one depends on.
> >  	 */
> >  	unsigned int depends_on;
> > +
> > +	spinlock_t rstat_ss_lock;
> > +	raw_spinlock_t __percpu *rstat_ss_cpu_lock;
> 
> Can we use local_lock_t here instead? I guess it would be annoying
> because we won't be able to have common code for locking/unlocking. It's
> annoying because the local lock is a spinlock under the hood for non-RT
> kernels anyway..
> 

local_lock_t are spinlocks for RT kernels and just preempt_disable() for
non-RT kernels. Orthogonally I am exploring making memcg stats nmi safe
for bpf stuff and may look into making cgroup_rstat_updated() nmi safe
as well (which would require exploring local_trylock_t here).

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 5/5] cgroup: use subsystem-specific rstat locks to avoid contention
       [not found]   ` <6807a132.df0a0220.28dc80.a1f0SMTPIN_ADDED_BROKEN@mx.google.com>
@ 2025-04-25  0:18     ` JP Kobryn
  0 siblings, 0 replies; 38+ messages in thread
From: JP Kobryn @ 2025-04-25  0:18 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: tj, shakeel.butt, yosryahmed, mkoutny, hannes, akpm, linux-mm,
	cgroups, kernel-team

On 4/22/25 7:01 AM, Yosry Ahmed wrote:
> On Thu, Apr 03, 2025 at 06:10:50PM -0700, JP Kobryn wrote:
>> It is possible to eliminate contention between subsystems when
>> updating/flushing stats by using subsystem-specific locks. Let the existing
>> rstat locks be dedicated to the cgroup base stats and rename them to
>> reflect that. Add similar locks to the cgroup_subsys struct for use with
>> individual subsystems.
>>
>> Lock initialization is done in the new function ss_rstat_init(ss) which
>> replaces cgroup_rstat_boot(void). If NULL is passed to this function, the
>> global base stat locks will be initialized. Otherwise, the subsystem locks
>> will be initialized.
>>
>> Change the existing lock helper functions to accept a reference to a css.
>> Then within these functions, conditionally select the appropriate locks
>> based on the subsystem affiliation of the given css. Add helper functions
>> for this selection routine to avoid repeated code.
>>
>> Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
>> ---
>>   block/blk-cgroup.c              |   2 +-
>>   include/linux/cgroup-defs.h     |  16 +++--
>>   include/trace/events/cgroup.h   |  12 +++-
>>   kernel/cgroup/cgroup-internal.h |   2 +-
>>   kernel/cgroup/cgroup.c          |  10 +++-
>>   kernel/cgroup/rstat.c           | 101 +++++++++++++++++++++++---------
>>   6 files changed, 103 insertions(+), 40 deletions(-)
>>
>> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
>> index 0560ea402856..62d0bf1e1a04 100644
>> --- a/block/blk-cgroup.c
>> +++ b/block/blk-cgroup.c
>> @@ -1074,7 +1074,7 @@ static void __blkcg_rstat_flush(struct blkcg *blkcg, int cpu)
>>   	/*
>>   	 * For covering concurrent parent blkg update from blkg_release().
>>   	 *
>> -	 * When flushing from cgroup, cgroup_rstat_lock is always held, so
>> +	 * When flushing from cgroup, the subsystem lock is always held, so
>>   	 * this lock won't cause contention most of time.
>>   	 */
>>   	raw_spin_lock_irqsave(&blkg_stat_lock, flags);
>> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
>> index c58c21c2110a..bb5a355524d6 100644
>> --- a/include/linux/cgroup-defs.h
>> +++ b/include/linux/cgroup-defs.h
>> @@ -223,7 +223,10 @@ struct cgroup_subsys_state {
>>   	/*
>>   	 * A singly-linked list of css structures to be rstat flushed.
>>   	 * This is a scratch field to be used exclusively by
>> -	 * css_rstat_flush_locked() and protected by cgroup_rstat_lock.
>> +	 * css_rstat_flush_locked().
>> +	 *
>> +	 * Protected by rstat_base_lock when css is cgroup::self.
>> +	 * Protected by css->ss->rstat_ss_lock otherwise.
>>   	 */
>>   	struct cgroup_subsys_state *rstat_flush_next;
>>   };
>> @@ -359,11 +362,11 @@ struct css_rstat_cpu {
>>   	 * are linked on the parent's ->updated_children through
>>   	 * ->updated_next.
>>   	 *
>> -	 * In addition to being more compact, singly-linked list pointing
>> -	 * to the cgroup makes it unnecessary for each per-cpu struct to
>> -	 * point back to the associated cgroup.
>> +	 * In addition to being more compact, singly-linked list pointing to
>> +	 * the css makes it unnecessary for each per-cpu struct to point back
>> +	 * to the associated css.
>>   	 *
>> -	 * Protected by per-cpu cgroup_rstat_cpu_lock.
>> +	 * Protected by per-cpu css->ss->rstat_ss_cpu_lock.
>>   	 */
>>   	struct cgroup_subsys_state *updated_children;	/* terminated by self cgroup */
> 
> This rename belongs in the previous patch, also the comment about
> updated_children should probably say "self css" now.

Thanks, will adjust in next rev.

> 
>>   	struct cgroup_subsys_state *updated_next;	/* NULL iff not on the list */
>> @@ -793,6 +796,9 @@ struct cgroup_subsys {
>>   	 * specifies the mask of subsystems that this one depends on.
>>   	 */
>>   	unsigned int depends_on;
>> +
>> +	spinlock_t rstat_ss_lock;
>> +	raw_spinlock_t __percpu *rstat_ss_cpu_lock;
> 
> Can we use local_lock_t here instead? I guess it would be annoying
> because we won't be able to have common code for locking/unlocking. It's
> annoying because the local lock is a spinlock under the hood for non-RT
> kernels anyway..
> 
>>   };
>>   
>>   extern struct percpu_rw_semaphore cgroup_threadgroup_rwsem;
> [..]
>> diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
>> index 37d9e5012b2d..bcc253aec774 100644
>> --- a/kernel/cgroup/rstat.c
>> +++ b/kernel/cgroup/rstat.c
>> @@ -9,8 +9,8 @@
>>   
>>   #include <trace/events/cgroup.h>
>>   
>> -static DEFINE_SPINLOCK(cgroup_rstat_lock);
>> -static DEFINE_PER_CPU(raw_spinlock_t, cgroup_rstat_cpu_lock);
>> +static DEFINE_SPINLOCK(rstat_base_lock);
>> +static DEFINE_PER_CPU(raw_spinlock_t, rstat_base_cpu_lock);
> 
> Can we do something like this (not sure the macro usage is correct):
> 
> static DEFINE_PER_CPU(raw_spinlock_t, rstat_base_cpu_lock) = __SPIN_LOCK_UNLOCKED(rstat_base_cpu_lock);
> 
> This should initialize the per-CPU spinlocks the same way
> DEFINE_SPINLOCK does IIUC.

Yes, this could work. There was some earlier discussion and it was
decided against though [0].

> 
>>   
>>   static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu);
>>   
> [..]
>> @@ -422,12 +443,36 @@ void css_rstat_exit(struct cgroup_subsys_state *css)
>>   	css->rstat_cpu = NULL;
>>   }
>>   
>> -void __init cgroup_rstat_boot(void)
>> +/**
>> + * ss_rstat_init - subsystem-specific rstat initialization
>> + * @ss: target subsystem
>> + *
>> + * If @ss is NULL, the static locks associated with the base stats
>> + * are initialized. If @ss is non-NULL, the subsystem-specific locks
>> + * are initialized.
>> + */
>> +int __init ss_rstat_init(struct cgroup_subsys *ss)
>>   {
>>   	int cpu;
>>   
>> +	if (!ss) {
>> +		spin_lock_init(&rstat_base_lock);
> 
> IIUC locks defined with DEFINE_SPINLOCK() do not need to be initialized,
> and I believe we can achieve the same for the per-CPU locks as I
> described above and eliminate this branch completely.

True. I'll remove in next rev, but will keep the initialization for the
per-cpu locks based on [0].

[0] 
https://lore.kernel.org/all/jtqkpzqv4xqy4vajm6fljin6ospot37qg252dfk3yldzq6aubu@icg3ndtg3k7i/


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 4/5] cgroup: use separate rstat trees for each subsystem
       [not found]   ` <68079aa7.df0a0220.30a1a0.cbb2SMTPIN_ADDED_BROKEN@mx.google.com>
@ 2025-04-30 23:43     ` JP Kobryn
  2025-05-06  9:37       ` Yosry Ahmed
  0 siblings, 1 reply; 38+ messages in thread
From: JP Kobryn @ 2025-04-30 23:43 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: tj, shakeel.butt, yosryahmed, mkoutny, hannes, akpm, linux-mm,
	cgroups, kernel-team

On 4/22/25 6:33 AM, Yosry Ahmed wrote:
> On Thu, Apr 03, 2025 at 06:10:49PM -0700, JP Kobryn wrote:
[..]
>> @@ -5425,6 +5417,9 @@ static void css_free_rwork_fn(struct work_struct *work)
>>   		struct cgroup_subsys_state *parent = css->parent;
>>   		int id = css->id;
>>   
>> +		if (ss->css_rstat_flush)
>> +			css_rstat_exit(css);
>> +
> 
> This call now exists in both branches (self css or not), so it's
> probably best to pull it outside. We should probably also pull the call
> in cgroup_destroy_root() outside into css_free_rwork_fn() so that we end
> up with a single call to css_rstat_exit() (apart from failure paths).

This can be done if css_rstat_exit() is modified to guard against
invalid css's like being a subsystem css and not having implemented
css_rstat_flush.

> 
> We can probably also use css_is_cgroup() here instead of 'if (ss)' for
> consistency.
> 
>>   		ss->css_free(css);
>>   		cgroup_idr_remove(&ss->css_idr, id);
>>   		cgroup_put(cgrp);
> [..]
>> @@ -5659,6 +5647,12 @@ static struct cgroup_subsys_state *css_create(struct cgroup *cgrp,
>>   		goto err_free_css;
>>   	css->id = err;
>>   
>> +	if (ss->css_rstat_flush) {
>> +		err = css_rstat_init(css);
>> +		if (err)
>> +			goto err_free_css;
>> +	}
>> +
>>   	/* @css is ready to be brought online now, make it visible */
>>   	list_add_tail_rcu(&css->sibling, &parent_css->children);
>>   	cgroup_idr_replace(&ss->css_idr, css, css->id);
>> @@ -5672,7 +5666,6 @@ static struct cgroup_subsys_state *css_create(struct cgroup *cgrp,
>>   err_list_del:
>>   	list_del_rcu(&css->sibling);
>>   err_free_css:
>> -	list_del_rcu(&css->rstat_css_node);
>>   	INIT_RCU_WORK(&css->destroy_rwork, css_free_rwork_fn);
>>   	queue_rcu_work(cgroup_destroy_wq, &css->destroy_rwork);
>>   	return ERR_PTR(err);
>> @@ -6104,11 +6097,17 @@ static void __init cgroup_init_subsys(struct cgroup_subsys *ss, bool early)
>>   	css->flags |= CSS_NO_REF;
>>   
>>   	if (early) {
>> -		/* allocation can't be done safely during early init */
>> +		/*
>> +		 * Allocation can't be done safely during early init.
>> +		 * Defer IDR and rstat allocations until cgroup_init().
>> +		 */
>>   		css->id = 1;
>>   	} else {
>>   		css->id = cgroup_idr_alloc(&ss->css_idr, css, 1, 2, GFP_KERNEL);
>>   		BUG_ON(css->id < 0);
>> +
>> +		if (ss->css_rstat_flush)
>> +			BUG_ON(css_rstat_init(css));
>>   	}
>>   
>>   	/* Update the init_css_set to contain a subsys
>> @@ -6207,9 +6206,17 @@ int __init cgroup_init(void)
>>   			struct cgroup_subsys_state *css =
>>   				init_css_set.subsys[ss->id];
>>   
>> +			/*
>> +			 * It is now safe to perform allocations.
>> +			 * Finish setting up subsystems that previously
>> +			 * deferred IDR and rstat allocations.
>> +			 */
>>   			css->id = cgroup_idr_alloc(&ss->css_idr, css, 1, 2,
>>   						   GFP_KERNEL);
>>   			BUG_ON(css->id < 0);
>> +
>> +			if (ss->css_rstat_flush)
>> +				BUG_ON(css_rstat_init(css));
> 
> The calls to css_rstat_init() are really difficult to track. Let's
> recap, before this change we had two calls:
> - In cgroup_setup_root(), for root cgroups.
> - In cgroup_create(), for non-root cgroups.
> 
> This patch adds 3 more, so we end up with 5 calls as follows:
> - In cgroup_setup_root(), for root self css's.
> - In cgroup_create(), for non-root self css's.
> - In cgroup_subsys_init(), for root subsys css's without early
>    initialization.
> - In cgroup_init(), for root subsys css's with early
>    initialization, as we cannot call it from cgroup_subsys_init() early
>    as allocations are not allowed during early init.
> - In css_create(), for non-root non-self css's.
> 
> We should try to consolidate as much as possible. For example:
> - Can we always make the call for root subsys css's in cgroup_init(),
>    regardless of early initialization status? Is there a need to make the
>    call early for subsystems that use early in cgroup_subsys_init()
>    initialization?
> 
> - Can we always make the call for root css's in cgroup_init(),
>    regardless of whether the css is a self css or a subsys css? I imagine
>    we'd still need two separate calls, one outside the loop for the self
>    css's, and one in the loop for subsys css's, but having them in the
>    same place should make things easier.

The answer might be the same for the two questions above. I think at
best, we can eliminate the single call below to css_rstat_init():

cgroup_init()
	for_each_subsys(ss, ssid)
		if (ss->early_init)
			css_rstat_init(css) /* remove */

I'm not sure if it's by design and also an undocumented constraint but I
find that it is not possible to have a cgroup subsystem that is
designated for "early init" participate in rstat at the same time. The
necessary ordering of actions should be:

init_and_link_css(css, ss, ...)
css_rstat_init(css)
css_online(css)

This sequence occurs within cgroup_init_subsys() when ss->early_init is
false. However, when early_init is true, the last two calls are
effectively inverted:

css_online(css)
css_rstat_init(css) /* too late */

This needs to be avoided or else failures will occur during boot.

Note that even before this series, this constraint seems to have
existed. Using branch for-6.16 as a base, I experimented with a minimal
custom test cgroup in which I implement css_rstat_flush while early_init
is on. The system fails during boot because online_css() is called
before cgroup_rstat_init().

cgroup_init_early()
	for_each_subsys(ss, ssid)
		if (ss->early)
			cgroup_init_subsys(ss, true)
				css = ss->css_alloc()
				online_css(css)
cgroup_init()
	cgroup_setup_root()
		cgroup_rstat_init(root_cgrp) /* too late */

Unless I"m missing something, do you think this constraint is worthy of
a separate patch? Something like this that prevents the combination of
rstat with early_init:

--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -6130,7 +6130,8 @@ int __init cgroup_init_early(void)
     for_each_subsys(ss, i) {
-       WARN(!ss->css_alloc || !ss->css_free || ss->name || ss->id,
+       WARN(!ss->css_alloc || !ss->css_free || ss->name || ss->id ||
+           (ss->early_init && ss->css_rstat_flush),

> 
> Ideally if we can do both the above, we'd end up with 3 calling
> functions only:
> - cgroup_init() -> for all root css's.
> - cgroup_create() -> for non-root self css's.
> - css_create() -> for non-root subsys css's.
> 
> Also, we should probably document all the different call paths for
> css_rstat_init() and css_rstat_exit() somewhere.

Will do.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 4/5] cgroup: use separate rstat trees for each subsystem
  2025-04-30 23:43     ` JP Kobryn
@ 2025-05-06  9:37       ` Yosry Ahmed
  0 siblings, 0 replies; 38+ messages in thread
From: Yosry Ahmed @ 2025-05-06  9:37 UTC (permalink / raw)
  To: JP Kobryn
  Cc: tj, shakeel.butt, mkoutny, hannes, akpm, linux-mm, cgroups,
	kernel-team

On Wed, Apr 30, 2025 at 04:43:41PM -0700, JP Kobryn wrote:
> On 4/22/25 6:33 AM, Yosry Ahmed wrote:
> > On Thu, Apr 03, 2025 at 06:10:49PM -0700, JP Kobryn wrote:
> [..]
> > > @@ -5425,6 +5417,9 @@ static void css_free_rwork_fn(struct work_struct *work)
> > >   		struct cgroup_subsys_state *parent = css->parent;
> > >   		int id = css->id;
> > > +		if (ss->css_rstat_flush)
> > > +			css_rstat_exit(css);
> > > +
> > 
> > This call now exists in both branches (self css or not), so it's
> > probably best to pull it outside. We should probably also pull the call
> > in cgroup_destroy_root() outside into css_free_rwork_fn() so that we end
> > up with a single call to css_rstat_exit() (apart from failure paths).
> 
> This can be done if css_rstat_exit() is modified to guard against
> invalid css's like being a subsystem css and not having implemented
> css_rstat_flush.
> 
> > 
> > We can probably also use css_is_cgroup() here instead of 'if (ss)' for
> > consistency.
> > 
> > >   		ss->css_free(css);
> > >   		cgroup_idr_remove(&ss->css_idr, id);
> > >   		cgroup_put(cgrp);
> > [..]
> > > @@ -5659,6 +5647,12 @@ static struct cgroup_subsys_state *css_create(struct cgroup *cgrp,
> > >   		goto err_free_css;
> > >   	css->id = err;
> > > +	if (ss->css_rstat_flush) {
> > > +		err = css_rstat_init(css);
> > > +		if (err)
> > > +			goto err_free_css;
> > > +	}
> > > +
> > >   	/* @css is ready to be brought online now, make it visible */
> > >   	list_add_tail_rcu(&css->sibling, &parent_css->children);
> > >   	cgroup_idr_replace(&ss->css_idr, css, css->id);
> > > @@ -5672,7 +5666,6 @@ static struct cgroup_subsys_state *css_create(struct cgroup *cgrp,
> > >   err_list_del:
> > >   	list_del_rcu(&css->sibling);
> > >   err_free_css:
> > > -	list_del_rcu(&css->rstat_css_node);
> > >   	INIT_RCU_WORK(&css->destroy_rwork, css_free_rwork_fn);
> > >   	queue_rcu_work(cgroup_destroy_wq, &css->destroy_rwork);
> > >   	return ERR_PTR(err);
> > > @@ -6104,11 +6097,17 @@ static void __init cgroup_init_subsys(struct cgroup_subsys *ss, bool early)
> > >   	css->flags |= CSS_NO_REF;
> > >   	if (early) {
> > > -		/* allocation can't be done safely during early init */
> > > +		/*
> > > +		 * Allocation can't be done safely during early init.
> > > +		 * Defer IDR and rstat allocations until cgroup_init().
> > > +		 */
> > >   		css->id = 1;
> > >   	} else {
> > >   		css->id = cgroup_idr_alloc(&ss->css_idr, css, 1, 2, GFP_KERNEL);
> > >   		BUG_ON(css->id < 0);
> > > +
> > > +		if (ss->css_rstat_flush)
> > > +			BUG_ON(css_rstat_init(css));
> > >   	}
> > >   	/* Update the init_css_set to contain a subsys
> > > @@ -6207,9 +6206,17 @@ int __init cgroup_init(void)
> > >   			struct cgroup_subsys_state *css =
> > >   				init_css_set.subsys[ss->id];
> > > +			/*
> > > +			 * It is now safe to perform allocations.
> > > +			 * Finish setting up subsystems that previously
> > > +			 * deferred IDR and rstat allocations.
> > > +			 */
> > >   			css->id = cgroup_idr_alloc(&ss->css_idr, css, 1, 2,
> > >   						   GFP_KERNEL);
> > >   			BUG_ON(css->id < 0);
> > > +
> > > +			if (ss->css_rstat_flush)
> > > +				BUG_ON(css_rstat_init(css));
> > 
> > The calls to css_rstat_init() are really difficult to track. Let's
> > recap, before this change we had two calls:
> > - In cgroup_setup_root(), for root cgroups.
> > - In cgroup_create(), for non-root cgroups.
> > 
> > This patch adds 3 more, so we end up with 5 calls as follows:
> > - In cgroup_setup_root(), for root self css's.
> > - In cgroup_create(), for non-root self css's.
> > - In cgroup_subsys_init(), for root subsys css's without early
> >    initialization.
> > - In cgroup_init(), for root subsys css's with early
> >    initialization, as we cannot call it from cgroup_subsys_init() early
> >    as allocations are not allowed during early init.
> > - In css_create(), for non-root non-self css's.
> > 
> > We should try to consolidate as much as possible. For example:
> > - Can we always make the call for root subsys css's in cgroup_init(),
> >    regardless of early initialization status? Is there a need to make the
> >    call early for subsystems that use early in cgroup_subsys_init()
> >    initialization?
> > 
> > - Can we always make the call for root css's in cgroup_init(),
> >    regardless of whether the css is a self css or a subsys css? I imagine
> >    we'd still need two separate calls, one outside the loop for the self
> >    css's, and one in the loop for subsys css's, but having them in the
> >    same place should make things easier.
> 
> The answer might be the same for the two questions above. I think at
> best, we can eliminate the single call below to css_rstat_init():
> 
> cgroup_init()
> 	for_each_subsys(ss, ssid)
> 		if (ss->early_init)
> 			css_rstat_init(css) /* remove */
> 
> I'm not sure if it's by design and also an undocumented constraint but I
> find that it is not possible to have a cgroup subsystem that is
> designated for "early init" participate in rstat at the same time. The
> necessary ordering of actions should be:
> 
> init_and_link_css(css, ss, ...)
> css_rstat_init(css)
> css_online(css)
> 
> This sequence occurs within cgroup_init_subsys() when ss->early_init is
> false. However, when early_init is true, the last two calls are
> effectively inverted:
> 
> css_online(css)
> css_rstat_init(css) /* too late */
> 
> This needs to be avoided or else failures will occur during boot.
> 
> Note that even before this series, this constraint seems to have
> existed. Using branch for-6.16 as a base, I experimented with a minimal
> custom test cgroup in which I implement css_rstat_flush while early_init
> is on. The system fails during boot because online_css() is called
> before cgroup_rstat_init().
> 
> cgroup_init_early()
> 	for_each_subsys(ss, ssid)
> 		if (ss->early)
> 			cgroup_init_subsys(ss, true)
> 				css = ss->css_alloc()
> 				online_css(css)
> cgroup_init()
> 	cgroup_setup_root()
> 		cgroup_rstat_init(root_cgrp) /* too late */

I am looking at online_css() and cgroup_rstat_init() and I cannot
immediately see the ordering requirement.

Is the idea that once a css is online an rstat flush can occur, and it
would crash if cgroup_rstat_init() hadn't been called yet? I am
wondering when would the flush occur before cgroup_init() is called.

Anyway, if that is possible I think enforcing this is probably
important.

> 
> Unless I"m missing something, do you think this constraint is worthy of
> a separate patch? Something like this that prevents the combination of
> rstat with early_init:
> 
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -6130,7 +6130,8 @@ int __init cgroup_init_early(void)
>     for_each_subsys(ss, i) {
> -       WARN(!ss->css_alloc || !ss->css_free || ss->name || ss->id,
> +       WARN(!ss->css_alloc || !ss->css_free || ss->name || ss->id ||
> +           (ss->early_init && ss->css_rstat_flush),
> 
> > 
> > Ideally if we can do both the above, we'd end up with 3 calling
> > functions only:
> > - cgroup_init() -> for all root css's.
> > - cgroup_create() -> for non-root self css's.
> > - css_create() -> for non-root subsys css's.
> > 
> > Also, we should probably document all the different call paths for
> > css_rstat_init() and css_rstat_exit() somewhere.
> 
> Will do.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 1/5] cgroup: move rstat base stat objects into their own struct
  2025-04-04  1:10 ` [PATCH v4 1/5] cgroup: move rstat base stat objects into their own struct JP Kobryn
  2025-04-15 17:16   ` Michal Koutný
  2025-04-22 12:13   ` Yosry Ahmed
@ 2025-05-29 18:58   ` Ihor Solodrai
  2025-05-29 19:11     ` Yonghong Song
  2 siblings, 1 reply; 38+ messages in thread
From: Ihor Solodrai @ 2025-05-29 18:58 UTC (permalink / raw)
  To: JP Kobryn, tj, shakeel.butt, yosryahmed, mkoutny, hannes, akpm
  Cc: linux-mm, cgroups, kernel-team, bpf

On 4/3/25 6:10 PM, JP Kobryn wrote:
> This non-functional change serves as preparation for moving to
> subsystem-based rstat trees. The base stats are not an actual subsystem,
> but in future commits they will have exclusive rstat trees just as other
> subsystems will.
> 
> Moving the base stat objects into a new struct allows the cgroup_rstat_cpu
> struct to become more compact since it now only contains the minimum amount
> of pointers needed for rstat participation. Subsystems will (in future
> commits) make use of the compact cgroup_rstat_cpu struct while avoiding the
> memory overhead of the base stat objects which they will not use.
> 
> An instance of the new struct cgroup_rstat_base_cpu was placed on the
> cgroup struct so it can retain ownership of these base stats common to all
> cgroups. A helper function was added for looking up the cpu-specific base
> stats of a given cgroup. Finally, initialization and variable names were
> adjusted where applicable.
> 
> Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
> ---
>   include/linux/cgroup-defs.h | 38 ++++++++++-------
>   kernel/cgroup/cgroup.c      |  8 +++-
>   kernel/cgroup/rstat.c       | 84 ++++++++++++++++++++++---------------
>   3 files changed, 79 insertions(+), 51 deletions(-)
> 

Hi everyone.

BPF CI started failing after recent upstream merges (tip: 90b83efa6701).
I bisected the issue to this patch, see a log snippet below [1]:

     ##[error]#44/9 btf_tag/btf_type_tag_percpu_vmlinux_helper
     load_btfs:PASS:could not load vmlinux BTF 0 nsec
     test_btf_type_tag_vmlinux_percpu:PASS:btf_type_tag_percpu 0 nsec
     libbpf: prog 'test_percpu_helper': BPF program load failed: -EACCES
     libbpf: prog 'test_percpu_helper': -- BEGIN PROG LOAD LOG --
     0: R1=ctx() R10=fp0
     ; int BPF_PROG(test_percpu_helper, struct cgroup *cgrp, const char 
*path) @ btf_type_tag_percpu.c:58
     0: (79) r1 = *(u64 *)(r1 +0)
     func 'cgroup_mkdir' arg0 has btf_id 437 type STRUCT 'cgroup'
     1: R1_w=trusted_ptr_cgroup()
     ; cpu = bpf_get_smp_processor_id(); @ btf_type_tag_percpu.c:63
     1: (7b) *(u64 *)(r10 -8) = r1         ; R1_w=trusted_ptr_cgroup() 
R10=fp0 fp-8_w=trusted_ptr_cgroup()
     2: (85) call bpf_get_smp_processor_id#8       ; 
R0_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=1,var_off=(0x0; 0x1))
     3: (79) r1 = *(u64 *)(r10 -8)         ; R1_w=trusted_ptr_cgroup() 
R10=fp0 fp-8_w=trusted_ptr_cgroup()
     ; cgrp->self.rstat_cpu, cpu); @ btf_type_tag_percpu.c:65
     4: (79) r1 = *(u64 *)(r1 +32)         ; R1_w=percpu_ptr_css_rstat_cpu()
     ; rstat = (struct cgroup_rstat_cpu *)bpf_per_cpu_ptr( @ 
btf_type_tag_percpu.c:64
     5: (bc) w2 = w0                       ; 
R0_w=scalar(id=1,smin=smin32=0,smax=umax=smax32=umax32=1,var_off=(0x0; 
0x1)) 
R2_w=scalar(id=1,smin=smin32=0,smax=umax=smax32=umax32=1,var_off=(0x0; 0x1))
     6: (85) call bpf_per_cpu_ptr#153      ; 
R0=ptr_or_null_css_rstat_cpu(id=2)
     ; if (rstat) { @ btf_type_tag_percpu.c:66
     7: (15) if r0 == 0x0 goto pc+1        ; R0=ptr_css_rstat_cpu()
     ; *(volatile int *)rstat; @ btf_type_tag_percpu.c:68
     8: (61) r1 = *(u32 *)(r0 +0)
     cannot access ptr member updated_children with moff 0 in struct 
css_rstat_cpu with off 0 size 4
     processed 9 insns (limit 1000000) max_states_per_insn 0 
total_states 1 peak_states 1 mark_read 1
     -- END PROG LOAD LOG --
     libbpf: prog 'test_percpu_helper': failed to load: -EACCES
     libbpf: failed to load object 'btf_type_tag_percpu'
     libbpf: failed to load BPF skeleton 'btf_type_tag_percpu': -EACCES
     test_btf_type_tag_vmlinux_percpu:FAIL:btf_type_tag_percpu_helper 
unexpected error: -13 (errno 13)

The test in question [2]:

SEC("tp_btf/cgroup_mkdir")
int BPF_PROG(test_percpu_helper, struct cgroup *cgrp, const char *path)
{
	struct cgroup_rstat_cpu *rstat;
	__u32 cpu;

	cpu = bpf_get_smp_processor_id();
	rstat = (struct cgroup_rstat_cpu *)bpf_per_cpu_ptr(cgrp->rstat_cpu, cpu);
	if (rstat) {
		/* READ_ONCE */
		*(volatile int *)rstat; // BPF verification fails here
	}

	return 0;
}

Any ideas about how to properly fix this?

Thanks.

[1] 
https://github.com/kernel-patches/bpf/actions/runs/15316839796/job/43125242673
[2] 
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/progs/btf_type_tag_percpu.c#n68

> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
> index 485b651869d9..6d177f770d28 100644
> --- a/include/linux/cgroup-defs.h
> +++ b/include/linux/cgroup-defs.h
> @@ -344,10 +344,29 @@ struct cgroup_base_stat {
>    * frequency decreases the cost of each read.
>    *
>    * This struct hosts both the fields which implement the above -
> - * updated_children and updated_next - and the fields which track basic
> - * resource statistics on top of it - bsync, bstat and last_bstat.
> + * updated_children and updated_next.
>    */
>   struct cgroup_rstat_cpu {
> +	/*
> +	 * Child cgroups with stat updates on this cpu since the last read
> +	 * are linked on the parent's ->updated_children through
> +	 * ->updated_next.
> +	 *
> +	 * In addition to being more compact, singly-linked list pointing
> +	 * to the cgroup makes it unnecessary for each per-cpu struct to
> +	 * point back to the associated cgroup.
> +	 *
> +	 * Protected by per-cpu cgroup_rstat_cpu_lock.
> +	 */
> +	struct cgroup *updated_children;	/* terminated by self cgroup */
> +	struct cgroup *updated_next;		/* NULL iff not on the list */
> +};
> +
> +/*
> + * This struct hosts the fields which track basic resource statistics on
> + * top of it - bsync, bstat and last_bstat.
> + */
> +struct cgroup_rstat_base_cpu {
>   	/*
>   	 * ->bsync protects ->bstat.  These are the only fields which get
>   	 * updated in the hot path.
> @@ -374,20 +393,6 @@ struct cgroup_rstat_cpu {
>   	 * deltas to propagate to the per-cpu subtree_bstat.
>   	 */
>   	struct cgroup_base_stat last_subtree_bstat;
> -
> -	/*
> -	 * Child cgroups with stat updates on this cpu since the last read
> -	 * are linked on the parent's ->updated_children through
> -	 * ->updated_next.
> -	 *
> -	 * In addition to being more compact, singly-linked list pointing
> -	 * to the cgroup makes it unnecessary for each per-cpu struct to
> -	 * point back to the associated cgroup.
> -	 *
> -	 * Protected by per-cpu cgroup_rstat_cpu_lock.
> -	 */
> -	struct cgroup *updated_children;	/* terminated by self cgroup */
> -	struct cgroup *updated_next;		/* NULL iff not on the list */
>   };
>   
>   struct cgroup_freezer_state {
> @@ -518,6 +523,7 @@ struct cgroup {
>   
>   	/* per-cpu recursive resource statistics */
>   	struct cgroup_rstat_cpu __percpu *rstat_cpu;
> +	struct cgroup_rstat_base_cpu __percpu *rstat_base_cpu;
>   	struct list_head rstat_css_list;
>   
>   	/*
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index ac2db99941ca..77349d07b117 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -161,10 +161,14 @@ static struct static_key_true *cgroup_subsys_on_dfl_key[] = {
>   };
>   #undef SUBSYS
>   
> -static DEFINE_PER_CPU(struct cgroup_rstat_cpu, cgrp_dfl_root_rstat_cpu);
> +static DEFINE_PER_CPU(struct cgroup_rstat_cpu, root_rstat_cpu);
> +static DEFINE_PER_CPU(struct cgroup_rstat_base_cpu, root_rstat_base_cpu);
>   
>   /* the default hierarchy */
> -struct cgroup_root cgrp_dfl_root = { .cgrp.rstat_cpu = &cgrp_dfl_root_rstat_cpu };
> +struct cgroup_root cgrp_dfl_root = {
> +	.cgrp.rstat_cpu = &root_rstat_cpu,
> +	.cgrp.rstat_base_cpu = &root_rstat_base_cpu,
> +};
>   EXPORT_SYMBOL_GPL(cgrp_dfl_root);
>   
>   /*
> diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
> index 4bb587d5d34f..a20e3ab3f7d3 100644
> --- a/kernel/cgroup/rstat.c
> +++ b/kernel/cgroup/rstat.c
> @@ -19,6 +19,12 @@ static struct cgroup_rstat_cpu *cgroup_rstat_cpu(struct cgroup *cgrp, int cpu)
>   	return per_cpu_ptr(cgrp->rstat_cpu, cpu);
>   }
>   
> +static struct cgroup_rstat_base_cpu *cgroup_rstat_base_cpu(
> +		struct cgroup *cgrp, int cpu)
> +{
> +	return per_cpu_ptr(cgrp->rstat_base_cpu, cpu);
> +}
> +
>   /*
>    * Helper functions for rstat per CPU lock (cgroup_rstat_cpu_lock).
>    *
> @@ -351,12 +357,22 @@ int cgroup_rstat_init(struct cgroup *cgrp)
>   			return -ENOMEM;
>   	}
>   
> +	if (!cgrp->rstat_base_cpu) {
> +		cgrp->rstat_base_cpu = alloc_percpu(struct cgroup_rstat_base_cpu);
> +		if (!cgrp->rstat_cpu) {
> +			free_percpu(cgrp->rstat_cpu);
> +			return -ENOMEM;
> +		}
> +	}
> +
>   	/* ->updated_children list is self terminated */
>   	for_each_possible_cpu(cpu) {
>   		struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(cgrp, cpu);
> +		struct cgroup_rstat_base_cpu *rstatbc =
> +			cgroup_rstat_base_cpu(cgrp, cpu);
>   
>   		rstatc->updated_children = cgrp;
> -		u64_stats_init(&rstatc->bsync);
> +		u64_stats_init(&rstatbc->bsync);
>   	}
>   
>   	return 0;
> @@ -379,6 +395,8 @@ void cgroup_rstat_exit(struct cgroup *cgrp)
>   
>   	free_percpu(cgrp->rstat_cpu);
>   	cgrp->rstat_cpu = NULL;
> +	free_percpu(cgrp->rstat_base_cpu);
> +	cgrp->rstat_base_cpu = NULL;
>   }
>   
>   void __init cgroup_rstat_boot(void)
> @@ -419,9 +437,9 @@ static void cgroup_base_stat_sub(struct cgroup_base_stat *dst_bstat,
>   
>   static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu)
>   {
> -	struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(cgrp, cpu);
> +	struct cgroup_rstat_base_cpu *rstatbc = cgroup_rstat_base_cpu(cgrp, cpu);
>   	struct cgroup *parent = cgroup_parent(cgrp);
> -	struct cgroup_rstat_cpu *prstatc;
> +	struct cgroup_rstat_base_cpu *prstatbc;
>   	struct cgroup_base_stat delta;
>   	unsigned seq;
>   
> @@ -431,15 +449,15 @@ static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu)
>   
>   	/* fetch the current per-cpu values */
>   	do {
> -		seq = __u64_stats_fetch_begin(&rstatc->bsync);
> -		delta = rstatc->bstat;
> -	} while (__u64_stats_fetch_retry(&rstatc->bsync, seq));
> +		seq = __u64_stats_fetch_begin(&rstatbc->bsync);
> +		delta = rstatbc->bstat;
> +	} while (__u64_stats_fetch_retry(&rstatbc->bsync, seq));
>   
>   	/* propagate per-cpu delta to cgroup and per-cpu global statistics */
> -	cgroup_base_stat_sub(&delta, &rstatc->last_bstat);
> +	cgroup_base_stat_sub(&delta, &rstatbc->last_bstat);
>   	cgroup_base_stat_add(&cgrp->bstat, &delta);
> -	cgroup_base_stat_add(&rstatc->last_bstat, &delta);
> -	cgroup_base_stat_add(&rstatc->subtree_bstat, &delta);
> +	cgroup_base_stat_add(&rstatbc->last_bstat, &delta);
> +	cgroup_base_stat_add(&rstatbc->subtree_bstat, &delta);
>   
>   	/* propagate cgroup and per-cpu global delta to parent (unless that's root) */
>   	if (cgroup_parent(parent)) {
> @@ -448,73 +466,73 @@ static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu)
>   		cgroup_base_stat_add(&parent->bstat, &delta);
>   		cgroup_base_stat_add(&cgrp->last_bstat, &delta);
>   
> -		delta = rstatc->subtree_bstat;
> -		prstatc = cgroup_rstat_cpu(parent, cpu);
> -		cgroup_base_stat_sub(&delta, &rstatc->last_subtree_bstat);
> -		cgroup_base_stat_add(&prstatc->subtree_bstat, &delta);
> -		cgroup_base_stat_add(&rstatc->last_subtree_bstat, &delta);
> +		delta = rstatbc->subtree_bstat;
> +		prstatbc = cgroup_rstat_base_cpu(parent, cpu);
> +		cgroup_base_stat_sub(&delta, &rstatbc->last_subtree_bstat);
> +		cgroup_base_stat_add(&prstatbc->subtree_bstat, &delta);
> +		cgroup_base_stat_add(&rstatbc->last_subtree_bstat, &delta);
>   	}
>   }
>   
> -static struct cgroup_rstat_cpu *
> +static struct cgroup_rstat_base_cpu *
>   cgroup_base_stat_cputime_account_begin(struct cgroup *cgrp, unsigned long *flags)
>   {
> -	struct cgroup_rstat_cpu *rstatc;
> +	struct cgroup_rstat_base_cpu *rstatbc;
>   
> -	rstatc = get_cpu_ptr(cgrp->rstat_cpu);
> -	*flags = u64_stats_update_begin_irqsave(&rstatc->bsync);
> -	return rstatc;
> +	rstatbc = get_cpu_ptr(cgrp->rstat_base_cpu);
> +	*flags = u64_stats_update_begin_irqsave(&rstatbc->bsync);
> +	return rstatbc;
>   }
>   
>   static void cgroup_base_stat_cputime_account_end(struct cgroup *cgrp,
> -						 struct cgroup_rstat_cpu *rstatc,
> +						 struct cgroup_rstat_base_cpu *rstatbc,
>   						 unsigned long flags)
>   {
> -	u64_stats_update_end_irqrestore(&rstatc->bsync, flags);
> +	u64_stats_update_end_irqrestore(&rstatbc->bsync, flags);
>   	cgroup_rstat_updated(cgrp, smp_processor_id());
> -	put_cpu_ptr(rstatc);
> +	put_cpu_ptr(rstatbc);
>   }
>   
>   void __cgroup_account_cputime(struct cgroup *cgrp, u64 delta_exec)
>   {
> -	struct cgroup_rstat_cpu *rstatc;
> +	struct cgroup_rstat_base_cpu *rstatbc;
>   	unsigned long flags;
>   
> -	rstatc = cgroup_base_stat_cputime_account_begin(cgrp, &flags);
> -	rstatc->bstat.cputime.sum_exec_runtime += delta_exec;
> -	cgroup_base_stat_cputime_account_end(cgrp, rstatc, flags);
> +	rstatbc = cgroup_base_stat_cputime_account_begin(cgrp, &flags);
> +	rstatbc->bstat.cputime.sum_exec_runtime += delta_exec;
> +	cgroup_base_stat_cputime_account_end(cgrp, rstatbc, flags);
>   }
>   
>   void __cgroup_account_cputime_field(struct cgroup *cgrp,
>   				    enum cpu_usage_stat index, u64 delta_exec)
>   {
> -	struct cgroup_rstat_cpu *rstatc;
> +	struct cgroup_rstat_base_cpu *rstatbc;
>   	unsigned long flags;
>   
> -	rstatc = cgroup_base_stat_cputime_account_begin(cgrp, &flags);
> +	rstatbc = cgroup_base_stat_cputime_account_begin(cgrp, &flags);
>   
>   	switch (index) {
>   	case CPUTIME_NICE:
> -		rstatc->bstat.ntime += delta_exec;
> +		rstatbc->bstat.ntime += delta_exec;
>   		fallthrough;
>   	case CPUTIME_USER:
> -		rstatc->bstat.cputime.utime += delta_exec;
> +		rstatbc->bstat.cputime.utime += delta_exec;
>   		break;
>   	case CPUTIME_SYSTEM:
>   	case CPUTIME_IRQ:
>   	case CPUTIME_SOFTIRQ:
> -		rstatc->bstat.cputime.stime += delta_exec;
> +		rstatbc->bstat.cputime.stime += delta_exec;
>   		break;
>   #ifdef CONFIG_SCHED_CORE
>   	case CPUTIME_FORCEIDLE:
> -		rstatc->bstat.forceidle_sum += delta_exec;
> +		rstatbc->bstat.forceidle_sum += delta_exec;
>   		break;
>   #endif
>   	default:
>   		break;
>   	}
>   
> -	cgroup_base_stat_cputime_account_end(cgrp, rstatc, flags);
> +	cgroup_base_stat_cputime_account_end(cgrp, rstatbc, flags);
>   }
>   
>   /*


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v4 1/5] cgroup: move rstat base stat objects into their own struct
  2025-05-29 18:58   ` Ihor Solodrai
@ 2025-05-29 19:11     ` Yonghong Song
  0 siblings, 0 replies; 38+ messages in thread
From: Yonghong Song @ 2025-05-29 19:11 UTC (permalink / raw)
  To: Ihor Solodrai, JP Kobryn, tj, shakeel.butt, yosryahmed, mkoutny,
	hannes, akpm
  Cc: linux-mm, cgroups, kernel-team, bpf



On 5/29/25 11:58 AM, Ihor Solodrai wrote:
> On 4/3/25 6:10 PM, JP Kobryn wrote:
>> This non-functional change serves as preparation for moving to
>> subsystem-based rstat trees. The base stats are not an actual subsystem,
>> but in future commits they will have exclusive rstat trees just as other
>> subsystems will.
>>
>> Moving the base stat objects into a new struct allows the 
>> cgroup_rstat_cpu
>> struct to become more compact since it now only contains the minimum 
>> amount
>> of pointers needed for rstat participation. Subsystems will (in future
>> commits) make use of the compact cgroup_rstat_cpu struct while 
>> avoiding the
>> memory overhead of the base stat objects which they will not use.
>>
>> An instance of the new struct cgroup_rstat_base_cpu was placed on the
>> cgroup struct so it can retain ownership of these base stats common 
>> to all
>> cgroups. A helper function was added for looking up the cpu-specific 
>> base
>> stats of a given cgroup. Finally, initialization and variable names were
>> adjusted where applicable.
>>
>> Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
>> ---
>>   include/linux/cgroup-defs.h | 38 ++++++++++-------
>>   kernel/cgroup/cgroup.c      |  8 +++-
>>   kernel/cgroup/rstat.c       | 84 ++++++++++++++++++++++---------------
>>   3 files changed, 79 insertions(+), 51 deletions(-)
>>
>
> Hi everyone.
>
> BPF CI started failing after recent upstream merges (tip: 90b83efa6701).
> I bisected the issue to this patch, see a log snippet below [1]:
>
>     ##[error]#44/9 btf_tag/btf_type_tag_percpu_vmlinux_helper
>     load_btfs:PASS:could not load vmlinux BTF 0 nsec
>     test_btf_type_tag_vmlinux_percpu:PASS:btf_type_tag_percpu 0 nsec
>     libbpf: prog 'test_percpu_helper': BPF program load failed: -EACCES
>     libbpf: prog 'test_percpu_helper': -- BEGIN PROG LOAD LOG --
>     0: R1=ctx() R10=fp0
>     ; int BPF_PROG(test_percpu_helper, struct cgroup *cgrp, const char 
> *path) @ btf_type_tag_percpu.c:58
>     0: (79) r1 = *(u64 *)(r1 +0)
>     func 'cgroup_mkdir' arg0 has btf_id 437 type STRUCT 'cgroup'
>     1: R1_w=trusted_ptr_cgroup()
>     ; cpu = bpf_get_smp_processor_id(); @ btf_type_tag_percpu.c:63
>     1: (7b) *(u64 *)(r10 -8) = r1         ; R1_w=trusted_ptr_cgroup() 
> R10=fp0 fp-8_w=trusted_ptr_cgroup()
>     2: (85) call bpf_get_smp_processor_id#8       ; 
> R0_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=1,var_off=(0x0; 0x1))
>     3: (79) r1 = *(u64 *)(r10 -8)         ; R1_w=trusted_ptr_cgroup() 
> R10=fp0 fp-8_w=trusted_ptr_cgroup()
>     ; cgrp->self.rstat_cpu, cpu); @ btf_type_tag_percpu.c:65
>     4: (79) r1 = *(u64 *)(r1 +32)         ; 
> R1_w=percpu_ptr_css_rstat_cpu()
>     ; rstat = (struct cgroup_rstat_cpu *)bpf_per_cpu_ptr( @ 
> btf_type_tag_percpu.c:64
>     5: (bc) w2 = w0                       ; 
> R0_w=scalar(id=1,smin=smin32=0,smax=umax=smax32=umax32=1,var_off=(0x0; 
> 0x1)) 
> R2_w=scalar(id=1,smin=smin32=0,smax=umax=smax32=umax32=1,var_off=(0x0; 
> 0x1))
>     6: (85) call bpf_per_cpu_ptr#153      ; 
> R0=ptr_or_null_css_rstat_cpu(id=2)
>     ; if (rstat) { @ btf_type_tag_percpu.c:66
>     7: (15) if r0 == 0x0 goto pc+1        ; R0=ptr_css_rstat_cpu()
>     ; *(volatile int *)rstat; @ btf_type_tag_percpu.c:68
>     8: (61) r1 = *(u32 *)(r0 +0)
>     cannot access ptr member updated_children with moff 0 in struct 
> css_rstat_cpu with off 0 size 4
>     processed 9 insns (limit 1000000) max_states_per_insn 0 
> total_states 1 peak_states 1 mark_read 1
>     -- END PROG LOAD LOG --
>     libbpf: prog 'test_percpu_helper': failed to load: -EACCES
>     libbpf: failed to load object 'btf_type_tag_percpu'
>     libbpf: failed to load BPF skeleton 'btf_type_tag_percpu': -EACCES
> test_btf_type_tag_vmlinux_percpu:FAIL:btf_type_tag_percpu_helper 
> unexpected error: -13 (errno 13)
>
> The test in question [2]:
>
> SEC("tp_btf/cgroup_mkdir")
> int BPF_PROG(test_percpu_helper, struct cgroup *cgrp, const char *path)
> {
>     struct cgroup_rstat_cpu *rstat;
>     __u32 cpu;
>
>     cpu = bpf_get_smp_processor_id();
>     rstat = (struct cgroup_rstat_cpu 
> *)bpf_per_cpu_ptr(cgrp->rstat_cpu, cpu);
>     if (rstat) {
>         /* READ_ONCE */
>         *(volatile int *)rstat; // BPF verification fails here
>     }
>
>     return 0;
> }
>
> Any ideas about how to properly fix this?

The struct cgroup_rstat_cpu is renamed to css_rstat_cpu. Most likely the test needs
change. I will take a look.

>
> Thanks.
>
> [1] 
> https://github.com/kernel-patches/bpf/actions/runs/15316839796/job/43125242673
> [2] 
> https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/progs/btf_type_tag_percpu.c#n68

[...]


^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2025-05-29 19:11 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-04  1:10 [PATCH v4 0/5] cgroup: separate rstat trees JP Kobryn
2025-04-04  1:10 ` [PATCH v4 1/5] cgroup: move rstat base stat objects into their own struct JP Kobryn
2025-04-15 17:16   ` Michal Koutný
2025-04-22 12:13   ` Yosry Ahmed
2025-05-29 18:58   ` Ihor Solodrai
2025-05-29 19:11     ` Yonghong Song
2025-04-04  1:10 ` [PATCH v4 2/5] cgroup: add helper for checking when css is cgroup::self JP Kobryn
2025-04-22 12:19   ` Yosry Ahmed
     [not found]   ` <68078968.5d0a0220.2c3c35.bab3SMTPIN_ADDED_BROKEN@mx.google.com>
2025-04-24 16:59     ` JP Kobryn
2025-04-04  1:10 ` [PATCH v4 3/5] cgroup: change rstat function signatures from cgroup-based to css-based JP Kobryn
2025-04-04 20:00   ` Tejun Heo
2025-04-04 20:09     ` Tejun Heo
2025-04-04 21:21       ` JP Kobryn
2025-04-22 12:35   ` Yosry Ahmed
2025-04-22 12:39   ` Yosry Ahmed
     [not found]   ` <68078d3c.050a0220.3d37e.6d82SMTPIN_ADDED_BROKEN@mx.google.com>
2025-04-24 17:10     ` JP Kobryn
2025-04-04  1:10 ` [PATCH v4 4/5] cgroup: use separate rstat trees for each subsystem JP Kobryn
2025-04-15 17:15   ` Michal Koutný
2025-04-16 21:43     ` JP Kobryn
2025-04-17  9:26       ` Michal Koutný
2025-04-17 19:05         ` JP Kobryn
2025-04-17 20:10           ` JP Kobryn
2025-04-21 18:18   ` JP Kobryn
2025-04-22 13:33   ` Yosry Ahmed
     [not found]   ` <68079aa7.df0a0220.30a1a0.cbb2SMTPIN_ADDED_BROKEN@mx.google.com>
2025-04-30 23:43     ` JP Kobryn
2025-05-06  9:37       ` Yosry Ahmed
2025-04-04  1:10 ` [PATCH v4 5/5] cgroup: use subsystem-specific rstat locks to avoid contention JP Kobryn
2025-04-04 20:28   ` Tejun Heo
2025-04-11  3:31     ` JP Kobryn
2025-04-15 17:15   ` Michal Koutný
2025-04-15 19:30     ` Tejun Heo
2025-04-16  9:50       ` Michal Koutný
2025-04-16 18:10         ` JP Kobryn
2025-04-16 18:14           ` Tejun Heo
2025-04-16 18:01     ` JP Kobryn
2025-04-22 14:01   ` Yosry Ahmed
2025-04-24 17:25     ` Shakeel Butt
     [not found]   ` <6807a132.df0a0220.28dc80.a1f0SMTPIN_ADDED_BROKEN@mx.google.com>
2025-04-25  0:18     ` JP Kobryn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).