[PATCH 00/11] cgroup: separate rstat trees

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 00/11] cgroup: separate rstat trees
@ 2025-02-18  3:14 JP Kobryn
  2025-02-18  3:14 ` [PATCH 01/11] cgroup: move rstat pointers into struct of their own JP Kobryn
                   ` (12 more replies)
  0 siblings, 13 replies; 42+ messages in thread
From: JP Kobryn @ 2025-02-18  3:14 UTC (permalink / raw)
  To: shakeel.butt, tj, mhocko, hannes, yosryahmed, akpm
  Cc: linux-mm, cgroups, kernel-team

The current design of rstat takes the approach that if one subsystem is to
be flushed, all other subsystem controllers with pending updates should
also be flushed. It seems that over time, the stat-keeping of some
subsystems has grown in size to the extent that they are noticeably slowing
down others. This has been most observable in situations where the memory
controller is enabled. One big area where the issue comes up is system
telemetry, where programs periodically sample cpu stats. It would be a
benefit for programs like this if the overhead of flushing memory stats
(and others) could be eliminated. It would save cpu cycles for existing
cpu-based telemetry programs and improve scalability in terms of increasing
sampling frequency and volume of hosts the program is running on - the cpu
cycles saved helps free up the budget for cycles to be spent on the desired
stats.

This series changes the approach of "flush all subsystems" to "flush only the
requested subsystem". The core design change is moving from a single unified
rstat tree to having separate trees: one for each enabled subsystem controller
(if it implements css_rstat_flush()), one for the base stats (cgroup::self)
subsystem state, and one dedicated to bpf-based cgroups (if enabled). A layer
of indirection was introduced to rstat. Where a cgroup reference was previously
used as a parameter, the updated/flush families of functions were changed to
instead accept a references to a new cgroup_rstat struct along with a new
interface containing ops that perform type-specific pointer offsets for
accessing common types. The ops allow rstat routines to work only with common
types while hiding away any unique types. Together, the new struct and
interface allows for extending the type of entity that can participate in
rstat. In this series, the cgroup_subsys_state and the cgroup_bpf are now
participants. For both of these structs, the cgroup_rstat struct was added as a
new field and ops were statically defined for both types in order to provide
access to related objects. To illustrate, the cgroup_subsys_state was given an
rstat_struct field and one of its ops was defined to get a pointer to the
rstat_struct of its parent.  Public api's were changed. In order for clients to
be specific about which stats are being updated or flushed, a reference to the
given cgroup_subsys_state is passed instead of the cgroup. For the bpf api's, a
cgroup is still passed as an argument since there is no subsystem state
associated with these custom cgroups. However, the names of the api calls were
changed and now have a "bpf_" prefix. Since separate trees are in use, the
locking scheme was adjusted to prevent any contention. Separate locks exist for
the three categories: base stats (cgroup::self), formal subsystem controllers
(memory, io, etc), and bpf-based cgroups. Where applicable, the functions for
lock management were adjusted to accept parameters instead of globals.

Breaking up the unified tree into separate trees eliminates the overhead
and scalability issue explained in the first section, but comes at the
expense of using additional memory. In an effort to minimize the additional
memory overhead, a conditional allocation is performed. The
cgroup_rstat_cpu originally contained the rstat list pointers and the base
stat entities. The struct was reduced to only contain the list pointers.
For the single case of where base stats are participating in rstat, a new
struct cgroup_rstat_base_cpu was created that contains the list pointers
and the base stat entities. The conditional allocation is done only when
the cgroup::self subsys_state is initialized. Since the list pointers exist
at the beginning of the cgroup_rstat_cpu and cgroup_rstat_base_cpu struct,
a union is used to access one type of pointer or the other depending on if
cgroup::self is detected; it is the one subsystem state where the subsystem
pointer is NULL. With this change, the total memory overhead on a per-cpu
basis is:

nr_cgroups * (
	sizeof(struct cgroup_rstat_base_cpu) +
	sizeof(struct cgroup_rstat_cpu) * nr_controllers
)

if bpf-based cgroups are enabled:

nr_cgroups * (
	sizeof(struct cgroup_rstat_base_cpu) +
	sizeof(struct cgroup_rstat_cpu) * (nr_controllers + 1)
)

... where nr_controllers is the number of enabled cgroup controllers
that implement css_rstat_flush().

With regard to validation, there is a measurable benefit when reading a
specific set of stats. Using the cpu stats as a basis for flushing, some
experiments were set up to measure the perf and time differences.

The first experiment consisted of a parent cgroup with memory.swap.max=0
and memory.max=1G. On a 52-cpu machine, 26 child cgroups were created and
within each child cgroup a process was spawned to encourage the updating of
memory cgroup stats by creating and then reading a file of size 1T
(encouraging reclaim). These 26 tasks were run in parallel.  While this was
going on, a custom program was used to open cpu.stat file of the parent
cgroup, read the entire file 1M times, then close it. The perf report for
the task performing the reading showed that most of the cycles (42%) were
spent on the function mem_cgroup_css_rstat_flush() of the control side. It
also showed a smaller but significant number of cycles spent in
__blkcg_rstat_flush. The perf report for patched kernel differed in that no
cycles were spent in these functions. Instead most cycles were spent on
cgroup_base_stat_flush(). Aside from the perf reports, the amount of time
spent running the program performing the reading of cpu.stats showed a gain
when comparing the control to the experimental kernel.The time in kernel
mode was reduced.

before:
real    0m18.449s
user    0m0.209s
sys     0m18.165s

after:
real    0m6.080s
user    0m0.170s
sys     0m5.890s

Another experiment on the same host was setup using a parent cgroup with
two child cgroups. The same swap and memory max were used as the previous
experiment. In the two child cgroups, kernel builds were done in parallel,
each using "-j 20". The program from the previous experiment was used to
perform 1M reads of the parent cpu.stat file. The perf comparison showed
similar results as the previous experiment. For the control side, a
majority of cycles (42%) on mem_cgroup_css_rstat_flush() and significant
cycles in __blkcg_rstat_flush(). On the experimental side, most cycles were
spent on cgroup_base_stat_flush() and no cycles were spent flushing memory
or io. As for the time taken by the program reading cpu.stat, measurements
are shown below.

before:
real    0m17.223s
user    0m0.259s
sys     0m16.871s

after:
real    0m6.498s
user    0m0.237s
sys     0m6.220s

For the final experiment, perf events were recorded during a kernel build
with the same host and cgroup setup. The builds took place in the child
node.  Control and experimental sides both showed similar in cycles spent
on cgroup_rstat_updated() and appeard insignificant compared among the
events recorded with the workload.

JP Kobryn (11):
  cgroup: move rstat pointers into struct of their own
  cgroup: add level of indirection for cgroup_rstat struct
  cgroup: move cgroup_rstat from cgroup to cgroup_subsys_state
  cgroup: introduce cgroup_rstat_ops
  cgroup: separate rstat for bpf cgroups
  cgroup: rstat lock indirection
  cgroup: fetch cpu-specific lock in rstat cpu lock helpers
  cgroup: rstat cpu lock indirection
  cgroup: separate rstat locks for bpf cgroups
  cgroup: separate rstat locks for subsystems
  cgroup: separate rstat list pointers from base stats

 block/blk-cgroup.c                            |   4 +-
 include/linux/bpf-cgroup-defs.h               |   3 +
 include/linux/cgroup-defs.h                   |  98 +--
 include/linux/cgroup.h                        |  11 +-
 include/linux/cgroup_rstat.h                  |  97 +++
 kernel/bpf/cgroup.c                           |   6 +
 kernel/cgroup/cgroup-internal.h               |   9 +-
 kernel/cgroup/cgroup.c                        |  65 +-
 kernel/cgroup/rstat.c                         | 556 +++++++++++++-----
 mm/memcontrol.c                               |   4 +-
 .../selftests/bpf/progs/btf_type_tag_percpu.c |   5 +-
 .../bpf/progs/cgroup_hierarchical_stats.c     |   6 +-
 12 files changed, 594 insertions(+), 270 deletions(-)
 create mode 100644 include/linux/cgroup_rstat.h

-- 
2.43.5

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 01/11] cgroup: move rstat pointers into struct of their own
  2025-02-18  3:14 [PATCH 00/11] cgroup: separate rstat trees JP Kobryn
@ 2025-02-18  3:14 ` JP Kobryn
  2025-02-19  1:05   ` Shakeel Butt
  2025-02-20 16:53   ` Yosry Ahmed
  2025-02-18  3:14 ` [PATCH 02/11] cgroup: add level of indirection for cgroup_rstat struct JP Kobryn
                   ` (11 subsequent siblings)
  12 siblings, 2 replies; 42+ messages in thread
From: JP Kobryn @ 2025-02-18  3:14 UTC (permalink / raw)
  To: shakeel.butt, tj, mhocko, hannes, yosryahmed, akpm
  Cc: linux-mm, cgroups, kernel-team

The rstat infrastructure makes use of pointers for list management.
These pointers only exist as fields in the cgroup struct, so moving them
into their own struct will allow them to be used elsewhere. The base
stat entities are included with them for now.

Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
---
 include/linux/cgroup-defs.h                   | 90 +-----------------
 include/linux/cgroup_rstat.h                  | 92 +++++++++++++++++++
 kernel/cgroup/cgroup.c                        |  3 +-
 kernel/cgroup/rstat.c                         | 27 +++---
 .../selftests/bpf/progs/btf_type_tag_percpu.c |  4 +-
 5 files changed, 112 insertions(+), 104 deletions(-)
 create mode 100644 include/linux/cgroup_rstat.h

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 1b20d2d8ef7c..6b6cc027fe70 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -17,7 +17,7 @@
 #include <linux/refcount.h>
 #include <linux/percpu-refcount.h>
 #include <linux/percpu-rwsem.h>
-#include <linux/u64_stats_sync.h>
+#include <linux/cgroup_rstat.h>
 #include <linux/workqueue.h>
 #include <linux/bpf-cgroup-defs.h>
 #include <linux/psi_types.h>
@@ -321,78 +321,6 @@ struct css_set {
 	struct rcu_head rcu_head;
 };
 
-struct cgroup_base_stat {
-	struct task_cputime cputime;
-
-#ifdef CONFIG_SCHED_CORE
-	u64 forceidle_sum;
-#endif
-	u64 ntime;
-};
-
-/*
- * rstat - cgroup scalable recursive statistics.  Accounting is done
- * per-cpu in cgroup_rstat_cpu which is then lazily propagated up the
- * hierarchy on reads.
- *
- * When a stat gets updated, the cgroup_rstat_cpu and its ancestors are
- * linked into the updated tree.  On the following read, propagation only
- * considers and consumes the updated tree.  This makes reading O(the
- * number of descendants which have been active since last read) instead of
- * O(the total number of descendants).
- *
- * This is important because there can be a lot of (draining) cgroups which
- * aren't active and stat may be read frequently.  The combination can
- * become very expensive.  By propagating selectively, increasing reading
- * frequency decreases the cost of each read.
- *
- * This struct hosts both the fields which implement the above -
- * updated_children and updated_next - and the fields which track basic
- * resource statistics on top of it - bsync, bstat and last_bstat.
- */
-struct cgroup_rstat_cpu {
-	/*
-	 * ->bsync protects ->bstat.  These are the only fields which get
-	 * updated in the hot path.
-	 */
-	struct u64_stats_sync bsync;
-	struct cgroup_base_stat bstat;
-
-	/*
-	 * Snapshots at the last reading.  These are used to calculate the
-	 * deltas to propagate to the global counters.
-	 */
-	struct cgroup_base_stat last_bstat;
-
-	/*
-	 * This field is used to record the cumulative per-cpu time of
-	 * the cgroup and its descendants. Currently it can be read via
-	 * eBPF/drgn etc, and we are still trying to determine how to
-	 * expose it in the cgroupfs interface.
-	 */
-	struct cgroup_base_stat subtree_bstat;
-
-	/*
-	 * Snapshots at the last reading. These are used to calculate the
-	 * deltas to propagate to the per-cpu subtree_bstat.
-	 */
-	struct cgroup_base_stat last_subtree_bstat;
-
-	/*
-	 * Child cgroups with stat updates on this cpu since the last read
-	 * are linked on the parent's ->updated_children through
-	 * ->updated_next.
-	 *
-	 * In addition to being more compact, singly-linked list pointing
-	 * to the cgroup makes it unnecessary for each per-cpu struct to
-	 * point back to the associated cgroup.
-	 *
-	 * Protected by per-cpu cgroup_rstat_cpu_lock.
-	 */
-	struct cgroup *updated_children;	/* terminated by self cgroup */
-	struct cgroup *updated_next;		/* NULL iff not on the list */
-};
-
 struct cgroup_freezer_state {
 	/* Should the cgroup and its descendants be frozen. */
 	bool freeze;
@@ -517,23 +445,9 @@ struct cgroup {
 	struct cgroup *old_dom_cgrp;		/* used while enabling threaded */
 
 	/* per-cpu recursive resource statistics */
-	struct cgroup_rstat_cpu __percpu *rstat_cpu;
+	struct cgroup_rstat rstat;
 	struct list_head rstat_css_list;
 
-	/*
-	 * Add padding to separate the read mostly rstat_cpu and
-	 * rstat_css_list into a different cacheline from the following
-	 * rstat_flush_next and *bstat fields which can have frequent updates.
-	 */
-	CACHELINE_PADDING(_pad_);
-
-	/*
-	 * A singly-linked list of cgroup structures to be rstat flushed.
-	 * This is a scratch field to be used exclusively by
-	 * cgroup_rstat_flush_locked() and protected by cgroup_rstat_lock.
-	 */
-	struct cgroup	*rstat_flush_next;
-
 	/* cgroup basic resource statistics */
 	struct cgroup_base_stat last_bstat;
 	struct cgroup_base_stat bstat;
diff --git a/include/linux/cgroup_rstat.h b/include/linux/cgroup_rstat.h
new file mode 100644
index 000000000000..f95474d6f8ab
--- /dev/null
+++ b/include/linux/cgroup_rstat.h
@@ -0,0 +1,92 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_RSTAT_H
+#define _LINUX_RSTAT_H
+
+#include <linux/u64_stats_sync.h>
+
+struct cgroup_rstat_cpu;
+
+/*
+ * rstat - cgroup scalable recursive statistics.  Accounting is done
+ * per-cpu in cgroup_rstat_cpu which is then lazily propagated up the
+ * hierarchy on reads.
+ *
+ * When a stat gets updated, the cgroup_rstat_cpu and its ancestors are
+ * linked into the updated tree.  On the following read, propagation only
+ * considers and consumes the updated tree.  This makes reading O(the
+ * number of descendants which have been active since last read) instead of
+ * O(the total number of descendants).
+ *
+ * This is important because there can be a lot of (draining) cgroups which
+ * aren't active and stat may be read frequently.  The combination can
+ * become very expensive.  By propagating selectively, increasing reading
+ * frequency decreases the cost of each read.
+ *
+ * This struct hosts both the fields which implement the above -
+ * updated_children and updated_next - and the fields which track basic
+ * resource statistics on top of it - bsync, bstat and last_bstat.
+ */
+struct cgroup_rstat {
+	struct cgroup_rstat_cpu __percpu *rstat_cpu;
+
+	/*
+	 * Add padding to separate the read mostly rstat_cpu and
+	 * rstat_css_list into a different cacheline from the following
+	 * rstat_flush_next and containing struct fields which can have
+	 * frequent updates.
+	 */
+	CACHELINE_PADDING(_pad_);
+	struct cgroup *rstat_flush_next;
+};
+
+struct cgroup_base_stat {
+	struct task_cputime cputime;
+
+#ifdef CONFIG_SCHED_CORE
+	u64 forceidle_sum;
+#endif
+	u64 ntime;
+};
+
+struct cgroup_rstat_cpu {
+	/*
+	 * Child cgroups with stat updates on this cpu since the last read
+	 * are linked on the parent's ->updated_children through
+	 * ->updated_next.
+	 *
+	 * In addition to being more compact, singly-linked list pointing
+	 * to the cgroup makes it unnecessary for each per-cpu struct to
+	 * point back to the associated cgroup.
+	 */
+	struct cgroup *updated_children;	/* terminated by self */
+	struct cgroup *updated_next;		/* NULL if not on the list */
+
+	/*
+	 * ->bsync protects ->bstat.  These are the only fields which get
+	 * updated in the hot path.
+	 */
+	struct u64_stats_sync bsync;
+	struct cgroup_base_stat bstat;
+
+	/*
+	 * Snapshots at the last reading.  These are used to calculate the
+	 * deltas to propagate to the global counters.
+	 */
+	struct cgroup_base_stat last_bstat;
+
+	/*
+	 * This field is used to record the cumulative per-cpu time of
+	 * the cgroup and its descendants. Currently it can be read via
+	 * eBPF/drgn etc, and we are still trying to determine how to
+	 * expose it in the cgroupfs interface.
+	 */
+	struct cgroup_base_stat subtree_bstat;
+
+	/*
+	 * Snapshots at the last reading. These are used to calculate the
+	 * deltas to propagate to the per-cpu subtree_bstat.
+	 */
+	struct cgroup_base_stat last_subtree_bstat;
+};
+
+#endif	/* _LINUX_RSTAT_H */
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index d9061bd55436..03a3a4da49f1 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -31,6 +31,7 @@
 #include "cgroup-internal.h"
 
 #include <linux/bpf-cgroup.h>
+#include <linux/cgroup_rstat.h>
 #include <linux/cred.h>
 #include <linux/errno.h>
 #include <linux/init_task.h>
@@ -164,7 +165,7 @@ static struct static_key_true *cgroup_subsys_on_dfl_key[] = {
 static DEFINE_PER_CPU(struct cgroup_rstat_cpu, cgrp_dfl_root_rstat_cpu);
 
 /* the default hierarchy */
-struct cgroup_root cgrp_dfl_root = { .cgrp.rstat_cpu = &cgrp_dfl_root_rstat_cpu };
+struct cgroup_root cgrp_dfl_root = { .cgrp.rstat.rstat_cpu = &cgrp_dfl_root_rstat_cpu };
 EXPORT_SYMBOL_GPL(cgrp_dfl_root);
 
 /*
diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index 5877974ece92..7e7879d88c38 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -16,7 +16,7 @@ static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu);
 
 static struct cgroup_rstat_cpu *cgroup_rstat_cpu(struct cgroup *cgrp, int cpu)
 {
-	return per_cpu_ptr(cgrp->rstat_cpu, cpu);
+	return per_cpu_ptr(cgrp->rstat.rstat_cpu, cpu);
 }
 
 /*
@@ -149,24 +149,24 @@ static struct cgroup *cgroup_rstat_push_children(struct cgroup *head,
 	struct cgroup *parent, *grandchild;
 	struct cgroup_rstat_cpu *crstatc;
 
-	child->rstat_flush_next = NULL;
+	child->rstat.rstat_flush_next = NULL;
 
 next_level:
 	while (chead) {
 		child = chead;
-		chead = child->rstat_flush_next;
+		chead = child->rstat.rstat_flush_next;
 		parent = cgroup_parent(child);
 
 		/* updated_next is parent cgroup terminated */
 		while (child != parent) {
-			child->rstat_flush_next = head;
+			child->rstat.rstat_flush_next = head;
 			head = child;
 			crstatc = cgroup_rstat_cpu(child, cpu);
 			grandchild = crstatc->updated_children;
 			if (grandchild != child) {
 				/* Push the grand child to the next level */
 				crstatc->updated_children = child;
-				grandchild->rstat_flush_next = ghead;
+				grandchild->rstat.rstat_flush_next = ghead;
 				ghead = grandchild;
 			}
 			child = crstatc->updated_next;
@@ -238,7 +238,7 @@ static struct cgroup *cgroup_rstat_updated_list(struct cgroup *root, int cpu)
 
 	/* Push @root to the list first before pushing the children */
 	head = root;
-	root->rstat_flush_next = NULL;
+	root->rstat.rstat_flush_next = NULL;
 	child = rstatc->updated_children;
 	rstatc->updated_children = root;
 	if (child != root)
@@ -310,7 +310,7 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp)
 	for_each_possible_cpu(cpu) {
 		struct cgroup *pos = cgroup_rstat_updated_list(cgrp, cpu);
 
-		for (; pos; pos = pos->rstat_flush_next) {
+		for (; pos; pos = pos->rstat.rstat_flush_next) {
 			struct cgroup_subsys_state *css;
 
 			cgroup_base_stat_flush(pos, cpu);
@@ -387,9 +387,10 @@ int cgroup_rstat_init(struct cgroup *cgrp)
 	int cpu;
 
 	/* the root cgrp has rstat_cpu preallocated */
-	if (!cgrp->rstat_cpu) {
-		cgrp->rstat_cpu = alloc_percpu(struct cgroup_rstat_cpu);
-		if (!cgrp->rstat_cpu)
+	if (!cgrp->rstat.rstat_cpu) {
+		cgrp->rstat.rstat_cpu = alloc_percpu(
+				struct cgroup_rstat_cpu);
+		if (!cgrp->rstat.rstat_cpu)
 			return -ENOMEM;
 	}
 
@@ -419,8 +420,8 @@ void cgroup_rstat_exit(struct cgroup *cgrp)
 			return;
 	}
 
-	free_percpu(cgrp->rstat_cpu);
-	cgrp->rstat_cpu = NULL;
+	free_percpu(cgrp->rstat.rstat_cpu);
+	cgrp->rstat.rstat_cpu = NULL;
 }
 
 void __init cgroup_rstat_boot(void)
@@ -503,7 +504,7 @@ cgroup_base_stat_cputime_account_begin(struct cgroup *cgrp, unsigned long *flags
 {
 	struct cgroup_rstat_cpu *rstatc;
 
-	rstatc = get_cpu_ptr(cgrp->rstat_cpu);
+	rstatc = get_cpu_ptr(cgrp->rstat.rstat_cpu);
 	*flags = u64_stats_update_begin_irqsave(&rstatc->bsync);
 	return rstatc;
 }
diff --git a/tools/testing/selftests/bpf/progs/btf_type_tag_percpu.c b/tools/testing/selftests/bpf/progs/btf_type_tag_percpu.c
index 38f78d9345de..035412265c3c 100644
--- a/tools/testing/selftests/bpf/progs/btf_type_tag_percpu.c
+++ b/tools/testing/selftests/bpf/progs/btf_type_tag_percpu.c
@@ -45,7 +45,7 @@ int BPF_PROG(test_percpu2, struct bpf_testmod_btf_type_tag_2 *arg)
 SEC("tp_btf/cgroup_mkdir")
 int BPF_PROG(test_percpu_load, struct cgroup *cgrp, const char *path)
 {
-	g = (__u64)cgrp->rstat_cpu->updated_children;
+	g = (__u64)cgrp->rstat.rstat_cpu->updated_children;
 	return 0;
 }
 
@@ -56,7 +56,7 @@ int BPF_PROG(test_percpu_helper, struct cgroup *cgrp, const char *path)
 	__u32 cpu;
 
 	cpu = bpf_get_smp_processor_id();
-	rstat = (struct cgroup_rstat_cpu *)bpf_per_cpu_ptr(cgrp->rstat_cpu, cpu);
+	rstat = (struct cgroup_rstat_cpu *)bpf_per_cpu_ptr(cgrp->rstat.rstat_cpu, cpu);
 	if (rstat) {
 		/* READ_ONCE */
 		*(volatile int *)rstat;
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 02/11] cgroup: add level of indirection for cgroup_rstat struct
  2025-02-18  3:14 [PATCH 00/11] cgroup: separate rstat trees JP Kobryn
  2025-02-18  3:14 ` [PATCH 01/11] cgroup: move rstat pointers into struct of their own JP Kobryn
@ 2025-02-18  3:14 ` JP Kobryn
  2025-02-19  2:26   ` Shakeel Butt
  2025-02-19  5:57   ` kernel test robot
  2025-02-18  3:14 ` [PATCH 03/11] cgroup: move cgroup_rstat from cgroup to cgroup_subsys_state JP Kobryn
                   ` (10 subsequent siblings)
  12 siblings, 2 replies; 42+ messages in thread
From: JP Kobryn @ 2025-02-18  3:14 UTC (permalink / raw)
  To: shakeel.butt, tj, mhocko, hannes, yosryahmed, akpm
  Cc: linux-mm, cgroups, kernel-team

Change the type of rstat node from cgroup to the new cgroup_rstat
struct. Then for the rstat updated/flush api calls, add double under
versions that accept references to the cgroup_rstat struct. This new
level of indirection will allow for extending the public api further.
i.e. the cgroup_rstat struct can be embedded in a new type of object and
a public api can be added for that new type.

Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
---
 include/linux/cgroup_rstat.h    |   6 +-
 kernel/cgroup/cgroup-internal.h |   4 +-
 kernel/cgroup/cgroup.c          |  12 +-
 kernel/cgroup/rstat.c           | 204 ++++++++++++++++++++------------
 4 files changed, 141 insertions(+), 85 deletions(-)

diff --git a/include/linux/cgroup_rstat.h b/include/linux/cgroup_rstat.h
index f95474d6f8ab..780b826ea364 100644
--- a/include/linux/cgroup_rstat.h
+++ b/include/linux/cgroup_rstat.h
@@ -36,7 +36,7 @@ struct cgroup_rstat {
 	 * frequent updates.
 	 */
 	CACHELINE_PADDING(_pad_);
-	struct cgroup *rstat_flush_next;
+	struct cgroup_rstat *rstat_flush_next;
 };
 
 struct cgroup_base_stat {
@@ -58,8 +58,8 @@ struct cgroup_rstat_cpu {
 	 * to the cgroup makes it unnecessary for each per-cpu struct to
 	 * point back to the associated cgroup.
 	 */
-	struct cgroup *updated_children;	/* terminated by self */
-	struct cgroup *updated_next;		/* NULL if not on the list */
+	struct cgroup_rstat *updated_children;	/* terminated by self */
+	struct cgroup_rstat *updated_next;		/* NULL if not on the list */
 
 	/*
 	 * ->bsync protects ->bstat.  These are the only fields which get
diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
index c964dd7ff967..03139018da43 100644
--- a/kernel/cgroup/cgroup-internal.h
+++ b/kernel/cgroup/cgroup-internal.h
@@ -269,8 +269,8 @@ int cgroup_task_count(const struct cgroup *cgrp);
 /*
  * rstat.c
  */
-int cgroup_rstat_init(struct cgroup *cgrp);
-void cgroup_rstat_exit(struct cgroup *cgrp);
+int cgroup_rstat_init(struct cgroup_rstat *rstat);
+void cgroup_rstat_exit(struct cgroup_rstat *rstat);
 void cgroup_rstat_boot(void);
 void cgroup_base_stat_cputime_show(struct seq_file *seq);
 
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 03a3a4da49f1..02d6c11ccccb 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1359,7 +1359,7 @@ static void cgroup_destroy_root(struct cgroup_root *root)
 
 	cgroup_unlock();
 
-	cgroup_rstat_exit(cgrp);
+	cgroup_rstat_exit(&cgrp->rstat);
 	kernfs_destroy_root(root->kf_root);
 	cgroup_free_root(root);
 }
@@ -2133,7 +2133,7 @@ int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask)
 	if (ret)
 		goto destroy_root;
 
-	ret = cgroup_rstat_init(root_cgrp);
+	ret = cgroup_rstat_init(&root_cgrp->rstat);
 	if (ret)
 		goto destroy_root;
 
@@ -2175,7 +2175,7 @@ int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask)
 	goto out;
 
 exit_stats:
-	cgroup_rstat_exit(root_cgrp);
+	cgroup_rstat_exit(&root_cgrp->rstat);
 destroy_root:
 	kernfs_destroy_root(root->kf_root);
 	root->kf_root = NULL;
@@ -5436,7 +5436,7 @@ static void css_free_rwork_fn(struct work_struct *work)
 			cgroup_put(cgroup_parent(cgrp));
 			kernfs_put(cgrp->kn);
 			psi_cgroup_free(cgrp);
-			cgroup_rstat_exit(cgrp);
+			cgroup_rstat_exit(&cgrp->rstat);
 			kfree(cgrp);
 		} else {
 			/*
@@ -5687,7 +5687,7 @@ static struct cgroup *cgroup_create(struct cgroup *parent, const char *name,
 	if (ret)
 		goto out_free_cgrp;
 
-	ret = cgroup_rstat_init(cgrp);
+	ret = cgroup_rstat_init(&cgrp->rstat);
 	if (ret)
 		goto out_cancel_ref;
 
@@ -5780,7 +5780,7 @@ static struct cgroup *cgroup_create(struct cgroup *parent, const char *name,
 out_kernfs_remove:
 	kernfs_remove(cgrp->kn);
 out_stat_exit:
-	cgroup_rstat_exit(cgrp);
+	cgroup_rstat_exit(&cgrp->rstat);
 out_cancel_ref:
 	percpu_ref_exit(&cgrp->self.refcnt);
 out_free_cgrp:
diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index 7e7879d88c38..13090dda56aa 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -14,9 +14,20 @@ static DEFINE_PER_CPU(raw_spinlock_t, cgroup_rstat_cpu_lock);
 
 static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu);
 
-static struct cgroup_rstat_cpu *cgroup_rstat_cpu(struct cgroup *cgrp, int cpu)
+static struct cgroup_rstat_cpu *rstat_cpu(struct cgroup_rstat *rstat, int cpu)
 {
-	return per_cpu_ptr(cgrp->rstat.rstat_cpu, cpu);
+	return per_cpu_ptr(rstat->rstat_cpu, cpu);
+}
+
+static struct cgroup_rstat *rstat_parent(struct cgroup_rstat *rstat)
+{
+	struct cgroup *cgrp = container_of(rstat, typeof(*cgrp), rstat);
+	struct cgroup *parent = cgroup_parent(cgrp);
+
+	if (!parent)
+		return NULL;
+
+	return &parent->rstat;
 }
 
 /*
@@ -73,17 +84,9 @@ void _cgroup_rstat_cpu_unlock(raw_spinlock_t *cpu_lock, int cpu,
 	raw_spin_unlock_irqrestore(cpu_lock, flags);
 }
 
-/**
- * cgroup_rstat_updated - keep track of updated rstat_cpu
- * @cgrp: target cgroup
- * @cpu: cpu on which rstat_cpu was updated
- *
- * @cgrp's rstat_cpu on @cpu was updated.  Put it on the parent's matching
- * rstat_cpu->updated_children list.  See the comment on top of
- * cgroup_rstat_cpu definition for details.
- */
-__bpf_kfunc void cgroup_rstat_updated(struct cgroup *cgrp, int cpu)
+static void __cgroup_rstat_updated(struct cgroup_rstat *rstat, int cpu)
 {
+	struct cgroup *cgrp = container_of(rstat, typeof(*cgrp), rstat);
 	raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu);
 	unsigned long flags;
 
@@ -95,15 +98,15 @@ __bpf_kfunc void cgroup_rstat_updated(struct cgroup *cgrp, int cpu)
 	 * instead of NULL, we can tell whether @cgrp is on the list by
 	 * testing the next pointer for NULL.
 	 */
-	if (data_race(cgroup_rstat_cpu(cgrp, cpu)->updated_next))
+	if (data_race(rstat_cpu(rstat, cpu)->updated_next))
 		return;
 
 	flags = _cgroup_rstat_cpu_lock(cpu_lock, cpu, cgrp, true);
 
 	/* put @cgrp and all ancestors on the corresponding updated lists */
 	while (true) {
-		struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(cgrp, cpu);
-		struct cgroup *parent = cgroup_parent(cgrp);
+		struct cgroup_rstat_cpu *rstatc = rstat_cpu(rstat, cpu);
+		struct cgroup_rstat *parent = rstat_parent(rstat);
 		struct cgroup_rstat_cpu *prstatc;
 
 		/*
@@ -115,20 +118,34 @@ __bpf_kfunc void cgroup_rstat_updated(struct cgroup *cgrp, int cpu)
 
 		/* Root has no parent to link it to, but mark it busy */
 		if (!parent) {
-			rstatc->updated_next = cgrp;
+			rstatc->updated_next = rstat;
 			break;
 		}
 
-		prstatc = cgroup_rstat_cpu(parent, cpu);
+		prstatc = rstat_cpu(parent, cpu);
 		rstatc->updated_next = prstatc->updated_children;
-		prstatc->updated_children = cgrp;
+		prstatc->updated_children = rstat;
 
-		cgrp = parent;
+		rstat = parent;
 	}
 
 	_cgroup_rstat_cpu_unlock(cpu_lock, cpu, cgrp, flags, true);
 }
 
+/**
+ * cgroup_rstat_updated - keep track of updated rstat_cpu
+ * @cgrp: target cgroup
+ * @cpu: cpu on which rstat_cpu was updated
+ *
+ * @cgrp's rstat_cpu on @cpu was updated.  Put it on the parent's matching
+ * rstat_cpu->updated_children list.  See the comment on top of
+ * cgroup_rstat_cpu definition for details.
+ */
+__bpf_kfunc void cgroup_rstat_updated(struct cgroup *cgrp, int cpu)
+{
+	__cgroup_rstat_updated(&cgrp->rstat, cpu);
+}
+
 /**
  * cgroup_rstat_push_children - push children cgroups into the given list
  * @head: current head of the list (= subtree root)
@@ -141,32 +158,32 @@ __bpf_kfunc void cgroup_rstat_updated(struct cgroup *cgrp, int cpu)
  * into a singly linked list built from the tail backward like "pushing"
  * cgroups into a stack. The root is pushed by the caller.
  */
-static struct cgroup *cgroup_rstat_push_children(struct cgroup *head,
-						 struct cgroup *child, int cpu)
+static struct cgroup_rstat *cgroup_rstat_push_children(
+	struct cgroup_rstat *head, struct cgroup_rstat *child, int cpu)
 {
-	struct cgroup *chead = child;	/* Head of child cgroup level */
-	struct cgroup *ghead = NULL;	/* Head of grandchild cgroup level */
-	struct cgroup *parent, *grandchild;
+	struct cgroup_rstat *chead = child;	/* Head of child cgroup level */
+	struct cgroup_rstat *ghead = NULL;	/* Head of grandchild cgroup level */
+	struct cgroup_rstat *parent, *grandchild;
 	struct cgroup_rstat_cpu *crstatc;
 
-	child->rstat.rstat_flush_next = NULL;
+	child->rstat_flush_next = NULL;
 
 next_level:
 	while (chead) {
 		child = chead;
-		chead = child->rstat.rstat_flush_next;
-		parent = cgroup_parent(child);
+		chead = child->rstat_flush_next;
+		parent = rstat_parent(child);
 
 		/* updated_next is parent cgroup terminated */
 		while (child != parent) {
-			child->rstat.rstat_flush_next = head;
+			child->rstat_flush_next = head;
 			head = child;
-			crstatc = cgroup_rstat_cpu(child, cpu);
+			crstatc = rstat_cpu(child, cpu);
 			grandchild = crstatc->updated_children;
 			if (grandchild != child) {
 				/* Push the grand child to the next level */
 				crstatc->updated_children = child;
-				grandchild->rstat.rstat_flush_next = ghead;
+				grandchild->rstat_flush_next = ghead;
 				ghead = grandchild;
 			}
 			child = crstatc->updated_next;
@@ -200,14 +217,16 @@ static struct cgroup *cgroup_rstat_push_children(struct cgroup *head,
  * within the children list and terminated by the parent cgroup. An exception
  * here is the cgroup root whose updated_next can be self terminated.
  */
-static struct cgroup *cgroup_rstat_updated_list(struct cgroup *root, int cpu)
+static struct cgroup_rstat *cgroup_rstat_updated_list(
+		struct cgroup_rstat *root, int cpu)
 {
+	struct cgroup *cgrp = container_of(root, typeof(*cgrp), rstat);
 	raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu);
-	struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(root, cpu);
-	struct cgroup *head = NULL, *parent, *child;
+	struct cgroup_rstat_cpu *rstatc = rstat_cpu(root, cpu);
+	struct cgroup_rstat *head = NULL, *parent, *child;
 	unsigned long flags;
 
-	flags = _cgroup_rstat_cpu_lock(cpu_lock, cpu, root, false);
+	flags = _cgroup_rstat_cpu_lock(cpu_lock, cpu, cgrp, false);
 
 	/* Return NULL if this subtree is not on-list */
 	if (!rstatc->updated_next)
@@ -217,17 +236,17 @@ static struct cgroup *cgroup_rstat_updated_list(struct cgroup *root, int cpu)
 	 * Unlink @root from its parent. As the updated_children list is
 	 * singly linked, we have to walk it to find the removal point.
 	 */
-	parent = cgroup_parent(root);
+	parent = rstat_parent(root);
 	if (parent) {
 		struct cgroup_rstat_cpu *prstatc;
-		struct cgroup **nextp;
+		struct cgroup_rstat **nextp;
 
-		prstatc = cgroup_rstat_cpu(parent, cpu);
+		prstatc = rstat_cpu(parent, cpu);
 		nextp = &prstatc->updated_children;
 		while (*nextp != root) {
 			struct cgroup_rstat_cpu *nrstatc;
 
-			nrstatc = cgroup_rstat_cpu(*nextp, cpu);
+			nrstatc = rstat_cpu(*nextp, cpu);
 			WARN_ON_ONCE(*nextp == parent);
 			nextp = &nrstatc->updated_next;
 		}
@@ -238,13 +257,13 @@ static struct cgroup *cgroup_rstat_updated_list(struct cgroup *root, int cpu)
 
 	/* Push @root to the list first before pushing the children */
 	head = root;
-	root->rstat.rstat_flush_next = NULL;
+	root->rstat_flush_next = NULL;
 	child = rstatc->updated_children;
 	rstatc->updated_children = root;
 	if (child != root)
 		head = cgroup_rstat_push_children(head, child, cpu);
 unlock_ret:
-	_cgroup_rstat_cpu_unlock(cpu_lock, cpu, root, flags, false);
+	_cgroup_rstat_cpu_unlock(cpu_lock, cpu, cgrp, flags, false);
 	return head;
 }
 
@@ -300,24 +319,26 @@ static inline void __cgroup_rstat_unlock(struct cgroup *cgrp, int cpu_in_loop)
 }
 
 /* see cgroup_rstat_flush() */
-static void cgroup_rstat_flush_locked(struct cgroup *cgrp)
+static void cgroup_rstat_flush_locked(struct cgroup_rstat *rstat)
 	__releases(&cgroup_rstat_lock) __acquires(&cgroup_rstat_lock)
 {
+	struct cgroup *cgrp = container_of(rstat, typeof(*cgrp), rstat);
 	int cpu;
 
 	lockdep_assert_held(&cgroup_rstat_lock);
 
 	for_each_possible_cpu(cpu) {
-		struct cgroup *pos = cgroup_rstat_updated_list(cgrp, cpu);
+		struct cgroup_rstat *pos = cgroup_rstat_updated_list(rstat, cpu);
 
-		for (; pos; pos = pos->rstat.rstat_flush_next) {
+		for (; pos; pos = pos->rstat_flush_next) {
+			struct cgroup *pos_cgroup = container_of(pos, struct cgroup, rstat);
 			struct cgroup_subsys_state *css;
 
-			cgroup_base_stat_flush(pos, cpu);
-			bpf_rstat_flush(pos, cgroup_parent(pos), cpu);
+			cgroup_base_stat_flush(pos_cgroup, cpu);
+			bpf_rstat_flush(pos_cgroup, cgroup_parent(pos_cgroup), cpu);
 
 			rcu_read_lock();
-			list_for_each_entry_rcu(css, &pos->rstat_css_list,
+			list_for_each_entry_rcu(css, &pos_cgroup->rstat_css_list,
 						rstat_css_node)
 				css->ss->css_rstat_flush(css, cpu);
 			rcu_read_unlock();
@@ -333,6 +354,17 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp)
 	}
 }
 
+static void __cgroup_rstat_flush(struct cgroup_rstat *rstat)
+{
+	struct cgroup *cgrp = container_of(rstat, typeof(*cgrp), rstat);
+
+	might_sleep();
+
+	__cgroup_rstat_lock(cgrp, -1);
+	cgroup_rstat_flush_locked(rstat);
+	__cgroup_rstat_unlock(cgrp, -1);
+}
+
 /**
  * cgroup_rstat_flush - flush stats in @cgrp's subtree
  * @cgrp: target cgroup
@@ -348,11 +380,17 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp)
  */
 __bpf_kfunc void cgroup_rstat_flush(struct cgroup *cgrp)
 {
-	might_sleep();
+	__cgroup_rstat_flush(&cgrp->rstat);
+}
+
+static void __cgroup_rstat_flush_hold(struct cgroup_rstat *rstat)
+	__acquires(&cgroup_rstat_lock)
+{
+	struct cgroup *cgrp = container_of(rstat, typeof(*cgrp), rstat);
 
+	might_sleep();
 	__cgroup_rstat_lock(cgrp, -1);
-	cgroup_rstat_flush_locked(cgrp);
-	__cgroup_rstat_unlock(cgrp, -1);
+	cgroup_rstat_flush_locked(rstat);
 }
 
 /**
@@ -365,63 +403,81 @@ __bpf_kfunc void cgroup_rstat_flush(struct cgroup *cgrp)
  * This function may block.
  */
 void cgroup_rstat_flush_hold(struct cgroup *cgrp)
-	__acquires(&cgroup_rstat_lock)
 {
-	might_sleep();
-	__cgroup_rstat_lock(cgrp, -1);
-	cgroup_rstat_flush_locked(cgrp);
+	__cgroup_rstat_flush_hold(&cgrp->rstat);
 }
 
 /**
  * cgroup_rstat_flush_release - release cgroup_rstat_flush_hold()
  * @cgrp: cgroup used by tracepoint
  */
-void cgroup_rstat_flush_release(struct cgroup *cgrp)
+static void __cgroup_rstat_flush_release(struct cgroup_rstat *rstat)
 	__releases(&cgroup_rstat_lock)
 {
+	struct cgroup *cgrp = container_of(rstat, typeof(*cgrp), rstat);
+
 	__cgroup_rstat_unlock(cgrp, -1);
 }
 
-int cgroup_rstat_init(struct cgroup *cgrp)
+/**
+ * cgroup_rstat_flush_release - release cgroup_rstat_flush_hold()
+ * @cgrp: cgroup used by tracepoint
+ */
+void cgroup_rstat_flush_release(struct cgroup *cgrp)
 {
-	int cpu;
+	__cgroup_rstat_flush_release(&cgrp->rstat);
+}
 
-	/* the root cgrp has rstat_cpu preallocated */
-	if (!cgrp->rstat.rstat_cpu) {
-		cgrp->rstat.rstat_cpu = alloc_percpu(
-				struct cgroup_rstat_cpu);
-		if (!cgrp->rstat.rstat_cpu)
-			return -ENOMEM;
-	}
+static void __cgroup_rstat_init(struct cgroup_rstat *rstat)
+{
+	int cpu;
 
 	/* ->updated_children list is self terminated */
 	for_each_possible_cpu(cpu) {
-		struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(cgrp, cpu);
+		struct cgroup_rstat_cpu *rstatc = rstat_cpu(rstat, cpu);
 
-		rstatc->updated_children = cgrp;
+		rstatc->updated_children = rstat;
 		u64_stats_init(&rstatc->bsync);
 	}
+}
+
+int cgroup_rstat_init(struct cgroup_rstat *rstat)
+{
+	/* the root cgrp has rstat_cpu preallocated */
+	if (!rstat->rstat_cpu) {
+		rstat->rstat_cpu = alloc_percpu(struct cgroup_rstat_cpu);
+		if (!rstat->rstat_cpu)
+			return -ENOMEM;
+	}
+
+	__cgroup_rstat_init(rstat);
 
 	return 0;
 }
 
-void cgroup_rstat_exit(struct cgroup *cgrp)
+static void __cgroup_rstat_exit(struct cgroup_rstat *rstat)
 {
 	int cpu;
 
-	cgroup_rstat_flush(cgrp);
-
 	/* sanity check */
 	for_each_possible_cpu(cpu) {
-		struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(cgrp, cpu);
+		struct cgroup_rstat_cpu *rstatc = rstat_cpu(rstat, cpu);
 
-		if (WARN_ON_ONCE(rstatc->updated_children != cgrp) ||
+		if (WARN_ON_ONCE(rstatc->updated_children != rstat) ||
 		    WARN_ON_ONCE(rstatc->updated_next))
 			return;
 	}
 
-	free_percpu(cgrp->rstat.rstat_cpu);
-	cgrp->rstat.rstat_cpu = NULL;
+	free_percpu(rstat->rstat_cpu);
+	rstat->rstat_cpu = NULL;
+}
+
+void cgroup_rstat_exit(struct cgroup_rstat *rstat)
+{
+	struct cgroup *cgrp = container_of(rstat, typeof(*cgrp), rstat);
+
+	cgroup_rstat_flush(cgrp);
+	__cgroup_rstat_exit(rstat);
 }
 
 void __init cgroup_rstat_boot(void)
@@ -462,7 +518,7 @@ static void cgroup_base_stat_sub(struct cgroup_base_stat *dst_bstat,
 
 static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu)
 {
-	struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(cgrp, cpu);
+	struct cgroup_rstat_cpu *rstatc = rstat_cpu(&cgrp->rstat, cpu);
 	struct cgroup *parent = cgroup_parent(cgrp);
 	struct cgroup_rstat_cpu *prstatc;
 	struct cgroup_base_stat delta;
@@ -492,7 +548,7 @@ static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu)
 		cgroup_base_stat_add(&cgrp->last_bstat, &delta);
 
 		delta = rstatc->subtree_bstat;
-		prstatc = cgroup_rstat_cpu(parent, cpu);
+		prstatc = rstat_cpu(&parent->rstat, cpu);
 		cgroup_base_stat_sub(&delta, &rstatc->last_subtree_bstat);
 		cgroup_base_stat_add(&prstatc->subtree_bstat, &delta);
 		cgroup_base_stat_add(&rstatc->last_subtree_bstat, &delta);
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 03/11] cgroup: move cgroup_rstat from cgroup to cgroup_subsys_state
  2025-02-18  3:14 [PATCH 00/11] cgroup: separate rstat trees JP Kobryn
  2025-02-18  3:14 ` [PATCH 01/11] cgroup: move rstat pointers into struct of their own JP Kobryn
  2025-02-18  3:14 ` [PATCH 02/11] cgroup: add level of indirection for cgroup_rstat struct JP Kobryn
@ 2025-02-18  3:14 ` JP Kobryn
  2025-02-20 17:06   ` Shakeel Butt
  2025-02-18  3:14 ` [PATCH 04/11] cgroup: introduce cgroup_rstat_ops JP Kobryn
                   ` (9 subsequent siblings)
  12 siblings, 1 reply; 42+ messages in thread
From: JP Kobryn @ 2025-02-18  3:14 UTC (permalink / raw)
  To: shakeel.butt, tj, mhocko, hannes, yosryahmed, akpm
  Cc: linux-mm, cgroups, kernel-team

Each cgroup owns a single cgroup_rstat instance. This means that a tree
of pending rstat updates can contain changes from different subsystems.
Because of this arrangement, when one subsystem is flushed via the
public api cgroup_rstat_flushed(), all other subsystems with pending
updates will also be flushed. Remove the cgroup_rstat instance from the
cgroup and instead give one instance to each cgroup_subsys_state.
Separate rstat trees will now exist for each unique subsystem. This
separation allows for subsystems to make updates and flushes without the
side effects of other subsystems. i.e. flushing the cpu stats does not
cause the memory stats to be flushed and vice versa. The change in
cgroup_rstat ownership from cgroup to cgroup_subsys_state allows for
direct flushing of the css, so the rcu list management entities and
operations previously tied to the cgroup which were used for managing a
list of subsystem states with pending flushes are removed. In terms of
client code, public api calls were changed to now accept the
cgroup_subsystem_state so that when flushing or updating, a specific
subsystem is associated with the call.

Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
---
 block/blk-cgroup.c                            |   4 +-
 include/linux/cgroup-defs.h                   |  10 +-
 include/linux/cgroup.h                        |   8 +-
 kernel/cgroup/cgroup-internal.h               |   4 +-
 kernel/cgroup/cgroup.c                        |  64 ++++++----
 kernel/cgroup/rstat.c                         | 117 ++++++++++--------
 mm/memcontrol.c                               |   4 +-
 .../selftests/bpf/progs/btf_type_tag_percpu.c |   5 +-
 .../bpf/progs/cgroup_hierarchical_stats.c     |   8 +-
 9 files changed, 123 insertions(+), 101 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 9ed93d91d754..6a0680d8ce6a 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1201,7 +1201,7 @@ static int blkcg_print_stat(struct seq_file *sf, void *v)
 	if (!seq_css(sf)->parent)
 		blkcg_fill_root_iostats();
 	else
-		cgroup_rstat_flush(blkcg->css.cgroup);
+		cgroup_rstat_flush(&blkcg->css);
 
 	rcu_read_lock();
 	hlist_for_each_entry_rcu(blkg, &blkcg->blkg_list, blkcg_node) {
@@ -2186,7 +2186,7 @@ void blk_cgroup_bio_start(struct bio *bio)
 	}
 
 	u64_stats_update_end_irqrestore(&bis->sync, flags);
-	cgroup_rstat_updated(blkcg->css.cgroup, cpu);
+	cgroup_rstat_updated(&blkcg->css, cpu);
 	put_cpu();
 }
 
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 6b6cc027fe70..81ec56860ee5 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -180,9 +180,6 @@ struct cgroup_subsys_state {
 	struct list_head sibling;
 	struct list_head children;
 
-	/* flush target list anchored at cgrp->rstat_css_list */
-	struct list_head rstat_css_node;
-
 	/*
 	 * PI: Subsys-unique ID.  0 is unused and root is always 1.  The
 	 * matching css can be looked up using css_from_id().
@@ -222,6 +219,9 @@ struct cgroup_subsys_state {
 	 * Protected by cgroup_mutex.
 	 */
 	int nr_descendants;
+
+	/* per-cpu recursive resource statistics */
+	struct cgroup_rstat rstat;
 };
 
 /*
@@ -444,10 +444,6 @@ struct cgroup {
 	struct cgroup *dom_cgrp;
 	struct cgroup *old_dom_cgrp;		/* used while enabling threaded */
 
-	/* per-cpu recursive resource statistics */
-	struct cgroup_rstat rstat;
-	struct list_head rstat_css_list;
-
 	/* cgroup basic resource statistics */
 	struct cgroup_base_stat last_bstat;
 	struct cgroup_base_stat bstat;
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index f8ef47f8a634..eec970622419 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -687,10 +687,10 @@ static inline void cgroup_path_from_kernfs_id(u64 id, char *buf, size_t buflen)
 /*
  * cgroup scalable recursive statistics.
  */
-void cgroup_rstat_updated(struct cgroup *cgrp, int cpu);
-void cgroup_rstat_flush(struct cgroup *cgrp);
-void cgroup_rstat_flush_hold(struct cgroup *cgrp);
-void cgroup_rstat_flush_release(struct cgroup *cgrp);
+void cgroup_rstat_updated(struct cgroup_subsys_state *css, int cpu);
+void cgroup_rstat_flush(struct cgroup_subsys_state *css);
+void cgroup_rstat_flush_hold(struct cgroup_subsys_state *css);
+void cgroup_rstat_flush_release(struct cgroup_subsys_state *css);
 
 /*
  * Basic resource stats.
diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
index 03139018da43..87d062baff90 100644
--- a/kernel/cgroup/cgroup-internal.h
+++ b/kernel/cgroup/cgroup-internal.h
@@ -269,8 +269,8 @@ int cgroup_task_count(const struct cgroup *cgrp);
 /*
  * rstat.c
  */
-int cgroup_rstat_init(struct cgroup_rstat *rstat);
-void cgroup_rstat_exit(struct cgroup_rstat *rstat);
+int cgroup_rstat_init(struct cgroup_subsys_state *css);
+void cgroup_rstat_exit(struct cgroup_subsys_state *css);
 void cgroup_rstat_boot(void);
 void cgroup_base_stat_cputime_show(struct seq_file *seq);
 
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 02d6c11ccccb..916e9c5a1fd7 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -165,7 +165,9 @@ static struct static_key_true *cgroup_subsys_on_dfl_key[] = {
 static DEFINE_PER_CPU(struct cgroup_rstat_cpu, cgrp_dfl_root_rstat_cpu);
 
 /* the default hierarchy */
-struct cgroup_root cgrp_dfl_root = { .cgrp.rstat.rstat_cpu = &cgrp_dfl_root_rstat_cpu };
+struct cgroup_root cgrp_dfl_root = {
+	.cgrp.self.rstat.rstat_cpu = &cgrp_dfl_root_rstat_cpu
+};
 EXPORT_SYMBOL_GPL(cgrp_dfl_root);
 
 /*
@@ -1359,7 +1361,7 @@ static void cgroup_destroy_root(struct cgroup_root *root)
 
 	cgroup_unlock();
 
-	cgroup_rstat_exit(&cgrp->rstat);
+	cgroup_rstat_exit(&cgrp->self);
 	kernfs_destroy_root(root->kf_root);
 	cgroup_free_root(root);
 }
@@ -1864,13 +1866,6 @@ int rebind_subsystems(struct cgroup_root *dst_root, u16 ss_mask)
 		}
 		spin_unlock_irq(&css_set_lock);
 
-		if (ss->css_rstat_flush) {
-			list_del_rcu(&css->rstat_css_node);
-			synchronize_rcu();
-			list_add_rcu(&css->rstat_css_node,
-				     &dcgrp->rstat_css_list);
-		}
-
 		/* default hierarchy doesn't enable controllers by default */
 		dst_root->subsys_mask |= 1 << ssid;
 		if (dst_root == &cgrp_dfl_root) {
@@ -2053,7 +2048,6 @@ static void init_cgroup_housekeeping(struct cgroup *cgrp)
 	cgrp->dom_cgrp = cgrp;
 	cgrp->max_descendants = INT_MAX;
 	cgrp->max_depth = INT_MAX;
-	INIT_LIST_HEAD(&cgrp->rstat_css_list);
 	prev_cputime_init(&cgrp->prev_cputime);
 
 	for_each_subsys(ss, ssid)
@@ -2133,7 +2127,7 @@ int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask)
 	if (ret)
 		goto destroy_root;
 
-	ret = cgroup_rstat_init(&root_cgrp->rstat);
+	ret = cgroup_rstat_init(&root_cgrp->self);
 	if (ret)
 		goto destroy_root;
 
@@ -2175,7 +2169,7 @@ int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask)
 	goto out;
 
 exit_stats:
-	cgroup_rstat_exit(&root_cgrp->rstat);
+	cgroup_rstat_exit(&root_cgrp->self);
 destroy_root:
 	kernfs_destroy_root(root->kf_root);
 	root->kf_root = NULL;
@@ -3240,6 +3234,12 @@ static int cgroup_apply_control_enable(struct cgroup *cgrp)
 				css = css_create(dsct, ss);
 				if (IS_ERR(css))
 					return PTR_ERR(css);
+
+				if (css->ss && css->ss->css_rstat_flush) {
+					ret = cgroup_rstat_init(css);
+					if (ret)
+						goto err_out;
+				}
 			}
 
 			WARN_ON_ONCE(percpu_ref_is_dying(&css->refcnt));
@@ -3253,6 +3253,21 @@ static int cgroup_apply_control_enable(struct cgroup *cgrp)
 	}
 
 	return 0;
+
+err_out:
+	cgroup_for_each_live_descendant_pre(dsct, d_css, cgrp) {
+		for_each_subsys(ss, ssid) {
+			struct cgroup_subsys_state *css = cgroup_css(dsct, ss);
+
+			if (!(cgroup_ss_mask(dsct) & (1 << ss->id)))
+				continue;
+
+			if (css && css->ss && css->ss->css_rstat_flush)
+				cgroup_rstat_exit(css);
+		}
+	}
+
+	return ret;
 }
 
 /**
@@ -5436,7 +5451,7 @@ static void css_free_rwork_fn(struct work_struct *work)
 			cgroup_put(cgroup_parent(cgrp));
 			kernfs_put(cgrp->kn);
 			psi_cgroup_free(cgrp);
-			cgroup_rstat_exit(&cgrp->rstat);
+			cgroup_rstat_exit(&cgrp->self);
 			kfree(cgrp);
 		} else {
 			/*
@@ -5464,11 +5479,7 @@ static void css_release_work_fn(struct work_struct *work)
 	if (ss) {
 		struct cgroup *parent_cgrp;
 
-		/* css release path */
-		if (!list_empty(&css->rstat_css_node)) {
-			cgroup_rstat_flush(cgrp);
-			list_del_rcu(&css->rstat_css_node);
-		}
+		cgroup_rstat_flush(css);
 
 		cgroup_idr_replace(&ss->css_idr, NULL, css->id);
 		if (ss->css_released)
@@ -5494,7 +5505,7 @@ static void css_release_work_fn(struct work_struct *work)
 		/* cgroup release path */
 		TRACE_CGROUP_PATH(release, cgrp);
 
-		cgroup_rstat_flush(cgrp);
+		cgroup_rstat_flush(&cgrp->self);
 
 		spin_lock_irq(&css_set_lock);
 		for (tcgrp = cgroup_parent(cgrp); tcgrp;
@@ -5542,7 +5553,6 @@ static void init_and_link_css(struct cgroup_subsys_state *css,
 	css->id = -1;
 	INIT_LIST_HEAD(&css->sibling);
 	INIT_LIST_HEAD(&css->children);
-	INIT_LIST_HEAD(&css->rstat_css_node);
 	css->serial_nr = css_serial_nr_next++;
 	atomic_set(&css->online_cnt, 0);
 
@@ -5551,9 +5561,6 @@ static void init_and_link_css(struct cgroup_subsys_state *css,
 		css_get(css->parent);
 	}
 
-	if (ss->css_rstat_flush)
-		list_add_rcu(&css->rstat_css_node, &cgrp->rstat_css_list);
-
 	BUG_ON(cgroup_css(cgrp, ss));
 }
 
@@ -5659,7 +5666,6 @@ static struct cgroup_subsys_state *css_create(struct cgroup *cgrp,
 err_list_del:
 	list_del_rcu(&css->sibling);
 err_free_css:
-	list_del_rcu(&css->rstat_css_node);
 	INIT_RCU_WORK(&css->destroy_rwork, css_free_rwork_fn);
 	queue_rcu_work(cgroup_destroy_wq, &css->destroy_rwork);
 	return ERR_PTR(err);
@@ -5687,7 +5693,7 @@ static struct cgroup *cgroup_create(struct cgroup *parent, const char *name,
 	if (ret)
 		goto out_free_cgrp;
 
-	ret = cgroup_rstat_init(&cgrp->rstat);
+	ret = cgroup_rstat_init(&cgrp->self);
 	if (ret)
 		goto out_cancel_ref;
 
@@ -5780,7 +5786,7 @@ static struct cgroup *cgroup_create(struct cgroup *parent, const char *name,
 out_kernfs_remove:
 	kernfs_remove(cgrp->kn);
 out_stat_exit:
-	cgroup_rstat_exit(&cgrp->rstat);
+	cgroup_rstat_exit(&cgrp->self);
 out_cancel_ref:
 	percpu_ref_exit(&cgrp->self.refcnt);
 out_free_cgrp:
@@ -6092,6 +6098,9 @@ static void __init cgroup_init_subsys(struct cgroup_subsys *ss, bool early)
 	} else {
 		css->id = cgroup_idr_alloc(&ss->css_idr, css, 1, 2, GFP_KERNEL);
 		BUG_ON(css->id < 0);
+
+		if (css->ss && css->ss->css_rstat_flush)
+			BUG_ON(cgroup_rstat_init(css));
 	}
 
 	/* Update the init_css_set to contain a subsys
@@ -6193,6 +6202,9 @@ int __init cgroup_init(void)
 			css->id = cgroup_idr_alloc(&ss->css_idr, css, 1, 2,
 						   GFP_KERNEL);
 			BUG_ON(css->id < 0);
+
+			if (css->ss && css->ss->css_rstat_flush)
+				BUG_ON(cgroup_rstat_init(css));
 		} else {
 			cgroup_init_subsys(ss, false);
 		}
diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index 13090dda56aa..a32bcd7942a5 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -21,13 +21,13 @@ static struct cgroup_rstat_cpu *rstat_cpu(struct cgroup_rstat *rstat, int cpu)
 
 static struct cgroup_rstat *rstat_parent(struct cgroup_rstat *rstat)
 {
-	struct cgroup *cgrp = container_of(rstat, typeof(*cgrp), rstat);
-	struct cgroup *parent = cgroup_parent(cgrp);
+	struct cgroup_subsys_state *css = container_of(
+			rstat, typeof(*css), rstat);
 
-	if (!parent)
+	if (!css->parent)
 		return NULL;
 
-	return &parent->rstat;
+	return &(css->parent->rstat);
 }
 
 /*
@@ -86,7 +86,9 @@ void _cgroup_rstat_cpu_unlock(raw_spinlock_t *cpu_lock, int cpu,
 
 static void __cgroup_rstat_updated(struct cgroup_rstat *rstat, int cpu)
 {
-	struct cgroup *cgrp = container_of(rstat, typeof(*cgrp), rstat);
+	struct cgroup_subsys_state *css = container_of(
+			rstat, typeof(*css), rstat);
+	struct cgroup *cgrp = css->cgroup;
 	raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu);
 	unsigned long flags;
 
@@ -95,7 +97,7 @@ static void __cgroup_rstat_updated(struct cgroup_rstat *rstat, int cpu)
 	 * temporary inaccuracies, which is fine.
 	 *
 	 * Because @parent's updated_children is terminated with @parent
-	 * instead of NULL, we can tell whether @cgrp is on the list by
+	 * instead of NULL, we can tell whether @rstat is on the list by
 	 * testing the next pointer for NULL.
 	 */
 	if (data_race(rstat_cpu(rstat, cpu)->updated_next))
@@ -103,7 +105,7 @@ static void __cgroup_rstat_updated(struct cgroup_rstat *rstat, int cpu)
 
 	flags = _cgroup_rstat_cpu_lock(cpu_lock, cpu, cgrp, true);
 
-	/* put @cgrp and all ancestors on the corresponding updated lists */
+	/* put @rstat and all ancestors on the corresponding updated lists */
 	while (true) {
 		struct cgroup_rstat_cpu *rstatc = rstat_cpu(rstat, cpu);
 		struct cgroup_rstat *parent = rstat_parent(rstat);
@@ -134,16 +136,16 @@ static void __cgroup_rstat_updated(struct cgroup_rstat *rstat, int cpu)
 
 /**
  * cgroup_rstat_updated - keep track of updated rstat_cpu
- * @cgrp: target cgroup
+ * @css: target cgroup subsystem state
  * @cpu: cpu on which rstat_cpu was updated
  *
- * @cgrp's rstat_cpu on @cpu was updated.  Put it on the parent's matching
+ * @css's rstat_cpu on @cpu was updated. Put it on the parent's matching
  * rstat_cpu->updated_children list.  See the comment on top of
  * cgroup_rstat_cpu definition for details.
  */
-__bpf_kfunc void cgroup_rstat_updated(struct cgroup *cgrp, int cpu)
+__bpf_kfunc void cgroup_rstat_updated(struct cgroup_subsys_state *css, int cpu)
 {
-	__cgroup_rstat_updated(&cgrp->rstat, cpu);
+	__cgroup_rstat_updated(&css->rstat, cpu);
 }
 
 /**
@@ -220,7 +222,9 @@ static struct cgroup_rstat *cgroup_rstat_push_children(
 static struct cgroup_rstat *cgroup_rstat_updated_list(
 		struct cgroup_rstat *root, int cpu)
 {
-	struct cgroup *cgrp = container_of(root, typeof(*cgrp), rstat);
+	struct cgroup_subsys_state *css = container_of(
+			root, typeof(*css), rstat);
+	struct cgroup *cgrp = css->cgroup;
 	raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu);
 	struct cgroup_rstat_cpu *rstatc = rstat_cpu(root, cpu);
 	struct cgroup_rstat *head = NULL, *parent, *child;
@@ -322,7 +326,9 @@ static inline void __cgroup_rstat_unlock(struct cgroup *cgrp, int cpu_in_loop)
 static void cgroup_rstat_flush_locked(struct cgroup_rstat *rstat)
 	__releases(&cgroup_rstat_lock) __acquires(&cgroup_rstat_lock)
 {
-	struct cgroup *cgrp = container_of(rstat, typeof(*cgrp), rstat);
+	struct cgroup_subsys_state *css = container_of(
+			rstat, typeof(*css), rstat);
+	struct cgroup *cgrp = css->cgroup;
 	int cpu;
 
 	lockdep_assert_held(&cgroup_rstat_lock);
@@ -331,17 +337,16 @@ static void cgroup_rstat_flush_locked(struct cgroup_rstat *rstat)
 		struct cgroup_rstat *pos = cgroup_rstat_updated_list(rstat, cpu);
 
 		for (; pos; pos = pos->rstat_flush_next) {
-			struct cgroup *pos_cgroup = container_of(pos, struct cgroup, rstat);
-			struct cgroup_subsys_state *css;
+			struct cgroup_subsys_state *pos_css = container_of(
+					pos, typeof(*pos_css), rstat);
+			struct cgroup *pos_cgroup = pos_css->cgroup;
 
-			cgroup_base_stat_flush(pos_cgroup, cpu);
-			bpf_rstat_flush(pos_cgroup, cgroup_parent(pos_cgroup), cpu);
+			if (!pos_css->ss)
+				cgroup_base_stat_flush(pos_cgroup, cpu);
+			else
+				pos_css->ss->css_rstat_flush(pos_css, cpu);
 
-			rcu_read_lock();
-			list_for_each_entry_rcu(css, &pos_cgroup->rstat_css_list,
-						rstat_css_node)
-				css->ss->css_rstat_flush(css, cpu);
-			rcu_read_unlock();
+			bpf_rstat_flush(pos_cgroup, cgroup_parent(pos_cgroup), cpu);
 		}
 
 		/* play nice and yield if necessary */
@@ -356,7 +361,9 @@ static void cgroup_rstat_flush_locked(struct cgroup_rstat *rstat)
 
 static void __cgroup_rstat_flush(struct cgroup_rstat *rstat)
 {
-	struct cgroup *cgrp = container_of(rstat, typeof(*cgrp), rstat);
+	struct cgroup_subsys_state *css = container_of(
+			rstat, typeof(*css), rstat);
+	struct cgroup *cgrp = css->cgroup;
 
 	might_sleep();
 
@@ -366,27 +373,29 @@ static void __cgroup_rstat_flush(struct cgroup_rstat *rstat)
 }
 
 /**
- * cgroup_rstat_flush - flush stats in @cgrp's subtree
- * @cgrp: target cgroup
+ * cgroup_rstat_flush - flush stats in @css's rstat subtree
+ * @css: target cgroup subsystem state
  *
- * Collect all per-cpu stats in @cgrp's subtree into the global counters
- * and propagate them upwards.  After this function returns, all cgroups in
- * the subtree have up-to-date ->stat.
+ * Collect all per-cpu stats in @css's subtree into the global counters
+ * and propagate them upwards. After this function returns, all rstat
+ * nodes in the subtree have up-to-date ->stat.
  *
- * This also gets all cgroups in the subtree including @cgrp off the
+ * This also gets all rstat nodes in the subtree including @css off the
  * ->updated_children lists.
  *
  * This function may block.
  */
-__bpf_kfunc void cgroup_rstat_flush(struct cgroup *cgrp)
+__bpf_kfunc void cgroup_rstat_flush(struct cgroup_subsys_state *css)
 {
-	__cgroup_rstat_flush(&cgrp->rstat);
+	__cgroup_rstat_flush(&css->rstat);
 }
 
 static void __cgroup_rstat_flush_hold(struct cgroup_rstat *rstat)
 	__acquires(&cgroup_rstat_lock)
 {
-	struct cgroup *cgrp = container_of(rstat, typeof(*cgrp), rstat);
+	struct cgroup_subsys_state *css = container_of(
+			rstat, typeof(*css), rstat);
+	struct cgroup *cgrp = css->cgroup;
 
 	might_sleep();
 	__cgroup_rstat_lock(cgrp, -1);
@@ -394,38 +403,40 @@ static void __cgroup_rstat_flush_hold(struct cgroup_rstat *rstat)
 }
 
 /**
- * cgroup_rstat_flush_hold - flush stats in @cgrp's subtree and hold
- * @cgrp: target cgroup
+ * cgroup_rstat_flush_hold - flush stats in @css's rstat subtree and hold
+ * @css: target subsystem state
  *
- * Flush stats in @cgrp's subtree and prevent further flushes.  Must be
+ * Flush stats in @css's rstat subtree and prevent further flushes. Must be
  * paired with cgroup_rstat_flush_release().
  *
  * This function may block.
  */
-void cgroup_rstat_flush_hold(struct cgroup *cgrp)
+void cgroup_rstat_flush_hold(struct cgroup_subsys_state *css)
 {
-	__cgroup_rstat_flush_hold(&cgrp->rstat);
+	__cgroup_rstat_flush_hold(&css->rstat);
 }
 
 /**
  * cgroup_rstat_flush_release - release cgroup_rstat_flush_hold()
- * @cgrp: cgroup used by tracepoint
+ * @rstat: rstat node used to find associated cgroup used by tracepoint
  */
 static void __cgroup_rstat_flush_release(struct cgroup_rstat *rstat)
 	__releases(&cgroup_rstat_lock)
 {
-	struct cgroup *cgrp = container_of(rstat, typeof(*cgrp), rstat);
+	struct cgroup_subsys_state *css = container_of(
+			rstat, typeof(*css), rstat);
+	struct cgroup *cgrp = css->cgroup;
 
 	__cgroup_rstat_unlock(cgrp, -1);
 }
 
 /**
  * cgroup_rstat_flush_release - release cgroup_rstat_flush_hold()
- * @cgrp: cgroup used by tracepoint
+ * @css: css that was previously used for the call to flush hold
  */
-void cgroup_rstat_flush_release(struct cgroup *cgrp)
+void cgroup_rstat_flush_release(struct cgroup_subsys_state *css)
 {
-	__cgroup_rstat_flush_release(&cgrp->rstat);
+	__cgroup_rstat_flush_release(&css->rstat);
 }
 
 static void __cgroup_rstat_init(struct cgroup_rstat *rstat)
@@ -441,8 +452,10 @@ static void __cgroup_rstat_init(struct cgroup_rstat *rstat)
 	}
 }
 
-int cgroup_rstat_init(struct cgroup_rstat *rstat)
+int cgroup_rstat_init(struct cgroup_subsys_state *css)
 {
+	struct cgroup_rstat *rstat = &css->rstat;
+
 	/* the root cgrp has rstat_cpu preallocated */
 	if (!rstat->rstat_cpu) {
 		rstat->rstat_cpu = alloc_percpu(struct cgroup_rstat_cpu);
@@ -472,11 +485,11 @@ static void __cgroup_rstat_exit(struct cgroup_rstat *rstat)
 	rstat->rstat_cpu = NULL;
 }
 
-void cgroup_rstat_exit(struct cgroup_rstat *rstat)
+void cgroup_rstat_exit(struct cgroup_subsys_state *css)
 {
-	struct cgroup *cgrp = container_of(rstat, typeof(*cgrp), rstat);
+	struct cgroup_rstat *rstat = &css->rstat;
 
-	cgroup_rstat_flush(cgrp);
+	cgroup_rstat_flush(css);
 	__cgroup_rstat_exit(rstat);
 }
 
@@ -518,7 +531,7 @@ static void cgroup_base_stat_sub(struct cgroup_base_stat *dst_bstat,
 
 static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu)
 {
-	struct cgroup_rstat_cpu *rstatc = rstat_cpu(&cgrp->rstat, cpu);
+	struct cgroup_rstat_cpu *rstatc = rstat_cpu(&(cgrp->self.rstat), cpu);
 	struct cgroup *parent = cgroup_parent(cgrp);
 	struct cgroup_rstat_cpu *prstatc;
 	struct cgroup_base_stat delta;
@@ -548,7 +561,7 @@ static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu)
 		cgroup_base_stat_add(&cgrp->last_bstat, &delta);
 
 		delta = rstatc->subtree_bstat;
-		prstatc = rstat_cpu(&parent->rstat, cpu);
+		prstatc = rstat_cpu(&(parent->self.rstat), cpu);
 		cgroup_base_stat_sub(&delta, &rstatc->last_subtree_bstat);
 		cgroup_base_stat_add(&prstatc->subtree_bstat, &delta);
 		cgroup_base_stat_add(&rstatc->last_subtree_bstat, &delta);
@@ -560,7 +573,7 @@ cgroup_base_stat_cputime_account_begin(struct cgroup *cgrp, unsigned long *flags
 {
 	struct cgroup_rstat_cpu *rstatc;
 
-	rstatc = get_cpu_ptr(cgrp->rstat.rstat_cpu);
+	rstatc = get_cpu_ptr(cgrp->self.rstat.rstat_cpu);
 	*flags = u64_stats_update_begin_irqsave(&rstatc->bsync);
 	return rstatc;
 }
@@ -570,7 +583,7 @@ static void cgroup_base_stat_cputime_account_end(struct cgroup *cgrp,
 						 unsigned long flags)
 {
 	u64_stats_update_end_irqrestore(&rstatc->bsync, flags);
-	cgroup_rstat_updated(cgrp, smp_processor_id());
+	cgroup_rstat_updated(&cgrp->self, smp_processor_id());
 	put_cpu_ptr(rstatc);
 }
 
@@ -673,12 +686,12 @@ void cgroup_base_stat_cputime_show(struct seq_file *seq)
 	u64 usage, utime, stime, ntime;
 
 	if (cgroup_parent(cgrp)) {
-		cgroup_rstat_flush_hold(cgrp);
+		cgroup_rstat_flush_hold(&cgrp->self);
 		usage = cgrp->bstat.cputime.sum_exec_runtime;
 		cputime_adjust(&cgrp->bstat.cputime, &cgrp->prev_cputime,
 			       &utime, &stime);
 		ntime = cgrp->bstat.ntime;
-		cgroup_rstat_flush_release(cgrp);
+		cgroup_rstat_flush_release(&cgrp->self);
 	} else {
 		/* cgrp->bstat of root is not actually used, reuse it */
 		root_cgroup_cputime(&cgrp->bstat);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 46f8b372d212..88c2c8e610b1 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -579,7 +579,7 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
 	if (!val)
 		return;
 
-	cgroup_rstat_updated(memcg->css.cgroup, cpu);
+	cgroup_rstat_updated(&memcg->css, cpu);
 	statc = this_cpu_ptr(memcg->vmstats_percpu);
 	for (; statc; statc = statc->parent) {
 		stats_updates = READ_ONCE(statc->stats_updates) + abs(val);
@@ -611,7 +611,7 @@ static void __mem_cgroup_flush_stats(struct mem_cgroup *memcg, bool force)
 	if (mem_cgroup_is_root(memcg))
 		WRITE_ONCE(flush_last_time, jiffies_64);
 
-	cgroup_rstat_flush(memcg->css.cgroup);
+	cgroup_rstat_flush(&memcg->css);
 }
 
 /*
diff --git a/tools/testing/selftests/bpf/progs/btf_type_tag_percpu.c b/tools/testing/selftests/bpf/progs/btf_type_tag_percpu.c
index 035412265c3c..310cd51e12e8 100644
--- a/tools/testing/selftests/bpf/progs/btf_type_tag_percpu.c
+++ b/tools/testing/selftests/bpf/progs/btf_type_tag_percpu.c
@@ -45,7 +45,7 @@ int BPF_PROG(test_percpu2, struct bpf_testmod_btf_type_tag_2 *arg)
 SEC("tp_btf/cgroup_mkdir")
 int BPF_PROG(test_percpu_load, struct cgroup *cgrp, const char *path)
 {
-	g = (__u64)cgrp->rstat.rstat_cpu->updated_children;
+	g = (__u64)cgrp->self.rstat.rstat_cpu->updated_children;
 	return 0;
 }
 
@@ -56,7 +56,8 @@ int BPF_PROG(test_percpu_helper, struct cgroup *cgrp, const char *path)
 	__u32 cpu;
 
 	cpu = bpf_get_smp_processor_id();
-	rstat = (struct cgroup_rstat_cpu *)bpf_per_cpu_ptr(cgrp->rstat.rstat_cpu, cpu);
+	rstat = (struct cgroup_rstat_cpu *)bpf_per_cpu_ptr(
+			cgrp->self.rstat.rstat_cpu, cpu);
 	if (rstat) {
 		/* READ_ONCE */
 		*(volatile int *)rstat;
diff --git a/tools/testing/selftests/bpf/progs/cgroup_hierarchical_stats.c b/tools/testing/selftests/bpf/progs/cgroup_hierarchical_stats.c
index c74362854948..10c803c8dc70 100644
--- a/tools/testing/selftests/bpf/progs/cgroup_hierarchical_stats.c
+++ b/tools/testing/selftests/bpf/progs/cgroup_hierarchical_stats.c
@@ -37,8 +37,8 @@ struct {
 	__type(value, struct attach_counter);
 } attach_counters SEC(".maps");
 
-extern void cgroup_rstat_updated(struct cgroup *cgrp, int cpu) __ksym;
-extern void cgroup_rstat_flush(struct cgroup *cgrp) __ksym;
+extern void cgroup_rstat_updated(struct cgroup_subsys_state *css, int cpu) __ksym;
+extern void cgroup_rstat_flush(struct cgroup_subsys_state *css) __ksym;
 
 static uint64_t cgroup_id(struct cgroup *cgrp)
 {
@@ -75,7 +75,7 @@ int BPF_PROG(counter, struct cgroup *dst_cgrp, struct task_struct *leader,
 	else if (create_percpu_attach_counter(cg_id, 1))
 		return 0;
 
-	cgroup_rstat_updated(dst_cgrp, bpf_get_smp_processor_id());
+	cgroup_rstat_updated(&dst_cgrp->self, bpf_get_smp_processor_id());
 	return 0;
 }
 
@@ -141,7 +141,7 @@ int BPF_PROG(dumper, struct bpf_iter_meta *meta, struct cgroup *cgrp)
 		return 1;
 
 	/* Flush the stats to make sure we get the most updated numbers */
-	cgroup_rstat_flush(cgrp);
+	cgroup_rstat_flush(&cgrp->self);
 
 	total_counter = bpf_map_lookup_elem(&attach_counters, &cg_id);
 	if (!total_counter) {
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 04/11] cgroup: introduce cgroup_rstat_ops
  2025-02-18  3:14 [PATCH 00/11] cgroup: separate rstat trees JP Kobryn
                   ` (2 preceding siblings ...)
  2025-02-18  3:14 ` [PATCH 03/11] cgroup: move cgroup_rstat from cgroup to cgroup_subsys_state JP Kobryn
@ 2025-02-18  3:14 ` JP Kobryn
  2025-02-19  7:21   ` kernel test robot
  2025-02-20 17:50   ` Shakeel Butt
  2025-02-18  3:14 ` [PATCH 05/11] cgroup: separate rstat for bpf cgroups JP Kobryn
                   ` (8 subsequent siblings)
  12 siblings, 2 replies; 42+ messages in thread
From: JP Kobryn @ 2025-02-18  3:14 UTC (permalink / raw)
  To: shakeel.butt, tj, mhocko, hannes, yosryahmed, akpm
  Cc: linux-mm, cgroups, kernel-team

The cgroup_rstat_ops interface provides a way for type-specific
operations to be hidden from the common rstat operations. Use it to
decouple the cgroup_subsys_type from within the internal rstat
updated/flush routines. The new ops interface allows for greater
extensibility in terms of future changes. i.e. public updated/flush
api's can be created that accept a arbitrary types, as long as that type
has an associated ops interface.

Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
---
 kernel/cgroup/rstat.c | 131 +++++++++++++++++++++++++++---------------
 1 file changed, 85 insertions(+), 46 deletions(-)

diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index a32bcd7942a5..a8bb304e49c4 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -9,6 +9,12 @@
 
 #include <trace/events/cgroup.h>
 
+struct cgroup_rstat_ops {
+	struct cgroup_rstat *(*parent_fn)(struct cgroup_rstat *);
+	struct cgroup *(*cgroup_fn)(struct cgroup_rstat *);
+	void (*flush_fn)(struct cgroup_rstat *, int);
+};
+
 static DEFINE_SPINLOCK(cgroup_rstat_lock);
 static DEFINE_PER_CPU(raw_spinlock_t, cgroup_rstat_cpu_lock);
 
@@ -19,7 +25,17 @@ static struct cgroup_rstat_cpu *rstat_cpu(struct cgroup_rstat *rstat, int cpu)
 	return per_cpu_ptr(rstat->rstat_cpu, cpu);
 }
 
-static struct cgroup_rstat *rstat_parent(struct cgroup_rstat *rstat)
+static inline bool is_base_css(struct cgroup_subsys_state *css)
+{
+	/* css for base stats has no subsystem */
+	if (!css->ss)
+		return true;
+
+	return false;
+}
+
+static struct cgroup_rstat *rstat_parent_via_css(
+		struct cgroup_rstat *rstat)
 {
 	struct cgroup_subsys_state *css = container_of(
 			rstat, typeof(*css), rstat);
@@ -30,6 +46,33 @@ static struct cgroup_rstat *rstat_parent(struct cgroup_rstat *rstat)
 	return &(css->parent->rstat);
 }
 
+static struct cgroup *rstat_cgroup_via_css(struct cgroup_rstat *rstat)
+{
+	struct cgroup_subsys_state *css =
+		container_of(rstat, struct cgroup_subsys_state, rstat);
+
+	return css->cgroup;
+}
+
+static void rstat_flush_via_css(struct cgroup_rstat *rstat, int cpu)
+{
+	struct cgroup_subsys_state *css = container_of(
+			rstat, typeof(*css), rstat);
+
+	if (is_base_css(css)) {
+		cgroup_base_stat_flush(css->cgroup, cpu);
+		return;
+	}
+
+	css->ss->css_rstat_flush(css, cpu);
+}
+
+static struct cgroup_rstat_ops rstat_css_ops = {
+	.parent_fn = rstat_parent_via_css,
+	.cgroup_fn = rstat_cgroup_via_css,
+	.flush_fn = rstat_flush_via_css,
+};
+
 /*
  * Helper functions for rstat per CPU lock (cgroup_rstat_cpu_lock).
  *
@@ -84,11 +127,11 @@ void _cgroup_rstat_cpu_unlock(raw_spinlock_t *cpu_lock, int cpu,
 	raw_spin_unlock_irqrestore(cpu_lock, flags);
 }
 
-static void __cgroup_rstat_updated(struct cgroup_rstat *rstat, int cpu)
+static void __cgroup_rstat_updated(struct cgroup_rstat *rstat, int cpu,
+		struct cgroup_rstat_ops *ops)
 {
-	struct cgroup_subsys_state *css = container_of(
-			rstat, typeof(*css), rstat);
-	struct cgroup *cgrp = css->cgroup;
+	struct cgroup *cgrp;
+
 	raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu);
 	unsigned long flags;
 
@@ -103,12 +146,13 @@ static void __cgroup_rstat_updated(struct cgroup_rstat *rstat, int cpu)
 	if (data_race(rstat_cpu(rstat, cpu)->updated_next))
 		return;
 
+	cgrp = ops->cgroup_fn(rstat);
 	flags = _cgroup_rstat_cpu_lock(cpu_lock, cpu, cgrp, true);
 
 	/* put @rstat and all ancestors on the corresponding updated lists */
 	while (true) {
 		struct cgroup_rstat_cpu *rstatc = rstat_cpu(rstat, cpu);
-		struct cgroup_rstat *parent = rstat_parent(rstat);
+		struct cgroup_rstat *parent = ops->parent_fn(rstat);
 		struct cgroup_rstat_cpu *prstatc;
 
 		/*
@@ -145,7 +189,7 @@ static void __cgroup_rstat_updated(struct cgroup_rstat *rstat, int cpu)
  */
 __bpf_kfunc void cgroup_rstat_updated(struct cgroup_subsys_state *css, int cpu)
 {
-	__cgroup_rstat_updated(&css->rstat, cpu);
+	__cgroup_rstat_updated(&css->rstat, cpu, &rstat_css_ops);
 }
 
 /**
@@ -161,7 +205,8 @@ __bpf_kfunc void cgroup_rstat_updated(struct cgroup_subsys_state *css, int cpu)
  * cgroups into a stack. The root is pushed by the caller.
  */
 static struct cgroup_rstat *cgroup_rstat_push_children(
-	struct cgroup_rstat *head, struct cgroup_rstat *child, int cpu)
+	struct cgroup_rstat *head, struct cgroup_rstat *child, int cpu,
+	struct cgroup_rstat_ops *ops)
 {
 	struct cgroup_rstat *chead = child;	/* Head of child cgroup level */
 	struct cgroup_rstat *ghead = NULL;	/* Head of grandchild cgroup level */
@@ -174,7 +219,7 @@ static struct cgroup_rstat *cgroup_rstat_push_children(
 	while (chead) {
 		child = chead;
 		chead = child->rstat_flush_next;
-		parent = rstat_parent(child);
+		parent = ops->parent_fn(child);
 
 		/* updated_next is parent cgroup terminated */
 		while (child != parent) {
@@ -220,16 +265,15 @@ static struct cgroup_rstat *cgroup_rstat_push_children(
  * here is the cgroup root whose updated_next can be self terminated.
  */
 static struct cgroup_rstat *cgroup_rstat_updated_list(
-		struct cgroup_rstat *root, int cpu)
+		struct cgroup_rstat *root, int cpu, struct cgroup_rstat_ops *ops)
 {
-	struct cgroup_subsys_state *css = container_of(
-			root, typeof(*css), rstat);
-	struct cgroup *cgrp = css->cgroup;
 	raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu);
 	struct cgroup_rstat_cpu *rstatc = rstat_cpu(root, cpu);
 	struct cgroup_rstat *head = NULL, *parent, *child;
+	struct cgroup *cgrp;
 	unsigned long flags;
 
+	cgrp = ops->cgroup_fn(root);
 	flags = _cgroup_rstat_cpu_lock(cpu_lock, cpu, cgrp, false);
 
 	/* Return NULL if this subtree is not on-list */
@@ -240,7 +284,7 @@ static struct cgroup_rstat *cgroup_rstat_updated_list(
 	 * Unlink @root from its parent. As the updated_children list is
 	 * singly linked, we have to walk it to find the removal point.
 	 */
-	parent = rstat_parent(root);
+	parent = ops->parent_fn(root);
 	if (parent) {
 		struct cgroup_rstat_cpu *prstatc;
 		struct cgroup_rstat **nextp;
@@ -265,7 +309,7 @@ static struct cgroup_rstat *cgroup_rstat_updated_list(
 	child = rstatc->updated_children;
 	rstatc->updated_children = root;
 	if (child != root)
-		head = cgroup_rstat_push_children(head, child, cpu);
+		head = cgroup_rstat_push_children(head, child, cpu, ops);
 unlock_ret:
 	_cgroup_rstat_cpu_unlock(cpu_lock, cpu, cgrp, flags, false);
 	return head;
@@ -323,34 +367,30 @@ static inline void __cgroup_rstat_unlock(struct cgroup *cgrp, int cpu_in_loop)
 }
 
 /* see cgroup_rstat_flush() */
-static void cgroup_rstat_flush_locked(struct cgroup_rstat *rstat)
+static void cgroup_rstat_flush_locked(struct cgroup_rstat *rstat,
+		struct cgroup_rstat_ops *ops)
 	__releases(&cgroup_rstat_lock) __acquires(&cgroup_rstat_lock)
 {
-	struct cgroup_subsys_state *css = container_of(
-			rstat, typeof(*css), rstat);
-	struct cgroup *cgrp = css->cgroup;
 	int cpu;
 
 	lockdep_assert_held(&cgroup_rstat_lock);
 
 	for_each_possible_cpu(cpu) {
-		struct cgroup_rstat *pos = cgroup_rstat_updated_list(rstat, cpu);
+		struct cgroup_rstat *pos = cgroup_rstat_updated_list(
+				rstat, cpu, ops);
 
 		for (; pos; pos = pos->rstat_flush_next) {
-			struct cgroup_subsys_state *pos_css = container_of(
-					pos, typeof(*pos_css), rstat);
-			struct cgroup *pos_cgroup = pos_css->cgroup;
-
-			if (!pos_css->ss)
-				cgroup_base_stat_flush(pos_cgroup, cpu);
-			else
-				pos_css->ss->css_rstat_flush(pos_css, cpu);
+			struct cgroup *pos_cgroup = ops->cgroup_fn(pos);
 
+			ops->flush_fn(pos, cpu);
 			bpf_rstat_flush(pos_cgroup, cgroup_parent(pos_cgroup), cpu);
 		}
 
 		/* play nice and yield if necessary */
 		if (need_resched() || spin_needbreak(&cgroup_rstat_lock)) {
+			struct cgroup *cgrp;
+
+			cgrp = ops->cgroup_fn(rstat);
 			__cgroup_rstat_unlock(cgrp, cpu);
 			if (!cond_resched())
 				cpu_relax();
@@ -359,16 +399,15 @@ static void cgroup_rstat_flush_locked(struct cgroup_rstat *rstat)
 	}
 }
 
-static void __cgroup_rstat_flush(struct cgroup_rstat *rstat)
+static void __cgroup_rstat_flush(struct cgroup_rstat *rstat,
+		struct cgroup_rstat_ops *ops)
 {
-	struct cgroup_subsys_state *css = container_of(
-			rstat, typeof(*css), rstat);
-	struct cgroup *cgrp = css->cgroup;
+	struct cgroup *cgrp;
 
 	might_sleep();
-
+	cgrp = ops->cgroup_fn(rstat);
 	__cgroup_rstat_lock(cgrp, -1);
-	cgroup_rstat_flush_locked(rstat);
+	cgroup_rstat_flush_locked(rstat, ops);
 	__cgroup_rstat_unlock(cgrp, -1);
 }
 
@@ -387,19 +426,19 @@ static void __cgroup_rstat_flush(struct cgroup_rstat *rstat)
  */
 __bpf_kfunc void cgroup_rstat_flush(struct cgroup_subsys_state *css)
 {
-	__cgroup_rstat_flush(&css->rstat);
+	__cgroup_rstat_flush(&css->rstat, &rstat_css_ops);
 }
 
-static void __cgroup_rstat_flush_hold(struct cgroup_rstat *rstat)
+static void __cgroup_rstat_flush_hold(struct cgroup_rstat *rstat,
+		struct cgroup_rstat_ops *ops)
 	__acquires(&cgroup_rstat_lock)
 {
-	struct cgroup_subsys_state *css = container_of(
-			rstat, typeof(*css), rstat);
-	struct cgroup *cgrp = css->cgroup;
+	struct cgroup *cgrp;
 
 	might_sleep();
+	cgrp = ops->cgroup_fn(rstat);
 	__cgroup_rstat_lock(cgrp, -1);
-	cgroup_rstat_flush_locked(rstat);
+	cgroup_rstat_flush_locked(rstat, ops);
 }
 
 /**
@@ -413,20 +452,20 @@ static void __cgroup_rstat_flush_hold(struct cgroup_rstat *rstat)
  */
 void cgroup_rstat_flush_hold(struct cgroup_subsys_state *css)
 {
-	__cgroup_rstat_flush_hold(&css->rstat);
+	__cgroup_rstat_flush_hold(&css->rstat, &rstat_css_ops);
 }
 
 /**
  * cgroup_rstat_flush_release - release cgroup_rstat_flush_hold()
  * @rstat: rstat node used to find associated cgroup used by tracepoint
  */
-static void __cgroup_rstat_flush_release(struct cgroup_rstat *rstat)
+static void __cgroup_rstat_flush_release(struct cgroup_rstat *rstat,
+		struct cgroup_rstat_ops *ops)
 	__releases(&cgroup_rstat_lock)
 {
-	struct cgroup_subsys_state *css = container_of(
-			rstat, typeof(*css), rstat);
-	struct cgroup *cgrp = css->cgroup;
+	struct cgroup *cgrp;
 
+	cgrp = ops->cgroup_fn(rstat);
 	__cgroup_rstat_unlock(cgrp, -1);
 }
 
@@ -436,7 +475,7 @@ static void __cgroup_rstat_flush_release(struct cgroup_rstat *rstat)
  */
 void cgroup_rstat_flush_release(struct cgroup_subsys_state *css)
 {
-	__cgroup_rstat_flush_release(&css->rstat);
+	__cgroup_rstat_flush_release(&css->rstat, &rstat_css_ops);
 }
 
 static void __cgroup_rstat_init(struct cgroup_rstat *rstat)
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 05/11] cgroup: separate rstat for bpf cgroups
  2025-02-18  3:14 [PATCH 00/11] cgroup: separate rstat trees JP Kobryn
                   ` (3 preceding siblings ...)
  2025-02-18  3:14 ` [PATCH 04/11] cgroup: introduce cgroup_rstat_ops JP Kobryn
@ 2025-02-18  3:14 ` JP Kobryn
  2025-02-21 18:14   ` Shakeel Butt
  2025-02-18  3:14 ` [PATCH 06/11] cgroup: rstat lock indirection JP Kobryn
                   ` (7 subsequent siblings)
  12 siblings, 1 reply; 42+ messages in thread
From: JP Kobryn @ 2025-02-18  3:14 UTC (permalink / raw)
  To: shakeel.butt, tj, mhocko, hannes, yosryahmed, akpm
  Cc: linux-mm, cgroups, kernel-team

The processing of bpf cgroup stats is tied to the rstat actions of other
subsystems. Make changes to have them updated/flushed independently.
Give the cgroup_bpf struct its own cgroup_rstat instance and define a
new cgroup_rstat_ops instance specifically for the cgroup_bpf. Then
replace the kfunc status of the existing updated/flush api calls with
non-kfunc status. As an alternative, create new updated/flush kfuncs
specifically for bpf cgroups. In these new kfuncs, make use of the
bpf-specific rstat ops to plumb back in to the existing rstat routines.
Where applicable, use pre-processor conditionals to define bpf rstat
related stuff.

Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
---
 include/linux/bpf-cgroup-defs.h               |  3 +
 include/linux/cgroup.h                        |  3 +
 kernel/bpf/cgroup.c                           |  6 ++
 kernel/cgroup/cgroup-internal.h               |  5 +
 kernel/cgroup/rstat.c                         | 95 ++++++++++++++++---
 .../selftests/bpf/progs/btf_type_tag_percpu.c |  4 +-
 .../bpf/progs/cgroup_hierarchical_stats.c     |  8 +-
 7 files changed, 107 insertions(+), 17 deletions(-)

diff --git a/include/linux/bpf-cgroup-defs.h b/include/linux/bpf-cgroup-defs.h
index 0985221d5478..e68359f861fb 100644
--- a/include/linux/bpf-cgroup-defs.h
+++ b/include/linux/bpf-cgroup-defs.h
@@ -75,6 +75,9 @@ struct cgroup_bpf {
 
 	/* cgroup_bpf is released using a work queue */
 	struct work_struct release_work;
+
+	/* per-cpu recursive resource statistics */
+	struct cgroup_rstat rstat;
 };
 
 #else /* CONFIG_CGROUP_BPF */
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index eec970622419..253ce4bff576 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -836,6 +836,9 @@ static inline bool cgroup_task_frozen(struct task_struct *task)
 #endif /* !CONFIG_CGROUPS */
 
 #ifdef CONFIG_CGROUP_BPF
+void bpf_cgroup_rstat_updated(struct cgroup *cgrp, int cpu);
+void bpf_cgroup_rstat_flush(struct cgroup *cgrp);
+
 static inline void cgroup_bpf_get(struct cgroup *cgrp)
 {
 	percpu_ref_get(&cgrp->bpf.refcnt);
diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index 46e5db65dbc8..72bcfdbda6b1 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -210,6 +210,7 @@ void cgroup_bpf_offline(struct cgroup *cgrp)
 {
 	cgroup_get(cgrp);
 	percpu_ref_kill(&cgrp->bpf.refcnt);
+	bpf_cgroup_rstat_exit(&cgrp->bpf);
 }
 
 static void bpf_cgroup_storages_free(struct bpf_cgroup_storage *storages[])
@@ -490,6 +491,10 @@ int cgroup_bpf_inherit(struct cgroup *cgrp)
 	if (ret)
 		return ret;
 
+	ret = bpf_cgroup_rstat_init(&cgrp->bpf);
+	if (ret)
+		goto cleanup_ref;
+
 	for (p = cgroup_parent(cgrp); p; p = cgroup_parent(p))
 		cgroup_bpf_get(p);
 
@@ -513,6 +518,7 @@ int cgroup_bpf_inherit(struct cgroup *cgrp)
 	for (p = cgroup_parent(cgrp); p; p = cgroup_parent(p))
 		cgroup_bpf_put(p);
 
+cleanup_ref:
 	percpu_ref_exit(&cgrp->bpf.refcnt);
 
 	return -ENOMEM;
diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
index 87d062baff90..bba1a1794de2 100644
--- a/kernel/cgroup/cgroup-internal.h
+++ b/kernel/cgroup/cgroup-internal.h
@@ -274,6 +274,11 @@ void cgroup_rstat_exit(struct cgroup_subsys_state *css);
 void cgroup_rstat_boot(void);
 void cgroup_base_stat_cputime_show(struct seq_file *seq);
 
+#ifdef CONFIG_CGROUP_BPF
+int bpf_cgroup_rstat_init(struct cgroup_bpf *bpf);
+void bpf_cgroup_rstat_exit(struct cgroup_bpf *bpf);
+#endif /* CONFIG_CGROUP_BPF */
+
 /*
  * namespace.c
  */
diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index a8bb304e49c4..14dd8217db64 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -73,6 +73,47 @@ static struct cgroup_rstat_ops rstat_css_ops = {
 	.flush_fn = rstat_flush_via_css,
 };
 
+#ifdef CONFIG_CGROUP_BPF
+__weak noinline void bpf_rstat_flush(struct cgroup *cgrp,
+				     struct cgroup *parent, int cpu);
+
+static struct cgroup *rstat_cgroup_via_bpf(struct cgroup_rstat *rstat)
+{
+	struct cgroup_bpf *bpf = container_of(rstat, typeof(*bpf), rstat);
+	struct cgroup *cgrp = container_of(bpf, typeof(*cgrp), bpf);
+
+	return cgrp;
+}
+
+static struct cgroup_rstat *rstat_parent_via_bpf(
+		struct cgroup_rstat *rstat)
+{
+	struct cgroup *cgrp, *cgrp_parent;
+
+	cgrp = rstat_cgroup_via_bpf(rstat);
+	cgrp_parent = cgroup_parent(cgrp);
+	if (!cgrp_parent)
+		return NULL;
+
+	return &(cgrp_parent->bpf.rstat);
+}
+
+static void rstat_flush_via_bpf(struct cgroup_rstat *rstat, int cpu)
+{
+	struct cgroup *cgrp, *cgrp_parent;
+
+	cgrp = rstat_cgroup_via_bpf(rstat);
+	cgrp_parent = cgroup_parent(cgrp);
+	bpf_rstat_flush(cgrp, cgrp_parent, cpu);
+}
+
+static struct cgroup_rstat_ops rstat_bpf_ops = {
+	.parent_fn = rstat_parent_via_bpf,
+	.cgroup_fn = rstat_cgroup_via_bpf,
+	.flush_fn = rstat_flush_via_bpf,
+};
+#endif /* CONFIG_CGROUP_BPF */
+
 /*
  * Helper functions for rstat per CPU lock (cgroup_rstat_cpu_lock).
  *
@@ -187,11 +228,18 @@ static void __cgroup_rstat_updated(struct cgroup_rstat *rstat, int cpu,
  * rstat_cpu->updated_children list.  See the comment on top of
  * cgroup_rstat_cpu definition for details.
  */
-__bpf_kfunc void cgroup_rstat_updated(struct cgroup_subsys_state *css, int cpu)
+void cgroup_rstat_updated(struct cgroup_subsys_state *css, int cpu)
 {
 	__cgroup_rstat_updated(&css->rstat, cpu, &rstat_css_ops);
 }
 
+#ifdef CONFIG_CGROUP_BPF
+__bpf_kfunc void bpf_cgroup_rstat_updated(struct cgroup *cgroup, int cpu)
+{
+	__cgroup_rstat_updated(&(cgroup->bpf.rstat), cpu, &rstat_bpf_ops);
+}
+#endif /* CONFIG_CGROUP_BPF */
+
 /**
  * cgroup_rstat_push_children - push children cgroups into the given list
  * @head: current head of the list (= subtree root)
@@ -330,8 +378,7 @@ static struct cgroup_rstat *cgroup_rstat_updated_list(
 
 __bpf_hook_start();
 
-__weak noinline void bpf_rstat_flush(struct cgroup *cgrp,
-				     struct cgroup *parent, int cpu)
+void bpf_rstat_flush(struct cgroup *cgrp, struct cgroup *parent, int cpu)
 {
 }
 
@@ -379,12 +426,8 @@ static void cgroup_rstat_flush_locked(struct cgroup_rstat *rstat,
 		struct cgroup_rstat *pos = cgroup_rstat_updated_list(
 				rstat, cpu, ops);
 
-		for (; pos; pos = pos->rstat_flush_next) {
-			struct cgroup *pos_cgroup = ops->cgroup_fn(pos);
-
+		for (; pos; pos = pos->rstat_flush_next)
 			ops->flush_fn(pos, cpu);
-			bpf_rstat_flush(pos_cgroup, cgroup_parent(pos_cgroup), cpu);
-		}
 
 		/* play nice and yield if necessary */
 		if (need_resched() || spin_needbreak(&cgroup_rstat_lock)) {
@@ -424,11 +467,18 @@ static void __cgroup_rstat_flush(struct cgroup_rstat *rstat,
  *
  * This function may block.
  */
-__bpf_kfunc void cgroup_rstat_flush(struct cgroup_subsys_state *css)
+void cgroup_rstat_flush(struct cgroup_subsys_state *css)
 {
 	__cgroup_rstat_flush(&css->rstat, &rstat_css_ops);
 }
 
+#ifdef CONFIG_CGROUP_BPF
+__bpf_kfunc void bpf_cgroup_rstat_flush(struct cgroup *cgroup)
+{
+	__cgroup_rstat_flush(&(cgroup->bpf.rstat), &rstat_bpf_ops);
+}
+#endif /* CONFIG_CGROUP_BPF */
+
 static void __cgroup_rstat_flush_hold(struct cgroup_rstat *rstat,
 		struct cgroup_rstat_ops *ops)
 	__acquires(&cgroup_rstat_lock)
@@ -532,6 +582,27 @@ void cgroup_rstat_exit(struct cgroup_subsys_state *css)
 	__cgroup_rstat_exit(rstat);
 }
 
+#ifdef CONFIG_CGROUP_BPF
+int bpf_cgroup_rstat_init(struct cgroup_bpf *bpf)
+{
+	struct cgroup_rstat *rstat = &bpf->rstat;
+
+	rstat->rstat_cpu = alloc_percpu(struct cgroup_rstat_cpu);
+	if (!rstat->rstat_cpu)
+		return -ENOMEM;
+
+	__cgroup_rstat_init(rstat);
+
+	return 0;
+}
+
+void bpf_cgroup_rstat_exit(struct cgroup_bpf *bpf)
+{
+	__cgroup_rstat_flush(&bpf->rstat, &rstat_bpf_ops);
+	__cgroup_rstat_exit(&bpf->rstat);
+}
+#endif /* CONFIG_CGROUP_BPF */
+
 void __init cgroup_rstat_boot(void)
 {
 	int cpu;
@@ -754,10 +825,11 @@ void cgroup_base_stat_cputime_show(struct seq_file *seq)
 	cgroup_force_idle_show(seq, &cgrp->bstat);
 }
 
+#ifdef CONFIG_CGROUP_BPF
 /* Add bpf kfuncs for cgroup_rstat_updated() and cgroup_rstat_flush() */
 BTF_KFUNCS_START(bpf_rstat_kfunc_ids)
-BTF_ID_FLAGS(func, cgroup_rstat_updated)
-BTF_ID_FLAGS(func, cgroup_rstat_flush, KF_SLEEPABLE)
+BTF_ID_FLAGS(func, bpf_cgroup_rstat_updated)
+BTF_ID_FLAGS(func, bpf_cgroup_rstat_flush, KF_SLEEPABLE)
 BTF_KFUNCS_END(bpf_rstat_kfunc_ids)
 
 static const struct btf_kfunc_id_set bpf_rstat_kfunc_set = {
@@ -771,3 +843,4 @@ static int __init bpf_rstat_kfunc_init(void)
 					 &bpf_rstat_kfunc_set);
 }
 late_initcall(bpf_rstat_kfunc_init);
+#endif /* CONFIG_CGROUP_BPF */
diff --git a/tools/testing/selftests/bpf/progs/btf_type_tag_percpu.c b/tools/testing/selftests/bpf/progs/btf_type_tag_percpu.c
index 310cd51e12e8..da15ada56218 100644
--- a/tools/testing/selftests/bpf/progs/btf_type_tag_percpu.c
+++ b/tools/testing/selftests/bpf/progs/btf_type_tag_percpu.c
@@ -45,7 +45,7 @@ int BPF_PROG(test_percpu2, struct bpf_testmod_btf_type_tag_2 *arg)
 SEC("tp_btf/cgroup_mkdir")
 int BPF_PROG(test_percpu_load, struct cgroup *cgrp, const char *path)
 {
-	g = (__u64)cgrp->self.rstat.rstat_cpu->updated_children;
+	g = (__u64)cgrp->bpf.rstat.rstat_cpu->updated_children;
 	return 0;
 }
 
@@ -57,7 +57,7 @@ int BPF_PROG(test_percpu_helper, struct cgroup *cgrp, const char *path)
 
 	cpu = bpf_get_smp_processor_id();
 	rstat = (struct cgroup_rstat_cpu *)bpf_per_cpu_ptr(
-			cgrp->self.rstat.rstat_cpu, cpu);
+			cgrp->bpf.rstat.rstat_cpu, cpu);
 	if (rstat) {
 		/* READ_ONCE */
 		*(volatile int *)rstat;
diff --git a/tools/testing/selftests/bpf/progs/cgroup_hierarchical_stats.c b/tools/testing/selftests/bpf/progs/cgroup_hierarchical_stats.c
index 10c803c8dc70..24450dd4d3f3 100644
--- a/tools/testing/selftests/bpf/progs/cgroup_hierarchical_stats.c
+++ b/tools/testing/selftests/bpf/progs/cgroup_hierarchical_stats.c
@@ -37,8 +37,8 @@ struct {
 	__type(value, struct attach_counter);
 } attach_counters SEC(".maps");
 
-extern void cgroup_rstat_updated(struct cgroup_subsys_state *css, int cpu) __ksym;
-extern void cgroup_rstat_flush(struct cgroup_subsys_state *css) __ksym;
+extern void bpf_cgroup_rstat_updated(struct cgroup *cgrp, int cpu) __ksym;
+extern void bpf_cgroup_rstat_flush(struct cgroup *cgrp) __ksym;
 
 static uint64_t cgroup_id(struct cgroup *cgrp)
 {
@@ -75,7 +75,7 @@ int BPF_PROG(counter, struct cgroup *dst_cgrp, struct task_struct *leader,
 	else if (create_percpu_attach_counter(cg_id, 1))
 		return 0;
 
-	cgroup_rstat_updated(&dst_cgrp->self, bpf_get_smp_processor_id());
+	bpf_cgroup_rstat_updated(dst_cgrp, bpf_get_smp_processor_id());
 	return 0;
 }
 
@@ -141,7 +141,7 @@ int BPF_PROG(dumper, struct bpf_iter_meta *meta, struct cgroup *cgrp)
 		return 1;
 
 	/* Flush the stats to make sure we get the most updated numbers */
-	cgroup_rstat_flush(&cgrp->self);
+	bpf_cgroup_rstat_flush(cgrp);
 
 	total_counter = bpf_map_lookup_elem(&attach_counters, &cg_id);
 	if (!total_counter) {
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 06/11] cgroup: rstat lock indirection
  2025-02-18  3:14 [PATCH 00/11] cgroup: separate rstat trees JP Kobryn
                   ` (4 preceding siblings ...)
  2025-02-18  3:14 ` [PATCH 05/11] cgroup: separate rstat for bpf cgroups JP Kobryn
@ 2025-02-18  3:14 ` JP Kobryn
  2025-02-21 22:09   ` Shakeel Butt
  2025-02-18  3:14 ` [PATCH 07/11] cgroup: fetch cpu-specific lock in rstat cpu lock helpers JP Kobryn
                   ` (6 subsequent siblings)
  12 siblings, 1 reply; 42+ messages in thread
From: JP Kobryn @ 2025-02-18  3:14 UTC (permalink / raw)
  To: shakeel.butt, tj, mhocko, hannes, yosryahmed, akpm
  Cc: linux-mm, cgroups, kernel-team

Instead of accessing the target lock directly via global var, access it
indirectly in the form of a new parameter. Also change the ordering of
the parameters to be consistent with the related per-cpu locking
function _cgroup_rstat_cpu_lock().

Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
---
 kernel/cgroup/rstat.c | 30 ++++++++++++++++--------------
 1 file changed, 16 insertions(+), 14 deletions(-)

diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index 14dd8217db64..26c75629bca2 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -385,7 +385,7 @@ void bpf_rstat_flush(struct cgroup *cgrp, struct cgroup *parent, int cpu)
 __bpf_hook_end();
 
 /*
- * Helper functions for locking cgroup_rstat_lock.
+ * Helper functions for locking.
  *
  * This makes it easier to diagnose locking issues and contention in
  * production environments.  The parameter @cpu_in_loop indicate lock
@@ -393,24 +393,26 @@ __bpf_hook_end();
  * value -1 is used when obtaining the main lock else this is the CPU
  * number processed last.
  */
-static inline void __cgroup_rstat_lock(struct cgroup *cgrp, int cpu_in_loop)
-	__acquires(&cgroup_rstat_lock)
+static inline void __cgroup_rstat_lock(spinlock_t *lock,
+		struct cgroup *cgrp, int cpu_in_loop)
+	__acquires(lock)
 {
 	bool contended;
 
-	contended = !spin_trylock_irq(&cgroup_rstat_lock);
+	contended = !spin_trylock_irq(lock);
 	if (contended) {
 		trace_cgroup_rstat_lock_contended(cgrp, cpu_in_loop, contended);
-		spin_lock_irq(&cgroup_rstat_lock);
+		spin_lock_irq(lock);
 	}
 	trace_cgroup_rstat_locked(cgrp, cpu_in_loop, contended);
 }
 
-static inline void __cgroup_rstat_unlock(struct cgroup *cgrp, int cpu_in_loop)
-	__releases(&cgroup_rstat_lock)
+static inline void __cgroup_rstat_unlock(spinlock_t *lock,
+		struct cgroup *cgrp, int cpu_in_loop)
+	__releases(lock)
 {
 	trace_cgroup_rstat_unlock(cgrp, cpu_in_loop, false);
-	spin_unlock_irq(&cgroup_rstat_lock);
+	spin_unlock_irq(lock);
 }
 
 /* see cgroup_rstat_flush() */
@@ -434,10 +436,10 @@ static void cgroup_rstat_flush_locked(struct cgroup_rstat *rstat,
 			struct cgroup *cgrp;
 
 			cgrp = ops->cgroup_fn(rstat);
-			__cgroup_rstat_unlock(cgrp, cpu);
+			__cgroup_rstat_unlock(&cgroup_rstat_lock, cgrp, cpu);
 			if (!cond_resched())
 				cpu_relax();
-			__cgroup_rstat_lock(cgrp, cpu);
+			__cgroup_rstat_lock(&cgroup_rstat_lock, cgrp, cpu);
 		}
 	}
 }
@@ -449,9 +451,9 @@ static void __cgroup_rstat_flush(struct cgroup_rstat *rstat,
 
 	might_sleep();
 	cgrp = ops->cgroup_fn(rstat);
-	__cgroup_rstat_lock(cgrp, -1);
+	__cgroup_rstat_lock(&cgroup_rstat_lock, cgrp, -1);
 	cgroup_rstat_flush_locked(rstat, ops);
-	__cgroup_rstat_unlock(cgrp, -1);
+	__cgroup_rstat_unlock(&cgroup_rstat_lock, cgrp, -1);
 }
 
 /**
@@ -487,7 +489,7 @@ static void __cgroup_rstat_flush_hold(struct cgroup_rstat *rstat,
 
 	might_sleep();
 	cgrp = ops->cgroup_fn(rstat);
-	__cgroup_rstat_lock(cgrp, -1);
+	__cgroup_rstat_lock(&cgroup_rstat_lock, cgrp, -1);
 	cgroup_rstat_flush_locked(rstat, ops);
 }
 
@@ -516,7 +518,7 @@ static void __cgroup_rstat_flush_release(struct cgroup_rstat *rstat,
 	struct cgroup *cgrp;
 
 	cgrp = ops->cgroup_fn(rstat);
-	__cgroup_rstat_unlock(cgrp, -1);
+	__cgroup_rstat_unlock(&cgroup_rstat_lock, cgrp, -1);
 }
 
 /**
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 07/11] cgroup: fetch cpu-specific lock in rstat cpu lock helpers
  2025-02-18  3:14 [PATCH 00/11] cgroup: separate rstat trees JP Kobryn
                   ` (5 preceding siblings ...)
  2025-02-18  3:14 ` [PATCH 06/11] cgroup: rstat lock indirection JP Kobryn
@ 2025-02-18  3:14 ` JP Kobryn
  2025-02-21 22:35   ` Shakeel Butt
  2025-02-18  3:14 ` [PATCH 08/11] cgroup: rstat cpu lock indirection JP Kobryn
                   ` (5 subsequent siblings)
  12 siblings, 1 reply; 42+ messages in thread
From: JP Kobryn @ 2025-02-18  3:14 UTC (permalink / raw)
  To: shakeel.butt, tj, mhocko, hannes, yosryahmed, akpm
  Cc: linux-mm, cgroups, kernel-team

The lock/unlock helper functions for per-cpu locks accept a cpu
argument. This makes them appear as if the cpu will be used as the
offset off of the base per-cpu pointer. But in fact, the cpu is only
used as a tracepoint argument. Change the functions so that the cpu is
also used primarily for looking up the lock specific to this cpu. This
means the call sites can be adjusted to not have to perform the offset
prior to calling this function. Note that this follows suit with other
functions in the rstat source - functions that accept a cpu argument
perform the per-cpu pointer lookup within as opposed to having clients
lookup in advance.

Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
---
 kernel/cgroup/rstat.c | 37 +++++++++++++++++++++----------------
 1 file changed, 21 insertions(+), 16 deletions(-)

diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index 26c75629bca2..4cb0f3ffc1db 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -115,7 +115,12 @@ static struct cgroup_rstat_ops rstat_bpf_ops = {
 #endif /* CONFIG_CGROUP_BPF */
 
 /*
- * Helper functions for rstat per CPU lock (cgroup_rstat_cpu_lock).
+ * Helper functions for rstat per-cpu locks.
+ * @lock: pointer to per-cpu lock variable
+ * @cpu: the cpu to use for getting the cpu-specific lock
+ * @cgrp: the associated cgroup
+ * @fast_path: whether this function is called while updating
+ *	in the fast path or flushing in the NON-fast path
  *
  * This makes it easier to diagnose locking issues and contention in
  * production environments. The parameter @fast_path determine the
@@ -123,19 +128,20 @@ static struct cgroup_rstat_ops rstat_bpf_ops = {
  * operations without handling high-frequency fast-path "update" events.
  */
 static __always_inline
-unsigned long _cgroup_rstat_cpu_lock(raw_spinlock_t *cpu_lock, int cpu,
+unsigned long _cgroup_rstat_cpu_lock(raw_spinlock_t *lock, int cpu,
 				     struct cgroup *cgrp, const bool fast_path)
 {
+	raw_spinlock_t *cpu_lock = per_cpu_ptr(lock, cpu);
 	unsigned long flags;
 	bool contended;
 
 	/*
-	 * The _irqsave() is needed because cgroup_rstat_lock is
-	 * spinlock_t which is a sleeping lock on PREEMPT_RT. Acquiring
-	 * this lock with the _irq() suffix only disables interrupts on
-	 * a non-PREEMPT_RT kernel. The raw_spinlock_t below disables
-	 * interrupts on both configurations. The _irqsave() ensures
-	 * that interrupts are always disabled and later restored.
+	 * The _irqsave() is needed because the locks used for flushing
+	 * are spinlock_t which is a sleeping lock on PREEMPT_RT.
+	 * Acquiring this lock with the _irq() suffix only disables
+	 * interrupts on a non-PREEMPT_RT kernel. The raw_spinlock_t below
+	 * disables interrupts on both configurations. The _irqsave()
+	 * ensures that interrupts are always disabled and later restored.
 	 */
 	contended = !raw_spin_trylock_irqsave(cpu_lock, flags);
 	if (contended) {
@@ -156,10 +162,12 @@ unsigned long _cgroup_rstat_cpu_lock(raw_spinlock_t *cpu_lock, int cpu,
 }
 
 static __always_inline
-void _cgroup_rstat_cpu_unlock(raw_spinlock_t *cpu_lock, int cpu,
+void _cgroup_rstat_cpu_unlock(raw_spinlock_t *lock, int cpu,
 			      struct cgroup *cgrp, unsigned long flags,
 			      const bool fast_path)
 {
+	raw_spinlock_t *cpu_lock = per_cpu_ptr(lock, cpu);
+
 	if (fast_path)
 		trace_cgroup_rstat_cpu_unlock_fastpath(cgrp, cpu, false);
 	else
@@ -172,8 +180,6 @@ static void __cgroup_rstat_updated(struct cgroup_rstat *rstat, int cpu,
 		struct cgroup_rstat_ops *ops)
 {
 	struct cgroup *cgrp;
-
-	raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu);
 	unsigned long flags;
 
 	/*
@@ -188,7 +194,7 @@ static void __cgroup_rstat_updated(struct cgroup_rstat *rstat, int cpu,
 		return;
 
 	cgrp = ops->cgroup_fn(rstat);
-	flags = _cgroup_rstat_cpu_lock(cpu_lock, cpu, cgrp, true);
+	flags = _cgroup_rstat_cpu_lock(&cgroup_rstat_cpu_lock, cpu, cgrp, true);
 
 	/* put @rstat and all ancestors on the corresponding updated lists */
 	while (true) {
@@ -216,7 +222,7 @@ static void __cgroup_rstat_updated(struct cgroup_rstat *rstat, int cpu,
 		rstat = parent;
 	}
 
-	_cgroup_rstat_cpu_unlock(cpu_lock, cpu, cgrp, flags, true);
+	_cgroup_rstat_cpu_unlock(&cgroup_rstat_cpu_lock, cpu, cgrp, flags, true);
 }
 
 /**
@@ -315,14 +321,13 @@ static struct cgroup_rstat *cgroup_rstat_push_children(
 static struct cgroup_rstat *cgroup_rstat_updated_list(
 		struct cgroup_rstat *root, int cpu, struct cgroup_rstat_ops *ops)
 {
-	raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu);
 	struct cgroup_rstat_cpu *rstatc = rstat_cpu(root, cpu);
 	struct cgroup_rstat *head = NULL, *parent, *child;
 	struct cgroup *cgrp;
 	unsigned long flags;
 
 	cgrp = ops->cgroup_fn(root);
-	flags = _cgroup_rstat_cpu_lock(cpu_lock, cpu, cgrp, false);
+	flags = _cgroup_rstat_cpu_lock(&cgroup_rstat_cpu_lock, cpu, cgrp, false);
 
 	/* Return NULL if this subtree is not on-list */
 	if (!rstatc->updated_next)
@@ -359,7 +364,7 @@ static struct cgroup_rstat *cgroup_rstat_updated_list(
 	if (child != root)
 		head = cgroup_rstat_push_children(head, child, cpu, ops);
 unlock_ret:
-	_cgroup_rstat_cpu_unlock(cpu_lock, cpu, cgrp, flags, false);
+	_cgroup_rstat_cpu_unlock(&cgroup_rstat_cpu_lock, cpu, cgrp, flags, false);
 	return head;
 }
 
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 08/11] cgroup: rstat cpu lock indirection
  2025-02-18  3:14 [PATCH 00/11] cgroup: separate rstat trees JP Kobryn
                   ` (6 preceding siblings ...)
  2025-02-18  3:14 ` [PATCH 07/11] cgroup: fetch cpu-specific lock in rstat cpu lock helpers JP Kobryn
@ 2025-02-18  3:14 ` JP Kobryn
  2025-02-19  8:48   ` kernel test robot
  2025-02-22  0:18   ` Shakeel Butt
  2025-02-18  3:14 ` [PATCH 09/11] cgroup: separate rstat locks for bpf cgroups JP Kobryn
                   ` (4 subsequent siblings)
  12 siblings, 2 replies; 42+ messages in thread
From: JP Kobryn @ 2025-02-18  3:14 UTC (permalink / raw)
  To: shakeel.butt, tj, mhocko, hannes, yosryahmed, akpm
  Cc: linux-mm, cgroups, kernel-team

Where functions access the global per-cpu lock, change their signature
to accept the lock instead as a paremeter. Change the code within these
functions to only access the parameter. This indirection allows for
future code to accept different locks, increasing extensibity. For
example, a new lock could be added specifically for the bpf cgroups and
it would not contend with the existing lock.

Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
---
 kernel/cgroup/rstat.c | 74 +++++++++++++++++++++++++------------------
 1 file changed, 43 insertions(+), 31 deletions(-)

diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index 4cb0f3ffc1db..9f6da3ea3c8c 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -177,7 +177,7 @@ void _cgroup_rstat_cpu_unlock(raw_spinlock_t *lock, int cpu,
 }
 
 static void __cgroup_rstat_updated(struct cgroup_rstat *rstat, int cpu,
-		struct cgroup_rstat_ops *ops)
+		struct cgroup_rstat_ops *ops, raw_spinlock_t *cpu_lock)
 {
 	struct cgroup *cgrp;
 	unsigned long flags;
@@ -194,7 +194,7 @@ static void __cgroup_rstat_updated(struct cgroup_rstat *rstat, int cpu,
 		return;
 
 	cgrp = ops->cgroup_fn(rstat);
-	flags = _cgroup_rstat_cpu_lock(&cgroup_rstat_cpu_lock, cpu, cgrp, true);
+	flags = _cgroup_rstat_cpu_lock(cpu_lock, cpu, cgrp, true);
 
 	/* put @rstat and all ancestors on the corresponding updated lists */
 	while (true) {
@@ -222,7 +222,7 @@ static void __cgroup_rstat_updated(struct cgroup_rstat *rstat, int cpu,
 		rstat = parent;
 	}
 
-	_cgroup_rstat_cpu_unlock(&cgroup_rstat_cpu_lock, cpu, cgrp, flags, true);
+	_cgroup_rstat_cpu_unlock(cpu_lock, cpu, cgrp, flags, true);
 }
 
 /**
@@ -236,13 +236,15 @@ static void __cgroup_rstat_updated(struct cgroup_rstat *rstat, int cpu,
  */
 void cgroup_rstat_updated(struct cgroup_subsys_state *css, int cpu)
 {
-	__cgroup_rstat_updated(&css->rstat, cpu, &rstat_css_ops);
+	__cgroup_rstat_updated(&css->rstat, cpu, &rstat_css_ops,
+			&cgroup_rstat_cpu_lock);
 }
 
 #ifdef CONFIG_CGROUP_BPF
 __bpf_kfunc void bpf_cgroup_rstat_updated(struct cgroup *cgroup, int cpu)
 {
-	__cgroup_rstat_updated(&(cgroup->bpf.rstat), cpu, &rstat_bpf_ops);
+	__cgroup_rstat_updated(&(cgroup->bpf.rstat), cpu, &rstat_bpf_ops,
+			&cgroup_rstat_cpu_lock);
 }
 #endif /* CONFIG_CGROUP_BPF */
 
@@ -319,7 +321,8 @@ static struct cgroup_rstat *cgroup_rstat_push_children(
  * here is the cgroup root whose updated_next can be self terminated.
  */
 static struct cgroup_rstat *cgroup_rstat_updated_list(
-		struct cgroup_rstat *root, int cpu, struct cgroup_rstat_ops *ops)
+		struct cgroup_rstat *root, int cpu, struct cgroup_rstat_ops *ops,
+		raw_spinlock_t *cpu_lock)
 {
 	struct cgroup_rstat_cpu *rstatc = rstat_cpu(root, cpu);
 	struct cgroup_rstat *head = NULL, *parent, *child;
@@ -327,7 +330,7 @@ static struct cgroup_rstat *cgroup_rstat_updated_list(
 	unsigned long flags;
 
 	cgrp = ops->cgroup_fn(root);
-	flags = _cgroup_rstat_cpu_lock(&cgroup_rstat_cpu_lock, cpu, cgrp, false);
+	flags = _cgroup_rstat_cpu_lock(cpu_lock, cpu, cgrp, false);
 
 	/* Return NULL if this subtree is not on-list */
 	if (!rstatc->updated_next)
@@ -364,7 +367,7 @@ static struct cgroup_rstat *cgroup_rstat_updated_list(
 	if (child != root)
 		head = cgroup_rstat_push_children(head, child, cpu, ops);
 unlock_ret:
-	_cgroup_rstat_cpu_unlock(&cgroup_rstat_cpu_lock, cpu, cgrp, flags, false);
+	_cgroup_rstat_cpu_unlock(cpu_lock, cpu, cgrp, flags, false);
 	return head;
 }
 
@@ -422,43 +425,46 @@ static inline void __cgroup_rstat_unlock(spinlock_t *lock,
 
 /* see cgroup_rstat_flush() */
 static void cgroup_rstat_flush_locked(struct cgroup_rstat *rstat,
-		struct cgroup_rstat_ops *ops)
-	__releases(&cgroup_rstat_lock) __acquires(&cgroup_rstat_lock)
+		struct cgroup_rstat_ops *ops, spinlock_t *lock,
+		raw_spinlock_t *cpu_lock)
+	__releases(lock) __acquires(lock)
 {
 	int cpu;
 
-	lockdep_assert_held(&cgroup_rstat_lock);
+	lockdep_assert_held(lock);
 
 	for_each_possible_cpu(cpu) {
 		struct cgroup_rstat *pos = cgroup_rstat_updated_list(
-				rstat, cpu, ops);
+				rstat, cpu, ops, cpu_lock);
 
 		for (; pos; pos = pos->rstat_flush_next)
 			ops->flush_fn(pos, cpu);
 
 		/* play nice and yield if necessary */
-		if (need_resched() || spin_needbreak(&cgroup_rstat_lock)) {
+		if (need_resched() || spin_needbreak(lock)) {
 			struct cgroup *cgrp;
 
 			cgrp = ops->cgroup_fn(rstat);
-			__cgroup_rstat_unlock(&cgroup_rstat_lock, cgrp, cpu);
+			__cgroup_rstat_unlock(lock, cgrp, cpu);
 			if (!cond_resched())
 				cpu_relax();
-			__cgroup_rstat_lock(&cgroup_rstat_lock, cgrp, cpu);
+			__cgroup_rstat_lock(lock, cgrp, cpu);
 		}
 	}
 }
 
 static void __cgroup_rstat_flush(struct cgroup_rstat *rstat,
-		struct cgroup_rstat_ops *ops)
+		struct cgroup_rstat_ops *ops, spinlock_t *lock,
+		raw_spinlock_t *cpu_lock)
+	__acquires(lock) __releases(lock)
 {
 	struct cgroup *cgrp;
 
 	might_sleep();
 	cgrp = ops->cgroup_fn(rstat);
-	__cgroup_rstat_lock(&cgroup_rstat_lock, cgrp, -1);
-	cgroup_rstat_flush_locked(rstat, ops);
-	__cgroup_rstat_unlock(&cgroup_rstat_lock, cgrp, -1);
+	__cgroup_rstat_lock(lock, cgrp, -1);
+	cgroup_rstat_flush_locked(rstat, ops, lock, cpu_lock);
+	__cgroup_rstat_unlock(lock, cgrp, -1);
 }
 
 /**
@@ -476,26 +482,29 @@ static void __cgroup_rstat_flush(struct cgroup_rstat *rstat,
  */
 void cgroup_rstat_flush(struct cgroup_subsys_state *css)
 {
-	__cgroup_rstat_flush(&css->rstat, &rstat_css_ops);
+	__cgroup_rstat_flush(&css->rstat, &rstat_css_ops,
+			&cgroup_rstat_lock, &cgroup_rstat_cpu_lock);
 }
 
 #ifdef CONFIG_CGROUP_BPF
 __bpf_kfunc void bpf_cgroup_rstat_flush(struct cgroup *cgroup)
 {
-	__cgroup_rstat_flush(&(cgroup->bpf.rstat), &rstat_bpf_ops);
+	__cgroup_rstat_flush(&(cgroup->bpf.rstat), &rstat_bpf_ops,
+			&cgroup_rstat_lock, &cgroup_rstat_cpu_lock);
 }
 #endif /* CONFIG_CGROUP_BPF */
 
 static void __cgroup_rstat_flush_hold(struct cgroup_rstat *rstat,
-		struct cgroup_rstat_ops *ops)
-	__acquires(&cgroup_rstat_lock)
+		struct cgroup_rstat_ops *ops, spinlock_t *lock,
+		raw_spinlock_t *cpu_lock)
+	__acquires(lock)
 {
 	struct cgroup *cgrp;
 
 	might_sleep();
 	cgrp = ops->cgroup_fn(rstat);
-	__cgroup_rstat_lock(&cgroup_rstat_lock, cgrp, -1);
-	cgroup_rstat_flush_locked(rstat, ops);
+	__cgroup_rstat_lock(lock, cgrp, -1);
+	cgroup_rstat_flush_locked(rstat, ops, lock, cpu_lock);
 }
 
 /**
@@ -509,7 +518,8 @@ static void __cgroup_rstat_flush_hold(struct cgroup_rstat *rstat,
  */
 void cgroup_rstat_flush_hold(struct cgroup_subsys_state *css)
 {
-	__cgroup_rstat_flush_hold(&css->rstat, &rstat_css_ops);
+	__cgroup_rstat_flush_hold(&css->rstat, &rstat_css_ops,
+			&cgroup_rstat_lock, &cgroup_rstat_cpu_lock);
 }
 
 /**
@@ -517,13 +527,13 @@ void cgroup_rstat_flush_hold(struct cgroup_subsys_state *css)
  * @rstat: rstat node used to find associated cgroup used by tracepoint
  */
 static void __cgroup_rstat_flush_release(struct cgroup_rstat *rstat,
-		struct cgroup_rstat_ops *ops)
-	__releases(&cgroup_rstat_lock)
+		struct cgroup_rstat_ops *ops, spinlock_t *lock)
+	__releases(lock)
 {
 	struct cgroup *cgrp;
 
 	cgrp = ops->cgroup_fn(rstat);
-	__cgroup_rstat_unlock(&cgroup_rstat_lock, cgrp, -1);
+	__cgroup_rstat_unlock(lock, cgrp, -1);
 }
 
 /**
@@ -532,7 +542,8 @@ static void __cgroup_rstat_flush_release(struct cgroup_rstat *rstat,
  */
 void cgroup_rstat_flush_release(struct cgroup_subsys_state *css)
 {
-	__cgroup_rstat_flush_release(&css->rstat, &rstat_css_ops);
+	__cgroup_rstat_flush_release(&css->rstat, &rstat_css_ops,
+			&cgroup_rstat_lock);
 }
 
 static void __cgroup_rstat_init(struct cgroup_rstat *rstat)
@@ -605,7 +616,8 @@ int bpf_cgroup_rstat_init(struct cgroup_bpf *bpf)
 
 void bpf_cgroup_rstat_exit(struct cgroup_bpf *bpf)
 {
-	__cgroup_rstat_flush(&bpf->rstat, &rstat_bpf_ops);
+	__cgroup_rstat_flush(&bpf->rstat, &rstat_bpf_ops,
+			&cgroup_rstat_lock, &cgroup_rstat_cpu_lock);
 	__cgroup_rstat_exit(&bpf->rstat);
 }
 #endif /* CONFIG_CGROUP_BPF */
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 09/11] cgroup: separate rstat locks for bpf cgroups
  2025-02-18  3:14 [PATCH 00/11] cgroup: separate rstat trees JP Kobryn
                   ` (7 preceding siblings ...)
  2025-02-18  3:14 ` [PATCH 08/11] cgroup: rstat cpu lock indirection JP Kobryn
@ 2025-02-18  3:14 ` JP Kobryn
  2025-02-18  3:14 ` [PATCH 10/11] cgroup: separate rstat locks for subsystems JP Kobryn
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 42+ messages in thread
From: JP Kobryn @ 2025-02-18  3:14 UTC (permalink / raw)
  To: shakeel.butt, tj, mhocko, hannes, yosryahmed, akpm
  Cc: linux-mm, cgroups, kernel-team

Use new locks with the rstat entities specific to bpf cgroups. Having
these locks avoids contention with subsystems such as memory or io while
updating/flushing bpf cgroup stats.

Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
---
 kernel/cgroup/rstat.c | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index 9f6da3ea3c8c..7d9abfd644ca 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -18,6 +18,11 @@ struct cgroup_rstat_ops {
 static DEFINE_SPINLOCK(cgroup_rstat_lock);
 static DEFINE_PER_CPU(raw_spinlock_t, cgroup_rstat_cpu_lock);
 
+#ifdef CONFIG_CGROUP_BPF
+static DEFINE_SPINLOCK(cgroup_rstat_bpf_lock);
+static DEFINE_PER_CPU(raw_spinlock_t, cgroup_rstat_bpf_cpu_lock);
+#endif /* CONFIG_CGROUP_BPF */
+
 static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu);
 
 static struct cgroup_rstat_cpu *rstat_cpu(struct cgroup_rstat *rstat, int cpu)
@@ -244,7 +249,7 @@ void cgroup_rstat_updated(struct cgroup_subsys_state *css, int cpu)
 __bpf_kfunc void bpf_cgroup_rstat_updated(struct cgroup *cgroup, int cpu)
 {
 	__cgroup_rstat_updated(&(cgroup->bpf.rstat), cpu, &rstat_bpf_ops,
-			&cgroup_rstat_cpu_lock);
+			&cgroup_rstat_bpf_cpu_lock);
 }
 #endif /* CONFIG_CGROUP_BPF */
 
@@ -490,7 +495,7 @@ void cgroup_rstat_flush(struct cgroup_subsys_state *css)
 __bpf_kfunc void bpf_cgroup_rstat_flush(struct cgroup *cgroup)
 {
 	__cgroup_rstat_flush(&(cgroup->bpf.rstat), &rstat_bpf_ops,
-			&cgroup_rstat_lock, &cgroup_rstat_cpu_lock);
+			&cgroup_rstat_bpf_lock, &cgroup_rstat_bpf_cpu_lock);
 }
 #endif /* CONFIG_CGROUP_BPF */
 
@@ -617,7 +622,7 @@ int bpf_cgroup_rstat_init(struct cgroup_bpf *bpf)
 void bpf_cgroup_rstat_exit(struct cgroup_bpf *bpf)
 {
 	__cgroup_rstat_flush(&bpf->rstat, &rstat_bpf_ops,
-			&cgroup_rstat_lock, &cgroup_rstat_cpu_lock);
+			&cgroup_rstat_bpf_lock, &cgroup_rstat_bpf_cpu_lock);
 	__cgroup_rstat_exit(&bpf->rstat);
 }
 #endif /* CONFIG_CGROUP_BPF */
@@ -626,8 +631,13 @@ void __init cgroup_rstat_boot(void)
 {
 	int cpu;
 
-	for_each_possible_cpu(cpu)
+	for_each_possible_cpu(cpu) {
 		raw_spin_lock_init(per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu));
+
+#ifdef CONFIG_CGROUP_BPF
+		raw_spin_lock_init(per_cpu_ptr(&cgroup_rstat_bpf_cpu_lock, cpu));
+#endif /* CONFIG_CGROUP_BPF */
+	}
 }
 
 /*
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 10/11] cgroup: separate rstat locks for subsystems
  2025-02-18  3:14 [PATCH 00/11] cgroup: separate rstat trees JP Kobryn
                   ` (8 preceding siblings ...)
  2025-02-18  3:14 ` [PATCH 09/11] cgroup: separate rstat locks for bpf cgroups JP Kobryn
@ 2025-02-18  3:14 ` JP Kobryn
  2025-02-22  0:23   ` Shakeel Butt
  2025-02-18  3:14 ` [PATCH 11/11] cgroup: separate rstat list pointers from base stats JP Kobryn
                   ` (2 subsequent siblings)
  12 siblings, 1 reply; 42+ messages in thread
From: JP Kobryn @ 2025-02-18  3:14 UTC (permalink / raw)
  To: shakeel.butt, tj, mhocko, hannes, yosryahmed, akpm
  Cc: linux-mm, cgroups, kernel-team

Add new rstat locks for each subsystem. When handling cgroup subsystem
states, distinguish between states associated with formal subsystems
(memory, io, etc) and the base stats subsystem state (represented by
cgroup::self). This change is made to prevent contention when
updating/flushing stats.

Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
---
 kernel/cgroup/rstat.c | 68 ++++++++++++++++++++++++++++++++++++-------
 1 file changed, 58 insertions(+), 10 deletions(-)

diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index 7d9abfd644ca..93b97bddec9c 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -15,8 +15,11 @@ struct cgroup_rstat_ops {
 	void (*flush_fn)(struct cgroup_rstat *, int);
 };
 
-static DEFINE_SPINLOCK(cgroup_rstat_lock);
-static DEFINE_PER_CPU(raw_spinlock_t, cgroup_rstat_cpu_lock);
+static DEFINE_SPINLOCK(cgroup_rstat_base_lock);
+static DEFINE_PER_CPU(raw_spinlock_t, cgroup_rstat_base_cpu_lock);
+
+static spinlock_t cgroup_rstat_subsys_lock[CGROUP_SUBSYS_COUNT];
+static raw_spinlock_t __percpu cgroup_rstat_subsys_cpu_lock[CGROUP_SUBSYS_COUNT];
 
 #ifdef CONFIG_CGROUP_BPF
 static DEFINE_SPINLOCK(cgroup_rstat_bpf_lock);
@@ -241,8 +244,14 @@ static void __cgroup_rstat_updated(struct cgroup_rstat *rstat, int cpu,
  */
 void cgroup_rstat_updated(struct cgroup_subsys_state *css, int cpu)
 {
-	__cgroup_rstat_updated(&css->rstat, cpu, &rstat_css_ops,
-			&cgroup_rstat_cpu_lock);
+	raw_spinlock_t *cpu_lock;
+
+	if (is_base_css(css))
+		cpu_lock = &cgroup_rstat_base_cpu_lock;
+	else
+		cpu_lock = &cgroup_rstat_subsys_cpu_lock[css->ss->id];
+
+	__cgroup_rstat_updated(&css->rstat, cpu, &rstat_css_ops, cpu_lock);
 }
 
 #ifdef CONFIG_CGROUP_BPF
@@ -487,8 +496,19 @@ static void __cgroup_rstat_flush(struct cgroup_rstat *rstat,
  */
 void cgroup_rstat_flush(struct cgroup_subsys_state *css)
 {
+	spinlock_t *lock;
+	raw_spinlock_t *cpu_lock;
+
+	if (is_base_css(css)) {
+		lock = &cgroup_rstat_base_lock;
+		cpu_lock = &cgroup_rstat_base_cpu_lock;
+	} else {
+		lock = &cgroup_rstat_subsys_lock[css->ss->id];
+		cpu_lock = &cgroup_rstat_subsys_cpu_lock[css->ss->id];
+	}
+
 	__cgroup_rstat_flush(&css->rstat, &rstat_css_ops,
-			&cgroup_rstat_lock, &cgroup_rstat_cpu_lock);
+			lock, cpu_lock);
 }
 
 #ifdef CONFIG_CGROUP_BPF
@@ -523,8 +543,19 @@ static void __cgroup_rstat_flush_hold(struct cgroup_rstat *rstat,
  */
 void cgroup_rstat_flush_hold(struct cgroup_subsys_state *css)
 {
+	spinlock_t *lock;
+	raw_spinlock_t *cpu_lock;
+
+	if (is_base_css(css)) {
+		lock = &cgroup_rstat_base_lock;
+		cpu_lock = &cgroup_rstat_base_cpu_lock;
+	} else {
+		lock = &cgroup_rstat_subsys_lock[css->ss->id];
+		cpu_lock = &cgroup_rstat_subsys_cpu_lock[css->ss->id];
+	}
+
 	__cgroup_rstat_flush_hold(&css->rstat, &rstat_css_ops,
-			&cgroup_rstat_lock, &cgroup_rstat_cpu_lock);
+			lock, cpu_lock);
 }
 
 /**
@@ -547,8 +578,14 @@ static void __cgroup_rstat_flush_release(struct cgroup_rstat *rstat,
  */
 void cgroup_rstat_flush_release(struct cgroup_subsys_state *css)
 {
-	__cgroup_rstat_flush_release(&css->rstat, &rstat_css_ops,
-			&cgroup_rstat_lock);
+	spinlock_t *lock;
+
+	if (is_base_css(css))
+		lock = &cgroup_rstat_base_lock;
+	else
+		lock = &cgroup_rstat_subsys_lock[css->ss->id];
+
+	__cgroup_rstat_flush_release(&css->rstat, &rstat_css_ops, lock);
 }
 
 static void __cgroup_rstat_init(struct cgroup_rstat *rstat)
@@ -629,10 +666,21 @@ void bpf_cgroup_rstat_exit(struct cgroup_bpf *bpf)
 
 void __init cgroup_rstat_boot(void)
 {
-	int cpu;
+	struct cgroup_subsys *ss;
+	int cpu, ssid;
+
+	for_each_subsys(ss, ssid) {
+		spin_lock_init(&cgroup_rstat_subsys_lock[ssid]);
+
+		for_each_possible_cpu(cpu) {
+			raw_spinlock_t *cpu_lock =
+				per_cpu_ptr(&cgroup_rstat_subsys_cpu_lock[ssid], cpu);
+			raw_spin_lock_init(cpu_lock);
+		}
+	}
 
 	for_each_possible_cpu(cpu) {
-		raw_spin_lock_init(per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu));
+		raw_spin_lock_init(per_cpu_ptr(&cgroup_rstat_base_cpu_lock, cpu));
 
 #ifdef CONFIG_CGROUP_BPF
 		raw_spin_lock_init(per_cpu_ptr(&cgroup_rstat_bpf_cpu_lock, cpu));
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 11/11] cgroup: separate rstat list pointers from base stats
  2025-02-18  3:14 [PATCH 00/11] cgroup: separate rstat trees JP Kobryn
                   ` (9 preceding siblings ...)
  2025-02-18  3:14 ` [PATCH 10/11] cgroup: separate rstat locks for subsystems JP Kobryn
@ 2025-02-18  3:14 ` JP Kobryn
  2025-02-22  0:28   ` Shakeel Butt
  2025-02-20 15:51 ` [PATCH 00/11] cgroup: separate rstat trees Tejun Heo
  2025-02-20 17:26 ` Yosry Ahmed
  12 siblings, 1 reply; 42+ messages in thread
From: JP Kobryn @ 2025-02-18  3:14 UTC (permalink / raw)
  To: shakeel.butt, tj, mhocko, hannes, yosryahmed, akpm
  Cc: linux-mm, cgroups, kernel-team

A majority of the cgroup_rstat_cpu struct size is made up of the base stat
entities. Since only the "self" subsystem state makes use of these, move
them into a struct of their own. This allows for a new compact
cgroup_rstat_cpu struct that the formal subsystems can make use of.
Where applicable, decide on whether to allocate the compact or the full
struct including the base stats.

Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
---
 include/linux/cgroup_rstat.h |  8 +++++-
 kernel/cgroup/rstat.c        | 55 +++++++++++++++++++++++++++---------
 2 files changed, 48 insertions(+), 15 deletions(-)

diff --git a/include/linux/cgroup_rstat.h b/include/linux/cgroup_rstat.h
index 780b826ea364..fc26c0aa91ef 100644
--- a/include/linux/cgroup_rstat.h
+++ b/include/linux/cgroup_rstat.h
@@ -27,7 +27,10 @@ struct cgroup_rstat_cpu;
  * resource statistics on top of it - bsync, bstat and last_bstat.
  */
 struct cgroup_rstat {
-	struct cgroup_rstat_cpu __percpu *rstat_cpu;
+	union {
+		struct cgroup_rstat_cpu __percpu *rstat_cpu;
+		struct cgroup_rstat_base_cpu __percpu *rstat_base_cpu;
+	};
 
 	/*
 	 * Add padding to separate the read mostly rstat_cpu and
@@ -60,7 +63,10 @@ struct cgroup_rstat_cpu {
 	 */
 	struct cgroup_rstat *updated_children;	/* terminated by self */
 	struct cgroup_rstat *updated_next;		/* NULL if not on the list */
+};
 
+struct cgroup_rstat_base_cpu {
+	struct cgroup_rstat_cpu self;
 	/*
 	 * ->bsync protects ->bstat.  These are the only fields which get
 	 * updated in the hot path.
diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index 93b97bddec9c..6b14241d0924 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -33,6 +33,12 @@ static struct cgroup_rstat_cpu *rstat_cpu(struct cgroup_rstat *rstat, int cpu)
 	return per_cpu_ptr(rstat->rstat_cpu, cpu);
 }
 
+static struct cgroup_rstat_base_cpu *rstat_base_cpu(
+		struct cgroup_rstat *rstat, int cpu)
+{
+	return per_cpu_ptr(rstat->rstat_base_cpu, cpu);
+}
+
 static inline bool is_base_css(struct cgroup_subsys_state *css)
 {
 	/* css for base stats has no subsystem */
@@ -597,6 +603,18 @@ static void __cgroup_rstat_init(struct cgroup_rstat *rstat)
 		struct cgroup_rstat_cpu *rstatc = rstat_cpu(rstat, cpu);
 
 		rstatc->updated_children = rstat;
+	}
+}
+
+static void __cgroup_rstat_base_init(struct cgroup_rstat *rstat)
+{
+	int cpu;
+
+	/* ->updated_children list is self terminated */
+	for_each_possible_cpu(cpu) {
+		struct cgroup_rstat_base_cpu *rstatc = rstat_base_cpu(rstat, cpu);
+
+		rstatc->self.updated_children = rstat;
 		u64_stats_init(&rstatc->bsync);
 	}
 }
@@ -607,13 +625,21 @@ int cgroup_rstat_init(struct cgroup_subsys_state *css)
 
 	/* the root cgrp has rstat_cpu preallocated */
 	if (!rstat->rstat_cpu) {
-		rstat->rstat_cpu = alloc_percpu(struct cgroup_rstat_cpu);
-		if (!rstat->rstat_cpu)
-			return -ENOMEM;
+		if (is_base_css(css)) {
+			rstat->rstat_base_cpu = alloc_percpu(struct cgroup_rstat_base_cpu);
+			if (!rstat->rstat_base_cpu)
+				return -ENOMEM;
+
+			__cgroup_rstat_base_init(rstat);
+		} else {
+			rstat->rstat_cpu = alloc_percpu(struct cgroup_rstat_cpu);
+			if (!rstat->rstat_cpu)
+				return -ENOMEM;
+
+			__cgroup_rstat_init(rstat);
+		}
 	}
 
-	__cgroup_rstat_init(rstat);
-
 	return 0;
 }
 
@@ -718,9 +744,10 @@ static void cgroup_base_stat_sub(struct cgroup_base_stat *dst_bstat,
 
 static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu)
 {
-	struct cgroup_rstat_cpu *rstatc = rstat_cpu(&(cgrp->self.rstat), cpu);
+	struct cgroup_rstat_base_cpu *rstatc = rstat_base_cpu(
+			&(cgrp->self.rstat), cpu);
 	struct cgroup *parent = cgroup_parent(cgrp);
-	struct cgroup_rstat_cpu *prstatc;
+	struct cgroup_rstat_base_cpu *prstatc;
 	struct cgroup_base_stat delta;
 	unsigned seq;
 
@@ -748,25 +775,25 @@ static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu)
 		cgroup_base_stat_add(&cgrp->last_bstat, &delta);
 
 		delta = rstatc->subtree_bstat;
-		prstatc = rstat_cpu(&(parent->self.rstat), cpu);
+		prstatc = rstat_base_cpu(&(parent->self.rstat), cpu);
 		cgroup_base_stat_sub(&delta, &rstatc->last_subtree_bstat);
 		cgroup_base_stat_add(&prstatc->subtree_bstat, &delta);
 		cgroup_base_stat_add(&rstatc->last_subtree_bstat, &delta);
 	}
 }
 
-static struct cgroup_rstat_cpu *
+static struct cgroup_rstat_base_cpu *
 cgroup_base_stat_cputime_account_begin(struct cgroup *cgrp, unsigned long *flags)
 {
-	struct cgroup_rstat_cpu *rstatc;
+	struct cgroup_rstat_base_cpu *rstatc;
 
-	rstatc = get_cpu_ptr(cgrp->self.rstat.rstat_cpu);
+	rstatc = get_cpu_ptr(cgrp->self.rstat.rstat_base_cpu);
 	*flags = u64_stats_update_begin_irqsave(&rstatc->bsync);
 	return rstatc;
 }
 
 static void cgroup_base_stat_cputime_account_end(struct cgroup *cgrp,
-						 struct cgroup_rstat_cpu *rstatc,
+						 struct cgroup_rstat_base_cpu *rstatc,
 						 unsigned long flags)
 {
 	u64_stats_update_end_irqrestore(&rstatc->bsync, flags);
@@ -776,7 +803,7 @@ static void cgroup_base_stat_cputime_account_end(struct cgroup *cgrp,
 
 void __cgroup_account_cputime(struct cgroup *cgrp, u64 delta_exec)
 {
-	struct cgroup_rstat_cpu *rstatc;
+	struct cgroup_rstat_base_cpu *rstatc;
 	unsigned long flags;
 
 	rstatc = cgroup_base_stat_cputime_account_begin(cgrp, &flags);
@@ -787,7 +814,7 @@ void __cgroup_account_cputime(struct cgroup *cgrp, u64 delta_exec)
 void __cgroup_account_cputime_field(struct cgroup *cgrp,
 				    enum cpu_usage_stat index, u64 delta_exec)
 {
-	struct cgroup_rstat_cpu *rstatc;
+	struct cgroup_rstat_base_cpu *rstatc;
 	unsigned long flags;
 
 	rstatc = cgroup_base_stat_cputime_account_begin(cgrp, &flags);
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH 01/11] cgroup: move rstat pointers into struct of their own
  2025-02-18  3:14 ` [PATCH 01/11] cgroup: move rstat pointers into struct of their own JP Kobryn
@ 2025-02-19  1:05   ` Shakeel Butt
  2025-02-19  1:23     ` Shakeel Butt
  2025-02-20 16:53   ` Yosry Ahmed
  1 sibling, 1 reply; 42+ messages in thread
From: Shakeel Butt @ 2025-02-19  1:05 UTC (permalink / raw)
  To: JP Kobryn
  Cc: tj, mhocko, hannes, yosryahmed, akpm, linux-mm, cgroups,
	kernel-team

Thanks JP for awesome work. I am doing a quick first iteration and later
will do the deep review.

On Mon, Feb 17, 2025 at 07:14:38PM -0800, JP Kobryn wrote:
>  struct cgroup_freezer_state {
>  	/* Should the cgroup and its descendants be frozen. */
>  	bool freeze;
> @@ -517,23 +445,9 @@ struct cgroup {
>  	struct cgroup *old_dom_cgrp;		/* used while enabling threaded */
>  
>  	/* per-cpu recursive resource statistics */
> -	struct cgroup_rstat_cpu __percpu *rstat_cpu;
> +	struct cgroup_rstat rstat;
>  	struct list_head rstat_css_list;

You might want to place rstat after rstat_css_list just to keep
(hopefully) on the same cacheline as before other this will put
rstat_css_list with rstat_flush_next which the current padding is trying
to avoid. This is just to be safe. Later we might want to reevaluate the
padding and right cacheline alignments of the fields of struct cgroup.

>  
> -	/*
> -	 * Add padding to separate the read mostly rstat_cpu and
> -	 * rstat_css_list into a different cacheline from the following
> -	 * rstat_flush_next and *bstat fields which can have frequent updates.
> -	 */
> -	CACHELINE_PADDING(_pad_);
> -
> -	/*
> -	 * A singly-linked list of cgroup structures to be rstat flushed.
> -	 * This is a scratch field to be used exclusively by
> -	 * cgroup_rstat_flush_locked() and protected by cgroup_rstat_lock.
> -	 */
> -	struct cgroup	*rstat_flush_next;
> -
>  	/* cgroup basic resource statistics */
>  	struct cgroup_base_stat last_bstat;
>  	struct cgroup_base_stat bstat;
> diff --git a/include/linux/cgroup_rstat.h b/include/linux/cgroup_rstat.h
> new file mode 100644
> index 000000000000..f95474d6f8ab
> --- /dev/null
> +++ b/include/linux/cgroup_rstat.h
> @@ -0,0 +1,92 @@
[...]
> +struct cgroup_rstat {
> +	struct cgroup_rstat_cpu __percpu *rstat_cpu;
> +
> +	/*
> +	 * Add padding to separate the read mostly rstat_cpu and
> +	 * rstat_css_list into a different cacheline from the following
> +	 * rstat_flush_next and containing struct fields which can have
> +	 * frequent updates.
> +	 */
> +	CACHELINE_PADDING(_pad_);
> +	struct cgroup *rstat_flush_next;
> +};


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 01/11] cgroup: move rstat pointers into struct of their own
  2025-02-19  1:05   ` Shakeel Butt
@ 2025-02-19  1:23     ` Shakeel Butt
  0 siblings, 0 replies; 42+ messages in thread
From: Shakeel Butt @ 2025-02-19  1:23 UTC (permalink / raw)
  To: JP Kobryn
  Cc: tj, mhocko, hannes, yosryahmed, akpm, linux-mm, cgroups,
	kernel-team

On Tue, Feb 18, 2025 at 05:05:52PM -0800, Shakeel Butt wrote:
> Thanks JP for awesome work. I am doing a quick first iteration and later
> will do the deep review.
> 
> On Mon, Feb 17, 2025 at 07:14:38PM -0800, JP Kobryn wrote:
> >  struct cgroup_freezer_state {
> >  	/* Should the cgroup and its descendants be frozen. */
> >  	bool freeze;
> > @@ -517,23 +445,9 @@ struct cgroup {
> >  	struct cgroup *old_dom_cgrp;		/* used while enabling threaded */
> >  
> >  	/* per-cpu recursive resource statistics */
> > -	struct cgroup_rstat_cpu __percpu *rstat_cpu;
> > +	struct cgroup_rstat rstat;
> >  	struct list_head rstat_css_list;
> 
> You might want to place rstat after rstat_css_list just to keep
> (hopefully) on the same cacheline as before other this will put
> rstat_css_list with rstat_flush_next which the current padding is trying
> to avoid. This is just to be safe. Later we might want to reevaluate the
> padding and right cacheline alignments of the fields of struct cgroup.
> 

Ah I see later you can removed rstat_css_list as you moved the rstat
state from cgroup to css and you don't need rstat_css_list anymore.



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 02/11] cgroup: add level of indirection for cgroup_rstat struct
  2025-02-18  3:14 ` [PATCH 02/11] cgroup: add level of indirection for cgroup_rstat struct JP Kobryn
@ 2025-02-19  2:26   ` Shakeel Butt
  2025-02-20 17:08     ` Yosry Ahmed
  2025-02-19  5:57   ` kernel test robot
  1 sibling, 1 reply; 42+ messages in thread
From: Shakeel Butt @ 2025-02-19  2:26 UTC (permalink / raw)
  To: JP Kobryn
  Cc: tj, mhocko, hannes, yosryahmed, akpm, linux-mm, cgroups,
	kernel-team

On Mon, Feb 17, 2025 at 07:14:39PM -0800, JP Kobryn wrote:
> Change the type of rstat node from cgroup to the new cgroup_rstat
> struct. Then for the rstat updated/flush api calls, add double under
> versions that accept references to the cgroup_rstat struct. This new
> level of indirection will allow for extending the public api further.
> i.e. the cgroup_rstat struct can be embedded in a new type of object and
> a public api can be added for that new type.
> 
> Signed-off-by: JP Kobryn <inwardvessel@gmail.com>

I think the code looks good here but the commit message needs some
massaging. From what I understand, you are trying to decouple struct
cgroup_rstat from struct cgroup, so later you can move struct
cgroup_rstat in different structure (and maybe later some new structure
can all include cgroup_rstat to take advantage of rstat infra). I am not
sure I would call this "add level of indirection". 


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 02/11] cgroup: add level of indirection for cgroup_rstat struct
  2025-02-18  3:14 ` [PATCH 02/11] cgroup: add level of indirection for cgroup_rstat struct JP Kobryn
  2025-02-19  2:26   ` Shakeel Butt
@ 2025-02-19  5:57   ` kernel test robot
  1 sibling, 0 replies; 42+ messages in thread
From: kernel test robot @ 2025-02-19  5:57 UTC (permalink / raw)
  To: JP Kobryn, shakeel.butt, tj, mhocko, hannes, yosryahmed, akpm
  Cc: oe-kbuild-all, linux-mm, cgroups, kernel-team

Hi JP,

kernel test robot noticed the following build warnings:

[auto build test WARNING on tj-cgroup/for-next]
[also build test WARNING on bpf-next/master bpf/master linus/master v6.14-rc3 next-20250218]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/JP-Kobryn/cgroup-move-rstat-pointers-into-struct-of-their-own/20250218-111725
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git for-next
patch link:    https://lore.kernel.org/r/20250218031448.46951-3-inwardvessel%40gmail.com
patch subject: [PATCH 02/11] cgroup: add level of indirection for cgroup_rstat struct
config: arc-randconfig-002-20250219 (https://download.01.org/0day-ci/archive/20250219/202502191307.zqD101Gp-lkp@intel.com/config)
compiler: arc-elf-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250219/202502191307.zqD101Gp-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202502191307.zqD101Gp-lkp@intel.com/

All warnings (new ones prefixed by >>):

   kernel/cgroup/rstat.c:416: warning: Function parameter or struct member 'rstat' not described in '__cgroup_rstat_flush_release'
>> kernel/cgroup/rstat.c:416: warning: expecting prototype for cgroup_rstat_flush_release(). Prototype was for __cgroup_rstat_flush_release() instead


vim +416 kernel/cgroup/rstat.c

6162cef0f741c7 Tejun Heo              2018-04-26  409  
6162cef0f741c7 Tejun Heo              2018-04-26  410  /**
6162cef0f741c7 Tejun Heo              2018-04-26  411   * cgroup_rstat_flush_release - release cgroup_rstat_flush_hold()
97a46a66ad7d91 Jesper Dangaard Brouer 2024-04-17  412   * @cgrp: cgroup used by tracepoint
6162cef0f741c7 Tejun Heo              2018-04-26  413   */
3d844899ba042a JP Kobryn              2025-02-17  414  static void __cgroup_rstat_flush_release(struct cgroup_rstat *rstat)
0fa294fb1985c0 Tejun Heo              2018-04-26  415  	__releases(&cgroup_rstat_lock)
6162cef0f741c7 Tejun Heo              2018-04-26 @416  {
3d844899ba042a JP Kobryn              2025-02-17  417  	struct cgroup *cgrp = container_of(rstat, typeof(*cgrp), rstat);
3d844899ba042a JP Kobryn              2025-02-17  418  
fc29e04ae1ad4c Jesper Dangaard Brouer 2024-04-16  419  	__cgroup_rstat_unlock(cgrp, -1);
6162cef0f741c7 Tejun Heo              2018-04-26  420  }
6162cef0f741c7 Tejun Heo              2018-04-26  421  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 04/11] cgroup: introduce cgroup_rstat_ops
  2025-02-18  3:14 ` [PATCH 04/11] cgroup: introduce cgroup_rstat_ops JP Kobryn
@ 2025-02-19  7:21   ` kernel test robot
  2025-02-20 17:50   ` Shakeel Butt
  1 sibling, 0 replies; 42+ messages in thread
From: kernel test robot @ 2025-02-19  7:21 UTC (permalink / raw)
  To: JP Kobryn, shakeel.butt, tj, mhocko, hannes, yosryahmed, akpm
  Cc: oe-kbuild-all, linux-mm, cgroups, kernel-team

Hi JP,

kernel test robot noticed the following build warnings:

[auto build test WARNING on tj-cgroup/for-next]
[also build test WARNING on bpf-next/master bpf/master linus/master v6.14-rc3 next-20250219]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/JP-Kobryn/cgroup-move-rstat-pointers-into-struct-of-their-own/20250218-111725
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git for-next
patch link:    https://lore.kernel.org/r/20250218031448.46951-5-inwardvessel%40gmail.com
patch subject: [PATCH 04/11] cgroup: introduce cgroup_rstat_ops
config: arc-randconfig-002-20250219 (https://download.01.org/0day-ci/archive/20250219/202502191558.xCTZRkPs-lkp@intel.com/config)
compiler: arc-elf-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250219/202502191558.xCTZRkPs-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202502191558.xCTZRkPs-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> kernel/cgroup/rstat.c:210: warning: Function parameter or struct member 'ops' not described in 'cgroup_rstat_push_children'
>> kernel/cgroup/rstat.c:269: warning: Function parameter or struct member 'ops' not described in 'cgroup_rstat_updated_list'
>> kernel/cgroup/rstat.c:465: warning: Function parameter or struct member 'ops' not described in '__cgroup_rstat_flush_release'
   kernel/cgroup/rstat.c:465: warning: expecting prototype for cgroup_rstat_flush_release(). Prototype was for __cgroup_rstat_flush_release() instead


vim +210 kernel/cgroup/rstat.c

3d844899ba042a kernel/cgroup/rstat.c JP Kobryn       2025-02-17  194  
041cd640b2f3c5 kernel/cgroup/stat.c  Tejun Heo       2017-09-25  195  /**
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  196   * cgroup_rstat_push_children - push children cgroups into the given list
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  197   * @head: current head of the list (= subtree root)
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  198   * @child: first child of the root
041cd640b2f3c5 kernel/cgroup/stat.c  Tejun Heo       2017-09-25  199   * @cpu: target cpu
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  200   * Return: A new singly linked list of cgroups to be flush
041cd640b2f3c5 kernel/cgroup/stat.c  Tejun Heo       2017-09-25  201   *
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  202   * Iteratively traverse down the cgroup_rstat_cpu updated tree level by
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  203   * level and push all the parents first before their next level children
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  204   * into a singly linked list built from the tail backward like "pushing"
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  205   * cgroups into a stack. The root is pushed by the caller.
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  206   */
3d844899ba042a kernel/cgroup/rstat.c JP Kobryn       2025-02-17  207  static struct cgroup_rstat *cgroup_rstat_push_children(
d67ed623c585f2 kernel/cgroup/rstat.c JP Kobryn       2025-02-17  208  	struct cgroup_rstat *head, struct cgroup_rstat *child, int cpu,
d67ed623c585f2 kernel/cgroup/rstat.c JP Kobryn       2025-02-17  209  	struct cgroup_rstat_ops *ops)
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30 @210  {
3d844899ba042a kernel/cgroup/rstat.c JP Kobryn       2025-02-17  211  	struct cgroup_rstat *chead = child;	/* Head of child cgroup level */
3d844899ba042a kernel/cgroup/rstat.c JP Kobryn       2025-02-17  212  	struct cgroup_rstat *ghead = NULL;	/* Head of grandchild cgroup level */
3d844899ba042a kernel/cgroup/rstat.c JP Kobryn       2025-02-17  213  	struct cgroup_rstat *parent, *grandchild;
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  214  	struct cgroup_rstat_cpu *crstatc;
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  215  
3d844899ba042a kernel/cgroup/rstat.c JP Kobryn       2025-02-17  216  	child->rstat_flush_next = NULL;
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  217  
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  218  next_level:
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  219  	while (chead) {
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  220  		child = chead;
3d844899ba042a kernel/cgroup/rstat.c JP Kobryn       2025-02-17  221  		chead = child->rstat_flush_next;
d67ed623c585f2 kernel/cgroup/rstat.c JP Kobryn       2025-02-17  222  		parent = ops->parent_fn(child);
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  223  
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  224  		/* updated_next is parent cgroup terminated */
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  225  		while (child != parent) {
3d844899ba042a kernel/cgroup/rstat.c JP Kobryn       2025-02-17  226  			child->rstat_flush_next = head;
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  227  			head = child;
3d844899ba042a kernel/cgroup/rstat.c JP Kobryn       2025-02-17  228  			crstatc = rstat_cpu(child, cpu);
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  229  			grandchild = crstatc->updated_children;
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  230  			if (grandchild != child) {
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  231  				/* Push the grand child to the next level */
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  232  				crstatc->updated_children = child;
3d844899ba042a kernel/cgroup/rstat.c JP Kobryn       2025-02-17  233  				grandchild->rstat_flush_next = ghead;
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  234  				ghead = grandchild;
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  235  			}
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  236  			child = crstatc->updated_next;
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  237  			crstatc->updated_next = NULL;
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  238  		}
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  239  	}
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  240  
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  241  	if (ghead) {
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  242  		chead = ghead;
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  243  		ghead = NULL;
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  244  		goto next_level;
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  245  	}
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  246  	return head;
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  247  }
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  248  
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  249  /**
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  250   * cgroup_rstat_updated_list - return a list of updated cgroups to be flushed
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  251   * @root: root of the cgroup subtree to traverse
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  252   * @cpu: target cpu
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  253   * Return: A singly linked list of cgroups to be flushed
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  254   *
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  255   * Walks the updated rstat_cpu tree on @cpu from @root.  During traversal,
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  256   * each returned cgroup is unlinked from the updated tree.
041cd640b2f3c5 kernel/cgroup/stat.c  Tejun Heo       2017-09-25  257   *
041cd640b2f3c5 kernel/cgroup/stat.c  Tejun Heo       2017-09-25  258   * The only ordering guarantee is that, for a parent and a child pair
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  259   * covered by a given traversal, the child is before its parent in
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  260   * the list.
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  261   *
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  262   * Note that updated_children is self terminated and points to a list of
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  263   * child cgroups if not empty. Whereas updated_next is like a sibling link
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  264   * within the children list and terminated by the parent cgroup. An exception
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  265   * here is the cgroup root whose updated_next can be self terminated.
041cd640b2f3c5 kernel/cgroup/stat.c  Tejun Heo       2017-09-25  266   */
3d844899ba042a kernel/cgroup/rstat.c JP Kobryn       2025-02-17  267  static struct cgroup_rstat *cgroup_rstat_updated_list(
d67ed623c585f2 kernel/cgroup/rstat.c JP Kobryn       2025-02-17  268  		struct cgroup_rstat *root, int cpu, struct cgroup_rstat_ops *ops)
041cd640b2f3c5 kernel/cgroup/stat.c  Tejun Heo       2017-09-25 @269  {
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  270  	raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu);
3d844899ba042a kernel/cgroup/rstat.c JP Kobryn       2025-02-17  271  	struct cgroup_rstat_cpu *rstatc = rstat_cpu(root, cpu);
3d844899ba042a kernel/cgroup/rstat.c JP Kobryn       2025-02-17  272  	struct cgroup_rstat *head = NULL, *parent, *child;
d67ed623c585f2 kernel/cgroup/rstat.c JP Kobryn       2025-02-17  273  	struct cgroup *cgrp;
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  274  	unsigned long flags;
041cd640b2f3c5 kernel/cgroup/stat.c  Tejun Heo       2017-09-25  275  
d67ed623c585f2 kernel/cgroup/rstat.c JP Kobryn       2025-02-17  276  	cgrp = ops->cgroup_fn(root);
3d844899ba042a kernel/cgroup/rstat.c JP Kobryn       2025-02-17  277  	flags = _cgroup_rstat_cpu_lock(cpu_lock, cpu, cgrp, false);
041cd640b2f3c5 kernel/cgroup/stat.c  Tejun Heo       2017-09-25  278  
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  279  	/* Return NULL if this subtree is not on-list */
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  280  	if (!rstatc->updated_next)
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  281  		goto unlock_ret;
041cd640b2f3c5 kernel/cgroup/stat.c  Tejun Heo       2017-09-25  282  
041cd640b2f3c5 kernel/cgroup/stat.c  Tejun Heo       2017-09-25  283  	/*
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  284  	 * Unlink @root from its parent. As the updated_children list is
041cd640b2f3c5 kernel/cgroup/stat.c  Tejun Heo       2017-09-25  285  	 * singly linked, we have to walk it to find the removal point.
041cd640b2f3c5 kernel/cgroup/stat.c  Tejun Heo       2017-09-25  286  	 */
d67ed623c585f2 kernel/cgroup/rstat.c JP Kobryn       2025-02-17  287  	parent = ops->parent_fn(root);
dc26532aed0ab2 kernel/cgroup/rstat.c Johannes Weiner 2021-04-29  288  	if (parent) {
dc26532aed0ab2 kernel/cgroup/rstat.c Johannes Weiner 2021-04-29  289  		struct cgroup_rstat_cpu *prstatc;
3d844899ba042a kernel/cgroup/rstat.c JP Kobryn       2025-02-17  290  		struct cgroup_rstat **nextp;
041cd640b2f3c5 kernel/cgroup/stat.c  Tejun Heo       2017-09-25  291  
3d844899ba042a kernel/cgroup/rstat.c JP Kobryn       2025-02-17  292  		prstatc = rstat_cpu(parent, cpu);
c58632b3631cb2 kernel/cgroup/rstat.c Tejun Heo       2018-04-26  293  		nextp = &prstatc->updated_children;
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  294  		while (*nextp != root) {
dc26532aed0ab2 kernel/cgroup/rstat.c Johannes Weiner 2021-04-29  295  			struct cgroup_rstat_cpu *nrstatc;
dc26532aed0ab2 kernel/cgroup/rstat.c Johannes Weiner 2021-04-29  296  
3d844899ba042a kernel/cgroup/rstat.c JP Kobryn       2025-02-17  297  			nrstatc = rstat_cpu(*nextp, cpu);
041cd640b2f3c5 kernel/cgroup/stat.c  Tejun Heo       2017-09-25  298  			WARN_ON_ONCE(*nextp == parent);
c58632b3631cb2 kernel/cgroup/rstat.c Tejun Heo       2018-04-26  299  			nextp = &nrstatc->updated_next;
041cd640b2f3c5 kernel/cgroup/stat.c  Tejun Heo       2017-09-25  300  		}
c58632b3631cb2 kernel/cgroup/rstat.c Tejun Heo       2018-04-26  301  		*nextp = rstatc->updated_next;
dc26532aed0ab2 kernel/cgroup/rstat.c Johannes Weiner 2021-04-29  302  	}
9a9e97b2f1f27e kernel/cgroup/rstat.c Tejun Heo       2018-04-26  303  
dc26532aed0ab2 kernel/cgroup/rstat.c Johannes Weiner 2021-04-29  304  	rstatc->updated_next = NULL;
e76d28bdf9ba53 kernel/cgroup/rstat.c Waiman Long     2023-11-03  305  
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  306  	/* Push @root to the list first before pushing the children */
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  307  	head = root;
3d844899ba042a kernel/cgroup/rstat.c JP Kobryn       2025-02-17  308  	root->rstat_flush_next = NULL;
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  309  	child = rstatc->updated_children;
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  310  	rstatc->updated_children = root;
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  311  	if (child != root)
d67ed623c585f2 kernel/cgroup/rstat.c JP Kobryn       2025-02-17  312  		head = cgroup_rstat_push_children(head, child, cpu, ops);
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  313  unlock_ret:
3d844899ba042a kernel/cgroup/rstat.c JP Kobryn       2025-02-17  314  	_cgroup_rstat_cpu_unlock(cpu_lock, cpu, cgrp, flags, false);
e76d28bdf9ba53 kernel/cgroup/rstat.c Waiman Long     2023-11-03  315  	return head;
041cd640b2f3c5 kernel/cgroup/stat.c  Tejun Heo       2017-09-25  316  }
041cd640b2f3c5 kernel/cgroup/stat.c  Tejun Heo       2017-09-25  317  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 08/11] cgroup: rstat cpu lock indirection
  2025-02-18  3:14 ` [PATCH 08/11] cgroup: rstat cpu lock indirection JP Kobryn
@ 2025-02-19  8:48   ` kernel test robot
  2025-02-22  0:18   ` Shakeel Butt
  1 sibling, 0 replies; 42+ messages in thread
From: kernel test robot @ 2025-02-19  8:48 UTC (permalink / raw)
  To: JP Kobryn, shakeel.butt, tj, mhocko, hannes, yosryahmed, akpm
  Cc: oe-kbuild-all, linux-mm, cgroups, kernel-team

Hi JP,

kernel test robot noticed the following build warnings:

[auto build test WARNING on tj-cgroup/for-next]
[also build test WARNING on bpf-next/master bpf/master linus/master v6.14-rc3 next-20250219]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/JP-Kobryn/cgroup-move-rstat-pointers-into-struct-of-their-own/20250218-111725
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git for-next
patch link:    https://lore.kernel.org/r/20250218031448.46951-9-inwardvessel%40gmail.com
patch subject: [PATCH 08/11] cgroup: rstat cpu lock indirection
config: arc-randconfig-002-20250219 (https://download.01.org/0day-ci/archive/20250219/202502191619.0t8nOsuQ-lkp@intel.com/config)
compiler: arc-elf-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250219/202502191619.0t8nOsuQ-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202502191619.0t8nOsuQ-lkp@intel.com/

All warnings (new ones prefixed by >>):

   kernel/cgroup/rstat.c:266: warning: Function parameter or struct member 'ops' not described in 'cgroup_rstat_push_children'
   kernel/cgroup/rstat.c:326: warning: Function parameter or struct member 'ops' not described in 'cgroup_rstat_updated_list'
>> kernel/cgroup/rstat.c:326: warning: Function parameter or struct member 'cpu_lock' not described in 'cgroup_rstat_updated_list'
   kernel/cgroup/rstat.c:532: warning: Function parameter or struct member 'ops' not described in '__cgroup_rstat_flush_release'
>> kernel/cgroup/rstat.c:532: warning: Function parameter or struct member 'lock' not described in '__cgroup_rstat_flush_release'
   kernel/cgroup/rstat.c:532: warning: expecting prototype for cgroup_rstat_flush_release(). Prototype was for __cgroup_rstat_flush_release() instead


vim +326 kernel/cgroup/rstat.c

d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  304  
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  305  /**
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  306   * cgroup_rstat_updated_list - return a list of updated cgroups to be flushed
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  307   * @root: root of the cgroup subtree to traverse
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  308   * @cpu: target cpu
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  309   * Return: A singly linked list of cgroups to be flushed
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  310   *
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  311   * Walks the updated rstat_cpu tree on @cpu from @root.  During traversal,
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  312   * each returned cgroup is unlinked from the updated tree.
041cd640b2f3c5 kernel/cgroup/stat.c  Tejun Heo       2017-09-25  313   *
041cd640b2f3c5 kernel/cgroup/stat.c  Tejun Heo       2017-09-25  314   * The only ordering guarantee is that, for a parent and a child pair
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  315   * covered by a given traversal, the child is before its parent in
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  316   * the list.
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  317   *
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  318   * Note that updated_children is self terminated and points to a list of
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  319   * child cgroups if not empty. Whereas updated_next is like a sibling link
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  320   * within the children list and terminated by the parent cgroup. An exception
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  321   * here is the cgroup root whose updated_next can be self terminated.
041cd640b2f3c5 kernel/cgroup/stat.c  Tejun Heo       2017-09-25  322   */
3d844899ba042a kernel/cgroup/rstat.c JP Kobryn       2025-02-17  323  static struct cgroup_rstat *cgroup_rstat_updated_list(
85c7ff288b9391 kernel/cgroup/rstat.c JP Kobryn       2025-02-17  324  		struct cgroup_rstat *root, int cpu, struct cgroup_rstat_ops *ops,
85c7ff288b9391 kernel/cgroup/rstat.c JP Kobryn       2025-02-17  325  		raw_spinlock_t *cpu_lock)
041cd640b2f3c5 kernel/cgroup/stat.c  Tejun Heo       2017-09-25 @326  {
3d844899ba042a kernel/cgroup/rstat.c JP Kobryn       2025-02-17  327  	struct cgroup_rstat_cpu *rstatc = rstat_cpu(root, cpu);
3d844899ba042a kernel/cgroup/rstat.c JP Kobryn       2025-02-17  328  	struct cgroup_rstat *head = NULL, *parent, *child;
d67ed623c585f2 kernel/cgroup/rstat.c JP Kobryn       2025-02-17  329  	struct cgroup *cgrp;
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  330  	unsigned long flags;
041cd640b2f3c5 kernel/cgroup/stat.c  Tejun Heo       2017-09-25  331  
d67ed623c585f2 kernel/cgroup/rstat.c JP Kobryn       2025-02-17  332  	cgrp = ops->cgroup_fn(root);
85c7ff288b9391 kernel/cgroup/rstat.c JP Kobryn       2025-02-17  333  	flags = _cgroup_rstat_cpu_lock(cpu_lock, cpu, cgrp, false);
041cd640b2f3c5 kernel/cgroup/stat.c  Tejun Heo       2017-09-25  334  
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  335  	/* Return NULL if this subtree is not on-list */
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  336  	if (!rstatc->updated_next)
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  337  		goto unlock_ret;
041cd640b2f3c5 kernel/cgroup/stat.c  Tejun Heo       2017-09-25  338  
041cd640b2f3c5 kernel/cgroup/stat.c  Tejun Heo       2017-09-25  339  	/*
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  340  	 * Unlink @root from its parent. As the updated_children list is
041cd640b2f3c5 kernel/cgroup/stat.c  Tejun Heo       2017-09-25  341  	 * singly linked, we have to walk it to find the removal point.
041cd640b2f3c5 kernel/cgroup/stat.c  Tejun Heo       2017-09-25  342  	 */
d67ed623c585f2 kernel/cgroup/rstat.c JP Kobryn       2025-02-17  343  	parent = ops->parent_fn(root);
dc26532aed0ab2 kernel/cgroup/rstat.c Johannes Weiner 2021-04-29  344  	if (parent) {
dc26532aed0ab2 kernel/cgroup/rstat.c Johannes Weiner 2021-04-29  345  		struct cgroup_rstat_cpu *prstatc;
3d844899ba042a kernel/cgroup/rstat.c JP Kobryn       2025-02-17  346  		struct cgroup_rstat **nextp;
041cd640b2f3c5 kernel/cgroup/stat.c  Tejun Heo       2017-09-25  347  
3d844899ba042a kernel/cgroup/rstat.c JP Kobryn       2025-02-17  348  		prstatc = rstat_cpu(parent, cpu);
c58632b3631cb2 kernel/cgroup/rstat.c Tejun Heo       2018-04-26  349  		nextp = &prstatc->updated_children;
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  350  		while (*nextp != root) {
dc26532aed0ab2 kernel/cgroup/rstat.c Johannes Weiner 2021-04-29  351  			struct cgroup_rstat_cpu *nrstatc;
dc26532aed0ab2 kernel/cgroup/rstat.c Johannes Weiner 2021-04-29  352  
3d844899ba042a kernel/cgroup/rstat.c JP Kobryn       2025-02-17  353  			nrstatc = rstat_cpu(*nextp, cpu);
041cd640b2f3c5 kernel/cgroup/stat.c  Tejun Heo       2017-09-25  354  			WARN_ON_ONCE(*nextp == parent);
c58632b3631cb2 kernel/cgroup/rstat.c Tejun Heo       2018-04-26  355  			nextp = &nrstatc->updated_next;
041cd640b2f3c5 kernel/cgroup/stat.c  Tejun Heo       2017-09-25  356  		}
c58632b3631cb2 kernel/cgroup/rstat.c Tejun Heo       2018-04-26  357  		*nextp = rstatc->updated_next;
dc26532aed0ab2 kernel/cgroup/rstat.c Johannes Weiner 2021-04-29  358  	}
9a9e97b2f1f27e kernel/cgroup/rstat.c Tejun Heo       2018-04-26  359  
dc26532aed0ab2 kernel/cgroup/rstat.c Johannes Weiner 2021-04-29  360  	rstatc->updated_next = NULL;
e76d28bdf9ba53 kernel/cgroup/rstat.c Waiman Long     2023-11-03  361  
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  362  	/* Push @root to the list first before pushing the children */
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  363  	head = root;
3d844899ba042a kernel/cgroup/rstat.c JP Kobryn       2025-02-17  364  	root->rstat_flush_next = NULL;
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  365  	child = rstatc->updated_children;
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  366  	rstatc->updated_children = root;
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  367  	if (child != root)
d67ed623c585f2 kernel/cgroup/rstat.c JP Kobryn       2025-02-17  368  		head = cgroup_rstat_push_children(head, child, cpu, ops);
d499fd418fa159 kernel/cgroup/rstat.c Waiman Long     2023-11-30  369  unlock_ret:
85c7ff288b9391 kernel/cgroup/rstat.c JP Kobryn       2025-02-17  370  	_cgroup_rstat_cpu_unlock(cpu_lock, cpu, cgrp, flags, false);
e76d28bdf9ba53 kernel/cgroup/rstat.c Waiman Long     2023-11-03  371  	return head;
041cd640b2f3c5 kernel/cgroup/stat.c  Tejun Heo       2017-09-25  372  }
041cd640b2f3c5 kernel/cgroup/stat.c  Tejun Heo       2017-09-25  373  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 00/11] cgroup: separate rstat trees
  2025-02-18  3:14 [PATCH 00/11] cgroup: separate rstat trees JP Kobryn
                   ` (10 preceding siblings ...)
  2025-02-18  3:14 ` [PATCH 11/11] cgroup: separate rstat list pointers from base stats JP Kobryn
@ 2025-02-20 15:51 ` Tejun Heo
  2025-02-27 23:44   ` JP Kobryn
  2025-02-20 17:26 ` Yosry Ahmed
  12 siblings, 1 reply; 42+ messages in thread
From: Tejun Heo @ 2025-02-20 15:51 UTC (permalink / raw)
  To: JP Kobryn
  Cc: shakeel.butt, mhocko, hannes, yosryahmed, akpm, linux-mm, cgroups,
	kernel-team

Hello,

On Mon, Feb 17, 2025 at 07:14:37PM -0800, JP Kobryn wrote:
...
> The first experiment consisted of a parent cgroup with memory.swap.max=0
> and memory.max=1G. On a 52-cpu machine, 26 child cgroups were created and
> within each child cgroup a process was spawned to encourage the updating of
> memory cgroup stats by creating and then reading a file of size 1T
> (encouraging reclaim). These 26 tasks were run in parallel.  While this was
> going on, a custom program was used to open cpu.stat file of the parent
> cgroup, read the entire file 1M times, then close it. The perf report for
> the task performing the reading showed that most of the cycles (42%) were
> spent on the function mem_cgroup_css_rstat_flush() of the control side. It
> also showed a smaller but significant number of cycles spent in
> __blkcg_rstat_flush. The perf report for patched kernel differed in that no
> cycles were spent in these functions. Instead most cycles were spent on
> cgroup_base_stat_flush(). Aside from the perf reports, the amount of time
> spent running the program performing the reading of cpu.stats showed a gain
> when comparing the control to the experimental kernel.The time in kernel
> mode was reduced.
> 
> before:
> real    0m18.449s
> user    0m0.209s
> sys     0m18.165s
> 
> after:
> real    0m6.080s
> user    0m0.170s
> sys     0m5.890s
> 
> Another experiment on the same host was setup using a parent cgroup with
> two child cgroups. The same swap and memory max were used as the previous
> experiment. In the two child cgroups, kernel builds were done in parallel,
> each using "-j 20". The program from the previous experiment was used to
> perform 1M reads of the parent cpu.stat file. The perf comparison showed
> similar results as the previous experiment. For the control side, a
> majority of cycles (42%) on mem_cgroup_css_rstat_flush() and significant
> cycles in __blkcg_rstat_flush(). On the experimental side, most cycles were
> spent on cgroup_base_stat_flush() and no cycles were spent flushing memory
> or io. As for the time taken by the program reading cpu.stat, measurements
> are shown below.
> 
> before:
> real    0m17.223s
> user    0m0.259s
> sys     0m16.871s
> 
> after:
> real    0m6.498s
> user    0m0.237s
> sys     0m6.220s
> 
> For the final experiment, perf events were recorded during a kernel build
> with the same host and cgroup setup. The builds took place in the child
> node.  Control and experimental sides both showed similar in cycles spent
> on cgroup_rstat_updated() and appeard insignificant compared among the
> events recorded with the workload.

One of the reasons why the original design used one rstat tree is because
readers, in addition to writers, can often be correlated too - e.g. You'd
often have periodic monitoring tools which poll all the major stat files
periodically. Splitting the trees will likely make those at least a bit
worse. Can you test how much worse that'd be? ie. Repeat the above tests but
read all the major stat files - cgroup.stat, cpu.stat, memory.stat and
io.stat.

Thanks.

-- 
tejun


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 01/11] cgroup: move rstat pointers into struct of their own
  2025-02-18  3:14 ` [PATCH 01/11] cgroup: move rstat pointers into struct of their own JP Kobryn
  2025-02-19  1:05   ` Shakeel Butt
@ 2025-02-20 16:53   ` Yosry Ahmed
  2025-02-24 17:06     ` JP Kobryn
  1 sibling, 1 reply; 42+ messages in thread
From: Yosry Ahmed @ 2025-02-20 16:53 UTC (permalink / raw)
  To: JP Kobryn
  Cc: shakeel.butt, tj, mhocko, hannes, akpm, linux-mm, cgroups,
	kernel-team

On Mon, Feb 17, 2025 at 07:14:38PM -0800, JP Kobryn wrote:
> The rstat infrastructure makes use of pointers for list management.
> These pointers only exist as fields in the cgroup struct, so moving them
> into their own struct will allow them to be used elsewhere. The base
> stat entities are included with them for now.
> 
> Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
> ---
>  include/linux/cgroup-defs.h                   | 90 +-----------------
>  include/linux/cgroup_rstat.h                  | 92 +++++++++++++++++++
>  kernel/cgroup/cgroup.c                        |  3 +-
>  kernel/cgroup/rstat.c                         | 27 +++---
>  .../selftests/bpf/progs/btf_type_tag_percpu.c |  4 +-
>  5 files changed, 112 insertions(+), 104 deletions(-)
>  create mode 100644 include/linux/cgroup_rstat.h
> 
> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
> index 1b20d2d8ef7c..6b6cc027fe70 100644
> --- a/include/linux/cgroup-defs.h
> +++ b/include/linux/cgroup-defs.h
> @@ -17,7 +17,7 @@
>  #include <linux/refcount.h>
>  #include <linux/percpu-refcount.h>
>  #include <linux/percpu-rwsem.h>
> -#include <linux/u64_stats_sync.h>
> +#include <linux/cgroup_rstat.h>
>  #include <linux/workqueue.h>
>  #include <linux/bpf-cgroup-defs.h>
>  #include <linux/psi_types.h>
> @@ -321,78 +321,6 @@ struct css_set {
>  	struct rcu_head rcu_head;
>  };
>  
> -struct cgroup_base_stat {
> -	struct task_cputime cputime;
> -
> -#ifdef CONFIG_SCHED_CORE
> -	u64 forceidle_sum;
> -#endif
> -	u64 ntime;
> -};
> -
> -/*
> - * rstat - cgroup scalable recursive statistics.  Accounting is done
> - * per-cpu in cgroup_rstat_cpu which is then lazily propagated up the
> - * hierarchy on reads.
> - *
> - * When a stat gets updated, the cgroup_rstat_cpu and its ancestors are
> - * linked into the updated tree.  On the following read, propagation only
> - * considers and consumes the updated tree.  This makes reading O(the
> - * number of descendants which have been active since last read) instead of
> - * O(the total number of descendants).
> - *
> - * This is important because there can be a lot of (draining) cgroups which
> - * aren't active and stat may be read frequently.  The combination can
> - * become very expensive.  By propagating selectively, increasing reading
> - * frequency decreases the cost of each read.
> - *
> - * This struct hosts both the fields which implement the above -
> - * updated_children and updated_next - and the fields which track basic
> - * resource statistics on top of it - bsync, bstat and last_bstat.
> - */
> -struct cgroup_rstat_cpu {
> -	/*
> -	 * ->bsync protects ->bstat.  These are the only fields which get
> -	 * updated in the hot path.
> -	 */
> -	struct u64_stats_sync bsync;
> -	struct cgroup_base_stat bstat;
> -
> -	/*
> -	 * Snapshots at the last reading.  These are used to calculate the
> -	 * deltas to propagate to the global counters.
> -	 */
> -	struct cgroup_base_stat last_bstat;
> -
> -	/*
> -	 * This field is used to record the cumulative per-cpu time of
> -	 * the cgroup and its descendants. Currently it can be read via
> -	 * eBPF/drgn etc, and we are still trying to determine how to
> -	 * expose it in the cgroupfs interface.
> -	 */
> -	struct cgroup_base_stat subtree_bstat;
> -
> -	/*
> -	 * Snapshots at the last reading. These are used to calculate the
> -	 * deltas to propagate to the per-cpu subtree_bstat.
> -	 */
> -	struct cgroup_base_stat last_subtree_bstat;
> -
> -	/*
> -	 * Child cgroups with stat updates on this cpu since the last read
> -	 * are linked on the parent's ->updated_children through
> -	 * ->updated_next.
> -	 *
> -	 * In addition to being more compact, singly-linked list pointing
> -	 * to the cgroup makes it unnecessary for each per-cpu struct to
> -	 * point back to the associated cgroup.
> -	 *
> -	 * Protected by per-cpu cgroup_rstat_cpu_lock.
> -	 */
> -	struct cgroup *updated_children;	/* terminated by self cgroup */
> -	struct cgroup *updated_next;		/* NULL iff not on the list */
> -};
> -
>  struct cgroup_freezer_state {
>  	/* Should the cgroup and its descendants be frozen. */
>  	bool freeze;
> @@ -517,23 +445,9 @@ struct cgroup {
>  	struct cgroup *old_dom_cgrp;		/* used while enabling threaded */
>  
>  	/* per-cpu recursive resource statistics */
> -	struct cgroup_rstat_cpu __percpu *rstat_cpu;
> +	struct cgroup_rstat rstat;
>  	struct list_head rstat_css_list;
>  
> -	/*
> -	 * Add padding to separate the read mostly rstat_cpu and
> -	 * rstat_css_list into a different cacheline from the following
> -	 * rstat_flush_next and *bstat fields which can have frequent updates.
> -	 */
> -	CACHELINE_PADDING(_pad_);
> -
> -	/*
> -	 * A singly-linked list of cgroup structures to be rstat flushed.
> -	 * This is a scratch field to be used exclusively by
> -	 * cgroup_rstat_flush_locked() and protected by cgroup_rstat_lock.
> -	 */
> -	struct cgroup	*rstat_flush_next;
> -
>  	/* cgroup basic resource statistics */
>  	struct cgroup_base_stat last_bstat;
>  	struct cgroup_base_stat bstat;
> diff --git a/include/linux/cgroup_rstat.h b/include/linux/cgroup_rstat.h
> new file mode 100644
> index 000000000000..f95474d6f8ab
> --- /dev/null
> +++ b/include/linux/cgroup_rstat.h
> @@ -0,0 +1,92 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_RSTAT_H
> +#define _LINUX_RSTAT_H
> +
> +#include <linux/u64_stats_sync.h>
> +
> +struct cgroup_rstat_cpu;

Why do we need the forward declaration instead of just defining struct
cgroup_rstat_cpu first? Also, why do we need a new header for these
definitions rather than just adding struct cgroup_rstat to
cgroup-defs.h?

> +
> +/*
> + * rstat - cgroup scalable recursive statistics.  Accounting is done
> + * per-cpu in cgroup_rstat_cpu which is then lazily propagated up the
> + * hierarchy on reads.
> + *
> + * When a stat gets updated, the cgroup_rstat_cpu and its ancestors are
> + * linked into the updated tree.  On the following read, propagation only
> + * considers and consumes the updated tree.  This makes reading O(the
> + * number of descendants which have been active since last read) instead of
> + * O(the total number of descendants).
> + *
> + * This is important because there can be a lot of (draining) cgroups which
> + * aren't active and stat may be read frequently.  The combination can
> + * become very expensive.  By propagating selectively, increasing reading
> + * frequency decreases the cost of each read.
> + *
> + * This struct hosts both the fields which implement the above -
> + * updated_children and updated_next - and the fields which track basic
> + * resource statistics on top of it - bsync, bstat and last_bstat.
> + */
> +struct cgroup_rstat {
> +	struct cgroup_rstat_cpu __percpu *rstat_cpu;
> +
> +	/*
> +	 * Add padding to separate the read mostly rstat_cpu and
> +	 * rstat_css_list into a different cacheline from the following
> +	 * rstat_flush_next and containing struct fields which can have
> +	 * frequent updates.
> +	 */
> +	CACHELINE_PADDING(_pad_);
> +	struct cgroup *rstat_flush_next;
> +};
> +
> +struct cgroup_base_stat {
> +	struct task_cputime cputime;
> +
> +#ifdef CONFIG_SCHED_CORE
> +	u64 forceidle_sum;
> +#endif
> +	u64 ntime;
> +};
> +
> +struct cgroup_rstat_cpu {
> +	/*
> +	 * Child cgroups with stat updates on this cpu since the last read
> +	 * are linked on the parent's ->updated_children through
> +	 * ->updated_next.
> +	 *
> +	 * In addition to being more compact, singly-linked list pointing
> +	 * to the cgroup makes it unnecessary for each per-cpu struct to
> +	 * point back to the associated cgroup.
> +	 */
> +	struct cgroup *updated_children;	/* terminated by self */
> +	struct cgroup *updated_next;		/* NULL if not on the list */
> +
> +	/*
> +	 * ->bsync protects ->bstat.  These are the only fields which get
> +	 * updated in the hot path.
> +	 */
> +	struct u64_stats_sync bsync;
> +	struct cgroup_base_stat bstat;
> +
> +	/*
> +	 * Snapshots at the last reading.  These are used to calculate the
> +	 * deltas to propagate to the global counters.
> +	 */
> +	struct cgroup_base_stat last_bstat;
> +
> +	/*
> +	 * This field is used to record the cumulative per-cpu time of
> +	 * the cgroup and its descendants. Currently it can be read via
> +	 * eBPF/drgn etc, and we are still trying to determine how to
> +	 * expose it in the cgroupfs interface.
> +	 */
> +	struct cgroup_base_stat subtree_bstat;
> +
> +	/*
> +	 * Snapshots at the last reading. These are used to calculate the
> +	 * deltas to propagate to the per-cpu subtree_bstat.
> +	 */
> +	struct cgroup_base_stat last_subtree_bstat;
> +};
> +
> +#endif	/* _LINUX_RSTAT_H */


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 03/11] cgroup: move cgroup_rstat from cgroup to cgroup_subsys_state
  2025-02-18  3:14 ` [PATCH 03/11] cgroup: move cgroup_rstat from cgroup to cgroup_subsys_state JP Kobryn
@ 2025-02-20 17:06   ` Shakeel Butt
  2025-02-20 17:22     ` Yosry Ahmed
  0 siblings, 1 reply; 42+ messages in thread
From: Shakeel Butt @ 2025-02-20 17:06 UTC (permalink / raw)
  To: JP Kobryn
  Cc: tj, mhocko, hannes, yosryahmed, akpm, linux-mm, cgroups,
	kernel-team

On Mon, Feb 17, 2025 at 07:14:40PM -0800, JP Kobryn wrote:
[...]
> @@ -3240,6 +3234,12 @@ static int cgroup_apply_control_enable(struct cgroup *cgrp)
>  				css = css_create(dsct, ss);
>  				if (IS_ERR(css))
>  					return PTR_ERR(css);

Since rstat is part of css, why not cgroup_rstat_init() inside
css_create()?

> +
> +				if (css->ss && css->ss->css_rstat_flush) {
> +					ret = cgroup_rstat_init(css);
> +					if (ret)
> +						goto err_out;
> +				}
>  			}
>  
>  			WARN_ON_ONCE(percpu_ref_is_dying(&css->refcnt));
> @@ -3253,6 +3253,21 @@ static int cgroup_apply_control_enable(struct cgroup *cgrp)
>  	}
>  
>  	return 0;

Why not the following cleanup in css_kill()? If you handle it in
css_kill(), you don't need this special handling.

> +
> +err_out:
> +	cgroup_for_each_live_descendant_pre(dsct, d_css, cgrp) {
> +		for_each_subsys(ss, ssid) {
> +			struct cgroup_subsys_state *css = cgroup_css(dsct, ss);
> +
> +			if (!(cgroup_ss_mask(dsct) & (1 << ss->id)))
> +				continue;
> +
> +			if (css && css->ss && css->ss->css_rstat_flush)
> +				cgroup_rstat_exit(css);
> +		}
> +	}
> +
> +	return ret;
>  }


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 02/11] cgroup: add level of indirection for cgroup_rstat struct
  2025-02-19  2:26   ` Shakeel Butt
@ 2025-02-20 17:08     ` Yosry Ahmed
  0 siblings, 0 replies; 42+ messages in thread
From: Yosry Ahmed @ 2025-02-20 17:08 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: JP Kobryn, tj, mhocko, hannes, akpm, linux-mm, cgroups,
	kernel-team

On Tue, Feb 18, 2025 at 06:26:07PM -0800, Shakeel Butt wrote:
> On Mon, Feb 17, 2025 at 07:14:39PM -0800, JP Kobryn wrote:
> > Change the type of rstat node from cgroup to the new cgroup_rstat
> > struct. Then for the rstat updated/flush api calls, add double under
> > versions that accept references to the cgroup_rstat struct. This new
> > level of indirection will allow for extending the public api further.
> > i.e. the cgroup_rstat struct can be embedded in a new type of object and
> > a public api can be added for that new type.
> > 
> > Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
> 
> I think the code looks good here but the commit message needs some
> massaging. From what I understand, you are trying to decouple struct
> cgroup_rstat from struct cgroup, so later you can move struct
> cgroup_rstat in different structure (and maybe later some new structure
> can all include cgroup_rstat to take advantage of rstat infra). I am not
> sure I would call this "add level of indirection". 

+1 here and probably in other patches too. "Add level of indirection"
sounds like we are dereferencing one more pointer, which is not the case
here.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 03/11] cgroup: move cgroup_rstat from cgroup to cgroup_subsys_state
  2025-02-20 17:06   ` Shakeel Butt
@ 2025-02-20 17:22     ` Yosry Ahmed
  2025-02-25 19:20       ` JP Kobryn
  0 siblings, 1 reply; 42+ messages in thread
From: Yosry Ahmed @ 2025-02-20 17:22 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: JP Kobryn, tj, mhocko, hannes, akpm, linux-mm, cgroups,
	kernel-team

On Thu, Feb 20, 2025 at 09:06:44AM -0800, Shakeel Butt wrote:
> On Mon, Feb 17, 2025 at 07:14:40PM -0800, JP Kobryn wrote:
> [...]
> > @@ -3240,6 +3234,12 @@ static int cgroup_apply_control_enable(struct cgroup *cgrp)
> >  				css = css_create(dsct, ss);
> >  				if (IS_ERR(css))
> >  					return PTR_ERR(css);
> 
> Since rstat is part of css, why not cgroup_rstat_init() inside
> css_create()?
> 
> > +
> > +				if (css->ss && css->ss->css_rstat_flush) {
> > +					ret = cgroup_rstat_init(css);
> > +					if (ret)
> > +						goto err_out;
> > +				}
> >  			}
> >  
> >  			WARN_ON_ONCE(percpu_ref_is_dying(&css->refcnt));
> > @@ -3253,6 +3253,21 @@ static int cgroup_apply_control_enable(struct cgroup *cgrp)
> >  	}
> >  
> >  	return 0;
> 
> Why not the following cleanup in css_kill()? If you handle it in
> css_kill(), you don't need this special handling.

Also, I don't think we are currently calling cgroup_rstat_exit() for
every css when it is destroyed, so moving the cleanup to css_kill()
should handle this as well.

> 
> > +
> > +err_out:
> > +	cgroup_for_each_live_descendant_pre(dsct, d_css, cgrp) {
> > +		for_each_subsys(ss, ssid) {
> > +			struct cgroup_subsys_state *css = cgroup_css(dsct, ss);
> > +
> > +			if (!(cgroup_ss_mask(dsct) & (1 << ss->id)))
> > +				continue;
> > +
> > +			if (css && css->ss && css->ss->css_rstat_flush)
> > +				cgroup_rstat_exit(css);
> > +		}
> > +	}
> > +
> > +	return ret;
> >  }
> 


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 00/11] cgroup: separate rstat trees
  2025-02-18  3:14 [PATCH 00/11] cgroup: separate rstat trees JP Kobryn
                   ` (11 preceding siblings ...)
  2025-02-20 15:51 ` [PATCH 00/11] cgroup: separate rstat trees Tejun Heo
@ 2025-02-20 17:26 ` Yosry Ahmed
  2025-02-20 17:53   ` Shakeel Butt
  12 siblings, 1 reply; 42+ messages in thread
From: Yosry Ahmed @ 2025-02-20 17:26 UTC (permalink / raw)
  To: JP Kobryn
  Cc: shakeel.butt, tj, mhocko, hannes, akpm, linux-mm, cgroups,
	kernel-team

On Mon, Feb 17, 2025 at 07:14:37PM -0800, JP Kobryn wrote:
> The current design of rstat takes the approach that if one subsystem is to
> be flushed, all other subsystem controllers with pending updates should
> also be flushed. It seems that over time, the stat-keeping of some
> subsystems has grown in size to the extent that they are noticeably slowing
> down others. This has been most observable in situations where the memory
> controller is enabled. One big area where the issue comes up is system
> telemetry, where programs periodically sample cpu stats. It would be a
> benefit for programs like this if the overhead of flushing memory stats
> (and others) could be eliminated. It would save cpu cycles for existing
> cpu-based telemetry programs and improve scalability in terms of increasing
> sampling frequency and volume of hosts the program is running on - the cpu
> cycles saved helps free up the budget for cycles to be spent on the desired
> stats.
> 
> This series changes the approach of "flush all subsystems" to "flush only the
> requested subsystem". The core design change is moving from a single unified
> rstat tree to having separate trees: one for each enabled subsystem controller
> (if it implements css_rstat_flush()), one for the base stats (cgroup::self)
> subsystem state, and one dedicated to bpf-based cgroups (if enabled). A layer
> of indirection was introduced to rstat. Where a cgroup reference was previously
> used as a parameter, the updated/flush families of functions were changed to
> instead accept a references to a new cgroup_rstat struct along with a new
> interface containing ops that perform type-specific pointer offsets for
> accessing common types. The ops allow rstat routines to work only with common
> types while hiding away any unique types. Together, the new struct and
> interface allows for extending the type of entity that can participate in
> rstat. In this series, the cgroup_subsys_state and the cgroup_bpf are now
> participants. For both of these structs, the cgroup_rstat struct was added as a
> new field and ops were statically defined for both types in order to provide
> access to related objects. To illustrate, the cgroup_subsys_state was given an
> rstat_struct field and one of its ops was defined to get a pointer to the
> rstat_struct of its parent.  Public api's were changed. In order for clients to
> be specific about which stats are being updated or flushed, a reference to the
> given cgroup_subsys_state is passed instead of the cgroup. For the bpf api's, a
> cgroup is still passed as an argument since there is no subsystem state
> associated with these custom cgroups. However, the names of the api calls were
> changed and now have a "bpf_" prefix. Since separate trees are in use, the
> locking scheme was adjusted to prevent any contention. Separate locks exist for
> the three categories: base stats (cgroup::self), formal subsystem controllers
> (memory, io, etc), and bpf-based cgroups. Where applicable, the functions for
> lock management were adjusted to accept parameters instead of globals.
> 
> Breaking up the unified tree into separate trees eliminates the overhead
> and scalability issue explained in the first section, but comes at the
> expense of using additional memory. In an effort to minimize the additional
> memory overhead, a conditional allocation is performed. The
> cgroup_rstat_cpu originally contained the rstat list pointers and the base
> stat entities. The struct was reduced to only contain the list pointers.
> For the single case of where base stats are participating in rstat, a new
> struct cgroup_rstat_base_cpu was created that contains the list pointers
> and the base stat entities. The conditional allocation is done only when
> the cgroup::self subsys_state is initialized. Since the list pointers exist
> at the beginning of the cgroup_rstat_cpu and cgroup_rstat_base_cpu struct,
> a union is used to access one type of pointer or the other depending on if
> cgroup::self is detected; it is the one subsystem state where the subsystem
> pointer is NULL. With this change, the total memory overhead on a per-cpu
> basis is:
> 
> nr_cgroups * (
> 	sizeof(struct cgroup_rstat_base_cpu) +
> 	sizeof(struct cgroup_rstat_cpu) * nr_controllers
> )
> 
> if bpf-based cgroups are enabled:
> 
> nr_cgroups * (
> 	sizeof(struct cgroup_rstat_base_cpu) +
> 	sizeof(struct cgroup_rstat_cpu) * (nr_controllers + 1)
> )

Adding actual numbers here as examples would help, instead of people
having to look at the size of these structs at the end of the series.
You can calculate the per CPU per cgroup size on a popular arch (e.g.
x86_64) and that here.

Also, it would be useful to specify the change in memory overhead. IIUC,
sizeof(struct cgroup_rstat_base_cpu) is irrelevant because we are
already using that today anyway, so the actual increase is sizeof(struct
cgroup_rstat_cpu) * (nr_controllers - 1).

Another question is, does it make sense to keep BPF flushing in the
"self" css with base stats flushing for now? IIUC BPF flushing is not
very popular now anyway, and doing so will remove the need to support
flushing and updating things that are not css's. Just food for thought.

> 
> ... where nr_controllers is the number of enabled cgroup controllers
> that implement css_rstat_flush().
> 
> With regard to validation, there is a measurable benefit when reading a
> specific set of stats. Using the cpu stats as a basis for flushing, some
> experiments were set up to measure the perf and time differences.
> 
> The first experiment consisted of a parent cgroup with memory.swap.max=0
> and memory.max=1G. On a 52-cpu machine, 26 child cgroups were created and
> within each child cgroup a process was spawned to encourage the updating of
> memory cgroup stats by creating and then reading a file of size 1T
> (encouraging reclaim). These 26 tasks were run in parallel.  While this was
> going on, a custom program was used to open cpu.stat file of the parent
> cgroup, read the entire file 1M times, then close it. The perf report for
> the task performing the reading showed that most of the cycles (42%) were
> spent on the function mem_cgroup_css_rstat_flush() of the control side. It
> also showed a smaller but significant number of cycles spent in
> __blkcg_rstat_flush. The perf report for patched kernel differed in that no
> cycles were spent in these functions. Instead most cycles were spent on
> cgroup_base_stat_flush(). Aside from the perf reports, the amount of time
> spent running the program performing the reading of cpu.stats showed a gain
> when comparing the control to the experimental kernel.The time in kernel
> mode was reduced.
> 
> before:
> real    0m18.449s
> user    0m0.209s
> sys     0m18.165s
> 
> after:
> real    0m6.080s
> user    0m0.170s
> sys     0m5.890s
> 
> Another experiment on the same host was setup using a parent cgroup with
> two child cgroups. The same swap and memory max were used as the previous
> experiment. In the two child cgroups, kernel builds were done in parallel,
> each using "-j 20". The program from the previous experiment was used to
> perform 1M reads of the parent cpu.stat file. The perf comparison showed
> similar results as the previous experiment. For the control side, a
> majority of cycles (42%) on mem_cgroup_css_rstat_flush() and significant
> cycles in __blkcg_rstat_flush(). On the experimental side, most cycles were
> spent on cgroup_base_stat_flush() and no cycles were spent flushing memory
> or io. As for the time taken by the program reading cpu.stat, measurements
> are shown below.
> 
> before:
> real    0m17.223s
> user    0m0.259s
> sys     0m16.871s
> 
> after:
> real    0m6.498s
> user    0m0.237s
> sys     0m6.220s
> 
> For the final experiment, perf events were recorded during a kernel build
> with the same host and cgroup setup. The builds took place in the child
> node.  Control and experimental sides both showed similar in cycles spent
> on cgroup_rstat_updated() and appeard insignificant compared among the
> events recorded with the workload.
> 
> JP Kobryn (11):
>   cgroup: move rstat pointers into struct of their own
>   cgroup: add level of indirection for cgroup_rstat struct
>   cgroup: move cgroup_rstat from cgroup to cgroup_subsys_state
>   cgroup: introduce cgroup_rstat_ops
>   cgroup: separate rstat for bpf cgroups
>   cgroup: rstat lock indirection
>   cgroup: fetch cpu-specific lock in rstat cpu lock helpers
>   cgroup: rstat cpu lock indirection
>   cgroup: separate rstat locks for bpf cgroups
>   cgroup: separate rstat locks for subsystems
>   cgroup: separate rstat list pointers from base stats
> 
>  block/blk-cgroup.c                            |   4 +-
>  include/linux/bpf-cgroup-defs.h               |   3 +
>  include/linux/cgroup-defs.h                   |  98 +--
>  include/linux/cgroup.h                        |  11 +-
>  include/linux/cgroup_rstat.h                  |  97 +++
>  kernel/bpf/cgroup.c                           |   6 +
>  kernel/cgroup/cgroup-internal.h               |   9 +-
>  kernel/cgroup/cgroup.c                        |  65 +-
>  kernel/cgroup/rstat.c                         | 556 +++++++++++++-----
>  mm/memcontrol.c                               |   4 +-
>  .../selftests/bpf/progs/btf_type_tag_percpu.c |   5 +-
>  .../bpf/progs/cgroup_hierarchical_stats.c     |   6 +-
>  12 files changed, 594 insertions(+), 270 deletions(-)
>  create mode 100644 include/linux/cgroup_rstat.h
> 
> -- 
> 2.43.5
> 
> 


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 04/11] cgroup: introduce cgroup_rstat_ops
  2025-02-18  3:14 ` [PATCH 04/11] cgroup: introduce cgroup_rstat_ops JP Kobryn
  2025-02-19  7:21   ` kernel test robot
@ 2025-02-20 17:50   ` Shakeel Butt
  1 sibling, 0 replies; 42+ messages in thread
From: Shakeel Butt @ 2025-02-20 17:50 UTC (permalink / raw)
  To: JP Kobryn
  Cc: tj, mhocko, hannes, yosryahmed, akpm, linux-mm, cgroups,
	kernel-team

On Mon, Feb 17, 2025 at 07:14:41PM -0800, JP Kobryn wrote:
> The cgroup_rstat_ops interface provides a way for type-specific
> operations to be hidden from the common rstat operations. Use it to
> decouple the cgroup_subsys_type from within the internal rstat
> updated/flush routines. The new ops interface allows for greater
> extensibility in terms of future changes. i.e. public updated/flush

Here you might need to be explicit what future changes. Will all
controllers using rstat require this ops interface or some of them?


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 00/11] cgroup: separate rstat trees
  2025-02-20 17:26 ` Yosry Ahmed
@ 2025-02-20 17:53   ` Shakeel Butt
  2025-02-20 17:59     ` Yosry Ahmed
  0 siblings, 1 reply; 42+ messages in thread
From: Shakeel Butt @ 2025-02-20 17:53 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: JP Kobryn, tj, mhocko, hannes, akpm, linux-mm, cgroups,
	kernel-team

On Thu, Feb 20, 2025 at 05:26:04PM +0000, Yosry Ahmed wrote:
> 
> Another question is, does it make sense to keep BPF flushing in the
> "self" css with base stats flushing for now? IIUC BPF flushing is not
> very popular now anyway, and doing so will remove the need to support
> flushing and updating things that are not css's. Just food for thought.
> 

Oh if this simplifies the code, I would say go for it.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 00/11] cgroup: separate rstat trees
  2025-02-20 17:53   ` Shakeel Butt
@ 2025-02-20 17:59     ` Yosry Ahmed
  2025-02-20 18:14       ` JP Kobryn
  0 siblings, 1 reply; 42+ messages in thread
From: Yosry Ahmed @ 2025-02-20 17:59 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: JP Kobryn, tj, mhocko, hannes, akpm, linux-mm, cgroups,
	kernel-team

On Thu, Feb 20, 2025 at 09:53:33AM -0800, Shakeel Butt wrote:
> On Thu, Feb 20, 2025 at 05:26:04PM +0000, Yosry Ahmed wrote:
> > 
> > Another question is, does it make sense to keep BPF flushing in the
> > "self" css with base stats flushing for now? IIUC BPF flushing is not
> > very popular now anyway, and doing so will remove the need to support
> > flushing and updating things that are not css's. Just food for thought.
> > 
> 
> Oh if this simplifies the code, I would say go for it.

I think we wouldn't need cgroup_rstat_ops and some of the refactoring
may not be needed. It will also reduce the memory overhead, and keep it
constant regardless of using BPF which is nice.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 00/11] cgroup: separate rstat trees
  2025-02-20 17:59     ` Yosry Ahmed
@ 2025-02-20 18:14       ` JP Kobryn
  2025-02-20 20:04         ` Yosry Ahmed
  0 siblings, 1 reply; 42+ messages in thread
From: JP Kobryn @ 2025-02-20 18:14 UTC (permalink / raw)
  To: Yosry Ahmed, Shakeel Butt
  Cc: tj, mhocko, hannes, akpm, linux-mm, cgroups, kernel-team

On 2/20/25 9:59 AM, Yosry Ahmed wrote:
> On Thu, Feb 20, 2025 at 09:53:33AM -0800, Shakeel Butt wrote:
>> On Thu, Feb 20, 2025 at 05:26:04PM +0000, Yosry Ahmed wrote:
>>>
>>> Another question is, does it make sense to keep BPF flushing in the
>>> "self" css with base stats flushing for now? IIUC BPF flushing is not
>>> very popular now anyway, and doing so will remove the need to support
>>> flushing and updating things that are not css's. Just food for thought.
>>>
>>
>> Oh if this simplifies the code, I would say go for it.
> 
> I think we wouldn't need cgroup_rstat_ops and some of the refactoring
> may not be needed. It will also reduce the memory overhead, and keep it
> constant regardless of using BPF which is nice.

Yes, this is true. cgroup_rstat_ops was only added to allow cgroup_bpf
to make use of rstat. If the bpf flushing remains tied to
cgroup_subsys_state::self, then the ops interface and supporting code
can be removed. Probably stating the obvious but the trade-off would be
that if bpf cgroups are in use, they would account for some extra
overhead while flushing the base stats. Is Google making use of bpf-
based cgroups?


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 00/11] cgroup: separate rstat trees
  2025-02-20 18:14       ` JP Kobryn
@ 2025-02-20 20:04         ` Yosry Ahmed
  2025-02-20 20:22           ` Yosry Ahmed
  2025-02-24 21:13           ` Shakeel Butt
  0 siblings, 2 replies; 42+ messages in thread
From: Yosry Ahmed @ 2025-02-20 20:04 UTC (permalink / raw)
  To: JP Kobryn
  Cc: Shakeel Butt, tj, mhocko, hannes, akpm, linux-mm, cgroups,
	kernel-team

On Thu, Feb 20, 2025 at 10:14:45AM -0800, JP Kobryn wrote:
> On 2/20/25 9:59 AM, Yosry Ahmed wrote:
> > On Thu, Feb 20, 2025 at 09:53:33AM -0800, Shakeel Butt wrote:
> > > On Thu, Feb 20, 2025 at 05:26:04PM +0000, Yosry Ahmed wrote:
> > > > 
> > > > Another question is, does it make sense to keep BPF flushing in the
> > > > "self" css with base stats flushing for now? IIUC BPF flushing is not
> > > > very popular now anyway, and doing so will remove the need to support
> > > > flushing and updating things that are not css's. Just food for thought.
> > > > 
> > > 
> > > Oh if this simplifies the code, I would say go for it.
> > 
> > I think we wouldn't need cgroup_rstat_ops and some of the refactoring
> > may not be needed. It will also reduce the memory overhead, and keep it
> > constant regardless of using BPF which is nice.
> 
> Yes, this is true. cgroup_rstat_ops was only added to allow cgroup_bpf
> to make use of rstat. If the bpf flushing remains tied to
> cgroup_subsys_state::self, then the ops interface and supporting code
> can be removed. Probably stating the obvious but the trade-off would be
> that if bpf cgroups are in use, they would account for some extra
> overhead while flushing the base stats. Is Google making use of bpf-
> based cgroups?

Ironically I don't know, but I don't expect the BPF flushing to be
expensive enough to affect this. If someone has the use case that loads
enough BPF programs to cause a noticeable impact, we can address it
then.

This series will still be an improvement anyway.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 00/11] cgroup: separate rstat trees
  2025-02-20 20:04         ` Yosry Ahmed
@ 2025-02-20 20:22           ` Yosry Ahmed
  2025-02-24 21:13           ` Shakeel Butt
  1 sibling, 0 replies; 42+ messages in thread
From: Yosry Ahmed @ 2025-02-20 20:22 UTC (permalink / raw)
  To: JP Kobryn
  Cc: Shakeel Butt, tj, mhocko, hannes, akpm, linux-mm, cgroups,
	kernel-team

> >  Yes, this is true. cgroup_rstat_ops was only added to allow cgroup_bpf
> >  to make use of rstat. If the bpf flushing remains tied to
> >  cgroup_subsys_state::self, then the ops interface and supporting code
> >  can be removed. Probably stating the obvious but the trade-off would be
> >  that if bpf cgroups are in use, they would account for some extra
> >  overhead while flushing the base stats. Is Google making use of bpf-
> >  based cgroups?
> > 
> 
> Ironically I don't know, but I don't expect the BPF flushing to be
> expensive enough to affect this. If someone has the use case that loads
> enough BPF programs to cause a noticeable impact, we can address it
> then.
> 
> This series will still be an improvement anyway.

I actually just remembered. Using BPF programs to collect stats using rstat is currently independent of the CGROUP_BPF stuff. So I think this approach breaks that anyway.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 05/11] cgroup: separate rstat for bpf cgroups
  2025-02-18  3:14 ` [PATCH 05/11] cgroup: separate rstat for bpf cgroups JP Kobryn
@ 2025-02-21 18:14   ` Shakeel Butt
  0 siblings, 0 replies; 42+ messages in thread
From: Shakeel Butt @ 2025-02-21 18:14 UTC (permalink / raw)
  To: JP Kobryn
  Cc: tj, mhocko, hannes, yosryahmed, akpm, linux-mm, cgroups,
	kernel-team

On Mon, Feb 17, 2025 at 07:14:42PM -0800, JP Kobryn wrote:
> The processing of bpf cgroup stats is tied to the rstat actions of other
> subsystems. Make changes to have them updated/flushed independently.
> Give the cgroup_bpf struct its own cgroup_rstat instance and define a
> new cgroup_rstat_ops instance specifically for the cgroup_bpf. Then
> replace the kfunc status of the existing updated/flush api calls with
> non-kfunc status. As an alternative, create new updated/flush kfuncs
> specifically for bpf cgroups. In these new kfuncs, make use of the
> bpf-specific rstat ops to plumb back in to the existing rstat routines.
> Where applicable, use pre-processor conditionals to define bpf rstat
> related stuff.
> 
> Signed-off-by: JP Kobryn <inwardvessel@gmail.com>

Since we deicided to keep bpf with the base stats, this patch will go
away.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 06/11] cgroup: rstat lock indirection
  2025-02-18  3:14 ` [PATCH 06/11] cgroup: rstat lock indirection JP Kobryn
@ 2025-02-21 22:09   ` Shakeel Butt
  0 siblings, 0 replies; 42+ messages in thread
From: Shakeel Butt @ 2025-02-21 22:09 UTC (permalink / raw)
  To: JP Kobryn
  Cc: tj, mhocko, hannes, yosryahmed, akpm, linux-mm, cgroups,
	kernel-team

On Mon, Feb 17, 2025 at 07:14:43PM -0800, JP Kobryn wrote:
> Instead of accessing the target lock directly via global var, access it
> indirectly in the form of a new parameter. Also change the ordering of
> the parameters to be consistent with the related per-cpu locking
> function _cgroup_rstat_cpu_lock().
> 
> Signed-off-by: JP Kobryn <inwardvessel@gmail.com>

Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 07/11] cgroup: fetch cpu-specific lock in rstat cpu lock helpers
  2025-02-18  3:14 ` [PATCH 07/11] cgroup: fetch cpu-specific lock in rstat cpu lock helpers JP Kobryn
@ 2025-02-21 22:35   ` Shakeel Butt
  0 siblings, 0 replies; 42+ messages in thread
From: Shakeel Butt @ 2025-02-21 22:35 UTC (permalink / raw)
  To: JP Kobryn
  Cc: tj, mhocko, hannes, yosryahmed, akpm, linux-mm, cgroups,
	kernel-team

On Mon, Feb 17, 2025 at 07:14:44PM -0800, JP Kobryn wrote:
> The lock/unlock helper functions for per-cpu locks accept a cpu
> argument. This makes them appear as if the cpu will be used as the
> offset off of the base per-cpu pointer. But in fact, the cpu is only
> used as a tracepoint argument. Change the functions so that the cpu is
> also used primarily for looking up the lock specific to this cpu. This
> means the call sites can be adjusted to not have to perform the offset
> prior to calling this function. Note that this follows suit with other
> functions in the rstat source - functions that accept a cpu argument
> perform the per-cpu pointer lookup within as opposed to having clients
> lookup in advance.
> 
> Signed-off-by: JP Kobryn <inwardvessel@gmail.com>

Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 08/11] cgroup: rstat cpu lock indirection
  2025-02-18  3:14 ` [PATCH 08/11] cgroup: rstat cpu lock indirection JP Kobryn
  2025-02-19  8:48   ` kernel test robot
@ 2025-02-22  0:18   ` Shakeel Butt
  1 sibling, 0 replies; 42+ messages in thread
From: Shakeel Butt @ 2025-02-22  0:18 UTC (permalink / raw)
  To: JP Kobryn
  Cc: tj, mhocko, hannes, yosryahmed, akpm, linux-mm, cgroups,
	kernel-team

On Mon, Feb 17, 2025 at 07:14:45PM -0800, JP Kobryn wrote:
> Where functions access the global per-cpu lock, change their signature
> to accept the lock instead as a paremeter. Change the code within these
> functions to only access the parameter. This indirection allows for
> future code to accept different locks, increasing extensibity. For
> example, a new lock could be added specifically for the bpf cgroups and
> it would not contend with the existing lock.
> 
> Signed-off-by: JP Kobryn <inwardvessel@gmail.com>

Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 10/11] cgroup: separate rstat locks for subsystems
  2025-02-18  3:14 ` [PATCH 10/11] cgroup: separate rstat locks for subsystems JP Kobryn
@ 2025-02-22  0:23   ` Shakeel Butt
  0 siblings, 0 replies; 42+ messages in thread
From: Shakeel Butt @ 2025-02-22  0:23 UTC (permalink / raw)
  To: JP Kobryn
  Cc: tj, mhocko, hannes, yosryahmed, akpm, linux-mm, cgroups,
	kernel-team

On Mon, Feb 17, 2025 at 07:14:47PM -0800, JP Kobryn wrote:
> Add new rstat locks for each subsystem. When handling cgroup subsystem
> states, distinguish between states associated with formal subsystems
> (memory, io, etc) and the base stats subsystem state (represented by
> cgroup::self). This change is made to prevent contention when
> updating/flushing stats.
> 
> Signed-off-by: JP Kobryn <inwardvessel@gmail.com>

This looks good as well.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 11/11] cgroup: separate rstat list pointers from base stats
  2025-02-18  3:14 ` [PATCH 11/11] cgroup: separate rstat list pointers from base stats JP Kobryn
@ 2025-02-22  0:28   ` Shakeel Butt
  0 siblings, 0 replies; 42+ messages in thread
From: Shakeel Butt @ 2025-02-22  0:28 UTC (permalink / raw)
  To: JP Kobryn
  Cc: tj, mhocko, hannes, yosryahmed, akpm, linux-mm, cgroups,
	kernel-team

On Mon, Feb 17, 2025 at 07:14:48PM -0800, JP Kobryn wrote:
> A majority of the cgroup_rstat_cpu struct size is made up of the base stat
> entities. Since only the "self" subsystem state makes use of these, move
> them into a struct of their own. This allows for a new compact
> cgroup_rstat_cpu struct that the formal subsystems can make use of.
> Where applicable, decide on whether to allocate the compact or the full
> struct including the base stats.
> 
> Signed-off-by: JP Kobryn <inwardvessel@gmail.com>

This looks good. Now there are two major asks and then minor comments on
some individual patches.

Two main asks are:
1. Experiment requested by Tejun.
2. BPF rstat using the base subsystem as requested by Yosry.

I would say send the v2 addressing these requests.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 01/11] cgroup: move rstat pointers into struct of their own
  2025-02-20 16:53   ` Yosry Ahmed
@ 2025-02-24 17:06     ` JP Kobryn
  2025-02-24 18:36       ` Yosry Ahmed
  0 siblings, 1 reply; 42+ messages in thread
From: JP Kobryn @ 2025-02-24 17:06 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: shakeel.butt, tj, mhocko, hannes, akpm, linux-mm, cgroups,
	kernel-team

On 2/20/25 8:53 AM, Yosry Ahmed wrote:
> On Mon, Feb 17, 2025 at 07:14:38PM -0800, JP Kobryn wrote:
>> The rstat infrastructure makes use of pointers for list management.
>> These pointers only exist as fields in the cgroup struct, so moving them
>> into their own struct will allow them to be used elsewhere. The base
>> stat entities are included with them for now.
>>
>> Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
>> ---
>>   include/linux/cgroup-defs.h                   | 90 +-----------------
>>   include/linux/cgroup_rstat.h                  | 92 +++++++++++++++++++
>>   kernel/cgroup/cgroup.c                        |  3 +-
>>   kernel/cgroup/rstat.c                         | 27 +++---
>>   .../selftests/bpf/progs/btf_type_tag_percpu.c |  4 +-
>>   5 files changed, 112 insertions(+), 104 deletions(-)
>>   create mode 100644 include/linux/cgroup_rstat.h
>>
>> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
>> index 1b20d2d8ef7c..6b6cc027fe70 100644
>> --- a/include/linux/cgroup-defs.h
>> +++ b/include/linux/cgroup-defs.h
>> @@ -17,7 +17,7 @@
>>   #include <linux/refcount.h>
>>   #include <linux/percpu-refcount.h>
>>   #include <linux/percpu-rwsem.h>
>> -#include <linux/u64_stats_sync.h>
>> +#include <linux/cgroup_rstat.h>
>>   #include <linux/workqueue.h>
>>   #include <linux/bpf-cgroup-defs.h>
>>   #include <linux/psi_types.h>
>> @@ -321,78 +321,6 @@ struct css_set {
>>   	struct rcu_head rcu_head;
>>   };
>>   
>> -struct cgroup_base_stat {
>> -	struct task_cputime cputime;
>> -
>> -#ifdef CONFIG_SCHED_CORE
>> -	u64 forceidle_sum;
>> -#endif
>> -	u64 ntime;
>> -};
>> -
>> -/*
>> - * rstat - cgroup scalable recursive statistics.  Accounting is done
>> - * per-cpu in cgroup_rstat_cpu which is then lazily propagated up the
>> - * hierarchy on reads.
>> - *
>> - * When a stat gets updated, the cgroup_rstat_cpu and its ancestors are
>> - * linked into the updated tree.  On the following read, propagation only
>> - * considers and consumes the updated tree.  This makes reading O(the
>> - * number of descendants which have been active since last read) instead of
>> - * O(the total number of descendants).
>> - *
>> - * This is important because there can be a lot of (draining) cgroups which
>> - * aren't active and stat may be read frequently.  The combination can
>> - * become very expensive.  By propagating selectively, increasing reading
>> - * frequency decreases the cost of each read.
>> - *
>> - * This struct hosts both the fields which implement the above -
>> - * updated_children and updated_next - and the fields which track basic
>> - * resource statistics on top of it - bsync, bstat and last_bstat.
>> - */
>> -struct cgroup_rstat_cpu {
>> -	/*
>> -	 * ->bsync protects ->bstat.  These are the only fields which get
>> -	 * updated in the hot path.
>> -	 */
>> -	struct u64_stats_sync bsync;
>> -	struct cgroup_base_stat bstat;
>> -
>> -	/*
>> -	 * Snapshots at the last reading.  These are used to calculate the
>> -	 * deltas to propagate to the global counters.
>> -	 */
>> -	struct cgroup_base_stat last_bstat;
>> -
>> -	/*
>> -	 * This field is used to record the cumulative per-cpu time of
>> -	 * the cgroup and its descendants. Currently it can be read via
>> -	 * eBPF/drgn etc, and we are still trying to determine how to
>> -	 * expose it in the cgroupfs interface.
>> -	 */
>> -	struct cgroup_base_stat subtree_bstat;
>> -
>> -	/*
>> -	 * Snapshots at the last reading. These are used to calculate the
>> -	 * deltas to propagate to the per-cpu subtree_bstat.
>> -	 */
>> -	struct cgroup_base_stat last_subtree_bstat;
>> -
>> -	/*
>> -	 * Child cgroups with stat updates on this cpu since the last read
>> -	 * are linked on the parent's ->updated_children through
>> -	 * ->updated_next.
>> -	 *
>> -	 * In addition to being more compact, singly-linked list pointing
>> -	 * to the cgroup makes it unnecessary for each per-cpu struct to
>> -	 * point back to the associated cgroup.
>> -	 *
>> -	 * Protected by per-cpu cgroup_rstat_cpu_lock.
>> -	 */
>> -	struct cgroup *updated_children;	/* terminated by self cgroup */
>> -	struct cgroup *updated_next;		/* NULL iff not on the list */
>> -};
>> -
>>   struct cgroup_freezer_state {
>>   	/* Should the cgroup and its descendants be frozen. */
>>   	bool freeze;
>> @@ -517,23 +445,9 @@ struct cgroup {
>>   	struct cgroup *old_dom_cgrp;		/* used while enabling threaded */
>>   
>>   	/* per-cpu recursive resource statistics */
>> -	struct cgroup_rstat_cpu __percpu *rstat_cpu;
>> +	struct cgroup_rstat rstat;
>>   	struct list_head rstat_css_list;
>>   
>> -	/*
>> -	 * Add padding to separate the read mostly rstat_cpu and
>> -	 * rstat_css_list into a different cacheline from the following
>> -	 * rstat_flush_next and *bstat fields which can have frequent updates.
>> -	 */
>> -	CACHELINE_PADDING(_pad_);
>> -
>> -	/*
>> -	 * A singly-linked list of cgroup structures to be rstat flushed.
>> -	 * This is a scratch field to be used exclusively by
>> -	 * cgroup_rstat_flush_locked() and protected by cgroup_rstat_lock.
>> -	 */
>> -	struct cgroup	*rstat_flush_next;
>> -
>>   	/* cgroup basic resource statistics */
>>   	struct cgroup_base_stat last_bstat;
>>   	struct cgroup_base_stat bstat;
>> diff --git a/include/linux/cgroup_rstat.h b/include/linux/cgroup_rstat.h
>> new file mode 100644
>> index 000000000000..f95474d6f8ab
>> --- /dev/null
>> +++ b/include/linux/cgroup_rstat.h
>> @@ -0,0 +1,92 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +#ifndef _LINUX_RSTAT_H
>> +#define _LINUX_RSTAT_H
>> +
>> +#include <linux/u64_stats_sync.h>
>> +
>> +struct cgroup_rstat_cpu;
> 
> Why do we need the forward declaration instead of just defining struct
> cgroup_rstat_cpu first? Also, why do we need a new header for these
> definitions rather than just adding struct cgroup_rstat to
> cgroup-defs.h?

The new header was added so the cgroup_rstat type can be used in bpf
cgroup-defs.h. As for the forward declaration, this was done so that
updated_next and updated children fields of the cgroup_rstat_cpu can
change type from from cgroup to cgroup_rstat.

Regardless, based on the direction we are moving with bpf sharing the
"self" tree, this new header will NOT be needed in v2.

> 
>> +
>> +/*
>> + * rstat - cgroup scalable recursive statistics.  Accounting is done
>> + * per-cpu in cgroup_rstat_cpu which is then lazily propagated up the
>> + * hierarchy on reads.
>> + *
>> + * When a stat gets updated, the cgroup_rstat_cpu and its ancestors are
>> + * linked into the updated tree.  On the following read, propagation only
>> + * considers and consumes the updated tree.  This makes reading O(the
>> + * number of descendants which have been active since last read) instead of
>> + * O(the total number of descendants).
>> + *
>> + * This is important because there can be a lot of (draining) cgroups which
>> + * aren't active and stat may be read frequently.  The combination can
>> + * become very expensive.  By propagating selectively, increasing reading
>> + * frequency decreases the cost of each read.
>> + *
>> + * This struct hosts both the fields which implement the above -
>> + * updated_children and updated_next - and the fields which track basic
>> + * resource statistics on top of it - bsync, bstat and last_bstat.
>> + */
>> +struct cgroup_rstat {
>> +	struct cgroup_rstat_cpu __percpu *rstat_cpu;
>> +
>> +	/*
>> +	 * Add padding to separate the read mostly rstat_cpu and
>> +	 * rstat_css_list into a different cacheline from the following
>> +	 * rstat_flush_next and containing struct fields which can have
>> +	 * frequent updates.
>> +	 */
>> +	CACHELINE_PADDING(_pad_);
>> +	struct cgroup *rstat_flush_next;
>> +};
>> +
>> +struct cgroup_base_stat {
>> +	struct task_cputime cputime;
>> +
>> +#ifdef CONFIG_SCHED_CORE
>> +	u64 forceidle_sum;
>> +#endif
>> +	u64 ntime;
>> +};
>> +
>> +struct cgroup_rstat_cpu {
>> +	/*
>> +	 * Child cgroups with stat updates on this cpu since the last read
>> +	 * are linked on the parent's ->updated_children through
>> +	 * ->updated_next.
>> +	 *
>> +	 * In addition to being more compact, singly-linked list pointing
>> +	 * to the cgroup makes it unnecessary for each per-cpu struct to
>> +	 * point back to the associated cgroup.
>> +	 */
>> +	struct cgroup *updated_children;	/* terminated by self */
>> +	struct cgroup *updated_next;		/* NULL if not on the list */
>> +
>> +	/*
>> +	 * ->bsync protects ->bstat.  These are the only fields which get
>> +	 * updated in the hot path.
>> +	 */
>> +	struct u64_stats_sync bsync;
>> +	struct cgroup_base_stat bstat;
>> +
>> +	/*
>> +	 * Snapshots at the last reading.  These are used to calculate the
>> +	 * deltas to propagate to the global counters.
>> +	 */
>> +	struct cgroup_base_stat last_bstat;
>> +
>> +	/*
>> +	 * This field is used to record the cumulative per-cpu time of
>> +	 * the cgroup and its descendants. Currently it can be read via
>> +	 * eBPF/drgn etc, and we are still trying to determine how to
>> +	 * expose it in the cgroupfs interface.
>> +	 */
>> +	struct cgroup_base_stat subtree_bstat;
>> +
>> +	/*
>> +	 * Snapshots at the last reading. These are used to calculate the
>> +	 * deltas to propagate to the per-cpu subtree_bstat.
>> +	 */
>> +	struct cgroup_base_stat last_subtree_bstat;
>> +};
>> +
>> +#endif	/* _LINUX_RSTAT_H */



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 01/11] cgroup: move rstat pointers into struct of their own
  2025-02-24 17:06     ` JP Kobryn
@ 2025-02-24 18:36       ` Yosry Ahmed
  0 siblings, 0 replies; 42+ messages in thread
From: Yosry Ahmed @ 2025-02-24 18:36 UTC (permalink / raw)
  To: JP Kobryn
  Cc: shakeel.butt, tj, mhocko, hannes, akpm, linux-mm, cgroups,
	kernel-team

On Mon, Feb 24, 2025 at 09:06:21AM -0800, JP Kobryn wrote:
> On 2/20/25 8:53 AM, Yosry Ahmed wrote:
> > On Mon, Feb 17, 2025 at 07:14:38PM -0800, JP Kobryn wrote:
> > > The rstat infrastructure makes use of pointers for list management.
> > > These pointers only exist as fields in the cgroup struct, so moving them
> > > into their own struct will allow them to be used elsewhere. The base
> > > stat entities are included with them for now.
> > > 
> > > Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
> > > ---
> > >   include/linux/cgroup-defs.h                   | 90 +-----------------
> > >   include/linux/cgroup_rstat.h                  | 92 +++++++++++++++++++
> > >   kernel/cgroup/cgroup.c                        |  3 +-
> > >   kernel/cgroup/rstat.c                         | 27 +++---
> > >   .../selftests/bpf/progs/btf_type_tag_percpu.c |  4 +-
> > >   5 files changed, 112 insertions(+), 104 deletions(-)
> > >   create mode 100644 include/linux/cgroup_rstat.h
> > > 
> > > diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
> > > index 1b20d2d8ef7c..6b6cc027fe70 100644
> > > --- a/include/linux/cgroup-defs.h
> > > +++ b/include/linux/cgroup-defs.h
> > > @@ -17,7 +17,7 @@
> > >   #include <linux/refcount.h>
> > >   #include <linux/percpu-refcount.h>
> > >   #include <linux/percpu-rwsem.h>
> > > -#include <linux/u64_stats_sync.h>
> > > +#include <linux/cgroup_rstat.h>
> > >   #include <linux/workqueue.h>
> > >   #include <linux/bpf-cgroup-defs.h>
> > >   #include <linux/psi_types.h>
> > > @@ -321,78 +321,6 @@ struct css_set {
> > >   	struct rcu_head rcu_head;
> > >   };
> > > -struct cgroup_base_stat {
> > > -	struct task_cputime cputime;
> > > -
> > > -#ifdef CONFIG_SCHED_CORE
> > > -	u64 forceidle_sum;
> > > -#endif
> > > -	u64 ntime;
> > > -};
> > > -
> > > -/*
> > > - * rstat - cgroup scalable recursive statistics.  Accounting is done
> > > - * per-cpu in cgroup_rstat_cpu which is then lazily propagated up the
> > > - * hierarchy on reads.
> > > - *
> > > - * When a stat gets updated, the cgroup_rstat_cpu and its ancestors are
> > > - * linked into the updated tree.  On the following read, propagation only
> > > - * considers and consumes the updated tree.  This makes reading O(the
> > > - * number of descendants which have been active since last read) instead of
> > > - * O(the total number of descendants).
> > > - *
> > > - * This is important because there can be a lot of (draining) cgroups which
> > > - * aren't active and stat may be read frequently.  The combination can
> > > - * become very expensive.  By propagating selectively, increasing reading
> > > - * frequency decreases the cost of each read.
> > > - *
> > > - * This struct hosts both the fields which implement the above -
> > > - * updated_children and updated_next - and the fields which track basic
> > > - * resource statistics on top of it - bsync, bstat and last_bstat.
> > > - */
> > > -struct cgroup_rstat_cpu {
> > > -	/*
> > > -	 * ->bsync protects ->bstat.  These are the only fields which get
> > > -	 * updated in the hot path.
> > > -	 */
> > > -	struct u64_stats_sync bsync;
> > > -	struct cgroup_base_stat bstat;
> > > -
> > > -	/*
> > > -	 * Snapshots at the last reading.  These are used to calculate the
> > > -	 * deltas to propagate to the global counters.
> > > -	 */
> > > -	struct cgroup_base_stat last_bstat;
> > > -
> > > -	/*
> > > -	 * This field is used to record the cumulative per-cpu time of
> > > -	 * the cgroup and its descendants. Currently it can be read via
> > > -	 * eBPF/drgn etc, and we are still trying to determine how to
> > > -	 * expose it in the cgroupfs interface.
> > > -	 */
> > > -	struct cgroup_base_stat subtree_bstat;
> > > -
> > > -	/*
> > > -	 * Snapshots at the last reading. These are used to calculate the
> > > -	 * deltas to propagate to the per-cpu subtree_bstat.
> > > -	 */
> > > -	struct cgroup_base_stat last_subtree_bstat;
> > > -
> > > -	/*
> > > -	 * Child cgroups with stat updates on this cpu since the last read
> > > -	 * are linked on the parent's ->updated_children through
> > > -	 * ->updated_next.
> > > -	 *
> > > -	 * In addition to being more compact, singly-linked list pointing
> > > -	 * to the cgroup makes it unnecessary for each per-cpu struct to
> > > -	 * point back to the associated cgroup.
> > > -	 *
> > > -	 * Protected by per-cpu cgroup_rstat_cpu_lock.
> > > -	 */
> > > -	struct cgroup *updated_children;	/* terminated by self cgroup */
> > > -	struct cgroup *updated_next;		/* NULL iff not on the list */
> > > -};
> > > -
> > >   struct cgroup_freezer_state {
> > >   	/* Should the cgroup and its descendants be frozen. */
> > >   	bool freeze;
> > > @@ -517,23 +445,9 @@ struct cgroup {
> > >   	struct cgroup *old_dom_cgrp;		/* used while enabling threaded */
> > >   	/* per-cpu recursive resource statistics */
> > > -	struct cgroup_rstat_cpu __percpu *rstat_cpu;
> > > +	struct cgroup_rstat rstat;
> > >   	struct list_head rstat_css_list;
> > > -	/*
> > > -	 * Add padding to separate the read mostly rstat_cpu and
> > > -	 * rstat_css_list into a different cacheline from the following
> > > -	 * rstat_flush_next and *bstat fields which can have frequent updates.
> > > -	 */
> > > -	CACHELINE_PADDING(_pad_);
> > > -
> > > -	/*
> > > -	 * A singly-linked list of cgroup structures to be rstat flushed.
> > > -	 * This is a scratch field to be used exclusively by
> > > -	 * cgroup_rstat_flush_locked() and protected by cgroup_rstat_lock.
> > > -	 */
> > > -	struct cgroup	*rstat_flush_next;
> > > -
> > >   	/* cgroup basic resource statistics */
> > >   	struct cgroup_base_stat last_bstat;
> > >   	struct cgroup_base_stat bstat;
> > > diff --git a/include/linux/cgroup_rstat.h b/include/linux/cgroup_rstat.h
> > > new file mode 100644
> > > index 000000000000..f95474d6f8ab
> > > --- /dev/null
> > > +++ b/include/linux/cgroup_rstat.h
> > > @@ -0,0 +1,92 @@
> > > +/* SPDX-License-Identifier: GPL-2.0 */
> > > +#ifndef _LINUX_RSTAT_H
> > > +#define _LINUX_RSTAT_H
> > > +
> > > +#include <linux/u64_stats_sync.h>
> > > +
> > > +struct cgroup_rstat_cpu;
> > 
> > Why do we need the forward declaration instead of just defining struct
> > cgroup_rstat_cpu first? Also, why do we need a new header for these
> > definitions rather than just adding struct cgroup_rstat to
> > cgroup-defs.h?
> 
> The new header was added so the cgroup_rstat type can be used in bpf
> cgroup-defs.h. As for the forward declaration, this was done so that
> updated_next and updated children fields of the cgroup_rstat_cpu can
> change type from from cgroup to cgroup_rstat.
> 
> Regardless, based on the direction we are moving with bpf sharing the
> "self" tree, this new header will NOT be needed in v2.

Nice. Seems like the series will be simplified quite a bit by doing
this.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 00/11] cgroup: separate rstat trees
  2025-02-20 20:04         ` Yosry Ahmed
  2025-02-20 20:22           ` Yosry Ahmed
@ 2025-02-24 21:13           ` Shakeel Butt
  2025-02-24 21:54             ` Yosry Ahmed
  1 sibling, 1 reply; 42+ messages in thread
From: Shakeel Butt @ 2025-02-24 21:13 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: JP Kobryn, tj, mhocko, hannes, akpm, linux-mm, cgroups,
	kernel-team

On Thu, Feb 20, 2025 at 08:04:02PM +0000, Yosry Ahmed wrote:
> On Thu, Feb 20, 2025 at 10:14:45AM -0800, JP Kobryn wrote:
> > On 2/20/25 9:59 AM, Yosry Ahmed wrote:
> > > On Thu, Feb 20, 2025 at 09:53:33AM -0800, Shakeel Butt wrote:
> > > > On Thu, Feb 20, 2025 at 05:26:04PM +0000, Yosry Ahmed wrote:
> > > > > 
> > > > > Another question is, does it make sense to keep BPF flushing in the
> > > > > "self" css with base stats flushing for now? IIUC BPF flushing is not
> > > > > very popular now anyway, and doing so will remove the need to support
> > > > > flushing and updating things that are not css's. Just food for thought.
> > > > > 
> > > > 
> > > > Oh if this simplifies the code, I would say go for it.
> > > 
> > > I think we wouldn't need cgroup_rstat_ops and some of the refactoring
> > > may not be needed. It will also reduce the memory overhead, and keep it
> > > constant regardless of using BPF which is nice.
> > 
> > Yes, this is true. cgroup_rstat_ops was only added to allow cgroup_bpf
> > to make use of rstat. If the bpf flushing remains tied to
> > cgroup_subsys_state::self, then the ops interface and supporting code
> > can be removed. Probably stating the obvious but the trade-off would be
> > that if bpf cgroups are in use, they would account for some extra
> > overhead while flushing the base stats. Is Google making use of bpf-
> > based cgroups?
> 
> Ironically I don't know, but I don't expect the BPF flushing to be
> expensive enough to affect this. If someone has the use case that loads
> enough BPF programs to cause a noticeable impact, we can address it
> then.
> 
> This series will still be an improvement anyway.

If no one is using the bpf+rstat infra then maybe we should rip it out.
Do you have any concerns?


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 00/11] cgroup: separate rstat trees
  2025-02-24 21:13           ` Shakeel Butt
@ 2025-02-24 21:54             ` Yosry Ahmed
  0 siblings, 0 replies; 42+ messages in thread
From: Yosry Ahmed @ 2025-02-24 21:54 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: JP Kobryn, tj, mhocko, hannes, akpm, linux-mm, cgroups,
	kernel-team

On Mon, Feb 24, 2025 at 01:13:35PM -0800, Shakeel Butt wrote:
> On Thu, Feb 20, 2025 at 08:04:02PM +0000, Yosry Ahmed wrote:
> > On Thu, Feb 20, 2025 at 10:14:45AM -0800, JP Kobryn wrote:
> > > On 2/20/25 9:59 AM, Yosry Ahmed wrote:
> > > > On Thu, Feb 20, 2025 at 09:53:33AM -0800, Shakeel Butt wrote:
> > > > > On Thu, Feb 20, 2025 at 05:26:04PM +0000, Yosry Ahmed wrote:
> > > > > > 
> > > > > > Another question is, does it make sense to keep BPF flushing in the
> > > > > > "self" css with base stats flushing for now? IIUC BPF flushing is not
> > > > > > very popular now anyway, and doing so will remove the need to support
> > > > > > flushing and updating things that are not css's. Just food for thought.
> > > > > > 
> > > > > 
> > > > > Oh if this simplifies the code, I would say go for it.
> > > > 
> > > > I think we wouldn't need cgroup_rstat_ops and some of the refactoring
> > > > may not be needed. It will also reduce the memory overhead, and keep it
> > > > constant regardless of using BPF which is nice.
> > > 
> > > Yes, this is true. cgroup_rstat_ops was only added to allow cgroup_bpf
> > > to make use of rstat. If the bpf flushing remains tied to
> > > cgroup_subsys_state::self, then the ops interface and supporting code
> > > can be removed. Probably stating the obvious but the trade-off would be
> > > that if bpf cgroups are in use, they would account for some extra
> > > overhead while flushing the base stats. Is Google making use of bpf-
> > > based cgroups?
> > 
> > Ironically I don't know, but I don't expect the BPF flushing to be
> > expensive enough to affect this. If someone has the use case that loads
> > enough BPF programs to cause a noticeable impact, we can address it
> > then.
> > 
> > This series will still be an improvement anyway.
> 
> If no one is using the bpf+rstat infra then maybe we should rip it out.
> Do you have any concerns?

We did not end up using the BPF+rstat infra, so I have no objection over
removing that. They are kfuncs and supposedly there is no guarantee for
them hanging around.

However, looking back at the patch series [1], there were 3 main
components:
(a) cgroup_iter BPF programs support.
(b) kfunc hooks for BPF+rstat infra.
(c) Selftests.

I am not sure if there are other users for cgroup_iter for different
purposes than BPF+rstat, and I am not sure if we can remove an iterator
program type (in terms of stability).

We can drop the kfunc hooks, but they are not really a big deal imo. I
am fine either way.

If we remove (b) we can also remove the corresponding test, but not the
test for cgroup_iter as long as it stays.

[1]https://lore.kernel.org/all/20220824233117.1312810-1-haoluo@google.com/


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 03/11] cgroup: move cgroup_rstat from cgroup to cgroup_subsys_state
  2025-02-20 17:22     ` Yosry Ahmed
@ 2025-02-25 19:20       ` JP Kobryn
  0 siblings, 0 replies; 42+ messages in thread
From: JP Kobryn @ 2025-02-25 19:20 UTC (permalink / raw)
  To: Yosry Ahmed, Shakeel Butt
  Cc: tj, mhocko, hannes, akpm, linux-mm, cgroups, kernel-team

On 2/20/25 9:22 AM, Yosry Ahmed wrote:
> On Thu, Feb 20, 2025 at 09:06:44AM -0800, Shakeel Butt wrote:
>> On Mon, Feb 17, 2025 at 07:14:40PM -0800, JP Kobryn wrote:
>> [...]
>>> @@ -3240,6 +3234,12 @@ static int cgroup_apply_control_enable(struct cgroup *cgrp)
>>>   				css = css_create(dsct, ss);
>>>   				if (IS_ERR(css))
>>>   					return PTR_ERR(css);
>>
>> Since rstat is part of css, why not cgroup_rstat_init() inside
>> css_create()?

Good idea. Will do that in v2.
>>
>>> +
>>> +				if (css->ss && css->ss->css_rstat_flush) {
>>> +					ret = cgroup_rstat_init(css);
>>> +					if (ret)
>>> +						goto err_out;
>>> +				}
>>>   			}
>>>   
>>>   			WARN_ON_ONCE(percpu_ref_is_dying(&css->refcnt));
>>> @@ -3253,6 +3253 css when it is destroyed, so moving the cleanup to css_kill()
should handle th,21 @@ static int cgroup_apply_control_enable(struct 
cgroup *cgrp)
>>>   	}
>>>   
>>>   	return 0;
>>
>> Why not the following cleanup in css_kill()? If you handle it in
>> css_kill(), you don't need this special handling.
> 
> Also, I don't think we are currently calling cgroup_rstat_exit() for
> every css when it is destroyed, so moving the cleanup to css_kill()
> should handle this as well.

Thanks, this makes sense and will eliminate the need for the extra 
cleanup here. It seems that instead of kill_css(), the 
css_free_rwork_fn() function is a good place to call css_rstat_exit(), 
since it corresponds with calling cgroup_rstat_exit() when the css is
the "self" base stats.
> 
>>
>>> +
>>> +err_out:
>>> +	cgroup_for_each_live_descendant_pre(dsct, d_css, cgrp) {
>>> +		for_each_subsys(ss, ssid) {
>>> +			struct cgroup_subsys_state *css = cgroup_css(dsct, ss);
>>> +
>>> +			if (!(cgroup_ss_mask(dsct) & (1 << ss->id)))
>>> +				continue;
>>> +
>>> +			if (css && css->ss && css->ss->css_rstat_flush)
>>> +				cgroup_rstat_exit(css);
>>> +		}
>>> +	}
>>> +
>>> +	return ret;
>>>   }
>>



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 00/11] cgroup: separate rstat trees
  2025-02-20 15:51 ` [PATCH 00/11] cgroup: separate rstat trees Tejun Heo
@ 2025-02-27 23:44   ` JP Kobryn
  0 siblings, 0 replies; 42+ messages in thread
From: JP Kobryn @ 2025-02-27 23:44 UTC (permalink / raw)
  To: Tejun Heo
  Cc: shakeel.butt, mhocko, hannes, yosryahmed, akpm, linux-mm, cgroups,
	kernel-team

On 2/20/25 7:51 AM, Tejun Heo wrote:
> Hello,
> 
> On Mon, Feb 17, 2025 at 07:14:37PM -0800, JP Kobryn wrote:
> ...
>> The first experiment consisted of a parent cgroup with memory.swap.max=0
>> and memory.max=1G. On a 52-cpu machine, 26 child cgroups were created and
>> within each child cgroup a process was spawned to encourage the updating of
>> memory cgroup stats by creating and then reading a file of size 1T
>> (encouraging reclaim). These 26 tasks were run in parallel.  While this was
>> going on, a custom program was used to open cpu.stat file of the parent
>> cgroup, read the entire file 1M times, then close it. The perf report for
>> the task performing the reading showed that most of the cycles (42%) were
>> spent on the function mem_cgroup_css_rstat_flush() of the control side. It
>> also showed a smaller but significant number of cycles spent in
>> __blkcg_rstat_flush. The perf report for patched kernel differed in that no
>> cycles were spent in these functions. Instead most cycles were spent on
>> cgroup_base_stat_flush(). Aside from the perf reports, the amount of time
>> spent running the program performing the reading of cpu.stats showed a gain
>> when comparing the control to the experimental kernel.The time in kernel
>> mode was reduced.
>>
>> before:
>> real    0m18.449s
>> user    0m0.209s
>> sys     0m18.165s
>>
>> after:
>> real    0m6.080s
>> user    0m0.170s
>> sys     0m5.890s
>>
>> Another experiment on the same host was setup using a parent cgroup with
>> two child cgroups. The same swap and memory max were used as the previous
>> experiment. In the two child cgroups, kernel builds were done in parallel,
>> each using "-j 20". The program from the previous experiment was used to
>> perform 1M reads of the parent cpu.stat file. The perf comparison showed
>> similar results as the previous experiment. For the control side, a
>> majority of cycles (42%) on mem_cgroup_css_rstat_flush() and significant
>> cycles in __blkcg_rstat_flush(). On the experimental side, most cycles were
>> spent on cgroup_base_stat_flush() and no cycles were spent flushing memory
>> or io. As for the time taken by the program reading cpu.stat, measurements
>> are shown below.
>>
>> before:
>> real    0m17.223s
>> user    0m0.259s
>> sys     0m16.871s
>>
>> after:
>> real    0m6.498s
>> user    0m0.237s
>> sys     0m6.220s
>>
>> For the final experiment, perf events were recorded during a kernel build
>> with the same host and cgroup setup. The builds took place in the child
>> node.  Control and experimental sides both showed similar in cycles spent
>> on cgroup_rstat_updated() and appeard insignificant compared among the
>> events recorded with the workload.
> 
> One of the reasons why the original design used one rstat tree is because
> readers, in addition to writers, can often be correlated too - e.g. You'd
> often have periodic monitoring tools which poll all the major stat files
> periodically. Splitting the trees will likely make those at least a bit
> worse. Can you test how much worse that'd be? ie. Repeat the above tests but
> read all the major stat files - cgroup.stat, cpu.stat, memory.stat and
> io.stat.

Sure. I changed the experiment to read all of these files. It still 
showed an improvement in performance. You can see the details in
v2 [0] which I sent out earlier today.

[0] 
https://lore.kernel.org/all/20250227215543.49928-1-inwardvessel@gmail.com/
> 
> Thanks.
> 



^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2025-02-27 23:44 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-18  3:14 [PATCH 00/11] cgroup: separate rstat trees JP Kobryn
2025-02-18  3:14 ` [PATCH 01/11] cgroup: move rstat pointers into struct of their own JP Kobryn
2025-02-19  1:05   ` Shakeel Butt
2025-02-19  1:23     ` Shakeel Butt
2025-02-20 16:53   ` Yosry Ahmed
2025-02-24 17:06     ` JP Kobryn
2025-02-24 18:36       ` Yosry Ahmed
2025-02-18  3:14 ` [PATCH 02/11] cgroup: add level of indirection for cgroup_rstat struct JP Kobryn
2025-02-19  2:26   ` Shakeel Butt
2025-02-20 17:08     ` Yosry Ahmed
2025-02-19  5:57   ` kernel test robot
2025-02-18  3:14 ` [PATCH 03/11] cgroup: move cgroup_rstat from cgroup to cgroup_subsys_state JP Kobryn
2025-02-20 17:06   ` Shakeel Butt
2025-02-20 17:22     ` Yosry Ahmed
2025-02-25 19:20       ` JP Kobryn
2025-02-18  3:14 ` [PATCH 04/11] cgroup: introduce cgroup_rstat_ops JP Kobryn
2025-02-19  7:21   ` kernel test robot
2025-02-20 17:50   ` Shakeel Butt
2025-02-18  3:14 ` [PATCH 05/11] cgroup: separate rstat for bpf cgroups JP Kobryn
2025-02-21 18:14   ` Shakeel Butt
2025-02-18  3:14 ` [PATCH 06/11] cgroup: rstat lock indirection JP Kobryn
2025-02-21 22:09   ` Shakeel Butt
2025-02-18  3:14 ` [PATCH 07/11] cgroup: fetch cpu-specific lock in rstat cpu lock helpers JP Kobryn
2025-02-21 22:35   ` Shakeel Butt
2025-02-18  3:14 ` [PATCH 08/11] cgroup: rstat cpu lock indirection JP Kobryn
2025-02-19  8:48   ` kernel test robot
2025-02-22  0:18   ` Shakeel Butt
2025-02-18  3:14 ` [PATCH 09/11] cgroup: separate rstat locks for bpf cgroups JP Kobryn
2025-02-18  3:14 ` [PATCH 10/11] cgroup: separate rstat locks for subsystems JP Kobryn
2025-02-22  0:23   ` Shakeel Butt
2025-02-18  3:14 ` [PATCH 11/11] cgroup: separate rstat list pointers from base stats JP Kobryn
2025-02-22  0:28   ` Shakeel Butt
2025-02-20 15:51 ` [PATCH 00/11] cgroup: separate rstat trees Tejun Heo
2025-02-27 23:44   ` JP Kobryn
2025-02-20 17:26 ` Yosry Ahmed
2025-02-20 17:53   ` Shakeel Butt
2025-02-20 17:59     ` Yosry Ahmed
2025-02-20 18:14       ` JP Kobryn
2025-02-20 20:04         ` Yosry Ahmed
2025-02-20 20:22           ` Yosry Ahmed
2025-02-24 21:13           ` Shakeel Butt
2025-02-24 21:54             ` Yosry Ahmed

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).