[RFC PATCH 0/3] cgroup: nmi safe css_rstat

cgroups.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/3] cgroup: nmi safe css_rstat_updated
@ 2025-04-29  6:12 Shakeel Butt
  2025-04-29  6:12 ` [RFC PATCH 1/3] llist: add list_add_iff_not_on_list() Shakeel Butt
                   ` (3 more replies)
  0 siblings, 4 replies; 22+ messages in thread
From: Shakeel Butt @ 2025-04-29  6:12 UTC (permalink / raw)
  To: Tejun Heo, Andrew Morton, Alexei Starovoitov
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Yosry Ahmed, Michal Koutný, Vlastimil Babka,
	Sebastian Andrzej Siewior, JP Kobryn, bpf, linux-mm, cgroups,
	linux-kernel, Meta kernel team

BPF programs can run in nmi context and may trigger memcg charged memory
allocation in such context. Recently linux added support to nmi safe
page allocation along with memcg charging of such allocations. However
the kmalloc/slab support and corresponding memcg charging is still
lacking,

To provide nmi safe support for memcg charging for kmalloc/slab
allocations, we need nmi safe memcg stats and for that we need nmi safe
css_rstat_updated() which adds the given cgroup state whose stats are
updated into the per-cpu per-ss update tree. This series took the aim to
make css_rstat_updated() nmi safe.

This series is based on [1].

[1] http://lore.kernel.org/20250404011050.121777-1-inwardvessel@gmail.com

Shakeel Butt (3):
  llist: add list_add_iff_not_on_list()
  cgroup: support to enable nmi-safe css_rstat_updated
  cgroup: make css_rstat_updated nmi safe

 include/linux/cgroup-defs.h |   4 ++
 include/linux/llist.h       |   3 +
 kernel/cgroup/rstat.c       | 112 ++++++++++++++++++++++++++++--------
 lib/llist.c                 |  30 ++++++++++
 4 files changed, 124 insertions(+), 25 deletions(-)

-- 
2.47.1

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC PATCH 1/3] llist: add list_add_iff_not_on_list()
  2025-04-29  6:12 [RFC PATCH 0/3] cgroup: nmi safe css_rstat_updated Shakeel Butt
@ 2025-04-29  6:12 ` Shakeel Butt
  2025-04-30 12:44   ` [RFC PATCH 1/3] llist: add list_add_iff_not_on_list()g Yosry Ahmed
  2025-04-29  6:12 ` [RFC PATCH 2/3] cgroup: support to enable nmi-safe css_rstat_updated Shakeel Butt
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 22+ messages in thread
From: Shakeel Butt @ 2025-04-29  6:12 UTC (permalink / raw)
  To: Tejun Heo, Andrew Morton, Alexei Starovoitov
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Yosry Ahmed, Michal Koutný, Vlastimil Babka,
	Sebastian Andrzej Siewior, JP Kobryn, bpf, linux-mm, cgroups,
	linux-kernel, Meta kernel team

As the name implies, list_add_iff_not_on_list() adds the given node to
the given only if the node is not on any list. Many CPUs can call this
concurrently on the same node and only one of them will succeed.

This is also useful to be used by different contexts like task, irq and
nmi. In the case of failure either the node as already present on some
list or the caller can lost the race to add the given node to a list.
That node will eventually be added to a list by the winner.

Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
---
 include/linux/llist.h |  3 +++
 lib/llist.c           | 30 ++++++++++++++++++++++++++++++
 2 files changed, 33 insertions(+)

diff --git a/include/linux/llist.h b/include/linux/llist.h
index 2c982ff7475a..030cfec8778b 100644
--- a/include/linux/llist.h
+++ b/include/linux/llist.h
@@ -236,6 +236,9 @@ static inline bool __llist_add_batch(struct llist_node *new_first,
 	return new_last->next == NULL;
 }
 
+extern bool llist_add_iff_not_on_list(struct llist_node *new,
+				      struct llist_head *head);
+
 /**
  * llist_add - add a new entry
  * @new:	new entry to be added
diff --git a/lib/llist.c b/lib/llist.c
index f21d0cfbbaaa..9d743164720f 100644
--- a/lib/llist.c
+++ b/lib/llist.c
@@ -36,6 +36,36 @@ bool llist_add_batch(struct llist_node *new_first, struct llist_node *new_last,
 }
 EXPORT_SYMBOL_GPL(llist_add_batch);
 
+/**
+ * llist_add_iff_not_on_list - add an entry if it is not on list
+ * @new:	entry to be added
+ * @head:	the head for your lock-less list
+ *
+ * Adds the given entry to the given list only if the entry is not on any list.
+ * This is useful for cases where multiple CPUs tries to add the same node to
+ * the list or multiple contexts (process, irq or nmi) may add the same node to
+ * the list.
+ *
+ * Return true only if the caller has successfully added the given node to the
+ * list. Returns false if entry is already on some list or if another inserter
+ * wins the race to eventually add the given node to the list.
+ */
+bool llist_add_iff_not_on_list(struct llist_node *new, struct llist_head *head)
+{
+	struct llist_node *first = READ_ONCE(head->first);
+
+	if (llist_on_list(new))
+		return false;
+
+	if (cmpxchg(&new->next, new, first) != new)
+		return false;
+
+	while (!try_cmpxchg(&head->first, &first, new))
+		new->next = first;
+	return true;
+}
+EXPORT_SYMBOL_GPL(llist_add_iff_not_on_list);
+
 /**
  * llist_del_first - delete the first entry of lock-less list
  * @head:	the head for your lock-less list
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH 2/3] cgroup: support to enable nmi-safe css_rstat_updated
  2025-04-29  6:12 [RFC PATCH 0/3] cgroup: nmi safe css_rstat_updated Shakeel Butt
  2025-04-29  6:12 ` [RFC PATCH 1/3] llist: add list_add_iff_not_on_list() Shakeel Butt
@ 2025-04-29  6:12 ` Shakeel Butt
  2025-04-29  6:12 ` [RFC PATCH 3/3] cgroup: make css_rstat_updated nmi safe Shakeel Butt
  2025-04-29  6:12 ` [OFFLIST PATCH 1/2] cgroup: use separate rstat trees for each subsystem Shakeel Butt
  3 siblings, 0 replies; 22+ messages in thread
From: Shakeel Butt @ 2025-04-29  6:12 UTC (permalink / raw)
  To: Tejun Heo, Andrew Morton, Alexei Starovoitov
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Yosry Ahmed, Michal Koutný, Vlastimil Babka,
	Sebastian Andrzej Siewior, JP Kobryn, bpf, linux-mm, cgroups,
	linux-kernel, Meta kernel team

Add necessary infrastructure to enable the nmi-safe execution of
css_rstat_updated(). Currently css_rstat_updated() takes a per-cpu
per-css raw spinlock to add the given css in the per-cpu per-css update
tree. However the kernel can not spin in nmi context, so we need to
replace spinning on the raw spinlock with the trylock and on failure,
add the given css to the per-cpu backlog which will be processed when
the context that can spin on raw spinlock can run.

For now, this patch just adds necessary data structures in the css and
ss structures.

Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
---
 include/linux/cgroup-defs.h |  4 ++++
 kernel/cgroup/rstat.c       | 13 +++++++++++--
 2 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 560582c4dbeb..f7b680f853ea 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -370,6 +370,9 @@ struct css_rstat_cpu {
 	 */
 	struct cgroup_subsys_state *updated_children;	/* terminated by self cgroup */
 	struct cgroup_subsys_state *updated_next;	/* NULL iff not on the list */
+
+	struct llist_node lnode;			/* lockless backlog node */
+	struct cgroup_subsys_state *owner;		/* back pointer */
 };
 
 /*
@@ -800,6 +803,7 @@ struct cgroup_subsys {
 
 	spinlock_t rstat_ss_lock;
 	raw_spinlock_t __percpu *rstat_ss_cpu_lock;
+	struct llist_head __percpu *lhead; /* lockless backlog list */
 };
 
 extern struct percpu_rw_semaphore cgroup_threadgroup_rwsem;
diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index a30bcc4d4f48..d3092b4c85d7 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -419,7 +419,8 @@ int css_rstat_init(struct cgroup_subsys_state *css)
 	for_each_possible_cpu(cpu) {
 		struct css_rstat_cpu *rstatc = css_rstat_cpu(css, cpu);
 
-		rstatc->updated_children = css;
+		rstatc->owner = rstatc->updated_children = css;
+		init_llist_node(&rstatc->lnode);
 
 		if (css_is_cgroup(css)) {
 			struct cgroup_rstat_base_cpu *rstatbc;
@@ -484,8 +485,16 @@ int __init ss_rstat_init(struct cgroup_subsys *ss)
 	if (!ss->rstat_ss_cpu_lock)
 		return -ENOMEM;
 
-	for_each_possible_cpu(cpu)
+	ss->lhead = alloc_percpu(struct llist_head);
+	if (!ss->lhead) {
+		free_percpu(ss->rstat_ss_cpu_lock);
+		return -ENOMEM;
+	}
+
+	for_each_possible_cpu(cpu) {
 		raw_spin_lock_init(per_cpu_ptr(ss->rstat_ss_cpu_lock, cpu));
+		init_llist_head(per_cpu_ptr(ss->lhead, cpu));
+	}
 
 	return 0;
 }
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH 3/3] cgroup: make css_rstat_updated nmi safe
  2025-04-29  6:12 [RFC PATCH 0/3] cgroup: nmi safe css_rstat_updated Shakeel Butt
  2025-04-29  6:12 ` [RFC PATCH 1/3] llist: add list_add_iff_not_on_list() Shakeel Butt
  2025-04-29  6:12 ` [RFC PATCH 2/3] cgroup: support to enable nmi-safe css_rstat_updated Shakeel Butt
@ 2025-04-29  6:12 ` Shakeel Butt
  2025-04-30 13:14   ` Yosry Ahmed
  2025-04-29  6:12 ` [OFFLIST PATCH 1/2] cgroup: use separate rstat trees for each subsystem Shakeel Butt
  3 siblings, 1 reply; 22+ messages in thread
From: Shakeel Butt @ 2025-04-29  6:12 UTC (permalink / raw)
  To: Tejun Heo, Andrew Morton, Alexei Starovoitov
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Yosry Ahmed, Michal Koutný, Vlastimil Babka,
	Sebastian Andrzej Siewior, JP Kobryn, bpf, linux-mm, cgroups,
	linux-kernel, Meta kernel team

To make css_rstat_updated() able to safely run in nmi context, it can
not spin on locks and rather has to do trylock on the per-cpu per-ss raw
spinlock. This patch implements the backlog mechanism to handle the
failure in acquiring the per-cpu per-ss raw spinlock.

Each subsystem provides a per-cpu lockless list on which the kernel
stores the css given to css_rstat_updated() on trylock failure. These
lockless lists serve as backlog. On cgroup stats flushing code path, the
kernel first processes all the per-cpu lockless backlog lists of the
given ss and then proceeds to flush the update stat trees.

With css_rstat_updated() being nmi safe, the memch stats can and will be
converted to be nmi safe to enable nmi safe mem charging.

Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
---
 kernel/cgroup/rstat.c | 99 +++++++++++++++++++++++++++++++++----------
 1 file changed, 76 insertions(+), 23 deletions(-)

diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index d3092b4c85d7..ac533e46afa9 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -11,6 +11,7 @@
 
 static DEFINE_SPINLOCK(rstat_base_lock);
 static DEFINE_PER_CPU(raw_spinlock_t, rstat_base_cpu_lock);
+static DEFINE_PER_CPU(struct llist_head, rstat_backlog_list);
 
 static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu);
 
@@ -42,6 +43,13 @@ static raw_spinlock_t *ss_rstat_cpu_lock(struct cgroup_subsys *ss, int cpu)
 	return per_cpu_ptr(&rstat_base_cpu_lock, cpu);
 }
 
+static struct llist_head *ss_lhead_cpu(struct cgroup_subsys *ss, int cpu)
+{
+	if (ss)
+		return per_cpu_ptr(ss->lhead, cpu);
+	return per_cpu_ptr(&rstat_backlog_list, cpu);
+}
+
 /*
  * Helper functions for rstat per CPU locks.
  *
@@ -86,6 +94,21 @@ unsigned long _css_rstat_cpu_lock(struct cgroup_subsys_state *css, int cpu,
 	return flags;
 }
 
+static __always_inline
+bool _css_rstat_cpu_trylock(struct cgroup_subsys_state *css, int cpu,
+			    unsigned long *flags)
+{
+	struct cgroup *cgrp = css->cgroup;
+	raw_spinlock_t *cpu_lock;
+	bool contended;
+
+	cpu_lock = ss_rstat_cpu_lock(css->ss, cpu);
+	contended = !raw_spin_trylock_irqsave(cpu_lock, *flags);
+	if (contended)
+		trace_cgroup_rstat_cpu_lock_contended(cgrp, cpu, contended);
+	return !contended;
+}
+
 static __always_inline
 void _css_rstat_cpu_unlock(struct cgroup_subsys_state *css, int cpu,
 		unsigned long flags, const bool fast_path)
@@ -102,32 +125,16 @@ void _css_rstat_cpu_unlock(struct cgroup_subsys_state *css, int cpu,
 	raw_spin_unlock_irqrestore(cpu_lock, flags);
 }
 
-/**
- * css_rstat_updated - keep track of updated rstat_cpu
- * @css: target cgroup subsystem state
- * @cpu: cpu on which rstat_cpu was updated
- *
- * @css's rstat_cpu on @cpu was updated. Put it on the parent's matching
- * rstat_cpu->updated_children list. See the comment on top of
- * css_rstat_cpu definition for details.
- */
-__bpf_kfunc void css_rstat_updated(struct cgroup_subsys_state *css, int cpu)
+static void css_add_to_backlog(struct cgroup_subsys_state *css, int cpu)
 {
-	unsigned long flags;
-
-	/*
-	 * Speculative already-on-list test. This may race leading to
-	 * temporary inaccuracies, which is fine.
-	 *
-	 * Because @parent's updated_children is terminated with @parent
-	 * instead of NULL, we can tell whether @css is on the list by
-	 * testing the next pointer for NULL.
-	 */
-	if (data_race(css_rstat_cpu(css, cpu)->updated_next))
-		return;
+	struct llist_head *lhead = ss_lhead_cpu(css->ss, cpu);
+	struct css_rstat_cpu *rstatc = css_rstat_cpu(css, cpu);
 
-	flags = _css_rstat_cpu_lock(css, cpu, true);
+	llist_add_iff_not_on_list(&rstatc->lnode, lhead);
+}
 
+static void __css_rstat_updated(struct cgroup_subsys_state *css, int cpu)
+{
 	/* put @css and all ancestors on the corresponding updated lists */
 	while (true) {
 		struct css_rstat_cpu *rstatc = css_rstat_cpu(css, cpu);
@@ -153,6 +160,51 @@ __bpf_kfunc void css_rstat_updated(struct cgroup_subsys_state *css, int cpu)
 
 		css = parent;
 	}
+}
+
+static void css_process_backlog(struct cgroup_subsys *ss, int cpu)
+{
+	struct llist_head *lhead = ss_lhead_cpu(ss, cpu);
+	struct llist_node *lnode;
+
+	while ((lnode = llist_del_first_init(lhead))) {
+		struct css_rstat_cpu *rstatc;
+
+		rstatc = container_of(lnode, struct css_rstat_cpu, lnode);
+		__css_rstat_updated(rstatc->owner, cpu);
+	}
+}
+
+/**
+ * css_rstat_updated - keep track of updated rstat_cpu
+ * @css: target cgroup subsystem state
+ * @cpu: cpu on which rstat_cpu was updated
+ *
+ * @css's rstat_cpu on @cpu was updated. Put it on the parent's matching
+ * rstat_cpu->updated_children list. See the comment on top of
+ * css_rstat_cpu definition for details.
+ */
+__bpf_kfunc void css_rstat_updated(struct cgroup_subsys_state *css, int cpu)
+{
+	unsigned long flags;
+
+	/*
+	 * Speculative already-on-list test. This may race leading to
+	 * temporary inaccuracies, which is fine.
+	 *
+	 * Because @parent's updated_children is terminated with @parent
+	 * instead of NULL, we can tell whether @css is on the list by
+	 * testing the next pointer for NULL.
+	 */
+	if (data_race(css_rstat_cpu(css, cpu)->updated_next))
+		return;
+
+	if (!_css_rstat_cpu_trylock(css, cpu, &flags)) {
+		css_add_to_backlog(css, cpu);
+		return;
+	}
+
+	__css_rstat_updated(css, cpu);
 
 	_css_rstat_cpu_unlock(css, cpu, flags, true);
 }
@@ -255,6 +307,7 @@ static struct cgroup_subsys_state *css_rstat_updated_list(
 
 	flags = _css_rstat_cpu_lock(root, cpu, false);
 
+	css_process_backlog(root->ss, cpu);
 	/* Return NULL if this subtree is not on-list */
 	if (!rstatc->updated_next)
 		goto unlock_ret;
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [OFFLIST PATCH 1/2] cgroup: use separate rstat trees for each subsystem
  2025-04-29  6:12 [RFC PATCH 0/3] cgroup: nmi safe css_rstat_updated Shakeel Butt
                   ` (2 preceding siblings ...)
  2025-04-29  6:12 ` [RFC PATCH 3/3] cgroup: make css_rstat_updated nmi safe Shakeel Butt
@ 2025-04-29  6:12 ` Shakeel Butt
  2025-04-29  6:12   ` [OFFLIST PATCH 2/2] cgroup: use subsystem-specific rstat locks to avoid contention Shakeel Butt
  2025-04-29  6:15   ` [OFFLIST PATCH 1/2] cgroup: use separate rstat trees for each subsystem Shakeel Butt
  3 siblings, 2 replies; 22+ messages in thread
From: Shakeel Butt @ 2025-04-29  6:12 UTC (permalink / raw)
  To: Tejun Heo, Andrew Morton, Alexei Starovoitov, shakeel.butt
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Yosry Ahmed, Michal Koutný, Vlastimil Babka,
	Sebastian Andrzej Siewior, JP Kobryn, bpf, linux-mm, cgroups,
	linux-kernel, Meta kernel team

From: JP Kobryn <inwardvessel@gmail.com>

Different subsystems may call cgroup_rstat_updated() within the same
cgroup, resulting in a tree of pending updates from multiple subsystems.
When one of these subsystems is flushed via cgroup_rstat_flushed(), all
other subsystems with pending updates on the tree will also be flushed.

Change the paradigm of having a single rstat tree for all subsystems to
having separate trees for each subsystem. This separation allows for
subsystems to perform flushes without the side effects of other subsystems.
As an example, flushing the cpu stats will no longer cause the memory stats
to be flushed and vice versa.

In order to achieve subsystem-specific trees, change the tree node type
from cgroup to cgroup_subsys_state pointer. Then remove those pointers from
the cgroup and instead place them on the css. Finally, change update/flush
functions to make use of the different node type (css). These changes allow
a specific subsystem to be associated with an update or flush. Separate
rstat trees will now exist for each unique subsystem.

Since updating/flushing will now be done at the subsystem level, there is
no longer a need to keep track of updated css nodes at the cgroup level.
The list management of these nodes done within the cgroup (rstat_css_list
and related) has been removed accordingly.

Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
---
 include/linux/cgroup-defs.h                   |  40 ++--
 kernel/cgroup/cgroup.c                        |  49 ++---
 kernel/cgroup/rstat.c                         | 174 +++++++++---------
 .../selftests/bpf/progs/btf_type_tag_percpu.c |  18 +-
 4 files changed, 150 insertions(+), 131 deletions(-)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index e58bfb880111..45a605c74ff8 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -169,6 +169,9 @@ struct cgroup_subsys_state {
 	/* reference count - access via css_[try]get() and css_put() */
 	struct percpu_ref refcnt;
 
+	/* per-cpu recursive resource statistics */
+	struct css_rstat_cpu __percpu *rstat_cpu;
+
 	/*
 	 * siblings list anchored at the parent's ->children
 	 *
@@ -177,9 +180,6 @@ struct cgroup_subsys_state {
 	struct list_head sibling;
 	struct list_head children;
 
-	/* flush target list anchored at cgrp->rstat_css_list */
-	struct list_head rstat_css_node;
-
 	/*
 	 * PI: Subsys-unique ID.  0 is unused and root is always 1.  The
 	 * matching css can be looked up using css_from_id().
@@ -219,6 +219,13 @@ struct cgroup_subsys_state {
 	 * Protected by cgroup_mutex.
 	 */
 	int nr_descendants;
+
+	/*
+	 * A singly-linked list of css structures to be rstat flushed.
+	 * This is a scratch field to be used exclusively by
+	 * css_rstat_flush_locked() and protected by cgroup_rstat_lock.
+	 */
+	struct cgroup_subsys_state *rstat_flush_next;
 };
 
 /*
@@ -329,10 +336,10 @@ struct cgroup_base_stat {
 
 /*
  * rstat - cgroup scalable recursive statistics.  Accounting is done
- * per-cpu in cgroup_rstat_cpu which is then lazily propagated up the
+ * per-cpu in css_rstat_cpu which is then lazily propagated up the
  * hierarchy on reads.
  *
- * When a stat gets updated, the cgroup_rstat_cpu and its ancestors are
+ * When a stat gets updated, the css_rstat_cpu and its ancestors are
  * linked into the updated tree.  On the following read, propagation only
  * considers and consumes the updated tree.  This makes reading O(the
  * number of descendants which have been active since last read) instead of
@@ -346,7 +353,7 @@ struct cgroup_base_stat {
  * This struct hosts both the fields which implement the above -
  * updated_children and updated_next.
  */
-struct cgroup_rstat_cpu {
+struct css_rstat_cpu {
 	/*
 	 * Child cgroups with stat updates on this cpu since the last read
 	 * are linked on the parent's ->updated_children through
@@ -358,8 +365,8 @@ struct cgroup_rstat_cpu {
 	 *
 	 * Protected by per-cpu cgroup_rstat_cpu_lock.
 	 */
-	struct cgroup *updated_children;	/* terminated by self cgroup */
-	struct cgroup *updated_next;		/* NULL iff not on the list */
+	struct cgroup_subsys_state *updated_children;	/* terminated by self cgroup */
+	struct cgroup_subsys_state *updated_next;	/* NULL iff not on the list */
 };
 
 /*
@@ -521,25 +528,16 @@ struct cgroup {
 	struct cgroup *dom_cgrp;
 	struct cgroup *old_dom_cgrp;		/* used while enabling threaded */
 
-	/* per-cpu recursive resource statistics */
-	struct cgroup_rstat_cpu __percpu *rstat_cpu;
+	/* per-cpu recursive basic resource statistics */
 	struct cgroup_rstat_base_cpu __percpu *rstat_base_cpu;
-	struct list_head rstat_css_list;
 
 	/*
-	 * Add padding to separate the read mostly rstat_cpu and
-	 * rstat_css_list into a different cacheline from the following
-	 * rstat_flush_next and *bstat fields which can have frequent updates.
+	 * Add padding to keep the read mostly rstat per-cpu pointer on a
+	 * different cacheline than the following *bstat fields which can have
+	 * frequent updates.
 	 */
 	CACHELINE_PADDING(_pad_);
 
-	/*
-	 * A singly-linked list of cgroup structures to be rstat flushed.
-	 * This is a scratch field to be used exclusively by
-	 * css_rstat_flush_locked() and protected by cgroup_rstat_lock.
-	 */
-	struct cgroup	*rstat_flush_next;
-
 	/* cgroup basic resource statistics */
 	struct cgroup_base_stat last_bstat;
 	struct cgroup_base_stat bstat;
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index a314138894e1..d9865299edf5 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -161,12 +161,12 @@ static struct static_key_true *cgroup_subsys_on_dfl_key[] = {
 };
 #undef SUBSYS
 
-static DEFINE_PER_CPU(struct cgroup_rstat_cpu, root_rstat_cpu);
+static DEFINE_PER_CPU(struct css_rstat_cpu, root_rstat_cpu);
 static DEFINE_PER_CPU(struct cgroup_rstat_base_cpu, root_rstat_base_cpu);
 
 /* the default hierarchy */
 struct cgroup_root cgrp_dfl_root = {
-	.cgrp.rstat_cpu = &root_rstat_cpu,
+	.cgrp.self.rstat_cpu = &root_rstat_cpu,
 	.cgrp.rstat_base_cpu = &root_rstat_base_cpu,
 };
 EXPORT_SYMBOL_GPL(cgrp_dfl_root);
@@ -1880,13 +1880,6 @@ int rebind_subsystems(struct cgroup_root *dst_root, u16 ss_mask)
 		}
 		spin_unlock_irq(&css_set_lock);
 
-		if (ss->css_rstat_flush) {
-			list_del_rcu(&css->rstat_css_node);
-			synchronize_rcu();
-			list_add_rcu(&css->rstat_css_node,
-				     &dcgrp->rstat_css_list);
-		}
-
 		/* default hierarchy doesn't enable controllers by default */
 		dst_root->subsys_mask |= 1 << ssid;
 		if (dst_root == &cgrp_dfl_root) {
@@ -2069,7 +2062,6 @@ static void init_cgroup_housekeeping(struct cgroup *cgrp)
 	cgrp->dom_cgrp = cgrp;
 	cgrp->max_descendants = INT_MAX;
 	cgrp->max_depth = INT_MAX;
-	INIT_LIST_HEAD(&cgrp->rstat_css_list);
 	prev_cputime_init(&cgrp->prev_cputime);
 
 	for_each_subsys(ss, ssid)
@@ -5454,6 +5446,9 @@ static void css_free_rwork_fn(struct work_struct *work)
 		struct cgroup_subsys_state *parent = css->parent;
 		int id = css->id;
 
+		if (ss->css_rstat_flush)
+			css_rstat_exit(css);
+
 		ss->css_free(css);
 		cgroup_idr_remove(&ss->css_idr, id);
 		cgroup_put(cgrp);
@@ -5506,11 +5501,8 @@ static void css_release_work_fn(struct work_struct *work)
 	if (ss) {
 		struct cgroup *parent_cgrp;
 
-		/* css release path */
-		if (!list_empty(&css->rstat_css_node)) {
+		if (ss->css_rstat_flush)
 			css_rstat_flush(css);
-			list_del_rcu(&css->rstat_css_node);
-		}
 
 		cgroup_idr_replace(&ss->css_idr, NULL, css->id);
 		if (ss->css_released)
@@ -5536,7 +5528,7 @@ static void css_release_work_fn(struct work_struct *work)
 		/* cgroup release path */
 		TRACE_CGROUP_PATH(release, cgrp);
 
-		css_rstat_flush(css);
+		css_rstat_flush(&cgrp->self);
 
 		spin_lock_irq(&css_set_lock);
 		for (tcgrp = cgroup_parent(cgrp); tcgrp;
@@ -5584,7 +5576,6 @@ static void init_and_link_css(struct cgroup_subsys_state *css,
 	css->id = -1;
 	INIT_LIST_HEAD(&css->sibling);
 	INIT_LIST_HEAD(&css->children);
-	INIT_LIST_HEAD(&css->rstat_css_node);
 	css->serial_nr = css_serial_nr_next++;
 	atomic_set(&css->online_cnt, 0);
 
@@ -5593,9 +5584,6 @@ static void init_and_link_css(struct cgroup_subsys_state *css,
 		css_get(css->parent);
 	}
 
-	if (ss->css_rstat_flush)
-		list_add_rcu(&css->rstat_css_node, &cgrp->rstat_css_list);
-
 	BUG_ON(cgroup_css(cgrp, ss));
 }
 
@@ -5688,6 +5676,12 @@ static struct cgroup_subsys_state *css_create(struct cgroup *cgrp,
 		goto err_free_css;
 	css->id = err;
 
+	if (ss->css_rstat_flush) {
+		err = css_rstat_init(css);
+		if (err)
+			goto err_free_css;
+	}
+
 	/* @css is ready to be brought online now, make it visible */
 	list_add_tail_rcu(&css->sibling, &parent_css->children);
 	cgroup_idr_replace(&ss->css_idr, css, css->id);
@@ -5701,7 +5695,6 @@ static struct cgroup_subsys_state *css_create(struct cgroup *cgrp,
 err_list_del:
 	list_del_rcu(&css->sibling);
 err_free_css:
-	list_del_rcu(&css->rstat_css_node);
 	INIT_RCU_WORK(&css->destroy_rwork, css_free_rwork_fn);
 	queue_rcu_work(cgroup_destroy_wq, &css->destroy_rwork);
 	return ERR_PTR(err);
@@ -6139,11 +6132,17 @@ static void __init cgroup_init_subsys(struct cgroup_subsys *ss, bool early)
 	css->flags |= CSS_NO_REF;
 
 	if (early) {
-		/* allocation can't be done safely during early init */
+		/*
+		 * Allocation can't be done safely during early init.
+		 * Defer IDR and rstat allocations until cgroup_init().
+		 */
 		css->id = 1;
 	} else {
 		css->id = cgroup_idr_alloc(&ss->css_idr, css, 1, 2, GFP_KERNEL);
 		BUG_ON(css->id < 0);
+
+		if (ss->css_rstat_flush)
+			BUG_ON(css_rstat_init(css));
 	}
 
 	/* Update the init_css_set to contain a subsys
@@ -6242,9 +6241,17 @@ int __init cgroup_init(void)
 			struct cgroup_subsys_state *css =
 				init_css_set.subsys[ss->id];
 
+			/*
+			 * It is now safe to perform allocations.
+			 * Finish setting up subsystems that previously
+			 * deferred IDR and rstat allocations.
+			 */
 			css->id = cgroup_idr_alloc(&ss->css_idr, css, 1, 2,
 						   GFP_KERNEL);
 			BUG_ON(css->id < 0);
+
+			if (ss->css_rstat_flush)
+				BUG_ON(css_rstat_init(css));
 		} else {
 			cgroup_init_subsys(ss, false);
 		}
diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index 357c538d14da..ddc799ca6591 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -14,9 +14,10 @@ static DEFINE_PER_CPU(raw_spinlock_t, cgroup_rstat_cpu_lock);
 
 static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu);
 
-static struct cgroup_rstat_cpu *cgroup_rstat_cpu(struct cgroup *cgrp, int cpu)
+static struct css_rstat_cpu *css_rstat_cpu(
+		struct cgroup_subsys_state *css, int cpu)
 {
-	return per_cpu_ptr(cgrp->rstat_cpu, cpu);
+	return per_cpu_ptr(css->rstat_cpu, cpu);
 }
 
 static struct cgroup_rstat_base_cpu *cgroup_rstat_base_cpu(
@@ -87,13 +88,12 @@ void _css_rstat_cpu_unlock(raw_spinlock_t *cpu_lock, int cpu,
  * @css: target cgroup subsystem state
  * @cpu: cpu on which rstat_cpu was updated
  *
- * @css->cgroup's rstat_cpu on @cpu was updated. Put it on the parent's
- * matching rstat_cpu->updated_children list. See the comment on top of
- * cgroup_rstat_cpu definition for details.
+ * @css's rstat_cpu on @cpu was updated. Put it on the parent's matching
+ * rstat_cpu->updated_children list. See the comment on top of
+ * css_rstat_cpu definition for details.
  */
 __bpf_kfunc void css_rstat_updated(struct cgroup_subsys_state *css, int cpu)
 {
-	struct cgroup *cgrp = css->cgroup;
 	raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu);
 	unsigned long flags;
 
@@ -102,19 +102,19 @@ __bpf_kfunc void css_rstat_updated(struct cgroup_subsys_state *css, int cpu)
 	 * temporary inaccuracies, which is fine.
 	 *
 	 * Because @parent's updated_children is terminated with @parent
-	 * instead of NULL, we can tell whether @cgrp is on the list by
+	 * instead of NULL, we can tell whether @css is on the list by
 	 * testing the next pointer for NULL.
 	 */
-	if (data_race(cgroup_rstat_cpu(cgrp, cpu)->updated_next))
+	if (data_race(css_rstat_cpu(css, cpu)->updated_next))
 		return;
 
 	flags = _css_rstat_cpu_lock(cpu_lock, cpu, css, true);
 
-	/* put @cgrp and all ancestors on the corresponding updated lists */
+	/* put @css and all ancestors on the corresponding updated lists */
 	while (true) {
-		struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(cgrp, cpu);
-		struct cgroup *parent = cgroup_parent(cgrp);
-		struct cgroup_rstat_cpu *prstatc;
+		struct css_rstat_cpu *rstatc = css_rstat_cpu(css, cpu);
+		struct cgroup_subsys_state *parent = css->parent;
+		struct css_rstat_cpu *prstatc;
 
 		/*
 		 * Both additions and removals are bottom-up.  If a cgroup
@@ -125,40 +125,41 @@ __bpf_kfunc void css_rstat_updated(struct cgroup_subsys_state *css, int cpu)
 
 		/* Root has no parent to link it to, but mark it busy */
 		if (!parent) {
-			rstatc->updated_next = cgrp;
+			rstatc->updated_next = css;
 			break;
 		}
 
-		prstatc = cgroup_rstat_cpu(parent, cpu);
+		prstatc = css_rstat_cpu(parent, cpu);
 		rstatc->updated_next = prstatc->updated_children;
-		prstatc->updated_children = cgrp;
+		prstatc->updated_children = css;
 
-		cgrp = parent;
+		css = parent;
 	}
 
 	_css_rstat_cpu_unlock(cpu_lock, cpu, css, flags, true);
 }
 
 /**
- * cgroup_rstat_push_children - push children cgroups into the given list
+ * css_rstat_push_children - push children css's into the given list
  * @head: current head of the list (= subtree root)
  * @child: first child of the root
  * @cpu: target cpu
  * Return: A new singly linked list of cgroups to be flushed
  *
- * Iteratively traverse down the cgroup_rstat_cpu updated tree level by
+ * Iteratively traverse down the css_rstat_cpu updated tree level by
  * level and push all the parents first before their next level children
  * into a singly linked list via the rstat_flush_next pointer built from the
  * tail backward like "pushing" cgroups into a stack. The root is pushed by
  * the caller.
  */
-static struct cgroup *cgroup_rstat_push_children(struct cgroup *head,
-						 struct cgroup *child, int cpu)
+static struct cgroup_subsys_state *css_rstat_push_children(
+		struct cgroup_subsys_state *head,
+		struct cgroup_subsys_state *child, int cpu)
 {
-	struct cgroup *cnext = child;	/* Next head of child cgroup level */
-	struct cgroup *ghead = NULL;	/* Head of grandchild cgroup level */
-	struct cgroup *parent, *grandchild;
-	struct cgroup_rstat_cpu *crstatc;
+	struct cgroup_subsys_state *cnext = child;	/* Next head of child css level */
+	struct cgroup_subsys_state *ghead = NULL;	/* Head of grandchild css level */
+	struct cgroup_subsys_state *parent, *grandchild;
+	struct css_rstat_cpu *crstatc;
 
 	child->rstat_flush_next = NULL;
 
@@ -189,13 +190,13 @@ static struct cgroup *cgroup_rstat_push_children(struct cgroup *head,
 	while (cnext) {
 		child = cnext;
 		cnext = child->rstat_flush_next;
-		parent = cgroup_parent(child);
+		parent = child->parent;
 
 		/* updated_next is parent cgroup terminated if !NULL */
 		while (child != parent) {
 			child->rstat_flush_next = head;
 			head = child;
-			crstatc = cgroup_rstat_cpu(child, cpu);
+			crstatc = css_rstat_cpu(child, cpu);
 			grandchild = crstatc->updated_children;
 			if (grandchild != child) {
 				/* Push the grand child to the next level */
@@ -217,31 +218,32 @@ static struct cgroup *cgroup_rstat_push_children(struct cgroup *head,
 }
 
 /**
- * cgroup_rstat_updated_list - return a list of updated cgroups to be flushed
- * @root: root of the cgroup subtree to traverse
+ * css_rstat_updated_list - return a list of updated cgroups to be flushed
+ * @root: root of the css subtree to traverse
  * @cpu: target cpu
  * Return: A singly linked list of cgroups to be flushed
  *
  * Walks the updated rstat_cpu tree on @cpu from @root.  During traversal,
- * each returned cgroup is unlinked from the updated tree.
+ * each returned css is unlinked from the updated tree.
  *
  * The only ordering guarantee is that, for a parent and a child pair
  * covered by a given traversal, the child is before its parent in
  * the list.
  *
  * Note that updated_children is self terminated and points to a list of
- * child cgroups if not empty. Whereas updated_next is like a sibling link
- * within the children list and terminated by the parent cgroup. An exception
+ * child css's if not empty. Whereas updated_next is like a sibling link
+ * within the children list and terminated by the parent css. An exception
  * here is the cgroup root whose updated_next can be self terminated.
  */
-static struct cgroup *cgroup_rstat_updated_list(struct cgroup *root, int cpu)
+static struct cgroup_subsys_state *css_rstat_updated_list(
+		struct cgroup_subsys_state *root, int cpu)
 {
 	raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu);
-	struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(root, cpu);
-	struct cgroup *head = NULL, *parent, *child;
+	struct css_rstat_cpu *rstatc = css_rstat_cpu(root, cpu);
+	struct cgroup_subsys_state *head = NULL, *parent, *child;
 	unsigned long flags;
 
-	flags = _css_rstat_cpu_lock(cpu_lock, cpu, &root->self, false);
+	flags = _css_rstat_cpu_lock(cpu_lock, cpu, root, false);
 
 	/* Return NULL if this subtree is not on-list */
 	if (!rstatc->updated_next)
@@ -251,17 +253,17 @@ static struct cgroup *cgroup_rstat_updated_list(struct cgroup *root, int cpu)
 	 * Unlink @root from its parent. As the updated_children list is
 	 * singly linked, we have to walk it to find the removal point.
 	 */
-	parent = cgroup_parent(root);
+	parent = root->parent;
 	if (parent) {
-		struct cgroup_rstat_cpu *prstatc;
-		struct cgroup **nextp;
+		struct css_rstat_cpu *prstatc;
+		struct cgroup_subsys_state **nextp;
 
-		prstatc = cgroup_rstat_cpu(parent, cpu);
+		prstatc = css_rstat_cpu(parent, cpu);
 		nextp = &prstatc->updated_children;
 		while (*nextp != root) {
-			struct cgroup_rstat_cpu *nrstatc;
+			struct css_rstat_cpu *nrstatc;
 
-			nrstatc = cgroup_rstat_cpu(*nextp, cpu);
+			nrstatc = css_rstat_cpu(*nextp, cpu);
 			WARN_ON_ONCE(*nextp == parent);
 			nextp = &nrstatc->updated_next;
 		}
@@ -276,9 +278,9 @@ static struct cgroup *cgroup_rstat_updated_list(struct cgroup *root, int cpu)
 	child = rstatc->updated_children;
 	rstatc->updated_children = root;
 	if (child != root)
-		head = cgroup_rstat_push_children(head, child, cpu);
+		head = css_rstat_push_children(head, child, cpu);
 unlock_ret:
-	_css_rstat_cpu_unlock(cpu_lock, cpu, &root->self, flags, false);
+	_css_rstat_cpu_unlock(cpu_lock, cpu, root, flags, false);
 	return head;
 }
 
@@ -339,41 +341,36 @@ static inline void __css_rstat_unlock(struct cgroup_subsys_state *css,
 }
 
 /**
- * css_rstat_flush - flush stats in @css->cgroup's subtree
+ * css_rstat_flush - flush stats in @css's rstat subtree
  * @css: target cgroup subsystem state
  *
- * Collect all per-cpu stats in @css->cgroup's subtree into the global counters
- * and propagate them upwards.  After this function returns, all cgroups in
- * the subtree have up-to-date ->stat.
+ * Collect all per-cpu stats in @css's subtree into the global counters
+ * and propagate them upwards. After this function returns, all rstat
+ * nodes in the subtree have up-to-date ->stat.
  *
- * This also gets all cgroups in the subtree including @css->cgroup off the
+ * This also gets all rstat nodes in the subtree including @css off the
  * ->updated_children lists.
  *
  * This function may block.
  */
 __bpf_kfunc void css_rstat_flush(struct cgroup_subsys_state *css)
 {
-	struct cgroup *cgrp = css->cgroup;
 	int cpu;
 
 	might_sleep();
 	for_each_possible_cpu(cpu) {
-		struct cgroup *pos;
+		struct cgroup_subsys_state *pos;
 
 		/* Reacquire for each CPU to avoid disabling IRQs too long */
 		__css_rstat_lock(css, cpu);
-		pos = cgroup_rstat_updated_list(cgrp, cpu);
+		pos = css_rstat_updated_list(css, cpu);
 		for (; pos; pos = pos->rstat_flush_next) {
-			struct cgroup_subsys_state *css;
-
-			cgroup_base_stat_flush(pos, cpu);
-			bpf_rstat_flush(pos, cgroup_parent(pos), cpu);
-
-			rcu_read_lock();
-			list_for_each_entry_rcu(css, &pos->rstat_css_list,
-						rstat_css_node)
+			if (css_is_cgroup(pos)) {
+				cgroup_base_stat_flush(pos->cgroup, cpu);
+				bpf_rstat_flush(pos->cgroup,
+						cgroup_parent(pos->cgroup), cpu);
+			} else if (pos->ss->css_rstat_flush)
 				css->ss->css_rstat_flush(css, cpu);
-			rcu_read_unlock();
 		}
 		__css_rstat_unlock(css, cpu);
 		if (!cond_resched())
@@ -386,29 +383,36 @@ int css_rstat_init(struct cgroup_subsys_state *css)
 	struct cgroup *cgrp = css->cgroup;
 	int cpu;
 
-	/* the root cgrp has rstat_cpu preallocated */
-	if (!cgrp->rstat_cpu) {
-		cgrp->rstat_cpu = alloc_percpu(struct cgroup_rstat_cpu);
-		if (!cgrp->rstat_cpu)
-			return -ENOMEM;
+	/* the root cgrp has rstat_base_cpu preallocated */
+	if (css_is_cgroup(css)) {
+		if (!cgrp->rstat_base_cpu) {
+			cgrp->rstat_base_cpu = alloc_percpu(struct cgroup_rstat_base_cpu);
+			if (!cgrp->rstat_base_cpu)
+				return -ENOMEM;
+		}
 	}
 
-	if (!cgrp->rstat_base_cpu) {
-		cgrp->rstat_base_cpu = alloc_percpu(struct cgroup_rstat_base_cpu);
-		if (!cgrp->rstat_base_cpu) {
-			free_percpu(cgrp->rstat_cpu);
-			return -ENOMEM;
+	/* the root cgrp's self css has rstat_cpu preallocated */
+	if (!css->rstat_cpu) {
+		css->rstat_cpu = alloc_percpu(struct css_rstat_cpu);
+		if (!css->rstat_cpu) {
+			if (css_is_cgroup(css))
+				free_percpu(cgrp->rstat_base_cpu);
 		}
 	}
 
 	/* ->updated_children list is self terminated */
 	for_each_possible_cpu(cpu) {
-		struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(cgrp, cpu);
-		struct cgroup_rstat_base_cpu *rstatbc =
-			cgroup_rstat_base_cpu(cgrp, cpu);
+		struct css_rstat_cpu *rstatc = css_rstat_cpu(css, cpu);
 
-		rstatc->updated_children = cgrp;
-		u64_stats_init(&rstatbc->bsync);
+		rstatc->updated_children = css;
+
+		if (css_is_cgroup(css)) {
+			struct cgroup_rstat_base_cpu *rstatbc;
+
+			rstatbc = cgroup_rstat_base_cpu(cgrp, cpu);
+			u64_stats_init(&rstatbc->bsync);
+		}
 	}
 
 	return 0;
@@ -416,24 +420,28 @@ int css_rstat_init(struct cgroup_subsys_state *css)
 
 void css_rstat_exit(struct cgroup_subsys_state *css)
 {
-	struct cgroup *cgrp = css->cgroup;
 	int cpu;
 
-	css_rstat_flush(&cgrp->self);
+	css_rstat_flush(css);
 
 	/* sanity check */
 	for_each_possible_cpu(cpu) {
-		struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(cgrp, cpu);
+		struct css_rstat_cpu *rstatc = css_rstat_cpu(css, cpu);
 
-		if (WARN_ON_ONCE(rstatc->updated_children != cgrp) ||
+		if (WARN_ON_ONCE(rstatc->updated_children != css) ||
 		    WARN_ON_ONCE(rstatc->updated_next))
 			return;
 	}
 
-	free_percpu(cgrp->rstat_cpu);
-	cgrp->rstat_cpu = NULL;
-	free_percpu(cgrp->rstat_base_cpu);
-	cgrp->rstat_base_cpu = NULL;
+	if (css_is_cgroup(css)) {
+		struct cgroup *cgrp = css->cgroup;
+
+		free_percpu(cgrp->rstat_base_cpu);
+		cgrp->rstat_base_cpu = NULL;
+	}
+
+	free_percpu(css->rstat_cpu);
+	css->rstat_cpu = NULL;
 }
 
 void __init cgroup_rstat_boot(void)
diff --git a/tools/testing/selftests/bpf/progs/btf_type_tag_percpu.c b/tools/testing/selftests/bpf/progs/btf_type_tag_percpu.c
index 38f78d9345de..69f81cb555ca 100644
--- a/tools/testing/selftests/bpf/progs/btf_type_tag_percpu.c
+++ b/tools/testing/selftests/bpf/progs/btf_type_tag_percpu.c
@@ -30,22 +30,27 @@ int BPF_PROG(test_percpu2, struct bpf_testmod_btf_type_tag_2 *arg)
 
 /* trace_cgroup_mkdir(struct cgroup *cgrp, const char *path)
  *
- * struct cgroup_rstat_cpu {
+ * struct css_rstat_cpu {
  *   ...
- *   struct cgroup *updated_children;
+ *   struct cgroup_subsys_state *updated_children;
  *   ...
  * };
  *
- * struct cgroup {
+ * struct cgroup_subsys_state {
+ *   ...
+ *   struct css_rstat_cpu __percpu *rstat_cpu;
  *   ...
- *   struct cgroup_rstat_cpu __percpu *rstat_cpu;
+ * };
+ *
+ * struct cgroup {
+ *   struct cgroup_subsys_state self;
  *   ...
  * };
  */
 SEC("tp_btf/cgroup_mkdir")
 int BPF_PROG(test_percpu_load, struct cgroup *cgrp, const char *path)
 {
-	g = (__u64)cgrp->rstat_cpu->updated_children;
+	g = (__u64)cgrp->self.rstat_cpu->updated_children;
 	return 0;
 }
 
@@ -56,7 +61,8 @@ int BPF_PROG(test_percpu_helper, struct cgroup *cgrp, const char *path)
 	__u32 cpu;
 
 	cpu = bpf_get_smp_processor_id();
-	rstat = (struct cgroup_rstat_cpu *)bpf_per_cpu_ptr(cgrp->rstat_cpu, cpu);
+	rstat = (struct cgroup_rstat_cpu *)bpf_per_cpu_ptr(
+			cgrp->self.rstat_cpu, cpu);
 	if (rstat) {
 		/* READ_ONCE */
 		*(volatile int *)rstat;
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [OFFLIST PATCH 2/2] cgroup: use subsystem-specific rstat locks to avoid contention
  2025-04-29  6:12 ` [OFFLIST PATCH 1/2] cgroup: use separate rstat trees for each subsystem Shakeel Butt
@ 2025-04-29  6:12   ` Shakeel Butt
  2025-04-29  6:15     ` Shakeel Butt
  2025-04-29  6:15   ` [OFFLIST PATCH 1/2] cgroup: use separate rstat trees for each subsystem Shakeel Butt
  1 sibling, 1 reply; 22+ messages in thread
From: Shakeel Butt @ 2025-04-29  6:12 UTC (permalink / raw)
  To: Tejun Heo, Andrew Morton, Alexei Starovoitov, shakeel.butt
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Yosry Ahmed, Michal Koutný, Vlastimil Babka,
	Sebastian Andrzej Siewior, JP Kobryn, bpf, linux-mm, cgroups,
	linux-kernel, Meta kernel team

From: JP Kobryn <inwardvessel@gmail.com>

It is possible to eliminate contention between subsystems when
updating/flushing stats by using subsystem-specific locks. Let the existing
rstat locks be dedicated to the cgroup base stats and rename them to
reflect that. Add similar locks to the cgroup_subsys struct for use with
individual subsystems.

Lock initialization is done in the new function ss_rstat_init(ss) which
replaces cgroup_rstat_boot(void). If NULL is passed to this function, the
global base stat locks will be initialized. Otherwise, the subsystem locks
will be initialized.

Change the existing lock helper functions to accept a reference to a css.
Then within these functions, conditionally select the appropriate locks
based on the subsystem affiliation of the given css. Add helper functions
for this selection routine to avoid repeated code.

Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
---
 block/blk-cgroup.c              |   2 +-
 include/linux/cgroup-defs.h     |  16 +++--
 include/trace/events/cgroup.h   |  12 +++-
 kernel/cgroup/cgroup-internal.h |   2 +-
 kernel/cgroup/cgroup.c          |  10 ++-
 kernel/cgroup/rstat.c           | 108 +++++++++++++++++++++-----------
 6 files changed, 103 insertions(+), 47 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index abeb7ec27e92..d7563b4bb795 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1074,7 +1074,7 @@ static void __blkcg_rstat_flush(struct blkcg *blkcg, int cpu)
 	/*
 	 * For covering concurrent parent blkg update from blkg_release().
 	 *
-	 * When flushing from cgroup, cgroup_rstat_lock is always held, so
+	 * When flushing from cgroup, the subsystem lock is always held, so
 	 * this lock won't cause contention most of time.
 	 */
 	raw_spin_lock_irqsave(&blkg_stat_lock, flags);
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 45a605c74ff8..560582c4dbeb 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -223,7 +223,10 @@ struct cgroup_subsys_state {
 	/*
 	 * A singly-linked list of css structures to be rstat flushed.
 	 * This is a scratch field to be used exclusively by
-	 * css_rstat_flush_locked() and protected by cgroup_rstat_lock.
+	 * css_rstat_flush_locked().
+	 *
+	 * Protected by rstat_base_lock when css is cgroup::self.
+	 * Protected by css->ss->rstat_ss_lock otherwise.
 	 */
 	struct cgroup_subsys_state *rstat_flush_next;
 };
@@ -359,11 +362,11 @@ struct css_rstat_cpu {
 	 * are linked on the parent's ->updated_children through
 	 * ->updated_next.
 	 *
-	 * In addition to being more compact, singly-linked list pointing
-	 * to the cgroup makes it unnecessary for each per-cpu struct to
-	 * point back to the associated cgroup.
+	 * In addition to being more compact, singly-linked list pointing to
+	 * the css makes it unnecessary for each per-cpu struct to point back
+	 * to the associated css.
 	 *
-	 * Protected by per-cpu cgroup_rstat_cpu_lock.
+	 * Protected by per-cpu css->ss->rstat_ss_cpu_lock.
 	 */
 	struct cgroup_subsys_state *updated_children;	/* terminated by self cgroup */
 	struct cgroup_subsys_state *updated_next;	/* NULL iff not on the list */
@@ -794,6 +797,9 @@ struct cgroup_subsys {
 	 * specifies the mask of subsystems that this one depends on.
 	 */
 	unsigned int depends_on;
+
+	spinlock_t rstat_ss_lock;
+	raw_spinlock_t __percpu *rstat_ss_cpu_lock;
 };
 
 extern struct percpu_rw_semaphore cgroup_threadgroup_rwsem;
diff --git a/include/trace/events/cgroup.h b/include/trace/events/cgroup.h
index af2755bda6eb..7d332387be6c 100644
--- a/include/trace/events/cgroup.h
+++ b/include/trace/events/cgroup.h
@@ -231,7 +231,11 @@ DECLARE_EVENT_CLASS(cgroup_rstat,
 		  __entry->cpu, __entry->contended)
 );
 
-/* Related to global: cgroup_rstat_lock */
+/*
+ * Related to locks:
+ * global rstat_base_lock for base stats
+ * cgroup_subsys::rstat_ss_lock for subsystem stats
+ */
 DEFINE_EVENT(cgroup_rstat, cgroup_rstat_lock_contended,
 
 	TP_PROTO(struct cgroup *cgrp, int cpu, bool contended),
@@ -253,7 +257,11 @@ DEFINE_EVENT(cgroup_rstat, cgroup_rstat_unlock,
 	TP_ARGS(cgrp, cpu, contended)
 );
 
-/* Related to per CPU: cgroup_rstat_cpu_lock */
+/*
+ * Related to per CPU locks:
+ * global rstat_base_cpu_lock for base stats
+ * cgroup_subsys::rstat_ss_cpu_lock for subsystem stats
+ */
 DEFINE_EVENT(cgroup_rstat, cgroup_rstat_cpu_lock_contended,
 
 	TP_PROTO(struct cgroup *cgrp, int cpu, bool contended),
diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
index c161d34be634..b14e61c64a34 100644
--- a/kernel/cgroup/cgroup-internal.h
+++ b/kernel/cgroup/cgroup-internal.h
@@ -272,7 +272,7 @@ int cgroup_task_count(const struct cgroup *cgrp);
  */
 int css_rstat_init(struct cgroup_subsys_state *css);
 void css_rstat_exit(struct cgroup_subsys_state *css);
-void cgroup_rstat_boot(void);
+int ss_rstat_init(struct cgroup_subsys *ss);
 void cgroup_base_stat_cputime_show(struct seq_file *seq);
 
 /*
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index d9865299edf5..3528381ea73c 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -6141,8 +6141,10 @@ static void __init cgroup_init_subsys(struct cgroup_subsys *ss, bool early)
 		css->id = cgroup_idr_alloc(&ss->css_idr, css, 1, 2, GFP_KERNEL);
 		BUG_ON(css->id < 0);
 
-		if (ss->css_rstat_flush)
+		if (ss->css_rstat_flush) {
+			BUG_ON(ss_rstat_init(ss));
 			BUG_ON(css_rstat_init(css));
+		}
 	}
 
 	/* Update the init_css_set to contain a subsys
@@ -6219,7 +6221,7 @@ int __init cgroup_init(void)
 	BUG_ON(cgroup_init_cftypes(NULL, cgroup_psi_files));
 	BUG_ON(cgroup_init_cftypes(NULL, cgroup1_base_files));
 
-	cgroup_rstat_boot();
+	BUG_ON(ss_rstat_init(NULL));
 
 	get_user_ns(init_cgroup_ns.user_ns);
 
@@ -6250,8 +6252,10 @@ int __init cgroup_init(void)
 						   GFP_KERNEL);
 			BUG_ON(css->id < 0);
 
-			if (ss->css_rstat_flush)
+			if (ss->css_rstat_flush) {
+				BUG_ON(ss_rstat_init(ss));
 				BUG_ON(css_rstat_init(css));
+			}
 		} else {
 			cgroup_init_subsys(ss, false);
 		}
diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index ddc799ca6591..a30bcc4d4f48 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -9,8 +9,8 @@
 
 #include <trace/events/cgroup.h>
 
-static DEFINE_SPINLOCK(cgroup_rstat_lock);
-static DEFINE_PER_CPU(raw_spinlock_t, cgroup_rstat_cpu_lock);
+static DEFINE_SPINLOCK(rstat_base_lock);
+static DEFINE_PER_CPU(raw_spinlock_t, rstat_base_cpu_lock);
 
 static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu);
 
@@ -26,8 +26,24 @@ static struct cgroup_rstat_base_cpu *cgroup_rstat_base_cpu(
 	return per_cpu_ptr(cgrp->rstat_base_cpu, cpu);
 }
 
+static spinlock_t *ss_rstat_lock(struct cgroup_subsys *ss)
+{
+	if (ss)
+		return &ss->rstat_ss_lock;
+
+	return &rstat_base_lock;
+}
+
+static raw_spinlock_t *ss_rstat_cpu_lock(struct cgroup_subsys *ss, int cpu)
+{
+	if (ss)
+		return per_cpu_ptr(ss->rstat_ss_cpu_lock, cpu);
+
+	return per_cpu_ptr(&rstat_base_cpu_lock, cpu);
+}
+
 /*
- * Helper functions for rstat per CPU lock (cgroup_rstat_cpu_lock).
+ * Helper functions for rstat per CPU locks.
  *
  * This makes it easier to diagnose locking issues and contention in
  * production environments. The parameter @fast_path determine the
@@ -35,21 +51,23 @@ static struct cgroup_rstat_base_cpu *cgroup_rstat_base_cpu(
  * operations without handling high-frequency fast-path "update" events.
  */
 static __always_inline
-unsigned long _css_rstat_cpu_lock(raw_spinlock_t *cpu_lock, int cpu,
-				     struct cgroup_subsys_state *css, const bool fast_path)
+unsigned long _css_rstat_cpu_lock(struct cgroup_subsys_state *css, int cpu,
+		const bool fast_path)
 {
 	struct cgroup *cgrp = css->cgroup;
+	raw_spinlock_t *cpu_lock;
 	unsigned long flags;
 	bool contended;
 
 	/*
-	 * The _irqsave() is needed because cgroup_rstat_lock is
-	 * spinlock_t which is a sleeping lock on PREEMPT_RT. Acquiring
-	 * this lock with the _irq() suffix only disables interrupts on
-	 * a non-PREEMPT_RT kernel. The raw_spinlock_t below disables
-	 * interrupts on both configurations. The _irqsave() ensures
-	 * that interrupts are always disabled and later restored.
+	 * The _irqsave() is needed because the locks used for flushing are
+	 * spinlock_t which is a sleeping lock on PREEMPT_RT. Acquiring this lock
+	 * with the _irq() suffix only disables interrupts on a non-PREEMPT_RT
+	 * kernel. The raw_spinlock_t below disables interrupts on both
+	 * configurations. The _irqsave() ensures that interrupts are always
+	 * disabled and later restored.
 	 */
+	cpu_lock = ss_rstat_cpu_lock(css->ss, cpu);
 	contended = !raw_spin_trylock_irqsave(cpu_lock, flags);
 	if (contended) {
 		if (fast_path)
@@ -69,17 +87,18 @@ unsigned long _css_rstat_cpu_lock(raw_spinlock_t *cpu_lock, int cpu,
 }
 
 static __always_inline
-void _css_rstat_cpu_unlock(raw_spinlock_t *cpu_lock, int cpu,
-			      struct cgroup_subsys_state *css, unsigned long flags,
-			      const bool fast_path)
+void _css_rstat_cpu_unlock(struct cgroup_subsys_state *css, int cpu,
+		unsigned long flags, const bool fast_path)
 {
 	struct cgroup *cgrp = css->cgroup;
+	raw_spinlock_t *cpu_lock;
 
 	if (fast_path)
 		trace_cgroup_rstat_cpu_unlock_fastpath(cgrp, cpu, false);
 	else
 		trace_cgroup_rstat_cpu_unlock(cgrp, cpu, false);
 
+	cpu_lock = ss_rstat_cpu_lock(css->ss, cpu);
 	raw_spin_unlock_irqrestore(cpu_lock, flags);
 }
 
@@ -94,7 +113,6 @@ void _css_rstat_cpu_unlock(raw_spinlock_t *cpu_lock, int cpu,
  */
 __bpf_kfunc void css_rstat_updated(struct cgroup_subsys_state *css, int cpu)
 {
-	raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu);
 	unsigned long flags;
 
 	/*
@@ -108,7 +126,7 @@ __bpf_kfunc void css_rstat_updated(struct cgroup_subsys_state *css, int cpu)
 	if (data_race(css_rstat_cpu(css, cpu)->updated_next))
 		return;
 
-	flags = _css_rstat_cpu_lock(cpu_lock, cpu, css, true);
+	flags = _css_rstat_cpu_lock(css, cpu, true);
 
 	/* put @css and all ancestors on the corresponding updated lists */
 	while (true) {
@@ -136,7 +154,7 @@ __bpf_kfunc void css_rstat_updated(struct cgroup_subsys_state *css, int cpu)
 		css = parent;
 	}
 
-	_css_rstat_cpu_unlock(cpu_lock, cpu, css, flags, true);
+	_css_rstat_cpu_unlock(css, cpu, flags, true);
 }
 
 /**
@@ -163,13 +181,6 @@ static struct cgroup_subsys_state *css_rstat_push_children(
 
 	child->rstat_flush_next = NULL;
 
-	/*
-	 * The cgroup_rstat_lock must be held for the whole duration from
-	 * here as the rstat_flush_next list is being constructed to when
-	 * it is consumed later in css_rstat_flush().
-	 */
-	lockdep_assert_held(&cgroup_rstat_lock);
-
 	/*
 	 * Notation: -> updated_next pointer
 	 *	     => rstat_flush_next pointer
@@ -238,12 +249,11 @@ static struct cgroup_subsys_state *css_rstat_push_children(
 static struct cgroup_subsys_state *css_rstat_updated_list(
 		struct cgroup_subsys_state *root, int cpu)
 {
-	raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu);
 	struct css_rstat_cpu *rstatc = css_rstat_cpu(root, cpu);
 	struct cgroup_subsys_state *head = NULL, *parent, *child;
 	unsigned long flags;
 
-	flags = _css_rstat_cpu_lock(cpu_lock, cpu, root, false);
+	flags = _css_rstat_cpu_lock(root, cpu, false);
 
 	/* Return NULL if this subtree is not on-list */
 	if (!rstatc->updated_next)
@@ -280,7 +290,7 @@ static struct cgroup_subsys_state *css_rstat_updated_list(
 	if (child != root)
 		head = css_rstat_push_children(head, child, cpu);
 unlock_ret:
-	_css_rstat_cpu_unlock(cpu_lock, cpu, root, flags, false);
+	_css_rstat_cpu_unlock(root, cpu, flags, false);
 	return head;
 }
 
@@ -307,7 +317,7 @@ __weak noinline void bpf_rstat_flush(struct cgroup *cgrp,
 __bpf_hook_end();
 
 /*
- * Helper functions for locking cgroup_rstat_lock.
+ * Helper functions for locking.
  *
  * This makes it easier to diagnose locking issues and contention in
  * production environments.  The parameter @cpu_in_loop indicate lock
@@ -317,27 +327,31 @@ __bpf_hook_end();
  */
 static inline void __css_rstat_lock(struct cgroup_subsys_state *css,
 		int cpu_in_loop)
-	__acquires(&cgroup_rstat_lock)
+	__acquires(lock)
 {
 	struct cgroup *cgrp = css->cgroup;
+	spinlock_t *lock;
 	bool contended;
 
-	contended = !spin_trylock_irq(&cgroup_rstat_lock);
+	lock = ss_rstat_lock(css->ss);
+	contended = !spin_trylock_irq(lock);
 	if (contended) {
 		trace_cgroup_rstat_lock_contended(cgrp, cpu_in_loop, contended);
-		spin_lock_irq(&cgroup_rstat_lock);
+		spin_lock_irq(lock);
 	}
 	trace_cgroup_rstat_locked(cgrp, cpu_in_loop, contended);
 }
 
 static inline void __css_rstat_unlock(struct cgroup_subsys_state *css,
 				      int cpu_in_loop)
-	__releases(&cgroup_rstat_lock)
+	__releases(lock)
 {
 	struct cgroup *cgrp = css->cgroup;
+	spinlock_t *lock;
 
+	lock = ss_rstat_lock(css->ss);
 	trace_cgroup_rstat_unlock(cgrp, cpu_in_loop, false);
-	spin_unlock_irq(&cgroup_rstat_lock);
+	spin_unlock_irq(lock);
 }
 
 /**
@@ -444,12 +458,36 @@ void css_rstat_exit(struct cgroup_subsys_state *css)
 	css->rstat_cpu = NULL;
 }
 
-void __init cgroup_rstat_boot(void)
+/**
+ * ss_rstat_init - subsystem-specific rstat initialization
+ * @ss: target subsystem
+ *
+ * If @ss is NULL, the static locks associated with the base stats
+ * are initialized. If @ss is non-NULL, the subsystem-specific locks
+ * are initialized.
+ */
+int __init ss_rstat_init(struct cgroup_subsys *ss)
 {
 	int cpu;
 
+	if (!ss) {
+		spin_lock_init(&rstat_base_lock);
+
+		for_each_possible_cpu(cpu)
+			raw_spin_lock_init(per_cpu_ptr(&rstat_base_cpu_lock, cpu));
+
+		return 0;
+	}
+
+	spin_lock_init(&ss->rstat_ss_lock);
+	ss->rstat_ss_cpu_lock = alloc_percpu(raw_spinlock_t);
+	if (!ss->rstat_ss_cpu_lock)
+		return -ENOMEM;
+
 	for_each_possible_cpu(cpu)
-		raw_spin_lock_init(per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu));
+		raw_spin_lock_init(per_cpu_ptr(ss->rstat_ss_cpu_lock, cpu));
+
+	return 0;
 }
 
 /*
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [OFFLIST PATCH 1/2] cgroup: use separate rstat trees for each subsystem
  2025-04-29  6:12 ` [OFFLIST PATCH 1/2] cgroup: use separate rstat trees for each subsystem Shakeel Butt
  2025-04-29  6:12   ` [OFFLIST PATCH 2/2] cgroup: use subsystem-specific rstat locks to avoid contention Shakeel Butt
@ 2025-04-29  6:15   ` Shakeel Butt
  1 sibling, 0 replies; 22+ messages in thread
From: Shakeel Butt @ 2025-04-29  6:15 UTC (permalink / raw)
  To: Tejun Heo, Andrew Morton, Alexei Starovoitov
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Yosry Ahmed, Michal Koutný, Vlastimil Babka,
	Sebastian Andrzej Siewior, JP Kobryn, bpf, linux-mm, cgroups,
	linux-kernel, Meta kernel team

Please ignore this patch as it was sent by mistake.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [OFFLIST PATCH 2/2] cgroup: use subsystem-specific rstat locks to avoid contention
  2025-04-29  6:12   ` [OFFLIST PATCH 2/2] cgroup: use subsystem-specific rstat locks to avoid contention Shakeel Butt
@ 2025-04-29  6:15     ` Shakeel Butt
  2025-05-21 22:23       ` Klara Modin
  0 siblings, 1 reply; 22+ messages in thread
From: Shakeel Butt @ 2025-04-29  6:15 UTC (permalink / raw)
  To: Tejun Heo, Andrew Morton, Alexei Starovoitov
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Yosry Ahmed, Michal Koutný, Vlastimil Babka,
	Sebastian Andrzej Siewior, JP Kobryn, bpf, linux-mm, cgroups,
	linux-kernel, Meta kernel team

Please ignore this patch as it was sent by mistake.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 1/3] llist: add list_add_iff_not_on_list()g
  2025-04-29  6:12 ` [RFC PATCH 1/3] llist: add list_add_iff_not_on_list() Shakeel Butt
@ 2025-04-30 12:44   ` Yosry Ahmed
  0 siblings, 0 replies; 22+ messages in thread
From: Yosry Ahmed @ 2025-04-30 12:44 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Tejun Heo, Andrew Morton, Alexei Starovoitov, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Muchun Song, Michal Koutný,
	Vlastimil Babka, Sebastian Andrzej Siewior, JP Kobryn, bpf,
	linux-mm, cgroups, linux-kernel, Meta kernel team

On Mon, Apr 28, 2025 at 11:12:07PM -0700, Shakeel Butt wrote:
> As the name implies, list_add_iff_not_on_list() adds the given node to
> the given only if the node is not on any list. Many CPUs can call this
> concurrently on the same node and only one of them will succeed.
> 
> This is also useful to be used by different contexts like task, irq and
> nmi. In the case of failure either the node as already present on some
> list or the caller can lost the race to add the given node to a list.
> That node will eventually be added to a list by the winner.
> 
> Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
> ---
>  include/linux/llist.h |  3 +++
>  lib/llist.c           | 30 ++++++++++++++++++++++++++++++
>  2 files changed, 33 insertions(+)
> 
> diff --git a/include/linux/llist.h b/include/linux/llist.h
> index 2c982ff7475a..030cfec8778b 100644
> --- a/include/linux/llist.h
> +++ b/include/linux/llist.h
> @@ -236,6 +236,9 @@ static inline bool __llist_add_batch(struct llist_node *new_first,
>  	return new_last->next == NULL;
>  }
>  
> +extern bool llist_add_iff_not_on_list(struct llist_node *new,
> +				      struct llist_head *head);
> +
>  /**
>   * llist_add - add a new entry
>   * @new:	new entry to be added
> diff --git a/lib/llist.c b/lib/llist.c
> index f21d0cfbbaaa..9d743164720f 100644
> --- a/lib/llist.c
> +++ b/lib/llist.c
> @@ -36,6 +36,36 @@ bool llist_add_batch(struct llist_node *new_first, struct llist_node *new_last,
>  }
>  EXPORT_SYMBOL_GPL(llist_add_batch);
>  
> +/**
> + * llist_add_iff_not_on_list - add an entry if it is not on list
> + * @new:	entry to be added
> + * @head:	the head for your lock-less list
> + *
> + * Adds the given entry to the given list only if the entry is not on any list.
> + * This is useful for cases where multiple CPUs tries to add the same node to
> + * the list or multiple contexts (process, irq or nmi) may add the same node to
> + * the list.
> + *
> + * Return true only if the caller has successfully added the given node to the
> + * list. Returns false if entry is already on some list or if another inserter
> + * wins the race to eventually add the given node to the list.
> + */
> +bool llist_add_iff_not_on_list(struct llist_node *new, struct llist_head *head)

What about llist_try_add()?

> +{
> +	struct llist_node *first = READ_ONCE(head->first);
> +
> +	if (llist_on_list(new))
> +		return false;
> +
> +	if (cmpxchg(&new->next, new, first) != new)
> +		return false;

Here we will set new->next to the current head of the list, but this may
change from under us, and the next loop will then set it correctly
anyway. This is a bit confusing though.

Would it be better if we set new->next to NULL here, and then completely
rely on the loop below to set it properly?

> +
> +	while (!try_cmpxchg(&head->first, &first, new))
> +		new->next = first;

Not a big deal, but should we use llist_add_batch() here instead?

> +	return true;
> +}
> +EXPORT_SYMBOL_GPL(llist_add_iff_not_on_list);
> +
>  /**
>   * llist_del_first - delete the first entry of lock-less list
>   * @head:	the head for your lock-less list
> -- 
> 2.47.1
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 3/3] cgroup: make css_rstat_updated nmi safe
  2025-04-29  6:12 ` [RFC PATCH 3/3] cgroup: make css_rstat_updated nmi safe Shakeel Butt
@ 2025-04-30 13:14   ` Yosry Ahmed
  2025-05-01 22:10     ` Shakeel Butt
  0 siblings, 1 reply; 22+ messages in thread
From: Yosry Ahmed @ 2025-04-30 13:14 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Tejun Heo, Andrew Morton, Alexei Starovoitov, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Muchun Song, Michal Koutný,
	Vlastimil Babka, Sebastian Andrzej Siewior, JP Kobryn, bpf,
	linux-mm, cgroups, linux-kernel, Meta kernel team

On Mon, Apr 28, 2025 at 11:12:09PM -0700, Shakeel Butt wrote:
> To make css_rstat_updated() able to safely run in nmi context, it can
> not spin on locks and rather has to do trylock on the per-cpu per-ss raw
> spinlock. This patch implements the backlog mechanism to handle the
> failure in acquiring the per-cpu per-ss raw spinlock.
> 
> Each subsystem provides a per-cpu lockless list on which the kernel
> stores the css given to css_rstat_updated() on trylock failure. These
> lockless lists serve as backlog. On cgroup stats flushing code path, the
> kernel first processes all the per-cpu lockless backlog lists of the
> given ss and then proceeds to flush the update stat trees.
> 
> With css_rstat_updated() being nmi safe, the memch stats can and will be
> converted to be nmi safe to enable nmi safe mem charging.
> 
> Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
> ---
>  kernel/cgroup/rstat.c | 99 +++++++++++++++++++++++++++++++++----------
>  1 file changed, 76 insertions(+), 23 deletions(-)
> 
[..]
> @@ -153,6 +160,51 @@ __bpf_kfunc void css_rstat_updated(struct cgroup_subsys_state *css, int cpu)
>  
>  		css = parent;
>  	}
> +}
> +
> +static void css_process_backlog(struct cgroup_subsys *ss, int cpu)
> +{
> +	struct llist_head *lhead = ss_lhead_cpu(ss, cpu);
> +	struct llist_node *lnode;
> +
> +	while ((lnode = llist_del_first_init(lhead))) {
> +		struct css_rstat_cpu *rstatc;
> +
> +		rstatc = container_of(lnode, struct css_rstat_cpu, lnode);
> +		__css_rstat_updated(rstatc->owner, cpu);
> +	}
> +}
> +
> +/**
> + * css_rstat_updated - keep track of updated rstat_cpu
> + * @css: target cgroup subsystem state
> + * @cpu: cpu on which rstat_cpu was updated
> + *
> + * @css's rstat_cpu on @cpu was updated. Put it on the parent's matching
> + * rstat_cpu->updated_children list. See the comment on top of
> + * css_rstat_cpu definition for details.
> + */
> +__bpf_kfunc void css_rstat_updated(struct cgroup_subsys_state *css, int cpu)
> +{
> +	unsigned long flags;
> +
> +	/*
> +	 * Speculative already-on-list test. This may race leading to
> +	 * temporary inaccuracies, which is fine.
> +	 *
> +	 * Because @parent's updated_children is terminated with @parent
> +	 * instead of NULL, we can tell whether @css is on the list by
> +	 * testing the next pointer for NULL.
> +	 */
> +	if (data_race(css_rstat_cpu(css, cpu)->updated_next))
> +		return;
> +
> +	if (!_css_rstat_cpu_trylock(css, cpu, &flags)) {


IIUC this trylock will only fail if a BPF program runs in NMI context
and tries to update cgroup stats, interrupting a context that is already
holding the lock (i.e. updating or flushing stats).

How often does this happen in practice tho? Is it worth the complexity?

I wonder if it's better if we make css_rstat_updated() inherently
lockless instead.

What if css_rstat_updated() always just adds to a lockless tree, and we
defer constructing the proper tree to the flushing side? This should
make updates generally faster and avoids locking or disabling interrupts
in the fast path. We essentially push more work to the flushing side.

We may be able to consolidate some of the code too if all the logic
manipulating the tree is on the flushing side.

WDYT? Am I missing something here?

> +		css_add_to_backlog(css, cpu);
> +		return;
> +	}
> +
> +	__css_rstat_updated(css, cpu);
>  
>  	_css_rstat_cpu_unlock(css, cpu, flags, true);
>  }
> @@ -255,6 +307,7 @@ static struct cgroup_subsys_state *css_rstat_updated_list(
>  
>  	flags = _css_rstat_cpu_lock(root, cpu, false);
>  
> +	css_process_backlog(root->ss, cpu);
>  	/* Return NULL if this subtree is not on-list */
>  	if (!rstatc->updated_next)
>  		goto unlock_ret;
> -- 
> 2.47.1
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 3/3] cgroup: make css_rstat_updated nmi safe
  2025-04-30 13:14   ` Yosry Ahmed
@ 2025-05-01 22:10     ` Shakeel Butt
  2025-05-06  9:41       ` Yosry Ahmed
  0 siblings, 1 reply; 22+ messages in thread
From: Shakeel Butt @ 2025-05-01 22:10 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Tejun Heo, Andrew Morton, Alexei Starovoitov, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Muchun Song, Michal Koutný,
	Vlastimil Babka, Sebastian Andrzej Siewior, JP Kobryn, bpf,
	linux-mm, cgroups, linux-kernel, Meta kernel team

On Wed, Apr 30, 2025 at 06:14:28AM -0700, Yosry Ahmed wrote:
[...]
> > +
> > +	if (!_css_rstat_cpu_trylock(css, cpu, &flags)) {
> 
> 
> IIUC this trylock will only fail if a BPF program runs in NMI context
> and tries to update cgroup stats, interrupting a context that is already
> holding the lock (i.e. updating or flushing stats).
> 

Correct (though note that flushing side can be on a different CPU).

> How often does this happen in practice tho? Is it worth the complexity?

This is about correctness, so even a chance of occurance need the
solution.

> 
> I wonder if it's better if we make css_rstat_updated() inherently
> lockless instead.
> 
> What if css_rstat_updated() always just adds to a lockless tree,

Here I assume you meant lockless list instead of tree.

> and we
> defer constructing the proper tree to the flushing side? This should
> make updates generally faster and avoids locking or disabling interrupts
> in the fast path. We essentially push more work to the flushing side.
> 
> We may be able to consolidate some of the code too if all the logic
> manipulating the tree is on the flushing side.
> 
> WDYT? Am I missing something here?
> 

Yes this can be done but I don't think we need to tie that to current
series. I think we can start with lockless in the nmi context and then
iteratively make css_rstat_updated() lockless for all contexts.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 3/3] cgroup: make css_rstat_updated nmi safe
  2025-05-01 22:10     ` Shakeel Butt
@ 2025-05-06  9:41       ` Yosry Ahmed
  2025-05-06 19:30         ` Shakeel Butt
  0 siblings, 1 reply; 22+ messages in thread
From: Yosry Ahmed @ 2025-05-06  9:41 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Tejun Heo, Andrew Morton, Alexei Starovoitov, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Muchun Song, Michal Koutný,
	Vlastimil Babka, Sebastian Andrzej Siewior, JP Kobryn, bpf,
	linux-mm, cgroups, linux-kernel, Meta kernel team

On Thu, May 01, 2025 at 03:10:20PM -0700, Shakeel Butt wrote:
> On Wed, Apr 30, 2025 at 06:14:28AM -0700, Yosry Ahmed wrote:
> [...]
> > > +
> > > +	if (!_css_rstat_cpu_trylock(css, cpu, &flags)) {
> > 
> > 
> > IIUC this trylock will only fail if a BPF program runs in NMI context
> > and tries to update cgroup stats, interrupting a context that is already
> > holding the lock (i.e. updating or flushing stats).
> > 
> 
> Correct (though note that flushing side can be on a different CPU).
> 
> > How often does this happen in practice tho? Is it worth the complexity?
> 
> This is about correctness, so even a chance of occurance need the
> solution.

Right, my question was more about the need to special case NMIs, see
below.

> 
> > 
> > I wonder if it's better if we make css_rstat_updated() inherently
> > lockless instead.
> > 
> > What if css_rstat_updated() always just adds to a lockless tree,
> 
> Here I assume you meant lockless list instead of tree.

Yeah, in a sense. I meant using lockless lists to implement the rstat
tree instead of normal linked lists.

> 
> > and we
> > defer constructing the proper tree to the flushing side? This should
> > make updates generally faster and avoids locking or disabling interrupts
> > in the fast path. We essentially push more work to the flushing side.
> > 
> > We may be able to consolidate some of the code too if all the logic
> > manipulating the tree is on the flushing side.
> > 
> > WDYT? Am I missing something here?
> > 
> 
> Yes this can be done but I don't think we need to tie that to current
> series. I think we can start with lockless in the nmi context and then
> iteratively make css_rstat_updated() lockless for all contexts.

My question is basically whether it would be simpler to actually make it
all lockless than special casing NMIs. With this patch we have two
different paths and a deferred list that we process at a later point. I
think it may be simpler if we just make it all lockless to begin with.
Then we would have a single path and no special deferred processing.

WDYT?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 3/3] cgroup: make css_rstat_updated nmi safe
  2025-05-06  9:41       ` Yosry Ahmed
@ 2025-05-06 19:30         ` Shakeel Butt
  2025-05-07  6:52           ` Yosry Ahmed
  0 siblings, 1 reply; 22+ messages in thread
From: Shakeel Butt @ 2025-05-06 19:30 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Tejun Heo, Andrew Morton, Alexei Starovoitov, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Muchun Song, Michal Koutný,
	Vlastimil Babka, Sebastian Andrzej Siewior, JP Kobryn, bpf,
	linux-mm, cgroups, linux-kernel, Meta kernel team

On Tue, May 06, 2025 at 09:41:04AM +0000, Yosry Ahmed wrote:
> On Thu, May 01, 2025 at 03:10:20PM -0700, Shakeel Butt wrote:
> > On Wed, Apr 30, 2025 at 06:14:28AM -0700, Yosry Ahmed wrote:
> > [...]
> > > > +
> > > > +	if (!_css_rstat_cpu_trylock(css, cpu, &flags)) {
> > > 
> > > 
> > > IIUC this trylock will only fail if a BPF program runs in NMI context
> > > and tries to update cgroup stats, interrupting a context that is already
> > > holding the lock (i.e. updating or flushing stats).
> > > 
> > 
> > Correct (though note that flushing side can be on a different CPU).
> > 
> > > How often does this happen in practice tho? Is it worth the complexity?
> > 
> > This is about correctness, so even a chance of occurance need the
> > solution.
> 
> Right, my question was more about the need to special case NMIs, see
> below.
> 
> > 
> > > 
> > > I wonder if it's better if we make css_rstat_updated() inherently
> > > lockless instead.
> > > 
> > > What if css_rstat_updated() always just adds to a lockless tree,
> > 
> > Here I assume you meant lockless list instead of tree.
> 
> Yeah, in a sense. I meant using lockless lists to implement the rstat
> tree instead of normal linked lists.
> 
> > 
> > > and we
> > > defer constructing the proper tree to the flushing side? This should
> > > make updates generally faster and avoids locking or disabling interrupts
> > > in the fast path. We essentially push more work to the flushing side.
> > > 
> > > We may be able to consolidate some of the code too if all the logic
> > > manipulating the tree is on the flushing side.
> > > 
> > > WDYT? Am I missing something here?
> > > 
> > 
> > Yes this can be done but I don't think we need to tie that to current
> > series. I think we can start with lockless in the nmi context and then
> > iteratively make css_rstat_updated() lockless for all contexts.
> 
> My question is basically whether it would be simpler to actually make it
> all lockless than special casing NMIs. With this patch we have two
> different paths and a deferred list that we process at a later point. I
> think it may be simpler if we just make it all lockless to begin with.
> Then we would have a single path and no special deferred processing.
> 
> WDYT?

So, in the update side, always add to the lockless list (if not already)
and on the flush side, built the udpate tree from the lockless list and
flush it. Hopefully this tree building and flushing can be done in a
more optimized way. Is this what you are suggesting?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 3/3] cgroup: make css_rstat_updated nmi safe
  2025-05-06 19:30         ` Shakeel Butt
@ 2025-05-07  6:52           ` Yosry Ahmed
  0 siblings, 0 replies; 22+ messages in thread
From: Yosry Ahmed @ 2025-05-07  6:52 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Tejun Heo, Andrew Morton, Alexei Starovoitov, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Muchun Song, Michal Koutný,
	Vlastimil Babka, Sebastian Andrzej Siewior, JP Kobryn, bpf,
	linux-mm, cgroups, linux-kernel, Meta kernel team

On Tue, May 06, 2025 at 12:30:18PM -0700, Shakeel Butt wrote:
> On Tue, May 06, 2025 at 09:41:04AM +0000, Yosry Ahmed wrote:
> > On Thu, May 01, 2025 at 03:10:20PM -0700, Shakeel Butt wrote:
> > > On Wed, Apr 30, 2025 at 06:14:28AM -0700, Yosry Ahmed wrote:
> > > [...]
> > > > > +
> > > > > +	if (!_css_rstat_cpu_trylock(css, cpu, &flags)) {
> > > > 
> > > > 
> > > > IIUC this trylock will only fail if a BPF program runs in NMI context
> > > > and tries to update cgroup stats, interrupting a context that is already
> > > > holding the lock (i.e. updating or flushing stats).
> > > > 
> > > 
> > > Correct (though note that flushing side can be on a different CPU).
> > > 
> > > > How often does this happen in practice tho? Is it worth the complexity?
> > > 
> > > This is about correctness, so even a chance of occurance need the
> > > solution.
> > 
> > Right, my question was more about the need to special case NMIs, see
> > below.
> > 
> > > 
> > > > 
> > > > I wonder if it's better if we make css_rstat_updated() inherently
> > > > lockless instead.
> > > > 
> > > > What if css_rstat_updated() always just adds to a lockless tree,
> > > 
> > > Here I assume you meant lockless list instead of tree.
> > 
> > Yeah, in a sense. I meant using lockless lists to implement the rstat
> > tree instead of normal linked lists.
> > 
> > > 
> > > > and we
> > > > defer constructing the proper tree to the flushing side? This should
> > > > make updates generally faster and avoids locking or disabling interrupts
> > > > in the fast path. We essentially push more work to the flushing side.
> > > > 
> > > > We may be able to consolidate some of the code too if all the logic
> > > > manipulating the tree is on the flushing side.
> > > > 
> > > > WDYT? Am I missing something here?
> > > > 
> > > 
> > > Yes this can be done but I don't think we need to tie that to current
> > > series. I think we can start with lockless in the nmi context and then
> > > iteratively make css_rstat_updated() lockless for all contexts.
> > 
> > My question is basically whether it would be simpler to actually make it
> > all lockless than special casing NMIs. With this patch we have two
> > different paths and a deferred list that we process at a later point. I
> > think it may be simpler if we just make it all lockless to begin with.
> > Then we would have a single path and no special deferred processing.
> > 
> > WDYT?
> 
> So, in the update side, always add to the lockless list (if not already)
> and on the flush side, built the udpate tree from the lockless list and
> flush it.

Exactly, yes.

> Hopefully this tree building and flushing can be done in a
> more optimized way. Is this what you are suggesting?

Yes, but this latter part can be a follow up if it's not straight
forward. For now we can just add use a lockless list on the update side
and move updating the tree (i.e updated_next and updated_children) to
the flush side.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [OFFLIST PATCH 2/2] cgroup: use subsystem-specific rstat locks to avoid contention
  2025-04-29  6:15     ` Shakeel Butt
@ 2025-05-21 22:23       ` Klara Modin
  2025-05-21 22:29         ` Tejun Heo
  2025-05-21 23:23         ` Shakeel Butt
  0 siblings, 2 replies; 22+ messages in thread
From: Klara Modin @ 2025-05-21 22:23 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Tejun Heo, Andrew Morton, Alexei Starovoitov, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Muchun Song, Yosry Ahmed,
	Michal Koutný, Vlastimil Babka, Sebastian Andrzej Siewior,
	JP Kobryn, bpf, linux-mm, cgroups, linux-kernel, Meta kernel team

[-- Attachment #1: Type: text/plain, Size: 1330 bytes --]

Hi,

On 2025-04-28 23:15:58 -0700, Shakeel Butt wrote:
> Please ignore this patch as it was sent by mistake.

This seems to have made it into next:

748922dcfabd ("cgroup: use subsystem-specific rstat locks to avoid contention")

It causes a BUG and eventually a panic on my Raspberry Pi 1:

WARNING: CPU: 0 PID: 0 at mm/percpu.c:1766 pcpu_alloc_noprof (mm/percpu.c:1766 (discriminator 2)) 
illegal size (0) or align (4) for percpu allocation
CPU: 0 UID: 0 PID: 0 Comm: swapper Not tainted 6.15.0-rc7-next-20250521-00086-ga9fb18e56aad #263 NONE
Hardware name: BCM2835
Call trace:
unwind_backtrace from show_stack (arch/arm/kernel/traps.c:259) 
show_stack from dump_stack_lvl (lib/dump_stack.c:122) 
dump_stack_lvl from __warn (kernel/panic.c:729 kernel/panic.c:784) 
__warn from warn_slowpath_fmt (kernel/panic.c:815) 
warn_slowpath_fmt from pcpu_alloc_noprof (mm/percpu.c:1766 (discriminator 2)) 
pcpu_alloc_noprof from ss_rstat_init (kernel/cgroup/rstat.c:515) 
ss_rstat_init from cgroup_init_subsys (kernel/cgroup/cgroup.c:6134 (discriminator 2)) 
cgroup_init_subsys from cgroup_init (kernel/cgroup/cgroup.c:6240) 
cgroup_init from start_kernel (init/main.c:1093) 
 start_kernel from 0x0
...
kernel BUG at kernel/cgroup/cgroup.c:6134!
Internal error: Oops - BUG: 0 [#1] ARM

Reverting resolved it for me.

Regards,
Klara Modin

[-- Attachment #2: bisect-log --]
[-- Type: text/plain, Size: 2923 bytes --]

# bad: [7bac2c97af4078d7a627500c9bcdd5b033f97718] Add linux-next specific files for 20250521
# good: [b36ddb9210e6812eb1c86ad46b66cc46aa193487] Merge tag 'for-linus-6.15-ofs2' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux
git bisect start 'next/master' 'next/stable'
# good: [d86bd5895658c9c4317ecf66d43bd16995e18483] Merge branch 'spi-nor/next' of git://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux.git
git bisect good d86bd5895658c9c4317ecf66d43bd16995e18483
# good: [bf65a86fb3f72e1da64f68c29fe6d51de870e463] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/krzk/linux-dt.git
git bisect good bf65a86fb3f72e1da64f68c29fe6d51de870e463
# good: [56baa9ffcd362c315ee1d01bb089b8c2d9d068d3] Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/westeri/thunderbolt.git
git bisect good 56baa9ffcd362c315ee1d01bb089b8c2d9d068d3
# bad: [2c0faaaca998913917026967c2b5adeac3168855] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git
git bisect bad 2c0faaaca998913917026967c2b5adeac3168855
# good: [be4e0b332d0f11bd70e2db7a9bccfdfd95f58dec] Merge branch 'togreg' of git://git.kernel.org/pub/scm/linux/kernel/git/jic23/iio.git
git bisect good be4e0b332d0f11bd70e2db7a9bccfdfd95f58dec
# good: [427ab512c2c8784686738ade287e5eb52bd8292a] staging: gpib: Change error code for no listener
git bisect good 427ab512c2c8784686738ade287e5eb52bd8292a
# good: [3139fde502884920b33404600ae918a08d683703] Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/soundwire.git
git bisect good 3139fde502884920b33404600ae918a08d683703
# good: [86d2a97e8b5851cfeafc08c9efd7846e4f72adf3] Merge branch 'counter-next' of git://git.kernel.org/pub/scm/linux/kernel/git/wbg/counter.git
git bisect good 86d2a97e8b5851cfeafc08c9efd7846e4f72adf3
# good: [9002b75aa8e6f034ffbd1c1ccac46927a1cf0f12] irqchip/renesas-rzv2h: Add rzv2h_icu_register_dma_req()
git bisect good 9002b75aa8e6f034ffbd1c1ccac46927a1cf0f12
# good: [541a4219bd66bef56d93dbd306dc64a4d70ae99e] cgroup: compare css to cgroup::self in helper for distingushing css
git bisect good 541a4219bd66bef56d93dbd306dc64a4d70ae99e
# bad: [86aadd4d2347b11a239e755d5eb540488bbf5b8c] Merge branch 'for-6.16' into for-next
git bisect bad 86aadd4d2347b11a239e755d5eb540488bbf5b8c
# bad: [93b35663f2018ff2accf4336a909081883eda76b] cgroup: helper for checking rstat participation of css
git bisect bad 93b35663f2018ff2accf4336a909081883eda76b
# bad: [748922dcfabdd655d25fb6dd09a60e694a3d35e6] cgroup: use subsystem-specific rstat locks to avoid contention
git bisect bad 748922dcfabdd655d25fb6dd09a60e694a3d35e6
# good: [5da3bfa029d6809e192d112f39fca4dbe0137aaf] cgroup: use separate rstat trees for each subsystem
git bisect good 5da3bfa029d6809e192d112f39fca4dbe0137aaf
# first bad commit: [748922dcfabdd655d25fb6dd09a60e694a3d35e6] cgroup: use subsystem-specific rstat locks to avoid contention

[-- Attachment #3: boot-log --]
[-- Type: text/plain, Size: 7746 bytes --]

[    0.000000] Booting Linux on physical CPU 0x0
[    0.000000] Linux version 6.15.0-rc7-next-20250521-00086-ga9fb18e56aad (klara@soda.int.kasm.eu) (armv6j-unknown-linux-gnueabihf-gcc (Gentoo 15.1.0 p55) 15.1.0, GNU ld (Gentoo 2.44 p1) 2.44.0) #263 Wed May 21 23:48:00 CEST 2025
[    0.000000] CPU: ARMv6-compatible processor [410fb767] revision 7 (ARMv7), cr=00c5387d
[    0.000000] CPU: PIPT / VIPT nonaliasing data cache, VIPT nonaliasing instruction cache
[    0.000000] OF: fdt: Machine model: Raspberry Pi Model B Rev 2
[    0.000000] earlycon: pl11 at MMIO32 0x20201000 (options '')
[    0.000000] printk: legacy bootconsole [pl11] enabled
[    0.000000] Memory policy: Data cache writeback
[    0.000000] Reserved memory: created CMA memory pool at 0x1b000000, size 64 MiB
[    0.000000] OF: reserved mem: initialized node linux,cma, compatible id shared-dma-pool
[    0.000000] OF: reserved mem: 0x1b000000..0x1effffff (65536 KiB) map reusable linux,cma
[    0.000000] Zone ranges:
[    0.000000]   DMA      [mem 0x0000000000000000-0x000000001effffff]
[    0.000000]   Normal   empty
[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x0000000000000000-0x000000001effffff]
[    0.000000] Initmem setup node 0 [mem 0x0000000000000000-0x000000001effffff]
[    0.000000] CPU: All CPU(s) started in SVC mode.
[    0.000000] Kernel command line: bootargs_pre=coherent_pool=6M smsc95xx.turbo_mode=N dwc_otg.lpm_enable=0 console=tty0 console=ttyAMA0,115200 kgdboc=ttyAMA0,115200 root=/dev/mmcblk0p2 rw rootfstype=nilfs2 rootflags=discard earlycon=pl011,mmio32,0x20201000
[    0.000000] Unknown kernel command line parameters "bootargs_pre=coherent_pool=6M", will be passed to user space.
[    0.000000] printk: log buffer data + meta data: 16384 + 51200 = 67584 bytes
[    0.000000] Dentry cache hash table entries: 65536 (order: 6, 262144 bytes, linear)
[    0.000000] Inode-cache hash table entries: 32768 (order: 5, 131072 bytes, linear)
[    0.000000] Built 1 zonelists, mobility grouping on.  Total pages: 126976
[    0.000000] mem auto-init: stack:all(zero), heap alloc:off, heap free:off
[    0.000000] SLUB: HWalign=32, Order=0-1, MinObjects=0, CPUs=1, Nodes=1
[    0.000000] ftrace: allocating 33191 entries in 65 pages
[    0.000000] ftrace: allocated 65 pages with 2 groups
[    0.000000] RCU Tasks Rude: Setting shift to 0 and lim to 1 rcu_task_cb_adjust=1 rcu_task_cpu_ids=1.
[    0.000000] RCU Tasks Trace: Setting shift to 0 and lim to 1 rcu_task_cb_adjust=1 rcu_task_cpu_ids=1.
[    0.000000] NR_IRQS: 16, nr_irqs: 16, preallocated irqs: 16
[    0.000013] sched_clock: 32 bits at 1000kHz, resolution 1000ns, wraps every 2147483647500ns
[    0.008502] clocksource: timer: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275 ns
[    0.017997] bcm2835: system timer (irq = 27)
[    0.023325] kfence: initialized - using 2097152 bytes for 255 objects at 0x(ptrval)-0x(ptrval)
[    0.032532] Console: colour dummy device 80x30
[    0.037075] printk: legacy console [tty0] enabled
[    0.043066] Calibrating delay loop... 697.95 BogoMIPS (lpj=3489792)
[    0.108806] CPU: Testing write buffer coherency: ok
[    0.113907] pid_max: default: 4096 minimum: 301
[    0.118984] Mount-cache hash table entries: 1024 (order: 0, 4096 bytes, linear)
[    0.126496] Mountpoint-cache hash table entries: 1024 (order: 0, 4096 bytes, linear)
[    0.135751] ------------[ cut here ]------------
[    0.140554] WARNING: CPU: 0 PID: 0 at mm/percpu.c:1766 pcpu_alloc_noprof (mm/percpu.c:1766 (discriminator 2)) 
[    0.148469] illegal size (0) or align (4) for percpu allocation
[    0.154511] CPU: 0 UID: 0 PID: 0 Comm: swapper Not tainted 6.15.0-rc7-next-20250521-00086-ga9fb18e56aad #263 NONE
[    0.164999] Hardware name: BCM2835
[    0.168456] Call trace:
[    0.168485] unwind_backtrace from show_stack (arch/arm/kernel/traps.c:259) 
[    0.176410] show_stack from dump_stack_lvl (lib/dump_stack.c:122) 
[    0.181574] dump_stack_lvl from __warn (kernel/panic.c:729 kernel/panic.c:784) 
[    0.186379] __warn from warn_slowpath_fmt (kernel/panic.c:815) 
[    0.191448] warn_slowpath_fmt from pcpu_alloc_noprof (mm/percpu.c:1766 (discriminator 2)) 
[    0.197674] pcpu_alloc_noprof from ss_rstat_init (kernel/cgroup/rstat.c:515) 
[    0.203393] ss_rstat_init from cgroup_init_subsys (kernel/cgroup/cgroup.c:6134 (discriminator 2)) 
[    0.209267] cgroup_init_subsys from cgroup_init (kernel/cgroup/cgroup.c:6240) 
[    0.215049] cgroup_init from start_kernel (init/main.c:1093) 
[    0.220315]  start_kernel from 0x0
[    0.223837] ---[ end trace 0000000000000000 ]---
[    0.228555] ------------[ cut here ]------------
[    0.233239] kernel BUG at kernel/cgroup/cgroup.c:6134!
[    0.238450] Internal error: Oops - BUG: 0 [#1] ARM
[    0.243323] CPU: 0 UID: 0 PID: 0 Comm: swapper Tainted: G        W           6.15.0-rc7-next-20250521-00086-ga9fb18e56aad #263 NONE
[    0.255389] Tainted: [W]=WARN
[    0.258403] Hardware name: BCM2835
[    0.261856] PC is at cgroup_init_subsys (kernel/cgroup/cgroup.c:6134 (discriminator 3)) 
[    0.266747] LR is at pcpu_alloc_noprof (mm/percpu.c:1766 (discriminator 2)) 
[    0.271640] pc : lr : psr: a0000053 
[    0.277988] sp : c0d01f70  ip : 00000000  fp : c0d296fc
[    0.283283] r10: c0d29704  r9 : c0d2a2e8  r8 : c0d29788
[    0.288579] r7 : 00000000  r6 : c0d2a2d8  r5 : c0ddc590  r4 : c0d6d894
[    0.295189] r3 : c0d08c80  r2 : 00000000  r1 : 00000000  r0 : fffffff4
[    0.301800] Flags: NzCv  IRQs on  FIQs off  Mode SVC_32  ISA ARM  Segment none
[    0.309121] Control: 00c5387d  Table: 00004008  DAC: 00000051
[    0.314940] Register r0 information: non-paged memory
[    0.320077] Register r1 information: NULL pointer
[    0.324858] Register r2 information: NULL pointer
[    0.329637] Register r3 information: non-slab/vmalloc memory
[    0.335384] Register r4 information: non-slab/vmalloc memory
[    0.341130] Register r5 information: non-slab/vmalloc memory
[    0.346875] Register r6 information: non-slab/vmalloc memory
[    0.352620] Register r7 information: NULL pointer
[    0.357399] Register r8 information: non-slab/vmalloc memory
[    0.363141] Register r9 information: non-slab/vmalloc memory
[    0.368885] Register r10 information: non-slab/vmalloc memory
[    0.374716] Register r11 information: non-slab/vmalloc memory
[    0.380549] Register r12 information: NULL pointer
[    0.385416] Process swapper (pid: 0, stack limit = 0x(ptrval))
[    0.391330] Stack: (0xc0d01f70 to 0xc0d02000)
[    0.395761] 1f60:                                     c0d6d894 00000002 c0d296f4 c0dc2278
[    0.404053] 1f80: c0d29788 c0c0ff40 dafff19e c0c1953c c0d2b618 c0d12e18 c0d088e0 c0dba048
[    0.412344] 1fa0: dafff180 00000000 c0dba024 c0d088e0 c0d08960 c0dba010 dafff19e c0c016bc
[    0.420635] 1fc0: ffffffff ffffffff 00000000 c0c00694 00000000 00000000 c0c34f28 00000000
[    0.428921] 1fe0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[    0.437197] Call trace:
[    0.437219] cgroup_init_subsys from cgroup_init (kernel/cgroup/cgroup.c:6240) 
[    0.445566] cgroup_init from start_kernel (init/main.c:1093) 
[    0.450828]  start_kernel from 0x0
[ 0.454326] Code: e1a00004 eb0001bb e3500000 0a000001 (e7f001f2)
All code
========
   0:	e1a00004 	mov	r0, r4
   4:	eb0001bb 	bl	0x6f8
   8:	e3500000 	cmp	r0, #0
   c:	0a000001 	beq	0x18
  10:*	e7f001f2 	udf	#18		<-- trapping instruction

Code starting with the faulting instruction
===========================================
   0:	e7f001f2 	udf	#18
[    0.460550] ---[ end trace 0000000000000000 ]---
[    0.465252] Kernel panic - not syncing: Attempted to kill the idle task!
[    0.472066] Rebooting in 30 seconds..

[-- Attachment #4: config.gz --]
[-- Type: application/gzip, Size: 24847 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [OFFLIST PATCH 2/2] cgroup: use subsystem-specific rstat locks to avoid contention
  2025-05-21 22:23       ` Klara Modin
@ 2025-05-21 22:29         ` Tejun Heo
  2025-05-21 23:23         ` Shakeel Butt
  1 sibling, 0 replies; 22+ messages in thread
From: Tejun Heo @ 2025-05-21 22:29 UTC (permalink / raw)
  To: Klara Modin
  Cc: Shakeel Butt, Andrew Morton, Alexei Starovoitov, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Muchun Song, Yosry Ahmed,
	Michal Koutný, Vlastimil Babka, Sebastian Andrzej Siewior,
	JP Kobryn, bpf, linux-mm, cgroups, linux-kernel, Meta kernel team

On Thu, May 22, 2025 at 12:23:44AM +0200, Klara Modin wrote:
> Hi,
> 
> On 2025-04-28 23:15:58 -0700, Shakeel Butt wrote:
> > Please ignore this patch as it was sent by mistake.
> 
> This seems to have made it into next:
> 
> 748922dcfabd ("cgroup: use subsystem-specific rstat locks to avoid contention")
> 
> It causes a BUG and eventually a panic on my Raspberry Pi 1:
> 
> WARNING: CPU: 0 PID: 0 at mm/percpu.c:1766 pcpu_alloc_noprof (mm/percpu.c:1766 (discriminator 2)) 
> illegal size (0) or align (4) for percpu allocation
> CPU: 0 UID: 0 PID: 0 Comm: swapper Not tainted 6.15.0-rc7-next-20250521-00086-ga9fb18e56aad #263 NONE
> Hardware name: BCM2835
> Call trace:
> unwind_backtrace from show_stack (arch/arm/kernel/traps.c:259) 
> show_stack from dump_stack_lvl (lib/dump_stack.c:122) 
> dump_stack_lvl from __warn (kernel/panic.c:729 kernel/panic.c:784) 
> __warn from warn_slowpath_fmt (kernel/panic.c:815) 
> warn_slowpath_fmt from pcpu_alloc_noprof (mm/percpu.c:1766 (discriminator 2)) 
> pcpu_alloc_noprof from ss_rstat_init (kernel/cgroup/rstat.c:515) 
> ss_rstat_init from cgroup_init_subsys (kernel/cgroup/cgroup.c:6134 (discriminator 2)) 
> cgroup_init_subsys from cgroup_init (kernel/cgroup/cgroup.c:6240) 
> cgroup_init from start_kernel (init/main.c:1093) 
>  start_kernel from 0x0
> ...
> kernel BUG at kernel/cgroup/cgroup.c:6134!
> Internal error: Oops - BUG: 0 [#1] ARM
> 
> Reverting resolved it for me.

This posting was a mistake but direct postings from JP weren't. This being
pretty close to the merge window, unless the problem is trivial, the right
thing to do probalby is reverting the series. JP, what do you think?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [OFFLIST PATCH 2/2] cgroup: use subsystem-specific rstat locks to avoid contention
  2025-05-21 22:23       ` Klara Modin
  2025-05-21 22:29         ` Tejun Heo
@ 2025-05-21 23:23         ` Shakeel Butt
  2025-05-21 23:33           ` Shakeel Butt
  1 sibling, 1 reply; 22+ messages in thread
From: Shakeel Butt @ 2025-05-21 23:23 UTC (permalink / raw)
  To: Klara Modin
  Cc: Tejun Heo, Andrew Morton, Alexei Starovoitov, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Muchun Song, Yosry Ahmed,
	Michal Koutný, Vlastimil Babka, Sebastian Andrzej Siewior,
	JP Kobryn, bpf, linux-mm, cgroups, linux-kernel, Meta kernel team

On Thu, May 22, 2025 at 12:23:44AM +0200, Klara Modin wrote:
> Hi,
> 
> On 2025-04-28 23:15:58 -0700, Shakeel Butt wrote:
> > Please ignore this patch as it was sent by mistake.
> 
> This seems to have made it into next:
> 
> 748922dcfabd ("cgroup: use subsystem-specific rstat locks to avoid contention")
> 
> It causes a BUG and eventually a panic on my Raspberry Pi 1:
> 
> WARNING: CPU: 0 PID: 0 at mm/percpu.c:1766 pcpu_alloc_noprof (mm/percpu.c:1766 (discriminator 2)) 
> illegal size (0) or align (4) for percpu allocation

Ok this config is without CONFIG_SMP and on such configs we have:

typedef struct { } arch_spinlock_t;

So, we are doing ss->rstat_ss_cpu_lock = alloc_percpu(0).

Hmm, let me think more on how to fix this.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [OFFLIST PATCH 2/2] cgroup: use subsystem-specific rstat locks to avoid contention
  2025-05-21 23:23         ` Shakeel Butt
@ 2025-05-21 23:33           ` Shakeel Butt
  2025-05-21 23:47             ` JP Kobryn
  2025-05-21 23:47             ` Shakeel Butt
  0 siblings, 2 replies; 22+ messages in thread
From: Shakeel Butt @ 2025-05-21 23:33 UTC (permalink / raw)
  To: Klara Modin
  Cc: Tejun Heo, Andrew Morton, Alexei Starovoitov, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Muchun Song, Yosry Ahmed,
	Michal Koutný, Vlastimil Babka, Sebastian Andrzej Siewior,
	JP Kobryn, bpf, linux-mm, cgroups, linux-kernel, Meta kernel team

On Wed, May 21, 2025 at 04:23:44PM -0700, Shakeel Butt wrote:
> On Thu, May 22, 2025 at 12:23:44AM +0200, Klara Modin wrote:
> > Hi,
> > 
> > On 2025-04-28 23:15:58 -0700, Shakeel Butt wrote:
> > > Please ignore this patch as it was sent by mistake.
> > 
> > This seems to have made it into next:
> > 
> > 748922dcfabd ("cgroup: use subsystem-specific rstat locks to avoid contention")
> > 
> > It causes a BUG and eventually a panic on my Raspberry Pi 1:
> > 
> > WARNING: CPU: 0 PID: 0 at mm/percpu.c:1766 pcpu_alloc_noprof (mm/percpu.c:1766 (discriminator 2)) 
> > illegal size (0) or align (4) for percpu allocation
> 
> Ok this config is without CONFIG_SMP and on such configs we have:
> 
> typedef struct { } arch_spinlock_t;
> 
> So, we are doing ss->rstat_ss_cpu_lock = alloc_percpu(0).
> 
> Hmm, let me think more on how to fix this.
> 

I think following is the simplest fix:

diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index 7dd396ae3c68..aab09495192e 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -511,7 +511,10 @@ int __init ss_rstat_init(struct cgroup_subsys *ss)
 	int cpu;
 
 	if (ss) {
-		ss->rstat_ss_cpu_lock = alloc_percpu(raw_spinlock_t);
+		size_t size = sizeof(raw_spinlock_t) ?: 1;
+
+		ss->rstat_ss_cpu_lock = __alloc_percpu(size,
+						__alignof__(raw_spinlock_t));
 		if (!ss->rstat_ss_cpu_lock)
 			return -ENOMEM;
 	}

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [OFFLIST PATCH 2/2] cgroup: use subsystem-specific rstat locks to avoid contention
  2025-05-21 23:33           ` Shakeel Butt
@ 2025-05-21 23:47             ` JP Kobryn
  2025-05-21 23:50               ` Shakeel Butt
  2025-05-21 23:47             ` Shakeel Butt
  1 sibling, 1 reply; 22+ messages in thread
From: JP Kobryn @ 2025-05-21 23:47 UTC (permalink / raw)
  To: Shakeel Butt, Klara Modin
  Cc: Tejun Heo, Andrew Morton, Alexei Starovoitov, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Muchun Song, Yosry Ahmed,
	Michal Koutný, Vlastimil Babka, Sebastian Andrzej Siewior,
	bpf, linux-mm, cgroups, linux-kernel, Meta kernel team



On 5/21/25 4:33 PM, Shakeel Butt wrote:
> On Wed, May 21, 2025 at 04:23:44PM -0700, Shakeel Butt wrote:
>> On Thu, May 22, 2025 at 12:23:44AM +0200, Klara Modin wrote:
>>> Hi,
>>>
>>> On 2025-04-28 23:15:58 -0700, Shakeel Butt wrote:
>>>> Please ignore this patch as it was sent by mistake.
>>>
>>> This seems to have made it into next:
>>>
>>> 748922dcfabd ("cgroup: use subsystem-specific rstat locks to avoid contention")
>>>
>>> It causes a BUG and eventually a panic on my Raspberry Pi 1:
>>>
>>> WARNING: CPU: 0 PID: 0 at mm/percpu.c:1766 pcpu_alloc_noprof (mm/percpu.c:1766 (discriminator 2))
>>> illegal size (0) or align (4) for percpu allocation
>>
>> Ok this config is without CONFIG_SMP and on such configs we have:
>>
>> typedef struct { } arch_spinlock_t;
>>
>> So, we are doing ss->rstat_ss_cpu_lock = alloc_percpu(0).
>>
>> Hmm, let me think more on how to fix this.
>>
> 
> I think following is the simplest fix:
> 
> diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
> index 7dd396ae3c68..aab09495192e 100644
> --- a/kernel/cgroup/rstat.c
> +++ b/kernel/cgroup/rstat.c
> @@ -511,7 +511,10 @@ int __init ss_rstat_init(struct cgroup_subsys *ss)
>   	int cpu;
>   
>   	if (ss) {
> -		ss->rstat_ss_cpu_lock = alloc_percpu(raw_spinlock_t);
> +		size_t size = sizeof(raw_spinlock_t) ?: 1;
> +
> +		ss->rstat_ss_cpu_lock = __alloc_percpu(size,
> +						__alignof__(raw_spinlock_t));

Thanks for narrowing this one down so fast. Would this approach be more
straightforward?

if (ss) {
#ifdef CONFIG_SMP
	ss->rstat_ss_cpu_lock = alloc_percpu(raw_spinlock_t);
#endif

Since on non-smp the lock functions are no-ops, leaving the ss cpu lock
can perhaps be left NULL. I could include a comment as well explaining
why.

>   		if (!ss->rstat_ss_cpu_lock)
>   			return -ENOMEM;
>   	}


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [OFFLIST PATCH 2/2] cgroup: use subsystem-specific rstat locks to avoid contention
  2025-05-21 23:33           ` Shakeel Butt
  2025-05-21 23:47             ` JP Kobryn
@ 2025-05-21 23:47             ` Shakeel Butt
  1 sibling, 0 replies; 22+ messages in thread
From: Shakeel Butt @ 2025-05-21 23:47 UTC (permalink / raw)
  To: Klara Modin
  Cc: Tejun Heo, Andrew Morton, Alexei Starovoitov, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Muchun Song, Yosry Ahmed,
	Michal Koutný, Vlastimil Babka, Sebastian Andrzej Siewior,
	JP Kobryn, bpf, linux-mm, cgroups, linux-kernel, Meta kernel team

On Wed, May 21, 2025 at 04:33:35PM -0700, Shakeel Butt wrote:
> On Wed, May 21, 2025 at 04:23:44PM -0700, Shakeel Butt wrote:
> > On Thu, May 22, 2025 at 12:23:44AM +0200, Klara Modin wrote:
> > > Hi,
> > > 
> > > On 2025-04-28 23:15:58 -0700, Shakeel Butt wrote:
> > > > Please ignore this patch as it was sent by mistake.
> > > 
> > > This seems to have made it into next:
> > > 
> > > 748922dcfabd ("cgroup: use subsystem-specific rstat locks to avoid contention")
> > > 
> > > It causes a BUG and eventually a panic on my Raspberry Pi 1:
> > > 
> > > WARNING: CPU: 0 PID: 0 at mm/percpu.c:1766 pcpu_alloc_noprof (mm/percpu.c:1766 (discriminator 2)) 
> > > illegal size (0) or align (4) for percpu allocation
> > 
> > Ok this config is without CONFIG_SMP and on such configs we have:
> > 
> > typedef struct { } arch_spinlock_t;
> > 
> > So, we are doing ss->rstat_ss_cpu_lock = alloc_percpu(0).
> > 
> > Hmm, let me think more on how to fix this.
> > 
> 
> I think following is the simplest fix:
> 
> diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
> index 7dd396ae3c68..aab09495192e 100644
> --- a/kernel/cgroup/rstat.c
> +++ b/kernel/cgroup/rstat.c
> @@ -511,7 +511,10 @@ int __init ss_rstat_init(struct cgroup_subsys *ss)
>  	int cpu;
>  
>  	if (ss) {
> -		ss->rstat_ss_cpu_lock = alloc_percpu(raw_spinlock_t);
> +		size_t size = sizeof(raw_spinlock_t) ?: 1;
> +
> +		ss->rstat_ss_cpu_lock = __alloc_percpu(size,
> +						__alignof__(raw_spinlock_t));
>  		if (!ss->rstat_ss_cpu_lock)
>  			return -ENOMEM;
>  	}

JP suggested to avoid the allocation altogether on such configs. So,
something like the following.

diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index 7dd396ae3c68..d0c88903d033 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -510,7 +510,7 @@ int __init ss_rstat_init(struct cgroup_subsys *ss)
 {
 	int cpu;
 
-	if (ss) {
+	if (ss && sizeof(raw_spinlock_t)) {
 		ss->rstat_ss_cpu_lock = alloc_percpu(raw_spinlock_t);
 		if (!ss->rstat_ss_cpu_lock)
 			return -ENOMEM;


Tejun which one do you prefer?


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [OFFLIST PATCH 2/2] cgroup: use subsystem-specific rstat locks to avoid contention
  2025-05-21 23:47             ` JP Kobryn
@ 2025-05-21 23:50               ` Shakeel Butt
  2025-05-21 23:52                 ` JP Kobryn
  0 siblings, 1 reply; 22+ messages in thread
From: Shakeel Butt @ 2025-05-21 23:50 UTC (permalink / raw)
  To: JP Kobryn
  Cc: Klara Modin, Tejun Heo, Andrew Morton, Alexei Starovoitov,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Yosry Ahmed, Michal Koutný, Vlastimil Babka,
	Sebastian Andrzej Siewior, bpf, linux-mm, cgroups, linux-kernel,
	Meta kernel team

On Wed, May 21, 2025 at 04:47:01PM -0700, JP Kobryn wrote:
> 
> 
> On 5/21/25 4:33 PM, Shakeel Butt wrote:
> > On Wed, May 21, 2025 at 04:23:44PM -0700, Shakeel Butt wrote:
> > > On Thu, May 22, 2025 at 12:23:44AM +0200, Klara Modin wrote:
> > > > Hi,
> > > > 
> > > > On 2025-04-28 23:15:58 -0700, Shakeel Butt wrote:
> > > > > Please ignore this patch as it was sent by mistake.
> > > > 
> > > > This seems to have made it into next:
> > > > 
> > > > 748922dcfabd ("cgroup: use subsystem-specific rstat locks to avoid contention")
> > > > 
> > > > It causes a BUG and eventually a panic on my Raspberry Pi 1:
> > > > 
> > > > WARNING: CPU: 0 PID: 0 at mm/percpu.c:1766 pcpu_alloc_noprof (mm/percpu.c:1766 (discriminator 2))
> > > > illegal size (0) or align (4) for percpu allocation
> > > 
> > > Ok this config is without CONFIG_SMP and on such configs we have:
> > > 
> > > typedef struct { } arch_spinlock_t;
> > > 
> > > So, we are doing ss->rstat_ss_cpu_lock = alloc_percpu(0).
> > > 
> > > Hmm, let me think more on how to fix this.
> > > 
> > 
> > I think following is the simplest fix:
> > 
> > diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
> > index 7dd396ae3c68..aab09495192e 100644
> > --- a/kernel/cgroup/rstat.c
> > +++ b/kernel/cgroup/rstat.c
> > @@ -511,7 +511,10 @@ int __init ss_rstat_init(struct cgroup_subsys *ss)
> >   	int cpu;
> >   	if (ss) {
> > -		ss->rstat_ss_cpu_lock = alloc_percpu(raw_spinlock_t);
> > +		size_t size = sizeof(raw_spinlock_t) ?: 1;
> > +
> > +		ss->rstat_ss_cpu_lock = __alloc_percpu(size,
> > +						__alignof__(raw_spinlock_t));
> 
> Thanks for narrowing this one down so fast. Would this approach be more
> straightforward?
> 
> if (ss) {
> #ifdef CONFIG_SMP
> 	ss->rstat_ss_cpu_lock = alloc_percpu(raw_spinlock_t);
> #endif
> 
> Since on non-smp the lock functions are no-ops, leaving the ss cpu lock
> can perhaps be left NULL. I could include a comment as well explaining
> why.
> 
> >   		if (!ss->rstat_ss_cpu_lock)
> >   			return -ENOMEM;

Include this check and return -ENOMEM in the ifdef as well.

> >   	}
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [OFFLIST PATCH 2/2] cgroup: use subsystem-specific rstat locks to avoid contention
  2025-05-21 23:50               ` Shakeel Butt
@ 2025-05-21 23:52                 ` JP Kobryn
  0 siblings, 0 replies; 22+ messages in thread
From: JP Kobryn @ 2025-05-21 23:52 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Klara Modin, Tejun Heo, Andrew Morton, Alexei Starovoitov,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Yosry Ahmed, Michal Koutný, Vlastimil Babka,
	Sebastian Andrzej Siewior, bpf, linux-mm, cgroups, linux-kernel,
	Meta kernel team



On 5/21/25 4:50 PM, Shakeel Butt wrote:
> On Wed, May 21, 2025 at 04:47:01PM -0700, JP Kobryn wrote:
>>
>>
>> On 5/21/25 4:33 PM, Shakeel Butt wrote:
>>> On Wed, May 21, 2025 at 04:23:44PM -0700, Shakeel Butt wrote:
>>>> On Thu, May 22, 2025 at 12:23:44AM +0200, Klara Modin wrote:
>>>>> Hi,
>>>>>
>>>>> On 2025-04-28 23:15:58 -0700, Shakeel Butt wrote:
>>>>>> Please ignore this patch as it was sent by mistake.
>>>>>
>>>>> This seems to have made it into next:
>>>>>
>>>>> 748922dcfabd ("cgroup: use subsystem-specific rstat locks to avoid contention")
>>>>>
>>>>> It causes a BUG and eventually a panic on my Raspberry Pi 1:
>>>>>
>>>>> WARNING: CPU: 0 PID: 0 at mm/percpu.c:1766 pcpu_alloc_noprof (mm/percpu.c:1766 (discriminator 2))
>>>>> illegal size (0) or align (4) for percpu allocation
>>>>
>>>> Ok this config is without CONFIG_SMP and on such configs we have:
>>>>
>>>> typedef struct { } arch_spinlock_t;
>>>>
>>>> So, we are doing ss->rstat_ss_cpu_lock = alloc_percpu(0).
>>>>
>>>> Hmm, let me think more on how to fix this.
>>>>
>>>
>>> I think following is the simplest fix:
>>>
>>> diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
>>> index 7dd396ae3c68..aab09495192e 100644
>>> --- a/kernel/cgroup/rstat.c
>>> +++ b/kernel/cgroup/rstat.c
>>> @@ -511,7 +511,10 @@ int __init ss_rstat_init(struct cgroup_subsys *ss)
>>>    	int cpu;
>>>    	if (ss) {
>>> -		ss->rstat_ss_cpu_lock = alloc_percpu(raw_spinlock_t);
>>> +		size_t size = sizeof(raw_spinlock_t) ?: 1;
>>> +
>>> +		ss->rstat_ss_cpu_lock = __alloc_percpu(size,
>>> +						__alignof__(raw_spinlock_t));
>>
>> Thanks for narrowing this one down so fast. Would this approach be more
>> straightforward?
>>
>> if (ss) {
>> #ifdef CONFIG_SMP
>> 	ss->rstat_ss_cpu_lock = alloc_percpu(raw_spinlock_t);
>> #endif
>>
>> Since on non-smp the lock functions are no-ops, leaving the ss cpu lock
>> can perhaps be left NULL. I could include a comment as well explaining
>> why.
>>
>>>    		if (!ss->rstat_ss_cpu_lock)
>>>    			return -ENOMEM;
> 
> Include this check and return -ENOMEM in the ifdef as well.

Good call. I will get a patch out tonight.

> 
>>>    	}
>>


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2025-05-21 23:52 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-29  6:12 [RFC PATCH 0/3] cgroup: nmi safe css_rstat_updated Shakeel Butt
2025-04-29  6:12 ` [RFC PATCH 1/3] llist: add list_add_iff_not_on_list() Shakeel Butt
2025-04-30 12:44   ` [RFC PATCH 1/3] llist: add list_add_iff_not_on_list()g Yosry Ahmed
2025-04-29  6:12 ` [RFC PATCH 2/3] cgroup: support to enable nmi-safe css_rstat_updated Shakeel Butt
2025-04-29  6:12 ` [RFC PATCH 3/3] cgroup: make css_rstat_updated nmi safe Shakeel Butt
2025-04-30 13:14   ` Yosry Ahmed
2025-05-01 22:10     ` Shakeel Butt
2025-05-06  9:41       ` Yosry Ahmed
2025-05-06 19:30         ` Shakeel Butt
2025-05-07  6:52           ` Yosry Ahmed
2025-04-29  6:12 ` [OFFLIST PATCH 1/2] cgroup: use separate rstat trees for each subsystem Shakeel Butt
2025-04-29  6:12   ` [OFFLIST PATCH 2/2] cgroup: use subsystem-specific rstat locks to avoid contention Shakeel Butt
2025-04-29  6:15     ` Shakeel Butt
2025-05-21 22:23       ` Klara Modin
2025-05-21 22:29         ` Tejun Heo
2025-05-21 23:23         ` Shakeel Butt
2025-05-21 23:33           ` Shakeel Butt
2025-05-21 23:47             ` JP Kobryn
2025-05-21 23:50               ` Shakeel Butt
2025-05-21 23:52                 ` JP Kobryn
2025-05-21 23:47             ` Shakeel Butt
2025-04-29  6:15   ` [OFFLIST PATCH 1/2] cgroup: use separate rstat trees for each subsystem Shakeel Butt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).