[PATCH v2 0/4] cgroup: nmi safe css_rstat

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/4] cgroup: nmi safe css_rstat_updated
@ 2025-06-11 22:15 Shakeel Butt
  2025-06-11 22:15 ` [PATCH v2 1/4] cgroup: support to enable nmi-safe css_rstat_updated Shakeel Butt
                   ` (5 more replies)
  0 siblings, 6 replies; 11+ messages in thread
From: Shakeel Butt @ 2025-06-11 22:15 UTC (permalink / raw)
  To: Tejun Heo, Andrew Morton
  Cc: JP Kobryn, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Muchun Song, Vlastimil Babka, Alexei Starovoitov,
	Sebastian Andrzej Siewior, Michal Koutný, Harry Yoo,
	Yosry Ahmed, bpf, linux-mm, cgroups, linux-kernel,
	Meta kernel team

BPF programs can run in nmi context and may trigger memcg charged memory
allocation in such context. Recently linux added support to nmi safe
page allocation along with memcg charging of such allocations. However
the kmalloc/slab support and corresponding memcg charging is still
lacking,

To provide nmi safe support for memcg charging for kmalloc/slab
allocations, we need nmi safe memcg stats because for kernel memory
charging and stats happen together. At the moment, memcg charging and
memcg stats are nmi safe and the only thing which is not nmi safe is
adding the cgroup to the per-cpu rstat update tree. i.e.
css_rstat_updated() which this series is doing.

This series made css_rstat_updated by using per-cpu lockless lists whose
node in embedded in individual struct cgroup_subsys_state and the
per-cpu head is placed in struct cgroup_subsys. For rstat users without
cgroup_subsys, a global per-cpu lockless list head is created. The main
challenge to use lockless in this scenario was the potential multiple
inserters from the stacked context i.e. process, softirq, hardirq & nmi,
potentially using the same per-cpu lockless node of a given
cgroup_subsys_state. The normal lockless list does not protect against
such scenario.

The multiple stacked inserters using potentially same lockless node was
resolved by making one of them succeed on reset the lockless node and the
winner gets to insert the lockless node in the corresponding lockless
list. The losers can assume the lockless list insertion will eventually
succeed and continue their operation.

Changelog since v2:
- Add more clear explanation in cover letter and in the comment as
  suggested by Andrew, Michal & Tejun.
- Use this_cpu_cmpxchg() instead of try_cmpxchg() as suggested by Tejun.
- Remove the per-cpu ss locks as they are not needed anymore.

Changelog since v1:
- Based on Yosry's suggestion always use llist on the update side and
  create the update tree on flush side

[v1] https://lore.kernel.org/cgroups/20250429061211.1295443-1-shakeel.butt@linux.dev/

Shakeel Butt (4):
  cgroup: support to enable nmi-safe css_rstat_updated
  cgroup: make css_rstat_updated nmi safe
  cgroup: remove per-cpu per-subsystem locks
  memcg: cgroup: call css_rstat_updated irrespective of in_nmi()

 include/linux/cgroup-defs.h   |  11 +--
 include/trace/events/cgroup.h |  47 ----------
 kernel/cgroup/rstat.c         | 169 +++++++++++++---------------------
 mm/memcontrol.c               |  10 +-
 4 files changed, 74 insertions(+), 163 deletions(-)

-- 
2.47.1

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v2 1/4] cgroup: support to enable nmi-safe css_rstat_updated
  2025-06-11 22:15 [PATCH v2 0/4] cgroup: nmi safe css_rstat_updated Shakeel Butt
@ 2025-06-11 22:15 ` Shakeel Butt
  2025-06-11 22:15 ` [PATCH v2 2/4] cgroup: make css_rstat_updated nmi safe Shakeel Butt
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 11+ messages in thread
From: Shakeel Butt @ 2025-06-11 22:15 UTC (permalink / raw)
  To: Tejun Heo, Andrew Morton
  Cc: JP Kobryn, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Muchun Song, Vlastimil Babka, Alexei Starovoitov,
	Sebastian Andrzej Siewior, Michal Koutný, Harry Yoo,
	Yosry Ahmed, bpf, linux-mm, cgroups, linux-kernel,
	Meta kernel team

Add necessary infrastructure to enable the nmi-safe execution of
css_rstat_updated(). Currently css_rstat_updated() takes a per-cpu
per-css raw spinlock to add the given css in the per-cpu per-css update
tree. However the kernel can not spin in nmi context, so we need to
remove the spinning on the raw spinlock in css_rstat_updated().

To support lockless css_rstat_updated(), let's add necessary data
structures in the css and ss structures.

Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
---
 include/linux/cgroup-defs.h |  4 ++++
 kernel/cgroup/rstat.c       | 23 +++++++++++++++++++++--
 2 files changed, 25 insertions(+), 2 deletions(-)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index e61687d5e496..45860fe5dd0c 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -384,6 +384,9 @@ struct css_rstat_cpu {
 	 */
 	struct cgroup_subsys_state *updated_children;
 	struct cgroup_subsys_state *updated_next;	/* NULL if not on the list */
+
+	struct llist_node lnode;		/* lockless list for update */
+	struct cgroup_subsys_state *owner;	/* back pointer */
 };
 
 /*
@@ -822,6 +825,7 @@ struct cgroup_subsys {
 
 	spinlock_t rstat_ss_lock;
 	raw_spinlock_t __percpu *rstat_ss_cpu_lock;
+	struct llist_head __percpu *lhead; /* lockless update list head */
 };
 
 extern struct percpu_rw_semaphore cgroup_threadgroup_rwsem;
diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index cbeaa499a96a..a5608ae2be27 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -11,6 +11,7 @@
 
 static DEFINE_SPINLOCK(rstat_base_lock);
 static DEFINE_PER_CPU(raw_spinlock_t, rstat_base_cpu_lock);
+static DEFINE_PER_CPU(struct llist_head, rstat_backlog_list);
 
 static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu);
 
@@ -45,6 +46,13 @@ static spinlock_t *ss_rstat_lock(struct cgroup_subsys *ss)
 	return &rstat_base_lock;
 }
 
+static inline struct llist_head *ss_lhead_cpu(struct cgroup_subsys *ss, int cpu)
+{
+	if (ss)
+		return per_cpu_ptr(ss->lhead, cpu);
+	return per_cpu_ptr(&rstat_backlog_list, cpu);
+}
+
 static raw_spinlock_t *ss_rstat_cpu_lock(struct cgroup_subsys *ss, int cpu)
 {
 	if (ss) {
@@ -468,7 +476,8 @@ int css_rstat_init(struct cgroup_subsys_state *css)
 	for_each_possible_cpu(cpu) {
 		struct css_rstat_cpu *rstatc = css_rstat_cpu(css, cpu);
 
-		rstatc->updated_children = css;
+		rstatc->owner = rstatc->updated_children = css;
+		init_llist_node(&rstatc->lnode);
 
 		if (is_self) {
 			struct cgroup_rstat_base_cpu *rstatbc;
@@ -532,9 +541,19 @@ int __init ss_rstat_init(struct cgroup_subsys *ss)
 			return -ENOMEM;
 	}
 
+	if (ss) {
+		ss->lhead = alloc_percpu(struct llist_head);
+		if (!ss->lhead) {
+			free_percpu(ss->rstat_ss_cpu_lock);
+			return -ENOMEM;
+		}
+	}
+
 	spin_lock_init(ss_rstat_lock(ss));
-	for_each_possible_cpu(cpu)
+	for_each_possible_cpu(cpu) {
 		raw_spin_lock_init(ss_rstat_cpu_lock(ss, cpu));
+		init_llist_head(ss_lhead_cpu(ss, cpu));
+	}
 
 	return 0;
 }
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v2 2/4] cgroup: make css_rstat_updated nmi safe
  2025-06-11 22:15 [PATCH v2 0/4] cgroup: nmi safe css_rstat_updated Shakeel Butt
  2025-06-11 22:15 ` [PATCH v2 1/4] cgroup: support to enable nmi-safe css_rstat_updated Shakeel Butt
@ 2025-06-11 22:15 ` Shakeel Butt
  2025-06-11 22:15 ` [PATCH v2 3/4] cgroup: remove per-cpu per-subsystem locks Shakeel Butt
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 11+ messages in thread
From: Shakeel Butt @ 2025-06-11 22:15 UTC (permalink / raw)
  To: Tejun Heo, Andrew Morton
  Cc: JP Kobryn, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Muchun Song, Vlastimil Babka, Alexei Starovoitov,
	Sebastian Andrzej Siewior, Michal Koutný, Harry Yoo,
	Yosry Ahmed, bpf, linux-mm, cgroups, linux-kernel,
	Meta kernel team

To make css_rstat_updated() able to safely run in nmi context, let's
move the rstat update tree creation at the flush side and use per-cpu
lockless lists in struct cgroup_subsys to track the css whose stats are
updated on that cpu.

The struct cgroup_subsys_state now has per-cpu lnode which needs to be
inserted into the corresponding per-cpu lhead of struct cgroup_subsys.
Since we want the insertion to be nmi safe, there can be multiple
inserters on the same cpu for the same lnode. Here multiple inserters
are from stacked contexts like softirq, hardirq and nmi.

The current llist does not provide function to protect against the
scenario where multiple inserters can use the same lnode. So, using
llist_node() out of the box is not safe for this scenario.

However we can protect against multiple inserters using the same lnode
by using the fact llist node points to itself when not on the llist and
atomically reset it and select the winner as the single inserter.

Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
---
Changes since v1:
- More clear code comment as suggested by Tejun.

 kernel/cgroup/rstat.c | 65 +++++++++++++++++++++++++++++++++++--------
 1 file changed, 53 insertions(+), 12 deletions(-)

diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index a5608ae2be27..a7550961dd12 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -138,13 +138,16 @@ void _css_rstat_cpu_unlock(struct cgroup_subsys_state *css, int cpu,
  * @css: target cgroup subsystem state
  * @cpu: cpu on which rstat_cpu was updated
  *
- * @css's rstat_cpu on @cpu was updated. Put it on the parent's matching
- * rstat_cpu->updated_children list. See the comment on top of
- * css_rstat_cpu definition for details.
+ * Atomically inserts the css in the ss's llist for the given cpu. This is
+ * reentrant safe i.e. safe against softirq, hardirq and nmi. The ss's llist
+ * will be processed at the flush time to create the update tree.
  */
 __bpf_kfunc void css_rstat_updated(struct cgroup_subsys_state *css, int cpu)
 {
-	unsigned long flags;
+	struct llist_head *lhead;
+	struct css_rstat_cpu *rstatc;
+	struct css_rstat_cpu __percpu *rstatc_pcpu;
+	struct llist_node *self;
 
 	/*
 	 * Since bpf programs can call this function, prevent access to
@@ -153,19 +156,44 @@ __bpf_kfunc void css_rstat_updated(struct cgroup_subsys_state *css, int cpu)
 	if (!css_uses_rstat(css))
 		return;
 
+	lockdep_assert_preemption_disabled();
+
+	/*
+	 * For archs withnot nmi safe cmpxchg or percpu ops support, ignore
+	 * the requests from nmi context.
+	 */
+	if ((!IS_ENABLED(CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG) ||
+	     !IS_ENABLED(CONFIG_ARCH_HAS_NMI_SAFE_THIS_CPU_OPS)) && in_nmi())
+		return;
+
+	rstatc = css_rstat_cpu(css, cpu);
+	/* If already on list return. */
+	if (llist_on_list(&rstatc->lnode))
+		return;
+
 	/*
-	 * Speculative already-on-list test. This may race leading to
-	 * temporary inaccuracies, which is fine.
+	 * This function can be renentered by irqs and nmis for the same cgroup
+	 * and may try to insert the same per-cpu lnode into the llist. Note
+	 * that llist_add() does not protect against such scenarios.
 	 *
-	 * Because @parent's updated_children is terminated with @parent
-	 * instead of NULL, we can tell whether @css is on the list by
-	 * testing the next pointer for NULL.
+	 * To protect against such stacked contexts of irqs/nmis, we use the
+	 * fact that lnode points to itself when not on a list and then use
+	 * this_cpu_cmpxchg() to atomically set to NULL to select the winner
+	 * which will call llist_add(). The losers can assume the insertion is
+	 * successful and the winner will eventually add the per-cpu lnode to
+	 * the llist.
 	 */
-	if (data_race(css_rstat_cpu(css, cpu)->updated_next))
+	self = &rstatc->lnode;
+	rstatc_pcpu = css->rstat_cpu;
+	if (this_cpu_cmpxchg(rstatc_pcpu->lnode.next, self, NULL) != self)
 		return;
 
-	flags = _css_rstat_cpu_lock(css, cpu, true);
+	lhead = ss_lhead_cpu(css->ss, cpu);
+	llist_add(&rstatc->lnode, lhead);
+}
 
+static void __css_process_update_tree(struct cgroup_subsys_state *css, int cpu)
+{
 	/* put @css and all ancestors on the corresponding updated lists */
 	while (true) {
 		struct css_rstat_cpu *rstatc = css_rstat_cpu(css, cpu);
@@ -191,8 +219,19 @@ __bpf_kfunc void css_rstat_updated(struct cgroup_subsys_state *css, int cpu)
 
 		css = parent;
 	}
+}
+
+static void css_process_update_tree(struct cgroup_subsys *ss, int cpu)
+{
+	struct llist_head *lhead = ss_lhead_cpu(ss, cpu);
+	struct llist_node *lnode;
+
+	while ((lnode = llist_del_first_init(lhead))) {
+		struct css_rstat_cpu *rstatc;
 
-	_css_rstat_cpu_unlock(css, cpu, flags, true);
+		rstatc = container_of(lnode, struct css_rstat_cpu, lnode);
+		__css_process_update_tree(rstatc->owner, cpu);
+	}
 }
 
 /**
@@ -300,6 +339,8 @@ static struct cgroup_subsys_state *css_rstat_updated_list(
 
 	flags = _css_rstat_cpu_lock(root, cpu, false);
 
+	css_process_update_tree(root->ss, cpu);
+
 	/* Return NULL if this subtree is not on-list */
 	if (!rstatc->updated_next)
 		goto unlock_ret;
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v2 3/4] cgroup: remove per-cpu per-subsystem locks
  2025-06-11 22:15 [PATCH v2 0/4] cgroup: nmi safe css_rstat_updated Shakeel Butt
  2025-06-11 22:15 ` [PATCH v2 1/4] cgroup: support to enable nmi-safe css_rstat_updated Shakeel Butt
  2025-06-11 22:15 ` [PATCH v2 2/4] cgroup: make css_rstat_updated nmi safe Shakeel Butt
@ 2025-06-11 22:15 ` Shakeel Butt
  2025-06-11 22:15 ` [PATCH v2 4/4] memcg: cgroup: call css_rstat_updated irrespective of in_nmi() Shakeel Butt
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 11+ messages in thread
From: Shakeel Butt @ 2025-06-11 22:15 UTC (permalink / raw)
  To: Tejun Heo, Andrew Morton
  Cc: JP Kobryn, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Muchun Song, Vlastimil Babka, Alexei Starovoitov,
	Sebastian Andrzej Siewior, Michal Koutný, Harry Yoo,
	Yosry Ahmed, bpf, linux-mm, cgroups, linux-kernel,
	Meta kernel team

The rstat update side used to insert the cgroup whose stats are updated
in the update tree and the read side flush the update tree to get the
latest uptodate stats. The per-cpu per-subsystem locks were used to
synchronize the update and flush side. However now the update side does
not access update tree but uses per-cpu lockless lists. So there is no
need for locks to synchronize update and flush side. Let's remove them.

Suggested-by: JP Kobryn <inwardvessel@gmail.com>
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
---
 include/linux/cgroup-defs.h   |   7 ---
 include/trace/events/cgroup.h |  47 ---------------
 kernel/cgroup/rstat.c         | 107 ++--------------------------------
 3 files changed, 4 insertions(+), 157 deletions(-)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 45860fe5dd0c..bca3562e3df4 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -375,12 +375,6 @@ struct css_rstat_cpu {
 	 * Child cgroups with stat updates on this cpu since the last read
 	 * are linked on the parent's ->updated_children through
 	 * ->updated_next. updated_children is terminated by its container css.
-	 *
-	 * In addition to being more compact, singly-linked list pointing to
-	 * the css makes it unnecessary for each per-cpu struct to point back
-	 * to the associated css.
-	 *
-	 * Protected by per-cpu css->ss->rstat_ss_cpu_lock.
 	 */
 	struct cgroup_subsys_state *updated_children;
 	struct cgroup_subsys_state *updated_next;	/* NULL if not on the list */
@@ -824,7 +818,6 @@ struct cgroup_subsys {
 	unsigned int depends_on;
 
 	spinlock_t rstat_ss_lock;
-	raw_spinlock_t __percpu *rstat_ss_cpu_lock;
 	struct llist_head __percpu *lhead; /* lockless update list head */
 };
 
diff --git a/include/trace/events/cgroup.h b/include/trace/events/cgroup.h
index 7d332387be6c..ba9229af9a34 100644
--- a/include/trace/events/cgroup.h
+++ b/include/trace/events/cgroup.h
@@ -257,53 +257,6 @@ DEFINE_EVENT(cgroup_rstat, cgroup_rstat_unlock,
 	TP_ARGS(cgrp, cpu, contended)
 );
 
-/*
- * Related to per CPU locks:
- * global rstat_base_cpu_lock for base stats
- * cgroup_subsys::rstat_ss_cpu_lock for subsystem stats
- */
-DEFINE_EVENT(cgroup_rstat, cgroup_rstat_cpu_lock_contended,
-
-	TP_PROTO(struct cgroup *cgrp, int cpu, bool contended),
-
-	TP_ARGS(cgrp, cpu, contended)
-);
-
-DEFINE_EVENT(cgroup_rstat, cgroup_rstat_cpu_lock_contended_fastpath,
-
-	TP_PROTO(struct cgroup *cgrp, int cpu, bool contended),
-
-	TP_ARGS(cgrp, cpu, contended)
-);
-
-DEFINE_EVENT(cgroup_rstat, cgroup_rstat_cpu_locked,
-
-	TP_PROTO(struct cgroup *cgrp, int cpu, bool contended),
-
-	TP_ARGS(cgrp, cpu, contended)
-);
-
-DEFINE_EVENT(cgroup_rstat, cgroup_rstat_cpu_locked_fastpath,
-
-	TP_PROTO(struct cgroup *cgrp, int cpu, bool contended),
-
-	TP_ARGS(cgrp, cpu, contended)
-);
-
-DEFINE_EVENT(cgroup_rstat, cgroup_rstat_cpu_unlock,
-
-	TP_PROTO(struct cgroup *cgrp, int cpu, bool contended),
-
-	TP_ARGS(cgrp, cpu, contended)
-);
-
-DEFINE_EVENT(cgroup_rstat, cgroup_rstat_cpu_unlock_fastpath,
-
-	TP_PROTO(struct cgroup *cgrp, int cpu, bool contended),
-
-	TP_ARGS(cgrp, cpu, contended)
-);
-
 #endif /* _TRACE_CGROUP_H */
 
 /* This part must be outside protection */
diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index a7550961dd12..c8a48cf83878 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -10,7 +10,6 @@
 #include <trace/events/cgroup.h>
 
 static DEFINE_SPINLOCK(rstat_base_lock);
-static DEFINE_PER_CPU(raw_spinlock_t, rstat_base_cpu_lock);
 static DEFINE_PER_CPU(struct llist_head, rstat_backlog_list);
 
 static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu);
@@ -53,86 +52,6 @@ static inline struct llist_head *ss_lhead_cpu(struct cgroup_subsys *ss, int cpu)
 	return per_cpu_ptr(&rstat_backlog_list, cpu);
 }
 
-static raw_spinlock_t *ss_rstat_cpu_lock(struct cgroup_subsys *ss, int cpu)
-{
-	if (ss) {
-		/*
-		 * Depending on config, the subsystem per-cpu lock type may be an
-		 * empty struct. In enviromnents where this is the case, allocation
-		 * of this field is not performed in ss_rstat_init(). Avoid a
-		 * cpu-based offset relative to NULL by returning early. When the
-		 * lock type is zero in size, the corresponding lock functions are
-		 * no-ops so passing them NULL is acceptable.
-		 */
-		if (sizeof(*ss->rstat_ss_cpu_lock) == 0)
-			return NULL;
-
-		return per_cpu_ptr(ss->rstat_ss_cpu_lock, cpu);
-	}
-
-	return per_cpu_ptr(&rstat_base_cpu_lock, cpu);
-}
-
-/*
- * Helper functions for rstat per CPU locks.
- *
- * This makes it easier to diagnose locking issues and contention in
- * production environments. The parameter @fast_path determine the
- * tracepoints being added, allowing us to diagnose "flush" related
- * operations without handling high-frequency fast-path "update" events.
- */
-static __always_inline
-unsigned long _css_rstat_cpu_lock(struct cgroup_subsys_state *css, int cpu,
-		const bool fast_path)
-{
-	struct cgroup *cgrp = css->cgroup;
-	raw_spinlock_t *cpu_lock;
-	unsigned long flags;
-	bool contended;
-
-	/*
-	 * The _irqsave() is needed because the locks used for flushing are
-	 * spinlock_t which is a sleeping lock on PREEMPT_RT. Acquiring this lock
-	 * with the _irq() suffix only disables interrupts on a non-PREEMPT_RT
-	 * kernel. The raw_spinlock_t below disables interrupts on both
-	 * configurations. The _irqsave() ensures that interrupts are always
-	 * disabled and later restored.
-	 */
-	cpu_lock = ss_rstat_cpu_lock(css->ss, cpu);
-	contended = !raw_spin_trylock_irqsave(cpu_lock, flags);
-	if (contended) {
-		if (fast_path)
-			trace_cgroup_rstat_cpu_lock_contended_fastpath(cgrp, cpu, contended);
-		else
-			trace_cgroup_rstat_cpu_lock_contended(cgrp, cpu, contended);
-
-		raw_spin_lock_irqsave(cpu_lock, flags);
-	}
-
-	if (fast_path)
-		trace_cgroup_rstat_cpu_locked_fastpath(cgrp, cpu, contended);
-	else
-		trace_cgroup_rstat_cpu_locked(cgrp, cpu, contended);
-
-	return flags;
-}
-
-static __always_inline
-void _css_rstat_cpu_unlock(struct cgroup_subsys_state *css, int cpu,
-		unsigned long flags, const bool fast_path)
-{
-	struct cgroup *cgrp = css->cgroup;
-	raw_spinlock_t *cpu_lock;
-
-	if (fast_path)
-		trace_cgroup_rstat_cpu_unlock_fastpath(cgrp, cpu, false);
-	else
-		trace_cgroup_rstat_cpu_unlock(cgrp, cpu, false);
-
-	cpu_lock = ss_rstat_cpu_lock(css->ss, cpu);
-	raw_spin_unlock_irqrestore(cpu_lock, flags);
-}
-
 /**
  * css_rstat_updated - keep track of updated rstat_cpu
  * @css: target cgroup subsystem state
@@ -335,15 +254,12 @@ static struct cgroup_subsys_state *css_rstat_updated_list(
 {
 	struct css_rstat_cpu *rstatc = css_rstat_cpu(root, cpu);
 	struct cgroup_subsys_state *head = NULL, *parent, *child;
-	unsigned long flags;
-
-	flags = _css_rstat_cpu_lock(root, cpu, false);
 
 	css_process_update_tree(root->ss, cpu);
 
 	/* Return NULL if this subtree is not on-list */
 	if (!rstatc->updated_next)
-		goto unlock_ret;
+		return NULL;
 
 	/*
 	 * Unlink @root from its parent. As the updated_children list is
@@ -375,8 +291,7 @@ static struct cgroup_subsys_state *css_rstat_updated_list(
 	rstatc->updated_children = root;
 	if (child != root)
 		head = css_rstat_push_children(head, child, cpu);
-unlock_ret:
-	_css_rstat_cpu_unlock(root, cpu, flags, false);
+
 	return head;
 }
 
@@ -572,29 +487,15 @@ int __init ss_rstat_init(struct cgroup_subsys *ss)
 {
 	int cpu;
 
-	/*
-	 * Depending on config, the subsystem per-cpu lock type may be an empty
-	 * struct. Avoid allocating a size of zero in this case.
-	 */
-	if (ss && sizeof(*ss->rstat_ss_cpu_lock)) {
-		ss->rstat_ss_cpu_lock = alloc_percpu(raw_spinlock_t);
-		if (!ss->rstat_ss_cpu_lock)
-			return -ENOMEM;
-	}
-
 	if (ss) {
 		ss->lhead = alloc_percpu(struct llist_head);
-		if (!ss->lhead) {
-			free_percpu(ss->rstat_ss_cpu_lock);
+		if (!ss->lhead)
 			return -ENOMEM;
-		}
 	}
 
 	spin_lock_init(ss_rstat_lock(ss));
-	for_each_possible_cpu(cpu) {
-		raw_spin_lock_init(ss_rstat_cpu_lock(ss, cpu));
+	for_each_possible_cpu(cpu)
 		init_llist_head(ss_lhead_cpu(ss, cpu));
-	}
 
 	return 0;
 }
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v2 4/4] memcg: cgroup: call css_rstat_updated irrespective of in_nmi()
  2025-06-11 22:15 [PATCH v2 0/4] cgroup: nmi safe css_rstat_updated Shakeel Butt
                   ` (2 preceding siblings ...)
  2025-06-11 22:15 ` [PATCH v2 3/4] cgroup: remove per-cpu per-subsystem locks Shakeel Butt
@ 2025-06-11 22:15 ` Shakeel Butt
  2025-06-16 18:15 ` [PATCH v2 0/4] cgroup: nmi safe css_rstat_updated Tejun Heo
  2025-06-16 20:08 ` JP Kobryn
  5 siblings, 0 replies; 11+ messages in thread
From: Shakeel Butt @ 2025-06-11 22:15 UTC (permalink / raw)
  To: Tejun Heo, Andrew Morton
  Cc: JP Kobryn, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Muchun Song, Vlastimil Babka, Alexei Starovoitov,
	Sebastian Andrzej Siewior, Michal Koutný, Harry Yoo,
	Yosry Ahmed, bpf, linux-mm, cgroups, linux-kernel,
	Meta kernel team

css_rstat_updated() is nmi safe, so there is no need to avoid it in
in_nmi(), so remove the check.

Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
---
 mm/memcontrol.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 902da8a9c643..d122bfe33e98 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -573,9 +573,7 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val,
 	if (!val)
 		return;
 
-	/* TODO: add to cgroup update tree once it is nmi-safe. */
-	if (!in_nmi())
-		css_rstat_updated(&memcg->css, cpu);
+	css_rstat_updated(&memcg->css, cpu);
 	statc_pcpu = memcg->vmstats_percpu;
 	for (; statc_pcpu; statc_pcpu = statc->parent_pcpu) {
 		statc = this_cpu_ptr(statc_pcpu);
@@ -2530,7 +2528,8 @@ static inline void account_slab_nmi_safe(struct mem_cgroup *memcg,
 	} else {
 		struct mem_cgroup_per_node *pn = memcg->nodeinfo[pgdat->node_id];
 
-		/* TODO: add to cgroup update tree once it is nmi-safe. */
+		/* preemption is disabled in_nmi(). */
+		css_rstat_updated(&memcg->css, smp_processor_id());
 		if (idx == NR_SLAB_RECLAIMABLE_B)
 			atomic_add(nr, &pn->slab_reclaimable);
 		else
@@ -2753,7 +2752,8 @@ static inline void account_kmem_nmi_safe(struct mem_cgroup *memcg, int val)
 	if (likely(!in_nmi())) {
 		mod_memcg_state(memcg, MEMCG_KMEM, val);
 	} else {
-		/* TODO: add to cgroup update tree once it is nmi-safe. */
+		/* preemption is disabled in_nmi(). */
+		css_rstat_updated(&memcg->css, smp_processor_id());
 		atomic_add(val, &memcg->kmem_stat);
 	}
 }
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 0/4] cgroup: nmi safe css_rstat_updated
  2025-06-11 22:15 [PATCH v2 0/4] cgroup: nmi safe css_rstat_updated Shakeel Butt
                   ` (3 preceding siblings ...)
  2025-06-11 22:15 ` [PATCH v2 4/4] memcg: cgroup: call css_rstat_updated irrespective of in_nmi() Shakeel Butt
@ 2025-06-16 18:15 ` Tejun Heo
  2025-06-16 19:20   ` Shakeel Butt
  2025-06-16 20:08 ` JP Kobryn
  5 siblings, 1 reply; 11+ messages in thread
From: Tejun Heo @ 2025-06-16 18:15 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Andrew Morton, JP Kobryn, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Muchun Song, Vlastimil Babka, Alexei Starovoitov,
	Sebastian Andrzej Siewior, Michal Koutný, Harry Yoo,
	Yosry Ahmed, bpf, linux-mm, cgroups, linux-kernel,
	Meta kernel team

Hello,

On Wed, Jun 11, 2025 at 03:15:28PM -0700, Shakeel Butt wrote:
> Shakeel Butt (4):
>   cgroup: support to enable nmi-safe css_rstat_updated
>   cgroup: make css_rstat_updated nmi safe
>   cgroup: remove per-cpu per-subsystem locks
>   memcg: cgroup: call css_rstat_updated irrespective of in_nmi()

The patches look good to me. How should it be routed? Should I take all
four, just the first three or would it better to route all through -mm?

Thanks.

-- 
tejun


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 0/4] cgroup: nmi safe css_rstat_updated
  2025-06-16 18:15 ` [PATCH v2 0/4] cgroup: nmi safe css_rstat_updated Tejun Heo
@ 2025-06-16 19:20   ` Shakeel Butt
  2025-06-17 19:06     ` Tejun Heo
  0 siblings, 1 reply; 11+ messages in thread
From: Shakeel Butt @ 2025-06-16 19:20 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Andrew Morton, JP Kobryn, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Muchun Song, Vlastimil Babka, Alexei Starovoitov,
	Sebastian Andrzej Siewior, Michal Koutný, Harry Yoo,
	Yosry Ahmed, bpf, linux-mm, cgroups, linux-kernel,
	Meta kernel team

On Mon, Jun 16, 2025 at 08:15:17AM -1000, Tejun Heo wrote:
> Hello,
> 
> On Wed, Jun 11, 2025 at 03:15:28PM -0700, Shakeel Butt wrote:
> > Shakeel Butt (4):
> >   cgroup: support to enable nmi-safe css_rstat_updated
> >   cgroup: make css_rstat_updated nmi safe
> >   cgroup: remove per-cpu per-subsystem locks
> >   memcg: cgroup: call css_rstat_updated irrespective of in_nmi()
> 
> The patches look good to me. How should it be routed? Should I take all
> four, just the first three or would it better to route all through -mm?
> 

I would like all four to be together and since most of the code is in
cgroup, cgroup tree makes more sense unless Andrew has different
opinion.

thanks,
Shakeel


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 0/4] cgroup: nmi safe css_rstat_updated
  2025-06-16 19:20   ` Shakeel Butt
@ 2025-06-17 19:06     ` Tejun Heo
  2025-06-17 19:38       ` Shakeel Butt
  0 siblings, 1 reply; 11+ messages in thread
From: Tejun Heo @ 2025-06-17 19:06 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Andrew Morton, JP Kobryn, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Muchun Song, Vlastimil Babka, Alexei Starovoitov,
	Sebastian Andrzej Siewior, Michal Koutný, Harry Yoo,
	Yosry Ahmed, bpf, linux-mm, cgroups, linux-kernel,
	Meta kernel team

On Mon, Jun 16, 2025 at 12:20:28PM -0700, Shakeel Butt wrote:
> On Mon, Jun 16, 2025 at 08:15:17AM -1000, Tejun Heo wrote:
> > Hello,
> > 
> > On Wed, Jun 11, 2025 at 03:15:28PM -0700, Shakeel Butt wrote:
> > > Shakeel Butt (4):
> > >   cgroup: support to enable nmi-safe css_rstat_updated
> > >   cgroup: make css_rstat_updated nmi safe
> > >   cgroup: remove per-cpu per-subsystem locks
> > >   memcg: cgroup: call css_rstat_updated irrespective of in_nmi()
> > 
> > The patches look good to me. How should it be routed? Should I take all
> > four, just the first three or would it better to route all through -mm?
> > 
> 
> I would like all four to be together and since most of the code is in
> cgroup, cgroup tree makes more sense unless Andrew has different
> opinion.

Okay, I'll route them through cgroup. The patches don't apply cleanly on
cgroup/for-6.17. Can you please send a refreshed set?

Thanks.

-- 
tejun


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 0/4] cgroup: nmi safe css_rstat_updated
  2025-06-17 19:06     ` Tejun Heo
@ 2025-06-17 19:38       ` Shakeel Butt
  0 siblings, 0 replies; 11+ messages in thread
From: Shakeel Butt @ 2025-06-17 19:38 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Andrew Morton, JP Kobryn, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Muchun Song, Vlastimil Babka, Alexei Starovoitov,
	Sebastian Andrzej Siewior, Michal Koutný, Harry Yoo,
	Yosry Ahmed, bpf, linux-mm, cgroups, linux-kernel,
	Meta kernel team

On Tue, Jun 17, 2025 at 09:06:01AM -1000, Tejun Heo wrote:
> On Mon, Jun 16, 2025 at 12:20:28PM -0700, Shakeel Butt wrote:
> > On Mon, Jun 16, 2025 at 08:15:17AM -1000, Tejun Heo wrote:
> > > Hello,
> > > 
> > > On Wed, Jun 11, 2025 at 03:15:28PM -0700, Shakeel Butt wrote:
> > > > Shakeel Butt (4):
> > > >   cgroup: support to enable nmi-safe css_rstat_updated
> > > >   cgroup: make css_rstat_updated nmi safe
> > > >   cgroup: remove per-cpu per-subsystem locks
> > > >   memcg: cgroup: call css_rstat_updated irrespective of in_nmi()
> > > 
> > > The patches look good to me. How should it be routed? Should I take all
> > > four, just the first three or would it better to route all through -mm?
> > > 
> > 
> > I would like all four to be together and since most of the code is in
> > cgroup, cgroup tree makes more sense unless Andrew has different
> > opinion.
> 
> Okay, I'll route them through cgroup. The patches don't apply cleanly on
> cgroup/for-6.17. Can you please send a refreshed set?

Yup, I will do asap.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 0/4] cgroup: nmi safe css_rstat_updated
  2025-06-11 22:15 [PATCH v2 0/4] cgroup: nmi safe css_rstat_updated Shakeel Butt
                   ` (4 preceding siblings ...)
  2025-06-16 18:15 ` [PATCH v2 0/4] cgroup: nmi safe css_rstat_updated Tejun Heo
@ 2025-06-16 20:08 ` JP Kobryn
  2025-06-16 20:13   ` Shakeel Butt
  5 siblings, 1 reply; 11+ messages in thread
From: JP Kobryn @ 2025-06-16 20:08 UTC (permalink / raw)
  To: Shakeel Butt, Tejun Heo, Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Vlastimil Babka, Alexei Starovoitov, Sebastian Andrzej Siewior,
	Michal Koutný, Harry Yoo, Yosry Ahmed, bpf, linux-mm,
	cgroups, linux-kernel, Meta kernel team

On 6/11/25 3:15 PM, Shakeel Butt wrote:
> BPF programs can run in nmi context and may trigger memcg charged memory
> allocation in such context. Recently linux added support to nmi safe
> page allocation along with memcg charging of such allocations. However
> the kmalloc/slab support and corresponding memcg charging is still
> lacking,
> 
> To provide nmi safe support for memcg charging for kmalloc/slab
> allocations, we need nmi safe memcg stats because for kernel memory
> charging and stats happen together. At the moment, memcg charging and
> memcg stats are nmi safe and the only thing which is not nmi safe is
> adding the cgroup to the per-cpu rstat update tree. i.e.
> css_rstat_updated() which this series is doing.
> 
> This series made css_rstat_updated by using per-cpu lockless lists whose
> node in embedded in individual struct cgroup_subsys_state and the
> per-cpu head is placed in struct cgroup_subsys. For rstat users without
> cgroup_subsys, a global per-cpu lockless list head is created. The main
> challenge to use lockless in this scenario was the potential multiple
> inserters from the stacked context i.e. process, softirq, hardirq & nmi,
> potentially using the same per-cpu lockless node of a given
> cgroup_subsys_state. The normal lockless list does not protect against
> such scenario.
> 
> The multiple stacked inserters using potentially same lockless node was
> resolved by making one of them succeed on reset the lockless node and the
> winner gets to insert the lockless node in the corresponding lockless
> list. The losers can assume the lockless list insertion will eventually
> succeed and continue their operation.
> 
> Changelog since v2:
> - Add more clear explanation in cover letter and in the comment as
>    suggested by Andrew, Michal & Tejun.
> - Use this_cpu_cmpxchg() instead of try_cmpxchg() as suggested by Tejun.
> - Remove the per-cpu ss locks as they are not needed anymore.
> 
> Changelog since v1:
> - Based on Yosry's suggestion always use llist on the update side and
>    create the update tree on flush side
> 
> [v1] https://lore.kernel.org/cgroups/20250429061211.1295443-1-shakeel.butt@linux.dev/
>   
> 
> 
> Shakeel Butt (4):
>    cgroup: support to enable nmi-safe css_rstat_updated
>    cgroup: make css_rstat_updated nmi safe
>    cgroup: remove per-cpu per-subsystem locks
>    memcg: cgroup: call css_rstat_updated irrespective of in_nmi()
> 
>   include/linux/cgroup-defs.h   |  11 +--
>   include/trace/events/cgroup.h |  47 ----------
>   kernel/cgroup/rstat.c         | 169 +++++++++++++---------------------
>   mm/memcontrol.c               |  10 +-
>   4 files changed, 74 insertions(+), 163 deletions(-)
> 

I tested this series by doing some updates/flushes on a cgroup hierarchy
with four levels. This tag can be added to the patches in this series.

Tested-by: JP Kobryn <inwardvessel@gmail.com>



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 0/4] cgroup: nmi safe css_rstat_updated
  2025-06-16 20:08 ` JP Kobryn
@ 2025-06-16 20:13   ` Shakeel Butt
  0 siblings, 0 replies; 11+ messages in thread
From: Shakeel Butt @ 2025-06-16 20:13 UTC (permalink / raw)
  To: JP Kobryn
  Cc: Tejun Heo, Andrew Morton, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Muchun Song, Vlastimil Babka, Alexei Starovoitov,
	Sebastian Andrzej Siewior, Michal Koutný, Harry Yoo,
	Yosry Ahmed, bpf, linux-mm, cgroups, linux-kernel,
	Meta kernel team

On Mon, Jun 16, 2025 at 01:08:49PM -0700, JP Kobryn wrote:
> On 6/11/25 3:15 PM, Shakeel Butt wrote:
> > BPF programs can run in nmi context and may trigger memcg charged memory
> > allocation in such context. Recently linux added support to nmi safe
> > page allocation along with memcg charging of such allocations. However
> > the kmalloc/slab support and corresponding memcg charging is still
> > lacking,
> > 
> > To provide nmi safe support for memcg charging for kmalloc/slab
> > allocations, we need nmi safe memcg stats because for kernel memory
> > charging and stats happen together. At the moment, memcg charging and
> > memcg stats are nmi safe and the only thing which is not nmi safe is
> > adding the cgroup to the per-cpu rstat update tree. i.e.
> > css_rstat_updated() which this series is doing.
> > 
> > This series made css_rstat_updated by using per-cpu lockless lists whose
> > node in embedded in individual struct cgroup_subsys_state and the
> > per-cpu head is placed in struct cgroup_subsys. For rstat users without
> > cgroup_subsys, a global per-cpu lockless list head is created. The main
> > challenge to use lockless in this scenario was the potential multiple
> > inserters from the stacked context i.e. process, softirq, hardirq & nmi,
> > potentially using the same per-cpu lockless node of a given
> > cgroup_subsys_state. The normal lockless list does not protect against
> > such scenario.
> > 
> > The multiple stacked inserters using potentially same lockless node was
> > resolved by making one of them succeed on reset the lockless node and the
> > winner gets to insert the lockless node in the corresponding lockless
> > list. The losers can assume the lockless list insertion will eventually
> > succeed and continue their operation.
> > 
> > Changelog since v2:
> > - Add more clear explanation in cover letter and in the comment as
> >    suggested by Andrew, Michal & Tejun.
> > - Use this_cpu_cmpxchg() instead of try_cmpxchg() as suggested by Tejun.
> > - Remove the per-cpu ss locks as they are not needed anymore.
> > 
> > Changelog since v1:
> > - Based on Yosry's suggestion always use llist on the update side and
> >    create the update tree on flush side
> > 
> > [v1] https://lore.kernel.org/cgroups/20250429061211.1295443-1-shakeel.butt@linux.dev/
> > 
> > 
> > Shakeel Butt (4):
> >    cgroup: support to enable nmi-safe css_rstat_updated
> >    cgroup: make css_rstat_updated nmi safe
> >    cgroup: remove per-cpu per-subsystem locks
> >    memcg: cgroup: call css_rstat_updated irrespective of in_nmi()
> > 
> >   include/linux/cgroup-defs.h   |  11 +--
> >   include/trace/events/cgroup.h |  47 ----------
> >   kernel/cgroup/rstat.c         | 169 +++++++++++++---------------------
> >   mm/memcontrol.c               |  10 +-
> >   4 files changed, 74 insertions(+), 163 deletions(-)
> > 
> 
> I tested this series by doing some updates/flushes on a cgroup hierarchy
> with four levels. This tag can be added to the patches in this series.
> 
> Tested-by: JP Kobryn <inwardvessel@gmail.com>
> 

Thanks a lot.


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2025-06-17 19:39 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-11 22:15 [PATCH v2 0/4] cgroup: nmi safe css_rstat_updated Shakeel Butt
2025-06-11 22:15 ` [PATCH v2 1/4] cgroup: support to enable nmi-safe css_rstat_updated Shakeel Butt
2025-06-11 22:15 ` [PATCH v2 2/4] cgroup: make css_rstat_updated nmi safe Shakeel Butt
2025-06-11 22:15 ` [PATCH v2 3/4] cgroup: remove per-cpu per-subsystem locks Shakeel Butt
2025-06-11 22:15 ` [PATCH v2 4/4] memcg: cgroup: call css_rstat_updated irrespective of in_nmi() Shakeel Butt
2025-06-16 18:15 ` [PATCH v2 0/4] cgroup: nmi safe css_rstat_updated Tejun Heo
2025-06-16 19:20   ` Shakeel Butt
2025-06-17 19:06     ` Tejun Heo
2025-06-17 19:38       ` Shakeel Butt
2025-06-16 20:08 ` JP Kobryn
2025-06-16 20:13   ` Shakeel Butt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).