* [PATCH 0/3] cgroup: nmi safe css_rstat_updated
@ 2025-06-09 22:56 Shakeel Butt
2025-06-09 22:56 ` [PATCH 1/3] cgroup: support to enable nmi-safe css_rstat_updated Shakeel Butt
` (4 more replies)
0 siblings, 5 replies; 15+ messages in thread
From: Shakeel Butt @ 2025-06-09 22:56 UTC (permalink / raw)
To: Tejun Heo, Andrew Morton
Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
Vlastimil Babka, Alexei Starovoitov, Sebastian Andrzej Siewior,
Michal Koutný, Harry Yoo, Yosry Ahmed, bpf, linux-mm,
cgroups, linux-kernel, Meta kernel team
BPF programs can run in nmi context and may trigger memcg charged memory
allocation in such context. Recently linux added support to nmi safe
page allocation along with memcg charging of such allocations. However
the kmalloc/slab support and corresponding memcg charging is still
lacking,
To provide nmi safe support for memcg charging for kmalloc/slab
allocations, we need nmi safe memcg stats and for that we need nmi safe
css_rstat_updated() which adds the given cgroup state whose stats are
updated into the per-cpu per-ss update tree. This series took the aim to
make css_rstat_updated() nmi safe.
This series made css_rstat_updated by using per-cpu lockless lists whose
node in embedded in individual struct cgroup_subsys_state and the
per-cpu head is placed in struct cgroup_subsys. For rstat users without
cgroup_subsys, a global per-cpu lockless list head is created. The main
challenge to use lockless in this scenario was the potential multiple
inserters using the same lockless node of a cgroup_subsys_state which is
different from traditional users of lockless lists.
The multiple inserters using potentially same lockless node was resolved
by making one of them succeed on reset the lockless node and the winner
gets to insert the lockless node in the corresponding lockless list.
Changelog since v1:
- Based on Yosry's suggestion always use llist on the update side and
create the update tree on flush side
[v1] https://lore.kernel.org/cgroups/20250429061211.1295443-1-shakeel.butt@linux.dev/
Shakeel Butt (3):
cgroup: support to enable nmi-safe css_rstat_updated
cgroup: make css_rstat_updated nmi safe
memcg: cgroup: call memcg_rstat_updated irrespective of in_nmi()
include/linux/cgroup-defs.h | 4 ++
kernel/cgroup/rstat.c | 80 ++++++++++++++++++++++++++++++-------
mm/memcontrol.c | 10 ++---
3 files changed, 75 insertions(+), 19 deletions(-)
--
2.47.1
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH 1/3] cgroup: support to enable nmi-safe css_rstat_updated
2025-06-09 22:56 [PATCH 0/3] cgroup: nmi safe css_rstat_updated Shakeel Butt
@ 2025-06-09 22:56 ` Shakeel Butt
2025-06-09 22:56 ` [PATCH 2/3] cgroup: make css_rstat_updated nmi safe Shakeel Butt
` (3 subsequent siblings)
4 siblings, 0 replies; 15+ messages in thread
From: Shakeel Butt @ 2025-06-09 22:56 UTC (permalink / raw)
To: Tejun Heo, Andrew Morton
Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
Vlastimil Babka, Alexei Starovoitov, Sebastian Andrzej Siewior,
Michal Koutný, Harry Yoo, Yosry Ahmed, bpf, linux-mm,
cgroups, linux-kernel, Meta kernel team
Add necessary infrastructure to enable the nmi-safe execution of
css_rstat_updated(). Currently css_rstat_updated() takes a per-cpu
per-css raw spinlock to add the given css in the per-cpu per-css update
tree. However the kernel can not spin in nmi context, so we need to
remove the spinning on the raw spinlock in css_rstat_updated().
To support lockless css_rstat_updated(), let's add necessary data
structures in the css and ss structures.
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
---
include/linux/cgroup-defs.h | 4 ++++
kernel/cgroup/rstat.c | 23 +++++++++++++++++++++--
2 files changed, 25 insertions(+), 2 deletions(-)
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index e61687d5e496..45860fe5dd0c 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -384,6 +384,9 @@ struct css_rstat_cpu {
*/
struct cgroup_subsys_state *updated_children;
struct cgroup_subsys_state *updated_next; /* NULL if not on the list */
+
+ struct llist_node lnode; /* lockless list for update */
+ struct cgroup_subsys_state *owner; /* back pointer */
};
/*
@@ -822,6 +825,7 @@ struct cgroup_subsys {
spinlock_t rstat_ss_lock;
raw_spinlock_t __percpu *rstat_ss_cpu_lock;
+ struct llist_head __percpu *lhead; /* lockless update list head */
};
extern struct percpu_rw_semaphore cgroup_threadgroup_rwsem;
diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index cbeaa499a96a..a5608ae2be27 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -11,6 +11,7 @@
static DEFINE_SPINLOCK(rstat_base_lock);
static DEFINE_PER_CPU(raw_spinlock_t, rstat_base_cpu_lock);
+static DEFINE_PER_CPU(struct llist_head, rstat_backlog_list);
static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu);
@@ -45,6 +46,13 @@ static spinlock_t *ss_rstat_lock(struct cgroup_subsys *ss)
return &rstat_base_lock;
}
+static inline struct llist_head *ss_lhead_cpu(struct cgroup_subsys *ss, int cpu)
+{
+ if (ss)
+ return per_cpu_ptr(ss->lhead, cpu);
+ return per_cpu_ptr(&rstat_backlog_list, cpu);
+}
+
static raw_spinlock_t *ss_rstat_cpu_lock(struct cgroup_subsys *ss, int cpu)
{
if (ss) {
@@ -468,7 +476,8 @@ int css_rstat_init(struct cgroup_subsys_state *css)
for_each_possible_cpu(cpu) {
struct css_rstat_cpu *rstatc = css_rstat_cpu(css, cpu);
- rstatc->updated_children = css;
+ rstatc->owner = rstatc->updated_children = css;
+ init_llist_node(&rstatc->lnode);
if (is_self) {
struct cgroup_rstat_base_cpu *rstatbc;
@@ -532,9 +541,19 @@ int __init ss_rstat_init(struct cgroup_subsys *ss)
return -ENOMEM;
}
+ if (ss) {
+ ss->lhead = alloc_percpu(struct llist_head);
+ if (!ss->lhead) {
+ free_percpu(ss->rstat_ss_cpu_lock);
+ return -ENOMEM;
+ }
+ }
+
spin_lock_init(ss_rstat_lock(ss));
- for_each_possible_cpu(cpu)
+ for_each_possible_cpu(cpu) {
raw_spin_lock_init(ss_rstat_cpu_lock(ss, cpu));
+ init_llist_head(ss_lhead_cpu(ss, cpu));
+ }
return 0;
}
--
2.47.1
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH 2/3] cgroup: make css_rstat_updated nmi safe
2025-06-09 22:56 [PATCH 0/3] cgroup: nmi safe css_rstat_updated Shakeel Butt
2025-06-09 22:56 ` [PATCH 1/3] cgroup: support to enable nmi-safe css_rstat_updated Shakeel Butt
@ 2025-06-09 22:56 ` Shakeel Butt
2025-06-10 21:26 ` Tejun Heo
2025-06-11 5:23 ` JP Kobryn
2025-06-09 22:56 ` [PATCH 3/3] memcg: cgroup: call memcg_rstat_updated irrespective of in_nmi() Shakeel Butt
` (2 subsequent siblings)
4 siblings, 2 replies; 15+ messages in thread
From: Shakeel Butt @ 2025-06-09 22:56 UTC (permalink / raw)
To: Tejun Heo, Andrew Morton
Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
Vlastimil Babka, Alexei Starovoitov, Sebastian Andrzej Siewior,
Michal Koutný, Harry Yoo, Yosry Ahmed, bpf, linux-mm,
cgroups, linux-kernel, Meta kernel team
To make css_rstat_updated() able to safely run in nmi context, let's
move the rstat update tree creation at the flush side and use per-cpu
lockless lists in struct cgroup_subsys to track the css whose stats are
updated on that cpu.
The struct cgroup_subsys_state now has per-cpu lnode which needs to be
inserted into the corresponding per-cpu lhead of struct cgroup_subsys.
Since we want the insertion to be nmi safe, there can be multiple
inserters on the same cpu for the same lnode. The current llist does not
provide function to protect against the scenario where multiple
inserters can use the same lnode. So, using llist_node() out of the box
is not safe for this scenario.
However we can protect against multiple inserters using the same lnode
by using the fact llist node points to itself when not on the llist and
atomically reset it and select the winner as the single inserter.
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
---
kernel/cgroup/rstat.c | 57 ++++++++++++++++++++++++++++++++++---------
1 file changed, 45 insertions(+), 12 deletions(-)
diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index a5608ae2be27..4fabd7973067 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -138,13 +138,15 @@ void _css_rstat_cpu_unlock(struct cgroup_subsys_state *css, int cpu,
* @css: target cgroup subsystem state
* @cpu: cpu on which rstat_cpu was updated
*
- * @css's rstat_cpu on @cpu was updated. Put it on the parent's matching
- * rstat_cpu->updated_children list. See the comment on top of
- * css_rstat_cpu definition for details.
+ * Atomically inserts the css in the ss's llist for the given cpu. This is nmi
+ * safe. The ss's llist will be processed at the flush time to create the update
+ * tree.
*/
__bpf_kfunc void css_rstat_updated(struct cgroup_subsys_state *css, int cpu)
{
- unsigned long flags;
+ struct llist_head *lhead = ss_lhead_cpu(css->ss, cpu);
+ struct css_rstat_cpu *rstatc = css_rstat_cpu(css, cpu);
+ struct llist_node *self;
/*
* Since bpf programs can call this function, prevent access to
@@ -153,19 +155,37 @@ __bpf_kfunc void css_rstat_updated(struct cgroup_subsys_state *css, int cpu)
if (!css_uses_rstat(css))
return;
+ lockdep_assert_preemption_disabled();
+
+ /*
+ * For arch that does not support nmi safe cmpxchg, we ignore the
+ * requests from nmi context for rstat update llist additions.
+ */
+ if (!IS_ENABLED(CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG) && in_nmi())
+ return;
+
+ /* If already on list return. */
+ if (llist_on_list(&rstatc->lnode))
+ return;
+
/*
- * Speculative already-on-list test. This may race leading to
- * temporary inaccuracies, which is fine.
+ * Make sure only one insert request can proceed on this cpu for this
+ * specific lnode and thus this needs to be safe against irqs and nmis.
*
- * Because @parent's updated_children is terminated with @parent
- * instead of NULL, we can tell whether @css is on the list by
- * testing the next pointer for NULL.
+ * Please note that llist_add() does not protect against multiple
+ * inserters for the same lnode. We use the fact that lnode points to
+ * itself when not on a list and then atomically set it to NULL to
+ * select the single inserter.
*/
- if (data_race(css_rstat_cpu(css, cpu)->updated_next))
+ self = &rstatc->lnode;
+ if (!try_cmpxchg(&(rstatc->lnode.next), &self, NULL))
return;
- flags = _css_rstat_cpu_lock(css, cpu, true);
+ llist_add(&rstatc->lnode, lhead);
+}
+static void __css_process_update_tree(struct cgroup_subsys_state *css, int cpu)
+{
/* put @css and all ancestors on the corresponding updated lists */
while (true) {
struct css_rstat_cpu *rstatc = css_rstat_cpu(css, cpu);
@@ -191,8 +211,19 @@ __bpf_kfunc void css_rstat_updated(struct cgroup_subsys_state *css, int cpu)
css = parent;
}
+}
+
+static void css_process_update_tree(struct cgroup_subsys *ss, int cpu)
+{
+ struct llist_head *lhead = ss_lhead_cpu(ss, cpu);
+ struct llist_node *lnode;
+
+ while ((lnode = llist_del_first_init(lhead))) {
+ struct css_rstat_cpu *rstatc;
- _css_rstat_cpu_unlock(css, cpu, flags, true);
+ rstatc = container_of(lnode, struct css_rstat_cpu, lnode);
+ __css_process_update_tree(rstatc->owner, cpu);
+ }
}
/**
@@ -300,6 +331,8 @@ static struct cgroup_subsys_state *css_rstat_updated_list(
flags = _css_rstat_cpu_lock(root, cpu, false);
+ css_process_update_tree(root->ss, cpu);
+
/* Return NULL if this subtree is not on-list */
if (!rstatc->updated_next)
goto unlock_ret;
--
2.47.1
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH 3/3] memcg: cgroup: call memcg_rstat_updated irrespective of in_nmi()
2025-06-09 22:56 [PATCH 0/3] cgroup: nmi safe css_rstat_updated Shakeel Butt
2025-06-09 22:56 ` [PATCH 1/3] cgroup: support to enable nmi-safe css_rstat_updated Shakeel Butt
2025-06-09 22:56 ` [PATCH 2/3] cgroup: make css_rstat_updated nmi safe Shakeel Butt
@ 2025-06-09 22:56 ` Shakeel Butt
2025-06-09 23:44 ` [PATCH 0/3] cgroup: nmi safe css_rstat_updated Andrew Morton
2025-06-10 10:53 ` Michal Koutný
4 siblings, 0 replies; 15+ messages in thread
From: Shakeel Butt @ 2025-06-09 22:56 UTC (permalink / raw)
To: Tejun Heo, Andrew Morton
Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
Vlastimil Babka, Alexei Starovoitov, Sebastian Andrzej Siewior,
Michal Koutný, Harry Yoo, Yosry Ahmed, bpf, linux-mm,
cgroups, linux-kernel, Meta kernel team
css_rstat_updated() is nmi safe, so there is no need to avoid it in
in_nmi(), so remove the check.
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
---
mm/memcontrol.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 902da8a9c643..d122bfe33e98 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -573,9 +573,7 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val,
if (!val)
return;
- /* TODO: add to cgroup update tree once it is nmi-safe. */
- if (!in_nmi())
- css_rstat_updated(&memcg->css, cpu);
+ css_rstat_updated(&memcg->css, cpu);
statc_pcpu = memcg->vmstats_percpu;
for (; statc_pcpu; statc_pcpu = statc->parent_pcpu) {
statc = this_cpu_ptr(statc_pcpu);
@@ -2530,7 +2528,8 @@ static inline void account_slab_nmi_safe(struct mem_cgroup *memcg,
} else {
struct mem_cgroup_per_node *pn = memcg->nodeinfo[pgdat->node_id];
- /* TODO: add to cgroup update tree once it is nmi-safe. */
+ /* preemption is disabled in_nmi(). */
+ css_rstat_updated(&memcg->css, smp_processor_id());
if (idx == NR_SLAB_RECLAIMABLE_B)
atomic_add(nr, &pn->slab_reclaimable);
else
@@ -2753,7 +2752,8 @@ static inline void account_kmem_nmi_safe(struct mem_cgroup *memcg, int val)
if (likely(!in_nmi())) {
mod_memcg_state(memcg, MEMCG_KMEM, val);
} else {
- /* TODO: add to cgroup update tree once it is nmi-safe. */
+ /* preemption is disabled in_nmi(). */
+ css_rstat_updated(&memcg->css, smp_processor_id());
atomic_add(val, &memcg->kmem_stat);
}
}
--
2.47.1
^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH 0/3] cgroup: nmi safe css_rstat_updated
2025-06-09 22:56 [PATCH 0/3] cgroup: nmi safe css_rstat_updated Shakeel Butt
` (2 preceding siblings ...)
2025-06-09 22:56 ` [PATCH 3/3] memcg: cgroup: call memcg_rstat_updated irrespective of in_nmi() Shakeel Butt
@ 2025-06-09 23:44 ` Andrew Morton
2025-06-09 23:51 ` Shakeel Butt
2025-06-10 10:53 ` Michal Koutný
4 siblings, 1 reply; 15+ messages in thread
From: Andrew Morton @ 2025-06-09 23:44 UTC (permalink / raw)
To: Shakeel Butt
Cc: Tejun Heo, Johannes Weiner, Michal Hocko, Roman Gushchin,
Muchun Song, Vlastimil Babka, Alexei Starovoitov,
Sebastian Andrzej Siewior, Michal Koutný, Harry Yoo,
Yosry Ahmed, bpf, linux-mm, cgroups, linux-kernel,
Meta kernel team
On Mon, 9 Jun 2025 15:56:08 -0700 Shakeel Butt <shakeel.butt@linux.dev> wrote:
> BPF programs can run in nmi context and may trigger memcg charged memory
> allocation in such context. Recently linux added support to nmi safe
> page allocation along with memcg charging of such allocations. However
> the kmalloc/slab support and corresponding memcg charging is still
> lacking,
>
> To provide nmi safe support for memcg charging for kmalloc/slab
> allocations, we need nmi safe memcg stats and for that we need nmi safe
> css_rstat_updated() which adds the given cgroup state whose stats are
> updated into the per-cpu per-ss update tree. This series took the aim to
> make css_rstat_updated() nmi safe.
>
> This series made css_rstat_updated by using per-cpu lockless lists whose
> node in embedded in individual struct cgroup_subsys_state and the
> per-cpu head is placed in struct cgroup_subsys. For rstat users without
> cgroup_subsys, a global per-cpu lockless list head is created. The main
> challenge to use lockless in this scenario was the potential multiple
> inserters using the same lockless node of a cgroup_subsys_state which is
> different from traditional users of lockless lists.
>
> The multiple inserters using potentially same lockless node was resolved
> by making one of them succeed on reset the lockless node and the winner
> gets to insert the lockless node in the corresponding lockless list.
And what happens with the losers?
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 0/3] cgroup: nmi safe css_rstat_updated
2025-06-09 23:44 ` [PATCH 0/3] cgroup: nmi safe css_rstat_updated Andrew Morton
@ 2025-06-09 23:51 ` Shakeel Butt
0 siblings, 0 replies; 15+ messages in thread
From: Shakeel Butt @ 2025-06-09 23:51 UTC (permalink / raw)
To: Andrew Morton
Cc: Tejun Heo, Johannes Weiner, Michal Hocko, Roman Gushchin,
Muchun Song, Vlastimil Babka, Alexei Starovoitov,
Sebastian Andrzej Siewior, Michal Koutný, Harry Yoo,
Yosry Ahmed, bpf, linux-mm, cgroups, linux-kernel,
Meta kernel team
On Mon, Jun 09, 2025 at 04:44:10PM -0700, Andrew Morton wrote:
> On Mon, 9 Jun 2025 15:56:08 -0700 Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> > BPF programs can run in nmi context and may trigger memcg charged memory
> > allocation in such context. Recently linux added support to nmi safe
> > page allocation along with memcg charging of such allocations. However
> > the kmalloc/slab support and corresponding memcg charging is still
> > lacking,
> >
> > To provide nmi safe support for memcg charging for kmalloc/slab
> > allocations, we need nmi safe memcg stats and for that we need nmi safe
> > css_rstat_updated() which adds the given cgroup state whose stats are
> > updated into the per-cpu per-ss update tree. This series took the aim to
> > make css_rstat_updated() nmi safe.
> >
> > This series made css_rstat_updated by using per-cpu lockless lists whose
> > node in embedded in individual struct cgroup_subsys_state and the
> > per-cpu head is placed in struct cgroup_subsys. For rstat users without
> > cgroup_subsys, a global per-cpu lockless list head is created. The main
> > challenge to use lockless in this scenario was the potential multiple
> > inserters using the same lockless node of a cgroup_subsys_state which is
> > different from traditional users of lockless lists.
> >
> > The multiple inserters using potentially same lockless node was resolved
> > by making one of them succeed on reset the lockless node and the winner
> > gets to insert the lockless node in the corresponding lockless list.
>
> And what happens with the losers?
Losers can continue their normal work without worrying about this
specific insertion. Basically we need one successful insertion. In
addition this is a contention between process context, softirq, hardirq
and nmi on the same cpu for the same cgroup which should be very
unlikely.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 0/3] cgroup: nmi safe css_rstat_updated
2025-06-09 22:56 [PATCH 0/3] cgroup: nmi safe css_rstat_updated Shakeel Butt
` (3 preceding siblings ...)
2025-06-09 23:44 ` [PATCH 0/3] cgroup: nmi safe css_rstat_updated Andrew Morton
@ 2025-06-10 10:53 ` Michal Koutný
2025-06-10 16:24 ` Shakeel Butt
4 siblings, 1 reply; 15+ messages in thread
From: Michal Koutný @ 2025-06-10 10:53 UTC (permalink / raw)
To: Shakeel Butt
Cc: Tejun Heo, Andrew Morton, Johannes Weiner, Michal Hocko,
Roman Gushchin, Muchun Song, Vlastimil Babka, Alexei Starovoitov,
Sebastian Andrzej Siewior, Harry Yoo, Yosry Ahmed, bpf, linux-mm,
cgroups, linux-kernel, Meta kernel team
[-- Attachment #1: Type: text/plain, Size: 1131 bytes --]
On Mon, Jun 09, 2025 at 03:56:08PM -0700, Shakeel Butt <shakeel.butt@linux.dev> wrote:
> BPF programs can run in nmi context and may trigger memcg charged memory
> allocation in such context. Recently linux added support to nmi safe
> page allocation along with memcg charging of such allocations. However
> the kmalloc/slab support and corresponding memcg charging is still
> lacking,
>
> To provide nmi safe support for memcg charging for kmalloc/slab
> allocations, we need nmi safe memcg stats and for that we need nmi safe
> css_rstat_updated() which adds the given cgroup state whose stats are
> updated into the per-cpu per-ss update tree. This series took the aim to
> make css_rstat_updated() nmi safe.
memcg charging relies on page counters and per-cpu stocks.
css_rstat_updated() is "only" for statistics (which has admiteddly some
in-kernel consumers but those are already affected by batching and
flushing errors).
Have I missed some updates that make css_rstat_updated() calls critical
for memcg charging? I'd find it useful to explain this aspect more in
the cover letter.
Thanks,
Michal
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 0/3] cgroup: nmi safe css_rstat_updated
2025-06-10 10:53 ` Michal Koutný
@ 2025-06-10 16:24 ` Shakeel Butt
0 siblings, 0 replies; 15+ messages in thread
From: Shakeel Butt @ 2025-06-10 16:24 UTC (permalink / raw)
To: Michal Koutný
Cc: Tejun Heo, Andrew Morton, Johannes Weiner, Michal Hocko,
Roman Gushchin, Muchun Song, Vlastimil Babka, Alexei Starovoitov,
Sebastian Andrzej Siewior, Harry Yoo, Yosry Ahmed, bpf, linux-mm,
cgroups, linux-kernel, Meta kernel team
On Tue, Jun 10, 2025 at 12:53:11PM +0200, Michal Koutný wrote:
> On Mon, Jun 09, 2025 at 03:56:08PM -0700, Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > BPF programs can run in nmi context and may trigger memcg charged memory
> > allocation in such context. Recently linux added support to nmi safe
> > page allocation along with memcg charging of such allocations. However
> > the kmalloc/slab support and corresponding memcg charging is still
> > lacking,
> >
> > To provide nmi safe support for memcg charging for kmalloc/slab
> > allocations, we need nmi safe memcg stats and for that we need nmi safe
> > css_rstat_updated() which adds the given cgroup state whose stats are
> > updated into the per-cpu per-ss update tree. This series took the aim to
> > make css_rstat_updated() nmi safe.
>
> memcg charging relies on page counters and per-cpu stocks.
> css_rstat_updated() is "only" for statistics (which has admiteddly some
> in-kernel consumers but those are already affected by batching and
> flushing errors).
>
> Have I missed some updates that make css_rstat_updated() calls critical
> for memcg charging? I'd find it useful to explain this aspect more in
> the cover letter.
For kernel memory, the charging and stats (MEMCG_KMEM,
NR_SLAB_RECLAIMABLE_B, NR_SLAB_UNRECLAIMABLE_B) updates happen together.
I will add a line or two in the next version.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 2/3] cgroup: make css_rstat_updated nmi safe
2025-06-09 22:56 ` [PATCH 2/3] cgroup: make css_rstat_updated nmi safe Shakeel Butt
@ 2025-06-10 21:26 ` Tejun Heo
2025-06-10 22:31 ` Shakeel Butt
2025-06-11 5:23 ` JP Kobryn
1 sibling, 1 reply; 15+ messages in thread
From: Tejun Heo @ 2025-06-10 21:26 UTC (permalink / raw)
To: Shakeel Butt
Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
Muchun Song, Vlastimil Babka, Alexei Starovoitov,
Sebastian Andrzej Siewior, Michal Koutný, Harry Yoo,
Yosry Ahmed, bpf, linux-mm, cgroups, linux-kernel,
Meta kernel team
Hello,
On Mon, Jun 09, 2025 at 03:56:10PM -0700, Shakeel Butt wrote:
...
> + self = &rstatc->lnode;
> + if (!try_cmpxchg(&(rstatc->lnode.next), &self, NULL))
> return;
>
> + llist_add(&rstatc->lnode, lhead);
I may be missing something but when you say multiple inserters, you mean the
function being re-entered from stacked contexts - ie. process context, BH,
irq, nmi? If so, would it make sense to make the nmi and non-nmi paths use
separate lnode? In non-nmi path, we can just disable irq and test whether
lnode is empty and add it. nmi path can just test whether its lnode is empty
and add it. I suppose nmi's don't nest, right? If they do, we can do
try_cmpxchg() there I suppose.
While the actual addition to the list would be relatively low frequency,
css_rstat_updated() itself can be called pretty frequently. Before, the hot
path was early exit after data_race(css_rstat_cpu(css, cpu)->updated_next).
After, the hot path is now !try_cmpxchg() which doesn't seem great.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 2/3] cgroup: make css_rstat_updated nmi safe
2025-06-10 21:26 ` Tejun Heo
@ 2025-06-10 22:31 ` Shakeel Butt
2025-06-10 22:39 ` Tejun Heo
0 siblings, 1 reply; 15+ messages in thread
From: Shakeel Butt @ 2025-06-10 22:31 UTC (permalink / raw)
To: Tejun Heo
Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
Muchun Song, Vlastimil Babka, Alexei Starovoitov,
Sebastian Andrzej Siewior, Michal Koutný, Harry Yoo,
Yosry Ahmed, bpf, linux-mm, cgroups, linux-kernel,
Meta kernel team
On Tue, Jun 10, 2025 at 11:26:35AM -1000, Tejun Heo wrote:
> Hello,
>
> On Mon, Jun 09, 2025 at 03:56:10PM -0700, Shakeel Butt wrote:
> ...
> > + self = &rstatc->lnode;
> > + if (!try_cmpxchg(&(rstatc->lnode.next), &self, NULL))
> > return;
> >
> > + llist_add(&rstatc->lnode, lhead);
>
> I may be missing something but when you say multiple inserters, you mean the
> function being re-entered from stacked contexts - ie. process context, BH,
> irq, nmi?
Yes.
> If so, would it make sense to make the nmi and non-nmi paths use
> separate lnode? In non-nmi path, we can just disable irq and test whether
> lnode is empty and add it. nmi path can just test whether its lnode is empty
> and add it. I suppose nmi's don't nest, right? If they do, we can do
> try_cmpxchg() there I suppose.
>
> While the actual addition to the list would be relatively low frequency,
> css_rstat_updated() itself can be called pretty frequently. Before, the hot
> path was early exit after data_race(css_rstat_cpu(css, cpu)->updated_next).
> After, the hot path is now !try_cmpxchg() which doesn't seem great.
>
Couple of lines above I have llist_on_list(&rstatc->lnode) check which
should be as cheap as data_race(css_rstat_cpu(css, cpu)->updated_next).
So, I can add lnode for nmi and non-nmi contexts (with irqs disabled)
but I think that is not needed. Actually I ran the netperf benchmark (36
parallel instances) and I see no significant differences with and
without the patch.
Thanks for taking a look.
Shakeel
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 2/3] cgroup: make css_rstat_updated nmi safe
2025-06-10 22:31 ` Shakeel Butt
@ 2025-06-10 22:39 ` Tejun Heo
2025-06-10 23:28 ` Shakeel Butt
0 siblings, 1 reply; 15+ messages in thread
From: Tejun Heo @ 2025-06-10 22:39 UTC (permalink / raw)
To: Shakeel Butt
Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
Muchun Song, Vlastimil Babka, Alexei Starovoitov,
Sebastian Andrzej Siewior, Michal Koutný, Harry Yoo,
Yosry Ahmed, bpf, linux-mm, cgroups, linux-kernel,
Meta kernel team
Hello,
On Tue, Jun 10, 2025 at 03:31:03PM -0700, Shakeel Butt wrote:
...
> Couple of lines above I have llist_on_list(&rstatc->lnode) check which
> should be as cheap as data_race(css_rstat_cpu(css, cpu)->updated_next).
Ah, I missed that.
> So, I can add lnode for nmi and non-nmi contexts (with irqs disabled)
> but I think that is not needed. Actually I ran the netperf benchmark (36
> parallel instances) and I see no significant differences with and
> without the patch.
Yeah, as long as the hot path doesn't hit the extra cmpxchg, I think it
should be fine. Can you fortify the comments a bit that the synchronization
is against the stacking contexts on the same CPU. The use of cmpxchg for
something like this is a bit unusual and it'd be nice to have explanation on
why it's done this way and why the overhead doesn't matter.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 2/3] cgroup: make css_rstat_updated nmi safe
2025-06-10 22:39 ` Tejun Heo
@ 2025-06-10 23:28 ` Shakeel Butt
2025-06-10 23:33 ` Tejun Heo
0 siblings, 1 reply; 15+ messages in thread
From: Shakeel Butt @ 2025-06-10 23:28 UTC (permalink / raw)
To: Tejun Heo
Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
Muchun Song, Vlastimil Babka, Alexei Starovoitov,
Sebastian Andrzej Siewior, Michal Koutný, Harry Yoo,
Yosry Ahmed, bpf, linux-mm, cgroups, linux-kernel,
Meta kernel team
On Tue, Jun 10, 2025 at 12:39:18PM -1000, Tejun Heo wrote:
> Hello,
>
> On Tue, Jun 10, 2025 at 03:31:03PM -0700, Shakeel Butt wrote:
> ...
> > Couple of lines above I have llist_on_list(&rstatc->lnode) check which
> > should be as cheap as data_race(css_rstat_cpu(css, cpu)->updated_next).
>
> Ah, I missed that.
>
> > So, I can add lnode for nmi and non-nmi contexts (with irqs disabled)
> > but I think that is not needed. Actually I ran the netperf benchmark (36
> > parallel instances) and I see no significant differences with and
> > without the patch.
>
> Yeah, as long as the hot path doesn't hit the extra cmpxchg, I think it
> should be fine. Can you fortify the comments a bit that the synchronization
> is against the stacking contexts on the same CPU. The use of cmpxchg for
> something like this is a bit unusual and it'd be nice to have explanation on
> why it's done this way and why the overhead doesn't matter.
I was actually thinking of using this_cpu_cmpxchg but then I need to
also check for CONFIG_ARCH_HAS_NMI_SAFE_THIS_CPU_OPS. However if you
prefer that, I can try this_cpu_cmpxchg in the next version.
I will also fix the comment with additional information about stacking
context.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 2/3] cgroup: make css_rstat_updated nmi safe
2025-06-10 23:28 ` Shakeel Butt
@ 2025-06-10 23:33 ` Tejun Heo
0 siblings, 0 replies; 15+ messages in thread
From: Tejun Heo @ 2025-06-10 23:33 UTC (permalink / raw)
To: Shakeel Butt
Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
Muchun Song, Vlastimil Babka, Alexei Starovoitov,
Sebastian Andrzej Siewior, Michal Koutný, Harry Yoo,
Yosry Ahmed, bpf, linux-mm, cgroups, linux-kernel,
Meta kernel team
Hello,
On Tue, Jun 10, 2025 at 04:28:23PM -0700, Shakeel Butt wrote:
...
> I was actually thinking of using this_cpu_cmpxchg but then I need to
> also check for CONFIG_ARCH_HAS_NMI_SAFE_THIS_CPU_OPS. However if you
> prefer that, I can try this_cpu_cmpxchg in the next version.
Yeah, I don't think it'd make any performance differences, but, provided it
doesn't too much complexity, it'd make things less confusing as the
construct being used aligns with the problem being solved.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 2/3] cgroup: make css_rstat_updated nmi safe
2025-06-09 22:56 ` [PATCH 2/3] cgroup: make css_rstat_updated nmi safe Shakeel Butt
2025-06-10 21:26 ` Tejun Heo
@ 2025-06-11 5:23 ` JP Kobryn
2025-06-11 13:56 ` Shakeel Butt
1 sibling, 1 reply; 15+ messages in thread
From: JP Kobryn @ 2025-06-11 5:23 UTC (permalink / raw)
To: Shakeel Butt
Cc: Johannes Weiner, Andrew Morton, Tejun Heo, Michal Hocko,
Roman Gushchin, Muchun Song, Vlastimil Babka, Alexei Starovoitov,
Sebastian Andrzej Siewior, Michal Koutný, Harry Yoo,
Yosry Ahmed, bpf, linux-mm, cgroups, linux-kernel,
Meta kernel team
On 6/9/25 3:56 PM, Shakeel Butt wrote:
[..]
> diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
> index a5608ae2be27..4fabd7973067 100644
> --- a/kernel/cgroup/rstat.c
> +++ b/kernel/cgroup/rstat.c
> @@ -138,13 +138,15 @@ void _css_rstat_cpu_unlock(struct cgroup_subsys_state *css, int cpu,
> * @css: target cgroup subsystem state
> * @cpu: cpu on which rstat_cpu was updated
> *
> - * @css's rstat_cpu on @cpu was updated. Put it on the parent's matching
> - * rstat_cpu->updated_children list. See the comment on top of
> - * css_rstat_cpu definition for details.
> + * Atomically inserts the css in the ss's llist for the given cpu. This is nmi
> + * safe. The ss's llist will be processed at the flush time to create the update
> + * tree.
> */
> __bpf_kfunc void css_rstat_updated(struct cgroup_subsys_state *css, int cpu)
> {
> - unsigned long flags;
> + struct llist_head *lhead = ss_lhead_cpu(css->ss, cpu);
> + struct css_rstat_cpu *rstatc = css_rstat_cpu(css, cpu);
> + struct llist_node *self;
>
> /*
> - flags = _css_rstat_cpu_lock(css, cpu, true);
> + llist_add(&rstatc->lnode, lhead);
> +}
[..]
> +
> +static void css_process_update_tree(struct cgroup_subsys *ss, int cpu)
> +{
> + struct llist_head *lhead = ss_lhead_cpu(ss, cpu);
> + struct llist_node *lnode;
> +
> + while ((lnode = llist_del_first_init(lhead))) {
> + struct css_rstat_cpu *rstatc;
>
> - _css_rstat_cpu_unlock(css, cpu, flags, true);
> + rstatc = container_of(lnode, struct css_rstat_cpu, lnode);
> + __css_process_update_tree(rstatc->owner, cpu);
> + }
> }
>
> /**
> @@ -300,6 +331,8 @@ static struct cgroup_subsys_state *css_rstat_updated_list(
>
> flags = _css_rstat_cpu_lock(root, cpu, false);
The subsystem per-cpu locks were used to synchronize updater and flusher
on a given cpu. Since you no longer use them on the updater side, it
seems these locks can be removed altogether.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 2/3] cgroup: make css_rstat_updated nmi safe
2025-06-11 5:23 ` JP Kobryn
@ 2025-06-11 13:56 ` Shakeel Butt
0 siblings, 0 replies; 15+ messages in thread
From: Shakeel Butt @ 2025-06-11 13:56 UTC (permalink / raw)
To: JP Kobryn
Cc: Johannes Weiner, Andrew Morton, Tejun Heo, Michal Hocko,
Roman Gushchin, Muchun Song, Vlastimil Babka, Alexei Starovoitov,
Sebastian Andrzej Siewior, Michal Koutný, Harry Yoo,
Yosry Ahmed, bpf, linux-mm, cgroups, linux-kernel,
Meta kernel team
On Tue, Jun 10, 2025 at 10:23:51PM -0700, JP Kobryn wrote:
>
> The subsystem per-cpu locks were used to synchronize updater and flusher
> on a given cpu. Since you no longer use them on the updater side, it
> seems these locks can be removed altogether.
Indeed you are right, thanks for the suggestion.
^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2025-06-11 13:56 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-09 22:56 [PATCH 0/3] cgroup: nmi safe css_rstat_updated Shakeel Butt
2025-06-09 22:56 ` [PATCH 1/3] cgroup: support to enable nmi-safe css_rstat_updated Shakeel Butt
2025-06-09 22:56 ` [PATCH 2/3] cgroup: make css_rstat_updated nmi safe Shakeel Butt
2025-06-10 21:26 ` Tejun Heo
2025-06-10 22:31 ` Shakeel Butt
2025-06-10 22:39 ` Tejun Heo
2025-06-10 23:28 ` Shakeel Butt
2025-06-10 23:33 ` Tejun Heo
2025-06-11 5:23 ` JP Kobryn
2025-06-11 13:56 ` Shakeel Butt
2025-06-09 22:56 ` [PATCH 3/3] memcg: cgroup: call memcg_rstat_updated irrespective of in_nmi() Shakeel Butt
2025-06-09 23:44 ` [PATCH 0/3] cgroup: nmi safe css_rstat_updated Andrew Morton
2025-06-09 23:51 ` Shakeel Butt
2025-06-10 10:53 ` Michal Koutný
2025-06-10 16:24 ` Shakeel Butt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).