[RFC PATCH 0/7] memcg: make memcg stats irq safe

bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/7] memcg: make memcg stats irq safe
@ 2025-05-13  3:13 Shakeel Butt
  2025-05-13  3:13 ` [RFC PATCH 1/7] memcg: memcg_rstat_updated re-entrant safe against irqs Shakeel Butt
                   ` (6 more replies)
  0 siblings, 7 replies; 16+ messages in thread
From: Shakeel Butt @ 2025-05-13  3:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Vlastimil Babka, Alexei Starovoitov, Sebastian Andrzej Siewior,
	Harry Yoo, Yosry Ahmed, bpf, linux-mm, cgroups, linux-kernel,
	Meta kernel team

This series converts memcg stats to be irq safe i.e. memcg stats can be
updated in any context (task, softirq or hardirq) without disabling the
irqs.

This is still an RFC as I am not satisfied with the usage of atomic_*
ops in memcg_rstat_updated(). Second I still need to run performance
benchmarks (any suggestions/recommendations would be appreciated).
Sending this out early to get feedback.

This is based on latest mm-everything branch along with the nmi-safe
memcg series [1].

Link: http://lore.kernel.org/20250509232859.657525-1-shakeel.butt@linux.dev

Shakeel Butt (7):
  memcg: memcg_rstat_updated re-entrant safe against irqs
  memcg: move preempt disable to callers of memcg_rstat_updated
  memcg: make mod_memcg_state re-entrant safe against irqs
  memcg: make count_memcg_events re-entrant safe against irqs
  memcg: make __mod_memcg_lruvec_state re-entrant safe against irqs
  memcg: objcg stock trylock without irq disabling
  memcg: no stock lock for cpu hot-unplug

 include/linux/memcontrol.h |  41 +--------
 mm/memcontrol-v1.c         |   6 +-
 mm/memcontrol.c            | 167 +++++++++++++++----------------------
 mm/swap.c                  |   8 +-
 mm/vmscan.c                |  14 ++--
 5 files changed, 85 insertions(+), 151 deletions(-)

-- 
2.47.1


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [RFC PATCH 1/7] memcg: memcg_rstat_updated re-entrant safe against irqs
  2025-05-13  3:13 [RFC PATCH 0/7] memcg: make memcg stats irq safe Shakeel Butt
@ 2025-05-13  3:13 ` Shakeel Butt
  2025-05-13 10:22   ` Vlastimil Babka
  2025-05-13  3:13 ` [RFC PATCH 2/7] memcg: move preempt disable to callers of memcg_rstat_updated Shakeel Butt
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 16+ messages in thread
From: Shakeel Butt @ 2025-05-13  3:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Vlastimil Babka, Alexei Starovoitov, Sebastian Andrzej Siewior,
	Harry Yoo, Yosry Ahmed, bpf, linux-mm, cgroups, linux-kernel,
	Meta kernel team

The function memcg_rstat_updated() is used to track the memcg stats
updates for optimizing the flushes. At the moment, it is not re-entrant
safe and the callers disabled irqs before calling. However to achieve
the goal of updating memcg stats without irqs, memcg_rstat_updated()
needs to be re-entrant safe against irqs.

This patch makes memcg_rstat_updated() re-entrant safe against irqs.
However it is using atomic_* ops which on x86, adds lock prefix to the
instructions. Since this is per-cpu data, the this_cpu_* ops are
preferred. However the percpu pointer is stored in struct mem_cgroup and
doing the upward traversal through struct mem_cgroup may cause two cache
misses as compared to traversing through struct memcg_vmstats_percpu
pointer.

NOTE: explore if there is atomic_* ops alternative without lock prefix.
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
---
 mm/memcontrol.c | 21 +++++++++++++--------
 1 file changed, 13 insertions(+), 8 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6cfa3550f300..2c4c095bf26c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -503,7 +503,7 @@ static inline int memcg_events_index(enum vm_event_item idx)
 
 struct memcg_vmstats_percpu {
 	/* Stats updates since the last flush */
-	unsigned int			stats_updates;
+	atomic_t			stats_updates;
 
 	/* Cached pointers for fast iteration in memcg_rstat_updated() */
 	struct memcg_vmstats_percpu	*parent;
@@ -590,12 +590,15 @@ static bool memcg_vmstats_needs_flush(struct memcg_vmstats *vmstats)
 static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
 {
 	struct memcg_vmstats_percpu *statc;
-	int cpu = smp_processor_id();
-	unsigned int stats_updates;
+	int cpu;
+	int stats_updates;
 
 	if (!val)
 		return;
 
+	/* Don't assume callers have preemption disabled. */
+	cpu = get_cpu();
+
 	cgroup_rstat_updated(memcg->css.cgroup, cpu);
 	statc = this_cpu_ptr(memcg->vmstats_percpu);
 	for (; statc; statc = statc->parent) {
@@ -607,14 +610,16 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
 		if (memcg_vmstats_needs_flush(statc->vmstats))
 			break;
 
-		stats_updates = READ_ONCE(statc->stats_updates) + abs(val);
-		WRITE_ONCE(statc->stats_updates, stats_updates);
+		stats_updates = atomic_add_return(abs(val), &statc->stats_updates);
 		if (stats_updates < MEMCG_CHARGE_BATCH)
 			continue;
 
-		atomic64_add(stats_updates, &statc->vmstats->stats_updates);
-		WRITE_ONCE(statc->stats_updates, 0);
+		stats_updates = atomic_xchg(&statc->stats_updates, 0);
+		if (stats_updates)
+			atomic64_add(stats_updates,
+				     &statc->vmstats->stats_updates);
 	}
+	put_cpu();
 }
 
 static void __mem_cgroup_flush_stats(struct mem_cgroup *memcg, bool force)
@@ -4155,7 +4160,7 @@ static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu)
 		mem_cgroup_stat_aggregate(&ac);
 
 	}
-	WRITE_ONCE(statc->stats_updates, 0);
+	atomic_set(&statc->stats_updates, 0);
 	/* We are in a per-cpu loop here, only do the atomic write once */
 	if (atomic64_read(&memcg->vmstats->stats_updates))
 		atomic64_set(&memcg->vmstats->stats_updates, 0);
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC PATCH 2/7] memcg: move preempt disable to callers of memcg_rstat_updated
  2025-05-13  3:13 [RFC PATCH 0/7] memcg: make memcg stats irq safe Shakeel Butt
  2025-05-13  3:13 ` [RFC PATCH 1/7] memcg: memcg_rstat_updated re-entrant safe against irqs Shakeel Butt
@ 2025-05-13  3:13 ` Shakeel Butt
  2025-05-13 10:34   ` Vlastimil Babka
  2025-05-13  3:13 ` [RFC PATCH 3/7] memcg: make mod_memcg_state re-entrant safe against irqs Shakeel Butt
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 16+ messages in thread
From: Shakeel Butt @ 2025-05-13  3:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Vlastimil Babka, Alexei Starovoitov, Sebastian Andrzej Siewior,
	Harry Yoo, Yosry Ahmed, bpf, linux-mm, cgroups, linux-kernel,
	Meta kernel team

Let's move the explicit preempt disable code to the callers of
memcg_rstat_updated and also remove the memcg_stats_lock and related
functions which ensures the callers of stats update functions have
disabled preemption because now the stats update functions are
explicitly disabling preemption.

Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
---
 mm/memcontrol.c | 74 +++++++++++++------------------------------------
 1 file changed, 19 insertions(+), 55 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2c4c095bf26c..62450e7991d8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -558,47 +558,21 @@ static u64 flush_last_time;
 
 #define FLUSH_TIME (2UL*HZ)
 
-/*
- * Accessors to ensure that preemption is disabled on PREEMPT_RT because it can
- * not rely on this as part of an acquired spinlock_t lock. These functions are
- * never used in hardirq context on PREEMPT_RT and therefore disabling preemtion
- * is sufficient.
- */
-static void memcg_stats_lock(void)
-{
-	preempt_disable_nested();
-	VM_WARN_ON_IRQS_ENABLED();
-}
-
-static void __memcg_stats_lock(void)
-{
-	preempt_disable_nested();
-}
-
-static void memcg_stats_unlock(void)
-{
-	preempt_enable_nested();
-}
-
-
 static bool memcg_vmstats_needs_flush(struct memcg_vmstats *vmstats)
 {
 	return atomic64_read(&vmstats->stats_updates) >
 		MEMCG_CHARGE_BATCH * num_online_cpus();
 }
 
-static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
+static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val,
+				       int cpu)
 {
 	struct memcg_vmstats_percpu *statc;
-	int cpu;
 	int stats_updates;
 
 	if (!val)
 		return;
 
-	/* Don't assume callers have preemption disabled. */
-	cpu = get_cpu();
-
 	cgroup_rstat_updated(memcg->css.cgroup, cpu);
 	statc = this_cpu_ptr(memcg->vmstats_percpu);
 	for (; statc; statc = statc->parent) {
@@ -619,7 +593,6 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
 			atomic64_add(stats_updates,
 				     &statc->vmstats->stats_updates);
 	}
-	put_cpu();
 }
 
 static void __mem_cgroup_flush_stats(struct mem_cgroup *memcg, bool force)
@@ -717,6 +690,7 @@ void __mod_memcg_state(struct mem_cgroup *memcg, enum memcg_stat_item idx,
 		       int val)
 {
 	int i = memcg_stats_index(idx);
+	int cpu;
 
 	if (mem_cgroup_disabled())
 		return;
@@ -724,12 +698,14 @@ void __mod_memcg_state(struct mem_cgroup *memcg, enum memcg_stat_item idx,
 	if (WARN_ONCE(BAD_STAT_IDX(i), "%s: missing stat item %d\n", __func__, idx))
 		return;
 
-	memcg_stats_lock();
+	cpu = get_cpu();
+
 	__this_cpu_add(memcg->vmstats_percpu->state[i], val);
 	val = memcg_state_val_in_pages(idx, val);
-	memcg_rstat_updated(memcg, val);
+	memcg_rstat_updated(memcg, val, cpu);
 	trace_mod_memcg_state(memcg, idx, val);
-	memcg_stats_unlock();
+
+	put_cpu();
 }
 
 #ifdef CONFIG_MEMCG_V1
@@ -758,6 +734,7 @@ static void __mod_memcg_lruvec_state(struct lruvec *lruvec,
 	struct mem_cgroup_per_node *pn;
 	struct mem_cgroup *memcg;
 	int i = memcg_stats_index(idx);
+	int cpu;
 
 	if (WARN_ONCE(BAD_STAT_IDX(i), "%s: missing stat item %d\n", __func__, idx))
 		return;
@@ -765,24 +742,7 @@ static void __mod_memcg_lruvec_state(struct lruvec *lruvec,
 	pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
 	memcg = pn->memcg;
 
-	/*
-	 * The caller from rmap relies on disabled preemption because they never
-	 * update their counter from in-interrupt context. For these two
-	 * counters we check that the update is never performed from an
-	 * interrupt context while other caller need to have disabled interrupt.
-	 */
-	__memcg_stats_lock();
-	if (IS_ENABLED(CONFIG_DEBUG_VM)) {
-		switch (idx) {
-		case NR_ANON_MAPPED:
-		case NR_FILE_MAPPED:
-		case NR_ANON_THPS:
-			WARN_ON_ONCE(!in_task());
-			break;
-		default:
-			VM_WARN_ON_IRQS_ENABLED();
-		}
-	}
+	cpu = get_cpu();
 
 	/* Update memcg */
 	__this_cpu_add(memcg->vmstats_percpu->state[i], val);
@@ -791,9 +751,10 @@ static void __mod_memcg_lruvec_state(struct lruvec *lruvec,
 	__this_cpu_add(pn->lruvec_stats_percpu->state[i], val);
 
 	val = memcg_state_val_in_pages(idx, val);
-	memcg_rstat_updated(memcg, val);
+	memcg_rstat_updated(memcg, val, cpu);
 	trace_mod_memcg_lruvec_state(memcg, idx, val);
-	memcg_stats_unlock();
+
+	put_cpu();
 }
 
 /**
@@ -873,6 +834,7 @@ void __count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx,
 			  unsigned long count)
 {
 	int i = memcg_events_index(idx);
+	int cpu;
 
 	if (mem_cgroup_disabled())
 		return;
@@ -880,11 +842,13 @@ void __count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx,
 	if (WARN_ONCE(BAD_STAT_IDX(i), "%s: missing stat item %d\n", __func__, idx))
 		return;
 
-	memcg_stats_lock();
+	cpu = get_cpu();
+
 	__this_cpu_add(memcg->vmstats_percpu->events[i], count);
-	memcg_rstat_updated(memcg, count);
+	memcg_rstat_updated(memcg, count, cpu);
 	trace_count_memcg_events(memcg, idx, count);
-	memcg_stats_unlock();
+
+	put_cpu();
 }
 
 unsigned long memcg_events(struct mem_cgroup *memcg, int event)
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC PATCH 3/7] memcg: make mod_memcg_state re-entrant safe against irqs
  2025-05-13  3:13 [RFC PATCH 0/7] memcg: make memcg stats irq safe Shakeel Butt
  2025-05-13  3:13 ` [RFC PATCH 1/7] memcg: memcg_rstat_updated re-entrant safe against irqs Shakeel Butt
  2025-05-13  3:13 ` [RFC PATCH 2/7] memcg: move preempt disable to callers of memcg_rstat_updated Shakeel Butt
@ 2025-05-13  3:13 ` Shakeel Butt
  2025-05-13 10:38   ` Vlastimil Babka
  2025-05-13  3:13 ` [RFC PATCH 4/7] memcg: make count_memcg_events " Shakeel Butt
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 16+ messages in thread
From: Shakeel Butt @ 2025-05-13  3:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Vlastimil Babka, Alexei Starovoitov, Sebastian Andrzej Siewior,
	Harry Yoo, Yosry Ahmed, bpf, linux-mm, cgroups, linux-kernel,
	Meta kernel team

Let's make mod_memcg_state re-entrant safe against irqs. The only thing
needed is to convert the usage of __this_cpu_add() to this_cpu_add().
In addition, with re-entrant safety, there is no need to disable irqs.

mod_memcg_state() is not safe against nmi, so let's add warning if
someone tries to call it in nmi context.

Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
---
 include/linux/memcontrol.h | 20 ++------------------
 mm/memcontrol.c            | 12 ++++++++----
 2 files changed, 10 insertions(+), 22 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index ed9acb68652a..84e2cea7e666 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -911,19 +911,9 @@ struct mem_cgroup *mem_cgroup_get_oom_group(struct task_struct *victim,
 					    struct mem_cgroup *oom_domain);
 void mem_cgroup_print_oom_group(struct mem_cgroup *memcg);
 
-void __mod_memcg_state(struct mem_cgroup *memcg, enum memcg_stat_item idx,
-		       int val);
-
 /* idx can be of type enum memcg_stat_item or node_stat_item */
-static inline void mod_memcg_state(struct mem_cgroup *memcg,
-				   enum memcg_stat_item idx, int val)
-{
-	unsigned long flags;
-
-	local_irq_save(flags);
-	__mod_memcg_state(memcg, idx, val);
-	local_irq_restore(flags);
-}
+void mod_memcg_state(struct mem_cgroup *memcg,
+		     enum memcg_stat_item idx, int val);
 
 static inline void mod_memcg_page_state(struct page *page,
 					enum memcg_stat_item idx, int val)
@@ -1390,12 +1380,6 @@ static inline void mem_cgroup_print_oom_group(struct mem_cgroup *memcg)
 {
 }
 
-static inline void __mod_memcg_state(struct mem_cgroup *memcg,
-				     enum memcg_stat_item idx,
-				     int nr)
-{
-}
-
 static inline void mod_memcg_state(struct mem_cgroup *memcg,
 				   enum memcg_stat_item idx,
 				   int nr)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 62450e7991d8..373d36cae069 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -681,12 +681,12 @@ static int memcg_state_val_in_pages(int idx, int val)
 }
 
 /**
- * __mod_memcg_state - update cgroup memory statistics
+ * mod_memcg_state - update cgroup memory statistics
  * @memcg: the memory cgroup
  * @idx: the stat item - can be enum memcg_stat_item or enum node_stat_item
  * @val: delta to add to the counter, can be negative
  */
-void __mod_memcg_state(struct mem_cgroup *memcg, enum memcg_stat_item idx,
+void mod_memcg_state(struct mem_cgroup *memcg, enum memcg_stat_item idx,
 		       int val)
 {
 	int i = memcg_stats_index(idx);
@@ -698,9 +698,13 @@ void __mod_memcg_state(struct mem_cgroup *memcg, enum memcg_stat_item idx,
 	if (WARN_ONCE(BAD_STAT_IDX(i), "%s: missing stat item %d\n", __func__, idx))
 		return;
 
+	if (WARN_ONCE(in_nmi(), "%s: called in nmi context for stat item %d\n",
+		      __func__, idx))
+		return;
+
 	cpu = get_cpu();
 
-	__this_cpu_add(memcg->vmstats_percpu->state[i], val);
+	this_cpu_add(memcg->vmstats_percpu->state[i], val);
 	val = memcg_state_val_in_pages(idx, val);
 	memcg_rstat_updated(memcg, val, cpu);
 	trace_mod_memcg_state(memcg, idx, val);
@@ -2969,7 +2973,7 @@ static void drain_obj_stock(struct obj_stock_pcp *stock)
 
 			memcg = get_mem_cgroup_from_objcg(old);
 
-			__mod_memcg_state(memcg, MEMCG_KMEM, -nr_pages);
+			mod_memcg_state(memcg, MEMCG_KMEM, -nr_pages);
 			memcg1_account_kmem(memcg, -nr_pages);
 			if (!mem_cgroup_is_root(memcg))
 				memcg_uncharge(memcg, nr_pages);
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC PATCH 4/7] memcg: make count_memcg_events re-entrant safe against irqs
  2025-05-13  3:13 [RFC PATCH 0/7] memcg: make memcg stats irq safe Shakeel Butt
                   ` (2 preceding siblings ...)
  2025-05-13  3:13 ` [RFC PATCH 3/7] memcg: make mod_memcg_state re-entrant safe against irqs Shakeel Butt
@ 2025-05-13  3:13 ` Shakeel Butt
  2025-05-13 10:39   ` Vlastimil Babka
  2025-05-13  3:13 ` [RFC PATCH 5/7] memcg: make __mod_memcg_lruvec_state " Shakeel Butt
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 16+ messages in thread
From: Shakeel Butt @ 2025-05-13  3:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Vlastimil Babka, Alexei Starovoitov, Sebastian Andrzej Siewior,
	Harry Yoo, Yosry Ahmed, bpf, linux-mm, cgroups, linux-kernel,
	Meta kernel team

Let's make count_memcg_events re-entrant safe against irqs. The only
thing needed is to convert the usage of __this_cpu_add() to
this_cpu_add(). In addition, with re-entrant safety, there is no need
to disable irqs. Also add warnings for in_nmi() as it is not safe
against nmi context.

Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
---
 include/linux/memcontrol.h | 21 ++-------------------
 mm/memcontrol-v1.c         |  6 +++---
 mm/memcontrol.c            | 10 +++++++---
 mm/swap.c                  |  8 ++++----
 mm/vmscan.c                | 14 +++++++-------
 5 files changed, 23 insertions(+), 36 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 84e2cea7e666..31b9ab93d4e1 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -952,19 +952,8 @@ static inline void mod_lruvec_kmem_state(void *p, enum node_stat_item idx,
 	local_irq_restore(flags);
 }
 
-void __count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx,
-			  unsigned long count);
-
-static inline void count_memcg_events(struct mem_cgroup *memcg,
-				      enum vm_event_item idx,
-				      unsigned long count)
-{
-	unsigned long flags;
-
-	local_irq_save(flags);
-	__count_memcg_events(memcg, idx, count);
-	local_irq_restore(flags);
-}
+void count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx,
+			unsigned long count);
 
 static inline void count_memcg_folio_events(struct folio *folio,
 		enum vm_event_item idx, unsigned long nr)
@@ -1438,12 +1427,6 @@ static inline void mod_lruvec_kmem_state(void *p, enum node_stat_item idx,
 }
 
 static inline void count_memcg_events(struct mem_cgroup *memcg,
-				      enum vm_event_item idx,
-				      unsigned long count)
-{
-}
-
-static inline void __count_memcg_events(struct mem_cgroup *memcg,
 					enum vm_event_item idx,
 					unsigned long count)
 {
diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
index 3852f0713ad2..581c960ba19b 100644
--- a/mm/memcontrol-v1.c
+++ b/mm/memcontrol-v1.c
@@ -512,9 +512,9 @@ static void memcg1_charge_statistics(struct mem_cgroup *memcg, int nr_pages)
 {
 	/* pagein of a big page is an event. So, ignore page size */
 	if (nr_pages > 0)
-		__count_memcg_events(memcg, PGPGIN, 1);
+		count_memcg_events(memcg, PGPGIN, 1);
 	else {
-		__count_memcg_events(memcg, PGPGOUT, 1);
+		count_memcg_events(memcg, PGPGOUT, 1);
 		nr_pages = -nr_pages; /* for event */
 	}
 
@@ -689,7 +689,7 @@ void memcg1_uncharge_batch(struct mem_cgroup *memcg, unsigned long pgpgout,
 	unsigned long flags;
 
 	local_irq_save(flags);
-	__count_memcg_events(memcg, PGPGOUT, pgpgout);
+	count_memcg_events(memcg, PGPGOUT, pgpgout);
 	__this_cpu_add(memcg->events_percpu->nr_page_events, nr_memory);
 	memcg1_check_events(memcg, nid);
 	local_irq_restore(flags);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 373d36cae069..9e7dc90cc460 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -829,12 +829,12 @@ void __mod_lruvec_kmem_state(void *p, enum node_stat_item idx, int val)
 }
 
 /**
- * __count_memcg_events - account VM events in a cgroup
+ * count_memcg_events - account VM events in a cgroup
  * @memcg: the memory cgroup
  * @idx: the event item
  * @count: the number of events that occurred
  */
-void __count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx,
+void count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx,
 			  unsigned long count)
 {
 	int i = memcg_events_index(idx);
@@ -846,9 +846,13 @@ void __count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx,
 	if (WARN_ONCE(BAD_STAT_IDX(i), "%s: missing stat item %d\n", __func__, idx))
 		return;
 
+	if (WARN_ONCE(in_nmi(), "%s: called in nmi context for stat item %d\n",
+		      __func__, idx))
+		return;
+
 	cpu = get_cpu();
 
-	__this_cpu_add(memcg->vmstats_percpu->events[i], count);
+	this_cpu_add(memcg->vmstats_percpu->events[i], count);
 	memcg_rstat_updated(memcg, count, cpu);
 	trace_count_memcg_events(memcg, idx, count);
 
diff --git a/mm/swap.c b/mm/swap.c
index 77b2d5997873..4fc322f7111a 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -309,7 +309,7 @@ static void lru_activate(struct lruvec *lruvec, struct folio *folio)
 	trace_mm_lru_activate(folio);
 
 	__count_vm_events(PGACTIVATE, nr_pages);
-	__count_memcg_events(lruvec_memcg(lruvec), PGACTIVATE, nr_pages);
+	count_memcg_events(lruvec_memcg(lruvec), PGACTIVATE, nr_pages);
 }
 
 #ifdef CONFIG_SMP
@@ -581,7 +581,7 @@ static void lru_deactivate_file(struct lruvec *lruvec, struct folio *folio)
 
 	if (active) {
 		__count_vm_events(PGDEACTIVATE, nr_pages);
-		__count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE,
+		count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE,
 				     nr_pages);
 	}
 }
@@ -599,7 +599,7 @@ static void lru_deactivate(struct lruvec *lruvec, struct folio *folio)
 	lruvec_add_folio(lruvec, folio);
 
 	__count_vm_events(PGDEACTIVATE, nr_pages);
-	__count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE, nr_pages);
+	count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE, nr_pages);
 }
 
 static void lru_lazyfree(struct lruvec *lruvec, struct folio *folio)
@@ -625,7 +625,7 @@ static void lru_lazyfree(struct lruvec *lruvec, struct folio *folio)
 	lruvec_add_folio(lruvec, folio);
 
 	__count_vm_events(PGLAZYFREE, nr_pages);
-	__count_memcg_events(lruvec_memcg(lruvec), PGLAZYFREE, nr_pages);
+	count_memcg_events(lruvec_memcg(lruvec), PGLAZYFREE, nr_pages);
 }
 
 /*
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5efd939d8c76..f86d264558f5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2028,7 +2028,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
 	item = PGSCAN_KSWAPD + reclaimer_offset(sc);
 	if (!cgroup_reclaim(sc))
 		__count_vm_events(item, nr_scanned);
-	__count_memcg_events(lruvec_memcg(lruvec), item, nr_scanned);
+	count_memcg_events(lruvec_memcg(lruvec), item, nr_scanned);
 	__count_vm_events(PGSCAN_ANON + file, nr_scanned);
 
 	spin_unlock_irq(&lruvec->lru_lock);
@@ -2048,7 +2048,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
 	item = PGSTEAL_KSWAPD + reclaimer_offset(sc);
 	if (!cgroup_reclaim(sc))
 		__count_vm_events(item, nr_reclaimed);
-	__count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed);
+	count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed);
 	__count_vm_events(PGSTEAL_ANON + file, nr_reclaimed);
 	spin_unlock_irq(&lruvec->lru_lock);
 
@@ -2138,7 +2138,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 
 	if (!cgroup_reclaim(sc))
 		__count_vm_events(PGREFILL, nr_scanned);
-	__count_memcg_events(lruvec_memcg(lruvec), PGREFILL, nr_scanned);
+	count_memcg_events(lruvec_memcg(lruvec), PGREFILL, nr_scanned);
 
 	spin_unlock_irq(&lruvec->lru_lock);
 
@@ -2195,7 +2195,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	nr_deactivate = move_folios_to_lru(lruvec, &l_inactive);
 
 	__count_vm_events(PGDEACTIVATE, nr_deactivate);
-	__count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE, nr_deactivate);
+	count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE, nr_deactivate);
 
 	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
 	spin_unlock_irq(&lruvec->lru_lock);
@@ -4616,8 +4616,8 @@ static int scan_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 		__count_vm_events(item, isolated);
 		__count_vm_events(PGREFILL, sorted);
 	}
-	__count_memcg_events(memcg, item, isolated);
-	__count_memcg_events(memcg, PGREFILL, sorted);
+	count_memcg_events(memcg, item, isolated);
+	count_memcg_events(memcg, PGREFILL, sorted);
 	__count_vm_events(PGSCAN_ANON + type, isolated);
 	trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, MAX_LRU_BATCH,
 				scanned, skipped, isolated,
@@ -4769,7 +4769,7 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 	item = PGSTEAL_KSWAPD + reclaimer_offset(sc);
 	if (!cgroup_reclaim(sc))
 		__count_vm_events(item, reclaimed);
-	__count_memcg_events(memcg, item, reclaimed);
+	count_memcg_events(memcg, item, reclaimed);
 	__count_vm_events(PGSTEAL_ANON + type, reclaimed);
 
 	spin_unlock_irq(&lruvec->lru_lock);
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC PATCH 5/7] memcg: make __mod_memcg_lruvec_state re-entrant safe against irqs
  2025-05-13  3:13 [RFC PATCH 0/7] memcg: make memcg stats irq safe Shakeel Butt
                   ` (3 preceding siblings ...)
  2025-05-13  3:13 ` [RFC PATCH 4/7] memcg: make count_memcg_events " Shakeel Butt
@ 2025-05-13  3:13 ` Shakeel Butt
  2025-05-13 10:40   ` Vlastimil Babka
  2025-05-13  3:13 ` [RFC PATCH 6/7] memcg: objcg stock trylock without irq disabling Shakeel Butt
  2025-05-13  3:13 ` [RFC PATCH 7/7] memcg: no stock lock for cpu hot-unplug Shakeel Butt
  6 siblings, 1 reply; 16+ messages in thread
From: Shakeel Butt @ 2025-05-13  3:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Vlastimil Babka, Alexei Starovoitov, Sebastian Andrzej Siewior,
	Harry Yoo, Yosry Ahmed, bpf, linux-mm, cgroups, linux-kernel,
	Meta kernel team

Let's make __mod_memcg_lruvec_state re-entrant safe and name it
mod_memcg_lruvec_state(). The only thing needed is to convert the usage
of __this_cpu_add() to this_cpu_add(). There are two callers of
mod_memcg_lruvec_state() and one of them i.e. __mod_objcg_mlstate() will
be re-entrant safe as well, so, rename it mod_objcg_mlstate(). The last
caller __mod_lruvec_state() still calls __mod_node_page_state() which is
not re-entrant safe yet, so keep it as is.

Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
---
 mm/memcontrol.c | 28 ++++++++++++++++------------
 1 file changed, 16 insertions(+), 12 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9e7dc90cc460..adf2f1922118 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -731,7 +731,7 @@ unsigned long memcg_page_state_local(struct mem_cgroup *memcg, int idx)
 }
 #endif
 
-static void __mod_memcg_lruvec_state(struct lruvec *lruvec,
+static void mod_memcg_lruvec_state(struct lruvec *lruvec,
 				     enum node_stat_item idx,
 				     int val)
 {
@@ -743,16 +743,20 @@ static void __mod_memcg_lruvec_state(struct lruvec *lruvec,
 	if (WARN_ONCE(BAD_STAT_IDX(i), "%s: missing stat item %d\n", __func__, idx))
 		return;
 
+	if (WARN_ONCE(in_nmi(), "%s: called in nmi context for stat item %d\n",
+		      __func__, idx))
+		return;
+
 	pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
 	memcg = pn->memcg;
 
 	cpu = get_cpu();
 
 	/* Update memcg */
-	__this_cpu_add(memcg->vmstats_percpu->state[i], val);
+	this_cpu_add(memcg->vmstats_percpu->state[i], val);
 
 	/* Update lruvec */
-	__this_cpu_add(pn->lruvec_stats_percpu->state[i], val);
+	this_cpu_add(pn->lruvec_stats_percpu->state[i], val);
 
 	val = memcg_state_val_in_pages(idx, val);
 	memcg_rstat_updated(memcg, val, cpu);
@@ -779,7 +783,7 @@ void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
 
 	/* Update memcg and lruvec */
 	if (!mem_cgroup_disabled())
-		__mod_memcg_lruvec_state(lruvec, idx, val);
+		mod_memcg_lruvec_state(lruvec, idx, val);
 }
 
 void __lruvec_stat_mod_folio(struct folio *folio, enum node_stat_item idx,
@@ -2559,7 +2563,7 @@ static void commit_charge(struct folio *folio, struct mem_cgroup *memcg)
 	folio->memcg_data = (unsigned long)memcg;
 }
 
-static inline void __mod_objcg_mlstate(struct obj_cgroup *objcg,
+static inline void mod_objcg_mlstate(struct obj_cgroup *objcg,
 				       struct pglist_data *pgdat,
 				       enum node_stat_item idx, int nr)
 {
@@ -2570,7 +2574,7 @@ static inline void __mod_objcg_mlstate(struct obj_cgroup *objcg,
 	memcg = obj_cgroup_memcg(objcg);
 	if (likely(!in_nmi())) {
 		lruvec = mem_cgroup_lruvec(memcg, pgdat);
-		__mod_memcg_lruvec_state(lruvec, idx, nr);
+		mod_memcg_lruvec_state(lruvec, idx, nr);
 	} else {
 		struct mem_cgroup_per_node *pn = memcg->nodeinfo[pgdat->node_id];
 
@@ -2901,12 +2905,12 @@ static void __account_obj_stock(struct obj_cgroup *objcg,
 		struct pglist_data *oldpg = stock->cached_pgdat;
 
 		if (stock->nr_slab_reclaimable_b) {
-			__mod_objcg_mlstate(objcg, oldpg, NR_SLAB_RECLAIMABLE_B,
+			mod_objcg_mlstate(objcg, oldpg, NR_SLAB_RECLAIMABLE_B,
 					  stock->nr_slab_reclaimable_b);
 			stock->nr_slab_reclaimable_b = 0;
 		}
 		if (stock->nr_slab_unreclaimable_b) {
-			__mod_objcg_mlstate(objcg, oldpg, NR_SLAB_UNRECLAIMABLE_B,
+			mod_objcg_mlstate(objcg, oldpg, NR_SLAB_UNRECLAIMABLE_B,
 					  stock->nr_slab_unreclaimable_b);
 			stock->nr_slab_unreclaimable_b = 0;
 		}
@@ -2932,7 +2936,7 @@ static void __account_obj_stock(struct obj_cgroup *objcg,
 		}
 	}
 	if (nr)
-		__mod_objcg_mlstate(objcg, pgdat, idx, nr);
+		mod_objcg_mlstate(objcg, pgdat, idx, nr);
 }
 
 static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
@@ -3004,13 +3008,13 @@ static void drain_obj_stock(struct obj_stock_pcp *stock)
 	 */
 	if (stock->nr_slab_reclaimable_b || stock->nr_slab_unreclaimable_b) {
 		if (stock->nr_slab_reclaimable_b) {
-			__mod_objcg_mlstate(old, stock->cached_pgdat,
+			mod_objcg_mlstate(old, stock->cached_pgdat,
 					  NR_SLAB_RECLAIMABLE_B,
 					  stock->nr_slab_reclaimable_b);
 			stock->nr_slab_reclaimable_b = 0;
 		}
 		if (stock->nr_slab_unreclaimable_b) {
-			__mod_objcg_mlstate(old, stock->cached_pgdat,
+			mod_objcg_mlstate(old, stock->cached_pgdat,
 					  NR_SLAB_UNRECLAIMABLE_B,
 					  stock->nr_slab_unreclaimable_b);
 			stock->nr_slab_unreclaimable_b = 0;
@@ -3050,7 +3054,7 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
 
 	if (unlikely(in_nmi())) {
 		if (pgdat)
-			__mod_objcg_mlstate(objcg, pgdat, idx, nr_bytes);
+			mod_objcg_mlstate(objcg, pgdat, idx, nr_bytes);
 		nr_pages = nr_bytes >> PAGE_SHIFT;
 		nr_bytes = nr_bytes & (PAGE_SIZE - 1);
 		atomic_add(nr_bytes, &objcg->nr_charged_bytes);
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC PATCH 6/7] memcg: objcg stock trylock without irq disabling
  2025-05-13  3:13 [RFC PATCH 0/7] memcg: make memcg stats irq safe Shakeel Butt
                   ` (4 preceding siblings ...)
  2025-05-13  3:13 ` [RFC PATCH 5/7] memcg: make __mod_memcg_lruvec_state " Shakeel Butt
@ 2025-05-13  3:13 ` Shakeel Butt
  2025-05-13 13:05   ` Vlastimil Babka
  2025-05-13  3:13 ` [RFC PATCH 7/7] memcg: no stock lock for cpu hot-unplug Shakeel Butt
  6 siblings, 1 reply; 16+ messages in thread
From: Shakeel Butt @ 2025-05-13  3:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Vlastimil Babka, Alexei Starovoitov, Sebastian Andrzej Siewior,
	Harry Yoo, Yosry Ahmed, bpf, linux-mm, cgroups, linux-kernel,
	Meta kernel team

There is no need to disable irqs to use objcg per-cpu stock, so let's
just not do that but consume_obj_stock() and refill_obj_stock() will
need to use trylock instead to keep per-cpu stock safe. One consequence
of this change is that the charge request from irq context may take
slowpath more often but it should be rare.

Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
---
 mm/memcontrol.c | 26 ++++++++++----------------
 1 file changed, 10 insertions(+), 16 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index adf2f1922118..af7df675d733 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1918,18 +1918,17 @@ static void drain_local_memcg_stock(struct work_struct *dummy)
 static void drain_local_obj_stock(struct work_struct *dummy)
 {
 	struct obj_stock_pcp *stock;
-	unsigned long flags;
 
 	if (WARN_ONCE(!in_task(), "drain in non-task context"))
 		return;
 
-	local_lock_irqsave(&obj_stock.lock, flags);
+	local_lock(&obj_stock.lock);
 
 	stock = this_cpu_ptr(&obj_stock);
 	drain_obj_stock(stock);
 	clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags);
 
-	local_unlock_irqrestore(&obj_stock.lock, flags);
+	local_unlock(&obj_stock.lock);
 }
 
 static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
@@ -2062,14 +2061,13 @@ void drain_all_stock(struct mem_cgroup *root_memcg)
 static int memcg_hotplug_cpu_dead(unsigned int cpu)
 {
 	struct obj_stock_pcp *obj_st;
-	unsigned long flags;
 
 	obj_st = &per_cpu(obj_stock, cpu);
 
 	/* drain_obj_stock requires objstock.lock */
-	local_lock_irqsave(&obj_stock.lock, flags);
+	local_lock(&obj_stock.lock);
 	drain_obj_stock(obj_st);
-	local_unlock_irqrestore(&obj_stock.lock, flags);
+	local_unlock(&obj_stock.lock);
 
 	/* no need for the local lock */
 	drain_stock_fully(&per_cpu(memcg_stock, cpu));
@@ -2943,14 +2941,12 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
 			      struct pglist_data *pgdat, enum node_stat_item idx)
 {
 	struct obj_stock_pcp *stock;
-	unsigned long flags;
 	bool ret = false;
 
-	if (unlikely(in_nmi()))
+	if (unlikely(in_nmi()) ||
+	    !local_trylock(&obj_stock.lock))
 		return ret;
 
-	local_lock_irqsave(&obj_stock.lock, flags);
-
 	stock = this_cpu_ptr(&obj_stock);
 	if (objcg == READ_ONCE(stock->cached_objcg) && stock->nr_bytes >= nr_bytes) {
 		stock->nr_bytes -= nr_bytes;
@@ -2960,7 +2956,7 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
 			__account_obj_stock(objcg, stock, nr_bytes, pgdat, idx);
 	}
 
-	local_unlock_irqrestore(&obj_stock.lock, flags);
+	local_unlock(&obj_stock.lock);
 
 	return ret;
 }
@@ -3049,10 +3045,10 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
 		enum node_stat_item idx)
 {
 	struct obj_stock_pcp *stock;
-	unsigned long flags;
 	unsigned int nr_pages = 0;
 
-	if (unlikely(in_nmi())) {
+	if (unlikely(in_nmi()) ||
+	    !local_trylock(&obj_stock.lock)) {
 		if (pgdat)
 			mod_objcg_mlstate(objcg, pgdat, idx, nr_bytes);
 		nr_pages = nr_bytes >> PAGE_SHIFT;
@@ -3061,8 +3057,6 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
 		goto out;
 	}
 
-	local_lock_irqsave(&obj_stock.lock, flags);
-
 	stock = this_cpu_ptr(&obj_stock);
 	if (READ_ONCE(stock->cached_objcg) != objcg) { /* reset if necessary */
 		drain_obj_stock(stock);
@@ -3083,7 +3077,7 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
 		stock->nr_bytes &= (PAGE_SIZE - 1);
 	}
 
-	local_unlock_irqrestore(&obj_stock.lock, flags);
+	local_unlock(&obj_stock.lock);
 out:
 	if (nr_pages)
 		obj_cgroup_uncharge_pages(objcg, nr_pages);
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC PATCH 7/7] memcg: no stock lock for cpu hot-unplug
  2025-05-13  3:13 [RFC PATCH 0/7] memcg: make memcg stats irq safe Shakeel Butt
                   ` (5 preceding siblings ...)
  2025-05-13  3:13 ` [RFC PATCH 6/7] memcg: objcg stock trylock without irq disabling Shakeel Butt
@ 2025-05-13  3:13 ` Shakeel Butt
  2025-05-13 13:10   ` Vlastimil Babka
  6 siblings, 1 reply; 16+ messages in thread
From: Shakeel Butt @ 2025-05-13  3:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Vlastimil Babka, Alexei Starovoitov, Sebastian Andrzej Siewior,
	Harry Yoo, Yosry Ahmed, bpf, linux-mm, cgroups, linux-kernel,
	Meta kernel team

Previously on the cpu hot-unplug, the kernel would call
drain_obj_stock() with objcg local lock. However local lock was not
neede as the stock which was accessed belongs to a dead cpu but we kept
it there to disable irqs as drain_obj_stock() may call
mod_objcg_mlstate() which required irqs disabled. However there is no
need to disable irqs now for mod_objcg_mlstate(), so we can remove the
lcoal lock altogether from cpu hot-unplug path.

Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
---
 mm/memcontrol.c | 10 +---------
 1 file changed, 1 insertion(+), 9 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index af7df675d733..539cd76e1492 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2060,16 +2060,8 @@ void drain_all_stock(struct mem_cgroup *root_memcg)
 
 static int memcg_hotplug_cpu_dead(unsigned int cpu)
 {
-	struct obj_stock_pcp *obj_st;
-
-	obj_st = &per_cpu(obj_stock, cpu);
-
-	/* drain_obj_stock requires objstock.lock */
-	local_lock(&obj_stock.lock);
-	drain_obj_stock(obj_st);
-	local_unlock(&obj_stock.lock);
-
 	/* no need for the local lock */
+	drain_obj_stock(&per_cpu(obj_stock, cpu));
 	drain_stock_fully(&per_cpu(memcg_stock, cpu));
 
 	return 0;
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 1/7] memcg: memcg_rstat_updated re-entrant safe against irqs
  2025-05-13  3:13 ` [RFC PATCH 1/7] memcg: memcg_rstat_updated re-entrant safe against irqs Shakeel Butt
@ 2025-05-13 10:22   ` Vlastimil Babka
  2025-05-13 18:09     ` Shakeel Butt
  0 siblings, 1 reply; 16+ messages in thread
From: Vlastimil Babka @ 2025-05-13 10:22 UTC (permalink / raw)
  To: Shakeel Butt, Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Alexei Starovoitov, Sebastian Andrzej Siewior, Harry Yoo,
	Yosry Ahmed, bpf, linux-mm, cgroups, linux-kernel,
	Meta kernel team, Mathieu Desnoyers

On 5/13/25 05:13, Shakeel Butt wrote:
> The function memcg_rstat_updated() is used to track the memcg stats
> updates for optimizing the flushes. At the moment, it is not re-entrant
> safe and the callers disabled irqs before calling. However to achieve
> the goal of updating memcg stats without irqs, memcg_rstat_updated()
> needs to be re-entrant safe against irqs.
> 
> This patch makes memcg_rstat_updated() re-entrant safe against irqs.
> However it is using atomic_* ops which on x86, adds lock prefix to the
> instructions. Since this is per-cpu data, the this_cpu_* ops are
> preferred. However the percpu pointer is stored in struct mem_cgroup and
> doing the upward traversal through struct mem_cgroup may cause two cache
> misses as compared to traversing through struct memcg_vmstats_percpu
> pointer.
> 
> NOTE: explore if there is atomic_* ops alternative without lock prefix.

local_t might be what you want here
https://docs.kernel.org/core-api/local_ops.html

Or maybe just add __percpu to parent like this?

struct memcg_vmstats_percpu {
...
        struct memcg_vmstats_percpu __percpu *parent;
...
}

Yes, it means on each cpu's struct memcg_vmstats_percpu instance there will
be actually the same value stored (the percpu offset) instead of the
cpu-specific parent pointer, which might seem wasteful. But AFAIK this_cpu_*
is optimized enough thanks to the segment register usage, that it doesn't
matter? It shouldn't cause any extra cache miss you worry about, IIUC?

With that I think you could refactor that code to use e.g.
this_cpu_add_return() and this_cpu_xchg() on the stats_updates and obtain
the parent "pointer" in a way that's also compatible with these operations.

That is unless we want also nmi safety, then we're back to the issue of the
previous series...

> Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
> ---
>  mm/memcontrol.c | 21 +++++++++++++--------
>  1 file changed, 13 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 6cfa3550f300..2c4c095bf26c 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -503,7 +503,7 @@ static inline int memcg_events_index(enum vm_event_item idx)
>  
>  struct memcg_vmstats_percpu {
>  	/* Stats updates since the last flush */
> -	unsigned int			stats_updates;
> +	atomic_t			stats_updates;
>  
>  	/* Cached pointers for fast iteration in memcg_rstat_updated() */
>  	struct memcg_vmstats_percpu	*parent;
> @@ -590,12 +590,15 @@ static bool memcg_vmstats_needs_flush(struct memcg_vmstats *vmstats)
>  static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
>  {
>  	struct memcg_vmstats_percpu *statc;
> -	int cpu = smp_processor_id();
> -	unsigned int stats_updates;
> +	int cpu;
> +	int stats_updates;
>  
>  	if (!val)
>  		return;
>  
> +	/* Don't assume callers have preemption disabled. */
> +	cpu = get_cpu();
> +
>  	cgroup_rstat_updated(memcg->css.cgroup, cpu);
>  	statc = this_cpu_ptr(memcg->vmstats_percpu);
>  	for (; statc; statc = statc->parent) {
> @@ -607,14 +610,16 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
>  		if (memcg_vmstats_needs_flush(statc->vmstats))
>  			break;
>  
> -		stats_updates = READ_ONCE(statc->stats_updates) + abs(val);
> -		WRITE_ONCE(statc->stats_updates, stats_updates);
> +		stats_updates = atomic_add_return(abs(val), &statc->stats_updates);
>  		if (stats_updates < MEMCG_CHARGE_BATCH)
>  			continue;
>  
> -		atomic64_add(stats_updates, &statc->vmstats->stats_updates);
> -		WRITE_ONCE(statc->stats_updates, 0);
> +		stats_updates = atomic_xchg(&statc->stats_updates, 0);
> +		if (stats_updates)
> +			atomic64_add(stats_updates,
> +				     &statc->vmstats->stats_updates);
>  	}
> +	put_cpu();
>  }
>  
>  static void __mem_cgroup_flush_stats(struct mem_cgroup *memcg, bool force)
> @@ -4155,7 +4160,7 @@ static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu)
>  		mem_cgroup_stat_aggregate(&ac);
>  
>  	}
> -	WRITE_ONCE(statc->stats_updates, 0);
> +	atomic_set(&statc->stats_updates, 0);
>  	/* We are in a per-cpu loop here, only do the atomic write once */
>  	if (atomic64_read(&memcg->vmstats->stats_updates))
>  		atomic64_set(&memcg->vmstats->stats_updates, 0);


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 2/7] memcg: move preempt disable to callers of memcg_rstat_updated
  2025-05-13  3:13 ` [RFC PATCH 2/7] memcg: move preempt disable to callers of memcg_rstat_updated Shakeel Butt
@ 2025-05-13 10:34   ` Vlastimil Babka
  0 siblings, 0 replies; 16+ messages in thread
From: Vlastimil Babka @ 2025-05-13 10:34 UTC (permalink / raw)
  To: Shakeel Butt, Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Alexei Starovoitov, Sebastian Andrzej Siewior, Harry Yoo,
	Yosry Ahmed, bpf, linux-mm, cgroups, linux-kernel,
	Meta kernel team

On 5/13/25 05:13, Shakeel Butt wrote:
> Let's move the explicit preempt disable code to the callers of
> memcg_rstat_updated and also remove the memcg_stats_lock and related
> functions which ensures the callers of stats update functions have
> disabled preemption because now the stats update functions are
> explicitly disabling preemption.
> 
> Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>

A welcome cleanup!

Acked-by: Vlastimil Babka <vbabka@suse.cz>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 3/7] memcg: make mod_memcg_state re-entrant safe against irqs
  2025-05-13  3:13 ` [RFC PATCH 3/7] memcg: make mod_memcg_state re-entrant safe against irqs Shakeel Butt
@ 2025-05-13 10:38   ` Vlastimil Babka
  0 siblings, 0 replies; 16+ messages in thread
From: Vlastimil Babka @ 2025-05-13 10:38 UTC (permalink / raw)
  To: Shakeel Butt, Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Alexei Starovoitov, Sebastian Andrzej Siewior, Harry Yoo,
	Yosry Ahmed, bpf, linux-mm, cgroups, linux-kernel,
	Meta kernel team

On 5/13/25 05:13, Shakeel Butt wrote:
> Let's make mod_memcg_state re-entrant safe against irqs. The only thing
> needed is to convert the usage of __this_cpu_add() to this_cpu_add().
> In addition, with re-entrant safety, there is no need to disable irqs.
> 
> mod_memcg_state() is not safe against nmi, so let's add warning if
> someone tries to call it in nmi context.
> 
> Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>

Also a good cleanup.

Acked-by: Vlastimil Babka <vbabka@suse.cz>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 4/7] memcg: make count_memcg_events re-entrant safe against irqs
  2025-05-13  3:13 ` [RFC PATCH 4/7] memcg: make count_memcg_events " Shakeel Butt
@ 2025-05-13 10:39   ` Vlastimil Babka
  0 siblings, 0 replies; 16+ messages in thread
From: Vlastimil Babka @ 2025-05-13 10:39 UTC (permalink / raw)
  To: Shakeel Butt, Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Alexei Starovoitov, Sebastian Andrzej Siewior, Harry Yoo,
	Yosry Ahmed, bpf, linux-mm, cgroups, linux-kernel,
	Meta kernel team

On 5/13/25 05:13, Shakeel Butt wrote:
> Let's make count_memcg_events re-entrant safe against irqs. The only
> thing needed is to convert the usage of __this_cpu_add() to
> this_cpu_add(). In addition, with re-entrant safety, there is no need
> to disable irqs. Also add warnings for in_nmi() as it is not safe
> against nmi context.
> 
> Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>

Acked-by: Vlastimil Babka <vbabka@suse.cz>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 5/7] memcg: make __mod_memcg_lruvec_state re-entrant safe against irqs
  2025-05-13  3:13 ` [RFC PATCH 5/7] memcg: make __mod_memcg_lruvec_state " Shakeel Butt
@ 2025-05-13 10:40   ` Vlastimil Babka
  0 siblings, 0 replies; 16+ messages in thread
From: Vlastimil Babka @ 2025-05-13 10:40 UTC (permalink / raw)
  To: Shakeel Butt, Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Alexei Starovoitov, Sebastian Andrzej Siewior, Harry Yoo,
	Yosry Ahmed, bpf, linux-mm, cgroups, linux-kernel,
	Meta kernel team

On 5/13/25 05:13, Shakeel Butt wrote:
> Let's make __mod_memcg_lruvec_state re-entrant safe and name it
> mod_memcg_lruvec_state(). The only thing needed is to convert the usage
> of __this_cpu_add() to this_cpu_add(). There are two callers of
> mod_memcg_lruvec_state() and one of them i.e. __mod_objcg_mlstate() will
> be re-entrant safe as well, so, rename it mod_objcg_mlstate(). The last
> caller __mod_lruvec_state() still calls __mod_node_page_state() which is
> not re-entrant safe yet, so keep it as is.
> 
> Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>

Acked-by: Vlastimil Babka <vbabka@suse.cz>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 6/7] memcg: objcg stock trylock without irq disabling
  2025-05-13  3:13 ` [RFC PATCH 6/7] memcg: objcg stock trylock without irq disabling Shakeel Butt
@ 2025-05-13 13:05   ` Vlastimil Babka
  0 siblings, 0 replies; 16+ messages in thread
From: Vlastimil Babka @ 2025-05-13 13:05 UTC (permalink / raw)
  To: Shakeel Butt, Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Alexei Starovoitov, Sebastian Andrzej Siewior, Harry Yoo,
	Yosry Ahmed, bpf, linux-mm, cgroups, linux-kernel,
	Meta kernel team

On 5/13/25 05:13, Shakeel Butt wrote:
> There is no need to disable irqs to use objcg per-cpu stock, so let's
> just not do that but consume_obj_stock() and refill_obj_stock() will
> need to use trylock instead to keep per-cpu stock safe. One consequence

I'd rather say "to avoid deadlock".

> of this change is that the charge request from irq context may take
> slowpath more often but it should be rare.
> 
> Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>

Acked-by: Vlastimil Babka <vbabka@suse.cz>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 7/7] memcg: no stock lock for cpu hot-unplug
  2025-05-13  3:13 ` [RFC PATCH 7/7] memcg: no stock lock for cpu hot-unplug Shakeel Butt
@ 2025-05-13 13:10   ` Vlastimil Babka
  0 siblings, 0 replies; 16+ messages in thread
From: Vlastimil Babka @ 2025-05-13 13:10 UTC (permalink / raw)
  To: Shakeel Butt, Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Alexei Starovoitov, Sebastian Andrzej Siewior, Harry Yoo,
	Yosry Ahmed, bpf, linux-mm, cgroups, linux-kernel,
	Meta kernel team

On 5/13/25 05:13, Shakeel Butt wrote:
> Previously on the cpu hot-unplug, the kernel would call
> drain_obj_stock() with objcg local lock. However local lock was not
> neede as the stock which was accessed belongs to a dead cpu but we kept

needed

> it there to disable irqs as drain_obj_stock() may call
> mod_objcg_mlstate() which required irqs disabled. However there is no
> need to disable irqs now for mod_objcg_mlstate(), so we can remove the
> lcoal lock altogether from cpu hot-unplug path.

local

> 
> Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>

I think you could switch ordering of this patch with 6/7 to avoid changing
memcg_hotplug_cpu_dead() twice?

Acked-by: Vlastimil Babka <vbabka@suse.cz>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 1/7] memcg: memcg_rstat_updated re-entrant safe against irqs
  2025-05-13 10:22   ` Vlastimil Babka
@ 2025-05-13 18:09     ` Shakeel Butt
  0 siblings, 0 replies; 16+ messages in thread
From: Shakeel Butt @ 2025-05-13 18:09 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Muchun Song, Alexei Starovoitov, Sebastian Andrzej Siewior,
	Harry Yoo, Yosry Ahmed, bpf, linux-mm, cgroups, linux-kernel,
	Meta kernel team, Mathieu Desnoyers

On Tue, May 13, 2025 at 12:22:28PM +0200, Vlastimil Babka wrote:
> On 5/13/25 05:13, Shakeel Butt wrote:
> > The function memcg_rstat_updated() is used to track the memcg stats
> > updates for optimizing the flushes. At the moment, it is not re-entrant
> > safe and the callers disabled irqs before calling. However to achieve
> > the goal of updating memcg stats without irqs, memcg_rstat_updated()
> > needs to be re-entrant safe against irqs.
> > 
> > This patch makes memcg_rstat_updated() re-entrant safe against irqs.
> > However it is using atomic_* ops which on x86, adds lock prefix to the
> > instructions. Since this is per-cpu data, the this_cpu_* ops are
> > preferred. However the percpu pointer is stored in struct mem_cgroup and
> > doing the upward traversal through struct mem_cgroup may cause two cache
> > misses as compared to traversing through struct memcg_vmstats_percpu
> > pointer.
> > 
> > NOTE: explore if there is atomic_* ops alternative without lock prefix.
> 
> local_t might be what you want here
> https://docs.kernel.org/core-api/local_ops.html
> 
> Or maybe just add __percpu to parent like this?
> 
> struct memcg_vmstats_percpu {
> ...
>         struct memcg_vmstats_percpu __percpu *parent;
> ...
> }
> 
> Yes, it means on each cpu's struct memcg_vmstats_percpu instance there will
> be actually the same value stored (the percpu offset) instead of the
> cpu-specific parent pointer, which might seem wasteful. But AFAIK this_cpu_*
> is optimized enough thanks to the segment register usage, that it doesn't
> matter? It shouldn't cause any extra cache miss you worry about, IIUC?
> 
> With that I think you could refactor that code to use e.g.
> this_cpu_add_return() and this_cpu_xchg() on the stats_updates and obtain
> the parent "pointer" in a way that's also compatible with these operations.
> 

Thanks, I will try both of these and see which one looks better.

> That is unless we want also nmi safety, then we're back to the issue of the
> previous series...

Nah just irq safety for now and thanks a lot of quick feedback and
review.

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2025-05-13 18:09 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-13  3:13 [RFC PATCH 0/7] memcg: make memcg stats irq safe Shakeel Butt
2025-05-13  3:13 ` [RFC PATCH 1/7] memcg: memcg_rstat_updated re-entrant safe against irqs Shakeel Butt
2025-05-13 10:22   ` Vlastimil Babka
2025-05-13 18:09     ` Shakeel Butt
2025-05-13  3:13 ` [RFC PATCH 2/7] memcg: move preempt disable to callers of memcg_rstat_updated Shakeel Butt
2025-05-13 10:34   ` Vlastimil Babka
2025-05-13  3:13 ` [RFC PATCH 3/7] memcg: make mod_memcg_state re-entrant safe against irqs Shakeel Butt
2025-05-13 10:38   ` Vlastimil Babka
2025-05-13  3:13 ` [RFC PATCH 4/7] memcg: make count_memcg_events " Shakeel Butt
2025-05-13 10:39   ` Vlastimil Babka
2025-05-13  3:13 ` [RFC PATCH 5/7] memcg: make __mod_memcg_lruvec_state " Shakeel Butt
2025-05-13 10:40   ` Vlastimil Babka
2025-05-13  3:13 ` [RFC PATCH 6/7] memcg: objcg stock trylock without irq disabling Shakeel Butt
2025-05-13 13:05   ` Vlastimil Babka
2025-05-13  3:13 ` [RFC PATCH 7/7] memcg: no stock lock for cpu hot-unplug Shakeel Butt
2025-05-13 13:10   ` Vlastimil Babka

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).