linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/3] memcg: decouple memcg and objcg stocks
@ 2025-05-02  0:17 Shakeel Butt
  2025-05-02  0:17 ` [PATCH v2 1/3] memcg: separate local_trylock for memcg and obj Shakeel Butt
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Shakeel Butt @ 2025-05-02  0:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Alexei Starovoitov, linux-mm, cgroups, bpf, linux-kernel,
	Meta kernel team

The per-cpu memcg charge cache and objcg charge cache are coupled in a
single struct memcg_stock_pcp and a single local lock is used to protect
both of the caches. This makes memcg charging and objcg charging nmi
safe challenging. Decoupling memcg and objcg stocks would allow us to
make them nmi safe and even work without disabling irqs independently.
This series completely decouples memcg and objcg stocks.

Changes since v1:
- Drop first patch as requested by Alexei.
- Remove preempt_disable() usage as suggested by Vlastimil.

Shakeel Butt (3):
  memcg: separate local_trylock for memcg and obj
  memcg: completely decouple memcg and obj stocks
  memcg: no irq disable for memcg stock lock

 mm/memcontrol.c | 159 +++++++++++++++++++++++++++++-------------------
 1 file changed, 97 insertions(+), 62 deletions(-)

-- 
2.47.1



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v2 1/3] memcg: separate local_trylock for memcg and obj
  2025-05-02  0:17 [PATCH v2 0/3] memcg: decouple memcg and objcg stocks Shakeel Butt
@ 2025-05-02  0:17 ` Shakeel Butt
  2025-05-02  0:17 ` [PATCH v2 2/3] memcg: completely decouple memcg and obj stocks Shakeel Butt
  2025-05-02  0:17 ` [PATCH v2 3/3] memcg: no irq disable for memcg stock lock Shakeel Butt
  2 siblings, 0 replies; 12+ messages in thread
From: Shakeel Butt @ 2025-05-02  0:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Alexei Starovoitov, linux-mm, cgroups, bpf, linux-kernel,
	Meta kernel team

The per-cpu stock_lock protects cached memcg and cached objcg and their
respective fields. However there is no dependency between these fields
and it is better to have fine grained separate locks for cached memcg
and cached objcg. This decoupling of locks allows us to make the memcg
charge cache and objcg charge cache to be nmi safe independently.

At the moment, memcg charge cache is already nmi safe and this
decoupling will allow to make memcg charge cache work without disabling
irqs.

Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
---
Changes since v1:
- Drop usage of preempt_disable() as suggested by Vlastimil.

 mm/memcontrol.c | 51 ++++++++++++++++++++++++++-----------------------
 1 file changed, 27 insertions(+), 24 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 0d42699bb564..14714e1d36e9 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1806,13 +1806,14 @@ void mem_cgroup_print_oom_group(struct mem_cgroup *memcg)
  */
 #define NR_MEMCG_STOCK 7
 struct memcg_stock_pcp {
-	local_trylock_t stock_lock;
+	local_trylock_t memcg_lock;
 	uint8_t nr_pages[NR_MEMCG_STOCK];
 	struct mem_cgroup *cached[NR_MEMCG_STOCK];
 
+	local_trylock_t obj_lock;
+	unsigned int nr_bytes;
 	struct obj_cgroup *cached_objcg;
 	struct pglist_data *cached_pgdat;
-	unsigned int nr_bytes;
 	int nr_slab_reclaimable_b;
 	int nr_slab_unreclaimable_b;
 
@@ -1821,7 +1822,8 @@ struct memcg_stock_pcp {
 #define FLUSHING_CACHED_CHARGE	0
 };
 static DEFINE_PER_CPU_ALIGNED(struct memcg_stock_pcp, memcg_stock) = {
-	.stock_lock = INIT_LOCAL_TRYLOCK(stock_lock),
+	.memcg_lock = INIT_LOCAL_TRYLOCK(memcg_lock),
+	.obj_lock = INIT_LOCAL_TRYLOCK(obj_lock),
 };
 static DEFINE_MUTEX(percpu_charge_mutex);
 
@@ -1854,8 +1856,8 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages,
 		return ret;
 
 	if (gfpflags_allow_spinning(gfp_mask))
-		local_lock_irqsave(&memcg_stock.stock_lock, flags);
-	else if (!local_trylock_irqsave(&memcg_stock.stock_lock, flags))
+		local_lock_irqsave(&memcg_stock.memcg_lock, flags);
+	else if (!local_trylock_irqsave(&memcg_stock.memcg_lock, flags))
 		return ret;
 
 	stock = this_cpu_ptr(&memcg_stock);
@@ -1872,7 +1874,7 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages,
 		break;
 	}
 
-	local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
+	local_unlock_irqrestore(&memcg_stock.memcg_lock, flags);
 
 	return ret;
 }
@@ -1918,19 +1920,19 @@ static void drain_local_stock(struct work_struct *dummy)
 	struct memcg_stock_pcp *stock;
 	unsigned long flags;
 
-	/*
-	 * The only protection from cpu hotplug (memcg_hotplug_cpu_dead) vs.
-	 * drain_stock races is that we always operate on local CPU stock
-	 * here with IRQ disabled
-	 */
-	local_lock_irqsave(&memcg_stock.stock_lock, flags);
+	if (WARN_ONCE(!in_task(), "drain in non-task context"))
+		return;
 
+	local_lock_irqsave(&memcg_stock.obj_lock, flags);
 	stock = this_cpu_ptr(&memcg_stock);
 	drain_obj_stock(stock);
+	local_unlock_irqrestore(&memcg_stock.obj_lock, flags);
+
+	local_lock_irqsave(&memcg_stock.memcg_lock, flags);
+	stock = this_cpu_ptr(&memcg_stock);
 	drain_stock_fully(stock);
 	clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags);
-
-	local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
+	local_unlock_irqrestore(&memcg_stock.memcg_lock, flags);
 }
 
 static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
@@ -1953,10 +1955,10 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
 	VM_WARN_ON_ONCE(mem_cgroup_is_root(memcg));
 
 	if (nr_pages > MEMCG_CHARGE_BATCH ||
-	    !local_trylock_irqsave(&memcg_stock.stock_lock, flags)) {
+	    !local_trylock_irqsave(&memcg_stock.memcg_lock, flags)) {
 		/*
 		 * In case of larger than batch refill or unlikely failure to
-		 * lock the percpu stock_lock, uncharge memcg directly.
+		 * lock the percpu memcg_lock, uncharge memcg directly.
 		 */
 		memcg_uncharge(memcg, nr_pages);
 		return;
@@ -1988,7 +1990,7 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
 		WRITE_ONCE(stock->nr_pages[i], nr_pages);
 	}
 
-	local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
+	local_unlock_irqrestore(&memcg_stock.memcg_lock, flags);
 }
 
 static bool is_drain_needed(struct memcg_stock_pcp *stock,
@@ -2063,11 +2065,12 @@ static int memcg_hotplug_cpu_dead(unsigned int cpu)
 
 	stock = &per_cpu(memcg_stock, cpu);
 
-	/* drain_obj_stock requires stock_lock */
-	local_lock_irqsave(&memcg_stock.stock_lock, flags);
+	/* drain_obj_stock requires obj_lock */
+	local_lock_irqsave(&memcg_stock.obj_lock, flags);
 	drain_obj_stock(stock);
-	local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
+	local_unlock_irqrestore(&memcg_stock.obj_lock, flags);
 
+	/* no need for the local lock */
 	drain_stock_fully(stock);
 
 	return 0;
@@ -2920,7 +2923,7 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
 	unsigned long flags;
 	bool ret = false;
 
-	local_lock_irqsave(&memcg_stock.stock_lock, flags);
+	local_lock_irqsave(&memcg_stock.obj_lock, flags);
 
 	stock = this_cpu_ptr(&memcg_stock);
 	if (objcg == READ_ONCE(stock->cached_objcg) && stock->nr_bytes >= nr_bytes) {
@@ -2931,7 +2934,7 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
 			__account_obj_stock(objcg, stock, nr_bytes, pgdat, idx);
 	}
 
-	local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
+	local_unlock_irqrestore(&memcg_stock.obj_lock, flags);
 
 	return ret;
 }
@@ -3020,7 +3023,7 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
 	unsigned long flags;
 	unsigned int nr_pages = 0;
 
-	local_lock_irqsave(&memcg_stock.stock_lock, flags);
+	local_lock_irqsave(&memcg_stock.obj_lock, flags);
 
 	stock = this_cpu_ptr(&memcg_stock);
 	if (READ_ONCE(stock->cached_objcg) != objcg) { /* reset if necessary */
@@ -3042,7 +3045,7 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
 		stock->nr_bytes &= (PAGE_SIZE - 1);
 	}
 
-	local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
+	local_unlock_irqrestore(&memcg_stock.obj_lock, flags);
 
 	if (nr_pages)
 		obj_cgroup_uncharge_pages(objcg, nr_pages);
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 2/3] memcg: completely decouple memcg and obj stocks
  2025-05-02  0:17 [PATCH v2 0/3] memcg: decouple memcg and objcg stocks Shakeel Butt
  2025-05-02  0:17 ` [PATCH v2 1/3] memcg: separate local_trylock for memcg and obj Shakeel Butt
@ 2025-05-02  0:17 ` Shakeel Butt
  2025-05-02  0:17 ` [PATCH v2 3/3] memcg: no irq disable for memcg stock lock Shakeel Butt
  2 siblings, 0 replies; 12+ messages in thread
From: Shakeel Butt @ 2025-05-02  0:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Alexei Starovoitov, linux-mm, cgroups, bpf, linux-kernel,
	Meta kernel team, Vlastimil Babka

Let's completely decouple the memcg and obj per-cpu stocks. This will
enable us to make memcg per-cpu stocks to used without disabling irqs.
Also it will enable us to make obj stocks nmi safe independently which
is required to make kmalloc/slab safe for allocations from nmi context.

Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/memcontrol.c | 151 +++++++++++++++++++++++++++++-------------------
 1 file changed, 93 insertions(+), 58 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 14714e1d36e9..cd81c70d144b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1804,13 +1804,23 @@ void mem_cgroup_print_oom_group(struct mem_cgroup *memcg)
  * The value of NR_MEMCG_STOCK is selected to keep the cached memcgs and their
  * nr_pages in a single cacheline. This may change in future.
  */
+#define FLUSHING_CACHED_CHARGE	0
 #define NR_MEMCG_STOCK 7
 struct memcg_stock_pcp {
-	local_trylock_t memcg_lock;
+	local_trylock_t lock;
 	uint8_t nr_pages[NR_MEMCG_STOCK];
 	struct mem_cgroup *cached[NR_MEMCG_STOCK];
 
-	local_trylock_t obj_lock;
+	struct work_struct work;
+	unsigned long flags;
+};
+
+static DEFINE_PER_CPU_ALIGNED(struct memcg_stock_pcp, memcg_stock) = {
+	.lock = INIT_LOCAL_TRYLOCK(lock),
+};
+
+struct obj_stock_pcp {
+	local_trylock_t lock;
 	unsigned int nr_bytes;
 	struct obj_cgroup *cached_objcg;
 	struct pglist_data *cached_pgdat;
@@ -1819,16 +1829,16 @@ struct memcg_stock_pcp {
 
 	struct work_struct work;
 	unsigned long flags;
-#define FLUSHING_CACHED_CHARGE	0
 };
-static DEFINE_PER_CPU_ALIGNED(struct memcg_stock_pcp, memcg_stock) = {
-	.memcg_lock = INIT_LOCAL_TRYLOCK(memcg_lock),
-	.obj_lock = INIT_LOCAL_TRYLOCK(obj_lock),
+
+static DEFINE_PER_CPU_ALIGNED(struct obj_stock_pcp, obj_stock) = {
+	.lock = INIT_LOCAL_TRYLOCK(lock),
 };
+
 static DEFINE_MUTEX(percpu_charge_mutex);
 
-static void drain_obj_stock(struct memcg_stock_pcp *stock);
-static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
+static void drain_obj_stock(struct obj_stock_pcp *stock);
+static bool obj_stock_flush_required(struct obj_stock_pcp *stock,
 				     struct mem_cgroup *root_memcg);
 
 /**
@@ -1856,8 +1866,8 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages,
 		return ret;
 
 	if (gfpflags_allow_spinning(gfp_mask))
-		local_lock_irqsave(&memcg_stock.memcg_lock, flags);
-	else if (!local_trylock_irqsave(&memcg_stock.memcg_lock, flags))
+		local_lock_irqsave(&memcg_stock.lock, flags);
+	else if (!local_trylock_irqsave(&memcg_stock.lock, flags))
 		return ret;
 
 	stock = this_cpu_ptr(&memcg_stock);
@@ -1874,7 +1884,7 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages,
 		break;
 	}
 
-	local_unlock_irqrestore(&memcg_stock.memcg_lock, flags);
+	local_unlock_irqrestore(&memcg_stock.lock, flags);
 
 	return ret;
 }
@@ -1915,7 +1925,7 @@ static void drain_stock_fully(struct memcg_stock_pcp *stock)
 		drain_stock(stock, i);
 }
 
-static void drain_local_stock(struct work_struct *dummy)
+static void drain_local_memcg_stock(struct work_struct *dummy)
 {
 	struct memcg_stock_pcp *stock;
 	unsigned long flags;
@@ -1923,16 +1933,30 @@ static void drain_local_stock(struct work_struct *dummy)
 	if (WARN_ONCE(!in_task(), "drain in non-task context"))
 		return;
 
-	local_lock_irqsave(&memcg_stock.obj_lock, flags);
-	stock = this_cpu_ptr(&memcg_stock);
-	drain_obj_stock(stock);
-	local_unlock_irqrestore(&memcg_stock.obj_lock, flags);
+	local_lock_irqsave(&memcg_stock.lock, flags);
 
-	local_lock_irqsave(&memcg_stock.memcg_lock, flags);
 	stock = this_cpu_ptr(&memcg_stock);
 	drain_stock_fully(stock);
 	clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags);
-	local_unlock_irqrestore(&memcg_stock.memcg_lock, flags);
+
+	local_unlock_irqrestore(&memcg_stock.lock, flags);
+}
+
+static void drain_local_obj_stock(struct work_struct *dummy)
+{
+	struct obj_stock_pcp *stock;
+	unsigned long flags;
+
+	if (WARN_ONCE(!in_task(), "drain in non-task context"))
+		return;
+
+	local_lock_irqsave(&obj_stock.lock, flags);
+
+	stock = this_cpu_ptr(&obj_stock);
+	drain_obj_stock(stock);
+	clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags);
+
+	local_unlock_irqrestore(&obj_stock.lock, flags);
 }
 
 static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
@@ -1955,10 +1979,10 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
 	VM_WARN_ON_ONCE(mem_cgroup_is_root(memcg));
 
 	if (nr_pages > MEMCG_CHARGE_BATCH ||
-	    !local_trylock_irqsave(&memcg_stock.memcg_lock, flags)) {
+	    !local_trylock_irqsave(&memcg_stock.lock, flags)) {
 		/*
 		 * In case of larger than batch refill or unlikely failure to
-		 * lock the percpu memcg_lock, uncharge memcg directly.
+		 * lock the percpu memcg_stock.lock, uncharge memcg directly.
 		 */
 		memcg_uncharge(memcg, nr_pages);
 		return;
@@ -1990,23 +2014,17 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
 		WRITE_ONCE(stock->nr_pages[i], nr_pages);
 	}
 
-	local_unlock_irqrestore(&memcg_stock.memcg_lock, flags);
+	local_unlock_irqrestore(&memcg_stock.lock, flags);
 }
 
-static bool is_drain_needed(struct memcg_stock_pcp *stock,
-			    struct mem_cgroup *root_memcg)
+static bool is_memcg_drain_needed(struct memcg_stock_pcp *stock,
+				  struct mem_cgroup *root_memcg)
 {
 	struct mem_cgroup *memcg;
 	bool flush = false;
 	int i;
 
 	rcu_read_lock();
-
-	if (obj_stock_flush_required(stock, root_memcg)) {
-		flush = true;
-		goto out;
-	}
-
 	for (i = 0; i < NR_MEMCG_STOCK; ++i) {
 		memcg = READ_ONCE(stock->cached[i]);
 		if (!memcg)
@@ -2018,7 +2036,6 @@ static bool is_drain_needed(struct memcg_stock_pcp *stock,
 			break;
 		}
 	}
-out:
 	rcu_read_unlock();
 	return flush;
 }
@@ -2043,15 +2060,27 @@ void drain_all_stock(struct mem_cgroup *root_memcg)
 	migrate_disable();
 	curcpu = smp_processor_id();
 	for_each_online_cpu(cpu) {
-		struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
-		bool flush = is_drain_needed(stock, root_memcg);
+		struct memcg_stock_pcp *memcg_st = &per_cpu(memcg_stock, cpu);
+		struct obj_stock_pcp *obj_st = &per_cpu(obj_stock, cpu);
 
-		if (flush &&
-		    !test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags)) {
+		if (!test_bit(FLUSHING_CACHED_CHARGE, &memcg_st->flags) &&
+		    is_memcg_drain_needed(memcg_st, root_memcg) &&
+		    !test_and_set_bit(FLUSHING_CACHED_CHARGE,
+				      &memcg_st->flags)) {
 			if (cpu == curcpu)
-				drain_local_stock(&stock->work);
+				drain_local_memcg_stock(&memcg_st->work);
 			else if (!cpu_is_isolated(cpu))
-				schedule_work_on(cpu, &stock->work);
+				schedule_work_on(cpu, &memcg_st->work);
+		}
+
+		if (!test_bit(FLUSHING_CACHED_CHARGE, &obj_st->flags) &&
+		    obj_stock_flush_required(obj_st, root_memcg) &&
+		    !test_and_set_bit(FLUSHING_CACHED_CHARGE,
+				      &obj_st->flags)) {
+			if (cpu == curcpu)
+				drain_local_obj_stock(&obj_st->work);
+			else if (!cpu_is_isolated(cpu))
+				schedule_work_on(cpu, &obj_st->work);
 		}
 	}
 	migrate_enable();
@@ -2060,18 +2089,18 @@ void drain_all_stock(struct mem_cgroup *root_memcg)
 
 static int memcg_hotplug_cpu_dead(unsigned int cpu)
 {
-	struct memcg_stock_pcp *stock;
+	struct obj_stock_pcp *obj_st;
 	unsigned long flags;
 
-	stock = &per_cpu(memcg_stock, cpu);
+	obj_st = &per_cpu(obj_stock, cpu);
 
-	/* drain_obj_stock requires obj_lock */
-	local_lock_irqsave(&memcg_stock.obj_lock, flags);
-	drain_obj_stock(stock);
-	local_unlock_irqrestore(&memcg_stock.obj_lock, flags);
+	/* drain_obj_stock requires objstock.lock */
+	local_lock_irqsave(&obj_stock.lock, flags);
+	drain_obj_stock(obj_st);
+	local_unlock_irqrestore(&obj_stock.lock, flags);
 
 	/* no need for the local lock */
-	drain_stock_fully(stock);
+	drain_stock_fully(&per_cpu(memcg_stock, cpu));
 
 	return 0;
 }
@@ -2868,7 +2897,7 @@ void __memcg_kmem_uncharge_page(struct page *page, int order)
 }
 
 static void __account_obj_stock(struct obj_cgroup *objcg,
-				struct memcg_stock_pcp *stock, int nr,
+				struct obj_stock_pcp *stock, int nr,
 				struct pglist_data *pgdat, enum node_stat_item idx)
 {
 	int *bytes;
@@ -2919,13 +2948,13 @@ static void __account_obj_stock(struct obj_cgroup *objcg,
 static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
 			      struct pglist_data *pgdat, enum node_stat_item idx)
 {
-	struct memcg_stock_pcp *stock;
+	struct obj_stock_pcp *stock;
 	unsigned long flags;
 	bool ret = false;
 
-	local_lock_irqsave(&memcg_stock.obj_lock, flags);
+	local_lock_irqsave(&obj_stock.lock, flags);
 
-	stock = this_cpu_ptr(&memcg_stock);
+	stock = this_cpu_ptr(&obj_stock);
 	if (objcg == READ_ONCE(stock->cached_objcg) && stock->nr_bytes >= nr_bytes) {
 		stock->nr_bytes -= nr_bytes;
 		ret = true;
@@ -2934,12 +2963,12 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
 			__account_obj_stock(objcg, stock, nr_bytes, pgdat, idx);
 	}
 
-	local_unlock_irqrestore(&memcg_stock.obj_lock, flags);
+	local_unlock_irqrestore(&obj_stock.lock, flags);
 
 	return ret;
 }
 
-static void drain_obj_stock(struct memcg_stock_pcp *stock)
+static void drain_obj_stock(struct obj_stock_pcp *stock)
 {
 	struct obj_cgroup *old = READ_ONCE(stock->cached_objcg);
 
@@ -3000,32 +3029,35 @@ static void drain_obj_stock(struct memcg_stock_pcp *stock)
 	obj_cgroup_put(old);
 }
 
-static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
+static bool obj_stock_flush_required(struct obj_stock_pcp *stock,
 				     struct mem_cgroup *root_memcg)
 {
 	struct obj_cgroup *objcg = READ_ONCE(stock->cached_objcg);
 	struct mem_cgroup *memcg;
+	bool flush = false;
 
+	rcu_read_lock();
 	if (objcg) {
 		memcg = obj_cgroup_memcg(objcg);
 		if (memcg && mem_cgroup_is_descendant(memcg, root_memcg))
-			return true;
+			flush = true;
 	}
+	rcu_read_unlock();
 
-	return false;
+	return flush;
 }
 
 static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
 		bool allow_uncharge, int nr_acct, struct pglist_data *pgdat,
 		enum node_stat_item idx)
 {
-	struct memcg_stock_pcp *stock;
+	struct obj_stock_pcp *stock;
 	unsigned long flags;
 	unsigned int nr_pages = 0;
 
-	local_lock_irqsave(&memcg_stock.obj_lock, flags);
+	local_lock_irqsave(&obj_stock.lock, flags);
 
-	stock = this_cpu_ptr(&memcg_stock);
+	stock = this_cpu_ptr(&obj_stock);
 	if (READ_ONCE(stock->cached_objcg) != objcg) { /* reset if necessary */
 		drain_obj_stock(stock);
 		obj_cgroup_get(objcg);
@@ -3045,7 +3077,7 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
 		stock->nr_bytes &= (PAGE_SIZE - 1);
 	}
 
-	local_unlock_irqrestore(&memcg_stock.obj_lock, flags);
+	local_unlock_irqrestore(&obj_stock.lock, flags);
 
 	if (nr_pages)
 		obj_cgroup_uncharge_pages(objcg, nr_pages);
@@ -5164,9 +5196,12 @@ int __init mem_cgroup_init(void)
 	cpuhp_setup_state_nocalls(CPUHP_MM_MEMCQ_DEAD, "mm/memctrl:dead", NULL,
 				  memcg_hotplug_cpu_dead);
 
-	for_each_possible_cpu(cpu)
+	for_each_possible_cpu(cpu) {
 		INIT_WORK(&per_cpu_ptr(&memcg_stock, cpu)->work,
-			  drain_local_stock);
+			  drain_local_memcg_stock);
+		INIT_WORK(&per_cpu_ptr(&obj_stock, cpu)->work,
+			  drain_local_obj_stock);
+	}
 
 	memcg_size = struct_size_t(struct mem_cgroup, nodeinfo, nr_node_ids);
 	memcg_cachep = kmem_cache_create("mem_cgroup", memcg_size, 0,
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 3/3] memcg: no irq disable for memcg stock lock
  2025-05-02  0:17 [PATCH v2 0/3] memcg: decouple memcg and objcg stocks Shakeel Butt
  2025-05-02  0:17 ` [PATCH v2 1/3] memcg: separate local_trylock for memcg and obj Shakeel Butt
  2025-05-02  0:17 ` [PATCH v2 2/3] memcg: completely decouple memcg and obj stocks Shakeel Butt
@ 2025-05-02  0:17 ` Shakeel Butt
  2025-05-02 18:29   ` Alexei Starovoitov
  2 siblings, 1 reply; 12+ messages in thread
From: Shakeel Butt @ 2025-05-02  0:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Alexei Starovoitov, linux-mm, cgroups, bpf, linux-kernel,
	Meta kernel team, Vlastimil Babka

There is no need to disable irqs to use memcg per-cpu stock, so let's
just not do that. One consequence of this change is if the kernel while
in task context has the memcg stock lock and that cpu got interrupted.
The memcg charges on that cpu in the irq context will take the slow path
of memcg charging. However that should be super rare and should be fine
in general.

Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/memcontrol.c | 17 +++++++----------
 1 file changed, 7 insertions(+), 10 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index cd81c70d144b..f8b9c7aa6771 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1858,7 +1858,6 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages,
 {
 	struct memcg_stock_pcp *stock;
 	uint8_t stock_pages;
-	unsigned long flags;
 	bool ret = false;
 	int i;
 
@@ -1866,8 +1865,8 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages,
 		return ret;
 
 	if (gfpflags_allow_spinning(gfp_mask))
-		local_lock_irqsave(&memcg_stock.lock, flags);
-	else if (!local_trylock_irqsave(&memcg_stock.lock, flags))
+		local_lock(&memcg_stock.lock);
+	else if (!local_trylock(&memcg_stock.lock))
 		return ret;
 
 	stock = this_cpu_ptr(&memcg_stock);
@@ -1884,7 +1883,7 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages,
 		break;
 	}
 
-	local_unlock_irqrestore(&memcg_stock.lock, flags);
+	local_unlock(&memcg_stock.lock);
 
 	return ret;
 }
@@ -1928,18 +1927,17 @@ static void drain_stock_fully(struct memcg_stock_pcp *stock)
 static void drain_local_memcg_stock(struct work_struct *dummy)
 {
 	struct memcg_stock_pcp *stock;
-	unsigned long flags;
 
 	if (WARN_ONCE(!in_task(), "drain in non-task context"))
 		return;
 
-	local_lock_irqsave(&memcg_stock.lock, flags);
+	local_lock(&memcg_stock.lock);
 
 	stock = this_cpu_ptr(&memcg_stock);
 	drain_stock_fully(stock);
 	clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags);
 
-	local_unlock_irqrestore(&memcg_stock.lock, flags);
+	local_unlock(&memcg_stock.lock);
 }
 
 static void drain_local_obj_stock(struct work_struct *dummy)
@@ -1964,7 +1962,6 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
 	struct memcg_stock_pcp *stock;
 	struct mem_cgroup *cached;
 	uint8_t stock_pages;
-	unsigned long flags;
 	bool success = false;
 	int empty_slot = -1;
 	int i;
@@ -1979,7 +1976,7 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
 	VM_WARN_ON_ONCE(mem_cgroup_is_root(memcg));
 
 	if (nr_pages > MEMCG_CHARGE_BATCH ||
-	    !local_trylock_irqsave(&memcg_stock.lock, flags)) {
+	    !local_trylock(&memcg_stock.lock)) {
 		/*
 		 * In case of larger than batch refill or unlikely failure to
 		 * lock the percpu memcg_stock.lock, uncharge memcg directly.
@@ -2014,7 +2011,7 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
 		WRITE_ONCE(stock->nr_pages[i], nr_pages);
 	}
 
-	local_unlock_irqrestore(&memcg_stock.lock, flags);
+	local_unlock(&memcg_stock.lock);
 }
 
 static bool is_memcg_drain_needed(struct memcg_stock_pcp *stock,
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 3/3] memcg: no irq disable for memcg stock lock
  2025-05-02  0:17 ` [PATCH v2 3/3] memcg: no irq disable for memcg stock lock Shakeel Butt
@ 2025-05-02 18:29   ` Alexei Starovoitov
  2025-05-02 23:03     ` Shakeel Butt
  0 siblings, 1 reply; 12+ messages in thread
From: Alexei Starovoitov @ 2025-05-02 18:29 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Muchun Song, Alexei Starovoitov, linux-mm,
	open list:CONTROL GROUP (CGROUP), bpf, LKML, Meta kernel team,
	Vlastimil Babka

On Thu, May 1, 2025 at 5:18 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> There is no need to disable irqs to use memcg per-cpu stock, so let's
> just not do that. One consequence of this change is if the kernel while
> in task context has the memcg stock lock and that cpu got interrupted.
> The memcg charges on that cpu in the irq context will take the slow path
> of memcg charging. However that should be super rare and should be fine
> in general.
>
> Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  mm/memcontrol.c | 17 +++++++----------
>  1 file changed, 7 insertions(+), 10 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index cd81c70d144b..f8b9c7aa6771 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1858,7 +1858,6 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages,
>  {
>         struct memcg_stock_pcp *stock;
>         uint8_t stock_pages;
> -       unsigned long flags;
>         bool ret = false;
>         int i;
>
> @@ -1866,8 +1865,8 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages,
>                 return ret;
>
>         if (gfpflags_allow_spinning(gfp_mask))
> -               local_lock_irqsave(&memcg_stock.lock, flags);
> -       else if (!local_trylock_irqsave(&memcg_stock.lock, flags))
> +               local_lock(&memcg_stock.lock);
> +       else if (!local_trylock(&memcg_stock.lock))
>                 return ret;

I don't think it works.
When there is a normal irq and something doing regular GFP_NOWAIT
allocation gfpflags_allow_spinning() will be true and
local_lock() will reenter and complain that lock->acquired is
already set... but only with lockdep on.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 3/3] memcg: no irq disable for memcg stock lock
  2025-05-02 18:29   ` Alexei Starovoitov
@ 2025-05-02 23:03     ` Shakeel Butt
  2025-05-02 23:28       ` Alexei Starovoitov
  2025-05-05 10:28       ` Vlastimil Babka
  0 siblings, 2 replies; 12+ messages in thread
From: Shakeel Butt @ 2025-05-02 23:03 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Muchun Song, Alexei Starovoitov, linux-mm,
	open list:CONTROL GROUP (CGROUP), bpf, LKML, Meta kernel team,
	Vlastimil Babka

On Fri, May 2, 2025 at 11:29 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Thu, May 1, 2025 at 5:18 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >
> > There is no need to disable irqs to use memcg per-cpu stock, so let's
> > just not do that. One consequence of this change is if the kernel while
> > in task context has the memcg stock lock and that cpu got interrupted.
> > The memcg charges on that cpu in the irq context will take the slow path
> > of memcg charging. However that should be super rare and should be fine
> > in general.
> >
> > Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
> > Acked-by: Vlastimil Babka <vbabka@suse.cz>
> > ---
> >  mm/memcontrol.c | 17 +++++++----------
> >  1 file changed, 7 insertions(+), 10 deletions(-)
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index cd81c70d144b..f8b9c7aa6771 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -1858,7 +1858,6 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages,
> >  {
> >         struct memcg_stock_pcp *stock;
> >         uint8_t stock_pages;
> > -       unsigned long flags;
> >         bool ret = false;
> >         int i;
> >
> > @@ -1866,8 +1865,8 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages,
> >                 return ret;
> >
> >         if (gfpflags_allow_spinning(gfp_mask))
> > -               local_lock_irqsave(&memcg_stock.lock, flags);
> > -       else if (!local_trylock_irqsave(&memcg_stock.lock, flags))
> > +               local_lock(&memcg_stock.lock);
> > +       else if (!local_trylock(&memcg_stock.lock))
> >                 return ret;
>
> I don't think it works.
> When there is a normal irq and something doing regular GFP_NOWAIT
> allocation gfpflags_allow_spinning() will be true and
> local_lock() will reenter and complain that lock->acquired is
> already set... but only with lockdep on.

Yes indeed. I dropped the first patch and didn't fix this one
accordingly. I think the fix can be as simple as checking for
in_task() here instead of gfp_mask. That should work for both RT and
non-RT kernels.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 3/3] memcg: no irq disable for memcg stock lock
  2025-05-02 23:03     ` Shakeel Butt
@ 2025-05-02 23:28       ` Alexei Starovoitov
  2025-05-02 23:40         ` Shakeel Butt
  2025-05-05 10:28       ` Vlastimil Babka
  1 sibling, 1 reply; 12+ messages in thread
From: Alexei Starovoitov @ 2025-05-02 23:28 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Muchun Song, Alexei Starovoitov, linux-mm,
	open list:CONTROL GROUP (CGROUP), bpf, LKML, Meta kernel team,
	Vlastimil Babka

On Fri, May 2, 2025 at 4:03 PM Shakeel Butt <shakeel.butt@gmail.com> wrote:
>
> On Fri, May 2, 2025 at 11:29 AM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Thu, May 1, 2025 at 5:18 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > >
> > > There is no need to disable irqs to use memcg per-cpu stock, so let's
> > > just not do that. One consequence of this change is if the kernel while
> > > in task context has the memcg stock lock and that cpu got interrupted.
> > > The memcg charges on that cpu in the irq context will take the slow path
> > > of memcg charging. However that should be super rare and should be fine
> > > in general.
> > >
> > > Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
> > > Acked-by: Vlastimil Babka <vbabka@suse.cz>
> > > ---
> > >  mm/memcontrol.c | 17 +++++++----------
> > >  1 file changed, 7 insertions(+), 10 deletions(-)
> > >
> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > index cd81c70d144b..f8b9c7aa6771 100644
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -1858,7 +1858,6 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages,
> > >  {
> > >         struct memcg_stock_pcp *stock;
> > >         uint8_t stock_pages;
> > > -       unsigned long flags;
> > >         bool ret = false;
> > >         int i;
> > >
> > > @@ -1866,8 +1865,8 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages,
> > >                 return ret;
> > >
> > >         if (gfpflags_allow_spinning(gfp_mask))
> > > -               local_lock_irqsave(&memcg_stock.lock, flags);
> > > -       else if (!local_trylock_irqsave(&memcg_stock.lock, flags))
> > > +               local_lock(&memcg_stock.lock);
> > > +       else if (!local_trylock(&memcg_stock.lock))
> > >                 return ret;
> >
> > I don't think it works.
> > When there is a normal irq and something doing regular GFP_NOWAIT
> > allocation gfpflags_allow_spinning() will be true and
> > local_lock() will reenter and complain that lock->acquired is
> > already set... but only with lockdep on.
>
> Yes indeed. I dropped the first patch and didn't fix this one
> accordingly. I think the fix can be as simple as checking for
> in_task() here instead of gfp_mask. That should work for both RT and
> non-RT kernels.

Like:
if (in_task())
  local_lock(...);
else if (!local_trylock(...))

Most of the networking runs in bh, so it will be using
local_trylock() path which is probably ok in !PREEMPT_RT,
but will cause random performance issues in PREEMP_RT,
since rt_spin_trylock() will be randomly failing and taking
slow path of charging. It's not going to cause permanent
nginx 3x regression :), but unlucky slowdowns will be seen.
A task can grab that per-cpu rt_spin_lock and preempted
by network processing.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 3/3] memcg: no irq disable for memcg stock lock
  2025-05-02 23:28       ` Alexei Starovoitov
@ 2025-05-02 23:40         ` Shakeel Butt
  2025-05-05  9:06           ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 12+ messages in thread
From: Shakeel Butt @ 2025-05-02 23:40 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Muchun Song, Alexei Starovoitov, linux-mm,
	open list:CONTROL GROUP (CGROUP), bpf, LKML, Meta kernel team,
	Vlastimil Babka, Sebastian Andrzej Siewior

On Fri, May 2, 2025 at 4:28 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
[...]
> > >
> > > I don't think it works.
> > > When there is a normal irq and something doing regular GFP_NOWAIT
> > > allocation gfpflags_allow_spinning() will be true and
> > > local_lock() will reenter and complain that lock->acquired is
> > > already set... but only with lockdep on.
> >
> > Yes indeed. I dropped the first patch and didn't fix this one
> > accordingly. I think the fix can be as simple as checking for
> > in_task() here instead of gfp_mask. That should work for both RT and
> > non-RT kernels.
>
> Like:
> if (in_task())
>   local_lock(...);
> else if (!local_trylock(...))
>
> Most of the networking runs in bh, so it will be using
> local_trylock() path which is probably ok in !PREEMPT_RT,
> but will cause random performance issues in PREEMP_RT,
> since rt_spin_trylock() will be randomly failing and taking
> slow path of charging. It's not going to cause permanent
> nginx 3x regression :), but unlucky slowdowns will be seen.
> A task can grab that per-cpu rt_spin_lock and preempted
> by network processing.

Does networking run in bh for PREEMPT_RT as well?

I think I should get networking & RT folks opinion on this one. I will
decouple this irq patch from the decoupling lock patches and start a
separate discussion thread.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 3/3] memcg: no irq disable for memcg stock lock
  2025-05-02 23:40         ` Shakeel Butt
@ 2025-05-05  9:06           ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 12+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-05-05  9:06 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Alexei Starovoitov, Andrew Morton, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Muchun Song, Alexei Starovoitov, linux-mm,
	open list:CONTROL GROUP (CGROUP), bpf, LKML, Meta kernel team,
	Vlastimil Babka

On 2025-05-02 16:40:53 [-0700], Shakeel Butt wrote:
> On Fri, May 2, 2025 at 4:28 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> [...]
> > > >
> > > > I don't think it works.
> > > > When there is a normal irq and something doing regular GFP_NOWAIT
> > > > allocation gfpflags_allow_spinning() will be true and
> > > > local_lock() will reenter and complain that lock->acquired is
> > > > already set... but only with lockdep on.
> > >
> > > Yes indeed. I dropped the first patch and didn't fix this one
> > > accordingly. I think the fix can be as simple as checking for
> > > in_task() here instead of gfp_mask. That should work for both RT and
> > > non-RT kernels.
> >
> > Like:
> > if (in_task())
> >   local_lock(...);
> > else if (!local_trylock(...))
> >
> > Most of the networking runs in bh, so it will be using
> > local_trylock() path which is probably ok in !PREEMPT_RT,
> > but will cause random performance issues in PREEMP_RT,
> > since rt_spin_trylock() will be randomly failing and taking
> > slow path of charging. It's not going to cause permanent
> > nginx 3x regression :), but unlucky slowdowns will be seen.
> > A task can grab that per-cpu rt_spin_lock and preempted
> > by network processing.
> 
> Does networking run in bh for PREEMPT_RT as well?

It does but BH is preemptible.

> I think I should get networking & RT folks opinion on this one. I will
> decouple this irq patch from the decoupling lock patches and start a
> separate discussion thread.

Sebastian


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 3/3] memcg: no irq disable for memcg stock lock
  2025-05-02 23:03     ` Shakeel Butt
  2025-05-02 23:28       ` Alexei Starovoitov
@ 2025-05-05 10:28       ` Vlastimil Babka
  2025-05-05 17:13         ` Shakeel Butt
  1 sibling, 1 reply; 12+ messages in thread
From: Vlastimil Babka @ 2025-05-05 10:28 UTC (permalink / raw)
  To: Shakeel Butt, Alexei Starovoitov
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Muchun Song, Alexei Starovoitov, linux-mm,
	open list:CONTROL GROUP (CGROUP), bpf, LKML, Meta kernel team,
	Sebastian Andrzej Siewior

On 5/3/25 01:03, Shakeel Butt wrote:
>> > index cd81c70d144b..f8b9c7aa6771 100644
>> > --- a/mm/memcontrol.c
>> > +++ b/mm/memcontrol.c
>> > @@ -1858,7 +1858,6 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages,
>> >  {
>> >         struct memcg_stock_pcp *stock;
>> >         uint8_t stock_pages;
>> > -       unsigned long flags;
>> >         bool ret = false;
>> >         int i;
>> >
>> > @@ -1866,8 +1865,8 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages,
>> >                 return ret;
>> >
>> >         if (gfpflags_allow_spinning(gfp_mask))
>> > -               local_lock_irqsave(&memcg_stock.lock, flags);
>> > -       else if (!local_trylock_irqsave(&memcg_stock.lock, flags))
>> > +               local_lock(&memcg_stock.lock);
>> > +       else if (!local_trylock(&memcg_stock.lock))
>> >                 return ret;
>>
>> I don't think it works.
>> When there is a normal irq and something doing regular GFP_NOWAIT
>> allocation gfpflags_allow_spinning() will be true and
>> local_lock() will reenter and complain that lock->acquired is
>> already set... but only with lockdep on.
> 
> Yes indeed. I dropped the first patch and didn't fix this one
> accordingly. I think the fix can be as simple as checking for
> in_task() here instead of gfp_mask. That should work for both RT and
> non-RT kernels.

These in_task() checks seem hacky to me. I think the patch 1 in v1 was the
correct way how to use the local_trylock() to avoid these.

As for the RT concerns, AFAIK RT isn't about being fast, but about being
preemptible, and the v1 approach didn't violate that - taking the slowpaths
more often shouldn't be an issue.

Let me quote Shakeel's scenario from the v1 thread:

> I didn't really think too much about PREEMPT_RT kernels as I assume
> performance is not top priority but I think I get your point. Let me

Agreed.

> explain and correct me if I am wrong. On PREEMPT_RT kernel, the local
> lock is a spin lock which is actually a mutex but with priority
> inheritance. A task having the local lock can still get context switched

Let's say (seems implied already) this is a low prio task.

> (but will remain on same CPU run queue) and the newer task can try to

And this is a high prio task.

> acquire the memcg stock local lock. If we just do trylock, it will
> always go to the slow path but if we do local_lock() then it will sleeps
> and possibly gives its priority to the task owning the lock and possibly
> make that task to get the CPU. Later the task slept on memcg stock lock
> will wake up and go through fast path.

I think from RT latency perspective it could very much be better for the
high prio task just skip the fast path and go for the slowpath, instead of
going to sleep while boosting the low prio task to let the high prio task
use the fast path later. It's not really a fast path anymore I'd say.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 3/3] memcg: no irq disable for memcg stock lock
  2025-05-05 10:28       ` Vlastimil Babka
@ 2025-05-05 17:13         ` Shakeel Butt
  2025-05-05 20:49           ` Shakeel Butt
  0 siblings, 1 reply; 12+ messages in thread
From: Shakeel Butt @ 2025-05-05 17:13 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Shakeel Butt, Alexei Starovoitov, Andrew Morton, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Muchun Song, Alexei Starovoitov,
	linux-mm, open list:CONTROL GROUP (CGROUP), bpf, LKML,
	Meta kernel team, Sebastian Andrzej Siewior, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, David S. Miller, netdev

Ccing networking folks.

Background: https://lore.kernel.org/dvyyqubghf67b3qsuoreegqk4qnuuqfkk7plpfhhrck5yeeuic@xbn4c6c7yc42/

On Mon, May 05, 2025 at 12:28:43PM +0200, Vlastimil Babka wrote:
> On 5/3/25 01:03, Shakeel Butt wrote:
> >> > index cd81c70d144b..f8b9c7aa6771 100644
> >> > --- a/mm/memcontrol.c
> >> > +++ b/mm/memcontrol.c
> >> > @@ -1858,7 +1858,6 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages,
> >> >  {
> >> >         struct memcg_stock_pcp *stock;
> >> >         uint8_t stock_pages;
> >> > -       unsigned long flags;
> >> >         bool ret = false;
> >> >         int i;
> >> >
> >> > @@ -1866,8 +1865,8 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages,
> >> >                 return ret;
> >> >
> >> >         if (gfpflags_allow_spinning(gfp_mask))
> >> > -               local_lock_irqsave(&memcg_stock.lock, flags);
> >> > -       else if (!local_trylock_irqsave(&memcg_stock.lock, flags))
> >> > +               local_lock(&memcg_stock.lock);
> >> > +       else if (!local_trylock(&memcg_stock.lock))
> >> >                 return ret;
> >>
> >> I don't think it works.
> >> When there is a normal irq and something doing regular GFP_NOWAIT
> >> allocation gfpflags_allow_spinning() will be true and
> >> local_lock() will reenter and complain that lock->acquired is
> >> already set... but only with lockdep on.
> > 
> > Yes indeed. I dropped the first patch and didn't fix this one
> > accordingly. I think the fix can be as simple as checking for
> > in_task() here instead of gfp_mask. That should work for both RT and
> > non-RT kernels.
> 
> These in_task() checks seem hacky to me. I think the patch 1 in v1 was the
> correct way how to use the local_trylock() to avoid these.
> 
> As for the RT concerns, AFAIK RT isn't about being fast, but about being
> preemptible, and the v1 approach didn't violate that - taking the slowpaths
> more often shouldn't be an issue.
> 
> Let me quote Shakeel's scenario from the v1 thread:
> 
> > I didn't really think too much about PREEMPT_RT kernels as I assume
> > performance is not top priority but I think I get your point. Let me
> 
> Agreed.
> 
> > explain and correct me if I am wrong. On PREEMPT_RT kernel, the local
> > lock is a spin lock which is actually a mutex but with priority
> > inheritance. A task having the local lock can still get context switched
> 
> Let's say (seems implied already) this is a low prio task.
> 
> > (but will remain on same CPU run queue) and the newer task can try to
> 
> And this is a high prio task.
> 
> > acquire the memcg stock local lock. If we just do trylock, it will
> > always go to the slow path but if we do local_lock() then it will sleeps
> > and possibly gives its priority to the task owning the lock and possibly
> > make that task to get the CPU. Later the task slept on memcg stock lock
> > will wake up and go through fast path.
> 
> I think from RT latency perspective it could very much be better for the
> high prio task just skip the fast path and go for the slowpath, instead of
> going to sleep while boosting the low prio task to let the high prio task
> use the fast path later. It's not really a fast path anymore I'd say.

Thanks Vlastimil, this is actually a very good point. Slow path of memcg
charging is couple of atomic operations while the alternative here is at
least two context switches (and possibly scheduler delay). So, it does
not seem like a fast path anymore.

I have cc'ed networking folks to get their take as well. Orthogonally I
will do some netperf benchmarking on v1 with RT kernel.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 3/3] memcg: no irq disable for memcg stock lock
  2025-05-05 17:13         ` Shakeel Butt
@ 2025-05-05 20:49           ` Shakeel Butt
  0 siblings, 0 replies; 12+ messages in thread
From: Shakeel Butt @ 2025-05-05 20:49 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Shakeel Butt, Alexei Starovoitov, Andrew Morton, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Muchun Song, Alexei Starovoitov,
	linux-mm, open list:CONTROL GROUP (CGROUP), bpf, LKML,
	Meta kernel team, Sebastian Andrzej Siewior, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, David S. Miller, netdev

On Mon, May 05, 2025 at 10:13:37AM -0700, Shakeel Butt wrote:
> Ccing networking folks.
> 
> Background: https://lore.kernel.org/dvyyqubghf67b3qsuoreegqk4qnuuqfkk7plpfhhrck5yeeuic@xbn4c6c7yc42/
> 
> On Mon, May 05, 2025 at 12:28:43PM +0200, Vlastimil Babka wrote:
> > On 5/3/25 01:03, Shakeel Butt wrote:
> > >> > index cd81c70d144b..f8b9c7aa6771 100644
> > >> > --- a/mm/memcontrol.c
> > >> > +++ b/mm/memcontrol.c
> > >> > @@ -1858,7 +1858,6 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages,
> > >> >  {
> > >> >         struct memcg_stock_pcp *stock;
> > >> >         uint8_t stock_pages;
> > >> > -       unsigned long flags;
> > >> >         bool ret = false;
> > >> >         int i;
> > >> >
> > >> > @@ -1866,8 +1865,8 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages,
> > >> >                 return ret;
> > >> >
> > >> >         if (gfpflags_allow_spinning(gfp_mask))
> > >> > -               local_lock_irqsave(&memcg_stock.lock, flags);
> > >> > -       else if (!local_trylock_irqsave(&memcg_stock.lock, flags))
> > >> > +               local_lock(&memcg_stock.lock);
> > >> > +       else if (!local_trylock(&memcg_stock.lock))
> > >> >                 return ret;
> > >>
> > >> I don't think it works.
> > >> When there is a normal irq and something doing regular GFP_NOWAIT
> > >> allocation gfpflags_allow_spinning() will be true and
> > >> local_lock() will reenter and complain that lock->acquired is
> > >> already set... but only with lockdep on.
> > > 
> > > Yes indeed. I dropped the first patch and didn't fix this one
> > > accordingly. I think the fix can be as simple as checking for
> > > in_task() here instead of gfp_mask. That should work for both RT and
> > > non-RT kernels.
> > 
> > These in_task() checks seem hacky to me. I think the patch 1 in v1 was the
> > correct way how to use the local_trylock() to avoid these.
> > 
> > As for the RT concerns, AFAIK RT isn't about being fast, but about being
> > preemptible, and the v1 approach didn't violate that - taking the slowpaths
> > more often shouldn't be an issue.
> > 
> > Let me quote Shakeel's scenario from the v1 thread:
> > 
> > > I didn't really think too much about PREEMPT_RT kernels as I assume
> > > performance is not top priority but I think I get your point. Let me
> > 
> > Agreed.
> > 
> > > explain and correct me if I am wrong. On PREEMPT_RT kernel, the local
> > > lock is a spin lock which is actually a mutex but with priority
> > > inheritance. A task having the local lock can still get context switched
> > 
> > Let's say (seems implied already) this is a low prio task.
> > 
> > > (but will remain on same CPU run queue) and the newer task can try to
> > 
> > And this is a high prio task.
> > 
> > > acquire the memcg stock local lock. If we just do trylock, it will
> > > always go to the slow path but if we do local_lock() then it will sleeps
> > > and possibly gives its priority to the task owning the lock and possibly
> > > make that task to get the CPU. Later the task slept on memcg stock lock
> > > will wake up and go through fast path.
> > 
> > I think from RT latency perspective it could very much be better for the
> > high prio task just skip the fast path and go for the slowpath, instead of
> > going to sleep while boosting the low prio task to let the high prio task
> > use the fast path later. It's not really a fast path anymore I'd say.
> 
> Thanks Vlastimil, this is actually a very good point. Slow path of memcg
> charging is couple of atomic operations while the alternative here is at
> least two context switches (and possibly scheduler delay). So, it does
> not seem like a fast path anymore.
> 
> I have cc'ed networking folks to get their take as well. Orthogonally I
> will do some netperf benchmarking on v1 with RT kernel.

Let me share the result with PREEMPT_RT config on next-20250505 with and
without the v1 of this series.

I ran varying number of netperf clients in different cgroups on a 72 CPU
machine.

 $ netserver -6
 $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K

number of clients | Without series | With series
  6               | 38559.1 Mbps   | 38652.6 Mbps
  12              | 37388.8 Mbps   | 37560.1 Mbps
  18              | 30707.5 Mbps   | 31378.3 Mbps
  24              | 25908.4 Mbps   | 26423.9 Mbps
  30              | 22347.7 Mbps   | 22326.5 Mbps
  36              | 20235.1 Mbps   | 20165.0 Mbps

I don't see any significant performance difference for the network
intensive workload with this series.

I am going to send out v3 which will be rebased version of v1 with all
these details unless someone has concerns about this.


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2025-05-05 20:49 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-02  0:17 [PATCH v2 0/3] memcg: decouple memcg and objcg stocks Shakeel Butt
2025-05-02  0:17 ` [PATCH v2 1/3] memcg: separate local_trylock for memcg and obj Shakeel Butt
2025-05-02  0:17 ` [PATCH v2 2/3] memcg: completely decouple memcg and obj stocks Shakeel Butt
2025-05-02  0:17 ` [PATCH v2 3/3] memcg: no irq disable for memcg stock lock Shakeel Butt
2025-05-02 18:29   ` Alexei Starovoitov
2025-05-02 23:03     ` Shakeel Butt
2025-05-02 23:28       ` Alexei Starovoitov
2025-05-02 23:40         ` Shakeel Butt
2025-05-05  9:06           ` Sebastian Andrzej Siewior
2025-05-05 10:28       ` Vlastimil Babka
2025-05-05 17:13         ` Shakeel Butt
2025-05-05 20:49           ` Shakeel Butt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).