* [PATCH 0/4] memcg: decouple memcg and objcg stocks @ 2025-04-29 23:04 Shakeel Butt 2025-04-29 23:04 ` [PATCH 1/4] memcg: simplify consume_stock Shakeel Butt ` (3 more replies) 0 siblings, 4 replies; 13+ messages in thread From: Shakeel Butt @ 2025-04-29 23:04 UTC (permalink / raw) To: Andrew Morton Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song, Sebastian Andrzej Siewior, Vlastimil Babka, linux-mm, cgroups, linux-kernel, Meta kernel team The per-cpu memcg charge cache and objcg charge cache are coupled in a single struct memcg_stock_pcp and a single local lock is used to protect both of the caches. This makes memcg charging and objcg charging nmi safe challenging. Decoupling memcg and objcg stocks would allow us to make them nmi safe and even work without disabling irqs independently. This series completely decouples memcg and objcg stocks. Shakeel Butt (4): memcg: simplify consume_stock memcg: separate local_trylock for memcg and obj memcg: completely decouple memcg and obj stocks memcg: no irq disable for memcg stock lock mm/memcontrol.c | 174 ++++++++++++++++++++++++++++-------------------- 1 file changed, 102 insertions(+), 72 deletions(-) -- 2.47.1 ^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH 1/4] memcg: simplify consume_stock 2025-04-29 23:04 [PATCH 0/4] memcg: decouple memcg and objcg stocks Shakeel Butt @ 2025-04-29 23:04 ` Shakeel Butt 2025-04-29 23:51 ` Alexei Starovoitov 2025-04-29 23:04 ` [PATCH 2/4] memcg: separate local_trylock for memcg and obj Shakeel Butt ` (2 subsequent siblings) 3 siblings, 1 reply; 13+ messages in thread From: Shakeel Butt @ 2025-04-29 23:04 UTC (permalink / raw) To: Andrew Morton Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song, Sebastian Andrzej Siewior, Vlastimil Babka, linux-mm, cgroups, linux-kernel, Meta kernel team The consume_stock() does not need to check gfp_mask for spinning and can simply trylock the local lock to decide to proceed or fail. No need to spin at all for local lock. Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> --- mm/memcontrol.c | 20 +++++++------------- 1 file changed, 7 insertions(+), 13 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 650fe4314c39..40d0838d88bc 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1804,16 +1804,14 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock, * consume_stock: Try to consume stocked charge on this cpu. * @memcg: memcg to consume from. * @nr_pages: how many pages to charge. - * @gfp_mask: allocation mask. * - * The charges will only happen if @memcg matches the current cpu's memcg - * stock, and at least @nr_pages are available in that stock. Failure to - * service an allocation will refill the stock. + * Consume the cached charge if enough nr_pages are present otherwise return + * failure. Also return failure for charge request larger than + * MEMCG_CHARGE_BATCH or if the local lock is already taken. * * returns true if successful, false otherwise. */ -static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages, - gfp_t gfp_mask) +static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages) { struct memcg_stock_pcp *stock; uint8_t stock_pages; @@ -1821,12 +1819,8 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages, bool ret = false; int i; - if (nr_pages > MEMCG_CHARGE_BATCH) - return ret; - - if (gfpflags_allow_spinning(gfp_mask)) - local_lock_irqsave(&memcg_stock.stock_lock, flags); - else if (!local_trylock_irqsave(&memcg_stock.stock_lock, flags)) + if (nr_pages > MEMCG_CHARGE_BATCH || + !local_trylock_irqsave(&memcg_stock.stock_lock, flags)) return ret; stock = this_cpu_ptr(&memcg_stock); @@ -2329,7 +2323,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask, unsigned long pflags; retry: - if (consume_stock(memcg, nr_pages, gfp_mask)) + if (consume_stock(memcg, nr_pages)) return 0; if (!gfpflags_allow_spinning(gfp_mask)) -- 2.47.1 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH 1/4] memcg: simplify consume_stock 2025-04-29 23:04 ` [PATCH 1/4] memcg: simplify consume_stock Shakeel Butt @ 2025-04-29 23:51 ` Alexei Starovoitov 2025-04-30 4:37 ` Shakeel Butt 0 siblings, 1 reply; 13+ messages in thread From: Alexei Starovoitov @ 2025-04-29 23:51 UTC (permalink / raw) To: Shakeel Butt Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song, Sebastian Andrzej Siewior, Vlastimil Babka, linux-mm, cgroups, linux-kernel, Meta kernel team, bpf On Tue, Apr 29, 2025 at 04:04:25PM -0700, Shakeel Butt wrote: > The consume_stock() does not need to check gfp_mask for spinning and can > simply trylock the local lock to decide to proceed or fail. No need to > spin at all for local lock. > > Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> > --- > mm/memcontrol.c | 20 +++++++------------- > 1 file changed, 7 insertions(+), 13 deletions(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 650fe4314c39..40d0838d88bc 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -1804,16 +1804,14 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock, > * consume_stock: Try to consume stocked charge on this cpu. > * @memcg: memcg to consume from. > * @nr_pages: how many pages to charge. > - * @gfp_mask: allocation mask. > * > - * The charges will only happen if @memcg matches the current cpu's memcg > - * stock, and at least @nr_pages are available in that stock. Failure to > - * service an allocation will refill the stock. > + * Consume the cached charge if enough nr_pages are present otherwise return > + * failure. Also return failure for charge request larger than > + * MEMCG_CHARGE_BATCH or if the local lock is already taken. > * > * returns true if successful, false otherwise. > */ > -static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages, > - gfp_t gfp_mask) > +static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages) > { > struct memcg_stock_pcp *stock; > uint8_t stock_pages; > @@ -1821,12 +1819,8 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages, > bool ret = false; > int i; > > - if (nr_pages > MEMCG_CHARGE_BATCH) > - return ret; > - > - if (gfpflags_allow_spinning(gfp_mask)) > - local_lock_irqsave(&memcg_stock.stock_lock, flags); > - else if (!local_trylock_irqsave(&memcg_stock.stock_lock, flags)) > + if (nr_pages > MEMCG_CHARGE_BATCH || > + !local_trylock_irqsave(&memcg_stock.stock_lock, flags)) I don't think it's a good idea. spin_trylock() will fail often enough in PREEMPT_RT. Even during normal boot I see preemption between tasks and they contend on the same cpu for the same local_lock==spin_lock. Making them take slow path is a significant behavior change that needs to be carefully considered. Also please cc bpf@vger in the future for these kind of changes. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1/4] memcg: simplify consume_stock 2025-04-29 23:51 ` Alexei Starovoitov @ 2025-04-30 4:37 ` Shakeel Butt 2025-04-30 15:11 ` Sebastian Andrzej Siewior 0 siblings, 1 reply; 13+ messages in thread From: Shakeel Butt @ 2025-04-30 4:37 UTC (permalink / raw) To: Alexei Starovoitov Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song, Sebastian Andrzej Siewior, Vlastimil Babka, linux-mm, cgroups, linux-kernel, Meta kernel team, bpf On Tue, Apr 29, 2025 at 04:51:03PM -0700, Alexei Starovoitov wrote: > On Tue, Apr 29, 2025 at 04:04:25PM -0700, Shakeel Butt wrote: > > The consume_stock() does not need to check gfp_mask for spinning and can > > simply trylock the local lock to decide to proceed or fail. No need to > > spin at all for local lock. > > > > Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> > > --- > > mm/memcontrol.c | 20 +++++++------------- > > 1 file changed, 7 insertions(+), 13 deletions(-) > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > index 650fe4314c39..40d0838d88bc 100644 > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -1804,16 +1804,14 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock, > > * consume_stock: Try to consume stocked charge on this cpu. > > * @memcg: memcg to consume from. > > * @nr_pages: how many pages to charge. > > - * @gfp_mask: allocation mask. > > * > > - * The charges will only happen if @memcg matches the current cpu's memcg > > - * stock, and at least @nr_pages are available in that stock. Failure to > > - * service an allocation will refill the stock. > > + * Consume the cached charge if enough nr_pages are present otherwise return > > + * failure. Also return failure for charge request larger than > > + * MEMCG_CHARGE_BATCH or if the local lock is already taken. > > * > > * returns true if successful, false otherwise. > > */ > > -static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages, > > - gfp_t gfp_mask) > > +static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages) > > { > > struct memcg_stock_pcp *stock; > > uint8_t stock_pages; > > @@ -1821,12 +1819,8 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages, > > bool ret = false; > > int i; > > > > - if (nr_pages > MEMCG_CHARGE_BATCH) > > - return ret; > > - > > - if (gfpflags_allow_spinning(gfp_mask)) > > - local_lock_irqsave(&memcg_stock.stock_lock, flags); > > - else if (!local_trylock_irqsave(&memcg_stock.stock_lock, flags)) > > + if (nr_pages > MEMCG_CHARGE_BATCH || > > + !local_trylock_irqsave(&memcg_stock.stock_lock, flags)) > > I don't think it's a good idea. > spin_trylock() will fail often enough in PREEMPT_RT. > Even during normal boot I see preemption between tasks and they > contend on the same cpu for the same local_lock==spin_lock. > Making them take slow path is a significant behavior change > that needs to be carefully considered. I didn't really think too much about PREEMPT_RT kernels as I assume performance is not top priority but I think I get your point. Let me explain and correct me if I am wrong. On PREEMPT_RT kernel, the local lock is a spin lock which is actually a mutex but with priority inheritance. A task having the local lock can still get context switched (but will remain on same CPU run queue) and the newer task can try to acquire the memcg stock local lock. If we just do trylock, it will always go to the slow path but if we do local_lock() then it will sleeps and possibly gives its priority to the task owning the lock and possibly make that task to get the CPU. Later the task slept on memcg stock lock will wake up and go through fast path. Ok, I will drop the first patch. Please let me know your comments on the remaining series. > > Also please cc bpf@vger in the future for these kind of changes. Sounds good. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1/4] memcg: simplify consume_stock 2025-04-30 4:37 ` Shakeel Butt @ 2025-04-30 15:11 ` Sebastian Andrzej Siewior 0 siblings, 0 replies; 13+ messages in thread From: Sebastian Andrzej Siewior @ 2025-04-30 15:11 UTC (permalink / raw) To: Shakeel Butt Cc: Alexei Starovoitov, Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song, Vlastimil Babka, linux-mm, cgroups, linux-kernel, Meta kernel team, bpf On 2025-04-29 21:37:26 [-0700], Shakeel Butt wrote: > > > - if (gfpflags_allow_spinning(gfp_mask)) > > > - local_lock_irqsave(&memcg_stock.stock_lock, flags); > > > - else if (!local_trylock_irqsave(&memcg_stock.stock_lock, flags)) > > > + if (nr_pages > MEMCG_CHARGE_BATCH || > > > + !local_trylock_irqsave(&memcg_stock.stock_lock, flags)) > > > > I don't think it's a good idea. > > spin_trylock() will fail often enough in PREEMPT_RT. > > Even during normal boot I see preemption between tasks and they > > contend on the same cpu for the same local_lock==spin_lock. > > Making them take slow path is a significant behavior change > > that needs to be carefully considered. > > I didn't really think too much about PREEMPT_RT kernels as I assume > performance is not top priority but I think I get your point. Let me Not sure if this is performance nor simply failing to allocate memory. > explain and correct me if I am wrong. On PREEMPT_RT kernel, the local > lock is a spin lock which is actually a mutex but with priority > inheritance. A task having the local lock can still get context switched > (but will remain on same CPU run queue) and the newer task can try to > acquire the memcg stock local lock. If we just do trylock, it will > always go to the slow path but if we do local_lock() then it will sleeps > and possibly gives its priority to the task owning the lock and possibly > make that task to get the CPU. Later the task slept on memcg stock lock > will wake up and go through fast path. So far correct. On PREEMPT_RT a task with spinlock_t or local_lock_t can get preempted while owning the lock. The local_lock_t is a per-CPU lock. Sebastian ^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH 2/4] memcg: separate local_trylock for memcg and obj 2025-04-29 23:04 [PATCH 0/4] memcg: decouple memcg and objcg stocks Shakeel Butt 2025-04-29 23:04 ` [PATCH 1/4] memcg: simplify consume_stock Shakeel Butt @ 2025-04-29 23:04 ` Shakeel Butt 2025-04-30 11:42 ` Vlastimil Babka 2025-04-29 23:04 ` [PATCH 3/4] memcg: completely decouple memcg and obj stocks Shakeel Butt 2025-04-29 23:04 ` [PATCH 4/4] memcg: no irq disable for memcg stock lock Shakeel Butt 3 siblings, 1 reply; 13+ messages in thread From: Shakeel Butt @ 2025-04-29 23:04 UTC (permalink / raw) To: Andrew Morton Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song, Sebastian Andrzej Siewior, Vlastimil Babka, linux-mm, cgroups, linux-kernel, Meta kernel team The per-cpu stock_lock protects cached memcg and cached objcg and their respective fields. However there is no dependency between these fields and it is better to have fine grained separate locks for cached memcg and cached objcg. This decoupling of locks allows us to make the memcg charge cache and objcg charge cache to be nmi safe independently. At the moment, memcg charge cache is already nmi safe and this decoupling will allow to make memcg charge cache work without disabling irqs. Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> --- mm/memcontrol.c | 52 +++++++++++++++++++++++++++---------------------- 1 file changed, 29 insertions(+), 23 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 40d0838d88bc..460634e8435f 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1777,13 +1777,14 @@ void mem_cgroup_print_oom_group(struct mem_cgroup *memcg) #define NR_MEMCG_STOCK 7 struct memcg_stock_pcp { - local_trylock_t stock_lock; + local_trylock_t memcg_lock; uint8_t nr_pages[NR_MEMCG_STOCK]; struct mem_cgroup *cached[NR_MEMCG_STOCK]; + local_trylock_t obj_lock; + unsigned int nr_bytes; struct obj_cgroup *cached_objcg; struct pglist_data *cached_pgdat; - unsigned int nr_bytes; int nr_slab_reclaimable_b; int nr_slab_unreclaimable_b; @@ -1792,7 +1793,8 @@ struct memcg_stock_pcp { #define FLUSHING_CACHED_CHARGE 0 }; static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock) = { - .stock_lock = INIT_LOCAL_TRYLOCK(stock_lock), + .memcg_lock = INIT_LOCAL_TRYLOCK(memcg_lock), + .obj_lock = INIT_LOCAL_TRYLOCK(obj_lock), }; static DEFINE_MUTEX(percpu_charge_mutex); @@ -1820,7 +1822,7 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages) int i; if (nr_pages > MEMCG_CHARGE_BATCH || - !local_trylock_irqsave(&memcg_stock.stock_lock, flags)) + !local_trylock_irqsave(&memcg_stock.memcg_lock, flags)) return ret; stock = this_cpu_ptr(&memcg_stock); @@ -1837,7 +1839,7 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages) break; } - local_unlock_irqrestore(&memcg_stock.stock_lock, flags); + local_unlock_irqrestore(&memcg_stock.memcg_lock, flags); return ret; } @@ -1883,19 +1885,22 @@ static void drain_local_stock(struct work_struct *dummy) struct memcg_stock_pcp *stock; unsigned long flags; - /* - * The only protection from cpu hotplug (memcg_hotplug_cpu_dead) vs. - * drain_stock races is that we always operate on local CPU stock - * here with IRQ disabled - */ - local_lock_irqsave(&memcg_stock.stock_lock, flags); + if (WARN_ONCE(!in_task(), "drain in non-task context")) + return; + preempt_disable(); stock = this_cpu_ptr(&memcg_stock); + + local_lock_irqsave(&memcg_stock.obj_lock, flags); drain_obj_stock(stock); + local_unlock_irqrestore(&memcg_stock.obj_lock, flags); + + local_lock_irqsave(&memcg_stock.memcg_lock, flags); drain_stock_fully(stock); - clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags); + local_unlock_irqrestore(&memcg_stock.memcg_lock, flags); - local_unlock_irqrestore(&memcg_stock.stock_lock, flags); + clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags); + preempt_enable(); } static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages) @@ -1918,10 +1923,10 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages) VM_WARN_ON_ONCE(mem_cgroup_is_root(memcg)); if (nr_pages > MEMCG_CHARGE_BATCH || - !local_trylock_irqsave(&memcg_stock.stock_lock, flags)) { + !local_trylock_irqsave(&memcg_stock.memcg_lock, flags)) { /* * In case of larger than batch refill or unlikely failure to - * lock the percpu stock_lock, uncharge memcg directly. + * lock the percpu memcg_lock, uncharge memcg directly. */ memcg_uncharge(memcg, nr_pages); return; @@ -1953,7 +1958,7 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages) WRITE_ONCE(stock->nr_pages[i], nr_pages); } - local_unlock_irqrestore(&memcg_stock.stock_lock, flags); + local_unlock_irqrestore(&memcg_stock.memcg_lock, flags); } static bool is_drain_needed(struct memcg_stock_pcp *stock, @@ -2028,11 +2033,12 @@ static int memcg_hotplug_cpu_dead(unsigned int cpu) stock = &per_cpu(memcg_stock, cpu); - /* drain_obj_stock requires stock_lock */ - local_lock_irqsave(&memcg_stock.stock_lock, flags); + /* drain_obj_stock requires obj_lock */ + local_lock_irqsave(&memcg_stock.obj_lock, flags); drain_obj_stock(stock); - local_unlock_irqrestore(&memcg_stock.stock_lock, flags); + local_unlock_irqrestore(&memcg_stock.obj_lock, flags); + /* no need for the local lock */ drain_stock_fully(stock); return 0; @@ -2885,7 +2891,7 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes, unsigned long flags; bool ret = false; - local_lock_irqsave(&memcg_stock.stock_lock, flags); + local_lock_irqsave(&memcg_stock.obj_lock, flags); stock = this_cpu_ptr(&memcg_stock); if (objcg == READ_ONCE(stock->cached_objcg) && stock->nr_bytes >= nr_bytes) { @@ -2896,7 +2902,7 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes, __account_obj_stock(objcg, stock, nr_bytes, pgdat, idx); } - local_unlock_irqrestore(&memcg_stock.stock_lock, flags); + local_unlock_irqrestore(&memcg_stock.obj_lock, flags); return ret; } @@ -2985,7 +2991,7 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes, unsigned long flags; unsigned int nr_pages = 0; - local_lock_irqsave(&memcg_stock.stock_lock, flags); + local_lock_irqsave(&memcg_stock.obj_lock, flags); stock = this_cpu_ptr(&memcg_stock); if (READ_ONCE(stock->cached_objcg) != objcg) { /* reset if necessary */ @@ -3007,7 +3013,7 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes, stock->nr_bytes &= (PAGE_SIZE - 1); } - local_unlock_irqrestore(&memcg_stock.stock_lock, flags); + local_unlock_irqrestore(&memcg_stock.obj_lock, flags); if (nr_pages) obj_cgroup_uncharge_pages(objcg, nr_pages); -- 2.47.1 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH 2/4] memcg: separate local_trylock for memcg and obj 2025-04-29 23:04 ` [PATCH 2/4] memcg: separate local_trylock for memcg and obj Shakeel Butt @ 2025-04-30 11:42 ` Vlastimil Babka 2025-04-30 15:03 ` Shakeel Butt 0 siblings, 1 reply; 13+ messages in thread From: Vlastimil Babka @ 2025-04-30 11:42 UTC (permalink / raw) To: Shakeel Butt, Andrew Morton Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song, Sebastian Andrzej Siewior, linux-mm, cgroups, linux-kernel, Meta kernel team, bpf On 4/30/25 01:04, Shakeel Butt wrote: > The per-cpu stock_lock protects cached memcg and cached objcg and their > respective fields. However there is no dependency between these fields > and it is better to have fine grained separate locks for cached memcg > and cached objcg. This decoupling of locks allows us to make the memcg > charge cache and objcg charge cache to be nmi safe independently. > > At the moment, memcg charge cache is already nmi safe and this > decoupling will allow to make memcg charge cache work without disabling > irqs. > > Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> > --- > mm/memcontrol.c | 52 +++++++++++++++++++++++++++---------------------- > 1 file changed, 29 insertions(+), 23 deletions(-) > @@ -1883,19 +1885,22 @@ static void drain_local_stock(struct work_struct *dummy) > struct memcg_stock_pcp *stock; > unsigned long flags; > > - /* > - * The only protection from cpu hotplug (memcg_hotplug_cpu_dead) vs. > - * drain_stock races is that we always operate on local CPU stock > - * here with IRQ disabled > - */ > - local_lock_irqsave(&memcg_stock.stock_lock, flags); > + if (WARN_ONCE(!in_task(), "drain in non-task context")) > + return; > > + preempt_disable(); > stock = this_cpu_ptr(&memcg_stock); > + > + local_lock_irqsave(&memcg_stock.obj_lock, flags); > drain_obj_stock(stock); > + local_unlock_irqrestore(&memcg_stock.obj_lock, flags); > + > + local_lock_irqsave(&memcg_stock.memcg_lock, flags); > drain_stock_fully(stock); > - clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags); > + local_unlock_irqrestore(&memcg_stock.memcg_lock, flags); > > - local_unlock_irqrestore(&memcg_stock.stock_lock, flags); > + clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags); > + preempt_enable(); This usage of preempt_disable() looks rather weird and makes RT unhappy as the local lock is a mutex, so it gives you this: BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48 I know the next patch removes it again but for bisectability purposes it should be avoided. Instead of preempt_disable() we can extend the local lock scope here? > } > > static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages) > @@ -1918,10 +1923,10 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages) > VM_WARN_ON_ONCE(mem_cgroup_is_root(memcg)); > > if (nr_pages > MEMCG_CHARGE_BATCH || > - !local_trylock_irqsave(&memcg_stock.stock_lock, flags)) { > + !local_trylock_irqsave(&memcg_stock.memcg_lock, flags)) { > /* > * In case of larger than batch refill or unlikely failure to > - * lock the percpu stock_lock, uncharge memcg directly. > + * lock the percpu memcg_lock, uncharge memcg directly. > */ > memcg_uncharge(memcg, nr_pages); > return; > @@ -1953,7 +1958,7 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages) > WRITE_ONCE(stock->nr_pages[i], nr_pages); > } > > - local_unlock_irqrestore(&memcg_stock.stock_lock, flags); > + local_unlock_irqrestore(&memcg_stock.memcg_lock, flags); > } > > static bool is_drain_needed(struct memcg_stock_pcp *stock, > @@ -2028,11 +2033,12 @@ static int memcg_hotplug_cpu_dead(unsigned int cpu) > > stock = &per_cpu(memcg_stock, cpu); > > - /* drain_obj_stock requires stock_lock */ > - local_lock_irqsave(&memcg_stock.stock_lock, flags); > + /* drain_obj_stock requires obj_lock */ > + local_lock_irqsave(&memcg_stock.obj_lock, flags); > drain_obj_stock(stock); > - local_unlock_irqrestore(&memcg_stock.stock_lock, flags); > + local_unlock_irqrestore(&memcg_stock.obj_lock, flags); > > + /* no need for the local lock */ > drain_stock_fully(stock); > > return 0; > @@ -2885,7 +2891,7 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes, > unsigned long flags; > bool ret = false; > > - local_lock_irqsave(&memcg_stock.stock_lock, flags); > + local_lock_irqsave(&memcg_stock.obj_lock, flags); > > stock = this_cpu_ptr(&memcg_stock); > if (objcg == READ_ONCE(stock->cached_objcg) && stock->nr_bytes >= nr_bytes) { > @@ -2896,7 +2902,7 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes, > __account_obj_stock(objcg, stock, nr_bytes, pgdat, idx); > } > > - local_unlock_irqrestore(&memcg_stock.stock_lock, flags); > + local_unlock_irqrestore(&memcg_stock.obj_lock, flags); > > return ret; > } > @@ -2985,7 +2991,7 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes, > unsigned long flags; > unsigned int nr_pages = 0; > > - local_lock_irqsave(&memcg_stock.stock_lock, flags); > + local_lock_irqsave(&memcg_stock.obj_lock, flags); > > stock = this_cpu_ptr(&memcg_stock); > if (READ_ONCE(stock->cached_objcg) != objcg) { /* reset if necessary */ > @@ -3007,7 +3013,7 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes, > stock->nr_bytes &= (PAGE_SIZE - 1); > } > > - local_unlock_irqrestore(&memcg_stock.stock_lock, flags); > + local_unlock_irqrestore(&memcg_stock.obj_lock, flags); > > if (nr_pages) > obj_cgroup_uncharge_pages(objcg, nr_pages); ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 2/4] memcg: separate local_trylock for memcg and obj 2025-04-30 11:42 ` Vlastimil Babka @ 2025-04-30 15:03 ` Shakeel Butt 0 siblings, 0 replies; 13+ messages in thread From: Shakeel Butt @ 2025-04-30 15:03 UTC (permalink / raw) To: Vlastimil Babka Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song, Sebastian Andrzej Siewior, linux-mm, cgroups, linux-kernel, Meta kernel team, bpf On Wed, Apr 30, 2025 at 01:42:47PM +0200, Vlastimil Babka wrote: > On 4/30/25 01:04, Shakeel Butt wrote: > > The per-cpu stock_lock protects cached memcg and cached objcg and their > > respective fields. However there is no dependency between these fields > > and it is better to have fine grained separate locks for cached memcg > > and cached objcg. This decoupling of locks allows us to make the memcg > > charge cache and objcg charge cache to be nmi safe independently. > > > > At the moment, memcg charge cache is already nmi safe and this > > decoupling will allow to make memcg charge cache work without disabling > > irqs. > > > > Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> > > --- > > mm/memcontrol.c | 52 +++++++++++++++++++++++++++---------------------- > > 1 file changed, 29 insertions(+), 23 deletions(-) > > > @@ -1883,19 +1885,22 @@ static void drain_local_stock(struct work_struct *dummy) > > struct memcg_stock_pcp *stock; > > unsigned long flags; > > > > - /* > > - * The only protection from cpu hotplug (memcg_hotplug_cpu_dead) vs. > > - * drain_stock races is that we always operate on local CPU stock > > - * here with IRQ disabled > > - */ > > - local_lock_irqsave(&memcg_stock.stock_lock, flags); > > + if (WARN_ONCE(!in_task(), "drain in non-task context")) > > + return; > > > > + preempt_disable(); > > stock = this_cpu_ptr(&memcg_stock); > > + > > + local_lock_irqsave(&memcg_stock.obj_lock, flags); > > drain_obj_stock(stock); > > + local_unlock_irqrestore(&memcg_stock.obj_lock, flags); > > + > > + local_lock_irqsave(&memcg_stock.memcg_lock, flags); > > drain_stock_fully(stock); > > - clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags); > > + local_unlock_irqrestore(&memcg_stock.memcg_lock, flags); > > > > - local_unlock_irqrestore(&memcg_stock.stock_lock, flags); > > + clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags); > > + preempt_enable(); > > This usage of preempt_disable() looks rather weird and makes RT unhappy as > the local lock is a mutex, so it gives you this: > > BUG: sleeping function called from invalid context at > kernel/locking/spinlock_rt.c:48 > > I know the next patch removes it again but for bisectability purposes it > should be avoided. Instead of preempt_disable() we can extend the local lock > scope here? > Indeed and thanks for the suggestion, will fix in v2. ^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH 3/4] memcg: completely decouple memcg and obj stocks 2025-04-29 23:04 [PATCH 0/4] memcg: decouple memcg and objcg stocks Shakeel Butt 2025-04-29 23:04 ` [PATCH 1/4] memcg: simplify consume_stock Shakeel Butt 2025-04-29 23:04 ` [PATCH 2/4] memcg: separate local_trylock for memcg and obj Shakeel Butt @ 2025-04-29 23:04 ` Shakeel Butt 2025-04-30 12:21 ` Vlastimil Babka 2025-04-29 23:04 ` [PATCH 4/4] memcg: no irq disable for memcg stock lock Shakeel Butt 3 siblings, 1 reply; 13+ messages in thread From: Shakeel Butt @ 2025-04-29 23:04 UTC (permalink / raw) To: Andrew Morton Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song, Sebastian Andrzej Siewior, Vlastimil Babka, linux-mm, cgroups, linux-kernel, Meta kernel team Let's completely decouple the memcg and obj per-cpu stocks. This will enable us to make memcg per-cpu stocks to used without disabling irqs. Also it will enable us to make obj stocks nmi safe independently which is required to make kmalloc/slab safe for allocations from nmi context. Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> --- mm/memcontrol.c | 150 +++++++++++++++++++++++++++++------------------- 1 file changed, 91 insertions(+), 59 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 460634e8435f..8f31b35ddcb3 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1775,13 +1775,23 @@ void mem_cgroup_print_oom_group(struct mem_cgroup *memcg) pr_cont(" are going to be killed due to memory.oom.group set\n"); } +#define FLUSHING_CACHED_CHARGE 0 #define NR_MEMCG_STOCK 7 struct memcg_stock_pcp { - local_trylock_t memcg_lock; + local_trylock_t lock; uint8_t nr_pages[NR_MEMCG_STOCK]; struct mem_cgroup *cached[NR_MEMCG_STOCK]; - local_trylock_t obj_lock; + struct work_struct work; + unsigned long flags; +}; + +static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock) = { + .lock = INIT_LOCAL_TRYLOCK(lock), +}; + +struct obj_stock_pcp { + local_trylock_t lock; unsigned int nr_bytes; struct obj_cgroup *cached_objcg; struct pglist_data *cached_pgdat; @@ -1790,16 +1800,16 @@ struct memcg_stock_pcp { struct work_struct work; unsigned long flags; -#define FLUSHING_CACHED_CHARGE 0 }; -static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock) = { - .memcg_lock = INIT_LOCAL_TRYLOCK(memcg_lock), - .obj_lock = INIT_LOCAL_TRYLOCK(obj_lock), + +static DEFINE_PER_CPU(struct obj_stock_pcp, obj_stock) = { + .lock = INIT_LOCAL_TRYLOCK(lock), }; + static DEFINE_MUTEX(percpu_charge_mutex); -static void drain_obj_stock(struct memcg_stock_pcp *stock); -static bool obj_stock_flush_required(struct memcg_stock_pcp *stock, +static void drain_obj_stock(struct obj_stock_pcp *stock); +static bool obj_stock_flush_required(struct obj_stock_pcp *stock, struct mem_cgroup *root_memcg); /** @@ -1822,7 +1832,7 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages) int i; if (nr_pages > MEMCG_CHARGE_BATCH || - !local_trylock_irqsave(&memcg_stock.memcg_lock, flags)) + !local_trylock_irqsave(&memcg_stock.lock, flags)) return ret; stock = this_cpu_ptr(&memcg_stock); @@ -1839,7 +1849,7 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages) break; } - local_unlock_irqrestore(&memcg_stock.memcg_lock, flags); + local_unlock_irqrestore(&memcg_stock.lock, flags); return ret; } @@ -1880,7 +1890,7 @@ static void drain_stock_fully(struct memcg_stock_pcp *stock) drain_stock(stock, i); } -static void drain_local_stock(struct work_struct *dummy) +static void drain_local_memcg_stock(struct work_struct *dummy) { struct memcg_stock_pcp *stock; unsigned long flags; @@ -1888,19 +1898,30 @@ static void drain_local_stock(struct work_struct *dummy) if (WARN_ONCE(!in_task(), "drain in non-task context")) return; - preempt_disable(); + local_lock_irqsave(&memcg_stock.lock, flags); + stock = this_cpu_ptr(&memcg_stock); + drain_stock_fully(stock); + clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags); - local_lock_irqsave(&memcg_stock.obj_lock, flags); - drain_obj_stock(stock); - local_unlock_irqrestore(&memcg_stock.obj_lock, flags); + local_unlock_irqrestore(&memcg_stock.lock, flags); +} - local_lock_irqsave(&memcg_stock.memcg_lock, flags); - drain_stock_fully(stock); - local_unlock_irqrestore(&memcg_stock.memcg_lock, flags); +static void drain_local_obj_stock(struct work_struct *dummy) +{ + struct obj_stock_pcp *stock; + unsigned long flags; + if (WARN_ONCE(!in_task(), "drain in non-task context")) + return; + + local_lock_irqsave(&obj_stock.lock, flags); + + stock = this_cpu_ptr(&obj_stock); + drain_obj_stock(stock); clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags); - preempt_enable(); + + local_unlock_irqrestore(&obj_stock.lock, flags); } static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages) @@ -1923,10 +1944,10 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages) VM_WARN_ON_ONCE(mem_cgroup_is_root(memcg)); if (nr_pages > MEMCG_CHARGE_BATCH || - !local_trylock_irqsave(&memcg_stock.memcg_lock, flags)) { + !local_trylock_irqsave(&memcg_stock.lock, flags)) { /* * In case of larger than batch refill or unlikely failure to - * lock the percpu memcg_lock, uncharge memcg directly. + * lock the percpu memcg_stock.lock, uncharge memcg directly. */ memcg_uncharge(memcg, nr_pages); return; @@ -1958,23 +1979,17 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages) WRITE_ONCE(stock->nr_pages[i], nr_pages); } - local_unlock_irqrestore(&memcg_stock.memcg_lock, flags); + local_unlock_irqrestore(&memcg_stock.lock, flags); } -static bool is_drain_needed(struct memcg_stock_pcp *stock, - struct mem_cgroup *root_memcg) +static bool is_memcg_drain_needed(struct memcg_stock_pcp *stock, + struct mem_cgroup *root_memcg) { struct mem_cgroup *memcg; bool flush = false; int i; rcu_read_lock(); - - if (obj_stock_flush_required(stock, root_memcg)) { - flush = true; - goto out; - } - for (i = 0; i < NR_MEMCG_STOCK; ++i) { memcg = READ_ONCE(stock->cached[i]); if (!memcg) @@ -1986,7 +2001,6 @@ static bool is_drain_needed(struct memcg_stock_pcp *stock, break; } } -out: rcu_read_unlock(); return flush; } @@ -2011,15 +2025,27 @@ void drain_all_stock(struct mem_cgroup *root_memcg) migrate_disable(); curcpu = smp_processor_id(); for_each_online_cpu(cpu) { - struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu); - bool flush = is_drain_needed(stock, root_memcg); + struct memcg_stock_pcp *memcg_st = &per_cpu(memcg_stock, cpu); + struct obj_stock_pcp *obj_st = &per_cpu(obj_stock, cpu); - if (flush && - !test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags)) { + if (!test_bit(FLUSHING_CACHED_CHARGE, &memcg_st->flags) && + is_memcg_drain_needed(memcg_st, root_memcg) && + !test_and_set_bit(FLUSHING_CACHED_CHARGE, + &memcg_st->flags)) { if (cpu == curcpu) - drain_local_stock(&stock->work); + drain_local_memcg_stock(&memcg_st->work); else if (!cpu_is_isolated(cpu)) - schedule_work_on(cpu, &stock->work); + schedule_work_on(cpu, &memcg_st->work); + } + + if (!test_bit(FLUSHING_CACHED_CHARGE, &obj_st->flags) && + obj_stock_flush_required(obj_st, root_memcg) && + !test_and_set_bit(FLUSHING_CACHED_CHARGE, + &obj_st->flags)) { + if (cpu == curcpu) + drain_local_obj_stock(&obj_st->work); + else if (!cpu_is_isolated(cpu)) + schedule_work_on(cpu, &obj_st->work); } } migrate_enable(); @@ -2028,18 +2054,18 @@ void drain_all_stock(struct mem_cgroup *root_memcg) static int memcg_hotplug_cpu_dead(unsigned int cpu) { - struct memcg_stock_pcp *stock; + struct obj_stock_pcp *obj_st; unsigned long flags; - stock = &per_cpu(memcg_stock, cpu); + obj_st = &per_cpu(obj_stock, cpu); - /* drain_obj_stock requires obj_lock */ - local_lock_irqsave(&memcg_stock.obj_lock, flags); - drain_obj_stock(stock); - local_unlock_irqrestore(&memcg_stock.obj_lock, flags); + /* drain_obj_stock requires objstock.lock */ + local_lock_irqsave(&obj_stock.lock, flags); + drain_obj_stock(obj_st); + local_unlock_irqrestore(&obj_stock.lock, flags); /* no need for the local lock */ - drain_stock_fully(stock); + drain_stock_fully(&per_cpu(memcg_stock, cpu)); return 0; } @@ -2836,7 +2862,7 @@ void __memcg_kmem_uncharge_page(struct page *page, int order) } static void __account_obj_stock(struct obj_cgroup *objcg, - struct memcg_stock_pcp *stock, int nr, + struct obj_stock_pcp *stock, int nr, struct pglist_data *pgdat, enum node_stat_item idx) { int *bytes; @@ -2887,13 +2913,13 @@ static void __account_obj_stock(struct obj_cgroup *objcg, static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes, struct pglist_data *pgdat, enum node_stat_item idx) { - struct memcg_stock_pcp *stock; + struct obj_stock_pcp *stock; unsigned long flags; bool ret = false; - local_lock_irqsave(&memcg_stock.obj_lock, flags); + local_lock_irqsave(&obj_stock.lock, flags); - stock = this_cpu_ptr(&memcg_stock); + stock = this_cpu_ptr(&obj_stock); if (objcg == READ_ONCE(stock->cached_objcg) && stock->nr_bytes >= nr_bytes) { stock->nr_bytes -= nr_bytes; ret = true; @@ -2902,12 +2928,12 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes, __account_obj_stock(objcg, stock, nr_bytes, pgdat, idx); } - local_unlock_irqrestore(&memcg_stock.obj_lock, flags); + local_unlock_irqrestore(&obj_stock.lock, flags); return ret; } -static void drain_obj_stock(struct memcg_stock_pcp *stock) +static void drain_obj_stock(struct obj_stock_pcp *stock) { struct obj_cgroup *old = READ_ONCE(stock->cached_objcg); @@ -2968,32 +2994,35 @@ static void drain_obj_stock(struct memcg_stock_pcp *stock) obj_cgroup_put(old); } -static bool obj_stock_flush_required(struct memcg_stock_pcp *stock, +static bool obj_stock_flush_required(struct obj_stock_pcp *stock, struct mem_cgroup *root_memcg) { struct obj_cgroup *objcg = READ_ONCE(stock->cached_objcg); struct mem_cgroup *memcg; + bool flush = false; + rcu_read_lock(); if (objcg) { memcg = obj_cgroup_memcg(objcg); if (memcg && mem_cgroup_is_descendant(memcg, root_memcg)) - return true; + flush = true; } + rcu_read_unlock(); - return false; + return flush; } static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes, bool allow_uncharge, int nr_acct, struct pglist_data *pgdat, enum node_stat_item idx) { - struct memcg_stock_pcp *stock; + struct obj_stock_pcp *stock; unsigned long flags; unsigned int nr_pages = 0; - local_lock_irqsave(&memcg_stock.obj_lock, flags); + local_lock_irqsave(&obj_stock.lock, flags); - stock = this_cpu_ptr(&memcg_stock); + stock = this_cpu_ptr(&obj_stock); if (READ_ONCE(stock->cached_objcg) != objcg) { /* reset if necessary */ drain_obj_stock(stock); obj_cgroup_get(objcg); @@ -3013,7 +3042,7 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes, stock->nr_bytes &= (PAGE_SIZE - 1); } - local_unlock_irqrestore(&memcg_stock.obj_lock, flags); + local_unlock_irqrestore(&obj_stock.lock, flags); if (nr_pages) obj_cgroup_uncharge_pages(objcg, nr_pages); @@ -5078,9 +5107,12 @@ int __init mem_cgroup_init(void) cpuhp_setup_state_nocalls(CPUHP_MM_MEMCQ_DEAD, "mm/memctrl:dead", NULL, memcg_hotplug_cpu_dead); - for_each_possible_cpu(cpu) + for_each_possible_cpu(cpu) { INIT_WORK(&per_cpu_ptr(&memcg_stock, cpu)->work, - drain_local_stock); + drain_local_memcg_stock); + INIT_WORK(&per_cpu_ptr(&obj_stock, cpu)->work, + drain_local_obj_stock); + } memcg_size = struct_size_t(struct mem_cgroup, nodeinfo, nr_node_ids); memcg_cachep = kmem_cache_create("mem_cgroup", memcg_size, 0, -- 2.47.1 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH 3/4] memcg: completely decouple memcg and obj stocks 2025-04-29 23:04 ` [PATCH 3/4] memcg: completely decouple memcg and obj stocks Shakeel Butt @ 2025-04-30 12:21 ` Vlastimil Babka 0 siblings, 0 replies; 13+ messages in thread From: Vlastimil Babka @ 2025-04-30 12:21 UTC (permalink / raw) To: Shakeel Butt, Andrew Morton Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song, Sebastian Andrzej Siewior, linux-mm, cgroups, linux-kernel, Meta kernel team, bpf On 4/30/25 01:04, Shakeel Butt wrote: > Let's completely decouple the memcg and obj per-cpu stocks. This will > enable us to make memcg per-cpu stocks to used without disabling irqs. > Also it will enable us to make obj stocks nmi safe independently which > is required to make kmalloc/slab safe for allocations from nmi context. > > Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Vlastimil Babka <vbabka@suse.cz> ^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH 4/4] memcg: no irq disable for memcg stock lock 2025-04-29 23:04 [PATCH 0/4] memcg: decouple memcg and objcg stocks Shakeel Butt ` (2 preceding siblings ...) 2025-04-29 23:04 ` [PATCH 3/4] memcg: completely decouple memcg and obj stocks Shakeel Butt @ 2025-04-29 23:04 ` Shakeel Butt 2025-04-30 12:26 ` Vlastimil Babka 3 siblings, 1 reply; 13+ messages in thread From: Shakeel Butt @ 2025-04-29 23:04 UTC (permalink / raw) To: Andrew Morton Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song, Sebastian Andrzej Siewior, Vlastimil Babka, linux-mm, cgroups, linux-kernel, Meta kernel team There is no need to disable irqs to use memcg per-cpu stock, so let's just not do that. One consequence of this change is if the kernel while in task context has the memcg stock lock and that cpu got interrupted. The memcg charges on that cpu in the irq context will take the slow path of memcg charging. However that should be super rare and should be fine in general. Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> --- mm/memcontrol.c | 14 ++++++-------- 1 file changed, 6 insertions(+), 8 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 8f31b35ddcb3..25bd312d0455 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1827,12 +1827,11 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages) { struct memcg_stock_pcp *stock; uint8_t stock_pages; - unsigned long flags; bool ret = false; int i; if (nr_pages > MEMCG_CHARGE_BATCH || - !local_trylock_irqsave(&memcg_stock.lock, flags)) + !local_trylock(&memcg_stock.lock)) return ret; stock = this_cpu_ptr(&memcg_stock); @@ -1849,7 +1848,7 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages) break; } - local_unlock_irqrestore(&memcg_stock.lock, flags); + local_unlock(&memcg_stock.lock); return ret; } @@ -1893,18 +1892,17 @@ static void drain_stock_fully(struct memcg_stock_pcp *stock) static void drain_local_memcg_stock(struct work_struct *dummy) { struct memcg_stock_pcp *stock; - unsigned long flags; if (WARN_ONCE(!in_task(), "drain in non-task context")) return; - local_lock_irqsave(&memcg_stock.lock, flags); + local_lock(&memcg_stock.lock); stock = this_cpu_ptr(&memcg_stock); drain_stock_fully(stock); clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags); - local_unlock_irqrestore(&memcg_stock.lock, flags); + local_unlock(&memcg_stock.lock); } static void drain_local_obj_stock(struct work_struct *dummy) @@ -1944,7 +1942,7 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages) VM_WARN_ON_ONCE(mem_cgroup_is_root(memcg)); if (nr_pages > MEMCG_CHARGE_BATCH || - !local_trylock_irqsave(&memcg_stock.lock, flags)) { + !local_trylock(&memcg_stock.lock)) { /* * In case of larger than batch refill or unlikely failure to * lock the percpu memcg_stock.lock, uncharge memcg directly. @@ -1979,7 +1977,7 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages) WRITE_ONCE(stock->nr_pages[i], nr_pages); } - local_unlock_irqrestore(&memcg_stock.lock, flags); + local_unlock(&memcg_stock.lock); } static bool is_memcg_drain_needed(struct memcg_stock_pcp *stock, -- 2.47.1 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH 4/4] memcg: no irq disable for memcg stock lock 2025-04-29 23:04 ` [PATCH 4/4] memcg: no irq disable for memcg stock lock Shakeel Butt @ 2025-04-30 12:26 ` Vlastimil Babka 0 siblings, 0 replies; 13+ messages in thread From: Vlastimil Babka @ 2025-04-30 12:26 UTC (permalink / raw) To: Shakeel Butt, Andrew Morton Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song, Sebastian Andrzej Siewior, linux-mm, cgroups, linux-kernel, Meta kernel team On 4/30/25 01:04, Shakeel Butt wrote: > There is no need to disable irqs to use memcg per-cpu stock, so let's > just not do that. One consequence of this change is if the kernel while > in task context has the memcg stock lock and that cpu got interrupted. > The memcg charges on that cpu in the irq context will take the slow path > of memcg charging. However that should be super rare and should be fine > in general. > > Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Vlastimil Babka <vbabka@suse.cz> ^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH v3 0/4] memcg: decouple memcg and objcg stocks @ 2025-05-06 22:55 Shakeel Butt 2025-05-06 22:55 ` [PATCH 3/4] memcg: completely decouple memcg and obj stocks Shakeel Butt 0 siblings, 1 reply; 13+ messages in thread From: Shakeel Butt @ 2025-05-06 22:55 UTC (permalink / raw) To: Andrew Morton Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song, Alexei Starovoitov, Vlastimil Babka, Sebastian Andrzej Siewior, Jakub Kicinski, Eric Dumazet, linux-mm, cgroups, linux-kernel, bpf, netdev, Meta kernel team The per-cpu memcg charge cache and objcg charge cache are coupled in a single struct memcg_stock_pcp and a single local lock is used to protect both of the caches. This makes memcg charging and objcg charging nmi safe challenging. Decoupling memcg and objcg stocks would allow us to make them nmi safe and even work without disabling irqs independently. This series completely decouples memcg and objcg stocks. To evaluate the impact of this series with and without PREEMPT_RT config, we ran varying number of netperf clients in different cgroups on a 72 CPU machine. $ netserver -6 $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K PREEMPT_RT config: ------------------ number of clients | Without series | With series 6 | 38559.1 Mbps | 38652.6 Mbps 12 | 37388.8 Mbps | 37560.1 Mbps 18 | 30707.5 Mbps | 31378.3 Mbps 24 | 25908.4 Mbps | 26423.9 Mbps 30 | 22347.7 Mbps | 22326.5 Mbps 36 | 20235.1 Mbps | 20165.0 Mbps !PREEMPT_RT config: ------------------- number of clients | Without series | With series 6 | 50235.7 Mbps | 51415.4 Mbps 12 | 49336.5 Mbps | 49901.4 Mbps 18 | 46306.8 Mbps | 46482.7 Mbps 24 | 38145.7 Mbps | 38729.4 Mbps 30 | 30347.6 Mbps | 31698.2 Mbps 36 | 26976.6 Mbps | 27364.4 Mbps No performance regression was observed. Changes since v2: - Ran and included network intensive benchmarking results - Brought back the simplify patch dropped in v2 after perf experiment. Changes since v1: - Drop first patch as requested by Alexei. - Remove preempt_disable() usage as suggested by Vlastimil. Shakeel Butt (4): memcg: simplify consume_stock memcg: separate local_trylock for memcg and obj memcg: completely decouple memcg and obj stocks memcg: no irq disable for memcg stock lock mm/memcontrol.c | 175 ++++++++++++++++++++++++++++-------------------- 1 file changed, 102 insertions(+), 73 deletions(-) -- 2.47.1 ^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH 3/4] memcg: completely decouple memcg and obj stocks 2025-05-06 22:55 [PATCH v3 0/4] memcg: decouple memcg and objcg stocks Shakeel Butt @ 2025-05-06 22:55 ` Shakeel Butt 0 siblings, 0 replies; 13+ messages in thread From: Shakeel Butt @ 2025-05-06 22:55 UTC (permalink / raw) To: Andrew Morton Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song, Alexei Starovoitov, Vlastimil Babka, Sebastian Andrzej Siewior, Jakub Kicinski, Eric Dumazet, linux-mm, cgroups, linux-kernel, bpf, netdev, Meta kernel team Let's completely decouple the memcg and obj per-cpu stocks. This will enable us to make memcg per-cpu stocks to used without disabling irqs. Also it will enable us to make obj stocks nmi safe independently which is required to make kmalloc/slab safe for allocations from nmi context. Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Vlastimil Babka <vbabka@suse.cz> --- mm/memcontrol.c | 149 ++++++++++++++++++++++++++++++------------------ 1 file changed, 92 insertions(+), 57 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 09a78c606d8d..14d73e5352fe 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1780,12 +1780,22 @@ void mem_cgroup_print_oom_group(struct mem_cgroup *memcg) * nr_pages in a single cacheline. This may change in future. */ #define NR_MEMCG_STOCK 7 +#define FLUSHING_CACHED_CHARGE 0 struct memcg_stock_pcp { - local_trylock_t memcg_lock; + local_trylock_t lock; uint8_t nr_pages[NR_MEMCG_STOCK]; struct mem_cgroup *cached[NR_MEMCG_STOCK]; - local_trylock_t obj_lock; + struct work_struct work; + unsigned long flags; +}; + +static DEFINE_PER_CPU_ALIGNED(struct memcg_stock_pcp, memcg_stock) = { + .lock = INIT_LOCAL_TRYLOCK(lock), +}; + +struct obj_stock_pcp { + local_trylock_t lock; unsigned int nr_bytes; struct obj_cgroup *cached_objcg; struct pglist_data *cached_pgdat; @@ -1794,16 +1804,16 @@ struct memcg_stock_pcp { struct work_struct work; unsigned long flags; -#define FLUSHING_CACHED_CHARGE 0 }; -static DEFINE_PER_CPU_ALIGNED(struct memcg_stock_pcp, memcg_stock) = { - .memcg_lock = INIT_LOCAL_TRYLOCK(memcg_lock), - .obj_lock = INIT_LOCAL_TRYLOCK(obj_lock), + +static DEFINE_PER_CPU_ALIGNED(struct obj_stock_pcp, obj_stock) = { + .lock = INIT_LOCAL_TRYLOCK(lock), }; + static DEFINE_MUTEX(percpu_charge_mutex); -static void drain_obj_stock(struct memcg_stock_pcp *stock); -static bool obj_stock_flush_required(struct memcg_stock_pcp *stock, +static void drain_obj_stock(struct obj_stock_pcp *stock); +static bool obj_stock_flush_required(struct obj_stock_pcp *stock, struct mem_cgroup *root_memcg); /** @@ -1826,7 +1836,7 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages) int i; if (nr_pages > MEMCG_CHARGE_BATCH || - !local_trylock_irqsave(&memcg_stock.memcg_lock, flags)) + !local_trylock_irqsave(&memcg_stock.lock, flags)) return ret; stock = this_cpu_ptr(&memcg_stock); @@ -1843,7 +1853,7 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages) break; } - local_unlock_irqrestore(&memcg_stock.memcg_lock, flags); + local_unlock_irqrestore(&memcg_stock.lock, flags); return ret; } @@ -1884,7 +1894,7 @@ static void drain_stock_fully(struct memcg_stock_pcp *stock) drain_stock(stock, i); } -static void drain_local_stock(struct work_struct *dummy) +static void drain_local_memcg_stock(struct work_struct *dummy) { struct memcg_stock_pcp *stock; unsigned long flags; @@ -1892,16 +1902,30 @@ static void drain_local_stock(struct work_struct *dummy) if (WARN_ONCE(!in_task(), "drain in non-task context")) return; - local_lock_irqsave(&memcg_stock.obj_lock, flags); - stock = this_cpu_ptr(&memcg_stock); - drain_obj_stock(stock); - local_unlock_irqrestore(&memcg_stock.obj_lock, flags); + local_lock_irqsave(&memcg_stock.lock, flags); - local_lock_irqsave(&memcg_stock.memcg_lock, flags); stock = this_cpu_ptr(&memcg_stock); drain_stock_fully(stock); clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags); - local_unlock_irqrestore(&memcg_stock.memcg_lock, flags); + + local_unlock_irqrestore(&memcg_stock.lock, flags); +} + +static void drain_local_obj_stock(struct work_struct *dummy) +{ + struct obj_stock_pcp *stock; + unsigned long flags; + + if (WARN_ONCE(!in_task(), "drain in non-task context")) + return; + + local_lock_irqsave(&obj_stock.lock, flags); + + stock = this_cpu_ptr(&obj_stock); + drain_obj_stock(stock); + clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags); + + local_unlock_irqrestore(&obj_stock.lock, flags); } static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages) @@ -1924,10 +1948,10 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages) VM_WARN_ON_ONCE(mem_cgroup_is_root(memcg)); if (nr_pages > MEMCG_CHARGE_BATCH || - !local_trylock_irqsave(&memcg_stock.memcg_lock, flags)) { + !local_trylock_irqsave(&memcg_stock.lock, flags)) { /* * In case of larger than batch refill or unlikely failure to - * lock the percpu memcg_lock, uncharge memcg directly. + * lock the percpu memcg_stock.lock, uncharge memcg directly. */ memcg_uncharge(memcg, nr_pages); return; @@ -1959,23 +1983,17 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages) WRITE_ONCE(stock->nr_pages[i], nr_pages); } - local_unlock_irqrestore(&memcg_stock.memcg_lock, flags); + local_unlock_irqrestore(&memcg_stock.lock, flags); } -static bool is_drain_needed(struct memcg_stock_pcp *stock, - struct mem_cgroup *root_memcg) +static bool is_memcg_drain_needed(struct memcg_stock_pcp *stock, + struct mem_cgroup *root_memcg) { struct mem_cgroup *memcg; bool flush = false; int i; rcu_read_lock(); - - if (obj_stock_flush_required(stock, root_memcg)) { - flush = true; - goto out; - } - for (i = 0; i < NR_MEMCG_STOCK; ++i) { memcg = READ_ONCE(stock->cached[i]); if (!memcg) @@ -1987,7 +2005,6 @@ static bool is_drain_needed(struct memcg_stock_pcp *stock, break; } } -out: rcu_read_unlock(); return flush; } @@ -2012,15 +2029,27 @@ void drain_all_stock(struct mem_cgroup *root_memcg) migrate_disable(); curcpu = smp_processor_id(); for_each_online_cpu(cpu) { - struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu); - bool flush = is_drain_needed(stock, root_memcg); + struct memcg_stock_pcp *memcg_st = &per_cpu(memcg_stock, cpu); + struct obj_stock_pcp *obj_st = &per_cpu(obj_stock, cpu); - if (flush && - !test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags)) { + if (!test_bit(FLUSHING_CACHED_CHARGE, &memcg_st->flags) && + is_memcg_drain_needed(memcg_st, root_memcg) && + !test_and_set_bit(FLUSHING_CACHED_CHARGE, + &memcg_st->flags)) { if (cpu == curcpu) - drain_local_stock(&stock->work); + drain_local_memcg_stock(&memcg_st->work); else if (!cpu_is_isolated(cpu)) - schedule_work_on(cpu, &stock->work); + schedule_work_on(cpu, &memcg_st->work); + } + + if (!test_bit(FLUSHING_CACHED_CHARGE, &obj_st->flags) && + obj_stock_flush_required(obj_st, root_memcg) && + !test_and_set_bit(FLUSHING_CACHED_CHARGE, + &obj_st->flags)) { + if (cpu == curcpu) + drain_local_obj_stock(&obj_st->work); + else if (!cpu_is_isolated(cpu)) + schedule_work_on(cpu, &obj_st->work); } } migrate_enable(); @@ -2029,18 +2058,18 @@ void drain_all_stock(struct mem_cgroup *root_memcg) static int memcg_hotplug_cpu_dead(unsigned int cpu) { - struct memcg_stock_pcp *stock; + struct obj_stock_pcp *obj_st; unsigned long flags; - stock = &per_cpu(memcg_stock, cpu); + obj_st = &per_cpu(obj_stock, cpu); - /* drain_obj_stock requires obj_lock */ - local_lock_irqsave(&memcg_stock.obj_lock, flags); - drain_obj_stock(stock); - local_unlock_irqrestore(&memcg_stock.obj_lock, flags); + /* drain_obj_stock requires objstock.lock */ + local_lock_irqsave(&obj_stock.lock, flags); + drain_obj_stock(obj_st); + local_unlock_irqrestore(&obj_stock.lock, flags); /* no need for the local lock */ - drain_stock_fully(stock); + drain_stock_fully(&per_cpu(memcg_stock, cpu)); return 0; } @@ -2837,7 +2866,7 @@ void __memcg_kmem_uncharge_page(struct page *page, int order) } static void __account_obj_stock(struct obj_cgroup *objcg, - struct memcg_stock_pcp *stock, int nr, + struct obj_stock_pcp *stock, int nr, struct pglist_data *pgdat, enum node_stat_item idx) { int *bytes; @@ -2888,13 +2917,13 @@ static void __account_obj_stock(struct obj_cgroup *objcg, static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes, struct pglist_data *pgdat, enum node_stat_item idx) { - struct memcg_stock_pcp *stock; + struct obj_stock_pcp *stock; unsigned long flags; bool ret = false; - local_lock_irqsave(&memcg_stock.obj_lock, flags); + local_lock_irqsave(&obj_stock.lock, flags); - stock = this_cpu_ptr(&memcg_stock); + stock = this_cpu_ptr(&obj_stock); if (objcg == READ_ONCE(stock->cached_objcg) && stock->nr_bytes >= nr_bytes) { stock->nr_bytes -= nr_bytes; ret = true; @@ -2903,12 +2932,12 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes, __account_obj_stock(objcg, stock, nr_bytes, pgdat, idx); } - local_unlock_irqrestore(&memcg_stock.obj_lock, flags); + local_unlock_irqrestore(&obj_stock.lock, flags); return ret; } -static void drain_obj_stock(struct memcg_stock_pcp *stock) +static void drain_obj_stock(struct obj_stock_pcp *stock) { struct obj_cgroup *old = READ_ONCE(stock->cached_objcg); @@ -2969,32 +2998,35 @@ static void drain_obj_stock(struct memcg_stock_pcp *stock) obj_cgroup_put(old); } -static bool obj_stock_flush_required(struct memcg_stock_pcp *stock, +static bool obj_stock_flush_required(struct obj_stock_pcp *stock, struct mem_cgroup *root_memcg) { struct obj_cgroup *objcg = READ_ONCE(stock->cached_objcg); struct mem_cgroup *memcg; + bool flush = false; + rcu_read_lock(); if (objcg) { memcg = obj_cgroup_memcg(objcg); if (memcg && mem_cgroup_is_descendant(memcg, root_memcg)) - return true; + flush = true; } + rcu_read_unlock(); - return false; + return flush; } static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes, bool allow_uncharge, int nr_acct, struct pglist_data *pgdat, enum node_stat_item idx) { - struct memcg_stock_pcp *stock; + struct obj_stock_pcp *stock; unsigned long flags; unsigned int nr_pages = 0; - local_lock_irqsave(&memcg_stock.obj_lock, flags); + local_lock_irqsave(&obj_stock.lock, flags); - stock = this_cpu_ptr(&memcg_stock); + stock = this_cpu_ptr(&obj_stock); if (READ_ONCE(stock->cached_objcg) != objcg) { /* reset if necessary */ drain_obj_stock(stock); obj_cgroup_get(objcg); @@ -3014,7 +3046,7 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes, stock->nr_bytes &= (PAGE_SIZE - 1); } - local_unlock_irqrestore(&memcg_stock.obj_lock, flags); + local_unlock_irqrestore(&obj_stock.lock, flags); if (nr_pages) obj_cgroup_uncharge_pages(objcg, nr_pages); @@ -5079,9 +5111,12 @@ int __init mem_cgroup_init(void) cpuhp_setup_state_nocalls(CPUHP_MM_MEMCQ_DEAD, "mm/memctrl:dead", NULL, memcg_hotplug_cpu_dead); - for_each_possible_cpu(cpu) + for_each_possible_cpu(cpu) { INIT_WORK(&per_cpu_ptr(&memcg_stock, cpu)->work, - drain_local_stock); + drain_local_memcg_stock); + INIT_WORK(&per_cpu_ptr(&obj_stock, cpu)->work, + drain_local_obj_stock); + } memcg_size = struct_size_t(struct mem_cgroup, nodeinfo, nr_node_ids); memcg_cachep = kmem_cache_create("mem_cgroup", memcg_size, 0, -- 2.47.1 ^ permalink raw reply related [flat|nested] 13+ messages in thread
end of thread, other threads:[~2025-05-06 22:56 UTC | newest] Thread overview: 13+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-04-29 23:04 [PATCH 0/4] memcg: decouple memcg and objcg stocks Shakeel Butt 2025-04-29 23:04 ` [PATCH 1/4] memcg: simplify consume_stock Shakeel Butt 2025-04-29 23:51 ` Alexei Starovoitov 2025-04-30 4:37 ` Shakeel Butt 2025-04-30 15:11 ` Sebastian Andrzej Siewior 2025-04-29 23:04 ` [PATCH 2/4] memcg: separate local_trylock for memcg and obj Shakeel Butt 2025-04-30 11:42 ` Vlastimil Babka 2025-04-30 15:03 ` Shakeel Butt 2025-04-29 23:04 ` [PATCH 3/4] memcg: completely decouple memcg and obj stocks Shakeel Butt 2025-04-30 12:21 ` Vlastimil Babka 2025-04-29 23:04 ` [PATCH 4/4] memcg: no irq disable for memcg stock lock Shakeel Butt 2025-04-30 12:26 ` Vlastimil Babka -- strict thread matches above, loose matches on Subject: below -- 2025-05-06 22:55 [PATCH v3 0/4] memcg: decouple memcg and objcg stocks Shakeel Butt 2025-05-06 22:55 ` [PATCH 3/4] memcg: completely decouple memcg and obj stocks Shakeel Butt
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).