* [PATCH v2 0/3] memcg: decouple memcg and objcg stocks @ 2025-05-02 0:17 Shakeel Butt 2025-05-02 0:17 ` [PATCH v2 1/3] memcg: separate local_trylock for memcg and obj Shakeel Butt ` (2 more replies) 0 siblings, 3 replies; 12+ messages in thread From: Shakeel Butt @ 2025-05-02 0:17 UTC (permalink / raw) To: Andrew Morton Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song, Alexei Starovoitov, linux-mm, cgroups, bpf, linux-kernel, Meta kernel team The per-cpu memcg charge cache and objcg charge cache are coupled in a single struct memcg_stock_pcp and a single local lock is used to protect both of the caches. This makes memcg charging and objcg charging nmi safe challenging. Decoupling memcg and objcg stocks would allow us to make them nmi safe and even work without disabling irqs independently. This series completely decouples memcg and objcg stocks. Changes since v1: - Drop first patch as requested by Alexei. - Remove preempt_disable() usage as suggested by Vlastimil. Shakeel Butt (3): memcg: separate local_trylock for memcg and obj memcg: completely decouple memcg and obj stocks memcg: no irq disable for memcg stock lock mm/memcontrol.c | 159 +++++++++++++++++++++++++++++------------------- 1 file changed, 97 insertions(+), 62 deletions(-) -- 2.47.1 ^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH v2 1/3] memcg: separate local_trylock for memcg and obj 2025-05-02 0:17 [PATCH v2 0/3] memcg: decouple memcg and objcg stocks Shakeel Butt @ 2025-05-02 0:17 ` Shakeel Butt 2025-05-02 0:17 ` [PATCH v2 2/3] memcg: completely decouple memcg and obj stocks Shakeel Butt 2025-05-02 0:17 ` [PATCH v2 3/3] memcg: no irq disable for memcg stock lock Shakeel Butt 2 siblings, 0 replies; 12+ messages in thread From: Shakeel Butt @ 2025-05-02 0:17 UTC (permalink / raw) To: Andrew Morton Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song, Alexei Starovoitov, linux-mm, cgroups, bpf, linux-kernel, Meta kernel team The per-cpu stock_lock protects cached memcg and cached objcg and their respective fields. However there is no dependency between these fields and it is better to have fine grained separate locks for cached memcg and cached objcg. This decoupling of locks allows us to make the memcg charge cache and objcg charge cache to be nmi safe independently. At the moment, memcg charge cache is already nmi safe and this decoupling will allow to make memcg charge cache work without disabling irqs. Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> --- Changes since v1: - Drop usage of preempt_disable() as suggested by Vlastimil. mm/memcontrol.c | 51 ++++++++++++++++++++++++++----------------------- 1 file changed, 27 insertions(+), 24 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 0d42699bb564..14714e1d36e9 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1806,13 +1806,14 @@ void mem_cgroup_print_oom_group(struct mem_cgroup *memcg) */ #define NR_MEMCG_STOCK 7 struct memcg_stock_pcp { - local_trylock_t stock_lock; + local_trylock_t memcg_lock; uint8_t nr_pages[NR_MEMCG_STOCK]; struct mem_cgroup *cached[NR_MEMCG_STOCK]; + local_trylock_t obj_lock; + unsigned int nr_bytes; struct obj_cgroup *cached_objcg; struct pglist_data *cached_pgdat; - unsigned int nr_bytes; int nr_slab_reclaimable_b; int nr_slab_unreclaimable_b; @@ -1821,7 +1822,8 @@ struct memcg_stock_pcp { #define FLUSHING_CACHED_CHARGE 0 }; static DEFINE_PER_CPU_ALIGNED(struct memcg_stock_pcp, memcg_stock) = { - .stock_lock = INIT_LOCAL_TRYLOCK(stock_lock), + .memcg_lock = INIT_LOCAL_TRYLOCK(memcg_lock), + .obj_lock = INIT_LOCAL_TRYLOCK(obj_lock), }; static DEFINE_MUTEX(percpu_charge_mutex); @@ -1854,8 +1856,8 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages, return ret; if (gfpflags_allow_spinning(gfp_mask)) - local_lock_irqsave(&memcg_stock.stock_lock, flags); - else if (!local_trylock_irqsave(&memcg_stock.stock_lock, flags)) + local_lock_irqsave(&memcg_stock.memcg_lock, flags); + else if (!local_trylock_irqsave(&memcg_stock.memcg_lock, flags)) return ret; stock = this_cpu_ptr(&memcg_stock); @@ -1872,7 +1874,7 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages, break; } - local_unlock_irqrestore(&memcg_stock.stock_lock, flags); + local_unlock_irqrestore(&memcg_stock.memcg_lock, flags); return ret; } @@ -1918,19 +1920,19 @@ static void drain_local_stock(struct work_struct *dummy) struct memcg_stock_pcp *stock; unsigned long flags; - /* - * The only protection from cpu hotplug (memcg_hotplug_cpu_dead) vs. - * drain_stock races is that we always operate on local CPU stock - * here with IRQ disabled - */ - local_lock_irqsave(&memcg_stock.stock_lock, flags); + if (WARN_ONCE(!in_task(), "drain in non-task context")) + return; + local_lock_irqsave(&memcg_stock.obj_lock, flags); stock = this_cpu_ptr(&memcg_stock); drain_obj_stock(stock); + local_unlock_irqrestore(&memcg_stock.obj_lock, flags); + + local_lock_irqsave(&memcg_stock.memcg_lock, flags); + stock = this_cpu_ptr(&memcg_stock); drain_stock_fully(stock); clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags); - - local_unlock_irqrestore(&memcg_stock.stock_lock, flags); + local_unlock_irqrestore(&memcg_stock.memcg_lock, flags); } static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages) @@ -1953,10 +1955,10 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages) VM_WARN_ON_ONCE(mem_cgroup_is_root(memcg)); if (nr_pages > MEMCG_CHARGE_BATCH || - !local_trylock_irqsave(&memcg_stock.stock_lock, flags)) { + !local_trylock_irqsave(&memcg_stock.memcg_lock, flags)) { /* * In case of larger than batch refill or unlikely failure to - * lock the percpu stock_lock, uncharge memcg directly. + * lock the percpu memcg_lock, uncharge memcg directly. */ memcg_uncharge(memcg, nr_pages); return; @@ -1988,7 +1990,7 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages) WRITE_ONCE(stock->nr_pages[i], nr_pages); } - local_unlock_irqrestore(&memcg_stock.stock_lock, flags); + local_unlock_irqrestore(&memcg_stock.memcg_lock, flags); } static bool is_drain_needed(struct memcg_stock_pcp *stock, @@ -2063,11 +2065,12 @@ static int memcg_hotplug_cpu_dead(unsigned int cpu) stock = &per_cpu(memcg_stock, cpu); - /* drain_obj_stock requires stock_lock */ - local_lock_irqsave(&memcg_stock.stock_lock, flags); + /* drain_obj_stock requires obj_lock */ + local_lock_irqsave(&memcg_stock.obj_lock, flags); drain_obj_stock(stock); - local_unlock_irqrestore(&memcg_stock.stock_lock, flags); + local_unlock_irqrestore(&memcg_stock.obj_lock, flags); + /* no need for the local lock */ drain_stock_fully(stock); return 0; @@ -2920,7 +2923,7 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes, unsigned long flags; bool ret = false; - local_lock_irqsave(&memcg_stock.stock_lock, flags); + local_lock_irqsave(&memcg_stock.obj_lock, flags); stock = this_cpu_ptr(&memcg_stock); if (objcg == READ_ONCE(stock->cached_objcg) && stock->nr_bytes >= nr_bytes) { @@ -2931,7 +2934,7 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes, __account_obj_stock(objcg, stock, nr_bytes, pgdat, idx); } - local_unlock_irqrestore(&memcg_stock.stock_lock, flags); + local_unlock_irqrestore(&memcg_stock.obj_lock, flags); return ret; } @@ -3020,7 +3023,7 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes, unsigned long flags; unsigned int nr_pages = 0; - local_lock_irqsave(&memcg_stock.stock_lock, flags); + local_lock_irqsave(&memcg_stock.obj_lock, flags); stock = this_cpu_ptr(&memcg_stock); if (READ_ONCE(stock->cached_objcg) != objcg) { /* reset if necessary */ @@ -3042,7 +3045,7 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes, stock->nr_bytes &= (PAGE_SIZE - 1); } - local_unlock_irqrestore(&memcg_stock.stock_lock, flags); + local_unlock_irqrestore(&memcg_stock.obj_lock, flags); if (nr_pages) obj_cgroup_uncharge_pages(objcg, nr_pages); -- 2.47.1 ^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH v2 2/3] memcg: completely decouple memcg and obj stocks 2025-05-02 0:17 [PATCH v2 0/3] memcg: decouple memcg and objcg stocks Shakeel Butt 2025-05-02 0:17 ` [PATCH v2 1/3] memcg: separate local_trylock for memcg and obj Shakeel Butt @ 2025-05-02 0:17 ` Shakeel Butt 2025-05-02 0:17 ` [PATCH v2 3/3] memcg: no irq disable for memcg stock lock Shakeel Butt 2 siblings, 0 replies; 12+ messages in thread From: Shakeel Butt @ 2025-05-02 0:17 UTC (permalink / raw) To: Andrew Morton Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song, Alexei Starovoitov, linux-mm, cgroups, bpf, linux-kernel, Meta kernel team, Vlastimil Babka Let's completely decouple the memcg and obj per-cpu stocks. This will enable us to make memcg per-cpu stocks to used without disabling irqs. Also it will enable us to make obj stocks nmi safe independently which is required to make kmalloc/slab safe for allocations from nmi context. Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Vlastimil Babka <vbabka@suse.cz> --- mm/memcontrol.c | 151 +++++++++++++++++++++++++++++------------------- 1 file changed, 93 insertions(+), 58 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 14714e1d36e9..cd81c70d144b 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1804,13 +1804,23 @@ void mem_cgroup_print_oom_group(struct mem_cgroup *memcg) * The value of NR_MEMCG_STOCK is selected to keep the cached memcgs and their * nr_pages in a single cacheline. This may change in future. */ +#define FLUSHING_CACHED_CHARGE 0 #define NR_MEMCG_STOCK 7 struct memcg_stock_pcp { - local_trylock_t memcg_lock; + local_trylock_t lock; uint8_t nr_pages[NR_MEMCG_STOCK]; struct mem_cgroup *cached[NR_MEMCG_STOCK]; - local_trylock_t obj_lock; + struct work_struct work; + unsigned long flags; +}; + +static DEFINE_PER_CPU_ALIGNED(struct memcg_stock_pcp, memcg_stock) = { + .lock = INIT_LOCAL_TRYLOCK(lock), +}; + +struct obj_stock_pcp { + local_trylock_t lock; unsigned int nr_bytes; struct obj_cgroup *cached_objcg; struct pglist_data *cached_pgdat; @@ -1819,16 +1829,16 @@ struct memcg_stock_pcp { struct work_struct work; unsigned long flags; -#define FLUSHING_CACHED_CHARGE 0 }; -static DEFINE_PER_CPU_ALIGNED(struct memcg_stock_pcp, memcg_stock) = { - .memcg_lock = INIT_LOCAL_TRYLOCK(memcg_lock), - .obj_lock = INIT_LOCAL_TRYLOCK(obj_lock), + +static DEFINE_PER_CPU_ALIGNED(struct obj_stock_pcp, obj_stock) = { + .lock = INIT_LOCAL_TRYLOCK(lock), }; + static DEFINE_MUTEX(percpu_charge_mutex); -static void drain_obj_stock(struct memcg_stock_pcp *stock); -static bool obj_stock_flush_required(struct memcg_stock_pcp *stock, +static void drain_obj_stock(struct obj_stock_pcp *stock); +static bool obj_stock_flush_required(struct obj_stock_pcp *stock, struct mem_cgroup *root_memcg); /** @@ -1856,8 +1866,8 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages, return ret; if (gfpflags_allow_spinning(gfp_mask)) - local_lock_irqsave(&memcg_stock.memcg_lock, flags); - else if (!local_trylock_irqsave(&memcg_stock.memcg_lock, flags)) + local_lock_irqsave(&memcg_stock.lock, flags); + else if (!local_trylock_irqsave(&memcg_stock.lock, flags)) return ret; stock = this_cpu_ptr(&memcg_stock); @@ -1874,7 +1884,7 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages, break; } - local_unlock_irqrestore(&memcg_stock.memcg_lock, flags); + local_unlock_irqrestore(&memcg_stock.lock, flags); return ret; } @@ -1915,7 +1925,7 @@ static void drain_stock_fully(struct memcg_stock_pcp *stock) drain_stock(stock, i); } -static void drain_local_stock(struct work_struct *dummy) +static void drain_local_memcg_stock(struct work_struct *dummy) { struct memcg_stock_pcp *stock; unsigned long flags; @@ -1923,16 +1933,30 @@ static void drain_local_stock(struct work_struct *dummy) if (WARN_ONCE(!in_task(), "drain in non-task context")) return; - local_lock_irqsave(&memcg_stock.obj_lock, flags); - stock = this_cpu_ptr(&memcg_stock); - drain_obj_stock(stock); - local_unlock_irqrestore(&memcg_stock.obj_lock, flags); + local_lock_irqsave(&memcg_stock.lock, flags); - local_lock_irqsave(&memcg_stock.memcg_lock, flags); stock = this_cpu_ptr(&memcg_stock); drain_stock_fully(stock); clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags); - local_unlock_irqrestore(&memcg_stock.memcg_lock, flags); + + local_unlock_irqrestore(&memcg_stock.lock, flags); +} + +static void drain_local_obj_stock(struct work_struct *dummy) +{ + struct obj_stock_pcp *stock; + unsigned long flags; + + if (WARN_ONCE(!in_task(), "drain in non-task context")) + return; + + local_lock_irqsave(&obj_stock.lock, flags); + + stock = this_cpu_ptr(&obj_stock); + drain_obj_stock(stock); + clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags); + + local_unlock_irqrestore(&obj_stock.lock, flags); } static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages) @@ -1955,10 +1979,10 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages) VM_WARN_ON_ONCE(mem_cgroup_is_root(memcg)); if (nr_pages > MEMCG_CHARGE_BATCH || - !local_trylock_irqsave(&memcg_stock.memcg_lock, flags)) { + !local_trylock_irqsave(&memcg_stock.lock, flags)) { /* * In case of larger than batch refill or unlikely failure to - * lock the percpu memcg_lock, uncharge memcg directly. + * lock the percpu memcg_stock.lock, uncharge memcg directly. */ memcg_uncharge(memcg, nr_pages); return; @@ -1990,23 +2014,17 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages) WRITE_ONCE(stock->nr_pages[i], nr_pages); } - local_unlock_irqrestore(&memcg_stock.memcg_lock, flags); + local_unlock_irqrestore(&memcg_stock.lock, flags); } -static bool is_drain_needed(struct memcg_stock_pcp *stock, - struct mem_cgroup *root_memcg) +static bool is_memcg_drain_needed(struct memcg_stock_pcp *stock, + struct mem_cgroup *root_memcg) { struct mem_cgroup *memcg; bool flush = false; int i; rcu_read_lock(); - - if (obj_stock_flush_required(stock, root_memcg)) { - flush = true; - goto out; - } - for (i = 0; i < NR_MEMCG_STOCK; ++i) { memcg = READ_ONCE(stock->cached[i]); if (!memcg) @@ -2018,7 +2036,6 @@ static bool is_drain_needed(struct memcg_stock_pcp *stock, break; } } -out: rcu_read_unlock(); return flush; } @@ -2043,15 +2060,27 @@ void drain_all_stock(struct mem_cgroup *root_memcg) migrate_disable(); curcpu = smp_processor_id(); for_each_online_cpu(cpu) { - struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu); - bool flush = is_drain_needed(stock, root_memcg); + struct memcg_stock_pcp *memcg_st = &per_cpu(memcg_stock, cpu); + struct obj_stock_pcp *obj_st = &per_cpu(obj_stock, cpu); - if (flush && - !test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags)) { + if (!test_bit(FLUSHING_CACHED_CHARGE, &memcg_st->flags) && + is_memcg_drain_needed(memcg_st, root_memcg) && + !test_and_set_bit(FLUSHING_CACHED_CHARGE, + &memcg_st->flags)) { if (cpu == curcpu) - drain_local_stock(&stock->work); + drain_local_memcg_stock(&memcg_st->work); else if (!cpu_is_isolated(cpu)) - schedule_work_on(cpu, &stock->work); + schedule_work_on(cpu, &memcg_st->work); + } + + if (!test_bit(FLUSHING_CACHED_CHARGE, &obj_st->flags) && + obj_stock_flush_required(obj_st, root_memcg) && + !test_and_set_bit(FLUSHING_CACHED_CHARGE, + &obj_st->flags)) { + if (cpu == curcpu) + drain_local_obj_stock(&obj_st->work); + else if (!cpu_is_isolated(cpu)) + schedule_work_on(cpu, &obj_st->work); } } migrate_enable(); @@ -2060,18 +2089,18 @@ void drain_all_stock(struct mem_cgroup *root_memcg) static int memcg_hotplug_cpu_dead(unsigned int cpu) { - struct memcg_stock_pcp *stock; + struct obj_stock_pcp *obj_st; unsigned long flags; - stock = &per_cpu(memcg_stock, cpu); + obj_st = &per_cpu(obj_stock, cpu); - /* drain_obj_stock requires obj_lock */ - local_lock_irqsave(&memcg_stock.obj_lock, flags); - drain_obj_stock(stock); - local_unlock_irqrestore(&memcg_stock.obj_lock, flags); + /* drain_obj_stock requires objstock.lock */ + local_lock_irqsave(&obj_stock.lock, flags); + drain_obj_stock(obj_st); + local_unlock_irqrestore(&obj_stock.lock, flags); /* no need for the local lock */ - drain_stock_fully(stock); + drain_stock_fully(&per_cpu(memcg_stock, cpu)); return 0; } @@ -2868,7 +2897,7 @@ void __memcg_kmem_uncharge_page(struct page *page, int order) } static void __account_obj_stock(struct obj_cgroup *objcg, - struct memcg_stock_pcp *stock, int nr, + struct obj_stock_pcp *stock, int nr, struct pglist_data *pgdat, enum node_stat_item idx) { int *bytes; @@ -2919,13 +2948,13 @@ static void __account_obj_stock(struct obj_cgroup *objcg, static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes, struct pglist_data *pgdat, enum node_stat_item idx) { - struct memcg_stock_pcp *stock; + struct obj_stock_pcp *stock; unsigned long flags; bool ret = false; - local_lock_irqsave(&memcg_stock.obj_lock, flags); + local_lock_irqsave(&obj_stock.lock, flags); - stock = this_cpu_ptr(&memcg_stock); + stock = this_cpu_ptr(&obj_stock); if (objcg == READ_ONCE(stock->cached_objcg) && stock->nr_bytes >= nr_bytes) { stock->nr_bytes -= nr_bytes; ret = true; @@ -2934,12 +2963,12 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes, __account_obj_stock(objcg, stock, nr_bytes, pgdat, idx); } - local_unlock_irqrestore(&memcg_stock.obj_lock, flags); + local_unlock_irqrestore(&obj_stock.lock, flags); return ret; } -static void drain_obj_stock(struct memcg_stock_pcp *stock) +static void drain_obj_stock(struct obj_stock_pcp *stock) { struct obj_cgroup *old = READ_ONCE(stock->cached_objcg); @@ -3000,32 +3029,35 @@ static void drain_obj_stock(struct memcg_stock_pcp *stock) obj_cgroup_put(old); } -static bool obj_stock_flush_required(struct memcg_stock_pcp *stock, +static bool obj_stock_flush_required(struct obj_stock_pcp *stock, struct mem_cgroup *root_memcg) { struct obj_cgroup *objcg = READ_ONCE(stock->cached_objcg); struct mem_cgroup *memcg; + bool flush = false; + rcu_read_lock(); if (objcg) { memcg = obj_cgroup_memcg(objcg); if (memcg && mem_cgroup_is_descendant(memcg, root_memcg)) - return true; + flush = true; } + rcu_read_unlock(); - return false; + return flush; } static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes, bool allow_uncharge, int nr_acct, struct pglist_data *pgdat, enum node_stat_item idx) { - struct memcg_stock_pcp *stock; + struct obj_stock_pcp *stock; unsigned long flags; unsigned int nr_pages = 0; - local_lock_irqsave(&memcg_stock.obj_lock, flags); + local_lock_irqsave(&obj_stock.lock, flags); - stock = this_cpu_ptr(&memcg_stock); + stock = this_cpu_ptr(&obj_stock); if (READ_ONCE(stock->cached_objcg) != objcg) { /* reset if necessary */ drain_obj_stock(stock); obj_cgroup_get(objcg); @@ -3045,7 +3077,7 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes, stock->nr_bytes &= (PAGE_SIZE - 1); } - local_unlock_irqrestore(&memcg_stock.obj_lock, flags); + local_unlock_irqrestore(&obj_stock.lock, flags); if (nr_pages) obj_cgroup_uncharge_pages(objcg, nr_pages); @@ -5164,9 +5196,12 @@ int __init mem_cgroup_init(void) cpuhp_setup_state_nocalls(CPUHP_MM_MEMCQ_DEAD, "mm/memctrl:dead", NULL, memcg_hotplug_cpu_dead); - for_each_possible_cpu(cpu) + for_each_possible_cpu(cpu) { INIT_WORK(&per_cpu_ptr(&memcg_stock, cpu)->work, - drain_local_stock); + drain_local_memcg_stock); + INIT_WORK(&per_cpu_ptr(&obj_stock, cpu)->work, + drain_local_obj_stock); + } memcg_size = struct_size_t(struct mem_cgroup, nodeinfo, nr_node_ids); memcg_cachep = kmem_cache_create("mem_cgroup", memcg_size, 0, -- 2.47.1 ^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH v2 3/3] memcg: no irq disable for memcg stock lock 2025-05-02 0:17 [PATCH v2 0/3] memcg: decouple memcg and objcg stocks Shakeel Butt 2025-05-02 0:17 ` [PATCH v2 1/3] memcg: separate local_trylock for memcg and obj Shakeel Butt 2025-05-02 0:17 ` [PATCH v2 2/3] memcg: completely decouple memcg and obj stocks Shakeel Butt @ 2025-05-02 0:17 ` Shakeel Butt 2025-05-02 18:29 ` Alexei Starovoitov 2 siblings, 1 reply; 12+ messages in thread From: Shakeel Butt @ 2025-05-02 0:17 UTC (permalink / raw) To: Andrew Morton Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song, Alexei Starovoitov, linux-mm, cgroups, bpf, linux-kernel, Meta kernel team, Vlastimil Babka There is no need to disable irqs to use memcg per-cpu stock, so let's just not do that. One consequence of this change is if the kernel while in task context has the memcg stock lock and that cpu got interrupted. The memcg charges on that cpu in the irq context will take the slow path of memcg charging. However that should be super rare and should be fine in general. Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Vlastimil Babka <vbabka@suse.cz> --- mm/memcontrol.c | 17 +++++++---------- 1 file changed, 7 insertions(+), 10 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index cd81c70d144b..f8b9c7aa6771 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1858,7 +1858,6 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages, { struct memcg_stock_pcp *stock; uint8_t stock_pages; - unsigned long flags; bool ret = false; int i; @@ -1866,8 +1865,8 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages, return ret; if (gfpflags_allow_spinning(gfp_mask)) - local_lock_irqsave(&memcg_stock.lock, flags); - else if (!local_trylock_irqsave(&memcg_stock.lock, flags)) + local_lock(&memcg_stock.lock); + else if (!local_trylock(&memcg_stock.lock)) return ret; stock = this_cpu_ptr(&memcg_stock); @@ -1884,7 +1883,7 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages, break; } - local_unlock_irqrestore(&memcg_stock.lock, flags); + local_unlock(&memcg_stock.lock); return ret; } @@ -1928,18 +1927,17 @@ static void drain_stock_fully(struct memcg_stock_pcp *stock) static void drain_local_memcg_stock(struct work_struct *dummy) { struct memcg_stock_pcp *stock; - unsigned long flags; if (WARN_ONCE(!in_task(), "drain in non-task context")) return; - local_lock_irqsave(&memcg_stock.lock, flags); + local_lock(&memcg_stock.lock); stock = this_cpu_ptr(&memcg_stock); drain_stock_fully(stock); clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags); - local_unlock_irqrestore(&memcg_stock.lock, flags); + local_unlock(&memcg_stock.lock); } static void drain_local_obj_stock(struct work_struct *dummy) @@ -1964,7 +1962,6 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages) struct memcg_stock_pcp *stock; struct mem_cgroup *cached; uint8_t stock_pages; - unsigned long flags; bool success = false; int empty_slot = -1; int i; @@ -1979,7 +1976,7 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages) VM_WARN_ON_ONCE(mem_cgroup_is_root(memcg)); if (nr_pages > MEMCG_CHARGE_BATCH || - !local_trylock_irqsave(&memcg_stock.lock, flags)) { + !local_trylock(&memcg_stock.lock)) { /* * In case of larger than batch refill or unlikely failure to * lock the percpu memcg_stock.lock, uncharge memcg directly. @@ -2014,7 +2011,7 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages) WRITE_ONCE(stock->nr_pages[i], nr_pages); } - local_unlock_irqrestore(&memcg_stock.lock, flags); + local_unlock(&memcg_stock.lock); } static bool is_memcg_drain_needed(struct memcg_stock_pcp *stock, -- 2.47.1 ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH v2 3/3] memcg: no irq disable for memcg stock lock 2025-05-02 0:17 ` [PATCH v2 3/3] memcg: no irq disable for memcg stock lock Shakeel Butt @ 2025-05-02 18:29 ` Alexei Starovoitov 2025-05-02 23:03 ` Shakeel Butt 0 siblings, 1 reply; 12+ messages in thread From: Alexei Starovoitov @ 2025-05-02 18:29 UTC (permalink / raw) To: Shakeel Butt Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song, Alexei Starovoitov, linux-mm, open list:CONTROL GROUP (CGROUP), bpf, LKML, Meta kernel team, Vlastimil Babka On Thu, May 1, 2025 at 5:18 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > There is no need to disable irqs to use memcg per-cpu stock, so let's > just not do that. One consequence of this change is if the kernel while > in task context has the memcg stock lock and that cpu got interrupted. > The memcg charges on that cpu in the irq context will take the slow path > of memcg charging. However that should be super rare and should be fine > in general. > > Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> > Acked-by: Vlastimil Babka <vbabka@suse.cz> > --- > mm/memcontrol.c | 17 +++++++---------- > 1 file changed, 7 insertions(+), 10 deletions(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index cd81c70d144b..f8b9c7aa6771 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -1858,7 +1858,6 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages, > { > struct memcg_stock_pcp *stock; > uint8_t stock_pages; > - unsigned long flags; > bool ret = false; > int i; > > @@ -1866,8 +1865,8 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages, > return ret; > > if (gfpflags_allow_spinning(gfp_mask)) > - local_lock_irqsave(&memcg_stock.lock, flags); > - else if (!local_trylock_irqsave(&memcg_stock.lock, flags)) > + local_lock(&memcg_stock.lock); > + else if (!local_trylock(&memcg_stock.lock)) > return ret; I don't think it works. When there is a normal irq and something doing regular GFP_NOWAIT allocation gfpflags_allow_spinning() will be true and local_lock() will reenter and complain that lock->acquired is already set... but only with lockdep on. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 3/3] memcg: no irq disable for memcg stock lock 2025-05-02 18:29 ` Alexei Starovoitov @ 2025-05-02 23:03 ` Shakeel Butt 2025-05-02 23:28 ` Alexei Starovoitov 2025-05-05 10:28 ` Vlastimil Babka 0 siblings, 2 replies; 12+ messages in thread From: Shakeel Butt @ 2025-05-02 23:03 UTC (permalink / raw) To: Alexei Starovoitov Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song, Alexei Starovoitov, linux-mm, open list:CONTROL GROUP (CGROUP), bpf, LKML, Meta kernel team, Vlastimil Babka On Fri, May 2, 2025 at 11:29 AM Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > On Thu, May 1, 2025 at 5:18 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > > There is no need to disable irqs to use memcg per-cpu stock, so let's > > just not do that. One consequence of this change is if the kernel while > > in task context has the memcg stock lock and that cpu got interrupted. > > The memcg charges on that cpu in the irq context will take the slow path > > of memcg charging. However that should be super rare and should be fine > > in general. > > > > Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> > > Acked-by: Vlastimil Babka <vbabka@suse.cz> > > --- > > mm/memcontrol.c | 17 +++++++---------- > > 1 file changed, 7 insertions(+), 10 deletions(-) > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > index cd81c70d144b..f8b9c7aa6771 100644 > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -1858,7 +1858,6 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages, > > { > > struct memcg_stock_pcp *stock; > > uint8_t stock_pages; > > - unsigned long flags; > > bool ret = false; > > int i; > > > > @@ -1866,8 +1865,8 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages, > > return ret; > > > > if (gfpflags_allow_spinning(gfp_mask)) > > - local_lock_irqsave(&memcg_stock.lock, flags); > > - else if (!local_trylock_irqsave(&memcg_stock.lock, flags)) > > + local_lock(&memcg_stock.lock); > > + else if (!local_trylock(&memcg_stock.lock)) > > return ret; > > I don't think it works. > When there is a normal irq and something doing regular GFP_NOWAIT > allocation gfpflags_allow_spinning() will be true and > local_lock() will reenter and complain that lock->acquired is > already set... but only with lockdep on. Yes indeed. I dropped the first patch and didn't fix this one accordingly. I think the fix can be as simple as checking for in_task() here instead of gfp_mask. That should work for both RT and non-RT kernels. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 3/3] memcg: no irq disable for memcg stock lock 2025-05-02 23:03 ` Shakeel Butt @ 2025-05-02 23:28 ` Alexei Starovoitov 2025-05-02 23:40 ` Shakeel Butt 2025-05-05 10:28 ` Vlastimil Babka 1 sibling, 1 reply; 12+ messages in thread From: Alexei Starovoitov @ 2025-05-02 23:28 UTC (permalink / raw) To: Shakeel Butt Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song, Alexei Starovoitov, linux-mm, open list:CONTROL GROUP (CGROUP), bpf, LKML, Meta kernel team, Vlastimil Babka On Fri, May 2, 2025 at 4:03 PM Shakeel Butt <shakeel.butt@gmail.com> wrote: > > On Fri, May 2, 2025 at 11:29 AM Alexei Starovoitov > <alexei.starovoitov@gmail.com> wrote: > > > > On Thu, May 1, 2025 at 5:18 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > > > > There is no need to disable irqs to use memcg per-cpu stock, so let's > > > just not do that. One consequence of this change is if the kernel while > > > in task context has the memcg stock lock and that cpu got interrupted. > > > The memcg charges on that cpu in the irq context will take the slow path > > > of memcg charging. However that should be super rare and should be fine > > > in general. > > > > > > Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> > > > Acked-by: Vlastimil Babka <vbabka@suse.cz> > > > --- > > > mm/memcontrol.c | 17 +++++++---------- > > > 1 file changed, 7 insertions(+), 10 deletions(-) > > > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > > index cd81c70d144b..f8b9c7aa6771 100644 > > > --- a/mm/memcontrol.c > > > +++ b/mm/memcontrol.c > > > @@ -1858,7 +1858,6 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages, > > > { > > > struct memcg_stock_pcp *stock; > > > uint8_t stock_pages; > > > - unsigned long flags; > > > bool ret = false; > > > int i; > > > > > > @@ -1866,8 +1865,8 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages, > > > return ret; > > > > > > if (gfpflags_allow_spinning(gfp_mask)) > > > - local_lock_irqsave(&memcg_stock.lock, flags); > > > - else if (!local_trylock_irqsave(&memcg_stock.lock, flags)) > > > + local_lock(&memcg_stock.lock); > > > + else if (!local_trylock(&memcg_stock.lock)) > > > return ret; > > > > I don't think it works. > > When there is a normal irq and something doing regular GFP_NOWAIT > > allocation gfpflags_allow_spinning() will be true and > > local_lock() will reenter and complain that lock->acquired is > > already set... but only with lockdep on. > > Yes indeed. I dropped the first patch and didn't fix this one > accordingly. I think the fix can be as simple as checking for > in_task() here instead of gfp_mask. That should work for both RT and > non-RT kernels. Like: if (in_task()) local_lock(...); else if (!local_trylock(...)) Most of the networking runs in bh, so it will be using local_trylock() path which is probably ok in !PREEMPT_RT, but will cause random performance issues in PREEMP_RT, since rt_spin_trylock() will be randomly failing and taking slow path of charging. It's not going to cause permanent nginx 3x regression :), but unlucky slowdowns will be seen. A task can grab that per-cpu rt_spin_lock and preempted by network processing. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 3/3] memcg: no irq disable for memcg stock lock 2025-05-02 23:28 ` Alexei Starovoitov @ 2025-05-02 23:40 ` Shakeel Butt 2025-05-05 9:06 ` Sebastian Andrzej Siewior 0 siblings, 1 reply; 12+ messages in thread From: Shakeel Butt @ 2025-05-02 23:40 UTC (permalink / raw) To: Alexei Starovoitov Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song, Alexei Starovoitov, linux-mm, open list:CONTROL GROUP (CGROUP), bpf, LKML, Meta kernel team, Vlastimil Babka, Sebastian Andrzej Siewior On Fri, May 2, 2025 at 4:28 PM Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: [...] > > > > > > I don't think it works. > > > When there is a normal irq and something doing regular GFP_NOWAIT > > > allocation gfpflags_allow_spinning() will be true and > > > local_lock() will reenter and complain that lock->acquired is > > > already set... but only with lockdep on. > > > > Yes indeed. I dropped the first patch and didn't fix this one > > accordingly. I think the fix can be as simple as checking for > > in_task() here instead of gfp_mask. That should work for both RT and > > non-RT kernels. > > Like: > if (in_task()) > local_lock(...); > else if (!local_trylock(...)) > > Most of the networking runs in bh, so it will be using > local_trylock() path which is probably ok in !PREEMPT_RT, > but will cause random performance issues in PREEMP_RT, > since rt_spin_trylock() will be randomly failing and taking > slow path of charging. It's not going to cause permanent > nginx 3x regression :), but unlucky slowdowns will be seen. > A task can grab that per-cpu rt_spin_lock and preempted > by network processing. Does networking run in bh for PREEMPT_RT as well? I think I should get networking & RT folks opinion on this one. I will decouple this irq patch from the decoupling lock patches and start a separate discussion thread. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 3/3] memcg: no irq disable for memcg stock lock 2025-05-02 23:40 ` Shakeel Butt @ 2025-05-05 9:06 ` Sebastian Andrzej Siewior 0 siblings, 0 replies; 12+ messages in thread From: Sebastian Andrzej Siewior @ 2025-05-05 9:06 UTC (permalink / raw) To: Shakeel Butt Cc: Alexei Starovoitov, Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song, Alexei Starovoitov, linux-mm, open list:CONTROL GROUP (CGROUP), bpf, LKML, Meta kernel team, Vlastimil Babka On 2025-05-02 16:40:53 [-0700], Shakeel Butt wrote: > On Fri, May 2, 2025 at 4:28 PM Alexei Starovoitov > <alexei.starovoitov@gmail.com> wrote: > [...] > > > > > > > > I don't think it works. > > > > When there is a normal irq and something doing regular GFP_NOWAIT > > > > allocation gfpflags_allow_spinning() will be true and > > > > local_lock() will reenter and complain that lock->acquired is > > > > already set... but only with lockdep on. > > > > > > Yes indeed. I dropped the first patch and didn't fix this one > > > accordingly. I think the fix can be as simple as checking for > > > in_task() here instead of gfp_mask. That should work for both RT and > > > non-RT kernels. > > > > Like: > > if (in_task()) > > local_lock(...); > > else if (!local_trylock(...)) > > > > Most of the networking runs in bh, so it will be using > > local_trylock() path which is probably ok in !PREEMPT_RT, > > but will cause random performance issues in PREEMP_RT, > > since rt_spin_trylock() will be randomly failing and taking > > slow path of charging. It's not going to cause permanent > > nginx 3x regression :), but unlucky slowdowns will be seen. > > A task can grab that per-cpu rt_spin_lock and preempted > > by network processing. > > Does networking run in bh for PREEMPT_RT as well? It does but BH is preemptible. > I think I should get networking & RT folks opinion on this one. I will > decouple this irq patch from the decoupling lock patches and start a > separate discussion thread. Sebastian ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 3/3] memcg: no irq disable for memcg stock lock 2025-05-02 23:03 ` Shakeel Butt 2025-05-02 23:28 ` Alexei Starovoitov @ 2025-05-05 10:28 ` Vlastimil Babka 2025-05-05 17:13 ` Shakeel Butt 1 sibling, 1 reply; 12+ messages in thread From: Vlastimil Babka @ 2025-05-05 10:28 UTC (permalink / raw) To: Shakeel Butt, Alexei Starovoitov Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song, Alexei Starovoitov, linux-mm, open list:CONTROL GROUP (CGROUP), bpf, LKML, Meta kernel team, Sebastian Andrzej Siewior On 5/3/25 01:03, Shakeel Butt wrote: >> > index cd81c70d144b..f8b9c7aa6771 100644 >> > --- a/mm/memcontrol.c >> > +++ b/mm/memcontrol.c >> > @@ -1858,7 +1858,6 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages, >> > { >> > struct memcg_stock_pcp *stock; >> > uint8_t stock_pages; >> > - unsigned long flags; >> > bool ret = false; >> > int i; >> > >> > @@ -1866,8 +1865,8 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages, >> > return ret; >> > >> > if (gfpflags_allow_spinning(gfp_mask)) >> > - local_lock_irqsave(&memcg_stock.lock, flags); >> > - else if (!local_trylock_irqsave(&memcg_stock.lock, flags)) >> > + local_lock(&memcg_stock.lock); >> > + else if (!local_trylock(&memcg_stock.lock)) >> > return ret; >> >> I don't think it works. >> When there is a normal irq and something doing regular GFP_NOWAIT >> allocation gfpflags_allow_spinning() will be true and >> local_lock() will reenter and complain that lock->acquired is >> already set... but only with lockdep on. > > Yes indeed. I dropped the first patch and didn't fix this one > accordingly. I think the fix can be as simple as checking for > in_task() here instead of gfp_mask. That should work for both RT and > non-RT kernels. These in_task() checks seem hacky to me. I think the patch 1 in v1 was the correct way how to use the local_trylock() to avoid these. As for the RT concerns, AFAIK RT isn't about being fast, but about being preemptible, and the v1 approach didn't violate that - taking the slowpaths more often shouldn't be an issue. Let me quote Shakeel's scenario from the v1 thread: > I didn't really think too much about PREEMPT_RT kernels as I assume > performance is not top priority but I think I get your point. Let me Agreed. > explain and correct me if I am wrong. On PREEMPT_RT kernel, the local > lock is a spin lock which is actually a mutex but with priority > inheritance. A task having the local lock can still get context switched Let's say (seems implied already) this is a low prio task. > (but will remain on same CPU run queue) and the newer task can try to And this is a high prio task. > acquire the memcg stock local lock. If we just do trylock, it will > always go to the slow path but if we do local_lock() then it will sleeps > and possibly gives its priority to the task owning the lock and possibly > make that task to get the CPU. Later the task slept on memcg stock lock > will wake up and go through fast path. I think from RT latency perspective it could very much be better for the high prio task just skip the fast path and go for the slowpath, instead of going to sleep while boosting the low prio task to let the high prio task use the fast path later. It's not really a fast path anymore I'd say. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 3/3] memcg: no irq disable for memcg stock lock 2025-05-05 10:28 ` Vlastimil Babka @ 2025-05-05 17:13 ` Shakeel Butt 2025-05-05 20:49 ` Shakeel Butt 0 siblings, 1 reply; 12+ messages in thread From: Shakeel Butt @ 2025-05-05 17:13 UTC (permalink / raw) To: Vlastimil Babka Cc: Shakeel Butt, Alexei Starovoitov, Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song, Alexei Starovoitov, linux-mm, open list:CONTROL GROUP (CGROUP), bpf, LKML, Meta kernel team, Sebastian Andrzej Siewior, Eric Dumazet, Jakub Kicinski, Paolo Abeni, David S. Miller, netdev Ccing networking folks. Background: https://lore.kernel.org/dvyyqubghf67b3qsuoreegqk4qnuuqfkk7plpfhhrck5yeeuic@xbn4c6c7yc42/ On Mon, May 05, 2025 at 12:28:43PM +0200, Vlastimil Babka wrote: > On 5/3/25 01:03, Shakeel Butt wrote: > >> > index cd81c70d144b..f8b9c7aa6771 100644 > >> > --- a/mm/memcontrol.c > >> > +++ b/mm/memcontrol.c > >> > @@ -1858,7 +1858,6 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages, > >> > { > >> > struct memcg_stock_pcp *stock; > >> > uint8_t stock_pages; > >> > - unsigned long flags; > >> > bool ret = false; > >> > int i; > >> > > >> > @@ -1866,8 +1865,8 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages, > >> > return ret; > >> > > >> > if (gfpflags_allow_spinning(gfp_mask)) > >> > - local_lock_irqsave(&memcg_stock.lock, flags); > >> > - else if (!local_trylock_irqsave(&memcg_stock.lock, flags)) > >> > + local_lock(&memcg_stock.lock); > >> > + else if (!local_trylock(&memcg_stock.lock)) > >> > return ret; > >> > >> I don't think it works. > >> When there is a normal irq and something doing regular GFP_NOWAIT > >> allocation gfpflags_allow_spinning() will be true and > >> local_lock() will reenter and complain that lock->acquired is > >> already set... but only with lockdep on. > > > > Yes indeed. I dropped the first patch and didn't fix this one > > accordingly. I think the fix can be as simple as checking for > > in_task() here instead of gfp_mask. That should work for both RT and > > non-RT kernels. > > These in_task() checks seem hacky to me. I think the patch 1 in v1 was the > correct way how to use the local_trylock() to avoid these. > > As for the RT concerns, AFAIK RT isn't about being fast, but about being > preemptible, and the v1 approach didn't violate that - taking the slowpaths > more often shouldn't be an issue. > > Let me quote Shakeel's scenario from the v1 thread: > > > I didn't really think too much about PREEMPT_RT kernels as I assume > > performance is not top priority but I think I get your point. Let me > > Agreed. > > > explain and correct me if I am wrong. On PREEMPT_RT kernel, the local > > lock is a spin lock which is actually a mutex but with priority > > inheritance. A task having the local lock can still get context switched > > Let's say (seems implied already) this is a low prio task. > > > (but will remain on same CPU run queue) and the newer task can try to > > And this is a high prio task. > > > acquire the memcg stock local lock. If we just do trylock, it will > > always go to the slow path but if we do local_lock() then it will sleeps > > and possibly gives its priority to the task owning the lock and possibly > > make that task to get the CPU. Later the task slept on memcg stock lock > > will wake up and go through fast path. > > I think from RT latency perspective it could very much be better for the > high prio task just skip the fast path and go for the slowpath, instead of > going to sleep while boosting the low prio task to let the high prio task > use the fast path later. It's not really a fast path anymore I'd say. Thanks Vlastimil, this is actually a very good point. Slow path of memcg charging is couple of atomic operations while the alternative here is at least two context switches (and possibly scheduler delay). So, it does not seem like a fast path anymore. I have cc'ed networking folks to get their take as well. Orthogonally I will do some netperf benchmarking on v1 with RT kernel. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 3/3] memcg: no irq disable for memcg stock lock 2025-05-05 17:13 ` Shakeel Butt @ 2025-05-05 20:49 ` Shakeel Butt 0 siblings, 0 replies; 12+ messages in thread From: Shakeel Butt @ 2025-05-05 20:49 UTC (permalink / raw) To: Vlastimil Babka Cc: Shakeel Butt, Alexei Starovoitov, Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song, Alexei Starovoitov, linux-mm, open list:CONTROL GROUP (CGROUP), bpf, LKML, Meta kernel team, Sebastian Andrzej Siewior, Eric Dumazet, Jakub Kicinski, Paolo Abeni, David S. Miller, netdev On Mon, May 05, 2025 at 10:13:37AM -0700, Shakeel Butt wrote: > Ccing networking folks. > > Background: https://lore.kernel.org/dvyyqubghf67b3qsuoreegqk4qnuuqfkk7plpfhhrck5yeeuic@xbn4c6c7yc42/ > > On Mon, May 05, 2025 at 12:28:43PM +0200, Vlastimil Babka wrote: > > On 5/3/25 01:03, Shakeel Butt wrote: > > >> > index cd81c70d144b..f8b9c7aa6771 100644 > > >> > --- a/mm/memcontrol.c > > >> > +++ b/mm/memcontrol.c > > >> > @@ -1858,7 +1858,6 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages, > > >> > { > > >> > struct memcg_stock_pcp *stock; > > >> > uint8_t stock_pages; > > >> > - unsigned long flags; > > >> > bool ret = false; > > >> > int i; > > >> > > > >> > @@ -1866,8 +1865,8 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages, > > >> > return ret; > > >> > > > >> > if (gfpflags_allow_spinning(gfp_mask)) > > >> > - local_lock_irqsave(&memcg_stock.lock, flags); > > >> > - else if (!local_trylock_irqsave(&memcg_stock.lock, flags)) > > >> > + local_lock(&memcg_stock.lock); > > >> > + else if (!local_trylock(&memcg_stock.lock)) > > >> > return ret; > > >> > > >> I don't think it works. > > >> When there is a normal irq and something doing regular GFP_NOWAIT > > >> allocation gfpflags_allow_spinning() will be true and > > >> local_lock() will reenter and complain that lock->acquired is > > >> already set... but only with lockdep on. > > > > > > Yes indeed. I dropped the first patch and didn't fix this one > > > accordingly. I think the fix can be as simple as checking for > > > in_task() here instead of gfp_mask. That should work for both RT and > > > non-RT kernels. > > > > These in_task() checks seem hacky to me. I think the patch 1 in v1 was the > > correct way how to use the local_trylock() to avoid these. > > > > As for the RT concerns, AFAIK RT isn't about being fast, but about being > > preemptible, and the v1 approach didn't violate that - taking the slowpaths > > more often shouldn't be an issue. > > > > Let me quote Shakeel's scenario from the v1 thread: > > > > > I didn't really think too much about PREEMPT_RT kernels as I assume > > > performance is not top priority but I think I get your point. Let me > > > > Agreed. > > > > > explain and correct me if I am wrong. On PREEMPT_RT kernel, the local > > > lock is a spin lock which is actually a mutex but with priority > > > inheritance. A task having the local lock can still get context switched > > > > Let's say (seems implied already) this is a low prio task. > > > > > (but will remain on same CPU run queue) and the newer task can try to > > > > And this is a high prio task. > > > > > acquire the memcg stock local lock. If we just do trylock, it will > > > always go to the slow path but if we do local_lock() then it will sleeps > > > and possibly gives its priority to the task owning the lock and possibly > > > make that task to get the CPU. Later the task slept on memcg stock lock > > > will wake up and go through fast path. > > > > I think from RT latency perspective it could very much be better for the > > high prio task just skip the fast path and go for the slowpath, instead of > > going to sleep while boosting the low prio task to let the high prio task > > use the fast path later. It's not really a fast path anymore I'd say. > > Thanks Vlastimil, this is actually a very good point. Slow path of memcg > charging is couple of atomic operations while the alternative here is at > least two context switches (and possibly scheduler delay). So, it does > not seem like a fast path anymore. > > I have cc'ed networking folks to get their take as well. Orthogonally I > will do some netperf benchmarking on v1 with RT kernel. Let me share the result with PREEMPT_RT config on next-20250505 with and without the v1 of this series. I ran varying number of netperf clients in different cgroups on a 72 CPU machine. $ netserver -6 $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K number of clients | Without series | With series 6 | 38559.1 Mbps | 38652.6 Mbps 12 | 37388.8 Mbps | 37560.1 Mbps 18 | 30707.5 Mbps | 31378.3 Mbps 24 | 25908.4 Mbps | 26423.9 Mbps 30 | 22347.7 Mbps | 22326.5 Mbps 36 | 20235.1 Mbps | 20165.0 Mbps I don't see any significant performance difference for the network intensive workload with this series. I am going to send out v3 which will be rebased version of v1 with all these details unless someone has concerns about this. ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2025-05-05 20:49 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-05-02 0:17 [PATCH v2 0/3] memcg: decouple memcg and objcg stocks Shakeel Butt 2025-05-02 0:17 ` [PATCH v2 1/3] memcg: separate local_trylock for memcg and obj Shakeel Butt 2025-05-02 0:17 ` [PATCH v2 2/3] memcg: completely decouple memcg and obj stocks Shakeel Butt 2025-05-02 0:17 ` [PATCH v2 3/3] memcg: no irq disable for memcg stock lock Shakeel Butt 2025-05-02 18:29 ` Alexei Starovoitov 2025-05-02 23:03 ` Shakeel Butt 2025-05-02 23:28 ` Alexei Starovoitov 2025-05-02 23:40 ` Shakeel Butt 2025-05-05 9:06 ` Sebastian Andrzej Siewior 2025-05-05 10:28 ` Vlastimil Babka 2025-05-05 17:13 ` Shakeel Butt 2025-05-05 20:49 ` Shakeel Butt
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).