* [PATCH RFC 0/4] memcg_kmem hooks refactoring and kmem_cache_charge()
@ 2024-03-01 17:07 Vlastimil Babka
2024-03-01 17:07 ` [PATCH RFC 1/4] mm, slab: move memcg charging to post-alloc hook Vlastimil Babka
` (3 more replies)
0 siblings, 4 replies; 21+ messages in thread
From: Vlastimil Babka @ 2024-03-01 17:07 UTC (permalink / raw)
To: Linus Torvalds, Josh Poimboeuf, Jeff Layton, Chuck Lever,
Kees Cook, Christoph Lameter, Pekka Enberg, David Rientjes,
Joonsoo Kim, Andrew Morton, Roman Gushchin, Hyeonggon Yoo,
Johannes Weiner, Michal Hocko, Shakeel Butt, Muchun Song,
Alexander Viro, Christian Brauner, Jan Kara
Cc: linux-mm, linux-kernel, cgroups, linux-fsdevel, Vlastimil Babka
Hi,
I have tried to look into Linus's suggestions to reduce slab memcg
accounting overhead [1] [2].
The reorganized hooks are in Patch 1 and it definitely seems like nice
cleanup on its own.
In Patch 2 I have tried to move them to mm/memcontrol.c to reduce calls
to memcg code. I hoped to see better performance, but probably didn't.
Patch 3 introduces the suggested kmem_cache_charge() API and Patch 4
tries to use it for the testcase in [1] but it's unfinished due to my
lack of VFS knowledge.
I haven't done much benchmarking yet, just in a guest VM on my desktop
for the test case from [1]. Applying patches 1+2 might have improved it
slightly, but could be noise. With 3+4 the memcg overhead is gone as
expected (the charging never happens) but due to the unfinished state I
don't know yet if the separation might hurt cases where the open()
actually succeeds.
Anyway thought I would share already so others can play with it and see
if it's a good direction to pursue (with patches 3+4). I think Patch 1
should be good to apply in any case (after more testing, and review),
not yet sure about Patch 2.
[1] https://lore.kernel.org/all/CAHk-=whYOOdM7jWy5jdrAm8LxcgCMFyk2bt8fYYvZzM4U-zAQA@mail.gmail.com/
[2] https://lore.kernel.org/all/CAHk-=whw936qzDLBQdUz-He5WK_0fRSWwKAjtbVsMGfX70Nf_Q@mail.gmail.com/
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
Vlastimil Babka (4):
mm, slab: move memcg charging to post-alloc hook
mm, slab: move slab_memcg hooks to mm/memcontrol.c
mm, slab: introduce kmem_cache_charge()
UNFINISHED mm, fs: use kmem_cache_charge() in path_openat()
fs/file_table.c | 9 +-
fs/internal.h | 1 +
fs/namei.c | 4 +-
include/linux/slab.h | 10 +++
mm/memcontrol.c | 90 ++++++++++++++++++++
mm/slab.h | 10 +++
mm/slub.c | 231 +++++++++++++++------------------------------------
7 files changed, 188 insertions(+), 167 deletions(-)
---
base-commit: 6613476e225e090cc9aad49be7fa504e290dd33d
change-id: 20240229-slab-memcg-ae6b3789c924
Best regards,
--
Vlastimil Babka <vbabka@suse.cz>
^ permalink raw reply [flat|nested] 21+ messages in thread* [PATCH RFC 1/4] mm, slab: move memcg charging to post-alloc hook 2024-03-01 17:07 [PATCH RFC 0/4] memcg_kmem hooks refactoring and kmem_cache_charge() Vlastimil Babka @ 2024-03-01 17:07 ` Vlastimil Babka 2024-03-12 18:52 ` Roman Gushchin 2024-03-15 3:23 ` Chengming Zhou 2024-03-01 17:07 ` [PATCH RFC 2/4] mm, slab: move slab_memcg hooks to mm/memcontrol.c Vlastimil Babka ` (2 subsequent siblings) 3 siblings, 2 replies; 21+ messages in thread From: Vlastimil Babka @ 2024-03-01 17:07 UTC (permalink / raw) To: Linus Torvalds, Josh Poimboeuf, Jeff Layton, Chuck Lever, Kees Cook, Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim, Andrew Morton, Roman Gushchin, Hyeonggon Yoo, Johannes Weiner, Michal Hocko, Shakeel Butt, Muchun Song, Alexander Viro, Christian Brauner, Jan Kara Cc: linux-mm, linux-kernel, cgroups, linux-fsdevel, Vlastimil Babka The MEMCG_KMEM integration with slab currently relies on two hooks during allocation. memcg_slab_pre_alloc_hook() determines the objcg and charges it, and memcg_slab_post_alloc_hook() assigns the objcg pointer to the allocated object(s). As Linus pointed out, this is unnecessarily complex. Failing to charge due to memcg limits should be rare, so we can optimistically allocate the object(s) and do the charging together with assigning the objcg pointer in a single post_alloc hook. In the rare case the charging fails, we can free the object(s) back. This simplifies the code (no need to pass around the objcg pointer) and potentially allows to separate charging from allocation in cases where it's common that the allocation would be immediately freed, and the memcg handling overhead could be saved. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/all/CAHk-=whYOOdM7jWy5jdrAm8LxcgCMFyk2bt8fYYvZzM4U-zAQA@mail.gmail.com/ Signed-off-by: Vlastimil Babka <vbabka@suse.cz> --- mm/slub.c | 180 +++++++++++++++++++++++++++----------------------------------- 1 file changed, 77 insertions(+), 103 deletions(-) diff --git a/mm/slub.c b/mm/slub.c index 2ef88bbf56a3..7022a1246bab 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -1897,23 +1897,36 @@ static inline size_t obj_full_size(struct kmem_cache *s) return s->size + sizeof(struct obj_cgroup *); } -/* - * Returns false if the allocation should fail. - */ -static bool __memcg_slab_pre_alloc_hook(struct kmem_cache *s, - struct list_lru *lru, - struct obj_cgroup **objcgp, - size_t objects, gfp_t flags) +static bool __memcg_slab_post_alloc_hook(struct kmem_cache *s, + struct list_lru *lru, + gfp_t flags, size_t size, + void **p) { + struct obj_cgroup *objcg; + struct slab *slab; + unsigned long off; + size_t i; + /* * The obtained objcg pointer is safe to use within the current scope, * defined by current task or set_active_memcg() pair. * obj_cgroup_get() is used to get a permanent reference. */ - struct obj_cgroup *objcg = current_obj_cgroup(); + objcg = current_obj_cgroup(); if (!objcg) return true; + /* + * slab_alloc_node() avoids the NULL check, so we might be called with a + * single NULL object. kmem_cache_alloc_bulk() aborts if it can't fill + * the whole requested size. + * return success as there's nothing to free back + */ + if (unlikely(*p == NULL)) + return true; + + flags &= gfp_allowed_mask; + if (lru) { int ret; struct mem_cgroup *memcg; @@ -1926,71 +1939,51 @@ static bool __memcg_slab_pre_alloc_hook(struct kmem_cache *s, return false; } - if (obj_cgroup_charge(objcg, flags, objects * obj_full_size(s))) + if (obj_cgroup_charge(objcg, flags, size * obj_full_size(s))) return false; - *objcgp = objcg; + for (i = 0; i < size; i++) { + slab = virt_to_slab(p[i]); + + if (!slab_objcgs(slab) && + memcg_alloc_slab_cgroups(slab, s, flags, false)) { + obj_cgroup_uncharge(objcg, obj_full_size(s)); + continue; + } + + off = obj_to_index(s, slab, p[i]); + obj_cgroup_get(objcg); + slab_objcgs(slab)[off] = objcg; + mod_objcg_state(objcg, slab_pgdat(slab), + cache_vmstat_idx(s), obj_full_size(s)); + } + return true; } -/* - * Returns false if the allocation should fail. - */ +static void memcg_alloc_abort_single(struct kmem_cache *s, void *object); + static __fastpath_inline -bool memcg_slab_pre_alloc_hook(struct kmem_cache *s, struct list_lru *lru, - struct obj_cgroup **objcgp, size_t objects, - gfp_t flags) +bool memcg_slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru, + gfp_t flags, size_t size, void **p) { - if (!memcg_kmem_online()) + if (likely(!memcg_kmem_online())) return true; if (likely(!(flags & __GFP_ACCOUNT) && !(s->flags & SLAB_ACCOUNT))) return true; - return likely(__memcg_slab_pre_alloc_hook(s, lru, objcgp, objects, - flags)); -} - -static void __memcg_slab_post_alloc_hook(struct kmem_cache *s, - struct obj_cgroup *objcg, - gfp_t flags, size_t size, - void **p) -{ - struct slab *slab; - unsigned long off; - size_t i; - - flags &= gfp_allowed_mask; - - for (i = 0; i < size; i++) { - if (likely(p[i])) { - slab = virt_to_slab(p[i]); - - if (!slab_objcgs(slab) && - memcg_alloc_slab_cgroups(slab, s, flags, false)) { - obj_cgroup_uncharge(objcg, obj_full_size(s)); - continue; - } + if (likely(__memcg_slab_post_alloc_hook(s, lru, flags, size, p))) + return true; - off = obj_to_index(s, slab, p[i]); - obj_cgroup_get(objcg); - slab_objcgs(slab)[off] = objcg; - mod_objcg_state(objcg, slab_pgdat(slab), - cache_vmstat_idx(s), obj_full_size(s)); - } else { - obj_cgroup_uncharge(objcg, obj_full_size(s)); - } + if (likely(size == 1)) { + memcg_alloc_abort_single(s, p); + *p = NULL; + } else { + kmem_cache_free_bulk(s, size, p); } -} - -static __fastpath_inline -void memcg_slab_post_alloc_hook(struct kmem_cache *s, struct obj_cgroup *objcg, - gfp_t flags, size_t size, void **p) -{ - if (likely(!memcg_kmem_online() || !objcg)) - return; - return __memcg_slab_post_alloc_hook(s, objcg, flags, size, p); + return false; } static void __memcg_slab_free_hook(struct kmem_cache *s, struct slab *slab, @@ -2029,14 +2022,6 @@ void memcg_slab_free_hook(struct kmem_cache *s, struct slab *slab, void **p, __memcg_slab_free_hook(s, slab, p, objects, objcgs); } - -static inline -void memcg_slab_alloc_error_hook(struct kmem_cache *s, int objects, - struct obj_cgroup *objcg) -{ - if (objcg) - obj_cgroup_uncharge(objcg, objects * obj_full_size(s)); -} #else /* CONFIG_MEMCG_KMEM */ static inline struct mem_cgroup *memcg_from_slab_obj(void *ptr) { @@ -2047,31 +2032,18 @@ static inline void memcg_free_slab_cgroups(struct slab *slab) { } -static inline bool memcg_slab_pre_alloc_hook(struct kmem_cache *s, - struct list_lru *lru, - struct obj_cgroup **objcgp, - size_t objects, gfp_t flags) -{ - return true; -} - -static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s, - struct obj_cgroup *objcg, +static inline bool memcg_slab_post_alloc_hook(struct kmem_cache *s, + struct list_lru *lru, gfp_t flags, size_t size, void **p) { + return true; } static inline void memcg_slab_free_hook(struct kmem_cache *s, struct slab *slab, void **p, int objects) { } - -static inline -void memcg_slab_alloc_error_hook(struct kmem_cache *s, int objects, - struct obj_cgroup *objcg) -{ -} #endif /* CONFIG_MEMCG_KMEM */ /* @@ -3751,10 +3723,7 @@ noinline int should_failslab(struct kmem_cache *s, gfp_t gfpflags) ALLOW_ERROR_INJECTION(should_failslab, ERRNO); static __fastpath_inline -struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s, - struct list_lru *lru, - struct obj_cgroup **objcgp, - size_t size, gfp_t flags) +struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s, gfp_t flags) { flags &= gfp_allowed_mask; @@ -3763,14 +3732,11 @@ struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s, if (unlikely(should_failslab(s, flags))) return NULL; - if (unlikely(!memcg_slab_pre_alloc_hook(s, lru, objcgp, size, flags))) - return NULL; - return s; } static __fastpath_inline -void slab_post_alloc_hook(struct kmem_cache *s, struct obj_cgroup *objcg, +bool slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru, gfp_t flags, size_t size, void **p, bool init, unsigned int orig_size) { @@ -3819,7 +3785,7 @@ void slab_post_alloc_hook(struct kmem_cache *s, struct obj_cgroup *objcg, kmsan_slab_alloc(s, p[i], init_flags); } - memcg_slab_post_alloc_hook(s, objcg, flags, size, p); + return memcg_slab_post_alloc_hook(s, lru, flags, size, p); } /* @@ -3836,10 +3802,9 @@ static __fastpath_inline void *slab_alloc_node(struct kmem_cache *s, struct list gfp_t gfpflags, int node, unsigned long addr, size_t orig_size) { void *object; - struct obj_cgroup *objcg = NULL; bool init = false; - s = slab_pre_alloc_hook(s, lru, &objcg, 1, gfpflags); + s = slab_pre_alloc_hook(s, gfpflags); if (unlikely(!s)) return NULL; @@ -3856,8 +3821,10 @@ static __fastpath_inline void *slab_alloc_node(struct kmem_cache *s, struct list /* * When init equals 'true', like for kzalloc() family, only * @orig_size bytes might be zeroed instead of s->object_size + * In case this fails due to memcg_slab_post_alloc_hook(), + * object is set to NULL */ - slab_post_alloc_hook(s, objcg, gfpflags, 1, &object, init, orig_size); + slab_post_alloc_hook(s, lru, gfpflags, 1, &object, init, orig_size); return object; } @@ -4300,6 +4267,16 @@ void slab_free(struct kmem_cache *s, struct slab *slab, void *object, do_slab_free(s, slab, object, object, 1, addr); } +#ifdef CONFIG_MEMCG_KMEM +/* Do not inline the rare memcg charging failed path into the allocation path */ +static noinline +void memcg_alloc_abort_single(struct kmem_cache *s, void *object) +{ + if (likely(slab_free_hook(s, object, slab_want_init_on_free(s)))) + do_slab_free(s, virt_to_slab(object), object, object, 1, _RET_IP_); +} +#endif + static __fastpath_inline void slab_free_bulk(struct kmem_cache *s, struct slab *slab, void *head, void *tail, void **p, int cnt, unsigned long addr) @@ -4635,29 +4612,26 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size, void **p) { int i; - struct obj_cgroup *objcg = NULL; if (!size) return 0; - /* memcg and kmem_cache debug support */ - s = slab_pre_alloc_hook(s, NULL, &objcg, size, flags); + s = slab_pre_alloc_hook(s, flags); if (unlikely(!s)) return 0; i = __kmem_cache_alloc_bulk(s, flags, size, p); + if (unlikely(i == 0)) + return 0; /* * memcg and kmem_cache debug support and memory initialization. * Done outside of the IRQ disabled fastpath loop. */ - if (likely(i != 0)) { - slab_post_alloc_hook(s, objcg, flags, size, p, - slab_want_init_on_alloc(flags, s), s->object_size); - } else { - memcg_slab_alloc_error_hook(s, size, objcg); + if (unlikely(!slab_post_alloc_hook(s, NULL, flags, size, p, + slab_want_init_on_alloc(flags, s), s->object_size))) { + return 0; } - return i; } EXPORT_SYMBOL(kmem_cache_alloc_bulk); -- 2.44.0 ^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [PATCH RFC 1/4] mm, slab: move memcg charging to post-alloc hook 2024-03-01 17:07 ` [PATCH RFC 1/4] mm, slab: move memcg charging to post-alloc hook Vlastimil Babka @ 2024-03-12 18:52 ` Roman Gushchin 2024-03-12 18:59 ` Matthew Wilcox 2024-03-13 10:55 ` Vlastimil Babka 2024-03-15 3:23 ` Chengming Zhou 1 sibling, 2 replies; 21+ messages in thread From: Roman Gushchin @ 2024-03-12 18:52 UTC (permalink / raw) To: Vlastimil Babka Cc: Linus Torvalds, Josh Poimboeuf, Jeff Layton, Chuck Lever, Kees Cook, Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim, Andrew Morton, Hyeonggon Yoo, Johannes Weiner, Michal Hocko, Shakeel Butt, Muchun Song, Alexander Viro, Christian Brauner, Jan Kara, linux-mm, linux-kernel, cgroups, linux-fsdevel On Fri, Mar 01, 2024 at 06:07:08PM +0100, Vlastimil Babka wrote: > The MEMCG_KMEM integration with slab currently relies on two hooks > during allocation. memcg_slab_pre_alloc_hook() determines the objcg and > charges it, and memcg_slab_post_alloc_hook() assigns the objcg pointer > to the allocated object(s). > > As Linus pointed out, this is unnecessarily complex. Failing to charge > due to memcg limits should be rare, so we can optimistically allocate > the object(s) and do the charging together with assigning the objcg > pointer in a single post_alloc hook. In the rare case the charging > fails, we can free the object(s) back. > > This simplifies the code (no need to pass around the objcg pointer) and > potentially allows to separate charging from allocation in cases where > it's common that the allocation would be immediately freed, and the > memcg handling overhead could be saved. > > Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> > Link: https://lore.kernel.org/all/CAHk-=whYOOdM7jWy5jdrAm8LxcgCMFyk2bt8fYYvZzM4U-zAQA@mail.gmail.com/ > Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Nice cleanup, Vlastimil! Couple of small nits below, but otherwise, please, add my Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Thanks! --- > mm/slub.c | 180 +++++++++++++++++++++++++++----------------------------------- > 1 file changed, 77 insertions(+), 103 deletions(-) > > diff --git a/mm/slub.c b/mm/slub.c > index 2ef88bbf56a3..7022a1246bab 100644 > --- a/mm/slub.c > +++ b/mm/slub.c > @@ -1897,23 +1897,36 @@ static inline size_t obj_full_size(struct kmem_cache *s) > return s->size + sizeof(struct obj_cgroup *); > } > > -/* > - * Returns false if the allocation should fail. > - */ > -static bool __memcg_slab_pre_alloc_hook(struct kmem_cache *s, > - struct list_lru *lru, > - struct obj_cgroup **objcgp, > - size_t objects, gfp_t flags) > +static bool __memcg_slab_post_alloc_hook(struct kmem_cache *s, > + struct list_lru *lru, > + gfp_t flags, size_t size, > + void **p) > { > + struct obj_cgroup *objcg; > + struct slab *slab; > + unsigned long off; > + size_t i; > + > /* > * The obtained objcg pointer is safe to use within the current scope, > * defined by current task or set_active_memcg() pair. > * obj_cgroup_get() is used to get a permanent reference. > */ > - struct obj_cgroup *objcg = current_obj_cgroup(); > + objcg = current_obj_cgroup(); > if (!objcg) > return true; > > + /* > + * slab_alloc_node() avoids the NULL check, so we might be called with a > + * single NULL object. kmem_cache_alloc_bulk() aborts if it can't fill > + * the whole requested size. > + * return success as there's nothing to free back > + */ > + if (unlikely(*p == NULL)) > + return true; Probably better to move this check up? current_obj_cgroup() != NULL check is more expensive. > + > + flags &= gfp_allowed_mask; > + > if (lru) { > int ret; > struct mem_cgroup *memcg; > @@ -1926,71 +1939,51 @@ static bool __memcg_slab_pre_alloc_hook(struct kmem_cache *s, > return false; > } > > - if (obj_cgroup_charge(objcg, flags, objects * obj_full_size(s))) > + if (obj_cgroup_charge(objcg, flags, size * obj_full_size(s))) > return false; > > - *objcgp = objcg; > + for (i = 0; i < size; i++) { > + slab = virt_to_slab(p[i]); Not specific to this change, but I wonder if it makes sense to introduce virt_to_slab() variant without any extra checks for this and similar cases, where we know for sure that p resides on a slab page. What do you think? ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH RFC 1/4] mm, slab: move memcg charging to post-alloc hook 2024-03-12 18:52 ` Roman Gushchin @ 2024-03-12 18:59 ` Matthew Wilcox 2024-03-12 20:35 ` Roman Gushchin 2024-03-13 10:55 ` Vlastimil Babka 1 sibling, 1 reply; 21+ messages in thread From: Matthew Wilcox @ 2024-03-12 18:59 UTC (permalink / raw) To: Roman Gushchin Cc: Vlastimil Babka, Linus Torvalds, Josh Poimboeuf, Jeff Layton, Chuck Lever, Kees Cook, Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim, Andrew Morton, Hyeonggon Yoo, Johannes Weiner, Michal Hocko, Shakeel Butt, Muchun Song, Alexander Viro, Christian Brauner, Jan Kara, linux-mm, linux-kernel, cgroups, linux-fsdevel On Tue, Mar 12, 2024 at 11:52:46AM -0700, Roman Gushchin wrote: > On Fri, Mar 01, 2024 at 06:07:08PM +0100, Vlastimil Babka wrote: > > @@ -1926,71 +1939,51 @@ static bool __memcg_slab_pre_alloc_hook(struct kmem_cache *s, > > return false; > > } > > > > - if (obj_cgroup_charge(objcg, flags, objects * obj_full_size(s))) > > + if (obj_cgroup_charge(objcg, flags, size * obj_full_size(s))) > > return false; > > > > - *objcgp = objcg; > > + for (i = 0; i < size; i++) { > > + slab = virt_to_slab(p[i]); > > Not specific to this change, but I wonder if it makes sense to introduce virt_to_slab() > variant without any extra checks for this and similar cases, where we know for sure > that p resides on a slab page. What do you think? You'd only save a single test_bit() ... is it really worth doing? Cache misses are the expensive thing, not instructions. And debugging time: if somehow p[i] becomes not-on-a-slab-anymore, getting a NULL pointer splat here before we go any further might be worth all the CPU time wasted doing that test_bit(). ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH RFC 1/4] mm, slab: move memcg charging to post-alloc hook 2024-03-12 18:59 ` Matthew Wilcox @ 2024-03-12 20:35 ` Roman Gushchin 0 siblings, 0 replies; 21+ messages in thread From: Roman Gushchin @ 2024-03-12 20:35 UTC (permalink / raw) To: Matthew Wilcox Cc: Vlastimil Babka, Linus Torvalds, Josh Poimboeuf, Jeff Layton, Chuck Lever, Kees Cook, Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim, Andrew Morton, Hyeonggon Yoo, Johannes Weiner, Michal Hocko, Shakeel Butt, Muchun Song, Alexander Viro, Christian Brauner, Jan Kara, linux-mm, linux-kernel, cgroups, linux-fsdevel On Tue, Mar 12, 2024 at 06:59:37PM +0000, Matthew Wilcox wrote: > On Tue, Mar 12, 2024 at 11:52:46AM -0700, Roman Gushchin wrote: > > On Fri, Mar 01, 2024 at 06:07:08PM +0100, Vlastimil Babka wrote: > > > @@ -1926,71 +1939,51 @@ static bool __memcg_slab_pre_alloc_hook(struct kmem_cache *s, > > > return false; > > > } > > > > > > - if (obj_cgroup_charge(objcg, flags, objects * obj_full_size(s))) > > > + if (obj_cgroup_charge(objcg, flags, size * obj_full_size(s))) > > > return false; > > > > > > - *objcgp = objcg; > > > + for (i = 0; i < size; i++) { > > > + slab = virt_to_slab(p[i]); > > > > Not specific to this change, but I wonder if it makes sense to introduce virt_to_slab() > > variant without any extra checks for this and similar cases, where we know for sure > > that p resides on a slab page. What do you think? > > You'd only save a single test_bit() ... is it really worth doing? > Cache misses are the expensive thing, not instructions. I agree here, unlikely it will produce a significant difference. > And debugging > time: if somehow p[i] becomes not-on-a-slab-anymore, getting a NULL > pointer splat here before we go any further might be worth all the CPU > time wasted doing that test_bit(). Well, Idk if it's a feasible concern here, hard to imagine how p[i] wouldn't belong to a slab page without something like a major memory corruption. Overall I agree it's not a big deal and the current code is fine. Thanks! ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH RFC 1/4] mm, slab: move memcg charging to post-alloc hook 2024-03-12 18:52 ` Roman Gushchin 2024-03-12 18:59 ` Matthew Wilcox @ 2024-03-13 10:55 ` Vlastimil Babka 2024-03-13 17:34 ` Roman Gushchin 1 sibling, 1 reply; 21+ messages in thread From: Vlastimil Babka @ 2024-03-13 10:55 UTC (permalink / raw) To: Roman Gushchin Cc: Linus Torvalds, Josh Poimboeuf, Jeff Layton, Chuck Lever, Kees Cook, Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim, Andrew Morton, Hyeonggon Yoo, Johannes Weiner, Michal Hocko, Shakeel Butt, Muchun Song, Alexander Viro, Christian Brauner, Jan Kara, linux-mm, linux-kernel, cgroups, linux-fsdevel On 3/12/24 19:52, Roman Gushchin wrote: > On Fri, Mar 01, 2024 at 06:07:08PM +0100, Vlastimil Babka wrote: >> The MEMCG_KMEM integration with slab currently relies on two hooks >> during allocation. memcg_slab_pre_alloc_hook() determines the objcg and >> charges it, and memcg_slab_post_alloc_hook() assigns the objcg pointer >> to the allocated object(s). >> >> As Linus pointed out, this is unnecessarily complex. Failing to charge >> due to memcg limits should be rare, so we can optimistically allocate >> the object(s) and do the charging together with assigning the objcg >> pointer in a single post_alloc hook. In the rare case the charging >> fails, we can free the object(s) back. >> >> This simplifies the code (no need to pass around the objcg pointer) and >> potentially allows to separate charging from allocation in cases where >> it's common that the allocation would be immediately freed, and the >> memcg handling overhead could be saved. >> >> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> >> Link: https://lore.kernel.org/all/CAHk-=whYOOdM7jWy5jdrAm8LxcgCMFyk2bt8fYYvZzM4U-zAQA@mail.gmail.com/ >> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> > > Nice cleanup, Vlastimil! > Couple of small nits below, but otherwise, please, add my > > Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Thanks! >> /* >> * The obtained objcg pointer is safe to use within the current scope, >> * defined by current task or set_active_memcg() pair. >> * obj_cgroup_get() is used to get a permanent reference. >> */ >> - struct obj_cgroup *objcg = current_obj_cgroup(); >> + objcg = current_obj_cgroup(); >> if (!objcg) >> return true; >> >> + /* >> + * slab_alloc_node() avoids the NULL check, so we might be called with a >> + * single NULL object. kmem_cache_alloc_bulk() aborts if it can't fill >> + * the whole requested size. >> + * return success as there's nothing to free back >> + */ >> + if (unlikely(*p == NULL)) >> + return true; > > Probably better to move this check up? current_obj_cgroup() != NULL check is more > expensive. It probably doesn't matter in practice anyway, but my thinking was that *p == NULL is so rare (the object allocation failed) it shouldn't matter that we did current_obj_cgroup() uselessly in case it happens. OTOH current_obj_cgroup() returning NULL is not that rare (?) so it could be useful to not check *p in those cases? >> + >> + flags &= gfp_allowed_mask; >> + >> if (lru) { >> int ret; >> struct mem_cgroup *memcg; >> @@ -1926,71 +1939,51 @@ static bool __memcg_slab_pre_alloc_hook(struct kmem_cache *s, >> return false; >> } >> >> - if (obj_cgroup_charge(objcg, flags, objects * obj_full_size(s))) >> + if (obj_cgroup_charge(objcg, flags, size * obj_full_size(s))) >> return false; >> >> - *objcgp = objcg; >> + for (i = 0; i < size; i++) { >> + slab = virt_to_slab(p[i]); > > Not specific to this change, but I wonder if it makes sense to introduce virt_to_slab() > variant without any extra checks for this and similar cases, where we know for sure > that p resides on a slab page. What do you think? ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH RFC 1/4] mm, slab: move memcg charging to post-alloc hook 2024-03-13 10:55 ` Vlastimil Babka @ 2024-03-13 17:34 ` Roman Gushchin 0 siblings, 0 replies; 21+ messages in thread From: Roman Gushchin @ 2024-03-13 17:34 UTC (permalink / raw) To: Vlastimil Babka Cc: Linus Torvalds, Josh Poimboeuf, Jeff Layton, Chuck Lever, Kees Cook, Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim, Andrew Morton, Hyeonggon Yoo, Johannes Weiner, Michal Hocko, Shakeel Butt, Muchun Song, Alexander Viro, Christian Brauner, Jan Kara, linux-mm, linux-kernel, cgroups, linux-fsdevel On Wed, Mar 13, 2024 at 11:55:04AM +0100, Vlastimil Babka wrote: > On 3/12/24 19:52, Roman Gushchin wrote: > > On Fri, Mar 01, 2024 at 06:07:08PM +0100, Vlastimil Babka wrote: > >> The MEMCG_KMEM integration with slab currently relies on two hooks > >> during allocation. memcg_slab_pre_alloc_hook() determines the objcg and > >> charges it, and memcg_slab_post_alloc_hook() assigns the objcg pointer > >> to the allocated object(s). > >> > >> As Linus pointed out, this is unnecessarily complex. Failing to charge > >> due to memcg limits should be rare, so we can optimistically allocate > >> the object(s) and do the charging together with assigning the objcg > >> pointer in a single post_alloc hook. In the rare case the charging > >> fails, we can free the object(s) back. > >> > >> This simplifies the code (no need to pass around the objcg pointer) and > >> potentially allows to separate charging from allocation in cases where > >> it's common that the allocation would be immediately freed, and the > >> memcg handling overhead could be saved. > >> > >> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> > >> Link: https://lore.kernel.org/all/CAHk-=whYOOdM7jWy5jdrAm8LxcgCMFyk2bt8fYYvZzM4U-zAQA@mail.gmail.com/ > >> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> > > > > Nice cleanup, Vlastimil! > > Couple of small nits below, but otherwise, please, add my > > > > Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> > > Thanks! > > >> /* > >> * The obtained objcg pointer is safe to use within the current scope, > >> * defined by current task or set_active_memcg() pair. > >> * obj_cgroup_get() is used to get a permanent reference. > >> */ > >> - struct obj_cgroup *objcg = current_obj_cgroup(); > >> + objcg = current_obj_cgroup(); > >> if (!objcg) > >> return true; > >> > >> + /* > >> + * slab_alloc_node() avoids the NULL check, so we might be called with a > >> + * single NULL object. kmem_cache_alloc_bulk() aborts if it can't fill > >> + * the whole requested size. > >> + * return success as there's nothing to free back > >> + */ > >> + if (unlikely(*p == NULL)) > >> + return true; > > > > Probably better to move this check up? current_obj_cgroup() != NULL check is more > > expensive. > > It probably doesn't matter in practice anyway, but my thinking was that > *p == NULL is so rare (the object allocation failed) it shouldn't matter > that we did current_obj_cgroup() uselessly in case it happens. > OTOH current_obj_cgroup() returning NULL is not that rare (?) so it > could be useful to not check *p in those cases? I see... Hm, I'd generally expect a speculative execution of the second check anyway (especially with an unlikely() hint for the first one), and because as you said p == NULL is almost never true, the additional cost is zero. But the same is true otherwise, so it really doesn't matter that much. Thanks for explaining your logic, it wasn't obvious to me. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH RFC 1/4] mm, slab: move memcg charging to post-alloc hook 2024-03-01 17:07 ` [PATCH RFC 1/4] mm, slab: move memcg charging to post-alloc hook Vlastimil Babka 2024-03-12 18:52 ` Roman Gushchin @ 2024-03-15 3:23 ` Chengming Zhou 1 sibling, 0 replies; 21+ messages in thread From: Chengming Zhou @ 2024-03-15 3:23 UTC (permalink / raw) To: Vlastimil Babka, Linus Torvalds, Josh Poimboeuf, Jeff Layton, Chuck Lever, Kees Cook, Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim, Andrew Morton, Roman Gushchin, Hyeonggon Yoo, Johannes Weiner, Michal Hocko, Shakeel Butt, Muchun Song, Alexander Viro, Christian Brauner, Jan Kara Cc: linux-mm, linux-kernel, cgroups, linux-fsdevel On 2024/3/2 01:07, Vlastimil Babka wrote: > The MEMCG_KMEM integration with slab currently relies on two hooks > during allocation. memcg_slab_pre_alloc_hook() determines the objcg and > charges it, and memcg_slab_post_alloc_hook() assigns the objcg pointer > to the allocated object(s). > > As Linus pointed out, this is unnecessarily complex. Failing to charge > due to memcg limits should be rare, so we can optimistically allocate > the object(s) and do the charging together with assigning the objcg > pointer in a single post_alloc hook. In the rare case the charging > fails, we can free the object(s) back. > > This simplifies the code (no need to pass around the objcg pointer) and > potentially allows to separate charging from allocation in cases where > it's common that the allocation would be immediately freed, and the > memcg handling overhead could be saved. > > Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> > Link: https://lore.kernel.org/all/CAHk-=whYOOdM7jWy5jdrAm8LxcgCMFyk2bt8fYYvZzM4U-zAQA@mail.gmail.com/ > Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Nice! Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Thanks. > --- > mm/slub.c | 180 +++++++++++++++++++++++++++----------------------------------- > 1 file changed, 77 insertions(+), 103 deletions(-) > > diff --git a/mm/slub.c b/mm/slub.c > index 2ef88bbf56a3..7022a1246bab 100644 > --- a/mm/slub.c > +++ b/mm/slub.c > @@ -1897,23 +1897,36 @@ static inline size_t obj_full_size(struct kmem_cache *s) > return s->size + sizeof(struct obj_cgroup *); > } > > -/* > - * Returns false if the allocation should fail. > - */ > -static bool __memcg_slab_pre_alloc_hook(struct kmem_cache *s, > - struct list_lru *lru, > - struct obj_cgroup **objcgp, > - size_t objects, gfp_t flags) > +static bool __memcg_slab_post_alloc_hook(struct kmem_cache *s, > + struct list_lru *lru, > + gfp_t flags, size_t size, > + void **p) > { > + struct obj_cgroup *objcg; > + struct slab *slab; > + unsigned long off; > + size_t i; > + > /* > * The obtained objcg pointer is safe to use within the current scope, > * defined by current task or set_active_memcg() pair. > * obj_cgroup_get() is used to get a permanent reference. > */ > - struct obj_cgroup *objcg = current_obj_cgroup(); > + objcg = current_obj_cgroup(); > if (!objcg) > return true; > > + /* > + * slab_alloc_node() avoids the NULL check, so we might be called with a > + * single NULL object. kmem_cache_alloc_bulk() aborts if it can't fill > + * the whole requested size. > + * return success as there's nothing to free back > + */ > + if (unlikely(*p == NULL)) > + return true; > + > + flags &= gfp_allowed_mask; > + > if (lru) { > int ret; > struct mem_cgroup *memcg; > @@ -1926,71 +1939,51 @@ static bool __memcg_slab_pre_alloc_hook(struct kmem_cache *s, > return false; > } > > - if (obj_cgroup_charge(objcg, flags, objects * obj_full_size(s))) > + if (obj_cgroup_charge(objcg, flags, size * obj_full_size(s))) > return false; > > - *objcgp = objcg; > + for (i = 0; i < size; i++) { > + slab = virt_to_slab(p[i]); > + > + if (!slab_objcgs(slab) && > + memcg_alloc_slab_cgroups(slab, s, flags, false)) { > + obj_cgroup_uncharge(objcg, obj_full_size(s)); > + continue; > + } > + > + off = obj_to_index(s, slab, p[i]); > + obj_cgroup_get(objcg); > + slab_objcgs(slab)[off] = objcg; > + mod_objcg_state(objcg, slab_pgdat(slab), > + cache_vmstat_idx(s), obj_full_size(s)); > + } > + > return true; > } > > -/* > - * Returns false if the allocation should fail. > - */ > +static void memcg_alloc_abort_single(struct kmem_cache *s, void *object); > + > static __fastpath_inline > -bool memcg_slab_pre_alloc_hook(struct kmem_cache *s, struct list_lru *lru, > - struct obj_cgroup **objcgp, size_t objects, > - gfp_t flags) > +bool memcg_slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru, > + gfp_t flags, size_t size, void **p) > { > - if (!memcg_kmem_online()) > + if (likely(!memcg_kmem_online())) > return true; > > if (likely(!(flags & __GFP_ACCOUNT) && !(s->flags & SLAB_ACCOUNT))) > return true; > > - return likely(__memcg_slab_pre_alloc_hook(s, lru, objcgp, objects, > - flags)); > -} > - > -static void __memcg_slab_post_alloc_hook(struct kmem_cache *s, > - struct obj_cgroup *objcg, > - gfp_t flags, size_t size, > - void **p) > -{ > - struct slab *slab; > - unsigned long off; > - size_t i; > - > - flags &= gfp_allowed_mask; > - > - for (i = 0; i < size; i++) { > - if (likely(p[i])) { > - slab = virt_to_slab(p[i]); > - > - if (!slab_objcgs(slab) && > - memcg_alloc_slab_cgroups(slab, s, flags, false)) { > - obj_cgroup_uncharge(objcg, obj_full_size(s)); > - continue; > - } > + if (likely(__memcg_slab_post_alloc_hook(s, lru, flags, size, p))) > + return true; > > - off = obj_to_index(s, slab, p[i]); > - obj_cgroup_get(objcg); > - slab_objcgs(slab)[off] = objcg; > - mod_objcg_state(objcg, slab_pgdat(slab), > - cache_vmstat_idx(s), obj_full_size(s)); > - } else { > - obj_cgroup_uncharge(objcg, obj_full_size(s)); > - } > + if (likely(size == 1)) { > + memcg_alloc_abort_single(s, p); > + *p = NULL; > + } else { > + kmem_cache_free_bulk(s, size, p); > } > -} > - > -static __fastpath_inline > -void memcg_slab_post_alloc_hook(struct kmem_cache *s, struct obj_cgroup *objcg, > - gfp_t flags, size_t size, void **p) > -{ > - if (likely(!memcg_kmem_online() || !objcg)) > - return; > > - return __memcg_slab_post_alloc_hook(s, objcg, flags, size, p); > + return false; > } > > static void __memcg_slab_free_hook(struct kmem_cache *s, struct slab *slab, > @@ -2029,14 +2022,6 @@ void memcg_slab_free_hook(struct kmem_cache *s, struct slab *slab, void **p, > > __memcg_slab_free_hook(s, slab, p, objects, objcgs); > } > - > -static inline > -void memcg_slab_alloc_error_hook(struct kmem_cache *s, int objects, > - struct obj_cgroup *objcg) > -{ > - if (objcg) > - obj_cgroup_uncharge(objcg, objects * obj_full_size(s)); > -} > #else /* CONFIG_MEMCG_KMEM */ > static inline struct mem_cgroup *memcg_from_slab_obj(void *ptr) > { > @@ -2047,31 +2032,18 @@ static inline void memcg_free_slab_cgroups(struct slab *slab) > { > } > > -static inline bool memcg_slab_pre_alloc_hook(struct kmem_cache *s, > - struct list_lru *lru, > - struct obj_cgroup **objcgp, > - size_t objects, gfp_t flags) > -{ > - return true; > -} > - > -static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s, > - struct obj_cgroup *objcg, > +static inline bool memcg_slab_post_alloc_hook(struct kmem_cache *s, > + struct list_lru *lru, > gfp_t flags, size_t size, > void **p) > { > + return true; > } > > static inline void memcg_slab_free_hook(struct kmem_cache *s, struct slab *slab, > void **p, int objects) > { > } > - > -static inline > -void memcg_slab_alloc_error_hook(struct kmem_cache *s, int objects, > - struct obj_cgroup *objcg) > -{ > -} > #endif /* CONFIG_MEMCG_KMEM */ > > /* > @@ -3751,10 +3723,7 @@ noinline int should_failslab(struct kmem_cache *s, gfp_t gfpflags) > ALLOW_ERROR_INJECTION(should_failslab, ERRNO); > > static __fastpath_inline > -struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s, > - struct list_lru *lru, > - struct obj_cgroup **objcgp, > - size_t size, gfp_t flags) > +struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s, gfp_t flags) > { > flags &= gfp_allowed_mask; > > @@ -3763,14 +3732,11 @@ struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s, > if (unlikely(should_failslab(s, flags))) > return NULL; > > - if (unlikely(!memcg_slab_pre_alloc_hook(s, lru, objcgp, size, flags))) > - return NULL; > - > return s; > } > > static __fastpath_inline > -void slab_post_alloc_hook(struct kmem_cache *s, struct obj_cgroup *objcg, > +bool slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru, > gfp_t flags, size_t size, void **p, bool init, > unsigned int orig_size) > { > @@ -3819,7 +3785,7 @@ void slab_post_alloc_hook(struct kmem_cache *s, struct obj_cgroup *objcg, > kmsan_slab_alloc(s, p[i], init_flags); > } > > - memcg_slab_post_alloc_hook(s, objcg, flags, size, p); > + return memcg_slab_post_alloc_hook(s, lru, flags, size, p); > } > > /* > @@ -3836,10 +3802,9 @@ static __fastpath_inline void *slab_alloc_node(struct kmem_cache *s, struct list > gfp_t gfpflags, int node, unsigned long addr, size_t orig_size) > { > void *object; > - struct obj_cgroup *objcg = NULL; > bool init = false; > > - s = slab_pre_alloc_hook(s, lru, &objcg, 1, gfpflags); > + s = slab_pre_alloc_hook(s, gfpflags); > if (unlikely(!s)) > return NULL; > > @@ -3856,8 +3821,10 @@ static __fastpath_inline void *slab_alloc_node(struct kmem_cache *s, struct list > /* > * When init equals 'true', like for kzalloc() family, only > * @orig_size bytes might be zeroed instead of s->object_size > + * In case this fails due to memcg_slab_post_alloc_hook(), > + * object is set to NULL > */ > - slab_post_alloc_hook(s, objcg, gfpflags, 1, &object, init, orig_size); > + slab_post_alloc_hook(s, lru, gfpflags, 1, &object, init, orig_size); > > return object; > } > @@ -4300,6 +4267,16 @@ void slab_free(struct kmem_cache *s, struct slab *slab, void *object, > do_slab_free(s, slab, object, object, 1, addr); > } > > +#ifdef CONFIG_MEMCG_KMEM > +/* Do not inline the rare memcg charging failed path into the allocation path */ > +static noinline > +void memcg_alloc_abort_single(struct kmem_cache *s, void *object) > +{ > + if (likely(slab_free_hook(s, object, slab_want_init_on_free(s)))) > + do_slab_free(s, virt_to_slab(object), object, object, 1, _RET_IP_); > +} > +#endif > + > static __fastpath_inline > void slab_free_bulk(struct kmem_cache *s, struct slab *slab, void *head, > void *tail, void **p, int cnt, unsigned long addr) > @@ -4635,29 +4612,26 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size, > void **p) > { > int i; > - struct obj_cgroup *objcg = NULL; > > if (!size) > return 0; > > - /* memcg and kmem_cache debug support */ > - s = slab_pre_alloc_hook(s, NULL, &objcg, size, flags); > + s = slab_pre_alloc_hook(s, flags); > if (unlikely(!s)) > return 0; > > i = __kmem_cache_alloc_bulk(s, flags, size, p); > + if (unlikely(i == 0)) > + return 0; > > /* > * memcg and kmem_cache debug support and memory initialization. > * Done outside of the IRQ disabled fastpath loop. > */ > - if (likely(i != 0)) { > - slab_post_alloc_hook(s, objcg, flags, size, p, > - slab_want_init_on_alloc(flags, s), s->object_size); > - } else { > - memcg_slab_alloc_error_hook(s, size, objcg); > + if (unlikely(!slab_post_alloc_hook(s, NULL, flags, size, p, > + slab_want_init_on_alloc(flags, s), s->object_size))) { > + return 0; > } > - > return i; > } > EXPORT_SYMBOL(kmem_cache_alloc_bulk); > ^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH RFC 2/4] mm, slab: move slab_memcg hooks to mm/memcontrol.c 2024-03-01 17:07 [PATCH RFC 0/4] memcg_kmem hooks refactoring and kmem_cache_charge() Vlastimil Babka 2024-03-01 17:07 ` [PATCH RFC 1/4] mm, slab: move memcg charging to post-alloc hook Vlastimil Babka @ 2024-03-01 17:07 ` Vlastimil Babka 2024-03-12 18:56 ` Roman Gushchin 2024-03-01 17:07 ` [PATCH RFC 3/4] mm, slab: introduce kmem_cache_charge() Vlastimil Babka 2024-03-01 17:07 ` [PATCH RFC 4/4] UNFINISHED mm, fs: use kmem_cache_charge() in path_openat() Vlastimil Babka 3 siblings, 1 reply; 21+ messages in thread From: Vlastimil Babka @ 2024-03-01 17:07 UTC (permalink / raw) To: Linus Torvalds, Josh Poimboeuf, Jeff Layton, Chuck Lever, Kees Cook, Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim, Andrew Morton, Roman Gushchin, Hyeonggon Yoo, Johannes Weiner, Michal Hocko, Shakeel Butt, Muchun Song, Alexander Viro, Christian Brauner, Jan Kara Cc: linux-mm, linux-kernel, cgroups, linux-fsdevel, Vlastimil Babka The hooks make multiple calls to functions in mm/memcontrol.c, including to th current_obj_cgroup() marked __always_inline. It might be faster to make a single call to the hook in mm/memcontrol.c instead. The hooks also don't use almost anything from mm/slub.c. obj_full_size() can move with the hooks and cache_vmstat_idx() to the internal mm/slab.h Signed-off-by: Vlastimil Babka <vbabka@suse.cz> --- mm/memcontrol.c | 90 ++++++++++++++++++++++++++++++++++++++++++++++++++ mm/slab.h | 10 ++++++ mm/slub.c | 100 -------------------------------------------------------- 3 files changed, 100 insertions(+), 100 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index e4c8735e7c85..37ee9356a26c 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3575,6 +3575,96 @@ void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size) refill_obj_stock(objcg, size, true); } +static inline size_t obj_full_size(struct kmem_cache *s) +{ + /* + * For each accounted object there is an extra space which is used + * to store obj_cgroup membership. Charge it too. + */ + return s->size + sizeof(struct obj_cgroup *); +} + +bool __memcg_slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru, + gfp_t flags, size_t size, void **p) +{ + struct obj_cgroup *objcg; + struct slab *slab; + unsigned long off; + size_t i; + + /* + * The obtained objcg pointer is safe to use within the current scope, + * defined by current task or set_active_memcg() pair. + * obj_cgroup_get() is used to get a permanent reference. + */ + objcg = current_obj_cgroup(); + if (!objcg) + return true; + + /* + * slab_alloc_node() avoids the NULL check, so we might be called with a + * single NULL object. kmem_cache_alloc_bulk() aborts if it can't fill + * the whole requested size. + * return success as there's nothing to free back + */ + if (unlikely(*p == NULL)) + return true; + + flags &= gfp_allowed_mask; + + if (lru) { + int ret; + struct mem_cgroup *memcg; + + memcg = get_mem_cgroup_from_objcg(objcg); + ret = memcg_list_lru_alloc(memcg, lru, flags); + css_put(&memcg->css); + + if (ret) + return false; + } + + if (obj_cgroup_charge(objcg, flags, size * obj_full_size(s))) + return false; + + for (i = 0; i < size; i++) { + slab = virt_to_slab(p[i]); + + if (!slab_objcgs(slab) && + memcg_alloc_slab_cgroups(slab, s, flags, false)) { + obj_cgroup_uncharge(objcg, obj_full_size(s)); + continue; + } + + off = obj_to_index(s, slab, p[i]); + obj_cgroup_get(objcg); + slab_objcgs(slab)[off] = objcg; + mod_objcg_state(objcg, slab_pgdat(slab), + cache_vmstat_idx(s), obj_full_size(s)); + } + + return true; +} + +void __memcg_slab_free_hook(struct kmem_cache *s, struct slab *slab, + void **p, int objects, struct obj_cgroup **objcgs) +{ + for (int i = 0; i < objects; i++) { + struct obj_cgroup *objcg; + unsigned int off; + + off = obj_to_index(s, slab, p[i]); + objcg = objcgs[off]; + if (!objcg) + continue; + + objcgs[off] = NULL; + obj_cgroup_uncharge(objcg, obj_full_size(s)); + mod_objcg_state(objcg, slab_pgdat(slab), cache_vmstat_idx(s), + -obj_full_size(s)); + obj_cgroup_put(objcg); + } +} #endif /* CONFIG_MEMCG_KMEM */ /* diff --git a/mm/slab.h b/mm/slab.h index 54deeb0428c6..3f170673fa55 100644 --- a/mm/slab.h +++ b/mm/slab.h @@ -541,6 +541,12 @@ static inline bool kmem_cache_debug_flags(struct kmem_cache *s, slab_flags_t fla return false; } +static inline enum node_stat_item cache_vmstat_idx(struct kmem_cache *s) +{ + return (s->flags & SLAB_RECLAIM_ACCOUNT) ? + NR_SLAB_RECLAIMABLE_B : NR_SLAB_UNRECLAIMABLE_B; +} + #ifdef CONFIG_MEMCG_KMEM /* * slab_objcgs - get the object cgroups vector associated with a slab @@ -564,6 +570,10 @@ int memcg_alloc_slab_cgroups(struct slab *slab, struct kmem_cache *s, gfp_t gfp, bool new_slab); void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat, enum node_stat_item idx, int nr); +bool __memcg_slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru, + gfp_t flags, size_t size, void **p); +void __memcg_slab_free_hook(struct kmem_cache *s, struct slab *slab, + void **p, int objects, struct obj_cgroup **objcgs); #else /* CONFIG_MEMCG_KMEM */ static inline struct obj_cgroup **slab_objcgs(struct slab *slab) { diff --git a/mm/slub.c b/mm/slub.c index 7022a1246bab..64da169d672a 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -1875,12 +1875,6 @@ static bool freelist_corrupted(struct kmem_cache *s, struct slab *slab, #endif #endif /* CONFIG_SLUB_DEBUG */ -static inline enum node_stat_item cache_vmstat_idx(struct kmem_cache *s) -{ - return (s->flags & SLAB_RECLAIM_ACCOUNT) ? - NR_SLAB_RECLAIMABLE_B : NR_SLAB_UNRECLAIMABLE_B; -} - #ifdef CONFIG_MEMCG_KMEM static inline void memcg_free_slab_cgroups(struct slab *slab) { @@ -1888,79 +1882,6 @@ static inline void memcg_free_slab_cgroups(struct slab *slab) slab->memcg_data = 0; } -static inline size_t obj_full_size(struct kmem_cache *s) -{ - /* - * For each accounted object there is an extra space which is used - * to store obj_cgroup membership. Charge it too. - */ - return s->size + sizeof(struct obj_cgroup *); -} - -static bool __memcg_slab_post_alloc_hook(struct kmem_cache *s, - struct list_lru *lru, - gfp_t flags, size_t size, - void **p) -{ - struct obj_cgroup *objcg; - struct slab *slab; - unsigned long off; - size_t i; - - /* - * The obtained objcg pointer is safe to use within the current scope, - * defined by current task or set_active_memcg() pair. - * obj_cgroup_get() is used to get a permanent reference. - */ - objcg = current_obj_cgroup(); - if (!objcg) - return true; - - /* - * slab_alloc_node() avoids the NULL check, so we might be called with a - * single NULL object. kmem_cache_alloc_bulk() aborts if it can't fill - * the whole requested size. - * return success as there's nothing to free back - */ - if (unlikely(*p == NULL)) - return true; - - flags &= gfp_allowed_mask; - - if (lru) { - int ret; - struct mem_cgroup *memcg; - - memcg = get_mem_cgroup_from_objcg(objcg); - ret = memcg_list_lru_alloc(memcg, lru, flags); - css_put(&memcg->css); - - if (ret) - return false; - } - - if (obj_cgroup_charge(objcg, flags, size * obj_full_size(s))) - return false; - - for (i = 0; i < size; i++) { - slab = virt_to_slab(p[i]); - - if (!slab_objcgs(slab) && - memcg_alloc_slab_cgroups(slab, s, flags, false)) { - obj_cgroup_uncharge(objcg, obj_full_size(s)); - continue; - } - - off = obj_to_index(s, slab, p[i]); - obj_cgroup_get(objcg); - slab_objcgs(slab)[off] = objcg; - mod_objcg_state(objcg, slab_pgdat(slab), - cache_vmstat_idx(s), obj_full_size(s)); - } - - return true; -} - static void memcg_alloc_abort_single(struct kmem_cache *s, void *object); static __fastpath_inline @@ -1986,27 +1907,6 @@ bool memcg_slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru, return false; } -static void __memcg_slab_free_hook(struct kmem_cache *s, struct slab *slab, - void **p, int objects, - struct obj_cgroup **objcgs) -{ - for (int i = 0; i < objects; i++) { - struct obj_cgroup *objcg; - unsigned int off; - - off = obj_to_index(s, slab, p[i]); - objcg = objcgs[off]; - if (!objcg) - continue; - - objcgs[off] = NULL; - obj_cgroup_uncharge(objcg, obj_full_size(s)); - mod_objcg_state(objcg, slab_pgdat(slab), cache_vmstat_idx(s), - -obj_full_size(s)); - obj_cgroup_put(objcg); - } -} - static __fastpath_inline void memcg_slab_free_hook(struct kmem_cache *s, struct slab *slab, void **p, int objects) -- 2.44.0 ^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [PATCH RFC 2/4] mm, slab: move slab_memcg hooks to mm/memcontrol.c 2024-03-01 17:07 ` [PATCH RFC 2/4] mm, slab: move slab_memcg hooks to mm/memcontrol.c Vlastimil Babka @ 2024-03-12 18:56 ` Roman Gushchin 2024-03-12 19:32 ` Matthew Wilcox 0 siblings, 1 reply; 21+ messages in thread From: Roman Gushchin @ 2024-03-12 18:56 UTC (permalink / raw) To: Vlastimil Babka Cc: Linus Torvalds, Josh Poimboeuf, Jeff Layton, Chuck Lever, Kees Cook, Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim, Andrew Morton, Hyeonggon Yoo, Johannes Weiner, Michal Hocko, Shakeel Butt, Muchun Song, Alexander Viro, Christian Brauner, Jan Kara, linux-mm, linux-kernel, cgroups, linux-fsdevel On Fri, Mar 01, 2024 at 06:07:09PM +0100, Vlastimil Babka wrote: > The hooks make multiple calls to functions in mm/memcontrol.c, including > to th current_obj_cgroup() marked __always_inline. It might be faster to > make a single call to the hook in mm/memcontrol.c instead. The hooks > also don't use almost anything from mm/slub.c. obj_full_size() can move > with the hooks and cache_vmstat_idx() to the internal mm/slab.h > > Signed-off-by: Vlastimil Babka <vbabka@suse.cz> > --- > mm/memcontrol.c | 90 ++++++++++++++++++++++++++++++++++++++++++++++++++ > mm/slab.h | 10 ++++++ > mm/slub.c | 100 -------------------------------------------------------- > 3 files changed, 100 insertions(+), 100 deletions(-) Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Btw, even before your change: $ cat mm/memcontrol.c | wc -l 8318 so I wonder if soon we might want to split it into some smaller parts. Thanks! > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index e4c8735e7c85..37ee9356a26c 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -3575,6 +3575,96 @@ void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size) > refill_obj_stock(objcg, size, true); > } > > +static inline size_t obj_full_size(struct kmem_cache *s) > +{ > + /* > + * For each accounted object there is an extra space which is used > + * to store obj_cgroup membership. Charge it too. > + */ > + return s->size + sizeof(struct obj_cgroup *); > +} > + > +bool __memcg_slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru, > + gfp_t flags, size_t size, void **p) > +{ > + struct obj_cgroup *objcg; > + struct slab *slab; > + unsigned long off; > + size_t i; > + > + /* > + * The obtained objcg pointer is safe to use within the current scope, > + * defined by current task or set_active_memcg() pair. > + * obj_cgroup_get() is used to get a permanent reference. > + */ > + objcg = current_obj_cgroup(); > + if (!objcg) > + return true; > + > + /* > + * slab_alloc_node() avoids the NULL check, so we might be called with a > + * single NULL object. kmem_cache_alloc_bulk() aborts if it can't fill > + * the whole requested size. > + * return success as there's nothing to free back > + */ > + if (unlikely(*p == NULL)) > + return true; > + > + flags &= gfp_allowed_mask; > + > + if (lru) { > + int ret; > + struct mem_cgroup *memcg; > + > + memcg = get_mem_cgroup_from_objcg(objcg); > + ret = memcg_list_lru_alloc(memcg, lru, flags); > + css_put(&memcg->css); > + > + if (ret) > + return false; > + } > + > + if (obj_cgroup_charge(objcg, flags, size * obj_full_size(s))) > + return false; > + > + for (i = 0; i < size; i++) { > + slab = virt_to_slab(p[i]); > + > + if (!slab_objcgs(slab) && > + memcg_alloc_slab_cgroups(slab, s, flags, false)) { > + obj_cgroup_uncharge(objcg, obj_full_size(s)); > + continue; > + } > + > + off = obj_to_index(s, slab, p[i]); > + obj_cgroup_get(objcg); > + slab_objcgs(slab)[off] = objcg; > + mod_objcg_state(objcg, slab_pgdat(slab), > + cache_vmstat_idx(s), obj_full_size(s)); > + } > + > + return true; > +} > + > +void __memcg_slab_free_hook(struct kmem_cache *s, struct slab *slab, > + void **p, int objects, struct obj_cgroup **objcgs) > +{ > + for (int i = 0; i < objects; i++) { > + struct obj_cgroup *objcg; > + unsigned int off; > + > + off = obj_to_index(s, slab, p[i]); > + objcg = objcgs[off]; > + if (!objcg) > + continue; > + > + objcgs[off] = NULL; > + obj_cgroup_uncharge(objcg, obj_full_size(s)); > + mod_objcg_state(objcg, slab_pgdat(slab), cache_vmstat_idx(s), > + -obj_full_size(s)); > + obj_cgroup_put(objcg); > + } > +} > #endif /* CONFIG_MEMCG_KMEM */ > > /* > diff --git a/mm/slab.h b/mm/slab.h > index 54deeb0428c6..3f170673fa55 100644 > --- a/mm/slab.h > +++ b/mm/slab.h > @@ -541,6 +541,12 @@ static inline bool kmem_cache_debug_flags(struct kmem_cache *s, slab_flags_t fla > return false; > } > > +static inline enum node_stat_item cache_vmstat_idx(struct kmem_cache *s) > +{ > + return (s->flags & SLAB_RECLAIM_ACCOUNT) ? > + NR_SLAB_RECLAIMABLE_B : NR_SLAB_UNRECLAIMABLE_B; > +} > + > #ifdef CONFIG_MEMCG_KMEM > /* > * slab_objcgs - get the object cgroups vector associated with a slab > @@ -564,6 +570,10 @@ int memcg_alloc_slab_cgroups(struct slab *slab, struct kmem_cache *s, > gfp_t gfp, bool new_slab); > void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat, > enum node_stat_item idx, int nr); > +bool __memcg_slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru, > + gfp_t flags, size_t size, void **p); > +void __memcg_slab_free_hook(struct kmem_cache *s, struct slab *slab, > + void **p, int objects, struct obj_cgroup **objcgs); > #else /* CONFIG_MEMCG_KMEM */ > static inline struct obj_cgroup **slab_objcgs(struct slab *slab) > { > diff --git a/mm/slub.c b/mm/slub.c > index 7022a1246bab..64da169d672a 100644 > --- a/mm/slub.c > +++ b/mm/slub.c > @@ -1875,12 +1875,6 @@ static bool freelist_corrupted(struct kmem_cache *s, struct slab *slab, > #endif > #endif /* CONFIG_SLUB_DEBUG */ > > -static inline enum node_stat_item cache_vmstat_idx(struct kmem_cache *s) > -{ > - return (s->flags & SLAB_RECLAIM_ACCOUNT) ? > - NR_SLAB_RECLAIMABLE_B : NR_SLAB_UNRECLAIMABLE_B; > -} > - > #ifdef CONFIG_MEMCG_KMEM > static inline void memcg_free_slab_cgroups(struct slab *slab) > { > @@ -1888,79 +1882,6 @@ static inline void memcg_free_slab_cgroups(struct slab *slab) > slab->memcg_data = 0; > } > > -static inline size_t obj_full_size(struct kmem_cache *s) > -{ > - /* > - * For each accounted object there is an extra space which is used > - * to store obj_cgroup membership. Charge it too. > - */ > - return s->size + sizeof(struct obj_cgroup *); > -} > - > -static bool __memcg_slab_post_alloc_hook(struct kmem_cache *s, > - struct list_lru *lru, > - gfp_t flags, size_t size, > - void **p) > -{ > - struct obj_cgroup *objcg; > - struct slab *slab; > - unsigned long off; > - size_t i; > - > - /* > - * The obtained objcg pointer is safe to use within the current scope, > - * defined by current task or set_active_memcg() pair. > - * obj_cgroup_get() is used to get a permanent reference. > - */ > - objcg = current_obj_cgroup(); > - if (!objcg) > - return true; > - > - /* > - * slab_alloc_node() avoids the NULL check, so we might be called with a > - * single NULL object. kmem_cache_alloc_bulk() aborts if it can't fill > - * the whole requested size. > - * return success as there's nothing to free back > - */ > - if (unlikely(*p == NULL)) > - return true; > - > - flags &= gfp_allowed_mask; > - > - if (lru) { > - int ret; > - struct mem_cgroup *memcg; > - > - memcg = get_mem_cgroup_from_objcg(objcg); > - ret = memcg_list_lru_alloc(memcg, lru, flags); > - css_put(&memcg->css); > - > - if (ret) > - return false; > - } > - > - if (obj_cgroup_charge(objcg, flags, size * obj_full_size(s))) > - return false; > - > - for (i = 0; i < size; i++) { > - slab = virt_to_slab(p[i]); > - > - if (!slab_objcgs(slab) && > - memcg_alloc_slab_cgroups(slab, s, flags, false)) { > - obj_cgroup_uncharge(objcg, obj_full_size(s)); > - continue; > - } > - > - off = obj_to_index(s, slab, p[i]); > - obj_cgroup_get(objcg); > - slab_objcgs(slab)[off] = objcg; > - mod_objcg_state(objcg, slab_pgdat(slab), > - cache_vmstat_idx(s), obj_full_size(s)); > - } > - > - return true; > -} > - > static void memcg_alloc_abort_single(struct kmem_cache *s, void *object); > > static __fastpath_inline > @@ -1986,27 +1907,6 @@ bool memcg_slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru, > return false; > } > > -static void __memcg_slab_free_hook(struct kmem_cache *s, struct slab *slab, > - void **p, int objects, > - struct obj_cgroup **objcgs) > -{ > - for (int i = 0; i < objects; i++) { > - struct obj_cgroup *objcg; > - unsigned int off; > - > - off = obj_to_index(s, slab, p[i]); > - objcg = objcgs[off]; > - if (!objcg) > - continue; > - > - objcgs[off] = NULL; > - obj_cgroup_uncharge(objcg, obj_full_size(s)); > - mod_objcg_state(objcg, slab_pgdat(slab), cache_vmstat_idx(s), > - -obj_full_size(s)); > - obj_cgroup_put(objcg); > - } > -} > - > static __fastpath_inline > void memcg_slab_free_hook(struct kmem_cache *s, struct slab *slab, void **p, > int objects) > > -- > 2.44.0 > > ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH RFC 2/4] mm, slab: move slab_memcg hooks to mm/memcontrol.c 2024-03-12 18:56 ` Roman Gushchin @ 2024-03-12 19:32 ` Matthew Wilcox 2024-03-12 20:36 ` Roman Gushchin 0 siblings, 1 reply; 21+ messages in thread From: Matthew Wilcox @ 2024-03-12 19:32 UTC (permalink / raw) To: Roman Gushchin Cc: Vlastimil Babka, Linus Torvalds, Josh Poimboeuf, Jeff Layton, Chuck Lever, Kees Cook, Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim, Andrew Morton, Hyeonggon Yoo, Johannes Weiner, Michal Hocko, Shakeel Butt, Muchun Song, Alexander Viro, Christian Brauner, Jan Kara, linux-mm, linux-kernel, cgroups, linux-fsdevel On Tue, Mar 12, 2024 at 11:56:31AM -0700, Roman Gushchin wrote: > On Fri, Mar 01, 2024 at 06:07:09PM +0100, Vlastimil Babka wrote: > > The hooks make multiple calls to functions in mm/memcontrol.c, including > > to th current_obj_cgroup() marked __always_inline. It might be faster to > > make a single call to the hook in mm/memcontrol.c instead. The hooks > > also don't use almost anything from mm/slub.c. obj_full_size() can move > > with the hooks and cache_vmstat_idx() to the internal mm/slab.h > > > > Signed-off-by: Vlastimil Babka <vbabka@suse.cz> > > --- > > mm/memcontrol.c | 90 ++++++++++++++++++++++++++++++++++++++++++++++++++ > > mm/slab.h | 10 ++++++ > > mm/slub.c | 100 -------------------------------------------------------- > > 3 files changed, 100 insertions(+), 100 deletions(-) > > Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> > > Btw, even before your change: > $ cat mm/memcontrol.c | wc -l > 8318 > so I wonder if soon we might want to split it into some smaller parts. If we are going to split it, perhaps a mm/memcg-v1.c would make sense, because I certainly don't have a good idea about what's v1 and what's v2. And maybe we could even conditionally compile the v1 file ;-) ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH RFC 2/4] mm, slab: move slab_memcg hooks to mm/memcontrol.c 2024-03-12 19:32 ` Matthew Wilcox @ 2024-03-12 20:36 ` Roman Gushchin 0 siblings, 0 replies; 21+ messages in thread From: Roman Gushchin @ 2024-03-12 20:36 UTC (permalink / raw) To: Matthew Wilcox Cc: Vlastimil Babka, Linus Torvalds, Josh Poimboeuf, Jeff Layton, Chuck Lever, Kees Cook, Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim, Andrew Morton, Hyeonggon Yoo, Johannes Weiner, Michal Hocko, Shakeel Butt, Muchun Song, Alexander Viro, Christian Brauner, Jan Kara, linux-mm, linux-kernel, cgroups, linux-fsdevel On Tue, Mar 12, 2024 at 07:32:57PM +0000, Matthew Wilcox wrote: > On Tue, Mar 12, 2024 at 11:56:31AM -0700, Roman Gushchin wrote: > > On Fri, Mar 01, 2024 at 06:07:09PM +0100, Vlastimil Babka wrote: > > > The hooks make multiple calls to functions in mm/memcontrol.c, including > > > to th current_obj_cgroup() marked __always_inline. It might be faster to > > > make a single call to the hook in mm/memcontrol.c instead. The hooks > > > also don't use almost anything from mm/slub.c. obj_full_size() can move > > > with the hooks and cache_vmstat_idx() to the internal mm/slab.h > > > > > > Signed-off-by: Vlastimil Babka <vbabka@suse.cz> > > > --- > > > mm/memcontrol.c | 90 ++++++++++++++++++++++++++++++++++++++++++++++++++ > > > mm/slab.h | 10 ++++++ > > > mm/slub.c | 100 -------------------------------------------------------- > > > 3 files changed, 100 insertions(+), 100 deletions(-) > > > > Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> > > > > Btw, even before your change: > > $ cat mm/memcontrol.c | wc -l > > 8318 > > so I wonder if soon we might want to split it into some smaller parts. > > If we are going to split it, perhaps a mm/memcg-v1.c would make sense, > because I certainly don't have a good idea about what's v1 and what's > v2. And maybe we could even conditionally compile the v1 file ;-) Good call. We already have cgroup/cgroup-v1.c and cgroup/legacy_freezer.c. Thanks! ^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH RFC 3/4] mm, slab: introduce kmem_cache_charge() 2024-03-01 17:07 [PATCH RFC 0/4] memcg_kmem hooks refactoring and kmem_cache_charge() Vlastimil Babka 2024-03-01 17:07 ` [PATCH RFC 1/4] mm, slab: move memcg charging to post-alloc hook Vlastimil Babka 2024-03-01 17:07 ` [PATCH RFC 2/4] mm, slab: move slab_memcg hooks to mm/memcontrol.c Vlastimil Babka @ 2024-03-01 17:07 ` Vlastimil Babka 2024-03-01 17:07 ` [PATCH RFC 4/4] UNFINISHED mm, fs: use kmem_cache_charge() in path_openat() Vlastimil Babka 3 siblings, 0 replies; 21+ messages in thread From: Vlastimil Babka @ 2024-03-01 17:07 UTC (permalink / raw) To: Linus Torvalds, Josh Poimboeuf, Jeff Layton, Chuck Lever, Kees Cook, Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim, Andrew Morton, Roman Gushchin, Hyeonggon Yoo, Johannes Weiner, Michal Hocko, Shakeel Butt, Muchun Song, Alexander Viro, Christian Brauner, Jan Kara Cc: linux-mm, linux-kernel, cgroups, linux-fsdevel, Vlastimil Babka As suggested by Linus, introduce a slab API function to memcg-charge a an object that was previously allocated without __GFP_ACCOUNT and from a cache that's not SLAB_ACCOUNT. This may be useful when it's likely the object is to be freed soon, and thus the charging/uncharging overhead can be avoided. In case kmem_cache_charge() is called on an already-charged object, it's a no-op. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/all/CAHk-=whYOOdM7jWy5jdrAm8LxcgCMFyk2bt8fYYvZzM4U-zAQA@mail.gmail.com/ Signed-off-by: Vlastimil Babka <vbabka@suse.cz> --- include/linux/slab.h | 10 ++++++++++ mm/slub.c | 29 +++++++++++++++++++++++++++++ 2 files changed, 39 insertions(+) diff --git a/include/linux/slab.h b/include/linux/slab.h index b5f5ee8308d0..0c3acb2fa3e6 100644 --- a/include/linux/slab.h +++ b/include/linux/slab.h @@ -491,6 +491,16 @@ void *kmem_cache_alloc(struct kmem_cache *cachep, gfp_t flags) __assume_slab_ali void *kmem_cache_alloc_lru(struct kmem_cache *s, struct list_lru *lru, gfp_t gfpflags) __assume_slab_alignment __malloc; void kmem_cache_free(struct kmem_cache *s, void *objp); +#ifdef CONFIG_MEMCG_KMEM +int kmem_cache_charge(struct kmem_cache *s, gfp_t flags, void *objp); +#else +static inline int +kmem_cache_charge(struct kmem_cache *s, gfp_t flags, void *objp) +{ + return 0; +} +#endif + /* * Bulk allocation and freeing operations. These are accelerated in an diff --git a/mm/slub.c b/mm/slub.c index 64da169d672a..72b61b379ba1 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -4241,6 +4241,35 @@ void kmem_cache_free(struct kmem_cache *s, void *x) } EXPORT_SYMBOL(kmem_cache_free); +#ifdef CONFIG_MEMCG_KMEM +int kmem_cache_charge(struct kmem_cache *s, gfp_t flags, void *x) +{ + struct obj_cgroup ** objcg; + struct slab *slab; + + s = cache_from_obj(s, x); + if (!s) + return -EINVAL; + + if (likely(!memcg_kmem_online())) + return 0; + + /* was it already accounted? */ + slab = virt_to_slab(x); + if ((objcg = slab_objcgs(slab))) { + unsigned int off = obj_to_index(s, slab, x); + + if (objcg[off]) + return 0; + } + + if (!memcg_slab_post_alloc_hook(s, NULL, flags, 1, &x)) + return -ENOMEM; + + return 0; +} +#endif + static void free_large_kmalloc(struct folio *folio, void *object) { unsigned int order = folio_order(folio); -- 2.44.0 ^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH RFC 4/4] UNFINISHED mm, fs: use kmem_cache_charge() in path_openat() 2024-03-01 17:07 [PATCH RFC 0/4] memcg_kmem hooks refactoring and kmem_cache_charge() Vlastimil Babka ` (2 preceding siblings ...) 2024-03-01 17:07 ` [PATCH RFC 3/4] mm, slab: introduce kmem_cache_charge() Vlastimil Babka @ 2024-03-01 17:07 ` Vlastimil Babka 2024-03-01 17:51 ` Linus Torvalds 3 siblings, 1 reply; 21+ messages in thread From: Vlastimil Babka @ 2024-03-01 17:07 UTC (permalink / raw) To: Linus Torvalds, Josh Poimboeuf, Jeff Layton, Chuck Lever, Kees Cook, Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim, Andrew Morton, Roman Gushchin, Hyeonggon Yoo, Johannes Weiner, Michal Hocko, Shakeel Butt, Muchun Song, Alexander Viro, Christian Brauner, Jan Kara Cc: linux-mm, linux-kernel, cgroups, linux-fsdevel, Vlastimil Babka This is just an example of using the kmem_cache_charge() API. I think it's placed in a place that's applicable for Linus's example [1] although he mentions do_dentry_open() - I have followed from strace() showing openat(2) to path_openat() doing the alloc_empty_file(). The idea is that filp_cachep stops being SLAB_ACCOUNT. Allocations that want to be accounted immediately can use GFP_KERNEL_ACCOUNT. I did that in alloc_empty_file_noaccount() (despite the contradictory name but the noaccount refers to something else, right?) as IIUC it's about kernel-internal opens. alloc_empty_file() is now not doing the accounting, so I added kmem_account_file() that calls the new kmem_cache_charge() API. Why is this unfinished: - there are other callers of alloc_empty_file() which I didn't adjust so they simply became memcg-unaccounted. I haven't investigated for which ones it would make also sense to separate the allocation and accounting. Maybe alloc_empty_file() would need to get a parameter to control this. - I don't know how to properly unwind the accounting failure case. It seems like a new case because when we succeed the open, there's no further error path at least in path_openat(). Basically it boils down I'm unfamiliar with VFS so this depends if this approach is deemed useful enough to finish it. [1] https://lore.kernel.org/all/CAHk-=whYOOdM7jWy5jdrAm8LxcgCMFyk2bt8fYYvZzM4U-zAQA@mail.gmail.com/ Not-signed-off-by: Vlastimil Babka <vbabka@suse.cz> --- fs/file_table.c | 9 +++++++-- fs/internal.h | 1 + fs/namei.c | 4 +++- 3 files changed, 11 insertions(+), 3 deletions(-) diff --git a/fs/file_table.c b/fs/file_table.c index b991f90571b4..6401b6f175ae 100644 --- a/fs/file_table.c +++ b/fs/file_table.c @@ -223,6 +223,11 @@ struct file *alloc_empty_file(int flags, const struct cred *cred) return ERR_PTR(-ENFILE); } +int kmem_account_file(struct file *f) +{ + return kmem_cache_charge(filp_cachep, GFP_KERNEL_ACCOUNT, f); +} + /* * Variant of alloc_empty_file() that doesn't check and modify nr_files. * @@ -234,7 +239,7 @@ struct file *alloc_empty_file_noaccount(int flags, const struct cred *cred) struct file *f; int error; - f = kmem_cache_zalloc(filp_cachep, GFP_KERNEL); + f = kmem_cache_zalloc(filp_cachep, GFP_KERNEL_ACCOUNT); if (unlikely(!f)) return ERR_PTR(-ENOMEM); @@ -468,7 +473,7 @@ void __init files_init(void) { filp_cachep = kmem_cache_create("filp", sizeof(struct file), 0, SLAB_TYPESAFE_BY_RCU | SLAB_HWCACHE_ALIGN | - SLAB_PANIC | SLAB_ACCOUNT, NULL); + SLAB_PANIC, NULL); percpu_counter_init(&nr_files, 0, GFP_KERNEL); } diff --git a/fs/internal.h b/fs/internal.h index b67406435fc0..06ada11b71d0 100644 --- a/fs/internal.h +++ b/fs/internal.h @@ -96,6 +96,7 @@ extern void chroot_fs_refs(const struct path *, const struct path *); struct file *alloc_empty_file(int flags, const struct cred *cred); struct file *alloc_empty_file_noaccount(int flags, const struct cred *cred); struct file *alloc_empty_backing_file(int flags, const struct cred *cred); +int kmem_account_file(struct file *file); static inline void file_put_write_access(struct file *file) { diff --git a/fs/namei.c b/fs/namei.c index 4e0de939fea1..fcf3f3fcd059 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -3799,8 +3799,10 @@ static struct file *path_openat(struct nameidata *nd, terminate_walk(nd); } if (likely(!error)) { - if (likely(file->f_mode & FMODE_OPENED)) + if (likely(file->f_mode & FMODE_OPENED)) { + kmem_account_file(file); return file; + } WARN_ON(1); error = -EINVAL; } -- 2.44.0 ^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [PATCH RFC 4/4] UNFINISHED mm, fs: use kmem_cache_charge() in path_openat() 2024-03-01 17:07 ` [PATCH RFC 4/4] UNFINISHED mm, fs: use kmem_cache_charge() in path_openat() Vlastimil Babka @ 2024-03-01 17:51 ` Linus Torvalds 2024-03-01 18:53 ` Roman Gushchin ` (2 more replies) 0 siblings, 3 replies; 21+ messages in thread From: Linus Torvalds @ 2024-03-01 17:51 UTC (permalink / raw) To: Vlastimil Babka Cc: Josh Poimboeuf, Jeff Layton, Chuck Lever, Kees Cook, Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim, Andrew Morton, Roman Gushchin, Hyeonggon Yoo, Johannes Weiner, Michal Hocko, Shakeel Butt, Muchun Song, Alexander Viro, Christian Brauner, Jan Kara, linux-mm, linux-kernel, cgroups, linux-fsdevel On Fri, 1 Mar 2024 at 09:07, Vlastimil Babka <vbabka@suse.cz> wrote: > > This is just an example of using the kmem_cache_charge() API. I think > it's placed in a place that's applicable for Linus's example [1] > although he mentions do_dentry_open() - I have followed from strace() > showing openat(2) to path_openat() doing the alloc_empty_file(). Thanks. This is not the right patch, but yes, patches 1-3 look very nice to me. > The idea is that filp_cachep stops being SLAB_ACCOUNT. Allocations that > want to be accounted immediately can use GFP_KERNEL_ACCOUNT. I did that > in alloc_empty_file_noaccount() (despite the contradictory name but the > noaccount refers to something else, right?) as IIUC it's about > kernel-internal opens. Yeah, the "noaccount" function is about not accounting it towards nr_files. That said, I don't think it necessarily needs to do the memory accounting either - it's literally for cases where we're never going to install the file descriptor in any user space. Your change to use GFP_KERNEL_ACCOUNT isn't exactly wrong, but I don't think it's really the right thing either, because > Why is this unfinished: > > - there are other callers of alloc_empty_file() which I didn't adjust so > they simply became memcg-unaccounted. I haven't investigated for which > ones it would make also sense to separate the allocation and accounting. > Maybe alloc_empty_file() would need to get a parameter to control > this. Right. I think the natural and logical way to deal with this is to just say "we account when we add the file to the fdtable". IOW, just have fd_install() do it. That's the really natural point, and also makes it very logical why alloc_empty_file_noaccount() wouldn't need to do the GFP_KERNEL_ACCOUNT. > - I don't know how to properly unwind the accounting failure case. It > seems like a new case because when we succeed the open, there's no > further error path at least in path_openat(). Yeah, let me think about this part. Becasue fd_install() is the right point, but that too does not really allow for error handling. Yes, we could close things and fail it, but it really is much too late at this point. What I *think* I'd want for this case is (a) allow the accounting to go over by a bit (b) make sure there's a cheap way to ask (before) about "did we go over the limit" IOW, the accounting never needed to be byte-accurate to begin with, and making it fail (cheaply and early) on the next file allocation is fine. Just make it really cheap. Can we do that? For example, maybe don't bother with the whole "bytes and pages" stuff. Just a simple "are we more than one page over?" kind of question. Without the 'stock_lock' mess for sub-page bytes etc How would that look? Would it result in something that can be done cheaply without locking and atomics and without excessive pointer indirection through many levels of memcg data structures? Linus ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH RFC 4/4] UNFINISHED mm, fs: use kmem_cache_charge() in path_openat() 2024-03-01 17:51 ` Linus Torvalds @ 2024-03-01 18:53 ` Roman Gushchin 2024-03-12 9:22 ` Vlastimil Babka 2024-03-04 12:47 ` Christian Brauner 2024-03-24 2:27 ` Al Viro 2 siblings, 1 reply; 21+ messages in thread From: Roman Gushchin @ 2024-03-01 18:53 UTC (permalink / raw) To: Linus Torvalds Cc: Vlastimil Babka, Josh Poimboeuf, Jeff Layton, Chuck Lever, Kees Cook, Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim, Andrew Morton, Hyeonggon Yoo, Johannes Weiner, Michal Hocko, Shakeel Butt, Muchun Song, Alexander Viro, Christian Brauner, Jan Kara, linux-mm, linux-kernel, cgroups, linux-fsdevel On Fri, Mar 01, 2024 at 09:51:18AM -0800, Linus Torvalds wrote: > On Fri, 1 Mar 2024 at 09:07, Vlastimil Babka <vbabka@suse.cz> wrote: > > > > This is just an example of using the kmem_cache_charge() API. I think > > it's placed in a place that's applicable for Linus's example [1] > > although he mentions do_dentry_open() - I have followed from strace() > > showing openat(2) to path_openat() doing the alloc_empty_file(). > > Thanks. This is not the right patch, but yes, patches 1-3 look very nice to me. > > > The idea is that filp_cachep stops being SLAB_ACCOUNT. Allocations that > > want to be accounted immediately can use GFP_KERNEL_ACCOUNT. I did that > > in alloc_empty_file_noaccount() (despite the contradictory name but the > > noaccount refers to something else, right?) as IIUC it's about > > kernel-internal opens. > > Yeah, the "noaccount" function is about not accounting it towards nr_files. > > That said, I don't think it necessarily needs to do the memory > accounting either - it's literally for cases where we're never going > to install the file descriptor in any user space. > > Your change to use GFP_KERNEL_ACCOUNT isn't exactly wrong, but I don't > think it's really the right thing either, because > > > Why is this unfinished: > > > > - there are other callers of alloc_empty_file() which I didn't adjust so > > they simply became memcg-unaccounted. I haven't investigated for which > > ones it would make also sense to separate the allocation and accounting. > > Maybe alloc_empty_file() would need to get a parameter to control > > this. > > Right. I think the natural and logical way to deal with this is to > just say "we account when we add the file to the fdtable". > > IOW, just have fd_install() do it. That's the really natural point, > and also makes it very logical why alloc_empty_file_noaccount() > wouldn't need to do the GFP_KERNEL_ACCOUNT. > > > - I don't know how to properly unwind the accounting failure case. It > > seems like a new case because when we succeed the open, there's no > > further error path at least in path_openat(). > > Yeah, let me think about this part. Becasue fd_install() is the right > point, but that too does not really allow for error handling. > > Yes, we could close things and fail it, but it really is much too late > at this point. > > What I *think* I'd want for this case is > > (a) allow the accounting to go over by a bit > > (b) make sure there's a cheap way to ask (before) about "did we go > over the limit" > > IOW, the accounting never needed to be byte-accurate to begin with, > and making it fail (cheaply and early) on the next file allocation is > fine. > > Just make it really cheap. Can we do that? > > For example, maybe don't bother with the whole "bytes and pages" > stuff. Just a simple "are we more than one page over?" kind of > question. Without the 'stock_lock' mess for sub-page bytes etc > > How would that look? Would it result in something that can be done > cheaply without locking and atomics and without excessive pointer > indirection through many levels of memcg data structures? I think it's possible and I'm currently looking into batching charge, objcg refcnt management and vmstats using per-task caching. It should speed up things for the majority of allocations. For allocations from an irq context and targeted allocations (where the target memcg != memcg of the current task) we'd probably need to keep the old scheme. I hope to post some patches relatively soon. I tried to optimize the current implementation but failed to get any significant gains. It seems that the overhead is very evenly spread across objcg pointer access, charge management, objcg refcnt management and vmstats. Thanks! ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH RFC 4/4] UNFINISHED mm, fs: use kmem_cache_charge() in path_openat() 2024-03-01 18:53 ` Roman Gushchin @ 2024-03-12 9:22 ` Vlastimil Babka 2024-03-12 19:05 ` Roman Gushchin 0 siblings, 1 reply; 21+ messages in thread From: Vlastimil Babka @ 2024-03-12 9:22 UTC (permalink / raw) To: Roman Gushchin, Linus Torvalds Cc: Josh Poimboeuf, Jeff Layton, Chuck Lever, Kees Cook, Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim, Andrew Morton, Hyeonggon Yoo, Johannes Weiner, Michal Hocko, Shakeel Butt, Muchun Song, Alexander Viro, Christian Brauner, Jan Kara, linux-mm, linux-kernel, cgroups, linux-fsdevel On 3/1/24 19:53, Roman Gushchin wrote: > On Fri, Mar 01, 2024 at 09:51:18AM -0800, Linus Torvalds wrote: >> What I *think* I'd want for this case is >> >> (a) allow the accounting to go over by a bit >> >> (b) make sure there's a cheap way to ask (before) about "did we go >> over the limit" >> >> IOW, the accounting never needed to be byte-accurate to begin with, >> and making it fail (cheaply and early) on the next file allocation is >> fine. >> >> Just make it really cheap. Can we do that? >> >> For example, maybe don't bother with the whole "bytes and pages" >> stuff. Just a simple "are we more than one page over?" kind of >> question. Without the 'stock_lock' mess for sub-page bytes etc >> >> How would that look? Would it result in something that can be done >> cheaply without locking and atomics and without excessive pointer >> indirection through many levels of memcg data structures? > > I think it's possible and I'm currently looking into batching charge, > objcg refcnt management and vmstats using per-task caching. It should > speed up things for the majority of allocations. > For allocations from an irq context and targeted allocations > (where the target memcg != memcg of the current task) we'd probably need to > keep the old scheme. I hope to post some patches relatively soon. Do you think this will work on top of this series, i.e. patches 1+2 could be eventually put to slab/for-next after the merge window, or would it interfere with your changes? > I tried to optimize the current implementation but failed to get any > significant gains. It seems that the overhead is very evenly spread across > objcg pointer access, charge management, objcg refcnt management and vmstats. > > Thanks! ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH RFC 4/4] UNFINISHED mm, fs: use kmem_cache_charge() in path_openat() 2024-03-12 9:22 ` Vlastimil Babka @ 2024-03-12 19:05 ` Roman Gushchin 0 siblings, 0 replies; 21+ messages in thread From: Roman Gushchin @ 2024-03-12 19:05 UTC (permalink / raw) To: Vlastimil Babka Cc: Linus Torvalds, Josh Poimboeuf, Jeff Layton, Chuck Lever, Kees Cook, Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim, Andrew Morton, Hyeonggon Yoo, Johannes Weiner, Michal Hocko, Shakeel Butt, Muchun Song, Alexander Viro, Christian Brauner, Jan Kara, linux-mm, linux-kernel, cgroups, linux-fsdevel On Tue, Mar 12, 2024 at 10:22:54AM +0100, Vlastimil Babka wrote: > On 3/1/24 19:53, Roman Gushchin wrote: > > On Fri, Mar 01, 2024 at 09:51:18AM -0800, Linus Torvalds wrote: > >> What I *think* I'd want for this case is > >> > >> (a) allow the accounting to go over by a bit > >> > >> (b) make sure there's a cheap way to ask (before) about "did we go > >> over the limit" > >> > >> IOW, the accounting never needed to be byte-accurate to begin with, > >> and making it fail (cheaply and early) on the next file allocation is > >> fine. > >> > >> Just make it really cheap. Can we do that? > >> > >> For example, maybe don't bother with the whole "bytes and pages" > >> stuff. Just a simple "are we more than one page over?" kind of > >> question. Without the 'stock_lock' mess for sub-page bytes etc > >> > >> How would that look? Would it result in something that can be done > >> cheaply without locking and atomics and without excessive pointer > >> indirection through many levels of memcg data structures? > > > > I think it's possible and I'm currently looking into batching charge, > > objcg refcnt management and vmstats using per-task caching. It should > > speed up things for the majority of allocations. > > For allocations from an irq context and targeted allocations > > (where the target memcg != memcg of the current task) we'd probably need to > > keep the old scheme. I hope to post some patches relatively soon. > > Do you think this will work on top of this series, i.e. patches 1+2 could be > eventually put to slab/for-next after the merge window, or would it > interfere with your changes? Please, go on and merge them, I'll rebase on top of it, it will be even better for my work. I made a couple of comments there, but overall they look very good to me, thank you for doing this work! > > > I tried to optimize the current implementation but failed to get any > > significant gains. It seems that the overhead is very evenly spread across > > objcg pointer access, charge management, objcg refcnt management and vmstats. I started working on the thing, but it's a bit more complicated than I initially thought because: 1) there are allocations made from a !in_task() context, so we need to handle this correctly 2) tasks can be moved between cgroups concurrently to memory allocations. fortunately my recent changes provide a path here, but it adds to the complexity. In alternative world where tasks can't move between cgroups the life would be so much better (and faster too, we could remove a ton of synchronization). 3) we do have per-numa-node per-memcg stats, which are less trivial to cache on struct task I hope to resolve these issues somehow and post patches, but probably will need a bit more time. Thanks! ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH RFC 4/4] UNFINISHED mm, fs: use kmem_cache_charge() in path_openat() 2024-03-01 17:51 ` Linus Torvalds 2024-03-01 18:53 ` Roman Gushchin @ 2024-03-04 12:47 ` Christian Brauner 2024-03-24 2:27 ` Al Viro 2 siblings, 0 replies; 21+ messages in thread From: Christian Brauner @ 2024-03-04 12:47 UTC (permalink / raw) To: Linus Torvalds Cc: Vlastimil Babka, Josh Poimboeuf, Jeff Layton, Chuck Lever, Kees Cook, Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim, Andrew Morton, Roman Gushchin, Hyeonggon Yoo, Johannes Weiner, Michal Hocko, Shakeel Butt, Muchun Song, Alexander Viro, Jan Kara, linux-mm, linux-kernel, cgroups, linux-fsdevel On Fri, Mar 01, 2024 at 09:51:18AM -0800, Linus Torvalds wrote: > On Fri, 1 Mar 2024 at 09:07, Vlastimil Babka <vbabka@suse.cz> wrote: > > > > This is just an example of using the kmem_cache_charge() API. I think > > it's placed in a place that's applicable for Linus's example [1] > > although he mentions do_dentry_open() - I have followed from strace() > > showing openat(2) to path_openat() doing the alloc_empty_file(). > > Thanks. This is not the right patch, but yes, patches 1-3 look very nice to me. > > > The idea is that filp_cachep stops being SLAB_ACCOUNT. Allocations that > > want to be accounted immediately can use GFP_KERNEL_ACCOUNT. I did that > > in alloc_empty_file_noaccount() (despite the contradictory name but the > > noaccount refers to something else, right?) as IIUC it's about > > kernel-internal opens. > > Yeah, the "noaccount" function is about not accounting it towards nr_files. > That said, I don't think it necessarily needs to do the memory > accounting either - it's literally for cases where we're never going > to install the file descriptor in any user space. Exactly. > Your change to use GFP_KERNEL_ACCOUNT isn't exactly wrong, but I don't > think it's really the right thing either, because > > > Why is this unfinished: > > > > - there are other callers of alloc_empty_file() which I didn't adjust so > > they simply became memcg-unaccounted. I haven't investigated for which > > ones it would make also sense to separate the allocation and accounting. > > Maybe alloc_empty_file() would need to get a parameter to control > > this. > > Right. I think the natural and logical way to deal with this is to > just say "we account when we add the file to the fdtable". > IOW, just have fd_install() do it. That's the really natural point, > and also makes it very logical why alloc_empty_file_noaccount() > wouldn't need to do the GFP_KERNEL_ACCOUNT. > > > - I don't know how to properly unwind the accounting failure case. It > > seems like a new case because when we succeed the open, there's no > > further error path at least in path_openat(). > > Yeah, let me think about this part. Becasue fd_install() is the right > point, but that too does not really allow for error handling. > > Yes, we could close things and fail it, but it really is much too late > at this point. It would also mean massaging 100+ callsites. And having a non-subsystems specific failure step between file allocation, fd reservation and fd_install() would be awkward and an invitation for bugs. > What I *think* I'd want for this case is > > (a) allow the accounting to go over by a bit > > (b) make sure there's a cheap way to ask (before) about "did we go > over the limit" > > IOW, the accounting never needed to be byte-accurate to begin with, > and making it fail (cheaply and early) on the next file allocation is > fine. I think that's a good idea. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH RFC 4/4] UNFINISHED mm, fs: use kmem_cache_charge() in path_openat() 2024-03-01 17:51 ` Linus Torvalds 2024-03-01 18:53 ` Roman Gushchin 2024-03-04 12:47 ` Christian Brauner @ 2024-03-24 2:27 ` Al Viro 2024-03-24 17:44 ` Linus Torvalds 2 siblings, 1 reply; 21+ messages in thread From: Al Viro @ 2024-03-24 2:27 UTC (permalink / raw) To: Linus Torvalds Cc: Vlastimil Babka, Josh Poimboeuf, Jeff Layton, Chuck Lever, Kees Cook, Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim, Andrew Morton, Roman Gushchin, Hyeonggon Yoo, Johannes Weiner, Michal Hocko, Shakeel Butt, Muchun Song, Christian Brauner, Jan Kara, linux-mm, linux-kernel, cgroups, linux-fsdevel On Fri, Mar 01, 2024 at 09:51:18AM -0800, Linus Torvalds wrote: > Right. I think the natural and logical way to deal with this is to > just say "we account when we add the file to the fdtable". > > IOW, just have fd_install() do it. That's the really natural point, > and also makes it very logical why alloc_empty_file_noaccount() > wouldn't need to do the GFP_KERNEL_ACCOUNT. We can have the same file occuring in many slots of many descriptor tables, obviously. So it would have to be a flag (in ->f_mode?) set by it, for "someone's already charged for it", or you'll end up with really insane crap on each fork(), dup(), etc. But there's also MAP_ANON with its setup_shmem_file(), with the resulting file not going into descriptor tables at all, and that's not a rare thing. > > - I don't know how to properly unwind the accounting failure case. It > > seems like a new case because when we succeed the open, there's no > > further error path at least in path_openat(). > > Yeah, let me think about this part. Becasue fd_install() is the right > point, but that too does not really allow for error handling. > > Yes, we could close things and fail it, but it really is much too late > at this point. That as well. For things like O_CREAT even do_dentry_open() would be too late for unrolls. > What I *think* I'd want for this case is > > (a) allow the accounting to go over by a bit > > (b) make sure there's a cheap way to ask (before) about "did we go > over the limit" > > IOW, the accounting never needed to be byte-accurate to begin with, > and making it fail (cheaply and early) on the next file allocation is > fine. > > Just make it really cheap. Can we do that? That might be reasonable, but TBH I would rather combine that with do_dentry_open()/alloc_file() (i.e. the places where we set FMODE_OPENED) as places to do that, rather than messing with fd_install(). How does the following sound? * those who allocate empty files mark them if they are intended to be kernel-internal (see below for how to get the information there) * memcg charge happens when we set FMODE_OPENED, provided that struct file instance is not marked kernel-internal. * exceeding the limit => pretend we'd succeeded and fail the next allocation. As for how to get the information down there... We have 6 functions where "allocate" and "mark it opened" callchains converge - alloc_file() (pipe(2) et.al., mostly), path_openat() (normal opens, but also filp_open() et.al.), dentry_open(), kernel_file_open(), kernel_tmpfile_open(), dentry_create(). The last 3 are all kernel-internal; dentry_open() might or might not be. For path_openat() we can add a bit somewhere in struct open_flags; the places where we set struct open_flags up would be the ones that might need to be annotated. That's file_open_name() file_open_root() do_sys_openat2() (definitely userland) io_openat2() (ditto) sys_uselib() (ditto) do_open_execat() (IMO can be considered userland in all cases) For alloc_file() it's almost always userland. IMO things like dma_buf_export() and setup_shmem_file() should be charged. So it's a matter of propagating the information to dentry_open(), file_open_name() and file_open_root(). That's about 70 callers to annotate, including filp_open() and file_open_root_mnt() into the mix. <greps> 61, actually, and from the quick look it seems that most of them are really obvious... Comments? ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH RFC 4/4] UNFINISHED mm, fs: use kmem_cache_charge() in path_openat() 2024-03-24 2:27 ` Al Viro @ 2024-03-24 17:44 ` Linus Torvalds 0 siblings, 0 replies; 21+ messages in thread From: Linus Torvalds @ 2024-03-24 17:44 UTC (permalink / raw) To: Al Viro Cc: Vlastimil Babka, Josh Poimboeuf, Jeff Layton, Chuck Lever, Kees Cook, Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim, Andrew Morton, Roman Gushchin, Hyeonggon Yoo, Johannes Weiner, Michal Hocko, Shakeel Butt, Muchun Song, Christian Brauner, Jan Kara, linux-mm, linux-kernel, cgroups, linux-fsdevel [ Al, I hope your email works now ] On Sat, 23 Mar 2024 at 19:27, Al Viro <viro@zeniv.linux.org.uk> wrote: > > We can have the same file occuring in many slots of many descriptor tables, > obviously. So it would have to be a flag (in ->f_mode?) set by it, for > "someone's already charged for it", or you'll end up with really insane > crap on each fork(), dup(), etc. Nope. That flag already exists in the slab code itself with this patch. The kmem_cache_charge() thing itself just sets the "I'm charged" bit in the slab header, and you're done. Any subsequent fd_install (with dup, or fork or whatever) simply is irrelevant. In fact, dup and fork and friends won't need to worry about this, because they only work on files that have already been installed, so they know the file is already accounted. So it's only the initial open() case that needs to do the kmem_cache_charge() as it does the fd_install. > But there's also MAP_ANON with its setup_shmem_file(), with the resulting > file not going into descriptor tables at all, and that's not a rare thing. Just making alloc_file_pseudo() do a SLAB_ACOUNT should take care of all the normal case. For once, the core allocator is not exposed very much, so we can literally just look at "who does alloc_file*()" and it turns out it's all pretty well abstracted out. So I think it's mainly the three cases of 'alloc_empty_file()' that would be affected and need to check that they actually do the fd_install() (or release it). Everything else should either not account at all (if they know they are doing temporary kernel things), or always account (eg alloc_file_pseudo()). Linus ^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2024-03-24 17:45 UTC | newest] Thread overview: 21+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-03-01 17:07 [PATCH RFC 0/4] memcg_kmem hooks refactoring and kmem_cache_charge() Vlastimil Babka 2024-03-01 17:07 ` [PATCH RFC 1/4] mm, slab: move memcg charging to post-alloc hook Vlastimil Babka 2024-03-12 18:52 ` Roman Gushchin 2024-03-12 18:59 ` Matthew Wilcox 2024-03-12 20:35 ` Roman Gushchin 2024-03-13 10:55 ` Vlastimil Babka 2024-03-13 17:34 ` Roman Gushchin 2024-03-15 3:23 ` Chengming Zhou 2024-03-01 17:07 ` [PATCH RFC 2/4] mm, slab: move slab_memcg hooks to mm/memcontrol.c Vlastimil Babka 2024-03-12 18:56 ` Roman Gushchin 2024-03-12 19:32 ` Matthew Wilcox 2024-03-12 20:36 ` Roman Gushchin 2024-03-01 17:07 ` [PATCH RFC 3/4] mm, slab: introduce kmem_cache_charge() Vlastimil Babka 2024-03-01 17:07 ` [PATCH RFC 4/4] UNFINISHED mm, fs: use kmem_cache_charge() in path_openat() Vlastimil Babka 2024-03-01 17:51 ` Linus Torvalds 2024-03-01 18:53 ` Roman Gushchin 2024-03-12 9:22 ` Vlastimil Babka 2024-03-12 19:05 ` Roman Gushchin 2024-03-04 12:47 ` Christian Brauner 2024-03-24 2:27 ` Al Viro 2024-03-24 17:44 ` Linus Torvalds
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).