The Linux Kernel Mailing List
 help / color / mirror / Atom feed
From: "Vlastimil Babka (SUSE)" <vbabka@kernel.org>
To: "Harry Yoo (Oracle)" <harry@kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Hao Li <hao.li@linux.dev>, Christoph Lameter <cl@gentwo.org>,
	David Rientjes <rientjes@google.com>,
	Roman Gushchin <roman.gushchin@linux.dev>,
	Suren Baghdasaryan <surenb@google.com>, Hao Ge <hao.ge@linux.dev>,
	Kees Cook <kees@kernel.org>, Pedro Falcato <pfalcato@suse.de>,
	Shakeel Butt <shakeel.butt@linux.dev>,
	Danielle Constantino <dcostantino@meta.com>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH RFC hotfixes 2/2] mm/slab: prevent unbounded recursion in free path with new kmalloc type
Date: Thu, 2 Jul 2026 14:57:34 +0200	[thread overview]
Message-ID: <de2e0d60-6954-4b54-bbff-ae78dbf9f789@kernel.org> (raw)
In-Reply-To: <20260702-kmalloc-no-objext-v1-2-167175008538@kernel.org>

On 7/2/26 06:09, Harry Yoo (Oracle) wrote:
> Commit 280ea9c3154b ("mm/slab: avoid allocating slabobj_ext array from
> its own slab") avoided recursive allocation of obj_exts from kmalloc
> caches of the same size, by bumping the obj_exts array's allocation
> size whenever the array size equals the size of the object being
> allocated.
> 
> However, as reported by Danielle Costantino and Shakeel Butt,
> even slabs from kmalloc caches of different sizes can form a cycle
> by allocating obj_exts arrays from each other [1]:
> 
>   What happened: a KMALLOC_NORMAL slab's obj_exts array (used by
>   allocation profiling / memcg accounting) is itself kmalloc()'d from a
>   KMALLOC_NORMAL cache, so the "slab holds another slab's obj_exts array"
>   relation can form cycles. With sizeof(struct slabobj_ext) == 16 and
>   the host's geometry:
> 
>   - kmalloc-512 has 64 objects/slab -> array is 64*16 == 1024 bytes,
>     served from kmalloc-1k;
>   - kmalloc-1k  has 32 objects/slab -> array is 32*16 ==  512 bytes,
>     served from kmalloc-512.
> 
>   A kmalloc-512 slab and a kmalloc-1k slab therefore hold each other's
>   obj_exts array.  Discarding one frees the other's array, which empties
>   and discards that slab, which frees the first's array, and so on:
>   __free_slab() -> free_slab_obj_exts() -> kfree() -> discard_slab() ->
>   __free_slab() recurses along the cycle until the stack is exhausted.
> 
> With memory allocation profiling, this allows unbounded recursion
> in the free path and led to a stack overflow on a production host in
> the Meta fleet [1]:
> 
>   BUG: TASK stack guard page was hit
>   Oops: stack guard page
>   RIP: 0010:kfree+0x8/0x5d0
>   Call Trace:
>    __free_slab+0x66/0xc0
>    kfree+0x3f0/0x5d0
>    ... ( ~125x __free_slab <-> kfree ) ...
>    <kernel driver freeing a resource>
>    do_syscall_64
> 
> It is proposed [1] to resolve this issue by always serving the obj_exts
> array allocation from kmalloc caches (or large kmalloc) of sizes larger
> than the object size. However, as pointed out by Vlastimil Babka [2],
> this can waste an excessive amount of memory as slabs from large
> kmalloc sizes (e.g. kmalloc-8k) generally need obj_exts arrays much
> smaller than the object size.
> 
> Therefore, rather than bumping the size, let us take a different
> approach; disallow formation of cycles between kmalloc types when
> allocating obj_exts arrays. Currently, all obj_exts arrays are served
> from normal kmalloc caches. Cycles cannot be created if obj_exts arrays
> of normal kmalloc caches are served from a special kmalloc type that can
> never have obj_exts arrays.
> 
> To achieve this, create a new kmalloc type called KMALLOC_NO_OBJ_EXT.
> KMALLOC_NO_OBJ_EXT caches are created when CONFIG_SLAB_OBJ_EXT is
> enabled, and they have SLAB_NO_OBJ_EXT flag to prevent allocation
> of obj_exts arrays. They remain unused until allocation of obj_exts
> arrays for normal kmalloc caches happens.

I wonder if we should just use them always (not just for kmalloc_normal) if
we already have them. Would there be any downside?

> Sheaf boostrapping for KMALLOC_NO_OBJ_EXT caches now must be deferred
> because allocation of a barn can trigger obj_exts array allocation of
> normal kmalloc caches when the KMALLOC_NO_OBJ_EXT cache for that size
> is not ready yet. For simplicity, perform bootstrapping of sheaves for
> all kmalloc caches later.
> 
> Introduce a new slab alloc flag, SLAB_ALLOC_NO_OBJ_EXT, to prevent
> allocation of obj_exts arrays, and let kmalloc_slab() override the type
> to KMALLOC_NO_OBJ_EXT when specified. Note that kmalloc_type() remains
> unchanged because kmalloc_flags() bypasses the kmalloc fastpath.
> 
> Do not pass SLAB_ALLOC_NO_RECURSE to kmalloc_flags() in
> alloc_slab_obj_exts() and instead use SLAB_ALLOC_NO_OBJ_EXT only when
> the objects are allocated from normal kmalloc caches. While this
> prevents unbounded recursive allocation of obj_exts, it allows
> KMALLOC_NO_OBJ_EXT caches to have sheaves.
> 
> Since sheaf allocations specify SLAB_ALLOC_NO_RECURSE that prevents
> allocation of both sheaves and obj_exts arrays, the recursion depth
> is bounded.
> 
> Reported-by: Danielle Costantino <dcostantino@meta.com>
> Reported-by: Shakeel Butt <shakeel.butt@linux.dev>
> Closes: https://lore.kernel.org/linux-mm/20260625230029.703750-1-shakeel.butt@linux.dev [1]
> Fixes: 4b8736964640 ("mm/slab: add allocation accounting into slab allocation and free paths")
> Cc: stable@vger.kernel.org
> Link: https://lore.kernel.org/linux-mm/c5c4208d-a6f0-413e-bad9-49be12f12d55@kernel.org [2]
> Signed-off-by: Harry Yoo (Oracle) <harry@kernel.org>
> ---
>  include/linux/slab.h |  3 ++
>  mm/slab.h            | 17 +++++++++--
>  mm/slab_common.c     | 18 +++++++++++-
>  mm/slub.c            | 83 +++++++++++++++++++++-------------------------------
>  4 files changed, 68 insertions(+), 53 deletions(-)
> 
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index 08d7b6c9c4d6..0c1d13773523 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -721,6 +721,9 @@ enum kmalloc_cache_type {
>  #endif
>  #ifdef CONFIG_MEMCG
>  	KMALLOC_CGROUP,
> +#endif
> +#ifdef CONFIG_SLAB_OBJ_EXT
> +	KMALLOC_NO_OBJ_EXT,
>  #endif
>  	NR_KMALLOC_TYPES
>  };
> diff --git a/mm/slab.h b/mm/slab.h
> index 281a65233795..0428cd495191 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -22,6 +22,7 @@
>  #define SLAB_ALLOC_NOLOCK	0x01 /* a kmalloc_nolock() allocation */
>  #define SLAB_ALLOC_NEW_SLAB	0x02 /* a flag for alloc_slab_obj_exts() */
>  #define SLAB_ALLOC_NO_RECURSE	0x04 /* prevent kmalloc() recursion */
> +#define SLAB_ALLOC_NO_OBJ_EXT	0x08 /* prevent obj_exts array allocation */
>  
>  static inline bool alloc_flags_allow_spinning(const unsigned int alloc_flags)
>  {
> @@ -386,12 +387,19 @@ static inline unsigned int size_index_elem(unsigned int bytes)
>   * KMALLOC_MAX_CACHE_SIZE and the caller must check that.
>   */
>  static inline struct kmem_cache *
> -kmalloc_slab(size_t size, kmem_buckets *b, gfp_t flags, kmalloc_token_t token)
> +kmalloc_slab(size_t size, kmem_buckets *b, gfp_t flags, kmalloc_token_t token,
> +	     unsigned int alloc_flags)
>  {
>  	unsigned int index;
> +	enum kmalloc_cache_type type = kmalloc_type(flags, token);
> +
> +#ifdef CONFIG_SLAB_OBJ_EXT
> +	if (alloc_flags & SLAB_ALLOC_NO_OBJ_EXT)
> +		type = KMALLOC_NO_OBJ_EXT;
> +#endif
>  
>  	if (!b)
> -		b = &kmalloc_caches[kmalloc_type(flags, token)];
> +		b = &kmalloc_caches[type];
>  	if (size <= 192)
>  		index = kmalloc_size_index[size_index_elem(size)];
>  	else
> @@ -426,6 +434,11 @@ static inline bool is_kmalloc_normal(struct kmem_cache *s)
>  {
>  	if (!is_kmalloc_cache(s))
>  		return false;
> +
> +	/* KMALLOC_NO_OBJ_EXT is not normal kmalloc */
> +	if (s->flags & SLAB_NO_OBJ_EXT)
> +		return false;

Could it just go the the test below?

> +
>  	return !(s->flags & (SLAB_CACHE_DMA|SLAB_ACCOUNT|SLAB_RECLAIM_ACCOUNT));
>  }
>  
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index b6426d7ceec9..7f262134d0f2 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -783,11 +783,15 @@ u8 kmalloc_size_index[24] __ro_after_init = {
>  size_t kmalloc_size_roundup(size_t size)
>  {
>  	if (size && size <= KMALLOC_MAX_CACHE_SIZE) {
> +		struct kmem_cache *s;
> +
>  		/*
>  		 * The flags don't matter since size_index is common to all.
>  		 * Neither does the caller for just getting ->object_size.
>  		 */
> -		return kmalloc_slab(size, NULL, GFP_KERNEL, __kmalloc_token(0))->object_size;
> +		s = kmalloc_slab(size, NULL, GFP_KERNEL, __kmalloc_token(0),
> +				 SLAB_ALLOC_DEFAULT);
> +		return s->object_size;
>  	}
>  
>  	/* Above the smaller buckets, size is a multiple of page size. */
> @@ -843,6 +847,12 @@ EXPORT_SYMBOL(kmalloc_size_roundup);
>  #define KMALLOC_PARTITION_NAME(N, sz)
>  #endif
>  
> +#ifdef CONFIG_SLAB_OBJ_EXT
> +#define KMALLOC_NO_OBJ_EXT_NAME(sz) .name[KMALLOC_NO_OBJ_EXT] = "kmalloc-no-objext-" #sz,
> +#else
> +#define KMALLOC_NO_OBJ_EXT_NAME(sz)
> +#endif
> +
>  #define INIT_KMALLOC_INFO(__size, __short_size)			\
>  {								\
>  	.name[KMALLOC_NORMAL]  = "kmalloc-" #__short_size,	\
> @@ -850,6 +860,7 @@ EXPORT_SYMBOL(kmalloc_size_roundup);
>  	KMALLOC_CGROUP_NAME(__short_size)			\
>  	KMALLOC_DMA_NAME(__short_size)				\
>  	KMALLOC_PARTITION_NAME(KMALLOC_PARTITION_CACHES_NR, __short_size)	\
> +	KMALLOC_NO_OBJ_EXT_NAME(__short_size)			\
>  	.size = __size,						\
>  }
>  
> @@ -966,6 +977,11 @@ new_kmalloc_cache(int idx, enum kmalloc_cache_type type)
>  		flags |= SLAB_NO_MERGE;
>  #endif
>  
> +#ifdef CONFIG_SLAB_OBJ_EXT
> +	if (type == KMALLOC_NO_OBJ_EXT)
> +		flags |= SLAB_NO_OBJ_EXT | SLAB_NO_MERGE;
> +#endif
> +
>  	/*
>  	 * If CONFIG_MEMCG is enabled, disable cache merging for
>  	 * KMALLOC_NORMAL caches.
> diff --git a/mm/slub.c b/mm/slub.c
> index efc85053ae84..8428b8308856 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2123,42 +2123,6 @@ static inline void init_slab_obj_exts(struct slab *slab)
>  	slab->obj_exts = 0;
>  }
>  
> -/*
> - * Calculate the allocation size for slabobj_ext array.
> - *
> - * When memory allocation profiling is enabled, the obj_exts array
> - * could be allocated from the same slab cache it's being allocated for.
> - * This would prevent the slab from ever being freed because it would
> - * always contain at least one allocated object (its own obj_exts array).
> - *
> - * To avoid this, increase the allocation size when we detect the array
> - * may come from the same cache, forcing it to use a different cache.
> - */
> -static inline size_t obj_exts_alloc_size(struct kmem_cache *s,
> -					 struct slab *slab, gfp_t gfp)
> -{
> -	size_t sz = sizeof(struct slabobj_ext) * slab->objects;
> -	struct kmem_cache *obj_exts_cache;
> -
> -	if (sz > KMALLOC_MAX_CACHE_SIZE)
> -		return sz;
> -
> -	if (!is_kmalloc_normal(s))
> -		return sz;
> -
> -	obj_exts_cache = kmalloc_slab(sz, NULL, gfp, __kmalloc_token(0));
> -	/*
> -	 * We can't simply compare s with obj_exts_cache, because partitioned kmalloc
> -	 * caches have multiple caches per size, selected by caller address or type.
> -	 * Since caller address or type may differ between kmalloc_slab() and actual
> -	 * allocation, bump size when sizes are equal.
> -	 */
> -	if (s->object_size == obj_exts_cache->object_size)
> -		return obj_exts_cache->object_size + 1;
> -
> -	return sz;
> -}
> -
>  int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
>  			gfp_t gfp, unsigned int alloc_flags)
>  {
> @@ -2168,14 +2132,18 @@ int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
>  	unsigned long new_exts;
>  	unsigned long old_exts;
>  	struct slabobj_ext *vec;
> -	size_t sz;
> +	size_t sz = sizeof(struct slabobj_ext) * slab->objects;
>  
>  	gfp &= ~OBJCGS_CLEAR_MASK;
> -	/* Prevent recursive extension vector allocation */
> -	alloc_flags |= SLAB_ALLOC_NO_RECURSE;
> -	alloc_flags &= ~SLAB_ALLOC_NEW_SLAB;
> +	/*
> +	 * In most cases, obj_exts arrays are allocated from normal kmalloc.
> +	 * However, normal kmalloc caches must allocate them from
> +	 * KMALLOC_NO_OBJ_EXT to caches to prevent recursion.
> +	 */
> +	if (is_kmalloc_normal(s))
> +		alloc_flags |= SLAB_ALLOC_NO_OBJ_EXT;
>  
> -	sz = obj_exts_alloc_size(s, slab, gfp);
> +	alloc_flags &= ~SLAB_ALLOC_NEW_SLAB;
>  
>  	/* This will use kmalloc_nolock() if alloc_flags say so */
>  	vec = kmalloc_flags(sz, gfp | __GFP_ZERO, alloc_flags, slab_nid(slab));
> @@ -2193,8 +2161,21 @@ int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
>  		return -ENOMEM;
>  	}
>  
> -	VM_WARN_ON_ONCE(virt_to_slab(vec) != NULL &&
> -			virt_to_slab(vec)->slab_cache == s);
> +	if (IS_ENABLED(CONFIG_DEBUG_VM)) {
> +		struct kmem_cache *exts_cache;
> +		struct slab *exts_slab;
> +
> +		exts_slab = virt_to_slab(vec);
> +		if (exts_slab) {
> +			/*
> +			 * The vector must be allocated from either normal or
> +			 * KMALLOC_NO_OBJ_EXT kmalloc caches to avoid cycles.
> +			 */
> +			exts_cache = virt_to_slab(vec)->slab_cache;
> +			WARN_ON_ONCE(!is_kmalloc_normal(exts_cache) &&
> +					!(exts_cache->flags & SLAB_NO_OBJ_EXT));
> +		}
> +	}
>  
>  	new_exts = (unsigned long)vec;
>  #ifdef CONFIG_MEMCG
> @@ -2254,7 +2235,7 @@ static inline void free_slab_obj_exts(struct slab *slab, bool allow_spin)
>  	}
>  
>  	/*
> -	 * obj_exts was created with SLAB_ALLOC_NO_RECURSE flag, therefore its
> +	 * obj_exts was created with SLAB_ALLOC_NO_OBJ_EXT flag, therefore its
>  	 * corresponding extension will be NULL. alloc_tag_sub() will throw a
>  	 * warning if slab has extensions but the extension of an object is
>  	 * NULL, therefore replace NULL with CODETAG_EMPTY to indicate that
> @@ -5330,7 +5311,7 @@ void *__do_kmalloc_node(kmem_buckets *b, gfp_t flags, int node,
>  	if (unlikely(!size))
>  		return ZERO_SIZE_PTR;
>  
> -	s = kmalloc_slab(size, b, flags, token);
> +	s = kmalloc_slab(size, b, flags, token, ac->alloc_flags);
>  
>  	ret = slab_alloc_node(s, flags, node, ac);
>  	ret = kasan_kmalloc(s, ret, size, flags);
> @@ -5395,7 +5376,9 @@ static void *__kmalloc_nolock_noprof(DECL_TOKEN_PARAMS(size, token), gfp_t gfp_f
>  retry:
>  	if (unlikely(size > KMALLOC_MAX_CACHE_SIZE))
>  		return NULL;
> -	s = kmalloc_slab(size, NULL, gfp_flags, PASS_TOKEN_PARAM(token));
> +
> +	s = kmalloc_slab(size, NULL, gfp_flags, PASS_TOKEN_PARAM(token),
> +			 ac->alloc_flags);
>  
>  	if (!(s->flags & __CMPXCHG_DOUBLE) && !kmem_cache_debug(s))
>  		/*
> @@ -7957,10 +7940,10 @@ static int calculate_sizes(struct kmem_cache_args *args, struct kmem_cache *s)
>  		s->allocflags |= __GFP_RECLAIMABLE;
>  
>  	/*
> -	 * For KMALLOC_NORMAL caches we enable sheaves later by
> -	 * bootstrap_kmalloc_sheaves() to avoid recursion
> +	 * For kmalloc caches we enable sheaves later by
> +	 * bootstrap_kmalloc_sheaves() to avoid recursion.
>  	 */
> -	if (!is_kmalloc_normal(s))
> +	if (!(s->flags & SLAB_KMALLOC))

is_kmalloc_cache()?

>  		s->sheaf_capacity = calculate_sheaf_capacity(s, args);
>  
>  	/*
> @@ -8524,7 +8507,7 @@ static void __init bootstrap_kmalloc_sheaves(void)
>  {
>  	enum kmalloc_cache_type type;
>  
> -	for (type = KMALLOC_NORMAL; type <= KMALLOC_PARTITION_END; type++) {
> +	for (type = KMALLOC_NORMAL; type < NR_KMALLOC_TYPES; type++) {
>  		for (int idx = 0; idx < KMALLOC_SHIFT_HIGH + 1; idx++) {
>  			if (kmalloc_caches[type][idx])
>  				bootstrap_cache_sheaves(kmalloc_caches[type][idx]);
> 


  reply	other threads:[~2026-07-02 12:57 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-07-02  4:09 [PATCH RFC hotfixes 0/2] mm/slab: fix unbounded recursion in free path with memalloc profiling Harry Yoo (Oracle)
2026-07-02  4:09 ` [PATCH RFC hotfixes 1/2] mm/slab: decouple SLAB_NO_SHEAVES from SLAB_NO_OBJ_EXT Harry Yoo (Oracle)
2026-07-02 12:49   ` Vlastimil Babka (SUSE)
2026-07-02  4:09 ` [PATCH RFC hotfixes 2/2] mm/slab: prevent unbounded recursion in free path with new kmalloc type Harry Yoo (Oracle)
2026-07-02 12:57   ` Vlastimil Babka (SUSE) [this message]
2026-07-02 13:20     ` Harry Yoo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=de2e0d60-6954-4b54-bbff-ae78dbf9f789@kernel.org \
    --to=vbabka@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=cl@gentwo.org \
    --cc=dcostantino@meta.com \
    --cc=hao.ge@linux.dev \
    --cc=hao.li@linux.dev \
    --cc=harry@kernel.org \
    --cc=kees@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=pfalcato@suse.de \
    --cc=rientjes@google.com \
    --cc=roman.gushchin@linux.dev \
    --cc=shakeel.butt@linux.dev \
    --cc=surenb@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox