[PATCH] mm/slub: serve slabobj_ext array from a strictly larger kmalloc cache

Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] mm/slub: serve slabobj_ext array from a strictly larger kmalloc cache
@ 2026-06-25 23:00 Shakeel Butt
  2026-06-26  4:22 ` Harry Yoo
  0 siblings, 1 reply; 4+ messages in thread
From: Shakeel Butt @ 2026-06-25 23:00 UTC (permalink / raw)
  To: Vlastimil Babka, Andrew Morton
  Cc: Harry Yoo, Roman Gushchin, Hao Li, Christoph Lameter,
	David Rientjes, Suren Baghdasaryan, Usama Arif, Meta kernel team,
	linux-mm, linux-kernel, Danielle Costantino

A production host in the Meta fleet (6.16 kernel, memory allocation
profiling enabled) panicked with a kernel stack overflow while a kernel
driver was freeing a resource:

  BUG: TASK stack guard page was hit
  Oops: stack guard page
  RIP: 0010:kfree+0x8/0x5d0
  Call Trace:
   __free_slab+0x66/0xc0
   kfree+0x3f0/0x5d0
   ... ( ~125x __free_slab <-> kfree ) ...
   <kernel driver freeing a resource>
   do_syscall_64

The crash dump shows a 125-deep __free_slab<->kfree recursion that
overflowed the 16 KiB kernel stack.

What happened: a KMALLOC_NORMAL slab's obj_exts array (used by allocation
profiling / memcg accounting) is itself kmalloc()'d from a KMALLOC_NORMAL
cache, so the "slab holds another slab's obj_exts array" relation can form
cycles.  With sizeof(struct slabobj_ext) == 16 and the host's geometry:

  - kmalloc-512 has 64 objects/slab -> array is 64*16 == 1024 bytes,
    served from kmalloc-1k;
  - kmalloc-1k  has 32 objects/slab -> array is 32*16 ==  512 bytes,
    served from kmalloc-512.

A kmalloc-512 slab and a kmalloc-1k slab therefore hold each other's
obj_exts array.  Discarding one frees the other's array, which empties and
discards that slab, which frees the first's array, and so on:
__free_slab() -> free_slab_obj_exts() -> kfree() -> discard_slab() ->
__free_slab() recurses along the cycle until the stack is exhausted.  The
dump confirms it: the recursion's slabs strictly alternate kmalloc-512
(obj_exts in kmalloc-1k) and kmalloc-1k (obj_exts in kmalloc-512), and
mem_alloc_profiling_key was enabled.

Commit 280ea9c3154b ("mm/slab: avoid allocating slabobj_ext array from
its own slab") is not sufficient: it bumps the allocation size only when
the array would come from the *same* cache (object_size ==).  At the
geometry above neither cache is self-referential (512 != 1024 and
1024 != 512), so the bump never triggers and the kmalloc-512 <-> kmalloc-1k
cross cycle remains.

Fix it structurally by removing cycles of every shape: serve the array
from a cache strictly larger than the one it describes whenever it would
otherwise come from the same or a smaller cache.  Every reference edge
then points from a smaller to a larger cache (here kmalloc-1k's array
moves to kmalloc-2k), so the relation is a DAG and cannot contain a cycle.
No slab can be self- or cross-pinned, the tear-down recursion is bounded
by the number of kmalloc size classes (it terminates at the large-kmalloc
path, which carries no obj_exts), and profiling/accounting coverage is
unchanged - the array is still allocated, only relocated.

Reproduced on next-20260623 at the same geometry: churning
kmalloc-512/kmalloc-1k under vm.mem_profiling and then shrinking leaves
kmalloc-512 with thousands of unreclaimable objects without this patch
(8056) and at baseline with it (847).

Fixes: 4b8736964640 ("mm/slab: add allocation accounting into slab allocation and free paths")
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Reported-by: Danielle Costantino <dcostantino@meta.com>
---
 mm/slub.c | 26 ++++++++++----------------
 1 file changed, 10 insertions(+), 16 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 9ec774dc7009..48e54d340865 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2124,15 +2124,14 @@ static inline void init_slab_obj_exts(struct slab *slab)
 }

 /*
- * Calculate the allocation size for slabobj_ext array.
+ * Size of the slabobj_ext array for @slab.
  *
- * When memory allocation profiling is enabled, the obj_exts array
- * could be allocated from the same slab cache it's being allocated for.
- * This would prevent the slab from ever being freed because it would
- * always contain at least one allocated object (its own obj_exts array).
- *
- * To avoid this, increase the allocation size when we detect the array
- * may come from the same cache, forcing it to use a different cache.
+ * The array is itself kmalloc()'d. If it came from the same or a smaller
+ * kmalloc cache than @s, the "slab holds another slab's array" relation could
+ * form a cycle (self, or e.g. kmalloc-512 <-> kmalloc-1k) that pins the slabs
+ * forever and recurses via free_slab_obj_exts() -> kfree() -> discard_slab()
+ * at teardown. Force it into a strictly larger cache to keep that relation a
+ * DAG (acyclic).
  */
 static inline size_t obj_exts_alloc_size(struct kmem_cache *s,
 					 struct slab *slab, gfp_t gfp)
@@ -2147,14 +2146,9 @@ static inline size_t obj_exts_alloc_size(struct kmem_cache *s,
 		return sz;

 	obj_exts_cache = kmalloc_slab(sz, NULL, gfp, __kmalloc_token(0));
-	/*
-	 * We can't simply compare s with obj_exts_cache, because partitioned kmalloc
-	 * caches have multiple caches per size, selected by caller address or type.
-	 * Since caller address or type may differ between kmalloc_slab() and actual
-	 * allocation, bump size when sizes are equal.
-	 */
-	if (s->object_size == obj_exts_cache->object_size)
-		return obj_exts_cache->object_size + 1;
+	/* compare object_size, not the cache pointer (partitioned kmalloc caches) */
+	if (obj_exts_cache->object_size <= s->object_size)
+		return s->object_size + 1;

 	return sz;
 }
-- 
2.53.0-Meta

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH] mm/slub: serve slabobj_ext array from a strictly larger kmalloc cache
  2026-06-25 23:00 [PATCH] mm/slub: serve slabobj_ext array from a strictly larger kmalloc cache Shakeel Butt
@ 2026-06-26  4:22 ` Harry Yoo
  2026-06-26 16:49   ` Shakeel Butt
  0 siblings, 1 reply; 4+ messages in thread
From: Harry Yoo @ 2026-06-26  4:22 UTC (permalink / raw)
  To: Shakeel Butt, Vlastimil Babka, Andrew Morton
  Cc: Roman Gushchin, Hao Li, Christoph Lameter, David Rientjes,
	Suren Baghdasaryan, Usama Arif, Meta kernel team, linux-mm,
	linux-kernel, Danielle Costantino


[-- Attachment #1.1: Type: text/plain, Size: 6242 bytes --]


Hi Shakeel,

On 6/26/26 8:00 AM, Shakeel Butt wrote:
> A production host in the Meta fleet (6.16 kernel, memory allocation
> profiling enabled) panicked with a kernel stack overflow while a kernel
> driver was freeing a resource:
> 
>   BUG: TASK stack guard page was hit
>   Oops: stack guard page
>   RIP: 0010:kfree+0x8/0x5d0
>   Call Trace:
>    __free_slab+0x66/0xc0
>    kfree+0x3f0/0x5d0
>    ... ( ~125x __free_slab <-> kfree ) ...
>    <kernel driver freeing a resource>
>    do_syscall_64
> 
> The crash dump shows a 125-deep __free_slab<->kfree recursion that
> overflowed the 16 KiB kernel stack.

Ouch!

> What happened: a KMALLOC_NORMAL slab's obj_exts array (used by allocation
> profiling / memcg accounting) is itself kmalloc()'d from a KMALLOC_NORMAL
> cache,

Usually KMALLOC_NORMAL caches don't need obj_exts array, but yes,
this could happen if memory allocation profiling is enabled.

> so the "slab holds another slab's obj_exts array" relation can form
> cycles.  With sizeof(struct slabobj_ext) == 16 and the host's geometry:
>
>   - kmalloc-512 has 64 objects/slab -> array is 64*16 == 1024 bytes,
>     served from kmalloc-1k;
>   - kmalloc-1k  has 32 objects/slab -> array is 32*16 ==  512 bytes,
>     served from kmalloc-512.

Right.

> A kmalloc-512 slab and a kmalloc-1k slab therefore hold each other's
> obj_exts array.  Discarding one frees the other's array, which empties and
> discards that slab, which frees the first's array, and so on:
> __free_slab() -> free_slab_obj_exts() -> kfree() -> discard_slab() ->
> __free_slab() recurses along the cycle until the stack is exhausted.

Right.

> The
> dump confirms it: the recursion's slabs strictly alternate kmalloc-512
> (obj_exts in kmalloc-1k) and kmalloc-1k (obj_exts in kmalloc-512), and
> mem_alloc_profiling_key was enabled.
> 
> Commit 280ea9c3154b ("mm/slab: avoid allocating slabobj_ext array from
> its own slab") is not sufficient: it bumps the allocation size only when
> the array would come from the *same* cache (object_size ==).  At the
> geometry above neither cache is self-referential (512 != 1024 and
> 1024 != 512), so the bump never triggers and the kmalloc-512 <-> kmalloc-1k
> cross cycle remains.

Right.

> Fix it structurally by removing cycles of every shape: serve the array
> from a cache strictly larger than the one it describes whenever it would
> otherwise come from the same or a smaller cache.  Every reference edge
> then points from a smaller to a larger cache (here kmalloc-1k's array
> moves to kmalloc-2k), so the relation is a DAG and cannot contain a cycle.

This will fix the problem.

But this will waste memory as we need smaller obj_exts array
as the size gets larger.

We should probably create a new kmalloc type to avoid cycles instead?
(needed only when memory profiling is enabled, though)

That would also prevent recursion even further.

> No slab can be self- or cross-pinned, the tear-down recursion is bounded
> by the number of kmalloc size classes (it terminates at the large-kmalloc
> path, which carries no obj_exts), and profiling/accounting coverage is
> unchanged - the array is still allocated, only relocated.
> 
> Reproduced on next-20260623 at the same geometry: churning
> kmalloc-512/kmalloc-1k under vm.mem_profiling and then shrinking leaves
> kmalloc-512 with thousands of unreclaimable objects without this patch
> (8056) and at baseline with it (847).
> 
> Fixes: 4b8736964640 ("mm/slab: add allocation accounting into slab allocation and free paths")

Perhaps Cc: stable? v6.12 and v6.18 are affected.

> Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
> Reported-by: Danielle Costantino <dcostantino@meta.com>
> ---
>  mm/slub.c | 26 ++++++++++----------------
>  1 file changed, 10 insertions(+), 16 deletions(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index 9ec774dc7009..48e54d340865 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2124,15 +2124,14 @@ static inline void init_slab_obj_exts(struct slab *slab)
>  }
>  
>  /*
> - * Calculate the allocation size for slabobj_ext array.
> + * Size of the slabobj_ext array for @slab.
>   *
> - * When memory allocation profiling is enabled, the obj_exts array
> - * could be allocated from the same slab cache it's being allocated for.
> - * This would prevent the slab from ever being freed because it would
> - * always contain at least one allocated object (its own obj_exts array).
> - *
> - * To avoid this, increase the allocation size when we detect the array
> - * may come from the same cache, forcing it to use a different cache.
> + * The array is itself kmalloc()'d. If it came from the same or a smaller
> + * kmalloc cache than @s, the "slab holds another slab's array" relation could
> + * form a cycle (self, or e.g. kmalloc-512 <-> kmalloc-1k) that pins the slabs
> + * forever and recurses via free_slab_obj_exts() -> kfree() -> discard_slab()
> + * at teardown. Force it into a strictly larger cache to keep that relation a
> + * DAG (acyclic).
>   */
>  static inline size_t obj_exts_alloc_size(struct kmem_cache *s,
>  					 struct slab *slab, gfp_t gfp)
> @@ -2147,14 +2146,9 @@ static inline size_t obj_exts_alloc_size(struct kmem_cache *s,
>  		return sz;
>  
>  	obj_exts_cache = kmalloc_slab(sz, NULL, gfp, __kmalloc_token(0));
> -	/*
> -	 * We can't simply compare s with obj_exts_cache, because partitioned kmalloc
> -	 * caches have multiple caches per size, selected by caller address or type.
> -	 * Since caller address or type may differ between kmalloc_slab() and actual
> -	 * allocation, bump size when sizes are equal.
> -	 */
> -	if (s->object_size == obj_exts_cache->object_size)
> -		return obj_exts_cache->object_size + 1;
> +	/* compare object_size, not the cache pointer (partitioned kmalloc caches) */

This comment is no longer relevant, by the way.

"compare object_size instead of cache pointers because there can be
 multiple caches of the same size" doesn't apply anymore.

> +	if (obj_exts_cache->object_size <= s->object_size)
> +		return s->object_size + 1;
>  
>  	return sz;
>  }

-- 
Cheers,
Harry / Hyeonggon

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] mm/slub: serve slabobj_ext array from a strictly larger kmalloc cache
  2026-06-26  4:22 ` Harry Yoo
@ 2026-06-26 16:49   ` Shakeel Butt
  2026-06-26 17:11     ` Vlastimil Babka (SUSE)
  0 siblings, 1 reply; 4+ messages in thread
From: Shakeel Butt @ 2026-06-26 16:49 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Vlastimil Babka, Andrew Morton, Roman Gushchin, Hao Li,
	Christoph Lameter, David Rientjes, Suren Baghdasaryan, Usama Arif,
	Meta kernel team, linux-mm, linux-kernel, Danielle Costantino

On Fri, Jun 26, 2026 at 01:22:09PM +0900, Harry Yoo wrote:
> 
> Hi Shakeel,
> 

[...]

> > What happened: a KMALLOC_NORMAL slab's obj_exts array (used by allocation
> > profiling / memcg accounting) is itself kmalloc()'d from a KMALLOC_NORMAL
> > cache,
> 
> Usually KMALLOC_NORMAL caches don't need obj_exts array, but yes,
> this could happen if memory allocation profiling is enabled.

Yes, we have enabled memory allocation profiling fleet wide.

[...]

> 
> > Fix it structurally by removing cycles of every shape: serve the array
> > from a cache strictly larger than the one it describes whenever it would
> > otherwise come from the same or a smaller cache.  Every reference edge
> > then points from a smaller to a larger cache (here kmalloc-1k's array
> > moves to kmalloc-2k), so the relation is a DAG and cannot contain a cycle.
> 
> This will fix the problem.
> 
> But this will waste memory as we need smaller obj_exts array
> as the size gets larger.
> 
> We should probably create a new kmalloc type to avoid cycles instead?
> (needed only when memory profiling is enabled, though)
> 
> That would also prevent recursion even further.

Yes but I assume that would add kmem caches even for users not using memory
profiling. Anyways, I think that is a separate discussion. Am I understanding
correctly that you don't have any concerns with this approach?

> 
> > No slab can be self- or cross-pinned, the tear-down recursion is bounded
> > by the number of kmalloc size classes (it terminates at the large-kmalloc
> > path, which carries no obj_exts), and profiling/accounting coverage is
> > unchanged - the array is still allocated, only relocated.
> > 
> > Reproduced on next-20260623 at the same geometry: churning
> > kmalloc-512/kmalloc-1k under vm.mem_profiling and then shrinking leaves
> > kmalloc-512 with thousands of unreclaimable objects without this patch
> > (8056) and at baseline with it (847).
> > 
> > Fixes: 4b8736964640 ("mm/slab: add allocation accounting into slab allocation and free paths")
> 
> Perhaps Cc: stable? v6.12 and v6.18 are affected.

Ack.

[...]

> > -	if (s->object_size == obj_exts_cache->object_size)
> > -		return obj_exts_cache->object_size + 1;
> > +	/* compare object_size, not the cache pointer (partitioned kmalloc caches) */
> 
> This comment is no longer relevant, by the way.
> 
> "compare object_size instead of cache pointers because there can be
>  multiple caches of the same size" doesn't apply anymore.
> 

I will remove the comment in next version.

Thanks for the review.

> > +	if (obj_exts_cache->object_size <= s->object_size)
> > +		return s->object_size + 1;
> >  
> >  	return sz;
> >  }
> 
> -- 
> Cheers,
> Harry / Hyeonggon





^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] mm/slub: serve slabobj_ext array from a strictly larger kmalloc cache
  2026-06-26 16:49   ` Shakeel Butt
@ 2026-06-26 17:11     ` Vlastimil Babka (SUSE)
  0 siblings, 0 replies; 4+ messages in thread
From: Vlastimil Babka (SUSE) @ 2026-06-26 17:11 UTC (permalink / raw)
  To: Shakeel Butt, Harry Yoo
  Cc: Andrew Morton, Roman Gushchin, Hao Li, Christoph Lameter,
	David Rientjes, Suren Baghdasaryan, Usama Arif, Meta kernel team,
	linux-mm, linux-kernel, Danielle Costantino

On 6/26/26 18:49, Shakeel Butt wrote:
> On Fri, Jun 26, 2026 at 01:22:09PM +0900, Harry Yoo wrote:
>> 
>> Hi Shakeel,
>> 
> 
> [...]
> 
>> > What happened: a KMALLOC_NORMAL slab's obj_exts array (used by allocation
>> > profiling / memcg accounting) is itself kmalloc()'d from a KMALLOC_NORMAL
>> > cache,
>> 
>> Usually KMALLOC_NORMAL caches don't need obj_exts array, but yes,
>> this could happen if memory allocation profiling is enabled.
> 
> Yes, we have enabled memory allocation profiling fleet wide.
> 
> [...]
> 
>> 
>> > Fix it structurally by removing cycles of every shape: serve the array
>> > from a cache strictly larger than the one it describes whenever it would
>> > otherwise come from the same or a smaller cache.  Every reference edge
>> > then points from a smaller to a larger cache (here kmalloc-1k's array
>> > moves to kmalloc-2k), so the relation is a DAG and cannot contain a cycle.
>> 
>> This will fix the problem.
>> 
>> But this will waste memory as we need smaller obj_exts array
>> as the size gets larger.
>> 
>> We should probably create a new kmalloc type to avoid cycles instead?
>> (needed only when memory profiling is enabled, though)
>> 
>> That would also prevent recursion even further.
> 
> Yes but I assume that would add kmem caches even for users not using memory
> profiling. Anyways, I think that is a separate discussion. Am I understanding
> correctly that you don't have any concerns with this approach?

Umm, the memory waste is a concern?

Minimally I'd now want to only do that size bumping when allocation
profiling is enabled. Ideally that means both configured in and not booted
with "never".

We probably should have done that already in 280ea9c3154b2. Because AFAIU
memcg-only obj_exts array don't have this issue (or maybe they do have the
[1] issue? Harry?). But if memcg-only should keep avoiding the same size
bucket, it can keep what it was doing and only memalloc profiling would do
the strictly larger thing.

Suren's input would be also nice to have.

Thanks!

[1] https://lore.kernel.org/oe-lkp/202601231457.f7b31e09-lkp@intel.com

>> 
>> > No slab can be self- or cross-pinned, the tear-down recursion is bounded
>> > by the number of kmalloc size classes (it terminates at the large-kmalloc
>> > path, which carries no obj_exts), and profiling/accounting coverage is
>> > unchanged - the array is still allocated, only relocated.
>> > 
>> > Reproduced on next-20260623 at the same geometry: churning
>> > kmalloc-512/kmalloc-1k under vm.mem_profiling and then shrinking leaves
>> > kmalloc-512 with thousands of unreclaimable objects without this patch
>> > (8056) and at baseline with it (847).
>> > 
>> > Fixes: 4b8736964640 ("mm/slab: add allocation accounting into slab allocation and free paths")
>> 
>> Perhaps Cc: stable? v6.12 and v6.18 are affected.
> 
> Ack.
> 
> [...]
> 
>> > -	if (s->object_size == obj_exts_cache->object_size)
>> > -		return obj_exts_cache->object_size + 1;
>> > +	/* compare object_size, not the cache pointer (partitioned kmalloc caches) */
>> 
>> This comment is no longer relevant, by the way.
>> 
>> "compare object_size instead of cache pointers because there can be
>>  multiple caches of the same size" doesn't apply anymore.
>> 
> 
> I will remove the comment in next version.
> 
> Thanks for the review.
> 
>> > +	if (obj_exts_cache->object_size <= s->object_size)
>> > +		return s->object_size + 1;
>> >  
>> >  	return sz;
>> >  }
>> 
>> -- 
>> Cheers,
>> Harry / Hyeonggon
> 
> 
> 



^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2026-06-26 17:11 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-25 23:00 [PATCH] mm/slub: serve slabobj_ext array from a strictly larger kmalloc cache Shakeel Butt
2026-06-26  4:22 ` Harry Yoo
2026-06-26 16:49   ` Shakeel Butt
2026-06-26 17:11     ` Vlastimil Babka (SUSE)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox