Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] mm/slub: serve slabobj_ext array from a strictly larger kmalloc cache
@ 2026-06-25 23:00 Shakeel Butt
  2026-06-26  4:22 ` Harry Yoo
  0 siblings, 1 reply; 2+ messages in thread
From: Shakeel Butt @ 2026-06-25 23:00 UTC (permalink / raw)
  To: Vlastimil Babka, Andrew Morton
  Cc: Harry Yoo, Roman Gushchin, Hao Li, Christoph Lameter,
	David Rientjes, Suren Baghdasaryan, Usama Arif, Meta kernel team,
	linux-mm, linux-kernel, Danielle Costantino

A production host in the Meta fleet (6.16 kernel, memory allocation
profiling enabled) panicked with a kernel stack overflow while a kernel
driver was freeing a resource:

  BUG: TASK stack guard page was hit
  Oops: stack guard page
  RIP: 0010:kfree+0x8/0x5d0
  Call Trace:
   __free_slab+0x66/0xc0
   kfree+0x3f0/0x5d0
   ... ( ~125x __free_slab <-> kfree ) ...
   <kernel driver freeing a resource>
   do_syscall_64

The crash dump shows a 125-deep __free_slab<->kfree recursion that
overflowed the 16 KiB kernel stack.

What happened: a KMALLOC_NORMAL slab's obj_exts array (used by allocation
profiling / memcg accounting) is itself kmalloc()'d from a KMALLOC_NORMAL
cache, so the "slab holds another slab's obj_exts array" relation can form
cycles.  With sizeof(struct slabobj_ext) == 16 and the host's geometry:

  - kmalloc-512 has 64 objects/slab -> array is 64*16 == 1024 bytes,
    served from kmalloc-1k;
  - kmalloc-1k  has 32 objects/slab -> array is 32*16 ==  512 bytes,
    served from kmalloc-512.

A kmalloc-512 slab and a kmalloc-1k slab therefore hold each other's
obj_exts array.  Discarding one frees the other's array, which empties and
discards that slab, which frees the first's array, and so on:
__free_slab() -> free_slab_obj_exts() -> kfree() -> discard_slab() ->
__free_slab() recurses along the cycle until the stack is exhausted.  The
dump confirms it: the recursion's slabs strictly alternate kmalloc-512
(obj_exts in kmalloc-1k) and kmalloc-1k (obj_exts in kmalloc-512), and
mem_alloc_profiling_key was enabled.

Commit 280ea9c3154b ("mm/slab: avoid allocating slabobj_ext array from
its own slab") is not sufficient: it bumps the allocation size only when
the array would come from the *same* cache (object_size ==).  At the
geometry above neither cache is self-referential (512 != 1024 and
1024 != 512), so the bump never triggers and the kmalloc-512 <-> kmalloc-1k
cross cycle remains.

Fix it structurally by removing cycles of every shape: serve the array
from a cache strictly larger than the one it describes whenever it would
otherwise come from the same or a smaller cache.  Every reference edge
then points from a smaller to a larger cache (here kmalloc-1k's array
moves to kmalloc-2k), so the relation is a DAG and cannot contain a cycle.
No slab can be self- or cross-pinned, the tear-down recursion is bounded
by the number of kmalloc size classes (it terminates at the large-kmalloc
path, which carries no obj_exts), and profiling/accounting coverage is
unchanged - the array is still allocated, only relocated.

Reproduced on next-20260623 at the same geometry: churning
kmalloc-512/kmalloc-1k under vm.mem_profiling and then shrinking leaves
kmalloc-512 with thousands of unreclaimable objects without this patch
(8056) and at baseline with it (847).

Fixes: 4b8736964640 ("mm/slab: add allocation accounting into slab allocation and free paths")
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Reported-by: Danielle Costantino <dcostantino@meta.com>
---
 mm/slub.c | 26 ++++++++++----------------
 1 file changed, 10 insertions(+), 16 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 9ec774dc7009..48e54d340865 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2124,15 +2124,14 @@ static inline void init_slab_obj_exts(struct slab *slab)
 }
 
 /*
- * Calculate the allocation size for slabobj_ext array.
+ * Size of the slabobj_ext array for @slab.
  *
- * When memory allocation profiling is enabled, the obj_exts array
- * could be allocated from the same slab cache it's being allocated for.
- * This would prevent the slab from ever being freed because it would
- * always contain at least one allocated object (its own obj_exts array).
- *
- * To avoid this, increase the allocation size when we detect the array
- * may come from the same cache, forcing it to use a different cache.
+ * The array is itself kmalloc()'d. If it came from the same or a smaller
+ * kmalloc cache than @s, the "slab holds another slab's array" relation could
+ * form a cycle (self, or e.g. kmalloc-512 <-> kmalloc-1k) that pins the slabs
+ * forever and recurses via free_slab_obj_exts() -> kfree() -> discard_slab()
+ * at teardown. Force it into a strictly larger cache to keep that relation a
+ * DAG (acyclic).
  */
 static inline size_t obj_exts_alloc_size(struct kmem_cache *s,
 					 struct slab *slab, gfp_t gfp)
@@ -2147,14 +2146,9 @@ static inline size_t obj_exts_alloc_size(struct kmem_cache *s,
 		return sz;
 
 	obj_exts_cache = kmalloc_slab(sz, NULL, gfp, __kmalloc_token(0));
-	/*
-	 * We can't simply compare s with obj_exts_cache, because partitioned kmalloc
-	 * caches have multiple caches per size, selected by caller address or type.
-	 * Since caller address or type may differ between kmalloc_slab() and actual
-	 * allocation, bump size when sizes are equal.
-	 */
-	if (s->object_size == obj_exts_cache->object_size)
-		return obj_exts_cache->object_size + 1;
+	/* compare object_size, not the cache pointer (partitioned kmalloc caches) */
+	if (obj_exts_cache->object_size <= s->object_size)
+		return s->object_size + 1;
 
 	return sz;
 }
-- 
2.53.0-Meta



^ permalink raw reply related	[flat|nested] 2+ messages in thread

* Re: [PATCH] mm/slub: serve slabobj_ext array from a strictly larger kmalloc cache
  2026-06-25 23:00 [PATCH] mm/slub: serve slabobj_ext array from a strictly larger kmalloc cache Shakeel Butt
@ 2026-06-26  4:22 ` Harry Yoo
  0 siblings, 0 replies; 2+ messages in thread
From: Harry Yoo @ 2026-06-26  4:22 UTC (permalink / raw)
  To: Shakeel Butt, Vlastimil Babka, Andrew Morton
  Cc: Roman Gushchin, Hao Li, Christoph Lameter, David Rientjes,
	Suren Baghdasaryan, Usama Arif, Meta kernel team, linux-mm,
	linux-kernel, Danielle Costantino


[-- Attachment #1.1: Type: text/plain, Size: 6242 bytes --]


Hi Shakeel,

On 6/26/26 8:00 AM, Shakeel Butt wrote:
> A production host in the Meta fleet (6.16 kernel, memory allocation
> profiling enabled) panicked with a kernel stack overflow while a kernel
> driver was freeing a resource:
> 
>   BUG: TASK stack guard page was hit
>   Oops: stack guard page
>   RIP: 0010:kfree+0x8/0x5d0
>   Call Trace:
>    __free_slab+0x66/0xc0
>    kfree+0x3f0/0x5d0
>    ... ( ~125x __free_slab <-> kfree ) ...
>    <kernel driver freeing a resource>
>    do_syscall_64
> 
> The crash dump shows a 125-deep __free_slab<->kfree recursion that
> overflowed the 16 KiB kernel stack.

Ouch!

> What happened: a KMALLOC_NORMAL slab's obj_exts array (used by allocation
> profiling / memcg accounting) is itself kmalloc()'d from a KMALLOC_NORMAL
> cache,

Usually KMALLOC_NORMAL caches don't need obj_exts array, but yes,
this could happen if memory allocation profiling is enabled.

> so the "slab holds another slab's obj_exts array" relation can form
> cycles.  With sizeof(struct slabobj_ext) == 16 and the host's geometry:
>
>   - kmalloc-512 has 64 objects/slab -> array is 64*16 == 1024 bytes,
>     served from kmalloc-1k;
>   - kmalloc-1k  has 32 objects/slab -> array is 32*16 ==  512 bytes,
>     served from kmalloc-512.

Right.

> A kmalloc-512 slab and a kmalloc-1k slab therefore hold each other's
> obj_exts array.  Discarding one frees the other's array, which empties and
> discards that slab, which frees the first's array, and so on:
> __free_slab() -> free_slab_obj_exts() -> kfree() -> discard_slab() ->
> __free_slab() recurses along the cycle until the stack is exhausted.

Right.

> The
> dump confirms it: the recursion's slabs strictly alternate kmalloc-512
> (obj_exts in kmalloc-1k) and kmalloc-1k (obj_exts in kmalloc-512), and
> mem_alloc_profiling_key was enabled.
> 
> Commit 280ea9c3154b ("mm/slab: avoid allocating slabobj_ext array from
> its own slab") is not sufficient: it bumps the allocation size only when
> the array would come from the *same* cache (object_size ==).  At the
> geometry above neither cache is self-referential (512 != 1024 and
> 1024 != 512), so the bump never triggers and the kmalloc-512 <-> kmalloc-1k
> cross cycle remains.

Right.

> Fix it structurally by removing cycles of every shape: serve the array
> from a cache strictly larger than the one it describes whenever it would
> otherwise come from the same or a smaller cache.  Every reference edge
> then points from a smaller to a larger cache (here kmalloc-1k's array
> moves to kmalloc-2k), so the relation is a DAG and cannot contain a cycle.

This will fix the problem.

But this will waste memory as we need smaller obj_exts array
as the size gets larger.

We should probably create a new kmalloc type to avoid cycles instead?
(needed only when memory profiling is enabled, though)

That would also prevent recursion even further.

> No slab can be self- or cross-pinned, the tear-down recursion is bounded
> by the number of kmalloc size classes (it terminates at the large-kmalloc
> path, which carries no obj_exts), and profiling/accounting coverage is
> unchanged - the array is still allocated, only relocated.
> 
> Reproduced on next-20260623 at the same geometry: churning
> kmalloc-512/kmalloc-1k under vm.mem_profiling and then shrinking leaves
> kmalloc-512 with thousands of unreclaimable objects without this patch
> (8056) and at baseline with it (847).
> 
> Fixes: 4b8736964640 ("mm/slab: add allocation accounting into slab allocation and free paths")

Perhaps Cc: stable? v6.12 and v6.18 are affected.

> Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
> Reported-by: Danielle Costantino <dcostantino@meta.com>
> ---
>  mm/slub.c | 26 ++++++++++----------------
>  1 file changed, 10 insertions(+), 16 deletions(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index 9ec774dc7009..48e54d340865 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2124,15 +2124,14 @@ static inline void init_slab_obj_exts(struct slab *slab)
>  }
>  
>  /*
> - * Calculate the allocation size for slabobj_ext array.
> + * Size of the slabobj_ext array for @slab.
>   *
> - * When memory allocation profiling is enabled, the obj_exts array
> - * could be allocated from the same slab cache it's being allocated for.
> - * This would prevent the slab from ever being freed because it would
> - * always contain at least one allocated object (its own obj_exts array).
> - *
> - * To avoid this, increase the allocation size when we detect the array
> - * may come from the same cache, forcing it to use a different cache.
> + * The array is itself kmalloc()'d. If it came from the same or a smaller
> + * kmalloc cache than @s, the "slab holds another slab's array" relation could
> + * form a cycle (self, or e.g. kmalloc-512 <-> kmalloc-1k) that pins the slabs
> + * forever and recurses via free_slab_obj_exts() -> kfree() -> discard_slab()
> + * at teardown. Force it into a strictly larger cache to keep that relation a
> + * DAG (acyclic).
>   */
>  static inline size_t obj_exts_alloc_size(struct kmem_cache *s,
>  					 struct slab *slab, gfp_t gfp)
> @@ -2147,14 +2146,9 @@ static inline size_t obj_exts_alloc_size(struct kmem_cache *s,
>  		return sz;
>  
>  	obj_exts_cache = kmalloc_slab(sz, NULL, gfp, __kmalloc_token(0));
> -	/*
> -	 * We can't simply compare s with obj_exts_cache, because partitioned kmalloc
> -	 * caches have multiple caches per size, selected by caller address or type.
> -	 * Since caller address or type may differ between kmalloc_slab() and actual
> -	 * allocation, bump size when sizes are equal.
> -	 */
> -	if (s->object_size == obj_exts_cache->object_size)
> -		return obj_exts_cache->object_size + 1;
> +	/* compare object_size, not the cache pointer (partitioned kmalloc caches) */

This comment is no longer relevant, by the way.

"compare object_size instead of cache pointers because there can be
 multiple caches of the same size" doesn't apply anymore.

> +	if (obj_exts_cache->object_size <= s->object_size)
> +		return s->object_size + 1;
>  
>  	return sz;
>  }

-- 
Cheers,
Harry / Hyeonggon

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2026-06-26  4:22 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-25 23:00 [PATCH] mm/slub: serve slabobj_ext array from a strictly larger kmalloc cache Shakeel Butt
2026-06-26  4:22 ` Harry Yoo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox