From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-alma10-1.taild15c8.ts.net [100.103.45.18]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3138C337B81 for ; Thu, 2 Jul 2026 04:09:46 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=100.103.45.18 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1782965388; cv=none; b=qcwmm85xGOkislKcXZSIgBQYBK7jG8v7jpjZGIRz/50RzsT9owmNzq58V51262wXeFjEz40Udl5j4Usazg1EWwhz5Ku6ssFQrdmQxynKnlSQVPCwb9Zznb8pSJQHgrxOtyj2fcZA4gt06pzKjQh542qoYBdf5g3dw2P2zKNqthA= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1782965388; c=relaxed/simple; bh=wggKRQwKIEKkQunsecY0vJ3p55AoySMAxsABmd8+b64=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=Vp8pI6874Cv8f7HYH0cCrx+fGIxjHcQIvj5KxfNGevPYfKUKzEns6lz89ohSzQRn0nOMOjQsF5TmSE/W8W3EB8FBkL6yYi7p9naEppU97CmJrjpw+ywA1RgTBX3WPoPK+cEVTxJeyiFzYDDGItZRLHNNqdbB1MCPz3YZl11Pm+s= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=cX3zbDu2; arc=none smtp.client-ip=100.103.45.18 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="cX3zbDu2" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 7EF501F00A3E; Thu, 2 Jul 2026 04:09:44 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel.org; s=k20260515; t=1782965386; bh=/FyYPBeoRMwikF8B0iK/hsVJVI6E7EzXwfe1U9arQtI=; h=From:Date:Subject:References:In-Reply-To:To:Cc; b=cX3zbDu2tQtmByEWCG05CXIQSIGQun/iEk97kgB4a/9dI1yBhKle1jC29m8mOf/ds f55sZBxajr6Hq/WYoI/NFOiyIfQVwzejxzu90vuzaobK9AM8W2JSb46sly3aXVR/ul YVLJR3DbGDiY9dnXIpyGCXBe4CJxTb4QfnGZ7fwReoQA6GQTce0URnjBaALv8VPlZX hG1ryku4I+Jrg9CMm9+oVtC/+NkatiEo0wBJQhDbFs7C3N+ldJMkxKDM4XhMJu9lEa PCkrtMsk2iPxNdhVxMeUpJloEbADI5Utnt1GZVdiKMMGQnj50h8DuwyFbXDIeiCS9p e0w2/sjo3IMfQ== From: "Harry Yoo (Oracle)" Date: Thu, 02 Jul 2026 13:09:25 +0900 Subject: [PATCH RFC hotfixes 2/2] mm/slab: prevent unbounded recursion in free path with new kmalloc type Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Message-Id: <20260702-kmalloc-no-objext-v1-2-167175008538@kernel.org> References: <20260702-kmalloc-no-objext-v1-0-167175008538@kernel.org> In-Reply-To: <20260702-kmalloc-no-objext-v1-0-167175008538@kernel.org> To: Vlastimil Babka , Andrew Morton , Hao Li , Christoph Lameter , David Rientjes , Roman Gushchin , Suren Baghdasaryan , Hao Ge , Kees Cook , Pedro Falcato , Shakeel Butt , Danielle Constantino Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, "Harry Yoo (Oracle)" X-Mailer: b4 0.14.3 Commit 280ea9c3154b ("mm/slab: avoid allocating slabobj_ext array from its own slab") avoided recursive allocation of obj_exts from kmalloc caches of the same size, by bumping the obj_exts array's allocation size whenever the array size equals the size of the object being allocated. However, as reported by Danielle Costantino and Shakeel Butt, even slabs from kmalloc caches of different sizes can form a cycle by allocating obj_exts arrays from each other [1]: What happened: a KMALLOC_NORMAL slab's obj_exts array (used by allocation profiling / memcg accounting) is itself kmalloc()'d from a KMALLOC_NORMAL cache, so the "slab holds another slab's obj_exts array" relation can form cycles. With sizeof(struct slabobj_ext) == 16 and the host's geometry: - kmalloc-512 has 64 objects/slab -> array is 64*16 == 1024 bytes, served from kmalloc-1k; - kmalloc-1k has 32 objects/slab -> array is 32*16 == 512 bytes, served from kmalloc-512. A kmalloc-512 slab and a kmalloc-1k slab therefore hold each other's obj_exts array. Discarding one frees the other's array, which empties and discards that slab, which frees the first's array, and so on: __free_slab() -> free_slab_obj_exts() -> kfree() -> discard_slab() -> __free_slab() recurses along the cycle until the stack is exhausted. With memory allocation profiling, this allows unbounded recursion in the free path and led to a stack overflow on a production host in the Meta fleet [1]: BUG: TASK stack guard page was hit Oops: stack guard page RIP: 0010:kfree+0x8/0x5d0 Call Trace: __free_slab+0x66/0xc0 kfree+0x3f0/0x5d0 ... ( ~125x __free_slab <-> kfree ) ... do_syscall_64 It is proposed [1] to resolve this issue by always serving the obj_exts array allocation from kmalloc caches (or large kmalloc) of sizes larger than the object size. However, as pointed out by Vlastimil Babka [2], this can waste an excessive amount of memory as slabs from large kmalloc sizes (e.g. kmalloc-8k) generally need obj_exts arrays much smaller than the object size. Therefore, rather than bumping the size, let us take a different approach; disallow formation of cycles between kmalloc types when allocating obj_exts arrays. Currently, all obj_exts arrays are served from normal kmalloc caches. Cycles cannot be created if obj_exts arrays of normal kmalloc caches are served from a special kmalloc type that can never have obj_exts arrays. To achieve this, create a new kmalloc type called KMALLOC_NO_OBJ_EXT. KMALLOC_NO_OBJ_EXT caches are created when CONFIG_SLAB_OBJ_EXT is enabled, and they have SLAB_NO_OBJ_EXT flag to prevent allocation of obj_exts arrays. They remain unused until allocation of obj_exts arrays for normal kmalloc caches happens. Sheaf boostrapping for KMALLOC_NO_OBJ_EXT caches now must be deferred because allocation of a barn can trigger obj_exts array allocation of normal kmalloc caches when the KMALLOC_NO_OBJ_EXT cache for that size is not ready yet. For simplicity, perform bootstrapping of sheaves for all kmalloc caches later. Introduce a new slab alloc flag, SLAB_ALLOC_NO_OBJ_EXT, to prevent allocation of obj_exts arrays, and let kmalloc_slab() override the type to KMALLOC_NO_OBJ_EXT when specified. Note that kmalloc_type() remains unchanged because kmalloc_flags() bypasses the kmalloc fastpath. Do not pass SLAB_ALLOC_NO_RECURSE to kmalloc_flags() in alloc_slab_obj_exts() and instead use SLAB_ALLOC_NO_OBJ_EXT only when the objects are allocated from normal kmalloc caches. While this prevents unbounded recursive allocation of obj_exts, it allows KMALLOC_NO_OBJ_EXT caches to have sheaves. Since sheaf allocations specify SLAB_ALLOC_NO_RECURSE that prevents allocation of both sheaves and obj_exts arrays, the recursion depth is bounded. Reported-by: Danielle Costantino Reported-by: Shakeel Butt Closes: https://lore.kernel.org/linux-mm/20260625230029.703750-1-shakeel.butt@linux.dev [1] Fixes: 4b8736964640 ("mm/slab: add allocation accounting into slab allocation and free paths") Cc: stable@vger.kernel.org Link: https://lore.kernel.org/linux-mm/c5c4208d-a6f0-413e-bad9-49be12f12d55@kernel.org [2] Signed-off-by: Harry Yoo (Oracle) --- include/linux/slab.h | 3 ++ mm/slab.h | 17 +++++++++-- mm/slab_common.c | 18 +++++++++++- mm/slub.c | 83 +++++++++++++++++++++------------------------------- 4 files changed, 68 insertions(+), 53 deletions(-) diff --git a/include/linux/slab.h b/include/linux/slab.h index 08d7b6c9c4d6..0c1d13773523 100644 --- a/include/linux/slab.h +++ b/include/linux/slab.h @@ -721,6 +721,9 @@ enum kmalloc_cache_type { #endif #ifdef CONFIG_MEMCG KMALLOC_CGROUP, +#endif +#ifdef CONFIG_SLAB_OBJ_EXT + KMALLOC_NO_OBJ_EXT, #endif NR_KMALLOC_TYPES }; diff --git a/mm/slab.h b/mm/slab.h index 281a65233795..0428cd495191 100644 --- a/mm/slab.h +++ b/mm/slab.h @@ -22,6 +22,7 @@ #define SLAB_ALLOC_NOLOCK 0x01 /* a kmalloc_nolock() allocation */ #define SLAB_ALLOC_NEW_SLAB 0x02 /* a flag for alloc_slab_obj_exts() */ #define SLAB_ALLOC_NO_RECURSE 0x04 /* prevent kmalloc() recursion */ +#define SLAB_ALLOC_NO_OBJ_EXT 0x08 /* prevent obj_exts array allocation */ static inline bool alloc_flags_allow_spinning(const unsigned int alloc_flags) { @@ -386,12 +387,19 @@ static inline unsigned int size_index_elem(unsigned int bytes) * KMALLOC_MAX_CACHE_SIZE and the caller must check that. */ static inline struct kmem_cache * -kmalloc_slab(size_t size, kmem_buckets *b, gfp_t flags, kmalloc_token_t token) +kmalloc_slab(size_t size, kmem_buckets *b, gfp_t flags, kmalloc_token_t token, + unsigned int alloc_flags) { unsigned int index; + enum kmalloc_cache_type type = kmalloc_type(flags, token); + +#ifdef CONFIG_SLAB_OBJ_EXT + if (alloc_flags & SLAB_ALLOC_NO_OBJ_EXT) + type = KMALLOC_NO_OBJ_EXT; +#endif if (!b) - b = &kmalloc_caches[kmalloc_type(flags, token)]; + b = &kmalloc_caches[type]; if (size <= 192) index = kmalloc_size_index[size_index_elem(size)]; else @@ -426,6 +434,11 @@ static inline bool is_kmalloc_normal(struct kmem_cache *s) { if (!is_kmalloc_cache(s)) return false; + + /* KMALLOC_NO_OBJ_EXT is not normal kmalloc */ + if (s->flags & SLAB_NO_OBJ_EXT) + return false; + return !(s->flags & (SLAB_CACHE_DMA|SLAB_ACCOUNT|SLAB_RECLAIM_ACCOUNT)); } diff --git a/mm/slab_common.c b/mm/slab_common.c index b6426d7ceec9..7f262134d0f2 100644 --- a/mm/slab_common.c +++ b/mm/slab_common.c @@ -783,11 +783,15 @@ u8 kmalloc_size_index[24] __ro_after_init = { size_t kmalloc_size_roundup(size_t size) { if (size && size <= KMALLOC_MAX_CACHE_SIZE) { + struct kmem_cache *s; + /* * The flags don't matter since size_index is common to all. * Neither does the caller for just getting ->object_size. */ - return kmalloc_slab(size, NULL, GFP_KERNEL, __kmalloc_token(0))->object_size; + s = kmalloc_slab(size, NULL, GFP_KERNEL, __kmalloc_token(0), + SLAB_ALLOC_DEFAULT); + return s->object_size; } /* Above the smaller buckets, size is a multiple of page size. */ @@ -843,6 +847,12 @@ EXPORT_SYMBOL(kmalloc_size_roundup); #define KMALLOC_PARTITION_NAME(N, sz) #endif +#ifdef CONFIG_SLAB_OBJ_EXT +#define KMALLOC_NO_OBJ_EXT_NAME(sz) .name[KMALLOC_NO_OBJ_EXT] = "kmalloc-no-objext-" #sz, +#else +#define KMALLOC_NO_OBJ_EXT_NAME(sz) +#endif + #define INIT_KMALLOC_INFO(__size, __short_size) \ { \ .name[KMALLOC_NORMAL] = "kmalloc-" #__short_size, \ @@ -850,6 +860,7 @@ EXPORT_SYMBOL(kmalloc_size_roundup); KMALLOC_CGROUP_NAME(__short_size) \ KMALLOC_DMA_NAME(__short_size) \ KMALLOC_PARTITION_NAME(KMALLOC_PARTITION_CACHES_NR, __short_size) \ + KMALLOC_NO_OBJ_EXT_NAME(__short_size) \ .size = __size, \ } @@ -966,6 +977,11 @@ new_kmalloc_cache(int idx, enum kmalloc_cache_type type) flags |= SLAB_NO_MERGE; #endif +#ifdef CONFIG_SLAB_OBJ_EXT + if (type == KMALLOC_NO_OBJ_EXT) + flags |= SLAB_NO_OBJ_EXT | SLAB_NO_MERGE; +#endif + /* * If CONFIG_MEMCG is enabled, disable cache merging for * KMALLOC_NORMAL caches. diff --git a/mm/slub.c b/mm/slub.c index efc85053ae84..8428b8308856 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -2123,42 +2123,6 @@ static inline void init_slab_obj_exts(struct slab *slab) slab->obj_exts = 0; } -/* - * Calculate the allocation size for slabobj_ext array. - * - * When memory allocation profiling is enabled, the obj_exts array - * could be allocated from the same slab cache it's being allocated for. - * This would prevent the slab from ever being freed because it would - * always contain at least one allocated object (its own obj_exts array). - * - * To avoid this, increase the allocation size when we detect the array - * may come from the same cache, forcing it to use a different cache. - */ -static inline size_t obj_exts_alloc_size(struct kmem_cache *s, - struct slab *slab, gfp_t gfp) -{ - size_t sz = sizeof(struct slabobj_ext) * slab->objects; - struct kmem_cache *obj_exts_cache; - - if (sz > KMALLOC_MAX_CACHE_SIZE) - return sz; - - if (!is_kmalloc_normal(s)) - return sz; - - obj_exts_cache = kmalloc_slab(sz, NULL, gfp, __kmalloc_token(0)); - /* - * We can't simply compare s with obj_exts_cache, because partitioned kmalloc - * caches have multiple caches per size, selected by caller address or type. - * Since caller address or type may differ between kmalloc_slab() and actual - * allocation, bump size when sizes are equal. - */ - if (s->object_size == obj_exts_cache->object_size) - return obj_exts_cache->object_size + 1; - - return sz; -} - int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s, gfp_t gfp, unsigned int alloc_flags) { @@ -2168,14 +2132,18 @@ int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s, unsigned long new_exts; unsigned long old_exts; struct slabobj_ext *vec; - size_t sz; + size_t sz = sizeof(struct slabobj_ext) * slab->objects; gfp &= ~OBJCGS_CLEAR_MASK; - /* Prevent recursive extension vector allocation */ - alloc_flags |= SLAB_ALLOC_NO_RECURSE; - alloc_flags &= ~SLAB_ALLOC_NEW_SLAB; + /* + * In most cases, obj_exts arrays are allocated from normal kmalloc. + * However, normal kmalloc caches must allocate them from + * KMALLOC_NO_OBJ_EXT to caches to prevent recursion. + */ + if (is_kmalloc_normal(s)) + alloc_flags |= SLAB_ALLOC_NO_OBJ_EXT; - sz = obj_exts_alloc_size(s, slab, gfp); + alloc_flags &= ~SLAB_ALLOC_NEW_SLAB; /* This will use kmalloc_nolock() if alloc_flags say so */ vec = kmalloc_flags(sz, gfp | __GFP_ZERO, alloc_flags, slab_nid(slab)); @@ -2193,8 +2161,21 @@ int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s, return -ENOMEM; } - VM_WARN_ON_ONCE(virt_to_slab(vec) != NULL && - virt_to_slab(vec)->slab_cache == s); + if (IS_ENABLED(CONFIG_DEBUG_VM)) { + struct kmem_cache *exts_cache; + struct slab *exts_slab; + + exts_slab = virt_to_slab(vec); + if (exts_slab) { + /* + * The vector must be allocated from either normal or + * KMALLOC_NO_OBJ_EXT kmalloc caches to avoid cycles. + */ + exts_cache = virt_to_slab(vec)->slab_cache; + WARN_ON_ONCE(!is_kmalloc_normal(exts_cache) && + !(exts_cache->flags & SLAB_NO_OBJ_EXT)); + } + } new_exts = (unsigned long)vec; #ifdef CONFIG_MEMCG @@ -2254,7 +2235,7 @@ static inline void free_slab_obj_exts(struct slab *slab, bool allow_spin) } /* - * obj_exts was created with SLAB_ALLOC_NO_RECURSE flag, therefore its + * obj_exts was created with SLAB_ALLOC_NO_OBJ_EXT flag, therefore its * corresponding extension will be NULL. alloc_tag_sub() will throw a * warning if slab has extensions but the extension of an object is * NULL, therefore replace NULL with CODETAG_EMPTY to indicate that @@ -5330,7 +5311,7 @@ void *__do_kmalloc_node(kmem_buckets *b, gfp_t flags, int node, if (unlikely(!size)) return ZERO_SIZE_PTR; - s = kmalloc_slab(size, b, flags, token); + s = kmalloc_slab(size, b, flags, token, ac->alloc_flags); ret = slab_alloc_node(s, flags, node, ac); ret = kasan_kmalloc(s, ret, size, flags); @@ -5395,7 +5376,9 @@ static void *__kmalloc_nolock_noprof(DECL_TOKEN_PARAMS(size, token), gfp_t gfp_f retry: if (unlikely(size > KMALLOC_MAX_CACHE_SIZE)) return NULL; - s = kmalloc_slab(size, NULL, gfp_flags, PASS_TOKEN_PARAM(token)); + + s = kmalloc_slab(size, NULL, gfp_flags, PASS_TOKEN_PARAM(token), + ac->alloc_flags); if (!(s->flags & __CMPXCHG_DOUBLE) && !kmem_cache_debug(s)) /* @@ -7957,10 +7940,10 @@ static int calculate_sizes(struct kmem_cache_args *args, struct kmem_cache *s) s->allocflags |= __GFP_RECLAIMABLE; /* - * For KMALLOC_NORMAL caches we enable sheaves later by - * bootstrap_kmalloc_sheaves() to avoid recursion + * For kmalloc caches we enable sheaves later by + * bootstrap_kmalloc_sheaves() to avoid recursion. */ - if (!is_kmalloc_normal(s)) + if (!(s->flags & SLAB_KMALLOC)) s->sheaf_capacity = calculate_sheaf_capacity(s, args); /* @@ -8524,7 +8507,7 @@ static void __init bootstrap_kmalloc_sheaves(void) { enum kmalloc_cache_type type; - for (type = KMALLOC_NORMAL; type <= KMALLOC_PARTITION_END; type++) { + for (type = KMALLOC_NORMAL; type < NR_KMALLOC_TYPES; type++) { for (int idx = 0; idx < KMALLOC_SHIFT_HIGH + 1; idx++) { if (kmalloc_caches[type][idx]) bootstrap_cache_sheaves(kmalloc_caches[type][idx]); -- 2.53.0