Linux-mm Archive on lore.kernel.org

Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH v2] slab: support for compiler-assisted type-based slab cache partitioning
From: Harry Yoo (Oracle) @ 2026-04-20  7:25 UTC (permalink / raw)
  To: Marco Elver
  Cc: Vlastimil Babka, Andrew Morton, Nathan Chancellor, Nicolas Schier,
	Dennis Zhou, Tejun Heo, Christoph Lameter, Hao Li, David Rientjes,
	Roman Gushchin, Kees Cook, Gustavo A. R. Silva, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Alexander Potapenko,
	Dmitry Vyukov, Nick Desaulniers, Bill Wendling, Justin Stitt,
	linux-kbuild, linux-kernel, linux-mm, linux-hardening, kasan-dev,
	llvm, Andrey Konovalov, Florent Revest, Jann Horn, KP Singh,
	Matteo Rizzo, GONG Ruiqi, Danilo Krummrich, Uladzislau Rezki,
	rust-for-linux
In-Reply-To: <20260415143735.2974230-1-elver@google.com>

[CC'ing RUST ALLOC folks for rust bindings]

On Wed, Apr 15, 2026 at 04:37:05PM +0200, Marco Elver wrote:
> Rework the general infrastructure around RANDOM_KMALLOC_CACHES into more
> flexible PARTITION_KMALLOC_CACHES, with the former being a partitioning
> mode of the latter.
> 
> Introduce a new mode, TYPED_KMALLOC_CACHES, which leverages a feature
> available in Clang 22 and later, called "allocation tokens" via
> __builtin_infer_alloc_token [1]. Unlike RANDOM_KMALLOC_CACHES, this mode
> deterministically assigns a slab cache to an allocation of type T,
> regardless of allocation site.
> 
> The builtin __builtin_infer_alloc_token(<malloc-args>, ...) instructs
> the compiler to infer an allocation type from arguments commonly passed
> to memory-allocating functions and returns a type-derived token ID. The
> implementation passes kmalloc-args to the builtin: the compiler performs
> best-effort type inference, and then recognizes common patterns such as
> `kmalloc(sizeof(T), ...)`, `kmalloc(sizeof(T) * n, ...)`, but also
> `(T *)kmalloc(...)`. Where the compiler fails to infer a type the
> fallback token (default: 0) is chosen.
> 
> Note: kmalloc_obj(..) APIs fix the pattern how size and result type are
> expressed, and therefore ensures there's not much drift in which
> patterns the compiler needs to recognize. Specifically, kmalloc_obj()
> and friends expand to `(TYPE *)KMALLOC(__obj_size, GFP)`, which the
> compiler recognizes via the cast to TYPE*.
> 
> Clang's default token ID calculation is described as [1]:
> 
>    typehashpointersplit: This mode assigns a token ID based on the hash
>    of the allocated type's name, where the top half ID-space is reserved
>    for types that contain pointers and the bottom half for types that do
>    not contain pointers.
> 
> Separating pointer-containing objects from pointerless objects and data
> allocations can help mitigate certain classes of memory corruption
> exploits [2]: attackers who gains a buffer overflow on a primitive
> buffer cannot use it to directly corrupt pointers or other critical
> metadata in an object residing in a different, isolated heap region.
> 
> It is important to note that heap isolation strategies offer a
> best-effort approach, and do not provide a 100% security guarantee,
> albeit achievable at relatively low performance cost. Note that this
> also does not prevent cross-cache attacks: while waiting for future
> features like SLAB_VIRTUAL [3] to provide physical page isolation, this
> feature should be deployed alongside SHUFFLE_PAGE_ALLOCATOR and
> init_on_free=1 to mitigate cross-cache attacks and page-reuse attacks as
> much as possible today.
> 
> With all that, my kernel (x86 defconfig) shows me a histogram of slab
> cache object distribution per /proc/slabinfo (after boot):
> 
>   <slab cache>      <objs> <hist>
>   kmalloc-part-15    1465  ++++++++++++++
>   kmalloc-part-14    2988  +++++++++++++++++++++++++++++
>   kmalloc-part-13    1656  ++++++++++++++++
>   kmalloc-part-12    1045  ++++++++++
>   kmalloc-part-11    1697  ++++++++++++++++
>   kmalloc-part-10    1489  ++++++++++++++
>   kmalloc-part-09     965  +++++++++
>   kmalloc-part-08     710  +++++++
>   kmalloc-part-07     100  +
>   kmalloc-part-06     217  ++
>   kmalloc-part-05     105  +
>   kmalloc-part-04    4047  ++++++++++++++++++++++++++++++++++++++++
>   kmalloc-part-03     183  +
>   kmalloc-part-02     283  ++
>   kmalloc-part-01     316  +++
>   kmalloc            1422  ++++++++++++++
> 
> The above /proc/slabinfo snapshot shows me there are 6673 allocated
> objects (slabs 00 - 07) that the compiler claims contain no pointers or
> it was unable to infer the type of, and 12015 objects that contain
> pointers (slabs 08 - 15). On a whole, this looks relatively sane.
> 
> Additionally, when I compile my kernel with -Rpass=alloc-token, which
> provides diagnostics where (after dead-code elimination) type inference
> failed, I see 186 allocation sites where the compiler failed to identify
> a type (down from 966 when I sent the RFC [4]). Some initial review
> confirms these are mostly variable sized buffers, but also include
> structs with trailing flexible length arrays.
> 
> Link: https://clang.llvm.org/docs/AllocToken.html [1]
> Link: https://blog.dfsec.com/ios/2025/05/30/blasting-past-ios-18/ [2]
> Link: https://lwn.net/Articles/944647/ [3]
> Link: https://lore.kernel.org/all/20250825154505.1558444-1-elver@google.com/ [4]
> Link: https://discourse.llvm.org/t/rfc-a-framework-for-allocator-partitioning-hints/87434
> Acked-by: GONG Ruiqi <gongruiqi1@huawei.com>
> Co-developed-by: Harry Yoo (Oracle) <harry@kernel.org>
> Signed-off-by: Harry Yoo (Oracle) <harry@kernel.org>
> Signed-off-by: Marco Elver <elver@google.com>
> ---
> v2:
> * Avoid empty function argument if !PARTITION_KMALLOC_CACHES
>   (co-developed-by Harry). While Clang does optimize out the empty
>   struct argument (and generated code is identical to before if
>   PARTITION_KMALLOC_CACHES is disabled), GCC doesn't do so. So we need
>   to fully remove the argument if not actually required.
> * Cover krealloc() which was missed before, resulting in ~100 additional
>   objects in the pointer-containing caches in above histogram.
> * Unify kmalloc_token_t definition.

> * Expand Kconfig help text.

Thanks. I find the help text much more useful.

> 
> v1: https://lore.kernel.org/all/20260331111240.153913-1-elver@google.com/
> * Rebase and switch to builtin name that was released in Clang 22.
> * Keep RANDOM_KMALLOC_CACHES the default.
> 
> RFC: https://lore.kernel.org/all/20250825154505.1558444-1-elver@google.com/
> ---

A few comments on V2:

# comment 1

I'm not a big fan of how k[v]realloc_node_align()
and kmalloc_nolock() define and pass the token parameter.

IMHO it'll be fine to use {DECL,PASS}_KMALLOC_PARAMS() in those
functions, since SLAB_BUCKETS users already passes NULL bucket
to most of __kmalloc*() calls anyway.

# comment 2

This breaks Documentation/.

Problems:

- The document generator doesn't handle DECL_KMALLOC_PARAMS() well.

- The signature of the function that users call (krealloc_node_align())
  and the function that has kerneldoc (krealloc_node_align_noprof())
  don't match.

- Even worse, moving kerneldoc to the macro doesn't work because
  it uses variable arguments (...) 

# comment 3

Looking at how rust generates helper functions,
in rust/helpers/slab.c:
| // SPDX-License-Identifier: GPL-2.0
| 
| #include <linux/slab.h>
| 
| __rust_helper void *__must_check __realloc_size(2)
| rust_helper_krealloc_node_align(const void *objp, size_t new_size, unsigned long align,
| 				gfp_t flags, int node)
| {
| 	return krealloc_node_align(objp, new_size, align, flags, node);
| }
| 
| __rust_helper void *__must_check __realloc_size(2)
| rust_helper_kvrealloc_node_align(const void *p, size_t size, unsigned long align,
| 				 gfp_t flags, int node)
| {
| 	return kvrealloc_node_align(p, size, align, flags, node);
| }

Rust code probably won't pass any meaningful token?
(something you may want to address in the future)

-- 
Cheers,
Harry / Hyeonggon

[Not trimming the rest of the patch for those added to Cc]

>  Makefile                        |   5 ++
>  include/linux/percpu.h          |   2 +-
>  include/linux/slab.h            | 127 ++++++++++++++++++++------------
>  kernel/configs/hardening.config |   2 +-
>  mm/Kconfig                      |  65 +++++++++++++---
>  mm/kfence/kfence_test.c         |   4 +-
>  mm/slab.h                       |   4 +-
>  mm/slab_common.c                |  48 ++++++------
>  mm/slub.c                       |  54 +++++++-------
>  9 files changed, 200 insertions(+), 111 deletions(-)
> 
> diff --git a/Makefile b/Makefile
> index 54e1ae602000..f70170ed1522 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -989,6 +989,11 @@ KBUILD_CFLAGS	+= $(CC_AUTO_VAR_INIT_ZERO_ENABLER)
>  endif
>  endif
>  
> +ifdef CONFIG_TYPED_KMALLOC_CACHES
> +# PARTITION_KMALLOC_CACHES_NR + 1
> +KBUILD_CFLAGS	+= -falloc-token-max=16
> +endif
> +
>  ifdef CONFIG_CC_IS_CLANG
>  ifdef CONFIG_CC_HAS_COUNTED_BY_PTR
>  KBUILD_CFLAGS	+= -fexperimental-late-parse-attributes
> diff --git a/include/linux/percpu.h b/include/linux/percpu.h
> index 85bf8dd9f087..271b41be314d 100644
> --- a/include/linux/percpu.h
> +++ b/include/linux/percpu.h
> @@ -36,7 +36,7 @@
>  #define PCPU_BITMAP_BLOCK_BITS		(PCPU_BITMAP_BLOCK_SIZE >>	\
>  					 PCPU_MIN_ALLOC_SHIFT)
>  
> -#ifdef CONFIG_RANDOM_KMALLOC_CACHES
> +#ifdef CONFIG_PARTITION_KMALLOC_CACHES
>  # if defined(CONFIG_LOCKDEP) && !defined(CONFIG_PAGE_SIZE_4KB)
>  # define PERCPU_DYNAMIC_SIZE_SHIFT      13
>  # else
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index 15a60b501b95..3bd0db0ec95f 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -499,14 +499,33 @@ int kmem_cache_shrink(struct kmem_cache *s);
>  				.usersize	= sizeof_field(struct __struct, __field),	\
>  			}, (__flags))
>  
> +#ifdef CONFIG_PARTITION_KMALLOC_CACHES
> +typedef struct { unsigned long v; } kmalloc_token_t;
> +#ifdef CONFIG_RANDOM_KMALLOC_CACHES
> +extern unsigned long random_kmalloc_seed;
> +#define __kmalloc_token(...) ((kmalloc_token_t){ .v = _RET_IP_ })
> +#elif defined(CONFIG_TYPED_KMALLOC_CACHES)
> +#define __kmalloc_token(...) ((kmalloc_token_t){ .v = __builtin_infer_alloc_token(__VA_ARGS__) })
> +#endif
> +#define DECL_TOKEN_PARAM(_token)	, kmalloc_token_t (_token)
> +#define _PASS_TOKEN_PARAM(_token)	, (_token)
> +#define PASS_TOKEN_PARAM(_token)	(_token)
> +#else /* !CONFIG_PARTITION_KMALLOC_CACHES */
> +typedef struct {} kmalloc_token_t;
> +#define __kmalloc_token(...) ((kmalloc_token_t){}) /* no-op */
> +#define DECL_TOKEN_PARAM(_token)
> +#define _PASS_TOKEN_PARAM(_token)
> +#define PASS_TOKEN_PARAM(_token)	((kmalloc_token_t){})
> +#endif /* CONFIG_PARTITION_KMALLOC_CACHES */
> +
>  /*
>   * Common kmalloc functions provided by all allocators
>   */
>  void * __must_check krealloc_node_align_noprof(const void *objp, size_t new_size,
>  					       unsigned long align,
> -					       gfp_t flags, int nid) __realloc_size(2);
> -#define krealloc_noprof(_o, _s, _f)	krealloc_node_align_noprof(_o, _s, 1, _f, NUMA_NO_NODE)
> -#define krealloc_node_align(...)	alloc_hooks(krealloc_node_align_noprof(__VA_ARGS__))
> +					       gfp_t flags, int nid DECL_TOKEN_PARAM(token)) __realloc_size(2);
> +#define krealloc_noprof(_o, _s, _f)	krealloc_node_align_noprof(_o, _s, 1, _f, NUMA_NO_NODE _PASS_TOKEN_PARAM(__kmalloc_token(_o, _s, _f)))
> +#define krealloc_node_align(...)	alloc_hooks(krealloc_node_align_noprof(__VA_ARGS__ _PASS_TOKEN_PARAM(__kmalloc_token(__VA_ARGS__))))
>  #define krealloc_node(_o, _s, _f, _n)	krealloc_node_align(_o, _s, 1, _f, _n)
>  #define krealloc(...)			krealloc_node(__VA_ARGS__, NUMA_NO_NODE)
>  
> @@ -612,10 +631,10 @@ static inline unsigned int arch_slab_minalign(void)
>  #define SLAB_OBJ_MIN_SIZE      (KMALLOC_MIN_SIZE < 16 ? \
>                                 (KMALLOC_MIN_SIZE) : 16)
>  
> -#ifdef CONFIG_RANDOM_KMALLOC_CACHES
> -#define RANDOM_KMALLOC_CACHES_NR	15 // # of cache copies
> +#ifdef CONFIG_PARTITION_KMALLOC_CACHES
> +#define PARTITION_KMALLOC_CACHES_NR	15 // # of cache copies
>  #else
> -#define RANDOM_KMALLOC_CACHES_NR	0
> +#define PARTITION_KMALLOC_CACHES_NR	0
>  #endif
>  
>  /*
> @@ -634,8 +653,8 @@ enum kmalloc_cache_type {
>  #ifndef CONFIG_MEMCG
>  	KMALLOC_CGROUP = KMALLOC_NORMAL,
>  #endif
> -	KMALLOC_RANDOM_START = KMALLOC_NORMAL,
> -	KMALLOC_RANDOM_END = KMALLOC_RANDOM_START + RANDOM_KMALLOC_CACHES_NR,
> +	KMALLOC_PARTITION_START = KMALLOC_NORMAL,
> +	KMALLOC_PARTITION_END = KMALLOC_PARTITION_START + PARTITION_KMALLOC_CACHES_NR,
>  #ifdef CONFIG_SLUB_TINY
>  	KMALLOC_RECLAIM = KMALLOC_NORMAL,
>  #else
> @@ -662,9 +681,7 @@ extern kmem_buckets kmalloc_caches[NR_KMALLOC_TYPES];
>  	(IS_ENABLED(CONFIG_ZONE_DMA)   ? __GFP_DMA : 0) |	\
>  	(IS_ENABLED(CONFIG_MEMCG) ? __GFP_ACCOUNT : 0))
>  
> -extern unsigned long random_kmalloc_seed;
> -
> -static __always_inline enum kmalloc_cache_type kmalloc_type(gfp_t flags, unsigned long caller)
> +static __always_inline enum kmalloc_cache_type kmalloc_type(gfp_t flags, kmalloc_token_t token)
>  {
>  	/*
>  	 * The most common case is KMALLOC_NORMAL, so test for it
> @@ -672,9 +689,11 @@ static __always_inline enum kmalloc_cache_type kmalloc_type(gfp_t flags, unsigne
>  	 */
>  	if (likely((flags & KMALLOC_NOT_NORMAL_BITS) == 0))
>  #ifdef CONFIG_RANDOM_KMALLOC_CACHES
> -		/* RANDOM_KMALLOC_CACHES_NR (=15) copies + the KMALLOC_NORMAL */
> -		return KMALLOC_RANDOM_START + hash_64(caller ^ random_kmalloc_seed,
> -						      ilog2(RANDOM_KMALLOC_CACHES_NR + 1));
> +		/* PARTITION_KMALLOC_CACHES_NR (=15) copies + the KMALLOC_NORMAL */
> +		return KMALLOC_PARTITION_START + hash_64(token.v ^ random_kmalloc_seed,
> +							 ilog2(PARTITION_KMALLOC_CACHES_NR + 1));
> +#elif defined(CONFIG_TYPED_KMALLOC_CACHES)
> +		return KMALLOC_PARTITION_START + token.v;
>  #else
>  		return KMALLOC_NORMAL;
>  #endif
> @@ -858,16 +877,22 @@ unsigned int kmem_cache_sheaf_size(struct slab_sheaf *sheaf);
>  #define PASS_BUCKET_PARAM(_b)		NULL
>  #endif
>  
> +#define DECL_KMALLOC_PARAMS(_size, _b, _token) DECL_BUCKET_PARAMS(_size, _b) \
> +					       DECL_TOKEN_PARAM(_token)
> +
> +#define PASS_KMALLOC_PARAMS(_size, _b, _token) PASS_BUCKET_PARAMS(_size, _b) \
> +					       _PASS_TOKEN_PARAM(_token)
> +
>  /*
>   * The following functions are not to be used directly and are intended only
>   * for internal use from kmalloc() and kmalloc_node()
>   * with the exception of kunit tests
>   */
>  
> -void *__kmalloc_noprof(size_t size, gfp_t flags)
> +void *__kmalloc_noprof(DECL_KMALLOC_PARAMS(size, b, token), gfp_t flags)
>  				__assume_kmalloc_alignment __alloc_size(1);
>  
> -void *__kmalloc_node_noprof(DECL_BUCKET_PARAMS(size, b), gfp_t flags, int node)
> +void *__kmalloc_node_noprof(DECL_KMALLOC_PARAMS(size, b, token), gfp_t flags, int node)
>  				__assume_kmalloc_alignment __alloc_size(1);
>  
>  void *__kmalloc_cache_noprof(struct kmem_cache *s, gfp_t flags, size_t size)
> @@ -938,7 +963,7 @@ void *__kmalloc_large_node_noprof(size_t size, gfp_t flags, int node)
>   *	Try really hard to succeed the allocation but fail
>   *	eventually.
>   */
> -static __always_inline __alloc_size(1) void *kmalloc_noprof(size_t size, gfp_t flags)
> +static __always_inline __alloc_size(1) void *_kmalloc_noprof(size_t size, gfp_t flags, kmalloc_token_t token)
>  {
>  	if (__builtin_constant_p(size) && size) {
>  		unsigned int index;
> @@ -948,14 +973,16 @@ static __always_inline __alloc_size(1) void *kmalloc_noprof(size_t size, gfp_t f
>  
>  		index = kmalloc_index(size);
>  		return __kmalloc_cache_noprof(
> -				kmalloc_caches[kmalloc_type(flags, _RET_IP_)][index],
> +				kmalloc_caches[kmalloc_type(flags, token)][index],
>  				flags, size);
>  	}
> -	return __kmalloc_noprof(size, flags);
> +	return __kmalloc_noprof(PASS_KMALLOC_PARAMS(size, NULL, token), flags);
>  }
> +#define kmalloc_noprof(...)			_kmalloc_noprof(__VA_ARGS__, __kmalloc_token(__VA_ARGS__))
>  #define kmalloc(...)				alloc_hooks(kmalloc_noprof(__VA_ARGS__))
>  
> -void *kmalloc_nolock_noprof(size_t size, gfp_t gfp_flags, int node);
> +void *_kmalloc_nolock_noprof(size_t size, gfp_t gfp_flags, int node DECL_TOKEN_PARAM(token));
> +#define kmalloc_nolock_noprof(...)		_kmalloc_nolock_noprof(__VA_ARGS__ _PASS_TOKEN_PARAM(__kmalloc_token(__VA_ARGS__)))
>  #define kmalloc_nolock(...)			alloc_hooks(kmalloc_nolock_noprof(__VA_ARGS__))
>  
>  /**
> @@ -1060,12 +1087,12 @@ void *kmalloc_nolock_noprof(size_t size, gfp_t gfp_flags, int node);
>  	__alloc_flex(kvzalloc, default_gfp(__VA_ARGS__), typeof(P), FAM, COUNT)
>  
>  #define kmem_buckets_alloc(_b, _size, _flags)	\
> -	alloc_hooks(__kmalloc_node_noprof(PASS_BUCKET_PARAMS(_size, _b), _flags, NUMA_NO_NODE))
> +	alloc_hooks(__kmalloc_node_noprof(PASS_KMALLOC_PARAMS(_size, _b, __kmalloc_token(_size)), _flags, NUMA_NO_NODE))
>  
>  #define kmem_buckets_alloc_track_caller(_b, _size, _flags)	\
> -	alloc_hooks(__kmalloc_node_track_caller_noprof(PASS_BUCKET_PARAMS(_size, _b), _flags, NUMA_NO_NODE, _RET_IP_))
> +	alloc_hooks(__kmalloc_node_track_caller_noprof(PASS_KMALLOC_PARAMS(_size, _b, __kmalloc_token(_size)), _flags, NUMA_NO_NODE, _RET_IP_))
>  
> -static __always_inline __alloc_size(1) void *kmalloc_node_noprof(size_t size, gfp_t flags, int node)
> +static __always_inline __alloc_size(1) void *_kmalloc_node_noprof(size_t size, gfp_t flags, int node, kmalloc_token_t token)
>  {
>  	if (__builtin_constant_p(size) && size) {
>  		unsigned int index;
> @@ -1075,11 +1102,12 @@ static __always_inline __alloc_size(1) void *kmalloc_node_noprof(size_t size, gf
>  
>  		index = kmalloc_index(size);
>  		return __kmalloc_cache_node_noprof(
> -				kmalloc_caches[kmalloc_type(flags, _RET_IP_)][index],
> +				kmalloc_caches[kmalloc_type(flags, token)][index],
>  				flags, node, size);
>  	}
> -	return __kmalloc_node_noprof(PASS_BUCKET_PARAMS(size, NULL), flags, node);
> +	return __kmalloc_node_noprof(PASS_KMALLOC_PARAMS(size, NULL, token), flags, node);
>  }
> +#define kmalloc_node_noprof(...)		_kmalloc_node_noprof(__VA_ARGS__, __kmalloc_token(__VA_ARGS__))
>  #define kmalloc_node(...)			alloc_hooks(kmalloc_node_noprof(__VA_ARGS__))
>  
>  /**
> @@ -1088,14 +1116,15 @@ static __always_inline __alloc_size(1) void *kmalloc_node_noprof(size_t size, gf
>   * @size: element size.
>   * @flags: the type of memory to allocate (see kmalloc).
>   */
> -static inline __alloc_size(1, 2) void *kmalloc_array_noprof(size_t n, size_t size, gfp_t flags)
> +static inline __alloc_size(1, 2) void *_kmalloc_array_noprof(size_t n, size_t size, gfp_t flags, kmalloc_token_t token)
>  {
>  	size_t bytes;
>  
>  	if (unlikely(check_mul_overflow(n, size, &bytes)))
>  		return NULL;
> -	return kmalloc_noprof(bytes, flags);
> +	return _kmalloc_noprof(bytes, flags, token);
>  }
> +#define kmalloc_array_noprof(...)		_kmalloc_array_noprof(__VA_ARGS__, __kmalloc_token(__VA_ARGS__))
>  #define kmalloc_array(...)			alloc_hooks(kmalloc_array_noprof(__VA_ARGS__))
>  
>  /**
> @@ -1115,18 +1144,19 @@ static inline __alloc_size(1, 2) void *kmalloc_array_noprof(size_t n, size_t siz
>   * In any case, the contents of the object pointed to are preserved up to the
>   * lesser of the new and old sizes.
>   */
> -static inline __realloc_size(2, 3) void * __must_check krealloc_array_noprof(void *p,
> +static inline __realloc_size(2, 3) void * __must_check _krealloc_array_noprof(void *p,
>  								       size_t new_n,
>  								       size_t new_size,
> -								       gfp_t flags)
> +								       gfp_t flags, kmalloc_token_t token)
>  {
>  	size_t bytes;
>  
>  	if (unlikely(check_mul_overflow(new_n, new_size, &bytes)))
>  		return NULL;
>  
> -	return krealloc_noprof(p, bytes, flags);
> +	return krealloc_node_align_noprof(p, bytes, 1, flags, NUMA_NO_NODE _PASS_TOKEN_PARAM(token));
>  }
> +#define krealloc_array_noprof(...)		_krealloc_array_noprof(__VA_ARGS__, __kmalloc_token(__VA_ARGS__))
>  #define krealloc_array(...)			alloc_hooks(krealloc_array_noprof(__VA_ARGS__))
>  
>  /**
> @@ -1137,10 +1167,10 @@ static inline __realloc_size(2, 3) void * __must_check krealloc_array_noprof(voi
>   */
>  #define kcalloc(n, size, flags)		kmalloc_array(n, size, (flags) | __GFP_ZERO)
>  
> -void *__kmalloc_node_track_caller_noprof(DECL_BUCKET_PARAMS(size, b), gfp_t flags, int node,
> +void *__kmalloc_node_track_caller_noprof(DECL_KMALLOC_PARAMS(size, b, token), gfp_t flags, int node,
>  					 unsigned long caller) __alloc_size(1);
>  #define kmalloc_node_track_caller_noprof(size, flags, node, caller) \
> -	__kmalloc_node_track_caller_noprof(PASS_BUCKET_PARAMS(size, NULL), flags, node, caller)
> +	__kmalloc_node_track_caller_noprof(PASS_KMALLOC_PARAMS(size, NULL, __kmalloc_token(size)), flags, node, caller)
>  #define kmalloc_node_track_caller(...)		\
>  	alloc_hooks(kmalloc_node_track_caller_noprof(__VA_ARGS__, _RET_IP_))
>  
> @@ -1157,17 +1187,18 @@ void *__kmalloc_node_track_caller_noprof(DECL_BUCKET_PARAMS(size, b), gfp_t flag
>  #define kmalloc_track_caller_noprof(...)	\
>  		kmalloc_node_track_caller_noprof(__VA_ARGS__, NUMA_NO_NODE, _RET_IP_)
>  
> -static inline __alloc_size(1, 2) void *kmalloc_array_node_noprof(size_t n, size_t size, gfp_t flags,
> -							  int node)
> +static inline __alloc_size(1, 2) void *_kmalloc_array_node_noprof(size_t n, size_t size, gfp_t flags,
> +								  int node, kmalloc_token_t token)
>  {
>  	size_t bytes;
>  
>  	if (unlikely(check_mul_overflow(n, size, &bytes)))
>  		return NULL;
>  	if (__builtin_constant_p(n) && __builtin_constant_p(size))
> -		return kmalloc_node_noprof(bytes, flags, node);
> -	return __kmalloc_node_noprof(PASS_BUCKET_PARAMS(bytes, NULL), flags, node);
> +		return _kmalloc_node_noprof(bytes, flags, node, token);
> +	return __kmalloc_node_noprof(PASS_KMALLOC_PARAMS(bytes, NULL, token), flags, node);
>  }
> +#define kmalloc_array_node_noprof(...)		_kmalloc_array_node_noprof(__VA_ARGS__, __kmalloc_token(__VA_ARGS__))
>  #define kmalloc_array_node(...)			alloc_hooks(kmalloc_array_node_noprof(__VA_ARGS__))
>  
>  #define kcalloc_node(_n, _size, _flags, _node)	\
> @@ -1183,39 +1214,43 @@ static inline __alloc_size(1, 2) void *kmalloc_array_node_noprof(size_t n, size_
>   * @size: how many bytes of memory are required.
>   * @flags: the type of memory to allocate (see kmalloc).
>   */
> -static inline __alloc_size(1) void *kzalloc_noprof(size_t size, gfp_t flags)
> +static inline __alloc_size(1) void *_kzalloc_noprof(size_t size, gfp_t flags, kmalloc_token_t token)
>  {
> -	return kmalloc_noprof(size, flags | __GFP_ZERO);
> +	return _kmalloc_noprof(size, flags | __GFP_ZERO, token);
>  }
> +#define kzalloc_noprof(...)			_kzalloc_noprof(__VA_ARGS__, __kmalloc_token(__VA_ARGS__))
>  #define kzalloc(...)				alloc_hooks(kzalloc_noprof(__VA_ARGS__))
>  #define kzalloc_node(_size, _flags, _node)	kmalloc_node(_size, (_flags)|__GFP_ZERO, _node)
>  
> -void *__kvmalloc_node_noprof(DECL_BUCKET_PARAMS(size, b), unsigned long align,
> +void *__kvmalloc_node_noprof(DECL_KMALLOC_PARAMS(size, b, token), unsigned long align,
>  			     gfp_t flags, int node) __alloc_size(1);
>  #define kvmalloc_node_align_noprof(_size, _align, _flags, _node)	\
> -	__kvmalloc_node_noprof(PASS_BUCKET_PARAMS(_size, NULL), _align, _flags, _node)
> +	__kvmalloc_node_noprof(PASS_KMALLOC_PARAMS(_size, NULL, __kmalloc_token(_size)), _align, _flags, _node)
>  #define kvmalloc_node_align(...)		\
>  	alloc_hooks(kvmalloc_node_align_noprof(__VA_ARGS__))
>  #define kvmalloc_node(_s, _f, _n)		kvmalloc_node_align(_s, 1, _f, _n)
> +#define kvmalloc_node_noprof(size, flags, node)	\
> +	kvmalloc_node_align_noprof(size, 1, flags, node)
>  #define kvmalloc(...)				kvmalloc_node(__VA_ARGS__, NUMA_NO_NODE)
> +#define kvmalloc_noprof(_size, _flags)		kvmalloc_node_noprof(_size, _flags, NUMA_NO_NODE)
>  #define kvzalloc(_size, _flags)			kvmalloc(_size, (_flags)|__GFP_ZERO)
>  
>  #define kvzalloc_node(_size, _flags, _node)	kvmalloc_node(_size, (_flags)|__GFP_ZERO, _node)
>  
>  #define kmem_buckets_valloc(_b, _size, _flags)	\
> -	alloc_hooks(__kvmalloc_node_noprof(PASS_BUCKET_PARAMS(_size, _b), 1, _flags, NUMA_NO_NODE))
> +	alloc_hooks(__kvmalloc_node_noprof(PASS_KMALLOC_PARAMS(_size, _b, __kmalloc_token(_size)), 1, _flags, NUMA_NO_NODE))
>  
>  static inline __alloc_size(1, 2) void *
> -kvmalloc_array_node_noprof(size_t n, size_t size, gfp_t flags, int node)
> +_kvmalloc_array_node_noprof(size_t n, size_t size, gfp_t flags, int node, kmalloc_token_t token)
>  {
>  	size_t bytes;
>  
>  	if (unlikely(check_mul_overflow(n, size, &bytes)))
>  		return NULL;
>  
> -	return kvmalloc_node_align_noprof(bytes, 1, flags, node);
> +	return __kvmalloc_node_noprof(PASS_KMALLOC_PARAMS(bytes, NULL, token), 1, flags, node);
>  }
> -
> +#define kvmalloc_array_node_noprof(...)		_kvmalloc_array_node_noprof(__VA_ARGS__, __kmalloc_token(__VA_ARGS__))
>  #define kvmalloc_array_noprof(...)		kvmalloc_array_node_noprof(__VA_ARGS__, NUMA_NO_NODE)
>  #define kvcalloc_node_noprof(_n,_s,_f,_node)	kvmalloc_array_node_noprof(_n,_s,(_f)|__GFP_ZERO,_node)
>  #define kvcalloc_noprof(...)			kvcalloc_node_noprof(__VA_ARGS__, NUMA_NO_NODE)
> @@ -1225,9 +1260,9 @@ kvmalloc_array_node_noprof(size_t n, size_t size, gfp_t flags, int node)
>  #define kvcalloc(...)				alloc_hooks(kvcalloc_noprof(__VA_ARGS__))
>  
>  void *kvrealloc_node_align_noprof(const void *p, size_t size, unsigned long align,
> -				  gfp_t flags, int nid) __realloc_size(2);
> +				  gfp_t flags, int nid DECL_TOKEN_PARAM(token)) __realloc_size(2);
>  #define kvrealloc_node_align(...)		\
> -	alloc_hooks(kvrealloc_node_align_noprof(__VA_ARGS__))
> +	alloc_hooks(kvrealloc_node_align_noprof(__VA_ARGS__ _PASS_TOKEN_PARAM(__kmalloc_token(__VA_ARGS__))))
>  #define kvrealloc_node(_p, _s, _f, _n)		kvrealloc_node_align(_p, _s, 1, _f, _n)
>  #define kvrealloc(...)				kvrealloc_node(__VA_ARGS__, NUMA_NO_NODE)
>  
> diff --git a/kernel/configs/hardening.config b/kernel/configs/hardening.config
> index 7c3924614e01..2963b6bd890f 100644
> --- a/kernel/configs/hardening.config
> +++ b/kernel/configs/hardening.config
> @@ -22,7 +22,7 @@ CONFIG_SLAB_FREELIST_RANDOM=y
>  CONFIG_SLAB_FREELIST_HARDENED=y
>  CONFIG_SLAB_BUCKETS=y
>  CONFIG_SHUFFLE_PAGE_ALLOCATOR=y
> -CONFIG_RANDOM_KMALLOC_CACHES=y
> +CONFIG_PARTITION_KMALLOC_CACHES=y
>  
>  # Sanity check userspace page table mappings.
>  CONFIG_PAGE_TABLE_CHECK=y
> diff --git a/mm/Kconfig b/mm/Kconfig
> index ebd8ea353687..47f323816de7 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -247,22 +247,67 @@ config SLUB_STATS
>  	  out which slabs are relevant to a particular load.
>  	  Try running: slabinfo -DA
>  
> -config RANDOM_KMALLOC_CACHES
> -	default n
> +config PARTITION_KMALLOC_CACHES
>  	depends on !SLUB_TINY
> -	bool "Randomize slab caches for normal kmalloc"
> +	bool "Partitioned slab caches for normal kmalloc"
>  	help
> -	  A hardening feature that creates multiple copies of slab caches for
> -	  normal kmalloc allocation and makes kmalloc randomly pick one based
> -	  on code address, which makes the attackers more difficult to spray
> -	  vulnerable memory objects on the heap for the purpose of exploiting
> -	  memory vulnerabilities.
> +	  A hardening feature that creates multiple isolated copies of slab
> +	  caches for normal kmalloc allocations. This makes it more difficult
> +	  to exploit memory-safety vulnerabilities by attacking vulnerable
> +	  co-located memory objects. Several modes are provided.
>  
>  	  Currently the number of copies is set to 16, a reasonably large value
>  	  that effectively diverges the memory objects allocated for different
>  	  subsystems or modules into different caches, at the expense of a
> -	  limited degree of memory and CPU overhead that relates to hardware and
> -	  system workload.
> +	  limited degree of memory and CPU overhead that relates to hardware
> +	  and system workload.
> +
> +choice
> +	prompt "Partitioned slab cache mode"
> +	depends on PARTITION_KMALLOC_CACHES
> +	default RANDOM_KMALLOC_CACHES
> +	help
> +	  Selects the slab cache partitioning mode.
> +
> +config RANDOM_KMALLOC_CACHES
> +	bool "Randomize slab caches for normal kmalloc"
> +	help
> +	  Randomly pick a slab cache based on code address and a per-boot
> +	  random seed.
> +
> +	  This makes it harder for attackers to predict object co-location.
> +	  The placement is random: while attackers don't know which kmalloc
> +	  cache an object will be allocated from, they might circumvent
> +	  the randomization by retrying attacks across multiple machines until
> +	  the target objects are co-located.
> +
> +config TYPED_KMALLOC_CACHES
> +	bool "Type based slab cache selection for normal kmalloc"
> +	depends on $(cc-option,-falloc-token-max=123)
> +	help
> +	  Rely on Clang's allocation tokens to choose a slab cache, where token
> +	  IDs are derived from the allocated type.
> +
> +	  Unlike RANDOM_KMALLOC_CACHES, cache assignment is deterministic based
> +	  on type, which guarantees that objects of certain types are not
> +	  placed in the same cache. This effectively mitigates certain classes
> +	  of exploits that probabilistic defenses like RANDOM_KMALLOC_CACHES
> +	  only make harder but not impossible. However, this also means the
> +	  cache assignment is predictable.
> +
> +	  Clang's default token ID calculation returns a bounded hash with
> +	  disjoint ranges for pointer-containing and pointerless objects: when
> +	  used as the slab cache index, this prevents buffer overflows on
> +	  primitive buffers from directly corrupting pointer-containing
> +	  objects.
> +
> +	  The current effectiveness of Clang's type inference can be judged by
> +	  -Rpass=alloc-token, which provides diagnostics where (after dead-code
> +	  elimination) type inference failed.
> +
> +	  Requires Clang 22 or later.
> +
> +endchoice
>  
>  endmenu # Slab allocator options
>  
> diff --git a/mm/kfence/kfence_test.c b/mm/kfence/kfence_test.c
> index 5725a367246d..8807ea8ed0d3 100644
> --- a/mm/kfence/kfence_test.c
> +++ b/mm/kfence/kfence_test.c
> @@ -214,7 +214,7 @@ static void test_cache_destroy(void)
>  static inline size_t kmalloc_cache_alignment(size_t size)
>  {
>  	/* just to get ->align so no need to pass in the real caller */
> -	enum kmalloc_cache_type type = kmalloc_type(GFP_KERNEL, 0);
> +	enum kmalloc_cache_type type = kmalloc_type(GFP_KERNEL, __kmalloc_token(0));
>  	return kmalloc_caches[type][__kmalloc_index(size, false)]->align;
>  }
>  
> @@ -285,7 +285,7 @@ static void *test_alloc(struct kunit *test, size_t size, gfp_t gfp, enum allocat
>  
>  		if (is_kfence_address(alloc)) {
>  			struct slab *slab = virt_to_slab(alloc);
> -			enum kmalloc_cache_type type = kmalloc_type(GFP_KERNEL, _RET_IP_);
> +			enum kmalloc_cache_type type = kmalloc_type(GFP_KERNEL, __kmalloc_token(size));
>  			struct kmem_cache *s = test_cache ?:
>  					kmalloc_caches[type][__kmalloc_index(size, false)];
>  
> diff --git a/mm/slab.h b/mm/slab.h
> index e9ab292acd22..dd49d37e253d 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -361,12 +361,12 @@ static inline unsigned int size_index_elem(unsigned int bytes)
>   * KMALLOC_MAX_CACHE_SIZE and the caller must check that.
>   */
>  static inline struct kmem_cache *
> -kmalloc_slab(size_t size, kmem_buckets *b, gfp_t flags, unsigned long caller)
> +kmalloc_slab(size_t size, kmem_buckets *b, gfp_t flags, kmalloc_token_t token)
>  {
>  	unsigned int index;
>  
>  	if (!b)
> -		b = &kmalloc_caches[kmalloc_type(flags, caller)];
> +		b = &kmalloc_caches[kmalloc_type(flags, token)];
>  	if (size <= 192)
>  		index = kmalloc_size_index[size_index_elem(size)];
>  	else
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index d5a70a831a2a..21ab7dd79b5e 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -787,7 +787,7 @@ size_t kmalloc_size_roundup(size_t size)
>  		 * The flags don't matter since size_index is common to all.
>  		 * Neither does the caller for just getting ->object_size.
>  		 */
> -		return kmalloc_slab(size, NULL, GFP_KERNEL, 0)->object_size;
> +		return kmalloc_slab(size, NULL, GFP_KERNEL, __kmalloc_token(0))->object_size;
>  	}
>  
>  	/* Above the smaller buckets, size is a multiple of page size. */
> @@ -821,26 +821,26 @@ EXPORT_SYMBOL(kmalloc_size_roundup);
>  #define KMALLOC_RCL_NAME(sz)
>  #endif
>  
> -#ifdef CONFIG_RANDOM_KMALLOC_CACHES
> -#define __KMALLOC_RANDOM_CONCAT(a, b) a ## b
> -#define KMALLOC_RANDOM_NAME(N, sz) __KMALLOC_RANDOM_CONCAT(KMA_RAND_, N)(sz)
> -#define KMA_RAND_1(sz)                  .name[KMALLOC_RANDOM_START +  1] = "kmalloc-rnd-01-" #sz,
> -#define KMA_RAND_2(sz)  KMA_RAND_1(sz)  .name[KMALLOC_RANDOM_START +  2] = "kmalloc-rnd-02-" #sz,
> -#define KMA_RAND_3(sz)  KMA_RAND_2(sz)  .name[KMALLOC_RANDOM_START +  3] = "kmalloc-rnd-03-" #sz,
> -#define KMA_RAND_4(sz)  KMA_RAND_3(sz)  .name[KMALLOC_RANDOM_START +  4] = "kmalloc-rnd-04-" #sz,
> -#define KMA_RAND_5(sz)  KMA_RAND_4(sz)  .name[KMALLOC_RANDOM_START +  5] = "kmalloc-rnd-05-" #sz,
> -#define KMA_RAND_6(sz)  KMA_RAND_5(sz)  .name[KMALLOC_RANDOM_START +  6] = "kmalloc-rnd-06-" #sz,
> -#define KMA_RAND_7(sz)  KMA_RAND_6(sz)  .name[KMALLOC_RANDOM_START +  7] = "kmalloc-rnd-07-" #sz,
> -#define KMA_RAND_8(sz)  KMA_RAND_7(sz)  .name[KMALLOC_RANDOM_START +  8] = "kmalloc-rnd-08-" #sz,
> -#define KMA_RAND_9(sz)  KMA_RAND_8(sz)  .name[KMALLOC_RANDOM_START +  9] = "kmalloc-rnd-09-" #sz,
> -#define KMA_RAND_10(sz) KMA_RAND_9(sz)  .name[KMALLOC_RANDOM_START + 10] = "kmalloc-rnd-10-" #sz,
> -#define KMA_RAND_11(sz) KMA_RAND_10(sz) .name[KMALLOC_RANDOM_START + 11] = "kmalloc-rnd-11-" #sz,
> -#define KMA_RAND_12(sz) KMA_RAND_11(sz) .name[KMALLOC_RANDOM_START + 12] = "kmalloc-rnd-12-" #sz,
> -#define KMA_RAND_13(sz) KMA_RAND_12(sz) .name[KMALLOC_RANDOM_START + 13] = "kmalloc-rnd-13-" #sz,
> -#define KMA_RAND_14(sz) KMA_RAND_13(sz) .name[KMALLOC_RANDOM_START + 14] = "kmalloc-rnd-14-" #sz,
> -#define KMA_RAND_15(sz) KMA_RAND_14(sz) .name[KMALLOC_RANDOM_START + 15] = "kmalloc-rnd-15-" #sz,
> -#else // CONFIG_RANDOM_KMALLOC_CACHES
> -#define KMALLOC_RANDOM_NAME(N, sz)
> +#ifdef CONFIG_PARTITION_KMALLOC_CACHES
> +#define __KMALLOC_PARTITION_CONCAT(a, b) a ## b
> +#define KMALLOC_PARTITION_NAME(N, sz) __KMALLOC_PARTITION_CONCAT(KMA_PART_, N)(sz)
> +#define KMA_PART_1(sz)                  .name[KMALLOC_PARTITION_START +  1] = "kmalloc-part-01-" #sz,
> +#define KMA_PART_2(sz)  KMA_PART_1(sz)  .name[KMALLOC_PARTITION_START +  2] = "kmalloc-part-02-" #sz,
> +#define KMA_PART_3(sz)  KMA_PART_2(sz)  .name[KMALLOC_PARTITION_START +  3] = "kmalloc-part-03-" #sz,
> +#define KMA_PART_4(sz)  KMA_PART_3(sz)  .name[KMALLOC_PARTITION_START +  4] = "kmalloc-part-04-" #sz,
> +#define KMA_PART_5(sz)  KMA_PART_4(sz)  .name[KMALLOC_PARTITION_START +  5] = "kmalloc-part-05-" #sz,
> +#define KMA_PART_6(sz)  KMA_PART_5(sz)  .name[KMALLOC_PARTITION_START +  6] = "kmalloc-part-06-" #sz,
> +#define KMA_PART_7(sz)  KMA_PART_6(sz)  .name[KMALLOC_PARTITION_START +  7] = "kmalloc-part-07-" #sz,
> +#define KMA_PART_8(sz)  KMA_PART_7(sz)  .name[KMALLOC_PARTITION_START +  8] = "kmalloc-part-08-" #sz,
> +#define KMA_PART_9(sz)  KMA_PART_8(sz)  .name[KMALLOC_PARTITION_START +  9] = "kmalloc-part-09-" #sz,
> +#define KMA_PART_10(sz) KMA_PART_9(sz)  .name[KMALLOC_PARTITION_START + 10] = "kmalloc-part-10-" #sz,
> +#define KMA_PART_11(sz) KMA_PART_10(sz) .name[KMALLOC_PARTITION_START + 11] = "kmalloc-part-11-" #sz,
> +#define KMA_PART_12(sz) KMA_PART_11(sz) .name[KMALLOC_PARTITION_START + 12] = "kmalloc-part-12-" #sz,
> +#define KMA_PART_13(sz) KMA_PART_12(sz) .name[KMALLOC_PARTITION_START + 13] = "kmalloc-part-13-" #sz,
> +#define KMA_PART_14(sz) KMA_PART_13(sz) .name[KMALLOC_PARTITION_START + 14] = "kmalloc-part-14-" #sz,
> +#define KMA_PART_15(sz) KMA_PART_14(sz) .name[KMALLOC_PARTITION_START + 15] = "kmalloc-part-15-" #sz,
> +#else // CONFIG_PARTITION_KMALLOC_CACHES
> +#define KMALLOC_PARTITION_NAME(N, sz)
>  #endif
>  
>  #define INIT_KMALLOC_INFO(__size, __short_size)			\
> @@ -849,7 +849,7 @@ EXPORT_SYMBOL(kmalloc_size_roundup);
>  	KMALLOC_RCL_NAME(__short_size)				\
>  	KMALLOC_CGROUP_NAME(__short_size)			\
>  	KMALLOC_DMA_NAME(__short_size)				\
> -	KMALLOC_RANDOM_NAME(RANDOM_KMALLOC_CACHES_NR, __short_size)	\
> +	KMALLOC_PARTITION_NAME(PARTITION_KMALLOC_CACHES_NR, __short_size)	\
>  	.size = __size,						\
>  }
>  
> @@ -961,8 +961,8 @@ new_kmalloc_cache(int idx, enum kmalloc_cache_type type)
>  		flags |= SLAB_CACHE_DMA;
>  	}
>  
> -#ifdef CONFIG_RANDOM_KMALLOC_CACHES
> -	if (type >= KMALLOC_RANDOM_START && type <= KMALLOC_RANDOM_END)
> +#ifdef CONFIG_PARTITION_KMALLOC_CACHES
> +	if (type >= KMALLOC_PARTITION_START && type <= KMALLOC_PARTITION_END)
>  		flags |= SLAB_NO_MERGE;
>  #endif
>  
> diff --git a/mm/slub.c b/mm/slub.c
> index 2b2d33cc735c..f033aae90bdc 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2125,11 +2125,11 @@ static inline size_t obj_exts_alloc_size(struct kmem_cache *s,
>  	if (!is_kmalloc_normal(s))
>  		return sz;
>  
> -	obj_exts_cache = kmalloc_slab(sz, NULL, gfp, 0);
> +	obj_exts_cache = kmalloc_slab(sz, NULL, gfp, __kmalloc_token(0));
>  	/*
> -	 * We can't simply compare s with obj_exts_cache, because random kmalloc
> -	 * caches have multiple caches per size, selected by caller address.
> -	 * Since caller address may differ between kmalloc_slab() and actual
> +	 * We can't simply compare s with obj_exts_cache, because partitioned kmalloc
> +	 * caches have multiple caches per size, selected by caller address or type.
> +	 * Since caller address or type may differ between kmalloc_slab() and actual
>  	 * allocation, bump size when sizes are equal.
>  	 */
>  	if (s->object_size == obj_exts_cache->object_size)
> @@ -5239,7 +5239,7 @@ EXPORT_SYMBOL(__kmalloc_large_node_noprof);
>  
>  static __always_inline
>  void *__do_kmalloc_node(size_t size, kmem_buckets *b, gfp_t flags, int node,
> -			unsigned long caller)
> +			unsigned long caller, kmalloc_token_t token)
>  {
>  	struct kmem_cache *s;
>  	void *ret;
> @@ -5254,22 +5254,24 @@ void *__do_kmalloc_node(size_t size, kmem_buckets *b, gfp_t flags, int node,
>  	if (unlikely(!size))
>  		return ZERO_SIZE_PTR;
>  
> -	s = kmalloc_slab(size, b, flags, caller);
> +	s = kmalloc_slab(size, b, flags, token);
>  
>  	ret = slab_alloc_node(s, NULL, flags, node, caller, size);
>  	ret = kasan_kmalloc(s, ret, size, flags);
>  	trace_kmalloc(caller, ret, size, s->size, flags, node);
>  	return ret;
>  }
> -void *__kmalloc_node_noprof(DECL_BUCKET_PARAMS(size, b), gfp_t flags, int node)
> +void *__kmalloc_node_noprof(DECL_KMALLOC_PARAMS(size, b, token), gfp_t flags, int node)
>  {
> -	return __do_kmalloc_node(size, PASS_BUCKET_PARAM(b), flags, node, _RET_IP_);
> +	return __do_kmalloc_node(size, PASS_BUCKET_PARAM(b), flags, node,
> +				 _RET_IP_, PASS_TOKEN_PARAM(token));
>  }
>  EXPORT_SYMBOL(__kmalloc_node_noprof);
>  
> -void *__kmalloc_noprof(size_t size, gfp_t flags)
> +void *__kmalloc_noprof(DECL_KMALLOC_PARAMS(size, b, token), gfp_t flags)
>  {
> -	return __do_kmalloc_node(size, NULL, flags, NUMA_NO_NODE, _RET_IP_);
> +	return __do_kmalloc_node(size, PASS_BUCKET_PARAM(b), flags,
> +				 NUMA_NO_NODE, _RET_IP_, PASS_TOKEN_PARAM(token));
>  }
>  EXPORT_SYMBOL(__kmalloc_noprof);
>  
> @@ -5284,7 +5286,7 @@ EXPORT_SYMBOL(__kmalloc_noprof);
>   * NULL does not mean EBUSY or EAGAIN. It means ENOMEM.
>   * There is no reason to call it again and expect !NULL.
>   */
> -void *kmalloc_nolock_noprof(size_t size, gfp_t gfp_flags, int node)
> +void *_kmalloc_nolock_noprof(size_t size, gfp_t gfp_flags, int node DECL_TOKEN_PARAM(token))
>  {
>  	gfp_t alloc_gfp = __GFP_NOWARN | __GFP_NOMEMALLOC | gfp_flags;
>  	struct kmem_cache *s;
> @@ -5307,7 +5309,7 @@ void *kmalloc_nolock_noprof(size_t size, gfp_t gfp_flags, int node)
>  retry:
>  	if (unlikely(size > KMALLOC_MAX_CACHE_SIZE))
>  		return NULL;
> -	s = kmalloc_slab(size, NULL, alloc_gfp, _RET_IP_);
> +	s = kmalloc_slab(size, NULL, alloc_gfp, PASS_TOKEN_PARAM(token));
>  
>  	if (!(s->flags & __CMPXCHG_DOUBLE) && !kmem_cache_debug(s))
>  		/*
> @@ -5360,12 +5362,13 @@ void *kmalloc_nolock_noprof(size_t size, gfp_t gfp_flags, int node)
>  	ret = kasan_kmalloc(s, ret, size, alloc_gfp);
>  	return ret;
>  }
> -EXPORT_SYMBOL_GPL(kmalloc_nolock_noprof);
> +EXPORT_SYMBOL_GPL(_kmalloc_nolock_noprof);
>  
> -void *__kmalloc_node_track_caller_noprof(DECL_BUCKET_PARAMS(size, b), gfp_t flags,
> +void *__kmalloc_node_track_caller_noprof(DECL_KMALLOC_PARAMS(size, b, token), gfp_t flags,
>  					 int node, unsigned long caller)
>  {
> -	return __do_kmalloc_node(size, PASS_BUCKET_PARAM(b), flags, node, caller);
> +	return __do_kmalloc_node(size, PASS_BUCKET_PARAM(b), flags, node,
> +				 caller, PASS_TOKEN_PARAM(token));
>  
>  }
>  EXPORT_SYMBOL(__kmalloc_node_track_caller_noprof);
> @@ -6555,7 +6558,7 @@ void kfree_nolock(const void *object)
>  EXPORT_SYMBOL_GPL(kfree_nolock);
>  
>  static __always_inline __realloc_size(2) void *
> -__do_krealloc(const void *p, size_t new_size, unsigned long align, gfp_t flags, int nid)
> +__do_krealloc(const void *p, size_t new_size, unsigned long align, gfp_t flags, int nid, kmalloc_token_t token)
>  {
>  	void *ret;
>  	size_t ks = 0;
> @@ -6627,7 +6630,7 @@ __do_krealloc(const void *p, size_t new_size, unsigned long align, gfp_t flags,
>  	return (void *)p;
>  
>  alloc_new:
> -	ret = kmalloc_node_track_caller_noprof(new_size, flags, nid, _RET_IP_);
> +	ret = __kmalloc_node_track_caller_noprof(PASS_KMALLOC_PARAMS(new_size, NULL, token), flags, nid, _RET_IP_);
>  	if (ret && p) {
>  		/* Disable KASAN checks as the object's redzone is accessed. */
>  		kasan_disable_current();
> @@ -6677,7 +6680,7 @@ __do_krealloc(const void *p, size_t new_size, unsigned long align, gfp_t flags,
>   * Return: pointer to the allocated memory or %NULL in case of error
>   */
>  void *krealloc_node_align_noprof(const void *p, size_t new_size, unsigned long align,
> -				 gfp_t flags, int nid)
> +				 gfp_t flags, int nid DECL_TOKEN_PARAM(token))
>  {
>  	void *ret;
>  
> @@ -6686,7 +6689,7 @@ void *krealloc_node_align_noprof(const void *p, size_t new_size, unsigned long a
>  		return ZERO_SIZE_PTR;
>  	}
>  
> -	ret = __do_krealloc(p, new_size, align, flags, nid);
> +	ret = __do_krealloc(p, new_size, align, flags, nid, PASS_TOKEN_PARAM(token));
>  	if (ret && kasan_reset_tag(p) != kasan_reset_tag(ret))
>  		kfree(p);
>  
> @@ -6723,6 +6726,7 @@ static gfp_t kmalloc_gfp_adjust(gfp_t flags, size_t size)
>   * failure, fall back to non-contiguous (vmalloc) allocation.
>   * @size: size of the request.
>   * @b: which set of kmalloc buckets to allocate from.
> + * @token: allocation token.
>   * @align: desired alignment.
>   * @flags: gfp mask for the allocation - must be compatible (superset) with GFP_KERNEL.
>   * @node: numa node to allocate from
> @@ -6739,7 +6743,7 @@ static gfp_t kmalloc_gfp_adjust(gfp_t flags, size_t size)
>   *
>   * Return: pointer to the allocated memory of %NULL in case of failure
>   */
> -void *__kvmalloc_node_noprof(DECL_BUCKET_PARAMS(size, b), unsigned long align,
> +void *__kvmalloc_node_noprof(DECL_KMALLOC_PARAMS(size, b, token), unsigned long align,
>  			     gfp_t flags, int node)
>  {
>  	bool allow_block;
> @@ -6751,7 +6755,7 @@ void *__kvmalloc_node_noprof(DECL_BUCKET_PARAMS(size, b), unsigned long align,
>  	 */
>  	ret = __do_kmalloc_node(size, PASS_BUCKET_PARAM(b),
>  				kmalloc_gfp_adjust(flags, size),
> -				node, _RET_IP_);
> +				node, _RET_IP_, PASS_TOKEN_PARAM(token));
>  	if (ret || size <= PAGE_SIZE)
>  		return ret;
>  
> @@ -6848,17 +6852,17 @@ EXPORT_SYMBOL(kvfree_sensitive);
>   * Return: pointer to the allocated memory or %NULL in case of error
>   */
>  void *kvrealloc_node_align_noprof(const void *p, size_t size, unsigned long align,
> -				  gfp_t flags, int nid)
> +				  gfp_t flags, int nid DECL_TOKEN_PARAM(token))
>  {
>  	void *n;
>  
>  	if (is_vmalloc_addr(p))
>  		return vrealloc_node_align_noprof(p, size, align, flags, nid);
>  
> -	n = krealloc_node_align_noprof(p, size, align, kmalloc_gfp_adjust(flags, size), nid);
> +	n = krealloc_node_align_noprof(p, size, align, kmalloc_gfp_adjust(flags, size), nid _PASS_TOKEN_PARAM(token));
>  	if (!n) {
>  		/* We failed to krealloc(), fall back to kvmalloc(). */
> -		n = kvmalloc_node_align_noprof(size, align, flags, nid);
> +		n = __kvmalloc_node_noprof(PASS_KMALLOC_PARAMS(size, NULL, token), align, flags, nid);
>  		if (!n)
>  			return NULL;
>  
> @@ -8351,7 +8355,7 @@ static void __init bootstrap_kmalloc_sheaves(void)
>  {
>  	enum kmalloc_cache_type type;
>  
> -	for (type = KMALLOC_NORMAL; type <= KMALLOC_RANDOM_END; type++) {
> +	for (type = KMALLOC_NORMAL; type <= KMALLOC_PARTITION_END; type++) {
>  		for (int idx = 0; idx < KMALLOC_SHIFT_HIGH + 1; idx++) {
>  			if (kmalloc_caches[type][idx])
>  				bootstrap_cache_sheaves(kmalloc_caches[type][idx]);
> -- 
> 2.54.0.rc1.513.gad8abe7a5a-goog
> 


^ permalink raw reply

* Re: [PATCH 1/5] liveupdate: Remove limit on the number of sessions
From: Mike Rapoport @ 2026-04-20  7:13 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: linux-kselftest, shuah, akpm, linux-mm, linux-kernel, dmatlack,
	kexec, pratyush, skhawaja, graf
In-Reply-To: <20260414200237.444170-2-pasha.tatashin@soleen.com>

On Tue, Apr 14, 2026 at 08:02:33PM +0000, Pasha Tatashin wrote:
> Currently, the number of LUO sessions is limited by a fixed number of
> pre-allocated pages for serialization (16 pages, allowing for ~819
> sessions).
> 
> This limitation is problematic if LUO is used to support things such as
> systemd file descriptor store, and would be used not just as VM memory
> but to save other states on the machine.
> 
> Remove this limit by transitioning to a linked-block approach for
> session metadata serialization. Instead of a single contiguous block,
> session metadata is now stored in a chain of 16-page blocks. Each block
> starts with a header containing the physical address of the next block
> and the number of session entries in the current block.
> 
> - Bump session ABI version to v3.
> - Update struct luo_session_header_ser to include a 'next' pointer.
> - Implement dynamic block allocation in luo_session_insert().
> - Update setup, serialization, and deserialization logic to traverse
>   the block chain.
> - Remove LUO_SESSION_MAX limit.
> 
> Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> ---
>  include/linux/kho/abi/luo.h      |  19 +--
>  kernel/liveupdate/luo_internal.h |  12 +-
>  kernel/liveupdate/luo_session.c  | 237 +++++++++++++++++++++++--------
>  3 files changed, 197 insertions(+), 71 deletions(-)

...

> +/**
> + * struct luo_session_block - Internal representation of a session serialization block.
> + * @list: List head for linking blocks in memory.
> + * @ser:  Pointer to the serialized header in preserved memory.
> + */
> +struct luo_session_block {
> +	struct list_head list;
> +	struct luo_session_header_ser *ser;
> +};
> +
>  /**
>   * struct luo_session_header - Header struct for managing LUO sessions.
>   * @count:      The number of sessions currently tracked in the @list.
> + * @nblocks:    The number of allocated serialization blocks.
>   * @list:       The head of the linked list of `struct luo_session` instances.
>   * @rwsem:      A read-write semaphore providing synchronized access to the
>   *              session list and other fields in this structure.
> - * @header_ser: The header data of serialization array.
> - * @ser:        The serialized session data (an array of
> - *              `struct luo_session_ser`).
> + * @blocks:     The list of serialization blocks (struct luo_session_block).
>   * @active:     Set to true when first initialized. If previous kernel did not
>   *              send session data, active stays false for incoming.
>   */
>  struct luo_session_header {
>  	long count;
> +	long nblocks;
>  	struct list_head list;
>  	struct rw_semaphore rwsem;
> -	struct luo_session_header_ser *header_ser;
> -	struct luo_session_ser *ser;
> +	struct list_head blocks;

Don't we need some sort of locking for blocks?

>  	bool active;
>  };
  
> @@ -147,15 +222,6 @@ static int luo_session_insert(struct luo_session_header *sh,
>  
>  	guard(rwsem_write)(&sh->rwsem);
>  
> -	/*
> -	 * For outgoing we should make sure there is room in serialization array
> -	 * for new session.
> -	 */
> -	if (sh == &luo_session_global.outgoing) {
> -		if (sh->count == LUO_SESSION_MAX)
> -			return -ENOMEM;
> -	}
> -
>  	/*
>  	 * For small number of sessions this loop won't hurt performance
>  	 * but if we ever start using a lot of sessions, this might

For ~8.1 million sessions this comment does not seem valid anymore ;-)

-- 
Sincerely yours,
Mike.


^ permalink raw reply

* Re: [PATCH v3 02/10] mm: introduce pgtable_has_pmd_leaves()
From: Lance Yang @ 2026-04-20  6:55 UTC (permalink / raw)
  To: david
  Cc: lance.yang, luizcap, linux-kernel, linux-mm, baolin.wang,
	ryan.roberts, akpm, ljs, ziy, Liam.Howlett, npache, dev.jain,
	baohua
In-Reply-To: <8a0b6d67-121a-4ac8-a74b-737d52d8b7ac@kernel.org>


On Fri, Apr 17, 2026 at 11:57:54AM +0200, David Hildenbrand (Arm) wrote:
>On 4/10/26 10:19, Lance Yang wrote:
[...]
>>>
>>> #if !defined(MAX_POSSIBLE_PHYSMEM_BITS) && !defined(CONFIG_64BIT)
>>> diff --git a/init/main.c b/init/main.c
>>> index 1cb395dd94e4..07f2ddbf9677 100644
>>> --- a/init/main.c
>>> +++ b/init/main.c
>>> @@ -1044,6 +1044,7 @@ void start_kernel(void)
>>> 	print_kernel_cmdline(saved_command_line);
>>> 	/* parameters may set static keys */
>>> 	parse_early_param();
>>> +	init_arch_has_pmd_leaves();
>> 
>> One more thought here: I don't see why we need boot-time caching.
>> 
>> has_transparent_hugepage() does *not* look expensive on the common
>> archs. On x86, it is just a CPU feature check. MIPS is different, yes,
>> only the first call there is more involved ...
>> 
>> But if we *really* want caching, couldn't we just do it lazily instead
>> of adding another early boot init step?
>> 
>> Something like:
>> 
>> bool pgtable_has_pmd_leaves(void)
>> {
>> 	static int __arch_has_pmd_leaves = -1;
>> 
>> 	if (READ_ONCE(__arch_has_pmd_leaves) < 0)
>> 		WRITE_ONCE(__arch_has_pmd_leaves, has_transparent_hugepage());
>> 
>> 	return READ_ONCE(__arch_has_pmd_leaves);
>
>Would look easier when just using two booleans: "has_pmd_leaves" and
>"probed".
>
>I do like static keys, though. And wonder whether just putting that into
>hugepage_init() would be sufficient?

Probably sufficient for the current callers, yes, since they are all
THP-related :)

But then the helper may return false simply because THP is not built,
rather than because the CPU lacks PMD leaf support, which does not match
its semantics very well.

To be clear, I like static keys too, as I suggested earlier ;)

Another boot-time init step that depends on specific init ordering seems
a bit shaky to me... At the very least, we'd need to document that
ordering very clearly.

Cheers,
Lance


^ permalink raw reply

* Re: [PATCH 7.2 v3 04/12] mm/khugepaged: remove READ_ONLY_THP_FOR_FS check in hugepage_pmd_enabled()
From: Baolin Wang @ 2026-04-20  6:55 UTC (permalink / raw)
  To: Zi Yan, Matthew Wilcox (Oracle), Song Liu
  Cc: Chris Mason, David Sterba, Alexander Viro, Christian Brauner,
	Jan Kara, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Lance Yang, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan,
	Michal Hocko, Shuah Khan, linux-btrfs, linux-kernel,
	linux-fsdevel, linux-mm, linux-kselftest
In-Reply-To: <20260418024429.4055056-5-ziy@nvidia.com>



On 4/18/26 10:44 AM, Zi Yan wrote:
> Remove READ_ONLY_THP_FOR_FS and khugepaged for file-backed pmd-sized
> hugepages are enabled by the global transparent hugepage control.
> khugepaged can still be enabled by per-size control for anon and shmem when
> the global control is off.
> 
> Add shmem_hpage_pmd_enabled() stub for !CONFIG_SHMEM to remove
> IS_ENABLED(SHMEM) in hugepage_pmd_enabled().
> 
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> ---
>   include/linux/shmem_fs.h |  2 +-
>   mm/khugepaged.c          | 28 ++++++++++++++++------------
>   2 files changed, 17 insertions(+), 13 deletions(-)
> 
> diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
> index 1a345142af7d..dff8fb6ddac0 100644
> --- a/include/linux/shmem_fs.h
> +++ b/include/linux/shmem_fs.h
> @@ -127,7 +127,7 @@ int shmem_writeout(struct folio *folio, struct swap_iocb **plug,
>   void shmem_truncate_range(struct inode *inode, loff_t start, uoff_t end);
>   int shmem_unuse(unsigned int type);
>   
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && defined(CONFIG_SHMEM)
>   unsigned long shmem_allowable_huge_orders(struct inode *inode,
>   				struct vm_area_struct *vma, pgoff_t index,
>   				loff_t write_end, bool shmem_huge_force);
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 1c0fdc81d276..718a2d06d1e6 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -406,18 +406,8 @@ static inline int collapse_test_exit_or_disable(struct mm_struct *mm)
>   		mm_flags_test(MMF_DISABLE_THP_COMPLETELY, mm);
>   }
>   
> -static bool hugepage_pmd_enabled(void)
> +static inline bool anon_hpage_pmd_enabled(void)
>   {
> -	/*
> -	 * We cover the anon, shmem and the file-backed case here; file-backed
> -	 * hugepages, when configured in, are determined by the global control.
> -	 * Anon pmd-sized hugepages are determined by the pmd-size control.
> -	 * Shmem pmd-sized hugepages are also determined by its pmd-size control,
> -	 * except when the global shmem_huge is set to SHMEM_HUGE_DENY.
> -	 */
> -	if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) &&
> -	    hugepage_global_enabled())
> -		return true;
>   	if (test_bit(PMD_ORDER, &huge_anon_orders_always))
>   		return true;
>   	if (test_bit(PMD_ORDER, &huge_anon_orders_madvise))
> @@ -425,7 +415,21 @@ static bool hugepage_pmd_enabled(void)
>   	if (test_bit(PMD_ORDER, &huge_anon_orders_inherit) &&
>   	    hugepage_global_enabled())
>   		return true;
> -	if (IS_ENABLED(CONFIG_SHMEM) && shmem_hpage_pmd_enabled())
> +	return false;
> +}
> +
> +static bool hugepage_pmd_enabled(void)
> +{
> +	/*
> +	 * Anon, shmem and file-backed pmd-size hugepages are all determined by
> +	 * the global control. If the global control is off, anon and shmem
> +	 * pmd-sized hugepages are also determined by its per-size control.
> +	 */

Personally, I found the previous comments clearer to me. The statement 
"Anon, shmem pmd-size hugepages are all determined by the global 
control" seems somewhat confusing. For example, if 
hugepage_global_enabled() returns true but the pmd-sized sub-control is 
set to 'never', that means anon pmd-size hugepages are not allowed.

The code changes LGTM.

> +	if (hugepage_global_enabled())
> +		return true;
> +	if (anon_hpage_pmd_enabled())
> +		return true;
> +	if (shmem_hpage_pmd_enabled())
>   		return true;
>   	return false;
>   }



^ permalink raw reply

* Re: [PATCH 02/53] selftests/mm: khugepaged: enable collapse_single_pte_entry_compound for shmem
From: Sarthak Sharma @ 2026-04-20  6:50 UTC (permalink / raw)
  To: Mike Rapoport, Andrew Morton, David Hildenbrand
  Cc: Baolin Wang, Barry Song, Dev Jain, Jason Gunthorpe, John Hubbard,
	Liam R. Howlett, Lance Yang, Leon Romanovsky, Lorenzo Stoakes,
	Mark Brown, Michal Hocko, Nico Pache, Peter Xu, Ryan Roberts,
	Shuah Khan, Suren Baghdasaryan, Vlastimil Babka, Zi Yan,
	linux-kernel, linux-kselftest, linux-mm
In-Reply-To: <20260406141735.2179309-3-rppt@kernel.org>

Hi Mike

On 4/6/26 7:46 PM, Mike Rapoport wrote:
> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> 
> A comment in collapse_single_pte_entry_compound() says it can't run on
> shmem because "MADV_DONTNEED can't evict tmpfs pages".
> But MADV_REMOVE can!
> 
> Use MADV_REMOVE for tmpfs to evict pages and enable
> collapse_single_pte_entry_compound() test for shmem.
> 
> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> --->  tools/testing/selftests/mm/khugepaged.c | 14 ++++++--------
>  1 file changed, 6 insertions(+), 8 deletions(-)
> 
> diff --git a/tools/testing/selftests/mm/khugepaged.c b/tools/testing/selftests/mm/khugepaged.c
> index 3fe7ef04ac62..e6fb01ca44ed 100644
> --- a/tools/testing/selftests/mm/khugepaged.c
> +++ b/tools/testing/selftests/mm/khugepaged.c
> @@ -783,20 +783,17 @@ static void collapse_max_ptes_swap(struct collapse_context *c, struct mem_ops *o
>  
>  static void collapse_single_pte_entry_compound(struct collapse_context *c, struct mem_ops *ops)
>  {
> +	int advise = MADV_DONTNEED;
>  	void *p;
>  
>  	p = alloc_hpage(ops);
>  
> -	if (is_tmpfs(ops)) {
> -		/* MADV_DONTNEED won't evict tmpfs pages */
> -		printf("tmpfs...");
> -		skip("Skip");
> -		goto skip;
> -	}
> +	if (is_tmpfs(ops))
> +		advise = MADV_REMOVE;

is_tmpfs(ops) will always return false for shmem_ops, since the function
definition does not handle the shmem_ops case. Therefore, this advise
will always remain as MADV_DONTNEED.

Also, I am able to run the shmem tests using MADV_DONTNEED and the tests
succeed. So, perhaps we don't need to use MADV_REMOVE in this case?

>  
>  	madvise(p, hpage_pmd_size, MADV_NOHUGEPAGE);
>  	printf("Split huge page leaving single PTE mapping compound page...");
> -	madvise(p + page_size, hpage_pmd_size - page_size, MADV_DONTNEED);
> +	madvise(p + page_size, hpage_pmd_size - page_size, advise);
>  	if (ops->check_huge(p, 0))
>  		success("OK");
>  	else
> @@ -805,7 +802,6 @@ static void collapse_single_pte_entry_compound(struct collapse_context *c, struc
>  	c->collapse("Collapse PTE table with single PTE mapping compound page",
>  		    p, 1, ops, true);
>  	validate_memory(p, 0, page_size);
> -skip:
>  	ops->cleanup_area(p, hpage_pmd_size);
>  }
>  
> @@ -1251,8 +1247,10 @@ int main(int argc, char **argv)
>  
>  	TEST(collapse_single_pte_entry_compound, khugepaged_context, anon_ops);
>  	TEST(collapse_single_pte_entry_compound, khugepaged_context, file_ops);
> +	TEST(collapse_single_pte_entry_compound, khugepaged_context, shmem_ops);
>  	TEST(collapse_single_pte_entry_compound, madvise_context, anon_ops);
>  	TEST(collapse_single_pte_entry_compound, madvise_context, file_ops);
> +	TEST(collapse_single_pte_entry_compound, madvise_context, shmem_ops);
>  
>  	TEST(collapse_full_of_compound, khugepaged_context, anon_ops);
>  	TEST(collapse_full_of_compound, khugepaged_context, file_ops);



^ permalink raw reply

* Re: [PATCH] mm/sparse: remove unnecessary NULL check before allocating mem_section
From: Mike Rapoport @ 2026-04-20  6:49 UTC (permalink / raw)
  To: Sang-Heon Jeon
  Cc: akpm, david, ljs, Liam.Howlett, vbabka, surenb, mhocko, linux-mm
In-Reply-To: <20260419144225.2875654-1-ekffu200098@gmail.com>

Hi,

On Sun, Apr 19, 2026 at 11:42:25PM +0900, Sang-Heon Jeon wrote:
> Commit 850ed20539a4 ("mm: move array mem_section init code out
> of memory_present()") moved mem_section allocation logic
> into memblocks_present().
> 
> Before that move, memory_present() could be called multiple times, so
> unlikely() matched the common case, where most calls found mem_section
> already allocated.
> 
> After that move, memblocks_present() is called exactly once from
> sparse_init(). Under CONFIG_SPARSEMEM_EXTREME, mem_section is always
> NULL when it is called.
> 
> So remove unnecessary NULL check before allocating mem_section. No
> functional change.
> 
> Signed-off-by: Sang-Heon Jeon <ekffu200098@gmail.com>

Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>

> ---
> Hello, 
> 
> While looking into boot information, I found a minor enhancement point.
> If I misunderstood anything, please feel free to let me know.

I would say it's more a cleanup than enhancement :)
 
> Thank you for taking valuable time to review this work.
> 
> Best Regards,
> Sang-Heon Jeon
> ---
>  mm/sparse.c | 10 ++++------
>  1 file changed, 4 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/sparse.c b/mm/sparse.c
> index effdac6b0ab1..e13f9f5fa090 100644
> --- a/mm/sparse.c
> +++ b/mm/sparse.c
> @@ -201,13 +201,11 @@ static void __init memblocks_present(void)
>  	int i, nid;
>  
>  #ifdef CONFIG_SPARSEMEM_EXTREME
> -	if (unlikely(!mem_section)) {
> -		unsigned long size, align;
> +	unsigned long size, align;
>  
> -		size = sizeof(struct mem_section *) * NR_SECTION_ROOTS;
> -		align = 1 << (INTERNODE_CACHE_SHIFT);
> -		mem_section = memblock_alloc_or_panic(size, align);
> -	}
> +	size = sizeof(struct mem_section *) * NR_SECTION_ROOTS;
> +	align = 1 << (INTERNODE_CACHE_SHIFT);
> +	mem_section = memblock_alloc_or_panic(size, align);
>  #endif
>  
>  	for_each_mem_pfn_range(i, MAX_NUMNODES, &start, &end, &nid)
> -- 
> 2.43.0
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply

* Re: [PATCH v6 01/30] mm: Introduce kpkeys
From: Kevin Brodsky @ 2026-04-20  6:46 UTC (permalink / raw)
  To: David Hildenbrand (Arm), linux-hardening
  Cc: linux-kernel, Andrew Morton, Andy Lutomirski, Catalin Marinas,
	Dave Hansen, Ira Weiny, Jann Horn, Jeff Xu, Joey Gouly, Kees Cook,
	Linus Walleij, Lorenzo Stoakes, Marc Zyngier, Mark Brown,
	Matthew Wilcox, Maxwell Bland, Mike Rapoport (IBM),
	Peter Zijlstra, Pierre Langlois, Quentin Perret, Rick Edgecombe,
	Ryan Roberts, Thomas Gleixner, Vlastimil Babka, Will Deacon,
	Yang Shi, Yeoreum Yun, linux-arm-kernel, linux-mm, x86
In-Reply-To: <cd2bcc09-2507-4ed4-bb92-2d53baedaf04@kernel.org>

On 17/04/2026 19:38, David Hildenbrand (Arm) wrote:
> On 4/17/26 17:59, Kevin Brodsky wrote:
>> On 17/04/2026 16:37, David Hildenbrand (Arm) wrote:
>>> On 2/27/26 18:54, Kevin Brodsky wrote:
>>>> kpkeys is a simple framework to enable the use of protection keys
>>>> (pkeys) to harden the kernel itself. This patch introduces the basic
>>>> API in <linux/kpkeys.h>: a couple of functions to set and restore
>>>> the pkey register and macros to define guard objects.
>>>>
>>>> kpkeys introduces a new concept on top of pkeys: the kpkeys level.
>>>> Each level is associated to a set of permissions for the pkeys
>>>> managed by the kpkeys framework. kpkeys_set_level(lvl) sets those
>>>> permissions according to lvl, and returns the original pkey
>>>> register, to be later restored by kpkeys_restore_pkey_reg(). To
>>>> start with, only KPKEYS_LVL_DEFAULT is available, which is meant
>>>> to grant RW access to KPKEYS_PKEY_DEFAULT (i.e. all memory since
>>>> this is the only available pkey for now).
>>>>
>>>> Because each architecture implementing pkeys uses a different
>>>> representation for the pkey register, and may reserve certain pkeys
>>>> for specific uses, support for kpkeys must be explicitly indicated
>>>> by selecting ARCH_HAS_KPKEYS and defining the following functions in
>>>> <asm/kpkeys.h>, in addition to the macros provided in
>>>> <asm-generic/kpkeys.h>:
>>>>
>>>> - arch_kpkeys_set_level()
>>>> - arch_kpkeys_restore_pkey_reg()
>>>> - arch_kpkeys_enabled()
>>> Another thing: why not simply drop the "arch_" stuff from these helpers?
>> The first two are not meant to be directly called, they're the
>> arch-specific implementation of kpkeys_set_level() and
>> kpkeys_restore_pkey_reg(), and those generic functions handle some
>> generic logic.
>>
>> arch_kpkeys_enabled() is directly used in generic code, so I suppose it
>> could be renamed to kpkeys_enabled()? It's actually implemented in an
>> arch header so I wasn't too sure about it.
> I was skimming over patch #13 and spotted:
>
> +void·__init·kpkeys_hardened_pgtables_init(void)
> +{
> +›	if·(!arch_kpkeys_enabled())
> +›	›	return;
> +
> +›	static_branch_enable(&kpkeys_hardened_pgtables_key);
> +}
>
> The arch_* there can just go IMHO.
>
> I'd also do it for the two ones used by the GUARD macros. If we don't
> expect common code wrappers (arch_kpkeys_enabled() vs. kpkeys_enabled),
> then the arch_ is unnecessary information -- IMHO

Makes sense. I could just rename arch_kpkeys_enabled() to
kpkeys_enabled(), but I'm thinking having an arch abstraction could be
clearer, after looking into protecting sparse-vmemmap page tables. The
new version would look like this:

* <asm/kpkeys.h>:
    - arch_supports_kpkeys()
    - arch_supports_kpkeys_early() [can be called before features have
been detected]

* <linux/kpkeys.h> defines:
    - kpkeys_enabled() -> arch_supports_kpkeys()
    - kpkeys_hardened_pgtables_enabled() -> static key
    - kpkeys_hardened_pgtables_early_enabled() ->
arch_supports_kpkeys_early() [called when setting up sparse-vmemmap,
linear map, etc.]

There is extra #ifdef'ing going on in <linux/kpkeys.h>, but
<asm/kpkeys.h> doesn't need to worry about it. I think this might be
easier to follow, I don't like too much having an interface function
like kpkeys_enabled() defined in an arch header (not great for
kernel-doc comments either). Any thoughts?

- Kevin


^ permalink raw reply

* Re: [PATCH] docs: Add overview and SLUB allocator sections to slab documentation
From: Jonathan Corbet @ 2026-04-20  6:43 UTC (permalink / raw)
  To: Nick Huang, Lorenzo Stoakes
  Cc: David Hildenbrand (Arm), Matthew Wilcox, Vlastimil Babka,
	Harry Yoo, Andrew Morton, Hao Li, Christoph Lameter,
	David Rientjes, Roman Gushchin, Liam R . Howlett, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Shuah Khan, linux-mm, linux-doc,
	linux-kernel
In-Reply-To: <CABZAGRHXtjzGJrgR1NAmVHFMP9eL5zZr3DaTAtAvywv_1sOHdw@mail.gmail.com>

Nick Huang <sef1548@gmail.com> writes:

> I am really sorry for causing trouble for everyone. I would like to
> ask which aspect of mine was disrespectful, so that I can be more
> careful next time.
>
> If I want to make this kind of change, should I send an [RFC patch] to
> ask for everyone's opinion?

If you want to be respectful, start by reading what has been sent to
you; the problems with your submission were well explained, more than
once.

To reiterate:

- Do not send LLM-generated material without marking it as such as
  described in our developer documentation.

- Do not attempt to document systems that you do not, yourself,
  understand; you have no hope of getting it right, and you will only
  succeed in wasting the time of the people who have to review your
  changes.

The point of documentation is to be informative, accurate, and useful;
simply filling in a bunch of words is not helpful to anybody.

Thanks,

jon


^ permalink raw reply

* Re: [PATCH] docs: Add overview and SLUB allocator sections to slab documentation
From: Mike Rapoport @ 2026-04-20  6:42 UTC (permalink / raw)
  To: Nick Huang
  Cc: Lorenzo Stoakes, David Hildenbrand (Arm), Matthew Wilcox,
	Vlastimil Babka, Harry Yoo, Andrew Morton, Jonathan Corbet,
	Hao Li, Christoph Lameter, David Rientjes, Roman Gushchin,
	Liam R . Howlett, Suren Baghdasaryan, Michal Hocko, Shuah Khan,
	linux-mm, linux-doc, linux-kernel
In-Reply-To: <CABZAGRHXtjzGJrgR1NAmVHFMP9eL5zZr3DaTAtAvywv_1sOHdw@mail.gmail.com>

Hi Nick,

On Mon, Apr 20, 2026 at 12:52:25PM +0800, Nick Huang wrote:
> 
> I am really sorry for causing trouble for everyone. I would like to
> ask which aspect of mine was disrespectful, so that I can be more
> careful next time.

Maintainers time is valuable and sending LLM generated patch completely
without understanding what it is about is disrespect for maintainers wasted
time.
 
> If I want to make this kind of change, should I send an [RFC patch] to
> ask for everyone's opinion?

If you want to make that kind of change, you should start with researching
and understanding yourself what the code is doing, double check the LLM
output and verify it and not just take an LLM slop and send it.
 
> Sorry, I really am not very clear about the process.

Start with reading kernel process documentation:
https://docs.kernel.org/process/development-process.html

> -- 
> Regards,
> Nick Huang

-- 
Sincerely yours,
Mike.


^ permalink raw reply

* Re: [PATCH 7.2 v3 03/12] mm/huge_memory: remove READ_ONLY_THP_FOR_FS from file_thp_enabled()
From: Baolin Wang @ 2026-04-20  6:31 UTC (permalink / raw)
  To: Zi Yan, Matthew Wilcox (Oracle), Song Liu
  Cc: Chris Mason, David Sterba, Alexander Viro, Christian Brauner,
	Jan Kara, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Lance Yang, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan,
	Michal Hocko, Shuah Khan, linux-btrfs, linux-kernel,
	linux-fsdevel, linux-mm, linux-kselftest
In-Reply-To: <20260418024429.4055056-4-ziy@nvidia.com>



On 4/18/26 10:44 AM, Zi Yan wrote:
> Replace it with a check on the max folio order of the file's address space
> mapping, making sure PMD THP is supported. Also remove the read-only fd
> check, since collapse_file() now makes sure all to-be-collapsed folios are
> clean and the created PMD file THP can be handled by FSes properly.
> 
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> ---

LGTM.
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>


^ permalink raw reply

* Re: [PATCH v2] mm/fake-numa: fix under-allocation detection in uniform split
From: Mike Rapoport @ 2026-04-20  6:31 UTC (permalink / raw)
  To: Sang-Heon Jeon; +Cc: akpm, djbw, mingo, linux-mm, Donghyeon Lee, Munhui Chae
In-Reply-To: <20260417135805.1758378-1-ekffu200098@gmail.com>

Hi,

On Fri, Apr 17, 2026 at 10:58:05PM +0900, Sang-Heon Jeon wrote:
> When split NUMA node uniformly, split_nodes_size_interleave_uniform()
> returns the next absolute node ID, not the number of nodes created.
> 
> The existing under-allocation detection logic compares next absolute node
> ID (ret) and request count (n), which only works when nid starts at 0.
> 
> For example, on a system with 2 physical NUMA nodes (node 0: 2GB, node
> 1: 128MB) and numa=fake=8U, 8 fake nodes are successfully created from
> node 0 and split_nodes_size_interleave_uniform() returns 8. For node 1,
> fake node nid starts at 8, but only 4 fake nodes are created due to
> current FAKE_NODE_MIN_SIZE being 32MB, and
> split_nodes_size_interleave_uniform() returns 12. By existing
> under-allocation detection logic, "ret < n" (12 < 8) is false, so the

In this example it would be 11, won't it?
I'll update when applying.

> under-allocation will not be detected.
> 
> Fix under-allocation detection logic to compare the number of actually
> created nodes (ret - nid) against the request count (n). Also skip
> under-allocation detection logic for memoryless physical nodes where no
> fake nodes are created.
> 
> Also, fix the outdated comment to match the actual return value.
> 
> Signed-off-by: Sang-Heon Jeon <ekffu200098@gmail.com>
> Reported-by: Donghyeon Lee <asd142513@gmail.com>
> Reported-by: Munhui Chae <mochae@student.42seoul.kr>
> Fixes: cc9aec03e58f ("x86/numa_emulation: Introduce uniform split capability") # 4.19

...

> @@ -416,9 +416,18 @@ void __init numa_emulation(struct numa_meminfo *numa_meminfo, int numa_dist_cnt)
>  					n, &pi.blk[0], nid);
>  			if (ret < 0)
>  				break;
> -			if (ret < n) {
> +
> +			/*
> +			 * If no memory was found for this physical node,
> +			 * skip the under-allocation check. 

checkpatch complains about trailing white space here.
I'll fix it up when applying.

> +			 */
> +			if (ret == nid)
> +				continue;
> +
> +			nr_created = ret - nid;
> +			if (nr_created < n) {
>  				pr_info("%s: phys: %d only got %d of %ld nodes, failing\n",
> -						__func__, i, ret, n);
> +						__func__, i, nr_created, n);
>  				ret = -1;
>  				break;
>  			}
> -- 
> 2.43.0
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply

* Re: [PATCH 7.2 v3 02/12] mm/khugepaged: add folio dirty check after try_to_unmap()
From: Baolin Wang @ 2026-04-20  6:28 UTC (permalink / raw)
  To: Zi Yan, Matthew Wilcox (Oracle), Song Liu
  Cc: Chris Mason, David Sterba, Alexander Viro, Christian Brauner,
	Jan Kara, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Lance Yang, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan,
	Michal Hocko, Shuah Khan, linux-btrfs, linux-kernel,
	linux-fsdevel, linux-mm, linux-kselftest
In-Reply-To: <20260418024429.4055056-3-ziy@nvidia.com>



On 4/18/26 10:44 AM, Zi Yan wrote:
> This check ensures the correctness of collapse read-only THPs for FSes
> after READ_ONLY_THP_FOR_FS is enabled by default for all FSes supporting
> PMD THP pagecache.
> 
> READ_ONLY_THP_FOR_FS only supports read-only fd and uses mapping->nr_thps
> and inode->i_writecount to prevent any write to read-only to-be-collapsed
> folios. In upcoming commits, READ_ONLY_THP_FOR_FS will be removed and the
> aforementioned mechanism will go away too. To ensure khugepaged functions
> as expected after the changes, skip if any folio is dirty after
> try_to_unmap(), since a dirty folio means this read-only folio
> got some writes via mmap can happen between try_to_unmap() and
> try_to_unmap_flush() via cached TLB entries and khugepaged does not support
> writable pagecache folio collapse yet.
> 
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> ---

LGTM.
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>


^ permalink raw reply

* Re: [PATCH 7.2 v3 01/12] mm/khugepaged: remove READ_ONLY_THP_FOR_FS check
From: Baolin Wang @ 2026-04-20  6:07 UTC (permalink / raw)
  To: Zi Yan, Matthew Wilcox (Oracle), Song Liu
  Cc: Chris Mason, David Sterba, Alexander Viro, Christian Brauner,
	Jan Kara, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Lance Yang, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan,
	Michal Hocko, Shuah Khan, linux-btrfs, linux-kernel,
	linux-fsdevel, linux-mm, linux-kselftest
In-Reply-To: <20260418024429.4055056-2-ziy@nvidia.com>



On 4/18/26 10:44 AM, Zi Yan wrote:
> collapse_file() requires FSes supporting large folio with at least
> PMD_ORDER, so replace the READ_ONLY_THP_FOR_FS check with that.
> MADV_COLLAPSE ignores shmem huge config, so exclude the check for shmem.
> 
> While at it, replace VM_BUG_ON with VM_WARN_ON_ONCE.
> 
> Add a helper function mapping_pmd_thp_support() for FSes supporting large
> folio with at least PMD_ORDER.
> 
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> ---

LGTM.
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>


^ permalink raw reply

* Re: [PATCH] docs: Add overview and SLUB allocator sections to slab documentation
From: Nick Huang @ 2026-04-20  4:52 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: David Hildenbrand (Arm), Matthew Wilcox, Vlastimil Babka,
	Harry Yoo, Andrew Morton, Jonathan Corbet, Hao Li,
	Christoph Lameter, David Rientjes, Roman Gushchin,
	Liam R . Howlett, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Shuah Khan, linux-mm, linux-doc, linux-kernel
In-Reply-To: <aeTTw4gziJigaNbU@lucifer>

Lorenzo Stoakes <ljs@kernel.org> 於 2026年4月19日週日 下午9:17寫道：
>
> On Sun, Apr 19, 2026 at 10:35:44AM +0200, David Hildenbrand (Arm) wrote:
> > On 4/18/26 18:15, Matthew Wilcox wrote:
> > > On Sat, Apr 18, 2026 at 10:07:22AM +0100, Lorenzo Stoakes wrote:
> > >> On Sat, Apr 18, 2026 at 12:06:19AM +0000, Nick Huang wrote:
> > >>> - Add "Overview" section explaining the slab allocator's role and purpose
> > >>> - Document the three main slab allocator implementations (SLAB, SLUB, SLOB)
> > >>
> > >> The fact you're insanely wrong about the current state of slab only makes this
> > >> worse.
> > >
> > > This is actually a new low.  We've always had to contend with people
> > > putting up outdated or just wrong information on web pages, and there's
> > > little we can do about it.  Witness all the outdated information about
> > > THP that's based on code that's been deleted for over a decade.
> > >
> > > But now we've got AI trained on all this wrong/ out of date information,
> > > and, er, "enthusiasts" who are trying to change the correct information
> > > in the kernel to match what the deluded AI "thinks" should be true.
> > >
> > > Let that sink in.
>
> Ugh ye gawds. My attitude is nip this in the bud early.
>
> I'm very harsh in response to these things for a reason - firstly, it's rude,
> obnoxious + disrespectful, so a negative response is wholly appropriate.
>
> But more importantly, I want to SET A PRECEDENT that if you send this crap
> you'll get a VERY negative response.
>
> Clueless but good faith or bad faith - it's straight up plagiarism and that's
> totally unacceptable.
>
> > >
> >
> > I think we should make it very clear that we don't want doc updates from someone
> > that is not a renowned expert in that area or wants to become an expert in that
> > area (and already discussed working on the docs with maintainers/experts).
> >
> > Otherwise we'll have this same discussion over and over again.
> >
> > diff --git a/Documentation/mm/index.rst b/Documentation/mm/index.rst
> > index 7aa2a88869083..8c5721001c8bb 100644
> > --- a/Documentation/mm/index.rst
> > +++ b/Documentation/mm/index.rst
> > @@ -7,6 +7,11 @@ of Linux.  If you are looking for advice on simply allocating
> > memory,
> >   see the :ref:`memory_allocation`.  For controlling and tuning guides,
> >   see the :doc:`admin guide <../admin-guide/mm/index>`.
> >
> > +A lot of documentation in this guide is still incomplete. If you are not
> > +a renowned expert in the specific area, but you want to contribute bigger
> > +chunks of documentation, talk to the respective MM experts first. LLM
> > +generated slop from non-experts will be rejected without further comments.
> > +
> >   .. toctree::
> >      :maxdepth: 1
> >
> >
> >
> > LLMs are just the tip of the iceberg. It will all be developmend-by review with
> > inexperienced contributors. And we are only willing to put in the effort to
> > teach contributors if the contributors are not actually worth our time: i.e.,
> > LLM kiddies that will actually stick around and help the subsystem in the long run.
> >
> >
> > The whole doc update stuff is similar to people just grepping for TODOs in the
> > kernel and then using an LLM to produce code they have no idea about.
> >
> > It's the evolution of typo fixes: review load without any benefit.
>
> Agree with all of that!
>
> Let's do that, happy to give tags on a patch for the above :)
>
> >
> > --
> > Cheers,
> >
> > David
> >
>
> Cheers, Lorenzo
Hi Lorenzo Stoakes


I am really sorry for causing trouble for everyone. I would like to
ask which aspect of mine was disrespectful, so that I can be more
careful next time.

If I want to make this kind of change, should I send an [RFC patch] to
ask for everyone's opinion?

Sorry, I really am not very clear about the process.
-- 
Regards,
Nick Huang


^ permalink raw reply

* Re: [PATCH 1/5] liveupdate: Remove limit on the number of sessions
From: Pasha Tatashin @ 2026-04-20  4:45 UTC (permalink / raw)
  To: Zhu Yanjun
  Cc: Pasha Tatashin, linux-kselftest, rppt, shuah, akpm, linux-mm,
	linux-kernel, dmatlack, kexec, pratyush, skhawaja, graf
In-Reply-To: <cea19b89-22d4-4df3-a770-d75eff22e556@linux.dev>

On 04-19 21:32, Zhu Yanjun wrote:
> 
> 在 2026/4/14 17:14, Pasha Tatashin 写道:
> > On Tue, Apr 14, 2026 at 8:06 PM yanjun.zhu <yanjun.zhu@linux.dev> wrote:
> > > On 4/14/26 1:02 PM, Pasha Tatashin wrote:
> > > > Currently, the number of LUO sessions is limited by a fixed number of
> > > > pre-allocated pages for serialization (16 pages, allowing for ~819
> > > > sessions).
> > > > 
> > > > This limitation is problematic if LUO is used to support things such as
> > > > systemd file descriptor store, and would be used not just as VM memory
> > > > but to save other states on the machine.
> > > > 
> > > > Remove this limit by transitioning to a linked-block approach for
> > > > session metadata serialization. Instead of a single contiguous block,
> > > > session metadata is now stored in a chain of 16-page blocks. Each block
> > > > starts with a header containing the physical address of the next block
> > > > and the number of session entries in the current block.
> > > > 
> > > > - Bump session ABI version to v3.
> > > > - Update struct luo_session_header_ser to include a 'next' pointer.
> > > > - Implement dynamic block allocation in luo_session_insert().
> > > > - Update setup, serialization, and deserialization logic to traverse
> > > >     the block chain.
> > > > - Remove LUO_SESSION_MAX limit.
> > > > 
> > > > Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> > > > ---
> > > >    include/linux/kho/abi/luo.h      |  19 +--
> > > >    kernel/liveupdate/luo_internal.h |  12 +-
> > > >    kernel/liveupdate/luo_session.c  | 237 +++++++++++++++++++++++--------
> > > >    3 files changed, 197 insertions(+), 71 deletions(-)
> > > > 
> > > > diff --git a/include/linux/kho/abi/luo.h b/include/linux/kho/abi/luo.h
> > > > index 46750a0ddf88..f5732958545e 100644
> > > > --- a/include/linux/kho/abi/luo.h
> > > > +++ b/include/linux/kho/abi/luo.h
> > > > @@ -57,9 +57,10 @@
> > > >     *   - compatible: "luo-session-v1"
> > > >     *     Identifies the session ABI version.
> > > >     *   - luo-session-header: u64
> > > > - *     The physical address of a `struct luo_session_header_ser`. This structure
> > > > - *     is the header for a contiguous block of memory containing an array of
> > > > - *     `struct luo_session_ser`, one for each preserved session.
> > > > + *     The physical address of the first `struct luo_session_header_ser`.
> > > > + *     This structure is the header for a block of memory containing an array
> > > > + *     of `struct luo_session_ser` entries. Multiple blocks are linked via
> > > > + *     the `next` field in the header.
> > > >     *
> > > >     * File-Lifecycle-Bound Node (luo-flb):
> > > >     *   This node describes all preserved global objects whose lifecycle is bound
> > > > @@ -77,9 +78,9 @@
> > > >     *   `__packed` structures. These structures contain the actual preserved state.
> > > >     *
> > > >     *   - struct luo_session_header_ser:
> > > > - *     Header for the session array. Contains the total page count of the
> > > > - *     preserved memory block and the number of `struct luo_session_ser`
> > > > - *     entries that follow.
> > > > + *     Header for the session data block. Contains the physical address of the
> > > > + *     next session data block and the number of `struct luo_session_ser`
> > > > + *     entries that follow this header in the current block.
> > > >     *
> > > >     *   - struct luo_session_ser:
> > > >     *     Metadata for a single session, including its name and a physical pointer
> > > > @@ -153,21 +154,23 @@ struct luo_file_set_ser {
> > > >     *                          luo_session_header_ser
> > > >     */
> > > >    #define LUO_FDT_SESSION_NODE_NAME   "luo-session"
> > > > -#define LUO_FDT_SESSION_COMPATIBLE   "luo-session-v2"
> > > > +#define LUO_FDT_SESSION_COMPATIBLE   "luo-session-v3"
> > > >    #define LUO_FDT_SESSION_HEADER              "luo-session-header"
> > > > 
> > > >    /**
> > > >     * struct luo_session_header_ser - Header for the serialized session data block.
> > > > + * @next:  Physical address of the next struct luo_session_header_ser.
> > > >     * @count: The number of `struct luo_session_ser` entries that immediately
> > > >     *         follow this header in the memory block.
> > > >     *
> > > > - * This structure is located at the beginning of a contiguous block of
> > > > + * This structure is located at the beginning of a block of
> > > >     * physical memory preserved across the kexec. It provides the necessary
> > > >     * metadata to interpret the array of session entries that follow.
> > > >     *
> > > >     * If this structure is modified, `LUO_FDT_SESSION_COMPATIBLE` must be updated.
> > > >     */
> > > >    struct luo_session_header_ser {
> > > > +     u64 next;
> > > >        u64 count;
> > > >    } __packed;
> > > > 
> > > > diff --git a/kernel/liveupdate/luo_internal.h b/kernel/liveupdate/luo_internal.h
> > > > index 875844d7a41d..a73f42069301 100644
> > > > --- a/kernel/liveupdate/luo_internal.h
> > > > +++ b/kernel/liveupdate/luo_internal.h
> > > > @@ -11,6 +11,16 @@
> > > >    #include <linux/liveupdate.h>
> > > >    #include <linux/uaccess.h>
> > > > 
> > > > +/*
> > > > + * Safeguard limit for the number of serialization blocks. This is used to
> > > > + * prevent infinite loops and excessive memory allocation in case of memory
> > > > + * corruption in the preserved state.
> > > > + *
> > > > + * This limit allows for ~8.1 million sessions and ~1.2 million files per
> > > > + * session, which is more than enough for all realistic use cases.
> > > > + */
> > > > +#define LUO_MAX_BLOCKS 10000
> > > > +
> > > >    struct luo_ucmd {
> > > >        void __user *ubuffer;
> > > >        u32 user_size;
> > > > @@ -59,7 +69,6 @@ struct luo_file_set {
> > > >     * struct luo_session - Represents an active or incoming Live Update session.
> > > >     * @name:       A unique name for this session, used for identification and
> > > >     *              retrieval.
> > > > - * @ser:        Pointer to the serialized data for this session.
> > > >     * @list:       A list_head member used to link this session into a global list
> > > >     *              of either outgoing (to be preserved) or incoming (restored from
> > > >     *              previous kernel) sessions.
> > > > @@ -70,7 +79,6 @@ struct luo_file_set {
> > > >     */
> > > >    struct luo_session {
> > > >        char name[LIVEUPDATE_SESSION_NAME_LENGTH];
> > > > -     struct luo_session_ser *ser;
> > > >        struct list_head list;
> > > >        bool retrieved;
> > > >        struct luo_file_set file_set;
> > > > diff --git a/kernel/liveupdate/luo_session.c b/kernel/liveupdate/luo_session.c
> > > > index 92b1af791889..007ca34eba79 100644
> > > > --- a/kernel/liveupdate/luo_session.c
> > > > +++ b/kernel/liveupdate/luo_session.c
> > > > @@ -69,30 +69,39 @@
> > > >    #include <uapi/linux/liveupdate.h>
> > > >    #include "luo_internal.h"
> > > > 
> > > > -/* 16 4K pages, give space for 744 sessions */
> > > > +/* 16 4K pages, give space for 819 sessions per block */
> > > >    #define LUO_SESSION_PGCNT   16ul
> > > > -#define LUO_SESSION_MAX              (((LUO_SESSION_PGCNT << PAGE_SHIFT) -   \
> > > > +#define LUO_SESSION_BLOCK_MAX                (((LUO_SESSION_PGCNT << PAGE_SHIFT) -   \
> > > >                sizeof(struct luo_session_header_ser)) /                \
> > > >                sizeof(struct luo_session_ser))
> > > > 
> > > > +/**
> > > > + * struct luo_session_block - Internal representation of a session serialization block.
> > > > + * @list: List head for linking blocks in memory.
> > > > + * @ser:  Pointer to the serialized header in preserved memory.
> > > > + */
> > > > +struct luo_session_block {
> > > > +     struct list_head list;
> > > > +     struct luo_session_header_ser *ser;
> > > > +};
> > > > +
> > > >    /**
> > > >     * struct luo_session_header - Header struct for managing LUO sessions.
> > > >     * @count:      The number of sessions currently tracked in the @list.
> > > > + * @nblocks:    The number of allocated serialization blocks.
> > > >     * @list:       The head of the linked list of `struct luo_session` instances.
> > > >     * @rwsem:      A read-write semaphore providing synchronized access to the
> > > >     *              session list and other fields in this structure.
> > > > - * @header_ser: The header data of serialization array.
> > > > - * @ser:        The serialized session data (an array of
> > > > - *              `struct luo_session_ser`).
> > > > + * @blocks:     The list of serialization blocks (struct luo_session_block).
> > > >     * @active:     Set to true when first initialized. If previous kernel did not
> > > >     *              send session data, active stays false for incoming.
> > > >     */
> > > >    struct luo_session_header {
> > > >        long count;
> > > > +     long nblocks;
> > > >        struct list_head list;
> > > >        struct rw_semaphore rwsem;
> > > > -     struct luo_session_header_ser *header_ser;
> > > > -     struct luo_session_ser *ser;
> > > > +     struct list_head blocks;
> > > >        bool active;
> > > >    };
> > > > 
> > > > @@ -110,10 +119,12 @@ static struct luo_session_global luo_session_global = {
> > > >        .incoming = {
> > > >                .list = LIST_HEAD_INIT(luo_session_global.incoming.list),
> > > >                .rwsem = __RWSEM_INITIALIZER(luo_session_global.incoming.rwsem),
> > > > +             .blocks = LIST_HEAD_INIT(luo_session_global.incoming.blocks),
> > > >        },
> > > >        .outgoing = {
> > > >                .list = LIST_HEAD_INIT(luo_session_global.outgoing.list),
> > > >                .rwsem = __RWSEM_INITIALIZER(luo_session_global.outgoing.rwsem),
> > > > +             .blocks = LIST_HEAD_INIT(luo_session_global.outgoing.blocks),
> > > >        },
> > > >    };
> > > > 
> > > > @@ -140,6 +151,70 @@ static void luo_session_free(struct luo_session *session)
> > > >        kfree(session);
> > > >    }
> > > > 
> > > > +static int luo_session_add_block(struct luo_session_header *sh,
> > > > +                              struct luo_session_header_ser *ser)
> > > > +{
> > > > +     struct luo_session_block *block;
> > > > +
> > > > +     if (sh->nblocks >= LUO_MAX_BLOCKS)
> > > > +             return -ENOSPC;
> > > > +
> > > > +     block = kzalloc_obj(*block);
> > > > +     if (!block)
> > > > +             return -ENOMEM;
> > > > +
> > > > +     block->ser = ser;
> > > > +     list_add_tail(&block->list, &sh->blocks);
> > > > +     sh->nblocks++;
> > > > +
> > > > +     return 0;
> > > > +}
> > > > +
> > > > +static int luo_session_create_ser_block(struct luo_session_header *sh)
> > > > +{
> > > > +     struct luo_session_block *last = NULL;
> > > > +     struct luo_session_header_ser *ser;
> > > > +     int err;
> > > > +
> > > > +     ser = kho_alloc_preserve(LUO_SESSION_PGCNT << PAGE_SHIFT);
> > > > +     if (IS_ERR(ser))
> > > > +             return PTR_ERR(ser);
> > > > +
> > > > +     if (!list_empty(&sh->blocks))
> > > > +             last = list_last_entry(&sh->blocks, struct luo_session_block, list);
> > > > +
> > > > +     err = luo_session_add_block(sh, ser);
> > > > +     if (err)
> > > > +             goto err_unpreserve;
> > > > +
> > > > +     if (last)
> > > > +             last->ser->next = virt_to_phys(ser);
> > > > +
> > > > +     return 0;
> > > > +
> > > > +err_unpreserve:
> > > > +     kho_unpreserve_free(ser);
> > > > +     return err;
> > > > +}
> > > > +
> > > > +static void luo_session_destroy_ser_blocks(struct luo_session_header *sh,
> > > > +                                        bool unpreserve)
> > > > +{
> > > > +     struct luo_session_block *block, *tmp;
> > > > +
> > > > +     list_for_each_entry_safe(block, tmp, &sh->blocks, list) {
> > > > +             if (block->ser) {
> > > > +                     if (unpreserve)
> > > > +                             kho_unpreserve_free(block->ser);
> > > > +                     else
> > > > +                             kho_restore_free(block->ser);
> > > > +             }
> > > > +             list_del(&block->list);
> > > > +             kfree(block);
> > > > +             sh->nblocks--;
> > > > +     }
> > > > +}
> > > > +
> > > >    static int luo_session_insert(struct luo_session_header *sh,
> > > >                              struct luo_session *session)
> > > >    {
> > > > @@ -147,15 +222,6 @@ static int luo_session_insert(struct luo_session_header *sh,
> > > > 
> > > >        guard(rwsem_write)(&sh->rwsem);
> > > > 
> > > > -     /*
> > > > -      * For outgoing we should make sure there is room in serialization array
> > > > -      * for new session.
> > > > -      */
> > > > -     if (sh == &luo_session_global.outgoing) {
> > > > -             if (sh->count == LUO_SESSION_MAX)
> > > > -                     return -ENOMEM;
> > > > -     }
> > > > -
> > > >        /*
> > > >         * For small number of sessions this loop won't hurt performance
> > > >         * but if we ever start using a lot of sessions, this might
> > > > @@ -166,6 +232,20 @@ static int luo_session_insert(struct luo_session_header *sh,
> > > >                if (!strncmp(it->name, session->name, sizeof(it->name)))
> > > >                        return -EEXIST;
> > > >        }
> > > > +
> > > > +     /*
> > > > +      * For outgoing we should make sure there is room in serialization array
> > > > +      * for new session. If not, allocate a new block.
> > > > +      */
> > > > +     if (sh == &luo_session_global.outgoing) {
> > > > +             if (sh->count == sh->nblocks * LUO_SESSION_BLOCK_MAX) {
> > > I am not sure if "LUO_SESSION_BLOCK_MAX 10000" has special meaning or not.
> > LUO_SESSION_BLOCK_MAX is calculated based on the number of times
> > "struct luo_session_ser)" fits into a page-aligned block minus the
> > luo_session_header_ser. It is not power-of-two, so there is no pretty
> > way to change this to avoid multiplcation.
> 
> Maybe we can use
> 
> if (!(sh->count % LUO_SESSION_BLOCK_MASK) && sh->count > 0)

I considered this approach as well, but it will not work as expected. 
If a user repeatedly preserves and closes a session in a loop at the 
block boundary, it will cause a new block to be created every time.

Also, multiplication is significantly more efficient than modulo. As 
modulo requires division and multiplications, and yields to about 15 or 
cycles, while multiplication typical latency is 3-4 cycles.

> 
> This can avoid multiplcation.
> 
> Just 2 cent suggestion.
> 
> Zhu Yanjun
> 
> > 
> > > If not, we can use the following:
> > > 
> > > #define LUO_SESSION_BLOCK_SHIFT 13
> > > #define LUO_SESSION_BLOCK_MAX   (1UL << LUO_SESSION_BLOCK_SHIFT) // 8192
> > > #define LUO_SESSION_BLOCK_MASK  (LUO_SESSION_BLOCK_MAX - 1)      // 0x1FFF
> > > 
> > > if (!(sh->count & LUO_SESSION_BLOCK_MASK) && sh->count > 0) {
> > > ...
> > > }
> > > 
> > > Zhu Yanjun
> > > 
> > > > +                     int err = luo_session_create_ser_block(sh);
> > > > +
> > > > +                     if (err)
> > > > +                             return err;
> > > > +             }
> > > > +     }
> > > > +
> > > >        list_add_tail(&session->list, &sh->list);
> > > >        sh->count++;
> > > > 
> > > > @@ -444,9 +524,12 @@ int __init luo_session_setup_outgoing(void *fdt_out)
> > > >        u64 header_ser_pa;
> > > >        int err;
> > > > 
> > > > -     header_ser = kho_alloc_preserve(LUO_SESSION_PGCNT << PAGE_SHIFT);
> > > > -     if (IS_ERR(header_ser))
> > > > -             return PTR_ERR(header_ser);
> > > > +     err = luo_session_create_ser_block(&luo_session_global.outgoing);
> > > > +     if (err)
> > > > +             return err;
> > > > +
> > > > +     header_ser = list_first_entry(&luo_session_global.outgoing.blocks,
> > > > +                                   struct luo_session_block, list)->ser;
> > > >        header_ser_pa = virt_to_phys(header_ser);
> > > > 
> > > >        err = fdt_begin_node(fdt_out, LUO_FDT_SESSION_NODE_NAME);
> > > > @@ -459,19 +542,18 @@ int __init luo_session_setup_outgoing(void *fdt_out)
> > > >        if (err)
> > > >                goto err_unpreserve;
> > > > 
> > > > -     luo_session_global.outgoing.header_ser = header_ser;
> > > > -     luo_session_global.outgoing.ser = (void *)(header_ser + 1);
> > > >        luo_session_global.outgoing.active = true;
> > > > 
> > > >        return 0;
> > > > 
> > > >    err_unpreserve:
> > > > -     kho_unpreserve_free(header_ser);
> > > > +     luo_session_destroy_ser_blocks(&luo_session_global.outgoing, true);
> > > >        return err;
> > > >    }
> > > > 
> > > >    int __init luo_session_setup_incoming(void *fdt_in)
> > > >    {
> > > > +     struct luo_session_header *sh = &luo_session_global.incoming;
> > > >        struct luo_session_header_ser *header_ser;
> > > >        int err, header_size, offset;
> > > >        u64 header_ser_pa;
> > > > @@ -501,11 +583,14 @@ int __init luo_session_setup_incoming(void *fdt_in)
> > > >        }
> > > > 
> > > >        header_ser_pa = get_unaligned((u64 *)ptr);
> > > > -     header_ser = phys_to_virt(header_ser_pa);
> > > > -
> > > > -     luo_session_global.incoming.header_ser = header_ser;
> > > > -     luo_session_global.incoming.ser = (void *)(header_ser + 1);
> > > > -     luo_session_global.incoming.active = true;
> > > > +     while (header_ser_pa) {
> > > > +             header_ser = phys_to_virt(header_ser_pa);
> > > > +             err = luo_session_add_block(sh, header_ser);
> > > > +             if (err)
> > > > +                     return err;
> > > > +             header_ser_pa = header_ser->next;
> > > > +     }
> > > > +     sh->active = true;
> > > > 
> > > >        return 0;
> > > >    }
> > > > @@ -513,6 +598,7 @@ int __init luo_session_setup_incoming(void *fdt_in)
> > > >    int luo_session_deserialize(void)
> > > >    {
> > > >        struct luo_session_header *sh = &luo_session_global.incoming;
> > > > +     struct luo_session_block *block;
> > > >        static bool is_deserialized;
> > > >        static int err;
> > > > 
> > > > @@ -539,40 +625,49 @@ int luo_session_deserialize(void)
> > > >         * userspace to detect the failure and trigger a reboot, which will
> > > >         * reliably reset devices and reclaim memory.
> > > >         */
> > > > -     for (int i = 0; i < sh->header_ser->count; i++) {
> > > > -             struct luo_session *session;
> > > > -
> > > > -             session = luo_session_alloc(sh->ser[i].name);
> > > > -             if (IS_ERR(session)) {
> > > > -                     pr_warn("Failed to allocate session [%.*s] during deserialization %pe\n",
> > > > -                             (int)sizeof(sh->ser[i].name),
> > > > -                             sh->ser[i].name, session);
> > > > -                     err = PTR_ERR(session);
> > > > -                     return err;
> > > > -             }
> > > > +     list_for_each_entry(block, &sh->blocks, list) {
> > > > +             struct luo_session_ser *ser = (void *)(block->ser + 1);
> > > > 
> > > > -             err = luo_session_insert(sh, session);
> > > > -             if (err) {
> > > > -                     pr_warn("Failed to insert session [%s] %pe\n",
> > > > -                             session->name, ERR_PTR(err));
> > > > -                     luo_session_free(session);
> > > > +             if (block->ser->count > LUO_SESSION_BLOCK_MAX) {
> > > > +                     pr_warn("Session block contains too many entries: %llu\n",
> > > > +                             block->ser->count);
> > > > +                     err = -EINVAL;
> > > >                        return err;
> > > >                }
> > > > 
> > > > -             scoped_guard(mutex, &session->mutex) {
> > > > -                     err = luo_file_deserialize(&session->file_set,
> > > > -                                                &sh->ser[i].file_set_ser);
> > > > -             }
> > > > -             if (err) {
> > > > -                     pr_warn("Failed to deserialize files for session [%s] %pe\n",
> > > > -                             session->name, ERR_PTR(err));
> > > > -                     return err;
> > > > +             for (int i = 0; i < block->ser->count; i++) {
> > > > +                     struct luo_session *session;
> > > > +
> > > > +                     session = luo_session_alloc(ser[i].name);
> > > > +                     if (IS_ERR(session)) {
> > > > +                             pr_warn("Failed to allocate session [%.*s] during deserialization %pe\n",
> > > > +                                     (int)sizeof(ser[i].name),
> > > > +                                     ser[i].name, session);
> > > > +                             err = PTR_ERR(session);
> > > > +                             return err;
> > > > +                     }
> > > > +
> > > > +                     err = luo_session_insert(sh, session);
> > > > +                     if (err) {
> > > > +                             pr_warn("Failed to insert session [%s] %pe\n",
> > > > +                                     session->name, ERR_PTR(err));
> > > > +                             luo_session_free(session);
> > > > +                             return err;
> > > > +                     }
> > > > +
> > > > +                     scoped_guard(mutex, &session->mutex) {
> > > > +                             err = luo_file_deserialize(&session->file_set,
> > > > +                                                        &ser[i].file_set_ser);
> > > > +                     }
> > > > +                     if (err) {
> > > > +                             pr_warn("Failed to deserialize files for session [%s] %pe\n",
> > > > +                                     session->name, ERR_PTR(err));
> > > > +                             return err;
> > > > +                     }
> > > >                }
> > > >        }
> > > > 
> > > > -     kho_restore_free(sh->header_ser);
> > > > -     sh->header_ser = NULL;
> > > > -     sh->ser = NULL;
> > > > +     luo_session_destroy_ser_blocks(sh, false);
> > > > 
> > > >        return 0;
> > > >    }
> > > > @@ -580,31 +675,51 @@ int luo_session_deserialize(void)
> > > >    int luo_session_serialize(void)
> > > >    {
> > > >        struct luo_session_header *sh = &luo_session_global.outgoing;
> > > > +     struct luo_session_block *block;
> > > >        struct luo_session *session;
> > > > +     struct luo_session_ser *ser;
> > > >        int i = 0;
> > > >        int err;
> > > > 
> > > >        guard(rwsem_write)(&sh->rwsem);
> > > > +
> > > > +     if (list_empty(&sh->blocks))
> > > > +             return 0;
> > > > +
> > > > +     block = list_first_entry(&sh->blocks, struct luo_session_block, list);
> > > > +     ser = (void *)(block->ser + 1);
> > > > +
> > > >        list_for_each_entry(session, &sh->list, list) {
> > > > -             err = luo_session_freeze_one(session, &sh->ser[i]);
> > > > +             if (i == LUO_SESSION_BLOCK_MAX) {
> > > > +                     block->ser->count = i;
> > > > +                     block = list_next_entry(block, list);
> > > > +                     ser = (void *)(block->ser + 1);
> > > > +                     i = 0;
> > > > +             }
> > > > +
> > > > +             err = luo_session_freeze_one(session, &ser[i]);
> > > >                if (err)
> > > >                        goto err_undo;
> > > > 
> > > > -             strscpy(sh->ser[i].name, session->name,
> > > > -                     sizeof(sh->ser[i].name));
> > > > +             strscpy(ser[i].name, session->name,
> > > > +                     sizeof(ser[i].name));
> > > >                i++;
> > > >        }
> > > > -     sh->header_ser->count = sh->count;
> > > > +     block->ser->count = i;
> > > > 
> > > >        return 0;
> > > > 
> > > >    err_undo:
> > > >        list_for_each_entry_continue_reverse(session, &sh->list, list) {
> > > >                i--;
> > > > -             luo_session_unfreeze_one(session, &sh->ser[i]);
> > > > -             memset(sh->ser[i].name, 0, sizeof(sh->ser[i].name));
> > > > +             if (i < 0) {
> > > > +                     block = list_prev_entry(block, list);
> > > > +                     ser = (void *)(block->ser + 1);
> > > > +                     i = LUO_SESSION_BLOCK_MAX - 1;
> > > > +             }
> > > > +             luo_session_unfreeze_one(session, &ser[i]);
> > > > +             memset(ser[i].name, 0, sizeof(ser[i].name));
> > > >        }
> > > > 
> > > >        return err;
> > > >    }
> > > > -
> > > 
> -- 
> Best Regards,
> Yanjun.Zhu
> 


^ permalink raw reply

* Re: [PATCH 1/5] liveupdate: Remove limit on the number of sessions
From: Zhu Yanjun @ 2026-04-20  4:32 UTC (permalink / raw)
  To: Pasha Tatashin, yanjun.zhu@linux.dev
  Cc: linux-kselftest, rppt, shuah, akpm, linux-mm, linux-kernel,
	dmatlack, kexec, pratyush, skhawaja, graf
In-Reply-To: <CA+CK2bDvFQUYCsD+jKBWw1KfRNJ4DOopAu29iq7SWYrPWV+dzQ@mail.gmail.com>


在 2026/4/14 17:14, Pasha Tatashin 写道:
> On Tue, Apr 14, 2026 at 8:06 PM yanjun.zhu <yanjun.zhu@linux.dev> wrote:
>> On 4/14/26 1:02 PM, Pasha Tatashin wrote:
>>> Currently, the number of LUO sessions is limited by a fixed number of
>>> pre-allocated pages for serialization (16 pages, allowing for ~819
>>> sessions).
>>>
>>> This limitation is problematic if LUO is used to support things such as
>>> systemd file descriptor store, and would be used not just as VM memory
>>> but to save other states on the machine.
>>>
>>> Remove this limit by transitioning to a linked-block approach for
>>> session metadata serialization. Instead of a single contiguous block,
>>> session metadata is now stored in a chain of 16-page blocks. Each block
>>> starts with a header containing the physical address of the next block
>>> and the number of session entries in the current block.
>>>
>>> - Bump session ABI version to v3.
>>> - Update struct luo_session_header_ser to include a 'next' pointer.
>>> - Implement dynamic block allocation in luo_session_insert().
>>> - Update setup, serialization, and deserialization logic to traverse
>>>     the block chain.
>>> - Remove LUO_SESSION_MAX limit.
>>>
>>> Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
>>> ---
>>>    include/linux/kho/abi/luo.h      |  19 +--
>>>    kernel/liveupdate/luo_internal.h |  12 +-
>>>    kernel/liveupdate/luo_session.c  | 237 +++++++++++++++++++++++--------
>>>    3 files changed, 197 insertions(+), 71 deletions(-)
>>>
>>> diff --git a/include/linux/kho/abi/luo.h b/include/linux/kho/abi/luo.h
>>> index 46750a0ddf88..f5732958545e 100644
>>> --- a/include/linux/kho/abi/luo.h
>>> +++ b/include/linux/kho/abi/luo.h
>>> @@ -57,9 +57,10 @@
>>>     *   - compatible: "luo-session-v1"
>>>     *     Identifies the session ABI version.
>>>     *   - luo-session-header: u64
>>> - *     The physical address of a `struct luo_session_header_ser`. This structure
>>> - *     is the header for a contiguous block of memory containing an array of
>>> - *     `struct luo_session_ser`, one for each preserved session.
>>> + *     The physical address of the first `struct luo_session_header_ser`.
>>> + *     This structure is the header for a block of memory containing an array
>>> + *     of `struct luo_session_ser` entries. Multiple blocks are linked via
>>> + *     the `next` field in the header.
>>>     *
>>>     * File-Lifecycle-Bound Node (luo-flb):
>>>     *   This node describes all preserved global objects whose lifecycle is bound
>>> @@ -77,9 +78,9 @@
>>>     *   `__packed` structures. These structures contain the actual preserved state.
>>>     *
>>>     *   - struct luo_session_header_ser:
>>> - *     Header for the session array. Contains the total page count of the
>>> - *     preserved memory block and the number of `struct luo_session_ser`
>>> - *     entries that follow.
>>> + *     Header for the session data block. Contains the physical address of the
>>> + *     next session data block and the number of `struct luo_session_ser`
>>> + *     entries that follow this header in the current block.
>>>     *
>>>     *   - struct luo_session_ser:
>>>     *     Metadata for a single session, including its name and a physical pointer
>>> @@ -153,21 +154,23 @@ struct luo_file_set_ser {
>>>     *                          luo_session_header_ser
>>>     */
>>>    #define LUO_FDT_SESSION_NODE_NAME   "luo-session"
>>> -#define LUO_FDT_SESSION_COMPATIBLE   "luo-session-v2"
>>> +#define LUO_FDT_SESSION_COMPATIBLE   "luo-session-v3"
>>>    #define LUO_FDT_SESSION_HEADER              "luo-session-header"
>>>
>>>    /**
>>>     * struct luo_session_header_ser - Header for the serialized session data block.
>>> + * @next:  Physical address of the next struct luo_session_header_ser.
>>>     * @count: The number of `struct luo_session_ser` entries that immediately
>>>     *         follow this header in the memory block.
>>>     *
>>> - * This structure is located at the beginning of a contiguous block of
>>> + * This structure is located at the beginning of a block of
>>>     * physical memory preserved across the kexec. It provides the necessary
>>>     * metadata to interpret the array of session entries that follow.
>>>     *
>>>     * If this structure is modified, `LUO_FDT_SESSION_COMPATIBLE` must be updated.
>>>     */
>>>    struct luo_session_header_ser {
>>> +     u64 next;
>>>        u64 count;
>>>    } __packed;
>>>
>>> diff --git a/kernel/liveupdate/luo_internal.h b/kernel/liveupdate/luo_internal.h
>>> index 875844d7a41d..a73f42069301 100644
>>> --- a/kernel/liveupdate/luo_internal.h
>>> +++ b/kernel/liveupdate/luo_internal.h
>>> @@ -11,6 +11,16 @@
>>>    #include <linux/liveupdate.h>
>>>    #include <linux/uaccess.h>
>>>
>>> +/*
>>> + * Safeguard limit for the number of serialization blocks. This is used to
>>> + * prevent infinite loops and excessive memory allocation in case of memory
>>> + * corruption in the preserved state.
>>> + *
>>> + * This limit allows for ~8.1 million sessions and ~1.2 million files per
>>> + * session, which is more than enough for all realistic use cases.
>>> + */
>>> +#define LUO_MAX_BLOCKS 10000
>>> +
>>>    struct luo_ucmd {
>>>        void __user *ubuffer;
>>>        u32 user_size;
>>> @@ -59,7 +69,6 @@ struct luo_file_set {
>>>     * struct luo_session - Represents an active or incoming Live Update session.
>>>     * @name:       A unique name for this session, used for identification and
>>>     *              retrieval.
>>> - * @ser:        Pointer to the serialized data for this session.
>>>     * @list:       A list_head member used to link this session into a global list
>>>     *              of either outgoing (to be preserved) or incoming (restored from
>>>     *              previous kernel) sessions.
>>> @@ -70,7 +79,6 @@ struct luo_file_set {
>>>     */
>>>    struct luo_session {
>>>        char name[LIVEUPDATE_SESSION_NAME_LENGTH];
>>> -     struct luo_session_ser *ser;
>>>        struct list_head list;
>>>        bool retrieved;
>>>        struct luo_file_set file_set;
>>> diff --git a/kernel/liveupdate/luo_session.c b/kernel/liveupdate/luo_session.c
>>> index 92b1af791889..007ca34eba79 100644
>>> --- a/kernel/liveupdate/luo_session.c
>>> +++ b/kernel/liveupdate/luo_session.c
>>> @@ -69,30 +69,39 @@
>>>    #include <uapi/linux/liveupdate.h>
>>>    #include "luo_internal.h"
>>>
>>> -/* 16 4K pages, give space for 744 sessions */
>>> +/* 16 4K pages, give space for 819 sessions per block */
>>>    #define LUO_SESSION_PGCNT   16ul
>>> -#define LUO_SESSION_MAX              (((LUO_SESSION_PGCNT << PAGE_SHIFT) -   \
>>> +#define LUO_SESSION_BLOCK_MAX                (((LUO_SESSION_PGCNT << PAGE_SHIFT) -   \
>>>                sizeof(struct luo_session_header_ser)) /                \
>>>                sizeof(struct luo_session_ser))
>>>
>>> +/**
>>> + * struct luo_session_block - Internal representation of a session serialization block.
>>> + * @list: List head for linking blocks in memory.
>>> + * @ser:  Pointer to the serialized header in preserved memory.
>>> + */
>>> +struct luo_session_block {
>>> +     struct list_head list;
>>> +     struct luo_session_header_ser *ser;
>>> +};
>>> +
>>>    /**
>>>     * struct luo_session_header - Header struct for managing LUO sessions.
>>>     * @count:      The number of sessions currently tracked in the @list.
>>> + * @nblocks:    The number of allocated serialization blocks.
>>>     * @list:       The head of the linked list of `struct luo_session` instances.
>>>     * @rwsem:      A read-write semaphore providing synchronized access to the
>>>     *              session list and other fields in this structure.
>>> - * @header_ser: The header data of serialization array.
>>> - * @ser:        The serialized session data (an array of
>>> - *              `struct luo_session_ser`).
>>> + * @blocks:     The list of serialization blocks (struct luo_session_block).
>>>     * @active:     Set to true when first initialized. If previous kernel did not
>>>     *              send session data, active stays false for incoming.
>>>     */
>>>    struct luo_session_header {
>>>        long count;
>>> +     long nblocks;
>>>        struct list_head list;
>>>        struct rw_semaphore rwsem;
>>> -     struct luo_session_header_ser *header_ser;
>>> -     struct luo_session_ser *ser;
>>> +     struct list_head blocks;
>>>        bool active;
>>>    };
>>>
>>> @@ -110,10 +119,12 @@ static struct luo_session_global luo_session_global = {
>>>        .incoming = {
>>>                .list = LIST_HEAD_INIT(luo_session_global.incoming.list),
>>>                .rwsem = __RWSEM_INITIALIZER(luo_session_global.incoming.rwsem),
>>> +             .blocks = LIST_HEAD_INIT(luo_session_global.incoming.blocks),
>>>        },
>>>        .outgoing = {
>>>                .list = LIST_HEAD_INIT(luo_session_global.outgoing.list),
>>>                .rwsem = __RWSEM_INITIALIZER(luo_session_global.outgoing.rwsem),
>>> +             .blocks = LIST_HEAD_INIT(luo_session_global.outgoing.blocks),
>>>        },
>>>    };
>>>
>>> @@ -140,6 +151,70 @@ static void luo_session_free(struct luo_session *session)
>>>        kfree(session);
>>>    }
>>>
>>> +static int luo_session_add_block(struct luo_session_header *sh,
>>> +                              struct luo_session_header_ser *ser)
>>> +{
>>> +     struct luo_session_block *block;
>>> +
>>> +     if (sh->nblocks >= LUO_MAX_BLOCKS)
>>> +             return -ENOSPC;
>>> +
>>> +     block = kzalloc_obj(*block);
>>> +     if (!block)
>>> +             return -ENOMEM;
>>> +
>>> +     block->ser = ser;
>>> +     list_add_tail(&block->list, &sh->blocks);
>>> +     sh->nblocks++;
>>> +
>>> +     return 0;
>>> +}
>>> +
>>> +static int luo_session_create_ser_block(struct luo_session_header *sh)
>>> +{
>>> +     struct luo_session_block *last = NULL;
>>> +     struct luo_session_header_ser *ser;
>>> +     int err;
>>> +
>>> +     ser = kho_alloc_preserve(LUO_SESSION_PGCNT << PAGE_SHIFT);
>>> +     if (IS_ERR(ser))
>>> +             return PTR_ERR(ser);
>>> +
>>> +     if (!list_empty(&sh->blocks))
>>> +             last = list_last_entry(&sh->blocks, struct luo_session_block, list);
>>> +
>>> +     err = luo_session_add_block(sh, ser);
>>> +     if (err)
>>> +             goto err_unpreserve;
>>> +
>>> +     if (last)
>>> +             last->ser->next = virt_to_phys(ser);
>>> +
>>> +     return 0;
>>> +
>>> +err_unpreserve:
>>> +     kho_unpreserve_free(ser);
>>> +     return err;
>>> +}
>>> +
>>> +static void luo_session_destroy_ser_blocks(struct luo_session_header *sh,
>>> +                                        bool unpreserve)
>>> +{
>>> +     struct luo_session_block *block, *tmp;
>>> +
>>> +     list_for_each_entry_safe(block, tmp, &sh->blocks, list) {
>>> +             if (block->ser) {
>>> +                     if (unpreserve)
>>> +                             kho_unpreserve_free(block->ser);
>>> +                     else
>>> +                             kho_restore_free(block->ser);
>>> +             }
>>> +             list_del(&block->list);
>>> +             kfree(block);
>>> +             sh->nblocks--;
>>> +     }
>>> +}
>>> +
>>>    static int luo_session_insert(struct luo_session_header *sh,
>>>                              struct luo_session *session)
>>>    {
>>> @@ -147,15 +222,6 @@ static int luo_session_insert(struct luo_session_header *sh,
>>>
>>>        guard(rwsem_write)(&sh->rwsem);
>>>
>>> -     /*
>>> -      * For outgoing we should make sure there is room in serialization array
>>> -      * for new session.
>>> -      */
>>> -     if (sh == &luo_session_global.outgoing) {
>>> -             if (sh->count == LUO_SESSION_MAX)
>>> -                     return -ENOMEM;
>>> -     }
>>> -
>>>        /*
>>>         * For small number of sessions this loop won't hurt performance
>>>         * but if we ever start using a lot of sessions, this might
>>> @@ -166,6 +232,20 @@ static int luo_session_insert(struct luo_session_header *sh,
>>>                if (!strncmp(it->name, session->name, sizeof(it->name)))
>>>                        return -EEXIST;
>>>        }
>>> +
>>> +     /*
>>> +      * For outgoing we should make sure there is room in serialization array
>>> +      * for new session. If not, allocate a new block.
>>> +      */
>>> +     if (sh == &luo_session_global.outgoing) {
>>> +             if (sh->count == sh->nblocks * LUO_SESSION_BLOCK_MAX) {
>> I am not sure if "LUO_SESSION_BLOCK_MAX 10000" has special meaning or not.
> LUO_SESSION_BLOCK_MAX is calculated based on the number of times
> "struct luo_session_ser)" fits into a page-aligned block minus the
> luo_session_header_ser. It is not power-of-two, so there is no pretty
> way to change this to avoid multiplcation.

Maybe we can use

if (!(sh->count % LUO_SESSION_BLOCK_MASK) && sh->count > 0)

This can avoid multiplcation.

Just 2 cent suggestion.

Zhu Yanjun

>
>> If not, we can use the following:
>>
>> #define LUO_SESSION_BLOCK_SHIFT 13
>> #define LUO_SESSION_BLOCK_MAX   (1UL << LUO_SESSION_BLOCK_SHIFT) // 8192
>> #define LUO_SESSION_BLOCK_MASK  (LUO_SESSION_BLOCK_MAX - 1)      // 0x1FFF
>>
>> if (!(sh->count & LUO_SESSION_BLOCK_MASK) && sh->count > 0) {
>> ...
>> }
>>
>> Zhu Yanjun
>>
>>> +                     int err = luo_session_create_ser_block(sh);
>>> +
>>> +                     if (err)
>>> +                             return err;
>>> +             }
>>> +     }
>>> +
>>>        list_add_tail(&session->list, &sh->list);
>>>        sh->count++;
>>>
>>> @@ -444,9 +524,12 @@ int __init luo_session_setup_outgoing(void *fdt_out)
>>>        u64 header_ser_pa;
>>>        int err;
>>>
>>> -     header_ser = kho_alloc_preserve(LUO_SESSION_PGCNT << PAGE_SHIFT);
>>> -     if (IS_ERR(header_ser))
>>> -             return PTR_ERR(header_ser);
>>> +     err = luo_session_create_ser_block(&luo_session_global.outgoing);
>>> +     if (err)
>>> +             return err;
>>> +
>>> +     header_ser = list_first_entry(&luo_session_global.outgoing.blocks,
>>> +                                   struct luo_session_block, list)->ser;
>>>        header_ser_pa = virt_to_phys(header_ser);
>>>
>>>        err = fdt_begin_node(fdt_out, LUO_FDT_SESSION_NODE_NAME);
>>> @@ -459,19 +542,18 @@ int __init luo_session_setup_outgoing(void *fdt_out)
>>>        if (err)
>>>                goto err_unpreserve;
>>>
>>> -     luo_session_global.outgoing.header_ser = header_ser;
>>> -     luo_session_global.outgoing.ser = (void *)(header_ser + 1);
>>>        luo_session_global.outgoing.active = true;
>>>
>>>        return 0;
>>>
>>>    err_unpreserve:
>>> -     kho_unpreserve_free(header_ser);
>>> +     luo_session_destroy_ser_blocks(&luo_session_global.outgoing, true);
>>>        return err;
>>>    }
>>>
>>>    int __init luo_session_setup_incoming(void *fdt_in)
>>>    {
>>> +     struct luo_session_header *sh = &luo_session_global.incoming;
>>>        struct luo_session_header_ser *header_ser;
>>>        int err, header_size, offset;
>>>        u64 header_ser_pa;
>>> @@ -501,11 +583,14 @@ int __init luo_session_setup_incoming(void *fdt_in)
>>>        }
>>>
>>>        header_ser_pa = get_unaligned((u64 *)ptr);
>>> -     header_ser = phys_to_virt(header_ser_pa);
>>> -
>>> -     luo_session_global.incoming.header_ser = header_ser;
>>> -     luo_session_global.incoming.ser = (void *)(header_ser + 1);
>>> -     luo_session_global.incoming.active = true;
>>> +     while (header_ser_pa) {
>>> +             header_ser = phys_to_virt(header_ser_pa);
>>> +             err = luo_session_add_block(sh, header_ser);
>>> +             if (err)
>>> +                     return err;
>>> +             header_ser_pa = header_ser->next;
>>> +     }
>>> +     sh->active = true;
>>>
>>>        return 0;
>>>    }
>>> @@ -513,6 +598,7 @@ int __init luo_session_setup_incoming(void *fdt_in)
>>>    int luo_session_deserialize(void)
>>>    {
>>>        struct luo_session_header *sh = &luo_session_global.incoming;
>>> +     struct luo_session_block *block;
>>>        static bool is_deserialized;
>>>        static int err;
>>>
>>> @@ -539,40 +625,49 @@ int luo_session_deserialize(void)
>>>         * userspace to detect the failure and trigger a reboot, which will
>>>         * reliably reset devices and reclaim memory.
>>>         */
>>> -     for (int i = 0; i < sh->header_ser->count; i++) {
>>> -             struct luo_session *session;
>>> -
>>> -             session = luo_session_alloc(sh->ser[i].name);
>>> -             if (IS_ERR(session)) {
>>> -                     pr_warn("Failed to allocate session [%.*s] during deserialization %pe\n",
>>> -                             (int)sizeof(sh->ser[i].name),
>>> -                             sh->ser[i].name, session);
>>> -                     err = PTR_ERR(session);
>>> -                     return err;
>>> -             }
>>> +     list_for_each_entry(block, &sh->blocks, list) {
>>> +             struct luo_session_ser *ser = (void *)(block->ser + 1);
>>>
>>> -             err = luo_session_insert(sh, session);
>>> -             if (err) {
>>> -                     pr_warn("Failed to insert session [%s] %pe\n",
>>> -                             session->name, ERR_PTR(err));
>>> -                     luo_session_free(session);
>>> +             if (block->ser->count > LUO_SESSION_BLOCK_MAX) {
>>> +                     pr_warn("Session block contains too many entries: %llu\n",
>>> +                             block->ser->count);
>>> +                     err = -EINVAL;
>>>                        return err;
>>>                }
>>>
>>> -             scoped_guard(mutex, &session->mutex) {
>>> -                     err = luo_file_deserialize(&session->file_set,
>>> -                                                &sh->ser[i].file_set_ser);
>>> -             }
>>> -             if (err) {
>>> -                     pr_warn("Failed to deserialize files for session [%s] %pe\n",
>>> -                             session->name, ERR_PTR(err));
>>> -                     return err;
>>> +             for (int i = 0; i < block->ser->count; i++) {
>>> +                     struct luo_session *session;
>>> +
>>> +                     session = luo_session_alloc(ser[i].name);
>>> +                     if (IS_ERR(session)) {
>>> +                             pr_warn("Failed to allocate session [%.*s] during deserialization %pe\n",
>>> +                                     (int)sizeof(ser[i].name),
>>> +                                     ser[i].name, session);
>>> +                             err = PTR_ERR(session);
>>> +                             return err;
>>> +                     }
>>> +
>>> +                     err = luo_session_insert(sh, session);
>>> +                     if (err) {
>>> +                             pr_warn("Failed to insert session [%s] %pe\n",
>>> +                                     session->name, ERR_PTR(err));
>>> +                             luo_session_free(session);
>>> +                             return err;
>>> +                     }
>>> +
>>> +                     scoped_guard(mutex, &session->mutex) {
>>> +                             err = luo_file_deserialize(&session->file_set,
>>> +                                                        &ser[i].file_set_ser);
>>> +                     }
>>> +                     if (err) {
>>> +                             pr_warn("Failed to deserialize files for session [%s] %pe\n",
>>> +                                     session->name, ERR_PTR(err));
>>> +                             return err;
>>> +                     }
>>>                }
>>>        }
>>>
>>> -     kho_restore_free(sh->header_ser);
>>> -     sh->header_ser = NULL;
>>> -     sh->ser = NULL;
>>> +     luo_session_destroy_ser_blocks(sh, false);
>>>
>>>        return 0;
>>>    }
>>> @@ -580,31 +675,51 @@ int luo_session_deserialize(void)
>>>    int luo_session_serialize(void)
>>>    {
>>>        struct luo_session_header *sh = &luo_session_global.outgoing;
>>> +     struct luo_session_block *block;
>>>        struct luo_session *session;
>>> +     struct luo_session_ser *ser;
>>>        int i = 0;
>>>        int err;
>>>
>>>        guard(rwsem_write)(&sh->rwsem);
>>> +
>>> +     if (list_empty(&sh->blocks))
>>> +             return 0;
>>> +
>>> +     block = list_first_entry(&sh->blocks, struct luo_session_block, list);
>>> +     ser = (void *)(block->ser + 1);
>>> +
>>>        list_for_each_entry(session, &sh->list, list) {
>>> -             err = luo_session_freeze_one(session, &sh->ser[i]);
>>> +             if (i == LUO_SESSION_BLOCK_MAX) {
>>> +                     block->ser->count = i;
>>> +                     block = list_next_entry(block, list);
>>> +                     ser = (void *)(block->ser + 1);
>>> +                     i = 0;
>>> +             }
>>> +
>>> +             err = luo_session_freeze_one(session, &ser[i]);
>>>                if (err)
>>>                        goto err_undo;
>>>
>>> -             strscpy(sh->ser[i].name, session->name,
>>> -                     sizeof(sh->ser[i].name));
>>> +             strscpy(ser[i].name, session->name,
>>> +                     sizeof(ser[i].name));
>>>                i++;
>>>        }
>>> -     sh->header_ser->count = sh->count;
>>> +     block->ser->count = i;
>>>
>>>        return 0;
>>>
>>>    err_undo:
>>>        list_for_each_entry_continue_reverse(session, &sh->list, list) {
>>>                i--;
>>> -             luo_session_unfreeze_one(session, &sh->ser[i]);
>>> -             memset(sh->ser[i].name, 0, sizeof(sh->ser[i].name));
>>> +             if (i < 0) {
>>> +                     block = list_prev_entry(block, list);
>>> +                     ser = (void *)(block->ser + 1);
>>> +                     i = LUO_SESSION_BLOCK_MAX - 1;
>>> +             }
>>> +             luo_session_unfreeze_one(session, &ser[i]);
>>> +             memset(ser[i].name, 0, sizeof(ser[i].name));
>>>        }
>>>
>>>        return err;
>>>    }
>>> -
>>
-- 
Best Regards,
Yanjun.Zhu



^ permalink raw reply

* Re: [RFC PATCH] slub: spill refill leftover objects into percpu sheaves
From: Hao Li @ 2026-04-20  3:18 UTC (permalink / raw)
  To: Vinicius Costa Gomes
  Cc: vbabka, harry, akpm, cl, rientjes, roman.gushchin, linux-mm,
	linux-kernel
In-Reply-To: <87a4v47xk5.fsf@intel.com>

On Wed, Apr 15, 2026 at 01:55:54PM -0700, Vinicius Costa Gomes wrote:
> I was also looking at these regressions, but I went from a different
> direction, and ended up with 3 patches:
> 
> 1. the regressions showed a lot of increase in the cache misses,
>    which gave me the idea that a cache would help (and it seemed to help)
> 
> 2. Allowing smaller refills (but potentially more frequent);
> 
> 3. A cute (but with small impact) use of prefetch();
> 
> The numbers are here (the commentary from the bot are very hit or miss,
> so don't pay too much attention to them):
> 
> https://github.com/vcgomes/linux/commit/c898c39ee8def5252942281353eda6acdd83d4ea
> 
> I am re-running the tests against a more recent tree, but if you
> want to take a look:
> 
> https://github.com/vcgomes/linux/tree/mm-sheaves-regression-timerfd
> 
> Also, if you feel it's useful, I can send a RFC.
> 

Hi Vinicius,

I tested the three patches in your GitHub repository. Under a 96-process stress
workload, mmap2 achieved about a 3% performance improvement.

For the slub stats, I observed some differences. Here are the results:

(baseline vs 3 patches)
alloc_fastpath +4.6%
alloc_slowpath +0% (and no slowpath in both test)
free_fastpath +0%
free_slowpath +189%
alloc_slab +247%
free_slab +247%
barn_get +8%
barn_put +8%
barn_get_fail +16%
barn_put_fail +0%
free_add_partial -37%
free_remove_partial +247%
sheaf_refill +3.88%

I notice one thing that seems consistent with my approach is the churn in
alloc_slab and free_slab. My impression is that this may be a common issue with
solutions designed at the per-CPU level...

-- 
Thanks,
Hao


^ permalink raw reply

* [PATCH 7.2 v9 2/2] x86/tlb: skip redundant sync IPIs for native TLB flush
From: Lance Yang @ 2026-04-20  3:08 UTC (permalink / raw)
  To: akpm
  Cc: peterz, david, dave.hansen, dave.hansen, ypodemsk, hughd, will,
	aneesh.kumar, npiggin, tglx, mingo, bp, x86, hpa, arnd, ljs, ziy,
	baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain, baohua,
	shy828301, riel, jannh, jgross, seanjc, pbonzini, boris.ostrovsky,
	virtualization, kvm, linux-arch, linux-mm, linux-kernel,
	ioworker0, Lance Yang
In-Reply-To: <20260420030851.6735-1-lance.yang@linux.dev>

From: Lance Yang <lance.yang@linux.dev>

Some page table operations need to synchronize with software/lockless
walkers after a TLB flush by calling tlb_remove_table_sync_{one,rcu}().
On x86, that extra synchronization is redundant when the preceding TLB
flush already broadcast IPIs to all relevant CPUs.

native_pv_tlb_init() checks whether native_flush_tlb_multi() is in use.
On CONFIG_PARAVIRT systems, it checks pv_ops; on non-PARAVIRT, native
flush is always in use.

It decides once at boot whether to enable the optimization: if using
native TLB flush and INVLPGB is not supported, we know IPIs were sent
and can skip the redundant sync. The decision is fixed via a static
key as Peter suggested[1].

PV backends (KVM, Xen, Hyper-V) typically have their own implementations
and don't call native_flush_tlb_multi() directly, so they cannot be trusted
to provide the IPI guarantees we need.

Also treat unshared_tables like freed_tables when issuing the TLB flush,
so lazy-TLB CPUs receive IPIs during unsharing of page tables as well.
This allows us to safely implement tlb_table_flush_implies_ipi_broadcast().

Two-step plan as David suggested[2]:

Step 1 (this patch): Skip redundant sync when we're 100% certain the TLB
flush sent IPIs. INVLPGB is excluded because when supported, we cannot
guarantee IPIs were sent, keeping it clean and simple.

Step 2 (future work): Send targeted IPIs only to CPUs actually doing
software/lockless page table walks, benefiting all architectures.

Regarding Step 2, it obviously only applies to setups where Step 1 does
not apply: like x86 with INVLPGB or arm64.

[1] https://lore.kernel.org/linux-mm/20260302145652.GH1395266@noisy.programming.kicks-ass.net/
[2] https://lore.kernel.org/linux-mm/bbfdf226-4660-4949-b17b-0d209ee4ef8c@kernel.org/

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Suggested-by: David Hildenbrand (Arm) <david@kernel.org>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Signed-off-by: Lance Yang <lance.yang@linux.dev>
---
 arch/x86/include/asm/tlb.h      | 18 +++++++++++++++++-
 arch/x86/include/asm/tlbflush.h |  2 ++
 arch/x86/kernel/smpboot.c       |  1 +
 arch/x86/mm/tlb.c               | 15 +++++++++++++++
 4 files changed, 35 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h
index 866ea78ba156..fc586ec8e768 100644
--- a/arch/x86/include/asm/tlb.h
+++ b/arch/x86/include/asm/tlb.h
@@ -5,11 +5,21 @@
 #define tlb_flush tlb_flush
 static inline void tlb_flush(struct mmu_gather *tlb);
 
+#define tlb_table_flush_implies_ipi_broadcast tlb_table_flush_implies_ipi_broadcast
+static inline bool tlb_table_flush_implies_ipi_broadcast(void);
+
 #include <asm-generic/tlb.h>
 #include <linux/kernel.h>
 #include <vdso/bits.h>
 #include <vdso/page.h>
 
+DECLARE_STATIC_KEY_FALSE(tlb_ipi_broadcast_key);
+
+static inline bool tlb_table_flush_implies_ipi_broadcast(void)
+{
+	return static_branch_likely(&tlb_ipi_broadcast_key);
+}
+
 static inline void tlb_flush(struct mmu_gather *tlb)
 {
 	unsigned long start = 0UL, end = TLB_FLUSH_ALL;
@@ -20,7 +30,13 @@ static inline void tlb_flush(struct mmu_gather *tlb)
 		end = tlb->end;
 	}
 
-	flush_tlb_mm_range(tlb->mm, start, end, stride_shift, tlb->freed_tables);
+	/*
+	 * Treat unshared_tables just like freed_tables, such that lazy-TLB
+	 * CPUs also receive IPIs during unsharing of page tables, allowing
+	 * us to safely implement tlb_table_flush_implies_ipi_broadcast().
+	 */
+	flush_tlb_mm_range(tlb->mm, start, end, stride_shift,
+			   tlb->freed_tables || tlb->unshared_tables);
 }
 
 static inline void invlpg(unsigned long addr)
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 5a3cdc439e38..8ba853154b46 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -18,6 +18,8 @@
 
 DECLARE_PER_CPU(u64, tlbstate_untag_mask);
 
+void __init native_pv_tlb_init(void);
+
 void __flush_tlb_all(void);
 
 #define TLB_FLUSH_ALL	-1UL
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 294a8ea60298..df776b645a9c 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1256,6 +1256,7 @@ void __init native_smp_prepare_boot_cpu(void)
 		switch_gdt_and_percpu_base(me);
 
 	native_pv_lock_init();
+	native_pv_tlb_init();
 }
 
 void __init native_smp_cpus_done(unsigned int max_cpus)
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 621e09d049cb..8f5585ebaf09 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -26,6 +26,8 @@
 
 #include "mm_internal.h"
 
+DEFINE_STATIC_KEY_FALSE(tlb_ipi_broadcast_key);
+
 #ifdef CONFIG_PARAVIRT
 # define STATIC_NOPV
 #else
@@ -1834,3 +1836,16 @@ static int __init create_tlb_single_page_flush_ceiling(void)
 	return 0;
 }
 late_initcall(create_tlb_single_page_flush_ceiling);
+
+void __init native_pv_tlb_init(void)
+{
+#ifdef CONFIG_PARAVIRT
+	if (pv_ops.mmu.flush_tlb_multi != native_flush_tlb_multi)
+		return;
+#endif
+
+	if (cpu_feature_enabled(X86_FEATURE_INVLPGB))
+		return;
+
+	static_branch_enable(&tlb_ipi_broadcast_key);
+}
-- 
2.49.0



^ permalink raw reply related

* [PATCH 7.2 v9 1/2] mm/mmu_gather: prepare to skip redundant sync IPIs
From: Lance Yang @ 2026-04-20  3:08 UTC (permalink / raw)
  To: akpm
  Cc: peterz, david, dave.hansen, dave.hansen, ypodemsk, hughd, will,
	aneesh.kumar, npiggin, tglx, mingo, bp, x86, hpa, arnd, ljs, ziy,
	baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain, baohua,
	shy828301, riel, jannh, jgross, seanjc, pbonzini, boris.ostrovsky,
	virtualization, kvm, linux-arch, linux-mm, linux-kernel,
	ioworker0, Lance Yang
In-Reply-To: <20260420030851.6735-1-lance.yang@linux.dev>

From: Lance Yang <lance.yang@linux.dev>

When page table operations require synchronization with software/lockless
walkers, they call tlb_remove_table_sync_{one,rcu}() after flushing the
TLB (tlb->freed_tables or tlb->unshared_tables).

On architectures where the TLB flush already sends IPIs to all target CPUs,
the subsequent sync IPI broadcast is redundant. This is not only costly on
large systems where it disrupts all CPUs even for single-process page table
operations, but has also been reported to hurt RT workloads[1].

Introduce tlb_table_flush_implies_ipi_broadcast() to check if the prior TLB
flush already provided the necessary synchronization. When true, the sync
calls can early-return.

A few cases rely on this synchronization:

1) hugetlb PMD unshare[2]: The problem is not the freeing but the reuse
   of the PMD table for other purposes in the last remaining user after
   unsharing.

2) khugepaged collapse[3]: Ensure no concurrent GUP-fast before collapsing
   and (possibly) freeing the page table / re-depositing it.

Currently always returns false (no behavior change). The follow-up patch
will enable the optimization for x86.

[1] https://lore.kernel.org/linux-mm/1b27a3fa-359a-43d0-bdeb-c31341749367@kernel.org/
[2] https://lore.kernel.org/linux-mm/6a364356-5fea-4a6c-b959-ba3b22ce9c88@kernel.org/
[3] https://lore.kernel.org/linux-mm/2cb4503d-3a3f-4f6c-8038-7b3d1c74b3c2@kernel.org/

Suggested-by: David Hildenbrand (Arm) <david@kernel.org>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Signed-off-by: Lance Yang <lance.yang@linux.dev>
---
 include/asm-generic/tlb.h | 17 +++++++++++++++++
 mm/mmu_gather.c           | 15 +++++++++++++++
 2 files changed, 32 insertions(+)

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index bdcc2778ac64..cb41cc6a0024 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -240,6 +240,23 @@ static inline void tlb_remove_table(struct mmu_gather *tlb, void *table)
 }
 #endif /* CONFIG_MMU_GATHER_TABLE_FREE */
 
+/**
+ * tlb_table_flush_implies_ipi_broadcast - does TLB flush imply IPI sync
+ *
+ * When page table operations require synchronization with software/lockless
+ * walkers, they flush the TLB (tlb->freed_tables or tlb->unshared_tables)
+ * then call tlb_remove_table_sync_{one,rcu}(). If the flush already sent
+ * IPIs to all CPUs, the sync call is redundant.
+ *
+ * Returns false by default. Architectures can override by defining this.
+ */
+#ifndef tlb_table_flush_implies_ipi_broadcast
+static inline bool tlb_table_flush_implies_ipi_broadcast(void)
+{
+	return false;
+}
+#endif
+
 #ifdef CONFIG_MMU_GATHER_RCU_TABLE_FREE
 /*
  * This allows an architecture that does not use the linux page-tables for
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index 3985d856de7f..37a6a711c37e 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -283,6 +283,14 @@ void tlb_remove_table_sync_one(void)
 	 * It is however sufficient for software page-table walkers that rely on
 	 * IRQ disabling.
 	 */
+
+	/*
+	 * Skip IPI if the preceding TLB flush already synchronized with
+	 * all CPUs that could be doing software/lockless page table walks.
+	 */
+	if (tlb_table_flush_implies_ipi_broadcast())
+		return;
+
 	smp_call_function(tlb_remove_table_smp_sync, NULL, 1);
 }
 
@@ -312,6 +320,13 @@ static void tlb_remove_table_free(struct mmu_table_batch *batch)
  */
 void tlb_remove_table_sync_rcu(void)
 {
+	/*
+	 * Skip RCU wait if the preceding TLB flush already synchronized
+	 * with all CPUs that could be doing software/lockless page table walks.
+	 */
+	if (tlb_table_flush_implies_ipi_broadcast())
+		return;
+
 	synchronize_rcu();
 }
 
-- 
2.49.0



^ permalink raw reply related

* [PATCH 7.2 v9 0/2] skip redundant sync IPIs when TLB flush sent them
From: Lance Yang @ 2026-04-20  3:08 UTC (permalink / raw)
  To: akpm
  Cc: peterz, david, dave.hansen, dave.hansen, ypodemsk, hughd, will,
	aneesh.kumar, npiggin, tglx, mingo, bp, x86, hpa, arnd, ljs, ziy,
	baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain, baohua,
	shy828301, riel, jannh, jgross, seanjc, pbonzini, boris.ostrovsky,
	virtualization, kvm, linux-arch, linux-mm, linux-kernel,
	ioworker0

Hi all,

When page table operations require synchronization with software/lockless
walkers, they call tlb_remove_table_sync_{one,rcu}() after flushing the
TLB (tlb->freed_tables or tlb->unshared_tables).

On architectures where the TLB flush already sends IPIs to all target CPUs,
the subsequent sync IPI broadcast is redundant. This is not only costly on
large systems where it disrupts all CPUs even for single-process page table
operations, but has also been reported to hurt RT workloads[1].

This series introduces tlb_table_flush_implies_ipi_broadcast() to check if
the prior TLB flush already provided the necessary synchronization. When
true, the sync calls can early-return.

A few cases rely on this synchronization:

1) hugetlb PMD unshare[2]: The problem is not the freeing but the reuse
   of the PMD table for other purposes in the last remaining user after
   unsharing.

2) khugepaged collapse[3]: Ensure no concurrent GUP-fast before collapsing
   and (possibly) freeing the page table / re-depositing it.

Two-step plan as David suggested[4]:

Step 1 (this series): Skip redundant sync when we're 100% certain the TLB
flush sent IPIs. INVLPGB is excluded because when supported, we cannot
guarantee IPIs were sent, keeping it clean and simple.

Step 2 (future work): Send targeted IPIs only to CPUs actually doing
software/lockless page table walks, benefiting all architectures.

Regarding Step 2, it obviously only applies to setups where Step 1 does not
apply: like x86 with INVLPGB or arm64. Step 2 work is ongoing; early
attempts showed ~3% GUP-fast overhead. Reducing the overhead requires more
work and tuning; it will be submitted separately once ready.

On a 64-core Intel x86 server, the CAL interrupt count in
/proc/interrupts dropped from 646,316 to 785 when collapsing a 20 GiB
range with this series applied.

David Hildenbrand did the initial implementation. I built on his work and
relied on off-list discussions to push it further - thanks a lot David!

[1] https://lore.kernel.org/linux-mm/1b27a3fa-359a-43d0-bdeb-c31341749367@kernel.org/
[2] https://lore.kernel.org/linux-mm/6a364356-5fea-4a6c-b959-ba3b22ce9c88@kernel.org/
[3] https://lore.kernel.org/linux-mm/2cb4503d-3a3f-4f6c-8038-7b3d1c74b3c2@kernel.org/
[4] https://lore.kernel.org/linux-mm/bbfdf226-4660-4949-b17b-0d209ee4ef8c@kernel.org/

v8 -> v9:
- Rebase on mm-new; re-tested, no code changes.
- https://lore.kernel.org/linux-mm/20260324085238.44477-1-lance.yang@linux.dev/

v7 -> v8:
- Pick up Acked-by tags from David, thanks!
- Add CAL interrupt numbers to the cover letter (per Andrew, thanks!)
- Rewrite the [2/2] changelog and reword the comment (per David, thanks!)
- https://lore.kernel.org/linux-mm/20260309020711.20831-1-lance.yang@linux.dev/

v6 -> v7:
- Simplify init logic and eliminate duplicated X86_FEATURE_INVLPGB checks
  (per Dave, thanks!)
- Remove flush_tlb_multi_implies_ipi_broadcast property because no PV
  backend sets it today.
- https://lore.kernel.org/linux-mm/20260304021046.18550-1-lance.yang@linux.dev/

v5 -> v6:
- Use static_branch to eliminate the branch overhead (per Peter, thanks!)
- https://lore.kernel.org/linux-mm/20260302063048.9479-1-lance.yang@linux.dev/

v4 -> v5:
- Drop per-CPU tracking (active_lockless_pt_walk_mm) from this series;
  defer to Step 2 as it adds ~3% GUP-fast overhead
- Keep pv_ops property false for PV backends like KVM: preempted vCPUs
  cannot be assumed safe (per Sean, thanks!)
  https://lore.kernel.org/linux-mm/aaCP95l-m8ISXF78@google.com/
- https://lore.kernel.org/linux-mm/20260202074557.16544-1-lance.yang@linux.dev/ 

v3 -> v4:
- Rework based on David's two-step direction and per-CPU idea:
  1) Targeted IPIs: per-CPU variable when entering/leaving lockless page
     table walk; tlb_remove_table_sync_mm() IPIs only those CPUs.
  2) On x86, pv_mmu_ops property set at init to skip the extra sync when
     flush_tlb_multi() already sends IPIs.
  https://lore.kernel.org/linux-mm/bbfdf226-4660-4949-b17b-0d209ee4ef8c@kernel.org/
- https://lore.kernel.org/linux-mm/20260106120303.38124-1-lance.yang@linux.dev/

v2 -> v3:
- Complete rewrite: use dynamic IPI tracking instead of static checks
  (per Dave Hansen, thanks!)
- Track IPIs via mmu_gather: native_flush_tlb_multi() sets flag when
  actually sending IPIs
- Motivation for skipping redundant IPIs explained by David:
  https://lore.kernel.org/linux-mm/1b27a3fa-359a-43d0-bdeb-c31341749367@kernel.org/
- https://lore.kernel.org/linux-mm/20251229145245.85452-1-lance.yang@linux.dev/

v1 -> v2:
- Fix cover letter encoding to resolve send-email issues. Apologies for
  any email flood caused by the failed send attempts :(

RFC -> v1:
- Use a callback function in pv_mmu_ops instead of comparing function
  pointers (per David)
- Embed the check directly in tlb_remove_table_sync_one() instead of
  requiring every caller to check explicitly (per David)
- Move tlb_table_flush_implies_ipi_broadcast() outside of
  CONFIG_MMU_GATHER_RCU_TABLE_FREE to fix build error on architectures
  that don't enable this config.
  https://lore.kernel.org/oe-kbuild-all/202512142156.cShiu6PU-lkp@intel.com/
- https://lore.kernel.org/linux-mm/20251213080038.10917-1-lance.yang@linux.dev/

Lance Yang (2):
  mm/mmu_gather: prepare to skip redundant sync IPIs
  x86/tlb: skip redundant sync IPIs for native TLB flush

 arch/x86/include/asm/tlb.h      | 18 +++++++++++++++++-
 arch/x86/include/asm/tlbflush.h |  2 ++
 arch/x86/kernel/smpboot.c       |  1 +
 arch/x86/mm/tlb.c               | 15 +++++++++++++++
 include/asm-generic/tlb.h       | 17 +++++++++++++++++
 mm/mmu_gather.c                 | 15 +++++++++++++++
 6 files changed, 67 insertions(+), 1 deletion(-)

-- 
2.49.0

^ permalink raw reply

* Re: [LSF/MM/BPF TOPIC][RFC PATCH v4 00/27] Private Memory Nodes (w/ Compressed RAM)
From: Gregory Price @ 2026-04-20  2:56 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: lsf-pc, linux-kernel, linux-cxl, cgroups, linux-mm,
	linux-trace-kernel, damon, kernel-team, gregkh, rafael, dakr,
	dave, jonathan.cameron, dave.jiang, alison.schofield,
	vishal.l.verma, ira.weiny, dan.j.williams, longman, akpm,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	osalvador, ziy, matthew.brost, joshua.hahnjy, rakie.kim,
	byungchul, ying.huang, apopple, axelrasmussen, yuanchu, weixugc,
	yury.norov, linux, mhiramat, mathieu.desnoyers, tj, hannes,
	mkoutny, jackmanb, sj, baolin.wang, npache, ryan.roberts,
	dev.jain, baohua, lance.yang, muchun.song, xu.xin16,
	chengming.zhou, jannh, linmiaohe, nao.horiguchi, pfalcato,
	rientjes, shakeel.butt, riel, harry.yoo, cl, roman.gushchin,
	chrisl, kasong, shikemeng, nphamcs, bhe, zhengqi.arch,
	terry.bowman
In-Reply-To: <46837cea-5d90-49d8-be67-7306e0e89aa3@kernel.org>

On Fri, Apr 17, 2026 at 11:37:36AM +0200, David Hildenbrand (Arm) wrote:
> On 4/15/26 17:17, Gregory Price wrote:
> 
> >> Needs a second thought regarding fallback logic I raised above.
> >>
> >> What I think would have to be audited is the usage of __GFP_THISNODE by
> >> kernel allocations, where we would not actually want to allocate from
> >> this private node.
> >>
> > 
> > This is fair, and I a re-visit is absolutely warranted.
> > 
> > Re-examining the quick audit from my last response suggests - I should
> > never have seen leakage in those cases, but the fallbacks are needed.
> > 
> > So yes, this all requires a second look (and a third, and a ninth).
> > 
> > I'm not married to __GFP_PRIVATE, but it has been reliable for me.
> 
> Yes, we should carefully describe which semantics we want to achieve, to
> then figure out how we could achieve them.
> 

Ah, I finally dug up my notes on this.

If we overload __GFP_THISNODE - then we have to audit all gfp_mask's
with THISNODE against the use of any of the following *forever*:

#define node_online_map         node_states[N_ONLINE]
#define node_possible_map       node_states[N_POSSIBLE]
#define for_each_node(node)        for_each_node_state(node, N_POSSIBLE)
#define for_each_online_node(node) for_each_node_state(node, N_ONLINE)

  or

cgroup.cpuset.mems_allowed / mems_effective


Anyone that attempts to do:

    for_each_online_node(node):
        buf = alloc_pages_node(node, __GFP_THISNODE, NULL)

*will* get incidental access to private node memory, and it won't be
obvious to existing tooling that this should be considered a bug.


rate of occurance in the current code:
-----------------
node_online_map       -  21 instances
node_possible_map     -  25 instances
for_each_node         -  346 instances
for_each_online_node  -  67 instances
GFP_THISNODE          -  58 instances
(notes don't have mems_allowed/mems_effective instances)


But it's not always going to be obvious - since nodemasks and gfp_masks
get passed around as variables all throughout the kernel.

I ultimately determined that auditing this in-tree is already a fools
errand - and suggesting we try to validate this never occurs for all
future code moving forward is just not realistic in any sense.

I could not come up with a way to remove private nodes from
node_online/possible_map - and private nodes must be added to
cpuset.mems_allowed to allow cpuset control (otherwise all userland
access is blanket denied).

So I moved back to __GFP_PRIVATE.

=== TL;DR:

The core premise of private nodes is isolation first.

So we want this code:

   for node in cpuset.mems_allowed / online_map
       buf = alloc_pages_node(node, __GFP_THISNODE, NULL)

To explicitly fail - so that the caller knows they can't use these
masks this way anymore (it was already potentially a bug, but could
have been masked if all online nodes had memory).

~Gregory


^ permalink raw reply

* Re: [PATCH 0/3] mm: split the file's i_mmap tree for NUMA
From: Huang Shijie @ 2026-04-20  2:10 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: akpm, viro, brauner, linux-mm, linux-kernel, linux-arm-kernel,
	linux-fsdevel, muchun.song, osalvador, linux-trace-kernel,
	linux-perf-users, linux-parisc, nvdimm, zhongyuan, fangbaoshun,
	yingzhiwei
In-Reply-To: <76pfiwabdgsej6q2yxfh3efuqvsyg7mt7rvl5itzzjyhdrto5r@53viaxsackzv>

On Mon, Apr 13, 2026 at 05:33:21PM +0200, Mateusz Guzik wrote:
> On Mon, Apr 13, 2026 at 02:20:39PM +0800, Huang Shijie wrote:
> >   In NUMA, there are maybe many NUMA nodes and many CPUs.
> > For example, a Hygon's server has 12 NUMA nodes, and 384 CPUs.
> > In the UnixBench tests, there is a test "execl" which tests
> > the execve system call.
> > 
> >   When we test our server with "./Run -c 384 execl",
> > the test result is not good enough. The i_mmap locks contended heavily on
> > "libc.so" and "ld.so". For example, the i_mmap tree for "libc.so" can have 
> > over 6000 VMAs, all the VMAs can be in different NUMA mode.
> > The insert/remove operations do not run quickly enough.
> > 
> > patch 1 & patch 2 are try to hide the direct access of i_mmap.
> > patch 3 splits the i_mmap into sibling trees, and we can get better 
> > performance with this patch set:
> >     we can get 77% performance improvement(10 times average)
> > 
> 
> To my reading you kept the lock as-is and only distributed the protected
> state.
> 
> While I don't doubt the improvement, I'm confident should you take a
> look at the profile you are going to find this still does not scale with
> rwsem being one of the problems (there are other global locks, some of
> which have experimental patches for).
> 
> Apart from that this does nothing to help high core systems which are
> all one node, which imo puts another question mark on this specific
> proposal.
> 
> Of course one may question whether a RB tree is the right choice here,
> it may be the lock-protected cost can go way down with merely a better
> data structure.
> 
> Regardless of that, for actual scalability, there will be no way around
> decentralazing locking around this and partitioning per some core count
> (not just by numa awareness).
> 
> Decentralizing locking is definitely possible, but I have not looked
> into specifics of how problematic it is. Best case scenario it will
> merely with separate locks. Worst case scenario something needs a fully
> stabilized state for traversal, in that case another rw lock can be
> slapped around this, creating locking order read lock -> per-subset
> write lock -- this will suffer scalability due to the read locking, but
> it will still scale drastically better as apart from that there will be
> no serialization. In this setting the problematic consumer will write
> lock the new thing to stabilize the state.
> 
I thought over again.
I can change this patch set to support the non-NUMA case by:
  1.) Still use one rw lock.
  2.) For NUMA, keep the patch set as it is.
  3.) For non-NUMA case, split the i_mmap tree to several subtrees.
      For example, if a machine has 192 CPUs, split the 32 CPUs as a tree.

So extend the patch set to support both the NUMA and non-NUMA machines.

Thanks
Huang Shijie



^ permalink raw reply

* Re: [RFC 1/2] mm: page_alloc: replace pageblock_flags bitmap with struct pageblock_data
From: Zi Yan @ 2026-04-20  1:40 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Vlastimil Babka, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Rik van Riel, linux-kernel, Johannes Weiner
In-Reply-To: <20260403194526.477775-2-hannes@cmpxchg.org>

On 3 Apr 2026, at 15:40, Johannes Weiner wrote:

> From: Johannes Weiner <jweiner@meta.com>
>
> Replace the packed pageblock_flags bitmap with a per-pageblock struct
> containing its own flags word. This changes the storage from
> NR_PAGEBLOCK_BITS bits per pageblock packed into shared unsigned longs,
> to a dedicated unsigned long per pageblock.
>
> The free path looks up migratetype (from pageblock flags) immediately
> followed by looking up pageblock ownership. Colocating them in a struct
> means this hot path touches one cache line instead of two.
>
> The per-pageblock struct also eliminates all the bit-packing indexing
> (pfn_to_bitidx, word selection, intra-word shifts), simplifying the
> accessor code.
>
> Memory overhead: 8 bytes per pageblock (one unsigned long). With 2MB
> pageblocks on x86_64, that's 4KB per GB -- up from ~0.5-1 bytes per
> pageblock with the packed bitmap, but still negligible in absolute terms.
>
> No functional change.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  include/linux/mmzone.h | 15 ++++----
>  mm/internal.h          | 17 +++++++++
>  mm/mm_init.c           | 25 ++++++-------
>  mm/page_alloc.c        | 81 ++++++------------------------------------
>  mm/sparse.c            |  3 +-
>  5 files changed, 48 insertions(+), 93 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 3e51190a55e4..2f202bda5ec6 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -916,7 +916,7 @@ struct zone {
>  	 * Flags for a pageblock_nr_pages block. See pageblock-flags.h.
>  	 * In SPARSEMEM, this map is stored in struct mem_section
>  	 */
> -	unsigned long		*pageblock_flags;
> +	struct pageblock_data	*pageblock_data;
>  #endif /* CONFIG_SPARSEMEM */
>
>  	/* zone_start_pfn == zone_start_paddr >> PAGE_SHIFT */
> @@ -1866,9 +1866,6 @@ static inline bool movable_only_nodes(nodemask_t *nodes)
>  #define PAGES_PER_SECTION       (1UL << PFN_SECTION_SHIFT)
>  #define PAGE_SECTION_MASK	(~(PAGES_PER_SECTION-1))
>
> -#define SECTION_BLOCKFLAGS_BITS \
> -	((1UL << (PFN_SECTION_SHIFT - pageblock_order)) * NR_PAGEBLOCK_BITS)
> -
>  #if (MAX_PAGE_ORDER + PAGE_SHIFT) > SECTION_SIZE_BITS
>  #error Allocator MAX_PAGE_ORDER exceeds SECTION_SIZE
>  #endif
> @@ -1901,13 +1898,17 @@ static inline unsigned long section_nr_to_pfn(unsigned long sec)
>  #define SUBSECTION_ALIGN_UP(pfn) ALIGN((pfn), PAGES_PER_SUBSECTION)
>  #define SUBSECTION_ALIGN_DOWN(pfn) ((pfn) & PAGE_SUBSECTION_MASK)
>
> +struct pageblock_data {
> +	unsigned long flags;

Would it be better to make this uint32_t if !CONFIG_MEMORY_ISOLATION
and uint64_t otherwise? MIGRATE_ISOLATE is the only reason to have
8 byte pageblock flag.

<snip>

> -#ifdef CONFIG_MEMORY_ISOLATION
> -	BUILD_BUG_ON(NR_PAGEBLOCK_BITS != 8);
> -#else
> -	BUILD_BUG_ON(NR_PAGEBLOCK_BITS != 4);
> -#endif

We probably still need

	BUILD_BUG_ON(NR_PAGEBLOCK_BITS > sizeof(struct pageblock_data));

just in case in the future we add too many pageblock bits.


Otherwise, this patch can be sent and merged separately.


--
Best Regards,
Yan, Zi


^ permalink raw reply

* Re: [PATCH RFC bpf-next 1/8] kasan: expose generic kasan helpers
From: Alexei Starovoitov @ 2026-04-19 22:51 UTC (permalink / raw)
  To: Andrey Konovalov
  Cc: Alexis Lothoré, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman,
	Kumar Kartikeya Dwivedi, Song Liu, Yonghong Song, Jiri Olsa,
	John Fastabend, David S. Miller, David Ahern, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, X86 ML, H. Peter Anvin,
	Shuah Khan, Maxime Coquelin, Alexandre Torgue, Andrey Ryabinin,
	Alexander Potapenko, Dmitry Vyukov, Vincenzo Frascino,
	Andrew Morton, ebpf, Bastien Curutchet, Thomas Petazzoni,
	Xu Kuohai, bpf, LKML, Network Development,
	open list:KERNEL SELFTEST FRAMEWORK, linux-stm32,
	linux-arm-kernel, kasan-dev, linux-mm
In-Reply-To: <CA+fCnZe-b0Qqbo5gGv3HN20twquQETDfYYkE1r9tPr9zUFbW9Q@mail.gmail.com>

On Sun, Apr 19, 2026 at 2:49 PM Andrey Konovalov <andreyknvl@gmail.com> wrote:
>
> On Tue, Apr 14, 2026 at 5:58 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > I think we're talking past each other.
> > We're not interested in KASAN_SW_TAGS or KASAN_HW_TAGS.
> > We're not going to modify arm64 JIT at all.
> >
> > This is purely KASAN_GENRIC and only on x86-64.
> > JIT will emit exactly what compilers emit for generic
> > which is __asan_load/store. This is as stable ABI as it can get
> > and we don't want to deviate from it.
>
> OK, I supposed that's fair. You did throw me off point with your
> performance comment. But if you decide to add SW_TAGS support at some
> point, I think this discussion needs to be revisited.
>
> But please add a comment saying that those functions are only exposed
> for BPF JIT and they are not supposed to be used by other parts of the
> kernel. And in case you do end up adding a new config option, guard
> the public declarations by a corresponding ifdef.

I feel concerns of misuse are overblown.
Being in include/linux/kasan.h doesn't make them free-for-all
all of a sudden, but if you prefer we can just copy paste:
+void __asan_load1(void *p);
+void __asan_store1(void *p);
into bpf_jit_comp.c

> > The goal here is to find bugs in the verifier.
> > If something got past it, that shouldn't have,
> > kasan generic on x86-64 is enough.
>
> FWIW, I suspect HW_TAGS KASAN already just works with JITed BPF code.

Ohh. Good point. Looks like modern arm64 cpus in public clouds
don't have that enabled, so one would need pixel phone to
catch verifier bugs via hw_tags.
So we still need this x86-specific jit kasan.
I guess eventually it can be removed when hw_tags support is widespread.


^ permalink raw reply

* Re: [GIT PULL] hotfixes for 7.1-rc1
From: pr-tracker-bot @ 2026-04-19 21:54 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linus Torvalds, linux-mm, mm-commits, linux-kernel
In-Reply-To: <20260419115803.3114199cdc346ba9654958f9@linux-foundation.org>

The pull request you sent on Sun, 19 Apr 2026 11:58:03 -0700:

> git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm tags/mm-hotfixes-stable-2026-04-19-00-14

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/c1f49dea2b8f335813d3b348fd39117fb8efb428

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox