All of lore.kernel.org
 help / color / mirror / Atom feed
From: Harry Yoo <harry.yoo@oracle.com>
To: Vlastimil Babka <vbabka@suse.cz>
Cc: Suren Baghdasaryan <surenb@google.com>,
	"Liam R. Howlett" <Liam.Howlett@oracle.com>,
	Christoph Lameter <cl@gentwo.org>,
	David Rientjes <rientjes@google.com>,
	Roman Gushchin <roman.gushchin@linux.dev>,
	Uladzislau Rezki <urezki@gmail.com>,
	Sidhartha Kumar <sidhartha.kumar@oracle.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	rcu@vger.kernel.org, maple-tree@lists.infradead.org,
	Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Subject: Re: [PATCH v7 03/21] slab: add opt-in caching layer of percpu sheaves
Date: Mon, 8 Sep 2025 20:19:52 +0900	[thread overview]
Message-ID: <aL672Jeqi99atefN@hyeyoo> (raw)
In-Reply-To: <20250903-slub-percpu-caches-v7-3-71c114cdefef@suse.cz>

On Wed, Sep 03, 2025 at 02:59:45PM +0200, Vlastimil Babka wrote:
> Specifying a non-zero value for a new struct kmem_cache_args field
> sheaf_capacity will setup a caching layer of percpu arrays called
> sheaves of given capacity for the created cache.
> 
> Allocations from the cache will allocate via the percpu sheaves (main or
> spare) as long as they have no NUMA node preference. Frees will also
> put the object back into one of the sheaves.
> 
> When both percpu sheaves are found empty during an allocation, an empty
> sheaf may be replaced with a full one from the per-node barn. If none
> are available and the allocation is allowed to block, an empty sheaf is
> refilled from slab(s) by an internal bulk alloc operation. When both
> percpu sheaves are full during freeing, the barn can replace a full one
> with an empty one, unless over a full sheaves limit. In that case a
> sheaf is flushed to slab(s) by an internal bulk free operation. Flushing
> sheaves and barns is also wired to the existing cpu flushing and cache
> shrinking operations.
> 
> The sheaves do not distinguish NUMA locality of the cached objects. If
> an allocation is requested with kmem_cache_alloc_node() (or a mempolicy
> with strict_numa mode enabled) with a specific node (not NUMA_NO_NODE),
> the sheaves are bypassed.
> 
> The bulk operations exposed to slab users also try to utilize the
> sheaves as long as the necessary (full or empty) sheaves are available
> on the cpu or in the barn. Once depleted, they will fallback to bulk
> alloc/free to slabs directly to avoid double copying.
> 
> The sheaf_capacity value is exported in sysfs for observability.
> 
> Sysfs CONFIG_SLUB_STATS counters alloc_cpu_sheaf and free_cpu_sheaf
> count objects allocated or freed using the sheaves (and thus not
> counting towards the other alloc/free path counters). Counters
> sheaf_refill and sheaf_flush count objects filled or flushed from or to
> slab pages, and can be used to assess how effective the caching is. The
> refill and flush operations will also count towards the usual
> alloc_fastpath/slowpath, free_fastpath/slowpath and other counters for
> the backing slabs.  For barn operations, barn_get and barn_put count how
> many full sheaves were get from or put to the barn, the _fail variants
> count how many such requests could not be satisfied mainly  because the
> barn was either empty or full. While the barn also holds empty sheaves
> to make some operations easier, these are not as critical to mandate own
> counters.  Finally, there are sheaf_alloc/sheaf_free counters.
> 
> Access to the percpu sheaves is protected by local_trylock() when
> potential callers include irq context, and local_lock() otherwise (such
> as when we already know the gfp flags allow blocking). The trylock
> failures should be rare and we can easily fallback. Each per-NUMA-node
> barn has a spin_lock.
> 
> When slub_debug is enabled for a cache with sheaf_capacity also
> specified, the latter is ignored so that allocations and frees reach the
> slow path where debugging hooks are processed. Similarly, we ignore it
> with CONFIG_SLUB_TINY which prefers low memory usage to performance.
> 
> [boot failure: https://lore.kernel.org/all/583eacf5-c971-451a-9f76-fed0e341b815@linux.ibm.com/ ]
> Reported-and-tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  include/linux/slab.h |   31 ++
>  mm/slab.h            |    2 +
>  mm/slab_common.c     |    5 +-
>  mm/slub.c            | 1164 +++++++++++++++++++++++++++++++++++++++++++++++---
>  4 files changed, 1143 insertions(+), 59 deletions(-)
> 
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index bfe7c40eeee1a01c175766935c1e3c0304434a53..e2b197e47866c30acdbd1fee4159f262a751c5a7 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -163,6 +163,9 @@ int slab_unmergeable(struct kmem_cache *s)
>  		return 1;
>  #endif
>  
> +	if (s->cpu_sheaves)
> +		return 1;
> +
>  	/*
>  	 * We may have set a slab to be unmergeable during bootstrap.
>  	 */
> @@ -321,7 +324,7 @@ struct kmem_cache *__kmem_cache_create_args(const char *name,
>  		    object_size - args->usersize < args->useroffset))
>  		args->usersize = args->useroffset = 0;
>  
> -	if (!args->usersize)
> +	if (!args->usersize && !args->sheaf_capacity)
>  		s = __kmem_cache_alias(name, object_size, args->align, flags,
>  				       args->ctor);

Can we merge caches that use sheaves in the future if the capacity
is the same, or are there any restrictions for merging that I overlooked?

>  /*
>   * Slab allocation and freeing
>   */
> @@ -3344,11 +3748,42 @@ static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu)
>  	put_partials_cpu(s, c);
>  }
>  
> -struct slub_flush_work {
> -	struct work_struct work;
> -	struct kmem_cache *s;
> -	bool skip;
> -};
> +static inline void flush_this_cpu_slab(struct kmem_cache *s)
> +{
> +	struct kmem_cache_cpu *c = this_cpu_ptr(s->cpu_slab);
> +
> +	if (c->slab)
> +		flush_slab(s, c);
> +
> +	put_partials(s);
> +}
> +
> +static bool has_cpu_slab(int cpu, struct kmem_cache *s)
> +{
> +	struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu);
> +
> +	return c->slab || slub_percpu_partial(c);
> +}
> +
> +#else /* CONFIG_SLUB_TINY */
> +static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu) { }
> +static inline bool has_cpu_slab(int cpu, struct kmem_cache *s) { return false; }
> +static inline void flush_this_cpu_slab(struct kmem_cache *s) { }
> +#endif /* CONFIG_SLUB_TINY */
> +
> +static bool has_pcs_used(int cpu, struct kmem_cache *s)
> +{
> +	struct slub_percpu_sheaves *pcs;
> +
> +	if (!s->cpu_sheaves)
> +		return false;
> +
> +	pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
> +
> +	return (pcs->spare || pcs->main->size);
> +}
> +
> +static void pcs_flush_all(struct kmem_cache *s);

nit: we don't need these functions to flush sheaves if SLUB_TINY=y
as we don't create sheaves for SLUB_TINY anymore?

>  /*
>   * Flush cpu slab.
> @@ -3358,30 +3793,18 @@ struct slub_flush_work {
>  static void flush_cpu_slab(struct work_struct *w)
>  {
>  	struct kmem_cache *s;
> -	struct kmem_cache_cpu *c;
>  	struct slub_flush_work *sfw;
>  
>  	sfw = container_of(w, struct slub_flush_work, work);
>  
>  	s = sfw->s;
> -	c = this_cpu_ptr(s->cpu_slab);
> -
> -	if (c->slab)
> -		flush_slab(s, c);
> -
> -	put_partials(s);
> -}
>  
> -static bool has_cpu_slab(int cpu, struct kmem_cache *s)
> -{
> -	struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu);
> +	if (s->cpu_sheaves)
> +		pcs_flush_all(s);
>  
> -	return c->slab || slub_percpu_partial(c);
> +	flush_this_cpu_slab(s);
>  } 
> -#else /* CONFIG_SLUB_TINY */
> -static inline void flush_all_cpus_locked(struct kmem_cache *s) { }
> -static inline void flush_all(struct kmem_cache *s) { }
> -static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu) { }
> -static inline int slub_cpu_dead(unsigned int cpu) { return 0; }
> -#endif /* CONFIG_SLUB_TINY */
> -
>  /*
>   * Check if the objects in a per cpu structure fit numa
>   * locality expectations.
> @@ -4191,30 +4610,240 @@ bool slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
>  }
>  
>  /*
> - * Inlined fastpath so that allocation functions (kmalloc, kmem_cache_alloc)
> - * have the fastpath folded into their functions. So no function call
> - * overhead for requests that can be satisfied on the fastpath.
> - *
> - * The fastpath works by first checking if the lockless freelist can be used.
> - * If not then __slab_alloc is called for slow processing.
> + * Replace the empty main sheaf with a (at least partially) full sheaf.
>   *
> - * Otherwise we can simply pick the next object from the lockless free list.
> + * Must be called with the cpu_sheaves local lock locked. If successful, returns
> + * the pcs pointer and the local lock locked (possibly on a different cpu than
> + * initially called). If not successful, returns NULL and the local lock
> + * unlocked.
>   */
> -static __fastpath_inline void *slab_alloc_node(struct kmem_cache *s, struct list_lru *lru,
> -		gfp_t gfpflags, int node, unsigned long addr, size_t orig_size)
> +static struct slub_percpu_sheaves *
> +__pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs, gfp_t gfp)
>  {
> -	void *object;
> -	bool init = false;
> +	struct slab_sheaf *empty = NULL;
> +	struct slab_sheaf *full;
> +	struct node_barn *barn;
> +	bool can_alloc;
>  
> -	s = slab_pre_alloc_hook(s, gfpflags);
> -	if (unlikely(!s))
> +	lockdep_assert_held(this_cpu_ptr(&s->cpu_sheaves->lock));
> +
> +	if (pcs->spare && pcs->spare->size > 0) {
> +		swap(pcs->main, pcs->spare);
> +		return pcs;
> +	}
> +
> +	barn = get_barn(s);
> +
> +	full = barn_replace_empty_sheaf(barn, pcs->main);
> +
> +	if (full) {
> +		stat(s, BARN_GET);
> +		pcs->main = full;
> +		return pcs;
> +	}
> +
> +	stat(s, BARN_GET_FAIL);
> +
> +	can_alloc = gfpflags_allow_blocking(gfp);
> +
> +	if (can_alloc) {
> +		if (pcs->spare) {
> +			empty = pcs->spare;
> +			pcs->spare = NULL;
> +		} else {
> +			empty = barn_get_empty_sheaf(barn);
> +		}
> +	}
> +
> +	local_unlock(&s->cpu_sheaves->lock);
> +
> +	if (!can_alloc)
> +		return NULL;
> +
> +	if (empty) {
> +		if (!refill_sheaf(s, empty, gfp)) {
> +			full = empty;
> +		} else {
> +			/*
> +			 * we must be very low on memory so don't bother
> +			 * with the barn
> +			 */
> +			free_empty_sheaf(s, empty);
> +		}
> +	} else {
> +		full = alloc_full_sheaf(s, gfp);
> +	}
> +
> +	if (!full)
> +		return NULL;
> +
> +	/*
> +	 * we can reach here only when gfpflags_allow_blocking
> +	 * so this must not be an irq
> +	 */
> +	local_lock(&s->cpu_sheaves->lock);
> +	pcs = this_cpu_ptr(s->cpu_sheaves);
> +
> +	/*
> +	 * If we are returning empty sheaf, we either got it from the
> +	 * barn or had to allocate one. If we are returning a full
> +	 * sheaf, it's due to racing or being migrated to a different
> +	 * cpu. Breaching the barn's sheaf limits should be thus rare
> +	 * enough so just ignore them to simplify the recovery.
> +	 */
> +
> +	if (pcs->main->size == 0) {
> +		barn_put_empty_sheaf(barn, pcs->main);

It should be very rare but it should do
barn = get_barn(s); again after taking s->cpu_sheaves->lock?

> +		pcs->main = full;
> +		return pcs;
> +	}
> +
> +	if (!pcs->spare) {
> +		pcs->spare = full;
> +		return pcs;
> +	}
> +
> +	if (pcs->spare->size == 0) {
> +		barn_put_empty_sheaf(barn, pcs->spare);
> +		pcs->spare = full;
> +		return pcs;
> +	}
> +
> +	barn_put_full_sheaf(barn, full);
> +	stat(s, BARN_PUT);
> +
> +	return pcs;
> +}
> @@ -4591,6 +5220,295 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
>  	discard_slab(s, slab);
>  }
>  
> +/*
> + * Replace the full main sheaf with a (at least partially) empty sheaf.
> + *
> + * Must be called with the cpu_sheaves local lock locked. If successful, returns
> + * the pcs pointer and the local lock locked (possibly on a different cpu than
> + * initially called). If not successful, returns NULL and the local lock
> + * unlocked.
> + */
> +static struct slub_percpu_sheaves *
> +__pcs_replace_full_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs)
> +{
> +	struct slab_sheaf *empty;
> +	struct node_barn *barn;
> +	bool put_fail;
> +
> +restart:
> +	lockdep_assert_held(this_cpu_ptr(&s->cpu_sheaves->lock));
> +
> +	barn = get_barn(s);
> +	put_fail = false;
> +
> +	if (!pcs->spare) {
> +		empty = barn_get_empty_sheaf(barn);
> +		if (empty) {
> +			pcs->spare = pcs->main;
> +			pcs->main = empty;
> +			return pcs;
> +		}
> +		goto alloc_empty;
> +	}
> +
> +	if (pcs->spare->size < s->sheaf_capacity) {
> +		swap(pcs->main, pcs->spare);
> +		return pcs;
> +	}
> +
> +	empty = barn_replace_full_sheaf(barn, pcs->main);
> +
> +	if (!IS_ERR(empty)) {
> +		stat(s, BARN_PUT);
> +		pcs->main = empty;
> +		return pcs;
> +	}
> +
> +	if (PTR_ERR(empty) == -E2BIG) {
> +		/* Since we got here, spare exists and is full */
> +		struct slab_sheaf *to_flush = pcs->spare;
> +
> +		stat(s, BARN_PUT_FAIL);
> +
> +		pcs->spare = NULL;
> +		local_unlock(&s->cpu_sheaves->lock);
> +
> +		sheaf_flush_unused(s, to_flush);
> +		empty = to_flush;
> +		goto got_empty;
> +	}
> +
> +	/*
> +	 * We could not replace full sheaf because barn had no empty
> +	 * sheaves. We can still allocate it and put the full sheaf in
> +	 * __pcs_install_empty_sheaf(), but if we fail to allocate it,
> +	 * make sure to count the fail.
> +	 */
> +	put_fail = true;
> +
> +alloc_empty:
> +	local_unlock(&s->cpu_sheaves->lock);
> +
> +	empty = alloc_empty_sheaf(s, GFP_NOWAIT);
> +	if (empty)
> +		goto got_empty;
> +
> +	if (put_fail)
> +		 stat(s, BARN_PUT_FAIL);
> +
> +	if (!sheaf_flush_main(s))
> +		return NULL;
> +
> +	if (!local_trylock(&s->cpu_sheaves->lock))
> +		return NULL;
> +
> +	pcs = this_cpu_ptr(s->cpu_sheaves);
> +
> +	/*
> +	 * we flushed the main sheaf so it should be empty now,
> +	 * but in case we got preempted or migrated, we need to
> +	 * check again
> +	 */
> +	if (pcs->main->size == s->sheaf_capacity)
> +		goto restart;
> +
> +	return pcs;
> +
> +got_empty:
> +	if (!local_trylock(&s->cpu_sheaves->lock)) {
> +		barn_put_empty_sheaf(barn, empty);

Same here, we might have gotten migrated to a different node.

> +		return NULL;
> +	}
> +
> +	pcs = this_cpu_ptr(s->cpu_sheaves);
> +	__pcs_install_empty_sheaf(s, pcs, empty);
> +
> +	return pcs;
> +}

Otherwise looks good to me!

-- 
Cheers,
Harry / Hyeonggon


  reply	other threads:[~2025-09-08 11:20 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-03 12:59 [PATCH v7 00/21] SLUB percpu sheaves Vlastimil Babka
2025-09-03 12:59 ` [PATCH v7 01/21] locking/local_lock: Expose dep_map in local_trylock_t Vlastimil Babka
2025-09-04  1:38   ` Harry Yoo
2025-09-03 12:59 ` [PATCH v7 02/21] slab: simplify init_kmem_cache_nodes() error handling Vlastimil Babka
2025-09-04  1:41   ` Harry Yoo
2025-09-03 12:59 ` [PATCH v7 03/21] slab: add opt-in caching layer of percpu sheaves Vlastimil Babka
2025-09-08 11:19   ` Harry Yoo [this message]
2025-09-08 12:26     ` Vlastimil Babka
2025-09-03 12:59 ` [PATCH v7 04/21] slab: add sheaf support for batching kfree_rcu() operations Vlastimil Babka
2025-09-08 11:59   ` Uladzislau Rezki
2025-09-08 12:45     ` Vlastimil Babka
2025-09-09  9:08       ` Uladzislau Rezki
2025-09-09  9:14         ` Uladzislau Rezki
2025-09-09 10:20         ` Vlastimil Babka
2025-09-09 14:55           ` Vlastimil Babka
2025-09-09 14:35         ` Liam R. Howlett
2025-09-10  7:31           ` Uladzislau Rezki
2025-09-03 12:59 ` [PATCH v7 05/21] slab: sheaf prefilling for guaranteed allocations Vlastimil Babka
2025-09-03 12:59 ` [PATCH v7 06/21] slab: determine barn status racily outside of lock Vlastimil Babka
2025-09-03 12:59 ` [PATCH v7 07/21] slab: skip percpu sheaves for remote object freeing Vlastimil Babka
2025-09-03 12:59 ` [PATCH v7 08/21] slab: allow NUMA restricted allocations to use percpu sheaves Vlastimil Babka
2025-09-03 12:59 ` [PATCH v7 09/21] tools/testing/maple_tree: Fix check_bulk_rebalance() locks Vlastimil Babka
2025-09-03 12:59 ` [PATCH v7 10/21] tools/testing/vma: Implement vm_refcnt reset Vlastimil Babka
2025-09-03 12:59 ` [PATCH v7 11/21] tools/testing: Add support for changes to slab for sheaves Vlastimil Babka
2025-09-03 12:59 ` [PATCH v7 12/21] mm, vma: use percpu sheaves for vm_area_struct cache Vlastimil Babka
2025-09-03 12:59 ` [PATCH v7 13/21] maple_tree: use percpu sheaves for maple_node_cache Vlastimil Babka
2025-09-03 12:59 ` [PATCH v7 14/21] tools/testing: include maple-shim.c in maple.c Vlastimil Babka
2025-09-03 12:59 ` [PATCH v7 15/21] testing/radix-tree/maple: Hack around kfree_rcu not existing Vlastimil Babka
2025-09-03 12:59 ` [PATCH v7 16/21] maple_tree: Use kfree_rcu in ma_free_rcu Vlastimil Babka
2025-09-03 12:59 ` [PATCH v7 17/21] maple_tree: Replace mt_free_one() with kfree() Vlastimil Babka
2025-09-03 13:00 ` [PATCH v7 18/21] tools/testing: Add support for prefilled slab sheafs Vlastimil Babka
2025-09-03 13:00 ` [PATCH v7 19/21] maple_tree: Prefilled sheaf conversion and testing Vlastimil Babka
2025-09-03 13:00 ` [PATCH v7 20/21] maple_tree: Add single node allocation support to maple state Vlastimil Babka
2025-09-03 13:00 ` [PATCH v7 21/21] maple_tree: Convert forking to use the sheaf interface Vlastimil Babka
2025-09-08  7:55 ` [PATCH v7 00/21] SLUB percpu sheaves Vlastimil Babka

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aL672Jeqi99atefN@hyeyoo \
    --to=harry.yoo@oracle.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=cl@gentwo.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=maple-tree@lists.infradead.org \
    --cc=rcu@vger.kernel.org \
    --cc=rientjes@google.com \
    --cc=roman.gushchin@linux.dev \
    --cc=sidhartha.kumar@oracle.com \
    --cc=surenb@google.com \
    --cc=urezki@gmail.com \
    --cc=vbabka@suse.cz \
    --cc=venkat88@linux.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.