linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Harry Yoo <harry.yoo@oracle.com>
To: Vlastimil Babka <vbabka@suse.cz>
Cc: Suren Baghdasaryan <surenb@google.com>,
	"Liam R. Howlett" <Liam.Howlett@oracle.com>,
	Christoph Lameter <cl@linux.com>,
	David Rientjes <rientjes@google.com>,
	Roman Gushchin <roman.gushchin@linux.dev>,
	Uladzislau Rezki <urezki@gmail.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	rcu@vger.kernel.org, maple-tree@lists.infradead.org
Subject: Re: [PATCH v4 1/9] slab: add opt-in caching layer of percpu sheaves
Date: Tue, 29 Apr 2025 10:08:27 +0900	[thread overview]
Message-ID: <aBAmi38oWka6ckjk@harry> (raw)
In-Reply-To: <20250425-slub-percpu-caches-v4-1-8a636982b4a4@suse.cz>

On Fri, Apr 25, 2025 at 10:27:21AM +0200, Vlastimil Babka wrote:
> Specifying a non-zero value for a new struct kmem_cache_args field
> sheaf_capacity will setup a caching layer of percpu arrays called
> sheaves of given capacity for the created cache.
> 
> Allocations from the cache will allocate via the percpu sheaves (main or
> spare) as long as they have no NUMA node preference. Frees will also
> put the object back into one of the sheaves.
> 
> When both percpu sheaves are found empty during an allocation, an empty
> sheaf may be replaced with a full one from the per-node barn. If none
> are available and the allocation is allowed to block, an empty sheaf is
> refilled from slab(s) by an internal bulk alloc operation. When both
> percpu sheaves are full during freeing, the barn can replace a full one
> with an empty one, unless over a full sheaves limit. In that case a
> sheaf is flushed to slab(s) by an internal bulk free operation. Flushing
> sheaves and barns is also wired to the existing cpu flushing and cache
> shrinking operations.
> 
> The sheaves do not distinguish NUMA locality of the cached objects. If
> an allocation is requested with kmem_cache_alloc_node() (or a mempolicy
> with strict_numa mode enabled) with a specific node (not NUMA_NO_NODE),
> the sheaves are bypassed.
> 
> The bulk operations exposed to slab users also try to utilize the
> sheaves as long as the necessary (full or empty) sheaves are available
> on the cpu or in the barn. Once depleted, they will fallback to bulk
> alloc/free to slabs directly to avoid double copying.
> 
> The sheaf_capacity value is exported in sysfs for observability.
> 
> Sysfs CONFIG_SLUB_STATS counters alloc_cpu_sheaf and free_cpu_sheaf
> count objects allocated or freed using the sheaves (and thus not
> counting towards the other alloc/free path counters). Counters
> sheaf_refill and sheaf_flush count objects filled or flushed from or to
> slab pages, and can be used to assess how effective the caching is. The
> refill and flush operations will also count towards the usual
> alloc_fastpath/slowpath, free_fastpath/slowpath and other counters for
> the backing slabs.  For barn operations, barn_get and barn_put count how
> many full sheaves were get from or put to the barn, the _fail variants
> count how many such requests could not be satisfied mainly  because the
> barn was either empty or full.

> While the barn also holds empty sheaves
> to make some operations easier, these are not as critical to mandate own
> counters.  Finally, there are sheaf_alloc/sheaf_free counters.

I initially thought we need counters for empty sheaves to see how many times
it grabs empty sheaves from the barn, but looks like barn_put
("put full sheaves to the barn") is effectively a proxy for that, right?

> Access to the percpu sheaves is protected by local_trylock() when
> potential callers include irq context, and local_lock() otherwise (such
> as when we already know the gfp flags allow blocking). The trylock
> failures should be rare and we can easily fallback. Each per-NUMA-node
> barn has a spin_lock.
> 
> When slub_debug is enabled for a cache with sheaf_capacity also
> specified, the latter is ignored so that allocations and frees reach the
> slow path where debugging hooks are processed.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---

Reviewed-by: Harry Yoo <harry.yoo@oracle.com>

LGTM, with a few nits:

>  include/linux/slab.h |   31 ++
>  mm/slab.h            |    2 +
>  mm/slab_common.c     |    5 +-
>  mm/slub.c            | 1053 +++++++++++++++++++++++++++++++++++++++++++++++---
>  4 files changed, 1044 insertions(+), 47 deletions(-)
> 
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index d5a8ab98035cf3e3d9043e3b038e1bebeff05b52..4cb495d55fc58c70a992ee4782d7990ce1c55dc6 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -335,6 +335,37 @@ struct kmem_cache_args {
> 	 * %NULL means no constructor.
> 	 */
> 	void (*ctor)(void *);
>	/**
>	 * @sheaf_capacity: Enable sheaves of given capacity for the cache.
>	 *
>	 * With a non-zero value, allocations from the cache go through caching
>	 * arrays called sheaves. Each cpu has a main sheaf that's always
>	 * present, and a spare sheaf thay may be not present. When both become
>	 * empty, there's an attempt to replace an empty sheaf with a full sheaf
>	 * from the per-node barn.
>	 *
>	 * When no full sheaf is available, and gfp flags allow blocking, a
>	 * sheaf is allocated and filled from slab(s) using bulk allocation.
>	 * Otherwise the allocation falls back to the normal operation
>	 * allocating a single object from a slab.
>	 *
>	 * Analogically when freeing and both percpu sheaves are full, the barn
>	 * may replace it with an empty sheaf, unless it's over capacity. In
>	 * that case a sheaf is bulk freed to slab pages.
>	 *
>	 * The sheaves do not enforce NUMA placement of objects, so allocations
>	 * via kmem_cache_alloc_node() with a node specified other than
>	 * NUMA_NO_NODE will bypass them.
>	 *
>	 * Bulk allocation and free operations also try to use the cpu sheaves
>	 * and barn, but fallback to using slab pages directly.
>	 *
>	 * When slub_debug is enabled for the cache, the sheaf_capacity argument
>	 * is ignored.
>	 *
>	 * %0 means no sheaves will be created

nit: created -> created. (with a full stop)

>	 */
>	unsigned int sheaf_capacity;

> diff --git a/mm/slub.c b/mm/slub.c
> index dc9e729e1d269b5d362cb5bc44f824640ffd00f3..ae3e80ad9926ca15601eef2f2aa016ca059498f8 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> +static void pcs_destroy(struct kmem_cache *s)
> +{
> +	int cpu;
> +
> +	for_each_possible_cpu(cpu) {
> +		struct slub_percpu_sheaves *pcs;
> +
> +		pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
> +
> +		/* can happen when unwinding failed create */
> +		if (!pcs->main)
> +			continue;
> +
> +		/*
> +		 * We have already passed __kmem_cache_shutdown() so everything
> +		 * was flushed and there should be no objects allocated from
> +		 * slabs, otherwise kmem_cache_destroy() would have aborted.
> +		 * Therefore something would have to be really wrong if the
> +		 * warnings here trigger, and we should rather leave bojects and

nit: bojects -> objects

> +		 * sheaves to leak in that case.
> +		 */
> +
> +		WARN_ON(pcs->spare);
> +
> +		if (!WARN_ON(pcs->main->size)) {
> +			free_empty_sheaf(s, pcs->main);
> +			pcs->main = NULL;
> +		}
> +	}
> +
> +	free_percpu(s->cpu_sheaves);
> +	s->cpu_sheaves = NULL;
> +}
> +
> +/*
> + * If a empty sheaf is available, return it and put the supplied full one to

nit: a empty sheaf -> an empty sheaf

> + * barn. But if there are too many full sheaves, reject this with -E2BIG.
> + */
>
> +static struct slab_sheaf *
> +barn_replace_full_sheaf(struct node_barn *barn, struct slab_sheaf *full)
> @@ -4567,6 +5169,234 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
>  	discard_slab(s, slab);
>  }
>  
> +/*
> + * Free an object to the percpu sheaves.
> + * The object is expected to have passed slab_free_hook() already.
> + */
> +static __fastpath_inline
> +bool free_to_pcs(struct kmem_cache *s, void *object)
> +{
> +	struct slub_percpu_sheaves *pcs;
> +
> +restart:
> +	if (!local_trylock(&s->cpu_sheaves->lock))
> +		return false;
> +
> +	pcs = this_cpu_ptr(s->cpu_sheaves);
> +
> +	if (unlikely(pcs->main->size == s->sheaf_capacity)) {
> +
> +		struct slab_sheaf *empty;
> +
> +		if (!pcs->spare) {
> +			empty = barn_get_empty_sheaf(pcs->barn);
> +			if (empty) {
> +				pcs->spare = pcs->main;
> +				pcs->main = empty;
> +				goto do_free;
> +			}
> +			goto alloc_empty;
> +		}
> +
> +		if (pcs->spare->size < s->sheaf_capacity) {
> +			swap(pcs->main, pcs->spare);
> +			goto do_free;
> +		}
> +
> +		empty = barn_replace_full_sheaf(pcs->barn, pcs->main);
> +
> +		if (!IS_ERR(empty)) {
> +			stat(s, BARN_PUT);
> +			pcs->main = empty;
> +			goto do_free;
> +		}

nit: stat(s, BARN_PUT_FAIL); should probably be here instead?

> +
> +		if (PTR_ERR(empty) == -E2BIG) {
> +			/* Since we got here, spare exists and is full */
> +			struct slab_sheaf *to_flush = pcs->spare;
> +
> +			stat(s, BARN_PUT_FAIL);
> +
> +			pcs->spare = NULL;
> +			local_unlock(&s->cpu_sheaves->lock);
> +
> +			sheaf_flush_unused(s, to_flush);
> +			empty = to_flush;
> +			goto got_empty;
> +		}

> @@ -6455,6 +7374,16 @@ int do_kmem_cache_create(struct kmem_cache *s, const char *name,
>  
>  	set_cpu_partial(s);
>  
> +	if (args->sheaf_capacity && !(s->flags & SLAB_DEBUG_FLAGS)) {
> +		s->cpu_sheaves = alloc_percpu(struct slub_percpu_sheaves);

nit: Probably you want to disable sheaves on CONFIG_SLUB_TINY=y too?

> +		if (!s->cpu_sheaves) {
> +			err = -ENOMEM;
> +			goto out;
> +		}
> +		// TODO: increase capacity to grow slab_sheaf up to next kmalloc size?
> +		s->sheaf_capacity = args->sheaf_capacity;
> +	}
> +

-- 
Cheers,
Harry / Hyeonggon


  parent reply	other threads:[~2025-04-29  1:09 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-25  8:27 [PATCH v4 0/9] SLUB percpu sheaves Vlastimil Babka
2025-04-25  8:27 ` [PATCH v4 1/9] slab: add opt-in caching layer of " Vlastimil Babka
2025-04-25 17:31   ` Christoph Lameter (Ampere)
2025-04-28  7:01     ` Vlastimil Babka
2025-05-06 17:32       ` Suren Baghdasaryan
2025-05-06 23:11         ` Suren Baghdasaryan
2025-04-29  1:08   ` Harry Yoo [this message]
2025-05-13 16:08     ` Vlastimil Babka
2025-05-06 23:14   ` Suren Baghdasaryan
2025-05-14 13:06     ` Vlastimil Babka
2025-04-25  8:27 ` [PATCH v4 2/9] slab: add sheaf support for batching kfree_rcu() operations Vlastimil Babka
2025-04-29  7:36   ` Harry Yoo
2025-05-14 13:07     ` Vlastimil Babka
2025-05-06 21:34   ` Suren Baghdasaryan
2025-05-14 14:01     ` Vlastimil Babka
2025-05-15  8:45       ` Vlastimil Babka
2025-05-15 15:03         ` Suren Baghdasaryan
2025-04-25  8:27 ` [PATCH v4 3/9] slab: sheaf prefilling for guaranteed allocations Vlastimil Babka
2025-05-06 22:54   ` Suren Baghdasaryan
2025-05-07  9:15   ` Harry Yoo
2025-05-07  9:20     ` Harry Yoo
2025-05-15  8:41     ` Vlastimil Babka
2025-04-25  8:27 ` [PATCH v4 4/9] slab: determine barn status racily outside of lock Vlastimil Babka
2025-04-25  8:27 ` [PATCH v4 5/9] tools: Add testing support for changes to rcu and slab for sheaves Vlastimil Babka
2025-04-25  8:27 ` [PATCH v4 6/9] tools: Add sheaves support to testing infrastructure Vlastimil Babka
2025-04-25  8:27 ` [PATCH v4 7/9] maple_tree: use percpu sheaves for maple_node_cache Vlastimil Babka
2025-04-25  8:27 ` [PATCH v4 8/9] mm, vma: use percpu sheaves for vm_area_struct cache Vlastimil Babka
2025-05-06 23:08   ` Suren Baghdasaryan
2025-04-25  8:27 ` [PATCH v4 9/9] mm, slub: skip percpu sheaves for remote object freeing Vlastimil Babka
2025-04-25 17:35   ` Christoph Lameter (Ampere)
2025-04-28  7:08     ` Vlastimil Babka
2025-05-07 10:39   ` Harry Yoo
2025-05-15  8:59     ` Vlastimil Babka
2025-05-15 12:46 ` [PATCH v4 0/9] SLUB percpu sheaves Vlastimil Babka
2025-05-15 15:01   ` Suren Baghdasaryan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aBAmi38oWka6ckjk@harry \
    --to=harry.yoo@oracle.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=cl@linux.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=maple-tree@lists.infradead.org \
    --cc=rcu@vger.kernel.org \
    --cc=rientjes@google.com \
    --cc=roman.gushchin@linux.dev \
    --cc=surenb@google.com \
    --cc=urezki@gmail.com \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).