From: Harry Yoo <harry.yoo@oracle.com>
To: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: bpf@vger.kernel.org, linux-mm@kvack.org, vbabka@suse.cz,
shakeel.butt@linux.dev, mhocko@suse.com, bigeasy@linutronix.de,
andrii@kernel.org, memxor@gmail.com, akpm@linux-foundation.org,
peterz@infradead.org, rostedt@goodmis.org, hannes@cmpxchg.org
Subject: Re: [PATCH v2 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock().
Date: Fri, 11 Jul 2025 16:36:30 +0900 [thread overview]
Message-ID: <aHC-_upDSW_Twplc@hyeyoo> (raw)
In-Reply-To: <20250709015303.8107-7-alexei.starovoitov@gmail.com>
On Tue, Jul 08, 2025 at 06:53:03PM -0700, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
>
> kmalloc_nolock() relies on ability of local_lock to detect the situation
> when it's locked.
> In !PREEMPT_RT local_lock_is_locked() is true only when NMI happened in
> irq saved region that protects _that specific_ per-cpu kmem_cache_cpu.
> In that case retry the operation in a different kmalloc bucket.
> The second attempt will likely succeed, since this cpu locked
> different kmem_cache_cpu.
>
> Similarly, in PREEMPT_RT local_lock_is_locked() returns true when
> per-cpu rt_spin_lock is locked by current task. In this case re-entrance
> into the same kmalloc bucket is unsafe, and kmalloc_nolock() tries
> a different bucket that is most likely is not locked by the current
> task. Though it may be locked by a different task it's safe to
> rt_spin_lock() on it.
>
> Similar to alloc_pages_nolock() the kmalloc_nolock() returns NULL
> immediately if called from hard irq or NMI in PREEMPT_RT.
>
> kfree_nolock() defers freeing to irq_work when local_lock_is_locked()
> and in_nmi() or in PREEMPT_RT.
Does that mean in mm/Kconfig SLUB now needs to select IRQ_WORK?
> SLUB_TINY config doesn't use local_lock_is_locked() and relies on
> spin_trylock_irqsave(&n->list_lock) to allocate while kfree_nolock()
> always defers to irq_work.
>
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---
> include/linux/kasan.h | 13 +-
> include/linux/slab.h | 4 +
> mm/kasan/common.c | 5 +-
> mm/slub.c | 330 ++++++++++++++++++++++++++++++++++++++----
> 4 files changed, 319 insertions(+), 33 deletions(-)
>
> diff --git a/include/linux/kasan.h b/include/linux/kasan.h
> index 890011071f2b..acdc8cb0152e 100644
> --- a/include/linux/kasan.h
> +++ b/include/linux/kasan.h
> @@ -200,7 +200,7 @@ static __always_inline bool kasan_slab_pre_free(struct kmem_cache *s,
> }
>
> bool __kasan_slab_free(struct kmem_cache *s, void *object, bool init,
> - bool still_accessible);
> + bool still_accessible, bool no_quarantine);
> /**
> * kasan_slab_free - Poison, initialize, and quarantine a slab object.
> * @object: Object to be freed.
> @@ -1982,6 +1983,7 @@ static inline void init_slab_obj_exts(struct slab *slab)
> int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
> gfp_t gfp, bool new_slab)
> {
> + bool allow_spin = gfpflags_allow_spinning(gfp);
> unsigned int objects = objs_per_slab(s, slab);
> unsigned long new_exts;
> unsigned long old_exts;
> @@ -1990,8 +1992,14 @@ int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
> gfp &= ~OBJCGS_CLEAR_MASK;
> /* Prevent recursive extension vector allocation */
> gfp |= __GFP_NO_OBJ_EXT;
> - vec = kcalloc_node(objects, sizeof(struct slabobj_ext), gfp,
> - slab_nid(slab));
> + if (unlikely(!allow_spin)) {
> + size_t sz = objects * sizeof(struct slabobj_ext);
> +
> + vec = kmalloc_nolock(sz, __GFP_ZERO, slab_nid(slab));
Missing memset()?
as there is no kcalloc_nolock()...
> + } else {
> + vec = kcalloc_node(objects, sizeof(struct slabobj_ext), gfp,
> + slab_nid(slab));
> + }
> if (!vec) {
> /* Mark vectors which failed to allocate */
> if (new_slab)
> @@ -3911,6 +3953,12 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> void *flush_freelist = c->freelist;
> struct slab *flush_slab = c->slab;
>
> + if (unlikely(!allow_spin))
> + /*
> + * Reentrant slub cannot take locks
> + * necessary for deactivate_slab()
> + */
> + return NULL;
> c->slab = NULL;
> c->freelist = NULL;
> c->tid = next_tid(c->tid);
> @@ -4555,6 +4707,53 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
> discard_slab(s, slab);
> }
>
> +static DEFINE_PER_CPU(struct llist_head, defer_free_objects);
> +static DEFINE_PER_CPU(struct irq_work, defer_free_work);
> +
> +static void free_deferred_objects(struct irq_work *work)
> +{
> + struct llist_head *llhead = this_cpu_ptr(&defer_free_objects);
> + struct llist_node *llnode, *pos, *t;
> +
> + if (llist_empty(llhead))
> + return;
> +
> + llnode = llist_del_all(llhead);
> + llist_for_each_safe(pos, t, llnode) {
> + struct kmem_cache *s;
> + struct slab *slab;
> + void *x = pos;
> +
> + slab = virt_to_slab(x);
> + s = slab->slab_cache;
> +
> + /*
> + * memcg, kasan_slab_pre are already done for 'x'.
> + * The only thing left is kasan_poison.
> + */
> + kasan_slab_free(s, x, false, false, true);
> + __slab_free(s, slab, x, x, 1, _THIS_IP_);
> + }
> +}
> +
> +static void defer_free(void *head)
> +{
> + if (llist_add(head, this_cpu_ptr(&defer_free_objects)))
> + irq_work_queue(this_cpu_ptr(&defer_free_work));
By adding it to the lockless list, it's overwriting freed objects,
and it's not always safe.
Looking at calculate_sizes():
if (((flags & SLAB_TYPESAFE_BY_RCU) && !args->use_freeptr_offset) ||
(flags & SLAB_POISON) || s->ctor ||
((flags & SLAB_RED_ZONE) &&
(s->object_size < sizeof(void *) || slub_debug_orig_size(s)))) {
/*
* Relocate free pointer after the object if it is not
* permitted to overwrite the first word of the object on
* kmem_cache_free.
*
* This is the case if we do RCU, have a constructor or
* destructor, are poisoning the objects, or are
* redzoning an object smaller than sizeof(void *) or are
* redzoning an object with slub_debug_orig_size() enabled,
* in which case the right redzone may be extended.
*
* The assumption that s->offset >= s->inuse means free
* pointer is outside of the object is used in the
* freeptr_outside_object() function. If that is no
* longer true, the function needs to be modified.
*/
s->offset = size;
size += sizeof(void *);
Only sizeof(void *) bytes from object + s->offset is always safe to overwrite.
So either 1) teach defer_free() that it needs to use s->offset for each
object, instead of zero (and that the list can have objects from
different caches), or 2) introduce per-cache per-CPU lockless lists?
--
Cheers,
Harry / Hyeonggon
next prev parent reply other threads:[~2025-07-11 7:36 UTC|newest]
Thread overview: 38+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-07-09 1:52 [PATCH v2 0/6] slab: Re-entrant kmalloc_nolock() Alexei Starovoitov
2025-07-09 1:52 ` [PATCH v2 1/6] locking/local_lock: Expose dep_map in local_trylock_t Alexei Starovoitov
2025-07-11 8:02 ` Sebastian Andrzej Siewior
2025-07-09 1:52 ` [PATCH v2 2/6] locking/local_lock: Introduce local_lock_is_locked() Alexei Starovoitov
2025-07-11 7:52 ` Sebastian Andrzej Siewior
2025-07-09 1:53 ` [PATCH v2 3/6] locking/local_lock: Introduce local_lock_lockdep_start/end() Alexei Starovoitov
2025-07-11 7:50 ` Sebastian Andrzej Siewior
2025-07-11 9:55 ` Vlastimil Babka
2025-07-11 15:17 ` Sebastian Andrzej Siewior
2025-07-11 15:23 ` Vlastimil Babka
2025-07-12 2:19 ` Alexei Starovoitov
2025-07-14 11:06 ` Sebastian Andrzej Siewior
2025-07-14 15:35 ` Vlastimil Babka
2025-07-14 15:54 ` Sebastian Andrzej Siewior
2025-07-14 17:52 ` Alexei Starovoitov
2025-07-14 18:33 ` Vlastimil Babka
2025-07-14 18:46 ` Alexei Starovoitov
2025-07-15 6:56 ` Vlastimil Babka
2025-07-15 17:29 ` Alexei Starovoitov
2025-07-15 17:48 ` Vlastimil Babka
2025-07-15 21:00 ` Alexei Starovoitov
2025-07-09 1:53 ` [PATCH v2 4/6] mm: Allow GFP_ACCOUNT to be used in alloc_pages_nolock() Alexei Starovoitov
2025-07-09 14:20 ` Vlastimil Babka
2025-07-09 1:53 ` [PATCH v2 5/6] mm: Introduce alloc_frozen_pages_nolock() Alexei Starovoitov
2025-07-09 14:21 ` Vlastimil Babka
2025-07-09 1:53 ` [PATCH v2 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock() Alexei Starovoitov
2025-07-10 9:36 ` Vlastimil Babka
2025-07-10 10:21 ` Harry Yoo
2025-07-10 15:05 ` Vlastimil Babka
2025-07-10 19:13 ` Alexei Starovoitov
2025-07-11 6:06 ` Harry Yoo
2025-07-11 10:30 ` Vlastimil Babka
2025-07-12 1:55 ` Alexei Starovoitov
2025-07-10 19:21 ` Alexei Starovoitov
2025-07-11 7:26 ` Sebastian Andrzej Siewior
2025-07-11 7:36 ` Harry Yoo [this message]
2025-07-11 7:40 ` Harry Yoo
2025-07-11 10:48 ` Vlastimil Babka
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aHC-_upDSW_Twplc@hyeyoo \
--to=harry.yoo@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=alexei.starovoitov@gmail.com \
--cc=andrii@kernel.org \
--cc=bigeasy@linutronix.de \
--cc=bpf@vger.kernel.org \
--cc=hannes@cmpxchg.org \
--cc=linux-mm@kvack.org \
--cc=memxor@gmail.com \
--cc=mhocko@suse.com \
--cc=peterz@infradead.org \
--cc=rostedt@goodmis.org \
--cc=shakeel.butt@linux.dev \
--cc=vbabka@suse.cz \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).