From: Harry Yoo <harry.yoo@oracle.com>
To: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: bpf@vger.kernel.org, linux-mm@kvack.org, vbabka@suse.cz,
shakeel.butt@linux.dev, mhocko@suse.com, bigeasy@linutronix.de,
andrii@kernel.org, memxor@gmail.com, akpm@linux-foundation.org,
peterz@infradead.org, rostedt@goodmis.org, hannes@cmpxchg.org
Subject: Re: [PATCH v2 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock().
Date: Fri, 11 Jul 2025 16:36:30 +0900 [thread overview]
Message-ID: <aHC-_upDSW_Twplc@hyeyoo> (raw)
In-Reply-To: <20250709015303.8107-7-alexei.starovoitov@gmail.com>
On Tue, Jul 08, 2025 at 06:53:03PM -0700, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
>
> kmalloc_nolock() relies on ability of local_lock to detect the situation
> when it's locked.
> In !PREEMPT_RT local_lock_is_locked() is true only when NMI happened in
> irq saved region that protects _that specific_ per-cpu kmem_cache_cpu.
> In that case retry the operation in a different kmalloc bucket.
> The second attempt will likely succeed, since this cpu locked
> different kmem_cache_cpu.
>
> Similarly, in PREEMPT_RT local_lock_is_locked() returns true when
> per-cpu rt_spin_lock is locked by current task. In this case re-entrance
> into the same kmalloc bucket is unsafe, and kmalloc_nolock() tries
> a different bucket that is most likely is not locked by the current
> task. Though it may be locked by a different task it's safe to
> rt_spin_lock() on it.
>
> Similar to alloc_pages_nolock() the kmalloc_nolock() returns NULL
> immediately if called from hard irq or NMI in PREEMPT_RT.
>
> kfree_nolock() defers freeing to irq_work when local_lock_is_locked()
> and in_nmi() or in PREEMPT_RT.
Does that mean in mm/Kconfig SLUB now needs to select IRQ_WORK?
> SLUB_TINY config doesn't use local_lock_is_locked() and relies on
> spin_trylock_irqsave(&n->list_lock) to allocate while kfree_nolock()
> always defers to irq_work.
>
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---
> include/linux/kasan.h | 13 +-
> include/linux/slab.h | 4 +
> mm/kasan/common.c | 5 +-
> mm/slub.c | 330 ++++++++++++++++++++++++++++++++++++++----
> 4 files changed, 319 insertions(+), 33 deletions(-)
>
> diff --git a/include/linux/kasan.h b/include/linux/kasan.h
> index 890011071f2b..acdc8cb0152e 100644
> --- a/include/linux/kasan.h
> +++ b/include/linux/kasan.h
> @@ -200,7 +200,7 @@ static __always_inline bool kasan_slab_pre_free(struct kmem_cache *s,
> }
>
> bool __kasan_slab_free(struct kmem_cache *s, void *object, bool init,
> - bool still_accessible);
> + bool still_accessible, bool no_quarantine);
> /**
> * kasan_slab_free - Poison, initialize, and quarantine a slab object.
> * @object: Object to be freed.
> @@ -1982,6 +1983,7 @@ static inline void init_slab_obj_exts(struct slab *slab)
> int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
> gfp_t gfp, bool new_slab)
> {
> + bool allow_spin = gfpflags_allow_spinning(gfp);
> unsigned int objects = objs_per_slab(s, slab);
> unsigned long new_exts;
> unsigned long old_exts;
> @@ -1990,8 +1992,14 @@ int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
> gfp &= ~OBJCGS_CLEAR_MASK;
> /* Prevent recursive extension vector allocation */
> gfp |= __GFP_NO_OBJ_EXT;
> - vec = kcalloc_node(objects, sizeof(struct slabobj_ext), gfp,
> - slab_nid(slab));
> + if (unlikely(!allow_spin)) {
> + size_t sz = objects * sizeof(struct slabobj_ext);
> +
> + vec = kmalloc_nolock(sz, __GFP_ZERO, slab_nid(slab));
Missing memset()?
as there is no kcalloc_nolock()...
> + } else {
> + vec = kcalloc_node(objects, sizeof(struct slabobj_ext), gfp,
> + slab_nid(slab));
> + }
> if (!vec) {
> /* Mark vectors which failed to allocate */
> if (new_slab)
> @@ -3911,6 +3953,12 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> void *flush_freelist = c->freelist;
> struct slab *flush_slab = c->slab;
>
> + if (unlikely(!allow_spin))
> + /*
> + * Reentrant slub cannot take locks
> + * necessary for deactivate_slab()
> + */
> + return NULL;
> c->slab = NULL;
> c->freelist = NULL;
> c->tid = next_tid(c->tid);
> @@ -4555,6 +4707,53 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
> discard_slab(s, slab);
> }
>
> +static DEFINE_PER_CPU(struct llist_head, defer_free_objects);
> +static DEFINE_PER_CPU(struct irq_work, defer_free_work);
> +
> +static void free_deferred_objects(struct irq_work *work)
> +{
> + struct llist_head *llhead = this_cpu_ptr(&defer_free_objects);
> + struct llist_node *llnode, *pos, *t;
> +
> + if (llist_empty(llhead))
> + return;
> +
> + llnode = llist_del_all(llhead);
> + llist_for_each_safe(pos, t, llnode) {
> + struct kmem_cache *s;
> + struct slab *slab;
> + void *x = pos;
> +
> + slab = virt_to_slab(x);
> + s = slab->slab_cache;
> +
> + /*
> + * memcg, kasan_slab_pre are already done for 'x'.
> + * The only thing left is kasan_poison.
> + */
> + kasan_slab_free(s, x, false, false, true);
> + __slab_free(s, slab, x, x, 1, _THIS_IP_);
> + }
> +}
> +
> +static void defer_free(void *head)
> +{
> + if (llist_add(head, this_cpu_ptr(&defer_free_objects)))
> + irq_work_queue(this_cpu_ptr(&defer_free_work));
By adding it to the lockless list, it's overwriting freed objects,
and it's not always safe.
Looking at calculate_sizes():
if (((flags & SLAB_TYPESAFE_BY_RCU) && !args->use_freeptr_offset) ||
(flags & SLAB_POISON) || s->ctor ||
((flags & SLAB_RED_ZONE) &&
(s->object_size < sizeof(void *) || slub_debug_orig_size(s)))) {
/*
* Relocate free pointer after the object if it is not
* permitted to overwrite the first word of the object on
* kmem_cache_free.
*
* This is the case if we do RCU, have a constructor or
* destructor, are poisoning the objects, or are
* redzoning an object smaller than sizeof(void *) or are
* redzoning an object with slub_debug_orig_size() enabled,
* in which case the right redzone may be extended.
*
* The assumption that s->offset >= s->inuse means free
* pointer is outside of the object is used in the
* freeptr_outside_object() function. If that is no
* longer true, the function needs to be modified.
*/
s->offset = size;
size += sizeof(void *);
Only sizeof(void *) bytes from object + s->offset is always safe to overwrite.
So either 1) teach defer_free() that it needs to use s->offset for each
object, instead of zero (and that the list can have objects from
different caches), or 2) introduce per-cache per-CPU lockless lists?
--
Cheers,
Harry / Hyeonggon
next prev parent reply other threads:[~2025-07-11 7:37 UTC|newest]
Thread overview: 38+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-07-09 1:52 [PATCH v2 0/6] slab: Re-entrant kmalloc_nolock() Alexei Starovoitov
2025-07-09 1:52 ` [PATCH v2 1/6] locking/local_lock: Expose dep_map in local_trylock_t Alexei Starovoitov
2025-07-11 8:02 ` Sebastian Andrzej Siewior
2025-07-09 1:52 ` [PATCH v2 2/6] locking/local_lock: Introduce local_lock_is_locked() Alexei Starovoitov
2025-07-11 7:52 ` Sebastian Andrzej Siewior
2025-07-09 1:53 ` [PATCH v2 3/6] locking/local_lock: Introduce local_lock_lockdep_start/end() Alexei Starovoitov
2025-07-11 7:50 ` Sebastian Andrzej Siewior
2025-07-11 9:55 ` Vlastimil Babka
2025-07-11 15:17 ` Sebastian Andrzej Siewior
2025-07-11 15:23 ` Vlastimil Babka
2025-07-12 2:19 ` Alexei Starovoitov
2025-07-14 11:06 ` Sebastian Andrzej Siewior
2025-07-14 15:35 ` Vlastimil Babka
2025-07-14 15:54 ` Sebastian Andrzej Siewior
2025-07-14 17:52 ` Alexei Starovoitov
2025-07-14 18:33 ` Vlastimil Babka
2025-07-14 18:46 ` Alexei Starovoitov
2025-07-15 6:56 ` Vlastimil Babka
2025-07-15 17:29 ` Alexei Starovoitov
2025-07-15 17:48 ` Vlastimil Babka
2025-07-15 21:00 ` Alexei Starovoitov
2025-07-09 1:53 ` [PATCH v2 4/6] mm: Allow GFP_ACCOUNT to be used in alloc_pages_nolock() Alexei Starovoitov
2025-07-09 14:20 ` Vlastimil Babka
2025-07-09 1:53 ` [PATCH v2 5/6] mm: Introduce alloc_frozen_pages_nolock() Alexei Starovoitov
2025-07-09 14:21 ` Vlastimil Babka
2025-07-09 1:53 ` [PATCH v2 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock() Alexei Starovoitov
2025-07-10 9:36 ` Vlastimil Babka
2025-07-10 10:21 ` Harry Yoo
2025-07-10 15:05 ` Vlastimil Babka
2025-07-10 19:13 ` Alexei Starovoitov
2025-07-11 6:06 ` Harry Yoo
2025-07-11 10:30 ` Vlastimil Babka
2025-07-12 1:55 ` Alexei Starovoitov
2025-07-10 19:21 ` Alexei Starovoitov
2025-07-11 7:26 ` Sebastian Andrzej Siewior
2025-07-11 7:36 ` Harry Yoo [this message]
2025-07-11 7:40 ` Harry Yoo
2025-07-11 10:48 ` Vlastimil Babka
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aHC-_upDSW_Twplc@hyeyoo \
--to=harry.yoo@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=alexei.starovoitov@gmail.com \
--cc=andrii@kernel.org \
--cc=bigeasy@linutronix.de \
--cc=bpf@vger.kernel.org \
--cc=hannes@cmpxchg.org \
--cc=linux-mm@kvack.org \
--cc=memxor@gmail.com \
--cc=mhocko@suse.com \
--cc=peterz@infradead.org \
--cc=rostedt@goodmis.org \
--cc=shakeel.butt@linux.dev \
--cc=vbabka@suse.cz \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.