From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1C55E3947AE for ; Thu, 16 Apr 2026 10:05:31 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776333931; cv=none; b=muvjVeHhk2GCPS4UCRvPdy65KqhYW4sJ6mngq4htPOiyg5ecPRgfEBPTE/HsDkod7jo9L89wHVy08juf0BQO+TBVmtvKq+TV1U/0jtwRTNHWjgCJFwGabyf6pFpqF+6bK322aT/UhvcgPTncY1sa/GQgZmkESksXd6Civse5YEE= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776333931; c=relaxed/simple; bh=8gyZUaLwbT9XK/gBK6i0TFCYB/r+2lYDLzDL4l1s5EY=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=JOpV6+k80QgdpVbiMrcCKt7hjAdgvvJo+bXfdMk/ISbwPjzwvMnWpUSJzV9IE3R/7aAIU6wJs9B1vl+8aPGY8m+RR/pktLzTG1nROs6XU1nOrb3PFEC9wlzEaFQ3v3S5zZLRWoHpJBf6abpUMeoRdSoXY910i8NFTRDl8r4Vako= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=TtUb6cwE; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="TtUb6cwE" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 78E33C2BCAF; Thu, 16 Apr 2026 10:05:26 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1776333930; bh=8gyZUaLwbT9XK/gBK6i0TFCYB/r+2lYDLzDL4l1s5EY=; h=Date:Subject:To:Cc:References:From:In-Reply-To:From; b=TtUb6cwEGYRjaC6Xkc6sM5pQaMqh4YiEV4RrNWkD4lGUhKygpzCg9HdZ4FZU8OALs mwEuc2zvhg2Ylp0ZfztSWpQF++fuvLyl1NTwMsOMguQ/U0jPcvsHkfZaat1D1LNXNH SpAq1RTal4xZ39M0h19M5DjXUfF9b3JEAwX6FLPDNU5PvrwDwlPh1BN5QCbNNvZmSF VKyLDMlP1gzdnKTwJfdtjiccm8JU72z7IUq5Do9ZHaPKYimEnBNepUMvRKE0Wnt2tl c6Y1PyOjuM7FFFa+1eluY9U4tiIs78crX2yaRxkaIBfGdAVxHj2xmiwrMzmjnU6YgD zGEECpiJP8Skg== Message-ID: Date: Thu, 16 Apr 2026 12:05:24 +0200 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC] making nested spin_trylock() work on UP? Content-Language: en-US To: "Harry Yoo (Oracle)" , Matthew Wilcox Cc: Vlastimil Babka , Peter Zijlstra , Ingo Molnar , Will Deacon , Sebastian Andrzej Siewior , LKML , "linux-mm@kvack.org" , Linus Torvalds , Waiman Long , Mel Gorman , Steven Rostedt , Alexei Starovoitov , Hao Li , Andrew Morton , Suren Baghdasaryan , Michal Hocko , Brendan Jackman , Johannes Weiner , Zi Yan , Christoph Lameter , David Rientjes , Roman Gushchin References: From: "Vlastimil Babka (SUSE)" In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit On 4/15/26 20:44, Harry Yoo (Oracle) wrote: > [+Cc Alexei for _nolock() APIs] > [+Cc SLAB ALLOCATOR and PAGE ALLOCATOR folks] > > I was testing kmalloc_nolock() on UP and I think > I'm dealt with a similar issue... > > On Sat, Feb 14, 2026 at 06:28:43AM +0000, Matthew Wilcox wrote: >> On Fri, Feb 13, 2026 at 12:57:43PM +0100, Vlastimil Babka wrote: >> > The page allocator has been using a locking scheme for its percpu page >> > caches (pcp) for years now, based on spin_trylock() with no _irqsave() part. >> > The point is that if we interrupt the locked section, we fail the trylock >> > and just fallback to something that's more expensive, but it's rare so we >> > don't need to pay the irqsave cost all the time in the fastpaths. >> > >> > It's similar to but not exactly local_trylock_t (which is also newer anyway) >> > because in some cases we do lock the pcp of a non-local cpu to flush it, in >> > a way that's cheaper than IPI or queue_work_on(). >> > >> > The complication of this scheme has been UP non-debug spinlock >> > implementation which assumes spin_trylock() can't fail on UP and has no >> > state to track it. It just doesn't anticipate this usage scenario. > > This is not the only scenario that doesn't work. > > I was testing "calling {kmalloc,kfree}_nolock() in an NMI handler > when the CPU is calling kmalloc() & kfree()" [1] scenario. > > Weirdly it's broken (dmesg at the end of the email) on UP since v6.18, > where {kmalloc,kfree}_nolock() APIs were introduced. > > [1] https://lore.kernel.org/linux-mm/20260406090907.11710-3-harry@kernel.org > >> > So to >> > work around that we disable IRQs on UP, complicating the implementation. >> > Also recently we found years old bug in the implementation - see >> > 038a102535eb ("mm/page_alloc: prevent pcp corruption with SMP=n"). > > In the case mentioned above, disabling IRQs doesn't work as the handler > can be called in an NMI context. IIRC for the BPF usecases of kmalloc_nolock() think there could be also some kprobe context somewhere in the locked section. > {kmalloc,kfree}_nolock()->spin_trylock_irqsave() can succeed on UP > when the CPU already acquired the spinlock w/ IRQs disabled. > >> > So my question is if we could have spinlock implementation supporting this >> > nested spin_trylock() usage, or if the UP optimization is still considered >> > too important to lose it. I was thinking: >> > >> > - remove the UP implementation completely - would it increase the overhead >> > on SMP=n systems too much and do we still care? >> > >> > - make the non-debug implementation a bit like the debug one so we do have >> > the 'locked' state (see include/linux/spinlock_up.h and lock->slock). This >> > also adds some overhead but not as much as the full SMP implementation? >> >> What if we use an atomic_t on UP to simulate there being a spinlock, >> but only for pcp? Your demo shows pcp_spin_trylock() continuing to >> exist, so how about doing something like: >> >> #ifdef CONFIG_SMP >> #define pcp_spin_trylock(ptr) \ >> ({ \ >> struct per_cpu_pages *__ret; \ >> __ret = pcpu_spin_trylock(struct per_cpu_pages, lock, ptr); \ >> __ret; \ >> }) >> #else >> static atomic_t pcp_UP_lock = ATOMIC_INIT(0); >> #define pcp_spin_trylock(ptr) \ >> ({ \ >> struct per_cpu_pages *__ret = NULL; \ >> if (atomic_try_cmpxchg(&pcp_UP_lock, 0, 1)) \ >> __ret = (void *)&pcp_UP_lock; \ >> __ret; \ >> }); >> #endif >> >> (obviously you need pcp_spin_lock/pcp_spin_unlock also defined) >> >> That only costs us 4 extra bytes on UP, rather than 4 bytes per spinlock. >> And some people still use routers with tiny amounts of memory and a >> single CPU, or retrocomputers with single CPUs. > > I think we need a special spinlock type that wraps something like this > and use them when spinlocks can be trylock'd in an unknown context: > pcp lock, zone lock, per-node partial slab list lock, > per-node barn lock, etc. Soudns like a lot of hassle for a niche config (SMP=n) where nobody would use e.g. bpf tracing anyway. We already have this in kmalloc_nolock(): /* * See the comment for the same check in * alloc_frozen_pages_nolock_noprof() */ if (IS_ENABLED(CONFIG_PREEMPT_RT) && (in_nmi() || in_hardirq())) return NULL; It would be trivial to extend this to !SMP. However it wouldn't cover the kprobe context. Any idea Alexei? > dmesg here, HEAD is a commit that adds the test case, on top of > commit af92793e52c3a ("slab: Introduce kmalloc_nolock() and > kfree_nolock()."): >> >> [ 3.658916] ------------[ cut here ]------------ >> [ 3.659492] perf: interrupt took too long (5015 > 5005), lowering kernel.perf_event_max_sample_rate to 39000 >> [ 3.660800] kernel BUG at mm/slub.c:4382! > > This is BUG_ON(new.frozen) in freeze_slab(), which implies that > somebody else has taken it off list and froze it already (which should > have been prevented by the spinlock) > >> [ 3.661674] Oops: invalid opcode: 0000 [#1] NOPTI >> [ 3.662427] CPU: 0 UID: 0 PID: 256 Comm: kunit_try_catch Tainted: G E N 6.17.0-rc3+ #24 PREEMPTLAZY >> [ 3.663270] Tainted: [E]=UNSIGNED_MODULE, [N]=TEST >> [ 3.663658] Hardware name: QEMU Ubuntu 24.04 PC v2 (i440FX + PIIX, arch_caps fix, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 >> [ 3.664571] RIP: 0010:___slab_alloc (mm/slub.c:4382 (discriminator 1) mm/slub.c:4599 (discriminator 1)) >> [ 3.664949] Code: 4c 24 78 e8 32 cc ff ff 84 c0 0f 85 09 fa ff ff 49 8b 4c 24 28 4d 8b 6c 24 20 48 89 c8 48 89 4c 24 78 48 c1 e8 18 84 c0 79 b3 <0f> 0b 41 8b 46 10 a9 87 04 00 00 74 a1 a8 80 75 24 49 89 dd e9 09 >