* [PATCH v2 1/8] dma-debug: Allow multiple invocations of overlapping entries
From: Leon Romanovsky @ 2026-03-11 19:08 UTC (permalink / raw)
To: Marek Szyprowski, Robin Murphy, Michael S. Tsirkin, Petr Tesarik,
Jonathan Corbet, Shuah Khan, Jason Wang, Xuan Zhuo,
Eugenio Pérez, Jason Gunthorpe, Leon Romanovsky,
Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Joerg Roedel,
Will Deacon, Andrew Morton
Cc: iommu, linux-kernel, linux-doc, virtualization, linux-rdma,
linux-trace-kernel, linux-mm
In-Reply-To: <20260311-dma-debug-overlap-v2-0-e00bc2ca346d@nvidia.com>
From: Leon Romanovsky <leonro@nvidia.com>
Repeated DMA mappings with DMA_ATTR_CPU_CACHE_CLEAN trigger the
following splat. This prevents using the attribute in cases where a DMA
region is shared and reused more than seven times.
------------[ cut here ]------------
DMA-API: exceeded 7 overlapping mappings of cacheline 0x000000000438c440
WARNING: kernel/dma/debug.c:467 at add_dma_entry+0x219/0x280, CPU#4: ibv_rc_pingpong/1644
Modules linked in: xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink iptable_nat nf_nat xt_addrtype br_netfilter rpcsec_gss_krb5 auth_rpcgss oid_registry overlay mlx5_fwctl zram zsmalloc mlx5_ib fuse rpcrdma rdma_ucm ib_uverbs ib_iser libiscsi scsi_transport_iscsi ib_umad rdma_cm ib_ipoib iw_cm ib_cm mlx5_core ib_core
CPU: 4 UID: 2733 PID: 1644 Comm: ibv_rc_pingpong Not tainted 6.19.0+ #129 PREEMPT
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:add_dma_entry+0x221/0x280
Code: c0 0f 84 f2 fe ff ff 83 e8 01 89 05 6d 99 11 01 e9 e4 fe ff ff 0f 8e 1f ff ff ff 48 8d 3d 07 ef 2d 01 be 07 00 00 00 48 89 e2 <67> 48 0f b9 3a e9 06 ff ff ff 48 c7 c7 98 05 2b 82 c6 05 72 92 28
RSP: 0018:ff1100010e657970 EFLAGS: 00010002
RAX: 0000000000000007 RBX: ff1100010234eb00 RCX: 0000000000000000
RDX: ff1100010e657970 RSI: 0000000000000007 RDI: ffffffff82678660
RBP: 000000000438c440 R08: 0000000000000228 R09: 0000000000000000
R10: 00000000000001be R11: 000000000000089d R12: 0000000000000800
R13: 00000000ffffffef R14: 0000000000000202 R15: ff1100010234eb00
FS: 00007fb15f3f6740(0000) GS:ff110008dcc19000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fb15f32d3a0 CR3: 0000000116f59001 CR4: 0000000000373eb0
Call Trace:
<TASK>
debug_dma_map_sg+0x1b4/0x390
__dma_map_sg_attrs+0x6d/0x1a0
dma_map_sgtable+0x19/0x30
ib_umem_get+0x284/0x3b0 [ib_uverbs]
mlx5_ib_reg_user_mr+0x68/0x2a0 [mlx5_ib]
ib_uverbs_reg_mr+0x17f/0x2a0 [ib_uverbs]
ib_uverbs_handler_UVERBS_METHOD_INVOKE_WRITE+0xc2/0x130 [ib_uverbs]
ib_uverbs_cmd_verbs+0xa0b/0xae0 [ib_uverbs]
? ib_uverbs_handler_UVERBS_METHOD_QUERY_PORT_SPEED+0xe0/0xe0 [ib_uverbs]
? mmap_region+0x7a/0xb0
? do_mmap+0x3b8/0x5c0
ib_uverbs_ioctl+0xa7/0x110 [ib_uverbs]
__x64_sys_ioctl+0x14f/0x8b0
? ksys_mmap_pgoff+0xc5/0x190
do_syscall_64+0x8c/0xbf0
entry_SYSCALL_64_after_hwframe+0x4b/0x53
RIP: 0033:0x7fb15f5e4eed
Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45 10 c7 45 b0 10 00 00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1a 48 8b 45 c8 64 48 2b 04 25 28 00 00 00
RSP: 002b:00007ffe09a5c540 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007ffe09a5c5d0 RCX: 00007fb15f5e4eed
RDX: 00007ffe09a5c5f0 RSI: 00000000c0181b01 RDI: 0000000000000003
RBP: 00007ffe09a5c590 R08: 0000000000000028 R09: 00007ffe09a5c794
R10: 0000000000000001 R11: 0000000000000246 R12: 00007ffe09a5c794
R13: 000000000000000c R14: 0000000025a49170 R15: 000000000000000c
</TASK>
---[ end trace 0000000000000000 ]---
Fixes: 61868dc55a11 ("dma-mapping: add DMA_ATTR_CPU_CACHE_CLEAN")
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
kernel/dma/debug.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/kernel/dma/debug.c b/kernel/dma/debug.c
index 86f87e43438c3..be207be749968 100644
--- a/kernel/dma/debug.c
+++ b/kernel/dma/debug.c
@@ -453,7 +453,7 @@ static int active_cacheline_set_overlap(phys_addr_t cln, int overlap)
return overlap;
}
-static void active_cacheline_inc_overlap(phys_addr_t cln)
+static void active_cacheline_inc_overlap(phys_addr_t cln, bool is_cache_clean)
{
int overlap = active_cacheline_read_overlap(cln);
@@ -462,7 +462,7 @@ static void active_cacheline_inc_overlap(phys_addr_t cln)
/* If we overflowed the overlap counter then we're potentially
* leaking dma-mappings.
*/
- WARN_ONCE(overlap > ACTIVE_CACHELINE_MAX_OVERLAP,
+ WARN_ONCE(!is_cache_clean && overlap > ACTIVE_CACHELINE_MAX_OVERLAP,
pr_fmt("exceeded %d overlapping mappings of cacheline %pa\n"),
ACTIVE_CACHELINE_MAX_OVERLAP, &cln);
}
@@ -495,7 +495,7 @@ static int active_cacheline_insert(struct dma_debug_entry *entry,
if (rc == -EEXIST) {
struct dma_debug_entry *existing;
- active_cacheline_inc_overlap(cln);
+ active_cacheline_inc_overlap(cln, entry->is_cache_clean);
existing = radix_tree_lookup(&dma_active_cacheline, cln);
/* A lookup failure here after we got -EEXIST is unexpected. */
WARN_ON(!existing);
--
2.53.0
^ permalink raw reply related
* [PATCH v2 0/8] RDMA: Enable operation with DMA debug enabled
From: Leon Romanovsky @ 2026-03-11 19:08 UTC (permalink / raw)
To: Marek Szyprowski, Robin Murphy, Michael S. Tsirkin, Petr Tesarik,
Jonathan Corbet, Shuah Khan, Jason Wang, Xuan Zhuo,
Eugenio Pérez, Jason Gunthorpe, Leon Romanovsky,
Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Joerg Roedel,
Will Deacon, Andrew Morton
Cc: iommu, linux-kernel, linux-doc, virtualization, linux-rdma,
linux-trace-kernel, linux-mm
Add a new DMA_ATTR_REQUIRE_COHERENT attribute to the DMA API to mark
mappings that must run on a DMA‑coherent system. Such buffers cannot
use the SWIOTLB path, may overlap with CPU caches, and do not depend on
explicit cache flushing.
Mappings using this attribute are rejected on systems where cache
side‑effects could lead to data corruption, and therefore do not need
the cache‑overlap debugging logic. This series also includes fixes for
DMA_ATTR_CPU_CACHE_CLEAN handling.
Thanks.
---
Changes in v2:
- Added DMA_ATTR_REQUIRE_COHERENT attribute
- Added HMM patch which needs this attribute as well
- Renamed DMA_ATTR_CPU_CACHE_CLEAN to be DMA_ATTR_DEBUGGING_IGNORE_CACHELINES
- Link to v1: https://patch.msgid.link/20260307-dma-debug-overlap-v1-0-c034c38872af@nvidia.com
---
Leon Romanovsky (8):
dma-debug: Allow multiple invocations of overlapping entries
dma-mapping: handle DMA_ATTR_CPU_CACHE_CLEAN in trace output
dma-mapping: Clarify valid conditions for CPU cache line overlap
dma-mapping: Introduce DMA require coherency attribute
dma-direct: prevent SWIOTLB path when DMA_ATTR_REQUIRE_COHERENT is set
iommu/dma: add support for DMA_ATTR_REQUIRE_COHERENT attribute
RDMA/umem: Tell DMA mapping that UMEM requires coherency
mm/hmm: Indicate that HMM requires DMA coherency
Documentation/core-api/dma-attributes.rst | 34 +++++++++++++++++++++++--------
drivers/infiniband/core/umem.c | 5 +++--
drivers/iommu/dma-iommu.c | 21 +++++++++++++++----
drivers/virtio/virtio_ring.c | 10 ++++-----
include/linux/dma-mapping.h | 15 ++++++++++----
include/trace/events/dma.h | 4 +++-
kernel/dma/debug.c | 9 ++++----
kernel/dma/direct.h | 7 ++++---
kernel/dma/mapping.c | 6 ++++++
mm/hmm.c | 4 ++--
10 files changed, 82 insertions(+), 33 deletions(-)
---
base-commit: 11439c4635edd669ae435eec308f4ab8a0804808
change-id: 20260305-dma-debug-overlap-21487c3fa02c
Best regards,
--
Leon Romanovsky <leonro@nvidia.com>
^ permalink raw reply
* Re: [RFC PATCH v2 0/3] disable optimistic spinning for ftrace_lock
From: David Laight @ 2026-03-11 19:05 UTC (permalink / raw)
To: Steven Rostedt
Cc: Yafang Shao, Peter Zijlstra, mingo, will, boqun, longman,
mhiramat, mark.rutland, mathieu.desnoyers, linux-kernel,
linux-trace-kernel
In-Reply-To: <20260311130743.63c997ec@gandalf.local.home>
On Wed, 11 Mar 2026 13:07:43 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:
> On Wed, 11 Mar 2026 21:40:32 +0800
> Yafang Shao <laoar.shao@gmail.com> wrote:
>
> > > The code needs to drop the ftrace_lock across t_show.
> >
> > It's unclear whether we can safely release ftrace_lock within t_show —
> > doing so would probably necessitate a major redesign of the current
> > implementation.
>
> The issue isn't t_show, it's the calls between t_start and t_next and
> subsequent t_next calls, which needs to keep a consistent state. t_show
> just happens to be called in between them.
>
> >
> > >
> > > Although there is a bigger issue of why on earth the code is reading the
> > > list of filter functions at all - never mind all the time.
> >
> > bpftrace reads the complete list of available functions into
> > userspace, then performs matching against the target function to
> > determine if it is traceable.
>
> Could it parse it in smaller bits? That is, the lock is held only during an
> individual read system call. If it reads the available_filter_functions
> file via smaller buffers, it would not hold the lock for as long.
But the expensive part is probably looking up the symbol name.
Shorter reads would hand the lock off to the other process, but overall
the lock would still be held for the same length of time.
How does the code work out where to start from for each read system call?
Couldn't you (effectively) do one symbol at a time the same way?
Another option would be to put a 'generation number' on the list of functions
(after all it doesn't change that often).
Then you can release the lock, generate the data, re-acquire the lock and
check the generation number hasn't changed.
If it hasn't changed you can carry on processing the list using the
same pointer.
If the generation number has changed terminate the read and worry about
locating the correct start position for the next read.
David
>
> -- Steve
>
^ permalink raw reply
* Re: [RFC PATCH v2 0/3] disable optimistic spinning for ftrace_lock
From: Steven Rostedt @ 2026-03-11 17:39 UTC (permalink / raw)
To: Yafang Shao
Cc: David Laight, Peter Zijlstra, mingo, will, boqun, longman,
mhiramat, mark.rutland, mathieu.desnoyers, linux-kernel,
linux-trace-kernel
In-Reply-To: <20260311130743.63c997ec@gandalf.local.home>
On Wed, 11 Mar 2026 13:07:43 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:
> > bpftrace reads the complete list of available functions into
> > userspace, then performs matching against the target function to
> > determine if it is traceable.
>
> Could it parse it in smaller bits? That is, the lock is held only during an
> individual read system call. If it reads the available_filter_functions
> file via smaller buffers, it would not hold the lock for as long
Hmm, I guess this wouldn't help much. I ran:
trace-cmd record -e sys_enter_read -e sys_exit_read -F cat /sys/kernel/tracing/available_filter_functions > /dev/null
And trace-cmd report shows:
[..]
cat-1208 [001] ..... 142.025582: sys_enter_read: fd: 0x00000003, buf: 0x7fa9daf21000, count: 0x00040000
cat-1208 [001] ..... 142.025995: sys_exit_read: 0xfee
cat-1208 [001] ..... 142.026000: sys_enter_read: fd: 0x00000003, buf: 0x7fa9daf21000, count: 0x00040000
cat-1208 [001] ..... 142.026392: sys_exit_read: 0xff8
cat-1208 [001] ..... 142.026396: sys_enter_read: fd: 0x00000003, buf: 0x7fa9daf21000, count: 0x00040000
cat-1208 [001] ..... 142.026766: sys_exit_read: 0xfed
cat-1208 [001] ..... 142.026770: sys_enter_read: fd: 0x00000003, buf: 0x7fa9daf21000, count: 0x00040000
cat-1208 [001] ..... 142.027113: sys_exit_read: 0xfe0
cat-1208 [001] ..... 142.027117: sys_enter_read: fd: 0x00000003, buf: 0x7fa9daf21000, count: 0x00040000
cat-1208 [001] ..... 142.027502: sys_exit_read: 0xfec
cat-1208 [001] ..... 142.027506: sys_enter_read: fd: 0x00000003, buf: 0x7fa9daf21000, count: 0x00040000
[..]
which shows that even though the read buffer size is 0x40000, the size read
is just 0xff8. So the buffer being read and return is never more than a page.
Unless you are running on powerpc, where the page is likely 64K.
-- Steve
^ permalink raw reply
* Re: [RFC PATCH v2 0/3] disable optimistic spinning for ftrace_lock
From: Steven Rostedt @ 2026-03-11 17:07 UTC (permalink / raw)
To: Yafang Shao
Cc: David Laight, Peter Zijlstra, mingo, will, boqun, longman,
mhiramat, mark.rutland, mathieu.desnoyers, linux-kernel,
linux-trace-kernel
In-Reply-To: <CALOAHbDqYjJngQmmOaPRA=k4Bb8Or39YNp5R98f_op4dti2_TQ@mail.gmail.com>
On Wed, 11 Mar 2026 21:40:32 +0800
Yafang Shao <laoar.shao@gmail.com> wrote:
> > The code needs to drop the ftrace_lock across t_show.
>
> It's unclear whether we can safely release ftrace_lock within t_show —
> doing so would probably necessitate a major redesign of the current
> implementation.
The issue isn't t_show, it's the calls between t_start and t_next and
subsequent t_next calls, which needs to keep a consistent state. t_show
just happens to be called in between them.
>
> >
> > Although there is a bigger issue of why on earth the code is reading the
> > list of filter functions at all - never mind all the time.
>
> bpftrace reads the complete list of available functions into
> userspace, then performs matching against the target function to
> determine if it is traceable.
Could it parse it in smaller bits? That is, the lock is held only during an
individual read system call. If it reads the available_filter_functions
file via smaller buffers, it would not hold the lock for as long.
-- Steve
^ permalink raw reply
* Re: [net-next,1/2] mptcp: better mptcp-level RTT estimator
From: Matthieu Baerts @ 2026-03-11 16:27 UTC (permalink / raw)
To: Jakub Kicinski
Cc: martineau, davem, netdev, pabeni, fw, horms, edumazet,
linux-kernel, mptcp, geliang, mhiramat, linux-trace-kernel,
rostedt, mathieu.desnoyers
In-Reply-To: <20260311024547.361027-1-kuba@kernel.org>
Hi Jakub, Claude,
On 11/03/2026 03:45, Jakub Kicinski wrote:
> This is an AI-generated review of your patch. The human sending this
> email has considered the AI review valid, or at least plausible.
Thank you for having forwarded this. The review is indeed valid, a v2 is
required.
Cheers,
Matt
--
Sponsored by the NGI0 Core fund.
^ permalink raw reply
* Re: [PATCH v13 03/32] ring-buffer: Introduce ring-buffer remotes
From: Vincent Donnefort @ 2026-03-11 15:23 UTC (permalink / raw)
To: Markus Elfring
Cc: linux-trace-kernel, kernel-team, kvmarm, linux-arm-kernel,
Joey Gouly, Marc Zyngier, Masami Hiramatsu, Mathieu Desnoyers,
Oliver Upton, Steven Rostedt, Suzuki Poulouse, Zenghui Yu, LKML,
Aneesh Kumar K.V, John Stultz, Quentin Perret, Will Deacon
In-Reply-To: <677c7ad6-6e67-4011-b2d0-03d0d58547ce@web.de>
On Fri, Mar 06, 2026 at 05:37:35PM +0100, Markus Elfring wrote:
> …
> > It is expected from the remote to keep the meta-page updated.
>
> See also once more:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/submitting-patches.rst?h=v7.0-rc2#n94
>
>
> …
> > +++ b/kernel/trace/ring_buffer.c
> …
> > +int ring_buffer_poll_remote(struct trace_buffer *buffer, int cpu)
> > +{
> …
> > + cpus_read_lock();
> > +
> > + /*
> > + * Make sure all the ring buffers are up to date before we start reading
> > + * them.
> > + */
> > + for_each_buffer_cpu(buffer, cpu) {
> …
> > + }
> > +
> > + cpus_read_unlock();
> > +
> > + return 0;
> > +}
> …
>
> How do you think about to use another lock guard here?
> https://elixir.bootlin.com/linux/v7.0-rc1/source/include/linux/cpuhplock.h#L48
Sorry, I forgot to reply to you. I had to respin a new version so I have made
the changes you've suggested.
Thanks,
Vincent
>
> Regards,
> Markus
^ permalink raw reply
* Re: [PATCH 01/61] Coccinelle: Prefer IS_ERR_OR_NULL over manual NULL check
From: Markus Elfring @ 2026-03-11 15:12 UTC (permalink / raw)
To: Philipp Hahn, cocci, Julia Lawall, Nicolas Palix
Cc: amd-gfx, apparmor, bpf, ceph-devel, dm-devel, dri-devel, gfs2,
intel-gfx, intel-wired-lan, iommu, kvm, linux-arm-kernel,
linux-block, linux-bluetooth, linux-btrfs, linux-cifs, linux-clk,
linux-erofs, linux-ext4, linux-fsdevel, linux-gpio, linux-hyperv,
linux-input, linux-leds, linux-media, linux-mips, linux-mm,
linux-modules, linux-mtd, linux-nfs, linux-omap, linux-phy,
linux-pm, linux-rockchip, linux-s390, linux-scsi, linux-sctp,
linux-security-module, linux-sh, linux-sound, linux-stm32,
linux-trace-kernel, linux-usb, linux-wireless, netdev, ntfs3,
samba-technical, sched-ext, target-devel, tipc-discussion, v9fs,
LKML
In-Reply-To: <20260310-b4-is_err_or_null-v1-1-bd63b656022d@avm.de>
…
> +// Confidence: High
Some contributors presented discerning comments for this change approach.
Thus I became also curious how much they can eventually be taken better into account
by the means of the semantic patch language (Coccinelle software).
…
+@p1 depends on patch@
+expression E;
+@@
+(
> +- E != NULL && !IS_ERR(E)
> ++ !IS_ERR_OR_NULL(E)
> +|
> +- E == NULL || IS_ERR(E)
> ++ IS_ERR_OR_NULL(E)
> +|
> +- !IS_ERR(E) && E != NULL
> ++ !IS_ERR_OR_NULL(E)
> +|
> +- IS_ERR(E) || E == NULL
> ++ IS_ERR_OR_NULL(E)
> +)
Several detected expressions should refer to return values from function calls.
https://en.wikipedia.org/wiki/Return_statement
* Do any development challenges hinder still the determination of corresponding
failure predicates?
* How will interests evolve to improve data processing any further for such
use cases?
Regards,
Markus
^ permalink raw reply
* Re: [PATCH v2 1/3] tracing: Have futex syscall trace event show specific user data
From: Steven Rostedt @ 2026-03-11 14:16 UTC (permalink / raw)
To: Masami Hiramatsu (Google)
Cc: linux-kernel, linux-trace-kernel, Mark Rutland, Mathieu Desnoyers,
Andrew Morton, Thomas Gleixner, Peter Zijlstra, Brian Geffon,
John Stultz, Ian Rogers, Suleiman Souhlal
In-Reply-To: <20260311180325.488f724d97204d1a3c66f071@kernel.org>
On Wed, 11 Mar 2026 18:03:25 +0900
Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:
> > + pos += snprintf(buf + pos, LEN_OR_ZERO,
> > + " __print_symbolic(REC->op & 0x%x, ", FUTEX_CMD_MASK);
> > +
> > + pos += snprintf(buf + pos, LEN_OR_ZERO,
> > + "{%d, \"FUTEX_WAIT\"}, ", FUTEX_WAIT);
> > + pos += snprintf(buf + pos, LEN_OR_ZERO,
> > + "{%d, \"FUTEX_WAKE\"}, ", FUTEX_WAKE);
> > + pos += snprintf(buf + pos, LEN_OR_ZERO,
> > + "{%d, \"FUTEX_FD\"}, ", FUTEX_FD);
> > + pos += snprintf(buf + pos, LEN_OR_ZERO,
> > + "{%d, \"FUTEX_REQUEUE\"}, ", FUTEX_REQUEUE);
> > + pos += snprintf(buf + pos, LEN_OR_ZERO,
> > + "{%d, \"FUTEX_CMP_REQUEUE\"}, ", FUTEX_CMP_REQUEUE);
> > + pos += snprintf(buf + pos, LEN_OR_ZERO,
> > + "{%d, \"FUTEX_WAKE_OP\"}, ", FUTEX_WAKE_OP);
> > + pos += snprintf(buf + pos, LEN_OR_ZERO,
> > + "{%d, \"FUTEX_LOCK_PI\"}, ", FUTEX_LOCK_PI);
> > + pos += snprintf(buf + pos, LEN_OR_ZERO,
> > + "{%d, \"FUTEX_UNLOCK_PI\"}, ", FUTEX_UNLOCK_PI);
> > + pos += snprintf(buf + pos, LEN_OR_ZERO,
> > + "{%d, \"FUTEX_TRYLOCK_PI\"}, ", FUTEX_TRYLOCK_PI);
> > + pos += snprintf(buf + pos, LEN_OR_ZERO,
> > + "{%d, \"FUTEX_WAIT_BITSET\"}, ", FUTEX_WAIT_BITSET);
> > + pos += snprintf(buf + pos, LEN_OR_ZERO,
> > + "{%d, \"FUTEX_WAKE_BITSET\"}, ", FUTEX_WAKE_BITSET);
> > + pos += snprintf(buf + pos, LEN_OR_ZERO,
> > + "{%d, \"FUTEX_WAIT_REQUEUE_PI\"}, ", FUTEX_WAIT_REQUEUE_PI);
> > + pos += snprintf(buf + pos, LEN_OR_ZERO,
> > + "{%d, \"FUTEX_CMP_REQUEUE_PI\"}, ", FUTEX_CMP_REQUEUE_PI);
> > + pos += snprintf(buf + pos, LEN_OR_ZERO,
> > + "{%d, \"FUTEX_LOCK_PI2\"}),", FUTEX_LOCK_PI2);
>
> Hmm can we share __futex_cmds[] with kernel/futex/syscalls.c?
> Then these could be
>
> for (i = 0; i <= FUTEX_LOCK_PI2; i++)
> pos += snprintf(buf + pos, LEN_OR_ZERO,
> "{%d, \"%s\"}%s", i, __futex_cmds[i],
> i == FUTEX_LOCK_PI2 ? ")," : ", ");
Hmm, I created the above *before* creating the __futex_cmds. But yes, that
makes sense. Thanks for the suggestion.
>
>
> > +
> > + pos += snprintf(buf + pos, LEN_OR_ZERO,
> > + " (REC->op & %d) ? \"|FUTEX_PRIVATE_FLAG\" : \"\",",
> > + FUTEX_PRIVATE_FLAG);
> > + pos += snprintf(buf + pos, LEN_OR_ZERO,
> > + " (REC->op & %d) ? \"|FUTEX_CLOCK_REALTIME\" : \"\",",
> > + FUTEX_CLOCK_REALTIME);
> > +
> > + pos += snprintf(buf + pos, LEN_OR_ZERO,
> > + " REC->val, REC->utime,");
> > +
> > + pos += snprintf(buf + pos, LEN_OR_ZERO,
> > + " REC->uaddr, REC->val3");
> > + return pos;
> > +}
> > +
> > static int __init
> > __set_enter_print_fmt(struct syscall_metadata *entry, char *buf, int len)
> > {
> [...]
> > @@ -689,6 +799,48 @@ static int syscall_copy_user_array(char *buf, const char __user *ptr,
> > return 0;
> > }
> >
> > +static int
> > +syscall_get_futex(unsigned long *args, char **buffer, int *size, int buf_size)
> > +{
> > + struct syscall_user_buffer *sbuf;
> > + const char __user *ptr;
> > + char *buf;
> > +
> > + /* buf_size of zero means user doesn't want user space read */
> > + if (!buf_size)
> > + return -1;
> > +
> > + /* If the syscall_buffer is NULL, tracing is being shutdown */
> > + sbuf = READ_ONCE(syscall_buffer);
> > + if (!sbuf)
> > + return -1;
> > +
> > + ptr = (char __user *)args[0];
> > +
> > + *buffer = trace_user_fault_read(&sbuf->buf, ptr, 4, NULL, NULL);
> > + if (!*buffer)
> > + return -1;
> > +
> > + /* Add room for the value */
> > + *size += 4;
> > +
> > + buf = *buffer;
>
> As kernel test bot says, this does nothing. (*buffer is already assigned)
>
Oops, I think this was a cut-and-paste error :-p
-- Steve
^ permalink raw reply
* Re: [PATCH 15/61] trace: Prefer IS_ERR_OR_NULL over manual NULL check
From: Geert Uytterhoeven @ 2026-03-11 14:06 UTC (permalink / raw)
To: Steven Rostedt
Cc: Masami Hiramatsu (Google), Philipp Hahn, amd-gfx, apparmor, bpf,
ceph-devel, cocci, dm-devel, dri-devel, gfs2, intel-gfx,
intel-wired-lan, iommu, kvm, linux-arm-kernel, linux-block,
linux-bluetooth, linux-btrfs, linux-cifs, linux-clk, linux-erofs,
linux-ext4, linux-fsdevel, linux-gpio, linux-hyperv, linux-input,
linux-kernel, linux-leds, linux-media, linux-mips, linux-mm,
linux-modules, linux-mtd, linux-nfs, linux-omap, linux-phy,
linux-pm, linux-rockchip, linux-s390, linux-scsi, linux-sctp,
linux-security-module, linux-sh, linux-sound, linux-stm32,
linux-trace-kernel, linux-usb, linux-wireless, netdev, ntfs3,
samba-technical, sched-ext, target-devel, tipc-discussion, v9fs,
Mathieu Desnoyers
In-Reply-To: <20260311100332.6a2ce4b1@gandalf.local.home>
Hi Steven,
On Wed, 11 Mar 2026 at 15:03, Steven Rostedt <rostedt@goodmis.org> wrote:
> On Wed, 11 Mar 2026 14:13:32 +0900
> Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:
>
> > Hmm, now IS_ERR_OR_NULL() is an inline function, so it is safe.
> > But if you want to use IS_ERR_OR_NULL() here, it will be better something like
> >
> > node = rhashtable_walk_next(&iter);
> > while (!IS_ERR_OR_NULL(node)) {
> > fprobe_remove_node_in_module(mod, node, &alist);
> > node = rhashtable_walk_next(&iter);
> > }
>
> But now you need to have a duplicate code in order to acquire "node"
>
> I think the patch just makes the code worse.
Obviously we need a new for_each_*() helper hiding all the gory internals?
Gr{oetje,eeting}s,
Geert
--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org
In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
^ permalink raw reply
* Re: [PATCH 15/61] trace: Prefer IS_ERR_OR_NULL over manual NULL check
From: Steven Rostedt @ 2026-03-11 14:03 UTC (permalink / raw)
To: Masami Hiramatsu (Google)
Cc: Philipp Hahn, amd-gfx, apparmor, bpf, ceph-devel, cocci, dm-devel,
dri-devel, gfs2, intel-gfx, intel-wired-lan, iommu, kvm,
linux-arm-kernel, linux-block, linux-bluetooth, linux-btrfs,
linux-cifs, linux-clk, linux-erofs, linux-ext4, linux-fsdevel,
linux-gpio, linux-hyperv, linux-input, linux-kernel, linux-leds,
linux-media, linux-mips, linux-mm, linux-modules, linux-mtd,
linux-nfs, linux-omap, linux-phy, linux-pm, linux-rockchip,
linux-s390, linux-scsi, linux-sctp, linux-security-module,
linux-sh, linux-sound, linux-stm32, linux-trace-kernel, linux-usb,
linux-wireless, netdev, ntfs3, samba-technical, sched-ext,
target-devel, tipc-discussion, v9fs, Mathieu Desnoyers
In-Reply-To: <20260311141332.b611237d36b61b2409e66cb3@kernel.org>
On Wed, 11 Mar 2026 14:13:32 +0900
Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:
> Hmm, now IS_ERR_OR_NULL() is an inline function, so it is safe.
> But if you want to use IS_ERR_OR_NULL() here, it will be better something like
>
> node = rhashtable_walk_next(&iter);
> while (!IS_ERR_OR_NULL(node)) {
> fprobe_remove_node_in_module(mod, node, &alist);
> node = rhashtable_walk_next(&iter);
> }
But now you need to have a duplicate code in order to acquire "node"
I think the patch just makes the code worse.
-- Steve
^ permalink raw reply
* Re: [RFC PATCH v2 0/3] disable optimistic spinning for ftrace_lock
From: Yafang Shao @ 2026-03-11 13:40 UTC (permalink / raw)
To: David Laight
Cc: Peter Zijlstra, mingo, will, boqun, longman, rostedt, mhiramat,
mark.rutland, mathieu.desnoyers, linux-kernel, linux-trace-kernel
In-Reply-To: <20260311125350.1d89341f@pumpkin>
On Wed, Mar 11, 2026 at 8:53 PM David Laight
<david.laight.linux@gmail.com> wrote:
>
> On Wed, 11 Mar 2026 12:54:26 +0100
> Peter Zijlstra <peterz@infradead.org> wrote:
>
> > On Wed, Mar 11, 2026 at 07:52:47PM +0800, Yafang Shao wrote:
> > > Recently, we resolved a latency spike issue caused by concurrently running
> > > bpftrace processes. The root cause was high contention on the ftrace_lock
> > > due to optimistic spinning. We can optimize this by disabling optimistic
> > > spinning for ftrace_lock.
> > >
> > > While semaphores may present similar challenges, I'm not currently aware of
> > > specific instances that exhibit this exact issue. Should we encounter
> > > problematic semaphores in production workloads, we can address them at that
> > > time.
> > >
> > > PATCH #1: introduce slow_mutex_[un]lock to disable optimistic spinning
> > > PATCH #2: add variant for rtmutex
> > > PATCH #3: disable optimistic spinning for ftrace_lock
> > >
> >
> > So I really utterly hate this.
>
> Yep...
> Adding the extra parameter is likely to have a measurable impact
> on everything else.
>
> The problematic path is obvious: find_kallsyms_symbol+142
> module_address_lookup+104
> kallsyms_lookup_buildid+203
> kallsyms_lookup+20
> print_rec+64
> t_show+67
> seq_read_iter+709
> seq_read+165
> vfs_read+165
> ksys_read+103
> __x64_sys_read+25
> do_syscall_64+56
> entry_SYSCALL_64_after_hwframe+100
>
> The code needs to drop the ftrace_lock across t_show.
It's unclear whether we can safely release ftrace_lock within t_show —
doing so would probably necessitate a major redesign of the current
implementation.
>
> Although there is a bigger issue of why on earth the code is reading the
> list of filter functions at all - never mind all the time.
bpftrace reads the complete list of available functions into
userspace, then performs matching against the target function to
determine if it is traceable.
> I'll do it by hand when debugging, but I'd have though anything using bpf
> will know exactly where to add its hooks.
--
Regards
Yafang
^ permalink raw reply
* Re: [PATCH 36/61] arch/sh: Prefer IS_ERR_OR_NULL over manual NULL check
From: Geert Uytterhoeven @ 2026-03-11 13:15 UTC (permalink / raw)
To: Philipp Hahn
Cc: amd-gfx, apparmor, bpf, ceph-devel, cocci, dm-devel, dri-devel,
gfs2, intel-gfx, intel-wired-lan, iommu, kvm, linux-arm-kernel,
linux-block, linux-bluetooth, linux-btrfs, linux-cifs, linux-clk,
linux-erofs, linux-ext4, linux-fsdevel, linux-gpio, linux-hyperv,
linux-input, linux-kernel, linux-leds, linux-media, linux-mips,
linux-mm, linux-modules, linux-mtd, linux-nfs, linux-omap,
linux-phy, linux-pm, linux-rockchip, linux-s390, linux-scsi,
linux-sctp, linux-security-module, linux-sh, linux-sound,
linux-stm32, linux-trace-kernel, linux-usb, linux-wireless,
netdev, ntfs3, samba-technical, sched-ext, target-devel,
tipc-discussion, v9fs, Yoshinori Sato, Rich Felker,
John Paul Adrian Glaubitz
In-Reply-To: <20260310-b4-is_err_or_null-v1-36-bd63b656022d@avm.de>
On Tue, 10 Mar 2026 at 12:56, Philipp Hahn <phahn-oss@avm.de> wrote:
> Prefer using IS_ERR_OR_NULL() over using IS_ERR() and a manual NULL
> check.
>
> Change generated with coccinelle.
>
> To: Yoshinori Sato <ysato@users.sourceforge.jp>
> To: Rich Felker <dalias@libc.org>
> To: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
> Cc: linux-sh@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Signed-off-by: Philipp Hahn <phahn-oss@avm.de>
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Gr{oetje,eeting}s,
Geert
--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org
In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
^ permalink raw reply
* [PATCH v3 4/4] trace/preemptirq: Implement trace_irqflags hooks
From: Wander Lairson Costa @ 2026-03-11 12:50 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Masami Hiramatsu, Mathieu Desnoyers,
Andrew Morton, Wander Lairson Costa, open list, open list:TRACING
Cc: acme, williams, gmonaco
In-Reply-To: <20260311125021.197638-1-wander@redhat.com>
The previous commit introduced the CONFIG_TRACE_IRQFLAGS_TOGGLE symbol.
This patch implements the actual infrastructure to allow tracing
irq_disable and irq_enable events without pulling in the heavy
CONFIG_TRACE_IRQFLAGS dependencies like lockdep or the irqsoff tracer.
The implementation hooks into the local_irq_* macros in irqflags.h.
Instead of using the heavy trace_hardirqs_on/off calls, it uses
lightweight tracepoint_enabled() checks. If the tracepoint is enabled,
it calls into specific wrapper functions in trace_preemptirq.c.
These wrappers check the raw hardware state via raw_irqs_disabled() to
filter out redundant events, such as disabling interrupts when they
are already disabled. This approach is simpler than the full
TRACE_IRQFLAGS method which requires maintaining a per-cpu software
state variable.
To support this, the tracepoint definitions are exposed under the new
configuration. Additionally, a circular header dependency involving
irqflags.h, tracepoint-defs.h, and atomic.h is resolved by moving the
atomic.h inclusion to tracepoint.h, allowing irqflags.h to include
tracepoint-defs.h safely.
Signed-off-by: Wander Lairson Costa <wander@redhat.com>
---
include/linux/irqflags.h | 62 ++++++++++++++++++++++++++++++-
include/linux/tracepoint-defs.h | 1 -
include/linux/tracepoint.h | 1 +
include/trace/events/preemptirq.h | 2 +-
kernel/trace/trace_preemptirq.c | 49 ++++++++++++++++++++++++
5 files changed, 112 insertions(+), 3 deletions(-)
diff --git a/include/linux/irqflags.h b/include/linux/irqflags.h
index 57b074e0cfbbb..f40557bebd325 100644
--- a/include/linux/irqflags.h
+++ b/include/linux/irqflags.h
@@ -18,6 +18,19 @@
#include <asm/irqflags.h>
#include <asm/percpu.h>
+/*
+ * Avoid the circular dependency
+ * irqflags.h <-----------------+
+ * tracepoint_defs.h |
+ * static_key.h |
+ * jump_label.h |
+ * atomic.h |
+ * cmpxchg.h ---------+
+ */
+#ifdef CONFIG_TRACE_IRQFLAGS_TOGGLE
+#include <linux/tracepoint-defs.h>
+#endif
+
struct task_struct;
/* Currently lockdep_softirqs_on/off is used only by lockdep */
@@ -232,7 +245,54 @@ extern void warn_bogus_irq_restore(void);
} while (0)
-#else /* !CONFIG_TRACE_IRQFLAGS */
+#elif defined(CONFIG_TRACE_IRQFLAGS_TOGGLE) /* !CONFIG_TRACE_IRQFLAGS */
+
+DECLARE_TRACEPOINT(irq_enable);
+DECLARE_TRACEPOINT(irq_disable);
+
+void trace_local_irq_enable(void);
+void trace_local_irq_disable(void);
+void trace_local_irq_save(unsigned long flags);
+void trace_local_irq_restore(unsigned long flags);
+void trace_safe_halt(void);
+
+#define local_irq_enable() \
+ do { \
+ if (tracepoint_enabled(irq_enable)) \
+ trace_local_irq_enable(); \
+ raw_local_irq_enable(); \
+ } while (0)
+
+#define local_irq_disable() \
+ do { \
+ if (tracepoint_enabled(irq_disable)) \
+ trace_local_irq_disable(); \
+ else \
+ raw_local_irq_disable(); \
+ } while (0)
+
+#define local_irq_save(flags) \
+ do { \
+ raw_local_irq_save(flags); \
+ if (tracepoint_enabled(irq_disable)) \
+ trace_local_irq_save(flags); \
+ } while (0)
+
+#define local_irq_restore(flags) \
+ do { \
+ if (tracepoint_enabled(irq_enable)) \
+ trace_local_irq_restore(flags); \
+ raw_local_irq_restore(flags); \
+ } while (0)
+
+#define safe_halt() \
+ do { \
+ if (tracepoint_enabled(irq_enable)) \
+ trace_safe_halt(); \
+ raw_safe_halt(); \
+ } while (0)
+
+#else /* !CONFIG_TRACE_IRQFLAGS_TOGGLE */
#define local_irq_enable() do { raw_local_irq_enable(); } while (0)
#define local_irq_disable() do { raw_local_irq_disable(); } while (0)
diff --git a/include/linux/tracepoint-defs.h b/include/linux/tracepoint-defs.h
index aebf0571c736e..cb1f15a4e43f0 100644
--- a/include/linux/tracepoint-defs.h
+++ b/include/linux/tracepoint-defs.h
@@ -8,7 +8,6 @@
* trace_print_flags{_u64}. Otherwise linux/tracepoint.h should be used.
*/
-#include <linux/atomic.h>
#include <linux/static_key.h>
struct static_call_key;
diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
index 22ca1c8b54f32..e7d8c5ca00c79 100644
--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -20,6 +20,7 @@
#include <linux/rcupdate_trace.h>
#include <linux/tracepoint-defs.h>
#include <linux/static_call.h>
+#include <linux/atomic.h>
struct module;
struct tracepoint;
diff --git a/include/trace/events/preemptirq.h b/include/trace/events/preemptirq.h
index f99562d2b496b..a607a6f4e29ca 100644
--- a/include/trace/events/preemptirq.h
+++ b/include/trace/events/preemptirq.h
@@ -32,7 +32,7 @@ DECLARE_EVENT_CLASS(preemptirq_template,
(void *)((unsigned long)(_stext) + __entry->parent_offs))
);
-#ifdef CONFIG_TRACE_IRQFLAGS
+#if defined(CONFIG_TRACE_IRQFLAGS) || defined(CONFIG_TRACE_IRQFLAGS_TOGGLE)
DEFINE_EVENT(preemptirq_template, irq_disable,
TP_PROTO(unsigned long ip, unsigned long parent_ip),
TP_ARGS(ip, parent_ip));
diff --git a/kernel/trace/trace_preemptirq.c b/kernel/trace/trace_preemptirq.c
index 9f098fcb28012..0f32da96d2f01 100644
--- a/kernel/trace/trace_preemptirq.c
+++ b/kernel/trace/trace_preemptirq.c
@@ -111,8 +111,57 @@ void trace_hardirqs_off(void)
}
EXPORT_SYMBOL(trace_hardirqs_off);
NOKPROBE_SYMBOL(trace_hardirqs_off);
+
#endif /* CONFIG_TRACE_IRQFLAGS */
+#ifdef CONFIG_TRACE_IRQFLAGS_TOGGLE
+EXPORT_TRACEPOINT_SYMBOL(irq_disable);
+EXPORT_TRACEPOINT_SYMBOL(irq_enable);
+
+void trace_local_irq_enable(void)
+{
+ if (raw_irqs_disabled())
+ trace(irq_enable, TP_ARGS(CALLER_ADDR0, CALLER_ADDR1));
+}
+EXPORT_SYMBOL(trace_local_irq_enable);
+NOKPROBE_SYMBOL(trace_local_irq_enable);
+
+void trace_local_irq_disable(void)
+{
+ const bool was_disabled = raw_irqs_disabled();
+
+ raw_local_irq_disable();
+ if (!was_disabled)
+ trace(irq_disable, TP_ARGS(CALLER_ADDR0, CALLER_ADDR1));
+}
+EXPORT_SYMBOL(trace_local_irq_disable);
+NOKPROBE_SYMBOL(trace_local_irq_disable);
+
+void trace_local_irq_save(unsigned long flags)
+{
+ if (!raw_irqs_disabled_flags(flags))
+ trace(irq_disable, TP_ARGS(CALLER_ADDR0, CALLER_ADDR1));
+}
+EXPORT_SYMBOL(trace_local_irq_save);
+NOKPROBE_SYMBOL(trace_local_irq_save);
+
+void trace_local_irq_restore(unsigned long flags)
+{
+ if (!raw_irqs_disabled_flags(flags) && raw_irqs_disabled())
+ trace(irq_enable, TP_ARGS(CALLER_ADDR0, CALLER_ADDR1));
+}
+EXPORT_SYMBOL(trace_local_irq_restore);
+NOKPROBE_SYMBOL(trace_local_irq_restore);
+
+void trace_safe_halt(void)
+{
+ if (raw_irqs_disabled())
+ trace(irq_enable, TP_ARGS(CALLER_ADDR0, CALLER_ADDR1));
+}
+EXPORT_SYMBOL(trace_safe_halt);
+NOKPROBE_SYMBOL(trace_safe_halt);
+#endif /* CONFIG_TRACE_IRQFLAGS_TOGGLE */
+
#ifdef CONFIG_TRACE_PREEMPT_TOGGLE
#if !defined(CONFIG_DEBUG_PREEMPT) && !defined(CONFIG_PREEMPT_TRACER)
--
2.53.0
^ permalink raw reply related
* [PATCH v3 3/4] trace/preemptirq: add TRACE_IRQFLAGS_TOGGLE
From: Wander Lairson Costa @ 2026-03-11 12:50 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Masami Hiramatsu, Mathieu Desnoyers,
Andrew Morton, open list:SCHEDULER, open list:TRACING
Cc: acme, williams, gmonaco, Wander Lairson Costa
In-Reply-To: <20260311125021.197638-1-wander@redhat.com>
The IRQ disable/enable tracepoints are currently gated behind
TRACE_IRQFLAGS, a hidden config that cannot be selected directly by
users. It is only enabled when selected by PROVE_LOCKING or
IRQSOFF_TRACER, both of which carry overhead beyond what is needed
for just the tracepoints.
Introduce TRACE_IRQFLAGS_TOGGLE, a user-selectable config that enables
the irq_disable and irq_enable tracepoints independently. It is
mutually exclusive with TRACE_IRQFLAGS and mirrors how
TRACE_PREEMPT_TOGGLE works for preemption tracepoints.
Make this option depend on CONFIG_JUMP_LABEL to avoid a circular header
dependency. Without TRACE_IRQFLAGS, irqflags.h must check the static
key before invoking the tracepoint. Using tracepoint_enabled() for this
check pulls in tracepoint_defs.h, which eventually includes atomic.h
and cmpxchg.h, circling back to irqflags.h. Enforcing CONFIG_JUMP_LABEL
allows the use of static_key_false() directly, avoiding the inclusion
of the full tracepoint header chain and preventing the cycle.
Signed-off-by: Wander Lairson Costa <wander@redhat.com>
---
kernel/trace/Kconfig | 13 ++++++++++++-
1 file changed, 12 insertions(+), 1 deletion(-)
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index e007459ecf361..8bea77b5f1200 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -162,7 +162,7 @@ config RING_BUFFER_ALLOW_SWAP
config PREEMPTIRQ_TRACEPOINTS
bool
- depends on TRACE_PREEMPT_TOGGLE || TRACE_IRQFLAGS
+ depends on TRACE_PREEMPT_TOGGLE || TRACE_IRQFLAGS || TRACE_IRQFLAGS_TOGGLE
select TRACING
default y
help
@@ -418,6 +418,17 @@ config TRACE_PREEMPT_TOGGLE
Enables hooks into preemption disable and enable paths for
tracing or latency measurement.
+config TRACE_IRQFLAGS_TOGGLE
+ bool "IRQ disable/enable tracing hooks"
+ default n
+ depends on TRACE_IRQFLAGS_SUPPORT && JUMP_LABEL && !TRACE_IRQFLAGS
+ help
+ Enables hooks into IRQ disable and enable paths for tracing.
+
+ This provides the irq_disable and irq_enable tracepoints
+ without pulling in the full TRACE_IRQFLAGS infrastructure
+ used by lockdep and the irqsoff latency tracer.
+
config IRQSOFF_TRACER
bool "Interrupts-off Latency Tracer"
default n
--
2.53.0
^ permalink raw reply related
* [PATCH v3 2/4] trace/preemptirq: make TRACE_PREEMPT_TOGGLE user-selectable
From: Wander Lairson Costa @ 2026-03-11 12:50 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Masami Hiramatsu, Mathieu Desnoyers,
Andrew Morton, open list:SCHEDULER, open list:TRACING
Cc: acme, williams, gmonaco, Wander Lairson Costa
In-Reply-To: <20260311125021.197638-1-wander@redhat.com>
Make TRACE_PREEMPT_TOGGLE directly selectable so that
preempt_enable/preempt_disable tracepoints can be enabled in
production kernels without requiring the preemptoff latency tracer
overhead.
Signed-off-by: Wander Lairson Costa <wander@redhat.com>
---
kernel/trace/Kconfig | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 49de13cae4288..e007459ecf361 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -413,10 +413,10 @@ config STACK_TRACER
Say N if unsure.
config TRACE_PREEMPT_TOGGLE
- bool
+ bool "Preempt disable/enable tracing hooks"
help
- Enables hooks which will be called when preemption is first disabled,
- and last enabled.
+ Enables hooks into preemption disable and enable paths for
+ tracing or latency measurement.
config IRQSOFF_TRACER
bool "Interrupts-off Latency Tracer"
--
2.53.0
^ permalink raw reply related
* [PATCH v3 1/4] tracing/preemptirq: Optimize preempt_disable/enable() tracepoint overhead
From: Wander Lairson Costa @ 2026-03-11 12:50 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Masami Hiramatsu, Mathieu Desnoyers,
Andrew Morton, open list:SCHEDULER, open list:TRACING
Cc: acme, williams, gmonaco, Wander Lairson Costa
In-Reply-To: <20260311125021.197638-1-wander@redhat.com>
When CONFIG_TRACE_PREEMPT_TOGGLE is enabled, preempt_count_add() and
preempt_count_sub() become external function calls (defined in
kernel/sched/core.c) rather than inlined operations. These functions
also perform preempt_count() checks and call trace_preempt_on/off()
unconditionally, even when no tracing consumer is active.
Reduce this overhead by splitting the #if logic in preempt.h into
three cases. When CONFIG_DEBUG_PREEMPT or CONFIG_PREEMPT_TRACER is
set, keep external function calls because DEBUG_PREEMPT needs runtime
validation checks, and PREEMPT_TRACER needs the preemptoff latency
tracer hooks (tracer_preempt_on/off, called via trace_preempt_on/off).
When CONFIG_TRACE_PREEMPT_TOGGLE alone is set, provide new inline
versions of preempt_count_add/sub() that check the tracepoint static
key via the __preempt_trace_enabled() macro before calling into the
tracing path. The macro evaluates to true when the preempt_enable or
preempt_disable tracepoint has subscribers AND the preempt count
equals val (indicating the first preempt disable or last preempt
enable), preserving the original preempt_latency_start/stop semantics.
When none of the above are set, use pure inline macros with no tracing
overhead.
The preempt_count_dec_and_test() macro is refactored out of the
three-way #if into a separate block shared by the first two cases,
since both need it to call the (potentially inline)
preempt_count_sub() before checking should_resched().
The inline path calls thin __trace_preempt_on/off() wrappers (added
in trace_preemptirq.c) that invoke trace_preempt_on/off(), keeping
the full tracepoint machinery out of the inline code.
The #include <linux/tracepoint-defs.h> is placed inside the
CONFIG_TRACE_PREEMPT_TOGGLE block rather than at the top of the file
to avoid a circular include dependency on architectures where
asm/irqflags.h includes linux/preempt.h (e.g. m68k):
preempt.h -> tracepoint-defs.h -> static_key.h -> jump_label.h ->
atomic.h -> irqflags.h -> asm/irqflags.h -> preempt.h (guarded)
If the include were at the top, this chain would be traversed before
hardirq_count() is defined (at line 110), causing a build failure on
m68k. Placing it inside the #elif block ensures it is only pulled in
when CONFIG_TRACE_PREEMPT_TOGGLE is enabled and avoids the cycle for
configurations that do not select it.
In core.c, narrow the compilation guard for the external
preempt_count_add/sub() from CONFIG_DEBUG_PREEMPT ||
CONFIG_TRACE_PREEMPT_TOGGLE to CONFIG_DEBUG_PREEMPT ||
CONFIG_PREEMPT_TRACER, since CONFIG_TRACE_PREEMPT_TOGGLE is now
handled inline.
Signed-off-by: Wander Lairson Costa <wander@redhat.com>
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
---
include/linux/preempt.h | 49 +++++++++++++++++++++++++++++++--
kernel/sched/core.c | 2 +-
kernel/trace/trace_preemptirq.c | 19 +++++++++++++
3 files changed, 66 insertions(+), 4 deletions(-)
diff --git a/include/linux/preempt.h b/include/linux/preempt.h
index d964f965c8ffc..f59a92f930d81 100644
--- a/include/linux/preempt.h
+++ b/include/linux/preempt.h
@@ -189,17 +189,60 @@ static __always_inline unsigned char interrupt_context_level(void)
*/
#define in_atomic_preempt_off() (preempt_count() != PREEMPT_DISABLE_OFFSET)
-#if defined(CONFIG_DEBUG_PREEMPT) || defined(CONFIG_TRACE_PREEMPT_TOGGLE)
+#if defined(CONFIG_DEBUG_PREEMPT) || defined(CONFIG_PREEMPT_TRACER)
extern void preempt_count_add(int val);
extern void preempt_count_sub(int val);
-#define preempt_count_dec_and_test() \
- ({ preempt_count_sub(1); should_resched(0); })
+#elif defined(CONFIG_TRACE_PREEMPT_TOGGLE)
+/*
+ * Avoid the circular dependency on architectures where asm/irqflags.h
+ * includes linux/preempt.h (e.g. m68k):
+ *
+ * preempt.h <--------------------+
+ * tracepoint-defs.h |
+ * static_key.h |
+ * jump_label.h |
+ * atomic.h |
+ * irqflags.h |
+ * asm/irqflags.h |
+ * preempt.h --------------+
+ */
+#include <linux/tracepoint-defs.h>
+
+extern void __trace_preempt_on(void);
+extern void __trace_preempt_off(void);
+
+DECLARE_TRACEPOINT(preempt_enable);
+DECLARE_TRACEPOINT(preempt_disable);
+
+#define __preempt_trace_enabled(type, val) \
+ (tracepoint_enabled(preempt_##type) && preempt_count() == (val))
+
+static __always_inline void preempt_count_add(int val)
+{
+ __preempt_count_add(val);
+
+ if (__preempt_trace_enabled(disable, val))
+ __trace_preempt_off();
+}
+
+static __always_inline void preempt_count_sub(int val)
+{
+ if (__preempt_trace_enabled(enable, val))
+ __trace_preempt_on();
+
+ __preempt_count_sub(val);
+}
#else
#define preempt_count_add(val) __preempt_count_add(val)
#define preempt_count_sub(val) __preempt_count_sub(val)
#define preempt_count_dec_and_test() __preempt_count_dec_and_test()
#endif
+#if defined(CONFIG_DEBUG_PREEMPT) || defined(CONFIG_TRACE_PREEMPT_TOGGLE)
+#define preempt_count_dec_and_test() \
+ ({ preempt_count_sub(1); should_resched(0); })
+#endif
+
#define __preempt_count_inc() __preempt_count_add(1)
#define __preempt_count_dec() __preempt_count_sub(1)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b7f77c165a6e0..125e5d71d1bd3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5733,7 +5733,7 @@ static inline void sched_tick_stop(int cpu) { }
#endif /* !CONFIG_NO_HZ_FULL */
#if defined(CONFIG_PREEMPTION) && (defined(CONFIG_DEBUG_PREEMPT) || \
- defined(CONFIG_TRACE_PREEMPT_TOGGLE))
+ defined(CONFIG_PREEMPT_TRACER))
/*
* If the value passed in is equal to the current preempt count
* then we just disabled preemption. Start timing the latency.
diff --git a/kernel/trace/trace_preemptirq.c b/kernel/trace/trace_preemptirq.c
index 0c42b15c38004..9f098fcb28012 100644
--- a/kernel/trace/trace_preemptirq.c
+++ b/kernel/trace/trace_preemptirq.c
@@ -115,6 +115,25 @@ NOKPROBE_SYMBOL(trace_hardirqs_off);
#ifdef CONFIG_TRACE_PREEMPT_TOGGLE
+#if !defined(CONFIG_DEBUG_PREEMPT) && !defined(CONFIG_PREEMPT_TRACER)
+EXPORT_TRACEPOINT_SYMBOL(preempt_disable);
+EXPORT_TRACEPOINT_SYMBOL(preempt_enable);
+
+void __trace_preempt_on(void)
+{
+ trace_preempt_on(CALLER_ADDR0, get_lock_parent_ip());
+}
+EXPORT_SYMBOL(__trace_preempt_on);
+NOKPROBE_SYMBOL(__trace_preempt_on);
+
+void __trace_preempt_off(void)
+{
+ trace_preempt_off(CALLER_ADDR0, get_lock_parent_ip());
+}
+EXPORT_SYMBOL(__trace_preempt_off);
+NOKPROBE_SYMBOL(__trace_preempt_off);
+#endif /* !CONFIG_DEBUG_PREEMPT */
+
void trace_preempt_on(unsigned long a0, unsigned long a1)
{
trace(preempt_enable, TP_ARGS(a0, a1));
--
2.53.0
^ permalink raw reply related
* [PATCH v3 0/4] tracing/preemptirq: Optimize disabled tracepoint overhead
From: Wander Lairson Costa @ 2026-03-11 12:50 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Masami Hiramatsu, Mathieu Desnoyers,
Andrew Morton, open list:SCHEDULER, open list:TRACING
Cc: acme, williams, gmonaco, Wander Lairson Costa
The preempt and IRQ tracepoints currently impose measurable overhead
even when they are compiled in but not actively enabled. This overhead
stems from external function calls and unconditional tracepoint checks
in highly active code paths.
The v2 series optimized within the existing CONFIG_TRACE_IRQFLAGS
path, which still required the heavy lockdep and irqsoff
infrastructure to be enabled. This v3 takes a different approach by
providing independent, user-selectable configurations
(CONFIG_TRACE_PREEMPT_TOGGLE and CONFIG_TRACE_IRQFLAGS_TOGGLE) that
expose the tracepoints without pulling in the heavier infrastructure.
The preempt optimization uses inline static key checks, while the IRQ
optimization employs lightweight wrapper functions that check raw
hardware state. Making both configurations explicitly user-selectable
addresses upstream feedback regarding the impact on code generation,
ensuring that this optimization remains strictly opt-in.
---
Performance Measurements
Measurements were taken using the tracer-benchmark kernel module [1].
The module creates one kthread per online CPU. Each thread performs
a configurable number of iterations of
local_irq_disable()/local_irq_enable() and
preempt_disable()/preempt_enable() pairs, timing each pair with
ktime_get_ns(). All threads start simultaneously via a completion
to maximize contention. Per-CPU results (average, median) are aggregated across
CPUs. The 99th percentile is measured separately
on a single pinned CPU. The kernel used was version 7.0.0. All
values are in nanoseconds. Each run collected 10^7 samples.
Configurations compared:
- 7.0.0: stock kernel
- irqsoff: stock kernel with CONFIG_IRQSOFF_TRACER=y and
CONFIG_PREEMPT_TRACER=y
- preemptirq: patched kernel with CONFIG_TRACE_PREEMPT_TOGGLE=y
and CONFIG_TRACE_IRQFLAGS_TOGGLE=y
The '+' suffix indicates the test ran with tracepoints enabled.
IRQ Metrics
Metric 7.0.0 irqsoff irqsoff+ preemptirq preemptirq+
average 19 27 175 19 166
median 19 27 172 19 164
99 percentile 21 29 234 21 221
Preempt Metrics
Metric 7.0.0 irqsoff irqsoff+ preemptirq preemptirq+
average 16 21 169 16 160
median 16 21 165 17 159
99 percentile 18 23 236 18 217
The preemptirq configuration matches the stock kernel performance
when tracepoints are disabled, while the irqsoff configuration adds
~40% overhead even when inactive. When tracepoints are enabled,
preemptirq is also slightly faster than irqsoff.
Binary size impact (stripped vmlinux, defconfig):
7.0.0: 43404576 bytes
preemptirq: 43429152 bytes (+24576 bytes, +0.057%)
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
---
References:
[1] https://github.com/walac/tracer-benchmark
Changes in v3:
- Reworked series from 2 to 4 patches
- IRQ tracing rearchitected: instead of optimizing within
CONFIG_TRACE_IRQFLAGS, introduced independent
CONFIG_TRACE_IRQFLAGS_TOGGLE that provides irq_disable/irq_enable
tracepoints without pulling in lockdep or irqsoff infrastructure
- Made TRACE_PREEMPT_TOGGLE user-selectable in Kconfig, addressing
upstream feedback about code generation impact
- Preempt optimization now handles CONFIG_PREEMPT_TRACER alongside
CONFIG_DEBUG_PREEMPT in the three-way #if split
- Fixed __preempt_trace_enabled() macro to accept val as parameter
- Resolved circular header dependency on m68k by placing
tracepoint-defs.h include inside conditional blocks instead of
at the top of preempt.h and irqflags.h
- Moved atomic.h include from tracepoint-defs.h to tracepoint.h
to break circular dependency chain on ARM32
- Used EXPORT_TRACEPOINT_SYMBOL() instead of raw
EXPORT_SYMBOL(__tracepoint_*) for proper tracepoint registration
- Narrowed core.c compilation guard to CONFIG_DEBUG_PREEMPT ||
CONFIG_PREEMPT_TRACER since TRACE_PREEMPT_TOGGLE is now handled
inline
- Updated performance benchmarks on 7.0.0, including
tracepoint-enabled measurements and binary size impact
Changes in v2:
- Fixed build failure on arm32 (circular dependency:
atomic.h -> cmpxchg.h -> irqflags.h -> tracepoint.h -> atomic.h)
Wander Lairson Costa (4):
tracing/preemptirq: Optimize preempt_disable/enable() tracepoint
overhead
trace/preemptirq: make TRACE_PREEMPT_TOGGLE user-selectable
trace/preemptirq: add TRACE_IRQFLAGS_TOGGLE
trace/preemptirq: Implement trace_irqflags hooks
include/linux/irqflags.h | 62 +++++++++++++++++++++++++++-
include/linux/preempt.h | 49 ++++++++++++++++++++--
include/linux/tracepoint-defs.h | 1 -
include/linux/tracepoint.h | 1 +
include/trace/events/preemptirq.h | 2 +-
kernel/sched/core.c | 2 +-
kernel/trace/Kconfig | 19 +++++++--
kernel/trace/trace_preemptirq.c | 68 +++++++++++++++++++++++++++++++
8 files changed, 193 insertions(+), 11 deletions(-)
--
2.53.0
^ permalink raw reply
* Re: [RFC PATCH v2 0/3] disable optimistic spinning for ftrace_lock
From: David Laight @ 2026-03-11 12:53 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Yafang Shao, mingo, will, boqun, longman, rostedt, mhiramat,
mark.rutland, mathieu.desnoyers, linux-kernel, linux-trace-kernel
In-Reply-To: <20260311115426.GN606826@noisy.programming.kicks-ass.net>
On Wed, 11 Mar 2026 12:54:26 +0100
Peter Zijlstra <peterz@infradead.org> wrote:
> On Wed, Mar 11, 2026 at 07:52:47PM +0800, Yafang Shao wrote:
> > Recently, we resolved a latency spike issue caused by concurrently running
> > bpftrace processes. The root cause was high contention on the ftrace_lock
> > due to optimistic spinning. We can optimize this by disabling optimistic
> > spinning for ftrace_lock.
> >
> > While semaphores may present similar challenges, I'm not currently aware of
> > specific instances that exhibit this exact issue. Should we encounter
> > problematic semaphores in production workloads, we can address them at that
> > time.
> >
> > PATCH #1: introduce slow_mutex_[un]lock to disable optimistic spinning
> > PATCH #2: add variant for rtmutex
> > PATCH #3: disable optimistic spinning for ftrace_lock
> >
>
> So I really utterly hate this.
Yep...
Adding the extra parameter is likely to have a measurable impact
on everything else.
The problematic path is obvious: find_kallsyms_symbol+142
module_address_lookup+104
kallsyms_lookup_buildid+203
kallsyms_lookup+20
print_rec+64
t_show+67
seq_read_iter+709
seq_read+165
vfs_read+165
ksys_read+103
__x64_sys_read+25
do_syscall_64+56
entry_SYSCALL_64_after_hwframe+100
The code needs to drop the ftrace_lock across t_show.
Although there is a bigger issue of why on earth the code is reading the
list of filter functions at all - never mind all the time.
I'll do it by hand when debugging, but I'd have though anything using bpf
will know exactly where to add its hooks.
David
^ permalink raw reply
* Re: [RFC PATCH v2 0/3] disable optimistic spinning for ftrace_lock
From: Yafang Shao @ 2026-03-11 11:55 UTC (permalink / raw)
To: Peter Zijlstra
Cc: mingo, will, boqun, longman, rostedt, mhiramat, mark.rutland,
mathieu.desnoyers, david.laight.linux, linux-kernel,
linux-trace-kernel
In-Reply-To: <20260311115426.GN606826@noisy.programming.kicks-ass.net>
On Wed, Mar 11, 2026 at 7:54 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Mar 11, 2026 at 07:52:47PM +0800, Yafang Shao wrote:
> > Recently, we resolved a latency spike issue caused by concurrently running
> > bpftrace processes. The root cause was high contention on the ftrace_lock
> > due to optimistic spinning. We can optimize this by disabling optimistic
> > spinning for ftrace_lock.
> >
> > While semaphores may present similar challenges, I'm not currently aware of
> > specific instances that exhibit this exact issue. Should we encounter
> > problematic semaphores in production workloads, we can address them at that
> > time.
> >
> > PATCH #1: introduce slow_mutex_[un]lock to disable optimistic spinning
> > PATCH #2: add variant for rtmutex
> > PATCH #3: disable optimistic spinning for ftrace_lock
> >
>
> So I really utterly hate this.
Do you have any other suggestions for optimizing this?
--
Regards
Yafang
^ permalink raw reply
* Re: [RFC PATCH v2 0/3] disable optimistic spinning for ftrace_lock
From: Peter Zijlstra @ 2026-03-11 11:54 UTC (permalink / raw)
To: Yafang Shao
Cc: mingo, will, boqun, longman, rostedt, mhiramat, mark.rutland,
mathieu.desnoyers, david.laight.linux, linux-kernel,
linux-trace-kernel
In-Reply-To: <20260311115250.78488-1-laoar.shao@gmail.com>
On Wed, Mar 11, 2026 at 07:52:47PM +0800, Yafang Shao wrote:
> Recently, we resolved a latency spike issue caused by concurrently running
> bpftrace processes. The root cause was high contention on the ftrace_lock
> due to optimistic spinning. We can optimize this by disabling optimistic
> spinning for ftrace_lock.
>
> While semaphores may present similar challenges, I'm not currently aware of
> specific instances that exhibit this exact issue. Should we encounter
> problematic semaphores in production workloads, we can address them at that
> time.
>
> PATCH #1: introduce slow_mutex_[un]lock to disable optimistic spinning
> PATCH #2: add variant for rtmutex
> PATCH #3: disable optimistic spinning for ftrace_lock
>
So I really utterly hate this.
^ permalink raw reply
* [RFC PATCH v2 3/3] ftrace: Disable optimistic spinning for ftrace_lock
From: Yafang Shao @ 2026-03-11 11:52 UTC (permalink / raw)
To: peterz, mingo, will, boqun, longman, rostedt, mhiramat,
mark.rutland, mathieu.desnoyers, david.laight.linux
Cc: linux-kernel, linux-trace-kernel, Yafang Shao
In-Reply-To: <20260311115250.78488-1-laoar.shao@gmail.com>
The ftrace_lock may be held for a relatively long duration. For example,
reading the available_filter_functions file takes considerable time:
$ time cat /sys/kernel/tracing/available_filter_functions &> /dev/null
real 0m0.457s user 0m0.001s sys 0m0.455s
When the lock owner is continuously running, other tasks waiting for the
lock will spin repeatedly until they hit a cond_resched() point, wasting
CPU cycles.
ftrace_lock is currently used in the following scenarios:
- Debugging
- Live patching
Neither of these scenarios are in the application critical path.
Therefore, it is reasonable to make tasks sleep when they cannot acquire
the lock immediately, rather than spinning and consuming CPU resources.
Performance Comparison
======================
- Before this change
- Single task reading available_filter_functions:
real 0m0.457s user 0m0.001s sys 0m0.455s
- Six concurrent processes:
real 0m2.666s user 0m0.001s sys 0m2.557s
real 0m2.718s user 0m0.000s sys 0m2.655s
real 0m2.718s user 0m0.001s sys 0m2.600s
real 0m2.733s user 0m0.001s sys 0m2.554s
real 0m2.735s user 0m0.000s sys 0m2.573s
real 0m2.738s user 0m0.000s sys 0m2.664s
- After this change
- Single task:
real 0m0.454s user 0m0.002s sys 0m0.453s
- Six concurrent processes:
real 0m2.691s user 0m0.001s sys 0m0.458s
real 0m2.785s user 0m0.001s sys 0m0.467s
real 0m2.787s user 0m0.000s sys 0m0.469s
real 0m2.787s user 0m0.000s sys 0m0.466s
real 0m2.788s user 0m0.001s sys 0m0.468s
real 0m2.789s user 0m0.000s sys 0m0.471s
The system time significantly decreases in the concurrent case, as tasks
now sleep while waiting for the lock instead of busy-spinning.
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
kernel/trace/ftrace.c | 106 +++++++++++++++++++++---------------------
1 file changed, 53 insertions(+), 53 deletions(-)
diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index 827fb9a0bf0d..00c195e280c5 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -1284,10 +1284,10 @@ static void clear_ftrace_mod_list(struct list_head *head)
if (!head)
return;
- mutex_lock(&ftrace_lock);
+ slow_mutex_lock(&ftrace_lock);
list_for_each_entry_safe(p, n, head, list)
free_ftrace_mod(p);
- mutex_unlock(&ftrace_lock);
+ slow_mutex_unlock(&ftrace_lock);
}
void free_ftrace_hash(struct ftrace_hash *hash)
@@ -4254,7 +4254,7 @@ static void *t_start(struct seq_file *m, loff_t *pos)
void *p = NULL;
loff_t l;
- mutex_lock(&ftrace_lock);
+ slow_mutex_lock(&ftrace_lock);
if (unlikely(ftrace_disabled))
return NULL;
@@ -4305,7 +4305,7 @@ static void *t_start(struct seq_file *m, loff_t *pos)
static void t_stop(struct seq_file *m, void *p)
{
- mutex_unlock(&ftrace_lock);
+ slow_mutex_unlock(&ftrace_lock);
}
void * __weak
@@ -4362,11 +4362,11 @@ static __init void ftrace_check_work_func(struct work_struct *work)
struct ftrace_page *pg;
struct dyn_ftrace *rec;
- mutex_lock(&ftrace_lock);
+ slow_mutex_lock(&ftrace_lock);
do_for_each_ftrace_rec(pg, rec) {
test_for_valid_rec(rec);
} while_for_each_ftrace_rec();
- mutex_unlock(&ftrace_lock);
+ slow_mutex_unlock(&ftrace_lock);
}
static int __init ftrace_check_for_weak_functions(void)
@@ -5123,7 +5123,7 @@ static void process_mod_list(struct list_head *head, struct ftrace_ops *ops,
if (!new_hash)
goto out; /* warn? */
- mutex_lock(&ftrace_lock);
+ slow_mutex_lock(&ftrace_lock);
list_for_each_entry_safe(ftrace_mod, n, head, list) {
@@ -5145,7 +5145,7 @@ static void process_mod_list(struct list_head *head, struct ftrace_ops *ops,
ftrace_mod->func = func;
}
- mutex_unlock(&ftrace_lock);
+ slow_mutex_unlock(&ftrace_lock);
list_for_each_entry_safe(ftrace_mod, n, &process_mods, list) {
@@ -5159,11 +5159,11 @@ static void process_mod_list(struct list_head *head, struct ftrace_ops *ops,
if (enable && list_empty(head))
new_hash->flags &= ~FTRACE_HASH_FL_MOD;
- mutex_lock(&ftrace_lock);
+ slow_mutex_lock(&ftrace_lock);
ftrace_hash_move_and_update_ops(ops, orig_hash,
new_hash, enable);
- mutex_unlock(&ftrace_lock);
+ slow_mutex_unlock(&ftrace_lock);
out:
mutex_unlock(&ops->func_hash->regex_lock);
@@ -5465,7 +5465,7 @@ register_ftrace_function_probe(char *glob, struct trace_array *tr,
return -EINVAL;
- mutex_lock(&ftrace_lock);
+ slow_mutex_lock(&ftrace_lock);
/* Check if the probe_ops is already registered */
list_for_each_entry(iter, &tr->func_probes, list) {
if (iter->probe_ops == probe_ops) {
@@ -5476,7 +5476,7 @@ register_ftrace_function_probe(char *glob, struct trace_array *tr,
if (!probe) {
probe = kzalloc_obj(*probe);
if (!probe) {
- mutex_unlock(&ftrace_lock);
+ slow_mutex_unlock(&ftrace_lock);
return -ENOMEM;
}
probe->probe_ops = probe_ops;
@@ -5488,7 +5488,7 @@ register_ftrace_function_probe(char *glob, struct trace_array *tr,
acquire_probe_locked(probe);
- mutex_unlock(&ftrace_lock);
+ slow_mutex_unlock(&ftrace_lock);
/*
* Note, there's a small window here that the func_hash->filter_hash
@@ -5540,7 +5540,7 @@ register_ftrace_function_probe(char *glob, struct trace_array *tr,
}
}
- mutex_lock(&ftrace_lock);
+ slow_mutex_lock(&ftrace_lock);
if (!count) {
/* Nothing was added? */
@@ -5560,7 +5560,7 @@ register_ftrace_function_probe(char *glob, struct trace_array *tr,
ret = ftrace_startup(&probe->ops, 0);
out_unlock:
- mutex_unlock(&ftrace_lock);
+ slow_mutex_unlock(&ftrace_lock);
if (!ret)
ret = count;
@@ -5619,7 +5619,7 @@ unregister_ftrace_function_probe_func(char *glob, struct trace_array *tr,
return -EINVAL;
}
- mutex_lock(&ftrace_lock);
+ slow_mutex_lock(&ftrace_lock);
/* Check if the probe_ops is already registered */
list_for_each_entry(iter, &tr->func_probes, list) {
if (iter->probe_ops == probe_ops) {
@@ -5636,7 +5636,7 @@ unregister_ftrace_function_probe_func(char *glob, struct trace_array *tr,
acquire_probe_locked(probe);
- mutex_unlock(&ftrace_lock);
+ slow_mutex_unlock(&ftrace_lock);
mutex_lock(&probe->ops.func_hash->regex_lock);
@@ -5679,7 +5679,7 @@ unregister_ftrace_function_probe_func(char *glob, struct trace_array *tr,
goto out_unlock;
}
- mutex_lock(&ftrace_lock);
+ slow_mutex_lock(&ftrace_lock);
WARN_ON(probe->ref < count);
@@ -5703,7 +5703,7 @@ unregister_ftrace_function_probe_func(char *glob, struct trace_array *tr,
probe_ops->free(probe_ops, tr, entry->ip, probe->data);
kfree(entry);
}
- mutex_unlock(&ftrace_lock);
+ slow_mutex_unlock(&ftrace_lock);
out_unlock:
mutex_unlock(&probe->ops.func_hash->regex_lock);
@@ -5714,7 +5714,7 @@ unregister_ftrace_function_probe_func(char *glob, struct trace_array *tr,
return ret;
err_unlock_ftrace:
- mutex_unlock(&ftrace_lock);
+ slow_mutex_unlock(&ftrace_lock);
return ret;
}
@@ -5943,9 +5943,9 @@ ftrace_set_hash(struct ftrace_ops *ops, unsigned char *buf, int len,
goto out_regex_unlock;
}
- mutex_lock(&ftrace_lock);
+ slow_mutex_lock(&ftrace_lock);
ret = ftrace_hash_move_and_update_ops(ops, orig_hash, hash, enable);
- mutex_unlock(&ftrace_lock);
+ slow_mutex_unlock(&ftrace_lock);
out_regex_unlock:
mutex_unlock(&ops->func_hash->regex_lock);
@@ -6205,7 +6205,7 @@ __modify_ftrace_direct(struct ftrace_ops *ops, unsigned long addr)
* Now the ftrace_ops_list_func() is called to do the direct callers.
* We can safely change the direct functions attached to each entry.
*/
- mutex_lock(&ftrace_lock);
+ slow_mutex_lock(&ftrace_lock);
size = 1 << hash->size_bits;
for (i = 0; i < size; i++) {
@@ -6219,7 +6219,7 @@ __modify_ftrace_direct(struct ftrace_ops *ops, unsigned long addr)
/* Prevent store tearing if a trampoline concurrently accesses the value */
WRITE_ONCE(ops->direct_call, addr);
- mutex_unlock(&ftrace_lock);
+ slow_mutex_unlock(&ftrace_lock);
out:
/* Removing the tmp_ops will add the updated direct callers to the functions */
@@ -6625,7 +6625,7 @@ int update_ftrace_direct_mod(struct ftrace_ops *ops, struct ftrace_hash *hash, b
* Now the ftrace_ops_list_func() is called to do the direct callers.
* We can safely change the direct functions attached to each entry.
*/
- mutex_lock(&ftrace_lock);
+ slow_mutex_lock(&ftrace_lock);
size = 1 << hash->size_bits;
for (i = 0; i < size; i++) {
@@ -6637,7 +6637,7 @@ int update_ftrace_direct_mod(struct ftrace_ops *ops, struct ftrace_hash *hash, b
}
}
- mutex_unlock(&ftrace_lock);
+ slow_mutex_unlock(&ftrace_lock);
out:
/* Removing the tmp_ops will add the updated direct callers to the functions */
@@ -6980,10 +6980,10 @@ int ftrace_regex_release(struct inode *inode, struct file *file)
} else
orig_hash = &iter->ops->func_hash->notrace_hash;
- mutex_lock(&ftrace_lock);
+ slow_mutex_lock(&ftrace_lock);
ftrace_hash_move_and_update_ops(iter->ops, orig_hash,
iter->hash, filter_hash);
- mutex_unlock(&ftrace_lock);
+ slow_mutex_unlock(&ftrace_lock);
}
mutex_unlock(&iter->ops->func_hash->regex_lock);
@@ -7464,12 +7464,12 @@ void ftrace_create_filter_files(struct ftrace_ops *ops,
*/
void ftrace_destroy_filter_files(struct ftrace_ops *ops)
{
- mutex_lock(&ftrace_lock);
+ slow_mutex_lock(&ftrace_lock);
if (ops->flags & FTRACE_OPS_FL_ENABLED)
ftrace_shutdown(ops, 0);
ops->flags |= FTRACE_OPS_FL_DELETED;
ftrace_free_filter(ops);
- mutex_unlock(&ftrace_lock);
+ slow_mutex_unlock(&ftrace_lock);
}
static __init int ftrace_init_dyn_tracefs(struct dentry *d_tracer)
@@ -7571,7 +7571,7 @@ static int ftrace_process_locs(struct module *mod,
if (!start_pg)
return -ENOMEM;
- mutex_lock(&ftrace_lock);
+ slow_mutex_lock(&ftrace_lock);
/*
* Core and each module needs their own pages, as
@@ -7661,7 +7661,7 @@ static int ftrace_process_locs(struct module *mod,
local_irq_restore(flags);
ret = 0;
out:
- mutex_unlock(&ftrace_lock);
+ slow_mutex_unlock(&ftrace_lock);
/* We should have used all pages unless we skipped some */
if (pg_unuse) {
@@ -7868,7 +7868,7 @@ void ftrace_release_mod(struct module *mod)
struct ftrace_page *tmp_page = NULL;
struct ftrace_page *pg;
- mutex_lock(&ftrace_lock);
+ slow_mutex_lock(&ftrace_lock);
/*
* To avoid the UAF problem after the module is unloaded, the
@@ -7913,7 +7913,7 @@ void ftrace_release_mod(struct module *mod)
last_pg = &pg->next;
}
out_unlock:
- mutex_unlock(&ftrace_lock);
+ slow_mutex_unlock(&ftrace_lock);
/* Need to synchronize with ftrace_location_range() */
if (tmp_page)
@@ -7938,7 +7938,7 @@ void ftrace_module_enable(struct module *mod)
struct dyn_ftrace *rec;
struct ftrace_page *pg;
- mutex_lock(&ftrace_lock);
+ slow_mutex_lock(&ftrace_lock);
if (ftrace_disabled)
goto out_unlock;
@@ -8008,7 +8008,7 @@ void ftrace_module_enable(struct module *mod)
ftrace_arch_code_modify_post_process();
out_unlock:
- mutex_unlock(&ftrace_lock);
+ slow_mutex_unlock(&ftrace_lock);
process_cached_mods(mod->name);
}
@@ -8267,7 +8267,7 @@ void ftrace_free_mem(struct module *mod, void *start_ptr, void *end_ptr)
key.ip = start;
key.flags = end; /* overload flags, as it is unsigned long */
- mutex_lock(&ftrace_lock);
+ slow_mutex_lock(&ftrace_lock);
/*
* If we are freeing module init memory, then check if
@@ -8310,7 +8310,7 @@ void ftrace_free_mem(struct module *mod, void *start_ptr, void *end_ptr)
/* More than one function may be in this block */
goto again;
}
- mutex_unlock(&ftrace_lock);
+ slow_mutex_unlock(&ftrace_lock);
list_for_each_entry_safe(func, func_next, &clear_hash, list) {
clear_func_from_hashes(func);
@@ -8686,22 +8686,22 @@ static void clear_ftrace_pids(struct trace_array *tr, int type)
void ftrace_clear_pids(struct trace_array *tr)
{
- mutex_lock(&ftrace_lock);
+ slow_mutex_lock(&ftrace_lock);
clear_ftrace_pids(tr, TRACE_PIDS | TRACE_NO_PIDS);
- mutex_unlock(&ftrace_lock);
+ slow_mutex_unlock(&ftrace_lock);
}
static void ftrace_pid_reset(struct trace_array *tr, int type)
{
- mutex_lock(&ftrace_lock);
+ slow_mutex_lock(&ftrace_lock);
clear_ftrace_pids(tr, type);
ftrace_update_pid_func();
ftrace_startup_all(0);
- mutex_unlock(&ftrace_lock);
+ slow_mutex_unlock(&ftrace_lock);
}
/* Greater than any max PID */
@@ -8713,7 +8713,7 @@ static void *fpid_start(struct seq_file *m, loff_t *pos)
struct trace_pid_list *pid_list;
struct trace_array *tr = m->private;
- mutex_lock(&ftrace_lock);
+ slow_mutex_lock(&ftrace_lock);
rcu_read_lock_sched();
pid_list = rcu_dereference_sched(tr->function_pids);
@@ -8740,7 +8740,7 @@ static void fpid_stop(struct seq_file *m, void *p)
__releases(RCU)
{
rcu_read_unlock_sched();
- mutex_unlock(&ftrace_lock);
+ slow_mutex_unlock(&ftrace_lock);
}
static int fpid_show(struct seq_file *m, void *v)
@@ -8766,7 +8766,7 @@ static void *fnpid_start(struct seq_file *m, loff_t *pos)
struct trace_pid_list *pid_list;
struct trace_array *tr = m->private;
- mutex_lock(&ftrace_lock);
+ slow_mutex_lock(&ftrace_lock);
rcu_read_lock_sched();
pid_list = rcu_dereference_sched(tr->function_no_pids);
@@ -9057,7 +9057,7 @@ static int prepare_direct_functions_for_ipmodify(struct ftrace_ops *ops)
unsigned long ip = entry->ip;
bool found_op = false;
- mutex_lock(&ftrace_lock);
+ slow_mutex_lock(&ftrace_lock);
do_for_each_ftrace_op(op, ftrace_ops_list) {
if (!(op->flags & FTRACE_OPS_FL_DIRECT))
continue;
@@ -9066,7 +9066,7 @@ static int prepare_direct_functions_for_ipmodify(struct ftrace_ops *ops)
break;
}
} while_for_each_ftrace_op(op);
- mutex_unlock(&ftrace_lock);
+ slow_mutex_unlock(&ftrace_lock);
if (found_op) {
if (!op->ops_func)
@@ -9106,7 +9106,7 @@ static void cleanup_direct_functions_after_ipmodify(struct ftrace_ops *ops)
unsigned long ip = entry->ip;
bool found_op = false;
- mutex_lock(&ftrace_lock);
+ slow_mutex_lock(&ftrace_lock);
do_for_each_ftrace_op(op, ftrace_ops_list) {
if (!(op->flags & FTRACE_OPS_FL_DIRECT))
continue;
@@ -9115,7 +9115,7 @@ static void cleanup_direct_functions_after_ipmodify(struct ftrace_ops *ops)
break;
}
} while_for_each_ftrace_op(op);
- mutex_unlock(&ftrace_lock);
+ slow_mutex_unlock(&ftrace_lock);
/* The cleanup is optional, ignore any errors */
if (found_op && op->ops_func)
@@ -9153,11 +9153,11 @@ static int register_ftrace_function_nolock(struct ftrace_ops *ops)
ftrace_ops_init(ops);
- mutex_lock(&ftrace_lock);
+ slow_mutex_lock(&ftrace_lock);
ret = ftrace_startup(ops, 0);
- mutex_unlock(&ftrace_lock);
+ slow_mutex_unlock(&ftrace_lock);
return ret;
}
@@ -9200,9 +9200,9 @@ int unregister_ftrace_function(struct ftrace_ops *ops)
{
int ret;
- mutex_lock(&ftrace_lock);
+ slow_mutex_lock(&ftrace_lock);
ret = ftrace_shutdown(ops, 0);
- mutex_unlock(&ftrace_lock);
+ slow_mutex_unlock(&ftrace_lock);
cleanup_direct_functions_after_ipmodify(ops);
return ret;
--
2.47.3
^ permalink raw reply related
* [RFC PATCH v2 2/3] locking/rtmutex: Add slow path variants for lock/unlock
From: Yafang Shao @ 2026-03-11 11:52 UTC (permalink / raw)
To: peterz, mingo, will, boqun, longman, rostedt, mhiramat,
mark.rutland, mathieu.desnoyers, david.laight.linux
Cc: linux-kernel, linux-trace-kernel, Yafang Shao
In-Reply-To: <20260311115250.78488-1-laoar.shao@gmail.com>
Add slow mutex APIs for rtmutex:
slow_rt_mutex_lock: lock a rtmutex without optimistic spinning
slow_rt_mutex_unlock: unlock the slow rtmutex
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
include/linux/rtmutex.h | 3 +++
kernel/locking/rtmutex.c | 37 +++++++++++++++++-----------
kernel/locking/rtmutex_api.c | 47 ++++++++++++++++++++++++++++++------
3 files changed, 66 insertions(+), 21 deletions(-)
diff --git a/include/linux/rtmutex.h b/include/linux/rtmutex.h
index ede4c6bf6f22..22294a916ddc 100644
--- a/include/linux/rtmutex.h
+++ b/include/linux/rtmutex.h
@@ -109,6 +109,7 @@ extern void __rt_mutex_init(struct rt_mutex *lock, const char *name, struct lock
#ifdef CONFIG_DEBUG_LOCK_ALLOC
extern void rt_mutex_lock_nested(struct rt_mutex *lock, unsigned int subclass);
+extern void slow_rt_mutex_lock_nested(struct rt_mutex *lock, unsigned int subclass);
extern void _rt_mutex_lock_nest_lock(struct rt_mutex *lock, struct lockdep_map *nest_lock);
#define rt_mutex_lock(lock) rt_mutex_lock_nested(lock, 0)
#define rt_mutex_lock_nest_lock(lock, nest_lock) \
@@ -116,9 +117,11 @@ extern void _rt_mutex_lock_nest_lock(struct rt_mutex *lock, struct lockdep_map *
typecheck(struct lockdep_map *, &(nest_lock)->dep_map); \
_rt_mutex_lock_nest_lock(lock, &(nest_lock)->dep_map); \
} while (0)
+#define slow_rt_mutex_lock(lock) slow_rt_mutex_lock_nested(lock, 0)
#else
extern void rt_mutex_lock(struct rt_mutex *lock);
+extern void slow_rt_mutex_lock(struct rt_mutex *lock);
#define rt_mutex_lock_nested(lock, subclass) rt_mutex_lock(lock)
#define rt_mutex_lock_nest_lock(lock, nest_lock) rt_mutex_lock(lock)
#endif
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index c80902eacd79..663ff96cb1be 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -1480,10 +1480,13 @@ static __always_inline void __rt_mutex_unlock(struct rt_mutex_base *lock)
#ifdef CONFIG_SMP
static bool rtmutex_spin_on_owner(struct rt_mutex_base *lock,
struct rt_mutex_waiter *waiter,
- struct task_struct *owner)
+ struct task_struct *owner,
+ const bool slow)
{
bool res = true;
+ if (slow)
+ return false;
rcu_read_lock();
for (;;) {
/* If owner changed, trylock again. */
@@ -1517,7 +1520,8 @@ static bool rtmutex_spin_on_owner(struct rt_mutex_base *lock,
#else
static bool rtmutex_spin_on_owner(struct rt_mutex_base *lock,
struct rt_mutex_waiter *waiter,
- struct task_struct *owner)
+ struct task_struct *owner,
+ const bool slow)
{
return false;
}
@@ -1606,7 +1610,8 @@ static int __sched rt_mutex_slowlock_block(struct rt_mutex_base *lock,
unsigned int state,
struct hrtimer_sleeper *timeout,
struct rt_mutex_waiter *waiter,
- struct wake_q_head *wake_q)
+ struct wake_q_head *wake_q,
+ const bool slow)
__releases(&lock->wait_lock) __acquires(&lock->wait_lock)
{
struct rt_mutex *rtm = container_of(lock, struct rt_mutex, rtmutex);
@@ -1642,7 +1647,7 @@ static int __sched rt_mutex_slowlock_block(struct rt_mutex_base *lock,
owner = NULL;
raw_spin_unlock_irq_wake(&lock->wait_lock, wake_q);
- if (!owner || !rtmutex_spin_on_owner(lock, waiter, owner)) {
+ if (!owner || !rtmutex_spin_on_owner(lock, waiter, owner, slow)) {
lockevent_inc(rtmutex_slow_sleep);
rt_mutex_schedule();
}
@@ -1693,7 +1698,8 @@ static int __sched __rt_mutex_slowlock(struct rt_mutex_base *lock,
unsigned int state,
enum rtmutex_chainwalk chwalk,
struct rt_mutex_waiter *waiter,
- struct wake_q_head *wake_q)
+ struct wake_q_head *wake_q,
+ const bool slow)
{
struct rt_mutex *rtm = container_of(lock, struct rt_mutex, rtmutex);
struct ww_mutex *ww = ww_container_of(rtm);
@@ -1718,7 +1724,7 @@ static int __sched __rt_mutex_slowlock(struct rt_mutex_base *lock,
ret = task_blocks_on_rt_mutex(lock, waiter, current, ww_ctx, chwalk, wake_q);
if (likely(!ret))
- ret = rt_mutex_slowlock_block(lock, ww_ctx, state, NULL, waiter, wake_q);
+ ret = rt_mutex_slowlock_block(lock, ww_ctx, state, NULL, waiter, wake_q, slow);
if (likely(!ret)) {
/* acquired the lock */
@@ -1749,7 +1755,8 @@ static int __sched __rt_mutex_slowlock(struct rt_mutex_base *lock,
static inline int __rt_mutex_slowlock_locked(struct rt_mutex_base *lock,
struct ww_acquire_ctx *ww_ctx,
unsigned int state,
- struct wake_q_head *wake_q)
+ struct wake_q_head *wake_q,
+ const bool slow)
{
struct rt_mutex_waiter waiter;
int ret;
@@ -1758,7 +1765,7 @@ static inline int __rt_mutex_slowlock_locked(struct rt_mutex_base *lock,
waiter.ww_ctx = ww_ctx;
ret = __rt_mutex_slowlock(lock, ww_ctx, state, RT_MUTEX_MIN_CHAINWALK,
- &waiter, wake_q);
+ &waiter, wake_q, slow);
debug_rt_mutex_free_waiter(&waiter);
lockevent_cond_inc(rtmutex_slow_wake, !wake_q_empty(wake_q));
@@ -1773,7 +1780,8 @@ static inline int __rt_mutex_slowlock_locked(struct rt_mutex_base *lock,
*/
static int __sched rt_mutex_slowlock(struct rt_mutex_base *lock,
struct ww_acquire_ctx *ww_ctx,
- unsigned int state)
+ unsigned int state,
+ const bool slow)
{
DEFINE_WAKE_Q(wake_q);
unsigned long flags;
@@ -1797,7 +1805,7 @@ static int __sched rt_mutex_slowlock(struct rt_mutex_base *lock,
* irqsave/restore variants.
*/
raw_spin_lock_irqsave(&lock->wait_lock, flags);
- ret = __rt_mutex_slowlock_locked(lock, ww_ctx, state, &wake_q);
+ ret = __rt_mutex_slowlock_locked(lock, ww_ctx, state, &wake_q, slow);
raw_spin_unlock_irqrestore_wake(&lock->wait_lock, flags, &wake_q);
rt_mutex_post_schedule();
@@ -1805,14 +1813,14 @@ static int __sched rt_mutex_slowlock(struct rt_mutex_base *lock,
}
static __always_inline int __rt_mutex_lock(struct rt_mutex_base *lock,
- unsigned int state)
+ unsigned int state, const bool slow)
{
lockdep_assert(!current->pi_blocked_on);
if (likely(rt_mutex_try_acquire(lock)))
return 0;
- return rt_mutex_slowlock(lock, NULL, state);
+ return rt_mutex_slowlock(lock, NULL, state, slow);
}
#endif /* RT_MUTEX_BUILD_MUTEX */
@@ -1827,7 +1835,8 @@ static __always_inline int __rt_mutex_lock(struct rt_mutex_base *lock,
* @wake_q: The wake_q to wake tasks after we release the wait_lock
*/
static void __sched rtlock_slowlock_locked(struct rt_mutex_base *lock,
- struct wake_q_head *wake_q)
+ struct wake_q_head *wake_q,
+ const bool slow)
__releases(&lock->wait_lock) __acquires(&lock->wait_lock)
{
struct rt_mutex_waiter waiter;
@@ -1863,7 +1872,7 @@ static void __sched rtlock_slowlock_locked(struct rt_mutex_base *lock,
owner = NULL;
raw_spin_unlock_irq_wake(&lock->wait_lock, wake_q);
- if (!owner || !rtmutex_spin_on_owner(lock, &waiter, owner)) {
+ if (!owner || !rtmutex_spin_on_owner(lock, &waiter, owner, slow)) {
lockevent_inc(rtlock_slow_sleep);
schedule_rtlock();
}
diff --git a/kernel/locking/rtmutex_api.c b/kernel/locking/rtmutex_api.c
index 59dbd29cb219..b196cdd35ff1 100644
--- a/kernel/locking/rtmutex_api.c
+++ b/kernel/locking/rtmutex_api.c
@@ -37,21 +37,29 @@ subsys_initcall(init_rtmutex_sysctl);
* The atomic acquire/release ops are compiled away, when either the
* architecture does not support cmpxchg or when debugging is enabled.
*/
-static __always_inline int __rt_mutex_lock_common(struct rt_mutex *lock,
+static __always_inline int ___rt_mutex_lock_common(struct rt_mutex *lock,
unsigned int state,
struct lockdep_map *nest_lock,
- unsigned int subclass)
+ unsigned int subclass,
+ const bool slow)
{
int ret;
might_sleep();
mutex_acquire_nest(&lock->dep_map, subclass, 0, nest_lock, _RET_IP_);
- ret = __rt_mutex_lock(&lock->rtmutex, state);
+ ret = __rt_mutex_lock(&lock->rtmutex, state, slow);
if (ret)
mutex_release(&lock->dep_map, _RET_IP_);
return ret;
}
+static __always_inline int __rt_mutex_lock_common(struct rt_mutex *lock,
+ unsigned int state,
+ struct lockdep_map *nest_lock,
+ unsigned int subclass)
+{
+ return ___rt_mutex_lock_common(lock, state, nest_lock, subclass, false);
+}
void rt_mutex_base_init(struct rt_mutex_base *rtb)
{
__rt_mutex_base_init(rtb);
@@ -77,6 +85,11 @@ void __sched _rt_mutex_lock_nest_lock(struct rt_mutex *lock, struct lockdep_map
}
EXPORT_SYMBOL_GPL(_rt_mutex_lock_nest_lock);
+void __sched slow_rt_mutex_lock_nested(struct rt_mutex *lock, unsigned int subclass)
+{
+ ___rt_mutex_lock_common(lock, TASK_UNINTERRUPTIBLE, NULL, subclass, true);
+}
+
#else /* !CONFIG_DEBUG_LOCK_ALLOC */
/**
@@ -89,6 +102,11 @@ void __sched rt_mutex_lock(struct rt_mutex *lock)
__rt_mutex_lock_common(lock, TASK_UNINTERRUPTIBLE, NULL, 0);
}
EXPORT_SYMBOL_GPL(rt_mutex_lock);
+
+void __sched slow_rt_mutex_lock(struct rt_mutex *lock)
+{
+ ___rt_mutex_lock_common(lock, TASK_UNINTERRUPTIBLE, NULL, 0, true);
+}
#endif
/**
@@ -401,7 +419,7 @@ int __sched rt_mutex_wait_proxy_lock(struct rt_mutex_base *lock,
raw_spin_lock_irq(&lock->wait_lock);
/* sleep on the mutex */
set_current_state(TASK_INTERRUPTIBLE);
- ret = rt_mutex_slowlock_block(lock, NULL, TASK_INTERRUPTIBLE, to, waiter, NULL);
+ ret = rt_mutex_slowlock_block(lock, NULL, TASK_INTERRUPTIBLE, to, waiter, NULL, false);
/*
* try_to_take_rt_mutex() sets the waiter bit unconditionally. We might
* have to fix that up.
@@ -521,17 +539,18 @@ static void __mutex_rt_init_generic(struct mutex *mutex)
debug_check_no_locks_freed((void *)mutex, sizeof(*mutex));
}
-static __always_inline int __mutex_lock_common(struct mutex *lock,
+static __always_inline int ___mutex_lock_common(struct mutex *lock,
unsigned int state,
unsigned int subclass,
struct lockdep_map *nest_lock,
- unsigned long ip)
+ unsigned long ip,
+ const bool slow)
{
int ret;
might_sleep();
mutex_acquire_nest(&lock->dep_map, subclass, 0, nest_lock, ip);
- ret = __rt_mutex_lock(&lock->rtmutex, state);
+ ret = __rt_mutex_lock(&lock->rtmutex, state, slow);
if (ret)
mutex_release(&lock->dep_map, ip);
else
@@ -539,6 +558,15 @@ static __always_inline int __mutex_lock_common(struct mutex *lock,
return ret;
}
+static __always_inline int __mutex_lock_common(struct mutex *lock,
+ unsigned int state,
+ unsigned int subclass,
+ struct lockdep_map *nest_lock,
+ unsigned long ip)
+{
+ ___mutex_lock_common(lock, state, subclass, nest_lock, ip, false);
+}
+
#ifdef CONFIG_DEBUG_LOCK_ALLOC
void mutex_rt_init_lockdep(struct mutex *mutex, const char *name, struct lock_class_key *key)
{
@@ -644,6 +672,11 @@ int __sched mutex_trylock(struct mutex *lock)
return __rt_mutex_trylock(&lock->rtmutex);
}
EXPORT_SYMBOL(mutex_trylock);
+
+void __sched slow_mutex_lock(struct mutex *lock)
+{
+ ___mutex_lock_common(lock, state, subclass, nest_lock, ip, true);
+}
#endif /* !CONFIG_DEBUG_LOCK_ALLOC */
void __sched mutex_unlock(struct mutex *lock)
--
2.47.3
^ permalink raw reply related
* [RFC PATCH v2 1/3] locking/mutex: Add slow path variants for lock/unlock
From: Yafang Shao @ 2026-03-11 11:52 UTC (permalink / raw)
To: peterz, mingo, will, boqun, longman, rostedt, mhiramat,
mark.rutland, mathieu.desnoyers, david.laight.linux
Cc: linux-kernel, linux-trace-kernel, Yafang Shao
In-Reply-To: <20260311115250.78488-1-laoar.shao@gmail.com>
Background
==========
One of our latency-sensitive services reported random CPU pressure spikes.
After a thorough investigation, we finally identified the root cause of the
CPU pressure spikes. The key kernel stacks are as follows:
- Task A
2026-02-14-16:53:40.938243: [CPU198] 2156302(bpftrace) cgrp:4019437 pod:4019253
find_kallsyms_symbol+142
module_address_lookup+104
kallsyms_lookup_buildid+203
kallsyms_lookup+20
print_rec+64
t_show+67
seq_read_iter+709
seq_read+165
vfs_read+165
ksys_read+103
__x64_sys_read+25
do_syscall_64+56
entry_SYSCALL_64_after_hwframe+100
This task (2156302, bpftrace) is reading the
/sys/kernel/tracing/available_filter_functions to check if a function
is traceable:
https://github.com/bpftrace/bpftrace/blob/master/src/tracefs/tracefs.h#L21
Reading the available_filter_functions file is time-consuming, as it
contains tens of thousands of functions:
$ cat /sys/kernel/tracing/available_filter_functions | wc -l
59221
$ time cat /sys/kernel/tracing/available_filter_functions > /dev/null
real 0m0.458s user 0m0.001s sys 0m0.457s
Consequently, the ftrace_lock is held by this task for an extended period.
- Other Tasks
2026-02-14-16:53:41.437094: [CPU79] 2156308(bpftrace) cgrp:4019437 pod:4019253
mutex_spin_on_owner+108
__mutex_lock.constprop.0+1132
__mutex_lock_slowpath+19
mutex_lock+56
t_start+51
seq_read_iter+250
seq_read+165
vfs_read+165
ksys_read+103
__x64_sys_read+25
do_syscall_64+56
entry_SYSCALL_64_after_hwframe+100
Since ftrace_lock is held by Task-A and Task-A is actively running on a
CPU, all other tasks waiting for the same lock will spin on their
respective CPUs. This leads to increased CPU pressure.
Reproduction
============
This issue can be reproduced simply by running
`cat available_filter_functions`.
- Single process reading available_filter_functions:
$ time cat /sys/kernel/tracing/available_filter_functions > /dev/null
real 0m0.458s user 0m0.001s sys 0m0.457s
- Six processes reading available_filter_functions simultaneously:
for i in `seq 0 5`; do
time cat /sys/kernel/tracing/available_filter_functions > /dev/null &
done
The results are as follows:
real 0m2.666s user 0m0.001s sys 0m2.557s
real 0m2.718s user 0m0.000s sys 0m2.655s
real 0m2.718s user 0m0.001s sys 0m2.600s
real 0m2.733s user 0m0.001s sys 0m2.554s
real 0m2.735s user 0m0.000s sys 0m2.573s
real 0m2.738s user 0m0.000s sys 0m2.664s
As more processes are added, the system time increases correspondingly.
Solution
========
One approach is to optimize the reading of available_filter_functions to
make it as fast as possible. However, the risk lies in the contention
caused by optimistic spin locking.
Therefore, we need to consider an alternative solution that avoids
optimistic spinning for heavy mutexes that may be held for long durations.
Note that we do not want to disable CONFIG_MUTEX_SPIN_ON_OWNER entirely, as
that could lead to unexpected performance regressions.
In this patch, two new APIs are introduced to allow heavy locks to
selectively disable optimistic spinning.
slow_mutex_lock() - lock a mutex without optimistic spinning
slow_mutex_unlock() - unlock the slow mutex
- The result of this optimization
After applying this slow mutex to ftrace_lock and concurrently running six
processes, the results are as follows:
real 0m2.691s user 0m0.001s sys 0m0.458s
real 0m2.785s user 0m0.001s sys 0m0.467s
real 0m2.787s user 0m0.000s sys 0m0.469s
real 0m2.787s user 0m0.000s sys 0m0.466s
real 0m2.788s user 0m0.001s sys 0m0.468s
real 0m2.789s user 0m0.000s sys 0m0.471s
The system time remains similar to that of running a single process.
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
include/linux/mutex.h | 4 ++++
kernel/locking/mutex.c | 41 ++++++++++++++++++++++++++++++++++-------
2 files changed, 38 insertions(+), 7 deletions(-)
diff --git a/include/linux/mutex.h b/include/linux/mutex.h
index ecaa0440f6ec..eed0e87c084c 100644
--- a/include/linux/mutex.h
+++ b/include/linux/mutex.h
@@ -189,11 +189,13 @@ extern int __must_check mutex_lock_interruptible_nested(struct mutex *lock,
extern int __must_check _mutex_lock_killable(struct mutex *lock,
unsigned int subclass, struct lockdep_map *nest_lock) __cond_acquires(0, lock);
extern void mutex_lock_io_nested(struct mutex *lock, unsigned int subclass) __acquires(lock);
+extern void slow_mutex_lock_nested(struct mutex *lock, unsigned int subclass);
#define mutex_lock(lock) mutex_lock_nested(lock, 0)
#define mutex_lock_interruptible(lock) mutex_lock_interruptible_nested(lock, 0)
#define mutex_lock_killable(lock) _mutex_lock_killable(lock, 0, NULL)
#define mutex_lock_io(lock) mutex_lock_io_nested(lock, 0)
+#define slow_mutex_lock(lock) slow_mutex_lock_nested(lock, 0)
#define mutex_lock_nest_lock(lock, nest_lock) \
do { \
@@ -215,6 +217,7 @@ extern void mutex_lock(struct mutex *lock) __acquires(lock);
extern int __must_check mutex_lock_interruptible(struct mutex *lock) __cond_acquires(0, lock);
extern int __must_check mutex_lock_killable(struct mutex *lock) __cond_acquires(0, lock);
extern void mutex_lock_io(struct mutex *lock) __acquires(lock);
+extern void slow_mutex_lock(struct mutex *lock) __acquires(lock);
# define mutex_lock_nested(lock, subclass) mutex_lock(lock)
# define mutex_lock_interruptible_nested(lock, subclass) mutex_lock_interruptible(lock)
@@ -247,6 +250,7 @@ extern int mutex_trylock(struct mutex *lock) __cond_acquires(true, lock);
#endif
extern void mutex_unlock(struct mutex *lock) __releases(lock);
+#define slow_mutex_unlock(lock) mutex_unlock(lock)
extern int atomic_dec_and_mutex_lock(atomic_t *cnt, struct mutex *lock) __cond_acquires(true, lock);
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 2a1d165b3167..5766d824b3fe 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -443,8 +443,11 @@ static inline int mutex_can_spin_on_owner(struct mutex *lock)
*/
static __always_inline bool
mutex_optimistic_spin(struct mutex *lock, struct ww_acquire_ctx *ww_ctx,
- struct mutex_waiter *waiter)
+ struct mutex_waiter *waiter, const bool slow)
{
+ if (slow)
+ return false;
+
if (!waiter) {
/*
* The purpose of the mutex_can_spin_on_owner() function is
@@ -577,7 +580,8 @@ EXPORT_SYMBOL(ww_mutex_unlock);
static __always_inline int __sched
__mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclass,
struct lockdep_map *nest_lock, unsigned long ip,
- struct ww_acquire_ctx *ww_ctx, const bool use_ww_ctx)
+ struct ww_acquire_ctx *ww_ctx, const bool use_ww_ctx,
+ const bool slow)
{
DEFINE_WAKE_Q(wake_q);
struct mutex_waiter waiter;
@@ -615,7 +619,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
trace_contention_begin(lock, LCB_F_MUTEX | LCB_F_SPIN);
if (__mutex_trylock(lock) ||
- mutex_optimistic_spin(lock, ww_ctx, NULL)) {
+ mutex_optimistic_spin(lock, ww_ctx, NULL, slow)) {
/* got the lock, yay! */
lock_acquired(&lock->dep_map, ip);
if (ww_ctx)
@@ -716,7 +720,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
* to run.
*/
clear_task_blocked_on(current, lock);
- if (mutex_optimistic_spin(lock, ww_ctx, &waiter))
+ if (mutex_optimistic_spin(lock, ww_ctx, &waiter, slow))
break;
set_task_blocked_on(current, lock);
trace_contention_begin(lock, LCB_F_MUTEX);
@@ -773,14 +777,21 @@ static int __sched
__mutex_lock(struct mutex *lock, unsigned int state, unsigned int subclass,
struct lockdep_map *nest_lock, unsigned long ip)
{
- return __mutex_lock_common(lock, state, subclass, nest_lock, ip, NULL, false);
+ return __mutex_lock_common(lock, state, subclass, nest_lock, ip, NULL, false, false);
+}
+
+static int __sched
+__slow_mutex_lock(struct mutex *lock, unsigned int state, unsigned int subclass,
+ struct lockdep_map *nest_lock, unsigned long ip)
+{
+ return __mutex_lock_common(lock, state, subclass, nest_lock, ip, NULL, false, true);
}
static int __sched
__ww_mutex_lock(struct mutex *lock, unsigned int state, unsigned int subclass,
unsigned long ip, struct ww_acquire_ctx *ww_ctx)
{
- return __mutex_lock_common(lock, state, subclass, NULL, ip, ww_ctx, true);
+ return __mutex_lock_common(lock, state, subclass, NULL, ip, ww_ctx, true, false);
}
/**
@@ -861,11 +872,17 @@ mutex_lock_io_nested(struct mutex *lock, unsigned int subclass)
token = io_schedule_prepare();
__mutex_lock_common(lock, TASK_UNINTERRUPTIBLE,
- subclass, NULL, _RET_IP_, NULL, 0);
+ subclass, NULL, _RET_IP_, NULL, 0, false);
io_schedule_finish(token);
}
EXPORT_SYMBOL_GPL(mutex_lock_io_nested);
+void __sched
+slow_mutex_lock_nested(struct mutex *lock, unsigned int subclass)
+{
+ __slow_mutex_lock(lock, TASK_UNINTERRUPTIBLE, subclass, NULL, _RET_IP_);
+}
+
static inline int
ww_mutex_deadlock_injection(struct ww_mutex *lock, struct ww_acquire_ctx *ctx)
{
@@ -923,6 +940,16 @@ ww_mutex_lock_interruptible(struct ww_mutex *lock, struct ww_acquire_ctx *ctx)
}
EXPORT_SYMBOL_GPL(ww_mutex_lock_interruptible);
+#else
+
+void __sched slow_mutex_lock(struct mutex *lock)
+{
+ might_sleep();
+
+ if (!__mutex_trylock_fast(lock))
+ __slow_mutex_lock(lock, TASK_UNINTERRUPTIBLE, 0, NULL, _RET_IP_);
+}
+
#endif
/*
--
2.47.3
^ permalink raw reply related
* [RFC PATCH v2 0/3] disable optimistic spinning for ftrace_lock
From: Yafang Shao @ 2026-03-11 11:52 UTC (permalink / raw)
To: peterz, mingo, will, boqun, longman, rostedt, mhiramat,
mark.rutland, mathieu.desnoyers, david.laight.linux
Cc: linux-kernel, linux-trace-kernel, Yafang Shao
Recently, we resolved a latency spike issue caused by concurrently running
bpftrace processes. The root cause was high contention on the ftrace_lock
due to optimistic spinning. We can optimize this by disabling optimistic
spinning for ftrace_lock.
While semaphores may present similar challenges, I'm not currently aware of
specific instances that exhibit this exact issue. Should we encounter
problematic semaphores in production workloads, we can address them at that
time.
PATCH #1: introduce slow_mutex_[un]lock to disable optimistic spinning
PATCH #2: add variant for rtmutex
PATCH #3: disable optimistic spinning for ftrace_lock
v1->v2:
- add slow_mutex_[un]lock (Steven)
- add variant for rtmutex (Waiman)
- revise commit log for clarity and accuracy (Waiman, Peter)
- note that semaphores may present similar challenges (David)
RFC v1: https://lore.kernel.org/bpf/20260304074650.58165-1-laoar.shao@gmail.com/
Yafang Shao (3):
locking/mutex: Add slow path variants for lock/unlock
locking/rtmutex: Add slow path variants for lock/unlock
ftrace: Disable optimistic spinning for ftrace_lock
include/linux/mutex.h | 4 ++
include/linux/rtmutex.h | 3 +
kernel/locking/mutex.c | 41 +++++++++++---
kernel/locking/rtmutex.c | 37 +++++++-----
kernel/locking/rtmutex_api.c | 47 +++++++++++++---
kernel/trace/ftrace.c | 106 +++++++++++++++++------------------
6 files changed, 157 insertions(+), 81 deletions(-)
--
2.47.3
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox