public inbox for linux-trace-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH v2 0/3] disable optimistic spinning for ftrace_lock
@ 2026-03-11 11:52 Yafang Shao
  2026-03-11 11:52 ` [RFC PATCH v2 1/3] locking/mutex: Add slow path variants for lock/unlock Yafang Shao
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Yafang Shao @ 2026-03-11 11:52 UTC (permalink / raw)
  To: peterz, mingo, will, boqun, longman, rostedt, mhiramat,
	mark.rutland, mathieu.desnoyers, david.laight.linux
  Cc: linux-kernel, linux-trace-kernel, Yafang Shao

Recently, we resolved a latency spike issue caused by concurrently running
bpftrace processes. The root cause was high contention on the ftrace_lock
due to optimistic spinning. We can optimize this by disabling optimistic
spinning for ftrace_lock.

While semaphores may present similar challenges, I'm not currently aware of
specific instances that exhibit this exact issue. Should we encounter
problematic semaphores in production workloads, we can address them at that
time.

PATCH #1: introduce slow_mutex_[un]lock to disable optimistic spinning
PATCH #2: add variant for rtmutex
PATCH #3: disable optimistic spinning for ftrace_lock

v1->v2:
- add slow_mutex_[un]lock (Steven)
- add variant for rtmutex (Waiman)
- revise commit log for clarity and accuracy (Waiman, Peter)
- note that semaphores may present similar challenges (David)

RFC v1: https://lore.kernel.org/bpf/20260304074650.58165-1-laoar.shao@gmail.com/

Yafang Shao (3):
  locking/mutex: Add slow path variants for lock/unlock
  locking/rtmutex: Add slow path variants for lock/unlock
  ftrace: Disable optimistic spinning for ftrace_lock

 include/linux/mutex.h        |   4 ++
 include/linux/rtmutex.h      |   3 +
 kernel/locking/mutex.c       |  41 +++++++++++---
 kernel/locking/rtmutex.c     |  37 +++++++-----
 kernel/locking/rtmutex_api.c |  47 +++++++++++++---
 kernel/trace/ftrace.c        | 106 +++++++++++++++++------------------
 6 files changed, 157 insertions(+), 81 deletions(-)

-- 
2.47.3


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [RFC PATCH v2 1/3] locking/mutex: Add slow path variants for lock/unlock
  2026-03-11 11:52 [RFC PATCH v2 0/3] disable optimistic spinning for ftrace_lock Yafang Shao
@ 2026-03-11 11:52 ` Yafang Shao
  2026-03-11 11:52 ` [RFC PATCH v2 2/3] locking/rtmutex: " Yafang Shao
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 13+ messages in thread
From: Yafang Shao @ 2026-03-11 11:52 UTC (permalink / raw)
  To: peterz, mingo, will, boqun, longman, rostedt, mhiramat,
	mark.rutland, mathieu.desnoyers, david.laight.linux
  Cc: linux-kernel, linux-trace-kernel, Yafang Shao

Background
==========

One of our latency-sensitive services reported random CPU pressure spikes.
After a thorough investigation, we finally identified the root cause of the
CPU pressure spikes. The key kernel stacks are as follows:

- Task A

2026-02-14-16:53:40.938243: [CPU198] 2156302(bpftrace) cgrp:4019437 pod:4019253

        find_kallsyms_symbol+142
        module_address_lookup+104
        kallsyms_lookup_buildid+203
        kallsyms_lookup+20
        print_rec+64
        t_show+67
        seq_read_iter+709
        seq_read+165
        vfs_read+165
        ksys_read+103
        __x64_sys_read+25
        do_syscall_64+56
        entry_SYSCALL_64_after_hwframe+100

This task (2156302, bpftrace) is reading the
/sys/kernel/tracing/available_filter_functions to check if a function
is traceable:

  https://github.com/bpftrace/bpftrace/blob/master/src/tracefs/tracefs.h#L21

Reading the available_filter_functions file is time-consuming, as it
contains tens of thousands of functions:

  $ cat /sys/kernel/tracing/available_filter_functions | wc -l
  59221

  $ time cat /sys/kernel/tracing/available_filter_functions > /dev/null
  real 0m0.458s user 0m0.001s sys 0m0.457s

Consequently, the ftrace_lock is held by this task for an extended period.

- Other Tasks

2026-02-14-16:53:41.437094: [CPU79] 2156308(bpftrace) cgrp:4019437 pod:4019253

        mutex_spin_on_owner+108
        __mutex_lock.constprop.0+1132
        __mutex_lock_slowpath+19
        mutex_lock+56
        t_start+51
        seq_read_iter+250
        seq_read+165
        vfs_read+165
        ksys_read+103
        __x64_sys_read+25
        do_syscall_64+56
        entry_SYSCALL_64_after_hwframe+100

Since ftrace_lock is held by Task-A and Task-A is actively running on a
CPU, all other tasks waiting for the same lock will spin on their
respective CPUs. This leads to increased CPU pressure.

Reproduction
============

This issue can be reproduced simply by running
`cat available_filter_functions`.

- Single process reading available_filter_functions:

  $ time cat /sys/kernel/tracing/available_filter_functions > /dev/null
  real 0m0.458s user 0m0.001s sys 0m0.457s

- Six processes reading available_filter_functions simultaneously:

  for i in `seq 0 5`; do
      time cat /sys/kernel/tracing/available_filter_functions > /dev/null &
  done

  The results are as follows:

  real 0m2.666s user 0m0.001s sys 0m2.557s
  real 0m2.718s user 0m0.000s sys 0m2.655s
  real 0m2.718s user 0m0.001s sys 0m2.600s
  real 0m2.733s user 0m0.001s sys 0m2.554s
  real 0m2.735s user 0m0.000s sys 0m2.573s
  real 0m2.738s user 0m0.000s sys 0m2.664s

  As more processes are added, the system time increases correspondingly.

Solution
========

One approach is to optimize the reading of available_filter_functions to
make it as fast as possible. However, the risk lies in the contention
caused by optimistic spin locking.

Therefore, we need to consider an alternative solution that avoids
optimistic spinning for heavy mutexes that may be held for long durations.
Note that we do not want to disable CONFIG_MUTEX_SPIN_ON_OWNER entirely, as
that could lead to unexpected performance regressions.

In this patch, two new APIs are introduced to allow heavy locks to
selectively disable optimistic spinning.

  slow_mutex_lock() - lock a mutex without optimistic spinning
  slow_mutex_unlock() - unlock the slow mutex

- The result of this optimization

After applying this slow mutex to ftrace_lock and concurrently running six
processes, the results are as follows:

  real 0m2.691s user 0m0.001s sys 0m0.458s
  real 0m2.785s user 0m0.001s sys 0m0.467s
  real 0m2.787s user 0m0.000s sys 0m0.469s
  real 0m2.787s user 0m0.000s sys 0m0.466s
  real 0m2.788s user 0m0.001s sys 0m0.468s
  real 0m2.789s user 0m0.000s sys 0m0.471s

The system time remains similar to that of running a single process.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 include/linux/mutex.h  |  4 ++++
 kernel/locking/mutex.c | 41 ++++++++++++++++++++++++++++++++++-------
 2 files changed, 38 insertions(+), 7 deletions(-)

diff --git a/include/linux/mutex.h b/include/linux/mutex.h
index ecaa0440f6ec..eed0e87c084c 100644
--- a/include/linux/mutex.h
+++ b/include/linux/mutex.h
@@ -189,11 +189,13 @@ extern int __must_check mutex_lock_interruptible_nested(struct mutex *lock,
 extern int __must_check _mutex_lock_killable(struct mutex *lock,
 		unsigned int subclass, struct lockdep_map *nest_lock) __cond_acquires(0, lock);
 extern void mutex_lock_io_nested(struct mutex *lock, unsigned int subclass) __acquires(lock);
+extern void slow_mutex_lock_nested(struct mutex *lock, unsigned int subclass);
 
 #define mutex_lock(lock) mutex_lock_nested(lock, 0)
 #define mutex_lock_interruptible(lock) mutex_lock_interruptible_nested(lock, 0)
 #define mutex_lock_killable(lock) _mutex_lock_killable(lock, 0, NULL)
 #define mutex_lock_io(lock) mutex_lock_io_nested(lock, 0)
+#define slow_mutex_lock(lock) slow_mutex_lock_nested(lock, 0)
 
 #define mutex_lock_nest_lock(lock, nest_lock)				\
 do {									\
@@ -215,6 +217,7 @@ extern void mutex_lock(struct mutex *lock) __acquires(lock);
 extern int __must_check mutex_lock_interruptible(struct mutex *lock) __cond_acquires(0, lock);
 extern int __must_check mutex_lock_killable(struct mutex *lock) __cond_acquires(0, lock);
 extern void mutex_lock_io(struct mutex *lock) __acquires(lock);
+extern void slow_mutex_lock(struct mutex *lock) __acquires(lock);
 
 # define mutex_lock_nested(lock, subclass) mutex_lock(lock)
 # define mutex_lock_interruptible_nested(lock, subclass) mutex_lock_interruptible(lock)
@@ -247,6 +250,7 @@ extern int mutex_trylock(struct mutex *lock) __cond_acquires(true, lock);
 #endif
 
 extern void mutex_unlock(struct mutex *lock) __releases(lock);
+#define slow_mutex_unlock(lock) mutex_unlock(lock)
 
 extern int atomic_dec_and_mutex_lock(atomic_t *cnt, struct mutex *lock) __cond_acquires(true, lock);
 
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 2a1d165b3167..5766d824b3fe 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -443,8 +443,11 @@ static inline int mutex_can_spin_on_owner(struct mutex *lock)
  */
 static __always_inline bool
 mutex_optimistic_spin(struct mutex *lock, struct ww_acquire_ctx *ww_ctx,
-		      struct mutex_waiter *waiter)
+		      struct mutex_waiter *waiter, const bool slow)
 {
+	if (slow)
+		return false;
+
 	if (!waiter) {
 		/*
 		 * The purpose of the mutex_can_spin_on_owner() function is
@@ -577,7 +580,8 @@ EXPORT_SYMBOL(ww_mutex_unlock);
 static __always_inline int __sched
 __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclass,
 		    struct lockdep_map *nest_lock, unsigned long ip,
-		    struct ww_acquire_ctx *ww_ctx, const bool use_ww_ctx)
+		    struct ww_acquire_ctx *ww_ctx, const bool use_ww_ctx,
+		    const bool slow)
 {
 	DEFINE_WAKE_Q(wake_q);
 	struct mutex_waiter waiter;
@@ -615,7 +619,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 
 	trace_contention_begin(lock, LCB_F_MUTEX | LCB_F_SPIN);
 	if (__mutex_trylock(lock) ||
-	    mutex_optimistic_spin(lock, ww_ctx, NULL)) {
+	    mutex_optimistic_spin(lock, ww_ctx, NULL, slow)) {
 		/* got the lock, yay! */
 		lock_acquired(&lock->dep_map, ip);
 		if (ww_ctx)
@@ -716,7 +720,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 			 * to run.
 			 */
 			clear_task_blocked_on(current, lock);
-			if (mutex_optimistic_spin(lock, ww_ctx, &waiter))
+			if (mutex_optimistic_spin(lock, ww_ctx, &waiter, slow))
 				break;
 			set_task_blocked_on(current, lock);
 			trace_contention_begin(lock, LCB_F_MUTEX);
@@ -773,14 +777,21 @@ static int __sched
 __mutex_lock(struct mutex *lock, unsigned int state, unsigned int subclass,
 	     struct lockdep_map *nest_lock, unsigned long ip)
 {
-	return __mutex_lock_common(lock, state, subclass, nest_lock, ip, NULL, false);
+	return __mutex_lock_common(lock, state, subclass, nest_lock, ip, NULL, false, false);
+}
+
+static int __sched
+__slow_mutex_lock(struct mutex *lock, unsigned int state, unsigned int subclass,
+		    struct lockdep_map *nest_lock, unsigned long ip)
+{
+	return __mutex_lock_common(lock, state, subclass, nest_lock, ip, NULL, false, true);
 }
 
 static int __sched
 __ww_mutex_lock(struct mutex *lock, unsigned int state, unsigned int subclass,
 		unsigned long ip, struct ww_acquire_ctx *ww_ctx)
 {
-	return __mutex_lock_common(lock, state, subclass, NULL, ip, ww_ctx, true);
+	return __mutex_lock_common(lock, state, subclass, NULL, ip, ww_ctx, true, false);
 }
 
 /**
@@ -861,11 +872,17 @@ mutex_lock_io_nested(struct mutex *lock, unsigned int subclass)
 
 	token = io_schedule_prepare();
 	__mutex_lock_common(lock, TASK_UNINTERRUPTIBLE,
-			    subclass, NULL, _RET_IP_, NULL, 0);
+			    subclass, NULL, _RET_IP_, NULL, 0, false);
 	io_schedule_finish(token);
 }
 EXPORT_SYMBOL_GPL(mutex_lock_io_nested);
 
+void __sched
+slow_mutex_lock_nested(struct mutex *lock, unsigned int subclass)
+{
+	__slow_mutex_lock(lock, TASK_UNINTERRUPTIBLE, subclass, NULL, _RET_IP_);
+}
+
 static inline int
 ww_mutex_deadlock_injection(struct ww_mutex *lock, struct ww_acquire_ctx *ctx)
 {
@@ -923,6 +940,16 @@ ww_mutex_lock_interruptible(struct ww_mutex *lock, struct ww_acquire_ctx *ctx)
 }
 EXPORT_SYMBOL_GPL(ww_mutex_lock_interruptible);
 
+#else
+
+void __sched slow_mutex_lock(struct mutex *lock)
+{
+	might_sleep();
+
+	if (!__mutex_trylock_fast(lock))
+		__slow_mutex_lock(lock, TASK_UNINTERRUPTIBLE, 0, NULL, _RET_IP_);
+}
+
 #endif
 
 /*
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC PATCH v2 2/3] locking/rtmutex: Add slow path variants for lock/unlock
  2026-03-11 11:52 [RFC PATCH v2 0/3] disable optimistic spinning for ftrace_lock Yafang Shao
  2026-03-11 11:52 ` [RFC PATCH v2 1/3] locking/mutex: Add slow path variants for lock/unlock Yafang Shao
@ 2026-03-11 11:52 ` Yafang Shao
  2026-03-11 11:52 ` [RFC PATCH v2 3/3] ftrace: Disable optimistic spinning for ftrace_lock Yafang Shao
  2026-03-11 11:54 ` [RFC PATCH v2 0/3] disable " Peter Zijlstra
  3 siblings, 0 replies; 13+ messages in thread
From: Yafang Shao @ 2026-03-11 11:52 UTC (permalink / raw)
  To: peterz, mingo, will, boqun, longman, rostedt, mhiramat,
	mark.rutland, mathieu.desnoyers, david.laight.linux
  Cc: linux-kernel, linux-trace-kernel, Yafang Shao

Add slow mutex APIs for rtmutex:

 slow_rt_mutex_lock: lock a rtmutex without optimistic spinning
 slow_rt_mutex_unlock: unlock the slow rtmutex

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 include/linux/rtmutex.h      |  3 +++
 kernel/locking/rtmutex.c     | 37 +++++++++++++++++-----------
 kernel/locking/rtmutex_api.c | 47 ++++++++++++++++++++++++++++++------
 3 files changed, 66 insertions(+), 21 deletions(-)

diff --git a/include/linux/rtmutex.h b/include/linux/rtmutex.h
index ede4c6bf6f22..22294a916ddc 100644
--- a/include/linux/rtmutex.h
+++ b/include/linux/rtmutex.h
@@ -109,6 +109,7 @@ extern void __rt_mutex_init(struct rt_mutex *lock, const char *name, struct lock
 
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 extern void rt_mutex_lock_nested(struct rt_mutex *lock, unsigned int subclass);
+extern void slow_rt_mutex_lock_nested(struct rt_mutex *lock, unsigned int subclass);
 extern void _rt_mutex_lock_nest_lock(struct rt_mutex *lock, struct lockdep_map *nest_lock);
 #define rt_mutex_lock(lock) rt_mutex_lock_nested(lock, 0)
 #define rt_mutex_lock_nest_lock(lock, nest_lock)			\
@@ -116,9 +117,11 @@ extern void _rt_mutex_lock_nest_lock(struct rt_mutex *lock, struct lockdep_map *
 		typecheck(struct lockdep_map *, &(nest_lock)->dep_map);	\
 		_rt_mutex_lock_nest_lock(lock, &(nest_lock)->dep_map);	\
 	} while (0)
+#define slow_rt_mutex_lock(lock) slow_rt_mutex_lock_nested(lock, 0)
 
 #else
 extern void rt_mutex_lock(struct rt_mutex *lock);
+extern void slow_rt_mutex_lock(struct rt_mutex *lock);
 #define rt_mutex_lock_nested(lock, subclass) rt_mutex_lock(lock)
 #define rt_mutex_lock_nest_lock(lock, nest_lock) rt_mutex_lock(lock)
 #endif
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index c80902eacd79..663ff96cb1be 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -1480,10 +1480,13 @@ static __always_inline void __rt_mutex_unlock(struct rt_mutex_base *lock)
 #ifdef CONFIG_SMP
 static bool rtmutex_spin_on_owner(struct rt_mutex_base *lock,
 				  struct rt_mutex_waiter *waiter,
-				  struct task_struct *owner)
+				  struct task_struct *owner,
+				  const bool slow)
 {
 	bool res = true;
 
+	if (slow)
+		return false;
 	rcu_read_lock();
 	for (;;) {
 		/* If owner changed, trylock again. */
@@ -1517,7 +1520,8 @@ static bool rtmutex_spin_on_owner(struct rt_mutex_base *lock,
 #else
 static bool rtmutex_spin_on_owner(struct rt_mutex_base *lock,
 				  struct rt_mutex_waiter *waiter,
-				  struct task_struct *owner)
+				  struct task_struct *owner,
+				  const bool slow)
 {
 	return false;
 }
@@ -1606,7 +1610,8 @@ static int __sched rt_mutex_slowlock_block(struct rt_mutex_base *lock,
 					   unsigned int state,
 					   struct hrtimer_sleeper *timeout,
 					   struct rt_mutex_waiter *waiter,
-					   struct wake_q_head *wake_q)
+					   struct wake_q_head *wake_q,
+					   const bool slow)
 	__releases(&lock->wait_lock) __acquires(&lock->wait_lock)
 {
 	struct rt_mutex *rtm = container_of(lock, struct rt_mutex, rtmutex);
@@ -1642,7 +1647,7 @@ static int __sched rt_mutex_slowlock_block(struct rt_mutex_base *lock,
 			owner = NULL;
 		raw_spin_unlock_irq_wake(&lock->wait_lock, wake_q);
 
-		if (!owner || !rtmutex_spin_on_owner(lock, waiter, owner)) {
+		if (!owner || !rtmutex_spin_on_owner(lock, waiter, owner, slow)) {
 			lockevent_inc(rtmutex_slow_sleep);
 			rt_mutex_schedule();
 		}
@@ -1693,7 +1698,8 @@ static int __sched __rt_mutex_slowlock(struct rt_mutex_base *lock,
 				       unsigned int state,
 				       enum rtmutex_chainwalk chwalk,
 				       struct rt_mutex_waiter *waiter,
-				       struct wake_q_head *wake_q)
+				       struct wake_q_head *wake_q,
+				       const bool slow)
 {
 	struct rt_mutex *rtm = container_of(lock, struct rt_mutex, rtmutex);
 	struct ww_mutex *ww = ww_container_of(rtm);
@@ -1718,7 +1724,7 @@ static int __sched __rt_mutex_slowlock(struct rt_mutex_base *lock,
 
 	ret = task_blocks_on_rt_mutex(lock, waiter, current, ww_ctx, chwalk, wake_q);
 	if (likely(!ret))
-		ret = rt_mutex_slowlock_block(lock, ww_ctx, state, NULL, waiter, wake_q);
+		ret = rt_mutex_slowlock_block(lock, ww_ctx, state, NULL, waiter, wake_q, slow);
 
 	if (likely(!ret)) {
 		/* acquired the lock */
@@ -1749,7 +1755,8 @@ static int __sched __rt_mutex_slowlock(struct rt_mutex_base *lock,
 static inline int __rt_mutex_slowlock_locked(struct rt_mutex_base *lock,
 					     struct ww_acquire_ctx *ww_ctx,
 					     unsigned int state,
-					     struct wake_q_head *wake_q)
+					     struct wake_q_head *wake_q,
+					     const bool slow)
 {
 	struct rt_mutex_waiter waiter;
 	int ret;
@@ -1758,7 +1765,7 @@ static inline int __rt_mutex_slowlock_locked(struct rt_mutex_base *lock,
 	waiter.ww_ctx = ww_ctx;
 
 	ret = __rt_mutex_slowlock(lock, ww_ctx, state, RT_MUTEX_MIN_CHAINWALK,
-				  &waiter, wake_q);
+				  &waiter, wake_q, slow);
 
 	debug_rt_mutex_free_waiter(&waiter);
 	lockevent_cond_inc(rtmutex_slow_wake, !wake_q_empty(wake_q));
@@ -1773,7 +1780,8 @@ static inline int __rt_mutex_slowlock_locked(struct rt_mutex_base *lock,
  */
 static int __sched rt_mutex_slowlock(struct rt_mutex_base *lock,
 				     struct ww_acquire_ctx *ww_ctx,
-				     unsigned int state)
+				     unsigned int state,
+				     const bool slow)
 {
 	DEFINE_WAKE_Q(wake_q);
 	unsigned long flags;
@@ -1797,7 +1805,7 @@ static int __sched rt_mutex_slowlock(struct rt_mutex_base *lock,
 	 * irqsave/restore variants.
 	 */
 	raw_spin_lock_irqsave(&lock->wait_lock, flags);
-	ret = __rt_mutex_slowlock_locked(lock, ww_ctx, state, &wake_q);
+	ret = __rt_mutex_slowlock_locked(lock, ww_ctx, state, &wake_q, slow);
 	raw_spin_unlock_irqrestore_wake(&lock->wait_lock, flags, &wake_q);
 	rt_mutex_post_schedule();
 
@@ -1805,14 +1813,14 @@ static int __sched rt_mutex_slowlock(struct rt_mutex_base *lock,
 }
 
 static __always_inline int __rt_mutex_lock(struct rt_mutex_base *lock,
-					   unsigned int state)
+					   unsigned int state, const bool slow)
 {
 	lockdep_assert(!current->pi_blocked_on);
 
 	if (likely(rt_mutex_try_acquire(lock)))
 		return 0;
 
-	return rt_mutex_slowlock(lock, NULL, state);
+	return rt_mutex_slowlock(lock, NULL, state, slow);
 }
 #endif /* RT_MUTEX_BUILD_MUTEX */
 
@@ -1827,7 +1835,8 @@ static __always_inline int __rt_mutex_lock(struct rt_mutex_base *lock,
  * @wake_q:	The wake_q to wake tasks after we release the wait_lock
  */
 static void __sched rtlock_slowlock_locked(struct rt_mutex_base *lock,
-					   struct wake_q_head *wake_q)
+					   struct wake_q_head *wake_q,
+					   const bool slow)
 	__releases(&lock->wait_lock) __acquires(&lock->wait_lock)
 {
 	struct rt_mutex_waiter waiter;
@@ -1863,7 +1872,7 @@ static void __sched rtlock_slowlock_locked(struct rt_mutex_base *lock,
 			owner = NULL;
 		raw_spin_unlock_irq_wake(&lock->wait_lock, wake_q);
 
-		if (!owner || !rtmutex_spin_on_owner(lock, &waiter, owner)) {
+		if (!owner || !rtmutex_spin_on_owner(lock, &waiter, owner, slow)) {
 			lockevent_inc(rtlock_slow_sleep);
 			schedule_rtlock();
 		}
diff --git a/kernel/locking/rtmutex_api.c b/kernel/locking/rtmutex_api.c
index 59dbd29cb219..b196cdd35ff1 100644
--- a/kernel/locking/rtmutex_api.c
+++ b/kernel/locking/rtmutex_api.c
@@ -37,21 +37,29 @@ subsys_initcall(init_rtmutex_sysctl);
  * The atomic acquire/release ops are compiled away, when either the
  * architecture does not support cmpxchg or when debugging is enabled.
  */
-static __always_inline int __rt_mutex_lock_common(struct rt_mutex *lock,
+static __always_inline int ___rt_mutex_lock_common(struct rt_mutex *lock,
 						  unsigned int state,
 						  struct lockdep_map *nest_lock,
-						  unsigned int subclass)
+						  unsigned int subclass,
+						  const bool slow)
 {
 	int ret;
 
 	might_sleep();
 	mutex_acquire_nest(&lock->dep_map, subclass, 0, nest_lock, _RET_IP_);
-	ret = __rt_mutex_lock(&lock->rtmutex, state);
+	ret = __rt_mutex_lock(&lock->rtmutex, state, slow);
 	if (ret)
 		mutex_release(&lock->dep_map, _RET_IP_);
 	return ret;
 }
 
+static __always_inline int __rt_mutex_lock_common(struct rt_mutex *lock,
+						  unsigned int state,
+						  struct lockdep_map *nest_lock,
+						  unsigned int subclass)
+{
+	return ___rt_mutex_lock_common(lock, state, nest_lock, subclass, false);
+}
 void rt_mutex_base_init(struct rt_mutex_base *rtb)
 {
 	__rt_mutex_base_init(rtb);
@@ -77,6 +85,11 @@ void __sched _rt_mutex_lock_nest_lock(struct rt_mutex *lock, struct lockdep_map
 }
 EXPORT_SYMBOL_GPL(_rt_mutex_lock_nest_lock);
 
+void __sched slow_rt_mutex_lock_nested(struct rt_mutex *lock, unsigned int subclass)
+{
+	___rt_mutex_lock_common(lock, TASK_UNINTERRUPTIBLE, NULL, subclass, true);
+}
+
 #else /* !CONFIG_DEBUG_LOCK_ALLOC */
 
 /**
@@ -89,6 +102,11 @@ void __sched rt_mutex_lock(struct rt_mutex *lock)
 	__rt_mutex_lock_common(lock, TASK_UNINTERRUPTIBLE, NULL, 0);
 }
 EXPORT_SYMBOL_GPL(rt_mutex_lock);
+
+void __sched slow_rt_mutex_lock(struct rt_mutex *lock)
+{
+	___rt_mutex_lock_common(lock, TASK_UNINTERRUPTIBLE, NULL, 0, true);
+}
 #endif
 
 /**
@@ -401,7 +419,7 @@ int __sched rt_mutex_wait_proxy_lock(struct rt_mutex_base *lock,
 	raw_spin_lock_irq(&lock->wait_lock);
 	/* sleep on the mutex */
 	set_current_state(TASK_INTERRUPTIBLE);
-	ret = rt_mutex_slowlock_block(lock, NULL, TASK_INTERRUPTIBLE, to, waiter, NULL);
+	ret = rt_mutex_slowlock_block(lock, NULL, TASK_INTERRUPTIBLE, to, waiter, NULL, false);
 	/*
 	 * try_to_take_rt_mutex() sets the waiter bit unconditionally. We might
 	 * have to fix that up.
@@ -521,17 +539,18 @@ static void __mutex_rt_init_generic(struct mutex *mutex)
 	debug_check_no_locks_freed((void *)mutex, sizeof(*mutex));
 }
 
-static __always_inline int __mutex_lock_common(struct mutex *lock,
+static __always_inline int ___mutex_lock_common(struct mutex *lock,
 					       unsigned int state,
 					       unsigned int subclass,
 					       struct lockdep_map *nest_lock,
-					       unsigned long ip)
+					       unsigned long ip,
+					       const bool slow)
 {
 	int ret;
 
 	might_sleep();
 	mutex_acquire_nest(&lock->dep_map, subclass, 0, nest_lock, ip);
-	ret = __rt_mutex_lock(&lock->rtmutex, state);
+	ret = __rt_mutex_lock(&lock->rtmutex, state, slow);
 	if (ret)
 		mutex_release(&lock->dep_map, ip);
 	else
@@ -539,6 +558,15 @@ static __always_inline int __mutex_lock_common(struct mutex *lock,
 	return ret;
 }
 
+static __always_inline int __mutex_lock_common(struct mutex *lock,
+					       unsigned int state,
+					       unsigned int subclass,
+					       struct lockdep_map *nest_lock,
+					       unsigned long ip)
+{
+	___mutex_lock_common(lock, state, subclass, nest_lock, ip, false);
+}
+
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 void mutex_rt_init_lockdep(struct mutex *mutex, const char *name, struct lock_class_key *key)
 {
@@ -644,6 +672,11 @@ int __sched mutex_trylock(struct mutex *lock)
 	return __rt_mutex_trylock(&lock->rtmutex);
 }
 EXPORT_SYMBOL(mutex_trylock);
+
+void __sched slow_mutex_lock(struct mutex *lock)
+{
+	___mutex_lock_common(lock, state, subclass, nest_lock, ip, true);
+}
 #endif /* !CONFIG_DEBUG_LOCK_ALLOC */
 
 void __sched mutex_unlock(struct mutex *lock)
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC PATCH v2 3/3] ftrace: Disable optimistic spinning for ftrace_lock
  2026-03-11 11:52 [RFC PATCH v2 0/3] disable optimistic spinning for ftrace_lock Yafang Shao
  2026-03-11 11:52 ` [RFC PATCH v2 1/3] locking/mutex: Add slow path variants for lock/unlock Yafang Shao
  2026-03-11 11:52 ` [RFC PATCH v2 2/3] locking/rtmutex: " Yafang Shao
@ 2026-03-11 11:52 ` Yafang Shao
  2026-03-11 11:54 ` [RFC PATCH v2 0/3] disable " Peter Zijlstra
  3 siblings, 0 replies; 13+ messages in thread
From: Yafang Shao @ 2026-03-11 11:52 UTC (permalink / raw)
  To: peterz, mingo, will, boqun, longman, rostedt, mhiramat,
	mark.rutland, mathieu.desnoyers, david.laight.linux
  Cc: linux-kernel, linux-trace-kernel, Yafang Shao

The ftrace_lock may be held for a relatively long duration. For example,
reading the available_filter_functions file takes considerable time:

  $ time cat /sys/kernel/tracing/available_filter_functions &> /dev/null
  real 0m0.457s user 0m0.001s sys 0m0.455s

When the lock owner is continuously running, other tasks waiting for the
lock will spin repeatedly until they hit a cond_resched() point, wasting
CPU cycles.

ftrace_lock is currently used in the following scenarios:
- Debugging
- Live patching

Neither of these scenarios are in the application critical path.
Therefore, it is reasonable to make tasks sleep when they cannot acquire
the lock immediately, rather than spinning and consuming CPU resources.

Performance Comparison
======================

- Before this change

  - Single task reading available_filter_functions:

    real 0m0.457s user 0m0.001s sys 0m0.455s

  - Six concurrent processes:

    real 0m2.666s user 0m0.001s sys 0m2.557s
    real 0m2.718s user 0m0.000s sys 0m2.655s
    real 0m2.718s user 0m0.001s sys 0m2.600s
    real 0m2.733s user 0m0.001s sys 0m2.554s
    real 0m2.735s user 0m0.000s sys 0m2.573s
    real 0m2.738s user 0m0.000s sys 0m2.664s

- After this change

  - Single task:

    real 0m0.454s user 0m0.002s sys 0m0.453s

  - Six concurrent processes:

    real 0m2.691s user 0m0.001s sys 0m0.458s
    real 0m2.785s user 0m0.001s sys 0m0.467s
    real 0m2.787s user 0m0.000s sys 0m0.469s
    real 0m2.787s user 0m0.000s sys 0m0.466s
    real 0m2.788s user 0m0.001s sys 0m0.468s
    real 0m2.789s user 0m0.000s sys 0m0.471s

The system time significantly decreases in the concurrent case, as tasks
now sleep while waiting for the lock instead of busy-spinning.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 kernel/trace/ftrace.c | 106 +++++++++++++++++++++---------------------
 1 file changed, 53 insertions(+), 53 deletions(-)

diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index 827fb9a0bf0d..00c195e280c5 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -1284,10 +1284,10 @@ static void clear_ftrace_mod_list(struct list_head *head)
 	if (!head)
 		return;
 
-	mutex_lock(&ftrace_lock);
+	slow_mutex_lock(&ftrace_lock);
 	list_for_each_entry_safe(p, n, head, list)
 		free_ftrace_mod(p);
-	mutex_unlock(&ftrace_lock);
+	slow_mutex_unlock(&ftrace_lock);
 }
 
 void free_ftrace_hash(struct ftrace_hash *hash)
@@ -4254,7 +4254,7 @@ static void *t_start(struct seq_file *m, loff_t *pos)
 	void *p = NULL;
 	loff_t l;
 
-	mutex_lock(&ftrace_lock);
+	slow_mutex_lock(&ftrace_lock);
 
 	if (unlikely(ftrace_disabled))
 		return NULL;
@@ -4305,7 +4305,7 @@ static void *t_start(struct seq_file *m, loff_t *pos)
 
 static void t_stop(struct seq_file *m, void *p)
 {
-	mutex_unlock(&ftrace_lock);
+	slow_mutex_unlock(&ftrace_lock);
 }
 
 void * __weak
@@ -4362,11 +4362,11 @@ static __init void ftrace_check_work_func(struct work_struct *work)
 	struct ftrace_page *pg;
 	struct dyn_ftrace *rec;
 
-	mutex_lock(&ftrace_lock);
+	slow_mutex_lock(&ftrace_lock);
 	do_for_each_ftrace_rec(pg, rec) {
 		test_for_valid_rec(rec);
 	} while_for_each_ftrace_rec();
-	mutex_unlock(&ftrace_lock);
+	slow_mutex_unlock(&ftrace_lock);
 }
 
 static int __init ftrace_check_for_weak_functions(void)
@@ -5123,7 +5123,7 @@ static void process_mod_list(struct list_head *head, struct ftrace_ops *ops,
 	if (!new_hash)
 		goto out; /* warn? */
 
-	mutex_lock(&ftrace_lock);
+	slow_mutex_lock(&ftrace_lock);
 
 	list_for_each_entry_safe(ftrace_mod, n, head, list) {
 
@@ -5145,7 +5145,7 @@ static void process_mod_list(struct list_head *head, struct ftrace_ops *ops,
 		ftrace_mod->func = func;
 	}
 
-	mutex_unlock(&ftrace_lock);
+	slow_mutex_unlock(&ftrace_lock);
 
 	list_for_each_entry_safe(ftrace_mod, n, &process_mods, list) {
 
@@ -5159,11 +5159,11 @@ static void process_mod_list(struct list_head *head, struct ftrace_ops *ops,
 	if (enable && list_empty(head))
 		new_hash->flags &= ~FTRACE_HASH_FL_MOD;
 
-	mutex_lock(&ftrace_lock);
+	slow_mutex_lock(&ftrace_lock);
 
 	ftrace_hash_move_and_update_ops(ops, orig_hash,
 					      new_hash, enable);
-	mutex_unlock(&ftrace_lock);
+	slow_mutex_unlock(&ftrace_lock);
 
  out:
 	mutex_unlock(&ops->func_hash->regex_lock);
@@ -5465,7 +5465,7 @@ register_ftrace_function_probe(char *glob, struct trace_array *tr,
 		return -EINVAL;
 
 
-	mutex_lock(&ftrace_lock);
+	slow_mutex_lock(&ftrace_lock);
 	/* Check if the probe_ops is already registered */
 	list_for_each_entry(iter, &tr->func_probes, list) {
 		if (iter->probe_ops == probe_ops) {
@@ -5476,7 +5476,7 @@ register_ftrace_function_probe(char *glob, struct trace_array *tr,
 	if (!probe) {
 		probe = kzalloc_obj(*probe);
 		if (!probe) {
-			mutex_unlock(&ftrace_lock);
+			slow_mutex_unlock(&ftrace_lock);
 			return -ENOMEM;
 		}
 		probe->probe_ops = probe_ops;
@@ -5488,7 +5488,7 @@ register_ftrace_function_probe(char *glob, struct trace_array *tr,
 
 	acquire_probe_locked(probe);
 
-	mutex_unlock(&ftrace_lock);
+	slow_mutex_unlock(&ftrace_lock);
 
 	/*
 	 * Note, there's a small window here that the func_hash->filter_hash
@@ -5540,7 +5540,7 @@ register_ftrace_function_probe(char *glob, struct trace_array *tr,
 		}
 	}
 
-	mutex_lock(&ftrace_lock);
+	slow_mutex_lock(&ftrace_lock);
 
 	if (!count) {
 		/* Nothing was added? */
@@ -5560,7 +5560,7 @@ register_ftrace_function_probe(char *glob, struct trace_array *tr,
 		ret = ftrace_startup(&probe->ops, 0);
 
  out_unlock:
-	mutex_unlock(&ftrace_lock);
+	slow_mutex_unlock(&ftrace_lock);
 
 	if (!ret)
 		ret = count;
@@ -5619,7 +5619,7 @@ unregister_ftrace_function_probe_func(char *glob, struct trace_array *tr,
 			return -EINVAL;
 	}
 
-	mutex_lock(&ftrace_lock);
+	slow_mutex_lock(&ftrace_lock);
 	/* Check if the probe_ops is already registered */
 	list_for_each_entry(iter, &tr->func_probes, list) {
 		if (iter->probe_ops == probe_ops) {
@@ -5636,7 +5636,7 @@ unregister_ftrace_function_probe_func(char *glob, struct trace_array *tr,
 
 	acquire_probe_locked(probe);
 
-	mutex_unlock(&ftrace_lock);
+	slow_mutex_unlock(&ftrace_lock);
 
 	mutex_lock(&probe->ops.func_hash->regex_lock);
 
@@ -5679,7 +5679,7 @@ unregister_ftrace_function_probe_func(char *glob, struct trace_array *tr,
 		goto out_unlock;
 	}
 
-	mutex_lock(&ftrace_lock);
+	slow_mutex_lock(&ftrace_lock);
 
 	WARN_ON(probe->ref < count);
 
@@ -5703,7 +5703,7 @@ unregister_ftrace_function_probe_func(char *glob, struct trace_array *tr,
 			probe_ops->free(probe_ops, tr, entry->ip, probe->data);
 		kfree(entry);
 	}
-	mutex_unlock(&ftrace_lock);
+	slow_mutex_unlock(&ftrace_lock);
 
  out_unlock:
 	mutex_unlock(&probe->ops.func_hash->regex_lock);
@@ -5714,7 +5714,7 @@ unregister_ftrace_function_probe_func(char *glob, struct trace_array *tr,
 	return ret;
 
  err_unlock_ftrace:
-	mutex_unlock(&ftrace_lock);
+	slow_mutex_unlock(&ftrace_lock);
 	return ret;
 }
 
@@ -5943,9 +5943,9 @@ ftrace_set_hash(struct ftrace_ops *ops, unsigned char *buf, int len,
 			goto out_regex_unlock;
 	}
 
-	mutex_lock(&ftrace_lock);
+	slow_mutex_lock(&ftrace_lock);
 	ret = ftrace_hash_move_and_update_ops(ops, orig_hash, hash, enable);
-	mutex_unlock(&ftrace_lock);
+	slow_mutex_unlock(&ftrace_lock);
 
  out_regex_unlock:
 	mutex_unlock(&ops->func_hash->regex_lock);
@@ -6205,7 +6205,7 @@ __modify_ftrace_direct(struct ftrace_ops *ops, unsigned long addr)
 	 * Now the ftrace_ops_list_func() is called to do the direct callers.
 	 * We can safely change the direct functions attached to each entry.
 	 */
-	mutex_lock(&ftrace_lock);
+	slow_mutex_lock(&ftrace_lock);
 
 	size = 1 << hash->size_bits;
 	for (i = 0; i < size; i++) {
@@ -6219,7 +6219,7 @@ __modify_ftrace_direct(struct ftrace_ops *ops, unsigned long addr)
 	/* Prevent store tearing if a trampoline concurrently accesses the value */
 	WRITE_ONCE(ops->direct_call, addr);
 
-	mutex_unlock(&ftrace_lock);
+	slow_mutex_unlock(&ftrace_lock);
 
 out:
 	/* Removing the tmp_ops will add the updated direct callers to the functions */
@@ -6625,7 +6625,7 @@ int update_ftrace_direct_mod(struct ftrace_ops *ops, struct ftrace_hash *hash, b
 	 * Now the ftrace_ops_list_func() is called to do the direct callers.
 	 * We can safely change the direct functions attached to each entry.
 	 */
-	mutex_lock(&ftrace_lock);
+	slow_mutex_lock(&ftrace_lock);
 
 	size = 1 << hash->size_bits;
 	for (i = 0; i < size; i++) {
@@ -6637,7 +6637,7 @@ int update_ftrace_direct_mod(struct ftrace_ops *ops, struct ftrace_hash *hash, b
 		}
 	}
 
-	mutex_unlock(&ftrace_lock);
+	slow_mutex_unlock(&ftrace_lock);
 
 out:
 	/* Removing the tmp_ops will add the updated direct callers to the functions */
@@ -6980,10 +6980,10 @@ int ftrace_regex_release(struct inode *inode, struct file *file)
 		} else
 			orig_hash = &iter->ops->func_hash->notrace_hash;
 
-		mutex_lock(&ftrace_lock);
+		slow_mutex_lock(&ftrace_lock);
 		ftrace_hash_move_and_update_ops(iter->ops, orig_hash,
 						      iter->hash, filter_hash);
-		mutex_unlock(&ftrace_lock);
+		slow_mutex_unlock(&ftrace_lock);
 	}
 
 	mutex_unlock(&iter->ops->func_hash->regex_lock);
@@ -7464,12 +7464,12 @@ void ftrace_create_filter_files(struct ftrace_ops *ops,
  */
 void ftrace_destroy_filter_files(struct ftrace_ops *ops)
 {
-	mutex_lock(&ftrace_lock);
+	slow_mutex_lock(&ftrace_lock);
 	if (ops->flags & FTRACE_OPS_FL_ENABLED)
 		ftrace_shutdown(ops, 0);
 	ops->flags |= FTRACE_OPS_FL_DELETED;
 	ftrace_free_filter(ops);
-	mutex_unlock(&ftrace_lock);
+	slow_mutex_unlock(&ftrace_lock);
 }
 
 static __init int ftrace_init_dyn_tracefs(struct dentry *d_tracer)
@@ -7571,7 +7571,7 @@ static int ftrace_process_locs(struct module *mod,
 	if (!start_pg)
 		return -ENOMEM;
 
-	mutex_lock(&ftrace_lock);
+	slow_mutex_lock(&ftrace_lock);
 
 	/*
 	 * Core and each module needs their own pages, as
@@ -7661,7 +7661,7 @@ static int ftrace_process_locs(struct module *mod,
 		local_irq_restore(flags);
 	ret = 0;
  out:
-	mutex_unlock(&ftrace_lock);
+	slow_mutex_unlock(&ftrace_lock);
 
 	/* We should have used all pages unless we skipped some */
 	if (pg_unuse) {
@@ -7868,7 +7868,7 @@ void ftrace_release_mod(struct module *mod)
 	struct ftrace_page *tmp_page = NULL;
 	struct ftrace_page *pg;
 
-	mutex_lock(&ftrace_lock);
+	slow_mutex_lock(&ftrace_lock);
 
 	/*
 	 * To avoid the UAF problem after the module is unloaded, the
@@ -7913,7 +7913,7 @@ void ftrace_release_mod(struct module *mod)
 			last_pg = &pg->next;
 	}
  out_unlock:
-	mutex_unlock(&ftrace_lock);
+	slow_mutex_unlock(&ftrace_lock);
 
 	/* Need to synchronize with ftrace_location_range() */
 	if (tmp_page)
@@ -7938,7 +7938,7 @@ void ftrace_module_enable(struct module *mod)
 	struct dyn_ftrace *rec;
 	struct ftrace_page *pg;
 
-	mutex_lock(&ftrace_lock);
+	slow_mutex_lock(&ftrace_lock);
 
 	if (ftrace_disabled)
 		goto out_unlock;
@@ -8008,7 +8008,7 @@ void ftrace_module_enable(struct module *mod)
 		ftrace_arch_code_modify_post_process();
 
  out_unlock:
-	mutex_unlock(&ftrace_lock);
+	slow_mutex_unlock(&ftrace_lock);
 
 	process_cached_mods(mod->name);
 }
@@ -8267,7 +8267,7 @@ void ftrace_free_mem(struct module *mod, void *start_ptr, void *end_ptr)
 	key.ip = start;
 	key.flags = end;	/* overload flags, as it is unsigned long */
 
-	mutex_lock(&ftrace_lock);
+	slow_mutex_lock(&ftrace_lock);
 
 	/*
 	 * If we are freeing module init memory, then check if
@@ -8310,7 +8310,7 @@ void ftrace_free_mem(struct module *mod, void *start_ptr, void *end_ptr)
 		/* More than one function may be in this block */
 		goto again;
 	}
-	mutex_unlock(&ftrace_lock);
+	slow_mutex_unlock(&ftrace_lock);
 
 	list_for_each_entry_safe(func, func_next, &clear_hash, list) {
 		clear_func_from_hashes(func);
@@ -8686,22 +8686,22 @@ static void clear_ftrace_pids(struct trace_array *tr, int type)
 
 void ftrace_clear_pids(struct trace_array *tr)
 {
-	mutex_lock(&ftrace_lock);
+	slow_mutex_lock(&ftrace_lock);
 
 	clear_ftrace_pids(tr, TRACE_PIDS | TRACE_NO_PIDS);
 
-	mutex_unlock(&ftrace_lock);
+	slow_mutex_unlock(&ftrace_lock);
 }
 
 static void ftrace_pid_reset(struct trace_array *tr, int type)
 {
-	mutex_lock(&ftrace_lock);
+	slow_mutex_lock(&ftrace_lock);
 	clear_ftrace_pids(tr, type);
 
 	ftrace_update_pid_func();
 	ftrace_startup_all(0);
 
-	mutex_unlock(&ftrace_lock);
+	slow_mutex_unlock(&ftrace_lock);
 }
 
 /* Greater than any max PID */
@@ -8713,7 +8713,7 @@ static void *fpid_start(struct seq_file *m, loff_t *pos)
 	struct trace_pid_list *pid_list;
 	struct trace_array *tr = m->private;
 
-	mutex_lock(&ftrace_lock);
+	slow_mutex_lock(&ftrace_lock);
 	rcu_read_lock_sched();
 
 	pid_list = rcu_dereference_sched(tr->function_pids);
@@ -8740,7 +8740,7 @@ static void fpid_stop(struct seq_file *m, void *p)
 	__releases(RCU)
 {
 	rcu_read_unlock_sched();
-	mutex_unlock(&ftrace_lock);
+	slow_mutex_unlock(&ftrace_lock);
 }
 
 static int fpid_show(struct seq_file *m, void *v)
@@ -8766,7 +8766,7 @@ static void *fnpid_start(struct seq_file *m, loff_t *pos)
 	struct trace_pid_list *pid_list;
 	struct trace_array *tr = m->private;
 
-	mutex_lock(&ftrace_lock);
+	slow_mutex_lock(&ftrace_lock);
 	rcu_read_lock_sched();
 
 	pid_list = rcu_dereference_sched(tr->function_no_pids);
@@ -9057,7 +9057,7 @@ static int prepare_direct_functions_for_ipmodify(struct ftrace_ops *ops)
 			unsigned long ip = entry->ip;
 			bool found_op = false;
 
-			mutex_lock(&ftrace_lock);
+			slow_mutex_lock(&ftrace_lock);
 			do_for_each_ftrace_op(op, ftrace_ops_list) {
 				if (!(op->flags & FTRACE_OPS_FL_DIRECT))
 					continue;
@@ -9066,7 +9066,7 @@ static int prepare_direct_functions_for_ipmodify(struct ftrace_ops *ops)
 					break;
 				}
 			} while_for_each_ftrace_op(op);
-			mutex_unlock(&ftrace_lock);
+			slow_mutex_unlock(&ftrace_lock);
 
 			if (found_op) {
 				if (!op->ops_func)
@@ -9106,7 +9106,7 @@ static void cleanup_direct_functions_after_ipmodify(struct ftrace_ops *ops)
 			unsigned long ip = entry->ip;
 			bool found_op = false;
 
-			mutex_lock(&ftrace_lock);
+			slow_mutex_lock(&ftrace_lock);
 			do_for_each_ftrace_op(op, ftrace_ops_list) {
 				if (!(op->flags & FTRACE_OPS_FL_DIRECT))
 					continue;
@@ -9115,7 +9115,7 @@ static void cleanup_direct_functions_after_ipmodify(struct ftrace_ops *ops)
 					break;
 				}
 			} while_for_each_ftrace_op(op);
-			mutex_unlock(&ftrace_lock);
+			slow_mutex_unlock(&ftrace_lock);
 
 			/* The cleanup is optional, ignore any errors */
 			if (found_op && op->ops_func)
@@ -9153,11 +9153,11 @@ static int register_ftrace_function_nolock(struct ftrace_ops *ops)
 
 	ftrace_ops_init(ops);
 
-	mutex_lock(&ftrace_lock);
+	slow_mutex_lock(&ftrace_lock);
 
 	ret = ftrace_startup(ops, 0);
 
-	mutex_unlock(&ftrace_lock);
+	slow_mutex_unlock(&ftrace_lock);
 
 	return ret;
 }
@@ -9200,9 +9200,9 @@ int unregister_ftrace_function(struct ftrace_ops *ops)
 {
 	int ret;
 
-	mutex_lock(&ftrace_lock);
+	slow_mutex_lock(&ftrace_lock);
 	ret = ftrace_shutdown(ops, 0);
-	mutex_unlock(&ftrace_lock);
+	slow_mutex_unlock(&ftrace_lock);
 
 	cleanup_direct_functions_after_ipmodify(ops);
 	return ret;
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH v2 0/3] disable optimistic spinning for ftrace_lock
  2026-03-11 11:52 [RFC PATCH v2 0/3] disable optimistic spinning for ftrace_lock Yafang Shao
                   ` (2 preceding siblings ...)
  2026-03-11 11:52 ` [RFC PATCH v2 3/3] ftrace: Disable optimistic spinning for ftrace_lock Yafang Shao
@ 2026-03-11 11:54 ` Peter Zijlstra
  2026-03-11 11:55   ` Yafang Shao
  2026-03-11 12:53   ` David Laight
  3 siblings, 2 replies; 13+ messages in thread
From: Peter Zijlstra @ 2026-03-11 11:54 UTC (permalink / raw)
  To: Yafang Shao
  Cc: mingo, will, boqun, longman, rostedt, mhiramat, mark.rutland,
	mathieu.desnoyers, david.laight.linux, linux-kernel,
	linux-trace-kernel

On Wed, Mar 11, 2026 at 07:52:47PM +0800, Yafang Shao wrote:
> Recently, we resolved a latency spike issue caused by concurrently running
> bpftrace processes. The root cause was high contention on the ftrace_lock
> due to optimistic spinning. We can optimize this by disabling optimistic
> spinning for ftrace_lock.
> 
> While semaphores may present similar challenges, I'm not currently aware of
> specific instances that exhibit this exact issue. Should we encounter
> problematic semaphores in production workloads, we can address them at that
> time.
> 
> PATCH #1: introduce slow_mutex_[un]lock to disable optimistic spinning
> PATCH #2: add variant for rtmutex
> PATCH #3: disable optimistic spinning for ftrace_lock
> 

So I really utterly hate this.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH v2 0/3] disable optimistic spinning for ftrace_lock
  2026-03-11 11:54 ` [RFC PATCH v2 0/3] disable " Peter Zijlstra
@ 2026-03-11 11:55   ` Yafang Shao
  2026-03-11 12:53   ` David Laight
  1 sibling, 0 replies; 13+ messages in thread
From: Yafang Shao @ 2026-03-11 11:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, will, boqun, longman, rostedt, mhiramat, mark.rutland,
	mathieu.desnoyers, david.laight.linux, linux-kernel,
	linux-trace-kernel

On Wed, Mar 11, 2026 at 7:54 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Mar 11, 2026 at 07:52:47PM +0800, Yafang Shao wrote:
> > Recently, we resolved a latency spike issue caused by concurrently running
> > bpftrace processes. The root cause was high contention on the ftrace_lock
> > due to optimistic spinning. We can optimize this by disabling optimistic
> > spinning for ftrace_lock.
> >
> > While semaphores may present similar challenges, I'm not currently aware of
> > specific instances that exhibit this exact issue. Should we encounter
> > problematic semaphores in production workloads, we can address them at that
> > time.
> >
> > PATCH #1: introduce slow_mutex_[un]lock to disable optimistic spinning
> > PATCH #2: add variant for rtmutex
> > PATCH #3: disable optimistic spinning for ftrace_lock
> >
>
> So I really utterly hate this.

Do you have any other suggestions for optimizing this?

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH v2 0/3] disable optimistic spinning for ftrace_lock
  2026-03-11 11:54 ` [RFC PATCH v2 0/3] disable " Peter Zijlstra
  2026-03-11 11:55   ` Yafang Shao
@ 2026-03-11 12:53   ` David Laight
  2026-03-11 13:40     ` Yafang Shao
  1 sibling, 1 reply; 13+ messages in thread
From: David Laight @ 2026-03-11 12:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Yafang Shao, mingo, will, boqun, longman, rostedt, mhiramat,
	mark.rutland, mathieu.desnoyers, linux-kernel, linux-trace-kernel

On Wed, 11 Mar 2026 12:54:26 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> On Wed, Mar 11, 2026 at 07:52:47PM +0800, Yafang Shao wrote:
> > Recently, we resolved a latency spike issue caused by concurrently running
> > bpftrace processes. The root cause was high contention on the ftrace_lock
> > due to optimistic spinning. We can optimize this by disabling optimistic
> > spinning for ftrace_lock.
> > 
> > While semaphores may present similar challenges, I'm not currently aware of
> > specific instances that exhibit this exact issue. Should we encounter
> > problematic semaphores in production workloads, we can address them at that
> > time.
> > 
> > PATCH #1: introduce slow_mutex_[un]lock to disable optimistic spinning
> > PATCH #2: add variant for rtmutex
> > PATCH #3: disable optimistic spinning for ftrace_lock
> >   
> 
> So I really utterly hate this.

Yep...
Adding the extra parameter is likely to have a measurable impact
on everything else.

The problematic path is obvious:        find_kallsyms_symbol+142
        module_address_lookup+104
        kallsyms_lookup_buildid+203
        kallsyms_lookup+20
        print_rec+64
        t_show+67
        seq_read_iter+709
        seq_read+165
        vfs_read+165
        ksys_read+103
        __x64_sys_read+25
        do_syscall_64+56
        entry_SYSCALL_64_after_hwframe+100

The code needs to drop the ftrace_lock across t_show.

Although there is a bigger issue of why on earth the code is reading the
list of filter functions at all - never mind all the time.
I'll do it by hand when debugging, but I'd have though anything using bpf
will know exactly where to add its hooks.

	David





^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH v2 0/3] disable optimistic spinning for ftrace_lock
  2026-03-11 12:53   ` David Laight
@ 2026-03-11 13:40     ` Yafang Shao
  2026-03-11 17:07       ` Steven Rostedt
  2026-03-12  8:06       ` Masami Hiramatsu
  0 siblings, 2 replies; 13+ messages in thread
From: Yafang Shao @ 2026-03-11 13:40 UTC (permalink / raw)
  To: David Laight
  Cc: Peter Zijlstra, mingo, will, boqun, longman, rostedt, mhiramat,
	mark.rutland, mathieu.desnoyers, linux-kernel, linux-trace-kernel

On Wed, Mar 11, 2026 at 8:53 PM David Laight
<david.laight.linux@gmail.com> wrote:
>
> On Wed, 11 Mar 2026 12:54:26 +0100
> Peter Zijlstra <peterz@infradead.org> wrote:
>
> > On Wed, Mar 11, 2026 at 07:52:47PM +0800, Yafang Shao wrote:
> > > Recently, we resolved a latency spike issue caused by concurrently running
> > > bpftrace processes. The root cause was high contention on the ftrace_lock
> > > due to optimistic spinning. We can optimize this by disabling optimistic
> > > spinning for ftrace_lock.
> > >
> > > While semaphores may present similar challenges, I'm not currently aware of
> > > specific instances that exhibit this exact issue. Should we encounter
> > > problematic semaphores in production workloads, we can address them at that
> > > time.
> > >
> > > PATCH #1: introduce slow_mutex_[un]lock to disable optimistic spinning
> > > PATCH #2: add variant for rtmutex
> > > PATCH #3: disable optimistic spinning for ftrace_lock
> > >
> >
> > So I really utterly hate this.
>
> Yep...
> Adding the extra parameter is likely to have a measurable impact
> on everything else.
>
> The problematic path is obvious:        find_kallsyms_symbol+142
>         module_address_lookup+104
>         kallsyms_lookup_buildid+203
>         kallsyms_lookup+20
>         print_rec+64
>         t_show+67
>         seq_read_iter+709
>         seq_read+165
>         vfs_read+165
>         ksys_read+103
>         __x64_sys_read+25
>         do_syscall_64+56
>         entry_SYSCALL_64_after_hwframe+100
>
> The code needs to drop the ftrace_lock across t_show.

It's unclear whether we can safely release ftrace_lock within t_show —
doing so would probably necessitate a major redesign of the current
implementation.

>
> Although there is a bigger issue of why on earth the code is reading the
> list of filter functions at all - never mind all the time.

bpftrace reads the complete list of available functions into
userspace, then performs matching against the target function to
determine if it is traceable.

> I'll do it by hand when debugging, but I'd have though anything using bpf
> will know exactly where to add its hooks.


-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH v2 0/3] disable optimistic spinning for ftrace_lock
  2026-03-11 13:40     ` Yafang Shao
@ 2026-03-11 17:07       ` Steven Rostedt
  2026-03-11 17:39         ` Steven Rostedt
  2026-03-11 19:05         ` David Laight
  2026-03-12  8:06       ` Masami Hiramatsu
  1 sibling, 2 replies; 13+ messages in thread
From: Steven Rostedt @ 2026-03-11 17:07 UTC (permalink / raw)
  To: Yafang Shao
  Cc: David Laight, Peter Zijlstra, mingo, will, boqun, longman,
	mhiramat, mark.rutland, mathieu.desnoyers, linux-kernel,
	linux-trace-kernel

On Wed, 11 Mar 2026 21:40:32 +0800
Yafang Shao <laoar.shao@gmail.com> wrote:

> > The code needs to drop the ftrace_lock across t_show.  
> 
> It's unclear whether we can safely release ftrace_lock within t_show —
> doing so would probably necessitate a major redesign of the current
> implementation.

The issue isn't t_show, it's the calls between t_start and t_next and
subsequent t_next calls, which needs to keep a consistent state. t_show
just happens to be called in between them.

> 
> >
> > Although there is a bigger issue of why on earth the code is reading the
> > list of filter functions at all - never mind all the time.  
> 
> bpftrace reads the complete list of available functions into
> userspace, then performs matching against the target function to
> determine if it is traceable.

Could it parse it in smaller bits? That is, the lock is held only during an
individual read system call. If it reads the available_filter_functions
file via smaller buffers, it would not hold the lock for as long.

-- Steve


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH v2 0/3] disable optimistic spinning for ftrace_lock
  2026-03-11 17:07       ` Steven Rostedt
@ 2026-03-11 17:39         ` Steven Rostedt
  2026-03-11 19:05         ` David Laight
  1 sibling, 0 replies; 13+ messages in thread
From: Steven Rostedt @ 2026-03-11 17:39 UTC (permalink / raw)
  To: Yafang Shao
  Cc: David Laight, Peter Zijlstra, mingo, will, boqun, longman,
	mhiramat, mark.rutland, mathieu.desnoyers, linux-kernel,
	linux-trace-kernel

On Wed, 11 Mar 2026 13:07:43 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> > bpftrace reads the complete list of available functions into
> > userspace, then performs matching against the target function to
> > determine if it is traceable.  
> 
> Could it parse it in smaller bits? That is, the lock is held only during an
> individual read system call. If it reads the available_filter_functions
> file via smaller buffers, it would not hold the lock for as long

Hmm, I guess this wouldn't help much. I ran:

 trace-cmd record -e sys_enter_read -e sys_exit_read -F cat /sys/kernel/tracing/available_filter_functions > /dev/null

And trace-cmd report shows:

[..]
             cat-1208  [001] .....   142.025582: sys_enter_read:       fd: 0x00000003, buf: 0x7fa9daf21000, count: 0x00040000
             cat-1208  [001] .....   142.025995: sys_exit_read:        0xfee
             cat-1208  [001] .....   142.026000: sys_enter_read:       fd: 0x00000003, buf: 0x7fa9daf21000, count: 0x00040000
             cat-1208  [001] .....   142.026392: sys_exit_read:        0xff8
             cat-1208  [001] .....   142.026396: sys_enter_read:       fd: 0x00000003, buf: 0x7fa9daf21000, count: 0x00040000
             cat-1208  [001] .....   142.026766: sys_exit_read:        0xfed
             cat-1208  [001] .....   142.026770: sys_enter_read:       fd: 0x00000003, buf: 0x7fa9daf21000, count: 0x00040000
             cat-1208  [001] .....   142.027113: sys_exit_read:        0xfe0
             cat-1208  [001] .....   142.027117: sys_enter_read:       fd: 0x00000003, buf: 0x7fa9daf21000, count: 0x00040000
             cat-1208  [001] .....   142.027502: sys_exit_read:        0xfec
             cat-1208  [001] .....   142.027506: sys_enter_read:       fd: 0x00000003, buf: 0x7fa9daf21000, count: 0x00040000
[..]

which shows that even though the read buffer size is 0x40000, the size read
is just 0xff8. So the buffer being read and return is never more than a page.

Unless you are running on powerpc, where the page is likely 64K.

-- Steve

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH v2 0/3] disable optimistic spinning for ftrace_lock
  2026-03-11 17:07       ` Steven Rostedt
  2026-03-11 17:39         ` Steven Rostedt
@ 2026-03-11 19:05         ` David Laight
  1 sibling, 0 replies; 13+ messages in thread
From: David Laight @ 2026-03-11 19:05 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Yafang Shao, Peter Zijlstra, mingo, will, boqun, longman,
	mhiramat, mark.rutland, mathieu.desnoyers, linux-kernel,
	linux-trace-kernel

On Wed, 11 Mar 2026 13:07:43 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Wed, 11 Mar 2026 21:40:32 +0800
> Yafang Shao <laoar.shao@gmail.com> wrote:
> 
> > > The code needs to drop the ftrace_lock across t_show.    
> > 
> > It's unclear whether we can safely release ftrace_lock within t_show —
> > doing so would probably necessitate a major redesign of the current
> > implementation.  
> 
> The issue isn't t_show, it's the calls between t_start and t_next and
> subsequent t_next calls, which needs to keep a consistent state. t_show
> just happens to be called in between them.
> 
> >   
> > >
> > > Although there is a bigger issue of why on earth the code is reading the
> > > list of filter functions at all - never mind all the time.    
> > 
> > bpftrace reads the complete list of available functions into
> > userspace, then performs matching against the target function to
> > determine if it is traceable.  
> 
> Could it parse it in smaller bits? That is, the lock is held only during an
> individual read system call. If it reads the available_filter_functions
> file via smaller buffers, it would not hold the lock for as long.

But the expensive part is probably looking up the symbol name.
Shorter reads would hand the lock off to the other process, but overall
the lock would still be held for the same length of time.

How does the code work out where to start from for each read system call?
Couldn't you (effectively) do one symbol at a time the same way?

Another option would be to put a 'generation number' on the list of functions
(after all it doesn't change that often).
Then you can release the lock, generate the data, re-acquire the lock and
check the generation number hasn't changed.
If it hasn't changed you can carry on processing the list using the
same pointer.
If the generation number has changed terminate the read and worry about
locating the correct start position for the next read.

	David


> 
> -- Steve
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH v2 0/3] disable optimistic spinning for ftrace_lock
  2026-03-11 13:40     ` Yafang Shao
  2026-03-11 17:07       ` Steven Rostedt
@ 2026-03-12  8:06       ` Masami Hiramatsu
  2026-03-12  8:57         ` Yafang Shao
  1 sibling, 1 reply; 13+ messages in thread
From: Masami Hiramatsu @ 2026-03-12  8:06 UTC (permalink / raw)
  To: Yafang Shao
  Cc: David Laight, Peter Zijlstra, mingo, will, boqun, longman,
	rostedt, mhiramat, mark.rutland, mathieu.desnoyers, linux-kernel,
	linux-trace-kernel

Hi,

On Wed, 11 Mar 2026 21:40:32 +0800
Yafang Shao <laoar.shao@gmail.com> wrote:

> > Although there is a bigger issue of why on earth the code is reading the
> > list of filter functions at all - never mind all the time.
> 
> bpftrace reads the complete list of available functions into
> userspace, then performs matching against the target function to
> determine if it is traceable.

What about changing bpftrace userspace tool to cache the list of available
functions? (or just add an option to pass available function list?)
Then, you can just copy the function list somewhere and uses it.

Of course we can do the same thing in the kernel, but I don't think
there is any reason to do it in the kernel instead of user space.

Thank you,

> 
> > I'll do it by hand when debugging, but I'd have though anything using bpf
> > will know exactly where to add its hooks.
> 
> 
> -- 
> Regards
> Yafang


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH v2 0/3] disable optimistic spinning for ftrace_lock
  2026-03-12  8:06       ` Masami Hiramatsu
@ 2026-03-12  8:57         ` Yafang Shao
  0 siblings, 0 replies; 13+ messages in thread
From: Yafang Shao @ 2026-03-12  8:57 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: David Laight, Peter Zijlstra, mingo, will, boqun, longman,
	rostedt, mark.rutland, mathieu.desnoyers, linux-kernel,
	linux-trace-kernel

On Thu, Mar 12, 2026 at 4:06 PM Masami Hiramatsu <mhiramat@kernel.org> wrote:
>
> Hi,
>
> On Wed, 11 Mar 2026 21:40:32 +0800
> Yafang Shao <laoar.shao@gmail.com> wrote:
>
> > > Although there is a bigger issue of why on earth the code is reading the
> > > list of filter functions at all - never mind all the time.
> >
> > bpftrace reads the complete list of available functions into
> > userspace, then performs matching against the target function to
> > determine if it is traceable.
>
> What about changing bpftrace userspace tool to cache the list of available
> functions? (or just add an option to pass available function list?)
> Then, you can just copy the function list somewhere and uses it.

Thanks for the suggestion.
Steven also mentioned this optimization, and I believe it could indeed
help address the performance issue.

>
> Of course we can do the same thing in the kernel, but I don't think
> there is any reason to do it in the kernel instead of user space.

Implementing this in the kernel provides a more generic solution.
Instead of patching each tool individually (bpftrace, perf, trace-cmd,
etc.).


--
Regards
Yafang

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2026-03-12  8:58 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-11 11:52 [RFC PATCH v2 0/3] disable optimistic spinning for ftrace_lock Yafang Shao
2026-03-11 11:52 ` [RFC PATCH v2 1/3] locking/mutex: Add slow path variants for lock/unlock Yafang Shao
2026-03-11 11:52 ` [RFC PATCH v2 2/3] locking/rtmutex: " Yafang Shao
2026-03-11 11:52 ` [RFC PATCH v2 3/3] ftrace: Disable optimistic spinning for ftrace_lock Yafang Shao
2026-03-11 11:54 ` [RFC PATCH v2 0/3] disable " Peter Zijlstra
2026-03-11 11:55   ` Yafang Shao
2026-03-11 12:53   ` David Laight
2026-03-11 13:40     ` Yafang Shao
2026-03-11 17:07       ` Steven Rostedt
2026-03-11 17:39         ` Steven Rostedt
2026-03-11 19:05         ` David Laight
2026-03-12  8:06       ` Masami Hiramatsu
2026-03-12  8:57         ` Yafang Shao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox