[PATCH v3 0/7] locking/rtmutex: Avoid PI state recursion through sched_submit

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 0/7] locking/rtmutex: Avoid PI state recursion through sched_submit_work()
@ 2023-09-08 16:22 Sebastian Andrzej Siewior
  2023-09-08 16:22 ` [PATCH v3 1/7] sched: Constrain locks in sched_submit_work() Sebastian Andrzej Siewior
                   ` (6 more replies)
  0 siblings, 7 replies; 36+ messages in thread
From: Sebastian Andrzej Siewior @ 2023-09-08 16:22 UTC (permalink / raw)
  To: Peter Zijlstra, linux-kernel
  Cc: bigeasy, tglx, boqun.feng, bristot, bsegall, dietmar.eggemann,
	jstultz, juri.lelli, longman, mgorman, mingo, rostedt, swood,
	vincent.guittot, vschneid, will

Hi!

this is the third iteration of the patch series that PeterZ sent.

PeterZ v1: https://lore.kernel.org/all/20230815110121.117752409@infradead.org/
My v2: https://lore.kernel.org/all/20230825181033.504534-1-bigeasy@linutronix.de/
It contains a range-diff what changed, in brief:
- + include linux/sched/rt.h to get compiled
- rt_mutex_pre_schedule()/post() for rwbase_write_lock().

Now the v3:
- Sultan Alsawaf reported a deadlock in write_lock(). A brown paper back
  later, I replaced rt_mutex_pre_schedule()/post() with
  rwbase_pre_schedule() in rwbase_write_lock().

- The reported futex splat in v2 has been addressed with a trylock. This
  is 7/7.

Sebastian



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v3 1/7] sched: Constrain locks in sched_submit_work()
  2023-09-08 16:22 [PATCH v3 0/7] locking/rtmutex: Avoid PI state recursion through sched_submit_work() Sebastian Andrzej Siewior
@ 2023-09-08 16:22 ` Sebastian Andrzej Siewior
  2023-09-20  7:36   ` [tip: locking/core] " tip-bot2 for Peter Zijlstra
  2023-09-08 16:22 ` [PATCH v3 2/7] locking/rtmutex: Avoid unconditional slowpath for DEBUG_RT_MUTEXES Sebastian Andrzej Siewior
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 36+ messages in thread
From: Sebastian Andrzej Siewior @ 2023-09-08 16:22 UTC (permalink / raw)
  To: Peter Zijlstra, linux-kernel
  Cc: bigeasy, tglx, boqun.feng, bristot, bsegall, dietmar.eggemann,
	jstultz, juri.lelli, longman, mgorman, mingo, rostedt, swood,
	vincent.guittot, vschneid, will

From: Peter Zijlstra <peterz@infradead.org>

Even though sched_submit_work() is ran from preemptible context,
it is discouraged to have it use blocking locks due to the recursion
potential.

Enforce this.

Signed-off-by: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230815111430.154558666@infradead.org
Link: https://lore.kernel.org/r/20230825181033.504534-2-bigeasy@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 kernel/sched/core.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2299a5cfbfb9e..d55564097bd86 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6720,11 +6720,18 @@ void __noreturn do_task_dead(void)
 
 static inline void sched_submit_work(struct task_struct *tsk)
 {
+	static DEFINE_WAIT_OVERRIDE_MAP(sched_map, LD_WAIT_CONFIG);
 	unsigned int task_flags;
 
 	if (task_is_running(tsk))
 		return;
 
+	/*
+	 * Establish LD_WAIT_CONFIG context to ensure none of the code called
+	 * will use a blocking primitive -- which would lead to recursion.
+	 */
+	lock_map_acquire_try(&sched_map);
+
 	task_flags = tsk->flags;
 	/*
 	 * If a worker goes to sleep, notify and ask workqueue whether it
@@ -6749,6 +6756,8 @@ static inline void sched_submit_work(struct task_struct *tsk)
 	 * make sure to submit it to avoid deadlocks.
 	 */
 	blk_flush_plug(tsk->plug, true);
+
+	lock_map_release(&sched_map);
 }
 
 static void sched_update_worker(struct task_struct *tsk)
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v3 2/7] locking/rtmutex: Avoid unconditional slowpath for DEBUG_RT_MUTEXES
  2023-09-08 16:22 [PATCH v3 0/7] locking/rtmutex: Avoid PI state recursion through sched_submit_work() Sebastian Andrzej Siewior
  2023-09-08 16:22 ` [PATCH v3 1/7] sched: Constrain locks in sched_submit_work() Sebastian Andrzej Siewior
@ 2023-09-08 16:22 ` Sebastian Andrzej Siewior
  2023-09-20  7:36   ` [tip: locking/core] " tip-bot2 for Sebastian Andrzej Siewior
  2023-09-08 16:22 ` [PATCH v3 3/7] sched: Extract __schedule_loop() Sebastian Andrzej Siewior
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 36+ messages in thread
From: Sebastian Andrzej Siewior @ 2023-09-08 16:22 UTC (permalink / raw)
  To: Peter Zijlstra, linux-kernel
  Cc: bigeasy, tglx, boqun.feng, bristot, bsegall, dietmar.eggemann,
	jstultz, juri.lelli, longman, mgorman, mingo, rostedt, swood,
	vincent.guittot, vschneid, will

With DEBUG_RT_MUTEXES enabled the fast-path rt_mutex_cmpxchg_acquire()
always fails and all lock operations take the slow path.

Provide a new helper inline rt_mutex_try_acquire() which maps to
rt_mutex_cmpxchg_acquire() in the non-debug case. For the debug case
it invokes rt_mutex_slowtrylock() which can acquire a non-contended
rtmutex under full debug coverage.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20230427111937.2745231-4-bigeasy@linutronix.de
Link: https://lore.kernel.org/r/20230815111430.220899937@infradead.org
Link: https://lore.kernel.org/r/20230825181033.504534-3-bigeasy@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 kernel/locking/rtmutex.c     | 21 ++++++++++++++++++++-
 kernel/locking/ww_rt_mutex.c |  2 +-
 2 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 21db0df0eb000..bcec0533a0cc0 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -218,6 +218,11 @@ static __always_inline bool rt_mutex_cmpxchg_acquire(struct rt_mutex_base *lock,
 	return try_cmpxchg_acquire(&lock->owner, &old, new);
 }
 
+static __always_inline bool rt_mutex_try_acquire(struct rt_mutex_base *lock)
+{
+	return rt_mutex_cmpxchg_acquire(lock, NULL, current);
+}
+
 static __always_inline bool rt_mutex_cmpxchg_release(struct rt_mutex_base *lock,
 						     struct task_struct *old,
 						     struct task_struct *new)
@@ -297,6 +302,20 @@ static __always_inline bool rt_mutex_cmpxchg_acquire(struct rt_mutex_base *lock,
 
 }
 
+static int __sched rt_mutex_slowtrylock(struct rt_mutex_base *lock);
+
+static __always_inline bool rt_mutex_try_acquire(struct rt_mutex_base *lock)
+{
+	/*
+	 * With debug enabled rt_mutex_cmpxchg trylock() will always fail.
+	 *
+	 * Avoid unconditionally taking the slow path by using
+	 * rt_mutex_slow_trylock() which is covered by the debug code and can
+	 * acquire a non-contended rtmutex.
+	 */
+	return rt_mutex_slowtrylock(lock);
+}
+
 static __always_inline bool rt_mutex_cmpxchg_release(struct rt_mutex_base *lock,
 						     struct task_struct *old,
 						     struct task_struct *new)
@@ -1755,7 +1774,7 @@ static int __sched rt_mutex_slowlock(struct rt_mutex_base *lock,
 static __always_inline int __rt_mutex_lock(struct rt_mutex_base *lock,
 					   unsigned int state)
 {
-	if (likely(rt_mutex_cmpxchg_acquire(lock, NULL, current)))
+	if (likely(rt_mutex_try_acquire(lock)))
 		return 0;
 
 	return rt_mutex_slowlock(lock, NULL, state);
diff --git a/kernel/locking/ww_rt_mutex.c b/kernel/locking/ww_rt_mutex.c
index d1473c624105c..c7196de838edc 100644
--- a/kernel/locking/ww_rt_mutex.c
+++ b/kernel/locking/ww_rt_mutex.c
@@ -62,7 +62,7 @@ __ww_rt_mutex_lock(struct ww_mutex *lock, struct ww_acquire_ctx *ww_ctx,
 	}
 	mutex_acquire_nest(&rtm->dep_map, 0, 0, nest_lock, ip);
 
-	if (likely(rt_mutex_cmpxchg_acquire(&rtm->rtmutex, NULL, current))) {
+	if (likely(rt_mutex_try_acquire(&rtm->rtmutex))) {
 		if (ww_ctx)
 			ww_mutex_set_context_fastpath(lock, ww_ctx);
 		return 0;
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v3 3/7] sched: Extract __schedule_loop()
  2023-09-08 16:22 [PATCH v3 0/7] locking/rtmutex: Avoid PI state recursion through sched_submit_work() Sebastian Andrzej Siewior
  2023-09-08 16:22 ` [PATCH v3 1/7] sched: Constrain locks in sched_submit_work() Sebastian Andrzej Siewior
  2023-09-08 16:22 ` [PATCH v3 2/7] locking/rtmutex: Avoid unconditional slowpath for DEBUG_RT_MUTEXES Sebastian Andrzej Siewior
@ 2023-09-08 16:22 ` Sebastian Andrzej Siewior
  2023-09-20  7:36   ` [tip: locking/core] " tip-bot2 for Thomas Gleixner
  2023-09-08 16:22 ` [PATCH v3 4/7] sched: Provide rt_mutex specific scheduler helpers Sebastian Andrzej Siewior
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 36+ messages in thread
From: Sebastian Andrzej Siewior @ 2023-09-08 16:22 UTC (permalink / raw)
  To: Peter Zijlstra, linux-kernel
  Cc: bigeasy, tglx, boqun.feng, bristot, bsegall, dietmar.eggemann,
	jstultz, juri.lelli, longman, mgorman, mingo, rostedt, swood,
	vincent.guittot, vschneid, will

From: Thomas Gleixner <tglx@linutronix.de>

There are currently two implementations of this basic __schedule()
loop, and there is soon to be a third.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20230427111937.2745231-2-bigeasy@linutronix.de
Link: https://lore.kernel.org/r/20230815111430.288063671@infradead.org
Link: https://lore.kernel.org/r/20230825181033.504534-4-bigeasy@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 kernel/sched/core.c | 21 +++++++++++----------
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d55564097bd86..1ea7ba53aad24 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6770,16 +6770,21 @@ static void sched_update_worker(struct task_struct *tsk)
 	}
 }
 
+static __always_inline void __schedule_loop(unsigned int sched_mode)
+{
+	do {
+		preempt_disable();
+		__schedule(sched_mode);
+		sched_preempt_enable_no_resched();
+	} while (need_resched());
+}
+
 asmlinkage __visible void __sched schedule(void)
 {
 	struct task_struct *tsk = current;
 
 	sched_submit_work(tsk);
-	do {
-		preempt_disable();
-		__schedule(SM_NONE);
-		sched_preempt_enable_no_resched();
-	} while (need_resched());
+	__schedule_loop(SM_NONE);
 	sched_update_worker(tsk);
 }
 EXPORT_SYMBOL(schedule);
@@ -6843,11 +6848,7 @@ void __sched schedule_preempt_disabled(void)
 #ifdef CONFIG_PREEMPT_RT
 void __sched notrace schedule_rtlock(void)
 {
-	do {
-		preempt_disable();
-		__schedule(SM_RTLOCK_WAIT);
-		sched_preempt_enable_no_resched();
-	} while (need_resched());
+	__schedule_loop(SM_RTLOCK_WAIT);
 }
 NOKPROBE_SYMBOL(schedule_rtlock);
 #endif
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v3 4/7] sched: Provide rt_mutex specific scheduler helpers
  2023-09-08 16:22 [PATCH v3 0/7] locking/rtmutex: Avoid PI state recursion through sched_submit_work() Sebastian Andrzej Siewior
                   ` (2 preceding siblings ...)
  2023-09-08 16:22 ` [PATCH v3 3/7] sched: Extract __schedule_loop() Sebastian Andrzej Siewior
@ 2023-09-08 16:22 ` Sebastian Andrzej Siewior
  2023-09-20  7:36   ` [tip: locking/core] " tip-bot2 for Peter Zijlstra
  2023-09-08 16:22 ` [PATCH v3 5/7] locking/rtmutex: Use " Sebastian Andrzej Siewior
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 36+ messages in thread
From: Sebastian Andrzej Siewior @ 2023-09-08 16:22 UTC (permalink / raw)
  To: Peter Zijlstra, linux-kernel
  Cc: bigeasy, tglx, boqun.feng, bristot, bsegall, dietmar.eggemann,
	jstultz, juri.lelli, longman, mgorman, mingo, rostedt, swood,
	vincent.guittot, vschneid, will

From: Peter Zijlstra <peterz@infradead.org>

With PREEMPT_RT there is a rt_mutex recursion problem where
sched_submit_work() can use an rtlock (aka spinlock_t). More
specifically what happens is:

  mutex_lock() /* really rt_mutex */
    ...
      __rt_mutex_slowlock_locked()
	task_blocks_on_rt_mutex()
          // enqueue current task as waiter
          // do PI chain walk
        rt_mutex_slowlock_block()
          schedule()
            sched_submit_work()
              ...
              spin_lock() /* really rtlock */
                ...
                  __rt_mutex_slowlock_locked()
                    task_blocks_on_rt_mutex()
                      // enqueue current task as waiter *AGAIN*
                      // *CONFUSION*

Fix this by making rt_mutex do the sched_submit_work() early, before
it enqueues itself as a waiter -- before it even knows *if* it will
wait.

[[ basically Thomas' patch but with different naming and a few asserts
   added ]]

Originally-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230815111430.355375399@infradead.org
Link: https://lore.kernel.org/r/20230825181033.504534-5-bigeasy@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 include/linux/sched.h    |  3 +++
 include/linux/sched/rt.h |  4 ++++
 kernel/sched/core.c      | 36 ++++++++++++++++++++++++++++++++----
 3 files changed, 39 insertions(+), 4 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 77f01ac385f7a..67623ffd4a8ea 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -911,6 +911,9 @@ struct task_struct {
 	 * ->sched_remote_wakeup gets used, so it can be in this word.
 	 */
 	unsigned			sched_remote_wakeup:1;
+#ifdef CONFIG_RT_MUTEXES
+	unsigned			sched_rt_mutex:1;
+#endif
 
 	/* Bit to tell LSMs we're in execve(): */
 	unsigned			in_execve:1;
diff --git a/include/linux/sched/rt.h b/include/linux/sched/rt.h
index 994c25640e156..b2b9e6eb96830 100644
--- a/include/linux/sched/rt.h
+++ b/include/linux/sched/rt.h
@@ -30,6 +30,10 @@ static inline bool task_is_realtime(struct task_struct *tsk)
 }
 
 #ifdef CONFIG_RT_MUTEXES
+extern void rt_mutex_pre_schedule(void);
+extern void rt_mutex_schedule(void);
+extern void rt_mutex_post_schedule(void);
+
 /*
  * Must hold either p->pi_lock or task_rq(p)->lock.
  */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1ea7ba53aad24..58d0346d1bb3b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6723,9 +6723,6 @@ static inline void sched_submit_work(struct task_struct *tsk)
 	static DEFINE_WAIT_OVERRIDE_MAP(sched_map, LD_WAIT_CONFIG);
 	unsigned int task_flags;
 
-	if (task_is_running(tsk))
-		return;
-
 	/*
 	 * Establish LD_WAIT_CONFIG context to ensure none of the code called
 	 * will use a blocking primitive -- which would lead to recursion.
@@ -6783,7 +6780,12 @@ asmlinkage __visible void __sched schedule(void)
 {
 	struct task_struct *tsk = current;
 
-	sched_submit_work(tsk);
+#ifdef CONFIG_RT_MUTEXES
+	lockdep_assert(!tsk->sched_rt_mutex);
+#endif
+
+	if (!task_is_running(tsk))
+		sched_submit_work(tsk);
 	__schedule_loop(SM_NONE);
 	sched_update_worker(tsk);
 }
@@ -7044,6 +7046,32 @@ static void __setscheduler_prio(struct task_struct *p, int prio)
 
 #ifdef CONFIG_RT_MUTEXES
 
+/*
+ * Would be more useful with typeof()/auto_type but they don't mix with
+ * bit-fields. Since it's a local thing, use int. Keep the generic sounding
+ * name such that if someone were to implement this function we get to compare
+ * notes.
+ */
+#define fetch_and_set(x, v) ({ int _x = (x); (x) = (v); _x; })
+
+void rt_mutex_pre_schedule(void)
+{
+	lockdep_assert(!fetch_and_set(current->sched_rt_mutex, 1));
+	sched_submit_work(current);
+}
+
+void rt_mutex_schedule(void)
+{
+	lockdep_assert(current->sched_rt_mutex);
+	__schedule_loop(SM_NONE);
+}
+
+void rt_mutex_post_schedule(void)
+{
+	sched_update_worker(current);
+	lockdep_assert(fetch_and_set(current->sched_rt_mutex, 0));
+}
+
 static inline int __rt_effective_prio(struct task_struct *pi_task, int prio)
 {
 	if (pi_task)
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v3 5/7] locking/rtmutex: Use rt_mutex specific scheduler helpers
  2023-09-08 16:22 [PATCH v3 0/7] locking/rtmutex: Avoid PI state recursion through sched_submit_work() Sebastian Andrzej Siewior
                   ` (3 preceding siblings ...)
  2023-09-08 16:22 ` [PATCH v3 4/7] sched: Provide rt_mutex specific scheduler helpers Sebastian Andrzej Siewior
@ 2023-09-08 16:22 ` Sebastian Andrzej Siewior
  2023-09-20  7:36   ` [tip: locking/core] " tip-bot2 for Sebastian Andrzej Siewior
  2023-09-08 16:22 ` [PATCH v3 6/7] locking/rtmutex: Add a lockdep assert to catch potential nested blocking Sebastian Andrzej Siewior
  2023-09-08 16:22 ` [PATCH v3 7/7] locking/rtmutex: Acquire the hb lock via trylock after wait-proxylock Sebastian Andrzej Siewior
  6 siblings, 1 reply; 36+ messages in thread
From: Sebastian Andrzej Siewior @ 2023-09-08 16:22 UTC (permalink / raw)
  To: Peter Zijlstra, linux-kernel
  Cc: bigeasy, tglx, boqun.feng, bristot, bsegall, dietmar.eggemann,
	jstultz, juri.lelli, longman, mgorman, mingo, rostedt, swood,
	vincent.guittot, vschneid, will

Have rt_mutex use the rt_mutex specific scheduler helpers to avoid
recursion vs rtlock on the PI state.

[[ peterz: adapted to new names ]]

Reported-by: Crystal Wood <swood@redhat.com>
Signed-off-by: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230815111430.421408298@infradead.org
Link: https://lore.kernel.org/r/20230825181033.504534-6-bigeasy@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 kernel/futex/pi.c            | 11 +++++++++++
 kernel/locking/rtmutex.c     | 14 ++++++++++++--
 kernel/locking/rwbase_rt.c   |  6 ++++++
 kernel/locking/rwsem.c       |  8 +++++++-
 kernel/locking/spinlock_rt.c |  4 ++++
 5 files changed, 40 insertions(+), 3 deletions(-)

diff --git a/kernel/futex/pi.c b/kernel/futex/pi.c
index ce2889f123755..f8e65b27d9d6b 100644
--- a/kernel/futex/pi.c
+++ b/kernel/futex/pi.c
@@ -1,6 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0-or-later
 
 #include <linux/slab.h>
+#include <linux/sched/rt.h>
 #include <linux/sched/task.h>
 
 #include "futex.h"
@@ -1002,6 +1003,12 @@ int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int tryl
 		goto no_block;
 	}
 
+	/*
+	 * Must be done before we enqueue the waiter, here is unfortunately
+	 * under the hb lock, but that *should* work because it does nothing.
+	 */
+	rt_mutex_pre_schedule();
+
 	rt_mutex_init_waiter(&rt_waiter);
 
 	/*
@@ -1052,6 +1059,10 @@ int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int tryl
 	if (ret && !rt_mutex_cleanup_proxy_lock(&q.pi_state->pi_mutex, &rt_waiter))
 		ret = 0;
 
+	/*
+	 * Waiter is unqueued.
+	 */
+	rt_mutex_post_schedule();
 no_block:
 	/*
 	 * Fixup the pi_state owner and possibly acquire the lock if we
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index bcec0533a0cc0..a3fe05dfd0d8f 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -1632,7 +1632,7 @@ static int __sched rt_mutex_slowlock_block(struct rt_mutex_base *lock,
 		raw_spin_unlock_irq(&lock->wait_lock);
 
 		if (!owner || !rtmutex_spin_on_owner(lock, waiter, owner))
-			schedule();
+			rt_mutex_schedule();
 
 		raw_spin_lock_irq(&lock->wait_lock);
 		set_current_state(state);
@@ -1661,7 +1661,7 @@ static void __sched rt_mutex_handle_deadlock(int res, int detect_deadlock,
 	WARN(1, "rtmutex deadlock detected\n");
 	while (1) {
 		set_current_state(TASK_INTERRUPTIBLE);
-		schedule();
+		rt_mutex_schedule();
 	}
 }
 
@@ -1756,6 +1756,15 @@ static int __sched rt_mutex_slowlock(struct rt_mutex_base *lock,
 	unsigned long flags;
 	int ret;
 
+	/*
+	 * Do all pre-schedule work here, before we queue a waiter and invoke
+	 * PI -- any such work that trips on rtlock (PREEMPT_RT spinlock) would
+	 * otherwise recurse back into task_blocks_on_rt_mutex() through
+	 * rtlock_slowlock() and will then enqueue a second waiter for this
+	 * same task and things get really confusing real fast.
+	 */
+	rt_mutex_pre_schedule();
+
 	/*
 	 * Technically we could use raw_spin_[un]lock_irq() here, but this can
 	 * be called in early boot if the cmpxchg() fast path is disabled
@@ -1767,6 +1776,7 @@ static int __sched rt_mutex_slowlock(struct rt_mutex_base *lock,
 	raw_spin_lock_irqsave(&lock->wait_lock, flags);
 	ret = __rt_mutex_slowlock_locked(lock, ww_ctx, state);
 	raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
+	rt_mutex_post_schedule();
 
 	return ret;
 }
diff --git a/kernel/locking/rwbase_rt.c b/kernel/locking/rwbase_rt.c
index 25ec0239477c2..c7258cb32d91b 100644
--- a/kernel/locking/rwbase_rt.c
+++ b/kernel/locking/rwbase_rt.c
@@ -71,6 +71,7 @@ static int __sched __rwbase_read_lock(struct rwbase_rt *rwb,
 	struct rt_mutex_base *rtm = &rwb->rtmutex;
 	int ret;
 
+	rwbase_pre_schedule();
 	raw_spin_lock_irq(&rtm->wait_lock);
 
 	/*
@@ -125,6 +126,7 @@ static int __sched __rwbase_read_lock(struct rwbase_rt *rwb,
 		rwbase_rtmutex_unlock(rtm);
 
 	trace_contention_end(rwb, ret);
+	rwbase_post_schedule();
 	return ret;
 }
 
@@ -237,6 +239,8 @@ static int __sched rwbase_write_lock(struct rwbase_rt *rwb,
 	/* Force readers into slow path */
 	atomic_sub(READER_BIAS, &rwb->readers);
 
+	rwbase_pre_schedule();
+
 	raw_spin_lock_irqsave(&rtm->wait_lock, flags);
 	if (__rwbase_write_trylock(rwb))
 		goto out_unlock;
@@ -248,6 +252,7 @@ static int __sched rwbase_write_lock(struct rwbase_rt *rwb,
 		if (rwbase_signal_pending_state(state, current)) {
 			rwbase_restore_current_state();
 			__rwbase_write_unlock(rwb, 0, flags);
+			rwbase_post_schedule();
 			trace_contention_end(rwb, -EINTR);
 			return -EINTR;
 		}
@@ -266,6 +271,7 @@ static int __sched rwbase_write_lock(struct rwbase_rt *rwb,
 
 out_unlock:
 	raw_spin_unlock_irqrestore(&rtm->wait_lock, flags);
+	rwbase_post_schedule();
 	return 0;
 }
 
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 9eabd585ce7af..2340b6d90ec6f 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -1427,8 +1427,14 @@ static inline void __downgrade_write(struct rw_semaphore *sem)
 #define rwbase_signal_pending_state(state, current)	\
 	signal_pending_state(state, current)
 
+#define rwbase_pre_schedule()				\
+	rt_mutex_pre_schedule()
+
 #define rwbase_schedule()				\
-	schedule()
+	rt_mutex_schedule()
+
+#define rwbase_post_schedule()				\
+	rt_mutex_post_schedule()
 
 #include "rwbase_rt.c"
 
diff --git a/kernel/locking/spinlock_rt.c b/kernel/locking/spinlock_rt.c
index 48a19ed8486d8..842037b2ba548 100644
--- a/kernel/locking/spinlock_rt.c
+++ b/kernel/locking/spinlock_rt.c
@@ -184,9 +184,13 @@ static __always_inline int  rwbase_rtmutex_trylock(struct rt_mutex_base *rtm)
 
 #define rwbase_signal_pending_state(state, current)	(0)
 
+#define rwbase_pre_schedule()
+
 #define rwbase_schedule()				\
 	schedule_rtlock()
 
+#define rwbase_post_schedule()
+
 #include "rwbase_rt.c"
 /*
  * The common functions which get wrapped into the rwlock API.
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v3 6/7] locking/rtmutex: Add a lockdep assert to catch potential nested blocking
  2023-09-08 16:22 [PATCH v3 0/7] locking/rtmutex: Avoid PI state recursion through sched_submit_work() Sebastian Andrzej Siewior
                   ` (4 preceding siblings ...)
  2023-09-08 16:22 ` [PATCH v3 5/7] locking/rtmutex: Use " Sebastian Andrzej Siewior
@ 2023-09-08 16:22 ` Sebastian Andrzej Siewior
  2023-09-20  7:36   ` [tip: locking/core] " tip-bot2 for Thomas Gleixner
  2023-09-08 16:22 ` [PATCH v3 7/7] locking/rtmutex: Acquire the hb lock via trylock after wait-proxylock Sebastian Andrzej Siewior
  6 siblings, 1 reply; 36+ messages in thread
From: Sebastian Andrzej Siewior @ 2023-09-08 16:22 UTC (permalink / raw)
  To: Peter Zijlstra, linux-kernel
  Cc: bigeasy, tglx, boqun.feng, bristot, bsegall, dietmar.eggemann,
	jstultz, juri.lelli, longman, mgorman, mingo, rostedt, swood,
	vincent.guittot, vschneid, will

From: Thomas Gleixner <tglx@linutronix.de>

There used to be a BUG_ON(current->pi_blocked_on) in the lock acquisition
functions, but that vanished in one of the rtmutex overhauls.

Bring it back in form of a lockdep assert to catch code paths which take
rtmutex based locks with current::pi_blocked_on != NULL.

Reported-by: Crystal Wood <swood@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20230427111937.2745231-5-bigeasy@linutronix.de
Link: https://lore.kernel.org/r/20230815111430.488430699@infradead.org
Link: https://lore.kernel.org/r/20230825181033.504534-7-bigeasy@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 kernel/locking/rtmutex.c     | 2 ++
 kernel/locking/rwbase_rt.c   | 2 ++
 kernel/locking/spinlock_rt.c | 2 ++
 3 files changed, 6 insertions(+)

diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index a3fe05dfd0d8f..4a10e8c16fd2b 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -1784,6 +1784,8 @@ static int __sched rt_mutex_slowlock(struct rt_mutex_base *lock,
 static __always_inline int __rt_mutex_lock(struct rt_mutex_base *lock,
 					   unsigned int state)
 {
+	lockdep_assert(!current->pi_blocked_on);
+
 	if (likely(rt_mutex_try_acquire(lock)))
 		return 0;
 
diff --git a/kernel/locking/rwbase_rt.c b/kernel/locking/rwbase_rt.c
index c7258cb32d91b..34a59569db6be 100644
--- a/kernel/locking/rwbase_rt.c
+++ b/kernel/locking/rwbase_rt.c
@@ -133,6 +133,8 @@ static int __sched __rwbase_read_lock(struct rwbase_rt *rwb,
 static __always_inline int rwbase_read_lock(struct rwbase_rt *rwb,
 					    unsigned int state)
 {
+	lockdep_assert(!current->pi_blocked_on);
+
 	if (rwbase_read_trylock(rwb))
 		return 0;
 
diff --git a/kernel/locking/spinlock_rt.c b/kernel/locking/spinlock_rt.c
index 842037b2ba548..38e292454fccb 100644
--- a/kernel/locking/spinlock_rt.c
+++ b/kernel/locking/spinlock_rt.c
@@ -37,6 +37,8 @@
 
 static __always_inline void rtlock_lock(struct rt_mutex_base *rtm)
 {
+	lockdep_assert(!current->pi_blocked_on);
+
 	if (unlikely(!rt_mutex_cmpxchg_acquire(rtm, NULL, current)))
 		rtlock_slowlock(rtm);
 }
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v3 7/7] locking/rtmutex: Acquire the hb lock via trylock after wait-proxylock.
  2023-09-08 16:22 [PATCH v3 0/7] locking/rtmutex: Avoid PI state recursion through sched_submit_work() Sebastian Andrzej Siewior
                   ` (5 preceding siblings ...)
  2023-09-08 16:22 ` [PATCH v3 6/7] locking/rtmutex: Add a lockdep assert to catch potential nested blocking Sebastian Andrzej Siewior
@ 2023-09-08 16:22 ` Sebastian Andrzej Siewior
  2023-09-09 18:29   ` Peter Zijlstra
  2023-09-11 14:11   ` Peter Zijlstra
  6 siblings, 2 replies; 36+ messages in thread
From: Sebastian Andrzej Siewior @ 2023-09-08 16:22 UTC (permalink / raw)
  To: Peter Zijlstra, linux-kernel
  Cc: bigeasy, tglx, boqun.feng, bristot, bsegall, dietmar.eggemann,
	jstultz, juri.lelli, longman, mgorman, mingo, rostedt, swood,
	vincent.guittot, vschneid, will

After rt_mutex_wait_proxy_lock() task_struct::pi_blocked_on is cleared
if current owns the lock. If the operation has been interrupted by a
signal or timeout then pi_blocked_on can be set. This means spin_lock()
*can* overwrite pi_blocked_on on PREEMPT_RT. This has been noticed by
the recently added lockdep-asserts…

The rt_mutex_cleanup_proxy_lock() operation will clear pi_blocked_on
(and update pending waiters as expected) but it must happen under the hb
lock to ensure the same state in rtmutex and userland.

Given all the possibilities it is probably the simplest option to
try-lock the hb lock. In case the lock is occupied a quick nap is
needed. A busy loop can lock up the system if performed by a task with
high priorioty preventing the owner from running.

The rt_mutex_post_schedule() needs to be put before try-lock-loop
because otherwie the schedule() in schedule_hrtimeout() will trip over
the !sched_rt_mutex assert.

Introduce futex_trylock_hblock() to try-lock the hb lock and sleep until
the try-lock operation succeeds. Use it after rt_mutex_wait_proxy_lock()
to acquire the lock.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20230831095314.fTliy0Bh@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 kernel/futex/futex.h     | 23 +++++++++++++++++++++++
 kernel/futex/pi.c        | 10 ++++------
 kernel/futex/requeue.c   |  4 ++--
 kernel/locking/rtmutex.c |  7 +++++--
 4 files changed, 34 insertions(+), 10 deletions(-)

diff --git a/kernel/futex/futex.h b/kernel/futex/futex.h
index b5379c0e6d6d1..58fbc34a59811 100644
--- a/kernel/futex/futex.h
+++ b/kernel/futex/futex.h
@@ -254,6 +254,29 @@ double_unlock_hb(struct futex_hash_bucket *hb1, struct futex_hash_bucket *hb2)
 		spin_unlock(&hb2->lock);
 }
 
+static inline void futex_trylock_hblock(spinlock_t *lock)
+{
+	do {
+		ktime_t chill_time;;
+
+		/*
+		 * Current is not longer pi_blocked_on if it owns the lock. It
+		 * can still have pi_blocked_on set if the lock acquiring was
+		 * interrupted by signal or timeout. The trylock operation does
+		 * not clobber pi_blocked_on so it is the only option.
+		 * Should the try-lock operation fail then it needs leave the CPU
+		 * to avoid a busy loop in case it is the task with the highest
+		 * priority.
+		 */
+		if (spin_trylock(lock))
+			return;
+
+		chill_time = ktime_set(0, NSEC_PER_MSEC);
+		set_current_state(TASK_UNINTERRUPTIBLE);
+		schedule_hrtimeout(&chill_time, HRTIMER_MODE_REL_HARD);
+	} while (1);
+}
+
 /* syscalls */
 
 extern int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags, u32
diff --git a/kernel/futex/pi.c b/kernel/futex/pi.c
index f8e65b27d9d6b..1440fdcdbfd8c 100644
--- a/kernel/futex/pi.c
+++ b/kernel/futex/pi.c
@@ -1046,7 +1046,10 @@ int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int tryl
 	ret = rt_mutex_wait_proxy_lock(&q.pi_state->pi_mutex, to, &rt_waiter);
 
 cleanup:
-	spin_lock(q.lock_ptr);
+	rt_mutex_post_schedule();
+
+	futex_trylock_hblock(q.lock_ptr);
+
 	/*
 	 * If we failed to acquire the lock (deadlock/signal/timeout), we must
 	 * first acquire the hb->lock before removing the lock from the
@@ -1058,11 +1061,6 @@ int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int tryl
 	 */
 	if (ret && !rt_mutex_cleanup_proxy_lock(&q.pi_state->pi_mutex, &rt_waiter))
 		ret = 0;
-
-	/*
-	 * Waiter is unqueued.
-	 */
-	rt_mutex_post_schedule();
 no_block:
 	/*
 	 * Fixup the pi_state owner and possibly acquire the lock if we
diff --git a/kernel/futex/requeue.c b/kernel/futex/requeue.c
index cba8b1a6a4cc2..26888cfa74449 100644
--- a/kernel/futex/requeue.c
+++ b/kernel/futex/requeue.c
@@ -850,8 +850,8 @@ int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
 		pi_mutex = &q.pi_state->pi_mutex;
 		ret = rt_mutex_wait_proxy_lock(pi_mutex, to, &rt_waiter);
 
-		/* Current is not longer pi_blocked_on */
-		spin_lock(q.lock_ptr);
+		futex_trylock_hblock(q.lock_ptr);
+
 		if (ret && !rt_mutex_cleanup_proxy_lock(pi_mutex, &rt_waiter))
 			ret = 0;
 
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 4a10e8c16fd2b..e56585ef489c8 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -1166,10 +1166,13 @@ try_to_take_rt_mutex(struct rt_mutex_base *lock, struct task_struct *task,
 	 * Clear @task->pi_blocked_on. Requires protection by
 	 * @task->pi_lock. Redundant operation for the @waiter == NULL
 	 * case, but conditionals are more expensive than a redundant
-	 * store.
+	 * store. But then there is FUTEX and if rt_mutex_wait_proxy_lock()
+	 * did not acquire the lock it try-locks another lock before it clears
+	 * @task->pi_blocked_on so we mustn't clear it here premature.
 	 */
 	raw_spin_lock(&task->pi_lock);
-	task->pi_blocked_on = NULL;
+	if (waiter)
+		task->pi_blocked_on = NULL;
 	/*
 	 * Finish the lock acquisition. @task is the new owner. If
 	 * other waiters exist we have to insert the highest priority
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 7/7] locking/rtmutex: Acquire the hb lock via trylock after wait-proxylock.
  2023-09-08 16:22 ` [PATCH v3 7/7] locking/rtmutex: Acquire the hb lock via trylock after wait-proxylock Sebastian Andrzej Siewior
@ 2023-09-09 18:29   ` Peter Zijlstra
  2023-09-11 14:11   ` Peter Zijlstra
  1 sibling, 0 replies; 36+ messages in thread
From: Peter Zijlstra @ 2023-09-09 18:29 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-kernel, tglx, boqun.feng, bristot, bsegall,
	dietmar.eggemann, jstultz, juri.lelli, longman, mgorman, mingo,
	rostedt, swood, vincent.guittot, vschneid, will

On Fri, Sep 08, 2023 at 06:22:54PM +0200, Sebastian Andrzej Siewior wrote:
> After rt_mutex_wait_proxy_lock() task_struct::pi_blocked_on is cleared
> if current owns the lock. If the operation has been interrupted by a
> signal or timeout then pi_blocked_on can be set. This means spin_lock()
> *can* overwrite pi_blocked_on on PREEMPT_RT. This has been noticed by
> the recently added lockdep-asserts…
> 
> The rt_mutex_cleanup_proxy_lock() operation will clear pi_blocked_on
> (and update pending waiters as expected) but it must happen under the hb
> lock to ensure the same state in rtmutex and userland.
> 
> Given all the possibilities it is probably the simplest option to
> try-lock the hb lock. In case the lock is occupied a quick nap is
> needed. A busy loop can lock up the system if performed by a task with
> high priorioty preventing the owner from running.
> 
> The rt_mutex_post_schedule() needs to be put before try-lock-loop
> because otherwie the schedule() in schedule_hrtimeout() will trip over
> the !sched_rt_mutex assert.
> 
> Introduce futex_trylock_hblock() to try-lock the hb lock and sleep until
> the try-lock operation succeeds. Use it after rt_mutex_wait_proxy_lock()
> to acquire the lock.
> 

Arggggh... this just keeps getting better :/


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 7/7] locking/rtmutex: Acquire the hb lock via trylock after wait-proxylock.
  2023-09-08 16:22 ` [PATCH v3 7/7] locking/rtmutex: Acquire the hb lock via trylock after wait-proxylock Sebastian Andrzej Siewior
  2023-09-09 18:29   ` Peter Zijlstra
@ 2023-09-11 14:11   ` Peter Zijlstra
  2023-09-12 10:57     ` Peter Zijlstra
                       ` (2 more replies)
  1 sibling, 3 replies; 36+ messages in thread
From: Peter Zijlstra @ 2023-09-11 14:11 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-kernel, tglx, boqun.feng, bristot, bsegall,
	dietmar.eggemann, jstultz, juri.lelli, longman, mgorman, mingo,
	rostedt, swood, vincent.guittot, vschneid, will

On Fri, Sep 08, 2023 at 06:22:54PM +0200, Sebastian Andrzej Siewior wrote:
> After rt_mutex_wait_proxy_lock() task_struct::pi_blocked_on is cleared
> if current owns the lock. If the operation has been interrupted by a
> signal or timeout then pi_blocked_on can be set. This means spin_lock()
> *can* overwrite pi_blocked_on on PREEMPT_RT. This has been noticed by
> the recently added lockdep-asserts…
> 
> The rt_mutex_cleanup_proxy_lock() operation will clear pi_blocked_on
> (and update pending waiters as expected) but it must happen under the hb
> lock to ensure the same state in rtmutex and userland.
> 
> Given all the possibilities it is probably the simplest option to
> try-lock the hb lock. In case the lock is occupied a quick nap is
> needed. A busy loop can lock up the system if performed by a task with
> high priorioty preventing the owner from running.
> 
> The rt_mutex_post_schedule() needs to be put before try-lock-loop
> because otherwie the schedule() in schedule_hrtimeout() will trip over
> the !sched_rt_mutex assert.
> 
> Introduce futex_trylock_hblock() to try-lock the hb lock and sleep until
> the try-lock operation succeeds. Use it after rt_mutex_wait_proxy_lock()
> to acquire the lock.

Aside from this being just plain gross, this also destroys determinism
of futex_pi, which completely defeats the purpose of the whole thing.


Now.. the reason we need hb->lock at this point is to avoid the
wake_futex_pi() -EAGAIN case.

This happens when futex_top_waiter() and rt_mutex_top_waiter() state
becomes inconsistent. The current rules are such that this inconsistency
will not be observed.

Notably the case that needs to be avoided is where futex_lock_pi() and
futex_unlock_pi() interleave such that unlock will fail to observe a new
waiter.

*However* the case at hand is where a waiter is leaving, in this case
what happens is that we'll fail to observe a waiter that is already
gone, which is harmless afaict.

Would not something like the below work instead?

---
diff --git a/kernel/futex/pi.c b/kernel/futex/pi.c
index ce2889f12375..8c76a52da9bd 100644
--- a/kernel/futex/pi.c
+++ b/kernel/futex/pi.c
@@ -610,29 +610,16 @@ int futex_lock_pi_atomic(u32 __user *uaddr, struct futex_hash_bucket *hb,
 /*
  * Caller must hold a reference on @pi_state.
  */
-static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_pi_state *pi_state)
+static int wake_futex_pi(u32 __user *uaddr, u32 uval,
+			 struct futex_pi_state *pi_state,
+			 rt_mutex_waiter *top_waiter)
 {
-	struct rt_mutex_waiter *top_waiter;
 	struct task_struct *new_owner;
 	bool postunlock = false;
 	DEFINE_RT_WAKE_Q(wqh);
 	u32 curval, newval;
 	int ret = 0;
 
-	top_waiter = rt_mutex_top_waiter(&pi_state->pi_mutex);
-	if (WARN_ON_ONCE(!top_waiter)) {
-		/*
-		 * As per the comment in futex_unlock_pi() this should not happen.
-		 *
-		 * When this happens, give up our locks and try again, giving
-		 * the futex_lock_pi() instance time to complete, either by
-		 * waiting on the rtmutex or removing itself from the futex
-		 * queue.
-		 */
-		ret = -EAGAIN;
-		goto out_unlock;
-	}
-
 	new_owner = top_waiter->task;
 
 	/*
@@ -1039,19 +1026,27 @@ int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int tryl
 	ret = rt_mutex_wait_proxy_lock(&q.pi_state->pi_mutex, to, &rt_waiter);
 
 cleanup:
-	spin_lock(q.lock_ptr);
 	/*
 	 * If we failed to acquire the lock (deadlock/signal/timeout), we must
 	 * first acquire the hb->lock before removing the lock from the
 	 * rt_mutex waitqueue, such that we can keep the hb and rt_mutex wait
 	 * lists consistent.
 	 *
-	 * In particular; it is important that futex_unlock_pi() can not
-	 * observe this inconsistency.
+	 * Cannot hold hb->lock because rt_mutex already has a waiter enqueued
+	 * and hb->lock can itself try and enqueue an rt_waiter through rtlock.
+	 *
+	 * Doing the cleanup without holding hb->lock can cause inconsistent
+	 * state between hb and pi_state, but only in the direction of seeing a
+	 * waiter that is leaving.
 	 */
 	if (ret && !rt_mutex_cleanup_proxy_lock(&q.pi_state->pi_mutex, &rt_waiter))
 		ret = 0;
 
+	/*
+	 * Now that the rt_waiter has been dequeued, it is safe to use
+	 * spinlock/rtlock, which will enqueue a new rt_waiter.
+	 */
+	spin_lock(q.lock_ptr);
 no_block:
 	/*
 	 * Fixup the pi_state owner and possibly acquire the lock if we
@@ -1132,6 +1127,7 @@ int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
 	top_waiter = futex_top_waiter(hb, &key);
 	if (top_waiter) {
 		struct futex_pi_state *pi_state = top_waiter->pi_state;
+		struct rt_mutex_waiter *rt_waiter;
 
 		ret = -EINVAL;
 		if (!pi_state)
@@ -1147,19 +1143,34 @@ int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
 		get_pi_state(pi_state);
 		/*
 		 * By taking wait_lock while still holding hb->lock, we ensure
-		 * there is no point where we hold neither; and therefore
-		 * wake_futex_p() must observe a state consistent with what we
-		 * observed.
+		 * there is no point where we hold neither; and thereby
+		 * wake_futex_pi() must observe any new waiters.
+		 *
+		 * Since the cleanup: case in futex_lock_pi() removes the
+		 * rt_waiter without holding hb->lock, it is possible for
+		 * wake_futex_pi() to not find a waiter while the above does,
+		 * in this case the waiter is on the way out and it can be
+		 * ignored.
 		 *
 		 * In particular; this forces __rt_mutex_start_proxy() to
 		 * complete such that we're guaranteed to observe the
-		 * rt_waiter. Also see the WARN in wake_futex_pi().
+		 * rt_waiter.
 		 */
 		raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
+
+		/*
+		 * Futex vs rt_mutex waiter state -- if there are on rt_mutex
+		 * waiters even though futex thinkgs there are, then the waiter
+		 * is leaving and the uncontended path is safe to take.
+		 */
+		rt_waiter = rt_mutex_top_waiter(&pi_state->pi_mutex);
+		if (!rt_waiter)
+			goto do_uncontended;
+
 		spin_unlock(&hb->lock);
 
 		/* drops pi_state->pi_mutex.wait_lock */
-		ret = wake_futex_pi(uaddr, uval, pi_state);
+		ret = wake_futex_pi(uaddr, uval, pi_state, rt_waiter);
 
 		put_pi_state(pi_state);
 
@@ -1187,6 +1198,7 @@ int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
 		return ret;
 	}
 
+do_uncontended:
 	/*
 	 * We have no kernel internal state, i.e. no waiters in the
 	 * kernel. Waiters which are about to queue themselves are stuck


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 7/7] locking/rtmutex: Acquire the hb lock via trylock after wait-proxylock.
  2023-09-11 14:11   ` Peter Zijlstra
@ 2023-09-12 10:57     ` Peter Zijlstra
  2023-09-12 11:26       ` Sebastian Andrzej Siewior
  2023-09-12 11:17     ` Sebastian Andrzej Siewior
  2023-09-15 12:58     ` Thomas Gleixner
  2 siblings, 1 reply; 36+ messages in thread
From: Peter Zijlstra @ 2023-09-12 10:57 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-kernel, tglx, boqun.feng, bristot, bsegall,
	dietmar.eggemann, jstultz, juri.lelli, longman, mgorman, mingo,
	rostedt, swood, vincent.guittot, vschneid, will

On Mon, Sep 11, 2023 at 04:11:35PM +0200, Peter Zijlstra wrote:

> Now.. the reason we need hb->lock at this point is to avoid the
> wake_futex_pi() -EAGAIN case.
> 
> This happens when futex_top_waiter() and rt_mutex_top_waiter() state
> becomes inconsistent. The current rules are such that this inconsistency
> will not be observed.
> 
> Notably the case that needs to be avoided is where futex_lock_pi() and
> futex_unlock_pi() interleave such that unlock will fail to observe a new
> waiter.
> 
> *However* the case at hand is where a waiter is leaving, in this case
> what happens is that we'll fail to observe a waiter that is already
> gone, which is harmless afaict.
> 
> Would not something like the below work instead?

I forgot to say this is complete untested and hasn't evne seen a
compiler up close...

> ---
> diff --git a/kernel/futex/pi.c b/kernel/futex/pi.c
> index ce2889f12375..8c76a52da9bd 100644
> --- a/kernel/futex/pi.c
> +++ b/kernel/futex/pi.c
> @@ -610,29 +610,16 @@ int futex_lock_pi_atomic(u32 __user *uaddr, struct futex_hash_bucket *hb,
>  /*
>   * Caller must hold a reference on @pi_state.
>   */
> -static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_pi_state *pi_state)
> +static int wake_futex_pi(u32 __user *uaddr, u32 uval,
> +			 struct futex_pi_state *pi_state,
> +			 rt_mutex_waiter *top_waiter)
>  {
> -	struct rt_mutex_waiter *top_waiter;
>  	struct task_struct *new_owner;
>  	bool postunlock = false;
>  	DEFINE_RT_WAKE_Q(wqh);
>  	u32 curval, newval;
>  	int ret = 0;
>  
> -	top_waiter = rt_mutex_top_waiter(&pi_state->pi_mutex);
> -	if (WARN_ON_ONCE(!top_waiter)) {
> -		/*
> -		 * As per the comment in futex_unlock_pi() this should not happen.
> -		 *
> -		 * When this happens, give up our locks and try again, giving
> -		 * the futex_lock_pi() instance time to complete, either by
> -		 * waiting on the rtmutex or removing itself from the futex
> -		 * queue.
> -		 */
> -		ret = -EAGAIN;
> -		goto out_unlock;
> -	}
> -
>  	new_owner = top_waiter->task;
>  
>  	/*
> @@ -1039,19 +1026,27 @@ int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int tryl
>  	ret = rt_mutex_wait_proxy_lock(&q.pi_state->pi_mutex, to, &rt_waiter);
>  
>  cleanup:
> -	spin_lock(q.lock_ptr);
>  	/*
>  	 * If we failed to acquire the lock (deadlock/signal/timeout), we must
>  	 * first acquire the hb->lock before removing the lock from the
>  	 * rt_mutex waitqueue, such that we can keep the hb and rt_mutex wait
>  	 * lists consistent.
>  	 *
> -	 * In particular; it is important that futex_unlock_pi() can not
> -	 * observe this inconsistency.
> +	 * Cannot hold hb->lock because rt_mutex already has a waiter enqueued
> +	 * and hb->lock can itself try and enqueue an rt_waiter through rtlock.
> +	 *
> +	 * Doing the cleanup without holding hb->lock can cause inconsistent
> +	 * state between hb and pi_state, but only in the direction of seeing a
> +	 * waiter that is leaving.
>  	 */
>  	if (ret && !rt_mutex_cleanup_proxy_lock(&q.pi_state->pi_mutex, &rt_waiter))
>  		ret = 0;
>  
> +	/*
> +	 * Now that the rt_waiter has been dequeued, it is safe to use
> +	 * spinlock/rtlock, which will enqueue a new rt_waiter.
> +	 */
> +	spin_lock(q.lock_ptr);
>  no_block:
>  	/*
>  	 * Fixup the pi_state owner and possibly acquire the lock if we
> @@ -1132,6 +1127,7 @@ int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
>  	top_waiter = futex_top_waiter(hb, &key);
>  	if (top_waiter) {
>  		struct futex_pi_state *pi_state = top_waiter->pi_state;
> +		struct rt_mutex_waiter *rt_waiter;
>  
>  		ret = -EINVAL;
>  		if (!pi_state)
> @@ -1147,19 +1143,34 @@ int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
>  		get_pi_state(pi_state);
>  		/*
>  		 * By taking wait_lock while still holding hb->lock, we ensure
> -		 * there is no point where we hold neither; and therefore
> -		 * wake_futex_p() must observe a state consistent with what we
> -		 * observed.
> +		 * there is no point where we hold neither; and thereby
> +		 * wake_futex_pi() must observe any new waiters.
> +		 *
> +		 * Since the cleanup: case in futex_lock_pi() removes the
> +		 * rt_waiter without holding hb->lock, it is possible for
> +		 * wake_futex_pi() to not find a waiter while the above does,
> +		 * in this case the waiter is on the way out and it can be
> +		 * ignored.
>  		 *
>  		 * In particular; this forces __rt_mutex_start_proxy() to
>  		 * complete such that we're guaranteed to observe the
> -		 * rt_waiter. Also see the WARN in wake_futex_pi().
> +		 * rt_waiter.
>  		 */
>  		raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
> +
> +		/*
> +		 * Futex vs rt_mutex waiter state -- if there are on rt_mutex
> +		 * waiters even though futex thinkgs there are, then the waiter
> +		 * is leaving and the uncontended path is safe to take.
> +		 */
> +		rt_waiter = rt_mutex_top_waiter(&pi_state->pi_mutex);
> +		if (!rt_waiter)
> +			goto do_uncontended;

That ^ needs to drop wait_lock before the goto.

> +
>  		spin_unlock(&hb->lock);
>  
>  		/* drops pi_state->pi_mutex.wait_lock */
> -		ret = wake_futex_pi(uaddr, uval, pi_state);
> +		ret = wake_futex_pi(uaddr, uval, pi_state, rt_waiter);
>  
>  		put_pi_state(pi_state);
>  
> @@ -1187,6 +1198,7 @@ int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
>  		return ret;
>  	}
>  
> +do_uncontended:
>  	/*
>  	 * We have no kernel internal state, i.e. no waiters in the
>  	 * kernel. Waiters which are about to queue themselves are stuck
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 7/7] locking/rtmutex: Acquire the hb lock via trylock after wait-proxylock.
  2023-09-11 14:11   ` Peter Zijlstra
  2023-09-12 10:57     ` Peter Zijlstra
@ 2023-09-12 11:17     ` Sebastian Andrzej Siewior
  2023-09-12 11:24       ` Peter Zijlstra
                         ` (2 more replies)
  2023-09-15 12:58     ` Thomas Gleixner
  2 siblings, 3 replies; 36+ messages in thread
From: Sebastian Andrzej Siewior @ 2023-09-12 11:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, tglx, boqun.feng, bristot, bsegall,
	dietmar.eggemann, jstultz, juri.lelli, longman, mgorman, mingo,
	rostedt, swood, vincent.guittot, vschneid, will

On 2023-09-11 16:11:35 [+0200], Peter Zijlstra wrote:
> Aside from this being just plain gross, this also destroys determinism
> of futex_pi, which completely defeats the purpose of the whole thing.

No objections here.

> Now.. the reason we need hb->lock at this point is to avoid the
> wake_futex_pi() -EAGAIN case.
> 
> This happens when futex_top_waiter() and rt_mutex_top_waiter() state
> becomes inconsistent. The current rules are such that this inconsistency
> will not be observed.
> 
> Notably the case that needs to be avoided is where futex_lock_pi() and
> futex_unlock_pi() interleave such that unlock will fail to observe a new
> waiter.
> 
> *However* the case at hand is where a waiter is leaving, in this case
> what happens is that we'll fail to observe a waiter that is already
> gone, which is harmless afaict.

Sounds harmless. I sure something will be pop up in a few years and we
will look back this ;)

> Would not something like the below work instead?

The following is what I ended up with. It adds the `struct' keywords to
wake_futex_pi() and accounts for rt_mutex_post_schedule() in
futex_lock_pi() (assuming you stuff this at the end.

Comments follow…

diff --git a/kernel/futex/pi.c b/kernel/futex/pi.c
index f8e65b27d9d6b..d8866278e92ff 100644
--- a/kernel/futex/pi.c
+++ b/kernel/futex/pi.c
@@ -611,29 +611,16 @@ int futex_lock_pi_atomic(u32 __user *uaddr, struct futex_hash_bucket *hb,
 /*
  * Caller must hold a reference on @pi_state.
  */
-static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_pi_state *pi_state)
+static int wake_futex_pi(u32 __user *uaddr, u32 uval,
+			 struct futex_pi_state *pi_state,
+			 struct rt_mutex_waiter *top_waiter)
 {
-	struct rt_mutex_waiter *top_waiter;
 	struct task_struct *new_owner;
 	bool postunlock = false;
 	DEFINE_RT_WAKE_Q(wqh);
 	u32 curval, newval;
 	int ret = 0;
 
-	top_waiter = rt_mutex_top_waiter(&pi_state->pi_mutex);
-	if (WARN_ON_ONCE(!top_waiter)) {
-		/*
-		 * As per the comment in futex_unlock_pi() this should not happen.
-		 *
-		 * When this happens, give up our locks and try again, giving
-		 * the futex_lock_pi() instance time to complete, either by
-		 * waiting on the rtmutex or removing itself from the futex
-		 * queue.
-		 */
-		ret = -EAGAIN;
-		goto out_unlock;
-	}
-
 	new_owner = top_waiter->task;
 
 	/*
@@ -1046,15 +1033,18 @@ int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int tryl
 	ret = rt_mutex_wait_proxy_lock(&q.pi_state->pi_mutex, to, &rt_waiter);
 
 cleanup:
-	spin_lock(q.lock_ptr);
 	/*
 	 * If we failed to acquire the lock (deadlock/signal/timeout), we must
 	 * first acquire the hb->lock before removing the lock from the
 	 * rt_mutex waitqueue, such that we can keep the hb and rt_mutex wait
 	 * lists consistent.
 	 *
-	 * In particular; it is important that futex_unlock_pi() can not
-	 * observe this inconsistency.
+	 * Cannot hold hb->lock because rt_mutex already has a waiter enqueued
+	 * and hb->lock can itself try and enqueue an rt_waiter through rtlock.
+	 *
+	 * Doing the cleanup without holding hb->lock can cause inconsistent
+	 * state between hb and pi_state, but only in the direction of seeing a
+	 * waiter that is leaving.
 	 */
 	if (ret && !rt_mutex_cleanup_proxy_lock(&q.pi_state->pi_mutex, &rt_waiter))
 		ret = 0;
@@ -1063,6 +1053,12 @@ int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int tryl
 	 * Waiter is unqueued.
 	 */
 	rt_mutex_post_schedule();
+
+	/*
+	 * Now that the rt_waiter has been dequeued, it is safe to use
+	 * spinlock/rtlock, which will enqueue a new rt_waiter.
+	 */
+	spin_lock(q.lock_ptr);
 no_block:
 	/*
 	 * Fixup the pi_state owner and possibly acquire the lock if we
@@ -1143,6 +1139,7 @@ int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
 	top_waiter = futex_top_waiter(hb, &key);
 	if (top_waiter) {
 		struct futex_pi_state *pi_state = top_waiter->pi_state;
+		struct rt_mutex_waiter *rt_waiter;
 
 		ret = -EINVAL;
 		if (!pi_state)
@@ -1158,19 +1155,34 @@ int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
 		get_pi_state(pi_state);
 		/*
 		 * By taking wait_lock while still holding hb->lock, we ensure
-		 * there is no point where we hold neither; and therefore
-		 * wake_futex_p() must observe a state consistent with what we
-		 * observed.
+		 * there is no point where we hold neither; and thereby
+		 * wake_futex_pi() must observe any new waiters.
+		 *
+		 * Since the cleanup: case in futex_lock_pi() removes the
+		 * rt_waiter without holding hb->lock, it is possible for
+		 * wake_futex_pi() to not find a waiter while the above does,
+		 * in this case the waiter is on the way out and it can be
+		 * ignored.
 		 *
 		 * In particular; this forces __rt_mutex_start_proxy() to
 		 * complete such that we're guaranteed to observe the
-		 * rt_waiter. Also see the WARN in wake_futex_pi().
+		 * rt_waiter.
 		 */
 		raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
+
+		/*
+		 * Futex vs rt_mutex waiter state -- if there are on rt_mutex
+		 * waiters even though futex thinkgs there are, then the waiter
+		 * is leaving and the uncontended path is safe to take.
+		 */
+		rt_waiter = rt_mutex_top_waiter(&pi_state->pi_mutex);
+		if (!rt_waiter)
+			goto do_uncontended;
+
 		spin_unlock(&hb->lock);
 
 		/* drops pi_state->pi_mutex.wait_lock */
-		ret = wake_futex_pi(uaddr, uval, pi_state);
+		ret = wake_futex_pi(uaddr, uval, pi_state, rt_waiter);
 
 		put_pi_state(pi_state);
 
@@ -1198,6 +1210,7 @@ int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
 		return ret;
 	}
 
+do_uncontended:
 	/*
 	 * We have no kernel internal state, i.e. no waiters in the
 	 * kernel. Waiters which are about to queue themselves are stuck

Sebastian

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 7/7] locking/rtmutex: Acquire the hb lock via trylock after wait-proxylock.
  2023-09-12 11:17     ` Sebastian Andrzej Siewior
@ 2023-09-12 11:24       ` Peter Zijlstra
  2023-09-12 11:25       ` Sebastian Andrzej Siewior
  2023-09-12 11:25       ` Peter Zijlstra
  2 siblings, 0 replies; 36+ messages in thread
From: Peter Zijlstra @ 2023-09-12 11:24 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-kernel, tglx, boqun.feng, bristot, bsegall,
	dietmar.eggemann, jstultz, juri.lelli, longman, mgorman, mingo,
	rostedt, swood, vincent.guittot, vschneid, will

On Tue, Sep 12, 2023 at 01:17:11PM +0200, Sebastian Andrzej Siewior wrote:
> On 2023-09-11 16:11:35 [+0200], Peter Zijlstra wrote:
> > Aside from this being just plain gross, this also destroys determinism
> > of futex_pi, which completely defeats the purpose of the whole thing.
> 
> No objections here.
> 
> > Now.. the reason we need hb->lock at this point is to avoid the
> > wake_futex_pi() -EAGAIN case.
> > 
> > This happens when futex_top_waiter() and rt_mutex_top_waiter() state
> > becomes inconsistent. The current rules are such that this inconsistency
> > will not be observed.
> > 
> > Notably the case that needs to be avoided is where futex_lock_pi() and
> > futex_unlock_pi() interleave such that unlock will fail to observe a new
> > waiter.
> > 
> > *However* the case at hand is where a waiter is leaving, in this case
> > what happens is that we'll fail to observe a waiter that is already
> > gone, which is harmless afaict.
> 
> Sounds harmless. I sure something will be pop up in a few years and we
> will look back this ;)

Oh absolutely, I already hate this... It's dangerous as heck, because
if we ever do trip the race the other way around it will silently
misbehave.

However, compared to the other hack ... :-(


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 7/7] locking/rtmutex: Acquire the hb lock via trylock after wait-proxylock.
  2023-09-12 11:17     ` Sebastian Andrzej Siewior
  2023-09-12 11:24       ` Peter Zijlstra
@ 2023-09-12 11:25       ` Sebastian Andrzej Siewior
  2023-09-12 11:25       ` Peter Zijlstra
  2 siblings, 0 replies; 36+ messages in thread
From: Sebastian Andrzej Siewior @ 2023-09-12 11:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, tglx, boqun.feng, bristot, bsegall,
	dietmar.eggemann, jstultz, juri.lelli, longman, mgorman, mingo,
	rostedt, swood, vincent.guittot, vschneid, will

On 2023-09-12 13:17:16 [+0200], To Peter Zijlstra wrote:
> Comments follow…
> 
> diff --git a/kernel/futex/pi.c b/kernel/futex/pi.c
> index f8e65b27d9d6b..d8866278e92ff 100644
> --- a/kernel/futex/pi.c
> +++ b/kernel/futex/pi.c
> @@ -611,29 +611,16 @@ int futex_lock_pi_atomic(u32 __user *uaddr, struct futex_hash_bucket *hb,
>  /*
>   * Caller must hold a reference on @pi_state.
>   */
> -static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_pi_state *pi_state)
> +static int wake_futex_pi(u32 __user *uaddr, u32 uval,
> +			 struct futex_pi_state *pi_state,
> +			 struct rt_mutex_waiter *top_waiter)
>  {
> -	struct rt_mutex_waiter *top_waiter;
>  	struct task_struct *new_owner;
>  	bool postunlock = false;
>  	DEFINE_RT_WAKE_Q(wqh);
>  	u32 curval, newval;
>  	int ret = 0;
>  
> -	top_waiter = rt_mutex_top_waiter(&pi_state->pi_mutex);
> -	if (WARN_ON_ONCE(!top_waiter)) {
> -		/*
> -		 * As per the comment in futex_unlock_pi() this should not happen.
> -		 *
> -		 * When this happens, give up our locks and try again, giving
> -		 * the futex_lock_pi() instance time to complete, either by
> -		 * waiting on the rtmutex or removing itself from the futex
> -		 * queue.
> -		 */
> -		ret = -EAGAIN;
> -		goto out_unlock;
> -	}
> -
>  	new_owner = top_waiter->task;
>  
>  	/*
> @@ -1046,15 +1033,18 @@ int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int tryl
>  	ret = rt_mutex_wait_proxy_lock(&q.pi_state->pi_mutex, to, &rt_waiter);
>  
>  cleanup:
> -	spin_lock(q.lock_ptr);
>  	/*
>  	 * If we failed to acquire the lock (deadlock/signal/timeout), we must
>  	 * first acquire the hb->lock before removing the lock from the
>  	 * rt_mutex waitqueue, such that we can keep the hb and rt_mutex wait
>  	 * lists consistent.
>  	 *
> -	 * In particular; it is important that futex_unlock_pi() can not
> -	 * observe this inconsistency.
> +	 * Cannot hold hb->lock because rt_mutex already has a waiter enqueued
> +	 * and hb->lock can itself try and enqueue an rt_waiter through rtlock.
> +	 *
> +	 * Doing the cleanup without holding hb->lock can cause inconsistent
> +	 * state between hb and pi_state, but only in the direction of seeing a
> +	 * waiter that is leaving.
>  	 */
>  	if (ret && !rt_mutex_cleanup_proxy_lock(&q.pi_state->pi_mutex, &rt_waiter))
>  		ret = 0;
> @@ -1063,6 +1053,12 @@ int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int tryl
>  	 * Waiter is unqueued.
>  	 */
>  	rt_mutex_post_schedule();

So no new lock-task can enter but one can give up and vanish.

> +
> +	/*
> +	 * Now that the rt_waiter has been dequeued, it is safe to use
> +	 * spinlock/rtlock, which will enqueue a new rt_waiter.
> +	 */
> +	spin_lock(q.lock_ptr);
>  no_block:
>  	/*
>  	 * Fixup the pi_state owner and possibly acquire the lock if we
> @@ -1143,6 +1139,7 @@ int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
>  	top_waiter = futex_top_waiter(hb, &key);
>  	if (top_waiter) {
>  		struct futex_pi_state *pi_state = top_waiter->pi_state;
> +		struct rt_mutex_waiter *rt_waiter;
>  
>  		ret = -EINVAL;
>  		if (!pi_state)
> @@ -1158,19 +1155,34 @@ int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
>  		get_pi_state(pi_state);
>  		/*
>  		 * By taking wait_lock while still holding hb->lock, we ensure
> -		 * there is no point where we hold neither; and therefore
> -		 * wake_futex_p() must observe a state consistent with what we
> -		 * observed.
> +		 * there is no point where we hold neither; and thereby
> +		 * wake_futex_pi() must observe any new waiters.
> +		 *
> +		 * Since the cleanup: case in futex_lock_pi() removes the
> +		 * rt_waiter without holding hb->lock, it is possible for
> +		 * wake_futex_pi() to not find a waiter while the above does,
> +		 * in this case the waiter is on the way out and it can be
> +		 * ignored.
>  		 *
>  		 * In particular; this forces __rt_mutex_start_proxy() to
>  		 * complete such that we're guaranteed to observe the
> -		 * rt_waiter. Also see the WARN in wake_futex_pi().
> +		 * rt_waiter.
>  		 */
>  		raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
> +
> +		/*
> +		 * Futex vs rt_mutex waiter state -- if there are on rt_mutex
> +		 * waiters even though futex thinkgs there are, then the waiter
> +		 * is leaving and the uncontended path is safe to take.
> +		 */
> +		rt_waiter = rt_mutex_top_waiter(&pi_state->pi_mutex);
> +		if (!rt_waiter)
> +			goto do_uncontended;

So I spent some time to come up with a test case for this and then
hacked the kernel to get into exact this situation. The wait_lock from
above needs to be released since it is done by wake_futex_pi()
otherwise.

Then we do earlier get_pi_state(pi_state) and I think a put here is in
order right? 

We own the hb lock here but it is released later in the do_uncontended
label so that looks okay.

The userland state is modified under the hb-lock so that should work. 

> +
>  		spin_unlock(&hb->lock);
>  
>  		/* drops pi_state->pi_mutex.wait_lock */
> -		ret = wake_futex_pi(uaddr, uval, pi_state);
> +		ret = wake_futex_pi(uaddr, uval, pi_state, rt_waiter);
>  
>  		put_pi_state(pi_state);
>  
> @@ -1198,6 +1210,7 @@ int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
>  		return ret;
>  	}
>  
> +do_uncontended:
>  	/*
>  	 * We have no kernel internal state, i.e. no waiters in the
>  	 * kernel. Waiters which are about to queue themselves are stuck

So I *think* this could work. Then we need something similar for
requeue-pi.
 
Sebastian

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 7/7] locking/rtmutex: Acquire the hb lock via trylock after wait-proxylock.
  2023-09-12 11:17     ` Sebastian Andrzej Siewior
  2023-09-12 11:24       ` Peter Zijlstra
  2023-09-12 11:25       ` Sebastian Andrzej Siewior
@ 2023-09-12 11:25       ` Peter Zijlstra
  2023-09-12 11:28         ` Sebastian Andrzej Siewior
  2 siblings, 1 reply; 36+ messages in thread
From: Peter Zijlstra @ 2023-09-12 11:25 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-kernel, tglx, boqun.feng, bristot, bsegall,
	dietmar.eggemann, jstultz, juri.lelli, longman, mgorman, mingo,
	rostedt, swood, vincent.guittot, vschneid, will

On Tue, Sep 12, 2023 at 01:17:11PM +0200, Sebastian Andrzej Siewior wrote:

> Comments follow…
> 
> diff --git a/kernel/futex/pi.c b/kernel/futex/pi.c
> index f8e65b27d9d6b..d8866278e92ff 100644
> --- a/kernel/futex/pi.c
> +++ b/kernel/futex/pi.c
> @@ -611,29 +611,16 @@ int futex_lock_pi_atomic(u32 __user *uaddr, struct futex_hash_bucket *hb,
>  /*
>   * Caller must hold a reference on @pi_state.
>   */
> -static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_pi_state *pi_state)
> +static int wake_futex_pi(u32 __user *uaddr, u32 uval,
> +			 struct futex_pi_state *pi_state,
> +			 struct rt_mutex_waiter *top_waiter)
>  {
> -	struct rt_mutex_waiter *top_waiter;
>  	struct task_struct *new_owner;
>  	bool postunlock = false;
>  	DEFINE_RT_WAKE_Q(wqh);
>  	u32 curval, newval;
>  	int ret = 0;
>  
> -	top_waiter = rt_mutex_top_waiter(&pi_state->pi_mutex);
> -	if (WARN_ON_ONCE(!top_waiter)) {
> -		/*
> -		 * As per the comment in futex_unlock_pi() this should not happen.
> -		 *
> -		 * When this happens, give up our locks and try again, giving
> -		 * the futex_lock_pi() instance time to complete, either by
> -		 * waiting on the rtmutex or removing itself from the futex
> -		 * queue.
> -		 */
> -		ret = -EAGAIN;
> -		goto out_unlock;
> -	}
> -
>  	new_owner = top_waiter->task;
>  
>  	/*
> @@ -1046,15 +1033,18 @@ int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int tryl
>  	ret = rt_mutex_wait_proxy_lock(&q.pi_state->pi_mutex, to, &rt_waiter);
>  
>  cleanup:
> -	spin_lock(q.lock_ptr);
>  	/*
>  	 * If we failed to acquire the lock (deadlock/signal/timeout), we must
>  	 * first acquire the hb->lock before removing the lock from the
>  	 * rt_mutex waitqueue, such that we can keep the hb and rt_mutex wait
>  	 * lists consistent.
>  	 *
> -	 * In particular; it is important that futex_unlock_pi() can not
> -	 * observe this inconsistency.
> +	 * Cannot hold hb->lock because rt_mutex already has a waiter enqueued
> +	 * and hb->lock can itself try and enqueue an rt_waiter through rtlock.
> +	 *
> +	 * Doing the cleanup without holding hb->lock can cause inconsistent
> +	 * state between hb and pi_state, but only in the direction of seeing a
> +	 * waiter that is leaving.
>  	 */
>  	if (ret && !rt_mutex_cleanup_proxy_lock(&q.pi_state->pi_mutex, &rt_waiter))
>  		ret = 0;
> @@ -1063,6 +1053,12 @@ int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int tryl
>  	 * Waiter is unqueued.
>  	 */
>  	rt_mutex_post_schedule();
> +
> +	/*
> +	 * Now that the rt_waiter has been dequeued, it is safe to use
> +	 * spinlock/rtlock, which will enqueue a new rt_waiter.
> +	 */
> +	spin_lock(q.lock_ptr);
>  no_block:
>  	/*
>  	 * Fixup the pi_state owner and possibly acquire the lock if we
> @@ -1143,6 +1139,7 @@ int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
>  	top_waiter = futex_top_waiter(hb, &key);
>  	if (top_waiter) {
>  		struct futex_pi_state *pi_state = top_waiter->pi_state;
> +		struct rt_mutex_waiter *rt_waiter;
>  
>  		ret = -EINVAL;
>  		if (!pi_state)
> @@ -1158,19 +1155,34 @@ int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
>  		get_pi_state(pi_state);
>  		/*
>  		 * By taking wait_lock while still holding hb->lock, we ensure
> -		 * there is no point where we hold neither; and therefore
> -		 * wake_futex_p() must observe a state consistent with what we
> -		 * observed.
> +		 * there is no point where we hold neither; and thereby
> +		 * wake_futex_pi() must observe any new waiters.
> +		 *
> +		 * Since the cleanup: case in futex_lock_pi() removes the
> +		 * rt_waiter without holding hb->lock, it is possible for
> +		 * wake_futex_pi() to not find a waiter while the above does,
> +		 * in this case the waiter is on the way out and it can be
> +		 * ignored.
>  		 *
>  		 * In particular; this forces __rt_mutex_start_proxy() to
>  		 * complete such that we're guaranteed to observe the
> -		 * rt_waiter. Also see the WARN in wake_futex_pi().
> +		 * rt_waiter.
>  		 */
>  		raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
> +
> +		/*
> +		 * Futex vs rt_mutex waiter state -- if there are on rt_mutex
> +		 * waiters even though futex thinkgs there are, then the waiter
> +		 * is leaving and the uncontended path is safe to take.
> +		 */
> +		rt_waiter = rt_mutex_top_waiter(&pi_state->pi_mutex);
> +		if (!rt_waiter)
> +			goto do_uncontended;

That wants to be:

		if (!rt_waiter) {
			raw_spin_unlock_irQ(&pi_state->pi_mutex.wait_lock);
			goto do_uncontended;
		}

> +
>  		spin_unlock(&hb->lock);
>  
>  		/* drops pi_state->pi_mutex.wait_lock */
> -		ret = wake_futex_pi(uaddr, uval, pi_state);
> +		ret = wake_futex_pi(uaddr, uval, pi_state, rt_waiter);
>  
>  		put_pi_state(pi_state);
>  
> @@ -1198,6 +1210,7 @@ int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
>  		return ret;
>  	}
>  
> +do_uncontended:
>  	/*
>  	 * We have no kernel internal state, i.e. no waiters in the
>  	 * kernel. Waiters which are about to queue themselves are stuck
> 

And I think there was a requeue site that wants updating too.. the above
was more or less a starting point for discussion :-)

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 7/7] locking/rtmutex: Acquire the hb lock via trylock after wait-proxylock.
  2023-09-12 10:57     ` Peter Zijlstra
@ 2023-09-12 11:26       ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 36+ messages in thread
From: Sebastian Andrzej Siewior @ 2023-09-12 11:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, tglx, boqun.feng, bristot, bsegall,
	dietmar.eggemann, jstultz, juri.lelli, longman, mgorman, mingo,
	rostedt, swood, vincent.guittot, vschneid, will

On 2023-09-12 12:57:45 [+0200], Peter Zijlstra wrote:
> > --- a/kernel/futex/pi.c
> > +++ b/kernel/futex/pi.c
> > @@ -1147,19 +1143,34 @@ int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
…
> > +		rt_waiter = rt_mutex_top_waiter(&pi_state->pi_mutex);
> > +		if (!rt_waiter)
> > +			goto do_uncontended;
> 
> That ^ needs to drop wait_lock before the goto.

Ach, you noticed that one. Wrote a reply a few minutes after that one.

Sebastian

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 7/7] locking/rtmutex: Acquire the hb lock via trylock after wait-proxylock.
  2023-09-12 11:25       ` Peter Zijlstra
@ 2023-09-12 11:28         ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 36+ messages in thread
From: Sebastian Andrzej Siewior @ 2023-09-12 11:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, tglx, boqun.feng, bristot, bsegall,
	dietmar.eggemann, jstultz, juri.lelli, longman, mgorman, mingo,
	rostedt, swood, vincent.guittot, vschneid, will

On 2023-09-12 13:25:51 [+0200], Peter Zijlstra wrote:
> 
> And I think there was a requeue site that wants updating too.. the above
> was more or less a starting point for discussion :-)

Yes. I spent the morning testing it so I can contribute to the
discussion ;)

Sebastian

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 7/7] locking/rtmutex: Acquire the hb lock via trylock after wait-proxylock.
  2023-09-11 14:11   ` Peter Zijlstra
  2023-09-12 10:57     ` Peter Zijlstra
  2023-09-12 11:17     ` Sebastian Andrzej Siewior
@ 2023-09-15 12:58     ` Thomas Gleixner
  2023-09-15 13:15       ` Sebastian Andrzej Siewior
  2023-09-15 15:19       ` Peter Zijlstra
  2 siblings, 2 replies; 36+ messages in thread
From: Thomas Gleixner @ 2023-09-15 12:58 UTC (permalink / raw)
  To: Peter Zijlstra, Sebastian Andrzej Siewior
  Cc: linux-kernel, boqun.feng, bristot, bsegall, dietmar.eggemann,
	jstultz, juri.lelli, longman, mgorman, mingo, rostedt, swood,
	vincent.guittot, vschneid, will

On Mon, Sep 11 2023 at 16:11, Peter Zijlstra wrote:
> On Fri, Sep 08, 2023 at 06:22:54PM +0200, Sebastian Andrzej Siewior wrote:
> Aside from this being just plain gross, this also destroys determinism
> of futex_pi, which completely defeats the purpose of the whole thing.

It served the purpose to force the two dudes who wrote most of this
horrorshow to actually look at it. It worked :)

> Now.. the reason we need hb->lock at this point is to avoid the
> wake_futex_pi() -EAGAIN case.
>
> This happens when futex_top_waiter() and rt_mutex_top_waiter() state
> becomes inconsistent. The current rules are such that this inconsistency
> will not be observed.
>
> Notably the case that needs to be avoided is where futex_lock_pi() and
> futex_unlock_pi() interleave such that unlock will fail to observe a new
> waiter.

That's correct.

> *However* the case at hand is where a waiter is leaving, in this case
> what happens is that we'll fail to observe a waiter that is already
> gone, which is harmless afaict.

It is harmless, when the wakee handles the inconsistency between the
futex waiter list and the rtmutex waiter list correctly.

> @@ -1039,19 +1026,27 @@ int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int tryl
>  	ret = rt_mutex_wait_proxy_lock(&q.pi_state->pi_mutex, to, &rt_waiter);
>  
>  cleanup:
> -	spin_lock(q.lock_ptr);
>  	/*
>  	 * If we failed to acquire the lock (deadlock/signal/timeout), we must
>  	 * first acquire the hb->lock before removing the lock from the
>  	 * rt_mutex waitqueue, such that we can keep the hb and rt_mutex wait
>  	 * lists consistent.

This comment does not make sense anymore.

> -	 * In particular; it is important that futex_unlock_pi() can not
> -	 * observe this inconsistency.
> +	 * Cannot hold hb->lock because rt_mutex already has a waiter enqueued
> +	 * and hb->lock can itself try and enqueue an rt_waiter through rtlock.
> +	 *
> +	 * Doing the cleanup without holding hb->lock can cause inconsistent
> +	 * state between hb and pi_state, but only in the direction of seeing a
> +	 * waiter that is leaving.

+	* See futex_unlock_pi()

>  	 */
>  	if (ret && !rt_mutex_cleanup_proxy_lock(&q.pi_state->pi_mutex, &rt_waiter))
>  		ret = 0;
>  
> +	/*
> +	 * Now that the rt_waiter has been dequeued, it is safe to use
> +	 * spinlock/rtlock, which will enqueue a new rt_waiter.
> +	 */
> +	spin_lock(q.lock_ptr);
>  no_block:
>  	/*
>  	 * Fixup the pi_state owner and possibly acquire the lock if we
> @@ -1132,6 +1127,7 @@ int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
>  	top_waiter = futex_top_waiter(hb, &key);
>  	if (top_waiter) {
>  		struct futex_pi_state *pi_state = top_waiter->pi_state;
> +		struct rt_mutex_waiter *rt_waiter;
>  
>  		ret = -EINVAL;
>  		if (!pi_state)
> @@ -1147,19 +1143,34 @@ int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
>  		get_pi_state(pi_state);
>  		/*
>  		 * By taking wait_lock while still holding hb->lock, we ensure
> -		 * there is no point where we hold neither; and therefore
> -		 * wake_futex_p() must observe a state consistent with what we
> -		 * observed.
> +		 * there is no point where we hold neither; and thereby
> +		 * wake_futex_pi() must observe any new waiters.
> +		 *
> +		 * Since the cleanup: case in futex_lock_pi() removes the
> +		 * rt_waiter without holding hb->lock, it is possible for
> +		 * wake_futex_pi() to not find a waiter while the above does,
> +		 * in this case the waiter is on the way out and it can be
> +		 * ignored.
>  		 *
>  		 * In particular; this forces __rt_mutex_start_proxy() to
>  		 * complete such that we're guaranteed to observe the
> -		 * rt_waiter. Also see the WARN in wake_futex_pi().
> +		 * rt_waiter.
>  		 */
>  		raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
> +
> +		/*
> +		 * Futex vs rt_mutex waiter state -- if there are on rt_mutex

s/on/no/

> +		 * waiters even though futex thinkgs there are, then the waiter

thinkgs :)

> +		 * is leaving and the uncontended path is safe to take.
> +		 */
> +		rt_waiter = rt_mutex_top_waiter(&pi_state->pi_mutex);
> +		if (!rt_waiter)
> +			goto do_uncontended;

Leaks pi_mutex.wait_lock

> +
>  		spin_unlock(&hb->lock);
>  
>  		/* drops pi_state->pi_mutex.wait_lock */
> -		ret = wake_futex_pi(uaddr, uval, pi_state);
> +		ret = wake_futex_pi(uaddr, uval, pi_state, rt_waiter);
>  
>  		put_pi_state(pi_state);
>  
> @@ -1187,6 +1198,7 @@ int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
>  		return ret;
>  	}
>  
> +do_uncontended:
>  	/*
>  	 * We have no kernel internal state, i.e. no waiters in the
>  	 * kernel. Waiters which are about to queue themselves are stuck

Plus you need:

diff --git a/kernel/futex/requeue.c b/kernel/futex/requeue.c
index cba8b1a6a4cc..af1427689414 100644
--- a/kernel/futex/requeue.c
+++ b/kernel/futex/requeue.c
@@ -850,11 +850,11 @@ int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
 		pi_mutex = &q.pi_state->pi_mutex;
 		ret = rt_mutex_wait_proxy_lock(pi_mutex, to, &rt_waiter);
 
-		/* Current is not longer pi_blocked_on */
-		spin_lock(q.lock_ptr);
+		/* Add a proper comment */
 		if (ret && !rt_mutex_cleanup_proxy_lock(pi_mutex, &rt_waiter))
 			ret = 0;
 
+		spin_lock(q.lock_ptr);
 		debug_rt_mutex_free_waiter(&rt_waiter);
 		/*
 		 * Fixup the pi_state owner and possibly acquire the lock if we


I spent quite some time to convince myself that this is correct. I was
not able to poke a hole into it. So that really should be safe to
do. Famous last words ...

Thanks,

        tglx

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 7/7] locking/rtmutex: Acquire the hb lock via trylock after wait-proxylock.
  2023-09-15 12:58     ` Thomas Gleixner
@ 2023-09-15 13:15       ` Sebastian Andrzej Siewior
  2023-09-15 15:19       ` Peter Zijlstra
  1 sibling, 0 replies; 36+ messages in thread
From: Sebastian Andrzej Siewior @ 2023-09-15 13:15 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, linux-kernel, boqun.feng, bristot, bsegall,
	dietmar.eggemann, jstultz, juri.lelli, longman, mgorman, mingo,
	rostedt, swood, vincent.guittot, vschneid, will

On 2023-09-15 14:58:35 [+0200], Thomas Gleixner wrote:
> > +		 * is leaving and the uncontended path is safe to take.
> > +		 */
> > +		rt_waiter = rt_mutex_top_waiter(&pi_state->pi_mutex);
> > +		if (!rt_waiter)
> > +			goto do_uncontended;
> 
> Leaks pi_mutex.wait_lock

and pi_state. 

> Plus you need:
> 
> diff --git a/kernel/futex/requeue.c b/kernel/futex/requeue.c
> index cba8b1a6a4cc..af1427689414 100644
> --- a/kernel/futex/requeue.c
> +++ b/kernel/futex/requeue.c
> @@ -850,11 +850,11 @@ int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
>  		pi_mutex = &q.pi_state->pi_mutex;
>  		ret = rt_mutex_wait_proxy_lock(pi_mutex, to, &rt_waiter);
>  
> -		/* Current is not longer pi_blocked_on */
> -		spin_lock(q.lock_ptr);
> +		/* Add a proper comment */
>  		if (ret && !rt_mutex_cleanup_proxy_lock(pi_mutex, &rt_waiter))
>  			ret = 0;
>  
> +		spin_lock(q.lock_ptr);
>  		debug_rt_mutex_free_waiter(&rt_waiter);
>  		/*
>  		 * Fixup the pi_state owner and possibly acquire the lock if we

Yes, if we do this.

> 
> I spent quite some time to convince myself that this is correct. I was
> not able to poke a hole into it. So that really should be safe to
> do. Famous last words ...

Okay. Then let me collect the pieces and post a new round then.

> Thanks,
> 
>         tglx

Sebastian

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 7/7] locking/rtmutex: Acquire the hb lock via trylock after wait-proxylock.
  2023-09-15 12:58     ` Thomas Gleixner
  2023-09-15 13:15       ` Sebastian Andrzej Siewior
@ 2023-09-15 15:19       ` Peter Zijlstra
  2023-09-15 18:59         ` Thomas Gleixner
                           ` (3 more replies)
  1 sibling, 4 replies; 36+ messages in thread
From: Peter Zijlstra @ 2023-09-15 15:19 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Sebastian Andrzej Siewior, linux-kernel, boqun.feng, bristot,
	bsegall, dietmar.eggemann, jstultz, juri.lelli, longman, mgorman,
	mingo, rostedt, swood, vincent.guittot, vschneid, will

On Fri, Sep 15, 2023 at 02:58:35PM +0200, Thomas Gleixner wrote:

> I spent quite some time to convince myself that this is correct. I was
> not able to poke a hole into it. So that really should be safe to
> do. Famous last words ...

IKR :-/

Something like so then...

---
Subject: futex/pi: Fix recursive rt_mutex waiter state
From: Peter Zijlstra <peterz@infradead.org>
Date: Tue, 12 Sep 2023 13:17:11 +0200

Some new assertions pointed out that the existing code has nested rt_mutex wait
state in the futex code.

Specifically, the futex_lock_pi() cancel case uses spin_lock() while there
still is a rt_waiter enqueued for this task, resulting in a state where there
are two waiters for the same task (and task_struct::pi_blocked_on gets
scrambled).

The reason to take hb->lock at this point is to avoid the wake_futex_pi()
EAGAIN case.

This happens when futex_top_waiter() and rt_mutex_top_waiter() state becomes
inconsistent. The current rules are such that this inconsistency will not be
observed.

Notably the case that needs to be avoided is where futex_lock_pi() and
futex_unlock_pi() interleave such that unlock will fail to observe a new
waiter.

*However* the case at hand is where a waiter is leaving, in this case the race
means a waiter that is going away is not observed -- which is harmless,
provided this race is explicitly handled.

This is a somewhat dangerous proposition because the converse race is not
observing a new waiter, which must absolutely not happen. But since the race is
valid this cannot be asserted.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 pi.c      |   76 +++++++++++++++++++++++++++++++++++++++-----------------------
 requeue.c |    6 +++-
 2 files changed, 52 insertions(+), 30 deletions(-)

Index: linux-2.6/kernel/futex/pi.c
===================================================================
--- linux-2.6.orig/kernel/futex/pi.c
+++ linux-2.6/kernel/futex/pi.c
@@ -610,29 +610,16 @@ int futex_lock_pi_atomic(u32 __user *uad
 /*
  * Caller must hold a reference on @pi_state.
  */
-static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_pi_state *pi_state)
+static int wake_futex_pi(u32 __user *uaddr, u32 uval,
+			 struct futex_pi_state *pi_state,
+			 struct rt_mutex_waiter *top_waiter)
 {
-	struct rt_mutex_waiter *top_waiter;
 	struct task_struct *new_owner;
 	bool postunlock = false;
 	DEFINE_RT_WAKE_Q(wqh);
 	u32 curval, newval;
 	int ret = 0;
 
-	top_waiter = rt_mutex_top_waiter(&pi_state->pi_mutex);
-	if (WARN_ON_ONCE(!top_waiter)) {
-		/*
-		 * As per the comment in futex_unlock_pi() this should not happen.
-		 *
-		 * When this happens, give up our locks and try again, giving
-		 * the futex_lock_pi() instance time to complete, either by
-		 * waiting on the rtmutex or removing itself from the futex
-		 * queue.
-		 */
-		ret = -EAGAIN;
-		goto out_unlock;
-	}
-
 	new_owner = top_waiter->task;
 
 	/*
@@ -1039,19 +1026,33 @@ retry_private:
 	ret = rt_mutex_wait_proxy_lock(&q.pi_state->pi_mutex, to, &rt_waiter);
 
 cleanup:
-	spin_lock(q.lock_ptr);
 	/*
 	 * If we failed to acquire the lock (deadlock/signal/timeout), we must
-	 * first acquire the hb->lock before removing the lock from the
-	 * rt_mutex waitqueue, such that we can keep the hb and rt_mutex wait
-	 * lists consistent.
+	 * must unwind the above, however we canont lock hb->lock because
+	 * rt_mutex already has a waiter enqueued and hb->lock can itself try
+	 * and enqueue an rt_waiter through rtlock.
+	 *
+	 * Doing the cleanup without holding hb->lock can cause inconsistent
+	 * state between hb and pi_state, but only in the direction of not
+	 * seeing a waiter that is leaving.
+	 *
+	 * See futex_unlock_pi(), it deals with this inconsistency.
+	 *
+	 * There be dragons here, since we must deal with the inconsistency on
+	 * the way out (here), it is impossible to detect/warn about the race
+	 * the other way around (missing an incoming waiter).
 	 *
-	 * In particular; it is important that futex_unlock_pi() can not
-	 * observe this inconsistency.
+	 * What could possibly go wrong...
 	 */
 	if (ret && !rt_mutex_cleanup_proxy_lock(&q.pi_state->pi_mutex, &rt_waiter))
 		ret = 0;
 
+	/*
+	 * Now that the rt_waiter has been dequeued, it is safe to use
+	 * spinlock/rtlock (which might enqueue its own rt_waiter) and fix up
+	 * the
+	 */
+	spin_lock(q.lock_ptr);
 no_block:
 	/*
 	 * Fixup the pi_state owner and possibly acquire the lock if we
@@ -1132,6 +1133,7 @@ retry:
 	top_waiter = futex_top_waiter(hb, &key);
 	if (top_waiter) {
 		struct futex_pi_state *pi_state = top_waiter->pi_state;
+		struct rt_mutex_waiter *rt_waiter;
 
 		ret = -EINVAL;
 		if (!pi_state)
@@ -1144,22 +1146,39 @@ retry:
 		if (pi_state->owner != current)
 			goto out_unlock;
 
-		get_pi_state(pi_state);
 		/*
 		 * By taking wait_lock while still holding hb->lock, we ensure
-		 * there is no point where we hold neither; and therefore
-		 * wake_futex_p() must observe a state consistent with what we
-		 * observed.
+		 * there is no point where we hold neither; and thereby
+		 * wake_futex_pi() must observe any new waiters.
+		 *
+		 * Since the cleanup: case in futex_lock_pi() removes the
+		 * rt_waiter without holding hb->lock, it is possible for
+		 * wake_futex_pi() to not find a waiter while the above does,
+		 * in this case the waiter is on the way out and it can be
+		 * ignored.
 		 *
 		 * In particular; this forces __rt_mutex_start_proxy() to
 		 * complete such that we're guaranteed to observe the
-		 * rt_waiter. Also see the WARN in wake_futex_pi().
+		 * rt_waiter.
 		 */
 		raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
+
+		/*
+		 * Futex vs rt_mutex waiter state -- if there are no rt_mutex
+		 * waiters even though futex thinks there are, then the waiter
+		 * is leaving and the uncontended path is safe to take.
+		 */
+		rt_waiter = rt_mutex_top_waiter(&pi_state->pi_mutex);
+		if (!rt_waiter) {
+			raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
+			goto do_uncontended;
+		}
+
+		get_pi_state(pi_state);
 		spin_unlock(&hb->lock);
 
 		/* drops pi_state->pi_mutex.wait_lock */
-		ret = wake_futex_pi(uaddr, uval, pi_state);
+		ret = wake_futex_pi(uaddr, uval, pi_state, rt_waiter);
 
 		put_pi_state(pi_state);
 
@@ -1187,6 +1206,7 @@ retry:
 		return ret;
 	}
 
+do_uncontended:
 	/*
 	 * We have no kernel internal state, i.e. no waiters in the
 	 * kernel. Waiters which are about to queue themselves are stuck
Index: linux-2.6/kernel/futex/requeue.c
===================================================================
--- linux-2.6.orig/kernel/futex/requeue.c
+++ linux-2.6/kernel/futex/requeue.c
@@ -850,11 +850,13 @@ int futex_wait_requeue_pi(u32 __user *ua
 		pi_mutex = &q.pi_state->pi_mutex;
 		ret = rt_mutex_wait_proxy_lock(pi_mutex, to, &rt_waiter);
 
-		/* Current is not longer pi_blocked_on */
-		spin_lock(q.lock_ptr);
+		/*
+		 * See futex_unlock_pi()'s cleanup: comment.
+		 */
 		if (ret && !rt_mutex_cleanup_proxy_lock(pi_mutex, &rt_waiter))
 			ret = 0;
 
+		spin_lock(q.lock_ptr);
 		debug_rt_mutex_free_waiter(&rt_waiter);
 		/*
 		 * Fixup the pi_state owner and possibly acquire the lock if we

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 7/7] locking/rtmutex: Acquire the hb lock via trylock after wait-proxylock.
  2023-09-15 15:19       ` Peter Zijlstra
@ 2023-09-15 18:59         ` Thomas Gleixner
  2023-09-19 11:03         ` Sebastian Andrzej Siewior
                           ` (2 subsequent siblings)
  3 siblings, 0 replies; 36+ messages in thread
From: Thomas Gleixner @ 2023-09-15 18:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Sebastian Andrzej Siewior, linux-kernel, boqun.feng, bristot,
	bsegall, dietmar.eggemann, jstultz, juri.lelli, longman, mgorman,
	mingo, rostedt, swood, vincent.guittot, vschneid, will

On Fri, Sep 15 2023 at 17:19, Peter Zijlstra wrote:
> On Fri, Sep 15, 2023 at 02:58:35PM +0200, Thomas Gleixner wrote:
> *However* the case at hand is where a waiter is leaving, in this case the race
> means a waiter that is going away is not observed -- which is harmless,
> provided this race is explicitly handled.
>
> This is a somewhat dangerous proposition because the converse race is not
> observing a new waiter, which must absolutely not happen. But since the race is
> valid this cannot be asserted.

Correct. But adding a new waiter requires to hold hb::lock which _IS_
held by the unlocking code when it deals with the outgoing race.

So I'm not too worried about it. 

> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>  	/*
>  	 * If we failed to acquire the lock (deadlock/signal/timeout), we must
> -	 * first acquire the hb->lock before removing the lock from the
> -	 * rt_mutex waitqueue, such that we can keep the hb and rt_mutex wait
> -	 * lists consistent.
> +	 * must unwind the above, however we canont lock hb->lock because
> +	 * rt_mutex already has a waiter enqueued and hb->lock can itself try
> +	 * and enqueue an rt_waiter through rtlock.
> +	 *
> +	 * Doing the cleanup without holding hb->lock can cause inconsistent
> +	 * state between hb and pi_state, but only in the direction of not
> +	 * seeing a waiter that is leaving.
> +	 *
> +	 * See futex_unlock_pi(), it deals with this inconsistency.
> +	 *
> +	 * There be dragons here, since we must deal with the inconsistency on
> +	 * the way out (here), it is impossible to detect/warn about the race
> +	 * the other way around (missing an incoming waiter).
>  	 *
> -	 * In particular; it is important that futex_unlock_pi() can not
> -	 * observe this inconsistency.
> +	 * What could possibly go wrong...

If some code in the future tries to enqueue a waiter w/o holding
hb::lock then this corner case will be the least of our worries. There
are tons of other things which will insta go south.

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 7/7] locking/rtmutex: Acquire the hb lock via trylock after wait-proxylock.
  2023-09-15 15:19       ` Peter Zijlstra
  2023-09-15 18:59         ` Thomas Gleixner
@ 2023-09-19 11:03         ` Sebastian Andrzej Siewior
  2023-09-20  7:36         ` [tip: locking/core] futex/pi: Fix recursive rt_mutex waiter state tip-bot2 for Peter Zijlstra
  2024-01-15 11:40         ` [PATCH v3 7/7] locking/rtmutex: Acquire the hb lock via trylock after wait-proxylock Jiri Slaby
  3 siblings, 0 replies; 36+ messages in thread
From: Sebastian Andrzej Siewior @ 2023-09-19 11:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, linux-kernel, boqun.feng, bristot, bsegall,
	dietmar.eggemann, jstultz, juri.lelli, longman, mgorman, mingo,
	rostedt, swood, vincent.guittot, vschneid, will

On 2023-09-15 17:19:43 [+0200], Peter Zijlstra wrote:
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Looked at it, poked at it can't find anything wrong with it.

Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

Sebastian

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [tip: locking/core] futex/pi: Fix recursive rt_mutex waiter state
  2023-09-15 15:19       ` Peter Zijlstra
  2023-09-15 18:59         ` Thomas Gleixner
  2023-09-19 11:03         ` Sebastian Andrzej Siewior
@ 2023-09-20  7:36         ` tip-bot2 for Peter Zijlstra
  2024-01-15 11:40         ` [PATCH v3 7/7] locking/rtmutex: Acquire the hb lock via trylock after wait-proxylock Jiri Slaby
  3 siblings, 0 replies; 36+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2023-09-20  7:36 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel), Thomas Gleixner,
	Sebastian Andrzej Siewior, x86, linux-kernel

The following commit has been merged into the locking/core branch of tip:

Commit-ID:     fbeb558b0dd0d6348e0872bbbbe96e30c65867b7
Gitweb:        https://git.kernel.org/tip/fbeb558b0dd0d6348e0872bbbbe96e30c65867b7
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Fri, 15 Sep 2023 17:19:44 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 20 Sep 2023 09:31:14 +02:00

futex/pi: Fix recursive rt_mutex waiter state

Some new assertions pointed out that the existing code has nested rt_mutex wait
state in the futex code.

Specifically, the futex_lock_pi() cancel case uses spin_lock() while there
still is a rt_waiter enqueued for this task, resulting in a state where there
are two waiters for the same task (and task_struct::pi_blocked_on gets
scrambled).

The reason to take hb->lock at this point is to avoid the wake_futex_pi()
EAGAIN case.

This happens when futex_top_waiter() and rt_mutex_top_waiter() state becomes
inconsistent. The current rules are such that this inconsistency will not be
observed.

Notably the case that needs to be avoided is where futex_lock_pi() and
futex_unlock_pi() interleave such that unlock will fail to observe a new
waiter.

*However* the case at hand is where a waiter is leaving, in this case the race
means a waiter that is going away is not observed -- which is harmless,
provided this race is explicitly handled.

This is a somewhat dangerous proposition because the converse race is not
observing a new waiter, which must absolutely not happen. But since the race is
valid this cannot be asserted.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://lkml.kernel.org/r/20230915151943.GD6743@noisy.programming.kicks-ass.net
---
 kernel/futex/pi.c      | 76 +++++++++++++++++++++++++----------------
 kernel/futex/requeue.c |  6 ++-
 2 files changed, 52 insertions(+), 30 deletions(-)

diff --git a/kernel/futex/pi.c b/kernel/futex/pi.c
index f8e65b2..d636a1b 100644
--- a/kernel/futex/pi.c
+++ b/kernel/futex/pi.c
@@ -611,29 +611,16 @@ int futex_lock_pi_atomic(u32 __user *uaddr, struct futex_hash_bucket *hb,
 /*
  * Caller must hold a reference on @pi_state.
  */
-static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_pi_state *pi_state)
+static int wake_futex_pi(u32 __user *uaddr, u32 uval,
+			 struct futex_pi_state *pi_state,
+			 struct rt_mutex_waiter *top_waiter)
 {
-	struct rt_mutex_waiter *top_waiter;
 	struct task_struct *new_owner;
 	bool postunlock = false;
 	DEFINE_RT_WAKE_Q(wqh);
 	u32 curval, newval;
 	int ret = 0;
 
-	top_waiter = rt_mutex_top_waiter(&pi_state->pi_mutex);
-	if (WARN_ON_ONCE(!top_waiter)) {
-		/*
-		 * As per the comment in futex_unlock_pi() this should not happen.
-		 *
-		 * When this happens, give up our locks and try again, giving
-		 * the futex_lock_pi() instance time to complete, either by
-		 * waiting on the rtmutex or removing itself from the futex
-		 * queue.
-		 */
-		ret = -EAGAIN;
-		goto out_unlock;
-	}
-
 	new_owner = top_waiter->task;
 
 	/*
@@ -1046,20 +1033,34 @@ retry_private:
 	ret = rt_mutex_wait_proxy_lock(&q.pi_state->pi_mutex, to, &rt_waiter);
 
 cleanup:
-	spin_lock(q.lock_ptr);
 	/*
 	 * If we failed to acquire the lock (deadlock/signal/timeout), we must
-	 * first acquire the hb->lock before removing the lock from the
-	 * rt_mutex waitqueue, such that we can keep the hb and rt_mutex wait
-	 * lists consistent.
+	 * must unwind the above, however we canont lock hb->lock because
+	 * rt_mutex already has a waiter enqueued and hb->lock can itself try
+	 * and enqueue an rt_waiter through rtlock.
+	 *
+	 * Doing the cleanup without holding hb->lock can cause inconsistent
+	 * state between hb and pi_state, but only in the direction of not
+	 * seeing a waiter that is leaving.
+	 *
+	 * See futex_unlock_pi(), it deals with this inconsistency.
 	 *
-	 * In particular; it is important that futex_unlock_pi() can not
-	 * observe this inconsistency.
+	 * There be dragons here, since we must deal with the inconsistency on
+	 * the way out (here), it is impossible to detect/warn about the race
+	 * the other way around (missing an incoming waiter).
+	 *
+	 * What could possibly go wrong...
 	 */
 	if (ret && !rt_mutex_cleanup_proxy_lock(&q.pi_state->pi_mutex, &rt_waiter))
 		ret = 0;
 
 	/*
+	 * Now that the rt_waiter has been dequeued, it is safe to use
+	 * spinlock/rtlock (which might enqueue its own rt_waiter) and fix up
+	 * the
+	 */
+	spin_lock(q.lock_ptr);
+	/*
 	 * Waiter is unqueued.
 	 */
 	rt_mutex_post_schedule();
@@ -1143,6 +1144,7 @@ retry:
 	top_waiter = futex_top_waiter(hb, &key);
 	if (top_waiter) {
 		struct futex_pi_state *pi_state = top_waiter->pi_state;
+		struct rt_mutex_waiter *rt_waiter;
 
 		ret = -EINVAL;
 		if (!pi_state)
@@ -1155,22 +1157,39 @@ retry:
 		if (pi_state->owner != current)
 			goto out_unlock;
 
-		get_pi_state(pi_state);
 		/*
 		 * By taking wait_lock while still holding hb->lock, we ensure
-		 * there is no point where we hold neither; and therefore
-		 * wake_futex_p() must observe a state consistent with what we
-		 * observed.
+		 * there is no point where we hold neither; and thereby
+		 * wake_futex_pi() must observe any new waiters.
+		 *
+		 * Since the cleanup: case in futex_lock_pi() removes the
+		 * rt_waiter without holding hb->lock, it is possible for
+		 * wake_futex_pi() to not find a waiter while the above does,
+		 * in this case the waiter is on the way out and it can be
+		 * ignored.
 		 *
 		 * In particular; this forces __rt_mutex_start_proxy() to
 		 * complete such that we're guaranteed to observe the
-		 * rt_waiter. Also see the WARN in wake_futex_pi().
+		 * rt_waiter.
 		 */
 		raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
+
+		/*
+		 * Futex vs rt_mutex waiter state -- if there are no rt_mutex
+		 * waiters even though futex thinks there are, then the waiter
+		 * is leaving and the uncontended path is safe to take.
+		 */
+		rt_waiter = rt_mutex_top_waiter(&pi_state->pi_mutex);
+		if (!rt_waiter) {
+			raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
+			goto do_uncontended;
+		}
+
+		get_pi_state(pi_state);
 		spin_unlock(&hb->lock);
 
 		/* drops pi_state->pi_mutex.wait_lock */
-		ret = wake_futex_pi(uaddr, uval, pi_state);
+		ret = wake_futex_pi(uaddr, uval, pi_state, rt_waiter);
 
 		put_pi_state(pi_state);
 
@@ -1198,6 +1217,7 @@ retry:
 		return ret;
 	}
 
+do_uncontended:
 	/*
 	 * We have no kernel internal state, i.e. no waiters in the
 	 * kernel. Waiters which are about to queue themselves are stuck
diff --git a/kernel/futex/requeue.c b/kernel/futex/requeue.c
index cba8b1a..4c73e0b 100644
--- a/kernel/futex/requeue.c
+++ b/kernel/futex/requeue.c
@@ -850,11 +850,13 @@ int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
 		pi_mutex = &q.pi_state->pi_mutex;
 		ret = rt_mutex_wait_proxy_lock(pi_mutex, to, &rt_waiter);
 
-		/* Current is not longer pi_blocked_on */
-		spin_lock(q.lock_ptr);
+		/*
+		 * See futex_unlock_pi()'s cleanup: comment.
+		 */
 		if (ret && !rt_mutex_cleanup_proxy_lock(pi_mutex, &rt_waiter))
 			ret = 0;
 
+		spin_lock(q.lock_ptr);
 		debug_rt_mutex_free_waiter(&rt_waiter);
 		/*
 		 * Fixup the pi_state owner and possibly acquire the lock if we

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [tip: locking/core] locking/rtmutex: Add a lockdep assert to catch potential nested blocking
  2023-09-08 16:22 ` [PATCH v3 6/7] locking/rtmutex: Add a lockdep assert to catch potential nested blocking Sebastian Andrzej Siewior
@ 2023-09-20  7:36   ` tip-bot2 for Thomas Gleixner
  0 siblings, 0 replies; 36+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2023-09-20  7:36 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Crystal Wood, Thomas Gleixner, Peter Zijlstra (Intel),
	Sebastian Andrzej Siewior, x86, linux-kernel

The following commit has been merged into the locking/core branch of tip:

Commit-ID:     45f67f30a22f264bc7a0a61255c2ee1a838e9403
Gitweb:        https://git.kernel.org/tip/45f67f30a22f264bc7a0a61255c2ee1a838e9403
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Fri, 08 Sep 2023 18:22:53 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 20 Sep 2023 09:31:14 +02:00

locking/rtmutex: Add a lockdep assert to catch potential nested blocking

There used to be a BUG_ON(current->pi_blocked_on) in the lock acquisition
functions, but that vanished in one of the rtmutex overhauls.

Bring it back in form of a lockdep assert to catch code paths which take
rtmutex based locks with current::pi_blocked_on != NULL.

Reported-by: Crystal Wood <swood@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20230908162254.999499-7-bigeasy@linutronix.de
---
 kernel/locking/rtmutex.c     | 2 ++
 kernel/locking/rwbase_rt.c   | 2 ++
 kernel/locking/spinlock_rt.c | 2 ++
 3 files changed, 6 insertions(+)

diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index a3fe05d..4a10e8c 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -1784,6 +1784,8 @@ static int __sched rt_mutex_slowlock(struct rt_mutex_base *lock,
 static __always_inline int __rt_mutex_lock(struct rt_mutex_base *lock,
 					   unsigned int state)
 {
+	lockdep_assert(!current->pi_blocked_on);
+
 	if (likely(rt_mutex_try_acquire(lock)))
 		return 0;
 
diff --git a/kernel/locking/rwbase_rt.c b/kernel/locking/rwbase_rt.c
index c7258cb..34a5956 100644
--- a/kernel/locking/rwbase_rt.c
+++ b/kernel/locking/rwbase_rt.c
@@ -133,6 +133,8 @@ static int __sched __rwbase_read_lock(struct rwbase_rt *rwb,
 static __always_inline int rwbase_read_lock(struct rwbase_rt *rwb,
 					    unsigned int state)
 {
+	lockdep_assert(!current->pi_blocked_on);
+
 	if (rwbase_read_trylock(rwb))
 		return 0;
 
diff --git a/kernel/locking/spinlock_rt.c b/kernel/locking/spinlock_rt.c
index 842037b..38e2924 100644
--- a/kernel/locking/spinlock_rt.c
+++ b/kernel/locking/spinlock_rt.c
@@ -37,6 +37,8 @@
 
 static __always_inline void rtlock_lock(struct rt_mutex_base *rtm)
 {
+	lockdep_assert(!current->pi_blocked_on);
+
 	if (unlikely(!rt_mutex_cmpxchg_acquire(rtm, NULL, current)))
 		rtlock_slowlock(rtm);
 }

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [tip: locking/core] locking/rtmutex: Use rt_mutex specific scheduler helpers
  2023-09-08 16:22 ` [PATCH v3 5/7] locking/rtmutex: Use " Sebastian Andrzej Siewior
@ 2023-09-20  7:36   ` tip-bot2 for Sebastian Andrzej Siewior
  0 siblings, 0 replies; 36+ messages in thread
From: tip-bot2 for Sebastian Andrzej Siewior @ 2023-09-20  7:36 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Crystal Wood, Sebastian Andrzej Siewior, Peter Zijlstra (Intel),
	x86, linux-kernel

The following commit has been merged into the locking/core branch of tip:

Commit-ID:     d14f9e930b9073de264c106bf04968286ef9b3a4
Gitweb:        https://git.kernel.org/tip/d14f9e930b9073de264c106bf04968286ef9b3a4
Author:        Sebastian Andrzej Siewior <bigeasy@linutronix.de>
AuthorDate:    Fri, 08 Sep 2023 18:22:52 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 20 Sep 2023 09:31:13 +02:00

locking/rtmutex: Use rt_mutex specific scheduler helpers

Have rt_mutex use the rt_mutex specific scheduler helpers to avoid
recursion vs rtlock on the PI state.

[[ peterz: adapted to new names ]]

Reported-by: Crystal Wood <swood@redhat.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20230908162254.999499-6-bigeasy@linutronix.de
---
 kernel/futex/pi.c            | 11 +++++++++++
 kernel/locking/rtmutex.c     | 14 ++++++++++++--
 kernel/locking/rwbase_rt.c   |  6 ++++++
 kernel/locking/rwsem.c       |  8 +++++++-
 kernel/locking/spinlock_rt.c |  4 ++++
 5 files changed, 40 insertions(+), 3 deletions(-)

diff --git a/kernel/futex/pi.c b/kernel/futex/pi.c
index ce2889f..f8e65b2 100644
--- a/kernel/futex/pi.c
+++ b/kernel/futex/pi.c
@@ -1,6 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0-or-later
 
 #include <linux/slab.h>
+#include <linux/sched/rt.h>
 #include <linux/sched/task.h>
 
 #include "futex.h"
@@ -1002,6 +1003,12 @@ retry_private:
 		goto no_block;
 	}
 
+	/*
+	 * Must be done before we enqueue the waiter, here is unfortunately
+	 * under the hb lock, but that *should* work because it does nothing.
+	 */
+	rt_mutex_pre_schedule();
+
 	rt_mutex_init_waiter(&rt_waiter);
 
 	/*
@@ -1052,6 +1059,10 @@ cleanup:
 	if (ret && !rt_mutex_cleanup_proxy_lock(&q.pi_state->pi_mutex, &rt_waiter))
 		ret = 0;
 
+	/*
+	 * Waiter is unqueued.
+	 */
+	rt_mutex_post_schedule();
 no_block:
 	/*
 	 * Fixup the pi_state owner and possibly acquire the lock if we
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index bcec053..a3fe05d 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -1632,7 +1632,7 @@ static int __sched rt_mutex_slowlock_block(struct rt_mutex_base *lock,
 		raw_spin_unlock_irq(&lock->wait_lock);
 
 		if (!owner || !rtmutex_spin_on_owner(lock, waiter, owner))
-			schedule();
+			rt_mutex_schedule();
 
 		raw_spin_lock_irq(&lock->wait_lock);
 		set_current_state(state);
@@ -1661,7 +1661,7 @@ static void __sched rt_mutex_handle_deadlock(int res, int detect_deadlock,
 	WARN(1, "rtmutex deadlock detected\n");
 	while (1) {
 		set_current_state(TASK_INTERRUPTIBLE);
-		schedule();
+		rt_mutex_schedule();
 	}
 }
 
@@ -1757,6 +1757,15 @@ static int __sched rt_mutex_slowlock(struct rt_mutex_base *lock,
 	int ret;
 
 	/*
+	 * Do all pre-schedule work here, before we queue a waiter and invoke
+	 * PI -- any such work that trips on rtlock (PREEMPT_RT spinlock) would
+	 * otherwise recurse back into task_blocks_on_rt_mutex() through
+	 * rtlock_slowlock() and will then enqueue a second waiter for this
+	 * same task and things get really confusing real fast.
+	 */
+	rt_mutex_pre_schedule();
+
+	/*
 	 * Technically we could use raw_spin_[un]lock_irq() here, but this can
 	 * be called in early boot if the cmpxchg() fast path is disabled
 	 * (debug, no architecture support). In this case we will acquire the
@@ -1767,6 +1776,7 @@ static int __sched rt_mutex_slowlock(struct rt_mutex_base *lock,
 	raw_spin_lock_irqsave(&lock->wait_lock, flags);
 	ret = __rt_mutex_slowlock_locked(lock, ww_ctx, state);
 	raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
+	rt_mutex_post_schedule();
 
 	return ret;
 }
diff --git a/kernel/locking/rwbase_rt.c b/kernel/locking/rwbase_rt.c
index 25ec023..c7258cb 100644
--- a/kernel/locking/rwbase_rt.c
+++ b/kernel/locking/rwbase_rt.c
@@ -71,6 +71,7 @@ static int __sched __rwbase_read_lock(struct rwbase_rt *rwb,
 	struct rt_mutex_base *rtm = &rwb->rtmutex;
 	int ret;
 
+	rwbase_pre_schedule();
 	raw_spin_lock_irq(&rtm->wait_lock);
 
 	/*
@@ -125,6 +126,7 @@ static int __sched __rwbase_read_lock(struct rwbase_rt *rwb,
 		rwbase_rtmutex_unlock(rtm);
 
 	trace_contention_end(rwb, ret);
+	rwbase_post_schedule();
 	return ret;
 }
 
@@ -237,6 +239,8 @@ static int __sched rwbase_write_lock(struct rwbase_rt *rwb,
 	/* Force readers into slow path */
 	atomic_sub(READER_BIAS, &rwb->readers);
 
+	rwbase_pre_schedule();
+
 	raw_spin_lock_irqsave(&rtm->wait_lock, flags);
 	if (__rwbase_write_trylock(rwb))
 		goto out_unlock;
@@ -248,6 +252,7 @@ static int __sched rwbase_write_lock(struct rwbase_rt *rwb,
 		if (rwbase_signal_pending_state(state, current)) {
 			rwbase_restore_current_state();
 			__rwbase_write_unlock(rwb, 0, flags);
+			rwbase_post_schedule();
 			trace_contention_end(rwb, -EINTR);
 			return -EINTR;
 		}
@@ -266,6 +271,7 @@ static int __sched rwbase_write_lock(struct rwbase_rt *rwb,
 
 out_unlock:
 	raw_spin_unlock_irqrestore(&rtm->wait_lock, flags);
+	rwbase_post_schedule();
 	return 0;
 }
 
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 9eabd58..2340b6d 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -1427,8 +1427,14 @@ static inline void __downgrade_write(struct rw_semaphore *sem)
 #define rwbase_signal_pending_state(state, current)	\
 	signal_pending_state(state, current)
 
+#define rwbase_pre_schedule()				\
+	rt_mutex_pre_schedule()
+
 #define rwbase_schedule()				\
-	schedule()
+	rt_mutex_schedule()
+
+#define rwbase_post_schedule()				\
+	rt_mutex_post_schedule()
 
 #include "rwbase_rt.c"
 
diff --git a/kernel/locking/spinlock_rt.c b/kernel/locking/spinlock_rt.c
index 48a19ed..842037b 100644
--- a/kernel/locking/spinlock_rt.c
+++ b/kernel/locking/spinlock_rt.c
@@ -184,9 +184,13 @@ static __always_inline int  rwbase_rtmutex_trylock(struct rt_mutex_base *rtm)
 
 #define rwbase_signal_pending_state(state, current)	(0)
 
+#define rwbase_pre_schedule()
+
 #define rwbase_schedule()				\
 	schedule_rtlock()
 
+#define rwbase_post_schedule()
+
 #include "rwbase_rt.c"
 /*
  * The common functions which get wrapped into the rwlock API.

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [tip: locking/core] sched: Provide rt_mutex specific scheduler helpers
  2023-09-08 16:22 ` [PATCH v3 4/7] sched: Provide rt_mutex specific scheduler helpers Sebastian Andrzej Siewior
@ 2023-09-20  7:36   ` tip-bot2 for Peter Zijlstra
  0 siblings, 0 replies; 36+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2023-09-20  7:36 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Thomas Gleixner, Peter Zijlstra (Intel),
	Sebastian Andrzej Siewior, x86, linux-kernel

The following commit has been merged into the locking/core branch of tip:

Commit-ID:     6b596e62ed9f90c4a97e68ae1f7b1af5beeb3c05
Gitweb:        https://git.kernel.org/tip/6b596e62ed9f90c4a97e68ae1f7b1af5beeb3c05
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Fri, 08 Sep 2023 18:22:51 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 20 Sep 2023 09:31:12 +02:00

sched: Provide rt_mutex specific scheduler helpers

With PREEMPT_RT there is a rt_mutex recursion problem where
sched_submit_work() can use an rtlock (aka spinlock_t). More
specifically what happens is:

  mutex_lock() /* really rt_mutex */
    ...
      __rt_mutex_slowlock_locked()
	task_blocks_on_rt_mutex()
          // enqueue current task as waiter
          // do PI chain walk
        rt_mutex_slowlock_block()
          schedule()
            sched_submit_work()
              ...
              spin_lock() /* really rtlock */
                ...
                  __rt_mutex_slowlock_locked()
                    task_blocks_on_rt_mutex()
                      // enqueue current task as waiter *AGAIN*
                      // *CONFUSION*

Fix this by making rt_mutex do the sched_submit_work() early, before
it enqueues itself as a waiter -- before it even knows *if* it will
wait.

[[ basically Thomas' patch but with different naming and a few asserts
   added ]]

Originally-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20230908162254.999499-5-bigeasy@linutronix.de
---
 include/linux/sched.h    |  3 +++
 include/linux/sched/rt.h |  4 ++++
 kernel/sched/core.c      | 36 ++++++++++++++++++++++++++++++++----
 3 files changed, 39 insertions(+), 4 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 77f01ac..67623ff 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -911,6 +911,9 @@ struct task_struct {
 	 * ->sched_remote_wakeup gets used, so it can be in this word.
 	 */
 	unsigned			sched_remote_wakeup:1;
+#ifdef CONFIG_RT_MUTEXES
+	unsigned			sched_rt_mutex:1;
+#endif
 
 	/* Bit to tell LSMs we're in execve(): */
 	unsigned			in_execve:1;
diff --git a/include/linux/sched/rt.h b/include/linux/sched/rt.h
index 994c256..b2b9e6e 100644
--- a/include/linux/sched/rt.h
+++ b/include/linux/sched/rt.h
@@ -30,6 +30,10 @@ static inline bool task_is_realtime(struct task_struct *tsk)
 }
 
 #ifdef CONFIG_RT_MUTEXES
+extern void rt_mutex_pre_schedule(void);
+extern void rt_mutex_schedule(void);
+extern void rt_mutex_post_schedule(void);
+
 /*
  * Must hold either p->pi_lock or task_rq(p)->lock.
  */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1ea7ba5..58d0346 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6723,9 +6723,6 @@ static inline void sched_submit_work(struct task_struct *tsk)
 	static DEFINE_WAIT_OVERRIDE_MAP(sched_map, LD_WAIT_CONFIG);
 	unsigned int task_flags;
 
-	if (task_is_running(tsk))
-		return;
-
 	/*
 	 * Establish LD_WAIT_CONFIG context to ensure none of the code called
 	 * will use a blocking primitive -- which would lead to recursion.
@@ -6783,7 +6780,12 @@ asmlinkage __visible void __sched schedule(void)
 {
 	struct task_struct *tsk = current;
 
-	sched_submit_work(tsk);
+#ifdef CONFIG_RT_MUTEXES
+	lockdep_assert(!tsk->sched_rt_mutex);
+#endif
+
+	if (!task_is_running(tsk))
+		sched_submit_work(tsk);
 	__schedule_loop(SM_NONE);
 	sched_update_worker(tsk);
 }
@@ -7044,6 +7046,32 @@ static void __setscheduler_prio(struct task_struct *p, int prio)
 
 #ifdef CONFIG_RT_MUTEXES
 
+/*
+ * Would be more useful with typeof()/auto_type but they don't mix with
+ * bit-fields. Since it's a local thing, use int. Keep the generic sounding
+ * name such that if someone were to implement this function we get to compare
+ * notes.
+ */
+#define fetch_and_set(x, v) ({ int _x = (x); (x) = (v); _x; })
+
+void rt_mutex_pre_schedule(void)
+{
+	lockdep_assert(!fetch_and_set(current->sched_rt_mutex, 1));
+	sched_submit_work(current);
+}
+
+void rt_mutex_schedule(void)
+{
+	lockdep_assert(current->sched_rt_mutex);
+	__schedule_loop(SM_NONE);
+}
+
+void rt_mutex_post_schedule(void)
+{
+	sched_update_worker(current);
+	lockdep_assert(fetch_and_set(current->sched_rt_mutex, 0));
+}
+
 static inline int __rt_effective_prio(struct task_struct *pi_task, int prio)
 {
 	if (pi_task)

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [tip: locking/core] sched: Extract __schedule_loop()
  2023-09-08 16:22 ` [PATCH v3 3/7] sched: Extract __schedule_loop() Sebastian Andrzej Siewior
@ 2023-09-20  7:36   ` tip-bot2 for Thomas Gleixner
  0 siblings, 0 replies; 36+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2023-09-20  7:36 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Thomas Gleixner, Sebastian Andrzej Siewior,
	Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the locking/core branch of tip:

Commit-ID:     de1474b46d889ee0367f6e71d9adfeb0711e4a8d
Gitweb:        https://git.kernel.org/tip/de1474b46d889ee0367f6e71d9adfeb0711e4a8d
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Fri, 08 Sep 2023 18:22:50 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 20 Sep 2023 09:31:12 +02:00

sched: Extract __schedule_loop()

There are currently two implementations of this basic __schedule()
loop, and there is soon to be a third.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20230908162254.999499-4-bigeasy@linutronix.de
---
 kernel/sched/core.c | 21 +++++++++++----------
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d555640..1ea7ba5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6770,16 +6770,21 @@ static void sched_update_worker(struct task_struct *tsk)
 	}
 }
 
-asmlinkage __visible void __sched schedule(void)
+static __always_inline void __schedule_loop(unsigned int sched_mode)
 {
-	struct task_struct *tsk = current;
-
-	sched_submit_work(tsk);
 	do {
 		preempt_disable();
-		__schedule(SM_NONE);
+		__schedule(sched_mode);
 		sched_preempt_enable_no_resched();
 	} while (need_resched());
+}
+
+asmlinkage __visible void __sched schedule(void)
+{
+	struct task_struct *tsk = current;
+
+	sched_submit_work(tsk);
+	__schedule_loop(SM_NONE);
 	sched_update_worker(tsk);
 }
 EXPORT_SYMBOL(schedule);
@@ -6843,11 +6848,7 @@ void __sched schedule_preempt_disabled(void)
 #ifdef CONFIG_PREEMPT_RT
 void __sched notrace schedule_rtlock(void)
 {
-	do {
-		preempt_disable();
-		__schedule(SM_RTLOCK_WAIT);
-		sched_preempt_enable_no_resched();
-	} while (need_resched());
+	__schedule_loop(SM_RTLOCK_WAIT);
 }
 NOKPROBE_SYMBOL(schedule_rtlock);
 #endif

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [tip: locking/core] locking/rtmutex: Avoid unconditional slowpath for DEBUG_RT_MUTEXES
  2023-09-08 16:22 ` [PATCH v3 2/7] locking/rtmutex: Avoid unconditional slowpath for DEBUG_RT_MUTEXES Sebastian Andrzej Siewior
@ 2023-09-20  7:36   ` tip-bot2 for Sebastian Andrzej Siewior
  0 siblings, 0 replies; 36+ messages in thread
From: tip-bot2 for Sebastian Andrzej Siewior @ 2023-09-20  7:36 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Sebastian Andrzej Siewior, Thomas Gleixner,
	Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the locking/core branch of tip:

Commit-ID:     af9f006393b53409be0ca83ae234bef840cdef4a
Gitweb:        https://git.kernel.org/tip/af9f006393b53409be0ca83ae234bef840cdef4a
Author:        Sebastian Andrzej Siewior <bigeasy@linutronix.de>
AuthorDate:    Fri, 08 Sep 2023 18:22:49 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 20 Sep 2023 09:31:11 +02:00

locking/rtmutex: Avoid unconditional slowpath for DEBUG_RT_MUTEXES

With DEBUG_RT_MUTEXES enabled the fast-path rt_mutex_cmpxchg_acquire()
always fails and all lock operations take the slow path.

Provide a new helper inline rt_mutex_try_acquire() which maps to
rt_mutex_cmpxchg_acquire() in the non-debug case. For the debug case
it invokes rt_mutex_slowtrylock() which can acquire a non-contended
rtmutex under full debug coverage.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20230908162254.999499-3-bigeasy@linutronix.de
---
 kernel/locking/rtmutex.c     | 21 ++++++++++++++++++++-
 kernel/locking/ww_rt_mutex.c |  2 +-
 2 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 21db0df..bcec053 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -218,6 +218,11 @@ static __always_inline bool rt_mutex_cmpxchg_acquire(struct rt_mutex_base *lock,
 	return try_cmpxchg_acquire(&lock->owner, &old, new);
 }
 
+static __always_inline bool rt_mutex_try_acquire(struct rt_mutex_base *lock)
+{
+	return rt_mutex_cmpxchg_acquire(lock, NULL, current);
+}
+
 static __always_inline bool rt_mutex_cmpxchg_release(struct rt_mutex_base *lock,
 						     struct task_struct *old,
 						     struct task_struct *new)
@@ -297,6 +302,20 @@ static __always_inline bool rt_mutex_cmpxchg_acquire(struct rt_mutex_base *lock,
 
 }
 
+static int __sched rt_mutex_slowtrylock(struct rt_mutex_base *lock);
+
+static __always_inline bool rt_mutex_try_acquire(struct rt_mutex_base *lock)
+{
+	/*
+	 * With debug enabled rt_mutex_cmpxchg trylock() will always fail.
+	 *
+	 * Avoid unconditionally taking the slow path by using
+	 * rt_mutex_slow_trylock() which is covered by the debug code and can
+	 * acquire a non-contended rtmutex.
+	 */
+	return rt_mutex_slowtrylock(lock);
+}
+
 static __always_inline bool rt_mutex_cmpxchg_release(struct rt_mutex_base *lock,
 						     struct task_struct *old,
 						     struct task_struct *new)
@@ -1755,7 +1774,7 @@ static int __sched rt_mutex_slowlock(struct rt_mutex_base *lock,
 static __always_inline int __rt_mutex_lock(struct rt_mutex_base *lock,
 					   unsigned int state)
 {
-	if (likely(rt_mutex_cmpxchg_acquire(lock, NULL, current)))
+	if (likely(rt_mutex_try_acquire(lock)))
 		return 0;
 
 	return rt_mutex_slowlock(lock, NULL, state);
diff --git a/kernel/locking/ww_rt_mutex.c b/kernel/locking/ww_rt_mutex.c
index d1473c6..c7196de 100644
--- a/kernel/locking/ww_rt_mutex.c
+++ b/kernel/locking/ww_rt_mutex.c
@@ -62,7 +62,7 @@ __ww_rt_mutex_lock(struct ww_mutex *lock, struct ww_acquire_ctx *ww_ctx,
 	}
 	mutex_acquire_nest(&rtm->dep_map, 0, 0, nest_lock, ip);
 
-	if (likely(rt_mutex_cmpxchg_acquire(&rtm->rtmutex, NULL, current))) {
+	if (likely(rt_mutex_try_acquire(&rtm->rtmutex))) {
 		if (ww_ctx)
 			ww_mutex_set_context_fastpath(lock, ww_ctx);
 		return 0;

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [tip: locking/core] sched: Constrain locks in sched_submit_work()
  2023-09-08 16:22 ` [PATCH v3 1/7] sched: Constrain locks in sched_submit_work() Sebastian Andrzej Siewior
@ 2023-09-20  7:36   ` tip-bot2 for Peter Zijlstra
  0 siblings, 0 replies; 36+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2023-09-20  7:36 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel), Sebastian Andrzej Siewior, x86,
	linux-kernel

The following commit has been merged into the locking/core branch of tip:

Commit-ID:     28bc55f654de49f6122c7475b01b5d5ef4bdf0d4
Gitweb:        https://git.kernel.org/tip/28bc55f654de49f6122c7475b01b5d5ef4bdf0d4
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Fri, 08 Sep 2023 18:22:48 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 20 Sep 2023 09:31:11 +02:00

sched: Constrain locks in sched_submit_work()

Even though sched_submit_work() is ran from preemptible context,
it is discouraged to have it use blocking locks due to the recursion
potential.

Enforce this.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20230908162254.999499-2-bigeasy@linutronix.de
---
 kernel/sched/core.c |  9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2299a5c..d555640 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6720,11 +6720,18 @@ void __noreturn do_task_dead(void)
 
 static inline void sched_submit_work(struct task_struct *tsk)
 {
+	static DEFINE_WAIT_OVERRIDE_MAP(sched_map, LD_WAIT_CONFIG);
 	unsigned int task_flags;
 
 	if (task_is_running(tsk))
 		return;
 
+	/*
+	 * Establish LD_WAIT_CONFIG context to ensure none of the code called
+	 * will use a blocking primitive -- which would lead to recursion.
+	 */
+	lock_map_acquire_try(&sched_map);
+
 	task_flags = tsk->flags;
 	/*
 	 * If a worker goes to sleep, notify and ask workqueue whether it
@@ -6749,6 +6756,8 @@ static inline void sched_submit_work(struct task_struct *tsk)
 	 * make sure to submit it to avoid deadlocks.
 	 */
 	blk_flush_plug(tsk->plug, true);
+
+	lock_map_release(&sched_map);
 }
 
 static void sched_update_worker(struct task_struct *tsk)

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 7/7] locking/rtmutex: Acquire the hb lock via trylock after wait-proxylock.
  2023-09-15 15:19       ` Peter Zijlstra
                           ` (2 preceding siblings ...)
  2023-09-20  7:36         ` [tip: locking/core] futex/pi: Fix recursive rt_mutex waiter state tip-bot2 for Peter Zijlstra
@ 2024-01-15 11:40         ` Jiri Slaby
  2024-01-15 11:52           ` Jiri Slaby
  3 siblings, 1 reply; 36+ messages in thread
From: Jiri Slaby @ 2024-01-15 11:40 UTC (permalink / raw)
  To: Peter Zijlstra, Thomas Gleixner
  Cc: Sebastian Andrzej Siewior, linux-kernel, boqun.feng, bristot,
	bsegall, dietmar.eggemann, jstultz, juri.lelli, longman, mgorman,
	mingo, rostedt, swood, vincent.guittot, vschneid, will

On 15. 09. 23, 17:19, Peter Zijlstra wrote:
> On Fri, Sep 15, 2023 at 02:58:35PM +0200, Thomas Gleixner wrote:
> 
>> I spent quite some time to convince myself that this is correct. I was
>> not able to poke a hole into it. So that really should be safe to
>> do. Famous last words ...
> 
> IKR :-/
> 
> Something like so then...
> 
> ---
> Subject: futex/pi: Fix recursive rt_mutex waiter state

So this breaks some random test in APR:

 From 
https://build.opensuse.org/package/live_build_log/openSUSE:Factory:Staging:G/apr/standard/x86_64:
testprocmutex       :  Line 122: child did not terminate with success

The child in fact terminates on 
https://github.com/apache/apr/blob/trunk/test/testprocmutex.c#L93:
                 while ((rv = apr_proc_mutex_timedlock(proc_lock, 1))) {
                     if (!APR_STATUS_IS_TIMEUP(rv))
                         exit(1); <----- here

The test creates 6 children and does some 
pthread_mutex_timedlock/unlock() repeatedly (200 times) in parallel 
while sleeping 1 us inside the lock. The timeout is 1 us above. And the 
test expects all them to fail (to time out). But the time out does not 
always happen in 6.7 (it's racy, so the failure is semi-random: like 1 
of 1000 attempts is bad).

If I revert this patch (commit fbeb558b0dd0d), the test works.

I know, the test could be broken too, but I have no idea, really. The 
testsuite is sort of hairy and I could not come up with a simple repro.

Note APR sets up PTHREAD_PROCESS_SHARED, _ROBUST, and _PRIO_INHERIT 
attrs for the mutex.

Anyway:
downstream report: https://bugzilla.suse.com/show_bug.cgi?id=1218801
APR report: https://bz.apache.org/bugzilla/show_bug.cgi?id=68481

Any idea if this patch should cause the above (or even is a desired 
behavior)?

Thanks.

> From: Peter Zijlstra <peterz@infradead.org>
> Date: Tue, 12 Sep 2023 13:17:11 +0200
> 
> Some new assertions pointed out that the existing code has nested rt_mutex wait
> state in the futex code.
> 
> Specifically, the futex_lock_pi() cancel case uses spin_lock() while there
> still is a rt_waiter enqueued for this task, resulting in a state where there
> are two waiters for the same task (and task_struct::pi_blocked_on gets
> scrambled).
> 
> The reason to take hb->lock at this point is to avoid the wake_futex_pi()
> EAGAIN case.
> 
> This happens when futex_top_waiter() and rt_mutex_top_waiter() state becomes
> inconsistent. The current rules are such that this inconsistency will not be
> observed.
> 
> Notably the case that needs to be avoided is where futex_lock_pi() and
> futex_unlock_pi() interleave such that unlock will fail to observe a new
> waiter.
> 
> *However* the case at hand is where a waiter is leaving, in this case the race
> means a waiter that is going away is not observed -- which is harmless,
> provided this race is explicitly handled.
> 
> This is a somewhat dangerous proposition because the converse race is not
> observing a new waiter, which must absolutely not happen. But since the race is
> valid this cannot be asserted.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>   pi.c      |   76 +++++++++++++++++++++++++++++++++++++++-----------------------
>   requeue.c |    6 +++-
>   2 files changed, 52 insertions(+), 30 deletions(-)
> 
> Index: linux-2.6/kernel/futex/pi.c
> ===================================================================
> --- linux-2.6.orig/kernel/futex/pi.c
> +++ linux-2.6/kernel/futex/pi.c
> @@ -610,29 +610,16 @@ int futex_lock_pi_atomic(u32 __user *uad
>   /*
>    * Caller must hold a reference on @pi_state.
>    */
> -static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_pi_state *pi_state)
> +static int wake_futex_pi(u32 __user *uaddr, u32 uval,
> +			 struct futex_pi_state *pi_state,
> +			 struct rt_mutex_waiter *top_waiter)
>   {
> -	struct rt_mutex_waiter *top_waiter;
>   	struct task_struct *new_owner;
>   	bool postunlock = false;
>   	DEFINE_RT_WAKE_Q(wqh);
>   	u32 curval, newval;
>   	int ret = 0;
>   
> -	top_waiter = rt_mutex_top_waiter(&pi_state->pi_mutex);
> -	if (WARN_ON_ONCE(!top_waiter)) {
> -		/*
> -		 * As per the comment in futex_unlock_pi() this should not happen.
> -		 *
> -		 * When this happens, give up our locks and try again, giving
> -		 * the futex_lock_pi() instance time to complete, either by
> -		 * waiting on the rtmutex or removing itself from the futex
> -		 * queue.
> -		 */
> -		ret = -EAGAIN;
> -		goto out_unlock;
> -	}
> -
>   	new_owner = top_waiter->task;
>   
>   	/*
> @@ -1039,19 +1026,33 @@ retry_private:
>   	ret = rt_mutex_wait_proxy_lock(&q.pi_state->pi_mutex, to, &rt_waiter);
>   
>   cleanup:
> -	spin_lock(q.lock_ptr);
>   	/*
>   	 * If we failed to acquire the lock (deadlock/signal/timeout), we must
> -	 * first acquire the hb->lock before removing the lock from the
> -	 * rt_mutex waitqueue, such that we can keep the hb and rt_mutex wait
> -	 * lists consistent.
> +	 * must unwind the above, however we canont lock hb->lock because
> +	 * rt_mutex already has a waiter enqueued and hb->lock can itself try
> +	 * and enqueue an rt_waiter through rtlock.
> +	 *
> +	 * Doing the cleanup without holding hb->lock can cause inconsistent
> +	 * state between hb and pi_state, but only in the direction of not
> +	 * seeing a waiter that is leaving.
> +	 *
> +	 * See futex_unlock_pi(), it deals with this inconsistency.
> +	 *
> +	 * There be dragons here, since we must deal with the inconsistency on
> +	 * the way out (here), it is impossible to detect/warn about the race
> +	 * the other way around (missing an incoming waiter).
>   	 *
> -	 * In particular; it is important that futex_unlock_pi() can not
> -	 * observe this inconsistency.
> +	 * What could possibly go wrong...
>   	 */
>   	if (ret && !rt_mutex_cleanup_proxy_lock(&q.pi_state->pi_mutex, &rt_waiter))
>   		ret = 0;
>   
> +	/*
> +	 * Now that the rt_waiter has been dequeued, it is safe to use
> +	 * spinlock/rtlock (which might enqueue its own rt_waiter) and fix up
> +	 * the
> +	 */
> +	spin_lock(q.lock_ptr);
>   no_block:
>   	/*
>   	 * Fixup the pi_state owner and possibly acquire the lock if we
> @@ -1132,6 +1133,7 @@ retry:
>   	top_waiter = futex_top_waiter(hb, &key);
>   	if (top_waiter) {
>   		struct futex_pi_state *pi_state = top_waiter->pi_state;
> +		struct rt_mutex_waiter *rt_waiter;
>   
>   		ret = -EINVAL;
>   		if (!pi_state)
> @@ -1144,22 +1146,39 @@ retry:
>   		if (pi_state->owner != current)
>   			goto out_unlock;
>   
> -		get_pi_state(pi_state);
>   		/*
>   		 * By taking wait_lock while still holding hb->lock, we ensure
> -		 * there is no point where we hold neither; and therefore
> -		 * wake_futex_p() must observe a state consistent with what we
> -		 * observed.
> +		 * there is no point where we hold neither; and thereby
> +		 * wake_futex_pi() must observe any new waiters.
> +		 *
> +		 * Since the cleanup: case in futex_lock_pi() removes the
> +		 * rt_waiter without holding hb->lock, it is possible for
> +		 * wake_futex_pi() to not find a waiter while the above does,
> +		 * in this case the waiter is on the way out and it can be
> +		 * ignored.
>   		 *
>   		 * In particular; this forces __rt_mutex_start_proxy() to
>   		 * complete such that we're guaranteed to observe the
> -		 * rt_waiter. Also see the WARN in wake_futex_pi().
> +		 * rt_waiter.
>   		 */
>   		raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
> +
> +		/*
> +		 * Futex vs rt_mutex waiter state -- if there are no rt_mutex
> +		 * waiters even though futex thinks there are, then the waiter
> +		 * is leaving and the uncontended path is safe to take.
> +		 */
> +		rt_waiter = rt_mutex_top_waiter(&pi_state->pi_mutex);
> +		if (!rt_waiter) {
> +			raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
> +			goto do_uncontended;
> +		}
> +
> +		get_pi_state(pi_state);
>   		spin_unlock(&hb->lock);
>   
>   		/* drops pi_state->pi_mutex.wait_lock */
> -		ret = wake_futex_pi(uaddr, uval, pi_state);
> +		ret = wake_futex_pi(uaddr, uval, pi_state, rt_waiter);
>   
>   		put_pi_state(pi_state);
>   
> @@ -1187,6 +1206,7 @@ retry:
>   		return ret;
>   	}
>   
> +do_uncontended:
>   	/*
>   	 * We have no kernel internal state, i.e. no waiters in the
>   	 * kernel. Waiters which are about to queue themselves are stuck
> Index: linux-2.6/kernel/futex/requeue.c
> ===================================================================
> --- linux-2.6.orig/kernel/futex/requeue.c
> +++ linux-2.6/kernel/futex/requeue.c
> @@ -850,11 +850,13 @@ int futex_wait_requeue_pi(u32 __user *ua
>   		pi_mutex = &q.pi_state->pi_mutex;
>   		ret = rt_mutex_wait_proxy_lock(pi_mutex, to, &rt_waiter);
>   
> -		/* Current is not longer pi_blocked_on */
> -		spin_lock(q.lock_ptr);
> +		/*
> +		 * See futex_unlock_pi()'s cleanup: comment.
> +		 */
>   		if (ret && !rt_mutex_cleanup_proxy_lock(pi_mutex, &rt_waiter))
>   			ret = 0;
>   
> +		spin_lock(q.lock_ptr);
>   		debug_rt_mutex_free_waiter(&rt_waiter);
>   		/*
>   		 * Fixup the pi_state owner and possibly acquire the lock if we
> 

-- 
js
suse labs


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 7/7] locking/rtmutex: Acquire the hb lock via trylock after wait-proxylock.
  2024-01-15 11:40         ` [PATCH v3 7/7] locking/rtmutex: Acquire the hb lock via trylock after wait-proxylock Jiri Slaby
@ 2024-01-15 11:52           ` Jiri Slaby
  2024-01-15 12:16             ` Sebastian Andrzej Siewior
  2024-01-15 12:54             ` Jiri Slaby
  0 siblings, 2 replies; 36+ messages in thread
From: Jiri Slaby @ 2024-01-15 11:52 UTC (permalink / raw)
  To: Peter Zijlstra, Thomas Gleixner
  Cc: Sebastian Andrzej Siewior, linux-kernel, boqun.feng, bristot,
	bsegall, dietmar.eggemann, jstultz, juri.lelli, longman, mgorman,
	mingo, rostedt, swood, vincent.guittot, vschneid, will

On 15. 01. 24, 12:40, Jiri Slaby wrote:
> On 15. 09. 23, 17:19, Peter Zijlstra wrote:
>> On Fri, Sep 15, 2023 at 02:58:35PM +0200, Thomas Gleixner wrote:
>>
>>> I spent quite some time to convince myself that this is correct. I was
>>> not able to poke a hole into it. So that really should be safe to
>>> do. Famous last words ...
>>
>> IKR :-/
>>
>> Something like so then...
>>
>> ---
>> Subject: futex/pi: Fix recursive rt_mutex waiter state
> 
> So this breaks some random test in APR:
> 
>  From 
> https://build.opensuse.org/package/live_build_log/openSUSE:Factory:Staging:G/apr/standard/x86_64:
> testprocmutex       :  Line 122: child did not terminate with success
> 
> The child in fact terminates on 
> https://github.com/apache/apr/blob/trunk/test/testprocmutex.c#L93:
>                  while ((rv = apr_proc_mutex_timedlock(proc_lock, 1))) {
>                      if (!APR_STATUS_IS_TIMEUP(rv))
>                          exit(1); <----- here
> 
> The test creates 6 children and does some 
> pthread_mutex_timedlock/unlock() repeatedly (200 times) in parallel 
> while sleeping 1 us inside the lock. The timeout is 1 us above. And the 
> test expects all them to fail (to time out). But the time out does not 
> always happen in 6.7 (it's racy, so the failure is semi-random: like 1 
> of 1000 attempts is bad).

This is not precise as I misinterpreted. The test is: either it succeeds 
or times out.

But since the commit, futex() yields 22/EINVAL, i.e. fails.

> If I revert this patch (commit fbeb558b0dd0d), the test works.
> 
> I know, the test could be broken too, but I have no idea, really. The 
> testsuite is sort of hairy and I could not come up with a simple repro.
> 
> Note APR sets up PTHREAD_PROCESS_SHARED, _ROBUST, and _PRIO_INHERIT 
> attrs for the mutex.
> 
> Anyway:
> downstream report: https://bugzilla.suse.com/show_bug.cgi?id=1218801
> APR report: https://bz.apache.org/bugzilla/show_bug.cgi?id=68481
> 
> Any idea if this patch should cause the above (or even is a desired 
> behavior)?
> 
> Thanks.



-- 
js


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 7/7] locking/rtmutex: Acquire the hb lock via trylock after wait-proxylock.
  2024-01-15 11:52           ` Jiri Slaby
@ 2024-01-15 12:16             ` Sebastian Andrzej Siewior
  2024-01-15 12:54             ` Jiri Slaby
  1 sibling, 0 replies; 36+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-01-15 12:16 UTC (permalink / raw)
  To: Jiri Slaby
  Cc: Peter Zijlstra, Thomas Gleixner, linux-kernel, boqun.feng,
	bristot, bsegall, dietmar.eggemann, jstultz, juri.lelli, longman,
	mgorman, mingo, rostedt, swood, vincent.guittot, vschneid, will

On 2024-01-15 12:52:49 [+0100], Jiri Slaby wrote:
…
> > 
> > The child in fact terminates on
> > https://github.com/apache/apr/blob/trunk/test/testprocmutex.c#L93:
> >                  while ((rv = apr_proc_mutex_timedlock(proc_lock, 1))) {
> >                      if (!APR_STATUS_IS_TIMEUP(rv))
> >                          exit(1); <----- here
> > 
> > The test creates 6 children and does some
> > pthread_mutex_timedlock/unlock() repeatedly (200 times) in parallel
> > while sleeping 1 us inside the lock. The timeout is 1 us above. And the
> > test expects all them to fail (to time out). But the time out does not
> > always happen in 6.7 (it's racy, so the failure is semi-random: like 1
> > of 1000 attempts is bad).
> 
> This is not precise as I misinterpreted. The test is: either it succeeds or
> times out.
> 
> But since the commit, futex() yields 22/EINVAL, i.e. fails.

Let me see if I can reproduce it here…

Sebastian

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 7/7] locking/rtmutex: Acquire the hb lock via trylock after wait-proxylock.
  2024-01-15 11:52           ` Jiri Slaby
  2024-01-15 12:16             ` Sebastian Andrzej Siewior
@ 2024-01-15 12:54             ` Jiri Slaby
  2024-01-15 15:32               ` Yann Ylavic
  2024-01-15 18:33               ` Sebastian Andrzej Siewior
  1 sibling, 2 replies; 36+ messages in thread
From: Jiri Slaby @ 2024-01-15 12:54 UTC (permalink / raw)
  To: Peter Zijlstra, Thomas Gleixner
  Cc: Sebastian Andrzej Siewior, linux-kernel, boqun.feng, bristot,
	bsegall, dietmar.eggemann, jstultz, juri.lelli, longman, mgorman,
	mingo, rostedt, swood, vincent.guittot, vschneid, will

[-- Attachment #1: Type: text/plain, Size: 1789 bytes --]

On 15. 01. 24, 12:52, Jiri Slaby wrote:
> On 15. 01. 24, 12:40, Jiri Slaby wrote:
>> On 15. 09. 23, 17:19, Peter Zijlstra wrote:
>>> On Fri, Sep 15, 2023 at 02:58:35PM +0200, Thomas Gleixner wrote:
>>>
>>>> I spent quite some time to convince myself that this is correct. I was
>>>> not able to poke a hole into it. So that really should be safe to
>>>> do. Famous last words ...
>>>
>>> IKR :-/
>>>
>>> Something like so then...
>>>
>>> ---
>>> Subject: futex/pi: Fix recursive rt_mutex waiter state
>>
>> So this breaks some random test in APR:
>>
>>  From 
>> https://build.opensuse.org/package/live_build_log/openSUSE:Factory:Staging:G/apr/standard/x86_64:
>> testprocmutex       :  Line 122: child did not terminate with success
>>
>> The child in fact terminates on 
>> https://github.com/apache/apr/blob/trunk/test/testprocmutex.c#L93:
>>                  while ((rv = apr_proc_mutex_timedlock(proc_lock, 1))) {
>>                      if (!APR_STATUS_IS_TIMEUP(rv))
>>                          exit(1); <----- here
>>
>> The test creates 6 children and does some 
>> pthread_mutex_timedlock/unlock() repeatedly (200 times) in parallel 
>> while sleeping 1 us inside the lock. The timeout is 1 us above. And 
>> the test expects all them to fail (to time out). But the time out does 
>> not always happen in 6.7 (it's racy, so the failure is semi-random: 
>> like 1 of 1000 attempts is bad).
> 
> This is not precise as I misinterpreted. The test is: either it succeeds 
> or times out.
> 
> But since the commit, futex() yields 22/EINVAL, i.e. fails.

A simplified reproducer attached (in particular, no APR anymore). Build 
with -pthread, obviously. If you see
BADx rv=22

that's bad.

regards,
-- 
js
suse labs

[-- Attachment #2: pokus2.c --]
[-- Type: text/x-csrc, Size: 1744 bytes --]

#include <errno.h>
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

#include <sys/mman.h>
#include <sys/time.h>
#include <sys/wait.h>

#define MAX_WAIT_USEC (1000*1000)
#define CHILDREN 16
#define MAX_ITER 200

#define NS_PER_S 1000000000

static pthread_mutex_t *proc_lock;

static void child()
{
        int rv, i = 0;

        do {
                int wait_usec = 0;
		struct timespec abstime;

		clock_gettime(CLOCK_REALTIME, &abstime);

		abstime.tv_nsec += 1000;
		if (abstime.tv_nsec >= NS_PER_S) {
			abstime.tv_sec++;
			abstime.tv_nsec -= NS_PER_S;
		}

                while ((rv = pthread_mutex_timedlock(proc_lock, &abstime))) {
                    if (rv != ETIMEDOUT) {
			    fprintf(stderr, "BADx rv=%d\n", rv);
			    abort();
		    }
                    if (++wait_usec >= MAX_WAIT_USEC)
                        abort();
                }
		//fprintf(stderr, "[%d] rv=%d\n", getpid(), rv);

            i++;
	    usleep(1);
            if (pthread_mutex_unlock(proc_lock))
                abort();
        } while (i < MAX_ITER);

	exit(0);
}

int main(int argc, char **argv)
{
	proc_lock = mmap(NULL, sizeof(*proc_lock),
			 PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_SHARED,
			 -1, 0);

	pthread_mutexattr_t mattr;

	pthread_mutexattr_init(&mattr);
	pthread_mutexattr_setpshared(&mattr, PTHREAD_PROCESS_SHARED);
	pthread_mutexattr_setrobust(&mattr, PTHREAD_MUTEX_ROBUST);
	pthread_mutexattr_setprotocol(&mattr, PTHREAD_PRIO_INHERIT);

	pthread_mutex_init(proc_lock, &mattr);

	pthread_mutexattr_destroy(&mattr);

	for (unsigned a = 0; a < CHILDREN; a++)
		if (!fork())
			child();

	for (unsigned a = 0; a < CHILDREN; a++)
		wait(NULL);

	pthread_mutex_destroy(proc_lock);


	return 0;
}


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 7/7] locking/rtmutex: Acquire the hb lock via trylock after wait-proxylock.
  2024-01-15 12:54             ` Jiri Slaby
@ 2024-01-15 15:32               ` Yann Ylavic
  2024-01-15 18:33               ` Sebastian Andrzej Siewior
  1 sibling, 0 replies; 36+ messages in thread
From: Yann Ylavic @ 2024-01-15 15:32 UTC (permalink / raw)
  To: Jiri Slaby
  Cc: Peter Zijlstra, Thomas Gleixner, Sebastian Andrzej Siewior,
	linux-kernel, boqun.feng, bristot, bsegall, dietmar.eggemann,
	jstultz, juri.lelli, longman, mgorman, mingo, rostedt, swood,
	vincent.guittot, vschneid, will

[-- Attachment #1: Type: text/plain, Size: 432 bytes --]

On Mon, Jan 15, 2024 at 1:55 PM Jiri Slaby <jirislaby@kernel.org> wrote:
>
> A simplified reproducer attached (in particular, no APR anymore). Build
> with -pthread, obviously.

I think "abstime" should be updated in the the
pthread_mutex_timedlock() loop or it might get in the past after the
first ETIMEDOUT?
Maybe something the attached pokus3.c reproducer is more inline with
the original code.


Regards;
Yann.

[-- Attachment #2: pokus3.c --]
[-- Type: text/x-csrc, Size: 1823 bytes --]

#include <errno.h>
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

#include <sys/mman.h>
#include <sys/time.h>
#include <sys/wait.h>

#define MAX_WAIT_USEC (1000*1000)
#define CHILDREN 16
#define MAX_ITER 200

#define NS_PER_S 1000000000

static pthread_mutex_t *proc_lock;

static void child()
{
        int rv, i = 0;

        do {
                int wait_usec = 0;
		struct timespec abstime;

                for (;;) {
		    clock_gettime(CLOCK_REALTIME, &abstime);

		    abstime.tv_nsec += 1000;
		    if (abstime.tv_nsec >= NS_PER_S) {
			abstime.tv_sec++;
			abstime.tv_nsec -= NS_PER_S;
		    }

                    rv = pthread_mutex_timedlock(proc_lock, &abstime);
                    if (rv == 0)
			break;
                    if (rv != ETIMEDOUT) {
			    fprintf(stderr, "BADx rv=%d\n", rv);
			    abort();
		    }
                    if (++wait_usec >= MAX_WAIT_USEC)
                        abort();
                }
		//fprintf(stderr, "[%d] rv=%d\n", getpid(), rv);

            i++;
	    usleep(1);
            if (pthread_mutex_unlock(proc_lock))
                abort();
        } while (i < MAX_ITER);

	exit(0);
}

int main(int argc, char **argv)
{
	proc_lock = mmap(NULL, sizeof(*proc_lock),
			 PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_SHARED,
			 -1, 0);

	pthread_mutexattr_t mattr;

	pthread_mutexattr_init(&mattr);
	pthread_mutexattr_setpshared(&mattr, PTHREAD_PROCESS_SHARED);
	pthread_mutexattr_setrobust(&mattr, PTHREAD_MUTEX_ROBUST);
	pthread_mutexattr_setprotocol(&mattr, PTHREAD_PRIO_INHERIT);

	pthread_mutex_init(proc_lock, &mattr);

	pthread_mutexattr_destroy(&mattr);

	for (unsigned a = 0; a < CHILDREN; a++)
		if (!fork())
			child();

	for (unsigned a = 0; a < CHILDREN; a++)
		wait(NULL);

	pthread_mutex_destroy(proc_lock);


	return 0;
}


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 7/7] locking/rtmutex: Acquire the hb lock via trylock after wait-proxylock.
  2024-01-15 12:54             ` Jiri Slaby
  2024-01-15 15:32               ` Yann Ylavic
@ 2024-01-15 18:33               ` Sebastian Andrzej Siewior
  2024-01-16  7:07                 ` Jiri Slaby
  1 sibling, 1 reply; 36+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-01-15 18:33 UTC (permalink / raw)
  To: Jiri Slaby
  Cc: Peter Zijlstra, Thomas Gleixner, linux-kernel, boqun.feng,
	bristot, bsegall, dietmar.eggemann, jstultz, juri.lelli, longman,
	mgorman, mingo, rostedt, swood, vincent.guittot, vschneid, will

On 2024-01-15 13:54:32 [+0100], Jiri Slaby wrote:
> A simplified reproducer attached (in particular, no APR anymore). Build with
> -pthread, obviously. If you see
> BADx rv=22
> 
> that's bad.

So task1 owns the futex, task2 does lock_pi_futex(). task2 setups the pi
state and everything for task1. Then task1 times out on the lock and
does not want the lock no more. And we break user state vs kernel via:


task1			task2

futex_unlock_pi();	futex_lock_pi();
			rt_mutex_wait_proxy_lock() // ETIMEDOUT
spin_lock(&hb->lock);
			rt_mutex_cleanup_proxy_lock()
			/* now the lock is still owned by task1, and the
			 * task2 removed itself as the waiter but its
			 * futex_q is still queued
			 */
			spin_lock(&hb->lock); /* block */

top_waiter = futex_top_waiter(hb, &key);
/* top_wait is task2's */
rt_waiter = rt_mutex_top_waiter(&pi_state->pi_mutex);
/* rt_waiter is NULL and the 
   futex is unlocked in userland via uaddr 
*/

and now

task 3			task4
locks in userland
			futex_lock_pi();
			futex_lock_pi_atomic();
			   -EINVAL = attach_to_pi_state()
			   /*
			    * becauase pi_state says owner
			    * is task1 but uaddr says task3.
			    */

\*/

This is due to the new lock ordering and the guarantees we no longer
have since the commit cited. The pi-state is cleaned/ removed by the last
one that wants the lock so in the unlock path there is either pi-state
with a waiter or nothing.

This duct tape at the end waits until the pi-state leaves or we get a
waiter. So this works but is not a fix.

diff --git a/kernel/futex/pi.c b/kernel/futex/pi.c
index 90e5197f4e56..f504ae864cc9 100644
--- a/kernel/futex/pi.c
+++ b/kernel/futex/pi.c
@@ -1182,6 +1182,9 @@ int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
 		rt_waiter = rt_mutex_top_waiter(&pi_state->pi_mutex);
 		if (!rt_waiter) {
 			raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
+			spin_unlock(&hb->lock);
+			cpu_relax();
+			goto retry;
 			goto do_uncontended;
 		}
 
-- 
2.43.0

> regards,

Sebastian

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 7/7] locking/rtmutex: Acquire the hb lock via trylock after wait-proxylock.
  2024-01-15 18:33               ` Sebastian Andrzej Siewior
@ 2024-01-16  7:07                 ` Jiri Slaby
  0 siblings, 0 replies; 36+ messages in thread
From: Jiri Slaby @ 2024-01-16  7:07 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Peter Zijlstra, Thomas Gleixner, linux-kernel, boqun.feng,
	bristot, bsegall, dietmar.eggemann, jstultz, juri.lelli, longman,
	mgorman, mingo, rostedt, swood, vincent.guittot, vschneid, will

On 15. 01. 24, 19:33, Sebastian Andrzej Siewior wrote:
> This duct tape at the end waits until the pi-state leaves or we get a
> waiter. So this works but is not a fix.

FWIW, it indeed avoids the problem.

> --- a/kernel/futex/pi.c
> +++ b/kernel/futex/pi.c
> @@ -1182,6 +1182,9 @@ int futex_unlock_pi(u32 __user *uaddr, unsigned int flags)
>   		rt_waiter = rt_mutex_top_waiter(&pi_state->pi_mutex);
>   		if (!rt_waiter) {
>   			raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
> +			spin_unlock(&hb->lock);
> +			cpu_relax();
> +			goto retry;
>   			goto do_uncontended;
>   		}

thanks,
-- 
js
suse labs


^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2024-01-16  7:07 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-09-08 16:22 [PATCH v3 0/7] locking/rtmutex: Avoid PI state recursion through sched_submit_work() Sebastian Andrzej Siewior
2023-09-08 16:22 ` [PATCH v3 1/7] sched: Constrain locks in sched_submit_work() Sebastian Andrzej Siewior
2023-09-20  7:36   ` [tip: locking/core] " tip-bot2 for Peter Zijlstra
2023-09-08 16:22 ` [PATCH v3 2/7] locking/rtmutex: Avoid unconditional slowpath for DEBUG_RT_MUTEXES Sebastian Andrzej Siewior
2023-09-20  7:36   ` [tip: locking/core] " tip-bot2 for Sebastian Andrzej Siewior
2023-09-08 16:22 ` [PATCH v3 3/7] sched: Extract __schedule_loop() Sebastian Andrzej Siewior
2023-09-20  7:36   ` [tip: locking/core] " tip-bot2 for Thomas Gleixner
2023-09-08 16:22 ` [PATCH v3 4/7] sched: Provide rt_mutex specific scheduler helpers Sebastian Andrzej Siewior
2023-09-20  7:36   ` [tip: locking/core] " tip-bot2 for Peter Zijlstra
2023-09-08 16:22 ` [PATCH v3 5/7] locking/rtmutex: Use " Sebastian Andrzej Siewior
2023-09-20  7:36   ` [tip: locking/core] " tip-bot2 for Sebastian Andrzej Siewior
2023-09-08 16:22 ` [PATCH v3 6/7] locking/rtmutex: Add a lockdep assert to catch potential nested blocking Sebastian Andrzej Siewior
2023-09-20  7:36   ` [tip: locking/core] " tip-bot2 for Thomas Gleixner
2023-09-08 16:22 ` [PATCH v3 7/7] locking/rtmutex: Acquire the hb lock via trylock after wait-proxylock Sebastian Andrzej Siewior
2023-09-09 18:29   ` Peter Zijlstra
2023-09-11 14:11   ` Peter Zijlstra
2023-09-12 10:57     ` Peter Zijlstra
2023-09-12 11:26       ` Sebastian Andrzej Siewior
2023-09-12 11:17     ` Sebastian Andrzej Siewior
2023-09-12 11:24       ` Peter Zijlstra
2023-09-12 11:25       ` Sebastian Andrzej Siewior
2023-09-12 11:25       ` Peter Zijlstra
2023-09-12 11:28         ` Sebastian Andrzej Siewior
2023-09-15 12:58     ` Thomas Gleixner
2023-09-15 13:15       ` Sebastian Andrzej Siewior
2023-09-15 15:19       ` Peter Zijlstra
2023-09-15 18:59         ` Thomas Gleixner
2023-09-19 11:03         ` Sebastian Andrzej Siewior
2023-09-20  7:36         ` [tip: locking/core] futex/pi: Fix recursive rt_mutex waiter state tip-bot2 for Peter Zijlstra
2024-01-15 11:40         ` [PATCH v3 7/7] locking/rtmutex: Acquire the hb lock via trylock after wait-proxylock Jiri Slaby
2024-01-15 11:52           ` Jiri Slaby
2024-01-15 12:16             ` Sebastian Andrzej Siewior
2024-01-15 12:54             ` Jiri Slaby
2024-01-15 15:32               ` Yann Ylavic
2024-01-15 18:33               ` Sebastian Andrzej Siewior
2024-01-16  7:07                 ` Jiri Slaby

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).