public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/6] locking/rtmutex: Avoid PI state recursion through sched_submit_work()
@ 2023-08-15 11:01 Peter Zijlstra
  2023-08-15 11:01 ` [PATCH 1/6] sched: Constrain locks in sched_submit_work() Peter Zijlstra
                   ` (6 more replies)
  0 siblings, 7 replies; 25+ messages in thread
From: Peter Zijlstra @ 2023-08-15 11:01 UTC (permalink / raw)
  To: bigeasy, tglx
  Cc: linux-kernel, peterz, bsegall, boqun.feng, swood, bristot,
	dietmar.eggemann, mingo, jstultz, juri.lelli, mgorman, rostedt,
	vschneid, vincent.guittot, longman, will

Hi!

This is basically the 'same' patches as send earlier by Sebastian here:

  https://lkml.kernel.org/r/20230427111937.2745231-1-bigeasy@linutronix.de

I spend a number of days trying to invert rtmutex, only to make a giant mess of
things and finally conceded that this is the least horrible approach.

There's a bunch of naming differences and I added some asserts that should
hopefully avoid things from going sideways without notice. I've also updated
the changelogs to high-light the actual problem. The whole pi_blocked_on
'corruption' is a mere consequence of the more fundamental problem that the
whole PI state recurses.

I've not tested this with the rest of the RT patches stuck on, so very limited
actual testing happened.

If anybody could please confirm stuff still works as advertised, I'll go queue
this mess and we can hopefully forget all about it.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 1/6] sched: Constrain locks in sched_submit_work()
  2023-08-15 11:01 [PATCH 0/6] locking/rtmutex: Avoid PI state recursion through sched_submit_work() Peter Zijlstra
@ 2023-08-15 11:01 ` Peter Zijlstra
  2023-08-15 11:01 ` [PATCH 2/6] locking/rtmutex: Avoid unconditional slowpath for DEBUG_RT_MUTEXES Peter Zijlstra
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 25+ messages in thread
From: Peter Zijlstra @ 2023-08-15 11:01 UTC (permalink / raw)
  To: bigeasy, tglx
  Cc: linux-kernel, peterz, bsegall, boqun.feng, swood, bristot,
	dietmar.eggemann, mingo, jstultz, juri.lelli, mgorman, rostedt,
	vschneid, vincent.guittot, longman, will

Even though sched_submit_work() is ran from preemptible context,
it is discouraged to have it use blocking locks due to the recursion
potential.

Enforce this.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c |    9 +++++++++
 1 file changed, 9 insertions(+)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6737,11 +6737,18 @@ void __noreturn do_task_dead(void)
 
 static inline void sched_submit_work(struct task_struct *tsk)
 {
+	static DEFINE_WAIT_OVERRIDE_MAP(sched_map, LD_WAIT_CONFIG);
 	unsigned int task_flags;
 
 	if (task_is_running(tsk))
 		return;
 
+	/*
+	 * Establish LD_WAIT_CONFIG context to ensure none of the code called
+	 * will use a blocking primitive -- which would lead to recursion.
+	 */
+	lock_map_acquire_try(&sched_map);
+
 	task_flags = tsk->flags;
 	/*
 	 * If a worker goes to sleep, notify and ask workqueue whether it
@@ -6766,6 +6773,8 @@ static inline void sched_submit_work(str
 	 * make sure to submit it to avoid deadlocks.
 	 */
 	blk_flush_plug(tsk->plug, true);
+
+	lock_map_release(&sched_map);
 }
 
 static void sched_update_worker(struct task_struct *tsk)



^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 2/6] locking/rtmutex: Avoid unconditional slowpath for DEBUG_RT_MUTEXES
  2023-08-15 11:01 [PATCH 0/6] locking/rtmutex: Avoid PI state recursion through sched_submit_work() Peter Zijlstra
  2023-08-15 11:01 ` [PATCH 1/6] sched: Constrain locks in sched_submit_work() Peter Zijlstra
@ 2023-08-15 11:01 ` Peter Zijlstra
  2023-08-15 11:01 ` [PATCH 3/6] sched: Extract __schedule_loop() Peter Zijlstra
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 25+ messages in thread
From: Peter Zijlstra @ 2023-08-15 11:01 UTC (permalink / raw)
  To: bigeasy, tglx
  Cc: linux-kernel, peterz, bsegall, boqun.feng, swood, bristot,
	dietmar.eggemann, mingo, jstultz, juri.lelli, mgorman, rostedt,
	vschneid, vincent.guittot, longman, will

From: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

With DEBUG_RT_MUTEXES enabled the fast-path rt_mutex_cmpxchg_acquire()
always fails and all lock operations take the slow path.

Provide a new helper inline rt_mutex_try_acquire() which maps to
rt_mutex_cmpxchg_acquire() in the non-debug case. For the debug case
it invokes rt_mutex_slowtrylock() which can acquire a non-contended
rtmutex under full debug coverage.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20230427111937.2745231-4-bigeasy@linutronix.de
---
 kernel/locking/rtmutex.c     |   21 ++++++++++++++++++++-
 kernel/locking/ww_rt_mutex.c |    2 +-
 2 files changed, 21 insertions(+), 2 deletions(-)

--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -218,6 +218,11 @@ static __always_inline bool rt_mutex_cmp
 	return try_cmpxchg_acquire(&lock->owner, &old, new);
 }
 
+static __always_inline bool rt_mutex_try_acquire(struct rt_mutex_base *lock)
+{
+	return rt_mutex_cmpxchg_acquire(lock, NULL, current);
+}
+
 static __always_inline bool rt_mutex_cmpxchg_release(struct rt_mutex_base *lock,
 						     struct task_struct *old,
 						     struct task_struct *new)
@@ -297,6 +302,20 @@ static __always_inline bool rt_mutex_cmp
 
 }
 
+static int __sched rt_mutex_slowtrylock(struct rt_mutex_base *lock);
+
+static __always_inline bool rt_mutex_try_acquire(struct rt_mutex_base *lock)
+{
+	/*
+	 * With debug enabled rt_mutex_cmpxchg trylock() will always fail.
+	 *
+	 * Avoid unconditionally taking the slow path by using
+	 * rt_mutex_slow_trylock() which is covered by the debug code and can
+	 * acquire a non-contended rtmutex.
+	 */
+	return rt_mutex_slowtrylock(lock);
+}
+
 static __always_inline bool rt_mutex_cmpxchg_release(struct rt_mutex_base *lock,
 						     struct task_struct *old,
 						     struct task_struct *new)
@@ -1755,7 +1774,7 @@ static int __sched rt_mutex_slowlock(str
 static __always_inline int __rt_mutex_lock(struct rt_mutex_base *lock,
 					   unsigned int state)
 {
-	if (likely(rt_mutex_cmpxchg_acquire(lock, NULL, current)))
+	if (likely(rt_mutex_try_acquire(lock)))
 		return 0;
 
 	return rt_mutex_slowlock(lock, NULL, state);
--- a/kernel/locking/ww_rt_mutex.c
+++ b/kernel/locking/ww_rt_mutex.c
@@ -62,7 +62,7 @@ __ww_rt_mutex_lock(struct ww_mutex *lock
 	}
 	mutex_acquire_nest(&rtm->dep_map, 0, 0, nest_lock, ip);
 
-	if (likely(rt_mutex_cmpxchg_acquire(&rtm->rtmutex, NULL, current))) {
+	if (likely(rt_mutex_try_acquire(&rtm->rtmutex))) {
 		if (ww_ctx)
 			ww_mutex_set_context_fastpath(lock, ww_ctx);
 		return 0;



^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 3/6] sched: Extract __schedule_loop()
  2023-08-15 11:01 [PATCH 0/6] locking/rtmutex: Avoid PI state recursion through sched_submit_work() Peter Zijlstra
  2023-08-15 11:01 ` [PATCH 1/6] sched: Constrain locks in sched_submit_work() Peter Zijlstra
  2023-08-15 11:01 ` [PATCH 2/6] locking/rtmutex: Avoid unconditional slowpath for DEBUG_RT_MUTEXES Peter Zijlstra
@ 2023-08-15 11:01 ` Peter Zijlstra
  2023-08-15 22:33   ` Phil Auld
  2023-08-15 11:01 ` [PATCH 4/6] sched: Provide rt_mutex specific scheduler helpers Peter Zijlstra
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2023-08-15 11:01 UTC (permalink / raw)
  To: bigeasy, tglx
  Cc: linux-kernel, peterz, bsegall, boqun.feng, swood, bristot,
	dietmar.eggemann, mingo, jstultz, juri.lelli, mgorman, rostedt,
	vschneid, vincent.guittot, longman, will

From: Thomas Gleixner <tglx@linutronix.de>

There are currently two implementations of this basic __schedule()
loop, and there is soon to be a third.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20230427111937.2745231-2-bigeasy@linutronix.de
---
 kernel/sched/core.c |   21 +++++++++++----------
 1 file changed, 11 insertions(+), 10 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6787,16 +6787,21 @@ static void sched_update_worker(struct t
 	}
 }
 
-asmlinkage __visible void __sched schedule(void)
+static __always_inline void __schedule_loop(unsigned int sched_mode)
 {
-	struct task_struct *tsk = current;
-
-	sched_submit_work(tsk);
 	do {
 		preempt_disable();
-		__schedule(SM_NONE);
+		__schedule(sched_mode);
 		sched_preempt_enable_no_resched();
 	} while (need_resched());
+}
+
+asmlinkage __visible void __sched schedule(void)
+{
+	struct task_struct *tsk = current;
+
+	sched_submit_work(tsk);
+	__schedule_loop(SM_NONE);
 	sched_update_worker(tsk);
 }
 EXPORT_SYMBOL(schedule);
@@ -6860,11 +6865,7 @@ void __sched schedule_preempt_disabled(v
 #ifdef CONFIG_PREEMPT_RT
 void __sched notrace schedule_rtlock(void)
 {
-	do {
-		preempt_disable();
-		__schedule(SM_RTLOCK_WAIT);
-		sched_preempt_enable_no_resched();
-	} while (need_resched());
+	__schedule_loop(SM_RTLOCK_WAIT);
 }
 NOKPROBE_SYMBOL(schedule_rtlock);
 #endif



^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 4/6] sched: Provide rt_mutex specific scheduler helpers
  2023-08-15 11:01 [PATCH 0/6] locking/rtmutex: Avoid PI state recursion through sched_submit_work() Peter Zijlstra
                   ` (2 preceding siblings ...)
  2023-08-15 11:01 ` [PATCH 3/6] sched: Extract __schedule_loop() Peter Zijlstra
@ 2023-08-15 11:01 ` Peter Zijlstra
  2023-08-15 11:01 ` [PATCH 5/6] locking/rtmutex: Use " Peter Zijlstra
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 25+ messages in thread
From: Peter Zijlstra @ 2023-08-15 11:01 UTC (permalink / raw)
  To: bigeasy, tglx
  Cc: linux-kernel, peterz, bsegall, boqun.feng, swood, bristot,
	dietmar.eggemann, mingo, jstultz, juri.lelli, mgorman, rostedt,
	vschneid, vincent.guittot, longman, will

With PREEMPT_RT there is a rt_mutex recursion problem where
sched_submit_work() can use an rtlock (aka spinlock_t). More
specifically what happens is:

  mutex_lock() /* really rt_mutex */
    ...
      __rt_mutex_slowlock_locked()
	task_blocks_on_rt_mutex()
          // enqueue current task as waiter
          // do PI chain walk
        rt_mutex_slowlock_block()
          schedule()
            sched_submit_work()
              ...
              spin_lock() /* really rtlock */
                ...
                  __rt_mutex_slowlock_locked()
                    task_blocks_on_rt_mutex()
                      // enqueue current task as waiter *AGAIN*
                      // *CONFUSION*

Fix this by making rt_mutex do the sched_submit_work() early, before
it enqueues itself as a waiter -- before it even knows *if* it will
wait.

[[ basically Thomas' patch but with different naming and a few asserts
   added ]]

Originally-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/sched.h    |    3 +++
 include/linux/sched/rt.h |    4 ++++
 kernel/sched/core.c      |   36 ++++++++++++++++++++++++++++++++----
 3 files changed, 39 insertions(+), 4 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -906,6 +906,9 @@ struct task_struct {
 	 * ->sched_remote_wakeup gets used, so it can be in this word.
 	 */
 	unsigned			sched_remote_wakeup:1;
+#ifdef CONFIG_RT_MUTEXES
+	unsigned			sched_rt_mutex:1;
+#endif
 
 	/* Bit to tell LSMs we're in execve(): */
 	unsigned			in_execve:1;
--- a/include/linux/sched/rt.h
+++ b/include/linux/sched/rt.h
@@ -30,6 +30,10 @@ static inline bool task_is_realtime(stru
 }
 
 #ifdef CONFIG_RT_MUTEXES
+extern void rt_mutex_pre_schedule(void);
+extern void rt_mutex_schedule(void);
+extern void rt_mutex_post_schedule(void);
+
 /*
  * Must hold either p->pi_lock or task_rq(p)->lock.
  */
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6738,9 +6738,6 @@ static inline void sched_submit_work(str
 	static DEFINE_WAIT_OVERRIDE_MAP(sched_map, LD_WAIT_CONFIG);
 	unsigned int task_flags;
 
-	if (task_is_running(tsk))
-		return;
-
 	/*
 	 * Establish LD_WAIT_CONFIG context to ensure none of the code called
 	 * will use a blocking primitive -- which would lead to recursion.
@@ -6798,7 +6795,12 @@ asmlinkage __visible void __sched schedu
 {
 	struct task_struct *tsk = current;
 
-	sched_submit_work(tsk);
+#ifdef CONFIG_RT_MUTEXES
+	lockdep_assert(!tsk->sched_rt_mutex);
+#endif
+
+	if (!task_is_running(tsk))
+		sched_submit_work(tsk);
 	__schedule_loop(SM_NONE);
 	sched_update_worker(tsk);
 }
@@ -7059,6 +7061,32 @@ static void __setscheduler_prio(struct t
 
 #ifdef CONFIG_RT_MUTEXES
 
+/*
+ * Would be more useful with typeof()/auto_type but they don't mix with
+ * bit-fields. Since it's a local thing, use int. Keep the generic sounding
+ * name such that if someone were to implement this function we get to compare
+ * notes.
+ */
+#define fetch_and_set(x, v) ({ int _x = (x); (x) = (v); _x; })
+
+void rt_mutex_pre_schedule(void)
+{
+	lockdep_assert(!fetch_and_set(current->sched_rt_mutex, 1));
+	sched_submit_work(current);
+}
+
+void rt_mutex_schedule(void)
+{
+	lockdep_assert(current->sched_rt_mutex);
+	__schedule_loop(SM_NONE);
+}
+
+void rt_mutex_post_schedule(void)
+{
+	sched_update_worker(current);
+	lockdep_assert(fetch_and_set(current->sched_rt_mutex, 0));
+}
+
 static inline int __rt_effective_prio(struct task_struct *pi_task, int prio)
 {
 	if (pi_task)



^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 5/6] locking/rtmutex: Use rt_mutex specific scheduler helpers
  2023-08-15 11:01 [PATCH 0/6] locking/rtmutex: Avoid PI state recursion through sched_submit_work() Peter Zijlstra
                   ` (3 preceding siblings ...)
  2023-08-15 11:01 ` [PATCH 4/6] sched: Provide rt_mutex specific scheduler helpers Peter Zijlstra
@ 2023-08-15 11:01 ` Peter Zijlstra
  2023-08-15 11:01 ` [PATCH 6/6] locking/rtmutex: Add a lockdep assert to catch potential nested blocking Peter Zijlstra
  2023-08-15 16:15 ` [PATCH 0/6] locking/rtmutex: Avoid PI state recursion through sched_submit_work() Peter Zijlstra
  6 siblings, 0 replies; 25+ messages in thread
From: Peter Zijlstra @ 2023-08-15 11:01 UTC (permalink / raw)
  To: bigeasy, tglx
  Cc: linux-kernel, peterz, bsegall, boqun.feng, swood, bristot,
	dietmar.eggemann, mingo, jstultz, juri.lelli, mgorman, rostedt,
	vschneid, vincent.guittot, longman, will

From: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

Have rt_mutex use the rt_mutex specific scheduler helpers to avoid
recursion vs rtlock on the PI state.

[[ peterz: adapted to new names ]]

Reported-by: Crystal Wood <swood@redhat.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/locking/rtmutex.c     |   14 ++++++++++++--
 kernel/locking/rwbase_rt.c   |    2 ++
 kernel/locking/rwsem.c       |    8 +++++++-
 kernel/locking/spinlock_rt.c |    4 ++++
 4 files changed, 25 insertions(+), 3 deletions(-)

--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -1636,7 +1636,7 @@ static int __sched rt_mutex_slowlock_blo
 		raw_spin_unlock_irq(&lock->wait_lock);
 
 		if (!owner || !rtmutex_spin_on_owner(lock, waiter, owner))
-			schedule();
+			rt_mutex_schedule();
 
 		raw_spin_lock_irq(&lock->wait_lock);
 		set_current_state(state);
@@ -1665,7 +1665,7 @@ static void __sched rt_mutex_handle_dead
 	WARN(1, "rtmutex deadlock detected\n");
 	while (1) {
 		set_current_state(TASK_INTERRUPTIBLE);
-		schedule();
+		rt_mutex_schedule();
 	}
 }
 
@@ -1761,6 +1761,15 @@ static int __sched rt_mutex_slowlock(str
 	int ret;
 
 	/*
+	 * Do all pre-schedule work here, before we queue a waiter and invoke
+	 * PI -- any such work that trips on rtlock (PREEMPT_RT spinlock) would
+	 * otherwise recurse back into task_blocks_on_rt_mutex() through
+	 * rtlock_slowlock() and will then enqueue a second waiter for this
+	 * same task and things get really confusing real fast.
+	 */
+	rt_mutex_pre_schedule();
+
+	/*
 	 * Technically we could use raw_spin_[un]lock_irq() here, but this can
 	 * be called in early boot if the cmpxchg() fast path is disabled
 	 * (debug, no architecture support). In this case we will acquire the
@@ -1771,6 +1780,7 @@ static int __sched rt_mutex_slowlock(str
 	raw_spin_lock_irqsave(&lock->wait_lock, flags);
 	ret = __rt_mutex_slowlock_locked(lock, ww_ctx, state);
 	raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
+	rt_mutex_post_schedule();
 
 	return ret;
 }
--- a/kernel/locking/rwbase_rt.c
+++ b/kernel/locking/rwbase_rt.c
@@ -71,6 +71,7 @@ static int __sched __rwbase_read_lock(st
 	struct rt_mutex_base *rtm = &rwb->rtmutex;
 	int ret;
 
+	rwbase_pre_schedule();
 	raw_spin_lock_irq(&rtm->wait_lock);
 
 	/*
@@ -125,6 +126,7 @@ static int __sched __rwbase_read_lock(st
 		rwbase_rtmutex_unlock(rtm);
 
 	trace_contention_end(rwb, ret);
+	rwbase_post_schedule();
 	return ret;
 }
 
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -1427,8 +1427,14 @@ static inline void __downgrade_write(str
 #define rwbase_signal_pending_state(state, current)	\
 	signal_pending_state(state, current)
 
+#define rwbase_pre_schedule()				\
+	rt_mutex_pre_schedule()
+
 #define rwbase_schedule()				\
-	schedule()
+	rt_mutex_schedule()
+
+#define rwbase_post_schedule()				\
+	rt_mutex_post_schedule()
 
 #include "rwbase_rt.c"
 
--- a/kernel/locking/spinlock_rt.c
+++ b/kernel/locking/spinlock_rt.c
@@ -184,9 +184,13 @@ static __always_inline int  rwbase_rtmut
 
 #define rwbase_signal_pending_state(state, current)	(0)
 
+#define rwbase_pre_schedule()
+
 #define rwbase_schedule()				\
 	schedule_rtlock()
 
+#define rwbase_post_schedule()
+
 #include "rwbase_rt.c"
 /*
  * The common functions which get wrapped into the rwlock API.



^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 6/6] locking/rtmutex: Add a lockdep assert to catch potential nested blocking
  2023-08-15 11:01 [PATCH 0/6] locking/rtmutex: Avoid PI state recursion through sched_submit_work() Peter Zijlstra
                   ` (4 preceding siblings ...)
  2023-08-15 11:01 ` [PATCH 5/6] locking/rtmutex: Use " Peter Zijlstra
@ 2023-08-15 11:01 ` Peter Zijlstra
  2023-08-15 16:15 ` [PATCH 0/6] locking/rtmutex: Avoid PI state recursion through sched_submit_work() Peter Zijlstra
  6 siblings, 0 replies; 25+ messages in thread
From: Peter Zijlstra @ 2023-08-15 11:01 UTC (permalink / raw)
  To: bigeasy, tglx
  Cc: linux-kernel, peterz, bsegall, boqun.feng, swood, bristot,
	dietmar.eggemann, mingo, jstultz, juri.lelli, mgorman, rostedt,
	vschneid, vincent.guittot, longman, will

From: Thomas Gleixner <tglx@linutronix.de>

There used to be a BUG_ON(current->pi_blocked_on) in the lock acquisition
functions, but that vanished in one of the rtmutex overhauls.

Bring it back in form of a lockdep assert to catch code paths which take
rtmutex based locks with current::pi_blocked_on != NULL.

Reported-by: Crystal Wood <swood@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20230427111937.2745231-5-bigeasy@linutronix.de
---
 kernel/locking/rtmutex.c     |    2 ++
 kernel/locking/rwbase_rt.c   |    2 ++
 kernel/locking/spinlock_rt.c |    2 ++
 3 files changed, 6 insertions(+)

--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -1774,6 +1774,8 @@ static int __sched rt_mutex_slowlock(str
 static __always_inline int __rt_mutex_lock(struct rt_mutex_base *lock,
 					   unsigned int state)
 {
+	lockdep_assert(!current->pi_blocked_on);
+
 	if (likely(rt_mutex_try_acquire(lock)))
 		return 0;
 
--- a/kernel/locking/rwbase_rt.c
+++ b/kernel/locking/rwbase_rt.c
@@ -131,6 +131,8 @@ static int __sched __rwbase_read_lock(st
 static __always_inline int rwbase_read_lock(struct rwbase_rt *rwb,
 					    unsigned int state)
 {
+	lockdep_assert(!current->pi_blocked_on);
+
 	if (rwbase_read_trylock(rwb))
 		return 0;
 
--- a/kernel/locking/spinlock_rt.c
+++ b/kernel/locking/spinlock_rt.c
@@ -37,6 +37,8 @@
 
 static __always_inline void rtlock_lock(struct rt_mutex_base *rtm)
 {
+	lockdep_assert(!current->pi_blocked_on);
+
 	if (unlikely(!rt_mutex_cmpxchg_acquire(rtm, NULL, current)))
 		rtlock_slowlock(rtm);
 }



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/6] locking/rtmutex: Avoid PI state recursion through sched_submit_work()
  2023-08-15 11:01 [PATCH 0/6] locking/rtmutex: Avoid PI state recursion through sched_submit_work() Peter Zijlstra
                   ` (5 preceding siblings ...)
  2023-08-15 11:01 ` [PATCH 6/6] locking/rtmutex: Add a lockdep assert to catch potential nested blocking Peter Zijlstra
@ 2023-08-15 16:15 ` Peter Zijlstra
  2023-08-16  8:58   ` Sebastian Andrzej Siewior
  6 siblings, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2023-08-15 16:15 UTC (permalink / raw)
  To: bigeasy, tglx
  Cc: linux-kernel, bsegall, boqun.feng, swood, bristot,
	dietmar.eggemann, mingo, jstultz, juri.lelli, mgorman, rostedt,
	vschneid, vincent.guittot, longman, will

On Tue, Aug 15, 2023 at 01:01:21PM +0200, Peter Zijlstra wrote:
> Hi!
> 
> This is basically the 'same' patches as send earlier by Sebastian here:
> 
>   https://lkml.kernel.org/r/20230427111937.2745231-1-bigeasy@linutronix.de
> 
> I spend a number of days trying to invert rtmutex, only to make a giant mess of
> things and finally conceded that this is the least horrible approach.
> 
> There's a bunch of naming differences and I added some asserts that should
> hopefully avoid things from going sideways without notice. I've also updated
> the changelogs to high-light the actual problem. The whole pi_blocked_on
> 'corruption' is a mere consequence of the more fundamental problem that the
> whole PI state recurses.
> 
> I've not tested this with the rest of the RT patches stuck on, so very limited
> actual testing happened.
> 
> If anybody could please confirm stuff still works as advertised, I'll go queue
> this mess and we can hopefully forget all about it.
> 

N/m - 0day found a problem. Futex-PI trips the rt_mutex_schedule()
assertion for not passing through rt_mutex_pre_schedule().

I'll go try and untangle that...

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/6] sched: Extract __schedule_loop()
  2023-08-15 11:01 ` [PATCH 3/6] sched: Extract __schedule_loop() Peter Zijlstra
@ 2023-08-15 22:33   ` Phil Auld
  2023-08-15 22:39     ` Peter Zijlstra
                       ` (2 more replies)
  0 siblings, 3 replies; 25+ messages in thread
From: Phil Auld @ 2023-08-15 22:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: bigeasy, tglx, linux-kernel, bsegall, boqun.feng, swood, bristot,
	dietmar.eggemann, mingo, jstultz, juri.lelli, mgorman, rostedt,
	vschneid, vincent.guittot, longman, will

Hi Peter,

On Tue, Aug 15, 2023 at 01:01:24PM +0200 Peter Zijlstra wrote:
> From: Thomas Gleixner <tglx@linutronix.de>
> 
> There are currently two implementations of this basic __schedule()
> loop, and there is soon to be a third.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Link: https://lkml.kernel.org/r/20230427111937.2745231-2-bigeasy@linutronix.de
> ---
>  kernel/sched/core.c |   21 +++++++++++----------
>  1 file changed, 11 insertions(+), 10 deletions(-)
> 
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6787,16 +6787,21 @@ static void sched_update_worker(struct t
>  	}
>  }
>  
> -asmlinkage __visible void __sched schedule(void)
> +static __always_inline void __schedule_loop(unsigned int sched_mode)

I think this needs __sched or it's the only thing that ever shows up
in wchan. E.g.

  16995 0 bash     S __schedule_loop.constprop.0
  17036 1 kworker/ I __schedule_loop.constprop.0
  17151 1 kworker/ I __schedule_loop.constprop.0
  17235 0 sleep    S __schedule_loop.constprop.0
  17236 4 ps       R -

vs

  10417 1 kworker/ I worker_thread
  10665 1 kworker/ I worker_thread
  10670 4 sshd     D -            
  10674 4 bash     S do_wait      
  10701 0 sleep    S hrtimer_nanosleep
  10702 4 ps       R -

Otherwise, looking forward to getting this and the rest of RT in...

Cheers,
Phil

>  {
> -	struct task_struct *tsk = current;
> -
> -	sched_submit_work(tsk);
>  	do {
>  		preempt_disable();
> -		__schedule(SM_NONE);
> +		__schedule(sched_mode);
>  		sched_preempt_enable_no_resched();
>  	} while (need_resched());
> +}
> +
> +asmlinkage __visible void __sched schedule(void)
> +{
> +	struct task_struct *tsk = current;
> +
> +	sched_submit_work(tsk);
> +	__schedule_loop(SM_NONE);
>  	sched_update_worker(tsk);
>  }
>  EXPORT_SYMBOL(schedule);
> @@ -6860,11 +6865,7 @@ void __sched schedule_preempt_disabled(v
>  #ifdef CONFIG_PREEMPT_RT
>  void __sched notrace schedule_rtlock(void)
>  {
> -	do {
> -		preempt_disable();
> -		__schedule(SM_RTLOCK_WAIT);
> -		sched_preempt_enable_no_resched();
> -	} while (need_resched());
> +	__schedule_loop(SM_RTLOCK_WAIT);
>  }
>  NOKPROBE_SYMBOL(schedule_rtlock);
>  #endif
> 
> 

-- 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/6] sched: Extract __schedule_loop()
  2023-08-15 22:33   ` Phil Auld
@ 2023-08-15 22:39     ` Peter Zijlstra
  2023-08-16 14:14       ` Phil Auld
  2023-08-15 22:42     ` Phil Auld
  2023-08-16 10:01     ` Sebastian Andrzej Siewior
  2 siblings, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2023-08-15 22:39 UTC (permalink / raw)
  To: Phil Auld
  Cc: bigeasy, tglx, linux-kernel, bsegall, boqun.feng, swood, bristot,
	dietmar.eggemann, mingo, jstultz, juri.lelli, mgorman, rostedt,
	vschneid, vincent.guittot, longman, will

On Tue, Aug 15, 2023 at 06:33:01PM -0400, Phil Auld wrote:
> Hi Peter,
> 
> On Tue, Aug 15, 2023 at 01:01:24PM +0200 Peter Zijlstra wrote:
> > From: Thomas Gleixner <tglx@linutronix.de>
> > 
> > There are currently two implementations of this basic __schedule()
> > loop, and there is soon to be a third.
> > 
> > Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > Link: https://lkml.kernel.org/r/20230427111937.2745231-2-bigeasy@linutronix.de
> > ---
> >  kernel/sched/core.c |   21 +++++++++++----------
> >  1 file changed, 11 insertions(+), 10 deletions(-)
> > 
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -6787,16 +6787,21 @@ static void sched_update_worker(struct t
> >  	}
> >  }
> >  
> > -asmlinkage __visible void __sched schedule(void)
> > +static __always_inline void __schedule_loop(unsigned int sched_mode)
> 
> I think this needs __sched or it's the only thing that ever shows up
> in wchan. E.g.
> 
>   16995 0 bash     S __schedule_loop.constprop.0
>   17036 1 kworker/ I __schedule_loop.constprop.0
>   17151 1 kworker/ I __schedule_loop.constprop.0
>   17235 0 sleep    S __schedule_loop.constprop.0
>   17236 4 ps       R -

But but but __always_inline... ?!?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/6] sched: Extract __schedule_loop()
  2023-08-15 22:33   ` Phil Auld
  2023-08-15 22:39     ` Peter Zijlstra
@ 2023-08-15 22:42     ` Phil Auld
  2023-08-16 10:01     ` Sebastian Andrzej Siewior
  2 siblings, 0 replies; 25+ messages in thread
From: Phil Auld @ 2023-08-15 22:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: bigeasy, tglx, linux-kernel, bsegall, boqun.feng, swood, bristot,
	dietmar.eggemann, mingo, jstultz, juri.lelli, mgorman, rostedt,
	vschneid, vincent.guittot, longman, will

On Tue, Aug 15, 2023 at 06:33:01PM -0400 Phil Auld wrote:
> Hi Peter,
> 
> On Tue, Aug 15, 2023 at 01:01:24PM +0200 Peter Zijlstra wrote:
> > From: Thomas Gleixner <tglx@linutronix.de>
> > 
> > There are currently two implementations of this basic __schedule()
> > loop, and there is soon to be a third.
> > 
> > Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > Link: https://lkml.kernel.org/r/20230427111937.2745231-2-bigeasy@linutronix.de
> > ---
> >  kernel/sched/core.c |   21 +++++++++++----------
> >  1 file changed, 11 insertions(+), 10 deletions(-)
> > 
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -6787,16 +6787,21 @@ static void sched_update_worker(struct t
> >  	}
> >  }
> >  
> > -asmlinkage __visible void __sched schedule(void)
> > +static __always_inline void __schedule_loop(unsigned int sched_mode)
> 
> I think this needs __sched or it's the only thing that ever shows up
> in wchan. E.g.
>

Using this command:

    ps -Ao pid,f,fname,s,wchan 


>   16995 0 bash     S __schedule_loop.constprop.0
>   17036 1 kworker/ I __schedule_loop.constprop.0
>   17151 1 kworker/ I __schedule_loop.constprop.0
>   17235 0 sleep    S __schedule_loop.constprop.0
>   17236 4 ps       R -
> 
> vs
> 
>   10417 1 kworker/ I worker_thread
>   10665 1 kworker/ I worker_thread
>   10670 4 sshd     D -            
>   10674 4 bash     S do_wait      
>   10701 0 sleep    S hrtimer_nanosleep
>   10702 4 ps       R -
> 
> Otherwise, looking forward to getting this and the rest of RT in...
> 
> Cheers,
> Phil
> 
> >  {
> > -	struct task_struct *tsk = current;
> > -
> > -	sched_submit_work(tsk);
> >  	do {
> >  		preempt_disable();
> > -		__schedule(SM_NONE);
> > +		__schedule(sched_mode);
> >  		sched_preempt_enable_no_resched();
> >  	} while (need_resched());
> > +}
> > +
> > +asmlinkage __visible void __sched schedule(void)
> > +{
> > +	struct task_struct *tsk = current;
> > +
> > +	sched_submit_work(tsk);
> > +	__schedule_loop(SM_NONE);
> >  	sched_update_worker(tsk);
> >  }
> >  EXPORT_SYMBOL(schedule);
> > @@ -6860,11 +6865,7 @@ void __sched schedule_preempt_disabled(v
> >  #ifdef CONFIG_PREEMPT_RT
> >  void __sched notrace schedule_rtlock(void)
> >  {
> > -	do {
> > -		preempt_disable();
> > -		__schedule(SM_RTLOCK_WAIT);
> > -		sched_preempt_enable_no_resched();
> > -	} while (need_resched());
> > +	__schedule_loop(SM_RTLOCK_WAIT);
> >  }
> >  NOKPROBE_SYMBOL(schedule_rtlock);
> >  #endif
> > 
> > 
> 
> -- 
> 

-- 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/6] locking/rtmutex: Avoid PI state recursion through sched_submit_work()
  2023-08-15 16:15 ` [PATCH 0/6] locking/rtmutex: Avoid PI state recursion through sched_submit_work() Peter Zijlstra
@ 2023-08-16  8:58   ` Sebastian Andrzej Siewior
  2023-08-16  9:42     ` Peter Zijlstra
  0 siblings, 1 reply; 25+ messages in thread
From: Sebastian Andrzej Siewior @ 2023-08-16  8:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, linux-kernel, bsegall, boqun.feng, swood, bristot,
	dietmar.eggemann, mingo, jstultz, juri.lelli, mgorman, rostedt,
	vschneid, vincent.guittot, longman, will

On 2023-08-15 18:15:57 [+0200], Peter Zijlstra wrote:
> N/m - 0day found a problem. Futex-PI trips the rt_mutex_schedule()
> assertion for not passing through rt_mutex_pre_schedule().
> 
> I'll go try and untangle that...
Is this the same as

| ------------[ cut here ]------------
| WARNING: CPU: 3 PID: 995 at kernel/sched/core.c:7155 rt_mutex_schedule+0x52/0x60
| Modules linked in:
| CPU: 3 PID: 995 Comm: mount Tainted: G        W          6.5.0-rc6-rt2+ #228
| Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
| RIP: 0010:rt_mutex_schedule+0x52/0x60
| Code: 00 00 00 e8 b0 80 ff ff 31 ff e8 a9 99 cb 00 bf 01 00 00 00 e8 3f 6d ff ff 48 8b 03 a9 08 00 08 00 75 db 5b e9 6f 82 cc 00 90 <0f> 0b 90 eb c6 66 0f 1f 84 00 00 0
| RSP: 0018:ffffc90001043e28 EFLAGS: 00010246
| RAX: ffff888107ab8040 RBX: 0000000000000202 RCX: 0000000000000000
| RDX: 0000000000000000 RSI: ffffffff82460bfd RDI: 00000000ffffffff
| RBP: ffffffff827e8fe0 R08: 0000000000000001 R09: 0000000000000001
| R10: 0000000000000000 R11: 0000000000000001 R12: ffffffff827e8fe8
| R13: 0000000000000002 R14: 0000000000000000 R15: ffff888107ab8040
| FS:  00007fb36281e840(0000) GS:ffff88817b4c0000(0000) knlGS:0000000000000000
| CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
| CR2: 00007fb3629702a0 CR3: 0000000108210000 CR4: 00000000003506e0
| Call Trace:
|  <TASK>
|  ? __warn+0x90/0x200
|  ? rt_mutex_schedule+0x52/0x60
|  ? report_bug+0x1c5/0x1f0
|  ? handle_bug+0x3b/0x70
|  ? exc_invalid_op+0x13/0x60
|  ? asm_exc_invalid_op+0x16/0x20
|  ? rt_mutex_schedule+0x52/0x60
|  rwbase_write_lock+0xc1/0x230
|  do_lock_mount+0x4c/0x200
|  ? __x86_return_thunk+0x5/0x10
|  path_mount+0x8a7/0xb00
|  __x64_sys_mount+0xf1/0x140
|  do_syscall_64+0x44/0x90
|  entry_SYSCALL_64_after_hwframe+0x74/0xde

Sebastian

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/6] locking/rtmutex: Avoid PI state recursion through sched_submit_work()
  2023-08-16  8:58   ` Sebastian Andrzej Siewior
@ 2023-08-16  9:42     ` Peter Zijlstra
  2023-08-16 10:19       ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2023-08-16  9:42 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: tglx, linux-kernel, bsegall, boqun.feng, swood, bristot,
	dietmar.eggemann, mingo, jstultz, juri.lelli, mgorman, rostedt,
	vschneid, vincent.guittot, longman, will

On Wed, Aug 16, 2023 at 10:58:26AM +0200, Sebastian Andrzej Siewior wrote:
> On 2023-08-15 18:15:57 [+0200], Peter Zijlstra wrote:
> > N/m - 0day found a problem. Futex-PI trips the rt_mutex_schedule()
> > assertion for not passing through rt_mutex_pre_schedule().
> > 
> > I'll go try and untangle that...
> Is this the same as

Not the same -- this is namespace_lock(), right? That's a regular rwsem
afaict and that *should* be good. Clearly I messed something up.

Thanks!

> | ------------[ cut here ]------------
> | WARNING: CPU: 3 PID: 995 at kernel/sched/core.c:7155 rt_mutex_schedule+0x52/0x60
> | Modules linked in:
> | CPU: 3 PID: 995 Comm: mount Tainted: G        W          6.5.0-rc6-rt2+ #228
> | Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
> | RIP: 0010:rt_mutex_schedule+0x52/0x60
> | Code: 00 00 00 e8 b0 80 ff ff 31 ff e8 a9 99 cb 00 bf 01 00 00 00 e8 3f 6d ff ff 48 8b 03 a9 08 00 08 00 75 db 5b e9 6f 82 cc 00 90 <0f> 0b 90 eb c6 66 0f 1f 84 00 00 0
> | RSP: 0018:ffffc90001043e28 EFLAGS: 00010246
> | RAX: ffff888107ab8040 RBX: 0000000000000202 RCX: 0000000000000000
> | RDX: 0000000000000000 RSI: ffffffff82460bfd RDI: 00000000ffffffff
> | RBP: ffffffff827e8fe0 R08: 0000000000000001 R09: 0000000000000001
> | R10: 0000000000000000 R11: 0000000000000001 R12: ffffffff827e8fe8
> | R13: 0000000000000002 R14: 0000000000000000 R15: ffff888107ab8040
> | FS:  00007fb36281e840(0000) GS:ffff88817b4c0000(0000) knlGS:0000000000000000
> | CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> | CR2: 00007fb3629702a0 CR3: 0000000108210000 CR4: 00000000003506e0
> | Call Trace:
> |  <TASK>
> |  ? __warn+0x90/0x200
> |  ? rt_mutex_schedule+0x52/0x60
> |  ? report_bug+0x1c5/0x1f0
> |  ? handle_bug+0x3b/0x70
> |  ? exc_invalid_op+0x13/0x60
> |  ? asm_exc_invalid_op+0x16/0x20
> |  ? rt_mutex_schedule+0x52/0x60
> |  rwbase_write_lock+0xc1/0x230
> |  do_lock_mount+0x4c/0x200
> |  ? __x86_return_thunk+0x5/0x10
> |  path_mount+0x8a7/0xb00
> |  __x64_sys_mount+0xf1/0x140
> |  do_syscall_64+0x44/0x90
> |  entry_SYSCALL_64_after_hwframe+0x74/0xde
> 
> Sebastian

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/6] sched: Extract __schedule_loop()
  2023-08-15 22:33   ` Phil Auld
  2023-08-15 22:39     ` Peter Zijlstra
  2023-08-15 22:42     ` Phil Auld
@ 2023-08-16 10:01     ` Sebastian Andrzej Siewior
  2023-08-16 11:39       ` Phil Auld
  2 siblings, 1 reply; 25+ messages in thread
From: Sebastian Andrzej Siewior @ 2023-08-16 10:01 UTC (permalink / raw)
  To: Phil Auld
  Cc: Peter Zijlstra, tglx, linux-kernel, bsegall, boqun.feng, swood,
	bristot, dietmar.eggemann, mingo, jstultz, juri.lelli, mgorman,
	rostedt, vschneid, vincent.guittot, longman, will

On 2023-08-15 18:33:01 [-0400], Phil Auld wrote:
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -6787,16 +6787,21 @@ static void sched_update_worker(struct t
> >  	}
> >  }
> >  
> > -asmlinkage __visible void __sched schedule(void)
> > +static __always_inline void __schedule_loop(unsigned int sched_mode)
> 
> I think this needs __sched or it's the only thing that ever shows up
> in wchan. E.g.
> 
>   16995 0 bash     S __schedule_loop.constprop.0

I don't see __schedule_loop in my RT and !RT build. I tried gcc and
clang.

Sebastian

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/6] locking/rtmutex: Avoid PI state recursion through sched_submit_work()
  2023-08-16  9:42     ` Peter Zijlstra
@ 2023-08-16 10:19       ` Sebastian Andrzej Siewior
  2023-08-16 13:46         ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 25+ messages in thread
From: Sebastian Andrzej Siewior @ 2023-08-16 10:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, linux-kernel, bsegall, boqun.feng, swood, bristot,
	dietmar.eggemann, mingo, jstultz, juri.lelli, mgorman, rostedt,
	vschneid, vincent.guittot, longman, will

On 2023-08-16 11:42:57 [+0200], Peter Zijlstra wrote:
> Not the same -- this is namespace_lock(), right? That's a regular rwsem
> afaict and that *should* be good. Clearly I messed something up.

Most likely. I do see it also fom inode_lock() which does down_write().
I see it only to originate from rwbase_write_lock().

> Thanks!

Sebastian

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/6] sched: Extract __schedule_loop()
  2023-08-16 10:01     ` Sebastian Andrzej Siewior
@ 2023-08-16 11:39       ` Phil Auld
  2023-08-16 12:20         ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 25+ messages in thread
From: Phil Auld @ 2023-08-16 11:39 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Peter Zijlstra, tglx, linux-kernel, bsegall, boqun.feng, swood,
	bristot, dietmar.eggemann, mingo, jstultz, juri.lelli, mgorman,
	rostedt, vschneid, vincent.guittot, longman, will

On Wed, Aug 16, 2023 at 12:01:54PM +0200 Sebastian Andrzej Siewior wrote:
> On 2023-08-15 18:33:01 [-0400], Phil Auld wrote:
> > > --- a/kernel/sched/core.c
> > > +++ b/kernel/sched/core.c
> > > @@ -6787,16 +6787,21 @@ static void sched_update_worker(struct t
> > >  	}
> > >  }
> > >  
> > > -asmlinkage __visible void __sched schedule(void)
> > > +static __always_inline void __schedule_loop(unsigned int sched_mode)
> > 
> > I think this needs __sched or it's the only thing that ever shows up
> > in wchan. E.g.
> > 
> >   16995 0 bash     S __schedule_loop.constprop.0
> 
> I don't see __schedule_loop in my RT and !RT build. I tried gcc and
> clang.
>

I do.  Admittedly I'm not an expert in how the wchan unwinding works but
we have a slightly older version of this patch in our kernel (schedule_loop
not __schedule_loop). When I added __sched it fixed it.   Maybe there
is something else but that seemed pretty obvious. 


/* Attach to any functions which should be ignored in wchan output. */
#define __sched		__section(".sched.text")

I can't explain why you are not seeing it.


Cheers,
Phil


> Sebastian
> 

-- 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/6] sched: Extract __schedule_loop()
  2023-08-16 11:39       ` Phil Auld
@ 2023-08-16 12:20         ` Sebastian Andrzej Siewior
  2023-08-16 12:48           ` Phil Auld
  0 siblings, 1 reply; 25+ messages in thread
From: Sebastian Andrzej Siewior @ 2023-08-16 12:20 UTC (permalink / raw)
  To: Phil Auld
  Cc: Peter Zijlstra, tglx, linux-kernel, bsegall, boqun.feng, swood,
	bristot, dietmar.eggemann, mingo, jstultz, juri.lelli, mgorman,
	rostedt, vschneid, vincent.guittot, longman, will

On 2023-08-16 07:39:45 [-0400], Phil Auld wrote:
> I do.  Admittedly I'm not an expert in how the wchan unwinding works but
> we have a slightly older version of this patch in our kernel (schedule_loop
> not __schedule_loop). When I added __sched it fixed it.   Maybe there
> is something else but that seemed pretty obvious. 
> 
> 
> /* Attach to any functions which should be ignored in wchan output. */
> #define __sched		__section(".sched.text")
> 
> I can't explain why you are not seeing it.

as peterz pointed out, it is marked __always_inline so the compiler
shouldn't make a separate function out of it.
Could you check with _this_ series? The schedule_loop variant is in RT
and does not have this inline thingy. So it would be good if the issue
you report actually exists in the series that has been posted.

> Cheers,
> Phil

Sebastian

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/6] sched: Extract __schedule_loop()
  2023-08-16 12:20         ` Sebastian Andrzej Siewior
@ 2023-08-16 12:48           ` Phil Auld
  0 siblings, 0 replies; 25+ messages in thread
From: Phil Auld @ 2023-08-16 12:48 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Peter Zijlstra, tglx, linux-kernel, bsegall, boqun.feng, swood,
	bristot, dietmar.eggemann, mingo, jstultz, juri.lelli, mgorman,
	rostedt, vschneid, vincent.guittot, longman, will

On Wed, Aug 16, 2023 at 02:20:07PM +0200 Sebastian Andrzej Siewior wrote:
> On 2023-08-16 07:39:45 [-0400], Phil Auld wrote:
> > I do.  Admittedly I'm not an expert in how the wchan unwinding works but
> > we have a slightly older version of this patch in our kernel (schedule_loop
> > not __schedule_loop). When I added __sched it fixed it.   Maybe there
> > is something else but that seemed pretty obvious. 
> > 
> > 
> > /* Attach to any functions which should be ignored in wchan output. */
> > #define __sched		__section(".sched.text")
> > 
> > I can't explain why you are not seeing it.
> 
> as peterz pointed out, it is marked __always_inline so the compiler
> shouldn't make a separate function out of it.
> Could you check with _this_ series? The schedule_loop variant is in RT
> and does not have this inline thingy. So it would be good if the issue
> you report actually exists in the series that has been posted.
>

Hhm, yes.  I was looking at the issue in our tree when these patches
came by. Sorry... I seem to have glossed over the __always_inline.
That would certainly work as well and, of course, does explain why you
aren't seeing it.

read more. talk less :)


Cheers,
Phil

> > Cheers,
> > Phil
> 
> Sebastian
> 

-- 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/6] locking/rtmutex: Avoid PI state recursion through sched_submit_work()
  2023-08-16 10:19       ` Sebastian Andrzej Siewior
@ 2023-08-16 13:46         ` Sebastian Andrzej Siewior
  2023-08-16 14:58           ` Peter Zijlstra
  0 siblings, 1 reply; 25+ messages in thread
From: Sebastian Andrzej Siewior @ 2023-08-16 13:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, linux-kernel, bsegall, boqun.feng, swood, bristot,
	dietmar.eggemann, mingo, jstultz, juri.lelli, mgorman, rostedt,
	vschneid, vincent.guittot, longman, will

On 2023-08-16 12:19:04 [+0200], To Peter Zijlstra wrote:
> On 2023-08-16 11:42:57 [+0200], Peter Zijlstra wrote:
> > Not the same -- this is namespace_lock(), right? That's a regular rwsem
> > afaict and that *should* be good. Clearly I messed something up.
> 
> Most likely. I do see it also fom inode_lock() which does down_write().
> I see it only to originate from rwbase_write_lock().

I've been looking at what you did and what we had.
I'm not sure if your additional debug/assert code figured it out or me
looking at it, but in rwbase_write_lock() for down_write(), we had this
beauty with a comment that you made go away:

|        * Take the rtmutex as a first step. For rwsem this will also
|        * invoke sched_submit_work() to flush IO and workers.
|	 */
|         if (rwbase_rtmutex_lock_state(rtm, state))

for rw_semaphore we don't have any explicit rwbase_sched_submit_work()
but relied on this one. Now that I look at it again,
rwbase_rtmutex_lock_state() can succeed in the fast path so we don't
flush/ invoke rwbase_pre_schedule().
So you rightfully removed the comment as it was misleading but we do
need that rwbase_pre_schedule() thingy before
raw_spin_lock_irqsave(&rtm->wait_lock).

Sebastian

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/6] sched: Extract __schedule_loop()
  2023-08-15 22:39     ` Peter Zijlstra
@ 2023-08-16 14:14       ` Phil Auld
  0 siblings, 0 replies; 25+ messages in thread
From: Phil Auld @ 2023-08-16 14:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: bigeasy, tglx, linux-kernel, bsegall, boqun.feng, swood, bristot,
	dietmar.eggemann, mingo, jstultz, juri.lelli, mgorman, rostedt,
	vschneid, vincent.guittot, longman, will

On Wed, Aug 16, 2023 at 12:39:26AM +0200 Peter Zijlstra wrote:
> On Tue, Aug 15, 2023 at 06:33:01PM -0400, Phil Auld wrote:
> > Hi Peter,
> > 
> > On Tue, Aug 15, 2023 at 01:01:24PM +0200 Peter Zijlstra wrote:
> > > From: Thomas Gleixner <tglx@linutronix.de>
> > > 
> > > There are currently two implementations of this basic __schedule()
> > > loop, and there is soon to be a third.
> > > 
> > > Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > > Link: https://lkml.kernel.org/r/20230427111937.2745231-2-bigeasy@linutronix.de
> > > ---
> > >  kernel/sched/core.c |   21 +++++++++++----------
> > >  1 file changed, 11 insertions(+), 10 deletions(-)
> > > 
> > > --- a/kernel/sched/core.c
> > > +++ b/kernel/sched/core.c
> > > @@ -6787,16 +6787,21 @@ static void sched_update_worker(struct t
> > >  	}
> > >  }
> > >  
> > > -asmlinkage __visible void __sched schedule(void)
> > > +static __always_inline void __schedule_loop(unsigned int sched_mode)
> > 
> > I think this needs __sched or it's the only thing that ever shows up
> > in wchan. E.g.
> > 
> >   16995 0 bash     S __schedule_loop.constprop.0
> >   17036 1 kworker/ I __schedule_loop.constprop.0
> >   17151 1 kworker/ I __schedule_loop.constprop.0
> >   17235 0 sleep    S __schedule_loop.constprop.0
> >   17236 4 ps       R -
> 
> But but but __always_inline... ?!?
> 

Yep, sorry. Totally missed those characters (and this reply of yours).

The older version from the rt tree had neither. 


Cheers,
Phil
-- 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/6] locking/rtmutex: Avoid PI state recursion through sched_submit_work()
  2023-08-16 13:46         ` Sebastian Andrzej Siewior
@ 2023-08-16 14:58           ` Peter Zijlstra
  2023-08-16 15:22             ` Peter Zijlstra
  2023-08-17  6:59             ` Sebastian Andrzej Siewior
  0 siblings, 2 replies; 25+ messages in thread
From: Peter Zijlstra @ 2023-08-16 14:58 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: tglx, linux-kernel, bsegall, boqun.feng, swood, bristot,
	dietmar.eggemann, mingo, jstultz, juri.lelli, mgorman, rostedt,
	vschneid, vincent.guittot, longman, will

On Wed, Aug 16, 2023 at 03:46:30PM +0200, Sebastian Andrzej Siewior wrote:
> On 2023-08-16 12:19:04 [+0200], To Peter Zijlstra wrote:
> > On 2023-08-16 11:42:57 [+0200], Peter Zijlstra wrote:
> > > Not the same -- this is namespace_lock(), right? That's a regular rwsem
> > > afaict and that *should* be good. Clearly I messed something up.
> > 
> > Most likely. I do see it also fom inode_lock() which does down_write().
> > I see it only to originate from rwbase_write_lock().
> 
> I've been looking at what you did and what we had.
> I'm not sure if your additional debug/assert code figured it out or me
> looking at it, but in rwbase_write_lock() for down_write(), we had this
> beauty with a comment that you made go away:
> 
> |        * Take the rtmutex as a first step. For rwsem this will also
> |        * invoke sched_submit_work() to flush IO and workers.
> |	 */
> |         if (rwbase_rtmutex_lock_state(rtm, state))
> 

Yeah, I can't quite remember why I killed that comment, I think because
it was either 'obvious' or confusing at the time. Or perhaps I was too
lazy to type, ... :-)

> for rw_semaphore we don't have any explicit rwbase_sched_submit_work()
> but relied on this one. Now that I look at it again,
> rwbase_rtmutex_lock_state() can succeed in the fast path so we don't
> flush/ invoke rwbase_pre_schedule().
> So you rightfully removed the comment as it was misleading but we do
> need that rwbase_pre_schedule() thingy before
> raw_spin_lock_irqsave(&rtm->wait_lock).

Right, it's both the fast-path and the fact that rt_mutex_slowlock()
will also do post_schedule() and reset the flag.

I've ended up with the below, but it is quite horrible.. but let me go
stare at the futex wreckage before trying to clean things up.

--- a/kernel/locking/rwbase_rt.c
+++ b/kernel/locking/rwbase_rt.c
@@ -270,6 +270,7 @@ static int __sched rwbase_write_lock(str
 
 out_unlock:
 	raw_spin_unlock_irqrestore(&rtm->wait_lock, flags);
+	rt_mutex_post_schedule();
 	return 0;
 }
 
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -1412,8 +1412,30 @@ static inline void __downgrade_write(str
 #define rwbase_restore_current_state()			\
 	__set_current_state(TASK_RUNNING)
 
-#define rwbase_rtmutex_lock_state(rtm, state)		\
-	__rt_mutex_lock(rtm, state)
+/*
+ * Variant of __rt_mutex_lock() that unconditionally does
+ * rt_mutex_pre_schedule() and keeps it on success.
+ */
+static __always_inline int
+rwbase_rtmutex_lock_state(struct rt_mutex_base *lock, unsigned int state)
+{
+	unsigned long flags;
+	int ret;
+
+	rt_mutex_pre_schedule();
+
+	if (likely(rt_mutex_try_acquire(lock)))
+		return 0;
+
+	raw_spin_lock_irqsave(&lock->wait_lock, flags);
+	ret = __rt_mutex_slowlock_locked(lock, NULL, state);
+	raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
+
+	if (ret)
+		rt_mutex_post_schedule();
+
+	return ret;
+}
 
 #define rwbase_rtmutex_slowlock_locked(rtm, state)	\
 	__rt_mutex_slowlock_locked(rtm, NULL, state)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/6] locking/rtmutex: Avoid PI state recursion through sched_submit_work()
  2023-08-16 14:58           ` Peter Zijlstra
@ 2023-08-16 15:22             ` Peter Zijlstra
  2023-08-16 15:25               ` Sebastian Andrzej Siewior
  2023-08-17  6:59             ` Sebastian Andrzej Siewior
  1 sibling, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2023-08-16 15:22 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: tglx, linux-kernel, bsegall, boqun.feng, swood, bristot,
	dietmar.eggemann, mingo, jstultz, juri.lelli, mgorman, rostedt,
	vschneid, vincent.guittot, longman, will

On Wed, Aug 16, 2023 at 04:58:18PM +0200, Peter Zijlstra wrote:

> I've ended up with the below, but it is quite horrible.. but let me go
> stare at the futex wreckage before trying to clean things up.

OK, I think the below covers the simple case, now lets see if I can make
sense of futex_wait_requeue_pi()... :/

---
--- a/kernel/futex/pi.c
+++ b/kernel/futex/pi.c
@@ -1002,6 +1002,12 @@ int futex_lock_pi(u32 __user *uaddr, uns
 		goto no_block;
 	}
 
+	/*
+	 * Must be done before we enqueue the waiter, here is unfortunately
+	 * under the hb lock, but that *should* work.
+	 */
+	rt_mutex_pre_schedule();
+
 	rt_mutex_init_waiter(&rt_waiter);
 
 	/*
@@ -1052,6 +1058,10 @@ int futex_lock_pi(u32 __user *uaddr, uns
 	if (ret && !rt_mutex_cleanup_proxy_lock(&q.pi_state->pi_mutex, &rt_waiter))
 		ret = 0;
 
+	/*
+	 * Waiter is unqueued.
+	 */
+	rt_mutex_post_schedule();
 no_block:
 	/*
 	 * Fixup the pi_state owner and possibly acquire the lock if we

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/6] locking/rtmutex: Avoid PI state recursion through sched_submit_work()
  2023-08-16 15:22             ` Peter Zijlstra
@ 2023-08-16 15:25               ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 25+ messages in thread
From: Sebastian Andrzej Siewior @ 2023-08-16 15:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, linux-kernel, bsegall, boqun.feng, swood, bristot,
	dietmar.eggemann, mingo, jstultz, juri.lelli, mgorman, rostedt,
	vschneid, vincent.guittot, longman, will

On 2023-08-16 17:22:31 [+0200], Peter Zijlstra wrote:
> On Wed, Aug 16, 2023 at 04:58:18PM +0200, Peter Zijlstra wrote:
> 
> > I've ended up with the below, but it is quite horrible.. but let me go
> > stare at the futex wreckage before trying to clean things up.
> 
> OK, I think the below covers the simple case, now lets see if I can make
> sense of futex_wait_requeue_pi()... :/
> 
> ---
> --- a/kernel/futex/pi.c
> +++ b/kernel/futex/pi.c
> @@ -1002,6 +1002,12 @@ int futex_lock_pi(u32 __user *uaddr, uns
>  		goto no_block;
>  	}
>  
> +	/*
> +	 * Must be done before we enqueue the waiter, here is unfortunately
> +	 * under the hb lock, but that *should* work.
> +	 */
> +	rt_mutex_pre_schedule();

but this will do sched_submit_work() which you don't need for futex at
all. It should be a nop (as in nothing will happen) but still.

>  	rt_mutex_init_waiter(&rt_waiter);
>  
>  	/*

Sebastian

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/6] locking/rtmutex: Avoid PI state recursion through sched_submit_work()
  2023-08-16 14:58           ` Peter Zijlstra
  2023-08-16 15:22             ` Peter Zijlstra
@ 2023-08-17  6:59             ` Sebastian Andrzej Siewior
  2023-08-17  8:26               ` Peter Zijlstra
  1 sibling, 1 reply; 25+ messages in thread
From: Sebastian Andrzej Siewior @ 2023-08-17  6:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, linux-kernel, bsegall, boqun.feng, swood, bristot,
	dietmar.eggemann, mingo, jstultz, juri.lelli, mgorman, rostedt,
	vschneid, vincent.guittot, longman, will

On 2023-08-16 16:58:18 [+0200], Peter Zijlstra wrote:
> I've ended up with the below, but it is quite horrible.. but let me go
> stare at the futex wreckage before trying to clean things up.

What about

diff --git a/kernel/locking/rwbase_rt.c b/kernel/locking/rwbase_rt.c
index f8a194e7ec9e9..b5e881250fec5 100644
--- a/kernel/locking/rwbase_rt.c
+++ b/kernel/locking/rwbase_rt.c
@@ -241,6 +241,8 @@ static int __sched rwbase_write_lock(struct rwbase_rt *rwb,
 	/* Force readers into slow path */
 	atomic_sub(READER_BIAS, &rwb->readers);
 
+	rt_mutex_pre_schedule();
+
 	raw_spin_lock_irqsave(&rtm->wait_lock, flags);
 	if (__rwbase_write_trylock(rwb))
 		goto out_unlock;
@@ -252,6 +254,7 @@ static int __sched rwbase_write_lock(struct rwbase_rt *rwb,
 		if (rwbase_signal_pending_state(state, current)) {
 			rwbase_restore_current_state();
 			__rwbase_write_unlock(rwb, 0, flags);
+			rt_mutex_post_schedule();
 			trace_contention_end(rwb, -EINTR);
 			return -EINTR;
 		}
@@ -270,6 +273,7 @@ static int __sched rwbase_write_lock(struct rwbase_rt *rwb,
 
 out_unlock:
 	raw_spin_unlock_irqrestore(&rtm->wait_lock, flags);
+	rt_mutex_post_schedule();
 	return 0;
 }
 
Sebastian

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/6] locking/rtmutex: Avoid PI state recursion through sched_submit_work()
  2023-08-17  6:59             ` Sebastian Andrzej Siewior
@ 2023-08-17  8:26               ` Peter Zijlstra
  0 siblings, 0 replies; 25+ messages in thread
From: Peter Zijlstra @ 2023-08-17  8:26 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: tglx, linux-kernel, bsegall, boqun.feng, swood, bristot,
	dietmar.eggemann, mingo, jstultz, juri.lelli, mgorman, rostedt,
	vschneid, vincent.guittot, longman, will

On Thu, Aug 17, 2023 at 08:59:50AM +0200, Sebastian Andrzej Siewior wrote:
> On 2023-08-16 16:58:18 [+0200], Peter Zijlstra wrote:
> > I've ended up with the below, but it is quite horrible.. but let me go
> > stare at the futex wreckage before trying to clean things up.
> 
> What about

Ah, of course, that's much nicer. I got hung up on that
rwbase_rtmutex_lock_state() thing :/

> diff --git a/kernel/locking/rwbase_rt.c b/kernel/locking/rwbase_rt.c
> index f8a194e7ec9e9..b5e881250fec5 100644
> --- a/kernel/locking/rwbase_rt.c
> +++ b/kernel/locking/rwbase_rt.c
> @@ -241,6 +241,8 @@ static int __sched rwbase_write_lock(struct rwbase_rt *rwb,
>  	/* Force readers into slow path */
>  	atomic_sub(READER_BIAS, &rwb->readers);
>  
> +	rt_mutex_pre_schedule();
> +
>  	raw_spin_lock_irqsave(&rtm->wait_lock, flags);
>  	if (__rwbase_write_trylock(rwb))
>  		goto out_unlock;
> @@ -252,6 +254,7 @@ static int __sched rwbase_write_lock(struct rwbase_rt *rwb,
>  		if (rwbase_signal_pending_state(state, current)) {
>  			rwbase_restore_current_state();
>  			__rwbase_write_unlock(rwb, 0, flags);
> +			rt_mutex_post_schedule();
>  			trace_contention_end(rwb, -EINTR);
>  			return -EINTR;
>  		}
> @@ -270,6 +273,7 @@ static int __sched rwbase_write_lock(struct rwbase_rt *rwb,
>  
>  out_unlock:
>  	raw_spin_unlock_irqrestore(&rtm->wait_lock, flags);
> +	rt_mutex_post_schedule();
>  	return 0;
>  }
>  
> Sebastian

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2023-08-17  8:27 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-08-15 11:01 [PATCH 0/6] locking/rtmutex: Avoid PI state recursion through sched_submit_work() Peter Zijlstra
2023-08-15 11:01 ` [PATCH 1/6] sched: Constrain locks in sched_submit_work() Peter Zijlstra
2023-08-15 11:01 ` [PATCH 2/6] locking/rtmutex: Avoid unconditional slowpath for DEBUG_RT_MUTEXES Peter Zijlstra
2023-08-15 11:01 ` [PATCH 3/6] sched: Extract __schedule_loop() Peter Zijlstra
2023-08-15 22:33   ` Phil Auld
2023-08-15 22:39     ` Peter Zijlstra
2023-08-16 14:14       ` Phil Auld
2023-08-15 22:42     ` Phil Auld
2023-08-16 10:01     ` Sebastian Andrzej Siewior
2023-08-16 11:39       ` Phil Auld
2023-08-16 12:20         ` Sebastian Andrzej Siewior
2023-08-16 12:48           ` Phil Auld
2023-08-15 11:01 ` [PATCH 4/6] sched: Provide rt_mutex specific scheduler helpers Peter Zijlstra
2023-08-15 11:01 ` [PATCH 5/6] locking/rtmutex: Use " Peter Zijlstra
2023-08-15 11:01 ` [PATCH 6/6] locking/rtmutex: Add a lockdep assert to catch potential nested blocking Peter Zijlstra
2023-08-15 16:15 ` [PATCH 0/6] locking/rtmutex: Avoid PI state recursion through sched_submit_work() Peter Zijlstra
2023-08-16  8:58   ` Sebastian Andrzej Siewior
2023-08-16  9:42     ` Peter Zijlstra
2023-08-16 10:19       ` Sebastian Andrzej Siewior
2023-08-16 13:46         ` Sebastian Andrzej Siewior
2023-08-16 14:58           ` Peter Zijlstra
2023-08-16 15:22             ` Peter Zijlstra
2023-08-16 15:25               ` Sebastian Andrzej Siewior
2023-08-17  6:59             ` Sebastian Andrzej Siewior
2023-08-17  8:26               ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox