[PATCH v11 0/7] Preparatory changes for Proxy Execution v11

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v11 0/7] Preparatory changes for Proxy Execution v11
@ 2024-07-09 20:31 John Stultz
  2024-07-09 20:31 ` [PATCH v11 1/7] locking/mutex: Remove wakeups from under mutex::wait_lock John Stultz
                   ` (7 more replies)
  0 siblings, 8 replies; 24+ messages in thread
From: John Stultz @ 2024-07-09 20:31 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
	Youssef Esmat, Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, K Prateek Nayak,
	Thomas Gleixner, Daniel Lezcano, kernel-team

Hey All,

I wanted to send out v11 of the preparatory patches for Proxy
Execution - an approach for a generalized form of priority
inheritance. Here again, I’m only submitting the early /
preparatory changes for review, in the hope that we can move
these more straightforward patches along and then iteratively
move through the more interesting patches in the Proxy Execution
series. That said, I’ve not gotten a ton of feedback with this
approach, so I’m open to other suggestions.

There have been some changes to the preparatory patches in v11:
* Qais Yousef suggested a few other spots where the
  move_queued_task_locked() helper could be used.
* Simplified the task_is_pushable() helper to return a bool as
  suggested by Metin Kaya and others. It will later be a
  tri-state return, but that can wait for later in the series
  when it is actually used.
* A few spots of re-arranging logic to reduce indentation and
  simplify things, suggested by Qais and Metin
* Metin pointed out some spots in the split scheduler and
  execution contexts patch where variables could be more clearly
  named.

Many thanks to Metin and Qais for their detailed feedback here!

I’ve also continued working on the rest of the series, which you
can find here:
 https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v11-6.10-rc7
 https://github.com/johnstultz-work/linux-dev.git proxy-exec-v11-6.10-rc7

New changes in the full series include:
* Got rid of recursion in activate_blocked_waiter logic
* Added more detail to new traceevents as well as additional
  traceevents for validating behavior
* Fixes for edge case where wake_cpu used for return migration
  ended up outside the affinity mask
* Fix for case where we weren’t preserving need_resched when
  find_proxy_task() returns the idle task
* Lots of small detail cleanups suggested by Metin

Issues still to address with the full series:
* K Prateek Nayak did some testing with an earlier version of
  the series and saw ~3-5% regressions in some cases. I’m hoping
  to look into this soon to see if we can reduce those further.
* The chain migration functionality needs further iterations and
  better validation to ensure it truly maintains the RT/DL load
  balancing invariants (despite this being broken in vanilla
  upstream with RT_PUSH_IPI currently)
* At OSPM, Juri Lelli and the (very very sadly) late Daniel
  Bristot de Oliveira raised the point that Proxy Exec may not
  actually be generalizable for SCHED_DEADLINE tasks, as one
  cannot always correctly donate the resources of the waiter to
  an owner on a different cpu. If one was to reverse the
  proxy-migration direction, migrating the owner to the waiter
  cpu, this would preserve the SCHED_DEADLINE bandwidth
  calculations, but would break down if the owner's cpu affinity
  disallowed it. To my understanding this constraint seems to
  make most forms of priority inheritance infeasible with
  SCHED_DEADLINE, but I’ll have to leave that to the
  folks/academics who know it well. After talking with Juri, my
  current plan is just to special case find_proxy_task() to not
  proxy with SCHED_DEADLINE (falling back to the current behavior
  where we deactivate the waiting task). But SCHED_NORMAL waiter
  tasks would still be able to benefit from Proxy Exec.
* Also at OSPM, Thomas Gleixner mentioned we might consider
  including Proxy Exec in the PREEMPT_RT patch series, however
  for this to be useful I need to take a stab at deprecating
  rt_mutexes for proxy mutexes, as everything is an rt_mutex
  with PREEMPT_RT.

Credit/Disclaimer:
--------------------
As mentioned previously, this Proxy Execution series has a long
history: 

First described in a paper[1] by Watkins, Straub, Niehaus, then
from patches from Peter Zijlstra, extended with lots of work by
Juri Lelli, Valentin Schneider, and Connor O'Brien. (and thank
you to Steven Rostedt for providing additional details here!)

So again, many thanks to those above, as all the credit for this
series really is due to them - while the mistakes are likely
mine.

As always, feedback and review would be greatly appreciated!

Thanks so much!
-john

[1] https://static.lwn.net/images/conf/rtlws11/papers/proc/p38.pdf

Cc: Joel Fernandes <joelaf@google.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Youssef Esmat <youssefesmat@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: kernel-team@android.com

Connor O'Brien (2):
  sched: Add move_queued_task_locked helper
  sched: Consolidate pick_*_task to task_is_pushable helper

John Stultz (1):
  sched: Split out __schedule() deactivate task logic into a helper

Juri Lelli (2):
  locking/mutex: Make mutex::wait_lock irq safe
  locking/mutex: Expose __mutex_owner()

Peter Zijlstra (2):
  locking/mutex: Remove wakeups from under mutex::wait_lock
  sched: Split scheduler and execution contexts

 kernel/locking/mutex.c       |  60 +++++++---------
 kernel/locking/mutex.h       |  27 ++++++++
 kernel/locking/rtmutex.c     |  30 +++++---
 kernel/locking/rwbase_rt.c   |   8 ++-
 kernel/locking/rwsem.c       |   4 +-
 kernel/locking/spinlock_rt.c |   3 +-
 kernel/locking/ww_mutex.h    |  49 +++++++------
 kernel/sched/core.c          | 130 ++++++++++++++++++++---------------
 kernel/sched/deadline.c      |  57 +++++++--------
 kernel/sched/fair.c          |  32 ++++-----
 kernel/sched/rt.c            |  67 ++++++++----------
 kernel/sched/sched.h         |  48 ++++++++++++-
 12 files changed, 295 insertions(+), 220 deletions(-)

-- 
2.45.2.993.g49e7a77208-goog

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v11 1/7] locking/mutex: Remove wakeups from under mutex::wait_lock
  2024-07-09 20:31 [PATCH v11 0/7] Preparatory changes for Proxy Execution v11 John Stultz
@ 2024-07-09 20:31 ` John Stultz
  2024-07-10 17:41   ` K Prateek Nayak
  2024-07-12 14:52   ` Peter Zijlstra
  2024-07-09 20:31 ` [PATCH v11 2/7] locking/mutex: Make mutex::wait_lock irq safe John Stultz
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 24+ messages in thread
From: John Stultz @ 2024-07-09 20:31 UTC (permalink / raw)
  To: LKML
  Cc: Peter Zijlstra, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Youssef Esmat,
	Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, K Prateek Nayak,
	Thomas Gleixner, Daniel Lezcano, kernel-team, John Stultz,
	Metin Kaya, Davidlohr Bueso

From: Peter Zijlstra <peterz@infradead.org>

In preparation to nest mutex::wait_lock under rq::lock we need to remove
wakeups from under it.

Cc: Joel Fernandes <joelaf@google.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Youssef Esmat <youssefesmat@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: kernel-team@android.com
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
[Heavily changed after 55f036ca7e74 ("locking: WW mutex cleanup") and
08295b3b5bee ("locking: Implement an algorithm choice for Wound-Wait
mutexes")]
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
[jstultz: rebased to mainline, added extra wake_up_q & init
 to avoid hangs, similar to Connor's rework of this patch]
Signed-off-by: John Stultz <jstultz@google.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: Metin Kaya <metin.kaya@arm.com>
Acked-by: Davidlohr Bueso <dave@stgolabs.net>
Reviewed-by: Metin Kaya <metin.kaya@arm.com>
---
v5:
* Reverted back to an earlier version of this patch to undo
  the change that kept the wake_q in the ctx structure, as
  that broke the rule that the wake_q must always be on the
  stack, as its not safe for concurrency.
v6:
* Made tweaks suggested by Waiman Long
v7:
* Fixups to pass wake_qs down for PREEMPT_RT logic
v10:
* Switched preempt_enable to be lower close to the unlock as
  suggested by Valentin
* Added additional preempt_disable coverage around the wake_q
  calls as again noted by Valentin
---
 kernel/locking/mutex.c       | 17 +++++++++++++----
 kernel/locking/rtmutex.c     | 30 +++++++++++++++++++++---------
 kernel/locking/rwbase_rt.c   |  8 +++++++-
 kernel/locking/rwsem.c       |  4 ++--
 kernel/locking/spinlock_rt.c |  3 ++-
 kernel/locking/ww_mutex.h    | 29 ++++++++++++++++++-----------
 6 files changed, 63 insertions(+), 28 deletions(-)

diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index cbae8c0b89ab..4269da1f3ef5 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -575,6 +575,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 		    struct lockdep_map *nest_lock, unsigned long ip,
 		    struct ww_acquire_ctx *ww_ctx, const bool use_ww_ctx)
 {
+	DEFINE_WAKE_Q(wake_q);
 	struct mutex_waiter waiter;
 	struct ww_mutex *ww;
 	int ret;
@@ -625,7 +626,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 	 */
 	if (__mutex_trylock(lock)) {
 		if (ww_ctx)
-			__ww_mutex_check_waiters(lock, ww_ctx);
+			__ww_mutex_check_waiters(lock, ww_ctx, &wake_q);
 
 		goto skip_wait;
 	}
@@ -645,7 +646,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 		 * Add in stamp order, waking up waiters that must kill
 		 * themselves.
 		 */
-		ret = __ww_mutex_add_waiter(&waiter, lock, ww_ctx);
+		ret = __ww_mutex_add_waiter(&waiter, lock, ww_ctx, &wake_q);
 		if (ret)
 			goto err_early_kill;
 	}
@@ -681,6 +682,11 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 		}
 
 		raw_spin_unlock(&lock->wait_lock);
+		/* Make sure we do wakeups before calling schedule */
+		if (!wake_q_empty(&wake_q)) {
+			wake_up_q(&wake_q);
+			wake_q_init(&wake_q);
+		}
 		schedule_preempt_disabled();
 
 		first = __mutex_waiter_is_first(lock, &waiter);
@@ -714,7 +720,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 		 */
 		if (!ww_ctx->is_wait_die &&
 		    !__mutex_waiter_is_first(lock, &waiter))
-			__ww_mutex_check_waiters(lock, ww_ctx);
+			__ww_mutex_check_waiters(lock, ww_ctx, &wake_q);
 	}
 
 	__mutex_remove_waiter(lock, &waiter);
@@ -730,6 +736,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 		ww_mutex_lock_acquired(ww, ww_ctx);
 
 	raw_spin_unlock(&lock->wait_lock);
+	wake_up_q(&wake_q);
 	preempt_enable();
 	return 0;
 
@@ -741,6 +748,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 	raw_spin_unlock(&lock->wait_lock);
 	debug_mutex_free_waiter(&waiter);
 	mutex_release(&lock->dep_map, ip);
+	wake_up_q(&wake_q);
 	preempt_enable();
 	return ret;
 }
@@ -951,9 +959,10 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
 	if (owner & MUTEX_FLAG_HANDOFF)
 		__mutex_handoff(lock, next);
 
+	preempt_disable();
 	raw_spin_unlock(&lock->wait_lock);
-
 	wake_up_q(&wake_q);
+	preempt_enable();
 }
 
 #ifndef CONFIG_DEBUG_LOCK_ALLOC
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 88d08eeb8bc0..7a85d9bfa972 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -34,13 +34,15 @@
 
 static inline int __ww_mutex_add_waiter(struct rt_mutex_waiter *waiter,
 					struct rt_mutex *lock,
-					struct ww_acquire_ctx *ww_ctx)
+					struct ww_acquire_ctx *ww_ctx,
+					struct wake_q_head *wake_q)
 {
 	return 0;
 }
 
 static inline void __ww_mutex_check_waiters(struct rt_mutex *lock,
-					    struct ww_acquire_ctx *ww_ctx)
+					    struct ww_acquire_ctx *ww_ctx,
+					    struct wake_q_head *wake_q)
 {
 }
 
@@ -1207,6 +1209,7 @@ static int __sched task_blocks_on_rt_mutex(struct rt_mutex_base *lock,
 	struct rt_mutex_waiter *top_waiter = waiter;
 	struct rt_mutex_base *next_lock;
 	int chain_walk = 0, res;
+	DEFINE_WAKE_Q(wake_q);
 
 	lockdep_assert_held(&lock->wait_lock);
 
@@ -1245,7 +1248,10 @@ static int __sched task_blocks_on_rt_mutex(struct rt_mutex_base *lock,
 
 		/* Check whether the waiter should back out immediately */
 		rtm = container_of(lock, struct rt_mutex, rtmutex);
-		res = __ww_mutex_add_waiter(waiter, rtm, ww_ctx);
+		preempt_disable();
+		res = __ww_mutex_add_waiter(waiter, rtm, ww_ctx, &wake_q);
+		wake_up_q(&wake_q);
+		preempt_enable();
 		if (res) {
 			raw_spin_lock(&task->pi_lock);
 			rt_mutex_dequeue(lock, waiter);
@@ -1678,7 +1684,8 @@ static int __sched __rt_mutex_slowlock(struct rt_mutex_base *lock,
 				       struct ww_acquire_ctx *ww_ctx,
 				       unsigned int state,
 				       enum rtmutex_chainwalk chwalk,
-				       struct rt_mutex_waiter *waiter)
+				       struct rt_mutex_waiter *waiter,
+				       struct wake_q_head *wake_q)
 {
 	struct rt_mutex *rtm = container_of(lock, struct rt_mutex, rtmutex);
 	struct ww_mutex *ww = ww_container_of(rtm);
@@ -1689,7 +1696,7 @@ static int __sched __rt_mutex_slowlock(struct rt_mutex_base *lock,
 	/* Try to acquire the lock again: */
 	if (try_to_take_rt_mutex(lock, current, NULL)) {
 		if (build_ww_mutex() && ww_ctx) {
-			__ww_mutex_check_waiters(rtm, ww_ctx);
+			__ww_mutex_check_waiters(rtm, ww_ctx, wake_q);
 			ww_mutex_lock_acquired(ww, ww_ctx);
 		}
 		return 0;
@@ -1707,7 +1714,7 @@ static int __sched __rt_mutex_slowlock(struct rt_mutex_base *lock,
 		/* acquired the lock */
 		if (build_ww_mutex() && ww_ctx) {
 			if (!ww_ctx->is_wait_die)
-				__ww_mutex_check_waiters(rtm, ww_ctx);
+				__ww_mutex_check_waiters(rtm, ww_ctx, wake_q);
 			ww_mutex_lock_acquired(ww, ww_ctx);
 		}
 	} else {
@@ -1729,7 +1736,8 @@ static int __sched __rt_mutex_slowlock(struct rt_mutex_base *lock,
 
 static inline int __rt_mutex_slowlock_locked(struct rt_mutex_base *lock,
 					     struct ww_acquire_ctx *ww_ctx,
-					     unsigned int state)
+					     unsigned int state,
+					     struct wake_q_head *wake_q)
 {
 	struct rt_mutex_waiter waiter;
 	int ret;
@@ -1738,7 +1746,7 @@ static inline int __rt_mutex_slowlock_locked(struct rt_mutex_base *lock,
 	waiter.ww_ctx = ww_ctx;
 
 	ret = __rt_mutex_slowlock(lock, ww_ctx, state, RT_MUTEX_MIN_CHAINWALK,
-				  &waiter);
+				  &waiter, wake_q);
 
 	debug_rt_mutex_free_waiter(&waiter);
 	return ret;
@@ -1754,6 +1762,7 @@ static int __sched rt_mutex_slowlock(struct rt_mutex_base *lock,
 				     struct ww_acquire_ctx *ww_ctx,
 				     unsigned int state)
 {
+	DEFINE_WAKE_Q(wake_q);
 	unsigned long flags;
 	int ret;
 
@@ -1775,8 +1784,11 @@ static int __sched rt_mutex_slowlock(struct rt_mutex_base *lock,
 	 * irqsave/restore variants.
 	 */
 	raw_spin_lock_irqsave(&lock->wait_lock, flags);
-	ret = __rt_mutex_slowlock_locked(lock, ww_ctx, state);
+	ret = __rt_mutex_slowlock_locked(lock, ww_ctx, state, &wake_q);
+	preempt_disable();
 	raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
+	wake_up_q(&wake_q);
+	preempt_enable();
 	rt_mutex_post_schedule();
 
 	return ret;
diff --git a/kernel/locking/rwbase_rt.c b/kernel/locking/rwbase_rt.c
index 34a59569db6b..9f4322c07486 100644
--- a/kernel/locking/rwbase_rt.c
+++ b/kernel/locking/rwbase_rt.c
@@ -69,6 +69,7 @@ static int __sched __rwbase_read_lock(struct rwbase_rt *rwb,
 				      unsigned int state)
 {
 	struct rt_mutex_base *rtm = &rwb->rtmutex;
+	DEFINE_WAKE_Q(wake_q);
 	int ret;
 
 	rwbase_pre_schedule();
@@ -110,7 +111,7 @@ static int __sched __rwbase_read_lock(struct rwbase_rt *rwb,
 	 * For rwlocks this returns 0 unconditionally, so the below
 	 * !ret conditionals are optimized out.
 	 */
-	ret = rwbase_rtmutex_slowlock_locked(rtm, state);
+	ret = rwbase_rtmutex_slowlock_locked(rtm, state, &wake_q);
 
 	/*
 	 * On success the rtmutex is held, so there can't be a writer
@@ -121,7 +122,12 @@ static int __sched __rwbase_read_lock(struct rwbase_rt *rwb,
 	 */
 	if (!ret)
 		atomic_inc(&rwb->readers);
+
+	preempt_disable();
 	raw_spin_unlock_irq(&rtm->wait_lock);
+	wake_up_q(&wake_q);
+	preempt_enable();
+
 	if (!ret)
 		rwbase_rtmutex_unlock(rtm);
 
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index c6d17aee4209..79ab7b8df5c1 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -1415,8 +1415,8 @@ static inline void __downgrade_write(struct rw_semaphore *sem)
 #define rwbase_rtmutex_lock_state(rtm, state)		\
 	__rt_mutex_lock(rtm, state)
 
-#define rwbase_rtmutex_slowlock_locked(rtm, state)	\
-	__rt_mutex_slowlock_locked(rtm, NULL, state)
+#define rwbase_rtmutex_slowlock_locked(rtm, state, wq)	\
+	__rt_mutex_slowlock_locked(rtm, NULL, state, wq)
 
 #define rwbase_rtmutex_unlock(rtm)			\
 	__rt_mutex_unlock(rtm)
diff --git a/kernel/locking/spinlock_rt.c b/kernel/locking/spinlock_rt.c
index 38e292454fcc..fb1810a14c9d 100644
--- a/kernel/locking/spinlock_rt.c
+++ b/kernel/locking/spinlock_rt.c
@@ -162,7 +162,8 @@ rwbase_rtmutex_lock_state(struct rt_mutex_base *rtm, unsigned int state)
 }
 
 static __always_inline int
-rwbase_rtmutex_slowlock_locked(struct rt_mutex_base *rtm, unsigned int state)
+rwbase_rtmutex_slowlock_locked(struct rt_mutex_base *rtm, unsigned int state,
+			       struct wake_q_head *wake_q)
 {
 	rtlock_slowlock_locked(rtm);
 	return 0;
diff --git a/kernel/locking/ww_mutex.h b/kernel/locking/ww_mutex.h
index 3ad2cc4823e5..7189c6631d90 100644
--- a/kernel/locking/ww_mutex.h
+++ b/kernel/locking/ww_mutex.h
@@ -275,7 +275,7 @@ __ww_ctx_less(struct ww_acquire_ctx *a, struct ww_acquire_ctx *b)
  */
 static bool
 __ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITER *waiter,
-	       struct ww_acquire_ctx *ww_ctx)
+	       struct ww_acquire_ctx *ww_ctx, struct wake_q_head *wake_q)
 {
 	if (!ww_ctx->is_wait_die)
 		return false;
@@ -284,7 +284,7 @@ __ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITER *waiter,
 #ifndef WW_RT
 		debug_mutex_wake_waiter(lock, waiter);
 #endif
-		wake_up_process(waiter->task);
+		wake_q_add(wake_q, waiter->task);
 	}
 
 	return true;
@@ -299,7 +299,8 @@ __ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITER *waiter,
  */
 static bool __ww_mutex_wound(struct MUTEX *lock,
 			     struct ww_acquire_ctx *ww_ctx,
-			     struct ww_acquire_ctx *hold_ctx)
+			     struct ww_acquire_ctx *hold_ctx,
+			     struct wake_q_head *wake_q)
 {
 	struct task_struct *owner = __ww_mutex_owner(lock);
 
@@ -331,7 +332,7 @@ static bool __ww_mutex_wound(struct MUTEX *lock,
 		 * wakeup pending to re-read the wounded state.
 		 */
 		if (owner != current)
-			wake_up_process(owner);
+			wake_q_add(wake_q, owner);
 
 		return true;
 	}
@@ -352,7 +353,8 @@ static bool __ww_mutex_wound(struct MUTEX *lock,
  * The current task must not be on the wait list.
  */
 static void
-__ww_mutex_check_waiters(struct MUTEX *lock, struct ww_acquire_ctx *ww_ctx)
+__ww_mutex_check_waiters(struct MUTEX *lock, struct ww_acquire_ctx *ww_ctx,
+			 struct wake_q_head *wake_q)
 {
 	struct MUTEX_WAITER *cur;
 
@@ -364,8 +366,8 @@ __ww_mutex_check_waiters(struct MUTEX *lock, struct ww_acquire_ctx *ww_ctx)
 		if (!cur->ww_ctx)
 			continue;
 
-		if (__ww_mutex_die(lock, cur, ww_ctx) ||
-		    __ww_mutex_wound(lock, cur->ww_ctx, ww_ctx))
+		if (__ww_mutex_die(lock, cur, ww_ctx, wake_q) ||
+		    __ww_mutex_wound(lock, cur->ww_ctx, ww_ctx, wake_q))
 			break;
 	}
 }
@@ -377,6 +379,8 @@ __ww_mutex_check_waiters(struct MUTEX *lock, struct ww_acquire_ctx *ww_ctx)
 static __always_inline void
 ww_mutex_set_context_fastpath(struct ww_mutex *lock, struct ww_acquire_ctx *ctx)
 {
+	DEFINE_WAKE_Q(wake_q);
+
 	ww_mutex_lock_acquired(lock, ctx);
 
 	/*
@@ -405,8 +409,10 @@ ww_mutex_set_context_fastpath(struct ww_mutex *lock, struct ww_acquire_ctx *ctx)
 	 * die or wound us.
 	 */
 	lock_wait_lock(&lock->base);
-	__ww_mutex_check_waiters(&lock->base, ctx);
+	__ww_mutex_check_waiters(&lock->base, ctx, &wake_q);
 	unlock_wait_lock(&lock->base);
+
+	wake_up_q(&wake_q);
 }
 
 static __always_inline int
@@ -488,7 +494,8 @@ __ww_mutex_check_kill(struct MUTEX *lock, struct MUTEX_WAITER *waiter,
 static inline int
 __ww_mutex_add_waiter(struct MUTEX_WAITER *waiter,
 		      struct MUTEX *lock,
-		      struct ww_acquire_ctx *ww_ctx)
+		      struct ww_acquire_ctx *ww_ctx,
+		      struct wake_q_head *wake_q)
 {
 	struct MUTEX_WAITER *cur, *pos = NULL;
 	bool is_wait_die;
@@ -532,7 +539,7 @@ __ww_mutex_add_waiter(struct MUTEX_WAITER *waiter,
 		pos = cur;
 
 		/* Wait-Die: ensure younger waiters die. */
-		__ww_mutex_die(lock, cur, ww_ctx);
+		__ww_mutex_die(lock, cur, ww_ctx, wake_q);
 	}
 
 	__ww_waiter_add(lock, waiter, pos);
@@ -550,7 +557,7 @@ __ww_mutex_add_waiter(struct MUTEX_WAITER *waiter,
 		 * such that either we or the fastpath will wound @ww->ctx.
 		 */
 		smp_mb();
-		__ww_mutex_wound(lock, ww_ctx, ww->ctx);
+		__ww_mutex_wound(lock, ww_ctx, ww->ctx, wake_q);
 	}
 
 	return 0;
-- 
2.45.2.993.g49e7a77208-goog


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v11 2/7] locking/mutex: Make mutex::wait_lock irq safe
  2024-07-09 20:31 [PATCH v11 0/7] Preparatory changes for Proxy Execution v11 John Stultz
  2024-07-09 20:31 ` [PATCH v11 1/7] locking/mutex: Remove wakeups from under mutex::wait_lock John Stultz
@ 2024-07-09 20:31 ` John Stultz
  2024-07-09 20:31 ` [PATCH v11 3/7] locking/mutex: Expose __mutex_owner() John Stultz
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 24+ messages in thread
From: John Stultz @ 2024-07-09 20:31 UTC (permalink / raw)
  To: LKML
  Cc: Juri Lelli, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
	Youssef Esmat, Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, K Prateek Nayak,
	Thomas Gleixner, Daniel Lezcano, kernel-team, Connor O'Brien,
	John Stultz, Metin Kaya

From: Juri Lelli <juri.lelli@redhat.com>

mutex::wait_lock might be nested under rq->lock.

Make it irq safe then.

Cc: Joel Fernandes <joelaf@google.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Youssef Esmat <youssefesmat@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: kernel-team@android.com
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
[rebase & fix {un,}lock_wait_lock helpers in ww_mutex.h]
Signed-off-by: Connor O'Brien <connoro@google.com>
Signed-off-by: John Stultz <jstultz@google.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: Metin Kaya <metin.kaya@arm.com>
Reviewed-by: Metin Kaya <metin.kaya@arm.com>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
---
v3:
* Re-added this patch after it was dropped in v2 which
  caused lockdep warnings to trip.
v7:
* Fix function definition for PREEMPT_RT case, as pointed out
  by Metin Kaya.
* Fix incorrect flags handling in PREEMPT_RT case as found by
  Metin Kaya
---
 kernel/locking/mutex.c    | 18 ++++++++++--------
 kernel/locking/ww_mutex.h | 22 +++++++++++-----------
 2 files changed, 21 insertions(+), 19 deletions(-)

diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 4269da1f3ef5..6d843a0978a5 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -578,6 +578,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 	DEFINE_WAKE_Q(wake_q);
 	struct mutex_waiter waiter;
 	struct ww_mutex *ww;
+	unsigned long flags;
 	int ret;
 
 	if (!use_ww_ctx)
@@ -620,7 +621,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 		return 0;
 	}
 
-	raw_spin_lock(&lock->wait_lock);
+	raw_spin_lock_irqsave(&lock->wait_lock, flags);
 	/*
 	 * After waiting to acquire the wait_lock, try again.
 	 */
@@ -681,7 +682,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 				goto err;
 		}
 
-		raw_spin_unlock(&lock->wait_lock);
+		raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
 		/* Make sure we do wakeups before calling schedule */
 		if (!wake_q_empty(&wake_q)) {
 			wake_up_q(&wake_q);
@@ -707,9 +708,9 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 			trace_contention_begin(lock, LCB_F_MUTEX);
 		}
 
-		raw_spin_lock(&lock->wait_lock);
+		raw_spin_lock_irqsave(&lock->wait_lock, flags);
 	}
-	raw_spin_lock(&lock->wait_lock);
+	raw_spin_lock_irqsave(&lock->wait_lock, flags);
 acquired:
 	__set_current_state(TASK_RUNNING);
 
@@ -735,7 +736,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 	if (ww_ctx)
 		ww_mutex_lock_acquired(ww, ww_ctx);
 
-	raw_spin_unlock(&lock->wait_lock);
+	raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
 	wake_up_q(&wake_q);
 	preempt_enable();
 	return 0;
@@ -745,7 +746,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 	__mutex_remove_waiter(lock, &waiter);
 err_early_kill:
 	trace_contention_end(lock, ret);
-	raw_spin_unlock(&lock->wait_lock);
+	raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
 	debug_mutex_free_waiter(&waiter);
 	mutex_release(&lock->dep_map, ip);
 	wake_up_q(&wake_q);
@@ -916,6 +917,7 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
 	struct task_struct *next = NULL;
 	DEFINE_WAKE_Q(wake_q);
 	unsigned long owner;
+	unsigned long flags;
 
 	mutex_release(&lock->dep_map, ip);
 
@@ -942,7 +944,7 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
 		}
 	}
 
-	raw_spin_lock(&lock->wait_lock);
+	raw_spin_lock_irqsave(&lock->wait_lock, flags);
 	debug_mutex_unlock(lock);
 	if (!list_empty(&lock->wait_list)) {
 		/* get the first entry from the wait-list: */
@@ -960,7 +962,7 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
 		__mutex_handoff(lock, next);
 
 	preempt_disable();
-	raw_spin_unlock(&lock->wait_lock);
+	raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
 	wake_up_q(&wake_q);
 	preempt_enable();
 }
diff --git a/kernel/locking/ww_mutex.h b/kernel/locking/ww_mutex.h
index 7189c6631d90..9facc0ddfdd3 100644
--- a/kernel/locking/ww_mutex.h
+++ b/kernel/locking/ww_mutex.h
@@ -70,14 +70,14 @@ __ww_mutex_has_waiters(struct mutex *lock)
 	return atomic_long_read(&lock->owner) & MUTEX_FLAG_WAITERS;
 }
 
-static inline void lock_wait_lock(struct mutex *lock)
+static inline void lock_wait_lock(struct mutex *lock, unsigned long *flags)
 {
-	raw_spin_lock(&lock->wait_lock);
+	raw_spin_lock_irqsave(&lock->wait_lock, *flags);
 }
 
-static inline void unlock_wait_lock(struct mutex *lock)
+static inline void unlock_wait_lock(struct mutex *lock, unsigned long *flags)
 {
-	raw_spin_unlock(&lock->wait_lock);
+	raw_spin_unlock_irqrestore(&lock->wait_lock, *flags);
 }
 
 static inline void lockdep_assert_wait_lock_held(struct mutex *lock)
@@ -144,14 +144,14 @@ __ww_mutex_has_waiters(struct rt_mutex *lock)
 	return rt_mutex_has_waiters(&lock->rtmutex);
 }
 
-static inline void lock_wait_lock(struct rt_mutex *lock)
+static inline void lock_wait_lock(struct rt_mutex *lock, unsigned long *flags)
 {
-	raw_spin_lock(&lock->rtmutex.wait_lock);
+	raw_spin_lock_irqsave(&lock->rtmutex.wait_lock, *flags);
 }
 
-static inline void unlock_wait_lock(struct rt_mutex *lock)
+static inline void unlock_wait_lock(struct rt_mutex *lock, unsigned long *flags)
 {
-	raw_spin_unlock(&lock->rtmutex.wait_lock);
+	raw_spin_unlock_irqrestore(&lock->rtmutex.wait_lock, *flags);
 }
 
 static inline void lockdep_assert_wait_lock_held(struct rt_mutex *lock)
@@ -380,6 +380,7 @@ static __always_inline void
 ww_mutex_set_context_fastpath(struct ww_mutex *lock, struct ww_acquire_ctx *ctx)
 {
 	DEFINE_WAKE_Q(wake_q);
+	unsigned long flags;
 
 	ww_mutex_lock_acquired(lock, ctx);
 
@@ -408,10 +409,9 @@ ww_mutex_set_context_fastpath(struct ww_mutex *lock, struct ww_acquire_ctx *ctx)
 	 * Uh oh, we raced in fastpath, check if any of the waiters need to
 	 * die or wound us.
 	 */
-	lock_wait_lock(&lock->base);
+	lock_wait_lock(&lock->base, &flags);
 	__ww_mutex_check_waiters(&lock->base, ctx, &wake_q);
-	unlock_wait_lock(&lock->base);
-
+	unlock_wait_lock(&lock->base, &flags);
 	wake_up_q(&wake_q);
 }
 
-- 
2.45.2.993.g49e7a77208-goog


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v11 3/7] locking/mutex: Expose __mutex_owner()
  2024-07-09 20:31 [PATCH v11 0/7] Preparatory changes for Proxy Execution v11 John Stultz
  2024-07-09 20:31 ` [PATCH v11 1/7] locking/mutex: Remove wakeups from under mutex::wait_lock John Stultz
  2024-07-09 20:31 ` [PATCH v11 2/7] locking/mutex: Make mutex::wait_lock irq safe John Stultz
@ 2024-07-09 20:31 ` John Stultz
  2024-07-09 20:31 ` [PATCH v11 4/7] sched: Add move_queued_task_locked helper John Stultz
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 24+ messages in thread
From: John Stultz @ 2024-07-09 20:31 UTC (permalink / raw)
  To: LKML
  Cc: Juri Lelli, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
	Youssef Esmat, Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, K Prateek Nayak,
	Thomas Gleixner, Daniel Lezcano, kernel-team, Valentin Schneider,
	Connor O'Brien, John Stultz, Metin Kaya

From: Juri Lelli <juri.lelli@redhat.com>

Implementing proxy execution requires that scheduler code be able to
identify the current owner of a mutex. Expose __mutex_owner() for
this purpose (alone!). Includes a null mutex check, so that users
of the function can be simplified.

Cc: Joel Fernandes <joelaf@google.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Youssef Esmat <youssefesmat@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: kernel-team@android.com
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
[Removed the EXPORT_SYMBOL]
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Connor O'Brien <connoro@google.com>
[jstultz: Reworked per Peter's suggestions]
Signed-off-by: John Stultz <jstultz@google.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: Metin Kaya <metin.kaya@arm.com>
Reviewed-by: Metin Kaya <metin.kaya@arm.com>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
---
v4:
* Move __mutex_owner() to kernel/locking/mutex.h instead of
  adding a new globally available accessor function to keep
  the exposure of this low, along with keeping it an inline
  function, as suggested by PeterZ
v10:
* Handle null lock ptr, to simplify later code, as suggested
  by Metin Kaya
v11:
* Tweak commit message suggested by Metin Kaya
---
 kernel/locking/mutex.c | 25 -------------------------
 kernel/locking/mutex.h | 27 +++++++++++++++++++++++++++
 2 files changed, 27 insertions(+), 25 deletions(-)

diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 6d843a0978a5..4b7193fd3be9 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -56,31 +56,6 @@ __mutex_init(struct mutex *lock, const char *name, struct lock_class_key *key)
 }
 EXPORT_SYMBOL(__mutex_init);
 
-/*
- * @owner: contains: 'struct task_struct *' to the current lock owner,
- * NULL means not owned. Since task_struct pointers are aligned at
- * at least L1_CACHE_BYTES, we have low bits to store extra state.
- *
- * Bit0 indicates a non-empty waiter list; unlock must issue a wakeup.
- * Bit1 indicates unlock needs to hand the lock to the top-waiter
- * Bit2 indicates handoff has been done and we're waiting for pickup.
- */
-#define MUTEX_FLAG_WAITERS	0x01
-#define MUTEX_FLAG_HANDOFF	0x02
-#define MUTEX_FLAG_PICKUP	0x04
-
-#define MUTEX_FLAGS		0x07
-
-/*
- * Internal helper function; C doesn't allow us to hide it :/
- *
- * DO NOT USE (outside of mutex code).
- */
-static inline struct task_struct *__mutex_owner(struct mutex *lock)
-{
-	return (struct task_struct *)(atomic_long_read(&lock->owner) & ~MUTEX_FLAGS);
-}
-
 static inline struct task_struct *__owner_task(unsigned long owner)
 {
 	return (struct task_struct *)(owner & ~MUTEX_FLAGS);
diff --git a/kernel/locking/mutex.h b/kernel/locking/mutex.h
index 0b2a79c4013b..cbff35b9b7ae 100644
--- a/kernel/locking/mutex.h
+++ b/kernel/locking/mutex.h
@@ -20,6 +20,33 @@ struct mutex_waiter {
 #endif
 };
 
+/*
+ * @owner: contains: 'struct task_struct *' to the current lock owner,
+ * NULL means not owned. Since task_struct pointers are aligned at
+ * at least L1_CACHE_BYTES, we have low bits to store extra state.
+ *
+ * Bit0 indicates a non-empty waiter list; unlock must issue a wakeup.
+ * Bit1 indicates unlock needs to hand the lock to the top-waiter
+ * Bit2 indicates handoff has been done and we're waiting for pickup.
+ */
+#define MUTEX_FLAG_WAITERS	0x01
+#define MUTEX_FLAG_HANDOFF	0x02
+#define MUTEX_FLAG_PICKUP	0x04
+
+#define MUTEX_FLAGS		0x07
+
+/*
+ * Internal helper function; C doesn't allow us to hide it :/
+ *
+ * DO NOT USE (outside of mutex & scheduler code).
+ */
+static inline struct task_struct *__mutex_owner(struct mutex *lock)
+{
+	if (!lock)
+		return NULL;
+	return (struct task_struct *)(atomic_long_read(&lock->owner) & ~MUTEX_FLAGS);
+}
+
 #ifdef CONFIG_DEBUG_MUTEXES
 extern void debug_mutex_lock_common(struct mutex *lock,
 				    struct mutex_waiter *waiter);
-- 
2.45.2.993.g49e7a77208-goog


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v11 4/7] sched: Add move_queued_task_locked helper
  2024-07-09 20:31 [PATCH v11 0/7] Preparatory changes for Proxy Execution v11 John Stultz
                   ` (2 preceding siblings ...)
  2024-07-09 20:31 ` [PATCH v11 3/7] locking/mutex: Expose __mutex_owner() John Stultz
@ 2024-07-09 20:31 ` John Stultz
  2024-07-09 20:31 ` [PATCH v11 5/7] sched: Consolidate pick_*_task to task_is_pushable helper John Stultz
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 24+ messages in thread
From: John Stultz @ 2024-07-09 20:31 UTC (permalink / raw)
  To: LKML
  Cc: Connor O'Brien, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
	Youssef Esmat, Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, K Prateek Nayak,
	Thomas Gleixner, Daniel Lezcano, kernel-team, John Stultz,
	Metin Kaya

From: Connor O'Brien <connoro@google.com>

Switch logic that deactivates, sets the task cpu,
and reactivates a task on a different rq to use a
helper that will be later extended to push entire
blocked task chains.

This patch was broken out from a larger chain migration
patch originally by Connor O'Brien.

Cc: Joel Fernandes <joelaf@google.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Youssef Esmat <youssefesmat@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: kernel-team@android.com
Signed-off-by: Connor O'Brien <connoro@google.com>
[jstultz: split out from larger chain migration patch]
Signed-off-by: John Stultz <jstultz@google.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: Metin Kaya <metin.kaya@arm.com>
Reviewed-by: Metin Kaya <metin.kaya@arm.com>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Qais Yousef <qyousef@layalina.io>
---
v8:
* Renamed from push_task_chain to do_push_task so it makes
  more sense without proxy-execution
v10:
* Changed name to move_queued_task_locked as suggested by Valentin
v11:
* Also use new helper in __migrate_swap_task() and
  try_steal_cookie() as suggested by Qais Yousef
* Nit cleanups suggested by Metin
---
 kernel/sched/core.c     | 13 +++----------
 kernel/sched/deadline.c |  8 ++------
 kernel/sched/rt.c       |  8 ++------
 kernel/sched/sched.h    | 12 ++++++++++++
 4 files changed, 19 insertions(+), 22 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index bcf2c4cc0522..5e63dbcbc1f0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2712,9 +2712,7 @@ int push_cpu_stop(void *arg)
 
 	// XXX validate p is still the highest prio task
 	if (task_rq(p) == rq) {
-		deactivate_task(rq, p, 0);
-		set_task_cpu(p, lowest_rq->cpu);
-		activate_task(lowest_rq, p, 0);
+		move_queued_task_locked(rq, lowest_rq, p);
 		resched_curr(lowest_rq);
 	}
 
@@ -3414,9 +3412,7 @@ static void __migrate_swap_task(struct task_struct *p, int cpu)
 		rq_pin_lock(src_rq, &srf);
 		rq_pin_lock(dst_rq, &drf);
 
-		deactivate_task(src_rq, p, 0);
-		set_task_cpu(p, cpu);
-		activate_task(dst_rq, p, 0);
+		move_queued_task_locked(src_rq, dst_rq, p);
 		wakeup_preempt(dst_rq, p, 0);
 
 		rq_unpin_lock(dst_rq, &drf);
@@ -6369,10 +6365,7 @@ static bool try_steal_cookie(int this, int that)
 		if (sched_task_is_throttled(p, this))
 			goto next;
 
-		deactivate_task(src, p, 0);
-		set_task_cpu(p, this);
-		activate_task(dst, p, 0);
-
+		move_queued_task_locked(src, dst, p);
 		resched_curr(dst);
 
 		success = true;
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index c75d1307d86d..3e05e239f5f6 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2442,9 +2442,7 @@ static int push_dl_task(struct rq *rq)
 		goto retry;
 	}
 
-	deactivate_task(rq, next_task, 0);
-	set_task_cpu(next_task, later_rq->cpu);
-	activate_task(later_rq, next_task, 0);
+	move_queued_task_locked(rq, later_rq, next_task);
 	ret = 1;
 
 	resched_curr(later_rq);
@@ -2530,9 +2528,7 @@ static void pull_dl_task(struct rq *this_rq)
 			if (is_migration_disabled(p)) {
 				push_task = get_push_task(src_rq);
 			} else {
-				deactivate_task(src_rq, p, 0);
-				set_task_cpu(p, this_cpu);
-				activate_task(this_rq, p, 0);
+				move_queued_task_locked(src_rq, this_rq, p);
 				dmin = p->dl.deadline;
 				resched = true;
 			}
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index aa4c1c874fa4..55bb33f367ad 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -2105,9 +2105,7 @@ static int push_rt_task(struct rq *rq, bool pull)
 		goto retry;
 	}
 
-	deactivate_task(rq, next_task, 0);
-	set_task_cpu(next_task, lowest_rq->cpu);
-	activate_task(lowest_rq, next_task, 0);
+	move_queued_task_locked(rq, lowest_rq, next_task);
 	resched_curr(lowest_rq);
 	ret = 1;
 
@@ -2378,9 +2376,7 @@ static void pull_rt_task(struct rq *this_rq)
 			if (is_migration_disabled(p)) {
 				push_task = get_push_task(src_rq);
 			} else {
-				deactivate_task(src_rq, p, 0);
-				set_task_cpu(p, this_cpu);
-				activate_task(this_rq, p, 0);
+				move_queued_task_locked(src_rq, this_rq, p);
 				resched = true;
 			}
 			/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a831af102070..a24286059a6a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3479,5 +3479,17 @@ static inline void init_sched_mm_cid(struct task_struct *t) { }
 
 extern u64 avg_vruntime(struct cfs_rq *cfs_rq);
 extern int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se);
+#ifdef CONFIG_SMP
+static inline
+void move_queued_task_locked(struct rq *src_rq, struct rq *dst_rq, struct task_struct *task)
+{
+	lockdep_assert_rq_held(src_rq);
+	lockdep_assert_rq_held(dst_rq);
+
+	deactivate_task(src_rq, task, 0);
+	set_task_cpu(task, dst_rq->cpu);
+	activate_task(dst_rq, task, 0);
+}
+#endif
 
 #endif /* _KERNEL_SCHED_SCHED_H */
-- 
2.45.2.993.g49e7a77208-goog


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v11 5/7] sched: Consolidate pick_*_task to task_is_pushable helper
  2024-07-09 20:31 [PATCH v11 0/7] Preparatory changes for Proxy Execution v11 John Stultz
                   ` (3 preceding siblings ...)
  2024-07-09 20:31 ` [PATCH v11 4/7] sched: Add move_queued_task_locked helper John Stultz
@ 2024-07-09 20:31 ` John Stultz
  2024-07-09 20:31 ` [PATCH v11 6/7] sched: Split out __schedule() deactivate task logic into a helper John Stultz
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 24+ messages in thread
From: John Stultz @ 2024-07-09 20:31 UTC (permalink / raw)
  To: LKML
  Cc: Connor O'Brien, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
	Youssef Esmat, Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, K Prateek Nayak,
	Thomas Gleixner, Daniel Lezcano, kernel-team, John Stultz,
	Metin Kaya

From: Connor O'Brien <connoro@google.com>

This patch consolidates rt and deadline pick_*_task functions to
a task_is_pushable() helper

This patch was broken out from a larger chain migration
patch originally by Connor O'Brien.

Cc: Joel Fernandes <joelaf@google.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Youssef Esmat <youssefesmat@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: kernel-team@android.com
Signed-off-by: Connor O'Brien <connoro@google.com>
[jstultz: split out from larger chain migration patch,
 renamed helper function]
Signed-off-by: John Stultz <jstultz@google.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: Metin Kaya <metin.kaya@arm.com>
Reviewed-by: Metin Kaya <metin.kaya@arm.com>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
---
v7:
* Split from chain migration patch
* Renamed function
v11:
* Switched to bool (though later in the series it goes
  to a tri-state return) for now to simplify review.
  Will add tri-state handling later in the series when
  its needed. Suggested by Metin and others.
---
 kernel/sched/deadline.c | 10 +---------
 kernel/sched/rt.c       | 11 +----------
 kernel/sched/sched.h    |  9 +++++++++
 3 files changed, 11 insertions(+), 19 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 3e05e239f5f6..ef135776e068 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2178,14 +2178,6 @@ static void task_fork_dl(struct task_struct *p)
 /* Only try algorithms three times */
 #define DL_MAX_TRIES 3
 
-static int pick_dl_task(struct rq *rq, struct task_struct *p, int cpu)
-{
-	if (!task_on_cpu(rq, p) &&
-	    cpumask_test_cpu(cpu, &p->cpus_mask))
-		return 1;
-	return 0;
-}
-
 /*
  * Return the earliest pushable rq's task, which is suitable to be executed
  * on the CPU, NULL otherwise:
@@ -2204,7 +2196,7 @@ static struct task_struct *pick_earliest_pushable_dl_task(struct rq *rq, int cpu
 	if (next_node) {
 		p = __node_2_pdl(next_node);
 
-		if (pick_dl_task(rq, p, cpu))
+		if (task_is_pushable(rq, p, cpu))
 			return p;
 
 		next_node = rb_next(next_node);
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 55bb33f367ad..56363e18949a 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1790,15 +1790,6 @@ static void put_prev_task_rt(struct rq *rq, struct task_struct *p)
 /* Only try algorithms three times */
 #define RT_MAX_TRIES 3
 
-static int pick_rt_task(struct rq *rq, struct task_struct *p, int cpu)
-{
-	if (!task_on_cpu(rq, p) &&
-	    cpumask_test_cpu(cpu, &p->cpus_mask))
-		return 1;
-
-	return 0;
-}
-
 /*
  * Return the highest pushable rq's task, which is suitable to be executed
  * on the CPU, NULL otherwise
@@ -1812,7 +1803,7 @@ static struct task_struct *pick_highest_pushable_task(struct rq *rq, int cpu)
 		return NULL;
 
 	plist_for_each_entry(p, head, pushable_tasks) {
-		if (pick_rt_task(rq, p, cpu))
+		if (task_is_pushable(rq, p, cpu))
 			return p;
 	}
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a24286059a6a..493de4cc320a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3490,6 +3490,15 @@ void move_queued_task_locked(struct rq *src_rq, struct rq *dst_rq, struct task_s
 	set_task_cpu(task, dst_rq->cpu);
 	activate_task(dst_rq, task, 0);
 }
+
+static inline
+bool task_is_pushable(struct rq *rq, struct task_struct *p, int cpu)
+{
+	if (!task_on_cpu(rq, p) &&
+	    cpumask_test_cpu(cpu, &p->cpus_mask))
+		return true;
+	return false;
+}
 #endif
 
 #endif /* _KERNEL_SCHED_SCHED_H */
-- 
2.45.2.993.g49e7a77208-goog


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v11 6/7] sched: Split out __schedule() deactivate task logic into a helper
  2024-07-09 20:31 [PATCH v11 0/7] Preparatory changes for Proxy Execution v11 John Stultz
                   ` (4 preceding siblings ...)
  2024-07-09 20:31 ` [PATCH v11 5/7] sched: Consolidate pick_*_task to task_is_pushable helper John Stultz
@ 2024-07-09 20:31 ` John Stultz
  2024-07-09 20:31 ` [PATCH v11 7/7] sched: Split scheduler and execution contexts John Stultz
  2024-07-30 14:46 ` [PATCH v11 0/7] Preparatory changes for Proxy Execution v11 Juri Lelli
  7 siblings, 0 replies; 24+ messages in thread
From: John Stultz @ 2024-07-09 20:31 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
	Youssef Esmat, Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, K Prateek Nayak,
	Thomas Gleixner, Daniel Lezcano, kernel-team, Metin Kaya

As we're going to re-use the deactivation logic,
split it into a helper.

Cc: Joel Fernandes <joelaf@google.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Youssef Esmat <youssefesmat@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: kernel-team@android.com
Signed-off-by: John Stultz <jstultz@google.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: Metin Kaya <metin.kaya@arm.com>
Reviewed-by: Metin Kaya <metin.kaya@arm.com>
Reviewed-by: Qais Yousef <qyousef@layalina.io>
---
v6:
* Define function as static to avoid "no previous prototype"
  warnings as Reported-by: kernel test robot <lkp@intel.com>
v7:
* Rename state task_state to be more clear, as suggested by
  Metin Kaya
v11:
* Return early to simplify indentation, and drop unused bool
  return (will be introduced later when its needed) as sugggested
  by Qais.
---
 kernel/sched/core.c | 71 +++++++++++++++++++++++++++------------------
 1 file changed, 42 insertions(+), 29 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5e63dbcbc1f0..029e7ecf5ea9 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6566,6 +6566,47 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 # define SM_MASK_PREEMPT	SM_PREEMPT
 #endif
 
+/*
+ * Helper function for __schedule()
+ *
+ * If a task does not have signals pending, deactivate it
+ * Otherwise marks the task's __state as RUNNING
+ */
+static void try_to_deactivate_task(struct rq *rq, struct task_struct *p,
+				   unsigned long task_state)
+{
+	if (signal_pending_state(task_state, p)) {
+		WRITE_ONCE(p->__state, TASK_RUNNING);
+		return;
+	}
+
+	p->sched_contributes_to_load =
+		(task_state & TASK_UNINTERRUPTIBLE) &&
+		!(task_state & TASK_NOLOAD) &&
+		!(task_state & TASK_FROZEN);
+
+	if (p->sched_contributes_to_load)
+		rq->nr_uninterruptible++;
+
+	/*
+	 * __schedule()			ttwu()
+	 *   prev_state = prev->state;    if (p->on_rq && ...)
+	 *   if (prev_state)		    goto out;
+	 *     p->on_rq = 0;		  smp_acquire__after_ctrl_dep();
+	 *				  p->state = TASK_WAKING
+	 *
+	 * Where __schedule() and ttwu() have matching control dependencies.
+	 *
+	 * After this, schedule() must not care about p->state any more.
+	 */
+	deactivate_task(rq, p, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK);
+
+	if (p->in_iowait) {
+		atomic_inc(&rq->nr_iowait);
+		delayacct_blkio_start();
+	}
+}
+
 /*
  * __schedule() is the main scheduler function.
  *
@@ -6659,35 +6700,7 @@ static void __sched notrace __schedule(unsigned int sched_mode)
 	 */
 	prev_state = READ_ONCE(prev->__state);
 	if (!(sched_mode & SM_MASK_PREEMPT) && prev_state) {
-		if (signal_pending_state(prev_state, prev)) {
-			WRITE_ONCE(prev->__state, TASK_RUNNING);
-		} else {
-			prev->sched_contributes_to_load =
-				(prev_state & TASK_UNINTERRUPTIBLE) &&
-				!(prev_state & TASK_NOLOAD) &&
-				!(prev_state & TASK_FROZEN);
-
-			if (prev->sched_contributes_to_load)
-				rq->nr_uninterruptible++;
-
-			/*
-			 * __schedule()			ttwu()
-			 *   prev_state = prev->state;    if (p->on_rq && ...)
-			 *   if (prev_state)		    goto out;
-			 *     p->on_rq = 0;		  smp_acquire__after_ctrl_dep();
-			 *				  p->state = TASK_WAKING
-			 *
-			 * Where __schedule() and ttwu() have matching control dependencies.
-			 *
-			 * After this, schedule() must not care about p->state any more.
-			 */
-			deactivate_task(rq, prev, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK);
-
-			if (prev->in_iowait) {
-				atomic_inc(&rq->nr_iowait);
-				delayacct_blkio_start();
-			}
-		}
+		try_to_deactivate_task(rq, prev, prev_state);
 		switch_count = &prev->nvcsw;
 	}
 
-- 
2.45.2.993.g49e7a77208-goog


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v11 7/7] sched: Split scheduler and execution contexts
  2024-07-09 20:31 [PATCH v11 0/7] Preparatory changes for Proxy Execution v11 John Stultz
                   ` (5 preceding siblings ...)
  2024-07-09 20:31 ` [PATCH v11 6/7] sched: Split out __schedule() deactivate task logic into a helper John Stultz
@ 2024-07-09 20:31 ` John Stultz
  2024-07-12 15:01   ` Peter Zijlstra
  2024-07-30 14:46 ` [PATCH v11 0/7] Preparatory changes for Proxy Execution v11 Juri Lelli
  7 siblings, 1 reply; 24+ messages in thread
From: John Stultz @ 2024-07-09 20:31 UTC (permalink / raw)
  To: LKML
  Cc: Peter Zijlstra, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Youssef Esmat,
	Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Xuewen Yan, K Prateek Nayak, Metin Kaya,
	Thomas Gleixner, Daniel Lezcano, kernel-team, Connor O'Brien,
	John Stultz, Metin Kaya

From: Peter Zijlstra <peterz@infradead.org>

Let's define the scheduling context as all the scheduler state
in task_struct for the task selected to run, and the execution
context as all state required to actually run the task.

Currently both are intertwined in task_struct. We want to
logically split these such that we can use the scheduling
context of the task selected to be scheduled, but use the
execution context of a different task to actually be run.

To this purpose, introduce rq_selected() macro to point to the
task_struct selected from the runqueue by the scheduler, and
will be used for scheduler state, and preserve rq->curr to
indicate the execution context of the task that will actually be
run.

Cc: Joel Fernandes <joelaf@google.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Youssef Esmat <youssefesmat@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: kernel-team@android.com
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20181009092434.26221-5-juri.lelli@redhat.com
[add additional comments and update more sched_class code to use
 rq::proxy]
Signed-off-by: Connor O'Brien <connoro@google.com>
[jstultz: Rebased and resolved minor collisions, reworked to use
 accessors, tweaked update_curr_common to use rq_proxy fixing rt
 scheduling issues]
Signed-off-by: John Stultz <jstultz@google.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: Metin Kaya <metin.kaya@arm.com>
Reviewed-by: Metin Kaya <metin.kaya@arm.com>
---
v2:
* Reworked to use accessors
* Fixed update_curr_common to use proxy instead of curr
v3:
* Tweaked wrapper names
* Swapped proxy for selected for clarity
v4:
* Minor variable name tweaks for readability
* Use a macro instead of a inline function and drop
  other helper functions as suggested by Peter.
* Remove verbose comments/questions to avoid review
  distractions, as suggested by Dietmar
v5:
* Add CONFIG_PROXY_EXEC option to this patch so the
  new logic can be tested with this change
* Minor fix to grab rq_selected when holding the rq lock
v7:
* Minor spelling fix and unused argument fixes suggested by
  Metin Kaya
* Switch to curr_selected for consistency, and minor rewording
  of commit message for clarity
* Rename variables selected instead of curr when we're using
  rq_selected()
* Reduce macros in CONFIG_SCHED_PROXY_EXEC ifdef sections,
  as suggested by Metin Kaya
v8:
* Use rq->curr, not rq_selected with task_tick, as suggested by
  Valentin
* Minor rework to reorder this with CONFIG_SCHED_PROXY_EXEC patch
v10:
* Use rq_selected in push_rt_task & get_push_task
v11:
* Rework to use selected instead of curr in a few cases we were
  previously assigning curr = rq_selected() to minimize lines of
  change. Suggested by Metin.
---
 kernel/sched/core.c     | 46 ++++++++++++++++++++++++---------------
 kernel/sched/deadline.c | 39 +++++++++++++++++----------------
 kernel/sched/fair.c     | 32 +++++++++++++--------------
 kernel/sched/rt.c       | 48 ++++++++++++++++++++---------------------
 kernel/sched/sched.h    | 27 ++++++++++++++++++++---
 5 files changed, 113 insertions(+), 79 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 029e7ecf5ea9..17036bae4a27 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -794,7 +794,7 @@ static enum hrtimer_restart hrtick(struct hrtimer *timer)
 
 	rq_lock(rq, &rf);
 	update_rq_clock(rq);
-	rq->curr->sched_class->task_tick(rq, rq->curr, 1);
+	rq_selected(rq)->sched_class->task_tick(rq, rq->curr, 1);
 	rq_unlock(rq, &rf);
 
 	return HRTIMER_NORESTART;
@@ -2236,16 +2236,18 @@ static inline void check_class_changed(struct rq *rq, struct task_struct *p,
 
 void wakeup_preempt(struct rq *rq, struct task_struct *p, int flags)
 {
-	if (p->sched_class == rq->curr->sched_class)
-		rq->curr->sched_class->wakeup_preempt(rq, p, flags);
-	else if (sched_class_above(p->sched_class, rq->curr->sched_class))
+	struct task_struct *selected = rq_selected(rq);
+
+	if (p->sched_class == selected->sched_class)
+		selected->sched_class->wakeup_preempt(rq, p, flags);
+	else if (sched_class_above(p->sched_class, selected->sched_class))
 		resched_curr(rq);
 
 	/*
 	 * A queue event has occurred, and we're going to schedule.  In
 	 * this case, we can save a useless back to back clock update.
 	 */
-	if (task_on_rq_queued(rq->curr) && test_tsk_need_resched(rq->curr))
+	if (task_on_rq_queued(selected) && test_tsk_need_resched(rq->curr))
 		rq_clock_skip_update(rq);
 }
 
@@ -2772,7 +2774,7 @@ __do_set_cpus_allowed(struct task_struct *p, struct affinity_context *ctx)
 		lockdep_assert_held(&p->pi_lock);
 
 	queued = task_on_rq_queued(p);
-	running = task_current(rq, p);
+	running = task_current_selected(rq, p);
 
 	if (queued) {
 		/*
@@ -5593,7 +5595,7 @@ unsigned long long task_sched_runtime(struct task_struct *p)
 	 * project cycles that may never be accounted to this
 	 * thread, breaking clock_gettime().
 	 */
-	if (task_current(rq, p) && task_on_rq_queued(p)) {
+	if (task_current_selected(rq, p) && task_on_rq_queued(p)) {
 		prefetch_curr_exec_start(p);
 		update_rq_clock(rq);
 		p->sched_class->update_curr(rq);
@@ -5661,7 +5663,8 @@ void sched_tick(void)
 {
 	int cpu = smp_processor_id();
 	struct rq *rq = cpu_rq(cpu);
-	struct task_struct *curr = rq->curr;
+	/* accounting goes to the selected task */
+	struct task_struct *selected;
 	struct rq_flags rf;
 	unsigned long hw_pressure;
 	u64 resched_latency;
@@ -5672,16 +5675,17 @@ void sched_tick(void)
 	sched_clock_tick();
 
 	rq_lock(rq, &rf);
+	selected = rq_selected(rq);
 
 	update_rq_clock(rq);
 	hw_pressure = arch_scale_hw_pressure(cpu_of(rq));
 	update_hw_load_avg(rq_clock_task(rq), rq, hw_pressure);
-	curr->sched_class->task_tick(rq, curr, 0);
+	selected->sched_class->task_tick(rq, selected, 0);
 	if (sched_feat(LATENCY_WARN))
 		resched_latency = cpu_resched_latency(rq);
 	calc_global_load_tick(rq);
 	sched_core_tick(rq);
-	task_tick_mm_cid(rq, curr);
+	task_tick_mm_cid(rq, selected);
 
 	rq_unlock(rq, &rf);
 
@@ -5690,8 +5694,8 @@ void sched_tick(void)
 
 	perf_event_task_tick();
 
-	if (curr->flags & PF_WQ_WORKER)
-		wq_worker_tick(curr);
+	if (selected->flags & PF_WQ_WORKER)
+		wq_worker_tick(selected);
 
 #ifdef CONFIG_SMP
 	rq->idle_balance = idle_cpu(cpu);
@@ -5756,6 +5760,12 @@ static void sched_tick_remote(struct work_struct *work)
 		struct task_struct *curr = rq->curr;
 
 		if (cpu_online(cpu)) {
+			/*
+			 * Since this is a remote tick for full dynticks mode,
+			 * we are always sure that there is no proxy (only a
+			 * single task is running).
+			 */
+			SCHED_WARN_ON(rq->curr != rq_selected(rq));
 			update_rq_clock(rq);
 
 			if (!is_idle_task(curr)) {
@@ -6705,6 +6715,7 @@ static void __sched notrace __schedule(unsigned int sched_mode)
 	}
 
 	next = pick_next_task(rq, prev, &rf);
+	rq_set_selected(rq, next);
 	clear_tsk_need_resched(prev);
 	clear_preempt_need_resched();
 #ifdef CONFIG_SCHED_DEBUG
@@ -7215,7 +7226,7 @@ void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task)
 
 	prev_class = p->sched_class;
 	queued = task_on_rq_queued(p);
-	running = task_current(rq, p);
+	running = task_current_selected(rq, p);
 	if (queued)
 		dequeue_task(rq, p, queue_flag);
 	if (running)
@@ -7305,7 +7316,7 @@ void set_user_nice(struct task_struct *p, long nice)
 	}
 
 	queued = task_on_rq_queued(p);
-	running = task_current(rq, p);
+	running = task_current_selected(rq, p);
 	if (queued)
 		dequeue_task(rq, p, DEQUEUE_SAVE | DEQUEUE_NOCLOCK);
 	if (running)
@@ -7884,7 +7895,7 @@ static int __sched_setscheduler(struct task_struct *p,
 	}
 
 	queued = task_on_rq_queued(p);
-	running = task_current(rq, p);
+	running = task_current_selected(rq, p);
 	if (queued)
 		dequeue_task(rq, p, queue_flags);
 	if (running)
@@ -9311,6 +9322,7 @@ void __init init_idle(struct task_struct *idle, int cpu)
 	rcu_read_unlock();
 
 	rq->idle = idle;
+	rq_set_selected(rq, idle);
 	rcu_assign_pointer(rq->curr, idle);
 	idle->on_rq = TASK_ON_RQ_QUEUED;
 #ifdef CONFIG_SMP
@@ -9400,7 +9412,7 @@ void sched_setnuma(struct task_struct *p, int nid)
 
 	rq = task_rq_lock(p, &rf);
 	queued = task_on_rq_queued(p);
-	running = task_current(rq, p);
+	running = task_current_selected(rq, p);
 
 	if (queued)
 		dequeue_task(rq, p, DEQUEUE_SAVE);
@@ -10505,7 +10517,7 @@ void sched_move_task(struct task_struct *tsk)
 
 	update_rq_clock(rq);
 
-	running = task_current(rq, tsk);
+	running = task_current_selected(rq, tsk);
 	queued = task_on_rq_queued(tsk);
 
 	if (queued)
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index ef135776e068..dbfa14ff16ed 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1217,7 +1217,7 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
 #endif
 
 	enqueue_task_dl(rq, p, ENQUEUE_REPLENISH);
-	if (dl_task(rq->curr))
+	if (dl_task(rq_selected(rq)))
 		wakeup_preempt_dl(rq, p, 0);
 	else
 		resched_curr(rq);
@@ -1441,11 +1441,11 @@ void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
  */
 static void update_curr_dl(struct rq *rq)
 {
-	struct task_struct *curr = rq->curr;
-	struct sched_dl_entity *dl_se = &curr->dl;
+	struct task_struct *selected = rq_selected(rq);
+	struct sched_dl_entity *dl_se = &selected->dl;
 	s64 delta_exec;
 
-	if (!dl_task(curr) || !on_dl_rq(dl_se))
+	if (!dl_task(selected) || !on_dl_rq(dl_se))
 		return;
 
 	/*
@@ -1898,7 +1898,7 @@ static int find_later_rq(struct task_struct *task);
 static int
 select_task_rq_dl(struct task_struct *p, int cpu, int flags)
 {
-	struct task_struct *curr;
+	struct task_struct *curr, *selected;
 	bool select_rq;
 	struct rq *rq;
 
@@ -1909,6 +1909,7 @@ select_task_rq_dl(struct task_struct *p, int cpu, int flags)
 
 	rcu_read_lock();
 	curr = READ_ONCE(rq->curr); /* unlocked access */
+	selected = READ_ONCE(rq_selected(rq));
 
 	/*
 	 * If we are dealing with a -deadline task, we must
@@ -1919,9 +1920,9 @@ select_task_rq_dl(struct task_struct *p, int cpu, int flags)
 	 * other hand, if it has a shorter deadline, we
 	 * try to make it stay here, it might be important.
 	 */
-	select_rq = unlikely(dl_task(curr)) &&
+	select_rq = unlikely(dl_task(selected)) &&
 		    (curr->nr_cpus_allowed < 2 ||
-		     !dl_entity_preempt(&p->dl, &curr->dl)) &&
+		     !dl_entity_preempt(&p->dl, &selected->dl)) &&
 		    p->nr_cpus_allowed > 1;
 
 	/*
@@ -1984,7 +1985,7 @@ static void check_preempt_equal_dl(struct rq *rq, struct task_struct *p)
 	 * let's hope p can move out.
 	 */
 	if (rq->curr->nr_cpus_allowed == 1 ||
-	    !cpudl_find(&rq->rd->cpudl, rq->curr, NULL))
+	    !cpudl_find(&rq->rd->cpudl, rq_selected(rq), NULL))
 		return;
 
 	/*
@@ -2023,7 +2024,7 @@ static int balance_dl(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
 static void wakeup_preempt_dl(struct rq *rq, struct task_struct *p,
 				  int flags)
 {
-	if (dl_entity_preempt(&p->dl, &rq->curr->dl)) {
+	if (dl_entity_preempt(&p->dl, &rq_selected(rq)->dl)) {
 		resched_curr(rq);
 		return;
 	}
@@ -2033,7 +2034,7 @@ static void wakeup_preempt_dl(struct rq *rq, struct task_struct *p,
 	 * In the unlikely case current and p have the same deadline
 	 * let us try to decide what's the best thing to do...
 	 */
-	if ((p->dl.deadline == rq->curr->dl.deadline) &&
+	if ((p->dl.deadline == rq_selected(rq)->dl.deadline) &&
 	    !test_tsk_need_resched(rq->curr))
 		check_preempt_equal_dl(rq, p);
 #endif /* CONFIG_SMP */
@@ -2065,7 +2066,7 @@ static void set_next_task_dl(struct rq *rq, struct task_struct *p, bool first)
 	if (!first)
 		return;
 
-	if (rq->curr->sched_class != &dl_sched_class)
+	if (rq_selected(rq)->sched_class != &dl_sched_class)
 		update_dl_rq_load_avg(rq_clock_pelt(rq), rq, 0);
 
 	deadline_queue_push_tasks(rq);
@@ -2390,8 +2391,8 @@ static int push_dl_task(struct rq *rq)
 	 * can move away, it makes sense to just reschedule
 	 * without going further in pushing next_task.
 	 */
-	if (dl_task(rq->curr) &&
-	    dl_time_before(next_task->dl.deadline, rq->curr->dl.deadline) &&
+	if (dl_task(rq_selected(rq)) &&
+	    dl_time_before(next_task->dl.deadline, rq_selected(rq)->dl.deadline) &&
 	    rq->curr->nr_cpus_allowed > 1) {
 		resched_curr(rq);
 		return 0;
@@ -2514,7 +2515,7 @@ static void pull_dl_task(struct rq *this_rq)
 			 * deadline than the current task of its runqueue.
 			 */
 			if (dl_time_before(p->dl.deadline,
-					   src_rq->curr->dl.deadline))
+					   rq_selected(src_rq)->dl.deadline))
 				goto skip;
 
 			if (is_migration_disabled(p)) {
@@ -2553,9 +2554,9 @@ static void task_woken_dl(struct rq *rq, struct task_struct *p)
 	if (!task_on_cpu(rq, p) &&
 	    !test_tsk_need_resched(rq->curr) &&
 	    p->nr_cpus_allowed > 1 &&
-	    dl_task(rq->curr) &&
+	    dl_task(rq_selected(rq)) &&
 	    (rq->curr->nr_cpus_allowed < 2 ||
-	     !dl_entity_preempt(&p->dl, &rq->curr->dl))) {
+	     !dl_entity_preempt(&p->dl, &rq_selected(rq)->dl))) {
 		push_dl_tasks(rq);
 	}
 }
@@ -2730,12 +2731,12 @@ static void switched_to_dl(struct rq *rq, struct task_struct *p)
 		return;
 	}
 
-	if (rq->curr != p) {
+	if (rq_selected(rq) != p) {
 #ifdef CONFIG_SMP
 		if (p->nr_cpus_allowed > 1 && rq->dl.overloaded)
 			deadline_queue_push_tasks(rq);
 #endif
-		if (dl_task(rq->curr))
+		if (dl_task(rq_selected(rq)))
 			wakeup_preempt_dl(rq, p, 0);
 		else
 			resched_curr(rq);
@@ -2764,7 +2765,7 @@ static void prio_changed_dl(struct rq *rq, struct task_struct *p,
 	if (!rq->dl.overloaded)
 		deadline_queue_pull_task(rq);
 
-	if (task_current(rq, p)) {
+	if (task_current_selected(rq, p)) {
 		/*
 		 * If we now have a earlier deadline task than p,
 		 * then reschedule, provided p is still on this
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8a5b1ae0aa55..4d0d3b423220 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1140,12 +1140,12 @@ static inline void update_curr_task(struct task_struct *p, s64 delta_exec)
  */
 s64 update_curr_common(struct rq *rq)
 {
-	struct task_struct *curr = rq->curr;
+	struct task_struct *selected = rq_selected(rq);
 	s64 delta_exec;
 
-	delta_exec = update_curr_se(rq, &curr->se);
+	delta_exec = update_curr_se(rq, &selected->se);
 	if (likely(delta_exec > 0))
-		update_curr_task(curr, delta_exec);
+		update_curr_task(selected, delta_exec);
 
 	return delta_exec;
 }
@@ -1177,7 +1177,7 @@ static void update_curr(struct cfs_rq *cfs_rq)
 
 static void update_curr_fair(struct rq *rq)
 {
-	update_curr(cfs_rq_of(&rq->curr->se));
+	update_curr(cfs_rq_of(&rq_selected(rq)->se));
 }
 
 static inline void
@@ -6646,7 +6646,7 @@ static void hrtick_start_fair(struct rq *rq, struct task_struct *p)
 		s64 delta = slice - ran;
 
 		if (delta < 0) {
-			if (task_current(rq, p))
+			if (task_current_selected(rq, p))
 				resched_curr(rq);
 			return;
 		}
@@ -6661,12 +6661,12 @@ static void hrtick_start_fair(struct rq *rq, struct task_struct *p)
  */
 static void hrtick_update(struct rq *rq)
 {
-	struct task_struct *curr = rq->curr;
+	struct task_struct *selected = rq_selected(rq);
 
-	if (!hrtick_enabled_fair(rq) || curr->sched_class != &fair_sched_class)
+	if (!hrtick_enabled_fair(rq) || selected->sched_class != &fair_sched_class)
 		return;
 
-	hrtick_start_fair(rq, curr);
+	hrtick_start_fair(rq, selected);
 }
 #else /* !CONFIG_SCHED_HRTICK */
 static inline void
@@ -8348,9 +8348,9 @@ static void set_next_buddy(struct sched_entity *se)
  */
 static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int wake_flags)
 {
-	struct task_struct *curr = rq->curr;
-	struct sched_entity *se = &curr->se, *pse = &p->se;
-	struct cfs_rq *cfs_rq = task_cfs_rq(curr);
+	struct task_struct *selected = rq_selected(rq);
+	struct sched_entity *se = &selected->se, *pse = &p->se;
+	struct cfs_rq *cfs_rq = task_cfs_rq(selected);
 	int cse_is_idle, pse_is_idle;
 
 	if (unlikely(se == pse))
@@ -8379,11 +8379,11 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
 	 * prevents us from potentially nominating it as a false LAST_BUDDY
 	 * below.
 	 */
-	if (test_tsk_need_resched(curr))
+	if (test_tsk_need_resched(rq->curr))
 		return;
 
 	/* Idle tasks are by definition preempted by non-idle tasks. */
-	if (unlikely(task_has_idle_policy(curr)) &&
+	if (unlikely(task_has_idle_policy(selected)) &&
 	    likely(!task_has_idle_policy(p)))
 		goto preempt;
 
@@ -9361,7 +9361,7 @@ static bool __update_blocked_others(struct rq *rq, bool *done)
 	 * update_load_avg() can call cpufreq_update_util(). Make sure that RT,
 	 * DL and IRQ signals have been updated before updating CFS.
 	 */
-	curr_class = rq->curr->sched_class;
+	curr_class = rq_selected(rq)->sched_class;
 
 	hw_pressure = arch_scale_hw_pressure(cpu_of(rq));
 
@@ -12738,7 +12738,7 @@ prio_changed_fair(struct rq *rq, struct task_struct *p, int oldprio)
 	 * our priority decreased, or if we are not currently running on
 	 * this runqueue and our priority is higher than the current's
 	 */
-	if (task_current(rq, p)) {
+	if (task_current_selected(rq, p)) {
 		if (p->prio > oldprio)
 			resched_curr(rq);
 	} else
@@ -12843,7 +12843,7 @@ static void switched_to_fair(struct rq *rq, struct task_struct *p)
 		 * kick off the schedule if running, otherwise just see
 		 * if we can still preempt the current task.
 		 */
-		if (task_current(rq, p))
+		if (task_current_selected(rq, p))
 			resched_curr(rq);
 		else
 			wakeup_preempt(rq, p, 0);
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 56363e18949a..da4cbd744fe6 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -529,7 +529,7 @@ static void dequeue_rt_entity(struct sched_rt_entity *rt_se, unsigned int flags)
 
 static void sched_rt_rq_enqueue(struct rt_rq *rt_rq)
 {
-	struct task_struct *curr = rq_of_rt_rq(rt_rq)->curr;
+	struct task_struct *selected = rq_selected(rq_of_rt_rq(rt_rq));
 	struct rq *rq = rq_of_rt_rq(rt_rq);
 	struct sched_rt_entity *rt_se;
 
@@ -543,7 +543,7 @@ static void sched_rt_rq_enqueue(struct rt_rq *rt_rq)
 		else if (!on_rt_rq(rt_se))
 			enqueue_rt_entity(rt_se, 0);
 
-		if (rt_rq->highest_prio.curr < curr->prio)
+		if (rt_rq->highest_prio.curr < selected->prio)
 			resched_curr(rq);
 	}
 }
@@ -999,11 +999,11 @@ static int sched_rt_runtime_exceeded(struct rt_rq *rt_rq)
  */
 static void update_curr_rt(struct rq *rq)
 {
-	struct task_struct *curr = rq->curr;
-	struct sched_rt_entity *rt_se = &curr->rt;
+	struct task_struct *selected = rq_selected(rq);
+	struct sched_rt_entity *rt_se = &selected->rt;
 	s64 delta_exec;
 
-	if (curr->sched_class != &rt_sched_class)
+	if (selected->sched_class != &rt_sched_class)
 		return;
 
 	delta_exec = update_curr_common(rq);
@@ -1542,7 +1542,7 @@ static int find_lowest_rq(struct task_struct *task);
 static int
 select_task_rq_rt(struct task_struct *p, int cpu, int flags)
 {
-	struct task_struct *curr;
+	struct task_struct *curr, *selected;
 	struct rq *rq;
 	bool test;
 
@@ -1554,6 +1554,7 @@ select_task_rq_rt(struct task_struct *p, int cpu, int flags)
 
 	rcu_read_lock();
 	curr = READ_ONCE(rq->curr); /* unlocked access */
+	selected = READ_ONCE(rq_selected(rq));
 
 	/*
 	 * If the current task on @p's runqueue is an RT task, then
@@ -1582,8 +1583,8 @@ select_task_rq_rt(struct task_struct *p, int cpu, int flags)
 	 * systems like big.LITTLE.
 	 */
 	test = curr &&
-	       unlikely(rt_task(curr)) &&
-	       (curr->nr_cpus_allowed < 2 || curr->prio <= p->prio);
+	       unlikely(rt_task(selected)) &&
+	       (curr->nr_cpus_allowed < 2 || selected->prio <= p->prio);
 
 	if (test || !rt_task_fits_capacity(p, cpu)) {
 		int target = find_lowest_rq(p);
@@ -1613,12 +1614,8 @@ select_task_rq_rt(struct task_struct *p, int cpu, int flags)
 
 static void check_preempt_equal_prio(struct rq *rq, struct task_struct *p)
 {
-	/*
-	 * Current can't be migrated, useless to reschedule,
-	 * let's hope p can move out.
-	 */
 	if (rq->curr->nr_cpus_allowed == 1 ||
-	    !cpupri_find(&rq->rd->cpupri, rq->curr, NULL))
+	    !cpupri_find(&rq->rd->cpupri, rq_selected(rq), NULL))
 		return;
 
 	/*
@@ -1661,7 +1658,9 @@ static int balance_rt(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
  */
 static void wakeup_preempt_rt(struct rq *rq, struct task_struct *p, int flags)
 {
-	if (p->prio < rq->curr->prio) {
+	struct task_struct *selected = rq_selected(rq);
+
+	if (p->prio < selected->prio) {
 		resched_curr(rq);
 		return;
 	}
@@ -1679,7 +1678,7 @@ static void wakeup_preempt_rt(struct rq *rq, struct task_struct *p, int flags)
 	 * to move current somewhere else, making room for our non-migratable
 	 * task.
 	 */
-	if (p->prio == rq->curr->prio && !test_tsk_need_resched(rq->curr))
+	if (p->prio == selected->prio && !test_tsk_need_resched(rq->curr))
 		check_preempt_equal_prio(rq, p);
 #endif
 }
@@ -1704,7 +1703,7 @@ static inline void set_next_task_rt(struct rq *rq, struct task_struct *p, bool f
 	 * utilization. We only care of the case where we start to schedule a
 	 * rt task
 	 */
-	if (rq->curr->sched_class != &rt_sched_class)
+	if (rq_selected(rq)->sched_class != &rt_sched_class)
 		update_rt_rq_load_avg(rq_clock_pelt(rq), rq, 0);
 
 	rt_queue_push_tasks(rq);
@@ -1976,6 +1975,7 @@ static struct task_struct *pick_next_pushable_task(struct rq *rq)
 
 	BUG_ON(rq->cpu != task_cpu(p));
 	BUG_ON(task_current(rq, p));
+	BUG_ON(task_current_selected(rq, p));
 	BUG_ON(p->nr_cpus_allowed <= 1);
 
 	BUG_ON(!task_on_rq_queued(p));
@@ -2008,7 +2008,7 @@ static int push_rt_task(struct rq *rq, bool pull)
 	 * higher priority than current. If that's the case
 	 * just reschedule current.
 	 */
-	if (unlikely(next_task->prio < rq->curr->prio)) {
+	if (unlikely(next_task->prio < rq_selected(rq)->prio)) {
 		resched_curr(rq);
 		return 0;
 	}
@@ -2029,7 +2029,7 @@ static int push_rt_task(struct rq *rq, bool pull)
 		 * Note that the stoppers are masqueraded as SCHED_FIFO
 		 * (cf. sched_set_stop_task()), so we can't rely on rt_task().
 		 */
-		if (rq->curr->sched_class != &rt_sched_class)
+		if (rq_selected(rq)->sched_class != &rt_sched_class)
 			return 0;
 
 		cpu = find_lowest_rq(rq->curr);
@@ -2361,7 +2361,7 @@ static void pull_rt_task(struct rq *this_rq)
 			 * p if it is lower in priority than the
 			 * current task on the run queue
 			 */
-			if (p->prio < src_rq->curr->prio)
+			if (p->prio < rq_selected(src_rq)->prio)
 				goto skip;
 
 			if (is_migration_disabled(p)) {
@@ -2403,9 +2403,9 @@ static void task_woken_rt(struct rq *rq, struct task_struct *p)
 	bool need_to_push = !task_on_cpu(rq, p) &&
 			    !test_tsk_need_resched(rq->curr) &&
 			    p->nr_cpus_allowed > 1 &&
-			    (dl_task(rq->curr) || rt_task(rq->curr)) &&
+			    (dl_task(rq_selected(rq)) || rt_task(rq_selected(rq))) &&
 			    (rq->curr->nr_cpus_allowed < 2 ||
-			     rq->curr->prio <= p->prio);
+			     rq_selected(rq)->prio <= p->prio);
 
 	if (need_to_push)
 		push_rt_tasks(rq);
@@ -2489,7 +2489,7 @@ static void switched_to_rt(struct rq *rq, struct task_struct *p)
 		if (p->nr_cpus_allowed > 1 && rq->rt.overloaded)
 			rt_queue_push_tasks(rq);
 #endif /* CONFIG_SMP */
-		if (p->prio < rq->curr->prio && cpu_online(cpu_of(rq)))
+		if (p->prio < rq_selected(rq)->prio && cpu_online(cpu_of(rq)))
 			resched_curr(rq);
 	}
 }
@@ -2504,7 +2504,7 @@ prio_changed_rt(struct rq *rq, struct task_struct *p, int oldprio)
 	if (!task_on_rq_queued(p))
 		return;
 
-	if (task_current(rq, p)) {
+	if (task_current_selected(rq, p)) {
 #ifdef CONFIG_SMP
 		/*
 		 * If our priority decreases while running, we
@@ -2530,7 +2530,7 @@ prio_changed_rt(struct rq *rq, struct task_struct *p, int oldprio)
 		 * greater than the current running task
 		 * then reschedule.
 		 */
-		if (p->prio < rq->curr->prio)
+		if (p->prio < rq_selected(rq)->prio)
 			resched_curr(rq);
 	}
 }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 493de4cc320a..7ee8c7fa0ae8 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1051,7 +1051,7 @@ struct rq {
 	 */
 	unsigned int		nr_uninterruptible;
 
-	struct task_struct __rcu	*curr;
+	struct task_struct __rcu	*curr;       /* Execution context */
 	struct task_struct	*idle;
 	struct task_struct	*stop;
 	unsigned long		next_balance;
@@ -1246,6 +1246,13 @@ DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
 #define cpu_curr(cpu)		(cpu_rq(cpu)->curr)
 #define raw_rq()		raw_cpu_ptr(&runqueues)
 
+/* For now, rq_selected == rq->curr */
+#define rq_selected(rq)		((rq)->curr)
+static inline void rq_set_selected(struct rq *rq, struct task_struct *t)
+{
+	/* Do nothing */
+}
+
 struct sched_group;
 #ifdef CONFIG_SCHED_CORE
 static inline struct cpumask *sched_group_span(struct sched_group *sg);
@@ -2151,11 +2158,25 @@ static inline u64 global_rt_runtime(void)
 	return (u64)sysctl_sched_rt_runtime * NSEC_PER_USEC;
 }
 
+/*
+ * Is p the current execution context?
+ */
 static inline int task_current(struct rq *rq, struct task_struct *p)
 {
 	return rq->curr == p;
 }
 
+/*
+ * Is p the current scheduling context?
+ *
+ * Note that it might be the current execution context at the same time if
+ * rq->curr == rq_selected() == p.
+ */
+static inline int task_current_selected(struct rq *rq, struct task_struct *p)
+{
+	return rq_selected(rq) == p;
+}
+
 static inline int task_on_cpu(struct rq *rq, struct task_struct *p)
 {
 #ifdef CONFIG_SMP
@@ -2325,7 +2346,7 @@ struct sched_class {
 
 static inline void put_prev_task(struct rq *rq, struct task_struct *prev)
 {
-	WARN_ON_ONCE(rq->curr != prev);
+	WARN_ON_ONCE(rq_selected(rq) != prev);
 	prev->sched_class->put_prev_task(rq, prev);
 }
 
@@ -2406,7 +2427,7 @@ extern void set_cpus_allowed_common(struct task_struct *p, struct affinity_conte
 
 static inline struct task_struct *get_push_task(struct rq *rq)
 {
-	struct task_struct *p = rq->curr;
+	struct task_struct *p = rq_selected(rq);
 
 	lockdep_assert_rq_held(rq);
 
-- 
2.45.2.993.g49e7a77208-goog


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH v11 1/7] locking/mutex: Remove wakeups from under mutex::wait_lock
  2024-07-09 20:31 ` [PATCH v11 1/7] locking/mutex: Remove wakeups from under mutex::wait_lock John Stultz
@ 2024-07-10 17:41   ` K Prateek Nayak
  2024-07-12 14:48     ` Peter Zijlstra
  2024-07-12 19:53     ` John Stultz
  2024-07-12 14:52   ` Peter Zijlstra
  1 sibling, 2 replies; 24+ messages in thread
From: K Prateek Nayak @ 2024-07-10 17:41 UTC (permalink / raw)
  To: John Stultz, LKML
  Cc: Peter Zijlstra, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Youssef Esmat,
	Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, Thomas Gleixner,
	Daniel Lezcano, kernel-team, Davidlohr Bueso

Hello John,

On 7/10/2024 2:01 AM, John Stultz wrote:
> From: Peter Zijlstra <peterz@infradead.org>
> 
> In preparation to nest mutex::wait_lock under rq::lock we need to remove
> wakeups from under it.
> 
> Cc: Joel Fernandes <joelaf@google.com>
> Cc: Qais Yousef <qyousef@layalina.io>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Juri Lelli <juri.lelli@redhat.com>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Valentin Schneider <vschneid@redhat.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Ben Segall <bsegall@google.com>
> Cc: Zimuzo Ezeozue <zezeozue@google.com>
> Cc: Youssef Esmat <youssefesmat@google.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Will Deacon <will@kernel.org>
> Cc: Waiman Long <longman@redhat.com>
> Cc: Boqun Feng <boqun.feng@gmail.com>
> Cc: "Paul E. McKenney" <paulmck@kernel.org>
> Cc: Metin Kaya <Metin.Kaya@arm.com>
> Cc: Xuewen Yan <xuewen.yan94@gmail.com>
> Cc: K Prateek Nayak <kprateek.nayak@amd.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
> Cc: kernel-team@android.com
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> [Heavily changed after 55f036ca7e74 ("locking: WW mutex cleanup") and
> 08295b3b5bee ("locking: Implement an algorithm choice for Wound-Wait
> mutexes")]
> Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
> [jstultz: rebased to mainline, added extra wake_up_q & init
>   to avoid hangs, similar to Connor's rework of this patch]
> Signed-off-by: John Stultz <jstultz@google.com>
> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
> Tested-by: Metin Kaya <metin.kaya@arm.com>
> Acked-by: Davidlohr Bueso <dave@stgolabs.net>
> Reviewed-by: Metin Kaya <metin.kaya@arm.com>
> ---
> v5:
> * Reverted back to an earlier version of this patch to undo
>    the change that kept the wake_q in the ctx structure, as
>    that broke the rule that the wake_q must always be on the
>    stack, as its not safe for concurrency.
> v6:
> * Made tweaks suggested by Waiman Long
> v7:
> * Fixups to pass wake_qs down for PREEMPT_RT logic
> v10:
> * Switched preempt_enable to be lower close to the unlock as
>    suggested by Valentin
> * Added additional preempt_disable coverage around the wake_q
>    calls as again noted by Valentin
> ---
>   kernel/locking/mutex.c       | 17 +++++++++++++----
>   kernel/locking/rtmutex.c     | 30 +++++++++++++++++++++---------
>   kernel/locking/rwbase_rt.c   |  8 +++++++-
>   kernel/locking/rwsem.c       |  4 ++--
>   kernel/locking/spinlock_rt.c |  3 ++-
>   kernel/locking/ww_mutex.h    | 29 ++++++++++++++++++-----------
>   6 files changed, 63 insertions(+), 28 deletions(-)
> 
> diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
> index cbae8c0b89ab..4269da1f3ef5 100644
> --- a/kernel/locking/mutex.c
> +++ b/kernel/locking/mutex.c
> @@ -575,6 +575,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
>   		    struct lockdep_map *nest_lock, unsigned long ip,
>   		    struct ww_acquire_ctx *ww_ctx, const bool use_ww_ctx)
>   {
> +	DEFINE_WAKE_Q(wake_q);
>   	struct mutex_waiter waiter;
>   	struct ww_mutex *ww;
>   	int ret;
> @@ -625,7 +626,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
>   	 */
>   	if (__mutex_trylock(lock)) {
>   		if (ww_ctx)
> -			__ww_mutex_check_waiters(lock, ww_ctx);
> +			__ww_mutex_check_waiters(lock, ww_ctx, &wake_q);
>   
>   		goto skip_wait;
>   	}
> @@ -645,7 +646,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
>   		 * Add in stamp order, waking up waiters that must kill
>   		 * themselves.
>   		 */
> -		ret = __ww_mutex_add_waiter(&waiter, lock, ww_ctx);
> +		ret = __ww_mutex_add_waiter(&waiter, lock, ww_ctx, &wake_q);
>   		if (ret)
>   			goto err_early_kill;
>   	}
> @@ -681,6 +682,11 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
>   		}
>   
>   		raw_spin_unlock(&lock->wait_lock);
> +		/* Make sure we do wakeups before calling schedule */
> +		if (!wake_q_empty(&wake_q)) {

nit.

This checks seems unnecessary (to my untrained eye). Any harm in
skipping it and simply doing a wake_up_q() followed by wake_q_init()
unconditionally?

> +			wake_up_q(&wake_q);
> +			wake_q_init(&wake_q);
> +		}
>   		schedule_preempt_disabled();
>   
>   		first = __mutex_waiter_is_first(lock, &waiter);
> @@ -714,7 +720,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
>   		 */
>   		if (!ww_ctx->is_wait_die &&
>   		    !__mutex_waiter_is_first(lock, &waiter))
> -			__ww_mutex_check_waiters(lock, ww_ctx);
> +			__ww_mutex_check_waiters(lock, ww_ctx, &wake_q);
>   	}
>   
>   	__mutex_remove_waiter(lock, &waiter);
> @@ -730,6 +736,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
>   		ww_mutex_lock_acquired(ww, ww_ctx);
>   
>   	raw_spin_unlock(&lock->wait_lock);
> +	wake_up_q(&wake_q);
>   	preempt_enable();
>   	return 0;
>   
> @@ -741,6 +748,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
>   	raw_spin_unlock(&lock->wait_lock);
>   	debug_mutex_free_waiter(&waiter);
>   	mutex_release(&lock->dep_map, ip);
> +	wake_up_q(&wake_q);
>   	preempt_enable();
>   	return ret;
>   }
> @@ -951,9 +959,10 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
>   	if (owner & MUTEX_FLAG_HANDOFF)
>   		__mutex_handoff(lock, next);
>   
> +	preempt_disable();
>   	raw_spin_unlock(&lock->wait_lock);
> -
>   	wake_up_q(&wake_q);
> +	preempt_enable();
>   }
>   
>   #ifndef CONFIG_DEBUG_LOCK_ALLOC
> diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
> index 88d08eeb8bc0..7a85d9bfa972 100644
> --- a/kernel/locking/rtmutex.c
> +++ b/kernel/locking/rtmutex.c
> @@ -34,13 +34,15 @@
>   
>   static inline int __ww_mutex_add_waiter(struct rt_mutex_waiter *waiter,
>   					struct rt_mutex *lock,
> -					struct ww_acquire_ctx *ww_ctx)
> +					struct ww_acquire_ctx *ww_ctx,
> +					struct wake_q_head *wake_q)
>   {
>   	return 0;
>   }
>   
>   static inline void __ww_mutex_check_waiters(struct rt_mutex *lock,
> -					    struct ww_acquire_ctx *ww_ctx)
> +					    struct ww_acquire_ctx *ww_ctx,
> +					    struct wake_q_head *wake_q)
>   {
>   }
>   
> @@ -1207,6 +1209,7 @@ static int __sched task_blocks_on_rt_mutex(struct rt_mutex_base *lock,
>   	struct rt_mutex_waiter *top_waiter = waiter;
>   	struct rt_mutex_base *next_lock;
>   	int chain_walk = 0, res;
> +	DEFINE_WAKE_Q(wake_q);
>   
>   	lockdep_assert_held(&lock->wait_lock);
>   
> @@ -1245,7 +1248,10 @@ static int __sched task_blocks_on_rt_mutex(struct rt_mutex_base *lock,
>   
>   		/* Check whether the waiter should back out immediately */
>   		rtm = container_of(lock, struct rt_mutex, rtmutex);
> -		res = __ww_mutex_add_waiter(waiter, rtm, ww_ctx);
> +		preempt_disable();
> +		res = __ww_mutex_add_waiter(waiter, rtm, ww_ctx, &wake_q);
> +		wake_up_q(&wake_q);
> +		preempt_enable();

I'm trying to understand this - we enter task_blocks_on_rt_mutex() with
"wait_lock" held (I believe the lockdep_assert_held() in the previous
hunk checks for the same). I walked down the call chain (although
briefly) and could only spot "task->pi_lock" being locked and unlocked
before this call to "wake_up_q()" but the "wait_lock" seems to be held
throughout, only being unlocked and locked again for
"rt_mutex_adjust_prio_chain()" later down.

Did I miss something or is disabling preemption for this specific hunk
enough to enable safe nesting?
--
Thanks and Regards,
Prateek

>   		if (res) {
>   			raw_spin_lock(&task->pi_lock);
>   			rt_mutex_dequeue(lock, waiter);
> [..snip..]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v11 1/7] locking/mutex: Remove wakeups from under mutex::wait_lock
  2024-07-10 17:41   ` K Prateek Nayak
@ 2024-07-12 14:48     ` Peter Zijlstra
  2024-07-12 19:53     ` John Stultz
  1 sibling, 0 replies; 24+ messages in thread
From: Peter Zijlstra @ 2024-07-12 14:48 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: John Stultz, LKML, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Youssef Esmat,
	Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, Thomas Gleixner,
	Daniel Lezcano, kernel-team, Davidlohr Bueso

On Wed, Jul 10, 2024 at 11:11:51PM +0530, K Prateek Nayak wrote:

> > @@ -681,6 +682,11 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
> >   		}
> >   		raw_spin_unlock(&lock->wait_lock);
> > +		/* Make sure we do wakeups before calling schedule */
> > +		if (!wake_q_empty(&wake_q)) {
> 
> nit.
> 
> This checks seems unnecessary (to my untrained eye). Any harm in
> skipping it and simply doing a wake_up_q() followed by wake_q_init()
> unconditionally?

Yeah, that doesn't really save anything.

> > +			wake_up_q(&wake_q);
> > +			wake_q_init(&wake_q);
> > +		}
> >   		schedule_preempt_disabled();
> >   		first = __mutex_waiter_is_first(lock, &waiter);

> > @@ -1207,6 +1209,7 @@ static int __sched task_blocks_on_rt_mutex(struct rt_mutex_base *lock,
> >   	struct rt_mutex_waiter *top_waiter = waiter;
> >   	struct rt_mutex_base *next_lock;
> >   	int chain_walk = 0, res;
> > +	DEFINE_WAKE_Q(wake_q);
> >   	lockdep_assert_held(&lock->wait_lock);
> > @@ -1245,7 +1248,10 @@ static int __sched task_blocks_on_rt_mutex(struct rt_mutex_base *lock,
> >   		/* Check whether the waiter should back out immediately */
> >   		rtm = container_of(lock, struct rt_mutex, rtmutex);
> > -		res = __ww_mutex_add_waiter(waiter, rtm, ww_ctx);
> > +		preempt_disable();
> > +		res = __ww_mutex_add_waiter(waiter, rtm, ww_ctx, &wake_q);
> > +		wake_up_q(&wake_q);
> > +		preempt_enable();
> 
> I'm trying to understand this - we enter task_blocks_on_rt_mutex() with
> "wait_lock" held (I believe the lockdep_assert_held() in the previous
> hunk checks for the same). I walked down the call chain (although
> briefly) and could only spot "task->pi_lock" being locked and unlocked
> before this call to "wake_up_q()" but the "wait_lock" seems to be held
> throughout, only being unlocked and locked again for
> "rt_mutex_adjust_prio_chain()" later down.
> 
> Did I miss something or is disabling preemption for this specific hunk
> enough to enable safe nesting?

So the whole ww-mutex stuff got significantly reworked since this patch
was started, but I think you're right, this wake_up_q() needs to be
outside wait_lock, while currently it sits inside.

One thing that helps though, is that I think that when
__ww_mutex_add_waiter() return !0, the wake_q should be empty here.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v11 1/7] locking/mutex: Remove wakeups from under mutex::wait_lock
  2024-07-09 20:31 ` [PATCH v11 1/7] locking/mutex: Remove wakeups from under mutex::wait_lock John Stultz
  2024-07-10 17:41   ` K Prateek Nayak
@ 2024-07-12 14:52   ` Peter Zijlstra
  2024-07-12 19:54     ` John Stultz
  1 sibling, 1 reply; 24+ messages in thread
From: Peter Zijlstra @ 2024-07-12 14:52 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Youssef Esmat,
	Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, K Prateek Nayak,
	Thomas Gleixner, Daniel Lezcano, kernel-team, Davidlohr Bueso

On Tue, Jul 09, 2024 at 01:31:44PM -0700, John Stultz wrote:
> @@ -405,8 +409,10 @@ ww_mutex_set_context_fastpath(struct ww_mutex *lock, struct ww_acquire_ctx *ctx)
>  	 * die or wound us.
>  	 */
>  	lock_wait_lock(&lock->base);
> -	__ww_mutex_check_waiters(&lock->base, ctx);
> +	__ww_mutex_check_waiters(&lock->base, ctx, &wake_q);

  preempt_disable()

>  	unlock_wait_lock(&lock->base);
> +
> +	wake_up_q(&wake_q);

  preempt_enable();

>  }
>  
>  static __always_inline int

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v11 7/7] sched: Split scheduler and execution contexts
  2024-07-09 20:31 ` [PATCH v11 7/7] sched: Split scheduler and execution contexts John Stultz
@ 2024-07-12 15:01   ` Peter Zijlstra
  2024-07-12 19:10     ` John Stultz
  2024-07-31  9:53     ` Juri Lelli
  0 siblings, 2 replies; 24+ messages in thread
From: Peter Zijlstra @ 2024-07-12 15:01 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Youssef Esmat,
	Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Xuewen Yan, K Prateek Nayak, Metin Kaya,
	Thomas Gleixner, Daniel Lezcano, kernel-team, Connor O'Brien

On Tue, Jul 09, 2024 at 01:31:50PM -0700, John Stultz wrote:
> From: Peter Zijlstra <peterz@infradead.org>
> 
> Let's define the scheduling context as all the scheduler state
> in task_struct for the task selected to run, and the execution
> context as all state required to actually run the task.
> 
> Currently both are intertwined in task_struct. We want to
> logically split these such that we can use the scheduling
> context of the task selected to be scheduled, but use the
> execution context of a different task to actually be run.
> 
> To this purpose, introduce rq_selected() macro to point to the
> task_struct selected from the runqueue by the scheduler, and
> will be used for scheduler state, and preserve rq->curr to
> indicate the execution context of the task that will actually be
> run.

> * Swapped proxy for selected for clarity

I'm not loving this naming...  what does selected even mean? What was
wrong with proxy? -- (did we have this conversation before?)

> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 493de4cc320a..7ee8c7fa0ae8 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1051,7 +1051,7 @@ struct rq {
>  	 */
>  	unsigned int		nr_uninterruptible;
>  
> -	struct task_struct __rcu	*curr;
> +	struct task_struct __rcu	*curr;       /* Execution context */
>  	struct task_struct	*idle;
>  	struct task_struct	*stop;
>  	unsigned long		next_balance;
> @@ -1246,6 +1246,13 @@ DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
>  #define cpu_curr(cpu)		(cpu_rq(cpu)->curr)
>  #define raw_rq()		raw_cpu_ptr(&runqueues)
>  
> +/* For now, rq_selected == rq->curr */
> +#define rq_selected(rq)		((rq)->curr)
> +static inline void rq_set_selected(struct rq *rq, struct task_struct *t)
> +{
> +	/* Do nothing */
> +}
> +
>  struct sched_group;
>  #ifdef CONFIG_SCHED_CORE
>  static inline struct cpumask *sched_group_span(struct sched_group *sg);
> @@ -2151,11 +2158,25 @@ static inline u64 global_rt_runtime(void)
>  	return (u64)sysctl_sched_rt_runtime * NSEC_PER_USEC;
>  }
>  
> +/*
> + * Is p the current execution context?
> + */
>  static inline int task_current(struct rq *rq, struct task_struct *p)
>  {
>  	return rq->curr == p;
>  }
>  
> +/*
> + * Is p the current scheduling context?
> + *
> + * Note that it might be the current execution context at the same time if
> + * rq->curr == rq_selected() == p.
> + */
> +static inline int task_current_selected(struct rq *rq, struct task_struct *p)
> +{
> +	return rq_selected(rq) == p;
> +}


And I think I hated on the macros before, and you said you needed them
to to allow !PROXY builds.

But what about something like:

#ifdef CONFIG_PROXY_EXEC
	struct task_struct __rcu *proxy;
	struct task_struct __rcu *curr;
#else
	union {
		struct task_struct __rcu *proxy;
		struct task_struct __rcu *curr;
	};
#endif


And then we can use rq->proxy and rq->curr like the good old days?


I realize this is going to be a lot of typing to fix up, and perhaps
there's a reason to not do this. But...

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v11 7/7] sched: Split scheduler and execution contexts
  2024-07-12 15:01   ` Peter Zijlstra
@ 2024-07-12 19:10     ` John Stultz
  2024-07-29 23:54       ` John Stultz
  2024-07-31  9:11       ` Juri Lelli
  2024-07-31  9:53     ` Juri Lelli
  1 sibling, 2 replies; 24+ messages in thread
From: John Stultz @ 2024-07-12 19:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Youssef Esmat,
	Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Xuewen Yan, K Prateek Nayak, Metin Kaya,
	Thomas Gleixner, Daniel Lezcano, kernel-team, Connor O'Brien

On Fri, Jul 12, 2024 at 8:02 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Jul 09, 2024 at 01:31:50PM -0700, John Stultz wrote:
> > From: Peter Zijlstra <peterz@infradead.org>
> >
> > Let's define the scheduling context as all the scheduler state
> > in task_struct for the task selected to run, and the execution
> > context as all state required to actually run the task.
> >
> > Currently both are intertwined in task_struct. We want to
> > logically split these such that we can use the scheduling
> > context of the task selected to be scheduled, but use the
> > execution context of a different task to actually be run.
> >
> > To this purpose, introduce rq_selected() macro to point to the
> > task_struct selected from the runqueue by the scheduler, and
> > will be used for scheduler state, and preserve rq->curr to
> > indicate the execution context of the task that will actually be
> > run.
>
> > * Swapped proxy for selected for clarity
>
> I'm not loving this naming...  what does selected even mean? What was
> wrong with proxy? -- (did we have this conversation before?)

So yeah, this came up earlier:
https://lore.kernel.org/lkml/CANDhNCr3acrEpBYd2LVkY3At=HCDZxGWqbMMwzVJ-Mn--dv3DA@mail.gmail.com/

My big concern is that the way proxy was used early in the series
seemed to be inverted from how the term is commonly used.

A proxy is one who takes an action on behalf of someone else.

In this case we have a blocked task that was picked to run, but then
we run another task in its place. Intuitively, this makes the proxy
the one that actually runs, not the one that was picked. But the
earliest versions of the patch had this flipped, and caused lots of
conceptual confusion in the discussions I had with folks about what
the patch was doing (as well as my own confusion initially working on
the patch).

So to avoid this, I've minimized the use of the term proxy in the
code, trying to use less confusing terms.

To me, selected is a more straightforward term, as it's the task the
scheduler chose to run (but not necessarily the one that actually
runs).
And curr is fine enough, as it is the currently running task.

But I'm not wedded to any particular name.  "selected", "chosen",
(Valentin suggested "picked" earlier, which I'd be ok with, but wanted
some sense of consensus before I go through to rework it all), but I
do think "proxy" for the scheduler context would make the code
unnecessarily difficult for others to understand.

> > +static inline int task_current_selected(struct rq *rq, struct task_struct *p)
> > +{
> > +     return rq_selected(rq) == p;
> > +}
>
>
> And I think I hated on the macros before, and you said you needed them
> to to allow !PROXY builds.
>
> But what about something like:
>
> #ifdef CONFIG_PROXY_EXEC
>         struct task_struct __rcu *proxy;
>         struct task_struct __rcu *curr;
> #else
>         union {
>                 struct task_struct __rcu *proxy;
>                 struct task_struct __rcu *curr;
>         };
> #endif
>
>
> And then we can use rq->proxy and rq->curr like the good old days?

Ok, yeah, we can use a union for that. I tend to think of unions as a
bit overly subtle for this sort of trick (the wrapper functions make
it more clear there's indirection going on), but that's a taste issue
and I'm not picky.

> I realize this is going to be a lot of typing to fix up, and perhaps
> there's a reason to not do this. But...

I have no problems reworking this if it helps get this series unstuck!

I might strongly push back against "proxy" as the name for the
scheduler context, but if it really is your favorite, I'll hold my
nose to move this along.

Do let me know your final preference and I'll go rework it all asap.

Thanks again so much for providing some feedback here! I really appreciate it!
-john

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v11 1/7] locking/mutex: Remove wakeups from under mutex::wait_lock
  2024-07-10 17:41   ` K Prateek Nayak
  2024-07-12 14:48     ` Peter Zijlstra
@ 2024-07-12 19:53     ` John Stultz
  1 sibling, 0 replies; 24+ messages in thread
From: John Stultz @ 2024-07-12 19:53 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: LKML, Peter Zijlstra, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Youssef Esmat,
	Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, Thomas Gleixner,
	Daniel Lezcano, kernel-team, Davidlohr Bueso

On Wed, Jul 10, 2024 at 10:42 AM 'K Prateek Nayak' via kernel-team
<kernel-team@android.com> wrote:
> On 7/10/2024 2:01 AM, John Stultz wrote:
> > @@ -681,6 +682,11 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
> >               }
> >
> >               raw_spin_unlock(&lock->wait_lock);
> > +             /* Make sure we do wakeups before calling schedule */
> > +             if (!wake_q_empty(&wake_q)) {
>
> nit.
>
> This checks seems unnecessary (to my untrained eye). Any harm in
> skipping it and simply doing a wake_up_q() followed by wake_q_init()
> unconditionally?
>
> > +                     wake_up_q(&wake_q);
> > +                     wake_q_init(&wake_q);
> > +             }
> >               schedule_preempt_disabled();

Ah, thank you for the suggestion. I've reworked this in my tree!

> > @@ -1207,6 +1209,7 @@ static int __sched task_blocks_on_rt_mutex(struct rt_mutex_base *lock,
> >       struct rt_mutex_waiter *top_waiter = waiter;
> >       struct rt_mutex_base *next_lock;
> >       int chain_walk = 0, res;
> > +     DEFINE_WAKE_Q(wake_q);
> >
> >       lockdep_assert_held(&lock->wait_lock);
> >
> > @@ -1245,7 +1248,10 @@ static int __sched task_blocks_on_rt_mutex(struct rt_mutex_base *lock,
> >
> >               /* Check whether the waiter should back out immediately */
> >               rtm = container_of(lock, struct rt_mutex, rtmutex);
> > -             res = __ww_mutex_add_waiter(waiter, rtm, ww_ctx);
> > +             preempt_disable();
> > +             res = __ww_mutex_add_waiter(waiter, rtm, ww_ctx, &wake_q);
> > +             wake_up_q(&wake_q);
> > +             preempt_enable();
>
> I'm trying to understand this - we enter task_blocks_on_rt_mutex() with
> "wait_lock" held (I believe the lockdep_assert_held() in the previous
> hunk checks for the same). I walked down the call chain (although
> briefly) and could only spot "task->pi_lock" being locked and unlocked
> before this call to "wake_up_q()" but the "wait_lock" seems to be held
> throughout, only being unlocked and locked again for
> "rt_mutex_adjust_prio_chain()" later down.
>
> Did I miss something or is disabling preemption for this specific hunk
> enough to enable safe nesting?

Thank you for pointing this out! It looks like I need to pipe the
wake_q down through task_blocks_on_rt_mutex(), and
rtlock_slowlock_locked() and define one in rtlock_slowlock().

Really appreciate the review and feedback here!
-john

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v11 1/7] locking/mutex: Remove wakeups from under mutex::wait_lock
  2024-07-12 14:52   ` Peter Zijlstra
@ 2024-07-12 19:54     ` John Stultz
  0 siblings, 0 replies; 24+ messages in thread
From: John Stultz @ 2024-07-12 19:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Youssef Esmat,
	Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, K Prateek Nayak,
	Thomas Gleixner, Daniel Lezcano, kernel-team, Davidlohr Bueso

On Fri, Jul 12, 2024 at 7:52 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Jul 09, 2024 at 01:31:44PM -0700, John Stultz wrote:
> > @@ -405,8 +409,10 @@ ww_mutex_set_context_fastpath(struct ww_mutex *lock, struct ww_acquire_ctx *ctx)
> >        * die or wound us.
> >        */
> >       lock_wait_lock(&lock->base);
> > -     __ww_mutex_check_waiters(&lock->base, ctx);
> > +     __ww_mutex_check_waiters(&lock->base, ctx, &wake_q);
>
>   preempt_disable()
>
> >       unlock_wait_lock(&lock->base);
> > +
> > +     wake_up_q(&wake_q);
>
>   preempt_enable();

Ah. Will fix for the next version. Thank you for the review and
pointing this out!
-john

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v11 7/7] sched: Split scheduler and execution contexts
  2024-07-12 19:10     ` John Stultz
@ 2024-07-29 23:54       ` John Stultz
  2024-07-31  9:11       ` Juri Lelli
  1 sibling, 0 replies; 24+ messages in thread
From: John Stultz @ 2024-07-29 23:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Youssef Esmat,
	Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Xuewen Yan, K Prateek Nayak, Metin Kaya,
	Thomas Gleixner, Daniel Lezcano, kernel-team

On Fri, Jul 12, 2024 at 12:10 PM John Stultz <jstultz@google.com> wrote:
> On Fri, Jul 12, 2024 at 8:02 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Tue, Jul 09, 2024 at 01:31:50PM -0700, John Stultz wrote:
> > > From: Peter Zijlstra <peterz@infradead.org>
> > >
> > > Let's define the scheduling context as all the scheduler state
> > > in task_struct for the task selected to run, and the execution
> > > context as all state required to actually run the task.
> > >
> > > Currently both are intertwined in task_struct. We want to
> > > logically split these such that we can use the scheduling
> > > context of the task selected to be scheduled, but use the
> > > execution context of a different task to actually be run.
> > >
> > > To this purpose, introduce rq_selected() macro to point to the
> > > task_struct selected from the runqueue by the scheduler, and
> > > will be used for scheduler state, and preserve rq->curr to
> > > indicate the execution context of the task that will actually be
> > > run.
> >
> > > * Swapped proxy for selected for clarity
> >
> > I'm not loving this naming...  what does selected even mean? What was
> > wrong with proxy? -- (did we have this conversation before?)
>
> So yeah, this came up earlier:
> https://lore.kernel.org/lkml/CANDhNCr3acrEpBYd2LVkY3At=HCDZxGWqbMMwzVJ-Mn--dv3DA@mail.gmail.com/
>
> My big concern is that the way proxy was used early in the series
> seemed to be inverted from how the term is commonly used.
>
> A proxy is one who takes an action on behalf of someone else.
>
> In this case we have a blocked task that was picked to run, but then
> we run another task in its place. Intuitively, this makes the proxy
> the one that actually runs, not the one that was picked. But the
> earliest versions of the patch had this flipped, and caused lots of
> conceptual confusion in the discussions I had with folks about what
> the patch was doing (as well as my own confusion initially working on
> the patch).
>
> So to avoid this, I've minimized the use of the term proxy in the
> code, trying to use less confusing terms.
>
> To me, selected is a more straightforward term, as it's the task the
> scheduler chose to run (but not necessarily the one that actually
> runs).
> And curr is fine enough, as it is the currently running task.
>
> But I'm not wedded to any particular name.  "selected", "chosen",
> (Valentin suggested "picked" earlier, which I'd be ok with, but wanted
> some sense of consensus before I go through to rework it all), but I
> do think "proxy" for the scheduler context would make the code
> unnecessarily difficult for others to understand.
...

> > I realize this is going to be a lot of typing to fix up, and perhaps
> > there's a reason to not do this. But...
>
> I have no problems reworking this if it helps get this series unstuck!
>
> I might strongly push back against "proxy" as the name for the
> scheduler context, but if it really is your favorite, I'll hold my
> nose to move this along.
>
> Do let me know your final preference and I'll go rework it all asap.

Hey Peter,
  Just wanted to check in on this in case the mail slipped by. I'm
eager to rework the naming and get this re-sent, but I just wanted to
confirm your preference here.

thanks
-john

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v11 0/7] Preparatory changes for Proxy Execution v11
  2024-07-09 20:31 [PATCH v11 0/7] Preparatory changes for Proxy Execution v11 John Stultz
                   ` (6 preceding siblings ...)
  2024-07-09 20:31 ` [PATCH v11 7/7] sched: Split scheduler and execution contexts John Stultz
@ 2024-07-30 14:46 ` Juri Lelli
  7 siblings, 0 replies; 24+ messages in thread
From: Juri Lelli @ 2024-07-30 14:46 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Peter Zijlstra,
	Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Youssef Esmat,
	Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, K Prateek Nayak,
	Thomas Gleixner, Daniel Lezcano, kernel-team, Luca Abeni

Hi John,

On 09/07/24 13:31, John Stultz wrote:
> Hey All,
> 
> I wanted to send out v11 of the preparatory patches for Proxy
> Execution - an approach for a generalized form of priority
> inheritance. Here again, I’m only submitting the early /
> preparatory changes for review, in the hope that we can move
> these more straightforward patches along and then iteratively
> move through the more interesting patches in the Proxy Execution
> series. That said, I’ve not gotten a ton of feedback with this
> approach, so I’m open to other suggestions.

I'd actually have some additional thoughts on what we discussed at
OSPM24. Hope it's OK if I use this cover letter as a starting point to
possibly discuss that further. Please don't hesitate to tell if you
would rather prefer we have that discussion separately after we agreed
on this first split of the series (I don't think - or I just hope -
whatever we decide about the migration logic will need changes in this
set).

...

> Issues still to address with the full series:

...

> * The chain migration functionality needs further iterations and
>   better validation to ensure it truly maintains the RT/DL load
>   balancing invariants (despite this being broken in vanilla
>   upstream with RT_PUSH_IPI currently)
> * At OSPM, Juri Lelli and the (very very sadly) late Daniel
>   Bristot de Oliveira raised the point that Proxy Exec may not
>   actually be generalizable for SCHED_DEADLINE tasks, as one
>   cannot always correctly donate the resources of the waiter to
>   an owner on a different cpu. If one was to reverse the
>   proxy-migration direction, migrating the owner to the waiter
>   cpu, this would preserve the SCHED_DEADLINE bandwidth
>   calculations, but would break down if the owner's cpu affinity
>   disallowed it. To my understanding this constraint seems to
>   make most forms of priority inheritance infeasible with
>   SCHED_DEADLINE, but I’ll have to leave that to the
>   folks/academics who know it well. After talking with Juri, my
>   current plan is just to special case find_proxy_task() to not
>   proxy with SCHED_DEADLINE (falling back to the current behavior
>   where we deactivate the waiting task). But SCHED_NORMAL waiter
>   tasks would still be able to benefit from Proxy Exec.

So, I've been discussing this a bit with Luca (now cc-ed), Tommaso and
Enrico (which I think you met at OSPM24 and/or at some previous
editions). Please consider that I am essentially thinking out loud, so
I'm pretty sure I'm missing details and possibly be just wrong, but
tl;dr it looks like we could somewhat reconcile the current
implementation (i.e. donors move to owners CPU) to what SCHED_DEADLINE
proxy execution theory (M-BWI [1]) wants if we maybe try to only migrate
the top-waiter (donor, one task) to the owner's CPU, possibly swapping
that with the next highest priority task enqueued on the owner's CPU so
that global invariants are respected. In this case we would leave other
potential donors on their CPUs and either ignore them when picking tasks
for execution or do slightly more fancy things for DEADLINE (can do that
at a later stage, but we would need to consume runtime of DEADLINE
entities even if the owner is running some place else, let's try to
ignore this detail for now I suggest).

Not sure if it makes any sense at all to you/others, but here it is. :)
Hope we can consider the alternative and discuss about it. I actually
wonder if it wouldn't also simplify blocking chains management a bit (no
need to migrate chains around anymore), but I'd guess it might
complicate local scheduling "a bit".

Please let me know what you think and/or if you would like to leave this
for a later stage.

Best,
Juri

1 - https://retis.santannapisa.it/~tommaso/publications/ECRTS-2010.pdf

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v11 7/7] sched: Split scheduler and execution contexts
  2024-07-12 19:10     ` John Stultz
  2024-07-29 23:54       ` John Stultz
@ 2024-07-31  9:11       ` Juri Lelli
  2024-07-31 11:37         ` Peter Zijlstra
  1 sibling, 1 reply; 24+ messages in thread
From: Juri Lelli @ 2024-07-31  9:11 UTC (permalink / raw)
  To: John Stultz
  Cc: Peter Zijlstra, LKML, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Youssef Esmat,
	Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Xuewen Yan, K Prateek Nayak, Metin Kaya,
	Thomas Gleixner, Daniel Lezcano, kernel-team, Connor O'Brien

Hi John,

On 12/07/24 12:10, John Stultz wrote:
> On Fri, Jul 12, 2024 at 8:02 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Tue, Jul 09, 2024 at 01:31:50PM -0700, John Stultz wrote:
> > > From: Peter Zijlstra <peterz@infradead.org>
> > >
> > > Let's define the scheduling context as all the scheduler state
> > > in task_struct for the task selected to run, and the execution
> > > context as all state required to actually run the task.
> > >
> > > Currently both are intertwined in task_struct. We want to
> > > logically split these such that we can use the scheduling
> > > context of the task selected to be scheduled, but use the
> > > execution context of a different task to actually be run.
> > >
> > > To this purpose, introduce rq_selected() macro to point to the
> > > task_struct selected from the runqueue by the scheduler, and
> > > will be used for scheduler state, and preserve rq->curr to
> > > indicate the execution context of the task that will actually be
> > > run.
> >
> > > * Swapped proxy for selected for clarity
> >
> > I'm not loving this naming...  what does selected even mean? What was
> > wrong with proxy? -- (did we have this conversation before?)
> 
> So yeah, this came up earlier:
> https://lore.kernel.org/lkml/CANDhNCr3acrEpBYd2LVkY3At=HCDZxGWqbMMwzVJ-Mn--dv3DA@mail.gmail.com/
> 
> My big concern is that the way proxy was used early in the series
> seemed to be inverted from how the term is commonly used.
> 
> A proxy is one who takes an action on behalf of someone else.
> 
> In this case we have a blocked task that was picked to run, but then
> we run another task in its place. Intuitively, this makes the proxy
> the one that actually runs, not the one that was picked. But the
> earliest versions of the patch had this flipped, and caused lots of
> conceptual confusion in the discussions I had with folks about what
> the patch was doing (as well as my own confusion initially working on
> the patch).

I don't think I have strong preferences either way, but I actually
considered the proxy to be the blocked donor (the one picked by the
scheduler to run), as it makes the owner use its properties, acting as a
proxy for the owner.

I think that in this case find_proxy_task() might have a confusing name,
as it doesn't actually try to find a proxy, but rather the owner of a
(or a chain of) mutex. We could rename that find_owner_task() and it
seems it might make sense, as we would have

pick_again:
        next = pick_next_task(rq, rq_proxy(rq), &rf);
        rq_set_proxy(rq, next);
        if (unlikely(task_is_blocked(next))) {
                next = find_owner_task(rq, next, &rf);

where, if next was blocked, we search for the owner it is blocked on.

Anyway, just wanted to tell you the way I understood this so far. Happy
to go with what you and Peter decide to go with. :)

Best,
Juri


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v11 7/7] sched: Split scheduler and execution contexts
  2024-07-12 15:01   ` Peter Zijlstra
  2024-07-12 19:10     ` John Stultz
@ 2024-07-31  9:53     ` Juri Lelli
  1 sibling, 0 replies; 24+ messages in thread
From: Juri Lelli @ 2024-07-31  9:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: John Stultz, LKML, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Youssef Esmat,
	Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Xuewen Yan, K Prateek Nayak, Metin Kaya,
	Thomas Gleixner, Daniel Lezcano, kernel-team, Connor O'Brien

On 12/07/24 17:01, Peter Zijlstra wrote:
> On Tue, Jul 09, 2024 at 01:31:50PM -0700, John Stultz wrote:
> > From: Peter Zijlstra <peterz@infradead.org>
> > 
> > Let's define the scheduling context as all the scheduler state
> > in task_struct for the task selected to run, and the execution
> > context as all state required to actually run the task.
> > 
> > Currently both are intertwined in task_struct. We want to
> > logically split these such that we can use the scheduling
> > context of the task selected to be scheduled, but use the
> > execution context of a different task to actually be run.
> > 
> > To this purpose, introduce rq_selected() macro to point to the
> > task_struct selected from the runqueue by the scheduler, and
> > will be used for scheduler state, and preserve rq->curr to
> > indicate the execution context of the task that will actually be
> > run.
> 
> > * Swapped proxy for selected for clarity
> 
> I'm not loving this naming...  what does selected even mean? What was
> wrong with proxy? -- (did we have this conversation before?)

Or maybe curr and sched_ctx as an alternative (if proxy confuses people
:), so that it's more direct/straightforward (even though it's not
symmetric)?


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v11 7/7] sched: Split scheduler and execution contexts
  2024-07-31  9:11       ` Juri Lelli
@ 2024-07-31 11:37         ` Peter Zijlstra
  2024-07-31 14:32           ` Steven Rostedt
                             ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: Peter Zijlstra @ 2024-07-31 11:37 UTC (permalink / raw)
  To: Juri Lelli
  Cc: John Stultz, LKML, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Youssef Esmat,
	Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Xuewen Yan, K Prateek Nayak, Metin Kaya,
	Thomas Gleixner, Daniel Lezcano, kernel-team, Connor O'Brien


Sorry for the delay, I need the earth to stop spinning so goddamn fast
:-) 36 hours days ftw or so... Oh wait, that'd mean other people also
increase the amount of crap they send my way, don't it?

Damn..

On Wed, Jul 31, 2024 at 11:11:45AM +0200, Juri Lelli wrote:
> Hi John,
> 
> On 12/07/24 12:10, John Stultz wrote:
> > On Fri, Jul 12, 2024 at 8:02 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > On Tue, Jul 09, 2024 at 01:31:50PM -0700, John Stultz wrote:
> > > > From: Peter Zijlstra <peterz@infradead.org>
> > > >
> > > > Let's define the scheduling context as all the scheduler state
> > > > in task_struct for the task selected to run, and the execution
> > > > context as all state required to actually run the task.
> > > >
> > > > Currently both are intertwined in task_struct. We want to
> > > > logically split these such that we can use the scheduling
> > > > context of the task selected to be scheduled, but use the
> > > > execution context of a different task to actually be run.
> > > >
> > > > To this purpose, introduce rq_selected() macro to point to the
> > > > task_struct selected from the runqueue by the scheduler, and
> > > > will be used for scheduler state, and preserve rq->curr to
> > > > indicate the execution context of the task that will actually be
> > > > run.
> > >
> > > > * Swapped proxy for selected for clarity
> > >
> > > I'm not loving this naming...  what does selected even mean? What was
> > > wrong with proxy? -- (did we have this conversation before?)
> > 
> > So yeah, this came up earlier:
> > https://lore.kernel.org/lkml/CANDhNCr3acrEpBYd2LVkY3At=HCDZxGWqbMMwzVJ-Mn--dv3DA@mail.gmail.com/
> > 
> > My big concern is that the way proxy was used early in the series
> > seemed to be inverted from how the term is commonly used.
> > 
> > A proxy is one who takes an action on behalf of someone else.

Ah, I see your confusion.

> > In this case we have a blocked task that was picked to run, but then
> > we run another task in its place. Intuitively, this makes the proxy
> > the one that actually runs, not the one that was picked. But the
> > earliest versions of the patch had this flipped, and caused lots of
> > conceptual confusion in the discussions I had with folks about what
> > the patch was doing (as well as my own confusion initially working on
> > the patch).
> 
> I don't think I have strong preferences either way, but I actually
> considered the proxy to be the blocked donor (the one picked by the
> scheduler to run), as it makes the owner use its properties, acting as a
> proxy for the owner.

This. But I suspect we both suffer from not being native English
speakers.

Would 'donor' work in this case?

Then the donor gets all the accounting done on it, while we execute
curr.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v11 7/7] sched: Split scheduler and execution contexts
  2024-07-31 11:37         ` Peter Zijlstra
@ 2024-07-31 14:32           ` Steven Rostedt
  2024-07-31 14:36           ` Phil Auld
  2024-07-31 23:32           ` John Stultz
  2 siblings, 0 replies; 24+ messages in thread
From: Steven Rostedt @ 2024-07-31 14:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Juri Lelli, John Stultz, LKML, Joel Fernandes, Qais Yousef,
	Ingo Molnar, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Ben Segall, Zimuzo Ezeozue, Youssef Esmat,
	Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Xuewen Yan, K Prateek Nayak, Metin Kaya,
	Thomas Gleixner, Daniel Lezcano, kernel-team, Connor O'Brien

On Wed, 31 Jul 2024 13:37:20 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> > I don't think I have strong preferences either way, but I actually
> > considered the proxy to be the blocked donor (the one picked by the
> > scheduler to run), as it makes the owner use its properties, acting as a
> > proxy for the owner.  
> 
> This. But I suspect we both suffer from not being native English
> speakers.

Has nothing to do with being a native English speaker or not. "proxy" is
not a commonly used word and can easily be misused.

> 
> Would 'donor' work in this case?
> 
> Then the donor gets all the accounting done on it, while we execute
> curr.

I agree that "donor" is a bit easier to understand.

-- Steve

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v11 7/7] sched: Split scheduler and execution contexts
  2024-07-31 11:37         ` Peter Zijlstra
  2024-07-31 14:32           ` Steven Rostedt
@ 2024-07-31 14:36           ` Phil Auld
  2024-07-31 23:32           ` John Stultz
  2 siblings, 0 replies; 24+ messages in thread
From: Phil Auld @ 2024-07-31 14:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Juri Lelli, John Stultz, LKML, Joel Fernandes, Qais Yousef,
	Ingo Molnar, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
	Youssef Esmat, Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Xuewen Yan, K Prateek Nayak, Metin Kaya,
	Thomas Gleixner, Daniel Lezcano, kernel-team, Connor O'Brien

On Wed, Jul 31, 2024 at 01:37:20PM +0200 Peter Zijlstra wrote:
> 
> Sorry for the delay, I need the earth to stop spinning so goddamn fast
> :-) 36 hours days ftw or so... Oh wait, that'd mean other people also
> increase the amount of crap they send my way, don't it?
> 
> Damn..
> 
> On Wed, Jul 31, 2024 at 11:11:45AM +0200, Juri Lelli wrote:
> > Hi John,
> > 
> > On 12/07/24 12:10, John Stultz wrote:
> > > On Fri, Jul 12, 2024 at 8:02 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > > >
> > > > On Tue, Jul 09, 2024 at 01:31:50PM -0700, John Stultz wrote:
> > > > > From: Peter Zijlstra <peterz@infradead.org>
> > > > >
> > > > > Let's define the scheduling context as all the scheduler state
> > > > > in task_struct for the task selected to run, and the execution
> > > > > context as all state required to actually run the task.
> > > > >
> > > > > Currently both are intertwined in task_struct. We want to
> > > > > logically split these such that we can use the scheduling
> > > > > context of the task selected to be scheduled, but use the
> > > > > execution context of a different task to actually be run.
> > > > >
> > > > > To this purpose, introduce rq_selected() macro to point to the
> > > > > task_struct selected from the runqueue by the scheduler, and
> > > > > will be used for scheduler state, and preserve rq->curr to
> > > > > indicate the execution context of the task that will actually be
> > > > > run.
> > > >
> > > > > * Swapped proxy for selected for clarity
> > > >
> > > > I'm not loving this naming...  what does selected even mean? What was
> > > > wrong with proxy? -- (did we have this conversation before?)
> > > 
> > > So yeah, this came up earlier:
> > > https://lore.kernel.org/lkml/CANDhNCr3acrEpBYd2LVkY3At=HCDZxGWqbMMwzVJ-Mn--dv3DA@mail.gmail.com/
> > > 
> > > My big concern is that the way proxy was used early in the series
> > > seemed to be inverted from how the term is commonly used.
> > > 
> > > A proxy is one who takes an action on behalf of someone else.
> 
> Ah, I see your confusion.
> 
> > > In this case we have a blocked task that was picked to run, but then
> > > we run another task in its place. Intuitively, this makes the proxy
> > > the one that actually runs, not the one that was picked. But the
> > > earliest versions of the patch had this flipped, and caused lots of
> > > conceptual confusion in the discussions I had with folks about what
> > > the patch was doing (as well as my own confusion initially working on
> > > the patch).
> > 
> > I don't think I have strong preferences either way, but I actually
> > considered the proxy to be the blocked donor (the one picked by the
> > scheduler to run), as it makes the owner use its properties, acting as a
> > proxy for the owner.
> 
> This. But I suspect we both suffer from not being native English
> speakers.
> 
> Would 'donor' work in this case?
>
> Then the donor gets all the accounting done on it, while we execute
> curr.
> 

"donor" works for me and is more expressive (and shorter) than
"selected". Fwiw.



Cheers,
Phil

-- 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v11 7/7] sched: Split scheduler and execution contexts
  2024-07-31 11:37         ` Peter Zijlstra
  2024-07-31 14:32           ` Steven Rostedt
  2024-07-31 14:36           ` Phil Auld
@ 2024-07-31 23:32           ` John Stultz
  2024-08-01 13:30             ` Phil Auld
  2 siblings, 1 reply; 24+ messages in thread
From: John Stultz @ 2024-07-31 23:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Juri Lelli, LKML, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Youssef Esmat,
	Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Xuewen Yan, K Prateek Nayak, Metin Kaya,
	Thomas Gleixner, Daniel Lezcano, kernel-team, Connor O'Brien

On Wed, Jul 31, 2024 at 4:37 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> Sorry for the delay, I need the earth to stop spinning so goddamn fast
> :-) 36 hours days ftw or so... Oh wait, that'd mean other people also
> increase the amount of crap they send my way, don't it?
>
> Damn..

Yeah. My sympathies! I do really appreciate your taking time to
provide feedback here (amongst all the other eevdf patches and
sched_ext stuff you're reviewing).

> Would 'donor' work in this case?
>
> Then the donor gets all the accounting done on it, while we execute
> curr.

'Donor' is great, I'll switch to that.

Thanks again for getting back to me here!
-john

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v11 7/7] sched: Split scheduler and execution contexts
  2024-07-31 23:32           ` John Stultz
@ 2024-08-01 13:30             ` Phil Auld
  0 siblings, 0 replies; 24+ messages in thread
From: Phil Auld @ 2024-08-01 13:30 UTC (permalink / raw)
  To: John Stultz
  Cc: Peter Zijlstra, Juri Lelli, LKML, Joel Fernandes, Qais Yousef,
	Ingo Molnar, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
	Youssef Esmat, Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Xuewen Yan, K Prateek Nayak, Metin Kaya,
	Thomas Gleixner, Daniel Lezcano, kernel-team, Connor O'Brien

On Wed, Jul 31, 2024 at 04:32:33PM -0700 John Stultz wrote:
> On Wed, Jul 31, 2024 at 4:37 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > Sorry for the delay, I need the earth to stop spinning so goddamn fast
> > :-) 36 hours days ftw or so... Oh wait, that'd mean other people also
> > increase the amount of crap they send my way, don't it?
> >
> > Damn..
> 
> Yeah. My sympathies! I do really appreciate your taking time to
> provide feedback here (amongst all the other eevdf patches and
> sched_ext stuff you're reviewing).
>

Not to mention way too many rust patches...


> > Would 'donor' work in this case?
> >
> > Then the donor gets all the accounting done on it, while we execute
> > curr.
> 
> 'Donor' is great, I'll switch to that.
> 
> Thanks again for getting back to me here!
> -john
> 

-- 


^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2024-08-01 13:30 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-09 20:31 [PATCH v11 0/7] Preparatory changes for Proxy Execution v11 John Stultz
2024-07-09 20:31 ` [PATCH v11 1/7] locking/mutex: Remove wakeups from under mutex::wait_lock John Stultz
2024-07-10 17:41   ` K Prateek Nayak
2024-07-12 14:48     ` Peter Zijlstra
2024-07-12 19:53     ` John Stultz
2024-07-12 14:52   ` Peter Zijlstra
2024-07-12 19:54     ` John Stultz
2024-07-09 20:31 ` [PATCH v11 2/7] locking/mutex: Make mutex::wait_lock irq safe John Stultz
2024-07-09 20:31 ` [PATCH v11 3/7] locking/mutex: Expose __mutex_owner() John Stultz
2024-07-09 20:31 ` [PATCH v11 4/7] sched: Add move_queued_task_locked helper John Stultz
2024-07-09 20:31 ` [PATCH v11 5/7] sched: Consolidate pick_*_task to task_is_pushable helper John Stultz
2024-07-09 20:31 ` [PATCH v11 6/7] sched: Split out __schedule() deactivate task logic into a helper John Stultz
2024-07-09 20:31 ` [PATCH v11 7/7] sched: Split scheduler and execution contexts John Stultz
2024-07-12 15:01   ` Peter Zijlstra
2024-07-12 19:10     ` John Stultz
2024-07-29 23:54       ` John Stultz
2024-07-31  9:11       ` Juri Lelli
2024-07-31 11:37         ` Peter Zijlstra
2024-07-31 14:32           ` Steven Rostedt
2024-07-31 14:36           ` Phil Auld
2024-07-31 23:32           ` John Stultz
2024-08-01 13:30             ` Phil Auld
2024-07-31  9:53     ` Juri Lelli
2024-07-30 14:46 ` [PATCH v11 0/7] Preparatory changes for Proxy Execution v11 Juri Lelli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox