[RFC PATCH v15 0/7] Single RunQueue Proxy Execution (v15)

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH v15 0/7] Single RunQueue Proxy Execution (v15)
@ 2025-03-12 22:11 John Stultz
  2025-03-12 22:11 ` [RFC PATCH v15 1/7] sched: Add CONFIG_SCHED_PROXY_EXEC & boot argument to enable/disable John Stultz
                   ` (6 more replies)
  0 siblings, 7 replies; 36+ messages in thread
From: John Stultz @ 2025-03-12 22:11 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
	Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, K Prateek Nayak,
	Thomas Gleixner, Daniel Lezcano, Suleiman Souhlal, kernel-team

Hey All,

After sending out the previous version of this series and
getting some great feedback from Peter, I was pulled into a few
other directions for a bit. But I’ve been able to get back to
the proxy work the last few weeks and wanted to send this
iteration out in preparation for discussions at OSPM next week.

So here is v15 of the Proxy Execution series, a generalized form
of priority inheritance.

As I’m trying to submit this work in smallish digestible pieces,
in this series, I’m only submitting for review the logic that
allows us to do the proxying if the lock owner is on the same
runqueue as the blocked waiter. Introducing the
CONFIG_SCHED_PROXY_EXEC option and boot-argument, reworking the
task_struct::blocked_on pointer and wrapper functions, the
initial sketch of the find_proxy_task() logic, some fixes for
using split contexts, and finally same-runqueue proxying.

With v15, I’ve tried to address some of Peter’s feedback,
splitting apart some patches so they are easier to review, and
breaking out some functionality that is not yet needed for
single-runqueue proxying, so that it can be introduced later,
closer to where it is necessary.

I’ve also continued working on the rest of the series, which
you can find here:
  https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v15-6.14-rc6/
  https://github.com/johnstultz-work/linux-dev.git proxy-exec-v15-6.14-rc6

New changes in the full series include:
* Having CONFIG_SCHED_PROXY_EXEC depend on EXPERT for now, as
  its use has pretty narrow value until we get to multi-runqueue
  proxying.
* Improved naming consistency and using the guard macro where
  appropriate
* Moving the blocked_on_state logic to later in the series
* Improved comments
* Build fixes for !CONFIG_SMP
* Moving the zap_balance_callback() logic to later in the series
* Fixes for when sched_proxy_exec() is disabled

Issues still to address with the full series:
* Peter suggested an idea that instead of when tasks become
  unblocked, using (blocked_on_state == BO_WAKING) as a guard
  against running proxy-migrated tasks on cpu’s they are not
  affined to, we could dequeue tasks first and then wake them.
  This does look to be cleaner in many ways, but the locking
  rework is significant and I’ve not worked out all the kinks
  with it yet.
* In the full series with proxy migration (and again, for
  clarity not with this same-rq proxying series I’m sending out
  here), I still am using some workarounds to avoid hitting some
  rare cases of what seem to be lost wakeups, where a task was
  marked as BO_WAKING, and the mutex it is blocked on has no
  owner, but the wakeup on the waiter never managed to
  transition it to BO_RUNNABLE. The workarounds handle doing the
  return migration from within find_proxy_task() but I still
  feel that those fixups shouldn’t be necessary, so I suspect
  the mutex unlock or ttwu logic has a race somewhere I’m
  missing.
* One new issue I found with the workarounds I mentioned in the
  previous bullet, is that they can cause warnings during
  cpuhotplug if we try to do manual return-migration to
  task->wake_cpu and that cpu is offline.
* K Prateek Nayak did some testing about a bit over a year ago
  with an earlier version of the series and saw ~3-5% regressions
  in some cases. I’m hoping to look into this soon to see if we
  can reduce those further.
* The chain migration functionality needs further iterations and
  better validation to ensure it truly maintains the RT/DL load
  balancing invariants (despite this being broken in vanilla
  upstream with RT_PUSH_IPI currently)

I’d really appreciate any feedback or review thoughts on this
series. I’m trying to keep the chunks small, reviewable and
iteratively testable, but if you have any suggestions on how to
improve the series, I’m all ears.

Credit/Disclaimer:
—--------------------
As mentioned previously, this Proxy Execution series has a long
history:

First described in a paper[1] by Watkins, Straub, Niehaus, then
from patches from Peter Zijlstra, extended with lots of work by
Juri Lelli, Valentin Schneider, and Connor O'Brien. (and thank
you to Steven Rostedt for providing additional details here!)

So again, many thanks to those above, as all the credit for this
series really is due to them - while the mistakes are likely mine.

Thanks so much!
-john

[1] https://static.lwn.net/images/conf/rtlws11/papers/proc/p38.pdf

Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kernel-team@android.com

John Stultz (3):
  sched: Add CONFIG_SCHED_PROXY_EXEC & boot argument to enable/disable
  sched: Fix runtime accounting w/ split exec & sched contexts
  sched: Add an initial sketch of the find_proxy_task() function

Peter Zijlstra (2):
  locking/mutex: Rework task_struct::blocked_on
  sched: Start blocked_on chain processing in find_proxy_task()

Valentin Schneider (2):
  locking/mutex: Add p->blocked_on wrappers for correctness checks
  sched: Fix proxy/current (push,pull)ability

 .../admin-guide/kernel-parameters.txt         |   5 +
 include/linux/sched.h                         |  62 +++-
 init/Kconfig                                  |  10 +
 kernel/fork.c                                 |   3 +-
 kernel/locking/mutex-debug.c                  |   9 +-
 kernel/locking/mutex.c                        |  11 +
 kernel/locking/mutex.h                        |   3 +-
 kernel/locking/ww_mutex.h                     |  16 +-
 kernel/sched/core.c                           | 266 +++++++++++++++++-
 kernel/sched/fair.c                           |  31 +-
 kernel/sched/rt.c                             |  15 +-
 kernel/sched/sched.h                          |  22 +-
 12 files changed, 423 insertions(+), 30 deletions(-)

-- 
2.49.0.rc0.332.g42c0ae87b1-goog

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [RFC PATCH v15 1/7] sched: Add CONFIG_SCHED_PROXY_EXEC & boot argument to enable/disable
  2025-03-12 22:11 [RFC PATCH v15 0/7] Single RunQueue Proxy Execution (v15) John Stultz
@ 2025-03-12 22:11 ` John Stultz
  2025-03-13 10:09   ` Steven Rostedt
  2025-03-17 14:33   ` Peter Zijlstra
  2025-03-12 22:11 ` [RFC PATCH v15 2/7] locking/mutex: Rework task_struct::blocked_on John Stultz
                   ` (5 subsequent siblings)
  6 siblings, 2 replies; 36+ messages in thread
From: John Stultz @ 2025-03-12 22:11 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
	Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, K Prateek Nayak,
	Thomas Gleixner, Daniel Lezcano, Suleiman Souhlal, kernel-team

Add a CONFIG_SCHED_PROXY_EXEC option, along with a boot argument
sched_proxy_exec= that can be used to disable the feature at boot
time if CONFIG_SCHED_PROXY_EXEC was enabled.

Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kernel-team@android.com
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: John Stultz <jstultz@google.com>
---
v7:
* Switch to CONFIG_SCHED_PROXY_EXEC/sched_proxy_exec= as
  suggested by Metin Kaya.
* Switch boot arg from =disable/enable to use kstrtobool(),
  which supports =yes|no|1|0|true|false|on|off, as also
  suggested by Metin Kaya, and print a message when a boot
  argument is used.
v8:
* Move CONFIG_SCHED_PROXY_EXEC under Scheduler Features as
  Suggested by Metin
* Minor rework reordering with split sched contexts patch
v12:
* Rework for selected -> donor renaming
v14:
* Depend on !PREEMPT_RT to avoid build issues for now
v15:
* Depend on EXPERT while patch series upstreaming is
  in progress.
---
 .../admin-guide/kernel-parameters.txt         |  5 ++++
 include/linux/sched.h                         | 13 +++++++++
 init/Kconfig                                  | 10 +++++++
 kernel/sched/core.c                           | 29 +++++++++++++++++++
 kernel/sched/sched.h                          | 12 ++++++++
 5 files changed, 69 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index fb8752b42ec85..dcc2443078d00 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -6262,6 +6262,11 @@
 	sa1100ir	[NET]
 			See drivers/net/irda/sa1100_ir.c.
 
+	sched_proxy_exec= [KNL]
+			Enables or disables "proxy execution" style
+			solution to mutex-based priority inversion.
+			Format: <bool>
+
 	sched_verbose	[KNL,EARLY] Enables verbose scheduler debug messages.
 
 	schedstats=	[KNL,X86] Enable or disable scheduled statistics.
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9c15365a30c08..1462f2c70aefc 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1636,6 +1636,19 @@ struct task_struct {
 	 */
 };
 
+#ifdef CONFIG_SCHED_PROXY_EXEC
+DECLARE_STATIC_KEY_TRUE(__sched_proxy_exec);
+static inline bool sched_proxy_exec(void)
+{
+	return static_branch_likely(&__sched_proxy_exec);
+}
+#else
+static inline bool sched_proxy_exec(void)
+{
+	return false;
+}
+#endif
+
 #define TASK_REPORT_IDLE	(TASK_REPORT + 1)
 #define TASK_REPORT_MAX		(TASK_REPORT_IDLE << 1)
 
diff --git a/init/Kconfig b/init/Kconfig
index d0d021b3fa3b3..b989ddc27444e 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -875,6 +875,16 @@ config UCLAMP_BUCKETS_COUNT
 
 	  If in doubt, use the default value.
 
+config SCHED_PROXY_EXEC
+	bool "Proxy Execution"
+	default n
+	# Avoid some build failures w/ PREEMPT_RT until it can be fixed
+	depends on !PREEMPT_RT
+	depends on EXPERT
+	help
+	  This option enables proxy execution, a mechanism for mutex-owning
+	  tasks to inherit the scheduling context of higher priority waiters.
+
 endmenu
 
 #
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 67189907214d3..3968c3967ec38 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -119,6 +119,35 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(sched_compute_energy_tp);
 
 DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
 
+#ifdef CONFIG_SCHED_PROXY_EXEC
+DEFINE_STATIC_KEY_TRUE(__sched_proxy_exec);
+static int __init setup_proxy_exec(char *str)
+{
+	bool proxy_enable;
+
+	if (kstrtobool(str, &proxy_enable)) {
+		pr_warn("Unable to parse sched_proxy_exec=\n");
+		return 0;
+	}
+
+	if (proxy_enable) {
+		pr_info("sched_proxy_exec enabled via boot arg\n");
+		static_branch_enable(&__sched_proxy_exec);
+	} else {
+		pr_info("sched_proxy_exec disabled via boot arg\n");
+		static_branch_disable(&__sched_proxy_exec);
+	}
+	return 1;
+}
+#else
+static int __init setup_proxy_exec(char *str)
+{
+	pr_warn("CONFIG_SCHED_PROXY_EXEC=n, so it cannot be enabled or disabled at boot time\n");
+	return 0;
+}
+#endif
+__setup("sched_proxy_exec=", setup_proxy_exec);
+
 #ifdef CONFIG_SCHED_DEBUG
 /*
  * Debugging: various feature bits
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c8512a9fb0229..05d2122533619 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1155,10 +1155,15 @@ struct rq {
 	 */
 	unsigned int		nr_uninterruptible;
 
+#ifdef CONFIG_SCHED_PROXY_EXEC
+	struct task_struct __rcu	*donor;  /* Scheduling context */
+	struct task_struct __rcu	*curr;   /* Execution context */
+#else
 	union {
 		struct task_struct __rcu *donor; /* Scheduler context */
 		struct task_struct __rcu *curr;  /* Execution context */
 	};
+#endif
 	struct sched_dl_entity	*dl_server;
 	struct task_struct	*idle;
 	struct task_struct	*stop;
@@ -1355,10 +1360,17 @@ DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
 #define cpu_curr(cpu)		(cpu_rq(cpu)->curr)
 #define raw_rq()		raw_cpu_ptr(&runqueues)
 
+#ifdef CONFIG_SCHED_PROXY_EXEC
+static inline void rq_set_donor(struct rq *rq, struct task_struct *t)
+{
+	rcu_assign_pointer(rq->donor, t);
+}
+#else
 static inline void rq_set_donor(struct rq *rq, struct task_struct *t)
 {
 	/* Do nothing */
 }
+#endif
 
 #ifdef CONFIG_SCHED_CORE
 static inline struct cpumask *sched_group_span(struct sched_group *sg);
-- 
2.49.0.rc0.332.g42c0ae87b1-goog


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC PATCH v15 2/7] locking/mutex: Rework task_struct::blocked_on
  2025-03-12 22:11 [RFC PATCH v15 0/7] Single RunQueue Proxy Execution (v15) John Stultz
  2025-03-12 22:11 ` [RFC PATCH v15 1/7] sched: Add CONFIG_SCHED_PROXY_EXEC & boot argument to enable/disable John Stultz
@ 2025-03-12 22:11 ` John Stultz
  2025-03-13 10:13   ` Steven Rostedt
  2025-03-12 22:11 ` [RFC PATCH v15 3/7] locking/mutex: Add p->blocked_on wrappers for correctness checks John Stultz
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 36+ messages in thread
From: John Stultz @ 2025-03-12 22:11 UTC (permalink / raw)
  To: LKML
  Cc: Peter Zijlstra, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, K Prateek Nayak, Thomas Gleixner,
	Daniel Lezcano, Suleiman Souhlal, kernel-team, Connor O'Brien,
	John Stultz

From: Peter Zijlstra <peterz@infradead.org>

Track the blocked-on relation for mutexes, to allow following this
relation at schedule time.

   task
     | blocked-on
     v
   mutex
     | owner
     v
   task

This all will be used for tracking blocked-task/mutex chains
with the prox-execution patch in a similar fashion to how
priority inheritance is done with rt_mutexes.

For serialization, blocked-on is only set by the task itself
(current). And both when setting or clearing (potentially by
others), is done while holding the mutex::wait_lock.

Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kernel-team@android.com
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
[minor changes while rebasing]
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Connor O'Brien <connoro@google.com>
[jstultz: Fix blocked_on tracking in __mutex_lock_common in error paths]
Signed-off-by: John Stultz <jstultz@google.com>
---
v2:
* Fixed blocked_on tracking in error paths that was causing crashes
v4:
* Ensure we clear blocked_on when waking ww_mutexes to die or wound.
  This is critical so we don't get circular blocked_on relationships
  that can't be resolved.
v5:
* Fix potential bug where the skip_wait path might clear blocked_on
  when that path never set it
* Slight tweaks to where we set blocked_on to make it consistent,
  along with extra WARN_ON correctness checking
* Minor comment changes
v7:
* Minor commit message change suggested by Metin Kaya
* Fix WARN_ON conditionals in unlock path (as blocked_on might already
  be cleared), found while looking at issue Metin Kaya raised.
* Minor tweaks to be consistent in what we do under the
  blocked_on lock, also tweaked variable name to avoid confusion
  with label, and comment typos, as suggested by Metin Kaya
* Minor tweak for CONFIG_SCHED_PROXY_EXEC name change
* Moved unused block of code to later in the series, as suggested
  by Metin Kaya
* Switch to a tri-state to be able to distinguish from waking and
  runnable so we can later safely do return migration from ttwu
* Folded together with related blocked_on changes
v8:
* Fix issue leaving task BO_BLOCKED when calling into optimistic
  spinning path.
* Include helper to better handle BO_BLOCKED->BO_WAKING transitions
v9:
* Typo fixup pointed out by Metin
* Cleanup BO_WAKING->BO_RUNNABLE transitions for the !proxy case
* Many cleanups and simplifications suggested by Metin
v11:
* Whitespace fixup pointed out by Metin
v13:
* Refactor set_blocked_on helpers clean things up a bit
v14:
* Small build fixup with PREEMPT_RT
v15:
* Improve consistency of names for functions that assume blocked_lock
  is held, as suggested by Peter
* Use guard instead of separate spinlock/unlock calls, also suggested
  by Peter
* Drop blocked_on_state tri-state for now, as its not needed until
  later in the series, when we get to proxy-migration and return-
  migration.
---
 include/linux/sched.h        |  5 +----
 kernel/fork.c                |  3 +--
 kernel/locking/mutex-debug.c |  9 +++++----
 kernel/locking/mutex.c       | 19 +++++++++++++++++++
 kernel/locking/ww_mutex.h    | 18 ++++++++++++++++--
 5 files changed, 42 insertions(+), 12 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1462f2c70aefc..03775c44b7073 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1212,10 +1212,7 @@ struct task_struct {
 	struct rt_mutex_waiter		*pi_blocked_on;
 #endif
 
-#ifdef CONFIG_DEBUG_MUTEXES
-	/* Mutex deadlock detection: */
-	struct mutex_waiter		*blocked_on;
-#endif
+	struct mutex			*blocked_on;	/* lock we're blocked on */
 
 #ifdef CONFIG_DEBUG_ATOMIC_SLEEP
 	int				non_block_count;
diff --git a/kernel/fork.c b/kernel/fork.c
index 735405a9c5f32..38f055082d07a 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2357,9 +2357,8 @@ __latent_entropy struct task_struct *copy_process(
 	lockdep_init_task(p);
 #endif
 
-#ifdef CONFIG_DEBUG_MUTEXES
 	p->blocked_on = NULL; /* not blocked yet */
-#endif
+
 #ifdef CONFIG_BCACHE
 	p->sequential_io	= 0;
 	p->sequential_io_avg	= 0;
diff --git a/kernel/locking/mutex-debug.c b/kernel/locking/mutex-debug.c
index 6e6f6071cfa27..758b7a6792b0c 100644
--- a/kernel/locking/mutex-debug.c
+++ b/kernel/locking/mutex-debug.c
@@ -53,17 +53,18 @@ void debug_mutex_add_waiter(struct mutex *lock, struct mutex_waiter *waiter,
 {
 	lockdep_assert_held(&lock->wait_lock);
 
-	/* Mark the current thread as blocked on the lock: */
-	task->blocked_on = waiter;
+	/* Current thread can't be already blocked (since it's executing!) */
+	DEBUG_LOCKS_WARN_ON(task->blocked_on);
 }
 
 void debug_mutex_remove_waiter(struct mutex *lock, struct mutex_waiter *waiter,
 			 struct task_struct *task)
 {
+	struct mutex *blocked_on = READ_ONCE(task->blocked_on);
+
 	DEBUG_LOCKS_WARN_ON(list_empty(&waiter->list));
 	DEBUG_LOCKS_WARN_ON(waiter->task != task);
-	DEBUG_LOCKS_WARN_ON(task->blocked_on != waiter);
-	task->blocked_on = NULL;
+	DEBUG_LOCKS_WARN_ON(blocked_on && blocked_on != lock);
 
 	INIT_LIST_HEAD(&waiter->list);
 	waiter->task = NULL;
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index b36f23de48f1b..37d1966970617 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -627,6 +627,8 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 			goto err_early_kill;
 	}
 
+	WARN_ON(current->blocked_on);
+	current->blocked_on = lock;
 	set_current_state(state);
 	trace_contention_begin(lock, LCB_F_MUTEX);
 	for (;;) {
@@ -663,6 +665,12 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 
 		first = __mutex_waiter_is_first(lock, &waiter);
 
+		/*
+		 * As we likely have been woken up by task
+		 * that has cleared our blocked_on state, re-set
+		 * it to the lock we are trying to aquire.
+		 */
+		current->blocked_on = lock;
 		set_current_state(state);
 		/*
 		 * Here we order against unlock; we must either see it change
@@ -683,6 +691,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 	}
 	raw_spin_lock_irqsave(&lock->wait_lock, flags);
 acquired:
+	current->blocked_on = NULL;
 	__set_current_state(TASK_RUNNING);
 
 	if (ww_ctx) {
@@ -712,9 +721,11 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 	return 0;
 
 err:
+	current->blocked_on = NULL;
 	__set_current_state(TASK_RUNNING);
 	__mutex_remove_waiter(lock, &waiter);
 err_early_kill:
+	WARN_ON(current->blocked_on);
 	trace_contention_end(lock, ret);
 	raw_spin_unlock_irqrestore_wake(&lock->wait_lock, flags, &wake_q);
 	debug_mutex_free_waiter(&waiter);
@@ -924,6 +935,14 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
 		next = waiter->task;
 
 		debug_mutex_wake_waiter(lock, waiter);
+		/*
+		 * Unlock wakeups can be happening in parallel
+		 * (when optimistic spinners steal and release
+		 * the lock), so blocked_on may already be
+		 * cleared here.
+		 */
+		WARN_ON(next->blocked_on && next->blocked_on != lock);
+		next->blocked_on = NULL;
 		wake_q_add(&wake_q, next);
 	}
 
diff --git a/kernel/locking/ww_mutex.h b/kernel/locking/ww_mutex.h
index 37f025a096c9d..00db40946328e 100644
--- a/kernel/locking/ww_mutex.h
+++ b/kernel/locking/ww_mutex.h
@@ -284,6 +284,14 @@ __ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITER *waiter,
 #ifndef WW_RT
 		debug_mutex_wake_waiter(lock, waiter);
 #endif
+		/*
+		 * When waking up the task to die, be sure to clear the
+		 * blocked_on pointer. Otherwise we can see circular
+		 * blocked_on relationships that can't resolve.
+		 */
+		WARN_ON(waiter->task->blocked_on &&
+			waiter->task->blocked_on != lock);
+		waiter->task->blocked_on = NULL;
 		wake_q_add(wake_q, waiter->task);
 	}
 
@@ -331,9 +339,15 @@ static bool __ww_mutex_wound(struct MUTEX *lock,
 		 * it's wounded in __ww_mutex_check_kill() or has a
 		 * wakeup pending to re-read the wounded state.
 		 */
-		if (owner != current)
+		if (owner != current) {
+			/*
+			 * When waking up the task to wound, be sure to clear the
+			 * blocked_on pointer. Otherwise we can see circular
+			 * blocked_on relationships that can't resolve.
+			 */
+			owner->blocked_on = NULL;
 			wake_q_add(wake_q, owner);
-
+		}
 		return true;
 	}
 
-- 
2.49.0.rc0.332.g42c0ae87b1-goog


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC PATCH v15 3/7] locking/mutex: Add p->blocked_on wrappers for correctness checks
  2025-03-12 22:11 [RFC PATCH v15 0/7] Single RunQueue Proxy Execution (v15) John Stultz
  2025-03-12 22:11 ` [RFC PATCH v15 1/7] sched: Add CONFIG_SCHED_PROXY_EXEC & boot argument to enable/disable John Stultz
  2025-03-12 22:11 ` [RFC PATCH v15 2/7] locking/mutex: Rework task_struct::blocked_on John Stultz
@ 2025-03-12 22:11 ` John Stultz
  2025-03-12 22:11 ` [RFC PATCH v15 4/7] sched: Fix runtime accounting w/ split exec & sched contexts John Stultz
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 36+ messages in thread
From: John Stultz @ 2025-03-12 22:11 UTC (permalink / raw)
  To: LKML
  Cc: Valentin Schneider, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
	Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, K Prateek Nayak,
	Thomas Gleixner, Daniel Lezcano, Suleiman Souhlal, kernel-team,
	Connor O'Brien, John Stultz

From: Valentin Schneider <valentin.schneider@arm.com>

This lets us assert mutex::wait_lock is held whenever we access
p->blocked_on, as well as warn us for unexpected state changes.

Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kernel-team@android.com
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
[fix conflicts, call in more places]
Signed-off-by: Connor O'Brien <connoro@google.com>
[jstultz: tweaked commit subject, reworked a good bit]
Signed-off-by: John Stultz <jstultz@google.com>
---
v2:
* Added get_task_blocked_on() accessor
v4:
* Address READ_ONCE usage that was dropped in v2
* Reordered to be a later add on to the main patch series as
  Peter was unhappy with similar wrappers in other patches.
v5:
* Added some extra correctness checking in wrappers
v7:
* Tweaks to reorder this change in the patch series
* Minor cleanup to set_task_blocked_on() suggested by Metin Kaya
v15:
* Split out into its own patch again.
* Further improve assumption checks in helpers.
---
 include/linux/sched.h        | 44 ++++++++++++++++++++++++++++++++++--
 kernel/locking/mutex-debug.c |  4 ++--
 kernel/locking/mutex.c       | 20 +++++-----------
 kernel/locking/ww_mutex.h    |  6 ++---
 4 files changed, 52 insertions(+), 22 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 03775c44b7073..62870077379a6 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -34,6 +34,7 @@
 #include <linux/sched/prio.h>
 #include <linux/sched/types.h>
 #include <linux/signal_types.h>
+#include <linux/spinlock.h>
 #include <linux/syscall_user_dispatch_types.h>
 #include <linux/mm_types_task.h>
 #include <linux/netdevice_xmit.h>
@@ -2154,6 +2155,47 @@ extern int __cond_resched_rwlock_write(rwlock_t *lock);
 	__cond_resched_rwlock_write(lock);					\
 })
 
+static inline void __set_task_blocked_on(struct task_struct *p, struct mutex *m)
+{
+	WARN_ON_ONCE(!m);
+	/* The task should only be setting itself as blocked */
+	WARN_ON_ONCE(p != current);
+	/* Currently we serialize blocked_on under the mutex::wait_lock */
+	lockdep_assert_held_once(&m->wait_lock);
+	/*
+	 * Check ensure we don't overwrite exisiting mutex value
+	 * with a different mutex. Note, setting it to the same
+	 * lock repeatedly is ok.
+	 */
+	WARN_ON_ONCE(p->blocked_on && p->blocked_on != m);
+	p->blocked_on = m;
+}
+
+static inline void set_task_blocked_on(struct task_struct *p, struct mutex *m)
+{
+	guard(raw_spinlock_irqsave)(&m->wait_lock);
+	__set_task_blocked_on(p, m);
+}
+
+static inline void __clear_task_blocked_on(struct task_struct *p, struct mutex *m)
+{
+	WARN_ON_ONCE(!m);
+	/* Currently we serialize blocked_on under the mutex::wait_lock */
+	lockdep_assert_held_once(&m->wait_lock);
+	/*
+	 * There may be cases where we re-clear already cleared
+	 * blocked_on relationships, but make sure we are not
+	 * clearing the relationship with a different lock.
+	 */
+	WARN_ON_ONCE(m && p->blocked_on && p->blocked_on != m);
+	p->blocked_on = NULL;
+}
+
+static inline struct mutex *__get_task_blocked_on(struct task_struct *p)
+{
+	return READ_ONCE(p->blocked_on);
+}
+
 static __always_inline bool need_resched(void)
 {
 	return unlikely(tif_need_resched());
@@ -2193,8 +2235,6 @@ extern bool sched_task_on_rq(struct task_struct *p);
 extern unsigned long get_wchan(struct task_struct *p);
 extern struct task_struct *cpu_curr_snapshot(int cpu);
 
-#include <linux/spinlock.h>
-
 /*
  * In order to reduce various lock holder preemption latencies provide an
  * interface to see if a vCPU is currently running or not.
diff --git a/kernel/locking/mutex-debug.c b/kernel/locking/mutex-debug.c
index 758b7a6792b0c..949103fd8e9b5 100644
--- a/kernel/locking/mutex-debug.c
+++ b/kernel/locking/mutex-debug.c
@@ -54,13 +54,13 @@ void debug_mutex_add_waiter(struct mutex *lock, struct mutex_waiter *waiter,
 	lockdep_assert_held(&lock->wait_lock);
 
 	/* Current thread can't be already blocked (since it's executing!) */
-	DEBUG_LOCKS_WARN_ON(task->blocked_on);
+	DEBUG_LOCKS_WARN_ON(__get_task_blocked_on(task));
 }
 
 void debug_mutex_remove_waiter(struct mutex *lock, struct mutex_waiter *waiter,
 			 struct task_struct *task)
 {
-	struct mutex *blocked_on = READ_ONCE(task->blocked_on);
+	struct mutex *blocked_on = __get_task_blocked_on(task);
 
 	DEBUG_LOCKS_WARN_ON(list_empty(&waiter->list));
 	DEBUG_LOCKS_WARN_ON(waiter->task != task);
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 37d1966970617..351500cf50ece 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -627,8 +627,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 			goto err_early_kill;
 	}
 
-	WARN_ON(current->blocked_on);
-	current->blocked_on = lock;
+	__set_task_blocked_on(current, lock);
 	set_current_state(state);
 	trace_contention_begin(lock, LCB_F_MUTEX);
 	for (;;) {
@@ -670,7 +669,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 		 * that has cleared our blocked_on state, re-set
 		 * it to the lock we are trying to aquire.
 		 */
-		current->blocked_on = lock;
+		set_task_blocked_on(current, lock);
 		set_current_state(state);
 		/*
 		 * Here we order against unlock; we must either see it change
@@ -691,7 +690,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 	}
 	raw_spin_lock_irqsave(&lock->wait_lock, flags);
 acquired:
-	current->blocked_on = NULL;
+	__clear_task_blocked_on(current, lock);
 	__set_current_state(TASK_RUNNING);
 
 	if (ww_ctx) {
@@ -721,11 +720,11 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 	return 0;
 
 err:
-	current->blocked_on = NULL;
+	__clear_task_blocked_on(current, lock);
 	__set_current_state(TASK_RUNNING);
 	__mutex_remove_waiter(lock, &waiter);
 err_early_kill:
-	WARN_ON(current->blocked_on);
+	WARN_ON(__get_task_blocked_on(current));
 	trace_contention_end(lock, ret);
 	raw_spin_unlock_irqrestore_wake(&lock->wait_lock, flags, &wake_q);
 	debug_mutex_free_waiter(&waiter);
@@ -935,14 +934,7 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
 		next = waiter->task;
 
 		debug_mutex_wake_waiter(lock, waiter);
-		/*
-		 * Unlock wakeups can be happening in parallel
-		 * (when optimistic spinners steal and release
-		 * the lock), so blocked_on may already be
-		 * cleared here.
-		 */
-		WARN_ON(next->blocked_on && next->blocked_on != lock);
-		next->blocked_on = NULL;
+		__clear_task_blocked_on(next, lock);
 		wake_q_add(&wake_q, next);
 	}
 
diff --git a/kernel/locking/ww_mutex.h b/kernel/locking/ww_mutex.h
index 00db40946328e..086fd5487ca77 100644
--- a/kernel/locking/ww_mutex.h
+++ b/kernel/locking/ww_mutex.h
@@ -289,9 +289,7 @@ __ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITER *waiter,
 		 * blocked_on pointer. Otherwise we can see circular
 		 * blocked_on relationships that can't resolve.
 		 */
-		WARN_ON(waiter->task->blocked_on &&
-			waiter->task->blocked_on != lock);
-		waiter->task->blocked_on = NULL;
+		__clear_task_blocked_on(waiter->task, lock);
 		wake_q_add(wake_q, waiter->task);
 	}
 
@@ -345,7 +343,7 @@ static bool __ww_mutex_wound(struct MUTEX *lock,
 			 * blocked_on pointer. Otherwise we can see circular
 			 * blocked_on relationships that can't resolve.
 			 */
-			owner->blocked_on = NULL;
+			__clear_task_blocked_on(owner, lock);
 			wake_q_add(wake_q, owner);
 		}
 		return true;
-- 
2.49.0.rc0.332.g42c0ae87b1-goog


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC PATCH v15 4/7] sched: Fix runtime accounting w/ split exec & sched contexts
  2025-03-12 22:11 [RFC PATCH v15 0/7] Single RunQueue Proxy Execution (v15) John Stultz
                   ` (2 preceding siblings ...)
  2025-03-12 22:11 ` [RFC PATCH v15 3/7] locking/mutex: Add p->blocked_on wrappers for correctness checks John Stultz
@ 2025-03-12 22:11 ` John Stultz
  2025-03-13 10:26   ` Steven Rostedt
  2025-03-13 17:24   ` K Prateek Nayak
  2025-03-12 22:11 ` [RFC PATCH v15 5/7] sched: Add an initial sketch of the find_proxy_task() function John Stultz
                   ` (2 subsequent siblings)
  6 siblings, 2 replies; 36+ messages in thread
From: John Stultz @ 2025-03-12 22:11 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
	Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, K Prateek Nayak,
	Thomas Gleixner, Daniel Lezcano, Suleiman Souhlal, kernel-team

The idea here is we want to charge the scheduler-context task's
vruntime but charge the execution-context task's sum_exec_runtime.

This way cputime accounting goes against the task actually running
but vruntime accounting goes against the rq->donor task so we get
proper fairness.

Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kernel-team@android.com
Signed-off-by: John Stultz <jstultz@google.com>
---
 kernel/sched/fair.c | 21 ++++++++++++++++-----
 1 file changed, 16 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c798d27952431..f8ad3a44b3771 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1129,22 +1129,33 @@ static void update_tg_load_avg(struct cfs_rq *cfs_rq)
 }
 #endif /* CONFIG_SMP */
 
-static s64 update_curr_se(struct rq *rq, struct sched_entity *curr)
+static s64 update_curr_se(struct rq *rq, struct sched_entity *se)
 {
 	u64 now = rq_clock_task(rq);
 	s64 delta_exec;
 
-	delta_exec = now - curr->exec_start;
+	delta_exec = now - se->exec_start;
 	if (unlikely(delta_exec <= 0))
 		return delta_exec;
 
-	curr->exec_start = now;
-	curr->sum_exec_runtime += delta_exec;
+	se->exec_start = now;
+	if (entity_is_task(se)) {
+		struct task_struct *running = rq->curr;
+		/*
+		 * If se is a task, we account the time against the running
+		 * task, as w/ proxy-exec they may not be the same.
+		 */
+		running->se.exec_start = now;
+		running->se.sum_exec_runtime += delta_exec;
+	} else {
+		/* If not task, account the time against se */
+		se->sum_exec_runtime += delta_exec;
+	}
 
 	if (schedstat_enabled()) {
 		struct sched_statistics *stats;
 
-		stats = __schedstats_from_se(curr);
+		stats = __schedstats_from_se(se);
 		__schedstat_set(stats->exec_max,
 				max(delta_exec, stats->exec_max));
 	}
-- 
2.49.0.rc0.332.g42c0ae87b1-goog


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC PATCH v15 5/7] sched: Add an initial sketch of the find_proxy_task() function
  2025-03-12 22:11 [RFC PATCH v15 0/7] Single RunQueue Proxy Execution (v15) John Stultz
                   ` (3 preceding siblings ...)
  2025-03-12 22:11 ` [RFC PATCH v15 4/7] sched: Fix runtime accounting w/ split exec & sched contexts John Stultz
@ 2025-03-12 22:11 ` John Stultz
  2025-03-15 16:35   ` K Prateek Nayak
  2025-03-17 13:48   ` Peter Zijlstra
  2025-03-12 22:11 ` [RFC PATCH v15 6/7] sched: Fix proxy/current (push,pull)ability John Stultz
  2025-03-12 22:11 ` [RFC PATCH v15 7/7] sched: Start blocked_on chain processing in find_proxy_task() John Stultz
  6 siblings, 2 replies; 36+ messages in thread
From: John Stultz @ 2025-03-12 22:11 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
	Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, K Prateek Nayak,
	Thomas Gleixner, Daniel Lezcano, Suleiman Souhlal, kernel-team

Add a find_proxy_task() function which doesn't do much.

When we select a blocked task to run, we will just deactivate it
and pick again. The exception being if it has become unblocked
after find_proxy_task() was called.

Greatly simplified from patch by:
  Peter Zijlstra (Intel) <peterz@infradead.org>
  Juri Lelli <juri.lelli@redhat.com>
  Valentin Schneider <valentin.schneider@arm.com>
  Connor O'Brien <connoro@google.com>

Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kernel-team@android.com
[jstultz: Split out from larger proxy patch and simplified
 for review and testing.]
Signed-off-by: John Stultz <jstultz@google.com>
---
v5:
* Split out from larger proxy patch
v7:
* Fixed unused function arguments, spelling nits, and tweaks for
  clarity, pointed out by Metin Kaya
* Fix build warning Reported-by: kernel test robot <lkp@intel.com>
  Closes: https://lore.kernel.org/oe-kbuild-all/202311081028.yDLmCWgr-lkp@intel.com/
v8:
* Fixed case where we might return a blocked task from find_proxy_task()
* Continued tweaks to handle avoiding returning blocked tasks
v9:
* Add zap_balance_callbacks helper to unwind balance_callbacks
  when we will re-call pick_next_task() again.
* Add extra comment suggested by Metin
* Typo fixes from Metin
* Moved adding proxy_resched_idle earlier in the series, as suggested
  by Metin
* Fix to call proxy_resched_idle() *prior* to deactivating next, to avoid
  crashes caused by stale references to next
* s/PROXY/SCHED_PROXY_EXEC/ as suggested by Metin
* Number of tweaks and cleanups suggested by Metin
* Simplify proxy_deactivate as suggested by Metin
v11:
* Tweaks for earlier simplification in try_to_deactivate_task
v13:
* Rename rename "next" to "donor" in find_proxy_task() for clarity
* Similarly use "donor" instead of next in proxy_deactivate
* Refactor/simplify proxy_resched_idle
* Moved up a needed fix from later in the series
v15:
* Tweaked some comments to better explain the initial sketch of
  find_proxy_task(), suggested by Qais
* Build fixes for !CONFIG_SMP
* Slight rework for blocked_on_state being added later in the
  series.
* Move the zap_balance_callbacks to later in the patch series
---
 kernel/sched/core.c  | 103 +++++++++++++++++++++++++++++++++++++++++--
 kernel/sched/rt.c    |  15 ++++++-
 kernel/sched/sched.h |  10 ++++-
 3 files changed, 122 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3968c3967ec38..b4f7b14f62a24 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6600,7 +6600,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
  * Otherwise marks the task's __state as RUNNING
  */
 static bool try_to_block_task(struct rq *rq, struct task_struct *p,
-			      unsigned long task_state)
+			      unsigned long task_state, bool deactivate_cond)
 {
 	int flags = DEQUEUE_NOCLOCK;
 
@@ -6609,6 +6609,9 @@ static bool try_to_block_task(struct rq *rq, struct task_struct *p,
 		return false;
 	}
 
+	if (!deactivate_cond)
+		return false;
+
 	p->sched_contributes_to_load =
 		(task_state & TASK_UNINTERRUPTIBLE) &&
 		!(task_state & TASK_NOLOAD) &&
@@ -6632,6 +6635,93 @@ static bool try_to_block_task(struct rq *rq, struct task_struct *p,
 	return true;
 }
 
+#ifdef CONFIG_SCHED_PROXY_EXEC
+
+static inline struct task_struct *
+proxy_resched_idle(struct rq *rq)
+{
+	put_prev_task(rq, rq->donor);
+	rq_set_donor(rq, rq->idle);
+	set_next_task(rq, rq->idle);
+	set_tsk_need_resched(rq->idle);
+	return rq->idle;
+}
+
+static bool proxy_deactivate(struct rq *rq, struct task_struct *donor)
+{
+	unsigned long state = READ_ONCE(donor->__state);
+
+	/* Don't deactivate if the state has been changed to TASK_RUNNING */
+	if (state == TASK_RUNNING)
+		return false;
+	/*
+	 * Because we got donor from pick_next_task, it is *crucial*
+	 * that we call proxy_resched_idle before we deactivate it.
+	 * As once we deactivate donor, donor->on_rq is set to zero,
+	 * which allows ttwu to immediately try to wake the task on
+	 * another rq. So we cannot use *any* references to donor
+	 * after that point. So things like cfs_rq->curr or rq->donor
+	 * need to be changed from next *before* we deactivate.
+	 */
+	proxy_resched_idle(rq);
+	return try_to_block_task(rq, donor, state, true);
+}
+
+/*
+ * Initial simple sketch that just deactivates the blocked task
+ * chosen by pick_next_task() so we can then pick something that
+ * isn't blocked.
+ */
+static struct task_struct *
+find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
+{
+	struct task_struct *p = donor;
+	struct mutex *mutex;
+
+	mutex = p->blocked_on;
+	/* Something changed in the chain, so pick again */
+	if (!mutex)
+		return NULL;
+	/*
+	 * By taking mutex->wait_lock we hold off concurrent mutex_unlock()
+	 * and ensure @owner sticks around.
+	 */
+	raw_spin_lock(&mutex->wait_lock);
+
+	/* Check again that p is blocked with blocked_lock held */
+	if (!task_is_blocked(p) || mutex != __get_task_blocked_on(p)) {
+		/*
+		 * Something changed in the blocked_on chain and
+		 * we don't know if only at this level. So, let's
+		 * just bail out completely and let __schedule
+		 * figure things out (pick_again loop).
+		 */
+		goto out;
+	}
+
+	if (!proxy_deactivate(rq, donor)) {
+		/*
+		 * XXX: For now, if deactivation failed, set donor
+		 * as not blocked, as we aren't doing proxy-migrations
+		 * yet (more logic will be needed then).
+		 */
+		__clear_task_blocked_on(donor, mutex);
+		raw_spin_unlock(&mutex->wait_lock);
+		return NULL;
+	}
+out:
+	raw_spin_unlock(&mutex->wait_lock);
+	return NULL; /* do pick_next_task again */
+}
+#else /* SCHED_PROXY_EXEC */
+static struct task_struct *
+find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
+{
+	WARN_ONCE(1, "This should never be called in the !SCHED_PROXY_EXEC case\n");
+	return donor;
+}
+#endif /* SCHED_PROXY_EXEC */
+
 /*
  * __schedule() is the main scheduler function.
  *
@@ -6739,12 +6829,19 @@ static void __sched notrace __schedule(int sched_mode)
 			goto picked;
 		}
 	} else if (!preempt && prev_state) {
-		try_to_block_task(rq, prev, prev_state);
+		try_to_block_task(rq, prev, prev_state,
+				  !task_is_blocked(prev));
 		switch_count = &prev->nvcsw;
 	}
 
-	next = pick_next_task(rq, prev, &rf);
+pick_again:
+	next = pick_next_task(rq, rq->donor, &rf);
 	rq_set_donor(rq, next);
+	if (unlikely(task_is_blocked(next))) {
+		next = find_proxy_task(rq, next, &rf);
+		if (!next)
+			goto pick_again;
+	}
 picked:
 	clear_tsk_need_resched(prev);
 	clear_preempt_need_resched();
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 4b8e33c615b12..2d418e0efecc5 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1479,8 +1479,19 @@ enqueue_task_rt(struct rq *rq, struct task_struct *p, int flags)
 
 	enqueue_rt_entity(rt_se, flags);
 
-	if (!task_current(rq, p) && p->nr_cpus_allowed > 1)
-		enqueue_pushable_task(rq, p);
+	/*
+	 * Current can't be pushed away. Selected is tied to current,
+	 * so don't push it either.
+	 */
+	if (task_current(rq, p) || task_current_donor(rq, p))
+		return;
+	/*
+	 * Pinned tasks can't be pushed.
+	 */
+	if (p->nr_cpus_allowed == 1)
+		return;
+
+	enqueue_pushable_task(rq, p);
 }
 
 static bool dequeue_task_rt(struct rq *rq, struct task_struct *p, int flags)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 05d2122533619..3e49d77ce2cdd 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2311,6 +2311,14 @@ static inline int task_current_donor(struct rq *rq, struct task_struct *p)
 	return rq->donor == p;
 }
 
+static inline bool task_is_blocked(struct task_struct *p)
+{
+	if (!sched_proxy_exec())
+		return false;
+
+	return !!p->blocked_on;
+}
+
 static inline int task_on_cpu(struct rq *rq, struct task_struct *p)
 {
 #ifdef CONFIG_SMP
@@ -2520,7 +2528,7 @@ static inline void put_prev_set_next_task(struct rq *rq,
 					  struct task_struct *prev,
 					  struct task_struct *next)
 {
-	WARN_ON_ONCE(rq->curr != prev);
+	WARN_ON_ONCE(rq->donor != prev);
 
 	__put_prev_set_next_dl_server(rq, prev, next);
 
-- 
2.49.0.rc0.332.g42c0ae87b1-goog


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC PATCH v15 6/7] sched: Fix proxy/current (push,pull)ability
  2025-03-12 22:11 [RFC PATCH v15 0/7] Single RunQueue Proxy Execution (v15) John Stultz
                   ` (4 preceding siblings ...)
  2025-03-12 22:11 ` [RFC PATCH v15 5/7] sched: Add an initial sketch of the find_proxy_task() function John Stultz
@ 2025-03-12 22:11 ` John Stultz
  2025-03-14  8:40   ` K Prateek Nayak
  2025-03-17 14:07   ` Peter Zijlstra
  2025-03-12 22:11 ` [RFC PATCH v15 7/7] sched: Start blocked_on chain processing in find_proxy_task() John Stultz
  6 siblings, 2 replies; 36+ messages in thread
From: John Stultz @ 2025-03-12 22:11 UTC (permalink / raw)
  To: LKML
  Cc: Valentin Schneider, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
	Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, K Prateek Nayak,
	Thomas Gleixner, Daniel Lezcano, Suleiman Souhlal, kernel-team,
	Connor O'Brien, John Stultz

From: Valentin Schneider <valentin.schneider@arm.com>

Proxy execution forms atomic pairs of tasks: The waiting donor
task (scheduling context) and a proxy (execution context). The
donor task, along with the rest of the blocked chain, follows
the proxy wrt CPU placement.

They can be the same task, in which case push/pull doesn't need any
modification. When they are different, however,
FIFO1 & FIFO42:

	      ,->  RT42
	      |     | blocked-on
	      |     v
blocked_donor |   mutex
	      |     | owner
	      |     v
	      `--  RT1

   RT1
   RT42

  CPU0            CPU1
   ^                ^
   |                |
  overloaded    !overloaded
  rq prio = 42  rq prio = 0

RT1 is eligible to be pushed to CPU1, but should that happen it will
"carry" RT42 along. Clearly here neither RT1 nor RT42 must be seen as
push/pullable.

Unfortunately, only the donor task is usually dequeued from the rq,
and the proxy'ed execution context (rq->curr) remains on the rq.
This can cause RT1 to be selected for migration from logic like the
rt pushable_list.

Thus, adda a dequeue/enqueue cycle on the proxy task before __schedule
returns, which allows the sched class logic to avoid adding the now
current task to the pushable_list.

Furthermore, tasks becoming blocked on a mutex don't need an explicit
dequeue/enqueue cycle to be made (push/pull)able: they have to be running
to block on a mutex, thus they will eventually hit put_prev_task().

Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kernel-team@android.com
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Connor O'Brien <connoro@google.com>
Signed-off-by: John Stultz <jstultz@google.com>
---
v3:
* Tweaked comments & commit message
v5:
* Minor simplifications to utilize the fix earlier
  in the patch series.
* Rework the wording of the commit message to match selected/
  proxy terminology and expand a bit to make it more clear how
  it works.
v6:
* Dropped now-unused proxied value, to be re-added later in the
  series when it is used, as caught by Dietmar
v7:
* Unused function argument fixup
* Commit message nit pointed out by Metin Kaya
* Dropped unproven unlikely() and use sched_proxy_exec()
  in proxy_tag_curr, suggested by Metin Kaya
v8:
* More cleanups and typo fixes suggested by Metin Kaya
v11:
* Cleanup of comimt message suggested by Metin
v12:
* Rework for rq_selected -> rq->donor renaming
---
 kernel/sched/core.c | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b4f7b14f62a24..3596244f613f8 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6722,6 +6722,23 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 }
 #endif /* SCHED_PROXY_EXEC */
 
+static inline void proxy_tag_curr(struct rq *rq, struct task_struct *owner)
+{
+	if (!sched_proxy_exec())
+		return;
+	/*
+	 * pick_next_task() calls set_next_task() on the chosen task
+	 * at some point, which ensures it is not push/pullable.
+	 * However, the chosen/donor task *and* the mutex owner form an
+	 * atomic pair wrt push/pull.
+	 *
+	 * Make sure owner we run is not pushable. Unfortunately we can
+	 * only deal with that by means of a dequeue/enqueue cycle. :-/
+	 */
+	dequeue_task(rq, owner, DEQUEUE_NOCLOCK | DEQUEUE_SAVE);
+	enqueue_task(rq, owner, ENQUEUE_NOCLOCK | ENQUEUE_RESTORE);
+}
+
 /*
  * __schedule() is the main scheduler function.
  *
@@ -6856,6 +6873,10 @@ static void __sched notrace __schedule(int sched_mode)
 		 * changes to task_struct made by pick_next_task().
 		 */
 		RCU_INIT_POINTER(rq->curr, next);
+
+		if (!task_current_donor(rq, next))
+			proxy_tag_curr(rq, next);
+
 		/*
 		 * The membarrier system call requires each architecture
 		 * to have a full memory barrier after updating
@@ -6890,6 +6911,10 @@ static void __sched notrace __schedule(int sched_mode)
 		/* Also unlocks the rq: */
 		rq = context_switch(rq, prev, next, &rf);
 	} else {
+		/* In case next was already curr but just got blocked_donor */
+		if (!task_current_donor(rq, next))
+			proxy_tag_curr(rq, next);
+
 		rq_unpin_lock(rq, &rf);
 		__balance_callbacks(rq);
 		raw_spin_rq_unlock_irq(rq);
-- 
2.49.0.rc0.332.g42c0ae87b1-goog


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC PATCH v15 7/7] sched: Start blocked_on chain processing in find_proxy_task()
  2025-03-12 22:11 [RFC PATCH v15 0/7] Single RunQueue Proxy Execution (v15) John Stultz
                   ` (5 preceding siblings ...)
  2025-03-12 22:11 ` [RFC PATCH v15 6/7] sched: Fix proxy/current (push,pull)ability John Stultz
@ 2025-03-12 22:11 ` John Stultz
  2025-03-17 16:43   ` Peter Zijlstra
                     ` (2 more replies)
  6 siblings, 3 replies; 36+ messages in thread
From: John Stultz @ 2025-03-12 22:11 UTC (permalink / raw)
  To: LKML
  Cc: Peter Zijlstra, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, K Prateek Nayak, Thomas Gleixner,
	Daniel Lezcano, Suleiman Souhlal, kernel-team, Valentin Schneider,
	Connor O'Brien, John Stultz

From: Peter Zijlstra <peterz@infradead.org>

Start to flesh out the real find_proxy_task() implementation,
but avoid the migration cases for now, in those cases just
deactivate the donor task and pick again.

To ensure the donor task or other blocked tasks in the chain
aren't migrated away while we're running the proxy, also tweak
the fair class logic to avoid migrating donor or mutex blocked
tasks.

Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kernel-team@android.com
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Connor O'Brien <connoro@google.com>
[jstultz: This change was split out from the larger proxy patch]
Signed-off-by: John Stultz <jstultz@google.com>
---
v5:
* Split this out from larger proxy patch
v7:
* Minor refactoring of core find_proxy_task() function
* Minor spelling and corrections suggested by Metin Kaya
* Dropped an added BUG_ON that was frequently tripped
v8:
* Fix issue if proxy_deactivate fails, we don't leave task
  BO_BLOCKED
* Switch to WARN_ON from BUG_ON checks
v9:
* Improve comments suggested by Metin
* Minor cleanups
v11:
* Previously we checked next==rq->idle && prev==rq->idle, but I
  think we only really care if next==rq->idle from find_proxy_task,
  as we will still want to resched regardless of what prev was.
v12:
* Commit message rework for selected -> donor rewording
v13:
* Address new delayed dequeue condition (deactivate donor for now)
* Next to donor renaming in find_proxy_task
* Improved comments for find_proxy_task
* Rework for proxy_deactivate cleanup
v14:
* Fix build error from __mutex_owner() with CONFIG_PREEMPT_RT
v15:
* Reworks for moving blocked_on_state to later in the series
---
 kernel/locking/mutex.h |   3 +-
 kernel/sched/core.c    | 165 +++++++++++++++++++++++++++++++++--------
 kernel/sched/fair.c    |  10 ++-
 3 files changed, 145 insertions(+), 33 deletions(-)

diff --git a/kernel/locking/mutex.h b/kernel/locking/mutex.h
index cbff35b9b7ae3..2e8080a9bee37 100644
--- a/kernel/locking/mutex.h
+++ b/kernel/locking/mutex.h
@@ -6,7 +6,7 @@
  *
  *  Copyright (C) 2004, 2005, 2006 Red Hat, Inc., Ingo Molnar <mingo@redhat.com>
  */
-
+#ifndef CONFIG_PREEMPT_RT
 /*
  * This is the control structure for tasks blocked on mutex, which resides
  * on the blocked task's kernel stack:
@@ -70,3 +70,4 @@ extern void debug_mutex_init(struct mutex *lock, const char *name,
 # define debug_mutex_unlock(lock)			do { } while (0)
 # define debug_mutex_init(lock, name, key)		do { } while (0)
 #endif /* !CONFIG_DEBUG_MUTEXES */
+#endif /* CONFIG_PREEMPT_RT */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3596244f613f8..28ac71dfc7e66 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -96,6 +96,7 @@
 #include "../workqueue_internal.h"
 #include "../../io_uring/io-wq.h"
 #include "../smpboot.h"
+#include "../locking/mutex.h"
 
 EXPORT_TRACEPOINT_SYMBOL_GPL(ipi_send_cpu);
 EXPORT_TRACEPOINT_SYMBOL_GPL(ipi_send_cpumask);
@@ -2950,8 +2951,15 @@ static int affine_move_task(struct rq *rq, struct task_struct *p, struct rq_flag
 	struct set_affinity_pending my_pending = { }, *pending = NULL;
 	bool stop_pending, complete = false;
 
-	/* Can the task run on the task's current CPU? If so, we're done */
-	if (cpumask_test_cpu(task_cpu(p), &p->cpus_mask)) {
+	/*
+	 * Can the task run on the task's current CPU? If so, we're done
+	 *
+	 * We are also done if the task is the current donor, boosting a lock-
+	 * holding proxy, (and potentially has been migrated outside its
+	 * current or previous affinity mask)
+	 */
+	if (cpumask_test_cpu(task_cpu(p), &p->cpus_mask) ||
+	    (task_current_donor(rq, p) && !task_current(rq, p))) {
 		struct task_struct *push_task = NULL;
 
 		if ((flags & SCA_MIGRATE_ENABLE) &&
@@ -6668,47 +6676,138 @@ static bool proxy_deactivate(struct rq *rq, struct task_struct *donor)
 }
 
 /*
- * Initial simple sketch that just deactivates the blocked task
- * chosen by pick_next_task() so we can then pick something that
- * isn't blocked.
+ * Find runnable lock owner to proxy for mutex blocked donor
+ *
+ * Follow the blocked-on relation:
+ *   task->blocked_on -> mutex->owner -> task...
+ *
+ * Lock order:
+ *
+ *   p->pi_lock
+ *     rq->lock
+ *       mutex->wait_lock
+ *
+ * Returns the task that is going to be used as execution context (the one
+ * that is actually going to be run on cpu_of(rq)).
  */
 static struct task_struct *
 find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 {
-	struct task_struct *p = donor;
+	struct task_struct *owner = NULL;
+	struct task_struct *ret = NULL;
+	int this_cpu = cpu_of(rq);
+	struct task_struct *p;
 	struct mutex *mutex;
 
-	mutex = p->blocked_on;
-	/* Something changed in the chain, so pick again */
-	if (!mutex)
-		return NULL;
-	/*
-	 * By taking mutex->wait_lock we hold off concurrent mutex_unlock()
-	 * and ensure @owner sticks around.
-	 */
-	raw_spin_lock(&mutex->wait_lock);
-
-	/* Check again that p is blocked with blocked_lock held */
-	if (!task_is_blocked(p) || mutex != __get_task_blocked_on(p)) {
+	/* Follow blocked_on chain. */
+	for (p = donor; task_is_blocked(p); p = owner) {
+		mutex = p->blocked_on;
+		/* Something changed in the chain, so pick again */
+		if (!mutex)
+			return NULL;
 		/*
-		 * Something changed in the blocked_on chain and
-		 * we don't know if only at this level. So, let's
-		 * just bail out completely and let __schedule
-		 * figure things out (pick_again loop).
+		 * By taking mutex->wait_lock we hold off concurrent mutex_unlock()
+		 * and ensure @owner sticks around.
 		 */
-		goto out;
-	}
+		raw_spin_lock(&mutex->wait_lock);
+
+		/* Check again that p is blocked with wait_lock held */
+		if (mutex != __get_task_blocked_on(p)) {
+			/*
+			 * Something changed in the blocked_on chain and
+			 * we don't know if only at this level. So, let's
+			 * just bail out completely and let __schedule
+			 * figure things out (pick_again loop).
+			 */
+			goto out;
+		}
+
+		owner = __mutex_owner(mutex);
+		if (!owner) {
+			__clear_task_blocked_on(p, mutex);
+			ret = p;
+			goto out;
+		}
+
+		if (task_cpu(owner) != this_cpu) {
+			/* XXX Don't handle migrations yet */
+			if (!proxy_deactivate(rq, donor))
+				goto deactivate_failed;
+			goto out;
+		}
+
+		if (task_on_rq_migrating(owner)) {
+			/*
+			 * One of the chain of mutex owners is currently migrating to this
+			 * CPU, but has not yet been enqueued because we are holding the
+			 * rq lock. As a simple solution, just schedule rq->idle to give
+			 * the migration a chance to complete. Much like the migrate_task
+			 * case we should end up back in find_proxy_task(), this time
+			 * hopefully with all relevant tasks already enqueued.
+			 */
+			raw_spin_unlock(&mutex->wait_lock);
+			return proxy_resched_idle(rq);
+		}
+
+		if (!owner->on_rq) {
+			/* XXX Don't handle blocked owners yet */
+			if (!proxy_deactivate(rq, donor))
+				goto deactivate_failed;
+			goto out;
+		}
+
+		if (owner->se.sched_delayed) {
+			/* XXX Don't handle delayed dequeue yet */
+			if (!proxy_deactivate(rq, donor))
+				goto deactivate_failed;
+			goto out;
+		}
+
+		if (owner == p) {
+			/*
+			 * It's possible we interleave with mutex_unlock like:
+			 *
+			 *				lock(&rq->lock);
+			 *				  find_proxy_task()
+			 * mutex_unlock()
+			 *   lock(&wait_lock);
+			 *   donor(owner) = current->blocked_donor;
+			 *   unlock(&wait_lock);
+			 *
+			 *   wake_up_q();
+			 *     ...
+			 *       ttwu_runnable()
+			 *         __task_rq_lock()
+			 *				  lock(&wait_lock);
+			 *				  owner == p
+			 *
+			 * Which leaves us to finish the ttwu_runnable() and make it go.
+			 *
+			 * So schedule rq->idle so that ttwu_runnable can get the rq lock
+			 * and mark owner as running.
+			 */
+			raw_spin_unlock(&mutex->wait_lock);
+			return proxy_resched_idle(rq);
+		}
 
-	if (!proxy_deactivate(rq, donor)) {
 		/*
-		 * XXX: For now, if deactivation failed, set donor
-		 * as not blocked, as we aren't doing proxy-migrations
-		 * yet (more logic will be needed then).
+		 * OK, now we're absolutely sure @owner is on this
+		 * rq, therefore holding @rq->lock is sufficient to
+		 * guarantee its existence, as per ttwu_remote().
 		 */
-		__clear_task_blocked_on(donor, mutex);
 		raw_spin_unlock(&mutex->wait_lock);
-		return NULL;
 	}
+
+	WARN_ON_ONCE(owner && !owner->on_rq);
+	return owner;
+
+deactivate_failed:
+	/*
+	 * XXX: For now, if deactivation failed, set donor
+	 * as unblocked, as we aren't doing proxy-migrations
+	 * yet (more logic will be needed then).
+	 */
+	donor->blocked_on = NULL; /* XXX not following locking rules :( */
 out:
 	raw_spin_unlock(&mutex->wait_lock);
 	return NULL; /* do pick_next_task again */
@@ -6791,6 +6890,7 @@ static void __sched notrace __schedule(int sched_mode)
 	struct rq_flags rf;
 	struct rq *rq;
 	int cpu;
+	bool preserve_need_resched = false;
 
 	cpu = smp_processor_id();
 	rq = cpu_rq(cpu);
@@ -6858,9 +6958,12 @@ static void __sched notrace __schedule(int sched_mode)
 		next = find_proxy_task(rq, next, &rf);
 		if (!next)
 			goto pick_again;
+		if (next == rq->idle)
+			preserve_need_resched = true;
 	}
 picked:
-	clear_tsk_need_resched(prev);
+	if (!preserve_need_resched)
+		clear_tsk_need_resched(prev);
 	clear_preempt_need_resched();
 #ifdef CONFIG_SCHED_DEBUG
 	rq->last_seen_need_resched_ns = 0;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f8ad3a44b3771..091f1a01b3327 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9385,6 +9385,7 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 	 * 3) cannot be migrated to this CPU due to cpus_ptr, or
 	 * 4) running (obviously), or
 	 * 5) are cache-hot on their current CPU.
+	 * 6) are blocked on mutexes (if SCHED_PROXY_EXEC is enabled)
 	 */
 	if ((p->se.sched_delayed) && (env->migration_type != migrate_load))
 		return 0;
@@ -9406,6 +9407,9 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 	if (kthread_is_per_cpu(p))
 		return 0;
 
+	if (task_is_blocked(p))
+		return 0;
+
 	if (!cpumask_test_cpu(env->dst_cpu, p->cpus_ptr)) {
 		int cpu;
 
@@ -9442,7 +9446,8 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 	/* Record that we found at least one task that could run on dst_cpu */
 	env->flags &= ~LBF_ALL_PINNED;
 
-	if (task_on_cpu(env->src_rq, p)) {
+	if (task_on_cpu(env->src_rq, p) ||
+	    task_current_donor(env->src_rq, p)) {
 		schedstat_inc(p->stats.nr_failed_migrations_running);
 		return 0;
 	}
@@ -9486,6 +9491,9 @@ static void detach_task(struct task_struct *p, struct lb_env *env)
 		schedstat_inc(p->stats.nr_forced_migrations);
 	}
 
+	WARN_ON(task_current(env->src_rq, p));
+	WARN_ON(task_current_donor(env->src_rq, p));
+
 	deactivate_task(env->src_rq, p, DEQUEUE_NOCLOCK);
 	set_task_cpu(p, env->dst_cpu);
 }
-- 
2.49.0.rc0.332.g42c0ae87b1-goog


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v15 1/7] sched: Add CONFIG_SCHED_PROXY_EXEC & boot argument to enable/disable
  2025-03-12 22:11 ` [RFC PATCH v15 1/7] sched: Add CONFIG_SCHED_PROXY_EXEC & boot argument to enable/disable John Stultz
@ 2025-03-13 10:09   ` Steven Rostedt
  2025-03-14  0:48     ` John Stultz
  2025-03-17 14:33   ` Peter Zijlstra
  1 sibling, 1 reply; 36+ messages in thread
From: Steven Rostedt @ 2025-03-13 10:09 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Ben Segall, Zimuzo Ezeozue, Mel Gorman, Will Deacon, Waiman Long,
	Boqun Feng, Paul E. McKenney, Metin Kaya, Xuewen Yan,
	K Prateek Nayak, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kernel-team

On Wed, 12 Mar 2025 15:11:31 -0700
John Stultz <jstultz@google.com> wrote:
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index fb8752b42ec85..dcc2443078d00 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -6262,6 +6262,11 @@
>  	sa1100ir	[NET]
>  			See drivers/net/irda/sa1100_ir.c.
>  
> +	sched_proxy_exec= [KNL]
> +			Enables or disables "proxy execution" style
> +			solution to mutex-based priority inversion.
> +			Format: <bool>

To enable, does this require: sched_proxy_exec=true

Could we just allow it to be:

		sched_proxy_exec

Also mean true? That is, both of the above would be true, but to
disable it, you would need: sched_proxy_exec=false.

> +
>  	sched_verbose	[KNL,EARLY] Enables verbose scheduler debug messages.
>  
>  	schedstats=	[KNL,X86] Enable or disable scheduled statistics.
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 9c15365a30c08..1462f2c70aefc 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1636,6 +1636,19 @@ struct task_struct {
>  	 */
>  };
>  
> +#ifdef CONFIG_SCHED_PROXY_EXEC
> +DECLARE_STATIC_KEY_TRUE(__sched_proxy_exec);
> +static inline bool sched_proxy_exec(void)
> +{
> +	return static_branch_likely(&__sched_proxy_exec);
> +}
> +#else
> +static inline bool sched_proxy_exec(void)
> +{
> +	return false;
> +}
> +#endif
> +
>  #define TASK_REPORT_IDLE	(TASK_REPORT + 1)
>  #define TASK_REPORT_MAX		(TASK_REPORT_IDLE << 1)
>  
> diff --git a/init/Kconfig b/init/Kconfig
> index d0d021b3fa3b3..b989ddc27444e 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -875,6 +875,16 @@ config UCLAMP_BUCKETS_COUNT
>  
>  	  If in doubt, use the default value.
>  
> +config SCHED_PROXY_EXEC
> +	bool "Proxy Execution"
> +	default n

Nit, you don't need "default n" because "default n" is the default ;-)

> +	# Avoid some build failures w/ PREEMPT_RT until it can be fixed
> +	depends on !PREEMPT_RT
> +	depends on EXPERT
> +	help
> +	  This option enables proxy execution, a mechanism for mutex-owning
> +	  tasks to inherit the scheduling context of higher priority waiters.
> +
>  endmenu
>  
>  #
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 67189907214d3..3968c3967ec38 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -119,6 +119,35 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(sched_compute_energy_tp);
>  
>  DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
>  
> +#ifdef CONFIG_SCHED_PROXY_EXEC
> +DEFINE_STATIC_KEY_TRUE(__sched_proxy_exec);
> +static int __init setup_proxy_exec(char *str)
> +{
> +	bool proxy_enable;
> +
> +	if (kstrtobool(str, &proxy_enable)) {

To make it work without adding =true, the above could be:

	bool proxy_enable = true;

	if (*str && kstrtobool(str, &proxy_enable)) {

> +		pr_warn("Unable to parse sched_proxy_exec=\n");
> +		return 0;
> +	}
> +
> +	if (proxy_enable) {
> +		pr_info("sched_proxy_exec enabled via boot arg\n");
> +		static_branch_enable(&__sched_proxy_exec);
> +	} else {
> +		pr_info("sched_proxy_exec disabled via boot arg\n");
> +		static_branch_disable(&__sched_proxy_exec);
> +	}
> +	return 1;
> +}

-- Steve

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v15 2/7] locking/mutex: Rework task_struct::blocked_on
  2025-03-12 22:11 ` [RFC PATCH v15 2/7] locking/mutex: Rework task_struct::blocked_on John Stultz
@ 2025-03-13 10:13   ` Steven Rostedt
  2025-03-14  6:12     ` John Stultz
  2025-03-17 11:44     ` Peter Zijlstra
  0 siblings, 2 replies; 36+ messages in thread
From: Steven Rostedt @ 2025-03-13 10:13 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Peter Zijlstra, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Ben Segall, Zimuzo Ezeozue, Mel Gorman, Will Deacon, Waiman Long,
	Boqun Feng, Paul E. McKenney, Metin Kaya, Xuewen Yan,
	K Prateek Nayak, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kernel-team, Connor O'Brien,
	Masami Hiramatsu


FYI, this is useful for Masami's "hung task" work that will show what
tasks a hung task is blocked on in a crash report.

  https://lore.kernel.org/all/174046694331.2194069.15472952050240807469.stgit@mhiramat.tok.corp.google.com/

-- Steve


On Wed, 12 Mar 2025 15:11:32 -0700
John Stultz <jstultz@google.com> wrote:

> From: Peter Zijlstra <peterz@infradead.org>
> 
> Track the blocked-on relation for mutexes, to allow following this
> relation at schedule time.
> 
>    task
>      | blocked-on
>      v
>    mutex
>      | owner
>      v
>    task
> 
> This all will be used for tracking blocked-task/mutex chains
> with the prox-execution patch in a similar fashion to how
> priority inheritance is done with rt_mutexes.
> 
> For serialization, blocked-on is only set by the task itself
> (current). And both when setting or clearing (potentially by
> others), is done while holding the mutex::wait_lock.
> 
> Cc: Joel Fernandes <joelagnelf@nvidia.com>
> Cc: Qais Yousef <qyousef@layalina.io>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Juri Lelli <juri.lelli@redhat.com>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Valentin Schneider <vschneid@redhat.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Ben Segall <bsegall@google.com>
> Cc: Zimuzo Ezeozue <zezeozue@google.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Will Deacon <will@kernel.org>
> Cc: Waiman Long <longman@redhat.com>
> Cc: Boqun Feng <boqun.feng@gmail.com>
> Cc: "Paul E. McKenney" <paulmck@kernel.org>
> Cc: Metin Kaya <Metin.Kaya@arm.com>
> Cc: Xuewen Yan <xuewen.yan94@gmail.com>
> Cc: K Prateek Nayak <kprateek.nayak@amd.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
> Cc: Suleiman Souhlal <suleiman@google.com>
> Cc: kernel-team@android.com
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> [minor changes while rebasing]
> Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Signed-off-by: Connor O'Brien <connoro@google.com>
> [jstultz: Fix blocked_on tracking in __mutex_lock_common in error paths]
> Signed-off-by: John Stultz <jstultz@google.com>
> ---
> v2:
> * Fixed blocked_on tracking in error paths that was causing crashes
> v4:
> * Ensure we clear blocked_on when waking ww_mutexes to die or wound.
>   This is critical so we don't get circular blocked_on relationships
>   that can't be resolved.
> v5:
> * Fix potential bug where the skip_wait path might clear blocked_on
>   when that path never set it
> * Slight tweaks to where we set blocked_on to make it consistent,
>   along with extra WARN_ON correctness checking
> * Minor comment changes
> v7:
> * Minor commit message change suggested by Metin Kaya
> * Fix WARN_ON conditionals in unlock path (as blocked_on might already
>   be cleared), found while looking at issue Metin Kaya raised.
> * Minor tweaks to be consistent in what we do under the
>   blocked_on lock, also tweaked variable name to avoid confusion
>   with label, and comment typos, as suggested by Metin Kaya
> * Minor tweak for CONFIG_SCHED_PROXY_EXEC name change
> * Moved unused block of code to later in the series, as suggested
>   by Metin Kaya
> * Switch to a tri-state to be able to distinguish from waking and
>   runnable so we can later safely do return migration from ttwu
> * Folded together with related blocked_on changes
> v8:
> * Fix issue leaving task BO_BLOCKED when calling into optimistic
>   spinning path.
> * Include helper to better handle BO_BLOCKED->BO_WAKING transitions
> v9:
> * Typo fixup pointed out by Metin
> * Cleanup BO_WAKING->BO_RUNNABLE transitions for the !proxy case
> * Many cleanups and simplifications suggested by Metin
> v11:
> * Whitespace fixup pointed out by Metin
> v13:
> * Refactor set_blocked_on helpers clean things up a bit
> v14:
> * Small build fixup with PREEMPT_RT
> v15:
> * Improve consistency of names for functions that assume blocked_lock
>   is held, as suggested by Peter
> * Use guard instead of separate spinlock/unlock calls, also suggested
>   by Peter
> * Drop blocked_on_state tri-state for now, as its not needed until
>   later in the series, when we get to proxy-migration and return-
>   migration.
> ---
>  include/linux/sched.h        |  5 +----
>  kernel/fork.c                |  3 +--
>  kernel/locking/mutex-debug.c |  9 +++++----
>  kernel/locking/mutex.c       | 19 +++++++++++++++++++
>  kernel/locking/ww_mutex.h    | 18 ++++++++++++++++--
>  5 files changed, 42 insertions(+), 12 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 1462f2c70aefc..03775c44b7073 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1212,10 +1212,7 @@ struct task_struct {
>  	struct rt_mutex_waiter		*pi_blocked_on;
>  #endif
>  
> -#ifdef CONFIG_DEBUG_MUTEXES
> -	/* Mutex deadlock detection: */
> -	struct mutex_waiter		*blocked_on;
> -#endif
> +	struct mutex			*blocked_on;	/* lock we're blocked on */
>  
>  #ifdef CONFIG_DEBUG_ATOMIC_SLEEP
>  	int				non_block_count;
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 735405a9c5f32..38f055082d07a 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -2357,9 +2357,8 @@ __latent_entropy struct task_struct *copy_process(
>  	lockdep_init_task(p);
>  #endif
>  
> -#ifdef CONFIG_DEBUG_MUTEXES
>  	p->blocked_on = NULL; /* not blocked yet */
> -#endif
> +
>  #ifdef CONFIG_BCACHE
>  	p->sequential_io	= 0;
>  	p->sequential_io_avg	= 0;
> diff --git a/kernel/locking/mutex-debug.c b/kernel/locking/mutex-debug.c
> index 6e6f6071cfa27..758b7a6792b0c 100644
> --- a/kernel/locking/mutex-debug.c
> +++ b/kernel/locking/mutex-debug.c
> @@ -53,17 +53,18 @@ void debug_mutex_add_waiter(struct mutex *lock, struct mutex_waiter *waiter,
>  {
>  	lockdep_assert_held(&lock->wait_lock);
>  
> -	/* Mark the current thread as blocked on the lock: */
> -	task->blocked_on = waiter;
> +	/* Current thread can't be already blocked (since it's executing!) */
> +	DEBUG_LOCKS_WARN_ON(task->blocked_on);
>  }
>  
>  void debug_mutex_remove_waiter(struct mutex *lock, struct mutex_waiter *waiter,
>  			 struct task_struct *task)
>  {
> +	struct mutex *blocked_on = READ_ONCE(task->blocked_on);
> +
>  	DEBUG_LOCKS_WARN_ON(list_empty(&waiter->list));
>  	DEBUG_LOCKS_WARN_ON(waiter->task != task);
> -	DEBUG_LOCKS_WARN_ON(task->blocked_on != waiter);
> -	task->blocked_on = NULL;
> +	DEBUG_LOCKS_WARN_ON(blocked_on && blocked_on != lock);
>  
>  	INIT_LIST_HEAD(&waiter->list);
>  	waiter->task = NULL;
> diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
> index b36f23de48f1b..37d1966970617 100644
> --- a/kernel/locking/mutex.c
> +++ b/kernel/locking/mutex.c
> @@ -627,6 +627,8 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
>  			goto err_early_kill;
>  	}
>  
> +	WARN_ON(current->blocked_on);
> +	current->blocked_on = lock;
>  	set_current_state(state);
>  	trace_contention_begin(lock, LCB_F_MUTEX);
>  	for (;;) {
> @@ -663,6 +665,12 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
>  
>  		first = __mutex_waiter_is_first(lock, &waiter);
>  
> +		/*
> +		 * As we likely have been woken up by task
> +		 * that has cleared our blocked_on state, re-set
> +		 * it to the lock we are trying to aquire.
> +		 */
> +		current->blocked_on = lock;
>  		set_current_state(state);
>  		/*
>  		 * Here we order against unlock; we must either see it change
> @@ -683,6 +691,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
>  	}
>  	raw_spin_lock_irqsave(&lock->wait_lock, flags);
>  acquired:
> +	current->blocked_on = NULL;
>  	__set_current_state(TASK_RUNNING);
>  
>  	if (ww_ctx) {
> @@ -712,9 +721,11 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
>  	return 0;
>  
>  err:
> +	current->blocked_on = NULL;
>  	__set_current_state(TASK_RUNNING);
>  	__mutex_remove_waiter(lock, &waiter);
>  err_early_kill:
> +	WARN_ON(current->blocked_on);
>  	trace_contention_end(lock, ret);
>  	raw_spin_unlock_irqrestore_wake(&lock->wait_lock, flags, &wake_q);
>  	debug_mutex_free_waiter(&waiter);
> @@ -924,6 +935,14 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
>  		next = waiter->task;
>  
>  		debug_mutex_wake_waiter(lock, waiter);
> +		/*
> +		 * Unlock wakeups can be happening in parallel
> +		 * (when optimistic spinners steal and release
> +		 * the lock), so blocked_on may already be
> +		 * cleared here.
> +		 */
> +		WARN_ON(next->blocked_on && next->blocked_on != lock);
> +		next->blocked_on = NULL;
>  		wake_q_add(&wake_q, next);
>  	}
>  
> diff --git a/kernel/locking/ww_mutex.h b/kernel/locking/ww_mutex.h
> index 37f025a096c9d..00db40946328e 100644
> --- a/kernel/locking/ww_mutex.h
> +++ b/kernel/locking/ww_mutex.h
> @@ -284,6 +284,14 @@ __ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITER *waiter,
>  #ifndef WW_RT
>  		debug_mutex_wake_waiter(lock, waiter);
>  #endif
> +		/*
> +		 * When waking up the task to die, be sure to clear the
> +		 * blocked_on pointer. Otherwise we can see circular
> +		 * blocked_on relationships that can't resolve.
> +		 */
> +		WARN_ON(waiter->task->blocked_on &&
> +			waiter->task->blocked_on != lock);
> +		waiter->task->blocked_on = NULL;
>  		wake_q_add(wake_q, waiter->task);
>  	}
>  
> @@ -331,9 +339,15 @@ static bool __ww_mutex_wound(struct MUTEX *lock,
>  		 * it's wounded in __ww_mutex_check_kill() or has a
>  		 * wakeup pending to re-read the wounded state.
>  		 */
> -		if (owner != current)
> +		if (owner != current) {
> +			/*
> +			 * When waking up the task to wound, be sure to clear the
> +			 * blocked_on pointer. Otherwise we can see circular
> +			 * blocked_on relationships that can't resolve.
> +			 */
> +			owner->blocked_on = NULL;
>  			wake_q_add(wake_q, owner);
> -
> +		}
>  		return true;
>  	}
>  


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v15 4/7] sched: Fix runtime accounting w/ split exec & sched contexts
  2025-03-12 22:11 ` [RFC PATCH v15 4/7] sched: Fix runtime accounting w/ split exec & sched contexts John Stultz
@ 2025-03-13 10:26   ` Steven Rostedt
  2025-03-15  6:05     ` John Stultz
  2025-03-13 17:24   ` K Prateek Nayak
  1 sibling, 1 reply; 36+ messages in thread
From: Steven Rostedt @ 2025-03-13 10:26 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Ben Segall, Zimuzo Ezeozue, Mel Gorman, Will Deacon, Waiman Long,
	Boqun Feng, Paul E. McKenney, Metin Kaya, Xuewen Yan,
	K Prateek Nayak, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kernel-team

On Wed, 12 Mar 2025 15:11:34 -0700
John Stultz <jstultz@google.com> wrote:

> The idea here is we want to charge the scheduler-context task's
> vruntime but charge the execution-context task's sum_exec_runtime.

The "but" is confusing me. Do you mean, "and also"? The sentence
doesn't make sense with "but" unless it was:

  "The idea here is we DON'T want to charge the scheduler-context
  task's vruntime but charge the execution-context task's
  sum_exec_runtime INSTEAD."

> 
> This way cputime accounting goes against the task actually running
> but vruntime accounting goes against the rq->donor task so we get
> proper fairness.

But this shows that you want to do both, although, I would remove the
"but" here too. Replace it with "while".

Or maybe I'm just confused.


> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c798d27952431..f8ad3a44b3771 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1129,22 +1129,33 @@ static void update_tg_load_avg(struct cfs_rq *cfs_rq)
>  }
>  #endif /* CONFIG_SMP */
>  
> -static s64 update_curr_se(struct rq *rq, struct sched_entity *curr)
> +static s64 update_curr_se(struct rq *rq, struct sched_entity *se)

Should this be renamed to "update_se()" as it no longer appears to be
updating "curr_se".

>  {
>  	u64 now = rq_clock_task(rq);
>  	s64 delta_exec;
>  
> -	delta_exec = now - curr->exec_start;
> +	delta_exec = now - se->exec_start;
>  	if (unlikely(delta_exec <= 0))
>  		return delta_exec;
>  
> -	curr->exec_start = now;
> -	curr->sum_exec_runtime += delta_exec;
> +	se->exec_start = now;
> +	if (entity_is_task(se)) {
> +		struct task_struct *running = rq->curr;
> +		/*
> +		 * If se is a task, we account the time against the running
> +		 * task, as w/ proxy-exec they may not be the same.
> +		 */
> +		running->se.exec_start = now;
> +		running->se.sum_exec_runtime += delta_exec;
> +	} else {
> +		/* If not task, account the time against se */
> +		se->sum_exec_runtime += delta_exec;
> +	}

Or maybe: update_proxy_se() ?

-- Steve

>  
>  	if (schedstat_enabled()) {
>  		struct sched_statistics *stats;
>  
> -		stats = __schedstats_from_se(curr);
> +		stats = __schedstats_from_se(se);
>  		__schedstat_set(stats->exec_max,
>  				max(delta_exec, stats->exec_max));
>  	}


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v15 4/7] sched: Fix runtime accounting w/ split exec & sched contexts
  2025-03-12 22:11 ` [RFC PATCH v15 4/7] sched: Fix runtime accounting w/ split exec & sched contexts John Stultz
  2025-03-13 10:26   ` Steven Rostedt
@ 2025-03-13 17:24   ` K Prateek Nayak
  1 sibling, 0 replies; 36+ messages in thread
From: K Prateek Nayak @ 2025-03-13 17:24 UTC (permalink / raw)
  To: John Stultz, LKML
  Cc: Joel Fernandes, Qais Yousef, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kernel-team

Hello John,

On 3/13/2025 3:41 AM, John Stultz wrote:
> The idea here is we want to charge the scheduler-context task's
> vruntime but charge the execution-context task's sum_exec_runtime.
> 
> This way cputime accounting goes against the task actually running
> but vruntime accounting goes against the rq->donor task so we get
> proper fairness.
> 
> Cc: Joel Fernandes <joelagnelf@nvidia.com>
> Cc: Qais Yousef <qyousef@layalina.io>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Juri Lelli <juri.lelli@redhat.com>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Valentin Schneider <vschneid@redhat.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Ben Segall <bsegall@google.com>
> Cc: Zimuzo Ezeozue <zezeozue@google.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Will Deacon <will@kernel.org>
> Cc: Waiman Long <longman@redhat.com>
> Cc: Boqun Feng <boqun.feng@gmail.com>
> Cc: "Paul E. McKenney" <paulmck@kernel.org>
> Cc: Metin Kaya <Metin.Kaya@arm.com>
> Cc: Xuewen Yan <xuewen.yan94@gmail.com>
> Cc: K Prateek Nayak <kprateek.nayak@amd.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
> Cc: Suleiman Souhlal <suleiman@google.com>
> Cc: kernel-team@android.com
> Signed-off-by: John Stultz <jstultz@google.com>
> ---
>   kernel/sched/fair.c | 21 ++++++++++++++++-----
>   1 file changed, 16 insertions(+), 5 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c798d27952431..f8ad3a44b3771 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1129,22 +1129,33 @@ static void update_tg_load_avg(struct cfs_rq *cfs_rq)
>   }
>   #endif /* CONFIG_SMP */
>   
> -static s64 update_curr_se(struct rq *rq, struct sched_entity *curr)
> +static s64 update_curr_se(struct rq *rq, struct sched_entity *se)
>   {
>   	u64 now = rq_clock_task(rq);
>   	s64 delta_exec;
>   
> -	delta_exec = now - curr->exec_start;
> +	delta_exec = now - se->exec_start;
>   	if (unlikely(delta_exec <= 0))
>   		return delta_exec;
>   
> -	curr->exec_start = now;
> -	curr->sum_exec_runtime += delta_exec;
> +	se->exec_start = now;
> +	if (entity_is_task(se)) {
> +		struct task_struct *running = rq->curr;
> +		/*
> +		 * If se is a task, we account the time against the running
> +		 * task, as w/ proxy-exec they may not be the same.
> +		 */
> +		running->se.exec_start = now;
> +		running->se.sum_exec_runtime += delta_exec;
> +	} else {
> +		/* If not task, account the time against se */
> +		se->sum_exec_runtime += delta_exec;
> +	}
>   
>   	if (schedstat_enabled()) {
>   		struct sched_statistics *stats;
>   
> -		stats = __schedstats_from_se(curr);
> +		stats = __schedstats_from_se(se);
>   		__schedstat_set(stats->exec_max,
>   				max(delta_exec, stats->exec_max));

So I'm slightly confused here - For the case of proxy where
entity_is_task(), we charge the delta_exec to the running task's se but
then we go ahead and update the exec_max against the stats of the
donor's se? That seems odd.

Could we just replace the se with &rq->curr->se if entity_is_task()
returns true and keep the rest as is? The calculations will be same as
what it is above (except for the stats bit) and you'll not require
updating "exec_start" for both current task and proxy's se. Thoughts?


>   	}

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v15 1/7] sched: Add CONFIG_SCHED_PROXY_EXEC & boot argument to enable/disable
  2025-03-13 10:09   ` Steven Rostedt
@ 2025-03-14  0:48     ` John Stultz
  0 siblings, 0 replies; 36+ messages in thread
From: John Stultz @ 2025-03-14  0:48 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Ben Segall, Zimuzo Ezeozue, Mel Gorman, Will Deacon, Waiman Long,
	Boqun Feng, Paul E. McKenney, Metin Kaya, Xuewen Yan,
	K Prateek Nayak, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kernel-team

On Thu, Mar 13, 2025 at 3:09 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Wed, 12 Mar 2025 15:11:31 -0700
> John Stultz <jstultz@google.com> wrote:
> > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> > index fb8752b42ec85..dcc2443078d00 100644
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -6262,6 +6262,11 @@
> >       sa1100ir        [NET]
> >                       See drivers/net/irda/sa1100_ir.c.
> >
> > +     sched_proxy_exec= [KNL]
> > +                     Enables or disables "proxy execution" style
> > +                     solution to mutex-based priority inversion.
> > +                     Format: <bool>
>
> To enable, does this require: sched_proxy_exec=true
>
> Could we just allow it to be:
>
>                 sched_proxy_exec
>
> Also mean true? That is, both of the above would be true, but to
> disable it, you would need: sched_proxy_exec=false.

Currently the flag defaults to true, so I'm not sure if
"sched_proxy_exec" on its own makes as much sense to me.

Though, in the android16-6.12 kernel, I have an additional change that
sets it default to false, which allows "sched_proxy_exec=true" to be
useful.  So I'm open to having the argument alone as an enablement
flag (in addition to the explicit setting), but I've personally always
found the mixed conventions there confusing, preferring the explicit
"=true" or "=1" on boot arguments.


> > diff --git a/init/Kconfig b/init/Kconfig
> > index d0d021b3fa3b3..b989ddc27444e 100644
> > --- a/init/Kconfig
> > +++ b/init/Kconfig
> > @@ -875,6 +875,16 @@ config UCLAMP_BUCKETS_COUNT
> >
> >         If in doubt, use the default value.
> >
> > +config SCHED_PROXY_EXEC
> > +     bool "Proxy Execution"
> > +     default n
>
> Nit, you don't need "default n" because "default n" is the default ;-)

Ah, thanks I'll drop that!

> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 67189907214d3..3968c3967ec38 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -119,6 +119,35 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(sched_compute_energy_tp);
> >
> >  DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
> >
> > +#ifdef CONFIG_SCHED_PROXY_EXEC
> > +DEFINE_STATIC_KEY_TRUE(__sched_proxy_exec);
> > +static int __init setup_proxy_exec(char *str)
> > +{
> > +     bool proxy_enable;
> > +
> > +     if (kstrtobool(str, &proxy_enable)) {
>
> To make it work without adding =true, the above could be:
>
>         bool proxy_enable = true;
>
>         if (*str && kstrtobool(str, &proxy_enable)) {
>

Ok, I'll give this a shot.

Thanks so much for the review and feedback! Really appreciate it!
-john

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v15 2/7] locking/mutex: Rework task_struct::blocked_on
  2025-03-13 10:13   ` Steven Rostedt
@ 2025-03-14  6:12     ` John Stultz
  2025-03-16 16:33       ` Steven Rostedt
  2025-03-18 14:11       ` Masami Hiramatsu
  2025-03-17 11:44     ` Peter Zijlstra
  1 sibling, 2 replies; 36+ messages in thread
From: John Stultz @ 2025-03-14  6:12 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Peter Zijlstra, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Ben Segall, Zimuzo Ezeozue, Mel Gorman, Will Deacon, Waiman Long,
	Boqun Feng, Paul E. McKenney, Metin Kaya, Xuewen Yan,
	K Prateek Nayak, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kernel-team, Masami Hiramatsu

On Thu, Mar 13, 2025 at 3:14 AM Steven Rostedt <rostedt@goodmis.org> wrote:
> FYI, this is useful for Masami's "hung task" work that will show what
> tasks a hung task is blocked on in a crash report.
>
>   https://lore.kernel.org/all/174046694331.2194069.15472952050240807469.stgit@mhiramat.tok.corp.google.com/
>

Ah. Indeed, we have similar use cases. There's some slight difference
in when we consider the task blocked, especially in this early patch
(as waking tasks mark us as unblocked so we can be selected to run).
But later on in the series (in the portions I've not yet submitted
here) when the blocked_on_state has been introduced, the blocked_on
value approximates to about the same spot as used here.

So I should be able to unify these. It looks like Masami's patch is
close to being queued, so maybe I'll incorporate it into my series and
rework my set ontop. Any objections to this?

thanks
-john

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v15 6/7] sched: Fix proxy/current (push,pull)ability
  2025-03-12 22:11 ` [RFC PATCH v15 6/7] sched: Fix proxy/current (push,pull)ability John Stultz
@ 2025-03-14  8:40   ` K Prateek Nayak
  2025-03-15  5:10     ` John Stultz
  2025-03-17 14:07   ` Peter Zijlstra
  1 sibling, 1 reply; 36+ messages in thread
From: K Prateek Nayak @ 2025-03-14  8:40 UTC (permalink / raw)
  To: John Stultz, LKML
  Cc: Valentin Schneider, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
	Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, Thomas Gleixner,
	Daniel Lezcano, Suleiman Souhlal, kernel-team, Connor O'Brien

Hello John,

On 3/13/2025 3:41 AM, John Stultz wrote:

[..snip..]

> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index b4f7b14f62a24..3596244f613f8 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6722,6 +6722,23 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
>   }
>   #endif /* SCHED_PROXY_EXEC */
>   
> +static inline void proxy_tag_curr(struct rq *rq, struct task_struct *owner)
> +{
> +	if (!sched_proxy_exec())
> +		return;
> +	/*
> +	 * pick_next_task() calls set_next_task() on the chosen task
> +	 * at some point, which ensures it is not push/pullable.
> +	 * However, the chosen/donor task *and* the mutex owner form an
> +	 * atomic pair wrt push/pull.
> +	 *
> +	 * Make sure owner we run is not pushable. Unfortunately we can
> +	 * only deal with that by means of a dequeue/enqueue cycle. :-/
> +	 */
> +	dequeue_task(rq, owner, DEQUEUE_NOCLOCK | DEQUEUE_SAVE);
> +	enqueue_task(rq, owner, ENQUEUE_NOCLOCK | ENQUEUE_RESTORE);
> +}
> +
>   /*
>    * __schedule() is the main scheduler function.
>    *
> @@ -6856,6 +6873,10 @@ static void __sched notrace __schedule(int sched_mode)
>   		 * changes to task_struct made by pick_next_task().
>   		 */
>   		RCU_INIT_POINTER(rq->curr, next);
> +
> +		if (!task_current_donor(rq, next))
> +			proxy_tag_curr(rq, next);

I don't see any dependency on rq->curr for task_current_donor() check.
Could this check be moved outside of the if-else block to avoid
duplicating in both places since rq_set_donor() was called just after
pick_next_task() or am I missing something?

> +
>   		/*
>   		 * The membarrier system call requires each architecture
>   		 * to have a full memory barrier after updating
> @@ -6890,6 +6911,10 @@ static void __sched notrace __schedule(int sched_mode)
>   		/* Also unlocks the rq: */
>   		rq = context_switch(rq, prev, next, &rf);
>   	} else {
> +		/* In case next was already curr but just got blocked_donor */
> +		if (!task_current_donor(rq, next))
> +			proxy_tag_curr(rq, next);
> +
>   		rq_unpin_lock(rq, &rf);
>   		__balance_callbacks(rq);
>   		raw_spin_rq_unlock_irq(rq);

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v15 6/7] sched: Fix proxy/current (push,pull)ability
  2025-03-14  8:40   ` K Prateek Nayak
@ 2025-03-15  5:10     ` John Stultz
  2025-03-15 16:06       ` K Prateek Nayak
  0 siblings, 1 reply; 36+ messages in thread
From: John Stultz @ 2025-03-15  5:10 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: LKML, Valentin Schneider, Joel Fernandes, Qais Yousef,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Valentin Schneider, Steven Rostedt, Ben Segall,
	Zimuzo Ezeozue, Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, Thomas Gleixner,
	Daniel Lezcano, Suleiman Souhlal, kernel-team, Connor O'Brien

On Fri, Mar 14, 2025 at 1:40 AM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
> On 3/13/2025 3:41 AM, John Stultz wrote:
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index b4f7b14f62a24..3596244f613f8 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -6722,6 +6722,23 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
> >   }
> >   #endif /* SCHED_PROXY_EXEC */
> >
> > +static inline void proxy_tag_curr(struct rq *rq, struct task_struct *owner)
> > +{
> > +     if (!sched_proxy_exec())
> > +             return;
> > +     /*
> > +      * pick_next_task() calls set_next_task() on the chosen task
> > +      * at some point, which ensures it is not push/pullable.
> > +      * However, the chosen/donor task *and* the mutex owner form an
> > +      * atomic pair wrt push/pull.
> > +      *
> > +      * Make sure owner we run is not pushable. Unfortunately we can
> > +      * only deal with that by means of a dequeue/enqueue cycle. :-/
> > +      */
> > +     dequeue_task(rq, owner, DEQUEUE_NOCLOCK | DEQUEUE_SAVE);
> > +     enqueue_task(rq, owner, ENQUEUE_NOCLOCK | ENQUEUE_RESTORE);
> > +}
> > +
> >   /*
> >    * __schedule() is the main scheduler function.
> >    *
> > @@ -6856,6 +6873,10 @@ static void __sched notrace __schedule(int sched_mode)
> >                * changes to task_struct made by pick_next_task().
> >                */
> >               RCU_INIT_POINTER(rq->curr, next);
> > +
> > +             if (!task_current_donor(rq, next))
> > +                     proxy_tag_curr(rq, next);
>
> I don't see any dependency on rq->curr for task_current_donor() check.
> Could this check be moved outside of the if-else block to avoid
> duplicating in both places since rq_set_donor() was called just after
> pick_next_task() or am I missing something?

So this check is just looking to see if next is the same as the
selected rq->donor (what pick_next_task() chose).

If so, nothing to do, same as always.

But If not (so we are proxying in this case), we need to call
proxy_tag_curr() because we have to make sure both the donor and the
proxy are not on a sched-classes pushable list.

This is because the logic around pick_next_task() calls
set_next_task() on the returned donor task, and in the sched-class
code, (for example RT) that logic will remove the chosen donor task
from the pushable list.

But when we find a proxy task to run on behalf of the donor, the
problem is that the proxy might be on the sched-class' pushable list.
So if we are proxying, we do a dequeue and enqueue pair, which allows
us to re-evaluate if the task is rq->curr, which will prevent it from
being added to any such pushable list.   This avoids the potential of
the balance callbacks trying to migrate the rq->curr under us.

Thanks so much for the review and the question! Let me know if that
makes any more sense, or if you have suggestions on how I could better
explain it in the commit message to help.

Appreciate it!
-john

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v15 4/7] sched: Fix runtime accounting w/ split exec & sched contexts
  2025-03-13 10:26   ` Steven Rostedt
@ 2025-03-15  6:05     ` John Stultz
  0 siblings, 0 replies; 36+ messages in thread
From: John Stultz @ 2025-03-15  6:05 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Ben Segall, Zimuzo Ezeozue, Mel Gorman, Will Deacon, Waiman Long,
	Boqun Feng, Paul E. McKenney, Metin Kaya, Xuewen Yan,
	K Prateek Nayak, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kernel-team

On Thu, Mar 13, 2025 at 3:26 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Wed, 12 Mar 2025 15:11:34 -0700
> John Stultz <jstultz@google.com> wrote:
>
> > The idea here is we want to charge the scheduler-context task's
> > vruntime but charge the execution-context task's sum_exec_runtime.
>
> The "but" is confusing me. Do you mean, "and also"? The sentence
> doesn't make sense with "but" unless it was:
>
>   "The idea here is we DON'T want to charge the scheduler-context
>   task's vruntime but charge the execution-context task's
>   sum_exec_runtime INSTEAD."
>
> >
> > This way cputime accounting goes against the task actually running
> > but vruntime accounting goes against the rq->donor task so we get
> > proper fairness.
>
> But this shows that you want to do both, although, I would remove the
> "but" here too. Replace it with "while".
>
> Or maybe I'm just confused.

Apologies for beingconfusing. I know I can tangle my words sometimes. :)

Hrmm. I think maybe it's a bit more clear if I switch the order. ie:

Without proxy-exec, we normally charge the "current" task for both its
vruntime as well as its sum_exec_runtime.

With proxy, we want to charge the rq->curr (proxy/lock holder) time to
its sum_exec_runtime (so it's clear to userland the rq->curr task *is*
running).

*But*, instead of charging rq->curr for the vruntime, that is charged
against the rq->donor(lock waiter) task, because that is what it is
donating when it is used as the scheduler-context.

If the donor and curr tasks are the same, then it's the same as without proxy.

Your suggestion of "while" is good as well. I'll try to reword this.


> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index c798d27952431..f8ad3a44b3771 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -1129,22 +1129,33 @@ static void update_tg_load_avg(struct cfs_rq *cfs_rq)
> >  }
> >  #endif /* CONFIG_SMP */
> >
> > -static s64 update_curr_se(struct rq *rq, struct sched_entity *curr)
> > +static s64 update_curr_se(struct rq *rq, struct sched_entity *se)
>
> Should this be renamed to "update_se()" as it no longer appears to be
> updating "curr_se".
>
> >  {
> >       u64 now = rq_clock_task(rq);
> >       s64 delta_exec;
> >
> > -     delta_exec = now - curr->exec_start;
> > +     delta_exec = now - se->exec_start;
> >       if (unlikely(delta_exec <= 0))
> >               return delta_exec;
> >
> > -     curr->exec_start = now;
> > -     curr->sum_exec_runtime += delta_exec;
> > +     se->exec_start = now;
> > +     if (entity_is_task(se)) {
> > +             struct task_struct *running = rq->curr;
> > +             /*
> > +              * If se is a task, we account the time against the running
> > +              * task, as w/ proxy-exec they may not be the same.
> > +              */
> > +             running->se.exec_start = now;
> > +             running->se.sum_exec_runtime += delta_exec;
> > +     } else {
> > +             /* If not task, account the time against se */
> > +             se->sum_exec_runtime += delta_exec;
> > +     }
>
> Or maybe: update_proxy_se() ?

I'm hesitant to call it update_proxy_se() since it's still used even
when proxy-exec isn't enabled.

update_se() could work, but also feels a little generic.

Maybe update_se_times()? or account_time_se()?


Appreciate the review and feedback!

thanks
-john

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v15 6/7] sched: Fix proxy/current (push,pull)ability
  2025-03-15  5:10     ` John Stultz
@ 2025-03-15 16:06       ` K Prateek Nayak
  0 siblings, 0 replies; 36+ messages in thread
From: K Prateek Nayak @ 2025-03-15 16:06 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Valentin Schneider, Joel Fernandes, Qais Yousef,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Valentin Schneider, Steven Rostedt, Ben Segall,
	Zimuzo Ezeozue, Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, Thomas Gleixner,
	Daniel Lezcano, Suleiman Souhlal, kernel-team, Connor O'Brien

Hello John,

On 3/15/2025 10:40 AM, John Stultz wrote:
[..snip..]
>>> @@ -6856,6 +6873,10 @@ static void __sched notrace __schedule(int sched_mode)
>>>                 * changes to task_struct made by pick_next_task().
>>>                 */
>>>                RCU_INIT_POINTER(rq->curr, next);
>>> +
>>> +             if (!task_current_donor(rq, next))
>>> +                     proxy_tag_curr(rq, next);
>>
>> I don't see any dependency on rq->curr for task_current_donor() check.
>> Could this check be moved outside of the if-else block to avoid
>> duplicating in both places since rq_set_donor() was called just after
>> pick_next_task() or am I missing something?
> 
> So this check is just looking to see if next is the same as the
> selected rq->donor (what pick_next_task() chose).
> 
> If so, nothing to do, same as always.
> 
> But If not (so we are proxying in this case), we need to call
> proxy_tag_curr() because we have to make sure both the donor and the
> proxy are not on a sched-classes pushable list.
> 
> This is because the logic around pick_next_task() calls
> set_next_task() on the returned donor task, and in the sched-class
> code, (for example RT) that logic will remove the chosen donor task
> from the pushable list.
> 
> But when we find a proxy task to run on behalf of the donor, the
> problem is that the proxy might be on the sched-class' pushable list.
> So if we are proxying, we do a dequeue and enqueue pair, which allows
> us to re-evaluate if the task is rq->curr, which will prevent it from
> being added to any such pushable list.   This avoids the potential of
> the balance callbacks trying to migrate the rq->curr under us.
> 
> Thanks so much for the review and the question! Let me know if that
> makes any more sense, or if you have suggestions on how I could better
> explain it in the commit message to help.

Thanks a ton for clarifying. I found the enqueue_task_rt() bits from
Patch 5 and then it made sense.

P.S. Could the enqueue_task_rt() bits be moved to this patch since it
fits here better?

I couldn't see the dependency for the enqueue bits in Patch 5 since on
finding a "blocked_on" task, the logic simply dequeues it and since
proxy_resched_idle() will nuke the rq->{curr,donor} reference before
that, it should be safe to move those bits here unless I missed
something again :)

> 
> Appreciate it!
> -john

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v15 5/7] sched: Add an initial sketch of the find_proxy_task() function
  2025-03-12 22:11 ` [RFC PATCH v15 5/7] sched: Add an initial sketch of the find_proxy_task() function John Stultz
@ 2025-03-15 16:35   ` K Prateek Nayak
  2025-03-17 13:48   ` Peter Zijlstra
  1 sibling, 0 replies; 36+ messages in thread
From: K Prateek Nayak @ 2025-03-15 16:35 UTC (permalink / raw)
  To: John Stultz, LKML
  Cc: Joel Fernandes, Qais Yousef, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kernel-team

Hello John,

On 3/13/2025 3:41 AM, John Stultz wrote:
> Add a find_proxy_task() function which doesn't do much.
> 
> When we select a blocked task to run, we will just deactivate it
> and pick again. The exception being if it has become unblocked
> after find_proxy_task() was called.
> 
> Greatly simplified from patch by:
>    Peter Zijlstra (Intel) <peterz@infradead.org>
>    Juri Lelli <juri.lelli@redhat.com>
>    Valentin Schneider <valentin.schneider@arm.com>
>    Connor O'Brien <connoro@google.com>
> 
> Cc: Joel Fernandes <joelagnelf@nvidia.com>
> Cc: Qais Yousef <qyousef@layalina.io>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Juri Lelli <juri.lelli@redhat.com>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Valentin Schneider <vschneid@redhat.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Ben Segall <bsegall@google.com>
> Cc: Zimuzo Ezeozue <zezeozue@google.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Will Deacon <will@kernel.org>
> Cc: Waiman Long <longman@redhat.com>
> Cc: Boqun Feng <boqun.feng@gmail.com>
> Cc: "Paul E. McKenney" <paulmck@kernel.org>
> Cc: Metin Kaya <Metin.Kaya@arm.com>
> Cc: Xuewen Yan <xuewen.yan94@gmail.com>
> Cc: K Prateek Nayak <kprateek.nayak@amd.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
> Cc: Suleiman Souhlal <suleiman@google.com>
> Cc: kernel-team@android.com
> [jstultz: Split out from larger proxy patch and simplified
>   for review and testing.]
> Signed-off-by: John Stultz <jstultz@google.com>
> ---
> v5:
> * Split out from larger proxy patch
> v7:
> * Fixed unused function arguments, spelling nits, and tweaks for
>    clarity, pointed out by Metin Kaya
> * Fix build warning Reported-by: kernel test robot <lkp@intel.com>
>    Closes: https://lore.kernel.org/oe-kbuild-all/202311081028.yDLmCWgr-lkp@intel.com/
> v8:
> * Fixed case where we might return a blocked task from find_proxy_task()
> * Continued tweaks to handle avoiding returning blocked tasks
> v9:
> * Add zap_balance_callbacks helper to unwind balance_callbacks
>    when we will re-call pick_next_task() again.
> * Add extra comment suggested by Metin
> * Typo fixes from Metin
> * Moved adding proxy_resched_idle earlier in the series, as suggested
>    by Metin
> * Fix to call proxy_resched_idle() *prior* to deactivating next, to avoid
>    crashes caused by stale references to next
> * s/PROXY/SCHED_PROXY_EXEC/ as suggested by Metin
> * Number of tweaks and cleanups suggested by Metin
> * Simplify proxy_deactivate as suggested by Metin
> v11:
> * Tweaks for earlier simplification in try_to_deactivate_task
> v13:
> * Rename rename "next" to "donor" in find_proxy_task() for clarity
> * Similarly use "donor" instead of next in proxy_deactivate
> * Refactor/simplify proxy_resched_idle
> * Moved up a needed fix from later in the series
> v15:
> * Tweaked some comments to better explain the initial sketch of
>    find_proxy_task(), suggested by Qais
> * Build fixes for !CONFIG_SMP
> * Slight rework for blocked_on_state being added later in the
>    series.
> * Move the zap_balance_callbacks to later in the patch series
> ---
>   kernel/sched/core.c  | 103 +++++++++++++++++++++++++++++++++++++++++--
>   kernel/sched/rt.c    |  15 ++++++-
>   kernel/sched/sched.h |  10 ++++-
>   3 files changed, 122 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 3968c3967ec38..b4f7b14f62a24 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6600,7 +6600,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>    * Otherwise marks the task's __state as RUNNING
>    */
>   static bool try_to_block_task(struct rq *rq, struct task_struct *p,
> -			      unsigned long task_state)
> +			      unsigned long task_state, bool deactivate_cond)
>   {
>   	int flags = DEQUEUE_NOCLOCK;
>   
> @@ -6609,6 +6609,9 @@ static bool try_to_block_task(struct rq *rq, struct task_struct *p,
>   		return false;
>   	}
>   
> +	if (!deactivate_cond)
> +		return false;
> +
>   	p->sched_contributes_to_load =
>   		(task_state & TASK_UNINTERRUPTIBLE) &&
>   		!(task_state & TASK_NOLOAD) &&
> @@ -6632,6 +6635,93 @@ static bool try_to_block_task(struct rq *rq, struct task_struct *p,
>   	return true;
>   }
>   
> +#ifdef CONFIG_SCHED_PROXY_EXEC
> +
> +static inline struct task_struct *
> +proxy_resched_idle(struct rq *rq)

nit. Any reason why this was put in the next line?

> +{
> +	put_prev_task(rq, rq->donor);

Any reason we cannot do a:

     put_prev_set_next_task(rq, rq->donor, rq->idle)

here? I don't see any dependency on rq->donor in set_next_task_idle()
and it should be safe.

> +	rq_set_donor(rq, rq->idle);
> +	set_next_task(rq, rq->idle);
> +	set_tsk_need_resched(rq->idle);
> +	return rq->idle;
> +}
> +
> +static bool proxy_deactivate(struct rq *rq, struct task_struct *donor)
> +{
> +	unsigned long state = READ_ONCE(donor->__state);
> +
> +	/* Don't deactivate if the state has been changed to TASK_RUNNING */
> +	if (state == TASK_RUNNING)
> +		return false;
> +	/*
> +	 * Because we got donor from pick_next_task, it is *crucial*
> +	 * that we call proxy_resched_idle before we deactivate it.
> +	 * As once we deactivate donor, donor->on_rq is set to zero,
> +	 * which allows ttwu to immediately try to wake the task on
> +	 * another rq. So we cannot use *any* references to donor
> +	 * after that point. So things like cfs_rq->curr or rq->donor
> +	 * need to be changed from next *before* we deactivate.
> +	 */
> +	proxy_resched_idle(rq);
> +	return try_to_block_task(rq, donor, state, true);
> +}
> +
> +/*
> + * Initial simple sketch that just deactivates the blocked task
> + * chosen by pick_next_task() so we can then pick something that
> + * isn't blocked.
> + */
> +static struct task_struct *
> +find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
> +{
> +	struct task_struct *p = donor;
> +	struct mutex *mutex;
> +
> +	mutex = p->blocked_on;
> +	/* Something changed in the chain, so pick again */
> +	if (!mutex)
> +		return NULL;
> +	/*
> +	 * By taking mutex->wait_lock we hold off concurrent mutex_unlock()
> +	 * and ensure @owner sticks around.
> +	 */
> +	raw_spin_lock(&mutex->wait_lock);
> +
> +	/* Check again that p is blocked with blocked_lock held */
> +	if (!task_is_blocked(p) || mutex != __get_task_blocked_on(p)) {
> +		/*
> +		 * Something changed in the blocked_on chain and
> +		 * we don't know if only at this level. So, let's
> +		 * just bail out completely and let __schedule
> +		 * figure things out (pick_again loop).
> +		 */
> +		goto out;
> +	}
> +
> +	if (!proxy_deactivate(rq, donor)) {
> +		/*
> +		 * XXX: For now, if deactivation failed, set donor
> +		 * as not blocked, as we aren't doing proxy-migrations
> +		 * yet (more logic will be needed then).
> +		 */
> +		__clear_task_blocked_on(donor, mutex);
> +		raw_spin_unlock(&mutex->wait_lock);
> +		return NULL;
> +	}
> +out:
> +	raw_spin_unlock(&mutex->wait_lock);
> +	return NULL; /* do pick_next_task again */
> +}
> +#else /* SCHED_PROXY_EXEC */
> +static struct task_struct *
> +find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
> +{
> +	WARN_ONCE(1, "This should never be called in the !SCHED_PROXY_EXEC case\n");
> +	return donor;
> +}
> +#endif /* SCHED_PROXY_EXEC */
> +
>   /*
>    * __schedule() is the main scheduler function.
>    *
> @@ -6739,12 +6829,19 @@ static void __sched notrace __schedule(int sched_mode)
>   			goto picked;
>   		}
>   	} else if (!preempt && prev_state) {
> -		try_to_block_task(rq, prev, prev_state);
> +		try_to_block_task(rq, prev, prev_state,
> +				  !task_is_blocked(prev));
>   		switch_count = &prev->nvcsw;
>   	}
>   
> -	next = pick_next_task(rq, prev, &rf);
> +pick_again:
> +	next = pick_next_task(rq, rq->donor, &rf);
>   	rq_set_donor(rq, next);
> +	if (unlikely(task_is_blocked(next))) {
> +		next = find_proxy_task(rq, next, &rf);
> +		if (!next)
> +			goto pick_again;
> +	}
>   picked:
>   	clear_tsk_need_resched(prev);
>   	clear_preempt_need_resched();
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index 4b8e33c615b12..2d418e0efecc5 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -1479,8 +1479,19 @@ enqueue_task_rt(struct rq *rq, struct task_struct *p, int flags)
>   
>   	enqueue_rt_entity(rt_se, flags);
>   
> -	if (!task_current(rq, p) && p->nr_cpus_allowed > 1)
> -		enqueue_pushable_task(rq, p);
> +	/*
> +	 * Current can't be pushed away. Selected is tied to current,
> +	 * so don't push it either.
> +	 */
> +	if (task_current(rq, p) || task_current_donor(rq, p))
> +		return;
> +	/*
> +	 * Pinned tasks can't be pushed.
> +	 */
> +	if (p->nr_cpus_allowed == 1)
> +		return;
> +
> +	enqueue_pushable_task(rq, p);
>   }
>   
>   static bool dequeue_task_rt(struct rq *rq, struct task_struct *p, int flags)
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 05d2122533619..3e49d77ce2cdd 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -2311,6 +2311,14 @@ static inline int task_current_donor(struct rq *rq, struct task_struct *p)
>   	return rq->donor == p;
>   }
>   
> +static inline bool task_is_blocked(struct task_struct *p)
> +{
> +	if (!sched_proxy_exec())
> +		return false;
> +
> +	return !!p->blocked_on;
> +}
> +
>   static inline int task_on_cpu(struct rq *rq, struct task_struct *p)
>   {
>   #ifdef CONFIG_SMP
> @@ -2520,7 +2528,7 @@ static inline void put_prev_set_next_task(struct rq *rq,
>   					  struct task_struct *prev,
>   					  struct task_struct *next)
>   {
> -	WARN_ON_ONCE(rq->curr != prev);
> +	WARN_ON_ONCE(rq->donor != prev);
>   
>   	__put_prev_set_next_dl_server(rq, prev, next);
>   

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v15 2/7] locking/mutex: Rework task_struct::blocked_on
  2025-03-14  6:12     ` John Stultz
@ 2025-03-16 16:33       ` Steven Rostedt
  2025-03-18 14:11       ` Masami Hiramatsu
  1 sibling, 0 replies; 36+ messages in thread
From: Steven Rostedt @ 2025-03-16 16:33 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Peter Zijlstra, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Ben Segall, Zimuzo Ezeozue, Mel Gorman, Will Deacon, Waiman Long,
	Boqun Feng, Paul E. McKenney, Metin Kaya, Xuewen Yan,
	K Prateek Nayak, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kernel-team, Masami Hiramatsu

On Thu, 13 Mar 2025 23:12:57 -0700
John Stultz <jstultz@google.com> wrote:

> So I should be able to unify these. It looks like Masami's patch is
> close to being queued, so maybe I'll incorporate it into my series and
> rework my set ontop. Any objections to this?

None by me.

-- Steve

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v15 2/7] locking/mutex: Rework task_struct::blocked_on
  2025-03-13 10:13   ` Steven Rostedt
  2025-03-14  6:12     ` John Stultz
@ 2025-03-17 11:44     ` Peter Zijlstra
  1 sibling, 0 replies; 36+ messages in thread
From: Peter Zijlstra @ 2025-03-17 11:44 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: John Stultz, LKML, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Ben Segall, Zimuzo Ezeozue, Mel Gorman, Will Deacon, Waiman Long,
	Boqun Feng, Paul E. McKenney, Metin Kaya, Xuewen Yan,
	K Prateek Nayak, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kernel-team, Connor O'Brien,
	Masami Hiramatsu

On Thu, Mar 13, 2025 at 06:13:51AM -0400, Steven Rostedt wrote:
> 
> FYI, this is useful for Masami's "hung task" work that will show what
> tasks a hung task is blocked on in a crash report.

Yeah, but I would really rather not have that interfere.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v15 5/7] sched: Add an initial sketch of the find_proxy_task() function
  2025-03-12 22:11 ` [RFC PATCH v15 5/7] sched: Add an initial sketch of the find_proxy_task() function John Stultz
  2025-03-15 16:35   ` K Prateek Nayak
@ 2025-03-17 13:48   ` Peter Zijlstra
  1 sibling, 0 replies; 36+ messages in thread
From: Peter Zijlstra @ 2025-03-17 13:48 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, K Prateek Nayak, Thomas Gleixner,
	Daniel Lezcano, Suleiman Souhlal, kernel-team

On Wed, Mar 12, 2025 at 03:11:35PM -0700, John Stultz wrote:
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index 4b8e33c615b12..2d418e0efecc5 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -1479,8 +1479,19 @@ enqueue_task_rt(struct rq *rq, struct task_struct *p, int flags)
>  
>  	enqueue_rt_entity(rt_se, flags);
>  
> -	if (!task_current(rq, p) && p->nr_cpus_allowed > 1)
> -		enqueue_pushable_task(rq, p);
> +	/*
> +	 * Current can't be pushed away. Selected is tied to current,
> +	 * so don't push it either.
> +	 */
> +	if (task_current(rq, p) || task_current_donor(rq, p))
> +		return;
> +	/*
> +	 * Pinned tasks can't be pushed.
> +	 */
> +	if (p->nr_cpus_allowed == 1)
> +		return;
> +
> +	enqueue_pushable_task(rq, p);
>  }
>  

As per always, deadline.c will have the exact same logic.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v15 6/7] sched: Fix proxy/current (push,pull)ability
  2025-03-12 22:11 ` [RFC PATCH v15 6/7] sched: Fix proxy/current (push,pull)ability John Stultz
  2025-03-14  8:40   ` K Prateek Nayak
@ 2025-03-17 14:07   ` Peter Zijlstra
  2025-03-28  4:45     ` K Prateek Nayak
  1 sibling, 1 reply; 36+ messages in thread
From: Peter Zijlstra @ 2025-03-17 14:07 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Valentin Schneider, Joel Fernandes, Qais Yousef,
	Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
	Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, K Prateek Nayak,
	Thomas Gleixner, Daniel Lezcano, Suleiman Souhlal, kernel-team,
	Connor O'Brien

On Wed, Mar 12, 2025 at 03:11:36PM -0700, John Stultz wrote:
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index b4f7b14f62a24..3596244f613f8 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6722,6 +6722,23 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
>  }
>  #endif /* SCHED_PROXY_EXEC */
>  
> +static inline void proxy_tag_curr(struct rq *rq, struct task_struct *owner)
> +{
> +	if (!sched_proxy_exec())
> +		return;
> +	/*
> +	 * pick_next_task() calls set_next_task() on the chosen task
> +	 * at some point, which ensures it is not push/pullable.
> +	 * However, the chosen/donor task *and* the mutex owner form an
> +	 * atomic pair wrt push/pull.
> +	 *
> +	 * Make sure owner we run is not pushable. Unfortunately we can
> +	 * only deal with that by means of a dequeue/enqueue cycle. :-/
> +	 */
> +	dequeue_task(rq, owner, DEQUEUE_NOCLOCK | DEQUEUE_SAVE);
> +	enqueue_task(rq, owner, ENQUEUE_NOCLOCK | ENQUEUE_RESTORE);
> +}

So this is probably fine at this point; but we should eventually look at
fixing this.

We can probably look at (ab)using sched_class::set_cpus_allowed() for
this.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v15 1/7] sched: Add CONFIG_SCHED_PROXY_EXEC & boot argument to enable/disable
  2025-03-12 22:11 ` [RFC PATCH v15 1/7] sched: Add CONFIG_SCHED_PROXY_EXEC & boot argument to enable/disable John Stultz
  2025-03-13 10:09   ` Steven Rostedt
@ 2025-03-17 14:33   ` Peter Zijlstra
  2025-03-17 14:44     ` John Stultz
  1 sibling, 1 reply; 36+ messages in thread
From: Peter Zijlstra @ 2025-03-17 14:33 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, K Prateek Nayak, Thomas Gleixner,
	Daniel Lezcano, Suleiman Souhlal, kernel-team

On Wed, Mar 12, 2025 at 03:11:31PM -0700, John Stultz wrote:

> diff --git a/init/Kconfig b/init/Kconfig
> index d0d021b3fa3b3..b989ddc27444e 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -875,6 +875,16 @@ config UCLAMP_BUCKETS_COUNT
>  
>  	  If in doubt, use the default value.
>  
> +config SCHED_PROXY_EXEC
> +	bool "Proxy Execution"
> +	default n
> +	# Avoid some build failures w/ PREEMPT_RT until it can be fixed
> +	depends on !PREEMPT_RT
	depends on !SPM && !SCHED_CLASS_EXT

for now, right?

> +	depends on EXPERT
> +	help
> +	  This option enables proxy execution, a mechanism for mutex-owning
> +	  tasks to inherit the scheduling context of higher priority waiters.
> +

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v15 1/7] sched: Add CONFIG_SCHED_PROXY_EXEC & boot argument to enable/disable
  2025-03-17 14:33   ` Peter Zijlstra
@ 2025-03-17 14:44     ` John Stultz
  2025-03-17 14:50       ` Peter Zijlstra
  0 siblings, 1 reply; 36+ messages in thread
From: John Stultz @ 2025-03-17 14:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, K Prateek Nayak, Thomas Gleixner,
	Daniel Lezcano, Suleiman Souhlal, kernel-team

On Mon, Mar 17, 2025 at 3:33 PM Peter Zijlstra <peterz@infradead.org> wrote:
> On Wed, Mar 12, 2025 at 03:11:31PM -0700, John Stultz wrote:
>
> > diff --git a/init/Kconfig b/init/Kconfig
> > index d0d021b3fa3b3..b989ddc27444e 100644
> > --- a/init/Kconfig
> > +++ b/init/Kconfig
> > @@ -875,6 +875,16 @@ config UCLAMP_BUCKETS_COUNT
> >
> >         If in doubt, use the default value.
> >
> > +config SCHED_PROXY_EXEC
> > +     bool "Proxy Execution"
> > +     default n
> > +     # Avoid some build failures w/ PREEMPT_RT until it can be fixed
> > +     depends on !PREEMPT_RT
>         depends on !SPM && !SCHED_CLASS_EXT
>
> for now, right?

Did you mean SMP there?  SMP is supported. Even with same rq proxying,
it just blocks the task and lets it sleep if the owner is on another
rq.

!SCHED_CLASS_EXT might be something to add, but mostly because the
SCHED_CLASS_EXT code probably doesn't have a way to understand the
split contexts yet.

I'll dig a bit more there.

Thanks so much for the feedback! Really appreciate it!
-john

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v15 1/7] sched: Add CONFIG_SCHED_PROXY_EXEC & boot argument to enable/disable
  2025-03-17 14:44     ` John Stultz
@ 2025-03-17 14:50       ` Peter Zijlstra
  0 siblings, 0 replies; 36+ messages in thread
From: Peter Zijlstra @ 2025-03-17 14:50 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, K Prateek Nayak, Thomas Gleixner,
	Daniel Lezcano, Suleiman Souhlal, kernel-team

On Mon, Mar 17, 2025 at 03:44:11PM +0100, John Stultz wrote:
> On Mon, Mar 17, 2025 at 3:33 PM Peter Zijlstra <peterz@infradead.org> wrote:
> > On Wed, Mar 12, 2025 at 03:11:31PM -0700, John Stultz wrote:
> >
> > > diff --git a/init/Kconfig b/init/Kconfig
> > > index d0d021b3fa3b3..b989ddc27444e 100644
> > > --- a/init/Kconfig
> > > +++ b/init/Kconfig
> > > @@ -875,6 +875,16 @@ config UCLAMP_BUCKETS_COUNT
> > >
> > >         If in doubt, use the default value.
> > >
> > > +config SCHED_PROXY_EXEC
> > > +     bool "Proxy Execution"
> > > +     default n
> > > +     # Avoid some build failures w/ PREEMPT_RT until it can be fixed
> > > +     depends on !PREEMPT_RT
> >         depends on !SPM && !SCHED_CLASS_EXT
> >
> > for now, right?
> 
> Did you mean SMP there?  SMP is supported. Even with same rq proxying,
> it just blocks the task and lets it sleep if the owner is on another
> rq.

Ah, yes, typing hard. And yes, I missed it would work rq local. Still
trying to untangle the proxy_deactivate() failure case :-)

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v15 7/7] sched: Start blocked_on chain processing in find_proxy_task()
  2025-03-12 22:11 ` [RFC PATCH v15 7/7] sched: Start blocked_on chain processing in find_proxy_task() John Stultz
@ 2025-03-17 16:43   ` Peter Zijlstra
  2025-03-18  6:09     ` Peter Zijlstra
  2025-03-17 16:47   ` Peter Zijlstra
  2025-03-17 16:49   ` Peter Zijlstra
  2 siblings, 1 reply; 36+ messages in thread
From: Peter Zijlstra @ 2025-03-17 16:43 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, K Prateek Nayak, Thomas Gleixner,
	Daniel Lezcano, Suleiman Souhlal, kernel-team, Valentin Schneider,
	Connor O'Brien

On Wed, Mar 12, 2025 at 03:11:37PM -0700, John Stultz wrote:
> @@ -6668,47 +6676,138 @@ static bool proxy_deactivate(struct rq *rq, struct task_struct *donor)
>  }
>  
>  /*
> + * Find runnable lock owner to proxy for mutex blocked donor
> + *
> + * Follow the blocked-on relation:
> + *   task->blocked_on -> mutex->owner -> task...
> + *
> + * Lock order:
> + *
> + *   p->pi_lock
> + *     rq->lock
> + *       mutex->wait_lock
> + *
> + * Returns the task that is going to be used as execution context (the one
> + * that is actually going to be run on cpu_of(rq)).
>   */
>  static struct task_struct *
>  find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
>  {
> +	struct task_struct *owner = NULL;
> +	struct task_struct *ret = NULL;
> +	int this_cpu = cpu_of(rq);
> +	struct task_struct *p;
>  	struct mutex *mutex;
>  
> +	/* Follow blocked_on chain. */
> +	for (p = donor; task_is_blocked(p); p = owner) {
> +		mutex = p->blocked_on;
> +		/* Something changed in the chain, so pick again */
> +		if (!mutex)
> +			return NULL;
>  		/*
> +		 * By taking mutex->wait_lock we hold off concurrent mutex_unlock()
> +		 * and ensure @owner sticks around.
>  		 */
> +		raw_spin_lock(&mutex->wait_lock);

This comment -- that is only true if you kill __mutex_unlock_fast(),
which I don't think you did in the previous patches.

> +
> +		/* Check again that p is blocked with wait_lock held */
> +		if (mutex != __get_task_blocked_on(p)) {
> +			/*
> +			 * Something changed in the blocked_on chain and
> +			 * we don't know if only at this level. So, let's
> +			 * just bail out completely and let __schedule
> +			 * figure things out (pick_again loop).
> +			 */
> +			goto out;
> +		}
> +
> +		owner = __mutex_owner(mutex);
> +		if (!owner) {
> +			__clear_task_blocked_on(p, mutex);
> +			ret = p;
> +			goto out;
> +		}
> +
> +		if (task_cpu(owner) != this_cpu) {
> +			/* XXX Don't handle migrations yet */
> +			if (!proxy_deactivate(rq, donor))
> +				goto deactivate_failed;
> +			goto out;
> +		}
> +
> +		if (task_on_rq_migrating(owner)) {
> +			/*
> +			 * One of the chain of mutex owners is currently migrating to this
> +			 * CPU, but has not yet been enqueued because we are holding the
> +			 * rq lock. As a simple solution, just schedule rq->idle to give
> +			 * the migration a chance to complete. Much like the migrate_task
> +			 * case we should end up back in find_proxy_task(), this time
> +			 * hopefully with all relevant tasks already enqueued.
> +			 */
> +			raw_spin_unlock(&mutex->wait_lock);
> +			return proxy_resched_idle(rq);
> +		}
> +
> +		if (!owner->on_rq) {
> +			/* XXX Don't handle blocked owners yet */
> +			if (!proxy_deactivate(rq, donor))
> +				goto deactivate_failed;
> +			goto out;
> +		}
> +
> +		if (owner->se.sched_delayed) {
> +			/* XXX Don't handle delayed dequeue yet */
> +			if (!proxy_deactivate(rq, donor))
> +				goto deactivate_failed;
> +			goto out;
> +		}
> +
> +		if (owner == p) {
> +			/*
> +			 * It's possible we interleave with mutex_unlock like:
> +			 *
> +			 *				lock(&rq->lock);
> +			 *				  find_proxy_task()
> +			 * mutex_unlock()
> +			 *   lock(&wait_lock);
> +			 *   donor(owner) = current->blocked_donor;
> +			 *   unlock(&wait_lock);
> +			 *
> +			 *   wake_up_q();
> +			 *     ...
> +			 *       ttwu_runnable()
> +			 *         __task_rq_lock()
> +			 *				  lock(&wait_lock);
> +			 *				  owner == p
> +			 *
> +			 * Which leaves us to finish the ttwu_runnable() and make it go.
> +			 *
> +			 * So schedule rq->idle so that ttwu_runnable can get the rq lock
> +			 * and mark owner as running.
> +			 */
> +			raw_spin_unlock(&mutex->wait_lock);
> +			return proxy_resched_idle(rq);
> +		}
>  
>  		/*
> +		 * OK, now we're absolutely sure @owner is on this
> +		 * rq, therefore holding @rq->lock is sufficient to
> +		 * guarantee its existence, as per ttwu_remote().
>  		 */
>  		raw_spin_unlock(&mutex->wait_lock);
>  	}
> +
> +	WARN_ON_ONCE(owner && !owner->on_rq);
> +	return owner;
> +
> +deactivate_failed:
> +	/*
> +	 * XXX: For now, if deactivation failed, set donor
> +	 * as unblocked, as we aren't doing proxy-migrations
> +	 * yet (more logic will be needed then).
> +	 */
> +	donor->blocked_on = NULL; /* XXX not following locking rules :( */
>  out:
>  	raw_spin_unlock(&mutex->wait_lock);
>  	return NULL; /* do pick_next_task again */


Also, something like the below might be a cleanup -- I didn't check your
full series.

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6653,7 +6653,7 @@ proxy_resched_idle(struct rq *rq)
 	return rq->idle;
 }
 
-static bool proxy_deactivate(struct rq *rq, struct task_struct *donor)
+static bool __proxy_deactivate(struct rq *rq, struct task_struct *donor)
 {
 	unsigned long state = READ_ONCE(donor->__state);
 
@@ -6673,6 +6673,19 @@ static bool proxy_deactivate(struct rq *
 	return try_to_block_task(rq, donor, state, true);
 }
 
+static struct task_struct *proxy_deactivate(struct rq *rq, struct task_struct *donor)
+{
+	if (!__proxy_deactivate(rq, donor)) {
+		/*
+		 * XXX: For now, if deactivation failed, set donor
+		 * as unblocked, as we aren't doing proxy-migrations
+		 * yet (more logic will be needed then).
+		 */
+		donor->blocked_on = NULL;
+	}
+	return NULL;
+}
+
 /*
  * Find runnable lock owner to proxy for mutex blocked donor
  *
@@ -6691,23 +6704,17 @@ static bool proxy_deactivate(struct rq *
 static struct task_struct *
 find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 {
-	struct task_struct *owner = NULL;
-	struct task_struct *ret = NULL;
+	struct task_struct *owner, *p;
 	int this_cpu = cpu_of(rq);
-	struct task_struct *p;
 	struct mutex *mutex;
 
 	/* Follow blocked_on chain. */
-	for (p = donor; task_is_blocked(p); p = owner) {
-		mutex = p->blocked_on;
-		/* Something changed in the chain, so pick again */
-		if (!mutex)
-			return NULL;
+	for (p = donor; (mutex = p->blocked_on); p = owner) {
 		/*
 		 * By taking mutex->wait_lock we hold off concurrent mutex_unlock()
 		 * and ensure @owner sticks around.
 		 */
-		raw_spin_lock(&mutex->wait_lock);
+		guard(raw_spinlock)(&mutex->wait_lock);
 
 		/* Check again that p is blocked with wait_lock held */
 		if (mutex != __get_task_blocked_on(p)) {
@@ -6717,22 +6724,17 @@ find_proxy_task(struct rq *rq, struct ta
 			 * just bail out completely and let __schedule
 			 * figure things out (pick_again loop).
 			 */
-			goto out;
+			return NULL;
 		}
 
 		owner = __mutex_owner(mutex);
 		if (!owner) {
 			__clear_task_blocked_on(p, mutex);
-			ret = p;
-			goto out;
+			return p;
 		}
 
-		if (task_cpu(owner) != this_cpu) {
-			/* XXX Don't handle migrations yet */
-			if (!proxy_deactivate(rq, donor))
-				goto deactivate_failed;
-			goto out;
-		}
+		if (task_cpu(owner) != this_cpu)
+			return proxy_deactivate(rq, donor);
 
 		if (task_on_rq_migrating(owner)) {
 			/*
@@ -6743,22 +6745,17 @@ find_proxy_task(struct rq *rq, struct ta
 			 * case we should end up back in find_proxy_task(), this time
 			 * hopefully with all relevant tasks already enqueued.
 			 */
-			raw_spin_unlock(&mutex->wait_lock);
-			return proxy_resched_idle(rq);
+			goto ret_idle;
 		}
 
 		if (!owner->on_rq) {
 			/* XXX Don't handle blocked owners yet */
-			if (!proxy_deactivate(rq, donor))
-				goto deactivate_failed;
-			goto out;
+			return proxy_deactivate(rq, donor);
 		}
 
 		if (owner->se.sched_delayed) {
 			/* XXX Don't handle delayed dequeue yet */
-			if (!proxy_deactivate(rq, donor))
-				goto deactivate_failed;
-			goto out;
+			return proxy_deactivate(rq, donor);
 		}
 
 		if (owner == p) {
@@ -6784,8 +6781,7 @@ find_proxy_task(struct rq *rq, struct ta
 			 * So schedule rq->idle so that ttwu_runnable can get the rq lock
 			 * and mark owner as running.
 			 */
-			raw_spin_unlock(&mutex->wait_lock);
-			return proxy_resched_idle(rq);
+			goto ret_idle;
 		}
 
 		/*
@@ -6793,22 +6789,13 @@ find_proxy_task(struct rq *rq, struct ta
 		 * rq, therefore holding @rq->lock is sufficient to
 		 * guarantee its existence, as per ttwu_remote().
 		 */
-		raw_spin_unlock(&mutex->wait_lock);
 	}
 
 	WARN_ON_ONCE(owner && !owner->on_rq);
 	return owner;
 
-deactivate_failed:
-	/*
-	 * XXX: For now, if deactivation failed, set donor
-	 * as unblocked, as we aren't doing proxy-migrations
-	 * yet (more logic will be needed then).
-	 */
-	donor->blocked_on = NULL; /* XXX not following locking rules :( */
-out:
-	raw_spin_unlock(&mutex->wait_lock);
-	return NULL; /* do pick_next_task again */
+ret_idle:
+	return proxy_resched_idle(rq);
 }
 #else /* SCHED_PROXY_EXEC */
 static struct task_struct *

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v15 7/7] sched: Start blocked_on chain processing in find_proxy_task()
  2025-03-12 22:11 ` [RFC PATCH v15 7/7] sched: Start blocked_on chain processing in find_proxy_task() John Stultz
  2025-03-17 16:43   ` Peter Zijlstra
@ 2025-03-17 16:47   ` Peter Zijlstra
  2025-03-17 16:49   ` Peter Zijlstra
  2 siblings, 0 replies; 36+ messages in thread
From: Peter Zijlstra @ 2025-03-17 16:47 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, K Prateek Nayak, Thomas Gleixner,
	Daniel Lezcano, Suleiman Souhlal, kernel-team, Valentin Schneider,
	Connor O'Brien

On Wed, Mar 12, 2025 at 03:11:37PM -0700, John Stultz wrote:
> @@ -6791,6 +6890,7 @@ static void __sched notrace __schedule(int sched_mode)
>  	struct rq_flags rf;
>  	struct rq *rq;
>  	int cpu;
> +	bool preserve_need_resched = false;
>  
>  	cpu = smp_processor_id();
>  	rq = cpu_rq(cpu);
> @@ -6858,9 +6958,12 @@ static void __sched notrace __schedule(int sched_mode)
>  		next = find_proxy_task(rq, next, &rf);
>  		if (!next)
>  			goto pick_again;
> +		if (next == rq->idle)
> +			preserve_need_resched = true;
>  	}
>  picked:
> -	clear_tsk_need_resched(prev);
> +	if (!preserve_need_resched)
> +		clear_tsk_need_resched(prev);
>  	clear_preempt_need_resched();
>  #ifdef CONFIG_SCHED_DEBUG
>  	rq->last_seen_need_resched_ns = 0;

I would really rather this was done like so:


--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6957,12 +6944,12 @@ static void __sched notrace __schedule(i
 		if (!next)
 			goto pick_again;
 		if (next == rq->idle)
-			preserve_need_resched = true;
+			goto keep_resched;
 	}
 picked:
-	if (!preserve_need_resched)
-		clear_tsk_need_resched(prev);
+	clear_tsk_need_resched(prev);
 	clear_preempt_need_resched();
+keep_resched:
 #ifdef CONFIG_SCHED_DEBUG
 	rq->last_seen_need_resched_ns = 0;
 #endif

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v15 7/7] sched: Start blocked_on chain processing in find_proxy_task()
  2025-03-12 22:11 ` [RFC PATCH v15 7/7] sched: Start blocked_on chain processing in find_proxy_task() John Stultz
  2025-03-17 16:43   ` Peter Zijlstra
  2025-03-17 16:47   ` Peter Zijlstra
@ 2025-03-17 16:49   ` Peter Zijlstra
  2 siblings, 0 replies; 36+ messages in thread
From: Peter Zijlstra @ 2025-03-17 16:49 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, K Prateek Nayak, Thomas Gleixner,
	Daniel Lezcano, Suleiman Souhlal, kernel-team, Valentin Schneider,
	Connor O'Brien

On Wed, Mar 12, 2025 at 03:11:37PM -0700, John Stultz wrote:

> @@ -2950,8 +2951,15 @@ static int affine_move_task(struct rq *rq, struct task_struct *p, struct rq_flag
>  	struct set_affinity_pending my_pending = { }, *pending = NULL;
>  	bool stop_pending, complete = false;
>  
> -	/* Can the task run on the task's current CPU? If so, we're done */
> -	if (cpumask_test_cpu(task_cpu(p), &p->cpus_mask)) {
> +	/*
> +	 * Can the task run on the task's current CPU? If so, we're done
> +	 *
> +	 * We are also done if the task is the current donor, boosting a lock-
> +	 * holding proxy, (and potentially has been migrated outside its
> +	 * current or previous affinity mask)
> +	 */
> +	if (cpumask_test_cpu(task_cpu(p), &p->cpus_mask) ||
> +	    (task_current_donor(rq, p) && !task_current(rq, p))) {
>  		struct task_struct *push_task = NULL;
>  
>  		if ((flags & SCA_MIGRATE_ENABLE) &&

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f8ad3a44b3771..091f1a01b3327 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9385,6 +9385,7 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
>  	 * 3) cannot be migrated to this CPU due to cpus_ptr, or
>  	 * 4) running (obviously), or
>  	 * 5) are cache-hot on their current CPU.
> +	 * 6) are blocked on mutexes (if SCHED_PROXY_EXEC is enabled)
>  	 */
>  	if ((p->se.sched_delayed) && (env->migration_type != migrate_load))
>  		return 0;
> @@ -9406,6 +9407,9 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
>  	if (kthread_is_per_cpu(p))
>  		return 0;
>  
> +	if (task_is_blocked(p))
> +		return 0;
> +
>  	if (!cpumask_test_cpu(env->dst_cpu, p->cpus_ptr)) {
>  		int cpu;
>  
> @@ -9442,7 +9446,8 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
>  	/* Record that we found at least one task that could run on dst_cpu */
>  	env->flags &= ~LBF_ALL_PINNED;
>  
> -	if (task_on_cpu(env->src_rq, p)) {
> +	if (task_on_cpu(env->src_rq, p) ||
> +	    task_current_donor(env->src_rq, p)) {
>  		schedstat_inc(p->stats.nr_failed_migrations_running);
>  		return 0;
>  	}


Somehow this and the previous patches that touched upon this made me
think that perhaps we can share with migrate_disable(). Specifically, we
seem to be adding those donor checks and hooks to exactly those
locations.

I've not actually tried though.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v15 7/7] sched: Start blocked_on chain processing in find_proxy_task()
  2025-03-17 16:43   ` Peter Zijlstra
@ 2025-03-18  6:09     ` Peter Zijlstra
  0 siblings, 0 replies; 36+ messages in thread
From: Peter Zijlstra @ 2025-03-18  6:09 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, K Prateek Nayak, Thomas Gleixner,
	Daniel Lezcano, Suleiman Souhlal, kernel-team, Valentin Schneider,
	Connor O'Brien

On Mon, Mar 17, 2025 at 05:43:56PM +0100, Peter Zijlstra wrote:
> On Wed, Mar 12, 2025 at 03:11:37PM -0700, John Stultz wrote:
> > @@ -6668,47 +6676,138 @@ static bool proxy_deactivate(struct rq *rq, struct task_struct *donor)
> >  }
> >  
> >  /*
> > + * Find runnable lock owner to proxy for mutex blocked donor
> > + *
> > + * Follow the blocked-on relation:
> > + *   task->blocked_on -> mutex->owner -> task...
> > + *
> > + * Lock order:
> > + *
> > + *   p->pi_lock
> > + *     rq->lock
> > + *       mutex->wait_lock
> > + *
> > + * Returns the task that is going to be used as execution context (the one
> > + * that is actually going to be run on cpu_of(rq)).
> >   */
> >  static struct task_struct *
> >  find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
> >  {
> > +	struct task_struct *owner = NULL;
> > +	struct task_struct *ret = NULL;
> > +	int this_cpu = cpu_of(rq);
> > +	struct task_struct *p;
> >  	struct mutex *mutex;
> >  
> > +	/* Follow blocked_on chain. */
> > +	for (p = donor; task_is_blocked(p); p = owner) {
> > +		mutex = p->blocked_on;
> > +		/* Something changed in the chain, so pick again */
> > +		if (!mutex)
> > +			return NULL;
> >  		/*
> > +		 * By taking mutex->wait_lock we hold off concurrent mutex_unlock()
> > +		 * and ensure @owner sticks around.
> >  		 */
> > +		raw_spin_lock(&mutex->wait_lock);
> 
> This comment -- that is only true if you kill __mutex_unlock_fast(),
> which I don't think you did in the previous patches.

Ignore this; I got myself confused again. :-)

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v15 2/7] locking/mutex: Rework task_struct::blocked_on
  2025-03-14  6:12     ` John Stultz
  2025-03-16 16:33       ` Steven Rostedt
@ 2025-03-18 14:11       ` Masami Hiramatsu
  2025-03-18 15:33         ` Lance Yang
  2025-03-19  8:54         ` John Stultz
  1 sibling, 2 replies; 36+ messages in thread
From: Masami Hiramatsu @ 2025-03-18 14:11 UTC (permalink / raw)
  To: John Stultz
  Cc: Steven Rostedt, LKML, Peter Zijlstra, Joel Fernandes, Qais Yousef,
	Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, K Prateek Nayak, Thomas Gleixner,
	Daniel Lezcano, Suleiman Souhlal, kernel-team, Masami Hiramatsu,
	Lance Yang, Suleiman Souhlal, hikalium

On Thu, 13 Mar 2025 23:12:57 -0700
John Stultz <jstultz@google.com> wrote:

> On Thu, Mar 13, 2025 at 3:14 AM Steven Rostedt <rostedt@goodmis.org> wrote:
> > FYI, this is useful for Masami's "hung task" work that will show what
> > tasks a hung task is blocked on in a crash report.
> >
> >   https://lore.kernel.org/all/174046694331.2194069.15472952050240807469.stgit@mhiramat.tok.corp.google.com/
> >
> 
> Ah. Indeed, we have similar use cases. There's some slight difference
> in when we consider the task blocked, especially in this early patch
> (as waking tasks mark us as unblocked so we can be selected to run).
> But later on in the series (in the portions I've not yet submitted
> here) when the blocked_on_state has been introduced, the blocked_on
> value approximates to about the same spot as used here.

Interesting. Can yo also track tasks which takes other locks like
rwsem/semaphore ? Lance is also working on this to expand it to
support semaphore.

https://lore.kernel.org/all/20250314144300.32542-1-ioworker0@gmail.com/

Please add them for the next version.

> 
> So I should be able to unify these. It looks like Masami's patch is
> close to being queued, so maybe I'll incorporate it into my series and
> rework my set ontop. Any objections to this?

No :) Please Cc to me. 


BTW, I had a chat with Suleiman and he suggested me to expand
this idea to record what locks the task takes. Then we can search
all tasks who is holding the lock. Something like,

struct task_struct {
	unsigned long blocking_on;
	unsigned long holding_locks[HOLDING_LOCK_MAX];
	unsigned int holding_idx;
};

lock(lock_addr) {
	if (succeeded_to_lock) {
		current->holding_locks[current->holding_idx++] = lock_addr;
	} else {
		record_blocking_on(current, lock_addr)
		wait_for_lock();
		clear_blocking_on(current, lock_addr)
	}
}

unlock(lock_addr) {
	current->holding_locks[--current->holding_idx] = 0UL;
}

And when we found a hung task, call dump_blocker() like this;

dump_blocker() {
	lock_addr = hung_task->blocking_on;
	for_each_task(task) {
		if (find_lock(task->holding_locks, lock_addr)) {
			dump_task(task);
			/* semaphore, rwsem will need to dump all holders. */
			if (lock is mutex)
				break;
		}
	}
}

This can be too much but interesting idea to find semaphore type blocker.

Thank you,

> 
> thanks
> -john


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v15 2/7] locking/mutex: Rework task_struct::blocked_on
  2025-03-18 14:11       ` Masami Hiramatsu
@ 2025-03-18 15:33         ` Lance Yang
  2025-03-19  9:49           ` John Stultz
  2025-03-19  8:54         ` John Stultz
  1 sibling, 1 reply; 36+ messages in thread
From: Lance Yang @ 2025-03-18 15:33 UTC (permalink / raw)
  To: Masami Hiramatsu, John Stultz
  Cc: Steven Rostedt, LKML, Peter Zijlstra, Joel Fernandes, Qais Yousef,
	Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, K Prateek Nayak, Thomas Gleixner,
	Daniel Lezcano, Suleiman Souhlal, kernel-team, hikalium, amaindex

On Tue, Mar 18, 2025 at 10:11 PM Masami Hiramatsu <mhiramat@kernel.org> wrote:
>
> On Thu, 13 Mar 2025 23:12:57 -0700
> John Stultz <jstultz@google.com> wrote:
>
> > On Thu, Mar 13, 2025 at 3:14 AM Steven Rostedt <rostedt@goodmis.org> wrote:
> > > FYI, this is useful for Masami's "hung task" work that will show what
> > > tasks a hung task is blocked on in a crash report.
> > >
> > >   https://lore.kernel.org/all/174046694331.2194069.15472952050240807469.stgit@mhiramat.tok.corp.google.com/
> > >
> >
> > Ah. Indeed, we have similar use cases. There's some slight difference
> > in when we consider the task blocked, especially in this early patch
> > (as waking tasks mark us as unblocked so we can be selected to run).
> > But later on in the series (in the portions I've not yet submitted
> > here) when the blocked_on_state has been introduced, the blocked_on
> > value approximates to about the same spot as used here.
>
> Interesting. Can yo also track tasks which takes other locks like
> rwsem/semaphore ? Lance is also working on this to expand it to
> support semaphore.
>
> https://lore.kernel.org/all/20250314144300.32542-1-ioworker0@gmail.com/
>
> Please add them for the next version.

Hi John,

When’s the next version expected? I intend to send a new version out
soon, and it’d be great if you could include it in the next version ;)

Also, since we have similar use cases, it might make sense to use
the same field to store the lock, encoding the lock type in the LSB as
Masami suggested.

>
> >
> > So I should be able to unify these. It looks like Masami's patch is
> > close to being queued, so maybe I'll incorporate it into my series and
> > rework my set ontop. Any objections to this?
>
> No :) Please Cc to me.
>
>
> BTW, I had a chat with Suleiman and he suggested me to expand
> this idea to record what locks the task takes. Then we can search
> all tasks who is holding the lock. Something like,

Hi Masami,

Yeah, that’s a really cool idea - being able to search for all tasks holding a
lock. Maybe we just keep it simple for now and extend it when we need to
handle more complex stuff like rwsem ;)

Thanks,
Lance

>
> struct task_struct {
>         unsigned long blocking_on;
>         unsigned long holding_locks[HOLDING_LOCK_MAX];
>         unsigned int holding_idx;
> };
>
> lock(lock_addr) {
>         if (succeeded_to_lock) {
>                 current->holding_locks[current->holding_idx++] = lock_addr;
>         } else {
>                 record_blocking_on(current, lock_addr)
>                 wait_for_lock();
>                 clear_blocking_on(current, lock_addr)
>         }
> }
>
> unlock(lock_addr) {
>         current->holding_locks[--current->holding_idx] = 0UL;
> }
>
> And when we found a hung task, call dump_blocker() like this;
>
> dump_blocker() {
>         lock_addr = hung_task->blocking_on;
>         for_each_task(task) {
>                 if (find_lock(task->holding_locks, lock_addr)) {
>                         dump_task(task);
>                         /* semaphore, rwsem will need to dump all holders. */
>                         if (lock is mutex)
>                                 break;
>                 }
>         }
> }
>
> This can be too much but interesting idea to find semaphore type blocker.
>
> Thank you,
>
> >
> > thanks
> > -john
>
>
> --
> Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v15 2/7] locking/mutex: Rework task_struct::blocked_on
  2025-03-18 14:11       ` Masami Hiramatsu
  2025-03-18 15:33         ` Lance Yang
@ 2025-03-19  8:54         ` John Stultz
  1 sibling, 0 replies; 36+ messages in thread
From: John Stultz @ 2025-03-19  8:54 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Steven Rostedt, LKML, Peter Zijlstra, Joel Fernandes, Qais Yousef,
	Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, K Prateek Nayak, Thomas Gleixner,
	Daniel Lezcano, Suleiman Souhlal, kernel-team, Lance Yang,
	hikalium

On Tue, Mar 18, 2025 at 3:11 PM Masami Hiramatsu <mhiramat@kernel.org> wrote:
> On Thu, 13 Mar 2025 23:12:57 -0700
> John Stultz <jstultz@google.com> wrote:
>
> > On Thu, Mar 13, 2025 at 3:14 AM Steven Rostedt <rostedt@goodmis.org> wrote:
> > > FYI, this is useful for Masami's "hung task" work that will show what
> > > tasks a hung task is blocked on in a crash report.
> > >
> > >   https://lore.kernel.org/all/174046694331.2194069.15472952050240807469.stgit@mhiramat.tok.corp.google.com/
> > >
> >
> > Ah. Indeed, we have similar use cases. There's some slight difference
> > in when we consider the task blocked, especially in this early patch
> > (as waking tasks mark us as unblocked so we can be selected to run).
> > But later on in the series (in the portions I've not yet submitted
> > here) when the blocked_on_state has been introduced, the blocked_on
> > value approximates to about the same spot as used here.
>
> Interesting. Can yo also track tasks which takes other locks like
> rwsem/semaphore ? Lance is also working on this to expand it to
> support semaphore.

Currently no, proxy-exec is initially just focused on kernel mutexes.
However I do hope to expand it to be usable with other locking
primitives, so something like what Lance is proposing would be  needed
for that, so I'm eager to make use of his work.

I've pulled both of your work into my tree and will try to rework my
logic on top.

> BTW, I had a chat with Suleiman and he suggested me to expand
> this idea to record what locks the task takes. Then we can search
> all tasks who is holding the lock. Something like,
>
> struct task_struct {
>         unsigned long blocking_on;
>         unsigned long holding_locks[HOLDING_LOCK_MAX];
>         unsigned int holding_idx;
> };
>
> lock(lock_addr) {
>         if (succeeded_to_lock) {
>                 current->holding_locks[current->holding_idx++] = lock_addr;
>         } else {
>                 record_blocking_on(current, lock_addr)
>                 wait_for_lock();
>                 clear_blocking_on(current, lock_addr)
>         }
> }
>
> unlock(lock_addr) {
>         current->holding_locks[--current->holding_idx] = 0UL;
> }
>
> And when we found a hung task, call dump_blocker() like this;
>
> dump_blocker() {
>         lock_addr = hung_task->blocking_on;
>         for_each_task(task) {
>                 if (find_lock(task->holding_locks, lock_addr)) {
>                         dump_task(task);
>                         /* semaphore, rwsem will need to dump all holders. */
>                         if (lock is mutex)
>                                 break;
>                 }
>         }
> }
>
> This can be too much but interesting idea to find semaphore type blocker.

Yeah. I suspect the rw/sem -> owners mapping is a missing piece that
will be needed for proxy-exec, but I've not looked closely yet.

thanks
-john

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v15 2/7] locking/mutex: Rework task_struct::blocked_on
  2025-03-18 15:33         ` Lance Yang
@ 2025-03-19  9:49           ` John Stultz
  2025-03-19 12:05             ` Lance Yang
  0 siblings, 1 reply; 36+ messages in thread
From: John Stultz @ 2025-03-19  9:49 UTC (permalink / raw)
  To: Lance Yang
  Cc: Masami Hiramatsu, Steven Rostedt, LKML, Peter Zijlstra,
	Joel Fernandes, Qais Yousef, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Valentin Schneider, Ben Segall,
	Zimuzo Ezeozue, Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, K Prateek Nayak,
	Thomas Gleixner, Daniel Lezcano, Suleiman Souhlal, kernel-team,
	hikalium, amaindex

On Tue, Mar 18, 2025 at 4:33 PM Lance Yang <ioworker0@gmail.com> wrote:
>
> When’s the next version expected? I intend to send a new version out
> soon, and it’d be great if you could include it in the next version ;)

Yeah, I've pulled in your old version already, but I'll update my tree
with your new one.  Between conf travel and vacation, it might be
April before I get v16 out.

Most likely I'll be moving much of linux/hung_task.h into
linux/sched.h to get generic accessors to setting and clearing the
task blocked-on relationship (so its not just tied to the hung task
debug logic). Then for proxy I just need to integrate it with the
additional blocked_on_state that is used to determine if the
(previously blocked, but left on the rq) task is runnable or not.

> Also, since we have similar use cases, it might make sense to use
> the same field to store the lock, encoding the lock type in the LSB as
> Masami suggested.

Yep. Agreed.  As I mentioned earlier, proxy only works with mutexes at
the moment, but I do intend to expand that once we have the core proxy
logic well tested upstream, so the multi-type blocked-on relationship
is really useful to have.

thanks!
-john

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v15 2/7] locking/mutex: Rework task_struct::blocked_on
  2025-03-19  9:49           ` John Stultz
@ 2025-03-19 12:05             ` Lance Yang
  0 siblings, 0 replies; 36+ messages in thread
From: Lance Yang @ 2025-03-19 12:05 UTC (permalink / raw)
  To: John Stultz
  Cc: Masami Hiramatsu, Steven Rostedt, LKML, Peter Zijlstra,
	Joel Fernandes, Qais Yousef, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Valentin Schneider, Ben Segall,
	Zimuzo Ezeozue, Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, K Prateek Nayak,
	Thomas Gleixner, Daniel Lezcano, Suleiman Souhlal, kernel-team,
	hikalium, amaindex

On Wed, Mar 19, 2025 at 5:49 PM John Stultz <jstultz@google.com> wrote:
>
> On Tue, Mar 18, 2025 at 4:33 PM Lance Yang <ioworker0@gmail.com> wrote:
> >
> > When’s the next version expected? I intend to send a new version out
> > soon, and it’d be great if you could include it in the next version ;)
>
> Yeah, I've pulled in your old version already, but I'll update my tree
> with your new one.  Between conf travel and vacation, it might be
> April before I get v16 out.

Appreciate it! I've sent a new one[1] out today and also Cc'd you. Will
keep you in the loop if anything changes ;)

>
> Most likely I'll be moving much of linux/hung_task.h into
> linux/sched.h to get generic accessors to setting and clearing the
> task blocked-on relationship (so its not just tied to the hung task
> debug logic). Then for proxy I just need to integrate it with the
> additional blocked_on_state that is used to determine if the
> (previously blocked, but left on the rq) task is runnable or not.

Yes, that makes sense to me. Feel free to adjust as needed ;)

>
> > Also, since we have similar use cases, it might make sense to use
> > the same field to store the lock, encoding the lock type in the LSB as
> > Masami suggested.
>
> Yep. Agreed.  As I mentioned earlier, proxy only works with mutexes at
> the moment, but I do intend to expand that once we have the core proxy
> logic well tested upstream, so the multi-type blocked-on relationship
> is really useful to have.

Yeah, let's keep it simple for now and expand it later once the core proxy
logic is good to go ;)

Thanks again for your time!
Lance

>
> thanks!
> -john

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH v15 6/7] sched: Fix proxy/current (push,pull)ability
  2025-03-17 14:07   ` Peter Zijlstra
@ 2025-03-28  4:45     ` K Prateek Nayak
  0 siblings, 0 replies; 36+ messages in thread
From: K Prateek Nayak @ 2025-03-28  4:45 UTC (permalink / raw)
  To: Peter Zijlstra, John Stultz
  Cc: LKML, Valentin Schneider, Joel Fernandes, Qais Yousef,
	Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
	Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, Thomas Gleixner,
	Daniel Lezcano, Suleiman Souhlal, kernel-team, Connor O'Brien

Hello John, Peter,

On 3/17/2025 7:37 PM, Peter Zijlstra wrote:
> On Wed, Mar 12, 2025 at 03:11:36PM -0700, John Stultz wrote:
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index b4f7b14f62a24..3596244f613f8 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -6722,6 +6722,23 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
>>   }
>>   #endif /* SCHED_PROXY_EXEC */
>>   
>> +static inline void proxy_tag_curr(struct rq *rq, struct task_struct *owner)
>> +{
>> +	if (!sched_proxy_exec())
>> +		return;
>> +	/*
>> +	 * pick_next_task() calls set_next_task() on the chosen task
>> +	 * at some point, which ensures it is not push/pullable.
>> +	 * However, the chosen/donor task *and* the mutex owner form an
>> +	 * atomic pair wrt push/pull.
>> +	 *
>> +	 * Make sure owner we run is not pushable. Unfortunately we can
>> +	 * only deal with that by means of a dequeue/enqueue cycle. :-/
>> +	 */
>> +	dequeue_task(rq, owner, DEQUEUE_NOCLOCK | DEQUEUE_SAVE);
>> +	enqueue_task(rq, owner, ENQUEUE_NOCLOCK | ENQUEUE_RESTORE);
>> +}
> 
> So this is probably fine at this point; but we should eventually look at
> fixing this.
> 
> We can probably look at (ab)using sched_class::set_cpus_allowed() for
> this.

Isn't the net result in this case just "dequeue_pushable_task()"?

Can it be defined as a class callback to avoid a full dequeue + requeue
overhead for the fair class during proxy?

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2025-03-28  4:45 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-12 22:11 [RFC PATCH v15 0/7] Single RunQueue Proxy Execution (v15) John Stultz
2025-03-12 22:11 ` [RFC PATCH v15 1/7] sched: Add CONFIG_SCHED_PROXY_EXEC & boot argument to enable/disable John Stultz
2025-03-13 10:09   ` Steven Rostedt
2025-03-14  0:48     ` John Stultz
2025-03-17 14:33   ` Peter Zijlstra
2025-03-17 14:44     ` John Stultz
2025-03-17 14:50       ` Peter Zijlstra
2025-03-12 22:11 ` [RFC PATCH v15 2/7] locking/mutex: Rework task_struct::blocked_on John Stultz
2025-03-13 10:13   ` Steven Rostedt
2025-03-14  6:12     ` John Stultz
2025-03-16 16:33       ` Steven Rostedt
2025-03-18 14:11       ` Masami Hiramatsu
2025-03-18 15:33         ` Lance Yang
2025-03-19  9:49           ` John Stultz
2025-03-19 12:05             ` Lance Yang
2025-03-19  8:54         ` John Stultz
2025-03-17 11:44     ` Peter Zijlstra
2025-03-12 22:11 ` [RFC PATCH v15 3/7] locking/mutex: Add p->blocked_on wrappers for correctness checks John Stultz
2025-03-12 22:11 ` [RFC PATCH v15 4/7] sched: Fix runtime accounting w/ split exec & sched contexts John Stultz
2025-03-13 10:26   ` Steven Rostedt
2025-03-15  6:05     ` John Stultz
2025-03-13 17:24   ` K Prateek Nayak
2025-03-12 22:11 ` [RFC PATCH v15 5/7] sched: Add an initial sketch of the find_proxy_task() function John Stultz
2025-03-15 16:35   ` K Prateek Nayak
2025-03-17 13:48   ` Peter Zijlstra
2025-03-12 22:11 ` [RFC PATCH v15 6/7] sched: Fix proxy/current (push,pull)ability John Stultz
2025-03-14  8:40   ` K Prateek Nayak
2025-03-15  5:10     ` John Stultz
2025-03-15 16:06       ` K Prateek Nayak
2025-03-17 14:07   ` Peter Zijlstra
2025-03-28  4:45     ` K Prateek Nayak
2025-03-12 22:11 ` [RFC PATCH v15 7/7] sched: Start blocked_on chain processing in find_proxy_task() John Stultz
2025-03-17 16:43   ` Peter Zijlstra
2025-03-18  6:09     ` Peter Zijlstra
2025-03-17 16:47   ` Peter Zijlstra
2025-03-17 16:49   ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox