[RFC][PATCH v14 0/7] Single CPU Proxy Execution (v14)

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC][PATCH v14 0/7] Single CPU Proxy Execution (v14)
@ 2024-11-25 19:51 John Stultz
  2024-11-25 19:51 ` [RFC][PATCH v14 1/7] sched: Add CONFIG_SCHED_PROXY_EXEC & boot argument to enable/disable John Stultz
                   ` (6 more replies)
  0 siblings, 7 replies; 23+ messages in thread
From: John Stultz @ 2024-11-25 19:51 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
	Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, K Prateek Nayak,
	Thomas Gleixner, Daniel Lezcano, kernel-team

Hey All,

After sending out the last revision, I got some feedback that
pointed out I hadn’t done much testing with the now upstream
enabled PREEMPT_RT option, resulting in a few build issues. So I
wanted to send out the next iteration Proxy Execution - an
approach for a generalized form of priority inheritance. 

In this series, I’m only submitting the logic to support Proxy
Execution as both a build and runtime option, the mutex
blocked_on rework, some small fixes for assumptions that proxy
changes, and the initial logic to run lock owners in place of
the waiting task on the same cpu.

With v14 of this set, most of the changes are just build fixups
related to PREEMPT_RT. With PREEMPT_RT the abstraction around
mutexes means accessing the mutex owner runs into some trouble
with the rt_mutex as the underlying structure changes, so I need
to do some further work abstracting how I access that to get it
working. So for now, I’ve just made SCHED_PROXY_EXEC option
exclusive to PREEMPT_RT.

I’ve also continued working on the rest of the series, which you
can find here:
  https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v14-6.13-rc1/
  https://github.com/johnstultz-work/linux-dev.git proxy-exec-v14-6.13-rc1

New changes in the full series include:
* Rework of sleeping_owner handling so that we properly deal
  with delayed-dequeued (sched_delayed) tasks (also removes now
  unused proxy_deactivate() logic)
* Improving edge cases in ttwu where we wouldn’t set the task
  as BO_RUNNABLE
* Making sure we call block_task() last in proxy_enqueue_on_owner
  and not touch it again to avoid races where it might be
  activated on another cpu
* Make sure we always activate blocked_entities when we exit
  from ttwu
* Fix to enqueue the last task in the chain (p) on the blocked
  owner instead of donor, so that we preserve the chain
  structure, so mid-chain wakeups propagate properly

Issues still to address with the full series:
* While I think I’ve now properly handled delayed dequeued tasks,
  I’d still appreciate any input on ways of better generalizing
  these multiple approaches to having un-runnable blocked tasks
  remaining on the runqueue.
* Even with some of the fixes in this version (and again, for
  clarity not with this same-rq proxying series I’m sending out
  here), I still have to include some workarounds to avoid hitting
  some rare cases of what seem to be lost wakeups, where a task
  was marked as BO_WAKING, but then ttwu never managed to
  transition it to BO_RUNNABLE. The workarounds handle doing the
  return migration from find_proxy_task() but I still feel that
  those fixups shouldn’t be necessary, so I suspect the ttwu
  logic has a race somewhere I’m missing.
* K Prateek Nayak did some testing about a year ago with an
  earlier version of the series and saw ~3-5% regressions in
  some cases. I’m hoping to look into this soon to see if we
  can reduce those further.
* The chain migration functionality needs further iterations
  and better validation to ensure it truly maintains the RT/DL
  load balancing invariants (despite this being broken in
  vanilla upstream with RT_PUSH_IPI currently)

I’d really appreciate any feedback or review thoughts on this
series! I’m trying to keep the chunks small, reviewable and
iteratively testable, but if you have any suggestions on how to
improve the series, I’m all ears.

Credit/Disclaimer:
—--------------------
As mentioned previously, this Proxy Execution series has a long
history: 

First described in a paper[1] by Watkins, Straub, Niehaus, then
from patches from Peter Zijlstra, extended with lots of work by
Juri Lelli, Valentin Schneider, and Connor O'Brien. (and thank
you to Steven Rostedt for providing additional details here!)

So again, many thanks to those above, as all the credit for this
series really is due to them - while the mistakes are likely
mine.

Thanks so much!
-john

[1] https://static.lwn.net/images/conf/rtlws11/papers/proc/p38.pdf

Cc: Joel Fernandes <joelaf@google.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: kernel-team@android.com

John Stultz (4):
  sched: Add CONFIG_SCHED_PROXY_EXEC & boot argument to enable/disable
  sched: Fix runtime accounting w/ split exec & sched contexts
  sched: Fix psi_dequeue for Proxy Execution
  sched: Add an initial sketch of the find_proxy_task() function

Peter Zijlstra (2):
  locking/mutex: Rework task_struct::blocked_on
  sched: Start blocked_on chain processing in find_proxy_task()

Valentin Schneider (1):
  sched: Fix proxy/current (push,pull)ability

 .../admin-guide/kernel-parameters.txt         |   5 +
 include/linux/sched.h                         |  79 ++++-
 init/Kconfig                                  |   9 +
 init/init_task.c                              |   1 +
 kernel/fork.c                                 |   4 +-
 kernel/locking/mutex-debug.c                  |   9 +-
 kernel/locking/mutex.c                        |  40 ++-
 kernel/locking/mutex.h                        |   3 +-
 kernel/locking/ww_mutex.h                     |  24 +-
 kernel/sched/core.c                           | 300 +++++++++++++++++-
 kernel/sched/fair.c                           |  31 +-
 kernel/sched/rt.c                             |  15 +-
 kernel/sched/sched.h                          |  22 +-
 kernel/sched/stats.h                          |   6 +-
 14 files changed, 511 insertions(+), 37 deletions(-)

-- 
2.47.0.371.ga323438b13-goog

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC][PATCH v14 1/7] sched: Add CONFIG_SCHED_PROXY_EXEC & boot argument to enable/disable
  2024-11-25 19:51 [RFC][PATCH v14 0/7] Single CPU Proxy Execution (v14) John Stultz
@ 2024-11-25 19:51 ` John Stultz
  2024-11-25 19:51 ` [RFC][PATCH v14 2/7] locking/mutex: Rework task_struct::blocked_on John Stultz
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 23+ messages in thread
From: John Stultz @ 2024-11-25 19:51 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
	Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, K Prateek Nayak,
	Thomas Gleixner, Daniel Lezcano, kernel-team

Add a CONFIG_SCHED_PROXY_EXEC option, along with a boot argument
sched_proxy_exec= that can be used to disable the feature at boot
time if CONFIG_SCHED_PROXY_EXEC was enabled.

Cc: Joel Fernandes <joelaf@google.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: kernel-team@android.com
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: John Stultz <jstultz@google.com>
---
v7:
* Switch to CONFIG_SCHED_PROXY_EXEC/sched_proxy_exec= as
  suggested by Metin Kaya.
* Switch boot arg from =disable/enable to use kstrtobool(),
  which supports =yes|no|1|0|true|false|on|off, as also
  suggested by Metin Kaya, and print a message when a boot
  argument is used.
v8:
* Move CONFIG_SCHED_PROXY_EXEC under Scheduler Features as
  Suggested by Metin
* Minor rework reordering with split sched contexts patch
v12:
* Rework for selected -> donor renaming
v14:
* Depend on !PREEMPT_RT to avoid build issues for now
---
 .../admin-guide/kernel-parameters.txt         |  5 ++++
 include/linux/sched.h                         | 13 +++++++++
 init/Kconfig                                  |  9 ++++++
 kernel/sched/core.c                           | 29 +++++++++++++++++++
 kernel/sched/sched.h                          | 12 ++++++++
 5 files changed, 68 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index a4736ee87b1aa..761e858c62b92 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -5955,6 +5955,11 @@
 	sa1100ir	[NET]
 			See drivers/net/irda/sa1100_ir.c.
 
+	sched_proxy_exec= [KNL]
+			Enables or disables "proxy execution" style
+			solution to mutex-based priority inversion.
+			Format: <bool>
+
 	sched_verbose	[KNL,EARLY] Enables verbose scheduler debug messages.
 
 	schedstats=	[KNL,X86] Enable or disable scheduled statistics.
diff --git a/include/linux/sched.h b/include/linux/sched.h
index f0e9e00d3cf52..24e338ac34d7b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1610,6 +1610,19 @@ struct task_struct {
 	 */
 };
 
+#ifdef CONFIG_SCHED_PROXY_EXEC
+DECLARE_STATIC_KEY_TRUE(__sched_proxy_exec);
+static inline bool sched_proxy_exec(void)
+{
+	return static_branch_likely(&__sched_proxy_exec);
+}
+#else
+static inline bool sched_proxy_exec(void)
+{
+	return false;
+}
+#endif
+
 #define TASK_REPORT_IDLE	(TASK_REPORT + 1)
 #define TASK_REPORT_MAX		(TASK_REPORT_IDLE << 1)
 
diff --git a/init/Kconfig b/init/Kconfig
index b07f238f3badb..364bd2065b1f1 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -863,6 +863,15 @@ config UCLAMP_BUCKETS_COUNT
 
 	  If in doubt, use the default value.
 
+config SCHED_PROXY_EXEC
+	bool "Proxy Execution"
+	default n
+	# Avoid some build failures w/ PREEMPT_RT until it can be fixed
+	depends on !PREEMPT_RT
+	help
+	  This option enables proxy execution, a mechanism for mutex-owning
+	  tasks to inherit the scheduling context of higher priority waiters.
+
 endmenu
 
 #
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 95e40895a5190..d712e177d3b75 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -119,6 +119,35 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(sched_compute_energy_tp);
 
 DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
 
+#ifdef CONFIG_SCHED_PROXY_EXEC
+DEFINE_STATIC_KEY_TRUE(__sched_proxy_exec);
+static int __init setup_proxy_exec(char *str)
+{
+	bool proxy_enable;
+
+	if (kstrtobool(str, &proxy_enable)) {
+		pr_warn("Unable to parse sched_proxy_exec=\n");
+		return 0;
+	}
+
+	if (proxy_enable) {
+		pr_info("sched_proxy_exec enabled via boot arg\n");
+		static_branch_enable(&__sched_proxy_exec);
+	} else {
+		pr_info("sched_proxy_exec disabled via boot arg\n");
+		static_branch_disable(&__sched_proxy_exec);
+	}
+	return 1;
+}
+#else
+static int __init setup_proxy_exec(char *str)
+{
+	pr_warn("CONFIG_SCHED_PROXY_EXEC=n, so it cannot be enabled or disabled at boot time\n");
+	return 0;
+}
+#endif
+__setup("sched_proxy_exec=", setup_proxy_exec);
+
 #ifdef CONFIG_SCHED_DEBUG
 /*
  * Debugging: various feature bits
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 76f5f53a645fc..24eae02ddc7f6 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1148,10 +1148,15 @@ struct rq {
 	 */
 	unsigned int		nr_uninterruptible;
 
+#ifdef CONFIG_SCHED_PROXY_EXEC
+	struct task_struct __rcu	*donor;  /* Scheduling context */
+	struct task_struct __rcu	*curr;   /* Execution context */
+#else
 	union {
 		struct task_struct __rcu *donor; /* Scheduler context */
 		struct task_struct __rcu *curr;  /* Execution context */
 	};
+#endif
 	struct sched_dl_entity	*dl_server;
 	struct task_struct	*idle;
 	struct task_struct	*stop;
@@ -1348,10 +1353,17 @@ DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
 #define cpu_curr(cpu)		(cpu_rq(cpu)->curr)
 #define raw_rq()		raw_cpu_ptr(&runqueues)
 
+#ifdef CONFIG_SCHED_PROXY_EXEC
+static inline void rq_set_donor(struct rq *rq, struct task_struct *t)
+{
+	rcu_assign_pointer(rq->donor, t);
+}
+#else
 static inline void rq_set_donor(struct rq *rq, struct task_struct *t)
 {
 	/* Do nothing */
 }
+#endif
 
 #ifdef CONFIG_SCHED_CORE
 static inline struct cpumask *sched_group_span(struct sched_group *sg);
-- 
2.47.0.371.ga323438b13-goog


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC][PATCH v14 2/7] locking/mutex: Rework task_struct::blocked_on
  2024-11-25 19:51 [RFC][PATCH v14 0/7] Single CPU Proxy Execution (v14) John Stultz
  2024-11-25 19:51 ` [RFC][PATCH v14 1/7] sched: Add CONFIG_SCHED_PROXY_EXEC & boot argument to enable/disable John Stultz
@ 2024-11-25 19:51 ` John Stultz
  2024-12-13 23:22   ` Peter Zijlstra
  2024-11-25 19:51 ` [RFC][PATCH v14 3/7] sched: Fix runtime accounting w/ split exec & sched contexts John Stultz
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 23+ messages in thread
From: John Stultz @ 2024-11-25 19:51 UTC (permalink / raw)
  To: LKML
  Cc: Peter Zijlstra, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, K Prateek Nayak, Thomas Gleixner,
	Daniel Lezcano, kernel-team, Connor O'Brien, John Stultz

From: Peter Zijlstra <peterz@infradead.org>

Track the blocked-on relation for mutexes, to allow following this
relation at schedule time.

   task
     | blocked-on
     v
   mutex
     | owner
     v
   task

Also add a blocked_on_state value so we can distinguish when a
task is blocked_on a mutex, but is either blocked, waking up, or
runnable (such that it can try to acquire the lock its blocked
on).

This avoids some of the subtle & racy games where the blocked_on
state gets cleared, only to have it re-added by the
mutex_lock_slowpath call when it tries to acquire the lock on
wakeup

Also add blocked_lock to the task_struct so we can safely
serialize the blocked-on state.

Finally add wrappers that are useful to provide correctness
checks. Folded in from a patch by:
  Valentin Schneider <valentin.schneider@arm.com>

This all will be used for tracking blocked-task/mutex chains
with the prox-execution patch in a similar fashion to how
priority inheritance is done with rt_mutexes.

Cc: Joel Fernandes <joelaf@google.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: kernel-team@android.com
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
[minor changes while rebasing]
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Connor O'Brien <connoro@google.com>
[jstultz: Fix blocked_on tracking in __mutex_lock_common in error paths]
Signed-off-by: John Stultz <jstultz@google.com>
---
v2:
* Fixed blocked_on tracking in error paths that was causing crashes
v4:
* Ensure we clear blocked_on when waking ww_mutexes to die or wound.
  This is critical so we don't get circular blocked_on relationships
  that can't be resolved.
v5:
* Fix potential bug where the skip_wait path might clear blocked_on
  when that path never set it
* Slight tweaks to where we set blocked_on to make it consistent,
  along with extra WARN_ON correctness checking
* Minor comment changes
v7:
* Minor commit message change suggested by Metin Kaya
* Fix WARN_ON conditionals in unlock path (as blocked_on might already
  be cleared), found while looking at issue Metin Kaya raised.
* Minor tweaks to be consistent in what we do under the
  blocked_on lock, also tweaked variable name to avoid confusion
  with label, and comment typos, as suggested by Metin Kaya
* Minor tweak for CONFIG_SCHED_PROXY_EXEC name change
* Moved unused block of code to later in the series, as suggested
  by Metin Kaya
* Switch to a tri-state to be able to distinguish from waking and
  runnable so we can later safely do return migration from ttwu
* Folded together with related blocked_on changes
v8:
* Fix issue leaving task BO_BLOCKED when calling into optimistic
  spinning path.
* Include helper to better handle BO_BLOCKED->BO_WAKING transitions
v9:
* Typo fixup pointed out by Metin
* Cleanup BO_WAKING->BO_RUNNABLE transitions for the !proxy case
* Many cleanups and simplifications suggested by Metin
v11:
* Whitespace fixup pointed out by Metin
v13:
* Refactor set_blocked_on helpers clean things up a bit
v14:
* Small build fixup with PREEMPT_RT
---
 include/linux/sched.h        | 66 ++++++++++++++++++++++++++++++++----
 init/init_task.c             |  1 +
 kernel/fork.c                |  4 +--
 kernel/locking/mutex-debug.c |  9 ++---
 kernel/locking/mutex.c       | 40 ++++++++++++++++++----
 kernel/locking/ww_mutex.h    | 24 +++++++++++--
 kernel/sched/core.c          |  1 +
 7 files changed, 125 insertions(+), 20 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 24e338ac34d7b..0ad8033f8c2b9 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -34,6 +34,7 @@
 #include <linux/sched/prio.h>
 #include <linux/sched/types.h>
 #include <linux/signal_types.h>
+#include <linux/spinlock.h>
 #include <linux/syscall_user_dispatch_types.h>
 #include <linux/mm_types_task.h>
 #include <linux/netdevice_xmit.h>
@@ -775,6 +776,12 @@ struct kmap_ctrl {
 #endif
 };
 
+enum blocked_on_state {
+	BO_RUNNABLE,
+	BO_BLOCKED,
+	BO_WAKING,
+};
+
 struct task_struct {
 #ifdef CONFIG_THREAD_INFO_IN_TASK
 	/*
@@ -1195,10 +1202,9 @@ struct task_struct {
 	struct rt_mutex_waiter		*pi_blocked_on;
 #endif
 
-#ifdef CONFIG_DEBUG_MUTEXES
-	/* Mutex deadlock detection: */
-	struct mutex_waiter		*blocked_on;
-#endif
+	enum blocked_on_state		blocked_on_state;
+	struct mutex			*blocked_on;	/* lock we're blocked on */
+	raw_spinlock_t			blocked_lock;
 
 #ifdef CONFIG_DEBUG_ATOMIC_SLEEP
 	int				non_block_count;
@@ -2118,6 +2124,56 @@ extern int __cond_resched_rwlock_write(rwlock_t *lock);
 	__cond_resched_rwlock_write(lock);					\
 })
 
+static inline void __set_blocked_on_runnable(struct task_struct *p)
+{
+	lockdep_assert_held(&p->blocked_lock);
+
+	if (p->blocked_on_state == BO_WAKING)
+		p->blocked_on_state = BO_RUNNABLE;
+}
+
+static inline void set_blocked_on_runnable(struct task_struct *p)
+{
+	unsigned long flags;
+
+	if (!sched_proxy_exec())
+		return;
+
+	raw_spin_lock_irqsave(&p->blocked_lock, flags);
+	__set_blocked_on_runnable(p);
+	raw_spin_unlock_irqrestore(&p->blocked_lock, flags);
+}
+
+static inline void set_blocked_on_waking(struct task_struct *p)
+{
+	lockdep_assert_held(&p->blocked_lock);
+
+	if (p->blocked_on_state == BO_BLOCKED)
+		p->blocked_on_state = BO_WAKING;
+}
+
+static inline void set_task_blocked_on(struct task_struct *p, struct mutex *m)
+{
+	lockdep_assert_held(&p->blocked_lock);
+
+	/*
+	 * Check we are clearing values to NULL or setting NULL
+	 * to values to ensure we don't overwrite existing mutex
+	 * values or clear already cleared values
+	 */
+	WARN_ON((!m && !p->blocked_on) || (m && p->blocked_on));
+
+	p->blocked_on = m;
+	p->blocked_on_state = m ? BO_BLOCKED : BO_RUNNABLE;
+}
+
+static inline struct mutex *get_task_blocked_on(struct task_struct *p)
+{
+	lockdep_assert_held(&p->blocked_lock);
+
+	return p->blocked_on;
+}
+
 static __always_inline bool need_resched(void)
 {
 	return unlikely(tif_need_resched());
@@ -2157,8 +2213,6 @@ extern bool sched_task_on_rq(struct task_struct *p);
 extern unsigned long get_wchan(struct task_struct *p);
 extern struct task_struct *cpu_curr_snapshot(int cpu);
 
-#include <linux/spinlock.h>
-
 /*
  * In order to reduce various lock holder preemption latencies provide an
  * interface to see if a vCPU is currently running or not.
diff --git a/init/init_task.c b/init/init_task.c
index e557f622bd906..7e29d86153d9f 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -140,6 +140,7 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = {
 	.journal_info	= NULL,
 	INIT_CPU_TIMERS(init_task)
 	.pi_lock	= __RAW_SPIN_LOCK_UNLOCKED(init_task.pi_lock),
+	.blocked_lock	= __RAW_SPIN_LOCK_UNLOCKED(init_task.blocked_lock),
 	.timer_slack_ns = 50000, /* 50 usec default slack */
 	.thread_pid	= &init_struct_pid,
 	.thread_node	= LIST_HEAD_INIT(init_signals.thread_head),
diff --git a/kernel/fork.c b/kernel/fork.c
index f253e81d0c28e..160bead843afb 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2231,6 +2231,7 @@ __latent_entropy struct task_struct *copy_process(
 	ftrace_graph_init_task(p);
 
 	rt_mutex_init_task(p);
+	raw_spin_lock_init(&p->blocked_lock);
 
 	lockdep_assert_irqs_enabled();
 #ifdef CONFIG_PROVE_LOCKING
@@ -2329,9 +2330,8 @@ __latent_entropy struct task_struct *copy_process(
 	lockdep_init_task(p);
 #endif
 
-#ifdef CONFIG_DEBUG_MUTEXES
+	p->blocked_on_state = BO_RUNNABLE;
 	p->blocked_on = NULL; /* not blocked yet */
-#endif
 #ifdef CONFIG_BCACHE
 	p->sequential_io	= 0;
 	p->sequential_io_avg	= 0;
diff --git a/kernel/locking/mutex-debug.c b/kernel/locking/mutex-debug.c
index 6e6f6071cfa27..1d8cff71f65e1 100644
--- a/kernel/locking/mutex-debug.c
+++ b/kernel/locking/mutex-debug.c
@@ -53,17 +53,18 @@ void debug_mutex_add_waiter(struct mutex *lock, struct mutex_waiter *waiter,
 {
 	lockdep_assert_held(&lock->wait_lock);
 
-	/* Mark the current thread as blocked on the lock: */
-	task->blocked_on = waiter;
+	/* Current thread can't be already blocked (since it's executing!) */
+	DEBUG_LOCKS_WARN_ON(get_task_blocked_on(task));
 }
 
 void debug_mutex_remove_waiter(struct mutex *lock, struct mutex_waiter *waiter,
 			 struct task_struct *task)
 {
+	struct mutex *blocked_on = get_task_blocked_on(task);
+
 	DEBUG_LOCKS_WARN_ON(list_empty(&waiter->list));
 	DEBUG_LOCKS_WARN_ON(waiter->task != task);
-	DEBUG_LOCKS_WARN_ON(task->blocked_on != waiter);
-	task->blocked_on = NULL;
+	DEBUG_LOCKS_WARN_ON(blocked_on && blocked_on != lock);
 
 	INIT_LIST_HEAD(&waiter->list);
 	waiter->task = NULL;
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 3302e52f0c967..8f5d3fe6c1029 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -597,6 +597,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 	}
 
 	raw_spin_lock_irqsave(&lock->wait_lock, flags);
+	raw_spin_lock(&current->blocked_lock);
 	/*
 	 * After waiting to acquire the wait_lock, try again.
 	 */
@@ -627,6 +628,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 			goto err_early_kill;
 	}
 
+	set_task_blocked_on(current, lock);
 	set_current_state(state);
 	trace_contention_begin(lock, LCB_F_MUTEX);
 	for (;;) {
@@ -639,7 +641,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 		 * the handoff.
 		 */
 		if (__mutex_trylock(lock))
-			goto acquired;
+			break; /* acquired */;
 
 		/*
 		 * Check for signals and kill conditions while holding
@@ -657,6 +659,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 				goto err;
 		}
 
+		raw_spin_unlock(&current->blocked_lock);
 		raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
 		/* Make sure we do wakeups before calling schedule */
 		wake_up_q(&wake_q);
@@ -666,6 +669,13 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 
 		first = __mutex_waiter_is_first(lock, &waiter);
 
+		raw_spin_lock_irqsave(&lock->wait_lock, flags);
+		raw_spin_lock(&current->blocked_lock);
+
+		/*
+		 * Re-set blocked_on_state as unlock path set it to WAKING/RUNNABLE
+		 */
+		current->blocked_on_state = BO_BLOCKED;
 		set_current_state(state);
 		/*
 		 * Here we order against unlock; we must either see it change
@@ -676,16 +686,26 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 			break;
 
 		if (first) {
+			bool opt_acquired;
+
+			/*
+			 * mutex_optimistic_spin() can schedule, so  we need to
+			 * release these locks before calling it.
+			 */
+			current->blocked_on_state = BO_RUNNABLE;
+			raw_spin_unlock(&current->blocked_lock);
+			raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
 			trace_contention_begin(lock, LCB_F_MUTEX | LCB_F_SPIN);
-			if (mutex_optimistic_spin(lock, ww_ctx, &waiter))
+			opt_acquired = mutex_optimistic_spin(lock, ww_ctx, &waiter);
+			raw_spin_lock_irqsave(&lock->wait_lock, flags);
+			raw_spin_lock(&current->blocked_lock);
+			current->blocked_on_state = BO_BLOCKED;
+			if (opt_acquired)
 				break;
 			trace_contention_begin(lock, LCB_F_MUTEX);
 		}
-
-		raw_spin_lock_irqsave(&lock->wait_lock, flags);
 	}
-	raw_spin_lock_irqsave(&lock->wait_lock, flags);
-acquired:
+	set_task_blocked_on(current, NULL);
 	__set_current_state(TASK_RUNNING);
 
 	if (ww_ctx) {
@@ -710,16 +730,20 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 	if (ww_ctx)
 		ww_mutex_lock_acquired(ww, ww_ctx);
 
+	raw_spin_unlock(&current->blocked_lock);
 	raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
 	wake_up_q(&wake_q);
 	preempt_enable();
 	return 0;
 
 err:
+	set_task_blocked_on(current, NULL);
 	__set_current_state(TASK_RUNNING);
 	__mutex_remove_waiter(lock, &waiter);
 err_early_kill:
+	WARN_ON(get_task_blocked_on(current));
 	trace_contention_end(lock, ret);
+	raw_spin_unlock(&current->blocked_lock);
 	raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
 	debug_mutex_free_waiter(&waiter);
 	mutex_release(&lock->dep_map, ip);
@@ -928,8 +952,12 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
 
 		next = waiter->task;
 
+		raw_spin_lock(&next->blocked_lock);
 		debug_mutex_wake_waiter(lock, waiter);
+		WARN_ON(get_task_blocked_on(next) != lock);
+		set_blocked_on_waking(next);
 		wake_q_add(&wake_q, next);
+		raw_spin_unlock(&next->blocked_lock);
 	}
 
 	if (owner & MUTEX_FLAG_HANDOFF)
diff --git a/kernel/locking/ww_mutex.h b/kernel/locking/ww_mutex.h
index 37f025a096c9d..d9ff2022eef6f 100644
--- a/kernel/locking/ww_mutex.h
+++ b/kernel/locking/ww_mutex.h
@@ -281,10 +281,21 @@ __ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITER *waiter,
 		return false;
 
 	if (waiter->ww_ctx->acquired > 0 && __ww_ctx_less(waiter->ww_ctx, ww_ctx)) {
+		/* nested as we should hold current->blocked_lock already */
+		raw_spin_lock_nested(&waiter->task->blocked_lock, SINGLE_DEPTH_NESTING);
 #ifndef WW_RT
 		debug_mutex_wake_waiter(lock, waiter);
+		/*
+		 * When waking up the task to die, be sure to set the
+		 * blocked_on_state to WAKING. Otherwise we can see
+		 * circular blocked_on relationships that can't
+		 * resolve.
+		 */
+		WARN_ON(get_task_blocked_on(waiter->task) != lock);
 #endif
+		set_blocked_on_waking(waiter->task);
 		wake_q_add(wake_q, waiter->task);
+		raw_spin_unlock(&waiter->task->blocked_lock);
 	}
 
 	return true;
@@ -331,9 +342,18 @@ static bool __ww_mutex_wound(struct MUTEX *lock,
 		 * it's wounded in __ww_mutex_check_kill() or has a
 		 * wakeup pending to re-read the wounded state.
 		 */
-		if (owner != current)
+		if (owner != current) {
+			/* nested as we should hold current->blocked_lock already */
+			raw_spin_lock_nested(&owner->blocked_lock, SINGLE_DEPTH_NESTING);
+			/*
+			 * When waking up the task to wound, be sure to set the
+			 * blocked_on_state flag. Otherwise we can see circular
+			 * blocked_on relationships that can't resolve.
+			 */
+			set_blocked_on_waking(owner);
 			wake_q_add(wake_q, owner);
-
+			raw_spin_unlock(&owner->blocked_lock);
+		}
 		return true;
 	}
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d712e177d3b75..f8714050b6d0d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4350,6 +4350,7 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 		ttwu_queue(p, cpu, wake_flags);
 	}
 out:
+	set_blocked_on_runnable(p);
 	if (success)
 		ttwu_stat(p, task_cpu(p), wake_flags);
 
-- 
2.47.0.371.ga323438b13-goog


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC][PATCH v14 3/7] sched: Fix runtime accounting w/ split exec & sched contexts
  2024-11-25 19:51 [RFC][PATCH v14 0/7] Single CPU Proxy Execution (v14) John Stultz
  2024-11-25 19:51 ` [RFC][PATCH v14 1/7] sched: Add CONFIG_SCHED_PROXY_EXEC & boot argument to enable/disable John Stultz
  2024-11-25 19:51 ` [RFC][PATCH v14 2/7] locking/mutex: Rework task_struct::blocked_on John Stultz
@ 2024-11-25 19:51 ` John Stultz
  2024-12-13 23:37   ` Peter Zijlstra
  2024-11-25 19:51 ` [RFC][PATCH v14 4/7] sched: Fix psi_dequeue for Proxy Execution John Stultz
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 23+ messages in thread
From: John Stultz @ 2024-11-25 19:51 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
	Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, K Prateek Nayak,
	Thomas Gleixner, Daniel Lezcano, kernel-team

The idea here is we want to charge the scheduler-context task's
vruntime but charge the execution-context task's sum_exec_runtime.

This way cputime accounting goes against the task actually running
but vruntime accounting goes against the rq->donor task so we get
proper fairness.

Cc: Joel Fernandes <joelaf@google.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: kernel-team@android.com
Signed-off-by: John Stultz <jstultz@google.com>
---
 kernel/sched/fair.c | 21 ++++++++++++++++-----
 1 file changed, 16 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fbdca89c677f4..ebde314e151f1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1131,22 +1131,33 @@ static void update_tg_load_avg(struct cfs_rq *cfs_rq)
 }
 #endif /* CONFIG_SMP */
 
-static s64 update_curr_se(struct rq *rq, struct sched_entity *curr)
+static s64 update_curr_se(struct rq *rq, struct sched_entity *se)
 {
 	u64 now = rq_clock_task(rq);
 	s64 delta_exec;
 
-	delta_exec = now - curr->exec_start;
+	delta_exec = now - se->exec_start;
 	if (unlikely(delta_exec <= 0))
 		return delta_exec;
 
-	curr->exec_start = now;
-	curr->sum_exec_runtime += delta_exec;
+	se->exec_start = now;
+	if (entity_is_task(se)) {
+		struct task_struct *running = rq->curr;
+		/*
+		 * If se is a task, we account the time against the running
+		 * task, as w/ proxy-exec they may not be the same.
+		 */
+		running->se.exec_start = now;
+		running->se.sum_exec_runtime += delta_exec;
+	} else {
+		/* If not task, account the time against se */
+		se->sum_exec_runtime += delta_exec;
+	}
 
 	if (schedstat_enabled()) {
 		struct sched_statistics *stats;
 
-		stats = __schedstats_from_se(curr);
+		stats = __schedstats_from_se(se);
 		__schedstat_set(stats->exec_max,
 				max(delta_exec, stats->exec_max));
 	}
-- 
2.47.0.371.ga323438b13-goog


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC][PATCH v14 4/7] sched: Fix psi_dequeue for Proxy Execution
  2024-11-25 19:51 [RFC][PATCH v14 0/7] Single CPU Proxy Execution (v14) John Stultz
                   ` (2 preceding siblings ...)
  2024-11-25 19:51 ` [RFC][PATCH v14 3/7] sched: Fix runtime accounting w/ split exec & sched contexts John Stultz
@ 2024-11-25 19:51 ` John Stultz
  2024-11-25 19:51 ` [RFC][PATCH v14 5/7] sched: Add an initial sketch of the find_proxy_task() function John Stultz
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 23+ messages in thread
From: John Stultz @ 2024-11-25 19:51 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
	Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, K Prateek Nayak,
	Thomas Gleixner, Daniel Lezcano, kernel-team

Currently, if the sleep flag is set, psi_dequeue() doesn't
change any of the psi_flags.

This is because psi_switch_task() will clear TSK_ONCPU as well
as other potential flags (TSK_RUNNING), and the assumption is
that a voluntary sleep always consists of a task being dequeued
followed shortly there after with a psi_sched_switch() call.

Proxy Execution changes this expectation, as mutex-blocked tasks
that would normally sleep stay on the runqueue. In the case where
the mutex owning task goes to sleep, we will then deactivate the
blocked task as well.

However, in that case, the mutex-blocked task will have had
its TSK_ONCPU cleared when it was switched off the cpu, but it
will stay TSK_RUNNING. Then when we later dequeue it becaues of
a sleeping-owner, as it is sleeping psi_dequeue() won't change
any state (leaving it TSK_RUNNING), as it incorrectly expects a
psi_task_switch() call to immediately follow.

Later on when it get re enqueued, and psi_flags are set for
TSK_RUNNING, we hit an error as the task is already TSK_RUNNING:
  psi: inconsistent task state!

To resolve this, extend the logic in psi_dequeue() so that
if the sleep flag is set, we also check if psi_flags have
TSK_ONCPU set (meaning the psi_task_switch is imminient) before
we do the shortcut return.

If TSK_ONCPU is not set, that means we've already swtiched away,
and this psi_dequeue call needs to clear the flags.

Cc: Joel Fernandes <joelaf@google.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: kernel-team@android.com
Signed-off-by: John Stultz <jstultz@google.com>
---
v13:
* Reworked for collision
---
 kernel/sched/stats.h | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
index 8ee0add5a48a8..c313fe76a7723 100644
--- a/kernel/sched/stats.h
+++ b/kernel/sched/stats.h
@@ -176,8 +176,12 @@ static inline void psi_dequeue(struct task_struct *p, int flags)
 	 * avoid walking all ancestors twice, psi_task_switch() handles
 	 * TSK_RUNNING and TSK_IOWAIT for us when it moves TSK_ONCPU.
 	 * Do nothing here.
+	 * In the SCHED_PROXY_EXECUTION case we may do sleeping
+	 * dequeues that are not followed by a task switch, so check
+	 * TSK_ONCPU is set to ensure the task switch is imminent.
+	 * Otherwise clear the flags as usual.
 	 */
-	if (flags & DEQUEUE_SLEEP)
+	if ((flags & DEQUEUE_SLEEP) && (p->psi_flags & TSK_ONCPU))
 		return;

 	/*
-- 
2.47.0.371.ga323438b13-goog

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC][PATCH v14 5/7] sched: Add an initial sketch of the find_proxy_task() function
  2024-11-25 19:51 [RFC][PATCH v14 0/7] Single CPU Proxy Execution (v14) John Stultz
                   ` (3 preceding siblings ...)
  2024-11-25 19:51 ` [RFC][PATCH v14 4/7] sched: Fix psi_dequeue for Proxy Execution John Stultz
@ 2024-11-25 19:51 ` John Stultz
  2024-12-14  0:05   ` Peter Zijlstra
  2024-11-25 19:52 ` [RFC][PATCH v14 6/7] sched: Fix proxy/current (push,pull)ability John Stultz
  2024-11-25 19:52 ` [RFC][PATCH v14 7/7] sched: Start blocked_on chain processing in find_proxy_task() John Stultz
  6 siblings, 1 reply; 23+ messages in thread
From: John Stultz @ 2024-11-25 19:51 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
	Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, K Prateek Nayak,
	Thomas Gleixner, Daniel Lezcano, kernel-team

Add a find_proxy_task() function which doesn't do much.

When we select a blocked task to run, we will just deactivate it
and pick again. The exception being if it has become unblocked
after find_proxy_task() was called.

Greatly simplified from patch by:
  Peter Zijlstra (Intel) <peterz@infradead.org>
  Juri Lelli <juri.lelli@redhat.com>
  Valentin Schneider <valentin.schneider@arm.com>
  Connor O'Brien <connoro@google.com>

Cc: Joel Fernandes <joelaf@google.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: kernel-team@android.com
[jstultz: Split out from larger proxy patch and simplified
 for review and testing.]
Signed-off-by: John Stultz <jstultz@google.com>
---
v5:
* Split out from larger proxy patch
v7:
* Fixed unused function arguments, spelling nits, and tweaks for
  clarity, pointed out by Metin Kaya
* Fix build warning Reported-by: kernel test robot <lkp@intel.com>
  Closes: https://lore.kernel.org/oe-kbuild-all/202311081028.yDLmCWgr-lkp@intel.com/
v8:
* Fixed case where we might return a blocked task from find_proxy_task()
* Continued tweaks to handle avoiding returning blocked tasks
v9:
* Add zap_balance_callbacks helper to unwind balance_callbacks
  when we will re-call pick_next_task() again.
* Add extra comment suggested by Metin
* Typo fixes from Metin
* Moved adding proxy_resched_idle earlier in the series, as suggested
  by Metin
* Fix to call proxy_resched_idle() *prior* to deactivating next, to avoid
  crashes caused by stale references to next
* s/PROXY/SCHED_PROXY_EXEC/ as suggested by Metin
* Number of tweaks and cleanups suggested by Metin
* Simplify proxy_deactivate as suggested by Metin
v11:
* Tweaks for earlier simplification in try_to_deactivate_task
v13:
* Rename rename "next" to "donor" in find_proxy_task() for clarity
* Similarly use "donor" instead of next in proxy_deactivate
* Refactor/simplify proxy_resched_idle
* Moved up a needed fix from later in the series
---
 kernel/sched/core.c  | 129 ++++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/rt.c    |  15 ++++-
 kernel/sched/sched.h |  10 +++-
 3 files changed, 148 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f8714050b6d0d..b492506d33415 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5052,6 +5052,34 @@ static void do_balance_callbacks(struct rq *rq, struct balance_callback *head)
 	}
 }
 
+/*
+ * Only called from __schedule context
+ *
+ * There are some cases where we are going to re-do the action
+ * that added the balance callbacks. We may not be in a state
+ * where we can run them, so just zap them so they can be
+ * properly re-added on the next time around. This is similar
+ * handling to running the callbacks, except we just don't call
+ * them.
+ */
+static void zap_balance_callbacks(struct rq *rq)
+{
+	struct balance_callback *next, *head;
+	bool found = false;
+
+	lockdep_assert_rq_held(rq);
+
+	head = rq->balance_callback;
+	while (head) {
+		if (head == &balance_push_callback)
+			found = true;
+		next = head->next;
+		head->next = NULL;
+		head = next;
+	}
+	rq->balance_callback = found ? &balance_push_callback : NULL;
+}
+
 static void balance_push(struct rq *rq);
 
 /*
@@ -6592,7 +6620,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
  * Otherwise marks the task's __state as RUNNING
  */
 static bool try_to_block_task(struct rq *rq, struct task_struct *p,
-			      unsigned long task_state)
+			      unsigned long task_state, bool deactivate_cond)
 {
 	int flags = DEQUEUE_NOCLOCK;
 
@@ -6601,6 +6629,9 @@ static bool try_to_block_task(struct rq *rq, struct task_struct *p,
 		return false;
 	}
 
+	if (!deactivate_cond)
+		return false;
+
 	p->sched_contributes_to_load =
 		(task_state & TASK_UNINTERRUPTIBLE) &&
 		!(task_state & TASK_NOLOAD) &&
@@ -6624,6 +6655,88 @@ static bool try_to_block_task(struct rq *rq, struct task_struct *p,
 	return true;
 }
 
+#ifdef CONFIG_SCHED_PROXY_EXEC
+
+static inline struct task_struct *
+proxy_resched_idle(struct rq *rq)
+{
+	put_prev_task(rq, rq->donor);
+	rq_set_donor(rq, rq->idle);
+	set_next_task(rq, rq->idle);
+	set_tsk_need_resched(rq->idle);
+	return rq->idle;
+}
+
+static bool proxy_deactivate(struct rq *rq, struct task_struct *donor)
+{
+	unsigned long state = READ_ONCE(donor->__state);
+
+	/* Don't deactivate if the state has been changed to TASK_RUNNING */
+	if (state == TASK_RUNNING)
+		return false;
+	/*
+	 * Because we got donor from pick_next_task, it is *crucial*
+	 * that we call proxy_resched_idle before we deactivate it.
+	 * As once we deactivate donor, donor->on_rq is set to zero,
+	 * which allows ttwu to immediately try to wake the task on
+	 * another rq. So we cannot use *any* references to donor
+	 * after that point. So things like cfs_rq->curr or rq->donor
+	 * need to be changed from next *before* we deactivate.
+	 */
+	proxy_resched_idle(rq);
+	return try_to_block_task(rq, donor, state, true);
+}
+
+/*
+ * Initial simple proxy that just returns the task if it's waking
+ * or deactivates the blocked task so we can pick something that
+ * isn't blocked.
+ */
+static struct task_struct *
+find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
+{
+	struct task_struct *p = donor;
+	struct mutex *mutex;
+
+	mutex = p->blocked_on;
+	/* Something changed in the chain, so pick again */
+	if (!mutex)
+		return NULL;
+	/*
+	 * By taking mutex->wait_lock we hold off concurrent mutex_unlock()
+	 * and ensure @owner sticks around.
+	 */
+	raw_spin_lock(&mutex->wait_lock);
+	raw_spin_lock(&p->blocked_lock);
+
+	/* Check again that p is blocked with blocked_lock held */
+	if (!task_is_blocked(p) || mutex != get_task_blocked_on(p)) {
+		/*
+		 * Something changed in the blocked_on chain and
+		 * we don't know if only at this level. So, let's
+		 * just bail out completely and let __schedule
+		 * figure things out (pick_again loop).
+		 */
+		goto out;
+	}
+	if (!proxy_deactivate(rq, donor))
+		/* XXX: This hack won't work when we get to migrations */
+		donor->blocked_on_state = BO_RUNNABLE;
+
+out:
+	raw_spin_unlock(&p->blocked_lock);
+	raw_spin_unlock(&mutex->wait_lock);
+	return NULL;
+}
+#else /* SCHED_PROXY_EXEC */
+static struct task_struct *
+find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
+{
+	WARN_ONCE(1, "This should never be called in the !SCHED_PROXY_EXEC case\n");
+	return donor;
+}
+#endif /* SCHED_PROXY_EXEC */
+
 /*
  * __schedule() is the main scheduler function.
  *
@@ -6732,12 +6845,22 @@ static void __sched notrace __schedule(int sched_mode)
 			goto picked;
 		}
 	} else if (!preempt && prev_state) {
-		block = try_to_block_task(rq, prev, prev_state);
+		block = try_to_block_task(rq, prev, prev_state,
+					  !task_is_blocked(prev));
 		switch_count = &prev->nvcsw;
 	}
 
-	next = pick_next_task(rq, prev, &rf);
+pick_again:
+	next = pick_next_task(rq, rq->donor, &rf);
 	rq_set_donor(rq, next);
+	if (unlikely(task_is_blocked(next))) {
+		next = find_proxy_task(rq, next, &rf);
+		if (!next) {
+			/* zap the balance_callbacks before picking again */
+			zap_balance_callbacks(rq);
+			goto pick_again;
+		}
+	}
 picked:
 	clear_tsk_need_resched(prev);
 	clear_preempt_need_resched();
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index bd66a46b06aca..fa4d9bf76ad49 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1479,8 +1479,19 @@ enqueue_task_rt(struct rq *rq, struct task_struct *p, int flags)
 
 	enqueue_rt_entity(rt_se, flags);
 
-	if (!task_current(rq, p) && p->nr_cpus_allowed > 1)
-		enqueue_pushable_task(rq, p);
+	/*
+	 * Current can't be pushed away. Selected is tied to current,
+	 * so don't push it either.
+	 */
+	if (task_current(rq, p) || task_current_donor(rq, p))
+		return;
+	/*
+	 * Pinned tasks can't be pushed.
+	 */
+	if (p->nr_cpus_allowed == 1)
+		return;
+
+	enqueue_pushable_task(rq, p);
 }
 
 static bool dequeue_task_rt(struct rq *rq, struct task_struct *p, int flags)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 24eae02ddc7f6..f560d1d1a7a0c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2272,6 +2272,14 @@ static inline int task_current_donor(struct rq *rq, struct task_struct *p)
 	return rq->donor == p;
 }
 
+static inline bool task_is_blocked(struct task_struct *p)
+{
+	if (!sched_proxy_exec())
+		return false;
+
+	return !!p->blocked_on && p->blocked_on_state != BO_RUNNABLE;
+}
+
 static inline int task_on_cpu(struct rq *rq, struct task_struct *p)
 {
 #ifdef CONFIG_SMP
@@ -2481,7 +2489,7 @@ static inline void put_prev_set_next_task(struct rq *rq,
 					  struct task_struct *prev,
 					  struct task_struct *next)
 {
-	WARN_ON_ONCE(rq->curr != prev);
+	WARN_ON_ONCE(rq->donor != prev);
 
 	__put_prev_set_next_dl_server(rq, prev, next);
 
-- 
2.47.0.371.ga323438b13-goog


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC][PATCH v14 6/7] sched: Fix proxy/current (push,pull)ability
  2024-11-25 19:51 [RFC][PATCH v14 0/7] Single CPU Proxy Execution (v14) John Stultz
                   ` (4 preceding siblings ...)
  2024-11-25 19:51 ` [RFC][PATCH v14 5/7] sched: Add an initial sketch of the find_proxy_task() function John Stultz
@ 2024-11-25 19:52 ` John Stultz
  2024-11-25 19:52 ` [RFC][PATCH v14 7/7] sched: Start blocked_on chain processing in find_proxy_task() John Stultz
  6 siblings, 0 replies; 23+ messages in thread
From: John Stultz @ 2024-11-25 19:52 UTC (permalink / raw)
  To: LKML
  Cc: Valentin Schneider, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
	Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, K Prateek Nayak,
	Thomas Gleixner, Daniel Lezcano, kernel-team, Connor O'Brien,
	John Stultz

From: Valentin Schneider <valentin.schneider@arm.com>

Proxy execution forms atomic pairs of tasks: The waiting donor
task (scheduling context) and a proxy (execution context). The
donor task, along with the rest of the blocked chain, follows
the proxy wrt CPU placement.

They can be the same task, in which case push/pull doesn't need any
modification. When they are different, however,
FIFO1 & FIFO42:

	      ,->  RT42
	      |     | blocked-on
	      |     v
blocked_donor |   mutex
	      |     | owner
	      |     v
	      `--  RT1

   RT1
   RT42

  CPU0            CPU1
   ^                ^
   |                |
  overloaded    !overloaded
  rq prio = 42  rq prio = 0

RT1 is eligible to be pushed to CPU1, but should that happen it will
"carry" RT42 along. Clearly here neither RT1 nor RT42 must be seen as
push/pullable.

Unfortunately, only the donor task is usually dequeued from the rq,
and the proxy'ed execution context (rq->curr) remains on the rq.
This can cause RT1 to be selected for migration from logic like the
rt pushable_list.

Thus, adda a dequeue/enqueue cycle on the proxy task before __schedule
returns, which allows the sched class logic to avoid adding the now
current task to the pushable_list.

Furthermore, tasks becoming blocked on a mutex don't need an explicit
dequeue/enqueue cycle to be made (push/pull)able: they have to be running
to block on a mutex, thus they will eventually hit put_prev_task().

Cc: Joel Fernandes <joelaf@google.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: kernel-team@android.com
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Connor O'Brien <connoro@google.com>
Signed-off-by: John Stultz <jstultz@google.com>
---
v3:
* Tweaked comments & commit message
v5:
* Minor simplifications to utilize the fix earlier
  in the patch series.
* Rework the wording of the commit message to match selected/
  proxy terminology and expand a bit to make it more clear how
  it works.
v6:
* Dropped now-unused proxied value, to be re-added later in the
  series when it is used, as caught by Dietmar
v7:
* Unused function argument fixup
* Commit message nit pointed out by Metin Kaya
* Dropped unproven unlikely() and use sched_proxy_exec()
  in proxy_tag_curr, suggested by Metin Kaya
v8:
* More cleanups and typo fixes suggested by Metin Kaya
v11:
* Cleanup of comimt message suggested by Metin
v12:
* Rework for rq_selected -> rq->donor renaming
---
 kernel/sched/core.c | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b492506d33415..a18523355fb18 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6737,6 +6737,23 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 }
 #endif /* SCHED_PROXY_EXEC */
 
+static inline void proxy_tag_curr(struct rq *rq, struct task_struct *owner)
+{
+	if (!sched_proxy_exec())
+		return;
+	/*
+	 * pick_next_task() calls set_next_task() on the chosen task
+	 * at some point, which ensures it is not push/pullable.
+	 * However, the chosen/donor task *and* the mutex owner form an
+	 * atomic pair wrt push/pull.
+	 *
+	 * Make sure owner we run is not pushable. Unfortunately we can
+	 * only deal with that by means of a dequeue/enqueue cycle. :-/
+	 */
+	dequeue_task(rq, owner, DEQUEUE_NOCLOCK | DEQUEUE_SAVE);
+	enqueue_task(rq, owner, ENQUEUE_NOCLOCK | ENQUEUE_RESTORE);
+}
+
 /*
  * __schedule() is the main scheduler function.
  *
@@ -6875,6 +6892,10 @@ static void __sched notrace __schedule(int sched_mode)
 		 * changes to task_struct made by pick_next_task().
 		 */
 		RCU_INIT_POINTER(rq->curr, next);
+
+		if (!task_current_donor(rq, next))
+			proxy_tag_curr(rq, next);
+
 		/*
 		 * The membarrier system call requires each architecture
 		 * to have a full memory barrier after updating
@@ -6908,6 +6929,10 @@ static void __sched notrace __schedule(int sched_mode)
 		/* Also unlocks the rq: */
 		rq = context_switch(rq, prev, next, &rf);
 	} else {
+		/* In case next was already curr but just got blocked_donor */
+		if (!task_current_donor(rq, next))
+			proxy_tag_curr(rq, next);
+
 		rq_unpin_lock(rq, &rf);
 		__balance_callbacks(rq);
 		raw_spin_rq_unlock_irq(rq);
-- 
2.47.0.371.ga323438b13-goog


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC][PATCH v14 7/7] sched: Start blocked_on chain processing in find_proxy_task()
  2024-11-25 19:51 [RFC][PATCH v14 0/7] Single CPU Proxy Execution (v14) John Stultz
                   ` (5 preceding siblings ...)
  2024-11-25 19:52 ` [RFC][PATCH v14 6/7] sched: Fix proxy/current (push,pull)ability John Stultz
@ 2024-11-25 19:52 ` John Stultz
  6 siblings, 0 replies; 23+ messages in thread
From: John Stultz @ 2024-11-25 19:52 UTC (permalink / raw)
  To: LKML
  Cc: Peter Zijlstra, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, K Prateek Nayak, Thomas Gleixner,
	Daniel Lezcano, kernel-team, Valentin Schneider,
	Connor O'Brien, John Stultz

From: Peter Zijlstra <peterz@infradead.org>

Start to flesh out the real find_proxy_task() implementation,
but avoid the migration cases for now, in those cases just
deactivate the donor task and pick again.

To ensure the donor task or other blocked tasks in the chain
aren't migrated away while we're running the proxy, also tweak
the CFS logic to avoid migrating donor or mutex blocked tasks.

Cc: Joel Fernandes <joelaf@google.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: kernel-team@android.com
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Connor O'Brien <connoro@google.com>
[jstultz: This change was split out from the larger proxy patch]
Signed-off-by: John Stultz <jstultz@google.com>
---
v5:
* Split this out from larger proxy patch
v7:
* Minor refactoring of core find_proxy_task() function
* Minor spelling and corrections suggested by Metin Kaya
* Dropped an added BUG_ON that was frequently tripped
v8:
* Fix issue if proxy_deactivate fails, we don't leave task
  BO_BLOCKED
* Switch to WARN_ON from BUG_ON checks
v9:
* Improve comments suggested by Metin
* Minor cleanups
v11:
* Previously we checked next==rq->idle && prev==rq->idle, but I
  think we only really care if next==rq->idle from find_proxy_task,
  as we will still want to resched regardless of what prev was.
v12:
* Commit message rework for selected -> donor rewording
v13:
* Address new delayed dequeue condition (deactivate donor for now)
* Next to donor renaming in find_proxy_task
* Improved comments for find_proxy_task
* Rework for proxy_deactivate cleanup
v14:
* Fix build error from __mutex_owner() with CONFIG_PREEMPT_RT
---
 kernel/locking/mutex.h |   3 +-
 kernel/sched/core.c    | 164 ++++++++++++++++++++++++++++++++++-------
 kernel/sched/fair.c    |  10 ++-
 3 files changed, 148 insertions(+), 29 deletions(-)

diff --git a/kernel/locking/mutex.h b/kernel/locking/mutex.h
index cbff35b9b7ae3..2e8080a9bee37 100644
--- a/kernel/locking/mutex.h
+++ b/kernel/locking/mutex.h
@@ -6,7 +6,7 @@
  *
  *  Copyright (C) 2004, 2005, 2006 Red Hat, Inc., Ingo Molnar <mingo@redhat.com>
  */
-
+#ifndef CONFIG_PREEMPT_RT
 /*
  * This is the control structure for tasks blocked on mutex, which resides
  * on the blocked task's kernel stack:
@@ -70,3 +70,4 @@ extern void debug_mutex_init(struct mutex *lock, const char *name,
 # define debug_mutex_unlock(lock)			do { } while (0)
 # define debug_mutex_init(lock, name, key)		do { } while (0)
 #endif /* !CONFIG_DEBUG_MUTEXES */
+#endif /* CONFIG_PREEMPT_RT */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a18523355fb18..dec9fabb7e105 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -96,6 +96,7 @@
 #include "../workqueue_internal.h"
 #include "../../io_uring/io-wq.h"
 #include "../smpboot.h"
+#include "../locking/mutex.h"
 
 EXPORT_TRACEPOINT_SYMBOL_GPL(ipi_send_cpu);
 EXPORT_TRACEPOINT_SYMBOL_GPL(ipi_send_cpumask);
@@ -2941,8 +2942,15 @@ static int affine_move_task(struct rq *rq, struct task_struct *p, struct rq_flag
 	struct set_affinity_pending my_pending = { }, *pending = NULL;
 	bool stop_pending, complete = false;
 
-	/* Can the task run on the task's current CPU? If so, we're done */
-	if (cpumask_test_cpu(task_cpu(p), &p->cpus_mask)) {
+	/*
+	 * Can the task run on the task's current CPU? If so, we're done
+	 *
+	 * We are also done if the task is the current donor, boosting a lock-
+	 * holding proxy, (and potentially has been migrated outside its
+	 * current or previous affinity mask)
+	 */
+	if (cpumask_test_cpu(task_cpu(p), &p->cpus_mask) ||
+	    (task_current_donor(rq, p) && !task_current(rq, p))) {
 		struct task_struct *push_task = NULL;
 
 		if ((flags & SCA_MIGRATE_ENABLE) &&
@@ -6688,41 +6696,139 @@ static bool proxy_deactivate(struct rq *rq, struct task_struct *donor)
 }
 
 /*
- * Initial simple proxy that just returns the task if it's waking
- * or deactivates the blocked task so we can pick something that
- * isn't blocked.
+ * Find runnable lock owner to proxy for mutex blocked donor
+ *
+ * Follow the blocked-on relation:
+ *   task->blocked_on -> mutex->owner -> task...
+ *
+ * Lock order:
+ *
+ *   p->pi_lock
+ *     rq->lock
+ *       mutex->wait_lock
+ *         p->blocked_lock
+ *
+ * Returns the task that is going to be used as execution context (the one
+ * that is actually going to be run on cpu_of(rq)).
  */
 static struct task_struct *
 find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 {
-	struct task_struct *p = donor;
+	struct task_struct *owner = NULL;
+	struct task_struct *ret = NULL;
+	int this_cpu = cpu_of(rq);
+	struct task_struct *p;
 	struct mutex *mutex;
 
-	mutex = p->blocked_on;
-	/* Something changed in the chain, so pick again */
-	if (!mutex)
-		return NULL;
-	/*
-	 * By taking mutex->wait_lock we hold off concurrent mutex_unlock()
-	 * and ensure @owner sticks around.
-	 */
-	raw_spin_lock(&mutex->wait_lock);
-	raw_spin_lock(&p->blocked_lock);
+	/* Follow blocked_on chain. */
+	for (p = donor; task_is_blocked(p); p = owner) {
+		mutex = p->blocked_on;
+		/* Something changed in the chain, so pick again */
+		if (!mutex)
+			return NULL;
+		/*
+		 * By taking mutex->wait_lock we hold off concurrent mutex_unlock()
+		 * and ensure @owner sticks around.
+		 */
+		raw_spin_lock(&mutex->wait_lock);
+		raw_spin_lock(&p->blocked_lock);
+
+		/* Check again that p is blocked with blocked_lock held */
+		if (mutex != get_task_blocked_on(p)) {
+			/*
+			 * Something changed in the blocked_on chain and
+			 * we don't know if only at this level. So, let's
+			 * just bail out completely and let __schedule
+			 * figure things out (pick_again loop).
+			 */
+			goto out;
+		}
+
+		owner = __mutex_owner(mutex);
+		if (!owner) {
+			p->blocked_on_state = BO_RUNNABLE;
+			ret = p;
+			goto out;
+		}
+
+		if (task_cpu(owner) != this_cpu) {
+			/* XXX Don't handle migrations yet */
+			if (!proxy_deactivate(rq, donor))
+				goto deactivate_failed;
+			goto out;
+		}
+
+		if (task_on_rq_migrating(owner)) {
+			/*
+			 * One of the chain of mutex owners is currently migrating to this
+			 * CPU, but has not yet been enqueued because we are holding the
+			 * rq lock. As a simple solution, just schedule rq->idle to give
+			 * the migration a chance to complete. Much like the migrate_task
+			 * case we should end up back in find_proxy_task(), this time
+			 * hopefully with all relevant tasks already enqueued.
+			 */
+			raw_spin_unlock(&p->blocked_lock);
+			raw_spin_unlock(&mutex->wait_lock);
+			return proxy_resched_idle(rq);
+		}
+
+		if (!owner->on_rq) {
+			/* XXX Don't handle blocked owners yet */
+			if (!proxy_deactivate(rq, donor))
+				goto deactivate_failed;
+			goto out;
+		}
+
+		if (owner->se.sched_delayed) {
+			/* XXX Don't handle delayed dequeue yet */
+			if (!proxy_deactivate(rq, donor))
+				goto deactivate_failed;
+			goto out;
+		}
+
+		if (owner == p) {
+			/*
+			 * It's possible we interleave with mutex_unlock like:
+			 *
+			 *				lock(&rq->lock);
+			 *				  find_proxy_task()
+			 * mutex_unlock()
+			 *   lock(&wait_lock);
+			 *   donor(owner) = current->blocked_donor;
+			 *   unlock(&wait_lock);
+			 *
+			 *   wake_up_q();
+			 *     ...
+			 *       ttwu_runnable()
+			 *         __task_rq_lock()
+			 *				  lock(&wait_lock);
+			 *				  owner == p
+			 *
+			 * Which leaves us to finish the ttwu_runnable() and make it go.
+			 *
+			 * So schedule rq->idle so that ttwu_runnable can get the rq lock
+			 * and mark owner as running.
+			 */
+			raw_spin_unlock(&p->blocked_lock);
+			raw_spin_unlock(&mutex->wait_lock);
+			return proxy_resched_idle(rq);
+		}
 
-	/* Check again that p is blocked with blocked_lock held */
-	if (!task_is_blocked(p) || mutex != get_task_blocked_on(p)) {
 		/*
-		 * Something changed in the blocked_on chain and
-		 * we don't know if only at this level. So, let's
-		 * just bail out completely and let __schedule
-		 * figure things out (pick_again loop).
+		 * OK, now we're absolutely sure @owner is on this
+		 * rq, therefore holding @rq->lock is sufficient to
+		 * guarantee its existence, as per ttwu_remote().
 		 */
-		goto out;
+		raw_spin_unlock(&p->blocked_lock);
+		raw_spin_unlock(&mutex->wait_lock);
 	}
-	if (!proxy_deactivate(rq, donor))
-		/* XXX: This hack won't work when we get to migrations */
-		donor->blocked_on_state = BO_RUNNABLE;
 
+	WARN_ON_ONCE(owner && !owner->on_rq);
+	return owner;
+
+deactivate_failed:
+	/* XXX: This hack won't work when we get to migrations */
+	donor->blocked_on_state = BO_RUNNABLE;
 out:
 	raw_spin_unlock(&p->blocked_lock);
 	raw_spin_unlock(&mutex->wait_lock);
@@ -6807,6 +6913,7 @@ static void __sched notrace __schedule(int sched_mode)
 	struct rq_flags rf;
 	struct rq *rq;
 	int cpu;
+	bool preserve_need_resched = false;
 
 	cpu = smp_processor_id();
 	rq = cpu_rq(cpu);
@@ -6877,9 +6984,12 @@ static void __sched notrace __schedule(int sched_mode)
 			zap_balance_callbacks(rq);
 			goto pick_again;
 		}
+		if (next == rq->idle)
+			preserve_need_resched = true;
 	}
 picked:
-	clear_tsk_need_resched(prev);
+	if (!preserve_need_resched)
+		clear_tsk_need_resched(prev);
 	clear_preempt_need_resched();
 #ifdef CONFIG_SCHED_DEBUG
 	rq->last_seen_need_resched_ns = 0;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ebde314e151f1..cc126cfcdac06 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9379,6 +9379,7 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 	 * 2) cannot be migrated to this CPU due to cpus_ptr, or
 	 * 3) running (obviously), or
 	 * 4) are cache-hot on their current CPU.
+	 * 5) are blocked on mutexes (if SCHED_PROXY_EXEC is enabled)
 	 */
 	if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
 		return 0;
@@ -9387,6 +9388,9 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 	if (kthread_is_per_cpu(p))
 		return 0;
 
+	if (task_is_blocked(p))
+		return 0;
+
 	if (!cpumask_test_cpu(env->dst_cpu, p->cpus_ptr)) {
 		int cpu;
 
@@ -9423,7 +9427,8 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 	/* Record that we found at least one task that could run on dst_cpu */
 	env->flags &= ~LBF_ALL_PINNED;
 
-	if (task_on_cpu(env->src_rq, p)) {
+	if (task_on_cpu(env->src_rq, p) ||
+	    task_current_donor(env->src_rq, p)) {
 		schedstat_inc(p->stats.nr_failed_migrations_running);
 		return 0;
 	}
@@ -9462,6 +9467,9 @@ static void detach_task(struct task_struct *p, struct lb_env *env)
 {
 	lockdep_assert_rq_held(env->src_rq);
 
+	WARN_ON(task_current(env->src_rq, p));
+	WARN_ON(task_current_donor(env->src_rq, p));
+
 	deactivate_task(env->src_rq, p, DEQUEUE_NOCLOCK);
 	set_task_cpu(p, env->dst_cpu);
 }
-- 
2.47.0.371.ga323438b13-goog


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [RFC][PATCH v14 2/7] locking/mutex: Rework task_struct::blocked_on
  2024-11-25 19:51 ` [RFC][PATCH v14 2/7] locking/mutex: Rework task_struct::blocked_on John Stultz
@ 2024-12-13 23:22   ` Peter Zijlstra
  2024-12-14  3:39     ` John Stultz
  0 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2024-12-13 23:22 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, K Prateek Nayak, Thomas Gleixner,
	Daniel Lezcano, kernel-team, Connor O'Brien

On Mon, Nov 25, 2024 at 11:51:56AM -0800, John Stultz wrote:

> Also add a blocked_on_state value so we can distinguish when a
> task is blocked_on a mutex, but is either blocked, waking up, or
> runnable (such that it can try to acquire the lock its blocked
> on).
> 
> This avoids some of the subtle & racy games where the blocked_on
> state gets cleared, only to have it re-added by the
> mutex_lock_slowpath call when it tries to acquire the lock on
> wakeup

If you can remember those sublte cases, I'm sure our future selves
would've loved it if you wrote a comment to go with these states :-)

> +enum blocked_on_state {
> +	BO_RUNNABLE,
> +	BO_BLOCKED,
> +	BO_WAKING,
> +};
> +
>  struct task_struct {
>  #ifdef CONFIG_THREAD_INFO_IN_TASK
>  	/*
> @@ -1195,10 +1202,9 @@ struct task_struct {
>  	struct rt_mutex_waiter		*pi_blocked_on;
>  #endif
>  
> -#ifdef CONFIG_DEBUG_MUTEXES
> -	/* Mutex deadlock detection: */
> -	struct mutex_waiter		*blocked_on;
> -#endif
> +	enum blocked_on_state		blocked_on_state;
> +	struct mutex			*blocked_on;	/* lock we're blocked on */
> +	raw_spinlock_t			blocked_lock;
>  
>  #ifdef CONFIG_DEBUG_ATOMIC_SLEEP
>  	int				non_block_count;
> @@ -2118,6 +2124,56 @@ extern int __cond_resched_rwlock_write(rwlock_t *lock);
>  	__cond_resched_rwlock_write(lock);					\
>  })
>  
> +static inline void __set_blocked_on_runnable(struct task_struct *p)
> +{
> +	lockdep_assert_held(&p->blocked_lock);
> +
> +	if (p->blocked_on_state == BO_WAKING)
> +		p->blocked_on_state = BO_RUNNABLE;
> +}
> +
> +static inline void set_blocked_on_runnable(struct task_struct *p)
> +{
> +	unsigned long flags;
> +
> +	if (!sched_proxy_exec())
> +		return;
> +
> +	raw_spin_lock_irqsave(&p->blocked_lock, flags);

Do we want to make this:

	guard(raw_spinlock_irqsave)(&p->blocked_lock);

?

> +	__set_blocked_on_runnable(p);
> +	raw_spin_unlock_irqrestore(&p->blocked_lock, flags);
> +}
> +
> +static inline void set_blocked_on_waking(struct task_struct *p)

consistent naming would be __set_blocked_on_waking()

> +{
> +	lockdep_assert_held(&p->blocked_lock);
> +
> +	if (p->blocked_on_state == BO_BLOCKED)
> +		p->blocked_on_state = BO_WAKING;
> +}
> +
> +static inline void set_task_blocked_on(struct task_struct *p, struct mutex *m)

__set_task_blocked_on()

> +{
> +	lockdep_assert_held(&p->blocked_lock);
> +
> +	/*
> +	 * Check we are clearing values to NULL or setting NULL
> +	 * to values to ensure we don't overwrite existing mutex
> +	 * values or clear already cleared values
> +	 */
> +	WARN_ON((!m && !p->blocked_on) || (m && p->blocked_on));
> +
> +	p->blocked_on = m;
> +	p->blocked_on_state = m ? BO_BLOCKED : BO_RUNNABLE;
> +}
> +
> +static inline struct mutex *get_task_blocked_on(struct task_struct *p)

__get_task_blocked_on()

> +{
> +	lockdep_assert_held(&p->blocked_lock);
> +
> +	return p->blocked_on;
> +}

That gets us the __ prefix if the caller is assumed to have taken
blocked_lock.

>  static __always_inline bool need_resched(void)
>  {
>  	return unlikely(tif_need_resched());
> @@ -2157,8 +2213,6 @@ extern bool sched_task_on_rq(struct task_struct *p);
>  extern unsigned long get_wchan(struct task_struct *p);
>  extern struct task_struct *cpu_curr_snapshot(int cpu);
>  
> -#include <linux/spinlock.h>
> -
>  /*
>   * In order to reduce various lock holder preemption latencies provide an
>   * interface to see if a vCPU is currently running or not.
> diff --git a/init/init_task.c b/init/init_task.c
> index e557f622bd906..7e29d86153d9f 100644
> --- a/init/init_task.c
> +++ b/init/init_task.c
> @@ -140,6 +140,7 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = {
>  	.journal_info	= NULL,
>  	INIT_CPU_TIMERS(init_task)
>  	.pi_lock	= __RAW_SPIN_LOCK_UNLOCKED(init_task.pi_lock),
> +	.blocked_lock	= __RAW_SPIN_LOCK_UNLOCKED(init_task.blocked_lock),
>  	.timer_slack_ns = 50000, /* 50 usec default slack */
>  	.thread_pid	= &init_struct_pid,
>  	.thread_node	= LIST_HEAD_INIT(init_signals.thread_head),
> diff --git a/kernel/fork.c b/kernel/fork.c
> index f253e81d0c28e..160bead843afb 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -2231,6 +2231,7 @@ __latent_entropy struct task_struct *copy_process(
>  	ftrace_graph_init_task(p);
>  
>  	rt_mutex_init_task(p);
> +	raw_spin_lock_init(&p->blocked_lock);
>  
>  	lockdep_assert_irqs_enabled();
>  #ifdef CONFIG_PROVE_LOCKING
> @@ -2329,9 +2330,8 @@ __latent_entropy struct task_struct *copy_process(
>  	lockdep_init_task(p);
>  #endif
>  
> -#ifdef CONFIG_DEBUG_MUTEXES
> +	p->blocked_on_state = BO_RUNNABLE;
>  	p->blocked_on = NULL; /* not blocked yet */
> -#endif
>  #ifdef CONFIG_BCACHE
>  	p->sequential_io	= 0;
>  	p->sequential_io_avg	= 0;
> diff --git a/kernel/locking/mutex-debug.c b/kernel/locking/mutex-debug.c
> index 6e6f6071cfa27..1d8cff71f65e1 100644
> --- a/kernel/locking/mutex-debug.c
> +++ b/kernel/locking/mutex-debug.c
> @@ -53,17 +53,18 @@ void debug_mutex_add_waiter(struct mutex *lock, struct mutex_waiter *waiter,
>  {
>  	lockdep_assert_held(&lock->wait_lock);
>  
> -	/* Mark the current thread as blocked on the lock: */
> -	task->blocked_on = waiter;
> +	/* Current thread can't be already blocked (since it's executing!) */
> +	DEBUG_LOCKS_WARN_ON(get_task_blocked_on(task));
>  }
>  
>  void debug_mutex_remove_waiter(struct mutex *lock, struct mutex_waiter *waiter,
>  			 struct task_struct *task)
>  {
> +	struct mutex *blocked_on = get_task_blocked_on(task);
> +
>  	DEBUG_LOCKS_WARN_ON(list_empty(&waiter->list));
>  	DEBUG_LOCKS_WARN_ON(waiter->task != task);
> -	DEBUG_LOCKS_WARN_ON(task->blocked_on != waiter);
> -	task->blocked_on = NULL;
> +	DEBUG_LOCKS_WARN_ON(blocked_on && blocked_on != lock);
>  
>  	INIT_LIST_HEAD(&waiter->list);
>  	waiter->task = NULL;
> diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
> index 3302e52f0c967..8f5d3fe6c1029 100644
> --- a/kernel/locking/mutex.c
> +++ b/kernel/locking/mutex.c
> @@ -597,6 +597,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
>  	}
>  
>  	raw_spin_lock_irqsave(&lock->wait_lock, flags);
> +	raw_spin_lock(&current->blocked_lock);
>  	/*
>  	 * After waiting to acquire the wait_lock, try again.
>  	 */
> @@ -627,6 +628,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
>  			goto err_early_kill;
>  	}
>  
> +	set_task_blocked_on(current, lock);
>  	set_current_state(state);

blocked_on_state mirrors task-state

>  	trace_contention_begin(lock, LCB_F_MUTEX);
>  	for (;;) {
> @@ -639,7 +641,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
>  		 * the handoff.
>  		 */
>  		if (__mutex_trylock(lock))
> -			goto acquired;
> +			break; /* acquired */;
>  
>  		/*
>  		 * Check for signals and kill conditions while holding
> @@ -657,6 +659,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
>  				goto err;
>  		}
>  
> +		raw_spin_unlock(&current->blocked_lock);
>  		raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
>  		/* Make sure we do wakeups before calling schedule */
>  		wake_up_q(&wake_q);
> @@ -666,6 +669,13 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
>  
>  		first = __mutex_waiter_is_first(lock, &waiter);
>  
> +		raw_spin_lock_irqsave(&lock->wait_lock, flags);
> +		raw_spin_lock(&current->blocked_lock);
> +
> +		/*
> +		 * Re-set blocked_on_state as unlock path set it to WAKING/RUNNABLE
> +		 */
> +		current->blocked_on_state = BO_BLOCKED;
>  		set_current_state(state);

And blocked_on_state again mirrors taks-state

>  		/*
>  		 * Here we order against unlock; we must either see it change
> @@ -676,16 +686,26 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
>  			break;
>  
>  		if (first) {
> +			bool opt_acquired;
> +
> +			/*
> +			 * mutex_optimistic_spin() can schedule, so  we need to
> +			 * release these locks before calling it.
> +			 */
> +			current->blocked_on_state = BO_RUNNABLE;
> +			raw_spin_unlock(&current->blocked_lock);
> +			raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
>  			trace_contention_begin(lock, LCB_F_MUTEX | LCB_F_SPIN);
> -			if (mutex_optimistic_spin(lock, ww_ctx, &waiter))
> +			opt_acquired = mutex_optimistic_spin(lock, ww_ctx, &waiter);

I'm confused -- didn't I kill optimistic spinning for proxy? IIRC it
fundamentally doesn't make sense since we do a hand-off to the donor
thread.

> +			raw_spin_lock_irqsave(&lock->wait_lock, flags);
> +			raw_spin_lock(&current->blocked_lock);
> +			current->blocked_on_state = BO_BLOCKED;
> +			if (opt_acquired)
>  				break;
>  			trace_contention_begin(lock, LCB_F_MUTEX);
>  		}
> -
> -		raw_spin_lock_irqsave(&lock->wait_lock, flags);
>  	}
> -	raw_spin_lock_irqsave(&lock->wait_lock, flags);
> -acquired:
> +	set_task_blocked_on(current, NULL);
>  	__set_current_state(TASK_RUNNING);

And again..

>  
>  	if (ww_ctx) {
> @@ -710,16 +730,20 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
>  	if (ww_ctx)
>  		ww_mutex_lock_acquired(ww, ww_ctx);
>  
> +	raw_spin_unlock(&current->blocked_lock);
>  	raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
>  	wake_up_q(&wake_q);
>  	preempt_enable();
>  	return 0;
>  
>  err:
> +	set_task_blocked_on(current, NULL);
>  	__set_current_state(TASK_RUNNING);

and one again..

>  	__mutex_remove_waiter(lock, &waiter);
>  err_early_kill:
> +	WARN_ON(get_task_blocked_on(current));
>  	trace_contention_end(lock, ret);
> +	raw_spin_unlock(&current->blocked_lock);
>  	raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
>  	debug_mutex_free_waiter(&waiter);
>  	mutex_release(&lock->dep_map, ip);
> @@ -928,8 +952,12 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
>  
>  		next = waiter->task;
>  
> +		raw_spin_lock(&next->blocked_lock);
>  		debug_mutex_wake_waiter(lock, waiter);
> +		WARN_ON(get_task_blocked_on(next) != lock);
> +		set_blocked_on_waking(next);

and more..

>  		wake_q_add(&wake_q, next);
> +		raw_spin_unlock(&next->blocked_lock);
>  	}
>  
>  	if (owner & MUTEX_FLAG_HANDOFF)
> diff --git a/kernel/locking/ww_mutex.h b/kernel/locking/ww_mutex.h
> index 37f025a096c9d..d9ff2022eef6f 100644
> --- a/kernel/locking/ww_mutex.h
> +++ b/kernel/locking/ww_mutex.h
> @@ -281,10 +281,21 @@ __ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITER *waiter,
>  		return false;
>  
>  	if (waiter->ww_ctx->acquired > 0 && __ww_ctx_less(waiter->ww_ctx, ww_ctx)) {
> +		/* nested as we should hold current->blocked_lock already */
> +		raw_spin_lock_nested(&waiter->task->blocked_lock, SINGLE_DEPTH_NESTING);
>  #ifndef WW_RT
>  		debug_mutex_wake_waiter(lock, waiter);
> +		/*
> +		 * When waking up the task to die, be sure to set the
> +		 * blocked_on_state to WAKING. Otherwise we can see
> +		 * circular blocked_on relationships that can't
> +		 * resolve.
> +		 */
> +		WARN_ON(get_task_blocked_on(waiter->task) != lock);
>  #endif
> +		set_blocked_on_waking(waiter->task);
>  		wake_q_add(wake_q, waiter->task);
> +		raw_spin_unlock(&waiter->task->blocked_lock);
>  	}
>  
>  	return true;
> @@ -331,9 +342,18 @@ static bool __ww_mutex_wound(struct MUTEX *lock,
>  		 * it's wounded in __ww_mutex_check_kill() or has a
>  		 * wakeup pending to re-read the wounded state.
>  		 */
> -		if (owner != current)
> +		if (owner != current) {
> +			/* nested as we should hold current->blocked_lock already */
> +			raw_spin_lock_nested(&owner->blocked_lock, SINGLE_DEPTH_NESTING);
> +			/*
> +			 * When waking up the task to wound, be sure to set the
> +			 * blocked_on_state flag. Otherwise we can see circular
> +			 * blocked_on relationships that can't resolve.
> +			 */
> +			set_blocked_on_waking(owner);
>  			wake_q_add(wake_q, owner);
> -
> +			raw_spin_unlock(&owner->blocked_lock);
> +		}
>  		return true;
>  	}
>  
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index d712e177d3b75..f8714050b6d0d 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4350,6 +4350,7 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>  		ttwu_queue(p, cpu, wake_flags);
>  	}
>  out:
> +	set_blocked_on_runnable(p);
>  	if (success)
>  		ttwu_stat(p, task_cpu(p), wake_flags);
>  

All in all I'm properly confused by this patch... please write
more/better comments changelog.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][PATCH v14 3/7] sched: Fix runtime accounting w/ split exec & sched contexts
  2024-11-25 19:51 ` [RFC][PATCH v14 3/7] sched: Fix runtime accounting w/ split exec & sched contexts John Stultz
@ 2024-12-13 23:37   ` Peter Zijlstra
  2024-12-14  0:09     ` Peter Zijlstra
  0 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2024-12-13 23:37 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, K Prateek Nayak, Thomas Gleixner,
	Daniel Lezcano, kernel-team

On Mon, Nov 25, 2024 at 11:51:57AM -0800, John Stultz wrote:



> -static s64 update_curr_se(struct rq *rq, struct sched_entity *curr)
> +static s64 update_curr_se(struct rq *rq, struct sched_entity *se)
>  {
>  	u64 now = rq_clock_task(rq);
>  	s64 delta_exec;
>  
> -	delta_exec = now - curr->exec_start;
> +	delta_exec = now - se->exec_start;
>  	if (unlikely(delta_exec <= 0))
>  		return delta_exec;
>  
> -	curr->exec_start = now;
> -	curr->sum_exec_runtime += delta_exec;
> +	se->exec_start = now;
> +	if (entity_is_task(se)) {
> +		struct task_struct *running = rq->curr;
> +		/*
> +		 * If se is a task, we account the time against the running
> +		 * task, as w/ proxy-exec they may not be the same.
> +		 */
> +		running->se.exec_start = now;
> +		running->se.sum_exec_runtime += delta_exec;
> +	} else {
> +		/* If not task, account the time against se */
> +		se->sum_exec_runtime += delta_exec;
> +	}
>  
>  	if (schedstat_enabled()) {
>  		struct sched_statistics *stats;
>  
> -		stats = __schedstats_from_se(curr);
> +		stats = __schedstats_from_se(se);
>  		__schedstat_set(stats->exec_max,
>  				max(delta_exec, stats->exec_max));
>  	}

Would it not be *much* clearer if we do it like:

static s64 update_curr_se(struct rq *rq, struct sched_entity *donor,
			  struct sched_entity *curr)
{
	...
	donor->exec_start = now;
	curr->exec_start = now;
	curr->sum_exec_runtime += delta_exec;
	...
}

and update the callsites like so:

update_curr_common()
	update_curr_se(rq, &donor->se, &rq->curr.se)

update_curr()
	update_curr_se(rq, &curr->se, &curr->se);


except, now I'm confused about the update_curr() case. That seems to
always update the execution context, rather than the donor ?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][PATCH v14 5/7] sched: Add an initial sketch of the find_proxy_task() function
  2024-11-25 19:51 ` [RFC][PATCH v14 5/7] sched: Add an initial sketch of the find_proxy_task() function John Stultz
@ 2024-12-14  0:05   ` Peter Zijlstra
  2024-12-17  5:42     ` John Stultz
  0 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2024-12-14  0:05 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, K Prateek Nayak, Thomas Gleixner,
	Daniel Lezcano, kernel-team

On Mon, Nov 25, 2024 at 11:51:59AM -0800, John Stultz wrote:

> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index f8714050b6d0d..b492506d33415 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5052,6 +5052,34 @@ static void do_balance_callbacks(struct rq *rq, struct balance_callback *head)
>  	}
>  }
>  
> +/*
> + * Only called from __schedule context
> + *
> + * There are some cases where we are going to re-do the action
> + * that added the balance callbacks. We may not be in a state
> + * where we can run them, so just zap them so they can be
> + * properly re-added on the next time around. This is similar
> + * handling to running the callbacks, except we just don't call
> + * them.
> + */

Which specific callbacks are this? sched_core_balance()?

In general, shooting down all callbacks like this makes me feel somewhat
uncomfortable.

> +#ifdef CONFIG_SCHED_PROXY_EXEC
> +
> +static inline struct task_struct *
> +proxy_resched_idle(struct rq *rq)
> +{
> +	put_prev_task(rq, rq->donor);
> +	rq_set_donor(rq, rq->idle);
> +	set_next_task(rq, rq->idle);
> +	set_tsk_need_resched(rq->idle);
> +	return rq->idle;
> +}
> +
> +static bool proxy_deactivate(struct rq *rq, struct task_struct *donor)
> +{
> +	unsigned long state = READ_ONCE(donor->__state);
> +
> +	/* Don't deactivate if the state has been changed to TASK_RUNNING */
> +	if (state == TASK_RUNNING)
> +		return false;
> +	/*
> +	 * Because we got donor from pick_next_task, it is *crucial*
> +	 * that we call proxy_resched_idle before we deactivate it.
> +	 * As once we deactivate donor, donor->on_rq is set to zero,
> +	 * which allows ttwu to immediately try to wake the task on
> +	 * another rq. So we cannot use *any* references to donor
> +	 * after that point. So things like cfs_rq->curr or rq->donor
> +	 * need to be changed from next *before* we deactivate.
> +	 */
> +	proxy_resched_idle(rq);
> +	return try_to_block_task(rq, donor, state, true);
> +}
> +
> +/*
> + * Initial simple proxy that just returns the task if it's waking
> + * or deactivates the blocked task so we can pick something that
> + * isn't blocked.
> + */
> +static struct task_struct *
> +find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
> +{
> +	struct task_struct *p = donor;
> +	struct mutex *mutex;
> +
> +	mutex = p->blocked_on;
> +	/* Something changed in the chain, so pick again */
> +	if (!mutex)
> +		return NULL;
> +	/*
> +	 * By taking mutex->wait_lock we hold off concurrent mutex_unlock()
> +	 * and ensure @owner sticks around.
> +	 */
> +	raw_spin_lock(&mutex->wait_lock);
> +	raw_spin_lock(&p->blocked_lock);

I'm still wondering what this blocked_lock does, that previous patch had
it mirror wait_mutex too, so far I don't see the point.

> +
> +	/* Check again that p is blocked with blocked_lock held */
> +	if (!task_is_blocked(p) || mutex != get_task_blocked_on(p)) {
> +		/*
> +		 * Something changed in the blocked_on chain and
> +		 * we don't know if only at this level. So, let's
> +		 * just bail out completely and let __schedule
> +		 * figure things out (pick_again loop).
> +		 */
> +		goto out;
> +	}
> +	if (!proxy_deactivate(rq, donor))
> +		/* XXX: This hack won't work when we get to migrations */
> +		donor->blocked_on_state = BO_RUNNABLE;
> +
> +out:
> +	raw_spin_unlock(&p->blocked_lock);
> +	raw_spin_unlock(&mutex->wait_lock);
> +	return NULL;
> +}
> +#else /* SCHED_PROXY_EXEC */
> +static struct task_struct *
> +find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
> +{
> +	WARN_ONCE(1, "This should never be called in the !SCHED_PROXY_EXEC case\n");
> +	return donor;
> +}
> +#endif /* SCHED_PROXY_EXEC */
> +
>  /*
>   * __schedule() is the main scheduler function.
>   *
> @@ -6732,12 +6845,22 @@ static void __sched notrace __schedule(int sched_mode)
>  			goto picked;
>  		}
>  	} else if (!preempt && prev_state) {
> -		block = try_to_block_task(rq, prev, prev_state);
> +		block = try_to_block_task(rq, prev, prev_state,
> +					  !task_is_blocked(prev));
>  		switch_count = &prev->nvcsw;
>  	}
>  
> -	next = pick_next_task(rq, prev, &rf);
> +pick_again:
> +	next = pick_next_task(rq, rq->donor, &rf);
>  	rq_set_donor(rq, next);
> +	if (unlikely(task_is_blocked(next))) {
> +		next = find_proxy_task(rq, next, &rf);
> +		if (!next) {
> +			/* zap the balance_callbacks before picking again */
> +			zap_balance_callbacks(rq);
> +			goto pick_again;
> +		}
> +	}
>  picked:
>  	clear_tsk_need_resched(prev);
>  	clear_preempt_need_resched();

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][PATCH v14 3/7] sched: Fix runtime accounting w/ split exec & sched contexts
  2024-12-13 23:37   ` Peter Zijlstra
@ 2024-12-14  0:09     ` Peter Zijlstra
  2024-12-17  6:09       ` John Stultz
  0 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2024-12-14  0:09 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, K Prateek Nayak, Thomas Gleixner,
	Daniel Lezcano, kernel-team

On Sat, Dec 14, 2024 at 12:37:40AM +0100, Peter Zijlstra wrote:
> On Mon, Nov 25, 2024 at 11:51:57AM -0800, John Stultz wrote:
> 
> 
> 
> > -static s64 update_curr_se(struct rq *rq, struct sched_entity *curr)
> > +static s64 update_curr_se(struct rq *rq, struct sched_entity *se)
> >  {
> >  	u64 now = rq_clock_task(rq);
> >  	s64 delta_exec;
> >  
> > -	delta_exec = now - curr->exec_start;
> > +	delta_exec = now - se->exec_start;
> >  	if (unlikely(delta_exec <= 0))
> >  		return delta_exec;
> >  
> > -	curr->exec_start = now;
> > -	curr->sum_exec_runtime += delta_exec;
> > +	se->exec_start = now;
> > +	if (entity_is_task(se)) {
> > +		struct task_struct *running = rq->curr;
> > +		/*
> > +		 * If se is a task, we account the time against the running
> > +		 * task, as w/ proxy-exec they may not be the same.
> > +		 */
> > +		running->se.exec_start = now;
> > +		running->se.sum_exec_runtime += delta_exec;
> > +	} else {
> > +		/* If not task, account the time against se */
> > +		se->sum_exec_runtime += delta_exec;
> > +	}
> >  
> >  	if (schedstat_enabled()) {
> >  		struct sched_statistics *stats;
> >  
> > -		stats = __schedstats_from_se(curr);
> > +		stats = __schedstats_from_se(se);
> >  		__schedstat_set(stats->exec_max,
> >  				max(delta_exec, stats->exec_max));
> >  	}
> 
> Would it not be *much* clearer if we do it like:
> 
> static s64 update_curr_se(struct rq *rq, struct sched_entity *donor,
> 			  struct sched_entity *curr)
> {
> 	...
> 	donor->exec_start = now;
> 	curr->exec_start = now;
> 	curr->sum_exec_runtime += delta_exec;
> 	...
> }
> 
> and update the callsites like so:
> 
> update_curr_common()
> 	update_curr_se(rq, &donor->se, &rq->curr.se)
> 
> update_curr()
> 	update_curr_se(rq, &curr->se, &curr->se);
> 
> 
> except, now I'm confused about the update_curr() case. That seems to
> always update the execution context, rather than the donor ?

Ah no, cfs_rq->curr is the donor.

I'll try again later; or risk keyboard face.. Zzzz



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][PATCH v14 2/7] locking/mutex: Rework task_struct::blocked_on
  2024-12-13 23:22   ` Peter Zijlstra
@ 2024-12-14  3:39     ` John Stultz
  2024-12-16 16:54       ` Peter Zijlstra
  0 siblings, 1 reply; 23+ messages in thread
From: John Stultz @ 2024-12-14  3:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, K Prateek Nayak, Thomas Gleixner,
	Daniel Lezcano, kernel-team, Connor O'Brien

On Fri, Dec 13, 2024 at 3:22 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Mon, Nov 25, 2024 at 11:51:56AM -0800, John Stultz wrote:
>
> > Also add a blocked_on_state value so we can distinguish when a
> > task is blocked_on a mutex, but is either blocked, waking up, or
> > runnable (such that it can try to acquire the lock its blocked
> > on).
> >
> > This avoids some of the subtle & racy games where the blocked_on
> > state gets cleared, only to have it re-added by the
> > mutex_lock_slowpath call when it tries to acquire the lock on
> > wakeup
>
> If you can remember those sublte cases, I'm sure our future selves
> would've loved it if you wrote a comment to go with these states :-)

Thanks so much for the review feedback! I really appreciate it!

So yes, the description can use improvement here. I at one time had
3-4 separate very fine grained patches (see the top 4 patches here:
https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v7-6.7-rc6-fine-grained/?after=c4cad6e353c00254a2dfbb227ef81d8c3827427d+35)
that I rolled into one when sending out(mostly to avoid overwhelming
folks), but the squished commit description isn't as clear.
So if it's helpful I can split this back out?

I'll also add some better comments as well.

> > +static inline void set_blocked_on_runnable(struct task_struct *p)
> > +{
> > +     unsigned long flags;
> > +
> > +     if (!sched_proxy_exec())
> > +             return;
> > +
> > +     raw_spin_lock_irqsave(&p->blocked_lock, flags);
>
> Do we want to make this:
>
>         guard(raw_spinlock_irqsave)(&p->blocked_lock);
>
> ?

Sure, sorry, while I've seen the guard bits, I've not yet really utilized them.
I'll take a stab at this for v15.

>
> > +     __set_blocked_on_runnable(p);
> > +     raw_spin_unlock_irqrestore(&p->blocked_lock, flags);
> > +}
> > +
> > +static inline void set_blocked_on_waking(struct task_struct *p)
>
> consistent naming would be __set_blocked_on_waking()

Yeah, I started with the unlocked versions and then added the one case
that takes the lock, but that does make it inconsistent. I'll rework
all these. Thanks for pointing this out!

> > @@ -627,6 +628,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
> >                       goto err_early_kill;
> >       }
> >
> > +     set_task_blocked_on(current, lock);
> >       set_current_state(state);
>
> blocked_on_state mirrors task-state

Yea. I know I always move a little fast in my talks but at OSPM and
LPC, but I've raised the point that the blocked_on_state very much
seems like it aliases the task->__state.
(See slide 26: https://lpc.events/event/18/contributions/1887/attachments/1402/3074/LPC_%20Proxy%20Exec%20deep%20dive%20outline.pdf)

I've not quite worked out how to integrate it though.

My initial introduction of the blocked_on_state was mostly to help
detangle the code, as there were a number of cases originally where in
order to let the task be selected to try to acquire the mutex, we
cleared the task->blocked_on pointer.  But this often ran into races
with ttwu, as if it cleared the blocked_on pointer, the task could get
selected to run before the return migration happened. So having the
tri-state BLOCKED->WAKING->RUNNABLE be explicit helped ensure we
enforced the rules properly so we didn't run on the wrong cpu.

Trying to squish this tristate into the task->__state has eluded me a
bit (mostly due to the lockfree manipulations being a little subtle -
defaulting to RUNNABLE has been usually safe, as the task will just
loop on the condition but with this we really don't want to lose the
signal that we need to do a return migration before the task runs), so
I'd love your insight here.

One thing I've been thinking about is reworking the blocked_on_state
to instead just be a "needs_return" flag. This might simplify things
as we could only set needs_return when we proxy_migrate away from the
task->wake_cpu, so we wouldn't have to handle the WAKING->RUNNABLE
transition for tasks that were never migrated.  And, once it's down to
a single bit, that might work better as a task->__state flag (but
again, it has to prevent transition to TASK_RUNNING until we migrate
back).

Thoughts?

> > @@ -676,16 +686,26 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
> >                       break;
> >
> >               if (first) {
> > +                     bool opt_acquired;
> > +
> > +                     /*
> > +                      * mutex_optimistic_spin() can schedule, so  we need to
> > +                      * release these locks before calling it.
> > +                      */
> > +                     current->blocked_on_state = BO_RUNNABLE;
> > +                     raw_spin_unlock(&current->blocked_lock);
> > +                     raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
> >                       trace_contention_begin(lock, LCB_F_MUTEX | LCB_F_SPIN);
> > -                     if (mutex_optimistic_spin(lock, ww_ctx, &waiter))
> > +                     opt_acquired = mutex_optimistic_spin(lock, ww_ctx, &waiter);
>
> I'm confused -- didn't I kill optimistic spinning for proxy? IIRC it
> fundamentally doesn't make sense since we do a hand-off to the donor
> thread.

So, the trouble was killing optimistic spinning with proxy-exec had a
larger performance impact. So around v8 (this time last year, I
think), I worked to re-enable optimistic spinning.

The idea is as long as the lock owner is running, we do the optimistic
spin as there's no benefit to proxy boosting it (the owner is already
running, mission accomplished). But, if it's not on_cpu, then we
consider ourselves blocked, and will go into __schedule() where we can
be selected and boost whomever the chain is waiting on. Then if we are
proxy boosting, on release we can hand it over to the boosting donor.

This feels like it nicely balances the normal non-proxy performance
where possible, only switching into proxy-boosting when things are
going to take longer to release.

> All in all I'm properly confused by this patch... please write
> more/better comments changelog.

Sure, I think it's probably worth splitting it out, so I'll take a
swing at that before v15.

Definitely would appreciate your thoughts on the idea for a
TASK_NEEDS_RETURN __state flag or if you see a more elegant way to put
an intermediate step in place on the wakeup to ensure we don't wake
proxy-migrated tasks on the wrong cpu.

Thanks so much again for the input! Really appreciate the feedback!
-john

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][PATCH v14 2/7] locking/mutex: Rework task_struct::blocked_on
  2024-12-14  3:39     ` John Stultz
@ 2024-12-16 16:54       ` Peter Zijlstra
  2024-12-16 17:07         ` Peter Zijlstra
  2024-12-17  5:01         ` John Stultz
  0 siblings, 2 replies; 23+ messages in thread
From: Peter Zijlstra @ 2024-12-16 16:54 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, K Prateek Nayak, Thomas Gleixner,
	Daniel Lezcano, kernel-team, Connor O'Brien

On Fri, Dec 13, 2024 at 07:39:57PM -0800, John Stultz wrote:
> On Fri, Dec 13, 2024 at 3:22 PM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Mon, Nov 25, 2024 at 11:51:56AM -0800, John Stultz wrote:
> >
> > > Also add a blocked_on_state value so we can distinguish when a
> > > task is blocked_on a mutex, but is either blocked, waking up, or
> > > runnable (such that it can try to acquire the lock its blocked
> > > on).
> > >
> > > This avoids some of the subtle & racy games where the blocked_on
> > > state gets cleared, only to have it re-added by the
> > > mutex_lock_slowpath call when it tries to acquire the lock on
> > > wakeup
> >
> > If you can remember those sublte cases, I'm sure our future selves
> > would've loved it if you wrote a comment to go with these states :-)
> 
> Thanks so much for the review feedback! I really appreciate it!
> 
> So yes, the description can use improvement here. I at one time had
> 3-4 separate very fine grained patches (see the top 4 patches here:
> https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v7-6.7-rc6-fine-grained/?after=c4cad6e353c00254a2dfbb227ef81d8c3827427d+35)
> that I rolled into one when sending out(mostly to avoid overwhelming
> folks), but the squished commit description isn't as clear.
> So if it's helpful I can split this back out?
> 
> I'll also add some better comments as well.

Not sure yet about splitting back out -- let me try and figure out what
all is actually done / needed.

So blocked_lock started out as another lock around ttwu(), in order to
serialize the task wakeup vs reading a remote ->blocked_on relation.

Since we do this with rq->lock held, it can't be ->pi_lock, and hence
->blocked_lock was born.

Later patches appear to have moved it into mutex, mirroring the
->wait_lock -- this is probably better.

/me goes chase that state thing for a bit..

So, this all seems to have started with:

  https://github.com/johnstultz-work/linux-dev/commit/c122552e07b75bb39d0297431799801de30f147f


> > > @@ -627,6 +628,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
> > >                       goto err_early_kill;
> > >       }
> > >
> > > +     set_task_blocked_on(current, lock);
> > >       set_current_state(state);
> >
> > blocked_on_state mirrors task-state
> 
> Yea. I know I always move a little fast in my talks but at OSPM and
> LPC, but I've raised the point that the blocked_on_state very much
> seems like it aliases the task->__state.
> (See slide 26: https://lpc.events/event/18/contributions/1887/attachments/1402/3074/LPC_%20Proxy%20Exec%20deep%20dive%20outline.pdf)
> 
> I've not quite worked out how to integrate it though.
> 
> My initial introduction of the blocked_on_state was mostly to help
> detangle the code, as there were a number of cases originally where in
> order to let the task be selected to try to acquire the mutex, we
> cleared the task->blocked_on pointer.  But this often ran into races
> with ttwu, as if it cleared the blocked_on pointer, the task could get
> selected to run before the return migration happened. So having the
> tri-state BLOCKED->WAKING->RUNNABLE be explicit helped ensure we
> enforced the rules properly so we didn't run on the wrong cpu.

Right, so we already have a TASK_WAKING state, that is that
intermediate. But let me try and get a feel for how things relate;
TASK_WAKING is after we've determined it is elegible for wakeup, but
before it is enqueued -- eg. it sits on the wakelist.

This BO_WAKING is not quite that. It happens before doing the actual
wakeup.

Also, when I pull your proxy-exec-v7-6.7-rc6-fine-gained and put that
next to your slides, I'm a little confused.

Specifically, slide 27 talks about a modification to try_to_wake_up() in
order to force the task into ttwu_runnable() such that it then hits
proxy_needs_return(). This latter part I can find, but not the former.

/me puzzles

Hmm, since a blocked task is on the runqueue and all that, you should
always walk that path, anything else would be buggy.

So fundamentally the problem is that you want to do a reverse migration
-- it needs to go back to a CPU where it is allowed to run. The way it
does this is to dequeue itself and let ttwu continue and clean up the
mess.

This all makes sense, but why can't you do this on the waking side? That
is, instead of setting BO_WAKING, do this whole if (!is_cpu_allowed())
deactivate_task() thing right there, before issuing the wakeup.

Notably, that sched_proxy_exec() block in __mutex_unlock_slowpath():

 - probably doesn't need current->blocked_lock, current isn't blocked
 
 - probably doesn't need lock->wait_lock, donor is stable under
   preempt_disable().


Something like the completely untested below (also, ttwu path needs
surgery now):

diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 2711af8c0052..47cfa01cd066 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -946,32 +946,11 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
 	}
 
 	preempt_disable();
+	next = proxy_handoff(lock);
+
 	raw_spin_lock_irqsave(&lock->wait_lock, flags);
 	debug_mutex_unlock(lock);
 
-	if (sched_proxy_exec()) {
-		raw_spin_lock(&current->blocked_lock);
-		/*
-		 * If we have a task boosting current, and that task was boosting
-		 * current through this lock, hand the lock to that task, as that
-		 * is the highest waiter, as selected by the scheduling function.
-		 */
-		donor = current->blocked_donor;
-		if (donor) {
-			struct mutex *next_lock;
-
-			raw_spin_lock_nested(&donor->blocked_lock, SINGLE_DEPTH_NESTING);
-			next_lock = get_task_blocked_on(donor);
-			if (next_lock == lock) {
-				next = donor;
-				donor->blocked_on_state = BO_WAKING;
-				wake_q_add(&wake_q, donor);
-				current->blocked_donor = NULL;
-			}
-			raw_spin_unlock(&donor->blocked_lock);
-		}
-	}
-
 	/*
 	 * Failing that, pick any on the wait list.
 	 */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6eaffa913495..30d7371bb5c4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4035,6 +4035,53 @@ static inline void activate_blocked_entities(struct rq *target_rq,
 }
 #endif /* CONFIG_SCHED_PROXY_EXEC */
 
+struct task_struct *proxy_handoff(struct mutex *lock);
+{
+	struct task_struct *next;
+
+	if (!sched_proxy_exec())
+		return NULL;
+
+	/*
+	 * current->blocked_donor can't change if we can't schedule
+	 * caller needs to do this, since its needs stabiliy of return value
+	 */
+	lockdep_assert_preemption_disabled();
+	next = current->blocked_donor;
+	if (!next)
+		return NULL;
+
+	scoped_guard (task_rq_lock, next) {
+		/*
+		 * current->blocked_donor had better be on the same CPU as current
+		 */
+		WARN_ON_ONCE(scope.rq != this_rq());
+
+		scoped_guard (raw_spin_lock, next->blocked_lock) {
+			/*
+			 * WARN_ON on this? How can this happen
+			 */
+			if (next->blocked_on != lock)
+				return NULL;
+		}
+
+		/*
+		 * blocked_on relation is stable, since we hold both
+		 * next->pi_lock and it's rq->lock
+		 *
+		 * OK -- we have a donor, it is blocked on the lock we're about
+		 * to release and it cannot run on this CPU -- fixies are
+		 * required.
+		 *
+		 * Dequeue the task, such that ttwu() can fix up the placement thing.
+		 */
+		if (!is_cpu_allowed(next, cpu_of(scope.rq)))
+			deactivate_task(scope.rq, next, DEQUEUE_SLEEP);
+	}
+
+	return next;
+}
+
 #ifdef CONFIG_SMP
 static inline bool proxy_needs_return(struct rq *rq, struct task_struct *p)
 {

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [RFC][PATCH v14 2/7] locking/mutex: Rework task_struct::blocked_on
  2024-12-16 16:54       ` Peter Zijlstra
@ 2024-12-16 17:07         ` Peter Zijlstra
  2024-12-17  5:01         ` John Stultz
  1 sibling, 0 replies; 23+ messages in thread
From: Peter Zijlstra @ 2024-12-16 17:07 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, K Prateek Nayak, Thomas Gleixner,
	Daniel Lezcano, kernel-team, Connor O'Brien

On Mon, Dec 16, 2024 at 05:54:19PM +0100, Peter Zijlstra wrote:

> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 6eaffa913495..30d7371bb5c4 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4035,6 +4035,53 @@ static inline void activate_blocked_entities(struct rq *target_rq,
>  }
>  #endif /* CONFIG_SCHED_PROXY_EXEC */
>  
> +struct task_struct *proxy_handoff(struct mutex *lock);
> +{
> +	struct task_struct *next;
> +
> +	if (!sched_proxy_exec())
> +		return NULL;
> +
> +	/*
> +	 * current->blocked_donor can't change if we can't schedule
> +	 * caller needs to do this, since its needs stabiliy of return value
> +	 */
> +	lockdep_assert_preemption_disabled();
> +	next = current->blocked_donor;
> +	if (!next)
> +		return NULL;
> +
> +	scoped_guard (task_rq_lock, next) {
> +		/*
> +		 * current->blocked_donor had better be on the same CPU as current
> +		 */
> +		WARN_ON_ONCE(scope.rq != this_rq());
> +
> +		scoped_guard (raw_spin_lock, next->blocked_lock) {
> +			/*
> +			 * WARN_ON on this? How can this happen
> +			 */
> +			if (next->blocked_on != lock)
> +				return NULL;
> +		}
> +
> +		/*
> +		 * blocked_on relation is stable, since we hold both
> +		 * next->pi_lock and it's rq->lock
> +		 *
> +		 * OK -- we have a donor, it is blocked on the lock we're about
> +		 * to release and it cannot run on this CPU -- fixies are
> +		 * required.
> +		 *
> +		 * Dequeue the task, such that ttwu() can fix up the placement thing.
> +		 */
> +		if (!is_cpu_allowed(next, cpu_of(scope.rq)))
> +			deactivate_task(scope.rq, next, DEQUEUE_SLEEP);
> +	}

It is probably better to do:

	scoped_guard (raw_spin_lock_irq, next->pi_lock) {

		int cpu = smp_processor_id();
		WARN_ON_ONCE(task_cpu(next) != cpu);

		...

		if (!is_cpu_allowed(next, cpu)) {
			struct rq_flags rf;
			struct rq *rq;
			rq = __task_rq_lock(next, &rf);
			deactivate_task(rq, next, DEQUEUE_SLEEP);
			__task_rq_unlock(rq, &rf);
		}
	}	

In order to minize the amount or rq->lock'ing.

> +
> +	return next;
> +}
> +
>  #ifdef CONFIG_SMP
>  static inline bool proxy_needs_return(struct rq *rq, struct task_struct *p)
>  {

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][PATCH v14 2/7] locking/mutex: Rework task_struct::blocked_on
  2024-12-16 16:54       ` Peter Zijlstra
  2024-12-16 17:07         ` Peter Zijlstra
@ 2024-12-17  5:01         ` John Stultz
  2024-12-17  8:39           ` Peter Zijlstra
                             ` (2 more replies)
  1 sibling, 3 replies; 23+ messages in thread
From: John Stultz @ 2024-12-17  5:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, K Prateek Nayak, Thomas Gleixner,
	Daniel Lezcano, kernel-team, Connor O'Brien

On Mon, Dec 16, 2024 at 8:54 AM Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, Dec 13, 2024 at 07:39:57PM -0800, John Stultz wrote:
> > On Fri, Dec 13, 2024 at 3:22 PM Peter Zijlstra <peterz@infradead.org> wrote:
> > > On Mon, Nov 25, 2024 at 11:51:56AM -0800, John Stultz wrote:
> > So yes, the description can use improvement here. I at one time had
> > 3-4 separate very fine grained patches (see the top 4 patches here:
> > https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v7-6.7-rc6-fine-grained/?after=c4cad6e353c00254a2dfbb227ef81d8c3827427d+35)
> > that I rolled into one when sending out(mostly to avoid overwhelming
> > folks), but the squished commit description isn't as clear.
> > So if it's helpful I can split this back out?
> >
> > I'll also add some better comments as well.
>
> Not sure yet about splitting back out -- let me try and figure out what
> all is actually done / needed.
>
> So blocked_lock started out as another lock around ttwu(), in order to
> serialize the task wakeup vs reading a remote ->blocked_on relation.

I think of it primarily to serialize the task->blocked* state (there
gets to be quite a bit by the end of the proxy series).

> Since we do this with rq->lock held, it can't be ->pi_lock, and hence
> ->blocked_lock was born.

Yeah, we needed to use something other than the task->pi_lock to
serialize it as it has to nest under the mutex->wait_lock.

> Later patches appear to have moved it into mutex, mirroring the
> ->wait_lock -- this is probably better.
>
> /me goes chase that state thing for a bit..

? I'm not sure I followed this.  The blocked_lock continues to
serialize the task->blocked* state through the patch series.


> > > > @@ -627,6 +628,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
> > > >                       goto err_early_kill;
> > > >       }
> > > >
> > > > +     set_task_blocked_on(current, lock);
> > > >       set_current_state(state);
> > >
> > > blocked_on_state mirrors task-state
> >
> > Yea. I know I always move a little fast in my talks but at OSPM and
> > LPC, but I've raised the point that the blocked_on_state very much
> > seems like it aliases the task->__state.
> > (See slide 26: https://lpc.events/event/18/contributions/1887/attachments/1402/3074/LPC_%20Proxy%20Exec%20deep%20dive%20outline.pdf)
> >
> > I've not quite worked out how to integrate it though.
> >
> > My initial introduction of the blocked_on_state was mostly to help
> > detangle the code, as there were a number of cases originally where in
> > order to let the task be selected to try to acquire the mutex, we
> > cleared the task->blocked_on pointer.  But this often ran into races
> > with ttwu, as if it cleared the blocked_on pointer, the task could get
> > selected to run before the return migration happened. So having the
> > tri-state BLOCKED->WAKING->RUNNABLE be explicit helped ensure we
> > enforced the rules properly so we didn't run on the wrong cpu.
>
> Right, so we already have a TASK_WAKING state, that is that
> intermediate. But let me try and get a feel for how things relate;
> TASK_WAKING is after we've determined it is elegible for wakeup, but
> before it is enqueued -- eg. it sits on the wakelist.
>
> This BO_WAKING is not quite that. It happens before doing the actual
> wakeup.

Right. So BO_WAKING is set from the point that the mutex unlock (or
ww_mutex_wound/die code) tries to wake up a task, as the task may have
been proxy-migrated, so we need to return it back before it runs.
So it's sort of a guard to make sure __schedule() doesn't just run the
task (as we could if it were BO_RUNNABLE), but also that we don't try
to proxy for it (as we would if it were BO_BLOCKED), since it needs to
be migrated.

My suspicion is reframing BO_WAKING as TASK_NEEDS_RETURN might be
doable, but I couldn't be sure we wouldn't hit an edge case where
TASK_RUNNING gets written over everything - effectively clearing the
guard.
Maybe the NEEDS_RETURN could just be a side state similar to the
blocked_on_state, and then BO_BLOCKED vs BO_RUNNABLE would just be
distinguishable by the !!task->blocked_on value.

> Also, when I pull your proxy-exec-v7-6.7-rc6-fine-gained and put that
> next to your slides, I'm a little confused.

So, apologies for the confusion. The link I sent you to the
fine-grained changes are a bit old (v7), since I stopped maintaining
the fine-grained patches around v11 (and the most recently shared
version is v14).
I just shared the v7 link as an example to see if it would be helpful
to split the patches here out again in a similar fashion (and I don't
think I published later versions of the fine grained series).

> Specifically, slide 27 talks about a modification to try_to_wake_up() in
> order to force the task into ttwu_runnable() such that it then hits
> proxy_needs_return(). This latter part I can find, but not the former.
>
> /me puzzles

So the slides actually have links to the code at that time, which
should be the v12 series:
  https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v12-6.11-rc5

And the try_to_wake_up() change from slide 27 is here:
https://github.com/johnstultz-work/linux-dev/commit/cc828a6bac87dcd5734902021973659fe52ef7e6#diff-cc1a82129952a910fdc4292448c2a097a2ba538bebefcf3c06381e45639ae73eR4158

(and here's the link to the v14 version of the same:
https://github.com/johnstultz-work/linux-dev/commit/602a4197c83b83cff08c7adda31324739c7938c7#diff-cc1a82129952a910fdc4292448c2a097a2ba538bebefcf3c06381e45639ae73eR4314
)

> Hmm, since a blocked task is on the runqueue and all that, you should
> always walk that path, anything else would be buggy.
>
> So fundamentally the problem is that you want to do a reverse migration
> -- it needs to go back to a CPU where it is allowed to run. The way it
> does this is to dequeue itself and let ttwu continue and clean up the
> mess.

Yep.

> This all makes sense, but why can't you do this on the waking side? That
> is, instead of setting BO_WAKING, do this whole if (!is_cpu_allowed())
> deactivate_task() thing right there, before issuing the wakeup.
>
> Notably, that sched_proxy_exec() block in __mutex_unlock_slowpath():
>
>  - probably doesn't need current->blocked_lock, current isn't blocked
>
>  - probably doesn't need lock->wait_lock, donor is stable under
>    preempt_disable().
>
>
> Something like the completely untested below (also, ttwu path needs
> surgery now):
>
> diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
> index 2711af8c0052..47cfa01cd066 100644
> --- a/kernel/locking/mutex.c
> +++ b/kernel/locking/mutex.c
> @@ -946,32 +946,11 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
>         }
>
>         preempt_disable();
> +       next = proxy_handoff(lock);
> +
>         raw_spin_lock_irqsave(&lock->wait_lock, flags);
>         debug_mutex_unlock(lock);
>
> -       if (sched_proxy_exec()) {
> -               raw_spin_lock(&current->blocked_lock);
> -               /*
> -                * If we have a task boosting current, and that task was boosting
> -                * current through this lock, hand the lock to that task, as that
> -                * is the highest waiter, as selected by the scheduling function.
> -                */
> -               donor = current->blocked_donor;
> -               if (donor) {
> -                       struct mutex *next_lock;
> -
> -                       raw_spin_lock_nested(&donor->blocked_lock, SINGLE_DEPTH_NESTING);
> -                       next_lock = get_task_blocked_on(donor);
> -                       if (next_lock == lock) {
> -                               next = donor;
> -                               donor->blocked_on_state = BO_WAKING;
> -                               wake_q_add(&wake_q, donor);
> -                               current->blocked_donor = NULL;
> -                       }
> -                       raw_spin_unlock(&donor->blocked_lock);
> -               }
> -       }
> -
>         /*
>          * Failing that, pick any on the wait list.
>          */
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 6eaffa913495..30d7371bb5c4 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4035,6 +4035,53 @@ static inline void activate_blocked_entities(struct rq *target_rq,
>  }
>  #endif /* CONFIG_SCHED_PROXY_EXEC */
>
> +struct task_struct *proxy_handoff(struct mutex *lock);
> +{
> +       struct task_struct *next;
> +
> +       if (!sched_proxy_exec())
> +               return NULL;
> +
> +       /*
> +        * current->blocked_donor can't change if we can't schedule
> +        * caller needs to do this, since its needs stabiliy of return value
> +        */
> +       lockdep_assert_preemption_disabled();
> +       next = current->blocked_donor;
> +       if (!next)
> +               return NULL;
> +
> +       scoped_guard (task_rq_lock, next) {
> +               /*
> +                * current->blocked_donor had better be on the same CPU as current
> +                */
> +               WARN_ON_ONCE(scope.rq != this_rq());
> +
> +               scoped_guard (raw_spin_lock, next->blocked_lock) {
> +                       /*
> +                        * WARN_ON on this? How can this happen
> +                        */
> +                       if (next->blocked_on != lock)
> +                               return NULL;
> +               }
> +
> +               /*
> +                * blocked_on relation is stable, since we hold both
> +                * next->pi_lock and it's rq->lock
> +                *
> +                * OK -- we have a donor, it is blocked on the lock we're about
> +                * to release and it cannot run on this CPU -- fixies are
> +                * required.
> +                *
> +                * Dequeue the task, such that ttwu() can fix up the placement thing.
> +                */
> +               if (!is_cpu_allowed(next, cpu_of(scope.rq)))

nit, we'd want to check its not the wake_cpu so we try to return it so
proxy migrations don't upset the tasks' original placement

> +                       deactivate_task(scope.rq, next, DEQUEUE_SLEEP);
> +       }
> +
> +       return next;
> +}
> +

Ok. I'll stare at all this a bit and see if I can give it a try.  I
fret that this doesn't handle the case if wakeups on the task occur
through other code paths? (So we still need BO_WAKING/NEEDS_RETURN to
prevent us from running until we migrate back). I don't really have a
specific case I can articulate, but my gut is telling me the problem
will be w/ ww_mutexes as that was a major source of problems with the
early versions of the patches that I believe tried to use logic
similar to this.

Again, I really appreciate your insight and suggestions here!

thanks
-john

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][PATCH v14 5/7] sched: Add an initial sketch of the find_proxy_task() function
  2024-12-14  0:05   ` Peter Zijlstra
@ 2024-12-17  5:42     ` John Stultz
  2024-12-17  8:52       ` Peter Zijlstra
  0 siblings, 1 reply; 23+ messages in thread
From: John Stultz @ 2024-12-17  5:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, K Prateek Nayak, Thomas Gleixner,
	Daniel Lezcano, kernel-team

On Fri, Dec 13, 2024 at 4:06 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Mon, Nov 25, 2024 at 11:51:59AM -0800, John Stultz wrote:
>
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index f8714050b6d0d..b492506d33415 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -5052,6 +5052,34 @@ static void do_balance_callbacks(struct rq *rq, struct balance_callback *head)
> >       }
> >  }
> >
> > +/*
> > + * Only called from __schedule context
> > + *
> > + * There are some cases where we are going to re-do the action
> > + * that added the balance callbacks. We may not be in a state
> > + * where we can run them, so just zap them so they can be
> > + * properly re-added on the next time around. This is similar
> > + * handling to running the callbacks, except we just don't call
> > + * them.
> > + */
>
> Which specific callbacks are this? sched_core_balance()?
>
> In general, shooting down all callbacks like this makes me feel somewhat
> uncomfortable.

So, if we originally picked a RT task, I believe it would setup the
push_rt_tasks callback, but if it got migrated and if we needed to
pick again,  we'd end up tripping on
`SCHED_WARN_ON(rq->balance_callback && rq->balance_callback !=
&balance_push_callback);`

For a while I tried to unpin and run the balance callbacks before
calling pick_again, if find_proxy_task() failed, but that was running
into troubles with tasks getting unintentionally added to the rt
pushable list (this was back in ~feb, so my memory is a little fuzzy).

So that's when I figured zaping the callbacks would be best, with the
idea being that we are starting selection over, so we effectively have
to undo any of the state that was set by pick_next_task() before
calling it again.

Let me know if you have concerns with this, or suggestions for other approaches.

> > +/*
> > + * Initial simple proxy that just returns the task if it's waking
> > + * or deactivates the blocked task so we can pick something that
> > + * isn't blocked.
> > + */
> > +static struct task_struct *
> > +find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
> > +{
> > +     struct task_struct *p = donor;
> > +     struct mutex *mutex;
> > +
> > +     mutex = p->blocked_on;
> > +     /* Something changed in the chain, so pick again */
> > +     if (!mutex)
> > +             return NULL;
> > +     /*
> > +      * By taking mutex->wait_lock we hold off concurrent mutex_unlock()
> > +      * and ensure @owner sticks around.
> > +      */
> > +     raw_spin_lock(&mutex->wait_lock);
> > +     raw_spin_lock(&p->blocked_lock);
>
> I'm still wondering what this blocked_lock does, that previous patch had
> it mirror wait_mutex too, so far I don't see the point.

Yeah, early on in the series it's maybe not as useful, but as we start
dealing with sleeping owner enqueuing, its doing more:
  https://github.com/johnstultz-work/linux-dev/commit/d594ca8df88645aa3b2b9daa105664893818bdb7

But it is possible it is more of a crutch for me to keep straight the
locking rules as it's simpler to keep in my head. :)
Happy to think a bit more on if it can be folded together with another lock.

Thanks again for the review and thoughts here!

thanks
-john

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][PATCH v14 3/7] sched: Fix runtime accounting w/ split exec & sched contexts
  2024-12-14  0:09     ` Peter Zijlstra
@ 2024-12-17  6:09       ` John Stultz
  2024-12-17  8:48         ` Peter Zijlstra
  0 siblings, 1 reply; 23+ messages in thread
From: John Stultz @ 2024-12-17  6:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, K Prateek Nayak, Thomas Gleixner,
	Daniel Lezcano, kernel-team

On Fri, Dec 13, 2024 at 4:10 PM Peter Zijlstra <peterz@infradead.org> wrote:
> On Sat, Dec 14, 2024 at 12:37:40AM +0100, Peter Zijlstra wrote:
> > Would it not be *much* clearer if we do it like:
> >
> > static s64 update_curr_se(struct rq *rq, struct sched_entity *donor,
> >                         struct sched_entity *curr)
> > {
> >       ...
> >       donor->exec_start = now;
> >       curr->exec_start = now;
> >       curr->sum_exec_runtime += delta_exec;
> >       ...
> > }
> >
> > and update the callsites like so:
> >
> > update_curr_common()
> >       update_curr_se(rq, &donor->se, &rq->curr.se)
> >
> > update_curr()
> >       update_curr_se(rq, &curr->se, &curr->se);
> >
> >
> > except, now I'm confused about the update_curr() case. That seems to
> > always update the execution context, rather than the donor ?
>
> Ah no, cfs_rq->curr is the donor.

Yeah. That is one detail in the current series where the naming can be
particularly confusing.

I can go through and rename cfs_rq->curr to cfs_rq->donor (or some
other name) to make it more clear, but it seems like a ton of churn,
so I've been hesitant to do so until there was stronger consensus to
taking the patch series, but maybe we're at that point now?

But maybe a simpler and more isolated fix is I could just rework
update_curr_se to just take the rq* and we can derive the donor.se and
curr.se from that.

Thoughts?
thanks
-john

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][PATCH v14 2/7] locking/mutex: Rework task_struct::blocked_on
  2024-12-17  5:01         ` John Stultz
@ 2024-12-17  8:39           ` Peter Zijlstra
  2024-12-17  8:46           ` Peter Zijlstra
  2024-12-17  9:19           ` Peter Zijlstra
  2 siblings, 0 replies; 23+ messages in thread
From: Peter Zijlstra @ 2024-12-17  8:39 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, K Prateek Nayak, Thomas Gleixner,
	Daniel Lezcano, kernel-team, Connor O'Brien

On Mon, Dec 16, 2024 at 09:01:24PM -0800, John Stultz wrote:
> On Mon, Dec 16, 2024 at 8:54 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > On Fri, Dec 13, 2024 at 07:39:57PM -0800, John Stultz wrote:
> > > On Fri, Dec 13, 2024 at 3:22 PM Peter Zijlstra <peterz@infradead.org> wrote:
> > > > On Mon, Nov 25, 2024 at 11:51:56AM -0800, John Stultz wrote:
> > > So yes, the description can use improvement here. I at one time had
> > > 3-4 separate very fine grained patches (see the top 4 patches here:
> > > https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v7-6.7-rc6-fine-grained/?after=c4cad6e353c00254a2dfbb227ef81d8c3827427d+35)
> > > that I rolled into one when sending out(mostly to avoid overwhelming
> > > folks), but the squished commit description isn't as clear.
> > > So if it's helpful I can split this back out?
> > >
> > > I'll also add some better comments as well.
> >
> > Not sure yet about splitting back out -- let me try and figure out what
> > all is actually done / needed.
> >
> > So blocked_lock started out as another lock around ttwu(), in order to
> > serialize the task wakeup vs reading a remote ->blocked_on relation.
> 
> I think of it primarily to serialize the task->blocked* state (there
> gets to be quite a bit by the end of the proxy series).
> 
> > Since we do this with rq->lock held, it can't be ->pi_lock, and hence
> > ->blocked_lock was born.
> 
> Yeah, we needed to use something other than the task->pi_lock to
> serialize it as it has to nest under the mutex->wait_lock.

No, the critical bit is nesting under rq->lock -- we need to be able to
walk the blocked relation in the middle of schedule(). You can equally
wrap blocked_lock outside of wait_lock, that doesn't really matter much.

> > Later patches appear to have moved it into mutex, mirroring the
> > ->wait_lock -- this is probably better.
> >
> > /me goes chase that state thing for a bit..
> 
> ? I'm not sure I followed this.  The blocked_lock continues to
> serialize the task->blocked* state through the patch series.

Well, there was only ->blocked_on, and on UP you don't need
serialization beyond disabling preemption.

The tricky bit is SMP, then you need something to stabilize the blocked
relation.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][PATCH v14 2/7] locking/mutex: Rework task_struct::blocked_on
  2024-12-17  5:01         ` John Stultz
  2024-12-17  8:39           ` Peter Zijlstra
@ 2024-12-17  8:46           ` Peter Zijlstra
  2024-12-17  9:19           ` Peter Zijlstra
  2 siblings, 0 replies; 23+ messages in thread
From: Peter Zijlstra @ 2024-12-17  8:46 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, K Prateek Nayak, Thomas Gleixner,
	Daniel Lezcano, kernel-team, Connor O'Brien

On Mon, Dec 16, 2024 at 09:01:24PM -0800, John Stultz wrote:

> > +struct task_struct *proxy_handoff(struct mutex *lock);
> > +{
> > +       struct task_struct *next;
> > +
> > +       if (!sched_proxy_exec())
> > +               return NULL;
> > +
> > +       /*
> > +        * current->blocked_donor can't change if we can't schedule
> > +        * caller needs to do this, since its needs stabiliy of return value
> > +        */
> > +       lockdep_assert_preemption_disabled();
> > +       next = current->blocked_donor;
> > +       if (!next)
> > +               return NULL;
> > +
> > +       scoped_guard (task_rq_lock, next) {
> > +               /*
> > +                * current->blocked_donor had better be on the same CPU as current
> > +                */
> > +               WARN_ON_ONCE(scope.rq != this_rq());
> > +
> > +               scoped_guard (raw_spin_lock, next->blocked_lock) {
> > +                       /*
> > +                        * WARN_ON on this? How can this happen
> > +                        */
> > +                       if (next->blocked_on != lock)
> > +                               return NULL;
> > +               }
> > +
> > +               /*
> > +                * blocked_on relation is stable, since we hold both
> > +                * next->pi_lock and it's rq->lock
> > +                *
> > +                * OK -- we have a donor, it is blocked on the lock we're about
> > +                * to release and it cannot run on this CPU -- fixies are
> > +                * required.
> > +                *
> > +                * Dequeue the task, such that ttwu() can fix up the placement thing.
> > +                */
> > +               if (!is_cpu_allowed(next, cpu_of(scope.rq)))
> 
> nit, we'd want to check its not the wake_cpu so we try to return it so
> proxy migrations don't upset the tasks' original placement

I don't think that really matters, not doing this migration is probably
faster and then load balance will try and fix things again.

The thing is, you want this task to take over the lock and start running
ASAP, and we know *this* CPU is about to release the lock and then the
power of the block-chain is likely to make the next task run.

So less migration is more better, otherwise you then get to pull the
entire block chain over to whatever -- and you know how much 'fun' that
is.

> > +                       deactivate_task(scope.rq, next, DEQUEUE_SLEEP);
> > +       }
> > +
> > +       return next;
> > +}
> > +
> 
> Ok. I'll stare at all this a bit and see if I can give it a try.  I
> fret that this doesn't handle the case if wakeups on the task occur
> through other code paths? (So we still need BO_WAKING/NEEDS_RETURN to
> prevent us from running until we migrate back). I don't really have a
> specific case I can articulate, but my gut is telling me the problem
> will be w/ ww_mutexes as that was a major source of problems with the
> early versions of the patches that I believe tried to use logic
> similar to this.

Right, so I've not looked at ww_mutex specifically yet, but I thought to
have understood it was more or less the same problem there. If you've
got more details, please do share :-)

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][PATCH v14 3/7] sched: Fix runtime accounting w/ split exec & sched contexts
  2024-12-17  6:09       ` John Stultz
@ 2024-12-17  8:48         ` Peter Zijlstra
  0 siblings, 0 replies; 23+ messages in thread
From: Peter Zijlstra @ 2024-12-17  8:48 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, K Prateek Nayak, Thomas Gleixner,
	Daniel Lezcano, kernel-team

On Mon, Dec 16, 2024 at 10:09:16PM -0800, John Stultz wrote:
> On Fri, Dec 13, 2024 at 4:10 PM Peter Zijlstra <peterz@infradead.org> wrote:
> > On Sat, Dec 14, 2024 at 12:37:40AM +0100, Peter Zijlstra wrote:
> > > Would it not be *much* clearer if we do it like:
> > >
> > > static s64 update_curr_se(struct rq *rq, struct sched_entity *donor,
> > >                         struct sched_entity *curr)
> > > {
> > >       ...
> > >       donor->exec_start = now;
> > >       curr->exec_start = now;
> > >       curr->sum_exec_runtime += delta_exec;
> > >       ...
> > > }
> > >
> > > and update the callsites like so:
> > >
> > > update_curr_common()
> > >       update_curr_se(rq, &donor->se, &rq->curr.se)
> > >
> > > update_curr()
> > >       update_curr_se(rq, &curr->se, &curr->se);
> > >
> > >
> > > except, now I'm confused about the update_curr() case. That seems to
> > > always update the execution context, rather than the donor ?
> >
> > Ah no, cfs_rq->curr is the donor.
> 
> Yeah. That is one detail in the current series where the naming can be
> particularly confusing.
> 
> I can go through and rename cfs_rq->curr to cfs_rq->donor (or some
> other name) to make it more clear, but it seems like a ton of churn,
> so I've been hesitant to do so until there was stronger consensus to
> taking the patch series, but maybe we're at that point now?

Nah, it was just me being confused, lets keep down the curn for now.

> But maybe a simpler and more isolated fix is I could just rework
> update_curr_se to just take the rq* and we can derive the donor.se and
> curr.se from that.

You can't; rq only has tasks, while cfs_rq is a hierarchy with many se's
backing a single task :/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][PATCH v14 5/7] sched: Add an initial sketch of the find_proxy_task() function
  2024-12-17  5:42     ` John Stultz
@ 2024-12-17  8:52       ` Peter Zijlstra
  0 siblings, 0 replies; 23+ messages in thread
From: Peter Zijlstra @ 2024-12-17  8:52 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, K Prateek Nayak, Thomas Gleixner,
	Daniel Lezcano, kernel-team

On Mon, Dec 16, 2024 at 09:42:31PM -0800, John Stultz wrote:
> On Fri, Dec 13, 2024 at 4:06 PM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Mon, Nov 25, 2024 at 11:51:59AM -0800, John Stultz wrote:
> >
> > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > index f8714050b6d0d..b492506d33415 100644
> > > --- a/kernel/sched/core.c
> > > +++ b/kernel/sched/core.c
> > > @@ -5052,6 +5052,34 @@ static void do_balance_callbacks(struct rq *rq, struct balance_callback *head)
> > >       }
> > >  }
> > >
> > > +/*
> > > + * Only called from __schedule context
> > > + *
> > > + * There are some cases where we are going to re-do the action
> > > + * that added the balance callbacks. We may not be in a state
> > > + * where we can run them, so just zap them so they can be
> > > + * properly re-added on the next time around. This is similar
> > > + * handling to running the callbacks, except we just don't call
> > > + * them.
> > > + */
> >
> > Which specific callbacks are this? sched_core_balance()?
> >
> > In general, shooting down all callbacks like this makes me feel somewhat
> > uncomfortable.
> 
> So, if we originally picked a RT task, I believe it would setup the
> push_rt_tasks callback, but if it got migrated and if we needed to
> pick again,  we'd end up tripping on
> `SCHED_WARN_ON(rq->balance_callback && rq->balance_callback !=
> &balance_push_callback);`
> 
> For a while I tried to unpin and run the balance callbacks before
> calling pick_again, if find_proxy_task() failed, but that was running
> into troubles with tasks getting unintentionally added to the rt
> pushable list (this was back in ~feb, so my memory is a little fuzzy).
> 
> So that's when I figured zaping the callbacks would be best, with the
> idea being that we are starting selection over, so we effectively have
> to undo any of the state that was set by pick_next_task() before
> calling it again.
> 
> Let me know if you have concerns with this, or suggestions for other approaches.

For now, lets stick a coherent comment on, explaining exactly which
callbacks and why.

> > > +/*
> > > + * Initial simple proxy that just returns the task if it's waking
> > > + * or deactivates the blocked task so we can pick something that
> > > + * isn't blocked.
> > > + */
> > > +static struct task_struct *
> > > +find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
> > > +{
> > > +     struct task_struct *p = donor;
> > > +     struct mutex *mutex;
> > > +
> > > +     mutex = p->blocked_on;
> > > +     /* Something changed in the chain, so pick again */
> > > +     if (!mutex)
> > > +             return NULL;
> > > +     /*
> > > +      * By taking mutex->wait_lock we hold off concurrent mutex_unlock()
> > > +      * and ensure @owner sticks around.
> > > +      */
> > > +     raw_spin_lock(&mutex->wait_lock);
> > > +     raw_spin_lock(&p->blocked_lock);
> >
> > I'm still wondering what this blocked_lock does, that previous patch had
> > it mirror wait_mutex too, so far I don't see the point.
> 
> Yeah, early on in the series it's maybe not as useful, but as we start
> dealing with sleeping owner enqueuing, its doing more:
>   https://github.com/johnstultz-work/linux-dev/commit/d594ca8df88645aa3b2b9daa105664893818bdb7
> 
> But it is possible it is more of a crutch for me to keep straight the
> locking rules as it's simpler to keep in my head. :)
> Happy to think a bit more on if it can be folded together with another lock.

I'm a big believer in only introducing state when we actually need it --
and I don't believe we actually need blocked_lock until we go SMP.

Anyway, I have since figured out the why of blocked_lock again; but
yeah, comments, because I'm sure to forget it again at some point.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC][PATCH v14 2/7] locking/mutex: Rework task_struct::blocked_on
  2024-12-17  5:01         ` John Stultz
  2024-12-17  8:39           ` Peter Zijlstra
  2024-12-17  8:46           ` Peter Zijlstra
@ 2024-12-17  9:19           ` Peter Zijlstra
  2 siblings, 0 replies; 23+ messages in thread
From: Peter Zijlstra @ 2024-12-17  9:19 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, K Prateek Nayak, Thomas Gleixner,
	Daniel Lezcano, kernel-team, Connor O'Brien

On Mon, Dec 16, 2024 at 09:01:24PM -0800, John Stultz wrote:
> > Specifically, slide 27 talks about a modification to try_to_wake_up() in
> > order to force the task into ttwu_runnable() such that it then hits
> > proxy_needs_return(). This latter part I can find, but not the former.
> >
> > /me puzzles
> 
> So the slides actually have links to the code at that time, which
> should be the v12 series:
>   https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v12-6.11-rc5

Oh, no the reason I looked at -v7 is that I thought it was the most
recent one.

This github piece of shit lists it as the first 'proxy-exec-v7*' in the
branch pull down with earlier version below it.

Only if you scoll down further (which I didn't, because why would I),
you'll eventually come across 'proxy-exec-v14*'.

I'm looking forward to the day where the web implodes under its own
incompetence, urgh, web sucks.

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2024-12-17  9:19 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-25 19:51 [RFC][PATCH v14 0/7] Single CPU Proxy Execution (v14) John Stultz
2024-11-25 19:51 ` [RFC][PATCH v14 1/7] sched: Add CONFIG_SCHED_PROXY_EXEC & boot argument to enable/disable John Stultz
2024-11-25 19:51 ` [RFC][PATCH v14 2/7] locking/mutex: Rework task_struct::blocked_on John Stultz
2024-12-13 23:22   ` Peter Zijlstra
2024-12-14  3:39     ` John Stultz
2024-12-16 16:54       ` Peter Zijlstra
2024-12-16 17:07         ` Peter Zijlstra
2024-12-17  5:01         ` John Stultz
2024-12-17  8:39           ` Peter Zijlstra
2024-12-17  8:46           ` Peter Zijlstra
2024-12-17  9:19           ` Peter Zijlstra
2024-11-25 19:51 ` [RFC][PATCH v14 3/7] sched: Fix runtime accounting w/ split exec & sched contexts John Stultz
2024-12-13 23:37   ` Peter Zijlstra
2024-12-14  0:09     ` Peter Zijlstra
2024-12-17  6:09       ` John Stultz
2024-12-17  8:48         ` Peter Zijlstra
2024-11-25 19:51 ` [RFC][PATCH v14 4/7] sched: Fix psi_dequeue for Proxy Execution John Stultz
2024-11-25 19:51 ` [RFC][PATCH v14 5/7] sched: Add an initial sketch of the find_proxy_task() function John Stultz
2024-12-14  0:05   ` Peter Zijlstra
2024-12-17  5:42     ` John Stultz
2024-12-17  8:52       ` Peter Zijlstra
2024-11-25 19:52 ` [RFC][PATCH v14 6/7] sched: Fix proxy/current (push,pull)ability John Stultz
2024-11-25 19:52 ` [RFC][PATCH v14 7/7] sched: Start blocked_on chain processing in find_proxy_task() John Stultz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox