* [PATCH v16 0/7] Single RunQueue Proxy Execution (v16)
@ 2025-04-12 6:02 John Stultz
2025-04-12 6:02 ` [PATCH v16 1/7] sched: Add CONFIG_SCHED_PROXY_EXEC & boot argument to enable/disable John Stultz
` (6 more replies)
0 siblings, 7 replies; 28+ messages in thread
From: John Stultz @ 2025-04-12 6:02 UTC (permalink / raw)
To: LKML
Cc: John Stultz, Joel Fernandes, Qais Yousef, Ingo Molnar,
Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
Paul E. McKenney, Metin Kaya, Xuewen Yan, K Prateek Nayak,
Thomas Gleixner, Daniel Lezcano, Suleiman Souhlal, kernel-team
Hey All,
After sending out v15, I unfortunately realized that in moving
some logic further back in the full patch series, I had
accidentally introduced some difficult to trigger bugs in the
subset of the series I was submitting. Sadly it took me a while
to figure out exactly which bits weren’t safe to migrate out,
but I’ve finally gotten it back into stable shape.
Many many thanks to Peter, Steven and Prateek for their helpful
feedback on the last revision. I have tried to integrate much
of the changes suggested, but I may have missed things in all
the great feedback, please let me know if you find anything.
Also, since v15, I presented at OSPM on the current status of
Proxy Execution, which you can watch here:
https://youtu.be/xcV1NtWENbs?feature=shared
So with that out of the way, here is v16 of the Proxy Execution
series, a generalized form of priority inheritance.
As I’m trying to submit this work in smallish digestible pieces,
in this series, I’m only submitting for review the logic that
allows us to do the proxying if the lock owner is on the same
runqueue as the blocked waiter. Introducing the
CONFIG_SCHED_PROXY_EXEC option and boot-argument, reworking the
task_struct::blocked_on pointer and wrapper functions, the
initial sketch of the find_proxy_task() logic, some fixes for
using split contexts, and finally same-runqueue proxying.
So with v16, I’ve obviously tried to stabilize the patch series
each step of the way, as well address the feedback provided.
Particularly complex has been the reworking of the
find_proxy_task() logic to use guard() to avoid some of the
uglier goto return logic.
I’ve also continued working on the rest of the series, which you
can find here:
https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v16-6.15-rc1/
https://github.com/johnstultz-work/linux-dev.git proxy-exec-v16-6.15-rc1
New changes in the full series include:
* Allow "sched_proxy_exec" without "=true" to enable
proxy-execution at boot time, in addition to the
"sched_proxy_exec=true" or "sched_proxy_exec=false" options as
suggested by Steven
* Drop the "default n" in Kconfig as suggested by Steven
* Add !SCHED_CLASS_EXT dependency until I can investigate if
sched_ext can understand split contexts, as suggested by Peter
* Undoing some changes I pushed out later in the series to be
earlier in order to avoid hitting bugs (mostly around
optimistic spinning/lock stealing, but also sched_balance
migrating blocked tasks).
* Renamed update_curr_se to update_se_times, as suggested by
Steven Rostedt.
* Move the enqueue_task_rt() changes to a more relevant patch,
as suggested by K Prateek Nayak
* Fixup whitespace error pointed out by K Prateek Nayak
* Use put_prev_set_next_task as suggested by K Prateek Nayak
* Try to rework find_proxy_task() locking to use guard and
proxy_deactivate_task() in the way Peter suggested.
* Simplified changes to enqueue_task_rt to match deadline's
logic, as pointed out by Peter
* Get rid of preserve_need_resched flag and rework per Peter's
suggestion
* Rework find_proxy_task() to use guard to cleanup the exit
gotos as Peter suggested.
* Properly understood the “lost-wakeup” issue I was tripping and
working around earlier, and reworked the forced
return-migration from find_proxy_task to use Peter’s
dequeue+wakeup approach, which helps resolve the cpuhotplug
issues I had also seen, caused by the manual return migration
sending tasks to offline cpus.
Issues still to address with the full series:
* Peter suggested an idea that instead of when tasks become
unblocked, using (blocked_on_state == BO_WAKING) to protect
against running proxy-migrated tasks on cpu’s they are not
affined to, we could dequeue tasks first and then wake them.
This does look to be cleaner in many ways, but the locking
rework is significant and I’ve not worked out all the kinks
with it yet. I am also a little worried that we may trip other
wakeup paths that might not do the dequeue first. However, I
have adopted this approach for the find_proxy_task() forced
return migration, and it’s working well.
* The new rework using guard() cleans up a lot of things, but
there are some edge cases where we change blocked_on locks, or
need to drop locks to do migration, so there still are some
odd goto exit cases needed to get out of the guard scope.
Ideas for further cleanups would be welcome here.
* Need to sort out what is needed for sched_ext to be ok with
proxy-execution enabled.
* K Prateek Nayak did some testing about a bit over a year ago
with an earlier version of the series and saw ~3-5%
regressions in some cases. I’m hoping to look into this soon
to see if we can reduce those further.
* The chain migration functionality needs further iterations and
better validation to ensure it truly maintains the RT/DL load
balancing invariants (despite this being broken in vanilla
upstream with RT_PUSH_IPI currently)
I’d really appreciate any feedback or review thoughts on this
series. I’m trying to keep the chunks small, reviewable and
iteratively testable, but if you have any suggestions on how to
improve the series, I’m all ears.
Credit/Disclaimer:
—--------------------
As mentioned previously, this Proxy Execution series has a long
history:
First described in a paper[1] by Watkins, Straub, Niehaus, then
from patches from Peter Zijlstra, extended with lots of work by
Juri Lelli, Valentin Schneider, and Connor O'Brien. (and thank
you to Steven Rostedt for providing additional details here!)
So again, many thanks to those above, as all the credit for this
series really is due to them - while the mistakes are likely
mine.
Thanks so much!
-john
[1] https://static.lwn.net/images/conf/rtlws11/papers/proc/p38.pdf
Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kernel-team@android.com
John Stultz (3):
sched: Add CONFIG_SCHED_PROXY_EXEC & boot argument to enable/disable
sched: Fix runtime accounting w/ split exec & sched contexts
sched: Add an initial sketch of the find_proxy_task() function
Peter Zijlstra (2):
locking/mutex: Rework task_struct::blocked_on
sched: Start blocked_on chain processing in find_proxy_task()
Valentin Schneider (2):
locking/mutex: Add p->blocked_on wrappers for correctness checks
sched: Fix proxy/current (push,pull)ability
.../admin-guide/kernel-parameters.txt | 5 +
include/linux/sched.h | 68 ++++-
init/Kconfig | 12 +
kernel/fork.c | 3 +-
kernel/locking/mutex-debug.c | 9 +-
kernel/locking/mutex.c | 18 ++
kernel/locking/mutex.h | 3 +-
kernel/locking/ww_mutex.h | 16 +-
kernel/sched/core.c | 258 +++++++++++++++++-
kernel/sched/deadline.c | 3 +
kernel/sched/fair.c | 35 ++-
kernel/sched/rt.c | 5 +
kernel/sched/sched.h | 22 +-
13 files changed, 428 insertions(+), 29 deletions(-)
--
2.49.0.604.gff1f9ca942-goog
^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH v16 1/7] sched: Add CONFIG_SCHED_PROXY_EXEC & boot argument to enable/disable
2025-04-12 6:02 [PATCH v16 0/7] Single RunQueue Proxy Execution (v16) John Stultz
@ 2025-04-12 6:02 ` John Stultz
2025-04-14 8:50 ` Juri Lelli
2025-04-12 6:02 ` [PATCH v16 2/7] locking/mutex: Rework task_struct::blocked_on John Stultz
` (5 subsequent siblings)
6 siblings, 1 reply; 28+ messages in thread
From: John Stultz @ 2025-04-12 6:02 UTC (permalink / raw)
To: LKML
Cc: John Stultz, Joel Fernandes, Qais Yousef, Ingo Molnar,
Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
Paul E. McKenney, Metin Kaya, Xuewen Yan, K Prateek Nayak,
Thomas Gleixner, Daniel Lezcano, Suleiman Souhlal, kernel-team
Add a CONFIG_SCHED_PROXY_EXEC option, along with a boot argument
sched_proxy_exec= that can be used to disable the feature at boot
time if CONFIG_SCHED_PROXY_EXEC was enabled.
Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kernel-team@android.com
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: John Stultz <jstultz@google.com>
---
v7:
* Switch to CONFIG_SCHED_PROXY_EXEC/sched_proxy_exec= as
suggested by Metin Kaya.
* Switch boot arg from =disable/enable to use kstrtobool(),
which supports =yes|no|1|0|true|false|on|off, as also
suggested by Metin Kaya, and print a message when a boot
argument is used.
v8:
* Move CONFIG_SCHED_PROXY_EXEC under Scheduler Features as
Suggested by Metin
* Minor rework reordering with split sched contexts patch
v12:
* Rework for selected -> donor renaming
v14:
* Depend on !PREEMPT_RT to avoid build issues for now
v15:
* Depend on EXPERT while patch series upstreaming is
in progress.
v16:
* Allow "sched_proxy_exec" without "=true" to enable
proxy-execution at boot time, in addition to the
"sched_proxy_exec=true" or "sched_proxy_exec=false" options
as suggested by Steven
* Drop the "default n" in Kconfig as suggested by Steven
* Add !SCHED_CLASS_EXT dependency until I can investigate if
sched_ext can understand split contexts, as suggested by
Peter
---
.../admin-guide/kernel-parameters.txt | 5 ++++
include/linux/sched.h | 13 +++++++++
init/Kconfig | 12 ++++++++
kernel/sched/core.c | 29 +++++++++++++++++++
kernel/sched/sched.h | 12 ++++++++
5 files changed, 71 insertions(+)
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 76e538c77e316..b21cb89a09831 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -6307,6 +6307,11 @@
sa1100ir [NET]
See drivers/net/irda/sa1100_ir.c.
+ sched_proxy_exec= [KNL]
+ Enables or disables "proxy execution" style
+ solution to mutex-based priority inversion.
+ Format: <bool>
+
sched_verbose [KNL,EARLY] Enables verbose scheduler debug messages.
schedstats= [KNL,X86] Enable or disable scheduled statistics.
diff --git a/include/linux/sched.h b/include/linux/sched.h
index f96ac19828934..3cdd598aaa9aa 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1663,6 +1663,19 @@ struct task_struct {
*/
};
+#ifdef CONFIG_SCHED_PROXY_EXEC
+DECLARE_STATIC_KEY_TRUE(__sched_proxy_exec);
+static inline bool sched_proxy_exec(void)
+{
+ return static_branch_likely(&__sched_proxy_exec);
+}
+#else
+static inline bool sched_proxy_exec(void)
+{
+ return false;
+}
+#endif
+
#define TASK_REPORT_IDLE (TASK_REPORT + 1)
#define TASK_REPORT_MAX (TASK_REPORT_IDLE << 1)
diff --git a/init/Kconfig b/init/Kconfig
index dd2ea3b9a7992..23f6edff481b2 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -883,6 +883,18 @@ config UCLAMP_BUCKETS_COUNT
If in doubt, use the default value.
+config SCHED_PROXY_EXEC
+ bool "Proxy Execution"
+ # Avoid some build failures w/ PREEMPT_RT until it can be fixed
+ depends on !PREEMPT_RT
+ # Need to investigate how to inform sched_ext of split contexts
+ depends on !SCHED_CLASS_EXT
+ # Not particularly useful until we get to multi-rq proxying
+ depends on EXPERT
+ help
+ This option enables proxy execution, a mechanism for mutex-owning
+ tasks to inherit the scheduling context of higher priority waiters.
+
endmenu
#
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c81cf642dba05..82817650a635b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -118,6 +118,35 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(sched_compute_energy_tp);
DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
+#ifdef CONFIG_SCHED_PROXY_EXEC
+DEFINE_STATIC_KEY_TRUE(__sched_proxy_exec);
+static int __init setup_proxy_exec(char *str)
+{
+ bool proxy_enable = true;
+
+ if (*str && kstrtobool(str + 1, &proxy_enable)) {
+ pr_warn("Unable to parse sched_proxy_exec=\n");
+ return 0;
+ }
+
+ if (proxy_enable) {
+ pr_info("sched_proxy_exec enabled via boot arg\n");
+ static_branch_enable(&__sched_proxy_exec);
+ } else {
+ pr_info("sched_proxy_exec disabled via boot arg\n");
+ static_branch_disable(&__sched_proxy_exec);
+ }
+ return 1;
+}
+#else
+static int __init setup_proxy_exec(char *str)
+{
+ pr_warn("CONFIG_SCHED_PROXY_EXEC=n, so it cannot be enabled or disabled at boot time\n");
+ return 0;
+}
+#endif
+__setup("sched_proxy_exec", setup_proxy_exec);
+
/*
* Debugging: various feature bits
*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 47972f34ea701..154f0aa0c6322 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1149,10 +1149,15 @@ struct rq {
*/
unsigned int nr_uninterruptible;
+#ifdef CONFIG_SCHED_PROXY_EXEC
+ struct task_struct __rcu *donor; /* Scheduling context */
+ struct task_struct __rcu *curr; /* Execution context */
+#else
union {
struct task_struct __rcu *donor; /* Scheduler context */
struct task_struct __rcu *curr; /* Execution context */
};
+#endif
struct sched_dl_entity *dl_server;
struct task_struct *idle;
struct task_struct *stop;
@@ -1347,10 +1352,17 @@ DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
#define cpu_curr(cpu) (cpu_rq(cpu)->curr)
#define raw_rq() raw_cpu_ptr(&runqueues)
+#ifdef CONFIG_SCHED_PROXY_EXEC
+static inline void rq_set_donor(struct rq *rq, struct task_struct *t)
+{
+ rcu_assign_pointer(rq->donor, t);
+}
+#else
static inline void rq_set_donor(struct rq *rq, struct task_struct *t)
{
/* Do nothing */
}
+#endif
#ifdef CONFIG_SCHED_CORE
static inline struct cpumask *sched_group_span(struct sched_group *sg);
--
2.49.0.604.gff1f9ca942-goog
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [PATCH v16 2/7] locking/mutex: Rework task_struct::blocked_on
2025-04-12 6:02 [PATCH v16 0/7] Single RunQueue Proxy Execution (v16) John Stultz
2025-04-12 6:02 ` [PATCH v16 1/7] sched: Add CONFIG_SCHED_PROXY_EXEC & boot argument to enable/disable John Stultz
@ 2025-04-12 6:02 ` John Stultz
2025-04-14 8:59 ` Juri Lelli
2025-04-12 6:02 ` [PATCH v16 3/7] locking/mutex: Add p->blocked_on wrappers for correctness checks John Stultz
` (4 subsequent siblings)
6 siblings, 1 reply; 28+ messages in thread
From: John Stultz @ 2025-04-12 6:02 UTC (permalink / raw)
To: LKML
Cc: Peter Zijlstra, Joel Fernandes, Qais Yousef, Ingo Molnar,
Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
Metin Kaya, Xuewen Yan, K Prateek Nayak, Thomas Gleixner,
Daniel Lezcano, Suleiman Souhlal, kernel-team, Connor O'Brien,
John Stultz
From: Peter Zijlstra <peterz@infradead.org>
Track the blocked-on relation for mutexes, to allow following this
relation at schedule time.
task
| blocked-on
v
mutex
| owner
v
task
This all will be used for tracking blocked-task/mutex chains
with the prox-execution patch in a similar fashion to how
priority inheritance is done with rt_mutexes.
For serialization, blocked-on is only set by the task itself
(current). And both when setting or clearing (potentially by
others), is done while holding the mutex::wait_lock.
Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kernel-team@android.com
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
[minor changes while rebasing]
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Connor O'Brien <connoro@google.com>
[jstultz: Fix blocked_on tracking in __mutex_lock_common in error paths]
Signed-off-by: John Stultz <jstultz@google.com>
---
v2:
* Fixed blocked_on tracking in error paths that was causing crashes
v4:
* Ensure we clear blocked_on when waking ww_mutexes to die or wound.
This is critical so we don't get circular blocked_on relationships
that can't be resolved.
v5:
* Fix potential bug where the skip_wait path might clear blocked_on
when that path never set it
* Slight tweaks to where we set blocked_on to make it consistent,
along with extra WARN_ON correctness checking
* Minor comment changes
v7:
* Minor commit message change suggested by Metin Kaya
* Fix WARN_ON conditionals in unlock path (as blocked_on might already
be cleared), found while looking at issue Metin Kaya raised.
* Minor tweaks to be consistent in what we do under the
blocked_on lock, also tweaked variable name to avoid confusion
with label, and comment typos, as suggested by Metin Kaya
* Minor tweak for CONFIG_SCHED_PROXY_EXEC name change
* Moved unused block of code to later in the series, as suggested
by Metin Kaya
* Switch to a tri-state to be able to distinguish from waking and
runnable so we can later safely do return migration from ttwu
* Folded together with related blocked_on changes
v8:
* Fix issue leaving task BO_BLOCKED when calling into optimistic
spinning path.
* Include helper to better handle BO_BLOCKED->BO_WAKING transitions
v9:
* Typo fixup pointed out by Metin
* Cleanup BO_WAKING->BO_RUNNABLE transitions for the !proxy case
* Many cleanups and simplifications suggested by Metin
v11:
* Whitespace fixup pointed out by Metin
v13:
* Refactor set_blocked_on helpers clean things up a bit
v14:
* Small build fixup with PREEMPT_RT
v15:
* Improve consistency of names for functions that assume blocked_lock
is held, as suggested by Peter
* Use guard instead of separate spinlock/unlock calls, also suggested
by Peter
* Drop blocked_on_state tri-state for now, as its not needed until
later in the series, when we get to proxy-migration and return-
migration.
v16:
* Clear blocked on before optimistic spinning
---
include/linux/sched.h | 5 +----
kernel/fork.c | 3 +--
kernel/locking/mutex-debug.c | 9 +++++----
kernel/locking/mutex.c | 22 ++++++++++++++++++++++
kernel/locking/ww_mutex.h | 18 ++++++++++++++++--
5 files changed, 45 insertions(+), 12 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 3cdd598aaa9aa..10be203ddb7e1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1234,10 +1234,7 @@ struct task_struct {
struct rt_mutex_waiter *pi_blocked_on;
#endif
-#ifdef CONFIG_DEBUG_MUTEXES
- /* Mutex deadlock detection: */
- struct mutex_waiter *blocked_on;
-#endif
+ struct mutex *blocked_on; /* lock we're blocked on */
#ifdef CONFIG_DETECT_HUNG_TASK_BLOCKER
struct mutex *blocker_mutex;
diff --git a/kernel/fork.c b/kernel/fork.c
index c4b26cd8998b8..3455ab283482e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2383,9 +2383,8 @@ __latent_entropy struct task_struct *copy_process(
lockdep_init_task(p);
#endif
-#ifdef CONFIG_DEBUG_MUTEXES
p->blocked_on = NULL; /* not blocked yet */
-#endif
+
#ifdef CONFIG_BCACHE
p->sequential_io = 0;
p->sequential_io_avg = 0;
diff --git a/kernel/locking/mutex-debug.c b/kernel/locking/mutex-debug.c
index 6e6f6071cfa27..758b7a6792b0c 100644
--- a/kernel/locking/mutex-debug.c
+++ b/kernel/locking/mutex-debug.c
@@ -53,17 +53,18 @@ void debug_mutex_add_waiter(struct mutex *lock, struct mutex_waiter *waiter,
{
lockdep_assert_held(&lock->wait_lock);
- /* Mark the current thread as blocked on the lock: */
- task->blocked_on = waiter;
+ /* Current thread can't be already blocked (since it's executing!) */
+ DEBUG_LOCKS_WARN_ON(task->blocked_on);
}
void debug_mutex_remove_waiter(struct mutex *lock, struct mutex_waiter *waiter,
struct task_struct *task)
{
+ struct mutex *blocked_on = READ_ONCE(task->blocked_on);
+
DEBUG_LOCKS_WARN_ON(list_empty(&waiter->list));
DEBUG_LOCKS_WARN_ON(waiter->task != task);
- DEBUG_LOCKS_WARN_ON(task->blocked_on != waiter);
- task->blocked_on = NULL;
+ DEBUG_LOCKS_WARN_ON(blocked_on && blocked_on != lock);
INIT_LIST_HEAD(&waiter->list);
waiter->task = NULL;
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 555e2b3a665a3..5243e59d75f40 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -643,6 +643,8 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
goto err_early_kill;
}
+ WARN_ON(current->blocked_on);
+ current->blocked_on = lock;
set_current_state(state);
trace_contention_begin(lock, LCB_F_MUTEX);
for (;;) {
@@ -679,6 +681,12 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
first = __mutex_waiter_is_first(lock, &waiter);
+ /*
+ * As we likely have been woken up by task
+ * that has cleared our blocked_on state, re-set
+ * it to the lock we are trying to aquire.
+ */
+ current->blocked_on = lock;
set_current_state(state);
/*
* Here we order against unlock; we must either see it change
@@ -690,8 +698,11 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
if (first) {
trace_contention_begin(lock, LCB_F_MUTEX | LCB_F_SPIN);
+ /* clear blocked_on as mutex_optimistic_spin may schedule() */
+ current->blocked_on = NULL;
if (mutex_optimistic_spin(lock, ww_ctx, &waiter))
break;
+ current->blocked_on = lock;
trace_contention_begin(lock, LCB_F_MUTEX);
}
@@ -699,6 +710,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
}
raw_spin_lock_irqsave(&lock->wait_lock, flags);
acquired:
+ current->blocked_on = NULL;
__set_current_state(TASK_RUNNING);
if (ww_ctx) {
@@ -728,9 +740,11 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
return 0;
err:
+ current->blocked_on = NULL;
__set_current_state(TASK_RUNNING);
__mutex_remove_waiter(lock, &waiter);
err_early_kill:
+ WARN_ON(current->blocked_on);
trace_contention_end(lock, ret);
raw_spin_unlock_irqrestore_wake(&lock->wait_lock, flags, &wake_q);
debug_mutex_free_waiter(&waiter);
@@ -940,6 +954,14 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
next = waiter->task;
debug_mutex_wake_waiter(lock, waiter);
+ /*
+ * Unlock wakeups can be happening in parallel
+ * (when optimistic spinners steal and release
+ * the lock), so blocked_on may already be
+ * cleared here.
+ */
+ WARN_ON(next->blocked_on && next->blocked_on != lock);
+ next->blocked_on = NULL;
wake_q_add(&wake_q, next);
}
diff --git a/kernel/locking/ww_mutex.h b/kernel/locking/ww_mutex.h
index 37f025a096c9d..00db40946328e 100644
--- a/kernel/locking/ww_mutex.h
+++ b/kernel/locking/ww_mutex.h
@@ -284,6 +284,14 @@ __ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITER *waiter,
#ifndef WW_RT
debug_mutex_wake_waiter(lock, waiter);
#endif
+ /*
+ * When waking up the task to die, be sure to clear the
+ * blocked_on pointer. Otherwise we can see circular
+ * blocked_on relationships that can't resolve.
+ */
+ WARN_ON(waiter->task->blocked_on &&
+ waiter->task->blocked_on != lock);
+ waiter->task->blocked_on = NULL;
wake_q_add(wake_q, waiter->task);
}
@@ -331,9 +339,15 @@ static bool __ww_mutex_wound(struct MUTEX *lock,
* it's wounded in __ww_mutex_check_kill() or has a
* wakeup pending to re-read the wounded state.
*/
- if (owner != current)
+ if (owner != current) {
+ /*
+ * When waking up the task to wound, be sure to clear the
+ * blocked_on pointer. Otherwise we can see circular
+ * blocked_on relationships that can't resolve.
+ */
+ owner->blocked_on = NULL;
wake_q_add(wake_q, owner);
-
+ }
return true;
}
--
2.49.0.604.gff1f9ca942-goog
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [PATCH v16 3/7] locking/mutex: Add p->blocked_on wrappers for correctness checks
2025-04-12 6:02 [PATCH v16 0/7] Single RunQueue Proxy Execution (v16) John Stultz
2025-04-12 6:02 ` [PATCH v16 1/7] sched: Add CONFIG_SCHED_PROXY_EXEC & boot argument to enable/disable John Stultz
2025-04-12 6:02 ` [PATCH v16 2/7] locking/mutex: Rework task_struct::blocked_on John Stultz
@ 2025-04-12 6:02 ` John Stultz
2025-04-14 9:09 ` Juri Lelli
2025-04-12 6:02 ` [PATCH v16 4/7] sched: Fix runtime accounting w/ split exec & sched contexts John Stultz
` (3 subsequent siblings)
6 siblings, 1 reply; 28+ messages in thread
From: John Stultz @ 2025-04-12 6:02 UTC (permalink / raw)
To: LKML
Cc: Valentin Schneider, Joel Fernandes, Qais Yousef, Ingo Molnar,
Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
Paul E. McKenney, Metin Kaya, Xuewen Yan, K Prateek Nayak,
Thomas Gleixner, Daniel Lezcano, Suleiman Souhlal, kernel-team,
Connor O'Brien, John Stultz
From: Valentin Schneider <valentin.schneider@arm.com>
This lets us assert mutex::wait_lock is held whenever we access
p->blocked_on, as well as warn us for unexpected state changes.
Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kernel-team@android.com
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
[fix conflicts, call in more places]
Signed-off-by: Connor O'Brien <connoro@google.com>
[jstultz: tweaked commit subject, reworked a good bit]
Signed-off-by: John Stultz <jstultz@google.com>
---
v2:
* Added get_task_blocked_on() accessor
v4:
* Address READ_ONCE usage that was dropped in v2
* Reordered to be a later add on to the main patch series as
Peter was unhappy with similar wrappers in other patches.
v5:
* Added some extra correctness checking in wrappers
v7:
* Tweaks to reorder this change in the patch series
* Minor cleanup to set_task_blocked_on() suggested by Metin Kaya
v15:
* Split out into its own patch again.
* Further improve assumption checks in helpers.
v16:
* Fix optimistic spin case that can call schedule()
---
include/linux/sched.h | 50 ++++++++++++++++++++++++++++++++++--
kernel/locking/mutex-debug.c | 4 +--
kernel/locking/mutex.c | 30 ++++++++++------------
kernel/locking/ww_mutex.h | 6 ++---
4 files changed, 65 insertions(+), 25 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 10be203ddb7e1..8a1f0703caba7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -34,6 +34,7 @@
#include <linux/sched/prio.h>
#include <linux/sched/types.h>
#include <linux/signal_types.h>
+#include <linux/spinlock.h>
#include <linux/syscall_user_dispatch_types.h>
#include <linux/mm_types_task.h>
#include <linux/netdevice_xmit.h>
@@ -2181,6 +2182,53 @@ extern int __cond_resched_rwlock_write(rwlock_t *lock);
__cond_resched_rwlock_write(lock); \
})
+static inline void __set_task_blocked_on(struct task_struct *p, struct mutex *m)
+{
+ WARN_ON_ONCE(!m);
+ /* The task should only be setting itself as blocked */
+ WARN_ON_ONCE(p != current);
+ /* Currently we serialize blocked_on under the mutex::wait_lock */
+ lockdep_assert_held_once(&m->wait_lock);
+ /*
+ * Check ensure we don't overwrite exisiting mutex value
+ * with a different mutex. Note, setting it to the same
+ * lock repeatedly is ok.
+ */
+ WARN_ON_ONCE(p->blocked_on && p->blocked_on != m);
+ p->blocked_on = m;
+}
+
+static inline void set_task_blocked_on(struct task_struct *p, struct mutex *m)
+{
+ guard(raw_spinlock_irqsave)(&m->wait_lock);
+ __set_task_blocked_on(p, m);
+}
+
+static inline void __clear_task_blocked_on(struct task_struct *p, struct mutex *m)
+{
+ WARN_ON_ONCE(!m);
+ /* Currently we serialize blocked_on under the mutex::wait_lock */
+ lockdep_assert_held_once(&m->wait_lock);
+ /*
+ * There may be cases where we re-clear already cleared
+ * blocked_on relationships, but make sure we are not
+ * clearing the relationship with a different lock.
+ */
+ WARN_ON_ONCE(m && p->blocked_on && p->blocked_on != m);
+ p->blocked_on = NULL;
+}
+
+static inline void clear_task_blocked_on(struct task_struct *p, struct mutex *m)
+{
+ guard(raw_spinlock_irqsave)(&m->wait_lock);
+ __clear_task_blocked_on(p, m);
+}
+
+static inline struct mutex *__get_task_blocked_on(struct task_struct *p)
+{
+ return READ_ONCE(p->blocked_on);
+}
+
static __always_inline bool need_resched(void)
{
return unlikely(tif_need_resched());
@@ -2220,8 +2268,6 @@ extern bool sched_task_on_rq(struct task_struct *p);
extern unsigned long get_wchan(struct task_struct *p);
extern struct task_struct *cpu_curr_snapshot(int cpu);
-#include <linux/spinlock.h>
-
/*
* In order to reduce various lock holder preemption latencies provide an
* interface to see if a vCPU is currently running or not.
diff --git a/kernel/locking/mutex-debug.c b/kernel/locking/mutex-debug.c
index 758b7a6792b0c..949103fd8e9b5 100644
--- a/kernel/locking/mutex-debug.c
+++ b/kernel/locking/mutex-debug.c
@@ -54,13 +54,13 @@ void debug_mutex_add_waiter(struct mutex *lock, struct mutex_waiter *waiter,
lockdep_assert_held(&lock->wait_lock);
/* Current thread can't be already blocked (since it's executing!) */
- DEBUG_LOCKS_WARN_ON(task->blocked_on);
+ DEBUG_LOCKS_WARN_ON(__get_task_blocked_on(task));
}
void debug_mutex_remove_waiter(struct mutex *lock, struct mutex_waiter *waiter,
struct task_struct *task)
{
- struct mutex *blocked_on = READ_ONCE(task->blocked_on);
+ struct mutex *blocked_on = __get_task_blocked_on(task);
DEBUG_LOCKS_WARN_ON(list_empty(&waiter->list));
DEBUG_LOCKS_WARN_ON(waiter->task != task);
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 5243e59d75f40..a34a7974b418e 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -643,8 +643,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
goto err_early_kill;
}
- WARN_ON(current->blocked_on);
- current->blocked_on = lock;
+ __set_task_blocked_on(current, lock);
set_current_state(state);
trace_contention_begin(lock, LCB_F_MUTEX);
for (;;) {
@@ -686,7 +685,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
* that has cleared our blocked_on state, re-set
* it to the lock we are trying to aquire.
*/
- current->blocked_on = lock;
+ set_task_blocked_on(current, lock);
set_current_state(state);
/*
* Here we order against unlock; we must either see it change
@@ -698,11 +697,15 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
if (first) {
trace_contention_begin(lock, LCB_F_MUTEX | LCB_F_SPIN);
- /* clear blocked_on as mutex_optimistic_spin may schedule() */
- current->blocked_on = NULL;
+ /*
+ * mutex_optimistic_spin() can call schedule(), so
+ * clear blocked on so we don't become unselectable
+ * to run.
+ */
+ clear_task_blocked_on(current, lock);
if (mutex_optimistic_spin(lock, ww_ctx, &waiter))
break;
- current->blocked_on = lock;
+ set_task_blocked_on(current, lock);
trace_contention_begin(lock, LCB_F_MUTEX);
}
@@ -710,7 +713,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
}
raw_spin_lock_irqsave(&lock->wait_lock, flags);
acquired:
- current->blocked_on = NULL;
+ __clear_task_blocked_on(current, lock);
__set_current_state(TASK_RUNNING);
if (ww_ctx) {
@@ -740,11 +743,11 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
return 0;
err:
- current->blocked_on = NULL;
+ __clear_task_blocked_on(current, lock);
__set_current_state(TASK_RUNNING);
__mutex_remove_waiter(lock, &waiter);
err_early_kill:
- WARN_ON(current->blocked_on);
+ WARN_ON(__get_task_blocked_on(current));
trace_contention_end(lock, ret);
raw_spin_unlock_irqrestore_wake(&lock->wait_lock, flags, &wake_q);
debug_mutex_free_waiter(&waiter);
@@ -954,14 +957,7 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
next = waiter->task;
debug_mutex_wake_waiter(lock, waiter);
- /*
- * Unlock wakeups can be happening in parallel
- * (when optimistic spinners steal and release
- * the lock), so blocked_on may already be
- * cleared here.
- */
- WARN_ON(next->blocked_on && next->blocked_on != lock);
- next->blocked_on = NULL;
+ __clear_task_blocked_on(next, lock);
wake_q_add(&wake_q, next);
}
diff --git a/kernel/locking/ww_mutex.h b/kernel/locking/ww_mutex.h
index 00db40946328e..086fd5487ca77 100644
--- a/kernel/locking/ww_mutex.h
+++ b/kernel/locking/ww_mutex.h
@@ -289,9 +289,7 @@ __ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITER *waiter,
* blocked_on pointer. Otherwise we can see circular
* blocked_on relationships that can't resolve.
*/
- WARN_ON(waiter->task->blocked_on &&
- waiter->task->blocked_on != lock);
- waiter->task->blocked_on = NULL;
+ __clear_task_blocked_on(waiter->task, lock);
wake_q_add(wake_q, waiter->task);
}
@@ -345,7 +343,7 @@ static bool __ww_mutex_wound(struct MUTEX *lock,
* blocked_on pointer. Otherwise we can see circular
* blocked_on relationships that can't resolve.
*/
- owner->blocked_on = NULL;
+ __clear_task_blocked_on(owner, lock);
wake_q_add(wake_q, owner);
}
return true;
--
2.49.0.604.gff1f9ca942-goog
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [PATCH v16 4/7] sched: Fix runtime accounting w/ split exec & sched contexts
2025-04-12 6:02 [PATCH v16 0/7] Single RunQueue Proxy Execution (v16) John Stultz
` (2 preceding siblings ...)
2025-04-12 6:02 ` [PATCH v16 3/7] locking/mutex: Add p->blocked_on wrappers for correctness checks John Stultz
@ 2025-04-12 6:02 ` John Stultz
2025-04-14 9:28 ` Juri Lelli
2025-04-17 11:12 ` Peter Zijlstra
2025-04-12 6:02 ` [PATCH v16 5/7] sched: Add an initial sketch of the find_proxy_task() function John Stultz
` (2 subsequent siblings)
6 siblings, 2 replies; 28+ messages in thread
From: John Stultz @ 2025-04-12 6:02 UTC (permalink / raw)
To: LKML
Cc: John Stultz, Joel Fernandes, Qais Yousef, Ingo Molnar,
Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
Paul E. McKenney, Metin Kaya, Xuewen Yan, K Prateek Nayak,
Thomas Gleixner, Daniel Lezcano, Suleiman Souhlal, kernel-team
Without proxy-exec, we normally charge the "current" task for
both its vruntime as well as its sum_exec_runtime.
With proxy, however, we have two "current" contexts: the
scheduler context and the execution context. We want to charge
the execution context rq->curr (ie: proxy/lock holder) execution
time to its sum_exec_runtime (so it's clear to userland the
rq->curr task *is* running).
Then instead of charging the execution context (rq->curr) for
the vruntime, we charge the vruntime against the scheduler context
(rq->donor) task, because that is the time it is donating when
it is used as the scheduler-context.
If the donor and curr tasks are the same, then it's the same as
without proxy.
Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kernel-team@android.com
Signed-off-by: John Stultz <jstultz@google.com>
---
v16:
* Renamed update_curr_se to update_se_times, as suggested by
Steven Rostedt.
* Reworded the commit message as suggested by Steven Rostedt
---
kernel/sched/fair.c | 25 ++++++++++++++++++-------
1 file changed, 18 insertions(+), 7 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e43993a4e5807..da8b0970c6655 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1143,22 +1143,33 @@ static void update_tg_load_avg(struct cfs_rq *cfs_rq)
}
#endif /* CONFIG_SMP */
-static s64 update_curr_se(struct rq *rq, struct sched_entity *curr)
+static s64 update_se_times(struct rq *rq, struct sched_entity *se)
{
u64 now = rq_clock_task(rq);
s64 delta_exec;
- delta_exec = now - curr->exec_start;
+ delta_exec = now - se->exec_start;
if (unlikely(delta_exec <= 0))
return delta_exec;
- curr->exec_start = now;
- curr->sum_exec_runtime += delta_exec;
+ se->exec_start = now;
+ if (entity_is_task(se)) {
+ struct task_struct *running = rq->curr;
+ /*
+ * If se is a task, we account the time against the running
+ * task, as w/ proxy-exec they may not be the same.
+ */
+ running->se.exec_start = now;
+ running->se.sum_exec_runtime += delta_exec;
+ } else {
+ /* If not task, account the time against se */
+ se->sum_exec_runtime += delta_exec;
+ }
if (schedstat_enabled()) {
struct sched_statistics *stats;
- stats = __schedstats_from_se(curr);
+ stats = __schedstats_from_se(se);
__schedstat_set(stats->exec_max,
max(delta_exec, stats->exec_max));
}
@@ -1213,7 +1224,7 @@ s64 update_curr_common(struct rq *rq)
struct task_struct *donor = rq->donor;
s64 delta_exec;
- delta_exec = update_curr_se(rq, &donor->se);
+ delta_exec = update_se_times(rq, &donor->se);
if (likely(delta_exec > 0))
update_curr_task(donor, delta_exec);
@@ -1233,7 +1244,7 @@ static void update_curr(struct cfs_rq *cfs_rq)
if (unlikely(!curr))
return;
- delta_exec = update_curr_se(rq, curr);
+ delta_exec = update_se_times(rq, curr);
if (unlikely(delta_exec <= 0))
return;
--
2.49.0.604.gff1f9ca942-goog
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [PATCH v16 5/7] sched: Add an initial sketch of the find_proxy_task() function
2025-04-12 6:02 [PATCH v16 0/7] Single RunQueue Proxy Execution (v16) John Stultz
` (3 preceding siblings ...)
2025-04-12 6:02 ` [PATCH v16 4/7] sched: Fix runtime accounting w/ split exec & sched contexts John Stultz
@ 2025-04-12 6:02 ` John Stultz
2025-04-14 9:41 ` Juri Lelli
2025-04-17 11:18 ` Peter Zijlstra
2025-04-12 6:02 ` [PATCH v16 6/7] sched: Fix proxy/current (push,pull)ability John Stultz
2025-04-12 6:02 ` [PATCH v16 7/7] sched: Start blocked_on chain processing in find_proxy_task() John Stultz
6 siblings, 2 replies; 28+ messages in thread
From: John Stultz @ 2025-04-12 6:02 UTC (permalink / raw)
To: LKML
Cc: John Stultz, Joel Fernandes, Qais Yousef, Ingo Molnar,
Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
Paul E. McKenney, Metin Kaya, Xuewen Yan, K Prateek Nayak,
Thomas Gleixner, Daniel Lezcano, Suleiman Souhlal, kernel-team
Add a find_proxy_task() function which doesn't do much.
When we select a blocked task to run, we will just deactivate it
and pick again. The exception being if it has become unblocked
after find_proxy_task() was called.
Greatly simplified from patch by:
Peter Zijlstra (Intel) <peterz@infradead.org>
Juri Lelli <juri.lelli@redhat.com>
Valentin Schneider <valentin.schneider@arm.com>
Connor O'Brien <connoro@google.com>
Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kernel-team@android.com
[jstultz: Split out from larger proxy patch and simplified
for review and testing.]
Signed-off-by: John Stultz <jstultz@google.com>
---
v5:
* Split out from larger proxy patch
v7:
* Fixed unused function arguments, spelling nits, and tweaks for
clarity, pointed out by Metin Kaya
* Fix build warning Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202311081028.yDLmCWgr-lkp@intel.com/
v8:
* Fixed case where we might return a blocked task from find_proxy_task()
* Continued tweaks to handle avoiding returning blocked tasks
v9:
* Add zap_balance_callbacks helper to unwind balance_callbacks
when we will re-call pick_next_task() again.
* Add extra comment suggested by Metin
* Typo fixes from Metin
* Moved adding proxy_resched_idle earlier in the series, as suggested
by Metin
* Fix to call proxy_resched_idle() *prior* to deactivating next, to avoid
crashes caused by stale references to next
* s/PROXY/SCHED_PROXY_EXEC/ as suggested by Metin
* Number of tweaks and cleanups suggested by Metin
* Simplify proxy_deactivate as suggested by Metin
v11:
* Tweaks for earlier simplification in try_to_deactivate_task
v13:
* Rename rename "next" to "donor" in find_proxy_task() for clarity
* Similarly use "donor" instead of next in proxy_deactivate
* Refactor/simplify proxy_resched_idle
* Moved up a needed fix from later in the series
v15:
* Tweaked some comments to better explain the initial sketch of
find_proxy_task(), suggested by Qais
* Build fixes for !CONFIG_SMP
* Slight rework for blocked_on_state being added later in the
series.
* Move the zap_balance_callbacks to later in the patch series
v16:
* Move the enqueue_task_rt() out to later in the series, as
suggested by K Prateek Nayak
* Fixup whitespace error pointed out by K Prateek Nayak
* Use put_prev_set_next_task as suggested by K Prateek Nayak
* Try to rework find_proxy_task() locking to use guard and
proxy_deactivate_task() in the way Peter suggested.
---
kernel/sched/core.c | 100 +++++++++++++++++++++++++++++++++++++++++--
kernel/sched/sched.h | 10 ++++-
2 files changed, 106 insertions(+), 4 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 82817650a635b..88acb47f50d0f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6600,7 +6600,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
* Otherwise marks the task's __state as RUNNING
*/
static bool try_to_block_task(struct rq *rq, struct task_struct *p,
- unsigned long task_state)
+ unsigned long task_state, bool deactivate_cond)
{
int flags = DEQUEUE_NOCLOCK;
@@ -6609,6 +6609,9 @@ static bool try_to_block_task(struct rq *rq, struct task_struct *p,
return false;
}
+ if (!deactivate_cond)
+ return false;
+
p->sched_contributes_to_load =
(task_state & TASK_UNINTERRUPTIBLE) &&
!(task_state & TASK_NOLOAD) &&
@@ -6632,6 +6635,90 @@ static bool try_to_block_task(struct rq *rq, struct task_struct *p,
return true;
}
+#ifdef CONFIG_SCHED_PROXY_EXEC
+static inline struct task_struct *proxy_resched_idle(struct rq *rq)
+{
+ put_prev_set_next_task(rq, rq->donor, rq->idle);
+ rq_set_donor(rq, rq->idle);
+ set_tsk_need_resched(rq->idle);
+ return rq->idle;
+}
+
+static bool __proxy_deactivate(struct rq *rq, struct task_struct *donor)
+{
+ unsigned long state = READ_ONCE(donor->__state);
+
+ /* Don't deactivate if the state has been changed to TASK_RUNNING */
+ if (state == TASK_RUNNING)
+ return false;
+ /*
+ * Because we got donor from pick_next_task, it is *crucial*
+ * that we call proxy_resched_idle before we deactivate it.
+ * As once we deactivate donor, donor->on_rq is set to zero,
+ * which allows ttwu to immediately try to wake the task on
+ * another rq. So we cannot use *any* references to donor
+ * after that point. So things like cfs_rq->curr or rq->donor
+ * need to be changed from next *before* we deactivate.
+ */
+ proxy_resched_idle(rq);
+ return try_to_block_task(rq, donor, state, true);
+}
+
+static struct task_struct *proxy_deactivate(struct rq *rq, struct task_struct *donor)
+{
+ if (!__proxy_deactivate(rq, donor)) {
+ /*
+ * XXX: For now, if deactivation failed, set donor
+ * as unblocked, as we aren't doing proxy-migrations
+ * yet (more logic will be needed then).
+ */
+ donor->blocked_on = NULL;
+ }
+ return NULL;
+}
+
+/*
+ * Initial simple sketch that just deactivates the blocked task
+ * chosen by pick_next_task() so we can then pick something that
+ * isn't blocked.
+ */
+static struct task_struct *
+find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
+{
+ struct task_struct *p = donor;
+ struct mutex *mutex;
+
+ mutex = p->blocked_on;
+ /* Something changed in the chain, so pick again */
+ if (!mutex)
+ return NULL;
+ /*
+ * By taking mutex->wait_lock we hold off concurrent mutex_unlock()
+ * and ensure @owner sticks around.
+ */
+ guard(raw_spinlock)(&mutex->wait_lock);
+
+ /* Check again that p is blocked with blocked_lock held */
+ if (!task_is_blocked(p) || mutex != __get_task_blocked_on(p)) {
+ /*
+ * Something changed in the blocked_on chain and
+ * we don't know if only at this level. So, let's
+ * just bail out completely and let __schedule
+ * figure things out (pick_again loop).
+ */
+ return NULL; /* do pick_next_task again */
+ }
+ return proxy_deactivate(rq, donor);
+}
+#else /* SCHED_PROXY_EXEC */
+static struct task_struct *
+find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
+{
+ WARN_ONCE(1, "This should never be called in the !SCHED_PROXY_EXEC case\n");
+ return donor;
+}
+#endif /* SCHED_PROXY_EXEC */
+
/*
* __schedule() is the main scheduler function.
*
@@ -6742,12 +6829,19 @@ static void __sched notrace __schedule(int sched_mode)
goto picked;
}
} else if (!preempt && prev_state) {
- try_to_block_task(rq, prev, prev_state);
+ try_to_block_task(rq, prev, prev_state,
+ !task_is_blocked(prev));
switch_count = &prev->nvcsw;
}
- next = pick_next_task(rq, prev, &rf);
+pick_again:
+ next = pick_next_task(rq, rq->donor, &rf);
rq_set_donor(rq, next);
+ if (unlikely(task_is_blocked(next))) {
+ next = find_proxy_task(rq, next, &rf);
+ if (!next)
+ goto pick_again;
+ }
picked:
clear_tsk_need_resched(prev);
clear_preempt_need_resched();
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 154f0aa0c6322..ea2c987005bc1 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2264,6 +2264,14 @@ static inline int task_current_donor(struct rq *rq, struct task_struct *p)
return rq->donor == p;
}
+static inline bool task_is_blocked(struct task_struct *p)
+{
+ if (!sched_proxy_exec())
+ return false;
+
+ return !!p->blocked_on;
+}
+
static inline int task_on_cpu(struct rq *rq, struct task_struct *p)
{
#ifdef CONFIG_SMP
@@ -2473,7 +2481,7 @@ static inline void put_prev_set_next_task(struct rq *rq,
struct task_struct *prev,
struct task_struct *next)
{
- WARN_ON_ONCE(rq->curr != prev);
+ WARN_ON_ONCE(rq->donor != prev);
__put_prev_set_next_dl_server(rq, prev, next);
--
2.49.0.604.gff1f9ca942-goog
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [PATCH v16 6/7] sched: Fix proxy/current (push,pull)ability
2025-04-12 6:02 [PATCH v16 0/7] Single RunQueue Proxy Execution (v16) John Stultz
` (4 preceding siblings ...)
2025-04-12 6:02 ` [PATCH v16 5/7] sched: Add an initial sketch of the find_proxy_task() function John Stultz
@ 2025-04-12 6:02 ` John Stultz
2025-04-14 3:28 ` K Prateek Nayak
2025-04-12 6:02 ` [PATCH v16 7/7] sched: Start blocked_on chain processing in find_proxy_task() John Stultz
6 siblings, 1 reply; 28+ messages in thread
From: John Stultz @ 2025-04-12 6:02 UTC (permalink / raw)
To: LKML
Cc: Valentin Schneider, Joel Fernandes, Qais Yousef, Ingo Molnar,
Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
Paul E. McKenney, Metin Kaya, Xuewen Yan, K Prateek Nayak,
Thomas Gleixner, Daniel Lezcano, Suleiman Souhlal, kernel-team,
Connor O'Brien, John Stultz
From: Valentin Schneider <valentin.schneider@arm.com>
Proxy execution forms atomic pairs of tasks: The waiting donor
task (scheduling context) and a proxy (execution context). The
donor task, along with the rest of the blocked chain, follows
the proxy wrt CPU placement.
They can be the same task, in which case push/pull doesn't need any
modification. When they are different, however,
FIFO1 & FIFO42:
,-> RT42
| | blocked-on
| v
blocked_donor | mutex
| | owner
| v
`-- RT1
RT1
RT42
CPU0 CPU1
^ ^
| |
overloaded !overloaded
rq prio = 42 rq prio = 0
RT1 is eligible to be pushed to CPU1, but should that happen it will
"carry" RT42 along. Clearly here neither RT1 nor RT42 must be seen as
push/pullable.
Unfortunately, only the donor task is usually dequeued from the rq,
and the proxy'ed execution context (rq->curr) remains on the rq.
This can cause RT1 to be selected for migration from logic like the
rt pushable_list.
Thus, adda a dequeue/enqueue cycle on the proxy task before __schedule
returns, which allows the sched class logic to avoid adding the now
current task to the pushable_list.
Furthermore, tasks becoming blocked on a mutex don't need an explicit
dequeue/enqueue cycle to be made (push/pull)able: they have to be running
to block on a mutex, thus they will eventually hit put_prev_task().
Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kernel-team@android.com
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Connor O'Brien <connoro@google.com>
Signed-off-by: John Stultz <jstultz@google.com>
---
v3:
* Tweaked comments & commit message
v5:
* Minor simplifications to utilize the fix earlier
in the patch series.
* Rework the wording of the commit message to match selected/
proxy terminology and expand a bit to make it more clear how
it works.
v6:
* Dropped now-unused proxied value, to be re-added later in the
series when it is used, as caught by Dietmar
v7:
* Unused function argument fixup
* Commit message nit pointed out by Metin Kaya
* Dropped unproven unlikely() and use sched_proxy_exec()
in proxy_tag_curr, suggested by Metin Kaya
v8:
* More cleanups and typo fixes suggested by Metin Kaya
v11:
* Cleanup of comimt message suggested by Metin
v12:
* Rework for rq_selected -> rq->donor renaming
v16:
* Pulled logic from later patch in to avoid sched_balance
migrating blocked tasks.
* Moved enqueue_task_rt() logic from earlier into this patch
as suggested by K Prateek Nayak
* Simplified changes to enqueue_task_rt to match deadline's
logic, as pointed out by Peter
---
kernel/sched/core.c | 25 +++++++++++++++++++++++++
kernel/sched/deadline.c | 3 +++
kernel/sched/rt.c | 5 +++++
3 files changed, 33 insertions(+)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 88acb47f50d0f..33f0260c20609 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6719,6 +6719,23 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
}
#endif /* SCHED_PROXY_EXEC */
+static inline void proxy_tag_curr(struct rq *rq, struct task_struct *owner)
+{
+ if (!sched_proxy_exec())
+ return;
+ /*
+ * pick_next_task() calls set_next_task() on the chosen task
+ * at some point, which ensures it is not push/pullable.
+ * However, the chosen/donor task *and* the mutex owner form an
+ * atomic pair wrt push/pull.
+ *
+ * Make sure owner we run is not pushable. Unfortunately we can
+ * only deal with that by means of a dequeue/enqueue cycle. :-/
+ */
+ dequeue_task(rq, owner, DEQUEUE_NOCLOCK | DEQUEUE_SAVE);
+ enqueue_task(rq, owner, ENQUEUE_NOCLOCK | ENQUEUE_RESTORE);
+}
+
/*
* __schedule() is the main scheduler function.
*
@@ -6855,6 +6872,10 @@ static void __sched notrace __schedule(int sched_mode)
* changes to task_struct made by pick_next_task().
*/
RCU_INIT_POINTER(rq->curr, next);
+
+ if (!task_current_donor(rq, next))
+ proxy_tag_curr(rq, next);
+
/*
* The membarrier system call requires each architecture
* to have a full memory barrier after updating
@@ -6889,6 +6910,10 @@ static void __sched notrace __schedule(int sched_mode)
/* Also unlocks the rq: */
rq = context_switch(rq, prev, next, &rf);
} else {
+ /* In case next was already curr but just got blocked_donor */
+ if (!task_current_donor(rq, next))
+ proxy_tag_curr(rq, next);
+
rq_unpin_lock(rq, &rf);
__balance_callbacks(rq);
raw_spin_rq_unlock_irq(rq);
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index ad45a8fea245e..eb07c3a1b8fa4 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2166,6 +2166,9 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
if (dl_server(&p->dl))
return;
+ if (task_is_blocked(p))
+ return;
+
if (!task_current(rq, p) && !p->dl.dl_throttled && p->nr_cpus_allowed > 1)
enqueue_pushable_dl_task(rq, p);
}
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index fa03ec3ed56a2..87ccd5d5375a3 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1477,6 +1477,9 @@ enqueue_task_rt(struct rq *rq, struct task_struct *p, int flags)
enqueue_rt_entity(rt_se, flags);
+ if (task_is_blocked(p))
+ return;
+
if (!task_current(rq, p) && p->nr_cpus_allowed > 1)
enqueue_pushable_task(rq, p);
}
@@ -1757,6 +1760,8 @@ static void put_prev_task_rt(struct rq *rq, struct task_struct *p, struct task_s
update_rt_rq_load_avg(rq_clock_pelt(rq), rq, 1);
+ if (task_is_blocked(p))
+ return;
/*
* The previous task needs to be made eligible for pushing
* if it is still active
--
2.49.0.604.gff1f9ca942-goog
^ permalink raw reply related [flat|nested] 28+ messages in thread
* [PATCH v16 7/7] sched: Start blocked_on chain processing in find_proxy_task()
2025-04-12 6:02 [PATCH v16 0/7] Single RunQueue Proxy Execution (v16) John Stultz
` (5 preceding siblings ...)
2025-04-12 6:02 ` [PATCH v16 6/7] sched: Fix proxy/current (push,pull)ability John Stultz
@ 2025-04-12 6:02 ` John Stultz
6 siblings, 0 replies; 28+ messages in thread
From: John Stultz @ 2025-04-12 6:02 UTC (permalink / raw)
To: LKML
Cc: Peter Zijlstra, Joel Fernandes, Qais Yousef, Ingo Molnar,
Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
Metin Kaya, Xuewen Yan, K Prateek Nayak, Thomas Gleixner,
Daniel Lezcano, Suleiman Souhlal, kernel-team, Valentin Schneider,
Connor O'Brien, John Stultz
From: Peter Zijlstra <peterz@infradead.org>
Start to flesh out the real find_proxy_task() implementation,
but avoid the migration cases for now, in those cases just
deactivate the donor task and pick again.
To ensure the donor task or other blocked tasks in the chain
aren't migrated away while we're running the proxy, also tweak
the fair class logic to avoid migrating donor or mutex blocked
tasks.
Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kernel-team@android.com
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Connor O'Brien <connoro@google.com>
[jstultz: This change was split out from the larger proxy patch]
Signed-off-by: John Stultz <jstultz@google.com>
---
v5:
* Split this out from larger proxy patch
v7:
* Minor refactoring of core find_proxy_task() function
* Minor spelling and corrections suggested by Metin Kaya
* Dropped an added BUG_ON that was frequently tripped
v8:
* Fix issue if proxy_deactivate fails, we don't leave task
BO_BLOCKED
* Switch to WARN_ON from BUG_ON checks
v9:
* Improve comments suggested by Metin
* Minor cleanups
v11:
* Previously we checked next==rq->idle && prev==rq->idle, but I
think we only really care if next==rq->idle from find_proxy_task,
as we will still want to resched regardless of what prev was.
v12:
* Commit message rework for selected -> donor rewording
v13:
* Address new delayed dequeue condition (deactivate donor for now)
* Next to donor renaming in find_proxy_task
* Improved comments for find_proxy_task
* Rework for proxy_deactivate cleanup
v14:
* Fix build error from __mutex_owner() with CONFIG_PREEMPT_RT
v15:
* Reworks for moving blocked_on_state to later in the series
v16:
* Pull down fix from later in the series where a deactivated
task could pass the (task_cpu(owner) == this_cpu) check
then have it be activated on a different cpu, so it passes
the on_rq check. Thus double check the values in the opposite
order to make sure nothing slips by.
* Add resched_idle label to simplify common exit path
* Get rid of preserve_need_resched flag and rework per Peter's
suggestion
* Rework find_proxy_task() to use guard to cleanup the exit gotos
as Peter suggested.
---
kernel/locking/mutex.h | 3 +-
kernel/sched/core.c | 146 ++++++++++++++++++++++++++++++++++-------
kernel/sched/fair.c | 10 ++-
3 files changed, 134 insertions(+), 25 deletions(-)
diff --git a/kernel/locking/mutex.h b/kernel/locking/mutex.h
index cbff35b9b7ae3..2e8080a9bee37 100644
--- a/kernel/locking/mutex.h
+++ b/kernel/locking/mutex.h
@@ -6,7 +6,7 @@
*
* Copyright (C) 2004, 2005, 2006 Red Hat, Inc., Ingo Molnar <mingo@redhat.com>
*/
-
+#ifndef CONFIG_PREEMPT_RT
/*
* This is the control structure for tasks blocked on mutex, which resides
* on the blocked task's kernel stack:
@@ -70,3 +70,4 @@ extern void debug_mutex_init(struct mutex *lock, const char *name,
# define debug_mutex_unlock(lock) do { } while (0)
# define debug_mutex_init(lock, name, key) do { } while (0)
#endif /* !CONFIG_DEBUG_MUTEXES */
+#endif /* CONFIG_PREEMPT_RT */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 33f0260c20609..c58980028fb5f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -95,6 +95,7 @@
#include "../workqueue_internal.h"
#include "../../io_uring/io-wq.h"
#include "../smpboot.h"
+#include "../locking/mutex.h"
EXPORT_TRACEPOINT_SYMBOL_GPL(ipi_send_cpu);
EXPORT_TRACEPOINT_SYMBOL_GPL(ipi_send_cpumask);
@@ -2955,8 +2956,15 @@ static int affine_move_task(struct rq *rq, struct task_struct *p, struct rq_flag
struct set_affinity_pending my_pending = { }, *pending = NULL;
bool stop_pending, complete = false;
- /* Can the task run on the task's current CPU? If so, we're done */
- if (cpumask_test_cpu(task_cpu(p), &p->cpus_mask)) {
+ /*
+ * Can the task run on the task's current CPU? If so, we're done
+ *
+ * We are also done if the task is the current donor, boosting a lock-
+ * holding proxy, (and potentially has been migrated outside its
+ * current or previous affinity mask)
+ */
+ if (cpumask_test_cpu(task_cpu(p), &p->cpus_mask) ||
+ (task_current_donor(rq, p) && !task_current(rq, p))) {
struct task_struct *push_task = NULL;
if ((flags & SCA_MIGRATE_ENABLE) &&
@@ -6678,37 +6686,126 @@ static struct task_struct *proxy_deactivate(struct rq *rq, struct task_struct *d
}
/*
- * Initial simple sketch that just deactivates the blocked task
- * chosen by pick_next_task() so we can then pick something that
- * isn't blocked.
+ * Find runnable lock owner to proxy for mutex blocked donor
+ *
+ * Follow the blocked-on relation:
+ * task->blocked_on -> mutex->owner -> task...
+ *
+ * Lock order:
+ *
+ * p->pi_lock
+ * rq->lock
+ * mutex->wait_lock
+ *
+ * Returns the task that is going to be used as execution context (the one
+ * that is actually going to be run on cpu_of(rq)).
*/
static struct task_struct *
find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
{
- struct task_struct *p = donor;
+ struct task_struct *owner = NULL;
+ int this_cpu = cpu_of(rq);
+ struct task_struct *p;
struct mutex *mutex;
- mutex = p->blocked_on;
- /* Something changed in the chain, so pick again */
- if (!mutex)
- return NULL;
- /*
- * By taking mutex->wait_lock we hold off concurrent mutex_unlock()
- * and ensure @owner sticks around.
- */
- guard(raw_spinlock)(&mutex->wait_lock);
+ /* Follow blocked_on chain. */
+ for (p = donor; task_is_blocked(p); p = owner) {
+ mutex = p->blocked_on;
+ /* Something changed in the chain, so pick again */
+ if (!mutex)
+ return NULL;
+ /*
+ * By taking mutex->wait_lock we hold off concurrent mutex_unlock()
+ * and ensure @owner sticks around.
+ */
+ guard(raw_spinlock)(&mutex->wait_lock);
+
+ /* Check again that p is blocked with wait_lock held */
+ if (mutex != __get_task_blocked_on(p)) {
+ /*
+ * Something changed in the blocked_on chain and
+ * we don't know if only at this level. So, let's
+ * just bail out completely and let __schedule
+ * figure things out (pick_again loop).
+ */
+ return NULL;
+ }
+
+ owner = __mutex_owner(mutex);
+ if (!owner) {
+ __clear_task_blocked_on(p, mutex);
+ return p;
+ }
+
+ if (task_cpu(owner) != this_cpu) {
+ /* XXX Don't handle migrations yet */
+ return proxy_deactivate(rq, donor);
+ }
+
+ if (task_on_rq_migrating(owner)) {
+ /*
+ * One of the chain of mutex owners is currently migrating to this
+ * CPU, but has not yet been enqueued because we are holding the
+ * rq lock. As a simple solution, just schedule rq->idle to give
+ * the migration a chance to complete. Much like the migrate_task
+ * case we should end up back in find_proxy_task(), this time
+ * hopefully with all relevant tasks already enqueued.
+ */
+ return proxy_resched_idle(rq);
+ }
+
+ if (!owner->on_rq) {
+ /* XXX Don't handle blocked owners yet */
+ return proxy_deactivate(rq, donor);
+ }
+
+ if (owner->se.sched_delayed) {
+ /* XXX Don't handle delayed dequeue yet */
+ return proxy_deactivate(rq, donor);
+ }
- /* Check again that p is blocked with blocked_lock held */
- if (!task_is_blocked(p) || mutex != __get_task_blocked_on(p)) {
/*
- * Something changed in the blocked_on chain and
- * we don't know if only at this level. So, let's
- * just bail out completely and let __schedule
- * figure things out (pick_again loop).
+ * If owner was !on_rq, the task_cpu() check followed by on_rq check
+ * could race with a wakeup onto another cpu right inbetween those checks.
+ * So double check owner is both on_rq & on this cpu.
+ */
+ if (!(task_on_rq_queued(owner) && task_cpu(owner) == this_cpu))
+ return NULL;
+
+ if (owner == p) {
+ /*
+ * It's possible we interleave with mutex_unlock like:
+ *
+ * lock(&rq->lock);
+ * find_proxy_task()
+ * mutex_unlock()
+ * lock(&wait_lock);
+ * donor(owner) = current->blocked_donor;
+ * unlock(&wait_lock);
+ *
+ * wake_up_q();
+ * ...
+ * ttwu_runnable()
+ * __task_rq_lock()
+ * lock(&wait_lock);
+ * owner == p
+ *
+ * Which leaves us to finish the ttwu_runnable() and make it go.
+ *
+ * So schedule rq->idle so that ttwu_runnable can get the rq lock
+ * and mark owner as running.
+ */
+ return proxy_resched_idle(rq);
+ }
+ /*
+ * OK, now we're absolutely sure @owner is on this
+ * rq, therefore holding @rq->lock is sufficient to
+ * guarantee its existence, as per ttwu_remote().
*/
- return NULL; /* do pick_next_task again */
}
- return proxy_deactivate(rq, donor);
+
+ WARN_ON_ONCE(owner && !owner->on_rq);
+ return owner;
}
#else /* SCHED_PROXY_EXEC */
static struct task_struct *
@@ -6858,10 +6955,13 @@ static void __sched notrace __schedule(int sched_mode)
next = find_proxy_task(rq, next, &rf);
if (!next)
goto pick_again;
+ if (next == rq->idle)
+ goto keep_resched;
}
picked:
clear_tsk_need_resched(prev);
clear_preempt_need_resched();
+keep_resched:
rq->last_seen_need_resched_ns = 0;
is_switch = prev != next;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index da8b0970c6655..b67c3b44c7b4d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9407,6 +9407,7 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
* 3) cannot be migrated to this CPU due to cpus_ptr, or
* 4) running (obviously), or
* 5) are cache-hot on their current CPU.
+ * 6) are blocked on mutexes (if SCHED_PROXY_EXEC is enabled)
*/
if ((p->se.sched_delayed) && (env->migration_type != migrate_load))
return 0;
@@ -9428,6 +9429,9 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
if (kthread_is_per_cpu(p))
return 0;
+ if (task_is_blocked(p))
+ return 0;
+
if (!cpumask_test_cpu(env->dst_cpu, p->cpus_ptr)) {
int cpu;
@@ -9463,7 +9467,8 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
/* Record that we found at least one task that could run on dst_cpu */
env->flags &= ~LBF_ALL_PINNED;
- if (task_on_cpu(env->src_rq, p)) {
+ if (task_on_cpu(env->src_rq, p) ||
+ task_current_donor(env->src_rq, p)) {
schedstat_inc(p->stats.nr_failed_migrations_running);
return 0;
}
@@ -9507,6 +9512,9 @@ static void detach_task(struct task_struct *p, struct lb_env *env)
schedstat_inc(p->stats.nr_forced_migrations);
}
+ WARN_ON(task_current(env->src_rq, p));
+ WARN_ON(task_current_donor(env->src_rq, p));
+
deactivate_task(env->src_rq, p, DEQUEUE_NOCLOCK);
set_task_cpu(p, env->dst_cpu);
}
--
2.49.0.604.gff1f9ca942-goog
^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [PATCH v16 6/7] sched: Fix proxy/current (push,pull)ability
2025-04-12 6:02 ` [PATCH v16 6/7] sched: Fix proxy/current (push,pull)ability John Stultz
@ 2025-04-14 3:28 ` K Prateek Nayak
2025-04-16 21:18 ` John Stultz
0 siblings, 1 reply; 28+ messages in thread
From: K Prateek Nayak @ 2025-04-14 3:28 UTC (permalink / raw)
To: John Stultz, LKML
Cc: Valentin Schneider, Joel Fernandes, Qais Yousef, Ingo Molnar,
Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
Paul E. McKenney, Metin Kaya, Xuewen Yan, Thomas Gleixner,
Daniel Lezcano, Suleiman Souhlal, kernel-team, Connor O'Brien
Hello John,
On 4/12/2025 11:32 AM, John Stultz wrote:
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index ad45a8fea245e..eb07c3a1b8fa4 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -2166,6 +2166,9 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
> if (dl_server(&p->dl))
> return;
>
> + if (task_is_blocked(p))
> + return;
> +
> if (!task_current(rq, p) && !p->dl.dl_throttled && p->nr_cpus_allowed > 1)
> enqueue_pushable_dl_task(rq, p);
> }
Do we need an early return in put_prev_task_dl() similar to the one in
put_prev_task_rt()?
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index fa03ec3ed56a2..87ccd5d5375a3 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -1477,6 +1477,9 @@ enqueue_task_rt(struct rq *rq, struct task_struct *p, int flags)
>
> enqueue_rt_entity(rt_se, flags);
>
> + if (task_is_blocked(p))
> + return;
> +
> if (!task_current(rq, p) && p->nr_cpus_allowed > 1)
> enqueue_pushable_task(rq, p);
> }
> @@ -1757,6 +1760,8 @@ static void put_prev_task_rt(struct rq *rq, struct task_struct *p, struct task_s
>
> update_rt_rq_load_avg(rq_clock_pelt(rq), rq, 1);
>
> + if (task_is_blocked(p))
> + return;
> /*
> * The previous task needs to be made eligible for pushing
> * if it is still active
--
Thanks and Regards,
Prateek
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v16 1/7] sched: Add CONFIG_SCHED_PROXY_EXEC & boot argument to enable/disable
2025-04-12 6:02 ` [PATCH v16 1/7] sched: Add CONFIG_SCHED_PROXY_EXEC & boot argument to enable/disable John Stultz
@ 2025-04-14 8:50 ` Juri Lelli
2025-04-16 21:24 ` John Stultz
0 siblings, 1 reply; 28+ messages in thread
From: Juri Lelli @ 2025-04-14 8:50 UTC (permalink / raw)
To: John Stultz
Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Peter Zijlstra,
Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
Metin Kaya, Xuewen Yan, K Prateek Nayak, Thomas Gleixner,
Daniel Lezcano, Suleiman Souhlal, kernel-team
Hi John,
On 11/04/25 23:02, John Stultz wrote:
> Add a CONFIG_SCHED_PROXY_EXEC option, along with a boot argument
> sched_proxy_exec= that can be used to disable the feature at boot
> time if CONFIG_SCHED_PROXY_EXEC was enabled.
>
> Cc: Joel Fernandes <joelagnelf@nvidia.com>
> Cc: Qais Yousef <qyousef@layalina.io>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Juri Lelli <juri.lelli@redhat.com>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Valentin Schneider <vschneid@redhat.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Ben Segall <bsegall@google.com>
> Cc: Zimuzo Ezeozue <zezeozue@google.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Will Deacon <will@kernel.org>
> Cc: Waiman Long <longman@redhat.com>
> Cc: Boqun Feng <boqun.feng@gmail.com>
> Cc: "Paul E. McKenney" <paulmck@kernel.org>
> Cc: Metin Kaya <Metin.Kaya@arm.com>
> Cc: Xuewen Yan <xuewen.yan94@gmail.com>
> Cc: K Prateek Nayak <kprateek.nayak@amd.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
> Cc: Suleiman Souhlal <suleiman@google.com>
> Cc: kernel-team@android.com
> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
> Signed-off-by: John Stultz <jstultz@google.com>
> ---
...
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 47972f34ea701..154f0aa0c6322 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1149,10 +1149,15 @@ struct rq {
> */
> unsigned int nr_uninterruptible;
>
> +#ifdef CONFIG_SCHED_PROXY_EXEC
> + struct task_struct __rcu *donor; /* Scheduling context */
> + struct task_struct __rcu *curr; /* Execution context */
> +#else
> union {
> struct task_struct __rcu *donor; /* Scheduler context */
> struct task_struct __rcu *curr; /* Execution context */
> };
> +#endif
> struct sched_dl_entity *dl_server;
> struct task_struct *idle;
> struct task_struct *stop;
> @@ -1347,10 +1352,17 @@ DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
> #define cpu_curr(cpu) (cpu_rq(cpu)->curr)
> #define raw_rq() raw_cpu_ptr(&runqueues)
>
> +#ifdef CONFIG_SCHED_PROXY_EXEC
> +static inline void rq_set_donor(struct rq *rq, struct task_struct *t)
> +{
> + rcu_assign_pointer(rq->donor, t);
> +}
> +#else
> static inline void rq_set_donor(struct rq *rq, struct task_struct *t)
> {
> /* Do nothing */
> }
> +#endif
Does this chunk belong here?
Thanks,
Juri
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v16 2/7] locking/mutex: Rework task_struct::blocked_on
2025-04-12 6:02 ` [PATCH v16 2/7] locking/mutex: Rework task_struct::blocked_on John Stultz
@ 2025-04-14 8:59 ` Juri Lelli
2025-04-16 21:28 ` John Stultz
0 siblings, 1 reply; 28+ messages in thread
From: Juri Lelli @ 2025-04-14 8:59 UTC (permalink / raw)
To: John Stultz
Cc: LKML, Peter Zijlstra, Joel Fernandes, Qais Yousef, Ingo Molnar,
Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
Metin Kaya, Xuewen Yan, K Prateek Nayak, Thomas Gleixner,
Daniel Lezcano, Suleiman Souhlal, kernel-team, Connor O'Brien
Hi John,
On 11/04/25 23:02, John Stultz wrote:
> From: Peter Zijlstra <peterz@infradead.org>
>
> Track the blocked-on relation for mutexes, to allow following this
> relation at schedule time.
>
> task
> | blocked-on
> v
> mutex
> | owner
> v
> task
>
> This all will be used for tracking blocked-task/mutex chains
> with the prox-execution patch in a similar fashion to how
> priority inheritance is done with rt_mutexes.
>
> For serialization, blocked-on is only set by the task itself
> (current). And both when setting or clearing (potentially by
> others), is done while holding the mutex::wait_lock.
>
> Cc: Joel Fernandes <joelagnelf@nvidia.com>
> Cc: Qais Yousef <qyousef@layalina.io>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Juri Lelli <juri.lelli@redhat.com>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Valentin Schneider <vschneid@redhat.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Ben Segall <bsegall@google.com>
> Cc: Zimuzo Ezeozue <zezeozue@google.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Will Deacon <will@kernel.org>
> Cc: Waiman Long <longman@redhat.com>
> Cc: Boqun Feng <boqun.feng@gmail.com>
> Cc: "Paul E. McKenney" <paulmck@kernel.org>
> Cc: Metin Kaya <Metin.Kaya@arm.com>
> Cc: Xuewen Yan <xuewen.yan94@gmail.com>
> Cc: K Prateek Nayak <kprateek.nayak@amd.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
> Cc: Suleiman Souhlal <suleiman@google.com>
> Cc: kernel-team@android.com
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> [minor changes while rebasing]
> Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Signed-off-by: Connor O'Brien <connoro@google.com>
> [jstultz: Fix blocked_on tracking in __mutex_lock_common in error paths]
> Signed-off-by: John Stultz <jstultz@google.com>
> ---
...
> @@ -940,6 +954,14 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
> next = waiter->task;
>
> debug_mutex_wake_waiter(lock, waiter);
> + /*
> + * Unlock wakeups can be happening in parallel
> + * (when optimistic spinners steal and release
> + * the lock), so blocked_on may already be
> + * cleared here.
> + */
> + WARN_ON(next->blocked_on && next->blocked_on != lock);
> + next->blocked_on = NULL;
Here and below, why the WARN_ON() if the fact that blocked_on has been
cleared already it's an OK situation? Ah, maybe it's catching the more
worrying situation that the lock has changed since the task blocked?
> wake_q_add(&wake_q, next);
> }
>
> diff --git a/kernel/locking/ww_mutex.h b/kernel/locking/ww_mutex.h
> index 37f025a096c9d..00db40946328e 100644
> --- a/kernel/locking/ww_mutex.h
> +++ b/kernel/locking/ww_mutex.h
> @@ -284,6 +284,14 @@ __ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITER *waiter,
> #ifndef WW_RT
> debug_mutex_wake_waiter(lock, waiter);
> #endif
> + /*
> + * When waking up the task to die, be sure to clear the
> + * blocked_on pointer. Otherwise we can see circular
> + * blocked_on relationships that can't resolve.
> + */
> + WARN_ON(waiter->task->blocked_on &&
> + waiter->task->blocked_on != lock);
> + waiter->task->blocked_on = NULL;
> wake_q_add(wake_q, waiter->task);
> }
Thanks,
Juri
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v16 3/7] locking/mutex: Add p->blocked_on wrappers for correctness checks
2025-04-12 6:02 ` [PATCH v16 3/7] locking/mutex: Add p->blocked_on wrappers for correctness checks John Stultz
@ 2025-04-14 9:09 ` Juri Lelli
2025-04-16 22:44 ` John Stultz
0 siblings, 1 reply; 28+ messages in thread
From: Juri Lelli @ 2025-04-14 9:09 UTC (permalink / raw)
To: John Stultz
Cc: LKML, Valentin Schneider, Joel Fernandes, Qais Yousef,
Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
Paul E. McKenney, Metin Kaya, Xuewen Yan, K Prateek Nayak,
Thomas Gleixner, Daniel Lezcano, Suleiman Souhlal, kernel-team,
Connor O'Brien
Hi John,
On 11/04/25 23:02, John Stultz wrote:
> From: Valentin Schneider <valentin.schneider@arm.com>
>
> This lets us assert mutex::wait_lock is held whenever we access
> p->blocked_on, as well as warn us for unexpected state changes.
>
> Cc: Joel Fernandes <joelagnelf@nvidia.com>
> Cc: Qais Yousef <qyousef@layalina.io>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Juri Lelli <juri.lelli@redhat.com>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Valentin Schneider <vschneid@redhat.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Ben Segall <bsegall@google.com>
> Cc: Zimuzo Ezeozue <zezeozue@google.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Will Deacon <will@kernel.org>
> Cc: Waiman Long <longman@redhat.com>
> Cc: Boqun Feng <boqun.feng@gmail.com>
> Cc: "Paul E. McKenney" <paulmck@kernel.org>
> Cc: Metin Kaya <Metin.Kaya@arm.com>
> Cc: Xuewen Yan <xuewen.yan94@gmail.com>
> Cc: K Prateek Nayak <kprateek.nayak@amd.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
> Cc: Suleiman Souhlal <suleiman@google.com>
> Cc: kernel-team@android.com
> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
> [fix conflicts, call in more places]
> Signed-off-by: Connor O'Brien <connoro@google.com>
> [jstultz: tweaked commit subject, reworked a good bit]
> Signed-off-by: John Stultz <jstultz@google.com>
> ---
...
> +static inline struct mutex *__get_task_blocked_on(struct task_struct *p)
> +{
> + return READ_ONCE(p->blocked_on);
> +}
> +
Shouldn't we check that wait_lock is held also when reading blocked_on?
And, if that's the case, why do we use READ_ONCE?
Thanks,
Juri
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v16 4/7] sched: Fix runtime accounting w/ split exec & sched contexts
2025-04-12 6:02 ` [PATCH v16 4/7] sched: Fix runtime accounting w/ split exec & sched contexts John Stultz
@ 2025-04-14 9:28 ` Juri Lelli
2025-04-16 23:30 ` John Stultz
2025-04-17 11:12 ` Peter Zijlstra
1 sibling, 1 reply; 28+ messages in thread
From: Juri Lelli @ 2025-04-14 9:28 UTC (permalink / raw)
To: John Stultz
Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Peter Zijlstra,
Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
Metin Kaya, Xuewen Yan, K Prateek Nayak, Thomas Gleixner,
Daniel Lezcano, Suleiman Souhlal, kernel-team
Hi John,
On 11/04/25 23:02, John Stultz wrote:
> Without proxy-exec, we normally charge the "current" task for
> both its vruntime as well as its sum_exec_runtime.
>
> With proxy, however, we have two "current" contexts: the
> scheduler context and the execution context. We want to charge
> the execution context rq->curr (ie: proxy/lock holder) execution
> time to its sum_exec_runtime (so it's clear to userland the
> rq->curr task *is* running).
>
> Then instead of charging the execution context (rq->curr) for
> the vruntime, we charge the vruntime against the scheduler context
> (rq->donor) task, because that is the time it is donating when
> it is used as the scheduler-context.
>
> If the donor and curr tasks are the same, then it's the same as
> without proxy.
>
> Cc: Joel Fernandes <joelagnelf@nvidia.com>
> Cc: Qais Yousef <qyousef@layalina.io>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Juri Lelli <juri.lelli@redhat.com>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Valentin Schneider <vschneid@redhat.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Ben Segall <bsegall@google.com>
> Cc: Zimuzo Ezeozue <zezeozue@google.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Will Deacon <will@kernel.org>
> Cc: Waiman Long <longman@redhat.com>
> Cc: Boqun Feng <boqun.feng@gmail.com>
> Cc: "Paul E. McKenney" <paulmck@kernel.org>
> Cc: Metin Kaya <Metin.Kaya@arm.com>
> Cc: Xuewen Yan <xuewen.yan94@gmail.com>
> Cc: K Prateek Nayak <kprateek.nayak@amd.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
> Cc: Suleiman Souhlal <suleiman@google.com>
> Cc: kernel-team@android.com
> Signed-off-by: John Stultz <jstultz@google.com>
> ---
...
> +static s64 update_se_times(struct rq *rq, struct sched_entity *se)
> {
> u64 now = rq_clock_task(rq);
> s64 delta_exec;
>
> - delta_exec = now - curr->exec_start;
> + delta_exec = now - se->exec_start;
> if (unlikely(delta_exec <= 0))
> return delta_exec;
>
> - curr->exec_start = now;
> - curr->sum_exec_runtime += delta_exec;
> + se->exec_start = now;
> + if (entity_is_task(se)) {
> + struct task_struct *running = rq->curr;
> + /*
> + * If se is a task, we account the time against the running
> + * task, as w/ proxy-exec they may not be the same.
> + */
> + running->se.exec_start = now;
> + running->se.sum_exec_runtime += delta_exec;
> + } else {
> + /* If not task, account the time against se */
> + se->sum_exec_runtime += delta_exec;
> + }
...
> @@ -1213,7 +1224,7 @@ s64 update_curr_common(struct rq *rq)
> struct task_struct *donor = rq->donor;
> s64 delta_exec;
>
> - delta_exec = update_curr_se(rq, &donor->se);
> + delta_exec = update_se_times(rq, &donor->se);
> if (likely(delta_exec > 0))
> update_curr_task(donor, delta_exec);
Considering that we calculate delta_exec in updated_se_times using
exec_start of the sched_entity passed as argument, is it correct to use
donor in the above?
Thanks,
Juri
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v16 5/7] sched: Add an initial sketch of the find_proxy_task() function
2025-04-12 6:02 ` [PATCH v16 5/7] sched: Add an initial sketch of the find_proxy_task() function John Stultz
@ 2025-04-14 9:41 ` Juri Lelli
2025-04-16 22:55 ` John Stultz
2025-04-17 11:18 ` Peter Zijlstra
1 sibling, 1 reply; 28+ messages in thread
From: Juri Lelli @ 2025-04-14 9:41 UTC (permalink / raw)
To: John Stultz
Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Peter Zijlstra,
Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
Metin Kaya, Xuewen Yan, K Prateek Nayak, Thomas Gleixner,
Daniel Lezcano, Suleiman Souhlal, kernel-team
Hi John,
On 11/04/25 23:02, John Stultz wrote:
> Add a find_proxy_task() function which doesn't do much.
>
> When we select a blocked task to run, we will just deactivate it
> and pick again. The exception being if it has become unblocked
> after find_proxy_task() was called.
>
> Greatly simplified from patch by:
> Peter Zijlstra (Intel) <peterz@infradead.org>
> Juri Lelli <juri.lelli@redhat.com>
> Valentin Schneider <valentin.schneider@arm.com>
> Connor O'Brien <connoro@google.com>
>
> Cc: Joel Fernandes <joelagnelf@nvidia.com>
> Cc: Qais Yousef <qyousef@layalina.io>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Juri Lelli <juri.lelli@redhat.com>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Valentin Schneider <vschneid@redhat.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Ben Segall <bsegall@google.com>
> Cc: Zimuzo Ezeozue <zezeozue@google.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Will Deacon <will@kernel.org>
> Cc: Waiman Long <longman@redhat.com>
> Cc: Boqun Feng <boqun.feng@gmail.com>
> Cc: "Paul E. McKenney" <paulmck@kernel.org>
> Cc: Metin Kaya <Metin.Kaya@arm.com>
> Cc: Xuewen Yan <xuewen.yan94@gmail.com>
> Cc: K Prateek Nayak <kprateek.nayak@amd.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
> Cc: Suleiman Souhlal <suleiman@google.com>
> Cc: kernel-team@android.com
> [jstultz: Split out from larger proxy patch and simplified
> for review and testing.]
> Signed-off-by: John Stultz <jstultz@google.com>
> ---
...
> +static struct task_struct *
> +find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
> +{
> + struct task_struct *p = donor;
> + struct mutex *mutex;
> +
> + mutex = p->blocked_on;
> + /* Something changed in the chain, so pick again */
> + if (!mutex)
> + return NULL;
> + /*
> + * By taking mutex->wait_lock we hold off concurrent mutex_unlock()
> + * and ensure @owner sticks around.
> + */
> + guard(raw_spinlock)(&mutex->wait_lock);
> +
> + /* Check again that p is blocked with blocked_lock held */
> + if (!task_is_blocked(p) || mutex != __get_task_blocked_on(p)) {
> + /*
> + * Something changed in the blocked_on chain and
> + * we don't know if only at this level. So, let's
> + * just bail out completely and let __schedule
> + * figure things out (pick_again loop).
> + */
> + return NULL; /* do pick_next_task again */
> + }
> + return proxy_deactivate(rq, donor);
Nitpick. We might either want to use donor (and remove p) or use p
everywhere in the function; think makes it more clear.
Best,
Juri
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v16 6/7] sched: Fix proxy/current (push,pull)ability
2025-04-14 3:28 ` K Prateek Nayak
@ 2025-04-16 21:18 ` John Stultz
0 siblings, 0 replies; 28+ messages in thread
From: John Stultz @ 2025-04-16 21:18 UTC (permalink / raw)
To: K Prateek Nayak
Cc: LKML, Valentin Schneider, Joel Fernandes, Qais Yousef,
Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Valentin Schneider, Steven Rostedt, Ben Segall,
Zimuzo Ezeozue, Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
Paul E. McKenney, Metin Kaya, Xuewen Yan, Thomas Gleixner,
Daniel Lezcano, Suleiman Souhlal, kernel-team, Connor O'Brien
On Sun, Apr 13, 2025 at 8:29 PM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>
> Hello John,
>
> On 4/12/2025 11:32 AM, John Stultz wrote:
> > diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> > index ad45a8fea245e..eb07c3a1b8fa4 100644
> > --- a/kernel/sched/deadline.c
> > +++ b/kernel/sched/deadline.c
> > @@ -2166,6 +2166,9 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
> > if (dl_server(&p->dl))
> > return;
> >
> > + if (task_is_blocked(p))
> > + return;
> > +
> > if (!task_current(rq, p) && !p->dl.dl_throttled && p->nr_cpus_allowed > 1)
> > enqueue_pushable_dl_task(rq, p);
> > }
>
> Do we need an early return in put_prev_task_dl() similar to the one in
> put_prev_task_rt()?
Indeed, that does look like something I've overlooked (and suggests I
need to do more deadline testing). Thank you so much for pointing this
out!
-john
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v16 1/7] sched: Add CONFIG_SCHED_PROXY_EXEC & boot argument to enable/disable
2025-04-14 8:50 ` Juri Lelli
@ 2025-04-16 21:24 ` John Stultz
2025-04-29 9:16 ` Juri Lelli
0 siblings, 1 reply; 28+ messages in thread
From: John Stultz @ 2025-04-16 21:24 UTC (permalink / raw)
To: Juri Lelli
Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Peter Zijlstra,
Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
Metin Kaya, Xuewen Yan, K Prateek Nayak, Thomas Gleixner,
Daniel Lezcano, Suleiman Souhlal, kernel-team
On Mon, Apr 14, 2025 at 1:50 AM Juri Lelli <juri.lelli@redhat.com> wrote:
> On 11/04/25 23:02, John Stultz wrote:
> > Add a CONFIG_SCHED_PROXY_EXEC option, along with a boot argument
> > sched_proxy_exec= that can be used to disable the feature at boot
> > time if CONFIG_SCHED_PROXY_EXEC was enabled.
> >
> > Cc: Joel Fernandes <joelagnelf@nvidia.com>
> > Cc: Qais Yousef <qyousef@layalina.io>
> > Cc: Ingo Molnar <mingo@redhat.com>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: Juri Lelli <juri.lelli@redhat.com>
> > Cc: Vincent Guittot <vincent.guittot@linaro.org>
> > Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> > Cc: Valentin Schneider <vschneid@redhat.com>
> > Cc: Steven Rostedt <rostedt@goodmis.org>
> > Cc: Ben Segall <bsegall@google.com>
> > Cc: Zimuzo Ezeozue <zezeozue@google.com>
> > Cc: Mel Gorman <mgorman@suse.de>
> > Cc: Will Deacon <will@kernel.org>
> > Cc: Waiman Long <longman@redhat.com>
> > Cc: Boqun Feng <boqun.feng@gmail.com>
> > Cc: "Paul E. McKenney" <paulmck@kernel.org>
> > Cc: Metin Kaya <Metin.Kaya@arm.com>
> > Cc: Xuewen Yan <xuewen.yan94@gmail.com>
> > Cc: K Prateek Nayak <kprateek.nayak@amd.com>
> > Cc: Thomas Gleixner <tglx@linutronix.de>
> > Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
> > Cc: Suleiman Souhlal <suleiman@google.com>
> > Cc: kernel-team@android.com
> > Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
> > Signed-off-by: John Stultz <jstultz@google.com>
> > ---
>
> ...
>
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index 47972f34ea701..154f0aa0c6322 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -1149,10 +1149,15 @@ struct rq {
> > */
> > unsigned int nr_uninterruptible;
> >
> > +#ifdef CONFIG_SCHED_PROXY_EXEC
> > + struct task_struct __rcu *donor; /* Scheduling context */
> > + struct task_struct __rcu *curr; /* Execution context */
> > +#else
> > union {
> > struct task_struct __rcu *donor; /* Scheduler context */
> > struct task_struct __rcu *curr; /* Execution context */
> > };
> > +#endif
> > struct sched_dl_entity *dl_server;
> > struct task_struct *idle;
> > struct task_struct *stop;
> > @@ -1347,10 +1352,17 @@ DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
> > #define cpu_curr(cpu) (cpu_rq(cpu)->curr)
> > #define raw_rq() raw_cpu_ptr(&runqueues)
> >
> > +#ifdef CONFIG_SCHED_PROXY_EXEC
> > +static inline void rq_set_donor(struct rq *rq, struct task_struct *t)
> > +{
> > + rcu_assign_pointer(rq->donor, t);
> > +}
> > +#else
> > static inline void rq_set_donor(struct rq *rq, struct task_struct *t)
> > {
> > /* Do nothing */
> > }
> > +#endif
>
> Does this chunk belong here?
Uhm. It's there intentionally (as with the config option we actually
have the potential for the two contexts to be different), but I'm open
to move it elsewhere if you think it sticks out oddly here and would
fit better in a later change.
Maybe, would adding a note in the commit message to clarify that this
patch also adds the ability for the donor and curr contexts to be
different, help?
thanks
-john
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v16 2/7] locking/mutex: Rework task_struct::blocked_on
2025-04-14 8:59 ` Juri Lelli
@ 2025-04-16 21:28 ` John Stultz
0 siblings, 0 replies; 28+ messages in thread
From: John Stultz @ 2025-04-16 21:28 UTC (permalink / raw)
To: Juri Lelli
Cc: LKML, Peter Zijlstra, Joel Fernandes, Qais Yousef, Ingo Molnar,
Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
Metin Kaya, Xuewen Yan, K Prateek Nayak, Thomas Gleixner,
Daniel Lezcano, Suleiman Souhlal, kernel-team, Connor O'Brien
On Mon, Apr 14, 2025 at 2:00 AM Juri Lelli <juri.lelli@redhat.com> wrote:
>
> Hi John,
>
> On 11/04/25 23:02, John Stultz wrote:
> > From: Peter Zijlstra <peterz@infradead.org>
> >
> > Track the blocked-on relation for mutexes, to allow following this
> > relation at schedule time.
> >
> > task
> > | blocked-on
> > v
> > mutex
> > | owner
> > v
> > task
> >
> > This all will be used for tracking blocked-task/mutex chains
> > with the prox-execution patch in a similar fashion to how
> > priority inheritance is done with rt_mutexes.
> >
> > For serialization, blocked-on is only set by the task itself
> > (current). And both when setting or clearing (potentially by
> > others), is done while holding the mutex::wait_lock.
> >
> > Cc: Joel Fernandes <joelagnelf@nvidia.com>
> > Cc: Qais Yousef <qyousef@layalina.io>
> > Cc: Ingo Molnar <mingo@redhat.com>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: Juri Lelli <juri.lelli@redhat.com>
> > Cc: Vincent Guittot <vincent.guittot@linaro.org>
> > Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> > Cc: Valentin Schneider <vschneid@redhat.com>
> > Cc: Steven Rostedt <rostedt@goodmis.org>
> > Cc: Ben Segall <bsegall@google.com>
> > Cc: Zimuzo Ezeozue <zezeozue@google.com>
> > Cc: Mel Gorman <mgorman@suse.de>
> > Cc: Will Deacon <will@kernel.org>
> > Cc: Waiman Long <longman@redhat.com>
> > Cc: Boqun Feng <boqun.feng@gmail.com>
> > Cc: "Paul E. McKenney" <paulmck@kernel.org>
> > Cc: Metin Kaya <Metin.Kaya@arm.com>
> > Cc: Xuewen Yan <xuewen.yan94@gmail.com>
> > Cc: K Prateek Nayak <kprateek.nayak@amd.com>
> > Cc: Thomas Gleixner <tglx@linutronix.de>
> > Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
> > Cc: Suleiman Souhlal <suleiman@google.com>
> > Cc: kernel-team@android.com
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > [minor changes while rebasing]
> > Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > Signed-off-by: Connor O'Brien <connoro@google.com>
> > [jstultz: Fix blocked_on tracking in __mutex_lock_common in error paths]
> > Signed-off-by: John Stultz <jstultz@google.com>
> > ---
>
> ...
>
> > @@ -940,6 +954,14 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
> > next = waiter->task;
> >
> > debug_mutex_wake_waiter(lock, waiter);
> > + /*
> > + * Unlock wakeups can be happening in parallel
> > + * (when optimistic spinners steal and release
> > + * the lock), so blocked_on may already be
> > + * cleared here.
> > + */
> > + WARN_ON(next->blocked_on && next->blocked_on != lock);
> > + next->blocked_on = NULL;
>
> Here and below, why the WARN_ON() if the fact that blocked_on has been
> cleared already it's an OK situation? Ah, maybe it's catching the more
> worrying situation that the lock has changed since the task blocked?
Right. Just trying to make sure we aren't getting into any mismatched
usage of the lock.
If it has been cleared already, that's ok, but I want to make sure
otherwise that it is the lock we are expecting.
Appreciate the review. Let me know if you have any further questions
or clarifications!
thanks
-john
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v16 3/7] locking/mutex: Add p->blocked_on wrappers for correctness checks
2025-04-14 9:09 ` Juri Lelli
@ 2025-04-16 22:44 ` John Stultz
0 siblings, 0 replies; 28+ messages in thread
From: John Stultz @ 2025-04-16 22:44 UTC (permalink / raw)
To: Juri Lelli
Cc: LKML, Valentin Schneider, Joel Fernandes, Qais Yousef,
Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
Paul E. McKenney, Metin Kaya, Xuewen Yan, K Prateek Nayak,
Thomas Gleixner, Daniel Lezcano, Suleiman Souhlal, kernel-team,
Connor O'Brien
On Mon, Apr 14, 2025 at 2:09 AM Juri Lelli <juri.lelli@redhat.com> wrote:
> On 11/04/25 23:02, John Stultz wrote:
> > From: Valentin Schneider <valentin.schneider@arm.com>
> >
> > This lets us assert mutex::wait_lock is held whenever we access
> > p->blocked_on, as well as warn us for unexpected state changes.
> >
>
> ...
>
> > +static inline struct mutex *__get_task_blocked_on(struct task_struct *p)
> > +{
> > + return READ_ONCE(p->blocked_on);
> > +}
> > +
>
> Shouldn't we check that wait_lock is held also when reading blocked_on?
Yeah, later in the series when I switch to the task->blocked lock we do:
https://github.com/johnstultz-work/linux-dev/blob/proxy-exec-v16-6.15-rc1/include/linux/sched.h#L2267
I think the difficulty here is just we don't have the mutex pointer in
this context to assert lock->wait_lock is held.
I guess we could assume blocked_on is the right mutex and assert that
we hold that one if it is non-null. I'll take a shot at that.
> And, if that's the case, why do we use READ_ONCE?
So, I was thinking this was because we pulled blocked_on in some
contexts without holding the lock and we wanted to avoid the compiler
rearranging things, but it's not used that way now, so I'll drop this.
Thanks for pointing this out!
-john
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v16 5/7] sched: Add an initial sketch of the find_proxy_task() function
2025-04-14 9:41 ` Juri Lelli
@ 2025-04-16 22:55 ` John Stultz
0 siblings, 0 replies; 28+ messages in thread
From: John Stultz @ 2025-04-16 22:55 UTC (permalink / raw)
To: Juri Lelli
Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Peter Zijlstra,
Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
Metin Kaya, Xuewen Yan, K Prateek Nayak, Thomas Gleixner,
Daniel Lezcano, Suleiman Souhlal, kernel-team
On Mon, Apr 14, 2025 at 2:42 AM Juri Lelli <juri.lelli@redhat.com> wrote:
>
> Hi John,
>
> On 11/04/25 23:02, John Stultz wrote:
> > Add a find_proxy_task() function which doesn't do much.
> >
> > When we select a blocked task to run, we will just deactivate it
> > and pick again. The exception being if it has become unblocked
> > after find_proxy_task() was called.
> >
> > Greatly simplified from patch by:
> > Peter Zijlstra (Intel) <peterz@infradead.org>
> > Juri Lelli <juri.lelli@redhat.com>
> > Valentin Schneider <valentin.schneider@arm.com>
> > Connor O'Brien <connoro@google.com>
> >
> > Cc: Joel Fernandes <joelagnelf@nvidia.com>
> > Cc: Qais Yousef <qyousef@layalina.io>
> > Cc: Ingo Molnar <mingo@redhat.com>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: Juri Lelli <juri.lelli@redhat.com>
> > Cc: Vincent Guittot <vincent.guittot@linaro.org>
> > Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> > Cc: Valentin Schneider <vschneid@redhat.com>
> > Cc: Steven Rostedt <rostedt@goodmis.org>
> > Cc: Ben Segall <bsegall@google.com>
> > Cc: Zimuzo Ezeozue <zezeozue@google.com>
> > Cc: Mel Gorman <mgorman@suse.de>
> > Cc: Will Deacon <will@kernel.org>
> > Cc: Waiman Long <longman@redhat.com>
> > Cc: Boqun Feng <boqun.feng@gmail.com>
> > Cc: "Paul E. McKenney" <paulmck@kernel.org>
> > Cc: Metin Kaya <Metin.Kaya@arm.com>
> > Cc: Xuewen Yan <xuewen.yan94@gmail.com>
> > Cc: K Prateek Nayak <kprateek.nayak@amd.com>
> > Cc: Thomas Gleixner <tglx@linutronix.de>
> > Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
> > Cc: Suleiman Souhlal <suleiman@google.com>
> > Cc: kernel-team@android.com
> > [jstultz: Split out from larger proxy patch and simplified
> > for review and testing.]
> > Signed-off-by: John Stultz <jstultz@google.com>
> > ---
>
> ...
>
> > +static struct task_struct *
> > +find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
> > +{
> > + struct task_struct *p = donor;
> > + struct mutex *mutex;
> > +
> > + mutex = p->blocked_on;
> > + /* Something changed in the chain, so pick again */
> > + if (!mutex)
> > + return NULL;
> > + /*
> > + * By taking mutex->wait_lock we hold off concurrent mutex_unlock()
> > + * and ensure @owner sticks around.
> > + */
> > + guard(raw_spinlock)(&mutex->wait_lock);
> > +
> > + /* Check again that p is blocked with blocked_lock held */
> > + if (!task_is_blocked(p) || mutex != __get_task_blocked_on(p)) {
> > + /*
> > + * Something changed in the blocked_on chain and
> > + * we don't know if only at this level. So, let's
> > + * just bail out completely and let __schedule
> > + * figure things out (pick_again loop).
> > + */
> > + return NULL; /* do pick_next_task again */
> > + }
> > + return proxy_deactivate(rq, donor);
>
> Nitpick. We might either want to use donor (and remove p) or use p
> everywhere in the function; think makes it more clear.
Ah, yes, in this step it, it would be more clear. The two separate
values are important later when we walk the chain, since we use p to
iterate down the chain, but if we cannot find a proxy to run, we need
to deactivate the donor.
I'll rework this chunk to just use donor in this step.
thanks
-john
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v16 4/7] sched: Fix runtime accounting w/ split exec & sched contexts
2025-04-14 9:28 ` Juri Lelli
@ 2025-04-16 23:30 ` John Stultz
2025-04-29 9:24 ` Juri Lelli
0 siblings, 1 reply; 28+ messages in thread
From: John Stultz @ 2025-04-16 23:30 UTC (permalink / raw)
To: Juri Lelli
Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Peter Zijlstra,
Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
Metin Kaya, Xuewen Yan, K Prateek Nayak, Thomas Gleixner,
Daniel Lezcano, Suleiman Souhlal, kernel-team
On Mon, Apr 14, 2025 at 2:28 AM Juri Lelli <juri.lelli@redhat.com> wrote:
> On 11/04/25 23:02, John Stultz wrote:
> > +static s64 update_se_times(struct rq *rq, struct sched_entity *se)
> > {
> > u64 now = rq_clock_task(rq);
> > s64 delta_exec;
> >
> > - delta_exec = now - curr->exec_start;
> > + delta_exec = now - se->exec_start;
> > if (unlikely(delta_exec <= 0))
> > return delta_exec;
> >
> > - curr->exec_start = now;
> > - curr->sum_exec_runtime += delta_exec;
> > + se->exec_start = now;
> > + if (entity_is_task(se)) {
> > + struct task_struct *running = rq->curr;
> > + /*
> > + * If se is a task, we account the time against the running
> > + * task, as w/ proxy-exec they may not be the same.
> > + */
> > + running->se.exec_start = now;
> > + running->se.sum_exec_runtime += delta_exec;
> > + } else {
> > + /* If not task, account the time against se */
> > + se->sum_exec_runtime += delta_exec;
> > + }
>
> ...
>
> > @@ -1213,7 +1224,7 @@ s64 update_curr_common(struct rq *rq)
> > struct task_struct *donor = rq->donor;
> > s64 delta_exec;
> >
> > - delta_exec = update_curr_se(rq, &donor->se);
> > + delta_exec = update_se_times(rq, &donor->se);
> > if (likely(delta_exec > 0))
> > update_curr_task(donor, delta_exec);
>
> Considering that we calculate delta_exec in updated_se_times using
> exec_start of the sched_entity passed as argument, is it correct to use
> donor in the above?
Sorry, I'm not sure I quite understand your concern here. Why are you
thinking using donor might be problematic here? We're passing the
donor->se in to calculate the delta_exec.
I'll grant that "update_curr_common" maybe isn't the best name for the
calling function anymore, as we're really only working on the donor
here. Is that what your concern stems from?
Apologies for not quite understanding right away. I really appreciate
your review and feedback!
thanks
-john
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v16 4/7] sched: Fix runtime accounting w/ split exec & sched contexts
2025-04-12 6:02 ` [PATCH v16 4/7] sched: Fix runtime accounting w/ split exec & sched contexts John Stultz
2025-04-14 9:28 ` Juri Lelli
@ 2025-04-17 11:12 ` Peter Zijlstra
2025-04-21 21:00 ` John Stultz
1 sibling, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2025-04-17 11:12 UTC (permalink / raw)
To: John Stultz
Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Juri Lelli,
Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
Metin Kaya, Xuewen Yan, K Prateek Nayak, Thomas Gleixner,
Daniel Lezcano, Suleiman Souhlal, kernel-team
On Fri, Apr 11, 2025 at 11:02:38PM -0700, John Stultz wrote:
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e43993a4e5807..da8b0970c6655 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1143,22 +1143,33 @@ static void update_tg_load_avg(struct cfs_rq *cfs_rq)
> }
> #endif /* CONFIG_SMP */
>
> -static s64 update_curr_se(struct rq *rq, struct sched_entity *curr)
> +static s64 update_se_times(struct rq *rq, struct sched_entity *se)
update_se()
> {
> u64 now = rq_clock_task(rq);
> s64 delta_exec;
>
> - delta_exec = now - curr->exec_start;
> + delta_exec = now - se->exec_start;
> if (unlikely(delta_exec <= 0))
> return delta_exec;
>
> - curr->exec_start = now;
> - curr->sum_exec_runtime += delta_exec;
> + se->exec_start = now;
> + if (entity_is_task(se)) {
> + struct task_struct *running = rq->curr;
> + /*
> + * If se is a task, we account the time against the running
> + * task, as w/ proxy-exec they may not be the same.
> + */
> + running->se.exec_start = now;
> + running->se.sum_exec_runtime += delta_exec;
> + } else {
> + /* If not task, account the time against se */
> + se->sum_exec_runtime += delta_exec;
> + }
So I am confused; you're accounting runtime to the actual running task,
but then accounting the same runtime to the cgroup of the donor.
This seems somewhat irregular.
Please consider all of update_curr_task(), and if they all want to be
against rq->curr, rather than rq->donor then more changes are needed.
> @@ -1213,7 +1224,7 @@ s64 update_curr_common(struct rq *rq)
> struct task_struct *donor = rq->donor;
> s64 delta_exec;
>
> - delta_exec = update_curr_se(rq, &donor->se);
> + delta_exec = update_se_times(rq, &donor->se);
> if (likely(delta_exec > 0))
> update_curr_task(donor, delta_exec);
>
> @@ -1233,7 +1244,7 @@ static void update_curr(struct cfs_rq *cfs_rq)
> if (unlikely(!curr))
> return;
>
> - delta_exec = update_curr_se(rq, curr);
> + delta_exec = update_se_times(rq, curr);
> if (unlikely(delta_exec <= 0))
> return;
I think I've tripped over this before, on how update_curr_common() uses
donor and update_curr() curr. This definitely needs a comment. Because
at first glance they're not the same.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v16 5/7] sched: Add an initial sketch of the find_proxy_task() function
2025-04-12 6:02 ` [PATCH v16 5/7] sched: Add an initial sketch of the find_proxy_task() function John Stultz
2025-04-14 9:41 ` Juri Lelli
@ 2025-04-17 11:18 ` Peter Zijlstra
2025-04-22 21:14 ` John Stultz
1 sibling, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2025-04-17 11:18 UTC (permalink / raw)
To: John Stultz
Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Juri Lelli,
Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
Metin Kaya, Xuewen Yan, K Prateek Nayak, Thomas Gleixner,
Daniel Lezcano, Suleiman Souhlal, kernel-team
On Fri, Apr 11, 2025 at 11:02:39PM -0700, John Stultz wrote:
> +#ifdef CONFIG_SCHED_PROXY_EXEC
> +static inline struct task_struct *proxy_resched_idle(struct rq *rq)
> +{
> + put_prev_set_next_task(rq, rq->donor, rq->idle);
> + rq_set_donor(rq, rq->idle);
> + set_tsk_need_resched(rq->idle);
> + return rq->idle;
> +}
> +
> +static bool __proxy_deactivate(struct rq *rq, struct task_struct *donor)
> +{
> + unsigned long state = READ_ONCE(donor->__state);
> +
> + /* Don't deactivate if the state has been changed to TASK_RUNNING */
> + if (state == TASK_RUNNING)
> + return false;
> + /*
> + * Because we got donor from pick_next_task, it is *crucial*
pick_next_task()
> + * that we call proxy_resched_idle before we deactivate it.
proxy_resched_idle()
> + * As once we deactivate donor, donor->on_rq is set to zero,
> + * which allows ttwu to immediately try to wake the task on
ttwu()
> + * another rq. So we cannot use *any* references to donor
> + * after that point. So things like cfs_rq->curr or rq->donor
> + * need to be changed from next *before* we deactivate.
> + */
> + proxy_resched_idle(rq);
> + return try_to_block_task(rq, donor, state, true);
> +}
> +
> +static struct task_struct *proxy_deactivate(struct rq *rq, struct task_struct *donor)
> +{
> + if (!__proxy_deactivate(rq, donor)) {
> + /*
> + * XXX: For now, if deactivation failed, set donor
> + * as unblocked, as we aren't doing proxy-migrations
> + * yet (more logic will be needed then).
> + */
> + donor->blocked_on = NULL;
> + }
> + return NULL;
> +}
> +
> +/*
> + * Initial simple sketch that just deactivates the blocked task
> + * chosen by pick_next_task() so we can then pick something that
> + * isn't blocked.
> + */
> +static struct task_struct *
> +find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
> +{
> + struct task_struct *p = donor;
> + struct mutex *mutex;
> +
> + mutex = p->blocked_on;
> + /* Something changed in the chain, so pick again */
> + if (!mutex)
> + return NULL;
> + /*
> + * By taking mutex->wait_lock we hold off concurrent mutex_unlock()
> + * and ensure @owner sticks around.
> + */
> + guard(raw_spinlock)(&mutex->wait_lock);
> +
> + /* Check again that p is blocked with blocked_lock held */
> + if (!task_is_blocked(p) || mutex != __get_task_blocked_on(p)) {
> + /*
> + * Something changed in the blocked_on chain and
> + * we don't know if only at this level. So, let's
> + * just bail out completely and let __schedule
__schedule()
> + * figure things out (pick_again loop).
> + */
> + return NULL; /* do pick_next_task again */
pick_next_task()
> + }
> + return proxy_deactivate(rq, donor);
I was expecting a for() loop here, this only follows blocked_on once,
right?
> +}
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v16 4/7] sched: Fix runtime accounting w/ split exec & sched contexts
2025-04-17 11:12 ` Peter Zijlstra
@ 2025-04-21 21:00 ` John Stultz
2025-04-24 13:37 ` Peter Zijlstra
0 siblings, 1 reply; 28+ messages in thread
From: John Stultz @ 2025-04-21 21:00 UTC (permalink / raw)
To: Peter Zijlstra
Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Juri Lelli,
Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
Metin Kaya, Xuewen Yan, K Prateek Nayak, Thomas Gleixner,
Daniel Lezcano, Suleiman Souhlal, kernel-team
On Thu, Apr 17, 2025 at 4:12 AM Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, Apr 11, 2025 at 11:02:38PM -0700, John Stultz wrote:
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index e43993a4e5807..da8b0970c6655 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -1143,22 +1143,33 @@ static void update_tg_load_avg(struct cfs_rq *cfs_rq)
> > }
> > #endif /* CONFIG_SMP */
> >
> > -static s64 update_curr_se(struct rq *rq, struct sched_entity *curr)
> > +static s64 update_se_times(struct rq *rq, struct sched_entity *se)
>
> update_se()
Sure thing!
> > {
> > u64 now = rq_clock_task(rq);
> > s64 delta_exec;
> >
> > - delta_exec = now - curr->exec_start;
> > + delta_exec = now - se->exec_start;
> > if (unlikely(delta_exec <= 0))
> > return delta_exec;
> >
> > - curr->exec_start = now;
> > - curr->sum_exec_runtime += delta_exec;
> > + se->exec_start = now;
> > + if (entity_is_task(se)) {
> > + struct task_struct *running = rq->curr;
> > + /*
> > + * If se is a task, we account the time against the running
> > + * task, as w/ proxy-exec they may not be the same.
> > + */
> > + running->se.exec_start = now;
> > + running->se.sum_exec_runtime += delta_exec;
> > + } else {
> > + /* If not task, account the time against se */
> > + se->sum_exec_runtime += delta_exec;
> > + }
>
>
> So I am confused; you're accounting runtime to the actual running task,
> but then accounting the same runtime to the cgroup of the donor.
>
> This seems somewhat irregular.
So, apologies, as it's been a bit since I've deeply thought on this.
In general we want to charge the donor for everything since it's
donating its time, etc. However, without this change, we got some
strange behavior in top, etc, because the proxy tasks that actually
ran didn't seem to gain any exec_runtime. So the split of charging
everything to the donor except the sum_exec_runtime to the actually
running process (the proxy) made sense.
Now, for cgroup accounting, it seems like we'd still want to charge
the donor's cgroup, so whatever restrictions there are in place apply
to the donor, but it's just when we get to the leaf task we charge the
proxy instead.
Does that sound reasonable? Or am I making a bad assumption here
around the cgroup logic?
> Please consider all of update_curr_task(), and if they all want to be
> against rq->curr, rather than rq->donor then more changes are needed.
So I think we are ok here, but it is confusing... see more below.
> > @@ -1213,7 +1224,7 @@ s64 update_curr_common(struct rq *rq)
> > struct task_struct *donor = rq->donor;
> > s64 delta_exec;
> >
> > - delta_exec = update_curr_se(rq, &donor->se);
> > + delta_exec = update_se_times(rq, &donor->se);
> > if (likely(delta_exec > 0))
> > update_curr_task(donor, delta_exec);
> >
> > @@ -1233,7 +1244,7 @@ static void update_curr(struct cfs_rq *cfs_rq)
> > if (unlikely(!curr))
> > return;
> >
> > - delta_exec = update_curr_se(rq, curr);
> > + delta_exec = update_se_times(rq, curr);
> > if (unlikely(delta_exec <= 0))
> > return;
>
> I think I've tripped over this before, on how update_curr_common() uses
> donor and update_curr() curr. This definitely needs a comment. Because
> at first glance they're not the same.
I suspect part of the incongruity/dissonance comes from the
cfs_rq->curr is actually the rq->donor (where rq->donor and rq->curr
are different), as its what the sched-class picked to run.
Renaming that I think might clarify things, but I have been hesitant
to cause too much naming churn in the series, but maybe it's the right
time to do it if it's causing confusion.
My other hesitancy there, is around wanting the proxy logic to be
focused in the core, so the sched-class "curr" can still be what the
class selected to run, its just proxy might pick something else to
actually run. But the top level rq->curr not being the cfs_rq->curr is
prone to confusion, and we already do have rq->donor references in
fair.c so its not like it's perfectly encapsulated and layered.
But I'll take a pass at renaming cfs_rq->curr to cfs_rq->donor, unless
you object.
thanks
-john
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v16 5/7] sched: Add an initial sketch of the find_proxy_task() function
2025-04-17 11:18 ` Peter Zijlstra
@ 2025-04-22 21:14 ` John Stultz
0 siblings, 0 replies; 28+ messages in thread
From: John Stultz @ 2025-04-22 21:14 UTC (permalink / raw)
To: Peter Zijlstra
Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Juri Lelli,
Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
Metin Kaya, Xuewen Yan, K Prateek Nayak, Thomas Gleixner,
Daniel Lezcano, Suleiman Souhlal, kernel-team
On Thu, Apr 17, 2025 at 4:18 AM Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, Apr 11, 2025 at 11:02:39PM -0700, John Stultz wrote:
> > + * Because we got donor from pick_next_task, it is *crucial*
>
> pick_next_task()
>
> > + * that we call proxy_resched_idle before we deactivate it.
>
> proxy_resched_idle()
>
> > + * As once we deactivate donor, donor->on_rq is set to zero,
> > + * which allows ttwu to immediately try to wake the task on
>
> ttwu()
>
Thanks for the comment tweaks, I've added them to my next version.
> > + }
> > + return proxy_deactivate(rq, donor);
>
> I was expecting a for() loop here, this only follows blocked_on once,
> right?
Yeah, this patch is only the simplest sketch of the find_proxy_task(),
which just deactivates the blocked donor and picks again (basically it
fails every time).
Initially I split it out to help test that keeping tasks on the
runqueue and deactivating them later would work properly.
It does not yet traverse the blocked on chain and actually find the proxy task.
I'll try to expand a bit more in the commit message so this isn't a surprise.
thanks
-john
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v16 4/7] sched: Fix runtime accounting w/ split exec & sched contexts
2025-04-21 21:00 ` John Stultz
@ 2025-04-24 13:37 ` Peter Zijlstra
2025-04-26 3:34 ` John Stultz
0 siblings, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2025-04-24 13:37 UTC (permalink / raw)
To: John Stultz
Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Juri Lelli,
Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
Metin Kaya, Xuewen Yan, K Prateek Nayak, Thomas Gleixner,
Daniel Lezcano, Suleiman Souhlal, kernel-team
On Mon, Apr 21, 2025 at 02:00:34PM -0700, John Stultz wrote:
> On Thu, Apr 17, 2025 at 4:12 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > On Fri, Apr 11, 2025 at 11:02:38PM -0700, John Stultz wrote:
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index e43993a4e5807..da8b0970c6655 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -1143,22 +1143,33 @@ static void update_tg_load_avg(struct cfs_rq *cfs_rq)
> > > }
> > > #endif /* CONFIG_SMP */
> > >
> > > -static s64 update_curr_se(struct rq *rq, struct sched_entity *curr)
> > > +static s64 update_se_times(struct rq *rq, struct sched_entity *se)
> >
> > update_se()
>
> Sure thing!
>
> > > {
> > > u64 now = rq_clock_task(rq);
> > > s64 delta_exec;
> > >
> > > - delta_exec = now - curr->exec_start;
> > > + delta_exec = now - se->exec_start;
> > > if (unlikely(delta_exec <= 0))
> > > return delta_exec;
> > >
> > > - curr->exec_start = now;
> > > - curr->sum_exec_runtime += delta_exec;
> > > + se->exec_start = now;
> > > + if (entity_is_task(se)) {
> > > + struct task_struct *running = rq->curr;
> > > + /*
> > > + * If se is a task, we account the time against the running
> > > + * task, as w/ proxy-exec they may not be the same.
> > > + */
> > > + running->se.exec_start = now;
> > > + running->se.sum_exec_runtime += delta_exec;
> > > + } else {
> > > + /* If not task, account the time against se */
> > > + se->sum_exec_runtime += delta_exec;
> > > + }
> >
> >
> > So I am confused; you're accounting runtime to the actual running task,
> > but then accounting the same runtime to the cgroup of the donor.
> >
> > This seems somewhat irregular.
>
> So, apologies, as it's been a bit since I've deeply thought on this.
> In general we want to charge the donor for everything since it's
> donating its time, etc. However, without this change, we got some
> strange behavior in top, etc, because the proxy tasks that actually
> ran didn't seem to gain any exec_runtime. So the split of charging
> everything to the donor except the sum_exec_runtime to the actually
> running process (the proxy) made sense.
>
> Now, for cgroup accounting, it seems like we'd still want to charge
> the donor's cgroup, so whatever restrictions there are in place apply
> to the donor, but it's just when we get to the leaf task we charge the
> proxy instead.
>
> Does that sound reasonable? Or am I making a bad assumption here
> around the cgroup logic?
Its all rather confusing one way or the other I'm afraid :/
This way when people go add up the task times and compare to cgroups
it doesn't match up.
Also, by adding sum_exec_runtime to curr, but
account_group_exec_runtime() on donor, the cputimer information is
inconsistent.
Whatever we do, it should be internally consistent, and this ain't it.
> > Please consider all of update_curr_task(), and if they all want to be
> > against rq->curr, rather than rq->donor then more changes are needed.
>
> So I think we are ok here, but it is confusing... see more below.
Yeah, we are okay. I remembered the discussion we had last time I
tripped over this. I just tripped over it again before remembering :-)
> > > @@ -1213,7 +1224,7 @@ s64 update_curr_common(struct rq *rq)
> > > struct task_struct *donor = rq->donor;
> > > s64 delta_exec;
> > >
> > > - delta_exec = update_curr_se(rq, &donor->se);
> > > + delta_exec = update_se_times(rq, &donor->se);
> > > if (likely(delta_exec > 0))
> > > update_curr_task(donor, delta_exec);
> > >
> > > @@ -1233,7 +1244,7 @@ static void update_curr(struct cfs_rq *cfs_rq)
> > > if (unlikely(!curr))
> > > return;
> > >
> > > - delta_exec = update_curr_se(rq, curr);
> > > + delta_exec = update_se_times(rq, curr);
> > > if (unlikely(delta_exec <= 0))
> > > return;
> >
> > I think I've tripped over this before, on how update_curr_common() uses
> > donor and update_curr() curr. This definitely needs a comment. Because
> > at first glance they're not the same.
>
> I suspect part of the incongruity/dissonance comes from the
> cfs_rq->curr is actually the rq->donor (where rq->donor and rq->curr
> are different), as its what the sched-class picked to run.
>
> Renaming that I think might clarify things, but I have been hesitant
> to cause too much naming churn in the series, but maybe it's the right
> time to do it if it's causing confusion.
>
> My other hesitancy there, is around wanting the proxy logic to be
> focused in the core, so the sched-class "curr" can still be what the
> class selected to run, its just proxy might pick something else to
> actually run. But the top level rq->curr not being the cfs_rq->curr is
> prone to confusion, and we already do have rq->donor references in
> fair.c so its not like it's perfectly encapsulated and layered.
>
> But I'll take a pass at renaming cfs_rq->curr to cfs_rq->donor, unless
> you object.
I was more thinking of a comment near here to clarify. Not sure
cfs_rq->donor makes much sense.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v16 4/7] sched: Fix runtime accounting w/ split exec & sched contexts
2025-04-24 13:37 ` Peter Zijlstra
@ 2025-04-26 3:34 ` John Stultz
0 siblings, 0 replies; 28+ messages in thread
From: John Stultz @ 2025-04-26 3:34 UTC (permalink / raw)
To: Peter Zijlstra
Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Juri Lelli,
Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
Metin Kaya, Xuewen Yan, K Prateek Nayak, Thomas Gleixner,
Daniel Lezcano, Suleiman Souhlal, kernel-team
On Thu, Apr 24, 2025 at 6:37 AM Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, Apr 21, 2025 at 02:00:34PM -0700, John Stultz wrote:
> > On Thu, Apr 17, 2025 at 4:12 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > > So I am confused; you're accounting runtime to the actual running task,
> > > but then accounting the same runtime to the cgroup of the donor.
> > >
> > > This seems somewhat irregular.
> >
> > So, apologies, as it's been a bit since I've deeply thought on this.
> > In general we want to charge the donor for everything since it's
> > donating its time, etc. However, without this change, we got some
> > strange behavior in top, etc, because the proxy tasks that actually
> > ran didn't seem to gain any exec_runtime. So the split of charging
> > everything to the donor except the sum_exec_runtime to the actually
> > running process (the proxy) made sense.
> >
> > Now, for cgroup accounting, it seems like we'd still want to charge
> > the donor's cgroup, so whatever restrictions there are in place apply
> > to the donor, but it's just when we get to the leaf task we charge the
> > proxy instead.
> >
> > Does that sound reasonable? Or am I making a bad assumption here
> > around the cgroup logic?
>
> Its all rather confusing one way or the other I'm afraid :/
>
> This way when people go add up the task times and compare to cgroups
> it doesn't match up.
So, inherently the point of proxy is that tasks get to run on borrowed time.
Priority inversions from any restrictions via cgroups are resolved
only by letting the task run using the donors time. So for proxy to
work we have to allow those tasks to effectively break through their
restrictions to release the needed locks.
So I'm not sure, outside of accounting everything onto the donor
(which creates its own oddities, which motivated this patch) one could
be perfectly consistent here.
I still think the model used in this patch feels like the best
balance. But I'm wondering, would accounting the "borrowed time" and
the "donated time" separately maybe help folks interpret things more
clearly?
> Also, by adding sum_exec_runtime to curr, but
> account_group_exec_runtime() on donor, the cputimer information is
> inconsistent.
Ah! Ok, this is a good point. I misinterpreted the
account_group_exec_runtime() as being cgroup related not thread group.
Thanks for pointing this out!
>
> Whatever we do, it should be internally consistent, and this ain't it.
>
I'll take a swing at fixing the account_group_exec_runtime(), but let
me know if you have any more thoughts on how we charge cputime between
donor cgroups and proxy tasks running on the donors time.
> > > I think I've tripped over this before, on how update_curr_common() uses
> > > donor and update_curr() curr. This definitely needs a comment. Because
> > > at first glance they're not the same.
> >
> > I suspect part of the incongruity/dissonance comes from the
> > cfs_rq->curr is actually the rq->donor (where rq->donor and rq->curr
> > are different), as its what the sched-class picked to run.
> >
> > Renaming that I think might clarify things, but I have been hesitant
> > to cause too much naming churn in the series, but maybe it's the right
> > time to do it if it's causing confusion.
> >
> > My other hesitancy there, is around wanting the proxy logic to be
> > focused in the core, so the sched-class "curr" can still be what the
> > class selected to run, its just proxy might pick something else to
> > actually run. But the top level rq->curr not being the cfs_rq->curr is
> > prone to confusion, and we already do have rq->donor references in
> > fair.c so its not like it's perfectly encapsulated and layered.
> >
> > But I'll take a pass at renaming cfs_rq->curr to cfs_rq->donor, unless
> > you object.
>
> I was more thinking of a comment near here to clarify. Not sure
> cfs_rq->donor makes much sense.
Ok.
/me quietly deletes that patch from his WIP series. :)
Thanks so much again for the review and feedback here! Very much appreciated!
-john
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v16 1/7] sched: Add CONFIG_SCHED_PROXY_EXEC & boot argument to enable/disable
2025-04-16 21:24 ` John Stultz
@ 2025-04-29 9:16 ` Juri Lelli
0 siblings, 0 replies; 28+ messages in thread
From: Juri Lelli @ 2025-04-29 9:16 UTC (permalink / raw)
To: John Stultz
Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Peter Zijlstra,
Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
Metin Kaya, Xuewen Yan, K Prateek Nayak, Thomas Gleixner,
Daniel Lezcano, Suleiman Souhlal, kernel-team
On 16/04/25 14:24, John Stultz wrote:
> On Mon, Apr 14, 2025 at 1:50 AM Juri Lelli <juri.lelli@redhat.com> wrote:
> > On 11/04/25 23:02, John Stultz wrote:
> > > Add a CONFIG_SCHED_PROXY_EXEC option, along with a boot argument
> > > sched_proxy_exec= that can be used to disable the feature at boot
> > > time if CONFIG_SCHED_PROXY_EXEC was enabled.
> > >
> > > Cc: Joel Fernandes <joelagnelf@nvidia.com>
> > > Cc: Qais Yousef <qyousef@layalina.io>
> > > Cc: Ingo Molnar <mingo@redhat.com>
> > > Cc: Peter Zijlstra <peterz@infradead.org>
> > > Cc: Juri Lelli <juri.lelli@redhat.com>
> > > Cc: Vincent Guittot <vincent.guittot@linaro.org>
> > > Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> > > Cc: Valentin Schneider <vschneid@redhat.com>
> > > Cc: Steven Rostedt <rostedt@goodmis.org>
> > > Cc: Ben Segall <bsegall@google.com>
> > > Cc: Zimuzo Ezeozue <zezeozue@google.com>
> > > Cc: Mel Gorman <mgorman@suse.de>
> > > Cc: Will Deacon <will@kernel.org>
> > > Cc: Waiman Long <longman@redhat.com>
> > > Cc: Boqun Feng <boqun.feng@gmail.com>
> > > Cc: "Paul E. McKenney" <paulmck@kernel.org>
> > > Cc: Metin Kaya <Metin.Kaya@arm.com>
> > > Cc: Xuewen Yan <xuewen.yan94@gmail.com>
> > > Cc: K Prateek Nayak <kprateek.nayak@amd.com>
> > > Cc: Thomas Gleixner <tglx@linutronix.de>
> > > Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
> > > Cc: Suleiman Souhlal <suleiman@google.com>
> > > Cc: kernel-team@android.com
> > > Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
> > > Signed-off-by: John Stultz <jstultz@google.com>
> > > ---
> >
> > ...
> >
> > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > > index 47972f34ea701..154f0aa0c6322 100644
> > > --- a/kernel/sched/sched.h
> > > +++ b/kernel/sched/sched.h
> > > @@ -1149,10 +1149,15 @@ struct rq {
> > > */
> > > unsigned int nr_uninterruptible;
> > >
> > > +#ifdef CONFIG_SCHED_PROXY_EXEC
> > > + struct task_struct __rcu *donor; /* Scheduling context */
> > > + struct task_struct __rcu *curr; /* Execution context */
> > > +#else
> > > union {
> > > struct task_struct __rcu *donor; /* Scheduler context */
> > > struct task_struct __rcu *curr; /* Execution context */
> > > };
> > > +#endif
> > > struct sched_dl_entity *dl_server;
> > > struct task_struct *idle;
> > > struct task_struct *stop;
> > > @@ -1347,10 +1352,17 @@ DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
> > > #define cpu_curr(cpu) (cpu_rq(cpu)->curr)
> > > #define raw_rq() raw_cpu_ptr(&runqueues)
> > >
> > > +#ifdef CONFIG_SCHED_PROXY_EXEC
> > > +static inline void rq_set_donor(struct rq *rq, struct task_struct *t)
> > > +{
> > > + rcu_assign_pointer(rq->donor, t);
> > > +}
> > > +#else
> > > static inline void rq_set_donor(struct rq *rq, struct task_struct *t)
> > > {
> > > /* Do nothing */
> > > }
> > > +#endif
> >
> > Does this chunk belong here?
>
> Uhm. It's there intentionally (as with the config option we actually
> have the potential for the two contexts to be different), but I'm open
> to move it elsewhere if you think it sticks out oddly here and would
> fit better in a later change.
>
> Maybe, would adding a note in the commit message to clarify that this
> patch also adds the ability for the donor and curr contexts to be
> different, help?
Guess it might, yes. But, it's a nitpick anyway, so feel free to ignore.
:)
Thanks!
Juri
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v16 4/7] sched: Fix runtime accounting w/ split exec & sched contexts
2025-04-16 23:30 ` John Stultz
@ 2025-04-29 9:24 ` Juri Lelli
0 siblings, 0 replies; 28+ messages in thread
From: Juri Lelli @ 2025-04-29 9:24 UTC (permalink / raw)
To: John Stultz
Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Peter Zijlstra,
Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
Metin Kaya, Xuewen Yan, K Prateek Nayak, Thomas Gleixner,
Daniel Lezcano, Suleiman Souhlal, kernel-team
On 16/04/25 16:30, John Stultz wrote:
> On Mon, Apr 14, 2025 at 2:28 AM Juri Lelli <juri.lelli@redhat.com> wrote:
> > On 11/04/25 23:02, John Stultz wrote:
> > > +static s64 update_se_times(struct rq *rq, struct sched_entity *se)
> > > {
> > > u64 now = rq_clock_task(rq);
> > > s64 delta_exec;
> > >
> > > - delta_exec = now - curr->exec_start;
> > > + delta_exec = now - se->exec_start;
> > > if (unlikely(delta_exec <= 0))
> > > return delta_exec;
> > >
> > > - curr->exec_start = now;
> > > - curr->sum_exec_runtime += delta_exec;
> > > + se->exec_start = now;
> > > + if (entity_is_task(se)) {
> > > + struct task_struct *running = rq->curr;
> > > + /*
> > > + * If se is a task, we account the time against the running
> > > + * task, as w/ proxy-exec they may not be the same.
> > > + */
> > > + running->se.exec_start = now;
> > > + running->se.sum_exec_runtime += delta_exec;
> > > + } else {
> > > + /* If not task, account the time against se */
> > > + se->sum_exec_runtime += delta_exec;
> > > + }
> >
> > ...
> >
> > > @@ -1213,7 +1224,7 @@ s64 update_curr_common(struct rq *rq)
> > > struct task_struct *donor = rq->donor;
> > > s64 delta_exec;
> > >
> > > - delta_exec = update_curr_se(rq, &donor->se);
> > > + delta_exec = update_se_times(rq, &donor->se);
> > > if (likely(delta_exec > 0))
> > > update_curr_task(donor, delta_exec);
> >
> > Considering that we calculate delta_exec in updated_se_times using
> > exec_start of the sched_entity passed as argument, is it correct to use
> > donor in the above?
>
> Sorry, I'm not sure I quite understand your concern here. Why are you
> thinking using donor might be problematic here? We're passing the
> donor->se in to calculate the delta_exec.
>
> I'll grant that "update_curr_common" maybe isn't the best name for the
> calling function anymore, as we're really only working on the donor
> here. Is that what your concern stems from?
Ah, I think I was just confused similarly to Peter. A comment explaining
this point might help (I believe you were going to add that based on
the discussion with Peter).
Thanks,
Juri
^ permalink raw reply [flat|nested] 28+ messages in thread
end of thread, other threads:[~2025-04-29 9:24 UTC | newest]
Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-12 6:02 [PATCH v16 0/7] Single RunQueue Proxy Execution (v16) John Stultz
2025-04-12 6:02 ` [PATCH v16 1/7] sched: Add CONFIG_SCHED_PROXY_EXEC & boot argument to enable/disable John Stultz
2025-04-14 8:50 ` Juri Lelli
2025-04-16 21:24 ` John Stultz
2025-04-29 9:16 ` Juri Lelli
2025-04-12 6:02 ` [PATCH v16 2/7] locking/mutex: Rework task_struct::blocked_on John Stultz
2025-04-14 8:59 ` Juri Lelli
2025-04-16 21:28 ` John Stultz
2025-04-12 6:02 ` [PATCH v16 3/7] locking/mutex: Add p->blocked_on wrappers for correctness checks John Stultz
2025-04-14 9:09 ` Juri Lelli
2025-04-16 22:44 ` John Stultz
2025-04-12 6:02 ` [PATCH v16 4/7] sched: Fix runtime accounting w/ split exec & sched contexts John Stultz
2025-04-14 9:28 ` Juri Lelli
2025-04-16 23:30 ` John Stultz
2025-04-29 9:24 ` Juri Lelli
2025-04-17 11:12 ` Peter Zijlstra
2025-04-21 21:00 ` John Stultz
2025-04-24 13:37 ` Peter Zijlstra
2025-04-26 3:34 ` John Stultz
2025-04-12 6:02 ` [PATCH v16 5/7] sched: Add an initial sketch of the find_proxy_task() function John Stultz
2025-04-14 9:41 ` Juri Lelli
2025-04-16 22:55 ` John Stultz
2025-04-17 11:18 ` Peter Zijlstra
2025-04-22 21:14 ` John Stultz
2025-04-12 6:02 ` [PATCH v16 6/7] sched: Fix proxy/current (push,pull)ability John Stultz
2025-04-14 3:28 ` K Prateek Nayak
2025-04-16 21:18 ` John Stultz
2025-04-12 6:02 ` [PATCH v16 7/7] sched: Start blocked_on chain processing in find_proxy_task() John Stultz
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox