[PATCH v5 00/19] Proxy Execution: A generalized form of Priority Inheritance v5

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v5 00/19] Proxy Execution: A generalized form of Priority Inheritance v5
@ 2023-08-19  6:08 John Stultz
  2023-08-19  6:08 ` [PATCH v5 01/19] sched: Unify runtime accounting across classes John Stultz
                   ` (18 more replies)
  0 siblings, 19 replies; 29+ messages in thread
From: John Stultz @ 2023-08-19  6:08 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
	Youssef Esmat, Mel Gorman, Daniel Bristot de Oliveira,
	Will Deacon, Waiman Long, Boqun Feng, Paul E . McKenney,
	kernel-team

Getting this iteration of the patch series out to the list has
taken way longer than I had hoped! Right before I sent out the
v4 release, I had found that while the patch series was fairly
stable with lower cpu counts, I could very easily run into
crashes with higher cpus during bootup. Not being able to
quickly resolve the crashes I was seeing, I went ahead and sent
out v4 for review, and figured I’d debug the issue and send out
v5 as a quick follow on. 

But unfortunately trying to diagnose the failures and fix them
started uncovering other crashes. Additionally, the v4 patch
series wasn’t properly bisectable as earlier changes were
missing fixes from later in the series, which made debugging
issues quite difficult. So after playing constant wack-a-mole
with bugs, I took the entire engine apart and laid all the bits
on the floor, and started slowly re-assembling things, testing
each step as I went.

This has been very laborious, as cutting small chunks from the
larger patches off would run into self-caused issues that I
spent much time debugging, due to my missing needed logic still
in the larger patch. But slowly I’ve managed to get almost all
the fine-grained patches to boot and run.

In fact, the patch series here is coarser than what I’ve
created, as there are a number of small “test” steps to help
validate behavior I changed, which would then be replaced by the
real logic afterwards. Including those here would just cause
more work for reviewers, but if you’re interested you can find
the fine-grained tree here: 
  https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v5-6.5-rc6-fine-grained
  https://github.com/johnstultz-work/linux-dev.git proxy-exec-v5-6.5-rc6-fine-grained

Having this fine grained tree really helped me chase down and
resolve a large number of bugs in the logic. In some cases I’ve
significantly reworked parts of the changes, so for instance, I
found the previous versions of the patch had a fair amount of
logic changing the blocked_on state in the try_to_wake_up paths
which were racy particularly with the return-migration. I’ve
moved the return-migration logic into __sched so we make sure we
don’t run proxy-migrated blocked tasks on an incorrect cpu when
they are unblocked. This change has its own complexity, so
feedback/suggestions for improvements would be appreciated. But
as a result, the patch series is much more stable (particularly
the earlier components).

Now, I have run into a few issues still, particularly around the
enqueuing of tasks on deactivated/sleeping owners, particularly
with the test-ww_mutex logic (ww_mutexes have been difficult as
they break assumptions that tasks are unblocked along the
blocked_on chain in an orderly fashion - Instead a task in the
middle of the chain may become suddenly runnable because it got
an EDEADLK from its ww_mutex). Additionally I haven’t been able
to test and debug chain migration. So I’ve left those patches out
for now, but will re-include them on the next revision. Given the
number of patches here, I suspect reviewers won’t mind. :) 

As mentioned previously, this Proxy Execution series has a long
history: First described in a paper[2] by Watkins, Straub,
Niehaus, then from patches from Peter Zijlstra, extended with
lots of work by Juri Lelli, Valentin Schneider, and Connor
O'Brien. (and thank you to Steven Rostedt for providing
additional details here!)

So again, many thanks to those above, as all the credit for this
series really is due to them - while the mistakes are likely
mine.

Overview:
---------
Proxy Execution is a generalized form of priority inheritance.
Classic priority inheritance works well for real-time tasks where
there is a straight forward priority order to how things are run.
But it breaks down when used between CFS or DEADLINE tasks, as
there are lots of parameters involved outside of just the task’s
nice value when selecting the next task to run (via
pick_next_task()). So ideally we want to imbue the mutex holder
with all the scheduler attributes of the blocked waiting task.

Proxy Execution does this via a few changes:
* Keeping tasks that are blocked on a mutex *on* the runqueue
* Keeping additional tracking of which mutex a task is blocked
  on, and which task holds a specific mutex.
* Special handling for when we select a blocked task to run, so
  that we instead run the mutex holder. 

The first of these is the most difficult to grasp (I do get the
mental friction here: blocked tasks on the *run*queue sounds like
nonsense! Personally I like to think of the runqueue in this
model more like a “task-selection queue”).

By leaving blocked tasks on the runqueue, we allow
pick_next_task() to choose the task that should run next (even if
it’s blocked waiting on a mutex). If we do select a blocked task,
we look at the task’s blocked_on mutex and from there look at the
mutex’s owner task. And in the simple case, the task which owns
the mutex is what we then choose to run, allowing it to release
the mutex.

This means that instead of just tracking “curr”, the scheduler
needs to track both the scheduler context (what was picked and
all the state used for scheduling decisions), and the execution
context (what we’re actually running).

In this way, the mutex owner is run “on behalf” of the blocked
task that was picked to run, essentially inheriting the scheduler
context of the waiting blocked task.

As Connor outlined in a previous submission of this patch series,
this raises a number of complicated situations: The mutex owner
might itself be blocked on another mutex, or it could be
sleeping, running on a different CPU, in the process of migrating
between CPUs, etc.

But the functionality provided by Proxy Execution is useful, as
in Android we have a large number of cases where we are seeing
priority inversion (not unbounded, but longer than we’d like)
between “foreground” and “background” SCHED_NORMAL applications,
so having a generalized solution would be very useful.

New in v5:
---------
* Broke the patch series up into fine grained changes

* Heavily reworked the return-migration handling by moving it
  out of try_to_wake_up() and into __schedule().

* Reworked blocked_on tracking logic (so it is handled
  completely in mutex code), and added a blocked_on_waking
  value, so we can distinguish when tasks have not acquired the
  mutex, but need to wake up to try to do so.

* Resolved lost-wakeup issues caused by the wake_qs being
  managed in the ctx structures instead of being on the stack. I
  went back to an earlier version of that patch from Juri though
  re-adding some fixes from Connor’s version.

* Resolved null se pointer crashes seen at bootup, caused by
  incorrect put_prev_task() calls

* Fixes to the test-ww_mutex test logic that was causing
  livelocks & hangs. (Sent as a separate series)

* And more I’ve likely forgotten

Also, I know Peter didn’t like the blocked_on wrappers, so I
dropped them last time, but I found them (and the debug checks
within) crucial to working out issues in this patch series. I’m
fine to consider dropping them later if they are still
objectionable, but their utility was too high at this point to
get rid of them.

Issues still to address:
—----------
* As mentioned above, the deactivated owner handling (where we
  deactivate waiting tasks and enqueue them onto a list, then
  reactivate them when the owner wakes up) has some major issues
  with ww_mutexes. Additionally I think there are other races
  possible, since everyone is using a single blocked_entry
  list_head, anyone who wakes up may think it’s the front of the
  chain and wake everyone else up(not just those waiting on it).
  I believe if two processes are woken at the same time, they
  could fight trying to consume and activate tasks onto different
  cpus.

* Still need to validate and re-add the chain migration patches.

* Seen some rare crashes around rt runqueue accounting. Will be
  investigating this.

* “rq_selected()” naming. Peter doesn’t like it, but I’ve not
  thought of a better name. Open to suggestions.

* As discussed at OSPM[1], I want to split pick_next_task() up
  into two phases selecting and setting the next tasks, as
  currently pick_next_task() assumes the returned task will be
  run which results in various side-effects in sched class logic
  when it’s run. 

* CFS load balancing. Blocked tasks may carry forward load (PELT)
  to the lock owner's CPU, so CPU may look like it is overloaded.

* I still want to push down the split scheduler and execution
  context awareness further through the scheduling code, as lots
  of logic still assumes there’s only a single “rq->curr” task.

* Optimization to avoid migrating blocked tasks (allowing for
  optimistic spinning) if the runnable lock-owner at the end of
  the blocked_on chain is already running (though this is
  difficult as locking rules to traverse the blocked on chain
  across run queues isn’t trivial).

Performance:
—----------
This patch series switches mutexes to use handoff mode rather
than optimistic spinning. This is a potential concern where locks
are under high contention. However, earlier performance analysis
(on both x86 and mobile devices) did not see major regressions.
That said, Chenyu did report a regression[3], which I’ll need to
look further into. I also briefly re-tested with this v5 series
and saw some average latencies grow suggesting the changes to
return-migration (and extra locking) have some impact. I’ll be
digging more there. As mentioned above, there may be some
additional optimizations that can help here, but my focus is on
getting the code working well before I spend time optimizing.

Review and feedback would be greatly appreciated!

If folks find it easier to test/tinker with, this patch series can also be found here:
  https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v5-6.5-rc6
  https://github.com/johnstultz-work/linux-dev.git proxy-exec-v5-6.5-rc6

Thanks so much!
-john

[1] https://youtu.be/QEWqRhVS3lI (video of my OSPM talk)
[2] https://static.lwn.net/images/conf/rtlws11/papers/proc/p38.pdf
[3] https://lore.kernel.org/lkml/Y7vVqE0M%2FAoDoVbj@chenyu5-mobl1/

Cc: Joel Fernandes <joelaf@google.com>
Cc: Qais Yousef <qyousef@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Youssef Esmat <youssefesmat@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Cc: kernel-team@android.com

John Stultz (8):
  locking/mutex: Removes wakeups from under mutex::wait_lock
  locking/mutex: Split blocked_on logic into two states (blocked_on and
    blocked_on_waking)
  sched: Fix runtime accounting w/ split exec & sched contexts
  sched: Unnest ttwu_runnable in prep for proxy-execution
  sched: Split out __sched() deactivate task logic into a helper
  sched: Add a very simple proxy() function
  sched: Add proxy deactivate helper
  sched: Handle blocked-waiter migration (and return migration)

Juri Lelli (2):
  locking/mutex: make mutex::wait_lock irq safe
  locking/mutex: Expose __mutex_owner()

Peter Zijlstra (7):
  sched: Unify runtime accounting across classes
  locking/mutex: Rework task_struct::blocked_on
  locking/mutex: Add task_struct::blocked_lock to serialize changes to
    the blocked_on state
  locking/mutex: Switch to mutex handoffs for CONFIG_PROXY_EXEC
  sched: Split scheduler execution context
  sched: Start blocked_on chain processing in proxy()
  sched: Add blocked_donor link to task for smarter mutex handoffs

Valentin Schneider (2):
  locking/mutex: Add p->blocked_on wrappers for correctness checks
  sched: Fix proxy/current (push,pull)ability

 include/linux/sched.h        |  33 ++-
 init/Kconfig                 |   7 +
 init/init_task.c             |   1 +
 kernel/Kconfig.locks         |   2 +-
 kernel/fork.c                |   6 +-
 kernel/locking/mutex-debug.c |   9 +-
 kernel/locking/mutex.c       | 128 ++++++---
 kernel/locking/mutex.h       |  25 ++
 kernel/locking/ww_mutex.h    |  72 +++--
 kernel/sched/core.c          | 522 +++++++++++++++++++++++++++++++----
 kernel/sched/deadline.c      |  50 ++--
 kernel/sched/fair.c          | 104 +++++--
 kernel/sched/rt.c            |  78 +++---
 kernel/sched/sched.h         |  61 +++-
 kernel/sched/stop_task.c     |  13 +-
 15 files changed, 868 insertions(+), 243 deletions(-)

-- 
2.42.0.rc1.204.g551eb34607-goog

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v5 01/19] sched: Unify runtime accounting across classes
  2023-08-19  6:08 [PATCH v5 00/19] Proxy Execution: A generalized form of Priority Inheritance v5 John Stultz
@ 2023-08-19  6:08 ` John Stultz
  2023-08-19  6:08 ` [PATCH v5 02/19] locking/mutex: Removes wakeups from under mutex::wait_lock John Stultz
                   ` (17 subsequent siblings)
  18 siblings, 0 replies; 29+ messages in thread
From: John Stultz @ 2023-08-19  6:08 UTC (permalink / raw)
  To: LKML
  Cc: Peter Zijlstra, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Youssef Esmat,
	Mel Gorman, Daniel Bristot de Oliveira, Will Deacon, Waiman Long,
	Boqun Feng, Paul E . McKenney, kernel-team, Connor O'Brien,
	John Stultz

From: Peter Zijlstra <peterz@infradead.org>

All classes use sched_entity::exec_start to track runtime and have
copies of the exact same code around to compute runtime.

Collapse all that.

Cc: Joel Fernandes <joelaf@google.com>
Cc: Qais Yousef <qyousef@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Youssef Esmat <youssefesmat@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Cc: kernel-team@android.com
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
[fix conflicts, fold in update_current_exec_runtime]
Signed-off-by: Connor O'Brien <connoro@google.com>
[jstultz: rebased, resovling minor conflicts]
Signed-off-by: John Stultz <jstultz@google.com>
---
NOTE: This patch is a general cleanup and if no one objects
could be merged at this point. If needed, I'll resend separately
if it isn't picked up on its own.
---
 include/linux/sched.h    |  2 +-
 kernel/sched/deadline.c  | 13 +++-------
 kernel/sched/fair.c      | 56 ++++++++++++++++++++++++++++++----------
 kernel/sched/rt.c        | 13 +++-------
 kernel/sched/sched.h     | 12 ++-------
 kernel/sched/stop_task.c | 13 +---------
 6 files changed, 52 insertions(+), 57 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 609bde814cb0..4bc8a4912539 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -520,7 +520,7 @@ struct sched_statistics {
 	u64				block_max;
 	s64				sum_block_runtime;
 
-	u64				exec_max;
+	s64				exec_max;
 	u64				slice_max;
 
 	u64				nr_migrations_cold;
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 58b542bf2893..9522e6607754 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1299,9 +1299,8 @@ static void update_curr_dl(struct rq *rq)
 {
 	struct task_struct *curr = rq->curr;
 	struct sched_dl_entity *dl_se = &curr->dl;
-	u64 delta_exec, scaled_delta_exec;
+	s64 delta_exec, scaled_delta_exec;
 	int cpu = cpu_of(rq);
-	u64 now;
 
 	if (!dl_task(curr) || !on_dl_rq(dl_se))
 		return;
@@ -1314,21 +1313,15 @@ static void update_curr_dl(struct rq *rq)
 	 * natural solution, but the full ramifications of this
 	 * approach need further study.
 	 */
-	now = rq_clock_task(rq);
-	delta_exec = now - curr->se.exec_start;
-	if (unlikely((s64)delta_exec <= 0)) {
+	delta_exec = update_curr_common(rq);
+	if (unlikely(delta_exec <= 0)) {
 		if (unlikely(dl_se->dl_yielded))
 			goto throttle;
 		return;
 	}
 
-	schedstat_set(curr->stats.exec_max,
-		      max(curr->stats.exec_max, delta_exec));
-
 	trace_sched_stat_runtime(curr, delta_exec, 0);
 
-	update_current_exec_runtime(curr, now, delta_exec);
-
 	if (dl_entity_is_special(dl_se))
 		return;
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b3e25be58e2b..80f74ec20fb8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -891,23 +891,17 @@ static void update_tg_load_avg(struct cfs_rq *cfs_rq)
 }
 #endif /* CONFIG_SMP */
 
-/*
- * Update the current task's runtime statistics.
- */
-static void update_curr(struct cfs_rq *cfs_rq)
+static s64 update_curr_se(struct rq *rq, struct sched_entity *curr)
 {
-	struct sched_entity *curr = cfs_rq->curr;
-	u64 now = rq_clock_task(rq_of(cfs_rq));
-	u64 delta_exec;
-
-	if (unlikely(!curr))
-		return;
+	u64 now = rq_clock_task(rq);
+	s64 delta_exec;
 
 	delta_exec = now - curr->exec_start;
-	if (unlikely((s64)delta_exec <= 0))
-		return;
+	if (unlikely(delta_exec <= 0))
+		return delta_exec;
 
 	curr->exec_start = now;
+	curr->sum_exec_runtime += delta_exec;
 
 	if (schedstat_enabled()) {
 		struct sched_statistics *stats;
@@ -917,9 +911,43 @@ static void update_curr(struct cfs_rq *cfs_rq)
 				max(delta_exec, stats->exec_max));
 	}
 
-	curr->sum_exec_runtime += delta_exec;
-	schedstat_add(cfs_rq->exec_clock, delta_exec);
+	return delta_exec;
+}
+
+/*
+ * Used by other classes to account runtime.
+ */
+s64 update_curr_common(struct rq *rq)
+{
+	struct task_struct *curr = rq->curr;
+	s64 delta_exec;
 
+	delta_exec = update_curr_se(rq, &curr->se);
+	if (unlikely(delta_exec <= 0))
+		return delta_exec;
+
+	account_group_exec_runtime(curr, delta_exec);
+	cgroup_account_cputime(curr, delta_exec);
+
+	return delta_exec;
+}
+
+/*
+ * Update the current task's runtime statistics.
+ */
+static void update_curr(struct cfs_rq *cfs_rq)
+{
+	struct sched_entity *curr = cfs_rq->curr;
+	s64 delta_exec;
+
+	if (unlikely(!curr))
+		return;
+
+	delta_exec = update_curr_se(rq_of(cfs_rq), curr);
+	if (unlikely(delta_exec <= 0))
+		return;
+
+	schedstat_add(cfs_rq->exec_clock, delta_exec);
 	curr->vruntime += calc_delta_fair(delta_exec, curr);
 	update_min_vruntime(cfs_rq);
 
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 00e0e5074115..0d0b276c447d 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1046,24 +1046,17 @@ static void update_curr_rt(struct rq *rq)
 {
 	struct task_struct *curr = rq->curr;
 	struct sched_rt_entity *rt_se = &curr->rt;
-	u64 delta_exec;
-	u64 now;
+	s64 delta_exec;
 
 	if (curr->sched_class != &rt_sched_class)
 		return;
 
-	now = rq_clock_task(rq);
-	delta_exec = now - curr->se.exec_start;
-	if (unlikely((s64)delta_exec <= 0))
+	delta_exec = update_curr_common(rq);
+	if (unlikely(delta_exec < 0))
 		return;
 
-	schedstat_set(curr->stats.exec_max,
-		      max(curr->stats.exec_max, delta_exec));
-
 	trace_sched_stat_runtime(curr, delta_exec, 0);
 
-	update_current_exec_runtime(curr, now, delta_exec);
-
 	if (!rt_bandwidth_enabled())
 		return;
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e93e006a942b..33ad47a093ae 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2204,6 +2204,8 @@ struct affinity_context {
 	unsigned int flags;
 };
 
+extern s64 update_curr_common(struct rq *rq);
+
 struct sched_class {
 
 #ifdef CONFIG_UCLAMP_TASK
@@ -3235,16 +3237,6 @@ extern int sched_dynamic_mode(const char *str);
 extern void sched_dynamic_update(int mode);
 #endif
 
-static inline void update_current_exec_runtime(struct task_struct *curr,
-						u64 now, u64 delta_exec)
-{
-	curr->se.sum_exec_runtime += delta_exec;
-	account_group_exec_runtime(curr, delta_exec);
-
-	curr->se.exec_start = now;
-	cgroup_account_cputime(curr, delta_exec);
-}
-
 #ifdef CONFIG_SCHED_MM_CID
 
 #define SCHED_MM_CID_PERIOD_NS	(100ULL * 1000000)	/* 100ms */
diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
index 85590599b4d6..7595494ceb6d 100644
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -70,18 +70,7 @@ static void yield_task_stop(struct rq *rq)
 
 static void put_prev_task_stop(struct rq *rq, struct task_struct *prev)
 {
-	struct task_struct *curr = rq->curr;
-	u64 now, delta_exec;
-
-	now = rq_clock_task(rq);
-	delta_exec = now - curr->se.exec_start;
-	if (unlikely((s64)delta_exec < 0))
-		delta_exec = 0;
-
-	schedstat_set(curr->stats.exec_max,
-		      max(curr->stats.exec_max, delta_exec));
-
-	update_current_exec_runtime(curr, now, delta_exec);
+	update_curr_common(rq);
 }
 
 /*
-- 
2.42.0.rc1.204.g551eb34607-goog


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v5 02/19] locking/mutex: Removes wakeups from under mutex::wait_lock
  2023-08-19  6:08 [PATCH v5 00/19] Proxy Execution: A generalized form of Priority Inheritance v5 John Stultz
  2023-08-19  6:08 ` [PATCH v5 01/19] sched: Unify runtime accounting across classes John Stultz
@ 2023-08-19  6:08 ` John Stultz
  2023-08-22 19:11   ` Waiman Long
  2023-08-19  6:08 ` [PATCH v5 03/19] locking/mutex: make mutex::wait_lock irq safe John Stultz
                   ` (16 subsequent siblings)
  18 siblings, 1 reply; 29+ messages in thread
From: John Stultz @ 2023-08-19  6:08 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
	Youssef Esmat, Mel Gorman, Daniel Bristot de Oliveira,
	Will Deacon, Waiman Long, Boqun Feng, Paul E . McKenney,
	kernel-team

In preparation to nest mutex::wait_lock under rq::lock we need to remove
wakeups from under it.

Cc: Joel Fernandes <joelaf@google.com>
Cc: Qais Yousef <qyousef@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Youssef Esmat <youssefesmat@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Cc: kernel-team@android.com
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
[Heavily changed after 55f036ca7e74 ("locking: WW mutex cleanup") and
08295b3b5bee ("locking: Implement an algorithm choice for Wound-Wait
mutexes")]
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
[jstultz: rebased to mainline, added extra wake_up_q & init
 to avoid hangs, similar to Connor's rework of this patch]
Signed-off-by: John Stultz <jstultz@google.com>
---
v5:
* Reverted back to an earlier version of this patch to undo
  the change that kept the wake_q in the ctx structure, as
  that broke the rule that the wake_q must always be on the
  stack, as its not safe for concurrency.
---
 kernel/locking/mutex.c    | 15 ++++++++++++---
 kernel/locking/ww_mutex.h | 29 ++++++++++++++++++-----------
 2 files changed, 30 insertions(+), 14 deletions(-)

diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index d973fe6041bf..118b6412845c 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -570,6 +570,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 		    struct lockdep_map *nest_lock, unsigned long ip,
 		    struct ww_acquire_ctx *ww_ctx, const bool use_ww_ctx)
 {
+	DEFINE_WAKE_Q(wake_q);
 	struct mutex_waiter waiter;
 	struct ww_mutex *ww;
 	int ret;
@@ -620,7 +621,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 	 */
 	if (__mutex_trylock(lock)) {
 		if (ww_ctx)
-			__ww_mutex_check_waiters(lock, ww_ctx);
+			__ww_mutex_check_waiters(lock, ww_ctx, &wake_q);
 
 		goto skip_wait;
 	}
@@ -640,7 +641,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 		 * Add in stamp order, waking up waiters that must kill
 		 * themselves.
 		 */
-		ret = __ww_mutex_add_waiter(&waiter, lock, ww_ctx);
+		ret = __ww_mutex_add_waiter(&waiter, lock, ww_ctx, &wake_q);
 		if (ret)
 			goto err_early_kill;
 	}
@@ -676,6 +677,10 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 		}
 
 		raw_spin_unlock(&lock->wait_lock);
+		/* Make sure we do wakeups before calling schedule */
+		wake_up_q(&wake_q);
+		wake_q_init(&wake_q);
+
 		schedule_preempt_disabled();
 
 		first = __mutex_waiter_is_first(lock, &waiter);
@@ -709,7 +714,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 		 */
 		if (!ww_ctx->is_wait_die &&
 		    !__mutex_waiter_is_first(lock, &waiter))
-			__ww_mutex_check_waiters(lock, ww_ctx);
+			__ww_mutex_check_waiters(lock, ww_ctx, &wake_q);
 	}
 
 	__mutex_remove_waiter(lock, &waiter);
@@ -725,6 +730,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 		ww_mutex_lock_acquired(ww, ww_ctx);
 
 	raw_spin_unlock(&lock->wait_lock);
+	wake_up_q(&wake_q);
 	preempt_enable();
 	return 0;
 
@@ -736,6 +742,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 	raw_spin_unlock(&lock->wait_lock);
 	debug_mutex_free_waiter(&waiter);
 	mutex_release(&lock->dep_map, ip);
+	wake_up_q(&wake_q);
 	preempt_enable();
 	return ret;
 }
@@ -946,9 +953,11 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
 	if (owner & MUTEX_FLAG_HANDOFF)
 		__mutex_handoff(lock, next);
 
+	preempt_disable();
 	raw_spin_unlock(&lock->wait_lock);
 
 	wake_up_q(&wake_q);
+	preempt_enable();
 }
 
 #ifndef CONFIG_DEBUG_LOCK_ALLOC
diff --git a/kernel/locking/ww_mutex.h b/kernel/locking/ww_mutex.h
index 3ad2cc4823e5..7189c6631d90 100644
--- a/kernel/locking/ww_mutex.h
+++ b/kernel/locking/ww_mutex.h
@@ -275,7 +275,7 @@ __ww_ctx_less(struct ww_acquire_ctx *a, struct ww_acquire_ctx *b)
  */
 static bool
 __ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITER *waiter,
-	       struct ww_acquire_ctx *ww_ctx)
+	       struct ww_acquire_ctx *ww_ctx, struct wake_q_head *wake_q)
 {
 	if (!ww_ctx->is_wait_die)
 		return false;
@@ -284,7 +284,7 @@ __ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITER *waiter,
 #ifndef WW_RT
 		debug_mutex_wake_waiter(lock, waiter);
 #endif
-		wake_up_process(waiter->task);
+		wake_q_add(wake_q, waiter->task);
 	}
 
 	return true;
@@ -299,7 +299,8 @@ __ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITER *waiter,
  */
 static bool __ww_mutex_wound(struct MUTEX *lock,
 			     struct ww_acquire_ctx *ww_ctx,
-			     struct ww_acquire_ctx *hold_ctx)
+			     struct ww_acquire_ctx *hold_ctx,
+			     struct wake_q_head *wake_q)
 {
 	struct task_struct *owner = __ww_mutex_owner(lock);
 
@@ -331,7 +332,7 @@ static bool __ww_mutex_wound(struct MUTEX *lock,
 		 * wakeup pending to re-read the wounded state.
 		 */
 		if (owner != current)
-			wake_up_process(owner);
+			wake_q_add(wake_q, owner);
 
 		return true;
 	}
@@ -352,7 +353,8 @@ static bool __ww_mutex_wound(struct MUTEX *lock,
  * The current task must not be on the wait list.
  */
 static void
-__ww_mutex_check_waiters(struct MUTEX *lock, struct ww_acquire_ctx *ww_ctx)
+__ww_mutex_check_waiters(struct MUTEX *lock, struct ww_acquire_ctx *ww_ctx,
+			 struct wake_q_head *wake_q)
 {
 	struct MUTEX_WAITER *cur;
 
@@ -364,8 +366,8 @@ __ww_mutex_check_waiters(struct MUTEX *lock, struct ww_acquire_ctx *ww_ctx)
 		if (!cur->ww_ctx)
 			continue;
 
-		if (__ww_mutex_die(lock, cur, ww_ctx) ||
-		    __ww_mutex_wound(lock, cur->ww_ctx, ww_ctx))
+		if (__ww_mutex_die(lock, cur, ww_ctx, wake_q) ||
+		    __ww_mutex_wound(lock, cur->ww_ctx, ww_ctx, wake_q))
 			break;
 	}
 }
@@ -377,6 +379,8 @@ __ww_mutex_check_waiters(struct MUTEX *lock, struct ww_acquire_ctx *ww_ctx)
 static __always_inline void
 ww_mutex_set_context_fastpath(struct ww_mutex *lock, struct ww_acquire_ctx *ctx)
 {
+	DEFINE_WAKE_Q(wake_q);
+
 	ww_mutex_lock_acquired(lock, ctx);
 
 	/*
@@ -405,8 +409,10 @@ ww_mutex_set_context_fastpath(struct ww_mutex *lock, struct ww_acquire_ctx *ctx)
 	 * die or wound us.
 	 */
 	lock_wait_lock(&lock->base);
-	__ww_mutex_check_waiters(&lock->base, ctx);
+	__ww_mutex_check_waiters(&lock->base, ctx, &wake_q);
 	unlock_wait_lock(&lock->base);
+
+	wake_up_q(&wake_q);
 }
 
 static __always_inline int
@@ -488,7 +494,8 @@ __ww_mutex_check_kill(struct MUTEX *lock, struct MUTEX_WAITER *waiter,
 static inline int
 __ww_mutex_add_waiter(struct MUTEX_WAITER *waiter,
 		      struct MUTEX *lock,
-		      struct ww_acquire_ctx *ww_ctx)
+		      struct ww_acquire_ctx *ww_ctx,
+		      struct wake_q_head *wake_q)
 {
 	struct MUTEX_WAITER *cur, *pos = NULL;
 	bool is_wait_die;
@@ -532,7 +539,7 @@ __ww_mutex_add_waiter(struct MUTEX_WAITER *waiter,
 		pos = cur;
 
 		/* Wait-Die: ensure younger waiters die. */
-		__ww_mutex_die(lock, cur, ww_ctx);
+		__ww_mutex_die(lock, cur, ww_ctx, wake_q);
 	}
 
 	__ww_waiter_add(lock, waiter, pos);
@@ -550,7 +557,7 @@ __ww_mutex_add_waiter(struct MUTEX_WAITER *waiter,
 		 * such that either we or the fastpath will wound @ww->ctx.
 		 */
 		smp_mb();
-		__ww_mutex_wound(lock, ww_ctx, ww->ctx);
+		__ww_mutex_wound(lock, ww_ctx, ww->ctx, wake_q);
 	}
 
 	return 0;
-- 
2.42.0.rc1.204.g551eb34607-goog


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v5 02/19] locking/mutex: Removes wakeups from under mutex::wait_lock
  2023-08-19  6:08 ` [PATCH v5 02/19] locking/mutex: Removes wakeups from under mutex::wait_lock John Stultz
@ 2023-08-22 19:11   ` Waiman Long
  2023-08-22 19:24     ` John Stultz
  0 siblings, 1 reply; 29+ messages in thread
From: Waiman Long @ 2023-08-22 19:11 UTC (permalink / raw)
  To: John Stultz, LKML
  Cc: Joel Fernandes, Qais Yousef, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Youssef Esmat,
	Mel Gorman, Daniel Bristot de Oliveira, Will Deacon, Boqun Feng,
	Paul E . McKenney, kernel-team

On 8/19/23 02:08, John Stultz wrote:
> In preparation to nest mutex::wait_lock under rq::lock we need to remove
> wakeups from under it.
>
> Cc: Joel Fernandes <joelaf@google.com>
> Cc: Qais Yousef <qyousef@google.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Juri Lelli <juri.lelli@redhat.com>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Valentin Schneider <vschneid@redhat.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Ben Segall <bsegall@google.com>
> Cc: Zimuzo Ezeozue <zezeozue@google.com>
> Cc: Youssef Esmat <youssefesmat@google.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
> Cc: Will Deacon <will@kernel.org>
> Cc: Waiman Long <longman@redhat.com>
> Cc: Boqun Feng <boqun.feng@gmail.com>
> Cc: "Paul E . McKenney" <paulmck@kernel.org>
> Cc: kernel-team@android.com
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> [Heavily changed after 55f036ca7e74 ("locking: WW mutex cleanup") and
> 08295b3b5bee ("locking: Implement an algorithm choice for Wound-Wait
> mutexes")]
> Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
> [jstultz: rebased to mainline, added extra wake_up_q & init
>   to avoid hangs, similar to Connor's rework of this patch]
> Signed-off-by: John Stultz <jstultz@google.com>
> ---
> v5:
> * Reverted back to an earlier version of this patch to undo
>    the change that kept the wake_q in the ctx structure, as
>    that broke the rule that the wake_q must always be on the
>    stack, as its not safe for concurrency.
> ---
>   kernel/locking/mutex.c    | 15 ++++++++++++---
>   kernel/locking/ww_mutex.h | 29 ++++++++++++++++++-----------
>   2 files changed, 30 insertions(+), 14 deletions(-)
>
> diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
> index d973fe6041bf..118b6412845c 100644
> --- a/kernel/locking/mutex.c
> +++ b/kernel/locking/mutex.c
> @@ -570,6 +570,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
>   		    struct lockdep_map *nest_lock, unsigned long ip,
>   		    struct ww_acquire_ctx *ww_ctx, const bool use_ww_ctx)
>   {
> +	DEFINE_WAKE_Q(wake_q);
>   	struct mutex_waiter waiter;
>   	struct ww_mutex *ww;
>   	int ret;
> @@ -620,7 +621,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
>   	 */
>   	if (__mutex_trylock(lock)) {
>   		if (ww_ctx)
> -			__ww_mutex_check_waiters(lock, ww_ctx);
> +			__ww_mutex_check_waiters(lock, ww_ctx, &wake_q);
>   
>   		goto skip_wait;
>   	}
> @@ -640,7 +641,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
>   		 * Add in stamp order, waking up waiters that must kill
>   		 * themselves.
>   		 */
> -		ret = __ww_mutex_add_waiter(&waiter, lock, ww_ctx);
> +		ret = __ww_mutex_add_waiter(&waiter, lock, ww_ctx, &wake_q);
>   		if (ret)
>   			goto err_early_kill;
>   	}
> @@ -676,6 +677,10 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
>   		}
>   
>   		raw_spin_unlock(&lock->wait_lock);
> +		/* Make sure we do wakeups before calling schedule */
> +		wake_up_q(&wake_q);
> +		wake_q_init(&wake_q);
> +

The wake_q may have task to wake up only in the case of ww_mutex which 
is a minority in the kernel. IOW, wake_up_q() which is a function call 
will do nothing in most cases. From an optimization point of view, it is 
better to do a "!wake_q_empty(&wake_q)" check before calling wake_up_q().


>   		schedule_preempt_disabled();
>   
>   		first = __mutex_waiter_is_first(lock, &waiter);
> @@ -709,7 +714,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
>   		 */
>   		if (!ww_ctx->is_wait_die &&
>   		    !__mutex_waiter_is_first(lock, &waiter))
> -			__ww_mutex_check_waiters(lock, ww_ctx);
> +			__ww_mutex_check_waiters(lock, ww_ctx, &wake_q);
>   	}
>   
>   	__mutex_remove_waiter(lock, &waiter);
> @@ -725,6 +730,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
>   		ww_mutex_lock_acquired(ww, ww_ctx);
>   
>   	raw_spin_unlock(&lock->wait_lock);
> +	wake_up_q(&wake_q);
>   	preempt_enable();
>   	return 0;
>   
> @@ -736,6 +742,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
>   	raw_spin_unlock(&lock->wait_lock);
>   	debug_mutex_free_waiter(&waiter);
>   	mutex_release(&lock->dep_map, ip);
> +	wake_up_q(&wake_q);
>   	preempt_enable();
>   	return ret;
>   }
> @@ -946,9 +953,11 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
>   	if (owner & MUTEX_FLAG_HANDOFF)
>   		__mutex_handoff(lock, next);
>   
> +	preempt_disable();
>   	raw_spin_unlock(&lock->wait_lock);
>   
>   	wake_up_q(&wake_q);
> +	preempt_enable();
>   }

I think it looks better to put the preempt_disable() right before 
raw_spin_lock() for proper nesting.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v5 02/19] locking/mutex: Removes wakeups from under mutex::wait_lock
  2023-08-22 19:11   ` Waiman Long
@ 2023-08-22 19:24     ` John Stultz
  0 siblings, 0 replies; 29+ messages in thread
From: John Stultz @ 2023-08-22 19:24 UTC (permalink / raw)
  To: Waiman Long
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Youssef Esmat,
	Mel Gorman, Daniel Bristot de Oliveira, Will Deacon, Boqun Feng,
	Paul E . McKenney, kernel-team

On Tue, Aug 22, 2023 at 12:11 PM Waiman Long <longman@redhat.com> wrote:
> On 8/19/23 02:08, John Stultz wrote:
> > @@ -676,6 +677,10 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
> >               }
> >
> >               raw_spin_unlock(&lock->wait_lock);
> > +             /* Make sure we do wakeups before calling schedule */
> > +             wake_up_q(&wake_q);
> > +             wake_q_init(&wake_q);
> > +
>
> The wake_q may have task to wake up only in the case of ww_mutex which
> is a minority in the kernel. IOW, wake_up_q() which is a function call
> will do nothing in most cases. From an optimization point of view, it is
> better to do a "!wake_q_empty(&wake_q)" check before calling wake_up_q().

Thanks for the suggestion!  Updated for the next version!

> > @@ -946,9 +953,11 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
> >       if (owner & MUTEX_FLAG_HANDOFF)
> >               __mutex_handoff(lock, next);
> >
> > +     preempt_disable();
> >       raw_spin_unlock(&lock->wait_lock);
> >
> >       wake_up_q(&wake_q);
> > +     preempt_enable();
> >   }
>
> I think it looks better to put the preempt_disable() right before
> raw_spin_lock() for proper nesting.

Agreed.

Thanks so much for the review and feedback! I really appreciate it!
-john

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v5 03/19] locking/mutex: make mutex::wait_lock irq safe
  2023-08-19  6:08 [PATCH v5 00/19] Proxy Execution: A generalized form of Priority Inheritance v5 John Stultz
  2023-08-19  6:08 ` [PATCH v5 01/19] sched: Unify runtime accounting across classes John Stultz
  2023-08-19  6:08 ` [PATCH v5 02/19] locking/mutex: Removes wakeups from under mutex::wait_lock John Stultz
@ 2023-08-19  6:08 ` John Stultz
  2023-08-19  6:08 ` [PATCH v5 04/19] locking/mutex: Expose __mutex_owner() John Stultz
                   ` (15 subsequent siblings)
  18 siblings, 0 replies; 29+ messages in thread
From: John Stultz @ 2023-08-19  6:08 UTC (permalink / raw)
  To: LKML
  Cc: Juri Lelli, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
	Youssef Esmat, Mel Gorman, Daniel Bristot de Oliveira,
	Will Deacon, Waiman Long, Boqun Feng, Paul E . McKenney,
	kernel-team, Connor O'Brien, John Stultz

From: Juri Lelli <juri.lelli@redhat.com>

mutex::wait_lock might be nested under rq->lock.

Make it irq safe then.

Cc: Joel Fernandes <joelaf@google.com>
Cc: Qais Yousef <qyousef@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Youssef Esmat <youssefesmat@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Cc: kernel-team@android.com
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
[rebase & fix {un,}lock_wait_lock helpers in ww_mutex.h]
Signed-off-by: Connor O'Brien <connoro@google.com>
Signed-off-by: John Stultz <jstultz@google.com>
---
v3:
* Re-added this patch after it was dropped in v2 which
  caused lockdep warnings to trip.
---
 kernel/locking/mutex.c    | 18 ++++++++++--------
 kernel/locking/ww_mutex.h | 22 +++++++++++-----------
 2 files changed, 21 insertions(+), 19 deletions(-)

diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 118b6412845c..ef56556d6ab9 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -573,6 +573,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 	DEFINE_WAKE_Q(wake_q);
 	struct mutex_waiter waiter;
 	struct ww_mutex *ww;
+	unsigned long flags;
 	int ret;
 
 	if (!use_ww_ctx)
@@ -615,7 +616,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 		return 0;
 	}
 
-	raw_spin_lock(&lock->wait_lock);
+	raw_spin_lock_irqsave(&lock->wait_lock, flags);
 	/*
 	 * After waiting to acquire the wait_lock, try again.
 	 */
@@ -676,7 +677,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 				goto err;
 		}
 
-		raw_spin_unlock(&lock->wait_lock);
+		raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
 		/* Make sure we do wakeups before calling schedule */
 		wake_up_q(&wake_q);
 		wake_q_init(&wake_q);
@@ -701,9 +702,9 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 			trace_contention_begin(lock, LCB_F_MUTEX);
 		}
 
-		raw_spin_lock(&lock->wait_lock);
+		raw_spin_lock_irqsave(&lock->wait_lock, flags);
 	}
-	raw_spin_lock(&lock->wait_lock);
+	raw_spin_lock_irqsave(&lock->wait_lock, flags);
 acquired:
 	__set_current_state(TASK_RUNNING);
 
@@ -729,7 +730,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 	if (ww_ctx)
 		ww_mutex_lock_acquired(ww, ww_ctx);
 
-	raw_spin_unlock(&lock->wait_lock);
+	raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
 	wake_up_q(&wake_q);
 	preempt_enable();
 	return 0;
@@ -739,7 +740,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 	__mutex_remove_waiter(lock, &waiter);
 err_early_kill:
 	trace_contention_end(lock, ret);
-	raw_spin_unlock(&lock->wait_lock);
+	raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
 	debug_mutex_free_waiter(&waiter);
 	mutex_release(&lock->dep_map, ip);
 	wake_up_q(&wake_q);
@@ -910,6 +911,7 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
 	struct task_struct *next = NULL;
 	DEFINE_WAKE_Q(wake_q);
 	unsigned long owner;
+	unsigned long flags;
 
 	mutex_release(&lock->dep_map, ip);
 
@@ -936,7 +938,7 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
 		}
 	}
 
-	raw_spin_lock(&lock->wait_lock);
+	raw_spin_lock_irqsave(&lock->wait_lock, flags);
 	debug_mutex_unlock(lock);
 	if (!list_empty(&lock->wait_list)) {
 		/* get the first entry from the wait-list: */
@@ -954,7 +956,7 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
 		__mutex_handoff(lock, next);
 
 	preempt_disable();
-	raw_spin_unlock(&lock->wait_lock);
+	raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
 
 	wake_up_q(&wake_q);
 	preempt_enable();
diff --git a/kernel/locking/ww_mutex.h b/kernel/locking/ww_mutex.h
index 7189c6631d90..8b94f4b89e74 100644
--- a/kernel/locking/ww_mutex.h
+++ b/kernel/locking/ww_mutex.h
@@ -70,14 +70,14 @@ __ww_mutex_has_waiters(struct mutex *lock)
 	return atomic_long_read(&lock->owner) & MUTEX_FLAG_WAITERS;
 }
 
-static inline void lock_wait_lock(struct mutex *lock)
+static inline void lock_wait_lock(struct mutex *lock, unsigned long *flags)
 {
-	raw_spin_lock(&lock->wait_lock);
+	raw_spin_lock_irqsave(&lock->wait_lock, *flags);
 }
 
-static inline void unlock_wait_lock(struct mutex *lock)
+static inline void unlock_wait_lock(struct mutex *lock, unsigned long flags)
 {
-	raw_spin_unlock(&lock->wait_lock);
+	raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
 }
 
 static inline void lockdep_assert_wait_lock_held(struct mutex *lock)
@@ -144,14 +144,14 @@ __ww_mutex_has_waiters(struct rt_mutex *lock)
 	return rt_mutex_has_waiters(&lock->rtmutex);
 }
 
-static inline void lock_wait_lock(struct rt_mutex *lock)
+static inline void lock_wait_lock(struct rt_mutex *lock, unsigned long *flags)
 {
-	raw_spin_lock(&lock->rtmutex.wait_lock);
+	raw_spin_lock_irqsave(&lock->rtmutex.wait_lock, *flags);
 }
 
-static inline void unlock_wait_lock(struct rt_mutex *lock)
+static inline void unlock_wait_lock(struct rt_mutex *lock, flags)
 {
-	raw_spin_unlock(&lock->rtmutex.wait_lock);
+	raw_spin_unlock_irqrestore(&lock->rtmutex.wait_lock, flags);
 }
 
 static inline void lockdep_assert_wait_lock_held(struct rt_mutex *lock)
@@ -380,6 +380,7 @@ static __always_inline void
 ww_mutex_set_context_fastpath(struct ww_mutex *lock, struct ww_acquire_ctx *ctx)
 {
 	DEFINE_WAKE_Q(wake_q);
+	unsigned long flags;
 
 	ww_mutex_lock_acquired(lock, ctx);
 
@@ -408,10 +409,9 @@ ww_mutex_set_context_fastpath(struct ww_mutex *lock, struct ww_acquire_ctx *ctx)
 	 * Uh oh, we raced in fastpath, check if any of the waiters need to
 	 * die or wound us.
 	 */
-	lock_wait_lock(&lock->base);
+	lock_wait_lock(&lock->base, &flags);
 	__ww_mutex_check_waiters(&lock->base, ctx, &wake_q);
-	unlock_wait_lock(&lock->base);
-
+	unlock_wait_lock(&lock->base, flags);
 	wake_up_q(&wake_q);
 }
 
-- 
2.42.0.rc1.204.g551eb34607-goog


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v5 04/19] locking/mutex: Expose __mutex_owner()
  2023-08-19  6:08 [PATCH v5 00/19] Proxy Execution: A generalized form of Priority Inheritance v5 John Stultz
                   ` (2 preceding siblings ...)
  2023-08-19  6:08 ` [PATCH v5 03/19] locking/mutex: make mutex::wait_lock irq safe John Stultz
@ 2023-08-19  6:08 ` John Stultz
  2023-08-19  6:08 ` [PATCH v5 05/19] locking/mutex: Rework task_struct::blocked_on John Stultz
                   ` (14 subsequent siblings)
  18 siblings, 0 replies; 29+ messages in thread
From: John Stultz @ 2023-08-19  6:08 UTC (permalink / raw)
  To: LKML
  Cc: Juri Lelli, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
	Youssef Esmat, Mel Gorman, Daniel Bristot de Oliveira,
	Will Deacon, Waiman Long, Boqun Feng, Paul E . McKenney,
	kernel-team, Valentin Schneider, Connor O'Brien, John Stultz

From: Juri Lelli <juri.lelli@redhat.com>

Implementing proxy execution requires that scheduler code be able to
identify the current owner of a mutex. Expose __mutex_owner() for
this purpose (alone!).

Cc: Joel Fernandes <joelaf@google.com>
Cc: Qais Yousef <qyousef@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Youssef Esmat <youssefesmat@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Cc: kernel-team@android.com
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
[Removed the EXPORT_SYMBOL]
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Connor O'Brien <connoro@google.com>
[jstultz: Reworked per Peter's suggestions]
Signed-off-by: John Stultz <jstultz@google.com>
---
v4:
* Move __mutex_owner() to kernel/locking/mutex.h instead of
  adding a new globally available accessor function to keep
  the exposure of this low, along with keeping it an inline
  function, as suggested by PeterZ
---
 kernel/locking/mutex.c | 25 -------------------------
 kernel/locking/mutex.h | 25 +++++++++++++++++++++++++
 2 files changed, 25 insertions(+), 25 deletions(-)

diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index ef56556d6ab9..67ff08beebd6 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -56,31 +56,6 @@ __mutex_init(struct mutex *lock, const char *name, struct lock_class_key *key)
 }
 EXPORT_SYMBOL(__mutex_init);
 
-/*
- * @owner: contains: 'struct task_struct *' to the current lock owner,
- * NULL means not owned. Since task_struct pointers are aligned at
- * at least L1_CACHE_BYTES, we have low bits to store extra state.
- *
- * Bit0 indicates a non-empty waiter list; unlock must issue a wakeup.
- * Bit1 indicates unlock needs to hand the lock to the top-waiter
- * Bit2 indicates handoff has been done and we're waiting for pickup.
- */
-#define MUTEX_FLAG_WAITERS	0x01
-#define MUTEX_FLAG_HANDOFF	0x02
-#define MUTEX_FLAG_PICKUP	0x04
-
-#define MUTEX_FLAGS		0x07
-
-/*
- * Internal helper function; C doesn't allow us to hide it :/
- *
- * DO NOT USE (outside of mutex code).
- */
-static inline struct task_struct *__mutex_owner(struct mutex *lock)
-{
-	return (struct task_struct *)(atomic_long_read(&lock->owner) & ~MUTEX_FLAGS);
-}
-
 static inline struct task_struct *__owner_task(unsigned long owner)
 {
 	return (struct task_struct *)(owner & ~MUTEX_FLAGS);
diff --git a/kernel/locking/mutex.h b/kernel/locking/mutex.h
index 0b2a79c4013b..1c7d3d32def8 100644
--- a/kernel/locking/mutex.h
+++ b/kernel/locking/mutex.h
@@ -20,6 +20,31 @@ struct mutex_waiter {
 #endif
 };
 
+/*
+ * @owner: contains: 'struct task_struct *' to the current lock owner,
+ * NULL means not owned. Since task_struct pointers are aligned at
+ * at least L1_CACHE_BYTES, we have low bits to store extra state.
+ *
+ * Bit0 indicates a non-empty waiter list; unlock must issue a wakeup.
+ * Bit1 indicates unlock needs to hand the lock to the top-waiter
+ * Bit2 indicates handoff has been done and we're waiting for pickup.
+ */
+#define MUTEX_FLAG_WAITERS	0x01
+#define MUTEX_FLAG_HANDOFF	0x02
+#define MUTEX_FLAG_PICKUP	0x04
+
+#define MUTEX_FLAGS		0x07
+
+/*
+ * Internal helper function; C doesn't allow us to hide it :/
+ *
+ * DO NOT USE (outside of mutex & scheduler code).
+ */
+static inline struct task_struct *__mutex_owner(struct mutex *lock)
+{
+	return (struct task_struct *)(atomic_long_read(&lock->owner) & ~MUTEX_FLAGS);
+}
+
 #ifdef CONFIG_DEBUG_MUTEXES
 extern void debug_mutex_lock_common(struct mutex *lock,
 				    struct mutex_waiter *waiter);
-- 
2.42.0.rc1.204.g551eb34607-goog


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v5 05/19] locking/mutex: Rework task_struct::blocked_on
  2023-08-19  6:08 [PATCH v5 00/19] Proxy Execution: A generalized form of Priority Inheritance v5 John Stultz
                   ` (3 preceding siblings ...)
  2023-08-19  6:08 ` [PATCH v5 04/19] locking/mutex: Expose __mutex_owner() John Stultz
@ 2023-08-19  6:08 ` John Stultz
  2023-08-19  6:08 ` [PATCH v5 06/19] locking/mutex: Add task_struct::blocked_lock to serialize changes to the blocked_on state John Stultz
                   ` (13 subsequent siblings)
  18 siblings, 0 replies; 29+ messages in thread
From: John Stultz @ 2023-08-19  6:08 UTC (permalink / raw)
  To: LKML
  Cc: Peter Zijlstra, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Youssef Esmat,
	Mel Gorman, Daniel Bristot de Oliveira, Will Deacon, Waiman Long,
	Boqun Feng, Paul E . McKenney, kernel-team, Connor O'Brien,
	John Stultz

From: Peter Zijlstra <peterz@infradead.org>

Track the blocked-on relation for mutexes, this allows following this
relation at schedule time.

   task
     | blocked-on
     v
   mutex
     | owner
     v
   task

Cc: Joel Fernandes <joelaf@google.com>
Cc: Qais Yousef <qyousef@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Youssef Esmat <youssefesmat@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Cc: kernel-team@android.com
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
[minor changes while rebasing]
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Connor O'Brien <connoro@google.com>
[jstultz: Fix blocked_on tracking in __mutex_lock_common in error paths]
Signed-off-by: John Stultz <jstultz@google.com>
---
v2:
* Fixed blocked_on tracking in error paths that was causing crashes
v4:
* Ensure we clear blocked_on when waking ww_mutexes to die or wound.
  This is critical so we don't get ciruclar blocked_on relationships
  that can't be resolved.
v5:
* Fix potential bug where the skip_wait path might clear blocked_on
  when that path never set it
* Slight tweaks to where we set blocked_on to make it consistent,
  along with extra WARN_ON correctness checking
* Minor comment changes
---
 include/linux/sched.h        |  5 +----
 kernel/fork.c                |  3 +--
 kernel/locking/mutex-debug.c |  9 +++++----
 kernel/locking/mutex.c       | 10 ++++++++++
 kernel/locking/ww_mutex.h    | 17 +++++++++++++++--
 5 files changed, 32 insertions(+), 12 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4bc8a4912539..8b98e3933bd9 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1140,10 +1140,7 @@ struct task_struct {
 	struct rt_mutex_waiter		*pi_blocked_on;
 #endif
 
-#ifdef CONFIG_DEBUG_MUTEXES
-	/* Mutex deadlock detection: */
-	struct mutex_waiter		*blocked_on;
-#endif
+	struct mutex			*blocked_on;	/* lock we're blocked on */
 
 #ifdef CONFIG_DEBUG_ATOMIC_SLEEP
 	int				non_block_count;
diff --git a/kernel/fork.c b/kernel/fork.c
index d2e12b6d2b18..42b022fc3c4c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2458,9 +2458,8 @@ __latent_entropy struct task_struct *copy_process(
 	lockdep_init_task(p);
 #endif
 
-#ifdef CONFIG_DEBUG_MUTEXES
 	p->blocked_on = NULL; /* not blocked yet */
-#endif
+
 #ifdef CONFIG_BCACHE
 	p->sequential_io	= 0;
 	p->sequential_io_avg	= 0;
diff --git a/kernel/locking/mutex-debug.c b/kernel/locking/mutex-debug.c
index bc8abb8549d2..7228909c3e62 100644
--- a/kernel/locking/mutex-debug.c
+++ b/kernel/locking/mutex-debug.c
@@ -52,17 +52,18 @@ void debug_mutex_add_waiter(struct mutex *lock, struct mutex_waiter *waiter,
 {
 	lockdep_assert_held(&lock->wait_lock);
 
-	/* Mark the current thread as blocked on the lock: */
-	task->blocked_on = waiter;
+	/* Current thread can't be already blocked (since it's executing!) */
+	DEBUG_LOCKS_WARN_ON(task->blocked_on);
 }
 
 void debug_mutex_remove_waiter(struct mutex *lock, struct mutex_waiter *waiter,
 			 struct task_struct *task)
 {
+	struct mutex *blocked_on = READ_ONCE(task->blocked_on);
+
 	DEBUG_LOCKS_WARN_ON(list_empty(&waiter->list));
 	DEBUG_LOCKS_WARN_ON(waiter->task != task);
-	DEBUG_LOCKS_WARN_ON(task->blocked_on != waiter);
-	task->blocked_on = NULL;
+	DEBUG_LOCKS_WARN_ON(blocked_on && blocked_on != lock);
 
 	INIT_LIST_HEAD(&waiter->list);
 	waiter->task = NULL;
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 67ff08beebd6..8af9269ce8d9 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -622,6 +622,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 			goto err_early_kill;
 	}
 
+	current->blocked_on = lock;
 	set_current_state(state);
 	trace_contention_begin(lock, LCB_F_MUTEX);
 	for (;;) {
@@ -661,6 +662,10 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 
 		first = __mutex_waiter_is_first(lock, &waiter);
 
+		/*
+		 * Gets reset by unlock path().
+		 */
+		current->blocked_on = lock;
 		set_current_state(state);
 		/*
 		 * Here we order against unlock; we must either see it change
@@ -681,6 +686,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 	}
 	raw_spin_lock_irqsave(&lock->wait_lock, flags);
 acquired:
+	current->blocked_on = NULL;
 	__set_current_state(TASK_RUNNING);
 
 	if (ww_ctx) {
@@ -711,9 +717,11 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 	return 0;
 
 err:
+	current->blocked_on = NULL;
 	__set_current_state(TASK_RUNNING);
 	__mutex_remove_waiter(lock, &waiter);
 err_early_kill:
+	WARN_ON(current->blocked_on);
 	trace_contention_end(lock, ret);
 	raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
 	debug_mutex_free_waiter(&waiter);
@@ -924,6 +932,8 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
 		next = waiter->task;
 
 		debug_mutex_wake_waiter(lock, waiter);
+		WARN_ON(next->blocked_on != lock);
+		next->blocked_on = NULL;
 		wake_q_add(&wake_q, next);
 	}
 
diff --git a/kernel/locking/ww_mutex.h b/kernel/locking/ww_mutex.h
index 8b94f4b89e74..8bb334491732 100644
--- a/kernel/locking/ww_mutex.h
+++ b/kernel/locking/ww_mutex.h
@@ -284,6 +284,13 @@ __ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITER *waiter,
 #ifndef WW_RT
 		debug_mutex_wake_waiter(lock, waiter);
 #endif
+		/*
+		 * When waking up the task to die, be sure to clear the
+		 * blocked_on pointer. Otherwise we can see circular
+		 * blocked_on relationships that can't resolve.
+		 */
+		WARN_ON(waiter->task->blocked_on != lock);
+		waiter->task->blocked_on = NULL;
 		wake_q_add(wake_q, waiter->task);
 	}
 
@@ -331,9 +338,15 @@ static bool __ww_mutex_wound(struct MUTEX *lock,
 		 * it's wounded in __ww_mutex_check_kill() or has a
 		 * wakeup pending to re-read the wounded state.
 		 */
-		if (owner != current)
+		if (owner != current) {
+			/*
+			 * When waking up the task to wound, be sure to clear the
+			 * blocked_on pointer. Otherwise we can see circular
+			 * blocked_on relationships that can't resolve.
+			 */
+			owner->blocked_on = NULL;
 			wake_q_add(wake_q, owner);
-
+		}
 		return true;
 	}
 
-- 
2.42.0.rc1.204.g551eb34607-goog


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v5 06/19] locking/mutex: Add task_struct::blocked_lock to serialize changes to the blocked_on state
  2023-08-19  6:08 [PATCH v5 00/19] Proxy Execution: A generalized form of Priority Inheritance v5 John Stultz
                   ` (4 preceding siblings ...)
  2023-08-19  6:08 ` [PATCH v5 05/19] locking/mutex: Rework task_struct::blocked_on John Stultz
@ 2023-08-19  6:08 ` John Stultz
  2023-08-19  6:08 ` [PATCH v5 07/19] locking/mutex: Add p->blocked_on wrappers for correctness checks John Stultz
                   ` (12 subsequent siblings)
  18 siblings, 0 replies; 29+ messages in thread
From: John Stultz @ 2023-08-19  6:08 UTC (permalink / raw)
  To: LKML
  Cc: Peter Zijlstra, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Youssef Esmat,
	Mel Gorman, Daniel Bristot de Oliveira, Will Deacon, Waiman Long,
	Boqun Feng, Paul E . McKenney, kernel-team, Valentin Schneider,
	Connor O'Brien, John Stultz

From: Peter Zijlstra <peterz@infradead.org>

This patch was split out from the later "sched: Add proxy
execution" patch.

Adds blocked_lock to the task_struct so we can safely keep track
of which tasks are blocked on us.

This will be used for tracking blocked-task/mutex chains with
the prox-execution patch in a similar fashion to how priority
inheritence is done with rt_mutexes.

Cc: Joel Fernandes <joelaf@google.com>
Cc: Qais Yousef <qyousef@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Youssef Esmat <youssefesmat@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Cc: kernel-team@android.com
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
[rebased, added comments and changelog]
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
[Fixed rebase conflicts]
[squashed sched: Ensure blocked_on is always guarded by blocked_lock]
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
[fix rebase conflicts, various fixes & tweaks commented inline]
[squashed sched: Use rq->curr vs rq->proxy checks]
Signed-off-by: Connor O'Brien <connoro@google.com>
[jstultz: Split out from bigger patch]
Signed-off-by: John Stultz <jstultz@google.com>
---
v2:
* Split out into its own patch
v4:
* Remove verbose comments/questions to avoid review
  distractions, as suggested by Dietmar
* Fixed nested block_on locking for ww_mutex access
---
 include/linux/sched.h     |  1 +
 init/init_task.c          |  1 +
 kernel/fork.c             |  1 +
 kernel/locking/mutex.c    | 24 ++++++++++++++++++++----
 kernel/locking/ww_mutex.h |  6 ++++++
 5 files changed, 29 insertions(+), 4 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8b98e3933bd9..7dded1c09efa 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1141,6 +1141,7 @@ struct task_struct {
 #endif
 
 	struct mutex			*blocked_on;	/* lock we're blocked on */
+	raw_spinlock_t			blocked_lock;
 
 #ifdef CONFIG_DEBUG_ATOMIC_SLEEP
 	int				non_block_count;
diff --git a/init/init_task.c b/init/init_task.c
index ff6c4b9bfe6b..189ce67e9704 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -130,6 +130,7 @@ struct task_struct init_task
 	.journal_info	= NULL,
 	INIT_CPU_TIMERS(init_task)
 	.pi_lock	= __RAW_SPIN_LOCK_UNLOCKED(init_task.pi_lock),
+	.blocked_lock	= __RAW_SPIN_LOCK_UNLOCKED(init_task.blocked_lock),
 	.timer_slack_ns = 50000, /* 50 usec default slack */
 	.thread_pid	= &init_struct_pid,
 	.thread_group	= LIST_HEAD_INIT(init_task.thread_group),
diff --git a/kernel/fork.c b/kernel/fork.c
index 42b022fc3c4c..8bad899b6c6e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2356,6 +2356,7 @@ __latent_entropy struct task_struct *copy_process(
 	ftrace_graph_init_task(p);
 
 	rt_mutex_init_task(p);
+	raw_spin_lock_init(&p->blocked_lock);
 
 	lockdep_assert_irqs_enabled();
 #ifdef CONFIG_PROVE_LOCKING
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 8af9269ce8d9..b5e5ed1a5eef 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -592,6 +592,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 	}
 
 	raw_spin_lock_irqsave(&lock->wait_lock, flags);
+	raw_spin_lock(&current->blocked_lock);
 	/*
 	 * After waiting to acquire the wait_lock, try again.
 	 */
@@ -653,6 +654,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 				goto err;
 		}
 
+		raw_spin_unlock(&current->blocked_lock);
 		raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
 		/* Make sure we do wakeups before calling schedule */
 		wake_up_q(&wake_q);
@@ -662,6 +664,8 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 
 		first = __mutex_waiter_is_first(lock, &waiter);
 
+		raw_spin_lock_irqsave(&lock->wait_lock, flags);
+		raw_spin_lock(&current->blocked_lock);
 		/*
 		 * Gets reset by unlock path().
 		 */
@@ -676,15 +680,23 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 			break;
 
 		if (first) {
+			bool acquired;
+
+			/*
+			 * mutex_optimistic_spin() can schedule, so  we need to
+			 * release these locks before calling it.
+			 */
+			raw_spin_unlock(&current->blocked_lock);
+			raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
 			trace_contention_begin(lock, LCB_F_MUTEX | LCB_F_SPIN);
-			if (mutex_optimistic_spin(lock, ww_ctx, &waiter))
+			acquired = mutex_optimistic_spin(lock, ww_ctx, &waiter);
+			raw_spin_lock_irqsave(&lock->wait_lock, flags);
+			raw_spin_lock(&current->blocked_lock);
+			if (acquired)
 				break;
 			trace_contention_begin(lock, LCB_F_MUTEX);
 		}
-
-		raw_spin_lock_irqsave(&lock->wait_lock, flags);
 	}
-	raw_spin_lock_irqsave(&lock->wait_lock, flags);
 acquired:
 	current->blocked_on = NULL;
 	__set_current_state(TASK_RUNNING);
@@ -711,6 +723,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 	if (ww_ctx)
 		ww_mutex_lock_acquired(ww, ww_ctx);
 
+	raw_spin_unlock(&current->blocked_lock);
 	raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
 	wake_up_q(&wake_q);
 	preempt_enable();
@@ -723,6 +736,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 err_early_kill:
 	WARN_ON(current->blocked_on);
 	trace_contention_end(lock, ret);
+	raw_spin_unlock(&current->blocked_lock);
 	raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
 	debug_mutex_free_waiter(&waiter);
 	mutex_release(&lock->dep_map, ip);
@@ -932,8 +946,10 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
 		next = waiter->task;
 
 		debug_mutex_wake_waiter(lock, waiter);
+		raw_spin_lock(&next->blocked_lock);
 		WARN_ON(next->blocked_on != lock);
 		next->blocked_on = NULL;
+		raw_spin_unlock(&next->blocked_lock);
 		wake_q_add(&wake_q, next);
 	}
 
diff --git a/kernel/locking/ww_mutex.h b/kernel/locking/ww_mutex.h
index 8bb334491732..2929a95b4272 100644
--- a/kernel/locking/ww_mutex.h
+++ b/kernel/locking/ww_mutex.h
@@ -281,6 +281,8 @@ __ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITER *waiter,
 		return false;
 
 	if (waiter->ww_ctx->acquired > 0 && __ww_ctx_less(waiter->ww_ctx, ww_ctx)) {
+		/* nested as we should hold current->blocked_lock already */
+		raw_spin_lock_nested(&waiter->task->blocked_lock, SINGLE_DEPTH_NESTING);
 #ifndef WW_RT
 		debug_mutex_wake_waiter(lock, waiter);
 #endif
@@ -292,6 +294,7 @@ __ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITER *waiter,
 		WARN_ON(waiter->task->blocked_on != lock);
 		waiter->task->blocked_on = NULL;
 		wake_q_add(wake_q, waiter->task);
+		raw_spin_unlock(&waiter->task->blocked_lock);
 	}
 
 	return true;
@@ -339,6 +342,8 @@ static bool __ww_mutex_wound(struct MUTEX *lock,
 		 * wakeup pending to re-read the wounded state.
 		 */
 		if (owner != current) {
+			/* nested as we should hold current->blocked_lock already */
+			raw_spin_lock_nested(&owner->blocked_lock, SINGLE_DEPTH_NESTING);
 			/*
 			 * When waking up the task to wound, be sure to clear the
 			 * blocked_on pointer. Otherwise we can see circular
@@ -346,6 +351,7 @@ static bool __ww_mutex_wound(struct MUTEX *lock,
 			 */
 			owner->blocked_on = NULL;
 			wake_q_add(wake_q, owner);
+			raw_spin_unlock(&owner->blocked_lock);
 		}
 		return true;
 	}
-- 
2.42.0.rc1.204.g551eb34607-goog


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v5 07/19] locking/mutex: Add p->blocked_on wrappers for correctness checks
  2023-08-19  6:08 [PATCH v5 00/19] Proxy Execution: A generalized form of Priority Inheritance v5 John Stultz
                   ` (5 preceding siblings ...)
  2023-08-19  6:08 ` [PATCH v5 06/19] locking/mutex: Add task_struct::blocked_lock to serialize changes to the blocked_on state John Stultz
@ 2023-08-19  6:08 ` John Stultz
  2023-08-19  6:08 ` [PATCH v5 08/19] locking/mutex: Split blocked_on logic into two states (blocked_on and blocked_on_waking) John Stultz
                   ` (11 subsequent siblings)
  18 siblings, 0 replies; 29+ messages in thread
From: John Stultz @ 2023-08-19  6:08 UTC (permalink / raw)
  To: LKML
  Cc: Valentin Schneider, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
	Youssef Esmat, Mel Gorman, Daniel Bristot de Oliveira,
	Will Deacon, Waiman Long, Boqun Feng, Paul E . McKenney,
	kernel-team, Connor O'Brien, John Stultz

From: Valentin Schneider <valentin.schneider@arm.com>

This lets us assert p->blocked_lock is held whenever we access
p->blocked_on, as well as warn us for unexpected state changes.

Cc: Joel Fernandes <joelaf@google.com>
Cc: Qais Yousef <qyousef@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Youssef Esmat <youssefesmat@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Cc: kernel-team@android.com
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
[fix conflicts, call in more places]
Signed-off-by: Connor O'Brien <connoro@google.com>
[jstultz: tweaked commit subject, added get_task_blocked_on() as well]
Signed-off-by: John Stultz <jstultz@google.com>
---
v2:
* Added get_task_blocked_on() accessor
v4:
* Address READ_ONCE usage that was dropped in v2
* Reordered to be a later add on to the main patch series as
  Peter was unhappy with similar wrappers in other patches.
v5:
* Added some extra correctness checking in wrappers
---
 include/linux/sched.h        | 22 ++++++++++++++++++++++
 kernel/locking/mutex-debug.c |  4 ++--
 kernel/locking/mutex.c       | 10 +++++-----
 kernel/locking/ww_mutex.h    |  4 ++--
 4 files changed, 31 insertions(+), 9 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7dded1c09efa..0f32bea47e5e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2233,6 +2233,28 @@ static inline int rwlock_needbreak(rwlock_t *lock)
 #endif
 }
 
+static inline void set_task_blocked_on(struct task_struct *p, struct mutex *m)
+{
+	lockdep_assert_held(&p->blocked_lock);
+
+	/* We should be setting values to NULL or NULL to values */
+	WARN_ON((!m && !p->blocked_on) || (m && p->blocked_on));
+
+	p->blocked_on = m;
+}
+
+static inline struct mutex *get_task_blocked_on(struct task_struct *p)
+{
+	lockdep_assert_held(&p->blocked_lock);
+
+	return p->blocked_on;
+}
+
+static inline struct mutex *get_task_blocked_on_once(struct task_struct *p)
+{
+	return READ_ONCE(p->blocked_on);
+}
+
 static __always_inline bool need_resched(void)
 {
 	return unlikely(tif_need_resched());
diff --git a/kernel/locking/mutex-debug.c b/kernel/locking/mutex-debug.c
index 7228909c3e62..1eedf7c60c00 100644
--- a/kernel/locking/mutex-debug.c
+++ b/kernel/locking/mutex-debug.c
@@ -53,13 +53,13 @@ void debug_mutex_add_waiter(struct mutex *lock, struct mutex_waiter *waiter,
 	lockdep_assert_held(&lock->wait_lock);
 
 	/* Current thread can't be already blocked (since it's executing!) */
-	DEBUG_LOCKS_WARN_ON(task->blocked_on);
+	DEBUG_LOCKS_WARN_ON(get_task_blocked_on(task));
 }
 
 void debug_mutex_remove_waiter(struct mutex *lock, struct mutex_waiter *waiter,
 			 struct task_struct *task)
 {
-	struct mutex *blocked_on = READ_ONCE(task->blocked_on);
+	struct mutex *blocked_on = get_task_blocked_on_once(task);
 
 	DEBUG_LOCKS_WARN_ON(list_empty(&waiter->list));
 	DEBUG_LOCKS_WARN_ON(waiter->task != task);
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index b5e5ed1a5eef..04b0ea45cc01 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -623,7 +623,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 			goto err_early_kill;
 	}
 
-	current->blocked_on = lock;
+	set_task_blocked_on(current, lock);
 	set_current_state(state);
 	trace_contention_begin(lock, LCB_F_MUTEX);
 	for (;;) {
@@ -669,7 +669,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 		/*
 		 * Gets reset by unlock path().
 		 */
-		current->blocked_on = lock;
+		set_task_blocked_on(current, lock);
 		set_current_state(state);
 		/*
 		 * Here we order against unlock; we must either see it change
@@ -698,7 +698,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 		}
 	}
 acquired:
-	current->blocked_on = NULL;
+	set_task_blocked_on(current, NULL);
 	__set_current_state(TASK_RUNNING);
 
 	if (ww_ctx) {
@@ -730,7 +730,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 	return 0;
 
 err:
-	current->blocked_on = NULL;
+	set_task_blocked_on(current, NULL);
 	__set_current_state(TASK_RUNNING);
 	__mutex_remove_waiter(lock, &waiter);
 err_early_kill:
@@ -948,7 +948,7 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
 		debug_mutex_wake_waiter(lock, waiter);
 		raw_spin_lock(&next->blocked_lock);
 		WARN_ON(next->blocked_on != lock);
-		next->blocked_on = NULL;
+		set_task_blocked_on(current, NULL);
 		raw_spin_unlock(&next->blocked_lock);
 		wake_q_add(&wake_q, next);
 	}
diff --git a/kernel/locking/ww_mutex.h b/kernel/locking/ww_mutex.h
index 2929a95b4272..44a532dda927 100644
--- a/kernel/locking/ww_mutex.h
+++ b/kernel/locking/ww_mutex.h
@@ -292,7 +292,7 @@ __ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITER *waiter,
 		 * blocked_on relationships that can't resolve.
 		 */
 		WARN_ON(waiter->task->blocked_on != lock);
-		waiter->task->blocked_on = NULL;
+		set_task_blocked_on(waiter->task, NULL);
 		wake_q_add(wake_q, waiter->task);
 		raw_spin_unlock(&waiter->task->blocked_lock);
 	}
@@ -349,7 +349,7 @@ static bool __ww_mutex_wound(struct MUTEX *lock,
 			 * blocked_on pointer. Otherwise we can see circular
 			 * blocked_on relationships that can't resolve.
 			 */
-			owner->blocked_on = NULL;
+			set_task_blocked_on(owner, NULL);
 			wake_q_add(wake_q, owner);
 			raw_spin_unlock(&owner->blocked_lock);
 		}
-- 
2.42.0.rc1.204.g551eb34607-goog


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v5 08/19] locking/mutex: Split blocked_on logic into two states (blocked_on and blocked_on_waking)
  2023-08-19  6:08 [PATCH v5 00/19] Proxy Execution: A generalized form of Priority Inheritance v5 John Stultz
                   ` (6 preceding siblings ...)
  2023-08-19  6:08 ` [PATCH v5 07/19] locking/mutex: Add p->blocked_on wrappers for correctness checks John Stultz
@ 2023-08-19  6:08 ` John Stultz
  2023-08-19  6:08 ` [PATCH v5 09/19] locking/mutex: Switch to mutex handoffs for CONFIG_PROXY_EXEC John Stultz
                   ` (10 subsequent siblings)
  18 siblings, 0 replies; 29+ messages in thread
From: John Stultz @ 2023-08-19  6:08 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
	Youssef Esmat, Mel Gorman, Daniel Bristot de Oliveira,
	Will Deacon, Waiman Long, Boqun Feng, Paul E . McKenney,
	kernel-team

This patch adds blocked_on_waking so we can track separately if
the task should be able to try to aquire the lock separately
from the lock it is blocked on.

This avoids some of the subtle magic where the blocked_on state
gets cleared, only to have it re-added by the
mutex_lock_slowpath call when it tries to aquire the lock on
wakeup

This should make dealing with the ww_mutex issue cleaner.

Cc: Joel Fernandes <joelaf@google.com>
Cc: Qais Yousef <qyousef@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Youssef Esmat <youssefesmat@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Cc: kernel-team@android.com
Signed-off-by: John Stultz <jstultz@google.com>
---
 include/linux/sched.h     |  2 ++
 kernel/fork.c             |  1 +
 kernel/locking/mutex.c    |  7 ++++---
 kernel/locking/ww_mutex.h | 12 ++++++------
 kernel/sched/sched.h      | 12 ++++++++++++
 5 files changed, 25 insertions(+), 9 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0f32bea47e5e..3b7f26df2496 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1141,6 +1141,7 @@ struct task_struct {
 #endif
 
 	struct mutex			*blocked_on;	/* lock we're blocked on */
+	bool				blocked_on_waking; /* blocked on, but waking */
 	raw_spinlock_t			blocked_lock;
 
 #ifdef CONFIG_DEBUG_ATOMIC_SLEEP
@@ -2241,6 +2242,7 @@ static inline void set_task_blocked_on(struct task_struct *p, struct mutex *m)
 	WARN_ON((!m && !p->blocked_on) || (m && p->blocked_on));
 
 	p->blocked_on = m;
+	p->blocked_on_waking = false;
 }
 
 static inline struct mutex *get_task_blocked_on(struct task_struct *p)
diff --git a/kernel/fork.c b/kernel/fork.c
index 8bad899b6c6e..5b11ead90b12 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2460,6 +2460,7 @@ __latent_entropy struct task_struct *copy_process(
 #endif
 
 	p->blocked_on = NULL; /* not blocked yet */
+	p->blocked_on_waking = false; /* not blocked yet */
 
 #ifdef CONFIG_BCACHE
 	p->sequential_io	= 0;
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 04b0ea45cc01..687009eca2d1 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -666,10 +666,11 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 
 		raw_spin_lock_irqsave(&lock->wait_lock, flags);
 		raw_spin_lock(&current->blocked_lock);
+
 		/*
-		 * Gets reset by unlock path().
+		 * Clear blocked_on_waking flag set by the unlock path().
 		 */
-		set_task_blocked_on(current, lock);
+		current->blocked_on_waking = false;
 		set_current_state(state);
 		/*
 		 * Here we order against unlock; we must either see it change
@@ -948,7 +949,7 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
 		debug_mutex_wake_waiter(lock, waiter);
 		raw_spin_lock(&next->blocked_lock);
 		WARN_ON(next->blocked_on != lock);
-		set_task_blocked_on(current, NULL);
+		next->blocked_on_waking = true;
 		raw_spin_unlock(&next->blocked_lock);
 		wake_q_add(&wake_q, next);
 	}
diff --git a/kernel/locking/ww_mutex.h b/kernel/locking/ww_mutex.h
index 44a532dda927..3b0a68d7e308 100644
--- a/kernel/locking/ww_mutex.h
+++ b/kernel/locking/ww_mutex.h
@@ -287,12 +287,12 @@ __ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITER *waiter,
 		debug_mutex_wake_waiter(lock, waiter);
 #endif
 		/*
-		 * When waking up the task to die, be sure to clear the
-		 * blocked_on pointer. Otherwise we can see circular
+		 * When waking up the task to die, be sure to set the
+		 * blocked_on_waking flag. Otherwise we can see circular
 		 * blocked_on relationships that can't resolve.
 		 */
 		WARN_ON(waiter->task->blocked_on != lock);
-		set_task_blocked_on(waiter->task, NULL);
+		waiter->task->blocked_on_waking = true;
 		wake_q_add(wake_q, waiter->task);
 		raw_spin_unlock(&waiter->task->blocked_lock);
 	}
@@ -345,11 +345,11 @@ static bool __ww_mutex_wound(struct MUTEX *lock,
 			/* nested as we should hold current->blocked_lock already */
 			raw_spin_lock_nested(&owner->blocked_lock, SINGLE_DEPTH_NESTING);
 			/*
-			 * When waking up the task to wound, be sure to clear the
-			 * blocked_on pointer. Otherwise we can see circular
+			 * When waking up the task to wound, be sure to set the
+			 * blocked_on_waking flag. Otherwise we can see circular
 			 * blocked_on relationships that can't resolve.
 			 */
-			set_task_blocked_on(owner, NULL);
+			owner->blocked_on_waking = true;
 			wake_q_add(wake_q, owner);
 			raw_spin_unlock(&owner->blocked_lock);
 		}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 33ad47a093ae..95900ccaaf82 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2111,6 +2111,18 @@ static inline int task_current(struct rq *rq, struct task_struct *p)
 	return rq->curr == p;
 }
 
+#ifdef CONFIG_PROXY_EXEC
+static inline bool task_is_blocked(struct task_struct *p)
+{
+	return !!p->blocked_on && !p->blocked_on_waking;
+}
+#else /* !PROXY_EXEC */
+static inline bool task_is_blocked(struct task_struct *p)
+{
+	return false;
+}
+#endif /* PROXY_EXEC */
+
 static inline int task_on_cpu(struct rq *rq, struct task_struct *p)
 {
 #ifdef CONFIG_SMP
-- 
2.42.0.rc1.204.g551eb34607-goog


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v5 09/19] locking/mutex: Switch to mutex handoffs for CONFIG_PROXY_EXEC
  2023-08-19  6:08 [PATCH v5 00/19] Proxy Execution: A generalized form of Priority Inheritance v5 John Stultz
                   ` (7 preceding siblings ...)
  2023-08-19  6:08 ` [PATCH v5 08/19] locking/mutex: Split blocked_on logic into two states (blocked_on and blocked_on_waking) John Stultz
@ 2023-08-19  6:08 ` John Stultz
  2023-08-19  6:08 ` [PATCH v5 10/19] sched: Split scheduler execution context John Stultz
                   ` (9 subsequent siblings)
  18 siblings, 0 replies; 29+ messages in thread
From: John Stultz @ 2023-08-19  6:08 UTC (permalink / raw)
  To: LKML
  Cc: Peter Zijlstra, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Youssef Esmat,
	Mel Gorman, Daniel Bristot de Oliveira, Will Deacon, Waiman Long,
	Boqun Feng, Paul E . McKenney, kernel-team, Valentin Schneider,
	Connor O'Brien, John Stultz

From: Peter Zijlstra <peterz@infradead.org>

Just the very earliest framework logic to provide
the config option and switch to using mutex
handoffs.

Cc: Joel Fernandes <joelaf@google.com>
Cc: Qais Yousef <qyousef@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Youssef Esmat <youssefesmat@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Cc: kernel-team@android.com
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
[rebased, added comments and changelog]
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
[Fixed rebase conflicts]
[squashed sched: Ensure blocked_on is always guarded by blocked_lock]
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
[fix rebase conflicts, various fixes & tweaks commented inline]
[squashed sched: Use rq->curr vs rq->proxy checks]
Signed-off-by: Connor O'Brien <connoro@google.com>
[jstultz: Split out only the very basic initial framework
 for proxy logic from a larger patch.]
Signed-off-by: John Stultz <jstultz@google.com>
---
v5:
Split out from core proxy patch
---
 kernel/Kconfig.locks   | 2 +-
 kernel/locking/mutex.c | 5 +++++
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
index 4198f0273ecd..791c98f1d329 100644
--- a/kernel/Kconfig.locks
+++ b/kernel/Kconfig.locks
@@ -226,7 +226,7 @@ config ARCH_SUPPORTS_ATOMIC_RMW
 
 config MUTEX_SPIN_ON_OWNER
 	def_bool y
-	depends on SMP && ARCH_SUPPORTS_ATOMIC_RMW
+	depends on SMP && ARCH_SUPPORTS_ATOMIC_RMW && !PROXY_EXEC
 
 config RWSEM_SPIN_ON_OWNER
        def_bool y
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 687009eca2d1..5125bbab4119 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -913,6 +913,10 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
 
 	mutex_release(&lock->dep_map, ip);
 
+#ifdef CONFIG_PROXY_EXEC
+	/* Always force HANDOFF for Proxy Exec for now. Revisit. */
+	owner = MUTEX_FLAG_HANDOFF;
+#else /* CONFIG_PROXY_EXEC */
 	/*
 	 * Release the lock before (potentially) taking the spinlock such that
 	 * other contenders can get on with things ASAP.
@@ -935,6 +939,7 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
 			return;
 		}
 	}
+#endif /* CONFIG_PROXY_EXEC */
 
 	raw_spin_lock_irqsave(&lock->wait_lock, flags);
 	debug_mutex_unlock(lock);
-- 
2.42.0.rc1.204.g551eb34607-goog


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v5 10/19] sched: Split scheduler execution context
  2023-08-19  6:08 [PATCH v5 00/19] Proxy Execution: A generalized form of Priority Inheritance v5 John Stultz
                   ` (8 preceding siblings ...)
  2023-08-19  6:08 ` [PATCH v5 09/19] locking/mutex: Switch to mutex handoffs for CONFIG_PROXY_EXEC John Stultz
@ 2023-08-19  6:08 ` John Stultz
  2023-08-19  6:08 ` [PATCH v5 11/19] sched: Fix runtime accounting w/ split exec & sched contexts John Stultz
                   ` (8 subsequent siblings)
  18 siblings, 0 replies; 29+ messages in thread
From: John Stultz @ 2023-08-19  6:08 UTC (permalink / raw)
  To: LKML
  Cc: Peter Zijlstra, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Youssef Esmat,
	Mel Gorman, Daniel Bristot de Oliveira, Will Deacon, Waiman Long,
	Boqun Feng, Paul E . McKenney, kernel-team, Connor O'Brien,
	John Stultz

From: Peter Zijlstra <peterz@infradead.org>

Lets define the scheduling context as all the scheduler state in
task_struct and the execution context as all state required to run
the task.

Currently both are intertwined in task_struct. We want to logically
split these such that we can run the execution context of one task
with the scheduling context of another.

To this purpose introduce rq_selected() macro to point to the
task_struct used for scheduler state and preserve rq->curr to
denote the execution context.

NOTE: Peter previously mentioned he didn't like the name
"rq_selected()", but I've not come up with a better alternative.
I'm very open to other name proposals.

Question for Peter: Dietmar suggested you'd prefer I drop the
conditionalization of the scheduler context pointer on the rq
(so rq_selected() would be open coded as rq->curr_sched or
whatever we agree on for a name), but I'd think in the
!CONFIG_PROXY_EXEC case we'd want to avoid the wasted pointer
and its use (since it curr_sched would always be == curr)?
If I'm wrong I'm fine switching this, but would appreciate
clarification.

Cc: Joel Fernandes <joelaf@google.com>
Cc: Qais Yousef <qyousef@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Youssef Esmat <youssefesmat@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Cc: kernel-team@android.com
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20181009092434.26221-5-juri.lelli@redhat.com
[add additional comments and update more sched_class code to use
 rq::proxy]
Signed-off-by: Connor O'Brien <connoro@google.com>
[jstultz: Rebased and resolved minor collisions, reworked to use
 accessors, tweaked update_curr_common to use rq_proxy fixing rt
 scheduling issues]
Signed-off-by: John Stultz <jstultz@google.com>
---
v2:
* Reworked to use accessors
* Fixed update_curr_common to use proxy instead of curr
v3:
* Tweaked wrapper names
* Swapped proxy for selected for clarity
v4:
* Minor variable name tweaks for readability
* Use a macro instead of a inline function and drop
  other helper functions as suggested by Peter.
* Remove verbose comments/questions to avoid review
  distractions, as suggested by Dietmar
v5:
* Add CONFIG_PROXY_EXEC option to this patch so the
  new logic can be tested with this change
* Minor fix to grab rq_selected when holding the rq lock
---
 init/Kconfig            |  7 +++++++
 kernel/sched/core.c     | 39 ++++++++++++++++++++++++++-------------
 kernel/sched/deadline.c | 35 ++++++++++++++++++-----------------
 kernel/sched/fair.c     | 18 +++++++++---------
 kernel/sched/rt.c       | 40 ++++++++++++++++++++--------------------
 kernel/sched/sched.h    | 37 +++++++++++++++++++++++++++++++++++--
 6 files changed, 115 insertions(+), 61 deletions(-)

diff --git a/init/Kconfig b/init/Kconfig
index f7f65af4ee12..a6c83931a9a9 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -907,6 +907,13 @@ config NUMA_BALANCING_DEFAULT_ENABLED
 	  If set, automatic NUMA balancing will be enabled if running on a NUMA
 	  machine.
 
+config PROXY_EXEC
+	bool "Proxy Execution"
+	default n
+	help
+	  This option enables proxy execution, a mechanism for mutex owning
+	  tasks to inherit the scheduling context of higher priority waiters.
+
 menuconfig CGROUPS
 	bool "Control Group support"
 	select KERNFS
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c52c2eba7c73..ea0a7fb144fd 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -793,7 +793,7 @@ static enum hrtimer_restart hrtick(struct hrtimer *timer)
 
 	rq_lock(rq, &rf);
 	update_rq_clock(rq);
-	rq->curr->sched_class->task_tick(rq, rq->curr, 1);
+	rq_selected(rq)->sched_class->task_tick(rq, rq_selected(rq), 1);
 	rq_unlock(rq, &rf);
 
 	return HRTIMER_NORESTART;
@@ -2200,16 +2200,18 @@ static inline void check_class_changed(struct rq *rq, struct task_struct *p,
 
 void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags)
 {
-	if (p->sched_class == rq->curr->sched_class)
-		rq->curr->sched_class->check_preempt_curr(rq, p, flags);
-	else if (sched_class_above(p->sched_class, rq->curr->sched_class))
+	struct task_struct *curr = rq_selected(rq);
+
+	if (p->sched_class == curr->sched_class)
+		curr->sched_class->check_preempt_curr(rq, p, flags);
+	else if (sched_class_above(p->sched_class, curr->sched_class))
 		resched_curr(rq);
 
 	/*
 	 * A queue event has occurred, and we're going to schedule.  In
 	 * this case, we can save a useless back to back clock update.
 	 */
-	if (task_on_rq_queued(rq->curr) && test_tsk_need_resched(rq->curr))
+	if (task_on_rq_queued(curr) && test_tsk_need_resched(rq->curr))
 		rq_clock_skip_update(rq);
 }
 
@@ -2748,7 +2750,7 @@ __do_set_cpus_allowed(struct task_struct *p, struct affinity_context *ctx)
 		lockdep_assert_held(&p->pi_lock);
 
 	queued = task_on_rq_queued(p);
-	running = task_current(rq, p);
+	running = task_current_selected(rq, p);
 
 	if (queued) {
 		/*
@@ -5573,7 +5575,7 @@ unsigned long long task_sched_runtime(struct task_struct *p)
 	 * project cycles that may never be accounted to this
 	 * thread, breaking clock_gettime().
 	 */
-	if (task_current(rq, p) && task_on_rq_queued(p)) {
+	if (task_current_selected(rq, p) && task_on_rq_queued(p)) {
 		prefetch_curr_exec_start(p);
 		update_rq_clock(rq);
 		p->sched_class->update_curr(rq);
@@ -5641,7 +5643,8 @@ void scheduler_tick(void)
 {
 	int cpu = smp_processor_id();
 	struct rq *rq = cpu_rq(cpu);
-	struct task_struct *curr = rq->curr;
+	/* accounting goes to the selected task */
+	struct task_struct *curr;
 	struct rq_flags rf;
 	unsigned long thermal_pressure;
 	u64 resched_latency;
@@ -5652,6 +5655,7 @@ void scheduler_tick(void)
 	sched_clock_tick();
 
 	rq_lock(rq, &rf);
+	curr = rq_selected(rq);
 
 	update_rq_clock(rq);
 	thermal_pressure = arch_scale_thermal_pressure(cpu_of(rq));
@@ -5742,6 +5746,13 @@ static void sched_tick_remote(struct work_struct *work)
 	if (cpu_is_offline(cpu))
 		goto out_unlock;
 
+	/*
+	 * Since this is a remote tick for full dynticks mode, we are
+	 * always sure that there is no proxy (only a single task is
+	 * running).
+	 */
+	SCHED_WARN_ON(rq->curr != rq_selected(rq));
+
 	update_rq_clock(rq);
 
 	if (!is_idle_task(curr)) {
@@ -6672,6 +6683,7 @@ static void __sched notrace __schedule(unsigned int sched_mode)
 	}
 
 	next = pick_next_task(rq, prev, &rf);
+	rq_set_selected(rq, next);
 	clear_tsk_need_resched(prev);
 	clear_preempt_need_resched();
 #ifdef CONFIG_SCHED_DEBUG
@@ -7138,7 +7150,7 @@ void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task)
 
 	prev_class = p->sched_class;
 	queued = task_on_rq_queued(p);
-	running = task_current(rq, p);
+	running = task_current_selected(rq, p);
 	if (queued)
 		dequeue_task(rq, p, queue_flag);
 	if (running)
@@ -7226,7 +7238,7 @@ void set_user_nice(struct task_struct *p, long nice)
 		goto out_unlock;
 	}
 	queued = task_on_rq_queued(p);
-	running = task_current(rq, p);
+	running = task_current_selected(rq, p);
 	if (queued)
 		dequeue_task(rq, p, DEQUEUE_SAVE | DEQUEUE_NOCLOCK);
 	if (running)
@@ -7797,7 +7809,7 @@ static int __sched_setscheduler(struct task_struct *p,
 	}
 
 	queued = task_on_rq_queued(p);
-	running = task_current(rq, p);
+	running = task_current_selected(rq, p);
 	if (queued)
 		dequeue_task(rq, p, queue_flags);
 	if (running)
@@ -9298,6 +9310,7 @@ void __init init_idle(struct task_struct *idle, int cpu)
 	rcu_read_unlock();
 
 	rq->idle = idle;
+	rq_set_selected(rq, idle);
 	rcu_assign_pointer(rq->curr, idle);
 	idle->on_rq = TASK_ON_RQ_QUEUED;
 #ifdef CONFIG_SMP
@@ -9387,7 +9400,7 @@ void sched_setnuma(struct task_struct *p, int nid)
 
 	rq = task_rq_lock(p, &rf);
 	queued = task_on_rq_queued(p);
-	running = task_current(rq, p);
+	running = task_current_selected(rq, p);
 
 	if (queued)
 		dequeue_task(rq, p, DEQUEUE_SAVE);
@@ -10514,7 +10527,7 @@ void sched_move_task(struct task_struct *tsk)
 
 	update_rq_clock(rq);
 
-	running = task_current(rq, tsk);
+	running = task_current_selected(rq, tsk);
 	queued = task_on_rq_queued(tsk);
 
 	if (queued)
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 9522e6607754..e8bca6b8da6f 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1174,7 +1174,7 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
 #endif
 
 	enqueue_task_dl(rq, p, ENQUEUE_REPLENISH);
-	if (dl_task(rq->curr))
+	if (dl_task(rq_selected(rq)))
 		check_preempt_curr_dl(rq, p, 0);
 	else
 		resched_curr(rq);
@@ -1297,7 +1297,7 @@ static u64 grub_reclaim(u64 delta, struct rq *rq, struct sched_dl_entity *dl_se)
  */
 static void update_curr_dl(struct rq *rq)
 {
-	struct task_struct *curr = rq->curr;
+	struct task_struct *curr = rq_selected(rq);
 	struct sched_dl_entity *dl_se = &curr->dl;
 	s64 delta_exec, scaled_delta_exec;
 	int cpu = cpu_of(rq);
@@ -1810,7 +1810,7 @@ static int find_later_rq(struct task_struct *task);
 static int
 select_task_rq_dl(struct task_struct *p, int cpu, int flags)
 {
-	struct task_struct *curr;
+	struct task_struct *curr, *selected;
 	bool select_rq;
 	struct rq *rq;
 
@@ -1821,6 +1821,7 @@ select_task_rq_dl(struct task_struct *p, int cpu, int flags)
 
 	rcu_read_lock();
 	curr = READ_ONCE(rq->curr); /* unlocked access */
+	selected = READ_ONCE(rq_selected(rq));
 
 	/*
 	 * If we are dealing with a -deadline task, we must
@@ -1831,9 +1832,9 @@ select_task_rq_dl(struct task_struct *p, int cpu, int flags)
 	 * other hand, if it has a shorter deadline, we
 	 * try to make it stay here, it might be important.
 	 */
-	select_rq = unlikely(dl_task(curr)) &&
+	select_rq = unlikely(dl_task(selected)) &&
 		    (curr->nr_cpus_allowed < 2 ||
-		     !dl_entity_preempt(&p->dl, &curr->dl)) &&
+		     !dl_entity_preempt(&p->dl, &selected->dl)) &&
 		    p->nr_cpus_allowed > 1;
 
 	/*
@@ -1896,7 +1897,7 @@ static void check_preempt_equal_dl(struct rq *rq, struct task_struct *p)
 	 * let's hope p can move out.
 	 */
 	if (rq->curr->nr_cpus_allowed == 1 ||
-	    !cpudl_find(&rq->rd->cpudl, rq->curr, NULL))
+	    !cpudl_find(&rq->rd->cpudl, rq_selected(rq), NULL))
 		return;
 
 	/*
@@ -1935,7 +1936,7 @@ static int balance_dl(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
 static void check_preempt_curr_dl(struct rq *rq, struct task_struct *p,
 				  int flags)
 {
-	if (dl_entity_preempt(&p->dl, &rq->curr->dl)) {
+	if (dl_entity_preempt(&p->dl, &rq_selected(rq)->dl)) {
 		resched_curr(rq);
 		return;
 	}
@@ -1945,7 +1946,7 @@ static void check_preempt_curr_dl(struct rq *rq, struct task_struct *p,
 	 * In the unlikely case current and p have the same deadline
 	 * let us try to decide what's the best thing to do...
 	 */
-	if ((p->dl.deadline == rq->curr->dl.deadline) &&
+	if ((p->dl.deadline == rq_selected(rq)->dl.deadline) &&
 	    !test_tsk_need_resched(rq->curr))
 		check_preempt_equal_dl(rq, p);
 #endif /* CONFIG_SMP */
@@ -1980,7 +1981,7 @@ static void set_next_task_dl(struct rq *rq, struct task_struct *p, bool first)
 	if (hrtick_enabled_dl(rq))
 		start_hrtick_dl(rq, p);
 
-	if (rq->curr->sched_class != &dl_sched_class)
+	if (rq_selected(rq)->sched_class != &dl_sched_class)
 		update_dl_rq_load_avg(rq_clock_pelt(rq), rq, 0);
 
 	deadline_queue_push_tasks(rq);
@@ -2297,8 +2298,8 @@ static int push_dl_task(struct rq *rq)
 	 * can move away, it makes sense to just reschedule
 	 * without going further in pushing next_task.
 	 */
-	if (dl_task(rq->curr) &&
-	    dl_time_before(next_task->dl.deadline, rq->curr->dl.deadline) &&
+	if (dl_task(rq_selected(rq)) &&
+	    dl_time_before(next_task->dl.deadline, rq_selected(rq)->dl.deadline) &&
 	    rq->curr->nr_cpus_allowed > 1) {
 		resched_curr(rq);
 		return 0;
@@ -2423,7 +2424,7 @@ static void pull_dl_task(struct rq *this_rq)
 			 * deadline than the current task of its runqueue.
 			 */
 			if (dl_time_before(p->dl.deadline,
-					   src_rq->curr->dl.deadline))
+					   rq_selected(src_rq)->dl.deadline))
 				goto skip;
 
 			if (is_migration_disabled(p)) {
@@ -2462,9 +2463,9 @@ static void task_woken_dl(struct rq *rq, struct task_struct *p)
 	if (!task_on_cpu(rq, p) &&
 	    !test_tsk_need_resched(rq->curr) &&
 	    p->nr_cpus_allowed > 1 &&
-	    dl_task(rq->curr) &&
+	    dl_task(rq_selected(rq)) &&
 	    (rq->curr->nr_cpus_allowed < 2 ||
-	     !dl_entity_preempt(&p->dl, &rq->curr->dl))) {
+	     !dl_entity_preempt(&p->dl, &rq_selected(rq)->dl))) {
 		push_dl_tasks(rq);
 	}
 }
@@ -2639,12 +2640,12 @@ static void switched_to_dl(struct rq *rq, struct task_struct *p)
 		return;
 	}
 
-	if (rq->curr != p) {
+	if (rq_selected(rq) != p) {
 #ifdef CONFIG_SMP
 		if (p->nr_cpus_allowed > 1 && rq->dl.overloaded)
 			deadline_queue_push_tasks(rq);
 #endif
-		if (dl_task(rq->curr))
+		if (dl_task(rq_selected(rq)))
 			check_preempt_curr_dl(rq, p, 0);
 		else
 			resched_curr(rq);
@@ -2673,7 +2674,7 @@ static void prio_changed_dl(struct rq *rq, struct task_struct *p,
 	if (!rq->dl.overloaded)
 		deadline_queue_pull_task(rq);
 
-	if (task_current(rq, p)) {
+	if (task_current_selected(rq, p)) {
 		/*
 		 * If we now have a earlier deadline task than p,
 		 * then reschedule, provided p is still on this
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 80f74ec20fb8..808a91ba6a5e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -919,7 +919,7 @@ static s64 update_curr_se(struct rq *rq, struct sched_entity *curr)
  */
 s64 update_curr_common(struct rq *rq)
 {
-	struct task_struct *curr = rq->curr;
+	struct task_struct *curr = rq_selected(rq);
 	s64 delta_exec;
 
 	delta_exec = update_curr_se(rq, &curr->se);
@@ -964,7 +964,7 @@ static void update_curr(struct cfs_rq *cfs_rq)
 
 static void update_curr_fair(struct rq *rq)
 {
-	update_curr(cfs_rq_of(&rq->curr->se));
+	update_curr(cfs_rq_of(&rq_selected(rq)->se));
 }
 
 static inline void
@@ -6248,7 +6248,7 @@ static void hrtick_start_fair(struct rq *rq, struct task_struct *p)
 		s64 delta = slice - ran;
 
 		if (delta < 0) {
-			if (task_current(rq, p))
+			if (task_current_selected(rq, p))
 				resched_curr(rq);
 			return;
 		}
@@ -6263,7 +6263,7 @@ static void hrtick_start_fair(struct rq *rq, struct task_struct *p)
  */
 static void hrtick_update(struct rq *rq)
 {
-	struct task_struct *curr = rq->curr;
+	struct task_struct *curr = rq_selected(rq);
 
 	if (!hrtick_enabled_fair(rq) || curr->sched_class != &fair_sched_class)
 		return;
@@ -7957,7 +7957,7 @@ static void set_skip_buddy(struct sched_entity *se)
  */
 static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
 {
-	struct task_struct *curr = rq->curr;
+	struct task_struct *curr = rq_selected(rq);
 	struct sched_entity *se = &curr->se, *pse = &p->se;
 	struct cfs_rq *cfs_rq = task_cfs_rq(curr);
 	int scale = cfs_rq->nr_running >= sched_nr_latency;
@@ -7991,7 +7991,7 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
 	 * prevents us from potentially nominating it as a false LAST_BUDDY
 	 * below.
 	 */
-	if (test_tsk_need_resched(curr))
+	if (test_tsk_need_resched(rq->curr))
 		return;
 
 	/* Idle tasks are by definition preempted by non-idle tasks. */
@@ -8990,7 +8990,7 @@ static bool __update_blocked_others(struct rq *rq, bool *done)
 	 * update_load_avg() can call cpufreq_update_util(). Make sure that RT,
 	 * DL and IRQ signals have been updated before updating CFS.
 	 */
-	curr_class = rq->curr->sched_class;
+	curr_class = rq_selected(rq)->sched_class;
 
 	thermal_pressure = arch_scale_thermal_pressure(cpu_of(rq));
 
@@ -12239,7 +12239,7 @@ prio_changed_fair(struct rq *rq, struct task_struct *p, int oldprio)
 	 * our priority decreased, or if we are not currently running on
 	 * this runqueue and our priority is higher than the current's
 	 */
-	if (task_current(rq, p)) {
+	if (task_current_selected(rq, p)) {
 		if (p->prio > oldprio)
 			resched_curr(rq);
 	} else
@@ -12384,7 +12384,7 @@ static void switched_to_fair(struct rq *rq, struct task_struct *p)
 		 * kick off the schedule if running, otherwise just see
 		 * if we can still preempt the current task.
 		 */
-		if (task_current(rq, p))
+		if (task_current_selected(rq, p))
 			resched_curr(rq);
 		else
 			check_preempt_curr(rq, p, 0);
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 0d0b276c447d..6d9036547c1d 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -574,7 +574,7 @@ static void dequeue_rt_entity(struct sched_rt_entity *rt_se, unsigned int flags)
 
 static void sched_rt_rq_enqueue(struct rt_rq *rt_rq)
 {
-	struct task_struct *curr = rq_of_rt_rq(rt_rq)->curr;
+	struct task_struct *curr = rq_selected(rq_of_rt_rq(rt_rq));
 	struct rq *rq = rq_of_rt_rq(rt_rq);
 	struct sched_rt_entity *rt_se;
 
@@ -1044,7 +1044,7 @@ static int sched_rt_runtime_exceeded(struct rt_rq *rt_rq)
  */
 static void update_curr_rt(struct rq *rq)
 {
-	struct task_struct *curr = rq->curr;
+	struct task_struct *curr = rq_selected(rq);
 	struct sched_rt_entity *rt_se = &curr->rt;
 	s64 delta_exec;
 
@@ -1591,7 +1591,7 @@ static int find_lowest_rq(struct task_struct *task);
 static int
 select_task_rq_rt(struct task_struct *p, int cpu, int flags)
 {
-	struct task_struct *curr;
+	struct task_struct *curr, *selected;
 	struct rq *rq;
 	bool test;
 
@@ -1603,6 +1603,7 @@ select_task_rq_rt(struct task_struct *p, int cpu, int flags)
 
 	rcu_read_lock();
 	curr = READ_ONCE(rq->curr); /* unlocked access */
+	selected = READ_ONCE(rq_selected(rq));
 
 	/*
 	 * If the current task on @p's runqueue is an RT task, then
@@ -1631,8 +1632,8 @@ select_task_rq_rt(struct task_struct *p, int cpu, int flags)
 	 * systems like big.LITTLE.
 	 */
 	test = curr &&
-	       unlikely(rt_task(curr)) &&
-	       (curr->nr_cpus_allowed < 2 || curr->prio <= p->prio);
+	       unlikely(rt_task(selected)) &&
+	       (curr->nr_cpus_allowed < 2 || selected->prio <= p->prio);
 
 	if (test || !rt_task_fits_capacity(p, cpu)) {
 		int target = find_lowest_rq(p);
@@ -1662,12 +1663,8 @@ select_task_rq_rt(struct task_struct *p, int cpu, int flags)
 
 static void check_preempt_equal_prio(struct rq *rq, struct task_struct *p)
 {
-	/*
-	 * Current can't be migrated, useless to reschedule,
-	 * let's hope p can move out.
-	 */
 	if (rq->curr->nr_cpus_allowed == 1 ||
-	    !cpupri_find(&rq->rd->cpupri, rq->curr, NULL))
+	    !cpupri_find(&rq->rd->cpupri, rq_selected(rq), NULL))
 		return;
 
 	/*
@@ -1710,7 +1707,9 @@ static int balance_rt(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
  */
 static void check_preempt_curr_rt(struct rq *rq, struct task_struct *p, int flags)
 {
-	if (p->prio < rq->curr->prio) {
+	struct task_struct *curr = rq_selected(rq);
+
+	if (p->prio < curr->prio) {
 		resched_curr(rq);
 		return;
 	}
@@ -1728,7 +1727,7 @@ static void check_preempt_curr_rt(struct rq *rq, struct task_struct *p, int flag
 	 * to move current somewhere else, making room for our non-migratable
 	 * task.
 	 */
-	if (p->prio == rq->curr->prio && !test_tsk_need_resched(rq->curr))
+	if (p->prio == curr->prio && !test_tsk_need_resched(rq->curr))
 		check_preempt_equal_prio(rq, p);
 #endif
 }
@@ -1753,7 +1752,7 @@ static inline void set_next_task_rt(struct rq *rq, struct task_struct *p, bool f
 	 * utilization. We only care of the case where we start to schedule a
 	 * rt task
 	 */
-	if (rq->curr->sched_class != &rt_sched_class)
+	if (rq_selected(rq)->sched_class != &rt_sched_class)
 		update_rt_rq_load_avg(rq_clock_pelt(rq), rq, 0);
 
 	rt_queue_push_tasks(rq);
@@ -2034,6 +2033,7 @@ static struct task_struct *pick_next_pushable_task(struct rq *rq)
 
 	BUG_ON(rq->cpu != task_cpu(p));
 	BUG_ON(task_current(rq, p));
+	BUG_ON(task_current_selected(rq, p));
 	BUG_ON(p->nr_cpus_allowed <= 1);
 
 	BUG_ON(!task_on_rq_queued(p));
@@ -2066,7 +2066,7 @@ static int push_rt_task(struct rq *rq, bool pull)
 	 * higher priority than current. If that's the case
 	 * just reschedule current.
 	 */
-	if (unlikely(next_task->prio < rq->curr->prio)) {
+	if (unlikely(next_task->prio < rq_selected(rq)->prio)) {
 		resched_curr(rq);
 		return 0;
 	}
@@ -2419,7 +2419,7 @@ static void pull_rt_task(struct rq *this_rq)
 			 * p if it is lower in priority than the
 			 * current task on the run queue
 			 */
-			if (p->prio < src_rq->curr->prio)
+			if (p->prio < rq_selected(src_rq)->prio)
 				goto skip;
 
 			if (is_migration_disabled(p)) {
@@ -2461,9 +2461,9 @@ static void task_woken_rt(struct rq *rq, struct task_struct *p)
 	bool need_to_push = !task_on_cpu(rq, p) &&
 			    !test_tsk_need_resched(rq->curr) &&
 			    p->nr_cpus_allowed > 1 &&
-			    (dl_task(rq->curr) || rt_task(rq->curr)) &&
+			    (dl_task(rq_selected(rq)) || rt_task(rq_selected(rq))) &&
 			    (rq->curr->nr_cpus_allowed < 2 ||
-			     rq->curr->prio <= p->prio);
+			     rq_selected(rq)->prio <= p->prio);
 
 	if (need_to_push)
 		push_rt_tasks(rq);
@@ -2547,7 +2547,7 @@ static void switched_to_rt(struct rq *rq, struct task_struct *p)
 		if (p->nr_cpus_allowed > 1 && rq->rt.overloaded)
 			rt_queue_push_tasks(rq);
 #endif /* CONFIG_SMP */
-		if (p->prio < rq->curr->prio && cpu_online(cpu_of(rq)))
+		if (p->prio < rq_selected(rq)->prio && cpu_online(cpu_of(rq)))
 			resched_curr(rq);
 	}
 }
@@ -2562,7 +2562,7 @@ prio_changed_rt(struct rq *rq, struct task_struct *p, int oldprio)
 	if (!task_on_rq_queued(p))
 		return;
 
-	if (task_current(rq, p)) {
+	if (task_current_selected(rq, p)) {
 #ifdef CONFIG_SMP
 		/*
 		 * If our priority decreases while running, we
@@ -2588,7 +2588,7 @@ prio_changed_rt(struct rq *rq, struct task_struct *p, int oldprio)
 		 * greater than the current running task
 		 * then reschedule.
 		 */
-		if (p->prio < rq->curr->prio)
+		if (p->prio < rq_selected(rq)->prio)
 			resched_curr(rq);
 	}
 }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 95900ccaaf82..71d92e75cee0 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1008,7 +1008,10 @@ struct rq {
 	 */
 	unsigned int		nr_uninterruptible;
 
-	struct task_struct __rcu	*curr;
+	struct task_struct __rcu	*curr;       /* Execution context */
+#ifdef CONFIG_PROXY_EXEC
+	struct task_struct __rcu	*curr_sched; /* Scheduling context (policy) */
+#endif
 	struct task_struct	*idle;
 	struct task_struct	*stop;
 	unsigned long		next_balance;
@@ -1207,6 +1210,22 @@ DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
 #define cpu_curr(cpu)		(cpu_rq(cpu)->curr)
 #define raw_rq()		raw_cpu_ptr(&runqueues)
 
+#ifdef CONFIG_PROXY_EXEC
+#define rq_selected(rq)		((rq)->curr_sched)
+#define cpu_curr_selected(cpu)	(cpu_rq(cpu)->curr_sched)
+static inline void rq_set_selected(struct rq *rq, struct task_struct *t)
+{
+	rcu_assign_pointer(rq->curr_sched, t);
+}
+#else
+#define rq_selected(rq)		((rq)->curr)
+#define cpu_curr_selected(cpu)	(cpu_rq(cpu)->curr)
+static inline void rq_set_selected(struct rq *rq, struct task_struct *t)
+{
+	/* Do nothing */
+}
+#endif
+
 struct sched_group;
 #ifdef CONFIG_SCHED_CORE
 static inline struct cpumask *sched_group_span(struct sched_group *sg);
@@ -2106,11 +2125,25 @@ static inline u64 global_rt_runtime(void)
 	return (u64)sysctl_sched_rt_runtime * NSEC_PER_USEC;
 }
 
+/*
+ * Is p the current execution context?
+ */
 static inline int task_current(struct rq *rq, struct task_struct *p)
 {
 	return rq->curr == p;
 }
 
+/*
+ * Is p the current scheduling context?
+ *
+ * Note that it might be the current execution context at the same time if
+ * rq->curr == rq_selected() == p.
+ */
+static inline int task_current_selected(struct rq *rq, struct task_struct *p)
+{
+	return rq_selected(rq) == p;
+}
+
 #ifdef CONFIG_PROXY_EXEC
 static inline bool task_is_blocked(struct task_struct *p)
 {
@@ -2284,7 +2317,7 @@ struct sched_class {
 
 static inline void put_prev_task(struct rq *rq, struct task_struct *prev)
 {
-	WARN_ON_ONCE(rq->curr != prev);
+	WARN_ON_ONCE(rq_selected(rq) != prev);
 	prev->sched_class->put_prev_task(rq, prev);
 }
 
-- 
2.42.0.rc1.204.g551eb34607-goog


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v5 11/19] sched: Fix runtime accounting w/ split exec & sched contexts
  2023-08-19  6:08 [PATCH v5 00/19] Proxy Execution: A generalized form of Priority Inheritance v5 John Stultz
                   ` (9 preceding siblings ...)
  2023-08-19  6:08 ` [PATCH v5 10/19] sched: Split scheduler execution context John Stultz
@ 2023-08-19  6:08 ` John Stultz
  2023-08-19  6:08 ` [PATCH v5 12/19] sched: Unnest ttwu_runnable in prep for proxy-execution John Stultz
                   ` (7 subsequent siblings)
  18 siblings, 0 replies; 29+ messages in thread
From: John Stultz @ 2023-08-19  6:08 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
	Youssef Esmat, Mel Gorman, Daniel Bristot de Oliveira,
	Will Deacon, Waiman Long, Boqun Feng, Paul E . McKenney,
	kernel-team

The idea here is we want to charge the scheduler-context task's
vruntime but charge the execution-context task's sum_exec_runtime.

This way cputime accounting goes against the task actually running
but vruntime accounting goes against the selected task so we get
proper fairness.

Cc: Joel Fernandes <joelaf@google.com>
Cc: Qais Yousef <qyousef@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Youssef Esmat <youssefesmat@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Cc: kernel-team@android.com
Signed-off-by: John Stultz <jstultz@google.com>
---
 kernel/sched/fair.c | 24 +++++++++++++++++++-----
 1 file changed, 19 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 808a91ba6a5e..6385693f1da6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -891,22 +891,36 @@ static void update_tg_load_avg(struct cfs_rq *cfs_rq)
 }
 #endif /* CONFIG_SMP */
 
-static s64 update_curr_se(struct rq *rq, struct sched_entity *curr)
+static s64 update_curr_se(struct rq *rq, struct sched_entity *se)
 {
 	u64 now = rq_clock_task(rq);
 	s64 delta_exec;
 
-	delta_exec = now - curr->exec_start;
+	/* Calculate the delta from selected se */
+	delta_exec = now - se->exec_start;
 	if (unlikely(delta_exec <= 0))
 		return delta_exec;
 
-	curr->exec_start = now;
-	curr->sum_exec_runtime += delta_exec;
+	/* Update selected se's exec_start */
+	se->exec_start = now;
+	if (entity_is_task(se)) {
+		struct task_struct *running = rq->curr;
+		/*
+		 * If se is a task, we account the time
+		 * against the running task, as w/ proxy-exec
+		 * they may not be the same.
+		 */
+		running->se.exec_start = now;
+		running->se.sum_exec_runtime += delta_exec;
+	} else {
+		/* If not task, account the time against se */
+		se->sum_exec_runtime += delta_exec;
+	}
 
 	if (schedstat_enabled()) {
 		struct sched_statistics *stats;
 
-		stats = __schedstats_from_se(curr);
+		stats = __schedstats_from_se(se);
 		__schedstat_set(stats->exec_max,
 				max(delta_exec, stats->exec_max));
 	}
-- 
2.42.0.rc1.204.g551eb34607-goog


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v5 12/19] sched: Unnest ttwu_runnable in prep for proxy-execution
  2023-08-19  6:08 [PATCH v5 00/19] Proxy Execution: A generalized form of Priority Inheritance v5 John Stultz
                   ` (10 preceding siblings ...)
  2023-08-19  6:08 ` [PATCH v5 11/19] sched: Fix runtime accounting w/ split exec & sched contexts John Stultz
@ 2023-08-19  6:08 ` John Stultz
  2023-08-19  6:08 ` [PATCH v5 13/19] sched: Split out __sched() deactivate task logic into a helper John Stultz
                   ` (6 subsequent siblings)
  18 siblings, 0 replies; 29+ messages in thread
From: John Stultz @ 2023-08-19  6:08 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
	Youssef Esmat, Mel Gorman, Daniel Bristot de Oliveira,
	Will Deacon, Waiman Long, Boqun Feng, Paul E . McKenney,
	kernel-team

Slightly rework ttwu_runnable to minimize the nesting to help
make the proxy-execution changes easier to read.

Should be no logical change here.

Cc: Joel Fernandes <joelaf@google.com>
Cc: Qais Yousef <qyousef@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Youssef Esmat <youssefesmat@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Cc: kernel-team@android.com
Signed-off-by: John Stultz <jstultz@google.com>
---
 kernel/sched/core.c | 24 +++++++++++++-----------
 1 file changed, 13 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ea0a7fb144fd..2d957a14104a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3840,18 +3840,20 @@ static int ttwu_runnable(struct task_struct *p, int wake_flags)
 	int ret = 0;
 
 	rq = __task_rq_lock(p, &rf);
-	if (task_on_rq_queued(p)) {
-		if (!task_on_cpu(rq, p)) {
-			/*
-			 * When on_rq && !on_cpu the task is preempted, see if
-			 * it should preempt the task that is current now.
-			 */
-			update_rq_clock(rq);
-			check_preempt_curr(rq, p, wake_flags);
-		}
-		ttwu_do_wakeup(p);
-		ret = 1;
+	if (!task_on_rq_queued(p))
+		goto out_unlock;
+
+	if (!task_on_cpu(rq, p)) {
+		/*
+		 * When on_rq && !on_cpu the task is preempted, see if
+		 * it should preempt the task that is current now.
+		 */
+		update_rq_clock(rq);
+		check_preempt_curr(rq, p, wake_flags);
 	}
+	ttwu_do_wakeup(p);
+	ret = 1;
+out_unlock:
 	__task_rq_unlock(rq, &rf);
 
 	return ret;
-- 
2.42.0.rc1.204.g551eb34607-goog


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v5 13/19] sched: Split out __sched() deactivate task logic into a helper
  2023-08-19  6:08 [PATCH v5 00/19] Proxy Execution: A generalized form of Priority Inheritance v5 John Stultz
                   ` (11 preceding siblings ...)
  2023-08-19  6:08 ` [PATCH v5 12/19] sched: Unnest ttwu_runnable in prep for proxy-execution John Stultz
@ 2023-08-19  6:08 ` John Stultz
  2023-08-23 21:12   ` kernel test robot
                     ` (2 more replies)
  2023-08-19  6:08 ` [PATCH v5 14/19] sched: Add a very simple proxy() function John Stultz
                   ` (5 subsequent siblings)
  18 siblings, 3 replies; 29+ messages in thread
From: John Stultz @ 2023-08-19  6:08 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
	Youssef Esmat, Mel Gorman, Daniel Bristot de Oliveira,
	Will Deacon, Waiman Long, Boqun Feng, Paul E . McKenney,
	kernel-team

As we're going to re-use the deactivation logic,
split it into a helper.

Cc: Joel Fernandes <joelaf@google.com>
Cc: Qais Yousef <qyousef@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Youssef Esmat <youssefesmat@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Cc: kernel-team@android.com
Signed-off-by: John Stultz <jstultz@google.com>
---
 kernel/sched/core.c | 65 +++++++++++++++++++++++++--------------------
 1 file changed, 36 insertions(+), 29 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2d957a14104a..76a42f21dda7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6562,6 +6562,41 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 # define SM_MASK_PREEMPT	SM_PREEMPT
 #endif
 
+bool try_to_deactivate_task(struct rq *rq, struct task_struct *p, unsigned long state)
+{
+	if (signal_pending_state(state, p)) {
+		WRITE_ONCE(p->__state, TASK_RUNNING);
+	} else {
+		p->sched_contributes_to_load =
+			(state & TASK_UNINTERRUPTIBLE) &&
+			!(state & TASK_NOLOAD) &&
+			!(state & TASK_FROZEN);
+
+		if (p->sched_contributes_to_load)
+			rq->nr_uninterruptible++;
+
+		/*
+		 * __schedule()			ttwu()
+		 *   prev_state = prev->state;    if (p->on_rq && ...)
+		 *   if (prev_state)		    goto out;
+		 *     p->on_rq = 0;		  smp_acquire__after_ctrl_dep();
+		 *				  p->state = TASK_WAKING
+		 *
+		 * Where __schedule() and ttwu() have matching control dependencies.
+		 *
+		 * After this, schedule() must not care about p->state any more.
+		 */
+		deactivate_task(rq, p, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK);
+
+		if (p->in_iowait) {
+			atomic_inc(&rq->nr_iowait);
+			delayacct_blkio_start();
+		}
+		return true;
+	}
+	return false;
+}
+
 /*
  * __schedule() is the main scheduler function.
  *
@@ -6652,35 +6687,7 @@ static void __sched notrace __schedule(unsigned int sched_mode)
 	 */
 	prev_state = READ_ONCE(prev->__state);
 	if (!(sched_mode & SM_MASK_PREEMPT) && prev_state) {
-		if (signal_pending_state(prev_state, prev)) {
-			WRITE_ONCE(prev->__state, TASK_RUNNING);
-		} else {
-			prev->sched_contributes_to_load =
-				(prev_state & TASK_UNINTERRUPTIBLE) &&
-				!(prev_state & TASK_NOLOAD) &&
-				!(prev_state & TASK_FROZEN);
-
-			if (prev->sched_contributes_to_load)
-				rq->nr_uninterruptible++;
-
-			/*
-			 * __schedule()			ttwu()
-			 *   prev_state = prev->state;    if (p->on_rq && ...)
-			 *   if (prev_state)		    goto out;
-			 *     p->on_rq = 0;		  smp_acquire__after_ctrl_dep();
-			 *				  p->state = TASK_WAKING
-			 *
-			 * Where __schedule() and ttwu() have matching control dependencies.
-			 *
-			 * After this, schedule() must not care about p->state any more.
-			 */
-			deactivate_task(rq, prev, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK);
-
-			if (prev->in_iowait) {
-				atomic_inc(&rq->nr_iowait);
-				delayacct_blkio_start();
-			}
-		}
+		try_to_deactivate_task(rq, prev, prev_state);
 		switch_count = &prev->nvcsw;
 	}
 
-- 
2.42.0.rc1.204.g551eb34607-goog


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v5 13/19] sched: Split out __sched() deactivate task logic into a helper
  2023-08-19  6:08 ` [PATCH v5 13/19] sched: Split out __sched() deactivate task logic into a helper John Stultz
@ 2023-08-23 21:12   ` kernel test robot
  2023-08-23 21:25     ` John Stultz
  2023-08-24  0:00   ` kernel test robot
  2023-08-24  0:34   ` kernel test robot
  2 siblings, 1 reply; 29+ messages in thread
From: kernel test robot @ 2023-08-23 21:12 UTC (permalink / raw)
  To: John Stultz, LKML
  Cc: oe-kbuild-all, John Stultz, Joel Fernandes, Qais Yousef,
	Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
	Youssef Esmat, Mel Gorman, Daniel Bristot de Oliveira,
	Will Deacon, Waiman Long, Boqun Feng, Paul E . McKenney,
	kernel-team

Hi John,

kernel test robot noticed the following build warnings:

[auto build test WARNING on tip/locking/core]
[also build test WARNING on linus/master tip/auto-latest v6.5-rc7]
[cannot apply to tip/sched/core tip/master next-20230823]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/John-Stultz/sched-Unify-runtime-accounting-across-classes/20230821-121604
base:   tip/locking/core
patch link:    https://lore.kernel.org/r/20230819060915.3001568-14-jstultz%40google.com
patch subject: [PATCH v5 13/19] sched: Split out __sched() deactivate task logic into a helper
config: riscv-allnoconfig (https://download.01.org/0day-ci/archive/20230824/202308240439.2aDXO6Ks-lkp@intel.com/config)
compiler: riscv64-linux-gcc (GCC) 13.2.0
reproduce: (https://download.01.org/0day-ci/archive/20230824/202308240439.2aDXO6Ks-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202308240439.2aDXO6Ks-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> kernel/sched/core.c:6565:6: warning: no previous prototype for 'try_to_deactivate_task' [-Wmissing-prototypes]
    6565 | bool try_to_deactivate_task(struct rq *rq, struct task_struct *p, unsigned long state)
         |      ^~~~~~~~~~~~~~~~~~~~~~


vim +/try_to_deactivate_task +6565 kernel/sched/core.c

  6564	
> 6565	bool try_to_deactivate_task(struct rq *rq, struct task_struct *p, unsigned long state)
  6566	{
  6567		if (signal_pending_state(state, p)) {
  6568			WRITE_ONCE(p->__state, TASK_RUNNING);
  6569		} else {
  6570			p->sched_contributes_to_load =
  6571				(state & TASK_UNINTERRUPTIBLE) &&
  6572				!(state & TASK_NOLOAD) &&
  6573				!(state & TASK_FROZEN);
  6574	
  6575			if (p->sched_contributes_to_load)
  6576				rq->nr_uninterruptible++;
  6577	
  6578			/*
  6579			 * __schedule()			ttwu()
  6580			 *   prev_state = prev->state;    if (p->on_rq && ...)
  6581			 *   if (prev_state)		    goto out;
  6582			 *     p->on_rq = 0;		  smp_acquire__after_ctrl_dep();
  6583			 *				  p->state = TASK_WAKING
  6584			 *
  6585			 * Where __schedule() and ttwu() have matching control dependencies.
  6586			 *
  6587			 * After this, schedule() must not care about p->state any more.
  6588			 */
  6589			deactivate_task(rq, p, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK);
  6590	
  6591			if (p->in_iowait) {
  6592				atomic_inc(&rq->nr_iowait);
  6593				delayacct_blkio_start();
  6594			}
  6595			return true;
  6596		}
  6597		return false;
  6598	}
  6599	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v5 13/19] sched: Split out __sched() deactivate task logic into a helper
  2023-08-23 21:12   ` kernel test robot
@ 2023-08-23 21:25     ` John Stultz
  0 siblings, 0 replies; 29+ messages in thread
From: John Stultz @ 2023-08-23 21:25 UTC (permalink / raw)
  To: kernel test robot
  Cc: LKML, oe-kbuild-all, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Youssef Esmat,
	Mel Gorman, Daniel Bristot de Oliveira, Will Deacon, Waiman Long,
	Boqun Feng, Paul E . McKenney, kernel-team

On Wed, Aug 23, 2023 at 2:13 PM kernel test robot <lkp@intel.com> wrote:
>
> Hi John,
>
> kernel test robot noticed the following build warnings:
>
> [auto build test WARNING on tip/locking/core]
> [also build test WARNING on linus/master tip/auto-latest v6.5-rc7]
> [cannot apply to tip/sched/core tip/master next-20230823]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch#_base_tree_information]
>
> url:    https://github.com/intel-lab-lkp/linux/commits/John-Stultz/sched-Unify-runtime-accounting-across-classes/20230821-121604
> base:   tip/locking/core
> patch link:    https://lore.kernel.org/r/20230819060915.3001568-14-jstultz%40google.com
> patch subject: [PATCH v5 13/19] sched: Split out __sched() deactivate task logic into a helper
> config: riscv-allnoconfig (https://download.01.org/0day-ci/archive/20230824/202308240439.2aDXO6Ks-lkp@intel.com/config)
> compiler: riscv64-linux-gcc (GCC) 13.2.0
> reproduce: (https://download.01.org/0day-ci/archive/20230824/202308240439.2aDXO6Ks-lkp@intel.com/reproduce)
>
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <lkp@intel.com>
> | Closes: https://lore.kernel.org/oe-kbuild-all/202308240439.2aDXO6Ks-lkp@intel.com/
>
> All warnings (new ones prefixed by >>):
>
> >> kernel/sched/core.c:6565:6: warning: no previous prototype for 'try_to_deactivate_task' [-Wmissing-prototypes]
>     6565 | bool try_to_deactivate_task(struct rq *rq, struct task_struct *p, unsigned long state)
>          |      ^~~~~~~~~~~~~~~~~~~~~~

Ah, thanks for pointing this out. Switched it to being static.

thanks
-john

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v5 13/19] sched: Split out __sched() deactivate task logic into a helper
  2023-08-19  6:08 ` [PATCH v5 13/19] sched: Split out __sched() deactivate task logic into a helper John Stultz
  2023-08-23 21:12   ` kernel test robot
@ 2023-08-24  0:00   ` kernel test robot
  2023-08-24  0:34   ` kernel test robot
  2 siblings, 0 replies; 29+ messages in thread
From: kernel test robot @ 2023-08-24  0:00 UTC (permalink / raw)
  To: John Stultz, LKML
  Cc: llvm, oe-kbuild-all, John Stultz, Joel Fernandes, Qais Yousef,
	Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
	Youssef Esmat, Mel Gorman, Daniel Bristot de Oliveira,
	Will Deacon, Waiman Long, Boqun Feng, Paul E . McKenney,
	kernel-team

Hi John,

kernel test robot noticed the following build warnings:

[auto build test WARNING on tip/locking/core]
[also build test WARNING on linus/master tip/auto-latest v6.5-rc7]
[cannot apply to tip/sched/core tip/master next-20230823]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/John-Stultz/sched-Unify-runtime-accounting-across-classes/20230821-121604
base:   tip/locking/core
patch link:    https://lore.kernel.org/r/20230819060915.3001568-14-jstultz%40google.com
patch subject: [PATCH v5 13/19] sched: Split out __sched() deactivate task logic into a helper
config: um-allnoconfig (https://download.01.org/0day-ci/archive/20230824/202308240712.RzqTiO81-lkp@intel.com/config)
compiler: clang version 17.0.0 (https://github.com/llvm/llvm-project.git 4a5ac14ee968ff0ad5d2cc1ffa0299048db4c88a)
reproduce: (https://download.01.org/0day-ci/archive/20230824/202308240712.RzqTiO81-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202308240712.RzqTiO81-lkp@intel.com/

All warnings (new ones prefixed by >>):

   In file included from kernel/sched/core.c:9:
   In file included from include/linux/highmem.h:12:
   In file included from include/linux/hardirq.h:11:
   In file included from arch/um/include/asm/hardirq.h:5:
   In file included from include/asm-generic/hardirq.h:17:
   In file included from include/linux/irq.h:20:
   In file included from include/linux/io.h:13:
   In file included from arch/um/include/asm/io.h:24:
   include/asm-generic/io.h:547:31: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     547 |         val = __raw_readb(PCI_IOBASE + addr);
         |                           ~~~~~~~~~~ ^
   include/asm-generic/io.h:560:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     560 |         val = __le16_to_cpu((__le16 __force)__raw_readw(PCI_IOBASE + addr));
         |                                                         ~~~~~~~~~~ ^
   include/uapi/linux/byteorder/little_endian.h:37:51: note: expanded from macro '__le16_to_cpu'
      37 | #define __le16_to_cpu(x) ((__force __u16)(__le16)(x))
         |                                                   ^
   In file included from kernel/sched/core.c:9:
   In file included from include/linux/highmem.h:12:
   In file included from include/linux/hardirq.h:11:
   In file included from arch/um/include/asm/hardirq.h:5:
   In file included from include/asm-generic/hardirq.h:17:
   In file included from include/linux/irq.h:20:
   In file included from include/linux/io.h:13:
   In file included from arch/um/include/asm/io.h:24:
   include/asm-generic/io.h:573:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     573 |         val = __le32_to_cpu((__le32 __force)__raw_readl(PCI_IOBASE + addr));
         |                                                         ~~~~~~~~~~ ^
   include/uapi/linux/byteorder/little_endian.h:35:51: note: expanded from macro '__le32_to_cpu'
      35 | #define __le32_to_cpu(x) ((__force __u32)(__le32)(x))
         |                                                   ^
   In file included from kernel/sched/core.c:9:
   In file included from include/linux/highmem.h:12:
   In file included from include/linux/hardirq.h:11:
   In file included from arch/um/include/asm/hardirq.h:5:
   In file included from include/asm-generic/hardirq.h:17:
   In file included from include/linux/irq.h:20:
   In file included from include/linux/io.h:13:
   In file included from arch/um/include/asm/io.h:24:
   include/asm-generic/io.h:584:33: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     584 |         __raw_writeb(value, PCI_IOBASE + addr);
         |                             ~~~~~~~~~~ ^
   include/asm-generic/io.h:594:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     594 |         __raw_writew((u16 __force)cpu_to_le16(value), PCI_IOBASE + addr);
         |                                                       ~~~~~~~~~~ ^
   include/asm-generic/io.h:604:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     604 |         __raw_writel((u32 __force)cpu_to_le32(value), PCI_IOBASE + addr);
         |                                                       ~~~~~~~~~~ ^
   include/asm-generic/io.h:692:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     692 |         readsb(PCI_IOBASE + addr, buffer, count);
         |                ~~~~~~~~~~ ^
   include/asm-generic/io.h:700:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     700 |         readsw(PCI_IOBASE + addr, buffer, count);
         |                ~~~~~~~~~~ ^
   include/asm-generic/io.h:708:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     708 |         readsl(PCI_IOBASE + addr, buffer, count);
         |                ~~~~~~~~~~ ^
   include/asm-generic/io.h:717:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     717 |         writesb(PCI_IOBASE + addr, buffer, count);
         |                 ~~~~~~~~~~ ^
   include/asm-generic/io.h:726:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     726 |         writesw(PCI_IOBASE + addr, buffer, count);
         |                 ~~~~~~~~~~ ^
   include/asm-generic/io.h:735:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     735 |         writesl(PCI_IOBASE + addr, buffer, count);
         |                 ~~~~~~~~~~ ^
>> kernel/sched/core.c:6565:6: warning: no previous prototype for function 'try_to_deactivate_task' [-Wmissing-prototypes]
    6565 | bool try_to_deactivate_task(struct rq *rq, struct task_struct *p, unsigned long state)
         |      ^
   kernel/sched/core.c:6565:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
    6565 | bool try_to_deactivate_task(struct rq *rq, struct task_struct *p, unsigned long state)
         | ^
         | static 
   kernel/sched/core.c:3697:20: warning: unused function 'rq_has_pinned_tasks' [-Wunused-function]
    3697 | static inline bool rq_has_pinned_tasks(struct rq *rq)
         |                    ^
   kernel/sched/core.c:5834:20: warning: unused function 'sched_tick_start' [-Wunused-function]
    5834 | static inline void sched_tick_start(int cpu) { }
         |                    ^
   kernel/sched/core.c:5835:20: warning: unused function 'sched_tick_stop' [-Wunused-function]
    5835 | static inline void sched_tick_stop(int cpu) { }
         |                    ^
   kernel/sched/core.c:6535:20: warning: unused function 'sched_core_cpu_starting' [-Wunused-function]
    6535 | static inline void sched_core_cpu_starting(unsigned int cpu) {}
         |                    ^
   kernel/sched/core.c:6536:20: warning: unused function 'sched_core_cpu_deactivate' [-Wunused-function]
    6536 | static inline void sched_core_cpu_deactivate(unsigned int cpu) {}
         |                    ^
   kernel/sched/core.c:6537:20: warning: unused function 'sched_core_cpu_dying' [-Wunused-function]
    6537 | static inline void sched_core_cpu_dying(unsigned int cpu) {}
         |                    ^
   19 warnings generated.


vim +/try_to_deactivate_task +6565 kernel/sched/core.c

  6564	
> 6565	bool try_to_deactivate_task(struct rq *rq, struct task_struct *p, unsigned long state)
  6566	{
  6567		if (signal_pending_state(state, p)) {
  6568			WRITE_ONCE(p->__state, TASK_RUNNING);
  6569		} else {
  6570			p->sched_contributes_to_load =
  6571				(state & TASK_UNINTERRUPTIBLE) &&
  6572				!(state & TASK_NOLOAD) &&
  6573				!(state & TASK_FROZEN);
  6574	
  6575			if (p->sched_contributes_to_load)
  6576				rq->nr_uninterruptible++;
  6577	
  6578			/*
  6579			 * __schedule()			ttwu()
  6580			 *   prev_state = prev->state;    if (p->on_rq && ...)
  6581			 *   if (prev_state)		    goto out;
  6582			 *     p->on_rq = 0;		  smp_acquire__after_ctrl_dep();
  6583			 *				  p->state = TASK_WAKING
  6584			 *
  6585			 * Where __schedule() and ttwu() have matching control dependencies.
  6586			 *
  6587			 * After this, schedule() must not care about p->state any more.
  6588			 */
  6589			deactivate_task(rq, p, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK);
  6590	
  6591			if (p->in_iowait) {
  6592				atomic_inc(&rq->nr_iowait);
  6593				delayacct_blkio_start();
  6594			}
  6595			return true;
  6596		}
  6597		return false;
  6598	}
  6599	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v5 13/19] sched: Split out __sched() deactivate task logic into a helper
  2023-08-19  6:08 ` [PATCH v5 13/19] sched: Split out __sched() deactivate task logic into a helper John Stultz
  2023-08-23 21:12   ` kernel test robot
  2023-08-24  0:00   ` kernel test robot
@ 2023-08-24  0:34   ` kernel test robot
  2 siblings, 0 replies; 29+ messages in thread
From: kernel test robot @ 2023-08-24  0:34 UTC (permalink / raw)
  To: John Stultz, LKML
  Cc: oe-kbuild-all, John Stultz, Joel Fernandes, Qais Yousef,
	Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
	Youssef Esmat, Mel Gorman, Daniel Bristot de Oliveira,
	Will Deacon, Waiman Long, Boqun Feng, Paul E . McKenney,
	kernel-team

Hi John,

kernel test robot noticed the following build warnings:

[auto build test WARNING on tip/locking/core]
[also build test WARNING on linus/master tip/auto-latest v6.5-rc7]
[cannot apply to tip/sched/core tip/master next-20230823]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/John-Stultz/sched-Unify-runtime-accounting-across-classes/20230821-121604
base:   tip/locking/core
patch link:    https://lore.kernel.org/r/20230819060915.3001568-14-jstultz%40google.com
patch subject: [PATCH v5 13/19] sched: Split out __sched() deactivate task logic into a helper
config: um-i386_defconfig (https://download.01.org/0day-ci/archive/20230824/202308240824.ugWLq8LH-lkp@intel.com/config)
compiler: gcc-7 (Ubuntu 7.5.0-6ubuntu2) 7.5.0
reproduce: (https://download.01.org/0day-ci/archive/20230824/202308240824.ugWLq8LH-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202308240824.ugWLq8LH-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> kernel/sched/core.c:6565:6: warning: no previous declaration for 'try_to_deactivate_task' [-Wmissing-declarations]
    bool try_to_deactivate_task(struct rq *rq, struct task_struct *p, unsigned long state)
         ^~~~~~~~~~~~~~~~~~~~~~


vim +/try_to_deactivate_task +6565 kernel/sched/core.c

  6564	
> 6565	bool try_to_deactivate_task(struct rq *rq, struct task_struct *p, unsigned long state)
  6566	{
  6567		if (signal_pending_state(state, p)) {
  6568			WRITE_ONCE(p->__state, TASK_RUNNING);
  6569		} else {
  6570			p->sched_contributes_to_load =
  6571				(state & TASK_UNINTERRUPTIBLE) &&
  6572				!(state & TASK_NOLOAD) &&
  6573				!(state & TASK_FROZEN);
  6574	
  6575			if (p->sched_contributes_to_load)
  6576				rq->nr_uninterruptible++;
  6577	
  6578			/*
  6579			 * __schedule()			ttwu()
  6580			 *   prev_state = prev->state;    if (p->on_rq && ...)
  6581			 *   if (prev_state)		    goto out;
  6582			 *     p->on_rq = 0;		  smp_acquire__after_ctrl_dep();
  6583			 *				  p->state = TASK_WAKING
  6584			 *
  6585			 * Where __schedule() and ttwu() have matching control dependencies.
  6586			 *
  6587			 * After this, schedule() must not care about p->state any more.
  6588			 */
  6589			deactivate_task(rq, p, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK);
  6590	
  6591			if (p->in_iowait) {
  6592				atomic_inc(&rq->nr_iowait);
  6593				delayacct_blkio_start();
  6594			}
  6595			return true;
  6596		}
  6597		return false;
  6598	}
  6599	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v5 14/19] sched: Add a very simple proxy() function
  2023-08-19  6:08 [PATCH v5 00/19] Proxy Execution: A generalized form of Priority Inheritance v5 John Stultz
                   ` (12 preceding siblings ...)
  2023-08-19  6:08 ` [PATCH v5 13/19] sched: Split out __sched() deactivate task logic into a helper John Stultz
@ 2023-08-19  6:08 ` John Stultz
  2023-08-19  6:08 ` [PATCH v5 15/19] sched: Add proxy deactivate helper John Stultz
                   ` (4 subsequent siblings)
  18 siblings, 0 replies; 29+ messages in thread
From: John Stultz @ 2023-08-19  6:08 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
	Youssef Esmat, Mel Gorman, Daniel Bristot de Oliveira,
	Will Deacon, Waiman Long, Boqun Feng, Paul E . McKenney,
	kernel-team

This adds a very simple proxy() function so if
we select a blocked task to run, we will deactivate it
and pick again. The exception being if it has become
unblocked after proxy() was called.

Greatly simplified from patch by:
  Peter Zijlstra (Intel) <peterz@infradead.org>
  Juri Lelli <juri.lelli@redhat.com>
  Valentin Schneider <valentin.schneider@arm.com>
  Connor O'Brien <connoro@google.com>

Cc: Joel Fernandes <joelaf@google.com>
Cc: Qais Yousef <qyousef@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Youssef Esmat <youssefesmat@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Cc: kernel-team@android.com
[jstultz: Split out from larger proxy patch and simplified
 for review and testing.]
Signed-off-by: John Stultz <jstultz@google.com>
---
v5:
* Split out from larger proxy patch
---
 kernel/sched/core.c | 89 +++++++++++++++++++++++++++++++++++++++++++--
 kernel/sched/rt.c   | 19 +++++++++-
 2 files changed, 102 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 76a42f21dda7..72d0803c7d47 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6562,11 +6562,12 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 # define SM_MASK_PREEMPT	SM_PREEMPT
 #endif
 
-bool try_to_deactivate_task(struct rq *rq, struct task_struct *p, unsigned long state)
+bool try_to_deactivate_task(struct rq *rq, struct task_struct *p,
+			    unsigned long state, bool deactivate_cond)
 {
 	if (signal_pending_state(state, p)) {
 		WRITE_ONCE(p->__state, TASK_RUNNING);
-	} else {
+	} else if (deactivate_cond) {
 		p->sched_contributes_to_load =
 			(state & TASK_UNINTERRUPTIBLE) &&
 			!(state & TASK_NOLOAD) &&
@@ -6597,6 +6598,74 @@ bool try_to_deactivate_task(struct rq *rq, struct task_struct *p, unsigned long
 	return false;
 }
 
+#ifdef CONFIG_PROXY_EXEC
+/*
+ * Initial simple proxy that just returns the task if its waking
+ * or deactivates the blocked task so we can pick something that
+ * isn't blocked.
+ */
+static struct task_struct *
+proxy(struct rq *rq, struct task_struct *next, struct rq_flags *rf)
+{
+	struct task_struct *p = next;
+	struct mutex *mutex;
+	unsigned long state;
+
+	mutex = p->blocked_on;
+	/* Something changed in the chain, pick_again */
+	if (!mutex)
+		return NULL;
+	/*
+	 * By taking mutex->wait_lock we hold off concurrent mutex_unlock()
+	 * and ensure @owner sticks around.
+	 */
+	raw_spin_lock(&mutex->wait_lock);
+	raw_spin_lock(&p->blocked_lock);
+
+	/* Check again that p is blocked with blocked_lock held */
+	if (!task_is_blocked(p) || mutex != p->blocked_on) {
+		/*
+		 * Something changed in the blocked_on chain and
+		 * we don't know if only at this level. So, let's
+		 * just bail out completely and let __schedule
+		 * figure things out (pick_again loop).
+		 */
+		raw_spin_unlock(&p->blocked_lock);
+		raw_spin_unlock(&mutex->wait_lock);
+		return NULL;
+	}
+
+	state = READ_ONCE(p->__state);
+	/* Don't deactivate if the state has been changed to TASK_RUNNING */
+	if (!state) {
+		raw_spin_unlock(&p->blocked_lock);
+		raw_spin_unlock(&mutex->wait_lock);
+		return p;
+	}
+
+	try_to_deactivate_task(rq, next, state, true);
+
+	/*
+	 * If next is the selected task, then remove lingering
+	 * references to it from rq and sched_class structs after
+	 * dequeueing.
+	 */
+	put_prev_task(rq, next);
+	rq_set_selected(rq, rq->idle);
+	resched_curr(rq);
+	raw_spin_unlock(&p->blocked_lock);
+	raw_spin_unlock(&mutex->wait_lock);
+	return NULL;
+}
+#else /* PROXY_EXEC */
+static struct task_struct *
+proxy(struct rq *rq, struct task_struct *next, struct rq_flags *rf)
+{
+	BUG(); // This should never be called in the !PROXY case
+	return next;
+}
+#endif /* PROXY_EXEC */
+
 /*
  * __schedule() is the main scheduler function.
  *
@@ -6687,12 +6756,24 @@ static void __sched notrace __schedule(unsigned int sched_mode)
 	 */
 	prev_state = READ_ONCE(prev->__state);
 	if (!(sched_mode & SM_MASK_PREEMPT) && prev_state) {
-		try_to_deactivate_task(rq, prev, prev_state);
+		try_to_deactivate_task(rq, prev, prev_state,
+				       !task_is_blocked(prev));
 		switch_count = &prev->nvcsw;
 	}
 
-	next = pick_next_task(rq, prev, &rf);
+pick_again:
+	next = pick_next_task(rq, rq_selected(rq), &rf);
 	rq_set_selected(rq, next);
+	if (unlikely(task_is_blocked(next))) {
+		next = proxy(rq, next, &rf);
+		if (!next) {
+			rq_unpin_lock(rq, &rf);
+			__balance_callbacks(rq);
+			rq_repin_lock(rq, &rf);
+			goto pick_again;
+		}
+	}
+
 	clear_tsk_need_resched(prev);
 	clear_preempt_need_resched();
 #ifdef CONFIG_SCHED_DEBUG
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 6d9036547c1d..cfbf3925e595 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1537,8 +1537,19 @@ enqueue_task_rt(struct rq *rq, struct task_struct *p, int flags)
 
 	enqueue_rt_entity(rt_se, flags);
 
-	if (!task_current(rq, p) && p->nr_cpus_allowed > 1)
-		enqueue_pushable_task(rq, p);
+	/*
+	 * Current can't be pushed away. Selected is tied to current,
+	 * so don't push it either.
+	 */
+	if (task_current(rq, p) || task_current_selected(rq, p))
+		return;
+	/*
+	 * Pinned tasks can't be pushed.
+	 */
+	if (p->nr_cpus_allowed == 1)
+		return;
+
+	enqueue_pushable_task(rq, p);
 }
 
 static void dequeue_task_rt(struct rq *rq, struct task_struct *p, int flags)
@@ -1825,6 +1836,10 @@ static void put_prev_task_rt(struct rq *rq, struct task_struct *p)
 
 	update_rt_rq_load_avg(rq_clock_pelt(rq), rq, 1);
 
+	/* Avoid marking selected as pushable */
+	if (task_current_selected(rq, p))
+		return;
+
 	/*
 	 * The previous task needs to be made eligible for pushing
 	 * if it is still active
-- 
2.42.0.rc1.204.g551eb34607-goog


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v5 15/19] sched: Add proxy deactivate helper
  2023-08-19  6:08 [PATCH v5 00/19] Proxy Execution: A generalized form of Priority Inheritance v5 John Stultz
                   ` (13 preceding siblings ...)
  2023-08-19  6:08 ` [PATCH v5 14/19] sched: Add a very simple proxy() function John Stultz
@ 2023-08-19  6:08 ` John Stultz
  2023-08-24 11:34   ` kernel test robot
  2023-08-19  6:08 ` [PATCH v5 16/19] sched: Fix proxy/current (push,pull)ability John Stultz
                   ` (3 subsequent siblings)
  18 siblings, 1 reply; 29+ messages in thread
From: John Stultz @ 2023-08-19  6:08 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
	Youssef Esmat, Mel Gorman, Daniel Bristot de Oliveira,
	Will Deacon, Waiman Long, Boqun Feng, Paul E . McKenney,
	kernel-team

Add small helper for deactivating the selected task

Cc: Joel Fernandes <joelaf@google.com>
Cc: Qais Yousef <qyousef@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Youssef Esmat <youssefesmat@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Cc: kernel-team@android.com
Signed-off-by: John Stultz <jstultz@google.com>
---
 kernel/sched/core.c | 43 +++++++++++++++++++++----------------------
 1 file changed, 21 insertions(+), 22 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 72d0803c7d47..bee7082b294f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6599,6 +6599,22 @@ bool try_to_deactivate_task(struct rq *rq, struct task_struct *p,
 }
 
 #ifdef CONFIG_PROXY_EXEC
+
+bool proxy_deactivate(struct rq *rq, struct task_struct *next)
+{
+	unsigned long state = READ_ONCE(next->__state);
+
+	/* Don't deactivate if the state has been changed to TASK_RUNNING */
+	if (!state)
+		return false;
+	if (!try_to_deactivate_task(rq, next, state, true))
+		return false;
+	put_prev_task(rq, next);
+	rq_set_selected(rq, rq->idle);
+	resched_curr(rq);
+	return true;
+}
+
 /*
  * Initial simple proxy that just returns the task if its waking
  * or deactivates the blocked task so we can pick something that
@@ -6607,10 +6623,9 @@ bool try_to_deactivate_task(struct rq *rq, struct task_struct *p,
 static struct task_struct *
 proxy(struct rq *rq, struct task_struct *next, struct rq_flags *rf)
 {
+	struct task_struct *ret = NULL;
 	struct task_struct *p = next;
 	struct mutex *mutex;
-	unsigned long state;
-
 	mutex = p->blocked_on;
 	/* Something changed in the chain, pick_again */
 	if (!mutex)
@@ -6632,30 +6647,14 @@ proxy(struct rq *rq, struct task_struct *next, struct rq_flags *rf)
 		 */
 		raw_spin_unlock(&p->blocked_lock);
 		raw_spin_unlock(&mutex->wait_lock);
-		return NULL;
-	}
-
-	state = READ_ONCE(p->__state);
-	/* Don't deactivate if the state has been changed to TASK_RUNNING */
-	if (!state) {
-		raw_spin_unlock(&p->blocked_lock);
-		raw_spin_unlock(&mutex->wait_lock);
-		return p;
+		return ret;
 	}
 
-	try_to_deactivate_task(rq, next, state, true);
-
-	/*
-	 * If next is the selected task, then remove lingering
-	 * references to it from rq and sched_class structs after
-	 * dequeueing.
-	 */
-	put_prev_task(rq, next);
-	rq_set_selected(rq, rq->idle);
-	resched_curr(rq);
+	if (!proxy_deactivate(rq, next))
+		ret = p;
 	raw_spin_unlock(&p->blocked_lock);
 	raw_spin_unlock(&mutex->wait_lock);
-	return NULL;
+	return ret;
 }
 #else /* PROXY_EXEC */
 static struct task_struct *
-- 
2.42.0.rc1.204.g551eb34607-goog


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v5 15/19] sched: Add proxy deactivate helper
  2023-08-19  6:08 ` [PATCH v5 15/19] sched: Add proxy deactivate helper John Stultz
@ 2023-08-24 11:34   ` kernel test robot
  0 siblings, 0 replies; 29+ messages in thread
From: kernel test robot @ 2023-08-24 11:34 UTC (permalink / raw)
  To: John Stultz, LKML
  Cc: oe-kbuild-all, John Stultz, Joel Fernandes, Qais Yousef,
	Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
	Youssef Esmat, Mel Gorman, Daniel Bristot de Oliveira,
	Will Deacon, Waiman Long, Boqun Feng, Paul E . McKenney,
	kernel-team

Hi John,

kernel test robot noticed the following build warnings:

[auto build test WARNING on tip/locking/core]
[also build test WARNING on linus/master tip/auto-latest v6.5-rc7]
[cannot apply to tip/sched/core tip/master next-20230824]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/John-Stultz/sched-Unify-runtime-accounting-across-classes/20230821-121604
base:   tip/locking/core
patch link:    https://lore.kernel.org/r/20230819060915.3001568-16-jstultz%40google.com
patch subject: [PATCH v5 15/19] sched: Add proxy deactivate helper
config: loongarch-allmodconfig (https://download.01.org/0day-ci/archive/20230824/202308241905.goxohVDL-lkp@intel.com/config)
compiler: loongarch64-linux-gcc (GCC) 13.2.0
reproduce: (https://download.01.org/0day-ci/archive/20230824/202308241905.goxohVDL-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202308241905.goxohVDL-lkp@intel.com/

All warnings (new ones prefixed by >>):

   kernel/sched/core.c:6565:6: warning: no previous prototype for 'try_to_deactivate_task' [-Wmissing-prototypes]
    6565 | bool try_to_deactivate_task(struct rq *rq, struct task_struct *p,
         |      ^~~~~~~~~~~~~~~~~~~~~~
>> kernel/sched/core.c:6603:6: warning: no previous prototype for 'proxy_deactivate' [-Wmissing-prototypes]
    6603 | bool proxy_deactivate(struct rq *rq, struct task_struct *next)
         |      ^~~~~~~~~~~~~~~~


vim +/proxy_deactivate +6603 kernel/sched/core.c

  6564	
> 6565	bool try_to_deactivate_task(struct rq *rq, struct task_struct *p,
  6566				    unsigned long state, bool deactivate_cond)
  6567	{
  6568		if (signal_pending_state(state, p)) {
  6569			WRITE_ONCE(p->__state, TASK_RUNNING);
  6570		} else if (deactivate_cond) {
  6571			p->sched_contributes_to_load =
  6572				(state & TASK_UNINTERRUPTIBLE) &&
  6573				!(state & TASK_NOLOAD) &&
  6574				!(state & TASK_FROZEN);
  6575	
  6576			if (p->sched_contributes_to_load)
  6577				rq->nr_uninterruptible++;
  6578	
  6579			/*
  6580			 * __schedule()			ttwu()
  6581			 *   prev_state = prev->state;    if (p->on_rq && ...)
  6582			 *   if (prev_state)		    goto out;
  6583			 *     p->on_rq = 0;		  smp_acquire__after_ctrl_dep();
  6584			 *				  p->state = TASK_WAKING
  6585			 *
  6586			 * Where __schedule() and ttwu() have matching control dependencies.
  6587			 *
  6588			 * After this, schedule() must not care about p->state any more.
  6589			 */
  6590			deactivate_task(rq, p, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK);
  6591	
  6592			if (p->in_iowait) {
  6593				atomic_inc(&rq->nr_iowait);
  6594				delayacct_blkio_start();
  6595			}
  6596			return true;
  6597		}
  6598		return false;
  6599	}
  6600	
  6601	#ifdef CONFIG_PROXY_EXEC
  6602	
> 6603	bool proxy_deactivate(struct rq *rq, struct task_struct *next)
  6604	{
  6605		unsigned long state = READ_ONCE(next->__state);
  6606	
  6607		/* Don't deactivate if the state has been changed to TASK_RUNNING */
  6608		if (!state)
  6609			return false;
  6610		if (!try_to_deactivate_task(rq, next, state, true))
  6611			return false;
  6612		put_prev_task(rq, next);
  6613		rq_set_selected(rq, rq->idle);
  6614		resched_curr(rq);
  6615		return true;
  6616	}
  6617	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v5 16/19] sched: Fix proxy/current (push,pull)ability
  2023-08-19  6:08 [PATCH v5 00/19] Proxy Execution: A generalized form of Priority Inheritance v5 John Stultz
                   ` (14 preceding siblings ...)
  2023-08-19  6:08 ` [PATCH v5 15/19] sched: Add proxy deactivate helper John Stultz
@ 2023-08-19  6:08 ` John Stultz
  2023-08-22 15:20   ` Dietmar Eggemann
  2023-08-19  6:08 ` [PATCH v5 17/19] sched: Start blocked_on chain processing in proxy() John Stultz
                   ` (2 subsequent siblings)
  18 siblings, 1 reply; 29+ messages in thread
From: John Stultz @ 2023-08-19  6:08 UTC (permalink / raw)
  To: LKML
  Cc: Valentin Schneider, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
	Youssef Esmat, Mel Gorman, Daniel Bristot de Oliveira,
	Will Deacon, Waiman Long, Boqun Feng, Paul E . McKenney,
	kernel-team, Connor O'Brien, John Stultz

From: Valentin Schneider <valentin.schneider@arm.com>

Proxy execution forms atomic pairs of tasks: The selected task
(scheduling context) and an proxy (execution context). The
selected task, along with the rest of the blocked chain,
follows the proxy wrt CPU placement.

They can be the same task, in which case push/pull doesn't need any
modification. When they are different, however,
FIFO1 & FIFO42:

	      ,->  RT42
	      |     | blocked-on
	      |     v
blocked_donor |   mutex
	      |     | owner
	      |     v
	      `--  RT1

   RT1
   RT42

  CPU0            CPU1
   ^                ^
   |                |
  overloaded    !overloaded
  rq prio = 42  rq prio = 0

RT1 is eligible to be pushed to CPU1, but should that happen it will
"carry" RT42 along. Clearly here neither RT1 nor RT42 must be seen as
push/pullable.

Unfortunately, only the selected task is usually dequeued from the
rq, and the proxy'ed execution context (rq->curr) remains on the rq.
This can cause RT1 to be selected for migration from logic like the
rt pushable_list.

This patch adds a dequeue/enqueue cycle on the proxy task before
__schedule returns, which allows the sched class logic to avoid
adding the now current task to the pushable_list.

Furthermore, tasks becoming blocked on a mutex don't need an explicit
dequeue/enqueue cycle to be made (push/pull)able: they have to be running
to block on a mutex, thus they will eventually hit put_prev_task().

XXX: pinned tasks becoming unblocked should be removed from the push/pull
lists, but those don't get to see __schedule() straight away.

Cc: Joel Fernandes <joelaf@google.com>
Cc: Qais Yousef <qyousef@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Youssef Esmat <youssefesmat@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Cc: kernel-team@android.com
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Connor O'Brien <connoro@google.com>
Signed-off-by: John Stultz <jstultz@google.com>
---
v3:
* Tweaked comments & commit message
v5:
* Minor simplifications to utilize the fix earlier
  in the patch series.
* Rework the wording of the commit message to match selected/proxy
  terminology and expand a bit to make it more clear how it works.
---
 kernel/sched/core.c | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index bee7082b294f..e8065fc5c894 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6656,6 +6656,21 @@ proxy(struct rq *rq, struct task_struct *next, struct rq_flags *rf)
 	raw_spin_unlock(&mutex->wait_lock);
 	return ret;
 }
+
+static inline void proxy_tag_curr(struct rq *rq, struct task_struct *next)
+{
+	/*
+	 * pick_next_task() calls set_next_task() on the selected task
+	 * at some point, which ensures it is not push/pullable.
+	 * However, the selected task *and* the ,mutex owner form an
+	 * atomic pair wrt push/pull.
+	 *
+	 * Make sure owner is not pushable. Unfortunately we can only
+	 * deal with that by means of a dequeue/enqueue cycle. :-/
+	 */
+	dequeue_task(rq, next, DEQUEUE_NOCLOCK | DEQUEUE_SAVE);
+	enqueue_task(rq, next, ENQUEUE_NOCLOCK | ENQUEUE_RESTORE);
+}
 #else /* PROXY_EXEC */
 static struct task_struct *
 proxy(struct rq *rq, struct task_struct *next, struct rq_flags *rf)
@@ -6663,6 +6678,8 @@ proxy(struct rq *rq, struct task_struct *next, struct rq_flags *rf)
 	BUG(); // This should never be called in the !PROXY case
 	return next;
 }
+
+static inline void proxy_tag_curr(struct rq *rq, struct task_struct *next) { }
 #endif /* PROXY_EXEC */
 
 /*
@@ -6711,6 +6728,7 @@ static void __sched notrace __schedule(unsigned int sched_mode)
 	unsigned long prev_state;
 	struct rq_flags rf;
 	struct rq *rq;
+	bool proxied;
 	int cpu;
 
 	cpu = smp_processor_id();
@@ -6760,6 +6778,7 @@ static void __sched notrace __schedule(unsigned int sched_mode)
 		switch_count = &prev->nvcsw;
 	}
 
+	proxied = (rq_selected(rq) != prev);
 pick_again:
 	next = pick_next_task(rq, rq_selected(rq), &rf);
 	rq_set_selected(rq, next);
@@ -6786,6 +6805,10 @@ static void __sched notrace __schedule(unsigned int sched_mode)
 		 * changes to task_struct made by pick_next_task().
 		 */
 		RCU_INIT_POINTER(rq->curr, next);
+
+		if (unlikely(!task_current_selected(rq, next)))
+			proxy_tag_curr(rq, next);
+
 		/*
 		 * The membarrier system call requires each architecture
 		 * to have a full memory barrier after updating
@@ -6810,6 +6833,10 @@ static void __sched notrace __schedule(unsigned int sched_mode)
 		/* Also unlocks the rq: */
 		rq = context_switch(rq, prev, next, &rf);
 	} else {
+		/* In case next was already curr but just got blocked_donor*/
+		if (unlikely(!task_current_selected(rq, next)))
+			proxy_tag_curr(rq, next);
+
 		rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);
 
 		rq_unpin_lock(rq, &rf);
-- 
2.42.0.rc1.204.g551eb34607-goog


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v5 16/19] sched: Fix proxy/current (push,pull)ability
  2023-08-19  6:08 ` [PATCH v5 16/19] sched: Fix proxy/current (push,pull)ability John Stultz
@ 2023-08-22 15:20   ` Dietmar Eggemann
  2023-08-22 16:19     ` John Stultz
  0 siblings, 1 reply; 29+ messages in thread
From: Dietmar Eggemann @ 2023-08-22 15:20 UTC (permalink / raw)
  To: John Stultz, LKML
  Cc: Valentin Schneider, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Youssef Esmat,
	Mel Gorman, Daniel Bristot de Oliveira, Will Deacon, Waiman Long,
	Boqun Feng, Paul E . McKenney, kernel-team, Connor O'Brien

On 19/08/2023 08:08, John Stultz wrote:
> From: Valentin Schneider <valentin.schneider@arm.com>

[...]

> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index bee7082b294f..e8065fc5c894 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6656,6 +6656,21 @@ proxy(struct rq *rq, struct task_struct *next, struct rq_flags *rf)
>  	raw_spin_unlock(&mutex->wait_lock);
>  	return ret;
>  }
> +
> +static inline void proxy_tag_curr(struct rq *rq, struct task_struct *next)
> +{
> +	/*
> +	 * pick_next_task() calls set_next_task() on the selected task
> +	 * at some point, which ensures it is not push/pullable.
> +	 * However, the selected task *and* the ,mutex owner form an
> +	 * atomic pair wrt push/pull.
> +	 *
> +	 * Make sure owner is not pushable. Unfortunately we can only
> +	 * deal with that by means of a dequeue/enqueue cycle. :-/
> +	 */
> +	dequeue_task(rq, next, DEQUEUE_NOCLOCK | DEQUEUE_SAVE);
> +	enqueue_task(rq, next, ENQUEUE_NOCLOCK | ENQUEUE_RESTORE);
> +}
>  #else /* PROXY_EXEC */
>  static struct task_struct *
>  proxy(struct rq *rq, struct task_struct *next, struct rq_flags *rf)
> @@ -6663,6 +6678,8 @@ proxy(struct rq *rq, struct task_struct *next, struct rq_flags *rf)
>  	BUG(); // This should never be called in the !PROXY case
>  	return next;
>  }
> +
> +static inline void proxy_tag_curr(struct rq *rq, struct task_struct *next) { }
>  #endif /* PROXY_EXEC */
>  
>  /*
> @@ -6711,6 +6728,7 @@ static void __sched notrace __schedule(unsigned int sched_mode)
>  	unsigned long prev_state;
>  	struct rq_flags rf;
>  	struct rq *rq;
> +	bool proxied;
>  	int cpu;
>  
>  	cpu = smp_processor_id();
> @@ -6760,6 +6778,7 @@ static void __sched notrace __schedule(unsigned int sched_mode)
>  		switch_count = &prev->nvcsw;
>  	}
>  
> +	proxied = (rq_selected(rq) != prev);

Looks like proxied isn't used here. (*)

>  pick_again:
>  	next = pick_next_task(rq, rq_selected(rq), &rf);
>  	rq_set_selected(rq, next);
> @@ -6786,6 +6805,10 @@ static void __sched notrace __schedule(unsigned int sched_mode)
>  		 * changes to task_struct made by pick_next_task().
>  		 */
>  		RCU_INIT_POINTER(rq->curr, next);
> +
> +		if (unlikely(!task_current_selected(rq, next)))
> +			proxy_tag_curr(rq, next);
> +
>  		/*
>  		 * The membarrier system call requires each architecture
>  		 * to have a full memory barrier after updating
> @@ -6810,6 +6833,10 @@ static void __sched notrace __schedule(unsigned int sched_mode)
>  		/* Also unlocks the rq: */
>  		rq = context_switch(rq, prev, next, &rf);
>  	} else {
> +		/* In case next was already curr but just got blocked_donor*/
> +		if (unlikely(!task_current_selected(rq, next)))
> +			proxy_tag_curr(rq, next);

(*) v4 had:

+          /* In case next was already curr but just got blocked_donor*/
+             if (unlikely(!proxied && next->blocked_donor))

> +
>  		rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);
>  
>  		rq_unpin_lock(rq, &rf);

I miss changes in enqueue_task_rt() and put_prev_task_rt() related to
'affinity of blocked tasks doesn't matter' from v4.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v5 16/19] sched: Fix proxy/current (push,pull)ability
  2023-08-22 15:20   ` Dietmar Eggemann
@ 2023-08-22 16:19     ` John Stultz
  0 siblings, 0 replies; 29+ messages in thread
From: John Stultz @ 2023-08-22 16:19 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: LKML, Valentin Schneider, Joel Fernandes, Qais Yousef,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
	Youssef Esmat, Mel Gorman, Daniel Bristot de Oliveira,
	Will Deacon, Waiman Long, Boqun Feng, Paul E . McKenney,
	kernel-team, Connor O'Brien

On Tue, Aug 22, 2023 at 8:20 AM Dietmar Eggemann
<dietmar.eggemann@arm.com> wrote:
>
> On 19/08/2023 08:08, John Stultz wrote:
> > From: Valentin Schneider <valentin.schneider@arm.com>
>
> [...]
>
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index bee7082b294f..e8065fc5c894 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -6656,6 +6656,21 @@ proxy(struct rq *rq, struct task_struct *next, struct rq_flags *rf)
> >       raw_spin_unlock(&mutex->wait_lock);
> >       return ret;
> >  }
> > +
> > +static inline void proxy_tag_curr(struct rq *rq, struct task_struct *next)
> > +{
> > +     /*
> > +      * pick_next_task() calls set_next_task() on the selected task
> > +      * at some point, which ensures it is not push/pullable.
> > +      * However, the selected task *and* the ,mutex owner form an
> > +      * atomic pair wrt push/pull.
> > +      *
> > +      * Make sure owner is not pushable. Unfortunately we can only
> > +      * deal with that by means of a dequeue/enqueue cycle. :-/
> > +      */
> > +     dequeue_task(rq, next, DEQUEUE_NOCLOCK | DEQUEUE_SAVE);
> > +     enqueue_task(rq, next, ENQUEUE_NOCLOCK | ENQUEUE_RESTORE);
> > +}
> >  #else /* PROXY_EXEC */
> >  static struct task_struct *
> >  proxy(struct rq *rq, struct task_struct *next, struct rq_flags *rf)
> > @@ -6663,6 +6678,8 @@ proxy(struct rq *rq, struct task_struct *next, struct rq_flags *rf)
> >       BUG(); // This should never be called in the !PROXY case
> >       return next;
> >  }
> > +
> > +static inline void proxy_tag_curr(struct rq *rq, struct task_struct *next) { }
> >  #endif /* PROXY_EXEC */
> >
> >  /*
> > @@ -6711,6 +6728,7 @@ static void __sched notrace __schedule(unsigned int sched_mode)
> >       unsigned long prev_state;
> >       struct rq_flags rf;
> >       struct rq *rq;
> > +     bool proxied;
> >       int cpu;
> >
> >       cpu = smp_processor_id();
> > @@ -6760,6 +6778,7 @@ static void __sched notrace __schedule(unsigned int sched_mode)
> >               switch_count = &prev->nvcsw;
> >       }
> >
> > +     proxied = (rq_selected(rq) != prev);
>
> Looks like proxied isn't used here. (*)

Ah. Looks like I should have split that off into a later change.
Thanks for pointing that out.

> >  pick_again:
> >       next = pick_next_task(rq, rq_selected(rq), &rf);
> >       rq_set_selected(rq, next);
> > @@ -6786,6 +6805,10 @@ static void __sched notrace __schedule(unsigned int sched_mode)
> >                * changes to task_struct made by pick_next_task().
> >                */
> >               RCU_INIT_POINTER(rq->curr, next);
> > +
> > +             if (unlikely(!task_current_selected(rq, next)))
> > +                     proxy_tag_curr(rq, next);
> > +
> >               /*
> >                * The membarrier system call requires each architecture
> >                * to have a full memory barrier after updating
> > @@ -6810,6 +6833,10 @@ static void __sched notrace __schedule(unsigned int sched_mode)
> >               /* Also unlocks the rq: */
> >               rq = context_switch(rq, prev, next, &rf);
> >       } else {
> > +             /* In case next was already curr but just got blocked_donor*/
> > +             if (unlikely(!task_current_selected(rq, next)))
> > +                     proxy_tag_curr(rq, next);
>
> (*) v4 had:
>
> +          /* In case next was already curr but just got blocked_donor*/
> +             if (unlikely(!proxied && next->blocked_donor))
>

That gets re-added later in the series when we add the blocked_donor tracking:
https://github.com/johnstultz-work/linux-dev/commit/00d33fb7d94bba6d230a833d775f83f4d90e4661#diff-cc1a82129952a910fdc4292448c2a097a2ba538bebefcf3c06381e45639ae73eR7115

> > +
> >               rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);
> >
> >               rq_unpin_lock(rq, &rf);
>
> I miss changes in enqueue_task_rt() and put_prev_task_rt() related to
> 'affinity of blocked tasks doesn't matter' from v4.

So, yeah. As I was focusing on the correctness of the proxy migration
and return paths, I realized having the push/pull logic migrate
blocked tasks independently really doesn't help things (as proxy will
push them back).

So in those call paths, we now skip trying to migrate blocked tasks,
leaving the core proxy logic to be the only one to move blocked tasks,
and the __sched check to be the one to return them to a runnable cpu
when they are unblocked.

thanks
-john

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v5 17/19] sched: Start blocked_on chain processing in proxy()
  2023-08-19  6:08 [PATCH v5 00/19] Proxy Execution: A generalized form of Priority Inheritance v5 John Stultz
                   ` (15 preceding siblings ...)
  2023-08-19  6:08 ` [PATCH v5 16/19] sched: Fix proxy/current (push,pull)ability John Stultz
@ 2023-08-19  6:08 ` John Stultz
  2023-08-19  6:08 ` [PATCH v5 18/19] sched: Handle blocked-waiter migration (and return migration) John Stultz
  2023-08-19  6:08 ` [PATCH v5 19/19] sched: Add blocked_donor link to task for smarter mutex handoffs John Stultz
  18 siblings, 0 replies; 29+ messages in thread
From: John Stultz @ 2023-08-19  6:08 UTC (permalink / raw)
  To: LKML
  Cc: Peter Zijlstra, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Youssef Esmat,
	Mel Gorman, Daniel Bristot de Oliveira, Will Deacon, Waiman Long,
	Boqun Feng, Paul E . McKenney, kernel-team, Valentin Schneider,
	Connor O'Brien, John Stultz

From: Peter Zijlstra <peterz@infradead.org>

Start to flesh out the real proxy() implementation, but
avoid the migration cases for now, in those cases just
deactivate the selected task and pick again.

To ensure the selected task or other blocked tasks in
the chain aren't migrated away while we're running the
proxy, this patch also tweaks CFS logic to avoid migrating
selected or mutex blocked tasks.

Cc: Joel Fernandes <joelaf@google.com>
Cc: Qais Yousef <qyousef@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Youssef Esmat <youssefesmat@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Cc: kernel-team@android.com
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Connor O'Brien <connoro@google.com>
[jstultz: This change was split out from the larger proxy patch]
Signed-off-by: John Stultz <jstultz@google.com>
---
v5:
* Split this out from larger proxy patch
---
 kernel/sched/core.c | 162 ++++++++++++++++++++++++++++++++++++--------
 kernel/sched/fair.c |  10 ++-
 2 files changed, 143 insertions(+), 29 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e8065fc5c894..fd3494503be7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -95,6 +95,7 @@
 #include "../workqueue_internal.h"
 #include "../../io_uring/io-wq.h"
 #include "../smpboot.h"
+#include "../locking/mutex.h"
 
 EXPORT_TRACEPOINT_SYMBOL_GPL(ipi_send_cpu);
 EXPORT_TRACEPOINT_SYMBOL_GPL(ipi_send_cpumask);
@@ -6600,6 +6601,15 @@ bool try_to_deactivate_task(struct rq *rq, struct task_struct *p,
 
 #ifdef CONFIG_PROXY_EXEC
 
+static inline struct task_struct *
+proxy_resched_idle(struct rq *rq, struct task_struct *next)
+{
+	put_prev_task(rq, next);
+	rq_set_selected(rq, rq->idle);
+	set_tsk_need_resched(rq->idle);
+	return rq->idle;
+}
+
 bool proxy_deactivate(struct rq *rq, struct task_struct *next)
 {
 	unsigned long state = READ_ONCE(next->__state);
@@ -6609,52 +6619,146 @@ bool proxy_deactivate(struct rq *rq, struct task_struct *next)
 		return false;
 	if (!try_to_deactivate_task(rq, next, state, true))
 		return false;
-	put_prev_task(rq, next);
-	rq_set_selected(rq, rq->idle);
-	resched_curr(rq);
+	proxy_resched_idle(rq, next);
 	return true;
 }
 
 /*
- * Initial simple proxy that just returns the task if its waking
- * or deactivates the blocked task so we can pick something that
- * isn't blocked.
+ * Find who @next (currently blocked on a mutex) can proxy for.
+ *
+ * Follow the blocked-on relation:
+ *   task->blocked_on -> mutex->owner -> task...
+ *
+ * Lock order:
+ *
+ *   p->pi_lock
+ *     rq->lock
+ *       mutex->wait_lock
+ *         p->blocked_lock
+ *
+ * Returns the task that is going to be used as execution context (the one
+ * that is actually going to be put to run on cpu_of(rq)).
  */
 static struct task_struct *
 proxy(struct rq *rq, struct task_struct *next, struct rq_flags *rf)
 {
 	struct task_struct *ret = NULL;
 	struct task_struct *p = next;
+	struct task_struct *owner = NULL;
+	int this_cpu;
 	struct mutex *mutex;
-	mutex = p->blocked_on;
-	/* Something changed in the chain, pick_again */
-	if (!mutex)
-		return NULL;
+
+	this_cpu = cpu_of(rq);
+
 	/*
-	 * By taking mutex->wait_lock we hold off concurrent mutex_unlock()
-	 * and ensure @owner sticks around.
+	 * Follow blocked_on chain.
+	 *
+	 * TODO: deadlock detection
 	 */
-	raw_spin_lock(&mutex->wait_lock);
-	raw_spin_lock(&p->blocked_lock);
+	for (p = next; task_is_blocked(p); p = owner) {
+		mutex = p->blocked_on;
+		/* Something changed in the chain, pick_again */
+		if (!mutex)
+			return NULL;
 
-	/* Check again that p is blocked with blocked_lock held */
-	if (!task_is_blocked(p) || mutex != p->blocked_on) {
 		/*
-		 * Something changed in the blocked_on chain and
-		 * we don't know if only at this level. So, let's
-		 * just bail out completely and let __schedule
-		 * figure things out (pick_again loop).
+		 * By taking mutex->wait_lock we hold off concurrent mutex_unlock()
+		 * and ensure @owner sticks around.
+		 */
+		raw_spin_lock(&mutex->wait_lock);
+		raw_spin_lock(&p->blocked_lock);
+
+		/* Check again that p is blocked with blocked_lock held */
+		if (mutex != p->blocked_on) {
+			/*
+			 * Something changed in the blocked_on chain and
+			 * we don't know if only at this level. So, let's
+			 * just bail out completely and let __schedule
+			 * figure things out (pick_again loop).
+			 */
+			raw_spin_unlock(&p->blocked_lock);
+			raw_spin_unlock(&mutex->wait_lock);
+			return NULL;
+		}
+
+		owner = __mutex_owner(mutex);
+		if (!owner) {
+			raw_spin_unlock(&p->blocked_lock);
+			raw_spin_unlock(&mutex->wait_lock);
+			return p;
+		}
+
+		if (task_cpu(owner) != this_cpu) {
+			/* XXX Don't handle migrations yet */
+			if (!proxy_deactivate(rq, next))
+				ret = next;
+			raw_spin_unlock(&p->blocked_lock);
+			raw_spin_unlock(&mutex->wait_lock);
+			return ret;
+		}
+
+		if (task_on_rq_migrating(owner)) {
+			/*
+			 * One of the chain of mutex owners is currently migrating to this
+			 * CPU, but has not yet been enqueued because we are holding the
+			 * rq lock. As a simple solution, just schedule rq->idle to give
+			 * the migration a chance to complete. Much like the migrate_task
+			 * case we should end up back in proxy(), this time hopefully with
+			 * all relevant tasks already enqueued.
+			 */
+			raw_spin_unlock(&p->blocked_lock);
+			raw_spin_unlock(&mutex->wait_lock);
+			return proxy_resched_idle(rq, next);
+		}
+
+		if (!owner->on_rq) {
+			/* XXX Don't handle blocked owners yet */
+			if (!proxy_deactivate(rq, next))
+				ret = next;
+			raw_spin_unlock(&p->blocked_lock);
+			raw_spin_unlock(&mutex->wait_lock);
+			return ret;
+		}
+
+		if (owner == p) {
+			/*
+			 * Its possible we interleave with mutex_unlock like:
+			 *
+			 *				lock(&rq->lock);
+			 *				  proxy()
+			 * mutex_unlock()
+			 *   lock(&wait_lock);
+			 *   next(owner) = current->blocked_donor;
+			 *   unlock(&wait_lock);
+			 *
+			 *   wake_up_q();
+			 *     ...
+			 *       ttwu_runnable()
+			 *         __task_rq_lock()
+			 *				  lock(&wait_lock);
+			 *				  owner == p
+			 *
+			 * Which leaves us to finish the ttwu_runnable() and make it go.
+			 *
+			 * So schedule rq->idle so that ttwu_runnable can get the rq lock
+			 * and mark owner as running.
+			 */
+			raw_spin_unlock(&p->blocked_lock);
+			raw_spin_unlock(&mutex->wait_lock);
+			return proxy_resched_idle(rq, next);
+		}
+
+		/*
+		 * OK, now we're absolutely sure @owner is not blocked _and_
+		 * on this rq, therefore holding @rq->lock is sufficient to
+		 * guarantee its existence, as per ttwu_remote().
 		 */
 		raw_spin_unlock(&p->blocked_lock);
 		raw_spin_unlock(&mutex->wait_lock);
-		return ret;
 	}
 
-	if (!proxy_deactivate(rq, next))
-		ret = p;
-	raw_spin_unlock(&p->blocked_lock);
-	raw_spin_unlock(&mutex->wait_lock);
-	return ret;
+	WARN_ON_ONCE(owner && !owner->on_rq);
+	return owner;
 }
 
 static inline void proxy_tag_curr(struct rq *rq, struct task_struct *next)
@@ -6730,6 +6834,7 @@ static void __sched notrace __schedule(unsigned int sched_mode)
 	struct rq *rq;
 	bool proxied;
 	int cpu;
+	bool preserve_need_resched = false;
 
 	cpu = smp_processor_id();
 	rq = cpu_rq(cpu);
@@ -6790,9 +6895,12 @@ static void __sched notrace __schedule(unsigned int sched_mode)
 			rq_repin_lock(rq, &rf);
 			goto pick_again;
 		}
+		if (next == rq->idle && prev == rq->idle)
+			preserve_need_resched = true;
 	}
 
-	clear_tsk_need_resched(prev);
+	if (!preserve_need_resched)
+		clear_tsk_need_resched(prev);
 	clear_preempt_need_resched();
 #ifdef CONFIG_SCHED_DEBUG
 	rq->last_seen_need_resched_ns = 0;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6385693f1da6..1d0c2cc4606a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8643,7 +8643,8 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 	/* Disregard pcpu kthreads; they are where they need to be. */
 	if (kthread_is_per_cpu(p))
 		return 0;
-
+	if (task_is_blocked(p))
+		return 0;
 	if (!cpumask_test_cpu(env->dst_cpu, p->cpus_ptr)) {
 		int cpu;
 
@@ -8680,7 +8681,8 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 	/* Record that we found at least one task that could run on dst_cpu */
 	env->flags &= ~LBF_ALL_PINNED;
 
-	if (task_on_cpu(env->src_rq, p)) {
+	if (task_on_cpu(env->src_rq, p) ||
+	    task_current_selected(env->src_rq, p)) {
 		schedstat_inc(p->stats.nr_failed_migrations_running);
 		return 0;
 	}
@@ -8719,6 +8721,10 @@ static void detach_task(struct task_struct *p, struct lb_env *env)
 {
 	lockdep_assert_rq_held(env->src_rq);
 
+	BUG_ON(task_is_blocked(p));
+	BUG_ON(task_current(env->src_rq, p));
+	BUG_ON(task_current_selected(env->src_rq, p));
+
 	deactivate_task(env->src_rq, p, DEQUEUE_NOCLOCK);
 	set_task_cpu(p, env->dst_cpu);
 }
-- 
2.42.0.rc1.204.g551eb34607-goog


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v5 18/19] sched: Handle blocked-waiter migration (and return migration)
  2023-08-19  6:08 [PATCH v5 00/19] Proxy Execution: A generalized form of Priority Inheritance v5 John Stultz
                   ` (16 preceding siblings ...)
  2023-08-19  6:08 ` [PATCH v5 17/19] sched: Start blocked_on chain processing in proxy() John Stultz
@ 2023-08-19  6:08 ` John Stultz
  2023-08-19  6:08 ` [PATCH v5 19/19] sched: Add blocked_donor link to task for smarter mutex handoffs John Stultz
  18 siblings, 0 replies; 29+ messages in thread
From: John Stultz @ 2023-08-19  6:08 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
	Youssef Esmat, Mel Gorman, Daniel Bristot de Oliveira,
	Will Deacon, Waiman Long, Boqun Feng, Paul E . McKenney,
	kernel-team

Add logic to handle migrating a blocked waiter to a remote
cpu where the lock owner is runnable.

Additionally, as the blocked task may not be able to run
on the remote cpu, add logic to handle return migration once
the waiting task is given the mutex.

Because tasks may get migrated to where they cannot run,
this patch also modifies the scheduling classes to avoid
sched class migrations on mutex blocked tasks, leaving
proxy() to do the migrations and return migrations.

This was split out from the larger proxy patch, and
significantly reworked to avoid changes to the try_to_wakeup()
call path.

Credits for the original patch go to:
  Peter Zijlstra (Intel) <peterz@infradead.org>
  Juri Lelli <juri.lelli@redhat.com>
  Valentin Schneider <valentin.schneider@arm.com>
  Connor O'Brien <connoro@google.com>

NOTE: The return migration is further complicated in that we
need to take the pi_lock in order to decide which cpu we should
migrate back to. This requires dropping the current rq lock,
grabbing the pi_lock re-taking the current rq lock, picking a
cpu, deactivating the task, switching its cpu, dropping the
current rq lock, grabbing the target rq, activating the task
and then dropping the target rq and reaquiring the current
rq. This seems overly complex, so suggestions for a better
approach would be welcome!

Cc: Joel Fernandes <joelaf@google.com>
Cc: Qais Yousef <qyousef@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Youssef Esmat <youssefesmat@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Cc: kernel-team@android.com
Signed-off-by: John Stultz <jstultz@google.com>
---
 kernel/sched/core.c     | 184 ++++++++++++++++++++++++++++++++++++++--
 kernel/sched/deadline.c |   2 +-
 kernel/sched/fair.c     |   4 +-
 kernel/sched/rt.c       |  14 +--
 4 files changed, 190 insertions(+), 14 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fd3494503be7..ab3b3a783ee9 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2949,8 +2949,15 @@ static int affine_move_task(struct rq *rq, struct task_struct *p, struct rq_flag
 	struct set_affinity_pending my_pending = { }, *pending = NULL;
 	bool stop_pending, complete = false;
 
-	/* Can the task run on the task's current CPU? If so, we're done */
-	if (cpumask_test_cpu(task_cpu(p), &p->cpus_mask)) {
+	/*
+	 * Can the task run on the task's current CPU? If so, we're done
+	 *
+	 * We are also done if the task is selected, boosting a lock-
+	 * holding proxy, (and potentially has been migrated outside its
+	 * current or previous affinity mask)
+	 */
+	if (cpumask_test_cpu(task_cpu(p), &p->cpus_mask) ||
+	    (task_current_selected(rq, p) && !task_current(rq, p))) {
 		struct task_struct *push_task = NULL;
 
 		if ((flags & SCA_MIGRATE_ENABLE) &&
@@ -6623,6 +6630,142 @@ bool proxy_deactivate(struct rq *rq, struct task_struct *next)
 	return true;
 }
 
+static struct task_struct *
+proxy_migrate_task(struct rq *rq, struct task_struct *next,
+		   struct rq_flags *rf, struct task_struct *p,
+		   int that_cpu)
+{
+	struct rq *that_rq;
+	int wake_cpu;
+
+	/*
+	 * If the blocked-on relationship crosses CPUs, migrate @p to the
+	 * @owner's CPU.
+	 *
+	 * This is because we must respect the CPU affinity of execution
+	 * contexts (@owner) but we can ignore affinity for scheduling
+	 * contexts (@p). So we have to move scheduling contexts towards
+	 * potential execution contexts.
+	 */
+	that_rq = cpu_rq(that_cpu);
+
+	/*
+	 * @owner can disappear, simply migrate to @that_cpu and leave that CPU
+	 * to sort things out.
+	 */
+
+	/*
+	 * Since we're going to drop @rq, we have to put(@next) first,
+	 * otherwise we have a reference that no longer belongs to us.  Use
+	 * @fake_task to fill the void and make the next pick_next_task()
+	 * invocation happy.
+	 *
+	 * CPU0				CPU1
+	 *
+	 *				B mutex_lock(X)
+	 *
+	 * A mutex_lock(X) <- B
+	 * A __schedule()
+	 * A pick->A
+	 * A proxy->B
+	 * A migrate A to CPU1
+	 *				B mutex_unlock(X) -> A
+	 *				B __schedule()
+	 *				B pick->A
+	 *				B switch_to (A)
+	 *				A ... does stuff
+	 * A ... is still running here
+	 *
+	 *		* BOOM *
+	 */
+	put_prev_task(rq, rq_selected(rq));
+	rq_set_selected(rq, rq->idle);
+	set_next_task(rq, rq_selected(rq));
+
+	wake_cpu = next->wake_cpu;
+
+	WARN_ON(next == rq->curr);
+
+	deactivate_task(rq, next, 0);
+	set_task_cpu(next, that_cpu);
+	/*
+	 * Preserve p->wake_cpu, such that we can tell where it
+	 * used to run later.
+	 */
+	next->wake_cpu = wake_cpu;
+
+	if (rq->balance_callback)
+		__balance_callbacks(rq);
+
+	rq_unpin_lock(rq, rf);
+	raw_spin_rq_unlock(rq);
+	raw_spin_rq_lock(that_rq);
+
+	activate_task(that_rq, next, 0);
+	check_preempt_curr(that_rq, next, 0);
+
+	raw_spin_rq_unlock(that_rq);
+	raw_spin_rq_lock(rq);
+	rq_repin_lock(rq, rf);
+
+	return NULL; /* Retry task selection on _this_ CPU. */
+}
+
+static inline bool proxy_return_migration(struct rq *rq, struct rq_flags *rf,
+					  struct task_struct *next)
+{
+	if (next->blocked_on && next->blocked_on_waking) {
+		if (!is_cpu_allowed(next, cpu_of(rq))) {
+			struct rq *that_rq;
+			int cpu;
+
+			if (next == rq->curr) {
+				/* can't migrate curr, so return and let caller sort it */
+				return true;
+			}
+
+/*?*/			put_prev_task(rq, rq_selected(rq));
+			rq_set_selected(rq, rq->idle);
+
+			/* First unpin & run balance callbacks */
+			rq_unpin_lock(rq, rf);
+			__balance_callbacks(rq);
+			/*
+			 * Drop the rq lock so we can get pi_lock,
+			 * then reaquire it again to figure out
+			 * where to send it.
+			 */
+			raw_spin_rq_unlock(rq);
+			raw_spin_lock(&next->pi_lock);
+			rq_lock(rq, rf);
+
+			cpu = select_task_rq(next, next->wake_cpu, WF_TTWU);
+
+			deactivate_task(rq, next, 0);
+			set_task_cpu(next, cpu);
+			that_rq = cpu_rq(cpu);
+
+			/* drop this rq lock and grab that_rq's */
+			rq_unpin_lock(rq, rf);
+			raw_spin_rq_unlock(rq);
+			raw_spin_rq_lock(that_rq);
+
+			activate_task(that_rq, next, 0);
+			check_preempt_curr(that_rq, next, 0);
+
+			/* drop that_rq's lock and re-grab this' */
+			raw_spin_rq_unlock(that_rq);
+			raw_spin_rq_lock(rq);
+			rq_repin_lock(rq, rf);
+
+			raw_spin_unlock(&next->pi_lock);
+
+			return true;
+		}
+	}
+	return false;
+}
+
 /*
  * Find who @next (currently blocked on a mutex) can proxy for.
  *
@@ -6645,7 +6788,8 @@ proxy(struct rq *rq, struct task_struct *next, struct rq_flags *rf)
 	struct task_struct *ret = NULL;
 	struct task_struct *p = next;
 	struct task_struct *owner = NULL;
-	int this_cpu;
+	bool curr_in_chain = false;
+	int this_cpu, that_cpu;
 	struct mutex *mutex;
 
 	this_cpu = cpu_of(rq);
@@ -6681,6 +6825,9 @@ proxy(struct rq *rq, struct task_struct *next, struct rq_flags *rf)
 			return NULL;
 		}
 
+		if (task_current(rq, p))
+			curr_in_chain = true;
+
 		owner = __mutex_owner(mutex);
 		if (!owner) {
 			raw_spin_unlock(&p->blocked_lock);
@@ -6689,12 +6836,17 @@ proxy(struct rq *rq, struct task_struct *next, struct rq_flags *rf)
 		}
 
 		if (task_cpu(owner) != this_cpu) {
-			/* XXX Don't handle migrations yet */
-			if (!proxy_deactivate(rq, next))
-				ret = next;
+			that_cpu = task_cpu(owner);
+			/*
+			 * @owner can disappear, simply migrate to @that_cpu and leave that CPU
+			 * to sort things out.
+			 */
 			raw_spin_unlock(&p->blocked_lock);
 			raw_spin_unlock(&mutex->wait_lock);
-			return ret;
+			if (curr_in_chain)
+				return proxy_resched_idle(rq, next);
+
+			return proxy_migrate_task(rq, next, rf, p, that_cpu);
 		}
 
 		if (task_on_rq_migrating(owner)) {
@@ -6775,7 +6927,14 @@ static inline void proxy_tag_curr(struct rq *rq, struct task_struct *next)
 	dequeue_task(rq, next, DEQUEUE_NOCLOCK | DEQUEUE_SAVE);
 	enqueue_task(rq, next, ENQUEUE_NOCLOCK | ENQUEUE_RESTORE);
 }
+
 #else /* PROXY_EXEC */
+static inline bool proxy_return_migration(struct rq *rq, struct rq_flags *rf,
+					  struct task_struct *next)
+{
+	return false;
+}
+
 static struct task_struct *
 proxy(struct rq *rq, struct task_struct *next, struct rq_flags *rf)
 {
@@ -6898,6 +7057,14 @@ static void __sched notrace __schedule(unsigned int sched_mode)
 		if (next == rq->idle && prev == rq->idle)
 			preserve_need_resched = true;
 	}
+	if (unlikely(proxy_return_migration(rq, &rf, next))) {
+		if (next != rq->curr)
+			goto pick_again;
+
+		rq_set_selected(rq, rq->idle);
+		set_tsk_need_resched(rq->idle);
+		next = rq->idle;
+	}
 
 	if (!preserve_need_resched)
 		clear_tsk_need_resched(prev);
@@ -6995,6 +7162,9 @@ static inline void sched_submit_work(struct task_struct *tsk)
 	 */
 	SCHED_WARN_ON(current->__state & TASK_RTLOCK_WAIT);
 
+	if (task_is_blocked(tsk))
+		return;
+
 	/*
 	 * If we are going to sleep and we have plugged IO queued,
 	 * make sure to submit it to avoid deadlocks.
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index e8bca6b8da6f..99788cfd8835 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1731,7 +1731,7 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 
 	enqueue_dl_entity(&p->dl, flags);
 
-	if (!task_current(rq, p) && p->nr_cpus_allowed > 1)
+	if (!task_current(rq, p) && p->nr_cpus_allowed > 1 && !task_is_blocked(p))
 		enqueue_pushable_dl_task(rq, p);
 }
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1d0c2cc4606a..00130a917da4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8112,7 +8112,9 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
 		goto idle;
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
-	if (!prev || prev->sched_class != &fair_sched_class)
+	if (!prev ||
+	    prev->sched_class != &fair_sched_class ||
+	    rq->curr != rq_selected(rq))
 		goto simple;
 
 	/*
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index cfbf3925e595..7053b81580ca 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1538,8 +1538,8 @@ enqueue_task_rt(struct rq *rq, struct task_struct *p, int flags)
 	enqueue_rt_entity(rt_se, flags);
 
 	/*
-	 * Current can't be pushed away. Selected is tied to current,
-	 * so don't push it either.
+	 * Current can't be pushed away. Proxy is tied to current, so don't
+	 * push it either.
 	 */
 	if (task_current(rq, p) || task_current_selected(rq, p))
 		return;
@@ -1549,6 +1549,9 @@ enqueue_task_rt(struct rq *rq, struct task_struct *p, int flags)
 	if (p->nr_cpus_allowed == 1)
 		return;
 
+	if (task_is_blocked(p))
+		return;
+
 	enqueue_pushable_task(rq, p);
 }
 
@@ -1836,13 +1839,14 @@ static void put_prev_task_rt(struct rq *rq, struct task_struct *p)
 
 	update_rt_rq_load_avg(rq_clock_pelt(rq), rq, 1);
 
-	/* Avoid marking selected as pushable */
-	if (task_current_selected(rq, p))
+	if (task_current(rq, p) || task_current_selected(rq, p))
 		return;
 
+	if (task_is_blocked(p))
+		return;
 	/*
 	 * The previous task needs to be made eligible for pushing
-	 * if it is still active
+	 * if it is still active.
 	 */
 	if (on_rt_rq(&p->rt) && p->nr_cpus_allowed > 1)
 		enqueue_pushable_task(rq, p);
-- 
2.42.0.rc1.204.g551eb34607-goog


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v5 19/19] sched: Add blocked_donor link to task for smarter mutex handoffs
  2023-08-19  6:08 [PATCH v5 00/19] Proxy Execution: A generalized form of Priority Inheritance v5 John Stultz
                   ` (17 preceding siblings ...)
  2023-08-19  6:08 ` [PATCH v5 18/19] sched: Handle blocked-waiter migration (and return migration) John Stultz
@ 2023-08-19  6:08 ` John Stultz
  18 siblings, 0 replies; 29+ messages in thread
From: John Stultz @ 2023-08-19  6:08 UTC (permalink / raw)
  To: LKML
  Cc: Peter Zijlstra, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Youssef Esmat,
	Mel Gorman, Daniel Bristot de Oliveira, Will Deacon, Waiman Long,
	Boqun Feng, Paul E . McKenney, kernel-team, Valentin Schneider,
	Connor O'Brien, John Stultz

From: Peter Zijlstra <peterz@infradead.org>

Add link to the task this task is proxying for, and use it so we
do intellegent hand-off of the owned mutex to the task we're
running on behalf.

TODO:
* Cleanup ifdef mess

Cc: Joel Fernandes <joelaf@google.com>
Cc: Qais Yousef <qyousef@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Youssef Esmat <youssefesmat@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Cc: kernel-team@android.com
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Connor O'Brien <connoro@google.com>
[jstultz: This patch was split out from larger proxy patch]
Signed-off-by: John Stultz <jstultz@google.com>
---
v5:
* Split out from larger proxy patch
---
 include/linux/sched.h  |  1 +
 kernel/fork.c          |  1 +
 kernel/locking/mutex.c | 36 +++++++++++++++++++++++++++++++++---
 kernel/sched/core.c    |  7 +++++--
 4 files changed, 40 insertions(+), 5 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 3b7f26df2496..c3aa05c27816 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1140,6 +1140,7 @@ struct task_struct {
 	struct rt_mutex_waiter		*pi_blocked_on;
 #endif
 
+	struct task_struct		*blocked_donor;	/* task that is boosting us */
 	struct mutex			*blocked_on;	/* lock we're blocked on */
 	bool				blocked_on_waking; /* blocked on, but waking */
 	raw_spinlock_t			blocked_lock;
diff --git a/kernel/fork.c b/kernel/fork.c
index 5b11ead90b12..a746ac03c533 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2459,6 +2459,7 @@ __latent_entropy struct task_struct *copy_process(
 	lockdep_init_task(p);
 #endif
 
+	p->blocked_donor = NULL; /* nobody is boosting us yet */
 	p->blocked_on = NULL; /* not blocked yet */
 	p->blocked_on_waking = false; /* not blocked yet */
 
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 5125bbab4119..cf6c325515c2 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -906,7 +906,7 @@ EXPORT_SYMBOL_GPL(ww_mutex_lock_interruptible);
  */
 static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigned long ip)
 {
-	struct task_struct *next = NULL;
+	struct task_struct *donor, *next = NULL;
 	DEFINE_WAKE_Q(wake_q);
 	unsigned long owner;
 	unsigned long flags;
@@ -943,7 +943,34 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
 
 	raw_spin_lock_irqsave(&lock->wait_lock, flags);
 	debug_mutex_unlock(lock);
-	if (!list_empty(&lock->wait_list)) {
+
+#ifdef CONFIG_PROXY_EXEC /* TODO: Clean up this ifdef mess */
+	raw_spin_lock(&current->blocked_lock);
+	/*
+	 * If we have a task boosting us, and that task was boosting us through
+	 * this lock, hand the lock to that task, as that is the highest
+	 * waiter, as selected by the scheduling function.
+	 */
+	donor = current->blocked_donor;
+	if (donor) {
+		struct mutex *next_lock;
+
+		raw_spin_lock_nested(&donor->blocked_lock, SINGLE_DEPTH_NESTING);
+		next_lock = get_task_blocked_on(donor);
+		if (next_lock == lock) {
+			next = donor;
+			donor->blocked_on_waking = true;
+			wake_q_add(&wake_q, donor);
+			current->blocked_donor = NULL;
+		}
+		raw_spin_unlock(&donor->blocked_lock);
+	}
+#endif
+
+	/*
+	 * Failing that, pick any on the wait list.
+	 */
+	if (!next && !list_empty(&lock->wait_list)) {
 		/* get the first entry from the wait-list: */
 		struct mutex_waiter *waiter =
 			list_first_entry(&lock->wait_list,
@@ -952,7 +979,7 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
 		next = waiter->task;
 
 		debug_mutex_wake_waiter(lock, waiter);
-		raw_spin_lock(&next->blocked_lock);
+		raw_spin_lock_nested(&next->blocked_lock, SINGLE_DEPTH_NESTING);
 		WARN_ON(next->blocked_on != lock);
 		next->blocked_on_waking = true;
 		raw_spin_unlock(&next->blocked_lock);
@@ -963,6 +990,9 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
 		__mutex_handoff(lock, next);
 
 	preempt_disable();
+#ifdef CONFIG_PROXY_EXEC
+	raw_spin_unlock(&current->blocked_lock);
+#endif
 	raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
 
 	wake_up_q(&wake_q);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ab3b3a783ee9..59ec06988783 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6907,6 +6907,8 @@ proxy(struct rq *rq, struct task_struct *next, struct rq_flags *rf)
 		 */
 		raw_spin_unlock(&p->blocked_lock);
 		raw_spin_unlock(&mutex->wait_lock);
+
+		owner->blocked_donor = p;
 	}
 
 	WARN_ON_ONCE(owner && !owner->on_rq);
@@ -7042,10 +7044,11 @@ static void __sched notrace __schedule(unsigned int sched_mode)
 		switch_count = &prev->nvcsw;
 	}
 
-	proxied = (rq_selected(rq) != prev);
+	proxied = !!prev->blocked_donor;
 pick_again:
 	next = pick_next_task(rq, rq_selected(rq), &rf);
 	rq_set_selected(rq, next);
+	next->blocked_donor = NULL;
 	if (unlikely(task_is_blocked(next))) {
 		next = proxy(rq, next, &rf);
 		if (!next) {
@@ -7109,7 +7112,7 @@ static void __sched notrace __schedule(unsigned int sched_mode)
 		rq = context_switch(rq, prev, next, &rf);
 	} else {
 		/* In case next was already curr but just got blocked_donor*/
-		if (unlikely(!task_current_selected(rq, next)))
+		if (unlikely(!proxied && next->blocked_donor))
 			proxy_tag_curr(rq, next);
 
 		rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);
-- 
2.42.0.rc1.204.g551eb34607-goog


^ permalink raw reply related	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2023-08-24 11:36 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-08-19  6:08 [PATCH v5 00/19] Proxy Execution: A generalized form of Priority Inheritance v5 John Stultz
2023-08-19  6:08 ` [PATCH v5 01/19] sched: Unify runtime accounting across classes John Stultz
2023-08-19  6:08 ` [PATCH v5 02/19] locking/mutex: Removes wakeups from under mutex::wait_lock John Stultz
2023-08-22 19:11   ` Waiman Long
2023-08-22 19:24     ` John Stultz
2023-08-19  6:08 ` [PATCH v5 03/19] locking/mutex: make mutex::wait_lock irq safe John Stultz
2023-08-19  6:08 ` [PATCH v5 04/19] locking/mutex: Expose __mutex_owner() John Stultz
2023-08-19  6:08 ` [PATCH v5 05/19] locking/mutex: Rework task_struct::blocked_on John Stultz
2023-08-19  6:08 ` [PATCH v5 06/19] locking/mutex: Add task_struct::blocked_lock to serialize changes to the blocked_on state John Stultz
2023-08-19  6:08 ` [PATCH v5 07/19] locking/mutex: Add p->blocked_on wrappers for correctness checks John Stultz
2023-08-19  6:08 ` [PATCH v5 08/19] locking/mutex: Split blocked_on logic into two states (blocked_on and blocked_on_waking) John Stultz
2023-08-19  6:08 ` [PATCH v5 09/19] locking/mutex: Switch to mutex handoffs for CONFIG_PROXY_EXEC John Stultz
2023-08-19  6:08 ` [PATCH v5 10/19] sched: Split scheduler execution context John Stultz
2023-08-19  6:08 ` [PATCH v5 11/19] sched: Fix runtime accounting w/ split exec & sched contexts John Stultz
2023-08-19  6:08 ` [PATCH v5 12/19] sched: Unnest ttwu_runnable in prep for proxy-execution John Stultz
2023-08-19  6:08 ` [PATCH v5 13/19] sched: Split out __sched() deactivate task logic into a helper John Stultz
2023-08-23 21:12   ` kernel test robot
2023-08-23 21:25     ` John Stultz
2023-08-24  0:00   ` kernel test robot
2023-08-24  0:34   ` kernel test robot
2023-08-19  6:08 ` [PATCH v5 14/19] sched: Add a very simple proxy() function John Stultz
2023-08-19  6:08 ` [PATCH v5 15/19] sched: Add proxy deactivate helper John Stultz
2023-08-24 11:34   ` kernel test robot
2023-08-19  6:08 ` [PATCH v5 16/19] sched: Fix proxy/current (push,pull)ability John Stultz
2023-08-22 15:20   ` Dietmar Eggemann
2023-08-22 16:19     ` John Stultz
2023-08-19  6:08 ` [PATCH v5 17/19] sched: Start blocked_on chain processing in proxy() John Stultz
2023-08-19  6:08 ` [PATCH v5 18/19] sched: Handle blocked-waiter migration (and return migration) John Stultz
2023-08-19  6:08 ` [PATCH v5 19/19] sched: Add blocked_donor link to task for smarter mutex handoffs John Stultz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox