[PATCH v5 18/19] sched: Handle blocked-waiter migration (and return migration)

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: John Stultz <jstultz@google.com>
To: LKML <linux-kernel@vger.kernel.org>
Cc: John Stultz <jstultz@google.com>,
	Joel Fernandes <joelaf@google.com>,
	Qais Yousef <qyousef@google.com>, Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Juri Lelli <juri.lelli@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Valentin Schneider <vschneid@redhat.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Zimuzo Ezeozue <zezeozue@google.com>,
	Youssef Esmat <youssefesmat@google.com>,
	Mel Gorman <mgorman@suse.de>,
	Daniel Bristot de Oliveira <bristot@redhat.com>,
	Will Deacon <will@kernel.org>, Waiman Long <longman@redhat.com>,
	Boqun Feng <boqun.feng@gmail.com>,
	"Paul E . McKenney" <paulmck@kernel.org>,
	kernel-team@android.com
Subject: [PATCH v5 18/19] sched: Handle blocked-waiter migration (and return migration)
Date: Sat, 19 Aug 2023 06:08:52 +0000	[thread overview]
Message-ID: <20230819060915.3001568-19-jstultz@google.com> (raw)
In-Reply-To: <20230819060915.3001568-1-jstultz@google.com>

Add logic to handle migrating a blocked waiter to a remote
cpu where the lock owner is runnable.

Additionally, as the blocked task may not be able to run
on the remote cpu, add logic to handle return migration once
the waiting task is given the mutex.

Because tasks may get migrated to where they cannot run,
this patch also modifies the scheduling classes to avoid
sched class migrations on mutex blocked tasks, leaving
proxy() to do the migrations and return migrations.

This was split out from the larger proxy patch, and
significantly reworked to avoid changes to the try_to_wakeup()
call path.

Credits for the original patch go to:
  Peter Zijlstra (Intel) <peterz@infradead.org>
  Juri Lelli <juri.lelli@redhat.com>
  Valentin Schneider <valentin.schneider@arm.com>
  Connor O'Brien <connoro@google.com>

NOTE: The return migration is further complicated in that we
need to take the pi_lock in order to decide which cpu we should
migrate back to. This requires dropping the current rq lock,
grabbing the pi_lock re-taking the current rq lock, picking a
cpu, deactivating the task, switching its cpu, dropping the
current rq lock, grabbing the target rq, activating the task
and then dropping the target rq and reaquiring the current
rq. This seems overly complex, so suggestions for a better
approach would be welcome!

Cc: Joel Fernandes <joelaf@google.com>
Cc: Qais Yousef <qyousef@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Youssef Esmat <youssefesmat@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Cc: kernel-team@android.com
Signed-off-by: John Stultz <jstultz@google.com>
---
 kernel/sched/core.c     | 184 ++++++++++++++++++++++++++++++++++++++--
 kernel/sched/deadline.c |   2 +-
 kernel/sched/fair.c     |   4 +-
 kernel/sched/rt.c       |  14 +--
 4 files changed, 190 insertions(+), 14 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fd3494503be7..ab3b3a783ee9 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2949,8 +2949,15 @@ static int affine_move_task(struct rq *rq, struct task_struct *p, struct rq_flag
 	struct set_affinity_pending my_pending = { }, *pending = NULL;
 	bool stop_pending, complete = false;
 
-	/* Can the task run on the task's current CPU? If so, we're done */
-	if (cpumask_test_cpu(task_cpu(p), &p->cpus_mask)) {
+	/*
+	 * Can the task run on the task's current CPU? If so, we're done
+	 *
+	 * We are also done if the task is selected, boosting a lock-
+	 * holding proxy, (and potentially has been migrated outside its
+	 * current or previous affinity mask)
+	 */
+	if (cpumask_test_cpu(task_cpu(p), &p->cpus_mask) ||
+	    (task_current_selected(rq, p) && !task_current(rq, p))) {
 		struct task_struct *push_task = NULL;
 
 		if ((flags & SCA_MIGRATE_ENABLE) &&
@@ -6623,6 +6630,142 @@ bool proxy_deactivate(struct rq *rq, struct task_struct *next)
 	return true;
 }
 
+static struct task_struct *
+proxy_migrate_task(struct rq *rq, struct task_struct *next,
+		   struct rq_flags *rf, struct task_struct *p,
+		   int that_cpu)
+{
+	struct rq *that_rq;
+	int wake_cpu;
+
+	/*
+	 * If the blocked-on relationship crosses CPUs, migrate @p to the
+	 * @owner's CPU.
+	 *
+	 * This is because we must respect the CPU affinity of execution
+	 * contexts (@owner) but we can ignore affinity for scheduling
+	 * contexts (@p). So we have to move scheduling contexts towards
+	 * potential execution contexts.
+	 */
+	that_rq = cpu_rq(that_cpu);
+
+	/*
+	 * @owner can disappear, simply migrate to @that_cpu and leave that CPU
+	 * to sort things out.
+	 */
+
+	/*
+	 * Since we're going to drop @rq, we have to put(@next) first,
+	 * otherwise we have a reference that no longer belongs to us.  Use
+	 * @fake_task to fill the void and make the next pick_next_task()
+	 * invocation happy.
+	 *
+	 * CPU0				CPU1
+	 *
+	 *				B mutex_lock(X)
+	 *
+	 * A mutex_lock(X) <- B
+	 * A __schedule()
+	 * A pick->A
+	 * A proxy->B
+	 * A migrate A to CPU1
+	 *				B mutex_unlock(X) -> A
+	 *				B __schedule()
+	 *				B pick->A
+	 *				B switch_to (A)
+	 *				A ... does stuff
+	 * A ... is still running here
+	 *
+	 *		* BOOM *
+	 */
+	put_prev_task(rq, rq_selected(rq));
+	rq_set_selected(rq, rq->idle);
+	set_next_task(rq, rq_selected(rq));
+
+	wake_cpu = next->wake_cpu;
+
+	WARN_ON(next == rq->curr);
+
+	deactivate_task(rq, next, 0);
+	set_task_cpu(next, that_cpu);
+	/*
+	 * Preserve p->wake_cpu, such that we can tell where it
+	 * used to run later.
+	 */
+	next->wake_cpu = wake_cpu;
+
+	if (rq->balance_callback)
+		__balance_callbacks(rq);
+
+	rq_unpin_lock(rq, rf);
+	raw_spin_rq_unlock(rq);
+	raw_spin_rq_lock(that_rq);
+
+	activate_task(that_rq, next, 0);
+	check_preempt_curr(that_rq, next, 0);
+
+	raw_spin_rq_unlock(that_rq);
+	raw_spin_rq_lock(rq);
+	rq_repin_lock(rq, rf);
+
+	return NULL; /* Retry task selection on _this_ CPU. */
+}
+
+static inline bool proxy_return_migration(struct rq *rq, struct rq_flags *rf,
+					  struct task_struct *next)
+{
+	if (next->blocked_on && next->blocked_on_waking) {
+		if (!is_cpu_allowed(next, cpu_of(rq))) {
+			struct rq *that_rq;
+			int cpu;
+
+			if (next == rq->curr) {
+				/* can't migrate curr, so return and let caller sort it */
+				return true;
+			}
+
+/*?*/			put_prev_task(rq, rq_selected(rq));
+			rq_set_selected(rq, rq->idle);
+
+			/* First unpin & run balance callbacks */
+			rq_unpin_lock(rq, rf);
+			__balance_callbacks(rq);
+			/*
+			 * Drop the rq lock so we can get pi_lock,
+			 * then reaquire it again to figure out
+			 * where to send it.
+			 */
+			raw_spin_rq_unlock(rq);
+			raw_spin_lock(&next->pi_lock);
+			rq_lock(rq, rf);
+
+			cpu = select_task_rq(next, next->wake_cpu, WF_TTWU);
+
+			deactivate_task(rq, next, 0);
+			set_task_cpu(next, cpu);
+			that_rq = cpu_rq(cpu);
+
+			/* drop this rq lock and grab that_rq's */
+			rq_unpin_lock(rq, rf);
+			raw_spin_rq_unlock(rq);
+			raw_spin_rq_lock(that_rq);
+
+			activate_task(that_rq, next, 0);
+			check_preempt_curr(that_rq, next, 0);
+
+			/* drop that_rq's lock and re-grab this' */
+			raw_spin_rq_unlock(that_rq);
+			raw_spin_rq_lock(rq);
+			rq_repin_lock(rq, rf);
+
+			raw_spin_unlock(&next->pi_lock);
+
+			return true;
+		}
+	}
+	return false;
+}
+
 /*
  * Find who @next (currently blocked on a mutex) can proxy for.
  *
@@ -6645,7 +6788,8 @@ proxy(struct rq *rq, struct task_struct *next, struct rq_flags *rf)
 	struct task_struct *ret = NULL;
 	struct task_struct *p = next;
 	struct task_struct *owner = NULL;
-	int this_cpu;
+	bool curr_in_chain = false;
+	int this_cpu, that_cpu;
 	struct mutex *mutex;
 
 	this_cpu = cpu_of(rq);
@@ -6681,6 +6825,9 @@ proxy(struct rq *rq, struct task_struct *next, struct rq_flags *rf)
 			return NULL;
 		}
 
+		if (task_current(rq, p))
+			curr_in_chain = true;
+
 		owner = __mutex_owner(mutex);
 		if (!owner) {
 			raw_spin_unlock(&p->blocked_lock);
@@ -6689,12 +6836,17 @@ proxy(struct rq *rq, struct task_struct *next, struct rq_flags *rf)
 		}
 
 		if (task_cpu(owner) != this_cpu) {
-			/* XXX Don't handle migrations yet */
-			if (!proxy_deactivate(rq, next))
-				ret = next;
+			that_cpu = task_cpu(owner);
+			/*
+			 * @owner can disappear, simply migrate to @that_cpu and leave that CPU
+			 * to sort things out.
+			 */
 			raw_spin_unlock(&p->blocked_lock);
 			raw_spin_unlock(&mutex->wait_lock);
-			return ret;
+			if (curr_in_chain)
+				return proxy_resched_idle(rq, next);
+
+			return proxy_migrate_task(rq, next, rf, p, that_cpu);
 		}
 
 		if (task_on_rq_migrating(owner)) {
@@ -6775,7 +6927,14 @@ static inline void proxy_tag_curr(struct rq *rq, struct task_struct *next)
 	dequeue_task(rq, next, DEQUEUE_NOCLOCK | DEQUEUE_SAVE);
 	enqueue_task(rq, next, ENQUEUE_NOCLOCK | ENQUEUE_RESTORE);
 }
+
 #else /* PROXY_EXEC */
+static inline bool proxy_return_migration(struct rq *rq, struct rq_flags *rf,
+					  struct task_struct *next)
+{
+	return false;
+}
+
 static struct task_struct *
 proxy(struct rq *rq, struct task_struct *next, struct rq_flags *rf)
 {
@@ -6898,6 +7057,14 @@ static void __sched notrace __schedule(unsigned int sched_mode)
 		if (next == rq->idle && prev == rq->idle)
 			preserve_need_resched = true;
 	}
+	if (unlikely(proxy_return_migration(rq, &rf, next))) {
+		if (next != rq->curr)
+			goto pick_again;
+
+		rq_set_selected(rq, rq->idle);
+		set_tsk_need_resched(rq->idle);
+		next = rq->idle;
+	}
 
 	if (!preserve_need_resched)
 		clear_tsk_need_resched(prev);
@@ -6995,6 +7162,9 @@ static inline void sched_submit_work(struct task_struct *tsk)
 	 */
 	SCHED_WARN_ON(current->__state & TASK_RTLOCK_WAIT);
 
+	if (task_is_blocked(tsk))
+		return;
+
 	/*
 	 * If we are going to sleep and we have plugged IO queued,
 	 * make sure to submit it to avoid deadlocks.
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index e8bca6b8da6f..99788cfd8835 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1731,7 +1731,7 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 
 	enqueue_dl_entity(&p->dl, flags);
 
-	if (!task_current(rq, p) && p->nr_cpus_allowed > 1)
+	if (!task_current(rq, p) && p->nr_cpus_allowed > 1 && !task_is_blocked(p))
 		enqueue_pushable_dl_task(rq, p);
 }
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1d0c2cc4606a..00130a917da4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8112,7 +8112,9 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
 		goto idle;
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
-	if (!prev || prev->sched_class != &fair_sched_class)
+	if (!prev ||
+	    prev->sched_class != &fair_sched_class ||
+	    rq->curr != rq_selected(rq))
 		goto simple;
 
 	/*
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index cfbf3925e595..7053b81580ca 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1538,8 +1538,8 @@ enqueue_task_rt(struct rq *rq, struct task_struct *p, int flags)
 	enqueue_rt_entity(rt_se, flags);
 
 	/*
-	 * Current can't be pushed away. Selected is tied to current,
-	 * so don't push it either.
+	 * Current can't be pushed away. Proxy is tied to current, so don't
+	 * push it either.
 	 */
 	if (task_current(rq, p) || task_current_selected(rq, p))
 		return;
@@ -1549,6 +1549,9 @@ enqueue_task_rt(struct rq *rq, struct task_struct *p, int flags)
 	if (p->nr_cpus_allowed == 1)
 		return;
 
+	if (task_is_blocked(p))
+		return;
+
 	enqueue_pushable_task(rq, p);
 }
 
@@ -1836,13 +1839,14 @@ static void put_prev_task_rt(struct rq *rq, struct task_struct *p)
 
 	update_rt_rq_load_avg(rq_clock_pelt(rq), rq, 1);
 
-	/* Avoid marking selected as pushable */
-	if (task_current_selected(rq, p))
+	if (task_current(rq, p) || task_current_selected(rq, p))
 		return;
 
+	if (task_is_blocked(p))
+		return;
 	/*
 	 * The previous task needs to be made eligible for pushing
-	 * if it is still active
+	 * if it is still active.
 	 */
 	if (on_rt_rq(&p->rt) && p->nr_cpus_allowed > 1)
 		enqueue_pushable_task(rq, p);
-- 
2.42.0.rc1.204.g551eb34607-goog

next prev parent reply	other threads:[~2023-08-19  6:18 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-08-19  6:08 [PATCH v5 00/19] Proxy Execution: A generalized form of Priority Inheritance v5 John Stultz
2023-08-19  6:08 ` [PATCH v5 01/19] sched: Unify runtime accounting across classes John Stultz
2023-08-19  6:08 ` [PATCH v5 02/19] locking/mutex: Removes wakeups from under mutex::wait_lock John Stultz
2023-08-22 19:11   ` Waiman Long
2023-08-22 19:24     ` John Stultz
2023-08-19  6:08 ` [PATCH v5 03/19] locking/mutex: make mutex::wait_lock irq safe John Stultz
2023-08-19  6:08 ` [PATCH v5 04/19] locking/mutex: Expose __mutex_owner() John Stultz
2023-08-19  6:08 ` [PATCH v5 05/19] locking/mutex: Rework task_struct::blocked_on John Stultz
2023-08-19  6:08 ` [PATCH v5 06/19] locking/mutex: Add task_struct::blocked_lock to serialize changes to the blocked_on state John Stultz
2023-08-19  6:08 ` [PATCH v5 07/19] locking/mutex: Add p->blocked_on wrappers for correctness checks John Stultz
2023-08-19  6:08 ` [PATCH v5 08/19] locking/mutex: Split blocked_on logic into two states (blocked_on and blocked_on_waking) John Stultz
2023-08-19  6:08 ` [PATCH v5 09/19] locking/mutex: Switch to mutex handoffs for CONFIG_PROXY_EXEC John Stultz
2023-08-19  6:08 ` [PATCH v5 10/19] sched: Split scheduler execution context John Stultz
2023-08-19  6:08 ` [PATCH v5 11/19] sched: Fix runtime accounting w/ split exec & sched contexts John Stultz
2023-08-19  6:08 ` [PATCH v5 12/19] sched: Unnest ttwu_runnable in prep for proxy-execution John Stultz
2023-08-19  6:08 ` [PATCH v5 13/19] sched: Split out __sched() deactivate task logic into a helper John Stultz
2023-08-23 21:12   ` kernel test robot
2023-08-23 21:25     ` John Stultz
2023-08-24  0:00   ` kernel test robot
2023-08-24  0:34   ` kernel test robot
2023-08-19  6:08 ` [PATCH v5 14/19] sched: Add a very simple proxy() function John Stultz
2023-08-19  6:08 ` [PATCH v5 15/19] sched: Add proxy deactivate helper John Stultz
2023-08-24 11:34   ` kernel test robot
2023-08-19  6:08 ` [PATCH v5 16/19] sched: Fix proxy/current (push,pull)ability John Stultz
2023-08-22 15:20   ` Dietmar Eggemann
2023-08-22 16:19     ` John Stultz
2023-08-19  6:08 ` [PATCH v5 17/19] sched: Start blocked_on chain processing in proxy() John Stultz
2023-08-19  6:08 ` John Stultz [this message]
2023-08-19  6:08 ` [PATCH v5 19/19] sched: Add blocked_donor link to task for smarter mutex handoffs John Stultz

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:fd3494503be dfblob:ab3b3a783ee dfblob:e8bca6b8da6
dfblob:99788cfd883 dfblob:1d0c2cc4606 dfblob:00130a917da
dfblob:cfbf3925e59 dfblob:7053b81580c )
 OR (
bs:"[PATCH v5 18/19] sched: Handle blocked-waiter migration (and return migration)" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230819060915.3001568-19-jstultz@google.com \
    --to=jstultz@google.com \
    --cc=boqun.feng@gmail.com \
    --cc=bristot@redhat.com \
    --cc=bsegall@google.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=joelaf@google.com \
    --cc=juri.lelli@redhat.com \
    --cc=kernel-team@android.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=longman@redhat.com \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=paulmck@kernel.org \
    --cc=peterz@infradead.org \
    --cc=qyousef@google.com \
    --cc=rostedt@goodmis.org \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    --cc=will@kernel.org \
    --cc=youssefesmat@google.com \
    --cc=zezeozue@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox