Re: [PATCH v25 9/9] sched: Handle blocked-waiter migration (and return migration)

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Peter Zijlstra <peterz@infradead.org>
To: John Stultz <jstultz@google.com>
Cc: LKML <linux-kernel@vger.kernel.org>,
	Joel Fernandes <joelagnelf@nvidia.com>,
	Qais Yousef <qyousef@layalina.io>, Ingo Molnar <mingo@redhat.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Valentin Schneider <vschneid@redhat.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Zimuzo Ezeozue <zezeozue@google.com>,
	Mel Gorman <mgorman@suse.de>, Will Deacon <will@kernel.org>,
	Waiman Long <longman@redhat.com>,
	Boqun Feng <boqun.feng@gmail.com>,
	"Paul E. McKenney" <paulmck@kernel.org>,
	Metin Kaya <Metin.Kaya@arm.com>,
	Xuewen Yan <xuewen.yan94@gmail.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Daniel Lezcano <daniel.lezcano@linaro.org>,
	Suleiman Souhlal <suleiman@google.com>,
	kuyo chang <kuyo.chang@mediatek.com>, hupu <hupu.gm@gmail.com>,
	kernel-team@android.com
Subject: Re: [PATCH v25 9/9] sched: Handle blocked-waiter migration (and return migration)
Date: Thu, 19 Mar 2026 13:49:56 +0100	[thread overview]
Message-ID: <20260319124956.GV3738010@noisy.programming.kicks-ass.net> (raw)
In-Reply-To: <20260313023022.2902479-10-jstultz@google.com>

On Fri, Mar 13, 2026 at 02:30:10AM +0000, John Stultz wrote:
> +/*
> + * If the blocked-on relationship crosses CPUs, migrate @p to the
> + * owner's CPU.
> + *
> + * This is because we must respect the CPU affinity of execution
> + * contexts (owner) but we can ignore affinity for scheduling
> + * contexts (@p). So we have to move scheduling contexts towards
> + * potential execution contexts.
> + *
> + * Note: The owner can disappear, but simply migrate to @target_cpu
> + * and leave that CPU to sort things out.
> + */
> +static void proxy_migrate_task(struct rq *rq, struct rq_flags *rf,
> +			       struct task_struct *p, int target_cpu)
>  {
> +	struct rq *target_rq = cpu_rq(target_cpu);
> +
> +	lockdep_assert_rq_held(rq);
> +
> +	/*
> +	 * Since we're going to drop @rq, we have to put(@rq->donor) first,
> +	 * otherwise we have a reference that no longer belongs to us.
> +	 *
> +	 * Additionally, as we put_prev_task(prev) earlier, its possible that
> +	 * prev will migrate away as soon as we drop the rq lock, however we
> +	 * still have it marked as rq->curr, as we've not yet switched tasks.
> +	 *
> +	 * So call proxy_resched_idle() to let go of the references before
> +	 * we release the lock.
> +	 */
> +	proxy_resched_idle(rq);

This comment confuses the heck out of me. It seems to imply we need to
schedule before dropping rq->lock.

> +
> +	WARN_ON(p == rq->curr);
> +
> +	deactivate_task(rq, p, DEQUEUE_NOCLOCK);
> +	proxy_set_task_cpu(p, target_cpu);
> +
> +	/*
> +	 * We have to zap callbacks before unlocking the rq
> +	 * as another CPU may jump in and call sched_balance_rq
> +	 * which can trip the warning in rq_pin_lock() if we
> +	 * leave callbacks set.
> +	 */

It might be good to explain where these callbacks come from.

> +	zap_balance_callbacks(rq);
> +	rq_unpin_lock(rq, rf);
> +	raw_spin_rq_unlock(rq);
> +
> +	attach_one_task(target_rq, p);
> +
> +	raw_spin_rq_lock(rq);
> +	rq_repin_lock(rq, rf);
> +	update_rq_clock(rq);
> +}
> +
> +static void proxy_force_return(struct rq *rq, struct rq_flags *rf,
> +			       struct task_struct *p)
> +{
> +	struct rq *this_rq, *target_rq;
> +	struct rq_flags this_rf;
> +	int cpu, wake_flag = WF_TTWU;
> +
> +	lockdep_assert_rq_held(rq);
> +	WARN_ON(p == rq->curr);
> +
> +	/*
> +	 * We have to zap callbacks before unlocking the rq
> +	 * as another CPU may jump in and call sched_balance_rq
> +	 * which can trip the warning in rq_pin_lock() if we
> +	 * leave callbacks set.
> +	 */

idem

> +	zap_balance_callbacks(rq);
> +	rq_unpin_lock(rq, rf);
> +	raw_spin_rq_unlock(rq);

This is in fact the very same sequence as above.

> +
> +	/*
> +	 * We drop the rq lock, and re-grab task_rq_lock to get
> +	 * the pi_lock (needed for select_task_rq) as well.
> +	 */
> +	this_rq = task_rq_lock(p, &this_rf);
> +
> +	/*
> +	 * Since we let go of the rq lock, the task may have been
> +	 * woken or migrated to another rq before we  got the
> +	 * task_rq_lock. So re-check we're on the same RQ. If
> +	 * not, the task has already been migrated and that CPU
> +	 * will handle any futher migrations.
> +	 */
> +	if (this_rq != rq)
> +		goto err_out;
> +
> +	/* Similarly, if we've been dequeued, someone else will wake us */
> +	if (!task_on_rq_queued(p))
> +		goto err_out;
> +
> +	/*
> +	 * Since we should only be calling here from __schedule()
> +	 * -> find_proxy_task(), no one else should have
> +	 * assigned current out from under us. But check and warn
> +	 * if we see this, then bail.
> +	 */
> +	if (task_current(this_rq, p) || task_on_cpu(this_rq, p)) {
> +		WARN_ONCE(1, "%s rq: %i current/on_cpu task %s %d  on_cpu: %i\n",
> +			  __func__, cpu_of(this_rq),
> +			  p->comm, p->pid, p->on_cpu);
> +		goto err_out;
>  	}
> -	return NULL;
> +
> +	update_rq_clock(this_rq);
> +	proxy_resched_idle(this_rq);
> +	deactivate_task(this_rq, p, DEQUEUE_NOCLOCK);
> +	cpu = select_task_rq(p, p->wake_cpu, &wake_flag);
> +	set_task_cpu(p, cpu);
> +	target_rq = cpu_rq(cpu);
> +	clear_task_blocked_on(p, NULL);
> +	task_rq_unlock(this_rq, p, &this_rf);
> +
> +	attach_one_task(target_rq, p);
> +
> +	/* Finally, re-grab the origianl rq lock and return to pick-again */
> +	raw_spin_rq_lock(rq);
> +	rq_repin_lock(rq, rf);
> +	update_rq_clock(rq);
> +	return;
> +
> +err_out:
> +	task_rq_unlock(this_rq, p, &this_rf);
> +	raw_spin_rq_lock(rq);
> +	rq_repin_lock(rq, rf);
> +	update_rq_clock(rq);
>  }

Hurm... how about something like so?

---
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6662,6 +6662,28 @@ static bool proxy_deactivate(struct rq *
 	return try_to_block_task(rq, donor, &state, true);
 }
 
+static inline void proxy_release_rq_lock(struct rq *rq, struct rq_flags *rf)
+	__releases(__rq_lockp(rq))
+{
+	/*
+	 * We have to zap callbacks before unlocking the rq
+	 * as another CPU may jump in and call sched_balance_rq
+	 * which can trip the warning in rq_pin_lock() if we
+	 * leave callbacks set.
+	 */
+	zap_balance_callbacks(rq);
+	rq_unpin_lock(rq, rf);
+	raw_spin_rq_unlock(rq);
+}
+
+static inline void proxy_reacquire_rq_lock(struct rq *rq, struct rq_flags *rf)
+	__acquires(__rq_lockp(rq))
+{
+	raw_spin_rq_lock(rq);
+	rq_repin_lock(rq, rf);
+	update_rq_clock(rq);
+}
+
 /*
  * If the blocked-on relationship crosses CPUs, migrate @p to the
  * owner's CPU.
@@ -6676,6 +6698,7 @@ static bool proxy_deactivate(struct rq *
  */
 static void proxy_migrate_task(struct rq *rq, struct rq_flags *rf,
 			       struct task_struct *p, int target_cpu)
+	__must_hold(__rq_lockp(rq))
 {
 	struct rq *target_rq = cpu_rq(target_cpu);
 
@@ -6699,98 +6722,72 @@ static void proxy_migrate_task(struct rq
 	deactivate_task(rq, p, DEQUEUE_NOCLOCK);
 	proxy_set_task_cpu(p, target_cpu);
 
-	/*
-	 * We have to zap callbacks before unlocking the rq
-	 * as another CPU may jump in and call sched_balance_rq
-	 * which can trip the warning in rq_pin_lock() if we
-	 * leave callbacks set.
-	 */
-	zap_balance_callbacks(rq);
-	rq_unpin_lock(rq, rf);
-	raw_spin_rq_unlock(rq);
+	proxy_release_rq_lock(rq, rf);
 
 	attach_one_task(target_rq, p);
 
-	raw_spin_rq_lock(rq);
-	rq_repin_lock(rq, rf);
-	update_rq_clock(rq);
+	proxy_reacquire_rq_lock(rq, rf);
 }
 
 static void proxy_force_return(struct rq *rq, struct rq_flags *rf,
 			       struct task_struct *p)
+	__must_hold(__rq_lockp(rq))
 {
-	struct rq *this_rq, *target_rq;
-	struct rq_flags this_rf;
+	struct rq *task_rq, *target_rq = NULL;
 	int cpu, wake_flag = WF_TTWU;
 
 	lockdep_assert_rq_held(rq);
 	WARN_ON(p == rq->curr);
 
-	/*
-	 * We have to zap callbacks before unlocking the rq
-	 * as another CPU may jump in and call sched_balance_rq
-	 * which can trip the warning in rq_pin_lock() if we
-	 * leave callbacks set.
-	 */
-	zap_balance_callbacks(rq);
-	rq_unpin_lock(rq, rf);
-	raw_spin_rq_unlock(rq);
+	proxy_release_rq_lock(rq, rf);
 
 	/*
 	 * We drop the rq lock, and re-grab task_rq_lock to get
 	 * the pi_lock (needed for select_task_rq) as well.
 	 */
-	this_rq = task_rq_lock(p, &this_rf);
+	scoped_guard (task_rq_lock, p) {
+		task_rq = scope.rq;
 
-	/*
-	 * Since we let go of the rq lock, the task may have been
-	 * woken or migrated to another rq before we  got the
-	 * task_rq_lock. So re-check we're on the same RQ. If
-	 * not, the task has already been migrated and that CPU
-	 * will handle any futher migrations.
-	 */
-	if (this_rq != rq)
-		goto err_out;
-
-	/* Similarly, if we've been dequeued, someone else will wake us */
-	if (!task_on_rq_queued(p))
-		goto err_out;
-
-	/*
-	 * Since we should only be calling here from __schedule()
-	 * -> find_proxy_task(), no one else should have
-	 * assigned current out from under us. But check and warn
-	 * if we see this, then bail.
-	 */
-	if (task_current(this_rq, p) || task_on_cpu(this_rq, p)) {
-		WARN_ONCE(1, "%s rq: %i current/on_cpu task %s %d  on_cpu: %i\n",
-			  __func__, cpu_of(this_rq),
-			  p->comm, p->pid, p->on_cpu);
-		goto err_out;
+		/*
+		 * Since we let go of the rq lock, the task may have been
+		 * woken or migrated to another rq before we  got the
+		 * task_rq_lock. So re-check we're on the same RQ. If
+		 * not, the task has already been migrated and that CPU
+		 * will handle any futher migrations.
+		 */
+		if (task_rq != rq)
+			break;
+
+		/* Similarly, if we've been dequeued, someone else will wake us */
+		if (!task_on_rq_queued(p))
+			break;
+
+		/*
+		 * Since we should only be calling here from __schedule()
+		 * -> find_proxy_task(), no one else should have
+		 * assigned current out from under us. But check and warn
+		 * if we see this, then bail.
+		 */
+		if (task_current(task_rq, p) || task_on_cpu(task_rq, p)) {
+			WARN_ONCE(1, "%s rq: %i current/on_cpu task %s %d  on_cpu: %i\n",
+				  __func__, cpu_of(task_rq),
+				  p->comm, p->pid, p->on_cpu);
+			break;
+		}
+
+		update_rq_clock(task_rq);
+		proxy_resched_idle(task_rq);
+		deactivate_task(task_rq, p, DEQUEUE_NOCLOCK);
+		cpu = select_task_rq(p, p->wake_cpu, &wake_flag);
+		set_task_cpu(p, cpu);
+		target_rq = cpu_rq(cpu);
+		clear_task_blocked_on(p, NULL);
 	}
 
-	update_rq_clock(this_rq);
-	proxy_resched_idle(this_rq);
-	deactivate_task(this_rq, p, DEQUEUE_NOCLOCK);
-	cpu = select_task_rq(p, p->wake_cpu, &wake_flag);
-	set_task_cpu(p, cpu);
-	target_rq = cpu_rq(cpu);
-	clear_task_blocked_on(p, NULL);
-	task_rq_unlock(this_rq, p, &this_rf);
+	if (target_rq)
+		attach_one_task(target_rq, p);
 
-	attach_one_task(target_rq, p);
-
-	/* Finally, re-grab the origianl rq lock and return to pick-again */
-	raw_spin_rq_lock(rq);
-	rq_repin_lock(rq, rf);
-	update_rq_clock(rq);
-	return;
-
-err_out:
-	task_rq_unlock(this_rq, p, &this_rf);
-	raw_spin_rq_lock(rq);
-	rq_repin_lock(rq, rf);
-	update_rq_clock(rq);
+	proxy_reacquire_rq_lock(rq, rf);
 }
 
 /*
@@ -6811,6 +6808,7 @@ static void proxy_force_return(struct rq
  */
 static struct task_struct *
 find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
+	__must_hold(__rq_lockp(rq))
 {
 	enum { FOUND, DEACTIVATE_DONOR, MIGRATE, NEEDS_RETURN } action = FOUND;
 	struct task_struct *owner = NULL;

next prev parent reply	other threads:[~2026-03-19 12:50 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-13  2:30 [PATCH v25 0/9] Simple Donor Migration for Proxy Execution John Stultz
2026-03-13  2:30 ` [PATCH v25 1/9] sched: Make class_schedulers avoid pushing current, and get rid of proxy_tag_curr() John Stultz
2026-03-13 13:48   ` Juri Lelli
2026-03-13 17:53     ` John Stultz
2026-03-15 16:26   ` K Prateek Nayak
2026-03-17  4:49     ` John Stultz
2026-03-17  5:41       ` K Prateek Nayak
2026-03-17  6:04         ` John Stultz
2026-03-17  7:52           ` K Prateek Nayak
2026-03-17 18:35             ` John Stultz
2026-03-18 13:36           ` Peter Zijlstra
2026-03-18 13:52             ` Peter Zijlstra
2026-03-18 17:55               ` K Prateek Nayak
2026-03-18 20:30             ` John Stultz
2026-03-18 20:34               ` Peter Zijlstra
2026-03-18 20:35                 ` John Stultz
2026-03-18 12:55         ` Peter Zijlstra
2026-03-18 18:01           ` K Prateek Nayak
2026-03-13  2:30 ` [PATCH v25 2/9] sched: Minimise repeated sched_proxy_exec() checking John Stultz
2026-03-15 17:01   ` K Prateek Nayak
2026-03-13  2:30 ` [PATCH v25 3/9] locking: Add task::blocked_lock to serialize blocked_on state John Stultz
2026-03-13  2:30 ` [PATCH v25 4/9] sched: Fix modifying donor->blocked on without proper locking John Stultz
2026-03-13  2:30 ` [PATCH v25 5/9] sched/locking: Add special p->blocked_on==PROXY_WAKING value for proxy return-migration John Stultz
2026-03-13  2:30 ` [PATCH v25 6/9] sched: Add assert_balance_callbacks_empty helper John Stultz
2026-03-13  2:30 ` [PATCH v25 7/9] sched: Add logic to zap balance callbacks if we pick again John Stultz
2026-03-13  2:30 ` [PATCH v25 8/9] sched: Move attach_one_task and attach_task helpers to sched.h John Stultz
2026-03-15 16:34   ` K Prateek Nayak
2026-03-16 23:34     ` John Stultz
2026-03-17  2:29       ` K Prateek Nayak
2026-03-13  2:30 ` [PATCH v25 9/9] sched: Handle blocked-waiter migration (and return migration) John Stultz
2026-03-15 17:38   ` K Prateek Nayak
2026-03-18 19:07     ` John Stultz
2026-03-18  6:35   ` Juri Lelli
2026-03-18  6:56     ` K Prateek Nayak
2026-03-18 10:16       ` Juri Lelli
2026-03-18 12:59   ` Peter Zijlstra
2026-03-19 12:49   ` Peter Zijlstra [this message]
2026-03-19 21:26     ` John Stultz

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260319124956.GV3738010@noisy.programming.kicks-ass.net \
    --to=peterz@infradead.org \
    --cc=Metin.Kaya@arm.com \
    --cc=boqun.feng@gmail.com \
    --cc=bsegall@google.com \
    --cc=daniel.lezcano@linaro.org \
    --cc=dietmar.eggemann@arm.com \
    --cc=hupu.gm@gmail.com \
    --cc=joelagnelf@nvidia.com \
    --cc=jstultz@google.com \
    --cc=juri.lelli@redhat.com \
    --cc=kernel-team@android.com \
    --cc=kprateek.nayak@amd.com \
    --cc=kuyo.chang@mediatek.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=longman@redhat.com \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=paulmck@kernel.org \
    --cc=qyousef@layalina.io \
    --cc=rostedt@goodmis.org \
    --cc=suleiman@google.com \
    --cc=tglx@linutronix.de \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    --cc=will@kernel.org \
    --cc=xuewen.yan94@gmail.com \
    --cc=zezeozue@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.