Re: [PATCH v26 00/10] Simple Donor Migration for Proxy Execution

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: K Prateek Nayak <kprateek.nayak@amd.com>
To: John Stultz <jstultz@google.com>, Peter Zijlstra <peterz@infradead.org>
Cc: LKML <linux-kernel@vger.kernel.org>,
	Joel Fernandes <joelagnelf@nvidia.com>,
	Qais Yousef <qyousef@layalina.io>, Ingo Molnar <mingo@redhat.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Valentin Schneider <vschneid@redhat.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Zimuzo Ezeozue <zezeozue@google.com>,
	Mel Gorman <mgorman@suse.de>, Will Deacon <will@kernel.org>,
	Waiman Long <longman@redhat.com>,
	Boqun Feng <boqun.feng@gmail.com>,
	"Paul E. McKenney" <paulmck@kernel.org>,
	Metin Kaya <Metin.Kaya@arm.com>,
	Xuewen Yan <xuewen.yan94@gmail.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Daniel Lezcano <daniel.lezcano@linaro.org>,
	Suleiman Souhlal <suleiman@google.com>,
	kuyo chang <kuyo.chang@mediatek.com>, hupu <hupu.gm@gmail.com>,
	<kernel-team@android.com>
Subject: Re: [PATCH v26 00/10] Simple Donor Migration for Proxy Execution
Date: Fri, 3 Apr 2026 11:39:25 +0530	[thread overview]
Message-ID: <e6dfce6a-ee6f-441b-9f21-e2036d0a3e5d@amd.com> (raw)
In-Reply-To: <CANDhNCoAp8MeSbyO_Q3PanZDZGLS04U0i==Uv_9Ev4eX9VvUVA@mail.gmail.com>

Hello peter, John,

On 4/3/2026 12:01 AM, John Stultz wrote:
>> So, I've not gone through all the cases yet, and it is *COMPLETELY*
>> untested, but something like the below perhaps?
>>
> 
> So, just to clarify, this suggestion is as an alternative to my
> return-migration via ttwu logic (not included in the v26
> simple-donor-migration chunk you've queued)?
>    https://github.com/johnstultz-work/linux-dev/commit/dfaa472f2a0b2f6a0c73083aaba5c55e256fdb56
> 
> I'm working on prepping that next chunk of the series (which includes
> that logic) to send out here shortly (integrating the few bits I from
> Prateek that I've managerd to get my head around).

Thanks and I agree that it might be more digestible that way.

> 
> There's still a bunch of other changes tied into waking a rq->donor
> outside of __schedule() in that chunk, so I'm not sure if this
> discussion would be easier to have in context once those are on the
> list?
> 
> So some of my (certainly confused) thoughts below...
> 
>> ---
>>  include/linux/sched.h |    2
>>  kernel/sched/core.c   |  173 ++++++++++++++++----------------------------------
>>  2 files changed, 58 insertions(+), 117 deletions(-)
>>
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -161,7 +161,7 @@ struct user_event_mm;
>>   */
>>  #define is_special_task_state(state)                                   \
>>         ((state) & (__TASK_STOPPED | __TASK_TRACED | TASK_PARKED |      \
>> -                   TASK_DEAD | TASK_FROZEN))
>> +                   TASK_DEAD | TASK_WAKING | TASK_FROZEN))
>>
>>  #ifdef CONFIG_DEBUG_ATOMIC_SLEEP
>>  # define debug_normal_state_change(state_value)                                \
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
> ...
>> @@ -3702,28 +3723,39 @@ ttwu_do_activate(struct rq *rq, struct t
>>   */
>>  static int ttwu_runnable(struct task_struct *p, int wake_flags)
>>  {
>> -       struct rq_flags rf;
>> -       struct rq *rq;
>> -       int ret = 0;
>> +       ACQUIRE(__task_rq_lock, guard)(p);
>> +       struct rq *rq = guard.rq;
>>
>> -       rq = __task_rq_lock(p, &rf);
>> -       if (task_on_rq_queued(p)) {
>> -               update_rq_clock(rq);
>> -               if (p->se.sched_delayed)
>> -                       enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
>> -               if (!task_on_cpu(rq, p)) {
>> +       if (!task_on_rq_queued(p))
>> +               return 0;
>> +
>> +       if (sched_proxy_exec() && p->blocked_on) {
>> +               guard(raw_spinlock)(&p->blocked_lock);
>> +               struct mutex *lock = p->blocked_on;
>> +               if (lock) {
>>                         /*
>> -                        * When on_rq && !on_cpu the task is preempted, see if
>> -                        * it should preempt the task that is current now.
>> +                        * TASK_WAKING is a special state and results in
>> +                        * DEQUEUE_SPECIAL such that the task will actually be
>> +                        * forced from the runqueue.
>>                          */
>> -                       wakeup_preempt(rq, p, wake_flags);
>> +                       block_task(rq, p, TASK_WAKING);

This needs to reset the rq->donor if the task getting woken up is the
current donor.

>> +                       p->blocked_on = NULL;
>> +                       return 0;
>>                 }
>> -               ttwu_do_wakeup(p);
>> -               ret = 1;
>>         }
>> -       __task_rq_unlock(rq, p, &rf);
>>
>> -       return ret;
>> +       update_rq_clock(rq);

nit. Since block_task() adds a DEQUEUE_NOCLOCK now we need to move that
clock update before the block.

>> +       if (p->se.sched_delayed)
>> +               enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
> 
> I can't precisely remember the details now, but I believe we need to
> handle enqueueing sched_delayed tasks before handling blocked_on
> tasks.

So proxy_deactivate() can still delay the task leading to
task_on_rq_queued() and the wakeup coming to ttwu_runnable() so either
we can dequeue it fully in proxy_deactivate() or we need to teach
block_task() to add a DEQUEUE_DELAYED flag when task_is_blocked().

I think the former is cleaner but we don't decay lag for fair task :-(

We can't simply re-enqueue it either since proxy migration might have
put it on a CPU outside its affinity mask so we need to take a full
dequeue + wakeup in ttwu_runnable().

> 
> 
>> +       if (!task_on_cpu(rq, p)) {
>> +               /*
>> +                * When on_rq && !on_cpu the task is preempted, see if
>> +                * it should preempt the task that is current now.
>> +                */
>> +               wakeup_preempt(rq, p, wake_flags);
>> +       }
>> +       ttwu_do_wakeup(p);
>> +       return 1;
>>  }
>>
>>  void sched_ttwu_pending(void *arg)
> ...
>> @@ -6586,13 +6598,12 @@ static inline struct task_struct *proxy_
>>         return rq->idle;
>>  }
>>
>> -static bool proxy_deactivate(struct rq *rq, struct task_struct *donor)
>> +static void proxy_deactivate(struct rq *rq, struct task_struct *donor)
>>  {
>>         unsigned long state = READ_ONCE(donor->__state);
>>
>> -       /* Don't deactivate if the state has been changed to TASK_RUNNING */
>> -       if (state == TASK_RUNNING)
>> -               return false;
>> +       WARN_ON_ONCE(state == TASK_RUNNING);
>> +
>>         /*
>>          * Because we got donor from pick_next_task(), it is *crucial*
>>          * that we call proxy_resched_idle() before we deactivate it.
>> @@ -6603,7 +6614,7 @@ static bool proxy_deactivate(struct rq *
>>          * need to be changed from next *before* we deactivate.
>>          */
>>         proxy_resched_idle(rq);
>> -       return try_to_block_task(rq, donor, &state, true);
>> +       block_task(rq, donor, state);
>>  }
>>
>>  static inline void proxy_release_rq_lock(struct rq *rq, struct rq_flags *rf)
>> @@ -6677,71 +6688,6 @@ static void proxy_migrate_task(struct rq
>>         proxy_reacquire_rq_lock(rq, rf);
>>  }
>>
>> -static void proxy_force_return(struct rq *rq, struct rq_flags *rf,
>> -                              struct task_struct *p)
>> -       __must_hold(__rq_lockp(rq))
>> -{
>> -       struct rq *task_rq, *target_rq = NULL;
>> -       int cpu, wake_flag = WF_TTWU;
>> -
>> -       lockdep_assert_rq_held(rq);
>> -       WARN_ON(p == rq->curr);
>> -
>> -       if (p == rq->donor)
>> -               proxy_resched_idle(rq);
>> -
>> -       proxy_release_rq_lock(rq, rf);
>> -       /*
>> -        * We drop the rq lock, and re-grab task_rq_lock to get
>> -        * the pi_lock (needed for select_task_rq) as well.
>> -        */
>> -       scoped_guard (task_rq_lock, p) {
>> -               task_rq = scope.rq;
>> -
>> -               /*
>> -                * Since we let go of the rq lock, the task may have been
>> -                * woken or migrated to another rq before we  got the
>> -                * task_rq_lock. So re-check we're on the same RQ. If
>> -                * not, the task has already been migrated and that CPU
>> -                * will handle any futher migrations.
>> -                */
>> -               if (task_rq != rq)
>> -                       break;
>> -
>> -               /*
>> -                * Similarly, if we've been dequeued, someone else will
>> -                * wake us
>> -                */
>> -               if (!task_on_rq_queued(p))
>> -                       break;
>> -
>> -               /*
>> -                * Since we should only be calling here from __schedule()
>> -                * -> find_proxy_task(), no one else should have
>> -                * assigned current out from under us. But check and warn
>> -                * if we see this, then bail.
>> -                */
>> -               if (task_current(task_rq, p) || task_on_cpu(task_rq, p)) {
>> -                       WARN_ONCE(1, "%s rq: %i current/on_cpu task %s %d  on_cpu: %i\n",
>> -                                 __func__, cpu_of(task_rq),
>> -                                 p->comm, p->pid, p->on_cpu);
>> -                       break;
>> -               }
>> -
>> -               update_rq_clock(task_rq);
>> -               deactivate_task(task_rq, p, DEQUEUE_NOCLOCK);
>> -               cpu = select_task_rq(p, p->wake_cpu, &wake_flag);
>> -               set_task_cpu(p, cpu);
>> -               target_rq = cpu_rq(cpu);
>> -               clear_task_blocked_on(p, NULL);
>> -       }
>> -
>> -       if (target_rq)
>> -               attach_one_task(target_rq, p);
>> -
>> -       proxy_reacquire_rq_lock(rq, rf);
>> -}
>> -

Went a little heavy of the delete there did you? :-)

>>  /*
>>   * Find runnable lock owner to proxy for mutex blocked donor
>>   *
>> @@ -6777,7 +6723,7 @@ find_proxy_task(struct rq *rq, struct ta
>>                                 clear_task_blocked_on(p, PROXY_WAKING);
>>                                 return p;
>>                         }
>> -                       goto force_return;
>> +                       goto deactivate;
>>                 }

This makes sense if we preserve the !TASK_RUNNING + p->blocked_on
invariant since we'll definitely get a wakeup here.

>>
>>                 /*
>> @@ -6812,7 +6758,7 @@ find_proxy_task(struct rq *rq, struct ta
>>                                 __clear_task_blocked_on(p, NULL);
>>                                 return p;
>>                         }
>> -                       goto force_return;
>> +                       goto deactivate;

This too makes sense considering !owner implies some task will be woken
up but ... if we take this task off and another task steals the mutex,
this task will no longer be able to proxy it since it is completely
blocked now.

Probably not desired. We should at least let it run and see if it can
get the mutex and evaluate the "p->blocked_on" again since !owner is
a limbo state.

>>                 }
>>
>>                 if (!READ_ONCE(owner->on_rq) || owner->se.sched_delayed) {
>> @@ -6891,12 +6837,7 @@ find_proxy_task(struct rq *rq, struct ta
>>         return owner;
>>
>>  deactivate:
>> -       if (proxy_deactivate(rq, donor))
>> -               return NULL;
>> -       /* If deactivate fails, force return */
>> -       p = donor;
>> -force_return:
>> -       proxy_force_return(rq, rf, p);
>> +       proxy_deactivate(rq, donor);
>>         return NULL;
>>  migrate_task:
>>         proxy_migrate_task(rq, rf, p, owner_cpu);
> 
> So I like getting rid of proxy_force_return(), but its not clear to me
> that proxy_deactivate() is what we want to do in these
> find_proxy_task() edge cases.
> 
> It feels like if we are already racing with ttwu, deactivating the
> task seems like it might open more windows where we might lose the
> wakeup.
> 
> In fact, the whole reason we have proxy_force_return() is that earlier
> in the proxy-exec development, when we hit those edge cases we usually
> would return proxy_reschedule_idle() just to drop the rq lock and let
> ttwu do its thing, but there kept on being cases where we would end up
> with lost wakeups.
> 
> But I'll give this a shot (and will integrate your ttwu_runnable
> cleanups regardless) and see how it does.

So I added the following in top of Peter's diff on top of
queue:sched/core and it hasn't crashed and burnt yet when running a
handful instances of sched-messaging with a mix of fair and SCHED_RR
priority:

  (Includes John's findings from the parallel thread)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5b2b2451720a..e845e3a8ae65 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2160,7 +2160,7 @@ void deactivate_task(struct rq *rq, struct task_struct *p, int flags)
 	dequeue_task(rq, p, flags);
 }
 
-static bool block_task(struct rq *rq, struct task_struct *p, unsigned long task_state)
+static void block_task(struct rq *rq, struct task_struct *p, unsigned long task_state)
 {
 	int flags = DEQUEUE_NOCLOCK;
 
@@ -3696,6 +3696,20 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
 	}
 }
 
+static void zap_balance_callbacks(struct rq *rq);
+
+static inline void proxy_reset_donor(struct rq *rq)
+{
+#ifdef CONFIG_SCHED_PROXY_EXEC
+	WARN_ON_ONCE(rq->curr == rq->donor);
+
+	put_prev_set_next_task(rq, rq->donor, rq->curr);
+	rq_set_donor(rq, rq->curr);
+	zap_balance_callbacks(rq);
+	resched_curr(rq);
+#endif
+}
+
 /*
  * Consider @p being inside a wait loop:
  *
@@ -3730,6 +3744,8 @@ static int ttwu_runnable(struct task_struct *p, int wake_flags)
 		return 0;
 
 	update_rq_clock(rq);
+	if (p->se.sched_delayed)
+		enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
 	if (sched_proxy_exec() && p->blocked_on) {
 		guard(raw_spinlock)(&p->blocked_lock);
 		struct mutex *lock = p->blocked_on;
@@ -3738,15 +3754,20 @@ static int ttwu_runnable(struct task_struct *p, int wake_flags)
 			 * TASK_WAKING is a special state and results in
 			 * DEQUEUE_SPECIAL such that the task will actually be
 			 * forced from the runqueue.
+			 *
+			 * XXX: All of this is now equivalent of
+			 * proxy_needs_return() from John's series :-)
 			 */
-			block_task(rq, p, TASK_WAKING);
 			p->blocked_on = NULL;
+			if (task_current(rq, p))
+				goto out;
+			if (task_current_donor(rq, p))
+				proxy_reset_donor(rq);
+			block_task(rq, p, TASK_WAKING);
 			return 0;
 		}
 	}
-
-	if (p->se.sched_delayed)
-		enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
+out:
 	if (!task_on_cpu(rq, p)) {
 		/*
 		 * When on_rq && !on_cpu the task is preempted, see if
@@ -4256,6 +4277,15 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 		 */
 		smp_cond_load_acquire(&p->on_cpu, !VAL);
 
+		/*
+		 * We never clear the blocked_on relation on proxy_deactivate.
+		 * If we don't clear it here, we have TASK_RUNNING + p->blocked_on
+		 * when waking up. Since this is a fully blocked, off CPU task
+		 * waking up, it should be safe to clear the blocked_on relation.
+		 */
+		if (task_is_blocked(p))
+			clear_task_blocked_on(p, NULL);
+
 		cpu = select_task_rq(p, p->wake_cpu, &wake_flags);
 		if (task_cpu(p) != cpu) {
 			if (p->in_iowait) {
@@ -6977,6 +7007,10 @@ static void __sched notrace __schedule(int sched_mode)
 		switch_count = &prev->nvcsw;
 	}
 
+	/* See: https://github.com/kudureranganath/linux/commit/0d6a01bb19db39f045d6f0f5fb4d196500091637 */
+	if (!prev_state && task_is_blocked(prev))
+		clear_task_blocked_on(prev, NULL);
+
 pick_again:
 	assert_balance_callbacks_empty(rq);
 	next = pick_next_task(rq, rq->donor, &rf);
---

Now, there are obviously some sharp edges that I have highlighted above
which may affect performance and correctness to some extent but once we
have sleeping owner bits, it should all go away.

Anyways, I'll let you bash me now on why that try_to_wake_up() hunk
might be totally wrong and dangerous :-)
 
-- 
Thanks and Regards,
Prateek

next prev parent reply	other threads:[~2026-04-03  6:09 UTC|newest]

Thread overview: 56+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-24 19:13 [PATCH v26 00/10] Simple Donor Migration for Proxy Execution John Stultz
2026-03-24 19:13 ` [PATCH v26 01/10] sched: Make class_schedulers avoid pushing current, and get rid of proxy_tag_curr() John Stultz
2026-04-03 12:30   ` [tip: sched/core] " tip-bot2 for John Stultz
2026-03-24 19:13 ` [PATCH v26 02/10] sched: Minimise repeated sched_proxy_exec() checking John Stultz
2026-04-03 12:30   ` [tip: sched/core] " tip-bot2 for John Stultz
2026-03-24 19:13 ` [PATCH v26 03/10] sched: Fix potentially missing balancing with Proxy Exec John Stultz
2026-04-03 12:30   ` [tip: sched/core] " tip-bot2 for John Stultz
2026-03-24 19:13 ` [PATCH v26 04/10] locking: Add task::blocked_lock to serialize blocked_on state John Stultz
2026-04-03 12:30   ` [tip: sched/core] " tip-bot2 for John Stultz
2026-03-24 19:13 ` [PATCH v26 05/10] sched: Fix modifying donor->blocked on without proper locking John Stultz
2026-03-26 21:45   ` Steven Rostedt
2026-04-03 12:30   ` [tip: sched/core] " tip-bot2 for John Stultz
2026-03-24 19:13 ` [PATCH v26 06/10] sched/locking: Add special p->blocked_on==PROXY_WAKING value for proxy return-migration John Stultz
2026-04-03 12:30   ` [tip: sched/core] " tip-bot2 for John Stultz
2026-03-24 19:13 ` [PATCH v26 07/10] sched: Add assert_balance_callbacks_empty helper John Stultz
2026-04-03 12:30   ` [tip: sched/core] " tip-bot2 for John Stultz
2026-03-24 19:13 ` [PATCH v26 08/10] sched: Add logic to zap balance callbacks if we pick again John Stultz
2026-04-03 12:30   ` [tip: sched/core] " tip-bot2 for John Stultz
2026-03-24 19:13 ` [PATCH v26 09/10] sched: Move attach_one_task and attach_task helpers to sched.h John Stultz
2026-04-03 12:30   ` [tip: sched/core] " tip-bot2 for John Stultz
2026-03-24 19:13 ` [PATCH v26 10/10] sched: Handle blocked-waiter migration (and return migration) John Stultz
2026-03-26 22:52   ` Steven Rostedt
2026-03-27  4:47     ` K Prateek Nayak
2026-03-27 12:47       ` Peter Zijlstra
2026-04-02 14:43   ` Peter Zijlstra
2026-04-02 15:08     ` Peter Zijlstra
2026-04-02 17:43       ` John Stultz
2026-04-02 17:34     ` John Stultz
2026-04-03 12:30   ` [tip: sched/core] " tip-bot2 for John Stultz
2026-03-25 10:52 ` [PATCH v26 00/10] Simple Donor Migration for Proxy Execution K Prateek Nayak
2026-03-27 11:48   ` Peter Zijlstra
2026-03-27 13:33     ` K Prateek Nayak
2026-03-27 15:20       ` Peter Zijlstra
2026-03-27 15:41         ` Peter Zijlstra
2026-03-27 16:00       ` Peter Zijlstra
2026-03-27 16:57         ` K Prateek Nayak
2026-04-02 15:50           ` Peter Zijlstra
2026-04-02 18:31             ` John Stultz
2026-04-02 21:04               ` John Stultz
2026-04-03  6:09               ` K Prateek Nayak [this message]
2026-04-03  9:52                 ` Peter Zijlstra
2026-04-03 10:25                   ` K Prateek Nayak
2026-04-03 11:28                     ` Peter Zijlstra
2026-04-03 13:43                       ` K Prateek Nayak
2026-04-03 14:38                         ` Peter Zijlstra
2026-04-03 15:39                           ` K Prateek Nayak
2026-04-03 21:08                             ` Peter Zijlstra
2026-04-04  0:26                             ` John Stultz
2026-04-04  5:49                               ` K Prateek Nayak
2026-04-04  6:07                                 ` John Stultz
2026-04-06  2:40                                   ` K Prateek Nayak
2026-04-03 12:54                     ` Peter Zijlstra
2026-04-03  9:18               ` Peter Zijlstra
2026-03-27 19:15     ` John Stultz
2026-03-27 19:10   ` John Stultz
2026-03-28  4:53     ` K Prateek Nayak

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:5b2b2451720 dfblob:e845e3a8ae6 )
 OR (
bs:"Re: [PATCH v26 00/10] Simple Donor Migration for Proxy Execution" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e6dfce6a-ee6f-441b-9f21-e2036d0a3e5d@amd.com \
    --to=kprateek.nayak@amd.com \
    --cc=Metin.Kaya@arm.com \
    --cc=boqun.feng@gmail.com \
    --cc=bsegall@google.com \
    --cc=daniel.lezcano@linaro.org \
    --cc=dietmar.eggemann@arm.com \
    --cc=hupu.gm@gmail.com \
    --cc=joelagnelf@nvidia.com \
    --cc=jstultz@google.com \
    --cc=juri.lelli@redhat.com \
    --cc=kernel-team@android.com \
    --cc=kuyo.chang@mediatek.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=longman@redhat.com \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=paulmck@kernel.org \
    --cc=peterz@infradead.org \
    --cc=qyousef@layalina.io \
    --cc=rostedt@goodmis.org \
    --cc=suleiman@google.com \
    --cc=tglx@linutronix.de \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    --cc=will@kernel.org \
    --cc=xuewen.yan94@gmail.com \
    --cc=zezeozue@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox