From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from desiato.infradead.org (desiato.infradead.org [90.155.92.199]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7A1FD39769B for ; Fri, 3 Apr 2026 09:52:41 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=90.155.92.199 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1775209966; cv=none; b=uvIpqAXobkM7zE/yikFYPa06QiLFBdJ6kaP87h3HZ2Pg0HgP6tisP1xCj8/31aram8LjuCCZkbAkfhrOARyQN0hlaP46HzpLspZ4zWJR7SgABxJv4SPQs8SS8Ofwu3gMytMv1P+BBVEhQFR36cucBccJuQQPqzjDOKWICXPJJuA= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1775209966; c=relaxed/simple; bh=+CoXeVAZBq58mqk6fUg0TeeN5a2tAA6hTk7Wj/RuPs4=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=oQkh8dDGUe99hZLRBQ9tszifuGelCG+yiYoqrlAaFjITcUS7BIciyRjiubgGYhlS+sT9dFf6ZqQ8quZtkY7K6OtaDvOkZAfflf4rsz4H/nVfudl4MsLWX3Z4Qw3nL/4Xa+rCACEQyKjLgch1hTjKzlWpgpkoGaQUrsCFXVo8JpE= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=infradead.org; spf=none smtp.mailfrom=infradead.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b=qY1HiOgq; arc=none smtp.client-ip=90.155.92.199 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=infradead.org Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=infradead.org Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b="qY1HiOgq" DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=desiato.20200630; h=In-Reply-To:Content-Type:MIME-Version: References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description; bh=V9n5y2Kb4B+iM7QrSSdBQy5ry8dLOG+UOLM1O73nOmE=; b=qY1HiOgqiRAudlBgh6MlVPTF2h bIengrJtgZfcVDlYrs+jcCWYwj7GLly7hLh55vhFg/HwSyoleP+bxPZkgDjVr2loljA9RoH9wZsFB +AybRar79NU2hUvdtFsNbGpbx3/5HN9BguVFoDUfJoRlpb0Ps6R0qkUQWm/3la9lBAP3vkDzyt/9g cW3bjlSPXWc0zry43VJhPYkwd5kE8ZQvW7D601kG4fhfIobX/iwJ46IfzZoxK4lQt0qr0QFAFKn94 vI5cN3VXrlaxC2bPBuHVMo/kdS3nr74IJkHj7wVtVYpHr6asKAGF/yP7WHVdeNH1gwXIwlBmOMsmu oEqVtR6Q==; Received: from 77-249-17-252.cable.dynamic.v4.ziggo.nl ([77.249.17.252] helo=noisy.programming.kicks-ass.net) by desiato.infradead.org with esmtpsa (Exim 4.98.2 #2 (Red Hat Linux)) id 1w8bCW-00000003w4s-0s7H; Fri, 03 Apr 2026 09:52:28 +0000 Received: by noisy.programming.kicks-ass.net (Postfix, from userid 1000) id D5FC83032D1; Fri, 03 Apr 2026 11:52:25 +0200 (CEST) Date: Fri, 3 Apr 2026 11:52:25 +0200 From: Peter Zijlstra To: K Prateek Nayak Cc: John Stultz , LKML , Joel Fernandes , Qais Yousef , Ingo Molnar , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Valentin Schneider , Steven Rostedt , Ben Segall , Zimuzo Ezeozue , Mel Gorman , Will Deacon , Waiman Long , Boqun Feng , "Paul E. McKenney" , Metin Kaya , Xuewen Yan , Thomas Gleixner , Daniel Lezcano , Suleiman Souhlal , kuyo chang , hupu , kernel-team@android.com Subject: Re: [PATCH v26 00/10] Simple Donor Migration for Proxy Execution Message-ID: <20260403095225.GY3738010@noisy.programming.kicks-ass.net> References: <20260324191337.1841376-1-jstultz@google.com> <36e96f87-a682-436e-aefc-13e2e5810019@amd.com> <20260327114844.GQ2872@noisy.programming.kicks-ass.net> <33e60181-1809-44e1-bc4c-8ac7f79d49d6@amd.com> <20260327160017.GK3738010@noisy.programming.kicks-ass.net> <1515d405-62fc-4952-842f-b69e2bf192c0@amd.com> <20260402155055.GV3738010@noisy.programming.kicks-ass.net> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: On Fri, Apr 03, 2026 at 11:39:25AM +0530, K Prateek Nayak wrote: > >> @@ -3702,28 +3723,39 @@ ttwu_do_activate(struct rq *rq, struct t > >> */ > >> static int ttwu_runnable(struct task_struct *p, int wake_flags) > >> { > >> - struct rq_flags rf; > >> - struct rq *rq; > >> - int ret = 0; > >> + ACQUIRE(__task_rq_lock, guard)(p); > >> + struct rq *rq = guard.rq; > >> > >> - rq = __task_rq_lock(p, &rf); > >> - if (task_on_rq_queued(p)) { > >> - update_rq_clock(rq); > >> - if (p->se.sched_delayed) > >> - enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED); > >> - if (!task_on_cpu(rq, p)) { > >> + if (!task_on_rq_queued(p)) > >> + return 0; > >> + > >> + if (sched_proxy_exec() && p->blocked_on) { > >> + guard(raw_spinlock)(&p->blocked_lock); > >> + struct mutex *lock = p->blocked_on; > >> + if (lock) { > >> /* > >> - * When on_rq && !on_cpu the task is preempted, see if > >> - * it should preempt the task that is current now. > >> + * TASK_WAKING is a special state and results in > >> + * DEQUEUE_SPECIAL such that the task will actually be > >> + * forced from the runqueue. > >> */ > >> - wakeup_preempt(rq, p, wake_flags); > >> + block_task(rq, p, TASK_WAKING); > > This needs to reset the rq->donor if the task getting woken up is the > current donor. *groan*, that is a fun case. I'll ponder that. > >> + p->blocked_on = NULL; > >> + return 0; > >> } > >> - ttwu_do_wakeup(p); > >> - ret = 1; > >> } > >> - __task_rq_unlock(rq, p, &rf); > >> > >> - return ret; > >> + update_rq_clock(rq); > > nit. Since block_task() adds a DEQUEUE_NOCLOCK now we need to move that > clock update before the block. D'0h :-) > >> + if (p->se.sched_delayed) > >> + enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED); > > > > I can't precisely remember the details now, but I believe we need to > > handle enqueueing sched_delayed tasks before handling blocked_on > > tasks. > > So proxy_deactivate() can still delay the task leading to > task_on_rq_queued() and the wakeup coming to ttwu_runnable() so either > we can dequeue it fully in proxy_deactivate() or we need to teach > block_task() to add a DEQUEUE_DELAYED flag when task_is_blocked(). > > I think the former is cleaner but we don't decay lag for fair task :-( > > We can't simply re-enqueue it either since proxy migration might have > put it on a CPU outside its affinity mask so we need to take a full > dequeue + wakeup in ttwu_runnable(). Right, sanest option is to have ttwu_runnable() deal with this. > >> -static void proxy_force_return(struct rq *rq, struct rq_flags *rf, > >> - struct task_struct *p) > >> - __must_hold(__rq_lockp(rq)) > >> -{ > >> -} > >> - > > Went a little heavy of the delete there did you? :-) Well, I thought that was the whole idea, have ttwu() handle this :-) > >> /* > >> * Find runnable lock owner to proxy for mutex blocked donor > >> * > >> @@ -6777,7 +6723,7 @@ find_proxy_task(struct rq *rq, struct ta > >> clear_task_blocked_on(p, PROXY_WAKING); > >> return p; > >> } > >> - goto force_return; > >> + goto deactivate; > >> } > > This makes sense if we preserve the !TASK_RUNNING + p->blocked_on > invariant since we'll definitely get a wakeup here. Right, so TASK_RUNNING must imply !->blocked_on. > >> > >> /* > >> @@ -6812,7 +6758,7 @@ find_proxy_task(struct rq *rq, struct ta > >> __clear_task_blocked_on(p, NULL); > >> return p; > >> } > >> - goto force_return; > >> + goto deactivate; > > This too makes sense considering !owner implies some task will be woken > up but ... if we take this task off and another task steals the mutex, > this task will no longer be able to proxy it since it is completely > blocked now. > > Probably not desired. We should at least let it run and see if it can > get the mutex and evaluate the "p->blocked_on" again since !owner is > a limbo state. I need to go re-read the mutex side of things, but doesn't that do hand-off way more agressively? Anyway, one thing that is completely missing is a fast path for when the task is still inside its valid mask. I suspect adding that back will cure some of these issues. > So I added the following in top of Peter's diff on top of > queue:sched/core and it hasn't crashed and burnt yet when running a > handful instances of sched-messaging with a mix of fair and SCHED_RR > priority: > > (Includes John's findings from the parallel thread) > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 5b2b2451720a..e845e3a8ae65 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -2160,7 +2160,7 @@ void deactivate_task(struct rq *rq, struct task_struct *p, int flags) > dequeue_task(rq, p, flags); > } > > -static bool block_task(struct rq *rq, struct task_struct *p, unsigned long task_state) > +static void block_task(struct rq *rq, struct task_struct *p, unsigned long task_state) > { > int flags = DEQUEUE_NOCLOCK; > > @@ -3696,6 +3696,20 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags, > } > } > > +static void zap_balance_callbacks(struct rq *rq); > + > +static inline void proxy_reset_donor(struct rq *rq) > +{ > +#ifdef CONFIG_SCHED_PROXY_EXEC > + WARN_ON_ONCE(rq->curr == rq->donor); > + > + put_prev_set_next_task(rq, rq->donor, rq->curr); > + rq_set_donor(rq, rq->curr); > + zap_balance_callbacks(rq); > + resched_curr(rq); > +#endif > +} This one hurts my bain :-) > /* > * Consider @p being inside a wait loop: > * > @@ -3730,6 +3744,8 @@ static int ttwu_runnable(struct task_struct *p, int wake_flags) > return 0; > > update_rq_clock(rq); > + if (p->se.sched_delayed) > + enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED); Right, this works but seems wasteful, might be better to add DEQUEUE_DELAYED in the blocked_on case. > if (sched_proxy_exec() && p->blocked_on) { So I had doubts about this lockless test of ->blocked_on, I still cannot convince myself it is correct. > guard(raw_spinlock)(&p->blocked_lock); > struct mutex *lock = p->blocked_on; > @@ -3738,15 +3754,20 @@ static int ttwu_runnable(struct task_struct *p, int wake_flags) > * TASK_WAKING is a special state and results in > * DEQUEUE_SPECIAL such that the task will actually be > * forced from the runqueue. > + * > + * XXX: All of this is now equivalent of > + * proxy_needs_return() from John's series :-) > */ > - block_task(rq, p, TASK_WAKING); > p->blocked_on = NULL; > + if (task_current(rq, p)) > + goto out; Right, fair enough :-) This could also be done when rq->cpu is inside p->cpus_ptr mask, because in that case we don't strictly need a migration. Thinking about that was on the todo list. > + if (task_current_donor(rq, p)) > + proxy_reset_donor(rq); Fun fun fun :-) > + block_task(rq, p, TASK_WAKING); > return 0; > } > } > - > - if (p->se.sched_delayed) > - enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED); > +out: > if (!task_on_cpu(rq, p)) { > /* > * When on_rq && !on_cpu the task is preempted, see if > @@ -4256,6 +4277,15 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) > */ > smp_cond_load_acquire(&p->on_cpu, !VAL); > > + /* > + * We never clear the blocked_on relation on proxy_deactivate. > + * If we don't clear it here, we have TASK_RUNNING + p->blocked_on > + * when waking up. Since this is a fully blocked, off CPU task > + * waking up, it should be safe to clear the blocked_on relation. > + */ > + if (task_is_blocked(p)) > + clear_task_blocked_on(p, NULL); > + Aah, yes! This is when find_proxy_task() hits deactivate() for us and we skip ttwu_runnable(). We still need to clear ->blocked_on. I am once again not sure on the lockless nature of accessing ->blocked_on. > cpu = select_task_rq(p, p->wake_cpu, &wake_flags); > if (task_cpu(p) != cpu) { > if (p->in_iowait) { > @@ -6977,6 +7007,10 @@ static void __sched notrace __schedule(int sched_mode) > switch_count = &prev->nvcsw; > } > > + /* See: https://github.com/kudureranganath/linux/commit/0d6a01bb19db39f045d6f0f5fb4d196500091637 */ > + if (!prev_state && task_is_blocked(prev)) > + clear_task_blocked_on(prev, NULL); > + This one confuses me, ttwu() should never results in ->blocked_on being set. > pick_again: > assert_balance_callbacks_empty(rq); > next = pick_next_task(rq, rq->donor, &rf);