From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from casper.infradead.org (casper.infradead.org [90.155.50.34]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E62822E8B98 for ; Thu, 2 Apr 2026 15:51:10 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=90.155.50.34 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1775145074; cv=none; b=TfL2f4onPlp/B5n95fKnMd+cAzmWjQb+4hXkaruMWr2vtDDnHuxn8FdgC+WbiPJFAYoSLD2/Nt0xFCsux+swIpLpQ/dT2+arsXNqdSWbVOSo9cQWeekfmGi8L38pvlAg9rtisrXPRbCnoCF/F5AnmN1UNaFlicyXY8WMpquMjkY= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1775145074; c=relaxed/simple; bh=qA+WzUmpwTP+lCL9Df91SEzyOp1CJktpi8lL15z7Hfo=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=V9k0mMrQe2hhe5bnb9d1GNJP09Ich1GX54gvt95i6jBrUjjyMlCFNWmj8+gdQdT5OfjWaJiRh6v10gWK7lJHIde7KfylkkczDHF0GzkmxQhZ1s96aETsDNhuGpKuHwO1KwMLmrNpMMm0WUOmV7/9i+MzuUnkY4uHFdEPzBGIKwc= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=infradead.org; spf=none smtp.mailfrom=infradead.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b=M4cqSIXn; arc=none smtp.client-ip=90.155.50.34 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=infradead.org Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=infradead.org Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b="M4cqSIXn" DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=In-Reply-To:Content-Type:MIME-Version: References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description; bh=Dayng0C8ZtM5+8qsgttmlExIHsE3s1c+rcM2syA4iVw=; b=M4cqSIXnS/We5u5mtSaWXnsw3k 0DZxGUMkDSYBEuAzfBCxfEqJQfE9+9wRAmUXhXqZH9d60Xm4+AuJOpXoaTJOi6kZfXMW0umr8q2mF xi3V9LGVtbAX3hWEMlGl6HFkg1/+HzY6j84Nbt8ZT+N1ZiwOi/NkUKF30jsoGtm4mRsWI5yjkuZRu rE2I15+WuSmM0D9vma9tVY+gR7gHCFYQ470IrJB/375M3LUyhOcSBbApupIB362R0uy4CABclKJ8b gVxE04cHluj9CtE0ru7F2lxAEf5C7ffYaUR7XPQqXaSJJOrlmb9jz4Mgixxp20JG/ofojjQMyrarh hBrW2lUg==; Received: from 2001-1c00-8d85-4b00-266e-96ff-fe07-7dcc.cable.dynamic.v6.ziggo.nl ([2001:1c00:8d85:4b00:266e:96ff:fe07:7dcc] helo=noisy.programming.kicks-ass.net) by casper.infradead.org with esmtpsa (Exim 4.98.2 #2 (Red Hat Linux)) id 1w8KJt-0000000CYTJ-219o; Thu, 02 Apr 2026 15:50:57 +0000 Received: by noisy.programming.kicks-ass.net (Postfix, from userid 1000) id 00AD730301D; Thu, 02 Apr 2026 17:50:55 +0200 (CEST) Date: Thu, 2 Apr 2026 17:50:55 +0200 From: Peter Zijlstra To: K Prateek Nayak Cc: John Stultz , LKML , Joel Fernandes , Qais Yousef , Ingo Molnar , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Valentin Schneider , Steven Rostedt , Ben Segall , Zimuzo Ezeozue , Mel Gorman , Will Deacon , Waiman Long , Boqun Feng , "Paul E. McKenney" , Metin Kaya , Xuewen Yan , Thomas Gleixner , Daniel Lezcano , Suleiman Souhlal , kuyo chang , hupu , kernel-team@android.com Subject: Re: [PATCH v26 00/10] Simple Donor Migration for Proxy Execution Message-ID: <20260402155055.GV3738010@noisy.programming.kicks-ass.net> References: <20260324191337.1841376-1-jstultz@google.com> <36e96f87-a682-436e-aefc-13e2e5810019@amd.com> <20260327114844.GQ2872@noisy.programming.kicks-ass.net> <33e60181-1809-44e1-bc4c-8ac7f79d49d6@amd.com> <20260327160017.GK3738010@noisy.programming.kicks-ass.net> <1515d405-62fc-4952-842f-b69e2bf192c0@amd.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1515d405-62fc-4952-842f-b69e2bf192c0@amd.com> On Fri, Mar 27, 2026 at 10:27:08PM +0530, K Prateek Nayak wrote: > So taking a step back, this is what we have today (at least the > common scenario): > > CPU0 (donor - A) CPU1 (owner - B) > ================ ================ > > mutex_lock() > __set_current_state(TASK_INTERRUPTIBLE) > __set_task_blocked_on(M) > schedule() > /* Retained for proxy */ > proxy_migrate_task() > ==================================> /* Migrates to CPU1 */ > ... > send_sig(B) > signal_wake_up_state() > wake_up_state() > try_to_wake_up() > ttwu_runnable() > ttwu_do_wakeup() =============> /* A->__state = TASK_RUNNING */ > > /* > * After this point ttwu_state_match() > * will fail for A so a mutex_unlock() > * will have to go through __schedule() > * for return migration. > */ > > __schedule() > find_proxy_task() > > /* Scenario 1 - B sleeps */ > __clear_task_blocked_on() > proxy_deactivate(A) > /* A->__state == TASK_RUNNING */ > /* fallthrough */ > > /* Scenario 2 - return migration after unlock() */ > __clear_task_blocked_on() > /* > * At this point proxy stops. > * Much later after signal. > */ > proxy_force_return() > schedule() <================================== > signal_pending_state() > > clear_task_blocked_on() > __set_current_state(TASK_RUNNING) > > ... /* return with -EINR */ > > > Basically, a blocked donor has to wait for a mutex_unlock() before it > can go process the signal and bail out on the mutex_lock_interruptible() > which seems counter productive - but it is still okay from correctness > perspective. > > > > > One thing you *can* do it frob ttwu_runnable() to 'refuse' to wake the > > task, and then it goes into the normal path and will do the migration. > > I've done things like that before. > > > > Does that fix all the return-migration cases? > > Yes it does! If we handle the return via ttwu_runnable(), which is what > proxy_needs_return() in the next chunk of changes aims to do and we can > build the invariant that TASK_RUNNING + task_is_blocked() is an illegal > state outside of __schedule() which works well with ttwu_state_match(). > > > > >> 2. Why does proxy_needs_return() (this comes later in John's tree but I > >> moved it up ahead) need the proxy_task_runnable_but_waking() override > >> of the ttwu_state_mach() machinery? > >> (https://github.com/johnstultz-work/linux-dev/commit/28ad4d3fa847b90713ca18a623d1ee7f73b648d9) > > > > Since it comes later, I've not seen it and not given it thought ;-) > > > > (I mean, I've probably seen it at some point, but being the gold-fish > > that I am, I have no recollection, so I might as well not have seen it). > > > > A brief look now makes me confused. The comment fails to describe how > > that situation could ever come to pass. > > That is a signal delivery happening before unlock which will force > TASK_RUNNING but since we are waiting on an unlock, the wakeup from > unlock will see TASK_RUNNING + PROXY_WAKING. > > We then later force it on the ttwu path to do return via > ttwu_runnable(). So, I've not gone through all the cases yet, and it is *COMPLETELY* untested, but something like the below perhaps? --- include/linux/sched.h | 2 kernel/sched/core.c | 173 ++++++++++++++++---------------------------------- 2 files changed, 58 insertions(+), 117 deletions(-) --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -161,7 +161,7 @@ struct user_event_mm; */ #define is_special_task_state(state) \ ((state) & (__TASK_STOPPED | __TASK_TRACED | TASK_PARKED | \ - TASK_DEAD | TASK_FROZEN)) + TASK_DEAD | TASK_WAKING | TASK_FROZEN)) #ifdef CONFIG_DEBUG_ATOMIC_SLEEP # define debug_normal_state_change(state_value) \ --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2160,8 +2160,29 @@ void deactivate_task(struct rq *rq, stru dequeue_task(rq, p, flags); } -static void block_task(struct rq *rq, struct task_struct *p, int flags) +static bool block_task(struct rq *rq, struct task_struct *p, unsigned long task_state) { + int flags = DEQUEUE_NOCLOCK; + + p->sched_contributes_to_load = + (task_state & TASK_UNINTERRUPTIBLE) && + !(task_state & TASK_NOLOAD) && + !(task_state & TASK_FROZEN); + + if (unlikely(is_special_task_state(task_state))) + flags |= DEQUEUE_SPECIAL; + + /* + * __schedule() ttwu() + * prev_state = prev->state; if (p->on_rq && ...) + * if (prev_state) goto out; + * p->on_rq = 0; smp_acquire__after_ctrl_dep(); + * p->state = TASK_WAKING + * + * Where __schedule() and ttwu() have matching control dependencies. + * + * After this, schedule() must not care about p->state any more. + */ if (dequeue_task(rq, p, DEQUEUE_SLEEP | flags)) __block_task(rq, p); } @@ -3702,28 +3723,39 @@ ttwu_do_activate(struct rq *rq, struct t */ static int ttwu_runnable(struct task_struct *p, int wake_flags) { - struct rq_flags rf; - struct rq *rq; - int ret = 0; + ACQUIRE(__task_rq_lock, guard)(p); + struct rq *rq = guard.rq; - rq = __task_rq_lock(p, &rf); - if (task_on_rq_queued(p)) { - update_rq_clock(rq); - if (p->se.sched_delayed) - enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED); - if (!task_on_cpu(rq, p)) { + if (!task_on_rq_queued(p)) + return 0; + + if (sched_proxy_exec() && p->blocked_on) { + guard(raw_spinlock)(&p->blocked_lock); + struct mutex *lock = p->blocked_on; + if (lock) { /* - * When on_rq && !on_cpu the task is preempted, see if - * it should preempt the task that is current now. + * TASK_WAKING is a special state and results in + * DEQUEUE_SPECIAL such that the task will actually be + * forced from the runqueue. */ - wakeup_preempt(rq, p, wake_flags); + block_task(rq, p, TASK_WAKING); + p->blocked_on = NULL; + return 0; } - ttwu_do_wakeup(p); - ret = 1; } - __task_rq_unlock(rq, p, &rf); - return ret; + update_rq_clock(rq); + if (p->se.sched_delayed) + enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED); + if (!task_on_cpu(rq, p)) { + /* + * When on_rq && !on_cpu the task is preempted, see if + * it should preempt the task that is current now. + */ + wakeup_preempt(rq, p, wake_flags); + } + ttwu_do_wakeup(p); + return 1; } void sched_ttwu_pending(void *arg) @@ -6519,7 +6551,6 @@ static bool try_to_block_task(struct rq unsigned long *task_state_p, bool should_block) { unsigned long task_state = *task_state_p; - int flags = DEQUEUE_NOCLOCK; if (signal_pending_state(task_state, p)) { WRITE_ONCE(p->__state, TASK_RUNNING); @@ -6539,26 +6570,7 @@ static bool try_to_block_task(struct rq if (!should_block) return false; - p->sched_contributes_to_load = - (task_state & TASK_UNINTERRUPTIBLE) && - !(task_state & TASK_NOLOAD) && - !(task_state & TASK_FROZEN); - - if (unlikely(is_special_task_state(task_state))) - flags |= DEQUEUE_SPECIAL; - - /* - * __schedule() ttwu() - * prev_state = prev->state; if (p->on_rq && ...) - * if (prev_state) goto out; - * p->on_rq = 0; smp_acquire__after_ctrl_dep(); - * p->state = TASK_WAKING - * - * Where __schedule() and ttwu() have matching control dependencies. - * - * After this, schedule() must not care about p->state any more. - */ - block_task(rq, p, flags); + block_task(rq, p, task_state); return true; } @@ -6586,13 +6598,12 @@ static inline struct task_struct *proxy_ return rq->idle; } -static bool proxy_deactivate(struct rq *rq, struct task_struct *donor) +static void proxy_deactivate(struct rq *rq, struct task_struct *donor) { unsigned long state = READ_ONCE(donor->__state); - /* Don't deactivate if the state has been changed to TASK_RUNNING */ - if (state == TASK_RUNNING) - return false; + WARN_ON_ONCE(state == TASK_RUNNING); + /* * Because we got donor from pick_next_task(), it is *crucial* * that we call proxy_resched_idle() before we deactivate it. @@ -6603,7 +6614,7 @@ static bool proxy_deactivate(struct rq * * need to be changed from next *before* we deactivate. */ proxy_resched_idle(rq); - return try_to_block_task(rq, donor, &state, true); + block_task(rq, donor, state); } static inline void proxy_release_rq_lock(struct rq *rq, struct rq_flags *rf) @@ -6677,71 +6688,6 @@ static void proxy_migrate_task(struct rq proxy_reacquire_rq_lock(rq, rf); } -static void proxy_force_return(struct rq *rq, struct rq_flags *rf, - struct task_struct *p) - __must_hold(__rq_lockp(rq)) -{ - struct rq *task_rq, *target_rq = NULL; - int cpu, wake_flag = WF_TTWU; - - lockdep_assert_rq_held(rq); - WARN_ON(p == rq->curr); - - if (p == rq->donor) - proxy_resched_idle(rq); - - proxy_release_rq_lock(rq, rf); - /* - * We drop the rq lock, and re-grab task_rq_lock to get - * the pi_lock (needed for select_task_rq) as well. - */ - scoped_guard (task_rq_lock, p) { - task_rq = scope.rq; - - /* - * Since we let go of the rq lock, the task may have been - * woken or migrated to another rq before we got the - * task_rq_lock. So re-check we're on the same RQ. If - * not, the task has already been migrated and that CPU - * will handle any futher migrations. - */ - if (task_rq != rq) - break; - - /* - * Similarly, if we've been dequeued, someone else will - * wake us - */ - if (!task_on_rq_queued(p)) - break; - - /* - * Since we should only be calling here from __schedule() - * -> find_proxy_task(), no one else should have - * assigned current out from under us. But check and warn - * if we see this, then bail. - */ - if (task_current(task_rq, p) || task_on_cpu(task_rq, p)) { - WARN_ONCE(1, "%s rq: %i current/on_cpu task %s %d on_cpu: %i\n", - __func__, cpu_of(task_rq), - p->comm, p->pid, p->on_cpu); - break; - } - - update_rq_clock(task_rq); - deactivate_task(task_rq, p, DEQUEUE_NOCLOCK); - cpu = select_task_rq(p, p->wake_cpu, &wake_flag); - set_task_cpu(p, cpu); - target_rq = cpu_rq(cpu); - clear_task_blocked_on(p, NULL); - } - - if (target_rq) - attach_one_task(target_rq, p); - - proxy_reacquire_rq_lock(rq, rf); -} - /* * Find runnable lock owner to proxy for mutex blocked donor * @@ -6777,7 +6723,7 @@ find_proxy_task(struct rq *rq, struct ta clear_task_blocked_on(p, PROXY_WAKING); return p; } - goto force_return; + goto deactivate; } /* @@ -6812,7 +6758,7 @@ find_proxy_task(struct rq *rq, struct ta __clear_task_blocked_on(p, NULL); return p; } - goto force_return; + goto deactivate; } if (!READ_ONCE(owner->on_rq) || owner->se.sched_delayed) { @@ -6891,12 +6837,7 @@ find_proxy_task(struct rq *rq, struct ta return owner; deactivate: - if (proxy_deactivate(rq, donor)) - return NULL; - /* If deactivate fails, force return */ - p = donor; -force_return: - proxy_force_return(rq, rf, p); + proxy_deactivate(rq, donor); return NULL; migrate_task: proxy_migrate_task(rq, rf, p, owner_cpu);