All of lore.kernel.org
 help / color / mirror / Atom feed
From: K Prateek Nayak <kprateek.nayak@amd.com>
To: John Stultz <jstultz@google.com>, Vasily Gorbik <gor@linux.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Juri Lelli <juri.lelli@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>, "Mel Gorman" <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	"Vineeth Pillai" <vineethrp@google.com>,
	Joel Fernandes <joelagnelf@nvidia.com>,
	"Heiko Carstens" <hca@linux.ibm.com>,
	<linux-s390@vger.kernel.org>, <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH v2 2/2] sched/core: Don't proxy-exec unmatched cookie lock owners
Date: Thu, 14 May 2026 15:24:47 +0530	[thread overview]
Message-ID: <9ceb2af0-33c6-40ca-b855-5167a9c5ae0f@amd.com> (raw)
In-Reply-To: <CANDhNCqrh28F=omb7ftuXpbMssA0+5paZTkq5Zr5zee6zZ7=Tg@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 17699 bytes --]

Hello John, Vasily,

On 5/13/2026 3:46 AM, John Stultz wrote:
> On Thu, May 7, 2026 at 3:42 AM Vasily Gorbik <gor@linux.ibm.com> wrote:
>>
>> Core scheduling chooses a core-wide cookie before __schedule()
>> installs the next task. With proxy-exec enabled, that task becomes the
>> donor/scheduling context, and find_proxy_task() may then replace the
>> execution context with the runnable mutex owner. If its cookie differs
>> from the selected core cookie, running it would bypass core scheduling's
>> cookie selection.
>>
>> When the final mutex owner found by find_proxy_task() does not match the
>> selected core cookie, stop proxying the donor. If the current execution
>> context is already in the blocked chain, fall back to idle like the
>> existing proxy-exec retry paths do. Otherwise deactivate the donor and
>> let __schedule() pick again. The mutex owner can be picked later under
>> its own cookie.
>>
>> Fixes: 7de9d4f94638 ("sched: Start blocked_on chain processing in find_proxy_task()")
>> Reported-by: K Prateek Nayak <kprateek.nayak@amd.com>
>> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
>> ---
>>  kernel/sched/core.c | 6 ++++++
>>  1 file changed, 6 insertions(+)
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 8aed55592ca9..d338fb714ce8 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -6960,6 +6960,12 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
>>                  */
>>         }
>>         WARN_ON_ONCE(owner && !owner->on_rq);
>> +
>> +       if (owner && !sched_cpu_cookie_match(rq, owner)) {
>> +               if (curr_in_chain)
>> +                       return proxy_resched_idle(rq);
>> +               goto deactivate;
>> +       }
> 
> 
> Hrm. This is less pretty.
> 
> My previous (admittedly shallow) thinking on the core-scheduler was
> that it wouldn't be an issue for proxy because the donor wasn't going
> to actually run on the cpu, so whatever isolation is done on the core,
> the donor migration wouldn't be a problem.
> 
> But I'm seeing now the donor won't be *chosen* until it has the right
> core_cookie, and then that may be different from the owners cookie.
> 
> It seems like ideally we want the donor's effective cookie to be the
> same as the runnable-owner's in the chain.  The downside to this is
> you have to walk the blocked_on chain to evaluate this, and the whole
> core_tree rbtree sorts by cookie, so its not trivial to rework
> selection this way.   And since the runnable-owner of the chain-tree
> changes over time, we can't just set the inherited cookie when we set
> blocked_on.
> 
> So I will need to think a bit more on this.

Sorry it took me a little while to wrap my head around the core pick
bits (and an embarrassingly long time to spot the rq_lockp() trick for
core-wide locking) but with limited knowledge, this is what I've come up
with to make proxy work with core scheduling with the basic principle
that:

o If the lock owner matches the cookie with donor, we don't have
  to do anything - just run the owner on behalf of the donor.

o If there is a mismatch, then we have to first see if the rq of the
  blocked donor was responsible for core-pick (the rq that has the "max"
  priority task queued in pick_next_task()) and if it was, we can retry
  the pick by overriding the core cookie with that of the lock owner
  (spoofing lock owner as max).

  In case we find a blocked donor on a CPU that is not influencing the
  core cookie, we either swap to a different task with same cookie or
  force-idle the core.

This is the diff I have been testing so far:

  (On top of tip:sched/core at 4ac4d6549a656 + Vasily's Patch 1
   from this series)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8aed55592ca96..abddb958e10b5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -341,6 +341,18 @@ void sched_core_dequeue(struct rq *rq, struct task_struct *p, int flags)
 
 static int sched_task_is_throttled(struct task_struct *p, int cpu)
 {
+	/*
+	 * Don't move / select blocked during cookie stealing.
+	 * Proxy execution + core scheduling uses __schedule()
+	 * for migration, return-migration, and selecting the
+	 * correct core-cookie based on the donor context.
+	 *
+	 * Simplese way to achieve this is by spoofing throttle
+	 * status for blocked donors.
+	 */
+	if (task_is_blocked(p))
+		return 1;
+
 	if (p->sched_class->task_is_throttled)
 		return p->sched_class->task_is_throttled(p, cpu);
 
@@ -6061,6 +6073,9 @@ __pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 	BUG(); /* The idle class should always have a runnable task. */
 }
 
+static struct task_struct *
+find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf);
+
 #ifdef CONFIG_SCHED_CORE
 static inline bool is_task_rq_idle(struct task_struct *t)
 {
@@ -6104,6 +6119,96 @@ extern void task_vruntime_update(struct rq *rq, struct task_struct *p, bool in_f
 
 static void queue_core_balance(struct rq *rq);
 
+static inline bool proxy_should_set_donor(struct rq *rq)
+{
+	bool donor_already_set = rq->core_proxy_pick;
+
+	if (!sched_core_enabled(rq))
+		return true;
+
+	/* The core_proxy_pick cycle has ended. */
+	rq->core_proxy_pick = false;
+	return !donor_already_set;
+}
+
+static struct task_struct *
+proxy_steal_cookie(struct rq *rq, struct task_struct *donor, struct task_struct *owner)
+{
+	if (!sched_core_enabled(rq))
+		return owner;
+
+	/* owner can safely run in place of donor */
+	if (cookie_match(donor, owner))
+		return owner;
+
+	/*
+	 * Another CPU on this core dictated the pick! Try to
+	 * find a task with matching cookie on this rq,
+	 * otherwise resort to force-idling.
+	 */
+	if (!rq->core_pick_leader) {
+		unsigned long cookie = rq->core->core_cookie;
+		struct task_struct *next;
+		bool fi_before = false;
+
+		next = sched_core_find(rq, cookie);
+		if (next == donor)
+			next = sched_core_next(next, cookie);
+
+		/*
+		 * We have a compatible next that is not a blocked task.
+		 * Switch context and run that instead.
+		 */
+		if (next)
+			goto found;
+
+		/*
+		 * rq doesn't have any compatible task and the core-pick
+		 * was dictated by a remote CPU. Transition to force-idle.
+		 *
+		 * XXX: Set &rf to NULL and use the idle class pick to
+		 * notify sched-ext core of idling although proxy execution
+		 * and sched-ext are mutually exclusive at the moment.
+		 */
+		next = idle_sched_class.pick_task(rq, NULL);;
+		if (schedstat_enabled() && rq->core->core_forceidle_count) {
+			rq->core->core_forceidle_start = rq_clock(rq->core);
+			rq->core->core_forceidle_occupation--;
+		}
+
+		fi_before = !!rq->core->core_forceidle_count;
+		rq->core->core_forceidle_count++;
+		if (!fi_before) {
+			rq->core->core_forceidle_seq++;
+			task_vruntime_update(rq, donor, !!rq->core->core_forceidle_count);
+		}
+
+		rq->core_dl_server = NULL;
+		queue_core_balance(rq);
+found:
+		WARN_ON_ONCE(!cookie_match(next, donor));
+		put_prev_set_next_task(rq, donor, next);
+		rq_set_donor(rq, next);
+		return next;
+	}
+
+	/*
+	 * This CPU dictated the core-wide pick. Since this CPU is running
+	 * the highest priority donor on the core, it is possible to explore
+	 * dictate the core-cookie and explore proxy execution.
+	 *
+	 * Set rq->core_proxy_pick to the owner and explore a re-pick.
+	 * Returning NULL here would go through another cycle of
+	 * pick_next_task() which will update the cookie to owner's cookie
+	 * and retry.
+	 *
+	 * __schedule() runs with IRQs disabled and the owner will remain
+	 * valid until the re-pick even when the core lock is dropped.
+	 */
+	rq->core_proxy_pick = true;
+	return NULL;
+}
+
 static struct task_struct *
 pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 	__must_hold(__rq_lockp(rq))
@@ -6131,6 +6236,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 		 */
 		rq->core_pick = NULL;
 		rq->core_dl_server = NULL;
+		rq->core_pick_leader = true;
 		return __pick_next_task(rq, prev, rf);
 	}
 
@@ -6150,12 +6256,18 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 
 		next = rq->core_pick;
 		rq->dl_server = rq->core_dl_server;
+		WARN_ON_ONCE(rq->core_proxy_pick);
 		rq->core_pick = NULL;
 		rq->core_dl_server = NULL;
 		goto out_set_next;
 	}
 
-	prev_balance(rq, prev, rf);
+	/*
+	 * If the CPU is re-trying pick for the sake of proxy, skip executing
+	 * ->balance() calls since the pick on this CPU was already stabilized.
+	 */
+	if (likely(!rq->core_proxy_pick))
+		prev_balance(rq, prev, rf);
 
 	smt_mask = cpu_smt_mask(cpu);
 	need_sync = !!rq->core->core_cookie;
@@ -6188,6 +6300,15 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 	 */
 	rq->core->core_task_seq++;
 
+	/*
+	 * If the CPU is here to re-evaluate the cookie for proxy-execution,
+	 * skip the pick on this CPU since either the selected donor, or the
+	 * core_proxy_pick have a valid core-cookie and the core-wide pick
+	 * is necessary in either cases.
+	 */
+	if (unlikely(rq->core_proxy_pick))
+		goto restart_multi;
+
 	/*
 	 * Optimize for common case where this CPU has no cookies
 	 * and there are no cookied tasks running on siblings.
@@ -6200,6 +6321,12 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 		if (!next->core_cookie) {
 			rq->core_pick = NULL;
 			rq->core_dl_server = NULL;
+			/*
+			 * If no cookie is set, this CPU is free to trigger
+			 * a re-pick for proxy execution if the selection
+			 * turns out to be a blocked donor.
+			 */
+			rq->core_pick_leader = true;
 			/*
 			 * For robustness, update the min_vruntime_fi for
 			 * unconstrained picks as well.
@@ -6229,6 +6356,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 		if (i != cpu && (rq_i != rq->core || !core_clock_updated))
 			update_rq_clock(rq_i);
 
+		rq_i->core_pick_leader = false;
 		p = pick_task(rq_i, rf);
 		if (unlikely(p == RETRY_TASK))
 			goto restart_multi;
@@ -6236,10 +6364,72 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 		rq_i->core_pick = p;
 		rq_i->core_dl_server = rq_i->dl_server;
 
-		if (!max || prio_less(max, p, fi_before))
+		if (!max || prio_less(max, p, fi_before)) {
+			rq_i->core_pick_leader = true;
 			max = p;
+		}
 	}
 
+	if (unlikely(rq->core_proxy_pick)) {
+		struct task_struct *owner, *donor = rq->donor;
+
+		/*
+		 * Reset indicator; For all the failure cases, the
+		 * "core_proxy_pick" should be considered ended.
+		 */
+		rq->core_proxy_pick = false;
+
+		/*
+		 * The core_proxy_pick should no longer be considered if:
+		 *
+		 * 1) rq is no longer the pick leader (running the highest priority task).
+		 * 2) rq->donor is no longer the rq->core_pick. (Different pick).
+		 * 3) rq->core_pick is no longer blocked. (Donor was woken up)
+		 * 4) find_proxy_task(rq->donor) no longer points to core_proxy_pick.
+		 *    (change in the wait chain)
+		 */
+		if (!rq->core_pick_leader ||
+		    donor != rq->core_pick ||
+		    !task_is_blocked(donor))
+			goto continue_pick;
+
+		/* Check the wait-chain again. */
+		owner = find_proxy_task(rq, donor, rf);
+		/*
+		 * Wait chain has changed and requires a re-pick :-(
+		 * Since there is no dependency on rq->curr, do the
+		 * repick now.
+		 *
+		 * The "core_proxy_pick" has ended; Continue with the
+		 * default path. __schedule() will trigger a repick
+		 * if it finds it appropriate.
+		 */
+		if (!owner)
+			goto restart_multi;
+		/*
+		 * wait-chain has a new dependency on rq->curr!
+		 *
+		 * Continue with the core_pick as is for now even if
+		 * it is blocked. Effect of proxy_resched_idle() will
+		 * be undone at "out_set_next" on the way out and
+		 * find_proxy_task() in __schedule() will redo these
+		 * changes.
+		 */
+		if (owner == rq->idle) {
+			prev = rq->idle; /* New donor */
+			goto continue_pick;
+		}
+		/*
+		 * If the wait-chain changed but find_proxy_task()
+		 * returned a valid owner task, use it as the new max
+		 * task and continue with the pick.
+		 */
+		rq->core_proxy_pick = true;
+		rq->core_pick = owner;
+		next = owner;
+		max = owner;
+	}
+continue_pick:
 	cookie = rq->core->core_cookie = max->core_cookie;
 
 	/*
@@ -6335,8 +6525,17 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 
 		resched_curr(rq_i);
 	}
-
+	/*
+	 * If this was a proxy driven pick, skip switching the donor context.
+	 * "core_proxy_pick" indicator will be consumed in the caller since
+	 * "rq->donor" can be overriden by the caller.
+	 */
+	if (unlikely(rq->core_proxy_pick)) {
+		WARN_ON_ONCE(rq->donor != prev);
+		return next;
+	}
 out_set_next:
+	WARN_ON_ONCE(rq->core_proxy_pick);
 	put_prev_set_next_task(rq, prev, next);
 	if (rq->core->core_forceidle_count && next == rq->idle)
 		queue_core_balance(rq);
@@ -6464,6 +6663,7 @@ static void sched_core_cpu_starting(unsigned int cpu)
 	guard(core_lock)(&cpu);
 
 	WARN_ON_ONCE(rq->core != rq);
+	rq->core_proxy_pick = false;
 
 	/* if we're the first, we'll be our own leader */
 	if (cpumask_weight(smt_mask) == 1)
@@ -6558,6 +6758,13 @@ static inline void sched_core_cpu_dying(unsigned int cpu)
 static inline void sched_core_cpu_starting(unsigned int cpu) {}
 static inline void sched_core_cpu_deactivate(unsigned int cpu) {}
 static inline void sched_core_cpu_dying(unsigned int cpu) {}
+static inline bool proxy_should_set_donor(struct rq *rq) { return true; }
+
+static struct task_struct *
+proxy_steal_cookie(struct rq *rq, struct task_struct *donor, struct task_struct *next)
+{
+	return next;
+}
 
 static struct task_struct *
 pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
@@ -6653,6 +6860,7 @@ static inline void proxy_set_task_cpu(struct task_struct *p, int cpu)
 static inline struct task_struct *proxy_resched_idle(struct rq *rq)
 {
 	put_prev_set_next_task(rq, rq->donor, rq->idle);
+	rq->next_class = &idle_sched_class;
 	rq_set_donor(rq, rq->idle);
 	set_tsk_need_resched(rq->idle);
 	return rq->idle;
@@ -7115,7 +7323,9 @@ static void __sched notrace __schedule(int sched_mode)
 	if (sched_proxy_exec()) {
 		struct task_struct *prev_donor = rq->donor;
 
-		rq_set_donor(rq, next);
+		if (likely(proxy_should_set_donor(rq)))
+			rq_set_donor(rq, next);
+
 		if (unlikely(next->blocked_on)) {
 			next = find_proxy_task(rq, next, &rf);
 			if (!next) {
@@ -7126,6 +7336,21 @@ static void __sched notrace __schedule(int sched_mode)
 				zap_balance_callbacks(rq);
 				goto keep_resched;
 			}
+			/*
+			 * Check if the cookie matches. next can be:
+			 *
+			 * - The lock owner found by find_proxy_task()
+			 *   if it has a compatible cookie.
+			 *
+			 * - rq->idle if the core-wide pick was dictated
+			 *   by a task on a remote CPU and this CPU
+			 *   should now force-idle until the next pick.
+			 *
+			 * - NULL if the CPU should retry the pick.
+			 */
+			next = proxy_steal_cookie(rq, rq->donor, next);
+			if (!next)
+				goto pick_again;
 		}
 		if (rq->donor == prev_donor && prev != next) {
 			struct task_struct *donor = rq->donor;
@@ -9034,6 +9259,7 @@ void __init sched_init(void)
 		rq->core_forceidle_start = 0;
 
 		rq->core_cookie = 0UL;
+		rq->core_proxy_pick = false;
 #endif
 		zalloc_cpumask_var_node(&rq->scratch_mask, GFP_KERNEL, cpu_to_node(i));
 	}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 9f63b15d309d1..cd8acc85f5158 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1333,14 +1333,38 @@ struct rq {
 	unsigned int		core_enabled;
 	unsigned int		core_sched_seq;
 	struct rb_root		core_tree;
+	/*
+	 * Proxy Execution bits:
+	 *
+	 * core_proxy_pick: The CPU is redoing
+	 * pick_next_task() for proxy execution when
+	 * the lock owner has a differnt core cookie
+	 * compared to the donor.
+	 *
+	 * core_pick_leader: Indicator on the runqueue
+	 * that has the highest priority task from the
+	 * core-wide pick. If a !core_pick_leader finds
+	 * a blocked donor, it should go into a force
+	 * idle since the core_cookie is being dictated
+	 * by a task on a different CPU.
+	 *
+	 * XXX: Intentionally placed outside
+	 * CONFIG_SCHED_PROXY_EXEC for now for the sake
+	 * of this RFC. Do we really need to complicate
+	 * the dependency for 4-bytes that fit perfectly
+	 * in a hole?
+	 */
+	bool			core_proxy_pick;
+	bool			core_pick_leader;
+				/* Hole */
 
 	/* shared state -- careful with sched_core_cpu_deactivate() */
 	unsigned int		core_task_seq;
 	unsigned int		core_pick_seq;
-	unsigned long		core_cookie;
 	unsigned int		core_forceidle_count;
 	unsigned int		core_forceidle_seq;
 	unsigned int		core_forceidle_occupation;
+	unsigned long		core_cookie;
 	u64			core_forceidle_start;
 #endif /* CONFIG_SCHED_CORE */
 
---

So far it has survived 10 runs of the modified priority-inversion-demo
with coresched prefixed and attached is the chart while comparing a
sched_proxy_exec=0 (no-proxy*) vs sched_proxy_exec=1 (proxy*) runs.
There is still much room for optimization.

Full disclaimer: I see a:

    NOHZ tick-stop error: local softirq work is pending, handler #200!!!

being logged once in a while during the test but I see it even with
sched_proxy_exec=0 so I'm not too sure if this is something I've
introduced or not. I'll chase it a little while later on tip:sched/core
if it is reproducible there.

I have only tested CONFIG_SCHED_PROXY_EXEC + CONFIG_SCHED_CORE and
disabling just one might fail the build (or crash and burn).

Early feedback is welcome. I hope I have left enough comments in there
to justify the rationale (or how I might be breaking core-scheduling)
:-)

-- 
Thanks and Regards,
Prateek

[-- Attachment #2: chart-split.png --]
[-- Type: image/png, Size: 186991 bytes --]

  reply	other threads:[~2026-05-14  9:55 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-07 10:41 [PATCH v2 0/2] sched/core: Fix proxy-exec/core-sched interactions Vasily Gorbik
2026-05-07 10:41 ` [PATCH v2 1/2] sched/core: Don't steal a proxy-exec donor Vasily Gorbik
2026-05-12 21:35   ` John Stultz
2026-05-07 10:41 ` [PATCH v2 2/2] sched/core: Don't proxy-exec unmatched cookie lock owners Vasily Gorbik
2026-05-12 22:16   ` John Stultz
2026-05-14  9:54     ` K Prateek Nayak [this message]
2026-05-12 21:17 ` [PATCH v2 0/2] sched/core: Fix proxy-exec/core-sched interactions John Stultz
2026-05-13  0:48   ` John Stultz
2026-05-15 16:38     ` Vasily Gorbik

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=9ceb2af0-33c6-40ca-b855-5167a9c5ae0f@amd.com \
    --to=kprateek.nayak@amd.com \
    --cc=bsegall@google.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=gor@linux.ibm.com \
    --cc=hca@linux.ibm.com \
    --cc=joelagnelf@nvidia.com \
    --cc=jstultz@google.com \
    --cc=juri.lelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-s390@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=vincent.guittot@linaro.org \
    --cc=vineethrp@google.com \
    --cc=vschneid@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.