Re: [PATCH] sched_ext: Invalidate dispatch decisions on CPU affinity changes

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Kuba Piecuch <jpiecuch@google.com>
To: Andrea Righi <arighi@nvidia.com>, Kuba Piecuch <jpiecuch@google.com>
Cc: Tejun Heo <tj@kernel.org>, David Vernet <void@manifault.com>,
	 Changwoo Min <changwoo@igalia.com>,
	Christian Loehle <christian.loehle@arm.com>,
	 Emil Tsalapatis <emil@etsalapatis.com>,
	Daniel Hodges <hodgesd@meta.com>, <sched-ext@lists.linux.dev>,
	 <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH] sched_ext: Invalidate dispatch decisions on CPU affinity changes
Date: Wed, 04 Feb 2026 16:58:47 +0000	[thread overview]
Message-ID: <DG6C5HB3PHH3.2JRZX83QMLK2X@google.com> (raw)
In-Reply-To: <aYNnmI26oS7YNuMP@gpd4>

On Wed Feb 4, 2026 at 3:36 PM UTC, Andrea Righi wrote:
>> >
>> > When finish_dispatch() detects a qseq mismatch, the dispatch is dropped
>> > and the task is returned to the SCX_OPSS_QUEUED state, allowing it to be
>> > re-dispatched using up-to-date affinity information.
>> 
>> How will the scheduler know that the dispatch was dropped? Is the scheduler
>> expected to infer it from the ops.enqueue() that follows set_cpus_allowed_scx()
>> on CPU1?
>
> The idea was that, if the dispatch is dropped, we'll see another
> ops.enqueue() for the task, so at least the task is not "lost" and the
> BPF scheduler gets another chance what to do with it. In this case it'd be
> useful to set SCX_ENQ_REENQ (or a dedicated special flag) to indicate that
> the enqueue resulted from a dropped dispatch.

I think SCX_ENQ_REENQ is enough for now, we can always add a dedicated flag
if a need for it arises.

I still worry about the scenario you described. In particular, I think it can
lead to tasks being forgotten (i.e. not re-enqueued) after a failed dispatch.

  CPU0                                      CPU1
  ----                                      ----
  if (cpumask_test_cpu(cpu, p->cpus_ptr))
                                            task_rq_lock(p)
                                            dequeue_task_scx(p, ...)
                                              (remove p from internal queues)
                                            set_cpus_allowed_scx(p, new_mask)
                                            enqueue_task_scx(p, ...)
                                              (add p to internal queues)
                                            task_rq_unlock(p)
      (remove p from internal queues)
      scx_bpf_dsq_insert(p,
              SCX_DSQ_LOCAL_ON | cpu, 0)

In this scenario, the ops.enqueue() which is supposed to notify the BPF
scheduler about the failed dispatch actually happens _before_ the actual
dispatch, so once the dispatch fails, the task won't be re-enqueued.

There are two problems here:

1. CPU0 makes a scheduling decision based on stale data and it isn't detected.
2. Even if it is detected and the dispatch aborted, the task won't be
   re-enqueued.

The way we deal with the first problem in ghOSt (Google's equivalent of
sched_ext) is we expose the per-task sequence number to the BPF scheduler.
On the dispatch path, when the BPF scheduler has a candidate task,
it retrieves its seqnum, re-checks the task state to ensure that it's still
eligible for dispatch, and passes the seqnum to the kernel's dispatch helper
for verification. If the kernel detects that the seqnum has changed already,
it synchronously fails the dispatch attempt (dispatch always happens
synchronously in ghOSt). In sched_ext, we could do the synchronous check, but
we also need to do the same check later in finish_dispatch(), comparing
the current qseq against the qseq passed by the BPF scheduler.

To fix the second problem, we would need to explicitly call ops.enqueue()
from finish_dispatch() and the other places where we abort dispatch if the
qseq is out of date.

Either that, or just add locking to the BPF scheduler to prevent the race from
happening in the first place.

Thanks,
Kuba

next prev parent reply	other threads:[~2026-02-04 16:58 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-03 23:06 [PATCH] sched_ext: Invalidate dispatch decisions on CPU affinity changes Andrea Righi
2026-02-04 13:20 ` Kuba Piecuch
2026-02-04 15:36   ` Andrea Righi
2026-02-04 16:58     ` Kuba Piecuch [this message]
2026-02-04 17:56       ` Andrea Righi
2026-02-05 17:20         ` Kuba Piecuch
2026-02-05 17:37           ` Andrea Righi
2026-02-04 15:07 ` Christian Loehle
2026-02-04 23:31 ` Tejun Heo
2026-02-05  1:15   ` Tejun Heo
2026-02-05 16:40   ` Andrea Righi
2026-02-05 22:57     ` Tejun Heo
2026-02-06  8:43       ` Andrea Righi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=DG6C5HB3PHH3.2JRZX83QMLK2X@google.com \
    --to=jpiecuch@google.com \
    --cc=arighi@nvidia.com \
    --cc=changwoo@igalia.com \
    --cc=christian.loehle@arm.com \
    --cc=emil@etsalapatis.com \
    --cc=hodgesd@meta.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=sched-ext@lists.linux.dev \
    --cc=tj@kernel.org \
    --cc=void@manifault.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox