From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-ej1-f73.google.com (mail-ej1-f73.google.com [209.85.218.73]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E2E1F33F8B9 for ; Wed, 4 Feb 2026 16:58:49 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.218.73 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770224330; cv=none; b=QA71BOQOF0XHPqcDMmkiJ4niis9h7jf13loH42K6VoB8sRi2yhXVs9OPQAillz71wEp/YYr6QqwovKaBNlCAwv1G34G2XYXtyU9K7mq6rq71h+U+bxAi3RtlU8AcD4QDq38e+RtBWPb4olF1XBVbgD2rjjdH4WjLiXUY2hMguD4= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770224330; c=relaxed/simple; bh=QYi0U5Ah7uPnpZJMnGw0p2JsosrtJGT0ptSjNBuPVVs=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=oXqwT7neOO+PVxWp8XWoaYZKgSiYfM1CQshcWqB2TTblJqZJY/TYW5epSGl2hMhn8soSng56uCokd0Nn6/qpPB8K958iSQqltDKqkDfRkN1eCgiknIRuR6hlCu+N00YBPUYRgCgm0JVnPu68j9cJWKTtBFMT9KNozRyQyHt7d1c= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--jpiecuch.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=0NC2t5+M; arc=none smtp.client-ip=209.85.218.73 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--jpiecuch.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="0NC2t5+M" Received: by mail-ej1-f73.google.com with SMTP id a640c23a62f3a-b8db7f340b3so331135966b.3 for ; Wed, 04 Feb 2026 08:58:49 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1770224328; x=1770829128; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=q2a6UUIuP5/GR1+j9UCc+Shf4EKl+Lmal1pkr2RJgSo=; b=0NC2t5+MVEcmEEdYP3em02MNtCdLfGZw290i1ZcbLfvg+x9j8q7i01pYt8MCFXH7lp ZSFKr/BNGr3ehbUcu3X1LfX6KeeirMAN+4ZjL/uv/1fBoXVR4fktFC68p2k2MNgcBLGs +S3R3zE/O8202EaszwJWqIiFTixOq1UBdlcjsXjuhgR1B7azZwTU1OVgaBhmB3yBUBDE 4OUpaprW4cnUtf2/nxFHvLPZkeFi7LY9DtqreRgXyY3PN8xx7rHtmop8AfbGgcCi+L5Q vM1BYmCefmLEqHui7g7hcdTPM0gM3x7zyEntO//okNIwoHFQrmKa0qRTIbwuRfZm6MEs U80A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1770224328; x=1770829128; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=q2a6UUIuP5/GR1+j9UCc+Shf4EKl+Lmal1pkr2RJgSo=; b=ADZdGLQKA5tWFCuxIu2BrObHsQKQu5sQ549fSv3s74dcBe4A+y4Ds1AzMiBuYKSV5n cTRWLuwWsOB3028BXpxt0OwaG/vtpaP764m5UaNpcPlPBkoUkp0wDZhZl/ej86mBJKEV 2yh6NkCiaiJLCCJsafsEJuJXZWUJTY1UeQ8/CZTgy+Brstu1HCf42jKFXoiI3Ap2x3fH YEIeF3nvC1UTKxz0Aj+u+kKsqnB1TPysaxeC28A1NvdybDd5A/dm2QAqhpARu4PyI3pp 4MLFaIRmIMPw8knBcAKR/j7lx2rtrzJp7zmYdvcLrHe+76sKxs//j8D6V70CBPIzTqAr 49cQ== X-Forwarded-Encrypted: i=1; AJvYcCVUo7KHAnrGoy0FkekAUX4cEBPtiEN4/p7V403Ijvz2OJyahhlhhgoNoX2XOjNWY/LAQVlxJAfPOLsKfTo=@vger.kernel.org X-Gm-Message-State: AOJu0Yy2vVg1yXyZKle1X9ZtwHrsrMrotIoP7g6aonskB3DkuMjIKZQX +I84O3f46s3FwOiyPB771CW30YNu99AtwN/wn0MeLFh2/haIMapeqF+hrU0aY0psVa5NTwCwEMB kP2nHPNePftd+bA== X-Received: from ejbbo16.prod.google.com ([2002:a17:906:d050:b0:b6d:7849:5800]) (user=jpiecuch job=prod-delivery.src-stubby-dispatcher) by 2002:a17:907:934c:b0:b87:3cac:cd4b with SMTP id a640c23a62f3a-b8e9f17ef8amr250644966b.15.1770224328178; Wed, 04 Feb 2026 08:58:48 -0800 (PST) Date: Wed, 04 Feb 2026 16:58:47 +0000 In-Reply-To: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20260203230639.1259869-1-arighi@nvidia.com> X-Mailer: aerc 0.21.0-0-g5549850facc2 Message-ID: Subject: Re: [PATCH] sched_ext: Invalidate dispatch decisions on CPU affinity changes From: Kuba Piecuch To: Andrea Righi , Kuba Piecuch Cc: Tejun Heo , David Vernet , Changwoo Min , Christian Loehle , Emil Tsalapatis , Daniel Hodges , , Content-Type: text/plain; charset="UTF-8" On Wed Feb 4, 2026 at 3:36 PM UTC, Andrea Righi wrote: >> > >> > When finish_dispatch() detects a qseq mismatch, the dispatch is dropped >> > and the task is returned to the SCX_OPSS_QUEUED state, allowing it to be >> > re-dispatched using up-to-date affinity information. >> >> How will the scheduler know that the dispatch was dropped? Is the scheduler >> expected to infer it from the ops.enqueue() that follows set_cpus_allowed_scx() >> on CPU1? > > The idea was that, if the dispatch is dropped, we'll see another > ops.enqueue() for the task, so at least the task is not "lost" and the > BPF scheduler gets another chance what to do with it. In this case it'd be > useful to set SCX_ENQ_REENQ (or a dedicated special flag) to indicate that > the enqueue resulted from a dropped dispatch. I think SCX_ENQ_REENQ is enough for now, we can always add a dedicated flag if a need for it arises. I still worry about the scenario you described. In particular, I think it can lead to tasks being forgotten (i.e. not re-enqueued) after a failed dispatch. CPU0 CPU1 ---- ---- if (cpumask_test_cpu(cpu, p->cpus_ptr)) task_rq_lock(p) dequeue_task_scx(p, ...) (remove p from internal queues) set_cpus_allowed_scx(p, new_mask) enqueue_task_scx(p, ...) (add p to internal queues) task_rq_unlock(p) (remove p from internal queues) scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL_ON | cpu, 0) In this scenario, the ops.enqueue() which is supposed to notify the BPF scheduler about the failed dispatch actually happens _before_ the actual dispatch, so once the dispatch fails, the task won't be re-enqueued. There are two problems here: 1. CPU0 makes a scheduling decision based on stale data and it isn't detected. 2. Even if it is detected and the dispatch aborted, the task won't be re-enqueued. The way we deal with the first problem in ghOSt (Google's equivalent of sched_ext) is we expose the per-task sequence number to the BPF scheduler. On the dispatch path, when the BPF scheduler has a candidate task, it retrieves its seqnum, re-checks the task state to ensure that it's still eligible for dispatch, and passes the seqnum to the kernel's dispatch helper for verification. If the kernel detects that the seqnum has changed already, it synchronously fails the dispatch attempt (dispatch always happens synchronously in ghOSt). In sched_ext, we could do the synchronous check, but we also need to do the same check later in finish_dispatch(), comparing the current qseq against the qseq passed by the BPF scheduler. To fix the second problem, we would need to explicitly call ops.enqueue() from finish_dispatch() and the other places where we abort dispatch if the qseq is out of date. Either that, or just add locking to the BPF scheduler to prevent the race from happening in the first place. Thanks, Kuba