[PATCH v4 sched_ext/for-7.2 0/3] sched_ext: Add dispatch transaction API

Sched_ext development
 help / color / mirror / Atom feed

* [PATCH v4 sched_ext/for-7.2 0/3] sched_ext: Add dispatch transaction API
@ 2026-05-15 17:27 Cheng-Yang Chou
  2026-05-15 17:27 ` [PATCH 1/3] sched_ext: Rename scx_dsq_insert_commit() to scx_dsq_insert_stage() Cheng-Yang Chou
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Cheng-Yang Chou @ 2026-05-15 17:27 UTC (permalink / raw)
  To: sched-ext, Tejun Heo, David Vernet, Andrea Righi, Changwoo Min
  Cc: Kuba Piecuch, Ching-Chun Huang, Chia-Ping Tsai, yphbchou0911

scx_bpf_dsq_insert() captures the task's sequence number at insert time,
so any pre-dispatch validity checks a BPF scheduler performs before the
insert fall outside the race detection window.

Patch 1: Rename the internal scx_dsq_insert_commit() to
scx_dsq_insert_stage() to better reflect its role as a staging step
before the actual DSQ insertion. No functional change.

Patch 2: Introduce a dispatch transaction API with two new kfuncs.
- scx_bpf_dsq_insert_begin() captures the sequence number before any
  pre-dispatch checks, opening the transaction.
- scx_bpf_dsq_insert_commit() inserts the task using the early-captured
  token, extending the race detection window. If the token has gone stale
  (task was dequeued or claimed by another CPU), finish_dispatch()
  silently discards the entry. Both kfuncs are restricted to local DSQs,
  and commit() is callable only from ops.dispatch(). This is intended for
  schedulers that do not implement properly synchronized dequeue.

Patch 3: Add a selftest exercising the dispatch transaction API. The BPF
scheduler captures the token in ops.dispatch() and flips CPU affinity on
child tasks to trigger dequeue and re-enqueue, verifying that both
successful and stale-token dispatch paths are exercised.

Test plan:
- Applied this series and ran selftests with vng, all tests pass.
- Confirmed selftest skips on older kernel.

This series is discussed in [1].
[1]: https://lore.kernel.org/r/20260319083518.94673-1-arighi@nvidia.com/

Changes in v4:
- Separate the internal rename into its own commit with no functional
  change; use scx_dsq_insert_stage() instead of scx_dsq_insert_buf()
  (Tejun Heo)
- Restrict scx_bpf_dsq_insert_commit() to ops.dispatch() only by moving
  it from scx_kfunc_ids_enqueue_dispatch to scx_kfunc_ids_dispatch
  (Tejun Heo)
- Restrict scx_bpf_dsq_insert_commit() to local DSQs at runtime:
  targeting a non-local DSQ aborts the scheduler (Tejun Heo)
- Expand kdoc for both kfuncs with the full problem statement and
  use-case context (Tejun Heo)
- Add scx_bpf_dsq_insert_commit_args struct to stay within the BPF
  verifier's five-argument limit imposed by KF_IMPLICIT_ARGS (Andrea Righi)
- Selftest: pass slice=0 in commit_args to preserve the task's existing
  slice rather than zeroing it (Andrea Righi)
- Selftest: use the first online CPU from sched_getaffinity() instead of
  hardcoded CPU 0 (Andrea Righi)
- Selftest: clarify that nr_tx_stale counts attempted stale commits, not
  rejections detected by finish_dispatch() (Andrea Righi)
- Selftest: skip dispatch transaction path for CPU-pinned tasks (e.g.
  per-CPU kworkers) that cannot be inserted into the current CPU's local
  DSQ; fall back to the global DSQ instead
- Add comment to mark_direct_dispatch() explaining why qseq validation
  is not needed on the direct dispatch path (Andrea Righi)
- Remove duplicate scx_bpf_dsq_insert_begin BTF entry from
  scx_kfunc_ids_enqueue_dispatch (sashiko-bot)
- Use u64 consistently for qseq in the dispatch buffer path
  (scx_dsp_buf_ent, scx_dsq_insert_stage, finish_dispatch)
- Link to v3:
  https://lore.kernel.org/r/20260509191223.168648-1-yphbchou0911@gmail.com/

Changes in v3:
- Rename the API from cookie to dispatch transaction: kfuncs renamed from
  scx_bpf_task_get_cookie/scx_bpf_dsq_insert_with_cookie to
  scx_bpf_dsq_insert_begin/scx_bpf_dsq_insert_commit (Tejun Heo)
- Rename internal scx_dsq_insert_commit() to avoid collision with the
  new public kfunc (Tejun Heo)
- Move __ksym __weak declarations in common.bpf.h into the API commit
- Redesign selftest to call insert from ops.dispatch() with CPU affinity
  flips to actually exercise the race detection window (Tejun Heo)
- Skip selftest on older kernels lacking the dispatch transaction API
- Apply the same fix logic as commit e18002d2c791 to the selftest
- Link to v2:
  https://lore.kernel.org/r/20260506160412.522199-1-yphbchou0911@gmail.com/

Changes in v2:
- Reword kdoc for both kfuncs to avoid exposing qseq and ops_state as
  implementation details (Kuba Piecuch)
- Add __ksym __weak declarations for new kfuncs in common.bpf.h
- Add dispatch_cookie selftest (Kuba Piecuch)
- Link to v1:
  https://lore.kernel.org/r/20260506075925.371138-1-yphbchou0911@gmail.com/

Thanks,
Cheng-Yang

---

Cheng-Yang Chou (3):
  sched_ext: Rename scx_dsq_insert_commit() to scx_dsq_insert_stage()
  sched_ext: Add dispatch transaction API
  selftests/sched_ext: Add dispatch_cookie test

 kernel/sched/ext.c                            | 112 +++++++++++++-
 kernel/sched/ext_internal.h                   |   2 +-
 tools/sched_ext/include/scx/common.bpf.h      |   2 +
 tools/testing/selftests/sched_ext/Makefile    |   1 +
 .../selftests/sched_ext/dispatch_cookie.bpf.c | 120 +++++++++++++++
 .../selftests/sched_ext/dispatch_cookie.c     | 139 ++++++++++++++++++
 6 files changed, 369 insertions(+), 7 deletions(-)
 create mode 100644 tools/testing/selftests/sched_ext/dispatch_cookie.bpf.c
 create mode 100644 tools/testing/selftests/sched_ext/dispatch_cookie.c

-- 
2.48.1


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH 1/3] sched_ext: Rename scx_dsq_insert_commit() to scx_dsq_insert_stage()
  2026-05-15 17:27 [PATCH v4 sched_ext/for-7.2 0/3] sched_ext: Add dispatch transaction API Cheng-Yang Chou
@ 2026-05-15 17:27 ` Cheng-Yang Chou
  2026-05-15 17:27 ` [PATCH 2/3] sched_ext: Add dispatch transaction API Cheng-Yang Chou
  2026-05-15 17:27 ` [PATCH 3/3] selftests/sched_ext: Add dispatch_cookie test Cheng-Yang Chou
  2 siblings, 0 replies; 6+ messages in thread
From: Cheng-Yang Chou @ 2026-05-15 17:27 UTC (permalink / raw)
  To: sched-ext, Tejun Heo, David Vernet, Andrea Righi, Changwoo Min
  Cc: Kuba Piecuch, Ching-Chun Huang, Chia-Ping Tsai, yphbchou0911

Rename the internal static helper to scx_dsq_insert_stage() to better
reflect its role: the function stages the task into a buffer before the
actual DSQ insertion, which happens later in finish_dispatch().

No functional change.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
---
 kernel/sched/ext.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 3ba3771ae02e..9e59b068f02c 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -8422,7 +8422,7 @@ static bool scx_dsq_insert_preamble(struct scx_sched *sch, struct task_struct *p
 	return true;
 }
 
-static void scx_dsq_insert_commit(struct scx_sched *sch, struct task_struct *p,
+static void scx_dsq_insert_stage(struct scx_sched *sch, struct task_struct *p,
 				  u64 dsq_id, u64 enq_flags)
 {
 	struct scx_dsp_ctx *dspc = &this_cpu_ptr(sch->pcpu)->dsp_ctx;
@@ -8508,7 +8508,7 @@ __bpf_kfunc bool scx_bpf_dsq_insert___v2(struct task_struct *p, u64 dsq_id,
 	else
 		p->scx.slice = p->scx.slice ?: 1;
 
-	scx_dsq_insert_commit(sch, p, dsq_id, enq_flags);
+	scx_dsq_insert_stage(sch, p, dsq_id, enq_flags);
 
 	return true;
 }
@@ -8536,7 +8536,7 @@ static bool scx_dsq_insert_vtime(struct scx_sched *sch, struct task_struct *p,
 
 	p->scx.dsq_vtime = vtime;
 
-	scx_dsq_insert_commit(sch, p, dsq_id, enq_flags | SCX_ENQ_DSQ_PRIQ);
+	scx_dsq_insert_stage(sch, p, dsq_id, enq_flags | SCX_ENQ_DSQ_PRIQ);
 
 	return true;
 }
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 2/3] sched_ext: Add dispatch transaction API
  2026-05-15 17:27 [PATCH v4 sched_ext/for-7.2 0/3] sched_ext: Add dispatch transaction API Cheng-Yang Chou
  2026-05-15 17:27 ` [PATCH 1/3] sched_ext: Rename scx_dsq_insert_commit() to scx_dsq_insert_stage() Cheng-Yang Chou
@ 2026-05-15 17:27 ` Cheng-Yang Chou
  2026-05-15 17:59   ` sashiko-bot
  2026-05-15 17:27 ` [PATCH 3/3] selftests/sched_ext: Add dispatch_cookie test Cheng-Yang Chou
  2 siblings, 1 reply; 6+ messages in thread
From: Cheng-Yang Chou @ 2026-05-15 17:27 UTC (permalink / raw)
  To: sched-ext, Tejun Heo, David Vernet, Andrea Righi, Changwoo Min
  Cc: Kuba Piecuch, Ching-Chun Huang, Chia-Ping Tsai, yphbchou0911

scx_bpf_dsq_insert() captures the task's dispatch token at insert time.
Any BPF-side validity checks performed before the insert fall outside
the race detection window: a dequeue/re-enqueue occurring between the
check and the insert goes undetected, and finish_dispatch() proceeds
with stale assumptions.

Introduce two new kfuncs to extend the detection window via a dispatch
transaction:

- scx_bpf_dsq_insert_begin(p) captures the sequence number before any
  pre-dispatch checks, opening the transaction.

- scx_bpf_dsq_insert_commit(p, args, token) closes it by inserting @p
  using the early-captured token. Only callable from ops.dispatch() and
  only targeting local DSQs. If the token has gone stale,
  finish_dispatch() silently discards the buffered entry.

scx_bpf_dsq_insert_commit wraps its parameters in a struct to stay
within the BPF verifier's five-argument limit imposed by KF_IMPLICIT_ARGS.

This mechanism is intended for schedulers that do not implement properly
synchronized dequeue. A scheduler whose ops.dequeue() synchronizes
atomically with the dispatch path does not need this API.

Suggested-by: Tejun Heo <tj@kernel.org>
Suggested-by: Kuba Piecuch <jpiecuch@google.com>
Suggested-by: Andrea Righi <arighi@nvidia.com>
Reported-by: Andrea Righi <arighi@nvidia.com>
Link: https://lore.kernel.org/r/20260203230639.1259869-1-arighi@nvidia.com/
Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
---
 kernel/sched/ext.c                       | 110 +++++++++++++++++++++--
 kernel/sched/ext_internal.h              |   2 +-
 tools/sched_ext/include/scx/common.bpf.h |   2 +
 3 files changed, 108 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 9e59b068f02c..1c538329a202 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -1895,6 +1895,16 @@ static void mark_direct_dispatch(struct scx_sched *sch,
 		return;
 	}
 
+	/*
+	 * No qseq validation is needed here. Direct dispatch from ops.enqueue()
+	 * happens while @p is in SCX_OPSS_QUEUEING state under its rq lock,
+	 * preventing concurrent dequeue or re-enqueue. Direct dispatch from
+	 * ops.select_cpu() happens before @p has been handed to the BPF
+	 * scheduler, so no concurrent dequeue is possible either.
+	 * finish_dispatch() validates qseq only on the deferred (ops.dispatch())
+	 * path, where @p may have left SCX_OPSS_QUEUED state between
+	 * scx_bpf_dsq_insert() and the actual dispatch.
+	 */
 	WARN_ON_ONCE(p->scx.ddsp_dsq_id != SCX_DSQ_INVALID);
 	WARN_ON_ONCE(p->scx.ddsp_enq_flags);
 
@@ -2775,7 +2785,7 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq,
  */
 static void finish_dispatch(struct scx_sched *sch, struct rq *rq,
 			    struct task_struct *p,
-			    unsigned long qseq_at_dispatch,
+			    u64 qseq_at_dispatch,
 			    u64 dsq_id, u64 enq_flags)
 {
 	struct scx_dispatch_q *dsq;
@@ -8423,7 +8433,7 @@ static bool scx_dsq_insert_preamble(struct scx_sched *sch, struct task_struct *p
 }
 
 static void scx_dsq_insert_stage(struct scx_sched *sch, struct task_struct *p,
-				  u64 dsq_id, u64 enq_flags)
+			       u64 dsq_id, u64 enq_flags, u64 qseq)
 {
 	struct scx_dsp_ctx *dspc = &this_cpu_ptr(sch->pcpu)->dsp_ctx;
 	struct task_struct *ddsp_task;
@@ -8441,7 +8451,7 @@ static void scx_dsq_insert_stage(struct scx_sched *sch, struct task_struct *p,
 
 	dspc->buf[dspc->cursor++] = (struct scx_dsp_buf_ent){
 		.task = p,
-		.qseq = atomic_long_read(&p->scx.ops_state) & SCX_OPSS_QSEQ_MASK,
+		.qseq = qseq,
 		.dsq_id = dsq_id,
 		.enq_flags = enq_flags,
 	};
@@ -8508,7 +8518,8 @@ __bpf_kfunc bool scx_bpf_dsq_insert___v2(struct task_struct *p, u64 dsq_id,
 	else
 		p->scx.slice = p->scx.slice ?: 1;
 
-	scx_dsq_insert_stage(sch, p, dsq_id, enq_flags);
+	scx_dsq_insert_stage(sch, p, dsq_id, enq_flags,
+			      atomic_long_read(&p->scx.ops_state) & SCX_OPSS_QSEQ_MASK);
 
 	return true;
 }
@@ -8536,7 +8547,8 @@ static bool scx_dsq_insert_vtime(struct scx_sched *sch, struct task_struct *p,
 
 	p->scx.dsq_vtime = vtime;
 
-	scx_dsq_insert_stage(sch, p, dsq_id, enq_flags | SCX_ENQ_DSQ_PRIQ);
+	scx_dsq_insert_stage(sch, p, dsq_id, enq_flags | SCX_ENQ_DSQ_PRIQ,
+			      atomic_long_read(&p->scx.ops_state) & SCX_OPSS_QSEQ_MASK);
 
 	return true;
 }
@@ -8625,6 +8637,92 @@ __bpf_kfunc void scx_bpf_dsq_insert_vtime(struct task_struct *p, u64 dsq_id,
 	scx_dsq_insert_vtime(sch, p, dsq_id, slice, vtime, enq_flags);
 }
 
+/**
+ * scx_bpf_dsq_insert_begin - Begin a dispatch transaction for a task
+ * @p: task_struct to dispatch
+ *
+ * Returns an opaque u64 token encoding @p's current scheduling state
+ * sequence number. Pass it to scx_bpf_dsq_insert_commit() in ops.dispatch().
+ *
+ * This function addresses a race in schedulers that queue tasks in
+ * ops.enqueue() and dispatch them later in ops.dispatch() without
+ * implementing ops.dequeue(). Between queuing and dispatch, @p may be
+ * dequeued, migrated, or re-enqueued on another CPU. Without token
+ * validation a stale dispatch would silently succeed, running @p from the
+ * wrong queue context.
+ *
+ * Capture the token before any per-task validation or pre-dispatch work.
+ * A commit with a stale token (one where @p was dequeued or re-enqueued
+ * after begin()) is detected asynchronously by finish_dispatch() and
+ * discarded.
+ *
+ * Schedulers that implement ops.dequeue() with proper synchronization do not
+ * need this API.
+ */
+__bpf_kfunc u64 scx_bpf_dsq_insert_begin(struct task_struct *p)
+{
+	return atomic_long_read(&p->scx.ops_state) & SCX_OPSS_QSEQ_MASK;
+}
+
+struct scx_bpf_dsq_insert_commit_args {
+	/* @p can't be packed together as KF_RCU is not transitive */
+	u64			dsq_id;
+	u64			slice;
+	u64			enq_flags;
+};
+
+/**
+ * scx_bpf_dsq_insert_commit - Commit a dispatch transaction
+ * @p: task_struct to insert
+ * @args: pointer to struct scx_bpf_dsq_insert_commit_args
+ * @token: token from scx_bpf_dsq_insert_begin()
+ * @aux: implicit BPF argument
+ *
+ * May only be called from ops.dispatch(). Inserts @p into a local DSQ after
+ * validating @token against @p's current scheduling state. If @p was
+ * dequeued or re-enqueued between scx_bpf_dsq_insert_begin() and this call,
+ * the dispatch is silently discarded; stale-token detection fires
+ * asynchronously in finish_dispatch() after ops.dispatch() returns.
+ *
+ * Only local DSQs (SCX_DSQ_LOCAL or SCX_DSQ_LOCAL_ON | cpu) are valid
+ * targets. Attempting to commit to a non-local DSQ aborts the scheduler.
+ *
+ * Returns %true if @p was staged for dispatch, %false if @p is not owned by
+ * this scheduler. A %true return does not guarantee the task was actually
+ * dispatched: a stale token is detected after ops.dispatch() returns.
+ */
+__bpf_kfunc bool scx_bpf_dsq_insert_commit(struct task_struct *p,
+					    struct scx_bpf_dsq_insert_commit_args *args,
+					    u64 token,
+					    const struct bpf_prog_aux *aux)
+{
+	struct scx_sched *sch;
+	u64 enq_flags = args->enq_flags;
+
+	guard(rcu)();
+	sch = scx_prog_sched(aux);
+	if (unlikely(!sch))
+		return false;
+
+	if (args->dsq_id != SCX_DSQ_LOCAL &&
+	    (args->dsq_id & SCX_DSQ_LOCAL_ON) != SCX_DSQ_LOCAL_ON) {
+		scx_error(sch, "scx_bpf_dsq_insert_commit() may only target local DSQs");
+		return false;
+	}
+
+	if (!scx_dsq_insert_preamble(sch, p, args->dsq_id, &enq_flags))
+		return false;
+
+	if (args->slice)
+		p->scx.slice = args->slice;
+	else
+		p->scx.slice = p->scx.slice ?: 1;
+
+	scx_dsq_insert_stage(sch, p, args->dsq_id, enq_flags, token);
+
+	return true;
+}
+
 __bpf_kfunc_end_defs();
 
 BTF_KFUNCS_START(scx_kfunc_ids_enqueue_dispatch)
@@ -8993,6 +9091,7 @@ __bpf_kfunc bool scx_bpf_sub_dispatch(u64 cgroup_id, const struct bpf_prog_aux *
 __bpf_kfunc_end_defs();
 
 BTF_KFUNCS_START(scx_kfunc_ids_dispatch)
+BTF_ID_FLAGS(func, scx_bpf_dsq_insert_commit, KF_IMPLICIT_ARGS | KF_RCU)
 BTF_ID_FLAGS(func, scx_bpf_dispatch_nr_slots, KF_IMPLICIT_ARGS)
 BTF_ID_FLAGS(func, scx_bpf_dispatch_cancel, KF_IMPLICIT_ARGS)
 BTF_ID_FLAGS(func, scx_bpf_dsq_move_to_local, KF_IMPLICIT_ARGS)
@@ -10301,6 +10400,7 @@ BTF_ID_FLAGS(func, scx_bpf_put_cpumask, KF_RELEASE)
 BTF_ID_FLAGS(func, scx_bpf_task_running, KF_RCU)
 BTF_ID_FLAGS(func, scx_bpf_task_cpu, KF_RCU)
 BTF_ID_FLAGS(func, scx_bpf_task_cid, KF_RCU)
+BTF_ID_FLAGS(func, scx_bpf_dsq_insert_begin, KF_RCU)
 BTF_ID_FLAGS(func, scx_bpf_cpu_rq, KF_IMPLICIT_ARGS)
 BTF_ID_FLAGS(func, scx_bpf_locked_rq, KF_IMPLICIT_ARGS | KF_RET_NULL)
 BTF_ID_FLAGS(func, scx_bpf_cpu_curr, KF_IMPLICIT_ARGS | KF_RET_NULL | KF_RCU_PROTECTED)
diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h
index 7258aea94b9f..73cced7cc0a1 100644
--- a/kernel/sched/ext_internal.h
+++ b/kernel/sched/ext_internal.h
@@ -1054,7 +1054,7 @@ enum scx_sched_pcpu_flags {
 /* dispatch buf */
 struct scx_dsp_buf_ent {
 	struct task_struct	*task;
-	unsigned long		qseq;
+	u64			qseq;
 	u64			dsq_id;
 	u64			enq_flags;
 };
diff --git a/tools/sched_ext/include/scx/common.bpf.h b/tools/sched_ext/include/scx/common.bpf.h
index 5f715d69cde6..bc2083f485fd 100644
--- a/tools/sched_ext/include/scx/common.bpf.h
+++ b/tools/sched_ext/include/scx/common.bpf.h
@@ -63,6 +63,8 @@ s32 scx_bpf_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, u64 wake_flags,
 s32 __scx_bpf_select_cpu_and(struct task_struct *p, const struct cpumask *cpus_allowed,
 			     struct scx_bpf_select_cpu_and_args *args) __ksym __weak;
 bool __scx_bpf_dsq_insert_vtime(struct task_struct *p, struct scx_bpf_dsq_insert_vtime_args *args) __ksym __weak;
+u64 scx_bpf_dsq_insert_begin(struct task_struct *p) __ksym __weak;
+bool scx_bpf_dsq_insert_commit(struct task_struct *p, struct scx_bpf_dsq_insert_commit_args *args, u64 token) __ksym __weak;
 u32 scx_bpf_dispatch_nr_slots(void) __ksym;
 void scx_bpf_dispatch_cancel(void) __ksym;
 void scx_bpf_kick_cpu(s32 cpu, u64 flags) __ksym;
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH 2/3] sched_ext: Add dispatch transaction API
  2026-05-15 17:27 ` [PATCH 2/3] sched_ext: Add dispatch transaction API Cheng-Yang Chou
@ 2026-05-15 17:59   ` sashiko-bot
  0 siblings, 0 replies; 6+ messages in thread
From: sashiko-bot @ 2026-05-15 17:59 UTC (permalink / raw)
  To: Cheng-Yang Chou; +Cc: sched-ext

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
- [High] Double Fetch / TOCTOU vulnerability in `scx_bpf_dsq_insert_commit()` when accessing `args` fields.
- [High] Unprotected state modification (`p->scx.slice`) causing a data race on actively running or migrated tasks.
--

commit c413431466d2cd2f8ef609dabb82ff0b92ea46d2
Author: Cheng-Yang Chou <yphbchou0911@gmail.com>

sched_ext: Add dispatch transaction API

This commit introduces scx_bpf_dsq_insert_begin() and
scx_bpf_dsq_insert_commit() kfuncs to allow a dispatch transaction. This
mechanism extends the race detection window for schedulers that queue tasks
in ops.enqueue() but do not implement properly synchronized ops.dequeue().

Link: https://lore.kernel.org/r/20260203230639.1259869-1-arighi@nvidia.com/

> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 0dadda97ef57c..2dfcfc3c885bb 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c

[ ... ]

> @@ -8518,6 +8530,92 @@ __bpf_kfunc void scx_bpf_dsq_insert_vtime(struct task_struct *p, u64 dsq_id,

[ ... ]

> +__bpf_kfunc bool scx_bpf_dsq_insert_commit(struct task_struct *p,
> +					    struct scx_bpf_dsq_insert_commit_args *args,
> +					    u64 token,
> +					    const struct bpf_prog_aux *aux)
> +{
> +	struct scx_sched *sch;
> +	u64 enq_flags = args->enq_flags;
> +
> +	guard(rcu)();
> +	sch = scx_prog_sched(aux);
> +	if (unlikely(!sch))
> +		return false;
> +
> +	if (args->dsq_id != SCX_DSQ_LOCAL &&
> +	    (args->dsq_id & SCX_DSQ_LOCAL_ON) != SCX_DSQ_LOCAL_ON) {

Since pointer arguments to BPF kfuncs like args can point to shared memory
such as BPF map values, which can be concurrently modified by userspace or
other BPF programs, is it possible for args->dsq_id to change between reads?

If args->dsq_id changes to a global DSQ ID after this local-only check passes,
could it bypass the constraint during the subsequent scx_dsq_insert_preamble()
and scx_dsq_insert_stage() calls?

> +		scx_error(sch, "scx_bpf_dsq_insert_commit() may only target local DSQs");
> +		return false;
> +	}
> +
> +	if (!scx_dsq_insert_preamble(sch, p, args->dsq_id, &enq_flags))
> +		return false;
> +
> +	if (args->slice)
> +		p->scx.slice = args->slice;
> +	else
> +		p->scx.slice = p->scx.slice ?: 1;

Could a concurrent modification to args->slice between the condition check
and the assignment result in setting the task's slice to 0, bypassing the
fallback logic?

Would it be safer to fetch args->dsq_id and args->slice exactly once into
local variables at the beginning of the function, similar to how
args->enq_flags is handled?

Additionally, since scx_bpf_dsq_insert_commit() is intended for schedulers
where p may have been dequeued or migrated to another CPU after the dispatch
token was captured, the current CPU might not hold p's runqueue lock here.

Could modifying p->scx.slice without synchronization cause a data race if
the task is actively running or queued on another CPU?

Even if the stale token is later detected and discarded by finish_dispatch(),
it appears the corruption to p->scx.slice would already have occurred. Is
there a way to defer this write or synchronize it so the active state is not
modified?

> +
> +	scx_dsq_insert_stage(sch, p, args->dsq_id, enq_flags, token);
> +
> +	return true;
> +}

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260515172807.254804-1-yphbchou0911@gmail.com?part=2

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH 3/3] selftests/sched_ext: Add dispatch_cookie test
  2026-05-15 17:27 [PATCH v4 sched_ext/for-7.2 0/3] sched_ext: Add dispatch transaction API Cheng-Yang Chou
  2026-05-15 17:27 ` [PATCH 1/3] sched_ext: Rename scx_dsq_insert_commit() to scx_dsq_insert_stage() Cheng-Yang Chou
  2026-05-15 17:27 ` [PATCH 2/3] sched_ext: Add dispatch transaction API Cheng-Yang Chou
@ 2026-05-15 17:27 ` Cheng-Yang Chou
  2026-05-15 18:19   ` sashiko-bot
  2 siblings, 1 reply; 6+ messages in thread
From: Cheng-Yang Chou @ 2026-05-15 17:27 UTC (permalink / raw)
  To: sched-ext, Tejun Heo, David Vernet, Andrea Righi, Changwoo Min
  Cc: Kuba Piecuch, Ching-Chun Huang, Chia-Ping Tsai, yphbchou0911

Test scx_bpf_dsq_insert_begin() and scx_bpf_dsq_insert_commit().

The BPF scheduler enqueues tasks into a BPF queue map in ops.enqueue()
and dispatches them via the begin/commit transaction API in ops.dispatch().
After a successful dispatch, the token is stored in a BPF hash map. On
the task's next dispatch (after dequeue/re-enqueue increments qseq), the
stored token is stale and finish_dispatch() silently drops the buffered
entry; a fallback scx_bpf_dsq_insert() then dispatches the task.

Userspace forks spinning children and repeatedly flips their CPU affinity
to the first online CPU and back, triggering dequeue/re-enqueue cycles.
The test asserts that both nr_tx_dispatched (fresh commits) and
nr_tx_stale (attempted stale commits) are positive, exercising both the
happy path and the stale-token path. The test skips on kernels that
predate the dispatch transaction API.

Suggested-by: Tejun Heo <tj@kernel.org>
Suggested-by: Kuba Piecuch <jpiecuch@google.com>
Suggested-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
---
 tools/testing/selftests/sched_ext/Makefile    |   1 +
 .../selftests/sched_ext/dispatch_cookie.bpf.c | 120 +++++++++++++++
 .../selftests/sched_ext/dispatch_cookie.c     | 139 ++++++++++++++++++
 3 files changed, 260 insertions(+)
 create mode 100644 tools/testing/selftests/sched_ext/dispatch_cookie.bpf.c
 create mode 100644 tools/testing/selftests/sched_ext/dispatch_cookie.c

diff --git a/tools/testing/selftests/sched_ext/Makefile b/tools/testing/selftests/sched_ext/Makefile
index 5d2dffca0e91..ae3dc0913378 100644
--- a/tools/testing/selftests/sched_ext/Makefile
+++ b/tools/testing/selftests/sched_ext/Makefile
@@ -164,6 +164,7 @@ all_test_bpfprogs := $(foreach prog,$(wildcard *.bpf.c),$(INCLUDE_DIR)/$(patsubs
 auto-test-targets :=			\
 	create_dsq			\
 	dequeue				\
+	dispatch_cookie			\
 	enq_last_no_enq_fails		\
 	ddsp_bogus_dsq_fail		\
 	ddsp_vtimelocal_fail		\
diff --git a/tools/testing/selftests/sched_ext/dispatch_cookie.bpf.c b/tools/testing/selftests/sched_ext/dispatch_cookie.bpf.c
new file mode 100644
index 000000000000..160a77ed9000
--- /dev/null
+++ b/tools/testing/selftests/sched_ext/dispatch_cookie.bpf.c
@@ -0,0 +1,120 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Test scx_bpf_dsq_insert_begin() and scx_bpf_dsq_insert_commit().
+ *
+ * Exercises both the happy path (fresh token committed successfully) and
+ * the stale-token path (stored token reused after the task was dequeued and
+ * re-enqueued, incrementing qseq and making the token stale).
+ *
+ * Copyright (C) 2026 Ching-Chun (Jim) Huang <jserv@ccns.ncku.edu.tw>
+ * Copyright (C) 2026 Cheng-Yang Chou <yphbchou0911@gmail.com>
+ */
+#include <scx/common.bpf.h>
+
+char _license[] SEC("license") = "GPL";
+
+UEI_DEFINE(uei);
+
+struct {
+	__uint(type, BPF_MAP_TYPE_QUEUE);
+	__uint(max_entries, 8192);
+	__type(value, s32);
+} queue SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(max_entries, 4096);
+	__type(key, s32);
+	__type(value, u64);
+} last_token SEC(".maps");
+
+long nr_tx_dispatched;
+long nr_tx_stale;
+
+void BPF_STRUCT_OPS(dispatch_cookie_enqueue, struct task_struct *p,
+		    u64 enq_flags)
+{
+	s32 pid = p->pid;
+
+	if (bpf_map_push_elem(&queue, &pid, 0))
+		scx_bpf_error("Failed to enqueue %s[%d]", p->comm, p->pid);
+}
+
+void BPF_STRUCT_OPS(dispatch_cookie_dispatch, s32 cpu,
+		    struct task_struct *prev)
+{
+	s32 pid;
+	struct task_struct *p;
+	u64 *stored, token;
+
+	if (bpf_map_pop_elem(&queue, &pid))
+		return;
+
+	p = bpf_task_from_pid(pid);
+	if (!p)
+		return;
+
+	/*
+	 * Tasks pinned to a different CPU (e.g. per-CPU kworkers) cannot be
+	 * inserted into this CPU's local DSQ. Skip the transaction path and
+	 * fall back to the global DSQ so the scheduler does not abort.
+	 */
+	if (!bpf_cpumask_test_cpu(cpu, p->cpus_ptr)) {
+		scx_bpf_dsq_insert(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, 0);
+		bpf_task_release(p);
+		return;
+	}
+
+	/*
+	 * After a successful fresh dispatch, store the token. On the task's
+	 * next dispatch (after re-enqueue increments qseq), the stored token
+	 * exercises the stale path in finish_dispatch().
+	 *
+	 * scx_bpf_dsq_insert_commit() always returns %true when the preamble
+	 * passes; stale detection fires asynchronously in finish_dispatch()
+	 * with no BPF-observable signal. Always pair the commit() call with a
+	 * fallback scx_bpf_dsq_insert(): if the token is stale,
+	 * finish_dispatch() drops the buffered entry and the fallback
+	 * dispatches the task. If the token is still fresh, finish_dispatch()
+	 * dispatches it and the fallback's CAS is a no-op.
+	 */
+	struct scx_bpf_dsq_insert_commit_args commit_args = {
+		.dsq_id		= SCX_DSQ_LOCAL,
+		.slice		= 0,
+		.enq_flags	= 0,
+	};
+
+	stored = bpf_map_lookup_elem(&last_token, &pid);
+	if (stored) {
+		token = *stored;
+		bpf_map_delete_elem(&last_token, &pid);
+		scx_bpf_dsq_insert_commit(p, &commit_args, token);
+		scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, 0);
+		/* counts attempted stale commits, not finish_dispatch() rejections */
+		__sync_fetch_and_add(&nr_tx_stale, 1);
+	} else {
+		token = scx_bpf_dsq_insert_begin(p);
+		if (scx_bpf_dsq_insert_commit(p, &commit_args, token)) {
+			__sync_fetch_and_add(&nr_tx_dispatched, 1);
+			bpf_map_update_elem(&last_token, &pid, &token, BPF_ANY);
+		} else {
+			scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, 0);
+		}
+	}
+
+	bpf_task_release(p);
+}
+
+void BPF_STRUCT_OPS(dispatch_cookie_exit, struct scx_exit_info *ei)
+{
+	UEI_RECORD(uei, ei);
+}
+
+SEC(".struct_ops.link")
+struct sched_ext_ops dispatch_cookie_ops = {
+	.enqueue		= (void *) dispatch_cookie_enqueue,
+	.dispatch		= (void *) dispatch_cookie_dispatch,
+	.exit			= (void *) dispatch_cookie_exit,
+	.name			= "dispatch_cookie",
+	.timeout_ms		= 5000U,
+};
diff --git a/tools/testing/selftests/sched_ext/dispatch_cookie.c b/tools/testing/selftests/sched_ext/dispatch_cookie.c
new file mode 100644
index 000000000000..047d9ea1822b
--- /dev/null
+++ b/tools/testing/selftests/sched_ext/dispatch_cookie.c
@@ -0,0 +1,139 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Test scx_bpf_dsq_insert_begin() and scx_bpf_dsq_insert_commit().
+ *
+ * Copyright (C) 2026 Ching-Chun (Jim) Huang <jserv@ccns.ncku.edu.tw>
+ * Copyright (C) 2026 Cheng-Yang Chou <yphbchou0911@gmail.com>
+ */
+#define _GNU_SOURCE
+#include <bpf/bpf.h>
+#include <errno.h>
+#include <sched.h>
+#include <scx/common.h>
+#include <signal.h>
+#include <stdlib.h>
+#include <sys/wait.h>
+#include <unistd.h>
+#include "dispatch_cookie.bpf.skel.h"
+#include "scx_test.h"
+
+#define NUM_CHILDREN	32
+#define NUM_FLIP_ITERS	100
+
+struct dispatch_cookie_ctx {
+	struct dispatch_cookie	*skel;
+	struct bpf_link		*link;
+};
+
+static enum scx_test_status setup(void **ctx)
+{
+	struct dispatch_cookie_ctx *tctx;
+
+	if (!__COMPAT_has_ksym("scx_bpf_dsq_insert_begin")) {
+		fprintf(stderr, "SKIP: dispatch transaction API not supported\n");
+		return SCX_TEST_SKIP;
+	}
+
+	tctx = malloc(sizeof(*tctx));
+	SCX_FAIL_IF(!tctx, "Failed to allocate test context");
+	tctx->link = NULL;
+
+	tctx->skel = dispatch_cookie__open();
+	if (!tctx->skel) {
+		free(tctx);
+		SCX_FAIL("Failed to open skel");
+	}
+	SCX_ENUM_INIT(tctx->skel);
+	if (dispatch_cookie__load(tctx->skel)) {
+		dispatch_cookie__destroy(tctx->skel);
+		free(tctx);
+		SCX_FAIL("Failed to load skel");
+	}
+
+	*ctx = tctx;
+
+	return SCX_TEST_PASS;
+}
+
+static enum scx_test_status run(void *ctx)
+{
+	struct dispatch_cookie_ctx *tctx = ctx;
+	cpu_set_t cpuset_one, cpuset_all;
+	pid_t pids[NUM_CHILDREN];
+	int i, j, nforked = 0, status, first_cpu;
+
+	tctx->link = bpf_map__attach_struct_ops(tctx->skel->maps.dispatch_cookie_ops);
+	SCX_FAIL_IF(!tctx->link, "Failed to attach scheduler");
+
+	SCX_FAIL_IF(sched_getaffinity(0, sizeof(cpuset_all), &cpuset_all),
+		    "Failed to get CPU affinity (%d)", errno);
+
+	first_cpu = -1;
+	for (i = 0; i < CPU_SETSIZE; i++) {
+		if (CPU_ISSET(i, &cpuset_all)) {
+			first_cpu = i;
+			break;
+		}
+	}
+	SCX_FAIL_IF(first_cpu < 0, "No online CPUs found");
+
+	CPU_ZERO(&cpuset_one);
+	CPU_SET(first_cpu, &cpuset_one);
+
+	for (i = 0; i < NUM_CHILDREN; i++) {
+		pids[i] = fork();
+		if (pids[i] == 0) {
+			while (1)
+				sched_yield();
+		}
+		if (pids[i] > 0)
+			nforked++;
+	}
+
+	/*
+	 * Flip affinity to trigger dequeue/re-enqueue, which increments qseq
+	 * and makes previously captured tokens stale.
+	 */
+	for (i = 0; i < NUM_FLIP_ITERS; i++) {
+		for (j = 0; j < NUM_CHILDREN; j++) {
+			if (pids[j] <= 0)
+				continue;
+			sched_setaffinity(pids[j], sizeof(cpuset_one), &cpuset_one);
+			sched_setaffinity(pids[j], sizeof(cpuset_all), &cpuset_all);
+		}
+		usleep(1000);
+	}
+
+	for (i = 0; i < NUM_CHILDREN; i++) {
+		if (pids[i] <= 0)
+			continue;
+		kill(pids[i], SIGKILL);
+		waitpid(pids[i], &status, 0);
+	}
+
+	SCX_GT(nforked, 0);
+	SCX_GT(tctx->skel->bss->nr_tx_dispatched, 0);
+	SCX_GT(tctx->skel->bss->nr_tx_stale, 0);
+
+	return SCX_TEST_PASS;
+}
+
+static void cleanup(void *ctx)
+{
+	struct dispatch_cookie_ctx *tctx = ctx;
+
+	if (tctx->link)
+		bpf_link__destroy(tctx->link);
+	dispatch_cookie__destroy(tctx->skel);
+	free(tctx);
+}
+
+struct scx_test dispatch_cookie = {
+	.name		= "dispatch_cookie",
+	.description	= "Verify scx_bpf_dsq_insert_begin() and "
+			  "scx_bpf_dsq_insert_commit() dispatch tasks correctly",
+	.setup		= setup,
+	.run		= run,
+	.cleanup	= cleanup,
+};
+REGISTER_SCX_TEST(&dispatch_cookie)
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH 3/3] selftests/sched_ext: Add dispatch_cookie test
  2026-05-15 17:27 ` [PATCH 3/3] selftests/sched_ext: Add dispatch_cookie test Cheng-Yang Chou
@ 2026-05-15 18:19   ` sashiko-bot
  0 siblings, 0 replies; 6+ messages in thread
From: sashiko-bot @ 2026-05-15 18:19 UTC (permalink / raw)
  To: Cheng-Yang Chou; +Cc: sched-ext

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
- [Medium] Child processes spin indefinitely and will leak if the parent terminates unexpectedly.
- [Medium] The dispatch operation returns immediately without attempting to dispatch another task if `bpf_task_from_pid()` fails, potentially leaving runnable tasks stranded.
--

commit 91ea259621d3c40168f6eb29f9526e50de76f0b2
Author: Cheng-Yang Chou <yphbchou0911@gmail.com>

selftests/sched_ext: Add dispatch_cookie test

This commit adds a new sched_ext selftest to exercise the dispatch
transaction API (scx_bpf_dsq_insert_begin and scx_bpf_dsq_insert_commit).
It uses a BPF scheduler with a queue map and a userspace program that forks
multiple spinning children, flipping their CPU affinities to trigger dequeue
and re-enqueue cycles.

> diff --git a/tools/testing/selftests/sched_ext/dispatch_cookie.bpf.c b/tools/testing/selftests/sched_ext/dispatch_cookie.bpf.c
> new file mode 100644
> index 0000000000000..160a77ed9000f
> --- /dev/null
> +++ b/tools/testing/selftests/sched_ext/dispatch_cookie.bpf.c

[ ... ]

> +void BPF_STRUCT_OPS(dispatch_cookie_dispatch, s32 cpu,
> +		    struct task_struct *prev)
> +{
> +	s32 pid;
> +	struct task_struct *p;
> +	u64 *stored, token;
> +
> +	if (bpf_map_pop_elem(&queue, &pid))
> +		return;
> +
> +	p = bpf_task_from_pid(pid);
> +	if (!p)
> +		return;

If a task exits and bpf_task_from_pid(pid) returns NULL, does returning
early here prevent other valid tasks in the queue from being dispatched?

Since ops.dispatch() expects a task to be inserted into a DSQ, would it be
better to loop and try popping another PID instead of returning immediately
and leaving the CPU idle?

[ ... ]

> diff --git a/tools/testing/selftests/sched_ext/dispatch_cookie.c b/tools/testing/selftests/sched_ext/dispatch_cookie.c
> new file mode 100644
> index 0000000000000..047d9ea1822b9
> --- /dev/null
> +++ b/tools/testing/selftests/sched_ext/dispatch_cookie.c

[ ... ]

> +	for (i = 0; i < NUM_CHILDREN; i++) {
> +		pids[i] = fork();
> +		if (pids[i] == 0) {
> +			while (1)
> +				sched_yield();
> +		}
> +		if (pids[i] > 0)
> +			nforked++;
> +	}

If the test is interrupted (e.g., via SIGINT, SIGTERM, or a test framework
timeout) before reaching the cleanup code, will these spinning children be
leaked and consume 100% of the CPUs indefinitely?

Could this be prevented by calling prctl(PR_SET_PDEATHSIG, SIGKILL) in the
child processes immediately after fork()?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260515172807.254804-1-yphbchou0911@gmail.com?part=3

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-05-15 18:19 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-15 17:27 [PATCH v4 sched_ext/for-7.2 0/3] sched_ext: Add dispatch transaction API Cheng-Yang Chou
2026-05-15 17:27 ` [PATCH 1/3] sched_ext: Rename scx_dsq_insert_commit() to scx_dsq_insert_stage() Cheng-Yang Chou
2026-05-15 17:27 ` [PATCH 2/3] sched_ext: Add dispatch transaction API Cheng-Yang Chou
2026-05-15 17:59   ` sashiko-bot
2026-05-15 17:27 ` [PATCH 3/3] selftests/sched_ext: Add dispatch_cookie test Cheng-Yang Chou
2026-05-15 18:19   ` sashiko-bot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox