public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCHSET sched_ext/for-6.19] sched_ext: Deprecate ops.cpu_acquire/release()
@ 2025-10-25  0:18 Tejun Heo
  2025-10-25  0:18 ` [PATCH 1/3] sched_ext: Split schedule_deferred() into locked and unlocked variants Tejun Heo
                   ` (3 more replies)
  0 siblings, 4 replies; 28+ messages in thread
From: Tejun Heo @ 2025-10-25  0:18 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: linux-kernel, sched-ext, Peter Zijlstra, Wen-Fang Liu

The ops.cpu_acquire/release() callbacks are broken. They miss events under
multiple conditions and can't be fixed without adding global sched core hooks
that sched maintainers don't want. They also aren't necessary as BPF schedulers
can use generic BPF mechanisms like tracepoints to achieve the same goals.

The main use case for cpu_release() was calling scx_bpf_reenqueue_local() when
a CPU gets preempted by a higher priority scheduling class. This patchset makes
scx_bpf_reenqueue_local() callable from any context by adding deferred
execution support, which completely eliminates the need for cpu_acquire/release()
callbacks.

This patchset contains the following three patches:

  0001-sched_ext-Split-schedule_deferred-into-locked-and-un.patch
  0002-sched_ext-Factor-out-reenq_local-from-scx_bpf_reenqu.patch
  0003-sched_ext-Allow-scx_bpf_reenqueue_local-to-be-called.patch

Based on sched_ext/for-6.19 (dcb938c45328).

Git tree: git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext.git scx-reenq-anywhere

 kernel/sched/ext.c                       | 114 +++++++++++++++++++++++--------
 kernel/sched/sched.h                     |   1 +
 tools/sched_ext/include/scx/common.bpf.h |   1 -
 tools/sched_ext/include/scx/compat.bpf.h |  23 +++++++
 tools/sched_ext/scx_qmap.bpf.c           |  38 ++++++++---
 5 files changed, 136 insertions(+), 41 deletions(-)

--
tejun

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 1/3] sched_ext: Split schedule_deferred() into locked and unlocked variants
  2025-10-25  0:18 [PATCHSET sched_ext/for-6.19] sched_ext: Deprecate ops.cpu_acquire/release() Tejun Heo
@ 2025-10-25  0:18 ` Tejun Heo
  2025-10-25 23:17   ` Emil Tsalapatis
  2025-10-25  0:18 ` [PATCH 2/3] sched_ext: Factor out reenq_local() from scx_bpf_reenqueue_local() Tejun Heo
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 28+ messages in thread
From: Tejun Heo @ 2025-10-25  0:18 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: linux-kernel, sched-ext, Peter Zijlstra, Wen-Fang Liu, Tejun Heo

schedule_deferred() currently requires the rq lock to be held so that it can
use scheduler hooks for efficiency when available. However, there are cases
where deferred actions need to be scheduled from contexts that don't hold the
rq lock.

Split into schedule_deferred() which can be called from any context and just
queues irq_work, and schedule_deferred_locked() which requires the rq lock and
can optimize by using scheduler hooks when available. Update the existing call
site to use the _locked variant.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext.c | 33 ++++++++++++++++++++++++---------
 1 file changed, 24 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 000000000000..111111111111 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -775,12 +775,28 @@ static void deferred_irq_workfn(struct irq_work *irq_work)
  * schedule_deferred - Schedule execution of deferred actions on an rq
  * @rq: target rq
  *
- * Schedule execution of deferred actions on @rq. Must be called with @rq
- * locked. Deferred actions are executed with @rq locked but unpinned, and thus
- * can unlock @rq to e.g. migrate tasks to other rqs.
+ * Schedule execution of deferred actions on @rq. Deferred actions are executed
+ * with @rq locked but unpinned, and thus can unlock @rq to e.g. migrate tasks
+ * to other rqs.
  */
 static void schedule_deferred(struct rq *rq)
 {
+	/*
+	 * Queue an irq work. They are executed on IRQ re-enable which may take
+	 * a bit longer than the scheduler hook in schedule_deferred_locked().
+	 */
+	irq_work_queue(&rq->scx.deferred_irq_work);
+}
+
+/**
+ * schedule_deferred_locked - Schedule execution of deferred actions on an rq
+ * @rq: target rq
+ *
+ * Schedule execution of deferred actions on @rq. Equivalent to
+ * schedule_deferred() but requires @rq to be locked and can be more efficient.
+ */
+static void schedule_deferred_locked(struct rq *rq)
+{
 	lockdep_assert_rq_held(rq);

 	/*
@@ -812,12 +828,11 @@ static void schedule_deferred(struct rq *rq)
 	}

 	/*
-	 * No scheduler hooks available. Queue an irq work. They are executed on
-	 * IRQ re-enable which may take a bit longer than the scheduler hooks.
-	 * The above WAKEUP and BALANCE paths should cover most of the cases and
-	 * the time to IRQ re-enable shouldn't be long.
+	 * No scheduler hooks available. Use the generic irq_work path. The
+	 * above WAKEUP and BALANCE paths should cover most of the cases and the
+	 * time to IRQ re-enable shouldn't be long.
 	 */
-	irq_work_queue(&rq->scx.deferred_irq_work);
+	schedule_deferred(rq);
 }

 /**
@@ -1211,7 +1226,7 @@ static void direct_dispatch(struct scx_sched *sch, struct task_struct *p,
 		WARN_ON_ONCE(p->scx.dsq || !list_empty(&p->scx.dsq_list.node));
 		list_add_tail(&p->scx.dsq_list.node,
 			      &rq->scx.ddsp_deferred_locals);
-		schedule_deferred(rq);
+		schedule_deferred_locked(rq);
 		return;
 	}

--
2.47.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 2/3] sched_ext: Factor out reenq_local() from scx_bpf_reenqueue_local()
  2025-10-25  0:18 [PATCHSET sched_ext/for-6.19] sched_ext: Deprecate ops.cpu_acquire/release() Tejun Heo
  2025-10-25  0:18 ` [PATCH 1/3] sched_ext: Split schedule_deferred() into locked and unlocked variants Tejun Heo
@ 2025-10-25  0:18 ` Tejun Heo
  2025-10-25 23:19   ` Emil Tsalapatis
  2025-10-25  0:18 ` [PATCH 3/3] sched_ext: Allow scx_bpf_reenqueue_local() to be called from anywhere Tejun Heo
  2025-10-29 15:31 ` [PATCHSET sched_ext/for-6.19] sched_ext: Deprecate ops.cpu_acquire/release() Tejun Heo
  3 siblings, 1 reply; 28+ messages in thread
From: Tejun Heo @ 2025-10-25  0:18 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: linux-kernel, sched-ext, Peter Zijlstra, Wen-Fang Liu, Tejun Heo

Factor out the core re-enqueue logic from scx_bpf_reenqueue_local() into a
new reenq_local() helper function. scx_bpf_reenqueue_local() now handles the
BPF kfunc checks and calls reenq_local() to perform the actual work.

This is a prep patch to allow reenq_local() to be called from other contexts.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext.c | 50 +++++++++++++++++++++++++++++---------------------
 1 file changed, 29 insertions(+), 21 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 111111111111..222222222222 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -5881,32 +5881,12 @@ static const struct btf_kfunc_id_set scx_kfunc_set_dispatch = {
 	.set			= &scx_kfunc_ids_dispatch,
 };

-__bpf_kfunc_start_defs();
-
-/**
- * scx_bpf_reenqueue_local - Re-enqueue tasks on a local DSQ
- *
- * Iterate over all of the tasks currently enqueued on the local DSQ of the
- * caller's CPU, and re-enqueue them in the BPF scheduler. Returns the number of
- * processed tasks. Can only be called from ops.cpu_release().
- */
-__bpf_kfunc u32 scx_bpf_reenqueue_local(void)
+static u32 reenq_local(struct rq *rq)
 {
-	struct scx_sched *sch;
 	LIST_HEAD(tasks);
 	u32 nr_enqueued = 0;
-	struct rq *rq;
 	struct task_struct *p, *n;

-	guard(rcu)();
-	sch = rcu_dereference(scx_root);
-	if (unlikely(!sch))
-		return 0;
-
-	if (!scx_kf_allowed(sch, SCX_KF_CPU_RELEASE))
-		return 0;
-
-	rq = cpu_rq(smp_processor_id());
 	lockdep_assert_rq_held(rq);

 	/*
@@ -5943,6 +5923,34 @@ __bpf_kfunc u32 scx_bpf_reenqueue_local(void)
 	return nr_enqueued;
 }

+__bpf_kfunc_start_defs();
+
+/**
+ * scx_bpf_reenqueue_local - Re-enqueue tasks on a local DSQ
+ *
+ * Iterate over all of the tasks currently enqueued on the local DSQ of the
+ * caller's CPU, and re-enqueue them in the BPF scheduler. Returns the number of
+ * processed tasks. Can only be called from ops.cpu_release().
+ */
+__bpf_kfunc u32 scx_bpf_reenqueue_local(void)
+{
+	struct scx_sched *sch;
+	struct rq *rq;
+
+	guard(rcu)();
+	sch = rcu_dereference(scx_root);
+	if (unlikely(!sch))
+		return 0;
+
+	if (!scx_kf_allowed(sch, SCX_KF_CPU_RELEASE))
+		return 0;
+
+	rq = cpu_rq(smp_processor_id());
+	lockdep_assert_rq_held(rq);
+
+	return reenq_local(rq);
+}
+
 __bpf_kfunc_end_defs();

 BTF_KFUNCS_START(scx_kfunc_ids_cpu_release)
--
2.47.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 3/3] sched_ext: Allow scx_bpf_reenqueue_local() to be called from anywhere
  2025-10-25  0:18 [PATCHSET sched_ext/for-6.19] sched_ext: Deprecate ops.cpu_acquire/release() Tejun Heo
  2025-10-25  0:18 ` [PATCH 1/3] sched_ext: Split schedule_deferred() into locked and unlocked variants Tejun Heo
  2025-10-25  0:18 ` [PATCH 2/3] sched_ext: Factor out reenq_local() from scx_bpf_reenqueue_local() Tejun Heo
@ 2025-10-25  0:18 ` Tejun Heo
  2025-10-25 23:21   ` Emil Tsalapatis
                     ` (2 more replies)
  2025-10-29 15:31 ` [PATCHSET sched_ext/for-6.19] sched_ext: Deprecate ops.cpu_acquire/release() Tejun Heo
  3 siblings, 3 replies; 28+ messages in thread
From: Tejun Heo @ 2025-10-25  0:18 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: linux-kernel, sched-ext, Peter Zijlstra, Wen-Fang Liu, Tejun Heo

The ops.cpu_acquire/release() callbacks are broken - they miss events under
multiple conditions and can't be fixed without adding global sched core hooks
that sched maintainers don't want. They also aren't necessary as BPF schedulers
can use generic BPF mechanisms like tracepoints to achieve the same goals.

The main use case for cpu_release() was calling scx_bpf_reenqueue_local() when
a CPU gets preempted by a higher priority scheduling class. However, the old
scx_bpf_reenqueue_local() could only be called from cpu_release() context.

Add a new version of scx_bpf_reenqueue_local() that can be called from any
context by deferring the actual re-enqueue operation. This eliminates the need
for cpu_acquire/release() ops entirely. Schedulers can now use standard BPF
mechanisms like the sched_switch tracepoint to detect and handle CPU preemption.

Update scx_qmap to demonstrate the new approach using sched_switch instead of
cpu_release, with compat support for older kernels. Mark cpu_acquire/release()
as deprecated. The old scx_bpf_reenqueue_local() variant will be removed in
v6.23.

Reported-by: Wen-Fang Liu <liuwenfang@honor.com>
Link: https://lore.kernel.org/all/8d64c74118c6440f81bcf5a4ac6b9f00@honor.com/
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext.c                       | 31 ++++++++++++++++++++
 kernel/sched/sched.h                     |  1 +
 tools/sched_ext/include/scx/common.bpf.h |  1 -
 tools/sched_ext/include/scx/compat.bpf.h | 23 +++++++++++++++
 tools/sched_ext/scx_qmap.bpf.c           | 38 +++++++++++++++++-------
 5 files changed, 83 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 222222222222..333333333333 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -147,6 +147,7 @@ static struct kset *scx_kset;
 #include <trace/events/sched_ext.h>

 static void process_ddsp_deferred_locals(struct rq *rq);
+static u32 reenq_local(struct rq *rq);
 static void scx_kick_cpu(struct scx_sched *sch, s32 cpu, u64 flags);
 static void scx_vexit(struct scx_sched *sch, enum scx_exit_kind kind,
 		      s64 exit_code, const char *fmt, va_list args);
@@ -755,6 +756,11 @@ static int ops_sanitize_err(struct scx_sched *sch, s32 ret, s32 ops_err)
 static void run_deferred(struct rq *rq)
 {
 	process_ddsp_deferred_locals(rq);
+
+	if (local_read(&rq->scx.reenq_local_deferred)) {
+		local_set(&rq->scx.reenq_local_deferred, 0);
+		reenq_local(rq);
+	}
 }

 static void deferred_bal_cb_workfn(struct rq *rq)
@@ -4569,6 +4575,9 @@ static int validate_ops(struct scx_sched *sch)
 	if (ops->flags & SCX_OPS_HAS_CGROUP_WEIGHT)
 		pr_warn("SCX_OPS_HAS_CGROUP_WEIGHT is deprecated and a noop\n");

+	if (ops->cpu_acquire || ops->cpu_release)
+		pr_warn("ops->cpu_acquire/release() are deprecated, use sched_switch TP instead\n");
+
 	return 0;
 }

@@ -5931,6 +5940,9 @@ __bpf_kfunc_start_defs();
  * Iterate over all of the tasks currently enqueued on the local DSQ of the
  * caller's CPU, and re-enqueue them in the BPF scheduler. Returns the number of
  * processed tasks. Can only be called from ops.cpu_release().
+ *
+ * COMPAT: Will be removed in v6.23 along with the ___v2 suffix on the void
+ * returning variant that can be called from anywhere.
  */
 __bpf_kfunc u32 scx_bpf_reenqueue_local(void)
 {
@@ -6490,6 +6502,24 @@ __bpf_kfunc void scx_bpf_dump_bstr(char *data__str, u32 data__sz)
 }

 /**
+ * scx_bpf_reenqueue_local - Re-enqueue tasks on a local DSQ
+ *
+ * Iterate over all of the tasks currently enqueued on the local DSQ of the
+ * caller's CPU, and re-enqueue them in the BPF scheduler. Can be called from
+ * anywhere.
+ */
+__bpf_kfunc void scx_bpf_reenqueue_local___v2(void)
+{
+	struct rq *rq;
+
+	guard(preempt)();
+
+	rq = this_rq();
+	local_set(&rq->scx.reenq_local_deferred, 1);
+	schedule_deferred(rq);
+}
+
+/**
  * scx_bpf_cpuperf_cap - Query the maximum relative capacity of a CPU
  * @cpu: CPU of interest
  *
@@ -6902,6 +6932,7 @@ BTF_ID_FLAGS(func, bpf_iter_scx_dsq_destroy, KF_ITER_DESTRUCTOR)
 BTF_ID_FLAGS(func, scx_bpf_exit_bstr, KF_TRUSTED_ARGS)
 BTF_ID_FLAGS(func, scx_bpf_error_bstr, KF_TRUSTED_ARGS)
 BTF_ID_FLAGS(func, scx_bpf_dump_bstr, KF_TRUSTED_ARGS)
+BTF_ID_FLAGS(func, scx_bpf_reenqueue_local___v2)
 BTF_ID_FLAGS(func, scx_bpf_cpuperf_cap)
 BTF_ID_FLAGS(func, scx_bpf_cpuperf_cur)
 BTF_ID_FLAGS(func, scx_bpf_cpuperf_set)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 444444444444..555555555555 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -804,6 +804,7 @@ struct scx_rq {
 	cpumask_var_t		cpus_to_preempt;
 	cpumask_var_t		cpus_to_wait;
 	unsigned long		kick_sync;
+	local_t			reenq_local_deferred;
 	struct balance_callback	deferred_bal_cb;
 	struct irq_work		deferred_irq_work;
 	struct irq_work		kick_cpus_irq_work;
diff --git a/tools/sched_ext/include/scx/common.bpf.h b/tools/sched_ext/include/scx/common.bpf.h
index 666666666666..777777777777 100644
--- a/tools/sched_ext/include/scx/common.bpf.h
+++ b/tools/sched_ext/include/scx/common.bpf.h
@@ -70,7 +70,6 @@ void scx_bpf_dsq_move_set_slice(struct bpf_iter_scx_dsq *it__iter, u64 slice) _
 void scx_bpf_dsq_move_set_vtime(struct bpf_iter_scx_dsq *it__iter, u64 vtime) __ksym __weak;
 bool scx_bpf_dsq_move(struct bpf_iter_scx_dsq *it__iter, struct task_struct *p, u64 dsq_id, u64 enq_flags) __ksym __weak;
 bool scx_bpf_dsq_move_vtime(struct bpf_iter_scx_dsq *it__iter, struct task_struct *p, u64 dsq_id, u64 enq_flags) __ksym __weak;
-u32 scx_bpf_reenqueue_local(void) __ksym;
 void scx_bpf_kick_cpu(s32 cpu, u64 flags) __ksym;
 s32 scx_bpf_dsq_nr_queued(u64 dsq_id) __ksym;
 void scx_bpf_destroy_dsq(u64 dsq_id) __ksym;
diff --git a/tools/sched_ext/include/scx/compat.bpf.h b/tools/sched_ext/include/scx/compat.bpf.h
index 888888888888..999999999999 100644
--- a/tools/sched_ext/include/scx/compat.bpf.h
+++ b/tools/sched_ext/include/scx/compat.bpf.h
@@ -279,6 +279,29 @@ static inline void scx_bpf_task_set_dsq_weight(struct task_struct *p, u32 weigh
 }

 /*
+ * v6.19: The new void variant can be called from anywhere while the older v1
+ * variant can only be called from ops.cpu_release(). The double ___ prefixes on
+ * the v2 variant need to be removed once libbpf is updated to ignore ___ prefix
+ * on kernel side. Drop the wrapper and move the decl to common.bpf.h after
+ * v6.22.
+ */
+u32 scx_bpf_reenqueue_local___v1(void) __ksym __weak;
+void scx_bpf_reenqueue_local___v2___compat(void) __ksym __weak;
+
+static inline bool __COMPAT_scx_bpf_reenqueue_local_from_anywhere(void)
+{
+	return bpf_ksym_exists(scx_bpf_reenqueue_local___v2___compat);
+}
+
+static inline void scx_bpf_reenqueue_local(void)
+{
+	if (__COMPAT_scx_bpf_reenqueue_local_from_anywhere())
+		scx_bpf_reenqueue_local___v2___compat();
+	else
+		scx_bpf_reenqueue_local___v1();
+}
+
+/*
  * Define sched_ext_ops. This may be expanded to define multiple variants for
  * backward compatibility. See compat.h::SCX_OPS_LOAD/ATTACH().
  */
diff --git a/tools/sched_ext/scx_qmap.bpf.c b/tools/sched_ext/scx_qmap.bpf.c
index aaaaaaaaaaaa..bbbbbbbbbbbb 100644
--- a/tools/sched_ext/scx_qmap.bpf.c
+++ b/tools/sched_ext/scx_qmap.bpf.c
@@ -202,6 +202,9 @@ void BPF_STRUCT_OPS(qmap_enqueue, struct task_struct *p, u64 enq_flags)
 	void *ring;
 	s32 cpu;

+	if (enq_flags & SCX_ENQ_REENQ)
+		__sync_fetch_and_add(&nr_reenqueued, 1);
+
 	if (p->flags & PF_KTHREAD) {
 		if (stall_kernel_nth && !(++kernel_cnt % stall_kernel_nth))
 			return;
@@ -529,20 +532,35 @@ bool BPF_STRUCT_OPS(qmap_core_sched_before, struct task_struct *a,
 	return task_qdist(a) > task_qdist(b);
 }

-void BPF_STRUCT_OPS(qmap_cpu_release, s32 cpu, struct scx_cpu_release_args *args)
+SEC("tp_btf/sched_switch")
+int BPF_PROG(qmap_sched_switch, bool preempt, struct task_struct *prev,
+	     struct task_struct *next, unsigned long prev_state)
 {
-	u32 cnt;
+	if (!__COMPAT_scx_bpf_reenqueue_local_from_anywhere())
+		return 0;

 	/*
-	 * Called when @cpu is taken by a higher priority scheduling class. This
-	 * makes @cpu no longer available for executing sched_ext tasks. As we
-	 * don't want the tasks in @cpu's local dsq to sit there until @cpu
-	 * becomes available again, re-enqueue them into the global dsq. See
-	 * %SCX_ENQ_REENQ handling in qmap_enqueue().
+	 * If @cpu is taken by a higher priority scheduling class, it is no
+	 * longer available for executing sched_ext tasks. As we don't want the
+	 * tasks in @cpu's local dsq to sit there until @cpu becomes available
+	 * again, re-enqueue them into the global dsq. See %SCX_ENQ_REENQ
+	 * handling in qmap_enqueue().
 	 */
-	cnt = scx_bpf_reenqueue_local();
-	if (cnt)
-		__sync_fetch_and_add(&nr_reenqueued, cnt);
+	switch (next->policy) {
+	case 1: /* SCHED_FIFO */
+	case 2: /* SCHED_RR */
+	case 6: /* SCHED_DEADLINE */
+		scx_bpf_reenqueue_local();
+	}
+
+	return 0;
+}
+
+void BPF_STRUCT_OPS(qmap_cpu_release, s32 cpu, struct scx_cpu_release_args *args)
+{
+	/* see qmap_sched_switch() to learn how to do this on newer kernels */
+	if (!__COMPAT_scx_bpf_reenqueue_local_from_anywhere())
+		scx_bpf_reenqueue_local();
 }

 s32 BPF_STRUCT_OPS(qmap_init_task, struct task_struct *p,
--
2.47.1

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/3] sched_ext: Split schedule_deferred() into locked and unlocked variants
  2025-10-25  0:18 ` [PATCH 1/3] sched_ext: Split schedule_deferred() into locked and unlocked variants Tejun Heo
@ 2025-10-25 23:17   ` Emil Tsalapatis
  0 siblings, 0 replies; 28+ messages in thread
From: Emil Tsalapatis @ 2025-10-25 23:17 UTC (permalink / raw)
  To: Tejun Heo
  Cc: David Vernet, Andrea Righi, Changwoo Min, linux-kernel, sched-ext,
	Peter Zijlstra, Wen-Fang Liu

On Fri, Oct 24, 2025 at 8:18 PM Tejun Heo <tj@kernel.org> wrote:
>
> schedule_deferred() currently requires the rq lock to be held so that it can
> use scheduler hooks for efficiency when available. However, there are cases
> where deferred actions need to be scheduled from contexts that don't hold the
> rq lock.
>
> Split into schedule_deferred() which can be called from any context and just
> queues irq_work, and schedule_deferred_locked() which requires the rq lock and
> can optimize by using scheduler hooks when available. Update the existing call
> site to use the _locked variant.
>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---

Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>

I assume we don't really care about ba

>  kernel/sched/ext.c | 33 ++++++++++++++++++++++++---------
>  1 file changed, 24 insertions(+), 9 deletions(-)
>
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 000000000000..111111111111 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -775,12 +775,28 @@ static void deferred_irq_workfn(struct irq_work *irq_work)
>   * schedule_deferred - Schedule execution of deferred actions on an rq
>   * @rq: target rq
>   *
> - * Schedule execution of deferred actions on @rq. Must be called with @rq
> - * locked. Deferred actions are executed with @rq locked but unpinned, and thus
> - * can unlock @rq to e.g. migrate tasks to other rqs.
> + * Schedule execution of deferred actions on @rq. Deferred actions are executed
> + * with @rq locked but unpinned, and thus can unlock @rq to e.g. migrate tasks
> + * to other rqs.
>   */
>  static void schedule_deferred(struct rq *rq)
>  {
> +       /*
> +        * Queue an irq work. They are executed on IRQ re-enable which may take
> +        * a bit longer than the scheduler hook in schedule_deferred_locked().
> +        */
> +       irq_work_queue(&rq->scx.deferred_irq_work);
> +}
> +
> +/**
> + * schedule_deferred_locked - Schedule execution of deferred actions on an rq
> + * @rq: target rq
> + *
> + * Schedule execution of deferred actions on @rq. Equivalent to
> + * schedule_deferred() but requires @rq to be locked and can be more efficient.
> + */
> +static void schedule_deferred_locked(struct rq *rq)
> +{
>         lockdep_assert_rq_held(rq);
>
>         /*
> @@ -812,12 +828,11 @@ static void schedule_deferred(struct rq *rq)
>         }
>
>         /*
> -        * No scheduler hooks available. Queue an irq work. They are executed on
> -        * IRQ re-enable which may take a bit longer than the scheduler hooks.
> -        * The above WAKEUP and BALANCE paths should cover most of the cases and
> -        * the time to IRQ re-enable shouldn't be long.
> +        * No scheduler hooks available. Use the generic irq_work path. The
> +        * above WAKEUP and BALANCE paths should cover most of the cases and the
> +        * time to IRQ re-enable shouldn't be long.
>          */
> -       irq_work_queue(&rq->scx.deferred_irq_work);
> +       schedule_deferred(rq);
>  }
>
>  /**
> @@ -1211,7 +1226,7 @@ static void direct_dispatch(struct scx_sched *sch, struct task_struct *p,
>                 WARN_ON_ONCE(p->scx.dsq || !list_empty(&p->scx.dsq_list.node));
>                 list_add_tail(&p->scx.dsq_list.node,
>                               &rq->scx.ddsp_deferred_locals);
> -               schedule_deferred(rq);
> +               schedule_deferred_locked(rq);
>                 return;
>         }
>
> --
> 2.47.1
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/3] sched_ext: Factor out reenq_local() from scx_bpf_reenqueue_local()
  2025-10-25  0:18 ` [PATCH 2/3] sched_ext: Factor out reenq_local() from scx_bpf_reenqueue_local() Tejun Heo
@ 2025-10-25 23:19   ` Emil Tsalapatis
  0 siblings, 0 replies; 28+ messages in thread
From: Emil Tsalapatis @ 2025-10-25 23:19 UTC (permalink / raw)
  To: Tejun Heo
  Cc: David Vernet, Andrea Righi, Changwoo Min, linux-kernel, sched-ext,
	Peter Zijlstra, Wen-Fang Liu

On Fri, Oct 24, 2025 at 8:18 PM Tejun Heo <tj@kernel.org> wrote:
>
> Factor out the core re-enqueue logic from scx_bpf_reenqueue_local() into a
> new reenq_local() helper function. scx_bpf_reenqueue_local() now handles the
> BPF kfunc checks and calls reenq_local() to perform the actual work.
>
> This is a prep patch to allow reenq_local() to be called from other contexts.
>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---

Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>

>  kernel/sched/ext.c | 50 +++++++++++++++++++++++++++++---------------------
>  1 file changed, 29 insertions(+), 21 deletions(-)
>
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 111111111111..222222222222 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -5881,32 +5881,12 @@ static const struct btf_kfunc_id_set scx_kfunc_set_dispatch = {
>         .set                    = &scx_kfunc_ids_dispatch,
>  };
>
> -__bpf_kfunc_start_defs();
> -
> -/**
> - * scx_bpf_reenqueue_local - Re-enqueue tasks on a local DSQ
> - *
> - * Iterate over all of the tasks currently enqueued on the local DSQ of the
> - * caller's CPU, and re-enqueue them in the BPF scheduler. Returns the number of
> - * processed tasks. Can only be called from ops.cpu_release().
> - */
> -__bpf_kfunc u32 scx_bpf_reenqueue_local(void)
> +static u32 reenq_local(struct rq *rq)
>  {
> -       struct scx_sched *sch;
>         LIST_HEAD(tasks);
>         u32 nr_enqueued = 0;
> -       struct rq *rq;
>         struct task_struct *p, *n;
>
> -       guard(rcu)();
> -       sch = rcu_dereference(scx_root);
> -       if (unlikely(!sch))
> -               return 0;
> -
> -       if (!scx_kf_allowed(sch, SCX_KF_CPU_RELEASE))
> -               return 0;
> -
> -       rq = cpu_rq(smp_processor_id());
>         lockdep_assert_rq_held(rq);
>
>         /*
> @@ -5943,6 +5923,34 @@ __bpf_kfunc u32 scx_bpf_reenqueue_local(void)
>         return nr_enqueued;
>  }
>
> +__bpf_kfunc_start_defs();
> +
> +/**
> + * scx_bpf_reenqueue_local - Re-enqueue tasks on a local DSQ
> + *
> + * Iterate over all of the tasks currently enqueued on the local DSQ of the
> + * caller's CPU, and re-enqueue them in the BPF scheduler. Returns the number of
> + * processed tasks. Can only be called from ops.cpu_release().
> + */
> +__bpf_kfunc u32 scx_bpf_reenqueue_local(void)
> +{
> +       struct scx_sched *sch;
> +       struct rq *rq;
> +
> +       guard(rcu)();
> +       sch = rcu_dereference(scx_root);
> +       if (unlikely(!sch))
> +               return 0;
> +
> +       if (!scx_kf_allowed(sch, SCX_KF_CPU_RELEASE))
> +               return 0;
> +
> +       rq = cpu_rq(smp_processor_id());
> +       lockdep_assert_rq_held(rq);
> +
> +       return reenq_local(rq);
> +}
> +
>  __bpf_kfunc_end_defs();
>
>  BTF_KFUNCS_START(scx_kfunc_ids_cpu_release)
> --
> 2.47.1
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/3] sched_ext: Allow scx_bpf_reenqueue_local() to be called from anywhere
  2025-10-25  0:18 ` [PATCH 3/3] sched_ext: Allow scx_bpf_reenqueue_local() to be called from anywhere Tejun Heo
@ 2025-10-25 23:21   ` Emil Tsalapatis
  2025-10-27  9:18   ` Peter Zijlstra
  2025-10-27 18:19   ` [PATCH v2 " Tejun Heo
  2 siblings, 0 replies; 28+ messages in thread
From: Emil Tsalapatis @ 2025-10-25 23:21 UTC (permalink / raw)
  To: Tejun Heo
  Cc: David Vernet, Andrea Righi, Changwoo Min, linux-kernel, sched-ext,
	Peter Zijlstra, Wen-Fang Liu

On Fri, Oct 24, 2025 at 8:18 PM Tejun Heo <tj@kernel.org> wrote:
>
> The ops.cpu_acquire/release() callbacks are broken - they miss events under
> multiple conditions and can't be fixed without adding global sched core hooks
> that sched maintainers don't want. They also aren't necessary as BPF schedulers
> can use generic BPF mechanisms like tracepoints to achieve the same goals.
>
> The main use case for cpu_release() was calling scx_bpf_reenqueue_local() when
> a CPU gets preempted by a higher priority scheduling class. However, the old
> scx_bpf_reenqueue_local() could only be called from cpu_release() context.
>
> Add a new version of scx_bpf_reenqueue_local() that can be called from any
> context by deferring the actual re-enqueue operation. This eliminates the need
> for cpu_acquire/release() ops entirely. Schedulers can now use standard BPF
> mechanisms like the sched_switch tracepoint to detect and handle CPU preemption.
>
> Update scx_qmap to demonstrate the new approach using sched_switch instead of
> cpu_release, with compat support for older kernels. Mark cpu_acquire/release()
> as deprecated. The old scx_bpf_reenqueue_local() variant will be removed in
> v6.23.
>
> Reported-by: Wen-Fang Liu <liuwenfang@honor.com>
> Link: https://lore.kernel.org/all/8d64c74118c6440f81bcf5a4ac6b9f00@honor.com/
> Cc: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---

Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>

>  kernel/sched/ext.c                       | 31 ++++++++++++++++++++
>  kernel/sched/sched.h                     |  1 +
>  tools/sched_ext/include/scx/common.bpf.h |  1 -
>  tools/sched_ext/include/scx/compat.bpf.h | 23 +++++++++++++++
>  tools/sched_ext/scx_qmap.bpf.c           | 38 +++++++++++++++++-------
>  5 files changed, 83 insertions(+), 11 deletions(-)
>
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 222222222222..333333333333 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -147,6 +147,7 @@ static struct kset *scx_kset;
>  #include <trace/events/sched_ext.h>
>
>  static void process_ddsp_deferred_locals(struct rq *rq);
> +static u32 reenq_local(struct rq *rq);
>  static void scx_kick_cpu(struct scx_sched *sch, s32 cpu, u64 flags);
>  static void scx_vexit(struct scx_sched *sch, enum scx_exit_kind kind,
>                       s64 exit_code, const char *fmt, va_list args);
> @@ -755,6 +756,11 @@ static int ops_sanitize_err(struct scx_sched *sch, s32 ret, s32 ops_err)
>  static void run_deferred(struct rq *rq)
>  {
>         process_ddsp_deferred_locals(rq);
> +
> +       if (local_read(&rq->scx.reenq_local_deferred)) {
> +               local_set(&rq->scx.reenq_local_deferred, 0);
> +               reenq_local(rq);
> +       }
>  }
>
>  static void deferred_bal_cb_workfn(struct rq *rq)
> @@ -4569,6 +4575,9 @@ static int validate_ops(struct scx_sched *sch)
>         if (ops->flags & SCX_OPS_HAS_CGROUP_WEIGHT)
>                 pr_warn("SCX_OPS_HAS_CGROUP_WEIGHT is deprecated and a noop\n");
>
> +       if (ops->cpu_acquire || ops->cpu_release)
> +               pr_warn("ops->cpu_acquire/release() are deprecated, use sched_switch TP instead\n");
> +
>         return 0;
>  }
>
> @@ -5931,6 +5940,9 @@ __bpf_kfunc_start_defs();
>   * Iterate over all of the tasks currently enqueued on the local DSQ of the
>   * caller's CPU, and re-enqueue them in the BPF scheduler. Returns the number of
>   * processed tasks. Can only be called from ops.cpu_release().
> + *
> + * COMPAT: Will be removed in v6.23 along with the ___v2 suffix on the void
> + * returning variant that can be called from anywhere.
>   */
>  __bpf_kfunc u32 scx_bpf_reenqueue_local(void)
>  {
> @@ -6490,6 +6502,24 @@ __bpf_kfunc void scx_bpf_dump_bstr(char *data__str, u32 data__sz)
>  }
>
>  /**
> + * scx_bpf_reenqueue_local - Re-enqueue tasks on a local DSQ
> + *
> + * Iterate over all of the tasks currently enqueued on the local DSQ of the
> + * caller's CPU, and re-enqueue them in the BPF scheduler. Can be called from
> + * anywhere.
> + */
> +__bpf_kfunc void scx_bpf_reenqueue_local___v2(void)
> +{
> +       struct rq *rq;
> +
> +       guard(preempt)();
> +
> +       rq = this_rq();
> +       local_set(&rq->scx.reenq_local_deferred, 1);
> +       schedule_deferred(rq);
> +}
> +
> +/**
>   * scx_bpf_cpuperf_cap - Query the maximum relative capacity of a CPU
>   * @cpu: CPU of interest
>   *
> @@ -6902,6 +6932,7 @@ BTF_ID_FLAGS(func, bpf_iter_scx_dsq_destroy, KF_ITER_DESTRUCTOR)
>  BTF_ID_FLAGS(func, scx_bpf_exit_bstr, KF_TRUSTED_ARGS)
>  BTF_ID_FLAGS(func, scx_bpf_error_bstr, KF_TRUSTED_ARGS)
>  BTF_ID_FLAGS(func, scx_bpf_dump_bstr, KF_TRUSTED_ARGS)
> +BTF_ID_FLAGS(func, scx_bpf_reenqueue_local___v2)
>  BTF_ID_FLAGS(func, scx_bpf_cpuperf_cap)
>  BTF_ID_FLAGS(func, scx_bpf_cpuperf_cur)
>  BTF_ID_FLAGS(func, scx_bpf_cpuperf_set)
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 444444444444..555555555555 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -804,6 +804,7 @@ struct scx_rq {
>         cpumask_var_t           cpus_to_preempt;
>         cpumask_var_t           cpus_to_wait;
>         unsigned long           kick_sync;
> +       local_t                 reenq_local_deferred;
>         struct balance_callback deferred_bal_cb;
>         struct irq_work         deferred_irq_work;
>         struct irq_work         kick_cpus_irq_work;
> diff --git a/tools/sched_ext/include/scx/common.bpf.h b/tools/sched_ext/include/scx/common.bpf.h
> index 666666666666..777777777777 100644
> --- a/tools/sched_ext/include/scx/common.bpf.h
> +++ b/tools/sched_ext/include/scx/common.bpf.h
> @@ -70,7 +70,6 @@ void scx_bpf_dsq_move_set_slice(struct bpf_iter_scx_dsq *it__iter, u64 slice) _
>  void scx_bpf_dsq_move_set_vtime(struct bpf_iter_scx_dsq *it__iter, u64 vtime) __ksym __weak;
>  bool scx_bpf_dsq_move(struct bpf_iter_scx_dsq *it__iter, struct task_struct *p, u64 dsq_id, u64 enq_flags) __ksym __weak;
>  bool scx_bpf_dsq_move_vtime(struct bpf_iter_scx_dsq *it__iter, struct task_struct *p, u64 dsq_id, u64 enq_flags) __ksym __weak;
> -u32 scx_bpf_reenqueue_local(void) __ksym;
>  void scx_bpf_kick_cpu(s32 cpu, u64 flags) __ksym;
>  s32 scx_bpf_dsq_nr_queued(u64 dsq_id) __ksym;
>  void scx_bpf_destroy_dsq(u64 dsq_id) __ksym;
> diff --git a/tools/sched_ext/include/scx/compat.bpf.h b/tools/sched_ext/include/scx/compat.bpf.h
> index 888888888888..999999999999 100644
> --- a/tools/sched_ext/include/scx/compat.bpf.h
> +++ b/tools/sched_ext/include/scx/compat.bpf.h
> @@ -279,6 +279,29 @@ static inline void scx_bpf_task_set_dsq_weight(struct task_struct *p, u32 weigh
>  }
>
>  /*
> + * v6.19: The new void variant can be called from anywhere while the older v1
> + * variant can only be called from ops.cpu_release(). The double ___ prefixes on
> + * the v2 variant need to be removed once libbpf is updated to ignore ___ prefix
> + * on kernel side. Drop the wrapper and move the decl to common.bpf.h after
> + * v6.22.
> + */
> +u32 scx_bpf_reenqueue_local___v1(void) __ksym __weak;
> +void scx_bpf_reenqueue_local___v2___compat(void) __ksym __weak;
> +
> +static inline bool __COMPAT_scx_bpf_reenqueue_local_from_anywhere(void)
> +{
> +       return bpf_ksym_exists(scx_bpf_reenqueue_local___v2___compat);
> +}
> +
> +static inline void scx_bpf_reenqueue_local(void)
> +{
> +       if (__COMPAT_scx_bpf_reenqueue_local_from_anywhere())
> +               scx_bpf_reenqueue_local___v2___compat();
> +       else
> +               scx_bpf_reenqueue_local___v1();
> +}
> +
> +/*
>   * Define sched_ext_ops. This may be expanded to define multiple variants for
>   * backward compatibility. See compat.h::SCX_OPS_LOAD/ATTACH().
>   */
> diff --git a/tools/sched_ext/scx_qmap.bpf.c b/tools/sched_ext/scx_qmap.bpf.c
> index aaaaaaaaaaaa..bbbbbbbbbbbb 100644
> --- a/tools/sched_ext/scx_qmap.bpf.c
> +++ b/tools/sched_ext/scx_qmap.bpf.c
> @@ -202,6 +202,9 @@ void BPF_STRUCT_OPS(qmap_enqueue, struct task_struct *p, u64 enq_flags)
>         void *ring;
>         s32 cpu;
>
> +       if (enq_flags & SCX_ENQ_REENQ)
> +               __sync_fetch_and_add(&nr_reenqueued, 1);
> +
>         if (p->flags & PF_KTHREAD) {
>                 if (stall_kernel_nth && !(++kernel_cnt % stall_kernel_nth))
>                         return;
> @@ -529,20 +532,35 @@ bool BPF_STRUCT_OPS(qmap_core_sched_before, struct task_struct *a,
>         return task_qdist(a) > task_qdist(b);
>  }
>
> -void BPF_STRUCT_OPS(qmap_cpu_release, s32 cpu, struct scx_cpu_release_args *args)
> +SEC("tp_btf/sched_switch")
> +int BPF_PROG(qmap_sched_switch, bool preempt, struct task_struct *prev,
> +            struct task_struct *next, unsigned long prev_state)
>  {
> -       u32 cnt;
> +       if (!__COMPAT_scx_bpf_reenqueue_local_from_anywhere())
> +               return 0;
>
>         /*
> -        * Called when @cpu is taken by a higher priority scheduling class. This
> -        * makes @cpu no longer available for executing sched_ext tasks. As we
> -        * don't want the tasks in @cpu's local dsq to sit there until @cpu
> -        * becomes available again, re-enqueue them into the global dsq. See
> -        * %SCX_ENQ_REENQ handling in qmap_enqueue().
> +        * If @cpu is taken by a higher priority scheduling class, it is no
> +        * longer available for executing sched_ext tasks. As we don't want the
> +        * tasks in @cpu's local dsq to sit there until @cpu becomes available
> +        * again, re-enqueue them into the global dsq. See %SCX_ENQ_REENQ
> +        * handling in qmap_enqueue().
>          */
> -       cnt = scx_bpf_reenqueue_local();
> -       if (cnt)
> -               __sync_fetch_and_add(&nr_reenqueued, cnt);
> +       switch (next->policy) {
> +       case 1: /* SCHED_FIFO */
> +       case 2: /* SCHED_RR */
> +       case 6: /* SCHED_DEADLINE */
> +               scx_bpf_reenqueue_local();
> +       }
> +
> +       return 0;
> +}
> +
> +void BPF_STRUCT_OPS(qmap_cpu_release, s32 cpu, struct scx_cpu_release_args *args)
> +{
> +       /* see qmap_sched_switch() to learn how to do this on newer kernels */
> +       if (!__COMPAT_scx_bpf_reenqueue_local_from_anywhere())
> +               scx_bpf_reenqueue_local();
>  }
>
>  s32 BPF_STRUCT_OPS(qmap_init_task, struct task_struct *p,
> --
> 2.47.1
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/3] sched_ext: Allow scx_bpf_reenqueue_local() to be called from anywhere
  2025-10-25  0:18 ` [PATCH 3/3] sched_ext: Allow scx_bpf_reenqueue_local() to be called from anywhere Tejun Heo
  2025-10-25 23:21   ` Emil Tsalapatis
@ 2025-10-27  9:18   ` Peter Zijlstra
  2025-10-27 16:00     ` Tejun Heo
  2025-10-27 18:19   ` [PATCH v2 " Tejun Heo
  2 siblings, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2025-10-27  9:18 UTC (permalink / raw)
  To: Tejun Heo
  Cc: David Vernet, Andrea Righi, Changwoo Min, linux-kernel, sched-ext,
	Wen-Fang Liu

On Fri, Oct 24, 2025 at 02:18:49PM -1000, Tejun Heo wrote:
> The ops.cpu_acquire/release() callbacks are broken - they miss events under
> multiple conditions and can't be fixed without adding global sched core hooks
> that sched maintainers don't want. They also aren't necessary as BPF schedulers
> can use generic BPF mechanisms like tracepoints to achieve the same goals.
> 
> The main use case for cpu_release() was calling scx_bpf_reenqueue_local() when
> a CPU gets preempted by a higher priority scheduling class. However, the old
> scx_bpf_reenqueue_local() could only be called from cpu_release() context.

I'm a little confused. Isn't this the problem where balance_one()
migrates a task to the local rq and we end up having to RETRY_TASK
because another (higher) rq gets modified?

Why can't we simply re-queue the task in the RETRY_TASK branch --
effectively undoing balance_one()?


Relying on hooking into tracepoints seems like a gruesome hack.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/3] sched_ext: Allow scx_bpf_reenqueue_local() to be called from anywhere
  2025-10-27  9:18   ` Peter Zijlstra
@ 2025-10-27 16:00     ` Tejun Heo
  2025-10-27 17:49       ` Peter Zijlstra
  2025-10-27 18:10       ` Peter Zijlstra
  0 siblings, 2 replies; 28+ messages in thread
From: Tejun Heo @ 2025-10-27 16:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: David Vernet, Andrea Righi, Changwoo Min, linux-kernel, sched-ext,
	Wen-Fang Liu

Hello,

On Mon, Oct 27, 2025 at 10:18:22AM +0100, Peter Zijlstra wrote:
...
> > The main use case for cpu_release() was calling scx_bpf_reenqueue_local() when
> > a CPU gets preempted by a higher priority scheduling class. However, the old
> > scx_bpf_reenqueue_local() could only be called from cpu_release() context.
> 
> I'm a little confused. Isn't this the problem where balance_one()
> migrates a task to the local rq and we end up having to RETRY_TASK
> because another (higher) rq gets modified?

That's what I thought too and the gap between balance() and pick_task() can
be closed that way. However, while plugging that, I realized there's another
bigger gap between ttwu() and pick_task() because ttwu() can directly
dispatch a task into the local DSQ of a CPU. That one, there's no way to
close without a global hook.

> Why can't we simply re-queue the task in the RETRY_TASK branch --
> effectively undoing balance_one()?
> 
> Relying on hooking into tracepoints seems like a gruesome hack.

From a BPF scheduler's POV, it's just using a more generic mechanism.
Multiple schedulers already make use of other BPF attach points - timers,
TPs, fentry/fexit's, so this doesn't make things less congruent.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/3] sched_ext: Allow scx_bpf_reenqueue_local() to be called from anywhere
  2025-10-27 16:00     ` Tejun Heo
@ 2025-10-27 17:49       ` Peter Zijlstra
  2025-10-27 18:05         ` Tejun Heo
  2025-10-27 18:10       ` Peter Zijlstra
  1 sibling, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2025-10-27 17:49 UTC (permalink / raw)
  To: Tejun Heo
  Cc: David Vernet, Andrea Righi, Changwoo Min, linux-kernel, sched-ext,
	Wen-Fang Liu

On Mon, Oct 27, 2025 at 06:00:00AM -1000, Tejun Heo wrote:
> Hello,
> 
> On Mon, Oct 27, 2025 at 10:18:22AM +0100, Peter Zijlstra wrote:
> ...
> > > The main use case for cpu_release() was calling scx_bpf_reenqueue_local() when
> > > a CPU gets preempted by a higher priority scheduling class. However, the old
> > > scx_bpf_reenqueue_local() could only be called from cpu_release() context.
> > 
> > I'm a little confused. Isn't this the problem where balance_one()
> > migrates a task to the local rq and we end up having to RETRY_TASK
> > because another (higher) rq gets modified?
> 
> That's what I thought too and the gap between balance() and pick_task() can
> be closed that way. However, while plugging that, I realized there's another
> bigger gap between ttwu() and pick_task() because ttwu() can directly
> dispatch a task into the local DSQ of a CPU. That one, there's no way to
> close without a global hook.

This would've been prime Changelog material. As is the Changelog was so
vague I wasn't even sure it was that particular problem.

Please update the changelog to be clearer.

Also, why is this patch already in a pull request to Linus? what's the
hurry.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/3] sched_ext: Allow scx_bpf_reenqueue_local() to be called from anywhere
  2025-10-27 17:49       ` Peter Zijlstra
@ 2025-10-27 18:05         ` Tejun Heo
  2025-10-27 18:07           ` Peter Zijlstra
  0 siblings, 1 reply; 28+ messages in thread
From: Tejun Heo @ 2025-10-27 18:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: David Vernet, Andrea Righi, Changwoo Min, linux-kernel, sched-ext,
	Wen-Fang Liu

On Mon, Oct 27, 2025 at 06:49:53PM +0100, Peter Zijlstra wrote:
> > That's what I thought too and the gap between balance() and pick_task() can
> > be closed that way. However, while plugging that, I realized there's another
> > bigger gap between ttwu() and pick_task() because ttwu() can directly
> > dispatch a task into the local DSQ of a CPU. That one, there's no way to
> > close without a global hook.
> 
> This would've been prime Changelog material. As is the Changelog was so
> vague I wasn't even sure it was that particular problem.
> 
> Please update the changelog to be clearer.

Oh yeah, good point.

> Also, why is this patch already in a pull request to Linus? what's the
> hurry.

Hmmm? It shouldn't be. Let me check again. No, it isn't. What are you
looking at?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/3] sched_ext: Allow scx_bpf_reenqueue_local() to be called from anywhere
  2025-10-27 18:05         ` Tejun Heo
@ 2025-10-27 18:07           ` Peter Zijlstra
  0 siblings, 0 replies; 28+ messages in thread
From: Peter Zijlstra @ 2025-10-27 18:07 UTC (permalink / raw)
  To: Tejun Heo
  Cc: David Vernet, Andrea Righi, Changwoo Min, linux-kernel, sched-ext,
	Wen-Fang Liu

On Mon, Oct 27, 2025 at 08:05:42AM -1000, Tejun Heo wrote:
> > Also, why is this patch already in a pull request to Linus? what's the
> > hurry.
> 
> Hmmm? It shouldn't be. Let me check again. No, it isn't. What are you
> looking at?

Hmm, my bad, moron-in-a-hurry-can't-read and all that. I though it was
included in this one:

  https://lkml.kernel.org/r/36450bebbc782be498d762fcbcd99451@kernel.org

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/3] sched_ext: Allow scx_bpf_reenqueue_local() to be called from anywhere
  2025-10-27 16:00     ` Tejun Heo
  2025-10-27 17:49       ` Peter Zijlstra
@ 2025-10-27 18:10       ` Peter Zijlstra
  2025-10-27 18:17         ` Tejun Heo
  1 sibling, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2025-10-27 18:10 UTC (permalink / raw)
  To: Tejun Heo
  Cc: David Vernet, Andrea Righi, Changwoo Min, linux-kernel, sched-ext,
	Wen-Fang Liu

On Mon, Oct 27, 2025 at 06:00:00AM -1000, Tejun Heo wrote:
> Hello,
> 
> On Mon, Oct 27, 2025 at 10:18:22AM +0100, Peter Zijlstra wrote:
> ...
> > > The main use case for cpu_release() was calling scx_bpf_reenqueue_local() when
> > > a CPU gets preempted by a higher priority scheduling class. However, the old
> > > scx_bpf_reenqueue_local() could only be called from cpu_release() context.
> > 
> > I'm a little confused. Isn't this the problem where balance_one()
> > migrates a task to the local rq and we end up having to RETRY_TASK
> > because another (higher) rq gets modified?
> 
> That's what I thought too and the gap between balance() and pick_task() can
> be closed that way. However, while plugging that, I realized there's another
> bigger gap between ttwu() and pick_task() because ttwu() can directly
> dispatch a task into the local DSQ of a CPU. That one, there's no way to
> close without a global hook.

Just for my elucidation and such.. This is when ttwu() happens and the
CPU is idle and you dispatch directly to it, expecting it to then go run
that task. After which another wakeup/balance movement happens which
places/moves a task from a higher priority class to that CPU, such that
your initial (ext) task doesn't get to run after all. Right?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/3] sched_ext: Allow scx_bpf_reenqueue_local() to be called from anywhere
  2025-10-27 18:10       ` Peter Zijlstra
@ 2025-10-27 18:17         ` Tejun Heo
  2025-10-28 11:01           ` Peter Zijlstra
  0 siblings, 1 reply; 28+ messages in thread
From: Tejun Heo @ 2025-10-27 18:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: David Vernet, Andrea Righi, Changwoo Min, linux-kernel, sched-ext,
	Wen-Fang Liu

Hello,

On Mon, Oct 27, 2025 at 07:10:28PM +0100, Peter Zijlstra wrote:
...
> Just for my elucidation and such.. This is when ttwu() happens and the
> CPU is idle and you dispatch directly to it, expecting it to then go run
> that task. After which another wakeup/balance movement happens which
> places/moves a task from a higher priority class to that CPU, such that
> your initial (ext) task doesn't get to run after all. Right?

Yes, that's the scenario that I was thinking.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v2 3/3] sched_ext: Allow scx_bpf_reenqueue_local() to be called from anywhere
  2025-10-25  0:18 ` [PATCH 3/3] sched_ext: Allow scx_bpf_reenqueue_local() to be called from anywhere Tejun Heo
  2025-10-25 23:21   ` Emil Tsalapatis
  2025-10-27  9:18   ` Peter Zijlstra
@ 2025-10-27 18:19   ` Tejun Heo
  2025-10-29 10:45     ` Peter Zijlstra
  2025-10-29 15:49     ` [PATCH v3 " Tejun Heo
  2 siblings, 2 replies; 28+ messages in thread
From: Tejun Heo @ 2025-10-27 18:19 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: linux-kernel, sched-ext, Peter Zijlstra, Wen-Fang Liu

The ops.cpu_acquire/release() callbacks are broken - they miss events under
multiple conditions and can't be fixed without adding global sched core hooks
that sched maintainers don't want.

There are two distinct task dispatch gaps that can cause cpu_released flag
desynchronization:

1. balance-to-pick_task gap: This is what was originally reported. balance_scx()
   can enqueue a task, but during consume_remote_task() when the rq lock is
   released, a higher priority task can be enqueued and ultimately picked while
   cpu_released remains false. This gap is closeable via RETRY_TASK handling.

2. ttwu-to-pick_task gap: ttwu() can directly dispatch a task to a CPU's local
   DSQ. By the time the sched path runs on the target CPU, higher class tasks may
   already be queued. In such cases, nothing on sched_ext side will be invoked,
   and the only solution would be a hook invoked regardless of sched class, which
   isn't desirable.

Rather than adding invasive core hooks, BPF schedulers can use generic BPF
mechanisms like tracepoints. From SCX scheduler's perspective, this is congruent
with other mechanisms it already uses and doesn't add further friction.

The main use case for cpu_release() was calling scx_bpf_reenqueue_local() when
a CPU gets preempted by a higher priority scheduling class. However, the old
scx_bpf_reenqueue_local() could only be called from cpu_release() context.

Add a new version of scx_bpf_reenqueue_local() that can be called from any
context by deferring the actual re-enqueue operation. This eliminates the need
for cpu_acquire/release() ops entirely. Schedulers can now use standard BPF
mechanisms like the sched_switch tracepoint to detect and handle CPU preemption.

Update scx_qmap to demonstrate the new approach using sched_switch instead of
cpu_release, with compat support for older kernels. Mark cpu_acquire/release()
as deprecated. The old scx_bpf_reenqueue_local() variant will be removed in
v6.23.

Reported-by: Wen-Fang Liu <liuwenfang@honor.com>
Link: https://lore.kernel.org/all/8d64c74118c6440f81bcf5a4ac6b9f00@honor.com/
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
---
v2: Description updated w/ justifications on taking this approach instead of
    fixing ops.cpu_acquire/release().

 kernel/sched/ext.c                       |   31 +++++++++++++++++++++++++
 kernel/sched/sched.h                     |    1
 tools/sched_ext/include/scx/common.bpf.h |    1
 kernel/sched/ext.c                       |   31 +++++++++++++++++++++++++
 kernel/sched/sched.h                     |    1 
 tools/sched_ext/include/scx/common.bpf.h |    1 
 tools/sched_ext/include/scx/compat.bpf.h |   23 ++++++++++++++++++
 tools/sched_ext/scx_qmap.bpf.c           |   38 ++++++++++++++++++++++---------
 5 files changed, 83 insertions(+), 11 deletions(-)

--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -147,6 +147,7 @@ static struct kset *scx_kset;
 #include <trace/events/sched_ext.h>
 
 static void process_ddsp_deferred_locals(struct rq *rq);
+static u32 reenq_local(struct rq *rq);
 static void scx_kick_cpu(struct scx_sched *sch, s32 cpu, u64 flags);
 static void scx_vexit(struct scx_sched *sch, enum scx_exit_kind kind,
 		      s64 exit_code, const char *fmt, va_list args);
@@ -755,6 +756,11 @@ static int ops_sanitize_err(struct scx_s
 static void run_deferred(struct rq *rq)
 {
 	process_ddsp_deferred_locals(rq);
+
+	if (local_read(&rq->scx.reenq_local_deferred)) {
+		local_set(&rq->scx.reenq_local_deferred, 0);
+		reenq_local(rq);
+	}
 }
 
 static void deferred_bal_cb_workfn(struct rq *rq)
@@ -4569,6 +4575,9 @@ static int validate_ops(struct scx_sched
 	if (ops->flags & SCX_OPS_HAS_CGROUP_WEIGHT)
 		pr_warn("SCX_OPS_HAS_CGROUP_WEIGHT is deprecated and a noop\n");
 
+	if (ops->cpu_acquire || ops->cpu_release)
+		pr_warn("ops->cpu_acquire/release() are deprecated, use sched_switch TP instead\n");
+
 	return 0;
 }
 
@@ -5931,6 +5940,9 @@ __bpf_kfunc_start_defs();
  * Iterate over all of the tasks currently enqueued on the local DSQ of the
  * caller's CPU, and re-enqueue them in the BPF scheduler. Returns the number of
  * processed tasks. Can only be called from ops.cpu_release().
+ *
+ * COMPAT: Will be removed in v6.23 along with the ___v2 suffix on the void
+ * returning variant that can be called from anywhere.
  */
 __bpf_kfunc u32 scx_bpf_reenqueue_local(void)
 {
@@ -6490,6 +6502,24 @@ __bpf_kfunc void scx_bpf_dump_bstr(char
 }
 
 /**
+ * scx_bpf_reenqueue_local - Re-enqueue tasks on a local DSQ
+ *
+ * Iterate over all of the tasks currently enqueued on the local DSQ of the
+ * caller's CPU, and re-enqueue them in the BPF scheduler. Can be called from
+ * anywhere.
+ */
+__bpf_kfunc void scx_bpf_reenqueue_local___v2(void)
+{
+	struct rq *rq;
+
+	guard(preempt)();
+
+	rq = this_rq();
+	local_set(&rq->scx.reenq_local_deferred, 1);
+	schedule_deferred(rq);
+}
+
+/**
  * scx_bpf_cpuperf_cap - Query the maximum relative capacity of a CPU
  * @cpu: CPU of interest
  *
@@ -6902,6 +6932,7 @@ BTF_ID_FLAGS(func, bpf_iter_scx_dsq_dest
 BTF_ID_FLAGS(func, scx_bpf_exit_bstr, KF_TRUSTED_ARGS)
 BTF_ID_FLAGS(func, scx_bpf_error_bstr, KF_TRUSTED_ARGS)
 BTF_ID_FLAGS(func, scx_bpf_dump_bstr, KF_TRUSTED_ARGS)
+BTF_ID_FLAGS(func, scx_bpf_reenqueue_local___v2)
 BTF_ID_FLAGS(func, scx_bpf_cpuperf_cap)
 BTF_ID_FLAGS(func, scx_bpf_cpuperf_cur)
 BTF_ID_FLAGS(func, scx_bpf_cpuperf_set)
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -804,6 +804,7 @@ struct scx_rq {
 	cpumask_var_t		cpus_to_preempt;
 	cpumask_var_t		cpus_to_wait;
 	unsigned long		kick_sync;
+	local_t			reenq_local_deferred;
 	struct balance_callback	deferred_bal_cb;
 	struct irq_work		deferred_irq_work;
 	struct irq_work		kick_cpus_irq_work;
--- a/tools/sched_ext/include/scx/common.bpf.h
+++ b/tools/sched_ext/include/scx/common.bpf.h
@@ -70,7 +70,6 @@ void scx_bpf_dsq_move_set_slice(struct b
 void scx_bpf_dsq_move_set_vtime(struct bpf_iter_scx_dsq *it__iter, u64 vtime) __ksym __weak;
 bool scx_bpf_dsq_move(struct bpf_iter_scx_dsq *it__iter, struct task_struct *p, u64 dsq_id, u64 enq_flags) __ksym __weak;
 bool scx_bpf_dsq_move_vtime(struct bpf_iter_scx_dsq *it__iter, struct task_struct *p, u64 dsq_id, u64 enq_flags) __ksym __weak;
-u32 scx_bpf_reenqueue_local(void) __ksym;
 void scx_bpf_kick_cpu(s32 cpu, u64 flags) __ksym;
 s32 scx_bpf_dsq_nr_queued(u64 dsq_id) __ksym;
 void scx_bpf_destroy_dsq(u64 dsq_id) __ksym;
--- a/tools/sched_ext/include/scx/compat.bpf.h
+++ b/tools/sched_ext/include/scx/compat.bpf.h
@@ -279,6 +279,29 @@ static inline void scx_bpf_task_set_dsq_
 }
 
 /*
+ * v6.19: The new void variant can be called from anywhere while the older v1
+ * variant can only be called from ops.cpu_release(). The double ___ prefixes on
+ * the v2 variant need to be removed once libbpf is updated to ignore ___ prefix
+ * on kernel side. Drop the wrapper and move the decl to common.bpf.h after
+ * v6.22.
+ */
+u32 scx_bpf_reenqueue_local___v1(void) __ksym __weak;
+void scx_bpf_reenqueue_local___v2___compat(void) __ksym __weak;
+
+static inline bool __COMPAT_scx_bpf_reenqueue_local_from_anywhere(void)
+{
+	return bpf_ksym_exists(scx_bpf_reenqueue_local___v2___compat);
+}
+
+static inline void scx_bpf_reenqueue_local(void)
+{
+	if (__COMPAT_scx_bpf_reenqueue_local_from_anywhere())
+		scx_bpf_reenqueue_local___v2___compat();
+	else
+		scx_bpf_reenqueue_local___v1();
+}
+
+/*
  * Define sched_ext_ops. This may be expanded to define multiple variants for
  * backward compatibility. See compat.h::SCX_OPS_LOAD/ATTACH().
  */
--- a/tools/sched_ext/scx_qmap.bpf.c
+++ b/tools/sched_ext/scx_qmap.bpf.c
@@ -202,6 +202,9 @@ void BPF_STRUCT_OPS(qmap_enqueue, struct
 	void *ring;
 	s32 cpu;
 
+	if (enq_flags & SCX_ENQ_REENQ)
+		__sync_fetch_and_add(&nr_reenqueued, 1);
+
 	if (p->flags & PF_KTHREAD) {
 		if (stall_kernel_nth && !(++kernel_cnt % stall_kernel_nth))
 			return;
@@ -529,20 +532,35 @@ bool BPF_STRUCT_OPS(qmap_core_sched_befo
 	return task_qdist(a) > task_qdist(b);
 }
 
-void BPF_STRUCT_OPS(qmap_cpu_release, s32 cpu, struct scx_cpu_release_args *args)
+SEC("tp_btf/sched_switch")
+int BPF_PROG(qmap_sched_switch, bool preempt, struct task_struct *prev,
+	     struct task_struct *next, unsigned long prev_state)
 {
-	u32 cnt;
+	if (!__COMPAT_scx_bpf_reenqueue_local_from_anywhere())
+		return 0;
 
 	/*
-	 * Called when @cpu is taken by a higher priority scheduling class. This
-	 * makes @cpu no longer available for executing sched_ext tasks. As we
-	 * don't want the tasks in @cpu's local dsq to sit there until @cpu
-	 * becomes available again, re-enqueue them into the global dsq. See
-	 * %SCX_ENQ_REENQ handling in qmap_enqueue().
+	 * If @cpu is taken by a higher priority scheduling class, it is no
+	 * longer available for executing sched_ext tasks. As we don't want the
+	 * tasks in @cpu's local dsq to sit there until @cpu becomes available
+	 * again, re-enqueue them into the global dsq. See %SCX_ENQ_REENQ
+	 * handling in qmap_enqueue().
 	 */
-	cnt = scx_bpf_reenqueue_local();
-	if (cnt)
-		__sync_fetch_and_add(&nr_reenqueued, cnt);
+	switch (next->policy) {
+	case 1: /* SCHED_FIFO */
+	case 2: /* SCHED_RR */
+	case 6: /* SCHED_DEADLINE */
+		scx_bpf_reenqueue_local();
+	}
+
+	return 0;
+}
+
+void BPF_STRUCT_OPS(qmap_cpu_release, s32 cpu, struct scx_cpu_release_args *args)
+{
+	/* see qmap_sched_switch() to learn how to do this on newer kernels */
+	if (!__COMPAT_scx_bpf_reenqueue_local_from_anywhere())
+		scx_bpf_reenqueue_local();
 }
 
 s32 BPF_STRUCT_OPS(qmap_init_task, struct task_struct *p,

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/3] sched_ext: Allow scx_bpf_reenqueue_local() to be called from anywhere
  2025-10-27 18:17         ` Tejun Heo
@ 2025-10-28 11:01           ` Peter Zijlstra
  2025-10-28 17:07             ` Tejun Heo
  0 siblings, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2025-10-28 11:01 UTC (permalink / raw)
  To: Tejun Heo
  Cc: David Vernet, Andrea Righi, Changwoo Min, linux-kernel, sched-ext,
	Wen-Fang Liu

On Mon, Oct 27, 2025 at 08:17:38AM -1000, Tejun Heo wrote:
> Hello,
> 
> On Mon, Oct 27, 2025 at 07:10:28PM +0100, Peter Zijlstra wrote:
> ...
> > Just for my elucidation and such.. This is when ttwu() happens and the
> > CPU is idle and you dispatch directly to it, expecting it to then go run
> > that task. After which another wakeup/balance movement happens which
> > places/moves a task from a higher priority class to that CPU, such that
> > your initial (ext) task doesn't get to run after all. Right?
> 
> Yes, that's the scenario that I was thinking.

So I've been pondering this a bit, and came up with the below. I'm not
quite happy with it, I meant to share that new queue_mask variable, but
this came out.


---
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2175,10 +2175,14 @@ void wakeup_preempt(struct rq *rq, struc
 {
 	struct task_struct *donor = rq->donor;
 
-	if (p->sched_class == donor->sched_class)
-		donor->sched_class->wakeup_preempt(rq, p, flags);
-	else if (sched_class_above(p->sched_class, donor->sched_class))
+	if (p->sched_class == rq->next_class) {
+		rq->next_class->wakeup_preempt(rq, p, flags);
+
+	} else if (sched_class_above(p->sched_class, rq->next_class)) {
+		rq->next_class->wakeup_preempt(rq, p, flags);
 		resched_curr(rq);
+		rq->next_class = p->sched_class;
+	}
 
 	/*
 	 * A queue event has occurred, and we're going to schedule.  In
@@ -6814,6 +6818,7 @@ static void __sched notrace __schedule(i
 	clear_tsk_need_resched(prev);
 	clear_preempt_need_resched();
 keep_resched:
+	rq->next_class = next->sched_class;
 	rq->last_seen_need_resched_ns = 0;
 
 	is_switch = prev != next;
@@ -8653,6 +8658,8 @@ void __init sched_init(void)
 		rq->rt.rt_runtime = global_rt_runtime();
 		init_tg_rt_entry(&root_task_group, &rq->rt, NULL, i, NULL);
 #endif
+		rq->next_class = &idle_sched_class;
+
 		rq->sd = NULL;
 		rq->rd = NULL;
 		rq->cpu_capacity = SCHED_CAPACITY_SCALE;
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2289,9 +2289,16 @@ static int balance_dl(struct rq *rq, str
  * Only called when both the current and waking task are -deadline
  * tasks.
  */
-static void wakeup_preempt_dl(struct rq *rq, struct task_struct *p,
-				  int flags)
+static void wakeup_preempt_dl(struct rq *rq, struct task_struct *p, int flags)
 {
+	/*
+	 * Can only get preempted by stop-class, and those should be
+	 * few and short lived, doesn't really make sense to push
+	 * anything away for that.
+	 */
+	if (p->sched_class != &dl_sched_class)
+		return;
+
 	if (dl_entity_preempt(&p->dl, &rq->donor->dl)) {
 		resched_curr(rq);
 		return;
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -2966,7 +2966,12 @@ static void switched_from_scx(struct rq
 	scx_disable_task(p);
 }
 
-static void wakeup_preempt_scx(struct rq *rq, struct task_struct *p,int wake_flags) {}
+static void wakeup_preempt_scx(struct rq *rq, struct task_struct *p, int wake_flags)
+{
+	if (p->sched_class != &ext_sched_class)
+		switch_class(rq, p);
+}
+
 static void switched_to_scx(struct rq *rq, struct task_struct *p) {}
 
 int scx_check_setscheduler(struct task_struct *p, int policy)
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8729,7 +8729,7 @@ static void set_next_buddy(struct sched_
 /*
  * Preempt the current task with a newly woken task if needed:
  */
-static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int wake_flags)
+static void wakeup_preempt_fair(struct rq *rq, struct task_struct *p, int wake_flags)
 {
 	struct task_struct *donor = rq->donor;
 	struct sched_entity *se = &donor->se, *pse = &p->se;
@@ -8737,6 +8737,12 @@ static void check_preempt_wakeup_fair(st
 	int cse_is_idle, pse_is_idle;
 	bool do_preempt_short = false;
 
+	/*
+	 * XXX Getting preempted by higher class, try and find idle CPU?
+	 */
+	if (p->sched_class != &fair_sched_class)
+		return;
+
 	if (unlikely(se == pse))
 		return;
 
@@ -13640,7 +13646,7 @@ DEFINE_SCHED_CLASS(fair) = {
 	.yield_task		= yield_task_fair,
 	.yield_to_task		= yield_to_task_fair,
 
-	.wakeup_preempt		= check_preempt_wakeup_fair,
+	.wakeup_preempt		= wakeup_preempt_fair,
 
 	.pick_task		= pick_task_fair,
 	.pick_next_task		= pick_next_task_fair,
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1615,6 +1615,12 @@ static void wakeup_preempt_rt(struct rq
 {
 	struct task_struct *donor = rq->donor;
 
+	/*
+	 * XXX If we're preempted by DL, queue a push?
+	 */
+	if (p->sched_class != &rt_sched_class)
+		return;
+
 	if (p->prio < donor->prio) {
 		resched_curr(rq);
 		return;
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1179,6 +1179,7 @@ struct rq {
 	struct sched_dl_entity	*dl_server;
 	struct task_struct	*idle;
 	struct task_struct	*stop;
+	const struct sched_class *next_class;
 	unsigned long		next_balance;
 	struct mm_struct	*prev_mm;
 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/3] sched_ext: Allow scx_bpf_reenqueue_local() to be called from anywhere
  2025-10-28 11:01           ` Peter Zijlstra
@ 2025-10-28 17:07             ` Tejun Heo
  0 siblings, 0 replies; 28+ messages in thread
From: Tejun Heo @ 2025-10-28 17:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: David Vernet, Andrea Righi, Changwoo Min, linux-kernel, sched-ext,
	Wen-Fang Liu

Hello, Peter.

On Tue, Oct 28, 2025 at 12:01:53PM +0100, Peter Zijlstra wrote:
> On Mon, Oct 27, 2025 at 08:17:38AM -1000, Tejun Heo wrote:
> > Hello,
> > 
> > On Mon, Oct 27, 2025 at 07:10:28PM +0100, Peter Zijlstra wrote:
> > ...
> > > Just for my elucidation and such.. This is when ttwu() happens and the
> > > CPU is idle and you dispatch directly to it, expecting it to then go run
> > > that task. After which another wakeup/balance movement happens which
> > > places/moves a task from a higher priority class to that CPU, such that
> > > your initial (ext) task doesn't get to run after all. Right?
> > 
> > Yes, that's the scenario that I was thinking.
> 
> So I've been pondering this a bit, and came up with the below. I'm not
> quite happy with it, I meant to share that new queue_mask variable, but
> this came out.

Yeah, something like this that creates global state tracking from wakeup to
dispatch would work. However, from sched_ext POV, I think TP route probably
is a better route at least for now. Once reenqueue_local is allowed from
anywhere, which is useful no matter what, there just aren't good reasons to
maintain ops.cpu_acuire/release(). It doesn't allow anything more or make
things noticeably more performant or easier. It's always nice to be able to
reduce API surface after all.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 3/3] sched_ext: Allow scx_bpf_reenqueue_local() to be called from anywhere
  2025-10-27 18:19   ` [PATCH v2 " Tejun Heo
@ 2025-10-29 10:45     ` Peter Zijlstra
  2025-10-29 15:11       ` Tejun Heo
  2025-10-29 15:49     ` [PATCH v3 " Tejun Heo
  1 sibling, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2025-10-29 10:45 UTC (permalink / raw)
  To: Tejun Heo
  Cc: David Vernet, Andrea Righi, Changwoo Min, linux-kernel, sched-ext,
	Wen-Fang Liu

On Mon, Oct 27, 2025 at 08:19:40AM -1000, Tejun Heo wrote:
> The ops.cpu_acquire/release() callbacks are broken - they miss events under
> multiple conditions and can't be fixed without adding global sched core hooks
> that sched maintainers don't want.

I think I'll object to that statement just a wee bit. I think we can
make it work -- just not with the things proposed earlier.

Anyway, if you want to reduce the sched_ext interface and remove
cpu_acquire/release entirely, this is fine too.

I might still do that wakeup_preempt() change if I can merge / replace
the queue_mask RETRY_TASK logic -- I have vague memories the RT people
also wanted something like this a while ago and it isn't that big of a
change.

> There are two distinct task dispatch gaps that can cause cpu_released flag
> desynchronization:
> 
> 1. balance-to-pick_task gap: This is what was originally reported. balance_scx()
>    can enqueue a task, but during consume_remote_task() when the rq lock is
>    released, a higher priority task can be enqueued and ultimately picked while
>    cpu_released remains false. This gap is closeable via RETRY_TASK handling.
> 
> 2. ttwu-to-pick_task gap: ttwu() can directly dispatch a task to a CPU's local
>    DSQ. By the time the sched path runs on the target CPU, higher class tasks may
>    already be queued. In such cases, nothing on sched_ext side will be invoked,
>    and the only solution would be a hook invoked regardless of sched class, which
>    isn't desirable.
> 
> Rather than adding invasive core hooks, BPF schedulers can use generic BPF
> mechanisms like tracepoints. From SCX scheduler's perspective, this is congruent
> with other mechanisms it already uses and doesn't add further friction.
> 
> The main use case for cpu_release() was calling scx_bpf_reenqueue_local() when
> a CPU gets preempted by a higher priority scheduling class. However, the old
> scx_bpf_reenqueue_local() could only be called from cpu_release() context.
> 
> Add a new version of scx_bpf_reenqueue_local() that can be called from any
> context by deferring the actual re-enqueue operation. This eliminates the need
> for cpu_acquire/release() ops entirely. Schedulers can now use standard BPF
> mechanisms like the sched_switch tracepoint to detect and handle CPU preemption.
> 
> Update scx_qmap to demonstrate the new approach using sched_switch instead of
> cpu_release, with compat support for older kernels. Mark cpu_acquire/release()
> as deprecated. The old scx_bpf_reenqueue_local() variant will be removed in
> v6.23.
> 
> Reported-by: Wen-Fang Liu <liuwenfang@honor.com>
> Link: https://lore.kernel.org/all/8d64c74118c6440f81bcf5a4ac6b9f00@honor.com/
> Cc: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Tejun Heo <tj@kernel.org>

Yeah, this Changelog is much better, thanks!

6.23 is a long time, can't we throw this out quicker? This thing wasn't
supposed to be an ABI after all. A 1 release cycle seems fine to me ;-)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 3/3] sched_ext: Allow scx_bpf_reenqueue_local() to be called from anywhere
  2025-10-29 10:45     ` Peter Zijlstra
@ 2025-10-29 15:11       ` Tejun Heo
  0 siblings, 0 replies; 28+ messages in thread
From: Tejun Heo @ 2025-10-29 15:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: David Vernet, Andrea Righi, Changwoo Min, linux-kernel, sched-ext,
	Wen-Fang Liu

Hello,

On Wed, Oct 29, 2025 at 11:45:46AM +0100, Peter Zijlstra wrote:
> On Mon, Oct 27, 2025 at 08:19:40AM -1000, Tejun Heo wrote:
> > The ops.cpu_acquire/release() callbacks are broken - they miss events under
> > multiple conditions and can't be fixed without adding global sched core hooks
> > that sched maintainers don't want.
> 
> I think I'll object to that statement just a wee bit. I think we can
> make it work -- just not with the things proposed earlier.

Sure, I'll massage it a bit before committing.

> Anyway, if you want to reduce the sched_ext interface and remove
> cpu_acquire/release entirely, this is fine too.
> 
> I might still do that wakeup_preempt() change if I can merge / replace
> the queue_mask RETRY_TASK logic -- I have vague memories the RT people
> also wanted something like this a while ago and it isn't that big of a
> change.

Yeah, being able to create some kind of interlocking from ttwu to pick_task
is something generally useful, I think, even if I don't use it right now.

> 6.23 is a long time, can't we throw this out quicker? This thing wasn't
> supposed to be an ABI after all. A 1 release cycle seems fine to me ;-)

We've been discussing about compat policy lately and I think what we landed
on was maintaining compatibility, when reasonably possible, over one LTS
release + a couple non-LTS releases, which comes out to ~1.5 to 2 years.
That seems to give most people enough sliding room while not choking us with
too much compat overhead. This is a bit more work but it is actually
surprisingly not that painful with all the BPF compat features, and seems to
hit a reasonable balance.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCHSET sched_ext/for-6.19] sched_ext: Deprecate ops.cpu_acquire/release()
  2025-10-25  0:18 [PATCHSET sched_ext/for-6.19] sched_ext: Deprecate ops.cpu_acquire/release() Tejun Heo
                   ` (2 preceding siblings ...)
  2025-10-25  0:18 ` [PATCH 3/3] sched_ext: Allow scx_bpf_reenqueue_local() to be called from anywhere Tejun Heo
@ 2025-10-29 15:31 ` Tejun Heo
  3 siblings, 0 replies; 28+ messages in thread
From: Tejun Heo @ 2025-10-29 15:31 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: linux-kernel, sched-ext, Peter Zijlstra

> Tejun Heo (3):
>   sched_ext: Split schedule_deferred() into locked and unlocked variants
>   sched_ext: Factor out reenq_local() from scx_bpf_reenqueue_local()
>   sched_ext: Allow scx_bpf_reenqueue_local() to be called from anywhere

Applied 1-3 to sched_ext/for-6.19 with minor description adjustment in #3
responding to Peter's feedback.

Thanks.
--
tejun

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v3 3/3] sched_ext: Allow scx_bpf_reenqueue_local() to be called from anywhere
  2025-10-27 18:19   ` [PATCH v2 " Tejun Heo
  2025-10-29 10:45     ` Peter Zijlstra
@ 2025-10-29 15:49     ` Tejun Heo
  2025-11-27 10:39       ` Kuba Piecuch
  2025-12-11 14:24       ` Kuba Piecuch
  1 sibling, 2 replies; 28+ messages in thread
From: Tejun Heo @ 2025-10-29 15:49 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: linux-kernel, sched-ext, Peter Zijlstra, Wen-Fang Liu

The ops.cpu_acquire/release() callbacks miss events under multiple conditions.
There are two distinct task dispatch gaps that can cause cpu_released flag
desynchronization:

1. balance-to-pick_task gap: This is what was originally reported. balance_scx()
   can enqueue a task, but during consume_remote_task() when the rq lock is
   released, a higher priority task can be enqueued and ultimately picked while
   cpu_released remains false. This gap is closeable via RETRY_TASK handling.

2. ttwu-to-pick_task gap: ttwu() can directly dispatch a task to a CPU's local
   DSQ. By the time the sched path runs on the target CPU, higher class tasks may
   already be queued. In such cases, nothing on sched_ext side will be invoked,
   and the only solution would be a hook invoked regardless of sched class, which
   isn't desirable.

Rather than adding invasive core hooks, BPF schedulers can use generic BPF
mechanisms like tracepoints. From SCX scheduler's perspective, this is congruent
with other mechanisms it already uses and doesn't add further friction.

The main use case for cpu_release() was calling scx_bpf_reenqueue_local() when
a CPU gets preempted by a higher priority scheduling class. However, the old
scx_bpf_reenqueue_local() could only be called from cpu_release() context.

Add a new version of scx_bpf_reenqueue_local() that can be called from any
context by deferring the actual re-enqueue operation. This eliminates the need
for cpu_acquire/release() ops entirely. Schedulers can now use standard BPF
mechanisms like the sched_switch tracepoint to detect and handle CPU preemption.

Update scx_qmap to demonstrate the new approach using sched_switch instead of
cpu_release, with compat support for older kernels. Mark cpu_acquire/release()
as deprecated. The old scx_bpf_reenqueue_local() variant will be removed in
v6.23.

Reported-by: Wen-Fang Liu <liuwenfang@honor.com>
Link: https://lore.kernel.org/all/8d64c74118c6440f81bcf5a4ac6b9f00@honor.com/
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
---
v2: Description updated w/ justifications on taking this approach instead of
    fixing ops.cpu_acquire/release().

v3: Dropped "can't be fixed without adding global sched core hooks that sched
    maintainers don't want" from description per Peter's feedback.

 kernel/sched/ext.c                       | 31 +++++++++++++++++++
 kernel/sched/sched.h                     |  1 +
 tools/sched_ext/include/scx/common.bpf.h |  1 -
 tools/sched_ext/include/scx/compat.bpf.h | 23 ++++++++++++++
 tools/sched_ext/scx_qmap.bpf.c           | 38 +++++++++++++++++-------
 5 files changed, 83 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index d13ce92c3f01..d1ef5bda95ae 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -147,6 +147,7 @@ static struct kset *scx_kset;
 #include <trace/events/sched_ext.h>
 
 static void process_ddsp_deferred_locals(struct rq *rq);
+static u32 reenq_local(struct rq *rq);
 static void scx_kick_cpu(struct scx_sched *sch, s32 cpu, u64 flags);
 static void scx_vexit(struct scx_sched *sch, enum scx_exit_kind kind,
 		      s64 exit_code, const char *fmt, va_list args);
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -147,6 +147,7 @@ static struct kset *scx_kset;
 #include <trace/events/sched_ext.h>
 
 static void process_ddsp_deferred_locals(struct rq *rq);
+static u32 reenq_local(struct rq *rq);
 static void scx_kick_cpu(struct scx_sched *sch, s32 cpu, u64 flags);
 static void scx_vexit(struct scx_sched *sch, enum scx_exit_kind kind,
 		      s64 exit_code, const char *fmt, va_list args);
@@ -755,6 +756,11 @@ static int ops_sanitize_err(struct scx_sched *sch, const char *ops_name, s32 err
 static void run_deferred(struct rq *rq)
 {
 	process_ddsp_deferred_locals(rq);
+
+	if (local_read(&rq->scx.reenq_local_deferred)) {
+		local_set(&rq->scx.reenq_local_deferred, 0);
+		reenq_local(rq);
+	}
 }
 
 static void deferred_bal_cb_workfn(struct rq *rq)
@@ -4569,6 +4575,9 @@ static int validate_ops(struct scx_sched *sch, const struct sched_ext_ops *ops)
 	if (ops->flags & SCX_OPS_HAS_CGROUP_WEIGHT)
 		pr_warn("SCX_OPS_HAS_CGROUP_WEIGHT is deprecated and a noop\n");
 
+	if (ops->cpu_acquire || ops->cpu_release)
+		pr_warn("ops->cpu_acquire/release() are deprecated, use sched_switch TP instead\n");
+
 	return 0;
 }
 
@@ -5929,6 +5938,9 @@ __bpf_kfunc_start_defs();
  * Iterate over all of the tasks currently enqueued on the local DSQ of the
  * caller's CPU, and re-enqueue them in the BPF scheduler. Returns the number of
  * processed tasks. Can only be called from ops.cpu_release().
+ *
+ * COMPAT: Will be removed in v6.23 along with the ___v2 suffix on the void
+ * returning variant that can be called from anywhere.
  */
 __bpf_kfunc u32 scx_bpf_reenqueue_local(void)
 {
@@ -6487,6 +6499,24 @@ __bpf_kfunc void scx_bpf_dump_bstr(char *fmt, unsigned long long *data,
 		ops_dump_flush();
 }
 
+/**
+ * scx_bpf_reenqueue_local - Re-enqueue tasks on a local DSQ
+ *
+ * Iterate over all of the tasks currently enqueued on the local DSQ of the
+ * caller's CPU, and re-enqueue them in the BPF scheduler. Can be called from
+ * anywhere.
+ */
+__bpf_kfunc void scx_bpf_reenqueue_local___v2(void)
+{
+	struct rq *rq;
+
+	guard(preempt)();
+
+	rq = this_rq();
+	local_set(&rq->scx.reenq_local_deferred, 1);
+	schedule_deferred(rq);
+}
+
 /**
  * scx_bpf_cpuperf_cap - Query the maximum relative capacity of a CPU
  * @cpu: CPU of interest
@@ -6900,6 +6930,7 @@ BTF_ID_FLAGS(func, bpf_iter_scx_dsq_destroy, KF_ITER_DESTROY)
 BTF_ID_FLAGS(func, scx_bpf_exit_bstr, KF_TRUSTED_ARGS)
 BTF_ID_FLAGS(func, scx_bpf_error_bstr, KF_TRUSTED_ARGS)
 BTF_ID_FLAGS(func, scx_bpf_dump_bstr, KF_TRUSTED_ARGS)
+BTF_ID_FLAGS(func, scx_bpf_reenqueue_local___v2)
 BTF_ID_FLAGS(func, scx_bpf_cpuperf_cap)
 BTF_ID_FLAGS(func, scx_bpf_cpuperf_cur)
 BTF_ID_FLAGS(func, scx_bpf_cpuperf_set)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 909e94794f8a..27aae2a298f8 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -804,6 +804,7 @@ struct scx_rq {
 	cpumask_var_t		cpus_to_preempt;
 	cpumask_var_t		cpus_to_wait;
 	unsigned long		kick_sync;
+	local_t			reenq_local_deferred;
 	struct balance_callback	deferred_bal_cb;
 	struct irq_work		deferred_irq_work;
 	struct irq_work		kick_cpus_irq_work;
diff --git a/tools/sched_ext/include/scx/common.bpf.h b/tools/sched_ext/include/scx/common.bpf.h
index e65b1eb668ea..82a798c3fb22 100644
--- a/tools/sched_ext/include/scx/common.bpf.h
+++ b/tools/sched_ext/include/scx/common.bpf.h
@@ -70,7 +70,6 @@ void scx_bpf_dsq_move_set_slice(struct bpf_iter_scx_dsq *it__iter, u64 slice) __
 void scx_bpf_dsq_move_set_vtime(struct bpf_iter_scx_dsq *it__iter, u64 vtime) __ksym __weak;
 bool scx_bpf_dsq_move(struct bpf_iter_scx_dsq *it__iter, struct task_struct *p, u64 dsq_id, u64 enq_flags) __ksym __weak;
 bool scx_bpf_dsq_move_vtime(struct bpf_iter_scx_dsq *it__iter, struct task_struct *p, u64 dsq_id, u64 enq_flags) __ksym __weak;
-u32 scx_bpf_reenqueue_local(void) __ksym;
 void scx_bpf_kick_cpu(s32 cpu, u64 flags) __ksym;
 s32 scx_bpf_dsq_nr_queued(u64 dsq_id) __ksym;
 void scx_bpf_destroy_dsq(u64 dsq_id) __ksym;
diff --git a/tools/sched_ext/include/scx/compat.bpf.h b/tools/sched_ext/include/scx/compat.bpf.h
index 26bead92fa04..0bfb8abe2a46 100644
--- a/tools/sched_ext/include/scx/compat.bpf.h
+++ b/tools/sched_ext/include/scx/compat.bpf.h
@@ -278,6 +278,29 @@ static inline void scx_bpf_task_set_dsq_vtime(struct task_struct *p, u64 vtime)
 		p->scx.dsq_vtime = vtime;
 }
 
+/*
+ * v6.19: The new void variant can be called from anywhere while the older v1
+ * variant can only be called from ops.cpu_release(). The double ___ prefixes on
+ * the v2 variant need to be removed once libbpf is updated to ignore ___ prefix
+ * on kernel side. Drop the wrapper and move the decl to common.bpf.h after
+ * v6.22.
+ */
+u32 scx_bpf_reenqueue_local___v1(void) __ksym __weak;
+void scx_bpf_reenqueue_local___v2___compat(void) __ksym __weak;
+
+static inline bool __COMPAT_scx_bpf_reenqueue_local_from_anywhere(void)
+{
+	return bpf_ksym_exists(scx_bpf_reenqueue_local___v2___compat);
+}
+
+static inline void scx_bpf_reenqueue_local(void)
+{
+	if (__COMPAT_scx_bpf_reenqueue_local_from_anywhere())
+		scx_bpf_reenqueue_local___v2___compat();
+	else
+		scx_bpf_reenqueue_local___v1();
+}
+
 /*
  * Define sched_ext_ops. This may be expanded to define multiple variants for
  * backward compatibility. See compat.h::SCX_OPS_LOAD/ATTACH().
diff --git a/tools/sched_ext/scx_qmap.bpf.c b/tools/sched_ext/scx_qmap.bpf.c
index c67dac78a4c6..df21fad0c438 100644
--- a/tools/sched_ext/scx_qmap.bpf.c
+++ b/tools/sched_ext/scx_qmap.bpf.c
@@ -202,6 +202,9 @@ void BPF_STRUCT_OPS(qmap_enqueue, struct task_struct *p, u64 enq_flags)
 	void *ring;
 	s32 cpu;
 
+	if (enq_flags & SCX_ENQ_REENQ)
+		__sync_fetch_and_add(&nr_reenqueued, 1);
+
 	if (p->flags & PF_KTHREAD) {
 		if (stall_kernel_nth && !(++kernel_cnt % stall_kernel_nth))
 			return;
@@ -529,20 +532,35 @@ bool BPF_STRUCT_OPS(qmap_core_sched_before,
 	return task_qdist(a) > task_qdist(b);
 }
 
-void BPF_STRUCT_OPS(qmap_cpu_release, s32 cpu, struct scx_cpu_release_args *args)
+SEC("tp_btf/sched_switch")
+int BPF_PROG(qmap_sched_switch, bool preempt, struct task_struct *prev,
+	     struct task_struct *next, unsigned long prev_state)
 {
-	u32 cnt;
+	if (!__COMPAT_scx_bpf_reenqueue_local_from_anywhere())
+		return 0;
 
 	/*
-	 * Called when @cpu is taken by a higher priority scheduling class. This
-	 * makes @cpu no longer available for executing sched_ext tasks. As we
-	 * don't want the tasks in @cpu's local dsq to sit there until @cpu
-	 * becomes available again, re-enqueue them into the global dsq. See
-	 * %SCX_ENQ_REENQ handling in qmap_enqueue().
+	 * If @cpu is taken by a higher priority scheduling class, it is no
+	 * longer available for executing sched_ext tasks. As we don't want the
+	 * tasks in @cpu's local dsq to sit there until @cpu becomes available
+	 * again, re-enqueue them into the global dsq. See %SCX_ENQ_REENQ
+	 * handling in qmap_enqueue().
 	 */
-	cnt = scx_bpf_reenqueue_local();
-	if (cnt)
-		__sync_fetch_and_add(&nr_reenqueued, cnt);
+	switch (next->policy) {
+	case 1: /* SCHED_FIFO */
+	case 2: /* SCHED_RR */
+	case 6: /* SCHED_DEADLINE */
+		scx_bpf_reenqueue_local();
+	}
+
+	return 0;
+}
+
+void BPF_STRUCT_OPS(qmap_cpu_release, s32 cpu, struct scx_cpu_release_args *args)
+{
+	/* see qmap_sched_switch() to learn how to do this on newer kernels */
+	if (!__COMPAT_scx_bpf_reenqueue_local_from_anywhere())
+		scx_bpf_reenqueue_local();
 }
 
 s32 BPF_STRUCT_OPS(qmap_init_task, struct task_struct *p,
-- 
2.51.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 3/3] sched_ext: Allow scx_bpf_reenqueue_local() to be called from anywhere
  2025-10-29 15:49     ` [PATCH v3 " Tejun Heo
@ 2025-11-27 10:39       ` Kuba Piecuch
  2025-12-02 23:05         ` Tejun Heo
  2025-12-11 14:24       ` Kuba Piecuch
  1 sibling, 1 reply; 28+ messages in thread
From: Kuba Piecuch @ 2025-11-27 10:39 UTC (permalink / raw)
  To: Tejun Heo, David Vernet, Andrea Righi, Changwoo Min
  Cc: linux-kernel, sched-ext, Peter Zijlstra, Wen-Fang Liu

On Wed Oct 29, 2025 at 3:49 PM UTC, Tejun Heo wrote:
> Schedulers can now use standard BPF mechanisms like the sched_switch tracepoint
> to detect and handle CPU preemption.

Correct me if I'm wrong, but I think using the sched_switch tracepoint still
leaves us with no preemption notification in the following scenario:

1. An RT task is running on the CPU and blocks.

2. pick_task_scx() briefly drops the rq lock in balance_one() and the RT task
   is woken up.

3. SCX sees the enqueue and returns RETRY_TASK from pick_task_scx().

4. The RT task is picked.

5. Since prev == next, we don't enter the is_switch branch in __schedule()
   and the sched_switch tracepoint isn't reached.

The BPF scheduler could hook into trace_sched_exit_tp() to work around this,
but that tracepoint seems to be for testing and debugging purposes only.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 3/3] sched_ext: Allow scx_bpf_reenqueue_local() to be called from anywhere
  2025-11-27 10:39       ` Kuba Piecuch
@ 2025-12-02 23:05         ` Tejun Heo
  0 siblings, 0 replies; 28+ messages in thread
From: Tejun Heo @ 2025-12-02 23:05 UTC (permalink / raw)
  To: Kuba Piecuch
  Cc: David Vernet, Andrea Righi, Changwoo Min, linux-kernel, sched-ext,
	Peter Zijlstra, Wen-Fang Liu

Hello,

Sorry about the late response.

On Thu, Nov 27, 2025 at 10:39:35AM +0000, Kuba Piecuch wrote:
> On Wed Oct 29, 2025 at 3:49 PM UTC, Tejun Heo wrote:
> > Schedulers can now use standard BPF mechanisms like the sched_switch tracepoint
> > to detect and handle CPU preemption.
> 
> Correct me if I'm wrong, but I think using the sched_switch tracepoint still
> leaves us with no preemption notification in the following scenario:
> 
> 1. An RT task is running on the CPU and blocks.
> 
> 2. pick_task_scx() briefly drops the rq lock in balance_one() and the RT task
>    is woken up.
> 
> 3. SCX sees the enqueue and returns RETRY_TASK from pick_task_scx().
> 
> 4. The RT task is picked.
> 
> 5. Since prev == next, we don't enter the is_switch branch in __schedule()
>    and the sched_switch tracepoint isn't reached.

You're right.

> The BPF scheduler could hook into trace_sched_exit_tp() to work around this,
> but that tracepoint seems to be for testing and debugging purposes only.

Yes, that looks useable for now. Maybe we can add another hooking pointer
after the proposed core change from sched_ext side. Will think more about
that.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 3/3] sched_ext: Allow scx_bpf_reenqueue_local() to be called from anywhere
  2025-10-29 15:49     ` [PATCH v3 " Tejun Heo
  2025-11-27 10:39       ` Kuba Piecuch
@ 2025-12-11 14:24       ` Kuba Piecuch
  2025-12-11 16:17         ` Tejun Heo
  1 sibling, 1 reply; 28+ messages in thread
From: Kuba Piecuch @ 2025-12-11 14:24 UTC (permalink / raw)
  To: Tejun Heo, David Vernet, Andrea Righi, Changwoo Min
  Cc: linux-kernel, sched-ext, Peter Zijlstra, Wen-Fang Liu

Hi Tejun,

I think with the proposed implementation, using scx_bpf_reenqueue_local()
from arbitrary contexts can have highly non-intuitive effects.

For example, consider ops.enqueue() for a hypothetical userspace scheduler:

void BPF_STRUCT_OPS(example_enqueue, struct task_struct *p, u64 enq_flags)
{
	if (p->pid == user_scheduler_pid()) {
		/*
		 * Remove existing tasks from the local DSQ so that
		 * the userspace scheduler can schedule different tasks
		 * before them.
		 */
		scx_bpf_reenqueue_local();
		/*
		 * Dispatch the user scheduler directly to the local DSQ.
		 */
		scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, 0);
	}
	...
}

I'm not arguing this is the way it should be written, but AFAIK it's perfectly
legal.

Since we're doing a direct dispatch, the user scheduler task will be
inserted into the dispatch queue in enable_task_scx(), without dropping the rq
lock between example_enqueue() and the insertion, which means reenq_local()
will run afterwards (since it's deferred using irq_work), removing all tasks
from the DSQ, including the userspace scheduler.

A similar problem arises even if we don't do direct dispatch and drop the rq
lock after example_enqueue(): since dispatching and reenq_local() are deferred
using different irq_work entries, and irq_work_run() processes entries from
newest to oldest, dispatching will be handled before reenq_local(), yielding
the same result.

The user may be unaware of this behavior (it's not mentioned anywhere) and
expect the reenqueue to happen before dispatching the new task.

I think at the very least we should make users aware of this in the comment
for scx_bpf_reenqueue_local___v2().

Best,
Kuba

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 3/3] sched_ext: Allow scx_bpf_reenqueue_local() to be called from anywhere
  2025-12-11 14:24       ` Kuba Piecuch
@ 2025-12-11 16:17         ` Tejun Heo
  2025-12-11 16:20           ` Tejun Heo
  0 siblings, 1 reply; 28+ messages in thread
From: Tejun Heo @ 2025-12-11 16:17 UTC (permalink / raw)
  To: Kuba Piecuch
  Cc: David Vernet, Andrea Righi, Changwoo Min, linux-kernel, sched-ext,
	Peter Zijlstra, Wen-Fang Liu

Hello, Kuba.

On Thu, Dec 11, 2025 at 02:24:04PM +0000, Kuba Piecuch wrote:
> Since we're doing a direct dispatch, the user scheduler task will be
> inserted into the dispatch queue in enable_task_scx(), without dropping the rq
> lock between example_enqueue() and the insertion, which means reenq_local()
> will run afterwards (since it's deferred using irq_work), removing all tasks
> from the DSQ, including the userspace scheduler.
> 
> A similar problem arises even if we don't do direct dispatch and drop the rq
> lock after example_enqueue(): since dispatching and reenq_local() are deferred
> using different irq_work entries, and irq_work_run() processes entries from
> newest to oldest, dispatching will be handled before reenq_local(), yielding
> the same result.

Oh yeah, the asynchronity can become pretty confusing.

> The user may be unaware of this behavior (it's not mentioned anywhere) and
> expect the reenqueue to happen before dispatching the new task.
> 
> I think at the very least we should make users aware of this in the comment
> for scx_bpf_reenqueue_local___v2().

Documentation is always helpful but I wonder whether this can be improved by
the reenqueue function capturing the dsq seq number and re-enqueueing only
the ones that were enqueued before the reenqueue was called.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 3/3] sched_ext: Allow scx_bpf_reenqueue_local() to be called from anywhere
  2025-12-11 16:17         ` Tejun Heo
@ 2025-12-11 16:20           ` Tejun Heo
  2025-12-13  1:16             ` Andrea Righi
  0 siblings, 1 reply; 28+ messages in thread
From: Tejun Heo @ 2025-12-11 16:20 UTC (permalink / raw)
  To: Kuba Piecuch
  Cc: David Vernet, Andrea Righi, Changwoo Min, linux-kernel, sched-ext,
	Peter Zijlstra, Wen-Fang Liu

On Thu, Dec 11, 2025 at 06:17:03AM -1000, Tejun Heo wrote:
> Hello, Kuba.
> 
> On Thu, Dec 11, 2025 at 02:24:04PM +0000, Kuba Piecuch wrote:
> > Since we're doing a direct dispatch, the user scheduler task will be
> > inserted into the dispatch queue in enable_task_scx(), without dropping the rq
> > lock between example_enqueue() and the insertion, which means reenq_local()
> > will run afterwards (since it's deferred using irq_work), removing all tasks
> > from the DSQ, including the userspace scheduler.
> > 
> > A similar problem arises even if we don't do direct dispatch and drop the rq
> > lock after example_enqueue(): since dispatching and reenq_local() are deferred
> > using different irq_work entries, and irq_work_run() processes entries from
> > newest to oldest, dispatching will be handled before reenq_local(), yielding
> > the same result.
> 
> Oh yeah, the asynchronity can become pretty confusing.
> 
> > The user may be unaware of this behavior (it's not mentioned anywhere) and
> > expect the reenqueue to happen before dispatching the new task.
> > 
> > I think at the very least we should make users aware of this in the comment
> > for scx_bpf_reenqueue_local___v2().
> 
> Documentation is always helpful but I wonder whether this can be improved by
> the reenqueue function capturing the dsq seq number and re-enqueueing only
> the ones that were enqueued before the reenqueue was called.

That doesn't fix the ordering problem in the other direction, but the
reordering in the other direction seems inherently less useful at least.
Maybe this can also be solved with seq - ie. make enqueue record the current
seq, mark the DSQ with the latest reenqueue seq and at enqueue commit time
if the captured seq is alreaddy reenqueued, trigger reenqueue.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 3/3] sched_ext: Allow scx_bpf_reenqueue_local() to be called from anywhere
  2025-12-11 16:20           ` Tejun Heo
@ 2025-12-13  1:16             ` Andrea Righi
  2025-12-13  1:18               ` Tejun Heo
  0 siblings, 1 reply; 28+ messages in thread
From: Andrea Righi @ 2025-12-13  1:16 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Kuba Piecuch, David Vernet, Changwoo Min, linux-kernel, sched-ext,
	Peter Zijlstra, Wen-Fang Liu

On Thu, Dec 11, 2025 at 06:20:22AM -1000, Tejun Heo wrote:
> On Thu, Dec 11, 2025 at 06:17:03AM -1000, Tejun Heo wrote:
> > Hello, Kuba.
> > 
> > On Thu, Dec 11, 2025 at 02:24:04PM +0000, Kuba Piecuch wrote:
> > > Since we're doing a direct dispatch, the user scheduler task will be
> > > inserted into the dispatch queue in enable_task_scx(), without dropping the rq
> > > lock between example_enqueue() and the insertion, which means reenq_local()
> > > will run afterwards (since it's deferred using irq_work), removing all tasks
> > > from the DSQ, including the userspace scheduler.
> > > 
> > > A similar problem arises even if we don't do direct dispatch and drop the rq
> > > lock after example_enqueue(): since dispatching and reenq_local() are deferred
> > > using different irq_work entries, and irq_work_run() processes entries from
> > > newest to oldest, dispatching will be handled before reenq_local(), yielding
> > > the same result.
> > 
> > Oh yeah, the asynchronity can become pretty confusing.
> > 
> > > The user may be unaware of this behavior (it's not mentioned anywhere) and
> > > expect the reenqueue to happen before dispatching the new task.
> > > 
> > > I think at the very least we should make users aware of this in the comment
> > > for scx_bpf_reenqueue_local___v2().
> > 
> > Documentation is always helpful but I wonder whether this can be improved by
> > the reenqueue function capturing the dsq seq number and re-enqueueing only
> > the ones that were enqueued before the reenqueue was called.
> 
> That doesn't fix the ordering problem in the other direction, but the
> reordering in the other direction seems inherently less useful at least.
> Maybe this can also be solved with seq - ie. make enqueue record the current
> seq, mark the DSQ with the latest reenqueue seq and at enqueue commit time
> if the captured seq is alreaddy reenqueued, trigger reenqueue.

How about making it even more explicit renaming the kfunc to something like
scx_bpf_async_reenqueue_local() (and documenting it)?

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3 3/3] sched_ext: Allow scx_bpf_reenqueue_local() to be called from anywhere
  2025-12-13  1:16             ` Andrea Righi
@ 2025-12-13  1:18               ` Tejun Heo
  0 siblings, 0 replies; 28+ messages in thread
From: Tejun Heo @ 2025-12-13  1:18 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Kuba Piecuch, David Vernet, Changwoo Min, linux-kernel, sched-ext,
	Peter Zijlstra, Wen-Fang Liu

On Sat, Dec 13, 2025 at 02:16:44AM +0100, Andrea Righi wrote:
> > That doesn't fix the ordering problem in the other direction, but the
> > reordering in the other direction seems inherently less useful at least.
> > Maybe this can also be solved with seq - ie. make enqueue record the current
> > seq, mark the DSQ with the latest reenqueue seq and at enqueue commit time
> > if the captured seq is alreaddy reenqueued, trigger reenqueue.
> 
> How about making it even more explicit renaming the kfunc to something like
> scx_bpf_async_reenqueue_local() (and documenting it)?

I don't know. It's an implementation detail that can change in the future
and we don't want to be adding _async to e.g. scx_bpf_dsq_insert() too.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2025-12-13  1:18 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-25  0:18 [PATCHSET sched_ext/for-6.19] sched_ext: Deprecate ops.cpu_acquire/release() Tejun Heo
2025-10-25  0:18 ` [PATCH 1/3] sched_ext: Split schedule_deferred() into locked and unlocked variants Tejun Heo
2025-10-25 23:17   ` Emil Tsalapatis
2025-10-25  0:18 ` [PATCH 2/3] sched_ext: Factor out reenq_local() from scx_bpf_reenqueue_local() Tejun Heo
2025-10-25 23:19   ` Emil Tsalapatis
2025-10-25  0:18 ` [PATCH 3/3] sched_ext: Allow scx_bpf_reenqueue_local() to be called from anywhere Tejun Heo
2025-10-25 23:21   ` Emil Tsalapatis
2025-10-27  9:18   ` Peter Zijlstra
2025-10-27 16:00     ` Tejun Heo
2025-10-27 17:49       ` Peter Zijlstra
2025-10-27 18:05         ` Tejun Heo
2025-10-27 18:07           ` Peter Zijlstra
2025-10-27 18:10       ` Peter Zijlstra
2025-10-27 18:17         ` Tejun Heo
2025-10-28 11:01           ` Peter Zijlstra
2025-10-28 17:07             ` Tejun Heo
2025-10-27 18:19   ` [PATCH v2 " Tejun Heo
2025-10-29 10:45     ` Peter Zijlstra
2025-10-29 15:11       ` Tejun Heo
2025-10-29 15:49     ` [PATCH v3 " Tejun Heo
2025-11-27 10:39       ` Kuba Piecuch
2025-12-02 23:05         ` Tejun Heo
2025-12-11 14:24       ` Kuba Piecuch
2025-12-11 16:17         ` Tejun Heo
2025-12-11 16:20           ` Tejun Heo
2025-12-13  1:16             ` Andrea Righi
2025-12-13  1:18               ` Tejun Heo
2025-10-29 15:31 ` [PATCHSET sched_ext/for-6.19] sched_ext: Deprecate ops.cpu_acquire/release() Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox