[PATCHSET sched_ext/for-7.1] sched_ext: Overhaul DSQ reenqueue infrastructure

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCHSET sched_ext/for-7.1] sched_ext: Overhaul DSQ reenqueue infrastructure
@ 2026-03-06 19:06 Tejun Heo
  2026-03-06 19:06 ` [PATCH 01/15] sched_ext: Relocate scx_bpf_task_cgroup() and its BTF_ID to the end of kfunc section Tejun Heo
                   ` (16 more replies)
  0 siblings, 17 replies; 38+ messages in thread
From: Tejun Heo @ 2026-03-06 19:06 UTC (permalink / raw)
  To: linux-kernel, sched-ext; +Cc: void, arighi, changwoo, emil, Tejun Heo

scx_bpf_reenqueue_local() currently only supports the current CPU's local
DSQ and must be called from ops.cpu_release(). This patchset overhauls the
reenqueue infrastructure to support flexible, remote, and user DSQ reenqueue
operations through the new scx_bpf_dsq_reenq() kfunc.

The patchset:

- Refactors per-node data structures and DSQ lookup helpers as preparation.

- Converts the deferred reenqueue mechanism from lockless llist to a
  spinlock-protected regular list to support more complex list operations.

- Introduces scx_bpf_dsq_reenq() which can target any local DSQ including
  remote CPUs via SCX_DSQ_LOCAL_ON | cpu, and user-defined DSQs.

- Adds per-CPU data to DSQs and cursor-based iteration helpers to support
  user DSQ reenqueue with proper multi-rq locking.

- Adds reenqueue flags plumbing and a lockless fast-path optimization.

- Simplifies task state handling and adds SCX_TASK_REENQ_REASON flags so BPF
  schedulers can distinguish why a task is being reenqueued.

scx_bpf_reenqueue_local() is reimplemented as a wrapper around
scx_bpf_dsq_reenq(SCX_DSQ_LOCAL, 0) and may be deprecated in the future.

Based on sched_ext/for-7.1 (4f8b122848db).

Tejun Heo (15):
  sched_ext: Relocate scx_bpf_task_cgroup() and its BTF_ID to the end of kfunc section
  sched_ext: Wrap global DSQs in per-node structure
  sched_ext: Factor out pnode allocation and deallocation into helpers
  sched_ext: Change find_global_dsq() to take CPU number instead of task
  sched_ext: Relocate reenq_local() and run_deferred()
  sched_ext: Convert deferred_reenq_locals from llist to regular list
  sched_ext: Wrap deferred_reenq_local_node into a struct
  sched_ext: Introduce scx_bpf_dsq_reenq() for remote local DSQ reenqueue
  sched_ext: Add reenq_flags plumbing to scx_bpf_dsq_reenq()
  sched_ext: Add per-CPU data to DSQs
  sched_ext: Factor out nldsq_cursor_next_task() and nldsq_cursor_lost_task()
  sched_ext: Implement scx_bpf_dsq_reenq() for user DSQs
  sched_ext: Optimize schedule_dsq_reenq() with lockless fast path
  sched_ext: Simplify task state handling
  sched_ext: Add SCX_TASK_REENQ_REASON flags

Git tree:
  git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext.git scx-reenq

 include/linux/sched/ext.h                |  58 +-
 kernel/sched/ext.c                       | 882 ++++++++++++++++++++++---------
 kernel/sched/ext_internal.h              |  32 +-
 kernel/sched/sched.h                     |   4 +-
 tools/sched_ext/include/scx/compat.bpf.h |  21 +
 tools/sched_ext/scx_qmap.bpf.c           |  68 ++-
 tools/sched_ext/scx_qmap.c               |   5 +-
 7 files changed, 777 insertions(+), 293 deletions(-)

--
tejun

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 01/15] sched_ext: Relocate scx_bpf_task_cgroup() and its BTF_ID to the end of kfunc section
  2026-03-06 19:06 [PATCHSET sched_ext/for-7.1] sched_ext: Overhaul DSQ reenqueue infrastructure Tejun Heo
@ 2026-03-06 19:06 ` Tejun Heo
  2026-03-06 20:45   ` Emil Tsalapatis
  2026-03-06 23:20   ` Daniel Jordan
  2026-03-06 19:06 ` [PATCH 02/15] sched_ext: Wrap global DSQs in per-node structure Tejun Heo
                   ` (15 subsequent siblings)
  16 siblings, 2 replies; 38+ messages in thread
From: Tejun Heo @ 2026-03-06 19:06 UTC (permalink / raw)
  To: linux-kernel, sched-ext; +Cc: void, arighi, changwoo, emil, Tejun Heo

Move scx_bpf_task_cgroup() kfunc definition and its BTF_ID entry to the end
of the kfunc section before __bpf_kfunc_end_defs() for cleaner code
organization.

No functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext.c | 78 +++++++++++++++++++++++-----------------------
 1 file changed, 39 insertions(+), 39 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index e25b3593dd30..fe222df1d494 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -8628,43 +8628,6 @@ __bpf_kfunc struct task_struct *scx_bpf_cpu_curr(s32 cpu, const struct bpf_prog_
 	return rcu_dereference(cpu_rq(cpu)->curr);
 }
 
-/**
- * scx_bpf_task_cgroup - Return the sched cgroup of a task
- * @p: task of interest
- * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs
- *
- * @p->sched_task_group->css.cgroup represents the cgroup @p is associated with
- * from the scheduler's POV. SCX operations should use this function to
- * determine @p's current cgroup as, unlike following @p->cgroups,
- * @p->sched_task_group is protected by @p's rq lock and thus atomic w.r.t. all
- * rq-locked operations. Can be called on the parameter tasks of rq-locked
- * operations. The restriction guarantees that @p's rq is locked by the caller.
- */
-#ifdef CONFIG_CGROUP_SCHED
-__bpf_kfunc struct cgroup *scx_bpf_task_cgroup(struct task_struct *p,
-					       const struct bpf_prog_aux *aux)
-{
-	struct task_group *tg = p->sched_task_group;
-	struct cgroup *cgrp = &cgrp_dfl_root.cgrp;
-	struct scx_sched *sch;
-
-	guard(rcu)();
-
-	sch = scx_prog_sched(aux);
-	if (unlikely(!sch))
-		goto out;
-
-	if (!scx_kf_allowed_on_arg_tasks(sch, __SCX_KF_RQ_LOCKED, p))
-		goto out;
-
-	cgrp = tg_cgrp(tg);
-
-out:
-	cgroup_get(cgrp);
-	return cgrp;
-}
-#endif
-
 /**
  * scx_bpf_now - Returns a high-performance monotonically non-decreasing
  * clock for the current CPU. The clock returned is in nanoseconds.
@@ -8779,6 +8742,43 @@ __bpf_kfunc void scx_bpf_events(struct scx_event_stats *events,
 	memcpy(events, &e_sys, events__sz);
 }
 
+#ifdef CONFIG_CGROUP_SCHED
+/**
+ * scx_bpf_task_cgroup - Return the sched cgroup of a task
+ * @p: task of interest
+ * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs
+ *
+ * @p->sched_task_group->css.cgroup represents the cgroup @p is associated with
+ * from the scheduler's POV. SCX operations should use this function to
+ * determine @p's current cgroup as, unlike following @p->cgroups,
+ * @p->sched_task_group is protected by @p's rq lock and thus atomic w.r.t. all
+ * rq-locked operations. Can be called on the parameter tasks of rq-locked
+ * operations. The restriction guarantees that @p's rq is locked by the caller.
+ */
+__bpf_kfunc struct cgroup *scx_bpf_task_cgroup(struct task_struct *p,
+					       const struct bpf_prog_aux *aux)
+{
+	struct task_group *tg = p->sched_task_group;
+	struct cgroup *cgrp = &cgrp_dfl_root.cgrp;
+	struct scx_sched *sch;
+
+	guard(rcu)();
+
+	sch = scx_prog_sched(aux);
+	if (unlikely(!sch))
+		goto out;
+
+	if (!scx_kf_allowed_on_arg_tasks(sch, __SCX_KF_RQ_LOCKED, p))
+		goto out;
+
+	cgrp = tg_cgrp(tg);
+
+out:
+	cgroup_get(cgrp);
+	return cgrp;
+}
+#endif	/* CONFIG_CGROUP_SCHED */
+
 __bpf_kfunc_end_defs();
 
 BTF_KFUNCS_START(scx_kfunc_ids_any)
@@ -8808,11 +8808,11 @@ BTF_ID_FLAGS(func, scx_bpf_task_cpu, KF_RCU)
 BTF_ID_FLAGS(func, scx_bpf_cpu_rq, KF_IMPLICIT_ARGS)
 BTF_ID_FLAGS(func, scx_bpf_locked_rq, KF_IMPLICIT_ARGS | KF_RET_NULL)
 BTF_ID_FLAGS(func, scx_bpf_cpu_curr, KF_IMPLICIT_ARGS | KF_RET_NULL | KF_RCU_PROTECTED)
+BTF_ID_FLAGS(func, scx_bpf_now)
+BTF_ID_FLAGS(func, scx_bpf_events)
 #ifdef CONFIG_CGROUP_SCHED
 BTF_ID_FLAGS(func, scx_bpf_task_cgroup, KF_IMPLICIT_ARGS | KF_RCU | KF_ACQUIRE)
 #endif
-BTF_ID_FLAGS(func, scx_bpf_now)
-BTF_ID_FLAGS(func, scx_bpf_events)
 BTF_KFUNCS_END(scx_kfunc_ids_any)
 
 static const struct btf_kfunc_id_set scx_kfunc_set_any = {
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 02/15] sched_ext: Wrap global DSQs in per-node structure
  2026-03-06 19:06 [PATCHSET sched_ext/for-7.1] sched_ext: Overhaul DSQ reenqueue infrastructure Tejun Heo
  2026-03-06 19:06 ` [PATCH 01/15] sched_ext: Relocate scx_bpf_task_cgroup() and its BTF_ID to the end of kfunc section Tejun Heo
@ 2026-03-06 19:06 ` Tejun Heo
  2026-03-06 20:52   ` Emil Tsalapatis
  2026-03-06 23:20   ` Daniel Jordan
  2026-03-06 19:06 ` [PATCH 03/15] sched_ext: Factor out pnode allocation and deallocation into helpers Tejun Heo
                   ` (14 subsequent siblings)
  16 siblings, 2 replies; 38+ messages in thread
From: Tejun Heo @ 2026-03-06 19:06 UTC (permalink / raw)
  To: linux-kernel, sched-ext; +Cc: void, arighi, changwoo, emil, Tejun Heo

Global DSQs are currently stored as an array of scx_dispatch_q pointers,
one per NUMA node. To allow adding more per-node data structures, wrap the
global DSQ in scx_sched_pnode and replace global_dsqs with pnode array.

NUMA-aware allocation is maintained. No functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext.c          | 32 ++++++++++++++++----------------
 kernel/sched/ext_internal.h |  6 +++++-
 2 files changed, 21 insertions(+), 17 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index fe222df1d494..9232abea4f22 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -344,7 +344,7 @@ static bool scx_is_descendant(struct scx_sched *sch, struct scx_sched *ancestor)
 static struct scx_dispatch_q *find_global_dsq(struct scx_sched *sch,
 					      struct task_struct *p)
 {
-	return sch->global_dsqs[cpu_to_node(task_cpu(p))];
+	return &sch->pnode[cpu_to_node(task_cpu(p))]->global_dsq;
 }
 
 static struct scx_dispatch_q *find_user_dsq(struct scx_sched *sch, u64 dsq_id)
@@ -2229,7 +2229,7 @@ static bool consume_global_dsq(struct scx_sched *sch, struct rq *rq)
 {
 	int node = cpu_to_node(cpu_of(rq));
 
-	return consume_dispatch_q(sch, rq, sch->global_dsqs[node]);
+	return consume_dispatch_q(sch, rq, &sch->pnode[node]->global_dsq);
 }
 
 /**
@@ -4148,8 +4148,8 @@ static void scx_sched_free_rcu_work(struct work_struct *work)
 	free_percpu(sch->pcpu);
 
 	for_each_node_state(node, N_POSSIBLE)
-		kfree(sch->global_dsqs[node]);
-	kfree(sch->global_dsqs);
+		kfree(sch->pnode[node]);
+	kfree(sch->pnode);
 
 	rhashtable_walk_enter(&sch->dsq_hash, &rht_iter);
 	do {
@@ -5707,23 +5707,23 @@ static struct scx_sched *scx_alloc_and_add_sched(struct sched_ext_ops *ops,
 	if (ret < 0)
 		goto err_free_ei;
 
-	sch->global_dsqs = kzalloc_objs(sch->global_dsqs[0], nr_node_ids);
-	if (!sch->global_dsqs) {
+	sch->pnode = kzalloc_objs(sch->pnode[0], nr_node_ids);
+	if (!sch->pnode) {
 		ret = -ENOMEM;
 		goto err_free_hash;
 	}
 
 	for_each_node_state(node, N_POSSIBLE) {
-		struct scx_dispatch_q *dsq;
+		struct scx_sched_pnode *pnode;
 
-		dsq = kzalloc_node(sizeof(*dsq), GFP_KERNEL, node);
-		if (!dsq) {
+		pnode = kzalloc_node(sizeof(*pnode), GFP_KERNEL, node);
+		if (!pnode) {
 			ret = -ENOMEM;
-			goto err_free_gdsqs;
+			goto err_free_pnode;
 		}
 
-		init_dsq(dsq, SCX_DSQ_GLOBAL, sch);
-		sch->global_dsqs[node] = dsq;
+		init_dsq(&pnode->global_dsq, SCX_DSQ_GLOBAL, sch);
+		sch->pnode[node] = pnode;
 	}
 
 	sch->dsp_max_batch = ops->dispatch_max_batch ?: SCX_DSP_DFL_MAX_BATCH;
@@ -5732,7 +5732,7 @@ static struct scx_sched *scx_alloc_and_add_sched(struct sched_ext_ops *ops,
 				   __alignof__(struct scx_sched_pcpu));
 	if (!sch->pcpu) {
 		ret = -ENOMEM;
-		goto err_free_gdsqs;
+		goto err_free_pnode;
 	}
 
 	for_each_possible_cpu(cpu)
@@ -5819,10 +5819,10 @@ static struct scx_sched *scx_alloc_and_add_sched(struct sched_ext_ops *ops,
 	kthread_destroy_worker(sch->helper);
 err_free_pcpu:
 	free_percpu(sch->pcpu);
-err_free_gdsqs:
+err_free_pnode:
 	for_each_node_state(node, N_POSSIBLE)
-		kfree(sch->global_dsqs[node]);
-	kfree(sch->global_dsqs);
+		kfree(sch->pnode[node]);
+	kfree(sch->pnode);
 err_free_hash:
 	rhashtable_free_and_destroy(&sch->dsq_hash, NULL, NULL);
 err_free_ei:
diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h
index 4cb97093b872..9e5ebd00ea0c 100644
--- a/kernel/sched/ext_internal.h
+++ b/kernel/sched/ext_internal.h
@@ -975,6 +975,10 @@ struct scx_sched_pcpu {
 	struct scx_dsp_ctx	dsp_ctx;
 };
 
+struct scx_sched_pnode {
+	struct scx_dispatch_q	global_dsq;
+};
+
 struct scx_sched {
 	struct sched_ext_ops	ops;
 	DECLARE_BITMAP(has_op, SCX_OPI_END);
@@ -988,7 +992,7 @@ struct scx_sched {
 	 * per-node split isn't sufficient, it can be further split.
 	 */
 	struct rhashtable	dsq_hash;
-	struct scx_dispatch_q	**global_dsqs;
+	struct scx_sched_pnode	**pnode;
 	struct scx_sched_pcpu __percpu *pcpu;
 
 	u64			slice_dfl;
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 03/15] sched_ext: Factor out pnode allocation and deallocation into helpers
  2026-03-06 19:06 [PATCHSET sched_ext/for-7.1] sched_ext: Overhaul DSQ reenqueue infrastructure Tejun Heo
  2026-03-06 19:06 ` [PATCH 01/15] sched_ext: Relocate scx_bpf_task_cgroup() and its BTF_ID to the end of kfunc section Tejun Heo
  2026-03-06 19:06 ` [PATCH 02/15] sched_ext: Wrap global DSQs in per-node structure Tejun Heo
@ 2026-03-06 19:06 ` Tejun Heo
  2026-03-06 20:54   ` Emil Tsalapatis
  2026-03-06 23:21   ` Daniel Jordan
  2026-03-06 19:06 ` [PATCH 04/15] sched_ext: Change find_global_dsq() to take CPU number instead of task Tejun Heo
                   ` (13 subsequent siblings)
  16 siblings, 2 replies; 38+ messages in thread
From: Tejun Heo @ 2026-03-06 19:06 UTC (permalink / raw)
  To: linux-kernel, sched-ext; +Cc: void, arighi, changwoo, emil, Tejun Heo

Extract pnode allocation and deallocation logic into alloc_pnode() and
free_pnode() helpers. This simplifies scx_alloc_and_add_sched() and prepares
for adding more per-node initialization and cleanup in subsequent patches.

No functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext.c | 32 +++++++++++++++++++++++---------
 1 file changed, 23 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 9232abea4f22..c36d399bca3e 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -4114,6 +4114,7 @@ static const struct attribute_group scx_global_attr_group = {
 	.attrs = scx_global_attrs,
 };
 
+static void free_pnode(struct scx_sched_pnode *pnode);
 static void free_exit_info(struct scx_exit_info *ei);
 
 static void scx_sched_free_rcu_work(struct work_struct *work)
@@ -4148,7 +4149,7 @@ static void scx_sched_free_rcu_work(struct work_struct *work)
 	free_percpu(sch->pcpu);
 
 	for_each_node_state(node, N_POSSIBLE)
-		kfree(sch->pnode[node]);
+		free_pnode(sch->pnode[node]);
 	kfree(sch->pnode);
 
 	rhashtable_walk_enter(&sch->dsq_hash, &rht_iter);
@@ -5685,6 +5686,24 @@ static int alloc_kick_syncs(void)
 	return 0;
 }
 
+static void free_pnode(struct scx_sched_pnode *pnode)
+{
+	kfree(pnode);
+}
+
+static struct scx_sched_pnode *alloc_pnode(struct scx_sched *sch, int node)
+{
+	struct scx_sched_pnode *pnode;
+
+	pnode = kzalloc_node(sizeof(*pnode), GFP_KERNEL, node);
+	if (!pnode)
+		return NULL;
+
+	init_dsq(&pnode->global_dsq, SCX_DSQ_GLOBAL, sch);
+
+	return pnode;
+}
+
 static struct scx_sched *scx_alloc_and_add_sched(struct sched_ext_ops *ops,
 						 struct cgroup *cgrp,
 						 struct scx_sched *parent)
@@ -5714,16 +5733,11 @@ static struct scx_sched *scx_alloc_and_add_sched(struct sched_ext_ops *ops,
 	}
 
 	for_each_node_state(node, N_POSSIBLE) {
-		struct scx_sched_pnode *pnode;
-
-		pnode = kzalloc_node(sizeof(*pnode), GFP_KERNEL, node);
-		if (!pnode) {
+		sch->pnode[node] = alloc_pnode(sch, node);
+		if (!sch->pnode[node]) {
 			ret = -ENOMEM;
 			goto err_free_pnode;
 		}
-
-		init_dsq(&pnode->global_dsq, SCX_DSQ_GLOBAL, sch);
-		sch->pnode[node] = pnode;
 	}
 
 	sch->dsp_max_batch = ops->dispatch_max_batch ?: SCX_DSP_DFL_MAX_BATCH;
@@ -5821,7 +5835,7 @@ static struct scx_sched *scx_alloc_and_add_sched(struct sched_ext_ops *ops,
 	free_percpu(sch->pcpu);
 err_free_pnode:
 	for_each_node_state(node, N_POSSIBLE)
-		kfree(sch->pnode[node]);
+		free_pnode(sch->pnode[node]);
 	kfree(sch->pnode);
 err_free_hash:
 	rhashtable_free_and_destroy(&sch->dsq_hash, NULL, NULL);
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 04/15] sched_ext: Change find_global_dsq() to take CPU number instead of task
  2026-03-06 19:06 [PATCHSET sched_ext/for-7.1] sched_ext: Overhaul DSQ reenqueue infrastructure Tejun Heo
                   ` (2 preceding siblings ...)
  2026-03-06 19:06 ` [PATCH 03/15] sched_ext: Factor out pnode allocation and deallocation into helpers Tejun Heo
@ 2026-03-06 19:06 ` Tejun Heo
  2026-03-06 21:06   ` Emil Tsalapatis
                     ` (2 more replies)
  2026-03-06 19:06 ` [PATCH 05/15] sched_ext: Relocate reenq_local() and run_deferred() Tejun Heo
                   ` (12 subsequent siblings)
  16 siblings, 3 replies; 38+ messages in thread
From: Tejun Heo @ 2026-03-06 19:06 UTC (permalink / raw)
  To: linux-kernel, sched-ext; +Cc: void, arighi, changwoo, emil, Tejun Heo

Change find_global_dsq() to take a CPU number directly instead of a task
pointer. This prepares for callers where the CPU is available but the task is
not.

No functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext.c | 32 +++++++++++++++-----------------
 1 file changed, 15 insertions(+), 17 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index c36d399bca3e..c44893878878 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -341,10 +341,9 @@ static bool scx_is_descendant(struct scx_sched *sch, struct scx_sched *ancestor)
 	for ((pos) = scx_next_descendant_pre(NULL, (root)); (pos);		\
 	     (pos) = scx_next_descendant_pre((pos), (root)))
 
-static struct scx_dispatch_q *find_global_dsq(struct scx_sched *sch,
-					      struct task_struct *p)
+static struct scx_dispatch_q *find_global_dsq(struct scx_sched *sch, s32 tcpu)
 {
-	return &sch->pnode[cpu_to_node(task_cpu(p))]->global_dsq;
+	return &sch->pnode[cpu_to_node(tcpu)]->global_dsq;
 }
 
 static struct scx_dispatch_q *find_user_dsq(struct scx_sched *sch, u64 dsq_id)
@@ -1266,7 +1265,7 @@ static void dispatch_enqueue(struct scx_sched *sch, struct rq *rq,
 			scx_error(sch, "attempting to dispatch to a destroyed dsq");
 			/* fall back to the global dsq */
 			raw_spin_unlock(&dsq->lock);
-			dsq = find_global_dsq(sch, p);
+			dsq = find_global_dsq(sch, task_cpu(p));
 			raw_spin_lock(&dsq->lock);
 		}
 	}
@@ -1474,7 +1473,7 @@ static void dispatch_dequeue_locked(struct task_struct *p,
 
 static struct scx_dispatch_q *find_dsq_for_dispatch(struct scx_sched *sch,
 						    struct rq *rq, u64 dsq_id,
-						    struct task_struct *p)
+						    s32 tcpu)
 {
 	struct scx_dispatch_q *dsq;
 
@@ -1485,20 +1484,19 @@ static struct scx_dispatch_q *find_dsq_for_dispatch(struct scx_sched *sch,
 		s32 cpu = dsq_id & SCX_DSQ_LOCAL_CPU_MASK;
 
 		if (!ops_cpu_valid(sch, cpu, "in SCX_DSQ_LOCAL_ON dispatch verdict"))
-			return find_global_dsq(sch, p);
+			return find_global_dsq(sch, tcpu);
 
 		return &cpu_rq(cpu)->scx.local_dsq;
 	}
 
 	if (dsq_id == SCX_DSQ_GLOBAL)
-		dsq = find_global_dsq(sch, p);
+		dsq = find_global_dsq(sch, tcpu);
 	else
 		dsq = find_user_dsq(sch, dsq_id);
 
 	if (unlikely(!dsq)) {
-		scx_error(sch, "non-existent DSQ 0x%llx for %s[%d]",
-			  dsq_id, p->comm, p->pid);
-		return find_global_dsq(sch, p);
+		scx_error(sch, "non-existent DSQ 0x%llx", dsq_id);
+		return find_global_dsq(sch, tcpu);
 	}
 
 	return dsq;
@@ -1540,7 +1538,7 @@ static void direct_dispatch(struct scx_sched *sch, struct task_struct *p,
 {
 	struct rq *rq = task_rq(p);
 	struct scx_dispatch_q *dsq =
-		find_dsq_for_dispatch(sch, rq, p->scx.ddsp_dsq_id, p);
+		find_dsq_for_dispatch(sch, rq, p->scx.ddsp_dsq_id, task_cpu(p));
 
 	touch_core_sched_dispatch(rq, p);
 
@@ -1683,7 +1681,7 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
 	dsq = &rq->scx.local_dsq;
 	goto enqueue;
 global:
-	dsq = find_global_dsq(sch, p);
+	dsq = find_global_dsq(sch, task_cpu(p));
 	goto enqueue;
 bypass:
 	dsq = bypass_enq_target_dsq(sch, task_cpu(p));
@@ -2140,7 +2138,7 @@ static struct rq *move_task_between_dsqs(struct scx_sched *sch,
 		dst_rq = container_of(dst_dsq, struct rq, scx.local_dsq);
 		if (src_rq != dst_rq &&
 		    unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, true))) {
-			dst_dsq = find_global_dsq(sch, p);
+			dst_dsq = find_global_dsq(sch, task_cpu(p));
 			dst_rq = src_rq;
 		}
 	} else {
@@ -2269,7 +2267,7 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq,
 
 	if (src_rq != dst_rq &&
 	    unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, true))) {
-		dispatch_enqueue(sch, rq, find_global_dsq(sch, p), p,
+		dispatch_enqueue(sch, rq, find_global_dsq(sch, task_cpu(p)), p,
 				 enq_flags | SCX_ENQ_CLEAR_OPSS);
 		return;
 	}
@@ -2407,7 +2405,7 @@ static void finish_dispatch(struct scx_sched *sch, struct rq *rq,
 
 	BUG_ON(!(p->scx.flags & SCX_TASK_QUEUED));
 
-	dsq = find_dsq_for_dispatch(sch, this_rq(), dsq_id, p);
+	dsq = find_dsq_for_dispatch(sch, this_rq(), dsq_id, task_cpu(p));
 
 	if (dsq->id == SCX_DSQ_LOCAL)
 		dispatch_to_local_dsq(sch, rq, dsq, p, enq_flags);
@@ -2647,7 +2645,7 @@ static void process_ddsp_deferred_locals(struct rq *rq)
 
 		list_del_init(&p->scx.dsq_list.node);
 
-		dsq = find_dsq_for_dispatch(sch, rq, p->scx.ddsp_dsq_id, p);
+		dsq = find_dsq_for_dispatch(sch, rq, p->scx.ddsp_dsq_id, task_cpu(p));
 		if (!WARN_ON_ONCE(dsq->id != SCX_DSQ_LOCAL))
 			dispatch_to_local_dsq(sch, rq, dsq, p,
 					      p->scx.ddsp_enq_flags);
@@ -7410,7 +7408,7 @@ static bool scx_dsq_move(struct bpf_iter_scx_dsq_kern *kit,
 	}
 
 	/* @p is still on $src_dsq and stable, determine the destination */
-	dst_dsq = find_dsq_for_dispatch(sch, this_rq, dsq_id, p);
+	dst_dsq = find_dsq_for_dispatch(sch, this_rq, dsq_id, task_cpu(p));
 
 	/*
 	 * Apply vtime and slice updates before moving so that the new time is
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 05/15] sched_ext: Relocate reenq_local() and run_deferred()
  2026-03-06 19:06 [PATCHSET sched_ext/for-7.1] sched_ext: Overhaul DSQ reenqueue infrastructure Tejun Heo
                   ` (3 preceding siblings ...)
  2026-03-06 19:06 ` [PATCH 04/15] sched_ext: Change find_global_dsq() to take CPU number instead of task Tejun Heo
@ 2026-03-06 19:06 ` Tejun Heo
  2026-03-06 21:09   ` Emil Tsalapatis
                     ` (2 more replies)
  2026-03-06 19:06 ` [PATCH 06/15] sched_ext: Convert deferred_reenq_locals from llist to regular list Tejun Heo
                   ` (11 subsequent siblings)
  16 siblings, 3 replies; 38+ messages in thread
From: Tejun Heo @ 2026-03-06 19:06 UTC (permalink / raw)
  To: linux-kernel, sched-ext; +Cc: void, arighi, changwoo, emil, Tejun Heo

Previously, both process_ddsp_deferred_locals() and reenq_local() required
forward declarations. Reorganize so that only run_deferred() needs to be
declared. This reduces forward declaration clutter and will ease adding more
to the run_deferred() path.

No functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext.c | 132 ++++++++++++++++++++++-----------------------
 1 file changed, 65 insertions(+), 67 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index c44893878878..1b6cd1e4f8b9 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -193,9 +193,8 @@ MODULE_PARM_DESC(bypass_lb_intv_us, "bypass load balance interval in microsecond
 #define CREATE_TRACE_POINTS
 #include <trace/events/sched_ext.h>
 
-static void process_ddsp_deferred_locals(struct rq *rq);
+static void run_deferred(struct rq *rq);
 static bool task_dead_and_done(struct task_struct *p);
-static u32 reenq_local(struct scx_sched *sch, struct rq *rq);
 static void scx_kick_cpu(struct scx_sched *sch, s32 cpu, u64 flags);
 static void scx_disable(struct scx_sched *sch, enum scx_exit_kind kind);
 static bool scx_vexit(struct scx_sched *sch, enum scx_exit_kind kind,
@@ -1003,23 +1002,6 @@ static int ops_sanitize_err(struct scx_sched *sch, const char *ops_name, s32 err
 	return -EPROTO;
 }
 
-static void run_deferred(struct rq *rq)
-{
-	process_ddsp_deferred_locals(rq);
-
-	if (!llist_empty(&rq->scx.deferred_reenq_locals)) {
-		struct llist_node *llist =
-			llist_del_all(&rq->scx.deferred_reenq_locals);
-		struct scx_sched_pcpu *pos, *next;
-
-		llist_for_each_entry_safe(pos, next, llist,
-					  deferred_reenq_locals_node) {
-			init_llist_node(&pos->deferred_reenq_locals_node);
-			reenq_local(pos->sch, rq);
-		}
-	}
-}
-
 static void deferred_bal_cb_workfn(struct rq *rq)
 {
 	run_deferred(rq);
@@ -3072,7 +3054,6 @@ static void rq_offline_scx(struct rq *rq)
 	rq->scx.flags &= ~SCX_RQ_ONLINE;
 }
 
-
 static bool check_rq_for_timeouts(struct rq *rq)
 {
 	struct scx_sched *sch;
@@ -3612,6 +3593,70 @@ int scx_check_setscheduler(struct task_struct *p, int policy)
 	return 0;
 }
 
+static u32 reenq_local(struct scx_sched *sch, struct rq *rq)
+{
+	LIST_HEAD(tasks);
+	u32 nr_enqueued = 0;
+	struct task_struct *p, *n;
+
+	lockdep_assert_rq_held(rq);
+
+	/*
+	 * The BPF scheduler may choose to dispatch tasks back to
+	 * @rq->scx.local_dsq. Move all candidate tasks off to a private list
+	 * first to avoid processing the same tasks repeatedly.
+	 */
+	list_for_each_entry_safe(p, n, &rq->scx.local_dsq.list,
+				 scx.dsq_list.node) {
+		struct scx_sched *task_sch = scx_task_sched(p);
+
+		/*
+		 * If @p is being migrated, @p's current CPU may not agree with
+		 * its allowed CPUs and the migration_cpu_stop is about to
+		 * deactivate and re-activate @p anyway. Skip re-enqueueing.
+		 *
+		 * While racing sched property changes may also dequeue and
+		 * re-enqueue a migrating task while its current CPU and allowed
+		 * CPUs disagree, they use %ENQUEUE_RESTORE which is bypassed to
+		 * the current local DSQ for running tasks and thus are not
+		 * visible to the BPF scheduler.
+		 */
+		if (p->migration_pending)
+			continue;
+
+		if (!scx_is_descendant(task_sch, sch))
+			continue;
+
+		dispatch_dequeue(rq, p);
+		list_add_tail(&p->scx.dsq_list.node, &tasks);
+	}
+
+	list_for_each_entry_safe(p, n, &tasks, scx.dsq_list.node) {
+		list_del_init(&p->scx.dsq_list.node);
+		do_enqueue_task(rq, p, SCX_ENQ_REENQ, -1);
+		nr_enqueued++;
+	}
+
+	return nr_enqueued;
+}
+
+static void run_deferred(struct rq *rq)
+{
+	process_ddsp_deferred_locals(rq);
+
+	if (!llist_empty(&rq->scx.deferred_reenq_locals)) {
+		struct llist_node *llist =
+			llist_del_all(&rq->scx.deferred_reenq_locals);
+		struct scx_sched_pcpu *pos, *next;
+
+		llist_for_each_entry_safe(pos, next, llist,
+					  deferred_reenq_locals_node) {
+			init_llist_node(&pos->deferred_reenq_locals_node);
+			reenq_local(pos->sch, rq);
+		}
+	}
+}
+
 #ifdef CONFIG_NO_HZ_FULL
 bool scx_can_stop_tick(struct rq *rq)
 {
@@ -7702,53 +7747,6 @@ static const struct btf_kfunc_id_set scx_kfunc_set_dispatch = {
 	.set			= &scx_kfunc_ids_dispatch,
 };
 
-static u32 reenq_local(struct scx_sched *sch, struct rq *rq)
-{
-	LIST_HEAD(tasks);
-	u32 nr_enqueued = 0;
-	struct task_struct *p, *n;
-
-	lockdep_assert_rq_held(rq);
-
-	/*
-	 * The BPF scheduler may choose to dispatch tasks back to
-	 * @rq->scx.local_dsq. Move all candidate tasks off to a private list
-	 * first to avoid processing the same tasks repeatedly.
-	 */
-	list_for_each_entry_safe(p, n, &rq->scx.local_dsq.list,
-				 scx.dsq_list.node) {
-		struct scx_sched *task_sch = scx_task_sched(p);
-
-		/*
-		 * If @p is being migrated, @p's current CPU may not agree with
-		 * its allowed CPUs and the migration_cpu_stop is about to
-		 * deactivate and re-activate @p anyway. Skip re-enqueueing.
-		 *
-		 * While racing sched property changes may also dequeue and
-		 * re-enqueue a migrating task while its current CPU and allowed
-		 * CPUs disagree, they use %ENQUEUE_RESTORE which is bypassed to
-		 * the current local DSQ for running tasks and thus are not
-		 * visible to the BPF scheduler.
-		 */
-		if (p->migration_pending)
-			continue;
-
-		if (!scx_is_descendant(task_sch, sch))
-			continue;
-
-		dispatch_dequeue(rq, p);
-		list_add_tail(&p->scx.dsq_list.node, &tasks);
-	}
-
-	list_for_each_entry_safe(p, n, &tasks, scx.dsq_list.node) {
-		list_del_init(&p->scx.dsq_list.node);
-		do_enqueue_task(rq, p, SCX_ENQ_REENQ, -1);
-		nr_enqueued++;
-	}
-
-	return nr_enqueued;
-}
-
 __bpf_kfunc_start_defs();
 
 /**
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 06/15] sched_ext: Convert deferred_reenq_locals from llist to regular list
  2026-03-06 19:06 [PATCHSET sched_ext/for-7.1] sched_ext: Overhaul DSQ reenqueue infrastructure Tejun Heo
                   ` (4 preceding siblings ...)
  2026-03-06 19:06 ` [PATCH 05/15] sched_ext: Relocate reenq_local() and run_deferred() Tejun Heo
@ 2026-03-06 19:06 ` Tejun Heo
  2026-03-09 17:12   ` Emil Tsalapatis
  2026-03-06 19:06 ` [PATCH 07/15] sched_ext: Wrap deferred_reenq_local_node into a struct Tejun Heo
                   ` (10 subsequent siblings)
  16 siblings, 1 reply; 38+ messages in thread
From: Tejun Heo @ 2026-03-06 19:06 UTC (permalink / raw)
  To: linux-kernel, sched-ext; +Cc: void, arighi, changwoo, emil, Tejun Heo

The deferred reenqueue local mechanism uses an llist (lockless list) for
collecting schedulers that need their local DSQs re-enqueued. Convert to a
regular list protected by a raw_spinlock.

The llist was used for its lockless properties, but the upcoming changes to
support remote reenqueue require more complex list operations that are
difficult to implement correctly with lockless data structures. A spinlock-
protected regular list provides the necessary flexibility.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext.c          | 57 ++++++++++++++++++++++++-------------
 kernel/sched/ext_internal.h |  2 +-
 kernel/sched/sched.h        |  3 +-
 3 files changed, 41 insertions(+), 21 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 1b6cd1e4f8b9..ffccaf04e34d 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -3640,23 +3640,37 @@ static u32 reenq_local(struct scx_sched *sch, struct rq *rq)
 	return nr_enqueued;
 }
 
-static void run_deferred(struct rq *rq)
+static void process_deferred_reenq_locals(struct rq *rq)
 {
-	process_ddsp_deferred_locals(rq);
-
-	if (!llist_empty(&rq->scx.deferred_reenq_locals)) {
-		struct llist_node *llist =
-			llist_del_all(&rq->scx.deferred_reenq_locals);
-		struct scx_sched_pcpu *pos, *next;
+	lockdep_assert_rq_held(rq);
 
-		llist_for_each_entry_safe(pos, next, llist,
-					  deferred_reenq_locals_node) {
-			init_llist_node(&pos->deferred_reenq_locals_node);
-			reenq_local(pos->sch, rq);
+	while (true) {
+		struct scx_sched *sch;
+
+		scoped_guard (raw_spinlock, &rq->scx.deferred_reenq_lock) {
+			struct scx_sched_pcpu *sch_pcpu =
+				list_first_entry_or_null(&rq->scx.deferred_reenq_locals,
+							 struct scx_sched_pcpu,
+							 deferred_reenq_local_node);
+			if (!sch_pcpu)
+				return;
+
+			sch = sch_pcpu->sch;
+			list_del_init(&sch_pcpu->deferred_reenq_local_node);
 		}
+
+		reenq_local(sch, rq);
 	}
 }
 
+static void run_deferred(struct rq *rq)
+{
+	process_ddsp_deferred_locals(rq);
+
+	if (!list_empty(&rq->scx.deferred_reenq_locals))
+		process_deferred_reenq_locals(rq);
+}
+
 #ifdef CONFIG_NO_HZ_FULL
 bool scx_can_stop_tick(struct rq *rq)
 {
@@ -4180,13 +4194,13 @@ static void scx_sched_free_rcu_work(struct work_struct *work)
 
 	/*
 	 * $sch would have entered bypass mode before the RCU grace period. As
-	 * that blocks new deferrals, all deferred_reenq_locals_node's must be
+	 * that blocks new deferrals, all deferred_reenq_local_node's must be
 	 * off-list by now.
 	 */
 	for_each_possible_cpu(cpu) {
 		struct scx_sched_pcpu *pcpu = per_cpu_ptr(sch->pcpu, cpu);
 
-		WARN_ON_ONCE(llist_on_list(&pcpu->deferred_reenq_locals_node));
+		WARN_ON_ONCE(!list_empty(&pcpu->deferred_reenq_local_node));
 	}
 
 	free_percpu(sch->pcpu);
@@ -5799,7 +5813,7 @@ static struct scx_sched *scx_alloc_and_add_sched(struct sched_ext_ops *ops,
 		struct scx_sched_pcpu *pcpu = per_cpu_ptr(sch->pcpu, cpu);
 
 		pcpu->sch = sch;
-		init_llist_node(&pcpu->deferred_reenq_locals_node);
+		INIT_LIST_HEAD(&pcpu->deferred_reenq_local_node);
 	}
 
 	sch->helper = kthread_run_worker(0, "sched_ext_helper");
@@ -7126,7 +7140,8 @@ void __init init_sched_ext_class(void)
 		BUG_ON(!zalloc_cpumask_var_node(&rq->scx.cpus_to_kick_if_idle, GFP_KERNEL, n));
 		BUG_ON(!zalloc_cpumask_var_node(&rq->scx.cpus_to_preempt, GFP_KERNEL, n));
 		BUG_ON(!zalloc_cpumask_var_node(&rq->scx.cpus_to_wait, GFP_KERNEL, n));
-		init_llist_head(&rq->scx.deferred_reenq_locals);
+		raw_spin_lock_init(&rq->scx.deferred_reenq_lock);
+		INIT_LIST_HEAD(&rq->scx.deferred_reenq_locals);
 		rq->scx.deferred_irq_work = IRQ_WORK_INIT_HARD(deferred_irq_workfn);
 		rq->scx.kick_cpus_irq_work = IRQ_WORK_INIT_HARD(kick_cpus_irq_workfn);
 
@@ -8358,7 +8373,6 @@ __bpf_kfunc void scx_bpf_reenqueue_local___v2(const struct bpf_prog_aux *aux)
 	unsigned long flags;
 	struct scx_sched *sch;
 	struct rq *rq;
-	struct llist_node *lnode;
 
 	raw_local_irq_save(flags);
 
@@ -8374,9 +8388,14 @@ __bpf_kfunc void scx_bpf_reenqueue_local___v2(const struct bpf_prog_aux *aux)
 		goto out_irq_restore;
 
 	rq = this_rq();
-	lnode = &this_cpu_ptr(sch->pcpu)->deferred_reenq_locals_node;
-	if (!llist_on_list(lnode))
-		llist_add(lnode, &rq->scx.deferred_reenq_locals);
+	scoped_guard (raw_spinlock, &rq->scx.deferred_reenq_lock) {
+		struct scx_sched_pcpu *pcpu = this_cpu_ptr(sch->pcpu);
+
+		if (list_empty(&pcpu->deferred_reenq_local_node))
+			list_move_tail(&pcpu->deferred_reenq_local_node,
+				       &rq->scx.deferred_reenq_locals);
+	}
+
 	schedule_deferred(rq);
 out_irq_restore:
 	raw_local_irq_restore(flags);
diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h
index 9e5ebd00ea0c..80d40a9c5ad9 100644
--- a/kernel/sched/ext_internal.h
+++ b/kernel/sched/ext_internal.h
@@ -965,7 +965,7 @@ struct scx_sched_pcpu {
 	 */
 	struct scx_event_stats	event_stats;
 
-	struct llist_node	deferred_reenq_locals_node;
+	struct list_head	deferred_reenq_local_node;
 	struct scx_dispatch_q	bypass_dsq;
 #ifdef CONFIG_EXT_SUB_SCHED
 	u32			bypass_host_seq;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ebe971d12cb8..0794852524e7 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -808,7 +808,8 @@ struct scx_rq {
 
 	struct task_struct	*sub_dispatch_prev;
 
-	struct llist_head	deferred_reenq_locals;
+	raw_spinlock_t		deferred_reenq_lock;
+	struct list_head	deferred_reenq_locals;	/* scheds requesting reenq of local DSQ */
 	struct balance_callback	deferred_bal_cb;
 	struct irq_work		deferred_irq_work;
 	struct irq_work		kick_cpus_irq_work;
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 07/15] sched_ext: Wrap deferred_reenq_local_node into a struct
  2026-03-06 19:06 [PATCHSET sched_ext/for-7.1] sched_ext: Overhaul DSQ reenqueue infrastructure Tejun Heo
                   ` (5 preceding siblings ...)
  2026-03-06 19:06 ` [PATCH 06/15] sched_ext: Convert deferred_reenq_locals from llist to regular list Tejun Heo
@ 2026-03-06 19:06 ` Tejun Heo
  2026-03-09 17:16   ` Emil Tsalapatis
  2026-03-06 19:06 ` [PATCH 08/15] sched_ext: Introduce scx_bpf_dsq_reenq() for remote local DSQ reenqueue Tejun Heo
                   ` (9 subsequent siblings)
  16 siblings, 1 reply; 38+ messages in thread
From: Tejun Heo @ 2026-03-06 19:06 UTC (permalink / raw)
  To: linux-kernel, sched-ext; +Cc: void, arighi, changwoo, emil, Tejun Heo

Wrap the deferred_reenq_local_node list_head into struct
scx_deferred_reenq_local. More fields will be added and this allows using a
shorthand pointer to access them.

No functional change.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext.c          | 22 +++++++++++++---------
 kernel/sched/ext_internal.h |  6 +++++-
 2 files changed, 18 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index ffccaf04e34d..80d1e6ccc326 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -3648,15 +3648,19 @@ static void process_deferred_reenq_locals(struct rq *rq)
 		struct scx_sched *sch;
 
 		scoped_guard (raw_spinlock, &rq->scx.deferred_reenq_lock) {
-			struct scx_sched_pcpu *sch_pcpu =
+			struct scx_deferred_reenq_local *drl =
 				list_first_entry_or_null(&rq->scx.deferred_reenq_locals,
-							 struct scx_sched_pcpu,
-							 deferred_reenq_local_node);
-			if (!sch_pcpu)
+							 struct scx_deferred_reenq_local,
+							 node);
+			struct scx_sched_pcpu *sch_pcpu;
+
+			if (!drl)
 				return;
 
+			sch_pcpu = container_of(drl, struct scx_sched_pcpu,
+						deferred_reenq_local);
 			sch = sch_pcpu->sch;
-			list_del_init(&sch_pcpu->deferred_reenq_local_node);
+			list_del_init(&drl->node);
 		}
 
 		reenq_local(sch, rq);
@@ -4200,7 +4204,7 @@ static void scx_sched_free_rcu_work(struct work_struct *work)
 	for_each_possible_cpu(cpu) {
 		struct scx_sched_pcpu *pcpu = per_cpu_ptr(sch->pcpu, cpu);
 
-		WARN_ON_ONCE(!list_empty(&pcpu->deferred_reenq_local_node));
+		WARN_ON_ONCE(!list_empty(&pcpu->deferred_reenq_local.node));
 	}
 
 	free_percpu(sch->pcpu);
@@ -5813,7 +5817,7 @@ static struct scx_sched *scx_alloc_and_add_sched(struct sched_ext_ops *ops,
 		struct scx_sched_pcpu *pcpu = per_cpu_ptr(sch->pcpu, cpu);
 
 		pcpu->sch = sch;
-		INIT_LIST_HEAD(&pcpu->deferred_reenq_local_node);
+		INIT_LIST_HEAD(&pcpu->deferred_reenq_local.node);
 	}
 
 	sch->helper = kthread_run_worker(0, "sched_ext_helper");
@@ -8391,8 +8395,8 @@ __bpf_kfunc void scx_bpf_reenqueue_local___v2(const struct bpf_prog_aux *aux)
 	scoped_guard (raw_spinlock, &rq->scx.deferred_reenq_lock) {
 		struct scx_sched_pcpu *pcpu = this_cpu_ptr(sch->pcpu);
 
-		if (list_empty(&pcpu->deferred_reenq_local_node))
-			list_move_tail(&pcpu->deferred_reenq_local_node,
+		if (list_empty(&pcpu->deferred_reenq_local.node))
+			list_move_tail(&pcpu->deferred_reenq_local.node,
 				       &rq->scx.deferred_reenq_locals);
 	}
 
diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h
index 80d40a9c5ad9..1a8d61097cab 100644
--- a/kernel/sched/ext_internal.h
+++ b/kernel/sched/ext_internal.h
@@ -954,6 +954,10 @@ struct scx_dsp_ctx {
 	struct scx_dsp_buf_ent	buf[];
 };
 
+struct scx_deferred_reenq_local {
+	struct list_head	node;
+};
+
 struct scx_sched_pcpu {
 	struct scx_sched	*sch;
 	u64			flags;	/* protected by rq lock */
@@ -965,7 +969,7 @@ struct scx_sched_pcpu {
 	 */
 	struct scx_event_stats	event_stats;
 
-	struct list_head	deferred_reenq_local_node;
+	struct scx_deferred_reenq_local deferred_reenq_local;
 	struct scx_dispatch_q	bypass_dsq;
 #ifdef CONFIG_EXT_SUB_SCHED
 	u32			bypass_host_seq;
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 08/15] sched_ext: Introduce scx_bpf_dsq_reenq() for remote local DSQ reenqueue
  2026-03-06 19:06 [PATCHSET sched_ext/for-7.1] sched_ext: Overhaul DSQ reenqueue infrastructure Tejun Heo
                   ` (6 preceding siblings ...)
  2026-03-06 19:06 ` [PATCH 07/15] sched_ext: Wrap deferred_reenq_local_node into a struct Tejun Heo
@ 2026-03-06 19:06 ` Tejun Heo
  2026-03-09 17:33   ` Emil Tsalapatis
  2026-03-06 19:06 ` [PATCH 09/15] sched_ext: Add reenq_flags plumbing to scx_bpf_dsq_reenq() Tejun Heo
                   ` (8 subsequent siblings)
  16 siblings, 1 reply; 38+ messages in thread
From: Tejun Heo @ 2026-03-06 19:06 UTC (permalink / raw)
  To: linux-kernel, sched-ext; +Cc: void, arighi, changwoo, emil, Tejun Heo

scx_bpf_reenqueue_local() can only trigger re-enqueue of the current CPU's
local DSQ. Introduce scx_bpf_dsq_reenq() which takes a DSQ ID and can target
any local DSQ including remote CPUs via SCX_DSQ_LOCAL_ON | cpu. This will be
expanded to support user DSQs by future changes.

scx_bpf_reenqueue_local() is reimplemented as a simple wrapper around
scx_bpf_dsq_reenq(SCX_DSQ_LOCAL, 0) and may be deprecated in the future.

Update compat.bpf.h with a compatibility shim and scx_qmap to test the new
functionality.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext.c                       | 118 ++++++++++++++---------
 tools/sched_ext/include/scx/compat.bpf.h |  21 ++++
 tools/sched_ext/scx_qmap.bpf.c           |  11 ++-
 tools/sched_ext/scx_qmap.c               |   5 +-
 4 files changed, 106 insertions(+), 49 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 80d1e6ccc326..b02143b10f0f 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -1080,6 +1080,31 @@ static void schedule_deferred_locked(struct rq *rq)
 	schedule_deferred(rq);
 }
 
+static void schedule_dsq_reenq(struct scx_sched *sch, struct scx_dispatch_q *dsq)
+{
+	/*
+	 * Allowing reenqueues doesn't make sense while bypassing. This also
+	 * blocks from new reenqueues to be scheduled on dead scheds.
+	 */
+	if (unlikely(READ_ONCE(sch->bypass_depth)))
+		return;
+
+	if (dsq->id == SCX_DSQ_LOCAL) {
+		struct rq *rq = container_of(dsq, struct rq, scx.local_dsq);
+		struct scx_sched_pcpu *sch_pcpu = per_cpu_ptr(sch->pcpu, cpu_of(rq));
+		struct scx_deferred_reenq_local *drl = &sch_pcpu->deferred_reenq_local;
+
+		scoped_guard (raw_spinlock_irqsave, &rq->scx.deferred_reenq_lock) {
+			if (list_empty(&drl->node))
+				list_move_tail(&drl->node, &rq->scx.deferred_reenq_locals);
+		}
+
+		schedule_deferred(rq);
+	} else {
+		scx_error(sch, "DSQ 0x%llx not allowed for reenq", dsq->id);
+	}
+}
+
 /**
  * touch_core_sched - Update timestamp used for core-sched task ordering
  * @rq: rq to read clock from, must be locked
@@ -7775,9 +7800,6 @@ __bpf_kfunc_start_defs();
  * Iterate over all of the tasks currently enqueued on the local DSQ of the
  * caller's CPU, and re-enqueue them in the BPF scheduler. Returns the number of
  * processed tasks. Can only be called from ops.cpu_release().
- *
- * COMPAT: Will be removed in v6.23 along with the ___v2 suffix on the void
- * returning variant that can be called from anywhere.
  */
 __bpf_kfunc u32 scx_bpf_reenqueue_local(const struct bpf_prog_aux *aux)
 {
@@ -8207,6 +8229,52 @@ __bpf_kfunc struct task_struct *scx_bpf_dsq_peek(u64 dsq_id,
 	return rcu_dereference(dsq->first_task);
 }
 
+/**
+ * scx_bpf_dsq_reenq - Re-enqueue tasks on a DSQ
+ * @dsq_id: DSQ to re-enqueue
+ * @reenq_flags: %SCX_RENQ_*
+ * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs
+ *
+ * Iterate over all of the tasks currently enqueued on the DSQ identified by
+ * @dsq_id, and re-enqueue them in the BPF scheduler. The following DSQs are
+ * supported:
+ *
+ * - Local DSQs (%SCX_DSQ_LOCAL or %SCX_DSQ_LOCAL_ON | $cpu)
+ *
+ * Re-enqueues are performed asynchronously. Can be called from anywhere.
+ */
+__bpf_kfunc void scx_bpf_dsq_reenq(u64 dsq_id, u64 reenq_flags,
+				   const struct bpf_prog_aux *aux)
+{
+	struct scx_sched *sch;
+	struct scx_dispatch_q *dsq;
+
+	guard(preempt)();
+
+	sch = scx_prog_sched(aux);
+	if (unlikely(!sch))
+		return;
+
+	dsq = find_dsq_for_dispatch(sch, this_rq(), dsq_id, smp_processor_id());
+	schedule_dsq_reenq(sch, dsq);
+}
+
+/**
+ * scx_bpf_reenqueue_local - Re-enqueue tasks on a local DSQ
+ * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs
+ *
+ * Iterate over all of the tasks currently enqueued on the local DSQ of the
+ * caller's CPU, and re-enqueue them in the BPF scheduler. Can be called from
+ * anywhere.
+ *
+ * This is now a special case of scx_bpf_dsq_reenq() and may be removed in the
+ * future.
+ */
+__bpf_kfunc void scx_bpf_reenqueue_local___v2(const struct bpf_prog_aux *aux)
+{
+	scx_bpf_dsq_reenq(SCX_DSQ_LOCAL, 0, aux);
+}
+
 __bpf_kfunc_end_defs();
 
 static s32 __bstr_format(struct scx_sched *sch, u64 *data_buf, char *line_buf,
@@ -8364,47 +8432,6 @@ __bpf_kfunc void scx_bpf_dump_bstr(char *fmt, unsigned long long *data,
 		ops_dump_flush();
 }
 
-/**
- * scx_bpf_reenqueue_local - Re-enqueue tasks on a local DSQ
- * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs
- *
- * Iterate over all of the tasks currently enqueued on the local DSQ of the
- * caller's CPU, and re-enqueue them in the BPF scheduler. Can be called from
- * anywhere.
- */
-__bpf_kfunc void scx_bpf_reenqueue_local___v2(const struct bpf_prog_aux *aux)
-{
-	unsigned long flags;
-	struct scx_sched *sch;
-	struct rq *rq;
-
-	raw_local_irq_save(flags);
-
-	sch = scx_prog_sched(aux);
-	if (unlikely(!sch))
-		goto out_irq_restore;
-
-	/*
-	 * Allowing reenqueue-locals doesn't make sense while bypassing. This
-	 * also blocks from new reenqueues to be scheduled on dead scheds.
-	 */
-	if (unlikely(sch->bypass_depth))
-		goto out_irq_restore;
-
-	rq = this_rq();
-	scoped_guard (raw_spinlock, &rq->scx.deferred_reenq_lock) {
-		struct scx_sched_pcpu *pcpu = this_cpu_ptr(sch->pcpu);
-
-		if (list_empty(&pcpu->deferred_reenq_local.node))
-			list_move_tail(&pcpu->deferred_reenq_local.node,
-				       &rq->scx.deferred_reenq_locals);
-	}
-
-	schedule_deferred(rq);
-out_irq_restore:
-	raw_local_irq_restore(flags);
-}
-
 /**
  * scx_bpf_cpuperf_cap - Query the maximum relative capacity of a CPU
  * @cpu: CPU of interest
@@ -8821,13 +8848,14 @@ BTF_ID_FLAGS(func, scx_bpf_kick_cpu, KF_IMPLICIT_ARGS)
 BTF_ID_FLAGS(func, scx_bpf_dsq_nr_queued)
 BTF_ID_FLAGS(func, scx_bpf_destroy_dsq)
 BTF_ID_FLAGS(func, scx_bpf_dsq_peek, KF_IMPLICIT_ARGS | KF_RCU_PROTECTED | KF_RET_NULL)
+BTF_ID_FLAGS(func, scx_bpf_dsq_reenq, KF_IMPLICIT_ARGS)
+BTF_ID_FLAGS(func, scx_bpf_reenqueue_local___v2, KF_IMPLICIT_ARGS)
 BTF_ID_FLAGS(func, bpf_iter_scx_dsq_new, KF_IMPLICIT_ARGS | KF_ITER_NEW | KF_RCU_PROTECTED)
 BTF_ID_FLAGS(func, bpf_iter_scx_dsq_next, KF_ITER_NEXT | KF_RET_NULL)
 BTF_ID_FLAGS(func, bpf_iter_scx_dsq_destroy, KF_ITER_DESTROY)
 BTF_ID_FLAGS(func, scx_bpf_exit_bstr, KF_IMPLICIT_ARGS)
 BTF_ID_FLAGS(func, scx_bpf_error_bstr, KF_IMPLICIT_ARGS)
 BTF_ID_FLAGS(func, scx_bpf_dump_bstr, KF_IMPLICIT_ARGS)
-BTF_ID_FLAGS(func, scx_bpf_reenqueue_local___v2, KF_IMPLICIT_ARGS)
 BTF_ID_FLAGS(func, scx_bpf_cpuperf_cap, KF_IMPLICIT_ARGS)
 BTF_ID_FLAGS(func, scx_bpf_cpuperf_cur, KF_IMPLICIT_ARGS)
 BTF_ID_FLAGS(func, scx_bpf_cpuperf_set, KF_IMPLICIT_ARGS)
diff --git a/tools/sched_ext/include/scx/compat.bpf.h b/tools/sched_ext/include/scx/compat.bpf.h
index f2969c3061a7..2d3985be7e2c 100644
--- a/tools/sched_ext/include/scx/compat.bpf.h
+++ b/tools/sched_ext/include/scx/compat.bpf.h
@@ -375,6 +375,27 @@ static inline void scx_bpf_reenqueue_local(void)
 		scx_bpf_reenqueue_local___v1();
 }
 
+/*
+ * v6.20: New scx_bpf_dsq_reenq() that allows re-enqueues on more DSQs. This
+ * will eventually deprecate scx_bpf_reenqueue_local().
+ */
+void scx_bpf_dsq_reenq___compat(u64 dsq_id, u64 reenq_flags, const struct bpf_prog_aux *aux__prog) __ksym __weak;
+
+static inline bool __COMPAT_has_generic_reenq(void)
+{
+	return bpf_ksym_exists(scx_bpf_dsq_reenq___compat);
+}
+
+static inline void scx_bpf_dsq_reenq(u64 dsq_id, u64 reenq_flags)
+{
+	if (bpf_ksym_exists(scx_bpf_dsq_reenq___compat))
+		scx_bpf_dsq_reenq___compat(dsq_id, reenq_flags, NULL);
+	else if (dsq_id == SCX_DSQ_LOCAL && reenq_flags == 0)
+		scx_bpf_reenqueue_local();
+	else
+		scx_bpf_error("kernel too old to reenqueue foreign local or user DSQs");
+}
+
 /*
  * Define sched_ext_ops. This may be expanded to define multiple variants for
  * backward compatibility. See compat.h::SCX_OPS_LOAD/ATTACH().
diff --git a/tools/sched_ext/scx_qmap.bpf.c b/tools/sched_ext/scx_qmap.bpf.c
index 91b8eac83f52..83e8289e8c0c 100644
--- a/tools/sched_ext/scx_qmap.bpf.c
+++ b/tools/sched_ext/scx_qmap.bpf.c
@@ -131,7 +131,7 @@ struct {
 } cpu_ctx_stor SEC(".maps");
 
 /* Statistics */
-u64 nr_enqueued, nr_dispatched, nr_reenqueued, nr_dequeued, nr_ddsp_from_enq;
+u64 nr_enqueued, nr_dispatched, nr_reenqueued, nr_reenqueued_cpu0, nr_dequeued, nr_ddsp_from_enq;
 u64 nr_core_sched_execed;
 u64 nr_expedited_local, nr_expedited_remote, nr_expedited_lost, nr_expedited_from_timer;
 u32 cpuperf_min, cpuperf_avg, cpuperf_max;
@@ -206,8 +206,11 @@ void BPF_STRUCT_OPS(qmap_enqueue, struct task_struct *p, u64 enq_flags)
 	void *ring;
 	s32 cpu;
 
-	if (enq_flags & SCX_ENQ_REENQ)
+	if (enq_flags & SCX_ENQ_REENQ) {
 		__sync_fetch_and_add(&nr_reenqueued, 1);
+		if (scx_bpf_task_cpu(p) == 0)
+			__sync_fetch_and_add(&nr_reenqueued_cpu0, 1);
+	}
 
 	if (p->flags & PF_KTHREAD) {
 		if (stall_kernel_nth && !(++kernel_cnt % stall_kernel_nth))
@@ -561,6 +564,10 @@ int BPF_PROG(qmap_sched_switch, bool preempt, struct task_struct *prev,
 	case 2: /* SCHED_RR */
 	case 6: /* SCHED_DEADLINE */
 		scx_bpf_reenqueue_local();
+
+		/* trigger re-enqueue on CPU0 just to exercise LOCAL_ON */
+		if (__COMPAT_has_generic_reenq())
+			scx_bpf_dsq_reenq(SCX_DSQ_LOCAL_ON | 0, 0);
 	}
 
 	return 0;
diff --git a/tools/sched_ext/scx_qmap.c b/tools/sched_ext/scx_qmap.c
index 5d762d10f4db..9252037284d3 100644
--- a/tools/sched_ext/scx_qmap.c
+++ b/tools/sched_ext/scx_qmap.c
@@ -137,9 +137,10 @@ int main(int argc, char **argv)
 		long nr_enqueued = skel->bss->nr_enqueued;
 		long nr_dispatched = skel->bss->nr_dispatched;
 
-		printf("stats  : enq=%lu dsp=%lu delta=%ld reenq=%"PRIu64" deq=%"PRIu64" core=%"PRIu64" enq_ddsp=%"PRIu64"\n",
+		printf("stats  : enq=%lu dsp=%lu delta=%ld reenq/cpu0=%"PRIu64"/%"PRIu64" deq=%"PRIu64" core=%"PRIu64" enq_ddsp=%"PRIu64"\n",
 		       nr_enqueued, nr_dispatched, nr_enqueued - nr_dispatched,
-		       skel->bss->nr_reenqueued, skel->bss->nr_dequeued,
+		       skel->bss->nr_reenqueued, skel->bss->nr_reenqueued_cpu0,
+		       skel->bss->nr_dequeued,
 		       skel->bss->nr_core_sched_execed,
 		       skel->bss->nr_ddsp_from_enq);
 		printf("         exp_local=%"PRIu64" exp_remote=%"PRIu64" exp_timer=%"PRIu64" exp_lost=%"PRIu64"\n",
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 09/15] sched_ext: Add reenq_flags plumbing to scx_bpf_dsq_reenq()
  2026-03-06 19:06 [PATCHSET sched_ext/for-7.1] sched_ext: Overhaul DSQ reenqueue infrastructure Tejun Heo
                   ` (7 preceding siblings ...)
  2026-03-06 19:06 ` [PATCH 08/15] sched_ext: Introduce scx_bpf_dsq_reenq() for remote local DSQ reenqueue Tejun Heo
@ 2026-03-06 19:06 ` Tejun Heo
  2026-03-09 17:47   ` Emil Tsalapatis
  2026-03-06 19:06 ` [PATCH 10/15] sched_ext: Add per-CPU data to DSQs Tejun Heo
                   ` (7 subsequent siblings)
  16 siblings, 1 reply; 38+ messages in thread
From: Tejun Heo @ 2026-03-06 19:06 UTC (permalink / raw)
  To: linux-kernel, sched-ext; +Cc: void, arighi, changwoo, emil, Tejun Heo

Add infrastructure to pass flags through the deferred reenqueue path.
reenq_local() now takes a reenq_flags parameter, and scx_sched_pcpu gains a
deferred_reenq_local_flags field to accumulate flags from multiple
scx_bpf_dsq_reenq() calls before processing. No flags are defined yet.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext.c          | 33 ++++++++++++++++++++++++++++-----
 kernel/sched/ext_internal.h | 10 ++++++++++
 2 files changed, 38 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index b02143b10f0f..c9b0e94d59bd 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -1080,7 +1080,8 @@ static void schedule_deferred_locked(struct rq *rq)
 	schedule_deferred(rq);
 }
 
-static void schedule_dsq_reenq(struct scx_sched *sch, struct scx_dispatch_q *dsq)
+static void schedule_dsq_reenq(struct scx_sched *sch, struct scx_dispatch_q *dsq,
+			       u64 reenq_flags)
 {
 	/*
 	 * Allowing reenqueues doesn't make sense while bypassing. This also
@@ -1097,6 +1098,7 @@ static void schedule_dsq_reenq(struct scx_sched *sch, struct scx_dispatch_q *dsq
 		scoped_guard (raw_spinlock_irqsave, &rq->scx.deferred_reenq_lock) {
 			if (list_empty(&drl->node))
 				list_move_tail(&drl->node, &rq->scx.deferred_reenq_locals);
+			drl->flags |= reenq_flags;
 		}
 
 		schedule_deferred(rq);
@@ -3618,7 +3620,14 @@ int scx_check_setscheduler(struct task_struct *p, int policy)
 	return 0;
 }
 
-static u32 reenq_local(struct scx_sched *sch, struct rq *rq)
+static bool task_should_reenq(struct task_struct *p, u64 reenq_flags)
+{
+	if (reenq_flags & SCX_REENQ_ANY)
+		return true;
+	return false;
+}
+
+static u32 reenq_local(struct scx_sched *sch, struct rq *rq, u64 reenq_flags)
 {
 	LIST_HEAD(tasks);
 	u32 nr_enqueued = 0;
@@ -3652,6 +3661,9 @@ static u32 reenq_local(struct scx_sched *sch, struct rq *rq)
 		if (!scx_is_descendant(task_sch, sch))
 			continue;
 
+		if (!task_should_reenq(p, reenq_flags))
+			continue;
+
 		dispatch_dequeue(rq, p);
 		list_add_tail(&p->scx.dsq_list.node, &tasks);
 	}
@@ -3671,6 +3683,7 @@ static void process_deferred_reenq_locals(struct rq *rq)
 
 	while (true) {
 		struct scx_sched *sch;
+		u64 reenq_flags = 0;
 
 		scoped_guard (raw_spinlock, &rq->scx.deferred_reenq_lock) {
 			struct scx_deferred_reenq_local *drl =
@@ -3685,10 +3698,11 @@ static void process_deferred_reenq_locals(struct rq *rq)
 			sch_pcpu = container_of(drl, struct scx_sched_pcpu,
 						deferred_reenq_local);
 			sch = sch_pcpu->sch;
+			swap(drl->flags, reenq_flags);
 			list_del_init(&drl->node);
 		}
 
-		reenq_local(sch, rq);
+		reenq_local(sch, rq, reenq_flags);
 	}
 }
 
@@ -7817,7 +7831,7 @@ __bpf_kfunc u32 scx_bpf_reenqueue_local(const struct bpf_prog_aux *aux)
 	rq = cpu_rq(smp_processor_id());
 	lockdep_assert_rq_held(rq);
 
-	return reenq_local(sch, rq);
+	return reenq_local(sch, rq, 0);
 }
 
 __bpf_kfunc_end_defs();
@@ -8255,8 +8269,17 @@ __bpf_kfunc void scx_bpf_dsq_reenq(u64 dsq_id, u64 reenq_flags,
 	if (unlikely(!sch))
 		return;
 
+	if (unlikely(reenq_flags & ~__SCX_REENQ_USER_MASK)) {
+		scx_error(sch, "invalid SCX_REENQ flags 0x%llx", reenq_flags);
+		return;
+	}
+
+	/* not specifying any filter bits is the same as %SCX_REENQ_ANY */
+	if (!(reenq_flags & __SCX_REENQ_FILTER_MASK))
+		reenq_flags |= SCX_REENQ_ANY;
+
 	dsq = find_dsq_for_dispatch(sch, this_rq(), dsq_id, smp_processor_id());
-	schedule_dsq_reenq(sch, dsq);
+	schedule_dsq_reenq(sch, dsq, reenq_flags);
 }
 
 /**
diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h
index 1a8d61097cab..d9eda2e8701c 100644
--- a/kernel/sched/ext_internal.h
+++ b/kernel/sched/ext_internal.h
@@ -956,6 +956,7 @@ struct scx_dsp_ctx {
 
 struct scx_deferred_reenq_local {
 	struct list_head	node;
+	u64			flags;
 };
 
 struct scx_sched_pcpu {
@@ -1128,6 +1129,15 @@ enum scx_deq_flags {
 	SCX_DEQ_SCHED_CHANGE	= 1LLU << 33,
 };
 
+enum scx_reenq_flags {
+	/* low 16bits determine which tasks should be reenqueued */
+	SCX_REENQ_ANY		= 1LLU << 0,	/* all tasks */
+
+	__SCX_REENQ_FILTER_MASK	= 0xffffLLU,
+
+	__SCX_REENQ_USER_MASK	= SCX_REENQ_ANY,
+};
+
 enum scx_pick_idle_cpu_flags {
 	SCX_PICK_IDLE_CORE	= 1LLU << 0,	/* pick a CPU whose SMT siblings are also idle */
 	SCX_PICK_IDLE_IN_NODE	= 1LLU << 1,	/* pick a CPU in the same target NUMA node */
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 10/15] sched_ext: Add per-CPU data to DSQs
  2026-03-06 19:06 [PATCHSET sched_ext/for-7.1] sched_ext: Overhaul DSQ reenqueue infrastructure Tejun Heo
                   ` (8 preceding siblings ...)
  2026-03-06 19:06 ` [PATCH 09/15] sched_ext: Add reenq_flags plumbing to scx_bpf_dsq_reenq() Tejun Heo
@ 2026-03-06 19:06 ` Tejun Heo
  2026-03-06 22:54   ` Andrea Righi
  2026-03-06 23:09   ` [PATCH v2 " Tejun Heo
  2026-03-06 19:06 ` [PATCH 11/15] sched_ext: Factor out nldsq_cursor_next_task() and nldsq_cursor_lost_task() Tejun Heo
                   ` (6 subsequent siblings)
  16 siblings, 2 replies; 38+ messages in thread
From: Tejun Heo @ 2026-03-06 19:06 UTC (permalink / raw)
  To: linux-kernel, sched-ext; +Cc: void, arighi, changwoo, emil, Tejun Heo

Add per-CPU data structure to dispatch queues. Each DSQ now has a percpu
scx_dsq_pcpu which contains a back-pointer to the DSQ. This will be used by
future changes to implement per-CPU reenqueue tracking for user DSQs.

init_dsq() now allocates the percpu data and can fail, so it returns an
error code. All callers are updated to handle failures. exit_dsq() is added
to free the percpu data and is called from all DSQ cleanup paths.

In scx_bpf_create_dsq(), init_dsq() is called before rcu_read_lock() since
alloc_percpu() requires GFP_KERNEL context, and dsq->sched is set
afterwards.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/sched/ext.h |  5 +++
 kernel/sched/ext.c        | 82 ++++++++++++++++++++++++++++++++-------
 2 files changed, 73 insertions(+), 14 deletions(-)

diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
index f354d7d34306..98cc1f41b91e 100644
--- a/include/linux/sched/ext.h
+++ b/include/linux/sched/ext.h
@@ -62,6 +62,10 @@ enum scx_dsq_id_flags {
 	SCX_DSQ_LOCAL_CPU_MASK	= 0xffffffffLLU,
 };
 
+struct scx_dsq_pcpu {
+	struct scx_dispatch_q	*dsq;
+};
+
 /*
  * A dispatch queue (DSQ) can be either a FIFO or p->scx.dsq_vtime ordered
  * queue. A built-in DSQ is always a FIFO. The built-in local DSQs are used to
@@ -79,6 +83,7 @@ struct scx_dispatch_q {
 	struct rhash_head	hash_node;
 	struct llist_node	free_node;
 	struct scx_sched	*sched;
+	struct scx_dsq_pcpu __percpu *pcpu;
 	struct rcu_head		rcu;
 };
 
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index c9b0e94d59bd..996c410cc892 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -4021,15 +4021,42 @@ DEFINE_SCHED_CLASS(ext) = {
 #endif
 };
 
-static void init_dsq(struct scx_dispatch_q *dsq, u64 dsq_id,
-		     struct scx_sched *sch)
+static s32 init_dsq(struct scx_dispatch_q *dsq, u64 dsq_id,
+		    struct scx_sched *sch)
 {
+	s32 cpu;
+
 	memset(dsq, 0, sizeof(*dsq));
 
 	raw_spin_lock_init(&dsq->lock);
 	INIT_LIST_HEAD(&dsq->list);
 	dsq->id = dsq_id;
 	dsq->sched = sch;
+
+	dsq->pcpu = alloc_percpu(struct scx_dsq_pcpu);
+	if (!dsq->pcpu)
+		return -ENOMEM;
+
+	for_each_possible_cpu(cpu) {
+		struct scx_dsq_pcpu *pcpu = per_cpu_ptr(dsq->pcpu, cpu);
+
+		pcpu->dsq = dsq;
+	}
+
+	return 0;
+}
+
+static void exit_dsq(struct scx_dispatch_q *dsq)
+{
+	free_percpu(dsq->pcpu);
+}
+
+static void free_dsq_rcufn(struct rcu_head *rcu)
+{
+	struct scx_dispatch_q *dsq = container_of(rcu, struct scx_dispatch_q, rcu);
+
+	exit_dsq(dsq);
+	kfree(dsq);
 }
 
 static void free_dsq_irq_workfn(struct irq_work *irq_work)
@@ -4038,7 +4065,7 @@ static void free_dsq_irq_workfn(struct irq_work *irq_work)
 	struct scx_dispatch_q *dsq, *tmp_dsq;
 
 	llist_for_each_entry_safe(dsq, tmp_dsq, to_free, free_node)
-		kfree_rcu(dsq, rcu);
+		call_rcu(&dsq->rcu, free_dsq_rcufn);
 }
 
 static DEFINE_IRQ_WORK(free_dsq_irq_work, free_dsq_irq_workfn);
@@ -4235,15 +4262,17 @@ static void scx_sched_free_rcu_work(struct work_struct *work)
 		cgroup_put(sch_cgroup(sch));
 #endif	/* CONFIG_EXT_SUB_SCHED */
 
-	/*
-	 * $sch would have entered bypass mode before the RCU grace period. As
-	 * that blocks new deferrals, all deferred_reenq_local_node's must be
-	 * off-list by now.
-	 */
 	for_each_possible_cpu(cpu) {
 		struct scx_sched_pcpu *pcpu = per_cpu_ptr(sch->pcpu, cpu);
 
+		/*
+		 * $sch would have entered bypass mode before the RCU grace
+		 * period. As that blocks new deferrals, all
+		 * deferred_reenq_local_node's must be off-list by now.
+		 */
 		WARN_ON_ONCE(!list_empty(&pcpu->deferred_reenq_local.node));
+
+		exit_dsq(bypass_dsq(sch, cpu));
 	}
 
 	free_percpu(sch->pcpu);
@@ -5788,6 +5817,9 @@ static int alloc_kick_syncs(void)
 
 static void free_pnode(struct scx_sched_pnode *pnode)
 {
+	if (!pnode)
+		return;
+	exit_dsq(&pnode->global_dsq);
 	kfree(pnode);
 }
 
@@ -5799,7 +5831,10 @@ static struct scx_sched_pnode *alloc_pnode(struct scx_sched *sch, int node)
 	if (!pnode)
 		return NULL;
 
-	init_dsq(&pnode->global_dsq, SCX_DSQ_GLOBAL, sch);
+	if (init_dsq(&pnode->global_dsq, SCX_DSQ_GLOBAL, sch)) {
+		kfree(pnode);
+		return NULL;
+	}
 
 	return pnode;
 }
@@ -5849,8 +5884,11 @@ static struct scx_sched *scx_alloc_and_add_sched(struct sched_ext_ops *ops,
 		goto err_free_pnode;
 	}
 
-	for_each_possible_cpu(cpu)
-		init_dsq(bypass_dsq(sch, cpu), SCX_DSQ_BYPASS, sch);
+	for_each_possible_cpu(cpu) {
+		ret = init_dsq(bypass_dsq(sch, cpu), SCX_DSQ_BYPASS, sch);
+		if (ret)
+			goto err_free_pcpu;
+	}
 
 	for_each_possible_cpu(cpu) {
 		struct scx_sched_pcpu *pcpu = per_cpu_ptr(sch->pcpu, cpu);
@@ -5932,6 +5970,10 @@ static struct scx_sched *scx_alloc_and_add_sched(struct sched_ext_ops *ops,
 err_stop_helper:
 	kthread_destroy_worker(sch->helper);
 err_free_pcpu:
+	for_each_possible_cpu(cpu) {
+		if (bypass_dsq(sch, cpu))
+			exit_dsq(bypass_dsq(sch, cpu));
+	}
 	free_percpu(sch->pcpu);
 err_free_pnode:
 	for_each_node_state(node, N_POSSIBLE)
@@ -7174,7 +7216,7 @@ void __init init_sched_ext_class(void)
 		int  n = cpu_to_node(cpu);
 
 		/* local_dsq's sch will be set during scx_root_enable() */
-		init_dsq(&rq->scx.local_dsq, SCX_DSQ_LOCAL, NULL);
+		BUG_ON(init_dsq(&rq->scx.local_dsq, SCX_DSQ_LOCAL, NULL));
 
 		INIT_LIST_HEAD(&rq->scx.runnable_list);
 		INIT_LIST_HEAD(&rq->scx.ddsp_deferred_locals);
@@ -7873,11 +7915,21 @@ __bpf_kfunc s32 scx_bpf_create_dsq(u64 dsq_id, s32 node, const struct bpf_prog_a
 	if (!dsq)
 		return -ENOMEM;
 
+	/*
+	 * init_dsq() must be called in GFP_KERNEL context. Init it with NULL
+	 * @sch and update afterwards.
+	 */
+	ret = init_dsq(dsq, dsq_id, NULL);
+	if (ret) {
+		kfree(dsq);
+		return ret;
+	}
+
 	rcu_read_lock();
 
 	sch = scx_prog_sched(aux);
 	if (sch) {
-		init_dsq(dsq, dsq_id, sch);
+		dsq->sched = sch;
 		ret = rhashtable_lookup_insert_fast(&sch->dsq_hash, &dsq->hash_node,
 						    dsq_hash_params);
 	} else {
@@ -7885,8 +7937,10 @@ __bpf_kfunc s32 scx_bpf_create_dsq(u64 dsq_id, s32 node, const struct bpf_prog_a
 	}
 
 	rcu_read_unlock();
-	if (ret)
+	if (ret) {
+		exit_dsq(dsq);
 		kfree(dsq);
+	}
 	return ret;
 }
 
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 11/15] sched_ext: Factor out nldsq_cursor_next_task() and nldsq_cursor_lost_task()
  2026-03-06 19:06 [PATCHSET sched_ext/for-7.1] sched_ext: Overhaul DSQ reenqueue infrastructure Tejun Heo
                   ` (9 preceding siblings ...)
  2026-03-06 19:06 ` [PATCH 10/15] sched_ext: Add per-CPU data to DSQs Tejun Heo
@ 2026-03-06 19:06 ` Tejun Heo
  2026-03-06 19:06 ` [PATCH 12/15] sched_ext: Implement scx_bpf_dsq_reenq() for user DSQs Tejun Heo
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 38+ messages in thread
From: Tejun Heo @ 2026-03-06 19:06 UTC (permalink / raw)
  To: linux-kernel, sched-ext; +Cc: void, arighi, changwoo, emil, Tejun Heo

Factor out cursor-based DSQ iteration from bpf_iter_scx_dsq_next() into
nldsq_cursor_next_task() and the task-lost check from scx_dsq_move() into
nldsq_cursor_lost_task() to prepare for reuse.

As ->priv is only used to record dsq->seq for cursors, update
INIT_DSQ_LIST_CURSOR() to take the DSQ pointer and set ->priv from dsq->seq
so that users don't have to read it manually. Move scx_dsq_iter_flags enum
earlier so nldsq_cursor_next_task() can use SCX_DSQ_ITER_REV.

bypass_lb_cpu() now sets cursor.priv to dsq->seq but doesn't use it.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/sched/ext.h |   6 +-
 kernel/sched/ext.c        | 154 ++++++++++++++++++++++++--------------
 2 files changed, 102 insertions(+), 58 deletions(-)

diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
index 98cc1f41b91e..303f57dfb947 100644
--- a/include/linux/sched/ext.h
+++ b/include/linux/sched/ext.h
@@ -157,11 +157,11 @@ struct scx_dsq_list_node {
 	u32			priv;		/* can be used by iter cursor */
 };
 
-#define INIT_DSQ_LIST_CURSOR(__node, __flags, __priv)				\
+#define INIT_DSQ_LIST_CURSOR(__cursor, __dsq, __flags)				\
 	(struct scx_dsq_list_node) {						\
-		.node = LIST_HEAD_INIT((__node).node),				\
+		.node = LIST_HEAD_INIT((__cursor).node),			\
 		.flags = SCX_DSQ_LNODE_ITER_CURSOR | (__flags),			\
-		.priv = (__priv),						\
+		.priv = READ_ONCE((__dsq)->seq),				\
 	}
 
 struct scx_sched;
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 996c410cc892..6e4f84d3c407 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -570,9 +570,22 @@ static __always_inline bool scx_kf_allowed_on_arg_tasks(struct scx_sched *sch,
 	return true;
 }
 
+enum scx_dsq_iter_flags {
+	/* iterate in the reverse dispatch order */
+	SCX_DSQ_ITER_REV		= 1U << 16,
+
+	__SCX_DSQ_ITER_HAS_SLICE	= 1U << 30,
+	__SCX_DSQ_ITER_HAS_VTIME	= 1U << 31,
+
+	__SCX_DSQ_ITER_USER_FLAGS	= SCX_DSQ_ITER_REV,
+	__SCX_DSQ_ITER_ALL_FLAGS	= __SCX_DSQ_ITER_USER_FLAGS |
+					  __SCX_DSQ_ITER_HAS_SLICE |
+					  __SCX_DSQ_ITER_HAS_VTIME,
+};
+
 /**
  * nldsq_next_task - Iterate to the next task in a non-local DSQ
- * @dsq: user dsq being iterated
+ * @dsq: non-local dsq being iterated
  * @cur: current position, %NULL to start iteration
  * @rev: walk backwards
  *
@@ -612,6 +625,85 @@ static struct task_struct *nldsq_next_task(struct scx_dispatch_q *dsq,
 	for ((p) = nldsq_next_task((dsq), NULL, false); (p);			\
 	     (p) = nldsq_next_task((dsq), (p), false))
 
+/**
+ * nldsq_cursor_next_task - Iterate to the next task given a cursor in a non-local DSQ
+ * @cursor: scx_dsq_list_node initialized with INIT_DSQ_LIST_CURSOR()
+ * @dsq: non-local dsq being iterated
+ *
+ * Find the next task in a cursor based iteration. The caller must have
+ * initialized @cursor using INIT_DSQ_LIST_CURSOR() and can release the DSQ lock
+ * between the iteration steps.
+ *
+ * Only tasks which were queued before @cursor was initialized are visible. This
+ * bounds the iteration and guarantees that vtime never jumps in the other
+ * direction while iterating.
+ */
+static struct task_struct *nldsq_cursor_next_task(struct scx_dsq_list_node *cursor,
+						  struct scx_dispatch_q *dsq)
+{
+	bool rev = cursor->flags & SCX_DSQ_ITER_REV;
+	struct task_struct *p;
+
+	lockdep_assert_held(&dsq->lock);
+	BUG_ON(!(cursor->flags & SCX_DSQ_LNODE_ITER_CURSOR));
+
+	if (list_empty(&cursor->node))
+		p = NULL;
+	else
+		p = container_of(cursor, struct task_struct, scx.dsq_list);
+
+	/* skip cursors and tasks that were queued after @cursor init */
+	do {
+		p = nldsq_next_task(dsq, p, rev);
+	} while (p && unlikely(u32_before(cursor->priv, p->scx.dsq_seq)));
+
+	if (p) {
+		if (rev)
+			list_move_tail(&cursor->node, &p->scx.dsq_list.node);
+		else
+			list_move(&cursor->node, &p->scx.dsq_list.node);
+	} else {
+		list_del_init(&cursor->node);
+	}
+
+	return p;
+}
+
+/**
+ * nldsq_cursor_lost_task - Test whether someone else took the task since iteration
+ * @cursor: scx_dsq_list_node initialized with INIT_DSQ_LIST_CURSOR()
+ * @rq: rq @p was on
+ * @dsq: dsq @p was on
+ * @p: target task
+ *
+ * @p is a task returned by nldsq_cursor_next_task(). The locks may have been
+ * dropped and re-acquired inbetween. Verify that no one else took or is in the
+ * process of taking @p from @dsq.
+ *
+ * On %false return, the caller can assume full ownership of @p.
+ */
+static bool nldsq_cursor_lost_task(struct scx_dsq_list_node *cursor,
+				   struct rq *rq, struct scx_dispatch_q *dsq,
+				   struct task_struct *p)
+{
+	lockdep_assert_rq_held(rq);
+	lockdep_assert_held(&dsq->lock);
+
+	/*
+	 * @p could have already left $src_dsq, got re-enqueud, or be in the
+	 * process of being consumed by someone else.
+	 */
+	if (unlikely(p->scx.dsq != dsq ||
+		     u32_before(cursor->priv, p->scx.dsq_seq) ||
+		     p->scx.holding_cpu >= 0))
+		return true;
+
+	/* if @p has stayed on @dsq, its rq couldn't have changed */
+	if (WARN_ON_ONCE(rq != task_rq(p)))
+		return true;
+
+	return false;
+}
 
 /*
  * BPF DSQ iterator. Tasks in a non-local DSQ can be iterated in [reverse]
@@ -619,19 +711,6 @@ static struct task_struct *nldsq_next_task(struct scx_dispatch_q *dsq,
  * changes without breaking backward compatibility. Can be used with
  * bpf_for_each(). See bpf_iter_scx_dsq_*().
  */
-enum scx_dsq_iter_flags {
-	/* iterate in the reverse dispatch order */
-	SCX_DSQ_ITER_REV		= 1U << 16,
-
-	__SCX_DSQ_ITER_HAS_SLICE	= 1U << 30,
-	__SCX_DSQ_ITER_HAS_VTIME	= 1U << 31,
-
-	__SCX_DSQ_ITER_USER_FLAGS	= SCX_DSQ_ITER_REV,
-	__SCX_DSQ_ITER_ALL_FLAGS	= __SCX_DSQ_ITER_USER_FLAGS |
-					  __SCX_DSQ_ITER_HAS_SLICE |
-					  __SCX_DSQ_ITER_HAS_VTIME,
-};
-
 struct bpf_iter_scx_dsq_kern {
 	struct scx_dsq_list_node	cursor;
 	struct scx_dispatch_q		*dsq;
@@ -4498,7 +4577,7 @@ static u32 bypass_lb_cpu(struct scx_sched *sch, s32 donor,
 	struct rq *donor_rq = cpu_rq(donor);
 	struct scx_dispatch_q *donor_dsq = bypass_dsq(sch, donor);
 	struct task_struct *p, *n;
-	struct scx_dsq_list_node cursor = INIT_DSQ_LIST_CURSOR(cursor, 0, 0);
+	struct scx_dsq_list_node cursor = INIT_DSQ_LIST_CURSOR(cursor, donor_dsq, 0);
 	s32 delta = READ_ONCE(donor_dsq->nr) - nr_donor_target;
 	u32 nr_balanced = 0, min_delta_us;
 
@@ -7540,14 +7619,8 @@ static bool scx_dsq_move(struct bpf_iter_scx_dsq_kern *kit,
 	locked_rq = src_rq;
 	raw_spin_lock(&src_dsq->lock);
 
-	/*
-	 * Did someone else get to it? @p could have already left $src_dsq, got
-	 * re-enqueud, or be in the process of being consumed by someone else.
-	 */
-	if (unlikely(p->scx.dsq != src_dsq ||
-		     u32_before(kit->cursor.priv, p->scx.dsq_seq) ||
-		     p->scx.holding_cpu >= 0) ||
-	    WARN_ON_ONCE(src_rq != task_rq(p))) {
+	/* did someone else get to it while we dropped the locks? */
+	if (nldsq_cursor_lost_task(&kit->cursor, src_rq, src_dsq, p)) {
 		raw_spin_unlock(&src_dsq->lock);
 		goto out;
 	}
@@ -8186,8 +8259,7 @@ __bpf_kfunc int bpf_iter_scx_dsq_new(struct bpf_iter_scx_dsq *it, u64 dsq_id,
 	if (!kit->dsq)
 		return -ENOENT;
 
-	kit->cursor = INIT_DSQ_LIST_CURSOR(kit->cursor, flags,
-					   READ_ONCE(kit->dsq->seq));
+	kit->cursor = INIT_DSQ_LIST_CURSOR(kit->cursor, kit->dsq, flags);
 
 	return 0;
 }
@@ -8201,41 +8273,13 @@ __bpf_kfunc int bpf_iter_scx_dsq_new(struct bpf_iter_scx_dsq *it, u64 dsq_id,
 __bpf_kfunc struct task_struct *bpf_iter_scx_dsq_next(struct bpf_iter_scx_dsq *it)
 {
 	struct bpf_iter_scx_dsq_kern *kit = (void *)it;
-	bool rev = kit->cursor.flags & SCX_DSQ_ITER_REV;
-	struct task_struct *p;
-	unsigned long flags;
 
 	if (!kit->dsq)
 		return NULL;
 
-	raw_spin_lock_irqsave(&kit->dsq->lock, flags);
+	guard(raw_spinlock_irqsave)(&kit->dsq->lock);
 
-	if (list_empty(&kit->cursor.node))
-		p = NULL;
-	else
-		p = container_of(&kit->cursor, struct task_struct, scx.dsq_list);
-
-	/*
-	 * Only tasks which were queued before the iteration started are
-	 * visible. This bounds BPF iterations and guarantees that vtime never
-	 * jumps in the other direction while iterating.
-	 */
-	do {
-		p = nldsq_next_task(kit->dsq, p, rev);
-	} while (p && unlikely(u32_before(kit->cursor.priv, p->scx.dsq_seq)));
-
-	if (p) {
-		if (rev)
-			list_move_tail(&kit->cursor.node, &p->scx.dsq_list.node);
-		else
-			list_move(&kit->cursor.node, &p->scx.dsq_list.node);
-	} else {
-		list_del_init(&kit->cursor.node);
-	}
-
-	raw_spin_unlock_irqrestore(&kit->dsq->lock, flags);
-
-	return p;
+	return nldsq_cursor_next_task(&kit->cursor, kit->dsq);
 }
 
 /**
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 12/15] sched_ext: Implement scx_bpf_dsq_reenq() for user DSQs
  2026-03-06 19:06 [PATCHSET sched_ext/for-7.1] sched_ext: Overhaul DSQ reenqueue infrastructure Tejun Heo
                   ` (10 preceding siblings ...)
  2026-03-06 19:06 ` [PATCH 11/15] sched_ext: Factor out nldsq_cursor_next_task() and nldsq_cursor_lost_task() Tejun Heo
@ 2026-03-06 19:06 ` Tejun Heo
  2026-03-06 19:06 ` [PATCH 13/15] sched_ext: Optimize schedule_dsq_reenq() with lockless fast path Tejun Heo
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 38+ messages in thread
From: Tejun Heo @ 2026-03-06 19:06 UTC (permalink / raw)
  To: linux-kernel, sched-ext; +Cc: void, arighi, changwoo, emil, Tejun Heo

scx_bpf_dsq_reenq() currently only supports local DSQs. Extend it to support
user-defined DSQs by adding a deferred re-enqueue mechanism similar to the
local DSQ handling.

Add per-cpu deferred_reenq_user_node/flags to scx_dsq_pcpu and
deferred_reenq_users list to scx_rq. When scx_bpf_dsq_reenq() is called on a
user DSQ, the DSQ's per-cpu node is added to the current rq's deferred list.
process_deferred_reenq_users() then iterates the DSQ using the cursor helpers
and re-enqueues each task.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/sched/ext.h      |   6 ++
 kernel/sched/ext.c             | 128 +++++++++++++++++++++++++++++++++
 kernel/sched/sched.h           |   1 +
 tools/sched_ext/scx_qmap.bpf.c |  57 ++++++++++++++-
 4 files changed, 190 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
index 303f57dfb947..e77504faa0bc 100644
--- a/include/linux/sched/ext.h
+++ b/include/linux/sched/ext.h
@@ -62,8 +62,14 @@ enum scx_dsq_id_flags {
 	SCX_DSQ_LOCAL_CPU_MASK	= 0xffffffffLLU,
 };
 
+struct scx_deferred_reenq_user {
+	struct list_head	node;
+	u64			flags;
+};
+
 struct scx_dsq_pcpu {
 	struct scx_dispatch_q	*dsq;
+	struct scx_deferred_reenq_user deferred_reenq_user;
 };
 
 /*
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 6e4f84d3c407..f58afbc69cb4 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -1180,6 +1180,18 @@ static void schedule_dsq_reenq(struct scx_sched *sch, struct scx_dispatch_q *dsq
 			drl->flags |= reenq_flags;
 		}
 
+		schedule_deferred(rq);
+	} else if (!(dsq->id & SCX_DSQ_FLAG_BUILTIN)) {
+		struct rq *rq = this_rq();
+		struct scx_dsq_pcpu *dsq_pcpu = per_cpu_ptr(dsq->pcpu, cpu_of(rq));
+		struct scx_deferred_reenq_user *dru = &dsq_pcpu->deferred_reenq_user;
+
+		scoped_guard (raw_spinlock_irqsave, &rq->scx.deferred_reenq_lock) {
+			if (list_empty(&dru->node))
+				list_move_tail(&dru->node, &rq->scx.deferred_reenq_users);
+			dru->flags |= reenq_flags;
+		}
+
 		schedule_deferred(rq);
 	} else {
 		scx_error(sch, "DSQ 0x%llx not allowed for reenq", dsq->id);
@@ -3785,12 +3797,108 @@ static void process_deferred_reenq_locals(struct rq *rq)
 	}
 }
 
+static void reenq_user(struct rq *rq, struct scx_dispatch_q *dsq, u64 reenq_flags)
+{
+	struct rq *locked_rq = rq;
+	struct scx_sched *sch = dsq->sched;
+	struct scx_dsq_list_node cursor = INIT_DSQ_LIST_CURSOR(cursor, dsq, 0);
+	struct task_struct *p;
+	s32 nr_enqueued = 0;
+
+	lockdep_assert_rq_held(rq);
+
+	raw_spin_lock(&dsq->lock);
+
+	while (likely(!READ_ONCE(sch->bypass_depth))) {
+		struct rq *task_rq;
+
+		p = nldsq_cursor_next_task(&cursor, dsq);
+		if (!p)
+			break;
+
+		if (!task_should_reenq(p, reenq_flags))
+			continue;
+
+		task_rq = task_rq(p);
+
+		if (locked_rq != task_rq) {
+			if (locked_rq)
+				raw_spin_rq_unlock(locked_rq);
+			if (unlikely(!raw_spin_rq_trylock(task_rq))) {
+				raw_spin_unlock(&dsq->lock);
+				raw_spin_rq_lock(task_rq);
+				raw_spin_lock(&dsq->lock);
+			}
+			locked_rq = task_rq;
+
+			/* did we lose @p while switching locks? */
+			if (nldsq_cursor_lost_task(&cursor, task_rq, dsq, p))
+				continue;
+		}
+
+		/* @p is on @dsq, its rq and @dsq are locked */
+		dispatch_dequeue_locked(p, dsq);
+		raw_spin_unlock(&dsq->lock);
+		do_enqueue_task(task_rq, p, SCX_ENQ_REENQ, -1);
+
+		if (!(++nr_enqueued % SCX_TASK_ITER_BATCH)) {
+			raw_spin_rq_unlock(locked_rq);
+			locked_rq = NULL;
+			cpu_relax();
+		}
+
+		raw_spin_lock(&dsq->lock);
+	}
+
+	list_del_init(&cursor.node);
+	raw_spin_unlock(&dsq->lock);
+
+	if (locked_rq != rq) {
+		if (locked_rq)
+			raw_spin_rq_unlock(locked_rq);
+		raw_spin_rq_lock(rq);
+	}
+}
+
+static void process_deferred_reenq_users(struct rq *rq)
+{
+	lockdep_assert_rq_held(rq);
+
+	while (true) {
+		struct scx_dispatch_q *dsq;
+		u64 reenq_flags = 0;
+
+		scoped_guard (raw_spinlock, &rq->scx.deferred_reenq_lock) {
+			struct scx_deferred_reenq_user *dru =
+				list_first_entry_or_null(&rq->scx.deferred_reenq_users,
+							 struct scx_deferred_reenq_user,
+							 node);
+			struct scx_dsq_pcpu *dsq_pcpu;
+
+			if (!dru)
+				return;
+
+			dsq_pcpu = container_of(dru, struct scx_dsq_pcpu,
+						deferred_reenq_user);
+			dsq = dsq_pcpu->dsq;
+			swap(dru->flags, reenq_flags);
+			list_del_init(&dru->node);
+		}
+
+		BUG_ON(dsq->id & SCX_DSQ_FLAG_BUILTIN);
+		reenq_user(rq, dsq, reenq_flags);
+	}
+}
+
 static void run_deferred(struct rq *rq)
 {
 	process_ddsp_deferred_locals(rq);
 
 	if (!list_empty(&rq->scx.deferred_reenq_locals))
 		process_deferred_reenq_locals(rq);
+
+	if (!list_empty(&rq->scx.deferred_reenq_users))
+		process_deferred_reenq_users(rq);
 }
 
 #ifdef CONFIG_NO_HZ_FULL
@@ -4120,6 +4228,7 @@ static s32 init_dsq(struct scx_dispatch_q *dsq, u64 dsq_id,
 		struct scx_dsq_pcpu *pcpu = per_cpu_ptr(dsq->pcpu, cpu);
 
 		pcpu->dsq = dsq;
+		INIT_LIST_HEAD(&pcpu->deferred_reenq_user.node);
 	}
 
 	return 0;
@@ -4127,6 +4236,23 @@ static s32 init_dsq(struct scx_dispatch_q *dsq, u64 dsq_id,
 
 static void exit_dsq(struct scx_dispatch_q *dsq)
 {
+	s32 cpu;
+
+	for_each_possible_cpu(cpu) {
+		struct scx_dsq_pcpu *pcpu = per_cpu_ptr(dsq->pcpu, cpu);
+		struct scx_deferred_reenq_user *dru = &pcpu->deferred_reenq_user;
+		struct rq *rq = cpu_rq(cpu);
+
+		/*
+		 * There must have been a RCU grace period since the last
+		 * insertion and @dsq should be off the deferred list by now.
+		 */
+		if (WARN_ON_ONCE(!list_empty(&dru->node))) {
+			guard(raw_spinlock_irqsave)(&rq->scx.deferred_reenq_lock);
+			list_del_init(&dru->node);
+		}
+	}
+
 	free_percpu(dsq->pcpu);
 }
 
@@ -7306,6 +7432,7 @@ void __init init_sched_ext_class(void)
 		BUG_ON(!zalloc_cpumask_var_node(&rq->scx.cpus_to_wait, GFP_KERNEL, n));
 		raw_spin_lock_init(&rq->scx.deferred_reenq_lock);
 		INIT_LIST_HEAD(&rq->scx.deferred_reenq_locals);
+		INIT_LIST_HEAD(&rq->scx.deferred_reenq_users);
 		rq->scx.deferred_irq_work = IRQ_WORK_INIT_HARD(deferred_irq_workfn);
 		rq->scx.kick_cpus_irq_work = IRQ_WORK_INIT_HARD(kick_cpus_irq_workfn);
 
@@ -8352,6 +8479,7 @@ __bpf_kfunc struct task_struct *scx_bpf_dsq_peek(u64 dsq_id,
  * supported:
  *
  * - Local DSQs (%SCX_DSQ_LOCAL or %SCX_DSQ_LOCAL_ON | $cpu)
+ * - User DSQs
  *
  * Re-enqueues are performed asynchronously. Can be called from anywhere.
  */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 0794852524e7..893f89ce2a77 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -810,6 +810,7 @@ struct scx_rq {
 
 	raw_spinlock_t		deferred_reenq_lock;
 	struct list_head	deferred_reenq_locals;	/* scheds requesting reenq of local DSQ */
+	struct list_head	deferred_reenq_users;	/* user DSQs requesting reenq */
 	struct balance_callback	deferred_bal_cb;
 	struct irq_work		deferred_irq_work;
 	struct irq_work		kick_cpus_irq_work;
diff --git a/tools/sched_ext/scx_qmap.bpf.c b/tools/sched_ext/scx_qmap.bpf.c
index 83e8289e8c0c..a4a1b84fe359 100644
--- a/tools/sched_ext/scx_qmap.bpf.c
+++ b/tools/sched_ext/scx_qmap.bpf.c
@@ -26,8 +26,11 @@
 
 enum consts {
 	ONE_SEC_IN_NS		= 1000000000,
+	ONE_MSEC_IN_NS		= 1000000,
+	LOWPRI_INTV_NS		= 10 * ONE_MSEC_IN_NS,
 	SHARED_DSQ		= 0,
 	HIGHPRI_DSQ		= 1,
+	LOWPRI_DSQ		= 2,
 	HIGHPRI_WEIGHT		= 8668,		/* this is what -20 maps to */
 };
 
@@ -172,6 +175,9 @@ s32 BPF_STRUCT_OPS(qmap_select_cpu, struct task_struct *p,
 	if (!(tctx = lookup_task_ctx(p)))
 		return -ESRCH;
 
+	if (p->scx.weight < 2 && !(p->flags & PF_KTHREAD))
+		return prev_cpu;
+
 	cpu = pick_direct_dispatch_cpu(p, prev_cpu);
 
 	if (cpu >= 0) {
@@ -242,6 +248,13 @@ void BPF_STRUCT_OPS(qmap_enqueue, struct task_struct *p, u64 enq_flags)
 		return;
 	}
 
+	/* see lowpri_timerfn() */
+	if (__COMPAT_has_generic_reenq() &&
+	    p->scx.weight < 2 && !(p->flags & PF_KTHREAD) && !(enq_flags & SCX_ENQ_REENQ)) {
+		scx_bpf_dsq_insert(p, LOWPRI_DSQ, slice_ns, enq_flags);
+		return;
+	}
+
 	/* if select_cpu() wasn't called, try direct dispatch */
 	if (!__COMPAT_is_enq_cpu_selected(enq_flags) &&
 	    (cpu = pick_direct_dispatch_cpu(p, scx_bpf_task_cpu(p))) >= 0) {
@@ -873,6 +886,28 @@ static int monitor_timerfn(void *map, int *key, struct bpf_timer *timer)
 	return 0;
 }
 
+struct lowpri_timer {
+	struct bpf_timer timer;
+};
+
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__uint(max_entries, 1);
+	__type(key, u32);
+	__type(value, struct lowpri_timer);
+} lowpri_timer SEC(".maps");
+
+/*
+ * Nice 19 tasks are put into the lowpri DSQ. Every 10ms, reenq is triggered and
+ * the tasks are transferred to SHARED_DSQ.
+ */
+static int lowpri_timerfn(void *map, int *key, struct bpf_timer *timer)
+{
+	scx_bpf_dsq_reenq(LOWPRI_DSQ, 0);
+	bpf_timer_start(timer, LOWPRI_INTV_NS, 0);
+	return 0;
+}
+
 s32 BPF_STRUCT_OPS_SLEEPABLE(qmap_init)
 {
 	u32 key = 0;
@@ -894,14 +929,32 @@ s32 BPF_STRUCT_OPS_SLEEPABLE(qmap_init)
 		return ret;
 	}
 
+	ret = scx_bpf_create_dsq(LOWPRI_DSQ, -1);
+	if (ret)
+		return ret;
+
 	timer = bpf_map_lookup_elem(&monitor_timer, &key);
 	if (!timer)
 		return -ESRCH;
-
 	bpf_timer_init(timer, &monitor_timer, CLOCK_MONOTONIC);
 	bpf_timer_set_callback(timer, monitor_timerfn);
+	ret = bpf_timer_start(timer, ONE_SEC_IN_NS, 0);
+	if (ret)
+		return ret;
 
-	return bpf_timer_start(timer, ONE_SEC_IN_NS, 0);
+	if (__COMPAT_has_generic_reenq()) {
+		/* see lowpri_timerfn() */
+		timer = bpf_map_lookup_elem(&lowpri_timer, &key);
+		if (!timer)
+			return -ESRCH;
+		bpf_timer_init(timer, &lowpri_timer, CLOCK_MONOTONIC);
+		bpf_timer_set_callback(timer, lowpri_timerfn);
+		ret = bpf_timer_start(timer, LOWPRI_INTV_NS, 0);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
 }
 
 void BPF_STRUCT_OPS(qmap_exit, struct scx_exit_info *ei)
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 13/15] sched_ext: Optimize schedule_dsq_reenq() with lockless fast path
  2026-03-06 19:06 [PATCHSET sched_ext/for-7.1] sched_ext: Overhaul DSQ reenqueue infrastructure Tejun Heo
                   ` (11 preceding siblings ...)
  2026-03-06 19:06 ` [PATCH 12/15] sched_ext: Implement scx_bpf_dsq_reenq() for user DSQs Tejun Heo
@ 2026-03-06 19:06 ` Tejun Heo
  2026-03-06 19:06 ` [PATCH 14/15] sched_ext: Simplify task state handling Tejun Heo
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 38+ messages in thread
From: Tejun Heo @ 2026-03-06 19:06 UTC (permalink / raw)
  To: linux-kernel, sched-ext; +Cc: void, arighi, changwoo, emil, Tejun Heo

schedule_dsq_reenq() always acquires deferred_reenq_lock to queue a reenqueue
request. Add a lockless fast-path to skip lock acquisition when the request is
already pending with the required flags set.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext.c | 44 ++++++++++++++++++++++++++++++++++++--------
 1 file changed, 36 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index f58afbc69cb4..8e7dffe4094c 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -1174,10 +1174,20 @@ static void schedule_dsq_reenq(struct scx_sched *sch, struct scx_dispatch_q *dsq
 		struct scx_sched_pcpu *sch_pcpu = per_cpu_ptr(sch->pcpu, cpu_of(rq));
 		struct scx_deferred_reenq_local *drl = &sch_pcpu->deferred_reenq_local;
 
-		scoped_guard (raw_spinlock_irqsave, &rq->scx.deferred_reenq_lock) {
+		/*
+		 * Pairs with smp_mb() in process_deferred_reenq_locals() and
+		 * guarantees that there is a reenq_local() afterwards.
+		 */
+		smp_mb();
+
+		if (list_empty(&drl->node) ||
+		    (READ_ONCE(drl->flags) & reenq_flags) != reenq_flags) {
+
+			guard(raw_spinlock_irqsave)(&rq->scx.deferred_reenq_lock);
+
 			if (list_empty(&drl->node))
 				list_move_tail(&drl->node, &rq->scx.deferred_reenq_locals);
-			drl->flags |= reenq_flags;
+			WRITE_ONCE(drl->flags, drl->flags | reenq_flags);
 		}
 
 		schedule_deferred(rq);
@@ -1186,10 +1196,20 @@ static void schedule_dsq_reenq(struct scx_sched *sch, struct scx_dispatch_q *dsq
 		struct scx_dsq_pcpu *dsq_pcpu = per_cpu_ptr(dsq->pcpu, cpu_of(rq));
 		struct scx_deferred_reenq_user *dru = &dsq_pcpu->deferred_reenq_user;
 
-		scoped_guard (raw_spinlock_irqsave, &rq->scx.deferred_reenq_lock) {
+		/*
+		 * Pairs with smp_mb() in process_deferred_reenq_users() and
+		 * guarantees that there is a reenq_user() afterwards.
+		 */
+		smp_mb();
+
+		if (list_empty(&dru->node) ||
+		    (READ_ONCE(dru->flags) & reenq_flags) != reenq_flags) {
+
+			guard(raw_spinlock_irqsave)(&rq->scx.deferred_reenq_lock);
+
 			if (list_empty(&dru->node))
 				list_move_tail(&dru->node, &rq->scx.deferred_reenq_users);
-			dru->flags |= reenq_flags;
+			WRITE_ONCE(dru->flags, dru->flags | reenq_flags);
 		}
 
 		schedule_deferred(rq);
@@ -3774,7 +3794,7 @@ static void process_deferred_reenq_locals(struct rq *rq)
 
 	while (true) {
 		struct scx_sched *sch;
-		u64 reenq_flags = 0;
+		u64 reenq_flags;
 
 		scoped_guard (raw_spinlock, &rq->scx.deferred_reenq_lock) {
 			struct scx_deferred_reenq_local *drl =
@@ -3789,10 +3809,14 @@ static void process_deferred_reenq_locals(struct rq *rq)
 			sch_pcpu = container_of(drl, struct scx_sched_pcpu,
 						deferred_reenq_local);
 			sch = sch_pcpu->sch;
-			swap(drl->flags, reenq_flags);
+			reenq_flags = drl->flags;
+			WRITE_ONCE(drl->flags, 0);
 			list_del_init(&drl->node);
 		}
 
+		/* see schedule_dsq_reenq() */
+		smp_mb();
+
 		reenq_local(sch, rq, reenq_flags);
 	}
 }
@@ -3866,7 +3890,7 @@ static void process_deferred_reenq_users(struct rq *rq)
 
 	while (true) {
 		struct scx_dispatch_q *dsq;
-		u64 reenq_flags = 0;
+		u64 reenq_flags;
 
 		scoped_guard (raw_spinlock, &rq->scx.deferred_reenq_lock) {
 			struct scx_deferred_reenq_user *dru =
@@ -3881,10 +3905,14 @@ static void process_deferred_reenq_users(struct rq *rq)
 			dsq_pcpu = container_of(dru, struct scx_dsq_pcpu,
 						deferred_reenq_user);
 			dsq = dsq_pcpu->dsq;
-			swap(dru->flags, reenq_flags);
+			reenq_flags = dru->flags;
+			WRITE_ONCE(dru->flags, 0);
 			list_del_init(&dru->node);
 		}
 
+		/* see schedule_dsq_reenq() */
+		smp_mb();
+
 		BUG_ON(dsq->id & SCX_DSQ_FLAG_BUILTIN);
 		reenq_user(rq, dsq, reenq_flags);
 	}
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 14/15] sched_ext: Simplify task state handling
  2026-03-06 19:06 [PATCHSET sched_ext/for-7.1] sched_ext: Overhaul DSQ reenqueue infrastructure Tejun Heo
                   ` (12 preceding siblings ...)
  2026-03-06 19:06 ` [PATCH 13/15] sched_ext: Optimize schedule_dsq_reenq() with lockless fast path Tejun Heo
@ 2026-03-06 19:06 ` Tejun Heo
  2026-03-06 19:06 ` [PATCH 15/15] sched_ext: Add SCX_TASK_REENQ_REASON flags Tejun Heo
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 38+ messages in thread
From: Tejun Heo @ 2026-03-06 19:06 UTC (permalink / raw)
  To: linux-kernel, sched-ext; +Cc: void, arighi, changwoo, emil, Tejun Heo

Task states (NONE, INIT, READY, ENABLED) were defined in a separate enum with
unshifted values and then shifted when stored in scx_entity.flags. Simplify by
defining them as pre-shifted values directly in scx_ent_flags and removing the
separate scx_task_state enum. This removes the need for shifting when
reading/writing state values.

scx_get_task_state() now returns the masked flags value directly.
scx_set_task_state() accepts the pre-shifted state value. scx_dump_task()
shifts down for display to maintain readable output.

No functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/sched/ext.h | 28 ++++++++++++++++------------
 kernel/sched/ext.c        | 19 +++++++++----------
 2 files changed, 25 insertions(+), 22 deletions(-)

diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
index e77504faa0bc..e822b374b17f 100644
--- a/include/linux/sched/ext.h
+++ b/include/linux/sched/ext.h
@@ -93,7 +93,7 @@ struct scx_dispatch_q {
 	struct rcu_head		rcu;
 };
 
-/* scx_entity.flags */
+/* sched_ext_entity.flags */
 enum scx_ent_flags {
 	SCX_TASK_QUEUED		= 1 << 0, /* on ext runqueue */
 	SCX_TASK_IN_CUSTODY	= 1 << 1, /* in custody, needs ops.dequeue() when leaving */
@@ -101,21 +101,25 @@ enum scx_ent_flags {
 	SCX_TASK_DEQD_FOR_SLEEP	= 1 << 3, /* last dequeue was for SLEEP */
 	SCX_TASK_SUB_INIT	= 1 << 4, /* task being initialized for a sub sched */
 
-	SCX_TASK_STATE_SHIFT	= 8,	  /* bit 8 and 9 are used to carry scx_task_state */
+	/*
+	 * Bits 8 and 9 are used to carry task state:
+	 *
+	 * NONE		ops.init_task() not called yet
+	 * INIT		ops.init_task() succeeded, but task can be cancelled
+	 * READY	fully initialized, but not in sched_ext
+	 * ENABLED	fully initialized and in sched_ext
+	 */
+	SCX_TASK_STATE_SHIFT	= 8,	  /* bits 8 and 9 are used to carry task state */
 	SCX_TASK_STATE_BITS	= 2,
 	SCX_TASK_STATE_MASK	= ((1 << SCX_TASK_STATE_BITS) - 1) << SCX_TASK_STATE_SHIFT,
 
-	SCX_TASK_CURSOR		= 1 << 31, /* iteration cursor, not a task */
-};
-
-/* scx_entity.flags & SCX_TASK_STATE_MASK */
-enum scx_task_state {
-	SCX_TASK_NONE,		/* ops.init_task() not called yet */
-	SCX_TASK_INIT,		/* ops.init_task() succeeded, but task can be cancelled */
-	SCX_TASK_READY,		/* fully initialized, but not in sched_ext */
-	SCX_TASK_ENABLED,	/* fully initialized and in sched_ext */
+	SCX_TASK_NONE		= 0 << SCX_TASK_STATE_SHIFT,
+	SCX_TASK_INIT		= 1 << SCX_TASK_STATE_SHIFT,
+	SCX_TASK_READY		= 2 << SCX_TASK_STATE_SHIFT,
+	SCX_TASK_ENABLED	= 3 << SCX_TASK_STATE_SHIFT,
 
-	SCX_TASK_NR_STATES,
+	/* iteration cursor, not a task */
+	SCX_TASK_CURSOR		= 1 << 31,
 };
 
 /* scx_entity.dsq_flags */
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 8e7dffe4094c..df659e51bd8a 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -3312,18 +3312,16 @@ static struct cgroup *tg_cgrp(struct task_group *tg)
 
 #endif	/* CONFIG_EXT_GROUP_SCHED */
 
-static enum scx_task_state scx_get_task_state(const struct task_struct *p)
+static u32 scx_get_task_state(const struct task_struct *p)
 {
-	return (p->scx.flags & SCX_TASK_STATE_MASK) >> SCX_TASK_STATE_SHIFT;
+	return p->scx.flags & SCX_TASK_STATE_MASK;
 }
 
-static void scx_set_task_state(struct task_struct *p, enum scx_task_state state)
+static void scx_set_task_state(struct task_struct *p, u32 state)
 {
-	enum scx_task_state prev_state = scx_get_task_state(p);
+	u32 prev_state = scx_get_task_state(p);
 	bool warn = false;
 
-	BUILD_BUG_ON(SCX_TASK_NR_STATES > (1 << SCX_TASK_STATE_BITS));
-
 	switch (state) {
 	case SCX_TASK_NONE:
 		break;
@@ -3341,11 +3339,11 @@ static void scx_set_task_state(struct task_struct *p, enum scx_task_state state)
 		return;
 	}
 
-	WARN_ONCE(warn, "sched_ext: Invalid task state transition %d -> %d for %s[%d]",
+	WARN_ONCE(warn, "sched_ext: Invalid task state transition 0x%x -> 0x%x for %s[%d]",
 		  prev_state, state, p->comm, p->pid);
 
 	p->scx.flags &= ~SCX_TASK_STATE_MASK;
-	p->scx.flags |= state << SCX_TASK_STATE_SHIFT;
+	p->scx.flags |= state;
 }
 
 static int __scx_init_task(struct scx_sched *sch, struct task_struct *p, bool fork)
@@ -5795,7 +5793,8 @@ static void scx_dump_task(struct scx_sched *sch,
 		  own_marker, sch_id_buf,
 		  jiffies_delta_msecs(p->scx.runnable_at, dctx->at_jiffies));
 	dump_line(s, "      scx_state/flags=%u/0x%x dsq_flags=0x%x ops_state/qseq=%lu/%lu",
-		  scx_get_task_state(p), p->scx.flags & ~SCX_TASK_STATE_MASK,
+		  scx_get_task_state(p) >> SCX_TASK_STATE_SHIFT,
+		  p->scx.flags & ~SCX_TASK_STATE_MASK,
 		  p->scx.dsq_flags, ops_state & SCX_OPSS_STATE_MASK,
 		  ops_state >> SCX_OPSS_QSEQ_SHIFT);
 	dump_line(s, "      sticky/holding_cpu=%d/%d dsq_id=%s",
@@ -6556,7 +6555,7 @@ static struct scx_sched *find_parent_sched(struct cgroup *cgrp)
 
 static bool assert_task_ready_or_enabled(struct task_struct *p)
 {
-	enum scx_task_state state = scx_get_task_state(p);
+	u32 state = scx_get_task_state(p);
 
 	switch (state) {
 	case SCX_TASK_READY:
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 15/15] sched_ext: Add SCX_TASK_REENQ_REASON flags
  2026-03-06 19:06 [PATCHSET sched_ext/for-7.1] sched_ext: Overhaul DSQ reenqueue infrastructure Tejun Heo
                   ` (13 preceding siblings ...)
  2026-03-06 19:06 ` [PATCH 14/15] sched_ext: Simplify task state handling Tejun Heo
@ 2026-03-06 19:06 ` Tejun Heo
  2026-03-06 23:14 ` [PATCHSET sched_ext/for-7.1] sched_ext: Overhaul DSQ reenqueue infrastructure Andrea Righi
  2026-03-07 15:38 ` Tejun Heo
  16 siblings, 0 replies; 38+ messages in thread
From: Tejun Heo @ 2026-03-06 19:06 UTC (permalink / raw)
  To: linux-kernel, sched-ext; +Cc: void, arighi, changwoo, emil, Tejun Heo

SCX_ENQ_REENQ indicates that a task is being re-enqueued but doesn't tell the
BPF scheduler why. Add SCX_TASK_REENQ_REASON flags using bits 12-13 of
p->scx.flags to communicate the reason during ops.enqueue():

- NONE: Not being reenqueued
- KFUNC: Reenqueued by scx_bpf_dsq_reenq() and friends

More reasons will be added.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/sched/ext.h   | 15 +++++++++++++++
 kernel/sched/ext.c          | 25 ++++++++++++++++++++++---
 kernel/sched/ext_internal.h | 10 +++-------
 3 files changed, 40 insertions(+), 10 deletions(-)

diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
index e822b374b17f..60a4f65d0174 100644
--- a/include/linux/sched/ext.h
+++ b/include/linux/sched/ext.h
@@ -118,6 +118,21 @@ enum scx_ent_flags {
 	SCX_TASK_READY		= 2 << SCX_TASK_STATE_SHIFT,
 	SCX_TASK_ENABLED	= 3 << SCX_TASK_STATE_SHIFT,
 
+	/*
+	 * Bits 12 and 13 are used to carry reenqueue reason. In addition to
+	 * %SCX_ENQ_REENQ flag, ops.enqueue() can also test for
+	 * %SCX_TASK_REENQ_REASON_NONE to distinguish reenqueues.
+	 *
+	 * NONE		not being reenqueued
+	 * KFUNC	reenqueued by scx_bpf_dsq_reenq() and friends
+	 */
+	SCX_TASK_REENQ_REASON_SHIFT = 12,
+	SCX_TASK_REENQ_REASON_BITS = 2,
+	SCX_TASK_REENQ_REASON_MASK = ((1 << SCX_TASK_REENQ_REASON_BITS) - 1) << SCX_TASK_REENQ_REASON_SHIFT,
+
+	SCX_TASK_REENQ_NONE	= 0 << SCX_TASK_REENQ_REASON_SHIFT,
+	SCX_TASK_REENQ_KFUNC	= 1 << SCX_TASK_REENQ_REASON_SHIFT,
+
 	/* iteration cursor, not a task */
 	SCX_TASK_CURSOR		= 1 << 31,
 };
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index df659e51bd8a..66af7a83bb1e 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -3729,8 +3729,10 @@ int scx_check_setscheduler(struct task_struct *p, int policy)
 	return 0;
 }
 
-static bool task_should_reenq(struct task_struct *p, u64 reenq_flags)
+static bool task_should_reenq(struct task_struct *p, u64 reenq_flags, u32 *reason)
 {
+	*reason = SCX_TASK_REENQ_KFUNC;
+
 	if (reenq_flags & SCX_REENQ_ANY)
 		return true;
 	return false;
@@ -3752,6 +3754,7 @@ static u32 reenq_local(struct scx_sched *sch, struct rq *rq, u64 reenq_flags)
 	list_for_each_entry_safe(p, n, &rq->scx.local_dsq.list,
 				 scx.dsq_list.node) {
 		struct scx_sched *task_sch = scx_task_sched(p);
+		u32 reason;
 
 		/*
 		 * If @p is being migrated, @p's current CPU may not agree with
@@ -3770,16 +3773,24 @@ static u32 reenq_local(struct scx_sched *sch, struct rq *rq, u64 reenq_flags)
 		if (!scx_is_descendant(task_sch, sch))
 			continue;
 
-		if (!task_should_reenq(p, reenq_flags))
+		if (!task_should_reenq(p, reenq_flags, &reason))
 			continue;
 
 		dispatch_dequeue(rq, p);
+
+		if (WARN_ON_ONCE(p->scx.flags & SCX_TASK_REENQ_REASON_MASK))
+			p->scx.flags &= ~SCX_TASK_REENQ_REASON_MASK;
+		p->scx.flags |= reason;
+
 		list_add_tail(&p->scx.dsq_list.node, &tasks);
 	}
 
 	list_for_each_entry_safe(p, n, &tasks, scx.dsq_list.node) {
 		list_del_init(&p->scx.dsq_list.node);
+
 		do_enqueue_task(rq, p, SCX_ENQ_REENQ, -1);
+
+		p->scx.flags &= ~SCX_TASK_REENQ_REASON_MASK;
 		nr_enqueued++;
 	}
 
@@ -3833,12 +3844,13 @@ static void reenq_user(struct rq *rq, struct scx_dispatch_q *dsq, u64 reenq_flag
 
 	while (likely(!READ_ONCE(sch->bypass_depth))) {
 		struct rq *task_rq;
+		u32 reason;
 
 		p = nldsq_cursor_next_task(&cursor, dsq);
 		if (!p)
 			break;
 
-		if (!task_should_reenq(p, reenq_flags))
+		if (!task_should_reenq(p, reenq_flags, &reason))
 			continue;
 
 		task_rq = task_rq(p);
@@ -3861,8 +3873,15 @@ static void reenq_user(struct rq *rq, struct scx_dispatch_q *dsq, u64 reenq_flag
 		/* @p is on @dsq, its rq and @dsq are locked */
 		dispatch_dequeue_locked(p, dsq);
 		raw_spin_unlock(&dsq->lock);
+
+		if (WARN_ON_ONCE(p->scx.flags & SCX_TASK_REENQ_REASON_MASK))
+			p->scx.flags &= ~SCX_TASK_REENQ_REASON_MASK;
+		p->scx.flags |= reason;
+
 		do_enqueue_task(task_rq, p, SCX_ENQ_REENQ, -1);
 
+		p->scx.flags &= ~SCX_TASK_REENQ_REASON_MASK;
+
 		if (!(++nr_enqueued % SCX_TASK_ITER_BATCH)) {
 			raw_spin_rq_unlock(locked_rq);
 			locked_rq = NULL;
diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h
index d9eda2e8701c..f8df73044515 100644
--- a/kernel/sched/ext_internal.h
+++ b/kernel/sched/ext_internal.h
@@ -1080,13 +1080,9 @@ enum scx_enq_flags {
 	SCX_ENQ_PREEMPT		= 1LLU << 32,
 
 	/*
-	 * The task being enqueued was previously enqueued on the current CPU's
-	 * %SCX_DSQ_LOCAL, but was removed from it in a call to the
-	 * scx_bpf_reenqueue_local() kfunc. If scx_bpf_reenqueue_local() was
-	 * invoked in a ->cpu_release() callback, and the task is again
-	 * dispatched back to %SCX_LOCAL_DSQ by this current ->enqueue(), the
-	 * task will not be scheduled on the CPU until at least the next invocation
-	 * of the ->cpu_acquire() callback.
+	 * The task being enqueued was previously enqueued on a DSQ, but was
+	 * removed and is being re-enqueued. See SCX_TASK_REENQ_* flags to find
+	 * out why a given task is being reenqueued.
 	 */
 	SCX_ENQ_REENQ		= 1LLU << 40,
 
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH 01/15] sched_ext: Relocate scx_bpf_task_cgroup() and its BTF_ID to the end of kfunc section
  2026-03-06 19:06 ` [PATCH 01/15] sched_ext: Relocate scx_bpf_task_cgroup() and its BTF_ID to the end of kfunc section Tejun Heo
@ 2026-03-06 20:45   ` Emil Tsalapatis
  2026-03-06 23:20   ` Daniel Jordan
  1 sibling, 0 replies; 38+ messages in thread
From: Emil Tsalapatis @ 2026-03-06 20:45 UTC (permalink / raw)
  To: Tejun Heo, linux-kernel, sched-ext; +Cc: void, arighi, changwoo

On Fri Mar 6, 2026 at 2:06 PM EST, Tejun Heo wrote:
> Move scx_bpf_task_cgroup() kfunc definition and its BTF_ID entry to the end
> of the kfunc section before __bpf_kfunc_end_defs() for cleaner code
> organization.
>
> No functional changes.
>
> Signed-off-by: Tejun Heo <tj@kernel.org>

Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>

> ---
>  kernel/sched/ext.c | 78 +++++++++++++++++++++++-----------------------
>  1 file changed, 39 insertions(+), 39 deletions(-)
>
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index e25b3593dd30..fe222df1d494 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -8628,43 +8628,6 @@ __bpf_kfunc struct task_struct *scx_bpf_cpu_curr(s32 cpu, const struct bpf_prog_
>  	return rcu_dereference(cpu_rq(cpu)->curr);
>  }
>  
> -/**
> - * scx_bpf_task_cgroup - Return the sched cgroup of a task
> - * @p: task of interest
> - * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs
> - *
> - * @p->sched_task_group->css.cgroup represents the cgroup @p is associated with
> - * from the scheduler's POV. SCX operations should use this function to
> - * determine @p's current cgroup as, unlike following @p->cgroups,
> - * @p->sched_task_group is protected by @p's rq lock and thus atomic w.r.t. all
> - * rq-locked operations. Can be called on the parameter tasks of rq-locked
> - * operations. The restriction guarantees that @p's rq is locked by the caller.
> - */
> -#ifdef CONFIG_CGROUP_SCHED
> -__bpf_kfunc struct cgroup *scx_bpf_task_cgroup(struct task_struct *p,
> -					       const struct bpf_prog_aux *aux)
> -{
> -	struct task_group *tg = p->sched_task_group;
> -	struct cgroup *cgrp = &cgrp_dfl_root.cgrp;
> -	struct scx_sched *sch;
> -
> -	guard(rcu)();
> -
> -	sch = scx_prog_sched(aux);
> -	if (unlikely(!sch))
> -		goto out;
> -
> -	if (!scx_kf_allowed_on_arg_tasks(sch, __SCX_KF_RQ_LOCKED, p))
> -		goto out;
> -
> -	cgrp = tg_cgrp(tg);
> -
> -out:
> -	cgroup_get(cgrp);
> -	return cgrp;
> -}
> -#endif
> -
>  /**
>   * scx_bpf_now - Returns a high-performance monotonically non-decreasing
>   * clock for the current CPU. The clock returned is in nanoseconds.
> @@ -8779,6 +8742,43 @@ __bpf_kfunc void scx_bpf_events(struct scx_event_stats *events,
>  	memcpy(events, &e_sys, events__sz);
>  }
>  
> +#ifdef CONFIG_CGROUP_SCHED
> +/**
> + * scx_bpf_task_cgroup - Return the sched cgroup of a task
> + * @p: task of interest
> + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs
> + *
> + * @p->sched_task_group->css.cgroup represents the cgroup @p is associated with
> + * from the scheduler's POV. SCX operations should use this function to
> + * determine @p's current cgroup as, unlike following @p->cgroups,
> + * @p->sched_task_group is protected by @p's rq lock and thus atomic w.r.t. all
> + * rq-locked operations. Can be called on the parameter tasks of rq-locked
> + * operations. The restriction guarantees that @p's rq is locked by the caller.
> + */
> +__bpf_kfunc struct cgroup *scx_bpf_task_cgroup(struct task_struct *p,
> +					       const struct bpf_prog_aux *aux)
> +{
> +	struct task_group *tg = p->sched_task_group;
> +	struct cgroup *cgrp = &cgrp_dfl_root.cgrp;
> +	struct scx_sched *sch;
> +
> +	guard(rcu)();
> +
> +	sch = scx_prog_sched(aux);
> +	if (unlikely(!sch))
> +		goto out;
> +
> +	if (!scx_kf_allowed_on_arg_tasks(sch, __SCX_KF_RQ_LOCKED, p))
> +		goto out;
> +
> +	cgrp = tg_cgrp(tg);
> +
> +out:
> +	cgroup_get(cgrp);
> +	return cgrp;
> +}
> +#endif	/* CONFIG_CGROUP_SCHED */
> +
>  __bpf_kfunc_end_defs();
>  
>  BTF_KFUNCS_START(scx_kfunc_ids_any)
> @@ -8808,11 +8808,11 @@ BTF_ID_FLAGS(func, scx_bpf_task_cpu, KF_RCU)
>  BTF_ID_FLAGS(func, scx_bpf_cpu_rq, KF_IMPLICIT_ARGS)
>  BTF_ID_FLAGS(func, scx_bpf_locked_rq, KF_IMPLICIT_ARGS | KF_RET_NULL)
>  BTF_ID_FLAGS(func, scx_bpf_cpu_curr, KF_IMPLICIT_ARGS | KF_RET_NULL | KF_RCU_PROTECTED)
> +BTF_ID_FLAGS(func, scx_bpf_now)
> +BTF_ID_FLAGS(func, scx_bpf_events)
>  #ifdef CONFIG_CGROUP_SCHED
>  BTF_ID_FLAGS(func, scx_bpf_task_cgroup, KF_IMPLICIT_ARGS | KF_RCU | KF_ACQUIRE)
>  #endif
> -BTF_ID_FLAGS(func, scx_bpf_now)
> -BTF_ID_FLAGS(func, scx_bpf_events)
>  BTF_KFUNCS_END(scx_kfunc_ids_any)
>  
>  static const struct btf_kfunc_id_set scx_kfunc_set_any = {


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 02/15] sched_ext: Wrap global DSQs in per-node structure
  2026-03-06 19:06 ` [PATCH 02/15] sched_ext: Wrap global DSQs in per-node structure Tejun Heo
@ 2026-03-06 20:52   ` Emil Tsalapatis
  2026-03-06 23:20   ` Daniel Jordan
  1 sibling, 0 replies; 38+ messages in thread
From: Emil Tsalapatis @ 2026-03-06 20:52 UTC (permalink / raw)
  To: Tejun Heo, linux-kernel, sched-ext; +Cc: void, arighi, changwoo

On Fri Mar 6, 2026 at 2:06 PM EST, Tejun Heo wrote:
> Global DSQs are currently stored as an array of scx_dispatch_q pointers,
> one per NUMA node. To allow adding more per-node data structures, wrap the
> global DSQ in scx_sched_pnode and replace global_dsqs with pnode array.
>
> NUMA-aware allocation is maintained. No functional changes.
>
> Signed-off-by: Tejun Heo <tj@kernel.org>

Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>

> ---
>  kernel/sched/ext.c          | 32 ++++++++++++++++----------------
>  kernel/sched/ext_internal.h |  6 +++++-
>  2 files changed, 21 insertions(+), 17 deletions(-)
>
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index fe222df1d494..9232abea4f22 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -344,7 +344,7 @@ static bool scx_is_descendant(struct scx_sched *sch, struct scx_sched *ancestor)
>  static struct scx_dispatch_q *find_global_dsq(struct scx_sched *sch,
>  					      struct task_struct *p)
>  {
> -	return sch->global_dsqs[cpu_to_node(task_cpu(p))];
> +	return &sch->pnode[cpu_to_node(task_cpu(p))]->global_dsq;
>  }
>  
>  static struct scx_dispatch_q *find_user_dsq(struct scx_sched *sch, u64 dsq_id)
> @@ -2229,7 +2229,7 @@ static bool consume_global_dsq(struct scx_sched *sch, struct rq *rq)
>  {
>  	int node = cpu_to_node(cpu_of(rq));
>  
> -	return consume_dispatch_q(sch, rq, sch->global_dsqs[node]);
> +	return consume_dispatch_q(sch, rq, &sch->pnode[node]->global_dsq);
>  }
>  
>  /**
> @@ -4148,8 +4148,8 @@ static void scx_sched_free_rcu_work(struct work_struct *work)
>  	free_percpu(sch->pcpu);
>  
>  	for_each_node_state(node, N_POSSIBLE)
> -		kfree(sch->global_dsqs[node]);
> -	kfree(sch->global_dsqs);
> +		kfree(sch->pnode[node]);
> +	kfree(sch->pnode);
>  
>  	rhashtable_walk_enter(&sch->dsq_hash, &rht_iter);
>  	do {
> @@ -5707,23 +5707,23 @@ static struct scx_sched *scx_alloc_and_add_sched(struct sched_ext_ops *ops,
>  	if (ret < 0)
>  		goto err_free_ei;
>  
> -	sch->global_dsqs = kzalloc_objs(sch->global_dsqs[0], nr_node_ids);
> -	if (!sch->global_dsqs) {
> +	sch->pnode = kzalloc_objs(sch->pnode[0], nr_node_ids);
> +	if (!sch->pnode) {
>  		ret = -ENOMEM;
>  		goto err_free_hash;
>  	}
>  
>  	for_each_node_state(node, N_POSSIBLE) {
> -		struct scx_dispatch_q *dsq;
> +		struct scx_sched_pnode *pnode;
>  
> -		dsq = kzalloc_node(sizeof(*dsq), GFP_KERNEL, node);
> -		if (!dsq) {
> +		pnode = kzalloc_node(sizeof(*pnode), GFP_KERNEL, node);
> +		if (!pnode) {
>  			ret = -ENOMEM;
> -			goto err_free_gdsqs;
> +			goto err_free_pnode;
>  		}
>  
> -		init_dsq(dsq, SCX_DSQ_GLOBAL, sch);
> -		sch->global_dsqs[node] = dsq;
> +		init_dsq(&pnode->global_dsq, SCX_DSQ_GLOBAL, sch);
> +		sch->pnode[node] = pnode;
>  	}
>  
>  	sch->dsp_max_batch = ops->dispatch_max_batch ?: SCX_DSP_DFL_MAX_BATCH;
> @@ -5732,7 +5732,7 @@ static struct scx_sched *scx_alloc_and_add_sched(struct sched_ext_ops *ops,
>  				   __alignof__(struct scx_sched_pcpu));
>  	if (!sch->pcpu) {
>  		ret = -ENOMEM;
> -		goto err_free_gdsqs;
> +		goto err_free_pnode;
>  	}
>  
>  	for_each_possible_cpu(cpu)
> @@ -5819,10 +5819,10 @@ static struct scx_sched *scx_alloc_and_add_sched(struct sched_ext_ops *ops,
>  	kthread_destroy_worker(sch->helper);
>  err_free_pcpu:
>  	free_percpu(sch->pcpu);
> -err_free_gdsqs:
> +err_free_pnode:
>  	for_each_node_state(node, N_POSSIBLE)
> -		kfree(sch->global_dsqs[node]);
> -	kfree(sch->global_dsqs);
> +		kfree(sch->pnode[node]);
> +	kfree(sch->pnode);
>  err_free_hash:
>  	rhashtable_free_and_destroy(&sch->dsq_hash, NULL, NULL);
>  err_free_ei:
> diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h
> index 4cb97093b872..9e5ebd00ea0c 100644
> --- a/kernel/sched/ext_internal.h
> +++ b/kernel/sched/ext_internal.h
> @@ -975,6 +975,10 @@ struct scx_sched_pcpu {
>  	struct scx_dsp_ctx	dsp_ctx;
>  };
>  
> +struct scx_sched_pnode {
> +	struct scx_dispatch_q	global_dsq;
> +};
> +
>  struct scx_sched {
>  	struct sched_ext_ops	ops;
>  	DECLARE_BITMAP(has_op, SCX_OPI_END);
> @@ -988,7 +992,7 @@ struct scx_sched {
>  	 * per-node split isn't sufficient, it can be further split.
>  	 */
>  	struct rhashtable	dsq_hash;
> -	struct scx_dispatch_q	**global_dsqs;
> +	struct scx_sched_pnode	**pnode;
>  	struct scx_sched_pcpu __percpu *pcpu;
>  
>  	u64			slice_dfl;


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 03/15] sched_ext: Factor out pnode allocation and deallocation into helpers
  2026-03-06 19:06 ` [PATCH 03/15] sched_ext: Factor out pnode allocation and deallocation into helpers Tejun Heo
@ 2026-03-06 20:54   ` Emil Tsalapatis
  2026-03-06 23:21   ` Daniel Jordan
  1 sibling, 0 replies; 38+ messages in thread
From: Emil Tsalapatis @ 2026-03-06 20:54 UTC (permalink / raw)
  To: Tejun Heo, linux-kernel, sched-ext; +Cc: void, arighi, changwoo

On Fri Mar 6, 2026 at 2:06 PM EST, Tejun Heo wrote:
> Extract pnode allocation and deallocation logic into alloc_pnode() and
> free_pnode() helpers. This simplifies scx_alloc_and_add_sched() and prepares
> for adding more per-node initialization and cleanup in subsequent patches.
>
> No functional changes.
>
> Signed-off-by: Tejun Heo <tj@kernel.org>

Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>

> ---
>  kernel/sched/ext.c | 32 +++++++++++++++++++++++---------
>  1 file changed, 23 insertions(+), 9 deletions(-)
>
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 9232abea4f22..c36d399bca3e 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -4114,6 +4114,7 @@ static const struct attribute_group scx_global_attr_group = {
>  	.attrs = scx_global_attrs,
>  };
>  
> +static void free_pnode(struct scx_sched_pnode *pnode);
>  static void free_exit_info(struct scx_exit_info *ei);
>  
>  static void scx_sched_free_rcu_work(struct work_struct *work)
> @@ -4148,7 +4149,7 @@ static void scx_sched_free_rcu_work(struct work_struct *work)
>  	free_percpu(sch->pcpu);
>  
>  	for_each_node_state(node, N_POSSIBLE)
> -		kfree(sch->pnode[node]);
> +		free_pnode(sch->pnode[node]);
>  	kfree(sch->pnode);
>  
>  	rhashtable_walk_enter(&sch->dsq_hash, &rht_iter);
> @@ -5685,6 +5686,24 @@ static int alloc_kick_syncs(void)
>  	return 0;
>  }
>  
> +static void free_pnode(struct scx_sched_pnode *pnode)
> +{
> +	kfree(pnode);
> +}
> +
> +static struct scx_sched_pnode *alloc_pnode(struct scx_sched *sch, int node)
> +{
> +	struct scx_sched_pnode *pnode;
> +
> +	pnode = kzalloc_node(sizeof(*pnode), GFP_KERNEL, node);
> +	if (!pnode)
> +		return NULL;
> +
> +	init_dsq(&pnode->global_dsq, SCX_DSQ_GLOBAL, sch);
> +
> +	return pnode;
> +}
> +
>  static struct scx_sched *scx_alloc_and_add_sched(struct sched_ext_ops *ops,
>  						 struct cgroup *cgrp,
>  						 struct scx_sched *parent)
> @@ -5714,16 +5733,11 @@ static struct scx_sched *scx_alloc_and_add_sched(struct sched_ext_ops *ops,
>  	}
>  
>  	for_each_node_state(node, N_POSSIBLE) {
> -		struct scx_sched_pnode *pnode;
> -
> -		pnode = kzalloc_node(sizeof(*pnode), GFP_KERNEL, node);
> -		if (!pnode) {
> +		sch->pnode[node] = alloc_pnode(sch, node);
> +		if (!sch->pnode[node]) {
>  			ret = -ENOMEM;
>  			goto err_free_pnode;
>  		}
> -
> -		init_dsq(&pnode->global_dsq, SCX_DSQ_GLOBAL, sch);
> -		sch->pnode[node] = pnode;
>  	}
>  
>  	sch->dsp_max_batch = ops->dispatch_max_batch ?: SCX_DSP_DFL_MAX_BATCH;
> @@ -5821,7 +5835,7 @@ static struct scx_sched *scx_alloc_and_add_sched(struct sched_ext_ops *ops,
>  	free_percpu(sch->pcpu);
>  err_free_pnode:
>  	for_each_node_state(node, N_POSSIBLE)
> -		kfree(sch->pnode[node]);
> +		free_pnode(sch->pnode[node]);
>  	kfree(sch->pnode);
>  err_free_hash:
>  	rhashtable_free_and_destroy(&sch->dsq_hash, NULL, NULL);


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 04/15] sched_ext: Change find_global_dsq() to take CPU number instead of task
  2026-03-06 19:06 ` [PATCH 04/15] sched_ext: Change find_global_dsq() to take CPU number instead of task Tejun Heo
@ 2026-03-06 21:06   ` Emil Tsalapatis
  2026-03-06 22:33   ` [PATCH v2 " Tejun Heo
  2026-03-06 23:21   ` [PATCH " Daniel Jordan
  2 siblings, 0 replies; 38+ messages in thread
From: Emil Tsalapatis @ 2026-03-06 21:06 UTC (permalink / raw)
  To: Tejun Heo, linux-kernel, sched-ext; +Cc: void, arighi, changwoo

On Fri Mar 6, 2026 at 2:06 PM EST, Tejun Heo wrote:
> Change find_global_dsq() to take a CPU number directly instead of a task
> pointer. This prepares for callers where the CPU is available but the task is
> not.
>
> No functional changes.
>
> Signed-off-by: Tejun Heo <tj@kernel.org>

Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>

> ---
>  kernel/sched/ext.c | 32 +++++++++++++++-----------------
>  1 file changed, 15 insertions(+), 17 deletions(-)
>
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index c36d399bca3e..c44893878878 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -341,10 +341,9 @@ static bool scx_is_descendant(struct scx_sched *sch, struct scx_sched *ancestor)
>  	for ((pos) = scx_next_descendant_pre(NULL, (root)); (pos);		\
>  	     (pos) = scx_next_descendant_pre((pos), (root)))
>  
> -static struct scx_dispatch_q *find_global_dsq(struct scx_sched *sch,
> -					      struct task_struct *p)
> +static struct scx_dispatch_q *find_global_dsq(struct scx_sched *sch, s32 tcpu)

Nit: Maybe rename tcpu to cpu? It's only a task's CPU because we currently use it
that way, and the name is no longer obvious.

>  {
> -	return &sch->pnode[cpu_to_node(task_cpu(p))]->global_dsq;
> +	return &sch->pnode[cpu_to_node(tcpu)]->global_dsq;
>  }
>  
>  static struct scx_dispatch_q *find_user_dsq(struct scx_sched *sch, u64 dsq_id)
> @@ -1266,7 +1265,7 @@ static void dispatch_enqueue(struct scx_sched *sch, struct rq *rq,
>  			scx_error(sch, "attempting to dispatch to a destroyed dsq");
>  			/* fall back to the global dsq */
>  			raw_spin_unlock(&dsq->lock);
> -			dsq = find_global_dsq(sch, p);
> +			dsq = find_global_dsq(sch, task_cpu(p));
>  			raw_spin_lock(&dsq->lock);
>  		}
>  	}
> @@ -1474,7 +1473,7 @@ static void dispatch_dequeue_locked(struct task_struct *p,
>  
>  static struct scx_dispatch_q *find_dsq_for_dispatch(struct scx_sched *sch,
>  						    struct rq *rq, u64 dsq_id,
> -						    struct task_struct *p)
> +						    s32 tcpu)
>  {
>  	struct scx_dispatch_q *dsq;
>  
> @@ -1485,20 +1484,19 @@ static struct scx_dispatch_q *find_dsq_for_dispatch(struct scx_sched *sch,
>  		s32 cpu = dsq_id & SCX_DSQ_LOCAL_CPU_MASK;
>  
>  		if (!ops_cpu_valid(sch, cpu, "in SCX_DSQ_LOCAL_ON dispatch verdict"))
> -			return find_global_dsq(sch, p);
> +			return find_global_dsq(sch, tcpu);
>  
>  		return &cpu_rq(cpu)->scx.local_dsq;
>  	}
>  
>  	if (dsq_id == SCX_DSQ_GLOBAL)
> -		dsq = find_global_dsq(sch, p);
> +		dsq = find_global_dsq(sch, tcpu);
>  	else
>  		dsq = find_user_dsq(sch, dsq_id);
>  
>  	if (unlikely(!dsq)) {
> -		scx_error(sch, "non-existent DSQ 0x%llx for %s[%d]",
> -			  dsq_id, p->comm, p->pid);
> -		return find_global_dsq(sch, p);
> +		scx_error(sch, "non-existent DSQ 0x%llx", dsq_id);
> +		return find_global_dsq(sch, tcpu);
>  	}
>  
>  	return dsq;
> @@ -1540,7 +1538,7 @@ static void direct_dispatch(struct scx_sched *sch, struct task_struct *p,
>  {
>  	struct rq *rq = task_rq(p);
>  	struct scx_dispatch_q *dsq =
> -		find_dsq_for_dispatch(sch, rq, p->scx.ddsp_dsq_id, p);
> +		find_dsq_for_dispatch(sch, rq, p->scx.ddsp_dsq_id, task_cpu(p));
>  
>  	touch_core_sched_dispatch(rq, p);
>  
> @@ -1683,7 +1681,7 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
>  	dsq = &rq->scx.local_dsq;
>  	goto enqueue;
>  global:
> -	dsq = find_global_dsq(sch, p);
> +	dsq = find_global_dsq(sch, task_cpu(p));
>  	goto enqueue;
>  bypass:
>  	dsq = bypass_enq_target_dsq(sch, task_cpu(p));
> @@ -2140,7 +2138,7 @@ static struct rq *move_task_between_dsqs(struct scx_sched *sch,
>  		dst_rq = container_of(dst_dsq, struct rq, scx.local_dsq);
>  		if (src_rq != dst_rq &&
>  		    unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, true))) {
> -			dst_dsq = find_global_dsq(sch, p);
> +			dst_dsq = find_global_dsq(sch, task_cpu(p));
>  			dst_rq = src_rq;
>  		}
>  	} else {
> @@ -2269,7 +2267,7 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq,
>  
>  	if (src_rq != dst_rq &&
>  	    unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, true))) {
> -		dispatch_enqueue(sch, rq, find_global_dsq(sch, p), p,
> +		dispatch_enqueue(sch, rq, find_global_dsq(sch, task_cpu(p)), p,
>  				 enq_flags | SCX_ENQ_CLEAR_OPSS);
>  		return;
>  	}
> @@ -2407,7 +2405,7 @@ static void finish_dispatch(struct scx_sched *sch, struct rq *rq,
>  
>  	BUG_ON(!(p->scx.flags & SCX_TASK_QUEUED));
>  
> -	dsq = find_dsq_for_dispatch(sch, this_rq(), dsq_id, p);
> +	dsq = find_dsq_for_dispatch(sch, this_rq(), dsq_id, task_cpu(p));
>  
>  	if (dsq->id == SCX_DSQ_LOCAL)
>  		dispatch_to_local_dsq(sch, rq, dsq, p, enq_flags);
> @@ -2647,7 +2645,7 @@ static void process_ddsp_deferred_locals(struct rq *rq)
>  
>  		list_del_init(&p->scx.dsq_list.node);
>  
> -		dsq = find_dsq_for_dispatch(sch, rq, p->scx.ddsp_dsq_id, p);
> +		dsq = find_dsq_for_dispatch(sch, rq, p->scx.ddsp_dsq_id, task_cpu(p));
>  		if (!WARN_ON_ONCE(dsq->id != SCX_DSQ_LOCAL))
>  			dispatch_to_local_dsq(sch, rq, dsq, p,
>  					      p->scx.ddsp_enq_flags);
> @@ -7410,7 +7408,7 @@ static bool scx_dsq_move(struct bpf_iter_scx_dsq_kern *kit,
>  	}
>  
>  	/* @p is still on $src_dsq and stable, determine the destination */
> -	dst_dsq = find_dsq_for_dispatch(sch, this_rq, dsq_id, p);
> +	dst_dsq = find_dsq_for_dispatch(sch, this_rq, dsq_id, task_cpu(p));
>  
>  	/*
>  	 * Apply vtime and slice updates before moving so that the new time is


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 05/15] sched_ext: Relocate reenq_local() and run_deferred()
  2026-03-06 19:06 ` [PATCH 05/15] sched_ext: Relocate reenq_local() and run_deferred() Tejun Heo
@ 2026-03-06 21:09   ` Emil Tsalapatis
  2026-03-06 23:34   ` Daniel Jordan
  2026-03-07  0:12   ` [PATCH v2 05/15] sched_ext: Relocate run_deferred() and its callees Tejun Heo
  2 siblings, 0 replies; 38+ messages in thread
From: Emil Tsalapatis @ 2026-03-06 21:09 UTC (permalink / raw)
  To: Tejun Heo, linux-kernel, sched-ext; +Cc: void, arighi, changwoo

On Fri Mar 6, 2026 at 2:06 PM EST, Tejun Heo wrote:
> Previously, both process_ddsp_deferred_locals() and reenq_local() required
> forward declarations. Reorganize so that only run_deferred() needs to be
> declared. This reduces forward declaration clutter and will ease adding more
> to the run_deferred() path.
>
> No functional changes.
>
> Signed-off-by: Tejun Heo <tj@kernel.org>

Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>

> ---
>  kernel/sched/ext.c | 132 ++++++++++++++++++++++-----------------------
>  1 file changed, 65 insertions(+), 67 deletions(-)
>
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index c44893878878..1b6cd1e4f8b9 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -193,9 +193,8 @@ MODULE_PARM_DESC(bypass_lb_intv_us, "bypass load balance interval in microsecond
>  #define CREATE_TRACE_POINTS
>  #include <trace/events/sched_ext.h>
>  
> -static void process_ddsp_deferred_locals(struct rq *rq);
> +static void run_deferred(struct rq *rq);
>  static bool task_dead_and_done(struct task_struct *p);
> -static u32 reenq_local(struct scx_sched *sch, struct rq *rq);
>  static void scx_kick_cpu(struct scx_sched *sch, s32 cpu, u64 flags);
>  static void scx_disable(struct scx_sched *sch, enum scx_exit_kind kind);
>  static bool scx_vexit(struct scx_sched *sch, enum scx_exit_kind kind,
> @@ -1003,23 +1002,6 @@ static int ops_sanitize_err(struct scx_sched *sch, const char *ops_name, s32 err
>  	return -EPROTO;
>  }
>  
> -static void run_deferred(struct rq *rq)
> -{
> -	process_ddsp_deferred_locals(rq);
> -
> -	if (!llist_empty(&rq->scx.deferred_reenq_locals)) {
> -		struct llist_node *llist =
> -			llist_del_all(&rq->scx.deferred_reenq_locals);
> -		struct scx_sched_pcpu *pos, *next;
> -
> -		llist_for_each_entry_safe(pos, next, llist,
> -					  deferred_reenq_locals_node) {
> -			init_llist_node(&pos->deferred_reenq_locals_node);
> -			reenq_local(pos->sch, rq);
> -		}
> -	}
> -}
> -
>  static void deferred_bal_cb_workfn(struct rq *rq)
>  {
>  	run_deferred(rq);
> @@ -3072,7 +3054,6 @@ static void rq_offline_scx(struct rq *rq)
>  	rq->scx.flags &= ~SCX_RQ_ONLINE;
>  }
>  
> -
>  static bool check_rq_for_timeouts(struct rq *rq)
>  {
>  	struct scx_sched *sch;
> @@ -3612,6 +3593,70 @@ int scx_check_setscheduler(struct task_struct *p, int policy)
>  	return 0;
>  }
>  
> +static u32 reenq_local(struct scx_sched *sch, struct rq *rq)
> +{
> +	LIST_HEAD(tasks);
> +	u32 nr_enqueued = 0;
> +	struct task_struct *p, *n;
> +
> +	lockdep_assert_rq_held(rq);
> +
> +	/*
> +	 * The BPF scheduler may choose to dispatch tasks back to
> +	 * @rq->scx.local_dsq. Move all candidate tasks off to a private list
> +	 * first to avoid processing the same tasks repeatedly.
> +	 */
> +	list_for_each_entry_safe(p, n, &rq->scx.local_dsq.list,
> +				 scx.dsq_list.node) {
> +		struct scx_sched *task_sch = scx_task_sched(p);
> +
> +		/*
> +		 * If @p is being migrated, @p's current CPU may not agree with
> +		 * its allowed CPUs and the migration_cpu_stop is about to
> +		 * deactivate and re-activate @p anyway. Skip re-enqueueing.
> +		 *
> +		 * While racing sched property changes may also dequeue and
> +		 * re-enqueue a migrating task while its current CPU and allowed
> +		 * CPUs disagree, they use %ENQUEUE_RESTORE which is bypassed to
> +		 * the current local DSQ for running tasks and thus are not
> +		 * visible to the BPF scheduler.
> +		 */
> +		if (p->migration_pending)
> +			continue;
> +
> +		if (!scx_is_descendant(task_sch, sch))
> +			continue;
> +
> +		dispatch_dequeue(rq, p);
> +		list_add_tail(&p->scx.dsq_list.node, &tasks);
> +	}
> +
> +	list_for_each_entry_safe(p, n, &tasks, scx.dsq_list.node) {
> +		list_del_init(&p->scx.dsq_list.node);
> +		do_enqueue_task(rq, p, SCX_ENQ_REENQ, -1);
> +		nr_enqueued++;
> +	}
> +
> +	return nr_enqueued;
> +}
> +
> +static void run_deferred(struct rq *rq)
> +{
> +	process_ddsp_deferred_locals(rq);
> +
> +	if (!llist_empty(&rq->scx.deferred_reenq_locals)) {
> +		struct llist_node *llist =
> +			llist_del_all(&rq->scx.deferred_reenq_locals);
> +		struct scx_sched_pcpu *pos, *next;
> +
> +		llist_for_each_entry_safe(pos, next, llist,
> +					  deferred_reenq_locals_node) {
> +			init_llist_node(&pos->deferred_reenq_locals_node);
> +			reenq_local(pos->sch, rq);
> +		}
> +	}
> +}
> +
>  #ifdef CONFIG_NO_HZ_FULL
>  bool scx_can_stop_tick(struct rq *rq)
>  {
> @@ -7702,53 +7747,6 @@ static const struct btf_kfunc_id_set scx_kfunc_set_dispatch = {
>  	.set			= &scx_kfunc_ids_dispatch,
>  };
>  
> -static u32 reenq_local(struct scx_sched *sch, struct rq *rq)
> -{
> -	LIST_HEAD(tasks);
> -	u32 nr_enqueued = 0;
> -	struct task_struct *p, *n;
> -
> -	lockdep_assert_rq_held(rq);
> -
> -	/*
> -	 * The BPF scheduler may choose to dispatch tasks back to
> -	 * @rq->scx.local_dsq. Move all candidate tasks off to a private list
> -	 * first to avoid processing the same tasks repeatedly.
> -	 */
> -	list_for_each_entry_safe(p, n, &rq->scx.local_dsq.list,
> -				 scx.dsq_list.node) {
> -		struct scx_sched *task_sch = scx_task_sched(p);
> -
> -		/*
> -		 * If @p is being migrated, @p's current CPU may not agree with
> -		 * its allowed CPUs and the migration_cpu_stop is about to
> -		 * deactivate and re-activate @p anyway. Skip re-enqueueing.
> -		 *
> -		 * While racing sched property changes may also dequeue and
> -		 * re-enqueue a migrating task while its current CPU and allowed
> -		 * CPUs disagree, they use %ENQUEUE_RESTORE which is bypassed to
> -		 * the current local DSQ for running tasks and thus are not
> -		 * visible to the BPF scheduler.
> -		 */
> -		if (p->migration_pending)
> -			continue;
> -
> -		if (!scx_is_descendant(task_sch, sch))
> -			continue;
> -
> -		dispatch_dequeue(rq, p);
> -		list_add_tail(&p->scx.dsq_list.node, &tasks);
> -	}
> -
> -	list_for_each_entry_safe(p, n, &tasks, scx.dsq_list.node) {
> -		list_del_init(&p->scx.dsq_list.node);
> -		do_enqueue_task(rq, p, SCX_ENQ_REENQ, -1);
> -		nr_enqueued++;
> -	}
> -
> -	return nr_enqueued;
> -}
> -
>  __bpf_kfunc_start_defs();
>  
>  /**


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v2 04/15] sched_ext: Change find_global_dsq() to take CPU number instead of task
  2026-03-06 19:06 ` [PATCH 04/15] sched_ext: Change find_global_dsq() to take CPU number instead of task Tejun Heo
  2026-03-06 21:06   ` Emil Tsalapatis
@ 2026-03-06 22:33   ` Tejun Heo
  2026-03-06 23:21   ` [PATCH " Daniel Jordan
  2 siblings, 0 replies; 38+ messages in thread
From: Tejun Heo @ 2026-03-06 22:33 UTC (permalink / raw)
  To: linux-kernel, sched-ext; +Cc: void, arighi, changwoo, emil

Change find_global_dsq() to take a CPU number directly instead of a task
pointer. This prepares for callers where the CPU is available but the task is
not.

No functional changes.

v2: Rename tcpu to cpu in find_global_dsq() (Emil Tsalapatis).

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
---
 kernel/sched/ext.c |   32 +++++++++++++++-----------------
 1 file changed, 15 insertions(+), 17 deletions(-)

--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -341,10 +341,9 @@ static bool scx_is_descendant(struct scx
 	for ((pos) = scx_next_descendant_pre(NULL, (root)); (pos);		\
 	     (pos) = scx_next_descendant_pre((pos), (root)))

-static struct scx_dispatch_q *find_global_dsq(struct scx_sched *sch,
-					      struct task_struct *p)
+static struct scx_dispatch_q *find_global_dsq(struct scx_sched *sch, s32 cpu)
 {
-	return &sch->pnode[cpu_to_node(task_cpu(p))]->global_dsq;
+	return &sch->pnode[cpu_to_node(cpu)]->global_dsq;
 }

 static struct scx_dispatch_q *find_user_dsq(struct scx_sched *sch, u64 dsq_id)
@@ -1266,7 +1265,7 @@ static void dispatch_enqueue(struct scx_
 			scx_error(sch, "attempting to dispatch to a destroyed dsq");
 			/* fall back to the global dsq */
 			raw_spin_unlock(&dsq->lock);
-			dsq = find_global_dsq(sch, p);
+			dsq = find_global_dsq(sch, task_cpu(p));
 			raw_spin_lock(&dsq->lock);
 		}
 	}
@@ -1474,7 +1473,7 @@ static void dispatch_dequeue_locked(stru

 static struct scx_dispatch_q *find_dsq_for_dispatch(struct scx_sched *sch,
 						    struct rq *rq, u64 dsq_id,
-						    struct task_struct *p)
+						    s32 tcpu)
 {
 	struct scx_dispatch_q *dsq;

@@ -1485,20 +1484,19 @@ static struct scx_dispatch_q *find_dsq_f
 		s32 cpu = dsq_id & SCX_DSQ_LOCAL_CPU_MASK;

 		if (!ops_cpu_valid(sch, cpu, "in SCX_DSQ_LOCAL_ON dispatch verdict"))
-			return find_global_dsq(sch, p);
+			return find_global_dsq(sch, tcpu);

 		return &cpu_rq(cpu)->scx.local_dsq;
 	}

 	if (dsq_id == SCX_DSQ_GLOBAL)
-		dsq = find_global_dsq(sch, p);
+		dsq = find_global_dsq(sch, tcpu);
 	else
 		dsq = find_user_dsq(sch, dsq_id);

 	if (unlikely(!dsq)) {
-		scx_error(sch, "non-existent DSQ 0x%llx for %s[%d]",
-			  dsq_id, p->comm, p->pid);
-		return find_global_dsq(sch, p);
+		scx_error(sch, "non-existent DSQ 0x%llx", dsq_id);
+		return find_global_dsq(sch, tcpu);
 	}

 	return dsq;
@@ -1540,7 +1538,7 @@ static void direct_dispatch(struct scx_s
 {
 	struct rq *rq = task_rq(p);
 	struct scx_dispatch_q *dsq =
-		find_dsq_for_dispatch(sch, rq, p->scx.ddsp_dsq_id, p);
+		find_dsq_for_dispatch(sch, rq, p->scx.ddsp_dsq_id, task_cpu(p));

 	touch_core_sched_dispatch(rq, p);

@@ -1683,7 +1681,7 @@ local:
 	dsq = &rq->scx.local_dsq;
 	goto enqueue;
 global:
-	dsq = find_global_dsq(sch, p);
+	dsq = find_global_dsq(sch, task_cpu(p));
 	goto enqueue;
 bypass:
 	dsq = bypass_enq_target_dsq(sch, task_cpu(p));
@@ -2140,7 +2138,7 @@ static struct rq *move_task_between_dsqs
 		dst_rq = container_of(dst_dsq, struct rq, scx.local_dsq);
 		if (src_rq != dst_rq &&
 		    unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, true))) {
-			dst_dsq = find_global_dsq(sch, p);
+			dst_dsq = find_global_dsq(sch, task_cpu(p));
 			dst_rq = src_rq;
 		}
 	} else {
@@ -2269,7 +2267,7 @@ static void dispatch_to_local_dsq(struct

 	if (src_rq != dst_rq &&
 	    unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, true))) {
-		dispatch_enqueue(sch, rq, find_global_dsq(sch, p), p,
+		dispatch_enqueue(sch, rq, find_global_dsq(sch, task_cpu(p)), p,
 				 enq_flags | SCX_ENQ_CLEAR_OPSS);
 		return;
 	}
@@ -2407,7 +2405,7 @@ retry:

 	BUG_ON(!(p->scx.flags & SCX_TASK_QUEUED));

-	dsq = find_dsq_for_dispatch(sch, this_rq(), dsq_id, p);
+	dsq = find_dsq_for_dispatch(sch, this_rq(), dsq_id, task_cpu(p));

 	if (dsq->id == SCX_DSQ_LOCAL)
 		dispatch_to_local_dsq(sch, rq, dsq, p, enq_flags);
@@ -2647,7 +2645,7 @@ static void process_ddsp_deferred_locals

 		list_del_init(&p->scx.dsq_list.node);

-		dsq = find_dsq_for_dispatch(sch, rq, p->scx.ddsp_dsq_id, p);
+		dsq = find_dsq_for_dispatch(sch, rq, p->scx.ddsp_dsq_id, task_cpu(p));
 		if (!WARN_ON_ONCE(dsq->id != SCX_DSQ_LOCAL))
 			dispatch_to_local_dsq(sch, rq, dsq, p,
 					      p->scx.ddsp_enq_flags);
@@ -7410,7 +7408,7 @@ static bool scx_dsq_move(struct bpf_iter
 	}

 	/* @p is still on $src_dsq and stable, determine the destination */
-	dst_dsq = find_dsq_for_dispatch(sch, this_rq, dsq_id, p);
+	dst_dsq = find_dsq_for_dispatch(sch, this_rq, dsq_id, task_cpu(p));

 	/*
 	 * Apply vtime and slice updates before moving so that the new time is

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 10/15] sched_ext: Add per-CPU data to DSQs
  2026-03-06 19:06 ` [PATCH 10/15] sched_ext: Add per-CPU data to DSQs Tejun Heo
@ 2026-03-06 22:54   ` Andrea Righi
  2026-03-06 22:56     ` Andrea Righi
  2026-03-06 23:09   ` [PATCH v2 " Tejun Heo
  1 sibling, 1 reply; 38+ messages in thread
From: Andrea Righi @ 2026-03-06 22:54 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, sched-ext, void, changwoo, emil

Hi Tejun,

On Fri, Mar 06, 2026 at 09:06:18AM -1000, Tejun Heo wrote:
> Add per-CPU data structure to dispatch queues. Each DSQ now has a percpu
> scx_dsq_pcpu which contains a back-pointer to the DSQ. This will be used by
> future changes to implement per-CPU reenqueue tracking for user DSQs.
> 
> init_dsq() now allocates the percpu data and can fail, so it returns an
> error code. All callers are updated to handle failures. exit_dsq() is added
> to free the percpu data and is called from all DSQ cleanup paths.
> 
> In scx_bpf_create_dsq(), init_dsq() is called before rcu_read_lock() since
> alloc_percpu() requires GFP_KERNEL context, and dsq->sched is set
> afterwards.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---
...
> @@ -5849,8 +5884,11 @@ static struct scx_sched *scx_alloc_and_add_sched(struct sched_ext_ops *ops,
>  		goto err_free_pnode;
>  	}
>  
> -	for_each_possible_cpu(cpu)
> -		init_dsq(bypass_dsq(sch, cpu), SCX_DSQ_BYPASS, sch);
> +	for_each_possible_cpu(cpu) {
> +		ret = init_dsq(bypass_dsq(sch, cpu), SCX_DSQ_BYPASS, sch);
> +		if (ret)
> +			goto err_free_pcpu;

If we fail here...

> +	}
>  
>  	for_each_possible_cpu(cpu) {
>  		struct scx_sched_pcpu *pcpu = per_cpu_ptr(sch->pcpu, cpu);
> @@ -5932,6 +5970,10 @@ static struct scx_sched *scx_alloc_and_add_sched(struct sched_ext_ops *ops,
>  err_stop_helper:
>  	kthread_destroy_worker(sch->helper);
>  err_free_pcpu:
> +	for_each_possible_cpu(cpu) {
> +		if (bypass_dsq(sch, cpu))
> +			exit_dsq(bypass_dsq(sch, cpu));
> +	}

...we are still calling exit_dsq() for all the CPUs. We should probably
call it only on the initialized ones, or we may touch uninitialized mem.

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 10/15] sched_ext: Add per-CPU data to DSQs
  2026-03-06 22:54   ` Andrea Righi
@ 2026-03-06 22:56     ` Andrea Righi
  0 siblings, 0 replies; 38+ messages in thread
From: Andrea Righi @ 2026-03-06 22:56 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, sched-ext, void, changwoo, emil

On Fri, Mar 06, 2026 at 11:54:15PM +0100, Andrea Righi wrote:
> Hi Tejun,
> 
> On Fri, Mar 06, 2026 at 09:06:18AM -1000, Tejun Heo wrote:
> > Add per-CPU data structure to dispatch queues. Each DSQ now has a percpu
> > scx_dsq_pcpu which contains a back-pointer to the DSQ. This will be used by
> > future changes to implement per-CPU reenqueue tracking for user DSQs.
> > 
> > init_dsq() now allocates the percpu data and can fail, so it returns an
> > error code. All callers are updated to handle failures. exit_dsq() is added
> > to free the percpu data and is called from all DSQ cleanup paths.
> > 
> > In scx_bpf_create_dsq(), init_dsq() is called before rcu_read_lock() since
> > alloc_percpu() requires GFP_KERNEL context, and dsq->sched is set
> > afterwards.
> > 
> > Signed-off-by: Tejun Heo <tj@kernel.org>
> > ---
> ...
> > @@ -5849,8 +5884,11 @@ static struct scx_sched *scx_alloc_and_add_sched(struct sched_ext_ops *ops,
> >  		goto err_free_pnode;
> >  	}
> >  
> > -	for_each_possible_cpu(cpu)
> > -		init_dsq(bypass_dsq(sch, cpu), SCX_DSQ_BYPASS, sch);
> > +	for_each_possible_cpu(cpu) {
> > +		ret = init_dsq(bypass_dsq(sch, cpu), SCX_DSQ_BYPASS, sch);
> > +		if (ret)
> > +			goto err_free_pcpu;
> 
> If we fail here...
> 
> > +	}
> >  
> >  	for_each_possible_cpu(cpu) {
> >  		struct scx_sched_pcpu *pcpu = per_cpu_ptr(sch->pcpu, cpu);
> > @@ -5932,6 +5970,10 @@ static struct scx_sched *scx_alloc_and_add_sched(struct sched_ext_ops *ops,
> >  err_stop_helper:
> >  	kthread_destroy_worker(sch->helper);
> >  err_free_pcpu:
> > +	for_each_possible_cpu(cpu) {
> > +		if (bypass_dsq(sch, cpu))
> > +			exit_dsq(bypass_dsq(sch, cpu));
> > +	}
> 
> ...we are still calling exit_dsq() for all the CPUs. We should probably
> call it only on the initialized ones, or we may touch uninitialized mem.

Moreover, bypass_dsq() is always non-NULL so we can remove the redundant
check if bypass_dsq().

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v2 10/15] sched_ext: Add per-CPU data to DSQs
  2026-03-06 19:06 ` [PATCH 10/15] sched_ext: Add per-CPU data to DSQs Tejun Heo
  2026-03-06 22:54   ` Andrea Righi
@ 2026-03-06 23:09   ` Tejun Heo
  1 sibling, 0 replies; 38+ messages in thread
From: Tejun Heo @ 2026-03-06 23:09 UTC (permalink / raw)
  To: linux-kernel, sched-ext; +Cc: void, arighi, changwoo, emil

Add per-CPU data structure to dispatch queues. Each DSQ now has a percpu
scx_dsq_pcpu which contains a back-pointer to the DSQ. This will be used by
future changes to implement per-CPU reenqueue tracking for user DSQs.

init_dsq() now allocates the percpu data and can fail, so it returns an
error code. All callers are updated to handle failures. exit_dsq() is added
to free the percpu data and is called from all DSQ cleanup paths.

In scx_bpf_create_dsq(), init_dsq() is called before rcu_read_lock() since
alloc_percpu() requires GFP_KERNEL context, and dsq->sched is set
afterwards.

v2: Fix err_free_pcpu to only exit_dsq() initialized bypass DSQs (Andrea
    Righi).

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Andrea Righi <arighi@nvidia.com>
---
 include/linux/sched/ext.h |    5 ++
 kernel/sched/ext.c        |   87 ++++++++++++++++++++++++++++++++++++++--------
 2 files changed, 77 insertions(+), 15 deletions(-)

--- a/include/linux/sched/ext.h
+++ b/include/linux/sched/ext.h
@@ -62,6 +62,10 @@ enum scx_dsq_id_flags {
 	SCX_DSQ_LOCAL_CPU_MASK	= 0xffffffffLLU,
 };

+struct scx_dsq_pcpu {
+	struct scx_dispatch_q	*dsq;
+};
+
 /*
  * A dispatch queue (DSQ) can be either a FIFO or p->scx.dsq_vtime ordered
  * queue. A built-in DSQ is always a FIFO. The built-in local DSQs are used to
@@ -79,6 +83,7 @@ struct scx_dispatch_q {
 	struct rhash_head	hash_node;
 	struct llist_node	free_node;
 	struct scx_sched	*sched;
+	struct scx_dsq_pcpu __percpu *pcpu;
 	struct rcu_head		rcu;
 };

--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -4021,15 +4021,42 @@ DEFINE_SCHED_CLASS(ext) = {
 #endif
 };

-static void init_dsq(struct scx_dispatch_q *dsq, u64 dsq_id,
-		     struct scx_sched *sch)
+static s32 init_dsq(struct scx_dispatch_q *dsq, u64 dsq_id,
+		    struct scx_sched *sch)
 {
+	s32 cpu;
+
 	memset(dsq, 0, sizeof(*dsq));

 	raw_spin_lock_init(&dsq->lock);
 	INIT_LIST_HEAD(&dsq->list);
 	dsq->id = dsq_id;
 	dsq->sched = sch;
+
+	dsq->pcpu = alloc_percpu(struct scx_dsq_pcpu);
+	if (!dsq->pcpu)
+		return -ENOMEM;
+
+	for_each_possible_cpu(cpu) {
+		struct scx_dsq_pcpu *pcpu = per_cpu_ptr(dsq->pcpu, cpu);
+
+		pcpu->dsq = dsq;
+	}
+
+	return 0;
+}
+
+static void exit_dsq(struct scx_dispatch_q *dsq)
+{
+	free_percpu(dsq->pcpu);
+}
+
+static void free_dsq_rcufn(struct rcu_head *rcu)
+{
+	struct scx_dispatch_q *dsq = container_of(rcu, struct scx_dispatch_q, rcu);
+
+	exit_dsq(dsq);
+	kfree(dsq);
 }

 static void free_dsq_irq_workfn(struct irq_work *irq_work)
@@ -4038,7 +4065,7 @@ static void free_dsq_irq_workfn(struct i
 	struct scx_dispatch_q *dsq, *tmp_dsq;

 	llist_for_each_entry_safe(dsq, tmp_dsq, to_free, free_node)
-		kfree_rcu(dsq, rcu);
+		call_rcu(&dsq->rcu, free_dsq_rcufn);
 }

 static DEFINE_IRQ_WORK(free_dsq_irq_work, free_dsq_irq_workfn);
@@ -4235,15 +4262,17 @@ static void scx_sched_free_rcu_work(stru
 		cgroup_put(sch_cgroup(sch));
 #endif	/* CONFIG_EXT_SUB_SCHED */

-	/*
-	 * $sch would have entered bypass mode before the RCU grace period. As
-	 * that blocks new deferrals, all deferred_reenq_local_node's must be
-	 * off-list by now.
-	 */
 	for_each_possible_cpu(cpu) {
 		struct scx_sched_pcpu *pcpu = per_cpu_ptr(sch->pcpu, cpu);

+		/*
+		 * $sch would have entered bypass mode before the RCU grace
+		 * period. As that blocks new deferrals, all
+		 * deferred_reenq_local_node's must be off-list by now.
+		 */
 		WARN_ON_ONCE(!list_empty(&pcpu->deferred_reenq_local.node));
+
+		exit_dsq(bypass_dsq(sch, cpu));
 	}

 	free_percpu(sch->pcpu);
@@ -5788,6 +5817,9 @@ static int alloc_kick_syncs(void)

 static void free_pnode(struct scx_sched_pnode *pnode)
 {
+	if (!pnode)
+		return;
+	exit_dsq(&pnode->global_dsq);
 	kfree(pnode);
 }

@@ -5799,7 +5831,10 @@ static struct scx_sched_pnode *alloc_pno
 	if (!pnode)
 		return NULL;

-	init_dsq(&pnode->global_dsq, SCX_DSQ_GLOBAL, sch);
+	if (init_dsq(&pnode->global_dsq, SCX_DSQ_GLOBAL, sch)) {
+		kfree(pnode);
+		return NULL;
+	}

 	return pnode;
 }
@@ -5810,7 +5845,7 @@ static struct scx_sched *scx_alloc_and_a
 {
 	struct scx_sched *sch;
 	s32 level = parent ? parent->level + 1 : 0;
-	s32 node, cpu, ret;
+	s32 node, cpu, ret, bypass_fail_cpu = nr_cpu_ids;

 	sch = kzalloc_flex(*sch, ancestors, level);
 	if (!sch)
@@ -5849,8 +5884,13 @@ static struct scx_sched *scx_alloc_and_a
 		goto err_free_pnode;
 	}

-	for_each_possible_cpu(cpu)
-		init_dsq(bypass_dsq(sch, cpu), SCX_DSQ_BYPASS, sch);
+	for_each_possible_cpu(cpu) {
+		ret = init_dsq(bypass_dsq(sch, cpu), SCX_DSQ_BYPASS, sch);
+		if (ret) {
+			bypass_fail_cpu = cpu;
+			goto err_free_pcpu;
+		}
+	}

 	for_each_possible_cpu(cpu) {
 		struct scx_sched_pcpu *pcpu = per_cpu_ptr(sch->pcpu, cpu);
@@ -5932,6 +5972,11 @@ static struct scx_sched *scx_alloc_and_a
 err_stop_helper:
 	kthread_destroy_worker(sch->helper);
 err_free_pcpu:
+	for_each_possible_cpu(cpu) {
+		if (cpu == bypass_fail_cpu)
+			break;
+		exit_dsq(bypass_dsq(sch, cpu));
+	}
 	free_percpu(sch->pcpu);
 err_free_pnode:
 	for_each_node_state(node, N_POSSIBLE)
@@ -7174,7 +7219,7 @@ void __init init_sched_ext_class(void)
 		int  n = cpu_to_node(cpu);

 		/* local_dsq's sch will be set during scx_root_enable() */
-		init_dsq(&rq->scx.local_dsq, SCX_DSQ_LOCAL, NULL);
+		BUG_ON(init_dsq(&rq->scx.local_dsq, SCX_DSQ_LOCAL, NULL));

 		INIT_LIST_HEAD(&rq->scx.runnable_list);
 		INIT_LIST_HEAD(&rq->scx.ddsp_deferred_locals);
@@ -7873,11 +7918,21 @@ __bpf_kfunc s32 scx_bpf_create_dsq(u64 d
 	if (!dsq)
 		return -ENOMEM;

+	/*
+	 * init_dsq() must be called in GFP_KERNEL context. Init it with NULL
+	 * @sch and update afterwards.
+	 */
+	ret = init_dsq(dsq, dsq_id, NULL);
+	if (ret) {
+		kfree(dsq);
+		return ret;
+	}
+
 	rcu_read_lock();

 	sch = scx_prog_sched(aux);
 	if (sch) {
-		init_dsq(dsq, dsq_id, sch);
+		dsq->sched = sch;
 		ret = rhashtable_lookup_insert_fast(&sch->dsq_hash, &dsq->hash_node,
 						    dsq_hash_params);
 	} else {
@@ -7885,8 +7940,10 @@ __bpf_kfunc s32 scx_bpf_create_dsq(u64 d
 	}

 	rcu_read_unlock();
-	if (ret)
+	if (ret) {
+		exit_dsq(dsq);
 		kfree(dsq);
+	}
 	return ret;
 }

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCHSET sched_ext/for-7.1] sched_ext: Overhaul DSQ reenqueue infrastructure
  2026-03-06 19:06 [PATCHSET sched_ext/for-7.1] sched_ext: Overhaul DSQ reenqueue infrastructure Tejun Heo
                   ` (14 preceding siblings ...)
  2026-03-06 19:06 ` [PATCH 15/15] sched_ext: Add SCX_TASK_REENQ_REASON flags Tejun Heo
@ 2026-03-06 23:14 ` Andrea Righi
  2026-03-07 15:38 ` Tejun Heo
  16 siblings, 0 replies; 38+ messages in thread
From: Andrea Righi @ 2026-03-06 23:14 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, sched-ext, void, changwoo, emil

On Fri, Mar 06, 2026 at 09:06:08AM -1000, Tejun Heo wrote:
> scx_bpf_reenqueue_local() currently only supports the current CPU's local
> DSQ and must be called from ops.cpu_release(). This patchset overhauls the
> reenqueue infrastructure to support flexible, remote, and user DSQ reenqueue
> operations through the new scx_bpf_dsq_reenq() kfunc.
> 
> The patchset:
> 
> - Refactors per-node data structures and DSQ lookup helpers as preparation.
> 
> - Converts the deferred reenqueue mechanism from lockless llist to a
>   spinlock-protected regular list to support more complex list operations.
> 
> - Introduces scx_bpf_dsq_reenq() which can target any local DSQ including
>   remote CPUs via SCX_DSQ_LOCAL_ON | cpu, and user-defined DSQs.
> 
> - Adds per-CPU data to DSQs and cursor-based iteration helpers to support
>   user DSQ reenqueue with proper multi-rq locking.
> 
> - Adds reenqueue flags plumbing and a lockless fast-path optimization.
> 
> - Simplifies task state handling and adds SCX_TASK_REENQ_REASON flags so BPF
>   schedulers can distinguish why a task is being reenqueued.
> 
> scx_bpf_reenqueue_local() is reimplemented as a wrapper around
> scx_bpf_dsq_reenq(SCX_DSQ_LOCAL, 0) and may be deprecated in the future.

This is really useful! Small issue on patch 10/15 (that you've already
fixed). Everything else looks good to me.

Reviewed-by: Andrea Righi <arighi@nvidia.com>

Thanks.
-Andrea

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 01/15] sched_ext: Relocate scx_bpf_task_cgroup() and its BTF_ID to the end of kfunc section
  2026-03-06 19:06 ` [PATCH 01/15] sched_ext: Relocate scx_bpf_task_cgroup() and its BTF_ID to the end of kfunc section Tejun Heo
  2026-03-06 20:45   ` Emil Tsalapatis
@ 2026-03-06 23:20   ` Daniel Jordan
  1 sibling, 0 replies; 38+ messages in thread
From: Daniel Jordan @ 2026-03-06 23:20 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, sched-ext, void, arighi, changwoo, emil

On Fri, Mar 06, 2026 at 09:06:09AM -1000, Tejun Heo wrote:
> Move scx_bpf_task_cgroup() kfunc definition and its BTF_ID entry to the end
> of the kfunc section before __bpf_kfunc_end_defs() for cleaner code
> organization.
> 
> No functional changes.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>

Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 02/15] sched_ext: Wrap global DSQs in per-node structure
  2026-03-06 19:06 ` [PATCH 02/15] sched_ext: Wrap global DSQs in per-node structure Tejun Heo
  2026-03-06 20:52   ` Emil Tsalapatis
@ 2026-03-06 23:20   ` Daniel Jordan
  1 sibling, 0 replies; 38+ messages in thread
From: Daniel Jordan @ 2026-03-06 23:20 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, sched-ext, void, arighi, changwoo, emil

On Fri, Mar 06, 2026 at 09:06:10AM -1000, Tejun Heo wrote:
> Global DSQs are currently stored as an array of scx_dispatch_q pointers,
> one per NUMA node. To allow adding more per-node data structures, wrap the
> global DSQ in scx_sched_pnode and replace global_dsqs with pnode array.
> 
> NUMA-aware allocation is maintained. No functional changes.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>

Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 03/15] sched_ext: Factor out pnode allocation and deallocation into helpers
  2026-03-06 19:06 ` [PATCH 03/15] sched_ext: Factor out pnode allocation and deallocation into helpers Tejun Heo
  2026-03-06 20:54   ` Emil Tsalapatis
@ 2026-03-06 23:21   ` Daniel Jordan
  1 sibling, 0 replies; 38+ messages in thread
From: Daniel Jordan @ 2026-03-06 23:21 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, sched-ext, void, arighi, changwoo, emil

On Fri, Mar 06, 2026 at 09:06:11AM -1000, Tejun Heo wrote:
> Extract pnode allocation and deallocation logic into alloc_pnode() and
> free_pnode() helpers. This simplifies scx_alloc_and_add_sched() and prepares
> for adding more per-node initialization and cleanup in subsequent patches.
> 
> No functional changes.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>

Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 04/15] sched_ext: Change find_global_dsq() to take CPU number instead of task
  2026-03-06 19:06 ` [PATCH 04/15] sched_ext: Change find_global_dsq() to take CPU number instead of task Tejun Heo
  2026-03-06 21:06   ` Emil Tsalapatis
  2026-03-06 22:33   ` [PATCH v2 " Tejun Heo
@ 2026-03-06 23:21   ` Daniel Jordan
  2 siblings, 0 replies; 38+ messages in thread
From: Daniel Jordan @ 2026-03-06 23:21 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, sched-ext, void, arighi, changwoo, emil

On Fri, Mar 06, 2026 at 09:06:12AM -1000, Tejun Heo wrote:
> Change find_global_dsq() to take a CPU number directly instead of a task
> pointer. This prepares for callers where the CPU is available but the task is
> not.
> 
> No functional changes.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>

Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 05/15] sched_ext: Relocate reenq_local() and run_deferred()
  2026-03-06 19:06 ` [PATCH 05/15] sched_ext: Relocate reenq_local() and run_deferred() Tejun Heo
  2026-03-06 21:09   ` Emil Tsalapatis
@ 2026-03-06 23:34   ` Daniel Jordan
  2026-03-07  0:12   ` [PATCH v2 05/15] sched_ext: Relocate run_deferred() and its callees Tejun Heo
  2 siblings, 0 replies; 38+ messages in thread
From: Daniel Jordan @ 2026-03-06 23:34 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, sched-ext, void, arighi, changwoo, emil

On Fri, Mar 06, 2026 at 09:06:13AM -1000, Tejun Heo wrote:
> Previously, both process_ddsp_deferred_locals() and reenq_local() required
> forward declarations. Reorganize so that only run_deferred() needs to be
> declared. This reduces forward declaration clutter and will ease adding more
> to the run_deferred() path.
> 
> No functional changes.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>

Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>

process_ddsp_deferred_locals() could be moved closer to run_deferred()
like reenq_local() is.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v2 05/15] sched_ext: Relocate run_deferred() and its callees
  2026-03-06 19:06 ` [PATCH 05/15] sched_ext: Relocate reenq_local() and run_deferred() Tejun Heo
  2026-03-06 21:09   ` Emil Tsalapatis
  2026-03-06 23:34   ` Daniel Jordan
@ 2026-03-07  0:12   ` Tejun Heo
  2 siblings, 0 replies; 38+ messages in thread
From: Tejun Heo @ 2026-03-07  0:12 UTC (permalink / raw)
  To: sched-ext, linux-kernel
  Cc: David Vernet, Andrea Righi, Changwoo Min, Emil Tsalapatis,
	Daniel Jordan

Previously, both process_ddsp_deferred_locals() and reenq_local() required
forward declarations. Reorganize so that only run_deferred() needs to be
declared. Both callees are grouped right before run_deferred() for better
locality. This reduces forward declaration clutter and will ease adding more
to the run_deferred() path.

No functional changes.

v2: Also relocate process_ddsp_deferred_locals() next to run_deferred()
    (Daniel Jordan).

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
---
 kernel/sched/ext.c |  186 ++++++++++++++++++++++++++---------------------------
 1 file changed, 92 insertions(+), 94 deletions(-)

--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -193,9 +193,8 @@ MODULE_PARM_DESC(bypass_lb_intv_us, "byp
 #define CREATE_TRACE_POINTS
 #include <trace/events/sched_ext.h>

-static void process_ddsp_deferred_locals(struct rq *rq);
+static void run_deferred(struct rq *rq);
 static bool task_dead_and_done(struct task_struct *p);
-static u32 reenq_local(struct scx_sched *sch, struct rq *rq);
 static void scx_kick_cpu(struct scx_sched *sch, s32 cpu, u64 flags);
 static void scx_disable(struct scx_sched *sch, enum scx_exit_kind kind);
 static bool scx_vexit(struct scx_sched *sch, enum scx_exit_kind kind,
@@ -1003,23 +1002,6 @@ static int ops_sanitize_err(struct scx_s
 	return -EPROTO;
 }

-static void run_deferred(struct rq *rq)
-{
-	process_ddsp_deferred_locals(rq);
-
-	if (!llist_empty(&rq->scx.deferred_reenq_locals)) {
-		struct llist_node *llist =
-			llist_del_all(&rq->scx.deferred_reenq_locals);
-		struct scx_sched_pcpu *pos, *next;
-
-		llist_for_each_entry_safe(pos, next, llist,
-					  deferred_reenq_locals_node) {
-			init_llist_node(&pos->deferred_reenq_locals_node);
-			reenq_local(pos->sch, rq);
-		}
-	}
-}
-
 static void deferred_bal_cb_workfn(struct rq *rq)
 {
 	run_deferred(rq);
@@ -2625,33 +2607,6 @@ has_tasks:
 	return true;
 }

-static void process_ddsp_deferred_locals(struct rq *rq)
-{
-	struct task_struct *p;
-
-	lockdep_assert_rq_held(rq);
-
-	/*
-	 * Now that @rq can be unlocked, execute the deferred enqueueing of
-	 * tasks directly dispatched to the local DSQs of other CPUs. See
-	 * direct_dispatch(). Keep popping from the head instead of using
-	 * list_for_each_entry_safe() as dispatch_local_dsq() may unlock @rq
-	 * temporarily.
-	 */
-	while ((p = list_first_entry_or_null(&rq->scx.ddsp_deferred_locals,
-				struct task_struct, scx.dsq_list.node))) {
-		struct scx_sched *sch = scx_task_sched(p);
-		struct scx_dispatch_q *dsq;
-
-		list_del_init(&p->scx.dsq_list.node);
-
-		dsq = find_dsq_for_dispatch(sch, rq, p->scx.ddsp_dsq_id, task_cpu(p));
-		if (!WARN_ON_ONCE(dsq->id != SCX_DSQ_LOCAL))
-			dispatch_to_local_dsq(sch, rq, dsq, p,
-					      p->scx.ddsp_enq_flags);
-	}
-}
-
 static void set_next_task_scx(struct rq *rq, struct task_struct *p, bool first)
 {
 	struct scx_sched *sch = scx_task_sched(p);
@@ -3072,7 +3027,6 @@ static void rq_offline_scx(struct rq *rq
 	rq->scx.flags &= ~SCX_RQ_ONLINE;
 }

-
 static bool check_rq_for_timeouts(struct rq *rq)
 {
 	struct scx_sched *sch;
@@ -3612,6 +3566,97 @@ int scx_check_setscheduler(struct task_s
 	return 0;
 }

+static void process_ddsp_deferred_locals(struct rq *rq)
+{
+	struct task_struct *p;
+
+	lockdep_assert_rq_held(rq);
+
+	/*
+	 * Now that @rq can be unlocked, execute the deferred enqueueing of
+	 * tasks directly dispatched to the local DSQs of other CPUs. See
+	 * direct_dispatch(). Keep popping from the head instead of using
+	 * list_for_each_entry_safe() as dispatch_local_dsq() may unlock @rq
+	 * temporarily.
+	 */
+	while ((p = list_first_entry_or_null(&rq->scx.ddsp_deferred_locals,
+				struct task_struct, scx.dsq_list.node))) {
+		struct scx_sched *sch = scx_task_sched(p);
+		struct scx_dispatch_q *dsq;
+
+		list_del_init(&p->scx.dsq_list.node);
+
+		dsq = find_dsq_for_dispatch(sch, rq, p->scx.ddsp_dsq_id, task_cpu(p));
+		if (!WARN_ON_ONCE(dsq->id != SCX_DSQ_LOCAL))
+			dispatch_to_local_dsq(sch, rq, dsq, p,
+					      p->scx.ddsp_enq_flags);
+	}
+}
+
+static u32 reenq_local(struct scx_sched *sch, struct rq *rq)
+{
+	LIST_HEAD(tasks);
+	u32 nr_enqueued = 0;
+	struct task_struct *p, *n;
+
+	lockdep_assert_rq_held(rq);
+
+	/*
+	 * The BPF scheduler may choose to dispatch tasks back to
+	 * @rq->scx.local_dsq. Move all candidate tasks off to a private list
+	 * first to avoid processing the same tasks repeatedly.
+	 */
+	list_for_each_entry_safe(p, n, &rq->scx.local_dsq.list,
+				 scx.dsq_list.node) {
+		struct scx_sched *task_sch = scx_task_sched(p);
+
+		/*
+		 * If @p is being migrated, @p's current CPU may not agree with
+		 * its allowed CPUs and the migration_cpu_stop is about to
+		 * deactivate and re-activate @p anyway. Skip re-enqueueing.
+		 *
+		 * While racing sched property changes may also dequeue and
+		 * re-enqueue a migrating task while its current CPU and allowed
+		 * CPUs disagree, they use %ENQUEUE_RESTORE which is bypassed to
+		 * the current local DSQ for running tasks and thus are not
+		 * visible to the BPF scheduler.
+		 */
+		if (p->migration_pending)
+			continue;
+
+		if (!scx_is_descendant(task_sch, sch))
+			continue;
+
+		dispatch_dequeue(rq, p);
+		list_add_tail(&p->scx.dsq_list.node, &tasks);
+	}
+
+	list_for_each_entry_safe(p, n, &tasks, scx.dsq_list.node) {
+		list_del_init(&p->scx.dsq_list.node);
+		do_enqueue_task(rq, p, SCX_ENQ_REENQ, -1);
+		nr_enqueued++;
+	}
+
+	return nr_enqueued;
+}
+
+static void run_deferred(struct rq *rq)
+{
+	process_ddsp_deferred_locals(rq);
+
+	if (!llist_empty(&rq->scx.deferred_reenq_locals)) {
+		struct llist_node *llist =
+			llist_del_all(&rq->scx.deferred_reenq_locals);
+		struct scx_sched_pcpu *pos, *next;
+
+		llist_for_each_entry_safe(pos, next, llist,
+					  deferred_reenq_locals_node) {
+			init_llist_node(&pos->deferred_reenq_locals_node);
+			reenq_local(pos->sch, rq);
+		}
+	}
+}
+
 #ifdef CONFIG_NO_HZ_FULL
 bool scx_can_stop_tick(struct rq *rq)
 {
@@ -7702,53 +7747,6 @@ static const struct btf_kfunc_id_set scx
 	.set			= &scx_kfunc_ids_dispatch,
 };

-static u32 reenq_local(struct scx_sched *sch, struct rq *rq)
-{
-	LIST_HEAD(tasks);
-	u32 nr_enqueued = 0;
-	struct task_struct *p, *n;
-
-	lockdep_assert_rq_held(rq);
-
-	/*
-	 * The BPF scheduler may choose to dispatch tasks back to
-	 * @rq->scx.local_dsq. Move all candidate tasks off to a private list
-	 * first to avoid processing the same tasks repeatedly.
-	 */
-	list_for_each_entry_safe(p, n, &rq->scx.local_dsq.list,
-				 scx.dsq_list.node) {
-		struct scx_sched *task_sch = scx_task_sched(p);
-
-		/*
-		 * If @p is being migrated, @p's current CPU may not agree with
-		 * its allowed CPUs and the migration_cpu_stop is about to
-		 * deactivate and re-activate @p anyway. Skip re-enqueueing.
-		 *
-		 * While racing sched property changes may also dequeue and
-		 * re-enqueue a migrating task while its current CPU and allowed
-		 * CPUs disagree, they use %ENQUEUE_RESTORE which is bypassed to
-		 * the current local DSQ for running tasks and thus are not
-		 * visible to the BPF scheduler.
-		 */
-		if (p->migration_pending)
-			continue;
-
-		if (!scx_is_descendant(task_sch, sch))
-			continue;
-
-		dispatch_dequeue(rq, p);
-		list_add_tail(&p->scx.dsq_list.node, &tasks);
-	}
-
-	list_for_each_entry_safe(p, n, &tasks, scx.dsq_list.node) {
-		list_del_init(&p->scx.dsq_list.node);
-		do_enqueue_task(rq, p, SCX_ENQ_REENQ, -1);
-		nr_enqueued++;
-	}
-
-	return nr_enqueued;
-}
-
 __bpf_kfunc_start_defs();

 /**
--
tejun

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCHSET sched_ext/for-7.1] sched_ext: Overhaul DSQ reenqueue infrastructure
  2026-03-06 19:06 [PATCHSET sched_ext/for-7.1] sched_ext: Overhaul DSQ reenqueue infrastructure Tejun Heo
                   ` (15 preceding siblings ...)
  2026-03-06 23:14 ` [PATCHSET sched_ext/for-7.1] sched_ext: Overhaul DSQ reenqueue infrastructure Andrea Righi
@ 2026-03-07 15:38 ` Tejun Heo
  16 siblings, 0 replies; 38+ messages in thread
From: Tejun Heo @ 2026-03-07 15:38 UTC (permalink / raw)
  To: linux-kernel, sched-ext; +Cc: void, arighi, changwoo, emil, daniel.m.jordan

> Tejun (15):
>  sched_ext: Relocate scx_bpf_task_cgroup() and its BTF_ID to the end of kfunc section
>  sched_ext: Wrap global DSQs in per-node structure
>  sched_ext: Factor out pnode allocation and deallocation into helpers
>  sched_ext: Change find_global_dsq() to take CPU number instead of task
>  sched_ext: Relocate run_deferred() and its callees
>  sched_ext: Convert deferred_reenq_locals from llist to regular list
>  sched_ext: Wrap deferred_reenq_local_node into a struct
>  sched_ext: Introduce scx_bpf_dsq_reenq() for remote local DSQ reenqueue
>  sched_ext: Add reenq_flags plumbing to scx_bpf_dsq_reenq()
>  sched_ext: Add per-CPU data to DSQs
>  sched_ext: Factor out nldsq_cursor_next_task() and nldsq_cursor_lost_task()
>  sched_ext: Implement scx_bpf_dsq_reenq() for user DSQs
>  sched_ext: Optimize schedule_dsq_reenq() with lockless fast path
>  sched_ext: Simplify task state handling
>  sched_ext: Add SCX_TASK_REENQ_REASON flags

Applied 1-15 to sched_ext/for-7.1.

Thanks.

--
tejun

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 06/15] sched_ext: Convert deferred_reenq_locals from llist to regular list
  2026-03-06 19:06 ` [PATCH 06/15] sched_ext: Convert deferred_reenq_locals from llist to regular list Tejun Heo
@ 2026-03-09 17:12   ` Emil Tsalapatis
  2026-03-09 17:16     ` Emil Tsalapatis
  0 siblings, 1 reply; 38+ messages in thread
From: Emil Tsalapatis @ 2026-03-09 17:12 UTC (permalink / raw)
  To: Tejun Heo, linux-kernel, sched-ext; +Cc: void, arighi, changwoo

On Fri Mar 6, 2026 at 2:06 PM EST, Tejun Heo wrote:
> The deferred reenqueue local mechanism uses an llist (lockless list) for
> collecting schedulers that need their local DSQs re-enqueued. Convert to a
> regular list protected by a raw_spinlock.
>
> The llist was used for its lockless properties, but the upcoming changes to
> support remote reenqueue require more complex list operations that are
> difficult to implement correctly with lockless data structures. A spinlock-
> protected regular list provides the necessary flexibility.
>
> Signed-off-by: Tejun Heo <tj@kernel.org>


Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>

> ---
>  kernel/sched/ext.c          | 57 ++++++++++++++++++++++++-------------
>  kernel/sched/ext_internal.h |  2 +-
>  kernel/sched/sched.h        |  3 +-
>  3 files changed, 41 insertions(+), 21 deletions(-)
>
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 1b6cd1e4f8b9..ffccaf04e34d 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -3640,23 +3640,37 @@ static u32 reenq_local(struct scx_sched *sch, struct rq *rq)
>  	return nr_enqueued;
>  }
>  
> -static void run_deferred(struct rq *rq)
> +static void process_deferred_reenq_locals(struct rq *rq)
>  {
> -	process_ddsp_deferred_locals(rq);
> -
> -	if (!llist_empty(&rq->scx.deferred_reenq_locals)) {
> -		struct llist_node *llist =
> -			llist_del_all(&rq->scx.deferred_reenq_locals);
> -		struct scx_sched_pcpu *pos, *next;
> +	lockdep_assert_rq_held(rq);
>  
> -		llist_for_each_entry_safe(pos, next, llist,
> -					  deferred_reenq_locals_node) {
> -			init_llist_node(&pos->deferred_reenq_locals_node);
> -			reenq_local(pos->sch, rq);
> +	while (true) {
> +		struct scx_sched *sch;
> +
> +		scoped_guard (raw_spinlock, &rq->scx.deferred_reenq_lock) {
> +			struct scx_sched_pcpu *sch_pcpu =
> +				list_first_entry_or_null(&rq->scx.deferred_reenq_locals,
> +							 struct scx_sched_pcpu,
> +							 deferred_reenq_local_node);
> +			if (!sch_pcpu)
> +				return;
> +
> +			sch = sch_pcpu->sch;

While both scx and sch_pcpu aren't used in this patch, they are useful
for subsequent patches.

> +			list_del_init(&sch_pcpu->deferred_reenq_local_node);
>  		}
> +
> +		reenq_local(sch, rq);
>  	}
>  }
>  
> +static void run_deferred(struct rq *rq)
> +{
> +	process_ddsp_deferred_locals(rq);
> +
> +	if (!list_empty(&rq->scx.deferred_reenq_locals))
> +		process_deferred_reenq_locals(rq);
> +}
> +
>  #ifdef CONFIG_NO_HZ_FULL
>  bool scx_can_stop_tick(struct rq *rq)
>  {
> @@ -4180,13 +4194,13 @@ static void scx_sched_free_rcu_work(struct work_struct *work)
>  
>  	/*
>  	 * $sch would have entered bypass mode before the RCU grace period. As
> -	 * that blocks new deferrals, all deferred_reenq_locals_node's must be
> +	 * that blocks new deferrals, all deferred_reenq_local_node's must be
>  	 * off-list by now.
>  	 */
>  	for_each_possible_cpu(cpu) {
>  		struct scx_sched_pcpu *pcpu = per_cpu_ptr(sch->pcpu, cpu);
>  
> -		WARN_ON_ONCE(llist_on_list(&pcpu->deferred_reenq_locals_node));
> +		WARN_ON_ONCE(!list_empty(&pcpu->deferred_reenq_local_node));
>  	}
>  
>  	free_percpu(sch->pcpu);
> @@ -5799,7 +5813,7 @@ static struct scx_sched *scx_alloc_and_add_sched(struct sched_ext_ops *ops,
>  		struct scx_sched_pcpu *pcpu = per_cpu_ptr(sch->pcpu, cpu);
>  
>  		pcpu->sch = sch;
> -		init_llist_node(&pcpu->deferred_reenq_locals_node);
> +		INIT_LIST_HEAD(&pcpu->deferred_reenq_local_node);
>  	}
>  
>  	sch->helper = kthread_run_worker(0, "sched_ext_helper");
> @@ -7126,7 +7140,8 @@ void __init init_sched_ext_class(void)
>  		BUG_ON(!zalloc_cpumask_var_node(&rq->scx.cpus_to_kick_if_idle, GFP_KERNEL, n));
>  		BUG_ON(!zalloc_cpumask_var_node(&rq->scx.cpus_to_preempt, GFP_KERNEL, n));
>  		BUG_ON(!zalloc_cpumask_var_node(&rq->scx.cpus_to_wait, GFP_KERNEL, n));
> -		init_llist_head(&rq->scx.deferred_reenq_locals);
> +		raw_spin_lock_init(&rq->scx.deferred_reenq_lock);
> +		INIT_LIST_HEAD(&rq->scx.deferred_reenq_locals);
>  		rq->scx.deferred_irq_work = IRQ_WORK_INIT_HARD(deferred_irq_workfn);
>  		rq->scx.kick_cpus_irq_work = IRQ_WORK_INIT_HARD(kick_cpus_irq_workfn);
>  
> @@ -8358,7 +8373,6 @@ __bpf_kfunc void scx_bpf_reenqueue_local___v2(const struct bpf_prog_aux *aux)
>  	unsigned long flags;
>  	struct scx_sched *sch;
>  	struct rq *rq;
> -	struct llist_node *lnode;
>  
>  	raw_local_irq_save(flags);
>  
> @@ -8374,9 +8388,14 @@ __bpf_kfunc void scx_bpf_reenqueue_local___v2(const struct bpf_prog_aux *aux)
>  		goto out_irq_restore;
>  
>  	rq = this_rq();
> -	lnode = &this_cpu_ptr(sch->pcpu)->deferred_reenq_locals_node;
> -	if (!llist_on_list(lnode))
> -		llist_add(lnode, &rq->scx.deferred_reenq_locals);
> +	scoped_guard (raw_spinlock, &rq->scx.deferred_reenq_lock) {
> +		struct scx_sched_pcpu *pcpu = this_cpu_ptr(sch->pcpu);
> +
> +		if (list_empty(&pcpu->deferred_reenq_local_node))
> +			list_move_tail(&pcpu->deferred_reenq_local_node,
> +				       &rq->scx.deferred_reenq_locals);
> +	}
> +
>  	schedule_deferred(rq);
>  out_irq_restore:
>  	raw_local_irq_restore(flags);
> diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h
> index 9e5ebd00ea0c..80d40a9c5ad9 100644
> --- a/kernel/sched/ext_internal.h
> +++ b/kernel/sched/ext_internal.h
> @@ -965,7 +965,7 @@ struct scx_sched_pcpu {
>  	 */
>  	struct scx_event_stats	event_stats;
>  
> -	struct llist_node	deferred_reenq_locals_node;
> +	struct list_head	deferred_reenq_local_node;
>  	struct scx_dispatch_q	bypass_dsq;
>  #ifdef CONFIG_EXT_SUB_SCHED
>  	u32			bypass_host_seq;
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index ebe971d12cb8..0794852524e7 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -808,7 +808,8 @@ struct scx_rq {
>  
>  	struct task_struct	*sub_dispatch_prev;
>  
> -	struct llist_head	deferred_reenq_locals;
> +	raw_spinlock_t		deferred_reenq_lock;
> +	struct list_head	deferred_reenq_locals;	/* scheds requesting reenq of local DSQ */
>  	struct balance_callback	deferred_bal_cb;
>  	struct irq_work		deferred_irq_work;
>  	struct irq_work		kick_cpus_irq_work;


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 06/15] sched_ext: Convert deferred_reenq_locals from llist to regular list
  2026-03-09 17:12   ` Emil Tsalapatis
@ 2026-03-09 17:16     ` Emil Tsalapatis
  0 siblings, 0 replies; 38+ messages in thread
From: Emil Tsalapatis @ 2026-03-09 17:16 UTC (permalink / raw)
  To: Emil Tsalapatis, Tejun Heo, linux-kernel, sched-ext
  Cc: void, arighi, changwoo

On Mon Mar 9, 2026 at 1:12 PM EDT, Emil Tsalapatis wrote:
> On Fri Mar 6, 2026 at 2:06 PM EST, Tejun Heo wrote:
>> The deferred reenqueue local mechanism uses an llist (lockless list) for
>> collecting schedulers that need their local DSQs re-enqueued. Convert to a
>> regular list protected by a raw_spinlock.
>>
>> The llist was used for its lockless properties, but the upcoming changes to
>> support remote reenqueue require more complex list operations that are
>> difficult to implement correctly with lockless data structures. A spinlock-
>> protected regular list provides the necessary flexibility.
>>
>> Signed-off-by: Tejun Heo <tj@kernel.org>
>
>
> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
>
>> ---
>>  kernel/sched/ext.c          | 57 ++++++++++++++++++++++++-------------
>>  kernel/sched/ext_internal.h |  2 +-
>>  kernel/sched/sched.h        |  3 +-
>>  3 files changed, 41 insertions(+), 21 deletions(-)
>>
>> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
>> index 1b6cd1e4f8b9..ffccaf04e34d 100644
>> --- a/kernel/sched/ext.c
>> +++ b/kernel/sched/ext.c
>> @@ -3640,23 +3640,37 @@ static u32 reenq_local(struct scx_sched *sch, struct rq *rq)
>>  	return nr_enqueued;
>>  }
>>  
>> -static void run_deferred(struct rq *rq)
>> +static void process_deferred_reenq_locals(struct rq *rq)
>>  {
>> -	process_ddsp_deferred_locals(rq);
>> -
>> -	if (!llist_empty(&rq->scx.deferred_reenq_locals)) {
>> -		struct llist_node *llist =
>> -			llist_del_all(&rq->scx.deferred_reenq_locals);
>> -		struct scx_sched_pcpu *pos, *next;
>> +	lockdep_assert_rq_held(rq);
>>  
>> -		llist_for_each_entry_safe(pos, next, llist,
>> -					  deferred_reenq_locals_node) {
>> -			init_llist_node(&pos->deferred_reenq_locals_node);
>> -			reenq_local(pos->sch, rq);
>> +	while (true) {
>> +		struct scx_sched *sch;
>> +
>> +		scoped_guard (raw_spinlock, &rq->scx.deferred_reenq_lock) {
>> +			struct scx_sched_pcpu *sch_pcpu =
>> +				list_first_entry_or_null(&rq->scx.deferred_reenq_locals,
>> +							 struct scx_sched_pcpu,
>> +							 deferred_reenq_local_node);
>> +			if (!sch_pcpu)
>> +				return;
>> +
>> +			sch = sch_pcpu->sch;
>
> While both scx and sch_pcpu aren't used in this patch, they are useful
> for subsequent patches.
>

This comment was meant for the next patch in the series, sorry about
that. The review tag still applies.

>> +			list_del_init(&sch_pcpu->deferred_reenq_local_node);
>>  		}
>> +
>> +		reenq_local(sch, rq);
>>  	}
>>  }
>>  
>> +static void run_deferred(struct rq *rq)
>> +{
>> +	process_ddsp_deferred_locals(rq);
>> +
>> +	if (!list_empty(&rq->scx.deferred_reenq_locals))
>> +		process_deferred_reenq_locals(rq);
>> +}
>> +
>>  #ifdef CONFIG_NO_HZ_FULL
>>  bool scx_can_stop_tick(struct rq *rq)
>>  {
>> @@ -4180,13 +4194,13 @@ static void scx_sched_free_rcu_work(struct work_struct *work)
>>  
>>  	/*
>>  	 * $sch would have entered bypass mode before the RCU grace period. As
>> -	 * that blocks new deferrals, all deferred_reenq_locals_node's must be
>> +	 * that blocks new deferrals, all deferred_reenq_local_node's must be
>>  	 * off-list by now.
>>  	 */
>>  	for_each_possible_cpu(cpu) {
>>  		struct scx_sched_pcpu *pcpu = per_cpu_ptr(sch->pcpu, cpu);
>>  
>> -		WARN_ON_ONCE(llist_on_list(&pcpu->deferred_reenq_locals_node));
>> +		WARN_ON_ONCE(!list_empty(&pcpu->deferred_reenq_local_node));
>>  	}
>>  
>>  	free_percpu(sch->pcpu);
>> @@ -5799,7 +5813,7 @@ static struct scx_sched *scx_alloc_and_add_sched(struct sched_ext_ops *ops,
>>  		struct scx_sched_pcpu *pcpu = per_cpu_ptr(sch->pcpu, cpu);
>>  
>>  		pcpu->sch = sch;
>> -		init_llist_node(&pcpu->deferred_reenq_locals_node);
>> +		INIT_LIST_HEAD(&pcpu->deferred_reenq_local_node);
>>  	}
>>  
>>  	sch->helper = kthread_run_worker(0, "sched_ext_helper");
>> @@ -7126,7 +7140,8 @@ void __init init_sched_ext_class(void)
>>  		BUG_ON(!zalloc_cpumask_var_node(&rq->scx.cpus_to_kick_if_idle, GFP_KERNEL, n));
>>  		BUG_ON(!zalloc_cpumask_var_node(&rq->scx.cpus_to_preempt, GFP_KERNEL, n));
>>  		BUG_ON(!zalloc_cpumask_var_node(&rq->scx.cpus_to_wait, GFP_KERNEL, n));
>> -		init_llist_head(&rq->scx.deferred_reenq_locals);
>> +		raw_spin_lock_init(&rq->scx.deferred_reenq_lock);
>> +		INIT_LIST_HEAD(&rq->scx.deferred_reenq_locals);
>>  		rq->scx.deferred_irq_work = IRQ_WORK_INIT_HARD(deferred_irq_workfn);
>>  		rq->scx.kick_cpus_irq_work = IRQ_WORK_INIT_HARD(kick_cpus_irq_workfn);
>>  
>> @@ -8358,7 +8373,6 @@ __bpf_kfunc void scx_bpf_reenqueue_local___v2(const struct bpf_prog_aux *aux)
>>  	unsigned long flags;
>>  	struct scx_sched *sch;
>>  	struct rq *rq;
>> -	struct llist_node *lnode;
>>  
>>  	raw_local_irq_save(flags);
>>  
>> @@ -8374,9 +8388,14 @@ __bpf_kfunc void scx_bpf_reenqueue_local___v2(const struct bpf_prog_aux *aux)
>>  		goto out_irq_restore;
>>  
>>  	rq = this_rq();
>> -	lnode = &this_cpu_ptr(sch->pcpu)->deferred_reenq_locals_node;
>> -	if (!llist_on_list(lnode))
>> -		llist_add(lnode, &rq->scx.deferred_reenq_locals);
>> +	scoped_guard (raw_spinlock, &rq->scx.deferred_reenq_lock) {
>> +		struct scx_sched_pcpu *pcpu = this_cpu_ptr(sch->pcpu);
>> +
>> +		if (list_empty(&pcpu->deferred_reenq_local_node))
>> +			list_move_tail(&pcpu->deferred_reenq_local_node,
>> +				       &rq->scx.deferred_reenq_locals);
>> +	}
>> +
>>  	schedule_deferred(rq);
>>  out_irq_restore:
>>  	raw_local_irq_restore(flags);
>> diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h
>> index 9e5ebd00ea0c..80d40a9c5ad9 100644
>> --- a/kernel/sched/ext_internal.h
>> +++ b/kernel/sched/ext_internal.h
>> @@ -965,7 +965,7 @@ struct scx_sched_pcpu {
>>  	 */
>>  	struct scx_event_stats	event_stats;
>>  
>> -	struct llist_node	deferred_reenq_locals_node;
>> +	struct list_head	deferred_reenq_local_node;
>>  	struct scx_dispatch_q	bypass_dsq;
>>  #ifdef CONFIG_EXT_SUB_SCHED
>>  	u32			bypass_host_seq;
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index ebe971d12cb8..0794852524e7 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -808,7 +808,8 @@ struct scx_rq {
>>  
>>  	struct task_struct	*sub_dispatch_prev;
>>  
>> -	struct llist_head	deferred_reenq_locals;
>> +	raw_spinlock_t		deferred_reenq_lock;
>> +	struct list_head	deferred_reenq_locals;	/* scheds requesting reenq of local DSQ */
>>  	struct balance_callback	deferred_bal_cb;
>>  	struct irq_work		deferred_irq_work;
>>  	struct irq_work		kick_cpus_irq_work;


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 07/15] sched_ext: Wrap deferred_reenq_local_node into a struct
  2026-03-06 19:06 ` [PATCH 07/15] sched_ext: Wrap deferred_reenq_local_node into a struct Tejun Heo
@ 2026-03-09 17:16   ` Emil Tsalapatis
  0 siblings, 0 replies; 38+ messages in thread
From: Emil Tsalapatis @ 2026-03-09 17:16 UTC (permalink / raw)
  To: Tejun Heo, linux-kernel, sched-ext; +Cc: void, arighi, changwoo

On Fri Mar 6, 2026 at 2:06 PM EST, Tejun Heo wrote:
> Wrap the deferred_reenq_local_node list_head into struct
> scx_deferred_reenq_local. More fields will be added and this allows using a
> shorthand pointer to access them.
>
> No functional change.
>
> Signed-off-by: Tejun Heo <tj@kernel.org>

Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>

> ---
>  kernel/sched/ext.c          | 22 +++++++++++++---------
>  kernel/sched/ext_internal.h |  6 +++++-
>  2 files changed, 18 insertions(+), 10 deletions(-)
>
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index ffccaf04e34d..80d1e6ccc326 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -3648,15 +3648,19 @@ static void process_deferred_reenq_locals(struct rq *rq)
>  		struct scx_sched *sch;
>  
>  		scoped_guard (raw_spinlock, &rq->scx.deferred_reenq_lock) {
> -			struct scx_sched_pcpu *sch_pcpu =
> +			struct scx_deferred_reenq_local *drl =
>  				list_first_entry_or_null(&rq->scx.deferred_reenq_locals,
> -							 struct scx_sched_pcpu,
> -							 deferred_reenq_local_node);
> -			if (!sch_pcpu)
> +							 struct scx_deferred_reenq_local,
> +							 node);
> +			struct scx_sched_pcpu *sch_pcpu;
> +
> +			if (!drl)
>  				return;
>  
> +			sch_pcpu = container_of(drl, struct scx_sched_pcpu,
> +						deferred_reenq_local);
>  			sch = sch_pcpu->sch;
> -			list_del_init(&sch_pcpu->deferred_reenq_local_node);
> +			list_del_init(&drl->node);
>  		}
>  
>  		reenq_local(sch, rq);
> @@ -4200,7 +4204,7 @@ static void scx_sched_free_rcu_work(struct work_struct *work)
>  	for_each_possible_cpu(cpu) {
>  		struct scx_sched_pcpu *pcpu = per_cpu_ptr(sch->pcpu, cpu);
>  
> -		WARN_ON_ONCE(!list_empty(&pcpu->deferred_reenq_local_node));
> +		WARN_ON_ONCE(!list_empty(&pcpu->deferred_reenq_local.node));
>  	}
>  
>  	free_percpu(sch->pcpu);
> @@ -5813,7 +5817,7 @@ static struct scx_sched *scx_alloc_and_add_sched(struct sched_ext_ops *ops,
>  		struct scx_sched_pcpu *pcpu = per_cpu_ptr(sch->pcpu, cpu);
>  
>  		pcpu->sch = sch;
> -		INIT_LIST_HEAD(&pcpu->deferred_reenq_local_node);
> +		INIT_LIST_HEAD(&pcpu->deferred_reenq_local.node);
>  	}
>  
>  	sch->helper = kthread_run_worker(0, "sched_ext_helper");
> @@ -8391,8 +8395,8 @@ __bpf_kfunc void scx_bpf_reenqueue_local___v2(const struct bpf_prog_aux *aux)
>  	scoped_guard (raw_spinlock, &rq->scx.deferred_reenq_lock) {
>  		struct scx_sched_pcpu *pcpu = this_cpu_ptr(sch->pcpu);
>  
> -		if (list_empty(&pcpu->deferred_reenq_local_node))
> -			list_move_tail(&pcpu->deferred_reenq_local_node,
> +		if (list_empty(&pcpu->deferred_reenq_local.node))
> +			list_move_tail(&pcpu->deferred_reenq_local.node,
>  				       &rq->scx.deferred_reenq_locals);
>  	}
>  
> diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h
> index 80d40a9c5ad9..1a8d61097cab 100644
> --- a/kernel/sched/ext_internal.h
> +++ b/kernel/sched/ext_internal.h
> @@ -954,6 +954,10 @@ struct scx_dsp_ctx {
>  	struct scx_dsp_buf_ent	buf[];
>  };
>  
> +struct scx_deferred_reenq_local {
> +	struct list_head	node;
> +};
> +
>  struct scx_sched_pcpu {
>  	struct scx_sched	*sch;
>  	u64			flags;	/* protected by rq lock */
> @@ -965,7 +969,7 @@ struct scx_sched_pcpu {
>  	 */
>  	struct scx_event_stats	event_stats;
>  
> -	struct list_head	deferred_reenq_local_node;
> +	struct scx_deferred_reenq_local deferred_reenq_local;
>  	struct scx_dispatch_q	bypass_dsq;
>  #ifdef CONFIG_EXT_SUB_SCHED
>  	u32			bypass_host_seq;


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 08/15] sched_ext: Introduce scx_bpf_dsq_reenq() for remote local DSQ reenqueue
  2026-03-06 19:06 ` [PATCH 08/15] sched_ext: Introduce scx_bpf_dsq_reenq() for remote local DSQ reenqueue Tejun Heo
@ 2026-03-09 17:33   ` Emil Tsalapatis
  0 siblings, 0 replies; 38+ messages in thread
From: Emil Tsalapatis @ 2026-03-09 17:33 UTC (permalink / raw)
  To: Tejun Heo, linux-kernel, sched-ext; +Cc: void, arighi, changwoo

On Fri Mar 6, 2026 at 2:06 PM EST, Tejun Heo wrote:
> scx_bpf_reenqueue_local() can only trigger re-enqueue of the current CPU's
> local DSQ. Introduce scx_bpf_dsq_reenq() which takes a DSQ ID and can target
> any local DSQ including remote CPUs via SCX_DSQ_LOCAL_ON | cpu. This will be
> expanded to support user DSQs by future changes.
>
> scx_bpf_reenqueue_local() is reimplemented as a simple wrapper around
> scx_bpf_dsq_reenq(SCX_DSQ_LOCAL, 0) and may be deprecated in the future.
>
> Update compat.bpf.h with a compatibility shim and scx_qmap to test the new
> functionality.
>
> Signed-off-by: Tejun Heo <tj@kernel.org>

Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>

> ---
>  kernel/sched/ext.c                       | 118 ++++++++++++++---------
>  tools/sched_ext/include/scx/compat.bpf.h |  21 ++++
>  tools/sched_ext/scx_qmap.bpf.c           |  11 ++-
>  tools/sched_ext/scx_qmap.c               |   5 +-
>  4 files changed, 106 insertions(+), 49 deletions(-)
>
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 80d1e6ccc326..b02143b10f0f 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -1080,6 +1080,31 @@ static void schedule_deferred_locked(struct rq *rq)
>  	schedule_deferred(rq);
>  }
>  
> +static void schedule_dsq_reenq(struct scx_sched *sch, struct scx_dispatch_q *dsq)
> +{
> +	/*
> +	 * Allowing reenqueues doesn't make sense while bypassing. This also
> +	 * blocks from new reenqueues to be scheduled on dead scheds.
> +	 */
> +	if (unlikely(READ_ONCE(sch->bypass_depth)))
> +		return;
> +
> +	if (dsq->id == SCX_DSQ_LOCAL) {
> +		struct rq *rq = container_of(dsq, struct rq, scx.local_dsq);
> +		struct scx_sched_pcpu *sch_pcpu = per_cpu_ptr(sch->pcpu, cpu_of(rq));
> +		struct scx_deferred_reenq_local *drl = &sch_pcpu->deferred_reenq_local;
> +
> +		scoped_guard (raw_spinlock_irqsave, &rq->scx.deferred_reenq_lock) {
> +			if (list_empty(&drl->node))
> +				list_move_tail(&drl->node, &rq->scx.deferred_reenq_locals);
> +		}
> +
> +		schedule_deferred(rq);
> +	} else {
> +		scx_error(sch, "DSQ 0x%llx not allowed for reenq", dsq->id);
> +	}
> +}
> +
>  /**
>   * touch_core_sched - Update timestamp used for core-sched task ordering
>   * @rq: rq to read clock from, must be locked
> @@ -7775,9 +7800,6 @@ __bpf_kfunc_start_defs();
>   * Iterate over all of the tasks currently enqueued on the local DSQ of the
>   * caller's CPU, and re-enqueue them in the BPF scheduler. Returns the number of
>   * processed tasks. Can only be called from ops.cpu_release().
> - *
> - * COMPAT: Will be removed in v6.23 along with the ___v2 suffix on the void
> - * returning variant that can be called from anywhere.
>   */
>  __bpf_kfunc u32 scx_bpf_reenqueue_local(const struct bpf_prog_aux *aux)
>  {
> @@ -8207,6 +8229,52 @@ __bpf_kfunc struct task_struct *scx_bpf_dsq_peek(u64 dsq_id,
>  	return rcu_dereference(dsq->first_task);
>  }
>  
> +/**
> + * scx_bpf_dsq_reenq - Re-enqueue tasks on a DSQ
> + * @dsq_id: DSQ to re-enqueue
> + * @reenq_flags: %SCX_RENQ_*
> + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs
> + *
> + * Iterate over all of the tasks currently enqueued on the DSQ identified by
> + * @dsq_id, and re-enqueue them in the BPF scheduler. The following DSQs are
> + * supported:
> + *
> + * - Local DSQs (%SCX_DSQ_LOCAL or %SCX_DSQ_LOCAL_ON | $cpu)
> + *
> + * Re-enqueues are performed asynchronously. Can be called from anywhere.
> + */
> +__bpf_kfunc void scx_bpf_dsq_reenq(u64 dsq_id, u64 reenq_flags,
> +				   const struct bpf_prog_aux *aux)
> +{
> +	struct scx_sched *sch;
> +	struct scx_dispatch_q *dsq;
> +
> +	guard(preempt)();
> +
> +	sch = scx_prog_sched(aux);
> +	if (unlikely(!sch))
> +		return;
> +
> +	dsq = find_dsq_for_dispatch(sch, this_rq(), dsq_id, smp_processor_id());
> +	schedule_dsq_reenq(sch, dsq);
> +}
> +
> +/**
> + * scx_bpf_reenqueue_local - Re-enqueue tasks on a local DSQ
> + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs
> + *
> + * Iterate over all of the tasks currently enqueued on the local DSQ of the
> + * caller's CPU, and re-enqueue them in the BPF scheduler. Can be called from
> + * anywhere.
> + *
> + * This is now a special case of scx_bpf_dsq_reenq() and may be removed in the
> + * future.
> + */
> +__bpf_kfunc void scx_bpf_reenqueue_local___v2(const struct bpf_prog_aux *aux)
> +{
> +	scx_bpf_dsq_reenq(SCX_DSQ_LOCAL, 0, aux);
> +}
> +
>  __bpf_kfunc_end_defs();
>  
>  static s32 __bstr_format(struct scx_sched *sch, u64 *data_buf, char *line_buf,
> @@ -8364,47 +8432,6 @@ __bpf_kfunc void scx_bpf_dump_bstr(char *fmt, unsigned long long *data,
>  		ops_dump_flush();
>  }
>  
> -/**
> - * scx_bpf_reenqueue_local - Re-enqueue tasks on a local DSQ
> - * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs
> - *
> - * Iterate over all of the tasks currently enqueued on the local DSQ of the
> - * caller's CPU, and re-enqueue them in the BPF scheduler. Can be called from
> - * anywhere.
> - */
> -__bpf_kfunc void scx_bpf_reenqueue_local___v2(const struct bpf_prog_aux *aux)
> -{
> -	unsigned long flags;
> -	struct scx_sched *sch;
> -	struct rq *rq;
> -
> -	raw_local_irq_save(flags);
> -
> -	sch = scx_prog_sched(aux);
> -	if (unlikely(!sch))
> -		goto out_irq_restore;
> -
> -	/*
> -	 * Allowing reenqueue-locals doesn't make sense while bypassing. This
> -	 * also blocks from new reenqueues to be scheduled on dead scheds.
> -	 */
> -	if (unlikely(sch->bypass_depth))
> -		goto out_irq_restore;
> -
> -	rq = this_rq();
> -	scoped_guard (raw_spinlock, &rq->scx.deferred_reenq_lock) {
> -		struct scx_sched_pcpu *pcpu = this_cpu_ptr(sch->pcpu);
> -
> -		if (list_empty(&pcpu->deferred_reenq_local.node))
> -			list_move_tail(&pcpu->deferred_reenq_local.node,
> -				       &rq->scx.deferred_reenq_locals);
> -	}
> -
> -	schedule_deferred(rq);
> -out_irq_restore:
> -	raw_local_irq_restore(flags);
> -}
> -
>  /**
>   * scx_bpf_cpuperf_cap - Query the maximum relative capacity of a CPU
>   * @cpu: CPU of interest
> @@ -8821,13 +8848,14 @@ BTF_ID_FLAGS(func, scx_bpf_kick_cpu, KF_IMPLICIT_ARGS)
>  BTF_ID_FLAGS(func, scx_bpf_dsq_nr_queued)
>  BTF_ID_FLAGS(func, scx_bpf_destroy_dsq)
>  BTF_ID_FLAGS(func, scx_bpf_dsq_peek, KF_IMPLICIT_ARGS | KF_RCU_PROTECTED | KF_RET_NULL)
> +BTF_ID_FLAGS(func, scx_bpf_dsq_reenq, KF_IMPLICIT_ARGS)
> +BTF_ID_FLAGS(func, scx_bpf_reenqueue_local___v2, KF_IMPLICIT_ARGS)
>  BTF_ID_FLAGS(func, bpf_iter_scx_dsq_new, KF_IMPLICIT_ARGS | KF_ITER_NEW | KF_RCU_PROTECTED)
>  BTF_ID_FLAGS(func, bpf_iter_scx_dsq_next, KF_ITER_NEXT | KF_RET_NULL)
>  BTF_ID_FLAGS(func, bpf_iter_scx_dsq_destroy, KF_ITER_DESTROY)
>  BTF_ID_FLAGS(func, scx_bpf_exit_bstr, KF_IMPLICIT_ARGS)
>  BTF_ID_FLAGS(func, scx_bpf_error_bstr, KF_IMPLICIT_ARGS)
>  BTF_ID_FLAGS(func, scx_bpf_dump_bstr, KF_IMPLICIT_ARGS)
> -BTF_ID_FLAGS(func, scx_bpf_reenqueue_local___v2, KF_IMPLICIT_ARGS)
>  BTF_ID_FLAGS(func, scx_bpf_cpuperf_cap, KF_IMPLICIT_ARGS)
>  BTF_ID_FLAGS(func, scx_bpf_cpuperf_cur, KF_IMPLICIT_ARGS)
>  BTF_ID_FLAGS(func, scx_bpf_cpuperf_set, KF_IMPLICIT_ARGS)
> diff --git a/tools/sched_ext/include/scx/compat.bpf.h b/tools/sched_ext/include/scx/compat.bpf.h
> index f2969c3061a7..2d3985be7e2c 100644
> --- a/tools/sched_ext/include/scx/compat.bpf.h
> +++ b/tools/sched_ext/include/scx/compat.bpf.h
> @@ -375,6 +375,27 @@ static inline void scx_bpf_reenqueue_local(void)
>  		scx_bpf_reenqueue_local___v1();
>  }
>  
> +/*
> + * v6.20: New scx_bpf_dsq_reenq() that allows re-enqueues on more DSQs. This
> + * will eventually deprecate scx_bpf_reenqueue_local().
> + */
> +void scx_bpf_dsq_reenq___compat(u64 dsq_id, u64 reenq_flags, const struct bpf_prog_aux *aux__prog) __ksym __weak;
> +
> +static inline bool __COMPAT_has_generic_reenq(void)
> +{
> +	return bpf_ksym_exists(scx_bpf_dsq_reenq___compat);
> +}
> +
> +static inline void scx_bpf_dsq_reenq(u64 dsq_id, u64 reenq_flags)
> +{
> +	if (bpf_ksym_exists(scx_bpf_dsq_reenq___compat))
> +		scx_bpf_dsq_reenq___compat(dsq_id, reenq_flags, NULL);
> +	else if (dsq_id == SCX_DSQ_LOCAL && reenq_flags == 0)
> +		scx_bpf_reenqueue_local();
> +	else
> +		scx_bpf_error("kernel too old to reenqueue foreign local or user DSQs");
> +}
> +
>  /*
>   * Define sched_ext_ops. This may be expanded to define multiple variants for
>   * backward compatibility. See compat.h::SCX_OPS_LOAD/ATTACH().
> diff --git a/tools/sched_ext/scx_qmap.bpf.c b/tools/sched_ext/scx_qmap.bpf.c
> index 91b8eac83f52..83e8289e8c0c 100644
> --- a/tools/sched_ext/scx_qmap.bpf.c
> +++ b/tools/sched_ext/scx_qmap.bpf.c
> @@ -131,7 +131,7 @@ struct {
>  } cpu_ctx_stor SEC(".maps");
>  
>  /* Statistics */
> -u64 nr_enqueued, nr_dispatched, nr_reenqueued, nr_dequeued, nr_ddsp_from_enq;
> +u64 nr_enqueued, nr_dispatched, nr_reenqueued, nr_reenqueued_cpu0, nr_dequeued, nr_ddsp_from_enq;
>  u64 nr_core_sched_execed;
>  u64 nr_expedited_local, nr_expedited_remote, nr_expedited_lost, nr_expedited_from_timer;
>  u32 cpuperf_min, cpuperf_avg, cpuperf_max;
> @@ -206,8 +206,11 @@ void BPF_STRUCT_OPS(qmap_enqueue, struct task_struct *p, u64 enq_flags)
>  	void *ring;
>  	s32 cpu;
>  
> -	if (enq_flags & SCX_ENQ_REENQ)
> +	if (enq_flags & SCX_ENQ_REENQ) {
>  		__sync_fetch_and_add(&nr_reenqueued, 1);
> +		if (scx_bpf_task_cpu(p) == 0)
> +			__sync_fetch_and_add(&nr_reenqueued_cpu0, 1);
> +	}
>  
>  	if (p->flags & PF_KTHREAD) {
>  		if (stall_kernel_nth && !(++kernel_cnt % stall_kernel_nth))
> @@ -561,6 +564,10 @@ int BPF_PROG(qmap_sched_switch, bool preempt, struct task_struct *prev,
>  	case 2: /* SCHED_RR */
>  	case 6: /* SCHED_DEADLINE */
>  		scx_bpf_reenqueue_local();
> +
> +		/* trigger re-enqueue on CPU0 just to exercise LOCAL_ON */
> +		if (__COMPAT_has_generic_reenq())
> +			scx_bpf_dsq_reenq(SCX_DSQ_LOCAL_ON | 0, 0);
>  	}
>  
>  	return 0;
> diff --git a/tools/sched_ext/scx_qmap.c b/tools/sched_ext/scx_qmap.c
> index 5d762d10f4db..9252037284d3 100644
> --- a/tools/sched_ext/scx_qmap.c
> +++ b/tools/sched_ext/scx_qmap.c
> @@ -137,9 +137,10 @@ int main(int argc, char **argv)
>  		long nr_enqueued = skel->bss->nr_enqueued;
>  		long nr_dispatched = skel->bss->nr_dispatched;
>  
> -		printf("stats  : enq=%lu dsp=%lu delta=%ld reenq=%"PRIu64" deq=%"PRIu64" core=%"PRIu64" enq_ddsp=%"PRIu64"\n",
> +		printf("stats  : enq=%lu dsp=%lu delta=%ld reenq/cpu0=%"PRIu64"/%"PRIu64" deq=%"PRIu64" core=%"PRIu64" enq_ddsp=%"PRIu64"\n",
>  		       nr_enqueued, nr_dispatched, nr_enqueued - nr_dispatched,
> -		       skel->bss->nr_reenqueued, skel->bss->nr_dequeued,
> +		       skel->bss->nr_reenqueued, skel->bss->nr_reenqueued_cpu0,
> +		       skel->bss->nr_dequeued,
>  		       skel->bss->nr_core_sched_execed,
>  		       skel->bss->nr_ddsp_from_enq);
>  		printf("         exp_local=%"PRIu64" exp_remote=%"PRIu64" exp_timer=%"PRIu64" exp_lost=%"PRIu64"\n",


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 09/15] sched_ext: Add reenq_flags plumbing to scx_bpf_dsq_reenq()
  2026-03-06 19:06 ` [PATCH 09/15] sched_ext: Add reenq_flags plumbing to scx_bpf_dsq_reenq() Tejun Heo
@ 2026-03-09 17:47   ` Emil Tsalapatis
  0 siblings, 0 replies; 38+ messages in thread
From: Emil Tsalapatis @ 2026-03-09 17:47 UTC (permalink / raw)
  To: Tejun Heo, linux-kernel, sched-ext; +Cc: void, arighi, changwoo

On Fri Mar 6, 2026 at 2:06 PM EST, Tejun Heo wrote:
> Add infrastructure to pass flags through the deferred reenqueue path.
> reenq_local() now takes a reenq_flags parameter, and scx_sched_pcpu gains a
> deferred_reenq_local_flags field to accumulate flags from multiple
> scx_bpf_dsq_reenq() calls before processing. No flags are defined yet.
>
> Signed-off-by: Tejun Heo <tj@kernel.org>

Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>

> ---
>  kernel/sched/ext.c          | 33 ++++++++++++++++++++++++++++-----
>  kernel/sched/ext_internal.h | 10 ++++++++++
>  2 files changed, 38 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index b02143b10f0f..c9b0e94d59bd 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -1080,7 +1080,8 @@ static void schedule_deferred_locked(struct rq *rq)
>  	schedule_deferred(rq);
>  }
>  
> -static void schedule_dsq_reenq(struct scx_sched *sch, struct scx_dispatch_q *dsq)
> +static void schedule_dsq_reenq(struct scx_sched *sch, struct scx_dispatch_q *dsq,
> +			       u64 reenq_flags)
>  {
>  	/*
>  	 * Allowing reenqueues doesn't make sense while bypassing. This also
> @@ -1097,6 +1098,7 @@ static void schedule_dsq_reenq(struct scx_sched *sch, struct scx_dispatch_q *dsq
>  		scoped_guard (raw_spinlock_irqsave, &rq->scx.deferred_reenq_lock) {
>  			if (list_empty(&drl->node))
>  				list_move_tail(&drl->node, &rq->scx.deferred_reenq_locals);
> +			drl->flags |= reenq_flags;
>  		}
>  
>  		schedule_deferred(rq);
> @@ -3618,7 +3620,14 @@ int scx_check_setscheduler(struct task_struct *p, int policy)
>  	return 0;
>  }
>  
> -static u32 reenq_local(struct scx_sched *sch, struct rq *rq)
> +static bool task_should_reenq(struct task_struct *p, u64 reenq_flags)
> +{
> +	if (reenq_flags & SCX_REENQ_ANY)
> +		return true;
> +	return false;

Nit: (reenq_flags & SCX_REENQ_ANY) != 0?

> +}
> +
> +static u32 reenq_local(struct scx_sched *sch, struct rq *rq, u64 reenq_flags)
>  {
>  	LIST_HEAD(tasks);
>  	u32 nr_enqueued = 0;
> @@ -3652,6 +3661,9 @@ static u32 reenq_local(struct scx_sched *sch, struct rq *rq)
>  		if (!scx_is_descendant(task_sch, sch))
>  			continue;
>  
> +		if (!task_should_reenq(p, reenq_flags))
> +			continue;
> +
>  		dispatch_dequeue(rq, p);
>  		list_add_tail(&p->scx.dsq_list.node, &tasks);
>  	}
> @@ -3671,6 +3683,7 @@ static void process_deferred_reenq_locals(struct rq *rq)
>  
>  	while (true) {
>  		struct scx_sched *sch;
> +		u64 reenq_flags = 0;
>  
>  		scoped_guard (raw_spinlock, &rq->scx.deferred_reenq_lock) {
>  			struct scx_deferred_reenq_local *drl =
> @@ -3685,10 +3698,11 @@ static void process_deferred_reenq_locals(struct rq *rq)
>  			sch_pcpu = container_of(drl, struct scx_sched_pcpu,
>  						deferred_reenq_local);
>  			sch = sch_pcpu->sch;
> +			swap(drl->flags, reenq_flags);
>  			list_del_init(&drl->node);
>  		}
>  
> -		reenq_local(sch, rq);
> +		reenq_local(sch, rq, reenq_flags);
>  	}
>  }
>  
> @@ -7817,7 +7831,7 @@ __bpf_kfunc u32 scx_bpf_reenqueue_local(const struct bpf_prog_aux *aux)
>  	rq = cpu_rq(smp_processor_id());
>  	lockdep_assert_rq_held(rq);
>  
> -	return reenq_local(sch, rq);
> +	return reenq_local(sch, rq, 0);
>  }
>  
>  __bpf_kfunc_end_defs();
> @@ -8255,8 +8269,17 @@ __bpf_kfunc void scx_bpf_dsq_reenq(u64 dsq_id, u64 reenq_flags,
>  	if (unlikely(!sch))
>  		return;
>  
> +	if (unlikely(reenq_flags & ~__SCX_REENQ_USER_MASK)) {
> +		scx_error(sch, "invalid SCX_REENQ flags 0x%llx", reenq_flags);
> +		return;
> +	}
> +
> +	/* not specifying any filter bits is the same as %SCX_REENQ_ANY */
> +	if (!(reenq_flags & __SCX_REENQ_FILTER_MASK))
> +		reenq_flags |= SCX_REENQ_ANY;
> +
>  	dsq = find_dsq_for_dispatch(sch, this_rq(), dsq_id, smp_processor_id());
> -	schedule_dsq_reenq(sch, dsq);
> +	schedule_dsq_reenq(sch, dsq, reenq_flags);
>  }
>  
>  /**
> diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h
> index 1a8d61097cab..d9eda2e8701c 100644
> --- a/kernel/sched/ext_internal.h
> +++ b/kernel/sched/ext_internal.h
> @@ -956,6 +956,7 @@ struct scx_dsp_ctx {
>  
>  struct scx_deferred_reenq_local {
>  	struct list_head	node;
> +	u64			flags;
>  };
>  
>  struct scx_sched_pcpu {
> @@ -1128,6 +1129,15 @@ enum scx_deq_flags {
>  	SCX_DEQ_SCHED_CHANGE	= 1LLU << 33,
>  };
>  
> +enum scx_reenq_flags {
> +	/* low 16bits determine which tasks should be reenqueued */
> +	SCX_REENQ_ANY		= 1LLU << 0,	/* all tasks */
> +
> +	__SCX_REENQ_FILTER_MASK	= 0xffffLLU,
> +
> +	__SCX_REENQ_USER_MASK	= SCX_REENQ_ANY,
> +};
> +
>  enum scx_pick_idle_cpu_flags {
>  	SCX_PICK_IDLE_CORE	= 1LLU << 0,	/* pick a CPU whose SMT siblings are also idle */
>  	SCX_PICK_IDLE_IN_NODE	= 1LLU << 1,	/* pick a CPU in the same target NUMA node */


^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2026-03-09 17:48 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-06 19:06 [PATCHSET sched_ext/for-7.1] sched_ext: Overhaul DSQ reenqueue infrastructure Tejun Heo
2026-03-06 19:06 ` [PATCH 01/15] sched_ext: Relocate scx_bpf_task_cgroup() and its BTF_ID to the end of kfunc section Tejun Heo
2026-03-06 20:45   ` Emil Tsalapatis
2026-03-06 23:20   ` Daniel Jordan
2026-03-06 19:06 ` [PATCH 02/15] sched_ext: Wrap global DSQs in per-node structure Tejun Heo
2026-03-06 20:52   ` Emil Tsalapatis
2026-03-06 23:20   ` Daniel Jordan
2026-03-06 19:06 ` [PATCH 03/15] sched_ext: Factor out pnode allocation and deallocation into helpers Tejun Heo
2026-03-06 20:54   ` Emil Tsalapatis
2026-03-06 23:21   ` Daniel Jordan
2026-03-06 19:06 ` [PATCH 04/15] sched_ext: Change find_global_dsq() to take CPU number instead of task Tejun Heo
2026-03-06 21:06   ` Emil Tsalapatis
2026-03-06 22:33   ` [PATCH v2 " Tejun Heo
2026-03-06 23:21   ` [PATCH " Daniel Jordan
2026-03-06 19:06 ` [PATCH 05/15] sched_ext: Relocate reenq_local() and run_deferred() Tejun Heo
2026-03-06 21:09   ` Emil Tsalapatis
2026-03-06 23:34   ` Daniel Jordan
2026-03-07  0:12   ` [PATCH v2 05/15] sched_ext: Relocate run_deferred() and its callees Tejun Heo
2026-03-06 19:06 ` [PATCH 06/15] sched_ext: Convert deferred_reenq_locals from llist to regular list Tejun Heo
2026-03-09 17:12   ` Emil Tsalapatis
2026-03-09 17:16     ` Emil Tsalapatis
2026-03-06 19:06 ` [PATCH 07/15] sched_ext: Wrap deferred_reenq_local_node into a struct Tejun Heo
2026-03-09 17:16   ` Emil Tsalapatis
2026-03-06 19:06 ` [PATCH 08/15] sched_ext: Introduce scx_bpf_dsq_reenq() for remote local DSQ reenqueue Tejun Heo
2026-03-09 17:33   ` Emil Tsalapatis
2026-03-06 19:06 ` [PATCH 09/15] sched_ext: Add reenq_flags plumbing to scx_bpf_dsq_reenq() Tejun Heo
2026-03-09 17:47   ` Emil Tsalapatis
2026-03-06 19:06 ` [PATCH 10/15] sched_ext: Add per-CPU data to DSQs Tejun Heo
2026-03-06 22:54   ` Andrea Righi
2026-03-06 22:56     ` Andrea Righi
2026-03-06 23:09   ` [PATCH v2 " Tejun Heo
2026-03-06 19:06 ` [PATCH 11/15] sched_ext: Factor out nldsq_cursor_next_task() and nldsq_cursor_lost_task() Tejun Heo
2026-03-06 19:06 ` [PATCH 12/15] sched_ext: Implement scx_bpf_dsq_reenq() for user DSQs Tejun Heo
2026-03-06 19:06 ` [PATCH 13/15] sched_ext: Optimize schedule_dsq_reenq() with lockless fast path Tejun Heo
2026-03-06 19:06 ` [PATCH 14/15] sched_ext: Simplify task state handling Tejun Heo
2026-03-06 19:06 ` [PATCH 15/15] sched_ext: Add SCX_TASK_REENQ_REASON flags Tejun Heo
2026-03-06 23:14 ` [PATCHSET sched_ext/for-7.1] sched_ext: Overhaul DSQ reenqueue infrastructure Andrea Righi
2026-03-07 15:38 ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox