[PATCHSET sched_ext/for-6.12] sched_ext: Implement scx_bpf_dispatch[_vtime]_from

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCHSET sched_ext/for-6.12] sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq()
@ 2024-08-30 11:03 Tejun Heo
  2024-08-30 11:03 ` [PATCH 01/11] sched_ext: Rename scx_kfunc_set_sleepable to unlocked and relocate Tejun Heo
                   ` (11 more replies)
  0 siblings, 12 replies; 35+ messages in thread
From: Tejun Heo @ 2024-08-30 11:03 UTC (permalink / raw)
  To: void; +Cc: kernel-team, linux-kernel

Once a task is put into a DSQ, the allowed operations are fairly limited.
Tasks in the built-in local and global DSQs are executed automatically and,
ignoring dequeue, there is only one way a task in a user DSQ can be
manipulated - scx_bpf_consume() moves the first task to the dispatching
local DSQ. This inflexibility sometimes gets in the way and is an area where
multiple feature requests have been made.

Implement scx_bpf_dispatch[_vtime]_from_dsq(), which can be called during
DSQ iteration and can move the task to any DSQ - local DSQs, global DSQ and
user DSQs. The kfuncs can be called from ops.dispatch() and any BPF context
which dosen't hold a rq lock including BPF timers and SYSCALL programs.

This patchset is on top of:

   sched_ext/for-6.12 bf934bed5e2f ("sched_ext: Add missing cfi stub for ops.tick")
 + two fix patches (http://lkml.kernel.org/r/ZtGkPKgoE5BeI7fN@slm.duckdns.org)

and is also available at:

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext.git scx-dispatch_from_dsq

This patchset contains the following patches:

 0001-sched_ext-Rename-scx_kfunc_set_sleepable-to-unlocked.patch
 0002-sched_ext-Refactor-consume_remote_task.patch
 0003-sched_ext-Make-find_dsq_for_dispatch-handle-SCX_DSQ_.patch
 0004-sched_ext-Make-dispatch_to_local_dsq-return-void.patch
 0005-sched_ext-Restructure-dispatch_to_local_dsq.patch
 0006-sched_ext-Reorder-args-for-consume_local-remote_task.patch
 0007-sched_ext-Move-sanity-check-and-dsq_mod_nr-into-task.patch
 0008-sched_ext-Move-consume_local_task-upward.patch
 0009-sched_ext-Replace-consume_local_task-with-move_local.patch
 0010-sched_ext-Implement-scx_bpf_dispatch-_vtime-_from_ds.patch
 0011-scx_qmap-Implement-highpri-boosting.patch

 0001-0009 are prep patches. The logic to bounce tasks across DSQs and CPUs
 is rather complicated due to synchronization. The prep patches do quite a
 bit of refactoring so that the helpers are more composable and can be used
 for the new kfuncs.

 0010 implements scx_bpf_dispatch[_vtime]_from_dsq().

 0011 adds demo usages to scx_qmap.

diffstat follows. Thanks.

 kernel/sched/ext.c                       |  592 ++++++++++++++++++++++++++++++++++++----------------------
 tools/sched_ext/include/scx/common.bpf.h |    8
 tools/sched_ext/scx_qmap.bpf.c           |  142 ++++++++++++-
 tools/sched_ext/scx_qmap.c               |   11 -
 4 files changed, 519 insertions(+), 234 deletions(-)

--
tejun

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 01/11] sched_ext: Rename scx_kfunc_set_sleepable to unlocked and relocate
  2024-08-30 11:03 [PATCHSET sched_ext/for-6.12] sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq() Tejun Heo
@ 2024-08-30 11:03 ` Tejun Heo
  2024-08-30 17:45   ` David Vernet
  2024-08-30 11:03 ` [PATCH 02/11] sched_ext: Refactor consume_remote_task() Tejun Heo
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 35+ messages in thread
From: Tejun Heo @ 2024-08-30 11:03 UTC (permalink / raw)
  To: void; +Cc: kernel-team, linux-kernel, Tejun Heo

Sleepables don't need to be in its own kfunc set as each is tagged with
KF_SLEEPABLE. Rename to scx_kfunc_set_unlocked indicating that rq lock is
not held and relocate right above the any set. This will be used to add
kfuncs that are allowed to be called from SYSCALL but not TRACING.

No functional changes intended.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext.c | 66 +++++++++++++++++++++++-----------------------
 1 file changed, 33 insertions(+), 33 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 140a4612d379..5423554a11af 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -5395,35 +5395,6 @@ void __init init_sched_ext_class(void)
 
 __bpf_kfunc_start_defs();
 
-/**
- * scx_bpf_create_dsq - Create a custom DSQ
- * @dsq_id: DSQ to create
- * @node: NUMA node to allocate from
- *
- * Create a custom DSQ identified by @dsq_id. Can be called from any sleepable
- * scx callback, and any BPF_PROG_TYPE_SYSCALL prog.
- */
-__bpf_kfunc s32 scx_bpf_create_dsq(u64 dsq_id, s32 node)
-{
-	if (unlikely(node >= (int)nr_node_ids ||
-		     (node < 0 && node != NUMA_NO_NODE)))
-		return -EINVAL;
-	return PTR_ERR_OR_ZERO(create_dsq(dsq_id, node));
-}
-
-__bpf_kfunc_end_defs();
-
-BTF_KFUNCS_START(scx_kfunc_ids_sleepable)
-BTF_ID_FLAGS(func, scx_bpf_create_dsq, KF_SLEEPABLE)
-BTF_KFUNCS_END(scx_kfunc_ids_sleepable)
-
-static const struct btf_kfunc_id_set scx_kfunc_set_sleepable = {
-	.owner			= THIS_MODULE,
-	.set			= &scx_kfunc_ids_sleepable,
-};
-
-__bpf_kfunc_start_defs();
-
 /**
  * scx_bpf_select_cpu_dfl - The default implementation of ops.select_cpu()
  * @p: task_struct to select a CPU for
@@ -5766,6 +5737,35 @@ static const struct btf_kfunc_id_set scx_kfunc_set_cpu_release = {
 
 __bpf_kfunc_start_defs();
 
+/**
+ * scx_bpf_create_dsq - Create a custom DSQ
+ * @dsq_id: DSQ to create
+ * @node: NUMA node to allocate from
+ *
+ * Create a custom DSQ identified by @dsq_id. Can be called from any sleepable
+ * scx callback, and any BPF_PROG_TYPE_SYSCALL prog.
+ */
+__bpf_kfunc s32 scx_bpf_create_dsq(u64 dsq_id, s32 node)
+{
+	if (unlikely(node >= (int)nr_node_ids ||
+		     (node < 0 && node != NUMA_NO_NODE)))
+		return -EINVAL;
+	return PTR_ERR_OR_ZERO(create_dsq(dsq_id, node));
+}
+
+__bpf_kfunc_end_defs();
+
+BTF_KFUNCS_START(scx_kfunc_ids_unlocked)
+BTF_ID_FLAGS(func, scx_bpf_create_dsq, KF_SLEEPABLE)
+BTF_KFUNCS_END(scx_kfunc_ids_unlocked)
+
+static const struct btf_kfunc_id_set scx_kfunc_set_unlocked = {
+	.owner			= THIS_MODULE,
+	.set			= &scx_kfunc_ids_unlocked,
+};
+
+__bpf_kfunc_start_defs();
+
 /**
  * scx_bpf_kick_cpu - Trigger reschedule on a CPU
  * @cpu: cpu to kick
@@ -6462,10 +6462,6 @@ static int __init scx_init(void)
 	 * check using scx_kf_allowed().
 	 */
 	if ((ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS,
-					     &scx_kfunc_set_sleepable)) ||
-	    (ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_SYSCALL,
-					     &scx_kfunc_set_sleepable)) ||
-	    (ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS,
 					     &scx_kfunc_set_select_cpu)) ||
 	    (ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS,
 					     &scx_kfunc_set_enqueue_dispatch)) ||
@@ -6473,6 +6469,10 @@ static int __init scx_init(void)
 					     &scx_kfunc_set_dispatch)) ||
 	    (ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS,
 					     &scx_kfunc_set_cpu_release)) ||
+	    (ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS,
+					     &scx_kfunc_set_unlocked)) ||
+	    (ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_SYSCALL,
+					     &scx_kfunc_set_unlocked)) ||
 	    (ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS,
 					     &scx_kfunc_set_any)) ||
 	    (ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING,
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 02/11] sched_ext: Refactor consume_remote_task()
  2024-08-30 11:03 [PATCHSET sched_ext/for-6.12] sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq() Tejun Heo
  2024-08-30 11:03 ` [PATCH 01/11] sched_ext: Rename scx_kfunc_set_sleepable to unlocked and relocate Tejun Heo
@ 2024-08-30 11:03 ` Tejun Heo
  2024-08-31  4:05   ` David Vernet
  2024-08-30 11:03 ` [PATCH 03/11] sched_ext: Make find_dsq_for_dispatch() handle SCX_DSQ_LOCAL_ON Tejun Heo
                   ` (9 subsequent siblings)
  11 siblings, 1 reply; 35+ messages in thread
From: Tejun Heo @ 2024-08-30 11:03 UTC (permalink / raw)
  To: void; +Cc: kernel-team, linux-kernel, Tejun Heo

The tricky p->scx.holding_cpu handling was split across
consume_remote_task() body and move_task_to_local_dsq(). Refactor such that:

- All the tricky part is now in the new unlink_dsq_and_lock_task_rq() with
  consolidated documentation.

- move_task_to_local_dsq() now implements straightforward task migration
  making it easier to use in other places.

- dispatch_to_local_dsq() is another user move_task_to_local_dsq(). The
  usage is updated accordingly. This makes the local and remote cases more
  symmetric.

No functional changes intended.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext.c | 145 ++++++++++++++++++++++++---------------------
 1 file changed, 76 insertions(+), 69 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 5423554a11af..3facfca73337 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -2109,49 +2109,13 @@ static bool yield_to_task_scx(struct rq *rq, struct task_struct *to)
  * @src_rq: rq to move the task from, locked on entry, released on return
  * @dst_rq: rq to move the task into, locked on return
  *
- * Move @p which is currently on @src_rq to @dst_rq's local DSQ. The caller
- * must:
- *
- * 1. Start with exclusive access to @p either through its DSQ lock or
- *    %SCX_OPSS_DISPATCHING flag.
- *
- * 2. Set @p->scx.holding_cpu to raw_smp_processor_id().
- *
- * 3. Remember task_rq(@p) as @src_rq. Release the exclusive access so that we
- *    don't deadlock with dequeue.
- *
- * 4. Lock @src_rq from #3.
- *
- * 5. Call this function.
- *
- * Returns %true if @p was successfully moved. %false after racing dequeue and
- * losing. On return, @src_rq is unlocked and @dst_rq is locked.
+ * Move @p which is currently on @src_rq to @dst_rq's local DSQ.
  */
-static bool move_task_to_local_dsq(struct task_struct *p, u64 enq_flags,
+static void move_task_to_local_dsq(struct task_struct *p, u64 enq_flags,
 				   struct rq *src_rq, struct rq *dst_rq)
 {
 	lockdep_assert_rq_held(src_rq);
 
-	/*
-	 * If dequeue got to @p while we were trying to lock @src_rq, it'd have
-	 * cleared @p->scx.holding_cpu to -1. While other cpus may have updated
-	 * it to different values afterwards, as this operation can't be
-	 * preempted or recurse, @p->scx.holding_cpu can never become
-	 * raw_smp_processor_id() again before we're done. Thus, we can tell
-	 * whether we lost to dequeue by testing whether @p->scx.holding_cpu is
-	 * still raw_smp_processor_id().
-	 *
-	 * @p->rq couldn't have changed if we're still the holding cpu.
-	 *
-	 * See dispatch_dequeue() for the counterpart.
-	 */
-	if (unlikely(p->scx.holding_cpu != raw_smp_processor_id()) ||
-	    WARN_ON_ONCE(src_rq != task_rq(p))) {
-		raw_spin_rq_unlock(src_rq);
-		raw_spin_rq_lock(dst_rq);
-		return false;
-	}
-
 	/* the following marks @p MIGRATING which excludes dequeue */
 	deactivate_task(src_rq, p, 0);
 	set_task_cpu(p, cpu_of(dst_rq));
@@ -2170,8 +2134,6 @@ static bool move_task_to_local_dsq(struct task_struct *p, u64 enq_flags,
 	dst_rq->scx.extra_enq_flags = enq_flags;
 	activate_task(dst_rq, p, 0);
 	dst_rq->scx.extra_enq_flags = 0;
-
-	return true;
 }
 
 #endif	/* CONFIG_SMP */
@@ -2236,28 +2198,69 @@ static bool task_can_run_on_remote_rq(struct task_struct *p, struct rq *rq,
 	return true;
 }
 
-static bool consume_remote_task(struct rq *rq, struct scx_dispatch_q *dsq,
-				struct task_struct *p, struct rq *task_rq)
+/**
+ * unlink_dsq_and_lock_task_rq() - Unlink task from its DSQ and lock its task_rq
+ * @p: target task
+ * @dsq: locked DSQ @p is currently on
+ * @task_rq: @p's task_rq, stable with @dsq locked
+ *
+ * Called with @dsq locked but no rq's locked. We want to move @p to a different
+ * DSQ, including any local DSQ, but are not locking @task_rq. Locking @task_rq
+ * is required when transferring into a local DSQ. Even when transferring into a
+ * non-local DSQ, it's better to use the same mechanism to protect against
+ * dequeues and maintain the invariant that @p->scx.dsq can only change while
+ * @task_rq is locked, which e.g. scx_dump_task() depends on.
+ *
+ * We want to grab @task_rq but that can deadlock if we try while locking @dsq,
+ * so we want to unlink @p from @dsq, drop its lock and then lock @task_rq. As
+ * this may race with dequeue, which can't drop the rq lock or fail, do a little
+ * dancing from our side.
+ *
+ * @p->scx.holding_cpu is set to this CPU before @dsq is unlocked. If @p gets
+ * dequeued after we unlock @dsq but before locking @task_rq, the holding_cpu
+ * would be cleared to -1. While other cpus may have updated it to different
+ * values afterwards, as this operation can't be preempted or recurse, the
+ * holding_cpu can never become this CPU again before we're done. Thus, we can
+ * tell whether we lost to dequeue by testing whether the holding_cpu still
+ * points to this CPU. See dispatch_dequeue() for the counterpart.
+ *
+ * On return, @dsq is unlocked and @task_rq is locked. Returns %true if @p is
+ * still valid. %false if lost to dequeue.
+ */
+static bool unlink_dsq_and_lock_task_rq(struct task_struct *p,
+					struct scx_dispatch_q *dsq,
+					struct rq *task_rq)
 {
-	lockdep_assert_held(&dsq->lock);	/* released on return */
+	s32 cpu = raw_smp_processor_id();
+
+	lockdep_assert_held(&dsq->lock);
 
-	/*
-	 * @dsq is locked and @p is on a remote rq. @p is currently protected by
-	 * @dsq->lock. We want to pull @p to @rq but may deadlock if we grab
-	 * @task_rq while holding @dsq and @rq locks. As dequeue can't drop the
-	 * rq lock or fail, do a little dancing from our side. See
-	 * move_task_to_local_dsq().
-	 */
 	WARN_ON_ONCE(p->scx.holding_cpu >= 0);
 	task_unlink_from_dsq(p, dsq);
 	dsq_mod_nr(dsq, -1);
-	p->scx.holding_cpu = raw_smp_processor_id();
-	raw_spin_unlock(&dsq->lock);
+	p->scx.holding_cpu = cpu;
 
-	raw_spin_rq_unlock(rq);
+	raw_spin_unlock(&dsq->lock);
 	raw_spin_rq_lock(task_rq);
 
-	return move_task_to_local_dsq(p, 0, task_rq, rq);
+	/* task_rq couldn't have changed if we're still the holding cpu */
+	return likely(p->scx.holding_cpu == cpu) &&
+		!WARN_ON_ONCE(task_rq != task_rq(p));
+}
+
+static bool consume_remote_task(struct rq *this_rq, struct scx_dispatch_q *dsq,
+				struct task_struct *p, struct rq *task_rq)
+{
+	raw_spin_rq_unlock(this_rq);
+
+	if (unlink_dsq_and_lock_task_rq(p, dsq, task_rq)) {
+		move_task_to_local_dsq(p, 0, task_rq, this_rq);
+		return true;
+	} else {
+		raw_spin_rq_unlock(task_rq);
+		raw_spin_rq_lock(this_rq);
+		return false;
+	}
 }
 #else	/* CONFIG_SMP */
 static inline bool task_can_run_on_remote_rq(struct task_struct *p, struct rq *rq, bool trigger_error) { return false; }
@@ -2361,7 +2364,8 @@ dispatch_to_local_dsq(struct rq *rq, u64 dsq_id, struct task_struct *p,
 		 * As DISPATCHING guarantees that @p is wholly ours, we can
 		 * pretend that we're moving from a DSQ and use the same
 		 * mechanism - mark the task under transfer with holding_cpu,
-		 * release DISPATCHING and then follow the same protocol.
+		 * release DISPATCHING and then follow the same protocol. See
+		 * unlink_dsq_and_lock_task_rq().
 		 */
 		p->scx.holding_cpu = raw_smp_processor_id();
 
@@ -2374,28 +2378,31 @@ dispatch_to_local_dsq(struct rq *rq, u64 dsq_id, struct task_struct *p,
 			raw_spin_rq_lock(src_rq);
 		}
 
-		if (src_rq == dst_rq) {
+		/* task_rq couldn't have changed if we're still the holding cpu */
+		dsp = p->scx.holding_cpu == raw_smp_processor_id() &&
+			!WARN_ON_ONCE(src_rq != task_rq(p));
+
+		if (likely(dsp)) {
 			/*
-			 * As @p is staying on the same rq, there's no need to
+			 * If @p is staying on the same rq, there's no need to
 			 * go through the full deactivate/activate cycle.
 			 * Optimize by abbreviating the operations in
 			 * move_task_to_local_dsq().
 			 */
-			dsp = p->scx.holding_cpu == raw_smp_processor_id();
-			if (likely(dsp)) {
+			if (src_rq == dst_rq) {
 				p->scx.holding_cpu = -1;
-				dispatch_enqueue(&dst_rq->scx.local_dsq, p,
-						 enq_flags);
+				dispatch_enqueue(&dst_rq->scx.local_dsq,
+						 p, enq_flags);
+			} else {
+				move_task_to_local_dsq(p, enq_flags,
+						       src_rq, dst_rq);
 			}
-		} else {
-			dsp = move_task_to_local_dsq(p, enq_flags,
-						     src_rq, dst_rq);
-		}
 
-		/* if the destination CPU is idle, wake it up */
-		if (dsp && sched_class_above(p->sched_class,
-					     dst_rq->curr->sched_class))
-			resched_curr(dst_rq);
+			/* if the destination CPU is idle, wake it up */
+			if (sched_class_above(p->sched_class,
+					      dst_rq->curr->sched_class))
+				resched_curr(dst_rq);
+		}
 
 		/* switch back to @rq lock */
 		if (rq != dst_rq) {
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 03/11] sched_ext: Make find_dsq_for_dispatch() handle SCX_DSQ_LOCAL_ON
  2024-08-30 11:03 [PATCHSET sched_ext/for-6.12] sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq() Tejun Heo
  2024-08-30 11:03 ` [PATCH 01/11] sched_ext: Rename scx_kfunc_set_sleepable to unlocked and relocate Tejun Heo
  2024-08-30 11:03 ` [PATCH 02/11] sched_ext: Refactor consume_remote_task() Tejun Heo
@ 2024-08-30 11:03 ` Tejun Heo
  2024-09-01  0:11   ` David Vernet
  2024-08-30 11:03 ` [PATCH 04/11] sched_ext: Make dispatch_to_local_dsq() return void Tejun Heo
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 35+ messages in thread
From: Tejun Heo @ 2024-08-30 11:03 UTC (permalink / raw)
  To: void; +Cc: kernel-team, linux-kernel, Tejun Heo

find_dsq_for_dispatch() handles all DSQ IDs except SCX_DSQ_LOCAL_ON.
Instead, each caller is hanlding SCX_DSQ_LOCAL_ON before calling it. Move
SCX_DSQ_LOCAL_ON lookup into find_dsq_for_dispatch() to remove duplicate
code in direct_dispatch() and dispatch_to_local_dsq().

No functional changes intended.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext.c | 90 +++++++++++++++++++++-------------------------
 1 file changed, 40 insertions(+), 50 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 3facfca73337..80387cd9b8c3 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -1726,6 +1726,15 @@ static struct scx_dispatch_q *find_dsq_for_dispatch(struct rq *rq, u64 dsq_id,
 	if (dsq_id == SCX_DSQ_LOCAL)
 		return &rq->scx.local_dsq;
 
+	if ((dsq_id & SCX_DSQ_LOCAL_ON) == SCX_DSQ_LOCAL_ON) {
+		s32 cpu = dsq_id & SCX_DSQ_LOCAL_CPU_MASK;
+
+		if (!ops_cpu_valid(cpu, "in SCX_DSQ_LOCAL_ON dispatch verdict"))
+			return &scx_dsq_global;
+
+		return &cpu_rq(cpu)->scx.local_dsq;
+	}
+
 	dsq = find_non_local_dsq(dsq_id);
 	if (unlikely(!dsq)) {
 		scx_ops_error("non-existent DSQ 0x%llx for %s[%d]",
@@ -1769,8 +1778,8 @@ static void mark_direct_dispatch(struct task_struct *ddsp_task,
 static void direct_dispatch(struct task_struct *p, u64 enq_flags)
 {
 	struct rq *rq = task_rq(p);
-	struct scx_dispatch_q *dsq;
-	u64 dsq_id = p->scx.ddsp_dsq_id;
+	struct scx_dispatch_q *dsq =
+		find_dsq_for_dispatch(rq, p->scx.ddsp_dsq_id, p);
 
 	touch_core_sched_dispatch(rq, p);
 
@@ -1782,15 +1791,9 @@ static void direct_dispatch(struct task_struct *p, u64 enq_flags)
 	 * DSQ_LOCAL_ON verdicts targeting the local DSQ of a remote CPU, defer
 	 * the enqueue so that it's executed when @rq can be unlocked.
 	 */
-	if ((dsq_id & SCX_DSQ_LOCAL_ON) == SCX_DSQ_LOCAL_ON) {
-		s32 cpu = dsq_id & SCX_DSQ_LOCAL_CPU_MASK;
+	if (dsq->id == SCX_DSQ_LOCAL && dsq != &rq->scx.local_dsq) {
 		unsigned long opss;
 
-		if (cpu == cpu_of(rq)) {
-			dsq_id = SCX_DSQ_LOCAL;
-			goto dispatch;
-		}
-
 		opss = atomic_long_read(&p->scx.ops_state) & SCX_OPSS_STATE_MASK;
 
 		switch (opss & SCX_OPSS_STATE_MASK) {
@@ -1817,8 +1820,6 @@ static void direct_dispatch(struct task_struct *p, u64 enq_flags)
 		return;
 	}
 
-dispatch:
-	dsq = find_dsq_for_dispatch(rq, dsq_id, p);
 	dispatch_enqueue(dsq, p, p->scx.ddsp_enq_flags | SCX_ENQ_CLEAR_OPSS);
 }
 
@@ -2303,51 +2304,38 @@ static bool consume_dispatch_q(struct rq *rq, struct scx_dispatch_q *dsq)
 enum dispatch_to_local_dsq_ret {
 	DTL_DISPATCHED,		/* successfully dispatched */
 	DTL_LOST,		/* lost race to dequeue */
-	DTL_NOT_LOCAL,		/* destination is not a local DSQ */
 	DTL_INVALID,		/* invalid local dsq_id */
 };
 
 /**
  * dispatch_to_local_dsq - Dispatch a task to a local dsq
  * @rq: current rq which is locked
- * @dsq_id: destination dsq ID
+ * @dst_dsq: destination DSQ
  * @p: task to dispatch
  * @enq_flags: %SCX_ENQ_*
  *
- * We're holding @rq lock and want to dispatch @p to the local DSQ identified by
- * @dsq_id. This function performs all the synchronization dancing needed
- * because local DSQs are protected with rq locks.
+ * We're holding @rq lock and want to dispatch @p to @dst_dsq which is a local
+ * DSQ. This function performs all the synchronization dancing needed because
+ * local DSQs are protected with rq locks.
  *
  * The caller must have exclusive ownership of @p (e.g. through
  * %SCX_OPSS_DISPATCHING).
  */
 static enum dispatch_to_local_dsq_ret
-dispatch_to_local_dsq(struct rq *rq, u64 dsq_id, struct task_struct *p,
-		      u64 enq_flags)
+dispatch_to_local_dsq(struct rq *rq, struct scx_dispatch_q *dst_dsq,
+		      struct task_struct *p, u64 enq_flags)
 {
 	struct rq *src_rq = task_rq(p);
-	struct rq *dst_rq;
+	struct rq *dst_rq = container_of(dst_dsq, struct rq, scx.local_dsq);
 
 	/*
 	 * We're synchronized against dequeue through DISPATCHING. As @p can't
 	 * be dequeued, its task_rq and cpus_allowed are stable too.
+	 *
+	 * If dispatching to @rq that @p is already on, no lock dancing needed.
 	 */
-	if (dsq_id == SCX_DSQ_LOCAL) {
-		dst_rq = rq;
-	} else if ((dsq_id & SCX_DSQ_LOCAL_ON) == SCX_DSQ_LOCAL_ON) {
-		s32 cpu = dsq_id & SCX_DSQ_LOCAL_CPU_MASK;
-
-		if (!ops_cpu_valid(cpu, "in SCX_DSQ_LOCAL_ON dispatch verdict"))
-			return DTL_INVALID;
-		dst_rq = cpu_rq(cpu);
-	} else {
-		return DTL_NOT_LOCAL;
-	}
-
-	/* if dispatching to @rq that @p is already on, no lock dancing needed */
 	if (rq == src_rq && rq == dst_rq) {
-		dispatch_enqueue(&dst_rq->scx.local_dsq, p,
-				 enq_flags | SCX_ENQ_CLEAR_OPSS);
+		dispatch_enqueue(dst_dsq, p, enq_flags | SCX_ENQ_CLEAR_OPSS);
 		return DTL_DISPATCHED;
 	}
 
@@ -2489,19 +2477,21 @@ static void finish_dispatch(struct rq *rq, struct task_struct *p,
 
 	BUG_ON(!(p->scx.flags & SCX_TASK_QUEUED));
 
-	switch (dispatch_to_local_dsq(rq, dsq_id, p, enq_flags)) {
-	case DTL_DISPATCHED:
-		break;
-	case DTL_LOST:
-		break;
-	case DTL_INVALID:
-		dsq_id = SCX_DSQ_GLOBAL;
-		fallthrough;
-	case DTL_NOT_LOCAL:
-		dsq = find_dsq_for_dispatch(cpu_rq(raw_smp_processor_id()),
-					    dsq_id, p);
+	dsq = find_dsq_for_dispatch(this_rq(), dsq_id, p);
+
+	if (dsq->id == SCX_DSQ_LOCAL) {
+		switch (dispatch_to_local_dsq(rq, dsq, p, enq_flags)) {
+		case DTL_DISPATCHED:
+			break;
+		case DTL_LOST:
+			break;
+		case DTL_INVALID:
+			dispatch_enqueue(&scx_dsq_global, p,
+					 enq_flags | SCX_ENQ_CLEAR_OPSS);
+			break;
+		}
+	} else {
 		dispatch_enqueue(dsq, p, enq_flags | SCX_ENQ_CLEAR_OPSS);
-		break;
 	}
 }
 
@@ -2718,13 +2708,13 @@ static void process_ddsp_deferred_locals(struct rq *rq)
 	 */
 	while ((p = list_first_entry_or_null(&rq->scx.ddsp_deferred_locals,
 				struct task_struct, scx.dsq_list.node))) {
-		s32 ret;
+		struct scx_dispatch_q *dsq;
 
 		list_del_init(&p->scx.dsq_list.node);
 
-		ret = dispatch_to_local_dsq(rq, p->scx.ddsp_dsq_id, p,
-					    p->scx.ddsp_enq_flags);
-		WARN_ON_ONCE(ret == DTL_NOT_LOCAL);
+		dsq = find_dsq_for_dispatch(rq, p->scx.ddsp_dsq_id, p);
+		if (!WARN_ON_ONCE(dsq->id != SCX_DSQ_LOCAL))
+			dispatch_to_local_dsq(rq, dsq, p, p->scx.ddsp_enq_flags);
 	}
 }
 
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 04/11] sched_ext: Make dispatch_to_local_dsq() return void
  2024-08-30 11:03 [PATCHSET sched_ext/for-6.12] sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq() Tejun Heo
                   ` (2 preceding siblings ...)
  2024-08-30 11:03 ` [PATCH 03/11] sched_ext: Make find_dsq_for_dispatch() handle SCX_DSQ_LOCAL_ON Tejun Heo
@ 2024-08-30 11:03 ` Tejun Heo
  2024-08-30 17:44   ` [PATCH 04/11] sched_ext: Fix processs_ddsp_deferred_locals() by unifying DTL_INVALID handling Tejun Heo
  2024-08-30 11:03 ` [PATCH 05/11] sched_ext: Restructure dispatch_to_local_dsq() Tejun Heo
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 35+ messages in thread
From: Tejun Heo @ 2024-08-30 11:03 UTC (permalink / raw)
  To: void; +Cc: kernel-team, linux-kernel, Tejun Heo

With the preceding update, the only return value which makes meaningful
difference is DTL_INVALID, for which the caller dispatches the task to the
global DSQ. Move the global DSQ fallback into dispatch_to_local_dsq() and
remove the return value.

No functional changes intended.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext.c | 41 ++++++++++-------------------------------
 1 file changed, 10 insertions(+), 31 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 80387cd9b8c3..34b4e63850c1 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -2301,12 +2301,6 @@ static bool consume_dispatch_q(struct rq *rq, struct scx_dispatch_q *dsq)
 	return false;
 }
 
-enum dispatch_to_local_dsq_ret {
-	DTL_DISPATCHED,		/* successfully dispatched */
-	DTL_LOST,		/* lost race to dequeue */
-	DTL_INVALID,		/* invalid local dsq_id */
-};
-
 /**
  * dispatch_to_local_dsq - Dispatch a task to a local dsq
  * @rq: current rq which is locked
@@ -2321,9 +2315,8 @@ enum dispatch_to_local_dsq_ret {
  * The caller must have exclusive ownership of @p (e.g. through
  * %SCX_OPSS_DISPATCHING).
  */
-static enum dispatch_to_local_dsq_ret
-dispatch_to_local_dsq(struct rq *rq, struct scx_dispatch_q *dst_dsq,
-		      struct task_struct *p, u64 enq_flags)
+static void dispatch_to_local_dsq(struct rq *rq, struct scx_dispatch_q *dst_dsq,
+				  struct task_struct *p, u64 enq_flags)
 {
 	struct rq *src_rq = task_rq(p);
 	struct rq *dst_rq = container_of(dst_dsq, struct rq, scx.local_dsq);
@@ -2336,13 +2329,11 @@ dispatch_to_local_dsq(struct rq *rq, struct scx_dispatch_q *dst_dsq,
 	 */
 	if (rq == src_rq && rq == dst_rq) {
 		dispatch_enqueue(dst_dsq, p, enq_flags | SCX_ENQ_CLEAR_OPSS);
-		return DTL_DISPATCHED;
+		return;
 	}
 
 #ifdef CONFIG_SMP
 	if (likely(task_can_run_on_remote_rq(p, dst_rq, true))) {
-		bool dsp;
-
 		/*
 		 * @p is on a possibly remote @src_rq which we need to lock to
 		 * move the task. If dequeue is in progress, it'd be locking
@@ -2367,10 +2358,8 @@ dispatch_to_local_dsq(struct rq *rq, struct scx_dispatch_q *dst_dsq,
 		}
 
 		/* task_rq couldn't have changed if we're still the holding cpu */
-		dsp = p->scx.holding_cpu == raw_smp_processor_id() &&
-			!WARN_ON_ONCE(src_rq != task_rq(p));
-
-		if (likely(dsp)) {
+		if (likely(p->scx.holding_cpu == raw_smp_processor_id()) &&
+		    !WARN_ON_ONCE(src_rq != task_rq(p))) {
 			/*
 			 * If @p is staying on the same rq, there's no need to
 			 * go through the full deactivate/activate cycle.
@@ -2398,11 +2387,11 @@ dispatch_to_local_dsq(struct rq *rq, struct scx_dispatch_q *dst_dsq,
 			raw_spin_rq_lock(rq);
 		}
 
-		return dsp ? DTL_DISPATCHED : DTL_LOST;
+		return;
 	}
 #endif	/* CONFIG_SMP */
 
-	return DTL_INVALID;
+	dispatch_enqueue(&scx_dsq_global, p, enq_flags | SCX_ENQ_CLEAR_OPSS);
 }
 
 /**
@@ -2479,20 +2468,10 @@ static void finish_dispatch(struct rq *rq, struct task_struct *p,
 
 	dsq = find_dsq_for_dispatch(this_rq(), dsq_id, p);
 
-	if (dsq->id == SCX_DSQ_LOCAL) {
-		switch (dispatch_to_local_dsq(rq, dsq, p, enq_flags)) {
-		case DTL_DISPATCHED:
-			break;
-		case DTL_LOST:
-			break;
-		case DTL_INVALID:
-			dispatch_enqueue(&scx_dsq_global, p,
-					 enq_flags | SCX_ENQ_CLEAR_OPSS);
-			break;
-		}
-	} else {
+	if (dsq->id == SCX_DSQ_LOCAL)
+		dispatch_to_local_dsq(rq, dsq, p, enq_flags);
+	else
 		dispatch_enqueue(dsq, p, enq_flags | SCX_ENQ_CLEAR_OPSS);
-	}
 }
 
 static void flush_dispatch_buf(struct rq *rq)
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 05/11] sched_ext: Restructure dispatch_to_local_dsq()
  2024-08-30 11:03 [PATCHSET sched_ext/for-6.12] sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq() Tejun Heo
                   ` (3 preceding siblings ...)
  2024-08-30 11:03 ` [PATCH 04/11] sched_ext: Make dispatch_to_local_dsq() return void Tejun Heo
@ 2024-08-30 11:03 ` Tejun Heo
  2024-09-01  1:09   ` David Vernet
  2024-08-30 11:03 ` [PATCH 06/11] sched_ext: Reorder args for consume_local/remote_task() Tejun Heo
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 35+ messages in thread
From: Tejun Heo @ 2024-08-30 11:03 UTC (permalink / raw)
  To: void; +Cc: kernel-team, linux-kernel, Tejun Heo

Now that there's nothing left after the big if block, flip the if condition
and unindent the body.

No functional changes intended.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext.c | 94 ++++++++++++++++++++++------------------------
 1 file changed, 44 insertions(+), 50 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 34b4e63850c1..add267f31396 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -2333,65 +2333,59 @@ static void dispatch_to_local_dsq(struct rq *rq, struct scx_dispatch_q *dst_dsq,
 	}
 
 #ifdef CONFIG_SMP
-	if (likely(task_can_run_on_remote_rq(p, dst_rq, true))) {
-		/*
-		 * @p is on a possibly remote @src_rq which we need to lock to
-		 * move the task. If dequeue is in progress, it'd be locking
-		 * @src_rq and waiting on DISPATCHING, so we can't grab @src_rq
-		 * lock while holding DISPATCHING.
-		 *
-		 * As DISPATCHING guarantees that @p is wholly ours, we can
-		 * pretend that we're moving from a DSQ and use the same
-		 * mechanism - mark the task under transfer with holding_cpu,
-		 * release DISPATCHING and then follow the same protocol. See
-		 * unlink_dsq_and_lock_task_rq().
-		 */
-		p->scx.holding_cpu = raw_smp_processor_id();
+	if (unlikely(!task_can_run_on_remote_rq(p, dst_rq, true))) {
+		dispatch_enqueue(&scx_dsq_global, p, enq_flags | SCX_ENQ_CLEAR_OPSS);
+		return;
+	}
 
-		/* store_release ensures that dequeue sees the above */
-		atomic_long_set_release(&p->scx.ops_state, SCX_OPSS_NONE);
+	/*
+	 * @p is on a possibly remote @src_rq which we need to lock to move the
+	 * task. If dequeue is in progress, it'd be locking @src_rq and waiting
+	 * on DISPATCHING, so we can't grab @src_rq lock while holding
+	 * DISPATCHING.
+	 *
+	 * As DISPATCHING guarantees that @p is wholly ours, we can pretend that
+	 * we're moving from a DSQ and use the same mechanism - mark the task
+	 * under transfer with holding_cpu, release DISPATCHING and then follow
+	 * the same protocol. See unlink_dsq_and_lock_task_rq().
+	 */
+	p->scx.holding_cpu = raw_smp_processor_id();
 
-		/* switch to @src_rq lock */
-		if (rq != src_rq) {
-			raw_spin_rq_unlock(rq);
-			raw_spin_rq_lock(src_rq);
-		}
+	/* store_release ensures that dequeue sees the above */
+	atomic_long_set_release(&p->scx.ops_state, SCX_OPSS_NONE);
 
-		/* task_rq couldn't have changed if we're still the holding cpu */
-		if (likely(p->scx.holding_cpu == raw_smp_processor_id()) &&
-		    !WARN_ON_ONCE(src_rq != task_rq(p))) {
-			/*
-			 * If @p is staying on the same rq, there's no need to
-			 * go through the full deactivate/activate cycle.
-			 * Optimize by abbreviating the operations in
-			 * move_task_to_local_dsq().
-			 */
-			if (src_rq == dst_rq) {
-				p->scx.holding_cpu = -1;
-				dispatch_enqueue(&dst_rq->scx.local_dsq,
-						 p, enq_flags);
-			} else {
-				move_task_to_local_dsq(p, enq_flags,
-						       src_rq, dst_rq);
-			}
+	/* switch to @src_rq lock */
+	if (rq != src_rq) {
+		raw_spin_rq_unlock(rq);
+		raw_spin_rq_lock(src_rq);
+	}
 
-			/* if the destination CPU is idle, wake it up */
-			if (sched_class_above(p->sched_class,
-					      dst_rq->curr->sched_class))
-				resched_curr(dst_rq);
+	/* task_rq couldn't have changed if we're still the holding cpu */
+	if (likely(p->scx.holding_cpu == raw_smp_processor_id()) &&
+	    !WARN_ON_ONCE(src_rq != task_rq(p))) {
+		/*
+		 * If @p is staying on the same rq, there's no need to go
+		 * through the full deactivate/activate cycle. Optimize by
+		 * abbreviating the operations in move_task_to_local_dsq().
+		 */
+		if (src_rq == dst_rq) {
+			p->scx.holding_cpu = -1;
+			dispatch_enqueue(&dst_rq->scx.local_dsq, p, enq_flags);
+		} else {
+			move_task_to_local_dsq(p, enq_flags, src_rq, dst_rq);
 		}
 
-		/* switch back to @rq lock */
-		if (rq != dst_rq) {
-			raw_spin_rq_unlock(dst_rq);
-			raw_spin_rq_lock(rq);
-		}
+		/* if the destination CPU is idle, wake it up */
+		if (sched_class_above(p->sched_class, dst_rq->curr->sched_class))
+			resched_curr(dst_rq);
+	}
 
-		return;
+	/* switch back to @rq lock */
+	if (rq != dst_rq) {
+		raw_spin_rq_unlock(dst_rq);
+		raw_spin_rq_lock(rq);
 	}
 #endif	/* CONFIG_SMP */
-
-	dispatch_enqueue(&scx_dsq_global, p, enq_flags | SCX_ENQ_CLEAR_OPSS);
 }
 
 /**
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 06/11] sched_ext: Reorder args for consume_local/remote_task()
  2024-08-30 11:03 [PATCHSET sched_ext/for-6.12] sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq() Tejun Heo
                   ` (4 preceding siblings ...)
  2024-08-30 11:03 ` [PATCH 05/11] sched_ext: Restructure dispatch_to_local_dsq() Tejun Heo
@ 2024-08-30 11:03 ` Tejun Heo
  2024-09-01  1:40   ` David Vernet
  2024-08-30 11:03 ` [PATCH 07/11] sched_ext: Move sanity check and dsq_mod_nr() into task_unlink_from_dsq() Tejun Heo
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 35+ messages in thread
From: Tejun Heo @ 2024-08-30 11:03 UTC (permalink / raw)
  To: void; +Cc: kernel-team, linux-kernel, Tejun Heo

Reorder args for consistency in the order of:

  current_rq, p, src_[rq|dsq], dst_[rq|dsq].

No functional changes intended.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index add267f31396..620cc0586c4b 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -2139,8 +2139,8 @@ static void move_task_to_local_dsq(struct task_struct *p, u64 enq_flags,
 
 #endif	/* CONFIG_SMP */
 
-static void consume_local_task(struct rq *rq, struct scx_dispatch_q *dsq,
-			       struct task_struct *p)
+static void consume_local_task(struct task_struct *p,
+			       struct scx_dispatch_q *dsq, struct rq *rq)
 {
 	lockdep_assert_held(&dsq->lock);	/* released on return */
 
@@ -2249,8 +2249,8 @@ static bool unlink_dsq_and_lock_task_rq(struct task_struct *p,
 		!WARN_ON_ONCE(task_rq != task_rq(p));
 }
 
-static bool consume_remote_task(struct rq *this_rq, struct scx_dispatch_q *dsq,
-				struct task_struct *p, struct rq *task_rq)
+static bool consume_remote_task(struct rq *this_rq, struct task_struct *p,
+				struct scx_dispatch_q *dsq, struct rq *task_rq)
 {
 	raw_spin_rq_unlock(this_rq);
 
@@ -2265,7 +2265,7 @@ static bool consume_remote_task(struct rq *this_rq, struct scx_dispatch_q *dsq,
 }
 #else	/* CONFIG_SMP */
 static inline bool task_can_run_on_remote_rq(struct task_struct *p, struct rq *rq, bool trigger_error) { return false; }
-static inline bool consume_remote_task(struct rq *rq, struct scx_dispatch_q *dsq, struct task_struct *p, struct rq *task_rq) { return false; }
+static inline bool consume_remote_task(struct rq *this_rq, struct task_struct *p, struct scx_dispatch_q *dsq, struct rq *task_rq) { return false; }
 #endif	/* CONFIG_SMP */
 
 static bool consume_dispatch_q(struct rq *rq, struct scx_dispatch_q *dsq)
@@ -2286,12 +2286,12 @@ static bool consume_dispatch_q(struct rq *rq, struct scx_dispatch_q *dsq)
 		struct rq *task_rq = task_rq(p);
 
 		if (rq == task_rq) {
-			consume_local_task(rq, dsq, p);
+			consume_local_task(p, dsq, rq);
 			return true;
 		}
 
 		if (task_can_run_on_remote_rq(p, rq, false)) {
-			if (likely(consume_remote_task(rq, dsq, p, task_rq)))
+			if (likely(consume_remote_task(rq, p, dsq, task_rq)))
 				return true;
 			goto retry;
 		}
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 07/11] sched_ext: Move sanity check and dsq_mod_nr() into task_unlink_from_dsq()
  2024-08-30 11:03 [PATCHSET sched_ext/for-6.12] sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq() Tejun Heo
                   ` (5 preceding siblings ...)
  2024-08-30 11:03 ` [PATCH 06/11] sched_ext: Reorder args for consume_local/remote_task() Tejun Heo
@ 2024-08-30 11:03 ` Tejun Heo
  2024-09-01  1:42   ` David Vernet
  2024-08-30 11:03 ` [PATCH 08/11] sched_ext: Move consume_local_task() upward Tejun Heo
                   ` (4 subsequent siblings)
  11 siblings, 1 reply; 35+ messages in thread
From: Tejun Heo @ 2024-08-30 11:03 UTC (permalink / raw)
  To: void; +Cc: kernel-team, linux-kernel, Tejun Heo

All task_unlink_from_dsq() users are doing dsq_mod_nr(dsq, -1). Move it into
task_unlink_from_dsq(). Also move sanity check into it.

No functional changes intended.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 620cc0586c4b..9befa2bd1eb0 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -1641,6 +1641,8 @@ static void dispatch_enqueue(struct scx_dispatch_q *dsq, struct task_struct *p,
 static void task_unlink_from_dsq(struct task_struct *p,
 				 struct scx_dispatch_q *dsq)
 {
+	WARN_ON_ONCE(list_empty(&p->scx.dsq_list.node));
+
 	if (p->scx.dsq_flags & SCX_TASK_DSQ_ON_PRIQ) {
 		rb_erase(&p->scx.dsq_priq, &dsq->priq);
 		RB_CLEAR_NODE(&p->scx.dsq_priq);
@@ -1648,6 +1650,7 @@ static void task_unlink_from_dsq(struct task_struct *p,
 	}
 
 	list_del_init(&p->scx.dsq_list.node);
+	dsq_mod_nr(dsq, -1);
 }
 
 static void dispatch_dequeue(struct rq *rq, struct task_struct *p)
@@ -1684,9 +1687,7 @@ static void dispatch_dequeue(struct rq *rq, struct task_struct *p)
 	*/
 	if (p->scx.holding_cpu < 0) {
 		/* @p must still be on @dsq, dequeue */
-		WARN_ON_ONCE(list_empty(&p->scx.dsq_list.node));
 		task_unlink_from_dsq(p, dsq);
-		dsq_mod_nr(dsq, -1);
 	} else {
 		/*
 		 * We're racing against dispatch_to_local_dsq() which already
@@ -2148,7 +2149,6 @@ static void consume_local_task(struct task_struct *p,
 	WARN_ON_ONCE(p->scx.holding_cpu >= 0);
 	task_unlink_from_dsq(p, dsq);
 	list_add_tail(&p->scx.dsq_list.node, &rq->scx.local_dsq.list);
-	dsq_mod_nr(dsq, -1);
 	dsq_mod_nr(&rq->scx.local_dsq, 1);
 	p->scx.dsq = &rq->scx.local_dsq;
 	raw_spin_unlock(&dsq->lock);
@@ -2238,7 +2238,6 @@ static bool unlink_dsq_and_lock_task_rq(struct task_struct *p,
 
 	WARN_ON_ONCE(p->scx.holding_cpu >= 0);
 	task_unlink_from_dsq(p, dsq);
-	dsq_mod_nr(dsq, -1);
 	p->scx.holding_cpu = cpu;
 
 	raw_spin_unlock(&dsq->lock);
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 08/11] sched_ext: Move consume_local_task() upward
  2024-08-30 11:03 [PATCHSET sched_ext/for-6.12] sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq() Tejun Heo
                   ` (6 preceding siblings ...)
  2024-08-30 11:03 ` [PATCH 07/11] sched_ext: Move sanity check and dsq_mod_nr() into task_unlink_from_dsq() Tejun Heo
@ 2024-08-30 11:03 ` Tejun Heo
  2024-09-01  1:43   ` David Vernet
  2024-08-30 11:03 ` [PATCH 09/11] sched_ext: Replace consume_local_task() with move_local_task_to_local_dsq() Tejun Heo
                   ` (3 subsequent siblings)
  11 siblings, 1 reply; 35+ messages in thread
From: Tejun Heo @ 2024-08-30 11:03 UTC (permalink / raw)
  To: void; +Cc: kernel-team, linux-kernel, Tejun Heo

So that the local case comes first and two CONFIG_SMP blocks can be merged.

No functional changes intended.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext.c | 31 ++++++++++++++-----------------
 1 file changed, 14 insertions(+), 17 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 9befa2bd1eb0..51d141602a11 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -2103,6 +2103,20 @@ static bool yield_to_task_scx(struct rq *rq, struct task_struct *to)
 		return false;
 }
 
+static void consume_local_task(struct task_struct *p,
+			       struct scx_dispatch_q *dsq, struct rq *rq)
+{
+	lockdep_assert_held(&dsq->lock);	/* released on return */
+
+	/* @dsq is locked and @p is on this rq */
+	WARN_ON_ONCE(p->scx.holding_cpu >= 0);
+	task_unlink_from_dsq(p, dsq);
+	list_add_tail(&p->scx.dsq_list.node, &rq->scx.local_dsq.list);
+	dsq_mod_nr(&rq->scx.local_dsq, 1);
+	p->scx.dsq = &rq->scx.local_dsq;
+	raw_spin_unlock(&dsq->lock);
+}
+
 #ifdef CONFIG_SMP
 /**
  * move_task_to_local_dsq - Move a task from a different rq to a local DSQ
@@ -2138,23 +2152,6 @@ static void move_task_to_local_dsq(struct task_struct *p, u64 enq_flags,
 	dst_rq->scx.extra_enq_flags = 0;
 }
 
-#endif	/* CONFIG_SMP */
-
-static void consume_local_task(struct task_struct *p,
-			       struct scx_dispatch_q *dsq, struct rq *rq)
-{
-	lockdep_assert_held(&dsq->lock);	/* released on return */
-
-	/* @dsq is locked and @p is on this rq */
-	WARN_ON_ONCE(p->scx.holding_cpu >= 0);
-	task_unlink_from_dsq(p, dsq);
-	list_add_tail(&p->scx.dsq_list.node, &rq->scx.local_dsq.list);
-	dsq_mod_nr(&rq->scx.local_dsq, 1);
-	p->scx.dsq = &rq->scx.local_dsq;
-	raw_spin_unlock(&dsq->lock);
-}
-
-#ifdef CONFIG_SMP
 /*
  * Similar to kernel/sched/core.c::is_cpu_allowed(). However, there are two
  * differences:
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 09/11] sched_ext: Replace consume_local_task() with move_local_task_to_local_dsq()
  2024-08-30 11:03 [PATCHSET sched_ext/for-6.12] sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq() Tejun Heo
                   ` (7 preceding siblings ...)
  2024-08-30 11:03 ` [PATCH 08/11] sched_ext: Move consume_local_task() upward Tejun Heo
@ 2024-08-30 11:03 ` Tejun Heo
  2024-09-01  1:55   ` David Vernet
  2024-08-30 11:03 ` [PATCH 10/11] sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq() Tejun Heo
                   ` (2 subsequent siblings)
  11 siblings, 1 reply; 35+ messages in thread
From: Tejun Heo @ 2024-08-30 11:03 UTC (permalink / raw)
  To: void; +Cc: kernel-team, linux-kernel, Tejun Heo

- Rename move_task_to_local_dsq() to move_remote_task_to_local_dsq().

- Rename consume_local_task() to move_local_task_to_local_dsq() and remove
  task_unlink_from_dsq() and source DSQ unlocking from it.

This is to make the migration code easier to reuse.

No functional changes intended.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext.c | 42 ++++++++++++++++++++++++++----------------
 1 file changed, 26 insertions(+), 16 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 51d141602a11..df33524d68f3 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -2103,23 +2103,30 @@ static bool yield_to_task_scx(struct rq *rq, struct task_struct *to)
 		return false;
 }
 
-static void consume_local_task(struct task_struct *p,
-			       struct scx_dispatch_q *dsq, struct rq *rq)
+static void move_local_task_to_local_dsq(struct task_struct *p, u64 enq_flags,
+					 struct scx_dispatch_q *src_dsq,
+					 struct rq *dst_rq)
 {
-	lockdep_assert_held(&dsq->lock);	/* released on return */
+	struct scx_dispatch_q *dst_dsq = &dst_rq->scx.local_dsq;
+
+	/* @dsq is locked and @p is on @dst_rq */
+	lockdep_assert_held(&src_dsq->lock);
+	lockdep_assert_rq_held(dst_rq);
 
-	/* @dsq is locked and @p is on this rq */
 	WARN_ON_ONCE(p->scx.holding_cpu >= 0);
-	task_unlink_from_dsq(p, dsq);
-	list_add_tail(&p->scx.dsq_list.node, &rq->scx.local_dsq.list);
-	dsq_mod_nr(&rq->scx.local_dsq, 1);
-	p->scx.dsq = &rq->scx.local_dsq;
-	raw_spin_unlock(&dsq->lock);
+
+	if (enq_flags & (SCX_ENQ_HEAD | SCX_ENQ_PREEMPT))
+		list_add(&p->scx.dsq_list.node, &dst_dsq->list);
+	else
+		list_add_tail(&p->scx.dsq_list.node, &dst_dsq->list);
+
+	dsq_mod_nr(dst_dsq, 1);
+	p->scx.dsq = dst_dsq;
 }
 
 #ifdef CONFIG_SMP
 /**
- * move_task_to_local_dsq - Move a task from a different rq to a local DSQ
+ * move_remote_task_to_local_dsq - Move a task from a foreign rq to a local DSQ
  * @p: task to move
  * @enq_flags: %SCX_ENQ_*
  * @src_rq: rq to move the task from, locked on entry, released on return
@@ -2127,8 +2134,8 @@ static void consume_local_task(struct task_struct *p,
  *
  * Move @p which is currently on @src_rq to @dst_rq's local DSQ.
  */
-static void move_task_to_local_dsq(struct task_struct *p, u64 enq_flags,
-				   struct rq *src_rq, struct rq *dst_rq)
+static void move_remote_task_to_local_dsq(struct task_struct *p, u64 enq_flags,
+					  struct rq *src_rq, struct rq *dst_rq)
 {
 	lockdep_assert_rq_held(src_rq);
 
@@ -2251,7 +2258,7 @@ static bool consume_remote_task(struct rq *this_rq, struct task_struct *p,
 	raw_spin_rq_unlock(this_rq);
 
 	if (unlink_dsq_and_lock_task_rq(p, dsq, task_rq)) {
-		move_task_to_local_dsq(p, 0, task_rq, this_rq);
+		move_remote_task_to_local_dsq(p, 0, task_rq, this_rq);
 		return true;
 	} else {
 		raw_spin_rq_unlock(task_rq);
@@ -2282,7 +2289,9 @@ static bool consume_dispatch_q(struct rq *rq, struct scx_dispatch_q *dsq)
 		struct rq *task_rq = task_rq(p);
 
 		if (rq == task_rq) {
-			consume_local_task(p, dsq, rq);
+			task_unlink_from_dsq(p, dsq);
+			move_local_task_to_local_dsq(p, 0, dsq, rq);
+			raw_spin_unlock(&dsq->lock);
 			return true;
 		}
 
@@ -2362,13 +2371,14 @@ static void dispatch_to_local_dsq(struct rq *rq, struct scx_dispatch_q *dst_dsq,
 		/*
 		 * If @p is staying on the same rq, there's no need to go
 		 * through the full deactivate/activate cycle. Optimize by
-		 * abbreviating the operations in move_task_to_local_dsq().
+		 * abbreviating move_remote_task_to_local_dsq().
 		 */
 		if (src_rq == dst_rq) {
 			p->scx.holding_cpu = -1;
 			dispatch_enqueue(&dst_rq->scx.local_dsq, p, enq_flags);
 		} else {
-			move_task_to_local_dsq(p, enq_flags, src_rq, dst_rq);
+			move_remote_task_to_local_dsq(p, enq_flags,
+						      src_rq, dst_rq);
 		}
 
 		/* if the destination CPU is idle, wake it up */
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 10/11] sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq()
  2024-08-30 11:03 [PATCHSET sched_ext/for-6.12] sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq() Tejun Heo
                   ` (8 preceding siblings ...)
  2024-08-30 11:03 ` [PATCH 09/11] sched_ext: Replace consume_local_task() with move_local_task_to_local_dsq() Tejun Heo
@ 2024-08-30 11:03 ` Tejun Heo
  2024-08-31 14:30   ` Andrea Righi
  2024-08-30 11:03 ` [PATCH 11/11] scx_qmap: Implement highpri boosting Tejun Heo
  2024-08-30 17:31 ` [PATCHSET sched_ext/for-6.12] sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq() Tejun Heo
  11 siblings, 1 reply; 35+ messages in thread
From: Tejun Heo @ 2024-08-30 11:03 UTC (permalink / raw)
  To: void
  Cc: kernel-team, linux-kernel, Tejun Heo, Daniel Hodges, Changwoo Min,
	Andrea Righi, Dan Schatzberg

Once a task is put into a DSQ, the allowed operations are fairly limited.
Tasks in the built-in local and global DSQs are executed automatically and,
ignoring dequeue, there is only one way a task in a user DSQ can be
manipulated - scx_bpf_consume() moves the first task to the dispatching
local DSQ. This inflexibility sometimes gets in the way and is an area where
multiple feature requests have been made.

Implement scx_bpf_dispatch[_vtime]_from_dsq(), which can be called during
DSQ iteration and can move the task to any DSQ - local DSQs, global DSQ and
user DSQs. The kfuncs can be called from ops.dispatch() and any BPF context
which dosen't hold a rq lock including BPF timers and SYSCALL programs.

This is an expansion of an earlier patch which only allowed moving into the
dispatching local DSQ:

  http://lkml.kernel.org/r/Zn4Cw4FDTmvXnhaf@slm.duckdns.org

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Daniel Hodges <hodges.daniel.scott@gmail.com>
Cc: David Vernet <void@manifault.com>
Cc: Changwoo Min <multics69@gmail.com>
Cc: Andrea Righi <andrea.righi@linux.dev>
Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
---
 kernel/sched/ext.c                       | 180 ++++++++++++++++++++++-
 tools/sched_ext/include/scx/common.bpf.h |   8 +
 2 files changed, 186 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index df33524d68f3..96b8cc490841 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -1067,6 +1067,11 @@ static __always_inline bool scx_kf_allowed_on_arg_tasks(u32 mask,
 	return true;
 }
 
+static bool scx_kf_allowed_if_unlocked(void)
+{
+	return !current->scx.kf_mask;
+}
+
 /**
  * nldsq_next_task - Iterate to the next task in a non-local DSQ
  * @dsq: user dsq being interated
@@ -5461,7 +5466,7 @@ __bpf_kfunc_start_defs();
  * scx_bpf_dispatch - Dispatch a task into the FIFO queue of a DSQ
  * @p: task_struct to dispatch
  * @dsq_id: DSQ to dispatch to
- * @slice: duration @p can run for in nsecs
+ * @slice: duration @p can run for in nsecs, 0 to keep the current value
  * @enq_flags: SCX_ENQ_*
  *
  * Dispatch @p into the FIFO queue of the DSQ identified by @dsq_id. It is safe
@@ -5511,7 +5516,7 @@ __bpf_kfunc void scx_bpf_dispatch(struct task_struct *p, u64 dsq_id, u64 slice,
  * scx_bpf_dispatch_vtime - Dispatch a task into the vtime priority queue of a DSQ
  * @p: task_struct to dispatch
  * @dsq_id: DSQ to dispatch to
- * @slice: duration @p can run for in nsecs
+ * @slice: duration @p can run for in nsecs, 0 to keep the current value
  * @vtime: @p's ordering inside the vtime-sorted queue of the target DSQ
  * @enq_flags: SCX_ENQ_*
  *
@@ -5552,6 +5557,117 @@ static const struct btf_kfunc_id_set scx_kfunc_set_enqueue_dispatch = {
 	.set			= &scx_kfunc_ids_enqueue_dispatch,
 };
 
+static bool scx_dispatch_from_dsq(struct bpf_iter_scx_dsq_kern *kit,
+				  struct task_struct *p, u64 dsq_id,
+				  u64 slice, u64 vtime, u64 enq_flags)
+{
+	struct scx_dispatch_q *src_dsq = kit->dsq, *dst_dsq;
+	struct rq *this_rq, *src_rq, *dst_rq, *locked_rq;
+	bool dispatched = false;
+	bool in_balance;
+	unsigned long flags;
+
+	if (!scx_kf_allowed_if_unlocked() && !scx_kf_allowed(SCX_KF_DISPATCH))
+		return false;
+
+	/*
+	 * Can be called from either ops.dispatch() locking this_rq() or any
+	 * context where no rq lock is held. If latter, lock @p's task_rq which
+	 * we'll likely need anyway.
+	 */
+	src_rq = task_rq(p);
+
+	local_irq_save(flags);
+	this_rq = this_rq();
+	in_balance = this_rq->scx.flags & SCX_RQ_IN_BALANCE;
+
+	if (in_balance) {
+		if (this_rq != src_rq) {
+			raw_spin_rq_unlock(this_rq);
+			raw_spin_rq_lock(src_rq);
+		}
+	} else {
+		raw_spin_rq_lock(src_rq);
+	}
+
+	locked_rq = src_rq;
+	raw_spin_lock(&src_dsq->lock);
+
+	/*
+	 * Did someone else get to it? @p could have already left $src_dsq, got
+	 * re-enqueud, or be in the process of being consumed by someone else.
+	 */
+	if (unlikely(p->scx.dsq != src_dsq ||
+		     u32_before(kit->dsq_seq, p->scx.dsq_seq) ||
+		     p->scx.holding_cpu >= 0) ||
+	    WARN_ON_ONCE(src_rq != task_rq(p))) {
+		raw_spin_unlock(&src_dsq->lock);
+		goto out;
+	}
+
+	/* @p is still on $src_dsq and stable, determine the destination */
+	dst_dsq = find_dsq_for_dispatch(this_rq, dsq_id, p);
+
+	if (dst_dsq->id == SCX_DSQ_LOCAL) {
+		dst_rq = container_of(dst_dsq, struct rq, scx.local_dsq);
+		if (!task_can_run_on_remote_rq(p, dst_rq, true)) {
+			dst_dsq = &scx_dsq_global;
+			dst_rq = src_rq;
+		}
+	} else {
+		/* no need to migrate if destination is a non-local DSQ */
+		dst_rq = src_rq;
+	}
+
+	/*
+	 * Move @p into $dst_dsq. If $dst_dsq is the local DSQ of a different
+	 * CPU, @p will be migrated.
+	 */
+	if (dst_dsq->id == SCX_DSQ_LOCAL) {
+		/* @p is going from a non-local DSQ to a local DSQ */
+		if (src_rq == dst_rq) {
+			task_unlink_from_dsq(p, src_dsq);
+			move_local_task_to_local_dsq(p, enq_flags,
+						     src_dsq, dst_rq);
+			raw_spin_unlock(&src_dsq->lock);
+		} else {
+			raw_spin_unlock(&src_dsq->lock);
+			move_remote_task_to_local_dsq(p, enq_flags,
+						      src_rq, dst_rq);
+			locked_rq = dst_rq;
+		}
+	} else {
+		/*
+		 * @p is going from a non-local DSQ to a non-local DSQ. As
+		 * $src_dsq is already locked, do an abbreviated dequeue.
+		 */
+		task_unlink_from_dsq(p, src_dsq);
+		p->scx.dsq = NULL;
+		raw_spin_unlock(&src_dsq->lock);
+
+		p->scx.dsq_vtime = vtime;
+		dispatch_enqueue(dst_dsq, p, enq_flags);
+	}
+
+	if (slice)
+		p->scx.slice = slice;
+	else
+		p->scx.slice = p->scx.slice ?: 1;
+
+	dispatched = true;
+out:
+	if (in_balance) {
+		if (this_rq != locked_rq) {
+			raw_spin_rq_unlock(locked_rq);
+			raw_spin_rq_lock(this_rq);
+		}
+	} else {
+		raw_spin_rq_unlock_irqrestore(locked_rq, flags);
+	}
+
+	return dispatched;
+}
+
 __bpf_kfunc_start_defs();
 
 /**
@@ -5631,12 +5747,70 @@ __bpf_kfunc bool scx_bpf_consume(u64 dsq_id)
 	}
 }
 
+/**
+ * scx_bpf_dispatch_from_dsq - Move a task from DSQ iteration to a DSQ
+ * @it__iter: DSQ iterator in progress
+ * @p: task to transfer
+ * @dsq_id: DSQ to move @p to
+ * @slice: duration @p can run for in nsecs, 0 to keep the current value
+ * @enq_flags: SCX_ENQ_*
+ *
+ * Transfer @p which is on the DSQ currently iterated by @it__iter to the DSQ
+ * specified by @dsq_id. All DSQs - local DSQs, global DSQ and user DSQs - can
+ * be the destination.
+ *
+ * For the transfer to be successful, @p must still be on the DSQ and have been
+ * queued before the DSQ iteration started. This function doesn't care whether
+ * @p was obtained from the DSQ iteration. @p just has to be on the DSQ and have
+ * been queued before the iteration started.
+ *
+ * Can be called from ops.dispatch() or any BPF context which doesn't hold a rq
+ * lock (e.g. BPF timers or SYSCALL programs).
+ *
+ * Returns %true if @p has been consumed, %false if @p had already been consumed
+ * or dequeued.
+ */
+__bpf_kfunc bool scx_bpf_dispatch_from_dsq(struct bpf_iter_scx_dsq *it__iter,
+					   struct task_struct *p, u64 dsq_id,
+					   u64 slice, u64 enq_flags)
+{
+	return scx_dispatch_from_dsq((struct bpf_iter_scx_dsq_kern *)it__iter,
+				     p, dsq_id, slice, 0, enq_flags);
+}
+
+/**
+ * scx_bpf_dispatch_vtime_from_dsq - Move a task from DSQ iteration to a PRIQ DSQ
+ * @it__iter: DSQ iterator in progress
+ * @p: task to transfer
+ * @dsq_id: DSQ to move @p to
+ * @slice: duration @p can run for in nsecs, 0 to keep the current value
+ * @vtime: @p's ordering inside the vtime-sorted queue of the target DSQ
+ * @enq_flags: SCX_ENQ_*
+ *
+ * Transfer @p which is on the DSQ currently iterated by @it__iter to the
+ * priority queue of the DSQ specified by @dsq_id. The destination must be a
+ * user DSQ as only user DSQs support priority queue.
+ *
+ * All other aspects are identical to scx_bpf_dispatch_from_dsq(). See
+ * scx_bpf_dispatch_vtime() for more information on @vtime.
+ */
+__bpf_kfunc bool scx_bpf_dispatch_vtime_from_dsq(struct bpf_iter_scx_dsq *it__iter,
+						 struct task_struct *p, u64 dsq_id,
+						 u64 slice, u64 vtime, u64 enq_flags)
+{
+	return scx_dispatch_from_dsq((struct bpf_iter_scx_dsq_kern *)it__iter,
+				     p, dsq_id, slice, vtime,
+				     enq_flags | SCX_ENQ_DSQ_PRIQ);
+}
+
 __bpf_kfunc_end_defs();
 
 BTF_KFUNCS_START(scx_kfunc_ids_dispatch)
 BTF_ID_FLAGS(func, scx_bpf_dispatch_nr_slots)
 BTF_ID_FLAGS(func, scx_bpf_dispatch_cancel)
 BTF_ID_FLAGS(func, scx_bpf_consume)
+BTF_ID_FLAGS(func, scx_bpf_dispatch_from_dsq, KF_RCU)
+BTF_ID_FLAGS(func, scx_bpf_dispatch_vtime_from_dsq, KF_RCU)
 BTF_KFUNCS_END(scx_kfunc_ids_dispatch)
 
 static const struct btf_kfunc_id_set scx_kfunc_set_dispatch = {
@@ -5733,6 +5907,8 @@ __bpf_kfunc_end_defs();
 
 BTF_KFUNCS_START(scx_kfunc_ids_unlocked)
 BTF_ID_FLAGS(func, scx_bpf_create_dsq, KF_SLEEPABLE)
+BTF_ID_FLAGS(func, scx_bpf_dispatch_from_dsq, KF_RCU)
+BTF_ID_FLAGS(func, scx_bpf_dispatch_vtime_from_dsq, KF_RCU)
 BTF_KFUNCS_END(scx_kfunc_ids_unlocked)
 
 static const struct btf_kfunc_id_set scx_kfunc_set_unlocked = {
diff --git a/tools/sched_ext/include/scx/common.bpf.h b/tools/sched_ext/include/scx/common.bpf.h
index 20280df62857..ef018071da31 100644
--- a/tools/sched_ext/include/scx/common.bpf.h
+++ b/tools/sched_ext/include/scx/common.bpf.h
@@ -35,6 +35,8 @@ void scx_bpf_dispatch_vtime(struct task_struct *p, u64 dsq_id, u64 slice, u64 vt
 u32 scx_bpf_dispatch_nr_slots(void) __ksym;
 void scx_bpf_dispatch_cancel(void) __ksym;
 bool scx_bpf_consume(u64 dsq_id) __ksym;
+bool scx_bpf_dispatch_from_dsq(struct bpf_iter_scx_dsq *it__iter, struct task_struct *p, u64 dsq_id, u64 slice, u64 enq_flags) __ksym __weak;
+bool scx_bpf_dispatch_vtime_from_dsq(struct bpf_iter_scx_dsq *it__iter, struct task_struct *p, u64 dsq_id, u64 slice, u64 vtime, u64 enq_flags) __ksym __weak;
 u32 scx_bpf_reenqueue_local(void) __ksym;
 void scx_bpf_kick_cpu(s32 cpu, u64 flags) __ksym;
 s32 scx_bpf_dsq_nr_queued(u64 dsq_id) __ksym;
@@ -62,6 +64,12 @@ bool scx_bpf_task_running(const struct task_struct *p) __ksym;
 s32 scx_bpf_task_cpu(const struct task_struct *p) __ksym;
 struct rq *scx_bpf_cpu_rq(s32 cpu) __ksym;
 
+/*
+ * Use the following as @it__iter when calling
+ * scx_bpf_dispatch[_vtime]_from_dsq() from within bpf_for_each() loops.
+ */
+#define BPF_FOR_EACH_ITER	(&___it)
+
 static inline __attribute__((format(printf, 1, 2)))
 void ___scx_bpf_bstr_format_checker(const char *fmt, ...) {}
 
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 11/11] scx_qmap: Implement highpri boosting
  2024-08-30 11:03 [PATCHSET sched_ext/for-6.12] sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq() Tejun Heo
                   ` (9 preceding siblings ...)
  2024-08-30 11:03 ` [PATCH 10/11] sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq() Tejun Heo
@ 2024-08-30 11:03 ` Tejun Heo
  2024-08-30 20:59   ` [PATCH v2 " Tejun Heo
  2024-08-30 17:31 ` [PATCHSET sched_ext/for-6.12] sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq() Tejun Heo
  11 siblings, 1 reply; 35+ messages in thread
From: Tejun Heo @ 2024-08-30 11:03 UTC (permalink / raw)
  To: void
  Cc: kernel-team, linux-kernel, Tejun Heo, Daniel Hodges, Changwoo Min,
	Andrea Righi, Dan Schatzberg

Implement a silly boosting mechanism for nice -20 tasks. The only purpose is
demonstrating and testing scx_bpf_dispatch_from_dsq(). The boosting only
works within SHARED_DSQ and makes only minor differences with increased
dispatch batch (-b).

This exercises moving tasks to a user DSQ and all local DSQs from
ops.dispatch() and BPF timerfn.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Daniel Hodges <hodges.daniel.scott@gmail.com>
Cc: David Vernet <void@manifault.com>
Cc: Changwoo Min <multics69@gmail.com>
Cc: Andrea Righi <andrea.righi@linux.dev>
Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
---
 tools/sched_ext/scx_qmap.bpf.c | 142 ++++++++++++++++++++++++++++++---
 tools/sched_ext/scx_qmap.c     |  11 ++-
 2 files changed, 139 insertions(+), 14 deletions(-)

diff --git a/tools/sched_ext/scx_qmap.bpf.c b/tools/sched_ext/scx_qmap.bpf.c
index 892278f12dce..11b8e22128e1 100644
--- a/tools/sched_ext/scx_qmap.bpf.c
+++ b/tools/sched_ext/scx_qmap.bpf.c
@@ -27,6 +27,8 @@
 enum consts {
 	ONE_SEC_IN_NS		= 1000000000,
 	SHARED_DSQ		= 0,
+	HIGHPRI_DSQ		= 1,
+	HIGHPRI_WEIGHT		= 8668,		/* this is what -20 maps to */
 };
 
 char _license[] SEC("license") = "GPL";
@@ -36,10 +38,12 @@ const volatile u32 stall_user_nth;
 const volatile u32 stall_kernel_nth;
 const volatile u32 dsp_inf_loop_after;
 const volatile u32 dsp_batch;
+const volatile bool highpri_boosting;
 const volatile bool print_shared_dsq;
 const volatile s32 disallow_tgid;
 const volatile bool suppress_dump;
 
+u64 nr_highpri_queued;
 u32 test_error_cnt;
 
 UEI_DEFINE(uei);
@@ -95,6 +99,7 @@ static u64 core_sched_tail_seqs[5];
 /* Per-task scheduling context */
 struct task_ctx {
 	bool	force_local;	/* Dispatch directly to local_dsq */
+	bool	highpri;
 	u64	core_sched_seq;
 };
 
@@ -122,6 +127,7 @@ struct {
 /* Statistics */
 u64 nr_enqueued, nr_dispatched, nr_reenqueued, nr_dequeued, nr_ddsp_from_enq;
 u64 nr_core_sched_execed;
+u64 nr_expedited_local, nr_expedited_remote, nr_expedited_lost, nr_expedited_from_timer;
 u32 cpuperf_min, cpuperf_avg, cpuperf_max;
 u32 cpuperf_target_min, cpuperf_target_avg, cpuperf_target_max;
 
@@ -140,17 +146,25 @@ static s32 pick_direct_dispatch_cpu(struct task_struct *p, s32 prev_cpu)
 	return -1;
 }
 
+static struct task_ctx *lookup_task_ctx(struct task_struct *p)
+{
+	struct task_ctx *tctx;
+
+	if (!(tctx = bpf_task_storage_get(&task_ctx_stor, p, 0, 0))) {
+		scx_bpf_error("task_ctx lookup failed");
+		return NULL;
+	}
+	return tctx;
+}
+
 s32 BPF_STRUCT_OPS(qmap_select_cpu, struct task_struct *p,
 		   s32 prev_cpu, u64 wake_flags)
 {
 	struct task_ctx *tctx;
 	s32 cpu;
 
-	tctx = bpf_task_storage_get(&task_ctx_stor, p, 0, 0);
-	if (!tctx) {
-		scx_bpf_error("task_ctx lookup failed");
+	if (!(tctx = lookup_task_ctx(p)))
 		return -ESRCH;
-	}
 
 	cpu = pick_direct_dispatch_cpu(p, prev_cpu);
 
@@ -197,11 +211,8 @@ void BPF_STRUCT_OPS(qmap_enqueue, struct task_struct *p, u64 enq_flags)
 	if (test_error_cnt && !--test_error_cnt)
 		scx_bpf_error("test triggering error");
 
-	tctx = bpf_task_storage_get(&task_ctx_stor, p, 0, 0);
-	if (!tctx) {
-		scx_bpf_error("task_ctx lookup failed");
+	if (!(tctx = lookup_task_ctx(p)))
 		return;
-	}
 
 	/*
 	 * All enqueued tasks must have their core_sched_seq updated for correct
@@ -256,6 +267,10 @@ void BPF_STRUCT_OPS(qmap_enqueue, struct task_struct *p, u64 enq_flags)
 		return;
 	}
 
+	if (highpri_boosting && p->scx.weight >= HIGHPRI_WEIGHT) {
+		tctx->highpri = true;
+		__sync_fetch_and_add(&nr_highpri_queued, 1);
+	}
 	__sync_fetch_and_add(&nr_enqueued, 1);
 }
 
@@ -272,13 +287,89 @@ void BPF_STRUCT_OPS(qmap_dequeue, struct task_struct *p, u64 deq_flags)
 
 static void update_core_sched_head_seq(struct task_struct *p)
 {
-	struct task_ctx *tctx = bpf_task_storage_get(&task_ctx_stor, p, 0, 0);
 	int idx = weight_to_idx(p->scx.weight);
+	struct task_ctx *tctx;
 
-	if (tctx)
+	if ((tctx = lookup_task_ctx(p)))
 		core_sched_head_seqs[idx] = tctx->core_sched_seq;
-	else
-		scx_bpf_error("task_ctx lookup failed");
+}
+
+/*
+ * To demonstrate the use of scx_bpf_dispatch_from_dsq(), implement silly
+ * selective priority boosting mechanism by scanning SHARED_DSQ looking for
+ * highpri tasks, moving them to HIGHPRI_DSQ and then consuming them first. This
+ * makes minor difference only when dsp_batch is larger than 1.
+ *
+ * scx_bpf_dispatch[_vtime]_from_dsq() are allowed both from ops.dispatch() and
+ * non-rq-lock holding BPF programs. As demonstration, this function is called
+ * from qmap_dispatch() and monitor_timerfn().
+ */
+static bool dispatch_highpri(bool from_timer)
+{
+	struct task_struct *p;
+	s32 this_cpu = bpf_get_smp_processor_id();
+
+	/* scan SHARED_DSQ and move highpri tasks to HIGHPRI_DSQ */
+	bpf_for_each(scx_dsq, p, SHARED_DSQ, 0) {
+		struct task_ctx *tctx;
+
+		/* BPF workaround: iterated task is not trusted, look up again */
+		p = bpf_task_from_pid(p->pid);
+		if (!p)
+			continue;
+
+		if (!(tctx = lookup_task_ctx(p))) {
+			bpf_task_release(p);
+			return false;
+		}
+
+		if (tctx->highpri)
+			scx_bpf_dispatch_from_dsq(BPF_FOR_EACH_ITER, p,
+						  HIGHPRI_DSQ, 0, 0);
+
+		bpf_task_release(p);
+	}
+
+	/*
+	 * Scan HIGHPRI_DSQ and dispatch until a task that can run on this CPU
+	 * is found.
+	 */
+	bpf_for_each(scx_dsq, p, HIGHPRI_DSQ, 0) {
+		bool dispatched = false;
+		s32 cpu;
+
+		/* BPF workaround: iterated task is not trusted, look up again */
+		p = bpf_task_from_pid(p->pid);
+		if (!p)
+			continue;
+
+		if (bpf_cpumask_test_cpu(this_cpu, p->cpus_ptr))
+			cpu = this_cpu;
+		else
+			cpu = scx_bpf_pick_any_cpu(p->cpus_ptr, 0);
+
+		if (scx_bpf_dispatch_from_dsq(BPF_FOR_EACH_ITER, p,
+					      SCX_DSQ_LOCAL_ON | cpu, 0,
+					      SCX_ENQ_PREEMPT)) {
+			if (cpu == this_cpu) {
+				dispatched = true;
+				__sync_fetch_and_add(&nr_expedited_local, 1);
+			} else {
+				__sync_fetch_and_add(&nr_expedited_remote, 1);
+			}
+			if (from_timer)
+				__sync_fetch_and_add(&nr_expedited_from_timer, 1);
+		} else {
+			__sync_fetch_and_add(&nr_expedited_lost, 1);
+		}
+
+		bpf_task_release(p);
+
+		if (dispatched)
+			return true;
+	}
+
+	return false;
 }
 
 void BPF_STRUCT_OPS(qmap_dispatch, s32 cpu, struct task_struct *prev)
@@ -289,7 +380,10 @@ void BPF_STRUCT_OPS(qmap_dispatch, s32 cpu, struct task_struct *prev)
 	void *fifo;
 	s32 i, pid;
 
-	if (scx_bpf_consume(SHARED_DSQ))
+	if (dispatch_highpri(false))
+		return;
+
+	if (!nr_highpri_queued && scx_bpf_consume(SHARED_DSQ))
 		return;
 
 	if (dsp_inf_loop_after && nr_dispatched > dsp_inf_loop_after) {
@@ -326,6 +420,8 @@ void BPF_STRUCT_OPS(qmap_dispatch, s32 cpu, struct task_struct *prev)
 
 		/* Dispatch or advance. */
 		bpf_repeat(BPF_MAX_LOOPS) {
+			struct task_ctx *tctx;
+
 			if (bpf_map_pop_elem(fifo, &pid))
 				break;
 
@@ -333,13 +429,25 @@ void BPF_STRUCT_OPS(qmap_dispatch, s32 cpu, struct task_struct *prev)
 			if (!p)
 				continue;
 
+			if (!(tctx = lookup_task_ctx(p))) {
+				bpf_task_release(p);
+				return;
+			}
+
+			if (tctx->highpri)
+				__sync_fetch_and_sub(&nr_highpri_queued, 1);
+
 			update_core_sched_head_seq(p);
 			__sync_fetch_and_add(&nr_dispatched, 1);
+
 			scx_bpf_dispatch(p, SHARED_DSQ, slice_ns, 0);
 			bpf_task_release(p);
+
 			batch--;
 			cpuc->dsp_cnt--;
 			if (!batch || !scx_bpf_dispatch_nr_slots()) {
+				if (dispatch_highpri(false))
+					return;
 				scx_bpf_consume(SHARED_DSQ);
 				return;
 			}
@@ -649,6 +757,10 @@ static void dump_shared_dsq(void)
 
 static int monitor_timerfn(void *map, int *key, struct bpf_timer *timer)
 {
+	bpf_rcu_read_lock();
+	dispatch_highpri(true);
+	bpf_rcu_read_unlock();
+
 	monitor_cpuperf();
 
 	if (print_shared_dsq)
@@ -670,6 +782,10 @@ s32 BPF_STRUCT_OPS_SLEEPABLE(qmap_init)
 	if (ret)
 		return ret;
 
+	ret = scx_bpf_create_dsq(HIGHPRI_DSQ, -1);
+	if (ret)
+		return ret;
+
 	timer = bpf_map_lookup_elem(&monitor_timer, &key);
 	if (!timer)
 		return -ESRCH;
diff --git a/tools/sched_ext/scx_qmap.c b/tools/sched_ext/scx_qmap.c
index c9ca30d62b2b..ac45a02b4055 100644
--- a/tools/sched_ext/scx_qmap.c
+++ b/tools/sched_ext/scx_qmap.c
@@ -29,6 +29,7 @@ const char help_fmt[] =
 "  -l COUNT      Trigger dispatch infinite looping after COUNT dispatches\n"
 "  -b COUNT      Dispatch upto COUNT tasks together\n"
 "  -P            Print out DSQ content to trace_pipe every second, use with -b\n"
+"  -H            Boost nice -20 tasks in SHARED_DSQ, use with -b\n"
 "  -d PID        Disallow a process from switching into SCHED_EXT (-1 for self)\n"
 "  -D LEN        Set scx_exit_info.dump buffer length\n"
 "  -S            Suppress qmap-specific debug dump\n"
@@ -63,7 +64,7 @@ int main(int argc, char **argv)
 
 	skel = SCX_OPS_OPEN(qmap_ops, scx_qmap);
 
-	while ((opt = getopt(argc, argv, "s:e:t:T:l:b:Pd:D:Spvh")) != -1) {
+	while ((opt = getopt(argc, argv, "s:e:t:T:l:b:PHd:D:Spvh")) != -1) {
 		switch (opt) {
 		case 's':
 			skel->rodata->slice_ns = strtoull(optarg, NULL, 0) * 1000;
@@ -86,6 +87,9 @@ int main(int argc, char **argv)
 		case 'P':
 			skel->rodata->print_shared_dsq = true;
 			break;
+		case 'H':
+			skel->rodata->highpri_boosting = true;
+			break;
 		case 'd':
 			skel->rodata->disallow_tgid = strtol(optarg, NULL, 0);
 			if (skel->rodata->disallow_tgid < 0)
@@ -121,6 +125,11 @@ int main(int argc, char **argv)
 		       skel->bss->nr_reenqueued, skel->bss->nr_dequeued,
 		       skel->bss->nr_core_sched_execed,
 		       skel->bss->nr_ddsp_from_enq);
+		printf("         exp_local=%"PRIu64" exp_remote=%"PRIu64" exp_timer=%"PRIu64" exp_lost=%"PRIu64"\n",
+		       skel->bss->nr_expedited_local,
+		       skel->bss->nr_expedited_remote,
+		       skel->bss->nr_expedited_from_timer,
+		       skel->bss->nr_expedited_lost);
 		if (__COMPAT_has_ksym("scx_bpf_cpuperf_cur"))
 			printf("cpuperf: cur min/avg/max=%u/%u/%u target min/avg/max=%u/%u/%u\n",
 			       skel->bss->cpuperf_min,
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCHSET sched_ext/for-6.12] sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq()
  2024-08-30 11:03 [PATCHSET sched_ext/for-6.12] sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq() Tejun Heo
                   ` (10 preceding siblings ...)
  2024-08-30 11:03 ` [PATCH 11/11] scx_qmap: Implement highpri boosting Tejun Heo
@ 2024-08-30 17:31 ` Tejun Heo
  11 siblings, 0 replies; 35+ messages in thread
From: Tejun Heo @ 2024-08-30 17:31 UTC (permalink / raw)
  To: void; +Cc: kernel-team, linux-kernel

On Fri, Aug 30, 2024 at 01:03:44AM -1000, Tejun Heo wrote:
> This patchset is on top of:
> 
>    sched_ext/for-6.12 bf934bed5e2f ("sched_ext: Add missing cfi stub for ops.tick")
>  + two fix patches (http://lkml.kernel.org/r/ZtGkPKgoE5BeI7fN@slm.duckdns.org)

Oops, it also needs

  + bpf/for-next 6af48162ac09 ("Merge branch 'bpf-next/net' into bpf-next/for-next")

to allows passing in BPF iterator into kfuncs.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 04/11] sched_ext: Fix processs_ddsp_deferred_locals() by unifying DTL_INVALID handling
  2024-08-30 11:03 ` [PATCH 04/11] sched_ext: Make dispatch_to_local_dsq() return void Tejun Heo
@ 2024-08-30 17:44   ` Tejun Heo
  2024-09-01  0:53     ` David Vernet
  0 siblings, 1 reply; 35+ messages in thread
From: Tejun Heo @ 2024-08-30 17:44 UTC (permalink / raw)
  To: void; +Cc: kernel-team, linux-kernel

With the preceding update, the only return value which makes meaningful
difference is DTL_INVALID, for which one caller, finish_dispatch(), falls
back to the global DSQ and the other, process_ddsp_deferred_locals(),
doesn't do anything.

It should always fallback to the global DSQ. Move the global DSQ fallback
into dispatch_to_local_dsq() and remove the return value.

v2: Patch title and description updated to reflect the behavior fix for
    process_ddsp_deferred_locals().

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: David Vernet <void@manifault.com>
---
 kernel/sched/ext.c |   41 ++++++++++-------------------------------
 1 file changed, 10 insertions(+), 31 deletions(-)

--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -2301,12 +2301,6 @@ retry:
 	return false;
 }
 
-enum dispatch_to_local_dsq_ret {
-	DTL_DISPATCHED,		/* successfully dispatched */
-	DTL_LOST,		/* lost race to dequeue */
-	DTL_INVALID,		/* invalid local dsq_id */
-};
-
 /**
  * dispatch_to_local_dsq - Dispatch a task to a local dsq
  * @rq: current rq which is locked
@@ -2321,9 +2315,8 @@ enum dispatch_to_local_dsq_ret {
  * The caller must have exclusive ownership of @p (e.g. through
  * %SCX_OPSS_DISPATCHING).
  */
-static enum dispatch_to_local_dsq_ret
-dispatch_to_local_dsq(struct rq *rq, struct scx_dispatch_q *dst_dsq,
-		      struct task_struct *p, u64 enq_flags)
+static void dispatch_to_local_dsq(struct rq *rq, struct scx_dispatch_q *dst_dsq,
+				  struct task_struct *p, u64 enq_flags)
 {
 	struct rq *src_rq = task_rq(p);
 	struct rq *dst_rq = container_of(dst_dsq, struct rq, scx.local_dsq);
@@ -2336,13 +2329,11 @@ dispatch_to_local_dsq(struct rq *rq, str
 	 */
 	if (rq == src_rq && rq == dst_rq) {
 		dispatch_enqueue(dst_dsq, p, enq_flags | SCX_ENQ_CLEAR_OPSS);
-		return DTL_DISPATCHED;
+		return;
 	}
 
 #ifdef CONFIG_SMP
 	if (likely(task_can_run_on_remote_rq(p, dst_rq, true))) {
-		bool dsp;
-
 		/*
 		 * @p is on a possibly remote @src_rq which we need to lock to
 		 * move the task. If dequeue is in progress, it'd be locking
@@ -2367,10 +2358,8 @@ dispatch_to_local_dsq(struct rq *rq, str
 		}
 
 		/* task_rq couldn't have changed if we're still the holding cpu */
-		dsp = p->scx.holding_cpu == raw_smp_processor_id() &&
-			!WARN_ON_ONCE(src_rq != task_rq(p));
-
-		if (likely(dsp)) {
+		if (likely(p->scx.holding_cpu == raw_smp_processor_id()) &&
+		    !WARN_ON_ONCE(src_rq != task_rq(p))) {
 			/*
 			 * If @p is staying on the same rq, there's no need to
 			 * go through the full deactivate/activate cycle.
@@ -2398,11 +2387,11 @@ dispatch_to_local_dsq(struct rq *rq, str
 			raw_spin_rq_lock(rq);
 		}
 
-		return dsp ? DTL_DISPATCHED : DTL_LOST;
+		return;
 	}
 #endif	/* CONFIG_SMP */
 
-	return DTL_INVALID;
+	dispatch_enqueue(&scx_dsq_global, p, enq_flags | SCX_ENQ_CLEAR_OPSS);
 }
 
 /**
@@ -2479,20 +2468,10 @@ retry:
 
 	dsq = find_dsq_for_dispatch(this_rq(), dsq_id, p);
 
-	if (dsq->id == SCX_DSQ_LOCAL) {
-		switch (dispatch_to_local_dsq(rq, dsq, p, enq_flags)) {
-		case DTL_DISPATCHED:
-			break;
-		case DTL_LOST:
-			break;
-		case DTL_INVALID:
-			dispatch_enqueue(&scx_dsq_global, p,
-					 enq_flags | SCX_ENQ_CLEAR_OPSS);
-			break;
-		}
-	} else {
+	if (dsq->id == SCX_DSQ_LOCAL)
+		dispatch_to_local_dsq(rq, dsq, p, enq_flags);
+	else
 		dispatch_enqueue(dsq, p, enq_flags | SCX_ENQ_CLEAR_OPSS);
-	}
 }
 
 static void flush_dispatch_buf(struct rq *rq)

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 01/11] sched_ext: Rename scx_kfunc_set_sleepable to unlocked and relocate
  2024-08-30 11:03 ` [PATCH 01/11] sched_ext: Rename scx_kfunc_set_sleepable to unlocked and relocate Tejun Heo
@ 2024-08-30 17:45   ` David Vernet
  0 siblings, 0 replies; 35+ messages in thread
From: David Vernet @ 2024-08-30 17:45 UTC (permalink / raw)
  To: Tejun Heo; +Cc: kernel-team, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 4227 bytes --]

On Fri, Aug 30, 2024 at 01:03:45AM -1000, Tejun Heo wrote:
> Sleepables don't need to be in its own kfunc set as each is tagged with
> KF_SLEEPABLE. Rename to scx_kfunc_set_unlocked indicating that rq lock is
> not held and relocate right above the any set. This will be used to add
> kfuncs that are allowed to be called from SYSCALL but not TRACING.
> 
> No functional changes intended.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>

Acked-by: David Vernet <void@manifault.com>

> ---
>  kernel/sched/ext.c | 66 +++++++++++++++++++++++-----------------------
>  1 file changed, 33 insertions(+), 33 deletions(-)
> 
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 140a4612d379..5423554a11af 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -5395,35 +5395,6 @@ void __init init_sched_ext_class(void)
>  
>  __bpf_kfunc_start_defs();
>  
> -/**
> - * scx_bpf_create_dsq - Create a custom DSQ
> - * @dsq_id: DSQ to create
> - * @node: NUMA node to allocate from
> - *
> - * Create a custom DSQ identified by @dsq_id. Can be called from any sleepable
> - * scx callback, and any BPF_PROG_TYPE_SYSCALL prog.
> - */
> -__bpf_kfunc s32 scx_bpf_create_dsq(u64 dsq_id, s32 node)
> -{
> -	if (unlikely(node >= (int)nr_node_ids ||
> -		     (node < 0 && node != NUMA_NO_NODE)))
> -		return -EINVAL;
> -	return PTR_ERR_OR_ZERO(create_dsq(dsq_id, node));
> -}
> -
> -__bpf_kfunc_end_defs();
> -
> -BTF_KFUNCS_START(scx_kfunc_ids_sleepable)
> -BTF_ID_FLAGS(func, scx_bpf_create_dsq, KF_SLEEPABLE)
> -BTF_KFUNCS_END(scx_kfunc_ids_sleepable)
> -
> -static const struct btf_kfunc_id_set scx_kfunc_set_sleepable = {
> -	.owner			= THIS_MODULE,
> -	.set			= &scx_kfunc_ids_sleepable,
> -};
> -
> -__bpf_kfunc_start_defs();
> -
>  /**
>   * scx_bpf_select_cpu_dfl - The default implementation of ops.select_cpu()
>   * @p: task_struct to select a CPU for
> @@ -5766,6 +5737,35 @@ static const struct btf_kfunc_id_set scx_kfunc_set_cpu_release = {
>  
>  __bpf_kfunc_start_defs();
>  
> +/**
> + * scx_bpf_create_dsq - Create a custom DSQ
> + * @dsq_id: DSQ to create
> + * @node: NUMA node to allocate from
> + *
> + * Create a custom DSQ identified by @dsq_id. Can be called from any sleepable
> + * scx callback, and any BPF_PROG_TYPE_SYSCALL prog.
> + */
> +__bpf_kfunc s32 scx_bpf_create_dsq(u64 dsq_id, s32 node)
> +{
> +	if (unlikely(node >= (int)nr_node_ids ||
> +		     (node < 0 && node != NUMA_NO_NODE)))
> +		return -EINVAL;
> +	return PTR_ERR_OR_ZERO(create_dsq(dsq_id, node));
> +}
> +
> +__bpf_kfunc_end_defs();
> +
> +BTF_KFUNCS_START(scx_kfunc_ids_unlocked)
> +BTF_ID_FLAGS(func, scx_bpf_create_dsq, KF_SLEEPABLE)
> +BTF_KFUNCS_END(scx_kfunc_ids_unlocked)
> +
> +static const struct btf_kfunc_id_set scx_kfunc_set_unlocked = {
> +	.owner			= THIS_MODULE,
> +	.set			= &scx_kfunc_ids_unlocked,
> +};
> +
> +__bpf_kfunc_start_defs();
> +
>  /**
>   * scx_bpf_kick_cpu - Trigger reschedule on a CPU
>   * @cpu: cpu to kick
> @@ -6462,10 +6462,6 @@ static int __init scx_init(void)
>  	 * check using scx_kf_allowed().
>  	 */
>  	if ((ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS,
> -					     &scx_kfunc_set_sleepable)) ||
> -	    (ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_SYSCALL,
> -					     &scx_kfunc_set_sleepable)) ||
> -	    (ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS,
>  					     &scx_kfunc_set_select_cpu)) ||
>  	    (ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS,
>  					     &scx_kfunc_set_enqueue_dispatch)) ||
> @@ -6473,6 +6469,10 @@ static int __init scx_init(void)
>  					     &scx_kfunc_set_dispatch)) ||
>  	    (ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS,
>  					     &scx_kfunc_set_cpu_release)) ||
> +	    (ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS,
> +					     &scx_kfunc_set_unlocked)) ||
> +	    (ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_SYSCALL,
> +					     &scx_kfunc_set_unlocked)) ||
>  	    (ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS,
>  					     &scx_kfunc_set_any)) ||
>  	    (ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING,
> -- 
> 2.46.0
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH v2 11/11] scx_qmap: Implement highpri boosting
  2024-08-30 11:03 ` [PATCH 11/11] scx_qmap: Implement highpri boosting Tejun Heo
@ 2024-08-30 20:59   ` Tejun Heo
  0 siblings, 0 replies; 35+ messages in thread
From: Tejun Heo @ 2024-08-30 20:59 UTC (permalink / raw)
  To: void
  Cc: kernel-team, linux-kernel, Daniel Hodges, Changwoo Min,
	Andrea Righi, Dan Schatzberg

Implement a silly boosting mechanism for nice -20 tasks. The only purpose is
demonstrating and testing scx_bpf_dispatch_from_dsq(). The boosting only
works within SHARED_DSQ and makes only minor differences with increased
dispatch batch (-b).

This exercises moving tasks to a user DSQ and all local DSQs from
ops.dispatch() and BPF timerfn.

v2: Drop the workaround for the iterated tasks not being trusted by the
    verifier. The issue is fixed from BPF side.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Daniel Hodges <hodges.daniel.scott@gmail.com>
Cc: David Vernet <void@manifault.com>
Cc: Changwoo Min <multics69@gmail.com>
Cc: Andrea Righi <andrea.righi@linux.dev>
Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
---
 tools/sched_ext/scx_qmap.bpf.c |  126 ++++++++++++++++++++++++++++++++++++-----
 tools/sched_ext/scx_qmap.c     |   11 +++
 2 files changed, 123 insertions(+), 14 deletions(-)

--- a/tools/sched_ext/scx_qmap.bpf.c
+++ b/tools/sched_ext/scx_qmap.bpf.c
@@ -27,6 +27,8 @@
 enum consts {
 	ONE_SEC_IN_NS		= 1000000000,
 	SHARED_DSQ		= 0,
+	HIGHPRI_DSQ		= 1,
+	HIGHPRI_WEIGHT		= 8668,		/* this is what -20 maps to */
 };
 
 char _license[] SEC("license") = "GPL";
@@ -36,10 +38,12 @@ const volatile u32 stall_user_nth;
 const volatile u32 stall_kernel_nth;
 const volatile u32 dsp_inf_loop_after;
 const volatile u32 dsp_batch;
+const volatile bool highpri_boosting;
 const volatile bool print_shared_dsq;
 const volatile s32 disallow_tgid;
 const volatile bool suppress_dump;
 
+u64 nr_highpri_queued;
 u32 test_error_cnt;
 
 UEI_DEFINE(uei);
@@ -95,6 +99,7 @@ static u64 core_sched_tail_seqs[5];
 /* Per-task scheduling context */
 struct task_ctx {
 	bool	force_local;	/* Dispatch directly to local_dsq */
+	bool	highpri;
 	u64	core_sched_seq;
 };
 
@@ -122,6 +127,7 @@ struct {
 /* Statistics */
 u64 nr_enqueued, nr_dispatched, nr_reenqueued, nr_dequeued, nr_ddsp_from_enq;
 u64 nr_core_sched_execed;
+u64 nr_expedited_local, nr_expedited_remote, nr_expedited_lost, nr_expedited_from_timer;
 u32 cpuperf_min, cpuperf_avg, cpuperf_max;
 u32 cpuperf_target_min, cpuperf_target_avg, cpuperf_target_max;
 
@@ -140,17 +146,25 @@ static s32 pick_direct_dispatch_cpu(stru
 	return -1;
 }
 
+static struct task_ctx *lookup_task_ctx(struct task_struct *p)
+{
+	struct task_ctx *tctx;
+
+	if (!(tctx = bpf_task_storage_get(&task_ctx_stor, p, 0, 0))) {
+		scx_bpf_error("task_ctx lookup failed");
+		return NULL;
+	}
+	return tctx;
+}
+
 s32 BPF_STRUCT_OPS(qmap_select_cpu, struct task_struct *p,
 		   s32 prev_cpu, u64 wake_flags)
 {
 	struct task_ctx *tctx;
 	s32 cpu;
 
-	tctx = bpf_task_storage_get(&task_ctx_stor, p, 0, 0);
-	if (!tctx) {
-		scx_bpf_error("task_ctx lookup failed");
+	if (!(tctx = lookup_task_ctx(p)))
 		return -ESRCH;
-	}
 
 	cpu = pick_direct_dispatch_cpu(p, prev_cpu);
 
@@ -197,11 +211,8 @@ void BPF_STRUCT_OPS(qmap_enqueue, struct
 	if (test_error_cnt && !--test_error_cnt)
 		scx_bpf_error("test triggering error");
 
-	tctx = bpf_task_storage_get(&task_ctx_stor, p, 0, 0);
-	if (!tctx) {
-		scx_bpf_error("task_ctx lookup failed");
+	if (!(tctx = lookup_task_ctx(p)))
 		return;
-	}
 
 	/*
 	 * All enqueued tasks must have their core_sched_seq updated for correct
@@ -256,6 +267,10 @@ void BPF_STRUCT_OPS(qmap_enqueue, struct
 		return;
 	}
 
+	if (highpri_boosting && p->scx.weight >= HIGHPRI_WEIGHT) {
+		tctx->highpri = true;
+		__sync_fetch_and_add(&nr_highpri_queued, 1);
+	}
 	__sync_fetch_and_add(&nr_enqueued, 1);
 }
 
@@ -272,13 +287,73 @@ void BPF_STRUCT_OPS(qmap_dequeue, struct
 
 static void update_core_sched_head_seq(struct task_struct *p)
 {
-	struct task_ctx *tctx = bpf_task_storage_get(&task_ctx_stor, p, 0, 0);
 	int idx = weight_to_idx(p->scx.weight);
+	struct task_ctx *tctx;
 
-	if (tctx)
+	if ((tctx = lookup_task_ctx(p)))
 		core_sched_head_seqs[idx] = tctx->core_sched_seq;
-	else
-		scx_bpf_error("task_ctx lookup failed");
+}
+
+/*
+ * To demonstrate the use of scx_bpf_dispatch_from_dsq(), implement silly
+ * selective priority boosting mechanism by scanning SHARED_DSQ looking for
+ * highpri tasks, moving them to HIGHPRI_DSQ and then consuming them first. This
+ * makes minor difference only when dsp_batch is larger than 1.
+ *
+ * scx_bpf_dispatch[_vtime]_from_dsq() are allowed both from ops.dispatch() and
+ * non-rq-lock holding BPF programs. As demonstration, this function is called
+ * from qmap_dispatch() and monitor_timerfn().
+ */
+static bool dispatch_highpri(bool from_timer)
+{
+	struct task_struct *p;
+	s32 this_cpu = bpf_get_smp_processor_id();
+
+	/* scan SHARED_DSQ and move highpri tasks to HIGHPRI_DSQ */
+	bpf_for_each(scx_dsq, p, SHARED_DSQ, 0) {
+		struct task_ctx *tctx;
+
+		if (!(tctx = lookup_task_ctx(p)))
+			return false;
+
+		if (tctx->highpri)
+			scx_bpf_dispatch_from_dsq(BPF_FOR_EACH_ITER, p,
+						  HIGHPRI_DSQ, 0, 0);
+	}
+
+	/*
+	 * Scan HIGHPRI_DSQ and dispatch until a task that can run on this CPU
+	 * is found.
+	 */
+	bpf_for_each(scx_dsq, p, HIGHPRI_DSQ, 0) {
+		bool dispatched = false;
+		s32 cpu;
+
+		if (bpf_cpumask_test_cpu(this_cpu, p->cpus_ptr))
+			cpu = this_cpu;
+		else
+			cpu = scx_bpf_pick_any_cpu(p->cpus_ptr, 0);
+
+		if (scx_bpf_dispatch_from_dsq(BPF_FOR_EACH_ITER, p,
+					      SCX_DSQ_LOCAL_ON | cpu, 0,
+					      SCX_ENQ_PREEMPT)) {
+			if (cpu == this_cpu) {
+				dispatched = true;
+				__sync_fetch_and_add(&nr_expedited_local, 1);
+			} else {
+				__sync_fetch_and_add(&nr_expedited_remote, 1);
+			}
+			if (from_timer)
+				__sync_fetch_and_add(&nr_expedited_from_timer, 1);
+		} else {
+			__sync_fetch_and_add(&nr_expedited_lost, 1);
+		}
+
+		if (dispatched)
+			return true;
+	}
+
+	return false;
 }
 
 void BPF_STRUCT_OPS(qmap_dispatch, s32 cpu, struct task_struct *prev)
@@ -289,7 +364,10 @@ void BPF_STRUCT_OPS(qmap_dispatch, s32 c
 	void *fifo;
 	s32 i, pid;
 
-	if (scx_bpf_consume(SHARED_DSQ))
+	if (dispatch_highpri(false))
+		return;
+
+	if (!nr_highpri_queued && scx_bpf_consume(SHARED_DSQ))
 		return;
 
 	if (dsp_inf_loop_after && nr_dispatched > dsp_inf_loop_after) {
@@ -326,6 +404,8 @@ void BPF_STRUCT_OPS(qmap_dispatch, s32 c
 
 		/* Dispatch or advance. */
 		bpf_repeat(BPF_MAX_LOOPS) {
+			struct task_ctx *tctx;
+
 			if (bpf_map_pop_elem(fifo, &pid))
 				break;
 
@@ -333,13 +413,25 @@ void BPF_STRUCT_OPS(qmap_dispatch, s32 c
 			if (!p)
 				continue;
 
+			if (!(tctx = lookup_task_ctx(p))) {
+				bpf_task_release(p);
+				return;
+			}
+
+			if (tctx->highpri)
+				__sync_fetch_and_sub(&nr_highpri_queued, 1);
+
 			update_core_sched_head_seq(p);
 			__sync_fetch_and_add(&nr_dispatched, 1);
+
 			scx_bpf_dispatch(p, SHARED_DSQ, slice_ns, 0);
 			bpf_task_release(p);
+
 			batch--;
 			cpuc->dsp_cnt--;
 			if (!batch || !scx_bpf_dispatch_nr_slots()) {
+				if (dispatch_highpri(false))
+					return;
 				scx_bpf_consume(SHARED_DSQ);
 				return;
 			}
@@ -649,6 +741,10 @@ static void dump_shared_dsq(void)
 
 static int monitor_timerfn(void *map, int *key, struct bpf_timer *timer)
 {
+	bpf_rcu_read_lock();
+	dispatch_highpri(true);
+	bpf_rcu_read_unlock();
+
 	monitor_cpuperf();
 
 	if (print_shared_dsq)
@@ -670,6 +766,10 @@ s32 BPF_STRUCT_OPS_SLEEPABLE(qmap_init)
 	if (ret)
 		return ret;
 
+	ret = scx_bpf_create_dsq(HIGHPRI_DSQ, -1);
+	if (ret)
+		return ret;
+
 	timer = bpf_map_lookup_elem(&monitor_timer, &key);
 	if (!timer)
 		return -ESRCH;
--- a/tools/sched_ext/scx_qmap.c
+++ b/tools/sched_ext/scx_qmap.c
@@ -29,6 +29,7 @@ const char help_fmt[] =
 "  -l COUNT      Trigger dispatch infinite looping after COUNT dispatches\n"
 "  -b COUNT      Dispatch upto COUNT tasks together\n"
 "  -P            Print out DSQ content to trace_pipe every second, use with -b\n"
+"  -H            Boost nice -20 tasks in SHARED_DSQ, use with -b\n"
 "  -d PID        Disallow a process from switching into SCHED_EXT (-1 for self)\n"
 "  -D LEN        Set scx_exit_info.dump buffer length\n"
 "  -S            Suppress qmap-specific debug dump\n"
@@ -63,7 +64,7 @@ int main(int argc, char **argv)
 
 	skel = SCX_OPS_OPEN(qmap_ops, scx_qmap);
 
-	while ((opt = getopt(argc, argv, "s:e:t:T:l:b:Pd:D:Spvh")) != -1) {
+	while ((opt = getopt(argc, argv, "s:e:t:T:l:b:PHd:D:Spvh")) != -1) {
 		switch (opt) {
 		case 's':
 			skel->rodata->slice_ns = strtoull(optarg, NULL, 0) * 1000;
@@ -86,6 +87,9 @@ int main(int argc, char **argv)
 		case 'P':
 			skel->rodata->print_shared_dsq = true;
 			break;
+		case 'H':
+			skel->rodata->highpri_boosting = true;
+			break;
 		case 'd':
 			skel->rodata->disallow_tgid = strtol(optarg, NULL, 0);
 			if (skel->rodata->disallow_tgid < 0)
@@ -121,6 +125,11 @@ int main(int argc, char **argv)
 		       skel->bss->nr_reenqueued, skel->bss->nr_dequeued,
 		       skel->bss->nr_core_sched_execed,
 		       skel->bss->nr_ddsp_from_enq);
+		printf("         exp_local=%"PRIu64" exp_remote=%"PRIu64" exp_timer=%"PRIu64" exp_lost=%"PRIu64"\n",
+		       skel->bss->nr_expedited_local,
+		       skel->bss->nr_expedited_remote,
+		       skel->bss->nr_expedited_from_timer,
+		       skel->bss->nr_expedited_lost);
 		if (__COMPAT_has_ksym("scx_bpf_cpuperf_cur"))
 			printf("cpuperf: cur min/avg/max=%u/%u/%u target min/avg/max=%u/%u/%u\n",
 			       skel->bss->cpuperf_min,

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 02/11] sched_ext: Refactor consume_remote_task()
  2024-08-30 11:03 ` [PATCH 02/11] sched_ext: Refactor consume_remote_task() Tejun Heo
@ 2024-08-31  4:05   ` David Vernet
  2024-08-31  5:33     ` Tejun Heo
  0 siblings, 1 reply; 35+ messages in thread
From: David Vernet @ 2024-08-31  4:05 UTC (permalink / raw)
  To: Tejun Heo; +Cc: kernel-team, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 9684 bytes --]

On Fri, Aug 30, 2024 at 01:03:46AM -1000, Tejun Heo wrote:
> The tricky p->scx.holding_cpu handling was split across
> consume_remote_task() body and move_task_to_local_dsq(). Refactor such that:
> 
> - All the tricky part is now in the new unlink_dsq_and_lock_task_rq() with
>   consolidated documentation.
> 
> - move_task_to_local_dsq() now implements straightforward task migration
>   making it easier to use in other places.
> 
> - dispatch_to_local_dsq() is another user move_task_to_local_dsq(). The
>   usage is updated accordingly. This makes the local and remote cases more
>   symmetric.
> 
> No functional changes intended.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---
>  kernel/sched/ext.c | 145 ++++++++++++++++++++++++---------------------
>  1 file changed, 76 insertions(+), 69 deletions(-)
> 
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 5423554a11af..3facfca73337 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -2109,49 +2109,13 @@ static bool yield_to_task_scx(struct rq *rq, struct task_struct *to)
>   * @src_rq: rq to move the task from, locked on entry, released on return
>   * @dst_rq: rq to move the task into, locked on return
>   *
> - * Move @p which is currently on @src_rq to @dst_rq's local DSQ. The caller
> - * must:
> - *
> - * 1. Start with exclusive access to @p either through its DSQ lock or
> - *    %SCX_OPSS_DISPATCHING flag.
> - *
> - * 2. Set @p->scx.holding_cpu to raw_smp_processor_id().
> - *
> - * 3. Remember task_rq(@p) as @src_rq. Release the exclusive access so that we
> - *    don't deadlock with dequeue.
> - *
> - * 4. Lock @src_rq from #3.
> - *
> - * 5. Call this function.
> - *
> - * Returns %true if @p was successfully moved. %false after racing dequeue and
> - * losing. On return, @src_rq is unlocked and @dst_rq is locked.
> + * Move @p which is currently on @src_rq to @dst_rq's local DSQ.
>   */
> -static bool move_task_to_local_dsq(struct task_struct *p, u64 enq_flags,
> +static void move_task_to_local_dsq(struct task_struct *p, u64 enq_flags,
>  				   struct rq *src_rq, struct rq *dst_rq)
>  {
>  	lockdep_assert_rq_held(src_rq);
>  
> -	/*
> -	 * If dequeue got to @p while we were trying to lock @src_rq, it'd have
> -	 * cleared @p->scx.holding_cpu to -1. While other cpus may have updated
> -	 * it to different values afterwards, as this operation can't be
> -	 * preempted or recurse, @p->scx.holding_cpu can never become
> -	 * raw_smp_processor_id() again before we're done. Thus, we can tell
> -	 * whether we lost to dequeue by testing whether @p->scx.holding_cpu is
> -	 * still raw_smp_processor_id().
> -	 *
> -	 * @p->rq couldn't have changed if we're still the holding cpu.
> -	 *
> -	 * See dispatch_dequeue() for the counterpart.
> -	 */
> -	if (unlikely(p->scx.holding_cpu != raw_smp_processor_id()) ||
> -	    WARN_ON_ONCE(src_rq != task_rq(p))) {
> -		raw_spin_rq_unlock(src_rq);
> -		raw_spin_rq_lock(dst_rq);
> -		return false;
> -	}
> -
>  	/* the following marks @p MIGRATING which excludes dequeue */
>  	deactivate_task(src_rq, p, 0);

Not a functional change from the prior patch, but it occurred to me that
if we just deactivate like this then we'll also fire the ops.quiescent()
callback in dequeue_task_scx(). Should we add a check to skip the
dequeue callbacks if p->scx.holding_cpu >= 0?

Cleanup looks great otherwise.

Thanks,
David

>  	set_task_cpu(p, cpu_of(dst_rq));
> @@ -2170,8 +2134,6 @@ static bool move_task_to_local_dsq(struct task_struct *p, u64 enq_flags,
>  	dst_rq->scx.extra_enq_flags = enq_flags;
>  	activate_task(dst_rq, p, 0);
>  	dst_rq->scx.extra_enq_flags = 0;
> -
> -	return true;
>  }
>  
>  #endif	/* CONFIG_SMP */
> @@ -2236,28 +2198,69 @@ static bool task_can_run_on_remote_rq(struct task_struct *p, struct rq *rq,
>  	return true;
>  }
>  
> -static bool consume_remote_task(struct rq *rq, struct scx_dispatch_q *dsq,
> -				struct task_struct *p, struct rq *task_rq)
> +/**
> + * unlink_dsq_and_lock_task_rq() - Unlink task from its DSQ and lock its task_rq
> + * @p: target task
> + * @dsq: locked DSQ @p is currently on
> + * @task_rq: @p's task_rq, stable with @dsq locked
> + *
> + * Called with @dsq locked but no rq's locked. We want to move @p to a different
> + * DSQ, including any local DSQ, but are not locking @task_rq. Locking @task_rq
> + * is required when transferring into a local DSQ. Even when transferring into a
> + * non-local DSQ, it's better to use the same mechanism to protect against
> + * dequeues and maintain the invariant that @p->scx.dsq can only change while
> + * @task_rq is locked, which e.g. scx_dump_task() depends on.
> + *
> + * We want to grab @task_rq but that can deadlock if we try while locking @dsq,
> + * so we want to unlink @p from @dsq, drop its lock and then lock @task_rq. As
> + * this may race with dequeue, which can't drop the rq lock or fail, do a little
> + * dancing from our side.
> + *
> + * @p->scx.holding_cpu is set to this CPU before @dsq is unlocked. If @p gets
> + * dequeued after we unlock @dsq but before locking @task_rq, the holding_cpu
> + * would be cleared to -1. While other cpus may have updated it to different
> + * values afterwards, as this operation can't be preempted or recurse, the
> + * holding_cpu can never become this CPU again before we're done. Thus, we can
> + * tell whether we lost to dequeue by testing whether the holding_cpu still
> + * points to this CPU. See dispatch_dequeue() for the counterpart.
> + *
> + * On return, @dsq is unlocked and @task_rq is locked. Returns %true if @p is
> + * still valid. %false if lost to dequeue.
> + */
> +static bool unlink_dsq_and_lock_task_rq(struct task_struct *p,
> +					struct scx_dispatch_q *dsq,
> +					struct rq *task_rq)
>  {
> -	lockdep_assert_held(&dsq->lock);	/* released on return */
> +	s32 cpu = raw_smp_processor_id();
> +
> +	lockdep_assert_held(&dsq->lock);
>  
> -	/*
> -	 * @dsq is locked and @p is on a remote rq. @p is currently protected by
> -	 * @dsq->lock. We want to pull @p to @rq but may deadlock if we grab
> -	 * @task_rq while holding @dsq and @rq locks. As dequeue can't drop the
> -	 * rq lock or fail, do a little dancing from our side. See
> -	 * move_task_to_local_dsq().
> -	 */
>  	WARN_ON_ONCE(p->scx.holding_cpu >= 0);
>  	task_unlink_from_dsq(p, dsq);
>  	dsq_mod_nr(dsq, -1);
> -	p->scx.holding_cpu = raw_smp_processor_id();
> -	raw_spin_unlock(&dsq->lock);
> +	p->scx.holding_cpu = cpu;
>  
> -	raw_spin_rq_unlock(rq);
> +	raw_spin_unlock(&dsq->lock);
>  	raw_spin_rq_lock(task_rq);
>  
> -	return move_task_to_local_dsq(p, 0, task_rq, rq);
> +	/* task_rq couldn't have changed if we're still the holding cpu */
> +	return likely(p->scx.holding_cpu == cpu) &&
> +		!WARN_ON_ONCE(task_rq != task_rq(p));
> +}
> +
> +static bool consume_remote_task(struct rq *this_rq, struct scx_dispatch_q *dsq,
> +				struct task_struct *p, struct rq *task_rq)
> +{
> +	raw_spin_rq_unlock(this_rq);
> +
> +	if (unlink_dsq_and_lock_task_rq(p, dsq, task_rq)) {
> +		move_task_to_local_dsq(p, 0, task_rq, this_rq);
> +		return true;
> +	} else {
> +		raw_spin_rq_unlock(task_rq);
> +		raw_spin_rq_lock(this_rq);
> +		return false;
> +	}
>  }
>  #else	/* CONFIG_SMP */
>  static inline bool task_can_run_on_remote_rq(struct task_struct *p, struct rq *rq, bool trigger_error) { return false; }
> @@ -2361,7 +2364,8 @@ dispatch_to_local_dsq(struct rq *rq, u64 dsq_id, struct task_struct *p,
>  		 * As DISPATCHING guarantees that @p is wholly ours, we can
>  		 * pretend that we're moving from a DSQ and use the same
>  		 * mechanism - mark the task under transfer with holding_cpu,
> -		 * release DISPATCHING and then follow the same protocol.
> +		 * release DISPATCHING and then follow the same protocol. See
> +		 * unlink_dsq_and_lock_task_rq().
>  		 */
>  		p->scx.holding_cpu = raw_smp_processor_id();
>  
> @@ -2374,28 +2378,31 @@ dispatch_to_local_dsq(struct rq *rq, u64 dsq_id, struct task_struct *p,
>  			raw_spin_rq_lock(src_rq);
>  		}
>  
> -		if (src_rq == dst_rq) {
> +		/* task_rq couldn't have changed if we're still the holding cpu */
> +		dsp = p->scx.holding_cpu == raw_smp_processor_id() &&
> +			!WARN_ON_ONCE(src_rq != task_rq(p));
> +
> +		if (likely(dsp)) {
>  			/*
> -			 * As @p is staying on the same rq, there's no need to
> +			 * If @p is staying on the same rq, there's no need to
>  			 * go through the full deactivate/activate cycle.
>  			 * Optimize by abbreviating the operations in
>  			 * move_task_to_local_dsq().
>  			 */
> -			dsp = p->scx.holding_cpu == raw_smp_processor_id();
> -			if (likely(dsp)) {
> +			if (src_rq == dst_rq) {
>  				p->scx.holding_cpu = -1;
> -				dispatch_enqueue(&dst_rq->scx.local_dsq, p,
> -						 enq_flags);
> +				dispatch_enqueue(&dst_rq->scx.local_dsq,
> +						 p, enq_flags);
> +			} else {
> +				move_task_to_local_dsq(p, enq_flags,
> +						       src_rq, dst_rq);
>  			}
> -		} else {
> -			dsp = move_task_to_local_dsq(p, enq_flags,
> -						     src_rq, dst_rq);
> -		}
>  
> -		/* if the destination CPU is idle, wake it up */
> -		if (dsp && sched_class_above(p->sched_class,
> -					     dst_rq->curr->sched_class))
> -			resched_curr(dst_rq);
> +			/* if the destination CPU is idle, wake it up */
> +			if (sched_class_above(p->sched_class,
> +					      dst_rq->curr->sched_class))
> +				resched_curr(dst_rq);
> +		}
>  
>  		/* switch back to @rq lock */
>  		if (rq != dst_rq) {
> -- 
> 2.46.0
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 02/11] sched_ext: Refactor consume_remote_task()
  2024-08-31  4:05   ` David Vernet
@ 2024-08-31  5:33     ` Tejun Heo
  2024-08-31 23:40       ` David Vernet
  0 siblings, 1 reply; 35+ messages in thread
From: Tejun Heo @ 2024-08-31  5:33 UTC (permalink / raw)
  To: David Vernet; +Cc: kernel-team, linux-kernel

Hello,

On Fri, Aug 30, 2024 at 11:05:16PM -0500, David Vernet wrote:
...
> Not a functional change from the prior patch, but it occurred to me that
> if we just deactivate like this then we'll also fire the ops.quiescent()
> callback in dequeue_task_scx(). Should we add a check to skip the
> dequeue callbacks if p->scx.holding_cpu >= 0?

Right, migrations shouldn't trigger quiescent / runnable events. We should
be able to suppress based on holding_cpu and sticky_cpu. Will look into
that.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 10/11] sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq()
  2024-08-30 11:03 ` [PATCH 10/11] sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq() Tejun Heo
@ 2024-08-31 14:30   ` Andrea Righi
  2024-08-31 16:20     ` Tejun Heo
  0 siblings, 1 reply; 35+ messages in thread
From: Andrea Righi @ 2024-08-31 14:30 UTC (permalink / raw)
  To: Tejun Heo
  Cc: void, kernel-team, linux-kernel, Daniel Hodges, Changwoo Min,
	Dan Schatzberg

On Fri, Aug 30, 2024 at 01:03:54AM -1000, Tejun Heo wrote:
> Once a task is put into a DSQ, the allowed operations are fairly limited.
> Tasks in the built-in local and global DSQs are executed automatically and,
> ignoring dequeue, there is only one way a task in a user DSQ can be
> manipulated - scx_bpf_consume() moves the first task to the dispatching
> local DSQ. This inflexibility sometimes gets in the way and is an area where
> multiple feature requests have been made.
> 
> Implement scx_bpf_dispatch[_vtime]_from_dsq(), which can be called during
> DSQ iteration and can move the task to any DSQ - local DSQs, global DSQ and
> user DSQs. The kfuncs can be called from ops.dispatch() and any BPF context
> which dosen't hold a rq lock including BPF timers and SYSCALL programs.
> 
> This is an expansion of an earlier patch which only allowed moving into the
> dispatching local DSQ:
> 
>   http://lkml.kernel.org/r/Zn4Cw4FDTmvXnhaf@slm.duckdns.org
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Daniel Hodges <hodges.daniel.scott@gmail.com>
> Cc: David Vernet <void@manifault.com>
> Cc: Changwoo Min <multics69@gmail.com>
> Cc: Andrea Righi <andrea.righi@linux.dev>
> Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
> ---
>  kernel/sched/ext.c                       | 180 ++++++++++++++++++++++-
>  tools/sched_ext/include/scx/common.bpf.h |   8 +
>  2 files changed, 186 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index df33524d68f3..96b8cc490841 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -1067,6 +1067,11 @@ static __always_inline bool scx_kf_allowed_on_arg_tasks(u32 mask,
>  	return true;
>  }
>  
> +static bool scx_kf_allowed_if_unlocked(void)
> +{
> +	return !current->scx.kf_mask;
> +}
> +
>  /**
>   * nldsq_next_task - Iterate to the next task in a non-local DSQ
>   * @dsq: user dsq being interated
> @@ -5461,7 +5466,7 @@ __bpf_kfunc_start_defs();
>   * scx_bpf_dispatch - Dispatch a task into the FIFO queue of a DSQ
>   * @p: task_struct to dispatch
>   * @dsq_id: DSQ to dispatch to
> - * @slice: duration @p can run for in nsecs
> + * @slice: duration @p can run for in nsecs, 0 to keep the current value
>   * @enq_flags: SCX_ENQ_*
>   *
>   * Dispatch @p into the FIFO queue of the DSQ identified by @dsq_id. It is safe
> @@ -5511,7 +5516,7 @@ __bpf_kfunc void scx_bpf_dispatch(struct task_struct *p, u64 dsq_id, u64 slice,
>   * scx_bpf_dispatch_vtime - Dispatch a task into the vtime priority queue of a DSQ
>   * @p: task_struct to dispatch
>   * @dsq_id: DSQ to dispatch to
> - * @slice: duration @p can run for in nsecs
> + * @slice: duration @p can run for in nsecs, 0 to keep the current value
>   * @vtime: @p's ordering inside the vtime-sorted queue of the target DSQ

Maybe allow to keep the current vtime if 0 is passed, similar to slice?

>   * @enq_flags: SCX_ENQ_*
>   *
> @@ -5552,6 +5557,117 @@ static const struct btf_kfunc_id_set scx_kfunc_set_enqueue_dispatch = {
>  	.set			= &scx_kfunc_ids_enqueue_dispatch,
>  };
>  
> +static bool scx_dispatch_from_dsq(struct bpf_iter_scx_dsq_kern *kit,
> +				  struct task_struct *p, u64 dsq_id,
> +				  u64 slice, u64 vtime, u64 enq_flags)
> +{
> +	struct scx_dispatch_q *src_dsq = kit->dsq, *dst_dsq;
> +	struct rq *this_rq, *src_rq, *dst_rq, *locked_rq;
> +	bool dispatched = false;
> +	bool in_balance;
> +	unsigned long flags;
> +
> +	if (!scx_kf_allowed_if_unlocked() && !scx_kf_allowed(SCX_KF_DISPATCH))
> +		return false;
> +
> +	/*
> +	 * Can be called from either ops.dispatch() locking this_rq() or any
> +	 * context where no rq lock is held. If latter, lock @p's task_rq which
> +	 * we'll likely need anyway.
> +	 */

About locking, I was wondering if we could provide a similar API
(scx_bpf_dispatch_lock()?) to use scx_bpf_dispatch() from any context
and not necessarily from ops.select_cpu() / ops.enqueue() or
ops.dispatch().

This would be really useful for user-space schedulers, since we could
use scx_bpf_dispatch() directly and get rid of the
BPF_MAP_TYPE_RINGBUFFER complexity.

> +	src_rq = task_rq(p);
> +
> +	local_irq_save(flags);
> +	this_rq = this_rq();
> +	in_balance = this_rq->scx.flags & SCX_RQ_IN_BALANCE;
> +
> +	if (in_balance) {
> +		if (this_rq != src_rq) {
> +			raw_spin_rq_unlock(this_rq);
> +			raw_spin_rq_lock(src_rq);
> +		}
> +	} else {
> +		raw_spin_rq_lock(src_rq);
> +	}
> +
> +	locked_rq = src_rq;
> +	raw_spin_lock(&src_dsq->lock);
> +
> +	/*
> +	 * Did someone else get to it? @p could have already left $src_dsq, got
> +	 * re-enqueud, or be in the process of being consumed by someone else.
> +	 */
> +	if (unlikely(p->scx.dsq != src_dsq ||
> +		     u32_before(kit->dsq_seq, p->scx.dsq_seq) ||
> +		     p->scx.holding_cpu >= 0) ||
> +	    WARN_ON_ONCE(src_rq != task_rq(p))) {
> +		raw_spin_unlock(&src_dsq->lock);
> +		goto out;
> +	}
> +
> +	/* @p is still on $src_dsq and stable, determine the destination */
> +	dst_dsq = find_dsq_for_dispatch(this_rq, dsq_id, p);
> +
> +	if (dst_dsq->id == SCX_DSQ_LOCAL) {
> +		dst_rq = container_of(dst_dsq, struct rq, scx.local_dsq);
> +		if (!task_can_run_on_remote_rq(p, dst_rq, true)) {
> +			dst_dsq = &scx_dsq_global;
> +			dst_rq = src_rq;
> +		}
> +	} else {
> +		/* no need to migrate if destination is a non-local DSQ */
> +		dst_rq = src_rq;
> +	}
> +
> +	/*
> +	 * Move @p into $dst_dsq. If $dst_dsq is the local DSQ of a different
> +	 * CPU, @p will be migrated.
> +	 */
> +	if (dst_dsq->id == SCX_DSQ_LOCAL) {
> +		/* @p is going from a non-local DSQ to a local DSQ */
> +		if (src_rq == dst_rq) {
> +			task_unlink_from_dsq(p, src_dsq);
> +			move_local_task_to_local_dsq(p, enq_flags,
> +						     src_dsq, dst_rq);
> +			raw_spin_unlock(&src_dsq->lock);
> +		} else {
> +			raw_spin_unlock(&src_dsq->lock);
> +			move_remote_task_to_local_dsq(p, enq_flags,
> +						      src_rq, dst_rq);
> +			locked_rq = dst_rq;
> +		}
> +	} else {
> +		/*
> +		 * @p is going from a non-local DSQ to a non-local DSQ. As
> +		 * $src_dsq is already locked, do an abbreviated dequeue.
> +		 */
> +		task_unlink_from_dsq(p, src_dsq);
> +		p->scx.dsq = NULL;
> +		raw_spin_unlock(&src_dsq->lock);
> +
> +		p->scx.dsq_vtime = vtime;

This would be like:

  if (vtime)
  	p->scx.dsq_vtime = vtime;

> +		dispatch_enqueue(dst_dsq, p, enq_flags);
> +	}
> +
> +	if (slice)
> +		p->scx.slice = slice;
> +	else
> +		p->scx.slice = p->scx.slice ?: 1;
> +
> +	dispatched = true;
> +out:
> +	if (in_balance) {
> +		if (this_rq != locked_rq) {
> +			raw_spin_rq_unlock(locked_rq);
> +			raw_spin_rq_lock(this_rq);
> +		}
> +	} else {
> +		raw_spin_rq_unlock_irqrestore(locked_rq, flags);
> +	}
> +
> +	return dispatched;
> +}
> +
>  __bpf_kfunc_start_defs();
>  
>  /**
> @@ -5631,12 +5747,70 @@ __bpf_kfunc bool scx_bpf_consume(u64 dsq_id)
>  	}
>  }
>  
> +/**
> + * scx_bpf_dispatch_from_dsq - Move a task from DSQ iteration to a DSQ
> + * @it__iter: DSQ iterator in progress
> + * @p: task to transfer
> + * @dsq_id: DSQ to move @p to
> + * @slice: duration @p can run for in nsecs, 0 to keep the current value
> + * @enq_flags: SCX_ENQ_*
> + *
> + * Transfer @p which is on the DSQ currently iterated by @it__iter to the DSQ
> + * specified by @dsq_id. All DSQs - local DSQs, global DSQ and user DSQs - can
> + * be the destination.
> + *
> + * For the transfer to be successful, @p must still be on the DSQ and have been
> + * queued before the DSQ iteration started. This function doesn't care whether
> + * @p was obtained from the DSQ iteration. @p just has to be on the DSQ and have
> + * been queued before the iteration started.
> + *
> + * Can be called from ops.dispatch() or any BPF context which doesn't hold a rq
> + * lock (e.g. BPF timers or SYSCALL programs).
> + *
> + * Returns %true if @p has been consumed, %false if @p had already been consumed
> + * or dequeued.
> + */
> +__bpf_kfunc bool scx_bpf_dispatch_from_dsq(struct bpf_iter_scx_dsq *it__iter,
> +					   struct task_struct *p, u64 dsq_id,
> +					   u64 slice, u64 enq_flags)
> +{
> +	return scx_dispatch_from_dsq((struct bpf_iter_scx_dsq_kern *)it__iter,
> +				     p, dsq_id, slice, 0, enq_flags);
> +}
> +
> +/**
> + * scx_bpf_dispatch_vtime_from_dsq - Move a task from DSQ iteration to a PRIQ DSQ
> + * @it__iter: DSQ iterator in progress
> + * @p: task to transfer
> + * @dsq_id: DSQ to move @p to
> + * @slice: duration @p can run for in nsecs, 0 to keep the current value
> + * @vtime: @p's ordering inside the vtime-sorted queue of the target DSQ
> + * @enq_flags: SCX_ENQ_*

Hm... can we pass 6 arguments to a kfunc? I think we're limited to 5,
unless I'm missing something here.

> + *
> + * Transfer @p which is on the DSQ currently iterated by @it__iter to the
> + * priority queue of the DSQ specified by @dsq_id. The destination must be a
> + * user DSQ as only user DSQs support priority queue.
> + *
> + * All other aspects are identical to scx_bpf_dispatch_from_dsq(). See
> + * scx_bpf_dispatch_vtime() for more information on @vtime.
> + */
> +__bpf_kfunc bool scx_bpf_dispatch_vtime_from_dsq(struct bpf_iter_scx_dsq *it__iter,
> +						 struct task_struct *p, u64 dsq_id,
> +						 u64 slice, u64 vtime, u64 enq_flags)
> +{
> +	return scx_dispatch_from_dsq((struct bpf_iter_scx_dsq_kern *)it__iter,
> +				     p, dsq_id, slice, vtime,
> +				     enq_flags | SCX_ENQ_DSQ_PRIQ);
> +}
> +
>  __bpf_kfunc_end_defs();
>  
>  BTF_KFUNCS_START(scx_kfunc_ids_dispatch)
>  BTF_ID_FLAGS(func, scx_bpf_dispatch_nr_slots)
>  BTF_ID_FLAGS(func, scx_bpf_dispatch_cancel)
>  BTF_ID_FLAGS(func, scx_bpf_consume)
> +BTF_ID_FLAGS(func, scx_bpf_dispatch_from_dsq, KF_RCU)
> +BTF_ID_FLAGS(func, scx_bpf_dispatch_vtime_from_dsq, KF_RCU)
>  BTF_KFUNCS_END(scx_kfunc_ids_dispatch)
>  
>  static const struct btf_kfunc_id_set scx_kfunc_set_dispatch = {
> @@ -5733,6 +5907,8 @@ __bpf_kfunc_end_defs();
>  
>  BTF_KFUNCS_START(scx_kfunc_ids_unlocked)
>  BTF_ID_FLAGS(func, scx_bpf_create_dsq, KF_SLEEPABLE)
> +BTF_ID_FLAGS(func, scx_bpf_dispatch_from_dsq, KF_RCU)
> +BTF_ID_FLAGS(func, scx_bpf_dispatch_vtime_from_dsq, KF_RCU)
>  BTF_KFUNCS_END(scx_kfunc_ids_unlocked)
>  
>  static const struct btf_kfunc_id_set scx_kfunc_set_unlocked = {
> diff --git a/tools/sched_ext/include/scx/common.bpf.h b/tools/sched_ext/include/scx/common.bpf.h
> index 20280df62857..ef018071da31 100644
> --- a/tools/sched_ext/include/scx/common.bpf.h
> +++ b/tools/sched_ext/include/scx/common.bpf.h
> @@ -35,6 +35,8 @@ void scx_bpf_dispatch_vtime(struct task_struct *p, u64 dsq_id, u64 slice, u64 vt
>  u32 scx_bpf_dispatch_nr_slots(void) __ksym;
>  void scx_bpf_dispatch_cancel(void) __ksym;
>  bool scx_bpf_consume(u64 dsq_id) __ksym;
> +bool scx_bpf_dispatch_from_dsq(struct bpf_iter_scx_dsq *it__iter, struct task_struct *p, u64 dsq_id, u64 slice, u64 enq_flags) __ksym __weak;
> +bool scx_bpf_dispatch_vtime_from_dsq(struct bpf_iter_scx_dsq *it__iter, struct task_struct *p, u64 dsq_id, u64 slice, u64 vtime, u64 enq_flags) __ksym __weak;
>  u32 scx_bpf_reenqueue_local(void) __ksym;
>  void scx_bpf_kick_cpu(s32 cpu, u64 flags) __ksym;
>  s32 scx_bpf_dsq_nr_queued(u64 dsq_id) __ksym;
> @@ -62,6 +64,12 @@ bool scx_bpf_task_running(const struct task_struct *p) __ksym;
>  s32 scx_bpf_task_cpu(const struct task_struct *p) __ksym;
>  struct rq *scx_bpf_cpu_rq(s32 cpu) __ksym;
>  
> +/*
> + * Use the following as @it__iter when calling
> + * scx_bpf_dispatch[_vtime]_from_dsq() from within bpf_for_each() loops.
> + */
> +#define BPF_FOR_EACH_ITER	(&___it)
> +
>  static inline __attribute__((format(printf, 1, 2)))
>  void ___scx_bpf_bstr_format_checker(const char *fmt, ...) {}
>  
> -- 
> 2.46.0
> 

-Andrea

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 10/11] sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq()
  2024-08-31 14:30   ` Andrea Righi
@ 2024-08-31 16:20     ` Tejun Heo
  2024-08-31 21:15       ` Andrea Righi
  0 siblings, 1 reply; 35+ messages in thread
From: Tejun Heo @ 2024-08-31 16:20 UTC (permalink / raw)
  To: Andrea Righi
  Cc: void, kernel-team, linux-kernel, Daniel Hodges, Changwoo Min,
	Dan Schatzberg

Hello,

On Sat, Aug 31, 2024 at 04:30:57PM +0200, Andrea Righi wrote:
...
> > @@ -5511,7 +5516,7 @@ __bpf_kfunc void scx_bpf_dispatch(struct task_struct *p, u64 dsq_id, u64 slice,
> >   * scx_bpf_dispatch_vtime - Dispatch a task into the vtime priority queue of a DSQ
> >   * @p: task_struct to dispatch
> >   * @dsq_id: DSQ to dispatch to
> > - * @slice: duration @p can run for in nsecs
> > + * @slice: duration @p can run for in nsecs, 0 to keep the current value
> >   * @vtime: @p's ordering inside the vtime-sorted queue of the target DSQ
> 
> Maybe allow to keep the current vtime if 0 is passed, similar to slice?

It's tricky as 0 is a valid vtime. It's unlikely but depending on how vtime
is defined, it may wrap in a practical amount of time. More on this below.

...
> > +	/*
> > +	 * Can be called from either ops.dispatch() locking this_rq() or any
> > +	 * context where no rq lock is held. If latter, lock @p's task_rq which
> > +	 * we'll likely need anyway.
> > +	 */
> 
> About locking, I was wondering if we could provide a similar API
> (scx_bpf_dispatch_lock()?) to use scx_bpf_dispatch() from any context
> and not necessarily from ops.select_cpu() / ops.enqueue() or
> ops.dispatch().
> 
> This would be really useful for user-space schedulers, since we could
> use scx_bpf_dispatch() directly and get rid of the
> BPF_MAP_TYPE_RINGBUFFER complexity.

One difference between scx_bpf_dispatch() and scx_bpf_dispatch_from_dsq() is
that the former is designed to be safe to call from any context under any
locks by doing the actual dispatches asynchronously. This is primarily to
allow scx_bpf_dispatch() to be called under BPF locks as they are used to
transfer the ownership of tasks from the BPF side to the kernel side. This
makes it more difficult to make scx_bpf_dispatch() more flexible. The way
BPF locks are currently developing, we might not have to worry about killing
the system through deadlocks but it'd still be very prone to soft deadlocks
that kill the BPF scheduler if implemented synchronously. Maybe the solution
here is bouncing to an irq_work or something. I'll think more on it.

...
> > +__bpf_kfunc bool scx_bpf_dispatch_from_dsq(struct bpf_iter_scx_dsq *it__iter,
> > +					   struct task_struct *p, u64 dsq_id,
> > +					   u64 slice, u64 enq_flags)
> > +{
> > +	return scx_dispatch_from_dsq((struct bpf_iter_scx_dsq_kern *)it__iter,
> > +				     p, dsq_id, slice, 0, enq_flags);
> > +}
> > +
> > +/**
> > + * scx_bpf_dispatch_vtime_from_dsq - Move a task from DSQ iteration to a PRIQ DSQ
> > + * @it__iter: DSQ iterator in progress
> > + * @p: task to transfer
> > + * @dsq_id: DSQ to move @p to
> > + * @slice: duration @p can run for in nsecs, 0 to keep the current value
> > + * @vtime: @p's ordering inside the vtime-sorted queue of the target DSQ
> > + * @enq_flags: SCX_ENQ_*
> 
> Hm... can we pass 6 arguments to a kfunc? I think we're limited to 5,
> unless I'm missing something here.

Hah, I actually don't know and didn't test the vtime variant. Maybe I should
just drop the @slice and @vtime. They can be set by the caller explicitly
before calling these kfuncs anyway although there are some concerns around
ownership (ie. the caller can't be sure that the task has already been
dispatched by someone else before scx_bpf_dispatch_from_dsq() commits). Or
maybe I should pack the optional arguments into a struct. I'll think more
about it.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 10/11] sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq()
  2024-08-31 16:20     ` Tejun Heo
@ 2024-08-31 21:15       ` Andrea Righi
  2024-09-02  1:53         ` Changwoo Min
  0 siblings, 1 reply; 35+ messages in thread
From: Andrea Righi @ 2024-08-31 21:15 UTC (permalink / raw)
  To: Tejun Heo
  Cc: void, kernel-team, linux-kernel, Daniel Hodges, Changwoo Min,
	Dan Schatzberg

On Sat, Aug 31, 2024 at 06:20:58AM -1000, Tejun Heo wrote:
> Hello,
> 
> On Sat, Aug 31, 2024 at 04:30:57PM +0200, Andrea Righi wrote:
> ...
> > > @@ -5511,7 +5516,7 @@ __bpf_kfunc void scx_bpf_dispatch(struct task_struct *p, u64 dsq_id, u64 slice,
> > >   * scx_bpf_dispatch_vtime - Dispatch a task into the vtime priority queue of a DSQ
> > >   * @p: task_struct to dispatch
> > >   * @dsq_id: DSQ to dispatch to
> > > - * @slice: duration @p can run for in nsecs
> > > + * @slice: duration @p can run for in nsecs, 0 to keep the current value
> > >   * @vtime: @p's ordering inside the vtime-sorted queue of the target DSQ
> > 
> > Maybe allow to keep the current vtime if 0 is passed, similar to slice?
> 
> It's tricky as 0 is a valid vtime. It's unlikely but depending on how vtime
> is defined, it may wrap in a practical amount of time. More on this below.

Ok.

> 
> ...
> > > +	/*
> > > +	 * Can be called from either ops.dispatch() locking this_rq() or any
> > > +	 * context where no rq lock is held. If latter, lock @p's task_rq which
> > > +	 * we'll likely need anyway.
> > > +	 */
> > 
> > About locking, I was wondering if we could provide a similar API
> > (scx_bpf_dispatch_lock()?) to use scx_bpf_dispatch() from any context
> > and not necessarily from ops.select_cpu() / ops.enqueue() or
> > ops.dispatch().
> > 
> > This would be really useful for user-space schedulers, since we could
> > use scx_bpf_dispatch() directly and get rid of the
> > BPF_MAP_TYPE_RINGBUFFER complexity.
> 
> One difference between scx_bpf_dispatch() and scx_bpf_dispatch_from_dsq() is
> that the former is designed to be safe to call from any context under any
> locks by doing the actual dispatches asynchronously. This is primarily to
> allow scx_bpf_dispatch() to be called under BPF locks as they are used to
> transfer the ownership of tasks from the BPF side to the kernel side. This
> makes it more difficult to make scx_bpf_dispatch() more flexible. The way
> BPF locks are currently developing, we might not have to worry about killing
> the system through deadlocks but it'd still be very prone to soft deadlocks
> that kill the BPF scheduler if implemented synchronously. Maybe the solution
> here is bouncing to an irq_work or something. I'll think more on it.

Got it. Well, the idea was to reduce complexity in the user-space
schedulers, but if we need to increase complexity in the kernel to do
so, probably it's not a good idea.

Moreover, using the BPF_MAP_TYPE_RINGBUFFER is really fast now, the
overhead is pretty close to zero, so maybe we can keep this as a low
priority todo.

> 
> ...
> > > +__bpf_kfunc bool scx_bpf_dispatch_from_dsq(struct bpf_iter_scx_dsq *it__iter,
> > > +					   struct task_struct *p, u64 dsq_id,
> > > +					   u64 slice, u64 enq_flags)
> > > +{
> > > +	return scx_dispatch_from_dsq((struct bpf_iter_scx_dsq_kern *)it__iter,
> > > +				     p, dsq_id, slice, 0, enq_flags);
> > > +}
> > > +
> > > +/**
> > > + * scx_bpf_dispatch_vtime_from_dsq - Move a task from DSQ iteration to a PRIQ DSQ
> > > + * @it__iter: DSQ iterator in progress
> > > + * @p: task to transfer
> > > + * @dsq_id: DSQ to move @p to
> > > + * @slice: duration @p can run for in nsecs, 0 to keep the current value
> > > + * @vtime: @p's ordering inside the vtime-sorted queue of the target DSQ
> > > + * @enq_flags: SCX_ENQ_*
> > 
> > Hm... can we pass 6 arguments to a kfunc? I think we're limited to 5,
> > unless I'm missing something here.
> 
> Hah, I actually don't know and didn't test the vtime variant. Maybe I should
> just drop the @slice and @vtime. They can be set by the caller explicitly
> before calling these kfuncs anyway although there are some concerns around
> ownership (ie. the caller can't be sure that the task has already been
> dispatched by someone else before scx_bpf_dispatch_from_dsq() commits). Or
> maybe I should pack the optional arguments into a struct. I'll think more
> about it.

IMHO we can simply drop them, introducing a separate struct makes the
API a bit inconsistent with scx_bpf_dispatch() (and I don't think we
want to change also scx_bpf_dispatch() for that).

About the ownership, true... maybe we can accept a bit of fuzziness
in this case, also considering that this race can happen only when using
scx_bpf_dispatch_from_dsq().

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 02/11] sched_ext: Refactor consume_remote_task()
  2024-08-31  5:33     ` Tejun Heo
@ 2024-08-31 23:40       ` David Vernet
  0 siblings, 0 replies; 35+ messages in thread
From: David Vernet @ 2024-08-31 23:40 UTC (permalink / raw)
  To: Tejun Heo; +Cc: kernel-team, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 884 bytes --]

On Fri, Aug 30, 2024 at 07:33:21PM -1000, Tejun Heo wrote:
> Hello,
> 
> On Fri, Aug 30, 2024 at 11:05:16PM -0500, David Vernet wrote:
> ...
> > Not a functional change from the prior patch, but it occurred to me that
> > if we just deactivate like this then we'll also fire the ops.quiescent()
> > callback in dequeue_task_scx(). Should we add a check to skip the
> > dequeue callbacks if p->scx.holding_cpu >= 0?
> 
> Right, migrations shouldn't trigger quiescent / runnable events. We should
> be able to suppress based on holding_cpu and sticky_cpu. Will look into
> that.

Ah right, holding_cpu on its own is of course not enough because its
whole point is to avoid racing with actual, non-migratory dequeues.
Thanks for taking a look at that -- in the meantime, this patch LG:

Acked-by: David Vernet <void@manifault.com>

> 
> Thanks.
> 
> -- 
> tejun

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 03/11] sched_ext: Make find_dsq_for_dispatch() handle SCX_DSQ_LOCAL_ON
  2024-08-30 11:03 ` [PATCH 03/11] sched_ext: Make find_dsq_for_dispatch() handle SCX_DSQ_LOCAL_ON Tejun Heo
@ 2024-09-01  0:11   ` David Vernet
  0 siblings, 0 replies; 35+ messages in thread
From: David Vernet @ 2024-09-01  0:11 UTC (permalink / raw)
  To: Tejun Heo; +Cc: kernel-team, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 477 bytes --]

On Fri, Aug 30, 2024 at 01:03:47AM -1000, Tejun Heo wrote:
> find_dsq_for_dispatch() handles all DSQ IDs except SCX_DSQ_LOCAL_ON.
> Instead, each caller is hanlding SCX_DSQ_LOCAL_ON before calling it. Move
> SCX_DSQ_LOCAL_ON lookup into find_dsq_for_dispatch() to remove duplicate
> code in direct_dispatch() and dispatch_to_local_dsq().
> 
> No functional changes intended.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>

Acked-by: David Vernet <void@manifault.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 04/11] sched_ext: Fix processs_ddsp_deferred_locals() by unifying DTL_INVALID handling
  2024-08-30 17:44   ` [PATCH 04/11] sched_ext: Fix processs_ddsp_deferred_locals() by unifying DTL_INVALID handling Tejun Heo
@ 2024-09-01  0:53     ` David Vernet
  2024-09-01  0:56       ` David Vernet
  0 siblings, 1 reply; 35+ messages in thread
From: David Vernet @ 2024-09-01  0:53 UTC (permalink / raw)
  To: Tejun Heo; +Cc: kernel-team, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 6082 bytes --]

On Fri, Aug 30, 2024 at 07:44:46AM -1000, Tejun Heo wrote:
> With the preceding update, the only return value which makes meaningful
> difference is DTL_INVALID, for which one caller, finish_dispatch(), falls
> back to the global DSQ and the other, process_ddsp_deferred_locals(),
> doesn't do anything.
> 
> It should always fallback to the global DSQ. Move the global DSQ fallback
> into dispatch_to_local_dsq() and remove the return value.
> 
> v2: Patch title and description updated to reflect the behavior fix for
>     process_ddsp_deferred_locals().
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: David Vernet <void@manifault.com>

Acked-by: David Vernet <void@manifault.com>

FYI, I was able to trigger what I think is a benign pr_warn() by testing
the failure path for DSQ_LOCAL_ON in the direct dispatch path. If I
write a testcase that dispatches an instance of a ksoftirqd task to a
CPU that it's not allowed to run on, I get the following:

[   14.799935] sched_ext: BPF scheduler "ddsp_local_on_invalid" enabled
[   14.813738] NOHZ tick-stop error: local softirq work is pending, handler #200!!!
[   14.829536] sched_ext: BPF scheduler "ddsp_local_on_invalid" disabled (runtime error)
[   14.829613] sched_ext: ddsp_local_on_invalid: SCX_DSQ_LOCAL[_ON] verdict target cpu 1 not allowed for kworker/0:2[113]
[   14.829707]    dispatch_to_local_dsq+0x13b/0x1e0
[   14.829760]    run_deferred+0x9d/0xe0
[   14.829797]    ttwu_do_activate+0x10a/0x250
[   14.829834]    try_to_wake_up+0x28c/0x530
[   14.829882]    kick_pool+0xd6/0x160
[   14.829923]    __queue_work+0x3f7/0x530
[   14.829962]    call_timer_fn+0xcf/0x2a0
[   14.830001]    __run_timer_base+0x227/0x250
[   14.830039]    run_timer_softirq+0x45/0x60
[   14.830078]    handle_softirqs+0x120/0x400
[   14.830121]    __irq_exit_rcu+0x67/0xd0
[   14.830157]    irq_exit_rcu+0xe/0x20
[   14.830188]    sysvec_apic_timer_interrupt+0x76/0x90
[   14.830218]    asm_sysvec_apic_timer_interrupt+0x1a/0x20
[   14.830254]    default_idle+0x13/0x20
[   14.830282]    default_idle_call+0x74/0xb0
[   14.830306]    do_idle+0xef/0x280
[   14.830335]    cpu_startup_entry+0x2a/0x30
[   14.830363]    __pfx_kernel_init+0x0/0x10
[   14.830386]    start_kernel+0x37c/0x3d0
[   14.830414]    x86_64_start_reservations+0x24/0x30
[   14.830445]    x86_64_start_kernel+0xb1/0xc0
[   14.830480]    common_startup_64+0x13e/0x147

This is from report_idle_softirq() in time/tick-sched.c, which has a
baked-in assumption that if there's a pending softirq and we're not in
an IRQ, then ksoftirqd must have been scheduled and we should therefore
expect to see a need_resched() on the current core if we're on the
can_stop_idle_tick() path.

I think this is safe because we're going to be reverting back to fair.c
at this point anyways, but we should figure out how to avoid confusing
the idle tick path regardless.

Thanks,
David

> ---
>  kernel/sched/ext.c |   41 ++++++++++-------------------------------
>  1 file changed, 10 insertions(+), 31 deletions(-)
> 
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -2301,12 +2301,6 @@ retry:
>  	return false;
>  }
>  
> -enum dispatch_to_local_dsq_ret {
> -	DTL_DISPATCHED,		/* successfully dispatched */
> -	DTL_LOST,		/* lost race to dequeue */
> -	DTL_INVALID,		/* invalid local dsq_id */
> -};
> -
>  /**
>   * dispatch_to_local_dsq - Dispatch a task to a local dsq
>   * @rq: current rq which is locked
> @@ -2321,9 +2315,8 @@ enum dispatch_to_local_dsq_ret {
>   * The caller must have exclusive ownership of @p (e.g. through
>   * %SCX_OPSS_DISPATCHING).
>   */
> -static enum dispatch_to_local_dsq_ret
> -dispatch_to_local_dsq(struct rq *rq, struct scx_dispatch_q *dst_dsq,
> -		      struct task_struct *p, u64 enq_flags)
> +static void dispatch_to_local_dsq(struct rq *rq, struct scx_dispatch_q *dst_dsq,
> +				  struct task_struct *p, u64 enq_flags)
>  {
>  	struct rq *src_rq = task_rq(p);
>  	struct rq *dst_rq = container_of(dst_dsq, struct rq, scx.local_dsq);
> @@ -2336,13 +2329,11 @@ dispatch_to_local_dsq(struct rq *rq, str
>  	 */
>  	if (rq == src_rq && rq == dst_rq) {
>  		dispatch_enqueue(dst_dsq, p, enq_flags | SCX_ENQ_CLEAR_OPSS);
> -		return DTL_DISPATCHED;
> +		return;
>  	}
>  
>  #ifdef CONFIG_SMP
>  	if (likely(task_can_run_on_remote_rq(p, dst_rq, true))) {
> -		bool dsp;
> -
>  		/*
>  		 * @p is on a possibly remote @src_rq which we need to lock to
>  		 * move the task. If dequeue is in progress, it'd be locking
> @@ -2367,10 +2358,8 @@ dispatch_to_local_dsq(struct rq *rq, str
>  		}
>  
>  		/* task_rq couldn't have changed if we're still the holding cpu */
> -		dsp = p->scx.holding_cpu == raw_smp_processor_id() &&
> -			!WARN_ON_ONCE(src_rq != task_rq(p));
> -
> -		if (likely(dsp)) {
> +		if (likely(p->scx.holding_cpu == raw_smp_processor_id()) &&
> +		    !WARN_ON_ONCE(src_rq != task_rq(p))) {
>  			/*
>  			 * If @p is staying on the same rq, there's no need to
>  			 * go through the full deactivate/activate cycle.
> @@ -2398,11 +2387,11 @@ dispatch_to_local_dsq(struct rq *rq, str
>  			raw_spin_rq_lock(rq);
>  		}
>  
> -		return dsp ? DTL_DISPATCHED : DTL_LOST;
> +		return;
>  	}
>  #endif	/* CONFIG_SMP */
>  
> -	return DTL_INVALID;
> +	dispatch_enqueue(&scx_dsq_global, p, enq_flags | SCX_ENQ_CLEAR_OPSS);
>  }
>  
>  /**
> @@ -2479,20 +2468,10 @@ retry:
>  
>  	dsq = find_dsq_for_dispatch(this_rq(), dsq_id, p);
>  
> -	if (dsq->id == SCX_DSQ_LOCAL) {
> -		switch (dispatch_to_local_dsq(rq, dsq, p, enq_flags)) {
> -		case DTL_DISPATCHED:
> -			break;
> -		case DTL_LOST:
> -			break;
> -		case DTL_INVALID:
> -			dispatch_enqueue(&scx_dsq_global, p,
> -					 enq_flags | SCX_ENQ_CLEAR_OPSS);
> -			break;
> -		}
> -	} else {
> +	if (dsq->id == SCX_DSQ_LOCAL)
> +		dispatch_to_local_dsq(rq, dsq, p, enq_flags);
> +	else
>  		dispatch_enqueue(dsq, p, enq_flags | SCX_ENQ_CLEAR_OPSS);
> -	}
>  }
>  
>  static void flush_dispatch_buf(struct rq *rq)

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 04/11] sched_ext: Fix processs_ddsp_deferred_locals() by unifying DTL_INVALID handling
  2024-09-01  0:53     ` David Vernet
@ 2024-09-01  0:56       ` David Vernet
  2024-09-01  8:03         ` Tejun Heo
  0 siblings, 1 reply; 35+ messages in thread
From: David Vernet @ 2024-09-01  0:56 UTC (permalink / raw)
  To: Tejun Heo; +Cc: kernel-team, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 6888 bytes --]

On Sat, Aug 31, 2024 at 07:53:37PM -0500, David Vernet wrote:
> On Fri, Aug 30, 2024 at 07:44:46AM -1000, Tejun Heo wrote:
> > With the preceding update, the only return value which makes meaningful
> > difference is DTL_INVALID, for which one caller, finish_dispatch(), falls
> > back to the global DSQ and the other, process_ddsp_deferred_locals(),
> > doesn't do anything.
> > 
> > It should always fallback to the global DSQ. Move the global DSQ fallback
> > into dispatch_to_local_dsq() and remove the return value.
> > 
> > v2: Patch title and description updated to reflect the behavior fix for
> >     process_ddsp_deferred_locals().
> > 
> > Signed-off-by: Tejun Heo <tj@kernel.org>
> > Cc: David Vernet <void@manifault.com>
> 
> Acked-by: David Vernet <void@manifault.com>
> 
> FYI, I was able to trigger what I think is a benign pr_warn() by testing
> the failure path for DSQ_LOCAL_ON in the direct dispatch path. If I
> write a testcase that dispatches an instance of a ksoftirqd task to a

Sorry, should have been more clear: the testcase dispatched all tasks to
the wrong CPU, which is why it's a kworker in the print output below. I
believe that ksoftiqrd hit the same path as well and just wasn't printed
in the output because it lost the race to scx_bpf_error(). Let me know
if you want the testcase to repro and I can send it, or send a separate
patch to add it to selftests.

Thanks,
David

> CPU that it's not allowed to run on, I get the following:
> 
> [   14.799935] sched_ext: BPF scheduler "ddsp_local_on_invalid" enabled
> [   14.813738] NOHZ tick-stop error: local softirq work is pending, handler #200!!!
> [   14.829536] sched_ext: BPF scheduler "ddsp_local_on_invalid" disabled (runtime error)
> [   14.829613] sched_ext: ddsp_local_on_invalid: SCX_DSQ_LOCAL[_ON] verdict target cpu 1 not allowed for kworker/0:2[113]
> [   14.829707]    dispatch_to_local_dsq+0x13b/0x1e0
> [   14.829760]    run_deferred+0x9d/0xe0
> [   14.829797]    ttwu_do_activate+0x10a/0x250
> [   14.829834]    try_to_wake_up+0x28c/0x530
> [   14.829882]    kick_pool+0xd6/0x160
> [   14.829923]    __queue_work+0x3f7/0x530
> [   14.829962]    call_timer_fn+0xcf/0x2a0
> [   14.830001]    __run_timer_base+0x227/0x250
> [   14.830039]    run_timer_softirq+0x45/0x60
> [   14.830078]    handle_softirqs+0x120/0x400
> [   14.830121]    __irq_exit_rcu+0x67/0xd0
> [   14.830157]    irq_exit_rcu+0xe/0x20
> [   14.830188]    sysvec_apic_timer_interrupt+0x76/0x90
> [   14.830218]    asm_sysvec_apic_timer_interrupt+0x1a/0x20
> [   14.830254]    default_idle+0x13/0x20
> [   14.830282]    default_idle_call+0x74/0xb0
> [   14.830306]    do_idle+0xef/0x280
> [   14.830335]    cpu_startup_entry+0x2a/0x30
> [   14.830363]    __pfx_kernel_init+0x0/0x10
> [   14.830386]    start_kernel+0x37c/0x3d0
> [   14.830414]    x86_64_start_reservations+0x24/0x30
> [   14.830445]    x86_64_start_kernel+0xb1/0xc0
> [   14.830480]    common_startup_64+0x13e/0x147
> 
> This is from report_idle_softirq() in time/tick-sched.c, which has a
> baked-in assumption that if there's a pending softirq and we're not in
> an IRQ, then ksoftirqd must have been scheduled and we should therefore
> expect to see a need_resched() on the current core if we're on the
> can_stop_idle_tick() path.
> 
> I think this is safe because we're going to be reverting back to fair.c
> at this point anyways, but we should figure out how to avoid confusing
> the idle tick path regardless.
> 
> Thanks,
> David
> 
> > ---
> >  kernel/sched/ext.c |   41 ++++++++++-------------------------------
> >  1 file changed, 10 insertions(+), 31 deletions(-)
> > 
> > --- a/kernel/sched/ext.c
> > +++ b/kernel/sched/ext.c
> > @@ -2301,12 +2301,6 @@ retry:
> >  	return false;
> >  }
> >  
> > -enum dispatch_to_local_dsq_ret {
> > -	DTL_DISPATCHED,		/* successfully dispatched */
> > -	DTL_LOST,		/* lost race to dequeue */
> > -	DTL_INVALID,		/* invalid local dsq_id */
> > -};
> > -
> >  /**
> >   * dispatch_to_local_dsq - Dispatch a task to a local dsq
> >   * @rq: current rq which is locked
> > @@ -2321,9 +2315,8 @@ enum dispatch_to_local_dsq_ret {
> >   * The caller must have exclusive ownership of @p (e.g. through
> >   * %SCX_OPSS_DISPATCHING).
> >   */
> > -static enum dispatch_to_local_dsq_ret
> > -dispatch_to_local_dsq(struct rq *rq, struct scx_dispatch_q *dst_dsq,
> > -		      struct task_struct *p, u64 enq_flags)
> > +static void dispatch_to_local_dsq(struct rq *rq, struct scx_dispatch_q *dst_dsq,
> > +				  struct task_struct *p, u64 enq_flags)
> >  {
> >  	struct rq *src_rq = task_rq(p);
> >  	struct rq *dst_rq = container_of(dst_dsq, struct rq, scx.local_dsq);
> > @@ -2336,13 +2329,11 @@ dispatch_to_local_dsq(struct rq *rq, str
> >  	 */
> >  	if (rq == src_rq && rq == dst_rq) {
> >  		dispatch_enqueue(dst_dsq, p, enq_flags | SCX_ENQ_CLEAR_OPSS);
> > -		return DTL_DISPATCHED;
> > +		return;
> >  	}
> >  
> >  #ifdef CONFIG_SMP
> >  	if (likely(task_can_run_on_remote_rq(p, dst_rq, true))) {
> > -		bool dsp;
> > -
> >  		/*
> >  		 * @p is on a possibly remote @src_rq which we need to lock to
> >  		 * move the task. If dequeue is in progress, it'd be locking
> > @@ -2367,10 +2358,8 @@ dispatch_to_local_dsq(struct rq *rq, str
> >  		}
> >  
> >  		/* task_rq couldn't have changed if we're still the holding cpu */
> > -		dsp = p->scx.holding_cpu == raw_smp_processor_id() &&
> > -			!WARN_ON_ONCE(src_rq != task_rq(p));
> > -
> > -		if (likely(dsp)) {
> > +		if (likely(p->scx.holding_cpu == raw_smp_processor_id()) &&
> > +		    !WARN_ON_ONCE(src_rq != task_rq(p))) {
> >  			/*
> >  			 * If @p is staying on the same rq, there's no need to
> >  			 * go through the full deactivate/activate cycle.
> > @@ -2398,11 +2387,11 @@ dispatch_to_local_dsq(struct rq *rq, str
> >  			raw_spin_rq_lock(rq);
> >  		}
> >  
> > -		return dsp ? DTL_DISPATCHED : DTL_LOST;
> > +		return;
> >  	}
> >  #endif	/* CONFIG_SMP */
> >  
> > -	return DTL_INVALID;
> > +	dispatch_enqueue(&scx_dsq_global, p, enq_flags | SCX_ENQ_CLEAR_OPSS);
> >  }
> >  
> >  /**
> > @@ -2479,20 +2468,10 @@ retry:
> >  
> >  	dsq = find_dsq_for_dispatch(this_rq(), dsq_id, p);
> >  
> > -	if (dsq->id == SCX_DSQ_LOCAL) {
> > -		switch (dispatch_to_local_dsq(rq, dsq, p, enq_flags)) {
> > -		case DTL_DISPATCHED:
> > -			break;
> > -		case DTL_LOST:
> > -			break;
> > -		case DTL_INVALID:
> > -			dispatch_enqueue(&scx_dsq_global, p,
> > -					 enq_flags | SCX_ENQ_CLEAR_OPSS);
> > -			break;
> > -		}
> > -	} else {
> > +	if (dsq->id == SCX_DSQ_LOCAL)
> > +		dispatch_to_local_dsq(rq, dsq, p, enq_flags);
> > +	else
> >  		dispatch_enqueue(dsq, p, enq_flags | SCX_ENQ_CLEAR_OPSS);
> > -	}
> >  }
> >  
> >  static void flush_dispatch_buf(struct rq *rq)



[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 05/11] sched_ext: Restructure dispatch_to_local_dsq()
  2024-08-30 11:03 ` [PATCH 05/11] sched_ext: Restructure dispatch_to_local_dsq() Tejun Heo
@ 2024-09-01  1:09   ` David Vernet
  0 siblings, 0 replies; 35+ messages in thread
From: David Vernet @ 2024-09-01  1:09 UTC (permalink / raw)
  To: Tejun Heo; +Cc: kernel-team, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2413 bytes --]

On Fri, Aug 30, 2024 at 01:03:49AM -1000, Tejun Heo wrote:
> Now that there's nothing left after the big if block, flip the if condition
> and unindent the body.
> 
> No functional changes intended.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>

LGTM, just one comment below.

Acked-by: David Vernet <void@manifault.com>

> ---
>  kernel/sched/ext.c | 94 ++++++++++++++++++++++------------------------
>  1 file changed, 44 insertions(+), 50 deletions(-)
> 
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 34b4e63850c1..add267f31396 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -2333,65 +2333,59 @@ static void dispatch_to_local_dsq(struct rq *rq, struct scx_dispatch_q *dst_dsq,
>  	}
>  
>  #ifdef CONFIG_SMP
> -	if (likely(task_can_run_on_remote_rq(p, dst_rq, true))) {
> -		/*
> -		 * @p is on a possibly remote @src_rq which we need to lock to
> -		 * move the task. If dequeue is in progress, it'd be locking
> -		 * @src_rq and waiting on DISPATCHING, so we can't grab @src_rq
> -		 * lock while holding DISPATCHING.
> -		 *
> -		 * As DISPATCHING guarantees that @p is wholly ours, we can
> -		 * pretend that we're moving from a DSQ and use the same
> -		 * mechanism - mark the task under transfer with holding_cpu,
> -		 * release DISPATCHING and then follow the same protocol. See
> -		 * unlink_dsq_and_lock_task_rq().
> -		 */
> -		p->scx.holding_cpu = raw_smp_processor_id();
> +	if (unlikely(!task_can_run_on_remote_rq(p, dst_rq, true))) {
> +		dispatch_enqueue(&scx_dsq_global, p, enq_flags | SCX_ENQ_CLEAR_OPSS);

Given that UP systems will always take the path above where rq == src_rq
&& rq == dst_rq, this isn't really a functional change, but technically
we are now moving the backup global DSQ dispatch_enqueue call into the
CONFIG_SMP block. Should we maybe add a BUG() in an #else block below?
It's UP so it's obviously not super critical, but it might help
readability? Feel free to ignore -- the fact that all the rest of the
logic is in a CONFIG_SMP block is pretty clear, and this arguably
already improves readability by moving an unreachable piece of code on
UP into the SMP block.

Thanks,
David

> +		return;
> +	}
>  

[... snip ...]

>  #endif	/* CONFIG_SMP */
> -
> -	dispatch_enqueue(&scx_dsq_global, p, enq_flags | SCX_ENQ_CLEAR_OPSS);
>  }
>  
>  /**
> -- 
> 2.46.0
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 06/11] sched_ext: Reorder args for consume_local/remote_task()
  2024-08-30 11:03 ` [PATCH 06/11] sched_ext: Reorder args for consume_local/remote_task() Tejun Heo
@ 2024-09-01  1:40   ` David Vernet
  2024-09-01  6:37     ` Tejun Heo
  0 siblings, 1 reply; 35+ messages in thread
From: David Vernet @ 2024-09-01  1:40 UTC (permalink / raw)
  To: Tejun Heo; +Cc: kernel-team, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 3413 bytes --]

On Fri, Aug 30, 2024 at 01:03:50AM -1000, Tejun Heo wrote:
> Reorder args for consistency in the order of:
> 
>   current_rq, p, src_[rq|dsq], dst_[rq|dsq].
> 
> No functional changes intended.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---
>  kernel/sched/ext.c | 14 +++++++-------
>  1 file changed, 7 insertions(+), 7 deletions(-)
> 
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index add267f31396..620cc0586c4b 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -2139,8 +2139,8 @@ static void move_task_to_local_dsq(struct task_struct *p, u64 enq_flags,
>  
>  #endif	/* CONFIG_SMP */
>  
> -static void consume_local_task(struct rq *rq, struct scx_dispatch_q *dsq,
> -			       struct task_struct *p)
> +static void consume_local_task(struct task_struct *p,
> +			       struct scx_dispatch_q *dsq, struct rq *rq)
>  {
>  	lockdep_assert_held(&dsq->lock);	/* released on return */
>  
> @@ -2249,8 +2249,8 @@ static bool unlink_dsq_and_lock_task_rq(struct task_struct *p,
>  		!WARN_ON_ONCE(task_rq != task_rq(p));
>  }
>  
> -static bool consume_remote_task(struct rq *this_rq, struct scx_dispatch_q *dsq,
> -				struct task_struct *p, struct rq *task_rq)
> +static bool consume_remote_task(struct rq *this_rq, struct task_struct *p,
> +				struct scx_dispatch_q *dsq, struct rq *task_rq)
>  {
>  	raw_spin_rq_unlock(this_rq);
>  
> @@ -2265,7 +2265,7 @@ static bool consume_remote_task(struct rq *this_rq, struct scx_dispatch_q *dsq,
>  }
>  #else	/* CONFIG_SMP */
>  static inline bool task_can_run_on_remote_rq(struct task_struct *p, struct rq *rq, bool trigger_error) { return false; }
> -static inline bool consume_remote_task(struct rq *rq, struct scx_dispatch_q *dsq, struct task_struct *p, struct rq *task_rq) { return false; }
> +static inline bool consume_remote_task(struct rq *this_rq, struct task_struct *p, struct scx_dispatch_q *dsq, struct rq *task_rq) { return false; }
>  #endif	/* CONFIG_SMP */
>  
>  static bool consume_dispatch_q(struct rq *rq, struct scx_dispatch_q *dsq)
> @@ -2286,12 +2286,12 @@ static bool consume_dispatch_q(struct rq *rq, struct scx_dispatch_q *dsq)
>  		struct rq *task_rq = task_rq(p);
>  
>  		if (rq == task_rq) {
> -			consume_local_task(rq, dsq, p);
> +			consume_local_task(p, dsq, rq);
>  			return true;
>  		}
>  
>  		if (task_can_run_on_remote_rq(p, rq, false)) {

How do you feel about always prefixing src_ and dst_ for any arguments
that refer to either (with any @rq before @p implying current as this
patch proposes)? In this case it's a bit confusing to read because
technically according to the convention proposed in this patch, @rq
could be either curr_rq or src_rq in consume_dispatch_q() (there's no
@p to disambiguate), and @rq could be either src_rq or dst_rq in
task_can_run_on_remote_rq() (they both come after @p).

It's pretty obvious from context that @rq is referring to a dst_rq in
task_can_run_on_remote_rq(), but it might still be a bit easier on the
eyes to be explicit. And for functions like consume_remote_task() which
take both a src_dsq and a src_rq, I think it will be easier to follow
then the convention.

Thanks,
David

> -			if (likely(consume_remote_task(rq, dsq, p, task_rq)))
> +			if (likely(consume_remote_task(rq, p, dsq, task_rq)))
>  				return true;
>  			goto retry;
>  		}
> -- 
> 2.46.0
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 07/11] sched_ext: Move sanity check and dsq_mod_nr() into task_unlink_from_dsq()
  2024-08-30 11:03 ` [PATCH 07/11] sched_ext: Move sanity check and dsq_mod_nr() into task_unlink_from_dsq() Tejun Heo
@ 2024-09-01  1:42   ` David Vernet
  0 siblings, 0 replies; 35+ messages in thread
From: David Vernet @ 2024-09-01  1:42 UTC (permalink / raw)
  To: Tejun Heo; +Cc: kernel-team, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 333 bytes --]

On Fri, Aug 30, 2024 at 01:03:51AM -1000, Tejun Heo wrote:
> All task_unlink_from_dsq() users are doing dsq_mod_nr(dsq, -1). Move it into
> task_unlink_from_dsq(). Also move sanity check into it.
> 
> No functional changes intended.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>

Acked-by: David Vernet <void@manifault.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 08/11] sched_ext: Move consume_local_task() upward
  2024-08-30 11:03 ` [PATCH 08/11] sched_ext: Move consume_local_task() upward Tejun Heo
@ 2024-09-01  1:43   ` David Vernet
  0 siblings, 0 replies; 35+ messages in thread
From: David Vernet @ 2024-09-01  1:43 UTC (permalink / raw)
  To: Tejun Heo; +Cc: kernel-team, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 273 bytes --]

On Fri, Aug 30, 2024 at 01:03:52AM -1000, Tejun Heo wrote:
> So that the local case comes first and two CONFIG_SMP blocks can be merged.
> 
> No functional changes intended.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>

Acked-by: David Vernet <void@manifault.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 09/11] sched_ext: Replace consume_local_task() with move_local_task_to_local_dsq()
  2024-08-30 11:03 ` [PATCH 09/11] sched_ext: Replace consume_local_task() with move_local_task_to_local_dsq() Tejun Heo
@ 2024-09-01  1:55   ` David Vernet
  0 siblings, 0 replies; 35+ messages in thread
From: David Vernet @ 2024-09-01  1:55 UTC (permalink / raw)
  To: Tejun Heo; +Cc: kernel-team, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 4281 bytes --]

On Fri, Aug 30, 2024 at 01:03:53AM -1000, Tejun Heo wrote:
> - Rename move_task_to_local_dsq() to move_remote_task_to_local_dsq().
> 
> - Rename consume_local_task() to move_local_task_to_local_dsq() and remove
>   task_unlink_from_dsq() and source DSQ unlocking from it.
> 
> This is to make the migration code easier to reuse.
> 
> No functional changes intended.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>

Acked-by: David Vernet <void@manifault.com>

> ---
>  kernel/sched/ext.c | 42 ++++++++++++++++++++++++++----------------
>  1 file changed, 26 insertions(+), 16 deletions(-)
> 
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 51d141602a11..df33524d68f3 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -2103,23 +2103,30 @@ static bool yield_to_task_scx(struct rq *rq, struct task_struct *to)
>  		return false;
>  }
>  
> -static void consume_local_task(struct task_struct *p,
> -			       struct scx_dispatch_q *dsq, struct rq *rq)
> +static void move_local_task_to_local_dsq(struct task_struct *p, u64 enq_flags,
> +					 struct scx_dispatch_q *src_dsq,
> +					 struct rq *dst_rq)
>  {
> -	lockdep_assert_held(&dsq->lock);	/* released on return */
> +	struct scx_dispatch_q *dst_dsq = &dst_rq->scx.local_dsq;
> +
> +	/* @dsq is locked and @p is on @dst_rq */
> +	lockdep_assert_held(&src_dsq->lock);
> +	lockdep_assert_rq_held(dst_rq);
>  
> -	/* @dsq is locked and @p is on this rq */
>  	WARN_ON_ONCE(p->scx.holding_cpu >= 0);
> -	task_unlink_from_dsq(p, dsq);
> -	list_add_tail(&p->scx.dsq_list.node, &rq->scx.local_dsq.list);
> -	dsq_mod_nr(&rq->scx.local_dsq, 1);
> -	p->scx.dsq = &rq->scx.local_dsq;
> -	raw_spin_unlock(&dsq->lock);
> +
> +	if (enq_flags & (SCX_ENQ_HEAD | SCX_ENQ_PREEMPT))
> +		list_add(&p->scx.dsq_list.node, &dst_dsq->list);
> +	else
> +		list_add_tail(&p->scx.dsq_list.node, &dst_dsq->list);
> +
> +	dsq_mod_nr(dst_dsq, 1);
> +	p->scx.dsq = dst_dsq;
>  }
>  
>  #ifdef CONFIG_SMP
>  /**
> - * move_task_to_local_dsq - Move a task from a different rq to a local DSQ
> + * move_remote_task_to_local_dsq - Move a task from a foreign rq to a local DSQ
>   * @p: task to move
>   * @enq_flags: %SCX_ENQ_*
>   * @src_rq: rq to move the task from, locked on entry, released on return
> @@ -2127,8 +2134,8 @@ static void consume_local_task(struct task_struct *p,
>   *
>   * Move @p which is currently on @src_rq to @dst_rq's local DSQ.
>   */
> -static void move_task_to_local_dsq(struct task_struct *p, u64 enq_flags,
> -				   struct rq *src_rq, struct rq *dst_rq)
> +static void move_remote_task_to_local_dsq(struct task_struct *p, u64 enq_flags,
> +					  struct rq *src_rq, struct rq *dst_rq)
>  {
>  	lockdep_assert_rq_held(src_rq);
>  
> @@ -2251,7 +2258,7 @@ static bool consume_remote_task(struct rq *this_rq, struct task_struct *p,
>  	raw_spin_rq_unlock(this_rq);
>  
>  	if (unlink_dsq_and_lock_task_rq(p, dsq, task_rq)) {
> -		move_task_to_local_dsq(p, 0, task_rq, this_rq);
> +		move_remote_task_to_local_dsq(p, 0, task_rq, this_rq);
>  		return true;
>  	} else {
>  		raw_spin_rq_unlock(task_rq);
> @@ -2282,7 +2289,9 @@ static bool consume_dispatch_q(struct rq *rq, struct scx_dispatch_q *dsq)
>  		struct rq *task_rq = task_rq(p);
>  
>  		if (rq == task_rq) {
> -			consume_local_task(p, dsq, rq);
> +			task_unlink_from_dsq(p, dsq);
> +			move_local_task_to_local_dsq(p, 0, dsq, rq);
> +			raw_spin_unlock(&dsq->lock);
>  			return true;
>  		}
>  
> @@ -2362,13 +2371,14 @@ static void dispatch_to_local_dsq(struct rq *rq, struct scx_dispatch_q *dst_dsq,
>  		/*
>  		 * If @p is staying on the same rq, there's no need to go
>  		 * through the full deactivate/activate cycle. Optimize by
> -		 * abbreviating the operations in move_task_to_local_dsq().
> +		 * abbreviating move_remote_task_to_local_dsq().
>  		 */
>  		if (src_rq == dst_rq) {
>  			p->scx.holding_cpu = -1;
>  			dispatch_enqueue(&dst_rq->scx.local_dsq, p, enq_flags);
>  		} else {
> -			move_task_to_local_dsq(p, enq_flags, src_rq, dst_rq);
> +			move_remote_task_to_local_dsq(p, enq_flags,
> +						      src_rq, dst_rq);
>  		}
>  
>  		/* if the destination CPU is idle, wake it up */
> -- 
> 2.46.0
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 06/11] sched_ext: Reorder args for consume_local/remote_task()
  2024-09-01  1:40   ` David Vernet
@ 2024-09-01  6:37     ` Tejun Heo
  0 siblings, 0 replies; 35+ messages in thread
From: Tejun Heo @ 2024-09-01  6:37 UTC (permalink / raw)
  To: David Vernet; +Cc: kernel-team, linux-kernel

Hello,

On Sat, Aug 31, 2024 at 08:40:00PM -0500, David Vernet wrote:
...
> > @@ -2265,7 +2265,7 @@ static bool consume_remote_task(struct rq *this_rq, struct scx_dispatch_q *dsq,
> >  }
> >  #else	/* CONFIG_SMP */
> >  static inline bool task_can_run_on_remote_rq(struct task_struct *p, struct rq *rq, bool trigger_error) { return false; }
> > -static inline bool consume_remote_task(struct rq *rq, struct scx_dispatch_q *dsq, struct task_struct *p, struct rq *task_rq) { return false; }
> > +static inline bool consume_remote_task(struct rq *this_rq, struct task_struct *p, struct scx_dispatch_q *dsq, struct rq *task_rq) { return false; }
> >  #endif	/* CONFIG_SMP */
> >  
> >  static bool consume_dispatch_q(struct rq *rq, struct scx_dispatch_q *dsq)
> > @@ -2286,12 +2286,12 @@ static bool consume_dispatch_q(struct rq *rq, struct scx_dispatch_q *dsq)
> >  		struct rq *task_rq = task_rq(p);
> >  
> >  		if (rq == task_rq) {
> > -			consume_local_task(rq, dsq, p);
> > +			consume_local_task(p, dsq, rq);
> >  			return true;
> >  		}
> >  
> >  		if (task_can_run_on_remote_rq(p, rq, false)) {
> 
> How do you feel about always prefixing src_ and dst_ for any arguments
> that refer to either (with any @rq before @p implying current as this
> patch proposes)? In this case it's a bit confusing to read because
> technically according to the convention proposed in this patch, @rq
> could be either curr_rq or src_rq in consume_dispatch_q() (there's no
> @p to disambiguate), and @rq could be either src_rq or dst_rq in
> task_can_run_on_remote_rq() (they both come after @p).
> 
> It's pretty obvious from context that @rq is referring to a dst_rq in
> task_can_run_on_remote_rq(), but it might still be a bit easier on the
> eyes to be explicit. And for functions like consume_remote_task() which
> take both a src_dsq and a src_rq, I think it will be easier to follow
> then the convention.

re. these arguments, there are largely two schemes - one coming from sched
core and shared with other scheds where the leading [this_]rq denotes the
current / local rq, and the other internal to SCX where more parties are
involved - this_rq, src_rq and dst_rq. While the latter situation may not be
unique to SCX, it is a lot more pronounced in SCX than other scheduling
classes as migrations are integral to how it works.

I'm not sure about applying the latter naming scheme to everything. Where
SCX is interfacing with sched core, I think there are more benefits in
following the existing naming scheme. This means that there are going to be
places where the two naming schemes interact, which is what this patch is
showing - consume*() functions are following the sched core scheme as
they're always used to pull tasks to the "current" rq for ops.balance() -
it's still doing what balances do in other sched classes. However, the
helpers that they use are more generic and flexible ones which are going to
be used for SCX specific things such as moving tasks between arbitrary DSQs.

That said, the use of @task_rq instead of @src_rq here is just confusing. I
will rename it to @src_rq.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 04/11] sched_ext: Fix processs_ddsp_deferred_locals() by unifying DTL_INVALID handling
  2024-09-01  0:56       ` David Vernet
@ 2024-09-01  8:03         ` Tejun Heo
  2024-09-01 15:35           ` David Vernet
  0 siblings, 1 reply; 35+ messages in thread
From: Tejun Heo @ 2024-09-01  8:03 UTC (permalink / raw)
  To: David Vernet; +Cc: kernel-team, linux-kernel

Hello,

On Sat, Aug 31, 2024 at 07:56:39PM -0500, David Vernet wrote:
...
> Sorry, should have been more clear: the testcase dispatched all tasks to
> the wrong CPU, which is why it's a kworker in the print output below. I
> believe that ksoftiqrd hit the same path as well and just wasn't printed
> in the output because it lost the race to scx_bpf_error(). Let me know
> if you want the testcase to repro and I can send it, or send a separate
> patch to add it to selftests.

Yeah, please share the repro.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 04/11] sched_ext: Fix processs_ddsp_deferred_locals() by unifying DTL_INVALID handling
  2024-09-01  8:03         ` Tejun Heo
@ 2024-09-01 15:35           ` David Vernet
  0 siblings, 0 replies; 35+ messages in thread
From: David Vernet @ 2024-09-01 15:35 UTC (permalink / raw)
  To: Tejun Heo; +Cc: kernel-team, linux-kernel


[-- Attachment #1.1: Type: text/plain, Size: 950 bytes --]

On Sat, Aug 31, 2024 at 10:03:58PM -1000, Tejun Heo wrote:
> Hello,
> 
> On Sat, Aug 31, 2024 at 07:56:39PM -0500, David Vernet wrote:
> ...
> > Sorry, should have been more clear: the testcase dispatched all tasks to
> > the wrong CPU, which is why it's a kworker in the print output below. I
> > believe that ksoftiqrd hit the same path as well and just wasn't printed
> > in the output because it lost the race to scx_bpf_error(). Let me know
> > if you want the testcase to repro and I can send it, or send a separate
> > patch to add it to selftests.
> 
> Yeah, please share the repro.

See the attached patch. You can run the test as follows after rebuilding
selftests:

./runner -t ddsp_local_on_invalid

You may have to run the test a few times to see the repro. Worth noting
is that the repro doesn't seem to hit if we don't explicitly set the
fallback target to 0 in ddsp_local_on_invalid_enqueue().

Thanks,
David

[-- Attachment #1.2: 0001-scx-Add-test-validating-direct-dispatch-with-LOCAL_O.patch --]
[-- Type: text/x-patch, Size: 4635 bytes --]

From 5e1a850b7db989429b94ea5d9cdf786faf3bcd4f Mon Sep 17 00:00:00 2001
From: David Vernet <void@manifault.com>
Date: Sat, 31 Aug 2024 19:16:22 -0500
Subject: [PATCH] scx: Add test validating direct dispatch with LOCAL_ON

SCX_DSQ_LOCAL_ON | cpu can now be invoked on the direct dispatch path.
Let's ensure we're gracefully handling it being dispatched to a CPU
that it's not allowed to run on.

Signed-off-by: David Vernet <void@manifault.com>
---
 tools/testing/selftests/sched_ext/Makefile    |  1 +
 .../sched_ext/ddsp_local_on_invalid.bpf.c     | 42 ++++++++++++++
 .../sched_ext/ddsp_local_on_invalid.c         | 58 +++++++++++++++++++
 3 files changed, 101 insertions(+)
 create mode 100644 tools/testing/selftests/sched_ext/ddsp_local_on_invalid.bpf.c
 create mode 100644 tools/testing/selftests/sched_ext/ddsp_local_on_invalid.c

diff --git a/tools/testing/selftests/sched_ext/Makefile b/tools/testing/selftests/sched_ext/Makefile
index 0754a2c110a1..4823a67e6854 100644
--- a/tools/testing/selftests/sched_ext/Makefile
+++ b/tools/testing/selftests/sched_ext/Makefile
@@ -165,6 +165,7 @@ auto-test-targets :=			\
 	enq_last_no_enq_fails		\
 	enq_select_cpu_fails		\
 	ddsp_bogus_dsq_fail		\
+	ddsp_local_on_invalid		\
 	ddsp_vtimelocal_fail		\
 	dsp_local_on			\
 	exit				\
diff --git a/tools/testing/selftests/sched_ext/ddsp_local_on_invalid.bpf.c b/tools/testing/selftests/sched_ext/ddsp_local_on_invalid.bpf.c
new file mode 100644
index 000000000000..e4512d7cc4b5
--- /dev/null
+++ b/tools/testing/selftests/sched_ext/ddsp_local_on_invalid.bpf.c
@@ -0,0 +1,42 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2024 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2024 David Vernet <dvernet@meta.com>
+ */
+#include <scx/common.bpf.h>
+
+char _license[] SEC("license") = "GPL";
+const volatile s32 nr_cpus;
+
+UEI_DEFINE(uei);
+
+s32 BPF_STRUCT_OPS(ddsp_local_on_invalid_select_cpu, struct task_struct *p,
+		   s32 prev_cpu, u64 wake_flags)
+{
+	return prev_cpu;
+}
+
+void BPF_STRUCT_OPS(ddsp_local_on_invalid_enqueue, struct task_struct *p,
+		    u64 enq_flags)
+{
+	int target = bpf_cpumask_first_zero(p->cpus_ptr);
+
+	if (target >= nr_cpus)
+		target = 0;
+
+	scx_bpf_dispatch(p, SCX_DSQ_LOCAL_ON | target, SCX_SLICE_DFL, enq_flags);
+}
+
+void BPF_STRUCT_OPS(ddsp_local_on_invalid_exit, struct scx_exit_info *ei)
+{
+	UEI_RECORD(uei, ei);
+}
+
+SEC(".struct_ops.link")
+struct sched_ext_ops ddsp_local_on_invalid_ops = {
+	.select_cpu		= ddsp_local_on_invalid_select_cpu,
+	.enqueue		= ddsp_local_on_invalid_enqueue,
+	.exit			= ddsp_local_on_invalid_exit,
+	.name			= "ddsp_local_on_invalid",
+	.timeout_ms		= 2000U,
+};
diff --git a/tools/testing/selftests/sched_ext/ddsp_local_on_invalid.c b/tools/testing/selftests/sched_ext/ddsp_local_on_invalid.c
new file mode 100644
index 000000000000..7bc49df06ee0
--- /dev/null
+++ b/tools/testing/selftests/sched_ext/ddsp_local_on_invalid.c
@@ -0,0 +1,58 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2024 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2024 David Vernet <dvernet@meta.com>
+ */
+#include <bpf/bpf.h>
+#include <scx/common.h>
+#include <unistd.h>
+#include "ddsp_local_on_invalid.bpf.skel.h"
+#include "scx_test.h"
+
+static enum scx_test_status setup(void **ctx)
+{
+	struct ddsp_local_on_invalid *skel;
+
+	skel = ddsp_local_on_invalid__open();
+	SCX_FAIL_IF(!skel, "Failed to open");
+
+	skel->rodata->nr_cpus = libbpf_num_possible_cpus();
+	SCX_FAIL_IF(ddsp_local_on_invalid__load(skel), "Failed to load skel");
+	*ctx = skel;
+
+	return SCX_TEST_PASS;
+}
+
+static enum scx_test_status run(void *ctx)
+{
+	struct ddsp_local_on_invalid *skel = ctx;
+	struct bpf_link *link;
+
+	link = bpf_map__attach_struct_ops(skel->maps.ddsp_local_on_invalid_ops);
+	SCX_FAIL_IF(!link, "Failed to attach struct_ops");
+
+	/* Just sleeping is fine, plenty of scheduling events happening */
+	sleep(1);
+
+	SCX_EQ(skel->data->uei.kind, EXIT_KIND(SCX_EXIT_ERROR));
+	bpf_link__destroy(link);
+
+	return SCX_TEST_PASS;
+}
+
+static void cleanup(void *ctx)
+{
+	struct ddsp_local_on_invalid *skel = ctx;
+
+	ddsp_local_on_invalid__destroy(skel);
+}
+
+struct scx_test ddsp_local_on_invalid = {
+	.name = "ddsp_local_on_invalid",
+	.description = "Verify we can gracefully handle direct dispatch "
+		       "of tasks to an invalid local DSQ from osp.dispatch()",
+	.setup = setup,
+	.run = run,
+	.cleanup = cleanup,
+};
+REGISTER_SCX_TEST(&ddsp_local_on_invalid)
-- 
2.45.2


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH 10/11] sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq()
  2024-08-31 21:15       ` Andrea Righi
@ 2024-09-02  1:53         ` Changwoo Min
  2024-09-02  5:59           ` Tejun Heo
  0 siblings, 1 reply; 35+ messages in thread
From: Changwoo Min @ 2024-09-02  1:53 UTC (permalink / raw)
  To: Andrea Righi, Tejun Heo
  Cc: void, kernel-team, linux-kernel, Daniel Hodges, Dan Schatzberg,
	kernel-dev@igalia.com



On 24. 9. 1. 06:15, Andrea Righi wrote:

>>>> +__bpf_kfunc bool scx_bpf_dispatch_from_dsq(struct bpf_iter_scx_dsq *it__iter,
>>>> +					   struct task_struct *p, u64 dsq_id,
>>>> +					   u64 slice, u64 enq_flags)
>>>> +{
>>>> +	return scx_dispatch_from_dsq((struct bpf_iter_scx_dsq_kern *)it__iter,
>>>> +				     p, dsq_id, slice, 0, enq_flags);
>>>> +}
>>>> +
>>>> +/**
>>>> + * scx_bpf_dispatch_vtime_from_dsq - Move a task from DSQ iteration to a PRIQ DSQ
>>>> + * @it__iter: DSQ iterator in progress
>>>> + * @p: task to transfer
>>>> + * @dsq_id: DSQ to move @p to
>>>> + * @slice: duration @p can run for in nsecs, 0 to keep the current value
>>>> + * @vtime: @p's ordering inside the vtime-sorted queue of the target DSQ
>>>> + * @enq_flags: SCX_ENQ_*
>>>
>>> Hm... can we pass 6 arguments to a kfunc? I think we're limited to 5,
>>> unless I'm missing something here.
>>
>> Hah, I actually don't know and didn't test the vtime variant. Maybe I should
>> just drop the @slice and @vtime. They can be set by the caller explicitly
>> before calling these kfuncs anyway although there are some concerns around
>> ownership (ie. the caller can't be sure that the task has already been
>> dispatched by someone else before scx_bpf_dispatch_from_dsq() commits). Or
>> maybe I should pack the optional arguments into a struct. I'll think more
>> about it.
> 
> IMHO we can simply drop them, introducing a separate struct makes the
> API a bit inconsistent with scx_bpf_dispatch() (and I don't think we
> want to change also scx_bpf_dispatch() for that).

Dropping @slice and @vtime would be cleaner in terms of the API
interface. Some use cases simply move a task from one DSQ to
another (e.g., from a shared DSQ to a per-domain DSQ).


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 10/11] sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq()
  2024-09-02  1:53         ` Changwoo Min
@ 2024-09-02  5:59           ` Tejun Heo
  0 siblings, 0 replies; 35+ messages in thread
From: Tejun Heo @ 2024-09-02  5:59 UTC (permalink / raw)
  To: Changwoo Min
  Cc: Andrea Righi, void, kernel-team, linux-kernel, Daniel Hodges,
	Dan Schatzberg, kernel-dev@igalia.com

Hello,

On Mon, Sep 02, 2024 at 10:53:33AM +0900, Changwoo Min wrote:
..
> > IMHO we can simply drop them, introducing a separate struct makes the
> > API a bit inconsistent with scx_bpf_dispatch() (and I don't think we
> > want to change also scx_bpf_dispatch() for that).
> 
> Dropping @slice and @vtime would be cleaner in terms of the API
> interface. Some use cases simply move a task from one DSQ to
> another (e.g., from a shared DSQ to a per-domain DSQ).

Yeap, I sent an updated version where both parameters are dropped. There are
separate kfunc calls that can precede the dispatch call to override
slice/vtime if necessary.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2024-09-02  5:59 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-30 11:03 [PATCHSET sched_ext/for-6.12] sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq() Tejun Heo
2024-08-30 11:03 ` [PATCH 01/11] sched_ext: Rename scx_kfunc_set_sleepable to unlocked and relocate Tejun Heo
2024-08-30 17:45   ` David Vernet
2024-08-30 11:03 ` [PATCH 02/11] sched_ext: Refactor consume_remote_task() Tejun Heo
2024-08-31  4:05   ` David Vernet
2024-08-31  5:33     ` Tejun Heo
2024-08-31 23:40       ` David Vernet
2024-08-30 11:03 ` [PATCH 03/11] sched_ext: Make find_dsq_for_dispatch() handle SCX_DSQ_LOCAL_ON Tejun Heo
2024-09-01  0:11   ` David Vernet
2024-08-30 11:03 ` [PATCH 04/11] sched_ext: Make dispatch_to_local_dsq() return void Tejun Heo
2024-08-30 17:44   ` [PATCH 04/11] sched_ext: Fix processs_ddsp_deferred_locals() by unifying DTL_INVALID handling Tejun Heo
2024-09-01  0:53     ` David Vernet
2024-09-01  0:56       ` David Vernet
2024-09-01  8:03         ` Tejun Heo
2024-09-01 15:35           ` David Vernet
2024-08-30 11:03 ` [PATCH 05/11] sched_ext: Restructure dispatch_to_local_dsq() Tejun Heo
2024-09-01  1:09   ` David Vernet
2024-08-30 11:03 ` [PATCH 06/11] sched_ext: Reorder args for consume_local/remote_task() Tejun Heo
2024-09-01  1:40   ` David Vernet
2024-09-01  6:37     ` Tejun Heo
2024-08-30 11:03 ` [PATCH 07/11] sched_ext: Move sanity check and dsq_mod_nr() into task_unlink_from_dsq() Tejun Heo
2024-09-01  1:42   ` David Vernet
2024-08-30 11:03 ` [PATCH 08/11] sched_ext: Move consume_local_task() upward Tejun Heo
2024-09-01  1:43   ` David Vernet
2024-08-30 11:03 ` [PATCH 09/11] sched_ext: Replace consume_local_task() with move_local_task_to_local_dsq() Tejun Heo
2024-09-01  1:55   ` David Vernet
2024-08-30 11:03 ` [PATCH 10/11] sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq() Tejun Heo
2024-08-31 14:30   ` Andrea Righi
2024-08-31 16:20     ` Tejun Heo
2024-08-31 21:15       ` Andrea Righi
2024-09-02  1:53         ` Changwoo Min
2024-09-02  5:59           ` Tejun Heo
2024-08-30 11:03 ` [PATCH 11/11] scx_qmap: Implement highpri boosting Tejun Heo
2024-08-30 20:59   ` [PATCH v2 " Tejun Heo
2024-08-30 17:31 ` [PATCHSET sched_ext/for-6.12] sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq() Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox